Skip to main content
BMC Bioinformatics logoLink to BMC Bioinformatics
. 2021 Oct 30;22:533. doi: 10.1186/s12859-021-04448-2

Filling gaps of genome scaffolds via probabilistic searching optical maps against assembly graph

Bin Huang 1,2, Guozheng Wei 1,2, Bing Wang 1,2, Fusong Ju 1,2, Yi Zhong 3, Zhuozheng Shi 4, Shiwei Sun 1,2,5, Dongbo Bu 1,2,5,
PMCID: PMC8557617  PMID: 34717539

Abstract

Background

Optical maps record locations of specific enzyme recognition sites within long genome fragments. This long-distance information enables aligning genome assembly contigs onto optical maps and ordering contigs into scaffolds. The generated scaffolds, however, often contain a large amount of gaps. To fill these gaps, a feasible way is to search genome assembly graph for the best-matching contig paths that connect boundary contigs of gaps. The combination of searching and evaluation procedures might be “searching followed by evaluation”, which is infeasible for long gaps, or “searching by evaluation”, which heavily relies on heuristics and thus usually yields unreliable contig paths.

Results

We here report an accurate and efficient approach to filling gaps of genome scaffolds with aids of optical maps. Using simulated data from 12 species and real data from 3 species, we demonstrate the successful application of our approach in gap filling with improved accuracy and completeness of genome scaffolds.

Conclusion

Our approach applies a sequential Bayesian updating technique to measure the similarity between optical maps and candidate contig paths. Using this similarity to guide path searching, our approach achieves higher accuracy than the existing “searching by evaluation” strategy that relies on heuristics. Furthermore, unlike the “searching followed by evaluation” strategy enumerating all possible paths, our approach prunes the unlikely sub-paths and extends the highly-probable ones only, thus significantly increasing searching efficiency.

Supplementary Information

The online version contains supplementary material available at 10.1186/s12859-021-04448-2.

Keywords: Genome assembly, Gap filling, Scaffolding, Optical maps, Probabilistic search

Background

Genome assembly aims to reconstruct genomes from sequencing reads, and thus plays important roles in various downstream studies, including identification of genes and genome structure variations. Most of the existing assembly methods first organize sequencing reads into a graph, say de Bruijn graph or overlap graph, and then attempt to find a path in the graph to restore the original genome sequence [1]. However, the genome repeats longer than sequencing reads always create ambiguities in path finding, making assembly approaches yield only separate paths (called contigs) rather than the complete genomes [2]. The longer reads by third generation sequencing [3, 4], and long-distance linking information by pair-end, mate-pair, or mapping technologies, will definitely help genome assembly methods to resolve the ambiguities incurred by repeats [5]. The study [6] provides an elaborated review on the methodological progresses and perspectives in the integration of short-range and long-range information for improving assembly contiguity.

Among the technologies that provide long-distance information across repeats, optical mapping has its unique advantage in measuring long genome fragments. For example, the BioNano Saphyr platform can measure genome fragment up to 2 megabases [7]. Unlike genome sequencing technologies, optical maps record locations of specific enzyme recognition sites, say GCTCTTC and GAAGAGC for enzyme BspQI, along a genome. By identifying these sites from contigs, we can easily align contigs onto optical maps, and then order them into scaffolds [8]. However, the short contigs that contain insufficient enzyme recognition sites usually cannot be reliably aligned onto optical maps, thus creating a variety of gaps in scaffolds and making them far from complete genomes. Filling these gaps with nucleotide sequence will considerably improve the completeness of genome assembly.

A great variety of approaches have been proposed for filling gaps directly using sequencing reads, including SOAPdenovo [9], GapFiller [10], GMCloser [11], PBJelly [12] and LR_Gapcloser [13]. These approaches, however, are infeasible for filling gaps of the scaffolds obtained via optical maps since these gaps are often much longer than sequencing reads. To fill these large gaps, Nagarajan et al. proposed to use contig paths in assembly graph instead of the short sequencing reads [14]. Here, assembly graphs refer to the product of assembling sequencing reads using graph theory, which contains contigs as nodes and connections among them as edges.

Recent progresses to improve assembly contiguity also include Bionano solve pipeline, BiSCoT [15], and Novo&Stitch [16]. Briefly speaking, Bionano solve pipeline uses a module called “Hybrid Scaffold”, which sets the identified gaps with N-base rather than filling them using genome sequence. BiSCoT aims to resolve the N’s gap between contigs inserted by Bionano scaffolding through merging two contings that share a genomic region. Novo&Stitch proposes a novel method that uses optical maps for accurate assembly reconciliation.

To fill gaps, we can choose a contig path that connect two boundary contigs of a gap, and then uses the corresponding nucleotide sequences. Thus, the successful gap-filling relies on two steps: (1) searching contig paths in assembly graph, and (2) evaluating the consistency between contig paths and optical map of the gaps of interest [1719]. The two steps, i.e., searching and evaluating contigs paths, can be combined in various ways. For example, OMACC [17] employs the “searching followed by evaluation” strategy. Specifically, for the two boundary contigs of a gap, OMACC first searches assembly graph for all possible contig paths to connect them. Next, OMACC evaluates each possible contig path in terms of the difference between path length and gap size and selects the best path to fill the gap. By rescaling optical maps and estimating the number of repeat copies within gaps, OMACC achieved better accuracy than the previous studies [14, 17].

In contrast to OMACC, AGORA employs the “searching by evaluation” strategy [18]. That is, AGORA uses a modified depth-first search (DFS) to identify the most likely contig path. At each search step, AGORA selects an edge to extend the current sub-path according to several heuristics, say the decreasing order of edges, the consistency between this edge’s in silico map to the experimental optical maps. AGORA uses the first found contig path to fill a gap. These heuristics could greatly improve searching efficiency; however, they might also lead to potential errors in genome reconstruction.

In summary, the “searching followed by evaluation” strategy has high accuracy but low efficiency, whereas the “searching by evaluation” strategy has high efficiency but low accuracy. Thus, the tradeoff between accuracy and efficiency remains a challenging task.

In this study, we propose an accurate and efficient approach to gap filling. Unlike the existing “searching by evaluation” methods heavily relying on heuristics, our approach uses a stochastic model to calculate the similarity between optical maps and contig paths. Using the calculated similarity to guide path-finding, our approach achieves higher accuracy than the existing approaches using heuristics. In addition, unlike the “searching followed by evaluation” methods, our approach maintains only a small set of highly probable sub-paths and prunes the unlikely ones, thus significantly improving efficiency.

We evaluated nanoGapFiller on simulated optical maps of 12 species and real optical maps of 3 species. On the simulated data sets, nanoGapFiller fills the gaps with high accuracy in minutes. Moreover, nanoGapFiller always fills more gaps than OMACC. On real data sets, OMACC cannot fill any gap, while nanoGapFiller successfully fills all of the identified gaps. We also showed that our pruning strategy could significantly reduce running time without sacrificing accuracy. Thus, nanoGapFiller should benefit various downstream genomic studies by improving the completeness of genome reconstruction with aid of optical maps.

Results

Experiment setting and evaluation criteria

We evaluated accuracy and efficiency of nanoGapFiller on simulated optical maps of 12 species and real optical maps of 3 species. The real optical maps were acquired using BioNano Iris platform: For E. coli, P. putida and S. coelicolor, the number of optical maps are 8644, 15000 and 17422 respectively, and the coverage are 336, 435 and 354 respectively. The simulated optical maps were generated using an in-house simulator to extract enzyme recognition sites from reference genomes. We also applied another simulator OMsim [20] that adopts different error model from our in-house simulator.

The gaps of scaffolds were identified as follows: Using the reference genome of a species, we first generated simulated next-generation sequence (NGS) reads using read simulator ART [21], and then assembled these reads into assembly graph using genome assembler SPAdes [22]. Each simulated datasets has read length of 150, coverage of 50. Next, we aligned contigs onto optical maps, and further ordered the contigs into scaffolds according to the alignments. Finally, the unaligned parts of scaffolds were identified as gaps. To make thorough evaluation, we adopted two types of alignment methods: (1) SOMA2 used by OMACC [23], and (2) refAligner used by BioNano Solve package. Compared with SOMA2, refAligner generally reports fewer alignments with higher precision, and thus generates longer and more accurate gaps.

We assessed the quality of gap filling through calculating two levels of similarity between gap filling results and the corresponding regions in reference sequences:

  1. Contig path similarity (CPS): the number of contigs shared by the filled gaps and the real contig paths in reference genomes.

  2. Nucleotide sequence similarity (NSS): we further calculated the base-level similarity NSS=2×Lc/(Lr+Lf), where Lf and Lr denote the length of gap filling results and corresponding reference sequence, respectively, and Lc denotes the longest common string between them.

In the study, we compared nanoGapFiller with the state-of-the-art software OMACC. We did not perform comparison with AGORA since it is now out of maintenance.

Evaluating accuracy of gap filling

Table 1 shows the accuracy of gap filling results on the E. coli genome. As shown in this table, nanoGapFiller successfully filled all of these 23 gaps with NSS over 99%. In contrast, OMACC could only fill 11 out of the 23 gaps and failed to fill the long gaps with over 15 contigs. Even for these 11 gaps, OMACC’s quality is not always high. For example, for the gap 252216r-252238, its reference sequence consists of 7 contigs of 226 nt; however, OMACC filled this gap with 31 contigs of 1546 nt, which has considerably low similarity with the reference sequence (NSS: 25.51%). On the other 11 species, the gap filling results again suggest the superiority of nanoGapFiller in terms of accuracy and coverage (Additional file 1: Tables 1–11 and Additional file 1: Fig. 1). As shown in Additional file 1: Tables 14, 15, and 16, nanoGapFiller also shows better performance than Novo&Stitch.

Table 1.

Filling the gaps identified using simulated optical maps of E. coli genome

Gap Reference sequence OMACC nanoGapFiller
#contigs #bases #contigs #bases CPS NSS (%) #contigs #bases CPS NSS (%)
252228–252408r 4 11 4 11 4 100 4 11 4 100
252424r–252466 4 12 46 3316 4 0.72 4 12 4 100
252292–252268r 5 645 5 645 5 100 5 645 5 100
252410r–252292 5 10,858 5 10,858 5 100 5 10,858 5 100
252208–252300 5 199 7 367 5 70.32 5 199 5 100
252538r–252526 6 1256 6 1256 6 100
252216r–252238 7 226 31 1546 7 25.51 7 226 7 100
252238–252228 7 1259 7 1259 7 100 7 1259 7 100
251622r–252244 9 30,034 17 30,482 9 99.26 9 30,034 9 100
252244–252386r 10 10,424 10 10,424 10 100 10 10,424 10 100
251936r–252510 14 18,913 14 18,913 12 99.86
252408r–252538r 14 20,922 14 20,922 14 100 14 20,922 14 100
252268r–252316 15 3339 19 3701 15 94.86 15 3339 15 100
252300–252312 15 5127 15 5127 11 98.93
252132–252226 18 7171 18 7171 18 100
252180–252410r 23 22,821 24 22,821 20 99.40
252316–252290 23 79,025 23 79,025 22 99.99
252312–252424r 24 5431 25 5431 23 99.96
252466–252252 25 13,442 26 13,442 21 99.97
252226–252216r 42 156,131 43 156,131 33 99.93
252486–252208 47 83,834 47 83,834 41 99.92
252148–252180 56 50,774 59 50,774 44 99.20
252252–252310r 59 118,754 60 118,754 53 100

Alignment method: SOMA2. Here, the symbol ‘–’ represents the failure of OMACC

As a concrete example, we showed in Fig. 1 the filling process of the gap 781738-781976r of S. coelicolor. There are two contig paths connecting the beginning site and ending site of the gap: one path contains the contig 777124 while the other path contains its reverse complement 777124r. OMACC explores the distance between the beginning site and ending site only, and thus cannot identify which path matches better with the corresponding optical map. In contrast, nanoGapFiller utilizes the enzyme recognition sites in the intermediate contigs 777124 and 781726r. Specifically, both 777124 and its reverse complement 777124r contain two sites; however, the locations of these sites differ greatly in the two contigs. nanoGapFiller exploited this difference and thus correctly identified the contig path that fills the gap.

Fig. 1.

Fig. 1

A case study of gap filling using OMACC and nanoGapFiller. For the the gap 781738-781976r of S. coelicolor, there are two contig paths connecting the beginning site and ending site: one path contains the contig 777124 while the other path contains its reverse complement 777124r. OMACC explores the distance between the beginning site and ending site only, and thus cannot identify which path matches better with the corresponding optical map. In contrast, nanoGapFiller utilizes the enzyme recognition sites in the intermediate contig 777124 and 781726r and thus correctly identified the contig path that fills the gap (shown in red)

We further investigated the accuracy of nanoGapFiller on real optical maps of three species (Table 2, Additional file 1: Tables 12, 13). As shown in Table 2, only 9 gaps were identified when using SOMA2 as alignment method on E. coli species, which is less than those identified on the simulated optical maps. OMACC successfully filled 5 out of 9 gaps but failed at the other 4 gaps with over 15 contigs. In contrast, nanoGapFiller filled all of these 9 gaps with considerably high accuracy (NSS over 96%). Venn graphs suggest the superiority of nanoGapFiller in terms of coverage on these 3 species (Fig. 2).

Table 2.

Filling the gaps identified using real optical maps of E. coli genome

Gap Reference sequence OMACC nanoGapFiller
#contigs #bases #contigs #bases CPS NSS (%) #contigs #bases CPS NSS (%)
252486r–252036r 3 954 3 954 3 100 3 954 3 100
252408–252228r 4 11 4 11 4 100 4 11 4 100
252526r–252538 6 1256 6 1256 6 100 6 1256 6 100
252032–252526r 11 1127 11 1127 11 100 11 1127 11 100
252538–252408 14 20,922 14 20,922 14 100 14 20,922 14 100
252312r–252300r 15 5127 15 5127 13 99.06
252290r–251900r 21 18,454 31 19,755 20 96.59
252252r–252466r 26 13,442 37 13,877 21 97.69
252036r–252032 26 36,757 38 37,430 19 99.08

Alignment method: SOMA2. Here, the symbol ‘–’ represents the failure of OMACC

Fig. 2.

Fig. 2

Venn graphs of the gaps filled by nanoGapFiller and OMACC for a E. coli, b P. putida, and c S. coelicolor. Here, the gaps are identified using real optical maps with alignment method SOMA2

When using refAligner to align contigs onto optical maps, only 4, 2, and 1 gaps were identified for E. coli, S. coelicolor, and P. putida species, respectively (Table 3). The longest gap has 875Knt. OMACC failed at all of these 7 gaps. In contrast, with only one exception (252312r-252486r), nanoGapFiller successfully filled all gaps with nucleotide sequence highly similar to the reference genome (NSS over 99%). We also evaluate nanoGapFiller using OMBlast [24] as alignment tool. As shown in Additional file 1: Table 18, a total of 23 gaps are identified. Despite that these gaps are different from the gaps when using SOMA2 as alignment tool (Additional file 1: Table 17), nanoGapFiller can still successfully fill these gaps with significant gap filling performance (NSS over 97%).

Table 3.

Filling the gaps identified using real optical maps of E. coli, S. coelicolor, and P. putida genomes

Species Gap Reference sequence nanoGapFiller
#contigs #bases #contigs CPS NSS (%)
E. coli 252196–252226 17 51,810 20 17 99.66
E. coli 252486r–252526r 35 105,652 38 27 99.79
E. coli 252510r–252292r 93 600,523 107 86 99.84
E. coli 252312r–252486r 108 184,041 65 31 87.54
S. coelicolor 781738–781976r 9 88,470 9 9 100
S. coelicolor 781976r–781848r 18 66,018 18 18 100
P. putida 443944r–443818 95 875,288 96 83 99.88

Alignment method: refAligner

In addition to evaluating our approach on simulated optical maps generated by in-house simulator, we also repeated the evaluation process on the optical maps generated using OMsim that adopts a different error model. As shown in Additional file 1: Table 19, a total of 8 gaps were identified and for 7 out of the 8 gaps, nanoGapFiller achieves accurate gap filling with NSS exceeding 97%. These results clearly demonstrate that even using simulators with different error models, nanoGapFiller can still reliably accomplish gap filling.

Hi-C scaffolding [25, 26] is a promising approach that bridge and order contigs through exploiting the contact frequencies between pairs of loci [27]. Here, we compare nanoGapFiller with 3D-DNA [28], a software for Hi-C scaffolding, using the Hi-C data downloaded from NCBI GEO (GSM2870416, GSM2870417) [29]. As shown Additional file 1: Table 20, 3D-DNA achieves largest contig, total length and N50 of 4375178 bp, 4637496 bp and 4375178 bp, respectively, which is higher than that of nanoGapFiller (894614 bp, 4597570 bp and 785645 bp, respectively). However, 3D-DNA simply fills the gaps with N-bases rather than genome sequence. Thus, we further calculate NA50 where the contigs are replaced with the blocks that can be aligned to the reference. nanoGapFiller achieves an NA50 of 785645 bp, which is much higher than 3D-DNA (438708 bp).

Evaluating efficiency of gap filling

In this section, we analyzed the running time of nanoGapFiller. Theoretically, the probabilistic search procedure takes O(m|E|) times, where m denotes the number of sites in gaps, and |E| denotes the number of edges in the site graph. As shown in Table 4, for 11 out of 12 species, nanoGapFiller takes only minutes on an ordinary personal computer. For A. vari, the gaps contain 4,917,178 nt in total, and the site graph contains 135,667 edges, thus leading to an expensive time cost (71,067.07 s).

Table 4.

Running time (in seconds) of nanoGapFiller for filling the gaps of 12 species

Species #Contigs in assembly graph #Sites in site graph #Edges in site graph #Filled gaps Total length of gaps (nt) Running time (s)
S. ynec 250 774 1905 11 181,832 5.24
S. coelicolor 926 3,532 8160 17 478,321 25.95
S. agal 204 546 5554 16 463,622 8.04
P. syringae 1492 2176 51,343 25 732,210 371.05
P. putida 752 2190 15,759 9 2856,292 188.39
N. farcinica 388 924 2192 19 639,552 10.15
E. coli 734 1348 40,858 23 922,021 88.08
E. carotovora 622 1454 2636 19 586,169 9.42
C. hutchinsonii 596 990 5105 25 895,179 12.24
B. pseudomallei 390 2144 2662 8 175,737 3.83
B. japonicum 992 3524 26,014 29 1021,611 1721.27
A. vari 870 474 135,667 17 4917,178 71,067.07

Here, the gaps are identified using simulated optical maps with alignment method SOMA2. CPU: AMD Opteron 6344; OS: Ubuntu 16.04; Python version: 3.6.7

One of the key points of our approach is pruning the unlikely sub-paths when the matching probability is below MinimalMatchingProbability. As shown in Table 5, nanoGapFiller uses 2123 s when no pruning is applied; in contrast, it takes only 88 s when setting MinimalMatchingProbability as 10-5. On the other side, the gap filling results nearly never change at different settings of the pruning threshold MinimalMatchingProbability (Table 6). Together, these observations clearly suggest that our approach perfectly balances the accuracy and efficiency in gap filling.

Table 5.

Running time (in seconds) of nanoGapFiller at different settings of pruning threshold MinimalMatchingProbability

Dataset Alignment method MinimalMatchingProbability
0 (no pruning) 10-8 10-5 (default) 10-2
Simulated optical maps SOMA2 2123 90 88 53
Real optical maps SOMA2 46 13 11 13
Real optical maps refAligner 5953 1620 1227 941

Here, the gaps are identified using both simulated and real optical maps of E. coli species. CPU: AMD Opteron 6344; OS: Ubuntu 16.04; Python version: 3.6.7

Table 6.

The quality of filled gaps reported by nanoGapFiller at different settings of pruning threshold MinimalMatchingProbability

Species Gap MinimalMatchingProbability
0 (%) 10-8 (%) 10-5 (default) (%) 10-2 (%)
E. coli 252312r–252486r 87.22 87.22 87.22 87.22
E. coli 252486r–252526r 99.68 99.68 99.68 99.68
E. coli 252510r–252292r 98.87 98.87 98.87 96.55
E. coli 252196–252226 99.66 99.66 99.66 99.66
S. coelicolor 781738–781976r 100.00 100.00 100.00 100.00
S. coelicolor 781976r–781848r 100.00 100.00 100.00 100.00
P. putida 443944r–443818 99.88 99.88 99.88 99.88

Here the gaps are identified using real optical maps and alignment method refAligner. The quality is measured using base-level similarity (NSS) between the filled gaps and the corresponding reference genome sequence

Improvement of completeness of genome scaffolds

Finally we examined the improvement of completeness of genome scaffolds with gaps filled. As shown in Table 7, before filling gaps, the contigs are relatively short for A. vari species (N50: 64,556 nt). After filling the gaps using OMACC, the scaffold N50 increased to 78,980 nt. In contrast, after filling gaps using nanoGapFiller, the scaffold N50 increased to 7,589,422 nt, which is remarkably longer than that was reported using OMACC. We could observe similar results on other 11 species and real datasets (Tables 8 and 9).

Table 7.

Genome completeness improvement after filling gaps using OMACC and nanoGapFiller on 12 species

Species Scaffold N50 (nt)
Before gap filling Filling using OMACC Filling using nanoGapFiller
A. vari 64,556 78,980 7,589,442
B. japonicum 143,477 290,961 1,830,875
B. pseudomallei 86,778 99,967 113,112
C. hutchinsonii 129,478 212,390 1,935,216
E. carotovora 71,290 100,730 680,365
E. coli 78,648 140,985 1,222,147
N. farcinica 176,628 846,096 5,627,295
P. putida 127,879 127,879 4,873,348
P. syringae 79,967 90,066 366,420
S. agal 71,533 1,399,536 2,406,989
S. coelicolor 108,454 120,270 213,619
S. ynec 175,767 300,280 1,774,968

Here, the gaps are identified using simulated optical maps and alignment method SOMA2

Table 8.

Genome completeness improvement after filling gaps using OMACC and nanoGapFiller on 3 species

Species Scaffold N50 (nt)
Before gap filling Filling using OMACC Filling using nanoGapFiller
E. coli 78,648 107,371 124,003
P. putida 127,879 154,105 154,105
S. coelicolor 108,454 108,454 108,454

Here, the gaps are identified using real optical maps and alignment method SOMA2

Table 9.

Genome completeness improvement after filling gaps using OMACC and nanoGapFiller on 3 species

Species Scaffold N50 (nt)
Before gap filling Filling using OMACC Filling using nanoGapFiller
E. coli 78,648 78,648 133,054
P. putida 127,879 127,879 154,105
S. coelicolor 108,454 108,454 109,573

Here, the gaps are identified using real optical maps and alignment method refAligner

To acquire more detailed evaluations, we have further applied Quast [30] to calculate multiple metrics of the assembly results (Additional file 1: Tables 14, 15, and 16).

Discussion

In this study, we present an efficient and effective approach for fill gaps of scaffolds with aid of optical maps. Using probabilistic search, our approach perfectly balances the accuracy and efficiency of gap filling. The performance of our approach has been clearly demonstrated by the results on a variety of species using both simulated and real optical maps.

For large genome, the current version of nanoGapFiller suffers from the limitation that it generates a large size site graph which poses high memory requirement. How to improve our approach to reduce memory requirement remains one of the future studies.

Conclusion

In conclusion, nanoGapFiller can effectively improve the contiguity of genome assembly. We expect that our approach, with potential extensions, can greatly facilitate improving completeness of genome assembly.

Methods

Notations

In genome sequencing and assembly, a contig refers to a contiguous nucleotide sequence resulting from assembly of sequencing reads, whereas a scaffold refers to a series of contigs separated by gaps of estimated length.

Unlike genome sequence reads, an optical map records locations of specific enzyme recognition sites along a molecule of DNA. Specifically, for a molecule consisting of n recognition sites s1,s2,,sn, optical maps count the number of nucleotide bases between si and si+1 for 1in-1, which is denoted as d(si,si+1). For example, the molecule GCTCTTCACGCTCTTCACTGCTCTTC has three appearances of the enzyme recognition site GCTCTTC, and the corresponding optical map records the distance between these sites, i.e., d(s1,s2)=9, d(s2,s3)=10. In the study, we write a site sequence as sbse, where the symbol ‘’ represents the intermediate sites, and sb and se denotes the beginning and ending site of the sequence, respectively.

Most genome assembly approaches utilize graph theory to guide assembly and finally generate an assembly graph, which contains contigs as nodes and connections among them as edges. To accelerate searching optical maps against assembly graph, we transform assembly graph into site graph as follows: from the component contigs of the assembly graph, we first identify all appearances of the enzyme recognition sites. Next, we use these sites as nodes, and connect the neighboring sites with edges. Here, we say two sites are neighbors if one site can be directly reached from another one by following a contig path in the assembly graph. Each edge in a site graph is associated with a distance to represent the number of nucleotide bases between the two corresponding sites (Fig. 3).

Fig. 3.

Fig. 3

An example of the alignment between optical map and contig path. a An alignment corresponding to the generating of x1x5 from s1s5, where <x1,s1>, <x2,s2>, <x4,s3> and <x5,s5> are matching sites, while s4 is a missing site and x3 is a false-positive site. b The formal description of the alignment, where the symbol ‘–’ represents an Insertion or Deletion

Workflow of nanoGapFiller

nanoGapFiller takes experimental optical maps and genome assembly graph as input and generates scaffolds with gaps filled as output. As shown in Fig. 4, the workflow of nanoGapFiller mainly consists of the following three steps:

  1. Scaffolding and locating gaps: Initially, nanoGapFiller aligns genome assembly contigs onto optical maps. The aligned contigs are further connected into scaffolds according to their order in the alignment. Note that some regions of optical maps often fail to align with any contig, thus forming gaps in scaffolds. These gaps, represented as Ns rather than normal nucleotide bases A/C/T/G, might be thousands of bases long.

    For each gap, we record three features, namely, beginning site, ending site, and the site sequence excerpted from the corresponding unaligned region of an optical map. Take the gap shown in Fig. 4 as an example, its beginning site and ending site are x3 and x6, respectively, and its site sequence is x3x4x5x6.

  2. Finding the contig path matching best with gaps: To fill a gap of scaffolds, nanoGapFiller searches assembly graph for the contig path that matches best with the site sequence of the gap. For this aim, nanoGapFiller uses a stochastic model to measure the similarity between a site sequence and any possible contig path, and then uses the probabilistic search technique to efficiently identify the contig path with the highest similarity. The details of the stochastic model and the probabilistic search technique will be described in later subsections.

  3. Filling gaps of scaffolds: Finally, nanoGapFiller fills the gaps of scaffolds using the nucleic base sequence of the best-matching contig paths. For example, the gap shown in Fig. 4 is filled using the best-matching contig path c1c3c6c10. After filling the gaps of scaffolds, the genome completeness will be greatly improved.

Fig. 4.

Fig. 4

Overall pipeline of nanoGapFiller. Step 1. Initially, nanoGapFiller aligns genome assembly contigs onto optical maps. The aligned contigs are further connected into scaffolds according to their order in the alignment. Note that some regions of optical maps often fail to align with any contig, thus forming gaps in scaffolds. Here, we identified a gap with site sequence x3x4x5x6. Step 2. To fill this gap, nanoGapFiller searches in assembly graph the contig path (shown in red) that matches best with the site sequence x3x4x5x6. Step 3. nanoGapFiller fills the gap with the nucleotide sequence of the best-matching contig path c1c3c6c10

Measuring the similarity between an optical map and a contig path

Consider an optical map with site sequence x1xm and a contig path with site sequence s1sn. nanoGapFiller calculates the probability that the contig path generates the optical map (denoted as S(x1xm,s1sn)), and then uses this probability as similarity between them. The generating process of x1xm from s1sn is as follows: In an ideal optical mapping experiment, an enzyme recognition site si in the contig path will be observed and recorded as a certain site xj of the optical map, which is called matching between sites si and xj. However, it is often the case that some recognition sites are missing (called deletion) whereas some extra sites are recorded in optical maps purely due to false-positive signals (called insertion).

To formally describe the generating process of an optical map from a contig path, we define the alignment between their site sequences. For each alignment A of the sites sequences x1xm and s1sn, we use SA(x1xm,s1sn,A) to denote the possibility that the generating process corresponding to this alignment occurs.

Among all possible alignments between x1xm and s1sn, we identify the one with the highest score, and then use this score as the similarity between the two site sequences, i.e.,

S(x1xm,s1sn)=maxAASA(x1xm,s1sn,A),

where A denotes the set of all possible alignments of the two site sequences.

We calculate SA(x1xm,s1sn,A) as follows: we divide the two sequences at the matching sites of A, and thus acquire several matching fragment pairs. For example, the division at the matching sites <x2,s2> and <s2,x4> yields three matching fragment pairs (see Fig. 3). For each matching fragment pair p, we calculate three scoring items, including:

  1. Length difference item LD(p): In the ideal case, two matching fragments should have identical length. However, in an optical mapping experiment, the molecules are always stretched or compressed, leading to length difference of the matched fragments. To measure the length difference, we adopted the Laplace distribution as performed by Rmaps [3133], i.e.,
    LD(p)=12bexp-|d-μ|b,
    where d denotes the length difference of the two matching fragments in p, and μ and b denotes the mean and scale parameter of the distribution, respectively.
  2. Missing sites item M(p): We used the Geometry distribution [3133] to model the number of missing sites m, i.e.,
    M(p)=(1-q)M-1q,
    where q denotes the probability that an enzyme recognition site is detected by optical mapping.
  3. False-positive sites item FP(p): We used Poisson distribution to model the number of false-positive sites f, i.e.,
    FP(p)=λfe-λf!,
    where λ represents the expected number of false positive sites.

In this study, the parameters were set according to the manually-verified alignments of optical maps and contig paths of E. coli as q=0.772,λ=1526000,μ=293nt,b=500nt.

Identifying the best-matching contig path of a gap

Before describing our method to identify the contig path that best matches a given gap, we first present the formulation of this problem: Let x1xm be the site sequence of the gap of interest. Through locating gaps, we have identified from assembly graph two sites that match the beginning site x1 and the ending site xm, respectively. We denote these two identified sites as sb and se. Thus, the objective is to find the contig path with site sequence sbse such that the score S(x1xm,sbse) is maximized.

The basic idea of our method is probabilistic search together with search space pruning, which can be described as follows: Starting from the beginning site x1, we iterate finding the best-matching sites for each site xi (1im) through executing the following three steps:

  1. Calculating the probability of site matching: We use a set M[xi] to hold all matching sites of xi. From the first i-1 sites x1xi-1, we calculate the belief that xi matches each site sM[xi], denoted as Belief(xi=s). Now we perform normalization to transform the belief into probability Pr[xi=s].

  2. Pruning the unlikely matching pairs: To reduce the search space, we remove the unlikely matching sites, i.e., deleting the site s from M[xi] if Pr[xi=s] is less than a pre-defined threshold MinimalMatchingProbability. We will show experimental results that when setting appropriate threshold, the search space could be significantly reduced with little influence on finding the correct paths.

  3. Propagating the matching probability to downstream site-pairs: For the left-over sites sM[xi], we propagate their matching probability Pr[xi=s] to the downstream site pair <xj,s>, where ji+ MaxInsertionSize and s is within at most MaxDeletionSites from s. For each pair <xj,s>, we calculate its matching belief according to Bayesian formula, which uses Pr[xi=s] as prior probability and the similarity S(xixj,ss) as conditional probability.

We iterate this matching site finding procedure until reaching the ending site xm. Finally, we traceback from xm to identify the path matching best with the site sequence of the gap. Figure 5 shows an example of this probabilistic search procedure. The pseudocode is presented as follows. graphic file with name 12859_2021_4448_Figa_HTML.jpg

Fig. 5.

Fig. 5

Searching site graph for the site sequence that best matches a gap. In this example, the gap has site sequence x1x2x3x4 with distance 8, 5, 3, respectively. Through locating gaps in Step 1, we have known that the beginning site x1 matches s1, and the ending site x4 matches s7. Thus, our objective is to find the path from s1 to s7 that best matches the gap x1x2x3x4. a Initially, we set Pr[x1=s1]=1 as we have known x1 matches s1. Next we propagated this probability to downstream site pairs and calculated the following matching beliefs for site x2: Belief(x2=s2)=Pr[x1=s1]S(8,8), Belief(x2=s3)=Pr[x1=s1]S(8,7), and Belief(x2=s5)=Pr[x1=s1]S(8,14). After normalization, we obtained the site matching probabilities: Pr[x2=x2]=0.81, Pr[x2=x3]=0.19, and Pr[x2=x5]=0. b We propagated these probabilities further and obtained the following beliefs for site x3 Belief(x3=s4)=Pr[x2=s2]S(5,5), Belief(x3=s5)=Pr[x2=s3]S(5,4), and Belief(x3=s7)=Pr[x2=s5]S(5,4) and then normalized them into probabilities. After normalization, we obtained the site matching probabilities: Pr[x3=x4]=0.95, Pr[x3=x5]=0.05, and Pr[x3=x7]=0. c For site x4, we calculated its matching beliefs similarly. Note that there are two paths reaching site s7, and thus we needed to calculate the maximum of the two paths as follows: Belief(x4=s7)= max{Pr[x3=s4]S(3,3), Pr[x3=s5]S(3,4)}. After calculating Pr[x4=s8], we traced back and reported the best matching site path as s1s2s4s7

Supplementary Information

12859_2021_4448_MOESM1_ESM.pdf (664.5KB, pdf)

Additional file 1. The additional results on the performance of nanoGapFiller.

Acknowledgements

We greatly appreciate Xuan Li for providing experimental optical maps data and appreciate Wei Shen for his helps to performing Hi-C scaffolding and analysis.

Abbreviations

DFS

Depth-first search

NGS

Next-generation sequence

CPS

Contig path similarity

NSS

Nucleotide sequence similarity

Authors' contributions

DB conceived the study. BH designed and implemented the nanoGapFiller, and performed the experiment. BH, GW, BW, FJ, YZ, ZS, SS and DB analyzed the experimental results. BH, GW, BW and DB established the mathematical framework. BH and DB wrote and revised the manuscript. All authors read and approved the manuscript.

Funding

We would like to thank the National Key Research and Development Program of China (2020YFA0907000), and the National Natural Science Foundation of China (31770775, 62072435) for providing financial supports for this study and publication charges. The funding bodies had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Availability of data and materials

The genome reference analysed during the current study are available in the NCBI repository under access id: NC_005070.1, NC_004116.1, NC_007005.1, NC_006361.1, NC_004547.2, NC_008255.1, NC_006350.1, NC_004463.1, NC_007413.1, AL645882.2, NC_000913.2, AP013070.1. The Hi-C data are available in the NCBI GEO repository under access id: GSM2870416, GSM2870417. The optical map that support the findings of this study are available from Xuan Li but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Please contact Xuan Li (lixuan@sippe.ac.cn) if you need access these data. Source code of nanoGapFiller is freely available through https://github.com/bigict/nanoGapFiller.

Declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Footnotes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Bin Huang, Email: huangbin2015@ict.ac.cn.

Dongbo Bu, Email: dbu@ict.ac.cn.

References

  • 1.Pevzner PA, Tang H, Waterman MS. An Eulerian path approach to DNA fragment assembly. Proc Natl Acad Sci. 2001;98(17):9748–9753. doi: 10.1073/pnas.171285098. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Nagarajan N, Pop M. Sequence assembly demystified. Nat Rev Genet. 2013;14(3):157. doi: 10.1038/nrg3367. [DOI] [PubMed] [Google Scholar]
  • 3.Schadt EE, Turner S, Kasarskis A. A window into third-generation sequencing. Hum Mol Genet. 2010;19(R2):227–240. doi: 10.1093/hmg/ddq416. [DOI] [PubMed] [Google Scholar]
  • 4.Lee H, Gurtowski J, Yoo S, Nattestad M, Marcus S, Goodwin S, McCombie WR, Schatz M. Third-generation sequencing and the future of genomics. BioRxiv. 2016;048603.
  • 5.Parkhill J. In defense of complete genomes. Nat Biotechnol. 2000;18(5):493. doi: 10.1038/75346. [DOI] [PubMed] [Google Scholar]
  • 6.Garg S. Computational methods for chromosome-scale haplotype reconstruction. Genome Biol. 2021;22(1):1–24. doi: 10.1186/s13059-021-02328-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Malmberg M, Spangenberg G, Daetwyler H, Cogan N. Assessment of low-coverage nanopore long read sequencing for SNP genotyping in doubled haploid canola (Brassica napus L.) Sci Rep. 2019;9(1):8688. doi: 10.1038/s41598-019-45131-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Schwartz DC, Li X, Hernandez LI, Ramnarain SP, Huff EJ, Wang Y-K. Ordered restriction maps of Saccharomyces cerevisiae chromosomes constructed by optical mapping. Science. 1993;262(5130):110–114. doi: 10.1126/science.8211116. [DOI] [PubMed] [Google Scholar]
  • 9.Luo R, Liu B, Xie Y, Li Z, Huang W, Yuan J, He G, Chen Y, Pan Q, Liu Y, et al. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. GigaScience. 2012;1(1):18. doi: 10.1186/2047-217X-1-18. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Boetzer M, Pirovano W. Toward almost closed genomes with GapFiller. Genome Biol. 2012;13(6):56. doi: 10.1186/gb-2012-13-6-r56. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Kosugi S, Hirakawa H, Tabata S. GMcloser: closing gaps in assemblies accurately with a likelihood-based selection of contig or long-read alignments. Bioinformatics. 2015;31(23):3733–3741. doi: 10.1093/bioinformatics/btv465. [DOI] [PubMed] [Google Scholar]
  • 12.English AC, Richards S, Han Y, Wang M, Vee V, Qu J, Qin X, Muzny DM, Reid JG, Worley KC, et al. Mind the gap: upgrading genomes with Pacific Biosciences RS long-read sequencing technology. PLoS ONE. 2012;7(11):47768. doi: 10.1371/journal.pone.0047768. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Xu G-C, Xu T-J, Zhu R, Zhang Y, Li S-Q, Wang H-W, Li J-T. LR\_Gapcloser: a tiling path-based gap closer that uses long reads to complete genome assembly. GigaScience. 2018;8(1):157. doi: 10.1093/gigascience/giz157. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Nagarajan N, Read TD, Pop M. Scaffolding and validation of bacterial genome assemblies using optical restriction maps. Bioinformatics. 2008;24(10):1229–1235. doi: 10.1093/bioinformatics/btn102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Istace B, Belser C, Aury J-M. Biscot: improving large eukaryotic genome assemblies with optical maps. PeerJ. 2020;8:10150. doi: 10.7717/peerj.10150. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Pan W, Wanamaker SI, Ah-Fong AM, Judelson HS, Lonardi S. Novo&stitch: accurate reconciliation of genome assemblies via optical maps. Bioinformatics. 2018;34(13):43–51. doi: 10.1093/bioinformatics/bty255. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Chen Y-M, Yu C-H, Hwang C-C, Liu T. OMACC: an optical-map-assisted contig connector for improving de novo genome assembly. BMC Syst Biol. 2013;7(6):7. doi: 10.1186/1752-0509-7-S6-S7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Lin HC, Goldstein S, Mendelowitz L, Zhou S, Wetzel J, Schwartz DC, Pop M. AGORA: assembly guided by optical restriction alignment. BMC Bioinform. 2012;13(1):189. doi: 10.1186/1471-2105-13-189. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Mukherjee K, Alipanahi B, Kahveci T, Salmela L, Boucher C. Aligning optical maps to de Bruijn graphs. Bioinformatics. 2018 doi: 10.1093/bioinformatics/btz069. [DOI] [PubMed] [Google Scholar]
  • 20.Miclotte G, Plaisance S, Rombauts S, Van de Peer Y, Audenaert P, Fostier J. Omsim: a simulator for optical map data. Bioinformatics. 2017;33(17):2740–2742. doi: 10.1093/bioinformatics/btx293. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Huang W, Li L, Myers JR, Marth GT. ART: a next-generation sequencing read simulator. Bioinformatics. 2011;28(4):593–594. doi: 10.1093/bioinformatics/btr708. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Nurk S, Bankevich A, Antipov D, Gurevich A, Korobeynikov A, Lapidus A, Prjibelsky A, Pyshkin A, Sirotkin A, Sirotkin Y, et al. Assembling genomes and mini-metagenomes from highly chimeric reads. In: Annual international conference on research in computational molecular biology, 2013; pp. 158–170. Springer [DOI] [PMC free article] [PubMed]
  • 23.Kinnunen T, Nyrönen T, Lehtovuori P. SOMA2-open source framework for molecular modelling workflows. Chem Cent J. 2008;2(1):4. doi: 10.1186/1752-153X-2-S1-P4. [DOI] [Google Scholar]
  • 24.Leung AK-Y, Kwok T-P, Wan R, Xiao M, Kwok P-Y, Yip KY, Chan T-F. Omblast: alignment tool for optical mapping using a seed-and-extend approach. Bioinformatics. 2017;33(3):311–319. doi: 10.1093/bioinformatics/btw620. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Burton JN, Adey A, Patwardhan RP, Qiu R, Kitzman JO, Shendure J. Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Nat Biotechnol. 2013;31(12):1119–1125. doi: 10.1038/nbt.2727. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Marie-Nelly H, Marbouty M, Cournac A, Flot J-F, Liti G, Parodi DP, Syan S, Guillén N, Margeot A, Zimmer C, et al. High-quality genome (re) assembly using chromosomal contact data. Nat Commun. 2014;5(1):1–10. doi: 10.1038/ncomms6695. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Baudry L, Guiglielmoni N, Marie-Nelly H, Cormier A, Marbouty M, Avia K, Mie YL, Godfroy O, Sterck L, Cock JM, et al. instagraal: chromosome-level quality scaffolding of genomes using a proximity ligation-based scaffolder. Genome Biol. 2020;21(1):1–22. doi: 10.1186/s13059-020-02041-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Dudchenko O, Batra SS, Omer AD, Nyquist SK, Hoeger M, Durand NC, Shamim MS, Machol I, Lander ES, Aiden AP, et al. De novo assembly of the aedes aegypti genome using hi-c yields chromosome-length scaffolds. Science. 2017;356(6333):92–95. doi: 10.1126/science.aal3327. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Lioy VS, Cournac A, Marbouty M, Duigou S, Mozziconacci J, Espéli O, Boccard F, Koszul R. Multiscale structuring of the E. coli chromosome by nucleoid-associated and condensin proteins. Cell. 2018;172(4):771–783. doi: 10.1016/j.cell.2017.12.027. [DOI] [PubMed] [Google Scholar]
  • 30.Gurevich A, Saveliev V, Vyahhi N, Tesler G. Quast: quality assessment tool for genome assemblies. Bioinformatics. 2013;29(8):1072–1075. doi: 10.1093/bioinformatics/btt086. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Li M, Mak AC, Lam ET, Kwok P-Y, Xiao M, Yip KY, Chan T-F, Yiu S-M. Towards a more accurate error model for BioNano optical maps. In: International symposium on bioinformatics research and applications, 2016; pp. 67–79. Springer
  • 32.Chen P, Jing X, Ren J, Cao H, Hao P, Li X. Modelling BioNano optical data and simulation study of genome map assembly. Bioinformatics. 2018;34(23):3966–3974. doi: 10.1093/bioinformatics/bty456. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Das SK, Austin MD, Akana MC, Deshpande P, Cao H, Xiao M. Single molecule linear analysis of DNA in nano-channel labeled with sequence specific fluorescent probes. Nucleic Acids Res. 2010;38(18):177. doi: 10.1093/nar/gkq673. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

12859_2021_4448_MOESM1_ESM.pdf (664.5KB, pdf)

Additional file 1. The additional results on the performance of nanoGapFiller.

Data Availability Statement

The genome reference analysed during the current study are available in the NCBI repository under access id: NC_005070.1, NC_004116.1, NC_007005.1, NC_006361.1, NC_004547.2, NC_008255.1, NC_006350.1, NC_004463.1, NC_007413.1, AL645882.2, NC_000913.2, AP013070.1. The Hi-C data are available in the NCBI GEO repository under access id: GSM2870416, GSM2870417. The optical map that support the findings of this study are available from Xuan Li but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Please contact Xuan Li (lixuan@sippe.ac.cn) if you need access these data. Source code of nanoGapFiller is freely available through https://github.com/bigict/nanoGapFiller.


Articles from BMC Bioinformatics are provided here courtesy of BMC

RESOURCES