Abstract
Background
As the number of sequenced genomes rapidly increases, chromosome assembly is becoming an even more crucial step of any genome study. Since de novo chromosome assemblies are confounded by repeat-mediated artifacts, reference-assisted assemblies that use comparative inference have become widely used, prompting the development of several reference-assisted assembly programs for prokaryotic and eukaryotic genomes.
Findings
We developed Chromosomer – a reference-based genome arrangement tool, which rapidly builds chromosomes from genome contigs or scaffolds using their alignments to a reference genome of a closely related species. Chromosomer does not require mate-pair libraries and it offers a number of auxiliary tools that implement common operations accompanying the genome assembly process.
Conclusions
Despite implementing a straightforward alignment-based approach, Chromosomer is a useful tool for genomic analysis of species without chromosome maps. Putative chromosome assemblies by Chromosomer can be used in comparative genomic analysis, genomic variation assessment, potential linkage group inference and other kinds of analysis involving contig or scaffold mapping to a high-quality assembly.
Electronic supplementary material
The online version of this article (doi:10.1186/s13742-016-0141-6) contains supplementary material, which is available to authorized users.
Keywords: Reference-assisted assembly, Chromosome assembly, Alignment
Background
Chromosome assembly is an important part of virtually any eukaryotic genome project. The number of assembled genomes increases each year and many of them are anchored to physical chromosome maps [1]. A robust de novo chromosome assembly requires not only mate-pair reads with different insert sizes, but also physical and genetic maps [2–4]. The large number of high quality assembled ‘reference genomes’ leads to an alternative approach – a reference-assisted chromosome assembly. Using this approach, the benefits of assembled chromosomes can be exploited without additional sequencing or map construction. These benefits include a known number of linkage groups and an estimated distance between markers, which is important for inferences of linkage and synteny. An assisted assembly also connects and orders large numbers of small contigs or scaffolds based on comparative analysis. In many cases, the initial number of contigs and scaffolds can exceed several hundred thousand following de novo assembly; working with such a fragmented genome can prove challenging [5]. Arranging contigs and scaffolds into putative chromosomes using information from the reference genome of a closely related species reduces the overall number of fragments from thousands to hundreds or dozens and also simplifies the annotation and analysis of different genomic features such as repeats, genes, single-nucleotide polymorphisms, copy number variations and segmental duplications.
A disadvantage of this approach is the introduction of occasional assembly errors driven by evolutionary chromosomal rearrangements. Even a closely related reference can differ in synteny from the target genome to some degree. The number of introduced assembly artifacts generally correlates with the evolutionary distance between the target and reference genomes [6] although rates of chromosome rearrangements are hardly clock-like, at least for mammals [7, 8]. These assembly artifacts are easily corrected if a physical map for the target genome is developed, using a tool such as the single molecule next-generation mapping system (Irys) developed by BioNano Genomics [9].
Multiple programs have been developed for reference-assisted chromosome assembly: Bambus [10], BACCardI [11], Projector2 [12], OSLay [13], ABACAS [14], MeDuSa [15], AlignGraph [16], Ragout [17], SyMap [18] and RACA [19]. Most of the listed tools were designed for bacterial or small genomes. For example, ABACAS is a convenient bacterial genome contiguation tool that may also be used for small eukaryotic genomes such as Saccharomyces cerevisiae (12.1 mega base pairs). However, ABACAS is not efficiently scaled to use with the large genomes typical of vertebrate species.
SyMap was designed to facilitate reference-assisted chromosome assembly for eukaryotic genomes; however, it has important limitations. SyMap uses MUMmer [20] or NUCmer [21] for the alignment phase, requires a separate structured query language (SQL) database to work efficiently and takes a very long time to align large genomes to each other.
The most promising approach for reference-assisted assembly is based on using several reference genomes instead of a single one. RACA implements such an approach, using alignments of target, reference and outgroup genomes as inputs to generate predicted chromosome fragments (PCFs) [19]. However, RACA also requires additional evidence from mate-pair libraries for joining genome fragments, while most de novo sequenced genomes have no such libraries available. Furthermore, RACA requires extensive computations for assembling chromosomes.
In this paper we introduce Chromosomer – an open-source cross-platform software that automates the reference-assisted building of genomic chromosomes and is especially effective for large genomes (> 1 giga base pairs). Chromosomer constructs draft chromosomes based only on alignments between fragments (contigs or scaffolds) to be arranged and a reference genome, thereby improving analytical and annotation opportunities for the index species assembly. Although Chromosomer does not use any sophisticated models or algorithms for chromosome assembly, we show that its results are comparable with state-of-the-art assemblies and can be used for further genomic analysis.
Findings
Algorithm
To map fragments to a reference genome, Chromosomer uses results of pairwise alignments between the fragments (contigs and scaffolds) and the chromosomes of the reference genome. The alignments are required to have associated score values that reflect the length and identity of the aligned regions (for example, the BLAST bit score [22]). In addition, the start and end positions of aligned regions in both the fragments and the reference chromosomes are required.
Chromosomer analyzes alignment positions and scores to map fragments to a reference. The mapping process takes the following steps (see Fig. 1).
From pairwise alignments, determine fragments that can be anchored to a reference according to the ratio of their first and second greatest alignment scores. If the ratio is greater than the predefined threshold, which is the algorithm parameter, then the fragment is anchored to a position corresponding to its alignment with the greatest score. Otherwise, the fragment is considered unplaced if these two alignments are located on different reference chromosomes or unlocalized if both alignments are located on the same chromosome.
Using fragment anchors, map the fragments to the reference chromosomes (see Fig. 2a and b). Unlocalized and unplaced fragments are excluded from the assembly.
Resolve overlaps between mapped fragments by inserting gaps between them (see Fig. 3a and b).
Produce a map describing fragment positions at a reference genome and output assembled chromosome sequences and lists of unlocalized and unplaced fragments.
Besides reference-assisted chromosome assembly, Chromosomer also offers the following options:
transfer annotations from fragments to assembled chromosomes using a fragment map;
visualize a reference-assisted chromosome assembly as a genome browser track containing fragment positions;
obtain statistics on a reference-assisted chromosome assembly.
We further describe several aspects of the Chromosomer workflow: mapping fragments to reference chromosomes, transferring annotations from fragments to the assembled chromosomes and defining parameters that tune the Chromosomer assembly process. We consider all sequence coordinates to be zero-based and half-opened (that is, the first nucleotide is considered as position 0 and the last nucleotide position is equal to the sequence length).
Mapping fragments to reference genome
Assume we have a fragment of length L base pairs (bp) and an anchor between it and a reference chromosome that is formed by the alignment of the region in the fragment and the [SA,EA) region in the reference. The and terms denote start and end coordinates of the alignment in the fragment and the SA and EA terms denote start and end coordinates of the alignment in the reference genome. We derive fragment coordinates SF and EF in the reference genome for two cases: a direct fragment orientation that is the same as in the reference (Fig. 2a) and an orientation that is reversed relative to the reference (Fig. 2b). Equations for SF and EF in the direct orientation case are:
Equations for SF and EF in the reversed orientation case are:
Transferring annotations to assembled chromosomes
Next, assume we have a fragment of length L bp that is mapped to position [ SF,EF) in an assembled chromosome. We are interested in where the region of the original fragment will be placed. Let SR and ER be the start and end positions of the region in the chromosome; then the following equations hold if the fragment is mapped in direct orientation (Fig. 4a):
If the region is mapped in the reverse orientation, then SR and ER satisfy the following equations (Fig. 4b):
Assembly parameters
Chromosomer introduces two parameters that influence the assembly process. The first parameter is the alignment score ratio threshold, which is used to distinguish anchored and unplaced fragments. If the score ratio of the two fragment alignments with the highest scores exceeds the threshold, then the fragment is considered anchored, otherwise it is considered unplaced and is excluded from further analysis. The alignment score ratio threshold must be a positive number greater than one.
The second parameter is the insertion size – the size of a gap which is inserted between overlapping regions (see Fig. 3b). The insertion size is recommended to be equal to or greater than the sequencing library size.
Chromosomer assembly evaluation
To evaluate the performance of Chromosomer, we assembled the following bacterial, yeast and mammalian genomes.
Escherichia coli Sakai strain (E. coli K-12 strain as a reference);
Saccharomyces cerevisiae CLIB324 strain (S. cerevisiae S288c strain as a reference);
Pantholops hodgsonii (Tibetan antelope; Bos taurus as a reference);
Pan troglodytes (chimpanzee; Homo sapiens as a reference).
We also assembled the bacterial and yeast genomes using ABACAS and compared ABACAS-derived assemblies with Chromosomer-derived ones. Although ABACAS is not designed for assembling multichromosome genomes, we used separate ABACAS runs for each chromosome from the reference genome. The Chromosomer assembly of Tibetan antelope was compared with the RACA assembly presented in [19]. The Chromosomer-derived chimpanzee chromosomes were assessed by comparison with the GenBank assembly and by checking the coding region accuracy. LASTZ [23] was used to perform whole-genome alignments for assessing chromosomes obtained with Chromosomer.
Escherichia coli assembly
The E. coli Sakai strain genome was assembled in two steps. First, we assembled its reads (SRA accession numbers SRR530851 and SRR587217) to scaffolds using the SPAdes assembler [24] (Additional file 1). Next, we applied Chromosomer and ABACAS to assemble the scaffolds using the E. coli K-12 strain genome assembly (RefSeq accession number NC_000913.3) as a reference (Additional files 2 and 3). Finally, we compared the derived assembly with the RefSeq assembly of the E. coli Sakai strain (RefSeq accession number NC_002695.1). The dot plots of LASTZ whole-genome alignments between the derived assemblies and the RefSeq assembly are given in Fig. 5a and b. The comparison of the assemblies is given in Table 1.
Table 1.
ABACAS | Chromosomer | |
---|---|---|
Mean identity (%) | 90.34 | 89.50 |
Mean length (in bp) | 854.62 | 719.76 |
Mean mismatches (in bp) | 21.11 | 22.35 |
Coverage (in bp) | 1,999,204 | 5,053,547 |
The table shows statistics derived from LASTZ alignments of ABACAS and Chromosomer assemblies to the RefSeq E. coli Sakai strain assembly
Saccharomyces cerevisiae assembly
The S. cerevisiae CLIB324 strain genome was assembled from its scaffolds (GenBank accession number GCA_000192495.1) using S. cerevisiae S288c strain genome as a reference (RefSeq accession number GCF_000146045.2). The chromosome sequences assembled by ABACAS and Chromosomer are given in Additional files 4 and 5, respectively. Dot plots comparing the LASTZ alignments of chromosome 1 between the reference genome and those from the ABACAS or Chromosomer assemblies are shown in Fig. 6a and b. The comparison of the assemblies is given in Table 2.
Table 2.
ABACAS | Chromosomer | |
---|---|---|
Mean identity (%) | 93.29 | 93.87 |
Mean length (in bp) | 792.84 | 801.59 |
Mean mismatches (in bp) | 14.99 | 11.85 |
Coverage (in bp) | 7,920,800 | 9,595,323 |
The table shows statistics derived from LASTZ alignments of ABACAS and Chromosomer assemblies compared with the S. cerevisiae S288c strain assembly that was used as a reference
Pantholops hodgsonii assembly
The P. hodgsonii genome was assembled from its scaffolds (GenBank accession number GCA_000400835.1) using the B. taurus UMD3.1 assembly as a reference and the net alignments between the scaffolds and the cow chromosomes from [19]. The fragment map of the Chromosomer-derived Tibetan antelope chromosomes is given in Additional file 6.
We compared the Tibetan antelope chromosomes obtained by Chromosomer and the PCFs produced by RACA in Fig. 7. The comparison shows that both sets of chromosomes are similar to each other; however, RACA-derived PCFs are longer than Chromosomer-derived ones and the reference cow chromosomes. This result may be due to the difference in the Chromosomer and RACA algorithms: while RACA tends to gather as many genome fragments as possible to a larger fragment, Chromosomer determines scaffolds that have sufficient evidence for being placed on a chromosome; otherwise, Chromosomer considers a scaffold unlocalized or unplaced and does not include it in a chromosome. Thus, Chromosomer preserves the structure of the reference chromosomes, see Fig. 8.
Although the assemblies are fairly similar, two main differences can be distinguished:
The PCFs assembled by RACA tend to be longer than the original reference genome (cow) chromosomes.
RACA predicted two chromosomal translocations in the Tibetan antelope genome compared with the cow genome: the first one between chromosomes 7 and 10 and the second one between chromosomes 21 and 27. The predicted translocations led to elongation of chromosome 7 and shortening of chromosome 10; chromosomes 21 and 27 are also related in the same way but to a lesser extent (see Fig. 7). The ability to detect cross-species rearrangements is a feature of RACA that is related to its more complex assembly model and integration of paired-end reads, which Chromosomer does not use.
In addition, Chromosomer demonstrated better time performance and required fewer computational resources than RACA. It took about 1.7 hours and 1.5 GB of random access memory (RAM) for Chromosomer to produce the chromosomes from the net alignments using one CPU (central processing unit). RACA spent 55 hours and required 59 GB of RAM using three CPUs to get the result from the same net alignments. We used the SuperMicro server for the benchmark (12 Intel Xeon E5-2690 CPUs and 396 GB RAM).
Pan troglodytes assembly
The P. troglodytes genome was assembled using Chromosomer from its scaffolds (GenBank assembly accession GCA_000001515.4) using the H. sapiens GRCh38.p2 assembly as a reference and the net alignment of the chimpanzee genome to the human genome from the UCSC Genome Browser [25, 26]. The fragment map constructed by Chromosomer is given in Additional file 7. The dot plot of the alignment of chromosome 1 assembled by Chromosomer from the scaffolds to its GenBank sequence is given in Fig. 9.
We assessed the obtained chimpanzee chromosomes against two criteria: adjacency of contigs and scaffolds and gene integrity. We checked the adjacency for three levels: contigs within scaffolds (Fig. 10), contigs within chromosomes (Fig. 11) and scaffolds within chromosomes (Fig. 12). The figures show that Chromosomer performs best for assembling large genomic fragments such as scaffolds and may misplace short genomic fragments like contigs for the local structure. This conclusion is supported by the gene integrity check (Figs. 13 and 14), which shows that Chromosomer might break genes located on multiple contigs.
From the examples shown above, we conclude that Chromosomer is comparable to existing reference- genome assembly tools and is able to assemble and process large genomes. Chromosomer may increase efficiency of genome annotation studies by replacing numerous genome fragments with draft chromosome assemblies.
Availability and requirements
Chromosomer is publicly available at the Python Package Index (PyPI, https://pypi.python.org) and GitHub (https://github.com/gtamazian/chromosomer).
Project name: Chromosomer
Project home page:https://github.com/gtamazian/chromosomer
Operating systems: Platform independent
Programming languages: Python
Other requirements: Python 2.7
License: BSD 3-Clause License
Any restriction to use by non-academics: none
Acknowledgements
This work was supported by Russian Ministry of Science (11.G34.31.0068 to SJO).
Availability of supporting data
Data furthering supporting this paper is available in the GigaScience repository, GigaDB [27].
Authors’ contributions
PD, GT and KPK conceived the project. GT, PD, KK, AK and SJO supervised the project. GT implemented the tool. GT, PD and KK designed and described the usage examples. GT, PD, KK, AK, KPK and SJO composed and revised the manuscript. All authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Additional files
Contributor Information
Gaik Tamazian, Email: mail@gtamazian.com.
Pavel Dobrynin, Email: pdobrynin@gmail.com.
Ksenia Krasheninnikova, Email: krasheninnikova@gmail.com.
Aleksey Komissarov, Email: ad3002@gmail.com.
Klaus-Peter Koepfli, Email: klauspeter.koepfli527@gmail.com.
Stephen J. O’Brien, Email: lgdchief@gmail.com
References
- 1.Haussler D, O’Brien SJ, Ryder OA, Barker FK, Clamp M, Crawford AJ, Hanner R, Hanotte O, Johnson WE, McGuire JA, et al. Genome 10K: a proposal to obtain whole-genome sequence for 10,000 vertebrate species. J Hered. 2008;100(6):659–74. doi: 10.1093/jhered/esp086. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Mavrich TN, Jiang C, Ioshikhes IP, Li X, Venters BJ, Zanton SJ, Tomsho LP, Qi J, Glaser RL, Schuster SC, et al. Nucleosome organization in the Drosophila genome. Nature. 2008;453(7193):358–62. doi: 10.1038/nature06929. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.McPherson JD, Marra M, Hillier L, Waterston RH, Chinwalla A, Wallis J, Sekhon M, Wylie K, Mardis ER, Wilson RK, et al. A physical map of the human genome. Nature. 2001;409(6822):934–41. doi: 10.1038/35057157. [DOI] [PubMed] [Google Scholar]
- 4.Lewin HA, Larkin DM, Pontius J, O’Brien SJ. Every genome sequence needs a good map. Genome Res. 2009;19(11):1925–8. doi: 10.1101/gr.094557.109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Murchison EP, Schulz-Trieglaff OB, Ning Z, Alexandrov LB, Bauer MJ, Fu B, Hims M, Ding Z, Ivakhno S, Stewart C, et al. Genome sequencing and analysis of the Tasmanian devil and its transmissible cancer. Cell. 2012;148(4):780–91. doi: 10.1016/j.cell.2011.11.065. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Luo H, Arndt W, Zhang Y, Shi G, Alekseyev MA, Tang J, Hughes AL, Friedman R. Phylogenetic analysis of genome rearrangements among five mammalian orders. Mol Phylogenet Evol. 2012;65(3):871–82. doi: 10.1016/j.ympev.2012.08.008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.O’Brien SJ, Menotti-Raymond M, Murphy WJ, Nash WG, Wienberg J, Stanyon R, Copeland NG, Jenkins NA, Womack JE, Graves JAM. The promise of comparative genomics in mammals. Science. 1999;286(5439):458–81. doi: 10.1126/science.286.5439.458. [DOI] [PubMed] [Google Scholar]
- 8.Murphy WJ, Larkin DM, Everts-van der Wind A, Bourque G, Tesler G, Auvil L, Beever JE, Chowdhary BP, Galibert F, Gatzke L, et al. Dynamics of mammalian chromosome evolution inferred from multispecies comparative maps. Science. 2005;309(5734):613–7. doi: 10.1126/science.1111387. [DOI] [PubMed] [Google Scholar]
- 9.BioNano Genomics. Whole Genome Mapping with the Irys System. http://bionanogenomics.com. Accessed 13 Aug 2016.
- 10.Pop M, Kosack DS, Salzberg SL. Hierarchical scaffolding with Bambus. Genome Res. 2004;14(1):149–59. doi: 10.1101/gr.1536204. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Bartels D, Kespohl S, Albaum S, Drüke T, Goesmann A, Herold J, Kaiser O, Pühler A, Pfeiffer F, Raddatz G, et al. BACCardI — a tool for the validation of genomic assemblies, assisting genome finishing and intergenome comparison. Bioinformatics. 2005;21(7):853–9. doi: 10.1093/bioinformatics/bti091. [DOI] [PubMed] [Google Scholar]
- 12.van Hijum SA, Zomer AL, Kuipers OP, Kok J. Projector 2: contig mapping for efficient gap-closure of prokaryotic genome sequence assemblies. Nucleic Acids Res. 2005;33(suppl 2):560–6. doi: 10.1093/nar/gki356. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Richter DC, Schuster SC, Huson DH. OSLay: optimal syntenic layout of unfinished assemblies. Bioinformatics. 2007;23(13):1573–9. doi: 10.1093/bioinformatics/btm153. [DOI] [PubMed] [Google Scholar]
- 14.Assefa S, Keane TM, Otto TD, Newbold C, Berriman M. ABACAS: algorithm-based automatic contiguation of assembled sequences. Bioinformatics. 2009;25(15):1968–9. doi: 10.1093/bioinformatics/btp347. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Bosi E, Donati B, Galardini M, Brunetti S, Sagot MF, Lió P, Crescenzi P, Fani R, Fondi M. MeDuSa: a multi-draft based scaffolder. Bioinformatics. 2015;31(15):2443–51. doi: 10.1093/bioinformatics/btv171. [DOI] [PubMed] [Google Scholar]
- 16.Bao E, Jiang T, Girke T. AlignGraph: algorithm for secondary de novo genome assembly guided by closely related references. Bioinformatics. 2014;30(12):319–28. doi: 10.1093/bioinformatics/btu291. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Kolmogorov M, Raney B, Paten B, Pham S. Ragout—a reference-assisted assembly tool for bacterial genomes. Bioinformatics. 2014;30(12):302–9. doi: 10.1093/bioinformatics/btu280. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Soderlund C, Bomhoff M, Nelson WM. SyMAP v3.4: a turnkey synteny system with application to plant genomes. Nucleic Acids Res. 2011;39(10):68–8. doi: 10.1093/nar/gkr123. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Kim J, Larkin DM, Cai Q, Zhang Y, Ge RL, Auvil L, Capitanu B, Zhang G, Lewin HA, Ma J, et al. Reference-assisted chromosome assembly. Proc Natl Acad Sci. 2013;110(5):1785–90. doi: 10.1073/pnas.1220349110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Delcher AL, Salzberg SL, Phillippy AM. Using MUMmer to identify similar regions in large sequence sets. Curr Protocol Bioinforma. 2003;10. [DOI] [PubMed]
- 21.Delcher AL, Phillippy A, Carlton J, Salzberg SL. Fast algorithms for large-scale genome alignment and comparison. Nucleic Acids Res. 2002;30(11):2478–83. doi: 10.1093/nar/30.11.2478. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215(3):403–10. doi: 10.1016/S0022-2836(05)80360-2. [DOI] [PubMed] [Google Scholar]
- 23.Harris RS. Improved pairwise alignment of genomic DNA. PhD thesis, The Pennsylvania State University. 2007.
- 24.Nurk S, Bankevich A, Antipov D, Gurevich A, Korobeynikov A, Lapidus A, Prjibelsky A, Pyshkin A, Sirotkin A, Sirotkin Y, et al. Research in Computational Molecular Biology. Berlin, Heidelberg: Springer; 2013. Assembling genomes and mini-metagenomes from highly chimeric reads. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution’s cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci. 2003;100(20):11484–9. doi: 10.1073/pnas.1932072100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Rosenbloom KR, Armstrong J, Barber GP, Casper J, Clawson H, Diekhans M, Dreszer TR, Fujita PA, Guruvadoo L, Haeussler M, et al. The UCSC genome browser database: 2015 update. Nucleic Acids Res. 2015;43(D1):670–81. doi: 10.1093/nar/gku1177. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Tamazian G, Dobrynin P, Krasheninnikova K, Komissarov A, Koepfli K-P, O’Brien SJ. Supporting data for “Chromosomer: a reference-based genome arrangement tool for producing draft chromosome sequences”. GigaScience Database. 2016. http://dx.doi.org/10.5524/100210. [DOI] [PMC free article] [PubMed]