Abstract
Background
Large-scale sequencing projects provide high-quality full-genome data that can be used for reconstruction of chromosomal exchanges and rearrangements that disrupt conserved syntenic blocks. The highest resolution of cross-species homology can be obtained on the basis of whole-genome, reference-free alignments. Very large multiple alignments of full-genome sequence stored in a binary format demand an accurate and efficient computational approach for synteny block production.
Findings
halSynteny performs efficient processing of pairwise alignment blocks for any pair of genomes in the alignment. The tool is part of the HAL comparative genomics suite and is targeted to build synteny blocks for multi-hundred–way, reference-free vertebrate alignments built with the Cactus system.
Conclusions
halSynteny enables an accurate and rapid identification of synteny in multiple full-genome alignments. The method is implemented in C++11 as a component of the halTools software and released under MIT license. The package is available at https://github.com/ComparativeGenomicsToolkit/hal/.
Keywords: Synteny blocks, genome alignments, comparative genomics, HAL format
Introduction
Conserved synteny blocks provide a conceptual framework for the analysis of interspecies homology. Originally, the notion of synteny stems from the area of cell genetics, where it was defined as the co-location of ≥2 homologous genes on the same chromosome [1]. This term has been adopted by the comparative genomics field to refer to contiguously aligned regions that preserve order and orientation of the alignment while allowing for micro-rearrangements within the syntenic region [2]. These genomics approaches introduce some quantitative properties of blocks, such as the size of blocks and resolution of synteny in bases.
There are a number of existing tools designed for finding synteny blocks. The GRIMM-Synteny [2] algorithm reconstructs an anchor graph from the predefined set of homologous hits shared by genomes, which can be local pairwise alignments or orthologous genes. The chains-and-nets algorithm [3] introduces a novel BLASTZ scoring scheme for identification of alignment anchors between 2 species. A chained alignment is built over an ordered sequence of traditional pairwise nucleotide alignments; then the set of chains is processed into nets using the chains with the highest score. DAGchainer [4] implements a directed acyclic graph (DAG)-based approach over predefined pairs of gene anchors. Satsuma [5] describes application of the fast Fourier transform algorithm over the signal represented by the nucleotide multiplication pattern. MCScanX [6] operates over the gene sets and applies tuned scoring schemes in the dynamic programming algorithm over chains of pairwise gene alignments. i-ADHoRe [7] introduces homology matrices to resolve homology among tandem replications of genes; furthermore, the Needleman-Wunsch algorithm is applied for detection of collinearity. SynChro [8] operates over reciprocal best hits (RBH) obtained from BLASTP alignments for reconstruction of the synteny block backbones. In comparison to GRIMM-Synteny, which allows local disruptions of synteny measured in genomic intervals, SynChro allows for an unlimited number of non-RBH genes but preserves the required number of intermediate RBH genes between first and last gene in a synteny block. The DRIMM-Synteny [9] algorithm implements application of A-Bruijn graphs over a set of predefined anchors. Analogously to GRIMM-Synteny it can be, e.g., local alignments or pairs of similar genes. In contrast to GRIMM-Synteny it provides a resistance against unwanted synteny disruption when a search is performed over multiple genomes and homologous anchors may be absent in a small proportion of the genomes analyzed. The SyMAP [10] algorithm computes the raw hits between nucleotide sequences of a pair of genomes, which are then clustered and filtered using the optional gene annotation. CYNTENATOR [11] uses phylogenetic information and performs progressive alignment of the gene order among multiple genomes.
These tools all require various data formats, which must be derived from the alignment, such as a predefined set of homologous genomic markers, or genome alignment blocks, each being a sequence of aligned bases that is contiguous in each of the genomes represented by the block. Many of them also require a rigorous and reliable annotation of orthologous genes. With halSynteny, the alignment is the only required input.
With the increased availability of large-scale computing facilities, multiple-vertebrate whole-genome alignment is now tractable. Multi-species genome alignments are a useful tool for analysis of species homology in large-scale comparative genomic projects [12,13]. One of the state-of-the-art tools [14–16] is Progressive Cactus [17,18] which produces reference-free all-to-all genome alignments.
By producing a single, reference-free multiple alignment, Cactus allows synteny block reconstruction between any 2 genomes without reference bias, directly from the HAL representation. Here we present halSynteny, a tool that implements a DAG-based algorithm for identification of synteny blocks directly from HAL alignment and reporting synteny blocks in PSL format [19].
Methods
Algorithm
We describe a heuristical algorithm that operates on a pair of selected genome assemblies in the HAL multiple alignment. A synteny block is a sequence of local alignments that in each genome maintain the following properties: (i) are on 1 chromosome, (ii) do not overlap, (iii) are on the same strand, and (iv) have chromosome sequence coordinates that are either monotonically increasing (for positive strand) or decreasing for negative strand [2]. The set of synteny blocks over a pair of genomes is parameterized by the lower bound of minimal block length bmin and maximal distance dmax between 2 sequential anchoring alignment blocks. The pair (bmin, dmax) can be regarded as a resolution of the synteny block.
Each gapless alignment block between the pair of genomes is represented with the start and end positions on the chromosomes, along with the strand. Duplications are expressed as overlapping alignment blocks. Presume that there are alignment hits in genome A that can be ordered by genomic coordinates as p1⋅⋅⋅pi, ⋅⋅⋅pn and in genome B as u1⋅⋅⋅ui, ui + 1, ⋅⋅⋅un + 1, and there are alignments present among segments pl and ul for l ∈ 1⋅⋅⋅i pk and uk + 1 for k ∈ i⋅⋅⋅n. Then the synteny blocks between these 2 genomes can overlap and contain the following pairs of segments: (pi, ui) for l ∈ 1⋅⋅⋅i and (pk, uk + 1) for k ∈ i⋅⋅⋅n.
The set of graph vertices V is formed by alignment blocks. Vertex vj is defined syntenic to vi if each genome maintains the same order and orientation, their corresponding genomic coordinates do not overlap, and the genomic distance in either genome between vi and vj, dij, does not exceed the maximal distance dmax defined by the synteny resolution. The set of graph edges E is formed by all pairs (vi, vj) such that vj is syntenic to vi. This results in the set of DAG subgraphs corresponding to synteny regions in the graph of alignment blocks. The desirable set of synteny blocks would contain as many continuous synteny blocks as possible covering as much of both genome sequences as possible. To achieve this goal we build the graph G = <V, E> and apply the following algorithm:
-
Initialize weight labels of vertices and edges:
Initialize the weight of each vertex as the absolute value of the difference between start and end coordinates in the target of query genomes, which is called the size of the corresponding alignment block.
The weight wi of each edge coming into a vertex vi is defined as the initial weight of the vertex vi.
-
Traverse the vertices in topological order.:
For each vertex vi consider all the edges (vi, vj) and for all incident vertices vj calculate the candidate weight update, defined as the weight of an edge coming into vj plus the weight of the preceding vertex vi.
If the candidate weight is greater than the current weight of vj, replace the weight of vj with this value.
If wj was updated, store the parent vertex vi for backtracking.
Find the vertex with maximal weight and trace back using stored previous vertices.
If the path built from vertices obtained at Step 3 is at least as long as the predefined minimal block length (defined by resolution), then remove them from the vertex set and store this path as a synteny block; else stop execution. If V is not empty, then go to Step 3.
As a result, we construct a set of paths that are possibly overlapping in genomic coordinates, so that each path covers as much of each genome as possible.
Evaluation of results
To assess the accuracy of this algorithm, we constructed synteny blocks between the domestic cat (FelCat 8.0) and domestic dog (CanFam 3.1) genomes based on Progressive Cactus alignment of these genomes, together with the human genome (GRCh38) as an outgroup.
The resulting synteny blocks cover 99% of the cat’s genome, while 81.7% of that agrees with the assignment of homologous chromosomes obtained by the chromosomal painting approach [20] (Fig. 1); halSynteny produced results different from the chromosomal painting results in the red regions of the cat chromosomes A1, B1, and C1. These regions were labeled as homologous to dog chromosome 28 with chromosome painting, while halSynteny revealed homology with chromosome 25. These regions comprise 3.5% of the constructed synteny.
Such discrepancies can stem from the different nature of the 2 approaches. While genome alignment depends on the accuracy of inferred genomic sequence, chromosomal painting results depend on the DNA composition and environment of a genomic region. Although chromosomal painting provides an efficient technique to discover large-scale similarity of continuous homology, it tends to misclassify small insertions and small translocations [21]. Also, it was reported that in cases of complex rearrangements chromosomal painting can be laborious and requires confirmation [22].
Comparison of gene-level orthology performed between the cat chromosomes A1, B1, and C1 and dog chromosomes 28 and 25 [23] using OMA Browser [24,25] support the conclusions of inference derived by halSynteny.
We also performed an evaluation of halSynteny performance in comparison with the SatsumaSynteny2 software [5, 26] because it is a modern method for synteny reconstruction based on inference directly from the genomic alignments (in contrast to anchor-based tools). A comparison was performed on the basis of the described protocol [14] for 2 datasets of genomes of nematodes: Caenorhabditis elegans (PRJNA13758) and Caenorhabditisbriggsae (PRJNA10731), Strongyloides ratti(PRJEB125) and Strongyloides stercoralis (PRJEB528). The time required for construction of the whole-genome alignments is not counted as part of halSynteny performance because such an alignment is needed for a realistic comparative genomic project separately. Such an alignment allows for investigation of sequence orthology, mapping of genomic markers among genomes, and other independent tasks. Finally it allows for better understanding of produced synteny blocks by uploading it into the UCSC Genome Browser [27,28].
Results are presented in Table 1. The results of SatsumaSynteny2 in terms of genome coverage are similar to the ones reported by the benchmark study [14] for the older version of the tool SatsumaSynteny [5]. It is possible to account for specific assembly qualities, such as the diverse size of assembly fragments, by adjusting the resolution parameters of halSynteny, which may result in an increase of genome coverage.
Table 1:
Genome | Assembly N50 (Mb) | Genome coverage (%) | Time required, min | ||
---|---|---|---|---|---|
SatsumaSynteny2 | halSynteny | SatsumaSynteny2 | halSynteny | ||
Strongyloides ratti | 11.7 | 55.6 | 72.5 | 1,232 | 18 (+496) |
Strongyloides stercoralis | 0.4 | 56.6 | 55.5 | ||
Caenorhabditis elegans | 17.5 | 20.0 | 92.5 | 547 | 74 (+496) |
Caenorhabditisbriggsae | 108.4 | 18.7 | 88.3 |
Comparison of run time and genome coverage of resulting synteny blocks between SatsumaSynteny2 and halSynteny. The former was run with extra parameter -threads 10. The latter was run with resolution parameters –maxAnchorDistance 1000000 (1 Mb), –minBlockSize 100000 (100 kb) for S. ratti/S. stercoralis, –maxAnchorDistance 1000000 (1Mb), –minBlockSize 1000000 (1 Mb) for C. elegans/C. briggsae. As a preliminary step for application of halSynteny, the whole-genome alignment among all 4 genomes was constructed using Progressive Cactus software, which took 496 minutes. The assemblies of C. elegans and C. briggsaeare of chromosomal level, while there are scaffold-level assemblies for S. rattiand S. stercoralis.
Discussion
Given an alignment of 2 genomes, information about their alignment with a third genome does not affect synteny between the original pair of genomes. Thus our approach can be scaled to the problem of multiple genome comparison without loss of precision. As a use case, given 3 genomes G1, G2, G3, where G1 is a reference genome, G2 is a genome of interest, and G3 is an outgroup genome, we can build synteny blocks between pairs of <G1, G2> and <G1, G3> and assign evolutionary breakages of lineages of genomes G1 and G2 using G3 as an outgroup.
halSynteny implements an algorithm for producing synteny blocks from genome alignment designed to process binary HAL files as input. The DAG-based method DAGchainer [4] was previously implemented for constructing synteny from the BLAST [29] alignments of gene annotations. It operates with homologous gene pairs found within complete genome sequences, combining them into chains of syntenic genes. The alignment-based method SatsumaSynteny2 takes pairs of genome sequences as input and implements a dynamic programming algorithm for chaining the pairwise alignment blocks. Here we first apply the DAG-based approach to whole-genome alignments. We define synteny for a pair of genomes, aiming for more accurate results obtained from multiple genome alignment. In comparison with the other modern alignment-based software, halSynteny allows for obtaining high-coverage results that follow from the definition of synteny. When the performance of halSynteny is compared to that of alignment-based software, halSynteny produces much higher genome coverage, which agrees with the properties of the dataset. These results are closer to the results of anchor-based tools reported in the benchmark study [14], while halSynteny does not require an intermediate genome annotation step. halSynteny can be installed as part of the halTools software essential for HAL file processing and can be a useful tool for analyzing whole-genome alignment data.
Availability of Supporting Source Code and Requirements
An archival copy of the code and other supporting data is available via the GigaScience database, GigaDB [23].
Project name: halSynteny
Project home page: https://github.com/ComparativeGenomicsToolkit/hal
Operating system(s): Linux
Programming language: C++11
Other requirements: HAL API
License: MIT
biotoolsID: biotools:halSynteny https://bio.tools/halSynteny
Additional Files
Figure 1: DAG fragments corresponding to the comparison of S. ratti and S. stercoralis
Abbreviations
API: Application Programming Interface; BLAST: Basic Local Alignment Search Tool; DAG: directed acyclic graph; kb: kilobase pairs; Mb: megabase pairs; NHGRI: National Human Genome Research Institute; RBH: reciprocal best hits; UCSC: University of California Santa Cruz.
Funding
This publication was supported by a Subagreement from European Molecular Biology Laboratory with funds provided by Agreement No. 2U41HG007234-05 from National Institutes of Health, NIHGR. It was also supported by the NIHGR of the National Institutes of Health under Award No. R01HG008742. Its content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health, NIHGR, or European Molecular Biology Laboratory. K.K. was supported by RFBR grant, project No. 20-34-70055. K.K. and S.O. were supported, in part, by a Russian Science Foundation grant (project No. 17-14-01138) and by St. Petersburg State University (Genome Russia Grant No. 1.52.1647.2016).
Competing Interests
The authors declare that they have no competing interests.
Authors' Contributions
Method development: K.K., M.D., J.A.; Implementation and testing: K.K., M.D., J.A.; Data preparation: K.K.; Supervision: M.D., B.P., S.O.; Definition of research project: K.K.; Writing – review and editing: K.K., M.D., A.D., B.P., S.O.
Supplementary Material
ACKNOWLEDGEMENTS
We thank Pavel Dobrynin for the useful discussions about the experimental design and the reviewers for the valuable comments and suggestions that helped to improve the manuscript.
References
- 1. Hickey G, Paten B, Zerbino D, et al.. HAL: A hierarchical format for storing and analyzing multiple genome alignments. Bioinformatics. 2013;29(10):1341–2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Pevzner P, Tesler G. Genome rearrangements in mammalian evolution: lessons from human and mouse genomes. Genome Res. 2003;13(1):37–45. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Kent WJ, Baertsch R, Hinrichs A, et al.. Evolution’s cauldron: Duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A. 2003;100(20):11484–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Haas BJ, Delcher AL, Wortman JR, et al.. DAGchainer: A tool for mining segmental genome duplications and synteny. Bioinformatics. 2004;20(18):3643–6. [DOI] [PubMed] [Google Scholar]
- 5. Grabherr MG, Russell P, Meyer M, et al.. Genome-wide synteny through highly sensitive sequence alignment: Satsuma. Bioinformatics. 2010;26(9):1145–51. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Wang Y, Tang H, Debarry JD, et al.. MCScanX: A toolkit for detection and evolutionary analysis of gene synteny and collinearity. Nucleic Acids Res. 2012;40(7):e49. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Proost S, Fostier J, De Witte D, et al.. i-ADHoRe 3.0-fast and sensitive detection of genomic homology in extremely large data sets. Nucleic Acids Res. 2012;40(2):e11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Drillon G, Carbone A, Fischer G. SynChro: A fast and easy tool to reconstruct and visualize synteny blocks along eukaryotic chromosomes. PLoS One. 2014;9(3):e92621. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Pham SK, Pevzner PA. DRIMM-Synteny: Decomposing genomes into evolutionary conserved segments. Bioinformatics. 2010;26(20):2509–16. [DOI] [PubMed] [Google Scholar]
- 10. Soderlund C, Bomhoff M, Nelson WM. SyMAP v3.4: A turnkey synteny system with application to plant genomes. Nucleic Acids Res. 2011;39(10):e68. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Rödelsperger C, Dieterich C. CYNTENATOR: Progressive gene order alignment of 17 vertebrate genomes. PLoS One. 2010;5(1):e8861. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Lilue J, Doran AG, Fiddes IT, et al.. Sixteen diverse laboratory mouse reference genomes define strain-specific haplotypes and novel functional loci. Nat Genet. 2018;50(11):1574–83. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Zhang G, Li C, Li Q, et al.. Comparative genomics reveals insights into avian genome evolution and adaptation. Science. 2014;346(6215):1311–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Liu D, Hunt M, Tsai IJ. Inferring synteny between genome assemblies: A systematic evaluation. BMC Bioinformatics. 2018;19(1):26. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Dobrynin P, Liu S, Tamazian G, et al.. Genomic legacy of the African cheetah, Acinonyx jubatus. Genome Biol. 2015;16:277. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Choo SW, Rayko M, Tan TK, et al.. Pangolin genomes and the evolution of mammalian scales and immunity. Genome Res. 2016;26(10):1312–22. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Paten B, Earl D, Nguyen N, et al.. Cactus: Algorithms for genome multiple sequence alignment. Genome Res. 2011;21(9):1512–28. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Paten B, Diekhans M, Earl D, et al.. Cactus graphs for genome comparisons. J Comput Biol. 2011;18(3):469–81. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Schlenoff CI, Gruninger M, Tissot F, et al.. The Process Specification Language (PSL) overview and version 1.0 specification. NIST Interagency/Internal Report (NISTIR) - 6459 2004, doi: 10.6028/NIST.IR.6459. [DOI] [Google Scholar]
- 20. Yang F, Graphodatsky A, O’Brien P,et al.. Reciprocal chromosome painting illuminates the history of genome evolution of the domestic cat, dog and human. Chromosome Res. 2000;8(5):393–404. [DOI] [PubMed] [Google Scholar]
- 21. Lee C, Gisselsson S, Jin C, et al.. Limitations of chromosome classification by multicolor karyotyping. Am J Hum Genet. 2001;68(4):1043–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Ried T, Schröck T, Ning Y, et al.. Chromosome painting: a useful art. Hum Mol Genet. 1998;7(10):1619–26. [DOI] [PubMed] [Google Scholar]
- 23. Krasheninnikova K, Diekhans M, Armstrong J, et al.. Supporting data for ”halSynteny: a fast, easy-to-use conserved synteny block construction method for multiple whole-genome alignments.”. GigaScience Database. 2020;. 10.5524/100740. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. OMA Browser. https://omabrowser.org/. Accessed on 21 May 2020. [Google Scholar]
- 25. Altenhoff AM, Glover N, Train CM, et al.. The OMA orthology database in 2018: retrieving evolutionary relationships among all domains of life through richer web and programmatic interfaces. Nucleic Acids Res. 2018;46(D1):D477–85. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Satsuma2. https://github.com/bioinfologics/satsuma2. Accessed 21st May 2020 [Google Scholar]
- 27. Kent WJ, Sugnet CW, Furey TS, et al.. The human genome browser at UCSC. Genome Res. 2002;12(6):996–1006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Raney BJ, Dreszer TR, Barber GP, et al.. Track data hubs enable visualization of user-defined genome-wide annotations on the UCSC Genome Browser. Bioinformatics. 2014;30(7):1003–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Altschul SF, Gish W, Miller W, et al.. Basic Local Alignment Search Tool. J Mol Biol. 1990;215(3):403–10. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.