Abstract
Ernst Haeckel based his landmark Tree of Life on the supposed ontogenic recapitulation of phylogeny, i.e. that successive embryonic stages during the development of an organism re-trace the morphological forms of its ancestors over the course of evolution. Much of this idea has since been discredited. Today, phylogenies are often based on molecular sequences. A typical phylogenetic inference aims to capture and represent, in the form of a tree, the evolutionary history of a family of molecular sequences. The standard approach starts with a multiple sequence alignment, in which the sequences are arranged relative to each other in a way that maximises a measure of similarity position-by-position along their entire length. However, this approach ignores important evolutionary processes that are known to shape the genomes of microbes (bacteria, archaea and some morphologically simple eukaryotes). Recombination, genome rearrangement and lateral genetic transfer undermine the assumptions that underlie multiple sequence alignment, and imply that a tree-like structure may be too simplistic. Here, using genome sequences of 143 bacterial and archaeal genomes, we construct a network of phylogenetic relatedness based on the number of shared k-mers (subsequences at fixed length k). Our findings suggest that the network captures not only key aspects of microbial genome evolution as inferred from a tree, but also features that are not treelike. The method is highly scalable, allowing for investigation of genome evolution across a large number of genomes. Instead of using specific regions or sequences from genome sequences, or indeed Haeckel’s idea of ontogeny, we argue that genome phylogenies can be inferred using k-mers from whole-genome sequences. Representing these networks dynamically allows biological questions of interest to be formulated and addressed quickly and in a visually intuitive manner.
Keywords: phylogenies, phylogenetic trees, phylogenetic networks, k-mers
Introduction
Ernst Haeckel coined the term Phylogenie to describe the series of morphological stages in the evolutionary history of an organism or group of organisms 1. In his Tree of Life published 150 years ago 2, Haeckel postulated that living organisms trace their evolutionary origin(s) along three distinct lineages (Plantae, Protista and Animalia) to a “common Moneran root of autogonous organisms”. In some (but not all) later works (e.g. in 1868 3) he allowed that different Monera may have arisen independently by spontaneous generation. Either way, these views accord with the Larmackian notion of a built-in direction of evolution from morphologically simple “lower” organisms to more-complex “higher” forms 4.
Haeckel through his “Biogenetic Law” advocated that “ontogeny recapitulates phylogeny” 2: that the embryonic series of an organism is a record of its evolutionary history. Under this view, morphologies observed at different developmental stages of an organism resemble and represent the successive stages (including adult stages) of its ancestors over the course of evolution. Of course, he worked before the advent of genetics and the modern synthesis, and before it was appreciated that information on hereditary is carried by DNA and can be recovered by sequencing and statistical analysis. He could not have foreseen that these DNA sequences code for other biomolecules and control life processes, including his beloved developmental series and organismal phenotype, through vastly complex molecular webs of interactions. Nor could Haeckel have envisaged the scale of phylogenetic analysis that can be carried out today using these DNA sequences across multiple genomes, made possible by the advent of high-throughput sequencing and computing technologies.
Fast-forwarding 150 years, phylogenetic inference based on comparative analysis of biological sequences is now a common practice. The similarity among sequences is commonly interpreted as evidence of homology 5, 6, i.e. that they share a common ancestry. From the earliest days of molecular phylogenetics, multiple sequences have been aligned 7, 8 to display this homology position-by-position along the length of the sequences. That is, the residues are arranged relative to each other such that the best available hypothesis of homology is achieved at every position (column) of the alignment. By default, it is assumed that the best alignment can be achieved simply by displaying the sequences in the same direction, and inserting gaps where needed (to represent insertions and deletions). This assumption is largely valid when working with exons or proteins of morphologically complex eukaryotes. However, in microbes this assumption is violated by commonplace evolutionary processes including genome rearrangement, genetic recombination and lateral genetic transfer 9– 14. These scenarios cannot be captured simply in a tree or tree-like representation of evolutionary relationships. As Haeckel observed when he drew his Tree 2, biological evolution can be anything but straightforward, and these complications have become ever more-complicated 15, 16.
Alternative approaches for inferring and representing phylogenies are available. An attractive strategy that addresses the issue of full-length alignability is to compute relatedness among a set of sequences based on the number or extent of k-mers (short sub-sequences of a fixed length k) that they share. Such approaches avoid multiple sequence alignment, and for this reason are termed alignment-free. As opposed to heuristics in multiple sequence alignment, these methods provide exact solutions. Various modifications are available, e.g. the use of degenerate k-mers, scoring match lengths rather than k-mer composition, and grammar-based techniques; see recent reviews 17, 18 for more detail. Importantly, evolutionary relationships can also be depicted as a network, with taxa and relationships represented respectively as nodes and edges 19– 21, rather than as a strictly bifurcating tree. Using simulated and empirical sequence data, we recently demonstrated that alignment-free approaches can yield phylogenetic trees that are biologically meaningful 22– 24. We find that these approaches are more robust to genome rearrangement and lateral genetic transfer, and are highly scalable 22, 23, a much-desired feature given the current deluge of sequence data facing the research community 25. Here we extend the alignment-free phylogenetic approaches on 143 bacterial and archaeal genomes to generate a network of phylogenetic relatedness, and assess biological implications of this network relative to the phylogenetic tree.
Methods
Using 143 complete genomes of Bacteria and Archaea 22, we inferred the relatedness of these genome sequences using an alignment-free method based on the statistic 26, 27. We computed a distance, d for each possible pair of 143 genomes based on the presence of shared 25-mers using jD2Stat version 1.0 ( http://bioinformatics.org.au/tools/jD2Stat/) 23 and following Bernard et al. 22. Here the distance d is normalised based on genome sizes and the probabilities that corresponding k-mers occur in the compared sequences 26, 27; d ranges between 0.0 (i.e. two genomes are identical) and 15.5 (< 0.0001% 25-mers are shared between the two genomes). For a pair of genomes a and b, we transformed d ab into a similarity measure S ab, in which S ab = 10 – d ab. We ignore instances of d >10, as these pairs of sequences share ≤ 0.01% of 25-mers (i.e. there is little evidence of homology). To visualise the phylogenetic relatedness of these genomes, we adopted the D3 JavaScript library for data-driven documents ( https://d3js.org/). In this network, each node represents a genome, and an edge connecting two nodes represents the qualitative evidence of shared k-mers between them. We set a threshold function t for which only edges with S ≥ t are displayed on the screen. Changing t dynamically changes the network structure. The resulting dynamic network is available at http://bioinformatics.org.au/tools/AFnetwork/.
Results and discussion
Figure 1 shows the phylogenetic tree of the 143 Bacteria and Archaea genomes that we previously inferred using an alignment-free method based on the statistic 26, 27. In an earlier study 9, a supertree was generated for these genomes, summarising 22,432 protein phylogenies. Incongruence between the two trees was observed in 42% of the bipartitions, most of which are at terminal branches 22. The alignment-free tree ( Figure 1) recovers 13 out of the 15 “backbone” nodes 9, distinct clades of Archaea and Bacteria, a monophyletic clade of Proteobacteria, and the lack of resolution between gamma- and beta-Proteobacteria, in agreement with previously published studies; as such, this tree represents reality as presently understood, i.e. is biologically correct.
Figure 1. The alignment-free phylogenetic tree topology of the 143 Bacteria and Archaea genomes based on statistic, modified based on the tree in Bernard et al. 22; jackknife support at each internal node is shown.
Each phylum is represented in a distinct colour, and the backbones identified in Beiko et al. 9 are shown on the internal node with black filled circles. The association of Coxiella burnetii and Nitrosomonas europaea is marked with an asterisk.
Figure 2 shows the network of phylogenetic relatedness of the same 143 genomes; a dynamic view of this network is available at http://bioinformatics.org.au/tools/AFnetwork/. As in our tree ( Figure 1), Archaea and Bacteria form two separate paracliques; even at t = 0, we found only one archaean isolate (the euryarchaeote Methanocaldococcus jannaschii DSM 2661) linked to the bacterial groups Thermotogales and Aquificales 22. Upon reaching t = 3, most of the 14 phyla have formed distinct densely connected subgraphs in our network, i.e. Cyanobacteria and Chlamydiales form cliques at t = 1.5 and all subgroups of Proteobacteria form a large paraclique with the Firmicutes at t = 2. Four Escherichia coli and two Shigella isolates, known to be closely related, form a clique up to t = 8.5. Interestingly, this network also showcases the extent that genomic regions are shared among diverse phyla, e.g. the high extent of genetic similarity among Proteobacteria versus the low extent between Chlamydiales and Cyanobacteria. Our observations largely agree with published studies 9, 22, but also highlight the inadequacy of representing microbial phylogeny as a tree. For instance, in the tree Coxiella burnetii, a member of the gamma-Proteobacteria, is grouped with Nitrosomonas europaea of the alpha-Proteobacteria (marked with an asterisk in Figure 1); in the network, the strongest connection of C. burnetii is with Wigglesworthia glossinidia, a member of the gamma-Proteobacteria (marked with an asterisk in Figure 2) at t = 2. Both W. glossinidia and C. burnetii are parasites; the W. glossinidia genome (0.7 Mbp) is highly reduced 28 and the C. burnetii genome (2 Mbp) is proposed to be undergoing reduction 29. As both the tree ( Figure 1) and network presented here were generated using the same alignment-free method, the contradictory position of C. burnetii is likely caused by the neighbour-joining algorithm used for tree inference 22. In this scenario, the C. burnetii genome connects with N. europaea because it shares high similarity with N. europaea and Neisseria genomes of the beta-Proteobacteria ( S between 1.43 and 1.68), second only to W. glossinidia ( S = 2.05), and because it shares little or no similarity with other genomes of gamma-Proteobacteria that are closely related to W. glossinidia, i.e. Buchnera aphidicola isolates (average S = 0.63) and “ Candidatus Blochmannia floridanus” ( S = 0).
Figure 2. Alignment-free phylogenetic network of the 143 Bacteria and Archaea genomes based on statistic using 25-mers, at t = 2.
Each phylum is represented in a distinct colour, each node represents a genome and an edge represents a qualitative evidence of shared 25-mers between two genomes. The association between Coxiella burnetii and Wigglesworthia glossinidia is marked with an asterisk.
By changing the threshold t, we can dynamically visualise changes in the network structure. These changes are not random, but appear to correlate to the evolutionary history of the species. At t = 0, Archaea and Bacteria form two distinct paracliques, linked only by two edges, and the Planctomycetes isolate forms a singleton. When we increase t from 1 to 2, the Archaea and Bacteria paracliques quickly dissociate from each other; within the Bacteria, cliques of Chlamydiales and Cyanobacteria are formed and the Spirochaetales become isolated. Going from t = 2 to t = 3 we observe a scission between Firmicutes and Proteobacteria, and at t > 3 all classes of Proteobacteria start to form respective paracliques. The separation (as t is incremented) of a densely connected subgraph involving all representatives of a phylum, from the rest of the network mimics the divergence of this phylum from a common ancestor. Because the similarity measures do not have a unit (such as number of substitutions per site), it is not straightforward to interpret S as an evolutionary rate or divergence time. However, our findings suggest that our alignment-free network yields snapshots of biologically meaningful evolutionary relationship among these genomes, and that increasing the threshold based on the proportion of shared k-mers recapitulates the progressive separation of genomic lineages in evolution.
The alignment-free network reconstructed using whole-genome sequences thus recovers phylogenetic signals that cannot be captured in a binary tree. Using this approach, we generated the network in < 30 minutes; a whole-genome alignment of 143 sequences would have taken days, and even then, the alignment would be difficult to interpret given the genome dynamics in Bacteria and Archaea 9– 14. One can imagine inferring a network of thousands of microbial genomes in a few hours using distributed computing. More importantly, the network can be visualised dynamically, explored interactively and shared.
Other biological questions could be addressed by linking the k-mers to their genomic locations and annotated genome features, e.g. in a relational database 30. For instance, we could use such a database to compare thousands of isolates and identify core gene functions for a specific phylum or genus, or exclusive versus non-exclusive functions in bacterial pathogens, in a matter of seconds. We can also use k-mers to quickly search for biological information e.g. functions relevant to lateral genetic transfer, recombination or duplications.
In contrast to Haeckel’s “Biogenetic Law”, k-mers used in this way recapitulate phylogenetic signal, not ontogeny. Alignment-free approaches generate a biologically meaningful phylogenetic inference, and are highly scalable. More importantly, representing alignment-free phylogenetic relationships using a network captures aspects of evolutionary histories that are not possible in a tree. As more genome data become available, Haeckel’s goal of depicting the History of Life is closer to reality.
Data availability
The data referenced by this article are under copyright with the following copyright statement: Copyright: © 2016 Bernard G et al.
The 143 Bacteria and Archaea genomes used in this work are the same dataset used in an earlier study 22, available at http://dx.doi.org/10.14264/uql.2016.908 31. The dynamic phylogenetic network of these genomes is available at http://bioinformatics.org.au/tools/AFnetwork, with the source code available at http://dx.doi.org/10.14264/uql.2016.952 32
Funding Statement
We thank funding support from the Australian Research Council (DP150101875) awarded to MAR and CXC, and a James S. McDonnell Foundation grant awarded to MAR.
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
[version 1; referees: 1 approved
References
- 1. Dayrat B: The roots of phylogeny: how did Haeckel build his trees? Syst Biol. 2003;52(4):515–27. 10.1080/10635150390218277 [DOI] [PubMed] [Google Scholar]
- 2. Haeckel E: Generelle Morphologie der Organismen. Allgemeine Grundzüge der organischen Formen-Wissenschaft, mechanisch begründet durch die von Charles Darwin reformirte Descendenztheorie. Bd. 1 und 2. Berlin: Reimer;1866. 10.5962/bhl.title.3953 [DOI] [Google Scholar]
- 3. Haeckel E: Natürliche Schöpfungsgeschichte.. Berlin: Reimer;1868. Reference Source [Google Scholar]
- 4. Burkhardt RW, Jr: Lamarck, evolution, and the inheritance of acquired characters. Genetics. 2013;194(4):793–805. 10.1534/genetics.113.151852 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Fitch WM: Homology: a personal view on some of the problems. Trends Genet. 2000;16(5):227–31. 10.1016/S0168-9525(00)02005-9 [DOI] [PubMed] [Google Scholar]
- 6. Hall BK: Homology: the hierarchical basis of comparative biology. San Diego: Academic Press;1994. Reference Source [Google Scholar]
- 7. Notredame C: Recent progress in multiple sequence alignment: a survey. Pharmacogenomics. 2002;3(1):131–44. 10.1517/14622416.3.1.131 [DOI] [PubMed] [Google Scholar]
- 8. Notredame C: Recent evolutions of multiple sequence alignment algorithms. PLoS Comput Biol. 2007;3(8):e123. 10.1371/journal.pcbi.0030123 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Beiko RG, Harlow TJ, Ragan MA: Highways of gene sharing in prokaryotes. Proc Natl Acad Sci U S A. 2005;102(40):14332–7. 10.1073/pnas.0504068102 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Dagan T, Martin W: The tree of one percent. Genome Biol. 2006;7(10):118. 10.1186/gb-2006-7-10-118 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Darling AE, Miklós I, Ragan MA: Dynamics of genome rearrangement in bacterial populations. PLoS Genet. 2008;4(7):e1000128. 10.1371/journal.pgen.1000128 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Doolittle WF: Phylogenetic classification and the universal tree. Science. 1999;284(5423):2124–9. 10.1126/science.284.5423.2124 [DOI] [PubMed] [Google Scholar]
- 13. Koonin EV: Horizontal gene transfer: essentiality and evolvability in prokaryotes, and roles in evolutionary transitions [version 1; referees: 2 approved]. F1000Res. 2016;5: pii: F1000 Faculty Rev-1805. 10.12688/f1000research.8737.1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Puigbò P, Lobkovsky AE, Kristensen DM, et al. : Genomes in turmoil: quantification of genome dynamics in prokaryote supergenomes. BMC Biol. 2014;12:66. 10.1186/s12915-014-0066-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Adl SM, Simpson AG, Lane CE, et al. : The revised classification of eukaryotes. J Eukaryot Microbiol. 2012;59(5):429–93. 10.1111/j.1550-7408.2012.00644.x [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Spang A, Saw JH, Jørgensen SL, et al. : Complex archaea that bridge the gap between prokaryotes and eukaryotes. Nature. 2015;521(7551):173–9. 10.1038/nature14447 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Bonham-Carter O, Steele J, Bastola D: Alignment-free genetic sequence comparisons: a review of recent approaches by word analysis. Brief Bioinform. 2014;15(6):890–905. 10.1093/bib/bbt052 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Haubold B: Alignment-free phylogenetics and population genetics. Brief Bioinform. 2014;15(3):407–18. 10.1093/bib/bbt083 [DOI] [PubMed] [Google Scholar]
- 19. Corel E, Lopez P, Méheust R, et al. : Network-thinking: graphs to analyze microbial complexity and evolution. Trends Microbiol. 2016;24(3):224–37. 10.1016/j.tim.2015.12.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Dagan T: Phylogenomic networks. Trends Microbiol. 2011;19(10):483–91. 10.1016/j.tim.2011.07.001 [DOI] [PubMed] [Google Scholar]
- 21. Huson DH, Scornavacca C: A survey of combinatorial methods for phylogenetic networks. Genome Biol Evol. 2011;3:23–35. 10.1093/gbe/evq077 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Bernard G, Chan CX, Ragan MA: Alignment-free microbial phylogenomics under scenarios of sequence divergence, genome rearrangement and lateral genetic transfer. Sci Rep. 2016;6: 28970. 10.1038/srep28970 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Chan CX, Bernard G, Poirion O, et al. : Inferring phylogenies of evolving sequences without multiple sequence alignment. Sci Rep. 2014;4: 6504. 10.1038/srep06504 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Ragan MA, Bernard G, Chan CX: Molecular phylogenetics before sequences: oligonucleotide catalogs as k-mer spectra. RNA Biol. 2014;11(3):176–85. 10.4161/rna.27505 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Chan CX, Ragan MA: Next-generation phylogenomics. Biol Direct. 2013;8:3. 10.1186/1745-6150-8-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Reinert G, Chew D, Sun F, et al. : Alignment-free sequence comparison (I): statistics and power. J Comput Biol. 2009;16(12):1615–34. 10.1089/cmb.2009.0198 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Wan L, Reinert G, Sun F, et al. : Alignment-free sequence comparison (II): theoretical power of comparison statistics. J Comput Biol. 2010;17(11):1467–90. 10.1089/cmb.2010.0056 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Akman L, Yamashita A, Watanabe H, et al. : Genome sequence of the endocellular obligate symbiont of tsetse flies, Wigglesworthia glossinidia. Nat Genet. 2002;32(3):402–7. 10.1038/ng986 [DOI] [PubMed] [Google Scholar]
- 29. Seshadri R, Paulsen IT, Eisen JA, et al. : Complete genome sequence of the Q-fever pathogen Coxiella burnetii. Proc Natl Acad Sci U S A. 2003;100(9):5455–60. 10.1073/pnas.0931379100 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. Greenfield P, Roehm U: Answering biological questions by querying k-mer databases. Concurr Comput Pract Exper. 2013;25(4):497–509. 10.1002/cpe.2938 [DOI] [Google Scholar]
- 31. Bernard G, Chan CX, Ragan MA: 143 Prokaryote genomes. Dataset.2016. Data Source
- 32. Bernard G, Chan CX, Ragan MA: Alignment-free network of 143 prokaryote genomes. Dataset.2016. Data Source


