Summary
In the past decade, the number of publicly available bacterial genomes has increased dramatically. These genomes have been generated for impactful initiatives, especially in the field of genomic epidemiology (Brown, Dessai, McGarry, & Gerner-Smidt, 2019; Timme et al., 2017). Genomes are sequenced, shared publicly, and subsequently analyzed for phylogenetic relatedness. If two genomes of epidemiological interest are found to be related, further investigation might be prompted. However, comparing the multitudes of genomes for phylogenetic relatedness is computationally expensive and, with large numbers, laborious. Consequently, there are many strategies to reduce the complexity of the data for downstream analysis, especially using nucleotide stretches of length k (kmers).
One major kmer strategy is to reduce each genome to split kmers. With split kmer analysis, kmers on both sides of a variable site are recorded, and the variable nucleotide is identified. When comparing two or more genomes, the variable sites are compared. Split kmers have been implemented in software packages such as KSNP and SKA (Gardner, Slezak, & Hall, 2015; Harris, 2018).
Another major kmer strategy is to convert genomic data into manageable datasets, usually called sketches (Baker & Langmead, 2019; Ondov et al., 2016; Zhao, 2018). Most notably, an algorithm called min-hash was implemented in the Mash package (Ondov et al., 2016). In the min-hash algorithm, all kmers are recorded and transformed into integers using hashing and a Bloom filter (Bloom, 1970). These hashed kmers are sorted and only the first several kmers are retained. The kmers that appear at the top of the sorted list are collectively called the sketch. Any two sketches can be compared by counting how many hashed kmers they have in common.
Because min-hash creates distances between any two genomes, min-hash values can be used to rapidly cluster genomes into trees using the neighbor-joining algorithm (Saitou & Nei, 1987). We implemented this idea in software called Mashtree, which quickly and efficiently generates large trees that would be too computationally intensive using other methods.
Implementation
Workflow
Mashtree builds on two major algorithms that are already implemented in other software packages. The first is the min-hash algorithm, which is implemented in the software Mash (Ondov et al., 2016). Mashtree uses Mash to create sketches of the genomes with the function mash sketch. We elected to keep most default Mash parameters but increased the sketch size (number of hashed kmers) from 1,000 to 10,000 to increase discriminatory power. Then, Mash is used to calculate the distances between genomes with mash dist. Mashtree records these distances into a pairwise distance matrix. Next, Mashtree calls the neighbor-joining (NJ) algorithm which is implemented in the software QuickTree (Howe, Bateman, & Durbin, 2002). The Mash distance matrix is used with QuickTree with default options to generate a dendrogram. The workflow is depicted in Figure 1.
Confidence values
Although Mashtree does not infer phylogeny, we have borrowed the ideas behind phylogenetic confidence values to yield confidence values for each parent node in the tree. There are two resampling methods implemented in Mashtree to assign support values to internal nodes: bootstrapping and jackknifing. Initially, both methods create a tree as depicted in Figure 1. Then, confidence values can be calculated for the tree using either the bootstrapping approach or the jackknifing approach (Figures 2 and 3).
Other features
Mashtree has several other useful features. First, Mashtree can read any common sequence file type and can read gzip-compressed files (e.g., fastq, fastq.gz, fasta). This is a major advantage in being compatible with a wide variety of databases and with space-saving file compression. Second, Mashtree takes advantage of multithreading. The number of requested threads is used to determine how many genomes are sketched at the same time and how many sketches can be compared at the same time. When the number of threads requested outnumbers the number of operations that it can parallelize, Mashtree uses the multithreading already encoded in Mash sketches and distances. Third, Mashtree uses an SQLite database which can be used to cache results between runs.
Installation
The Mashtree package is programmed in Perl, and is available in the CPAN repository. Documentation can be found at https://github.com/lskatz/mashtree.
Acknowledgements
This work was made possible through support from the Advanced Molecular Detection (AMD) Initiative at the Centers for Disease Control and Prevention. Thank you Sam Minot, Andrew Page, Brian Raphael, and Torsten Seemann for helpful discussions. The findings and conclusions in this report are those of the authors and do not necessarily represent the official position of the Centers for Disease Control and Prevention.
References
- Baker DN, & Langmead B. (2019). Dashing: Fast and accurate genomic distances with hyperloglog. bioRxiv. doi: 10.1101/501726 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bloom BH (1970). Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 13(7), 422–426. doi: 10.1145/362686.362692 [DOI] [Google Scholar]
- Brown E, Dessai U, McGarry S, & Gerner-Smidt P. (2019). Use of whole-genome sequencing for food safety and public health in the united states. Foodborne Pathogens and Disease, 16(7), 441–450. doi: 10.1089/fpd.2019.2662 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gardner SN, Slezak T, & Hall BG (2015). KSNP3. 0: SNP detection and phylogenetic analysis of genomes without genome alignment or reference genome. Bioinformatics, 31(17), 2877–2878. doi: 10.1093/bioinformatics/btv271 [DOI] [PubMed] [Google Scholar]
- Harris SR (2018). SKA: Split kmer analysis toolkit for bacterial genomic epidemiology. BioRxiv, 453142. doi: 10.1101/453142 [DOI] [Google Scholar]
- Howe K, Bateman A, & Durbin R. (2002). QuickTree: Building huge neighbour-joining trees of protein sequences. Bioinformatics, 18(11), 1546–1547. doi: 10.1093/bioinformatics/18.11.1546 [DOI] [PubMed] [Google Scholar]
- Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S, & Phillippy AM (2016). Mash: Fast genome and metagenome distance estimation using minhash. Genome Biology, 17(1), 132. doi: 10.1186/s13059-016-0997-x [DOI] [PMC free article] [PubMed] [Google Scholar]
- Saitou N, & Nei M. (1987). The neighbor-joining method: A new method for reconstructing phylogenetic trees. Molecular Biology and Evolution, 4(4), 406–425. doi: 10.1093/oxfordjournals.molbev.a040454 [DOI] [PubMed] [Google Scholar]
- Timme RE, Rand H, Shumway M, Trees EK, Simmons M, Agarwala R, Davis S, et al. (2017). Benchmark datasets for phylogenomic pipeline validation, applications for foodborne pathogen surveillance. PeerJ, 5, e3893. doi: 10.7717/peerj.3893 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhao X. (2018). BinDash, software for fast genome distance estimation on a typical personal laptop. Bioinformatics, 35(4), 671–673. doi: 10.1093/bioinformatics/bty651 [DOI] [PubMed] [Google Scholar]