To the editor
Computational inference of the taxonomic origin of sequence fragments is an essential step in metagenome analysis1. Assignment of fragments to individual populations or corresponding higher-level evolutionary clades can be performed with either homology-, sequence similarity- or sequence composition-based methods2. It is a challenging task, because for the majority of uncultured micro-organisms reference sequence is unavailable and large amounts of data have to be processed. With this in mind, we introduce PhyloPythiaS, a fast and accurate sequence compositional classifier based on the structured output paradigm3.
We evaluated PhyloPythiaS on simulated and real metagenome data in comparison to four other methods; PhyloPythia4, MEGAN5, Phymm and PhymmBL6. PhyloPhythiaS performed particularly well for taxonomic assignment of populations from novel genera, order or higher-level clades, when limited amounts of reference data were available. Accurate assignments could be performed based on 100 kb of training data for a sample population. We observed this for simulated data (Fig. 1a, Supplementary Fig. 1 and Table 1) and a predominant population of a novel family of the order of Aeromonadales from the Australian Tammar wallaby gut (Fig. 1b, Supplementary Tables 5 and 9). In this scenario, alignment-based methods performed poorly. If closely related genomes were available the performance of all methods became more similar, with a slight advantage for alignment-based approaches. This is observed for simulated data and the predominant genera of two human gut metagenomes (Supplementary Tables 1.A and 11-13).
PhyloPythiaS also performed well in fragment assignment of ‘known unknowns’, i.e. for organisms of taxonomic clades with no available sequence. Here, we observed less ‘overbinning’, meaning assignments to correct higher-level, but incorrect low-level clades, than for PhymmBL (Supplementary Tables 5 and 8). For short fragments of ‘known unknowns’, all methods showed comparably low assignment accuracy, with MEGAN performing best (Supplementary Fig. 2 and Table 2).
Empirical analysis of execution times determined that PhyloPythiaS requires 0.08-0.1 seconds for the assignment of 0.1-10 kb fragments (Fig. 1c). This corresponds to a three- to 46-fold and five- to 68-fold improvement in comparison to MEGAN and PhymmBL, respectively (Fig. 1c). For characterization of a 13 MB assembled metagenome sample, PhyloPythiaS showed 22-fold, 85-fold and 106-fold speed increase in comparison to PhyloPythia, MEGAN and PhymmBL, respectively (Supplementary Table 14). As PhyloPythiaS models require only a subsample of the reference data for accurate assignment, in future, training times will not necessarily be impacted by increases of sequence data, contrary to alignment-based approaches.
PhyloPythiaS employs an ensemble of linear models whose parameters are identified using the paradigm of support vector machines (SVM) with structured output spaces to represent composition-based clade specifics of the taxonomic hierarchy, instead of an ensemble of multi-class SVMs for different taxonomic ranks and fragment lengths, as our previously described method PhyloPythia (Supplementary notes). It exhibits considerable gains in learning and prediction times, while performing similarly by several independent measures on two real-world metagenome data sets (Supplementary Tables 5-7, 9 and 11-14). PhyloPythiaS is freely available for academic use.
Supplementary Material
Acknowledgments
We thank T. Joachims for the SVMstruct implementation, L. Steinbrück and L. Feuerbach for software testing. K.R.P and A.C.M. were funded by the Max-Planck society and Heinrich-Heine University Düsseldorf. T.S. gratefully acknowledges support from the German Science Foundation DFG. P.J.T. is funded by NIH P50 GM068763. The unpublished data for the WG-1 project arises from support provided by the OCE Science team of CSIRO Australia, in the form of an OCE postdoctoral fellowship (P.P.) and OCE Science Leader program (M.M.).
References
- 1.Kunin V, Copeland A, Lapidus A, Mavromatis K, Hugenholtz P. Microbiol Mol Biol Rev. 2008;72(4):557–578. doi: 10.1128/MMBR.00009-08. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.McHardy AC, Rigoutsos I. Curr Opin Microbiol. 2007;10(5):499–503. doi: 10.1016/j.mib.2007.08.004. [DOI] [PubMed] [Google Scholar]
- 3.Tsochantaridis I, Joachims T, Hofmann T, Altun Y. J Mach Learn Res. 2005;6:1453–1484. [Google Scholar]
- 4.McHardy AC, Garcia-Martin H, Tsirigos A, Hugenholtz P, R I. Nat Methods. 2007;4(1):63–72. doi: 10.1038/nmeth976. [DOI] [PubMed] [Google Scholar]
- 5.Huson DH, Auch AF, Qi J, Schuster SC. Genome Res. 2007;17(3):377–386. doi: 10.1101/gr.5969107. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Brady A, Salzberg SL. Nat Methods. 2009;6(9):673–676. doi: 10.1038/nmeth.1358. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.