Taxonomic metagenome sequence assignment with structured output models

KR Patil; P Haider; PB Pope; PJ Turnbaugh; M Morrison; T Scheffer; AC McHardy

doi:10.1038/nmeth0311-191

. Author manuscript; available in PMC: 2011 Sep 1.

Published in final edited form as: Nat Methods. 2011 Mar;8(3):191–192. doi: 10.1038/nmeth0311-191

Taxonomic metagenome sequence assignment with structured output models

KR Patil ¹, P Haider ³, PB Pope ⁴, PJ Turnbaugh ⁵, M Morrison ⁴, T Scheffer ³, AC McHardy ^1,^2,^*

PMCID: PMC3131843 NIHMSID: NIHMS301833 PMID: 21358620

To the editor

Computational inference of the taxonomic origin of sequence fragments is an essential step in metagenome analysis¹. Assignment of fragments to individual populations or corresponding higher-level evolutionary clades can be performed with either homology-, sequence similarity- or sequence composition-based methods². It is a challenging task, because for the majority of uncultured micro-organisms reference sequence is unavailable and large amounts of data have to be processed. With this in mind, we introduce PhyloPythiaS, a fast and accurate sequence compositional classifier based on the structured output paradigm³.

We evaluated PhyloPythiaS on simulated and real metagenome data in comparison to four other methods; PhyloPythia⁴, MEGAN⁵, Phymm and PhymmBL⁶. PhyloPhythiaS performed particularly well for taxonomic assignment of populations from novel genera, order or higher-level clades, when limited amounts of reference data were available. Accurate assignments could be performed based on 100 kb of training data for a sample population. We observed this for simulated data (Fig. 1a, Supplementary Fig. 1 and Table 1) and a predominant population of a novel family of the order of Aeromonadales from the Australian Tammar wallaby gut (Fig. 1b, Supplementary Tables 5 and 9). In this scenario, alignment-based methods performed poorly. If closely related genomes were available the performance of all methods became more similar, with a slight advantage for alignment-based approaches. This is observed for simulated data and the predominant genera of two human gut metagenomes (Supplementary Tables 1.A and 11-13).

Comparison of different taxonomic classification methods. (a) Average performance per contig for the simMC data set (average length 2332 bp) at genus rank in four different experiments. Each experiment reflects a different scenario in terms of available reference sequences from closely related organisms for the dominant strains. ‘Known species’ had complete genome sequences of the same species available as reference; in the other experiments sequences of the same clades at the respective rank were excluded, while retaining 100 kb of the dominant strains. (b) Scaffold-contig consistency for the WG-1 population (uncultured Succinivibrionaceae bacterium) of the Tammar wallaby gut metagenome. Contig coloring reflects taxonomic assignment consistency with respect to WG-1. Only scaffolds longer than 20 kb are shown. (c) Empirical execution time evaluated on a Linux machine with 3 GHz processor and 4 GB main memory. Results for MEGAN and PhymmBL were determined with a reference database of size 2.1 GB.

PhyloPythiaS also performed well in fragment assignment of ‘known unknowns’, i.e. for organisms of taxonomic clades with no available sequence. Here, we observed less ‘overbinning’, meaning assignments to correct higher-level, but incorrect low-level clades, than for PhymmBL (Supplementary Tables 5 and 8). For short fragments of ‘known unknowns’, all methods showed comparably low assignment accuracy, with MEGAN performing best (Supplementary Fig. 2 and Table 2).

Empirical analysis of execution times determined that PhyloPythiaS requires 0.08-0.1 seconds for the assignment of 0.1-10 kb fragments (Fig. 1c). This corresponds to a three- to 46-fold and five- to 68-fold improvement in comparison to MEGAN and PhymmBL, respectively (Fig. 1c). For characterization of a 13 MB assembled metagenome sample, PhyloPythiaS showed 22-fold, 85-fold and 106-fold speed increase in comparison to PhyloPythia, MEGAN and PhymmBL, respectively (Supplementary Table 14). As PhyloPythiaS models require only a subsample of the reference data for accurate assignment, in future, training times will not necessarily be impacted by increases of sequence data, contrary to alignment-based approaches.

PhyloPythiaS employs an ensemble of linear models whose parameters are identified using the paradigm of support vector machines (SVM) with structured output spaces to represent composition-based clade specifics of the taxonomic hierarchy, instead of an ensemble of multi-class SVMs for different taxonomic ranks and fragment lengths, as our previously described method PhyloPythia (Supplementary notes). It exhibits considerable gains in learning and prediction times, while performing similarly by several independent measures on two real-world metagenome data sets (Supplementary Tables 5-7, 9 and 11-14). PhyloPythiaS is freely available for academic use.

Supplementary Material

Supplement

Supplementary Figure 1: Evaluation on the simMC (simulated acid mine drainage) data set in four different settings.

Supplementary Figure 2: Evaluation of different binning methods on short fragments of varying lengths.

Supplementary Figure 3: Histogram of the TW metagenome sample contig lengths. There are 5995 contigs in total.

Supplementary Figure 4: Overlap between predictions of different methods on the TW sample for the three uncultured populations.

Supplementary Figure 5: Overlap between predictions of different methods on TW sample for dominant phyla.

Supplementary Figure 6: Scaffold-contig visualization of different binning methods for the WG-2 population in the Tammar wallaby metagenome sample.

Supplementary Table 1: Assignment accuracy of different binning methods on the simulated Acid Mine Drainage data set.

Supplementary Table 2: Performance evaluation of the different binning methods on a simulated data set of short fragments of varying lengths.

Supplementary Table 3: Genomes used for simulated short fragment test data set. The “parent” columns show the lowest parent available in the reference taxonomy.

Supplementary Table 4: Modeled clades for the TW sample.

Supplementary Table 5: Performance of different binning methods for the abundant populations in the TW sample.

Supplementary Table 6: Statistical comparison of the assignments of different methods on TW data set.

Supplementary Table 7: Number of contigs classified by different methods at different taxonomic ranks for the TW sample.

Supplementary Table 8: Effect of sample specific data on the assignment of the TW sample for PhyloPythiaS and PhymmBL.

Supplementary Table 9: NUCmer analysis of the WG-1 assignments for TW sample.

Supplementary Table 10: Modeled clades for PhyloPythiaS for the human gut metagenome samples (TS28 and TS29).

Supplementary Table 11: Taxonomic assignments for abundant genera in the human gut metagenome samples.

Supplementary Table 12: Bin validation for the human gut metagenome samples using marker genes.

Supplementary Table 13: Validation for the human gut metagenome samples using CD-HIT (fraction matched).

Supplementary Table 14: Execution time comparison for different methods for characterization of the three real metagenome samples.

Supplementary Notes

NIHMS301833-supplement-Supplement.pdf^{(1.6MB, pdf)}

Acknowledgments

We thank T. Joachims for the SVMstruct implementation, L. Steinbrück and L. Feuerbach for software testing. K.R.P and A.C.M. were funded by the Max-Planck society and Heinrich-Heine University Düsseldorf. T.S. gratefully acknowledges support from the German Science Foundation DFG. P.J.T. is funded by NIH P50 GM068763. The unpublished data for the WG-1 project arises from support provided by the OCE Science team of CSIRO Australia, in the form of an OCE postdoctoral fellowship (P.P.) and OCE Science Leader program (M.M.).

References

1.Kunin V, Copeland A, Lapidus A, Mavromatis K, Hugenholtz P. Microbiol Mol Biol Rev. 2008;72(4):557–578. doi: 10.1128/MMBR.00009-08. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.McHardy AC, Rigoutsos I. Curr Opin Microbiol. 2007;10(5):499–503. doi: 10.1016/j.mib.2007.08.004. [DOI] [PubMed] [Google Scholar]
3.Tsochantaridis I, Joachims T, Hofmann T, Altun Y. J Mach Learn Res. 2005;6:1453–1484. [Google Scholar]
4.McHardy AC, Garcia-Martin H, Tsirigos A, Hugenholtz P, R I. Nat Methods. 2007;4(1):63–72. doi: 10.1038/nmeth976. [DOI] [PubMed] [Google Scholar]
5.Huson DH, Auch AF, Qi J, Schuster SC. Genome Res. 2007;17(3):377–386. doi: 10.1101/gr.5969107. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Brady A, Salzberg SL. Nat Methods. 2009;6(9):673–676. doi: 10.1038/nmeth.1358. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials