FOCUS: an alignment-free model to identify organisms in metagenomes using non-negative least squares

Genivaldo Gueiros Z Silva; Daniel A Cuevas; Bas E Dutilh; Robert A Edwards

doi:10.7717/peerj.425

. 2014 Jun 5;2:e425. doi: 10.7717/peerj.425

FOCUS: an alignment-free model to identify organisms in metagenomes using non-negative least squares

Genivaldo Gueiros Z Silva ¹, Daniel A Cuevas ¹, Bas E Dutilh ^4,⁵, Robert A Edwards ^1,^2,^3,^5,^6,^✉

Editor: Yong Wang

PMCID: PMC4060023 PMID: 24949242

Abstract

One of the major goals in metagenomics is to identify the organisms present in a microbial community from unannotated shotgun sequencing reads. Taxonomic profiling has valuable applications in biological and medical research, including disease diagnostics. Most currently available approaches do not scale well with increasing data volumes, which is important because both the number and lengths of the reads provided by sequencing platforms keep increasing. Here we introduce FOCUS, an agile composition based approach using non-negative least squares (NNLS) to report the organisms present in metagenomic samples and profile their abundances. FOCUS was tested with simulated and real metagenomes, and the results show that our approach accurately predicts the organisms present in microbial communities. FOCUS was implemented in Python. The source code and web-sever are freely available at http://edwards.sdsu.edu/FOCUS.

Keywords: Metagenomes, Modeling, k-mer

Introduction

Microbes are more abundant than any other cellular organism (Whitman, Coleman & Wiebe, 1998), and it is important to understand which organisms are present and what they are doing (Handelsman, 2004). In many environments a majority of the microbial community members cannot be cultured, and metagenomics is a powerful tool to directly probe uncultured genomes and understand the diversity of microbial communities by using only their DNA (Sharon & Banfield, 2013).

Understanding microbial communities is important in many areas of biology. For example, metagenomes can distinguish taxonomic and functional signatures of microbes associated with marine animals (Trindade-Silva et al., 2012) or disease states (Belda-Ferre et al., 2012). Large sequencing volumes, short read lengths, and sequencing errors make the task of identifying the diversity of organisms present in metagenomes challenging (Mande, Mohammed & Ghosh, 2012). Many programs exist for this and they are either homology- or composition-based.

Homology-based programs normally use the BLAST program (Altschul et al., 1997) to identify the best hit in a large database output. In MG-RAST (Meyer et al., 2008) sequences are aligned to a set of databases in order to classify the metagenomic sample. MetaPhlAn (Segata et al., 2012) and GenomePeek (K McNair, R Edwards, unpublished data) use a reduced database containing only marker genes, e.g., unique clades and housekeeping genes, allowing the BLAST search to be fast. PhymmBL (Brady & Salzberg, 2011) improves the BLAST results using interpolated Markov models. GASiC (Lindner & Renard, 2013) uses Bowtie (Langmead et al., 2009) and the reference genomes similarities to correct the observed abundance estimated. Parallel-Meta (Su, Xu & Ning, 2012) a fast program, which requires a GPU, uses megaBLAST (Zhang et al., 2000) and HMM (Hidden Markov Model) to improve the homology result. Most of these applications classify sequences individually, and generate a taxonomic profile by summing the bins.

In general, composition-based approaches use oligonucleotide (k-mer) frequencies. Taxy (Meinicke, Aßhauer & Lingner, 2011) uses oligonucleotide distributions in metagenomes and in reference genomes and uses mixture modeling to identify the organisms present in the metagenome, and RAIphy (Nalbantoglu et al., 2011) identifies organisms using oligonucleotides and relative abundance index.

We developed a new approach that reconstructs a taxonomic profile using an ensemble k-mer composition of the entire metagenome. We compute the optimal set of organism abundances using non-negative least squares (NNLS) to match the metagenome k-mer composition to organisms in a reference database. k-mers have previously been used to cluster unknown sequences (Teeling et al., 2004; McHardy et al., 2007) and NNLS has been used to identify the genera present in metagenomic samples based on variations in gene count (Carr, Shen-Orr & Borenstein, 2013). Here we combine these two approaches in FOCUS, an ultra fast, accurate, composition based approach to identify the taxa present in a metagenome. We compare the performance of FOCUS to GASiC, MetaPhlAn, RAIphy, PhymmBL, Taxy, and MG-RAST.

Methods

FOCUS workflow is described in Fig. 1. As in most composition-based approaches, a training set is pre-generated using the complete genomes information, and here the non-negative least squares (NNLS) is applied to compute the relative abundance of each organism in the database into the unknown data.

Reference dataset

FOCUS requires a group of reference genomes to model and identify the organisms present in a metagenome. 2,766 complete genomes were downloaded from the SEED servers (Aziz et al., 2012) on 20 December 2013 (see Table S1). k-mer frequencies (k = 6–8, default: k = 7) were calculated for both strands using Jellyfish 1.1.6 (Marçais & Kingsford, 2011), reducing the number of dimensions (Strous et al., 2012), and k-mer counts were normalized by the sum of frequencies. The user can also create their own training set, which is scalable to the quickly increasing number of available reference genomes because it also uses Jellyfish in the k-mer counting.

Simulated and real metagenomes

In order to evaluate FOCUS performance, a simulated dataset of short sequences (SimShort), containing 500,000 single 100 nt reads, was created using the supplied error model for Illumina GA IIx with TrueSeq SBS Kit v5-GA using GemSim (McElroy, Luciani & Thomas, 2012) (Table S2). The previous published high complexity simulated dataset (SimHC) from FAMeS (Mavromatis et al., 2007) was also used in the evaluation. Moreover, real metagenomic datasets were selected as test cases: one under healthy conditions, one under disease conditions (MG-RAST accession 4447943.3 and 4447192.3) (Belda-Ferre et al., 2012), one fecal sample from a healthy individual (MG-RAST accession 4440945.3) (Kurokawa et al., 2007), and three hundred datasets from the Human Micriobiome Project (HMP) (Consortium, 2012) (Table S3) were selected as a test case.

Non-negative least squares (NNLS)

The estimation of a parameterized model to understand some data is a fundamental problem in data modeling. Nevertheless, the estimation is not always easy, e.g., in problems like metagenome profiling that cannot have negative values for the fitted parameters. In such case, a solution can be estimated using NNLS, which is defined as:

Given a matrix A∈ℝ^mxn and a vector b∈ℝ^m, where m ≥ n, find a non-negative vector x∈ℝⁿ to minimize the function (1).

f (x) = 1 / 2 ∥ A x - b ∥^{2}, where x \geq 0 and \sum_{i = i}^{n} x_{i} = 1 .

(1)

In FOCUS, the reference matrix A is composed of m k-mer frequencies from n genomes, while a vector describing the user’s metagenomic dataset is calculated from the k-mer frequencies of both strands from the dataset using Jellyfish. FOCUS uses non-negative least squares to compute the set of k-mer frequencies x that explains the optimal possible abundance of k-mers in the user’s metagenome by selecting the optimal number of frequencies from the matrix A. We minimize the sum of squared differences (1) using the open source Scipy library (Jones, Oliphant & Peterson, 2001) which has a module for the NNLS algorithm which solves the KKT (Karush–Kuhn–Tucker) conditions (Lawson & Hanson, 1974). We added Tikhonov regularization (Garda & Galias, 2012) to deal with genomes that have similar k-mer compositions.

Jackknife resampling of the data

We implemented a jackknife resampling strategy to assess the robustness of the results. 50% of the reads were randomly resampled 1000 ×, and the species frequencies recalculated. For each species, these 1000 frequencies were averaged and the standard deviation calculated to estimate the spread.

Web-based and graphical user interface version

As an alternative to the command line version of the program, we have created a user-friendly web version and a graphical user interface (GUI) for Microsoft Windows users. The web server and the GUI are available at http://edwards.sdsu.edu/FOCUS.

Results and Discussion

Evaluation and comparison with other tools

All tools were run using default parameters and their default reference database, either online (MG-RAST) or using one core on a server with 24 processors × 6 cores Intel(R) Xeon(R) CPU X5650 @ 2.67 GHz and 189 GB RAM. We only compared GASiC to the SimHC dataset which had the results previously published (Lindner & Renard, 2013). We tried to run the tool; however, it requires a large amount of storage during the process to save its output data.

For the real data, three hundred and three metagenomic datasets were selected. First, the metagenomic sample of the human oral cavity from diseased conditions was used. MetaPhlAn apparently over predicted the genera Veillonella due to the short genome, and Taxy did not predict Prevotella hits (see Fig. 2) as described in Belda-Ferre et al. (2012). FOCUS was able to profile the organisms in only 41 s. Taxy took about 45 s, MetaPhlAn took about 3 min, RAIphy took 52 min, MG-RAST took 3 days, and PhymmBL took 1 week and 6 days. Using random subsets for the oral metagenome, we tested the tools scalability and showed that FOCUS and Taxy profile metagenomes in constant time (see Fig. 3).

Error bars represent the standard deviation uncertainty in tested metagenome.

The oral metagenome from the healthy condition was used. MetaPhlAn possibly over predicted the genera Neisseria, and Taxy was not able to predict Rothia hits (see Fig. 4). FOCUS profiled the metagenome in only 35 s. Taxy took about 41 s, MetaPhlAn took about 2 min, RAIphy took 48 min, MG-RAST took 3 days, and PhymmBL took 9 days.

Error bars show the standard deviation for the real metagenome.

A fecal metagenome from a healthy individual was used. All the tools predicted that Bifidobacterium and Enterococcus were the two most abundant genera in the sample. However, RAIphy apparently under predicted the genera Bifidobacterium (see Fig. 5). For this small dataset, FOCUS profiled the metagenome in 35 s. Taxy took about 40 s, MetaPhlAn took only 30 min, RAIphy took about 4 min, MG-RAST took 3 days, and PhymmBL took 2 days and 14 h.

Three hundred metagenomic samples (254 GB total) from HMP were analyzed at all the taxonomy levels using FOCUS (Table S4) in about 1 h and 20 min and compared with the published results from MetaPhlAn’s paper (Segata et al., 2012) by calculating the Euclidean distance between the results (see Fig. 6). For most of the samples, FOCUS and MetaPhlAn have similar predictions at the genera level but vary at the species level. However, for some samples in the posterior fornix and most of the samples from the anterior nares there were differences at all levels which may reflect the additional genome sequencing of isolates from those passages that has occurred since 2012. Other tools were not included in the analysis due to the CPU processing time.

The distance was computed using the Euclidean distance between the results of both tools.

For the simulated data, we removed species from the reference dataset that are present in this dataset and tried to predict the genera present in the SimShort dataset. A major limitation of many of the approaches discussed here is that the underlying databases cannot be changed. Only FOCUS, RAIphy, GASiC, and PhymmBL allow the end user to change their reference database. FOCUS and PhymmBL best predicted the correct genera while RAIphy could not correctly predict their abundance (Fig. 7). FOCUS had the fastest performance (45 s); RAIphy took about 2 h, while PhymmBL took approximately 5 days. Figs. S1–S5 show the same comparison for other taxonomy resolutions.

For the SimHC simulated metagenomes, the genera present in the dataset were deleted from the training dataset, and we evaluated the class-level prediction. The tested tools correctly predicted the classes, except that RAIphy over predicted the top two classes (see Fig. 8). Again, FOCUS was the fastest tool (30 s) in comparison to RAIphy, which took about 1 h and 50 min, and PhymmBL, which took about 4 days. See Figs. S6–S8 for other taxonomy levels.

Furthermore, for the SimHC dataset, we ran all the previously used tools and the GASiC published results to evaluate the genera-level prediction. GASiC and PhymmBL had the best predictions, and FOCUS failed in the prediction of 4 minor genera probably because many organisms present in the SimHC dataset were not included in the FOCUS database (see Fig. 9). We did not compare the running time because we extracted the GASiC results from its paper; however, in the original paper it took 2 days and needed at least 500 GB of storage to analyze the SimHC simulated metagenome.

The very small standard deviations observed after jackknife re-sampling indicate the robustness of our results. Furthermore, in order to show a quantitative evaluation between the real and predicted abundance for the synthetic metagenomes, we computed the Euclidean distance between the real and predicted abundances for all the simulated data presented above (see Fig. 10). For some of the tools, only genus level predictions are available, but for RAIphy, PhymmBL, and FOCUS we included all taxonomic levels. The data demonstrate that FOCUS had the best prediction in more than half of test cases.

These tests were performed on a server; however, FOCUS is also ultra fast on a simple computer. For example, we profiled the real dataset in 1 min and 45 s using an Intel(R) Core(TM) i3 @2.53 GHz and 1 GB RAM. In addition to the Web server, we have developed a stand-alone version that runs on the Windows^® platform.

Limitations

As with other methods created to profile metagenome sequences, FOCUS depends on a curated database of microbial reference genomes in order to predict a specific genus. If a reference genome is absent, the tool will predict the closest reference available.

Conclusions

Here we present FOCUS, an agile solution to identify the organisms present in metagenomic samples that does not rely on mapping individual reads, but instead determines the taxonomic composition of the entire metagenome at once by using NNLS. This makes FOCUS an extremely fast and scalable solution to profile the focal taxa in a metagenome. FOCUS reports very similar species compositions as currently available, state of the art metagenome profiling tools.

Availability and requirements

Project name: FOCUS

Project and web server home page: http://edwards.sdsu.edu/FOCUS

Operating system: the program has a command line version that works on OS X and Unix, and a GUI for Microsoft Windows users.

Programming language: Python 2.7.

Other requirements: Numpy (http://www.numpy.org), Scipy (http://scipy.org), Jellyfish (http://www.cbcb.umd.edu/software/jellyfish), and Python programming language (http://www.python.org).

License: GNU GPL3.

Any restrictions to use by non-academics: no special restrictions.

Supplemental information

Table S1. Complete list of complete genomes present in the training set.

Click here for additional data file.^{(108.4KB, xlsx)}

DOI: 10.7717/peerj.425/supp-1

Table S2. Complete list of organisms and abundances for the SimShort test set.

Click here for additional data file.^{(8.5KB, xlsx)}

DOI: 10.7717/peerj.425/supp-2

Table S3. Complete list of three hundred metagenomes from the Human Micriobiome Project (HMP) used as test set.

Click here for additional data file.^{(30.7KB, xlsx)}

DOI: 10.7717/peerj.425/supp-3

Table S4. FOCUS prediction for in all the levels for 300 metagenomes from the Human Microbiome Project.

Click here for additional data file.^{(1.5MB, xls)}

DOI: 10.7717/peerj.425/supp-4

Figures S1--S8. Supplementary_Data with supplementary results.

Click here for additional data file.^{(2.4MB, doc)}

DOI: 10.7717/peerj.425/supp-5

Acknowledgments

We thank Dr. Peter Blomgren for help with the Advanced Numerical Analysis, Raul Maia Falcao for working on an alternative version to count k-mers, and the reviewers for their useful comments.

Funding Statement

GGZS and DAC were supported by NSF Grants (DEB-1046413 and CNS-1305112 to RAE). BED was supported by NWO Veni (016.111.075), CAPES/BRASIL and the Dutch Virgo Consortium. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Additional Information and Declarations

Competing Interests

The authors declare there are no competing interests.

Author Contributions

Genivaldo Gueiros Z. Silva conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, reviewed drafts of the paper.

Daniel A. Cuevas contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, reviewed drafts of the paper.

Bas E. Dutilh and Robert A. Edwards contributed reagents/materials/analysis tools, wrote the paper, reviewed drafts of the paper.

Data Deposition

The following information was supplied regarding the deposition of related data:

https://sourceforge.net/projects/metagenomefocus/.

References

Altschul et al. (1997).Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research. 1997;25:3389–3402. doi: 10.1093/nar/25.17.3389. [DOI] [PMC free article] [PubMed] [Google Scholar]
Aziz et al. (2012).Aziz RK, Devoid S, Disz T, Edwards RA, Henry CS, Olsen GJ, Olson R, Overbeek R, Parrello B, Pusch GD, Stevens RL, Vonstein V, Xia F. SEED servers: high-performance access to the seed genomes, annotations, and metabolic models. PLoS ONE. 2012;7:e425. doi: 10.1371/journal.pone.0048053. [DOI] [PMC free article] [PubMed] [Google Scholar]
Belda-Ferre et al. (2012).Belda-Ferre P, Alcaraz LD, Cabrera-Rubio R, Romero H, Simón-Soro A, Pignatelli M, Mira A. The oral metagenome in health and disease. ISME Journal. 2012;6:46–56. doi: 10.1038/ismej.2011.85. [DOI] [PMC free article] [PubMed] [Google Scholar]
Brady & Salzberg (2011).Brady A, Salzberg S. PhymmBL expanded: confidence scores, custom databases, parallelization and more. Nature Methods. 2011;8:367–367. doi: 10.1038/nmeth0511-367. [DOI] [PMC free article] [PubMed] [Google Scholar]
Carr, Shen-Orr & Borenstein (2013).Carr R, Shen-Orr SS, Borenstein E. Reconstructing the genomic content of microbiome taxa through shotgun metagenomic deconvolution. PLoS Computer Biology. 2013;9:e425. doi: 10.1371/journal.pcbi.1003292. [DOI] [PMC free article] [PubMed] [Google Scholar]
Garda & Galias (2012).Garda B, Galias Z. Non-negative least squares and the Tikhonov regularization methods for coil design problems. 2012 International Conference on Signals and Electronic Systems (ICSES), 1–5; 2012. [Google Scholar]
Handelsman (2004).Handelsman J. Metagenomics: application of genomics to uncultured microorganisms. Microbiology and Molecular Biology Reviews. 2004;68:669–685. doi: 10.1128/MMBR.68.4.669-685.2004. [DOI] [PMC free article] [PubMed] [Google Scholar]
Jones, Oliphant & Peterson (2001).Jones E, Oliphant T, Peterson P. 2001. SciPy: Open source scientific tools for Python. Available at: http://www.scipy.org/ , http://www.scipy.org/Citing_SciPy (accessed 23 October 2013)
Kurokawa et al. (2007).Kurokawa K, Itoh T, Kuwahara T, Oshima K, Toh H, Toyoda A, Takami H, Morita H, Sharma VK, Srivastava TP, Taylor TD, Noguchi H, Mori H, Ogura Y, Ehrlich DS, Itoh K, Takagi T, Sakaki Y, Hayashi T, Hattori M. Comparative metagenomics revealed commonly enriched gene sets in human gut microbiomes. DNA Research. 2007;14(4):169–181. doi: 10.1093/dnares/dsm018. [DOI] [PMC free article] [PubMed] [Google Scholar]
Langmead et al. (2009).Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology. 2009;10:R25. doi: 10.1186/gb-2009-10-3-r25. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lawson & Hanson (1974).Lawson CL, Hanson RJ. Solving least squares problems. Society for Industrial and Applied Mathematics; 1974. 356 pp. [Google Scholar]
Lindner & Renard (2013).Lindner MS, Renard BY. Metagenomic abundance estimation and diagnostic testing on species level. Nucleic Acids Research. 2013;41:e10. doi: 10.1093/nar/gks803. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mande, Mohammed & Ghosh (2012).Mande SS, Mohammed MH, Ghosh TS. Classification of metagenomic sequences: methods and challenges. Briefings in Bioinformatics. 2012;13:669–681. doi: 10.1093/bib/bbs054. [DOI] [PubMed] [Google Scholar]
Marçais & Kingsford (2011).Marçais G, Kingsford C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics. 2011;27:764–770. doi: 10.1093/bioinformatics/btr011. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mavromatis et al. (2007).Mavromatis K, Ivanova N, Barry K, Shapiro H, Goltsman E, McHardy AC, Rigoutsos I, Salamov A, Korzeniewski F, Land M, Lapidus A, Grigoriev I, Richardson P, Hugenholtz P, Kyrpides NC. Use of simulated data sets to evaluate the fidelity of metagenomic processing methods. Nature Methods. 2007;4:495–500. doi: 10.1038/nmeth1043. [DOI] [PubMed] [Google Scholar]
McElroy, Luciani & Thomas (2012).McElroy KE, Luciani F, Thomas T. GemSIM: general, error-model based simulator of next-generation sequencing data. BMC Genomics. 2012;13:74. doi: 10.1186/1471-2164-13-74. [DOI] [PMC free article] [PubMed] [Google Scholar]
McHardy et al. (2007).McHardy AC, Martín HG, Tsirigos A, Hugenholtz P, Rigoutsos I. Accurate phylogenetic classification of variable-length DNA fragments. Nature Methods. 2007;4:63–72. doi: 10.1038/nmeth976. [DOI] [PubMed] [Google Scholar]
Meinicke, Aßhauer & Lingner (2011).Meinicke P, Aßhauer KP, Lingner T. Mixture models for analysis of the taxonomic composition of metagenomes. Bioinformatics. 2011;27(12):1628–1624. doi: 10.1093/bioinformatics/btr266. [DOI] [PMC free article] [PubMed] [Google Scholar]
Meyer et al. (2008).Meyer F, Paarmann D, D’Souza M, Olson R, Glass EM, Kubal M, Paczian T, Rodriguez A, Stevens R, Wilke A, Wilkening J, Edwards RA. The metagenomics RAST server—a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinformatics. 2008;9:386. doi: 10.1186/1471-2105-9-386. [DOI] [PMC free article] [PubMed] [Google Scholar]
Nalbantoglu et al. (2011).Nalbantoglu OU, Way SF, Hinrichs SH, Sayood K. RAIphy: phylogenetic classification of metagenomics samples using iterative refinement of relative abundance index profiles. BMC Bioinformatics. 2011;12:41. doi: 10.1186/1471-2105-12-41. [DOI] [PMC free article] [PubMed] [Google Scholar]
Segata et al. (2012).Segata N, Waldron L, Ballarini A, Narasimhan V, Jousson O, Huttenhower C. Metagenomic microbial community profiling using unique clade-specific marker genes. Nature Methods. 2012;9:811–814. doi: 10.1038/nmeth.2066. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sharon & Banfield (2013).Sharon I, Banfield JF. Genomes from metagenomics. Science. 2013;342:1057–1058. doi: 10.1126/science.1247023. [DOI] [PubMed] [Google Scholar]
Strous et al. (2012).Strous M, Kraft B, Bisdorf R, Tegetmeyer HE. The binning of metagenomic contigs for microbial physiology of mixed cultures. Frontiers Microbiology. 2012;3:e425. doi: 10.3389/fmicb.2012.00410. [DOI] [PMC free article] [PubMed] [Google Scholar]
Su, Xu & Ning (2012).Su X, Xu J, Ning K. Parallel-META: efficient metagenomic data analysis based on high-performance computation. BMC Systems Biology. 2012;6:S16. doi: 10.1186/1752-0509-6-S1-S16. [DOI] [PMC free article] [PubMed] [Google Scholar]
Teeling et al. (2004).Teeling H, Waldmann J, Lombardot T, Bauer M, Glöckner FO. TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences. BMC Bioinformatics. 2004;5:163. doi: 10.1186/1471-2105-5-163. [DOI] [PMC free article] [PubMed] [Google Scholar]
Consortium (2012).The Human Microbiome Project Consortium Structure, function and diversity of the healthy human microbiome. Nature. 2012;486:207–214. doi: 10.1038/nature11234. [DOI] [PMC free article] [PubMed] [Google Scholar]
Trindade-Silva et al. (2012).Trindade-Silva AE, Rua C, Silva GGZ, Dutilh BE, Moreira APB, Edwards RA, Hajdu E, Lobo-Hajdu G, Vasconcelos AT, Berlinck RGS, Thompson FL. Taxonomic and functional microbial signatures of the endemic marine sponge arenosclera brasiliensis. PLoS ONE. 2012;7:e425. doi: 10.1371/journal.pone.0039905. [DOI] [PMC free article] [PubMed] [Google Scholar]
Whitman, Coleman & Wiebe (1998).Whitman WB, Coleman DC, Wiebe WJ. Prokaryotes: the unseen majority. Proceedings of the National Academy of Sciences of the United States. 1998;95:6578–6583. doi: 10.1073/pnas.95.12.6578. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhang et al. (2000).Zhang Z, Schwartz S, Wagner L, Miller W. A greedy algorithm for aligning DNA sequences. Journal of Computational Biology. 2000;7:203–214. doi: 10.1089/10665270050081478. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Table S1. Complete list of complete genomes present in the training set.

Click here for additional data file.^{(108.4KB, xlsx)}

DOI: 10.7717/peerj.425/supp-1

Table S2. Complete list of organisms and abundances for the SimShort test set.

Click here for additional data file.^{(8.5KB, xlsx)}

DOI: 10.7717/peerj.425/supp-2

Table S3. Complete list of three hundred metagenomes from the Human Micriobiome Project (HMP) used as test set.

Click here for additional data file.^{(30.7KB, xlsx)}

DOI: 10.7717/peerj.425/supp-3

Table S4. FOCUS prediction for in all the levels for 300 metagenomes from the Human Microbiome Project.

Click here for additional data file.^{(1.5MB, xls)}

DOI: 10.7717/peerj.425/supp-4

Figures S1--S8. Supplementary_Data with supplementary results.

Click here for additional data file.^{(2.4MB, doc)}

DOI: 10.7717/peerj.425/supp-5

[ref-1] Altschul et al. (1997).Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research. 1997;25:3389–3402. doi: 10.1093/nar/25.17.3389. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-2] Aziz et al. (2012).Aziz RK, Devoid S, Disz T, Edwards RA, Henry CS, Olsen GJ, Olson R, Overbeek R, Parrello B, Pusch GD, Stevens RL, Vonstein V, Xia F. SEED servers: high-performance access to the seed genomes, annotations, and metabolic models. PLoS ONE. 2012;7:e425. doi: 10.1371/journal.pone.0048053. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-3] Belda-Ferre et al. (2012).Belda-Ferre P, Alcaraz LD, Cabrera-Rubio R, Romero H, Simón-Soro A, Pignatelli M, Mira A. The oral metagenome in health and disease. ISME Journal. 2012;6:46–56. doi: 10.1038/ismej.2011.85. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-4] Brady & Salzberg (2011).Brady A, Salzberg S. PhymmBL expanded: confidence scores, custom databases, parallelization and more. Nature Methods. 2011;8:367–367. doi: 10.1038/nmeth0511-367. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-5] Carr, Shen-Orr & Borenstein (2013).Carr R, Shen-Orr SS, Borenstein E. Reconstructing the genomic content of microbiome taxa through shotgun metagenomic deconvolution. PLoS Computer Biology. 2013;9:e425. doi: 10.1371/journal.pcbi.1003292. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-6] Garda & Galias (2012).Garda B, Galias Z. Non-negative least squares and the Tikhonov regularization methods for coil design problems. 2012 International Conference on Signals and Electronic Systems (ICSES), 1–5; 2012. [Google Scholar]

[ref-7] Handelsman (2004).Handelsman J. Metagenomics: application of genomics to uncultured microorganisms. Microbiology and Molecular Biology Reviews. 2004;68:669–685. doi: 10.1128/MMBR.68.4.669-685.2004. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-8] Jones, Oliphant & Peterson (2001).Jones E, Oliphant T, Peterson P. 2001. SciPy: Open source scientific tools for Python. Available at: http://www.scipy.org/ , http://www.scipy.org/Citing_SciPy (accessed 23 October 2013)

[ref-9] Kurokawa et al. (2007).Kurokawa K, Itoh T, Kuwahara T, Oshima K, Toh H, Toyoda A, Takami H, Morita H, Sharma VK, Srivastava TP, Taylor TD, Noguchi H, Mori H, Ogura Y, Ehrlich DS, Itoh K, Takagi T, Sakaki Y, Hayashi T, Hattori M. Comparative metagenomics revealed commonly enriched gene sets in human gut microbiomes. DNA Research. 2007;14(4):169–181. doi: 10.1093/dnares/dsm018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-10] Langmead et al. (2009).Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology. 2009;10:R25. doi: 10.1186/gb-2009-10-3-r25. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-11] Lawson & Hanson (1974).Lawson CL, Hanson RJ. Solving least squares problems. Society for Industrial and Applied Mathematics; 1974. 356 pp. [Google Scholar]

[ref-12] Lindner & Renard (2013).Lindner MS, Renard BY. Metagenomic abundance estimation and diagnostic testing on species level. Nucleic Acids Research. 2013;41:e10. doi: 10.1093/nar/gks803. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-13] Mande, Mohammed & Ghosh (2012).Mande SS, Mohammed MH, Ghosh TS. Classification of metagenomic sequences: methods and challenges. Briefings in Bioinformatics. 2012;13:669–681. doi: 10.1093/bib/bbs054. [DOI] [PubMed] [Google Scholar]

[ref-14] Marçais & Kingsford (2011).Marçais G, Kingsford C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics. 2011;27:764–770. doi: 10.1093/bioinformatics/btr011. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-15] Mavromatis et al. (2007).Mavromatis K, Ivanova N, Barry K, Shapiro H, Goltsman E, McHardy AC, Rigoutsos I, Salamov A, Korzeniewski F, Land M, Lapidus A, Grigoriev I, Richardson P, Hugenholtz P, Kyrpides NC. Use of simulated data sets to evaluate the fidelity of metagenomic processing methods. Nature Methods. 2007;4:495–500. doi: 10.1038/nmeth1043. [DOI] [PubMed] [Google Scholar]

[ref-16] McElroy, Luciani & Thomas (2012).McElroy KE, Luciani F, Thomas T. GemSIM: general, error-model based simulator of next-generation sequencing data. BMC Genomics. 2012;13:74. doi: 10.1186/1471-2164-13-74. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-17] McHardy et al. (2007).McHardy AC, Martín HG, Tsirigos A, Hugenholtz P, Rigoutsos I. Accurate phylogenetic classification of variable-length DNA fragments. Nature Methods. 2007;4:63–72. doi: 10.1038/nmeth976. [DOI] [PubMed] [Google Scholar]

[ref-18] Meinicke, Aßhauer & Lingner (2011).Meinicke P, Aßhauer KP, Lingner T. Mixture models for analysis of the taxonomic composition of metagenomes. Bioinformatics. 2011;27(12):1628–1624. doi: 10.1093/bioinformatics/btr266. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-19] Meyer et al. (2008).Meyer F, Paarmann D, D’Souza M, Olson R, Glass EM, Kubal M, Paczian T, Rodriguez A, Stevens R, Wilke A, Wilkening J, Edwards RA. The metagenomics RAST server—a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinformatics. 2008;9:386. doi: 10.1186/1471-2105-9-386. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-20] Nalbantoglu et al. (2011).Nalbantoglu OU, Way SF, Hinrichs SH, Sayood K. RAIphy: phylogenetic classification of metagenomics samples using iterative refinement of relative abundance index profiles. BMC Bioinformatics. 2011;12:41. doi: 10.1186/1471-2105-12-41. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-21] Segata et al. (2012).Segata N, Waldron L, Ballarini A, Narasimhan V, Jousson O, Huttenhower C. Metagenomic microbial community profiling using unique clade-specific marker genes. Nature Methods. 2012;9:811–814. doi: 10.1038/nmeth.2066. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-22] Sharon & Banfield (2013).Sharon I, Banfield JF. Genomes from metagenomics. Science. 2013;342:1057–1058. doi: 10.1126/science.1247023. [DOI] [PubMed] [Google Scholar]

[ref-23] Strous et al. (2012).Strous M, Kraft B, Bisdorf R, Tegetmeyer HE. The binning of metagenomic contigs for microbial physiology of mixed cultures. Frontiers Microbiology. 2012;3:e425. doi: 10.3389/fmicb.2012.00410. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-24] Su, Xu & Ning (2012).Su X, Xu J, Ning K. Parallel-META: efficient metagenomic data analysis based on high-performance computation. BMC Systems Biology. 2012;6:S16. doi: 10.1186/1752-0509-6-S1-S16. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-25] Teeling et al. (2004).Teeling H, Waldmann J, Lombardot T, Bauer M, Glöckner FO. TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences. BMC Bioinformatics. 2004;5:163. doi: 10.1186/1471-2105-5-163. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-26] Consortium (2012).The Human Microbiome Project Consortium Structure, function and diversity of the healthy human microbiome. Nature. 2012;486:207–214. doi: 10.1038/nature11234. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-27] Trindade-Silva et al. (2012).Trindade-Silva AE, Rua C, Silva GGZ, Dutilh BE, Moreira APB, Edwards RA, Hajdu E, Lobo-Hajdu G, Vasconcelos AT, Berlinck RGS, Thompson FL. Taxonomic and functional microbial signatures of the endemic marine sponge arenosclera brasiliensis. PLoS ONE. 2012;7:e425. doi: 10.1371/journal.pone.0039905. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-28] Whitman, Coleman & Wiebe (1998).Whitman WB, Coleman DC, Wiebe WJ. Prokaryotes: the unseen majority. Proceedings of the National Academy of Sciences of the United States. 1998;95:6578–6583. doi: 10.1073/pnas.95.12.6578. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-29] Zhang et al. (2000).Zhang Z, Schwartz S, Wagner L, Miller W. A greedy algorithm for aligning DNA sequences. Journal of Computational Biology. 2000;7:203–214. doi: 10.1089/10665270050081478. [DOI] [PubMed] [Google Scholar]

PERMALINK

FOCUS: an alignment-free model to identify organisms in metagenomes using non-negative least squares

Genivaldo Gueiros Z Silva

Daniel A Cuevas

Bas E Dutilh

Robert A Edwards

Abstract

Introduction

Methods

Figure 1. Workflow of the FOCUS program.

Reference dataset

Simulated and real metagenomes

Non-negative least squares (NNLS)

Jackknife resampling of the data

Web-based and graphical user interface version

Results and Discussion

Evaluation and comparison with other tools

Figure 2. Genera-level taxonomy classification sorted by FOCUS prediction for the metagenome from a diseased human oral cavity using FOCUS, MetaPhlAn, MG-RAST, PhymnBL, RAIphy, Taxy, and FOCUS (mean).

Figure 3. Scalability test using different sub-sets of the human oral cavity under disease metagenome using FOCUS, MetaPhlAn, MG-RAST, PhymnBL, RAIphy, Taxy.

Figure 4. Genera-level taxonomy classification sorted by FOCUS prediction for the metagenome from a healthy human oral cavity using FOCUS, MetaPhlAn, MG-RAST, PhymnBL, RAIphy, Taxy, and FOCUS (mean).

Figure 5. Genera-level taxonomy classification sorted by FOCUS prediction for the metagenome from a fecal metagenomic sample of a healthy human using FOCUS, MetaPhlAn, MG-RAST, PhymnBL, RAIphy, Taxy, and FOCUS (mean).

Figure 6. Heat-map representing the distance between the FOCUS and MetaPhlAn results for 300 metagenomes from the Human Microbiome Project across 15 body sites.

Figure 7. Genera-level taxonomy classification for the SimShort dataset using FOCUS, PhymnBL, RAIphy, and FOCUS (mean).

Figure 8. Class-level taxonomy classification for the SimHC dataset using FOCUS, PhymnBL, RAIphy, and FOCUS (mean).

Figure 9. Genera-level taxonomy classification for the SimHC dataset using FOCUS, MetaPhlAn, MG-RAST, PhymnBL, RAIphy, Taxy, GASiC, and FOCUS (mean).

Figure 10. Numerical evaluation between the real and predicted abundance for the synthetic metagenomes computed by the Euclidean distance between the real and the predicted values.

Limitations

Conclusions

Availability and requirements

Supplemental information

Acknowledgments

Funding Statement

Additional Information and Declarations

Competing Interests

Author Contributions

Data Deposition

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases