Composite Metagenome-Assembled Genomes Reduce the Quality of Public Genome Repositories

Alon Shaiber; A Murat Eren

doi:10.1128/mBio.00725-19

letter

. 2019 Jun 4;10(3):e00725-19. doi: 10.1128/mBio.00725-19

Composite Metagenome-Assembled Genomes Reduce the Quality of Public Genome Repositories

Alon Shaiber ^a,^✉, A Murat Eren ^b,^c,^✉

Editor: David A Relman^d

PMCID: PMC6550520 PMID: 31164461

LETTER

In their recent study, Espinoza et al. employ genome-resolved metagenomics to investigate supragingival plaque metagenomes of 88 individuals (1). The 34 metagenome-assembled genomes (MAGs) that the authors report include those that resolve to clades that have largely evaded cultivation efforts, such as Gracilibacteria (formerly GN02) and Saccharibacteria (formerly TM7) of the recently described Candidate Phyla Radiation (2). Generating new genomic insights into the understudied members of the human oral cavity is of critical importance for a comprehensive understanding of the microbial ecology and functioning of this biome, and we acknowledge the contribution of the authors on this front. However, the redundant occurrence of bacterial single-copy core genes suggests that more than half of the MAGs that Espinoza et al. report are composite genomes that do not meet the recent quality guidelines suggested by the community (3). Composite genomes that aggregate sequences originating from multiple distinct populations can yield misleading insights when treated and reported as single genomes (4).

To briefly demonstrate their composite nature, we refined some of the key Espinoza et al. MAGs through a previously described approach (5) and the data that the authors kindly provided (1). We found that MAG IV.A, MAG IV.B, and MAG III.A described multiple discrete populations with distinct distribution patterns across individuals (Fig. 1). A phylogenomic analysis of refined MAG IV.A genomes resolved to the candidate phylum Absconditabacteria (formerly SR1) and not to Gracilibacteria as reported by Espinoza et al. (Fig. 1D). A pangenomic analysis of the original and refined MAG III.A genomes with other publicly available Saccharibacteria genomes showed a 7-fold increase in the number of single-copy core genes (Fig. 1E). These findings demonstrate the potential implications of composite MAGs in comparative genomics studies where single-copy core genes are commonly used to infer diversity, phylogeny, and taxonomy (6). Composite MAGs can also lead to inaccurate ecological insights through inflated abundance and prevalence estimates. For instance, the original MAG III.A recruited a total of 1,849,593 reads from Espinoza et al. metagenomes; however, the most abundant refined III.A genome (MAG III.A.2, Fig. 1C) recruited only 629,291 reads.

FIG 1 — Refinement of three composite genome bins. (A to C) The top left corners of these panels display the original name of a given Espinoza et al. MAG (see Table 1 in the original study) and its estimated completion and redundancy (C/R) based on a bacterial single-copy core gene collection (10). Each concentric circle represents one of the 88 metagenomes in the original study, dendrograms show hierarchical clustering of contigs based on sequence composition and differential mean coverage across metagenomes (using Euclidean distance and Ward’s method), and each data point represents the read recruitment statistic of a given contig in a given metagenome. Arcs at the outermost layers mark contigs that belong to a refined bin along with their new completion and redundancy estimates (C/R). (D) The phylogenomic tree organizes genomes based on 37 concatenated ribosomal proteins. Coloring of genome names matches their taxonomy in NCBI, and branch colors match the consensus taxonomy of genomes they represent. Espinoza et al. reported MAG IV.A as *Gracilibacteria* (hence the red color); however, this phylogenomic analysis places refined MAGs under *Absconditabacteria*. (E) Pangenomic analysis of Espinoza et al. *Saccharibacteria* MAG III.A before (left) and after (right) refinement together with the *Saccharibacteria* genomes from panel D. Pangenomes describe 575 and 497 gene clusters, respectively, where each concentric circle represents a genome and bars correspond to the number of genes that a given genome is contributing to a given gene cluster (the maximum value is set to 2 for readability). Outermost layers mark single-copy core gene clusters to which every genome contributes precisely a single gene. We used Bowtie2 (11) to recruit reads from metagenomes, and anvi’o (12) to visualize and refine Espinoza et al. MAGs. FAMSA (13) aligned anvi’o-reported ribosomal protein amino acid sequences, trimAl (14) curated them, and IQ-TREE (15) computed the tree for the phylogenomic analysis. Anvi’o used DIAMOND (16) and MCL (17) algorithms to determine pangenomes. A reproducible bioinformatics workflow and FASTA files for refined MAGs are available at http://merenlab.org/data/refining-espinoza-mags.

Co-assembly of a large number of metagenomes that contain very closely related populations often hinders confident assignments of shared contigs into individual bins. Nevertheless, even when proper refinement is not possible, reporting composite MAGs as single genomes should be avoided. As of today, highly composite Espinoza et al. MAGs (Fig. 1 in this letter and Table 1 in the work of Espinoza et al.) are available as single genomes in public databases of the National Center for Biotechnology Information (NCBI).

The rapidly increasing number of MAGs in public databases already competes with the total number of microbial isolate genomes (3), and increasingly frequent studies that report large collections of MAGs offer a glimpse of the future (7 –9). Despite their growing availability, metagenomes are inherently complex and demand researchers to orchestrate an intricate combination of rapidly evolving computational tools and approaches with many alternatives to reconstruct, characterize, and finalize MAGs. We must continue to champion studies such as the one by Espinoza et al. for their contribution to our collective effort to shed light on the darker branches of the ever-growing Tree of Life. At the same time, editors and reviewers of genome-resolved metagenomics studies should properly scrutinize the quality and accuracy of MAGs prior to their publication. A systematic failure at this will reduce the quality of public genome repositories while yielding adverse effects such as misleading insights into novel microbial groups and reduced trust among scientists in findings that emerge from genome-resolved metagenomics.

Footnotes

Citation Shaiber A, Eren AM. 2019. Composite metagenome-assembled genomes reduce the quality of public genome repositories. mBio 10:e00725-19. https://doi.org/10.1128/mBio.00725-19.

REFERENCES

1.Espinoza JL, Harkins DM, Torralba M, Gomez A, Highlander SK, Jones MB, Leong P, Saffery R, Bockmann M, Kuelbs C, Inman JM, Hughes T, Craig JM, Nelson KE, Dupont CL. 2018. Supragingival plaque microbiome ecology and functional potential in the context of health and disease. mBio 9:e01631-18. doi: 10.1128/mBio.01631-18. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Brown CT, Hug LA, Thomas BC, Sharon I, Castelle CJ, Singh A, Wilkins MJ, Wrighton KC, Williams KH, Banfield JF. 2015. Unusual biology across a group comprising more than 15% of domain Bacteria. Nature 523:208–211. doi: 10.1038/nature14486. [DOI] [PubMed] [Google Scholar]
3.Bowers RM, Kyrpides NC, Stepanauskas R, Harmon-Smith M, Doud D, Reddy TBK, Schulz F, Jarett J, Rivers AR, Eloe-Fadrosh EA, Tringe SG, Ivanova NN, Copeland A, Clum A, Becraft ED, Malmstrom RR, Birren B, Podar M, Bork P, Weinstock GM, Garrity GM, Dodsworth JA, Yooseph S, Sutton G, Glöckner FO, Gilbert JA, Nelson WC, Hallam SJ, Jungbluth SP, Ettema TJG, Tighe S, Konstantinidis KT, Liu W-T, Baker BJ, Rattei T, Eisen JA, Hedlund B, McMahon KD, Fierer N, Knight R, Finn R, Cochrane G, Karsch-Mizrachi I, Tyson GW, Rinke C, Kyrpides NC, Schriml L, Garrity GM, Hugenholtz P, Sutton G, Yilmaz P, Meyer F, Glöckner FO, Gilbert JA, Knight R, Finn R, Cochrane G, Karsch-Mizrachi I, Lapidus A, Meyer F, Yilmaz P, Parks DH, Eren AM, Schriml L, Banfield JF, Hugenholtz P, Woyke T. 2017. Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea. Nat Biotechnol 35:725–731. doi: 10.1038/nbt.3893. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Koutsovoulos G, Kumar S, Laetsch DR, Stevens L, Daub J, Conlon C, Maroon H, Thomas F, Aboobaker AA, Blaxter M. 2016. No evidence for extensive horizontal gene transfer in the genome of the tardigrade Hypsibius dujardini. Proc Natl Acad Sci U S A 113:5053–5058. doi: 10.1073/pnas.1600338113. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Delmont TO, Eren AM. 2016. Identifying contamination with advanced visualization and analysis practices: metagenomic approaches for eukaryotic genome assemblies. PeerJ 4:e1839. doi: 10.7717/peerj.1839. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Hug LA, Baker BJ, Anantharaman K, Brown CT, Probst AJ, Castelle CJ, Butterfield CN, Hernsdorf AW, Amano Y, Ise K, Suzuki Y, Dudek N, Relman DA, Finstad KM, Amundson R, Thomas BC, Banfield JF. 2016. A new view of the tree of life. Nat Microbiol 1:16048. doi: 10.1038/nmicrobiol.2016.48. [DOI] [PubMed] [Google Scholar]
7.Parks DH, Rinke C, Chuvochina M, Chaumeil PA, Woodcroft BJ, Evans PN, Hugenholtz P, Tyson GW. 2017. Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life. Nat Microbiol 2:1533–1542. doi: 10.1038/s41564-017-0012-7. [DOI] [PubMed] [Google Scholar]
8.Almeida A, Mitchell AL, Boland M, Forster SC, Gloor GB, Tarkowska A, Lawley TD, Finn RD. 2019. A new genomic blueprint of the human gut microbiota. Nature 568:499–504. doi: 10.1038/s41586-019-0965-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Pasolli E, Asnicar F, Manara S, Zolfo M, Karcher N, Armanini F, Beghini F, Manghi P, Tett A, Ghensi P, Collado MC, Rice BL, Dulong C, Morgan XC, Golden CD, Quince C, Huttenhower C, Segata N. 2019. Extensive unexplored human microbiome diversity revealed by over 150,000 genomes from metagenomes spanning age, geography, and lifestyle. Cell 176: 649–662.e20. doi: 10.1016/j.cell.2019.01.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Campbell JH, O’Donoghue P, Campbell AG, Schwientek P, Sczyrba A, Woyke T, Söll D, Podar M. 2013. UGA is an additional glycine codon in uncultured SR1 bacteria from the human microbiota. Proc Natl Acad Sci U S A 110:5540–5545. doi: 10.1073/pnas.1303090110. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Langmead B, Salzberg SL. 2012. Fast gapped-read alignment with Bowtie 2. Nat Methods 9:357–359. doi: 10.1038/nmeth.1923. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Eren AM, Esen ÖC, Quince C, Vineis JH, Morrison HG, Sogin ML, Delmont TO. 2015. Anvi’o: an advanced analysis and visualization platform for ‘omics data. PeerJ 3:e1319. doi: 10.7717/peerj.1319. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Deorowicz S, Debudaj-Grabysz A, Gudyś A. 2016. FAMSA: fast and accurate multiple sequence alignment of huge protein families. Sci Rep 6:33964. doi: 10.1038/srep33964. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Capella-Gutiérrez S, Silla-Martínez JM, Gabaldón T. 2009. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics 25:1972–1973. doi: 10.1093/bioinformatics/btp348. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Nguyen L, Schmidt HA, Von Haeseler A, Minh BQ. 2015. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol Biol Evol 32:268–274. doi: 10.1093/molbev/msu300. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Buchfink B, Xie C, Huson DH. 2015. Fast and sensitive protein alignment using DIAMOND. Nat Methods 12:59–60. doi: 10.1038/nmeth.3176. [DOI] [PubMed] [Google Scholar]
17.Van Dongen S, Abreu-Goodger C. 2012. Using MCL to extract clusters from networks. Methods Mol Biol 804:281–295. doi: 10.1007/978-1-61779-361-5_15. [DOI] [PubMed] [Google Scholar]

[B1] 1.Espinoza JL, Harkins DM, Torralba M, Gomez A, Highlander SK, Jones MB, Leong P, Saffery R, Bockmann M, Kuelbs C, Inman JM, Hughes T, Craig JM, Nelson KE, Dupont CL. 2018. Supragingival plaque microbiome ecology and functional potential in the context of health and disease. mBio 9:e01631-18. doi: 10.1128/mBio.01631-18. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B2] 2.Brown CT, Hug LA, Thomas BC, Sharon I, Castelle CJ, Singh A, Wilkins MJ, Wrighton KC, Williams KH, Banfield JF. 2015. Unusual biology across a group comprising more than 15% of domain Bacteria. Nature 523:208–211. doi: 10.1038/nature14486. [DOI] [PubMed] [Google Scholar]

[B3] 3.Bowers RM, Kyrpides NC, Stepanauskas R, Harmon-Smith M, Doud D, Reddy TBK, Schulz F, Jarett J, Rivers AR, Eloe-Fadrosh EA, Tringe SG, Ivanova NN, Copeland A, Clum A, Becraft ED, Malmstrom RR, Birren B, Podar M, Bork P, Weinstock GM, Garrity GM, Dodsworth JA, Yooseph S, Sutton G, Glöckner FO, Gilbert JA, Nelson WC, Hallam SJ, Jungbluth SP, Ettema TJG, Tighe S, Konstantinidis KT, Liu W-T, Baker BJ, Rattei T, Eisen JA, Hedlund B, McMahon KD, Fierer N, Knight R, Finn R, Cochrane G, Karsch-Mizrachi I, Tyson GW, Rinke C, Kyrpides NC, Schriml L, Garrity GM, Hugenholtz P, Sutton G, Yilmaz P, Meyer F, Glöckner FO, Gilbert JA, Knight R, Finn R, Cochrane G, Karsch-Mizrachi I, Lapidus A, Meyer F, Yilmaz P, Parks DH, Eren AM, Schriml L, Banfield JF, Hugenholtz P, Woyke T. 2017. Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea. Nat Biotechnol 35:725–731. doi: 10.1038/nbt.3893. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B4] 4.Koutsovoulos G, Kumar S, Laetsch DR, Stevens L, Daub J, Conlon C, Maroon H, Thomas F, Aboobaker AA, Blaxter M. 2016. No evidence for extensive horizontal gene transfer in the genome of the tardigrade Hypsibius dujardini. Proc Natl Acad Sci U S A 113:5053–5058. doi: 10.1073/pnas.1600338113. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B5] 5.Delmont TO, Eren AM. 2016. Identifying contamination with advanced visualization and analysis practices: metagenomic approaches for eukaryotic genome assemblies. PeerJ 4:e1839. doi: 10.7717/peerj.1839. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B6] 6.Hug LA, Baker BJ, Anantharaman K, Brown CT, Probst AJ, Castelle CJ, Butterfield CN, Hernsdorf AW, Amano Y, Ise K, Suzuki Y, Dudek N, Relman DA, Finstad KM, Amundson R, Thomas BC, Banfield JF. 2016. A new view of the tree of life. Nat Microbiol 1:16048. doi: 10.1038/nmicrobiol.2016.48. [DOI] [PubMed] [Google Scholar]

[B7] 7.Parks DH, Rinke C, Chuvochina M, Chaumeil PA, Woodcroft BJ, Evans PN, Hugenholtz P, Tyson GW. 2017. Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life. Nat Microbiol 2:1533–1542. doi: 10.1038/s41564-017-0012-7. [DOI] [PubMed] [Google Scholar]

[B8] 8.Almeida A, Mitchell AL, Boland M, Forster SC, Gloor GB, Tarkowska A, Lawley TD, Finn RD. 2019. A new genomic blueprint of the human gut microbiota. Nature 568:499–504. doi: 10.1038/s41586-019-0965-1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B9] 9.Pasolli E, Asnicar F, Manara S, Zolfo M, Karcher N, Armanini F, Beghini F, Manghi P, Tett A, Ghensi P, Collado MC, Rice BL, Dulong C, Morgan XC, Golden CD, Quince C, Huttenhower C, Segata N. 2019. Extensive unexplored human microbiome diversity revealed by over 150,000 genomes from metagenomes spanning age, geography, and lifestyle. Cell 176: 649–662.e20. doi: 10.1016/j.cell.2019.01.001. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B10] 10.Campbell JH, O’Donoghue P, Campbell AG, Schwientek P, Sczyrba A, Woyke T, Söll D, Podar M. 2013. UGA is an additional glycine codon in uncultured SR1 bacteria from the human microbiota. Proc Natl Acad Sci U S A 110:5540–5545. doi: 10.1073/pnas.1303090110. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B11] 11.Langmead B, Salzberg SL. 2012. Fast gapped-read alignment with Bowtie 2. Nat Methods 9:357–359. doi: 10.1038/nmeth.1923. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B12] 12.Eren AM, Esen ÖC, Quince C, Vineis JH, Morrison HG, Sogin ML, Delmont TO. 2015. Anvi’o: an advanced analysis and visualization platform for ‘omics data. PeerJ 3:e1319. doi: 10.7717/peerj.1319. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B13] 13.Deorowicz S, Debudaj-Grabysz A, Gudyś A. 2016. FAMSA: fast and accurate multiple sequence alignment of huge protein families. Sci Rep 6:33964. doi: 10.1038/srep33964. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B14] 14.Capella-Gutiérrez S, Silla-Martínez JM, Gabaldón T. 2009. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics 25:1972–1973. doi: 10.1093/bioinformatics/btp348. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B15] 15.Nguyen L, Schmidt HA, Von Haeseler A, Minh BQ. 2015. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol Biol Evol 32:268–274. doi: 10.1093/molbev/msu300. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B16] 16.Buchfink B, Xie C, Huson DH. 2015. Fast and sensitive protein alignment using DIAMOND. Nat Methods 12:59–60. doi: 10.1038/nmeth.3176. [DOI] [PubMed] [Google Scholar]

[B17] 17.Van Dongen S, Abreu-Goodger C. 2012. Using MCL to extract clusters from networks. Methods Mol Biol 804:281–295. doi: 10.1007/978-1-61779-361-5_15. [DOI] [PubMed] [Google Scholar]

PERMALINK

Composite Metagenome-Assembled Genomes Reduce the Quality of Public Genome Repositories

Alon Shaiber

A Murat Eren

Roles

LETTER

FIG 1.

Footnotes

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Composite Metagenome-Assembled Genomes Reduce the Quality of Public Genome Repositories

Alon Shaiber

A Murat Eren

Roles

LETTER

FIG 1.

Footnotes

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases