Skip to main content
PLOS Biology logoLink to PLOS Biology
. 2022 Sep 6;20(9):e3001792. doi: 10.1371/journal.pbio.3001792

A curated data resource of 214K metagenomes for characterization of the global antimicrobial resistome

Hannah-Marie Martiny 1,*, Patrick Munk 1, Christian Brinch 1, Frank M Aarestrup 1, Thomas N Petersen 1
Editor: Tobias Bollenbach2
PMCID: PMC9447899  PMID: 36067158

Abstract

The growing threat of antimicrobial resistance (AMR) calls for new epidemiological surveillance methods, as well as a deeper understanding of how antimicrobial resistance genes (ARGs) have been transmitted around the world. The large pool of sequencing data available in public repositories provides an excellent resource for monitoring the temporal and spatial dissemination of AMR in different ecological settings. However, only a limited number of research groups globally have the computational resources to analyze such data. We retrieved 442 Tbp of sequencing reads from 214,095 metagenomic samples from the European Nucleotide Archive (ENA) and aligned them using a uniform approach against ARGs and 16S/18S rRNA genes. Here, we present the results of this extensive computational analysis and share the counts of reads aligned. Over 6.76∙108 read fragments were assigned to ARGs and 3.21∙109 to rRNA genes, where we observed distinct differences in both the abundance of ARGs and the link between microbiome and resistome compositions across various sampling types. This collection is another step towards establishing global surveillance of AMR and can serve as a resource for further research into the environmental spread and dynamic changes of ARGs.


The growing threat of antimicrobial resistance (AMR) calls for new epidemiological surveillance methods and a deeper understanding of how resistance genes are transmitted around the world. This study presents a large-scale remapping of sequencing reads of publicly available metagenomic datasets that can be used to monitor the global prevalence of AMR genes.

Introduction

The vast amount of genomic data available in public data repositories is a unique and potentially important resource for doing research and genomic surveillance of antimicrobial resistance (AMR). Using these datasets collected from locations all over the world across different years and from various sampling sources might further aid our understanding of the emergence and distribution of antimicrobial resistance genes (ARGs).

The sharing of genomic sequence data to one of the available repositories is today a major and often mandatory step in peer-reviewed journals, for which several repositories were created by the members of the International Nucleotide Sequence Database Collaboration (INSDC) [1], including the European Nucleotide Archive (ENA) [2]. The number of sequencing data available at ENA continues to increase with an estimated doubling time of 18 months (https://www.ebi.ac.uk/ena/browser/about/statistics; accessed 2022-03-08).

Several approaches for analyzing genomic data depending on the sample types are already well established.

However, the exploration of these resources is often restricted to a few research groups only since both sufficient skills in bioinformatics and access to high-performing computer resources are needed to handle the large amount of available data.

Existing collections of analyzed datasets tend to focus on either specific sample sources, such as humans [3,4], marine [5], or urban sewage [6,7], or focus on specific genera [8]. Especially the COVID-19 pandemic has highlighted the value of data sharing to trace the spread and evolution of the virus [9]. Despite the attempts to standardize the analysis workflows of these databases, they are limited in their ability to generalize across environments and locations. A recent study [10] has shared a searchable collection of 661K bacterial genomes for exploring the global bacterial diversity across different origins, providing an easy-to-access resource for genomic research. While this is an impressive data-sharing effort, the authors did not include metagenomic samples in their pipeline. Metagenomic techniques aim to sequence all DNA in a sample and can be used to characterize the microbiome in different environments [11,12], discover novel organisms [13], monitor disease [14,15], and specific genes, such as ARGs [5,6,16].

Here, we present a large-scale metagenomic analysis of 214,095 metagenomic samples retrieved from ENA. We have carried out an assembly-free approach by aligning sequencing reads against ARGs and 16S/18S ribosomal RNA genes. We have previously published an in-depth analysis of the distribution of mobilized colistin resistance [17] based on those data. Now we both share the entire collection of mapping results and showcase how to characterize the global resistome and microbiome with this dataset. The curated metadata and mapping results are available at https://doi.org/10.5281/zenodo.6919377 and documentation at https://hmmartiny.github.io/mARG/Tables.html.

Materials and methods

Retrieval of metagenomes

We retrieved metagenomic datasets from ENA [2] uploaded between 2010-01-01 and 2020-01-01 that had library source as “METAGENOMIC” and library strategy of “WGS.” We collected 214,095 sequencing runs from 146,732 samples from 6,307 projects corresponding to 442 Tbp of raw reads taking up 300 TB of storage. The associated metadata for each sample was also retrieved.

Preprocessing and mapping of sequencing reads

The retrieved raw FASTQ reads were trimmed and aligned against reference sequences, as outlined in Martiny (2022) [17]. In brief, we used FASTQC v.0.11.15 (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) for read quality checking and BBduk2 v.36.49 [18] for trimming the raw sequencing reads. With the k-mer-based alignment tool KMA 1.2.21 [19], the trimmed reads were mapped against reference sequences from 2 different databases: The AMR gene database ResFinder [20] (downloaded 2020-01-25), which contained 3,085 sequences of acquired ARGs, and the ribosomal rRNA Silva [21] gene database (version 138, downloaded 2020-01-16), which had 2,225,272 reference sequences with more than 88% of them being 16/18S rRNA genes. For KMA, we used the following alignment parameters: 1, -2, -3, -1 for a match, mismatch, gap opening, and gap extension. For read pairing, we used a value of 7 and a minimum relative alignment score of 0.75. Data retrieval, quality checking, trimming, and read alignments were done using the Danish National Supercomputer for Life Sciences (https://www.computerome.dk/).

Standardization of metadata

The following attributes for each metagenome were standardized: sampling location, sampling host or environment (referred to as a host below), and sampling date.

To standardize the label for sampling locations, we looked at the values entered in the two fields “country” and “location.” First, the latitude and longitude coordinates were mapped to a country using the Python library Shapely 1.7.1 [22] to find the matching area defined in one of the 3 public domain map datasets (countries, marine, and lakes) available in the Natural Earth Data collection. If the lookup failed or the coordinates were not given, the second step was to match the text attribute in the country label to ISO 3166 country codes with a fuzzy search with the Python library PyCountry 20.7.3 (https://github.com/flyingcircusio/pycountry). Finally, if the 2 lookup searches did not yield a match, we did a manual lookup of the country labels to standardize the text.

For the standardization of host labels, we mapped the taxonomic id given by the attribute “host_tax_id” to the NCBI Taxonomy database [23], or if the feature was missing, the “tax_id” was used instead.

Since the only way to curate entered collection dates is to look up suspicious dates in published studies manually, and that was deemed too time-intensive, we decided to replace dates entered as later than 2020-01-01 in the sample attribute field “collection_date” with the missing value NULL.

Measuring the abundance of ARGs

Since we report the fragment count aligned to each reference gene, the mapping results are compositional and should be treated as such [24]. In the simplest form, the ARG abundance for a sample or sample group can be calculated as the log-ratio of the count of reads, ni, aligned to each ARG i over the total sum of rRNA read fragments nB:

x=[n1,n2,,nD,nB],i=1..D
Abundance(x)=[logn1nB,logn2nB,,lognDnB]

where D is the number of ARGs and nB=jDBnj1106 with DB being the number of read fragments aligned to rRNA genes. Each ARG count ni has been adjusted with the length of the gene in kilobases.

The relative abundance resistance classes were calculated as the proportion of ARG resistance assigned to different classes and scaled with κ = 100:

Relativeabundance(x)=κnini

Diversity measurements

Besides the read abundance values, we report the species richness, Shannon diversity index [25], and the Gini–Simpson [26] diversity index of read counts of ARGs, genera, and phyla per sample. Species richness is the number of different genes or taxonomic groups present in the sample with at least 1 read fragment aligned.

The Shannon index (H′) was calculated using the proportions of reads pi=nin:

H=i=1Rpilnpi

whereas the Gini–Simpson index (GS) was calculated using the read counts n = [n1,…,nD] and N = ∑n is the total count of reads for the group:

GS=1ni(ni1)N(N1)

Together with these 2 indices, we also report the sample-wise unique number of reference sequences or taxonomic groups matched.

Results

Here, we present a large-scale mapping of 442 Tbp of raw reads of 214,095 metagenomic samples suitable for analyzing the distribution of acquired antimicrobial resistance genes and 16S/18S rRNA genes. Furthermore, we have spent considerable effort standardizing 3 main sample attributes: sampling date, location, and source. To facilitate easy access and usage, we have shared the mapping results and corrected metadata in 3 different data formats (TSV, HDF, and MySQL dumps). We also provide tutorials with code examples in R and Python on using the data in different scenarios. Data files are all available at https://doi.org/10.5281/zenodo.6919377.

By collecting the sequencing reads from ENA, we could also verify the inherited bias of specific sample types or sources being overrepresented simply due to the availability in the public repository. While the 214,095 metagenomic datasets were collected from 797 different hosts, most were either of human or marine origin (Fig 1A). A similar skewed geographical distribution towards European and North American countries was observed in the sampling locations (Fig 1B). The distribution of samples according to the sampling year reveals that a considerable number were collected between 2010 and 2020 (Fig 1C).

Fig 1. Distribution of metagenomes reveals the overrepresentation of samples from specific sources.

Fig 1

(a) Number of samples grouped per sampling host, where only hosts with more than 1,000 samples are plotted. (b) Sample locations for metagenomes with available GPS coordinates; each marker is a sample. A total of 83,903 samples did not have coordinates available. (c) Year of which a sample was collected. A total of 84,238 of the samples did not have a valid sampling date recorded. The data underlying this figure can be found at https://doi.org/10.5281/zenodo.6919377, and the base layer map was created with data from https://www.naturalearthdata.com/.

Of the more than 1.8∙1012 raw sequencing reads, corresponding to 442.1 Tbp, 93% of the reads were generated using Illumina sequencing technologies (S1 Fig). We mapped over 1.69∙1012 trimmed read fragments, with a median of 784,748 fragments per sample (range 1 to 916,901,400) (Fig 2A). Approximately 0.04% of all read fragments could be aligned to ARGs, and 0.19% to rRNA genes. Overall, the amount of sequencing reads and bases available did increase the count of aligned read fragments (S3 Fig). The number of ARG fragments aligned increased with the number of aligned rRNA fragments, although for 34% of the samples, we did not find any ARGs despite having read fragments aligning to 16S rRNA genes (Fig 2B). The microbial differences in the different sampling origins were highlighted in the number of aligned fragments (S4 Fig).

Fig 2. Distribution of available and aligned fragments.

Fig 2

(a) Density distribution of available fragments per sample. (b) The distribution compares the number of fragments mapped to rRNA genes and ARGs. The data underlying this figure can be found at https://doi.org/10.5281/zenodo.6919377.

The global abundance of antimicrobial resistance

To measure the global distribution of ARGs and the composition of the resistome, we calculated the abundance of ARGs as the log-ratio of ARG fragments over summed rRNA sequence fragments. Almost all of the reference sequences from the ResFinder database had at least 1 fragment aligned, and only 94 ARGs had no hits (S2 Fig). The median observed resistance load per metagenomic sample was 11.74 (log range: −1.45 to 23.52) (Fig 3A), which appeared to be mainly dependent on the geographic origin and environment (Fig 3B–3D) and not on which year the sample was taken. For example, samples originating from locations within Europe showed similar abundance levels for most of the samples but with several outliers, whereas multiple samples from locations in the Oceania region had a much broader load distribution with few outliers (Fig 3C).

Fig 3. Boxplots of ARG abundances in metagenomic samples show that levels vary across different origins.

Fig 3

(a) Distribution of ARG abundance per sample. (b) Distribution of sample-wise ARG abundance grouped by sampling year. (c) Sample-wise ARG abundance per sampling location. (d) Sample-wise ARG abundance grouped by hosts. Only hosts with more than 1,000 metagenomes analyzed are shown. The data underlying this figure can be found at https://doi.org/10.5281/zenodo.6919377.

While the distribution of sample-wise resistance loads illustrates the high variability in this data collection (Fig 3), we saw that once we stratified the relative ARG read proportions per resistance class and sample type, there were clear separations between different groups (Fig 4). For the sampling years with a considerable number of samples available (2004 to 2019), the relative proportion of classes was relatively consistent, with Tetracycline reads being the most common, except for a spike of Beta-lactam reads in 2017 (Fig 4A). Across the continents and large water bodies, we observed that ARGs conferring resistance to Aminoglycosides or Beta-lactam antimicrobials were more common in water environments, whereas mainland regions had a more diverse distribution (Fig 4B). Once we stratified by sampling host or source, the distribution of resistance classes was very dependent on the group, as seen by the high proportion of read fragments aligned to, for example, Phenicol for marine and soil samples and Tetracycline reads being highly prevalent in mice (Mus musculus) samples (Fig 4C).

Fig 4. Composition of reads assigned to ARGs from different resistance classes grouped by sampling origin.

Fig 4

(a) Grouped by sampling year. (b) Grouped per sampling location. (c) Grouped per sampling host. Only hosts with more than 1,000 metagenomes analyzed are shown. The data underlying this figure can be found at https://doi.org/10.5281/zenodo.6919377.

Linking the microbiome diversity with resistance diversity

The relationship between the diversity of the microbiome and the resistance genes was quantified by calculating the species richness and 2 alpha diversity measurements (Shannon and Gini–Simpson) on ARG levels and phyla and genera taxonomic levels. Without looking at the sample origin, we observed that a majority of the samples had both high microbial diversity and ARG diversity (Figs 5 and S5). However, the relationship between genera and ARG diversity indexes differed between sampling sources, with several groups containing samples that did not follow the assumption of the 2 diversity measurements following each other, suggesting that increased diversity of microbes in, for example, soil samples does not necessarily lead to a higher diversity of resistance genes. Contrarily, the chicken (Gallus gallus) samples showed that they still had elevated ARG diversity despite having lower microbial diversity (Fig 5).

Fig 5. The genus–ARG diversity relationship for all metagenomic samples.

Fig 5

The Gini–Simpson diversity indexes were calculated on genus categories (x-axis) compared to ARG levels (y-axis). Left: scatterplot of all samples. Right: samples colored by selected host or environmental origins. The data underlying this figure can be found at https://doi.org/10.5281/zenodo.6919377.

Discussion

Global surveillance of AMR based on genomics continues to become more accessible due to the advancement in NGS technologies and the practice of sharing raw sequencing data in public repositories. Standardized pipelines and databases are needed to utilize these large data volumes for tracking the dissemination of AMR. We have uniformly processed the sequencing reads of 214,095 metagenomes for the abundance analysis of ARGs.

Our data sharing efforts enable users to perform abundance analyses of individual ARGs, the resistome, and the microbiome across different environments, geographic locations, and sampling years.

We have given a brief characterization of the distribution of ARGs according to the collection of metagenomes. However, in-depth analyses remain to be performed to investigate the influence of temporal, geographical, and environmental origins on the dissemination and evolution of antimicrobial resistance. For example, analyzing the spread of specific ARGs across locations and different environments could reveal new transmission routes of resistance and guide the design of intervention strategies to stop the spread. We have previously published a study focusing on the distribution of mobilized colistin resistance (mcr) genes using this data resource, showing how widely disseminated the genes were [17]. Another use of the data collection could be to explore how the changes in microbial abundances affect and are affected by the resistome. Furthermore, our coverage statistics of reads aligned to ARGs could be used to investigate the rate of new variants occurring in different reservoirs. Even though we have focused on the threat of antimicrobial resistance, potential applications of this resource can be to look at the effects of, for example, climate changes on microbial compositions. Linking our observed read fragment counts with other types of genomic data, such as evaluating the risk of ARG mobility, accessibility, and pathogenicity in assembled genomes [27,28], and verifying observations from clinical data [29].

We recommend that potential users consider all the confounders present in this data collection in their statistical tests and modeling workflows, emphasizing that the experimental methods and sequencing platforms dictate the obtained sequencing reads and that metadata for a sample might be mislabeled, despite our efforts to minimize those kinds of errors. Furthermore, it is essential to consider the compositional nature of microbiomes [30]. The reads do not depend on the distribution of genetic material in the sample but on the capacity of the sequencing platform [24,31]. Various statistical methods already exist that consider the compositionality [24,32,33]. Finally, it is important to highlight that the results we have presented here include fragment counts of 1 for the sake of transparency, but we also recommend potential users consider appropriate filters in their analysis.

The sequencing data in public repositories has continued to grow, giving us plenty of opportunities to continue to expand our data collection even more. To establish a truly global surveillance program of AMR, sequencing data should be analyzed as soon as published in these archives. Although this would require access to even more computational resources, we hope to achieve this soon and compare our approach with other methods, such as AMRFinderPlus [34] and CARD [35]. As new sequencing technologies are becoming more used, our settings for our alignment procedure should also be tuned to better take advantage and be aware of the flaws of different sequencing platforms.

With this data resource, we have taken a step towards enabling the scientific community to utilize the wealth of information in these metagenomic samples to broaden our understanding of the dissemination of antimicrobial resistance and changes in microbiomes at both local and global scales through time and environments.

Supporting information

S1 Fig. Distribution of samples per sequencing instrument platform.

(a) Sample count per platform. (b) Distribution of raw sequencing read counts per platform. The data underlying this figure can be found at https://doi.org/10.5281/zenodo.6919377.

(TIFF)

S2 Fig. More than 96% of ARG templates had at least 1 aligned fragment.

The bars illustrate the percentage of ARGs per resistance class without and with at least 1 aligned fragment. The parenthesis after each class label contains the number of genes found out of the total available templates. The data underlying this figure can be found at https://doi.org/10.5281/zenodo.6919377.

(TIFF)

S3 Fig

The sample-wise distribution of aligned (a) ARG or (b) rRNA fragments compared to raw sequencing base counts. The data underlying this figure can be found at https://doi.org/10.5281/zenodo.6919377.

(TIFF)

S4 Fig. The sample-wise distribution of aligned rRNA fragments and ARG fragments, colored by selected host and environmental sources.

The data underlying this figure can be found at https://doi.org/10.5281/zenodo.6919377.

(TIFF)

S5 Fig. Additional distributions showing the relationship between ARGs and genera for all metagenomic samples.

(a) The richness of genus groups (x-axis) vs. ARG richness (y-axis). (b) The relationship between Shannon diversity index calculated on genus level (x-axis) and ARGs (y-axis). Right: samples colored by selected host or environmental origins. The data underlying this figure can be found at https://doi.org/10.5281/zenodo.6919377.

(TIFF)

Abbreviations

AMR

antimicrobial resistance

ARG

antimicrobial resistance gene

ENA

European Nucleotide Archive

INSDC

International Nucleotide Sequence Database Collaboration

mcr

mobilized colistin resistance

Data Availability

The code to produce the figures is available at https://github.com/hmmartiny/mARG. The data has been deposited at https://doi.org/10.5281/zenodo.6919377, and documentation of the various tables can be accessed at https://hmmartiny.github.io/mARG.

Funding Statement

This work was supported by the European Union’s Horizon H2020 grant VEO (874735) and the Novo Nordisk Foundation (grant NNF16OC0021856: Global Surveillance of Antimicrobial Resistance). HMM, PM, CB, TNP, and FMA were all supported by both grants. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Arita M., Karsch-Mizrachi I., Cochrane G. The international nucleotide sequence database collaboration. Nucleic Acids Res. (2021) 49, D121. doi: 10.1093/nar/gkaa967 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Leinonen R. et al. The European nucleotide archive. Nucleic Acids Res. (2011) 39, 44–47. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Shao L., Liao J., Qian J., Chen W., Fan X. MetaGeneBank: a standardized database to study deep sequenced metagenomic data from human fecal specimen. BMC Microbiol. (2021) 21, 1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Almeida A. et al. A unified catalog of 204,938 reference genomes from the human gut microbiome. Nat Biotechnol. (2021) 39, 105–114. doi: 10.1038/s41587-020-0603-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Cuadrat R. R. C., Sorokina M., Andrade B. G., Goris T., Dávila A. M. R. Global ocean resistome revealed: Exploring antibiotic resistance gene abundance and distribution in TARA Oceans samples. Gigascience. (2020) 9, 1–12. doi: 10.1093/gigascience/giaa046 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Hendriksen R. S. et al. Global monitoring of antimicrobial resistance based on metagenomics analyses of urban sewage. Nat Commun, (2019) 10. doi: 10.1038/s41467-019-08853-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Fresia P. et al. Urban metagenomics uncover antibiotic resistance reservoirs in coastal beach and sewage waters. Microbiome. (2019) 7, 1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Zhou Z., Alikhan N. F., Mohamed K., Fan Y., Achtman M. The EnteroBase user’s guide, with case studies on Salmonella transmissions, Yersinia pestis phylogeny, and Escherichia core genomic diversity. Genome Res. (2020) 30, 138–152. doi: 10.1101/gr.251678.119 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Khare S. et al. GISAID’s Role in Pandemic Response. China CDC Wkly. (2021) 3, 1049–1051. doi: 10.46234/ccdcw2021.255 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Blackwell G. A. et al. Exploring bacterial diversity via a curated and searchable snapshot of archived DNA sequences. PLoS Biol. (2021) 19, e3001421. doi: 10.1371/journal.pbio.3001421 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Fierer N. et al. Cross-biome metagenomic analyses of soil microbial communities and their functional attributes. Proc Natl Acad Sci U S A. (2012) 109, 21390–21395. doi: 10.1073/pnas.1215210110 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Gill S. R. et al. Metagenomic analysis of the human distal gut microbiome. Science (80-). (2006) 312, 1355–1359. doi: 10.1126/science.1124234 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Al-Shayeb B. et al. Clades of huge phages from across Earth’s ecosystems. Nature. (2020) 578, 425–431. doi: 10.1038/s41586-020-2007-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Nieuwenhuijse D. F. et al. Setting a baseline for global urban virome surveillance in sewage. Sci Rep. (2020) 10, 1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Liu P., Chen W., Chen J. P. Viral Metagenomics Revealed Sendai Virus and Coronavirus Infection of Malayan Pangolins (Manis javanica). Viruses 2019, Vol 11, Page 979 (2019) 11, 979. doi: 10.3390/v11110979 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Forsberg K. J. et al. Bacterial phylogeny structures soil resistomes across habitats. Nature. (2014) 509, 612–616. doi: 10.1038/nature13377 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Martiny H.-M. et al. Global distribution of mcr gene variants in 214,095 metagenomic samples. mSystems. (2022). doi: 10.1128/msystems.00105-22 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Bushnell B. BBMap. (2014). [Google Scholar]
  • 19.Clausen P. T. L. C., Aarestrup F. M., Lund O. Rapid and precise alignment of raw reads against redundant databases with KMA. BMC Bioinformatics. (2018) 19, 1–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Zankari E. et al. Identification of acquired antimicrobial resistance genes. J Antimicrob Chemother. (2012) 67, 2640–2644. doi: 10.1093/jac/dks261 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Quast C. et al. The SILVA ribosomal RNA gene database project: Improved data processing and web-based tools. Nucleic Acids Res. (2013) 41, 590–596. doi: 10.1093/nar/gks1219 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Gillies S., Others A. Shapely: manipulation and analysis of geometric objects. (2007). [Google Scholar]
  • 23.Federhen S. The NCBI Taxonomy database. Nucleic Acids Res. (2012) 40, D136–D143. doi: 10.1093/nar/gkr1178 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Gloor G. B., Macklaim J. M., Pawlowsky-Glahn V., Egozcue J. J. Microbiome datasets are compositional: And this is not optional. Front Microbiol. (2017) 8, 1–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Shannon C. E. A mathematical theory of communication. Bell Syst Tech J. (1948) 27, 379–423. [Google Scholar]
  • 26.Jost L. Entropy and diversity. Oikos. (2006) 113, 363–375. [Google Scholar]
  • 27.Zhang A. N. et al. An omics-based framework for assessing the health risk of antimicrobial resistance genes. Nat Commun. (2021) 12, 1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Zhang Z. et al. Assessment of global health risk of antibiotic resistance genes. Nat Commun, (2022) 13. doi: 10.1038/s41467-022-29283-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Karkman A., Berglund F., Flach C. F., Kristiansson E., Larsson D. G. J. Predicting clinical resistance prevalence using sewage metagenomic data. Commun Biol. (2020) 3, 1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Aitchison J. The Statistical Analysis of Compositional Data. J R Stat Soc Ser B. (1982) 44, 139–160. [Google Scholar]
  • 31.Quinn T. P. et al. A field guide for the compositional analysis of any-omics data. Gigascience. (2019) 8, 1–14. doi: 10.1093/gigascience/giz107 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Fernandes A. D., Macklaim J. M., Linn T. G., Reid G., Gloor G. B. ANOVA-Like Differential Expression (ALDEx) Analysis for Mixed Population RNA-Seq. PLoS ONE. (2013) 8. doi: 10.1371/journal.pone.0067019 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Friedman J., Alm E. J. Inferring Correlation Networks from Genomic Survey Data. PLoS Comput Biol. (2012) 8, 1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Feldgarden M. et al. AMRFinderPlus and the Reference Gene Catalog facilitate examination of the genomic links among antimicrobial resistance, stress response, and virulence. Sci Rep. (2021) 11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Alcock B. P. et al. CARD 2020: Antibiotic resistome surveillance with the comprehensive antibiotic resistance database. Nucleic Acids Res. (2020) 48, D517–D525. doi: 10.1093/nar/gkz935 [DOI] [PMC free article] [PubMed] [Google Scholar]

Decision Letter 0

Roland G Roberts

19 May 2022

Dear Dr Martiny,

Thank you for submitting your manuscript entitled "A curated data resource of 214K metagenomes for characterization of the global resistome" for consideration as a Methods and Resources by PLOS Biology.

Your manuscript has now been evaluated by the PLOS Biology editorial staff, as well as by an academic editor with relevant expertise, and I'm writing to let you know that we would like to send your submission out for external peer review.

However, before we can send your manuscript to reviewers, we need you to complete your submission by providing the metadata that is required for full assessment. To this end, please login to Editorial Manager where you will find the paper in the 'Submissions Needing Revisions' folder on your homepage. Please click 'Revise Submission' from the Action Links and complete all additional questions in the submission questionnaire.

Once your full submission is complete, your paper will undergo a series of checks in preparation for peer review. Once your manuscript has passed the checks it will be sent out for review. To provide the metadata for your submission, please Login to Editorial Manager (https://www.editorialmanager.com/pbiology) within two working days, i.e. by May 23 2022 11:59PM.

If your manuscript has been previously reviewed at another journal, PLOS Biology is willing to work with those reviews in order to avoid re-starting the process. Submission of the previous reviews is entirely optional and our ability to use them effectively will depend on the willingness of the previous journal to confirm the content of the reports and share the reviewer identities. Please note that we reserve the right to invite additional reviewers if we consider that additional/independent reviewers are needed, although we aim to avoid this as far as possible. In our experience, working with previous reviews does save time.

If you would like to send previous reviewer reports to us, please email me at rroberts@plos.org to let me know, including the name of the previous journal and the manuscript ID the study was given, as well as attaching a point-by-point response to reviewers that details how you have or plan to address the reviewers' concerns.

During the process of completing your manuscript submission, you will be invited to opt-in to posting your pre-review manuscript as a bioRxiv preprint. Visit http://journals.plos.org/plosbiology/s/preprints for full details. If you consent to posting your current manuscript as a preprint, please upload a single Preprint PDF.

Feel free to email us at plosbiology@plos.org if you have any queries relating to your submission.

Kind regards,

Roli Roberts

Roland Roberts

Senior Editor

PLOS Biology

rroberts@plos.org

Decision Letter 1

Roland G Roberts

14 Jul 2022

Dear Dr Martiny,

Thank you for your patience while your manuscript "A curated data resource of 214K metagenomes for characterization of the global resistome" was peer-reviewed at PLOS Biology. It has now been evaluated by the PLOS Biology editors, an Academic Editor with relevant expertise, and by three independent reviewers.

Based on the broadly very favourable reviews, we are likely to accept this manuscript for publication, provided you satisfactorily address the points raised by the reviewers. Please also make sure to address the following data and other policy-related requests.

a) Please address the concerns raised by the three reviewers.

b) Please could you change your Title to something slightly more explicit for our wider readership? We suggest "A curated data resource of 214K metagenomes characterizes the global antimicrobial resistome"

c) Please address my Data Policy requests below; specifically, we need you to supply the numerical values underlying Figs 1ABC, 2AB, 3ABCD, 4ABC, 5, S1AB, S2, S3, S4AB. I note that your Zenodo deposition currently only seems to contain relatively “raw” values, rather than those directly shown in the Figure – please could you include the latter, clearly labelled? If you’ve used any custom code, please also include this.

d) Please also cite the location of the data clearly in each main and supplementary Fig legend, e.g. “The data underlying this Figure can be found in https://doi.org/10.5281/zenodo.6519844”

As you address these items, please take this last chance to review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the cover letter that accompanies your revised manuscript.

We expect to receive your revised manuscript within two weeks.

To submit your revision, please go to https://www.editorialmanager.com/pbiology/ and log in as an Author. Click the link labelled 'Submissions Needing Revision' to find your submission record. Your revised submission must include the following:

- a cover letter that should detail your responses to any editorial requests, if applicable, and whether changes have been made to the reference list

- a Response to Reviewers file that provides a detailed response to the reviewers' comments (if applicable)

- a track-changes file indicating any changes that you have made to the manuscript.

NOTE: If Supporting Information files are included with your article, note that these are not copyedited and will be published as they are submitted. Please ensure that these files are legible and of high quality (at least 300 dpi) in an easily accessible file format. For this reason, please be aware that any references listed in an SI file will not be indexed. For more information, see our Supporting Information guidelines:

https://journals.plos.org/plosbiology/s/supporting-information

*Published Peer Review History*

Please note that you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. Please see here for more details:

https://blogs.plos.org/plos/2019/05/plos-journals-now-open-for-published-peer-review/

*Press*

Should you, your institution's press office or the journal office choose to press release your paper, please ensure you have opted out of Early Article Posting on the submission form. We ask that you notify us as soon as possible if you or your institution is planning to press release the article.

*Protocols deposition*

To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

Please do not hesitate to contact me should you have any questions.

Sincerely,

Roli Roberts

Roland Roberts, PhD

Senior Editor,

rroberts@plos.org,

PLOS Biology

------------------------------------------------------------------------

DATA POLICY:

You may be aware of the PLOS Data Policy, which requires that all data be made available without restriction: http://journals.plos.org/plosbiology/s/data-availability. For more information, please also see this editorial: http://dx.doi.org/10.1371/journal.pbio.1001797

Note that we do not require all raw data. Rather, we ask that all individual quantitative observations that underlie the data summarized in the figures and results of your paper be made available in one of the following forms:

1) Supplementary files (e.g., excel). Please ensure that all data files are uploaded as 'Supporting Information' and are invariably referred to (in the manuscript, figure legends, and the Description field when uploading your files) using the following format verbatim: S1 Data, S2 Data, etc. Multiple panels of a single or even several figures can be included as multiple sheets in one excel file that is saved using exactly the following convention: S1_Data.xlsx (using an underscore).

2) Deposition in a publicly available repository. Please also provide the accession code or a reviewer link so that we may view your data before publication.

Regardless of the method selected, please ensure that you provide the individual numerical values that underlie the summary data displayed in the following figure panels as they are essential for readers to assess your analysis and to reproduce it: Figs 1ABC, 2AB, 3ABCD, 4ABC, 5, S1AB, S2, S3, S4AB. NOTE: the numerical data provided should include all replicates AND the way in which the plotted mean and errors were derived (it should not present only the mean/average values).

IMPORTANT: Please also ensure that figure legends in your manuscript include information on where the underlying data can be found, and ensure your supplemental data file/s has a legend.

Please ensure that your Data Statement in the submission system accurately describes where your data can be found.

------------------------------------------------------------------------

DATA NOT SHOWN?

- Please note that per journal policy, we do not allow the mention of "data not shown", "personal communication", "manuscript in preparation" or other references to data that is not publicly available or contained within this manuscript. Please either remove mention of these data or provide figures presenting the results and the data underlying the figure(s).

------------------------------------------------------------------------

REVIEWERS' COMMENTS:

Reviewer #1:

Martiny et al. describe a new data resource that is the product of intensive and large-scale bioinformatics analysis of metagenomic data for the presence and abundance of acquired antimicrobial resistance genes (ARGs). This paper can be viewed from two perspectives: (1) a data science contribution to allow the community to better examine AMR transmission patterns and (2) knowledge gained from analysis of the data.

DATA SCIENCE

From the data science perspective, this is a very significant contribution to the field using the latest standards and mining of >200,000 metagenomic datasets, totalling more than 400 TB of sequence data. Considerable effort was put into data harmonization and normalization to provide a high-value data set to AMR researchers. The data, pipelines, software, and results are provided in a well-organized open format, allowing their analysis by the broader community. As the amount of computation needed to produce this data set is well beyond most AMR researchers interested in using genomics to understand ARG transmission patterns, this contribution is novel and of high value.

Software and their versions are properly described in the methods section, but parameters used for KMA are not outlined. Were default parameters used? It is fine that the methods are presented "in brief" given a citation of previous work, but the manuscript would be improved if this included the cut-offs used by KMA to determine aligned reads (MAPQ?). Similarly, a more explicit statement in the methods that use of ResFinder focuses on the analysis of acquired ARGs and does not include resistance via mutation (e.g. PointFinder) would be helpful.

ANALYSIS

Analysis and interpretation of the data are thin, with some issues that need to be addressed, but this does not undermine that the primary purpose of the manuscript is to describe the generation and content of the data produced for the broader community. Full analyses of these data are beyond the scope of this manuscript and the authors perform an adequate overview analysis and summary of the major sub-sets and trends in the data. However, the statement of "a general trend" in lines 227-229 does not appear supported by Figures 5 & S4. This section should be re-written to carefully discuss patterns supported by the data, such as exists for the chicken data, instead of broad statements based on unconvincing patterns in the plots.

The data include ARGs that have as little as a single read fragment aligned and these ARGs were used in the species richness estimates. Can the authors explain why they did not include a minimum coverage cut-off in these analyses?

The results presented are broken down by host (i.e. environment), location, and ResFinder drug class, but not by ARG families. While others are very likely to analyze these data for transmission patterns of ARGs or ARG families, at least an anecdotal investigation of a few ARGs would help illustrate the value of the data. Perhaps something recent like MCR versus the AACs? The "trends" mentioned above may be more obvious at the level ARG families.

DISCUSSION

Successful annotation of metagenomics data for ARGS requires both good software for sequencing read alignment and good reference data. Both KMA and ResFinder reflect the latest standards but like CARD and other databases, ResFinder's reference data is primarily from clinical isolates. It is possible there are ARGs in the environmental metagenomics data that are sufficiently different from these reference data to a degree that KMA is unsuccessful. CARD has its "Resistomes & Variants" data to provide an alternate in silico diversity of >200,000 ARG alleles for sequence read alignment. I'm not suggesting a re-analysis of these data with a broader in silico reference sequence collection, but I think the discussion should address this possible bias, i.e. false-negative results for divergent ARGs because of the algorithm/reference choice.

As mentioned above, the manuscript has little in the assessment of ARG transmission patterns, which is fine as it was not the major point of the paper, but Zhang et al. (PMID 34362925; Nature Communications 12: 4765) & Zhang et al. (PMID 35322038; Nature Communications 13: 1553) have recently published some large scale ARG metagenomic analyses that included assessment of ARG transmission and generation of risk metrics. At a minimum, the discussion should place the author's work in the context of these recent efforts.

As mentioned above, the data include ARGs that have as little as a single read fragment aligned. The authors should add a statement that they are including these data for complete transparency so others can decide their own cut-offs when analyzing the data.

No information is given in the discussion on the long-term maintenance of this resource. What is the plan as new metagenomics data become available? CARD has (beta) pathogen-of-origin kmer tools for ARGs, will the authors be exploring similar methods to provide a more pathogen-centric perspective in future analyses?

MINOR POINTS

Figure 1C caption should mention the number of samples for which the collection date was NULL.

Lines 210-214 have very confusing grammar.

The phrase "ARG template" is used without proper definition.

Reviewer #2:

The resource presented by Martiny et al. is timely and has potential to boost the research antimicrobial resistance through widening access. The methods are presented clearly, and the datasets are made publicly available. There are a few points that I recommend the authors to consider towards improving the quality through cross-checking some of the analyses.

1. Given the impressive volume and the breadth of the data analysed, it is quite surprising that 96 ARGs did not have any alignments. A cross-cheling of these results and any indicators of the underlying reasons (e.g., these ARGS being very specific to the environments not being represented here?) will be important.

2. Figure 4c, what do the rows 'Metagenome' and 'Metagenomes' refer to? Some error in metadata curation?

3. Fosfomycin ARG (green in Figure 4C), seems quite high in Food Metagenomes, while it is barely present in panels A and B. I suppose this indicates uneven distribution of sampling 'host'? Also, is there known connection between food microbiomes and fosfomycin resistance? BTW, 'environment' will be a much better and accurate term than 'host'.

Minor comments:

1. Line 81: uploaded between 2010-01-01 and 2020-01-01 that had library source as 'METAGEOMIC'

2. Inconsistent line spacing starting on page 5 line 135 through page 6 line 152

3. Any data on how sequencing depth affects detection of ARGs or 16/18S genes? For example, Fig2a could be converted from a density plot to a scatterplot showing sequencing depth vs fragments

4. Figure 3, what do different colour shades mean for boxes?

Reviewer #3:

[Note: because of other commitments, this reviewer was only able to give us the following preliminary comments; we hope that they will nevertheless be useful]

My opinion on that paper is that it's a valuable analysis, and seems to have been done carefully. I had only two technical quibbles:

1. The manuscript does not explain how the analysis workflow handles two key issues. First: AMR databases are full of different versions of the same gene - e.g. there are more than 170 allelic versions of the CTX-M gene. Were all reads mapped to a database containing all of these, or were representatives chosen? If mapping to everything, what was done with reads that mapped to multiple alleles of one gene , and how were counts resolved? Two: I don't understand why, when calculating abundance, using counts of reads mapping to ribosomal RNA as a denominator makes sense, as rRNA arrays are different lengths in different species.

2. The text seems to suggest the same mapping workflow was used for nanopore, pacbio, and illumina. Is this really true? The same kmer size also? If yes, a lot of sensitivity will have been lost in the long read data , although since this is <10% of the data, this is not really a big issue.

I also had one red flag: Given the high rate of metadata errors in the ENA, I am suspicious of the samples dated between 1845 and 1905 in Figure 3 - is there a way to check these? If there is no associated publication discussing old metagenomes, I would honestly consider discarding those datapoints as mislabelled.

Figure 4 is great, v interesting!

Decision Letter 2

Roland G Roberts

9 Aug 2022

Dear Dr Martiny,

Thank you for the submission of your revised Methods and Resources "A curated data resource of 214K metagenomes for characterization of the global antimicrobial resistome" for publication in PLOS Biology. On behalf of my colleagues and the Academic Editor, Tobias Bollenbach, I'm pleased to say that we can in principle accept your manuscript for publication, provided you address any remaining formatting and reporting issues. These will be detailed in an email you should receive within 2-3 business days from our colleagues in the journal operations team; no action is required from you until then. Please note that we will not be able to formally accept your manuscript and schedule it for publication until you have completed any requested changes.

Please take a minute to log into Editorial Manager at http://www.editorialmanager.com/pbiology/, click the "Update My Information" link at the top of the page, and update your user information to ensure an efficient production process.

PRESS: We frequently collaborate with press offices. If your institution or institutions have a press office, please notify them about your upcoming paper at this point, to enable them to help maximise its impact. If the press office is planning to promote your findings, we would be grateful if they could coordinate with biologypress@plos.org. If you have previously opted in to the early version process, we ask that you notify us immediately of any press plans so that we may opt out on your behalf.

We also ask that you take this opportunity to read our Embargo Policy regarding the discussion, promotion and media coverage of work that is yet to be published by PLOS. As your manuscript is not yet published, it is bound by the conditions of our Embargo Policy. Please be aware that this policy is in place both to ensure that any press coverage of your article is fully substantiated and to provide a direct link between such coverage and the published work. For full details of our Embargo Policy, please visit http://www.plos.org/about/media-inquiries/embargo-policy/.

Thank you again for choosing PLOS Biology for publication and supporting Open Access publishing. We look forward to publishing your study. 

Sincerely,

Roli Roberts

Roland G Roberts, PhD, PhD

Senior Editor

PLOS Biology

rroberts@plos.org

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Fig. Distribution of samples per sequencing instrument platform.

    (a) Sample count per platform. (b) Distribution of raw sequencing read counts per platform. The data underlying this figure can be found at https://doi.org/10.5281/zenodo.6919377.

    (TIFF)

    S2 Fig. More than 96% of ARG templates had at least 1 aligned fragment.

    The bars illustrate the percentage of ARGs per resistance class without and with at least 1 aligned fragment. The parenthesis after each class label contains the number of genes found out of the total available templates. The data underlying this figure can be found at https://doi.org/10.5281/zenodo.6919377.

    (TIFF)

    S3 Fig

    The sample-wise distribution of aligned (a) ARG or (b) rRNA fragments compared to raw sequencing base counts. The data underlying this figure can be found at https://doi.org/10.5281/zenodo.6919377.

    (TIFF)

    S4 Fig. The sample-wise distribution of aligned rRNA fragments and ARG fragments, colored by selected host and environmental sources.

    The data underlying this figure can be found at https://doi.org/10.5281/zenodo.6919377.

    (TIFF)

    S5 Fig. Additional distributions showing the relationship between ARGs and genera for all metagenomic samples.

    (a) The richness of genus groups (x-axis) vs. ARG richness (y-axis). (b) The relationship between Shannon diversity index calculated on genus level (x-axis) and ARGs (y-axis). Right: samples colored by selected host or environmental origins. The data underlying this figure can be found at https://doi.org/10.5281/zenodo.6919377.

    (TIFF)

    Attachment

    Submitted filename: response_PLOS_Biology.pdf

    Data Availability Statement

    The code to produce the figures is available at https://github.com/hmmartiny/mARG. The data has been deposited at https://doi.org/10.5281/zenodo.6919377, and documentation of the various tables can be accessed at https://hmmartiny.github.io/mARG.


    Articles from PLoS Biology are provided here courtesy of PLOS

    RESOURCES