Skip to main content
Scientific Data logoLink to Scientific Data
. 2025 Dec 3;12:1896. doi: 10.1038/s41597-025-06191-2

Reconstruction of 1,979 prokaryotic metagenome-assembled genomes from 37 global cave environments

Huihong Li 1,#, Yuping Cao 2,3,#, Xueke Liu 2,3, Zelin Ke 2,3, Liang Chen 4, Bupe A Siame 5, Sima Yaron 2,3,, Ka Yin Leung 2,3,
PMCID: PMC12675795  PMID: 41339358

Abstract

Cave microorganisms represent unique extremophiles that have evolved in isolated, nutrient-limited environments and harbor exceptional metabolic capabilities. However, knowledge of cave microbial diversity at genomic level remains limited. Previous studies have focused on individual caves and do not give a global picture. Here, we present the first prokaryotic cave metagenomic catalog from 37 geographical diverse cave environments. We employed an optimized genome reconstruction pipeline to recover 3,837 medium-to-high quality cave metagenome-assembled genomes (MAGs). These MAGs were dereplicated into 1,979 species-level representative clusters that spanned 67 phyla of Bacteria (n = 1,858) and Archaea (n = 121) domains. Classification of representative species showed that 98.7% did not match any existing genome taxonomy classification of named species at ≥ 95% average nucleotide identity (ANI). Most representative genomes harbored putative biosynthetic gene clusters (BGCs) (98.0%) and enzymatic antibiotic resistance genes (ARGs) (95.0%). This comprehensive MAGs catalog provides a foundational resource for exploring cave microbial diversity, secondary metabolism, and the evolutionary origins of antibiotic resistance in subterranean ecosystems.

Subject terms: Microbial ecology, Antimicrobial resistance, Data mining, Metagenomics, Microbial ecology

Background & Summary

Cave environments represent unique ecological niches characterized by extreme conditions, geographic isolation, and pristine states that have persisted for millennia1. These subterranean ecosystems, particularly the permanent dark zones, are characterized by stable temperatures and humidity, severe nutrient limitations, and the absence of photosynthetic activity2,3. Such extreme and isolated conditions mirror a remote moment in planetary history, leading to the evolution of highly specialized microbial communities with distinctive metabolic capabilities, including the production of novel secondary metabolites and diverse stress resistance mechanisms46. As such, cave microorganisms offer valuable insights into microbial adaptation, evolution, and biotechnological potential.

Recent studies have uncovered cave-derived microorganisms capable of producing novel secondary metabolites such as antibiotics, antifungal compounds, anticancer, and antiviral drugs79. However, despite growing interest in the cave microbiomes, most genomic studies of cave microbiomes have been restricted to individual caves or specific regions, with limited metagenomic analyses. The limited scale and quality of genome reconstruction efforts have also resulted in fragmented knowledge of global cave microbial diversity1014. Furthermore, previous cave MAG studies were constrained by limitations in reconstruction workflows, reliance on single binning algorithms, absence of dereplication procedures, and variable quality standards, resulting in low recoveries of high-quality genomes1523. Thus, a lack of comprehensive genomic catalogs has hindered our understanding of the global diversity, metabolic potential, and evolutionary trajectories of cave microorganisms. Moreover, the absence of standardized, large-scale genome reconstruction efforts have hindered comparative analyses across different cave systems and the identification of core versus site-specific microbial functions.

The main objective of this study was to build a comprehensive cave prokaryotic MAGs catalog by leveraging accumulated metagenomic data using advanced genome reconstruction techniques. Our approach focused on two key objectives: (1) compiling a geographical and ecologically diverse collection of cave metagenomes, and (2) maximize genome recovery using state-of-the-art multi-tool binning and refinement approaches24. Here, we provide a cave metagenomic catalog from 36 published metagenomic datasets encompassing 37 distinct cave environments from different geographical locations and cave sample types that included biofilms (n = 75), sediments (n = 87), minerals (n = 37), water (n = 40) and air (n = 2) (Fig. 1 and Table 1).

Fig. 1.

Fig. 1

Overview of the study. (a) Origins of the metagenome samples; details are given in the Methods. (b) Bioinformatics pipelines used for analyses. (c) Geographic distribution of the representative 1,979 MAGs from 37 caves, with sample sizes and cave microhabitats.

Table 1.

List of public data used and cave information.

Name Bioproject Related publication (selected) Typea Samples QCed reads (Gbp) Raw MAGs
Acquasanta PRJNA259227 Hamilton et al.49 Karst 1 34.04 91
Sarcophagus PRJNA1052463 Vojvoda et al.17 Anchialine 12 71.55 561
Kajan et al.50
Bundera Sinkhole PRJNA911846 Ghaly et al.23 Anchialine 12 3.53 59
Baram PRJNA946675 Vanghi et al.51 Karst 3 5.93 10
Borra PRJNA898384 Samanta et al.13 Karst 2 7.5 46
Cango PRJNA982691 Babalola et al.52 Karst 3 6.89 21
Carlsbad PRJNA1159728 Ulbrich et al.53 Karst 4 9.13 3
Clara PRJNA487362 Rodriguez-Ramos et al.54 Karst 1 0.24 5
El Mirón PRJEB55583 Klapper et al.55 Karst 6 6.02 12
Frasassi PRJNA1148974 Karst 2 31.22 263
PRJNA259227 Hamilton et al.49 7 147.2 473
Golden Dome PRJNA1189542 Maggiori et al.56 Lava 1 0.25 1
Harman 1 PRJNA1048116 Bay et al.21 Lava 13 21.12 69
Ice PRJNA1099162 Glacier 2 1.65 10
Kashmir PRJNA741999 Zada et al.57 Karst 1 4.54 15
Kipuka Kanohina PRJNA465923 Lava 1 11.91 49
PRJNA465921 1 10.36 106
PRJNA465922 1 11.73 118
Labirinto PRJEB14680 Karst 3 2.17 1
Mauna Loa lava tube PRJNA818798 Gadson et al.18 Lava 3 3.06 42
Mawsmai PRJNA551378 Karst 1 0.75 3
Sulzbrunn PRJNA1085763 Karwautz et al.58 Karst 3 9.24 104
Monte Cristo PRJNA642310 Bendia et al.15 Karst 3 13.06 128
Movile PRJEB12283 Kumaresan et al.10 Karst 2 0.34 0
PRJNA673084 Bizic et al.59 2 0.01 0
PRJNA1028180 Saati-Santamaría et al.60 3 3.22 33
PRJNA777757 Chiciudean et al.16 7 138.35 800
PRJEB12283 Kumaresan et al.10 2 0.19 2
PRJNA1028180 Saati-Santamaría et al.60 6 6.52 12
PRJNA673084 Bizic et al.59 5 0.32 4
Ondal PRJNA946675 Vanghi et al.51 Karst 11 28.66 34
Sterkfontein PRJNA982608 Babalola et al.61 Karst 3 7.35 34
Sanjidang PRJNA946675 Vanghi et al.51 Karst 4 9.05 25
Scrubby Creek PRJNA1048116 Bay et al.21 Karst 21 40.22 56
Sea PRJNA1275038 Anchialine 11 52.74 65
Seodae PRJNA946675 Vanghi et al.51 Karst 2 4.98 4
Shades of Death PRJNA1048116 Bay et al.21 Karst 26 43.93 217
Sof Umer PRJNA1082540 Meka et al.62 Karst 1 5.22 6
Sulfur PRJEB45004 Van Spanning et al.22 Karst 2 29.98 31
Sulzbrunn PRJNA825327 Zhu et al.63 Karst 1 3.79 45
Switzerland PRJEB56878 Moncadas et al.64 Karst 1 31.92 92
Tiser PRJNA741999 Zada et al.57 Karst 4 17.7 71
Triangle PRJNA465920 Parker et al.19 Karst 1 6.26 29
Tunnel PRJNA1048116 Bay et al.21 Lava 36 54.18 215
Wind PRJNA441392 Back et al.65 Karst 1 5.03 44
Wishing Well PRJNA518405 Karst 2 24.14 219
Yumugi river PRJNA616285 Turrini et al.66 Karst 1 0.26 1

aFive cave types are defined as: karst or solution, anchialine or sea, lava or lava tube, glacier or ice, and eolian; https://nckri.org/caves/types/.

We used an optimized reconstruction workflow according to Han et al. to maximize the recovery of draft genomes and to avoid potential cross-contamination between datasets24. This workflow was based on individual assemblies (MEGAHIT v1.2.9), followed by multi-tool binning using single-sample model (MetaBAT2 v2.18, SemiBin2 v2.2.0, MetaDecoder v1.2.1), and refinement (MAGScoT v1.1)2529. We recovered 4,234 refined and draft MAGs from 25,276 candidate bins generated by three complementary binning tools. Among these genomes, 3,837 met the medium-to-high-quality MIMAG threshold (≥50% completeness, ≤10% contamination)30. Dereplication at 95% average nucleotide identity (ANI) (species level) yielded 1,979 representative genomes (Fig. 2a). Of these, 1,142 (57.7%) genomes were of high quality (≥90% completeness, ≤5% contamination).

Fig. 2.

Fig. 2

Cave prokaryotic genome catalogs and encoded secondary-metabolite / resistance potential. (a) Circular phylogenomic tree of 1,979 non-redundant species-level representative MAGs. The maximum likelihood phylogenetic tree (center) was reconstructed using concatenated alignment of 120 bacterial and 53 archaeal GTDB marker genes. Concentric rings (inner to outer) display: MAG completeness and contamination, genome size, species novelty (≥95% ANI are marked), presence/abundance of putative novel ARGs, domain classification (Bacteria/Archaea), and phylum assignment (colored outer ring; legend at left shows phyla with counts). Three outer radial quantitative tracks show per-genome: BGC counts (red), ARG diversity (number of unique ARGs; blue), and putative novel ARG counts (teal). (b) Read recruitment of quality-controlled metagenomic reads to representative MAGs across all samples and by sample types. Boxplots show median (center line), interquartile range (box), 1.5 × IQR whiskers, and individual data points. Sample sizes (n) indicated below each group. (c) Distribution of BGCs per MAG showing total BGCs, complete (non-edge) BGCs, and major antiSMASH categories. Boxplots as in (b) with overlaid jittered points representing individual genomes. (d) Distribution of ARG counts per MAG (DRAMMA predictions, stringent threshold prob > 0.99) grouped by resistance class/category. Horizontal boxplots with individual MAG values (dots) show per-genome ARG load. “Novel” denotes unassigned (putatively novel) ARG predictions meeting probability criteria.

Taxonomic classification based on the genome taxonomy database (GTDB) (release R226)31 showed that the cave MAGs catalog (n = 1979) encompassed two domains (Bacteria, n = 1,858; Archaea, n = 121) (Fig. 2a). At phylum level, MAGs were assigned to 51 established phyla (n = 1,869), 16 candidate phyla (n = 108), and 2 remained unclassified. The dominant phyla were Pseudomonadota (n = 494; 24.9%), Actinomycetota (n = 239; 12.1%), Acidobacteriota (n = 196; 9.9%), Planctomycetota (n = 135; 6.8%), and Bacteroidota (n = 118; 6.0%) (Fig. 2a). Classification of MAGs at class level yielded 93 established classes containing most genomes (n = 1770) and 60 candidate classes (n = 209). Order-level classification placed the MAGs into 175 recognized orders (n = 1,477), 180 novel orders (n = 496), with 6 unclassifiable at this rank. Family-level classification of MAGs revealed 216 published families (n = 954), 391 candidate families (n = 976), and 49 were not assigned to any family. Genus-level classification of MAGs revealed 196 named genera (n = 379), 756 new genera (n = 1,185), and 415 lacking genus designations. Species-level resolution identified 25 MAGs matching valid species, 296 assigned to candidate species, and 1,658 unnamed species. Thus, the cave MAGs catalog revealed extensive taxonomic novelty, comprising 16 previously undescribed phyla, 60 new classes, 180 new orders, 391 new families, 756 new genera and 1,954 new species. Furthermore, this MAGs catalog collectively recruited a median 55.9% of metagenomic reads across the different cave samples (biofilms, sediments, minerals, water, and air), indicating substantial coverage of cave prokaryotic diversity (Fig. 2b) (see Technical Validation).

We applied antiSMASH 8.0.032 to interrogate the biosynthetic potential of encoded genes in the cave MAGs catalog. AntiSMASH detected 12,684 biosynthetic gene clusters (BGCs) in 1,940 (98.0%) genomes (median 5, IQR 3–8, range 1–90 per genome) with 2,150 complete/internal BGCs (antiSMASH contig_edge = False) (Fig. 2c). Across all clusters, the major BGC categories were for terpene synthesis (n = 4,188; 33.0%), ribosomally synthesized and post-translationally modified peptides (RiPP) (n = 3,479; 27.4%), non-ribosomal peptide synthetases (NRPS) (n = 2,176; 17.2%), polyketide synthetases (PKS) (n = 1,630; 12.9%), other (n = 1,205; 9.5%), and saccharide synthesis (n = 6; <0.1%) (Fig. 2c). We then profiled antibiotic resistance genes (ARGs) using DRAMMA33, a probability-calibrated machine learning classifier trained exclusively on direct resistance determinants and intentionally excluding broad efflux pumps, regulatory elements, and point-mutation–mediated resistance from its positive training set. This design focused predictions on loci with intrinsic mechanistic potential rather than indirect tolerance systems for adaption to extreme environments. Across the 1,979 genomes DRAMMA predicted 68,505 ARGs at the inclusive probability threshold (prob ≥ 0.75) spanning all genomes (Fig. 2d). Increasing stringency reduced the set to 26,460 predictions at prob ≥ 0.95 (present in 1,963 genomes; 99.2%) and 11,121 at prob ≥ 0.99 (present in 1,881 genomes (95.0%); ~4 ARGs per genome) (Fig. 2d). Putative novel ARG candidates (prob ≥ 0.75 with an unassigned label = 0 under DRAMMA’s suggestion) numbered 1,789 across 1,021 genomes (51.6%) (Fig. 2d). In summary, the cave MAGs catalog encodes a broad reservoir of secondary metabolite and novel resistance determinants. Thus, the MAGs catalog can provide a quantitative baseline for further experimental exploration of new secondary metabolites or antibiotic resistance genes.

Overall, this Cave MAGs catalog provides the first comprehensive, non-redundant genomic resource for cave prokaryotes. These genomes can be used for further investigations into novel metabolic pathways, secondary metabolite discovery, antibiotic resistance origins and evolution, as well as in the characterization of microbial ‘dark matter’ lineages that have adapted to one of Earth’s most extreme and isolated environments.

Methods

Collection of metagenomes

We composed a comprehensive dataset of cave metagenomes derived from diverse cave systems across multiple geographic regions and cave zones (Fig. 1, Table 1). On May 1, 2025, raw cave metagenomic sequences were collected through a systematic literature and database search using Google Scholar, PubMed, Web of Science Core Collection, and Science Direct, with keywords “cave metagenome” and “cave microbiome”. Additionally, metagenomes were retrieved from the NCBI Sequence Read Archive (SRA) and EMBL-EBI European Nucleotide Archive (ENA) using the search term “cave metagenome”. The collected datasets were filtered to ensure they met the criteria of research articles containing publicly accessible metagenomes. Pure culture sequences and 16S rDNA amplicon sequencing data were excluded from the analysis. Studies or Bioprojects based on bat guano, ancient DNA from caves, and cave paintings were also excluded as they represent specific niches with distinct microbial communities3436. These metagenomes were organized into five sample types, 36 datasets and 37 distinct cave systems based on related publications, research groups, and geographic locations. Associated metadata including cave locations, sampling dates, accession numbers, and other environmental parameters are available under the original publications and the BioSample database.

Sequence assemblies and metagenome binning

All the downloaded metagenomes were processed for quality checking and filtering to acquire clean sequences using fastp v1.0.137. fastp parameters included adapter detection for paired-end reads (–detect_adapter_for_pe), quality threshold of 20 (-q 20), unqualified base limit of 25% (-u 25), maximum N bases of 5 (-n 5), length requirements of 50–300 bp (–length_required 50–length_limit 300), tail trimming with window size 4 and mean quality 20, poly-G and poly-X trimming, and dereplication with accuracy of level 3. The low-quality region of metagenomes due to instability in the sequencing platform was also trimmed using fastp. Reports on sequence quality are available in figshare38. Clean reads were assembled on a sample-by-sample basis using MEGAHIT v1.2.9 with the meta-sensitive preset and a minimum contig length of 1,500 bp25. Assembly statistics were calculated for all samples. Coverage profiles were generated by mapping quality-controlled reads back to assemblies using Bowtie2 v2.5.4 with default parameters39. SAM files were converted to sorted BAM format using samtools v1.2240. BAM files were indexed for downstream analysis.

Single-sample metagenome binning was performed using a multi-algorithm approach with three complementary binning tools according to Han et al. to maximize the recovery of draft genomes24. MetaBAT2 v2.18 was run with parameters:–minContig 2500, –minClsSize 20000026. SemiBin2 v2.2.0 was executed in single_easy_bin mode with the global environment setting27. MetaDecoder v1.2.1 was used with a three-step approach: coverage calculation, seed generation, and clustering28. Additionally, bin refinement was performed using MAGScoT v1.1 with a maximum contamination threshold of 1% and consensus threshold of 0.529. Genome bins were subjected to quality assessment using CheckM2 v1.1.0 with the predict workflow41. Genomes with ≥ 50% completeness and ≤10% contamination were retained for further dereplication30. Species-level dereplication was performed using dRep v3.6.2 with parameters: primary ANI threshold of 90% (-pa 0.90), secondary ANI threshold of 95% (-sa 0.95), and fastANI as the similarity algorithm42. This two-step clustering approach first identified representatives within each metagenome division, then performed secondary clustering across all representatives to identify the final set of 1,979 species-level representative genomes.

Taxonomic assignment and phylogenetic reconstruction

Taxonomic classification was performed using GTDB-Tk v2.4.1 (GTDB release R226) with the classify workflow31, using the skipping ANI screening (–skip_ani_screen) to ensure comprehensive classification of potentially novel lineages. Phylogenetic reconstruction of representative MAGs was performed using the aligned concatenated marker genes generated by GTDB-Tk. Trees were constructed using IQ-TREE v3.0.1 with model selection (-m MFP), 1,000 bootstrap replicates (-B 1000), and 1,000 replicates for the SH-like approximate likelihood ratio test (-alrt 1000) and visualized with iTOL v743,44.

Annotation of ARGs and BGCs

BGCs were identified using antiSMASH v8.0.032. ARGs were predicted using the DRAMMA pipeline33. Open reading frames were predicted with Prodigal v2.6.3 in metagenomic mode (-p meta), producing protein FASTA (.faa), coding-sequence FASTA (.ffn), and gene coordinates (.gff)45. DRAMMA used the above Prodigal output as input and was run with an E-value threshold of 1e-5, gene window of 20, nucleotide window of 20,000 bp for enzymatic and novel ARG detection.

Data Records

The figshare record38 contains three genome archives at the root and two folders (Analysis/ and QCreport/) with tables and quality-control reports. Representative genomes are provided separately from the full refined set for users to either work with the non-redundant catalog or explore the complete, pre-dereplication space.

Representative genome archives

MAGs_HQ.tar.zst and MAGs_MQ.tar.zst together comprise the 1,979 species-level representative MAGs used for the phylogeny and read-recruitment analyses. The HQ archive contains genomes meeting high-quality MIMAG criteria (≥90% completeness, ≤5% contamination) and genomes meeting medium-quality criteria (≥50% completeness, ≤10% contamination). Each genome is provided as a FASTA file (.fa) with a stable filename identifier used throughout the tables following the pattern <RUN_ACCESSION> _cleanbin_ < BIN_ID> .fa (e.g., ERR10479404_cleanbin_000031.fa). These identifiers allowed direct joins to the analysis workbooks described below and to the sample-level metadata in metadata.xlsx.

Unpacking note

Archives are Zstandard-compressed tarballs; decompressed and then untarred (Linux/macOS: tar -I unzstd -xf <file.tar.zst> or zstd -d <file.tar.zst> & & tar -xf <file.tar> ; Windows: opened with 7-Zip, extracted once to obtain the.tar, before extracting the.tar).

Full refined set

magscot.tar.zst contains the 4,234 refined MAGs obtained after bin refinement from 25,276 candidate bins across three binners. This complete, pre-dereplication collection includes near-duplicate genomes that were collapsed when selecting the 1,979 species-level representatives. Filenames follow the same <RUN_ACCESSION> _cleanbin_ <BIN_ID> .fa convention and use the same MAG_ID keys as the analysis tables.

Sample metadata

metadata.xlsx provides one row per metagenome, including BioProject/BioSample and run accessions, cave name, geographic location and basic read counts after QC.

Analysis folder (annotation workbooks)

The Analysis/directory contains Excel files that enable immediate reuse of the catalog (all keyed by MAG_ID unless noted):

CheckM2.xlsx — genome-quality metrics for all 4,234 draft MAGs (completeness, contamination, genome size, N50, GC%, MIMAG tier).

Classification.xlsx — GTDB-Tk taxonomy classification for the 1,979 representatives.

BGC.xlsx — antiSMASH v8 biosynthetic gene cluster calls for the representatives.

ARG.xlsx — DRAMMA antimicrobial-resistance predictions for the representatives.

DRAM.xlsx — DRAM functional annotations in wide matrix form.

CoverM.xlsx — read-recruitment statistics mapping each sample to each representative MAG.

QC reports

The QCreport/folder contains two interactive MultiQC HTML reports that summarize read quality before and after trimming (Report_BeforeQC.html, Report_AfterQC.html), which can be opened directly in a web browser.

Linkage keys and joins

MAG_ID is the primary key across genome FASTA files and all analysis workbooks; it is equal to the FASTA filename stem (e.g., ERR10479404_cleanbin_000031). A MAG present in MAGs_HQ.tar.zst or MAGs_MQ.tar.zst is a species-level representative; MAGs found only in magscot.tar.zst are non-representative refined genomes.

Technical Validation

All metagenomic sequences underwent quality checking using fastqc v0.12.1 and fastp v1.0.1 were used to ensure the quality of clean sequences and MultiQC v1.29 was used to generate quality reports46. CheckM2 v1.1.0 was used to assess the completeness and contamination of constructed MAGs41. To assess the comprehensiveness and representation of cave prokaryotic diversity by our MAG catalogs, we performed read recruitment analysis using CoverM v0.7.047. Quality-controlled paired-end reads were mapped back to the 1,979 species representative genomes with calculation of multiple metrics that included mean coverage, relative abundance, covered fraction, variance, read count, and reads per base. DRAM v1.5.0 was used for comprehensive functional annotations with databases that included KEGG KOfam, Pfam, dbCAN, MEROPS peptidase and AMP. The quality of MAGs was further evaluated by detecting the presence of rRNA, tRNA, as well as verifying the presence of ARGs and biosynthesis genes48.

Acknowledgements

This research was supported by the National Natural Science Foundation of China (no. 32373177) and the Li Ka Shing Foundation (LKSF) STU-GTIIT Joint-Research Grant (no. 2024LKSFG07) to Drs. Ka Yin Leung and Liang Chen. Ka Yin Leung received funding from the Key Discipline Fund of Guangdong Province, and Yuping Cao received a scholarship from the Key Discipline Fund of Guangdong Province. This study was also supported by Natural Science Foundation of Guangdong Province (no. 2025A1515012810). We thank our lab members for engaging discussions and improvements of the manuscript.

Author contributions

H.L., Y.C., S.Y. and K.Y.L. conceived the study and contributed ideas. H.L. and Y.C. designed the pipeline. H.L., Y.C., X.L., Z.K., L.C. and B.A.S. collected data and performed analyses. All wrote the first draft and edited subsequent versions. All authors read and approved of the final manuscript.

Data availability

All refined genomes and companion files were deposited in figshare38 and the 1979 representative MAGs were submitted to DDBJ/ENA/GenBank under BioProject accession no. PRJEB95832. The raw metagenomic reads used for assembly are publicly available at NCBI SRA/ENA under their original studies; per-sample accessions and context listed in Table 1 and in the repository file metadata.xlsx.

Code availability

The options and parameters of all tools as well as software versions used for the analysis are described in the methods. Custom-designed scripts were not used to generate or process this dataset. An operation script is available at https://github.com/HuihongLi/CaveMAGs.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

These authors contributed equally: Huihong Li, Yuping Cao.

Contributor Information

Sima Yaron, Email: simay@bfe.technion.ac.il.

Ka Yin Leung, Email: kayin.leung@gtiit.edu.cn.

References

  • 1.Simon, K. S. The Biology of Caves and Other Subterranean Habitats. David C. Culver and Tanja Pipan. Integr. Comp. Biol.49, 473–474 (2009). [Google Scholar]
  • 2.Barton, H. A. & Jurado, V. What’s up down there? Microbial diversity in caves. Microbe2, 132–138 (2007). [Google Scholar]
  • 3.Niemiller, M. L. & Soares, D. Cave environments. in Extremophile fishes: ecology, evolution, and physiology of teleosts in extreme environments 161–191 (Springer, 2014).
  • 4.Barton, H. A. & Northup, D. E. Geomicrobiology in cave environments: past, current and future perspectives. J. Cave Karst Stud.69, 163–178 (2007). [Google Scholar]
  • 5.Bhullar, K. et al. Antibiotic resistance is prevalent in an isolated cave microbiome. PloS One7, e34953 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Tomczyk-Żak, K. & Zielenkiewicz, U. Microbial diversity in caves. Geomicrobiol. J.33, 20–38 (2016). [Google Scholar]
  • 7.Kosznik-Kwaśnicka, K., Golec, P., Jaroszewicz, W., Lubomska, D. & Piechowicz, L. Into the unknown: microbial communities in caves, their role, and potential use. Microorganisms10, 222 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Cheeptham, N. et al. Cure from the cave: volcanic cave actinomycetes and their potential in drug discovery. Int. J. Speleol.42, 35–47 (2013). [Google Scholar]
  • 9.Zada, S. et al. Cave microbes as a potential source of drugs development in the modern era. Microb. Ecol.84, 676–687 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Kumaresan, D. et al. Aerobic proteobacterial methylotrophs in Movile Cave: genomic and metagenomic analyses. Microbiome6, 1 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Ortiz, M. et al. Making a living while starving in the dark: metagenomic insights into the energy dynamics of a carbonate cave. ISME J.8, 478–491 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Wiseschart, A., Mhuantong, W., Tangphatsornruang, S., Chantasingh, D. & Pootanakit, K. Shotgun metagenomic sequencing from Manao-Pee cave, Thailand, reveals insight into the microbial community structure and its metabolic potential. BMC Microbiol.19, 144 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Samanta, B., Sharma, S. & Budhwar, R. Metagenome analysis of speleothem microbiome from subterranean cave reveals insight into community structure, metabolic potential, and BGCs diversity. Curr. Microbiol.80, 317 (2023). [DOI] [PubMed] [Google Scholar]
  • 14.Wiseschart, A. & Pootanakit, K. Chapter 23 - Metagenomic-based approach to a comprehensive understanding of cave microbial diversity. in Recent Advancements in Microbial Diversity (eds Mandal, S. D. & Bhatt, P.) 561–586 (Academic Press, 2020).
  • 15.Bendia, A. G. et al. Metagenome-assembled genomes from Monte Cristo Cave (Diamantina, Brazil) reveal prokaryotic lineages as functional models for life on Mars. Astrobiology22, 293–312 (2022). [DOI] [PubMed] [Google Scholar]
  • 16.Chiciudean, I. et al. Competition-cooperation in the chemoautotrophic ecosystem of Movile Cave: first metagenomic approach on sediments. Environ. Microbiome17, 44 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Vojvoda Zeljko, T. et al. Genome-centric metagenomes unveiling the hidden resistome in an anchialine cave. Environ. Microbiome19, 67 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Gadson, O. et al. Metagenome-assembled genome of a putative chemoheterotroph from volcanic terrain in Hawaii. Microbiol. Resour. Announc.11, e00556–22 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Parker, C. W. et al. Enhanced terrestrial Fe (II) mobilization identified through a novel mechanism of microbially driven cave formation in Fe (III)-rich rocks. Sci. Rep.12, 17062 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Bontemps, Z. et al. Microbial diversity and secondary metabolism potential in relation to dark alterations in Paleolithic Lascaux Cave. Npj Biofilms Microbiomes10, 121 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Bay, S. K. et al. Microbial aerotrophy enables continuous primary production in diverse cave ecosystems. Preprint at 10.1101/2024.05.30.596735 (2024). [DOI] [PMC free article] [PubMed]
  • 22.Van Spanning, R. J. et al. Methanotrophy by a Mycobacterium species that dominates a cave microbial ecosystem. Nat. Microbiol.7, 2089–2100 (2022). [DOI] [PubMed] [Google Scholar]
  • 23.Ghaly, T. M. et al. Stratified microbial communities in Australia’s only anchialine cave are taxonomically novel and drive chemotrophic energy production via coupled nitrogen-sulphur cycling. Microbiome11, 190 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Han, H., Wang, Z. & Zhu, S. Benchmarking metagenomic binning tools on real datasets across sequencing platforms and binning modes. Nat. Commun.16, 2865 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Li, D., Liu, C.-M., Luo, R., Sadakane, K. & Lam, T.-W. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics31, 1674–1676 (2015). [DOI] [PubMed] [Google Scholar]
  • 26.Kang, D. D. et al. MetaBAT 2: an adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies. PeerJ7, e7359 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Pan, S., Zhao, X.-M. & Coelho, L. P. SemiBin2: self-supervised contrastive learning leads to better MAGs for short-and long-read sequencing. Bioinformatics39, i21–i29 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Liu, C.-C. et al. MetaDecoder: a novel method for clustering metagenomic contigs. Microbiome10, 46 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Rühlemann, M. C., Wacker, E. M., Ellinghaus, D. & Franke, A. MAGScoT: a fast, lightweight and accurate bin-refinement tool. Bioinformatics38, 5430–5433 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Bowers, R. M. et al. Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea. Nat. Biotechnol.35, 725–731 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Chaumeil, P.-A., Mussig, A. J., Hugenholtz, P. & Parks, D. H. GTDB-Tk v2: memory friendly classification with the genome taxonomy database. Bioinformatics38, 5315–5316 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Blin, K. et al. antiSMASH 8.0: extended gene cluster detection capabilities and analyses of chemistry, enzymology, and regulation. Nucleic Acids Res.53, W32–W38 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Rannon, E., Shaashua, S. & Burstein, D. DRAMMA: a multifaceted machine learning approach for novel antimicrobial resistance gene detection in metagenomic data. Microbiome13, 67 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Roldán, C., Murcia-Mascarós, S., López-Montalvo, E., Vilanova, C. & Porcar, M. Proteomic and metagenomic insights into prehistoric Spanish Levantine Rock Art. Sci. Rep.8, 10011 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Sarhan, M. S. et al. Ancient DNA diffuses from human bones to cave stones. Iscience24 (2021). [DOI] [PMC free article] [PubMed]
  • 36.M. P. De Leon, Montecillo, A. D., Pinili, D. S., Siringan, M. A. T. & Park, D.-S. Bacterial diversity of bat guano from Cabalyorisa Cave, Mabini, Pangasinan, Philippines: A first report on the metagenome of Philippine bat guano. PLoS One13, e0200095 (2018). [DOI] [PMC free article] [PubMed]
  • 37.Chen, S., Zhou, Y., Chen, Y. & Gu, J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics34, i884–i890 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Li, H. Reconstruction of 1,979 prokaryotic metagenome-assembled genomes from 37 global cave environments. figshare. Dataset. 10.6084/m9.figshare.29554673.v4 [DOI] [PMC free article] [PubMed]
  • 39.Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods9, 357–359 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Danecek, P. et al. Twelve years of SAMtools and BCFtools. Gigascience10, giab008 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Chklovski, A., Parks, D. H., Woodcroft, B. J. & Tyson, G. W. CheckM2: a rapid, scalable and accurate tool for assessing microbial genome quality using machine learning. Nat. Methods20, 1203–1212 (2023). [DOI] [PubMed] [Google Scholar]
  • 42.Olm, M. R., Brown, C. T., Brooks, B. & Banfield, J. F. dRep: a tool for fast and accurate genomic comparisons that enables improved genome recovery from metagenomes through de-replication. ISME J.11, 2864–2868 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Minh, B. Q. et al. IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era. Mol. Biol. Evol.37, 1530–1534 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Letunic, I. & Bork, P. Interactive Tree of Life (iTOL) v6: recent updates to the phylogenetic tree display and annotation tool. Nucleic Acids Res.52, W78–W82 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Hyatt, D. et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics11, 119 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Ewels, P., Magnusson, M., Lundin, S. & Käller, M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics32, 3047–3048 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Aroney, S. T. et al. CoverM: read alignment statistics for metagenomics. Bioinformatics41, btaf147 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Shaffer, M. et al. DRAM for distilling microbial metabolism to automate the curation of microbiome function. Nucleic Acids Res.48, 8883–8900 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Hamilton, T. L., Jones, D. S., Schaperdoth, I. & Macalady, J. L. Metagenomic insights into S(0) precipitation in a terrestrial subsurface lithoautotrophic ecosystem. Front. Microbiol.5, 756 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Kajan, K., Kirkegaard, R., Pjevac, P., Orlić, S. & Mehrshad, M. Niche and spatial partitioning restrain ecological equivalence among microbes along aquatic redox gradient. Preprint at 10.1101/2024.08.09.607300 (2024).
  • 51.Vanghi, V., Timmermann, A., Jo, K. & Kwon, T. Exploring microbial diversity in South Korean caves through shotgun sequencing: contrasting dry and wet environments, swabbing versus sediment sampling. J. Geol. Soc. Korea60, 275–294 (2024). [Google Scholar]
  • 52.Babalola, O. O., Adedayo, A. A. & Akinola, S. A. High-throughput metagenomic assessment of Cango Cave microbiome-A South African limestone cave. Data Brief54, 110381 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Ulbrich, J., Jobe, N. E., Jones, D. S. & Kieft, T. L. Cave pools in Carlsbad Caverns National Park contain diverse bacteriophage communities and novel viral sequences. Microb. Ecol.87, 1–18 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Rodriguez-Ramos, L. E. & Rios-Velazquez, C. Microbiome dataset from Clara Cave and Empalme Sinkhole waters in Puerto Rico. Data Brief21, 1674–1677 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Klapper, M. et al. Natural products from reconstructed bacterial genomes of the Middle and Upper Paleolithic. Science380, 619–624 (2023). [DOI] [PubMed] [Google Scholar]
  • 56.Maggiori, C. et al. Draft genome sequence of a member of a putatively novel Rubrobacteraceae genus from lava tubes in Lava Beds National Monument. Microbiol. Resour. Announc.14, e01335–24 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Zada, S. et al. Composition and functional profiles of microbial communities in two geochemically and mineralogically different caves. Appl. Microbiol. Biotechnol.105, 8921–8936 (2021). [DOI] [PubMed] [Google Scholar]
  • 58.Karwautz, C., Kus, G., Stöckl, M., Neu, T. R. & Lueders, T. Microbial megacities fueled by methane oxidation in a mineral spring cave. ISME J.12, 87–100 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Bizic, M. et al. Cave Thiovulum (Candidatus Thiovulum stygium) differs metabolically and genomically from marine species. ISME J.17, 340–353 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Saati-Santamaría, Z. et al. Long-term evolution of prokaryotic genomes in a chemolithotrophic cave over 5.5 million years of isolation. Preprint at 10.1101/2025.04.10.648229 (2025).
  • 61.Babalola, O. O., Adedayo, A. A. & Akinola, S. A. Microbiome insights from a South African cultural and natural landmark cave using metagenomics next-generation sequencing. Microbiol. Resour. Announc.14, e01183–24 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Meka, A. F., Bekele, G. K., Abas, M. K. & Gemeda, M. T. Exploring microbial diversity and functional gene dynamics associated with the microbiome of Sof Umer cave, Ethiopia. Discov. Appl. Sci.6, 400 (2024). [Google Scholar]
  • 63.Zhu, B. et al. A novel Methylomirabilota methanotroph potentially couples methane oxidation to iodate reduction. Mlife1, 323–328 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64.Moncadas, L. S. et al. Rickettsiales’ deep evolutionary history sheds light on the emergence of intracellular lifestyles. Preprint at 10.1101/2023.01.31.526412 (2023).
  • 65.Back, J. Geochemical investigation of the madison aquifer, wind cave national park, south dakota. Natl. Park Serv. Nat. Resour. Tech. Rep. NPSNRPCWRDNRTR-2011416 50 (2011).
  • 66.Turrini, P. et al. The microbial community of a biofilm lining the wall of a pristine cave in Western New Guinea. Microbiol. Res.241, 126584 (2020). [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

All refined genomes and companion files were deposited in figshare38 and the 1979 representative MAGs were submitted to DDBJ/ENA/GenBank under BioProject accession no. PRJEB95832. The raw metagenomic reads used for assembly are publicly available at NCBI SRA/ENA under their original studies; per-sample accessions and context listed in Table 1 and in the repository file metadata.xlsx.

The options and parameters of all tools as well as software versions used for the analysis are described in the methods. Custom-designed scripts were not used to generate or process this dataset. An operation script is available at https://github.com/HuihongLi/CaveMAGs.


Articles from Scientific Data are provided here courtesy of Nature Publishing Group

RESOURCES