Abstract
Huge phages are widespread in the biosphere, yet their prevalence and ecology in the human gut remain poorly characterized. Here, we report Jug (jumbo gut) phages with genomes of 360 to 402 kilobase pairs that comprise ~1.1% of the reads in human gut metagenomes, and are predicted to infect Bacteroides and/or Phocaeicola. Although three of the four major groups of Jug phages shared >90% genome-wide sequence identity, their large terminase subunits exhibited only 38 to 57% identity, suggesting horizontal acquisition from other phages. Over 1500 genomes of Jug phages were recovered from human and animal gut metagenomes, revealing their broad distribution, with largely shared gene content suggestive of frequent cross-animal-host transmission. Jug phages displayed high gene transcription activities, including the gene for a calcium-translocating P-type ATPase not detected previously in phages. These findings broaden our understanding of huge phages and highlight Jug phages as potential major players in gut microbiome ecology.
A widely distributed but overlooked large-sized phage clade is active in human gut and thus could potentially affect human health.
INTRODUCTION
Viruses, in particular, those infecting bacteria and known as bacteriophages (phages), are the most abundant biological entities on Earth (1). The human gut contains a large diversity of phages (2, 3), some of which have been reported to be widespread in the human gut (4, 5), for example, crAssphage (6, 7). Phages of the Crassvirales order (including crAssphage) that infect Bacteroides species (8) appear to be the most prevalent in the human gut, while also present in environmental viromes as well (9). These phages exhibit high phylogenetic diversity (10, 11), and many members of the order possess unusual features such as interruption of genes by multiple self-splicing introns and inteins, and alternative genetic codes (12). The composition of the human gut phageome has been shown to be unique and stable over the lifetime of individuals (13) but may shift during the early stage of life, although some phages can persist for a long time since the early-life periods (14–16). Composition shifts in the human gut phageome have been reported to be associated with chronic diseases (17), such as inflammatory bowel disease (18), malnutrition (19), and metabolic syndrome (20), suggesting that gut phages are relevant for human health.
The double-strand (ds) DNA phages of the Caudovirales class show a broad distribution of genome size, with the peak at about 50 kilobase pairs (kbp) (21), whereas some, known as jumbo or huge phages, have genome sizes ≥200 kbp (21, 22) and up to 841 kbp (23). Huge phages are diverse and widespread (21), and their large genomes with numerous genes encoding functionally diverse proteins provide a unique opportunity to study complex biological mechanisms in viruses. Examples of such mechanisms include genes encoding multiple components of the transcription and translation machineries (21), enzymes augmenting various metabolic pathways of host cells, CRISPR-Cas systems along with other defense and counterdefense systems and more (24, 25). Notably, some huge phages replicate their genomes within a nucleus-like structures that protect phage DNA against host defenses (26, 27). The Lak phages, which infect Prevotella species, are among the very few huge phage clades that have been investigated in the human gut (28, 29). They have a genome size of 540 to 660 kbp, which is comparable to that of some prokaryotes, and were detected mostly in gut metagenomes of humans on non-Western diets as well as various animals (28, 29). In general, however, the prevalence, diversity, and ecological roles of huge phages in the human gut remain largely unknown.
In this work, we constructed the Huge Phage Genome Collection (HPGC) within which we identified a dominant huge phage clade, which we denoted Jug (after jumbo gut) phages. The Jug phages were classified into four groups, and comparative genome analysis suggested distinct evolutionary scenarios. More than 1500 Jug phage genomes were recovered from gut metagenomes of humans and various animals, demonstrating their broad diversity and distribution. Metatranscriptomics revealed the high activity of Jug phages in the human gut and provided information on cotranscription and RNA splicing of fragmented genes. We identified phage-encoded calcium–translocating P-type adenosine triphosphatases (ATPases) in Jug phages and other huge phages, an enzyme that has not been previously reported in phages, and showed that this gene was actively transcribed. The identification of Jug phages highlighted the capacity of HPGC for the discovery of ecologically important huge phage clades. Our analyses of the gene repertoires, distribution, transmission, and in situ activities of Jug phages document their potential importance as a major component of the human phageome and also expand our understanding of the ecology of huge phages in the gut microbiomes.
RESULTS
The huge phage genome collection
We collected 10,690 viral genomes with a length of ≥200 kbp from a number of publications reporting viral genomes from various habitats such as human guts, ocean, soil, and freshwater, as well as those deposited at National Center for Biotechnology Information (NCBI; Fig. 1A and table S1). Within this genome set, 9372 were identified as huge phage genomes using a combination of several viral identification tools (see Materials and Methods), ultimately establishing a nonredundant dataset of 7230 genomes (table S2). To our knowledge, this curated HPGC is the most comprehensive repository of huge phage genomes to date. The HPGC was dominated by genomes from the animal digestive system (2250 genomes) and freshwater ecosystems (1882 genomes). Additional sources included soil (930 genomes), marine systems (708 genomes), deep subsurface environments (343 genomes), and others (Fig. 1B). About 70% of the genomes were complete or of high quality, with a strong size bias toward 200 to 400 kbp (Fig. 1C).
Fig. 1. The construction of the HPGC and the identification of Jug phages.
(A) The HPGC construction pipeline. (B) The habitat distribution. Only those habitats that contain 1% or more of the nonredundant genomes are shown. Otherwise, they are assigned to “Others.” Those without habitat information available were assigned to “Unknown.” The habitat information of IMG/VR genomes was downloaded from the IMG website. For the genomes from NCBI and other publications, the habitat information was manually retrieved from the sample collection descriptions and followed the IMG/VR assignment categories (table S2). (C) The length distribution and CheckV-based genome quality. (D) The large species-level clusters with >10 genomes. The bacterial hosts and the phage clade of the clusters are shown when available. The red arrows indicate the two species-level clusters of Jug phages. (E) The length of the Jug phage genomes. The original and curated lengths of the 105 HPGC genomes and the curated lengths of 34 newly assembled genomes are shown. (F to H) Comparison of the genomes from group Beta and the other three groups of Jug phages with respect to (F) genome length, (G) GC contents, and (H) coding density. The average values between group Beta and the others were compared using an unpaired t test (***P < 0.001).
Genome clustering at ≥95% nucleotide similarity identified 4565 species-level clusters, with most clusters being singletons (3513; table S2). Only ~0.3% of the species-level clusters could be assigned to the taxonomic level of order based on geNomad analyses. Of the 42 large species-level clusters (>10 genomes each), 38 were detected in only one habitat each (Fig. 1D), which is true before genome dereplication as well. These large phage clusters included the phiKZ-like phages infecting Pseudomonas, the uncultured Lak phages infecting Prevotella, and phages infecting Synechococcus, Salmonella, Escherichia, and Aeromonas. Notably, 23 of the 42 large clusters included members from the animal digestive system, in particular, the human gut (Fig. 1D). However, with the exception of the Lak phages, the diversity and potential ecological importance of these huge phages remained largely uncharacterized.
A prevalent clade of huge phages from the human gut identified within HPGC
We focused on the human gut genomes within HPGC and clustered them at ≥90% nucleotide sequence identity (≥50% of genome length). The largest group contained 104 genomes (table S3), including all the genomes from the species-level clusters 08 and 31 (indicated by red arrows in Fig. 1D). We compared the amino acid sequences of the large terminase subunits (TerL) of these phages and divided them into two major groups, which we denoted Alpha and Delta. The TerL proteins from the two groups shared only ~38% amino acid sequence identity. This observation motivated us to investigate the TerL identity gap. Searching the rest of HPGC identified one genome (huge_phage_7569), which shared ~82% TerL identity to group Alpha, which we assigned to group Beta (fig. S1). Subsequently, we performed a Pebblescout search (30) using the genomes of groups Alpha, Beta, and Delta, and identified a set of genomes that shared only ~57% TerL identity but >90% nucleotide sequence identity with group Alpha (fig. S2), which we named group Gamma. We referred to them as Jug phages after jumbo gut phages.
We manually curated 139 Jug phage genomes (table S3), including 96, 27, 4, and 12 genomes in groups Alpha, Beta, Gamma, and Delta, respectively. Manual curation improved the quality of the 105 genomes within HPGC by extending their length from 307 to 375 kbp on average, with all 34 newly assembled genomes being >360 kbp (Fig. 1E). Compared with the other three groups, group Beta Jug phages had smaller genome sizes (Fig. 1F) and lower GC contents (27% versus 30% on average; Fig. 1G), but higher coding density (90.7% versus 88.6% on average using code 11; Fig. 1H). The Jug phage genomes encompassed from 320 to 555 (mean 501) predicted protein-coding genes. We annotated the genes using both Pfam domain search and protein structure prediction and comparison, and identified many of the viral marker genes, which indicated Jug phages are T4-like phages (table S4).
All the curated genomes of groups Alpha, Gamma, and Delta were from the human gut. In contrast, the Beta group had a broader animal-host range, with members with curated genomes identified in the guts of humans, dogs, cats, chickens, swans, and quails (table S3), and also in ducks with fragmented genomes (fig. S3).
Phylogeny, genome comparison, and the core gene set of Jug phages
Groups Alpha and Beta shared the highest TerL identity, whereas the comparison of genome nucleotide sequences, as well as amino acid sequences of DNA-directed RNA polymerase subunit beta’ (RpoC), major capsid protein (MCP), and DNA polymerase showed higher similarity between groups Alpha, Gamma, and Delta (Fig. 2A). These observations were supported by genome clustering (fig. S4) and the detection of a large number of single-copy genes shared by the three groups (406 in total, of which 371 had average pairwise identity >90%; table S5).
Fig. 2. The phylogeny and genomic features of the Jug phages.
(A) Identity of TerL, MCP, genome, RpoC, and DNA polymerase between groups. The outliers are indicated by gray circles. (B) TerL identity within other HPGC clusters (≥90% nucleotide similarity) with at least five members. The cluster with the lowest TerL identity is indicated with a red circle. (C) The TerL-based phylogeny. The TerL fragments of each curated Jug phage genome were concatenated individually and clustered at ≥99% identity. The TerL sequences of relatives were obtained from HPGC and IMG/VR via BLASTp and clustered at ≥99% identity as well. Only the cluster representatives were included, and the count of proteins in each cluster is indicated in the brackets. The total length of the TerL fragments is shown. All the corresponding genomes were from the gut microbiomes except one from a wastewater sample (indicated by *). (D) The RNA reads mapping evidence of cotranscription and RNA splicing of split genes. Examples of terL and nrdD genes are shown. Note that the nrdD gene was split into four fragments, and no mRNA splicing evidence was observed for the region between the 230– and 91–amino acid (AA) fragments.
Together, we detected 39 single-copy genes with ≥70% protein identity that are conserved in all four Jug phage groups (fig. S5 and table S6) and formed several conserved gene arrays (fig. S6). These conserved genes included those for MCP, anaerobic ribonucleoside-triphosphate reductase activating enzyme (NrdG), DNA topoisomerase IV, translation initiation factor 1A, and Ribonuclease H, with most of the others without functional annotation. Notably, other viral marker genes, such as phage portal protein and prohead protease (table S4), were fragmented (see below) and/or highly divergent and thus not counted as single-copy core genes.
Given the low TerL identity contrasting the high (>90%) genome sequence identity among groups Alpha, Gamma, and Delta (fig. S4), we searched the entire HPGC for other clusters of genomes with a similar pattern of sequence conservation. However, no other genome clusters in the HPGC dataset exhibited TerL identity <85% (Fig. 2B and fig. S7), suggesting that, at least within HPGC, the “high genome but low TerL identity” pattern is unique to Jug phages. Phylogenetic analyses of TerL proteins showed that group Gamma was most closely related to a huge phage from the water deer gut, whereas group Delta clustered with huge phages from the guts of ruminants such as sheep and cows (Fig. 2C and figs. S8 to S10). This pattern suggests independent acquisition of terL genes from gut-resident huge phages outside the Jug group, which was supported by the GC content profiles of the terL-encoding regions (fig. S11). Moreover, the genes immediately upstream of terL across the Alpha, Gamma, and Delta groups that typically encode predicted endonucleases were highly divergent, suggesting that these genes could have been transferred together with terL (fig. S12).
The relatively lower genome similarity between groups Alpha and Beta challenged their phylogenetic relatedness despite the high identity of their TerL sequences, which could be due to HGT as well. We explored the rest of HPGC and did not detect any other genomes with >50% genome identity to group Beta. On the other hand, groups Alpha and Beta shared ~95% MCP identity (table S7). The MCP phylogeny supported the close relatedness of groups Alpha and Beta (fig. S13).
Jug phage genes are frequently interrupted by group I introns and putative inteins
During the curation of the Jug phage genomes, we noticed that many genes appeared to be fragmented. A search for self-splicing introns (see Materials and Methods) led to the identification of 2513 group I introns in 136 of the 139 curated Jug phage genomes (18.5 introns, on average, in each genome; table S8), with fewer introns in group Beta compared to the other groups (table S9). These introns were on average ~206 nucleotides (nt) in length, suggesting that they were the small internal group I–like ribozymes (31). In addition, we identified numerous homing endonucleases of different families, in particular, 505 HNH, 80 PDDEXK_5, 55 GIY-YIG, and 17 LAGLIDADG endonucleases, as well as 26 Intein_splicing domain–containing proteins (table S4). Among them, except for 88 HNH and 9 LAGLIDADG domains that were inside other genes, all the remaining were stand-alone genes.
The group I introns and genes encoding various homing endonucleases interrupt many essential genes of the Jug phages. For example, the terL genes in all Alpha and most Delta genomes were split into two to three fragments due to group I intron insertion (fig. S14). The nrdD genes encoding anaerobic ribonucleotide reductase were fragmented in 95 of 104 genomes; all the intact nrdD genes were exclusively from group Beta (table S10). Insertions between the fragments commonly encoded endonucleases or proteins with “Intein_splicing” domains. In addition, DNA polymerase genes were also frequently split by intervening endonuclease sequences (table S11).
There were no paired metatranscriptomes for the metagenomic samples from which we reconstructed the 139 Jug phage genomes (see Materials and Methods). Therefore, to validate the predicted group I introns in Jug phages, we reanalyzed three gut samples from three male adults containing Jug phages for which metatranscriptomes were available (32). Analysis of these three Jug phage genomes and the corresponding metatranscriptomes showed that for 37 of the 76 predicted group I introns, splicing was supported by metatranscriptome reads mapping (table S12). For example, the split terL and nrdD genes were cotranscribed as an operon (fig. S15). Some of the RNA reads were precisely mapped to the flanking regions of the introns (Fig. 2D), directly confirming the restoration of the respective coding regions by splicing (fig. S16).
Jug phages infect Bacteroides and/or Phocaeicola species
The iPhoP analysis indicated that Jug phages infect Bacteroides and/or Phocaeicola (table S13), which was supported by local CRISPR-Cas spacer analysis, with a total of 34 unique targeting spacers (fig. S17 and table S14). Bacteroides and Phocaeicola, along with Prevotella, belong to the Bacteroidaceae family (GTDB taxonomy) and are typically abundant in the human gut. Previously, the Bacteroides-infecting crAss-like phages (89 to 192 kbp in length) (6, 11), and the Prevotella-infecting Lak phages (>540 kbp in length) (28, 29) have been reported to be widespread in the human gut. However, Jug phages showed no significant similarity to any of these phages apart from sharing several hallmark genes (see Materials and Methods).
To further validate the predicted host-virus relationship, we performed co-occurrence analyses for the 125 metagenomic assemblies that were used for the manual curation of Jug phage genomes. In these metagenomes, we identified 28 Bacteroides and 16 Phocaeicola species, with at least one of these represented in each of the 125 samples. However, only Phocaeicola vulgatus and Bacteroides clarus were detected in more than half of the samples (Fig. 3A). In one sample (SRR7403886), we detected one Bacteroides species without Phocaeicola, suggesting its potential host-phage relationship with the co-occurring group Gamma Jug phage. We attempted to confirm specific host-virus relationships by investigating the samples detected with only two or three Bacteroides/Phocaeicola species, but failed to construct any for them (fig. S18).
Fig. 3. The predicted bacterial hosts of Jug phages and host-phage co-occurrence.
(A) The relative abundance of predicted hosts in the samples. The numbers in the brackets following the species names indicate the count of samples with detection. Only those species detected in ≥20 samples are listed, and the rest are assigned to “other species.” (B) Percentage of mapped metagenomic reads. pBI143 is a widespread plasmid in the human gut (33). The samples with two or three predicted host species detected are highlighted with black circles, with further analyses shown in fig. S18. (C) The three samples with >8% of metagenomic reads mapped to the Jug phage genomes. The data for the sample with the highest fraction of mapped reads and the sample with only one predicted bacterial host species are shown. One of the three samples contained only one Bacteroides species without Phocaeicola; thus, the specific host-virus relationship was tentatively inferred.
On average, Jug phages accounted for 1.1% of the metagenomic reads, with >8% in three bulk metagenomic samples (Fig. 3, B and C). One sample, from an infant in the fourth month of life, contained a group Alpha Jug phage, which accounted for ~35.8% of the mapped reads. This particular gut microbiome contained 14 bacterial species, including B. clarus, Bacteroides sp902362375, and P. vulgatus, with a cumulative relative abundance of 29.7% for these three species (Fig. 3C and table S15). We identified the same Jug phage and 7 of the 14 bacterial species (including B. clarus) in the gut microbiomes of the mother at delivery, and detected the Jug phage in the infant at birth and in the 12th month as well (fig. S19), suggesting transmission of the Jug phage from the mother to the infant.
Jug phages are widespread in the guts of humans and animals
To assess the diversity and distribution of Jug phages, we searched the Logan assembly dataset (33) using terL as a marker and compared the retrieved contigs against the curated Jug phage genomes (Fig. 4A). We identified 2733 samples containing Jug phage-related contigs ≥1 kbp and a cumulative phage genome length ≥200 kbp (see Materials and Methods), with nearly all derived from gut microbiomes except for 12 from wastewater (table S16). The wastewater-derived genomes were highly fragmented (fig. S20).
Fig. 4. Cluster analysis of Jug phage proteins.
(A) The Logan-based analysis pipeline to obtain Jug phage genomes. Each of the BLASTn hits was filtered to allow ≥90% nucleotide identity across ≥90% of the contig length when matched to a curated Jug phage genome. (B) The global distribution of Logan-retrieved Jug phage genomes. A total of 1422 of the 1523 genomes (table S17) have location information and are shown on the map. All the genomes have a minimum length of 200 kbp, with all contigs in the genomes ≥5 kbp in length. The locations with more than 30 genomes retrieved were highlighted with the corresponding NCBI project IDs shown (all projects but one were from the human gut). The insertion in the lower-left corner shows the total length and contig count in each retrieved genome. Some regions of the world map are not shown as there was no genome identified or due to low map resolution. The distribution of protein clusters within (C) the four Jug phage groups; (D) human, mouse, and rat of group Alpha; and (E) human and dog of group Beta. In (C), (D), and (E), the shared or specific protein clusters are indicated by bars, with the count of the protein clusters shown. The numbers of Jug phage genomes in each group or animal host are shown in the brackets. The protein clusters are defined using CD-HIT at ≥70% protein identity (see Materials and Methods).
Predicted proteins from the Logan contigs were compared to the proteins of curated Jug phage genomes (TerL, MCP, RpoC, and DNA polymerase), and a single copy of each gene was detected in nearly all samples except one (NCBI SRA = SRR10680429), which contained one Alpha and one Beta member. These findings indicated that gut microbiomes typically harbor only one dominant Jug phage genotype. We thus reconstructed draft genomes from the individual samples by binning the matched contigs ≥5 kbp, yielding 1526 genomes ≥200 kbp in length (Fig. 4A), the majority (1450) belonging to group Alpha. These draft genomes comprised 9 to 29 contigs (Fig. 4B and table S17) and showed 99.5% average nucleotide identity to the curated Jug phage genomes (table S18). Notably, group Alpha Jug phages were also found in nonhuman animal hosts, including mice (84 genomes), rats (51 genomes), and hamsters (3 genomes).
We identified 2273 protein clusters in the curated and Logan-retrieved Jug phage genomes. Only ~10.9% of the cluster representatives were assigned known functions, with Pfam domains such as NRDD, AAA, Phage_lysozyme2, Amidase_3, CLP_protease, DNA_topoisoIV, eIF-1a, and HNH_3 among the most frequently detected (table S19). Group Alpha shared numerous protein clusters with groups Gamma and Delta but few with group Beta (Fig. 4C). Over 35% of the protein clusters were unique to group Alpha. Within this group, human-, mouse-, and rat-derived Jug phages shared 433 protein clusters, with only five and four clusters unique to mouse and rat, respectively (Fig. 4D). Similarly, within group Beta, Jug phages from the dog gut metagenomes shared >95% of proteins with those from humans (Fig. 4E). These shared clusters generally exhibited high sequence identity (fig. S21), and phylogenetic analyses showed no animal-host-specific clades (fig. S22). These findings suggest frequent transmission of Jug phages between animal hosts.
Jug phages are transmissible among humans and sensitive to dietary intervention
To explore the transmission of Jug phages, we reanalyzed the human gut samples of donors and recipients of a study of fecal microbiota transplantation (FMT) for the cure of ulcerative colitis (34). We confirmed that the recipients harbored no Jug phages before FMT (fig. S23 and table S20). Following FMT, Jug phages identical to those in the donors were identified in the guts of seven recipients during the treatment and were retained for up to 14 to 25 weeks. These results suggested that the Jug phage could be transmitted among humans.
To investigate the potential effects of diet on Jug phages, we reanalyzed the human gut samples of overweight or obese participants in two dietary intervention studies (study 1 and study 2) (35). The participants were given fiber-containing snacks while consuming meals that are high in saturated fats and low in fruits and vegetables (HiSF-LoFV) (fig. S24 and table S21). Group Alpha Jug phages were identified in participant S11 samples of both studies 1 and 2 and in participants S13 and S15 samples of study 1. The Jug phages were present in all the 58 analyzed samples, albeit with very low coverage in eight samples (fig. S25). P. vulgatus and Bacteroides sp902362375 were the only two shared potential host species identified in all three participants. We did not observe any significant correlation between the abundances of these bacteria and the Jug phages. However, the abundance of the Jug phages increased toward the end of the continuous consumption of fiber snacks as supplements in all three participants, suggesting that fiber-enriched diets promoted the reproduction of the Jug phages, conceivably, by enhancing the growth of the host bacteria, Bacteroides and/or Phocaeicola.
Jug phages are transcriptionally active in the human gut
To investigate the in situ activities of Jug phages, we reanalyzed metagenomic and metatranscriptomic datasets from three adult male gut samples that harbored Jug phages (see above) (32). All three Jug phages (one per individual) were from group Alpha (table S22) and accounted for 3 to 59% (26% on average) of all the virus-retrieved DNA reads (Fig. 5A). Notably, Jug phages were highly active in 7 of the 12 analyzed samples, with their transcripts accounting for 56 to 96% (79% on average) of all virus-retrieved RNA reads, which was by far the highest among all viruses co-occurring in these samples. In these seven samples, 87 to 99% of the Jug phage genes were expressed (Fig. 5B). Genes for proteins involved in lytic infection, in particular, virion formation, such as TerL, MCP, portal protein, prohead protease, baseplate proteins, tail components, and tape-measure proteins, ranked relatively low in transcriptional activity (table S23), suggesting that the majority of the Jug phages were not involved in active lysis of the host cells at the time of sampling.
Fig. 5. Transcription activity of Jug phages in the guts of three adult men.
(A) The relative abundance of DNA and RNA reads mapped to the Jug phages and all other viruses in the corresponding samples. For the seven samples with transcriptionally active Jug phages, the percentages of RNA reads mapped to the Jug phages against the total RNA reads mapped to all viruses are shown above the bars. The sampling procedures are shown to the right in brief. (B) The scatter plots compare the transcriptional levels of protein-coding genes in the Jug phages of the three adults. The number of transcribed genes in each time point sample is indicated in the bracket (see table S23 for details). Each blue circle represents a gene; the closer the circle to the red dashed line, the more similar the transcriptional activity is in the two samples. (C) The transcriptional profiles of the genes for TerD domain–containing proteins and others. The Jug phage genome from adult 1 is shown as an example. The RNA reads were mapped to the genome. The operon-encoding regions (suggested by the RNA reads mapping profiles) are indicated by the red lines. Hyp, hypothetical protein.
Each of the three Jug phage genomes contained three terD genes: two flanking a hypothetical gene without predicted function and a third adjacent to a pspA-like gene (Fig. 5C). Transcriptional profiling showed that each region was independently transcribed as an operon. Given that TerD-related proteins typically participate in stress response (36), the high RNA read mapping depth of the TerD-encoding regions suggests that Jug phages might experience strong environmental stress at the time of sampling. Among the Jug phage protein clusters, nine consisted of TerD-related proteins (>2200 proteins, ~0.45% of all predicted proteins; Fig. 4C). We found that TerD-related proteins are widely encoded by gut bacteria, and those from the same genus generally clustered together in the phylogenetic tree (fig. S26). Furthermore, TerD-like proteins encoded by Jug phages grouped closely with those from Bacteroides, Phocaeicola, and Prevotella, suggesting the possibility of HGT of these stress-related genes between Jug phages and their hosts.
A calcium-translocating P-type ATPase identified in Jug and other phages
Notably, among the other highly transcribed genes was a gene encoding a calcium-translocating P-type ATPase (table S23). This gene was identified in the majority of the curated (60.4%; all four groups) and the Logan-retrieved (74.1%) Jug phage genomes. Protein structure prediction and comparison confirmed the annotation (Fig. 6, A to C, and fig. S27), and sequence alignment showed the 10 transmembrane helices typical of P-type ATPases and the key functional residues (fig. S28). Many of the gut microbes encode this ATPase, which is an integral membrane protein that couples ATP hydrolysis to the export of Ca2+ ions, thereby maintaining low cytoplasmic Ca2+ levels essential for calcium homeostasis (37).
Fig. 6. A calcium-translocating P-type ATPase identified in phages.
(A) The protein structure of the human calcium-translocating P-type ATPase (PDB: 7e7s), and the predicted models of the ATPases from Listeria monocytogenes (LMCA1), Bacteroides and Phocaeicola species, and Jug phages. In the structure representations, “A” stands for the actuator domain, “P” stands for the phosphorylation domain, and “N” denotes the nucleotide-binding domain. The dashed lines indicate the membrane. The transmembrane helices are highlighted with different colors. The conserved phosphorylation motif DKTGT is highlighted in cyan and displayed as sticks. (B) Superposition of the calcium coordination sites. The sum distance between the Ca2+ (gray spheres)–binding residues between the ATPases from Jug phage and human (or bacteria) (shown in black), the sum distance of the Ca2+-binding residues to the Ca2+ ion in the ATPases from human (or bacteria) (shown in gray), and the sum distance from Ca2+-binding residues to the Ca2+ ion in the ATPase of Jug phage (shown in pink) are shown at the bottom. Note that the human ATPase can coordinate two Ca2+ ions. (C) Superposition of the DKTGT motifs from the Jug phage and human and bacterial P-type ATPases. The location of the motif in the protein is indicated by red boxes in (A). The distance (Å) between the aspartate (D) residue from the Jug phage and the reference ATPase is shown in red. (D) Superposition of the DKTGT motif between the ATPases of the Jug phages and predicted bacterial hosts. The distance between the D residues is shown in red. (E) The normalized transcriptional activities (RPKM) of the calcium-translocating P-type ATPase genes in Jug phages and the co-occurring bacteria in the gut microbiomes. The RPKM values of all bacteria were aggregated. (F) Cotranscription of the calcium-translocating P-type ATPase gene and the adjacent upstream gene.
Phylogenetic analysis showed that the P-type ATPases of Jug phages were most closely related to those from Bacteroides and Phocaeicola species (fig. S29), suggestive of HGT and supporting the bacterial host prediction for Jug phages. The arrangement of the aspartic acid residues (“DKTGT”), which are phosphorylated during the catalytic cycle, was highly consistent between the P-type ATPases of Jug phages and their predicted hosts as well (Fig. 6D). Unexpectedly, in all but one of the seven samples, the transcriptional activity of the Jug phage-encoded ATPase gene was higher than the cumulative activities of those encoded by all the coexisted bacteria in the gut microbiomes (Fig. 6E).
To date, the only viral calcium-translocating P-type ATPase was documented in Nucleocytoviricota viruses infecting Chlorella, where this gene has been shown to be actively transcribed during infection (38). Our extended analysis identified this ATPase in 81 other HPGC huge phages (200 to 447 kbp in length) and four smaller-sized phages (159 to 197 kbp in length) (table S24). Notably, all these phage genomes were reconstructed from gut microbiomes, and the phage-encoded ATPases were most similar to those from Bacteroidaceae (including Bacteroides and Phocaeicola) or Bacillota (fig. S30).
We found that the gene immediately upstream of the calcium-translocating P-type ATPase was cotranscribed with the ATPase gene (Fig. 6F). This adjacent gene is present in all Jug phages encoding the ATPase, and encodes a small protein [77 to 91 amino acids (aa)] without detectable homologs (fig. S31). The consistent presence of the two genes as an operon in the Jug phages implies that the small protein is functionally linked to the ATPase, perhaps regulating its activity.
DISCUSSION
Given their pivotal roles in shaping microbial community composition, in particular, human-associated microbiomes, element cycling, and the evolution of their hosts, phages have lately attracted renewed, strong interest in microbiology research (39). Huge phages are particularly notable because of their large genomes that encompass genes for complex biological mechanisms that are usually absent in smaller-sized phages (21). Our understanding of huge phages has markedly extended in the past several years, especially with the development of genome-resolved metagenomics and related analysis tools (40–45). However, the absence of a curated reference dataset of huge phages impedes deeper ecological insights. To fill this gap, we constructed HPGC. While conventional phage studies routinely recover tens of thousands of smaller-sized phage genomes in a single study (3, 46), huge phages remain underrepresented. We expect HPGC to become a robust resource for identifying huge phages in metagenomes, especially when combined with manual genome curation and/or single-molecule sequencing, which could partly resolve the fragmentation issues that emerge during short-read–based assembly and hamper the identification and annotation of huge phage genomes.
With the fast growth of sequencing data in public repositories, tools that can deal with the large datasets accurately and efficiently are of paramount importance. Pebblescout (30), Logan (33), Serratus (47), and BIGSI (48) are among such tools that substantially facilitate access, search, and scalable mining of the massive and largely untapped metagenome sequencing data. In this study, we showed how ecological analyses of a viral clade could be extended by using some of these tools, leading to the discovery of abundant but overlooked viral clades (Figs. 1D and 4).
In the case of Jug phages, we found that some essential viral proteins, such as TerL and MCP, are highly similar among group members (Fig. 2), which enables the reliable identification of Jug phages in metagenomic assemblies. The expanded genomic diversity via Logan-assembly revealed the wide distribution of Jug phages and likely frequent cross-animal-host transmission (Fig. 4). The transmission of the Jug phages between infant and mother, and among healthy individuals and patients via FMT (Fig. 5A) suggests that Jug phages can readily spread in the human population, explaining their wide distribution (Fig. 4). In addition, we identified relatives of Jug phages in the gut of ruminants (Fig. 2A). Notably, in each of the analyzed gut metagenomes, with a single exception, we detected only one Jug phage genome type. It seems likely that Jug phages have superinfection exclusion mechanisms that prevent infection of the host bacteria by other genome types. However, the possibility cannot be ruled out that we failed to retrieve Jug phage genomes from samples containing multiple genome types because the assemblies could be highly fragmented due to the high nucleotide identity between genomes from the same group, or alternatively, all other Jug phages in the same samples could have too low abundance to be detected by metagenomic analysis.
The pronounced discrepancy between the high overall genome similarity and the low TerL identity among the Alpha, Gamma, and Delta groups suggested that Jug phages have undergone targeted replacement of the terL gene via HGT (Fig. 2A and fig. S11). To our knowledge, this represents the first case of terL replacement in huge phages, although similar observations have been reported for crAss-like phages (11) and other groups of phages from the human gut (14). This finding challenges the common practice of using TerL as a universal phylogenetic marker for Caudovirales. The terL replacement might reflect adaptive fine-tuning of the DNA packaging mechanisms in response to ecological pressures or host-specific constraints, or could be a route of escape from host defenses.
In contrast to crAss-like phages (11) and Lak phages (28), Jug phages show no evidence of using alternative genetic codes. However, similar to crAss-like phages (11), gene fragmentation through insertion of group I self-splicing introns and inteins is a common feature of Jug phage genomes (Fig. 2, A and D). Given that many viral genes remain uncharacterized, fragmentation obscures gene prediction and functional inference. Integrating metatranscriptomic sequencing and reads mapping with structure-based protein annotation could help resolve gene boundaries and assess whether fragmented open reading frames are spliced or cotranslated (Fig. 2D).
Jug phages were predicted to infect Bacteroides and/or Phocaeicola, which are dominant in human gut microbiotas and play essential roles in degrading dietary and host-derived glycans, thereby contributing to short-chain fatty acid production and gut health (49). The increased occurrence of Jug phages in individuals on high-fiber diets supports this predicted host-phage association (fig. S24).
Last, we report here the identification of calcium-translocating P-type ATPases in Jug phages and other huge phages, as well as a few phages with smaller genomes infecting Bacteroidaceae and Bacillota (fig. S29), and show that this gene is expressed at a high level in at least some Jug phages (Fig. 6, E and F). These membrane proteins likely manipulate host intracellular Ca2+ levels, influencing host metabolism, stress response, and short-chain fatty acid production. We hypothesize that this phage-encoded calcium export system confers a competitive advantage to the phage by remodeling the host cytoplasmic environment to optimize phage replication and packaging, and/or indirectly reshape gut microbial ecology and host physiology by disrupting calcium-regulated commensal bacterial functions. Such mechanisms reflect a broader strategy in which phages deploy auxiliary metabolic functions to manipulate host cell physiology for their benefit.
Given the broad distribution, high transcriptional activity, and diverse auxiliary functions of Jug phages, we propose that they play an active role in modulating gut microbiome composition and host physiology. Their ecological influence could stem from lysis of bacterial hosts that are major microbiome components, and/or through metabolic modulation via the activity of phage genes such as the calcium-translocating P-type ATPases. There are two major limitations in this study. First, although the Jug phages were detected with high transcriptional activities in three gut samples, the number of samples is limited. Second, the cultivation of the identified Jug phages under laboratory condition has not yet been achieved. Thus, further research is warranted to isolate them, characterize their infection cycles, and experimentally validate their impact on host cells and microbial communities, as suggested by the in-depth bioinformatic analyses presented in this study.
MATERIALS AND METHODS
Viral genome collection
Published or public potential huge phage genomes were collected from NCBI Genbank and publications via the downloading links or accession numbers if provided, as of 30 September 2024.
The NCBI phage genomes with a minimum length of 200 kbp were collected with the keyword “phage,” and the “Sequence length” was set from “200000” to “1000000,” which retrieved a total of 954 genomes (“NCBI_phage_200kbp”). To avoid missing any genomes of huge phages, which are somehow named “virus” in NCBI, we also searched “virus” with the same length range, and retrieved a total of 943 genomes (“NCBI_virus_200kbp”). These two sets of genomes contained a total of 960 unique genomes. Local genome comparison indicated that all the previously reported isolated huge phages (22) were included in the collected NCBI sequences.
The viral datasets published before 31 August 2024, were downloaded according to the descriptions in the corresponding publications (if available). A detailed description of the publications is provided in table S1. The viral genomes were filtered to retain those with a minimum length of 200 kbp. We also retrieved some metagenomic datasets sequenced with third-generation sequencing technology, such as Nanopore. For example, the Nanopore sequences (passed default Guppy quality control) reported recently were provided by the authors (50); those with a minimum length of 200 kbp were screened. The IMG/VR v4 database was filtered to retain the viral genomes (i) with a minimum length of 200 kbp, (ii) not assigned as “Megaviricetes” (via the corresponding IMG/VR mapping file), and (iii) with the corresponding sample is not under the JGI Data Utilization Status of “Restricted,” which obtained a total of 5181 sequences.
Huge phage genome identification
The collected viral sequences were first evaluated by geNomad (parameter: end-to-end) (44). The geNomad results were parsed using the “Conservative standard,” i.e., virus_score ≥0.80, n_hallmarks ≥1, mark_enrichment ≥1.5, and fdr ≤ 0.05. The geNomad hits were then evaluated by VirSorter2 (51) with the following parameters: --keep-original-seq, --include-groups “dsDNAphage,NCLDV,RNA,ssDNA,lavidaviridae.” Only the sequences with a VirSorter2 score of ≥0.5 were retained for further analyses; otherwise, they would be excluded.
Then, CheckV (43) (parameter: end_to_end) was used to analyze the virus-containing sequences to remove the host regions. Subsequently, the output files of “proviruses.fna” and “viruses.fna” from CheckV were combined (termed as “checkv.fa”). The CheckV provirus sequences were re-run in a second CheckV analysis to obtain the relevant information on the virus fragments.
The “checkv.fa” sequences (pure viral fragments) are evaluated by another run of VirSorter2 analyses with the parameters as follows: “--seqname-suffix-off --viral-gene-enrich-off --provirus-off --prep-for-dramv --include-groups dsDNAphage,NCLDV,RNA,ssDNA,lavidaviridae.” Subsequently, the genomes were filtered on the basis of information from the second CheckV and the second VirSorter2 analyses to retain those with (i) viral_gene >0 (i.e., Keep1) or (ii) viral_gene =0 and (host_gene = 0 or score > =0.95 or hallmark >2) (i.e., Keep2), following a widely used online procedure (52).
The retained sequences were subjected to a BLAST search to check for replication issues that may arise in the de novo assembly. If a given sequence’s first half and second half shared at least 50% of their length with ≥99% nucleotide sequence similarity, it was flagged, and a manual check was performed to exclude any assembly issues.
Next, if a given sequence was assigned as “NCLDV” by geNomad while as “Caudoviricetes” by Virsorter2, or vice versa, then a manual check was performed for confirmation based on the taxonomic information (most similar UniProt BLASTp hits) of all the protein-coding genes of the sequence. In the final step, we uploaded all the tentative huge phage genomes to ggKbase (https://ggkbase.berkeley.edu/), and excluded those identified as nonphage genomes based on ggKbase references.
Genome clustering
To remove redundant genome sequences that may be reported from different studies or different assemblies, all the identified huge phage genomes were clustered using the rapid genome clustering approach provided by CheckV (https://bitbucket.org/berkeleylab/checkv/src/master/) (43), with the parameters set as “min_ani = 100, min_tcov = 100,” to obtain the nonredundant genome set. These nonredundant genomes were further clustered into species-level clusters using the same approaches, with the parameters of “min_ani = 95, min_tcov = 85.”
Manual curation of the Jug phage genomes
To curate the Jug phage genomes, the corresponding paired-end read files were downloaded from NCBI and assembled using metaSPAdes version 3.15.2 (53) with the kmer set of “21,33,55,77,99.” The assembled scaffolds were then searched against the HPGC Jug phage genomes using BLASTn for Jug phage–related scaffolds. The purpose of manual genome curation was to exclude any potential assembly errors and to extend the genome of interest by paired-end reads mapping. Thus, relatively high sequencing coverage (or depth), at least 30×, is ideal. We thus selected Jug phage related scaffolds with ≥30× coverage as targets for curation. The target scaffolds were manually curated via read mapping, scaffold extension, assembly error fixation, and reassembly, as previously described (54), with step-by-step procedures available at https://ggkbase-help.berkeley.edu/genome_curation/scaffold-extension-and-gap-closing/. The scaffold extension is heavily based on the unplaced paired-end reads. For example, for a paired-end read_1 and read_2, if read_1 is well-mapped to the reference while read_2 is not, then read_2 is named the unplaced paired-end reads. When performing the reads mapping using Bowtie2 or other alternative tools, we usually only output the mapped reads to the SAM/BAM files to save computational storage space. This could be conducted by adding the parameter of “--no-unal.” However, this will exclude both read_1 and read_2 in the abovementioned example. The shrinksam (available at https://github.com/bcthomas/shrinksam) was designed to output both reads, even if only one is well mapped. Alternatively, an in-house script could be developed to this end (for example, using the pysam tool that is available at https://github.com/pysam-developers/pysam). Please note that manual curation is time-consuming and should be performed only for those that are of high interest.
Protein-coding gene prediction and annotation
The protein-coding genes from each of the 139 curated Jug phages were predicted using Prodigal version 2.6.3 (single mode) with code 11 (55). The functional domains of the Jug phage proteins were predicted using hmmsearch from HMMER version 3.3.2 (56) against the Pfam database (57). The results were parsed using cath-resolve-hits version 0.16.10-0-g99edb28 (58), then the identification of a domain in a protein was filtered by an independent e-value (“indp-evalue”) threshold of 1 × 10−5.
For protein structure prediction, the protein sequences were first clustered using CD-HIT with the parameters of “-c = 1, -aS = 1.” For representatives of each cluster, protein structure was predicted using ColabFold (downloaded on 22 November 2024) (59), with the following parameters: “--num-recycle 3, --use-gpu-relax, --amber, --stop-at-score 70.” For structure annotation, the predicted structures with a pLDDT score of ≥70 were searched against the Protein Data Bank (PDB) (60) and the predicted structures of BFVD (61), using FoldSeek 9.427df8a (62) with the “easy-search” command (-c = 0.5, --cov-mode = 0). Only those hits with a TM score ≥0.5 (“alntmscore”) were retained for further analyses. The structure prediction and annotation of specific proteins, such as large terminase, MCPs, and DNA polymerase, were confirmed via online FoldSeek search (62) at https://search.foldseek.com/search.
Identification of group I introns
To globally identify potential group I introns in the Jug phage genomes, we downloaded the CM format files of all the group I intron subtypes reported recently (63) and searched against the Jug phage genomes using cmscan with the INFERNAL version 1.1.4 (64) with an e-value threshold of 1 × 10−4. If a specific region was predicted with multiple subtype hits, only the one with the lowest e-value was retained and counted.
Protein similarity comparison of different groups of Jug phages
To identify the single-copy protein clusters shared by all four groups, the predicted protein sequences were clustered using CD-HIT version 4.8.1 (65) with ≥70% identity (-c 0.7 -aS 0.9 -G 0), and the single-copy protein clusters were identified accordingly. We screened all the single-copy protein clusters, allowing a given cluster to be present in at least half of the genomes in each group (table S7).
Because the genomes of groups Alpha, Gamma, and Delta shared high genome-wide similarity while having low TerL identity, we investigated the single-copy protein clusters shared by all three groups for other lower similarity ones. The predicted proteins of groups Alpha, Gamma, and Delta were clustered using CD-HIT version 4.8.1 (65) with ≥70% identity (-c 0.7 -aS 0.9 -G 0). For each of the clusters, including members from Alpha, Gamma, and Delta, the average similarity of members against the representative was calculated (table S5).
As the genomes of groups Alpha and Beta shared high TerL identity but low genome-wide similarity, we investigated the single-copy protein clusters shared by both groups for other highly similar ones. The predicted proteins of group Alpha and Beta were clustered using CD-HIT version 4.8.1 (65) with ≥70% identity (-c 0.7 -aS 0.9 -G 0). For each of the clusters with ≥5 members from group Alpha and ≥5 members from group Beta, if the cluster representative was from group Alpha, the average similarity of group Beta members against it was calculated; otherwise, the average similarity of group Alpha members against it was calculated (table S4).
Large terminase identity within other HPGC genome clusters
The other HPGC huge phage genomes (excluding Jug phage ones) are clustered at 90% nucleotide sequence similarity across at least 50% of the genome length. A total of 212 clusters were identified with five or more members. The large terminase proteins were predicted using geNomad version 1.5.2 (44). Among the 79 clusters with at least two identified large terminase proteins, their within-cluster large terminase protein identity was determined using BLASTp.
Phylogenetic analyses of Jug phages based on TerL
The TerL fragments from curated Jug phage genomes were first concatenated into one piece and aligned with the full-length TerLs of Jug phages, the TerLs of Jug phage relatives, and the TerLs from the gut of sheep and water deer. All the HPGC proteins (excluding those from Jug phages) were searched against the Jug phage TerL sequences, and those hits with an e-value of ≤1 × 10−5, ≥30% identity, and a minimum length of 400 aa were determined as Jug phage relatives. For references, we also included NCBI Genbank TerLs with a minimum identity of 20% to those of Jug phages. The alignment of TerL sequences was performed using MUSCLE v3.8.31 (66) and filtered using trimAl version 1.4.rev22 (67) to remove those columns comprising ≥90% gaps. The phylogenetic trees were constructed using IQ-TREE (68) with the parameter “-bb 1000, -m LG+G4.”
Bacterial host prediction
We used both iPhoP version 1.3.3 (69) and local CRISPR-Cas spacer targeting to predict the host of Jug phages. The iPhoP analysis was performed using default parameters. For local CRISPR-Cas spacer targeting analyses, we first predicted the repeat regions from all the metagenomic assembled scaffolds (from the samples we used for manual curation of Jug phage genomes) using PILER-CR version 1.06 (70) with default parameters. The spacers were extracted from the predicted repeat regions with Cas genes presented within 1000 bp. The curated Jug phage genomes were searched against the spacers with an e-value threshold of 1 × 10−4, and only those spacers matched with ≤1 mismatch across the whole spacer length were counted as targeting spacers. The Cas genes were predicted by an HMM search of the predicted protein-coding genes of repeat-containing scaffolds against an in-house Cas reference database that was primarily from the TIGRFAM dataset (71). The taxonomic assignment of the scaffolds with targeting spacers was conducted by comparing their protein-coding genes against the UniprotKB database (72) using MMseqs2 version 15.6f452 (73).
Comparison against the crAss-like phage and Lak phage genomes
To compare the Jug phage genomes against crAssphage (145 to 192 kbp) and Lak phages (408 to 660 kbp), their genomes were downloaded from the corresponding publications (11, 28, 29) and searched using BLASTn with an e-value threshold of 1 × 10−5 and a minimum alignment length of 1000 bp. The protein-coding genes of crAss-like phages and Lak phages were downloaded from the corresponding publications and compared against the large terminase and MCP sequences of Jug phages using BLASTp with an e-value threshold of 1 × 10−5.
The co-occurrence analyses of Jug phages and predicted bacterial hosts
To evaluate the distribution of Jug phages, crAssphage, Lak phages, and pBI143, the quality paired-end reads from 125 samples were mapped to the genomes using Bowtie2 version 2.4.4 (74) with default parameters. The Shriksam tool (https://github.com/bcthomas/shrinksam) was used to exclude unmapped pairs from the SAM file, which was subsequently sorted and converted to the BAM format using SAMtools (75). The number of mapped reads to each genome across samples was generated by CoverM version 0.7.0 (“contig” mode) (https://github.com/wwood/CoverM) using the “count” method with the parameters set as “--min-read-aligned-percent 95 --min-read-percent-identity 90 --contig-end-exclusion 75 -m count.”
The relative abundance of the predicted bacterial host across the 125 metagenomic samples was determined using ribosomal protein S3 (rpS3). First, the protein-coding genes from all the assembled scaffolds were predicted using Prodigal version 2.6.3 (meta mode) with code 11 (55). Then, the rpS3-coding genes were predicted using hmmsearch from HMMER version 3.3.2 (56) against the kofam (76) by identifying those matching K02982 (score threshold = 108.13) or K02984 (score threshold = 67.70). The taxonomic assignment of the identified rpS3 was conducted by searching against the rpS3 database retrieved from GTDB (gtdb_r226) (77) using BLASTp. For each sample, to calculate the relative abundance of the predicted bacterial hosts, the sequencing coverage of the corresponding rpS3-encoding scaffold was divided by the total sequencing coverage of all rpS3-encoding scaffolds in the sample.
The co-occurrence of Jug phages and their predicted hosts (Bacteroides and/or Phocaeicola) was tested using the R Package of “CooccurrenceAffinity” version 1.0.2, based on a total of 610 metagenomic samples, including the 125 samples mentioned above, and other 485 randomly selected human gut metagenomic samples we have already assembled for other research projects. We profiled the presence of Bacteroides and Phocaeicola in all these samples based on rpS3 analyses as described above. The Jug phages in the other 485 samples were based on the identification of TerL and MCP proteins using BLASTp.
The co-occurrence analysis based on the CooccurrenceAffinity model revealed a highly significant positive association between Jug phages and their predicted hosts. The estimated log-odds affinities were α = 2.14 for Bacteroides and α = 1.76 for Phocaeicola (false discovery rate < 10−10), indicating that these hosts co-occur with Jug phages far more frequently than expected under random distribution. This result provides strong statistical support for the host assignments predicted based on CRISPR-Cas spacer targeting.
Pebblescout and Logan search
The curated Jug phage genomes were individually used for an online Pebblescout search at https://pebblescout.ncbi.nlm.nih.gov/ with the target database as of “Metagenomic” (30). The information on all the targeted metagenomic samples was downloaded for further analysis. The metagenomic samples with a Pebblescout “%coverage” of ≥70 were selected for de novo assembly and manual genome curation for Jug phage genomes.
The first 1000 nt (from 5′ end) of the nucleotide sequences of the Jug phage TerL genes were used for online Logan search at https://logan-search.org/dashboard with a threshold of ≥0.7 and with the “Group” parameter as “All” (33). The contigs of the targeted metagenomic samples were downloaded via “aws s3 cp s3://logan-pub/c/{sample}/{sample}.contigs.fa.zst . --no-sign-request,” in which “{sample}” represents the NCBI SRA ID of the corresponding sample, which was obtained from the Logan search. The downloaded contigs with a minimum length of 1000 bp were searched against the curated Jug phage genomes using BLASTp, and those with ≥90% similarity across ≥90% of their length were considered as reliable hits. The protein-coding genes were predicted from all the retrieved contigs ≥1000 bp using Prodigal version 2.6.3 (55) using the parameters of “-m -p meta,” and searched against the TerLs, rpoC, and the 38 single-copy protein clusters from the 139 curated Jug phage genomes using BLASTp.
Protein clustering analyses of curated and Logan-retrieved Jug phage genomes
The protein-coding genes were predicted from all the manually curated genomes and the Logan-retrieved Jug phage genomes using Prodigal version 2.6.3 (55) using the parameters of “-m -p meta.” The protein sequences were clustered using CD-HIT version 4.8.1 (65) using the parameters of “-c 0.7 -aS 0.9 -G 0,” which meant ≥70% identity across ≥90% length of both the query and target sequences. The distribution and sharing of each protein cluster among different groups, or different animal hosts of each group, were evaluated. If at least one genome of a given group or host animal had a protein in a given protein cluster, then the corresponding protein cluster was considered to be present in the group or host animal. The annotation of the proteins was performed as described above.
Analysis of Jug phages during FMT
The Logan-retrieved genomes included samples from FMT, which were used to investigate the transmission of Jug phages among humans. In the study by Smith et al. (34), the authors treated 22 patients with mild to moderate ulcerative colitis by FMT from two healthy donors. The FMT was conducted by capsules (CAPS) or enema (ENMA) for maintenance dosing. For comparison, 11 of the patients received antibiotic pretreatment before the FMT. The FMT was performed 6 times, followed by three follow-up evaluations conducted in 2, 6, and 14 weeks (unless otherwise stated) after the last maintenance dose. The fecal samples were collected for metagenomic analysis at the beginning, before antibiotic pretreatment, before each maintenance dosing, and before each follow-up evaluation. Thus, 10 (without antibiotic pretreatment) or 11 (with antibiotic pretreatment) metagenomic samples were generally available for each recipient, unless some samples were not sequences for various reasons.
The raw paired-end metagenomic reads from all donors and recipients of all time points were filtered using fastp version 0.22.0 (78) with default parameters. The quality reads were assembled using metaSPAdes version 3.15.2 (53) with the kmer set of “21,33,55,77,99.” The assembled scaffolds were searched against the curated Jug phage genomes using BLASTn, and the confirmed Jug phage-related scaffolds were manually curated. The quality reads of each recipient at each time point were mapped to the curated Jug phage genome of the donor with Bowtie2 version 2.4.4 (74) with default parameters. The sequencing coverage of the Jug phage at each time point was calculated by CoverM version 0.7.0 (“contig” mode) (https://github.com/wwood/CoverM) using the “trimmed_mean” method with the parameters set as “--min-read-aligned-percent 99 --min-read-percent-identity 100 --min-covered-fraction 99 --contig-end-exclusion 75 -m trimmed_mean.” Thus, if ≥99% of a reference Jug phage genome is covered by reads from a recipient, we concluded that an identical Jug phage was present in the recipient.
Analysis of Jug phages in the guts of humans with dietary intervention
In the study reported by Delannoy-Bruno et al. (35), 12 and 14 overweight or obese men or women completed dietary intervention studies 1 and 2, respectively. In study 1, the gut microbiome of the participants contained ≥0.1% of P. vulgatus and/or ≥0.1% of Bacteroides thetaiotaomicron, Bacteroides cellulosilyticus, Bacteroides uniformis, or Bacteroides ovatus. Each of the participants was asked to maintain their usual diet for 4 days, and on days 5 to 14, they were asked to consume only HiSF-LoFV meals. Starting from day 15, the participants were supplemented with pea fiber-containing snacks while continuing the HiSF-LoFV meals. The snack was taken once per day (i.e., one bar per day) on days 15 and 16, twice per day on days 17 and 18, then three times per day from day 19 through day 35. Then, the participants returned to HiSF-LoFV meals without a snack bar for another 14 days. In study 2, the participants were not prescreened for the presence of Bacteroides species. The participants were asked to consume HiSF-LoFV meals from days 2 to 11, then with a two-fiber snack prototype once on day 13, twice on day 14, and three times on days 14 to 25. On days 26 to 35, the participants consumed only HiSF-LoFV meals without fiber snacks. Afterward, they were supplemented with a four-fiber snack prototype once on day 36, twice on day 37, and three times per day on days 38 to 49.
We identified Jug phages using the Logan assembly in three of the participants (i.e., S11, S13, and S15), with Jug phages detected in S13 samples of both studies, in S11 samples of study 1 (completed both studies), and in S15 samples of study 1 (completed study 1 only). The quality control of raw metagenomic reads, assembly, and manual genome curation was performed as for the FMT samples. The identification of Bacteroides and Phocaeicola species in the samples was based on rpS3 as described above. For each participant, the unique rpS3-encoding scaffolds were used as read-mapping reference along with the corresponding curated Jug phage genome, then the quality paired-end reads of each time point were mapped using Bowtie2 version 2.4.4 (74) with default parameters. The sequencing coverage of the Jug phage at each time point was calculated by CoverM version 0.7.0 (“contig” mode) (https://github.com/wwood/CoverM) using the “trimmed_mean” method with the parameters set as “--min-read-aligned-percent 99 --min-read-percent-identity 100 --min-covered-fraction 75 --contig-end-exclusion 75 -m trimmed_mean.” The taxonomic assignment of the identified rpS3 was conducted by searching against the rpS3 database retrieved from GTDB (gtdb_r226) (77) using BLASTp. For each sample of each participant, the relative abundance of the Bacteroides and Phocaeicola species was calculated as the sequencing coverage of the corresponding rpS3-encoding scaffold divided by the total sequencing coverage of all rpS3-encoding scaffolds in the sample. To make it comparable, the presence ratio of each Jug phage in each sample was calculated as the sequencing coverage of the Jug phage genome divided by the total sequencing coverage of all rpS3-encoding scaffolds in the corresponding sample.
Analysis of Jug phages for transcriptional activity
Pebblescout search (30) was used to find available metatranscriptional datasets with Jug phages. The Jug phage genomes from three adult men, each with four DNA and four RNA samples (32), were reconstructed via de novo assembly and manual curation as described above. To evaluate in situ transcriptional activities in the human gut, the RNA reads were mapped to the corresponding Jug phage genomes using Bowtie2 version 2.4.4 (74), the alignment was filtered using CoverM version 0.7.0 (“contig” mode) using the “count” method with the parameters set as “--min-read-aligned-percent 99 --min-read-percent-identity 100 --contig-end-exclusion 75 -m count.” The protein-coding genes were predicted using Prodigal version 2.6.3 (single mode) (55), and RPKM (reads per kilobase per million mapped reads) of each protein-coding gene was calculated. The annotation of the proteins was performed as described above.
Analyses of TerD domain–containing proteins
To reveal the distribution of TerD domain–containing proteins in the Unified Human Gastrointestinal Genome (UHGG) database, the unique protein sequences (“uhgp-100.tar.gz”) were downloaded from http://ftp.ebi.ac.uk/pub/databases/metagenomics/mgnify_genomes/human-gut/v2.0/protein_catalogue. The proteins were first searched against the Pfam TerD domain HMM database (i.e., PF02342_23) using hmmsearch from HMMER version 3.3.2 (56) and parsed with cath-resolve-hits version 0.16.10-0-g99edb28 (58). The proteins with an indp-evalue ≤1 × 10−5 were searched against the whole Pfam HMM database and parsed using the cath-resolve-hits version. For a given protein, if the indp-evalue targeting another Pfam domain was lower than that of the TerD domain, then the protein was believed to have no TerD domain. The identified TerD domain–containing proteins were assigned to the genome with taxonomy based on the associated UHGG metadata (“genomes-all_metadata.tsv”). Notably, we found that the UHGG protein database contained many Jug phage proteins via BLASTp search (with >99% identity). For example, 236 of the 427 proteins from MGYG000275746 (UHGG assigned taxonomy: d__Bacteria;p__Firmicutes_A;c__Clostridia;o__TANB77;f__CAG-508;g__CAG-245;s__CAG-245 sp000435175) were actually Jug phage proteins. The TerD domain–containing proteins matching those from Jug phages were excluded when performing the genome and taxonomy assignment.
The analyses of calcium-translocating P-type ATPase
The calcium-translocating P-type ATPase genes (ATPase hereafter for short) identified in Jug phages according to the Pfam domain search usually contained four conserved Pfam domains: “Cation_ATPase_N,” “E1-E2_ATPase,” “Cation_ATPase,” and “Cation_ATPase_C.” The annotations were confirmed by protein structure prediction using ColabFold (downloaded on 22 November 2024) (59) and compared using FoldSeek 9.427df8a (62) with the “easy-search” command (-c = 0.5, --cov-mode = 0).
The scaffolds assembled from the seven adult metagenomics with transcriptionally active Jug phages were conducted for protein-coding gene prediction using Prodigal version 2.6.3 (-m, -p = meta) (55), followed by a BLASTp search against the Jug phage ATPases with an e-value threshold of 1 × 10−5. The proteins with a minimum length of 500 aa and ≥400 aa aligned length were considered as hits, which were predicted for domains using hmmsearch from HMMER version 3.3.2 (56) against the Pfam version 37 database (57). Those proteins with two of the four domains were considered ATPases. All those predicted ATPases from the seven metagenomic samples and those from curated Jug phages were clustered at ≥99% identity (-c 0.99 -aS 0.9 -G 0) using CD-HIT version 4.8.1 (65), and the representatives were used for phylogenetic tree analyses. The taxonomic assignment of the ATPases was determined on the basis of the proteins encoded on the corresponding scaffolds, and confirmed according to the phylogeny given the truth that those from phylogenetic relatives were usually clustered together on the tree. The identification of the calcium-translocating P-type ATPases from NCBI Genbank and other HPGC phage genomes was conducted using the same pipeline.
Structural annotation of calcium-transporting P-type ATPases was performed using a combination of InterProScan and NCBI CDD analyses. Functional domains, namely, the actuator (A), phosphorylation (P), and nucleotide-binding (N) domains, were defined on the basis of conserved features, although the exact residue ranges vary across different proteins. The transmembrane helices were predicted using the DeepTMHMM web server (https://dtu.biolib.com/DeepTMHMM) and grouped into three representative membrane-spanning regions (M1–2, M3–4, and M6–10) for visualization and comparison. All domains were visualized in PyMOL version 3.0.3 (79) using cartoon representation and domain-specific colors.
The ATPase structures from human (PDB: 7e7s) and Listeria monocytogenes (LMCA1) were used as references to compare the spatial positions of the conserved DKTGT motif in phage-derived homologs. The Cα-Cα distance between corresponding “D” residues in reference and phage proteins was computed using PyMOL to assess spatial shifts in the phosphorylation loop. The DKTGT motifs were rendered as sticks for clarity. Calcium ions in reference structures were automatically identified by selecting atoms with the element CA and residue name CA. Known calcium-coordinating residues in the reference proteins were specified manually, and their distances to calcium ions were calculated. To identify corresponding residues in target proteins, all Cα atoms in the target were searched for spatial proximity to each reference residue. The nearest residues were recorded as potential calcium-coordinating candidates. Distances between these target residues and the calcium ions in the reference structure were then measured. This mapping procedure was conducted separately for both human and L. monocytogenes reference structures, enabling comparative evaluation across target proteins, including those derived from phages. Coordinating residues were shown as sticks, colored dirty violet in the reference and hot pink in the target proteins.
Acknowledgments
We thank J. F. Banfield for commenting on the manuscript, Z. Hua for providing computational resources, and X. Shen and H. Shi for helpful discussion. We thank A. Babaian and the Logan team for guiding the use of the Logan search. We thank L. Lui for providing the quality-controlled Nanopore sequences. We thank the SuperComputing Center at the University of Science and Technology of China for its support of ColabFold analyses.
Funding:
This work was supported by the Research Program of the University of Science and Technology of China KY2400000036 (L.C.), US Department of Energy Office of Science user facilities (DE-AC02-05CH11231) (A.P.C.), National Institutes of Health (NIH, grant 1U01DE034196-01) (A.P.C.), São Paulo Research Foundation (FAPESP, grant 2021/10577-0) (A.P.C.), and National Institutes of Health (E.V.K.).
Author contributions:
Conceptualization: L.C. and E.V.K. Methodology: L.C., A.P.C., Y.Q., and E.V.K., Formal analyses: L.C. Investigation: L.C., Y.Q., and E.V.K. Data curation: L.C. Visualization: L.C. and Y.Q. Supervision: L.C. Writing–original draft: L.C. and Y.Q. Writing–review and editing: L.C., A.P.C., Y.Q., E.V.K., H.W., Y.Z., Y.D., and H.L. Project administration: L.C. Funding acquisition: L.C., A.P.C., and E.V.K.
Competing interests:
The authors declare that they have no competing interests.
Data and materials availability:
The genome sequences of the HPGC, the CRISPR-Cas spacer sequences used for bacterial host prediction of Jug phages, and the curated genome sequences of Jug phages are available at https://figshare.com/projects/Jug_phages_in_the_gut_of_animals/254252. All data and code needed to evaluate and reproduce the conclusions in the paper are present in the paper and/or the Supplementary Materials.
Supplementary Materials
The PDF file includes:
Figs. S1 to S31
Legends for tables S1 to S24
Other Supplementary Material for this manuscript includes the following:
Tables S1 to S24
REFERENCES
- 1.Camargo A. P., Nayfach S., Chen I.-M. A., Palaniappan K., Ratner A., Chu K., Ritter S. J., Reddy T. B. K., Mukherjee S., Schulz F., Call L., Neches R. Y., Woyke T., Ivanova N. N., Eloe-Fadrosh E. A., Kyrpides N. C., Roux S., IMG/VR v4: An expanded database of uncultivated virus genomes within a framework of extensive functional, taxonomic, and ecological metadata. Nucleic Acids Res. 51, D733–D743 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Camarillo-Guerrero L. F., Almeida A., Rangel-Pineros G., Finn R. D., Lawley T. D., Massive expansion of human gut bacteriophage diversity. Cell 184, 1098–1109.e9 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Nayfach S., Páez-Espino D., Call L., Low S. J., Sberro H., Ivanova N. N., Proal A. D., Fischbach M. A., Bhatt A. S., Hugenholtz P., Kyrpides N. C., Metagenomic compendium of 189,680 DNA viruses from the human gut microbiome. Nat. Microbiol. 6, 960–970 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Minot S., Bryson A., Chehoud C., Wu G. D., Lewis J. D., Bushman F. D., Rapid evolution of the human gut virome. Proc. Natl. Acad. Sci. U.S.A. 110, 12450–12455 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Manrique P., Bolduc B., Walk S. T., van der Oost J., de Vos W. M., Young M. J., Healthy human gut phageome. Proc. Natl. Acad. Sci. U.S.A. 113, 10400–10405 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Dutilh B. E., Cassman N., McNair K., Sanchez S. E., Silva G. G. Z., Boling L., Barr J. J., Speth D. R., Seguritan V., Aziz R. K., Felts B., Dinsdale E. A., Mokili J. L., Edwards R. A., A highly abundant bacteriophage discovered in the unknown sequences of human faecal metagenomes. Nat. Commun. 5, 4498 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Edwards R. A., Vega A. A., Norman H. M., Ohaeri M., Levi K., Dinsdale E. A., Cinek O., Aziz R. K., McNair K., Barr J. J., Bibby K., Brouns S. J. J., Cazares A., de Jonge P. A., Desnues C., Muñoz S. L. D., Fineran P. C., Kurilshikov A., Lavigne R., Mazankova K., McCarthy D. T., Nobrega F. L., Muñoz A. R., Tapia G., Trefault N., Tyakht A. V., Vinuesa P., Wagemans J., Zhernakova A., Aarestrup F. M., Ahmadov G., Alassaf A., Anton J., Asangba A., Billings E. K., Cantu V. A., Carlton J. M., Cazares D., Cho G.-S., Condeff T., Cortés P., Cranfield M., Cuevas D. A., De la Iglesia R., Decewicz P., Doane M. P., Dominy N. J., Dziewit L., Elwasila B. M., Eren A. M., Franz C., Fu J., Garcia-Aljaro C., Ghedin E., Gulino K. M., Haggerty J. M., Head S. R., Hendriksen R. S., Hill C., Hyöty H., Ilina E. N., Irwin M. T., Jeffries T. C., Jofre J., Junge R. E., Kelley S. T., Mirzaei M. K., Kowalewski M., Kumaresan D., Leigh S. R., Lipson D., Lisitsyna E. S., Llagostera M., Maritz J. M., Marr L. C., McCann A., Molshanski-Mor S., Monteiro S., Moreira-Grez B., Morris M., Mugisha L., Muniesa M., Neve H., Nguyen N.-P., Nigro O. D., Nilsson A. S., O’Connell T., Odeh R., Oliver A., Piuri M., Ii A. J. P., Qimron U., Quan Z.-X., Rainetova P., Ramírez-Rojas A., Raya R., Reasor K., Rice G. A. O., Rossi A., Santos R., Shimashita J., Stachler E. N., Stene L. C., Strain R., Stumpf R., Torres P. J., Twaddle A., Ibekwe M. U., Villagra N., Wandro S., White B., Whiteley A., Whiteson K. L., Wijmenga C., Zambrano M. M., Zschach H., Dutilh B. E., Global phylogeography and ancient evolution of the widespread human gut virus crAssphage. Nat. Microbiol. 4, 1727–1736 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Shkoporov A. N., Khokhlova E. V., Fitzgerald C. B., Stockdale S. R., Draper L. A., Ross R. P., Hill C., ΦCrAss001 represents the most abundant bacteriophage family in the human gut and infects Bacteroides intestinalis. Nat. Commun. 9, 4781 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Yutin N., Makarova K. S., Gussow A. B., Krupovic M., Segall A., Edwards R. A., Koonin E. V., Discovery of an expansive bacteriophage family that includes the most abundant viruses from the human gut. Nat. Microbiol. 3, 38–46 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Guerin E., Shkoporov A., Stockdale S. R., Clooney A. G., Ryan F. J., Sutton T. D. S., Draper L. A., Gonzalez-Tortuero E., Ross R. P., Hill C., Biology and taxonomy of crAss-like bacteriophages, the most abundant virus in the human gut. Cell Host Microbe. 24, 653–664.e6 (2018). [DOI] [PubMed] [Google Scholar]
- 11.Yutin N., Benler S., Shmakov S. A., Wolf Y. I., Tolstoy I., Rayko M., Antipov D., Pevzner P. A., Koonin E. V., Analysis of metagenome-assembled viral genomes from the human gut reveals diverse putative CrAss-like phages with unique genomic features. Nat. Commun. 12, 1–11 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Peters S. L., Borges A. L., Giannone R. J., Morowitz M. J., Banfield J. F., Hettich R. L., Experimental validation that human microbiome phages use alternative genetic coding. Nat. Commun. 13, 5710 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Shkoporov A. N., Clooney A. G., Sutton T. D. S., Ryan F. J., Daly K. M., Nolan J. A., McDonnell S. A., Khokhlova E. V., Draper L. A., Forde A., Guerin E., Velayudhan V., Ross R. P., Hill C., The human gut virome is highly diverse, stable, and individual specific. Cell Host Microbe 26, 527–541.e5 (2019). [DOI] [PubMed] [Google Scholar]
- 14.Shah S. A., Deng L., Thorsen J., Pedersen A. G., Dion M. B., Castro-Mejía J. L., Silins R., Romme F. O., Sausset R., Jessen L. E., Ndela E. O., Hjelmsø M., Rasmussen M. A., Redgwell T. A., Leal Rodríguez C., Vestergaard G., Zhang Y., Chawes B., Bønnelykke K., Sørensen S. J., Bisgaard H., Enault F., Stokholm J., Moineau S., Petit M.-A., Nielsen D. S., Expanding known viral diversity in the healthy infant gut. Nat. Microbiol. 8, 986–998 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Lou Y. C., Chen L., Borges A. L., West-Roberts J., Firek B. A., Morowitz M. J., Banfield J. F., Infant gut DNA bacteriophage strain persistence during the first 3 years of life. Cell Host Microbe 32, 35–47.e6 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Tisza M. J., Lloyd R. E., Hoffman K., Smith D. P., Rewers M., Javornik Cregeen S. J., Petrosino J. F., Longitudinal phage-bacteria dynamics in the early life gut microbiome. Nat. Microbiol. 10, 420–430 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Tisza M. J., Buck C. B., A catalog of tens of thousands of viruses from human metagenomes reveals hidden associations with chronic diseases. Proc. Natl. Acad. Sci. U.S.A. 118, e2023202118 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Gogokhia L., Buhrke K., Bell R., Hoffman B., Brown D. G., Hanke-Gogokhia C., Ajami N. J., Wong M. C., Ghazaryan A., Valentine J. F., Porter N., Martens E., O’Connell R., Jacob V., Scherl E., Crawford C., Stephens W. Z., Casjens S. R., Longman R. S., Round J. L., Expansion of bacteriophages is linked to aggravated intestinal inflammation and colitis. Cell Host Microbe 25, 285–299.e8 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Reyes A., Blanton L. V., Cao S., Zhao G., Manary M., Trehan I., Smith M. I., Wang D., Virgin H. W., Rohwer F., Gordon J. I., Gut DNA viromes of Malawian twins discordant for severe acute malnutrition. Proc Natl Acad Sci U.S.A. 112, 11941–11946 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.de Jonge P. A., Wortelboer K., Scheithauer T. P. M., van den Born B.-J. H., Zwinderman A. H., Nobrega F. L., Dutilh B. E., Nieuwdorp M., Herrema H., Gut virome profiling identifies a widespread bacteriophage family associated with metabolic syndrome. Nat. Commun. 13, 3594 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Al-Shayeb B., Sachdeva R., Chen L.-X., Ward F., Munk P., Devoto A., Castelle C. J., Olm M. R., Bouma-Gregson K., Amano Y., He C., Méheust R., Brooks B., Thomas A., Lavy A., Matheus-Carnevali P., Sun C., Goltsman D. S. A., Borton M. A., Sharrar A., Jaffe A. L., Nelson T. C., Kantor R., Keren R., Lane K. R., Farag I. F., Lei S., Finstad K., Amundson R., Anantharaman K., Zhou J., Probst A. J., Power M. E., Tringe S. G., Li W.-J., Wrighton K., Harrison S., Morowitz M., Relman D. A., Doudna J. A., Lehours A.-C., Warren L., Cate J. H. D., Santini J. M., Banfield J. F., Clades of huge phages from across Earth’s ecosystems. Nature 578, 425–431 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Yuan Y., Gao M., Jumbo bacteriophages: An overview. Front. Microbiol. 8, 403 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.L. Chen, Discovery and analysis of an 841 kbp phage genome: The largest known to date, bioRxiv 633092 [Preprint] (2025). 10.1101/2025.01.14.633092. [DOI]
- 24.Chen L.-X., Méheust R., Crits-Christoph A., McMahon K. D., Nelson T. C., Slater G. F., Warren L. A., Banfield J. F., Large freshwater phages with the potential to augment aerobic methane oxidation. Nat. Microbiol. 5, 1504–1515 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Pausch P., Al-Shayeb B., Bisom-Rapp E., Tsuchida C. A., Li Z., Cress B. F., Knott G. J., Jacobsen S. E., Banfield J. F., Doudna J. A., CRISPR-CasΦ from huge phages is a hypercompact genome editor. Science 369, 333–337 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Mendoza S. D., Nieweglowska E. S., Govindarajan S., Leon L. M., Berry J. D., Tiwari A., Chaikeeratisak V., Pogliano J., Agard D. A., Bondy-Denomy J., A bacteriophage nucleus-like compartment shields DNA from CRISPR nucleases. Nature 577, 244–248 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Malone L. M., Warring S. L., Jackson S. A., Warnecke C., Gardner P. P., Gumy L. F., Fineran P. C., A jumbo phage that forms a nucleus-like structure evades CRISPR-Cas DNA targeting but is vulnerable to type III RNA-based immunity. Nat. Microbiol. 5, 48–55 (2020). [DOI] [PubMed] [Google Scholar]
- 28.Devoto A. E., Santini J. M., Olm M. R., Anantharaman K., Munk P., Tung J., Archie E. A., Turnbaugh P. J., Seed K. D., Blekhman R., Aarestrup F. M., Thomas B. C., Banfield J. F., Megaphages infect Prevotella and variants are widespread in gut microbiomes. Nat. Microbiol. 4, 693–700 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Crisci M. A., Chen L.-X., Devoto A. E., Borges A. L., Bordin N., Sachdeva R., Tett A., Sharrar A. M., Segata N., Debenedetti F., Bailey M., Burt R., Wood R. M., Rowden L. J., Corsini P. M., van Winden S., Holmes M. A., Lei S., Banfield J. F., Santini J. M., Closely related Lak megaphages replicate in the microbiomes of diverse animals. iScience 24, 102875 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Shiryev S. A., Agarwala R., Indexing and searching petabase-scale nucleotide resources. Nat. Methods 21, 994–1002 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Einvik C., Nielsen H., Westhof E., Michel F., Johansen S., Group I-like ribozymes with a novel core organization perform obligate sequential hydrolytic cleavages at two processing sites. RNA 4, 530–541 (1998). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Abu-Ali G. S., Mehta R. S., Lloyd-Price J., Mallick H., Branck T., Ivey K. L., Drew D. A., DuLong C., Rimm E., Izard J., Chan A. T., Huttenhower C., Metatranscriptome of human faecal microbial communities in a cohort of adult men. Nat. Microbiol. 3, 356–366 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.R. Chikhi, B. Raffestin, A. Korobeynikov, R. Edgar, A. Babaian, Logan: Planetary-scale genome assembly surveys life’s diversity, bioRxiv 605881 [Preprint] (2024). 10.1101/2024.07.30.605881. [DOI]
- 34.Smith B. J., Piceno Y., Zydek M., Zhang B., Syriani L. A., Terdiman J. P., Kassam Z., Ma A., Lynch S. V., Pollard K. S., El-Nachef N., Strain-resolved analysis in a randomized trial of antibiotic pretreatment and maintenance dose delivery mode with fecal microbiota transplant for ulcerative colitis. Sci. Rep. 12, 5517 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Delannoy-Bruno O., Desai C., Raman A. S., Chen R. Y., Hibberd M. C., Cheng J., Han N., Castillo J. J., Couture G., Lebrilla C. B., Barve R. A., Lombard V., Henrissat B., Leyn S. A., Rodionov D. A., Osterman A. L., Hayashi D. K., Meynier A., Vinoy S., Kirbach K., Wilmot T., Heath A. C., Klein S., Barratt M. J., Gordon J. I., Evaluating microbiome-directed fibre snacks in gnotobiotic mice and humans. Nature 595, 91–95 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.L. Darwiche, J. Goff, Rethinking the primary functions of the “tellurium resistance genes,” bioRxiv 1612.v1 [Preprint] (2025). 10.20944/preprints202504.1612.v1. [DOI]
- 37.A. K. Campbell, Fundamentals of Intracellular Calcium (John Wiley & Sons, 2017). [Google Scholar]
- 38.Bonza M. C., Martin H., Kang M., Lewis G., Greiner T., Giacometti S., Van Etten J. L., De Michelis M. I., Thiel G., Moroni A., A functional calcium-transporting ATPase encoded by chlorella viruses. J. Gen. Virol. 91, 2620–2629 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Pires D. P., Costa A. R., Pinto G., Meneses L., Azeredo J., Current challenges and future opportunities of phage therapy. FEMS Microbiol. Rev. 44, 684–700 (2020). [DOI] [PubMed] [Google Scholar]
- 40.Ren J., Ahlgren N. A., Lu Y. Y., Fuhrman J. A., Sun F., VirFinder: A novel k-mer based tool for identifying viral sequences from assembled metagenomic data. Microbiome 5, 69 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Roux S., Enault F., Hurwitz B. L., Sullivan M. B., VirSorter: Mining viral signal from microbial genomic data. PeerJ 3, e985 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Kieft K., Zhou Z., Anantharaman K., VIBRANT: Automated recovery, annotation and curation of microbial viruses, and evaluation of viral community function from genomic sequences. Microbiome 8, 90 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Nayfach S., Camargo A. P., Schulz F., Eloe-Fadrosh E., Roux S., Kyrpides N. C., CheckV assesses the quality and completeness of metagenome-assembled viral genomes. Nat. Biotechnol. 39, 578–585 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Camargo A. P., Roux S., Schulz F., Babinski M., Xu Y., Hu B., Chain P. S. G., Nayfach S., Kyrpides N. C., Identification of mobile genetic elements with geNomad. Nat. Biotechnol. 42, 1303–1312 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Chen L., Banfield J. F., COBRA improves the completeness and contiguity of viral genomes assembled from metagenomes. Nat. Microbiol. 9, 737–750 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Li S., Guo R., Zhang Y., Li P., Chen F., Wang X., Li J., Jie Z., Lv Q., Jin H., Wang G., Yan Q., A catalog of 48,425 nonredundant viruses from oral metagenomes expands the horizon of the human oral virome. iScience 25, 104418 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.R. C. Edgar, J. Taylor, T. Altman, P. Barbera, D. Meleshko, D. Lohr, G. Novakovsky, B. Buchfink, B. Buchfink, B. Al-Shayeb, J. F. Banfield, M. de la Peña, A. Korobeynikov, R. Chikhi, A. Babaian, Petabase-scale sequence alignment catalyses viral discovery. bioRxiv 241729 [Preprint] (2020). 10.1101/2020.08.07.241729. [DOI]
- 48.Bradley P., den Bakker H. C., Rocha E. P. C., McVean G., Iqbal Z., Ultrafast search of all deposited bacterial and viral genomic data. Nat. Biotechnol. 37, 152–159 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Heidrich V., Valles-Colomer M., Segata N., Human microbiome acquisition and transmission. Nat. Rev. Microbiol. 23, 568–584 (2025). [DOI] [PubMed] [Google Scholar]
- 50.Lui L. M., Nielsen T. N., Decomposing a San Francisco estuary microbiome using long-read metagenomics reveals species- and strain-level dominance from picoeukaryotes to viruses. mSystems 9, e0024224 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Guo J., Bolduc B., Zayed A. A., Varsani A., Dominguez-Huerta G., Delmont T. O., Pratama A. A., Gazitúa M. C., Vik D., Sullivan M. B., Roux S., VirSorter2: A multi-classifier, expert-guided approach to detect diverse DNA and RNA viruses. Microbiome 9, 37 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.J. Guo, D. Vik, A. Adjie Pratama, S. Roux, M. Sullivan, Viral sequence identification SOP with VirSorter2 v2, ZappyLab, Inc. (2021). 10.17504/protocols.io.btv8nn9w. [DOI]
- 53.Nurk S., Meleshko D., Korobeynikov A., Pevzner P. A., metaSPAdes: A new versatile metagenomic assembler. Genome Res. 27, 824–834 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Chen L.-X., Anantharaman K., Shaiber A., Eren A. M., Banfield J. F., Accurate and complete genomes from metagenomes. Genome Res. 30, 315–333 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Hyatt D., Chen G.-L., Locascio P. F., Land M. L., Larimer F. W., Hauser L. J., Prodigal: Prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 11, 119 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Potter S. C., Luciani A., Eddy S. R., Park Y., Lopez R., Finn R. D., HMMER web server: 2018 update. Nucleic Acids Res. 46, W200–W204 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Finn R. D., Coggill P., Eberhardt R. Y., Eddy S. R., Mistry J., Mitchell A. L., Potter S. C., Punta M., Qureshi M., Sangrador-Vegas A., Salazar G. A., Tate J., Bateman A., The Pfam protein families database: Towards a more sustainable future. Nucleic Acids Res. 44, D279–D285 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Lewis T. E., Sillitoe I., Lees J. G., cath-resolve-hits: A new tool that resolves domain matches suspiciously quickly. Bioinformatics 35, 1766–1767 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Kim G., Lee S., Karin E. L., Kim H., Moriwaki Y., Ovchinnikov S., Steinegger M., Mirdita M., Easy and accurate protein structure prediction using ColabFold. Nat. Protoc. 20, 620–642 (2024). [DOI] [PubMed] [Google Scholar]
- 60.Berman H. M., Westbrook J., Feng Z., Gilliland G., Bhat T. N., Weissig H., Shindyalov I. N., Bourne P. E., The Protein Data Bank. Nucleic Acids Res. 28, 235–242 (2000). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Kim R. S., Karin E. L., Mirdita M., Chikhi R., Steinegger M., BFVD - A large repository of predicted viral protein structures. Nucleic Acids Res. 53, D340–D347 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.van Kempen M., Kim S. S., Tumescheit C., Mirdita M., Lee J., Gilchrist C. L. M., Söding J., Steinegger M., Fast and accurate protein structure search with Foldseek. Nat. Biotechnol. 42, 243–246 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Sellés Vidal L., Noma T., Yokobayashi Y., Accurate, comprehensive database of group I introns and their homing endonucleases. Bioinform. Adv. 5, vbaf020 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Nawrocki E. P., Eddy S. R., Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics 29, 2933–2935 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Li W., Godzik A., Cd-hit: A fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, 1658–1659 (2006). [DOI] [PubMed] [Google Scholar]
- 66.Edgar R. C., MUSCLE: Multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32, 1792–1797 (2004). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Capella-Gutiérrez S., Silla-Martínez J. M., Gabaldón T., trimAl: A tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics 25, 1972–1973 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Minh B. Q., Schmidt H. A., Chernomor O., Schrempf D., Woodhams M. D., von Haeseler A., Lanfear R., IQ-TREE 2: New models and efficient methods for phylogenetic inference in the genomic era. Mol. Biol. Evol. 37, 1530–1534 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Roux S., Camargo A. P., Coutinho F. H., Dabdoub S. M., Dutilh B. E., Nayfach S., Tritt A., iPHoP: An integrated machine learning framework to maximize host prediction for metagenome-derived viruses of archaea and bacteria. PLoS Biol. 21, e3002083 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Edgar R. C., PILER-CR: Fast and accurate identification of CRISPR repeats. BMC Bioinformatics 8, 18 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71.Haft D. H., Selengut J. D., White O., The TIGRFAMs database of protein families. Nucleic Acids Res. 31, 371–373 (2003). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Boutet E., Lieberherr D., Tognolli M., Schneider M., Bansal P., Bridge A. J., Poux S., Bougueleret L., Xenarios I., UniProtKB/Swiss-Prot, the manually annotated section of the UniProt KnowledgeBase: How to use the entry view. Methods Mol. Biol. 1374, 23–54 (2016). [DOI] [PubMed] [Google Scholar]
- 73.Steinegger M., Söding J., MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35, 1026–1028 (2017). [DOI] [PubMed] [Google Scholar]
- 74.Longmead B., Salzberg S. L., Fast gapped-read alignment with Bowtie2. Nat. Methods. 9, 357–359 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75.Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R., 1000 Genome Project Data Processing Subgroup , The sequence alignment/map format and SAMtools. Bioinformatics 25, 2078–2079 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 76.Aramaki T., Blanc-Mathieu R., Endo H., Ohkubo K., Kanehisa M., Goto S., Ogata H., KofamKOALA: KEGG ortholog assignment based on profile HMM and adaptive score threshold. Bioinformatics 36, 2251–2252 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 77.Parks D. H., Chuvochina M., Rinke C., Mussig A. J., Chaumeil P.-A., Hugenholtz P., GTDB: An ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy. Nucleic Acids Res. 50, D785–D794 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 78.Chen S., Zhou Y., Chen Y., Gu J., fastp: An ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34, i884–i890 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 79.Shindyalov I. N., Bourne P. E., Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng. 11, 739–747 (1998). [DOI] [PubMed] [Google Scholar]
- 80.Fogarty E. C., Schechter M. S., Lolans K., Sheahan M. L., Veseli I., Moore R. M., Kiefl E., Moody T., Rice P. A., Yu M. K., Mimee M., Chang E. B., Ruscheweyh H.-J., Sunagawa S., Mclellan S. L., Willis A. D., Comstock L. E., Eren A. M., A cryptic plasmid is among the most numerous genetic elements in the human gut. Cell 187, 1206–1222.e16 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Figs. S1 to S31
Legends for tables S1 to S24
Tables S1 to S24
Data Availability Statement
The genome sequences of the HPGC, the CRISPR-Cas spacer sequences used for bacterial host prediction of Jug phages, and the curated genome sequences of Jug phages are available at https://figshare.com/projects/Jug_phages_in_the_gut_of_animals/254252. All data and code needed to evaluate and reproduce the conclusions in the paper are present in the paper and/or the Supplementary Materials.






