Skip to main content
npj Viruses logoLink to npj Viruses
. 2025 Sep 30;3:71. doi: 10.1038/s44298-025-00150-9

Unlocking the genomic repertoire of a cultivated megaphage

Andra Buchan 1,2, Stephanie Wiedman 1,2, Kevin Lambirth 1, Madeline Bellanger-Perry 1,2, Jose L Figueroa III 1,2, Elena T Wright 3, Patil Shivprasad Suresh 4,5, Qibin Zhang 4,5, Julie A Thomas 6, Philip Serwer 3, Richard Allen White III 1,2,
PMCID: PMC12484599  PMID: 41028902

Abstract

Megaphages are bacteriophages (i.e., phages) with exceptionally large genomes that have been discovered computationally across the globe. To date, all have evaded cultivation except phage G, which we examined with multiomics and artificial intelligence (AI) to resolve its 50-year cultivated history. Phage G is one of the largest phages with a size of >0.6 µm, about half the width of the host cell, and a 499 kbp, non-permuted, linear genome that has, uniquely among known phages, two pairs of ends. Phage G has >650 protein-coding open reading frames (ORFs), with >65% being hypothetical proteins, with the expansive repertoire of auxiliary metabolic genes (AMGs) acquired from its bacterial host, manipulate host sporulation (sspD, RsfA, spoK), and antiviral escape genes (e.g., anti-CBass nuclease and Anti-Pycsar protein). Our study represents a doorway into the complexity of the genomic repertoire of the only cultivated megaphage.

Subject terms: Bacteriophages, Phage biology, Virology

Introduction

Phage are ubiquitous, cosmopolitan, and present within all of Earth’s diverse ecosystems (e.g., hot springs, soils, animal guts, wastewater, microbialites, and marine ecosystems), but the majority have genomes that are less than 200 kbp19. Megaphages (i.e., phages with genomes 500 kbp or larger) have been metagenomic detected across diverse ecosystems from animal guts, wastewater, and oceans1,46,9. Only one megaphage has been physically isolated and cultivated, that is phage G in 19681013 (Fig. 1). A draft genome was released to GenBank by Roger Hendrix et al. at the University of Pittsburgh in 2012 (GenBank ID: NC_023719.1, Fig. 1). Commonly, megaphages found in the gastrointestinal tract of animals, including humans, are known as Lak Phages4. The largest phage genome discovered from metagenomics was 735 kbp, isolated from Lac Pavin, a freshwater meromictic crater lake in France1. Another megaphage recently found via metagenomics was Mar_Mega_1 from marine waters within Plymouth Sound, UK6. However, in 50 years, no megaphage has ever been cultivated beyond phage G.

Fig. 1. The cultivation history of phage G.

Fig. 1

This provides the cultivation history from its isolation in the Donelli lab (1968), through eight other laboratories, to the modern day over a span of five decades.

The history of phage G starts in the late 1960s at Gianfranco Donelli’s laboratory at Istituto di Fisica in Rome1113 (Fig. 1). The host for phage G was recently revised to Lysinibacillus14. The original environmental origin of phage G is unknown. Phage G was transferred to the laboratory of W.F. Fangman at the University of Washington, Seattle, WA in the 1970s, which sent phage G to the laboratory of Philip Serwer (Fig. 1). Unknown until 2020, The Félix d’Hérelle Reference Center for Bacterial Viruses within the Université Laval, where Hans-Wolfgang Ackermann that personally imaged over 5500 phages, used phage G as a reference for Transmission Electron Microscopy (TEM). Hans-Wolfgang Ackermann donated it posthumously to the American Type Culture Collection (ATCC) in 2020. Ackermann’s laboratory had also received phage G from the Fangman-laboratory, likely in the 1980s or 1990s15 (Fig. 1). As mentioned above, the Serwer-laboratory sent R.W. Hendrix (RWH) the wild-type strain (which we called here the UT_WT, University of Texas - Wild Type) for genome sequencing in approximately 2001. We call this strain the NCBI wild-type, derived from UT_WT (NCBI_WT, Fig. 1). RWH noticed that, at some point in its propagation, NCBI_WT was able to grow well in liquid media, unlike the original UT_WT strain, suggesting that a spontaneous mutation had occurred. The Serwer-laboratory deliberately isolated a clearer plaque mutant (i.e., UT_MUT) (Fig. 1). The Serwer-laboratory sent the original UT_WT to Lindsey Black and Julie A. Thomas in 2010 (Fig. 1). Julie Thomas took this strain with her to Rochester Institute of Technology in 2014 (Fig. 1). R.A White III purchased the ATCC strain (ATCC_WT) in 2020 but, through cultivation, noticed a larger plaque morphology (RAW_WT; Fig. 1). Thus, the need to sequence all the strains present in the various laboratories was warranted due to these changes in phenotype over 50 years, which may be related to single nucleotide polymorphisms (SNPs), insertions/deletions, and/or methylation status.

Phage G DNA has been anomalous in that (1) pulsed field gel electrophoresis (PFGE) revealed it to be 750 to 630 kbp long1618 and (2) the NCBI_WT genome size is only 497 kbp by DNA sequencing. Possibly, the difference was caused by a permuted genome with exceptionally long terminal repeated sequences (LTRs). As well, packaging studies have shown that it can package 626 kbp within its capsid head1014,19. The LTR was not in the NCBI_WT genome draft of phage G. Another possibility is that modifications of G DNA (e.g., methylation, glycosylation, or other unknown modifications) distorted the PFGE. Phage G DNA may also be lethal in E. coli upon cloning, making it challenging to resolve LTRs. There is a genome size discrepancy between the DNA packaging, pulse-field gel electrophoresis, the missing LTRs, and observed size from the draft genome (NCBI_WT).

Here, we describe that the phage G genome is not permuted and that the above anomaly is the result of genomic DNA derivatization. In addition, the non-permuted character is unique in that two pairs of otherwise unique DNA ends exist and both pairs have the same terminal repeat. We utilized multiomics to resolve the genome, methylome, and proteome via high throughput Oxford nanopore long read sequencing and particle proteomics using mass spectrometry via LC-MS/MS. Our study describes in detail the functional repertoire of phage G across five decades of continuous cultivation.

Results

Genome features

Phage G has a Myovirus morphology (i.e., classical T4-like morphology) that is ~630 nm from head-to-tail (i.e., 450 nm tail and 180 nm head with 5-fold vertices), and it has a genome that is 3 × larger than the representative Myovirus Escherichia phage T414. The phage G genomes RAW_WT, UT_MUT, UT_WT, RIT_WT, and ATCC_WT are ~499 kbp, have 668-669 protein-coding ORFs, that is AT-rich with a GC content of ~29% (Table 1; Figs. 2, 3). The annotation predicts that 66% of the genome’s ORFs are hypothetical (i.e., without any known homologs or predicted functions), with a 35 kbp stretch of DNA between positions 291377-327020 containing 66 predicted hypothetical ORFs (Supplementary Table 1; Fig. 3). Approximately 40% of phage G genome is split between nucleic acid replication, structural genes (e.g., capsid, tail, portal), and auxiliary metabolic genes (AMGs) (Figs. 2, 3).

Table 1.

Genome quality statistics

Length N50/N90 Provirus Gene count Quality Completeness Contamination GC % Coverage
ATCC_WT 499946 499946 No 668 Complete 100 0 29.93 535
NCBI_WT 497513 497513 No 663 High-quality 100 0 29.93 NA
RIT_WT 499819 499819 No 669 High-quality 100 0 29.93 4690
UT_MUT 499819 499819 No 669 High-quality 100 0 29.93 553
UT_WT 499819 499819 No 668 High-quality 100 0 29.93 684
Moose W30-1 419969 419969 No 542 Complete 100 0 29.72 NA

This table assembly statistics (N90/N50), checkV results which includes MIUViG quality metrics (completeness/contamination), gene count, and GC content (%).

Fig. 2. Phage G genomic content and taxonomic placement.

Fig. 2

a Violin plots of the GC content across the variants of phage G compared against Moose phage W30-1. b PHROG hierarchical categories for gene content. Here all phage G gene categories hierarchy are compared to Moose phage W30-1. c A vContact2.0 plot for taxonomic placement of phage G based on protein/gene content clustering. Moose phage W30-1 failed to be placed.

Fig. 3. Genomic annotation plot with phylogenetic trees of phage G.

Fig. 3

a Genome plot of phage G. This genomic plot of phage G is based on ATCC_WT/RAW_WT. This includes the PHROG CDS annotations, GC content, GC skew, tRNAs/tmRNAs/CRISPRs (phage G has no CRISPRs), and the gene names. b Large terminase subunit (TerL) (ps583/gp1) maximum likelihood tree with 1000 bootstraps. Escherichia phage T4 provided the outgroup in this unrooted tree. c Major Capsid/Head protein (mcp) (ps609/gp27) maximum likelihood tree with 1000 bootstraps. Escherichia phage HK97 provided the outgroup in this unrooted tree. AlphaFold2 protein predictions are showcased for various representative clades of TerL and mcp.

We applied a network based taxonomic framework to resolve phage G’s relationship to a myriad of uncultivated phages in bacteria and archaea via VConTACT3.0 using INPHARED database20,21. All our phage G variants are clustered together with overlap (e.g., RIT_WT, NCBI_WT etc) near each other within a central cluster within our VConTACT3.0 network (Fig. 2). The phage G cluster also has distant relatives that are Staphylococcus, Bacillus, Lactococcus, and Listeria phages (Fig. 2).

PhageAI predicted the following taxonomy and lifestyle for phage G22. Phages with a lytic lifestyle replicate their DNA through the production of infectious virion particles, while those with a temperate lifestyle can enter both a lytic and lysogenic lifestyle where they replicate DNA within the host, often using host replication machinery, without the production of infectious virion particles8. Phage G has a temperate lifestyle (based on PhageAI at 96.45%), meaning it can enter both lytic and lysogenic lifestyles. The genome has genes whose products are similar to those known to be involved in lysogeny associated integration and excision, including ps548/gp662 transposase (Supplementary Table 1). The genome also has genes for transcriptional regulation (Supplementary Table 1; Fig. 3a). TaxMyPHAGE is able to provide the complete International Committee on Taxonomy of Viruses (ICTV) taxonomy, which recently assigned phage G to Donellivirus gee23 (Supplementary Table 2). We will use the original name (phage G) for the rest of this paper.

The draft genome was published by RWH in 2012. The CheckV result states the NCBI_WT draft genome is high quality24 (Table 1). However, it is incomplete due to the missing LTRs (Table 1). The NCBI_WT draft was also missing ~2 kbp at the 3’ terminal end of the genome when directly compared to RAW_WT, UT_MUT, UT_WT, RIT_WT, and ATCC_WT (Fig. 4). All phage G genomes were 99.9% similar to each other based on average nucleotide identity (ANI) (Fig. 4a). All the genomes (i.e., RAW_WT, UT_MUT, UT_WT, RIT_WT, and ATCC_WT) are at least 499 kbp, suggesting NCBI_WT from RWH either lost ~2 kbp or was missed in genome sequencing (Table 1; Fig. 4b).

Fig. 4. Whole-genome comparison of phage G variants vs. Moose phage W30-1.

Fig. 4

a This is an average nucleotide identity (ANI) plot compared using fastANI. b This whole genome comparison plot of NCBI_WT vs. ATCC_WT.

We performed whole genome alignment using MUMmer25 and found conserved SNPs across all phage G genomes. We sequenced ATCC_WT, RIT_WT, UT_MUT, and UT_WT at least 500× coverage (Table 1). The conserved SNPs include positions 284289 - G/A, 394072 - T/A, and 395353 - G/A, which are within hypothetical ORFs (ps330/gp428 and ps512/gp625) and a non-coding region (position 394072) (Supplementary Table 3). RIT_WT had more SNPs generally than the other compared variants, including within ps162/gp260, ps206/gp304, multiple in ps485/gp598, within a non-coding region 394072, and within ps578/gp690 (Supplementary Table 3). Most SNPs occur in ps485/gp598, a hypothetical region within RIT_WT (Supplementary Table 3). ORF ps485/gp598 is a hypothetical protein with an unknown functional role (Supplementary Table 1).

In addition, the NCBI_WT genome was artificially rearranged to make the large terminase subunit (terL) the starting gene at gp1; which is not the physical starting point of the genome. All five of our variants from independent de novo assemblies and long read nanopore data (Fig. 4b) showed the artificial rearrangement, and that the genome is not a circle (Supplementary Table 1). Thus, we have re-numbered genes based on the actual physical location of genes on the genome as ORFs as ps1-668 (Supplementary Table 1). We used SWORD-based local alignments of the NCBI_WT reference-called ORFs to assign the original gp numbers from NCBI_WT to our annotation26 (Supplementary Table 1). For example, the terL gene is at position 440572-442254 and is now ps583/gp1, which maintains the actual physical location plus the original gp number to avoid confusion with previous works over 50 years (Supplementary Table 1).

In addition, we resolved the LTR elements within phage G, due to the use of ultra-long read sequencing with Oxford nanopore. The LTRs are 127 bp in length, with a GC content percentage of 43.31%, which is 13.38% higher GC than the rest of the genome (29.93%) (Figs. 2, 3; Figure S1). The 5’ and 3’ LTRs are identical, forming terminal genome redundancy. We compared all five genomes (RAW_WT, UT_MUT, UT_WT, RIT_WT, and ATCC_WT), showing the identity of these residues across the LTRs (Fig. S1).

Nearest neighbor comparison

We compared our phage G genomes to fifty megaphages present in public databases, including the marine Mar_Mega_16. Amongst the genomes listed in ggkbase and Michniewski et al.6 was a phage genome labeled Moose phage W30-1 (http://ggkbase.berkeley.edu/organisms/405141). Moose phage W30-1 was isolated from moose rumen on a 0.2 μm filter based on ggkbase metadata. While not quite a megaphage of 500 kbp, at only 419 kbp, Moose phage W30-1 is the closest environmental relative of phage G (Fig. 4a). Moose phage W30-1 is 81% similar based on ANI to phage G (Fig. 4a). Moose phage W30-1 clustered closest of all megaphages with phage G using vConTACT3.0 (Fig. 2). Moose phage W30-1 has a slightly lower GC percentage than G phage 29.72% vs. 29.93%, it also has 80 kbp less DNA, it has genes with higher GC content up to 49%, and it only has one 5’ LTR (Table 2; Fig. 2). The LTR present on the 5’ of the Moose phage genome is 99 bp in length and has a GC content of 35.35%; it is not repeated at the 3’ end, suggesting that Moose phage W30-1 has a circular genome, not a linear like phage G (Fig. S1). The logo plot suggests a small amount of conservation in the LTRs for phage G and Moose phage W30-1 (Fig. S1). PhageAI suggested that phage Moose W30-1 is a temperate phage just like phage G but at a lower percentage of 60.28%, likely due to containing a similar transposase. Despite the >80% ANI between phage G and phage Moose W30-1, both had many unique viral genes. Notable genes found in Moose phage W30-1, but not in phage G, are genes for reverse transcriptase, HicB-like antitoxin, tellurite resistance, cytidylytransferase, dCTP deaminase, NinI-like serine-threonine phosphatase, NrdD-like anaerobic ribonucleotide reductase, PnuC-like nicotinamide transporter, and a RecJ endonuclease (Fig. 5a). Comparison by use of MetaCerberus indicated that Moose phage W30-1 was relatively enriched for bacterial motility proteins, flagellar assembly, focal adhesion, glycine/serine/threonine metabolism, glycosaminoglycan binding proteins, methane metabolism, and translation factors27,28 (Fig. 5b). Pathways of note in phage G include antibiotic biosynthesis, glutathione metabolism, selenocompound metabolism, biofilm formation, and antimicrobial resistance genes (e.g., Dihydrofolate reductase - dfrA) (Fig. 5, Figure S2). The biofilm formation pathway enrichment in phage G relates to spore biosynthesis manipulation, including spore protease (Fig. 5).

Table 2.

Gene open-reading frames with ≥40% GC content

ORF Annotation GC%
Moose W30-1
ORF357 Hypothetical 0.49
ORF366 Hypothetical 0.48
ORF429 Hypothetical 0.47
Phage G
ps22 T4-like spike protein 0.40
ps98 Tail structural protein 0.41
ps608 Capsid decoration 0.40

Both phage G and Moose phage W30-1 have an average of GC content <30%; however, both encode three genes each that have ≥40% GC content.

Fig. 5. Heatmaps of the most differential genes and pathways comparing phage G variants vs. Moose phage W30-1.

Fig. 5

a Annotations of genes b KEGG pathways. Only genes that are missing from at least one phage are shown.

Structural genes

Phage G has the hallmark genes of classical tailed phages, including capsid/head, portal, baseplate, terminase (terL), and other structural virion proteins (Figs. 2, 3). Of the many structural genes we detected, 16 were directly detected with proteomics, which includes the hallmark genes mentioned above (Fig. 6). Phylogenetic analysis of the terL gene suggests that Moose phage W30-1 diverged earlier than all G phage variants (Fig. 3b). The terL of moose phage W30-1 has an extra loop based on the predicted protein structure, which is not found on phage G (Fig. 3b). LAK phages, phage T4 (i.e., the outgroup), MarMega-1, and ERR599374 phages occur on different branches when compared to phage G and have predicted terL structural differences (Fig. 3b). We could not detect the terL protein (i.e., ps583/gp1) within our purified phage proteomics but has been previously detected with proteomics and cryoEM14 (Fig. 6).

Fig. 6. Particle proteomics using LC-MS/MS.

Fig. 6

a Heatmap comparing UT_MUT vs. ATCC_WT using DDA and DIA using the log2 normalized values. b Coverage of the ORFs detected in either UT_MUT and ATCC_WT using DDA and DIA. c PHROG gene pathways hierarchies in either UT_MUT and ATCC_WT using DDA and DIA.

We confirm González et al.’s14 results with phylogenetics that place the major capsid protein (MCP) (i.e., head with 5 fold vertices) within the HK97 capsid family (Fig. 3c). The phylogenetics of MCP suggests that phage G’s HK97-like capsid diverged earlier than E. coli HK97 and is an outgroup compared to it directly (Fig. 3c). Lak phages and other phages found metagenomically (e.g., GD phages) are also later HK97-style capsids, with E. coli HK97 providing an outgroup for them (Fig. 3c). We have been unable to find the MCP in Moose W30-1 phage; it appears highly divergent compared to known MCP HMMs (Supplementary Table 4, Fig. 5a). We also detect ps609/gp27, the mcp, within our proteomic data of purified phage particles (Fig. 5). A novel capsid decoration protein ps608/gp27 wasn’t detected in our proteomic data but has been previously detected with proteomics and cryoEM14. ORFs encoding head maturation proteases ps159-160/gp257-258 and ps601/gp19 encoded by phage G were proteomically detected (Supplementary Table 1, Fig. 6).

Portal and tail proteins are critical components of every tailed phage particle. The portal protein is encoded by ps596/gp14, which has also been previously detected by proteomics and cryoEM14; we have also found the protein within our proteomic data both in the ATCC/RAW_WT and the MUT_UT (Fig. 6). The portal is, based on HHpred and MetaCerberus, related to lambda phage-like portal proteins (pdb:8k39, Supplementary Table 1). Tail and tail assembly proteins include fifteen annotated ORFs, and seven proteins were found by LC-MS/MS for the complex tail assembly in phage G (Fig. 6). Starting with ps80-86/gp178-183, which includes a head-tail connector similar to SPP1 gp17, contractile tail sheath27, the two tail tube subunits 1 and 2, a tail assembly chaperone proteins, programmed frameshift tail chaperone protein, and tape measure protein with tail lysin domain (Supplementary Table 1). Our proteomic data confirms that ps82-83/gp179-180 tail tube subunit 1 and subunit 2 are expressed (Fig. 6). Of the related tail assembly proteins ps98/gp195, ps167/gp265, ps267/gp365, ps430/gp530, ps562/gp673 and Tail fiber protein ps430/gp530 were found within the phage particles’ proteomics (Fig. 6). The tail accessory proteins that were found amongst the LC-MS/MS include a novel CAZy-related GH113 Beta-mannanase lysis protein (i.e., ps562/gp673) that is attached to a tail fiber and ps167/gp265, an RNA ligase with tail fiber protein attachment catalyst (Fig. 6).

The baseplate proteins connecting to the tail include ps91-93/gp188-190, similar to the P2gpJ/I-like phage and T4 gp8-like (Supplementary Table 1). The proteomics confirm that ps91-92/gp188-189 that the P2gpJ/I like phage baseplates are expressed; however, ps93/gp190, the T4 gp8-like baseplate was not found within the proteomics (Fig. 6). HHpred suggested models of Mu gp47/gp48, T4 gp6, and P2 gpJ/P2gpL for ps91/gp188 and ps92/gp189 with further matching via HMMs to VOG33143 for ps91/gp188 (Supplementary Table 1).

DNA replication, repair, and transcription

Genes for nucleic acid metabolism (e.g., endonucleases, exonucleases, and helicases) include sixty-nine annotated ORFs of which 55% are found, expressed, and present within the phage particle by proteomic analysis (Supplementary Table 1; Fig. 6). Phage G encodes four DNA polymerases that are DNA pol III-like with two (ps233/gp331 and ps294/gp392) that were within the particle proteomics (Supplementary Table 1; Fig. 6). Sixteen endonucleases are encoded by phage G: with the particle proteomics detecting ps60, ps251, ps277, ps462, ps578, and ps604 (Supplementary Table 1; Fig. 6). Five exonucleases for DNA repair are also found with four being detected by proteomics (Supplementary Table 1; Fig. 6). Protein ps578/gp690 is 5’-3’ exonuclease Xni/ExoIX (flap endonuclease) with dual exo-endo putatively, and ps158/gp256 is a RecB-like homolog (Supplementary Table 1; Fig. 6). Helicases and helicase-associated ORFs (8 ps/gps), including primase (1 ps/gp), and gyrase subunits A and B (2 subunits), are associated DNA replication with ps154/ps156 (gyrase A/B) and ps123/ps125/ps128/ps201 (Helicases) present in the proteomics data (Supplementary Table 1; Fig. 6). The helicases ps127-129 are DnaB/C-like replicative helicases with loaders, and ps123/gp221 is uvsW-like but not expressed in the proteomics (Supplementary Table 1; Fig. 6). Other helicases are viral in nature but are not in known host-related classification families (Supplementary Table 1; Fig. 6).

Phage G encodes a gyrase AB subunits (ps154/ps156), and a uvsX-like recombinase (recA-like) protein (i.e., ps309/gp407), ribonucleotide reductase (RNR) class 1 b with alpha and beta subunits and Nrdl-like flavoprotein-like regulator protein on ps662-664/gp81-83, with ps82-83 detected by viral particle proteomics (Supplementary Table 1; Fig. 6). Our phylogenetic analysis of gyrase A and B subunits (ps154/ps156) shows limited relationship to bacterial gyrases, both Moose phage W30-1 and G phage being outgroups within our maximum likelihood estimations (Fig. S3). Structurally, the gyrA subunits between Moose phage W30-1 and G phage are very similar, whereas gyrB varies more based on ColabFold v1.5.529 (Fig. S3). Our uvsX-like recombinase (recA-like) protein based on phylogenetics appears more bacterial based on our maximum likelihood estimations and MAFFT alignments (Fig. S3). The Moose phage W30-1 uvsX-like recombinase (recA-like) outgroups the rest of the G phage, suggesting its uvsX-like gene is a more derived ancestor to phage G (Fig. S3). UvsX-like in phage G appears to be horizontally transferred from bacteria based on maximum likelihood, whereas the signal from gyrA and gyrB is not as clear (Fig. S3). RNR has essential roles in DNA replication in synthesizing deoxyribonucleotide triphosphates (dNTPs) from ribonucleotides triphosphates30. Encoding its own RNR and packaging it within the phage G particle ensures that viral dNTPs are made.

DNA binding proteins for nucleoside metabolism, stabilization, and replication are components of the phage G genome. The genome encodes a thymidine kinase, two thymidylate kinases, and thymidylate synthase (Supplementary Table 1).Thymidine kinase (ps40/gp135) and the thymidylate synthase (ps221/gp319) were detected within the proteomic data (Supplementary Table 1; Fig. 6). Three unique DNA binding proteins were found: ps141/gp239, ps248/gp346, and ps279/gp377 with all detected by viral particle proteomics (Supplementary Table 1; Fig. 6). In addition, ps141/gp239 encodes a bacterial nucleoid DNA-binding protein IHF-alpha, which is histone-like31 (Supplementary Table 1; Fig. 6). ORF ps248/gp346 encodes a ferritin-like DNA-binding protein often involved in DNA replication32 (Supplementary Table 1; Fig. 6). Gene product ps279/gp377 encodes a single-stranded DNA-binding protein usually involved in DNA replication33. Protein ps251/gp349 contains holliday junction resolvase RuvABC endonuclease, also detected in proteomics data (Supplementary Table 1; Fig. 6).

We predicted the origin of replication (OriC) and various promoter motifs across the G phage genome with two putative OriCs and 2852 putative promoters (Figure S4; Supplementary Table 5). The first predicted OriC starts 1-406 nucleotide position within the genome, has 52% A + T content, and has the classical 5’/3’ DnaA boxes, spacer, DnaA trio, ATP-DNA box, AT-rich region, CtrA and Fis binding sites/motifs, and IHF binding site (Fig. S4). The second predicted OriC is ~200 bp upstream of the first predicted OriC at position 611–986 bp and has all the classical site/modify predictions as the first, with a higher A + T content of 58% (Fig. S4). Of the predicted promoters, 97.5% were host promoters, and <3% were predicted to be phage promoters (Supplementary Table 5). All predicted promoters had a > 0.50 score, with 975 having a > 0.85 predicted score, 720 having 90% or above, and 60 at 100% (Supplementary Table 5).

While host-encoded promoters dominate the predicted promoter landscape of phage G’s genome, we focused on high-quality phage-related promoters. We predicted 19 at 0.80 or above and nine at 0.90 or above, with more GC content than the host-related putative promoters (Supplementary Table 5). The phage promoters with the highest scores rounded to 90% or above are promoters for multiple non-coding regions or are within coding regions within phage G. We have nine predicted phage promoters with 222643-222665 and 139516-139538 having the highest predicted scores 0.98 and 0.97 (Supplementary Table 5). Promoter on position 222643-222665 is a short motif ‘TAATACTACTCACTATATCAGAG,’ within ps269/gp367 is a hypothetical ORF with no functional annotation. The ORF ps269/gp367 is a Bacillus conserved hypothetical ORF related to phrog_11333 (6.1 × 10−178) that is often co-localized with phrog_16766 a host chromosome condensation regulator (Supplementary Table 1). Phage promoter at 139516–139538, a short motif ‘AAATAACCTTCAATAAGAGGATA,’ within a hypothetical protein (ps172/gp270). Position 136897-140626 has seven predicted ORFs with no known function; only ps167/gp265, an upstream RNA ligase and tail fiber protein attachment catalyst, and ps175/gp273 downstream MutT/NUDIX hydrolase have predicted function (Supplementary Table 1). While it is a highly confident promoter prediction, its regulation of function or function itself is unknown.

Transcription and host-related transcriptional regulation are highly present and adaptive within phage G’s genomic repertoire. Phage G encodes many host transcription factors, including four RsfA family-like, three RNA polymerase sigma-70 like factors, Mu-like gp9 factor, and three factors that are listed prespore or unlisted protein family relationship (ps223/gp321, ps281/gp379, ps643/gp62) (Supplementary Table 1, Fig. 6). Phage G encodes an antiterminator (i.e., ps469/gp582), which regulates host RNA transcription in other phage models34. These antiterminators are often co-located with terminase small subunit (TerS), which we could not find.

Translation

As with nucleic acid metabolism and replication, phage G controls its translation, including its regulation within the host cell. Phage G encodes its own tRNAs and has 18 of them, with 1 tmRNA predicted (Supplementary Table 6). The tmRNA is encoded in position 352812–353120, with a target peptide of ‘AKLNITNNELQVA*,’ within a non-coding region between ps444-ps445/gp550-gp556 and is 308 bp in length (Supplementary Table 6). The ps444-ps445/gp550-gp551 regions are of unknown function, but ps445/gp556 is predicted to be a tryptophan repeat gene family of unknown function. The tRNA gene encoding occurs at the near beginning of the genome position 48-1431, which encodes 6 tRNAs, and towards the end of the genome position 344693-499942 (Supplementary Table 1; Fig. 3). The tRNAs encoded by the genome include Arg: TCT, Asn:GTT, Asp: GTC, Cys: GCA, Gln:TTG, Gly: T.C.C., His:GTG, Met: CAT, Thr: TGT, Trp: CCA, and Tyr: GTA (Supplementary Table 6; Fig. 3). It has tRNAs with multiple copies, including Glu: TCC with two copies, Phe: GAA with two copies, and multiple alternative tRNAs for serine (Ser:TGA and Ser: GCT) (Supplementary Table 6; Fig. 3).

There were eight detected translation/amino acid metabolism-related genes found in the viral particle mass spectrometry data, including ClpP (ps159/gp257), ClpX (ps160/gp258), prolyl-4-hydroxylase 6 (ps183/gp281), and phosphatase (ps339/gp437) (Supplementary Table 1; Fig. 6). The genome encodes five individual proteases, including ClpP and ClpX, a head/capsid maturation protease (ps19/gp601), membrane protease (ps641/gp60, a YdiL-like, which is within the CAAX protease family), and a spore protease (ps121/gp219). Only ClpP and ClpX were detected in the proteomic data; none of the other proteases encoded were detected in the proteomic data (Supplementary Table 1; Fig. 6).

Lysis

All phages that can enter a lytic phase of their life cycle, whether temperate or strictly lytic, must escape their host cell via lysis of the cell membrane. Phage G has six genes with known homology to endolysins, holins, and spanins. These lysis genes are encoded near the end of the genome from position 373683-463039, encoding the three endolysins, the holin, and an Rz-like spanin protein (Supplementary Table 1). ORFs ps604-605/gp22-23 encode similar N-acetylmuramoyl-L-amidase/murein transglycoslyase/flgJ muramidase (EC 3.5.1.28) endolysins with ps604/gp22 but not ps605/gp23 detected within the proteomic data (Supplementary Table 1; Fig. 6). ORF ps575/gp687, which encodes the Rz-like spanin, was also detected in the proteomic data (Supplementary Table 1; Fig. 6). ORF ps475/gp588 encodes a VOG62162 and phrog_25820 endolysin; however, it has homology to KO number K23989, which is a Mannosyl-glycoprotein endo-beta-N-acetylglucosaminidase which has lytD and lytB domains but is not found within the proteomic data (Supplementary Table 1; Fig. 6). ORF ps86/gp183 is a novel tape measure and tail assembly chaperone protein with an attached tail lysin. This protein was detected in the proteomic data and may be a Tailocin-like endolysin (Supplementary Table 1; Fig. 6). A holin protein is also encoded by ps607/gp25, which was detected in the proteomic data, and was previously found in prior proteomics14.

Host response and auxiliary metabolic genes (AMGs)

Phage G has two genes that appear to directly interfere with host antiviral defense: anti-CBass nuclease Acb1 (ps20/gp116) and Anti-Pycsar protein Apyc1 (ps290/gp388)34(Supplementary Table 1). These proteins are part of the cyclic oligonucleotide-based antiphage signaling system (CBASS) and the pyrimidine cyclase system for antiphage resistance (Pycsar) that resist host antiviral immune response35. The closest hits for ps20/gp116 are to pVOG7183 (Unannotated protein) and phrog_12204 (unknown function), with phrog_12204 often co-localized with phrog_22915 a membrane protein involved in the moron, AMG and host takeover category. HHpred matches ps20/gp116 to pdb:7T26_A Anti-CBass nuclease Acb1 and Uniprot-sprot-vir model P04533 (T4 gene 57B). ORF ps290/gp388 has homology to VOG01700 which is Anti-Pycsar protein Apyc1 (5.4E-67, Supplementary Table 1). Neither ps20/gp116 nor ps290/gp388 were found amongst the proteomic data (Supplementary Table 1; Fig. 6).

Dihydrofolate reductase (dfrA) and phoH-like genes are encoded by phage G. Phage G, on ORF ps30/gp125, has a homolog to a dfrAas first described in Enterobacteriaceae phage T4. We applied phylogenetics to phage dfrA with the Enterobacteriaceae clade containing T4 with relatives AR1 and Serratia phage Muldoon forming the outgroup. Phage G forms a unique clade that branches from Citrobacter phage Mijalis and Maleficent and mainly contains gram-positive Bacillus and Salimicrobium (Fig. S4; Supplementary Table 7). Moose phage W30-1 also has dfrA, which shares some structural homology based on AlphaFold to phage G; however, as with other related phylogenies where Moose phage W30-1 has a homolog with phage G, it appears to have diverged earlier (Fig. S4; Supplementary Table 7). Phage G’s phoH-like protein is encoded by ps65/gp162 and ps189 in Moose phage W30-1 (Supplementary Table 1). ORF ps65/gp162 was not found proteomically in the virion particle (Fig. 6). We compared phoH’s from gram-positive relatives and then used Escherichia phage T5 as an outgroup; we found again that the Moose phage W30-1 diverged earlier and is in a separate clade when compared to phage G phoH, which forms a unique clade that further outgroups from a Bacillales bacterium (Fig. S4) The phage G and Moose phage W30-1 predicted structures are highly divergent, and the phage G protein is more complex than Escherichia phage T5 phoH but less complex than Moose phage W30-1 (Fig. S4; Supplementary Table 7).

Phage G has proteins related to flagellar operon protein (TIGR02530) (ps41/gp136), a FtsZ/tubulin-like GTPase (ps43/gp138), and F-like type IV secretion system proteins (T4SS) homologs (ps58-59/gp155-156) (Supplementary Table 1; Fig. 6). The ps41/gp136 and ps43/gp138 were not detected proteomically (Fig. 6). The ORFs ps41/gp136 within the TIGR02530 family are located between genes flgD and flgE36. ORF ps43/gp138 is a FtsZ/tubulin-like GTPase, which also has a homolog to KEGG KO CetZ K22222 (Supplementary Table 1). Pseudomonas phages provide an outgroup and structurally similar folds to phage G but are still highly diverged phylogenetically and structurally (Fig. S5; Supplementary Table 7). The phage G version appears more bacterial than phage-like as it forms a unique clade with Thermosipho sp as an outgroup (Fig. S5). Phage G has two ORFs predicted to be TraC-like and TraD-like homologs within the T4SS secretory system ps58-59/gp155-156, which were not expressed in the virion particle proteomics (Supplementary Table 1; Fig. 6).

Phage G possesses a multitude of sporulation genes to manipulate its Bacillus host sporulation. These sporulation related genes include a spore protease (ps121/gp219), small acid-soluble spore protein D (minor alpha/beta-type SASP) (ps168/gp266), RNA polymerase sporulation-specific sigma factor (ps202/gp300), prespore-specific regulator (ps223/gp321), Prespore specific transcription factor without RsfA-like domain (ps281/gp379), YtxC-like sporulation protein (ps306/gp404), stage V sporulation protein K (AAA-like ATPase, ps483/gp596), and three prespore-specific transcriptional regulators RsfA-like (ps518/gp631, ps665/gp84, ps666/gp85) (Supplementary Table 1). Of the sporulation related genes in phage G, only ps223/gp321, ps483/gp596, ps518/gp631, ps665/gp84, and ps666/gp85 were expressed in the virion particle proteomic data (Supplementary Table 1; Fig. 6). Moose phage W30-1 had only one spore-related gene, ps119, which is similar to phage G’s RNA polymerase sporulation-specific sigma factor (ps202/gp300) (Supplementary Table 1; Fig. 6). We further resolved stage V sporulation protein K (ps483/gp596) using phylogenetics, which is a K06413 homolog (gene ID spoVK) that also matches VOG44220 (AAA family ATPase). The closest clade to phage G’s spoVK was Clostridium botulinum related (Figure S2). Phage Moose W30-1 did not have a homolog to sporulation protein K (ps483/gp596), nor have we found a phage that does, only allowing us to directly compare to spore-forming gram-positives such as Bacillus and Clostridium (Fig. S2).

Methylome

We evaluated the whole-genome methylation landscape of phage G using DeepSignal237. We found 31 methylation sites across the genome significantly above the confidence threshold (Table 3; Fig. 7). Of all the methylation modification of the phage G genome, 71% occurs after position 251905, and ~40% of the methylation occurs between 314147-363191 (Table 3; Fig. 7). Approximately 32% of the methylation within the genome occurs in the cryptic region, which is a 35 kbp stretch (i.e., positions 291377-327020) that has 66 ORFs that are hypothetical with no known homology but to phage G itself (Table 3; Fig. 7). Both MetaCerberus, SWORD, and HHpred could not find matches to the various ORFs within this highly methylated region, so we used Foldseek (v9-427) to help us resolve these ORFs38. Two ORFs with known annotation upstream and within the unknown zone include the gp424 adhesion protein family that is only found in phage G (i.e., ps326/gp424, methylation position 278258) and ps377/gp476, which is a CyaY chaperone-like protein (methylation position 321477 and 321589) (Table 3; Fig. 7). The ORFs ps366/ps368/ps374 have multiple methylations but have unknown functions as methylation “hotspot” within the genome (Table 3; Fig. 7). Methylation positions 77050, 77779, and 78199 are within the baseplate protein assembly ORFs ps91-92/gp188-189, where long tail fibers attach to phage particles (Table 3; Fig. 7). Three ORFs with multiple methylation sites include ps177/gp275, ps242/gp340, and ps508/gp621 which encode a Pyrazinamidase/nicotinamidase/isochromatase hydrolase PncA-like protein, a Trimeric auto-transporter adhesins pectin lyase (DUF2807-like), and YozC-like protein (Table 3; Fig. 7).

Table 3.

De novo methylation call results and positions across the phage G genome

ORF GP Annotation Location
91 188 P2gpJ-like baseplate protein 77050
92 189 P2gpI-like baseplate protein with C-terminal appendage 77779, 78199
177 275 Pyrazinamidase/nicotinamidase/isochromatase hydrolase PncA-like 150529, 150535
242 340 Trimeric auto-transporter adhesins pectin lyase (DUF2807-like) 199184, 199227, 199234
252 350 Probable diguanylate cyclase DgcQ 199320
299 397 Hypothetical 251905, 252007
326 424 gp424 adhesion protein family (Gphage) 278258
366 465 Hypothetical 314147, 314150
368 467 Hypothetical 314848, 314854, 314866
373 472 Hypothetical 318043
374 473 Hypothetical 318860, 318862
377 476 CyaY chaperone-like protein 321477, 321589
444 550 Hypothetical 352329
448 560 Tryptophan repeat gene family protein 363191
508 621 YozC-like protein 400592, 400769, 400860, 400912
564 675 Hypothetical 425189
573 685 Holliday junction resolvase Hj-e/c-like 441939
577 689 Hypothetical 443970

This includes the full annotation name for each ORF (ps/gp identification number) but for further details check Supplementary Table 1 annotation for these.

Fig. 7. De novo methylation call results and positions across the phage G genome plot.

Fig. 7

To save space the full annotation name was shortened to fit within the plot. This includes the full annotation name for each ORF (ps/gp identification number) is present in Table 3, but for further details check Supplementary Table 1 annotation for these.

Particle biophysical characteristics

We performed various tests to measure the biophysical stability of the particles under abiotic environmental stressors (i.e., temperature and pH) for wild type vs. mutant (i.e., RAW_WT vs. UT_MUT) phage G. This was to determine if such stressors would impact the infectivity of phage G in the wild type or the UT mutant strain. Each stressor was for 30 min of time, followed by recovery to the specific pH or temperature. Generally, the UT_MUT has more variance of pfu/mL−1 than RAW_WT exempt 40 °C and pH 4, where RAW_WT had more (Fig. 8). The UT_mut is more stable during temperature increases from 40–60 °C compared to the RAW_WT strain (Fig. 8). At 40 °C, the UT_MUT has nearly half a log more pfu/mL−1 than RAW_WT (Fig. 8), which was statistically significant (p < 0.05). RAW_WT appeared more sensitive to heat than UT_MUT, but it had a more significant variance of pfu/mL−1 (Fig. 8). We attempted a 70 °C treatment, but no plaques were obtained for either the wild or mutant type. The UT_MUT had more significant variances of plaque formation at pH 6, control pH 7, and alkaline pH 10 (Fig. 8). UT_MUT lost nearly one entire pfu/mL−1 unit at acidic pH (pH = 4), which was statistically significant, but RAW_WT had a higher variance of pfu/mL−1 at pH four than UT_MUT (Fig. 8). We attempted pH 3 and pH 11, but there were too few plaques to count, and they were highly inconsistent in forming plaques amongst replicates; thus, they were omitted.

Fig. 8. Particle biophysical characteristics of phage G RAW_WT vs. UT_MUT.

Fig. 8

a Thermal stability test. b pH stability test. Each stressor was for 30 min of time followed by recovery to the specific pH or temperature.

Discussion

Here, we have presented the methylome, particle proteome, the biophysical properties including biochemical stability of the phage particle, and robust genomic annotation for the only known cultivated megaphage on Earth, The phage G genome is ~499 kbp at 29% GC content, containing 668 predicted ORFs, of which 104 are proteomically detected within the phage particle. Comparing the five variants of phage G, they are 99.9% similar with a few conserved SNPs and are similar to Moose phage W30-1 that has never been cultivated.

Based on our data, the sequence discrepancy with the PFGE data is caused by modification of the bases in the phage G genome, not by an unusually long terminal redundancy. Indeed, our data indicates that phage G is highly methylated within a cryptic region (i.e., positions 291377-327020). Methylation can impact the mobility of DNA through gels, as shown by ref. 39. Further experimental validation is needed related to methylation with PFGE studies.

PhageAI predicted that phage G was a temperate phage, which means that it can enter both the lysogenic and lytic phases. Only a few genes, such as a transposase, were found to possibly signify lysogenic lifestyle. However, lysogenic lifestyle has never been observed for phage G in 50 years. CheckV results suggest phage G is not a prophage nor does it exist as a prophage within its host Lysinibacillus sp. PGH14. Lytic/lysogenic classification for phage G is not certain.

Furthermore, we have resolved the genome to completeness with no gaps without contamination, correcting a decades-long debate by finding the missing that the LTR was 121-bp. We also found a missing 2 kbp at the 3’ end of the draft NCBI genome. The genome is linear with ends that require revision of the starting point of the sequence in NCBI_WT. All five variants of G phage were 99.9% the same based on ANI. The NCBI published variant is no longer available. Thus, we could not determine whether the 2 kbp was missing in the NCBI_WT variant. Sanger sequencing of phage G DNA might be challenging because (1) phage G DNA is highly methylated, which would make cloning difficult, and (2) some genes could be lethal in an E. coli cloning vector. Either of these two factors could be the cause of the missing 2 kbp piece in the NCBI reference. In addition, LTRs and repetitive sequences are also hard to resolve at the read length of Sanger sequencing (~600–800 bp).

Our annotation corrected the physical orientation and robustly improved it. Given the above revisions, we have also renamed the ORFs as ‘ps’, the initials (i.e., in honor of Philip Sewer) of the laboratory that kept phage G active (frozen stocks of phage G are unstable) and was the immediate source of the versions sequenced here. We retained the original gp numbering from the NCBI reference to avoid confusion. However, logically, we think that future work on phage G should use the new gene numbering system, which we will update biyearly with the advent of new technology or functional discoveries of the hypothetical ORFs within phage G.

While our study provides a robust annotation, still 66% of the genome ORFs are hypothetical even with modern structural AI-based predictions (e.g., Foldseek), HMMs (e.g., MetaCerberus and HHpred), and standard alignment approaches (e.g., SWORD). Furthermore, the role of the cryptic zone (i.e., a 35kbp track of the genome with limited functional annotation where 32% of all genome methylation occurs) is not known. The methylome of phage G provides an excellent opportunity for further discovery. CRISPRi or other genome-level manipulations, including protein-protein interactions, are recommended as future directions to resolve the functions of phage G40.

Our results are confirmatory to various previous proteomics and cryo-EM done previously on the particle capsid/head assembly and other proteins14. We went further to show that, based on phylogeny, the mcp and terL phage G classifies in a unique megaphage group that is unlike others found. Besides Moose phage W30-1, these genes appear to be very different from those of the other Lak phages and megaphages like MarMega-1. Phage G’s mcp (i.e., HK97 gp5 lineage) is 60% larger than its counterpart in Escherichia virus T4 (gp23 lineage)14. Phage G’s mcp derived earlier than HK97 or the Lak megaphages based on our phylogenetic analysis. This would suggest that phage G mcp is potentially older or more divergent than Lak or HK97 capsids. The tail assembly of phage G is highly complex, but various lysis proteins are connected to ensure entry and lysis.

Nucleic acid metabolism, repair, and transcription appear highly complex and regulated within G phage, as it is able to replicate 499 kbp of its genome in under 2 h. Various proteins related to nucleic acids were discovered within the phage particles of phage G, including DNA polymerases, exonucleases/endonucleases, helicases, primases, gyrases, and uvsX-like recombinase. Various DNA repair mechanisms exist within phage G, which may account for its genome SNP stability over five decades across multiple labs. We have also identified the potential OriC of phage G, which begins at the start of the genome, and have resolved multiple phage/host promoters in need of functional validation. Phage G may also control host transcription via antitermination, which prevents early termination of RNA transcription, and regulates phage lifestyle including initiating lytic or lysogenic phases34,41. Proteomics detected multiple genes within the particle that can potentially protect phage G from DNA damage (e.g., recA-like, histone-like and ferritin-like binding proteins). The resolvase RuvABC endonuclease, thymidine kinase, and thymidylate synthase encoded by phage G have yet to be functionally validated. Thymidine and thymidylate-related genes function to repair thymidine dimers42. RuvABC-like (i.e., ps251/gp349) as host proteins are involved in DNA repair and recombination including resolving cruciform DNA43,44. Phage G (i.e., ps141/gp239) encodes a histone-like DNA-binding protein which can be involved in phage recombination31. The ferritin-like DNA binding proteins (ps248/gp346, ps279/gp377) that may bind DNA and/or metal ions during DNA replication32,33.

Phage G encodes its own translation machinery including many of it is own tRNAs, a suppressor tRNA, various proteases, including ClpXP (ps159-160/gp257-258) to monitor translation, and various enzymes to lyse the cell. Phage G encodes a suppressor tRNA or tRNA Sup:UUA that is not associated with the translation of any known amino acid45. It is a UUA anticodon (i.e., RNA Sup:UUA) that decodes the TAA stop codon, which is rarely used in bacteria (0.79%) but more commonly found in eukaryotes (~21%)45. ClpP is an ATP-dependent protease that is involved in head mutation and virion structural formation. ClpP and ClpX interact to degrade misfolded proteins when the cell is under stressful conditions, allowing cellular protein stability46. Spanins encoded by ORF ps575/gp687 (i.e., Rz-like spanin) are more commonly found amongst gram-negatives, not gram-positives, as they are required to disrupt the outer membrane in gram-negatives47. We confirmed González et al.’s14 finding of the ps607/gp25, a holin protein which is a small protein that punches holes in bacterial cell walls towards the end of the lytic lifestyle towards particle release48.

Phage G genome encodes homologs to flagellar operon protein (TIGR02530) (ps41/gp136), FtsZ/tubulin-like GTPase (ps43/gp138), and F-like type IV secretion systems (T4SS) homologs (ps58-59/gp155-156). ORFs ps41/gp136 is a TIGR02530 family gene that is in-between genes flgD and flgE, which are flagella hook formation36. FlgE is the main structural subunit of the flagella hook protein, and FlgD polymerizes FlgE protein and provides scaffolding for the flagella hook in Bacillus subtilis36. In bacteria, FtsZ assembles a Z-ring, which is the site of cell division and is essential for the process; it also has roles in peptidoglycan synthesis and is the bacterial homolog for eukaryotic tubulin49. Homologs of FtsZ or TubZ in phages are commonly found in gram-negative bacteria (e.g., Pseudomonas) and their phages, including phiKZ, which we used as an outgroup in our phylogenetics50,51. FtsZ/TubZ homologs in phages act as DNA partitioning systems and cytomotive GTPases and may act as transporters of phage DNA to cellular poles50. F-like type IV secretion systems (T4SS) transport DNA between bacterial cells52. ORFs ps58-59/gp155-156 are TraC-like and TraD-like homologs within the T4SS secretory system. TraC and TraD are membrane-bound ATPase proteins within the bacterial cell, which provide energy in the form of ATP for pilus extension and transferosome transport reciprocally53. Their function within phages is entirely unknown.

Phage G on ORF ps30/gp125 has a homolog to a dfrA. Trimethoprim-resistant dihydrofolate reductase (dfrA) was first described in Enterobacteriaceae phage T4 back in 1970; in T4, it functions in building tetrahydrofolate for thymidylate synthesis, which supplies nucleotides for DNA synthesis and has structural roles as a protein in the phage baseplate of the tail54,55. The impact of the trimethoprim resistance being conferred or transduced from phage to bacterial host via a dfrA has not been functionally studied. Further investigation of dhf’s within phages warrants further study.

Another AMG within phage G and in Moose phage W30-1 includes phoH-like homology, which is found amongst many phages but represents a first within a megaphage. PhoH is part of the Pho (phosphate) regulon genes, which are induced in phosphate starvation; it is a cytoplasmic protein that is predicted to be an ATPase with ATP binding activity in E. coli K-125658. The gene is highly prevalent within phage genomes within marine ecosystems at ~40%, with ~4% in non-marine phages59. No phage functional analysis has been completed in a phage phoH; thus, its function still needs to be discovered. Further functional analysis is warranted in phoH, generally including the megaphage versions in phage G and Moose W30-1.

Phage G has antiviral escape genes anti-CBass nuclease Acb1 (ps20/gp116) and Anti-Pycsar protein Apyc1 (ps290/gp388), which may allow it to escape host response. Anti-CBass nuclease Acb1 hydrolyzes a tricyclic nucleotide used by the host to signal cell death and host immunity in response to phage infection35. The first Apyc1 protein was functionally validated in Bacillus subtilis phage SBSphiJ as a metal-dependent cyclic NMP phosphodiesterase that targets pyrimidines (mainly cytidine 3’,5’-cyclic monophosphate (cCMP) and uridine 3’,5’-cyclic monophosphate (cUMP)35. Both ps20/gp116 and ps290/gp388 are homologs to Acb1 and Apyc1, which could allow phage G to evade the host immune response by selectively degrading host cyclic nucleotide immune signals35. Further functional investigation and validation of these antiviral escape genes are warranted.

Spore-related host regulatory proteins are expanded within phage G. The host of phage G is a Lysinibacillus, a gram-positive endospore former, with endospore formation being an ancient form of dormancy60. Many phages have been identified to have spore-forming related homologs60, but few, if any, have as many of the genes as phage G. Around five of the spore-related proteins were found amongst the viral particle proteomics. However, other spore-related proteins not found in particle proteomics may be expressed only during viral lytic infection inside the host. Phage G potentially evolved later than Moose phage W30-1, and appears to have acquired more spore related genes from its host Lysinibacillus sp. PGH. Encoding various prespore proteins within the virion particle may regulate the ability of the host to enter the spore stage at the point of infection. Multiple copies of the RsfA-like and RsfA-like without the full domain (ps281/gp379) within phage G may be repressors to stationary phase and/or spore formation61. Spore protein D or small acid-soluble spore protein D with minor alpha/beta-type SASP is a sspD-like ORF encoded by ps168/gp266, that if like sspD like other SASPs would bind dsDNA and protect it from UV damage62. YtxC-like and stage V sporulation protein K are unknown transcriptional regulators of the sporulation; mutations in spoVK/spoVJ impair sporulation63.

Methylation within phages provides higher-order regulation of DNA recognition, gene expression, replication, survival, and evasion of host response64,65. Direct nanopore sequencing of phage G allows for de novo methylation detection without bisulfate chemical conversion due to the local change in current across the biological nanopore37. The methylome is genome-wide within phage G but observable functionally, mainly within the baseplate-tail regulation region. Methylation positions 77050–78199 are within the baseplate protein where the long tail fibers attach. This upstream tail fiber attachment baseplate methylation may regulate the length of the tail fiber or attachment of tail fibers to the baseplate. Tail fiber regulation is critical to recognizing and interacting with the surface receptor of the host66. PncA-like protein, potentially a pyrazinamidase/nicotinamidase/isochromatase hydrolase, is also highly methylated. The function of ps177/gp275 is unknown in phage G. However, PncA hydrolases (ps177/gp275) can function in a variety of ways, including drug resistance, or if it is a nicotinamidase, then that is used in recycling NAD+67,68. The trimeric auto-transporter adhesins pectin lyase (DUF2807-like) function within phage G is unknown, but is highly conserved in multiple phyla of bacteria69. YozC-like ORF ps508/621 is a conserved hypothetical protein in Bacillus spp. Nevertheless, its function is unknown70. A computational and experimental protein-protein interaction assay suggested that YozC interacts with AlaS, an alanine-tRNA ligase gene, to adenylate cyclase CyaA, which suggests that methylation could regulate this process70.

Particle biophysical characterization of phage G included both temperature and pH stress. The UT_MUT had slightly less resilience and higher variance amongst replicates to temperature and pH than the ATCC/RAW_WT. Over >70 °C and high/low pH (pH = 3 and 11) resulted in zero plaques regardless of phage G variant. Our data suggests that phage G generally cannot handle significant rapid shifts in pH and temperature. This sensitivity may explain why the culture has been hard to maintain across multiple labs.

Phage G represents a marvel of phage biology that has remained in the shadows for over five decades. This data provides the blueprint for further investigations unlocking the genomic repertoire of phage G. Further study in a variety of avenues is needed to unlock more of the functions within phage G as well as other megaphages. Greater efforts are also needed to cultivate megaphages as model systems to unravel the phage world’s giants.

Methods

Sampling, DNA extraction, Library Prep, and sequencing

Phage G was amplified using dilute agarose gels as a modification of the traditional double agar overlay plaque assay method71,72. Phage particles were purified from double agar plaque assay with SM buffer and chloroform treatment; then, an aqueous layer was extracted from centrifugation at 5 min at 4000 × g. Phage particles were precipitated with acetone from the aqueous layer73. Briefly, 1 part of phage particles (109 PFU mL−1) was suspended in 4 parts acidified acetone (5.5 pH), shaken vigorously for 2 min, and centrifuged for 5 min at 4000 × g. We precipitated 10 mL of total G phage stock. The supernatant was decanted, and the phage pellet was left to dry with phage pellets resuspended in SM buffer. The sample was DNA was extracted from all strains of G phage using the NEB (New England Biolabs, Ipswich, MA) Monarch HMW DNA Extraction Kit for Tissues (T3060), with a minor modification: RNase treatment was not performed. The libraries were prepared using the Oxford Nanopore Native Barcoding Kit 96 V14 (SQK-NBD114.96, Oxford, United Kingdom). DNA was sequenced using an Oxford Nanopore Promethion 2 Solo.

Quality control

Illumina 2500 data paired-end reads were filtered using Trimmomatic with sliding window parameters of 3 leading and trailing bases at a PHRED score of 20 and a minimum read length of 50 bases74. Adapters were detected and removed with ILLUMINACLIP using the TruSeq3 paired-end fasta, allowing two seed mismatches, palindrome alignment scores of 30, and an adapter clip threshold of 10 in accordance with the documentation. Raw Nanopore fast5 sequencing reads were processed with default filtering parameters from the provided MinKNOW sequencing software (v21.02.2) comprising a minimum quality score of 7, with real-time fast base-calling and fastq generation enabled. The resulting base called fastq files were used in subsequent downstream analyses.

De novo genome assembly

De novo assembly was conducted on UT_WT and RIT_WT samples utilizing Unicycler (v.0.5.0) as they were exclusively Illumina HiSeq data75. UT_MUT was also assembled in the same manner, as well as a hybrid Illumina/Nanopore de novo assembly approach using Trycycler (v0.5.4) as both data types were available76. UT_WT was covered at 684x, RIT_WT was covered at 4690x, and UT_MUT was covered at 553x (Table 1). The original raw reads for Moose W30-1 and NCBI_WT were unable. Briefly, the assessment of k-mer subassemblies from Unicycler’s implementation of SPAdes (v4.0) resulted in a best-selected k-mer size of 71, generating a single contig with no detectable dead ends in sequence77. Contigs with low coverage (<10 reads) were removed, resulting in one anchor segment of 499,819 bases, subsequently used to form the final uniting sequence and did not circularize. CheckV (v1.01) was used to assess the quality and completeness of all G phage strains. CheckV accesses the Minimum Information about an Uncultivated Virus Genome (MIUViG) quality metrics78. Genome statistics were evaluated using our in-house tool, RustyOmeStats (github.com/raw-lab/RustyOmeStats).

Annotation, SNP, methylome, and phylogenetic analysis

Annotation was completed using MetaCerberus (v1.4), Foldseek (v9-427), HHpred (v2.08), and SWORD (v1.0.4) with manual-by-hand curation. Prodigal-gv (v2.11.0) was used within MetaCerberus to call ORFs79, for all downstream annotation processes. To avoid confusion with previous manuscripts, we have named the new ORFs from the corrected annotation ps1-668 based on ATCC_WT annotation, with the original gp (“gene product”) from the RWH draft genome to remind the same. For example, while the terL gene was rearranged to the beginning of the genome as gp1 or G_1 in GenBank in NCBI_WT, it is ps583/gp1 having both the corrected number based on the proper topology and arrangement of the genome (i.e., ps - ORF number) and the original gp number. We used SWORD to align NCBI_WT against ATCC_WT to obtain the original gp numbers in relation to the new ps ORF names. Origin of replication (OriC) and phage/host promoters were found using PhagePromoter and Ori-Finder 202280,81, and default parameters were used. tRNAs and tmRNAs were annotated using tRNAscan-SE 2.0 (v2.0.12) and Aragorn (v1.2.36)82,83

Methylome and single nucleotide polymorphisms (SNPs) were analyzed using DeepSignal2 (v0.1.3) and MUMmer4’s NUCmer (v4.0.0rc1). ATCC_WT provided the reference for all SNP analyses for the other variants. RIT_WT had significantly more SNPs than the other strains, MUMmer’s delta-filter script with –identity set to 8 and –length set to 1000 was used to filter the RIT_WT delta output from NUCmer. Methylation was completed using DeepSignal2 using standard default measurements, and only high-quality methylations were used for downstream analysis.

ANI and phylogenetics analysis was completed using fastANI (v1.34), MAFFT (v7.520), and IQTree2 (v2.2.5)8486. FastANI was run using the default parameters outlined within the readme. MAFFT alignments were completed using the local options -linsi with 1000 bootstraps for all trees using parameter -bb 1000. IQtree2 maximum likelihood trees were selected based on Modelfinder87.

Particle biophysical characteristics

A working stock of RAW_WT/ATCC_WT and UT_MUT of 109 PFU mL−1 was used for all experiments. All temperature treatments were performed for 30 min shaken at 250 RPM within a thermomixer. After temperature treatment, the samples were cooled on ice for 5 min, then proceeded to plaque assay. For pH experiments, the desired pH was obtained and samples were exposed for 30 min. The pH was neutralized using either HCl or NaOH. Post pH exposure, the pH was neutralized to the control pH of 7 before undergoing a plaque assay. The experiment was repeated 4 times to determine the average PFUs for the results.

Proteomic sample preparation

Prior to proteomic analysis, G phage particles were purified on a 10–35% sucrose gradient as previously described88. After estimating protein content with BCA assay, 20 μg of proteins from G-ATCC, G-Mutant, and GLK samples, as well as half of the GSL, FGL, and GLS2 samples underwent in-solution digestion. Each sample was mixed with 55 μL of UTT buffer (8 M urea and 10 mM DTT in 50 mM TEABC) in 1.5 mL microcentrifuge tube, incubated at 30 °C with 500 rpm shaking for 60 min, followed by alkylation with 5 μL of 250 mM iodoacetamide at 23 °C for 60 min in the dark. After diluting urea to below 1 M with 50 mM triethylammonium bicarbonate, 100 μL of each sample was transferred to new microcentrifuge tube for proteolytic digestion by adding 0.4 µg of Trypsin/LysC (Thermo Scientific, catalog# A40009) and incubating overnight at 37 °C. Finally, digested peptides were acidified with 1% formic acid to terminate the proteolytic digestion.

LC-MS/MS analysis

Digested peptides (500 ng) were loaded onto EvoTip trap columns (Evosep, EV2013), after column conditioning, equilibration and washing steps according to manufacturer instructions, with all centrifugations carried out for 60 s at 800 g. Peptides on the EvoTips were separated on 150μm x 15 cm EASY-Spray column (PepMapTM 2 μm C18 beads, Thermo catalog # ES906) using Evosep One LC system (Evosep, Denmark). Peptides were eluted with a gradient up to 35% solvent B (0.1% formic acid in acetonitrile, solvent A: 0.1% formic acid in water) at flow rate of 0.5 μL/min, using a 44-min gradient method and detected in positive ion mode with an Orbitrap Exploris 240 mass spectrometer (ThermoFisher). Data were acquired using either data-dependent acquisition (DDA, top 20) or data-independent acquisition (DIA, mass isolation window 24 m/z). Mass resolution was set at 60,000 resolution (at 200 m/z) for full MS scan and 30,000 for MS/MS scans. HCD collision energy was set at 30.

Database search

The label-free raw data was processed and searched with Proteome Discoverer (PD, version 2.5.0.400, Thermo Fisher Scientific), using Sequest HT search engine applied and matched to the reference G phage genome, its host genome, and common MS contaminants (e.g., keratin, trypsin). Modifications to the searches included carbamidomethyl as a static modification on cysteines (+57.021 Da) and oxidation as a variable modification on methionines (+15.995 Da). In comparison, precursor mass tolerance was set as 10 ppm and fragment mass tolerance of 0.02 Da. Both data-dependent acquisition (DDA) and data-independent acquisition (DIA.) Mass spectrometry-based data was collected the raw data files were analyzed using DIA-NN (version 1.8.1) with reference phage G genome, its host genome, and common MS contaminants (e.g., keratin, trypsin). Settings included FASTA digest for library-free search, deep learning-based algorithms for spectra and retention times prediction, and critical parameters set at 15.0 for mass accuracy, 20.0 for MS1 accuracy, and 4 for scan window. Enzyme specificity was trypsin with allowance for one missed cleavage. Carbamidomethyl modification on cysteine residues was fixed, and a match between runs (MBR) was enabled. Protein inference grouped on genes using a neural network classifier, quantification optimized for LC accuracy, cross-run normalization tailored to retention time-dependent dynamics, and smart profiling techniques for library profiling. Default settings were used for other parameters for comprehensive analysis.

Supplementary information

Supplementary Material (4.5MB, docx)
Supplementary Table 1 (45.4KB, xlsx)
Supplementary Table 2 (10.3KB, xlsx)
Supplementary Table 3 (17KB, xlsx)
Supplementary Table 4 (43.3KB, xlsx)
Supplementary Table 5 (1.7MB, xlsx)
Supplementary Table 6 (9KB, xlsx)
Supplementary Table 7 (5.5KB, xlsx)

Acknowledgements

We also acknowledge the University Research Computing and the College of Computing and Informatics for computational and logistical support. We must further acknowledge Steven C. Hardies for his help with annotation of phage G and highly useful discussions. Andra Buchan, Stephanie Wiedman, Kevin Lambirth, Madeline Bellanger-Perry, Jose L. Figueroa III, and R.A. White III are supported by the UNC Charlotte Department Bioinformatics and Genomics start-up package from the North Carolina Research Campus in Kannapolis, NC.

Author contributions

A.B., J.L.F. III, and K.L. performed computational analysis. S.W., P.S.S., E.T.W., and M.B.P. performed molecular analysis and particle biophysical characteristics. All contributed to the drafts, edits, and final draft. All authors have read and approved the manuscript.

Data availability

All phage genomes used in this study are available on OSF (https://osf.io/gkmf7/). Phage proteomic data are available on PRIDE.

Code availability

The code used in this work are publicly available on GitHub (https://github.com/raw-lab/). All code to generate figures and results are present on the GitHub page.

Competing interests

The authors declare no conflicts of interest. RAW is the CEO of RAW Molecular Systems (RAW), LLC, but no financial, IP, or others from RAW LLC were used or contributed to the study.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

The online version contains supplementary material available at 10.1038/s44298-025-00150-9.

References

  • 1.Al-Shayeb, B. et al. Clades of huge phages from across Earth’s ecosystems. Nature578, 425–431 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Breitbart, M., Wegley, L., Leeds, S., Schoenfeld, T. & Rohwer, F. Phage community dynamics in hot springs. Appl. Environ. Microbiol.70, 1633–1640 (2004). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Carreira, C. et al. Integrating viruses into soil food web biogeochemistry. Nat. Microbiol.9, 1918–1928 (2024). [DOI] [PubMed] [Google Scholar]
  • 4.Cook, R. et al. Decoding huge phage diversity: a taxonomic classification of Lak megaphages. J. Gen. Virol.105, 001997 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Devoto, A. E. et al. Megaphages infect Prevotella and variants are widespread in gut microbiomes. Nat. Microbiol.4, 693–700 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Michniewski, S. et al. A new family of “megaphages” abundant in the marine environment. ISME Commun.1, 58 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.White, R. A. III, Visscher, P. T. & Burns, B. P. Between a rock and a soft place: the role of viruses in lithification of modern microbial mats. Trends Microbiol.29, 204–213 (2021). [DOI] [PubMed] [Google Scholar]
  • 8.White, R. A. III The future of virology is synthetic. mSystems6, e0077021 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Weinheimer, A. R. & Aylward, F. O. Infection strategy and biogeography distinguish cosmopolitan groups of marine jumbo bacteriophages. ISME J.16, 1657–1667 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Ageno, M., Donelli, G. & Guglielmi, F. Structure and physico-chemical properties of bacteriophage G. II, the shape and symmetry of the capsid. Micron4, 376–403 (1973). [Google Scholar]
  • 11.Donelli, G. Isolation of a bacteriophage of exceptional dimensions active in Bacillus megaterium. Nature44, 95 (1968). [Google Scholar]
  • 12.Donelli, G., Dore, E., Frontali, C. & Grandolfo, M. E. Structure and physico-chemical properties of bacteriophage G: III. A homogeneous DNA of molecular weight 5 × 10^8. J. Mol. Biol.94, 555–565 (1975). [DOI] [PubMed] [Google Scholar]
  • 13.Donelli, G., Griso, G., Paoletti, L., & Rebessi, S. Capsomeric arrangement in the bacteriophage G head. In Proc. Sixth European Congress on Electron Microscopy (Jerusalem). 2, 502–503 (1976).
  • 14.González, B. et al. Phage G structure at 6.1Å resolution, condensed DNA, and host identity revision to a Lysinibacillus. J. Mol. Biol.432, 4139–4153 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Ackermann, H. W. 5500 phages examined in the electron microscope. Arch. Virol.152, 227–243 (2007). [DOI] [PubMed] [Google Scholar]
  • 16.Hutson, M. S., Holzwarth, G., Duke, T. & Viovy, J.-L. Two-dimensional motion of DNA bands during 120° pulsed-field electrophoresis. I. Effect of molecular weight. Biopolymers35, 297–306 (1995). [Google Scholar]
  • 17.Serwer, P., Estrada, A. & Harris, R. A. Video light microscopy of 670-kb DNA in a hanging drop: shape of the envelope of DNA. Biophys. J.69, 2649–2660 (1995). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Serwer, P. & Hayes, S. J. Partially condensed DNA conformations observed by single molecule fluorescence microscopy. Biophys. J.81, 3398–3408 (2001). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Hua, J. et al. Capsids and genomes of jumbo-sized bacteriophages reveal the evolutionary reach of the HK97 fold. mBio8, e01579-17 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Cook, R. et al. INfrastructure for a PHAge REference Database: Identification of large-scale biases in the current collection of cultured phage genomes. Phage2, 214–223 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Bin Jang, H. Taxonomic assignment of uncultivated prokaryotic virus genomes is enabled by gene-sharing networks. Nat. Biotechnol.37, 632–639 (2019). [DOI] [PubMed] [Google Scholar]
  • 22.Tynecki, P. et al. PhageAI—bacteriophage life cycle recognition with machine learning and natural language processing. bioRxiv. 10.1101/2020.07.11.198606 (2020).
  • 23.Millard, A. et al. taxmyPHAGE: Automated taxonomy of dsDNA phage genomes at the genus and species level. Phage (New Rochelle) 6, 5–11 (2025). [DOI] [PMC free article] [PubMed]
  • 24.Nayfach, S. et al. CheckV assesses the quality and completeness of metagenome-assembled viral genomes. Nat. Biotechnol.39, 578–585 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Marçais, G. MUMmer4: A fast and versatile genome alignment system. PLoS Comput. Biol.14, e1005944 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Vaser, R., Pavlović, D. & Šikić, M. SWORD-a highly efficient protein database search. Bioinformation32, i680–i684 (2016). [DOI] [PubMed] [Google Scholar]
  • 27.Figueroa, J. III MetaCerberus: Distributed highly parallelized HMM-based processing for robust functional annotation across the tree of life. Bioinformation40, btae119 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Aksyuk, A. A. et al. The tail sheath structure of bacteriophage T4: a molecular machine for infecting bacteria. EMBO J.28, 821–829 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Mirdita, M. et al. ColabFold: making protein folding accessible to all. Nat. Methods19, 679–682 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Torrents, E. Ribonucleotide reductases: essential enzymes for bacterial life. Front. Cell. Infect. Microbiol.4, 52 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Travers, A. DNA-protein interactions: IHF-the master bender. Curr. Biol.7, R252–R254 (1997). [DOI] [PubMed] [Google Scholar]
  • 32.Smith, J. L. The physiological role of ferritin-like compounds in bacteria. Crit. Rev. Microbiol.30, 173–185 (2004). [DOI] [PubMed] [Google Scholar]
  • 33.Maffeo, C. & Aksimentiev, A. Molecular mechanism of DNA association with single-stranded DNA binding protein. Nucleic Acids Res.45, 12125–12139 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Conant, C. R., Goodarzi, J. P., Weitzel, S. E. & von Hippel, P. H. The antitermination activity of bacteriophage lambda N protein is controlled by the kinetics of an RNA-looping-facilitated interaction with the transcription complex. J. Mol. Biol.384, 87–108 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Hobbs, S. J. et al. Phage anti-CBASS and anti-Pycsar nucleases subvert bacterial immunity. Nature605, 522–526 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Mukherjee, S. & Kearns, D. B. The structure and regulation of flagella in Bacillus subtilis. Annu. Rev. Genet.48, 319–340 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Ni, P. et al. DeepSignal: Detecting DNA methylation state from nanopore sequencing reads using deep-learning. Bioinformatics35, 4586–4595 (2019). [DOI] [PubMed] [Google Scholar]
  • 38.Van Kempen, M. et al. Fast and accurate protein structure search with Foldseek. Nat. Biotechnol.42, 243–246 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Kinoshita-Kikuta, E., Kinoshita, E. & Koike, T. A mobility shift detection method for DNA methylation analysis using phosphate affinity polyacrylamide gel electrophoresis. Anal. Biochem.378, 102–104 (2008). [DOI] [PubMed] [Google Scholar]
  • 40.Adler, B. A. et al. CRISPRi-ART enables functional genomics of diverse bacteriophages using RNA-binding dCas13d. Nat. Microbiol. 10, 694–709 (2025). [DOI] [PMC free article] [PubMed]
  • 41.Murchland, I. M. et al. Instability of CII is needed for efficient switching between lytic and lysogenic development in bacteriophage 186. Nucleic Acids Res.48, 12030–12041 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Foekens, J. A., Romain, S., Look, M. P., Martin, P. M. & Klijn, J. G. Thymidine kinase and thymidylate synthase in advanced breast cancer: Response to tamoxifen and chemotherapy. Cancer Res.61, 1421–1425 (2001). [PubMed] [Google Scholar]
  • 43.Fogg, J. M., Schofield, M. J., White, M. F. & Lilley, D. M. Sequence and functional-group specificity for cleavage of DNA junctions by RuvC of Escherichia coli. Biochemistry38, 11349–11358 (1999). [DOI] [PubMed] [Google Scholar]
  • 44.Amit, R., Gileadi, O. & Stavans, J. Direct observation of RuvAB-catalyzed branch migration of single Holliday junctions. Proc. Natl. Acad. Sci. USA101, 11605–11610 (2004). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Santos, F. B. & Del-Bem, L. E. The evolution of tRNA copy number and repertoire in cellular life. Genes14, 27 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Krüger, E., Witt, E., Ohlmeier, S., Hanschke, R. & Hecker, M. The Clp proteases of Bacillus subtilis are directly involved in degradation of misfolded proteins. J. Bacteriol.182, 3259–3265 (2000). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Kongari, R. et. al. Phage spanins: Diversity, topological dynamics, and gene convergence. BMC Bioinforma.19, 326 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Wang, I. N., Smith, D. L. & Young, R. Holins: the protein clocks of bacteriophage infections. Annu. Rev. Microbiol.54, 799–825 (2000). [DOI] [PubMed] [Google Scholar]
  • 49.Margolin, W. FtsZ and the division of prokaryotic cells and organelles. Nat. Rev. Mol. Cell Biol.6, 862–871 (2005). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Oliva, M. A., Martin-Galiano, A. J., Sakaguchi, Y. & Andreu, J. M. Tubulin homolog TubZ in a phage-encoded partition system. Proc. Natl. Acad. Sci. USA109, 7711–7716 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Aylett, C. H., Izoré, T., Amos, L. A. & Löwe, J. Structure of the tubulin/FtsZ-like protein TubZ from Pseudomonas bacteriophage ΦKZ. J. Mol. Biol.425, 2164–2173 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Wallden, K., Rivera-Calzada, A. & Waksman, G. Type IV secretion systems: versatility and diversity in function. Cell. Microbiol.12, 1203–1212 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Bragagnolo, N. et al. Protein dynamics in F-like bacterial conjugation. Biomedicines8, 362 (2020). (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Kozloff, L. M., Verses, C., Lute, M. & Crosby, L. K. Bacteriophage tail components. II. Dihydrofolate reductase in T4D bacteriophage. J. Virol.5, 740–753 (1970). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Mosher, R. A. & Mathews, C. K. Bacteriophage T4-coded dihydrofolate reductase: synthesis, turnover, and location of the virion protein. J. Virol.31, 94–103 (1979). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Kim, S. K., Makino, K., Amemura, M., Shinagawa, H. & Nakata, A. Molecular analysis of the phoH gene, belonging to the phosphate regulon in Escherichia coli. J. Bacteriol.175, 1316–1324 (1993). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Koonin, E. V. & Rudd, K. E. Two domains of superfamily I helicases may exist as separate proteins. Protein Sci.5, 178–180 (1996). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Metcalf, W. W., Steed, P. M. & Wanner, B. L. Identification of phosphate starvation-inducible genes in Escherichia coli K-12 by DNA sequence analysis of psi::lacZ(Mu d1) transcriptional fusions. J. Bacteriol.172, 3191–3200 (1990). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Goldsmith, D. B. et al. Development of phoH as a novel signature gene for assessing marine phage diversity. Appl. Environ. Microbiol.77, 7730–7739 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Schwartz, D. A. et al. Human-gut phages harbor sporulation genes. mBio14, e0018223 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Wu, L. J. & Errington, J. Identification and characterization of a new prespore-specific regulatory gene, rsfA, of Bacillus subtilis. J. Bacteriol.182, 418–424 (2000). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Setlow, B., McGinnis, K. A., Ragkousi, K. & Setlow, P. Effects of major spore-specific DNA binding proteins on Bacillus subtilis sporulation and spore properties. J. Bacteriol.182, 6906–6912 (2000). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.Resnekov, O., Driks, A. & Losick, R. Identification and characterization of sporulation gene spoVS from Bacillus subtilis. J. Bacteriol.177, 5628–5635 (1995). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64.Sun, C. et al. Long-read sequencing reveals extensive DNA methylations in human gut phagenome contributed by prevalently phage-encoded methyltransferases. Adv. Sci.10, e2302159 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.Ding, Y. et al. PacBio sequencing of human fecal samples uncovers the DNA methylation landscape of 22,673 gut phages. Nucleic Acids Res.51, 12140–12149 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66.Taslem Mourosi, J. et al. Understanding bacteriophage tail fiber interaction with host surface receptor: the key “Blueprint” for reprogramming phage host range. Int. J. Mol. Sci.23, 12146 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67.Shang, F. et al. Crystal structure of the nicotinamidase/pyrazinamidase PncA from Bacillus subtilis. Biochem. Biophys. Res. Commun.503, 2906–2911 (2018). [DOI] [PubMed] [Google Scholar]
  • 68.Pardee, A. B. et al. Hyperproduction and purification of nicotinamide deamidase, a microconstitutive enzyme of Escherichia coli. J. Biol. Chem.246, 6792–6796 (1971). [PubMed] [Google Scholar]
  • 69.Irnov, I., Sharma, C. M., Vogel, J. & Winkler, W. C. Identification of regulatory RNAs in Bacillus subtilis. Nucleic Acids Res.38, 6637–6651 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70.O’Reilly, F. J. et al. Protein complexes in cells by AI-assisted structural proteomics. Mol. Syst. Biol.19, e11544 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 71.Serwer, P., Hayes, S. J., Thomas, J. A. & Hardies, S. C. Propagating the missing bacteriophages: a large bacteriophage in a new class. Virol. J.4, 21 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72.Serwer, P., Hayes, S. J., Thomas, J. A., Demeler, B. & Hardies, S. C. Isolation of novel large and aggregating bacteriophages. Methods Mol. Biol.501, 55–66 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73.Soleimani-Delfan, A., Bouzari, M. & Wang, R. A rapid competitive method for bacteriophage genomic DNA extraction. J. Virol. Methods293, 114148 (2021). [DOI] [PubMed] [Google Scholar]
  • 74.Bolger, A. M., Lohse, M. & Usadel, B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics30, 2114–2120 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 75.Wick, R. R., Judd, L. M., Gorrie, C. L. & Holt, K. E. Unicycler: Resolving bacterial genome assemblies from short and long sequencing reads. PLoS Comput. Biol.13, e1005595 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 76.Wick, R. R. et al. Trycycler: consensus long-read assemblies for bacterial genomes. Genome Biol.22, 266 (2021). (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 77.Bankevich, A. et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput.19, 455–477 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 78.Roux, S. et al. Minimum Information about an Uncultivated Virus Genome (MIUViG). Nat. Biotechnol.37, 29–37 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 79.Camargo, A. P. et al. Identification of mobile genetic elements with geNomad. Nat. Biotechnol.42, 1303–1312 (2024). (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 80.Sampaio, M., Rocha, M., Oliveira, H. & Dias, O. Predicting promoters in phage genomes using PhagePromoter. Bioinformation35, 5301–5302 (2019). [DOI] [PubMed] [Google Scholar]
  • 81.Dong, M. J., Luo, H. & Gao, F. Ori-Finder 2022: a comprehensive web server for prediction and analysis of bacterial replication origins. Genom. Proteom. Bioinforma.20, 1207–1213 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 82.Laslett, D. & Canback, B. ARAGORN, a program to detect tRNA genes and tmRNA genes in nucleotide sequences. Nucleic acids Res.32, 11–16 (2004). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 83.Chan, P. P., Lin, B. Y., Mak, A. J. & Lowe, T. M. tRNAscan-SE 2.0: improved detection and functional classification of transfer RNA genes. Nucleic acids Res.49, 9077–9096 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 84.Katoh, K. & Standley, D. M. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evol.30, 772–780 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 85.Jain, C., Rodriguez-R, L. M., Phillippy, A. M., Konstantinidis, K. T. & Aluru, S. High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Nat. Commun.9, 5114 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 86.Minh, B. Q. et al. IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era. Mol. Biol. Evol.37, 1530–1534 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 87.Kalyaanamoorthy, S., Minh, B. Q., Wong, T. K. F., von Haeseler, A. & Jermiin, L. S. ModelFinder: fast model selection for accurate phylogenetic estimates. Nat. methods14, 587–589 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 88.Chambers, J. P. et al. Biophysical breakthroughs projected for the phage therapy of bacterial disease. Biophysica4, 195–206 (2024). [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material (4.5MB, docx)
Supplementary Table 1 (45.4KB, xlsx)
Supplementary Table 2 (10.3KB, xlsx)
Supplementary Table 3 (17KB, xlsx)
Supplementary Table 4 (43.3KB, xlsx)
Supplementary Table 5 (1.7MB, xlsx)
Supplementary Table 6 (9KB, xlsx)
Supplementary Table 7 (5.5KB, xlsx)

Data Availability Statement

All phage genomes used in this study are available on OSF (https://osf.io/gkmf7/). Phage proteomic data are available on PRIDE.

The code used in this work are publicly available on GitHub (https://github.com/raw-lab/). All code to generate figures and results are present on the GitHub page.


Articles from npj Viruses are provided here courtesy of Nature Publishing Group

RESOURCES