Skip to main content
PLOS Biology logoLink to PLOS Biology
. 2021 Oct 27;19(10):e3001430. doi: 10.1371/journal.pbio.3001430

A phylogenomic framework for charting the diversity and evolution of giant viruses

Frank O Aylward 1,2,*, Mohammad Moniruzzaman 1, Anh D Ha 1, Eugene V Koonin 3
Editor: Curtis Suttle4
PMCID: PMC8575486  PMID: 34705818

Abstract

Large DNA viruses of the phylum Nucleocytoviricota have recently emerged as important members of ecosystems around the globe that challenge traditional views of viral complexity. Numerous members of this phylum that cannot be classified within established families have recently been reported, and there is presently a strong need for a robust phylogenomic and taxonomic framework for these viruses. Here, we report a comprehensive phylogenomic analysis of the Nucleocytoviricota, present a set of giant virus orthologous groups (GVOGs) together with a benchmarked reference phylogeny, and delineate a hierarchical taxonomy within this phylum. We show that the majority of Nucleocytoviricota diversity can be partitioned into 6 orders, 32 families, and 344 genera, substantially expanding the number of currently recognized taxonomic ranks for these viruses. We integrate our results within a taxonomy that has been adopted for all viruses to establish a unifying framework for the study of Nucleocytoviricota diversity, evolution, and environmental distribution.


Giant viruses have transformed our understanding of viral complexity, but we lack a framework for examining their diversity in the biosphere. This study presents a phylogenomic resource for charting the diversity, ecology, and evolution of giant viruses.

Main text

Large double-stranded DNA viruses of the phylum Nucleocytoviricota are a diverse group of viruses with virion sizes reaching up to 1.5 μm and genome sizes up to 2.5 Mb, comparable to many bacteria and archaea as well as picoeukaryotes [15]. The recognized taxonomic ranks in this phylum currently include 2 classes, 5 orders, 7 families, and 41 genera. The viruses in the families Asfarviridae, Ascoviridae, Iridoviridae, and Poxviridae infect metazoans, whereas those in the families Marseilleviridae, Mimiviridae, and Phycodnaviridae primarily infect algae or heterotrophic unicellular eukaryotes [68]. Members of the Nucleocytoviricota span an exceptionally broad range of genome sizes, from below 100 kbp to more than 2.5 Mbp. Several comparative genomic analyses have documented the highly complex, chimeric nature of their genomes in which numerous genes appear to have been acquired from diverse cellular lineages and other viruses [913]. These multiple, dynamic gene exchanges between viruses and their hosts [1417] as well as the large phylogenetic breadth of this viral group [12,18,19] make the investigation of the evolution and taxonomic classification of the Nucleocytoviricota a challenging task. Despite these difficulties, early comparative genomic analyses studies succeeded in identifying a small set of core genes that could be reliably used to produce phylogenies that encompass the entire diversity of Nucleocytoviricota, leading to the conclusion that all these viruses share common evolutionary origins [18,20].

Recent studies have reported numerous new Nucleocytoviricota genomes, many of which seem to represent novel lineages with only distant phylogenetic affinity for previously identified taxa [10,16,21]. For example, many viruses that infect a variety of protist genera have been discovered that are related to Mimiviridae but do not fall within the same clade as the canonical Acanthamoeba polyphaga mimivirus [9,22,23]. Moreover, numerous metagenome-assembled genomes (MAGs) have been reported that also appear to form novel sister clades to the Mimiviridae, Asfarviridae, and other families [10,16,21]. Uncertainty in the phylogenetic relationships within the Nucleocytoviricota is a major impediment to the ongoing efforts that seek to characterize the diversity of these viruses in the environment, as well as studies aiming to better understand the evolutionary origins of unique traits within this viral phylum. As more studies begin to chart the environmental diversity of Nucleocytoviricota, defining taxonomic groupings that encompass equivalent phylogenetic breadths will be critical for the exploration of the geographic and temporal variability in viral diversity and for comparing results from different studies. Moreover, the evolutionary origins of large genomes, virion sizes, and complex metabolic repertoires in many Nucleocytoviricota are of great interest, and ancestral state reconstructions and the tracking of horizontal gene transfers fully depend on a robust phylogenetic framework.

Here, we present a phylogenomic framework for charting the diversity and evolution of Nucleocytoviricota. We first assess the strength of the phylogenetic signals from different marker genes that are found in a broad array of distantly related viruses and arrive at a set of 7 genes that performs well in our benchmarking of concatenated protein alignments. Using this hallmark gene set, we then perform a large-scale phylogenetic analysis and clade delineation of the Nucleocytoviricota to produce a hierarchical taxonomy. Our taxonomy includes the established families Poxviridae, Asfarviridae, Iridoviridae, Phycodnaviridae, Marseilleviridae, and Mimiviridae as well as 26 proposed new family-level clades and 1 proposed new order. Sixteen of the families are represented only by genomes derived from cultivation-independent approaches, underscoring the enormous diversity of these viruses in the environment that have not yet been isolated. We integrate these family-level classifications into the broader hierarchical taxonomy of all viruses that has recently been adopted (i.e., a “megataxonomy” [3]) to arrive at a unified and hierarchical classification scheme for the entire phylum Nucleocytoviricota.

Results

Phylogenetic benchmarking of marker genes

We first generated a dataset of protein families to identify phylogenetic marker genes that are broadly represented across Nucleocytoviricota. To this end, we selected a set of 1,380 quality-checked Nucleocytoviricota genomes that encompassed all established families (S1 Data; see Methods). By clustering the protein sequences encoded in these genomes, we then generated a set of 8,863 protein families, which we refer to as giant virus orthologous groups (GVOGs). We examined 25 GVOGs that were represented in >70% of all genomes and ultimately arrived at a set of 9 GVOGs that were potentially useful for phylogenetic analysis, which is largely consistent with the previous studies that have identified phylogenetic marker genes in Nucleocytoviricota [19,20,24] (Table 1, Figs A-Y in S1 Text, see Methods for details; descriptions of the 25 GVOGs provided in S2 Data). These GVOGs included 5 genes that we have previously used for phylogenetic analysis of Nucleocytoviricota: the family B DNA Polymerase (PolB), A32-like packaging ATPase (A32), virus late transcription factor 3 (VLTF3), superfamily II helicase (SFII), and major capsid protein (MCP) [10]. In addition, this set included the large and small RNA polymerase subunits (RNAPL and RNAPS, respectively), the TFIIB transcriptional factor (TFIIB), and the Topoisomerase family II (TopoII).

Table 1. Broadly represented GVOGs used for phylogenetic benchmarking.

GVOG ID Name Annotation
GVOGm0003 MCP NCLDV major capsid protein
GVOGm0013 SFII DEAD/SNF2-like helicase
GVOGm0022 RNAPS DNA-directed RNA polymerase beta subunit
GVOGm0023 RNAPL DNA-directed RNA polymerase alpha subunit
GVOGm0054 PolB DNA polymerase family B
GVOGm0172 TFIIB Transcription initiation factor IIB
GVOGm0461 TopoII DNA topoisomerase II
GVOGm0760 A32 Packaging ATPase
GVOGm0890 VLTF3 Poxvirus Late Transcription Factor VLTF3

We evaluated individual marker genes and concatenated marker sets using the Internode Certainty and Tree Certainty metrics (IC and TC, respectively), which provide a measure of the phylogenetic strength of each individual marker gene [25,26]. The TC values were highest for the RNAP subunits, PolB, and TopoII (Fig 1A), consistent with the view that, in most cases, longer genes carry a stronger phylogenetic signal, likely due to the larger number of phylogenetically informative characters. A similar observation has also been made for phylogenetic marker genes of bacteria and archaea [27]. The MCP marker had markedly lower TC values than PolB, TopoII, or either of the RNAP subunits; this is potentially because Nucleocytoviricota genomes often encode multiple copies of MCP, which complicates efforts to distinguish orthologs from paralogs (Fig 1A). This is especially true when using metagenome-derived genomes that are incomplete, because orthologous MCP copies may be missing even while paralogs are present. When this occurs, a paralogous MCP will have the best match to this protein family and will be included even if it has experienced distinct evolutionary pressures compared to the orthologous copy. SFII, TFIIB, A32, and VLTF3 showed lower TC values than the other 5 markers, but these were also the shortest marker genes and would not be expected to yield high quality phylogenies when used individually.

Fig 1. Benchmarking of phylogenetic marker genes for Nucleocytoviricota.

Fig 1

(A) Dotplot of protein lengths for each of the 9 marker genes examined in detail. Blue dots represent proteins that were the best hit against marker gene HMMs and likely represent true orthologs, while red dots represent multiple copies of marker genes present in a genome. The TC scores of the markers are presented on the barplot on the right. (B) TC values for phylogenies made from concatenated alignments of different marker sets. Black text denotes markers we have used previously, red text denotes markers that we did not include in the final set, and blue text denotes additional markers used here compared to our original 5-gene set. Note that MCP was used in our original marker set but is excluded from the final 7-gene set. Protein lengths and TC values are provided in S2 Data. A32, A32-like packaging ATPase; HMM, Hidden Markov Model; MCP, major capsid protein; PolB, family B DNA Polymerase; RNAPL, large RNA polymerase subunit; RNAPS, small RNA polymerase subunit; SFII, superfamily II helicase; TC, Tree Certainty; TFIIB, TFIIB transcriptional factor; TopoII, Topoisomerase family II; VLTF3, virus late transcription factor 3.

Next, we sought to identify which marker genes provide for the best phylogenetic inference when used together in a concatenated alignment. If markers produce incongruent phylogenetic signals, they will yield trees with low TC values when concatenated, even if the individual phylogenetic strength of the markers is high [26]. We evaluated 8 marker gene sets in total. We began by assessing the TC of the 5-gene set that we have previously used [10]. Surprisingly, the TC of this set was lower than that of some individual markers (TC of 0.865; Fig 1B), suggesting that some of the markers provide incongruent signals. We surmised that this was most likely due to the MCP, given that the presence of multiple copies of this protein in some Nucleocytoviricota may complicate efforts to identify the appropriate ortholog to use for tree construction (Fig 1A). As we suspected, removal of MCP increased the TC of the concatenated tree (from 0.865 to 0.875) (Fig 1B). The addition of the RNAPS, RNAPL, TFIIB, or TopoII markers to the 4-gene set increased the TC (Fig 1B), although a 7-gene marker set that excluded RNAPS performed best overall (TC of 0.898). The existence of RNAPS paralogs has been observed before [23], and it is likely that this is the cause of the lower TC value when using this marker. Overall, the 7-gene marker set represents an improvement over the initial 5-gene set, and we therefore used these genes for subsequent phylogenetic analysis and clade demarcation. Importantly, the benchmarking results we present here are specific to the genome set that we analyzed, and the use of MCP and RNAPS as phylogenetic markers may still be useful in other contexts. For example, when analyzing only complete genomes the presence of multiple paralogous copies of these genes may be less problematic.

A hierarchical taxonomy for Nucleocytoviricota

The best-quality phylogenetic tree produced with the 7-gene marker set could be broadly divided into 2 class-level and 6 order-level clades, 5 of which were consistent with the orders in the recently adopted megataxonomy of viruses (Fig 2) [3]. The Chitovirales and Asfuvirales orders, which respectively contain the Poxviridae and Asfarviridae, formed a distinct group with a long stem branch (class Pokkesviricetes) that we used to root the tree, consistent with previous studies [20,28]. The Pimascovirales, which includes Pithoviruses, Marseilleviruses, and Iridoviridae/Ascoviridae, also formed a highly supported monophyletic group. The current order Algavirales, which includes the Phycodnaviridae, Chloroviruses, Pandoraviruses, Molliviruses, Prasinoviruses, and Coccolithoviruses, was paraphyletic, and we split this order into 2 groups based on their placement in the phylogeny. In the proposed taxonomy, we retain the existing Algavirales name for the clade that contains the Chloroviruses and Prasinoviruses and additionally propose the order pandoravirales for the group that includes the Pandoraviruses and Coccolithoviruses. The Imitervirales, which contain the Mimiviridae, formed a sister group to the Algavirales.

Fig 2. Phylogeny of Nucleocytoviricota based on the 7-gene marker gene set that had the highest TC value of those tested.

Fig 2

The phylogeny was inferred using the LG+I+F+G4 model in IQ-TREE. Solid circles denote IC values >0.5. Families are denoted by collapsed clades, with their nonredundant identifier provided at their right. The number of genomes in each clade is provided in brackets. Established family names are provided in bold italics, and proposed names are provided in lowercase. The presence of notable cultivated viruses is provided in bold next to some clades. Aav, Aureococcus anophagefferens virus; ChoanoV1, Choanoflagellate virus; CpV, Chrysochromulina parva virus; HaV, Heterosigma akashiwo virus; IC, Internode Certainty; PgV, Phaeocystis globosa virus; TC, Tree Certainty; TetV, Tetraselmis virus.

From our reference tree, we delineated taxonomic levels using the relative evolutionary distance (RED) of each clade as a guide, using an approach similar to the one recently employed for bacteria and archaea [29]. RED values vary between 0 and 1, with lower values denoting phylogenetically broad groups that branch closer to the root and higher values denoting phylogenetically shallow groups that branch closer to the leaves. The RED of the Nucleocytoviricota classes ranges from 0.017 to 0.032, whereas the values for the orders range from 0.158 to 0.240 (Fig 3A and S3 Data). We delineated family- and genus-level clades so that they had nonoverlapping RED values that were higher than their next-highest taxonomic rank (Fig 3A). This approach yielded clades that were consistent with families and genera currently recognized by the International Committee on Taxonomy of Viruses (ICTV; [30]), such as the Chlorovirus, Prasinovirus, and Mimiviridae (see below; full classification information in S1 Data). To ensure that putative families were not defined by spurious placement of individual genomes, we accepted only groups with ≥3 members and left other genomes in the tree as singletons with incertae sedis as the family identifier. This approach yielded a total of 32 families, not including 22 singleton genomes that potentially represent additional families and are listed as incertae sedis here. We provided tentative genus-level identifiers for all genomes, leading to 344 total genera (Figs 2 and 3). Of these, 213 genera contain only a single representative, and additional merging or splitting of these groups may be necessary as more genomes become available and fine-scale phylogenetic patterns are clarified.

Fig 3. Summary of the Nucleocytoviricota taxonomy.

Fig 3

(A) RED values for Nucleocytoviricota classes, orders, and families, and genera. (B) Treemap diagram of the Nucleocytoviricota in which orders and families are shown. The area of each rectangle is proportional to the number of genomes in the respective taxon. RED values can be found in S3 Data. RED, relative evolutionary divergence.

Of the 32 families, 6 correspond to the families currently recognized by the ICTV, for which we retained the existing nomenclature (Asfarviridae, Poxviridae, Marseilleviridae, Iridoviridae, Phycodnaviridae, and Mimiviridae). The Ascoviridae are included within the Iridoviridae, and so we use the latter family name here. In addition, we propose 6 family names here: “prasinoviridae,” which include the prasinoviruses, “pandoraviridae,” which include the Pandoraviruses and Mollivirus sibericum, “coccolithoviridae,” which include the coccolithoviruses, “pithoviridae,” which include Pithoviruses, Cedratviruses, and Orpheoviruses, “mesomimiviridae,” which includes several haptophyte viruses previously defined as “extended Mimiviridae,” and “mininucleoviridae,” which has previously been described and includes several viruses of Crustacea [31]. Some of these family names have been used previously, such as pandoraviridae and mininucleoviridae, but so far have not been formally recognized by the ICTV. For other proposed families, we provide nonredundant identifiers corresponding to their order, and we anticipate that future studies will provide information for selecting appropriate family names once more is learned on the host ranges and molecular traits of these viruses. Two of the families contained only a single cultivated representative (AG_04 and IM_09), whereas 16 families included none.

Notably, the Imitervirales contain 11 families, as well as 4 singleton viruses that potentially represent additional family-level clades. This underscores the vast diversity of the large viruses in this group, which is consistent with the results of several studies reporting an enormous diversity of Mimiviridae-like viruses in the biosphere, in particular in aquatic environments [10,3234]. Other studies have suggested additional nomenclature to refer to these Mimiviridae-like viruses, such as the “extended Mimiviridae” and the subfamilies Mesomimivirinae, or Megamimivirinae, but our results suggest that an extensive array of new families is warranted within Imitervirales, given the broad genomic and phylogenetic diversity within this group. Several of the proposed new families contain representatives that have recently been described; IM_12 contains the Tetraselmis virus (TetV), which encodes several fermentation genes [11], IM_09 contains Aureococcus anophagefferens virus (AaV), which is thought to play an important role in brown tide termination [35], and IM_08 contains a virus of Choanoflagellates [36] (Fig 2). Family IM_01 contains cultivated viruses that infect haptophytes of the genera Chrysochromulina and Phaeocystis, which were previously proposed to be classified in the subfamily mesomimivirinae [23]. We propose the name mesomimiviridae to denote the family-level status of this lineage, while still retaining reference to this original name. Notably, the Mesomimiviridae includes by far the largest total number of genomic representatives in our analysis (n = 655, including 652 MAGs; Figs 2 and 3B), the vast majority of which are derived from aquatic environments (Fig Z in S1 Text), suggesting that members of this family are important components of global freshwater and marine ecosystems. Within the Mimiviridae, we recovered 3 clades that correspond to previously proposed subfamilies. One of these clades contains Klosneuviruses and corresponds to the proposed subfamily Klosneuvirinae [37]; this subfamily also includes Bodo saltans virus as well as several genomes recovered from forest soils [38,39]. The second clade corresponds to the subfamily Megamimivirinae and includes A. polyphaga mimivirus, Tupanviruses, and Megavirus chilensis, among others [4042]. Lastly, we recovered a clade that includes Cafeteria roenbergensis virus [9], several “PacV” viruses obtained from flow sorting and sequencing of marine samples [43], and a variety of MAGs.

All families within the Imitervirales except one included members with genome sizes >500 kbp, highlighting the “giant” genomes that are characteristic of this lineage (Fig 4A). Genes involved in translation, including tRNA synthetases and translation initiation factors, were consistently highly represented in the Imitervirales, showing that the rich complement of these genes that has been described for the Mimiviridae is broadly characteristic of other families in this order (Fig 4B) [40,42]. Throughout the Imitervirales genes involved in glycolysis and the TCA cycle, cytoskeleton components such as viral-encoded actin, myosin, and kinesin proteins, and nutrient transporters including those that target ammonia and phosphate were also common (Fig 4B) [10,4446], underscoring the complex functional repertoires of this virus order.

Fig 4. Genomic characteristics of the Nucleocytoviricota.

Fig 4

(A) Violin plot showing the genome size distribution across the Nucleocytoviricota families. The dashed gray line denotes 500 kbp. (B) Bubble plot showing the percent of total proteins in each family that could be assigned to GVOGs that belonged to particular functional categories (details in S2 Data). GVOG, giant virus orthologous group.

The Algavirales is a sister lineage to the Imitervirales that contains 4 families encompassing several well-studied algal viruses. The Prasinoviridae (AG_01) is a family that includes viruses known to infect the prasinophyte genera Bathycoccus, Micromonas, and Ostreococcus [8], and cultivation-independent surveys have provided evidence that the MAGs in this clade are also associated with prasinophytes [46]. Similarly, our approach yielded a well-defined Phycodnaviridae family (AG_02) composed mostly of chloroviruses, consistent with the similar host range of these viruses [47]. All 4 families of the Algavirales have smaller genome sizes compared to the Imitervirales (Fig 4A), but there were still several similarities in their encoded functional repertoires. As noted previously [10,17,36], genes involved in light sensing, including rhodopsins and chlorophyll-binding proteins, were common across the Imitervirales and Algavirales, perhaps because many of the viruses are found in sunlit aquatic environments where manipulation of host light sensing during infection is advantageous. Moreover, genes involved in nutrient transport, translation, and even some components of glycolysis and the TCA cycle were found in the Algavirales, consistent with the complex repertoires of metabolic genes that have been reported for some of these viruses despite their relatively small genome sizes [48,49].

The pandoravirales, a new order we propose here, consists of 4 families, including the pandoraviridae and the coccolithoviridae. The pandoraviridae (PV_04) include Mollivirus sibericum as well as the pandoraviruses, which possess the largest viral genomes known [50]. Grouping of these viruses together in the same family is consistent with previous studies that have shown that M. sibericum and the Pandoraviruses have shared ancestry [51,52], and comparative genomic analysis that have shown that they all encode a unique duplication in the glycosyl hydrolase that has been co-opted as a major virion protein in the Pandoraviruses [53]. The coccolithoviridae (PV_05) is mostly comprised of viruses that infect the marine coccolithophore Emiliania huxleyi; although much smaller than the genomes of the Pandoraviruses, genomes of cultivated representatives of this family exceed 400 kbp and encode diverse functional repertoires including sphingolipid biosynthesis genes [54].

Although most orders contained primarily genomes that could be readily grouped into families, the pandoravirales also included 15 singleton genomes out of the 37 total. This is potentially due to the lack of adequate genome sampling in this group, which would result in many distinct lineages represented by only individual genomes. If this is the case, more well-defined families will become evident as additional genomes are sequenced. Alternatively, the lack of clearly defined families could result from longer branches in this group that obfuscate the clustering of well-defined groups. The Medusavirus, which is included in this order, encodes a divergent PolB marker gene that is likely the result of gene transfer with a eukaryotic homolog [55]. Frequent gene transfers among phylogenetic marker genes might be another explanation for the presence of many long branches in the pandoravirales clade.

The Pimascovirales encompass 10 families including the Iridoviridae (PM_02), Marseilleviridae (PM_05), and Pithoviridae (PM_07) and notably includes both Pithovirus sibericum, which has the largest viral capsid currently known (1.5 μm [56]), as well as crustacean viruses in the family Mininucleoviridae (PM_10), which possess the smallest genomes recorded for any Nucleocytoviricota (67 to 71 kbp [31]). The Mininucleoviridae have highly degraded genomes that lack several phylogenetic marker genes. Although they can be classified within the Pimascovirales with high confidence, their relationship to other families is uncertain, and we therefore placed them in a polytomous node at the base of this order (Fig 2). The uncharacterized family PM_01 contains the largest number of genomes (n = 64) within this order, all of which are MAGs. The majority of these MAGs were derived from aquatic metagenomes, and some have been recovered in marine metatranscriptomes [46], suggesting that they play an important but currently unknown role in marine systems. Overall, the repertoires of encoded proteins in the Pimascovirales were notably different from the Imitervirales, pandoravirales, and Algavirales; while cytoskeleton components, nutrient transporters, light sensing genes, and central carbon metabolism components were prevalent in the latter 3 families, they were largely absent in the Pimascovirales (Fig 4B). Conversely, histone components appeared to be more prevalent in the latter order; indeed, the histones encoded in marseilleviruses have recently become a model for understanding their structure and interactions with viral DNA [57,58]. Genes involved in translation and lipid metabolism were present in the Pimascovirales in addition to most other orders.

In addition to the families that fall within the established orders and families, we also identified several lineages or individual genomes that may represent novel taxonomic ranks (Fig 2). One of these groups consists of 3 genomes that is basal-branching to the Pokkesviricetes class, which we refer to as Pokkesviricetes incertae sedis (Fig 2). The basal-branching placement of this group suggests that it might comprise a new class that is a sister group to the Pokkesviricetes. The placement of this lineage remains tentative, however, and to clarify evolutionary relationships within the Nucleocytoviricota further phylogenetic work with additional genomes will be necessary both for this lineage as well as other putative novel taxa that are represented by individual genomes.

Discussion

Although only 6 families of Nucleocytoviricota have been established to date, recent cultivation-independent studies have revealed a vast diversity of these viruses in the environment, and their classification, together with cultivated representatives, has remained challenging. Here, we present a unified taxonomic framework based on a benchmarked set of phylogenetic marker genes that establishes a hierarchical taxonomy of Nucleocytoviricota. This taxonomy encompasses 6 orders and 32 families, including 1 order and 26 families we propose here. Remarkably, the Imitervirales contain 11 families, including the Mimiviridae, underscoring the vast diversity of large viruses within this order. This framework substantially increases the total number of Nucleocytoviricota families, and we expect that the number will continue to increase as new genomes are incorporated. In particular, we identified 22 singleton genomes that likely represent additional families, the status of which will be clarified as more genomes become available.

We anticipate that the phylogenetic and taxonomic framework we develop here will be a useful community resource for several future lines of inquiry into the biology of Nucleocytoviricota. Firstly, the GVOGs are a large set of viral protein families constructed using many recently produced Nucleocytoviricota MAGs, and they will likely be useful for the genome annotation and the examination of trends in gene content across viral groups. Secondly, the reference phylogeny we present will facilitate work that delves into ancestral Nucleocytoviricota lineages, examines the timing and nature of gene acquisitions, and classifies newly discovered viruses. For example, giant viral genomes (>500 kbp) evolved independently in multiple orders, and future studies that examine the similarities and differences in these genome expansion events will be important for pinpointing the driving forces of viral gigantism. Lastly, analysis of the environmental distribution of different taxonomic ranks of Nucleocytoviricota across Earth’s biomes will be an important direction for future work that reveals prominent biogeographic patterns and helps to clarify the ecological impact of these viruses.

Methods

Nucleocytoviricota genome set

We compiled a set of Nucleocytoviricota genomes that included MAGs as well as genomes of cultured isolates. For this, we first downloaded all MAGs available from several recent studies [10,16,21]. We also included all Nucleocytoviricota genomes available in NCBI RefSeq as of June 1, 2020. Lastly, we also included several Nucleocytoviricota genomes from select publications that were not yet available in NCBI, such as the cPacV, ChoanoV, Pyramimonas orientalis virus O1B (MT663543), and AbALV viruses that have recently been described [15,36,43,59]. After compiling this set, we dereplicated the genomes, since the presence of highly similar or identical genomes is not necessary for broad-scale phylogenetic inference. For dereplication, we compared all genomes against each other using MASH v. 2.0 [60] (“mash dist” parameters -k 16 and -s 300), and clustered genomes together using a single-linkage clustering, with all genomes with a MASH distance of ≤0.05 linked together. The MASH distance of 0.05 was chosen since it has been roughly found to correspond to an average nucleotide identity (ANI) of 95% [60]; although gene flow can occur over a broad range of genome identity values [61], this is still a useful threshold for genome dereplication. From each cluster, we chose the genome with the highest N50 contig length as the representative. We then decontaminated the genomes through analysis with ViralRecall v.2.0 [62] (-c parameter), with all contigs with negative scores removed on the grounds that they represent non-Nucleocytoviricota contamination or highly unusual gene composition that cannot be validated by our present knowledge of Nucleocytoviricota genomic content. We only considered contigs >10 kbp, given the inherent difficulty in eliminating contamination derived from short contigs. To ensure that we only used genomes that could be placed in a phylogeny, we then screened the genome set and retained only those with a PolB marker and 3 of the 4 markers A32, SFII, VLTF3, and MCP, consistent with our previous methodology [10]. After this, we arrived at a set of 1,380 genomes, including 1,253 MAGs and 127 complete genomes of cultivated viruses.

GVOG construction

To construct GVOGs, we first predicted proteins from all genomes using Prodigal v. 2.6.2. Proteins that did not have a recognizable start or stop codon at the ends of contigs were removed on the grounds that they may represent fragmented genes and obfuscate orthologous group (OG) predictions. We then calculated OGs using Proteinortho v. 6.06 [63] (parameters -e = 1e-5—identity = 25 -p = blastp+—selfblast—cov = 50 -sim = 0.80). We constructed Hidden Markov Models (HMMs) from proteins by aligning them with Clustal Omega v1.2.3 [64] (default parameters), trimming the alignment with trimAl v1.4.rev15 [65] (parameters -gt 0.1), and generating the HMM from the trimmed alignment with hmmbuild in HMMER v3.3 [66]. The goal of this analysis was to identify broad-level protein families, and we therefore sought to merge HMMs that bore similarity to each other and therefore derived from related protein families. For this, we then compared the proteins in each OG to the HMM of every other OG (hmmsearch -E 1e-20—domtblout option, hits retained only if 30% of the query protein aligned to the HMM). In cases where >50% of the proteins in one OG also had hits to the HMM of another OG, and vice versa, we then merged all of the proteins together and constructed a new merged HMM from the full set of proteins. The final set contained 8,863 HMMs, and we refer to these as the GVOGs. To provide annotations for GVOGs, we compared all of the proteins in each GVOG to the EggNOG 5.0 [67], Pfam [68], and NCVOG databases [69] (hmmsearch, -E 1e-3). For NCVOGs, we obtained protein sequences from the original NCVOG study and generated HMMs using the same methods we used for GVOGs. Annotations were assigned to a GVOG if >50% of the proteins used to make a GVOG had hits to the same HMM in one of these databases. Details regarding all GVOGs and their annotations can be found in S2 Data.

Benchmarking phylogenetic marker genes for Nucleocytoviricota

To identify phylogenetic markers for Nucleocytoviricota, we cataloged GVOGs that were broadly represented in the 1,380 viral genomes that we used for benchmarking. We searched all proteins encoded in the genomes against the GVOG HMMs using hmmsearch (e-value cutoff 1e-10) and identified a set of 25 GVOGs that were found in >70% of the genomes in our set (hmmsearch, -E 1e-5). We constructed individual phylogenetic trees of these protein families to assess their individual evolutionary histories. For individual phylogenetic trees, we calibrated bit score cutoffs so that poorly matching proteins would not be included. These cutoffs were generally equivalent to the fifth percentile score of all of the best protein matches for each genome. We then examined several features of these trees. Firstly, we only considered GVOGs present in all established families that would therefore be useful as universal or nearly universal phylogenetic markers. Secondly, we examined each tree individually to assess the degree to which taxa from different orders clustered together in distinct monophyletic groups, which was taken as a signature of HGT. High levels of gene transfer would produce topologies incongruent with other marker genes and therefore compromise the reliability of a given marker when used on a concatenated alignment. For individual marker gene trees, we aligned proteins from each GVOG using Clustal Omega, trimmed the alignment using trimAl (-gt 0.1 option), and constructed the phylogeny using IQ-TREE with ultrafast bootstraps calculated (-m TEST, -bb 1000, -wbt options).

We arrived at a set of 9 GVOGs that met the criteria described above and could potentially serve as robust phylogenetic markers (Table 1). We evaluated the phylogenetic strength of these markers individually using the recently developed TC and IC metrics. These metrics are an alternative to the traditional bootstrap because they take into account the frequency of contrasting bipartitions and can therefore be viewed as a measure of the phylogenetic strength of a gene [25,26]. We generated alignments using Clustal Omega, trimmed with TrimAl, and generated trees with IQ-TREE v1.6.9 [70] with ultrafast bootstraps [71] (parameters -wbt -bb 1000 -m LG+I+G4). We calculated TC and IC values in RaxML v8.2.12 (-f i option, ultrafast bootstraps used with the -z flag) [72]. We also evaluated the TC and IC values of trees generated from concatenated alignments. To construct concatenated alignments, we used the python program “ncldv_markersearch.py” that we developed for this purpose: https://github.com/faylward/ncldv_markersearch.

For the final tree used for clade demarcation, we ran IQ-TREE 5 times using the parameters “-m LG+F+I+G4 -bb 1000 -wbt,” and we chose the resulting tree with the highest TC value for subsequent clade demarcation and RED calculation. Three genomes in the Mininucleoviridae family were included in the final tree but were not used for the benchmarking analysis because they have been shown to have highly degraded genomes that are not necessarily representative of Nucleocytoviricota more broadly [31]. Moreover, the MAG ERX555967.47 was found to have highly variable placement in different orders in different trees we analyzed, and we therefore did not include this genome in the final tree on the grounds that it represented a rogue taxa that may reduce overall tree quality [73]. We rooted the final tree between the Pokkesviricetes and Megaviricetes, consistent with previous studies [6,28]. We placed the 3 genomes of Pokkesviricetes incertae sedis adjacent to the Pokkesvirictes clade due to the clustering of several GVOGs of this group with members of the Pokkesvirictes (SFII: Fig C in S1 Text, PolB: Fig I in S1 Text).

Family delineation and nomenclature

We calculated RED values in R using the get_reds function in the package “castor” [74]. As input, we used a rooted tree derived from the 7-gene marker set described above. For the Poxviridae, Asfarviridae, Iridoviridae, Phycodnaviridae, Marseilleviridae, mininucleoviridae, and Mimiviridae, we retained existing nomenclature, and clades assigned these names based on the initially characterized viruses that were assigned to these families. For example, the Phycodnaviridae was assigned to AG_02 because the chloroviruses within this clade were the first-described members of this family, while the prasinoviruses were assigned to a new family, although they are commonly referred to as Phycodnaviridae. Similarly, Mimiviridae was assigned based on the placement of A. polyphaga mimivirus, Iridoviridae was assigned based on the placement of Invertebrate iridescent virus 6, Asfarviridae was assigned to the clade containing African swine fever virus (ASFV), and Marseilleviridae was assigned to the clade containing the marseilleviruses. The treemap visualization was generated using the R package “treemap.”

Supporting information

S1 Text. Supporting figures.

Fig A. Major Capsid Protein GVOGm0003 phylogeny. Fig B. Disulfide (thiol) oxidoreductase GVOGm0004 phylogeny. Fig C. Superfamily II helicase GVOGm0013 phylogeny. Fig D. Patatin phospholipase GVOGm0018 phylogeny. Fig E. DEAD/SNF2-like helicase GVOGm0020 phylogeny. Fig F. DNA-directed RNA polymerase subunit beta (RNAPS) GVOGm0022 phylogeny. Fig G. DNA-directed RNA polymerase subunit alpha (RNAPL) GVOGm0023 phylogeny. Fig H. mRNA capping enzyme GVOGm0036 phylogeny. Fig I. DNA polymerase family B GVOGm0054 phylogeny. Fig J. TATA box binding protein (TBP) GVOGm0056 phylogeny. Fig K. Ribonucleoside diphosphate reductase, alpha subunit GVOGm0088 phylogeny. Fig L. D5-like helicase-primase GVOGm0095 phylogeny. Fig M. Uncharacterized, C-terminal domain GVOGm0115 phylogeny. Fig N. Uncharacterized protein GVOGm0152 phylogeny. Fig O. Transcription initiation factor IIB GVOGm0172 phylogeny. Fig P. RuvC, Holliday junction resolvases (HJRs) GVOGm0189 phylogeny. Fig Q. Ubiquitin carboxyl-terminal hydrolase GVOGm0214 phylogeny. Fig R. Proliferating cell nuclear antigen GVOGm0239 phylogeny. Fig S. DNA topoisomerase II GVOGm0461 phylogeny. Fig T. Divergent DNA-directed RNA polymerase subunit 5 GVOGm0694 phylogeny. Fig U. Packaging ATPase GVOGm0760 phylogeny. Fig V. Metallopeptidase WLM GVOGm0787 phylogeny. Fig W. Ribonuclease III GVOGm0798 phylogeny. Fig X. Virus Late Transcription Factor 3 VLTF3 GVOGm0890 phylogeny. Fig Y. Ribonucleotide reductase small subunit GVOGm1574 phylogeny. Fig Z. Barchart of source habitats for the Nucleocytoviricota families. Full information is provided in S1 Data.

(PDF)

S1 Data. Taxonomy, genome statistics, and other metadata for the Nucleocytoviricota genomes analyzed in this study.

(XLSX)

S2 Data. Statistics and descriptions of the 25 GVOGs present in 70% of the genomes analyzed.

Full annotations of all GVOGs are also provided, and TC values for the trees of this study.

(XLSX)

S3 Data. RED values for taxonomic ranks presented in this study.

(XLSX)

Acknowledgments

We acknowledge the use of the Virginia Tech Advanced Research Computing Center for bioinformatic analyses performed in this study.

Abbreviations

AaV

Aureococcus anophagefferens virus

ANI

average nucleotide identity

ASFV

African swine fever virus

A32

A32-like packaging ATPase

GVOG

giant virus orthologous group

HMM

Hidden Markov Model

IC

Internode Certainty

ICTV

International Committee on Taxonomy of the Viruses

MAG

metagenome-assembled genome

MCP

major capsid protein

OG

orthologous group

PolB

family B DNA Polymerase

RED

relative evolutionary distance

RNAPL

large RNA polymerase subunit

RNAPS

small RNA polymerase subunit

SFII

superfamily II helicase

TC

Tree Certainty

TetV

Tetraselmis virus

TFIIB

TFIIB transcriptional factor

TopoII

Topoisomerase family II

VLTF3

virus late transcription factor 3

Data Availability

All data products described in this study are available on the Giant Virus Database: https://faylward.github.io/GVDB/. Reference trees of concatenated alignments can be found on the interactive Tree of Life: https://itol.embl.de/shared/faylward.

Funding Statement

This work was supported by a Simons Early Career Award in Marine Microbial Ecology and Evolution to F.O.A, an the NSF (IIBR-1918271) award to F.O.A., and the Intramural Research Program of the National Institutes of Health (National Library of Medicine) for E.V.K. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Fischer MG. Giant viruses come of age. Curr Opin Microbiol. 2016;31:50–7. doi: 10.1016/j.mib.2016.03.001 [DOI] [PubMed] [Google Scholar]
  • 2.Koonin EV, Yutin N. Origin and evolution of eukaryotic large nucleo-cytoplasmic DNA viruses. Intervirology. 2010;53:284–92. doi: 10.1159/000312913 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Koonin EV, Dolja VV, Krupovic M, Varsani A, Wolf YI, Yutin N, et al. Global Organization and Proposed Megataxonomy of the Virus World. Microbiol Mol Biol Rev. 2020;84. doi: 10.1128/MMBR.00061-19 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Raoult D, Forterre P. Redefining viruses: lessons from Mimivirus. Nat Rev Microbiol. 2008;6:315–9. doi: 10.1038/nrmicro1858 [DOI] [PubMed] [Google Scholar]
  • 5.Wilhelm S, Bird J, Bonifer K, Calfee B, Chen T, Coy S, et al. A Student’s Guide to Giant Viruses Infecting Small Eukaryotes: From Acanthamoeba to Zooxanthellae. Viruses. 2017:46. doi: 10.3390/v9030046 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Koonin EV, Yutin N. Evolution of the Large Nucleocytoplasmic DNA Viruses of Eukaryotes and Convergent Origins of Viral Gigantism. Adv Virus Res. 2019;103:167–202. doi: 10.1016/bs.aivir.2018.09.002 [DOI] [PubMed] [Google Scholar]
  • 7.Karki S, Moniruzzaman M, Aylward FO. Comparative Genomics and Environmental Distribution of Large dsDNA Viruses in the Family Asfarviridae. Front Microbiol. 2021;12. doi: 10.3389/fmicb.2021.657471 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Weynberg KD, Allen MJ, Wilson WH. Marine Prasinoviruses and Their Tiny Plankton Hosts: A Review. Viruses. 2017;9. doi: 10.3390/v9030043 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Fischer MG, Allen MJ, Wilson WH, Suttle CA. Giant virus with a remarkable complement of genes infects marine zooplankton. Proc Natl Acad Sci U S A. 2010;107:19508–13. doi: 10.1073/pnas.1007615107 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Moniruzzaman M, Martinez-Gutierrez CA, Weinheimer AR, Aylward FO. Dynamic genome evolution and complex virocell metabolism of globally-distributed giant viruses. Nat Commun. 2020;11:1710. doi: 10.1038/s41467-020-15507-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Schvarcz CR, Steward GF. A giant virus infecting green algae encodes key fermentation genes. Virology. 2018;518:423–33. doi: 10.1016/j.virol.2018.03.010 [DOI] [PubMed] [Google Scholar]
  • 12.Boyer M, Yutin N, Pagnier I, Barrassi L, Fournous G, Espinosa L, et al. Giant Marseillevirus highlights the role of amoebae as a melting pot in emergence of chimeric microorganisms. Proc Natl Acad Sci U S A. 2009;106:21848–53. doi: 10.1073/pnas.0911354106 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Monier A, Pagarete A, de Vargas C, Allen MJ, Read B, Claverie J-M, et al. Horizontal gene transfer of an entire metabolic pathway between a eukaryotic alga and its DNA virus. Genome Res. 2009;19:1441–9. doi: 10.1101/gr.091686.109 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Moniruzzaman M, Weinheimer AR, Martinez-Gutierrez CA, Aylward FO. Widespread endogenization of giant viruses shapes genomes of green algae. Nature. 2020;588:141–5. doi: 10.1038/s41586-020-2924-2 [DOI] [PubMed] [Google Scholar]
  • 15.Rozenberg A, Oppermann J, Wietek J, Fernandez Lahore RG, Sandaa R-A, Bratbak G, et al. Lateral Gene Transfer of Anion-Conducting Channelrhodopsins between Green Algae and Giant Viruses. Curr Biol. 2020;30:4910–4920.e5. doi: 10.1016/j.cub.2020.09.056 [DOI] [PubMed] [Google Scholar]
  • 16.Schulz F, Roux S, Paez-Espino D, Jungbluth S, Walsh DA, Denef VJ, et al. Giant virus diversity and host interactions through global metagenomics. Nature. 2020;578:432–6. doi: 10.1038/s41586-020-1957-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Yutin N, Koonin EV. Proteorhodopsin genes in giant viruses. Biol Direct. 2012;7:34. doi: 10.1186/1745-6150-7-34 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Iyer LM, Aravind L, Koonin EV. Common origin of four diverse families of large eukaryotic DNA viruses. J Virol. 2001;75:11720–34. doi: 10.1128/JVI.75.23.11720-11734.2001 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Yutin N, Koonin EV. Hidden evolutionary complexity of Nucleo-Cytoplasmic Large DNA viruses of eukaryotes. Virol J. 2012;9:161. doi: 10.1186/1743-422X-9-161 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Iyer LM, Balaji S, Koonin EV, Aravind L. Evolutionary genomics of nucleo-cytoplasmic large DNA viruses. Virus Res. 2006;117:156–84. doi: 10.1016/j.virusres.2006.01.009 [DOI] [PubMed] [Google Scholar]
  • 21.Bäckström D, Yutin N, Jørgensen SL, Dharamshi J, Homa F, Zaremba-Niedwiedzka K, et al. Virus Genomes from Deep Sea Sediments Expand the Ocean Megavirome and Support Independent Origins of Viral Gigantism. mBio. 2019;10. doi: 10.1128/mBio.02497-18 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Santini S, Jeudy S, Bartoli J, Poirot O, Lescot M, Abergel C, et al. Genome of Phaeocystis globosa virus PgV-16T highlights the common ancestry of the largest known DNA viruses infecting eukaryotes. Proc Natl Acad Sci U S A. 2013;110:10800–5. doi: 10.1073/pnas.1303251110 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Gallot-Lavallée L, Blanc G, Claverie J-M. Comparative Genomics of Chrysochromulina Ericina Virus and Other Microalga-Infecting Large DNA Viruses Highlights Their Intricate Evolutionary Relationship with the Established Mimiviridae Family. J Virol. 2017;91. doi: 10.1128/JVI.00230-17 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Guglielmini J, Woo AC, Krupovic M, Forterre P, Gaia M. Diversification of giant and large eukaryotic dsDNA viruses predated the origin of modern eukaryotes. Proc Natl Acad Sci U S A. 2019;116:19585–92. doi: 10.1073/pnas.1912006116 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Salichos L, Stamatakis A, Rokas A. Novel information theory-based measures for quantifying incongruence among phylogenetic trees. Mol Biol Evol. 2014;31:1261–71. doi: 10.1093/molbev/msu061 [DOI] [PubMed] [Google Scholar]
  • 26.Salichos L, Rokas A. Inferring ancient divergences requires genes with strong phylogenetic signals. Nature. 2013;497:327–31. doi: 10.1038/nature12130 [DOI] [PubMed] [Google Scholar]
  • 27.Martinez-Gutierrez CA, Aylward FO. Phylogenetic Signal, Congruence, and Uncertainty across Bacteria and Archaea. Mol Biol Evol. 2021. doi: 10.1093/molbev/msab254 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Koonin EV, Yutin N. Multiple evolutionary origins of giant viruses. F1000Res. 2018:7. doi: 10.12688/f1000research.13350.2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Parks DH, Chuvochina M, Waite DW, Rinke C, Skarshewski A, Chaumeil P-A, et al. A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nat Biotechnol. 2018;36:996–1004. doi: 10.1038/nbt.4229 [DOI] [PubMed] [Google Scholar]
  • 30.Lefkowitz EJ, Dempsey DM, Hendrickson RC, Orton RJ, Siddell SG, Smith DB. Virus taxonomy: the database of the International Committee on Taxonomy of Viruses (ICTV). Nucleic Acids Res. 2018:D708–17. doi: 10.1093/nar/gkx932 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Subramaniam K, Behringer DC, Bojko J, Yutin N, Clark AS, Bateman KS, et al. A New Family of DNA Viruses Causing Disease in Crustaceans from Diverse Aquatic Biomes. mBio. 2020;11. doi: 10.1128/mBio.02938-19 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Mihara T, Koyano H, Hingamp P, Grimsley N, Goto S, Ogata H. Taxon Richness of “Megaviridae” Exceeds those of Bacteria and Archaea in the Ocean. Microbes Environ. 2018;33:162–71. doi: 10.1264/jsme2.ME17203 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Monier A, Larsen JB, Sandaa R-A, Bratbak G, Claverie J-M, Ogata H. Marine mimivirus relatives are probably large algal viruses. Virol J. 2008;5:12. doi: 10.1186/1743-422X-5-12 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Ghedin E, Claverie J-M. Mimivirus relatives in the Sargasso sea. Virol J. 2005;2:62. doi: 10.1186/1743-422X-2-62 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Moniruzzaman M, LeCleir GR, Brown CM, Gobler CJ, Bidle KD, Wilson WH, et al. Genome of brown tide virus (AaV), the little giant of the Megaviridae, elucidates NCLDV genome expansion and host-virus coevolution. Virology. 2014;466–467:60–70. doi: 10.1016/j.virol.2014.06.031 [DOI] [PubMed] [Google Scholar]
  • 36.Needham DM, Yoshizawa S, Hosaka T, Poirier C, Choi CJ, Hehenberger E, et al. A distinct lineage of giant viruses brings a rhodopsin photosystem to unicellular marine predators. Proc Natl Acad Sci. 2019:20574–83. doi: 10.1073/pnas.1907517116 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Schulz F, Yutin N, Ivanova NN, Ortega DR, Lee TK, Vierheilig J, et al. Giant viruses with an expanded complement of translation system components. Science. 2017;356:82–5. doi: 10.1126/science.aal4657 [DOI] [PubMed] [Google Scholar]
  • 38.Deeg CM, Chow C-ET, Suttle CA. The kinetoplastid-infecting Bodo saltans virus (BsV), a window into the most abundant giant viruses in the sea. Elife. 2018;7. doi: 10.7554/eLife.33014 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Schulz F, Alteio L, Goudeau D, Ryan EM, Yu FB, Malmstrom RR, et al. Hidden diversity of soil giant viruses. Nat Commun. 2018;9:4881. doi: 10.1038/s41467-018-07335-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Abrahão J, Silva L, Silva LS, Khalil JYB, Rodrigues R, Arantes T, et al. Tailed giant Tupanvirus possesses the most complete translational apparatus of the known virosphere. Nat Commun. 2018;9:749. doi: 10.1038/s41467-018-03168-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Arslan D, Legendre M, Seltzer V, Abergel C, Claverie J-M. Distant Mimivirus relative with a larger genome highlights the fundamental features of Megaviridae. Proc Natl Acad Sci U S A. 2011;108:17486–91. doi: 10.1073/pnas.1110889108 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Raoult D, Audic S, Robert C, Abergel C, Renesto P, Ogata H, et al. The 1.2-megabase genome sequence of Mimivirus. Science. 2004;306:1344–50. doi: 10.1126/science.1101485 [DOI] [PubMed] [Google Scholar]
  • 43.Needham DM, Poirier C, Hehenberger E, Jiménez V, Swalwell JE, Santoro AE, et al. Targeted metagenomic recovery of four divergent viruses reveals shared and distinctive characteristics of giant viruses of marine eukaryotes. Philos Trans R Soc Lond B Biol Sci. 2019;374:20190086. doi: 10.1098/rstb.2019.0086 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Kijima S, Delmont TO, Miyazaki U, Gaia M, Endo H, Ogata H. Discovery of Viral Myosin Genes With Complex Evolutionary History Within Plankton. Front Microbiol. 2021;12:683294. doi: 10.3389/fmicb.2021.683294 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Cunha VD, Da Cunha V, Gaia M, Ogata H, Jaillon O, Delmont TO, et al. Giant viruses encode novel types of actins possibly related to the origin of eukaryotic actin: the viractins. bioRxiv. doi: 10.1101/2020.06.16.150565 [DOI] [Google Scholar]
  • 46.Ha AD, Moniruzzaman M, Aylward FO. High Transcriptional Activity and Diverse Functional Repertoires of Hundreds of Giant Viruses in a Coastal Marine System. mSystems. 2021;6:e0029321. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Van Etten JL, Agarkova IV, Dunigan DD. Chloroviruses. Viruses. 2019;12:20. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Moreau H, Piganeau G, Desdevises Y, Cooke R, Derelle E, Grimsley N. Marine prasinovirus genomes show low evolutionary divergence and acquisition of protein metabolism genes by horizontal gene transfer. J Virol. 2010;84:12555–63. doi: 10.1128/JVI.01123-10 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Weynberg KD, Allen MJ, Gilg IC, Scanlan DJ, Wilson WH. Genome sequence of Ostreococcus tauri virus OtV-2 throws light on the role of picoeukaryote niche separation in the ocean. J Virol. 2011;85:4520–9. doi: 10.1128/JVI.02131-10 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Philippe N, Legendre M, Doutre G, Couté Y, Poirot O, Lescot M, et al. Pandoraviruses: amoeba viruses with genomes up to 2.5 Mb reaching that of parasitic eukaryotes. Science. 2013;341:281–6. doi: 10.1126/science.1239181 [DOI] [PubMed] [Google Scholar]
  • 51.Yutin N, Koonin EV. Pandoraviruses are highly derived phycodnaviruses. Biol Direct. 2013;8:25. doi: 10.1186/1745-6150-8-25 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Legendre M, Lartigue A, Bertaux L, Jeudy S, Bartoli J, Lescot M, et al. In-depth study of Mollivirus sibericum, a new 30,000-y-old giant virus infecting Acanthamoeba. Proc Natl Acad Sci U S A. 2015;112:E5327–35. doi: 10.1073/pnas.1510795112 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Krupovic M, Yutin N, Koonin E. Evolution of a major virion protein of the giant pandoraviruses from an inactivated bacterial glycoside hydrolase. Virus Evol. 2020;6. doi: 10.1093/ve/veaa059 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Wilson WH, Schroeder DC, Allen MJ, Holden MTG, Parkhill J, Barrell BG, et al. Complete genome sequence and lytic phase transcription profile of a Coccolithovirus. Science. 2005;309:1090–2. doi: 10.1126/science.1113109 [DOI] [PubMed] [Google Scholar]
  • 55.Yoshikawa G, Blanc-Mathieu R, Song C, Kayama Y, Mochizuki T, Murata K, et al. Medusavirus, a Novel Large DNA Virus Discovered from Hot Spring Water. J Virol. 2019;93. doi: 10.1128/JVI.02130-18 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Legendre M, Bartoli J, Shmakova L, Jeudy S, Labadie K, Adrait A, et al. Thirty-thousand-year-old distant relative of giant icosahedral DNA viruses with a pandoravirus morphology. Proc Natl Acad Sci U S A. 2014;111:4274–9. doi: 10.1073/pnas.1320670111 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Liu Y, Bisio H, Toner CM, Jeudy S, Philippe N, Zhou K, et al. Virus-encoded histone doublets are essential and form nucleosome-like structures. Cell. 2021;184:4237–4250.e19. doi: 10.1016/j.cell.2021.06.032 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Valencia-Sánchez MI, Abini-Agbomson S, Wang M, Lee R, Vasilyev N, Zhang J, et al. The structure of a virus-encoded nucleosome. Nat Struct Mol Biol. 2021;28:413–7. doi: 10.1038/s41594-021-00585-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Matsuyama T, Takano T, Nishiki I, Fujiwara A, Kiryu I, Inada M, et al. A novel Asfarvirus-like virus identified as a potential cause of mass mortality of abalone. Sci Rep. 2020;10:4620. doi: 10.1038/s41598-020-61492-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S, et al. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 2016;17:132. doi: 10.1186/s13059-016-0997-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Bobay L-M, Ochman H. Biological species in the viral world. Proc Natl Acad Sci. 2018:6040–5. doi: 10.1073/pnas.1717593115 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Aylward FO, Moniruzzaman M. ViralRecall-A Flexible Command-Line Tool for the Detection of Giant Virus Signatures in ‘Omic Data. Viruses. 2021;13. doi: 10.3390/v13020150 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.Lechner M, Findeiss S, Steiner L, Marz M, Stadler PF, Prohaska SJ. Proteinortho: detection of (co-)orthologs in large-scale analysis. BMC Bioinformatics. 2011;12:124. doi: 10.1186/1471-2105-12-124 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64.Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li W, et al. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol Syst Biol. 2011;7:539. doi: 10.1038/msb.2011.75 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.Capella-Gutiérrez S, Silla-Martínez JM, Gabaldón T. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics. 2009;25:1972–3. doi: 10.1093/bioinformatics/btp348 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66.Eddy SR. Accelerated Profile HMM Searches. PLoS Comput Biol. 2011;7:e1002195. doi: 10.1371/journal.pcbi.1002195 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67.Huerta-Cepas J, Szklarczyk D, Heller D, Hernández-Plaza A, Forslund SK, Cook H, et al. eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Nucleic Acids Res. 2019;47:D309–14. doi: 10.1093/nar/gky1085 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68.Mistry J, Chuguransky S, Williams L, Qureshi M, Salazar GA, Sonnhammer ELL, et al. Pfam: The protein families database in 2021. Nucleic Acids Res. 2021;49:D412–9. doi: 10.1093/nar/gkaa913 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69.Yutin N, Wolf YI, Raoult D, Koonin EV. Eukaryotic large nucleo-cytoplasmic DNA viruses: clusters of orthologous genes and reconstruction of viral genome evolution. Virol J. 2009;6:223. doi: 10.1186/1743-422X-6-223 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70.Nguyen L-T, Schmidt HA, von Haeseler A, Minh BQ. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol Biol Evol. 2015;32:268–74. doi: 10.1093/molbev/msu300 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 71.Hoang DT, Chernomor O, von Haeseler A, Minh BQ, Vinh LS. UFBoot2: Improving the Ultrafast Bootstrap Approximation. Mol Biol Evol. 2018;35:518–22. doi: 10.1093/molbev/msx281 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72.Stamatakis A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics. 2014;30:1312–3. doi: 10.1093/bioinformatics/btu033 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73.Aberer AJ, Krompass D, Stamatakis A. Pruning Rogue Taxa Improves Phylogenetic Accuracy: An Efficient Algorithm and Webservice. Syst Biol. 2013:162–6. doi: 10.1093/sysbio/sys078 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 74.Louca S, Doebeli M. Efficient comparative phylogenetics on large trees. Bioinformatics. 2018;34:1053–5. doi: 10.1093/bioinformatics/btx701 [DOI] [PubMed] [Google Scholar]

Decision Letter 0

Paula Jauregui, PhD

24 Jun 2021

Dear Dr Aylward,

Thank you for submitting your manuscript entitled "A Phylogenomic Framework for Charting the Diversity and Evolution of Giant Viruses" for consideration as a Methods and Resources by PLOS Biology.

Your manuscript has now been evaluated by the PLOS Biology editorial staff, as well as by an academic editor with relevant expertise, and I'm writing to let you know that we would like to send your submission out for external peer review. Please accept my apologies for the delay incurred while we sought external advice.

However, before we can send your manuscript to reviewers, we need you to complete your submission by providing the metadata that is required for full assessment. To this end, please login to Editorial Manager where you will find the paper in the 'Submissions Needing Revisions' folder on your homepage. Please click 'Revise Submission' from the Action Links and complete all additional questions in the submission questionnaire.

Please re-submit your manuscript within two working days, i.e. by Jun 28 2021 11:59PM.

Login to Editorial Manager here: https://www.editorialmanager.com/pbiology

During resubmission, you will be invited to opt-in to posting your pre-review manuscript as a bioRxiv preprint. Visit http://journals.plos.org/plosbiology/s/preprints for full details. If you consent to posting your current manuscript as a preprint, please upload a single Preprint PDF when you re-submit.

Once your full submission is complete, your paper will undergo a series of checks in preparation for peer review. Once your manuscript has passed all checks it will be sent out for review.

Given the disruptions resulting from the ongoing COVID-19 pandemic, please expect delays in the editorial process. We apologise in advance for any inconvenience caused and will do our best to minimize impact as far as possible.

Feel free to email us at plosbiology@plos.org if you have any queries relating to your submission.

Kind regards,

Roli Roberts

Roland G Roberts PhD

Senior Editor

PLOS Biology

rroberts@plos.org

on behalf of

Paula Jauregui, PhD

Editor

PLOS Biology

pjaureguionieva@plos.org

Decision Letter 1

Paula Jauregui, PhD

13 Aug 2021

Dear Dr. Aylward,

Thank you very much for submitting your manuscript "A Phylogenomic Framework for Charting the Diversity and Evolution of Giant Viruses" for consideration as a Methods and Resources at PLOS Biology. Your manuscript has been evaluated by the PLOS Biology editors, an Academic Editor with relevant expertise, and by several independent reviewers.

In light of the reviews (below), we are pleased to offer you the opportunity to address the comments from the reviewers in a revised version that we anticipate should not take you very long. We will then assess your revised manuscript and your response to the reviewers' comments and we may consult the reviewers again. Please also address the formatting and editorial policy requirements.

In particular, reviewer #1 wants you to provide the number of currently recognized viral genera in the phylum Nucleocytoviricota, and to explain “megataxonomy”, disagrees with the attribution the low TC value for the MCP tree to the existence of paralogs, says that the choice of ortholog is needed, has questions about the root of the phylogenetic tree, suggests you to describe your standpoint about Klosneuvirinae, wants you to add a few words about the enrichment of Translation genes in the PM_09, thinks that you should distinguish the transfer from medusavirus and eukaryotes, wants you to explain the classification of Pimascovirales and Pokkesviricetes, and asks how the genes with introns were treated. Reviewer #2 wants you to rewrite some sentences for clarification, says that you should provide information about the source of the genomes and about the viruses contributing to each branch of the tree, asks whether there are 8 or 9 or 7 marker gene sets, thinks that you should differentiate branches containing viruses derived from culturable isolates vs. viruses assembled from metagenomic data, and wants you to add a supplementary table that provides the complete set of RED values along with the identification of the nodes/clades associated with each value. Reviewer #3 thinks that you should add gene names to the supplemental figures S1-25 and wants to see the trees associated with the combined marker genes in Fig.1. This reviewer also thinks that it would be helpful for communication if names were given at the lower levels too when possible.

DATA POLICY:

You may be aware of the PLOS Data Policy, which requires that all data be made available without restriction: http://journals.plos.org/plosbiology/s/data-availability. For more information, please also see this editorial: http://dx.doi.org/10.1371/journal.pbio.1001797

Note that we do not require all raw data. Rather, we ask that all individual quantitative observations that underlie the data summarized in the figures and results of your paper be made available in one of the following forms:

1) Supplementary files (e.g., excel). Please ensure that all data files are uploaded as 'Supporting Information' and are invariably referred to (in the manuscript, figure legends, and the Description field when uploading your files) using the following format verbatim: S1 Data, S2 Data, etc. Multiple panels of a single or even several figures can be included as multiple sheets in one excel file that is saved using exactly the following convention: S1_Data.xlsx (using an underscore).

2) Deposition in a publicly available repository. Please also provide the accession code or a reviewer link so that we may view your data before publication.

Regardless of the method selected, please ensure that you provide the individual numerical values that underlie the summary data displayed in the following figure panels as they are essential for readers to assess your analysis and to reproduce it: Figures 1AB, 3A, S26.

NOTE: the numerical data provided should include all replicates AND the way in which the plotted mean and errors were derived (it should not present only the mean/average values).

Please also ensure that figure legends in your manuscript include information on where the underlying data can be found, and ensure your supplemental data file/s has a legend.

Please ensure that your Data Statement in the submission system accurately describes where your data can be found.

We expect to receive your revised manuscript within 1 month.

Please email us (plosbiology@plos.org) if you have any questions or concerns, or would like to request an extension. At this stage, your manuscript remains formally under active consideration at our journal; please notify us by email if you do not intend to submit a revision so that we may end consideration of the manuscript at PLOS Biology.

**IMPORTANT - SUBMITTING YOUR REVISION**

Your revisions should address the specific points made by each reviewer. Please submit the following files along with your revised manuscript:

1. A 'Response to Reviewers' file - this should detail your responses to the editorial requests, present a point-by-point response to all of the reviewers' comments, and indicate the changes made to the manuscript.

*NOTE: In your point by point response to the reviewers, please provide the full context of each review. Do not selectively quote paragraphs or sentences to reply to. The entire set of reviewer comments should be present in full and each specific point should be responded to individually.

You should also cite any additional relevant literature that has been published since the original submission and mention any additional citations in your response.

2. In addition to a clean copy of the manuscript, please also upload a 'track-changes' version of your manuscript that specifies the edits made. This should be uploaded as a "Related" file type.

*Resubmission Checklist*

When you are ready to resubmit your revised manuscript, please refer to this resubmission checklist: https://plos.io/Biology_Checklist

To submit a revised version of your manuscript, please go to https://www.editorialmanager.com/pbiology/ and log in as an Author. Click the link labelled 'Submissions Needing Revision' where you will find your submission record.

Please make sure to read the following important policies and guidelines while preparing your revision:

*Published Peer Review*

Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. Please see here for more details:

https://blogs.plos.org/plos/2019/05/plos-journals-now-open-for-published-peer-review/

*PLOS Data Policy*

Please note that as a condition of publication PLOS' data policy (http://journals.plos.org/plosbiology/s/data-availability) requires that you make available all data used to draw the conclusions arrived at in your manuscript. If you have not already done so, you must include any data used in your manuscript either in appropriate repositories, within the body of the manuscript, or as supporting information (N.B. this includes any numerical values that were used to generate graphs, histograms etc.). For an example see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5

*Protocols deposition*

To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

Thank you again for your submission to our journal. We hope that our editorial process has been constructive thus far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Paula

---

Paula Jauregui, PhD

Associate Editor

PLOS Biology

pjaureguionieva@plos.org

*****************************************************

REVIEWS:

Reviewer #1: Hiroyuki Ogata. Evolution and ecology of Giant viruses.

Reviewer #2: Virus taxonomy.

Reviewer #3: Giant viruses in soil.

Reviewer #1: Viruses of the phylum Nucleocytoviricota are known for their large genomes, particles, and intriguing biological traits and potential ecological roles. The ICTV taxonomy of this viral group had been incorrect in many aspects for long time until when the recent viral megataxonomy was established in 2020, by a group of leading virologists/taxonomists including one of the authors of this manuscript. However, recent metagenomics has started to produce hundreds to thousands of additional environmental genomes of the Nucleocytoviricota. Consequently, the need for updated phylogenomic studies have re-emerged in order to establish common languages on the taxonomies within the phylum including those uncultivated. As authors correctly pointed out in their manuscript, this updated taxonomic framework is important for ecological and evolutionary analyses of this group of viruses in connection with their relationships with the evolution of cellular organisms. This is very timely and useful work.

The authors compiled 1380 Nucleocytoviricota viral genomes (both isolated and uncultured ones), identified 8863 protein families (named GVOGs), of which they identified 25 GVOGs that are represented widely in the genomes. I have appreciated the next steps that authors took to identify 7 marker genes that are considered as the best choice for phylogenomic analyses, and to objectively delineate clades (class, order, family, and genus level). The choice of marker genes for the phylogenomics is based on the Tree Certainty (TC) scores, while the taxonomic delineation has been performed using a procedure based on the Relative Evolutionary Distance (RAD). To my opinion, this is the best plausible strategy to delineate the taxonomic clades of this viral group. With these methods, the authors partitioned 2 classes, 6 orders, 32 families (this high number is appropriate), and 344 genera within the phylum. The following descriptions of the compositions of each order are easy to read and well-balanced with previous taxonomic proposals. I have nevertheless a few comments that I hope would help to further clarify some points in the current version of the manuscript.

L.33: It would be useful to provide the number of currently recognized viral genera in the phylum Nucleocytoviricota.

LL.48-54: The citation of the references 9,15,20 appear twice in this part, which makes the logical flow awkward. Probably, the first citation of "9,15,20" would not be needed.

L.75: The definition of the word "megataxonomy" may be unclear to some readers (although it is a title of well-cited paper). A few words explaining this jargon would be useful.

L.103: "this is likely the case because … multiple copies of MCP, … orthologs from paralogs (Fig. 1a)." To my understanding, for tree reconstructions, the authors chose one sequence for multicopy genes from each genome based on the score against respective HMMs. If it is the case, one MCP at most was chosen from each genome. If this is how data was processed, then, I do not agree to attribute the low TC value for the MCP tree to the existence of paralogs. Because the choice of paralogs or orthologs does not affect the reliability of their tree (if paralogs are mixed, then the tree would not be a species tree but still a gene tree). I guess it is more due to the variable nature of MCPs. Related to this, LL.365-366 for the chose of sequence dataset for individual families is somehow unclear. Does this include only the best hits, or multiple hits if they pass the threshold? A few additional words may help readers to completely understand the procedure.

L.119: The fact that the addition of RNAPS led to the decreased TC may be due to the existence of paralog. In this case, because the goal is to gain the consistency among species trees, the choice of ortholog is needed. RNAPS paralogs was described in https://journals.asm.org/doi/10.1128/JVI.00230-17.

L.140: The computation of RED requires the root of the whole tree to my understanding. The authors mentioned that Pox-Asfar group was used to root the tree. However, this would not give the root of the whole phylogenetic tree. How was the root defined? Mid-point? This should be described in the Method section.

L.154: "including 213 genera with a single representative". Just above this statement, the authors state that clades with two or one members were left as singletons, and for the family count (n=32), the authors excluded the singletons. Consistent counting scheme would be easy to follow.

L.165: The citation of ref-22 should be ref-47.

LL.173-192: The proposed classification scheme does not refer to the Klosneuvirinae, which are also used in several published papers. I suggest the authors to describe their standpoint about this taxonomic group, so that the readers can judge if they wish to continue using this taxonomic group name.

L.199-202: Citations of the papers that described the discoveries of individual genes would be needed. For actin, myosin, and kinesin, the citations seem to be https://doi.org/10.1101/2020.06.16.150565, https://doi.org/10.3389/fmicb.2021.683294, doi: 10.1128/mSystems.00293-21.

Fig.4: The enrichment of Translation genes in the PM_09 is striking but this is not mentioned in the text. A few words on this may be useful for readers.

L.240-241: In the ref-45, the PolB of medusavirus was proposed to be transferred to eukaryotes, whereas the authors assume a transfer from eukaryotes to medusavirus. These should be distinguished.

L.263-272: The authors cite two genomes in the Pimascovirales and three genomes in the Pokkesviricetes. First of all, the branches that correspond to these five genomes should be clearly indicated in the Fig.2. In addition, if the two genomes in the Pimascovirales correspond to the two branches below the Mininucleoviridae, then they look like a family level clade (within the Pimascovirales) instead of an order level clade. If these two genomes form a new order, Mininucleoviridae should be also placed at the order level too. Additional explanation is needed.

L.333: The author used Prodigal for gene call. How were genes with introns treated? This should be clear as some of the genomes contain many introns.

Reviewer #2: The work described in this manuscript to develop a framework for studying the diversity and evolution of large DNA viruses, provides an extremely useful and valuable set of resources (the giant virus orthologous group database along with associated alignments and trees) and data that help to further classify these viruses with members that have either been physically isolated or identified via metagenomic sequencing of environmental samples. As such, it represents an important contribution to the literature. The manuscript will benefit from a careful revision to correct a number of problems, including those detailed below. Major problems include failure to include Table 1, and a misnumbering of tables in the text. In addition, it will be extremely useful to have more of the original data generated in support of this work, available to its readers.

Lines 54-55: The initial wording of the sentence beginning "The uncertainty of the phylogenetic structure and taxonomy Nucleocytoviricota…" needs rewriting.

Line 83: Table S1 is not a list of genomes. This reference should most likely be to Table S2. Table S2 should provide traditional GenBank accession numbers for all genome sequences when available. When GenBank accessions are not available, the source of the indicated genome sequence (and the provided genome_id) should be identified. I have no idea what some of the provided genome_ids refer to. (I would recommend not concatenating different IDs together.)

Line 89: There is no Table 1 provided with the manuscript.

Lines 311,312: Four sequences are mentioned that are not in an NCBI database. If these sequences are not readily available, they should not be included in this analysis.

Line 90: Table S1 does not provide, as indicated in the text, a "descriptions of all GVOGs".

Line 377: There is no Table 1

Line 387: Are there eight or nine marker gene sets?

Line 401: Now we are at (the final) 7 marker sets. The number of marker gene sets varies according to the analysis providing in figure 3. But in this section of the Methods, it is not clear why these numbers vary from 9 to 8 to 7.

In the supplementary phylogenetic trees for the set of 25 gene sets, it is very difficult to determine from the provided figures the characterizations of the sets provided in Table S1. For example, for GVOGs 115 and 152, the table indicates that these GVOGS are not represented in viruses belonging to the order Asfuvirales. In the figures for these two GVOG trees, there is also no indication of the presence of poxviruses (order Chitovirales) in the tree. GVOG 152 also appears to not have a representative from the pandoraviruses.

In all text and figures, the proposed order pandoravirales should not be capitalized or italicized since it is only proposed and is not an official taxon approved by the ICTV.

Figure 2 and corresponding text: All proposed taxa should, as indicated above, not be capitalized or italicized since they are not official taxa approved by the ICTV.

Figure 2: It would be useful to be able to identify the viruses contributing to each branch of the tree, especially for unlabeled, single genome branches. It is not clear if this information is available or where it might be.

Figure 2: It would be useful to differentiate branches containing viruses derived from culturable isolates vs. viruses assembled from metagenomic data.

Figure 3a: A supplementary table should be available that provides the complete set of RED values along with, importantly, the identification of the nodes/clades associated with each value. Along with Table S2, this will provide the data that supports the creation of the proposed new species and families.

Figure 4. Panels A and B should be labeled on the figure.

Reviewer #3: Recently several metadata studies (including one by these authors) have identified thousands of new genomes in the Nucleocytoviricota from environmental samples. This current work provides a phylogenetic and more importantly a taxonomic context to interpret and communicate this data. The research uses appropriate state of the art methods. An important aspect is the used of RED values to normalize taxonomic levels to ranges of genetic distances. The manuscript is very well written and will be an important contribution to the field. I only have minor concerns.

In browsing the 25 GVOG trees is the supplemental data S1-25, it was clear that different trees gave very different results. It would be nice to add gene names to the supplemental figures S1-25.

It would be good to see the trees associated with the combined marker genes in Fig.1 (maybe I missed these?). How did they vary? This would give a sense of how robust the taxonomy is to any individual gene.

Reference in the manuscript the itol tree https://itol.embl.de/tree/1281731864487941620067021 on the github site.

The authors only suggested names at the phylum, class and order level. All families and genera were given numbers. It would be helpful for communication if names were given at these lower levels too when possible or is the expectation that these will need to be dealt with separately as many genera will need to be renamed as species.

Decision Letter 2

Paula Jauregui, PhD

15 Sep 2021

Dear Dr. Aylward,

Thank you for submitting your revised Methods and Resources entitled "A Phylogenomic Framework for Charting the Diversity and Evolution of Giant Viruses" for publication in PLOS Biology. I have now discussed your revision with the Academic Editor. 

We will probably accept this manuscript for publication, provided you satisfactorily address the following data and other policy-related requests.

DATA POLICY:

You may be aware of the PLOS Data Policy, which requires that all data be made available without restriction: http://journals.plos.org/plosbiology/s/data-availability. For more information, please also see this editorial: http://dx.doi.org/10.1371/journal.pbio.1001797

Note that we do not require all raw data. Rather, we ask that all individual quantitative observations that underlie the data summarized in the figures and results of your paper be made available in one of the following forms:

1) Supplementary files (e.g., excel). Please ensure that all data files are uploaded as 'Supporting Information' and are invariably referred to (in the manuscript, figure legends, and the Description field when uploading your files) using the following format verbatim: S1 Data, S2 Data, etc. Multiple panels of a single or even several figures can be included as multiple sheets in one excel file that is saved using exactly the following convention: S1_Data.xlsx (using an underscore).

2) Deposition in a publicly available repository. Please also provide the accession code or a reviewer link so that we may view your data before publication.

Regardless of the method selected, please ensure that you provide the individual numerical values that underlie the summary data displayed in the following figure panels as they are essential for readers to assess your analysis and to reproduce it: Figures 1AB, 3A, S26.

NOTE: the numerical data provided should include all replicates AND the way in which the plotted mean and errors were derived (it should not present only the mean/average values).

Please also ensure that figure legends in your manuscript include information on where the underlying data can be found, and ensure your supplemental data file/s has a legend.

Please ensure that your Data Statement in the submission system accurately describes where your data can be found.

As you address these items, please take this last chance to review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the cover letter that accompanies your revised manuscript.

We expect to receive your revised manuscript within two weeks.

To submit your revision, please go to https://www.editorialmanager.com/pbiology/ and log in as an Author. Click the link labelled 'Submissions Needing Revision' to find your submission record. Your revised submission must include the following:

-  a cover letter that should detail your responses to any editorial requests, if applicable, and whether changes have been made to the reference list

-  a Response to Reviewers file that provides a detailed response to the reviewers' comments (if applicable)

-  a track-changes file indicating any changes that you have made to the manuscript. 

NOTE: If Supporting Information files are included with your article, note that these are not copyedited and will be published as they are submitted. Please ensure that these files are legible and of high quality (at least 300 dpi) in an easily accessible file format. For this reason, please be aware that any references listed in an SI file will not be indexed. For more information, see our Supporting Information guidelines:

https://journals.plos.org/plosbiology/s/supporting-information  

*Published Peer Review History*

Please note that you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. Please see here for more details:

https://blogs.plos.org/plos/2019/05/plos-journals-now-open-for-published-peer-review/

*Early Version*

Please note that an uncorrected proof of your manuscript will be published online ahead of the final version, unless you opted out when submitting your manuscript. If, for any reason, you do not want an earlier version of your manuscript published online, uncheck the box. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us as soon as possible if you or your institution is planning to press release the article.

*Protocols deposition*

To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

Please do not hesitate to contact me should you have any questions.

Sincerely,

Paula 

---

Paula Jauregui, PhD,

Associate Editor,

pjaureguionieva@plos.org,

PLOS Biology

Decision Letter 3

Paula Jauregui, PhD

29 Sep 2021

Dear Dr Aylward,

I'm handling your paper temporarily while my colleague Dr Paula Jauregui is out of the office. On behalf of my colleagues and the Academic Editor, Curtis Suttle, I'm pleased to say that we can in principle offer to publish your Methods and Resources "A Phylogenomic Framework for Charting the Diversity and Evolution of Giant Viruses" in PLOS Biology, provided you address any remaining formatting and reporting issues. These will be detailed in an email that will follow this letter and that you will usually receive within 2-3 business days, during which time no action is required from you. Please note that we will not be able to formally accept your manuscript and schedule it for publication until you have made the required changes.

Please take a minute to log into Editorial Manager at http://www.editorialmanager.com/pbiology/, click the "Update My Information" link at the top of the page, and update your user information to ensure an efficient production process.

PRESS: We frequently collaborate with press offices. If your institution or institutions have a press office, please notify them about your upcoming paper at this point, to enable them to help maximise its impact. If the press office is planning to promote your findings, we would be grateful if they could coordinate with biologypress@plos.org. If you have not yet opted out of the early version process, we ask that you notify us immediately of any press plans so that we may do so on your behalf.

We also ask that you take this opportunity to read our Embargo Policy regarding the discussion, promotion and media coverage of work that is yet to be published by PLOS. As your manuscript is not yet published, it is bound by the conditions of our Embargo Policy. Please be aware that this policy is in place both to ensure that any press coverage of your article is fully substantiated and to provide a direct link between such coverage and the published work. For full details of our Embargo Policy, please visit http://www.plos.org/about/media-inquiries/embargo-policy/.

Thank you again for choosing PLOS Biology for publication and supporting Open Access publishing. We look forward to publishing your study. 

Sincerely, 

Roli Roberts

Roland G Roberts PhD

Senior Editor

PLOS Biology

rroberts@plos.org

on behalf of

Paula Jauregui, PhD 

Associate Editor 

PLOS Biology

pjaureguionieva@plos.org

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Text. Supporting figures.

    Fig A. Major Capsid Protein GVOGm0003 phylogeny. Fig B. Disulfide (thiol) oxidoreductase GVOGm0004 phylogeny. Fig C. Superfamily II helicase GVOGm0013 phylogeny. Fig D. Patatin phospholipase GVOGm0018 phylogeny. Fig E. DEAD/SNF2-like helicase GVOGm0020 phylogeny. Fig F. DNA-directed RNA polymerase subunit beta (RNAPS) GVOGm0022 phylogeny. Fig G. DNA-directed RNA polymerase subunit alpha (RNAPL) GVOGm0023 phylogeny. Fig H. mRNA capping enzyme GVOGm0036 phylogeny. Fig I. DNA polymerase family B GVOGm0054 phylogeny. Fig J. TATA box binding protein (TBP) GVOGm0056 phylogeny. Fig K. Ribonucleoside diphosphate reductase, alpha subunit GVOGm0088 phylogeny. Fig L. D5-like helicase-primase GVOGm0095 phylogeny. Fig M. Uncharacterized, C-terminal domain GVOGm0115 phylogeny. Fig N. Uncharacterized protein GVOGm0152 phylogeny. Fig O. Transcription initiation factor IIB GVOGm0172 phylogeny. Fig P. RuvC, Holliday junction resolvases (HJRs) GVOGm0189 phylogeny. Fig Q. Ubiquitin carboxyl-terminal hydrolase GVOGm0214 phylogeny. Fig R. Proliferating cell nuclear antigen GVOGm0239 phylogeny. Fig S. DNA topoisomerase II GVOGm0461 phylogeny. Fig T. Divergent DNA-directed RNA polymerase subunit 5 GVOGm0694 phylogeny. Fig U. Packaging ATPase GVOGm0760 phylogeny. Fig V. Metallopeptidase WLM GVOGm0787 phylogeny. Fig W. Ribonuclease III GVOGm0798 phylogeny. Fig X. Virus Late Transcription Factor 3 VLTF3 GVOGm0890 phylogeny. Fig Y. Ribonucleotide reductase small subunit GVOGm1574 phylogeny. Fig Z. Barchart of source habitats for the Nucleocytoviricota families. Full information is provided in S1 Data.

    (PDF)

    S1 Data. Taxonomy, genome statistics, and other metadata for the Nucleocytoviricota genomes analyzed in this study.

    (XLSX)

    S2 Data. Statistics and descriptions of the 25 GVOGs present in 70% of the genomes analyzed.

    Full annotations of all GVOGs are also provided, and TC values for the trees of this study.

    (XLSX)

    S3 Data. RED values for taxonomic ranks presented in this study.

    (XLSX)

    Attachment

    Submitted filename: Response_to_Reviewers.docx

    Attachment

    Submitted filename: Response_to_Reviewers.docx

    Data Availability Statement

    All data products described in this study are available on the Giant Virus Database: https://faylward.github.io/GVDB/. Reference trees of concatenated alignments can be found on the interactive Tree of Life: https://itol.embl.de/shared/faylward.


    Articles from PLoS Biology are provided here courtesy of PLOS

    RESOURCES