Skip to main content
Brazilian Journal of Microbiology logoLink to Brazilian Journal of Microbiology
. 2024 Aug 24;55(4):3373–3387. doi: 10.1007/s42770-024-01496-7

The GC% landscape of the Nucleocytoviricota

Amanda Stéphanie Arantes Witt 1, João Victor Rodrigues Pessoa Carvalho 1, Mateus Sá Magalhães Serafim 1, Nidia Esther Colquehuanca Arias 1, Rodrigo Araújo Lima Rodrigues 1, Jônatas Santos Abrahão 1,
PMCID: PMC11711839  PMID: 39180708

Abstract

Genomic studies on sequence composition employ various approaches, such as calculating the proportion of guanine and cytosine within a given sequence (GC% content), which can shed light on various aspects of the organism’s biology. In this context, GC% can provide insights into virus-host relationships and evolution. Here, we present a comprehensive gene-by-gene analysis of 61 representatives belonging to the phylum Nucleocytoviricota, which comprises viruses with the largest genomes known in the virosphere. Parameters were evaluated not only based on the average GC% of a given viral species compared to the entire phylum but also considering gene position and phylogenetic history. Our results reveal that while some families exhibit similar GC% among their representatives (e.g., Marseilleviridae), others such as Poxviridae, Phycodnaviridae, and Mimiviridae have members with discrepant GC% values, likely reflecting adaptation to specific biological cycles and hosts. Interestingly, certain genes located at terminal regions or within specific genomic clusters show GC% values distinct from the average, suggesting recent acquisition or unique evolutionary pressures. Horizontal gene transfer and the presence of potential paralogs were also assessed in genes with the most discrepant GC% values, indicating multiple evolutionary histories. Taken together, to the best of our knowledge, this study represents the first global and gene-by-gene analysis of GC% distribution and profiles within genomes of Nucleocytoviricota members, highlighting their diversity and identifying potential new targets for future studies.

Supplementary Information

The online version contains supplementary material available at 10.1007/s42770-024-01496-7.

Keywords: GC% content, Genomics, Nucleocytoviricota, Poxviruses, Giant viruses

Introduction

The advent of DNA sequencing, pioneered by various researchers but made possible by the Sanger method in 1977, heralded a new era for biological studies [1]. Since then, a wide range of DNA sequencing technologies have been developed and enhanced, enabling exploration of diverse biological aspects in much greater depth [2]. Genomics is considered a Big Data science, demanding various data analysis methods, bioinformatic tools, and approaches [3]. Metrics utilized for genomic analysis are of significant importance, such as sequence composition analysis, which encompasses features like length, k-mer word frequency, codon usage, and nucleotide percentages. Specifically, the measured proportion of guanine (G) and cytosine (C) nucleotides (GC% content) within a target genomic sequence (RNA or DNA) is an easily calculated metric associated with a broad range of biological aspects, providing crucial information about genome structure and function as it affects the stability, flexibility, and interaction of nucleic acid molecules [46].

Used as a fundamental parameter of genome sequence variation, GC% content has been extensively explored, for example, in evolutionary and microbiological studies [712]. Additionally, GC% content has been identified as one of the genomic features associated with important biological events, such as horizontal genetic transfer (HGT), evolution, and acquisition of virulence factors [10, 1317]. Similarly, various studies have demonstrated specific correlations between the GC% of pathogens and host organisms, suggesting that sequence composition plays a crucial role in the coevolution of different species [18, 19], including many viruses [20].

The known virosphere encompasses the most diverse biological entities described to date, highlighting the diversity of viruses and their genomes [21]. The phylum Nucleocytoviricota comprises a hypothetical monophyletic taxon of large and giant DNA viruses. Currently, there are eleven official families described in this phylum, according to the International Committee on Taxonomy of Viruses. Nucleocytoviruses are known pathogens of algae, protists, invertebrates, and vertebrates, including humans [2226].

For instance, these viruses can reach sizes of up to 2.3 µm in particle size, with genome lengths as long as 2.8 Mb [27, 28]. Due to their large genomes, intergenic regions might be more abundant, which impacts their genome evolution [27, 28]. Another notable characteristic of nucleocytoviruses is the identification of unusual genes involved in canonical cellular metabolism, such as peptide synthesis factors, tRNAs, aminoacyl tRNA synthetases, and factors associated with tRNA/mRNA maturation and ribosome protein modification, bringing Nucleocytoviricota genomes closer in content to that of their hosts [27].

Therefore, considering the diversity of species, genomes, and host ranges, members of the Nucleocytoviricota are of significant interest for genomic and evolutionary studies. In this study, we describe a comprehensive characterization of Nucleocytoviricota genomes based on GC% content, focusing on gene/open reading frame (ORF) sequences. Our data suggest that gene-by-gene analyses of GC% content can provide insights into the evolution, genome expansion, and host-virus relationships of nucleocytoviruses.

Methods

Nucleocytoviruses selection and data acquisition

Representative viruses from different viral families within the phylum Nucleocytoviricota were selected to build the dataset used in this study. After extensive manual curation, we selected representative genomes from each genus within the families, based on the official taxonomy described by ICTV as [29, 30]. The primary criterion used was to select nucleocytoviruses previously considered prototype species of each genus or viral family. Considering that the term "prototype species" is no longer used and in some taxa, there is no prototype reference, we employed a second criterion for selection: viral representatives based on viral characterization (e.g., Eastern kangaroopox virus, genus Macropopoxvirus). Additionally, we purposely included more giant Nucleocytoviricota viruses since these viruses carry the largest genomes in the phylum, enriching our sequence composition analysis within the dataset. Lastly, publicly available complete genomes were downloaded from NCBI’s GenBank [31]. The genomic data were obtained in files generated by GenBank, which consisted of the complete FASTA sequences and descriptions of all predicted CDS/ORFs for each viral genome.

Virus-host GC% correlation analysis

To observe possible correlations between the GC% content of viruses and host organisms, we manually curated known hosts for each virus included in this study using the Virus-Host Database website [19]. In cases where neither a specific host was listed nor complete host genomes were available on GenBank (e.g., lymphocystis disease virus 1), representative host taxa were listed. Subsequently, genomic GC% data of hosts were obtained directly from reference genomes deposited on GenBank (Table 1).

Table 1.

GC% content average of viral genome, viral ORF, and associated host organism genome. Overall GC% content of Nucleocytoviricota viruses (Phycodnaviridae, Mimivirdae, Ascoviridae, Iridoviridae, Marseilleviridae, Asfarviridae, Extended Asfarviridae, Poxviridae, “Pithoviridae”, Yarviridae, "Pandoraviridae", and unclassified DNA viruses). Observed viral taxonomy at family level, viral species, GenBank accession for viral genome, genome size (kb), genomic GC% content, ORF GC% content of each virus, host organisms, and GC% content of host organisms are described

Family Virus GenBank Accession Genome size (kb) Viral genomic GC% Viral ORF GC% Host organism Host genomic GC%
Phycodnaviridae Paramecium bursaria Chlorella virus 1 NC_000852.5 330.6 40 40.6 Chlorella variabilis 65.5
Emiliania huxleyi virus 86 NC_007346.1 407.3 40.2 40.6 Emiliania huxleyi 64.5
Ectocarpus siliculosus virus 1 NC_002687.1 335.6 51.7 52.2 Ectocarpus siliculosus 53.49
Heterosigma akashiwo virus 1 NC_038553.1 274.7 30.4 30.8 Heterosigma akashiwo N/A
Mimiviridae Cafeteria roenbergensis virus NC_014637.1 617.4 23.3 23.4 Cafeteria roenbergensis 70.44
Acanthamoeba polyphaga mimivirus NC_014649.1 1,181.5 28 28.8 Acanthamoeba castellanii 58.35
Acanthamoeba polyphaga 58.7
Samba virus KF959826.2 1,181.5 28 28.7 Acanthamoeba castellanii 58.35
Megavirus chiliensis 1,246.1 Acanthamoeba castellanii 58.35
NC_016072.1 25.3 26.3 Acanthamoeba polyphaga 58.7
Acanthamoeba griffini N/A
Moumouvirus australiensis MG807320.1 1,098 25.1 26 Acanthamoeba polyphaga 58.7
Tupanvirus deep ocean MF405918.2 1,439.5 29.4 30.5 Acanthamoeba castellanii 58.35
29.4 30.5 Vermamoeba vermiformis 42.48
Tupanvirus soda lake KY523104.2 1,516.2 29.1 30.2 Acanthamoeba castellanii 58.35
Vermamoeba vermiformis 42.48
Aureococcus anophagefferens virus NC_024697.1 370.9 28.7 29.3 Aureococcus anophagefferens 69.92
Bodo Saltans Virus MF782455.1 1,385.8 25.3 25.7 Bodo saltans 51.6
Phaeocystis globosa virus group I NC_021312.1 459.9 32 33.4 Phaeocystis globosa N/A
Chrysochromulina ericina virus NC_028094.1 473.5 25.4 26 Haptolina ericina N/A
Tetraselmis viridis virus KY322437.1 668 41.2 40.6 Tetraselmis viridis N/A
Cotonvirus japonicus AP024483.1 1,476.5 25.3 26.6 Acanthamoeba castellanii 58.35
Ascoviridae Spodoptera frugiperda ascovirus 1a NC_008361.1 156.92 49.2 50.3 Spodoptera frugiperda 36.37
Iridoviridae Lymphocystis disease virus 1 NC_001824.1 102.65 29.1 29.2 Osmeridae (Hypomesus transpacificus **) 44.5
Percichthyidae (Maccullochella peelii **) 40.5
Percidae (Etheostoma cragini **) 40.5
Sparus aurata 41.94
Centrarchidae N/A
Pleuronectidae N/A
Pleuronectoidei N/A
Soleidae (Brachirus orientalis **) 39
Clupeidae (Alosa alosa**) 42.5
Frog virus 3 NC_005946.1 105.9 55.1 57.1 Notophthalmus viridescens N/A
Lithobates pipiens N/A
Dryophytes versicolor N/A
Lithobates sylvaticus N/A
Oophaga pumilio 26.9
Invertebrate iridescent virus 3 NC_008187 191.1 47.9 50.4 Mosquitos ( Aedes taeniorhyncus**) N/A
Decapod iridescent virus 1 MF599468.1 165.8 34.6 29.8 Penaeus vannamei 36.5
Invertebrate iridescent virus 6 NC_003038.1 28.6 35 Acheta domesticus 38.5
Gryllus bimaculatus 38.6
212.4 Spodoptera frugiperda 36.37
Choristoneura fumiferana 38.1
Chilo suppressalis 35.7
Marseilleviridae Marseillevirus marseillevirus NC_013756.1 368.4 44.7 45 Acanthamoeba castellanii 58.35
Acanthamoeba polyphaga 58.7
Brazillian marseillevirus KT752522 362.2 43.3 43.9 Acanthamoeba castellanii 58.35
Golden marseillevirus NC_031465.1 360.6 43.1 43.7 Acanthamoeba castellanii 58.35
Limnoperna fortunei 33.6
Lausannevirus NC_015326.1 346.7 42.9 43 Acanthamoeba castellanii 58.35
Tunisvirus NC_038511.1 380 43 43.6 Acanthamoeba castellanii 58.35
Asfarviridae African swine fever virus NC_001659.2 38.6 38.7 Chlorocebus aethiops 40.9
189.3 Sus scrofa 41.6
Phacochoerus africanus 40.48
Extended Asfarviridae Pacmanvirus S19 MZ440852.1 418.5 33.2 34.3 Acanthamoeba castellanii 58.35
Faustovirus e12 KJ614390 465.9 37.7 36.9 Vermamoeba vermiformis 42.48
Kaumoebavirus MT334784.1 362.5 43.1 43.4 Vermamoeba vermiformis 42.48
Poxviridae Fowlpox virus NC_002188.1 291 31.2 31.3 Gallus gallus 42
Meleagris gallopavo 41.1
Sheeppox virus NC_004002.1 Meleagris gallopavo 41.1
149.9 25 25.3 Capra hircus 42.1
Ovis aries 43.2
Yokapox virus NC_015960.1 175.7 25.6 26.2 Mus musculus 41.95
Mule deerpox virus AY689437.1 170.5 27 27.6 Odocoileus virginianus 41.5
Nile crocodilepox virus Crocodylus niloticus N/A
NC_008030.1 190 61.9 62.4 Crocodylus porosus 43.85
Crocodylus johnsoni N/A
Myxoma virus NC_001132.2 162.4 43.5 43.5 Oryctolagus cuniculus 43.97
Eastern kangaroopox virus MF467281.1 170.1 54 54.3 Macropus giganteus (eastern gray kangaroo*) 44.3
Molluscum contagiosum virus MH646551.1 192.1 64.3 63.9 Homo sapiens 40.4
Sea otterpox virus NC_037656.1 127.8 31.3 31.6 Enhydra lutris N/A
Vaccinia virus NC_006998.1 182.5 33.4 34.6 Homo sapiens 40.4
Bos taurus 41.92
Cotia virus KM595078.1 185.1 23.6 24.4 Chlorocebus aethiops 40.9
Mus musculus 41.95
Orf virus Homo sapiens 40.4
NC_005336.1 139.9 63.8 64.3 Capra hircus 42.1
Ovis aries 43.2
Pteropox virus NC_030656.1 133.4 33.8 34 Pteropus scapulatus N/A
Salmon gillpox virus NC_027707.1 241.5 37.5 37.1 Salmo salar N/A
Squirrelpox virus NC_022563.1 148.8 66.7 67 Sciurus vulgaris 39.26
Swinepox virus NC_003389.1 146.4 27.4 27.8 Sus scrofa 41.6
Eptesipox virus NC_035460.1 176.6 23.6 23.9 Eptesicus fuscus 43.5
Chlorocebus aethiops aethiops N/A
Yaba monkey tumor virus Erythrocebus patas 41.05
NC_005179.1 134.7 29.8 30.1 Papio hamadryas 40.9
Homo sapiens 40.4
Anomala cuprea entomopoxvirus NC_023426.1 245.7 20 20.5 Anomala cuprea N/A
Amsacta moorei entomopoxvirus NC_002520.1 232.3 17.8 18.2 Lymantria dispar 38.55
Melanoplus sanguinipes entomopoxvirus NC_001993.1 236.1 18.3 19 Locusta migratoria 41
Schistocerca gregaria 42.55
Diachasmimorpha longicaudata entomopoxvirus KR095315.1 252.9 30.1 31 Diachasmimorpha longicaudata N/A
“Pithoviridae” Pithovirus sibericum NC_023423.1 610 35.8 40.2 Acanthamoeba castellanii 58.35
Cedratvirus A11 NC_032108.1 589 42.7 43 Acanthamoeba castellanii 58.35
Pithoviridae-like Orpheovirus NC_036594.1 1,473.5 25 28.1 Vermamoeba vermiformis 42.48
Yaraviridae Yaravirus brasiliensis MT293574.1 44.9 58 58 Acanthamoeba castellanii 58.35
"Pandoraviridae" Pandoravirus quercus NC_037667.1 2,077.2 60.7 64.4 Acanthamoeba castellanii 58.35
Unclassified DNA virus Medusavirus AP018495.1 381.2 61.7 61.5 Acanthamoeba castellanii 58.35
Mollivirus sibericum NC_027867.1 651.5 60.1 60.2 Acanthamoeba castellanii 58.35

GC% content calculation

After establishing the database containing members of the phylum Nucleocytoviricota, the GC% content calculation of CDS/ORFs was performed. We employed the publicly available InfoSeq tool developed by EMBL’s European Bioinformatics Institute, EMBOSS (available at https://www.ebi.ac.uk/Tools/emboss/) [32]. The InfoSeq outputs provided a list of each coding sequence within the viral genome, as well as its length (kb) and calculated GC% content, respectively.

Data analysis and plotting

Kruskal–Wallis’ test (p < 0.05) was used to determine the variance of the GC% content of each CDS/ORF from each nucleocytovirus isolate genome. Dunn’s multiple comparison test quoted comparison between two isolates against each other within a family, subfamily, or group. To better understand the dimension and profile of the variation in GC% content, gene by gene, along the studied viral genomes, scatter plots were created using Microsoft's Power BI tool. GC% values are represented on the y-axis, and relative CDS/ORF positions along viral genome extension are represented on the x-axis. A tendency line (red) and GC% mean values (dashed blue) were included. Similarly, to visualize ORF GC% ranges within taxa/subgroups, considering overall minimum, maximum, and mean GC% of each, a heatmap-like graphic analysis was performed using Microsoft’s Excel. Conditional formatting was used to color categorize GC% value variation associated with each individual CDS/ORF within the genomes. Blue corresponds to lower GC% values, yellow corresponds to mean GC% value, and red corresponds to higher GC% values observed on each individual taxa/subgroup analyzed. Considering that different nucleocytoviruses had great variability among CDS/ORF quantity (not necessarily corresponding to genome size), we normalized the heatmap-like analysis with the number of CDS/ORF represented on the scale.

Analysis of potential gene duplication, phylogenetic, and HGT inferences

To test the hypothesis that CDS/ORFs of discrepant GC% could be associated with possible gene duplication events or HGT between host-viruses, we constructed a new dataset of target sequences. Considering outliers’ variations from GC% sequences along viral genome extension, and many CDS/ORFs to be analyzed, we purposefully decided to evaluate the three highest and three lowest GC% values observed on each individual viral genome. Therefore, based on gene sequences of minimum and maximum GC% of each viral genome, targeted sequences were selected for further analysis.

Following this approach, a targeted CDS/ORF dataset was built and used for phylogenetic and HGT analysis, as well as gene duplication event search using the NCBI BLASTp tool [33] (Supplementary Table 1). For the establishment of possible gene duplication events, BLASTp analysis of target sequences against original viral genomes and manual curation of data were performed, considering a threshold of 40% coverage and 40% identity. Subsequently, amino acid sequences from NCBI’s databank were compared to the proposed dataset employing the BLASTp tool (e-value threshold of 0.05) [33]. Upon these local alignments, viral targets that presented homologous sequences outside the virosphere were selected. BLASTp was also used against each viral host specific databank. Genes usually associated with cellular metabolism were favored as targeted sequences, as their identification on viral genomes suggests potential content acquisition from another organism. Lastly, phylogenetic analyses were performed for BLASTp hits considering a coverage threshold of 40% and an identity threshold of 30%. The availability of sequence hits of other taxa (non-viral) was also considered whilst determining the final dataset for phylogenetic analysis. For each dataset, the sequences were aligned using the MAFFT software, and phylogenetic analyses were conducted using the maximum likelihood method with the IQ-TREE 2 software [34, 35]. The amino acid substitution model automatically determined the best fit for each dataset. Bootstrap analysis with 1000 replicates was performed. Tree topologies were analyzed to identify potential horizontal gene transfer events accordingly to the methodology described by Irwin et al. (2022) [36].

Results and discussion

Nucleocytoviricota dataset preparation: open reading frame (ORF) vs intergenic regions’ GC% content

In this study, we carefully selected viral genomes to construct our main dataset, including only those with complete genomes available on GenBank (NCBI). Representative genomes from the main viral families of Nucleocytoviricota were chosen, totaling sixty-one complete viral genomes: Phycodnaviridae (n = 4), Mimiviridae (n = 13), Ascoviridae (n = 1), Iridoviridae (n = 5), Marseilleviridae (n = 5), Asfarviridae (n = 1), Poxviridae (n = 23), “Pithoviridae” (n = 2; and 1 pithoviridae-like), “Pandoraviridae” (n = 1), as well as extended Asfarviridae (n = 3), and unclassified DNA viruses (n = 2), according to the proposed ICTV taxonomy in 2020 (Walker et al., 2020). Additionally, we included the recently discovered Yaravirus (Yaraviridae) in our analysis, despite not being classified within Nucleocytoviricota, due to phylogenomic analyses indicating a close relationship between Yaravirus and nucleocytoviruses [37].

We manually curated data from the Virus-Host Database, indicating host organisms for all viruses within the dataset (see Table 1), and extracted the genomic GC% content of host organisms. For lymphocystis disease virus 1, which has multiple probable hosts within fish species, single representatives with available complete genomes were selected based on NCBI’s taxonomy. However, the genomic GC% content from host organisms for certain viruses was not available, including Heterosigma akashiwo virus 1, Phaeocystis globosa virus group 1, Chrysochromulina ericina virus, Tetraselmis viridis virus, invertebrate iridescent virus 3, sea otterpox virus, pteropox virus, salmon gillpox virus, Anomala cuprea entomopoxvirus, and Diachasmimorpha longicaudata entomopoxvirus.

As described in various models, sequence composition may vary along genome extensions, particularly between intergenic and intragenic regions [6, 3840]. In our study, we found that the GC% content of coding sequences and intragenic regions tended to present similar sequence composition, as the whole genome GC% values were comparable to the ORF mean GC% (see Table 1). Contrary to expectations, this suggests there may be no differential selective pressure over these distinct parts of the viral genome, or at least none associated with GC%. However, some exceptions were noted, including decapod iridescent virus 1, invertebrate iridescent virus 6, and pithovirus sibericum, which exhibited differences between the total genome GC% mean and the ORF GC% mean (ranging from 4.4% to 6.4% variation). Additionally, codon-usage analysis was performed and compared to %GC content profiles, revealing a clear relationship between %GC and the use of codons rich in C and G.

Nucleocytoviricota coding sequence (CDS)/ORF GC% variation

The GC% content variation profile of nucleocytoviruses coding sequence (CDS)/ORF assessed has a notable range with a minimum value of 8.13% (Amsacta moorei entomopoxvirus; CDS NP_065034.1), and a maximum value of 83.91% (Orf virus; CDS NP_957782.1), both described as hypothetical proteins of representatives of the Poxviridae family (Fig. 1 and supplementary Table 1). These aspects cohesively demonstrate the Poxviridae family as the family with the greatest GC% range within Nucleocytoviricota, thus presenting the most influence over the phylum’s GC% profile.

Fig. 1.

Fig. 1

Nucleocytoviricota heatmap-like GC% profile. Lowest GC% values categorized in blue, highest GC% values categorized in red. Minimum, mean and maximum GC% of the entire taxon dataset are used for categorizing ORF values and scale. Respective identified viruses: Paramecium bursaria Chlorella virus 1 (PbCv1), Emiliania huxleyi virus 86 (Ehv86), Ectocarpus siliculosus virus 1 (Esv1), Heterosigma akashiwo virus 1 (Hav1) (Phycodnaviridae 1–4); Cafeteria roenbergensis virus (Crov), Acanthamoeba polyphaga mimivirus (Apmv), Samba virus (Sbv), Megavirus chiliensis (Mvch), Moumouvirus australiensis (Mvau), Tupanvirus deep ocean (Tvdo), Tupanvirus soda lake (Tvsl), Aureococcus anophagefferens virus (Auanv), Bodo Saltans Virus (Bsv), Phaeocystis globosa virus group I (Phgvg1), Chrysochromulina ericina virus (Chrev), Tetraselmis viridis virus (Tvv), Cotonvirus japonicus (Cvj) (Mimiviridae 5–17); Spodoptera frugiperda ascovirus 1a (Sfa1a) (Ascoviridae 18); Lymphocystis disease virus 1 (Ldv1), Frog virus 3 (Fv3), Invertebrate iridescent virus 3 (Iniv3), Decapod iridescent virus 1 (Div1), Invertebrate iridescent virus 6 (Iniv6) (Iridoviridae, 19–23); Marseillevirus marseillevirus (Mvmv), Brasillian marseillevirus (Bramv), Golden marseillevirus (Gmv), Lausannevirus (Lauv), Tunisvirus (Tuv) (Marseilleviridae 24–28); African swine fever virus strain BA71V (Asfv), Pacmanvirus S19 (PvS19), Faustovirus e12 (Fve12), Kaumoebavirus (Kauv) (Asfarviridae and extended asfarviridae (29–32); Fowlpox virus (Fpoxv), Sheeppox virus 17077–99 (Shpoxv), Yokapox virus (Ykpoxv), Mule deerpox virus (Mdpoxv), Nile crocodilepox virus (Ncpoxv), Myxoma virus (Myxv), Eastern kangaroopox virus (Ekpoxv), Molluscum contagiosum virus (Mcv), Sea otterpox virus (Sopoxv), Vaccinia virus (Vacv), Cotia virus (Cov), Pteropox virus (Ptpoxv), Salmon gillpox virus (Sgpoxv), Orf virus (Orfv), Squirrelpox virus (Spoxv), Swinepox virus (Swpoxv), Eptesipox virus (Eptpoxv), Yaba monkey tumor virus (Ymtv), Anomala cuprea entomopoxvirus (AcEpoxv), Amsacta moorei entomopoxvirus (AmEpoxv), Melanoplus sanguinipes entomopoxvirus (MsEpoxv), Diachasmimorpha longicaudata entomopoxvirus (DlEpoxv) (Poxviridae 33–54); Pithovirus sibericum (PvS), Cedratvirus A11 (CdvA11), Orpheovirus (Orphv) (Pithoviridae and pithoviridae-like 55–57); Yaravirus brasiliensis (Yvbra) (Yaraviridae 58), Pandoravirus quercus (Pvq) (“Pandoraviridae”, 59), Medusavirus (Medv), and Mollivirus sibericum (MolS) (Unclassified DNA viruses, 60–61)

In terms of CDS/ORF GC% variation and mean within viral families, the following ranges were observed: (i) From 18.61% to 64.63% in Phycodnaviridae, with Ectocarpus siliculosus virus 1 and Heterosigma akashiwo virus 1 presenting an overall maximum and minimum GC% content, respectively; (ii) From 15.15% to 69.09% in Iridoviridae, with both frog virus 3 and invertebrate iridescent virus 3 presenting significantly higher GC% content in comparison to other family members; (iii) From 27.88% to 57.36% in Marseilleviridae, with no specific representatives; (iv) From 19.66% to 62.96% in Asfarviridae (and extended Asfarviridae), with kaumoebavirus and pacmanvirus S19 presenting an overall maximum and minimum GC% content, respectively; (v) From 9.18% to 57.83% in “Pithoviridae” (and pithoviridae-like), with orpheovirus presenting notably lower GC% content; (vi) From 9.68% to 62.55% in Mimiviridae, with Tetraselmis viridis virus and Cafeteria roenbergensis virus presenting an overall maximum and minimum GC% content, respectively; and (vii) From 8.13% to 83.91% in Poxviridae, with lower GC% represented in Entomopoxvirinae (Figs. 2 and 3).

Fig. 2.

Fig. 2

Heatmap-like GC% profile of nucleocytoviruses families. Lowest GC% values (blue) and highest GC% values (red) are represented. Minimum, mean, and maximum GC% of the entire taxon dataset are used for categorization of ORF values and scale. Represented families: Phycodnaviridae, Iridoviridae, Marseilleviridae, Asfarviridae* (Asfarviridae and extended asfarviridae), and Pithoviridae** (Pithoviridae and pithoviridae-like)

Fig. 3.

Fig. 3

Heatmap-like GC% profile of nucleocytoviruses families. Lowest GC% values (blue) and highest GC% values (red) are represented. Minimum, mean, and maximum GC% of the entire taxon dataset are used for categorization of ORF values and scale. Represented families: Mimiviridae and Poxviridae

When looking specifically into CDS/ORF GC% variation within individual viral genomes, we observed how the GC% varies in coding sequences along the extension of genomes, which could allow for the identification of possible hotspots for HGT or duplication events (Fig. 4). Interestingly, we also observed an absence of specific patterns for GC% variation among coding sequences of the viral families of the nucleocytoviruses assessed (supplementary Figs. 17). Thus, representative genomes of each family were selected for demonstrating the main aspects of ORF GC% distribution along genome extension (Fig. 4).

Fig. 4.

Fig. 4

Scatter plot of ORF GC% variation along genome extension. Dots represent GC% of specific ORF in its position within the virus genome, according to annotation. Tendency line (red), ORF GC% mean (dashed blue), and outliers (red triangles) are represented. Selected representative viruses of each family: A Orfv, B Ldv1, C Fv3, D Vacv, E Asfv, F Mvmv, G PvS, H Chrev, I Ehv86, J Tvdo

When considering how our data might have presented statistical difference amongst analyzed groups, we performed Kruskal–Wallis’ test (p < 0.05) followed by Dunn’s multiple comparison test for the quoted comparison between two isolates against each other within a family, subfamily, or group. Our statistical evaluation of the dataset proposed and explored herein demonstrated that there are significant differences among isolates within viral families, subfamilies, and groups (supplementary Fig. 8). This was a consistent result even when comparing isolates of the Marseilleviridae which had the least GC% variation among all nucleocytoviruses’ families assessed, pointing out that the five isolates' genomes analyzed are significantly different (p < 0.0001) in terms of GC% content (supplementary Fig. 8).

Virus-host GC% similarities and HGT analysis

Virus-host coevolution plays a pivotal role in sharing genomic features between viruses and their hosts. Studies have demonstrated how sequence composition and nucleotide frequency correlation between virus-host pairs can be associated with viral adaptability dynamics, or even used as reliable metrics for inferring virus-host linkages [18, 41, 42]. When comparing the viral GC% content profile of nucleocytoviruses to those of host organisms included in this study, contrary to what one might expect, values were not similar in most cases (Table 1). A few exceptions were noted: Ectocarpus siliculosus virus 1 (51.7%) and its host Ectocarpus siliculosus (53.49%); decapod iridescent virus 1 (34.6%) and its host Penaeus vannamei (36.5%); invertebrate iridescent virus 6 (35% ORF GC mean) and its hosts Spodoptera frugiperda (36.37%) and Chilo suppressalis (35.7%); African swine fever virus (38.6%) and its hosts Chlorocebus aethiops (40.9%), Sus scrofa (41.6%), and Phacochoerus africanus (40.48%); kaumoebavirus (43.1%) and its host Vermamoeba vermiformis (42.48%); myxoma virus (43.5%) and its host Oryctolagus cuniculus (43.97%); and Yaravirus (58%) and its host Acanthamoeba castellanii (58.35%). Although a few patterns of GC% similarity between viruses and hosts were observed in this dataset, it is important to consider that many of these viruses can infect more than one host organism, implying different selective pressures on sequence composition. Additionally, many host organisms did not have complete genome sequences available for GC% calculation, making it impossible to compare viral GC% content at the moment.

Although limited evidence of direct correlation was observed between the GC% content of viruses and hosts analyzed in this work, the influence of host genomic characteristics was not excluded from the possibilities to further explain CDS/ORF GC% variation observed in viral genomes. Genome expansion through the acquisition of genes from host organisms by horizontal gene transfer (HGT) has been widely debated regarding nucleocytoviruses, and the literature has shown that different viral families among Nucleocytoviricota present different tendencies of HGT events [43]. Overall, Nucleocytoviricota viruses are considered to have some propensity to acquire host genes by HGT, and such events may play an important role in composing the content diversity and size of these viruses’ genomes [43]. It has even been hypothesized that NCLDVs encode homologs of conserved genes commonly found among the domains of Bacteria, Archaea, and Eukarya, suggesting the phylum could be considered as a fourth domain of life. However, this hypothesis has not been supported by multiple phylogenetic analyses, indicating instead multiple independent acquisitions from various cellular lineages [4447].

Considering this, HGT was hypothesized as one of the possible causes of gene GC% variation in different nucleocytovirus genomes [15], especially in cases where proximal gene groups presented GC% content distant from the overall viral GC% mean. For instance, this was observed for Emiliania huxleyi virus 86 (Ehv86) (Fig. 4I), where a cluster of CDS/ORFs with discrepant GC% values (positions 290 to 315) was identified. Although events of HGT of entire metabolic pathways have been previously described for Ehv86 and its host Emiliania huxleyi [48], no hits for any of the CDS/ORFs were found when aligned to the host genome, leaving the discussion open regarding why these sequences differ in GC% composition and how they originated. Similarly, a discrepant GC% CDS/ORF cluster was identified within the genome of Chrysochromulina ericina virus (ChreV) (positions 422 to 442) (Fig. 4H), which may be linked to ChreV’s remarkable genomic characteristics, such as abundant mobile genetic elements, complex gene evolution, and host gene acquisition among other features [49].

After extensive data curation, we carefully selected 30 genes from our target CDS/ORF dataset based on the three maximum and three minimum GC% values observed for each viral genome (Supplementary Table 1). We then proceeded to assess phylogeny and evaluate HGT events according to the methodology described by Irwin et al. (2022) [36]. Among all analyzed targets, 14 potential HGT events were identified, such as the “Histone H2B/H2A fusion protein” (AMQ10945.1) of the Brazilian marseillevirus (Supplementary Figs. 935; Fig. 5A). Interestingly, a probable HGT from virus to host was observed (Fig. 5B) when evaluating the targeted sequence of “Papain-like cysteine peptidase” (YP_009310305.1) of Golden marseillevirus. Moreover, these findings are in accordance with previously described HGT events in Marseilleviridae members [50, 51].

Fig. 5.

Fig. 5

Phylogenetic and HGT inference. Representative targets: A HGT from host to virus, target “Histone H2B/H2A fusion protein” (AMQ10945.1) of brazilian marseillevirus (KT752522), and lausannevirus (NC_015326.1); B HGT from virus to host, target “Papain-like cysteine peptidase” (YP_009310305.1) of golden marseillevirus (NC_031465.1); and (C) no evidence of HGT, target “CD47-like protein” (NP_659700.1) of sheeppox virus (NC_004002.1). Cyan: viral clusters; orange: target viral sequence; black: non-viral cluster. Analysis performed with 1000 replicates

Finally, other analyzed sequences were considered inconclusive or did not indicate potential HGT events, such as the targets “CD47-like protein” (NP_659700.1) of Sheeppox virus (Fig. 5C), “Putative replication factor and/or DNA binding/packing protein” (NP_078747.1) of Lymphocystis disease virus 1, and “NAD-dependent DNA ligase” (ATE87064.1) of Decapod iridescent virus 1 (both inconclusive regarding HGT inference). This does not exclude the possibility that these events could potentially be observed more frequently when accessing all the identified outliers for GC% content variation among CDS/ORFs of viral genomes and other viral isolates in future studies. Therefore, events of HGT remain a possible cause of GC% variation within viral genomes, yet to be further explored.

Potential paralogs and gene duplication events

Another hypothesis for the gene GC% variation observed in this study was the presence of duplicated ORFs. In this scenario, if a given ORF had one or more copies along the virus genome, it is hypothesized that one copy would remain conserved, while other copies would potentially evolve under different selective pressure conditions [5256], likely resulting in GC% content variation. For instance, gene duplication is one of the known mechanisms by which genome evolution can be accelerated, allowing for the emergence of new genes with different functions [52, 55]. Moreover, gene duplication has already been described as a major component involved in genome expansion and genetic diversity among giant virus genomes [57].

Another piece of evidence suggesting that gene duplication could be associated with GC% variation lies within the Poxviridae family, specifically in the inverted terminal repetition sequences (ITR). ITRs consist of genome terminal regions containing inverted duplicated sequences, known to harbor most of the variable genetic content of poxvirus genomes, while conserved genes are primarily observed within the central genome region [5860]. This aligns with the majority of GC% variation in poxvirus genes observed along the genome extension in the present work (supplementary Fig. 6).

Among all selected ORFs of maximum and minimum GC% analyzed (n = 336), we identified 60 genes with at least one potential duplication, using a threshold of 40% coverage and 40% identity. Of these ORFs, 46 presented two potential copies, five presented three potential copies, and nine presented four or more potential copies. The maximum quantity of probable copies identified for an ORF was 17, regarding the targeted “Putative ankyrin repeat protein” (AMK61738.1) of Samba virus. However, the majority (70%) of the identified ORFs are characterized as hypothetical proteins, while the remaining 30% are miscellaneous. Underrepresented miscellaneous groups of probable paralogs included “MHC-like TNF binding protein” (3.3%), “EFc gene family protein” (3.3%), “Putative ankyrin repeat protein” (3.3%), “Chemokine-binding protein” (1.7%), and “Collagen and repeat containing protein” (1.7%) (Supplementary Table 1).

Still considering the 60 identified ORFs, 55% are represented in the Poxviridae family, followed by Mimiviridae (13.3%), Phycodnaviridae (8.3%), Iridoviridae, and “Pithoviridae” (both with 6.7%), and Marseilleviridae and Asfarviridae (both with 5%) (supplementary Table 1). It is worth noting that a higher observation of duplicated genes is expected for poxvirus genomes due to ITRs. Moreover, the identification of probable gene duplication events based on GC% variation among genomes has led to interesting targets for future ortholog studies. Further investigation regarding gene duplication and orthology in nucleocytovirus genomes should be conducted to better understand these events.

Perspectives in sequence composition studies of viral genomes

When studying genomes and sequence composition, various approaches can be considered, one of which is the evaluation of GC% content. However, the ratio of guanines and cytosines in a DNA/RNA sequence is just one of the many nucleotide ratios that can be measured. A different metric, although still based on the G + C ratio, is the calculation of CpG dinucleotides, which differs from GC% content as it specifically measures bonded cytosines to guanines (Cytosine-Phosphate-Guanine). CpG dinucleotide is known to be associated with gene regulation through methylation, cancer-inducing factors, and even virulence augmentation, since certain antiviral proteins specifically bind to CpG [6164].

In addition to CpG, other dinucleotide compositions in genomic sequences can provide important information about viruses and host organisms. In fact, different dinucleotide relative ratios can reflect the chemistry of dinucleotide stacking energies and base-step conformational tendencies of an organism, as well as species-specific properties of DNA modification, replication, and repair mechanisms [65]. Regarding other nucleotide composition evaluations, trinucleotide and tetranucleotide compositions should also be considered as relevant metrics for retrieving biological information from genome sequences of viruses and hosts [66, 67].

Furthermore, considering how assessing the GC% content profile of the phylum Nucleocytoviricota has provided an insightful view of large viral genomes, we consider other metrics of sequence composition analysis to be promising—especially since there remains a plethora of information yet to be unveiled from nucleocytoviruses’ sequence composition features.

Supplementary Information

Below is the link to the electronic supplementary material.

42770_2024_1496_MOESM2_ESM.docx (83.5KB, docx)

Supplementary file2 ORF of minimum and maximum GC% content. Targeted sequences for phylogenetic assessment (in bold) and gene duplication evaluation. Taxon, virus species, target/protein ID, GC% content of the ORF, ORF description, duplication (identified or not), probable copies of the gene (number), and GenBank accession for each hit. Threshold of 40% identity and 40% coverage. (DOCX 83 KB)

Acknowledgements

We thank our colleagues from Laboratório de Vírus—UFMG for their technical support.

Authors contributions

All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by Amanda Stéphanie Arantes Witt, João Victor Rodrigues Pessoa Carvalho, Mateus Sá Magalhães Serafim. Jônatas Abrahão designed and supervised the study. The first draft of the manuscript was written by Amanda Stéphanie Arantes Witt and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript. This text has been revised by artificial intelligence.

Funding

We acknowledge financial support from Rede Vírus—Ministério da Ciência, Tecnologia e Inovações (MCTI), Câmara Pox—405249/2022–5. We thank Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq), Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES), grant number 88882.348380/2010–1, Fundação de Amparo à Pesquisa do estado de Minas Gerais (FAPEMIG), Programas Institutos Nacionais de Ciência e Tecnologia (INCT), grant number 406441/2022–7, chamada 58/2022, and Pró-Reitorias de Pesquisa e Pós-Graduação of UFMG. J.S.A. is a CNPq researcher.

Data availability

The datasets generated during and/or analysed during the current study are available from the corresponding author.

Declarations

Ethics approval

This work is registered at SISGEN – Ministério do Meio Ambiente – number A2291C9.

Competing interests

The authors have no relevant financial or non-financial interests to disclose.

Footnotes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Sanger F, Nicklen S (1977) Coulson AR (1977) DNA sequencing with chain-terminating inhibitors (DNA polymerase/nucleotide sequences/bacteriophage 4X174). Proc Natl Acad Sci U S A 74(12):5463–5467. 10.1073/pnas.74.12.5463 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Shendure J, Balasubramanian S, Church GM, Gilbert W, Rogers J, Schloss JA, Waterston RH (2017) DNA sequencing at 40: Past, present and future. Nature 550(7676):345–353. 10.1038/nature24286 [DOI] [PubMed] [Google Scholar]
  • 3.Stephens ZD, Lee SY, Faghri F, Campbell RH, Zhai C, Efron MJ, Iyer R, Schatz MC, Sinha S, Robinson GE (2015) Big data: Astronomical or genomical? PLoS Biol 13(7):e1002195. 10.1371/journal.pbio.1002195 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Galtier N, Lobry JR (1997) Relationships between genomic G+C content, RNA secondary structures, and optimal growth temperature in prokaryotes. J Mol Evol 44(6):632–6. 10.1007/pl00006186 [DOI] [PubMed] [Google Scholar]
  • 5.Mugal CF, Arndt PF, Ellegren H (2013) Twisted signatures of GC-biased gene conversion embedded in an evolutionary stable karyotype. Mol Biol Evol 30:1700–1712. 10.1093/molbev/mst067 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Vinogradov AE (2003) DNA helix: The importance of being GC-rich. Nucleic Acids Res. 10.1093/nar/gkg296 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Hayek N (2013) Lateral transfer and GC content of bacterial resistance genes. Front Microbiol Front Res Found. 10.3389/fmicb.2013.00041 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Hildebrand F, Meyer A, Eyre-Walker A (2010) Evidence of selection upon genomic GC-content in bacteria. PLoS Genet 6(9):e1001107. 10.1371/journal.pgen.1001107 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Hu EZ, Lan XR, Liu ZL, Gao J, Niu DK (2022) A positive correlation between GC content and growth temperature in prokaryotes. BMC Genomics 23(1):110. 10.1186/s12864-022-08353-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Lassalle F, Périan S, Bataillon T, Nesme X, Duret L, Daubin V (2015) GC-Content Evolution in Bacterial Genomes: The Biased Gene Conversion Hypothesis Expands. PLoS Genet 11(2):e1004941. 10.1371/journal.pgen.1004941 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Romiguier J, Ranwez V, Douzery EJP, Galtier N (2010) Contrasting GC-content dynamics across 33 mammalian genomes: Relationship with life-history traits and chromosome sizes. Genome Res 20(8):1001–1009. 10.1101/gr.104372.109 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Wu H, Zhang Z, Hu S, Yu J (2012) On the molecular mechanism of GC content variation among eubacterial genomes. Biol Direct 7:2. 10.1186/1745-6150-7-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Aslam S, Lan XR, Zhang BW, Chen ZL, Wang L, Niu DK (2019) Aerobic prokaryotes do not have higher GC contents than anaerobic prokaryotes, but obligate aerobic prokaryotes have. BMC Evol Biol 19(1):35. 10.1186/s12862-019-1365-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Gal-Mor O, Finlay BB (2006) Pathogenicity islands: A molecular toolbox for bacterial virulence. Cell Microbiol. 10.1111/j.1462-5822.2006.00794.x [DOI] [PubMed] [Google Scholar]
  • 15.Maumus F, Blanc G (2016) Study of gene trafficking between acanthamoeba and giant viruses suggests an undiscovered family of amoeba-infecting viruses. Genome Biol Evol 8(11):3351–3363. 10.1093/gbe/evw260 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Pál C, Papp B, Lercher MJ (2005) Horizontal gene transfer depends on gene content of the host. Bioinformatics 21(Suppl 2):ii222-3. 10.1093/bioinformatics/bti1136 [DOI] [PubMed] [Google Scholar]
  • 17.Weissman JL, Fagan WF, Johnson PLF (2019) Linking high GC content to the repair of double strand breaks in prokaryotic genomes. PLoS Genet 15(11):e1008493. 10.1371/journal.pgen.1008493 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Lobo FP, Mota BEF, Pena SDJ, Azevedo V, Macedo AM, Tauch A, Machado CR, Franco GR (2009) Virus-host coevolution: Common patterns of nucleotide motif usage in Flaviviridae and their hosts. PLoS One 4(7):e6282. 10.1371/journal.pone.0006282 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Mihara T, Nishimura Y, Shimizu Y, Nishiyama H, Yoshikawa G, Uehara H, Hingamp P, Goto S, Ogata H (2016) Linking virus genomes with host taxonomy. Viruses 8(3):66. 10.3390/v8030066 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Simón D, Cristina J, Musto H (2021) Nucleotide composition and codon usage across viruses and their respective hosts. Front Microbiol 12:646300. 10.3389/fmicb.2021.646300 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Koonin EV, Krupovic M, Dolja V (2022) The global virome: How much diversity and how many independent origins? Environ Microbiol 25(1):40–44. 10.1111/1462-2920.16207 [DOI] [PubMed] [Google Scholar]
  • 22.Breman JG, Henderson DA (2002) Diagnosis and management of Smallpox. National Institutes of Health. 346(17):1300–8. 10.1056/NEJMra020025 [DOI] [PubMed]
  • 23.Minhaj FS, Ogale YP, Whitehill F, Schultz J, Foote M, Davidson W, Hughes CM, Wilkins K, Bachmann L, Chatelain R, Donnelly MAP et al (2022) Morbidity and mortality weekly report monkeypox outbreak-nine States, 80:104286. 10.1016/j.amsu.2022.104286 [DOI] [PMC free article] [PubMed]
  • 24.Scola B La, Audic S, Robert C, Jungang L, De Lamballerie X, Drancourt M, Birtles R, Claverie J-M, Raoult D (2003) A giant virus in amoebae. 299(5615):2033. 10.1126/science.1081867 [DOI] [PubMed]
  • 25.Trindade GS, Emerson GL, Carroll DS, Kroon EG, Damon IK (2007) Brazilian vaccinia viruses and their origins. 13(7):965–72. 10.3201/eid1307.061404 [DOI] [PMC free article] [PubMed]
  • 26.Van Etten JL, Graves MV, Müller DG, Boland W, Delaroque N (2002) Phycodnaviridae - Large DNA algal viruses. Arch Virol. 10.1007/s00705-002-0822-6 [DOI] [PubMed] [Google Scholar]
  • 27.Abrahão J, Silva L, Silva LS, Khalil JYB, Rodrigues R, Arantes T, Assis F, Boratto P, Andrade M, Kroon EG, Ribeiro B, Bergier I, Seligmann H, Ghigo E, Colson P, Levasseur A, Kroemer G, Raoult D, La Scola B (2018) Tailed giant Tupanvirus possesses the most complete translational apparatus of the known virosphere. Nat Commun 9(1):749. 10.1038/s41467-018-03168-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Philippe N, Legendre M, Doutre G, Couté Y, Poirot O, Lescot M, Arslan D, Seltzer V, Bertaux L, Bruley C, Garin J, Claverie JM, Abergel C (2013) Pandoraviruses: Amoeba viruses with genomes up to 2.5 Mb reaching that of parasitic eukaryotes. Science (1979) 341(6143):281–286. 10.1126/science.1239181 [DOI] [PubMed] [Google Scholar]
  • 29.Lefkowitz EJ, Dempsey DM, Hendrickson RC, Orton RJ, Siddell SG, Smith DB (2018) Virus taxonomy: The database of the International Committee on Taxonomy of Viruses (ICTV). Nucleic Acids Res 46(D1):D708–D717. 10.1093/nar/gkx932 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Walker PJ, Siddell SG, Lefkowitz EJ, Mushegian AR, Adriaenssens EM, Dempsey DM, Dutilh BE, Harrach B, Harrison RL, Hendrickson RC et al (2020) Changes to virus taxonomy and the Statutes ratified by the International Committee on Taxonomy of Viruses. Arch Virol 168(7):175. 10.1007/s00705-023-05797-4 [Google Scholar]
  • 31.Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Rapp BA, Wheeler DL (2000) Nucleic acids research. GenBank 28(1):15–8. 10.1093/nar/28.1.15 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Rice P, Longden I, Bleasby A (2000) EMBOSS: the European molecular biology open software suite. Trends Genet 16(6):276–277. 10.1016/s0168-9525(00)02024-2 [DOI] [PubMed] [Google Scholar]
  • 33.Johnson M, Zaretskaya I, Raytselis Y, Merezhuk Y, McGinnis S, Madden TL (2008) NCBI BLAST: a better web interface. Nucleic Acids Res 36(Web Server issue):W5-9. 10.1093/nar/gkn201 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Madeira F, Pearce M, Tivey ARN, Basutkar P, Lee J, Edbali O, Madhusoodanan N, Kolesnikov A, Lopez R (2022) Search and sequence analysis tools services from EMBL-EBI in 2022. Nucleic Acids Res 50(W1):W276–W279. 10.1093/nar/gkac240 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Nguyen LT, Schmidt HA, Von Haeseler A, Minh BQ (2015) IQ-TREE: A fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol Biol Evol 32(1):268–274. 10.1093/molbev/msu300 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Irwin NAT, Pittis AA, Richards TA, Keeling PJ (2022) Systematic evaluation of horizontal gene transfer between eukaryotes and viruses. Nat Microbiol 7(2):327–336. 10.1038/s41564-021-01026-3 [DOI] [PubMed] [Google Scholar]
  • 37.Miranda Boratto PV, Oliveira GP, Abrahão JS (2022) “Yaraviridae”: a proposed new family of viruses infecting Acanthamoeba castellanii. Arch Virol 167(2):711–715. 10.1007/s00705-021-05326-1 [DOI] [PubMed] [Google Scholar]
  • 38.Bernaola-Galván P, Oliver JL, Carpena P, Clay O, Bernardi G (2004) Quantifying intrachromosomal GC heterogeneity in prokaryotic genomes. Gene 333:121–133. 10.1016/j.gene.2004.02.042 [DOI] [PubMed] [Google Scholar]
  • 39.Bohlin J, Eldholm V, Pettersson JHO, Brynildsrud O, Snipen L (2017) The nucleotide composition of microbial genomes indicates differential patterns of selection on core and accessory genomes. BMC Genomics 18(1):151. 10.1186/s12864-017-3543-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Wen-Hua Q, Chao-chao Y, Wu-Jiao L, Xue-Mei J, Guang-Zhou L, Xiu-Yue Z, Ting-Zhang H, Jing L, Bi-Song Y (2016) Distinct patterns of simple sequence repeats and GC distribution in intragenic and intergenic regions of primate genomes. Aging. 8(11):2635–2654. 10.18632/aging.101025 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Monier A, Claverie J-M, Ogata H (2007) Horizontal gene transfer and nucleotide compositional anomaly in large DNA viruses. BMC Genomics 8:456. 10.1186/1471-2164-8-456 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Roux S, Hallam SJ, Woyke T, Sullivan MB (2015) Viral dark matter and virus–host interactions resolved from publicly available microbial genomes. Elife 4:e08490. 10.7554/eLife.08490 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Filée J (2009) Lateral gene transfer, lineage-specific gene expansion and the evolution of Nucleo Cytoplasmic Large DNA viruses. J Invertebr Pathol 101(3):169–71. 10.1016/j.jip.2009.03.010 [DOI] [PubMed] [Google Scholar]
  • 44.Mönttinen HAM, Bicep C, Williams TA, Hirt RP (2021) The genomes of nucleocytoplasmic large DNA viruses: viral evolution writ large. Microb Genom 7(9):000649. 10.1099/mgen.0.000649 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Raoult D, Audic S, Robert C, Abergel C, Renesto P, Ogata H, La Scola B, Suzan M, Claverie JM (2004) The 1.2-megabase genome sequence of Mimivirus. Science 306(5700):1344–50. 10.1126/science.1101485 [DOI] [PubMed] [Google Scholar]
  • 46.Williams TA, Embley TM, Heinz E (2011) Informational gene phylogenies do not support a fourth domain of life for nucleocytoplasmic large DNA viruses. PLoS One 6(6):e21080. 10.1371/journal.pone.0021080 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Woese C (1998) The universal ancestor. Proc Natl Acad Sci U S A 95(12):6854–9. 10.1073/pnas.95.12.6854 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Monier A, Pagarete A, De Vargas C, Allen MJ, Read B, Claverie JM, Ogata H (2009) Horizontal gene transfer of an entire metabolic pathway between a eukaryotic alga and its DNA virus. Genome Res 19(8):1441–1449. 10.1101/gr.091686.109 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Gallot-Lavallée L, Blanc G, Claverie J-M (2017) Comparative genomics of Chrysochromulina Ericina virus and other microalga-infecting large DNA viruses highlights their intricate evolutionary relationship with the established Mimiviridae family. J Virol 91(14):e00230-17. 10.1128/JVI.00230-17 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Bertelli C, Greub G (2012) Lateral gene exchanges shape the genomes of amoeba-resisting microorganisms. Front Cell Infect Microbiol. 10.3389/fcimb.2012.00110 [DOI] [PMC free article] [PubMed]
  • 51.Boyer M, Yutin N, Pagnier I, Barrassi L, Fournous G, Espinosa L, Robert C, Azza S, Sun S, Rossmann MG, Suzan-Monti M, La Scola B, Koonin E V, Raoult D (2009) Giant Marseillevirus highlights the role of amoebae as a melting pot in emergence of chimeric microorganisms. 106(51):21848–53. 10.1073/pnas.0911354106 [DOI] [PMC free article] [PubMed]
  • 52.Gabaldón T, Koonin EV (2013) Functional and evolutionary implications of gene orthology. Nat Rev Genet. 10.1038/nrg3456 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Gao Y, Zhao H, Jin Y, Xu X, Han GZ (2017) Extent and evolution of gene duplication in DNA viruses. Virus Res 240:161–165. 10.1016/j.virusres.2017.08.005 [DOI] [PubMed] [Google Scholar]
  • 54.He X, Zhang J (2005) Rapid subfunctionalization accompanied by prolonged and substantial neofunctionalization in duplicate gene evolution. Genetics 169(2):1157–1164. 10.1534/genetics.104.037051 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Magadum S, Banerjee U, Murugan P, Gangapur D, Ravikesavan R (2013) Gene duplication as a major force in evolution. J Genet 92(1):155–61. 10.1007/s12041-013-0212-8 [DOI] [PubMed] [Google Scholar]
  • 56.Shackelton LA, Holmes EC (2004) The evolution of large DNA viruses: Combining genomic information of viruses and their hosts. Trends Microbiol. 10.1016/j.tim.2004.08.005 [DOI] [PubMed] [Google Scholar]
  • 57.Machado TB, Picorelli ACR, de Azevedo BL, de Aquino ILM, Queiroz VF, Rodrigues RAL et al (2023) Gene duplication as a major force driving the genome expansion in some giant viruses. J Virol 97(12):e0130923. 10.1128/jvi.01309-23 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Brennan G, Stoian AMM, Yu H, Rahman MJ, Banerjee S, Stroup JN, Park C, Tazi L, Rothenburg S (2023) Molecular mechanisms of poxvirus evolution. mBio Am Soc Microbiol. 10.1128/mbio.01526-22 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Wittek R (1982) Organization and expression of the poxvirus genome. Experientia Generalia 38(3):285–97. 10.1007/BF01949349 [DOI] [PubMed] [Google Scholar]
  • 60.Wittek R, Menna A, Müller HK, Schumperli D, Boseley PG, Wyler R (1978) Inverted terminal repeats in rabbit poxvirus and vaccinia virus DNA. J Virol 28(1):171–81. 10.1128/JVI.28.1.171-181.1978 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Bergbauer M, Kalla M, Schmeinck A, Göbel C, Rothbauer U, Eck S, Benet-Pagés A, Strom TM, Hammerschmidt W (2010) CpG-methylation regulates a class of Epstein-Barr virus promoters. PLoS Pathog 6(9):e1001114. 10.1371/journal.ppat.1001114 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Fernandez AF, Rosales C, Lopez-Nieva P, Graña O, Ballestar E, Ropero S, Espada J, Melo SA, Lujambio A, Fraga MF, Pino I, Javierre B et al (2009) The dynamic DNA methylomes of double-stranded DNA viruses associated with human cancer. Genome Res 19:438–451. 10.1101/gr.083550.108 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.Willis DB, Granoff A (1980) Frog virus 3 DNA is heavily methylated at CpG sequences. Virology 107(1):250–7. 10.1016/0042-6822(80)90290-1 [DOI] [PubMed] [Google Scholar]
  • 64.Xia X (2020) Extreme genomic CpG deficiency in SARS-CoV-2 and evasion of host antiviral defense. Mol Biol Evol 37(9):2699–2705. 10.1093/molbev/msaa094 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.Karlin S, Burge C (1991) Dinucleotide relative abundance extremes: a genomic signature. Proc Natl Acad Sci USA 11(7):283–90. 10.1016/s0168-9525(00)89076-9 [DOI] [PubMed] [Google Scholar]
  • 66.Perry SC, Beiko RG (2010) Distinguishing Microbial Genome Fragments Based on Their Composition: Evolutionary and Comparative Genomic Perspectives. Genome Biol Evol 2:117–131. 10.1093/gbe/evq004 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67.Pride DT, Wassenaar TM, Ghose C, Blaser MJ (2006) Evidence of host-virus co-evolution in tetranucleotide usage patterns of bacteriophages and eukaryotic viruses. BMC Genomics 7:8. 10.1186/1471-2164-7-8 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

42770_2024_1496_MOESM2_ESM.docx (83.5KB, docx)

Supplementary file2 ORF of minimum and maximum GC% content. Targeted sequences for phylogenetic assessment (in bold) and gene duplication evaluation. Taxon, virus species, target/protein ID, GC% content of the ORF, ORF description, duplication (identified or not), probable copies of the gene (number), and GenBank accession for each hit. Threshold of 40% identity and 40% coverage. (DOCX 83 KB)

Data Availability Statement

The datasets generated during and/or analysed during the current study are available from the corresponding author.


Articles from Brazilian Journal of Microbiology are provided here courtesy of Brazilian Society of Microbiology

RESOURCES