Significance
Ancient animals left diverse physical fossil records from which we can deduce that species with extraordinary features once populated our planet. By infecting germlines, some ancient viruses deposited genetic fossil records. However, inferring that a sequence is a viral fossil has so far required homology to circulating viruses. We developed a method to recognize viral fossils that do not closely resemble known viruses. Rather than homology, we detected sequence patterns of fossilized and modern RNA viruses that distinguish them from human sequences. Our results indicate that as-yet-undiscovered fossils from unknown viruses remain hidden in animal genomes. These relics of the ancient virosphere, including sequences reported here, will expand our knowledge about the diversity of ancient viruses and also our genomes.
Keywords: endogenous RNA virus, human genome, paleovirology, machine learning
Abstract
Understanding the genetics and taxonomy of ancient viruses will give us great insights into not only the origin and evolution of viruses but also how viral infections played roles in our evolution. Endogenous viruses are remnants of ancient viral infections and are thought to retain the genetic characteristics of viruses from ancient times. In this study, we used machine learning of endogenous RNA virus sequence signatures to identify viruses in the human genome that have not been detected or are already extinct. Here, we show that the k-mer occurrence of ancient RNA viral sequences remains similar to that of extant RNA viral sequences and can be differentiated from that of other human genome sequences. Furthermore, using this characteristic, we screened RNA viral insertions in the human reference genome and found virus-like insertions with phylogenetic and evolutionary features indicative of an exogenous origin but lacking homology to previously identified sequences. Our analysis indicates that animal genomes still contain unknown virus-derived sequences and provides a glimpse into the diversity of the ancient virosphere.
Recent advances in metagenomic analysis have shown that viruses in nature are more diverse than previously thought, and many viruses with no sequence similarity to known viruses exist, yet undiscovered, in the biosphere. Detecting viral diversity and discovering new viruses can lead to a comprehensive understanding of the coexistence between viruses and organisms and provide effective tools with which to predict the emergence of novel viruses with epidemic or pandemic potential.
There is no reason to suspect that ancient viruses were less diverse than current viruses. Understanding the genetics and taxonomy of ancient viruses, including extinct viruses, will provide great insights into not only the origin and evolution of viruses but also how viral infections played roles in our evolution and how we have coexisted with potential pathogens. However, much is not known about the diversity of ancient viruses.
The clue to the existence of ancient viruses is found in our genomes. Genome sequences called endogenous viruses are remnants of ancient viral infections in an organism’s genome that are thought to retain the genetic characteristics of the viruses that prevailed in ancient times (1). In addition to retroviruses, which are well-recognized as endogenized relics, sequences from RNA viruses, called nonretroviral endogenous RNA virus elements (nrEVEs), have also been inserted into animal genomes (2–5). For example, endogenous bornavirus- and filovirus-like elements show detectable sequence similarity to their extant relatives and that ancient viruses were directly linked to the evolution of current viral lineages (6–11). On the other hand, some nrEVEs fall into lineages distantly related to current viruses at the genus or family level (2, 5). These findings indicate that the detection of nrEVEs in animal genomes would provide a better understanding of past viral diversity.
Current methods used to identify nrEVEs depend heavily on pairwise sequence similarity to known viral sequences (12, 13). Therefore, our knowledge of ancient viruses is inevitably biased toward those that are relatively similar to known viruses. In particular, RNA viruses may lose similarity to extant viruses due to the rapid evolution of viral genomes, and even the ancestors of existing viruses may not be detected. Furthermore, it is possible that ancestors of yet-to-be-recognized extant viruses, or extinct viruses, have also been endogenized in animal genomes. Thus, a comprehensive analysis of nrEVEs in animal genomes would require a new detection method based on a defining feature of viruses that does not depend on pairwise similarity to known viruses.
Extant viruses have been found to share certain patterns in the occurrence of nucleic acid combinations of length k, called k-mers. The dinucleotide (k-mer = 2) composition is generally uniform in an animal RNA virus family (14). Prokaryotic viral sequences have distinctive k-mer frequencies that distinguish them from the sequences of the host (15). k-mer occurrence in viral genomes is thought to be shaped by several selective constraints, such as codon usage bias, which buffers against error-prone replication, and the low-CG dinucleotide property that allows viruses to evade immune response (16, 17). These observations suggest the possibility that both ancient and modern viruses share defining k-mer signatures.
In this study, we employ machine learning of sequence signatures of ancient RNA viruses to search for nrEVEs without local sequence similarity to known viruses and demonstrate the presence of nrEVEs originating from an as-yet-unrecognized infectious agent in the human genome. Interestingly, we find that the k-mer frequencies of nrEVEs are more similar to those of current RNA viral sequences than to those of human genomic sequences. Furthermore, we discover not only previously unexplored ancient bornavirus-derived insertions but also a viral-like insertion, named predicted viral insertion (PVI), in the human genome, which is not homologous to known viral sequences but has exogenously-derived features. We also show that the PVI-related sequences have independently invaded mammalian lineages, suggesting that an unknown virus-like agent was invading host genomes during the mammalian radiation. Our findings will open a window for exploring viral diversities and evolution and expand our view of the virosphere in ancient times.
Results
Machine Learning Distinguishes nrEVEs from Other Human Sequences.
To uncover hidden nrEVEs, which cannot be detected with conventional pairwise similarity searches, in animal genomes, we first hypothesized that nrEVEs may have different nucleotide sequence compositions from those of “nonviral” sequences in the human genome. To examine this, we focused on the occurrence of k-mers of nrEVEs and evaluated whether a multiclass classifier constructed by a support vector machine (SVM), a supervised machine learning method, can distinguish nrEVEs from human sequences (Fig. 1A). To train the SVM, we used sequences of endogenous bornavirus- and filovirus-derived elements, which are the only reported mammalian nrEVEs. In addition, six different groups of human genome sequences, namely coding and noncoding exons, processed pseudogenes, introns, promoters, and intergenic regions, were employed for the training datasets. When k = 1 and 2 were used, the recall and precision scores of the SVM were low, while these scores were high and stable when we used k = 3, 4, and 5 (Fig. 1A and SI Appendix, Fig. S1). This demonstrates that the k-mer compositions of nrEVEs are different from those of other human sequences and that k-mers of 3 or longer are sufficient to accurately capture this distinction. To generate a dense k-mer matrix and avoid overfitting, we used k = 3 for further analyses.
We next examined whether nrEVEs share sequence characteristics defined by k-mer frequencies, regardless of the genes or virus families from which they originated. To this end, we evaluated the recall of nrEVE classification by the SVM as follows. We first divided the nrEVEs into two groups: bornaviral and filoviral nrEVEs. We used either nrEVE group and the six groups of human sequences as training data and constructed an SVM classifier. Then, we evaluated the recall of the SVM using the other group of nrEVEs as test data. The classifier trained on filoviral nrEVEs categorized more than half of the bornaviral nrEVEs correctly (Fig. 1B). Consistently, the SVM trained on bornaviral nrEVEs gave a recall score of more than 0.5. We next divided the nrEVEs into eight groups based on the viral genes from which they were derived. We trained SVMs, retaining one of the eight groups for test data and using the other seven as training data. Notably, sequences within each group lacked pairwise similarity to sequences in other groups (SI Appendix, Fig. S2). When bornavirus nucleoprotein (N)-derived nrEVEs were retained as test data, more than 75% of the test sequences were correctly classified (Fig. 1B). Consistently, we observed 44 to 83% of the test data were correctly classified when using the other nrEVE groups as test sequences, with one exception: training performed without filovirus glycoprotein (GP)-derived nrEVEs. From these observations, we conclude that, regardless of their origin, nrEVEs share distinguishing sequence characteristics in almost all cases.
Similarity in the Sequence Characteristics of nrEVEs and RNA Viruses.
The above result demonstrates the commonality in k-mer composition in nrEVE sequences. Because the genetic architecture of RNA viruses seems to be influenced by a number of constraints, such as immune pressure and error-prone replication, and has a pattern distinct from that of host species (16, 17), we next assessed whether the k-mer composition of nrEVEs is more similar to human coding sequences or to the coding genes in the single-strand, negative-sense RNA [(−)ssRNA] virus group, which includes bornaviruses and filoviruses. As shown in Fig. 1C, hierarchical clustering by k-mer frequency formed one cluster composed of a majority of nrEVEs with some (−)ssRNA viral sequences when we used k = 3 (SI Appendix, Fig. S3A), demonstrating that the k-mer composition of nrEVEs is more similar to that of (−)ssRNA viruses than to that of human coding sequences. Manifold learning, a nonlinear dimensionality reduction method, based on k-mer frequencies from the same dataset showed that the majority of nrEVEs exhibited a similar but slightly distinct distribution compared with that of the cluster composed of viral coding sequences (SI Appendix, Fig. S3B), suggesting that the k-mer composition of nrEVEs is different from that of (−)ssRNA viruses but more similar when compared with that of human coding sequences. These results suggest that the k-mer frequency of nrEVEs still retains similarities to that of (−)ssRNA viral coding sequences, despite the long residence of nrEVEs for at most 80 million y as endogenous sequences within host genomes.
Genome-Wide Screen for nrEVEs Hidden in the Human Genome.
To detect hidden nrEVEs originating from unknown RNA viruses, we applied the classifier constructed by the SVM to the reference human genome. The SVM can distinguish nrEVEs with substantial accuracy. However, as our approach is specifically designed to overcome the sparseness of “ground truth,” judging false positives is a challenge. We thus used the following three steps to extract sequences as candidate nrEVEs: 1) search for polyA tract (pA) and target site duplication (TSD) (pA-TSD), 2) detection of preintegration empty sites (PESs), and 3) removal of cellular pseudogenes (Fig. 2A).
Many nrEVEs in mammals share common sequence features, such as pA-TSDs, at the junctions of viral sequences with host chromosomes, probably reflecting the mechanism of integration by autonomous retrotransposons such as long interspersed nuclear elements (LINEs) (2). Therefore, we first searched the human genome for pA-TSDs (SI Appendix, Methods) and detected more than 8 million pA-TSDs (Fig. 2B and SI Appendix, Fig. S4). Next, we assessed whether the sequences detected by the pA-TSD search were acquired by insertion. The existence of an orthologous genomic locus with no insertion (PES) in other species is evidence of evolutionary invasion. Initial manual inspections of some of the sequences revealed that most of the pA-TSDs were probably derived from stochastic occurrences because we could not find a PES. To exclude pA-TSDs not derived from a recent integration via the mechanism described above, we extracted only those for which PESs were detectable in a genome alignment of 14 mammals (SI Appendix, Methods). This led to the extraction of 5,578 pA-TSDs that harbored at least one PES (Fig. 2B and SI Appendix, Fig. S4).
Cellular processed pseudogenes share characteristic features of integration sites, such as pA and TSD, with nrEVEs. Indeed, 43% of the 5,578 pA-TSDs overlapped with known cellular processed pseudogenes, demonstrating the enrichment of insertions generated by retrotransposons (P < 0.001, permutation test). To remove these insertions, we used a cellular pseudogene database and BLASTn-based identification of unannotated cellular pseudogenes (SI Appendix, Methods). This step identified more than 80% of the 5,578 pA-TSDs as likely cellular pseudogenes (Fig. 2B). The remaining noncellular pseudogene sequences (582 sequences) were then categorized by the SVM classifier to predict whether they have nrEVE features, and this yielded 100 elements with k-mer occurrences typical of nrEVEs (Fig. 2B, SI Appendix, Fig. S4, and Dataset S1). Finally, we manually curated these sequences to determine those most likely to be nrEVEs using a mammalian genome comparison, a sequence similarity search, and phylogenetic analysis.
Discovery of Additional EBLNs.
Previous studies reported the presence of eight bornaviral nrEVEs (seven endogenous bornavirus-like nucleoprotein elements [EBLNs] and one endogenous bornavirus-like glycoprotein element [EBLG]) in the human genome (4). The SVM for the nrEVE search yielded five of eight previously reported bornaviral nrEVEs (Dataset S1), demonstrating that our approach captured most known nrEVEs. Our approach could not detect three elements. However, this is not surprising because our method was tuned to detect features typical of nrEVE insertions but had less sensitivity for nrEVEs that lack these characteristics. Of the three elements, one, hsEBLN-4, lacks a clear pA and TSDs (18), and the others, hsEBLN-5 and hsEBLG-5, are located in transposon-rich regions, which are difficult to align with other genomes in order to allow confident identification of PESs (SI Appendix, Fig. S5).
We next performed a BLASTx search using the sequences detected by the SVM classifier to search for novel nrEVEs similar to known viruses but below the threshold of detection in canonical BLAST-based surveys. As a result, we found that two sequences showed weak similarity to the nucleoproteins of orthobornaviruses and recently discovered bornaviruses of the genus Carbovirus (SI Appendix, Fig. S6A) (19). Suspecting that these sequences represent EBLNs that have not been previously identified, we investigated their phylogenetic relationships with extant bornaviruses and known human EBLNs (Fig. 2C). The nucleotide sequence we identified as hsEBLN-8 showed a phylogenetically close relationship with hsEBLN-7, while the other sequence (hereafter hsEBLN-9) clustered with carboviruses (Fig. 2C). Although hsEBLN-8 has high similarity to hsEBLN-7 (Fig. 2C and SI Appendix, Fig. S6B), hsEBLN-8 was not detected in previous reports (3, 4). The region of hsEBLN-8 harboring pairwise similarity to bornavirus was shortened by a deletion and a putative insertion (Fig. 2D and SI Appendix, Fig. S6 C and D). On the other hand, hsEBLN-9 did not cluster with extant orthobornaviruses, which were used as the query in tBLASTn-based nrEVE searches in previous reports (Fig. 2C). In addition, hsEBLN-9 has multiple putative frameshifts in the region where it harbors similarity to carboviruses (Fig. 2E). These obscured similarities likely resulted in these EBLNs being missed in previous surveys, while reconstructable pairwise similarity to known bornaviruses or EBLNs revealed that the two human sequences are previously unrecognized nrEVEs. This provides evidence that our method can detect nrEVEs with too-weak pairwise similarity to known viruses to be detected by other methods.
Detection of nrEVE-Like Insertions in the Human Genome.
Beyond these two nrEVEs, no other sequences with any pairwise similarity to other existing viruses were detected. Manually judging the potential sources of the remaining candidates, we identified several insertions for which the sources were not clear (Dataset S1). Systematic identification of the sources of such orphan insertions is challenging; nevertheless, insights into the formation and distribution patterns in related species sometimes allow us to narrow down and specify possible sources. To highlight this approach, we selected one nrEVE-like insertion that we refer to as predicted viral insertion (Fig. 3A) and assessed whether the source of this sequence is an unknown virus.
PVI is ∼600 nt in length and has a clear pA and TSDs at the integration junctions (Fig. 3A), suggesting that PVI originated from an insertion of polyadenylated RNA by the machinery of a retrotransposon. An orthologous insertion site was found in the chimpanzee and marmoset genomes but was absent in the tarsier genome, suggesting that the insertion occurred at least 43 million y ago (MYA). We could not detect a clear, long open reading frame (ORF) in PVI (Fig. 3A). The search for known viral sequences related to PVI using BLAST failed to detect any similarity (E-value thresholds: 1e−5 for BLASTn and 1e−3 for BLASTx). To understand whether PVI is a cellular pseudogene, we first searched for human homologs. The BLASTn search with PVI as a query yielded 21 similar sequences; however, we could not find high similarity. The highest nucleotide identity score was 77% across 57% of the query. Moreover, the closest sequence appears to have been formed by an insertion in the same ancestral simian lineage as that for PVI (SI Appendix, Fig. S7A). To evaluate whether this nucleotide similarity is comparable to that of cellular pseudogenes formed similarly long ago, nucleotide identities between pseudogenes and their parental genes were calculated (SI Appendix, Methods). The percent identity of PVI with its most similar sequence was lower than most of the identities of cellular pseudogenes with their parental genes (Fig. 3B and SI Appendix, Fig. S7B). Related nrEVEs should show relatively high sequence divergence compared with that of pseudogenized sequences formed at the same time in evolution. This is because even the source sequences of the closest nrEVEs should already have had variations prior to integration due to the presence of quasispecies in exogenously replicating viruses. The observation of only weak identity between the closest PVIs thus suggests that these elements are of extrinsic origin.
Detection of PVI-Related Sequences in the Human Genome.
To gain additional insights into the exogenous origin of PVIs, we next identified more diverse PVI-related sequences in the human genome. Based on iterative BLASTn and LASTz alignments, we found 83 additional sequences, resulting in a total of 105 PVI and PVI-related sequences, hereafter referred to as PVIRs, in the human genome (Fig. 3C). The lengths of these elements ranged from less than 100 nt to several kilobases (SI Appendix, Fig. S8A). Consistent with an extrinsic origin, three PVIRs fall within annotated PIWI-interacting RNA clusters, where the bornaviral nrEVEs are also enriched, more often than expected by chance alone (P < 0.01; SI Appendix, Fig. S8 B and C) (9). A dot plot analysis revealed that 20 PVIRs are tandemly arrayed as head-to-tail multimers, of which one unit is ∼1.5 kb (SI Appendix, Fig. S8D).
To address whether the identified PVIRs show features of insertion, such as PESs and/or pA-TSDs, we manually assessed the presence of PESs and insertion junction sequences. Orthology was clearly defined for 24 PVIRs. Most of these elements were simian- or primate-specific, while one element was conserved across the Euarchontoglires mammals (Fig. 3D). Notably, none of the PVIRs were orthologous to those in laurasiatherians, suggesting that they were acquired after the divergence of the Euarchontoglires and laurasiatherians (<96 MYA). For 33 PVIRs, insertion junctions were defined. Six elements had pA-TSDs, 15 elements lacked clear pA sequences and harbored only TSDs, and one insertion was potentially established due to template switching during mobilization of the LINE1 retrotransposon (Fig. 3 E and F and SI Appendix, Fig. S9). Eleven PVIRs had neither a clear pA nor a TSD. The diversity of the insertion junctions suggests that the putative source agent(s) of PVIRs might not have encoded an autonomous integrase and that several different host integration mechanisms might contribute to the formation of PVIRs. From the absence of orthologs in distant animal lineages and the presence of clear integration features, we concluded that human PVIRs are sequences acquired from an exogenous source.
Similar Sequences of PVI in Mammals.
Independent insertion of similar sequences in species that do not encode human PVIR orthologs are additional lines of evidence suggesting horizontal transfer as viruses and the foreign origin of PVIRs. Therefore, we next explored genomes other than the human genome. By similarity searches using human PVIRs as queries, we found PVIRs in primates, flying lemurs, rabbits, and laurasiatherian mammals (Fig. 3D). Notably, we did not detect any PVIRs in the genomes of other organisms, including other vertebrates, invertebrates, prokaryotes, bacteria, and viruses. To assess whether these nonhuman PVIR insertions were independently generated from human PVIRs, we analyzed their orthology. Genome comparisons clearly defined 21 out of 286 nonhuman PVIR insertions that occurred independently in multiple mammalian lineages (Figs. 3D and 4A and SI Appendix, Fig. S10).
Next, we analyzed phylogenetic relationships based on a multiple-sequence alignment of relatively long PVIRs (>800 nt) (Fig. 4B). These clades were grouped into three clades designated clade 1 to clade 3, and clade 1 harbored a subclade, 1.1. Next, we classified all PVIRs based on their phylogenetic relationships (Fig. 4C and SI Appendix, Methods). We observed species specificity in PVIRs; clade 1 consisted of primate elements, with the exception of two rabbit elements, while clade 2 and clade 3 contained Euarchontoglires and laurasiatherian PVIRs, respectively. We found several events suggesting horizontal transfer of a putative source virus(es) of PVIRs among species. Subclade 1.1 PVIRs were observed in the human, tarsier, and aye-aye genomes but were absent in the bushbaby and mouse lemur genomes, suggesting that subclade 1.1 PVIRs entered the aye-aye genome independently from integrations into the tarsier and human genomes (Fig. 4C). Note that it is still possible that the source insertion of subclade 1.1 PVIRs occurred in a common primate ancestor, yet was independently deleted in the genomes of bushbaby and mouse lemurs or has otherwise become difficult to recognize in both these genomes. In another case, all of the clade 1 PVIRs were found in primate genomes, except for two PVIRs found in the rabbit genome (Fig. 4 C and D). These observations suggest that the source of PVIRs was either a horizontally transmissible transposon-like element or an infectious agent transmissible between mammalian lineages.
PVIRs Are Derived from an Exogenous Infectious Agent.
We evaluated whether PVIRs are likely to be transposons or exogenous infectious agents based on their sequence divergence. A lineage of transposons is expected to be less divergent than a group of nrEVEs if they were generated at the same time, according to the following reasoning. The average sequence divergence of a transposon family should roughly reflect its age (20) because transposons can accumulate mutations only after mobilization in the germline. In contrast, virus genomes continuously acquire mutations or variations during their replication cycle in somatic cells, before endogenization. This preexisting divergence of nrEVE source sequences, in addition to mutations that accumulate after endogenization, should give rise to higher sequence divergence of nrEVEs than of transposons formed at the same age. To evaluate whether PVIRs have higher divergence than transposons, we calculated sequence diversity using external branch lengths (SI Appendix, Methods and Fig. S11A). We used subclade 1.1 for this analysis because this clade is the youngest according to orthology analysis, and its sequences are abundant in the human genome. None of the subclade 1.1 element integrations between human and tarsier appeared orthologous, suggesting that these elements expanded after the divergence of human and tarsier (<67 MYA). The average external branch length of the elements exceeds that of transposons expanded at a similar age (Fig. 4E and SI Appendix, Fig. S11B), suggesting that PVIRs had already acquired some mutations and existed as polymorphic sequences before their insertion. This observation supports the scenario in which PVIRs originated from an exogenous infectious source, such as a virus.
Discussion
This study uncovers virus-like insertions in the human genome that lack pairwise homology to known viruses. This was made possible by using a machine learning approach to detect k-mer–based signatures in sequences derived from ancient RNA viruses. We demonstrated that nrEVE sequences have specific signatures distinguishable from those of other human sequences and that the sequence features of ancient RNA viruses may retain similarity to those of extant RNA viruses. This approach opens a window for exploring ancient RNA virus sequences hidden in animal genomes. In addition, our findings show that the current knowledge of ancient virus diversity is still rudimentary and that as-yet-undiscovered sequences derived from unknown viruses, such as unidentified or extinct viruses, remain hidden even in very well-studied animal genomes.
In this study, we demonstrated that nrEVEs derived from nonhomologous genes share specific sequence similarities. Although we could not elucidate why nrEVEs have these similarities because of the limited interpretability of the SVM, our results suggest that the k-mer occurrence of nrEVEs reflects that of the ancient RNA viral genomes from which they were derived. It is known that the dinucleotide composition of animal RNA viruses is mostly a characteristic of virus families rather than of host species (14). The VirFinder, a k-mer–based tool used to predict prokaryotic viral contigs from metagenomic data, correctly predicts viral sequences with no pairwise similarity to the training data (15). These results support the possibility that RNA viruses, including ancient viruses, may have k-mer patterns shared across different taxonomic families. Mechanistically, the viral k-mer space is shaped by several known selective constraints, such as codon usage bias, which buffers against error-prone replication (16), and the low-CG dinucleotide property that allows viruses to evade immune response (17, 21, 22). The k-mer frequency of nrEVEs identified by BLAST similarity search showed a spatial distribution different from that of human coding sequences and more similar to that of RNA viral sequences, suggesting that similar evolutionary constraints acting on ancient and extant RNA viruses might have resulted in the unique sequence signature of nrEVEs. Further studies regarding the sequence similarity between nrEVEs and current RNA viruses will provide detailed views of the sequence signature and host interactions of ancient RNA viruses.
Recent metagenomic and metatranscriptomic analyses have been uncovering previously undiscovered viral fragments in humans and environmental samples, indicating that many unknown infectious agents could still be present in the biosphere (23). Similarly, animal genomes may contain virus-derived sequences originating from as-yet-unidentified or extinct viruses. Our analysis revealed previously unrecognized endogenous bornavirus-like elements in the human genome that had not been identified before. In contrast, the present analysis did not identify sequences similar to known RNA viruses other than bornaviruses, even with weak similarities such as those below the threshold of detection defined in the general BLAST search settings. It may still be premature to conclude from this analysis that bornaviruses are the only RNA viruses that can contribute sequences to the human genome. However, our results indicate that Bornaviridae are rare RNA viruses that have existed for hundreds of millions of years along with the evolution of primate lineages.
In this study, we successfully identified a virus-like insertion, PVI, which has an unknown origin, in the human genome. This finding strongly suggests that there are still many uncharacterized virus-like insertions in mammalian genomes. Furthermore, the independent integrations and ubiquitous presence of PVIRs in the primate and laurasiatherian lineages strongly suggest that the sequences have expanded similar to an infection, indicating that PVIRs arise from the integration of exogenous agents, such as viruses. Phylogenetic analysis revealed that some PVIRs were most likely acquired by cross-species transmission of the exogenous source element. Sequence diversity suggested that PVIRs are more variable than transposons, implying an exogenous life cycle for the agent. Despite this body of evidence, however, we cannot conclude definitively that the source of PVIRs was an ancient RNA virus; the possibility that it originated from a previously undescribed transposon-like element capable of cross-species transmission remains. Not all of these sequences have the typical features of RNA virus integration but instead exhibit diverse integration junction sequences. In addition, some PVIRs lack both recognizable pA and TSDs. This feature suggests that during the replication cycle, the source agent may have produced a DNA form that could be integrated into a double-strand DNA break (24–26). We also found tandemly arrayed integrations of PVIRs, which is not a typical feature of canonical nrEVEs; however, endogenous retroviruses and retroviroid-like sequences are known to form tandemly repeated DNA sequences in the host genome (27, 28). Furthermore, adeno-associated virus generates tandem viral DNA in infected cells (29). In addition to animal viruses, plant-specific viroids, which are composed of circular single-stranded RNA molecules, are also known to produce tandem genome units in their rolling circle replication process (30). Thus, the tandem repeat structure of some PVIR integrations may be a clue to the replication mechanism of the unidentified elements producing PVIRs. Further investigations of the genomic structure, as well as the replication mechanism, of PVIRs may provide more clues to the origin of this putative viral element.
Although most nongenic sequences contributing to the great size of many animal genomes are suggested to be transposons and highly decayed repeat sequences, the form of a substantial fraction remains unclear due to a lack of similarity to characterized sequences. Such unknown sequences are often referred to as “genomic dark matter” (31). Our study suggests that unexplored virus-derived sequences may be a part of the evolutionary origins of such complex genomic sequences. Viral machinery coded in endogenous retroviruses and nrEVEs are frequently co-opted or repurposed for novel cellular functions. Therefore, unveiling hidden viral insertions in animal genomes will provide insight into the novelty of animal genomes driven by lateral gene transfer from viruses.
In summary, k-mer–based machine learning of ancient virus sequence signatures will open a window for exploring unappreciated ancient gene flow from currently unidentified viruses. Our findings will provide extensive insights into long-term virus evolution, animal genome organization, and virus–host interactions.
Materials and Methods
Sequence Data Preparation for Construction of the SVM.
The genomic positions of protein-coding exons, noncoding exons, pseudogene exons, and introns were retrieved from the GENCODE human genome annotation (release 27). For pseudogenes, gene_type “processed_pseudogenes” was used. Promoter regions were defined as the regions 1 kb upstream (downstream for transcripts in the antisense direction) of the transcription start sites. Intergenic regions were defined as genomic regions other than protein-coding exons, noncoding exons, pseudogene exons, introns, promoters, and known nrEVEs (hsEBLN-1 to hsEBLN-7 and hsEBLG-5). The sequences of these genomic regions were obtained from human genome assembly hg38 with repeat sequences masked by our criteria (SI Appendix, pA-TSD Search).
nrEVEs, which have pairwise similarity to known viruses, were searched by tBLASTn with the following option: E value = 1e-10. We used orthobornaviruses and filoviruses as search queries because they are the only RNA viruses related to nrEVEs in vertebrate genomes. Whole-genome shotgun sequences of vertebrates were used for the database. The tBLASTn search was performed on 7 November 2017. Because we searched for nrEVEs among whole-genome shotgun sequences, hits contained multiple nrEVE copies of the same nucleotide sequence. These redundant hits were removed, and only one of them was retained for analysis. The accession numbers of the query protein sequences are listed in SI Appendix.
Construction of the SVM.
We used a nonlinear SVM with a kernel function (32). For the kernel function, the Gaussian kernel function was adopted. The tuning parameters for the SVM were selected by twofold cross-validation. These analyses were performed by the functions svm and tune.svm in the package e1071 in R statistical software (version 3.5.3).
Manifold Learning.
We used t-distributed stochastic neighbor embedding (t-SNE) for manifold learning (33). This analysis was performed by the function TSNE in the package scikit-learn in Python (version 3.7.2).
Hierarchical Clustering.
To cluster the contribution ratios of k-mer occurrences, we used hierarchical clustering with the complete linkage method. This analysis was performed by the function heatmap in R. To cluster k-mer frequencies of human ORFs, viral ORFs, and nrEVEs, we used hierarchical clustering with Ward’s method (see also the legend of SI Appendix, Fig. S3). To measure similarities between k-mer frequencies, we used Euclidean distance. This analysis was performed by the function clustermap in the package seaborn in Python.
Phylogenetic Classification of PVIRs.
The lengths of PVIRs range from ∼100 nt to several kilobases. Therefore, it is impossible to generate a phylogenetic tree containing all PVIRs. To classify all PVIRs into clades, we investigated their phylogenetic relationships in a one-by-one manner. To this end, we made three alignments with relatively long elements to represent the three clades and then added one sequence to the alignment to evaluate the phylogenetic relationship of the added sequence. The addition of one element was performed using the MAFFT L-INS-i–add option. The generated alignments were checked manually, and then the trees were inferred by the maximum-likelihood method with the partial deletion option using MEGA X software (34). The Tamura three-parameter model with a discrete gamma distribution (+G) was used. The reliability of each internal branch was assessed by 100 bootstrap resamplings.
Supplementary Material
Acknowledgments
This work was supported in part by Japan Society for the Promotion of Science (JSPS) Grants-in-Aid for Scientific Research (KAKENHI) JP17H04083, JP19K22530, JP20H00662, and JP20H05682 (all to K.T.); Ministry of Education, Culture, Sports, Science and Technology KAKENHI JP16H06429, JP16K21723, and JP16H06430 (all to K.T.), JP17H05823 (to S.N.), and JP19H04833 (to M.H.); JSPS Core-to-Core Program, Japan Agency for Medical Research and Development Grant JP19fm0208014 (to K.T.); and the Joint Usage/Research Center Program on inFront, Kyoto University.
Footnotes
The authors declare no competing interest.
This article is a PNAS Direct Submission.
This article contains supporting information online at https://www.pnas.org/lookup/suppl/doi:10.1073/pnas.2010758118/-/DCSupplemental.
Data Availability.
Codes and data used in this article are available at https://github.com/shohei-kojima/Kojima_et_al_2021_PNAS. For the list of parameter settings used for the pA-TSD search of the human genome, genomic positions and manual annotations of sequences categorized in the nrEVE-like group by our nrEVE-search workflow, and genomic positions and manual annotations of the PVIRs found, see SI Appendix and Datasets S1, S2, and S3.
References
- 1.Feschotte C., Gilbert C., Endogenous viruses: Insights into viral evolution and impact on host biology. Nat. Rev. Genet. 13, 283–296 (2012). [DOI] [PubMed] [Google Scholar]
- 2.Horie M., et al. , Endogenous non-retroviral RNA virus elements in mammalian genomes. Nature 463, 84–87 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Belyi V. A., Levine A. J., Skalka A. M., Unexpected inheritance: Multiple integrations of ancient bornavirus and Ebolavirus/Marburgvirus sequences in vertebrate genomes. PLoS Pathog. 6, e1001030 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Katzourakis A., Gifford R. J., Endogenous viral elements in animal genomes. PLoS Genet. 6, e1001191 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Taylor D. J., Leach R. W., Bruenn J., Filoviruses are ancient and integrated into mammalian genomes. BMC Evol. Biol. 10, 193 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Kondoh T., et al. , Putative endogenous filovirus VP35-like protein potentially functions as an IFN antagonist but not a polymerase cofactor. PLoS One 12, e0186450 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Fujino K., Horie M., Honda T., Merriman D. K., Tomonaga K., Inhibition of Borna disease virus replication by an endogenous bornavirus-like element in the ground squirrel genome. Proc. Natl. Acad. Sci. U.S.A. 111, 13175–13180 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Edwards M. R., et al. , Conservation of structure and immune antagonist functions of filoviral VP35 homologs present in microbat genomes. Cell Rep. 24, 861–872.e6 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Parrish N. F., et al. , piRNAs derived from ancient viral processed pseudogenes as transgenerational sequence-specific immune memory in mammals. RNA 21, 1691–1703 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Sofuku K., Parrish N. F., Honda T., Tomonaga K., Transcription profiling demonstrates epigenetic control of non-retroviral RNA virus-derived elements in the human genome. Cell Rep. 12, 1548–1554 (2015). [DOI] [PubMed] [Google Scholar]
- 11.Kobayashi Y., et al. , Exaptation of bornavirus-like nucleoprotein elements in afrotherians. PLoS Pathog. 12, e1005785 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Kirsip H., Abroi A., Protein structure-guided hidden Markov models (HMMs) as a powerful method in the detection of ancestral endogenous viral elements. Viruses 11, 320 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Kryukov K., Ueda M. T., Imanishi T., Nakagawa S., Systematic survey of non-retroviral virus-like elements in eukaryotic genomes. Virus Res. 262, 30–36 (2019). [DOI] [PubMed] [Google Scholar]
- 14.Di Giallonardo F., Schlub T. E., Shi M., Holmes E. C., Dinucleotide composition in animal RNA viruses is shaped more by virus family than by host species. J. Virol. 91, e02381-16 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Ren J., Ahlgren N. A., Lu Y. Y., Fuhrman J. A., Sun F., VirFinder: A novel k-mer based tool for identifying viral sequences from assembled metagenomic data. Microbiome 5, 69 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Lauring A. S., Acevedo A., Cooper S. B., Andino R., Codon usage determines the mutational robustness, evolutionary capacity, and virulence of an RNA virus. Cell Host Microbe 12, 623–632 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Takata M. A., et al. , CG dinucleotide suppression enables antiviral defence targeting non-self RNA. Nature 550, 124–127 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Horie M., Kobayashi Y., Suzuki Y., Tomonaga K., Comprehensive analysis of endogenous bornavirus-like elements in eukaryote genomes. Philos. Trans. R. Soc. Lond. B Biol. Sci. 368, 20120499 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Hyndman T. H., Shilton C. M., Stenglein M. D., J. F. X. Wellehan, Jr, Wellehan X., Divergent bornaviruses from Australian carpet pythons with neurological disease date the origin of extant Bornaviridae prior to the end-Cretaceous extinction. PLoS Pathog. 14, e1006881 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Kapitonov V., Jurka J., The age of Alu subfamilies. J. Mol. Evol. 42, 59–65 (1996). [DOI] [PubMed] [Google Scholar]
- 21.Greenbaum B. D., Cocco S., Levine A. J., Monasson R., Quantitative theory of entropic forces acting on constrained nucleotide sequences applied to viruses. Proc. Natl. Acad. Sci. U.S.A. 111, 5054–5059 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Odon V., et al. , The role of ZAP and OAS3/RNAseL pathways in the attenuation of an RNA virus with elevated frequencies of CpG and UpA dinucleotides. Nucleic Acids Res. 47, 8061–8083 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Zhang Y.-Z., Chen Y.-M., Wang W., Qin X.-C., Holmes E. C., Expanding the RNA virosphere by unbiased metagenomics. Annu. Rev. Virol. 6, 119–139 (2019). [DOI] [PubMed] [Google Scholar]
- 24.Moore J. K., Haber J. E., Capture of retrotransposon DNA at the sites of chromosomal double-strand breaks. Nature 383, 644–646 (1996). [DOI] [PubMed] [Google Scholar]
- 25.Bill C. A., Summers J., Genomic DNA double-strand breaks are targets for hepadnaviral DNA integration. Proc. Natl. Acad. Sci. U.S.A. 101, 11135–11140 (2004). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Miller D. G., Petek L. M., Russell D. W., Adeno-associated virus vectors integrate at chromosome breakage sites. Nat. Genet. 36, 767–773 (2004). [DOI] [PubMed] [Google Scholar]
- 27.Daròs J. A., Flores R., Identification of a retroviroid-like element from plants. Proc. Natl. Acad. Sci. U.S.A. 92, 6856–6860 (1995). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Gao D., Li Y., Kim K. D., Abernathy B., Jackson S. A., Landscape and evolutionary dynamics of terminal repeat retrotransposons in miniature in plant genomes. Genome Biol. 17, 7 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Schnepp B. C., Jensen R. L., Chen C.-L., Johnson P. R., Clark K. R., Characterization of adeno-associated virus genomes isolated from human tissues. J. Virol. 79, 14793–14803 (2005). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Flores R., et al. , Viroid replication: Rolling-circles, enzymes and ribozymes. Viruses 1, 317–334 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.de Koning A. P. J., Gu W., Castoe T. A., Batzer M. A., Pollock D. D., Repetitive elements may comprise over two-thirds of the human genome. PLoS Genet. 7, e1002384 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Boser B. E., Guyon I. M., Vapnik V. N., “A training algorithm for optimal margin classifiers” in Proceedings of the Fifth Annual Workshop on Computational Learning Theory, Haussler D. H., Ed. (Association for Computing Machinery, New York, NY, 1992), pp. 144–152. [Google Scholar]
- 33.Van Der Maaten L., Hinton G., Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2625 (2008). [Google Scholar]
- 34.Kumar S., Stecher G., Li M., Knyaz C., Tamura K., MEGA X: Molecular evolutionary genetics analysis across computing platforms. Mol. Biol. Evol. 35, 1547–1549 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Codes and data used in this article are available at https://github.com/shohei-kojima/Kojima_et_al_2021_PNAS. For the list of parameter settings used for the pA-TSD search of the human genome, genomic positions and manual annotations of sequences categorized in the nrEVE-like group by our nrEVE-search workflow, and genomic positions and manual annotations of the PVIRs found, see SI Appendix and Datasets S1, S2, and S3.