Abstract
Double synonyms in the genetic code can be used as a tool to test competing hypotheses regarding ambigrammatic narnavirus genomes. Applying the analysis to recent observations of Culex narnavirus 1 and Zhejiang mosquito virus 3 ambigrammatic viruses indicates that the open reading frame on the complementary strand of the segment coding for RNA-dependent RNA polymerase does not code for a functional protein. Culex narnavirus 1 has been shown to possess a second segment, also ambigrammatic, termed ‘Robin’. We find a comparable segment for Zhejiang mosquito virus 3, a moderately diverged relative of Culex narnavirus 1. Our analysis of Robin polymorphisms suggests that its reverse open reading frame also does not code for a functional protein. We make a hypothesis about its role.
Keywords: metagenome, narnavirus, ambigram, mosquito
1. Introduction
Of all the various types of viruses catalogued, narnaviruses rank among the simplest and most surprising (Cobián Güemes et al. 2016). Narnaviruses (a contraction of ‘naked RNA virus’) are examples of a minimal blueprint for a virus: no capsid, no envelope and no apparent assembly of any kind. The known narnavirus blueprint appeared for all intents and purposes to be a single gene, that which codes for an RNA-dependent RNA polymerase, abbreviated as RdRp (Hillman and Cai 2013). However, some narnaviruses have been found to have a genome with an open reading frame (ORF) (i.e. a reading frame without stop codons) on the strand complementary to that coding for the RdRp gene, calling into question the general hypothesis of a one-gene blueprint (Cook et al. 2013; DeRisi et al. 2019; Dinan et al. 2020; Cepelewicz 2020). This reverse ORF (rORF) has codon boundaries aligned with the forward reading frame. Because the genome can be translated in either direction, we say that these narnaviruses are ambigrammatic. The significance of an ambigrammatic genome is an open problem. In this article, we discuss how polymorphisms of sampled sequences can distinguish between competing hypotheses on the function and nature of ambigrammatic viral genomes. Our methods are applied to known ambigrammatic narnavirus genes and to the newly discovered ambigrammatic second segment of some narnaviruses, termed Robin (Batson et al. 2021).
Our discussion is based upon two rules about the genetic code and its relation to ambigrammatic sequences. Both of these ambigram rules are concerned with the availability of synonyms within the genetic code, which allow coding of the same amino acid with a different codon. The first rule states that for any sequence of amino acids coded by the forward strand, it is possible to use individual synonymous substitutions to remove all stop codons on the complementary strand (this result was discussed already in DeRisi et al. 2019). The second ambigram rule, described below, states that the genetic code contains double synonyms that allow polymorphisms, accessible by single-base mutations, even when the amino acids coded by both the forward and the complementary strands are fixed.
The first of these rules addresses the ‘how’ of ambigrammatic genomes, by showing that stop codons on the complementary strand can be removed by single-point mutations, without altering the protein (in narnaviruses, the RdRp) coded in the forward direction. Here, we argue that the second rule can help to resolve the ‘why’ of ambigrammatic genomes: the origin of ambigrammaticity itself. There are two distinct reasons why there might be an evolutionary advantage for a virus to evolve an ambigrammatic sequence. The first possibility is that the complementary strand might code for a functionally significant protein, for example, one that might interfere with host defence mechanisms. The second possibility is that the lack of stop codons on the complementary strand is significant, even if the amino acid sequence that is coded is irrelevant. In particular, the lack of stop codons may promote the association between ribosomes and the complementary strand viral RNA (produced as part of its replication cycle). It is possible that a ‘polysome’ formed by a covering of ribosomes helps to shield the virus from degradation or from detection by cellular defence mechanisms (Cepelewicz 2020; Retallack et al. 2021; Wilkinson et al. 2021). The second ambigram rule combined with data on the polymorphism of the virus genome can help distinguish whether the complementary strand codes for a functional protein. We shall argue that in the case of Culex narnavirus 1 and Zhejiang mosquito virus 3, the evidence is in favour of this second hypothesis, namely that the ORF on the complementary strand does not code for a functional protein.
After describing the genetic ambigram rules, we discuss how the existence of double synonyms can be used to assess whether the ORF on the complementary strand codes for functional protein. It is well known that, because RdRp is a highly-conserved gene, CxNV1 non-synonymous mutations are likely to be detrimental, so that most of the observed diversity consists of synonymous changes. Some of these synonymous mutations have the potential to be synonymous in the complementary strand. If the complementary strand also codes for a functional protein, we expect that doubly synonymous mutations will be favoured. In fact, there would be mutational ‘hotspots’ corresponding to the potential doubly synonymous loci. We introduce two tests for whether the complementary strand is coding, based respectively on looking for mutational ‘hotspots’, and upon the mutational frequencies at loci which have double synonyms. We used these tests to analyse sequences for two different ambigrammatic narnaviruses: forty-six RdRp segments of Culex narnavirus 1 (Göertz et al. 2019) and twelve RdRp segments of Zhejiang mosquito virus 3 (Shi et al. 2017), abbreviated to CxNV1 and ZJMV3, respectively. We find that neither of our tests supports the hypothesis that the translated sequence of the complementary strand of RdRp is under purifying selection. We also applied these tests to the second segment, termed Robin, which is found to be closely associated with this ambigrammatic narnavirus infection in mosquitos (Batson et al. 2021; Retallack et al. 2021). We also found that the complementary ORF of Robin does not appear to be under purifying selection. The discovery of Robin suggested that ambigrammatic companions may exist for other ambigrammatic viruses. Accordingly, we searched the assembled contigs of studies reporting the detection of ZJMV3, the only other ambigrammatic narnavirus observed multiple times in numerous locations, and discovered an ambigrammatic segment with similar properties to CxNV1 Robin. Thus, we consider four viral segments, denoted CxNV1-RdRp, CxNV1-Robin, ZJMV3-RdRp and ZJMV3-Robin. We shall report evidence that Robin does code for a protein in its forward direction, but that its complementary strand does not code for a functional protein. We find evidence that Robin segments are under detectable purifying selection. Figure 1 illustrates the phylogenetic relationship of CxNV1 and ZJMV3, and ORF-wide dN/dS values of all their segments and coding directions (discussed in detail below).
Figure 1.
(A) A maximum-likelihood tree illustrating the relationship between CxNV1 (Culex narnavirus 1) (red) and ZJMV3 (Zhejiang mosquito virus 3) (blue) and related narnavirus RdRp amino acid sequences. (B) ORF-wide dN/dS values for forward and reverse directions of RdRp and Robin segments for both viruses.
Some careful consideration is required to reconcile our observations with results recently reported by Retallack et al. (2021), where it was shown that introducing mutations that are non-synonymous on the reverse ORF of Culex narnavirus 1 can reduce the fitness of this virus. In the concluding section, we consider the interpretation of these observations and discuss whether there may be implications for other viral families.
There are many examples of overlapping viral genes with staggered reading frames: this was first clearly described by Barrell et al. (1976) and has been reviewed by Chirico et al. (2010). Recent study by Nelson, Ardern, and Wei (2020) discusses how these can be identified. Our investigations indicate that the ambigrammatic ORFs discussed in this study are a different phenomenon, because they do not code for a functional protein. Our approach to analysing the ambigrammatic sequences is quite distinct from the rather complex machinery proposed by Nelson et al. (2020), because it emphasises the role of double synonyms as an unambiguous discriminant of the role of the ambigrammatic sequences.
For completeness, we also include some data on the relative prevalence of the two strands of the CxNV1 virus (both the RdRp and the Robin segments). The results do not appear to be unusual for a narnavirus, which is consistent with the hypothesis that the strand prevalences are not related to the ambigram property.
1.1. Ambigram rules and their significance
We start by describing the two genetic ambigram rules.
1.1.1. Rule 1. All complementary-strand stops are removable
Consider the reading frame on the complementary strand that has its codons aligned with those on the forward strand. Every codon on the forward strand corresponds to a complementary-strand codon read in the reverse direction. The rule states that any stop codon on the complementary strand can be removed by a single-point mutation that leaves the amino acid specified by the forward-read codon unchanged.
This result is demonstrated by the following argument, as discussed by DeRisi et al. (2019). Reversing the read direction and taking the pairing complement, the stop codons UAA, UAG and UGA in the standard genetic code become, respectively, UUA, CUA and UCA, for which the amino acids are Leu, Leu and Ser. It is only instances of leucine and serine in the forward sequence that can result in stop codons in the reverse read. The synonyms of Leu are CUN, UUA and UUG (where N means any base). The synonyms of Ser are UCN, AGU and AGC. The undesirable Leu codon UUA can be transformed to UUG by a single substitution. Similarly, the Leu codon CUA can be transformed to CUU, CUG or CUC by single substitutions. And the Ser codon UCA is transformed to UCU, UCG or UCC by single substitutions. We conclude that every stop codon on the reverse reading frame can be removed by a synonymous, single site nucleotide mutation.
Furthermore, it is found that complementary-strand stops cannot always be removed by synonymous substitutions in the other two read frames for the complementary strand (each case requires a separate and somewhat involved argument, also given by DeRisi et al. 2019). As a consequence of these two arguments, we need to discuss only the complementary read frame with aligned codons.
1.1.2. Rule 2. There exist double synonyms
Most synonymous mutations of the forward strand produce a non-synonymous change in the complementary strand, but the genetic code does include a number of double synonyms, where the reverse complement of a synonymous mutation is also a synonym. For example, codon AGG (Arg) can become CGG (Arg) via a synonymous mutation, while the reverse complement of AGG, which is CCU (Pro) transforms to CCG (Pro) under the same mutation.
The full set of double synonyms in the standard genetic code is as follows:
Two of the six synonyms of Ser are double synonyms, with reverse complements coding Arg. Conversely, two of the six synonyms of Arg are double synonyms, with reverse complement coding Ser.
Two more of the six synonyms of Arg are double synonyms, with reverse complement Pro. Conversely, two of the four synonyms of Pro are double synonyms coding for Arg.
Two of the six synonyms of Leu are double synonyms, with reverse complement Gln. Conversely, both synonyms of Gln are double synonyms, with reverse complement coding Leu.
Table 1 lists the sets of single and double synonyms for those amino acids that can have double synonyms. (We exclude the two synonyms of Ser and the one synonym of Leu for which the reverse complement is stop, because these do not occur in ambigrammatic genes.)
Table 1.
For each amino acid (AA) that can have double-synonym single-nucleotide mutations, we list all of the possible codons which do not code for stop on the complementary strand, indicating their reverse complement (Comp. AA).
AA | Codon | Comp. AA | ||
---|---|---|---|---|
Leu | UUG* | 1 + 0 | 1 + 0 | Gln |
CUU | 1 + 1 | 0 + 0 | Lys | |
CUC | 1 + 1 | 0 + 0 | Glu | |
CUG* | 1 + 2 | 1 + 0 | Gln | |
Pro | CCU* | 1 + 2 | 0 + 1 | Arg |
CCC | 1 + 2 | 0 + 0 | Gly | |
CCA | 1 + 2 | 0 + 0 | Trp | |
CCG* | 1 + 2 | 0 + 1 | Arg | |
Gln | CAA* | 1 + 0 | 1 + 0 | Leu |
CAG* | 1 + 0 | 1 + 0 | Leu | |
Arg | CGU | 1 + 2 | 0 + 0 | Thr |
CGC | 1 + 2 | 0 + 0 | Ala | |
CGA* | 1 + 3 | 0 + 1 | Ser | |
CGG* | 1 + 3 | 0 + 1 | Pro | |
AGA* | 1 + 1 | 0 + 1 | Ser | |
AGG* | 1 + 1 | 0 + 1 | Pro | |
Ser | UCU* | 1 + 1 | 0 + 1 | Arg |
UCC | 1 + 1 | 0 + 0 | Gly | |
UCG* | 1 + 2 | 0 + 1 | Arg | |
AGU | 1 + 0 | 0 + 0 | Thr | |
AGC | 1 + 0 | 0 + 0 | Ala |
The codons that have a double synonym are marked with an asterisk. For each of these codons, we list the number of mutations which are synonymous, and the number of double synonym mutations. In each case the numbers of single (double) mutations are written (), where the superscript n denotes transitions, and superscript v transversions. Also, double synonyms are counted in the list of single synonyms.
1.2. Implications
Our first rule shows that an ambigrammatic version of any gene can evolve, without making any changes to the amino acid sequence. This establishes how ambigrammatic sequences can arise, but it does not illuminate why they are favoured.
Combined with observed polymorphisms of narnaviruses, the second ambigram rule can give an indication of the utility of ambigrammatic sequences. In studies on the (usual) non-ambigrammatic genomes, the ratio of synonymous to non-synonymous mutations is used as an indicator of whether the nucleotide sequence codes for a protein: non-synonymous mutations are likely to be deleterious if the sequence codes for a functional protein. We shall adapt this approach to our study of ambigrammatic narnavirus genes. We assume that the forward direction is a coding sequence (usually for RdRp), and confine attention to those mutations which are synonymous in the forward direction. If the complementary strand codes for a functional protein, most of these synonymous mutations will inevitably result in changes of the complementary amino acid sequence. However, at many loci, the evolutionarily favoured amino acid will be one that allows double synonyms. In these cases, there can be non-deleterious mutations between a pair of codons that preserve the amino acid sequence of both the forward and the complementary strands.
If the complementary strand codes for a functional protein, we expect studies of the polymorphism of the gene would show that these double-synonym loci will be mutational ‘hotspots’, where mutations occur more frequently. In addition, the double-synonym pairs would be represented far more frequently than other mutations at these loci. These observations lead to two distinct tests for whether there is evolutionary pressure on the translated sequence of the complementary strand.
1.3. Ambigrammatic narnavirus genes
We analysed data from samples of two ambigrammatic narnaviruses, Culex narnavirus 1 (CxNV1, with forty-six genomes) and Zhejiang mosquito virus 3 (ZJMV3, with ten genomes). Both narnaviruses have an ambigrammatic RdRp coding gene, denoted CxNV1-RdRp and ZJMV3-RdRp, respectively. The reverse ORF has its codons aligned with the forward frame. In both forward and reverse reading frames, any stop codons are close to the 3′ end of the respective frame. The ambigrammatic feature is certainly a puzzle. There appear to be two classes of plausible explanations:
The reverse ORF codes a protein. This is logically possible, but if the RdRp gene is strongly conserved, there is very little flexibility in the rORF. However, in the absence of any additional evidence, it is the explanation that requires the fewest additional hypotheses.
The reverse ORF facilitates association of ribosomes with RNA. This could conceivably convey advantages by providing a mechanism to protect viral RNA from degradation, but without further evidence this requires additional hypotheses.
Recently, additional evidence has emerged, which may provide support for the second of these explanations. Specifically, the CxNV1 infection has recently been shown to be associated with another ambigrammatic viral RNA segment, termed Robin (Batson et al. 2021; Retallack et al. 2021). It was reported that this segment, CxNV1-Robin, is ambigrammatic, with forward and reverse codons aligned, over very nearly the entire length (about 850 nt), where direction designation is determined by which amino acid sequence appears more conserved. Again, any stop codons occur close to the 3′ end. Neither forward nor reverse directions of Robin are homologous with known sequences.
Because ambigrammatic genes are rare, finding two of them in the same system is a strong indication that their occurrence has a common explanation. This observation makes it appear unlikely that the reverse ORF is a device to ‘pack in’ an additional protein coding gene, and more likely that the ambigrammatic feature is associated with allowing ribosomes to associate with both strands of the viral RNA.
This reasoning suggests that the Robin gene may play a role in selecting for the ambigrammatic property (e.g. it may facilitate protection by ribosomes of the viral RNA). If this surmise is correct, we should expect to see a version of the Robin gene associated with other ambigrammatic narnaviruses. It is possible that this might be detected by a search of archived sequence data. Only Zhejiang mosquito virus 3 appeared to be observed multiple times to make detection of an additional Robin segment possible, so we concentrated on that system.
We were able to find evidence of an ambigrammatic RNA, of length approximately 900 nt, that co-occurs with ZJMV3 RdRp segment across multiple samples recovered by at least two studies that, like CxNV1 Robin, bears no recognisable homology (via BLAST [Altschul et al. 1990] or HHpred [Finn et al. 2011]) to publicly available sequences or CxNV1 Robin itself. Given the conjunction of these unusual features we strongly believe this ambigrammatic RNA to be the equivalent of a Robin segment in ZJMV3. We do note, however, that due to contig quality in the datasets where ZJMV3 was found, the typical pentamers found at the ends of narna- (Rodríguez-Cousiño et al. 1998), ourmia- (Wang et al. 2020), and, seemingly, mitovirus (Mizutani et al. 2018) RNA (5′-GGGGC and GCCCC-3′) cannot be identified unambiguously.
2. Methods
2.1. Tests for whether the complementary strand is coding
We have argued that doubly synonymous mutations will give a signature of the reverse strand coding for a functional protein. If the reverse-direction code is functional, then the only assuredly non-deleterious mutations would be the double-synonym ones, where one codon is transformed by a single-nucleotide substitution to another codon which preserves the amino acid coded in both the forward and the reverse directions.
Assume that we have M sequences of an ambigrammatic gene, fully sequenced and maximally aligned with each other, and that one strand, referred to as the ‘forward’ strand, codes for a functional protein. We identify a ‘consensus’ codon at each of the N loci, and then enumerate the set of variant codons at each amino acid locus. (The variant set is the set of all codons which were observed at a given locus and which differ from the consensus codon.) If the consensus codon at a locus is one of the twelve double-synonym codons listed in Table 1, we term this a doubly synonymous locus. The number of doubly synonymous loci is .
There are two different approaches to testing whether double synonyms indicate that the complementary strand is coding:
2.1.1. Look for the existence of mutational ‘hotspots’
We can look for evidence that the doubly-synonymous loci experience more substitutions than other loci.
For each codon locus k, we can determine the number of elements of the variant set, n(k) (i.e. the number of different codons observed at that codon locus which differ from the consensus). We also determine the fraction of codons f(k), which differ from the consensus codon, that is, the ratio of the number of polymorphs which do not have the consensus codon at site k to the total number of polymorphs, M. We then determine the averages of these quantities, and , for the doubly synonymous loci and for the other loci. If the ratios
(1) |
are large, this is evidence that the complementary strand is coding.
The null hypothesis, indicating that the reverse ORF is non-coding, is that the ratios Rn and Rf are sufficiently close to unity that the difference may be explained by statistical fluctuations.
2.1.2. Mutation frequencies test
We can also look at codon frequencies for different mutations at doubly synonymous loci. If the complementary strand is coding, we expect to find that the frequency of mutations observed at doubly synonymous loci will heavily favour double synonym codons over single synonyms. We consider the subset of double synonym loci where mutations are observed (i.e. where n(k) > 1). For each of these variable doubly synonymous loci, we can determine two numbers: is the number of distinct singly synonymous variants at locus k, and is the number of these variants which are also doubly synonymous (clearly ). If , that means that the mutations preserve the complementary-strand amino acid, which is an indication that the reverse strand is coding. If is the set of variable doubly synonymous loci, we then calculate
(2) |
If the complementary strand is coding, we expect
(3) |
to be close to unity.
However, there will also be beneficial or neutral mutations that do change the amino acids, so that not all mutations will be between sets of doubly-synonymous codons. We need to be able to quantify the extent to which finding other than double synonym mutations is an indication that the reverse strand is non-coding. We must do this by comparison with a null hypothesis, in which the reverse strand is non-coding.
2.2. Null hypothesis for mutation frequencies
Let R0 be the value of the ratio R that is derived from this null hypothesis that the complementary strand is non-coding. In order to compute the expected ratio, R0, we adopt the following approach. We assume that the M sequences are sufficiently similar that only a small fraction of loci have undergone mutations. We adopt the Kimura model (Kimura 1980), which assumes that the mutation rate for transitions ( or ) is different from the rate for transversions (other single-nucleotide mutations), and negligible for other types of mutation. The ratio of these rates is
(4) |
If the numbers of single (double) synonyms of the consensus nucleotide at locus k leading to transitions or transversions are, respectively, and (), then we estimate
(5) |
The numbers are given in Table 1 for all of the double-synonym codons.
2.3. Strandedness of Culex narnavirus 1 genomes
The Californian mosquito dataset (Batson et al. 2021) from which most Culex narnavirus 1 sequences came from was prepared using reagents that allow the inference of RNA template strandedness. Based on this we confirmed the expected excess of RNA templates with the same direction as the positive strand (i.e. mRNA-sense) of both RdRp and Robin segments of Culex narnavirus 1 (see Fig. 2), since it is a positive-sense single-stranded RNA virus. To ensure this was correct, we also checked that some other viruses present in the same dataset also followed expected strand excesses. Two positive sense single-stranded RNA viruses, Culex flavivirus and Marma virus, and two negative sense single-stranded orthomyxoviruses—Astopletus and Wuhan mosquito virus 6—did indeed have the expected overwhelming excess of positive-sense reads for the former two, and the moderate excess of negative-sense reads for the latter two (Waldron et al. 2018).
Figure 2.
(A) Number of reads (from top to bottom): in the positive sense (i.e. same as genomic) with respect to RdRp segment of Culex narnavirus 1, the ratio of positive- to negative-sense reads, and negative-sense reads. Numbers of positive-sense reads and ratio of positive- to negative-sense reads are log-scaled while number of negative reads is displayed in normal space. Ratios where no negative sense reads could be identified are highlighted with hatches and the ratio reflects positive-sense reads (i.e. by assuming there is one negative-sense read). (B) Same information as A, but showing read numbers for Robin segment of Culex narnavirus 1. The results are displayed for the forty-two different strains, with the samples ordered in descending order of the total number of reads for each segment.
The breakdown of read strandedness was carried out by aligning all non-host reads from each sample in the Californian dataset (Batson et al. 2021) to a consensus sequence of RdRp or Robin segments of Culex narnavirus 1 with minimap2 (Li 2018). Since the library was prepared using reagents that can discriminate the orientation of the original RNA template with respect to a reference, any reverse R1 read (and its forward R2 mate) corresponds to an RNA template in the same direction as the reference, which is positive sense and vice versa forward R1 reads (and reverse R2 mate) correspond to a template in the reverse direction (i.e. negative sense).
2.4. Finding the Robin segment of Zhejiang mosquito virus 3
We looked through assembled contig datasets from two metagenomic mosquito studies (three from China and six from Australia) (Shi et al. 2016, 2017], kindly provided to us by Mang Shi and Edward C Holmes. We clustered contigs from the nine datasets by similarity using CD-HIT (Fu et al. 2012) with a threshold of 90% and looked for clusters that contained contigs from at least six samples, that did not have standard deviation in contig length >1,200, and had fewer than 200 contigs. Of the hundreds of clusters filtered this way only a handful also possessed sequences ambigrammatic across at least ninety per cent of their length and only two clusters were mostly comprised ambigrammatic sequences, while the rest were clearly recognisable as mosquito contigs. Of the two clusters, one was identifiable as the RdRp of Zhejiang mosquito virus 3, while we presume the other to be an unrecognisably distant orthologue of Culex narnavirus 1 Robin, on account of its co-occurrence with ZJMV3 RdRp, ambigrammaticity, and length.
A phylogenetic tree of Zhejiang mosquito virus 3, Culex narnavirus 1 and their closest relatives was recovered by aligning their RdRp sequences at the amino acid level with MAFFT (E-INS-i setting) (Katoh et al. 2005) and inferring the phylogeny with RAxML (Stamatakis 2014] under a BLOSUM62+CAT (Henikoff and Henikoff 1992; Lartillot and Philippe 2004) amino acid substitution matrix and site rate heterogeneity. The tree (displayed in Fig. 1) was rooted with Ochlerotatus-associated narna-like virus 1 sequence, as it appeared to be the longest branch in the tree.
3. Results
Next, we report the results of our studies of polymorphism of the four ambigrammatic narnavirus genes. We discuss what can be learned from applying standard techniques, before discussing the results of our tests for whether the reverse ORF codes for a protein.
3.1. Forward reading frame
Each sequence was trimmed to a length of 3N nucleotides. We identified a consensus nucleotide at each locus and determined the set of variant nucleotides at each locus. We determined the total number of transition and transversion mutations that are observed, and , respectively. (More precisely, these are the total number of codon loci, across all polymorphs, for which the there is a single-nucleotide mutation relative to the consensus codon which is a transition or transversion.) We also determined the total number of mutations at each position in the codon . We estimated the average number of variable sites r as the total number of nucleotide variants, divided by the product of the number of sequences and alignment length. We also estimated the ratio α of the rate of selected transition mutations to the rate of transversions:
(6) |
(recall that there are twice as many transversions as transitions). We also determined a ‘normalised’ triplet of variable sites for each position within the codon: . Our results on the nucleotide-level investigation of polymorphism are summarised in Table 2.
Table 2.
Nucleotide-level statistics of mutations.
Strand | N | M | r | α | ||||
---|---|---|---|---|---|---|---|---|
CxNV1-RdRp | 1033 | 46 | 606 | 362 | 0.0068 | 3.35 | (181, 140, 645) | (0.56:0.44:2.00) |
ZJMV3-RdRp | 1075 | 12 | 210 | 39 | 0.0064 | 10.80 | (47, 29, 173) | (0.57:0.35:2.08) |
CxNV1-Robin | 272 | 46 | 213 | 146 | 0.0096 | 2.92 | (107, 100, 152) | (0.89:0.84:1.27) |
ZJMV3-Robin | 304 | 10 | 84 | 48 | 0.0145 | 3.50 | (35, 31, 66) | (0.80:0.70:1.50) |
The consensus sequence has N codons. Among the mutations observed in M polymorphs, there are transitions, transversions, with overall rate r and transition/transversion rate ratio α. The numbers total mutations at each base position is , and normalising these to ratios via Equation (6) yields .
We then assigned a consensus codon at each codon locus, selecting the frame by the criterion of minimising the number of stop codons. For each of the N codons, we determined the variant set of codons that were observed in each of the M sequences. The total number of synonymous and non-synonymous single-nucleotide changes in the variant sets was and , respectively. The total number of mutations (relative to the consensus sequence) encountered in the variant sets where two or three nucleotides were changed was . For each codon, there are numbers of possible non-synonymous mutations that are transistions and transversions, and , and numbers of synonymous mutations that are transitions and transversions, and (with ). Under the null hypothesis that the sequence is non-coding, the expected value of the ratio
(7) |
is
(8) |
We also determined the fraction of codons where multi-nucleotide mutations are observed, . We present our results for the codon-level mutations in Table 3, which includes information for both the forward and the complementary read directions (with codon boundaries aligned for the complementary direction).
Table 3.
Summary of results for codon-level mutations.
Strand | |||||||
---|---|---|---|---|---|---|---|
CxNV1-RdRp-fwd | 623 | 189 | 123 | 0.303 | 2.37 | 0.128 | 0.12 |
ZJMV3-RdRp-fwd | 170 | 59 | 13 | 0.347 | 2.14 | 0.162 | 0.012 |
CxNV1-Robin-fwd | 112 | 141 | 89 | 1.26 | 2.34 | 0.538 | 0.45 |
ZJMV3-Robin-fwd | 49 | 61 | 14 | 1.24 | 2.35 | 0.528 | 0.046 |
CxNV1-RdRp-comp | 136 | 676 | 123 | 4.97 | 2.43 | 2.04 | 0.12 |
ZJMV3-RdRp-comp | 50 | 179 | 13 | 3.58 | 2.14 | 1.67 | 0.012 |
CxNV1-Robin-comp | 66 | 187 | 89 | 2.83 | 2.39 | 1.23 | 0.45 |
ZJMV3-Robin-comp | 32 | 78 | 14 | 2.43 | 2.28 | 1.07 | 0.046 |
The numbers of single-nucleotide synonymous and non-synonymous mutations are and respectively, is the number of mutations with more than one base changed, is the null value of , and if the fraction of mutations that have multiple-nucleotide changes.
The alignments are ambigrammatic, in the sense that there are no stop codons in the interior of the sequence. None of the individual sequences had stop codons in the body of the sequence in either direction.
We also computed ORF-wide dN/dS values (plotted in Fig. 1B), by assuming that every mutation in the alignment has occurred only once to be conservative. This was motivated by the presence of pairs of sites with four haplotypes between them (4G sites), an indication that recombination may be a potential issue with narnavirus sequences. Normalising the number of observed non-synonymous and synonymous mutations was done by assuming a transition/transversion ratio of 2, consistent with Equation (6). These values dN/dS values are slightly different from the ratios in Table 3 because the latter excludes mutations where more than one base differs from the consensus codon. In all but one of the cases, dN/dS is higher than , because the multiple nucleotide mutations which are included in dN/dS are predominantly non-synoymous.
Based upon these tables, we can make the following observations and deductions:
Diversity. We observe that both RdRp and Robin segments are comparable in their diversity, for both CxNV1 and ZJMV3. As expected, RdRp sequences are highly conserved at the amino acid level. Robin, on the other hand, appears far more relaxed at the amino acid level and, consistent with this, diverged beyond recognition between CxNV1 and ZJMV3.
Relative mutation rate by codon position. For RdRp sequences, more mutations are observed at the third nucleotide in each codon, as expected for a sequence that preserves the amino acid sequence (because most synonymous mutations involve the third nucleotide of a codon). In the case of Robin sequences, the frequencies of mutation are much closer to being equal, to the extent that for CxNV1-Robin the null hypothesis that the rates are equal is not definitively rejected. However, mutations at different codon sites are sufficiently weighted toward the third position that we shall assume that Robin does code for a functional protein. While the values of are very different for RdRp and Robin, their values are comparable for CxNV1 and ZJMV3, which is an indication that the selective pressures on both viruses are the same.
Rate of multiple-nucleotide mutations. The fraction of multiple-nucleotide mutations is higher for Robin sequences than it is for RdRp sequences. This may be an indication that the Robin sequence is under strong selective pressure, because some aminoacid substitutions can only be achieved through multiple nucleotide mutations.
Transition to transversion ratio. Three of the values of α were similar to each other, while the value for ZJMV3-RdRp was higher than the others. Because transitions occur at a higher intrinsic rate, a lower value of α indicates that observed mutations are biased in favour of the rarer transversions, which is an indication of unusual selective pressures. The fact that the values of α for the Robin segments are comparable to, or lower than, the values for RdRp are a further indication that Robin is under similar selective pressure too.
Ratio of non-synonyms to synonyms. For the RdRp segments the values of are much smaller than the values R0 predicted (Equation (8)) by the null hypothesis that mutations are random. This indicates that the selective pressure on RdRp acts to preserve the amino acid sequence. For Robin segments, the values of R are much larger, but still smaller than the prediction from the null hypothesis. This indicates that while points 1–4 above indicate that Robin is under some selective pressure, the amino acid sequence is not strongly conserved. This is consistent with the hypothesis that the selection acting on Robin is relaxed.
Figure 3 illustrates the distribution of mutations across the forward and reverse reading frames of all four ORFs for both CxNV1 and ZJMV3. As expected, there is evidence that some regions accumulate mutations more readily than others. The pattern is consistent with what would be expected from the statistical reductions in the tables.
Figure 3.
Distribution of synonymous (blue), non-synonymous (red) substitutions, and doubly synonymous sites (green) in CxNV1 (upper plots) and ZJMV3 (lower plots) RdRp (left) and Robin (right) segments in both directions (forward toward top, reverse toward bottom). Translated reverse ORFs are shown backwards (segment coordinate space). Double synonyms do not overlap perfectly because forward and reverse ORFs differ in length and begin and end at different positions along the segment.
3.2. Complementary reading frame
We determined the set of doubly synonymous codons in the consensus sequence, and the subset of of these which have variant codons.
Mutational hotspots test. We applied the mutational hotspots test to all four sequences, as described in Equation (1). The results (Table 4) show no evidence that the doubly synonymous sites are undergoing more frequent mutations, or that their mutations are more widely spread across the dataset.
Mutation rate test. We examined the number of mutations in the set of doubly synonymous sites, which were variable. We found (Table 5) that many more of the observed mutations at these sites are only singly synonymous, when a doubly synonymous mutation is possible, which is further evidence that the complementary strand is non-coding. The numbers of doubly-synonymous mutations were quite low, and so it was not possible to make a reliable comparison of the ratio with the null hypothesis.
Ratio of non-synonyms to synonyms. The ratios of non-synonymous to synonymous mutations, presented in Table 3 and Fig. 1B, were lower than the null hypothesis for the forward direction. This is readily explained as an indication that the forward ORF codes for a functional protein. However, the ratios for the reverse direction were all higher than the null hypothesis. This observation is explained, qualitatively, as follows. If the forward direction strictly conserves the amino acid sequence, then all of the mutations that are synomymous on the reverse strand are doubly synonymous. Because only twelve of the soxty-four codons allow for doubly-synonymous mutations, the ratio would be very high for the complementary strand if the forward sequence were to be exactly conserved. We computed this ratio, and found 11.2 for CxNV1-RdRp, and similar values for the other sequences. This theoretical ratio is considerably higher than the measured value of 4.97, because the forward sequence is not exactly conserved. For Robin segments, the value of R for the reverse ORF is only slightly higher than the null hypothesis, because the amino acid sequence is only weakly conserved.
Table 4.
Summary of results of the mutational hotspots test.
Sample |
|
|
Gene | N | Rn | Rf | |
Double syns., CxNV1-RdRp | 0.954 | 0.161 | CxNV1-RdRp | 1033 | 220 | 0.986 | 1.044 |
Other codons, CxNV1-RdRp | 0.968 | 0.155 | ZJMV3-RdRp | 1075 | 219 | 0.975 | 0.840 |
Double syns., ZJMV3-RdRp | 1.20 | 0.042 | CxNV1-Robin | 272 | 54 | 1.19 | 1.16 |
Other codons, ZJMV3-RdRp | 1.23 | 0.050 | ZJMV3-Robin | 304 | 81 | 0.926 | 0.978 |
Double syns, CxNV1-Robin | 1.76 | 0.195 | |||||
Other codons, CxNV1Robin | 1.48 | 0.169 | |||||
Double syns, ZJMV3-Robin | 0.889 | 0.096 | |||||
Other codons, ZJMV3-Robin | 0.960 | 0.097 |
Left panel: values of the average number of elements of the variant set, and of the average fraction of non-consensus codons, , for double-synonym sites, and for the other sites. Right panel: N is the number of loci in the alignment, is the number of double-synonym loci, and Rn, Rf are the ratios of and at double-synonym sites to their values at other sites. The differences of these ratios from unity do not appear significant.
Table 5.
Results for the mutational codon frequency test: N is the number of loci in the alignment, is the number of mutationally active double-synonym loci, and are, respectively, the numbers of single and double synonym mutations.
The actual ratio is compared with the null-hypothesis value R0, Equation (8).
Sample | N | R | R 0 | ||||
---|---|---|---|---|---|---|---|
CxNV1-RdRp | 1033 | 136 | 151 | 60 | 2.51 | 3.02 | 0.83 |
ZJMV3-RdRp | 1075 | 219 | 33 | 20 | 1.65 | 3.21 | 0.51 |
CxNV1-Robin | 272 | 40 | 24 | 3 | 8.00 | 3.21 | 2.49 |
ZJMV3-Robin | 304 | 59 | 20 | 4 | 4.00 | 4.04 | 0.99 |
4. Discussion
We have argued that doubly synonymous codons provide a key to understanding whether ambigrammatic viral RNA segments code for two functional proteins. If there were two coding genes, doubly synonymous mutations would be mutational hotspots, because they are unambiguously non-deleterious. We applied our analysis to recent observations of polymorphisms in two ambigrammatic narnaviruses: Culex narnavirus 1 and Zhejiang mosquito virus 3. There was no evidence that doubly synonymous sites are mutational hotspots, or that there is a prevalence of mutations to other doubly-synonymous codons at these sites. Other, circumstantial, evidence favours the interpretation that the complementary strand is non-coding. Ambigrammatic sequences have been observed in other narnaviruses, but they are undoubtedly a rare phenomenon. If the rORF of both RdRp and Robin segments had evolved to code for a functional protein, each RNA segment would code for two genes. Given that ambigrammatic sequences are rare (DeRisi et al. 2019), finding a system where two had evolved independently would be highly improbable. Moreover, because the ambigrams are full length, each of the ambigrammatically coded sequences would code for two genes that have the same length as each other.
An observation of the simultaneous detection of two or more ambigrammatic genes would strongly favour models where there is an advantage in evolving an ambigrammatic sequence that is independent of whether the reverse ORFs are translated into functional proteins. This argument led us to discover the Robin segment of ZJMV3 and suggests that more ambigrammatic narnaviruses with at least two segments will be discovered by metagenomic surveys, when suitable data sets become available. Similarly, the elusive Robin segment should already be hiding in datasets of narnaviruses descended from the common ancestor of CxNV1 and ZJMV3.
Our studies of polymorphisms in the forward direction indicate that both RdRp and Robin are under purifying selection. In the case of RdRp, the amino acid sequence is strongly conserved, but the Robin sequence is not.
The role of the RdRp coding fragment is already understood. This makes it plausible that the other fragment plays a role that facilitates the evolution of ambigrams. If the lack of stop codons on the complementary strand is not required to allow protein synthesis, we can surmise that its role is to allow ribosomes to associate with the complementary strand. Having RNA segments able to be covered by ribosomes may provide some protection for the viral RNA against degradation.
Recent experiments indicate that ambigrammatic narnavirus genes display unusual ribosome profiles, with a ‘plateau’ structure (Retallack et al. 2021). It has been argued (Wilkinson et al. 2021) that the plateaus indicate that the ribosomes attached to the viral RNA become stalled, creating a cover (see also Cepelewicz 2020). The ambigram property allows binding of ribosomes to both strands, hiding the viral RNA from host defence and degradation mechanisms. We can surmise that there exists a molecule which binds to the 3′ end of the viral RNA, preventing release of ribosomes (Wilkinson et al. 2021). It is possible that Robin plays a role in this process, by creating a protein which blocks ribosome detachment at 3′ end. Alternatively, it might be proposed that the ribosome ‘traffic jam’ is a consequence of the structure of the RdRp itself, due to formation of RNA hairpins. However, these would have to trade off against RdRp function. The proposed mechanism involving Robin making a blocking protein has the advantage that the RdRp works efficiently when the viral RNA concentration is small. Later, after it has duplicated many copies of itself and of Robin, the Robin protein attaches to the viral RNA and creates stalled polysomes, protecting the viral RNA from degradation.
There may, however, be additional viral genes involved in ambigrammatic narnavirus infections, and there are many possible roles for the Robin gene. It could code a protein which inhibits the mechanism of ‘no-go-decay’, which releases stalled ribosomes, play a role in the viral suppression of RNAi (Mierlo et al. 2014) or in formation of syncytia or viral particles. Without a better understanding of the narnavirus lifecycle in arthropods, it is not certain whether Robin does code for a protein which blocks detachment of ribosomes.
We did search the CxNV1 dataset for further fragments of ambigrammatic viral RNA, which might be candidates for coding additional genes. A search for additional ambigrammatic sequences greater than 200 nt in length did not produce any candidates.
A recent preprint (Retallack et al. 2021) presents evidence that peptides translated from the reverse ORF of the RdRp segment can be detected (though not quantified) and that inserting mutations in the RdRp sequence which are synonymous in the forward reading frame but introduce stop codons in the reverse frame reduces the fitness of the virus. The mutations were clustered close to the 3′ end of the RdRp gene. These observations could be interpreted as indicating that the reverse reading frame codes for a functional protein or that all ORFs in the cell may be translated in a ‘leaky’ way. However, changing the RNA sequence may also interfere with the action of molecules which bind to the RdRp strand.
The data on strandedness indicates that there are mechanisms that favour the presence of one strand rather than the other. Our results do not suggest that this is related to the ambigrammatic property, and are consistent with the strandedness bias being determined by mechanisms that are also found in other virus families. Narnaviruses are distant relatives of the RNA bacteriophage family Leviviridae and of eukaryote-infecting mitoviruses that replicate in mitochondria, while ourmiaviruses are most closely related (Shi et al. 2016). It has been suggested that narnaviruses (Rodríguez-Cousiño et al. 1998), as well as their relatives ourmia-, mito- and leviviruses, possess conserved secondary RNA structures at their ends in narnaviruses these structures, particularly at the 3′ end, have been shown to proect genomic RNA from exonuclease degradation (Esteban et al. 2005). We may suppose that these secondary RNA structures could also function in regulating translation, though understandably the corpus of narnavirus molecular biology research is scarce. It may be tempting to rely on similar genomic features discovered in distant relatives of narnaviruses, namely RNA bacteriophages in the family Leviviridae, that have been investigated in far greater detail, with the caveat that these viruses possess a markedly different genomic organisation and infect entirely different hosts, and thus may prove more useful conceptually.
Acknowledgments
We thank Hanna Retallack and Joe DeRisi for discussions of their experimental studies of narnaviruses and Amy Kistler for assistance with narnaviral genomes. We thank Amy Kistler and John Pak for comments on a draft. We would like to thank Mang Shi and Edward C Holmes for sharing assembled contigs from Australian and Chinese mosquito metagenomic datasets. G.H. and D.Y. were supported by the Chan Zuckerberg Biohub; MW thanks the Chan Zuckerberg Biohub for its hospitality.
Data availability
Sequence data and scripts used to analyse and visualise data are available from: https://github.com/evogytis/narna-double-synonym.
Author contributions
G.D. devised and directed the search for an analog of Robin in the ZJMV3 sequence archive. M.W. produced a draft of the manuscript following discussions with the other authors about the recent discovery of a narnavirus system which has two ambigrammatic genes. All authors contributed to writing the manuscript, and reviewed the manuscript before submission.
Conflict of interest: None declared.
References
- Altschul S. F. et al. (1990) ‘Basic Local Alignment Search Tool’, Journal of Molecular Biology, 215: 403–10. [DOI] [PubMed] [Google Scholar]
- Barrell B. G., Air G. M., Hutchison C. A. (1976) ‘Overlapping Genes in Bacteriophage phiX174’, Nature, 264: 34–41. [DOI] [PubMed] [Google Scholar]
- Batson J. et al. (2021) ‘Single Mosquito Metatranscriptomics Recovers Mosquito Species, Blood Meal Sources, and Microbial Cargo, Including Viral Dark Matter. eLife’, accepted, 2021, bioRxiv: doi:10.1101/2020.02.10.942854.
- Cepelewicz J. ‘New Clues about ‘Ambigram’ Viruses with Strange Reversible Genes’. Quanta Magazine, 2020. <https://www.quantamagazine.org/new-clues-about-ambigram-viruses-with-strange-reversible-genes-20200212/>. Accessed April 2021. [Google Scholar]
- Chirico N., Vianelli A., Belshaw R. (2010) ‘Why Genes Overlap in Viruses’, Proc Biol Sci, 277: 1701. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cobián Güemes A. G. et al. (2016) ‘Viruses as Winners in the Game of Life’, Annual Review of Virology, 3: 197–214. [DOI] [PubMed] [Google Scholar]
- Cook S. et al. (2013) ‘Novel Virus Discovery and Genome Reconstruction from Field RNA Samples Reveals Highly Divergent Viruses in Dipteran Hosts. PLoS One, 8(11): e80720. [DOI] [PMC free article] [PubMed] [Google Scholar]
- DeRisi J. et al. (2019) ‘An Exploration of Ambigrammatic Sequences in Narnaviruses’, Scientific Reports, 9: 17982. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dinan A. M. et al. (2020) ‘A Case for a Negative-Strand Coding Sequence in a Group of Positive-Sense RNA Viruses’, Virus Evolution, 6: veaa007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Esteban R., Vega L., Fujimura T. (2005) ‘Launching of the Yeast 20 s Rna Narnavirus by Expressing the Genomic or Antigenomic Viral Rna in Vivo’, The Journal of Biological Chemistry, 280: 33725–34. [DOI] [PubMed] [Google Scholar]
- Finn R. D., Clements J., Eddy S. R. (2011) ‘HMMER Web Server: Interactive Sequence Similarity Searching’, Nucleic Acids Research, 39: W29–37 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fu L. et al. (2012) ‘CD-HIT: Accelerated for Clustering the Next-Generation Sequencing Data’, Bioinformatics (Oxford, England), 28: 3150–2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Göertz G. P. et al. (2019) ‘Mosquito Small RNA Responses to West Nile and Insect-Specific Virus Infections in Aedes and Culex Mosquito Cells’, Viruses, 11: 271. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Henikoff S., Henikoff J. G. (1992) Amino acid substitution matrices from protein blocks. Proceedings of the National Academy of Sciences of the United States of America, 89(22): 10915–10919. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hillman B. I., Cai G. (2013) ‘The Family Narnaviridae: Simplest of RNA Viruses’, Advances in Virus Research, 86: 149–76. [DOI] [PubMed] [Google Scholar]
- Katoh K. et al. (2005) ‘MAFFT Version 5: Improvement in Accuracy of Multiple Sequence Alignment’, Nucleic Acids Research, 33: 511–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kimura M. (1980) ‘A Simple Method for Estimating Evolutionary Rates of Base Substitutions through Comparative Studies of Nucleotide Sequences’, Journal of Molecular Evolution, 16: 111–20. [DOI] [PubMed] [Google Scholar]
- Lartillot N., Philippe H. (2004) ‘A Bayesian Mixture Model for across-Site Heterogeneities in the Amino-Acid Replacement Process’, Molecular Biology and Evolution, 21: 1095–109. [DOI] [PubMed] [Google Scholar]
- Li H. (2018) ‘Minimap2: Pairwise Alignment for Nucleotide Sequences’, Bioinformatics (Oxford, England), 34: 3094–100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mizutani Y. et al. (2018) ‘Novel Mitoviruses and a Unique Tymo-Like Virus in Hypovirulent and Virulent Strains of the Fusarium Head Blight Fungus, Fusarium Boothii’, Viruses, 10: 584. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nelson C. W., Ardern Z., Wei X. (2020) ‘Olgenie: Estimating Natural Selection to Predict Functional Overlapping Genes’, Molecular Biology and Evolution, 37: 2440–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Retallack H. et al. (2021) ‘Persistence of Ambigrammatic Narnaviruses Requires Translation of the Reverse Open Reading Frame’, Journal of Virology, JVI.00109-21. doi: 10.1128/JVI.00109-21. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rodríguez-Cousiño N. et al. (1998) ‘Yeast Positive-Stranded Virus-like RNA Replicons. 20 S and 23 S RNA Terminal Nucleotide Sequences and 3’ End Secondary Structures Resemble Those of RNA Coliphages’, The Journal of Biological Chemistry, 273: 20363–71. [DOI] [PubMed] [Google Scholar]
- Shi M. et al. (2016) ‘Redefining the Invertebrate RNA Virosphere’, Nature, 540: 539–4687. [DOI] [PubMed] [Google Scholar]
- Shi M. et al. (2017) ‘High-Resolution Metatranscriptomics Reveals the Ecological Dynamics of Mosquito-Associated RNA Viruses in Western Australia’, Journal of Virology, 91: 1098–5514. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stamatakis A. (2014) ‘RAxML Version 8: A Tool for Phylogenetic Analysis and Post-Analysis of Large Phylogenies’, Bioinformatics (Oxford, England), 30: 1312–3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- van Mierlo J. T. et al. (2014) ‘Novel Drosophila Viruses Encode Host-Specific Suppressors of RNAi’. PLoS Pathogens, 10: e1004256. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Waldron F. M., Stone G. N., Obbard D. J. (2018) Metagenomic sequencing suggests a diversity of RNA interference-like responses to viruses across multicellular eukaryotes. PLOS Genetics, 14(7): e1007533. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang Q. et al. (2020) ‘A Single ssRNA Segment Encoding RdRp is Sufficient for Replication, Infection, and Transmission of Ourmia-Like Virus in Fungi’, Frontiers in Microbiology, 11: 379–302X. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wilkinson M., Yllanes D., Huber G. (2021) ‘Polysomally Protected Viruses’, Physical Biology, doi:10.1088/1478-3975/abf5b5. Online ahead of print. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
Sequence data and scripts used to analyse and visualise data are available from: https://github.com/evogytis/narna-double-synonym.