Skip to main content
Life logoLink to Life
. 2017 Apr 18;7(2):20. doi: 10.3390/life7020020

The Maximal C3 Self-Complementary Trinucleotide Circular Code X in Genes of Bacteria, Archaea, Eukaryotes, Plasmids and Viruses

Christian J Michel 1
Editor: David Deamer1
PMCID: PMC5492142  PMID: 28420220

Abstract

In 1996, a set X of 20 trinucleotides was identified in genes of both prokaryotes and eukaryotes which has on average the highest occurrence in reading frame compared to its two shifted frames. Furthermore, this set X has an interesting mathematical property as X is a maximal C3 self-complementary trinucleotide circular code. In 2015, by quantifying the inspection approach used in 1996, the circular code X was confirmed in the genes of bacteria and eukaryotes and was also identified in the genes of plasmids and viruses. The method was based on the preferential occurrence of trinucleotides among the three frames at the gene population level. We extend here this definition at the gene level. This new statistical approach considers all the genes, i.e., of large and small lengths, with the same weight for searching the circular code X. As a consequence, the concept of circular code, in particular the reading frame retrieval, is directly associated to each gene. At the gene level, the circular code X is strengthened in the genes of bacteria, eukaryotes, plasmids, and viruses, and is now also identified in the genes of archaea. The genes of mitochondria and chloroplasts contain a subset of the circular code X. Finally, by studying viral genes, the circular code X was found in DNA genomes, RNA genomes, double-stranded genomes, and single-stranded genomes.

Keywords: circular code in genes, DNA genes, RNA genes, double-stranded genes, single-stranded genes

1. Introduction

Circular code is a mathematical structure of genes and genomes. This concept initially found for genes is extended for genomes (non-coding regions of eukaryotes) according to recent results. A circular code X is a set of words such that any motif from X, called X motif, allows it to retrieve, maintain, and synchronize the original (construction) frame.

The circular code X identified in the genes of bacteria, eukaryotes, plasmids, and viruses [1,2] contains the 20 following trinucleotides

X= {AAC,AAT,ACC,ATC,ATT,CAG,CTC,CTG,GAA,GAC, GAG,GAT,GCC,GGC,GGT,GTA,GTC,GTT,TAC,TTC} (1)

which allows it to both retrieve the reading frame with a window of 13 nucleotides (Figure 3 in [3]) and to code the 12 following amino acids

{Ala, Asn, Asp, Gln, Glu, Gly, Ile, Leu, Phe, Thr, Tyr, Val}. (2)

The current genetic code is not circular. Thus, it cannot retrieve the reading frame. The loss during evolution of this circular code property on the 4-letter alphabet {A,C,G,T} required a complex translation mechanism using 20 amino acids and proteins in current genomes.

X motifs from Equation (1) are identified in (i) genes “universally” [1,4]; (ii) tRNAs of prokaryotes and eukaryotes [3,5]; (iii) rRNAs of prokaryotes (16S) and eukaryotes (18S), in particular in the ribosome decoding center where the universally conserved nucleotides G530, A1492, and A1493 are included in the X motifs [3,6,7]; and (iv) genomes (non-coding regions of eukaryotes) [4,8].

The X motifs of maximal cardinality 20 (composition) in genes with the properties of the circular code, C3 and complementary allow the two reading frames and the four shifted frames to be retrieved by pairing between DNAs-DNAs, DNAs-mRNAs, mRNAs-rRNAs, mRNAs-tRNAs, and rRNAs-tRNAs, as shown with a 3D visualization of the X motifs in the ribosome [3,6,7].

The X motifs in genomes have a different structure compared to the X motifs in genes [8]. Indeed, their cardinality is not maximal (less than 10 for an order of magnitude), their size is longer, and their structure contains repeated trinucleotides. Furthermore, the X motifs of minimal cardinality 1 generated with the 20 repeated trinucleotides tn where tX (Equation (1)) are very common in the genomes of eukaryotes (e.g., [8,9,10]). Their length n can be very large (e.g., n>6000, see Figure 1). The repeated trinucleotides are very unstable with mutation rates up to 100,000 times higher than the genomic average mutation rate. Mutation in repeats increases its evolutionary stability.

Figure 1.

Figure 1

Model of evolution of the X circular code motifs (Equation (1)) by increasing its cardinality (composition) and decreasing its length. Evolution begins with X motifs of minimal cardinality 1 (long repeated trinucleotides) in genomes (the examples given are extracted from Table 2 in [8]). Then, the mutations in repeated trinucleotides lead to X motifs of low cardinality <10 (different and short repeated trinucleotides) in genomes (the examples given are extracted from Table 4 in [8]) up to X motifs of high 10 and maximal cardinality 20 coding the 12 amino acids (Equation (2)).

A model of evolution of the X motifs in genes and genomes can be proposed according to the previous works and the recent results [8]. It proposes that the X motifs of maximal cardinality 20 in genes have evolved from the X motifs of minimal cardinality 1 (repeated trinucleotides) in genomes (Figure 1). An X motif of minimal cardinality 1 which is unstable, mutates into an X motif of low cardinality <10 containing thus different repeated trinucleotides of short lengths. This evolutionary process continues by increasing the cardinality and decreasing the length of the X motifs up to generate the X motifs of high 10 and maximal cardinality 20 coding the 12 amino acids (Equation (2)) in genes. The X motifs of high cardinality have acquired the protein coding function in addition to the reading frame retrieval. This model suggests that the property of reading frame retrieval has preceded the protein coding function.

Since 1996, all the statistical analyses studying the preferential occurrence of trinucleotides among the three frames were done at the gene population level (kingdoms, taxonomic groups, genomes). We extend here the method from [1] at the gene level. This new approach is important as all the genes, i.e., of large and small lengths, are now considered with the same weight in the statistical definition for searching the circular code X. As a consequence, the concept of circular code, in particular the reading frame retrieval, is directly associated to each gene. Thus, at the gene level, the circular code X is searched here in the genes of bacteria, archaea, eukaryotes, plasmids, viruses, and eukaryotic organelles, i.e., mitochondria and chloroplasts. Finally, genes of double-stranded DNA and RNA viruses, and single-stranded DNA and RNA viruses are also analysed with this approach in order to assign a genetic information unit (DNA or RNA, double-stranded or single-stranded) to the circular code X.

2. Method

2.1. Definitions

We recall a few definitions without detailed explanations (i.e., without figures and examples) for understanding the main properties of the trinucleotide circular code X identified in genes [1,2].

Notation 1.

Let us denote the nucleotide 4-letter alphabet B={A,C,G,T} where A stands for Adenine, C stands for Cytosine, G stands for Guanine, and T stands for Thymine. The trinucleotide set over B is denoted by B3={AAA,,TTT} . The set of non-empty words (words, respectively) over B is denoted by B+ (B*, respectively).

Notation 2.

Genes have three frames f. By convention here, the reading frame f=0 is set up by a start trinucleotide {ATG,CTG,GTG,TTG}, and the frames f=1 and f=2 are the reading frame f=0 shifted by one and two nucleotides in the 53 direction (to the right), respectively.

Two biological maps are involved in gene coding.

Definition 1.

According to the complementary property of the DNA double helix, the nucleotide complementarity map C:BB is defined by C(A)=T, C(C)=G, C(G)=C, and C(T)=A. According to the complementary and antiparallel properties of the DNA double helix, the trinucleotide complementarity map C:B3B3 is defined by C(l0l1l2)=C(l2)C(l1)C(l0) for all l0,l1,l2B. By extension to a trinucleotide set S, the set complementarity map C:(B3)(B3), being the set of all subsets of B3, is defined by C(S)={v :u,vB3,uS,v=C(u)}, e.g., C({CGA, GAT})={ATC,TCG}.

Definition 2.

The trinucleotide circular permutation map P:B3B3 is defined by P(l0l1l2)=l1l2l0 for all l0,l1,l2B. P2 denotes the 2nd iterate of P. By extension to a trinucleotide set S, the set circular permutation map P:(B3)(B3) is defined by P(S)={v :u,vB3,uS,v=P(u)}, e.g., P({CGA, GAT})={ATG,GAC} and P({CGA, GAT})={ACG,TGA} .

Definition 3.

A set SB+ is a code if, for each x1,,xn,y1,,ymS, n,m1, the condition x1xn=y1ym implies n=m and xi=yi for i=1,,n.

Definition 4.

Any non-empty subset of the code B3 is a code and called trinucleotide code C.

Definition 5.

A trinucleotide code CB3 is self-complementary if, for each tC, C(t)C, i.e., C=C(C).

Definition 6.

A trinucleotide code XB3 is circular if, for each x1,,xn,y1,,ymX, n,m1, rB*, sB+, the conditions sx2xnr=y1ym and x1=rs imply n=m, r=ε (empty word), and xi=yi for i=1,,n.

The proofs to decide whether a code is circular or not are based on the flower automaton [2], the necklace 5LDCN (Letter Diletter Continued Necklace) [11], the necklace nLDCCN (Letter Diletter Continued Closed Necklace) with n{2,3,4,5} [12], and the graph theory [13].

Definition 7.

A trinucleotide circular code XB3 is C3 self-complementary if X, X1=P(X), and X2=P2(X) are trinucleotide circular codes such that X=C(X) (self-complementary), C(X1)=X2, and C(X2)=X1 (X1 and X2 are complementary).

The trinucleotide set X=X0 (Equation (1)) coding the reading frame (f=0) in genes is a maximal (20 trinucleotides) C3 self-complementary trinucleotide circular code [2] where the circular code X1=P(X) coding the frame f=1 contains the 20 following trinucleotides

X1= {AAG,ACA,ACG,ACT,AGC,AGG,ATA,ATG,CCA,CCG, GCG,GTG,TAG,TCA,TCC,TCG,TCT,TGC,TTA,TTG} (3)

and the circular code X2=P2(X) coding the frame f=2 contains the 20 following trinucleotides

X2= {AGA,AGT,CAA,CAC,CAT,CCT,CGA,CGC,CGG,CGT, CTA,CTT,GCA,GCT,GGA,TAA,TAT,TGA,TGG,TGT}. (4)

The trinucleotide circular codes X1 and X2 are related by the permutation map, i.e., X2=P(X1) and X1=P2(X2), and by the complementary map, i.e., X1=C(X2) and X2=C(X1) [14].

Several classes of methods were developed for identifying the circular code X in genes over the last 20 years: frequency methods [2,15,16], correlation function [17], covering capability function [18], and occurrence probability of a complementary/permutation (CP) trinucleotide set at the gene population level [1].

The class of the 216 C3 self-complementary trinucleotide circular codes (Definition 7; [2]; list given in Tables 4a, 5a, and 6a in [19]; [20]) is included in a larger class of codes C by relaxing the circularity property which was defined in [1]:

Definition 8.

A trinucleotide code CB3 is C3 self-complementary if C=C(C) (self-complementary), C(C1)=C2, and C(C2)=C1 (C1 and C2 are complementary) where C1=P(C) and C2=P2(C) .

The statistical approach developed analyses the C3 self-complementary codes (Definition 8) for searching the particular circular code X.

2.2. Gene Kingdoms

Gene kingdoms K of bacteria B, archaea A, plasmids , eukaryotes E, chromosomes of eukaryotes Echr, mitochondria M, chloroplasts , viruses V, and its five taxonomic double-stranded DNA viruses VdsDNA, double-stranded RNA viruses VdsRNA, single-stranded DNA viruses VssDNA, single-stranded RNA viruses VssRNA, and retro-transcribing viruses Vrt are obtained from the GenBank database (http://www.ncbi.nlm.nih.gov/genome/browse/, May 2016) (Table 1). Computer tests exclude genes when (i) their nucleotides do not belong to the alphabet B; (ii) they do not begin with a start trinucleotide {ATG,CTG,GTG,TTG}; (iii) they do not end with a stop trinucleotide {TAA,TAG,TGA}; and (iv) their lengths are not modulo 3. In order to have an order of magnitude of data acquisition (details in Table 1), the kingdom of bacteria B contains 15,735,053 genes and 5,222,267,667 trinucleotides (7,851,762 genes and 2,481,566,882 trinucleotides in [1]), i.e., a trinucleotide increase of about 110%, and the kingdom of eukaryotes E contains 4,356,391 genes and 2,406,844,838 trinucleotides (1,662,579 genes and 824,825,761 trinucleotides in [1]), i.e., a trinucleotide increase of about 192%. The gene kingdoms M, , VdsRNA, VssDNA, and Vrt have gene and trinucleotide data that are significantly lower (less than 1 million trinucleotides) than the other gene kingdoms (Table 1).

Table 1.

Kingdoms K of genes extracted from the GenBank database (http:// www.ncbi.nlm.nih.gov/genome/browse/, May 2016) with their symbol and their numbers of genomes, genes, and trinucleotides.

Kingdom K (Symbol) Nb of Genomes Nb of Genes Nb of Trinucleotides
Bacteria B 7039 15,735,053 5,222,267,667
Archaea A 182 282,802 81,460,549
Plasmids 2319 575,760 159,169,387
Eukaryotes E 190 4,356,391 2,406,844,838
Chromosomes of eukaryotes Echr 2979 4,356,391 2,406,844,838
Mitochondria M 228 3347 862,327
Chloroplasts 39 3192 925,303
Viruses V 5217 299,401 66,677,580
Double-stranded DNA viruses VdsDNAV 2480 259,696 59,239,700
Double-stranded RNA viruses VdsRNAV 211 1061 783,020
Single-stranded DNA viruses VssDNAV 715 3291 802,405
Single-stranded RNA viruses VssRNAV 1257 5093 4,406,365
Retro-transcribing viruses VrtV 137 560 289,447

2.3. Preferential Frame of a Trinucleotide in a Gene

The method developed in [1] for identifying the circular code X in genes determined the preferential frame of trinucleotides at the gene population level (kingdoms, taxonomic groups, genomes), i.e., after summing the trinucleotide frequencies of all genes in a kingdom. We extend this method at the gene level, i.e., the preferential frame of trinucleotides among the three frames is determined for each gene. There is no sum of trinucleotide frequencies of all genes in a kingdom. Thus, all the genes, i.e., of large and small lengths, have the same weight in respect to the preferential frame.

Consider a gene kingdom K listed in Table 1. Let Prf(t,g) be the occurrence frequency of a trinucleotide tB3 in a frame f{0,1,2} of a gene g belonging to a kingdom K. Thus, there are 3×64=192 trinucleotide occurrence frequencies Prf(t,g) in the three frames f of a gene g. Then, the preferential frame F(t,g){0,1,2} of a trinucleotide t in a gene g is the frame of maximal occurrence frequency Prf(t,g) among the three frames f of g

F(t,g)=argmaxf{0,1,2}Prf(t,g). (5)

The three frequencies of a given trinucleotide are computed in the three frames 0, 1, and 2 of a gene. Then, the preferential frame of the trinucleotide in this gene is the frame associated to its highest trinucleotide frequency.

Remark 1.

In [1], the three occurrence frequencies Prf(t,K) of a trinucleotide t in the three frames f computed in a gene kingdom K, always have different values, thus a unique preferential frame can be assigned to the trinucleotide. At the gene level, particularly for genes g of small lengths, a trinucleotide t may have an identical occurrence frequency Prf(t,g) in two or three frames f. In this case, two or three preferential frames F(t,g) are assigned to the trinucleotide t. If a trinucleotide t is absent in a gene g, mainly for genes g of very small lengths, then no preferential frame is attributed to t.

The indicator function δf(F(t,g)){0,1} is 1 if the preferential frame F(t,g) of a trinucleotide t is equal to the frame f of a gene g, and 0 otherwise

δf(F(t,g))={1 if F(t,g)=f0 otherwise (6)

where F(t,g) is defined in Equation (5).

2.4. Number of Preferential Frames of a Trinucleotide in a Gene Kingdom

The number Nbf(t,K) of preferential frames of a trinucleotide tB3 for each frame f{0,1,2} in a gene kingdom K is simply obtained by summing for all genes in K

Nbf(t,K)=gKδf(F(t,g)) (7)

where δf(F(t,g)) is defined in Equation (6).

2.5. Occurrence Probability of a Complementary/Permutation Trinucleotide Set in a Gene Kingdom

In order to study the C3 self-complementary codes C (Definition 8) including the class of circular codes, and in particular the circular code X, Equation (7) for a trinucleotide t is expanded to a set T of six trinucleotides involving the complementarity map C and the permutation map P simultaneously, precisely T={T0,T1,T2} with T0={t,C(t)} in frame 0, T1=P(T0)={P(t),P(C(t))} in frame 1, T2=P2(T0)={P2(t),P2(C(t))} in frame 2, and tB3\{AAA,CCC,GGG,TTT}. T is called a complementary and permutation (CP) trinucleotide set and is completely defined by the trinucleotide t.

Remark 2.

P(t)=C(P2(C(t))) and P2(t)=C(P(C(t))) (proof obvious).

When the trinucleotide t is given then the trinucleotide C(t) is also known. Thus, there are 60/2=30 CP trinucleotide sets noted T1,,T30 where Ti={Ti0,Ti1,Ti2} with Ti0={t,C(t)}i in frame 0, Ti1=P(T0)i={P(t),PC(t)}i in frame 1, and Ti2=P2(T0)i={P2(t),P2C(t)}i in frame 2. A maximal (20 trinucleotides) C3 self-complementary code C is identified with the first 10 values of the numbers Nb(T1,K),,Nb(T10,K) (defined below). Precisely, the code C has 20 trinucleotides C=C0={T10,,T100} in frame 0, 20 trinucleotides C1=P(C)={T11,,T101} in frame 1, and 20 trinucleotides C2=P2(C)={T12,,T102} in frame 2 with C=C(C) (self-complementary), C(C1)=C2, and C(C2)=C1 (C1 and C2 are complementary). There are (3010)=30,045,015 C3 self-complementary trinucleotide codes, and among them only 216 are circular [2,20].

Notation 3.

A CP trinucleotide set T={T0,T1,T2} belongs to the C3 self-complementary trinucleotide circular code X, i.e., TX, if T0X, i.e., if the trinucleotide t and its complementary trinucleotide C(t) belong to X. Ten CP trinucleotide sets T among 30 belong to the C3 circular code X, i.e., such that 10 sets T0X with T1=P(T0)P(X)=X1 and T2=P2(T0)P2(X)=X2.

Notation 4.

In order to facilitate the reading of Table 2, the 30 CP trinucleotide sets T={T0,T1,T2} are presented in the following way (i) the first 10 sets T1,,T10 belong to the circular code X (with T0={t,C(t)}X, T1X1 and T2X2) and are in lexicographical order with respect to the trinucleotide tX (in bold), and (ii) the 20 remaining sets T11,,T30 are in lexicographical order with respect to the trinucleotide tX1 (in italics).

Table 2.

Identification of the maximal C3 self-complementary trinucleotide circular code X in gene kingdoms K of bacteria B, archaea A, plasmids , eukaryotes E, chromosomes of eukaryotes Echr, mitochondria M, chloroplasts , viruses V, and its five taxonomic groups: double-stranded DNA viruses VdsDNA, double-stranded RNA viruses VdsRNA, single-stranded DNA viruses VssDNA, single-stranded RNA viruses VssRNA, and retro-transcribing viruses Vrt (Table 1). Occurrence probability Pb(T,K) (%) of the 30 complementary and permutation (CP) trinucleotide sets T={T0,T1,T2} with T0={t,C(t)} in frame 0, T1=P(T0)={P(t),P(C(t))} in frame 1, T2=P2(T0)={P2(t),P2(C(t))} in frame 2, in a gene kingdom K computed according to Equation (9) and its rank Rk(T,K), the 1st rank being associated to the highest value of Pb(T,K) and the 30th rank, to the lowest value of Pb(T,K). The 20 trinucleotides of the C3 self-complementary circular code X are in bold, the 20 trinucleotides of the circular code X1=P(X) are in italics, and the 20 trinucleotides of the circular code X2=P2(X) are both in bold and italics. The first 10 CP sets T1,,T10 belong to the circular code X (T0={t,C(t)}X with T1=P(T0)P(X)=X1 and T2=P2(T0)P2(X)=X2) and are in lexicographical order with respect to the trinucleotide tX in bold, and the 20 remaining CP sets T11,,T30 are in lexicographical order with respect to the trinucleotide tX1 in italics. The numbers in italics occurring with the CP sets T1,,T10 are associated with the two trinucleotides T0={t,C(t)} of X which do not occur preferentially in the gene kingdom.

B A E Echr M V VdsDNA VdsRNA VssDNA VssRNA Vrt
T t C(t) Pb Rk Pb Rk Pb Rk Pb Rk Pb Rk Pb Rk Pb Rk Pb Rk Pb Rk Pb Rk Pb Rk Pb Rk Pb Rk
T1 AAC GTT 6.1 6 7.4 4 6.0 4 8.5 3 8.4 3 4.4 9 4.9 9 6.6 3 6.8 4 6.4 3 5.9 1 7.0 3 5.0 5
T2 AAT ATT 7.4 2 5.3 8 7.3 2 8.7 2 8.7 2 3.4 14 5.8 1 6.6 2 7.5 2 5.9 4 5.6 2 6.3 4 5.3 4
T3 ACC GGT 5.1 9 3.9 12 5.1 8 5.1 8 4.7 10 5.7 4 5.2 3 4.9 8 4.9 9 4.6 8 4.6 5 5.3 7 3.7 9
T4 ATC GAT 6.5 4 6.7 5 6.2 3 8.1 4 8.2 4 3.6 13 4.6 14 6.5 4 6.7 5 6.4 2 5.4 4 7.1 2 6.0 2
T5 CAG CTG 4.6 10 3.7 13 3.9 10 4.2 10 4.9 9 0.7 30 0.0 30 2.9 15 3.8 10 2.4 18 2.8 19 1.5 24 2.9 18
T6 CTC GAG 6.2 5 7.5 3 5.9 6 7.0 5 7.5 5 2.6 18 0.4 27 5.6 6 6.3 6 5.3 7 4.1 10 5.4 6 4.9 6
T7 GAA TTC 5.8 7 5.3 7 5.5 7 5.0 9 5.2 7 6.3 2 4.9 7 5.2 7 5.6 7 5.3 6 4.6 6 4.9 8 5.4 3
T8 GAC GTC 8.2 1 9.7 1 7.8 1 9.0 1 9.1 1 5.8 3 0.6 26 7.2 1 8.0 1 7.3 1 5.4 3 7.3 1 6.4 1
T9 GCC GGC 6.7 3 8.2 2 6.0 5 5.7 6 5.2 8 7.1 1 5.3 2 5.9 5 7.0 3 5.5 5 4.3 8 5.7 5 4.7 7
T10 GTA TAC 5.4 8 6.6 6 5.0 9 5.4 7 5.7 6 4.6 8 4.8 11 4.7 9 5.3 8 4.6 9 4.0 11 4.5 10 4.3 8
T11 AAG CTT 3.0 13 4.1 11 2.9 16 3.4 14 3.5 13 1.4 27 3.2 18 3.3 12 3.0 14 3.2 13 3.6 12 3.6 13 2.9 16
T12 ACA TGT 1.1 26 1.4 20 1.1 26 0.3 30 0.2 30 4.3 10 2.9 19 1.2 27 0.9 26 1.3 28 1.4 28 1.5 26 1.5 29
T13 ACG CGT 1.6 22 0.3 28 1.8 21 0.6 26 0.6 25 1.9 24 5.0 4 1.8 22 1.4 23 1.6 25 3.2 16 1.6 22 2.1 24
T14 ACT AGT 2.9 15 1.3 22 3.3 13 3.0 15 2.5 15 2.1 22 4.9 8 3.6 11 3.1 13 3.7 11 4.3 7 4.1 11 3.4 14
T15 AGC GCT 2.9 14 3.4 14 3.1 15 3.4 13 3.1 14 3.9 12 4.9 6 4.0 10 3.6 11 4.2 10 4.3 9 4.6 9 3.5 13
T16 AGG CCT 2.3 19 1.5 19 2.5 19 2.4 16 2.0 17 2.2 21 4.8 13 2.7 17 2.5 17 2.6 17 3.5 13 2.7 17 2.4 22
T17 ATA TAT 1.6 21 4.1 10 1.6 24 0.8 23 0.9 23 4.2 11 2.7 20 2.3 19 1.7 20 2.9 16 2.8 21 2.7 16 2.9 17
T18 ATG CAT 3.1 12 2.5 16 3.2 14 1.3 20 1.4 19 1.8 26 4.8 10 2.5 18 2.6 15 2.3 19 3.1 17 1.8 20 2.7 20
T19 CCA TGG 1.6 23 1.3 21 1.8 20 1.1 22 0.8 24 2.8 16 4.5 15 2.0 21 1.6 22 2.3 20 2.6 22 1.8 19 2.8 19
T20 CCG CGG 0.6 28 0.1 29 0.7 28 0.8 24 1.0 22 0.8 29 0.6 25 1.5 26 0.8 28 1.4 27 2.8 20 1.5 25 2.3 23
T21 GCG CGC 2.7 17 1.7 18 3.4 11 3.5 12 3.9 12 2.0 23 4.2 17 2.9 16 2.4 18 3.2 15 3.4 14 3.0 14 3.6 12
T22 GTG CAC 3.3 11 4.7 9 3.4 12 3.8 11 4.5 11 1.8 25 0.3 28 3.3 13 3.5 12 3.2 14 3.2 15 2.9 15 3.7 10
T23 TAG CTA 1.7 20 2.1 17 1.6 22 1.6 19 1.8 18 3.0 15 0.3 29 1.7 24 1.6 21 2.1 22 1.8 26 1.6 23 2.0 25
T24 TCA TGA 0.4 29 0.8 25 0.6 29 0.6 25 0.4 28 4.8 7 0.7 24 0.7 30 0.6 30 1.0 29 0.9 30 0.8 30 0.9 30
T25 TCC GGA 1.5 24 1.0 24 1.6 23 0.6 27 0.6 26 5.0 6 4.8 12 1.7 23 1.1 25 1.9 23 2.4 23 2.0 18 2.7 21
T26 TCG CGA 0.2 30 0.1 30 0.4 30 0.4 29 0.3 29 2.6 17 4.3 16 1.0 29 0.7 29 0.8 30 1.7 27 1.0 28 1.7 28
T27 TCT AGA 1.2 25 0.6 27 1.5 25 1.6 18 1.3 21 2.4 19 2.1 22 1.6 25 1.4 24 1.8 24 2.0 25 1.7 21 1.7 27
T28 TGC GCA 2.5 18 3.0 15 2.8 18 2.4 17 2.0 16 5.2 5 4.9 5 3.0 14 2.6 16 3.4 12 2.9 18 3.7 12 3.2 15
T29 TTA TAA 0.9 27 0.6 26 1.0 27 0.5 28 0.4 27 2.3 20 1.6 23 1.1 28 0.9 27 1.5 26 1.4 29 1.0 29 1.9 26
T30 TTG CAA 2.8 16 1.2 23 2.9 17 1.2 21 1.4 20 1.2 28 2.3 21 2.1 20 2.2 19 2.2 21 2.3 24 1.4 27 3.6 11

The occurrence number Nb(T,K) of a CP trinucleotide set T={T0,T1,T2} in a gene kingdom K is equal to

Nb(T,K)=Nb0(t,K)+Nb0(C(t),K)+Nb1(P(t),K)+Nb1(P(C(t)),K)+Nb2(P2(t),K)+Nb2(P2(C(t)),K) (8)

where Nbf(t,K) is defined in Equation (7).

In order to normalize the numbers Nb(T,K) which depend on the numbers of genes in a kingdom K, we simply define the occurrence probability Pb(T,K) of a CP trinucleotide set T={T0,T1,T2} in a gene kingdom K as follows

Pb(T,K)=Nb(T,K)i=130Nb(Ti,K) (9)

where Nb(T,K) is defined in Equation (8).

The parameter Rk(T,K){1,,30} gives the rank of the values Pb(T,K) among the 30 CP trinucleotide sets T, the 1st rank being associated to the highest value of Pb(T,K) and the 30th rank, to the lowest value of Pb(T,K).

2.6. A Statistical Test to Evaluate the Significance of the Obtained Ranks

In order to evaluate the statistical significance of the ranks Rk(T,K) of the probabilities Pb(T,K) (Equation (9)) of the 30 CP trinucleotide sets T in a given kingdom K, we derive confidence intervals for Pb(T,K). If the confidence interval for two probabilities Pb(T,K) do not overlap, then their associated ranks Rk(T,K) are assumed to be valid (in the population). The confidence interval for two probabilities Pb(T,K) is evaluated by using the classical 2-sample z-test which is briefly recalled here.

Let P(T) and P(T) be the populations associated to the CP trinucleotide sets T and T of probabilities Pb(T,P) and Pb(T,P), respectively. The probabilities Pb(T,K) and Pb(T,K) of T and T are observed in a given gene kingdom K (sample) of size n=i=130Nb(Ti,K) (defined from Equation (8)). The tests carried out in Section 3 are applied on large samples (the size of the smallest sample analysed being n=10921 with the archaea A). Thus, the assumptions of normality for the variables and of the homogeneity for the variances in the two populations are not needed. The equality H0:Pb(T,P)Pb(T,P) is tested against the alternative H1:Pb(T,P)<Pb(T,P) if they are not equal. Under H0 and with large samples (n>30), min{nPb(T,P),n(1Pb(T,P)),nPb(T,P),n(1Pb(T,P))}>5 (always verified in the tests carried out in Section 3), and T and T are independent events (realistic hypothesis with kingdoms K of large sizes), then

Z=Pb(T,P)Pb(T,P)(nPb(T,P)+nPb(T,P)n+n)(1nPb(T,P)+nPb(T,P)n+n)(1n+1n)=2(Pb(T,P)Pb(T,P))(Pb(T,P)+Pb(T,P)2)(Pb(T,P)+Pb(T,P))n~N(0,1).

The z-value and the p-value are given for each statistical test carried out in Section 3.

2.7. Explained Example of the Statistical Approach Developed

As an example, we explain the definition of the occurrence probability Pb(T,K) (Equation (9)) which takes the value of 6.1% (see Table 2) with the CP trinucleotide set T1={T10,T11,T12} with T10={t,C(t)}={AAC,GTT} in frame 0, T11=P(T0)1={P(t),P(C(t))}={ACA,TTG} in frame 1 and T12=P2(T0)1={P2(t),P2(C(t))}={CAA,TGT} in frame 2 in the gene kingdom of bacteria K=B (Table 1).

The 3×64=192 occurrence frequencies Prf(t,g) of the 64 trinucleotides t are computed in the three frames f of each gene g belonging to B. Then, the preferential frame F(t,g) of each trinucleotide t for each gene g in B is determined according to Equation (5). For example, with the trinucleotide t=AAC in a gene g1 of B, if the frequency Pr0(AAC,g1) of AAC in frame f=0 (reading frame) is greater than the two frequencies Pr1(AAC,g1) and Pr2(AAC,g1) of AAC in frames f=1 and f=2, i.e., Pr0(AAC,g1)>Max{Pr1(AAC,g1),Pr2(AAC,g1)}, then the preferential frame of AAC in g1 is 0, i.e., F(AAC,g1)=0.

The indicator function δf(F(t,g)) of each trinucleotide t for each gene g in B is obtained from Equation (6). With the previous example of AAC in the gene g1 of B, the indicator function is equal to δ0(F(AAC,g1))=1 for the frame f=0 and δ1(F(AAC,g1))=δ2(F(AAC,g1))=0 for the frames f=1 and f=2.

The number Nbf(t,B) of preferential frames of each trinucleotide t for each frame f in B is computed according to Equation (7). With the previous example of AAC in B, the following numbers are obtained: Nb0(AAC,B)=3486 for the frame f=0, Nb1(AAC,B)=1742 for the frame f=1, and Nb2(AAC,B)=1819 for the frame f=2. Thus, the preferential frame of AAC in B is 0.

The occurrence number Nb(T,B) of the 30 CP trinucleotide sets Ti={Ti0,Ti1,Ti2} in B is determined according to Equation (8). With T1 in B, the following numbers are obtained: Nb0(GTT,B)=3765 for the frame f=0, Nb1(ACA,B)=4002 and Nb1(TTG,B)=5650 for the frame f=1, and Nb2(CAA,B)=3999 and Nb2(TGT,B)=4677 for the frame f=2. Then, the occurrence number of T1 in B is equal to Nb(T1,B)=Nb0(AAC,B)+Nb0(GTT,B)+Nb1(ACA,B)+Nb1(TTG,B)+Nb2(CAA,B)+Nb2(TGT,B)=3486+3765+4002+5650+3999+4677=25579.

Finally, the occurrence probability Pb(T,B) of the 30 CP trinucleotide sets Ti={Ti0,Ti1,Ti2} in B is deduced from Equation (9). With T1 in B, the occurrence probability of T1 in B is equal to Pb(T1,B)=Nb(T1,B)i=130Nb(Ti,B)=Nb(T1,B)Nb(T1,B)+...+Nb(T30,B)=2557925579+...+11856=255794225986.1%.

3. Results

3.1. Maximal C3 Self-Complementary Circular Code X in Genes

This new statistical approach will show that the same set X of 20 trinucleotides among (3010)=30,045,015 sets occurs preferentially in genes (reading frame) of bacteria B, archaea A, plasmids , eukaryotes E, and viruses V. This set X is the maximal C3 self-complementary circular code defined in Equation (1).

3.1.1. Circular Code X in Genes of Bacteria

In the genes of bacteria B, the 10 CP trinucleotide sets T1,,T10X have occurrence probabilities Pb(T,B) (Equation (9)) with the 10 highest ranks Rk(T,B) among 30 (Table 2), i.e., {t,C(t)}X, {P(t),P(C(t))}X1 and {P2(t),P2(C(t))}X2 leading to the 20 trinucleotides of X in frame 0, 20 trinucleotides of X1 in frame 1, and 20 trinucleotides of X2 in frame 2. The highest rank with Pb(T8,B)=8.2% is related to the complementary pair {t,C(t)}={GAC,GTC}X. The 10th rank with Pb(T5,B)=4.55% is very significantly greater than the 11th rank with Pb(T22,B)=3.31% (n=i=130Nb(Ti,B)=422598, z-value=29.33, p-value=10189). The 20 trinucleotides of the circular code X are identified in the genes of bacteria:

XB= X. (10)

The same result is obtained at the gene level and the gene population level [1].

3.1.2. Circular Code X in Genes of Archaea

In the genes of archaea A, the eight CP trinucleotide sets T1,T2,T4,T6,,T10X (except T3 and T5) have occurrence probabilities Pb(T,A) with the eight highest ranks Rk(T,A) among 30 (Table 2). The highest rank with Pb(T8,A)=9.7% is also related to the complementary pair {GAC,GTC}X. The CP set T22X with Rk(T22,A)=9 explains that the two complementary trinucleotides {t,C(t)}={ACC,GGT}X (T3) do not occur preferentially in A. As the CP set T5X has a rank Rk(T5,A)=13 with Pb(T5,A)=3.66% greater than Rk(T15,A)=14 with Pb(T15,A)=3.39% and Rk(T28,A)=15 with Pb(T28,A)=2.95%, the two complementary trinucleotides {t,C(t)}={CAG,CTG}X occur preferentially in A compared to {AGC,GCT} (T15) and {GCA,TGC} (T28), however the statistical significance between the ranks Rk(T5,A) and Rk(T15,A) is not confirmed due to the lack of archaeal gene data (see Section 2.2) (n=i=130Nb(Ti,A)=10921, z-value=1.08, p-value=0.14). Thus, a subset of X of 18 trinucleotides (a non-maximal C3 self-complementary circular code) is identified in the genes of archaea:

XA= XYA with YA={ACC,GGT}. (11)

Note that the code XA{CAC,GTG} (T22) is the variant X code observed in Deinococcus [1]. The circular code X retrieved in the genes of archaea is a new result which was not found in a study of variant X codes in archaeal genomes [15].

3.1.3. Circular Code X in Genes of Plasmids

In the genes of plasmids , the 10 CP trinucleotide sets T1,,T10X have occurrence probabilities Pb(T,) with the 10 highest ranks Rk(T,) among 30 (Table 2). The highest rank with Pb(T8,)=7.8% is again related to the complementary pair {GAC,GTC}X. The 10th rank with Pb(T5,)=3.93% is very significantly greater than the 11th rank with Pb(T21,)=3.43% (n=i=130Nb(Ti,)=144366, z-value=7.14, p-value=1013). The 20 trinucleotides of the circular code X are identified in the genes of plasmids:

X= X. (12)

The same result is obtained at the gene level and the gene population level [1].

3.1.4. Circular Code X in Genes of Eukaryotes

In the genes of eukaryotes E, the 10 CP trinucleotide sets T1,,T10X have occurrence probabilities Pb(T,E) with the 10 highest ranks Rk(T,E) among 30 (Table 2). The highest rank with Pb(T8,E)=9.0% is again related to the complementary pair {GAC,GTC}X. The 10th rank with Pb(T5,E)=4.23% is significantly greater than the 11th rank with Pb(T22,E)=3.82% (n=i=130Nb(Ti,E)=11401, z-value=1.57, p-value=0.06). The 20 trinucleotides of the circular code X are identified in the genes of eukaryotes:

XE= X. (13)

The same result is obtained at the gene level and the gene population level [1].

The subset XEHomo sapiens= X\{ACC,GCC,GGC,GGT} of X of 16 trinucleotides in the genes of Homo sapiens identified at the gene level is also identical to the subset found at the gene population level [1].

3.1.5. Circular Code X in Genes of Eukaryotic Chromosomes

The statistical analysis in Section 3.1.4 takes the eukaryotic genome as the genetic information unit. Indeed, Equation (7) with gE is achieved with Card(E)=190 eukaryotic genomes (see Table 1). We complete this classical approach by choosing the eukaryotic chromosome as the genetic information unit. Thus, Equation (7) with gEchr is performed with Card(Echr)=2979 eukaryotic chromosomes of Card(E)=190 genomes (see Table 1).

In the genes of eukaryotic chromosomes Echr, the 10 CP trinucleotide sets T1,,T10X have occurrence probabilities Pb(T,Echr) with the 10 highest ranks Rk(T,Echr) among 30 (Table 2). The highest rank with Pb(T8,Echr)=9.1% is again related to the complementary pair {GAC,GTC}X. The 10th rank with Pb(T3,Echr)=4.74% is very significantly greater than the 11th rank with Pb(T22,Echr)=4.47% (n=i=130Nb(Ti,Echr)=179136, z-value=3.86, p-value=105). The 20 trinucleotides of the circular code X are identified in the genes of eukaryotic chromosomes:

XEchr= X. (14)

It is a new result which completes the statistical analysis of genes in eukaryotic genomes (Section 3.1.4).

3.1.6. Non-Maximal Circular Code X in Genes of Eukaryotic Organelles

The genes of eukaryotic organelles, i.e., mitochondria and chloroplasts, are investigated with this statistical approach. It should also be stressed that the available data have an order of magnitude very significantly lower than the other gene kingdoms studied (less than 1 million trinucleotides for each class of organelles, see Table 1). However, we can already observe some statistical trends with the trinucleotides in the preferential frame.

Non-Maximal Circular Code X in Genes of Mitochondria

Surprisingly, in the genes of mitochondria M, the four CP trinucleotide sets T9,T7,T8,T3X have occurrence probabilities Pb(T,M) with the four highest ranks Rk(T,M) among 30 (Table 2). The CP set T28X with Rk(T28,M)=5 explains that the two complementary trinucleotides {CAG,CTG}X (T5) do not occur preferentially in M. The CP set T25X with Rk(T25,M)=6 determines that the two complementary trinucleotides {CTC,GAG}X (T6) do not occur preferentially in M. The CP set T24X with Rk(T24,M)=7 implies that the two complementary trinucleotides {ATC,GAT}X (T4) do not occur preferentially in M. The CP set T17X with Rk(T17,M)=11 explains that the two complementary trinucleotides {AAT,ATT}X (T2) do not occur preferentially in M. Thus, a subset of X of 12 trinucleotides (a non-maximal C3 self-complementary circular code) is identified in the genes of mitochondria M:

XM= X\YM with YM={AAT,ATC,ATT,CAG,CTC,CTG,GAG,GAT}. (15)

This subset XM= {AAC,ACC,GAA,GAC,GCC,GGC,GGT,GTA,GTC,GTT,TAC,TTC} is very close to the subset XM˜= {ACC,ATC,CTC,GAA,GAC,GAT,GCC,GGC, GGT,GTA,GTC,GTT,TTC} of X of 13 trinucleotides previously identified by inspection in mitochondrial genes [21], as XMXM˜={ACC,GAA,GAC,GCC,GGC,GGT,GTA,GTC,GTT,TTC} has 10 trinucleotides in common.

Non-Maximal Circular Code X in Genes of Chloroplasts

In the genes of chloroplasts , the highest occurrences of CP trinucleotide sets again belong to the circular code X. The three CP trinucleotide sets T2,T9,T3X have occurrence probabilities Pb(T,) with the three highest ranks Rk(T,) among 30 (Table 2). The CP set T13X with Rk(T13,)=4 explains that the two complementary trinucleotides {GAC,GTC}X (T8) do not occur preferentially in . The CP set T28X with Rk(T28,)=5 states that the two complementary trinucleotides {CAG,CTG}X (T5) do not occur preferentially in . The CP set T14X with Rk(T14,)=8 implies that the two complementary trinucleotides {GTA,TAC}X (T10) do not occur preferentially in . The CP set T18X with Rk(T18,)=10 explains that the two complementary trinucleotides {ATC,GAT}X (T4) do not occur preferentially in . The CP set T25X with Rk(T25,)=12 implies that the two complementary trinucleotides {CTC,GAG}X (T6) do not occur preferentially in . Thus, a subset of X of 10 trinucleotides (a non-maximal C3 self-complementary circular code) is identified in the genes of chloroplasts :

X= X\Y with Y={ATC,CAG,CTC,CTG,GAC,GAG,GAT,GTA,GTC,TAC}. (16)

3.1.7. Circular Code X in Genes of Viruses

In the genes of viruses V, the nine CP trinucleotide sets T1,,T4,T6,,T10X (except T5) have occurrence probabilities Pb(T,V) with the nine highest ranks Rk(T,V) among 30 (Table 2). The highest rank with Pb(T8,V)=7.2% is again related to the complementary pair {GAC,GTC}X. The CP set T15X with Rk(T15,V)=10 explains that the two complementary trinucleotides {CAG,CTG}X (T5) do not occur preferentially in V. Thus, a subset of X of 18 trinucleotides (a non-maximal C3 self-complementary circular code) is identified in the genes of viruses:

XV= X\YV with YV={CAG,CTG}. (17)

The statistical method of viral genes at the gene population level [1] could not decide between the two codes X18=X\{CAG,CTG} and X16=X\{CAG,CTG,GTA,TAC}. The statistical analysis at the gene level confirms the code XV= X18 of 18 trinucleotides in the genes of viruses.

3.2. Circular Code X Found in DNA and RNA Genomes and in Double-Stranded and Single-Stranded Genomes

The self-complementary property of the circular code X has been related since 1996 to the complementary property of the DNA double helix. In order to deepen this idea, we searched with this statistical approach the circular code X in five important sub-classes of viral genes using either DNA genome or RNA genome, and either double-stranded genome or single-stranded genome, i.e., in the genes of double-stranded DNA viruses VdsDNA, double-stranded RNA viruses VdsRNA, single-stranded DNA viruses VssDNA, single-stranded RNA viruses VssRNA, and retro-transcribing viruses Vrt.

In the genes of double-stranded DNA viruses VdsDNA, the 10 CP trinucleotide sets T1,,T10X have occurrence probabilities Pb(T,VdsDNA) with the 10 highest ranks Rk(T,VdsDNA) among 30 (Table 2). Thus, the circular code X is found in VdsDNA:

XVdsDNA= X. (18)

In the genes of double-stranded RNA viruses VdsRNA, single-stranded RNA viruses VssRNA, and retro-transcribing viruses Vrt, respectively, the nine CP trinucleotide sets T1,,T4,T6,,T10X (except T5) have occurrence probabilities Pb(T,VdsRNA), Pb(T,VssRNA), and Pb(T,Vrt), respectively, with the nine highest ranks Rk(T,VdsRNA), Rk(T,VssRNA), and Rk(T,Vrt), respectively, among 30 (Table 2). Note that the ranks Rk(T,VdsRNA), Rk(T,VssRNA), and Rk(T,Vrt) for a given CP trinucleotide set are not identical (Table 2). Thus, by using the reasoning mentioned previously (T15X with Rk(T15,V)>Rk(T5,V) for V in VdsRNA, VssRNA, and Vrt), a subset of X of 18 trinucleotides is observed in VdsRNA, VssRNA, and Vrt:

XVdsRNA= X\YVdsRNA with YVdsRNA={CAG,CTG}, (19)
XVssRNA=X\YVssRNA with YVssRNA={CAG,CTG}, (20)
XVrt=X\YVrt with YVrt={CAG,CTG}. (21)

In the genes of single-stranded DNA viruses VssDNA, the eight CP trinucleotide sets T1,,T4,T6,,T9X (except T5 and T10) have occurrence probabilities Pb(T,VssDNA) with the eight highest ranks Rk(T,VssDNA) among 30 (Table 2). Thus, by using the reasoning as previously mentioned (T15X with Rk(T15,VssDNA)>Rk(T5,VssDNA) and T14X with Rk(T14,VssDNA)>Rk(T10,VssDNA)), a subset of X of 16 trinucleotides is observed in VssDNA:

XVssDNA=X\YVssDNA with YVssDNA={CAG,CTG,GTA,TAC}. (22)

All these results show that the circular code X is found almost perfectly in DNA genomes, RNA genomes, double-stranded genomes, and single-stranded genomes. The very few exceptions, either the two trinucleotides {CAG,CTG} or the four trinucleotides {CAG,CTG,GTA,TAC} for one case, are related to the CP set or the two CP sets having the lowest occurrence among the 10 CP sets T1,,T10X.

4. Conclusions

The “universal” occurrence in genes of a same set X of 20 trinucleotides, which has in addition the mathematical property to be a circular code, must be confirmed by several statistical approaches and various gene data analyses at different levels: kingdom, taxonomic group, genome, and gene. All the previous approaches have studied and identified the circular code X at the gene population level (kingdom, taxonomic group, and genome) [1,2,15,16,17,21]. The statistical approach at the gene level developed here, for the first time since 1996, analyses the preferential occurrence of trinucleotides among the three frames of each gene. This new methodology allows all genes, i.e., of large and small lengths, to be considered with the same weight. As a consequence, the concept of circular code, in particular the reading frame retrieval, is directly associated to each gene. Thus, X motifs from the circular code X at different locations in a gene may assist the ribosome to maintain and synchronize the reading frame. The number, the cardinality, and the length of X motifs in genes may be associated to the length, the function, and the ancestry of genes. This research work is currently under investigation.

At the gene level, the circular code X is strengthened in the genes of bacteria, eukaryotes, plasmids, and viruses, and is now also identified in the genes of archaea. In addition to eukaryotic genomes, it is also found in the genes of eukaryotic chromosomes. The genes of mitochondria and chloroplasts contain a subset of the circular code X. It should be stressed that some mitochondrial and chloroplast genes lack the stop codon and are excluded from this data acquisition. Such a statistical bias may prevent a proper detection of preferential frames for some trinucleotides in the genes of eukaryotic organelles. The circular code X is searched in the large class of (3010)=30,045,015 C3 self-complementary trinucleotide codes which contains in particular the 216 maximal C3 self-complementary circular codes. Thus, for a basic order of magnitude, the probability to retrieve the same circular code X in four independent gene kingdoms (bacteria B, plasmids , eukaryotes E, double-stranded DNA viruses VdsDNA) is equal to 1/(3010)41030.

In the genes of the bacterial, eukaryotic, and plasmid kingdoms, 14 among the 47 studied gene taxonomic groups (about 30%) have variant X codes [1], i.e., trinucleotide codes which differ from X. Seven variant X codes are identified. However, all have at least 16 trinucleotides of X. Two variant X codes XA (according to the notation in [1]) in cyanobacteria and plasmids of cyanobacteria, and XD in birds, are self-complementary, without permuted trinucleotides, but are non-circular. Five variant X codes XB in Deinococcus, plasmids of chloroflexi and Deinococcus, mammals, and kinetoplasts, XC in elusimicrobia and apicomplexans, XE in fishes, XF in insects, and XG in basidiomycetes and plasmids of spirochaetes, are C3 self-complementary circular. Thus, two variant X codes XA and XD are not circular and do not belong to the set of the 216 maximal C3 self-complementary circular codes [2] having the strong mathematical structure of the dihedral group [20]. The reason could be related to the gene data or to a biological property which remains to be identified. All these variant X codes in the genes are identified at the taxonomic group level. However, as the circular code X is now also identified at the gene level, variant X codes may also be associated with genes belonging to the same genome but with different protein coding functions. This interesting and open problem should be investigated in the future.

A probability measure of the reading frame retrieval (RFR) of each trinucleotide of X has been introduced in [22] and [23] (Section 2.2 and 1st row of Table 1). The RFR probability PrRFR of the circular code X, i.e., the average RFR probability of the 20 trinucleotides of X, is equal to PrRFR(X)=82.5% (Result 5 in [22]; 1st row of Table 1 in [23]). This RFR measure can be applied to the non-maximal C3 self-complementary circular codes, precisely to the excluded trinucleotides YA={ACC,GGT} of archaea (Equation (11)), YM={AAT,ATC,ATT,CAG,CTC,CTG,GAG,GAT} of mitochondria (Equation (15)), Y={ATC,CAG,CTC,CTG,GAC,GAG,GAT,GTA,GTC,TAC} of chloroplasts (Equation (16)), YV={CAG,CTG} of viruses (Equation (17)), and YVssDNA={CAG,CTG,GTA,TAC} of single-stranded DNA viruses (Equation (22)). The computation leads to PrRFR(YA)=69.0%, PrRFR(YM)=88.5%, PrRFR(Y)=87.1%, PrRFR(YV)=100.0%, and PrRFR(YVssDNA)=85.7%. Archaeal genes miss two trinucleotides of X which have the lowest RFR values. In contrast, mitochondrial, chloroplast, and viral genes miss trinucleotides of X with high RFR values. Thus, the genes in reduced genomes are more flexible in translation, allowing overlap coding by frameshifting in agreement with [24] (and the cited references). However, it should be stressed that this result may vary with the increase of gene data of eukaryotic organelles in the future. The circular code X (20 trinucleotides) with the functions of reading frame retrieval and maintenance in regular RNA transcription, may also have, through its bijective transformation codes, the same functions in nucleotide exchanging RNA transcription in mitochondrial genes [23]. Indeed, as the mitochondrial gamma polymerase has bacterial origins (e.g., [25]), mitochondrial polymerization and its associated bijective transformations might use the circular code X. However at the translational level, the ribosome might follow the non-maximal C3 self-complementary circular code XM observed in mitochondrial genes (Equation (15)). A similar explanation could be applied to the chloroplast genes which have also bacterial origins (cyanobacteria).

By a study of viral genes, the circular code X is found in DNA genomes, RNA genomes, double-stranded genomes, and single-stranded genomes. Thus, the reading frame retrieval property of X could operate for translating DNA and RNA genes, in particular for the “primitive” RNA genes. The C3 property of X could be involved for translating the two shifted frames in DNA and RNA genes, in particular for optimizing the genomes of small sizes. The complementarity property of X is naturally associated to the double-stranded DNA and RNA genomes. It could also be used to pair single-stranded DNA genomes between them and single-stranded RNA genomes between them. Thus, the C3 and complementary properties of X could be involved for translating the three frames (reading frame and its two shifted frames) in one strand and the three frames in the complementary strand of DNA and RNA genes.

In summary, this new statistical approach at the gene level which is applied to massive gene data identifies the maximal C3 self-complementary trinucleotide circular code X in the genes of bacteria, archaea, eukaryotes, plasmids, and viruses, which may be involved in translation coding [3].

Acknowledgments

I thank the three reviewers for their advice, and Denise Besch, Svetlana Gorchkova, Elisabeth Michel, Professor Jacques Streith, and Jean-Marc Vassards for their support.

Conflicts of Interest

The author declares no conflict of interest.

References

  • 1.Michel C.J. The maximal C3 self-complementary trinucleotide circular code X in genes of bacteria, eukaryotes, plasmids and viruses. J. Theor. Biol. 2015;380:156–177. doi: 10.1016/j.jtbi.2015.04.009. [DOI] [PubMed] [Google Scholar]
  • 2.Arquès D.G., Michel C.J. A complementary circular code in the protein coding genes. J. Theor. Biol. 1996;182:45–58. doi: 10.1006/jtbi.1996.0142. [DOI] [PubMed] [Google Scholar]
  • 3.Michel C.J. Circular code motifs in transfer and 16S ribosomal RNAs: A possible translation code in genes. Comput. Biol. Chem. 2012;37:24–37. doi: 10.1016/j.compbiolchem.2011.10.002. [DOI] [PubMed] [Google Scholar]
  • 4.El Soufi K., Michel C.J. Circular code motifs in genomes of eukaryotes. J. Theor. Biol. 2016;408:198–212. doi: 10.1016/j.jtbi.2016.07.022. [DOI] [PubMed] [Google Scholar]
  • 5.Michel C.J. Circular code motifs in transfer RNAs. Comput. Biol. Chem. 2013;45:17–29. doi: 10.1016/j.compbiolchem.2013.02.004. [DOI] [PubMed] [Google Scholar]
  • 6.El Soufi K., Michel C.J. Circular code motifs in the ribosome decoding center. Comput. Biol. Chem. 2014;52:9–17. doi: 10.1016/j.compbiolchem.2014.08.001. [DOI] [PubMed] [Google Scholar]
  • 7.El Soufi K., Michel C.J. Circular code motifs near the ribosome decoding center. Comput. Biol. Chem. 2015;59:158–176. doi: 10.1016/j.compbiolchem.2015.07.015. [DOI] [PubMed] [Google Scholar]
  • 8.El Soufi K., Michel C.J. Unitary circular code motifs in genomes of eukaryotes. Biosystems. 2017 doi: 10.1016/j.biosystems.2017.02.001. in press. [DOI] [PubMed] [Google Scholar]
  • 9.Canapa A., Cerioni P.N., Barucca M., Olmo E., Caputo V. A centromeric satellite DNA may be involved in heterochromatin compactness in gobiid fishes. Chromosome Res. 2002;10:297–304. doi: 10.1023/A:1016519708187. [DOI] [PubMed] [Google Scholar]
  • 10.Gemayel R., Vinces M.D., Legendre M., Verstrepen K.J. Variable tandem repeats accelerate evolution of coding and regulatory sequences. Annu. Rev. Genet. 2010;44:445–477. doi: 10.1146/annurev-genet-072610-155046. [DOI] [PubMed] [Google Scholar]
  • 11.Pirillo G. A characterization for a set of trinucleotides to be a circular code. In: Pellegrini C., Cerrai P., Freguglia P., Benci V., Israel G., editors. Determinism, Holism, and Complexity. Kluwer Academic Publisher; New York, NY, USA: 2003. [Google Scholar]
  • 12.Michel C.J., Pirillo G. Identification of all trinucleotide circular codes. J. Theor. Biol. 2010;34:122–125. doi: 10.1016/j.compbiolchem.2010.03.004. [DOI] [PubMed] [Google Scholar]
  • 13.Fimmel E., Michel C.J., Strüngmann L. n-Nucleotide circular codes in graph theory. Philos. Trans. R. Soc. A. 2016;374:20150058. doi: 10.1098/rsta.2015.0058. [DOI] [PubMed] [Google Scholar]
  • 14.Bussoli L., Michel C.J., Pirillo G. On conjugation partitions of sets of trinucleotides. Appl. Math. 2012;3:107–112. doi: 10.4236/am.2012.31017. [DOI] [Google Scholar]
  • 15.Frey G., Michel C.J. Circular codes in archaeal genomes. J. Theor. Biol. 2003;223:413–431. doi: 10.1016/S0022-5193(03)00119-X. [DOI] [PubMed] [Google Scholar]
  • 16.Frey G., Michel C.J. Identification of circular codes in bacterial genomes and their use in a factorization method for retrieving the reading frames of genes. Comput. Biol. Chem. 2006;30:87–101. doi: 10.1016/j.compbiolchem.2005.11.001. [DOI] [PubMed] [Google Scholar]
  • 17.Arquès D.G., Michel C.J. A code in the protein coding genes. Biosystems. 1997;44:107–134. doi: 10.1016/S0303-2647(97)00049-X. [DOI] [PubMed] [Google Scholar]
  • 18.Gonzalez D.L., Giannerini S., Rosa R. Circular codes revisited: a statistical approach. J. Theor. Biol. 2011;275:21–28. doi: 10.1016/j.jtbi.2011.01.028. [DOI] [PubMed] [Google Scholar]
  • 19.Michel C.J., Pirillo G., Pirillo M.A. A relation between trinucleotide comma-free codes and trinucleotide circular codes. Theor. Comput. Sci. 2008;401:17–26. doi: 10.1016/j.tcs.2008.02.049. [DOI] [Google Scholar]
  • 20.Fimmel E., Giannerini S., Gonzalez D.L., Strüngmann L. Circular codes, symmetries and transformations. J. Math. Biol. 2015;70:1623–1644. doi: 10.1007/s00285-014-0806-7. [DOI] [PubMed] [Google Scholar]
  • 21.Arquès D.G., Michel C.J. A circular code in the protein coding genes of mitochondria. J. Theor. Biol. 1997;189:273–290. doi: 10.1006/jtbi.1997.0513. [DOI] [PubMed] [Google Scholar]
  • 22.Ahmed A., Frey G., Michel C.J. Essential molecular functions associated with circular code evolution. J. Theor. Biol. 2010;264:613–622. doi: 10.1016/j.jtbi.2010.02.006. [DOI] [PubMed] [Google Scholar]
  • 23.Michel C.J., Seligmann H. Bijective transformation circular codes and nucleotide exchanging RNA transcription. Biosystems. 2014;118:39–50. doi: 10.1016/j.biosystems.2014.02.002. [DOI] [PubMed] [Google Scholar]
  • 24.Seligmann H. Chimeric mitochondrial peptides from contiguous regular and swinger RNA. Comput. Struct. Biotechnol. J. 2016;14:283–297. doi: 10.1016/j.csbj.2016.06.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Wolf Y.I., Koonin E.V. Origin of an animal mitochondrial DNA polymerase subunit via lineage-specific acquisition of a glycyl-tRNA synthetase from bacteria of the Thermus-Deinococcus group. Trends Genet. 2001;17:431–433. doi: 10.1016/S0168-9525(01)02370-8. [DOI] [PubMed] [Google Scholar]

Articles from Life are provided here courtesy of Multidisciplinary Digital Publishing Institute (MDPI)

RESOURCES