Skip to main content
Life logoLink to Life
. 2019 Feb 10;9(1):18. doi: 10.3390/life9010018

Single-Frame, Multiple-Frame and Framing Motifs in Genes

Christian J Michel 1
PMCID: PMC6463195  PMID: 30744207

Abstract

We study the distribution of new classes of motifs in genes, a research field that has not been investigated to date. A single-frame motif SF has no trinucleotide in reading frame (frame 0) that occurs in a shifted frame (frame 1 or 2), e.g., the dicodon AAACAA is SF as the trinucleotides AAA and CAA do not occur in a shifted frame. A motif which is not single-frame SF is multiple-frame MF. Several classes of MF motifs are defined and analysed. The distributions of single-frame SF motifs (associated with an unambiguous trinucleotide decoding in the two 53 and 35 directions) and 5′ unambiguous motifs 5U (associated with an unambiguous trinucleotide decoding in the 53 direction only) are analysed without and with constraints. The constraints studied are: initiation and stop codons, periodic codons {AAA,CCC,GGG,TTT}, antiparallel complementarity and parallel complementarity. Taken together, these results suggest that the complementarity property involved in the antiparallel (DNA double helix, RNA stem) and parallel sequences could also be fundamental for coding genes with an unambiguous trinucleotide decoding in the two 53 and 35 directions or the 53 direction only. Furthermore, the single-frame motifs SF with a property of trinucleotide decoding and the framing motifs F (also called circular code motifs; first introduced by Michel (2012)) with a property of reading frame decoding may have been involved in the early life genes to build the modern genetic code and the extant genes. They could have been involved in the stage without anticodon-amino acid interactions or in the Implicated Site Nucleotides (ISN) of RNA interacting with the amino acids. Finally, the SF and MF dipeptides associated with the SF and MF dicodons, respectively, are studied and their importance for biology and the origin of life discussed.

Keywords: single-frame motifs, multiple-frame motifs, framing motifs, gene coding, antiparallel and parallel sequences, early life genes

1. Introduction

The reading frame coding with trinucleotide sets is a fascinating problem, both theoretical and experimental. Before the discovery of the genetic code, a first code was proposed by Gamow [1] by considering the “key-and-lock” relation between various amino acids, and the rhomb shaped “holes” formed by various nucleotides in the DNA. The proposed model will later prove to be false. A few years later, a class of trinucleotide codes, called comma-free codes, was proposed by Crick et al. [2] for explaining how the reading of a sequence of trinucleotides could code amino acids. In particular, how the correct reading frame can be retrieved and maintained. The four nucleotides {A,C,G,T} as well as the 16 dinucleotides {AA,…,TT} are simple codes which are not appropriate for coding 20 amino acids. However, trinucleotides induce a redundancy in their coding. Thus, Crick et al. [2] conjectured that only 20 trinucleotides among the 64 possible trinucleotides {AAA,…,TTT} code for the 20 amino acids. Such a bijective code implies that the coding trinucleotides are found only in one frame—the comma-freeness property. The determination of a set of 20 trinucleotides forming a comma-free code has several necessary conditions:

(i) A periodic trinucleotide from the set {AAA,CCC,GGG,TTT} must be excluded from such a code. Indeed, the concatenation of AAA with itself, for instance, does not allow the (original) reading frame to be retrieved as there are three possible decompositions: …,AAA,AAA,AAA,… (original frame), …A,AAA,AAA,AA… and …AA,AAA,AAA,A…, the commas showing the adopted decomposition.

(ii) Two non-periodic permuted trinucleotides, i.e., two trinucleotides related by a circular permutation, e.g., ACG and CGA, must also be excluded from such a code. Indeed, the concatenation of ACG with itself, for instance, does not allow the reading frame to be retrieved as there are two possible decompositions: …,ACG,ACG,ACG,… (original frame) and …A,CGA,CGA,CG

Therefore, by excluding the four periodic trinucleotides and by gathering the 60 remaining trinucleotides in 20 classes of three trinucleotides such that, in each class, the three trinucleotides are deduced from each other by a circular permutation, e.g., ACG, CGA and GAC, we see that a comma-free code can contain only one trinucleotide from each class and thus has at most 20 trinucleotides. This trinucleotide number is identical to the amino acid number, thus leading to a code assigning one trinucleotide per amino acid without ambiguity.

In the beginning 1960’s, the discovery that the trinucleotide TTT, an excluded trinucleotide in a comma-free code, codes phenylalanine [3], led to the abandonment of the concepts both of a comma-free code [2] and a bijective code as the genetic code is degenerate [4,5,6] with a gene translation in one direction [7].

In 1996, a statistical analysis of occurrence frequencies of the 64 trinucleotides in the three frames of genes of both prokaryotes and eukaryotes showed that the trinucleotides are not uniformly distributed in these three frames [8]. By excluding the four periodic trinucleotides and by assigning each trinucleotide to a preferential frame (frame of its highest occurrence frequency), three subsets X=X0, X1 and X2 of 20 trinucleotides each are found in the frames 0 (reading frame), 1 (frame 0 shifted by one nucleotide in the 53 direction, i.e., to the right) and 2 (frame 0 shifted by two nucleotides in the 53 direction) in genes of both prokaryotes and eukaryotes. The same set X of trinucleotides was identified in average in genes (reading frame) of bacteria, archaea, eukaryotes, plasmids and viruses [9,10]. It contains the 20 following trinucleotides:

X={AAC,AAT,ACC,ATC,ATT,CAG,CTC,CTG,GAA,GAC,GAG,GAT,GCC,GGC,GGT,GTA,GTC,GTT,TAC,TTC} (1)

and codes the 12 following amino acids (three and one letter notation):

X={Ala, Asn, Asp, Gln, Glu, Gly, Ile, Leu, Phe, Thr, Tyr, Val}={A,N,D,Q, E, G, I, L, F, T, Y, V}. (2)

This set X has a strong mathematical property. Indeed, X is a maximal C3 self-complementary trinucleotide circular code [8].

The reading frame coding with trinucleotide codes (sets of words) in general terms, i.e., not particularly the genetic code, is a concept which has been studied in Michel [11,12]. We extend it to the motifs (words of codes), a theoretical domain which has been ignored according to our knowledge. Genes (protein coding regions) can be partitioned into two disjoint classes of motifs: the single-frame motifs SF with an unambiguous trinucleotide decoding in the two 53 and 35 directions, and the multiple-frame motifs MF with an ambiguous trinucleotide decoding in at least one direction. A single-frame motif SF has no trinucleotide in reading frame (frame 0) that occurs in a shifted frame (frame 1 or 2). In contrast, a multiple-frame motif MF has at least one trinucleotide in reading frame that occurs in a shifted frame. Some well-known MF motifs are involved in ribosomal frameshifting. The expression of some viral and cellular genes utilizes a -1 programmed ribosomal frameshifting (-1 PRF) [13,14]. This -1 PRF sequence is based on three elements: (i) a slippery motif composed of seven nucleotides at which the change in reading frame occurs; (ii) a spacer motif, usually less than 12 nucleotides; and (iii) a down-stream (3′) stimulatory motif, usually a pseudoknot or a stem-loop. In eukaryotes, the slippery motif fits a consensus heptanucleotide X,XXY,YYZ, where XXX is any three identical nucleotides, YYY represents AAA or TTT, Z represents A, C or T, the commas separating the codons in reading frame [15,16]. The slippery motifs MF1=A,AAA,AAZ and MF2=T,TTT,TTZ are multiple-frame MF. Indeed, the codon AAA in reading frame also occurs in the shifted frames 1 and 2 in MF1, and similarly with the codon TTT in MF2. Alternative gene decoding is also possible with +1 programmed ribosomal frameshifting (+1 PRF) which has been particularly observed in Euplotes [17]. The identified slippery motif TTT,TAR where R={A,G} is multiple-frame MF. The slippery motifs AAA, CCC, GGG and TTT may cause frameshifting during transcription, producing RNAs missing specific nucleotides when compared to template DNA [18,19]. The slippery motifs are not always multiple-frame while stressing that the spacer and the down-stream stimulatory motifs have been very poorly characterized [20] and could also be involved in such a multiple-frame definition. From a theoretical point of view, it is important to extend this concept by increasing the length of such multiple-frame slippery motifs and also by considering their different classes. If the multiple-frame motifs may be involved in ribosomal frameshifting, the single-frame motifs SF and the framing motifs F (also called circular code motifs; first introduced in Michel [21,22]) from the circular codes [8,9,10] (reviews in Michel [23]; Fimmel and Strüngmann [24]) may have been important in early life genes for constructing the modern genetic code and the extant genes (see Discussion).

Several classes of MF motifs are defined: (i) a unidirectional multiple-frame motif 3UMF has no trinucleotide in reading frame that occurs in a shifted frame after its reading (i.e., its position in the reading frame) but has at least one trinucleotide in reading frame that occurs in a shifted frame before its reading, e.g., the dicodon AACACA is 3UMF as the trinucleotides AAC and (trivially) ACA do not occur in a shifted frame after their reading and as the trinucleotide ACA occurs in a shifted frame (precisely frame 1) before its reading; (ii) a unidirectional multiple-frame motif 5UMF, the opposite, has no trinucleotide in reading frame that occurs in a shifted frame before its reading but has at least one trinucleotide in reading frame that occurs in a shifted frame after its reading, e.g., the dicodon ACACAA mirror of AACACA is 5UMF as the trinucleotides (trivially) ACA and CAA do not occur in a shifted frame before their reading and as the trinucleotide ACA occurs in a shifted frame (precisely frame 2) after its reading; and (iii) a bidirectional multiple-frame motif BMF has at least one trinucleotide in reading frame that occurs in a shifted frame before its reading and has at least one trinucleotide in reading frame that occurs in a shifted frame after its reading (both 3UMF and 5UMF), e.g., the dicodons AAAAAA and ACACAC are BMF. A 5′ unambiguous motif 5U, is either a SF motif or a 3UMF motif, e.g., the dicodons AAACAA (SF motif) and AACACA (3UMF motif) belong to the class 5U.

We will only investigate here the distribution of the single-frame motifs SF associated with an unambiguous trinucleotide decoding in the two 53 and 35 directions, and the 5′ unambiguous motifs 5U associated with an unambiguous trinucleotide decoding in the 53 direction only, i.e., a less restrictive class of motifs. The distributions of SF and 5U motifs will be analysed without and with constraints. The constraints studied are: (i) with initiation and stop codons; (ii) without periodic codons {AAA,CCC,GGG,TTT}; (iii) with antiparallel complementarity; and (iv) with parallel complementarity.

We will also investigate the particular case of motifs made up of two codons, i.e., the dicodons. The definitions of SF and MF dicodons will thus identify two new classes of dipeptides, the SF and MF dipeptides. The SF dipeptides are coded by dicodons with an unambiguous trinucleotide decoding, in contrast to the MF dipeptides which are coded by dicodons with an ambiguous trinucleotide decoding. The concept of SF and MF dipeptides might be of predictive value to studies of prebiotic metabolites [25]. Peptide evolution on the primitive earth is an active and exciting field of research with cyclic dipeptides [26] and selective formation of SerHis dipeptide via phosphorus activation [27,28].

2. Method

2.1. Recall of Biological Definitions

Notation 1.

Let us denotes the nucleotide 4-letter alphabet B={A,C,G,T} where A stands for adenine, C stands for cytosine, G stands for guanine and T stands for thymine. The trinucleotide set over B is denoted by B3={AAA,,TTT} . The set of non-empty words (words, respectively) over B is denoted by B+ ( B* , respectively).

Definition 1.

According to the complementary property of the DNA double helix, the nucleotide complementarity mapC:BBis defined byC(A)=T,C(C)=G,C(G)=C,C(T)=A. According to the complementary and antiparallel properties of the DNA double helix, the trinucleotide antiparallel complementarity mapC:B3B3is defined byC(l0l1l2)=C(l2)C(l1)C(l0)for alll0,l1,l2B. The trinucleotide parallel complementarity mapD:B3B3is defined byD(l0l1l2)=C(l0)C(l1)C(l2)for alll0,l1,l2B.

Example 1.

C(ACG)=CGTandD(ACG)=TGC.

2.2. Recall of Circular Code Definitions

Definition 2.

A setS B+is a code if, for eachx1,,xn,y1,,ymS,n,m1, the conditionx1xn=y1ymimpliesn=mandxi=yifori=1,,n.

Definition 3.

Any non-empty subset of the code B3 is a code and called trinucleotide code.

Definition 4.

A trinucleotide codeXB3is circular if, for eachx1,,xn,y1,,ymX,n,m1,rB*,sB+, the conditionssx2xnr=y1ymandx1=rsimplyn=m,r=ε(empty word) andxi=yifori=1,,n.

We briefly recall the proof used here to determine whether a code is circular or not, with the most recent and powerful approach which relates an oriented (directed) graph to a trinucleotide code.

Definition 5.

[29]. Let XB3 be a trinucleotide code. The directed graph G(X)=(V(X),E(X)) associated with X has a finite set of vertices V(X) and a finite set of oriented edges E(X) (ordered pairs [v,w] where v,wX) defined as follows:

{V(X)={N1,N3,N1N2,N2N3: N1N2N3X}E(X)={[N1,N2N3],[N1N2,N3]: N1N2N3X}.

The theorem below gives a relation between a trinucleotide code which is circular and its associated graph.

Theorem 1.

[29]. Let XB3 be a trinucleotide code. The following statements are equivalent:

  • (i) 

    The codeXis circular.

  • (ii) 

    The graphG(X)is acyclic.

Definition 6.

Circular code motifs (first introduced by Michel [21,22]), also called here framing motifs F , are motifs from the circular codes. They have the capacity to retrieve, maintain and synchronize the reading frame in genes.

Example 2.

Let a framing motif F1=  ...AGGTAATTACCAG... be constructed with the circular code X (1) identified in genes of bacteria, archaea, eukaryotes, plasmids and viruses [8,9,10].

(i) Such a framing motif F1 can be obtained as follows. A sequence s of trinucleotides of X is generated and a substring is extracted at any position in this sequence s, i.e., the series of nucleotides on the right and the left of the substring are not considered. Let this substring be F1. (ii) This framing motif F1 allows the reading frame to be retrieved (Figure 1). We try the three possible decompositions w0, w1 (shifted by one letter to the right) and w2 (shifted by two letters to the right) of F1. With w0, AG is not a prefix of any trinucleotide of X, thus the frame associated with w0 is impossible. With w2, AG is a suffix of CAG and GAG belonging to X, then GTA, ATT and ACC belong to X, followed by A which is a prefix of five trinucleotides of X. Thus at this position, the frame associated with w2 is still possible and 2+3×3+1=12 nucleotides are read. The next letter G leads to AG which is not a prefix of any trinucleotide of X. Thus, a window of 12+1=13 nucleotides demonstrates that the frame associated with w2 is impossible. With w1, A is a suffix of GAA and GTA belonging to X, then GGT, AAT, TAC, CAG, etc., belong to X. Thus, the reading frame of F1 is associated with w1, i.e., the first letter A of w is the 3rd letter of a trinucleotide of X: the reading frame of the sequence s is retrieved: ...A,GGT,AAT,TAC,CAG,… (the comma showing the reading frame). (iii) We can prove mathematically that a windows of 13 nucleotides always retrieves the reading frame with the circular code X. Four framing motifs F need a window of 13 nucleotides with the circular code X as they are the four longest ambiguous words of length l=12 nucleotides: F1= AGGTAATTACCA, F2= AGGTAATTACCT (with w2, the first two letters AG are suffix of CAG and GAG belonging to X, and the last letter T is prefix of TAC and TTC belonging to X), F3= TGGTAATTACCA (with w2, the first two letters TG are suffix of CTG belonging to X, and the last letter A is prefix of five trinucleotides of X) and F4= TGGTAATTACCT (with w2, the first two letters TG are suffix of CTG belonging to X, and the last letter T is prefix of TAC and TTC belonging to X). These four framing motifs F contain the two longest ambiguous words of length l=11 nucleotides starting with a trinucleotide of X, i.e., when the suffixes of X are not considered: GGTAATTACCA and GGTAATTACCT (see last row in Table 1 in [21]). (iv) It is very important to stress that for all the other framing motifs F of the circular code X, i.e., different from F1, F2, F3 and F4, the window for retrieving the reading frame is less than 13 nucleotides (see the growth function of the window as a function of the number of nucleotides in Figure 4 in [21]). It is also very important to recall that any motif of the circular code X is framing, i.e., it has the property of reading frame retrieval.

Figure 1.

Figure 1

Retrieval of the reading frame of the word w= ...AGGTAATTACCAG... constructed with the circular code X (1). Among the three possible factorizations w0, w1 and w2, only one factorization w1 into trinucleotides of X is possible leading to ...A,GGT,AAT,TAC,CAG,… (the comma showing the reading frame). Thus, the first letter A of w is the third letter of a trinucleotide of X and the reading frame of the word is retrieved.

2.3. Definitions of Single-Frame and Multiple-Frame Motifs

Definition 7.

An-motif, also calledn-codon, is a series of trinucleotidestiinB3of trinucleotide lengthn,i{1,,n}, which defines the reading framef=0, i.e.,t1t2tn.

Definition 8.

The shifted framef=1andf=2of an-motif is a series of trinucleotidestifinB3of trinucleotide lengthn1,i{1,,n1}, starting at the 2nd and 3rd nucleotide oft1=l0l1l2of then-motif, i.e., atl1(f=1) andl2(f=2).

Notation 2.

Let T be the set of trinucleotides in reading frame f=0 of a n -motif. Let Tf be the set of trinucleotides in a shifted frame f{1,2} of a n -motif.

A single-frame motif SF has no trinucleotide t in reading frame that occurs in a shifted frame, i.e., the trinucleotide decoding is unambiguous in the two 53 and 35 directions. Formally:

Definition 9.

A single-framen-motifSF(unambiguous trinucleotide decoding in the two53and35directions) is an-motif such thatTTf=forf{1,2}, i.e.,titjffori{1,,n}, forj{1,,n1}and forf{1,2}.

Example 3.

Let the dicodon be AAACAA (2-motif). The trinucleotides in reading frame aret1=AAAandt2=CAA, leading to the trinucleotide setT={AAA,CAA}. The single trinucleotide in the shifted frame 1 ist11=AAC, leading to the trinucleotide setT1={AAC}. The single trinucleotide in the shifted frame 2 ist12=ACA, leading to the trinucleotide setT2={ACA}. AsTT1=andTT2=, AAACAA is a single-frame dicodonSF(Figure 2).

Figure 2.

Figure 2

(associated with Example 3). The dicodon AAACAA is single-frame SF.

A multiple-frame motif MF, in contrast to a SF motif, has at least one trinucleotide t in reading frame that occur in a shifted frame f. Formally:

Definition 10.

A multiple-frame n -motif MF (ambiguous trinucleotide decoding in at least one direction) is a n -motif such that TTf for f{1,2}, i.e., i{1,,n}j{1,,n1}f{1,2}:ti=tjf.

The unidirectional multiple-frame motifs UMF belong to a class of MF motifs where all the trinucleotides tf in a shifted frame f occur only before (3UMF: 35 direction) or only after (5UMF: 53 direction) the trinucleotides t in reading frame. Formally:

Definition 11.

A unidirectional multiple-framen-motif3UMF(ambiguous trinucleotide decoding in the35direction only) is aMFn-motif (FFfforf{1,2}) such that the conditionti=tjfimpliesi>jfori{1,,n}, forj{1,,n1}and forf{1,2}.

Example 4.

Let the dicodon be AACACA. The trinucleotides in reading frame are t1=AAC and t2=ACA , leading to T={AAC,ACA} . The single trinucleotide in the shifted frame 1 is t11=ACA , leading to T1={ACA} . The single trinucleotide in the shifted frame 2 is t12=CAC , leading to T2={CAC} . As TT1 , AACACA is a multiple-frame dicodon MF . Furthermore, as t2=t11=ACA yields to the inequality 2>1 , as t1=AACt11=ACA and as t1=AACt12=CAC , AACACA is a unidirectional multiple-frame dicodon 3UMF (Figure 3).

Figure 3.

Figure 3

(associated with Example 4). The dicodon AACACA is unidirectional multiple-frame 3UMF.

Definition 12.

A unidirectional multiple-framen-motif5UMF(ambiguous trinucleotide decoding in the53direction only) is aMFn-motif (FFfforf{1,2}) such that the conditionti=tjfimpliesijfori{1,,n}, forj{1,,n1}and forf{1,2}.

Example 5.

Let the dicodon be AAAAAC. The trinucleotides in reading frame aret1=AAAandt2=AAC, leading toT={AAA,AAC}. The trinucleotides in the shifted frames 1 and 2 aret11=t12=AAA, leading to the trinucleotide setsT1=T2={AAA}. AsTT1andTT2, AAAAAC is a multiple-frame dicodonMF. Furthermore, ast1=t11=t12=AAAyields to the two inequalities11and ast2=AACt11=t12=AAA,AAAAAC is a unidirectional multiple-frame dicodon5UMF(Figure 4).

Figure 4.

Figure 4

(associated with Example 5). The dicodon AAAAAC is unidirectional multiple-frame 5UMF.

Example 6.

Let the dicodon be ACACAA . The trinucleotides in reading frame are t1=ACA and t2=CAA , leading to T={ACA,CAA} . The single trinucleotide in the shifted frame 1 is t11=CAC , leading to T1={CAC} . The single trinucleotide in the shifted frame 2 is t12=ACA , leading to T2={ACA} . As TT2 , ACACAA is a multiple-frame dicodon MF . Furthermore, as t1=t12=ACA yields to the inequality 11 , as t2=CAAt11=CAC and as t2=CAAt12=ACA , ACACAA is a unidirectional multiple-frame dicodon 5UMF (Figure 5). The reasoning could be immediate by noting that the dicodon ACACAA is mirror of AACACA (compare with Example 4).

Figure 5.

Figure 5

(associated with Example 6). The dicodon ACACAA is unidirectional multiple-frame 5UMF.

Definition 13.

A bidirectional multiple-framen-motifBMF(ambiguous trinucleotide decoding in the two53and35directions) is both a5UMFand3UMFn-motif.

Example 7.

Let the trivial dicodon be AAAAAA. The trinucleotides in reading frame aret1=t2=AAA, leading to the trinucleotide setT={AAA}. The trinucleotides in the shifted frames 1 and 2 aret11=t12=AAA, leading to the trinucleotide setsT1=T2={AAA}. AsTT1andTT2, AAAAAA is a multiple-frame dicodonMF. Furthermore, ast1=t11=t12=AAAyields to the two inequalities11and ast2=t11=t12=AAAyields to the two inequalities2>1, AAAAAA is a bidirectional multiple-frame dicodonBMF(Figure 6).

Figure 6.

Figure 6

(associated with Example 7). The dicodon AAAAAA is bidirectional multiple-frame BMF.

Example 8.

Let the dicodon be ACACAC. The trinucleotides in reading frame aret1=ACAandt2=CAC, leading toT={ACA,CAC}. The single trinucleotide in the shifted frame 1 ist11=CAC, leading toT1={CAC}. The single trinucleotide in the shifted frame 2 ist12=ACA, leading toT2={ACA}. AsTT1andTT2, ACACAC is a multiple-frame dicodonMF. Furthermore, ast1=t12=ACAyields to the inequality11and ast2=t11=CACyields to the inequality2>1, ACACAC is a bidirectional multiple-frame dicodonBMF(Figure 7).

Figure 7.

Figure 7

(associated with Example 8). The dicodon ACACAC is bidirectional multiple-frame BMF.

In this paper, by varying n*, we will investigate two distributions: the single-frame n-motifs SF with an unambiguous trinucleotide decoding in the two 53 and 35 directions (see Definition 9), and the 5′ unambiguous n-motifs 5U with an unambiguous trinucleotide decoding in the 53 direction only which are defined formally as follows:

Definition 14.

A 5′ unambiguous n -motif 5U (unambiguous trinucleotide decoding in the 53 direction only) is either a SF n -motif or a 3UMF n -motif , i.e., neither a 5UMF n -motif nor a BMF n -motif.

Example 9.

The dicodons AAACAA (SFmotif) andAACACA (3UMFmotif) belong to the class5U.

2.4. Occurrence Probabilities of Single-Frame n-Motifs SF and 5′ Unambiguous n-Motifs 5U

Definition 15.

Let NbSFM(n) and NbMFM(n) be the numbers of n -motifs ( n* ) single-frame SF and multiple-frame MF , respectively. Let Nb5UMFM(n) , Nb3UMFM(n) and NbBMFM(n) be the numbers of multiple-frame n -motifs ( n* ) which are unidirectional 5UMF , unidirectional 3UMF and bidirectional BMF , respectively.

For n*, we have the obvious relations:

NbSFM(n)+NbMFM(n)=64n,NbMFM(n)=Nb5UMFM(n)+Nb3UMFM(n)+NbBMFM(n).

For n*, the occurrence probability PbSFM(n) of single-frame n-motifs SF will be computed according to

PbSFM(n)=1NbMFM(n)64n. (3)

Similarly, for n*, the occurrence probability Pb5UM(n) of 5′ unambiguous n-motifs 5U will be computed as follows

Pb5UM(n)=PbSFM(n)+Nb3UMFM(n)64n. (4)

Remark 1.

Obviously,Pb5UM(n)>PbSFM(n)whatevern. However, it will be interesting to compare these two probability distributions by varyingn.

2.5. Single-Frame 1-Motifs

It is a trivial case. Each of the 64 codons (1-motifs, n=1) are obviously single-frame motifs SF, by definition (non-existence of a shifted frame). Thus, the probabilities of SF and 5U 1-motifs are equal to PbSFM(1)=Pb5UM(1)=1.

2.6. Single-Frame 2-Motifs

There are 642=4096 dicodons (2-motifs, n=2). The complete study of dicodons which are single-frame SF and multiple-frame MF can be done by hand without difficulty. For the convenience of the reader, we give the complete list of MF dicodons: BMF (Definition 13, Table 1), 3UMF (Definition 11, Table 2) and 5UMF (Definition 12, Table 3).

Table 1.

The 16 bidirectional multiple-frame dicodons BMF (Definition 13).

Dicodon Frame 1 Frame 2 Dicodon Frame 1 Frame 2 Dicodon Frame 1 Frame 2 Dicodon Frame 1 Frame 2
AAAAAA AAA AAA CACACA ACA CAC GAGAGA AGA GAG TATATA ATA TAT
ACACAC CAC ACA CCCCCC CCC CCC GCGCGC CGC GCG TCTCTC CTC TCT
AGAGAG GAG AGA CGCGCG GCG CGC GGGGGG GGG GGG TGTGTG GTG TGT
ATATAT TAT ATA CTCTCT TCT CTC GTGTGT TGT GTG TTTTTT TTT TTT

Table 2.

The 96 unidirectional multiple-frame dicodons 3UMF (Definition 11), N being any nucleotide.

Dicodon Frame 1 Frame 2 Dicodon Frame 1 Frame 2 Dicodon Frame 1 Frame 2 Dicodon Frame 1 Frame 2
CAAAAA AAA AAA CCACAC CAC CGAGAG GAG CTATAT TAT
GAAAAA AAA AAA GCACAC CAC GGAGAG GAG GTATAT TAT
TAAAAA AAA AAA TCACAC CAC TGAGAG GAG TTATAT TAT
NCAAAA AAA ACCCCC CCC CCC AGCGCG GCG ATCTCT TCT
NGAAAA AAA GCCCCC CCC CCC GGCGCG GCG GTCTCT TCT
NTAAAA AAA TCCCCC CCC CCC TGCGCG GCG TTCTCT TCT
AACACA ACA NACCCC CCC AGGGGG GGG GGG ATGTGT TGT
GACACA ACA NGCCCC CCC CGGGGG GGG GGG CTGTGT TGT
TACACA ACA NTCCCC CCC TGGGGG GGG GGG TTGTGT TGT
AAGAGA AGA ACGCGC CGC NAGGGG GGG ATTTTT TTT TTT
CAGAGA AGA CCGCGC CGC NCGGGG GGG CTTTTT TTT TTT
TAGAGA AGA TCGCGC CGC NTGGGG GGG GTTTTT TTT TTT
AATATA ATA ACTCTC CTC AGTGTG GTG NATTTT TTT
CATATA ATA CCTCTC CTC CGTGTG GTG NCTTTT TTT
GATATA ATA GCTCTC CTC GGTGTG GTG NGTTTT TTT

Table 3.

The 96 unidirectional multiple-frame dicodons 5UMF (Definition 12), N being any nucleotide.

Dicodon Frame 1 Frame 2 Dicodon Frame 1 Frame 2 Dicodon Frame 1 Frame 2 Dicodon Frame 1 Frame 2
AAAAAC AAA AAA CACACC CAC GAGAGC GAG TATATC TAT
AAAAAG AAA AAA CACACG CAC GAGAGG GAG TATATG TAT
AAAAAT AAA AAA CACACT CAC GAGAGT GAG TATATT TAT
AAAACN AAA CCCCCA CCC CCC GCGCGA GCG TCTCTA TCT
AAAAGN AAA CCCCCG CCC CCC GCGCGG GCG TCTCTG TCT
AAAATN AAA CCCCCT CCC CCC GCGCGT GCG TCTCTT TCT
ACACAA ACA CCCCAN CCC GGGGGA GGG GGG TGTGTA TGT
ACACAG ACA CCCCGN CCC GGGGGC GGG GGG TGTGTC TGT
ACACAT ACA CCCCTN CCC GGGGGT GGG GGG TGTGTT TGT
AGAGAA AGA CGCGCA CGC GGGGAN GGG TTTTTA TTT TTT
AGAGAC AGA CGCGCC CGC GGGGCN GGG TTTTTC TTT TTT
AGAGAT AGA CGCGCT CGC GGGGTN GGG TTTTTG TTT TTT
ATATAA ATA CTCTCA CTC GTGTGA GTG TTTTAN TTT
ATATAC ATA CTCTCC CTC GTGTGC GTG TTTTCN TTT
ATATAG ATA CTCTCG CTC GTGTGG GTG TTTTGN TTT

The probability of SF 2-motifs is equal to PbSFM(2)=1(16+2×96)/642=0.9492. The probability of 5U 2-motifs is equal to Pb5UM(2)=PbSFM(2)+96/642=0.9727.

Remark 2.

Forn3, the3UMFand5UMFn-motifs can have two different shifted trinucleotides in the two frames 1 and 2, in contrast to the2-motifs (see Table 2 and Table 3). For example, with the tricodon AACAAAACC, the trinucleotides in reading frame aret1=AAC,t2=AAAandt3=ACCleading toT={AAA,AAC,ACC}. The trinucleotides in the shifted frame 1 aret11=ACAandt21=AAA, leading toT1={AAA,ACA}. The trinucleotides in the shifted frame 2 aret12=CAAandt22=AAC, leading toT2={AAC,CAA}. AsTT1andTT2, AACAAAACC is a multiple-frame tricodonMF. Furthermore, ast1=t22=AACyields to the inequality12, ast2=t21=AAAyields to the inequality22and ast3=ACCT1T2, AACAAAACC is a unidirectional multiple-frame tricodon5UMFwith two different trinucleotides in the two frames 1 and 2, i.e., AAA in frame 1 and AAC in frame 2.

2.7. Single-Frame n-Motifs

The determination of probability PbSFM(n) of single-frame n-motifs SF for n3 (tricodons, tetracodons, etc.) cannot be done by hand. For n{3,,6} (tricodons up to hexacodons), exact values of probability PbSFM(n) can be obtained by computer calculus (see Table 4). For n=6, the computation of SF motifs among the 646=68,719,476,736 hexacodons with a parallel program with 8 threads takes about 7 days on a standard PC. For n7 (heptacodons, octocodons, etc.), the probability PbSFM(n) is obtained by computer simulation. Simulated values of PbSFM(n) are obtained by generating 1,000,000 random n-motifs for each n. In order to evaluate this approach by computer simulation, simulated values of PbSFM(n) for n{2,,6} are also given in Table 4. Exact and simulated values of PbSFM(n) are identical at 103, demonstrating the reliability of the simulation approach.

Table 4.

Probability PbSFM(n) (%) of single-frame n -motifs SF for n{1,,6}. Exact and simulated values of PbSFM(n) are identical at 103.

Probability PbSFM(n) (%)
n-Motifs Number 64n Exact Values Simulated Values
1 64 100
2 4096 94.92 94.93
3 262,144 85.22 85.20
4 16,777,216 72.35 72.37
5 1,073,741,824 58.07 58.08
6 68,719,476,736 44.07 44.08

The probability Pb5UM(n) of 5′ unambiguous n-motifs 5U for n3 is computed similarly.

3. Results

3.1. Single-Frame Motifs

I first investigated the probability PbSFM(n) (Equation (3)) of single-frame n-motifs SF (Definition 9). The probability PbSFM(1) is equal to 1 (1-motifs, Section 2.5). The probability PbSFM(2) is equal to 94.9% (2-motifs, Section 2.6). The probability PbSFM(n) for n{3,,6} is given in Table 4. The probability PbSFM(n) for n7 is obtained by computer simulation (Section 2.7).

While the proportion of multiple-frame 2-motifs MF (Definition 10) is minimal (5.1%=100%94.9% for dicodons, Section 2.6), Figure 8 shows that their propagation will drastically reduce the proportion of SF n-motifs when the trinucleotide length n increases. There are almost no more SF motifs with a length of 14 trinucleotides (PbSFM(14)<1%) and the number of MF motifs becomes already higher than the number of SF motifs with a length of six trinucleotides (Figure 8).

Figure 8.

Figure 8

Decreasing probability PbSFM(n) (Equation (3)) of single-frame n -motifs SF (blue curve) and increasing probability 1PbSFM(n) of multiple-frame n -motifs MF (red curve) by varying the length n between 1 and 20 trinucleotides.

Thus, only short genes, i.e., with up to five trinucleotides, have a higher proportion of single-frame motifs compared to the multiple-frame motifs. Thus, primitive translation, without the extant complex ribosome, could only generate short peptides without frameshift errors.

3.2. 5′ Unambiguous Motifs

I then compared the probability PbSFM(n) (Equation (3)) of single-frame n-motifs SF (Definition 9) and the probability Pb5UM(n) (Equation (4)) of 5′ unambiguous n-motifs 5U (Definition 14). Figure 9 shows the decreasing probability Pb5UM(n) of 5U n-motifs when the trinucleotide length n increases. As expected (see Remark 1), its decrease is slower than that of SF n-motifs. There are almost no more 5U motifs with a length of 20 trinucleotides (Pb5UM(20)<1%). Thus with the 5U motifs, there is a length increase of 2014=6 trinucleotides in the trinucleotide decoding. The maximum probability difference Pb5UM(n)PbSFM(n) is 22.0% at length n=8 trinucleotides.

Figure 9.

Figure 9

Decreasing probability PbSFM(n) (Equation (3)) of single-frame n -motifs SF (blue curve from Figure 8) and decreasing probability Pb5UM(n) (Equation (4)) of 5′ unambiguous n -motifs 5U (cyan curve) by varying the length n between 1 and 20 trinucleotides.

The 5′ unambiguous n-motifs, a less restrictive class of motifs with an unambiguous trinucleotide decoding in the 53 direction only, can generate a slightly longer peptides without frameshift error compared to the single-frame motifs.

I now evaluate the single-frame motifs SF and the 5′ unambiguous motifs 5U with constraints.

3.3. Single-Frame and 5′ Unambiguous Motifs with Initiation and Stop Codons

The single-frame n-motifs SF and the 5′ unambiguous motifs 5U are investigated with an initiation codon ATG and a stop codon {TAA,TAG,TGA}. The case n=1 does not exist. For n=2, there are only three dicodons: ATGTAA, ATGTAG and ATGTGA which are all obviously SF. Thus, the probabilities of SF and USF 2-motifs are obviously PbSFM(2)=Pb5UM(2)=1. Figure 10 shows that the proportions of SF and 5U motifs with initiation and stop codons are lower than their respective non-constrained motifs.

Figure 10.

Figure 10

Decreasing probability PbSFM(n) (Equation (3)) of single-frame n -motifs SF (blue curve from Figure 8) and decreasing probability Pb5UM(n) (Equation (4)) of 5′ unambiguous n -motifs 5U (cyan curve from Figure 9) by varying the length n between 1 and 10 trinucleotides. With initiation and stop codons, decreasing probability PbSFM(n) of n -motifs SF (magenta curve) and decreasing probability Pb5UM(n) of n -motifs 5U (orange curve) by varying the length n between 2 and 10 trinucleotides.

Genes with initiation and stop codons do not increase translation fidelity compared to non-constrained genes (according to this approach).

3.4. Single-Frame and 5′ Unambiguous Motifs without Periodic Codons

The single-frame motifs SF and the 5′ unambiguous motifs 5U are now studied without periodic codons {AAA,CCC,GGG,TTT}. As expected, Figure 11 shows that the proportions of SF and 5U motifs without periodic codons are higher than their respective non-constrained motifs.

Figure 11.

Figure 11

Decreasing probability PbSFM(n) (Equation (3)) of single-frame n -motifs SF (blue curve from Figure 8) and decreasing probability Pb5UM(n) (Equation (4)) of 5′ unambiguous n -motifs 5U (cyan curve from Figure 9) by varying the length n between 1 and 10 trinucleotides. Without periodic codons {AAA,CCC,GGG,TTT}, decreasing probability PbSFM(n) of n -motifs SF (magenta curve) and decreasing probability Pb5UM(n) of n -motifs 5U (orange curve) by varying the length n between 1 and 10 trinucleotides.

Genes without periodic codons slightly increase frame translation fidelity compared to non-constrained genes (according to this approach).

3.5. Single-Frame and 5′ Unambiguous Motifs with Antiparallel Complementarity

The single-frame 2n-motifs SF and the 5′ unambiguous 2n-motifs 5U are now investigated with the following antiparallel complementary sequence: t1t2tnC(tn)C(t2)C(t1) where the trinucleotide antiparallel complementarity map C applied to a trinucleotide t is recalled in Definition 1. As an example, if t1t2t3=ACGTGCAAT then the antiparallel complementary sequence studied is ACGTGCAATATTGCACGT. Note that the trinucleotide length of such motifs is even. Classical antiparallel complementary structures are the DNA double helix and the RNA stem. Interesting results are observed. As expected, the two probability curves PbSFM(n) of SF motifs and Pb5UM(n) of 5U motifs with antiparallel complementarity are identical (Figure 12). The proof is based on the following property: if ti=tjf with i>j (3UMF motif) then C(ti)=ti=C(tjf)=tjf with ij (5UMF motif) and ff. Furthermore, antiparallel complementarity increases the proportion of SF motifs but decreases the proportion of 5U motifs, compared to their respective non-constrained motifs.

Figure 12.

Figure 12

Decreasing probability PbSFM(n) (Equation (3)) of single-frame n -motifs SF (blue curve from Figure 8) and decreasing probability Pb5UM(n) (Equation (4)) of 5′ unambiguous n -motifs 5U (cyan curve from Figure 9) by varying the length n between 1 and 14 trinucleotides. With antiparallel complementarity, decreasing probabilities PbSFM(n) and Pb5UM(n) of 2n -motifs SF and 5U (two identical curves in magenta) by varying the length n between 1 and 7 trinucleotides.

The “antiparallel complementary” genes have a higher proportion of single-frame motifs compared to the non-complementary genes. Thus, primitive translation associated with a DNA property could generate a greater number of peptides without frameshift errors.

3.6. Single-Frame Motifs and 5′ Unambiguous with Parallel Complementarity

The single-frame 2n-motifs SF and the 5′ unambiguous 2n-motifs 5U are now analysed with the following parallel complementary sequence: t1t2tnD(t1)D(t2)D(tn) where the trinucleotide parallel complementarity map D applied to a trinucleotide t is recalled in Definition 1. As an example, if t1t2t3=ACGTGCAAT then the parallel complementary sequence studied is ACGTGCAATTGCACGTTA. Note that the trinucleotide length of such motifs is also even. Interesting results are also observed. The two probability curves PbSFM(n) of SF motifs with parallel complementarity and Pb5UM(n) of 5U motifs without constraints are superposable (Figure 13). Parallel complementarity increases the proportions of both SF motifs and 5U motifs compared to their respective non-constrained motifs.

Figure 13.

Figure 13

Decreasing probability PbSFM(n) (Equation (3)) of single-frame n -motifs SF (blue curve from Figure 8) and decreasing probability Pb5UM(n) (Equation (4)) of 5′ unambiguous n -motifs 5U (cyan curve from Figure 9) by varying the length n between 1 and 14 trinucleotides. With parallel complementarity, decreasing probability PbSFM(n) of 2n -motifs SF (magenta curve) and decreasing probability Pb5UM(n) of 2n -motifs 5U (orange curve) by varying the length n between 1 and 7 trinucleotides.

“Parallel complementary” genes have a slightly higher proportion of single-frame motifs compared to the “antiparallel complementary” genes (compare the magenta curves in Figure 12 and Figure 13). The biological meaning is not yet explained.

3.7. Framing Motifs

There are framing motifs F which are single-frame SF or multiple-frame MF.

Proposition 1.

A framing motifFcan be single-frameSF.

Proof. Take the following motif m=GAACTCCCGATATGGCTC. The motif m can be generated by the code X={ATA,CCG,CTC,GAA,TGG}. By Theorem 1, it is easy to verify that the graph G(X) is acyclic, and thus X is circular. Furthermore, the set of trinucleotides in reading frame is T=X, the set of trinucleotides in the shifted frame 1 is T1={AAC,CGA,GGC,TAT,TCC} and the set of trinucleotides in the shifted frame 2 is T2={ACT,ATG,CCC,GAT,GCT}. We have TT1= and TT2=. Thus, the motif m is both framing F and single-frame SF.

Proposition 2.

A framing motifFcan be multiple-frameMF.

Proof. Take the following motif m=ATTGAGCGAGCCTGTCAG. The motif m can be generated by the code X={ATT,CAG,CGA,GAG,GCC,TGT}. By Theorem 1, it is easy to verify that the graph G(X) is acyclic, and thus X is circular. Furthermore, we have the trinucleotide sets T=X, T1={AGC,CCT,GAG,GTC,TTG} and T2={AGC,CTG,GCG,TCA,TGA} leading to TT1={GAG} and TT2=. Thus, the motif m is both framing F and multiple-frame MF, precisely unidirectional multiple-frame 5UMF.

There are single-frame motifs SF or multiple-frame motifs MF which are not framing F.

Proposition 3.

A single-frame motifSFcan be non-framingF.

Proof. Take the following motif m=GACAAATAAGTGGTATGA. The motif m can be generated by the code X={AAA,GAC,GTA,GTG,TAA,TGA}. We have the trinucleotide sets T=X, T1={AAG,AAT,ACA,TAT,TGG} and T2={AGT,ATA,ATG,CAA,GGT} leading to TT1= and TT2=. However, as X contains the periodic trinucleotide AAA, X is not circular. Thus, the motif m is single-frame SF but not framing F.

Proposition 4.

A multiple-frame motifMFcan be non-framingF.

Proof. Take the following motif m=GGACCATACATCCGGACT. The motif m can be generated by the code X={ACT,ATC,CCA,CGG,GGA,TAC}. We have the trinucleotide sets T=X, T1={ACA,CAT,GAC,GGA,TCC} and T2={ACC,ATA,CAT,CCG,GAC} leading to TT1={GGA} and TT2=. However, as X contains the two permuted trinucleotides ACT and TAC, X is not circular. Thus, the motif m is multiple-frame MF, precisely unidirectional multiple-frame 5UMF, but not framing F.

Genes which are both framing F and single-frame SF retrieve the reading frame and code for a unique peptide as the shifted frames would lead to a different peptide product.

3.8. A New Class of Theoretical Parameters Relating the Circular Codes and Their Circular Code Motifs

The idea is to define a new class of parameters in order to measure the intensity I(m) of a motif m of a circular code to retrieve the reading frame. Thus, we have to associate information from the circular code theory with information from words (motifs).

In the circular code theory, the most important and the simplest parameter is the length lmax(X) of a longest path (maximal arrow-length of a path) in the associated graph G(X) of a circular code X (see Definition 5). Note that the longest path lmax(X) has a finite length as the graph G(X) is acyclic (Theorem 1). The longest path lmax(X) can classify the circular codes, from the strong comma-free codes with lmax(X)=1 and the comma-free codes with lmax(X)=2 up to the general circular codes with a maximal longest path lmax(X)=8 when XB3 (i.e., for the trinucleotide circular codes) [29]. It is also related to the reading frame number nX of X, i.e., the number of nucleotides to retrieve the reading frame. This reading frame number nX can also be used to classify the circular codes, from the strong comma-free codes with nX=2 nucleotides and the comma-free codes with nX=3 nucleotides up to the general circular codes with a maximal number nX=13 nucleotides when XB3 [30]. However, this parameter nX needs to know the structure of the longest path lmax(X) which is one of the four cases: b1d1bk, b1d1dk, d1b1bk and d1b1dk where the nucleotide biB and the dinucleotide diB2 for any i (see Definition 5). In summary, for the circular codes XB3, the longest path lmax(X) belongs to the interval 1lmax(X)8 and the reading frame number nX belongs to the interval 2nX13 nucleotides. The definition of the reading frame number nX can still be generalized to arbitrary sequences, i.e., not entirely consisting of trinucleotides from X [30]. For these two reasons, i.e., the knowledge of the structure of lmax(X) and the generalized definition of nX, the parameter nX, mentioned here to take date, will not be considered here.

A motif m of a code, circular or not, can be characterized by its length l(m), given here in trinucleotides for convenience, for measuring its expansion; and its cardinality card(T(m)) of the set T(m) (see Notation 2) of trinucleotides (in reading frame f=0) of m for measuring its variety (complexity). In the case of a motif m of a trinucleotide circular code XB3, 1card(T(m))20.

It is important to stress the following condition: T(m)X with a trinucleotide circular code XB3. The case T(m)=X is associated with a trinucleotide circular code X constructed from the motif m.

A simple parameter measuring the expansion intensity Ie(m) of reading frame retrieval of a circular code motif m can be defined as follows:

Ie(m)=l(m)lmax(X) (5)

where l(m), l(m)1, is the trinucleotide length of the motif m and lmax(X), 1lmax(X)8, is the length of a longest path in the associated graph G(X) of a trinucleotide circular code XB3. Note that 18Ie(m)l(m) and if l(m)lmax(X) then 1Ie(m)l(m).

A second parameter measuring both the expansion and variety intensity Iev(m) of a circular code motif m can also be defined as follows:

Iev(m)=card(T(m))×Ie(m) (6)

where Ie(m) is defined in Equation (5) and card(T(m)), 1card(T(m))20, is the cardinality of the set T(m) (Notation 2) of trinucleotides (in reading frame f=0) of m. Note that 18Iev(m)20l(m) and if l(m)lmax(X) then 1Iev(m)20l(m). Thus, for the circular code motifs m of a given trinucleotide length l(m), the intensity Iev(m) of reading frame retrieval increases according to their cardinality card(T(m)).

For a sequence s containing several circular code motifs m, the formulas (5) and (6) can be expressed as follows:

Ie(s)=msIe(m)=msl(m)lmax(X) (7)

with the hypothesis that lmax(X) is identical for the motifs m, a realistic case when the motifs m are obtained from a same studied trinucleotide circular code X, and thus:

Iev(s)=msIev(m)=mscard(T(m))×l(m)lmax(X). (8)

Note also that the formulas Ie(s) and Iev(s) can also be normalized in order to weight the different lengths of sequences s.

3.9. MF Dipeptides

The series of multi-frame motifs MF starts with the dicodons. We will now focus on the MF dipeptides which are two consecutive amino acids coded by the MF dicodons. The 16 bidirectional multiple-frame dicodons BMF (Table 1) code 16 BMF dipeptides according to the universal genetic code (Table 5). They include the four obvious BMF dipeptides GlyGly (GGGGGG), LysLys (AAAAAA), PhePhe (TTTTTT) and ProPro (CCCCCC). 15 amino acids out of 20 are involved in these 16 BMF dipeptides (Table 6): Ala, Arg, Cys, Glu, Gly, His, Ile, Leu, Lys, Phe, Pro, Ser, Thr, Tyr and Val (except Asn, Asp, Gln, Met and Trp), each amino acid occurring once in a position of a BMF dipeptide, except Arg occurring twice in a position of a BMF dipeptide: ArgAla, ArgGlu, AlaArg and GluArg.

Table 5.

The 16 BMF dipeptides coded by the 16 bidirectional multiple-frame dicodons BMF (Definition 13, Table 1).

AR AlaArg GCGCGC GG GlyGly GGGGGG LS LeuSer CTCTCT SL SerLeu TCTCTC
CV CysVal TGTGTG HT HisThr CACACA PP ProPro CCCCCC TH ThrHis ACACAC
ER GluArg GAGAGA IY IleTyr ATATAT RA ArgAla CGCGCG VC ValCys GTGTGT
FF PhePhe TTTTTT KK LysLys AAAAAA RE ArgGlu AGAGAG YI TyrIle TATATA

Table 6.

Occurrence number of the 15 amino acids in the 1st and 2nd positions of the 16 BMF dipeptides (Table 5).

A C E F G H I K L P R S T V Y
Ala Cys Glu Phe Gly His Ile Lys Leu Pro Arg Ser Thr Val Tyr Sum
1st site 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 16
2nd site 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 16
Sum 2 2 2 2 2 2 2 2 2 2 4 2 2 2 2 32

The 96 unidirectional multiple-frame dicodons 3UMF (Table 2) code 83 3UMF dipeptides and four pairs (stop codon, amino acid): TAGArg, TAGGly, TGAGlu and TerLys where Ter can be the two stop codons TAA and TGA (Table 7). All the 20 amino acids are involved in the 83 3UMF dipeptides (Table 8). All the 20 amino acids occur in the first position of 3UMF dipeptides. Five amino acids Asn, Asp, Gln, Met and Trp do not occur in their second position which are the five amino acids not involved in the BMF dipeptides. In the 83 3UMF dipeptides, Pro and Gly are involved 20 and 19 times, respectively, while Met and Trp only twice and once, respectively.

Table 7.

The 83 3UMF dipeptides and the four pairs (stop codon, amino acid) coded by the 96 unidirectional multiple-frame dicodons 3UMF (Definition 11, Table 2).

AF AlaPhe GCTTTT IS IleSer ATCTCT RV ArgVal CGTGTG
AG AlaGly GCGGGG KG LysGly AAGGGG SA SerAla AGCGCG
AH AlaHis GCACAC KR LysArg AAGAGA SF SerPhe AGTTTT, TCTTTT
AK AlaLys GCAAAA LC LeuCys CTGTGT, TTGTGT SG SerGly TCGGGG
AL AlaLeu GCTCTC LF LeuPhe CTTTTT SH SerHis TCACAC
AP AlaPro GCCCCC LG LeuGly CTGGGG, TTGGGG SK SerLys TCAAAA
CA CysAla TGCGCG LK LeuLys CTAAAA, TTAAAA SP SerPro AGCCCC, TCCCCC
CF CysPhe TGTTTT LP LeuPro CTCCCC SR SerArg TCGCGC
CP CysPro TGCCCC LY LeuTyr CTATAT, TTATAT SV SerVal AGTGTG
DF AspPhe GATTTT MC MetCys ATGTGT TerE TerGlu TGAGAG
DI AspIle GATATA MG MetGly ATGGGG TerG TerGly TAGGGG
DP AspPro GACCCC NF AsnPhe AATTTT TerK TerLys TAAAAA, TGAAAA
DT AspThr GACACA NI AsnIle AATATA TerR TerArg TAGAGA
EG GluGly GAGGGG NP AsnPro AACCCC TF ThrPhe ACTTTT
EK GluLys GAAAAA NT AsnThr AACACA TG ThrGly ACGGGG
FP PhePro TTCCCC PF ProPhe CCTTTT TK ThrLys ACAAAA
FS PheSer TTCTCT PG ProGly CCGGGG TL ThrLeu ACTCTC
GA GlyAla GGCGCG PH ProHis CCACAC TP ThrPro ACCCCC
GE GlyGlu GGAGAG PK ProLys CCAAAA TR ThrArg ACGCGC
GF GlyPhe GGTTTT PL ProLeu CCTCTC VF ValPhe GTTTTT
GK GlyLys GGAAAA PR ProArg CCGCGC VG ValGly GTGGGG
GP GlyPro GGCCCC QG GlnGly CAGGGG VK ValLys GTAAAA
GV GlyVal GGTGTG QK GlnLys CAAAAA VP ValPro GTCCCC
HF HisPhe CATTTT QR GlnArg CAGAGA VS ValSer GTCTCT
HI HisIle CATATA RE ArgGlu CGAGAG VY ValTyr GTATAT
HP HisPro CACCCC RF ArgPhe CGTTTT WG TrpGly TGGGGG
IF IlePhe ATTTTT RG ArgGly AGGGGG, CGGGGG YF TyrPhe TATTTT
IK IleLys ATAAAA RK ArgLys AGAAAA, CGAAAA YP TyrPro TACCCC
IP IlePro ATCCCC RP ArgPro CGCCCC YT TyrThr TACACA

Table 8.

Occurrence number of the 20 amino acids in the first and second positions of the 83 3UMF dipeptides and the four pairs (stop codon, amino acid) (Table 7).

A C D E F G H I K L M N P Q R S T V W Y
Ala Cys Asp Glu Phe Gly His Ile Lys Leu Met Asn Pro Gln Arg Ser Thr Val Trp Tyr Ter Sum
1st site 6 3 4 2 2 6 3 4 2 6 2 4 6 3 6 8 6 6 1 3 4 87
2nd site 3 2 0 3 14 13 3 3 12 3 0 0 14 0 6 3 3 3 0 2 0 87
Sum 9 5 4 5 16 19 6 7 14 9 2 4 20 3 12 11 9 9 1 5 4 174

The 96 unidirectional multiple-frame dicodons 5UMF (Table 3) code 40 5UMF dipeptides and three pairs (amino acid, stop codon): IleTer where Ter can be the two stop codons TAA and TAG, PheTer where Ter can be the three stop codons TAA, TAG and TGA, and ValTGA (Table 9). All the 20 amino acids are involved in the 40 5UMF dipeptides (Table 10). Five amino acids are Asn, Asp, Gln, Met and Trp do not occur in the first position of 5UMF dipeptides which are the five amino acids not involved in the BMF dipeptides. All the 20 amino acids occur in their second position. In the 40 5UMF dipeptides, two amino acids Lys and Phe are involved eight times while Asn only once.

Table 9.

The 40 5UMF dipeptides and the three pairs (amino acid, stop codon) coded by the 96 unidirectional multiple-frame dicodons 5UMF (Definition 12, Table 3).

AR AlaArg GCGCGA, GCGCGG, GCGCGT KN LysAsn AAAAAC, AAAAAT
CV CysVal TGTGTA, TGTGTC, TGTGTT KR LysArg AAAAGA, AAAAGG
ER GluArg GAGAGG KS LysSer AAAAGC, AAAAGT
ES GluSer GAGAGC, GAGAGT KT LysThr AAAACA, AAAACC, AAAACG, AAAACT
FC PheCys TTTTGC, TTTTGT LS LeuSer CTCTCA, CTCTCC, CTCTCG
FF PhePhe TTTTTC PH ProHis CCCCAC, CCCCAT
FL PheLeu TTTTTA, TTTTTG PL ProLeu CCCCTA, CCCCTC, CCCCTG, CCCCTT
FS PheSer TTTTCA, TTTTCC, TTTTCG, TTTTCT PP ProPro CCCCCA, CCCCCG, CCCCCT
FTer PheTer TTTTAA, TTTTAG, TTTTGA PQ ProGln CCCCAA, CCCCAG
FW PheTrp TTTTGG PR ProArg CCCCGA, CCCCGC, CCCCGG, CCCCGT
FY PheTyr TTTTAC, TTTTAT RA ArgAla CGCGCA, CGCGCC, CGCGCT
GA GlyAla GGGGCA, GGGGCC, GGGGCG, GGGGCT RD ArgAsp AGAGAC, AGAGAT
GD GlyAsp GGGGAC, GGGGAT RE ArgGlu AGAGAA
GE GlyGlu GGGGAA, GGGGAG SL SerLeu TCTCTA, TCTCTG, TCTCTT
GG GlyGly GGGGGA, GGGGGC, GGGGGT TH ThrHis ACACAT
GV GlyVal GGGGTA, GGGGTC, GGGGTG, GGGGTT TQ ThrGln ACACAA, ACACAG
HT HisThr CACACC, CACACG, CACACT VC ValCys GTGTGC
ITer IleTer ATATAA, ATATAG VTer ValTer GTGTGA
IY IleTyr ATATAC VW ValTrp GTGTGG
KI LysIle AAAATA, AAAATC, AAAATT YI TyrIle TATATC, TATATT
KK LysLys AAAAAG YM TyrMet TATATG
KM LysMet AAAATG

Table 10.

Occurrence number of the 20 amino acids in the first and second positions of the 40 5UMF dipeptides and the three pairs (amino acid, stop codon) (Table 9).

A C D E F G H I K L M N P Q R S T V W Y
Ala Cys Asp Glu Phe Gly His Ile Lys Leu Met Asn Pro Gln Arg Ser Thr Val Trp Tyr Ter Sum
1st site 1 1 0 2 7 5 1 2 7 1 0 0 5 0 3 1 2 3 0 2 0 43
2nd site 2 2 2 2 1 1 2 2 1 3 2 1 1 2 4 4 2 2 2 2 3 43
Sum 3 3 2 4 8 6 3 4 8 4 2 1 6 2 7 5 4 5 2 4 3 86

The 114=12143 MF dipeptides among 400, i.e., 28.5%, are coded by 208=16+2×96 MF dicodons (BMF, 3UMF, 5UMF) among 4096, i.e., 5.1% (Table 11). As a consequence, 286 SF dipeptides, i.e., 71.5%, are coded by 3888 single-frame dicodons SF, i.e., 94.9%. There is also a strong asymmetry between the number of MF dipeptides coded by one direction or other direction: 83 3UMF dipeptides (Table 7) versus 40 5UMF dipeptides (Table 9). This asymmetry may be related to the gene translation in the 53 direction, the 3UMF dicodons having an unambiguous trinucleotide decoding in the 53 direction.

Table 11.

Multi-frame dipeptide boolean matrix. The 114=12143 MF dipeptides, the four pairs (stop codon, amino acid) and the three pairs (amino acid, stop codon) coded by the 208=16+2×96 multiple-frame dicodons BMF (Definition 13, Table 1), 3UMF (Definition 11, Table 2) and 5UMF (Definition 12, Table 3). The rows and columns are associated with the first and second amino acid, respectively, in the dipeptide. The value of 1 means a MF dipeptide coded by at least a multiple-frame dicodon MF (MF true). The value of 0 stands for a SF dipeptide coded by a single-frame dicodon SF (MF false). For example, the value of AlaCys is 0 (absent in Table 5, Table 7 and Table 9) and the value of CysAla is 1 (7th row in Table 7).

Site 2nd A C D E F G H I K L M N P Q R S T V W Y
1st Ala Cys Asp Glu Phe Gly His Ile Lys Leu Met Asn Pro Gln Arg Ser Thr Val Trp Tyr Ter Sum
A Ala 0 0 0 0 1 1 1 0 1 1 0 0 1 0 1 0 0 0 0 0 0 7
C Cys 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 4
D Asp 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 4
E Glu 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 4
F Phe 0 1 0 0 1 0 0 0 0 1 0 0 1 0 0 1 0 0 1 1 1 8
G Gly 1 0 1 1 1 1 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 8
H His 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 4
I Ile 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 1 0 0 0 1 1 6
K Lys 0 0 0 0 0 1 0 1 1 0 1 1 0 0 1 1 1 0 0 0 0 8
L Leu 0 1 0 0 1 1 0 0 1 0 0 0 1 0 0 1 0 0 0 1 0 7
M Met 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2
N Asn 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 4
P Pro 0 0 0 0 1 1 1 0 1 1 0 0 1 1 1 0 0 0 0 0 0 8
Q Gln 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 3
R Arg 1 0 1 1 1 1 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 8
S Ser 1 0 0 0 1 1 1 0 1 1 0 0 1 0 1 0 0 1 0 0 0 9
T Thr 0 0 0 0 1 1 1 0 1 1 0 0 1 1 1 0 0 0 0 0 0 8
V Val 0 1 0 0 1 1 0 0 1 0 0 0 1 0 0 1 0 0 1 1 1 9
W Trp 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
Y Tyr 0 0 0 0 1 0 0 1 0 0 1 0 1 0 0 0 1 0 0 0 0 5
Ter 0 0 0 1 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 4
Sum 4 4 2 3 15 14 4 5 13 5 2 1 15 2 8 6 5 4 2 4 3 121

Five dipeptides GlyAla, GlyVal, PheSer, ProLeu and ProArg are the most strongly coded, each by five MF dicodons (Table 12), e.g., GlyAla is coded by one 3UMF dicodon GGCGCG (Table 7), and four 5UMF dicodons GGGGCA, GGGGCC, GGGGCG and GGGGCT (Table 9). The SF and MF dipeptides could have particular spatial structures and biological functions in extant and primitive proteins which remain to be identified.

Table 12.

Multi-frame dipeptide occurrence matrix. The 114=12143 MF dipeptides, the four pairs (stop codon, amino acid) and the three pairs (amino acid, stop codon) coded by the 208=16+2×96 multiple-frame dicodons BMF (Definition 13, Table 1), 3UMF (Definition 11, Table 2) and 5UMF (Definition 12, Table 3). The rows and columns are associated with the first and second amino acid, respectively, in the dipeptide. The values between 1 and 5 give the number of times a MF dipeptide is coded by multiple-frame dicodons MF. The value of 0 stands for a SF dipeptide coded by a single-frame dicodon SF. For example, the value of AlaCys is 0 (absent in Table 5, Table 7 and Table 9), the value of CysAla is 1 (7th row in Table 7) and the value of AlaArg if 4 (one occurrence: 1st row in Table 5 and three occurrences: 1st row in Table 9).

Site 2nd A C D E F G H I K L M N P Q R S T V W Y
1st Ala Cys Asp Glu Phe Gly His Ile Lys Leu Met Asn Pro Gln Arg Ser Thr Val Trp Tyr Ter Sum
A Ala 0 0 0 0 1 1 1 0 1 1 0 0 1 0 4 0 0 0 0 0 0 10
C Cys 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 4 0 0 0 7
D Asp 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 4
E Glu 0 0 0 0 0 1 0 0 1 0 0 0 0 0 2 2 0 0 0 0 0 6
F Phe 0 2 0 0 2 0 0 0 0 2 0 0 1 0 0 5 0 0 1 2 3 18
G Gly 5 0 2 3 1 4 0 0 1 0 0 0 1 0 0 0 0 5 0 0 0 22
H His 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 4 0 0 0 0 7
I Ile 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 1 0 0 0 2 2 8
K Lys 0 0 0 0 0 1 0 3 2 0 1 2 0 0 3 2 4 0 0 0 0 18
L Leu 0 2 0 0 1 2 0 0 2 0 0 0 1 0 0 4 0 0 0 2 0 14
M Met 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2
N Asn 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 4
P Pro 0 0 0 0 1 1 3 0 1 5 0 0 4 2 5 0 0 0 0 0 0 22
Q Gln 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 3
R Arg 4 0 2 3 1 2 0 0 2 0 0 0 1 0 0 0 0 1 0 0 0 16
S Ser 1 0 0 0 2 1 1 0 1 4 0 0 2 0 1 0 0 1 0 0 0 14
T Thr 0 0 0 0 1 1 2 0 1 1 0 0 1 2 1 0 0 0 0 0 0 10
V Val 0 2 0 0 1 1 0 0 1 0 0 0 1 0 0 1 0 0 1 1 1 10
W Trp 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
Y Tyr 0 0 0 0 1 0 0 3 0 0 1 0 1 0 0 0 1 0 0 0 0 7
Ter 0 0 0 1 0 1 0 0 2 0 0 0 0 0 1 0 0 0 0 0 0 5
Sum 11 7 4 7 17 19 7 9 17 13 2 2 19 4 18 15 11 11 2 7 6 208

4. Discussion

For the first time to our knowledge, new definitions of motifs in genes are presented. The single-frame motifs SF (unambiguous trinucleotide decoding in the two 53 and 35 directions) and the multiple-frame motifs MF (ambiguous trinucleotide decoding in at least one direction) form a partition of genes. Several classes of MF motifs are defined and analysed: (i) unidirectional multiple-frame motifs 3UMF (ambiguous trinucleotide decoding in the 35 direction only); (ii) unidirectional multiple-frame motifs 5UMF (ambiguous trinucleotide decoding in the 53 direction only); and (iii) bidirectional multiple-frame motifs BMF (ambiguous trinucleotide decoding in the two 53 and 35 directions). The distribution of the single-frame motifs SF and the 5′ unambiguous motifs 5U (unambiguous trinucleotide decoding in the 53 direction only) are studied without and with constraints.

The proportion of SF motifs drastically decreases with their trinucleotide length. The SF motifs become absent (<1%) when their length 14 trinucleotides and the number of MF motifs becomes already higher than the number of SF motifs when their length 6 trinucleotides. As expected, the proportion of 5U motifs decreases more slowly than that of SF motifs. The 5U motifs become absent (<1%) when their length 20 trinucleotides. Thus with the 5U motifs, there is a length increase of 2014=6 trinucleotides in the trinucleotide decoding.

The proportions of SF and 5U motifs with initiation and stop codons are lower than their respective non-constrained motifs. In contrasts, their proportions in motifs without periodic codons {AAA,CCC,GGG,TTT} are higher than their respective non-constrained motifs. The proportions of SF and 5U motifs with antiparallel complementarity are identical. Antiparallel complementarity increases the proportion of SF motifs but decreases the proportion of 5U motifs, compared to their respective non-constrained motifs. The proportions of SF motifs with parallel complementarity and 5U motifs without constraints follow a similar distribution. Finally, parallel complementarity increases the proportions of both SF motifs and 5U motifs compared to their respective non-constrained motifs. Taken together, these results suggest that the complementarity property involved in the antiparallel (DNA double helix, RNA stem) and parallel sequences could also be fundamental for coding genes with unambiguous trinucleotide decoding, strictly in the two 53 and 35 directions (SF motifs) or conserved in the 53 direction but relaxed-lost in the 35 direction (5U motifs).

The single-frame motifs SF with a property of trinucleotide decoding and the framing motifs F with a property of reading frame decoding could have operated in the primitive soup for constructing the modern genetic code and the extant genes [31]. They could have been involved in the stage without anticodon-amino acid interactions to form peptides from prebiotically amino acids [32]. They could also have been related in the Implicated Site Nucleotides (ISN) of RNA interacting with the amino acids at the primitive step of life (review in [33]). According to a great number of biological experiments, the ISN structure contains nucleotides in fixed and variable positions, as well as an important trinucleotide for interacting with the amino acid (see e.g., the recent review in [34]). However, the general structure of the aptamers binding amino acids, in particular its nucleotide length, its amino acid binding loop and its nucleotide position, is still an open problem. Similar arguments could hold for the ribonucleopeptides which could be implicated in a primitive T box riboswitch functioning as an aminoacyl-tRNA synthetase and a peptidyl-transferase ribozyme [35]. The single-frame motifs SF and the framing motifs F with their properties to decode the trinucleotides and the reading frame could have been necessary for the evolutionary construction of the modern genetic code.

Acknowledgment

I thank Denise Marie Besch for her support.

Abbreviations

SF single-frame motif (unambiguous trinucleotide decoding in the two 53 and 35 directions)
MF multiple-frame motif
UMF unidirectional multiple-frame motif
3UMF unidirectional multiple-frame motif (ambiguous trinucleotide decoding in the 35 direction only)
5UMF unidirectional multiple-frame motif (ambiguous trinucleotide decoding in the 53 direction only)
BMF bidirectional multiple-frame motif (ambiguous trinucleotide decoding in the two 53 and 35 directions)
5U 5′ unambiguous motif (unambiguous trinucleotide decoding in the 53 direction only)
F framing motif (also called circular code motif)

Funding

The author received no funding for this study.

Conflicts of Interest

The author declares no competing interests.

References

  • 1.Gamow G. Possible relation between deoxyribonucleic acid and protein structures. Nature. 1954;173:318. doi: 10.1038/173318a0. [DOI] [Google Scholar]
  • 2.Crick F.H.C., Griffith J.S., Orgel L.E. Codes without commas. Proc. Natl. Acad. Sci. USA. 1957;43:416–421. doi: 10.1073/pnas.43.5.416. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Nirenberg M.W., Matthaei J.H. The dependence of cell-free protein synthesis in E. coli upon naturally occurring or synthetic polyribonucleotides. Proc. Natl. Acad. Sci. USA. 1961;47:1588–1602. doi: 10.1073/pnas.47.10.1588. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Crick F.H.C., Leslie Barnett Brenner S., Watts-Tobin R.J. General nature of the genetic code for proteins. Nature. 1961;192:1227–1232. doi: 10.1038/1921227a0. [DOI] [PubMed] [Google Scholar]
  • 5.Khorana H.G., Büchi H., Ghosh H., Gupta N., Jacob T.M., Kössel H., Morgan R., Narang S.A., Ohtsuka E., Wells R.D. Polynucleotide synthesis and the genetic code. Cold Spring Harb. Symp. Quant. Biol. 1966;31:39–49. doi: 10.1101/SQB.1966.031.01.010. [DOI] [PubMed] [Google Scholar]
  • 6.Nirenberg M., Caskey T., Marshall R., Brimacombe R., Kellogg D., Doctor B., Hatfield D., Levin J., Rottman F., Pestka S., et al. The RNA code and protein synthesis. Cold Spring Harb. Symp. Quant. Biol. 1966;31:11–24. doi: 10.1101/SQB.1966.031.01.008. [DOI] [PubMed] [Google Scholar]
  • 7.Salas M., Smith M.A., Stanley W.M., Wahba A.J., Ochoa S. Direction of reading of the genetic message. J. Biol. Chem. 1965;240:3988–3995. [PubMed] [Google Scholar]
  • 8.Arquès D.G., Michel C.J. A complementary circular code in the protein coding genes. J. Theor. Biol. 1996;182:45–58. doi: 10.1006/jtbi.1996.0142. [DOI] [PubMed] [Google Scholar]
  • 9.Michel C.J. The maximal C3 self-complementary trinucleotide circular code X in genes of bacteria, eukaryotes, plasmids and viruses. J. Theor. Biol. 2015;380:156–177. doi: 10.1016/j.jtbi.2015.04.009. [DOI] [PubMed] [Google Scholar]
  • 10.Michel C.J. The maximal C3 self-complementary trinucleotide circular code X in genes of bacteria, archaea, eukaryotes, plasmids and viruses. Life. 2017;7:20. doi: 10.3390/life7020020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Michel C.J. A genetic scale of reading frame coding. J. Theor. Biol. 2014;355:83–94. doi: 10.1016/j.jtbi.2014.03.029. [DOI] [PubMed] [Google Scholar]
  • 12.Michel C.J. An extended genetic scale of reading frame coding. J. Theor. Biol. 2015;365:164–174. doi: 10.1016/j.jtbi.2014.09.040. [DOI] [PubMed] [Google Scholar]
  • 13.Dinman J.D. Programmed ribosomal frameshifting goes beyond viruses. Microbe. 2006;1:521–527. doi: 10.1128/microbe.1.521.1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Farabaugh P.J. Programmed translational frameshifting. Annu. Rev. Genet. 1996;30:507–528. doi: 10.1146/annurev.genet.30.1.507. [DOI] [PubMed] [Google Scholar]
  • 15.Caliskan N., Peske F., Rodnina M.V. Changed in translation: MRNA recoding by -1 programmed ribosomal frameshifting. Trends Biochem. Sci. 2015;40:265–274. doi: 10.1016/j.tibs.2015.03.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Napthine S., Ling R., Finch L.K., Jones J.D., Bell S., Brierley I., Firth A.E. Protein-directed ribosomal frameshifting temporally regulates gene expression. Nat. Commun. 2017;8:15582. doi: 10.1038/ncomms15582. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Wang R., Xiong J., Wang W., Miao W., Liang A. High frequency of +1 programmed ribosomal frameshifting in Euplotes octocarinatus. Sci. Rep. 2016;6:21139. doi: 10.1038/srep21139. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.El Houmami N., Seligmann H. Evolution of nucleotide punctuation marks: From structural to linear signals. Front. Genet. 2017;8:36. doi: 10.3389/fgene.2017.00036. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Seligmann H. Codon expansion and systematic transcriptional deletions produce tetra-, pentacoded mitochondrial peptides. J. Theor. Biol. 2015;387:154–165. doi: 10.1016/j.jtbi.2015.09.030. [DOI] [PubMed] [Google Scholar]
  • 20.Baranov P.V., Atkins J.F., Yordanova M.M. Augmented genetic decoding: Global, local and temporal alterations of decoding processes and codon meaning. Nat. Rev. Genet. 2015;16:517–529. doi: 10.1038/nrg3963. [DOI] [PubMed] [Google Scholar]
  • 21.Michel C.J. Circular code motifs in transfer and 16S ribosomal RNAs: A possible translation code in genes. Comput. Biol. Chem. 2012;37:24–37. doi: 10.1016/j.compbiolchem.2011.10.002. [DOI] [PubMed] [Google Scholar]
  • 22.Michel C.J. Circular code motifs in transfer RNAs. Comput. Biol. Chem. 2013;45:17–29. doi: 10.1016/j.compbiolchem.2013.02.004. [DOI] [PubMed] [Google Scholar]
  • 23.Michel C.J. A 2006 review of circular codes in genes. Comput. Math. Appl. 2008;55:984–988. doi: 10.1016/j.camwa.2006.12.090. [DOI] [Google Scholar]
  • 24.Fimmel E., Strüngmann L. Mathematical fundamentals for the noise immunity of the genetic code. Biosystems. 2018;164:186–198. doi: 10.1016/j.biosystems.2017.09.007. [DOI] [PubMed] [Google Scholar]
  • 25.Luisi P.L. Prebiotic metabolic networks? Mol. Syst. Biol. 2014;10:729. doi: 10.1002/msb.20145351. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Ying J., Lin R., Xu P., Wu Y., Liu Y., Zhao Y. Prebiotic formation of cyclic dipeptides under potentially early Earth conditions. Sci. Rep. 2018;8:936. doi: 10.1038/s41598-018-19335-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Shu W., Yu Y., Chen S., Yan X., Liu Y., Zhao Y. Selective formation of Ser-His dipeptide via phosphorus activation. Orig. Life Evol. Biospheres. 2018;48:213–222. doi: 10.1007/s11084-018-9556-7. [DOI] [PubMed] [Google Scholar]
  • 28.Wieczorek R., Adamala K., Gasperi T., Polticelli F., Stano P. Small and random peptides: An unexplored reservoir of potentially functional primitive organocatalysts. The case of Seryl-Histidine. Life. 2017;7:19. doi: 10.3390/life7020019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Fimmel E., Michel C.J., Strüngmann L. n-Nucleotide circular codes in graph theory. Philos. Trans. R. Soc. A Math. Phys. Eng. Sci. 2016;374:20150058. doi: 10.1098/rsta.2015.0058. [DOI] [PubMed] [Google Scholar]
  • 30.Fimmel E., Michel C.J., Starman M., Strüngmann L. Self-complementary circular codes in coding theory. Theory Biosci. 2018;137:51–65. doi: 10.1007/s12064-018-0259-4. [DOI] [PubMed] [Google Scholar]
  • 31.Kun Á., Radványi Á. The evolution of the genetic code: Impasses and challenges. Biosystems. 2018;164:217–225. doi: 10.1016/j.biosystems.2017.10.006. [DOI] [PubMed] [Google Scholar]
  • 32.Johnson D.B.F., Wang L. Imprints of the genetic code in the ribosome. Proc. Natl. Acad. Sci. USA. 2010;107:8298–8303. doi: 10.1073/pnas.1000704107. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Yarus M. The genetic code and RNA-amino acid affinities. Life. 2017;7:13. doi: 10.3390/life7020013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Zagrovic B., Bartonek L., Polyansky A.A. RNA-protein interactions in an unstructured context. FEBS Lett. 2018;592:2901–2916. doi: 10.1002/1873-3468.13116. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Saad N.Y. A ribonucleopeptide world at the origin of life. J. Syst. Evol. 2018;56:1–13. doi: 10.1111/jse.12287. [DOI] [Google Scholar]

Articles from Life are provided here courtesy of Multidisciplinary Digital Publishing Institute (MDPI)

RESOURCES