An Approach for Searching Insertions in Bacterial Genes Leading to the Phase Shift of Triplet Periodicity

Maria A Korotkova; Nikolay A Kudryashov; Eugene V Korotkov

doi:10.1016/S1672-0229(11)60019-3

. 2011 Dec 23;9(4-5):158–170. doi: 10.1016/S1672-0229(11)60019-3

An Approach for Searching Insertions in Bacterial Genes Leading to the Phase Shift of Triplet Periodicity

Maria A Korotkova ¹, Nikolay A Kudryashov ¹, Eugene V Korotkov ^1,^2,^*

PMCID: PMC5054449 PMID: 22196359

Abstract

The concept of the phase shift of triplet periodicity (TP) was used for searching potential DNA insertions in genes from 17 bacterial genomes. A mathematical algorithm for detection of these insertions has been developed. This approach can detect potential insertions and deletions with lengths that are not multiples of three bases, especially insertions of relatively large DNA fragments (>100 bases). New similarity measure between triplet matrixes was employed to improve the sensitivity for detecting the TP phase shift. Sequences of 17,220 bacterial genes with each consisting of more than 1,200 bases were analyzed, and the presence of a TP phase shift has been shown in ~16% of analysed genes (2,809 genes), which is about 4 times more than that detected in our previous work. We propose that shifts of the TP phase may indicate the shifts of reading frame in genes after insertions of the DNA fragments with lengths that are not multiples of three bases. A relationship between the phase shifts of TP and the frame shifts in genes is discussed.

Key words: triplet periodicity, insertion, gene sequence, reading frame, phase shift, change-point problem

Introduction

Small insertions of DNA fragments in genes can take place rather frequently ¹^,². If the lengths of these insertions are not multiples of three bases, it may lead to the shift in the reading frame after the insertion site. These insertions can significantly change the amino acid sequence coded by the gene and it is important to understand their contribution to the generation of reading frame changes ³^,⁴^,⁵. At present, the mathematical methods used to find changes of the reading frame can be divided into two groups. Both of these groups share the same feature—additional information is required besides the DNA sequence being considered. The first group of methods needs external data including the amino acid sequence data bank and uses special software for searching similarities ⁶^,⁷^,⁸^,⁹. When these algorithms are used, the amino acid sequences corresponding to the alternative reading frames are created, and then are searched for their similarities in the database. If a similarity is found, then we can say that a shift of the reading frame has occurred in the analyzed gene. A data bank of amino acid sequences is necessary for this group of methods. The second group of methods uses the nucleotide sequence of the analyzed gene to find shifts of the reading frame. A set of gene sequences that have the shifts of the reading frame is used as additional information ¹⁰^,¹¹^,¹²^,¹³^,¹⁴. As a result, a search is made in the analyzed gene for some common properties intrinsic to DNA sequences in which such shifts have already been found. Such collective properties can be described in various ways, including creation of the weight matrix, calculation of k-tuple frequencies, development of the HMM models, utilization of the neural networks and application of other mathematical approaches ¹⁰^,¹¹^,¹²^,¹³^,¹⁴.

However, the requirement of additional information limits the application of such methods. For the methods of the first group, these limitations include the need for presence in the database of amino acid sequence, which could exist before the formation of the reading frame shift in the analyzed gene. This sequence should have significant similarity with the amino acid sequence created by using the alternative reading frame. Very often such amino acid sequence may be absent in the data bank or may show no significant similarity. In this case the search of the reading frame shifts by using this method becomes impossible. Limitations of the methods of the second group are of a different nature. These limitations are related to the fact that the search of the reading frame shifts uses the approaches connected with the revelation of some general statistical properties of gene regions, which are already known to have the reading frame shifts. However, as it was shown previously (15), the statistical properties of gene sequences may be different, resulting that genes belong to different classes of triplet periodicity (TP). Integration of different genes with known reading frame shifts can lead to the fact that most statistical features of the integrated sequences become poorly expressed. This can significantly decrease the power of recognition of the reading frame shifts.

Recently, we reported an approach for revealing the potential reading frame shifts in genes by searching for the phase shift of TP ¹⁶^,¹⁷. The advantage is that it does not require any additional information for detection of the possible positions of reading frame shifts in the gene. Only the information of TP and its phase shift is needed to identify the reading frame shifts ¹⁵^,¹⁶^,¹⁷. The mathematical approach developed uses the statistical test to check the homogeneity of two polynomial samples with unknown distributions. The TP matrixes to the left and to the right of position x in the analyzed coding sequences can be considered as two polynomial samples ¹⁵^,¹⁶^,¹⁷. This problem is the standard so-called “change-point problem” ¹⁸^,¹⁹^,²⁰ that was applied to the TP of DNA sequence.

TP of the coding DNA sequences is a common feature of all currently known living systems ²¹^,²²^,²³^,²⁴^,²⁵^,²⁶^,²⁷^,²⁸^,²⁹^,³⁰ and is associated with the reading frame that exists in a gene (15). The formation of TP is caused by the structure of the genetic code, which is practically the same in prokaryotes and eukaryotes, by the saturation of proteins with certain amino acids ³¹^,³²^,³³ and by GC content of the 3rd position of codons (34). If a shift of the reading frame occurs in a gene with TP, then this shift could be revealed due to the shift between existing reading frame and TP (Figure 1). Since it is difficult to significantly change the TP of coding sequences through relatively small number of base substitutions (35), such shift can remain in a gene for a rather long period of time. The presence of such shift between the TP of the nucleotide sequence and the existing reading frame may indicate the existing of a reading frame shift in the analyzed gene (17). However, the proposed mathematical method (17) can only detect the TP phase shift created by insertions of relatively short DNA sequences with lengths less than several tens of DNA bases. If insertion of a longer DNA fragment occurred, then this insertion can substantially change the TP around the area of the phase shift that greatly complicates the detection of the TP phase shift by using this method.

The influence of one DNA base insertion on the TP phase shift. The first three sequences have the reading frames T₁, T₂ and T₃, respectively. Then the coding sequence S with TP is shown. In this sequence the insertion of the nucleotide c was made in the position 19. Explicit periodicity of this sequence is chosen for clarity. In a case of “fuzzy” periodicity, the situation is the same as in the figure, but the periodicity will be difficult to be observed visually. Then we construct the TP matrices M₁(1, 18), M₁(19, 37), M₂(19, 37) and M₃(19, 37). The first matrix M₁ is constructed for the DNA region from the 1st to 18th base. Elements of these matrices m₁(i, j), m₂(i, j) and m₃(i, j) show the number of the bases a, t, c and g (index i) for the positions in the triplet reading frames T₁, T₂ and T₃ (index j). If we compare the matrix M₁(1, 18) with the matrices M₁(19, 37), M₂(19, 37) and M₃(19, 37), it can be seen that this matrix is most similar to the matrix M₂(19, 37). The initial phase of the matrices M₁, M₂ and M₃ in sequence S is equal to 1, 2 and 3 because the bases of sequence S with indices k equal to 1, 2 and 3 are the first bases of the triplet in the reading frames T₁, T₂ and T₃. Therefore, there is a TP phase shift by 1 base in sequence S after the position x=18 (the difference between the initial phases of the matrices M₂ and M₁).

There are two goals for the present study. Firstly, we would like to develop a new mathematical method for revealing TP phase shifts to account for possible reading frame shifts occurring due to insertions of relatively large DNA fragments (>100 bases). Secondly, we wanted to verify the presence of relatively long insertions (with length not multiple of three) in bacterial genes by applying the advanced algorithm. Our results show that approximately 16% of bacterial genes from 17 studied genomes have the TP phase shifts that may be caused by insertions of relatively long DNA fragments in the genes.

Method

Algorithm for searching the TP phase shift

The algorithm has been developed on the basis of our previous study (17). We assume that a coding nucleotide sequence S = {s(k),k = 1,2,…,L} is given, where each base s(k) is chosen from the alphabet A = {a, t, c, g}, L is the length of sequence S, which is a multiple of three. Let us introduce three reading frames in sequence S and denote them as T₁, T₂ and T₃ (Figure 1). The base s(1) of sequence S is the first, second and third codon base of the reading frames T₁, T₂ and T₃, respectively. T₁ actually exists in sequence S while T₂ and T₃ can be considered as hypothetical ones. We also define three TP matrices as M₁(i₁, i₂), M₂(i₁, i₂) and M₃(i₁, i₂), which are calculated for T₁, T₂ and T₃ for a part of sequence S from i₁ to i₂, denoted as S(i₁, i₂). Morerover, m₁(i, j), m₂(i, j) and m₃(i, j) are the elements of the matrices that show the number of bases of type i in sequence S (i=1, 2, 3, 4 for a, t, c, g, respectively) in the codon position j (j can be 1, 2 or 3) for T₁, T₂ and T₃, respectively. Let x₁ and x₂ are two coordinates in sequence S that are defined as L₁+3n, where n=0, 1, 2, 3, …, (L–L₁)/3 and L₁ is a multiple of three being in the range from 60 to 600. Let us consider the fragment of the sequence S(x₁−L₁+1, x₁) for which we construct the TP matrix M₁(x₁−L₁+1, x₁) for T₁ of sequence S. Let us also consider the fragments S(x₂+1, x₂+L₁), S(x₂+2, x₂+L₁+1) and S(x₂+3, x₂+L₁+2) for which we construct the TP matrices M₁(x₂+1, x₂+L₁), M₂(x₂+2, x₂+L₁+1) and M₃(x₂+3, x₂+L₁+2) for T₁, T₂ and T₃, respectively, of sequence S. If an insertion of DNA fragment with the length of (x₂−x₁+1) or (x₂−x₁+2) DNA bases occurs right after the position x₁ in sequence S, then it creates a shift in the reading frame by one or two bases and the same shift of the TP phase. In this case the matrix M₁(x₁−L₁+1, x₁) is more similar to the matrix M₂(x₂+2, x₂+L₁+1) or M₃(x₂+3, x₂+L₁+2), respectively. If, however, there are no insertions of nucleotides after the position x₁, then the matrix M₁(x₁−L₁+1, x₁) is most similar to the matrix M₁(x₂+1, x₂+L₁) for x₁=x₂. It is a typical problem of searching for the change point ¹⁸^,¹⁹^,²⁰ in symbolical sequence. Then we added the matrix M₁(x₁−L₁+1, x₁) to the matrix M_k(x₂+k, x₂+L₁+k−1) to form the combined matrix M (k=1, 2, 3). The matrix M has 4 rows and 3 columns and can be considered as contingency table. The rows represent the DNA bases and the columns represent the positions of DNA bases in the corresponding reading frame. Example of filling the matrix M is shown in Figure 2. Then we calculated I_1k (the mutual information multiplied by 2L₁ln2) for the combined matrix M using the following formula (32):

I_{1 k} = \sum_{i = 1}^{4} \sum_{j = 1}^{3} m (i, j) \ln m (i, j) - \sum_{i = 1}^{4} x (i) \ln x (i) - \sum_{j = 1}^{3} y (j) \ln y (j) + 2 L_{1} \ln 2 L_{1}

(1)

where $x (i) = \sum_{j = 1}^{3} m (i, j), y (j) = \sum_{i = 1}^{4} m (i, j) .$

The calculation of the matrix M₁(x₁−L₁+1, x₁) and the matrices M_k(x₂+k, x₂+L₁+k−1), k=1, 2, 3 in sequence S. A. The positions of regions with length L₁ in sequence S. B. The example of calculation of the matrix M₁(x₁−L₁+1, x₁) and the matrix M_k(x₂+k, x₂+L₁+k−1) for L₁=21, x₁=21, x₂=30. The insertion fragment begins from the 22nd base and ends by the 31st base of the DNA sequence. The insertion fragment shown in bold letters has another TP type than the rest DNA sequence. It is possible to see that matrix M₁(1, 21) differs from matrix M₁(31, 51) and matrix M₃(33, 53) while similar to matrix M₂(32, 52).

Then we calculated the argument of the normal distribution as follows:

X_{1 k} = \sqrt{4 I_{1 k}} - \sqrt{11}

(2)

The value X_1k (k=1, 2, 3) is a measure that indicates the TP level in the combined matrix M=M₁(x₁−L₁+1, x₁) + M_k(x₂+k, x₂+L₁+k−1). Sequence S (the nucleotide sequence of a gene) is not a random sequence since TP is observed in the gene. In this case we cannot use X_1k (k=1, 2, 3) as a measure of similarity between matrix M₁(x₁−L₁+1, x₁) and matrices M_k(x₂+k, x₂+L₁+k−1), k=1, 2, 3. Instead, the Monte Carlo method is used to calculate the similarity measure between two matrices M₁(x₁−L₁+1, x₁) and M_k(x₂+k, x₂+L₁+k−1). For this purpose, the sequences S(x₁−L₁+1, x₁) and S(x₂+1, x₂+L₁+3) are combined into one sequence SS(1, 2L₁+2), which is shuffled with its TP retained. To achieve this, we divided the sequence SS(1, 2L1+2) into three subsequences. The first of them (denoted as C₁) was obtained by choosing the bases in positions i=3n+1, where n=0, 1, 2, … from SS(1, 2L1+2) sequence. The second and the third sequence C₂ and C₃ was obtained by choosing the bases in positions i=3n+2 and i=3n+3, respectively.

Next, the random sequences R₁, R₂ and R₃ were generated using a random number generator. They had the same length as the sequence C₁, C₂ and C₃, respectively. We arranged the sequences of R₁, R₂ and R₃ in ascending order and keep track of the permutation order for each sequence. Then we rearranged the bases in sequences C₁, C₂ and C₃ in the same way. Upon such shuffling of sequences C₁, C₂ and C₃, we created a random sequence C. In sequence C, positions i=3n+1, i=3n+2 and i=3n+3 were occupied by the bases of sequence C₁, C₂ and C₃, respectively. Sequence C had the equal length and the same base composition as sequence SS(1, 2L1+2). We generated such random sequence C for 500 times. Each sequence C was divided back to the sequence S(x₁−L₁+1, x₁) and S(x₂+1, x₂+L₁), and for these two sequences we calculated X_1k using Formula 2. For the set of the values X_1k, the mean value and variance D(X_1k) were determined for k=2 and k=3. For this method of sequence SS shuffling, the values of X₁₁ for SS were equal to the values of X₁₁ for each of the random sequence C. As the measure of similarity between the matrices M₁(x₁−L₁+1, x₁) and M_k(x₂+k, x₂+L₁+k−1), we took the value:

Z_{1 k} = \frac{X_{1 k} - {\bar{X}}_{1 k}}{\sqrt{D (X_{1 k})}}

(3)

where k=2, 3.

It is possible to consider Z_1k as the function of L₁ for some constant values of x₁ and x₂. Then we calculated Z_1k(L₁) for L₁=60, 90, 120, …, 600. If x₁−1 or L−x₂ were less than 600, we calculated Z_1k(L₁) for L₁=60, 90, 120, …, min(x₁−1, L−x₂). We selected the value of L₁ that had the maximum value of Z_1k, which means that we selected the own value of L₁ for each x₁ and x₂. We need to find the maximum of Z_1k for some value L₁ since TP is not uniform along the length of a gene and may change its type (15). Testing of various lengths of L₁ has shown that the most effective search of the phase shifts was obtained if we did not fix any particular length L₁ and performed the search of some L₁ that had the maximal Z_1k. Carrying out the search of Z_1k maximal value for some L₁ does not interfere with the choice of a threshold Z₀ as described in the next subsection.

Maximal similarity between the matrices M₁(x₁−L₁+1, x₁) and M_k(x₂+k, x₂+L₁+k−1) (k=2, 3) corresponds to the maximal value of Z_1k for some k (k=2 or k=3 shows the insertion of 3n+1 or 3n+2 DNA fragment, correspondingly). The density distribution of Z_1k for different values of X₁₁ is shown in Figure 3. It can be seen that for the different values of X₁₁ we have similar distributions of Z₁₂. The same picture is observed for Z₁₃. These data show that the use of Monte Carlo method minimizes the TP influence of the sequence S on the spectrum Z_1k. It allows using Z_1k as the quantitative measure of the relationship between the matrix M₁(x₁−L₁+1, x₁) and the matrices M_k(x₂+k, x₂+L₁+k−1), k=2, 3. If similarity between the matrices is absent, then the values Z_1k (k=2, 3) are small (typically less than 3.0), while in the presence of such similarity, Z_1k values will be large. We should only determine the threshold level Z₀ for Z_1k, k=2, 3. If Z₀>Z_1k, then it shows the absence of the statistically significant similarity between the matrices and the absence of TP phase shift between coordinates x₁ and x₂. If Z₀<Z_1k, then it shows the presence of the similarity between the matrices and the existence of TP phase shift between coordinates x₁ and x₂. Selection of Z₀ is discussed below in the next subsection.

The density distribution of Z₁₂ for different values of X₁₁. Symbols • for X₁₁=0; ○ for X₁₁=6.0; ▼ for X₁₁=12.0.

We changed x₁ from L₁ to L−L₁+1 and x₂ was changed from x₁ to L−L₁+1. Then for each value of x₁ in sequence S we calculated value of x₂, which gives the maximum value of Z_1k (k=2, 3). Let such maximum be referred as mZ_1k. Then the plots for mZ_1k depending on x₁ and x₂ for k=2, 3 were obtained. In these plots we joined the neighboring points by line for more clearness. Let us assume that we have the sequence S with insertion of a DNA fragment with the length not multiple of three bases between positions $x_{1}^{0}$ and $x_{2}^{0}$ . Then the maximum value for the dependence of mZ_1k on x₁ occurs at position $x_{1}^{0}$ , and the maximum value for mZ_1k depending on x₂ will be observed near position $x_{2}^{0}$ (for the appropriate k). It means that the points of x₂ tend to group near and after $x_{2}^{0}$ . If the insertion has the length equal to x₂−x₁+1, then the largest values of mZ_1k are observed for k=1, and if the insertion had the length x₂−x₁+2, then the largest values of mZ_1k are obtained for k=2. The graph of the mZ_1k(x₁) function is a “mountain”, and the apex of this graph is observed for $x_{1} = x_{1}^{0}$ . The graph of the function of mZ_1k(x₂) looks like a “wall” and the boundary of the wall is position $x_{2}^{0}$ (examples of such plots are shown below in Results). Such graphs allow to predict the approximate (with accuracy up to several tens of bases) positions $x_{1}^{0}$ and $x_{2}^{0}$ in the gene.

Application of Monte Carlo method to determine Z₀

To find the threshold value Z₀, we used the gene sequences from 17 bacterial genomes (Table 1) from KEGG database (36). We created the random data bank by shuffling the bases of each gene sequence. It allows keeping the same length distribution for random sequences as for the studied genes from 17 bacterial genomes. To keep TP in the random sequence, the shuffling was performed in the same manner as described above. Upon shuffling of the sequences, only the TP phase shifts caused by random factors are remained. As a result, the database of random sequences was created. Sequences from this data bank had the same length and TP as in the genes of 17 bacterial genomes studied. We chose some level of Z₀ (for example, Z₀=4.0) and calculated the number of genes that had at least one value of mZ_1k>Z₀ for k=2 or 3 (as described above). This calculation was performed for the gene sequences from 17 bacterial genomes and for random sequences from the created data bank (numbers N₁ and N₂, respectively). We calculated N₂/N₁ and were increasing Z₀ until N₁/N₂ was not equal to 0.22. We did it till the level Z₀=8.0 when the number of the found TP phase shifts in random sequences was about 22% from the number of shifts that we have revealed in 17 bacterial genomes (N₂/N₁=0.22). Therefore, the level Z₀=8.0 can be chosen as the threshold level because an admixture of the TP shifts due to purely random factors can be considered as being relatively small.

Table 1.

List of the prokaryotic genomes used for searching the genes with triplet periodicity shifts

No.	Genome	No. of analyzed genes (>1,200 bp)	Q1	Q2	Q3
1	Arcobacter butzleri	611	5	30	3
2	Azotobacter vinelandii	1,306	43	140	29
3	Bordetella avium	885	42	108	34
4	Burkholderia mallei	1,380	116	240	103
5	Bacillus subtilis	937	50	145	35
6	Escherichia coli	1,158	101	237	75
7	Lactobacillus fermentum	444	16	49	14
8	Methylococcus capsulatus	854	41	111	38
9	Pseudomonas aeruginosa	1,566	38	162	29
10	Staphylococcus aureus COL	626	28	91	24
11	Salmonella enterica Choleraesuis	1,187	109	227	86
12	Streptococcus pneumoniae	507	27	82	25
13	Shigella sonnei	1,183	175	286	141
14	Salmonella typhimurium	1,200	94	220	68
15	Vibrio cholerae	1,047	61	176	41
16	Xanthomonas campestris	1,245	85	253	62
17	Yersinia pseudotuberculosis YPIII	1,084	119	252	98

	Total	17,220	1,150	2,809	905

Open in a new tab

Note: Q1, Q2 and Q3 are the number of the genes with a length greater than 1,200 bp that have a TP phase shift revealed by the method developed previously (17), the method developed in the present work, and both the method developed previously (17) and the method developed in the present work, respectively.

We chose the ratio N₂/N₁=0.22 for two reasons. The first reason is that we wanted to find the upper limit of the number of genes with a phase shift of TP. The second reason is that we were going to compare the results obtained in this paper with the results obtained previously (17). The algorithm described above allows calculating Z₀ for any ratio N₂/N₁ and for any number of testing genes. The level of Z₀ depends on the number of the analyzed genes.

Results

Analysis of genes from 17 bacterial genomes

At first, we studied the TP phase shift in the artificial periodic sequence. To create the artificial sequence, we took the sequence of the transaldolase B gene from the genome of Escherichia coli (b0008 in KEGG database) and added a random sequence of 298 bases after the 300th base, which shared the same base frequencies as transaldolase B gene and had a TP level greater than 5.0 (Formula 2). The insertion of this fragment creates the TP phase shift (and the shift of the reading frame) after the 598th base of the gene. The sequence of b0008 was initially analyzed without insertion and it was shown that mZ₁₂(x₁) (Figure 4A) and mZ₁₂(x₂) (Figure 4B) did not contain the values of mZ₁₂ higher than 5.0, which was even lower than the selected threshold value of 8.0. This result suggests that the analyzed sequence contains a homogeneous TP and the matrix M₁(x₁−L₁+1, x₁) is always more similar to the matrix M₁(x₂+1, x₂+L₁) than to the matrices M_k(x₂+k, x₂+L₁+k−1), k=2, 3, indicating the absence of TP phase shifts in the gene.

Dependence of mZ₁₂ on x₁ (A) and x₂ (B) for the gene encoding the transaldolase B from the genome of *E. coli* (b0008 in KEGG database).

A completely different pattern is observed in this gene upon artificial insertion (insertion of 298 nucleotides after the 300th base) of a random sequence, which creates the TP phase shift by one base to the right after the 598th base. The graph of the function mZ₁₂(x₁) (Figure 5A) looks like a “mountain” and shows that while x₁ moves from the 1st to the 300th base, the value of mZ₁₂ increases and reaches its maximum in the 300th base. Then the value of mZ₁₂ is decreased due to combination of different triplet periodicities, which are observed in this sequence after the 300th base. The graph of the dependence of mZ₁₂ on x₂ (Figure 5B) looks like a “wall” and the values of x₂ group together after the 600th base. This graph shows that all maximal similarities of the matrix M₁(x₁−L₁+1, x₁) and the matrix M₂(x₂+2, x₂+L₁+1) are observed mainly for x₂≥600. Thus, this artificial example shows that mZ₁₂(x₁) and mZ₁₂(x₂) graphs allow to see the TP phase shift and to predict the boundaries of the insertions $x_{1}^{0}$ and $x_{2}^{0}$ with accuracy up to several tens of DNA bases.

Dependence of mZ₁₂ on x₁ (A) and x₂ (B) for the gene encoding transaldolase B from the genome of *E. coli* (b0008 in KEGG database) with insertion of 298 nucleotides after the 300th nucleotide.

Then the genes with a length more than 1,200 bp from 17 bacterial genomes were analyzed by the method developed. A list of the genomes is shown in Table 1. The reason for choosing such length is explained as follows. For the correct statistical estimation, each cell of the matrices M₁(x₁−L₁+1, x₁) and M_k(x₂+k, x₂+L₁+k−1) should contain, on the average, no less than 10 values (37). We have 12 cells in each of M₁(x₁−L₁+1, x₁) and M_k(x₂+k, x₂+L₁+k−1) matrices, and it gives the lower limit of L₁=120. Therefore, to detect the TP phase shift, at least 240 bases are required. Accordingly, we used the genes for this study with a length of more than 1,200 nucleotides in order to reduce the influence of boundary effects on the performance of the algorithm. The total number of genes selected in the studied genomes was 17,220. There were 1,526 genes identified with insertions of the type 3n+1 and 1,283 genes with insertions of the type 3n+2 (n=0, 1, 2, …), for which mZ_1k>Z₀, i.e., the total number of the genes with the insertions was 2,809 (Table S1). This amount constitutes 16.3% of the studied genes with a length more than or equal to 1,200 bases. Concurrently, the random sequences from the created data bank were analyzed and 622 sequences with mZ_1k>Z₀ were found (see Method). Our analysis shows that the number of false positives in this case does not exceed 22%.

Examples of x₁ and x₂ determination

Let us consider several examples of the genes with revealed insertions for which mZ_1k>Z₀. For all examples only the values mZ_1k>4.0 are shown to reduce the influence of statistical noise on the plots. The first example is shown in Figure S1 for the gene coding the glycosyl transferase from the genome of E. coli. Figure S1A shows that $x_{1}^{0} \approx 760$ and Figure S1B shows that $x_{2}^{0} \approx 760$ . According to our method, these $x_{1}^{0}$ and $x_{2}^{0}$ values indicate that this gene has a short insertion with a length equal to 3n+2 (n=0, 1, …, 10) (function mZ₁₂) or a deletion with a length equal to 3n+1 (n=0, 1, 2, …). This example suggests that the mathematical method developed can detect the TP phase shifts caused by relatively short insertions or deletions. A second example of the gene with insertion having length not multiple of three bases is shown in Figure S2 for the gene coding the molybdenum cofactor biosynthesis protein A from the genome of Burkholderia mallei. Figure S2A shows that $x_{1}^{0} \approx 950$ , and Figure S2B shows that $x_{2}^{0} \approx 1100$ , indicating that this method can detect the insertions that have a length about 100 nucleotides (3n+1, n≈33). The third example (Figure S3) is the gene encoding the ubiquinone oxidoreductase, chain G, from the genome of E. coli (B2283 in database KEGG), where $x_{1}^{0} \approx 850$ and $x_{2}^{0} \approx 1100$ , i.e., the insertion size is about 250 bases. This gene has the insertion with the length equals to 3n+2, n≈83.

From these examples, it is possible to see that the accuracy of the boundaries $x_{1}^{0}$ and $x_{2}^{0}$ is no better than ±60 bp, as noted in Method.

Searching for amino acid similarity

We have also studied the similarity of the amino acid sequences created after the position x₂ till the end of the gene using the BLAST program. The amino acid sequences were created for the current reading frame in the gene and for the hypothetical reading frame, which could exist in the gene after position x₂ until the moment of a fragment insertion between positions x₁ and x₂. Thus, after x₂ position we have a pair of amino acid sequences, one of them actually exists and the other is hypothetical. For the real and hypothetical sequences, a sequence with the highest similarity was searched in the Swiss-prot database. For 803 pairs of such sequences the significant similarities in the Swiss-prot database did not exist. For 1,918 pairs of sequences the similarity was found only for the actually existing sequences, but for the hypothetical sequences the similarity was absent. For 84 pairs of sequences the similarity was found only for the hypothetical sequences. Only for 4 pairs such a similarity was observed for both the actually existing sequence and for the hypothetical sequence. These results show that the search of the insertions with a length not multiple of three bases in genes by means of revealing the TP phase shift in some cases can be confirmed by the found similarities. This result is not surprising since the insertion of DNA fragment into the gene could take place long time ago, and currently it is hard to notice the similarity at statistically significant level. Since TP is changing slowly (35), it allows revealing the TP phase shift that could take place long time ago.

Here we show an example in which the similarity was found for the existing and hypothetical amino acid sequences (existing and hypothetical reading frames in gene mba1516 from KEGG database, and Q62J71_BURMA amino acid sequence for existing reading frame from Swiss-prot database). This gene codes the molybdenum cofactor biosynthesis protein A from the genome of B. mallei. As it was noted above, this gene has the insertion with coordinates $x_{1}^{0} \approx 870$ and $x_{2}^{0} \approx 1100$ (Figure S2). The existing amino acid sequence has only one similarity after $x_{2}^{0} \approx 1100$ with the amino acid sequence A4LD09_BURPS, which also encodes molybdenum cofactor biosynthesis protein A, but in the genome of pseudo B. mallei 305. This similarity is observed for almost 100%. However, if we study the similarity of the whole amino acid sequence Q62J71_BURMA with the sequences from Swiss-prot, it can be seen that all other similarities of this sequence are finished near the 329th amino acid. It roughly corresponds to the beginning of the insertion in the gene ( $x_{1}^{0} = 870$ ). An example of such similarity is shown in Figure S4A for the sequence Q2SA06_HAHCH. We assume that after the insertion of a DNA fragment, a gene similar to the gene BMA1516 was cut into two fragments F₁ and F₂ near the base 870. The fragment F₁ (from 1st to 870th bases) was added to some sequences as the beginning of the gene. Thus, multiple similarities of the region from the 1st to 320th amino acids to amino acid sequences of different proteins were arisen.

For the hypothetical sequence, the similarity search showed that the fragment from the 320th amino acid to the end of the amino acid sequence had many similarities with different proteins, only from the 1st to 260th amino acid of these proteins. The example of such similarity is shown in Figure S4B. It can be assumed that the fragment F₂ was attached to some other DNA fragment as the beginning of the gene, and the second reading frame became the coding reading frame in this fragment.

It is also possible that the gene BMA1516 was created by a fusion of three fragments. The first fragment (called E₁) is similar to the gene coding the amino acid sequence Q2SA06_HAHCH. Then it was joined to the second relatively short fragment (referred as E₂) having the length approximately equal to 3n+2, n≈83, after which the reading frame was changed. Then the fragment E₃, which is similar to the gene coding the sequence Q63SW3_BURPS, was attached to the end of the fragment E₂. Since E₃ has a TP type similar to E₁, it is possible to reveal the TP phase shift after joining E₃. However, in the case of validity of any hypothesis from the two hypotheses considered here, the fragment of F₂ (E₂+E₃=F₂) or its most part E₃ may code the functionally important protein in two different reading frames.

Discussion

Deletions of DNA fragments with the length not multiple of three bases are found by this method as the insertions of one or two DNA bases. Therefore, we have developed an approach that reveals the whole set of TP phase shifts caused by deletions and insertions of DNA fragments. In this study 2,809 such genes were discovered, in which there were two regions with the same type of TP separated by insertions of nucleotides. This number constitutes approximately 16.3% from the total number of the analyzed genes while ~4% of the genes have the deletions and short insertions (17). Therefore, it can be assumed that the frequency of insertions of long DNA fragments is approximately few times greater than the frequency of deletions and short insertions.

In the revealed genes, the reading frame and the TP were clearly linked initially, and only after the insertions of DNA fragments the shift between them was formed. This relatively large percentage of genes with the TP phase shift may suggest that the shift of the reading frame in gene is not a very dramatic event for the encoded protein. It also means that the genetic code must somehow be adapted to these events (33, 34, 35, 38). If a large percentage of the genes with the shifts of the reading frame is observed, then a new amino acid sequence should often have some biological functions that can be picked up by the evolution. This may explain the relatively large percentage of genes with TP phase shift.

It is unlikely that the observed TP phase shifts are related to the sequencing errors. The mostly well-studied genomes of the bacteria, which up to date have been sequenced more than once, were chosen for this work. In this case the probability of the sequencing errors in a form of deletion or insertion of one or two DNA bases is significantly reduced, whereas replacements of DNA bases did not create the triplet phase shifts. However, the disappearance of the start or stop codons because of the sequencing errors could lead to errors in the gene identification. In this case, the accession of additional non-coding DNA fragments to actually existing gene may occur, which has relatively little effect on the TP phase shifts. This means that the connection of different types of TP may occur, but the TP phase shifts are absent. Furthermore, the insertion of long DNA fragments can hardly be induced by the sequencing errors, since the sequencing errors create deletions or insertions of only small DNA fragments (usually one or two bases).

In the present study we found a lower bound of the number of genes that contain a shift between the reading frame and TP. In reality the number of these genes may be large, since the approach works well with small numbers of insertions or deletions. If the density of the insertions and deletions is more than one insertion or deletion per a few tens of bases (~60), then discovery of the deletions and insertions by this algorithm is not always possible. As a result the statistically significant value mZ_1k for this gene cannot be obtained.

The mathematical approach used in this paper is the expansion of the method that was used earlier to reveal the TP phase shift (17). The modifications are as follows. Firstly, two triplet matrices (to the left and to the right from the position x, see Method) were compared using the level of similarity rather than using the level of a difference. It is more accurate since it allows to ignore such position x, in which the matrix M₁(x₁−L₁+1, x₁) is not similar to any of the matrices M_k(x₂+k, x₂+L₁+k−1), k=1, 2, 3. This situation may arise due to the splicing of gene fragments (39) and in this case the difference between the matrices M₁(x₁−L₁+1, x₁) and M₂(x₂+1, x₂+L₁) may be greater than the difference between the matrices M₁(x₁−L₁+1, x₁) and M_k(x₂+k, x₂+L₁+k−1), k=2, 3 due to the existence of certain classes of TP (15). Accordingly the splicing of genes could be identified as the TP phase shift. The use of the similarity for the matrix comparison allows eliminating the possibility of identifying splicing of the genes with different triplet frequencies as TP phase shift. Secondly, in this paper we have developed an approach for revealing insertions of long DNA fragments with lengths not multiple of three DNA bases. It is impossible to find the TP phase shift upon insertion of long DNA fragments (>100 bp) in genes by the method proposed earlier (17). The method proposed in the present work allows revealing TP phase shifts after insertion of DNA fragment of any length that is not multiple of three bases. We compared the results of this study with those obtained previously (17) for the genes with a length greater than 1,200 bp from bacterial genomes. Comparison was performed for the same level of the false positives number (~22%, see Method). From these results, it is clear that the examination of large DNA insertions allows to reveal about 2.4 times more genes (columns Q1 and Q2) with a TP phase shift than it was revealed previously (17). In addition, it can identify ~80% of genes with a phase shift (columns Q1 and Q3), which were identified earlier. The remaining 20% of genes have no TP phase shift, but rather the splicing of gene fragments. This error can occur in the algorithm developed earlier (17) because it uses the measure of matrices difference, as noted above.

The computational complexity of the analysis of a sequence S using our method is O(L²). About 20 hours were required for the analysis of 17,220 genes with the length more than 1,200 bp. We used 5 AMD Phenom II X4 processors for calculations. This result shows that computer cluster with 100 processors or more is required for analysis of all known genes from KEGG data bank (~18×10⁶ genes). In this case the time of the calculation could be greater than some months.

Search of the TP phase shifts with help of Fourier transformation was also reported in the previous study (40), which shows that the method is able to reveal the artificial insertions or deletions of bases in the genes. However, there is no data for the detection of the real TP phase shifts in genes from E. coli genome. Also, a large window was used in this study (750 bp and more) that can severely complicate the detection of the TP phase shifts in genes.

Spectral rotation measure (SRM) has been used to search for the TP in coding sequences ⁴¹^,⁴². It can also be applied to search the TP phase shifts (41). However, the reading frame identification was performed for a fixed window size equal to 351 bp (41). The use of fixed windows may lead to omission of the existing TP phase shifts, which can be found with help of larger or smaller windows (the value 2L₁ in our work). Our approach uses the TP matrices, which determine a TP type before x₁ position and after x₂ position in the sequence S very accurately. Using of the matrices allows to find the TP phase shifts for all values of L₁ (L₁ is not fixed in our method). It will be possible to compare our method and SRM more accurately when SRM is applied for the search of the TP shifts in genes from the bacterial genomes.

The results obtained show that unexpectedly large number of genes (~16%) have a TP phase shift (change points of TP in gene sequences). Shift of the reading frame is likely to be relatively neutral mutation, which does not result in complete inactivation of the gene.

Improvement of the mathematical approach used in this paper may be implemented with help of using more advanced algorithms that were applied for searching of the change points ¹⁸^,¹⁹^,²⁰. In this case it will be possible to detect the shifts of the reading frame caused by multiple insertions and deletions of DNA bases in different regions of a single gene.

Conclusion

A mathematical method has been developed for searching the TP phase shifts in genes. The method is based on a comparison of the TP matrices. Using this method, we analyzed the genes that are longer than 1,200 bp from 17 bacterial genomes. It was found that about 16% genes have the TP phase shifts. We propose that these phase shifts indicate the presence of insertions and deletions of DNA fragments in genes.

Authors’ contributions

MAK prepared the software and conducted data analysis. NAK did valuable discussion and co-wrote the manuscript. EVK proposed the approach for searching insertions in genes, conducted data analysis and co-wrote the mauscript. All authors read and approved the final manuscript.

Competing interests

The authors have declared that no competing interests exist.

Supplementary Material

Figures S1-S4; Table S1

mmc1.pdf^{(1.4MB, pdf)}

DOI: 10.1016/S1672-0229(11)60019-3

References

1.Wei Q. World Scientific Publishing; Singapore: 2007. DNA Repair, Genetic Instability, and Cancer. [Google Scholar]
2.Watson J.D. Benjamin-Cummings Publishing; San Francisco, USA: 2004. Molecular Biology of the Gene. [Google Scholar]
3.Okamura K. Frequent appearance of novel protein-coding sequences by frameshift translation. Genomics. 2006;88:690–697. doi: 10.1016/j.ygeno.2006.06.009. [DOI] [PubMed] [Google Scholar]
4.Raes J., Van de Peer Y. Functional divergence of proteins through frameshift mutations. Trends Genet. 2005;21:428–431. doi: 10.1016/j.tig.2005.05.013. [DOI] [PubMed] [Google Scholar]
5.Kramer E.M. A simplified explanation for the frameshift mutation that created a novel C-terminal motif in the APETALA3 gene lineage. BMC Evol. Biol. 2006;6:30. doi: 10.1186/1471-2148-6-30. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.States D.J., Botstein D. Molecular sequence accuracy and the analysis of protein coding regions. Proc. Natl. Acad. Sci. USA. 1991;88:5518–5522. doi: 10.1073/pnas.88.13.5518. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Pearson W.R. Comparison of DNA sequences with protein sequences. Genomics. 1997;46:24–36. doi: 10.1006/geno.1997.4995. [DOI] [PubMed] [Google Scholar]
8.Birney E. PairWise and SearchWise: finding the optimal alignment in a simultaneous comparison of a protein profile against all DNA translation frames. Nucleic Acids Res. 1996;24:2730–2739. doi: 10.1093/nar/24.14.2730. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Guan X., Uberbacher E.C. Alignments of DNA and protein sequences containing frameshift errors. Comput. Appl. Biosci. 1996;12:31–40. doi: 10.1093/bioinformatics/12.1.31. [DOI] [PubMed] [Google Scholar]
10.Antonov I., Borodovsky M. Genetack: frameshift identification in protein-coding sequences by the Viterbi algorithm. J. Bioinform. Comput. Biol. 2010;8:535–551. doi: 10.1142/s0219720010004847. [DOI] [PubMed] [Google Scholar]
11.Kislyuk A. Frameshift detection in prokaryotic genomic sequences. Int. J. Bioinform. Res. Appl. 2009;5:458–477. doi: 10.1504/IJBRA.2009.027519. [DOI] [PubMed] [Google Scholar]
12.Fichant G.A., Quentin Y. A frameshift error detection algorithm for DNA sequencing projects. Nucleic Acids Res. 1995;23:2900–2908. doi: 10.1093/nar/23.15.2900. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Médigue C. Detecting and analyzing DNA sequencing errors: toward a higher quality of the Bacillus subtilis genome sequence. Genome Res. 1999;9:1116–1127. doi: 10.1101/gr.9.11.1116. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Schiex T. FrameD: a flexible program for quality check and gene prediction in prokaryotic genomes and noisy matured eukaryotic sequences. Nucleic Acids Res. 2003;31:3738–3741. doi: 10.1093/nar/gkg610. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Frenkel F.E., Korotkov E.V. Classification analysis of triplet periodicity in protein-coding regions of genes. Gene. 2008;421:52–60. doi: 10.1016/j.gene.2008.06.012. [DOI] [PubMed] [Google Scholar]
16.Frenkel F.E., Korotkov E.V. Using triplet periodicity of nucleotide sequences for finding potential reading frame shifts in genes. DNA Res. 2009;16:105–114. doi: 10.1093/dnares/dsp002. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Korotkov E.V., Korotkova M.A. Study of the triplet periodicity phase shifts in genes. J. Integr. Bioinform. 2010;7:131. doi: 10.2390/biecoll-jib-2010-131. [DOI] [PubMed] [Google Scholar]
18.Carlstein E., editor. Vol. 23. Institute of Mathematical Statistics; Hayward, USA: 1994. Change-Point Problems. (IMS Lecture Notes–Monograph Series). [Google Scholar]
19.Litton C.D. Wiley Series in Probability & Mathematical Statistics: Applied Probability & Statistics. John Wiley & Sons; New York, USA: 1998. Statistical Analysis of Change-Point Problems. [Google Scholar]
20.Sinha B., Rukhin A. Nova Science Publishers; Hauppauge, USA: 1995. Applied Change Point Problems in Statistics. [Google Scholar]
21.Fickett J.W. Predictive methods using nucleotide sequences. Methods Biochem. Anal. 1998;39:231–245. doi: 10.1002/9780470110607.ch10. [DOI] [PubMed] [Google Scholar]
22.Staden R. Staden: statistical and structural analysis of nucleotide sequences. Methods Mol. Biol. 1994;25:69–77. doi: 10.1385/0-89603-276-0:69. [DOI] [PubMed] [Google Scholar]
23.Baxevanis A.D. Predictive methods using DNA sequences. Methods Biochem. Anal. 2001;43:233–252. doi: 10.1002/0471223921.ch10. [DOI] [PubMed] [Google Scholar]
24.Gutiérrez G. On the origin of the periodicity of three in protein coding DNA sequences. J. Theor. Biol. 1994;167:413–414. doi: 10.1006/jtbi.1994.1080. [DOI] [PubMed] [Google Scholar]
25.Gao J. Protein coding sequence identification by simultaneously characterizing the periodic and random features of DNA sequences. J. Biomed. Biotechnol. 2005;2:139–146. doi: 10.1155/JBB.2005.139. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Yin C., Yau S.S. Prediction of protein coding regions by the 3-base periodicity analysis of a DNA sequence. J. Theor. Biol. 2007;247:687–694. doi: 10.1016/j.jtbi.2007.03.038. [DOI] [PubMed] [Google Scholar]
27.Eskesen S.T. Periodicity of DNA in exons. BMC Mol. Biol. 2004;5:12. doi: 10.1186/1471-2199-5-12. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Bibb M.J. The relationship between base composition and codon usage in bacterial genes and its use for the simple and reliable identification of protein-coding sequences. Gene. 1984;30:157–166. doi: 10.1016/0378-1119(84)90116-1. [DOI] [PubMed] [Google Scholar]
29.Konopka A.K. Sequences and codes: fundamentals of biomolecular cryptography. In: Smith D.W., editor. Biocomputing: Informatics and Genome Projects. Academic Press; San Diego, USA: 1994. pp. 119–174. [Google Scholar]
30.Trifonov E.N. Elucidating sequence codes: three codes for evolution. Ann. N. Y. Acad. Sci. 1999;870:330–338. doi: 10.1111/j.1749-6632.1999.tb08894.x. [DOI] [PubMed] [Google Scholar]
31.Eigen M., Winkler-Oswatitsch R. Transfer-RNA: the early adaptor. Naturwissenschaften. 1981;68:217–228. doi: 10.1007/BF01047323. [DOI] [PubMed] [Google Scholar]
32.Zoltowski M. Is DNA code periodicity only due to CUF-codons usage frequency? Conf. Proc. IEEE Eng. Med. Biol. Soc. 2007;2007:1383–1386. doi: 10.1109/IEMBS.2007.4352556. [DOI] [PubMed] [Google Scholar]
33.Antezana M.A., Kreitman M. The nonrandom location of synonymous codons suggests that reading frame-independent forces have patterned codon preferences. J. Mol. Evol. 1999;49:36–43. doi: 10.1007/pl00006532. [DOI] [PubMed] [Google Scholar]
34.Aota S., Ikemura T. Diversity in G+C content at the third position of codons in vertebrate genes and its cause. Nucleic Acids Res. 1986;14:6345–6355. doi: 10.1093/nar/14.16.6345. [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Korotkov E.V. The informational concept of searching for periodicity in symbol sequences. Mol. Biol. (Mosk) 2003;37:436–451. [PubMed] [Google Scholar]
36.Ogata H. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 1999;27:29–34. doi: 10.1093/nar/27.1.29. [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Gmurman V.E. American Elsevier Publishing; New York, USA: 1968. Fundamentals of Probability Theory and Mathematical Statistics. [Google Scholar]
38.Kullback S. John Wiley & Sons; New York, USA: 1959. Information Theory and Statistics. [Google Scholar]
39.Pasek S. Gene fusion/fission is a major contributor to evolution of multi-domain bacterial proteins. Bioinformatics. 2006;22:1418–1423. doi: 10.1093/bioinformatics/btl135. [DOI] [PubMed] [Google Scholar]
40.Masoom, H., et al. 2006. A fast algorithm for detecting frame shifts in DNA sequences. In Proceedings of IEEE Symposium on Computational Intelligence and Bioinformatics and Computational Biology, pp.1-8. Toronto, Canada.
41.Kotlar D., Lavner Y. Gene prediction by spectral rotation measure: a new method for identifying protein-coding regions. Genome Res. 2003;13:1930–1937. doi: 10.1101/gr.1261703. [DOI] [PMC free article] [PubMed] [Google Scholar]
42.Chen, B. and Ji, P. Visualization of the protein-coding regions with a self adaptive spectral rotation approach. Nucleic Acids Res. 39: e3. [DOI] [PMC free article] [PubMed]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Figures S1-S4; Table S1

mmc1.pdf^{(1.4MB, pdf)}

[bib1] 1.Wei Q. World Scientific Publishing; Singapore: 2007. DNA Repair, Genetic Instability, and Cancer. [Google Scholar]

[bib2] 2.Watson J.D. Benjamin-Cummings Publishing; San Francisco, USA: 2004. Molecular Biology of the Gene. [Google Scholar]

[bib3] 3.Okamura K. Frequent appearance of novel protein-coding sequences by frameshift translation. Genomics. 2006;88:690–697. doi: 10.1016/j.ygeno.2006.06.009. [DOI] [PubMed] [Google Scholar]

[bib4] 4.Raes J., Van de Peer Y. Functional divergence of proteins through frameshift mutations. Trends Genet. 2005;21:428–431. doi: 10.1016/j.tig.2005.05.013. [DOI] [PubMed] [Google Scholar]

[bib5] 5.Kramer E.M. A simplified explanation for the frameshift mutation that created a novel C-terminal motif in the APETALA3 gene lineage. BMC Evol. Biol. 2006;6:30. doi: 10.1186/1471-2148-6-30. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib6] 6.States D.J., Botstein D. Molecular sequence accuracy and the analysis of protein coding regions. Proc. Natl. Acad. Sci. USA. 1991;88:5518–5522. doi: 10.1073/pnas.88.13.5518. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib7] 7.Pearson W.R. Comparison of DNA sequences with protein sequences. Genomics. 1997;46:24–36. doi: 10.1006/geno.1997.4995. [DOI] [PubMed] [Google Scholar]

[bib8] 8.Birney E. PairWise and SearchWise: finding the optimal alignment in a simultaneous comparison of a protein profile against all DNA translation frames. Nucleic Acids Res. 1996;24:2730–2739. doi: 10.1093/nar/24.14.2730. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib9] 9.Guan X., Uberbacher E.C. Alignments of DNA and protein sequences containing frameshift errors. Comput. Appl. Biosci. 1996;12:31–40. doi: 10.1093/bioinformatics/12.1.31. [DOI] [PubMed] [Google Scholar]

[bib10] 10.Antonov I., Borodovsky M. Genetack: frameshift identification in protein-coding sequences by the Viterbi algorithm. J. Bioinform. Comput. Biol. 2010;8:535–551. doi: 10.1142/s0219720010004847. [DOI] [PubMed] [Google Scholar]

[bib11] 11.Kislyuk A. Frameshift detection in prokaryotic genomic sequences. Int. J. Bioinform. Res. Appl. 2009;5:458–477. doi: 10.1504/IJBRA.2009.027519. [DOI] [PubMed] [Google Scholar]

[bib12] 12.Fichant G.A., Quentin Y. A frameshift error detection algorithm for DNA sequencing projects. Nucleic Acids Res. 1995;23:2900–2908. doi: 10.1093/nar/23.15.2900. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib13] 13.Médigue C. Detecting and analyzing DNA sequencing errors: toward a higher quality of the Bacillus subtilis genome sequence. Genome Res. 1999;9:1116–1127. doi: 10.1101/gr.9.11.1116. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib14] 14.Schiex T. FrameD: a flexible program for quality check and gene prediction in prokaryotic genomes and noisy matured eukaryotic sequences. Nucleic Acids Res. 2003;31:3738–3741. doi: 10.1093/nar/gkg610. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib15] 15.Frenkel F.E., Korotkov E.V. Classification analysis of triplet periodicity in protein-coding regions of genes. Gene. 2008;421:52–60. doi: 10.1016/j.gene.2008.06.012. [DOI] [PubMed] [Google Scholar]

[bib16] 16.Frenkel F.E., Korotkov E.V. Using triplet periodicity of nucleotide sequences for finding potential reading frame shifts in genes. DNA Res. 2009;16:105–114. doi: 10.1093/dnares/dsp002. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib17] 17.Korotkov E.V., Korotkova M.A. Study of the triplet periodicity phase shifts in genes. J. Integr. Bioinform. 2010;7:131. doi: 10.2390/biecoll-jib-2010-131. [DOI] [PubMed] [Google Scholar]

[bib18] 18.Carlstein E., editor. Vol. 23. Institute of Mathematical Statistics; Hayward, USA: 1994. Change-Point Problems. (IMS Lecture Notes–Monograph Series). [Google Scholar]

[bib19] 19.Litton C.D. Wiley Series in Probability & Mathematical Statistics: Applied Probability & Statistics. John Wiley & Sons; New York, USA: 1998. Statistical Analysis of Change-Point Problems. [Google Scholar]

[bib20] 20.Sinha B., Rukhin A. Nova Science Publishers; Hauppauge, USA: 1995. Applied Change Point Problems in Statistics. [Google Scholar]

[bib21] 21.Fickett J.W. Predictive methods using nucleotide sequences. Methods Biochem. Anal. 1998;39:231–245. doi: 10.1002/9780470110607.ch10. [DOI] [PubMed] [Google Scholar]

[bib22] 22.Staden R. Staden: statistical and structural analysis of nucleotide sequences. Methods Mol. Biol. 1994;25:69–77. doi: 10.1385/0-89603-276-0:69. [DOI] [PubMed] [Google Scholar]

[bib23] 23.Baxevanis A.D. Predictive methods using DNA sequences. Methods Biochem. Anal. 2001;43:233–252. doi: 10.1002/0471223921.ch10. [DOI] [PubMed] [Google Scholar]

[bib24] 24.Gutiérrez G. On the origin of the periodicity of three in protein coding DNA sequences. J. Theor. Biol. 1994;167:413–414. doi: 10.1006/jtbi.1994.1080. [DOI] [PubMed] [Google Scholar]

[bib25] 25.Gao J. Protein coding sequence identification by simultaneously characterizing the periodic and random features of DNA sequences. J. Biomed. Biotechnol. 2005;2:139–146. doi: 10.1155/JBB.2005.139. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib26] 26.Yin C., Yau S.S. Prediction of protein coding regions by the 3-base periodicity analysis of a DNA sequence. J. Theor. Biol. 2007;247:687–694. doi: 10.1016/j.jtbi.2007.03.038. [DOI] [PubMed] [Google Scholar]

[bib27] 27.Eskesen S.T. Periodicity of DNA in exons. BMC Mol. Biol. 2004;5:12. doi: 10.1186/1471-2199-5-12. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib28] 28.Bibb M.J. The relationship between base composition and codon usage in bacterial genes and its use for the simple and reliable identification of protein-coding sequences. Gene. 1984;30:157–166. doi: 10.1016/0378-1119(84)90116-1. [DOI] [PubMed] [Google Scholar]

[bib29] 29.Konopka A.K. Sequences and codes: fundamentals of biomolecular cryptography. In: Smith D.W., editor. Biocomputing: Informatics and Genome Projects. Academic Press; San Diego, USA: 1994. pp. 119–174. [Google Scholar]

[bib30] 30.Trifonov E.N. Elucidating sequence codes: three codes for evolution. Ann. N. Y. Acad. Sci. 1999;870:330–338. doi: 10.1111/j.1749-6632.1999.tb08894.x. [DOI] [PubMed] [Google Scholar]

[bib31] 31.Eigen M., Winkler-Oswatitsch R. Transfer-RNA: the early adaptor. Naturwissenschaften. 1981;68:217–228. doi: 10.1007/BF01047323. [DOI] [PubMed] [Google Scholar]

[bib32] 32.Zoltowski M. Is DNA code periodicity only due to CUF-codons usage frequency? Conf. Proc. IEEE Eng. Med. Biol. Soc. 2007;2007:1383–1386. doi: 10.1109/IEMBS.2007.4352556. [DOI] [PubMed] [Google Scholar]

[bib33] 33.Antezana M.A., Kreitman M. The nonrandom location of synonymous codons suggests that reading frame-independent forces have patterned codon preferences. J. Mol. Evol. 1999;49:36–43. doi: 10.1007/pl00006532. [DOI] [PubMed] [Google Scholar]

[bib34] 34.Aota S., Ikemura T. Diversity in G+C content at the third position of codons in vertebrate genes and its cause. Nucleic Acids Res. 1986;14:6345–6355. doi: 10.1093/nar/14.16.6345. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib35] 35.Korotkov E.V. The informational concept of searching for periodicity in symbol sequences. Mol. Biol. (Mosk) 2003;37:436–451. [PubMed] [Google Scholar]

[bib36] 36.Ogata H. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 1999;27:29–34. doi: 10.1093/nar/27.1.29. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib37] 37.Gmurman V.E. American Elsevier Publishing; New York, USA: 1968. Fundamentals of Probability Theory and Mathematical Statistics. [Google Scholar]

[bib38] 38.Kullback S. John Wiley & Sons; New York, USA: 1959. Information Theory and Statistics. [Google Scholar]

[bib39] 39.Pasek S. Gene fusion/fission is a major contributor to evolution of multi-domain bacterial proteins. Bioinformatics. 2006;22:1418–1423. doi: 10.1093/bioinformatics/btl135. [DOI] [PubMed] [Google Scholar]

[bib40] 40.Masoom, H., et al. 2006. A fast algorithm for detecting frame shifts in DNA sequences. In Proceedings of IEEE Symposium on Computational Intelligence and Bioinformatics and Computational Biology, pp.1-8. Toronto, Canada.

[bib41] 41.Kotlar D., Lavner Y. Gene prediction by spectral rotation measure: a new method for identifying protein-coding regions. Genome Res. 2003;13:1930–1937. doi: 10.1101/gr.1261703. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib42] 42.Chen, B. and Ji, P. Visualization of the protein-coding regions with a self adaptive spectral rotation approach. Nucleic Acids Res. 39: e3. [DOI] [PMC free article] [PubMed]

PERMALINK

An Approach for Searching Insertions in Bacterial Genes Leading to the Phase Shift of Triplet Periodicity

Maria A Korotkova

Nikolay A Kudryashov

Eugene V Korotkov

Abstract

Introduction

Figure 1.

Method

Algorithm for searching the TP phase shift

Figure 2.

Figure 3.

Application of Monte Carlo method to determine Z₀

Table 1.

Results

Analysis of genes from 17 bacterial genomes

Figure 4.

Figure 5.

Examples of x₁ and x₂ determination

Searching for amino acid similarity

Discussion

Conclusion

Authors’ contributions

Competing interests

Supplementary Material

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

An Approach for Searching Insertions in Bacterial Genes Leading to the Phase Shift of Triplet Periodicity

Maria A Korotkova

Nikolay A Kudryashov

Eugene V Korotkov

Abstract

Introduction

Figure 1.

Method

Algorithm for searching the TP phase shift

Figure 2.

Figure 3.

Application of Monte Carlo method to determine Z0

Table 1.

Results

Analysis of genes from 17 bacterial genomes

Figure 4.

Figure 5.

Examples of x1 and x2 determination

Searching for amino acid similarity

Discussion

Conclusion

Authors’ contributions

Competing interests

Supplementary Material

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Application of Monte Carlo method to determine Z₀

Examples of x₁ and x₂ determination