Abstract
We present an expression measure of a gene, devised to predict the level of gene expression from relative codon bias (RCB). There are a number of measures currently in use that quantify codon usage in genes. Based on the hypothesis that gene expressivity and codon composition is strongly correlated, RCB has been defined to provide an intuitively meaningful measure of an extent of the codon preference in a gene. We outline a simple approach to assess the strength of RCB (RCBS) in genes as a guide to their likely expression levels and illustrate this with an analysis of Escherichia coli (E. coli) genome. Our efforts to quantitatively predict gene expression levels in E. coli met with a high level of success. Surprisingly, we observe a strong correlation between RCBS and protein length indicating natural selection in favour of the shorter genes to be expressed at higher level. The agreement of our result with high protein abundances, microarray data and radioactive data demonstrates that the genomic expression profile available in our method can be applied in a meaningful way to the study of cell physiology and also for more detailed studies of particular genes of interest.
Keywords: codon usage, gene expression, predicted highly expressed genes, Escherichia coli
1. Introduction
Regulation of gene expression plays a central role in defining cell fate and controlling organ formation. Genomic function can be understood at the nucleotide level, but, the complexity and diversity of genomic function, leading to an emergent picture of the genome as an interacting system with many degrees of freedom, bring experimental and theoretical challenges to the quantitative measurement of the biological state, many of which are of statistical nature. Genes encode proteins, and proteins perform functions in the cell. Hence a gene takes part in biological function only if it is expressed, i.e. the protein produced from it is present in the cell. Gene regulation takes place during transcription, the process by which the cell reads the information contained in a gene and copies it to the messenger RNA which is subsequently used to make a functional protein. This is a most fundamental level of biological process which involves the interaction of DNA and proteins. Its regulation takes place through the binding of proteins to DNA at specific loci in the vicinity of the gene to be regulated. The transcription of one gene may be enhanced or reduced by the expression of the gene itself. The process is complex and not yet understood completely. Genes with high expression levels include those required for an organism’s viability and the ability to identify these genes is crucial for drug development. Certainly the high cost and technical expertise required is an obstacle to many investigators who are interested in pursuing such studies. Although a variety of software tools and technologies have been developed for gene expression studies, a universal standard making these studies more suitable for comparative analysis and for inter-operability with other information sources is yet to emerge. Large-scale, high-throughput experimental methods require material and information processing systems to match. The analysis of high-throughput gene expression data is in an early stage of development. Development of advance technology for whole genome expression studies is thus becoming increasingly recognized. Predicting expression level of genes through computational methods is appealing because it circumvents expensive and difficult experiment.
In recent years there has been increasing reports1–23,43,44 on predicted highly expressed genes in several micro-organisms which provide a wealth of information about gene expression. It is suggested that the essential genes primarily include the ensembles of highly expressed genes that encode proteins [transcription/translational factors (TF), ribosomal proteins (RP), proteases and chaperons (CH), degradation, cellular localization, biosynthesis, metabolism, photosynthesis, respiration and glycolysis, etc] vital for cell physiology. Perhaps, the essential functions of these gene products correspond to the biased amino acid composition that might minimize the substantial biosynthesis energy costs indicating the high biological significance of these genes. Besides other mechanisms, it is also suggested that codon bias can influence gene expression by optimization of the translational rate and thus, highly expressed genes can be characterized on the basis of biased codon usages compared with average genes. In several previous studies,3,7–13,17 a number of different patterns of codon usage have been hypothesized and many indices have been proposed to measure the degree of codon bias. Among these, the codon adaptation index (CAI) has been widely applied to the prediction of highly expressed genes in various organisms.3,15,16,24–27 CAI was proposed as a measure of codon usage in a gene relative to that in a reference set of genes.3 The previous studies suggest that CAI index correlates better with expression level of a gene than other codon usage indices, such as the effective number of codons,7 codon bias index,8 the frequency of optimal codons,9 intrinsic codon bias index,10 maximum likelihood codon bias,11 synonymous codon bias orderliness,12 and measure independent of length and composition (MILC),13 etc. The parameters underlying the CAI model rely on the codon composition of only a limited set of highly expressed genes and are based on a fairly simple assumption that the functional class of genes are highly expressed. To define the parameters in the CAI model, Sharp and Li3 considered the codon frequency of only 24 highly expressed genes of which 50% were genes of RPs and the rest mostly metabolic enzymes. A related method, the codon usage model, is based on similar principles, but the parameters are based on a somewhat broader set of highly expressed genes. In application of this model, Karlin and coworkers17–23 have shown that it is a reasonable assumption that for RP genes, CH and TF are highly expressed. Gene expressivity is strongly correlated with protein abundances. A number of studies have also revealed that codon compositions in highly expressed genes are influenced by tRNA abundances.1–6 Generally, highly expressed genes, producing abundant proteins, use a subset of optimal codons which are recognized by the most abundant tRNA species. It is well established that highly expressed genes have strongly biased usage of alternative synonymous codons and that of preferred codons, which are thought to be translated most efficiently by the most abundant tRNAs, and the lowly expressed genes have less biased codon usage patterns.1,2 The observations strongly suggest that natural selection has shaped the codon usage pattern accommodating optimal gene expression levels for most situations of its habitat, energy sources, and life cycle. Codon usages vary considerably within and between organisms. The effect of natural selection on codon usage quantifies the level of gene expression. However, the resulting bias in the codon usage has two main components. One is the correlation with tRNA availability and the other is non-random choices between pyrimidines for third base. A critical analysis of codon usage in a gene shows that mutational bias also plays a role in codon selection. Several studies have analysed the relationship between the GC-content of isochors and the expression patterns of the genes they contain.28 The G + C composition resulting from mutational bias has been hypothesized to determine the major trends in codon usage of high or low G + C organisms. Within a genome, codon bias tends to be much stronger in highly expressed gene than in genes expressed at lower levels, suggesting that there might be some selective advantage to concentrate essential genes on GC rich domains of the genome. Surprisingly, to address this important issue, some studies have also given conflicting results.29–33 Several papers reported very weak correlations, either negative or positive between the GC-content and gene expression. The discrepancy among the studies might be due to the methods used to measure the expression parameter of the data sets analysed or the differences in the way correlations were computed.
In fact, the characterization of regulatory elements underlying gene expression is largely an unsolved problem. The hypothesis that codon usage modulates gene expression has been accepted in general. Many researches in this field have formulated their own measures, which has led to a large number of available methods3,7–12,17 for gene expressivity analysis. Unfortunately, these methods are not universally applicable as they exhibit strong artefacts of their formulation with varying sequence length, or overall codon bias, or codon bias discrepancy. Our aim is to develop a measure that will be free from any such possible artefacts and we attempt here to verify the usefulness of such a measure by employing it to predict gene expressivity in Escherichia coli (E. coli).
2. Materials and methods
The genome sequence for E. coli K-12 MG1655 is obtained from Genebank accession no. NC_000913. All ORF (open reading frames) listed as coding for proteins (confirmed and hypothetical) are considered in this study. Our approach in estimating gene expression level is related to codon usage difference of a gene with respect to biased nucleotide composition at the three codon sites. Let f(x,y,z) be the normalized codon frequency for the codon triplet (x,y,z) of a gene. Then the relative codon bias (RCB) of a codon triplet (x,y,z) in a gene is defined as
![]() |
1 |
where f1(x) is the normalized frequency of x at the first codon position, f2(y) is the normalized frequency of y at the second codon position, and f3(z) is the normalized frequency of z at the third codon position of the gene. The frequencies f1, f2, f3 have been derived from the set of codon samples of a gene and the normalization of frequency is done over the gene length in codons, in an attempt to compensate for the expected increase of RCB with the total number of codons. We quantify the degree of codon bias of a gene in such a way that comparisons can be made both within and between genomes. As defined earlier, dxyz contains somewhat more quantitative information than others, since it considers codon usage as well as the base compositional bias. Then the expression measure of a gene is
![]() |
2 |
where is the codon usage difference of ith codon of a gene. L is the number of codons in the gene.
RCB is the difference of observed frequency of a codon from the expected frequency under the hypothesis of random codon usage where the base composition were biased at three sites as that in the sequence under study, divided by the expected frequency. RCBS is the overall score of a gene indicating the influence of RCB of each codon in a gene. Our analysis is based on the hypothesis that RCB reflects the level of gene expression. The expression measure of a gene in this approach is denoted by RCBS. RCBS value close to 0 indicates a lack of bias for the codons and is thus useful for comparing different sets of genes.
3. Results
Our data set includes 4174 complete protein coding sequences from E. coli. Expression profiles of the genes are determined by calculating the score of RCB (RCBS value) for each gene and their distributions are shown in Fig. 1. The majority of genes (63%) have RCBS values lying between 0.2 and 0.4, and the mean and median values are 0.3870 and 0.3295, respectively. Only ∼18% genes have RCBS values >0.5. The analysis of RCBS values among different gene class shows that the gene classes (RP, CH, TF), which serve the representatives of highly expressed genes have RCBS > 0.5 in most of the cases. This suggests that significantly stronger codon bias is a result for translational efficiency as well. This finding is consistent with others,3,17,18 as most of the previous expression measures have considered those as representative standards for highly expressed genes in their calculation. There is also experimental evidence in support of RP, CH and TF as standard derivatives for the highly expressed genes as it is observed that many RPs augmented by abundant TF and CH proteins are needed to assure properly translated, modified and folded protein products which expedite and regulate cellular activities in most prokaryotic genomes. Our data support the proposition that each genome has evolved a codon usage pattern accommodating gene expression level, and RCBS value >0.5 exhibits favourable codon usage. So, we chose this index as an effective expression measure on the basis that it has been shown to correlate highly to expression levels and the predicted expression level based on RCBS (RCBS > 0.5) values suggests that almost 18% of genes in the E. coli genome qualify as highly expressed genes. In our study, the genes are segregated into different functional categories such as metabolism, information transfer, regulation, transport, cell process, cell structure, location of gene products, extra-chromosomal, DNA sites and cryptic genes in accordance with Munich Information Center for Protein Sequence (MIPS) classification. Functional analysis shows that highly expressed genes involved in the location of gene products are the largest functional class followed by genes involved in information transfer, metabolism, cell structure, cell process, extra-chromosomal, regulation and transport function, respectively. A total of 750 genes are identified as highly expressed genes in E. coli with 163 genes involved in energy metabolism, 75 genes involved in translation, 34 genes in transcription, and 29 in CH and folding (Supplementary Table SI). In addition, the functional class of amino acid biosynthesis, nucleotide biosynthesis, fatty acid biosynthesis and other cofactor and small molecule, etc includes 67 highly expressed genes. Besides, there are several (∼185) genes encoding predicted proteins and 15 other genes of unknown function, which are thought to be highly expressed genes in our approach. We observe that 24 genes encoding predicted proteins and 12 genes encoding proteins of unknown function are highly expressed genes with RCBS > 1.0. The highly expressed genes of E. coli with RCBS > 1.0 are reported in Supplementary Table SII (hypothetical protein or predicted protein genes are not listed). Of these, 11 encode proteins that function in energy metabolism, 18 are RP genes, 11 encode TF and the remaining encode proteins that function in different cell process.
Figure 1.
Distribution of RCBS for all coding genes in the genome of E. coli.
In order to compare our results, we have also calculated CAI values for the same genes. Fig. 2 shows the relationship between RCBS and CAI values. Here, the CAI scores have been calculated according to the original publication of Sharp and Li,3 which stem from 24 highly expressed genes. It can be clearly seen that for genes with high CAI values (>0.5), there is strong correlation between them (r = 0.4614). But for proteins with CAI values significantly <0.3, correlation is worse (r = −0.0572). The novel method of quantitatively predicting gene expressivity is then compared with the other widely accepted measure of Karlin and Marzek.17 In Fig. 3, we plot RCBS values against E(g) of Karlin et al.18 The correlation is surprisingly good with r = 0.6706, P < 0.001. We analyse further the relationship between the length of the coding regions and the expression level of genes. In Fig. 4 we plot RCBS as a function of the gene length. We observe that shorter genes assume the higher value of RCBS while longer genes tend to have lower RCBS. There is a strong correlation between RCBS and gene length (r2 = 0.65878 and χ2 = 0.0149). This effect is not due to systematic bias of gene size. To investigate the effect of protein length on gene expression as measured by RCBS, the data is split into three groups: short (L < 150), intermediate (150 < L < 300) and long (L > 300). Several observations can be made. Genes are sorted according to their expression level. It should be noted that genes of the same expression level may have wide variation in length and also that genes of the same length may have a wide range of RCBS. We observe that the estimate of expression level, as derived from RCBS, ranges from a low value to high value for each of the three length groups. It is evident from our data that RCBS ranges from 0.245 to 3.416 for L < 150, whereas it ranges from 0.123 to 0.907 for 150 < L < 300 and from 0.079 to 1.328 for L > 300. It is noted that the selective pressure on codon usage appears to be lower in genes encoding long rather than short proteins. Our studies, although less extensive, suggest that selection on codon usage as well as sequence composition is primarily responsible for RCBS. For a simple explanation, we select a set of E. coli sequences of equal length and randomize the above sequences 500 times, keeping their (i) codon usage; and (ii) sequence composition conserved. RCBS calculated for those sequences are found to vary in a wide range. We repeat the experiment on different sets of genes with varying length. The results are summarized in Supplementary Tables SIIA and SIIB. Supplementary Table SIIA describes the results of 14 arbitrary nucleotide sequences of different length, each randomized 500 times. In Supplementary Table SIIB, we present the results of the same experiment on a few selected genes of different length. We observe that the smaller sequences have a greater probability of resulting in high value of RCBS (>0.5), but there is nothing to prevent longer sequences from having high RCBS. Although the values for shorter sequences are more variable due to sampling effect, the intrinsic effect of gene length on RCBS reduces with the increase in length. A thorough exploration of theoretical values of RCBS suggests that RCBS can be an effective measure of gene expression, as its value depends on codon usage pattern along with DNA compositional bias of a gene.
Figure 2.
RCBS plotted against CAI for E. coli genes.
Figure 3.
RCBS plotted against E(g)18 for E. coli genes.
Figure 4.
RCBS plotted against the length of 4174 genes from the E. coli genome.
In order to test the RCBS as an expression level predictor, we chose to compare our results with the experiments. We collected data sets (listed in Supplementary Tables SIII and SIV) which consist of mRNA or protein abundance data obtained by different methods—mostly cDNA microarrays27,34,35 or 2D gel electrophoresis data36–39 for abundances of many E. coli proteins are available for comparison with the predicted levels of expression. In Fig. 5, we compare the predicted levels of expression in E. coli with 2D gel patterns34 and expression measure E(g) of Karlin et al.18 The relationship between RCBS values and mRNA levels seen in Fig. 5 agrees better than with the findings of Karlin et al.18 The correlation between expression level (as relative molecular abundance) and RCBS value is found to be 0.4533 whereas that with E(g) value is 0 .2618. Among the 20 most abundant proteins, 17 were identified as highly expressed genes with three exceptions for metE, folA and ilvE. The results are in good agreement with those predicted by E(g). Among the 20 least abundant proteins, only three mismatch with our predicted results whereas there are seven mismatches with the results of Karlin et al.18 Although pck, nusb, vals, args, rpll, thrs, leus are less abundant, according to 2D gel patterns, the high E(g) values of Karlin et al.18 support naming the genes highly expressed. But our data support only nusb, vals and rpll to be highly expressive genes. Of the remaining 55 proteins 22 were identified as highly expressed genes. This agreement with molecular abundance data supports our predicted results better than others. In a step forward we compare RCBS and the concentrations of various proteins in E. coli along with their CAI values24 (Supplementary Table SIV). Concentration is expressed as the number of protein molecules per cell. Concentration being used as a measure of gene expression, we find that our result is surprisingly good. The RCBS values along with the CAI values are plotted against the logarithm of concentration in Fig. 6. The predicted gene expression level using RCBS value is found to correlate well with the protein concentration data24 (r = 0.708211). The correlation is better than the quantitative measure of CAI (r = 0.615546). It suggests that a quantitative estimate of the expression level by RCBS values performs better than other indices of expression measure. Thus, regardless of the state of cell growth, one can measure the relative expression level for each gene under various growth conditions, different genetic states or over a time course during environmental change.
Figure 5.
RCBS (+) and E(g) (*) plotted against relative molecular abundance of 96 genes from E. coli genome.18 RMB denotes relative molecular abundance. X-axis is taken in logarithmic scale.
Figure 6.
CAI (+) and RCBS (*) plotted against protein concentration of 45 genes from the E. coli genome.24 X-axis is taken in logarithmic scale.
In Fig. 7 we plotted radioactive data and microarray data against RCBS (Supplementary Table SV) for 117 genes as identified by heat shock treatment.35 Among these, 26 genes show high (RCBS > 0.5), 84 genes moderate (0.2 < RCBS < 0.5) and only seven genes show a low (RCBS < 0.2) level of expression. Despite the fact that the quality of experimental data seems to be a very important factor, we observe a good correlation between RCBS and microarray (radioactive) data (rmicro = 0.2415, rradio = 0.2098).
Figure 7.
Radioactive data (+) and microarray data (*)35 plotted against RCBS for E. coli genes. Y-axis is taken in logarithmic scale.
In another analysis we compared our expression measure (RCBS) with the genomic expression profiles of the E. coli genome growing on rich (Luria broth glucose) and on minimal culture (glucose) medium (Supplementary Tables SVA and SVB).34 Of the 76 genes expressed at significantly higher levels on Luria broth plus glucose medium, 54 genes show a high expression level in our expression measure, whereas only 12 genes out of 107 genes expressed on minimal glucose medium have a high level of expression. We observe that the correlation co-efficient of minimal culture data with RCBS (r = 0.3011) is good, but very much worse for Luria broth glucose data. The agreement of predicted and actual protein expression level varied greatly between all examined combinations of prediction method and data set. The discrepancy is thought to lie in the quality of experimental data. The preliminary analysis on the quality of experimental data shows that these kinds of experiments are inherently noisy and of low reproducibility. The reproducibility of microarray data can be evaluated through the computation of correlation coefficients within and among the data sets from different microarray experiments. Two data sets from different sources can be chosen for analysis in this study. In the first, the data set was obtained from ExpressDB and the comparison made between expression levels in E. coli grown to either mid-log phase (LP) or stationary phase (SP). In the second, the data set was obtained from the ASAP database, where E. coli is cultured in lysogeny broth (LB). It can clearly be seen that the pair wise correlation coefficient among the gene expression levels from different experiments (rLP-SP = 0.52, rLB-LP = 0.017, rLB-SP = −0.039)34 vary broadly indicating the very noisy nature of microarray experiments and their lack of accuracy. The quality of experimental data seems to be a very important factor in this kind of analysis. Large variances may reduce the significance of statistical tests and might hide interesting trends in complex data. Microarray data tend to suffer from noise introduced at each step of different experimental protocols, while protein abundance data and mRNA expression level do not agree well in all cases. The other probable reason for incoherent results is that prediction of gene expression from genomic data, based solely on codon usage, is oversimplified. Other factors, such as promoter strength and gene copy number should also be taken into account.
We now discuss our results in more detail for different functional classes of genes. The highly expressed genes are then classified into different functional categories, e.g. RPs, CH and degradation proteins, transcription and TF, energy metabolism, electron transport, recombination and repair, outer membrane proteins, aminoacyl tRNA synthetases, etc. (The distribution of highly expressed genes of different functional class in the genomes of E. coli is displayed in Supplementary Table SI.) All, but one RP, the major CH/degradation proteins and translation/transcription processing factors attain high expression levels. Supplementary Table SII presents the 52 genes with the highest predicted expression levels in E. coli. The gene for trp operon ladder peptide trpL involved in amino acid (tryptophan) biosynthesis attains the highest RCBS value 3.42, among all E. coli genes.
3.1. RP genes
RPs are very important in cell biology as thus provide a range of activities required for all steps of protein biosynthesis. Following the analysis based on the definition RCBS and Equation (1) and (2), we observe that virtually all RP genes qualify as highly expressed genes. The genes encoding RPs, which are expected to be expressed at high levels during rapid cell growth, were identified with RCBS values >0.5 (Table 1). All but one RP in E. coli are expressed at significantly higher levels; the only exception is rimK, RP S6 modification protein, where it is thought to contribute to the ribosome maturation and modification. The RCBS values for highly expressed RP genes range from 0.50 to 1.77. In fact, all RP genes in E. coli do not reach the top expression level. Seventeen out of 56 are among the highest 86 highly expressed genes. The highest expression level occurs for L34, with an RCBS value of 1.77. The RPs are the major component, together with the ancillary proteins, involved in protein synthesis. The genes coding for RPs, protein synthesis factors and RNA polymerase subunits are all intermingled and organized into a small number of operons. We observe that the genes for some major translational or transcription processing factors, including tufA, tufB, fusA, fkpA, slyD, rpoB and rpoC, which are within or near the large RP operon, are predicted as highly expressed genes. Although RPs play an exclusive role in determining ribosome structure, several are multifunctional. RplA, rplD and rplT, the 50S ribosomal subunit proteins (L1, L4 and L20 respectively), and rpsH, the 30S ribosomal subunit protein S8 have a regulatory role. The S1 gene, a giant RP gene (labelled as rpsA) is essential to E. coli and putatively contributes to the initiation of protein synthesis. S9 (rpsI) participates in certain repair activities, and S16 (rpsP) acts as an endonucleases.
Table 1.
RCBS of the highly expressed genes of different functional class in the E. coli genome
Functional class | Gene | RCBS | Gene | RCBS | Gene | RCBS | Gene | RCBS |
---|---|---|---|---|---|---|---|---|
Ribosomal | rplN | 0.50496 | rpsJ | 0.74635 | rplS | 0.87367 | rpmA | 1.08922 |
rpsD | 0.56061 | rplX | 0.75111 | sra | 0.88011 | rpmC | 1.09439 | |
rpsS | 0.60728 | rpsF | 0.75859 | rplI | 0.90076 | rplO | 1.16165 | |
rpsM | 0.61255 | rplD | 0.76302 | rpmB | 0.90877 | rpsI | 1.24694 | |
rpsG | 0.62318 | rplM | 0.79227 | rpsN | 0.91121 | rpmG | 1.2494 | |
rplF | 0.62913 | rplC | 0.79299 | rplP | 0.92341 | rpsT | 1.24983 | |
rplE | 0.67119 | rplQ | 0.80176 | rpsP | 0.92858 | rplL | 1.3063 | |
rpsH | 0.67126 | rpsB | 0.80995 | rplY | 0.9446 | rplT | 1.3222 | |
rpsK | 0.67627 | rpsA | 0.81499 | rpsL | 0.95959 | rpsO | 1.32324 | |
rpsE | 0.7021 | rplJ | 0.82165 | rplW | 1.00068 | rpmJ | 1.49921 | |
rplB | 0.71682 | rpsC | 0.84223 | rpmD | 1.00368 | rpsU | 1.60846 | |
rplV | 0.7302 | rplK | 0.84341 | rpsQ | 1.03424 | rpmI | 1.66876 | |
rplR | 0.7344 | rplA | 0.84538 | rpmF | 1.04844 | rpmH | 1.77046 | |
rplU | 0.73917 | rpmE | 0.85618 | rpsR | 1.05606 | – | – | |
Translational | Efp | 0.70878 | raiA | 0.50131 | rrfE | 1.03184 | ssrS | 0.70761 |
Ffs | 1.31636 | rrfA | 1.11799 | rrfF | 1.02752 | tsf | 0.85208 | |
Frr | 0.77909 | rrfB | 1.03184 | rrfG | 1.11995 | tufA | 0.94012 | |
fusA | 0.72335 | rrfC | 1.11995 | rrfH | 1.11995 | tufB | 0.86312 | |
infA | 0.7532 | rrfD | 1.11995 | rrlA | 1.06128 | yeiP | 0.52763 | |
Transcriptional | alpA | 0.64494 | glnB | 0.81972 | pspA | 0.71495 | rpoZ | 0.874 |
chaB | 0.91144 | greA | 0.61192 | pspB | 0.77923 | sfsB | 0.66054 | |
Crl | 0.68275 | greB | 0.52545 | relB | 0.68232 | slmA | 0.53879 | |
cspA | 1.2802 | Hha | 0.88747 | relE | 0.54866 | soxR | 0.59593 | |
cspC | 1.12974 | Hns | 0.73934 | rof | 0.65143 | soxS | 0.60395 | |
cspE | 0.87402 | metJ | 0.5234 | rpoB | 0.53467 | suhB | 0.53095 | |
deaD | 0.62977 | nusB | 0.66651 | rpoC | 0.66692 | tdcR | 0.60661 | |
flgM | 0.58028 | nusG | 0.62894 | rpoD | 0.53475 | trpR | 0.6079 | |
flhC | 0.504 | osmE | 0.55743 | rpoH | 0.51287 | – | – | |
CH and folding | ccmD | 0.81384 | groL | 0.90549 | hybG | 0.62208 | secB | 0.66081 |
dksA | 0.5747 | groS | 0.82021 | iscA | 0.66931 | skp | 0.85476 | |
dnaK | 0.65259 | hscB | 0.62877 | iscX | 0.73575 | slyD | 0.60592 | |
dsbA | 0.59085 | hslO | 0.51531 | lolA | 0.51362 | stpA | 0.74434 | |
fklB | 0.63123 | hslU | 0.49623 | narJ | 0.50787 | tig | 0.79986 | |
fkpA | 0.55943 | htpG | 0.5791 | ppiB | 0.65291 | – | – | |
fkpB | 0.51531 | hyaE | 0.56129 | ppiC | 0.70111 | – | – | |
fliT | 0.51569 | hybF | 0.51315 | rmf | 0.96923 | – | – | |
Outer membrane | csgA | 0.73214 | ompC | 1.03758 | slyB | 0.59077 | yqiG | 0.69853 |
mipA | 0.52949 | ompF | 0.63223 | tsx | 0.58718 | – | – | |
nmpC | 0.51413 | ompX | 0.90683 | yddL | 0.57797 | – | – | |
ompA | 0.79079 | pagP | 0.50225 | yqhH | 0.53974 | – | – | |
Post-translational | rimI | 0.50362 | Def | 0.50521 | napD | 0.65324 | npr | 0.66442 |
DNA repair/replication/recombination | cspD | 0.49781 | Hole | 0.70777 | ihfB | 0.58392 | rusA | 0.53058 |
dinI | 0.66454 | hupA | 0.97108 | priC | 0.58088 | ssb | 0.71106 | |
dinJ | 0.57421 | hupB | 0.74465 | rdgC | 0.51482 | xseB | 0.865 | |
fis | 0.93575 | ihfA | 0.55962 | recA | 0.60858 | yebG | 0.59001 | |
RNA modification | rluB | 0.55764 | Pnp | 0.59733 | deaD | 0.62977 | rbfA | 0.72106 |
DNA degradation | rusA | 0.53058 | xseB | 0.865 | – | – | – | – |
Degradation of Proteins/peptides/glycopeptides | hflC | 0.4998 | degP | 0.51382 | yhbO | 0.53736 | yajG | 0.55166 |
Degradation of small molecules | Pta | 0.58128 | frwB | 0.57401 | tnaC | 1.33277 | – | – |
Nucleoprotein and basic protein | Hfq | 0.51407 | Hns | 0.73934 | skp | 0.85476 | tpr | 1.29474 |
dps | 0.55438 | stpA | 0.74434 | fis | 0.93575 | – | – | |
ihfB | 0.58392 | hupB | 0.74465 | hupA | 0.97108 | – | – | |
Aminoacyl tRNA synthase | aspS | 0.52912 | lysS | 0.54138 | pheM | 2.38353 | valS | 0.52017 |
ygjH | 0.5786 | – | – | – | – | – | – | |
Energy metabolism | ||||||||
Glycolysis | eno | 0.99727 | gapA | 0.87498 | pfkA | 0.67783 | pykF | 0.62056 |
fbaA | 0.7547 | gpmA | 0.65413 | pgk | 0.76595 | tpiA | 0.80293 | |
TCA cycle | mdh | 0.55763 | sucB | 0.51856 | sucC | 0.50409 | sucD | 0.62233 |
Pentose phosphate pathway | talB | 0.58526 | tktA | 0.63261 | ||||
ATP synthase | atpA | 0.64784 | atpC | 0.51365 | atpD | 0.64873 | atpE | 1.08527 |
atpF | 0.60762 | |||||||
Pyruvate dehydronage | aceE | 0.57263 | aceF | 0.55269 | lpd | 0.56421 | ||
Aerobic respiration | cyoC | 0.53164 | hyaE | 0.56129 | nuoA | 0.54378 | nuoK | 0.61103 |
cyoD | 0.61485 | nirD | 0.70885 | nuoI | 0.59343 | |||
Anaerobic respiration | frdC | 0.73468 | hybG | 0.62208 | menB | 0.60086 | pflB | 0.75126 |
frdD | 0.72395 | hydN | 0.69364 | narH | 0.52986 | ubiC | 0.52458 | |
glpE | 0.54693 | hypA | 0.67865 | narJ | 0.50787 | |||
hybF | 0.51315 | hypC | 0.56922 | yfiD | 0.87609 | |||
Electron transport | ackA | 0.61336 | Fdx | 0.61409 | fldA | 0.60624 | cybC | 0.56769 |
Flagellum biogenesis | flgB | 0.54626 | fliJ | 0.67522 | fliS | 0.52105 | fliT | 0.51569 |
fliE | 0.66739 | fliQ | 0.5854 | |||||
Transport of small molecules | nupC | 0.50273 | potC | 0.51092 | tsx | 0.58718 | ||
Salvage of nucleocides and nucleotides | Apt | 0.73291 | deoC | 0.63634 | upp | 0.51826 | hpt | 0.69492 |
deoB | 0.55136 | deoD | 0.57449 | gpt | 0.56649 | |||
Central intermediary metabolism | citD | 0.59133 | folX | 0.51347 | gloA | 0.76667 | ulaD | 0.52297 |
citE | 0.51485 | Mutt | 0.63455 | aspA | 0.52318 | gcvH | 0.72458 | |
fixX | 0.60213 | |||||||
Carbohydrate metabolism | eda | 0.62187 | gntK | 0.50361 | ulaB | 0.51605 | uxaC | 0.57269 |
gatB | 0.53522 | Lpd | 0.56421 | ulaD | 0.52297 | uxuA | 0.59595 | |
paaB | 0.60215 | |||||||
Phosphorus metabolism | pstA | 0.51705 | pstS | 0.5871 | ppa | 0.6365 | psiF | 0.66563 |
phnG | 0.5443 | |||||||
Nitrogen metabolism | cynS | 0.53274 | glnK | 0.65458 | ||||
Sulphur metabolism | cysP | 0.51334 | ||||||
Amines metabolism | eutS | 0.57934 | ||||||
Amino acid biosynthesis | artM | 0.51962 | glnH | 0.54244 | ilvG | 1.32851 | metJ | 0.5234 |
dapD | 0.51627 | glnP | 0.596 | ilvL | 1.51982 | pheL | 2.8411 | |
fliY | 0.51995 | glyA | 0.57258 | ilvM | 0.84298 | sdaC | 0.62785 | |
glnA | 0.5114 | hisL | 1.99822 | ivbL | 1.76046 | thrL | 1.7054 | |
glnB | 0.81972 | ilvC | 0.54397 | leuL | 1.93311 | trpL | 3.41556 | |
trpR | 0.60479 | |||||||
Fatty acid biosynthesis | accA | 0.57451 | dgkA | 0.55757 | fabI | 0.54893 | ymcE | 0.60055 |
acpS | 0.55661 | fabA | 0.67664 | fabZ | 0.58465 | |||
Nucleotide biosynthesis | adk | 0.76156 | Ndk | 0.79214 | purC | 0.5899 | pyrL | 1.1651 |
guaB | 0.58481 | purA | 0.53711 | |||||
Cofactor and small molecule biosynthesis | gapA | 0.87498 | mioC | 0.50538 | moaE | 0.58446 | ubiC | 0.52458 |
glyA | 0.57258 | moaC | 0.50171 | ribE | 0.59736 | |||
menB | 0.60086 | moaD | 0.61154 | This | 0.78241 | |||
Macromolecule biosynthesis | accB | 0.55326 | dgkA | 0.55757 | grxC | 0.79395 | mipA | 0.52949 |
acpP | 0.82199 | fimA | 0.57714 | hipB | 0.62205 | nrdH | 0.66531 | |
ccmD | 0.81384 | glgS | 0.89234 | iscR | 0.50455 | pagP | 0.50225 | |
cybC | 0.56769 | grxA | 0.55662 | Lpp | 1.632 | trxA | 0.75124 | |
yfgJ | 0.72071 | |||||||
Inner membrane | ccmD | 0.81384 | metI | 0.53708 | yccF | 0.58505 | yidH | 0.53297 |
cyoC | 0.53164 | mscL | 0.57954 | ydgC | 0.55456 | yiiR | 0.51556 | |
cyoD | 0.61485 | narH | 0.52986 | yeaL | 0.50064 | yijD | 0.50746 | |
dgkA | 0.55757 | nuoA | 0.54378 | yeaQ | 0.71217 | yjeO | 0.54162 | |
frdC | 0.73468 | nuoK | 0.61103 | ygdD | 0.62392 | yjeT | 0.68009 | |
frdD | 0.72395 | nupC | 0.50273 | yhdT | 0.74646 | yncH | 0.7111 | |
glnP | 0.596 | Pal | 0.86696 | yhhL | 0.62656 | ynfA | 0.60738 | |
lpp | 1.632 | yaaH | 0.7921 | yiaB | 0.65847 | |||
mdtJ | 0.61263 | ybaN | 0.55105 | yiaW | 0.64364 | |||
Transport | yjdM | 0.76533 | glnH | 0.54244 | ptsH | 0.93025 | csgF | 0.54377 |
yjgA | 0.5484 | glnP | 0.596 | potC | 0.51092 | secG | 0.75473 | |
fliY | 0.51995 | mscL | 0.57954 | pmrD | 0.5388 | mokC | 0.62148 | |
cyoC | 0.53164 | sugE | 0.51943 | yrbC | 0.54592 | yajC | 0.69682 | |
metI | 0.53708 | mdtI | 0.74374 | frwB | 0.57401 | tatA | 0.72924 | |
metQ | 0.56475 | mdtJ | 0.61263 | fryB | 0.70188 | tatE | 0.71983 | |
feoA | 0.76102 | chbA | 0.55214 | yedE | 0.50339 | cysP | 0.51334 | |
gatB | 0.53522 | chbB | 0.65397 | ygaH | 0.5262 | npr | 0.66442 | |
gspI | 0.54627 | nuoI | 0.59343 | yqaE | 1.13838 | sdaC | 0.62785 | |
crr | 0.6849 | nupC | 0.50273 | marB | 0.61754 | |||
Regulator | chpS | 0.57732 | csrC | 0.51672 | hipB | 0.62205 | yfeC | 0.5528 |
cpxP | 0.50596 | dsrA | 1.78721 | Spf | 1.34529 | yiaG | 0.51628 | |
csgA | 0.73214 | dsrB | 0.75282 | sufE | 0.58559 | yifE | 0.54534 | |
csrA | 0.83793 | feoC | 0.86637 | yddM | 0.5642 | yrbA | 0.62229 |
3.2. Genes for transcription/translation processing factors
There are ∼100 genes encoding enzymes, factors and structural components that make up the translational apparatus. Out of these100 genes 75 are identified as highly expressed genes with RCBS values >0.5. Thus the majority of genes involved in translation are predicted to have a high expression level. Of these 75 translational genes, which are expressed at higher level, 55 encoded RPs. Highly expressed genes for transcription/translation processing factors are reported in Table 1 and can be compared with the data available.18
There are ∼260 known genes that encode factors involved in translation and ribosome modification including the initiation and elongation factors, 34 of which are indicated to be at a higher expression level. As with RPs, genes coding for elongation factors (efp, yeip, fusA, tsf, tufA, tufB), ribosome recycling factor (frr) and translation initiation factor (infA) register as highly expressed genes which play important roles in translation. The expression level of infB, fused protein chain initiation factor is moderately high (RCBS = 0.49017). The regulation of infB which is downstream and co-transcribed with moderately expressed TF gene nusA (RCBS = 0.46579), is complex and is thought to be the result of auto regulation of the extent of the read through at upstream terminators by moderately expressed nusA. The expression level of infB is higher than nusA. The elongation factor efp has been shown to be essential in E. coli for protein synthesis and viability. The expression levels of other elongation factors (fusA, tsf, tufA, tufB) are gradually higher. Interestingly, the regulation tufB is partially dependent upon the fis gene, global DNA binding transcriptional and the fis gene has significantly higher expression level (RCBS = 0.93575). Small RNA molecules are very important in cell biology and can regulate translation. It is found that genes coding 5S RNAs (rrfA, rrfB, rrfC, rrfD, rrfE, rrfF, rrfG, rrfH) and 23S RNA (rrlA) have distinctive RCBS values >1.0. Gene expression is controlled by a regulator that interacts with a specific sequence of a target RNA. Ffs coding for the 4.5S sRNA component of signal recognition particle works with the ffh protein (RCBS = 0.3524) and is involved in co-translational protein translocation into and possibly through membranes. SsrS coding for 6S sRNA inhibits RNA polymerase promoter binding. It acts as a template for RNA-directed pRNA synthesis by RNAP and mimics an open promoter. RaiA codes for cold shock protein associated with 30S ribosomal subunit. Ffs,ssrS and raiA involved in translational process are predicted to be highly expressed genes in our approach.
Moreover we identify four other genes which are involved in the post-translational process and are expressed at higher level. These are riml coding acetylase for 30S ribosomal subunit S18, def coding peptide deformylase, hypC coding protein required for maturation hydrogenases 1 and 3, napD coding for assembly protein for periplasmic nitrate reductage, and npr coding for phosphohistidinoprotein-hexose phosphotransferage component of N-regulated peroximal targeting signal (PTS) system.
Transcription is the first stage in gene expression and the principal step at which it is controlled. The gene for major cold shock protein (cspA) attains a significantly high expression level (RCBS = 1.28). The gene cspA is a regulator needed for adaptation to atypical conditions and gives a response to temperature stimulus. CspC coding for other stress proteins and a member of the cspA family is also a highly expressed gene. Among other genes involved in the transcription process RNA polymerase plays a vital role. RNA synthesis is catalysed by the enzyme RNA polymerase. Transcription starts when RNA polymerase binds to the promoter. Among the DNA-directed RNA polymerase rpoB, rpoC, rpoD, rpoH and rpoZ subunits in E. coli qualify the high expression level. RNA polymerase must be able to handle situations when transcription is blocked, e.g. when DNA is damaged. In the case of E. coli RNA polymerase, the proteins greA and greB, which have been predicted to have a high expression level, release polymerase from elongation arrest. Rho, transcription termination factor, attains a moderate expression level (RCBS = 0.4749). Termination and anti-termination are closely connected and involve proteins that interact with RNA polymerase. Anti-termination is used as a control mechanism and controls the ability of the enzyme to read past a terminator into genes lying beyond. The nus loci code for proteins that form part of the transcription apparatus. The nusA, nusb, nusG functions are concerned solely with the transmission of transcription. Transcription anti-termination protein (nusB) and transcription termination factor (nusG) have high expression levels. NusB is required for rho-dependent terminators whereas nusG may be considered with the general assembly of all the nus factors into a complex with RNA polymerase. NusA required for intrinsic terminators has a moderate expression level (RCBS = 0.4658).
3.3. CH/degradation protein genes
CH/degradation proteins are vital in cell physiology. CHs are proteins that assist the non-covalent folding/unfolding and assembly/disassembly of other macromolecular structures. One major function of CH is to prevent both newly synthesized polypeptide chains and assembled subunits from aggregating into non-functional structures. Many CHs are heat shock proteins, that is, proteins expressed in response to elevated temperatures or other cellular stresses. The reason for this behaviour is that protein folding is severely affected by heat and, therefore, some CHs act to repair the potential damage caused by misfolding. Other CHs are involved in folding newly made proteins as they are extruded from the ribosome. Although most newly synthesized proteins can fold in the absence of CHs, a minority strictly requires them. DnaK (HSP70), perhaps the best characterized CH in E. coli, is identified as a highly expressed gene. The Hsp70 proteins are aided by Hsp40 proteins (DnaJ in E. coli), which increase the ATP (adenosine triphosphate) consumption rate and activity of the Hsp70s. But, dnaJ has a low expression level (RCBS = 0.3988). It has been noted that increased expression of Hsp70 proteins in the cell results in a decreased tendency towards apoptosis. Although a precise mechanistic understanding has yet to be determined, it is known that Hsp70s have a high-affinity bound state to unfolded proteins when bound to adenosine diphosphate ribosyl, and a low-affinity state when bound to ATP. It is thought that many Hsp70s crowd around an unfolded substrate, stabilizing it and preventing aggregation until the unfolded molecule folds properly, at which time the Hsp70s lose affinity for the molecule and diffuse away. Other highly expressed heat shock proteins are groS, groL, hslO (Hsp33) htpG (Hsp90). GroS and groL are the small subunits of GroESL. These are the best characterized heat shock protein complexes in E. coli, identified as highly expressed genes. HtpG in E. coli is the least well-understood CH. Hsp90, a molecular CH, might be essential for activating many signalling proteins in the eukaryotic cell and is necessary for viability in eukaryotes. Since it is predicted to be a highly expressed gene, it is possibly necessary for prokaryotes as well.
Protein degradation plays an important role in cell cycle, in signal transduction and in maintaining the integrity of the proper folded state of a protein. Out of 100 genes involved in macromolecular degradation only six genes qualify as highly expressed genes. In Table 1, the predicted expression levels of highly expressed degradation genes are reported. Among these the genes encoding xseB (exonuclease VII small subunit) and rusA (DLP12 prophage, endonuclease RUS) are enzymes which regulate the degradation of DNA. These are also involved in DNA repair activity. Pnp and csrA are the only two proteins qualifying as highly expressed genes involved in RNA degradation. Pnp, polynucleotide phosphorylase/polyadenylase, is fundamental in RNA processing. Polyadenylation plays an important role in initiating degradation of some RNAs. Triple mutations that remove Pnp have a strong effect on stability. Poly(A)polymerase may create a poly (A) tail that acts as a binding site for the nucleases. DegP, serine endoprotease (Protease D0) encodes an enzyme which is involved in protein and peptide degradation and is predicted to be required for global protein degradation. It responds to temperature stimulus. YhbO, YajG, a predicted lipoprotein and YhbO, a predicted intercellular protease are thought to be involved in degradation of proteins and polysaccharides.
3.4. Aminoacyl tRNA synthetases and modification genes
There are 37 genes encoding the tRNA synthetases and other enzymes involved in tRNA modification. Results have been reported in Table 1. Compared with 19 PHX genes as predicted by Karlin et al.,18 only three genes register as highly expressed genes in our expression measure. These include aspartyl tRNA synthetase (aspS), lysine tRNA synthetase (lysS) and valyl tRNA synthetase (valS). The gene encoding glysine tRNA synthetase (glyS) is also predicted to be a highly expressed gene marginally with RCBS = 0.4974. Among other tRNA synthetase genes phes, glyQ, glnS, leus, serS, pros, tyrS, gltX and metG have moderate expression levels. PheM, phenylalanyl tRNA synthetase operon leader peptide registers a high RCB score with RCBS = 2.1835.
3.5. Outer membrane protein
There are ∼13 highly expressed genes encoding outer membrane proteins, as predicted by our expression measure. The expression levels of these genes have been displayed in Table 1. These include outer membrane protein (ompA, ompC, ompF, ompX), outer membrane lipoprotein (slyB), truncated outer membrane porin (nmpC), palmitoyl transferase for Lipid A (pagP), scaffolding protein for murein synthesizing machinery (mipA) and tsx. Moreover, yqiG, a predicted outer membrane user protein, yqhH, a predicted outer membrane lipoprotein, and yddL, a predicted putative outer membrane protein have been predicted as highly expressed genes in our analysis.
3.6. Inner membrane protein
Among the genes encoding inner membrane protein, murein lipoprotein (lpp) has the highest expression level (RCBS = 0.6320). Other than conserved inner membrane protein, 34 inner membrane protein genes have been listed in Table 1 as highly expressed genes. There are ∼83 conserved inner membrane proteins in the E. coli genome. Out of those, 17 have been predicted to be highly expressed genes (Supplementary Table SVII).
3.7. Amino acid biosynthesis
Overall, 20 of the 255 amino acid biosynthesis genes are expressed at a higher level. The artM, an arginine transporter subunit, flyM, a cystine transporter subunit, glnH and glnP, the glutamine transporter subunits are predicted to be expressed at higher levels. The glnA gene, which encodes glutamine synthetase, and glnB, which encodes regulatory protein for glumine synthetase, are expressed at higher levels. Interestingly, hisL, his operon ladder peptide; ilvL, ilvG operon ladder peptide; ivbL, ilvB operon ladder peptide; leuL, leu operon ladder peptide; pheL, pheA gene ladder peptide; thrL, thr operon ladder peptide; and trpl, trp operon ladder peptide are expressed at higher levels. The monocystronic gene ilvC, which is depressed exclusively by valine has a high value of expression score. The dapD product, 2,3,4,5-tetrahydropyridine-2-carboxylate N-succinyl transferage, which encodes the enzyme for lysine biosynthesis process via diaminopimelate has a high expression level.
3.8. Nucleotide biosynthesis
According to MIPS classification, ∼31 genes encode enzymes for nucleotide biosynthesis. In our study, we observe that five genes namely purA, purC, adk, ndk and guaB encoding enzymes which are involved in Purine ribonucleotide biosynthesis and pyrL, pyrBI operon leader peptide for Pyrimidine ribonucleotide biosynthesis, are highly expressed genes. PyrL has a significantly high expression level with RCBS = 1.16.
3.9. Genes for energy metabolism and metabolism of carbon compounds
Of the 392 genes involved in metabolism of carbon compound, 39 genes have a significantly high expression level. Of those, 27 are involved in carbohydrate metabolism, 10 are involved in amino acid metabolism, and two are involved in amines metabolism. Lpd is involved both in carbohydrate and amino acid metabolism. Rest one is involved in other carbon compound metabolism. No genes involved in fatty acid metabolism attain a high expression level, but seven of the 27 genes involved in fatty acid biosynthesis have a significantly high expression level. The data presented here indicate that accA (acetyl-CoA carboxylase), which encodes one component of acetyl coenzyme A carboxilase is a highly expressed gene. In addition, ymcE, which is cold shock protein and aspS also attain a high expression level. Although less is known about fab genes except the FadR activation on fabA, we predict that some of fab genes (fabA, fabI, fabZ) have a significant expression level. This is consistent with genomic expression profiling obtained from DNA microarray analysis of Tao et al.34
3.10. Energy metabolism genes
The genes involved in energy metabolism are primarily divided into four groups: glycolysis, pyruvate dehydronage, the pentose phosphate pathway and the TCA cycle. Of the 1530 genes that are involved in energy metabolism, 163 have been predicted to be highly expressed genes in our approach. Two basic metabolic pathways glycolysis and TCA cycle involve eight and four highly expressed genes respectively, whereas the genes in glycolysis and pyruvate metabolism are predominantly highly expressed genes. These include the genes for eno, fbaA, gapA, gpmA, pfkA, pykF, tpiA, pgk.
Unlike Karlin et al. the proteins involved in the initial steps of glycolysis (pgi coding glucophosphate isomerage and the proteins involved in the initial steps of TCA cycle (gltA, citrate synthase) are not highly expressed genes in our observation. Besides having the most TCA cycle, pyruvate dehydronage and glycolysis, E. coli genome has several highly expressed genes of anaerobic and aerobic respiration. Among NADH dehydrogenase nuo complex nuoA, nuoI and nuoK are highly expressed genes. Genes encoding α, β and ε subunits of F1 sector of membrane bound ATP synthase and b and c subunits of F0 sector of membrane bound ATP synthase genes have been predicted to be highly expressed genes. With respect to electron transport flavodoxin 1 (fldA) and cytochrome o ubiquinol oxidase subunit III (cyoC) are highly expressed gene with RCBS values 0.6062 and 0.5316, respectively. In addition, cytochrome c biogenesis protein (ccmD), and cytochrome o ubiquinol oxidase subunit IV (cyoD) also register high expression level in our approach.
In marked contrast to Kerlin et al., E. coli has six highly expressed flagellar genes flgB, fliE, fliJ, fliQ, fliS, fliT. The flagellum secretion apparatus may be viewed as part of the CH family essential for bacterial viability. Assembly of a flagellum is required to export protein subunits to the outer surface of the cell. Recent evidence indicates that flagellum regulon can also influence bacterium–host interactions independent of motility.
3.11. Fatty acid biosynthesis
Fatty acid metabolism is crucial because not only does it provide various fatty acids and phospholipids necessary for cell growth, but it also serves as a source of precursors for biosynthesis of secondary metabolites. The highly expressed genes involved in fatty acid biosynthesis included genes encoding beta-hydroxydecanoyl thioester dehydrase (fabA), NADH-dependent enoyl-[acyl-carrier-protein] reductase (fabI), (3R)-hydroxymyristol acyl carrier protein dehydratase (fabZ), holo-[acyl-carrier-protein] synthase 1(acpS), accA, cold shock gene (ymcE). Besides 3-oxoacyl-[acyl-carrier-protein] synthase I (fabB) has moderately high value of RCBS (RCBS = 0.4954).
3.12. Central intermediary metabolism
Several highly expressed genes in this functional class are also involved in carbohydrate metabolism. Besides other genes in this class which are also involved in nitrogen metabolism, phosphorus metabolism, amino acid metabolism, etc., our analysis identified the key genes involved in central intermediary metabolism, encoding aspartate ammonia-lyase (aspA), citrate lyase (citD, citE), glycine cleavage complex lipoylprotein (gcvH), Ni-dependent glyoxalase I (gloA), 3-keto-l-gulonate 6-phosphate decarboxylase (ulaD), d-erythro-7,8-dihydroneopterin triphosphate 2′;-epimerase and dihydroneopterin aldolase (folX) and d-erythro-7,8-dihydroneopterin triphosphate 2′;-epimerase and dihydroneopterin aldolase (mutT) as highly expressed genes. FixX, 4Fe-4S ferredoxin-type protein is also registered as a highly expressed gene predicted to be involved in central intermediary metabolism.
3.13. Genomic repair proteins
An event that introduces a deviation from the usual double-helical structure of DNA is a threat to the genetic constitution of the cell. The repair system is thus very important for the survival of the cell. The repair system can recognize a range of distortions in DNA as signal for action, and is likely to have several systems able to deal with DNA damage. Table 1 reports the highly expressed repair proteins in E. coli genome. Other repair proteins have low to moderate expression levels. Of the 51 genes involved in DNA repair, only six genes reach a high expression level. The principal pathway for recombination repair in E. coli is identified by the rec genes. recA, predicted to be highly expressed genes in our approach is not only involved in recombination–repair activities, but also has another quite distinct function. It can be activated by many treatments that damage DNA or inhibit replication in E. coli. This causes it to trigger a complex series of phenotype changes called the SOS response, which involves the expression of many genes whose products include repair function. The other highly expressed repair genes in E. coli are xseB, dinl, yebG, dinJ, rusA. DinI, DNA damage-inducible protein I, and dinJ, predicted antitoxin of YafQ–DinJ toxin antitoxin system act on damaged DNA and involved in repairing damaged DNA. YebG, a conserved protein regulated by LexA functions as DNA repair.
3.14. Regulatory protein
About 440 genes in E. coli encode regulatory proteins. Among these regulatory proteins 62 genes are predicted to be highly expressed genes. Several of the genes in this class also function in translation, transcription, DNA repair, replication/recombination, cell process, etc. The predicted expression levels of several other highly expressed genes of specific regulatory proteins are listed in Table 1.
3.15. Biosynthesis of vitamins, cofactors and small molecules
Vitamin biosynthesis proteins have largely low expression levels. Only ribE, riboflavin synthetase, is highly expressed. This is in contrast to the result of Karlin et al.18 Pathways for the synthesis of vitamins of which only small amounts are generally needed to achieve adequate function, record low RCBS values ranging from 0.1801 to 0.5974. Some of the enzymes that utilize the vitamins as cofactors are highly expressed, e.g. accB, acetyl-CoA carboxylase, BCCP subunit of E. coli is registered as highly expressed gene in our approach with RCBS = 0.5533. Expression of the 10 highly expressed genes involved in the biosynthesis of cofactors and small molecules are listed in Table 1.
3.16. Biosynthesis of other macromolecules
Among the genes encoding proteins for macromolecular biosynthesis, lpp attains significantly high RCBS value (RCBS = 1.6320). In addition to it, other highly expressed genes involved in macromolecular biosynthesis genes are major type 1 subunit fimbrin (fimA), DNA-binding transcriptional repressor (iscR) and truncated cytochrome b562 cytochrome (cybC). GlsG, a predicted glycogen synthesis protein and yfgJ, another predicted protein thought to be involved in macromolecular biosynthesis also attain the score of high expression level.
Of the 39 cryptic genes in E. coli analysed in our model, only three register as highly expressed genes. Those are csgA, a criptic curlin major subunit which is involved in glycoprotein biosynthesis, mokC, a regulatory protein of hokC, and gspl, a putative transport protein. The expression levels of these genes are 0.7, 0.62 and 0.55, respectively.
Among the genes induced under starvation conditions only dps, Fe-binding and storage protein (RCBS=0.5544) which provides DNA protection during starvation proteins, rpoH, RNA polymerase, sigma 32 (sigma H) factor (RCBS = 0.5129) are predicted as highly expressed genes in agreement with Karlin et al.18 Other starvation protein genes [otsA (RCBS = 0.2349), otsB (RCBS = 0.2700), rpoE (RCBS = 0.2781), rpoN (RCBS = 0.2486), rpoS (RCBS = 0.4093), katE (RCBS = 0.2359), surA (RCBS = 0.3936), bolA (RCBS = 0.4342)] have low to moderate expression levels. The survival protein surA which is registered as PHX with E(g) = 1.10 does not qualify as a highly expressed gene in our approach. Besides, we also observe that a number of genes encoding prophases are recorded as highly expressed genes in our analysis. A phase DNA molecule is often integrated into the DNA molecule of bacterium forming a prophase. A list of highly expressed genes encoding different prophases in E. coli is displayed in Table 2.
Table 2.
Predicted expression levels of highly expressed prophage genes
Gene | Description | RCBS |
---|---|---|
yeeT | CP4-44 prophage; predicted protein | 0.76113 |
alpA | CP4-57 prophage; DNA-binding transcriptional activator | 0.64494 |
ypjK | CP4-57 prophage; predicted inner membrane protein | 0.7551 |
yfjU | CP4-57 prophage; predicted inner membrane protein | 1.07646 |
yfjM | CP4-57 prophage; predicted protein | 0.56069 |
yafW | CP4-6 prophage; antitoxin of the YkfI–YafW toxin–antitoxin system | 0.54248 |
tfaS | CPS-53 (KpLE1) prophage; conserved protein | 0.60714 |
yfdT | CPS-53 (KpLE1) prophage; predicted protein | 0.54524 |
yfdS | CPS-53 (KpLE1) prophage; predicted protein | 0.59437 |
yffM | CPZ-55 prophage; predicted protein | 0.72955 |
ninE | DLP12 prophage; conserved protein | 0.61069 |
rusA | DLP12 prophage; endonuclease RUS | 0.53058 |
emrE | DLP12 prophage; multidrug resistance protein | 0.65874 |
borD | DLP12 prophage; predicted lipoprotein | 0.50128 |
rzoD | DLP12 prophage; predicted lipoprotein | 0.98537 |
essD | DLP12 prophage; predicted phage lysis protein | 0.77232 |
ybcO | DLP12 prophage; predicted protein | 0.56517 |
ybcW | DLP12 prophage; predicted protein | 0.67154 |
ylcG | DLP12 prophage; predicted protein | 1.05554 |
yciH | e14 prophage; 5-methylcytosine-specific restriction endonuclease B | 0.67815 |
yciX | e14 prophage; predicted DNA-binding transcriptional regulator | 0.79718 |
yciO | e14 prophage; predicted inner membrane protein | 0.50282 |
rluB | e14 prophage; predicted integrase | 0.55764 |
ymiA | e14 prophage; predicted protein | 1.3517 |
ylcH | hypothetical protein, DLP12 prophage | 1.56134 |
insM | KpLE2 phage-like element; iron-dicitrate transporter subunit | 0.6455 |
insA | KpLE2 phage-like element; IS1 repressor protein InsA | 0.52239 |
yqiG | KpLE2 phage-like element; IS2 insertion element repressor InsA | 0.69853 |
yjhD | KpLE2 phage-like element; IS30 transposase | 0.6955 |
relB | Qin prophage; bifunctional antitoxin of the RelE–RelB toxin–antitoxin system/transcriptional repressor | 0.68232 |
dicB | Qin prophage; cell division inhibition protein | 0.66801 |
cspB | Qin prophage; cold shock protein | 0.52261 |
cspF | Qin prophage; cold shock protein | 0.5891 |
cspI | Qin prophage; cold shock protein | 0.80085 |
dicC | Qin prophage; DNA-binding transcriptional regulator for DicB | 0.69275 |
ydfK | Qin prophage; predicted DNA-binding transcriptional regulator | 0.50987 |
ynfN | Qin prophage; predicted protein | 0.69704 |
gnsB | Qin prophage; predicted protein | 0.82038 |
ydfD | Qin prophage; predicted protein | 0.83742 |
ydfA | Qin prophage; predicted protein | 0.95351 |
ydfB | Qin prophage; predicted protein | 1.34218 |
essQ | Qin prophage; predicted S lysis protein | 0.62869 |
hokD | Qin prophage; small toxic polypeptide | 0.75743 |
relE | Qin prophage; toxin of the RelE–RelB toxin–antitoxin system | 0.54866 |
Apart from these classified genes, a fraction of poorly characterized genes which are generally annotated based on strong sequence similarity is also found among predicted highly expressed genes. Many of these genes encode predicted proteins and some are poorly characterized hypothetical genes. (A list of highly expressed genes which are thought to encode predicted proteins is given in supplementary Supplementary Table SVII). Our analysis thus provides strong support for significant roles of these genes which may be highly relevant for E. coli.
The large data set analysed here shows a clear connection between relative codon usage difference and gene expression level. Codon frequencies are found to vary between genes in the same genome and between genomes. Thus overall nucleotide composition of the genome which influences codon usage pattern introduces selective forces acting on highly expressed genes to improve efficiency of translation. This is also evident from the observation that shorter coding sequence has greater RCBS value, i.e. shorter genes have high expression level4,5,40,41 and this is consistent with the fact that the cost of producing a protein is proportional to its length.
Interestingly, we observe that besides highly expressed protein coding genes all tRNA genes (listed in Table 3) are also registered with very high RCBS values. This observation suggests that usage of preferred codons in these and highly expressed genes is positively correlated and the highly expressed genes use a preferred set of optimal codons in accordance with their respective tRNA levels. Moreover, this result might find another important application in tRNA genes. Besides measuring expression levels of a gene, RCBS score can be remarkably used to remove the false positives in tRNA finding algorithm. Moreover, several genes of unknown functions with predicted high expression levels may be attractive candidates for experimental characterization because we assume that they have important functions in those organisms. Table 4 lists such gene families of unknown functions. This kind of analysis is valuable in helping to identify the promising candidate genes to be focused for further experimental characterization.
Table 3.
Predicted expression levels of tRNA genes
Gene | RCBS | Gene | RCBS | Gene | RCBS | Gene | RCBS |
---|---|---|---|---|---|---|---|
alaX | 1.35584 | glnW | 1.96033 | leuP | 1.06805 | serT | 1.15723 |
alaW | 1.35584 | glnU | 1.96033 | leuX | 1.18771 | serU | 1.32755 |
alaV | 1.5556 | gltW | 1.85009 | leuU | 1.23093 | serW | 1.45877 |
alaU | 1.5556 | gltU | 1.85009 | leuZ | 1.3515 | serX | 1.45877 |
alaT | 1.5556 | gltT | 1.85009 | lysT | 1.91913 | thrW | 1.175 |
argU | 1.40468 | gltV | 1.85009 | lysW | 1.91913 | thrV | 1.27061 |
argX | 1.67244 | glyW | 1.32551 | lysY | 1.91913 | thrT | 1.27325 |
argQ | 1.76167 | glyV | 1.32551 | lysZ | 1.91913 | Thru | 1.7256 |
argZ | 1.76167 | glyX | 1.32551 | lysQ | 1.91913 | trpT | 1.62046 |
argY | 1.76167 | glyY | 1.32551 | lysV | 1.91913 | tyrU | 1.00445 |
argV | 1.76167 | glyT | 1.33638 | metY | 1.22225 | tyrV | 1.0433 |
argW | 1.99759 | glyU | 1.47125 | metZ | 1.32682 | tyrT | 1.0433 |
asnT | 1.87865 | hisR | 1.21868 | metW | 1.32682 | valW | 1.37166 |
asnW | 1.87865 | ileX | 1.41462 | metV | 1.32682 | valT | 1.37566 |
asnU | 1.87865 | ileV | 1.42883 | metU | 1.36722 | valZ | 1.37566 |
asnV | 1.87865 | ileU | 1.42883 | metT | 1.36722 | valU | 1.37566 |
aspU | 1.38539 | ileT | 1.42883 | pheV | 1.38483 | valX | 1.37566 |
aspV | 1.38539 | ileY | 1.45397 | pheU | 1.38483 | valY | 1.37566 |
aspT | 1.38539 | leuW | 1.02415 | proL | 1.26942 | valV | 1.6125 |
cysT | 1.35851 | leuT | 1.03107 | prom | 1.38923 | selC | 1.28639 |
glnX | 1.65127 | leuV | 1.03107 | proK | 1.44416 | – | – |
glnV | 1.65127 | leuQ | 1.03107 | serV | 1.14888 | – | – |
Table 4.
Predicted expression levels of highly expressed hypothetical protein genes
Gene | RCBS | Gene | RCBS | Gene | RCBS |
---|---|---|---|---|---|
ytcA | 0.51055 | ylcI | 0.77343 | ybhU | 1.09738 |
ybfK | 0.51884 | yojO | 0.84734 | ynhF | 1.15141 |
ymjA | 0.58644 | ygdT | 0.85155 | ydgU | 1.48121 |
yrhD | 0.63276 | ypaB | 0.92206 | ypfM | 1.86114 |
ydbJ | 0.63348 | yccB | 1.07903 | ylcH | 1.56134 |
4. Discussion
Our analysis supports that each genome has evolved codon usage patterns indicating gene expression levels. The three protein families – RPs, major translation/transcription processing factors, and CH/degradation proteins which are fundamental at many stages of the life style in promoting growth and stability, have been identified as highly expressed genes. Although the concept of predicting gene expression from codon usage was proposed a decade ago, only recently these methods have been successfully applied to the identification of highly expressed genes in various bacteria and eukaryotic organisms. But, any such codon usage-based prediction of gene expression relies on a prior definition of a reference set, consisting of highly expressed genes. For instance, CAI listed a set of 27 highly expressed genes for E. coli, which includes gene encoding 17 RPs, four elongation factors, four outer membrane protein, recA, and dnaK. For yeast a set of 24 highly expressed genes has been taken as a reference set. These include 16 genes encoding RPs, one for an elongation factor, two enolase genes, two GA-3-PDH genes, ADH 1, PCK, pyruvate kinase.3 Karlin and coworkers17–23 included transcription/translation-related factors and CHs in the reference set, in addition to the RP genes. MILC-based expression level predictor MELP13 is based on a reference set consisting of all genes coding for RPs, longer than 100 codons. Although the composition of the reference set is based on the functional assignment of the genes, but there is no specific algorithm to construct a reference set for individual species. The outcome is highly dependent on the genome examined. In some instances, in the use of alternative reference sets results are very poor. In principle it is not possible to regulate protein expression level by the judicious use of certain codons. It is worth emphasizing that individual genes tend to favour characteristic codon distributions and there is a strong connection between protein expressivity and the degree of codon bias. So, we emphasize that codon assignment as well as codon preferences should be taken into account in a single measure which will have functional feedback between the constraints of gene expression and microstructure of genomes. To better understand potential expression levels of genes, we developed a methodology that relates codon usage as well as large-scale DNA compositional biases among gene classes to the expression potential of individual genes. The CAI3 and codon usage models13,17 are originally based on somewhat qualitative assumptions about the expression levels of relatively few genes. This is our motivation for using a quantitative measure (RCBS) to recalculate genome-wide expression data. The new approach begins with the assumption, based on the argument just presented; that the general codon usage features observed in highly expressed genes greatly differ from that of randomly generated sequences with their sequence composition conserved. Our proposition is based on the fact that the difference between the geometric average of normalized frequency of codons (fxyz) in a sequence of nucleotides and that of f1(x) × f2(y) × f3(z) is >0.5 of the geometric average of f1(x) × f2(y) × f3(z) for highly expressed genes. The proposed threshold value (0.5) of RCBS is investigated for E. coli genome, Yeast genome and archeal genomes. The data (available on request) provide the evidence in favour of potential strength of our expression measure over the others. The most of the housekeeping genes fall in the category of highly expressed genes. The study also identifies a number of functionally unknown genes as highly expressed genes based on their codon profile. Thus, it often seems sufficient that our approach is a better alternative to the existing expression models. Surprisingly, we have found that there is a strong negative correlation between relative codon usage bias and protein length in contradiction with others.24,42 Although our primary motivation in developing this novel method was to compensate the possible artefacts due to sequence length variability, we have observed that highly expressed genes (identified by RCBS) show negative correlation with gene length leading to a biological relevance. This is suggested to be due to more effective translational selection acting to reduce size of the abundant proteins, to minimize transcriptional and translational energy costs. Although the longer sequences appear to be better optimized in terms of having codons for more abundant tRNAs which increase their probability in proper and timely translation, it is easier for a ribosome to translate a short RNA sequences, as opposed to decrease in fidelity for longer translation. Therefore it is likely that there is a natural selection for the shorter genes to be expressed at higher level.41
To summarize, we have introduced a novel method, based on codon usage difference with regard to random base composition at three codon sites, to estimate the level of expression of a gene. In this article, predicted highly expressed genes are characterized for E. coli genome only, but the method equally applies to other microbes to be reported in separate communication. By comparing its performance with other commonly used measures of gene expression, we have established that RCBS is a generally applicable method, being resistant to species specific and introduces little noise into measurements. It is remarkable that the present model usually performs as well as other codon usage model of Kerlin et al.18 sometime lead to a better correlation with expression data according to several other measures based on CAI.3 The prediction of expression level in our approach can be appreciated by comparing them with the protein abundance data and microarray data. Thus, our method is effectively complementary to the experimental procedures of 2D gel electrophoresis and DNA microarray analysis in assessing gene expression levels. In contrast to other existing measures, our model describes the global enrichment of a codon in highly expressed genes with no restrictions on composition of the other codons. Of course, the codon-based expression indicators yield static value, whereas gene expression is a dynamic process with very different expression levels under different conditions. In our view codon usage pattern of genomes evolves as a result of interplay between mutational and selective forces and the proper account of the adaptive response to the codon assignment can lead to a practical solution of gene expression.
Supplementary data
Supplementary data are available online at www.dnaresearch.oxfordjournals.org.
Funding
Financial support by the University Grants Commission, India, sanction No. F.PSW-060/05-06 (ERO), is gratefully acknowledged.
Supplementary Material
Acknowledgements
The authors would like to acknowledge the reviewers for their valuable suggestions and comments to improve the manuscript.
Footnotes
Edited by Hiroyuki Toh
References
- 1.Gouy M., Gautier C. Codon usage in bacteria: correlation with gene expressivity. Nucleic Acids Res. 1982;10:7055–7073. doi: 10.1093/nar/10.22.7055. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Holm L. Codon usage and gene expression. Nucleic Acids Res. 1986;14:3075–3087. doi: 10.1093/nar/14.7.3075. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Sharp P. M., Li W. H. The codon adaptation index-a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res. 1986;15:1281–1295. doi: 10.1093/nar/15.3.1281. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Ikemura T. Correlation between the abundance of Escherichia coli transfer RNAs and the occurrence of the respective codons in its protein genes. J. Mol. Biol. 1981;146:1–21. doi: 10.1016/0022-2836(81)90363-6. [DOI] [PubMed] [Google Scholar]
- 5.Ikemura T. Correlation between the abundance of Escherichia coli transfer RNAs and the occurrence of the respective codons in its protein genes: a proposal for a synonymous codon choice that is optimal for the E. coli translational system. J. Mol. Biol. 1981;151:389–409. doi: 10.1016/0022-2836(81)90003-6. [DOI] [PubMed] [Google Scholar]
- 6.Karlin S., Mrazek J., Campbell A. M. Codon usage in different gene classes of Escherichia coli genome. Mol. Microbiol. 1998;29 (6):1341–1355. doi: 10.1046/j.1365-2958.1998.01008.x. [DOI] [PubMed] [Google Scholar]
- 7.Wright F. The effective number of codons used in a gene. Gene. 1990;87(1):23–29. doi: 10.1016/0378-1119(90)90491-9. [DOI] [PubMed] [Google Scholar]
- 8.Morton B. R. Codon use and rate of divergence of land plant chloroplast genes. Mol. Biol. Evol. 1994;11(2):231–238. doi: 10.1093/oxfordjournals.molbev.a040105. [DOI] [PubMed] [Google Scholar]
- 9.Shields D. C., Sharp P. M. Synonymous codon usage in bacillus subtilis reflects both translational and mutational biases. Nucleic Acid Res. 1987;15(19):8023–8040. doi: 10.1093/nar/15.19.8023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Freire-Picos M. A., Gonzalez-Siso M. I., Rodriguez-Belmonte E., Rodriguez-Torres A. M., Ramil E., Cerdan M. E. Codon usage in Kluyveromyces lactis and in yeast cytochrome c-encoding genes. 1994;139(1):43–49. doi: 10.1016/0378-1119(94)90521-5. [DOI] [PubMed] [Google Scholar]
- 11.Urrutia A. O., Hurst L. D. Codon usage bias covaries with expression breadth and the rate of synonymous evolution in humans, but this is not evidence for selection. Genomes. 2001;159(3):1191–1199. doi: 10.1093/genetics/159.3.1191. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Wan X. F., Xu D., Kleinhofs A., Zhou J. Quantitative relationship between synonymous codon usage bias and GC composition across unicellular genomes. BMC Evol. Biol. 2004;4(1):19. doi: 10.1186/1471-2148-4-19. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Supek F., Vlahovicek K. Comparison of codon usage measure and their applicability in prediction of microbial gene expressivity. BMC Bioinformatics. 2005;6:182. doi: 10.1186/1471-2105-6-182. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Karlin S., Mrazek J. What drives codon choices in human genes? J. Mol. Biol. 1996;262(4):459–472. doi: 10.1006/jmbi.1996.0528. [DOI] [PubMed] [Google Scholar]
- 15.Jansen R., Bussemaker H. J., Gerstein M. Revisiting the codon adaptation index from a whole-genome perspective:analyzing the relationship between gene expression and codon occurance in yeast using a variety of models. Nucleic Acids Res. 2003;31(8):2242–2251. doi: 10.1093/nar/gkg306. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Wu G., Culley D. E., Zhang W. Predicted highly expressed genes in the genomes of Streptomyces coelicolor and Streptomyces avermitilis and the implications for their metabolism. Microbiology. 2005;151:2175–2187. doi: 10.1099/mic.0.27833-0. [DOI] [PubMed] [Google Scholar]
- 17.Karlin S., Mrazek J. Predicted highly expressed genes of diverse prokaryotic genomes. J. Bacteriol. 2000;182:5238–5250. doi: 10.1128/jb.182.18.5238-5250.2000. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Karlin S., Mrazek J., Campbell A. M., Kaiser D. Characterizations of highly expressed genes of four fast-growing bacteria. J. Bacteriol. 2001;183:5025–5040. doi: 10.1128/JB.183.17.5025-5040.2001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Karlin S., Mrazek J., Ma J., Brocchieri L. Predicted highly expressed genes in archeal genomes. PNAS. 2005;102:7303–7308. doi: 10.1073/pnas.0502313102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Mrazek J., Bhaya D., Grossman A. R., Karlin S. Highly expressed and alien genes of the Synechocystis genome. Nucleic Acids Res. 2001;29(7):1590–1601. doi: 10.1093/nar/29.7.1590. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Karlin S., Barnett M., Campbell A. M., Fisher R. F., Mrazek J. Predicting gene expression levels from codon biases in α-probacterial genomes. PNAS. 2003;100:7313–7318. doi: 10.1073/pnas.1232298100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Karlin S., Brocchieri L., Mrazek J., Kaiser D. Distinguishing features of δ-probacterial genomes. PNAS. 2006;103:11352–11357. doi: 10.1073/pnas.0604311103. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Karlin S., and Mrazek J. Comparative analysis of gene expression among low G+C gram-positive genomes. PNAS. 2004;101:6182–6187. doi: 10.1073/pnas.0401504101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Eyre-Walker A. Synonymous codon bias is related to gene length in Escherichia coli: selection for translational accuracy. Mol. Biol. Evol. 1996;13:864–872. doi: 10.1093/oxfordjournals.molbev.a025646. [DOI] [PubMed] [Google Scholar]
- 25.Coughlan A., Wolfe K. H. Relationship of codon bias to mRNA concentration and protein length in Saccharomyces cerevisiae. Yeast. 2000;16:1131–1145. doi: 10.1002/1097-0061(20000915)16:12<1131::AID-YEA609>3.0.CO;2-F. [DOI] [PubMed] [Google Scholar]
- 26.Martin-Galiano A. J., Wells J. M., de la Campa A. G. Relationship between codon bised genes, microarray expression values and physiological characteristics of Streptococcus pneumoniae. Microbiology. 2004;150:2313–2325. doi: 10.1099/mic.0.27097-0. [DOI] [PubMed] [Google Scholar]
- 27.dos Reis M., Wernisch L., Savva R. Unexpected correlations between gene expression and codon usage bias from microarray data for the whole Escherichia coli K-12 genome. Nucleic Acids Res. 2003;31(23):6976–6985. doi: 10.1093/nar/gkg897. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Semon M., Mouchirdoud D., Duret L. Relationship between gene expression and GC-content in mamals:statistical significance and biological relevance. Hum. Mol. Genet. 2005;14:421–427. doi: 10.1093/hmg/ddi038. [DOI] [PubMed] [Google Scholar]
- 29.Goncalves I., Duret L., Mouchiroud D. Nature and structure of human genes that generate retropseudogenes. Genome Res. 2000;10:672–678. doi: 10.1101/gr.10.5.672. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Duret L. Evolution of synonymous codon usage in metazoans. Curr. Opin. Genet. Dev. 2002;12:640–649. doi: 10.1016/s0959-437x(02)00353-2. [DOI] [PubMed] [Google Scholar]
- 31.Ponger L., Duret L., Mouchiroud D. Determination of CpG islands: expression in early embryo and isochore structure. Genome Res. 2001;11:1854–1860. doi: 10.1101/gr.174501. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Vinogradv A. E. Isochores and tissue specificity. Nucleic Acids Res. 2003;31:5212–5220. doi: 10.1093/nar/gkg699. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Urruita A. O., Hurst L. D. The signature of selection mediated by expression on human genes. Genome Res. 2003;13:2260–2264. doi: 10.1101/gr.641103. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Tao H., Bausch C., Richmond C., Blattner F. R., Conway T. Functional genomics: expression analysis of Escherichia coli growing on minimal and rich media. J. Bacteriol. 1999;181:6425–6440. doi: 10.1128/jb.181.20.6425-6440.1999. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Richmond C. S., Glasner J. D., Mau R., Jin H., Blattner F. R. Genome-wide expression profiling in Escherichia coli K-12. Nucleic Acids Res. 1999;27(8):3821–3835. doi: 10.1093/nar/27.19.3821. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.VanBogelen R. A., Abshire K. Z., Pertsemlidis A., Clark R. L., Neidhardt F. C. Gene-protein database of Escherichia coli K-12. In: Neidhardt F. C., Curtiss R. III, Ingraham J. L., Lin E. C. C., Umbarger H. E., editors. Escherichia coli and Salmonella: Cellular and Molecular Biology. 6th edn. Washington, D.C: ASM Press; 1996. pp. 2067–2117. [Google Scholar]
- 37.Pederson S., Bloch P. L., Reeh S., Neidhardt F. C. Patterns of protein synthesis in E. coli: a catalog of the amount of 140 individual proteins at different growth rates. Cell. 1978;14:179–190. doi: 10.1016/0092-8674(78)90312-4. [DOI] [PubMed] [Google Scholar]
- 38.Bloch P. L., Philips T. A., Neidhardt F. C. Protein identification on O’Farrell two dimensional gel:location of 81 Escherichia coli proteins. J. Bacteriol. 1980;141:1409–1420. doi: 10.1128/jb.141.3.1409-1420.1980. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Philips T. A., Bloch P. L., Neidhardt F. C. Protein identification on O’Farrell two dimensional gel: location of 55 Escherichia coli proteins. J. Bacteriol. 1980;144:1024–1033. doi: 10.1128/jb.144.3.1024-1033.1980. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Comeron J. M., Kreitman M., Aguade M. Natural selection on synonymous sites is correlated with gene length and recombination in Drosophila. Genetics. 1999;151:239–249. doi: 10.1093/genetics/151.1.239. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Duret L. L., Mouchiroud D. Expression pattern, surprisingly, gene length shape codon usage in Caenorhabditis, Drosophila, and Arabidopsis. PNAS. 1999;96:4482–4487. doi: 10.1073/pnas.96.8.4482. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Moriyama E. N., Powell J. R. Nucleic Acids Res. 1998;26:3188–3193. doi: 10.1093/nar/26.13.3188. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Merk I. R. A survey of codon and amino acid frequency bias in microbial genomes focusing on translational efficiency. J. Mol. Evol. 2003;57(4):453–466. doi: 10.1007/s00239-003-2499-1. [DOI] [PubMed] [Google Scholar]
- 44.Wagner A. Infering lifestyle from gene expression patterns. Mol. Biol. Evol. 2000;17(12):1985–1987. doi: 10.1093/oxfordjournals.molbev.a026299. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.