Abstract
The genome supplies information on both the quality and quantity of the transcriptome. However, as it remains unknown how a cell determines transcript levels from the genome sequences, despite comprehensive knowledge of the cellular components involved, the quantity information held by the genome cannot as yet be derived from nucleotide sequences. The model presented here explains on a thermodynamic basis how the components decode the genome to form and maintain the transcriptome. The model describes the level of a transcript as a pseudo-equilibrium between velocities of synthesis and degradation, both of which are controlled by sequence-specific interactions between protein factors and nucleic acids. Each of the transcript levels can be described by a single equation expressing a function of the activity concentrations of the protein factors. Quantitative information in the genome can thus be transformed into constants determined from the nucleotide sequences. Using this model, the transcriptome can be traced back to the protein factors and the state of chromosome packaging. The total description of transcript levels allows the model to be verified through comparison of derived hypotheses with comprehensive measurements of the transcriptome. The hypotheses thus derived in the present study are well supported by experimental microarray data, confirming the appropriateness of the model.
INTRODUCTION
Organization of the large volumes of experimental data acquired to date requires an appropriate model to serve as a framework for analysis. The ability of such a model to integrate the data is critical and will affect the accuracy and potential utility of data comparisons. Objectivity is required in the model, particularly when the results are to be shared among researchers, such as for transcriptome analyses. Many transcriptome studies have employed versatile theoretical models for each step of data analysis, including normalization, noise treatment and data interpretation that includes validations for expressional changes (1–5). Unfortunately, in exchange for such versatility, the objectivity of the models is reduced. Some models even rely on arbitrary calculations to eliminate the linear responses of data (5). In general, poor objectivity or linearity prevents meaningful comparisons between multiple experiments, giving rise to inconsistencies among sets of analytical results.
This report presents an objective model that explains the determination of transcript levels based on the relationship between the genome and the cellular components. Objectivity is achieved by representing the biochemical processes in a cell in terms of thermodynamics, adopting a similar approach to other bottom-up research on a part of transcriptional control (6–11). The model sees the level of a transcript as a balance between velocities of synthesis and degradation, both of which are controlled by interactions between nucleotide sequences and protein factors, which are known to affect the levels of transcripts (12–14). The objectivity of the model allows it to be verified as an appropriate framework for analysis through comparison of derivable hypotheses with experimental results.
MODEL OVERVIEW
The basis of the proposed model is the existence of a quasi-equilibrium between the synthesis and degradation of each transcript. A cell forms a closed system for mRNA; transcripts will not go out from or come into the cell. This means that concentration of a transcript accumulated in a cell is determined by the velocity of synthesis and the velocity of degradation within the cell. It has been shown that each transcript has a unique half-life, i.e. the velocity of degradation is linear with respect to the concentration of the transcript (12). Consequently, the pseudo-equilibrium can be applied (6), and state of the equilibrium determines the level of the transcript. Although the half-life and velocity of synthesis can change frequently within each cell, giving rise to transient departures from this pseudo-equilibrium estimation, these deviations are not expected to be large and will rapidly attenuate in accordance with the half-life of the mRNA. This pseudo-equilibrium approximately describes the balance between the velocity of transcription (vs) and the velocity of degradation (vd) for a particular gene g, as follows.
1 |
Although synthesis and degradation both involve multiple biochemical reactions, the velocities are determined by the slowest reaction as the rate-limiting step, as discussed below.
Regulation of mRNA synthesis
The rate-limiting step for mRNA synthesis is likely to occur at the onset of RNA elongation. For elongation to begin, the RNA polymerase II bound at the promoter must be hyperphosphorylated (15). Each phosphorylation step is energy-dependent, and should thus be restricted. This step cannot occur in parallel in the cell, and is unique to each gene. The velocity at this step can be described by considering a rapid pre-equilibrium between the binding and dissociation of RNA polymerase II with the promoter:
2 |
At a certain frequency, the bound polymerase obtains the potential energy required to overcome the energy barrier, and then initiates transcription. Consequently, vs can be described as a mathematical expectation determined by the concentration of the promoter–polymerase complex ([complex]) and the frequency (height of the barrier in Figure 1), as given by
3 |
where ks is a coefficient describing the frequency at which the polymerase enters the elongation state.
The concentration of the promoter–polymerase complex can be expressed as an equilibrium constant (Kp), as follows.
4 |
Here, [promoter] is the concentration of the polymerase-free promoter; ag is the local activity of the genome, the local activity represents the state of chromatin packing by histones at the promoter, and takes values between zero and unity. P0 is the total concentration of the promoter. Solving Equation 4 leads to the concentration of the polymerase bound at the promoter ([complex]):
5 |
This approximation is based on the assumption
6 |
This pre-equilibrium condition thus tends to favor the dissociated state. This assumption is introduced to allow the state of pre-equilibrium to determine the velocity in Equation 3; only under this condition, the pre-equilibrium can define the frequency or length of time of complex formation, during which the polymerase is susceptible to the synthesis initiation (Figure 1).
The equilibrium constant Kp is determined by regulators, which are protein factors that bind around the promoter in a sequence-specific manner. Each of the regulators contacts the polymerase, affecting the equilibrium constant with a certain Gibbs free energy (). The equilibrium constant is determined by the composite of the free energy as follows:
7 |
Here, R is the gas constant and T is the absolute temperature. The Gibbs free energy can be further described in terms of regulators, which are in equilibrium between binding and dissociation with particular cis elements. Using the activity concentration of free regulators ([regulator]), the Gibbs free energy can thus be rewritten as
8 |
where Kc is an equilibrium constant and kr is an activity constant. The constant Kc indicates the affinity of each regulator for the cis element, and is determined by the nucleotide sequence. The constant kr, however, indicates the function of each regulator to the gene, and is determined by the structure of the regulator and its spatial relationship with the polymerase. The nucleotide sequence around the promoter directs both of the determinants of kr by selecting the regulator and fixing the position at binding. For genes hosting multiple binding sites for a regulator, the sum of Kckr values for all the binding sites represents a functional value for the combination of the regulator and the gene.
The frequency with which a bound polymerase enters the elongation state is affected by the stimulation of the mediator complex by enhancer-binding regulators (Figure 1). The complex phosphorylates the polymerase, changing the charge of the enzyme (15), which has been prevented from elongation by Coulombic force (13,14). The reduction of Coulombic force lowers the Arrhenius activation energy (Ep) for the initiation of elongation. According to the Arrhenius equation, the parameter ks in Equation 3 can be represented by the composition of activation energies related to enhancer binding protein factors, as given by
9 |
where Ap is a constant specific to the polymerase. Using a similar estimation for the function of promoter-binding regulator proteins, the activation energy can be described as
10 |
where Ke is the equilibrium constant, km is an activity constant and [Eregulator] is the concentration of each free enhancer-binding regulator. The functions of the activity and equilibrium constants and the treatment for multiple binding sites are the same as in Equation 8. Equations 3, 5, 7 and 9 can finally be solved for the transcription velocity of the gene (vs) as follows:
11 |
Regulation of mRNA degradation
The rate limiting for degradation is probable to occur in the shortening or removal of the poly(A) tail (Figure 1). This slow step is immediately followed by decapping and then rapid removal of nucleotides from both termini. This sequential process is considered to be a common mechanism of degradation for properly synthesized mRNAs. The slow step is catalyzed by motif-specific nucleases, i.e. poly(A)-specific exonucleases and site-specific endonucleases. Inhibitor proteins that bind the same motifs, competing with the RNases, are also known. Each transcript may have a motif that determines the transcript's stability (12,16).
The velocity of degradation is thus considered to be proportional to the concentration of the transcript of the gene ([mRNAg]), as given by
12 |
where kd is a velocity constant determined by protein factors bound to the transcript in a motif-specific manner. As an enzyme or an inhibitor of an enzyme, an RNA-binding factor affects the activation energy of RNA hydrolysis (Ed) in the Arrhenius equation. The constant kd in Equation 12 can be described in terms of this energy as
13 |
where Ad is a constant specific to RNA hydrolysis. Considering a rapid equilibrium between the binding and dissociation of RNA-binding proteins to the determinant motif of the mRNA, the composite of the activation energy is determined as
14 |
where kh is an activity constant, Kb is an equilibrium constant and [RNA-binding protein] represents the activity concentration of each factor.
Accumulated amount of each transcript
Equations 1 and 11–13 lead to the following relationship between the energy parameters and the concentration of the transcript:
15 |
Here, kc is a constant specific to a cell and is given by kc = ApP0[polymerase]/Ad. The parameters , Ep and Ed are determined by the activity concentrations of the protein factors bound to specific nucleotide sequences of DNA or RNA as described in Equations 8, 10 and 14, respectively.
HYPOTHESES AND VERIFICATION
Lognormality in data distribution
For verification of the proposed model, a number of hypotheses derived from the model were tested against the relevant characteristics of measured transcriptome data. The first feature of transcriptome data considered is the statistical distribution of concentrations of transcripts in a cell. The model predicts that the protein factors (Equations 8, 10 and 14) will be basically independent of other factors, while the distributions of these factors with respect to , ΔEp and ΔEd will have common characteristics owing to the commonality of the physical bases. The composites of the energies can thus be considered to be the sums of independent, identically distributed variables. According to the central limit theorem, the distribution of () in Equation 15 will be normal, leading to a lognormal distribution of [mRNAg].
This hypothesis can be verified by comparison with comprehensive transcriptome data. Expressional microarray data for any organism on any analytical platform follows a three-parameter lognormal distribution (17), with an additional third parameter for compensation of signal background. This three-parameter distribution has been observed repeatedly in hundreds of experimental datasets on different platforms, providing strong support for this model-derived hypothesis.
Stability in scale parameter
The model also predicts a stable characteristic for the scale parameter of the log[mRNAg] distribution. According to Equation 15, the scale parameter is defined by the distributions of energetic parameters. These distributions would be stable and common because all can be considered to represent the sums of variables. Changes in the parameters therefore indicate a total shift in protein–nucleic acid interactions, and such shifts will require changes in the conditions of the cell, such as salt content or pH. However, the conditions of the cell will remain stable owing to the homeostatic character of cells in general. The stability of the scale parameter has been observed experimentally (17), supporting this prediction. The stability and mode of distribution are checked routinely in experiments as part of the microarray normalization process.
Multiplicative effects to [mRNA]
Another hypothesis derivable from the proposed model is the multiplicative change in transcript concentration by each protein factor. According to Equations 8, 10, 14 and 15, additive changes in the activity concentrations of the factors will cause additive changes in energy, which will in turn cause multiplicative changes in [mRNA]. Generally, a stimulus applied to a cell induces a change in the activity concentrations of certain protein factors, which in turn affects the concentrations of certain transcripts. These effects can be measured by conducting paired microarray experiments in which the changes are measured as ratios (17). If the hypothesis is correct, the ratios resulting from the application of a pair of simultaneous stimuli will coincide with the products of the ratios provoked by each individual stimulus. Such combinations of measurements have been reported in a series of experiments on the effects of environmental changes in yeast (18). In these experiments, a linear relationship was obtained for numerous combinations of stimuli following the products of the individual ratios (Figure 2A). It should be noted that the slight overestimation of the product can be attributed to the effect of medium replacement (18), which is counted twice in the product calculations. This correlation could not have been obtained by chance or systematic error, since the replacement of any of the ratios in a combination with another in a different time phase (i.e. a different stage of response) almost eliminates the relationship (Figure 2B). Thus, the multiplicative effect of factors predicted by the present model is supported by experimental findings.
DISCUSSION
The proposed model describes transcriptome formation in a cell on the basis of thermodynamic expressions. The model is expected to provide both the fidelity of a bottom-up approach and the objectivity required for evaluating its appropriateness. The observed characteristics of transcriptome data support the model. Although the verifications presented above are somewhat indirect, the objectivity of the proposed model should allow many other types of verification using transcriptome analyses.
The model can be used to identify the quantitative aspects of genome information giving the transcriptome. The genome is presented by the model in terms of two series of constants (equilibrium and activity constants) as a decoded form of the quantitative information. The constants in a genome locus can be determined through in vitro kinetic experiments. With accumulation of such measured data, it will become possible to predict the values by simulations in silico.
The nucleotide sequences do not encode all of the information necessary to reproduce the chromatin structure or factor activity concentrations, giving rise to the observed variety among transcriptomes. The structure is controlled by chemical modifications of histones (19), which are directed by covalent modification of the DNA. Tight restrictions on the changes in the nucleosome effectively preserve the modifications of histones at the locus when the genome is replicated (20). Alterations of DNA may instead be introduced during the development of multicellular organisms (21). Accordingly, such modifications seem to play an important role in cell differentiation, causing static changes in the transcriptome. Thus, the value of ag is expected to be relatively stable.
This model provides objectivity for transcriptome data analyses. For example, the synergy of additive effects, as commonly observed in combinations of stimuli or the artificial induction of factors (9,10), can be assessed in a more objective manner by the proposed model as the product of effects (Figure 2a). This shows that the protein factors that had changed the expression levels within the short time course experiment practically worked independent to each other. The independence supports the assumption that the pre-equilibrium condition of many protein factors tends to favor the dissociated state; any two factors that have neighboring or overlapping binding sites rarely interfere with each other, since chances that the factors hit with each others at the site are rather small. Of course, this observation obtained by a transcriptome analysis of yeast does not deny existence of dependences, which includes interference or stabilizing, between factors. For example, some constitutive factors may tend to favor to the binding state; such constitutive factors can bind to certain sites and affect other factors' binding for a long period, producing a tendency of the corresponding genes' expression levels. However, a constitutive factor would influence only limited genes and factors, since it can affect to molecules that are reachable at its binding state. Additionally, if dependences are common to factors' effects, they might cause contradict to the required condition of the central limit theorem, providing conflicts with the observed distribution of transcriptome data.
The transcriptome is represented in the model as a series of functions of protein factor activity concentrations (Equations 8, 10, 14 and 15). Protein factors can be expected to be dispersed uniformly in the corresponding cellular compartments owing to the rapid diffusion typical of large soluble molecules (22). Consequently, the activity concentration of a protein factor is probable to common to the genes. Equation 15 can thus be generalized to any gene, providing a series of simultaneous equations consisting of all genes in a genome (each equation forms a row in the spreadsheet shown in Figure 3). The low concentrations and substantial variations of specific activities owing to chemical modifications render the activity concentrations difficult to measure at present. However, if the activity and equilibrium constants k and K and local activity ag are available, the values can be derived from transcriptome data by solving the set of simultaneous equations. The set of activity concentrations can then be used to trace the changes in the transcriptome back to changes in each factor. The effect of changes in the factors can also be estimated by substituting activity concentration values, allowing the proper set of stimuli required to obtain the desired transcriptome to be predicted by simulation.
METHODS
Estimation of simultaneous stimuli effect
Data for heat shock, hypo-osmotic shock and the simultaneous application of both stimuli (18) were calculated by a parametric method (17). The combined effect of the two stimuli was estimated from the individual effects (measured in experiments 7547 and 2555) by multiplying the obtained ratios. The logarithms of the estimated ratios were plotted against the log ratios for experimental data (experiment 4787; Figure 2A). For comparison, data from experiment 4787 was exchanged with that for experiment 4786, representing data for a different time point in an identical experiment (Figure 2B). (Calculated data as well as parameters and raw data are provided in Supplementary Data sheet.)
SUPPLEMENTARY DATA
Supplementary Data are available at NAR Online.
Supplementary Material
Acknowledgments
Funding to pay the Open Access publication charges for this article was provided by subsidy for scientific research of Akita Prefectural University.
Conflict of interest statement. None declared.
REFERENCES
- 1.Eisen M., Spellman P., Brown P., Botstein D. Cluster analysis and display of genome-wide expression patterns. Proc. Natl Acad. Sci. USA. 1998;95:14863–14868. doi: 10.1073/pnas.95.25.14863. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Friedman N., Linial M., Nachman I., Pe'er D. Using Bayesian Network to analyze expression data. J. Comput. Biol. 2000;7:601–620. doi: 10.1089/106652700750050961. [DOI] [PubMed] [Google Scholar]
- 3.Strogatz S. Exploring complex networks. Nature. 2001;410:268–276. doi: 10.1038/35065725. [DOI] [PubMed] [Google Scholar]
- 4.Lee T., Rinaldi N., Robert F., Odom D., Bar-Joseph Z., Gerber G., Hannett N., Harbison C., Thompson C., Simon I., et al. Transcriptional regulatory networks in Saccharomyces cerevisiae. Science. 2002;298:799–804. doi: 10.1126/science.1075090. [DOI] [PubMed] [Google Scholar]
- 5.Quackenbush J. Microarray data normalization and transformation. Nature Genet. 2003;32:496–501. doi: 10.1038/ng1032. [DOI] [PubMed] [Google Scholar]
- 6.Hargrove J.L., Hulsey M.G., Beale E.G. The kinetics of mammalian gene expression. Bioessays. 1991;13:667–674. doi: 10.1002/bies.950131209. [DOI] [PubMed] [Google Scholar]
- 7.Herschlag D., Johnson F. Synergism in transcriptional activation: a kinetic view. Genes Dev. 1993;7:173–179. doi: 10.1101/gad.7.2.173. [DOI] [PubMed] [Google Scholar]
- 8.Chi T., Lieberman P., Ellwood K., Carey M. A general mechanism for transcriptional synergy by eukaryotic activators. Nature. 1995;377:254–257. doi: 10.1038/377254a0. [DOI] [PubMed] [Google Scholar]
- 9.Carey M. The enhanceosome and transcriptional synergy. Cell. 1998;92:5–8. doi: 10.1016/s0092-8674(00)80893-4. [DOI] [PubMed] [Google Scholar]
- 10.Wang J., Ellwood K., Lehman A., Carey M., She Z. A mathematical model for synergistic eukaryotic gene activation. J. Mol. Biol. 1999;286:315–325. doi: 10.1006/jmbi.1998.2489. [DOI] [PubMed] [Google Scholar]
- 11.Tsujikawa L., Tsodikov O., deHaseth P. Interaction of RNA polymerase with forked DNA: evidence for two kinetically significant intermediates on the pathway to the final complex. Proc. Natl Acad. Sci. USA. 2002;19:3493–3498. doi: 10.1073/pnas.062487299. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Ross J. Control of messenger RNA stability in higher eukaryotes. Trends Genet. 1996;12:171–175. doi: 10.1016/0168-9525(96)10016-0. [DOI] [PubMed] [Google Scholar]
- 13.Myers L.C., Kornberg R.D. Mediator of transcriptional regulation. Ann. Rev. Biochem. 2000;69:729–749. doi: 10.1146/annurev.biochem.69.1.729. [DOI] [PubMed] [Google Scholar]
- 14.Naar A.M., Lemon B.D., Tjian R. Transcriptional coactivator complexes. Ann. Rev. Biochem. 2001;70:475–501. doi: 10.1146/annurev.biochem.70.1.475. [DOI] [PubMed] [Google Scholar]
- 15.Palancade B., Bensaude O. Investigating RNA polymerase II carboxyl-terminal domain (CTD) phosphorylation. Eur. J. Biochem. 2003;270:3859–3870. doi: 10.1046/j.1432-1033.2003.03794.x. [DOI] [PubMed] [Google Scholar]
- 16.Doyle G., Betz N., Leeds P., Fleisig A., Prokipcak R., Ross J. The c-myc coding region determinant-binding protein: a member of a family of KH domain RNA-binding proteins. Nucleic Acids Res. 1998;26:5036–5044. doi: 10.1093/nar/26.22.5036. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Konishi T. Three-parameter lognormal distribution ubiquitously found in cDNA microarray data and its application to parametric data treatment. BMC Bioinformatics. 2004;5:5. doi: 10.1186/1471-2105-5-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Gasch A., Spellman P., Kao C., Carmel-Harel O., Eisen M., Storz G., Botstein D., Brown P. Genomic expression programs in the response of yeast cells to environmental changes. Mol. Biol. Cell. 2000;11:4241–4257. doi: 10.1091/mbc.11.12.4241. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Strahl B., Allis C. The language of covalent histone modifications. Nature. 2000;403:41–45. doi: 10.1038/47412. [DOI] [PubMed] [Google Scholar]
- 20.Becker P., Horz W. ATP-dependent nucleosome remodeling. Annu. Rev. Biochem. 2002;71:247–273. doi: 10.1146/annurev.biochem.71.110601.135400. [DOI] [PubMed] [Google Scholar]
- 21.Bird A. DNA methylation patterns and epigenetic memory. Genes Dev. 2002;16:6–21. doi: 10.1101/gad.947102. [DOI] [PubMed] [Google Scholar]
- 22.Verkman A. Solute and macromolecule diffusion in cellular aqueous compartments. Trends Biochem. Sci. 2002;27:27–33. doi: 10.1016/s0968-0004(01)02003-5. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.