Abstract
Background
Lowered sensitivity to the effects of ethanol increases the risk of developing alcoholism. Inbred mouse strains have been useful for the study of the genetic basis of various drug addiction-related phenotypes. Inbred Long-Sleep (ILS) and Inbred Short-Sleep (ISS) mice differentially express a number of genes thought to be implicated in sensitivity to the effects of ethanol. Concomitantly, there is evidence for a mediating role of cAMP/PKA/CREB signalling in aspects of alcoholism modelled in animals. In this report, the extent to which CREB signalling impacts the differential expression of genes in ILS and ISS mouse cerebella is examined.
Results
A training dataset for Machine Learning (ML) and Exploratory Data Analyses (EDA) was generated from promoter region sequences of a set of genes known to be targets of CREB transcription regulation and a set of genes whose transcription regulations are potentially CREB-independent. For each promoter sequence, a vector of size 132, with elements characterizing nucleotide composition features was generated. Genes whose expressions have been previously determined to be increased in ILS or ISS cerebella were identified, and their CREB regulation status predicted using the ML scheme C4.5. The C4.5 learning scheme was used because, of four ML schemes evaluated, it had the lowest predicted error rate. On an independent evaluation set of 21 genes of known CREB regulation status, C4.5 correctly classified 81% of instances with F-measures of 0.87 and 0.67 respectively for the CREB-regulated and CREB-independent classes. Additionally, six out of eight genes previously determined by two independent microarray platforms to be up-regulated in the ILS or ISS cerebellum were predicted by C4.5 to be transcriptionally regulated by CREB. Furthermore, 64% and 52% of a cross-section of other up-regulated cerebellar genes in ILS and ISS mice, respectively, were deemed to be CREB-regulated.
Conclusion
These observations collectively suggest that ethanol sensitivity, as it relates to the cerebellum, may be associated with CREB transcription activity.
Background
Animal models have facilitated the investigation of the mechanisms of several diseases. For drug addiction in particular, inbred mouse strains have proved to be invaluable [1,2], and have facilitated the mapping of aspects of addiction-related behaviour to specific genetic loci. Inbred Long Sleep (ILS) and Inbred Short Sleep (ISS) mice, for instance, present many contrasts with respect to a number of alcoholism related phenotypes [3-6]. They have been widely used to model ethanol sensitivity [7,8]. Ethanol sensitivity has a genetic basis [9], the comprehensive workings of which remain elusive. Consequently, a comparison of relevant brain region transcriptomes of ILS and ISS mice has the potential of revealing unique patterns of gene expression [10] that could be relevant to the mechanisms of alcoholism.
The cerebellum has long been almost exclusively associated with balance and motor co-ordination. It has relatively recently been found to be more involved with cognition than previously thought [11]. During neurodevelopment, the cerebellum is especially susceptible to ethanol toxicity [12]. Studies indicate a role for activation of the cerebellum in alcoholism. A Functional Magnetic Resonance Imaging study has indicated that ethanol odour-induced craving in untreated recently abstinent male alcoholics involves activation of the cerebellum along with the subcortical-limbic region of the right amygdala/hippocampal area [13]. Positron Emission Tomography studies in drug addiction similarly indicate a role for cerebellar activation [14,15]. The identification of specific pathways contributing to alcoholism-related events in the cerebellum would, therefore, be important.
The phosphoinositide (PI) and cyclic adenosine 3',5'-monophosphate (cAMP) signalling pathways have long been thought to be important in the development of ethanol dependence and tolerance [16]. There are several pieces of evidence suggesting a role for the cAMP/protein kinase A (PKA)/cAMP-response-element-binding protein (CREB) signalling pathway in addiction, even though they do not necessarily involve the cerebellum: Alcohol preferring (P) rats have lower levels of CREB and the transcriptionally-active phospho-CREB in the medial amygdala and central amygdala (CeA) than non-preferring (NP) rats [17]. Ethanol administration (or PKA activator [Sp-cAMP] administration into the CeA) increases CREB function in the CeA of P (but not NP) rats. Also, 24 hours following a single intra-peritoneal 2 mg/kg ethanol dose to C57BL/6J mice, there is long-term potentiation of GABA synaptic transmission at Ventral Tegmental Area dopaminergic neurons, via a cAMP-PKA-dependent mechanism [18]. One mechanism by which ethanol increases CREB levels involves inhibition of adenosine reuptake which results in increases in extracellular adenosine and activation of the adenosine A2 receptor, leading to increases in cAMP levels [19]. The ethanol-induced increase in CRE-mediated gene transcription requires PKA and involves an adenosine receptor-dependent phase and a later adenosine receptor-independent phase [20].
The emergence of high throughput data has facilitated the study of patterns of transcription. Machine Learning (ML) is one such avenue for mining such data [21]. It concentrates on methods for computer programs to improve their performance (i.e. modifying behaviour) by learning from previous data examples. ML is useful for the purpose of class prediction. During the learning process, structural patterns in the "training set" are established; these then constitute the basis upon which predictions are made when presented with data of unknown classification ("test set").
In the current studies, genes found to be differentially expressed in the cerebella of ILS and ISS mice [22] were examined to identify the extent to which CREB transcription regulates addiction mechanisms in the cerebellum. Nucleotide sequences of the promoter regions of various genes were analyzed to generate the data used for ML. The Composition, Transitions, and Distributions [23] of individual nucleotide bases as well as groups of nucleotide bases (Table 1), along with the presence and relative positions of specific cis elements were the basis on which genes were classified as being either transcriptionally CREB regulated or otherwise. The results reveal a strong pattern, in the cerebellum, of CREB regulation among genes differentially expressed between ILS and ISS mice.
Table 1.
GROUP | MEMBERS |
Purine | A, G |
Pyrimidine | C, T |
Strong Hydrogen Bonding | C, G |
Weak Hydrogen Bonding | A, T |
Keto | T, G |
Amino | A, C |
Results
Four ML schemes were evaluated: a Decision Tree (J48, an implementation of the C4.5 algorithm), a Support Vector Machine (SVM), a Naïve Bayes classifier (NN) and a Multi-layer Perceptron (MLP). Two alternate models for ML were tested in this study, using a dataset of 46 instances and two classes. These were:
• -a two-class model with classifications: "CREB-regulated" and "NOT CREB-regulated", and
• -a three-class model with a third classification "Nrf2-regulated" [24]
Nrf2 (NF-E2-related factor 2), the primary transcription factor that binds the Antioxidant Response Element (ARE), was selected because, like CREB, Nrf2 is a ubiquitous transcription factor. Secondly, it has a requirement for CREB Binding Protein for enhanced transcription activity [25]. Using the leave-one-out cross-validation technique, the two-class model had lower Mean Absolute Error rates for all learning schemes explored than the three-class model (Figure 1A). Also, of the four schemes and two models evaluated, the area under the Receiver Operating Characteristic (ROC) curve, a measure of test accuracy, was highest for the C4.5 scheme under the two-class model (Figure 1A).
Of the four ML schemes, using the leave-one-out cross-validation technique and the two-class model, the C4.5 Decision Tree algorithm had the lowest overall predicted error rate (Figure 1B; Table 2). Its ROC curve was closest to the left-hand border and the top border of the ROC space (Figure 2 and Additional File 1), indicating that it had the most optimal trade-off between sensitivity and specificity among the four schemes evaluated. It also had the highest area under the ROC curve (Table 3). The C4.5 Decision Tree algorithm [26] works top-down, seeking at each stage an attribute that best separates the classes. The attribute with the greatest information gain is chosen. It then recursively processes the sub-problems resulting from the split until the information either reaches a maximum or is zero. The information measure (entropy) is calculated thus:
Table 2.
C4.5 | SVM | NN | MLP | |
PERCENT CORRECT | 69.57 (46.06) | 50.00 (50.05) * | 58.70 (49.29) | 55.00 (49.80) |
MEAN ABSOLUTE ERROR | 0.18 (0.28) | 0.33 (0.11) * | 0.28 (0.33) | 0.30 (0.30) * |
RELATIVE ABSOLUTE ERROR | 52.51 (80.99) | 96.00 (32.03) * | 79.30 (94.63) | 85.53 (86.77) * |
ROOT MEAN SQUARED ERROR | 0.22 (0.34) | 0.41 (0.14) * | 0.34 (0.40) | 0.36 (0.37) * |
ROOT RELATIVE SQUARED ERROR | 53.54 (82.59) | 97.90 (32.67) * | 80.86 (96.50) | 87.11 (88.44) * |
The standard deviation of each attribute evaluated is located in brackets
**Leave-one-out technique, i.e. 46-fold cross-validation, performed with ten iterations each
*Use of Corrected Resampled T-test [44]; difference from corresponding C4.5 value is statistically significant (p = 0.05, two tailed)
Table 3.
INDEPENDENT EVALUATION SET | TEN-FOLD CROSS VALIDATION | LEAVE-ONE-OUT | |
C4.5 | 0.8563 | 0.7722 | 0.7883 |
NN | 0.7875 | 0.5936 | 0.6352 |
SVM | 0.9063 | 0.5217 | 0.5 |
MLP | 0.85 | 0.6711 | 0.5577 |
Entropy (p1, p2, .... pn) = -p1log2p1-p2log2p2....-pnlog2pn
where p1, p2, .... pn are fractions representing the data distribution at a node (attribute) and sum up to 1.
The two-class model was also used to test an independent dataset generated from 21 genes of known CREB regulation status. C4.5 correctly classified 81% of instances (Table 4) with F-measures of 0.87 and 0.67 respectively for the classes "CREB-regulated" and "NOT CREB-regulated" respectively. The F-measure is the harmonic mean of Precision and Sensitivity and can be used as a single measure of a test's performance:
Table 4.
GENE SYMBOL | C4.5 PREDICTION | CONFIDENCE LEVEL | ACTUAL STATUS |
Pcna | CREB-REGULATED | 1 | CREB-REGULATED |
Pdyn | CREB-REGULATED | 1 | CREB-REGULATED |
Penk1 | CREB-REGULATED | 1 | CREB-REGULATED |
Ptgs2 | CREB-REGULATED | 1 | CREB-REGULATED |
Pck1 | NOT-CREB-REGULATED* | 0.8 | CREB-REGULATED |
Ppargc1a | CREB-REGULATED | 1 | CREB-REGULATED |
Muc5b | CREB-REGULATED | 1 | CREB-REGULATED |
Rb1 | CREB-REGULATED | 1 | CREB-REGULATED |
Sst | NOT-CREB-REGULATED* | 0.8 | CREB-REGULATED |
Aanat | CREB-REGULATED | 1 | CREB-REGULATED |
Sod2 | CREB-REGULATED | 1 | CREB-REGULATED |
Sms | CREB-REGULATED | 1 | CREB-REGULATED |
Tnp1 | CREB-REGULATED | 1 | CREB-REGULATED |
Th | NOT-CREB-REGULATED* | 1 | CREB-REGULATED |
Vip | CREB-REGULATED | 1 | CREB-REGULATED |
Slc18a2 | CREB-REGULATED | 1 | CREB-REGULATED |
Kif1b | CREB-REGULATED | 1 | NOT-CREB-REGULATED* |
Tcf21 | NOT-CREB-REGULATED* | 1 | NOT-CREB-REGULATED* |
Wisp2 | NOT-CREB-REGULATED* | 1 | NOT-CREB-REGULATED* |
Ms4a4c | NOT-CREB-REGULATED* | 1 | NOT-CREB-REGULATED* |
Lrat | NOT-CREB-REGULATED* | 1 | NOT-CREB-REGULATED* |
*"Potentially CREB-independent" genes as defined under the Methods section.
**This follows training with a set of 46 genes of known status: twenty-three "CREB regulated" and twenty-three "Not CREB regulated" instances
F-measure = (2 * Precision * Sensitivity)/(Precision + Sensitivity)
where Precision = True Positives/(True Positives + False Positives)
Sensitivity (or Recall) is a measure of the probability that the test would reject a false null hypothesis:
Sensitivity = True Positives/(True Positives + False Negatives)
Additionally, using the two-class model, three out of four genes determined by two independent microarray platforms to be up-regulated in the ILS cerebellum [22] were determined by C4.5 to be transcriptionally CREB-regulated (Table 5). The platforms were the Affymetrix (Santa Clara, CA) platform Mouse Expression Set 430 (MOE430) and the cDNA arrays NIA15K manufactured at the University of Colorado's School of Medicine. Similarly, three out of four genes up-regulated by both platforms in the ISS cerebellum were deemed CREB-regulated (Table 6). Furthermore, 64% and 52% of a cross-section of other up-regulated cerebellar genes in ILS and ISS mice, respectively (as per the MOE430 platform), were deemed CREB-regulated.
Table 5.
GENE SYMBOL | C4.5 PREDICTION | CONFIDENCE LEVEL |
Chchd4 | CREB-REGULATED | 1 |
Sca1 | CREB-REGULATED | 1 |
Myo1d | NOT-CREB-REGULATED* | 1 |
6430706D22Rik | CREB-REGULATED | 1 |
*"Potentially CREB-independent" genes as defined under the Methods section.
Table 6.
GENE SYMBOL | C4.5 PREDICTION | CONFIDENCE LEVEL |
Cap1 | CREB-REGULATED | 1 |
D7Rp2e | NOT-CREB-REGULATED* | 1 |
Ftl1 | CREB-REGULATED | 1 |
Gnb1 | CREB-REGULATED | 1 |
*"Potentially CREB-independent" genes as defined under the Methods section.
Discussion
Lowered sensitivity to the effects of ethanol increases the risk of developing alcoholism. Differing sensitivities to ethanol is, at least in part, attributable to heredity [9], and inbred mouse strains have facilitated the investigation of this complex behavioral phenomenon. In studying CREB's gene regulating activity in ethanol sensitivity, a set of differentially expressed genes in the ILS/ISS mouse model of ethanol sensitivity were examined. The two-class model had lower error rates than the three-class model (Figure 1A). This is probably due to the inherent difficulty of distinguishing between the classifications "CREB-regulated" and "Nrf2-regulated". Indeed the case can be made that Nrf2 genes are dependent on CREB for enhanced transcription activity [25]. The complexity of the machinery for transcription makes the two-class model the preferred model for this study.
Properties of stretches of nucleotides can impact their affinity for specific transcription factors; this principle can be exploited for its therapeutic promise [27]. A central premise of this observation is the fact that the characteristics of individual nucleotide bases in any such oligonucleotide contribute to its structure and function [28]. As an example, hydrogen-bonded base pairs help determine the structure and function of nucleic acids. Strength of hydrogen bonding and other nucleotide base classifications used in generating the characteristics of each gene's promoter sequence for this ML study have been outlined in Table 1.
Of the four learning schemes evaluated using the two-class model, C4.5 was the most consistent performer, having the lowest overall error rates (Figure 1B), and the highest accuracy (Figure 2; Table 3), area under the ROC curves being measures of test accuracy. Because of variability between independent evaluation sets, performance evaluations based on evaluation sets are only instructive when such evaluation sets are large in size. Since the evaluation set used consisted of only 21 instances, the cross-validation techniques are better indicators of each learning scheme's performance. The Ten-fold Cross Validation technique is a standard way for predicting the error rate of a learning scheme [29,30]. When applying this technique, an average value is obtained for ten different sets of the re-organized data such that in each case, 90% of the data is used for training and 10% used for testing. The leave-one-out technique is, in essence, an n-fold cross-validation technique (n being the number of instances in the dataset) and, for a small dataset, a good predictor of a scheme's performance on an independent dataset. In this study, 81% of genes of known classification used as an evaluation set were correctly classified by C4.5 (Table 4), with F-measures of 0.87 and 0.67 respectively for the classes "CREB-regulated" and "NOT CREB-regulated" respectively.
The stretch of nucleotides between the cAMP Response Element (CRE) and the Transcription Start Site (TSS) and the stretch between the CRE and the Transcription Factor II D (TFIID) bind site were identified as important determinants of a gene's CREB regulation status (Figure 3). Two types of CRE with different affinities for the transcription factor CREB have been reported. One class containing the symmetrical TGACGTCA site shows a high binding affinity for CREB; the other type has asymmetric and weak binding sites ("CGTCA") [31]. The TATA-binding protein (TBP) and TBP-associated factors (TAFs) constitute the TFIID complex. The TFIID complex is a major component of the general RNA polymerase II (RNAP II) transcription machinery with intrinsic sequence-specific DNA-binding activity [32]. The binding of TFIID to a gene's core promoter region is an important rate-limiting step in the assembly of the transcription initiation complex. With the notable exception of the stretch between the CRE and the TFIID bind site, CREB target promoter regions have relatively high levels of nucleotide bases with strong Hydrogen Bonding (data not shown).
The transcription factor, CREB, is ubiquitously expressed in brain cells and is involved, among others, in learning and memory, anxiety, depression, and addiction [33]. A number of different signalling pathways culminate in the activation of CREB. These include pathways involving PKA, MAPK-activated ribosomal S6 kinases (RSKs), and calcium/calmodulin-dependent kinase IV (CaMKIV) [34]. Others such as CaMKII reduce CREB transcriptional activity [35]. Four genes have previously been found, by two independent microarray platforms, to be up-regulated in the ILS mouse cerebellum relative to the ISS cerebellum [22]. Of these, three were predicted by C4.5 as being CREB-dependent (Table 5). Similarly, three out of four genes up-regulated in the ISS cerebellum relative to the ILS cerebellum were predicted by the C4.5 scheme to be transcriptionally CREB-dependent (Table 6). Of a cross-section of genes up-regulated in the ILS cerebellum relative to ISS per the Affymetrix MOE430 platform [22], 64% were predicted by the C4.5 scheme to be transcriptionally CREB-dependent. Out of a similar cross-section up-regulated in the ISS cerebellum relative to the ILS cerebellum, 52% were predicted to be CREB-dependent. These indicate that CREB may be playing a central transcription-regulating role in the cerebellum in this ethanol sensitivity model.
Conclusion
Taken together, the observations made suggest that, in the cerebellum, CREB plays a key role in ethanol sensitivity and presents the field with a central hypothesis that needs to be further tested. CREB's role in mediating a number of complex behaviours has been documented [33]. Events in the extended amygdala have long been associated with the reinforcing effects of addicting drugs [36]. It is evident that the cerebellum, though less well studied in this regard, is involved in addiction [13-15]. Since CREB's transcription regulating activity differs from cell type to cell type [37], pursuit of the implications of a key role for CREB in this addiction model's cerebellar molecular milieu would be both promising and instructive.
Methods
A training dataset for ML was created out of twenty-three known targets of CREB transcriptional regulation [38,39], and twenty-three genes out of a set of twenty-eight (Table 7) whose transcription regulations are potentially CREB-independent. An independent set of twenty-one genes served as an evaluation set.
Table 7.
Gene Title | Gene Symbol |
FK 506 binding protein 5 | Fkbp5 |
cyclin-dependent kinase inhibitor 1A (P21) | Cdkn1a |
growth arrest and DNA-damage-inducible 45 gamma | Gadd45g |
angiopoietin-like 4 | Angptl4 |
adrenomedullin | Adm |
DNA-damage-inducible transcript 4 | Ddit4 |
chromodomain helicase DNA binding protein 1 | Chd1 |
sema domain, immunoglobulin domain (Ig), and GPI membrane anchor, | Sema7a |
chloride channel calcium activated 1 | Clca1 |
quiescin Q6 | Qscn6 |
sestrin 1 | Sesn1 |
UDP-N-acetyl-alpha-D-galactosamine:polypeptide N- | Galntl2 |
breast carcinoma amplified sequence 3 | Bcas3 |
membrane-associated protein 17 | Map17 |
discs, large homolog-associated protein 4 (Drosophila) | Dlgap4 |
glucosidase 1 | Gcs1 |
protein related to DAN and cerberus | Prdc |
thymidylate kinase family LPS-inducible member | Tyki |
histidine decarboxylase | Hdc |
sorting nexin 16 | Snx16 |
androgen-induced proliferation inhibitor | Aprin |
acylphosphatase 1, erythrocyte (common) type | Acyp1 |
intersectin (SH3 domain protein 1A) | Itsn |
kinesin family member 1B | Kif1b |
transcription factor 21 | Tcf21 |
WNT1 inducible signaling pathway protein 2 | Wisp2 |
membrane-spanning 4-domains, subfamily A, member 4C | Ms4a4c |
lecithin-retinol acyltransferase (phosphatidylcholine-retinol-O-acyltransferase) | Lrat |
"Potentially CREB-independent genes"
Nrf2 binds to CREB Binding Protein for enhanced transcription activating activity [25]. Cigarette Smoke (CS)-induced oxidative stress has been associated with the expression of Nrf2 transcription-dependent antioxidant and cytoprotective genes [40]. In experiments conducted by authors V.M and S.B., Nrf2 knockout and Wild-type mice were exposed to CS and Air. The genes listed in Table 7 were up-regulated in both groups, suggesting that their transcriptional regulation is Nrf2-independent (see "Oligonucleotide Microarray" below for further details on what constitutes "Nrf2-independent" genes). Furthermore, none of these genes is known specifically to be a target of CREB transcription regulation. Additionally, as depicted in Figure 3, these genes are distinguishable from those that are known targets for CREB transcription regulation.
CS Exposure
Mice of both genotypes were subjected to cigarette smoke exposure using a machine similar to the one used by [41]. The control groups were kept in a filtered air environment, and the experimental groups were subjected to CS for 5 hours by burning 2R4F reference cigarettes (2.45 mg nicotine per cigarette; Tobacco Research Institute, University of Kentucky), using a smoking machine (Model TE-10, Teague Enterprises). Details of the smoking protocol have been described previously [40]. Mice were fed AIN-76A diet (Harlan Teklad) and had access to water ad libitum; they were housed under controlled conditions (23 ± 2°C; 12-hour light/dark cycles). All experimental protocols conducted on the mice were performed in accordance with the standards established by the US Animal Welfare Acts, as set forth in NIH guidelines and in the Policy and Procedures Manual of the Johns Hopkins University Animal Care and Use Committee.
Oligonucleotide Microarray
Lungs were isolated after 5 hours of CS exposure. Total RNA from the lungs was extracted, using TRIZOL reagent (Invitrogen Corp.). The isolated RNA was hybridized to Murine Genome MOE 430 2.0 GeneChip arrays (Affymetrix, Santa Clara, CA) according to procedures described previously [40]. This array contains probes for detecting approximately 14,500 well-characterized genes and 4371 expressed sequence tags. Scanned output files were analyzed using Affymetrix GeneChip Operating Software version 1.3, and were independently normalized to an average intensity of 500. The data was further analyzed as described previously [42], by performing 9 pairwise comparisons for each group (Nrf2+/+, CS, n = 3, versus Nrf2+/+, air, n = 3, and Nrf2-/-, CS, n = 3, versus Nrf2-/-, air, n = 3). To limit the number of false positives, only those altered genes that showed more than a 1.5-fold change (FC) in magnitude and appeared in, at least, 6 of the 9 comparisons were selected. In addition, the Mann-Whitney pairwise comparison test was performed to rank the results by the significance (P ≤ 0.05) of each identified change in gene expression. In identifying transcriptionally Nrf2-independent genes, only those genes which passed all of these criteria were selected. Further, only those genes that were differentially induced (or repressed) by CS to a similar extent in both genotypes, and having a FC ≥ 2.0 magnitude were considered to be independent of Nrf2's transcription regulating activity. This last dataset was combined with data from previously published work [40] (Genechip used was Murine U74A version 2) to arrive at a comprehensive "Nrf2-independent" gene set.
Promoter Sequence Characteristics
Promoter sequences (1000 nucleotides upstream to 100 nucleotides downstream) corresponding to each gene was obtained from the cited database source [43]. For each promoter sequence, a vector of size 132, with elements characterizing features of the sequence (Figure 4) was generated using a Common Lisp [44] algorithm. The elements of the vector included a Boolean indicating whether or not the cAMP Response Element (CRE) was present, the number of nucleotide base pairs ("distance") between the CRE ("TGACGTCA", "CGTCA" or "TGCGTCA") and the Transcription Start Site (TSS), and the "distance" between the CRE and the TFIID bind site ("TATAGAA", "TATAAAA," "TATAG", or "TATA").
In addition to these, the three kinds of features of nucleotide sequences used were Composition, Transition and Distribution [23]. Composition is a reference to the proportions of nucleotide base types contributing to the promoter sequence make-up. Transitions represent the frequency with which specific nucleotide base types are followed or preceded, within the sequence, by other nucleotide base types. Distribution is a statement concerning the dissemination of specific nucleotide base types within portions of the sequence (or the entire sequence).
Nucleotide Base Types
For the purpose of the sequence characterizations just described nucleotide bases were grouped based on whether they were purine or pyrimidine, the strength with which they form hydrogen bonds, and whether or not they were "keto" or "amino" (Table 1).
The breakdown of the elements of each vector (Figure 4) is as follows: percent Compositions for the individual nucleotide bases (positions 1 to 4); percent Compositions, Transitions, and Distributions for the Purine versus Pyrimidine base types (positions 5 – 17, consisting of two positions for Compositions, one for Transitions, and ten for Distributions); percent Compositions, Transitions, and Distributions for Strong versus Weak Hydrogen Bonding base types (positions 18 – 30, consisting of two positions for Compositions, one for Transitions, and ten for Distributions), percent Compositions, Transitions, and Distributions for "Keto" versus "Amino" base types (positions 31 – 43, consisting of two positions for Compositions, one for Transitions, and ten for Distributions). The presence or absence of a CRE was indicated by a "1" or a "0" respectively at position 44. The sub-sequence made up of the stretch of bases between the CRE and the TSS was characterized at positions 45 through 88. At position 45, the "distance" was stated. In the absence of a CRE, the entire promoter sequence was characterized in lieu of the sought sub-sequence. In other words, in the absence of a CRE as defined above, the "distance" was longer. Details for positions 46 through 48 were as follows: individual nucleotide base percent Compositions were indicated at positions 46 – 49; Purine versus Pyrimidine base type data were at positions 50 – 62; Strong versus Weak Hydrogen Bonding base type data were at positions 63 – 75; "Keto" versus "Amino" base type data were at positions 76 – 88. Correspondingly, the sub-sequence made up of the stretch of bases between the CRE and the TFIID bind site was similarly characterized at positions 89 through 132.
Four ML schemes were evaluated for their learning performance on the models created: a Decision Tree (J48, an implementation of the C4.5 algorithm), a Support Vector Machine (SVM), a Naïve Bayes classifier (NN) and a Multi-layered Perceptron (MLP), all available through the Weka ML workbench [45]. The C4.5 algorithm emerged as having the lowest predicted error rate (Figure 1). The decision tree (Additional File 2) used in evaluating the independent dataset is based on all the training data. After applying the Corrected Resampled t-test [46] to data generated following use of the leave-one-out technique with ten iterations for each fold, error rates for C4.5 were significantly (p = 0.05) lower than those of SVM and MLP (Table 2). The rates were lower relative to NN though not statistically significant (Table 2). The ROC curves (Figure 2) used as indicators of performance were also generated using the "CREB-regulated" class and the default Weka ML workbench. The threshold modifications that constituted the basis of the ROC curves have been detailed in Additional File 1.
Subsequently a set of genes whose expressions have been previously determined [22] to be increased in ILS or ISS cerebella was identified and the CREB regulation status of each member predicted using the ML scheme C4.5.
Exploratory Data Analysis (EDA) techniques [47] were also used to characterize the vector set. Specifically, boxplots [48] were used to capture the distribution's central tendency (median), spread (fourth-spread), skewness (based on the relative positions of the median, lower fourth and upper fourth), tail length as well as outliers (Figure 3). The statistical environment used to implement the EDA aspects of the study was R [49].
Authors' contributions
GKA conceived the study, conducted the computational experiments, and participated in writing the manuscript. VM participated in the conduct of experiments and data analyses resulting in the identification of Nrf2-independent genes. He also wrote portions of the manuscript. SB participated in the experimental design, participated in the conduct of experiments and data analyses resulting in the identification of Nrf2-independent genes. He also participated in drafting portions of the manuscript. All authors read and approved the final manuscript.
Supplementary Material
Acknowledgments
Acknowledgements
This work has been supported by resources of the Massachusetts College of Pharmacy and Health Sciences, NIEHS center grant P30 ES 038819, NIH grants HL081205 (SB), P50 CA058184 (SB), and the Young Clinical Scientist award from the Flight Attendant Medical Research Institute (SB). We would also like to thank Masayuki Yamamoto, Tsukuba university, Japan and Thomas W. Kensler, Department of Environmental Health Sciences, Bloomberg School of Public Health, Johns Hopkins University for providing Nrf2 WT and knockout mice for the previously published work [40].
Contributor Information
George K Acquaah-Mensah, Email: george.acquaah-mensah@mcphs.edu.
Vikas Misra, Email: vmisra@jhsph.edu.
Shyam Biswal, Email: sbiswal@jhsph.edu.
References
- Downing C, Carosone-Link P, Bennett B, Johnson T. QTL mapping for low-dose ethanol activation in the LXS recombinant inbred strains. Alcohol Clin Exp Res. 2006;30:1111–1120. doi: 10.1111/j.1530-0277.2006.00137.x. [DOI] [PubMed] [Google Scholar]
- Bennett B, Carosone-Link P, Zahniser NR, Johnson TE. Confirmation and Fine Mapping of Ethanol Sensitivity QTLs, and Candidate Gene Testing in the LXS Recombinant Inbred Mice. J Pharmacol Exp Ther. 2006. [DOI] [PubMed]
- Radcliffe RA, Floyd KL, Lee MJ. Rapid ethanol tolerance mediated by adaptations in acute tolerance in inbred mouse strains. Pharmacol Biochem Behav. 2006 doi: 10.1016/j.pbb.2006.06.018. [DOI] [PubMed] [Google Scholar]
- Markel PD, Defries JC, Johnson TE. Use of repeated measures in an analysis of ethanol-induced loss of righting reflex in inbred long-sleep and short-sleep mice. Alcohol Clin Exp Res. 1995;19:299–304. doi: 10.1111/j.1530-0277.1995.tb01506.x. [DOI] [PubMed] [Google Scholar]
- Hanania T, McCreary AC, Haughey HM, Salaz DO, Zahniser NR. MK-801- and ethanol-induced activity in inbred long-sleep and short-sleep mice: dopamine and serotonin systems. Eur J Pharmacol. 2002;457:125–135. doi: 10.1016/S0014-2999(02)02685-7. [DOI] [PubMed] [Google Scholar]
- Hanania T, Zahniser NR. Locomotor activity induced by noncompetitive NMDA receptor antagonists versus dopamine transporter inhibitors: opposite strain differences in inbred long-sleep and short-sleep mice. Alcohol Clin Exp Res. 2002;26:431–440. doi: 10.1111/j.1530-0277.2002.tb02558.x. [DOI] [PubMed] [Google Scholar]
- Owens JC, Stallings MC, Johnson TE. Genetic analysis of low-dose ethanol-induced activation (LDA) in inbred long sleep (ILS) and inbred short sleep (ISS) mice. Behav Genet. 2002;32:163–171. doi: 10.1023/A:1016018411202. [DOI] [PubMed] [Google Scholar]
- Ehringer MA, Thompson J, Conroy O, Goldman D, Smith TL, Schuckit MA, Sikela JM. Human alcoholism studies of genes identified through mouse quantitative trait locus analysis. Addict Biol. 2002;7:365–371. doi: 10.1080/1355621021000005496. [DOI] [PubMed] [Google Scholar]
- Crabbe JC, Metten P, Yu CH, Schlumbohm JP, Cameron AJ, Wahlsten D. Genotypic differences in ethanol sensitivity in two tests of motor incoordination. J Appl Physiol. 2003;95:1338–1351. doi: 10.1152/japplphysiol.00132.2003. [DOI] [PubMed] [Google Scholar]
- Xu Y, Ehringer M, Yang F, Sikela JM. Comparison of Global Brain Gene Expression Profiles Between Inbred Long-Sleep and Inbred Short-Sleep Mice by High-Density Gene Array Hybridization. Alcohol Clin Exp Res. 2001;25:810–818. doi: 10.1111/j.1530-0277.2001.tb02284.x. [DOI] [PubMed] [Google Scholar]
- Schmahmann JD. Disorders of the cerebellum: ataxia, dysmetria of thought, and the cerebellar cognitive affective syndrome. J Neuropsychiatry Clin Neurosci. 2004;16:367–378. doi: 10.1176/jnp.16.3.367. [DOI] [PubMed] [Google Scholar]
- Maier SE, Miller JA, West JR. Prenatal binge-like alcohol exposure in the rat results in region-specific deficits in brain growth. Neurotoxicol Teratol. 1999;21:285–291. doi: 10.1016/S0892-0362(98)00056-7. [DOI] [PubMed] [Google Scholar]
- Schneider F, Habel U, Wagner M, Franke P, Salloum JB, Shah NJ, Toni I, Sulzbach C, Honig K, Maier W, Gaebel W, Zilles K. Subcortical correlates of craving in recently abstinent alcoholic patients. Am J Psychiatry. 2001;158:1075–1083. doi: 10.1176/appi.ajp.158.7.1075. [DOI] [PubMed] [Google Scholar]
- Volkow ND, Wang GJ, Fowler JS, Hitzemann R, Angrist B, Gatley SJ, Logan J, Ding YS, Pappas N. Association of methylphenidate-induced craving with changes in right striato-orbitofrontal metabolism in cocaine abusers: implications in addiction. Am J Psychiatry. 1999;156:19–26. doi: 10.1176/ajp.156.1.19. [DOI] [PubMed] [Google Scholar]
- Olbrich HM, Valerius G, Paris C, Hagenbuch F, Ebert D, Juengling FD. Brain activation during craving for alcohol measured by positron emission tomography. Aust N Z J Psychiatry. 2006;40:171–178. doi: 10.1111/j.1440-1614.2006.01765.x. [DOI] [PubMed] [Google Scholar]
- Pandey SC. Neuronal signaling systems and ethanol dependence. Mol Neurobiol. 1998;17:1–15. doi: 10.1007/BF02802021. [DOI] [PubMed] [Google Scholar]
- Pandey SC, Zhang H, Roy A, Xu T. Deficits in amygdaloid cAMP-responsive element-binding protein signaling play a role in genetic predisposition to anxiety and alcoholism. J Clin Invest. 2005;115:2762–2773. doi: 10.1172/JCI24381. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Melis M, Camarini R, Ungless MA, Bonci A. Long-lasting potentiation of GABAergic synapses in dopamine neurons after a single in vivo ethanol exposure. J Neurosci. 2002;22:2074–2082. doi: 10.1523/JNEUROSCI.22-06-02074.2002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mailliard WS, Diamond I. Recent advances in the neurobiology of alcoholism: the role of adenosine. Pharmacol Ther. 2004;101:39–46. doi: 10.1016/j.pharmthera.2003.10.002. [DOI] [PubMed] [Google Scholar]
- Asher O, Cunningham TD, Yao L, Gordon AS, Diamond I. Ethanol stimulates cAMP-responsive element (CRE)-mediated transcription via CRE-binding protein and cAMP-dependent protein kinase. J Pharmacol Exp Ther. 2002;301:66–70. doi: 10.1124/jpet.301.1.66. [DOI] [PubMed] [Google Scholar]
- Kuo WP, Kim EY, Trimarchi J, Jenssen TK, Vinterbo SA, Ohno-Machado L. A primer on gene expression and microarrays for machine learning researchers. J Biomed Inform. 2004;37:293–303. doi: 10.1016/j.jbi.2004.07.002. [DOI] [PubMed] [Google Scholar]
- MacLaren EJ, Sikela JM. Cerebellar gene expression profiling and eQTL analysis in inbred mouse strains selected for ethanol sensitivity. Alcohol Clin Exp Res. 2005;29:1568–1579. doi: 10.1097/01.alc.0000179376.27331.ac. [DOI] [PubMed] [Google Scholar]
- Ding CH, Dubchak I. Multi-class protein fold recognition using support vector machines and neural networks. Bioinformatics. 2001;17:349–358. doi: 10.1093/bioinformatics/17.4.349. [DOI] [PubMed] [Google Scholar]
- Chen XL, Kunsch C. Induction of cytoprotective genes through Nrf2/antioxidant response element pathway: a new therapeutic approach for the treatment of inflammatory diseases. Curr Pharm Des. 2004;10:879–891. doi: 10.2174/1381612043452901. [DOI] [PubMed] [Google Scholar]
- Katoh Y, Itoh K, Yoshida E, Miyagishi M, Fukamizu A, Yamamoto M. Two domains of Nrf2 cooperatively bind CBP, a CREB binding protein, and synergistically activate transcription. Genes Cells. 2001;6:857–868. doi: 10.1046/j.1365-2443.2001.00469.x. [DOI] [PubMed] [Google Scholar]
- Quinlan R. C45: Programs for Machine Learning. Morgan Kaufmann Publishers, San Mateo, CA; 1993. [Google Scholar]
- Cho-Chung YS, Park YG, Nesterova M, Lee YN, Cho YS. CRE-decoy oligonucleotide-inhibition of gene expression and tumor growth. Mol Cell Biochem. 2000;212:29–34. doi: 10.1023/A:1007144618589. [DOI] [PubMed] [Google Scholar]
- Karlin S, Ost F, Blaisdell BE. Mathematical Methods for DNA Sequences. Waterman MS, CRC Press; 1989. Patterns in DNA and amino acid sequences and their statistical significance; pp. 133–157. [Google Scholar]
- Kohavi R. Proc 14th International Joint Conference on Arificial Intelligence, Montreal, Canada. San Francisco: Morgan Kaufmann; 1995. A study of cross-validation and bootstrap for accuracy estimation and model selection; pp. 1137–1143. [Google Scholar]
- Witten IH, Frank E. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, San Francisco, CA; 2000. [Google Scholar]
- Eggers A, Caudevilla C, Asins G, Hegardt FG, Serra D. Mitochondrial 3-hydroxy-3-methylglutaryl-CoA synthase promoter contains a CREB binding site that regulates cAMP action in Caco-2 cells. Biochem J. 2000;345:201–206. doi: 10.1042/0264-6021:3450201. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Muller F, Tora L. The multicoloured world of promoter recognition complexes. EMBO J. 2004;23:2–8. doi: 10.1038/sj.emboj.7600027. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Carlezon WA, Jr, Duman RS, Nestler EJ. The many faces of CREB. Trends Neurosci. 2005;28:436–445. doi: 10.1016/j.tins.2005.06.005. [DOI] [PubMed] [Google Scholar]
- Gonzalez GA, Montminy MR. Cyclic AMP stimulates somatostatin gene transcription by phosphorylation of CREB at serine 133. Cell. 1989;59:675–680. doi: 10.1016/0092-8674(89)90013-5. [DOI] [PubMed] [Google Scholar]
- Matthews RP, Guthrie CR, Wailes LM, Zhao X, Means AR, McKnight GS. Calcium/calmodulin-dependent protein kinase types II and IV differentially regulate CREB-dependent gene expression. Mol Cell Biol. 1994;14:6107–6116. doi: 10.1128/mcb.14.9.6107. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Weiss F, Koob GF. Drug addiction: functional neurotoxicity of the brain reward systems. Neurotox Res. 2001;3:145–156. doi: 10.1007/BF03033235. [DOI] [PubMed] [Google Scholar]
- Cha-Molstad H, Keller DM, Yochum GS, Impey S, Goodman RH. Cell-type-specific binding of the transcription factor CREB to the cAMP-response element. Proc Natl Acad Sci USA. 2004;101:13572–13577. doi: 10.1073/pnas.0405587101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McClung CA, Nestler EJ. Regulation of gene expression and cocaine reward by CREB and DeltaFosB. Nat Neurosci. 2003;6:1208–1215. doi: 10.1038/nn1143. [DOI] [PubMed] [Google Scholar]
- Mayr B, Montminy M. Transcriptional regulation by the phosphorylation-dependent factor CREB. Nat Rev Mol Cell Biol. 2001;2:599–609. doi: 10.1038/35085068. [DOI] [PubMed] [Google Scholar]
- Rangasamy T, Cho CY, Thimmulappa RK, Zhen L, Srisuma SS, Kensler TW, Yamamoto M, Petrache I, Tuder RM, Biswal S. Genetic ablation of Nrf2 enhances susceptibility to cigarette smoke-induced emphysema in mice. J Clin Invest. 2004;114:1248–1259. doi: 10.1172/JCI200421146. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Witschi H, Espiritu I, Maronpot RR, Pinkerton KE, Jones AD. The carcinogenic potential of the gas phase of environmental tobacco smoke. Carcinogenesis. 1997;18:2035–2042. doi: 10.1093/carcin/18.11.2035. [DOI] [PubMed] [Google Scholar]
- Thimmulappa RK, Mai KH, Srisuma S, Kensler TW, Yamamoto M, Biswal S. Identification of Nrf2-regulated genes induced by the chemopreventive agent sulforaphane by oligonucleotide microarray. Cancer Res. 2002;62:5196–5203. [PubMed] [Google Scholar]
- http://www.genome.ucsc.edu
- Keene SE. Elements of CLOS programs. Object-Oriented Programming in Common Lisp Symbolics Incorporated (and Addison-Wesley) 1989. pp. 5–14.
- http://www.cs.waikato.ac.nz/ml/weka/
- Bengio Y, Nadeau C. Inference for the generalization error. Machine Learning. 2003;52:239–281. doi: 10.1023/A:1024068626366. [DOI] [Google Scholar]
- Tukey JW. Exploratory Data Analysis (Limited Preliminary Edition) II. Reading MA: Addison-Wesley; 1970. [Google Scholar]
- Chambers JM, Cleveland WS, Kleiner B, Tukey PA. Graphical Methods for Data Analysis. Wadsworth & Brooks/Cole; 1983. [Google Scholar]
- http://www.r-project.org/
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.