Abstract
Regulatory SNPs (rSNPs) reside primarily within the nonprotein coding genome and are thought to disturb normal patterns of gene expression by altering DNA binding of transcription factors. Nevertheless, despite the explosive rise in SNP association studies, there is little information as to the function of rSNPs in human disease. Serum response factor (SRF) is a widely expressed DNA-binding transcription factor that has variable affinity to at least 1,216 permutations of a 10 bp transcription factor binding site (TFBS) known as the CArG box. We developed a robust in silico bioinformatics screening method to evaluate sequences around RefSeq genes for conserved CArG boxes. Utilizing a predetermined phastCons threshold score, we identified 8,252 strand-specific CArGs within an 8 kb window around the transcription start site of 5,213 genes, including all previously defined SRF target genes. We then interrogated this CArG dataset for the presence of previously annotated common polymorphisms. We found a total of 118 unique CArG boxes harboring a SNP within the 10 bp CArG sequence and 1,130 CArG boxes with SNPs located just outside the CArG element. Gel shift and luciferase reporter assays validated SRF binding and functional activity of several new CArG boxes. Importantly, SNPs within or just outside the CArG box often resulted in altered SRF binding and activity. Collectively, these findings demonstrate a powerful approach to computationally define rSNPs in the human CArGome and provide a foundation for similar analyses of other TFBS. Such information may find utility in genetic association studies of human disease where little insight is known regarding the functionality of rSNPs.
Keywords: transcription factor binding site, bioinformatics, serum response factor, CArG box
single nucleotide polymorphisms (SNPs) are the most common variant in the human genome. While the function of protein-coding SNPs can be easily deciphered and tested, it is far more difficult to ascertain SNP function in the vast sequence landscape comprising the nonprotein coding genome. Even as the number of SNP association studies continues to soar, there has been less effort to elucidate the function of nonprotein coding SNPs associated with human disease though it is recognized that such variants likely play a significant role in the development of complex disease traits (19, 28). Recent estimates suggest that ∼88% of the SNPs identified as associated with a trait or disease are located in nonprotein coding regions (27). Regulatory SNPs (rSNPs) are defined as variant sequences located in or near transcription factor binding sites (TFBS) that affect gene expression by altering DNA binding properties of transcription factors (39). rSNPs may also have other effects such as altering the specificity of microRNAs and their binding sites or disrupting recombination, replication, or structural organization of the genome (47, 76). Despite the rise in rSNPs associated with human disease derived from genome-wide association studies (GWAS), we know very little about how each of these variants function on a molecular or cellular level (31). However, variations in the binding sites of RNA polymerase and nuclear factor-κB, frequently due to SNPs, were found to differ between humans and correlated with altered gene expression (36). Additionally, the recent example of a noncoding SNP that created a C/EBP TFBS correlating with increases in SORT1 gene expression and changes in serum lipoprotein levels strongly supports the hypothesis that rSNPs contribute to disease (55). Thus, elucidating the functional significance of such disease-associated rSNPs has emerged as a critical research aim (5, 67).
Given the extensive time and cost of wet-lab rSNP evaluation, numerous in silico methods of whole genome screening have been developed to predict TFBSs that incorporate multispecies alignments and phylogenetic foot-printing (24) and prioritize high-probability candidate rSNPs (63). Examples include the RAVEN system to explore in silico TFBS variation based upon position weight matrices (PWM) (2) and a novel Bayesian-based method to construct phylogenetic trees from 18 mammalian genomes, from which >28,000 TFBS that contain potential rSNPs were discovered (86). Other more recent PWM-based methods attempted to predict alterations in protein-DNA binding affinities induced by sequence variations and potential rSNPs (44, 45). While promising from a computational perspective, these approaches are limited primarily from a high false discovery rate of TFBS, thereby decreasing the likelihood of elucidating functionally relevant rSNPs. To address these challenges, translational methods that combine biological data with in silico methods are increasingly being used. Notably, Torkamani and Schork (78) utilized the functional genomics dataset from the ENCODE Pilot Project to develop a predictive model of rSNPs that achieved 80% sensitivity and 99% specificity. Unfortunately, this approach was limited since it used only 1% of the human genome for functional significance (17). Other translational bioinformatics approaches using chromatin immunoprecipitation (ChIP)-on-chip data have been successful at identifying rSNPs but require a priori knowledge of DNA-protein binding patterns (1). Computational models based on high-throughput sequencing methods of ChIP-seq have advanced the discovery of biologically active TFBS but are contextually biased by the cell type/condition and are limited to only those transcription factors with specific antibodies (3, 40, 84). Thus, while these biological assays have advanced our understanding of the genome, there remains a need for less biased genome-wide discovery of potentially functional rSNPs identified by in silico techniques.
Development of a robust genome-wide in silico approach to the identification of TFBS and potential rSNPs based on biological data requires a well-characterized TF. For example, the TFBS for CREB is well defined, and a recent genome-wide survey identified >750,000 CREB sites in the genome (89). Another TF with a well-defined TFBS is serum response factor (SRF), which controls disparate programs of gene expression related to growth and muscle differentiation (32, 51). SRF binds a TFBS known as the CArG box, a 10 bp sequence generally found within 4,000 bp of a targeted gene's transcription start site (TSS) (75). Previous studies have used various screening methods to identify SRF-target genes with CArG boxes that are conserved in sequence and space between human and mouse genomes (60, 69, 75, 88). The aggregate total number of CArG elements within the genome, known as the CArGome (75), control hundreds of target genes (53). Thus, SRF's extensive library of validated and hypothetical CArG targets and critical biological importance (50) render it an ideal TF to study in silico rSNP identification methods. Additionally, identifying potential rSNPs in or around CArG boxes may expand our understanding of SRF target gene function, specifically in the context of disease.
To date, there has been no systematic attempt to identify rSNPs across the human CArGome. Binding interactions between SRF and CArG elements have been well characterized (42). Previous studies have shown that mutations in the 10 bp CArG sequence are known to alter the binding properties of SRF (72). Additionally, Han et al. (23) discovered a 12 bp insertion mutation located ∼20 bp from a nearby CArG box that resulted in elevated SRF binding and increased gene expression of smooth muscle myosin light chain kinase in hypertensive rats. These results support the notion that genetic mutations in or near the CArG box alter gene expression and contribute to disease. Given our understanding of the CArG sequence and the increasing number of SRF targets in the genome, the CArG box is an ideal TFBS to use as a model for developing screening tools for functional rSNP identification. Herein, we describe an in silico screening approach to rapidly pinpoint functional rSNPs in the human CArGome. Validation studies suggest that most rSNPs in the CArGome have functional consequences for SRF binding and activity, thus providing a rich resource for future genetic association studies as well as novel insights into SRF function.
MATERIALS AND METHODS
Computer interrogation of CArGome for rSNP discovery.
Perl scripts (available on request) were created to implement the process of identifying SNPs in the CArGome as depicted in Fig. 1. The Ensembl database (release 59) (18) was queried for all protein-coding Human Genome Organization (HUGO) genes in the GRCh37/hg19 genome assembly from chromosomes 1 to 22, X, and Y. For each HUGO gene, the longest RefSeq transcript was used as the canonical transcript (Supplemental Table S1).1 For each canonical transcript, the promoter sequence comprising 4,000 bp upstream and 4,000 bp downstream from the TSS was extracted. Each promoter sequence was scanned for CArG elements using a Perl script that matched the pattern CCW6GG (consensus CArG box), allowing for 1 bp deviation (CArG-like box) as reviewed previously (53).
Using a set of 34 previously validated CArG boxes from mouse, we generated a conservation threshold score by first extracting the CArG sequences from the National Center for Biotechnology Information 37/mm9 mouse assembly from the Ensembl database (Supplemental Table S2). We initially evaluated CArG sequences for conservation with the GERP score (33). However, we found that a majority of the CArG sequences had no GERP score available. In contrast, all 34 CArG sequences used to develop the conservation threshold exhibited a phastCons score (11). Using the 46-way vertebrate phastCons dataset from the UCSC Genome Browser (71), we calculated the phastCons score for each CArG box by averaging the 10 individual nucleotide phastCons scores in the given CArG box. The cumulative average phastCons score for these 34 individual mouse CArG boxes was calculated to be 0.842 and subsequently used as the conservation cutoff score for the human RefSeq CArG scan (RS-CArGome).
For each CArG sequence found in the human scan of HUGO genes, a 10 bp average phastCons score was calculated and compared with the conservation cutoff described above. Those human CArG boxes meeting the cutoff score were classified as conserved and tabulated with respect to its nearest RefSeq neighbor. A set of 207 previously validated mammalian SRF-target genes was also cross-referenced with our tabular CArG list (Supplemental Table S3).
For each conserved CArG element identified, its absolute chromosomal coordinates in the human genome were used to query dbSNP 131 stored within the Ensembl Variation database to identify genotyped SNPs residing within the 10 bp conserved CArG or within an arbitrarily set 35 bp flanking sequence of the conserved CArG box (70). This distance surrounding the CArG box was selected as many CArG elements function in conjunction with neighboring cis-acting elements, such as ETS binding sites that may be located several helical turns of DNA away from the CArG box (4). Finally, the promoter sequences for each human gene that had conserved CArG boxes and locally identified SNPs were mapped on the human genome, along with the conserved CArG and SNPs, and visualized using the UCSC Genome Browser (38). A Perl script was written to ascertain SNP distances relative to CArG boxes along with their distribution among gene-related regions. CArG nucleotide sequence analysis was conducted with WebLogo (13). Evaluation of possible CArG-related gene clusters was performed based on CArG box genomic coordinates to identify CArG boxes located in the vicinity of multiple genes.
Functional analysis of CArG boxes and SRF target genes.
Genomic Regions Enrichment of Annotations Tool (GREAT) analysis was conducted to evaluate CArG box enrichment based on the biological process Gene Ontology (GO) term (49). All conserved CArG sequences identified in the hg19 assembly were utilized for GREAT analysis using version 1.6. GREAT analysis was performed against a whole genome background and using the “basal plus extension” association rule, in which a proximal gene association region is within 5 kb upstream and 1 kb downstream from the TSS and the distal gene association region is up to 1,000 kb.
Analysis for overrepresented TFBS in our set of 5,213 suspected SRF-target genes was conducted using the oPOSSUM's Human Single Site analysis, version 2.0 (29). HUGO gene names were submitted to the oPOSSUM web server (http://www.cisreg.ca/cgi-bin/oPOSSUM/opossum) for analysis with the default analysis settings, including top 10% of conserved regions, TFBS matrix match score of 80%, and a sequence length of 5 kb up and downstream. A total of 1,370 genes were excluded from analysis by oPOSSUM due to lack of strict one-to-one ortholog gene assignment between mouse and human.
Electrophoretic mobility shift assay.
Eight of the conserved CArG sequences with SNPs were selected for validation with electrophoretic mobility shift assay (EMSA) and named according to their nearest proximity to a RefSeq gene (Table 1). We chose these eight CArG boxes to study based on their novelty as potential SRF targets and/or their consensus CArG sequence, rendering them most susceptible to SRF binding. Seven of these potential SRF target genes had an SNP located within the CArG box (ADRB2, KCNA3, CDH3, RALYL, PLAGL2, KLF6, GRK6), and for one of the target genes the SNP was located 2 bp downstream from the CArG box (ABHD5). Double-stranded oligonucleotides (Integrative DNA Technologies, Coralville, IA) corresponding to each CArG element were heated to 65°C for 10 min and then slowly annealed. EMSA was performed with in vitro translated SRF as described previously using a well-defined SRF binding CArG box to the CNN1 gene as a positive control (52). Wild-type (WT) oligonucleotide, WT sequence with antibody to SRF, and SNP-modified oligonucleotide samples for each CArG box were studied. Samples were run on a 5% nondenaturing polyacrylamide gel, vacuum dried, and exposed for varying lengths of time to x-ray film.
Table 1.
HUGO Gene | CArG Sequence | Position to TSS | Coordinates | Strand | phastCons | dbSNP 131 | Position in CArG |
---|---|---|---|---|---|---|---|
ADRB2 | CCA(A/G)TTTTGG | −2458 | chr5:148203698-148203707 | 1 | 0.941732 | rs35118767 | 4 |
KCNA3 | CCATTTTTG(G/A) | −2423 | chr1:111220069-111220078 | −1 | 0.99685 | rs61805969 | 10 |
RALYL | CCAATTAAG(G/T) | 2056 | chr8:85620218-85620227 | 1 | 0.929134 | rs62530172 | 10 |
KLF6 | (C/A)CTTATTTGG | 2124 | chr10:3825334-3825343 | −1 | 0.994488 | rs10795076 | 1 |
ABHD5* | CCAGTTTTGGG(A/G) | −2405 | chr3:43729957-43729966 | 1 | 0.948032 | rs34104729 | 12 |
GRK6 | CCCTAA(T/C)TGG | −282 | chr5:176853405-176853414 | 1 | 0.998425 | rs3763079 | 7 |
CDH3 | CCTTTAG(A/G)GG | 2637 | chr16:68681376-68681385 | 1 | 0.992913 | rs16958232 | 8 |
PLAGL2 | CCAG(A/G)TAAGG | 464 | chr20:30795121-30795130 | −1 | 0.999213 | rs13037347 | 5 |
Each row represents a HUGO gene with a single nucleotide polymorphism (SNP) (dbSNP 131) mapped within or near the gene's conserved CArG box. HUGO Gene names in boldface contain true CArG sequences, while lightfaced gene names harbor CArG-like sequences with 1 bp substitution (see discussion for differences between consensus CArG and CArG-like sequences). Wild-type and SNP alleles are shown in parentheses for SNPs within or near the CArG box; the first nucleotide in each parenthesis represents the wild-type allele and the second nucleotide designates the SNP allele. Position to transcription start site (TSS) is the number of bp from the start of the CArG box to the TSS of the possible serum response factor-target gene (negative is upstream; positive is downstream). Conserved CArG Coordinates and Strand are based on Ensembl (Release 59), using the H.Sa Feb 2009 (HG 19) Assembly. phastCons score is an average of the 10 individual phastCons scores for each nucleotide in the CArG box (see methods). The reference SNP (rs) number for each CArG-SNP was obtained from dbSNP 131. Position in CArG is the location of the SNP in relation to position 1 of the CArG box.
SNP that lies outside the CArG box.
Luciferase assays.
We designed primers (Supplemental Table D4) flanking each of the eight CArG elements shown in Table 1 and PCR cloned intervening regions from total genomic DNA obtained from primary-derived human coronary artery smooth muscle cells. The PCR products were sequence verified and then cloned into the pGL3 basic vector for luciferase assay. Site-directed mutagenesis (QuiK Change, Stratagene) was done with primers that contained the SNP defined from our genomic scan. Cells (COS7 and rat pulmonary artery smooth muscle cells) were transfected with either the WT or SNP sequence in the absence or presence of SRF-VP16 (43) and luciferase activity measured 24 h later. Data were normalized to a control Renilla reporter that was cotransfected in all samples. Results are reported as the mean (± SD) of four replicates. One-way ANOVA with Tukey's t-test was done to compute any statistical differences between WT, WT+SRF, SNP, and SNP+SRF for each gene, but limited reporting to evaluations between WT and SNP reporters with SRF. A probability value of <0.05 was considered statistically significant.
RESULTS
In silico CArG sequence conservation screening.
We previously reported on the initial definition of the mammalian CArGome through a computational biology approach that analyzed only a small fraction of RefSeq genes (75). Here, we used the latest sequence assemblies to interrogate an 8 kb window of genomic sequence centered at the TSS in each of 18,925 protein-coding human genes for the presence of CArG boxes. We refer to this collection of CArG boxes as the RefSeq CArGome (RS-CArGome). This analysis revealed a total of 142,597 CArG boxes over a total genomic interval of 151.4 Mb. This number of CArG boxes agrees well with the theoretical number of 1 CArG per ∼1 kb of genomic sequence as described previously (53). Of the 142,597 CArG boxes identified, 8,252 (∼ 5% of CArG boxes) met the 10 bp average phastCons threshold of 0.842 (see materials and methods) and were therefore classified as conserved CArG boxes (Supplemental Table S5). A total of 657/8,252 (8%) conserved CArG boxes conformed to the consensus sequence, CCW6GG (where W is either adenine or thymine). The majority of conserved CArG boxes (7,595/8,252 or 92%) were CArG-like sequences that followed the consensus pattern with 1 bp deviation as described previously (53). Sequence analysis of the RS-CArGome showed an overall consensus of CCATTTATGG with each position of the CArG box displaying variable substitution patterns (Fig. 2). Of the 1,216 theoretical CArG sequences, 1,147 were present in our conserved CArG set, including all but two of the 64 types of consensus CArG boxes (CCTATATTGG and CCATTATAGG). The most common conserved CArG sequence was the CArG-like sequence CCCTTTAAGG, which was identified 62 times, and the most common consensus CArG sequence was CCTTATTTGG, identified 26 times (Supplemental Table S6). Interestingly, only 146 of the possible 1,216 CArG-box permutations comprised 50% of the RS-CArGome, suggesting a strong selection bias for only a subset of possible CArG sequences in the human genome (Supplemental Table S6).
The RS-CArGome encompasses 5,213 unique protein-coding genes (27.5% of the initial 18,925 protein-coding genes searched). Of the 209 previously validated mammalian SRF-target genes, all were identified with the initial CArG matching algorithm, with 72% meeting the phastCons conservation threshold value of 0.842. As an independent test for CArG element enrichment, we also evaluated our set of 5,213 suspected SRF-target genes using the oPOSSUM algorithm (29). This analysis confirmed that the majority of sequences common to our set of 5,213 genes were indeed CArG-like sequences (Supplemental Table S7). Other binding sites found (e.g., TATA and homeodomain) reflect the AT-rich central domain of CArG boxes.
We further defined the genomic region where the 8,252 conserved CArG boxes were located and calculated the distance of each CArG box from the TSS (Fig. 3). Given that some conserved CArG elements were discovered in the promoter region of multiple genes (see below), we assumed the gene closest to the CArG box to be the priority SRF target gene and thus determined the genomic region and calculated the distance based on this CArG-SRF target gene pairing (Supplemental Table S5). Over half of the conserved CArG boxes were located downstream from the TSS as indicated by the skewed histogram (Fig. 3A). The two regions with the highest number of conserved CArG elements were 500 bp immediately up- and downstream from the TSS, which is consistent with data showing a direct interaction between SRF and a subunit of the RNA polymerase II holoenzyme (34). While conserved CArG boxes were most commonly found in the immediate 5′ promoter region, a surprisingly high percentage (30%) was found within exons (Fig. 3B). We speculate that regulatory element conservation will coincide with protein coding information in many of these cases.
We next used the GREAT algorithm (49) to evaluate the functional significance of conserved CArG regions adjacent to putative SRF-target genes. GREAT analysis revealed enrichment for GO terms of biological processes related to cellular differentiation, the cytoskeleton, nervous system development, muscle cell development, and contraction, as well as tissue morphogenesis (Fig. 4). These classifications are consistent with known roles for SRF in normal and pathological processes (50).
Identification of CArG boxes near histone gene clusters.
Bidirectional promoters (79) and co-regulated gene expression patterns (85) are prevalent in the human genome; however, there has been no known SRF-binding CArG elements identified to date that control divergently transcribed genes or gene clusters. To explore the possibility that a given CArG box may be active for multiple and/or divergently transcribed SRF-target genes, we searched for conserved CArG boxes that are located in genomic regions corresponding to divergent promoters. We identified a total of 646 CArG boxes that were located in divergently arrayed promoter regions of two or more genes (Supplemental Table S8). Interestingly, we found several conserved CArG boxes adjacent to 26 histone genes on both chromosome 1 and 6. For example, we identified a cluster of three conserved CArG boxes between the divergently transcribed histone genes HIST2H2BE and HIST2H2AC separated by ∼300 bp. Despite the close proximity of these three CArG elements over this short genomic interval, we were unable to validate them as SRF dependent using luciferase reporter assay or expression analysis of the histone genes following SRF knockdown (data not shown). This suggests either we have not found a permissive context for these CArG boxes to exhibit function or they represent nonfunctional CArG boxes.
Defining functional CArG-SNPs.
A total of 115 unique, genotyped SNPs were identified across the entire RS-CArGome, representing 1.4% of all conserved CArG boxes (Supplemental Table S9). In four cases (TSPAN7, MDK, CHRM4, and TNN), the SNP resulted in a tolerable central A-T substitution, suggesting functional SRF-CArG binding would be preserved. Approximately 14% of the conserved CArG elements (1,130) were found to have 1,232 unique SNPs located 35 bp up- or downstream from the CArG box (Supplemental Table S10). For the cumulative 1,347 unique SNPs identified, there are 475 transversions, 784 transitions, 35 deletions, 40 insertions, and 3 mixed polymorphisms. The transversion-to-transition ratio of 1:1.65 is a slightly higher ratio of transversions than typically seen in the genome at large, but consistent with other analysis of rSNPs (21).
We selected eight novel CArG sequences from diverse GO Term Functional classifications exhibiting SNPs for further experimentation (Table 1). We have begun to catalogue these CArG-SNPs in a database using the UCSC genome browser (Fig. 5). Gel shift assays showed that each WT CArG sequence displayed variable binding to in vitro translated human SRF as evidenced by a supershift of the CArG-SRF nucleoprotein complex upon addition of an antibody to SRF (Fig. 6). We next tested the same sequences with the SNP introduced and found in seven of eight cases, the CArG-SNP mutation resulted in reduced SRF binding (Fig. 6). These results suggest that the majority of SNPs within CArG boxes will exhibit attenuated SRF binding. The CArG box associated with ABHD5, which had an SNP 2 bp outside the CArG box, also showed decreased binding. Interestingly, one of the CArG-SNPs (associated with CDH3) exhibited greater SRF binding than that observed with the WT CArG sequence (Fig. 6). This result was unexpected since the CArG-SNP in CDH3 caused a second nucleotide deviation from the consensus CArG and would therefore be less likely to support SRF binding based on the known binding behavior of SRF to CArG (53). This finding underscores unanticipated complexity of SRF binding to CArG sequences (see discussion).
To further evaluate the functional activity of the eight WT CArG sequences and the consequence of each CArG-SNP therein, we cloned a portion of regulatory sequences encompassing each CArG box (Supplemental Table S4) from human genomic DNA and cloned the sequences into a luciferase reporter plasmid. We then performed site-directed mutagenesis to create each CArG-SNP and transfected two cell types with either the WT CArG or each respective CArG-SNP for SRF-dependent luciferase activity. In two independent studies, we noted that WT CArG sequences that bound SRF tended to exhibit enhanced luciferase activity upon cotransfection with SRF-VP16 (see materials and methods). Introduction of the SNP impaired basal luciferase activity in four of the eight CArG boxes studied and reduced SRF-dependent transactivation, consistent with reduced SRF binding (Fig. 7). This overall trend of decreased promoter activity mirrors the EMSA binding data. However, given that SRF interacts with >60 cofactors (53), it is unclear if the specific luciferase assay utilized in this study is the appropriate context to evaluate all CArG-SNPs found by our methods. Also, consistent with the EMSA results, the promoter activity for CDH3 was significantly increased with the introduction of the SNP. Taken together, these results have greatly expanded the human CArGome and demonstrate the functional activity of rSNPs that fall within or adjacent to CArG boxes.
DISCUSSION
Efforts to elucidate the impact of rSNPs on human disease have garnered significant attention since completion of the sequencing phase of the human genome project. Much of the work in this area has concentrated on GWAS. With the advent of SNP chips, the number of GWAS has drastically accelerated in recent years (25). While estimates vary widely, some indicate the number of SNPs associated with human disease to be as high as 52,000 (33). Unfortunately, we are left with little understanding of biological mechanisms or function of these 52,000 SNPs, which remains one of the major criticisms of GWAS (15, 48). While identification of genetic risk has notable merit (46), understanding the functional significance of SNPs will lead to a deeper understanding of the cellular and molecular mechanisms of disease and facilitate the development of more clinically-relevant preventive and therapeutic interventions (27).
In silico models of TFBS prediction are lauded for cellular and environmental independence and perform adequately in lower organisms, but their success as predictors of functional rSNPs in more complex organisms has been limited (14). This weakness may stem in part from incomplete PWMs. For instance, the PWM stored in the Jaspar database (64) for CArG boxes (MA0083.1 accessed at http://jaspar.genereg.net on 13 April 2011) is based on one study (62) and does not represent what is currently known for this well-characterized TFBS (53). Furthermore, while identifying conserved nonprotein coding regions via multispecies alignments and phylogenetic foot-printing has further improved in silico rSNP identification models (30, 68, 77, 83), many of these comparative approaches require strict multispecies alignment of entire promoter regions and, as a result, have relatively poor sensitivity and specificity of TFBS prediction in vivo.
We have created the largest dataset of putative SRF-CArG sequences known to date. Our sequence-based approach was intentionally broad, so as to capture all CArG elements that adhere to our extensive knowledge of its well-validated sequence motif. Discovery of transcription factor target genes by array-based methods would likely capture only a subset of all target genes. Here, we utilize computational independence to avoid the limitations of experimental context introduced by wet-lab methods and next generation technologies. Given SRF's broad biological importance and the implications of such an extensive dataset discovered here, there is a definite need for validating these putative SRF targets using ChIP followed by deep sequencing in multiple cell types. It will also be of some interest to delineate subsets of the CArGome based on the SRF cofactor utilized.
The advantage of our method for identifying rSNPs in the RS-CArGome is based upon the integration of in silico and biological techniques. First, we employed a pure sequence-based algorithm based on the known CArG sequences that should bind SRF to minimize TFBS prediction errors. Second, we used a conserved, multispecies approach by utilizing a phastCons score, which is based on whole genome alignments from 46 vertebrate species. While whole genome alignment algorithms have acknowledged weaknesses (9), the advantage of using phastCons is that a conservation probability score is determined for each individual base in the aligned genomes and can be scaled to include only a specific sequence of interest. In our case, only the 10 bp CArG box adjacent to RefSeq genes required conservation, as opposed to previous approaches that require entire promoter sequence alignment based on fewer species (82). Third, we screened the human RS-CArGome to identify rSNPs within or adjacent to the CArG box to test whether such rSNPs induced alterations in SRF-CArG binding and function. Importantly, we demonstrate decreased SRF-CArG binding in seven of eight selected CArG boxes harboring an internal or external rSNP. We further show decreased promoter activity in four of these CArG-SNPs via luciferase assay. By definition, an SNP substitution in a consensus CArG will result in a CArG-like box, while an SNP in a CArG-like sequence should result in a nonconforming SRF binding site. This variation in the CArG box is significant because SRF binds to consensus CArG elements more tightly than to CArG-like boxes (26, 51). Consistent with this notion, validation EMSA gels for ADRB2, KCNA3, RALYL, and KLF6, carrying an SNP located within a consensus CArG box, show diminished SRF binding. Additionally, the decreased binding pattern displayed in ABHD5 suggests that SNPs in close proximity to the CArG box may also impact SRF-dependent transcription by potentially disrupting SRF's interaction with neighboring transcription factors. Our in silico sequence and conservation-based approach significantly expands the human CArGome, and the resulting dataset represents one of the largest collections of computationally predicted TFBS reported to date with insights into the functional consequences of CArG-SNPs. Interestingly, we found 14 CArG-SNPs that fall within haplotype blocks linked to various human phenotypes, including Type 2 diabetes, coronary artery disease, and ischemic stroke (Supplemental Table S11). Additionally we conducted linkage disequilibrium (LD) analysis with our CArG-SNPs and discovered a number of known GWAS SNPs in LD with CArG-SNPs (Supplemental Table S12). None of the CArG-SNPs reside within the 10 bp CArG element. Whether the CArG-SNPs alter SRF binding and/or contribute to phenotypic etiology is unknown.
Our present study likely underestimates the number of biologically active human CArG boxes. While a conservation-based approach has merits, it may have resulted in an increased number of false negative CArG boxes by eliminating nonconserved CArG boxes. Additionally, the discovery of a functionally active CArG box >10 bp (37, 41, 54) and the use of immunoprecipitation-based assays to identify several putative SRF target genes containing a CArG box deviating from the 1 bp substitution rule (88) suggest there are >1,216 permutations of the CArG box than we searched for here. In this context, we show that an atypical CArG box with two substitutions associated with the CDH3 gene supports greater SRF binding compared with the same CArG element with just 1 bp substitution. Finally, although SRF has known microRNA targets (56, 90) and SNPs have been shown to alter microRNA function (74), we did not include microRNA targets in this analysis. However, a recent microRNA-seq study revealed a large number of CArG-containing microRNAs in gastrointestinal smooth muscle with many shown to be regulated by SRF (58).
Our approach was limited to an 8 kb window of genomic sequence around HUGO protein-coding genes representing <5% of the human genome. If we consider the remaining nonredundant genomic sequence outside the window of analysis we interrogated and a phastCons threshold of 0.97099 (average of all 8,252 CArG boxes defined here, see Supplemental Table S5), we may infer a conservative estimate of some 50,000 CArG boxes outside the RS-CArGome. Assuming that a similar percentage of these CArG boxes will harbor rSNPs as those reported here (1.4%), we estimate there will be ∼700 additional CArG-SNPs in the human genome. Moreover, the predominance of only a subset of CArG sequences (146/1,216) in >50% of the RS-CArGome indicates the existence of a selection bias for specific CArG sequences in this region of the genome. It will be interesting to determine whether this CArG sequence bias extends outside the RS-CArGome. Moreover, it may prove true that many nonprotein coding RNAs will be dependent on SRF for expression. In this context, it will be important to apply our computational method to the remainder of the genome and assess functionality using high-throughput technologies such as ChIP-seq. It must be emphasized however, that ChIP-seq alone, while powerful, is limited to the cell type and cell context under analysis and will always underestimate the totality of TFBS (59). For instance, SRF binding-site identification by ChIP-seq was used to find 2,429 combined density profile peaks in human Jurkat cells (80) and 1,262 peaks in a macrophage cell line (73). Furthermore, Cooper et al. (12) used ChIP-chip to study SRF binding in neuronal, smooth muscle, and Jurkat cells and identified 216 SRF binding sites. These numbers are lower than those generated through computational means (89). Nevertheless, in silico screening methods employed in isolation will have inherent limitations as well but will be enhanced in combination with wet-lab assays such as ChIP-seq to yield more robust results. In particular, incorporating the whole genome TFBS and chromatin state data from the ENCODE project will undoubtedly improve computational screening for functional elements in the genome and assist in identifying rSNPs linked to human disease (66).
Integrating remote CArG elements identified by in silico whole genome curation with ChIP-seq may reveal novel SRF functions. For example, studies in yeast suggest a role for nonpromoter CArG elements in DNA replication (6). Moreover, it is possible that CArG boxes associate with such genomic functions as recombination and structural integration of DNA with surrounding histones or nucleoskeletal proteins. Interestingly, we identified multiple CArG boxes near histone gene clusters, but the set of three CArG elements within a ∼300 bp region positioned between two divergently transcribed histone genes were not biologically active in our limited wet-lab analysis (data not shown). Moreover, expression of several zinc-finger transcription factors with multiple conserved CArG elements was found to be unaltered upon knockdown of SRF (data not shown). There are two possibilities for these surprising results. First, we may not have found the correct cell culture context for these CArG elements to display activity. Alternatively, the results may simply reflect the fact that these CArG boxes are inactive, which would imply the existence of false positives among our dataset of 8,252 conserved CArG boxes. While in silico approaches will likely have a higher false discovery rate, high throughput assays such as ChIP-seq may have a higher false negative rate stemming from the biological context employed in performing these assays.
Several individual rSNPs have been functionally validated using traditional wet-lab techniques, including luciferase reporter assays, EMSA, microarray, and ChIP studies (10, 16, 57, 65, 81). Furthermore, many screening, annotation, and prioritization resources exist to identify potentially functional SNPs as high probability for inclusion in GWAS, such as CASCAD (22), SNPnexus (7), SNPLogic (61), SNPinfo (87), and FitSNPs (8). The weakness of these tools is they are overly dependent on in silico algorithms, typically require an a priori set of genes, and lack wet-lab functional validation, especially with regard to nonprotein-coding SNPs (35). The ORegAnno database stores experimentally identified DNA regulatory regions and regulatory variants (20). However, ORegAnno contains only 175 human rSNPs and appears to be outdated and not maintained for use as a practical resource for functional SNP evaluation. Thus, there remains a need for active central repositories of functionally validated rSNPs merged with disease-associated SNPs reported in GWAS studies. The results of this study represent a foundation for the development of such a resource.
GRANTS
This work was supported by National Heart, Lung, and Blood Institute Grants HL-62572 and HL-091168 to J. M. Miano.
DISCLOSURES
No conflicts of interest, financial or otherwise, are declared by the author(s).
Supplementary Material
ACKNOWLEDGMENTS
The authors thank Drs. Rob Fortuna and Brett Robbins from the University of Rochester Combined Internal Medicine-Pediatrics Residency Program for project support.
Footnotes
The online version of this article contains supplemental material.
REFERENCES
- 1. Ameur A, Rada-Iglesias A, Komorowski J, Wadelius C. Identification of candidate regulatory SNPs by combination of transcription-factor-binding site prediction, SNP genotyping and haploChIP. Nucleic Acids Res 37: e85, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Andersen MC, Engstrom PG, Lithwick S, Arenillas D, Eriksson P, Lenhard B, Wasserman WW, Odeberg J. In silico detection of sequence variations modifying transcriptional regulation. PLoS Comput Biol 4: e5, 2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Boeva V, Surdez D, Guillon N, Tirode F, Fejes AP, Delattre O, Barillot E. De novo motif identification improves the accuracy of predicting transcription factor binding sites in ChIP-Seq data analysis. Nucleic Acids Res 38: e126, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Buchwalter G, Gross C, Wasylyk B. Ets ternary complex transcription factors. Gene 324: 1–14, 2004. [DOI] [PubMed] [Google Scholar]
- 5. Buckland PR. The importance and identification of regulatory polymorphisms and their mechanisms of action. Biochim Biophys Acta 1762: 17–28, 2006. [DOI] [PubMed] [Google Scholar]
- 6. Chang VK, Donato JJ, Chan CS, Tye BK. Mcm1 promotes replication initiation by binding specific elements at replication origins. Mol Cell Biol 24: 6514–6524, 2004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Chelala C, Khan A, Lemoine NR. SNPnexus: a web database for functional annotation of newly discovered and public domain single nucleotide polymorphisms. Bioinformatics 25: 655–661, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Chen R, Morgan AA, Dudley J, Deshpande T, Li L, Kodama K, Chiang AP, Butte AJ. FitSNPs: highly differentially expressed genes are more likely to have variants associated with disease. Genome Biol 9: R170, 2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Chen X, Tompa M. Comparative assessment of methods for aligning multiple genome sequences. Nat Biotechnol 28: 567–572, 2010. [DOI] [PubMed] [Google Scholar]
- 10. Chorley BN, Wang X, Campbell MR, Pittman GS, Noureddine MA, Bell DA. Discovery and verification of functional single nucleotide polymorphisms in regulatory genomic regions: current and developing technologies. Mutat Res 659: 147–157, 2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Cooper GM, Stone EA, Asimenos G. NISC Comparative Sequencing Program Green ED, Batzoglou S, Sidow A. Distribution and intensity of constraint in mammalian genomic sequence. Genome Res 15: 901–913, 2005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Cooper SJ, Trinklein ND, Nguyen L, Myers RM. Serum response factor binding sites differ in three human cell types. Genome Res 17: 136–144, 2007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Crooks GE, Hon G, Chandonia JM, Brenner SE. WebLogo: a sequence logo generator. Genome Res 14: 1188–1190, 2004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Das MK, Dai HK. A survey of DNA motif finding algorithms. BMC Bioinformatics 8, Suppl 7: S21, 2007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Donnelly P. Progress and challenges in genome-wide association studies in humans. Nature 456: 728–731, 2008. [DOI] [PubMed] [Google Scholar]
- 16. Elnitski L, Jin VX, Farnham PJ, Jones SJ. Locating mammalian transcription factor binding sites: a survey of computational and experimental techniques. Genome Res 16: 1455–1464, 2006. [DOI] [PubMed] [Google Scholar]
- 17. ENCODE Project Consortium, Birney E, Stamatoyannopoulos JA, Dutta A, Guigo R, Gingeras TR, Margulies EH, Weng Z, Snyder M, Dermitzakis ET, Thurman RE, Kuehn MS, Taylor CM, Neph S, Koch CM, Asthana S, Malhotra A, Adzhubei I, Greenbaum JA, Andrews RM, Flicek P, Boyle PJ, Cao H, Carter NP, Clelland GK, Davis S, Day N, Dhami P, Dillon SC, Dorschner MO, Fiegler H, Giresi PG, Goldy J, Hawrylycz M, Haydock A, Humbert R, James KD, Johnson BE, Johnson EM, Frum TT, Rosenzweig ER, Karnani N, Lee K, Lefebvre GC, Navas PA, Neri F, Parker SC, Sabo PJ, Sandstrom R, Shafer A, Vetrie D, Weaver M, Wilcox S, Yu M, Collins FS, Dekker J, Lieb JD, Tullius TD, Crawford GE, Sunyaev S, Noble WS, Dunham I, Denoeud F, Reymond A, Kapranov P, Rozowsky J, Zheng D, Castelo R, Frankish A, Harrow J, Ghosh S, Sandelin A, Hofacker IL, Baertsch R, Keefe D, Dike S, Cheng J, Hirsch HA, Sekinger EA, Lagarde J, Abril JF, Shahab A, Flamm C, Fried C, Hackermuller J, Hertel J, Lindemeyer M, Missal K, Tanzer A, Washietl S, Korbel J, Emanuelsson O, Pedersen JS, Holroyd N, Taylor R, Swarbreck D, Matthews N, Dickson MC, Thomas DJ, Weirauch MT, Gilbert J, Drenkow J, Bell I, Zhao X, Srinivasan KG, Sung WK, Ooi HS, Chiu KP, Foissac S, Alioto T, Brent M, Pachter L, Tress ML, Valencia A, Choo SW, Choo CY, Ucla C, Manzano C, Wyss C, Cheung E, Clark TG, Brown JB, Ganesh M, Patel S, Tammana H, Chrast J, Henrichsen CN, Kai C, Kawai J, Nagalakshmi U, Wu J, Lian Z, Lian J, Newburger P, Zhang X, Bickel P, Mattick JS, Carninci P, Hayashizaki Y, Weissman S, Hubbard T, Myers RM, Rogers J, Stadler PF, Lowe TM, Wei CL, Ruan Y, Struhl K, Gerstein M, Antonarakis SE, Fu Y, Green ED, Karaoz U, Siepel A, Taylor J, Liefer LA, Wetterstrand KA, Good PJ, Feingold EA, Guyer MS, Cooper GM, Asimenos G, Dewey CN, Hou M, Nikolaev S, Montoya-Burgos JI, Loytynoja A, Whelan S, Pardi F, Massingham T, Huang H, Zhang NR, Holmes I, Mullikin JC, Ureta-Vidal A, Paten B, Seringhaus M, Church D, Rosenbloom K, Kent WJ, Stone EA, NISC Comparative Sequencing Program, Baylor College of Medicine Human Genome Sequencing Center, Washington University Genome Sequencing Center, Broad Institute, Children's Hospital Oakland Research Institute, Batzoglou S, Goldman N, Hardison RC, Haussler D, Miller W, Sidow A, Trinklein ND, Zhang ZD, Barrera L, Stuart R, King DC, Ameur A, Enroth S, Bieda MC, Kim J, Bhinge AA, Jiang N, Liu J, Yao F, Vega VB, Lee CW, Ng P, Shahab A, Yang A, Moqtaderi Z, Zhu Z, Xu X, Squazzo S, Oberley MJ, Inman D, Singer MA, Richmond TA, Munn KJ, Rada-Iglesias A, Wallerman O, Komorowski J, Fowler JC, Couttet P, Bruce AW, Dovey OM, Ellis PD, Langford CF, Nix DA, Euskirchen G, Hartman S, Urban AE, Kraus P, Van Calcar S, Heintzman N, Kim TH, Wang K, Qu C, Hon G, Luna R, Glass CK, Rosenfeld MG, Aldred SF, Cooper SJ, Halees A, Lin JM, Shulha HP, Zhang X, Xu M, Haidar JN, Yu Y, Ruan Y, Iyer VR, Green RD, Wadelius C, Farnham PJ, Ren B, Harte RA, Hinrichs AS, Trumbower H, Clawson H, Hillman-Jackson J, Zweig AS, Smith K, Thakkapallayil A, Barber G, Kuhn RM, Karolchik D, Armengol L, Bird CP, de Bakker PI, Kern AD, Lopez-Bigas N, Martin JD, Stranger BE, Woodroffe A, Davydov E, Dimas A, Eyras E, Hallgrimsdottir IB, Huppert J, Zody MC, Abecasis GR, Estivill X, Bouffard GG, Guan X, Hansen NF, Idol JR, Maduro VV, Maskeri B, McDowell JC, Park M, Thomas PJ, Young AC, Blakesley RW, Muzny DM, Sodergren E, Wheeler DA, Worley KC, Jiang H, Weinstock GM, Gibbs RA, Graves T, Fulton R, Mardis ER, Wilson RK, Clamp M, Cuff J, Gnerre S, Jaffe DB, Chang JL, Lindblad-Toh K, Lander ES, Koriabine M, Nefedov M, Osoegawa K, Yoshinaga Y, Zhu B, de Jong PJ. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447: 799–816, 2007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Flicek P, Amode MR, Barrell D, Beal K, Brent S, Chen Y, Clapham P, Coates G, Fairley S, Fitzgerald S, Gordon L, Hendrix M, Hourlier T, Johnson N, Kahari A, Keefe D, Keenan S, Kinsella R, Kokocinski F, Kulesha E, Larsson P, Longden I, McLaren W, Overduin B, Pritchard B, Riat HS, Rios D, Ritchie GR, Ruffier M, Schuster M, Sobral D, Spudich G, Tang YA, Trevanion S, Vandrovcova J, Vilella AJ, White S, Wilder SP, Zadissa A, Zamora J, Aken BL, Birney E, Cunningham F, Dunham I, Durbin R, Fernandez-Suarez XM, Herrero J, Hubbard TJ, Parker A, Proctor G, Vogel J, Searle SM. Ensembl 2011. Nucleic Acids Res 39: D800–D806, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Frazer KA, Murray SS, Schork NJ, Topol EJ. Human genetic variation and its contribution to complex traits. Nat Rev Genet 10: 241–251, 2009. [DOI] [PubMed] [Google Scholar]
- 20. Griffith OL, Montgomery SB, Bernier B, Chu B, Kasaian K, Aerts S, Mahony S, Sleumer MC, Bilenky M, Haeussler M, Griffith M, Gallo SM, Giardine B, Hooghe B, Van Loo P, Blanco E, Ticoll A, Lithwick S, Portales-Casamar E, Donaldson IJ, Robertson G, Wadelius C, De Bleser P, Vlieghe D, Halfon MS, Wasserman W, Hardison R, Bergman CM, Jones SJ, Open Regulatory Annotation Consortium ORegAnno: an open-access community-driven resource for regulatory annotation. Nucleic Acids Res 36: D107–D113, 2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Guo Y, Jamison DC. The distribution of SNPs in human gene regulatory regions. BMC Genomics 6: 140, 2005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Guryev V, Berezikov E, Cuppen E. CASCAD: a database of annotated candidate single nucleotide polymorphisms associated with expressed sequences. BMC Genomics 6: 10, 2005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Han YJ, Hu WY, Chernaya O, Antic N, Gu L, Gupta M, Piano M, de Lanerolle P. Increased myosin light chain kinase expression in hypertension: Regulation by serum response factor via an insertion mutation in the promoter. Mol Biol Cell 17: 4039–4050, 2006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Hannenhalli S. Eukaryotic transcription factor binding sites–modeling and integrative search methods. Bioinformatics 24: 1325–1331, 2008. [DOI] [PubMed] [Google Scholar]
- 25. Hardy J, Singleton A. Genomewide association studies and human disease. N Engl J Med 360: 1759–1768, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Hautmann MB, Madsen CS, Mack CP, Owens GK. Substitution of the degenerate smooth muscle (SM) alpha-actin CC(A/T-rich)6GG elements with c-fos serum response elements results in increased basal expression but relaxed SM cell specificity and reduced angiotensin II inducibility. J Biol Chem 273: 8398–8406, 1998. [DOI] [PubMed] [Google Scholar]
- 27. Hindorff LA, Sethupathy P, Junkins HA, Ramos EM, Mehta JP, Collins FS, Manolio TA. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci USA 106: 9362–9367, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Hirschhorn JN, Daly MJ. Genome-wide association studies for common diseases and complex traits. Nat Rev Genet 6: 95–108, 2005. [DOI] [PubMed] [Google Scholar]
- 29. Ho Sui SJ, Fulton DL, Arenillas DJ, Kwon AT, Wasserman WW. oPOSSUM: integrated tools for analysis of regulatory motif over-representation. Nucleic Acids Res 35: W245–W252, 2007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. Hooghe B, Hulpiau P, van Roy F, De Bleser P. ConTra: a promoter alignment analysis tool for identification of transcription factor binding sites across species. Nucleic Acids Res 36: W128–W132, 2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Hudson TJ. Wanted: regulatory SNPs. Nat Genet 33: 439–440, 2003. [DOI] [PubMed] [Google Scholar]
- 32. Johansen FE, Prywes R. Serum response factor: transcriptional regulation of genes induced by growth factors and differentiation. Biochim Biophys Acta 1242: 1–10, 1995. [DOI] [PubMed] [Google Scholar]
- 33. Johnson AD, O'Donnell CJ. An open access database of genome-wide association results. BMC Med Genet 10: 6, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34. Joliot V, Demma M, Prywes R. Interaction with RAP74 subunit of TFIIF is required for transcriptional activation by serum response factor. Nature 373: 632–635, 1995. [DOI] [PubMed] [Google Scholar]
- 35. Karchin R. Next generation tools for the annotation of human SNPs. Brief Bioinform 10: 35–52, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36. Kasowski M, Grubert F, Heffelfinger C, Hariharan M, Asabere A, Waszak SM, Habegger L, Rozowsky J, Shi M, Urban AE, Hong MY, Karczewski KJ, Huber W, Weissman SM, Gerstein MB, Korbel JO, Snyder M. Variation in transcription factor binding among humans. Science 328: 232–235, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37. Kasza A, Wyrzykowska P, Horwacik I, Tymoszuk P, Mizgalska D, Palmer K, Rokita H, Sharrocks AD, Jura J. Transcription factors Elk-1 and SRF are engaged in IL1-dependent regulation of ZC3H12A expression. BMC Mol Biol 11: 14, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D. The human genome browser at UCSC. Genome Res 12: 996–1006, 2002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39. Knight JC. Regulatory polymorphisms underlying complex disease traits. J Mol Med 83: 97–109, 2005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40. Kuttippurathu L, Hsing M, Liu Y, Schmidt B, Maskell DL, Lee K, He A, Pu WT, Kong SW. CompleteMOTIFs: DNA motif discovery platform for transcription factor binding experiments. Bioinformatics 27: 715–717, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41. Kuwahara K, Kinoshita H, Kuwabara Y, Nakagawa Y, Usami S, Minami T, Yamada Y, Fujiwara M, Nakao K. Myocardin-related transcription factor A is a common mediator of mechanical stress- and neurohumoral stimulation-induced cardiac hypertrophic signaling leading to activation of brain natriuretic peptide gene expression. Mol Cell Biol 30: 4134–4148, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42. Leung S, Miyamoto NG. Point mutational analysis of the human c-fos serum response factor binding site. Nucleic Acids Res 17: 1177–1195, 1989. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43. Long X, Tharp DL, Georger MA, Slivano OJ, Lee MY, Wamhoff BR, Bowles DK, Miano JM. The smooth muscle cell-restricted KCNMB1 ion channel subunit is a direct transcriptional target of serum response factor and myocardin. J Biol Chem 284: 33671–33682, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44. Macintyre G, Bailey J, Haviv I, Kowalczyk A. is-rSNP: a novel technique for in silico regulatory SNP detection. Bioinformatics 26: i524–30, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45. Manke T, Heinig M, Vingron M. Quantifying the effect of sequence variation on regulatory interactions. Hum Mutat 31: 477–483, 2010. [DOI] [PubMed] [Google Scholar]
- 46. Manolio TA. Genomewide association studies and assessment of the risk of disease. N Engl J Med 363: 166–176, 2010. [DOI] [PubMed] [Google Scholar]
- 47. Martin MM, Buckenberger JA, Jiang J, Malana GE, Nuovo GJ, Chotani M, Feldman DS, Schmittgen TD, Elton TS. The human angiotensin II type 1 receptor +1166 A/C polymorphism attenuates microRNA-155 binding. J Biol Chem 282: 24262–24269, 2007. [DOI] [PMC free article] [PubMed] [Google Scholar] [Retracted]
- 48. McCarthy MI, Abecasis GR, Cardon LR, Goldstein DB, Little J, Ioannidis JP, Hirschhorn JN. Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nat Rev Genet 9: 356–369, 2008. [DOI] [PubMed] [Google Scholar]
- 49. McLean CY, Bristor D, Hiller M, Clarke SL, Schaar BT, Lowe CB, Wenger AM, Bejerano G. GREAT improves functional interpretation of cis-regulatory regions. Nat Biotechnol 28: 495–501, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50. Miano JM. Role of serum response factor in the pathogenesis of disease. Lab Invest 90: 1274–1284, 2010. [DOI] [PubMed] [Google Scholar]
- 51. Miano JM. Serum response factor: toggling between disparate programs of gene expression. J Mol Cell Cardiol 35: 577–593, 2003. [DOI] [PubMed] [Google Scholar]
- 52. Miano JM, Carlson MJ, Spencer JA, Misra RP. Serum response factor-dependent regulation of the smooth muscle calponin gene. J Biol Chem 275: 9814–9822, 2000. [DOI] [PubMed] [Google Scholar]
- 53. Miano JM, Long X, Fujiwara K. Serum response factor: master regulator of the actin cytoskeleton and contractile apparatus. Am J Physiol Cell Physiol 292: C70–C81, 2007. [DOI] [PubMed] [Google Scholar]
- 54. Mokalled MH, Johnson A, Kim Y, Oh J, Olson EN. Myocardin-related transcription factors regulate the Cdk5/Pctaire1 kinase cascade to control neurite outgrowth, neuronal migration and brain development. Development 137: 2365–2374, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55. Musunuru K, Strong A, Frank-Kamenetsky M, Lee NE, Ahfeldt T, Sachs KV, Li X, Li H, Kuperwasser N, Ruda VM, Pirruccello JP, Muchmore B, Prokunina-Olsson L, Hall JL, Schadt EE, Morales CR, Lund-Katz S, Phillips MC, Wong J, Cantley W, Racie T, Ejebe KG, Orho-Melander M, Melander O, Koteliansky V, Fitzgerald K, Krauss RM, Cowan CA, Kathiresan S, Rader DJ. From noncoding variant to phenotype via SORT1 at the 1p13 cholesterol locus. Nature 466: 714–719, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56. Niu Z, Li A, Zhang SX, Schwartz RJ. Serum response factor micromanaging cardiogenesis. Curr Opin Cell Biol 19: 618–627, 2007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57. Pampin S, Rodriguez-Rey JC. Functional analysis of regulatory single-nucleotide polymorphisms. Curr Opin Lipidol 18: 194–198, 2007. [DOI] [PubMed] [Google Scholar]
- 58. Park C, Hennig GW, Sanders KM, Cho JH, Hatton WJ, Redelman D, Park JK, Ward SM, Miano JM, Yan W, Ro S. SRF-dependent microRNAs regulate gastrointestinal smooth muscle cell phenotypes. Gastroenterology 141: 164–175, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59. Park PJ. ChIP-seq: advantages and challenges of a maturing technology. Nat Rev Genet 10: 669–680, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60. Philippar U, Schratt G, Dieterich C, Muller JM, Galgoczy P, Engel FB, Keating MT, Gertler F, Schule R, Vingron M, Nordheim A. The SRF target gene Fhl2 antagonizes RhoA/MAL-dependent activation of SRF. Mol Cell 16: 867–880, 2004. [DOI] [PubMed] [Google Scholar]
- 61. Pico AR, Smirnov IV, Chang JS, Yeh RF, Wiemels JL, Wiencke JK, Tihan T, Conklin BR, Wrensch M. SNPLogic: an interactive single nucleotide polymorphism selection, annotation, and prioritization system. Nucleic Acids Res 37: D803–D809, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62. Pollock R, Treisman R. A sensitive method for the determination of protein-DNA binding specificities. Nucleic Acids Res 18: 6197–6204, 1990. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63. Ponomarenko JV, Merkulova TI, Orlova GV, Fokin ON, Gorshkova EV, Frolov AS, Valuev VP, Ponomarenko MP. rSNP Guide, a database system for analysis of transcription factor binding to DNA with variations: application to genome annotation. Nucleic Acids Res 31: 118–121, 2003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64. Portales-Casamar E, Thongjuea S, Kwon AT, Arenillas D, Zhao X, Valen E, Yusuf D, Lenhard B, Wasserman WW, Sandelin A. JASPAR 2010: the greatly expanded open-access database of transcription factor binding profiles. Nucleic Acids Res 38: D105–D110, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65. Prokunina L, Alarcon-Riquelme ME. Regulatory SNPs in complex diseases: their identification and functional validation. Expert Rev Mol Med 6: 1–15, 2004. [DOI] [PubMed] [Google Scholar]
- 66. Raney BJ, Cline MS, Rosenbloom KR, Dreszer TR, Learned K, Barber GP, Meyer LR, Sloan CA, Malladi VS, Roskin KM, Suh BB, Hinrichs AS, Clawson H, Zweig AS, Kirkup V, Fujita PA, Rhead B, Smith KE, Pohl A, Kuhn RM, Karolchik D, Haussler D, Kent WJ. ENCODE whole-genome data in the UCSC genome browser (2011 update). Nucleic Acids Res 39: D871–D875, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67. Rebbeck TR, Spitz M, Wu X. Assessing the function of genetic variants in candidate gene association studies. Nat Rev Genet 5: 589–597, 2004. [DOI] [PubMed] [Google Scholar]
- 68. Sandelin A. Prediction of regulatory elements. Methods Mol Biol 453: 233–244, 2008. [DOI] [PubMed] [Google Scholar]
- 69. Selvaraj A, Prywes R. Expression profiling of serum inducible genes identifies a subset of SRF target genes that are MKL dependent. BMC Mol Biol 5: 13, 2004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70. Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res 29: 308–311, 2001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71. Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, Clawson H, Spieth J, Hillier LW, Richards S, Weinstock GM, Wilson RK, Gibbs RA, Kent WJ, Miller W, Haussler D. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res 15: 1034–1050, 2005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72. Stepanek J, Vincent M, Turpin PY, Paulin D, Fermandjian S, Alpert B, Zentz C. C–>G base mutations in the CArG box of c-fos serum response element alter its bending flexibility: consequences for core-SRF recognition. FEBS J 274: 2333–2348, 2007. [DOI] [PubMed] [Google Scholar]
- 73. Sullivan AL, Benner C, Heinz S, Huang W, Xie L, Miano JM, Glass CK. Serum response factor utilizes distinct promoter- and enhancer-based mechanisms to regulate cytoskeletal gene expression in macrophages. Mol Cell Biol 31: 861–875, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74. Sun G, Yan J, Noltner K, Feng J, Li H, Sarkis DA, Sommer SS, Rossi JJ. SNPs in human miRNA genes affect biogenesis and function. RNA 15: 1640–1651, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75. Sun Q, Chen G, Streb JW, Long X, Yang Y, Stoeckert CJ, Jr, Miano JM. Defining the mammalian CArGome. Genome Res 16: 197–207, 2006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 76. Tan Z, Randall G, Fan J, Camoretti-Mercado B, Brockman-Schneider R, Pan L, Solway J, Gern JE, Lemanske RF, Nicolae D, Ober C. Allele-specific targeting of microRNAs to HLA-G and risk of asthma. Am J Hum Genet 81: 829–834, 2007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 77. Tokovenko B, Golda R, Protas O, Obolenskaya M, El'skaya A. COTRASIF: conservation-aided transcription-factor-binding site finder. Nucleic Acids Res 37: e49, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 78. Torkamani A, Schork NJ. Predicting functional regulatory polymorphisms. Bioinformatics 24: 1787–1792, 2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 79. Trinklein ND, Aldred SF, Hartman SJ, Schroeder DI, Otillar RP, Myers RM. An abundance of bidirectional promoters in the human genome. Genome Res 14: 62–66, 2004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 80. Valouev A, Johnson DS, Sundquist A, Medina C, Anton E, Batzoglou S, Myers RM, Sidow A. Genome-wide analysis of transcription factor binding sites based on ChIP-Seq data. Nat Meth 5: 829–834, 2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 81. Wang X, Tomso DJ, Chorley BN, Cho HY, Cheung VG, Kleeberger SR, Bell DA. Identification of polymorphic antioxidant response elements in the human genome. Hum Mol Genet 16: 1188–1200, 2007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 82. Ward LD, Bussemaker HJ. Predicting functional transcription factor binding through alignment-free and affinity-based analysis of orthologous promoter sequences. Bioinformatics 24: i165–71, 2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 83. Wasserman WW, Sandelin A. Applied bioinformatics for the identification of regulatory elements. Nat Rev Genet 5: 276–287, 2004. [DOI] [PubMed] [Google Scholar]
- 84. Won KJ, Ren B, Wang W. Genome-wide prediction of transcription factor binding sites using an integrated model. Genome Biol 11: R7, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 85. Woo YH, Walker M, Churchill GA. Coordinated expression domains in mammalian genomes. PLoS One 5: e12158, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 86. Xie X, Rigor P, Baldi P. MotifMap: a human genome-wide map of candidate regulatory motif sites. Bioinformatics 25: 167–174, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 87. Xu Z, Taylor JA. SNPinfo: integrating GWAS and candidate gene information into functional SNP selection for genetic association studies. Nucleic Acids Res 37: W600–W605, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 88. Zhang SX, Garcia-Gras E, Wycuff DR, Marriot SJ, Kadeer N, Yu W, Olson EN, Garry DJ, Parmacek MS, Schwartz RJ. Identification of direct serum-response factor gene targets during Me2SO-induced P19 cardiac cell differentiation. J Biol Chem 280: 19115–19126, 2005. [DOI] [PubMed] [Google Scholar]
- 89. Zhang X, Odom DT, Koo SH, Conkright MD, Canettieri G, Best J, Chen H, Jenner R, Herbolsheimer E, Jacobsen E, Kadam S, Ecker JR, Emerson B, Hogenesch JB, Unterman T, Young RA, Montminy M. Genome-wide analysis of cAMP-response element binding protein occupancy, phosphorylation, and target gene activation in human tissues. Proc Natl Acad Sci USA 102: 4459–4464, 2005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 90. Zhao Y, Samal E, Srivastava D. Serum response factor regulates a muscle-specific microRNA that targets Hand2 during cardiogenesis. Nature 436: 214–220, 2005. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.