Skip to main content
Nucleic Acids Research logoLink to Nucleic Acids Research
. 2007 Nov 2;36(Database issue):D83–D87. doi: 10.1093/nar/gkm875

CTCFBSDB: a CTCF-binding site database for characterization of vertebrate genomic insulators

Lei Bao 1, Mi Zhou 1, Yan Cui 1,*
PMCID: PMC2238977  PMID: 17981843

Abstract

Recent studies on transcriptional control of gene expression have pinpointed the importance of long-range interactions and three-dimensional organization of chromatins within the nucleus. Distal regulatory elements such as enhancers may activate transcription over long distances; hence, their action must be restricted within appropriate boundaries to prevent illegitimate activation of non-target genes. Insulators are DNA elements with enhancer-blocking and/or chromatin-bordering functions. In vertebrates, the versatile transcription regulator CCCTC-binding factor (CTCF) is the only identified trans-acting factor that confers enhancer-blocking insulator activity. CTCF-binding sites were found to be commonly distributed along the vertebrate genomes. We have constructed a CTCF-binding site database (CTCFBSDB) to characterize experimentally identified and computationally predicted CTCF-binding sties. Biological knowledge and data from multiple resources have been integrated into the database, including sequence data, genetic polymorphisms, function annotations, histone methylation profiles, gene expression profiles and comparative genomic information. A web-based user interface was implemented for data retrieval, analysis and visualization. In silico prediction of CTCF-binding motifs is provided to facilitate the identification of candidate insulators in the query sequences submitted by users. The database can be accessed at http://insulatordb.utmem.edu/

INTRODUCTION

CCCTC-binding factor (CTCF) is a versatile transcription regulator that is evolutionarily conserved from fruit fly to human (1). CTCF binds to different DNA sequences by combinatorial use of 11-zinc fingers and plays a key role in many chromatin insulation events [reviewed in (2)]. In eukaryotic genomes, chromatins are organized into distinct domains. The chromatin domain architecture is critical for transcription control. Insulators are the key DNA sequence elements that establish and maintain such domain boundaries (2–12). They represent a class of diverged DNA sequences capable of shielding genes from inappropriate cis-regulatory signals from the genomic neighborhood. There are two types of insulators—enhancer-blocking insulators that block enhancer–promoter communication and barrier insulators that protect against heterochromatin-mediated silencing (13). Many recent studies have been devoted to the identification and characterization of insulators. CTCF-binding site is of particular interest because CTCF is the only protein identified so far in vertebrate that binds to enhancer-blocking insulators and shows enhancer-blocking activity. Recent studies also linked the CTCF-binding site to epigenetic processes, such as imprinting (14–16), X-chromosome inactivation (17,18) and interchromosomal colocalization (19). Despite their obvious importance, to our knowledge, there is no public database categorizing this type of regulatory elements. In addition to dozens of well-characterized CTCF-binding sites with validated insulation functions that are scattered in the biomedical literature, several recent high-throughput ChIP-chip analyses and comparative genomic studies (20–23) have identified tens of thousands of potential CTCF-binding sites in human and mouse genomes. Here we report our effort in creating a CTCF-binding site database, a collection of experimentally identified and computationally predicted CTCF-binding sites. Biological knowledge and data from multiple resources were integrated to annotate the CTCF-binding sites. The database is designed to facilitate the studies on insulators and their roles in regulating gene expression and demarcating functional genomic domains.

DATA SOURCES AND PROCESSING

Data sources

Experimentally identified and computationally predicted CTCF-binding sites are processed separately. First, 34 417 experimentally identified CTCF-binding sites are collected from four sources: (i) 110 manually curated CTCF-binding sites from biomedical literature, denoted by identifiers starting with ‘INSUL_MAN’, (ii) 244 mouse CTCF-binding sites identified by Ohlsson and coworkers using ChIP-chip assay (21), denoted by identifiers starting with ‘INSUL_OHL’, (iii) 13 801 human CTCF-binding sites identified by Ren and coworkers using ChIP-chip assay (20), denoted by identifiers starting with ‘INSUL_REN’ and (iv) 20 262 human CTCF-binding sites identified by Zhao and coworkers using massive direct sequencing of ChIP DNA (23), denoted by identifiers starting with ‘INSUL_ZHAO’. Second, we collected the conserved CTCF-specific sequence motifs (∼20 bp) in the human and mouse genomes that were predicted using motif scan (20,22). We excluded those (∼40%) overlapped with any of the experimentally determined CTCF-binding sites. The resulting 18 905 entries include 7736 human and 5504 mouse CTCF-binding sites predicted in (20) and 5665 human CTCF-binding sites predicted in (22). The computationally predicted CTCF-binding sites have identifiers beginning with ‘INSUL_PRE’.

Annotation of the CTCF-binding sites

Table 1 shows the major data fields of the database. For the 110 manually curated CTCF-binding sites, we used a set of controlled vocabularies to describe their properties. The ‘Validation Method’ field specifies whether CTCF binding was validated by in vitro and/or in vivo assays and whether this CTCF-binding sequence showed enhancer-blocking function in transgenic experiment (24). The ‘In situ Function’ field annotates the biological roles of a CTCFBS in its natural genomic context (enhancer-blocking, chromatin boundary, etc.). The ‘Description’ field contains other features of the CTCF-binding site (e.g. methylation-sensitivity of the CTCF binding). Genomic coordinates of the CTCF-binding sequences were determined using the BLAT alignment program (25). The assemblies of genomes used are hg18 for human, mm8 for mouse, rn3 for rat and galGal2 for chicken. CTCF-binding sequences without chromosome location information usually mean that they were probably mapped to unsequenced portions (e.g. heterochromatic regions) of the genome (21).

Table 1.

Description of the fields

Field name Description
IDa Unique identifier of an entry
Speciesa Species name
Name Name used by the authors in the original paper
Chromosome location CTCF-binding site position
Orientation Forward (+) or reverse (−) strand
5′-Flanking gene 5′-Flanking gene of the CTCF-binding site along the genome
3′-Flanking gene 3′-Flanking gene of the CTCF-binding site along the genome
Validation methoda The validation methods including in vitro binding, in vivo binding, enhancer-blocking assay and sequence analysis
In situ function In situ function of the CTCF-binding site
Description Other important features of the CTCF-binding site
Referencea PubMed reference
Sequencea DNA sequence of the CTCF-binding site

aA mandatory field.

Sequence features of the CTCF-binding sites

The sequences of CTCF-binding sites in the database vary from 20 bp to several hundred bp. There are two reasons for this length heterogeneity. First, different experimental methods may have different basepair resolutions for locating CTCF-binding sites. Second, different laboratories often have different research goals when they publish the original sequences. Some researchers may stop at a 500-bp region encapsulating the CTCF-binding sites while others may further narrow down to the sequences covered by the CTCF protein physically. Most CTCF-binding sites were found to share a 20-bp motif (20), which is highlighted using consecutive arrows. The direction of the arrows shows the genomic orientation of the motif. We also highlighted all the single nucleotide polymorphisms (SNP) in the dbSNP database (26) that are located in a CTCF-binding site using vertical indicators. Mutations that disrupt CTCF-binding sites may lead to abnormal gene expression and cause diseases. Indeed, a recent study showed that inherited mutations that abolish CTCF-binding sites in the human H19 differentially methylated region (DMR) can cause Beckwith–Wiedemann syndrome (27,28). Thus, the naturally occurring mutations in CTCF-binding sites may represent new types of genetic variations that underlie phenotypes including disease status. To get more information about any of the SNPs, the user can click the SNP indicator to browse the corresponding dbSNP webpage (26).

Genomic context track

The genomic context of a CTCF-binding site provides clues for its in situ functions. The CTCF-binding site (red) and flanking genes within 100 kb distance are displayed using the UCSC genome browser (29) (Figure 1). Other CTCF-binding sites located in this genomic region are also displayed and different colors are used to distinguish the sources of the CTCF-binding sites: yellow for INSUL_MAN, blue for INSUL_OHL, green for INSUL_REN, cyan for INSUL_ZHAO and black for INSUL_PRE. An important function of CTCF-bound insulators is to demarcate transcriptionally active and silent chromatin domains, which are marked by distinct histone methylation patterns. A recent study provided high-resolution maps of histone methylations (chromatin domains) in the human genome (23). H3K4 trimethylation (H3K4me3) and H3K27 trimethylation (H3K27me3) are a pair of ‘Yin-Yang’ modifications with high level of H3K4me3 and H3K27me3, representing gene activation and silencing, respectively (23). We integrated H3K4me3 and H3K27me3 maps with our genomic context track of CTCF-binding sites using the genome browser to facilitate the utilization of this valuable information.

Figure 1.

Figure 1.

The genomic context of a few CTCF-binding sites. The CTCF-binding sites reside at the boundary between the two histone methylation domains (H3K4me3 and H3K27me3).

Flanking gene expression track

Another in situ function of insulators is to maintain independent expression patterns of neighboring genes. Suppose there is a tissue-specific enhancer that should control the transcription of one gene but not that of the other in a pair of neighboring genes. The CTCF-binding site located between the enhancer and the promoter of the second gene may function as enhancer-blocking insulator to protect against illegitimate transcriptional activation. In this scenario, the neighboring genes may have very different expression status in that tissue. We created a flanking gene expression track to compare the expression patterns of the genes flanking the CTCF-binding site. The data were obtained from The Genomics Institute of the Novartis Research Foundation (GNF) Gene Expression Atlas 2 (30), which contains genome-wide gene expression profiles of 61 mouse tissues and 79 human tissues. The raw data was log-transformed (base 2) and normalized to have a mean of 0 and SD of 1. The expression images were created using the Slcview software (http://slcview.stanford.edu), in which red indicates overexpression and green indicates underexpression. An example of gene expression track is shown in Figure 2.

Figure 2.

Figure 2.

A CTCF-binding site (INSUL_MAN0004) webpage displays the flanking gene expression profiles and links to tracks of SNPs, genomic context and orthologous regions.

Mammalian orthologous region track

Comparative genomic studies on human, mouse and rat may provide insights into the evolution of CTCF-binding sites. To this end, we created a track of mammalian orthologous regions. For any of the three genomes, the regions containing CTCF-binding sites and flanking genes were used to query orthologous regions in the other two genomes from the UCSC precomputed block chains (31,32). Only the DNA blocks with the maximal alignment score against the query region were retained as orthologous regions. The aligned orthologous sequences in up to 16 vertebrate genomes can be displayed by clicking the ‘view alignment’ button (Figure 2).

CTCF-binding site prediction

CTCF uses different combinations of its zinc fingers to recognize divergent DNA sequences. Recent studies have identified core motifs for CTCFBS sequences (20,22). The motifs are represented by position weight matrices (PWM). Altogether, four closely related PWM have been derived to accommodate the sequence divergences in CTCF-binding sites (20,22). The database provides a simple web tool to search for the core CTCF-binding motifs in a query sequence. It uses the STORM program (33) to scan for each of the four PWM in the query sequences and reports the best hits.

UTILITY AND DISCUSSION

First, a web interface was developed for browsing the experimentally identified and computationally predicted CTCF-binding sites. Users can focus on entries of interest using four selection controls—Species, Validation Method, In Situ Function and Description. The in situ function of most known CTCF-binding sites is to act as boundary element. However, in some biological contexts, CTCF-binding sites may also function as elements for transcription activation/repression [reviewed in (34)]. Second, a text search interface was developed for querying the database. Users can search for CTCF-binding sites by element name or by the PubMed identifier of the original literature. A useful approach is to retrieve the CTCF-binding sites contiguous to a gene of interest by entering an official gene symbol or words used in the gene description. Third, the database provides sequence similarity search (35) for the comparison between query sequences and CTCF-binding sequences. Finally, an option of genomic range search is provided. Users can specify a genomic interval and retrieve all the CTCF-binding sites in the interval.

To maintain an up-to-date resource, we encourage researchers to submit newly identified CTCFBS sequences to the database. Data can be submitted directly through a web interface. The submissions will be manually checked before being added to the database.

The database is an integrative platform for storing, retrieving and characterizing vertebrate genomic insulators. We envision that with more and more experimentally validated CTCFBS sequences available in the database, a comprehensive analysis of these sequences may facilitate the extraction of meaningful sequence signals, uncover the functional basis of insulators, and ultimately enable the mapping of every distinct transcription domain along the genomes.

ACKNOWLEDGEMENTS

We thank Dr Bing Ren for providing the 20-bp motif information for 13 801 experimentally identified CTCF-binding sites, Drs Bing Ren, Xiaohui Xie and Eric S. Lander for providing the predicted CTCF motifs and Dr Keji Zhao for providing the genomic coordinates of 20 262 CTCF-binding sites. Funding to pay the Open Access publication charges for this article was provided by The University of Tennessee Health Science Center.

Conflict of interest statement. None declared.

REFERENCES

  • 1.Moon H, Filippova G, Loukinov D, Pugacheva E, Chen Q, Smith ST, Munhall A, Grewe B, Bartkuhn M, et al. CTCF is conserved from Drosophila to humans and confers enhancer blocking of the Fab-8 insulator. EMBO Rep. 2005;6:165–170. doi: 10.1038/sj.embor.7400334. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.West AG, Gaszner M, Felsenfeld G. Insulators: many functions, many mechanisms. Genes Dev. 2002;16:271–288. doi: 10.1101/gad.954702. [DOI] [PubMed] [Google Scholar]
  • 3.Bell AC, West AG, Felsenfeld G. Insulators and boundaries: versatile regulatory elements in the eukaryotic. Science. 2001;291:447–450. doi: 10.1126/science.291.5503.447. [DOI] [PubMed] [Google Scholar]
  • 4.Gaszner M, Felsenfeld G. Insulators: exploiting transcriptional and epigenetic mechanisms. Nat. Rev. Genet. 2006;7:703–713. doi: 10.1038/nrg1925. [DOI] [PubMed] [Google Scholar]
  • 5.Kuhn EJ, Geyer PK. Genomic insulators: connecting properties to mechanism. Curr. Opin. Cell Biol. 2003;15:259–265. doi: 10.1016/s0955-0674(03)00039-5. [DOI] [PubMed] [Google Scholar]
  • 6.Brasset E, Vaury C. Insulators are fundamental components of the eukaryotic genomes. Heredity. 2005;94:571–576. doi: 10.1038/sj.hdy.6800669. [DOI] [PubMed] [Google Scholar]
  • 7.Capelson M, Corces VG. Boundary elements and nuclear organization. Biol. Cell. 2004;96:617–629. doi: 10.1016/j.biolcel.2004.06.004. [DOI] [PubMed] [Google Scholar]
  • 8.Engel N, Bartolomei MS. Mechanisms of insulator function in gene regulation and genomic imprinting. Int. Rev. Cytol. 2003;232:89–127. doi: 10.1016/s0074-7696(03)32003-0. [DOI] [PubMed] [Google Scholar]
  • 9.Fourel G, Magdinier F, Gilson E. Insulator dynamics and the setting of chromatin domains. Bioessays. 2004;26:523–532. doi: 10.1002/bies.20028. [DOI] [PubMed] [Google Scholar]
  • 10.Geyer PK, Clark I. Protecting against promiscuity: the regulatory role of insulators. Cell Mol. Life Sci. 2002;59:2112–2127. doi: 10.1007/s000180200011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Labrador M, Corces VG. Setting the boundaries of chromatin domains and nuclear organization. Cell. 2002;111:151–154. doi: 10.1016/s0092-8674(02)01004-8. [DOI] [PubMed] [Google Scholar]
  • 12.West AG, Fraser P. Remote control of gene transcription. Hum. Mol. Genet. 2005;14:R101–R111. doi: 10.1093/hmg/ddi104. [DOI] [PubMed] [Google Scholar]
  • 13.Scott KC, Merrett SL, Willard HF. A heterochromatin barrier partitions the fission yeast centromere into discrete chromatin domains. Curr. Biol. 2006;16:119–129. doi: 10.1016/j.cub.2005.11.065. [DOI] [PubMed] [Google Scholar]
  • 14.Bell AC, Felsenfeld G. Methylation of a CTCF-dependent boundary controls imprinted expression of the Igf2 gene. Nature. 2000;405:482–485. doi: 10.1038/35013100. [DOI] [PubMed] [Google Scholar]
  • 15.Hark AT, Schoenherr CJ, Katz DJ, Ingram RS, Levorse JM, Tilghman SM. CTCF mediates methylation-sensitive enhancer-blocking activity at the H19/Igf2 locus. Nature. 2000;405:486–489. doi: 10.1038/35013106. [DOI] [PubMed] [Google Scholar]
  • 16.Yu W, Ginjala V, Pant V, Chernukhin I, Whitehead J, Docquier F, Farrar D, Tavoosidana G, Mukhopadhyay R, et al. Poly(ADP-ribosyl)ation regulates CTCF-dependent chromatin- insulation. Nat. Genet. 2004;36:1105–1110. doi: 10.1038/ng1426. [DOI] [PubMed] [Google Scholar]
  • 17.Chao W, Huynh KD, Spencer RJ, Davidow LS, Lee JT. CTCF, a candidate trans-acting factor for X-inactivation choice. Science. 2002;295:345–347. doi: 10.1126/science.1065982. [DOI] [PubMed] [Google Scholar]
  • 18.Valley CM, Willard HF. Genomic and epigenomic approaches to the study of X chromosome inactivation. Curr. Opin. Genet. Dev. 2006;16:240–245. doi: 10.1016/j.gde.2006.04.008. [DOI] [PubMed] [Google Scholar]
  • 19.Ling JQ, Li T, Hu JF, Vu TH, Chen HL, Qiu XW, Cherry AM, Hoffman AR. CTCF mediates interchromosomal colocalization between Igf2/H19 and Wsb1/Nf1. Science. 2006;312:269–272. doi: 10.1126/science.1123191. [DOI] [PubMed] [Google Scholar]
  • 20.Kim TH, Abdullaev ZK, Smith AD, Ching KA, Loukinov DI, Green RD, Zhang MQ, Lobanenkov VV, Ren B. Analysis of the vertebrate insulator protein CTCF-binding sites in the human genome. Cell. 2007;128:1231–1245. doi: 10.1016/j.cell.2006.12.048. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Mukhopadhyay R, Yu W, Whitehead J, Xu J, Lezcano M, Pack S, Kanduri C, Kanduri M, Ginjala V, et al. The binding sites for the chromatin insulator protein CTCF map to DNA methylation-free domains genome-wide. Genome Res. 2004;14:1594–1602. doi: 10.1101/gr.2408304. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Xie X, Mikkelsen TS, Gnirke A, Lindblad-Toh K, Kellis M, Lander ES. Systematic discovery of regulatory motifs in conserved regions of the human genome, including thousands of CTCF insulator sites. Proc. Natl Acad. Sci. USA. 2007;104:7145–7150. doi: 10.1073/pnas.0701811104. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Barski A, Cuddapah S, Cui K, Roh TY, Schones DE, Wang Z, Wei G, Chepelev I, Zhao K. High-resolution profiling of histone methylations in the human genome. Cell. 2007;129:823–837. doi: 10.1016/j.cell.2007.05.009. [DOI] [PubMed] [Google Scholar]
  • 24.Chung JH, Whiteley M, Felsenfeld G. A 5′ element of the chicken beta-globin domain serves as an insulator in human erythroid cells and protects against position effect in Drosophila. Cell. 1993;74:505–514. doi: 10.1016/0092-8674(93)80052-g. [DOI] [PubMed] [Google Scholar]
  • 25.Kent WJ. BLAT – The BLAST-Like Alignment Tool. Genome Res. 2002;12:656–664. doi: 10.1101/gr.229202. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 2001;29:308–311. doi: 10.1093/nar/29.1.308. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Sparago A, Cerrato F, Vernucci M, Ferrero GB, Silengo MC, Riccio A. Microdeletions in the human H19 DMR result in loss of IGF2 imprinting and Beckwith-Wiedemann syndrome. Nat. Genet. 2004;36:958–960. doi: 10.1038/ng1410. [DOI] [PubMed] [Google Scholar]
  • 28.Prawitt D, Enklaar T, Gartner-Rupprecht B, Spangenberg C, Oswald M, Lausch E, Schmidtke P, Reutzel D, Fees S, et al. Microdeletion of target sites for insulator protein CTCF in a chromosome 11p15 imprinting center in Beckwith-Wiedemann syndrome and Wilms’ tumor. Proc. Natl Acad. Sci. USA. 2005;102:4085–4090. doi: 10.1073/pnas.0500037102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Karolchik D, Baertsch R, Diekhans M, Furey TS, Hinrichs A, Lu YT, Roskin KM, Schwartz M, Sugnet CW, et al. The UCSC Genome Browser Database. Nucleic Acids Res. 2003;31:51–54. doi: 10.1093/nar/gkg129. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Su AI, Wiltshire T, Batalov S, Lapp H, Ching KA, Block D, Zhang J, Soden R, Hayakawa M, et al. A gene atlas of the mouse and human protein-encoding transcriptomes. Proc. Natl Acad. Sci. USA. 2004;101:6062–6067. doi: 10.1073/pnas.0400782101. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W. Human-mouse alignments with BLASTZ. Genome Res. 2003;13:103–107. doi: 10.1101/gr.809403. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc. Natl Acad Sci. USA. 2003;100:11484–11489. doi: 10.1073/pnas.1932072100. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Schones DE, Smith AD, Zhang MQ. Statistical significance of cis-regulatory modules. BMC Bioinformatics. 2007;8:19. doi: 10.1186/1471-2105-8-19. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Ohlsson R, Renkawitz R, Lobanenkov V. CTCF is a uniquely versatile transcription regulator linked to epigenetics and disease. Trends Genet. 2001;17:520–527. doi: 10.1016/s0168-9525(01)02366-6. [DOI] [PubMed] [Google Scholar]
  • 35.Altschul SF, Gish W, Miller W, Meyers EW, Lipman DJ. Basic Local Alignment Search Tool. J. Mol. Biol. 1990;215:403. doi: 10.1016/S0022-2836(05)80360-2. [DOI] [PubMed] [Google Scholar]

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press

RESOURCES