Summary
This article illustrates the use of the Encyclopedia of DNA Elements (ENCODE) resource to generate or refine hypotheses from genomic data on disease and other phenotypic traits. First, the goals and history of ENCODE and related epigenomics projects are reviewed. Second, the rationale for ENCODE and the major data types used by ENCODE are briefly described, as are some standard heuristics for their interpretation. Third, the use of the ENCODE resource is examined. Standard use cases for ENCODE, accessing the ENCODE resource, and accessing data from related projects are discussed. Finally, access to resources from ENCODE and related epigenomics projects are reviewed. (Although the focus of this article is the use of ENCODE data, some of the same approaches can be used with the data from other projects.) While this article is focused on the case of interpreting genetic variation data, essentially the same approaches can be used with the ENCODE resource, or with data from other projects, to interpret epigenomic and gene regulation data, with appropriate modification (Rakyan et al. 2011; Ng et al. 2012). Such approaches could allow investigators to use genomic methods to study environmental and stochastic processes, in addition to genetic processes.
Goals and history of ENCODE
The primary goals of ENCODE, the Encyclopedia of DNA Elements, are 1) to create a comprehensive catalog of candidate functional elements in the genome, and 2) to make that catalog freely available as a community resource for all biologists. ENCODE resources can be accessed from the ENCODE portal (https://www.encodeproject.org) and at other URLs (Box 1). ENCODE data (transcription, transcription factor binding, histone modifications, DNase hypersensitivity, DNA methylation, DNA-DNA interactions, and RNA-protein interactions) are rapidly released to the public before publication, following the precedent of the human genome project. External users may freely download, analyze and publish results based on any ENCODE data (without any embargo or restrictions) as soon as they are released. ENCODE is focused on the human genome, though about 20% of the data collected annotate the mouse genome. The fly and worm genomes were the focus of the model organism (mod) modENCODE project. The catalogs, or maps, of candidate elements are intended to complement ongoing efforts to understand the functions resident in the genome, rather than to replace those individual efforts. At this time, ENCODE has released about 3000 human experiments, each containing at least 2 replicates, examining about 200 cell types (cell lines, primary cells, cells differentiated in culture, and explants), and about 900 mouse experiments, each containing at least 2 replicates, in over 100 cell lines, primary cells, and explants. To date, ENCODE human and mouse data have appeared in about 650 papers published by researchers outside of ENCODE, and modENCODE data have appeared in about 150 papers by researchers outside of modENCODE (https://www.encodeproject.org/search/?type=publication&published_by=community).
Box 1. Internet resources for ENCODE.
ENCODE Portal:
The ENCODE Portal has resources for searching, downloading, and visualizing ENCODE mouse and human data. The portal also has data summaries, an experiment list, consortium publications, community publications using ENCODE data, software tools, quality metrics, and data standards.
ENCODE project pages, NHGRI:
Tutorials on using the ENCODE resource:
http://www.genome.gov/27553900
https://www.encodeproject.org/tutorials
Automated mining of ENCODE data:
http://www.broadinstitute.org/mammals/haploreg/haploreg.php
http://regulome.stanford.edu/GWAS
Linkage between genes and regulatory elements:
If you are looking for info on visualizing ENCODE data at a region of interest, these might be a good places to start:
https://www.encodeproject.org/data/annotations/
If you are interested in seeing how labs within and outside of ENCODE are using the data in publications, start here:
https://www.encodeproject.org/publications
https://www.encodeproject.org/search/?type=publication&published_by=community
https://www.encodeproject.org/search/?type=publication&published_by=ENCODE&published_by=mouseENCODE
ENCODE and Roadmap Epigenomics data:
http://epigenomegateway.wustl.edu
ENCODE mailing list:
https://mailman.stanford.edu/mailman/listinfo/encode-announce
ENCODE was launched in 2003 with a pilot project to survey 1% of the human genome (Consortium 2004). Major findings from this microarray-based pilot phase were published in several papers in 2007 (eg. Birney et al. 2007). Based on the success of the pilot project, a genome-wide production phase using massively parallel sequencing focused on the human genome was launched in 2007, and production efforts focused on mouse were begun in 2009 (Stamatoyannopoulos et al. 2012). In addition, projects to study the fly and worm genomes were launched, in order to improve and to supplement annotations of the genomes of these important model organisms, as well as to assist in interpretation of the human genome (Celniker et al. 2009; Gerstein et al. 2010; Roy et al. 2010; Graveley et al. 2011; Kharchenko et al. 2011; Negre et al. 2011). A user’s guide to ENCODE (Consortium 2011) and recent findings on the human genome (eg. Consortium 2012) have been published. The ENCODE and modENCODE projects are funded by the National Human Genome Research Institute (NHGRI), one the components of the National Institutes of Health (NIH, US).
Related epigenomics projects
REMC, the Roadmap Epigenomics Mapping Centers project, is funded by the Common Fund of the NIH (US). The REMC goal is to generate reference epigenomic maps for “normal” human cells/tissues. Most samples have RNA-seq, DNase, DNAme, and histone modification data.
IHEC, the International Human Epigenome Consortium, is an international project. The IHEC goal is to coordinate production of 1000 human epigenome maps for cellular states relevant to health and disease. Complete epigenomes include at least mRNA-seq, DNAme/WGBS, and 6 histone modifications. IHEC projects generally include samples from healthy people and targeted diseases. IHEC projects include BLUEPRINT (European Union), CEEHRC Epigenomic Platform Program (Canada), CREST (Japan), and DEEP (Germany). In addition, REMC (Roadmap Epigenomics) is a member of IHEC, and ENCODE is an associate member of IHEC. ENCODE, REMC, and IHEC are coordinating metadata and ontologies to improve the interoperability of the data from the projects.
GTEx, the Genotype-Tissue Expression project, is funded by the Common Fund of the NIH (US). The goal of GTEx is study human gene expression across individuals, and across multiple tissues. Most samples have RNA-seq, and genotype. Having completed a feasibility pilot phase, GTEx has recently entered into a production phase.
URLs for descriptions of these projects and data distribution are included (Box 2).
Box 2. Internet resources for other epigenomics projects.
Roadmap Epigenomics Mapping Centers (REMC; US, NIH Common Fund)
Data visualization:
http://www.roadmapepigenomics.org/
http://www.genboree.org/epigenomeatlas/
http://epigenomegateway.wustl.edu
(Also Track hub on UCSC, Ensembl, and ENCODE portal)
Project description:
International Human Epigenome Consortium (IHEC; Multinational)
Project descriptions:
http://www.ihec-epigenomes.org/
http://ihec-epigenomes.org/research/projects/
Data portal:
http://epigenomesportal.ca/ihec/
Data sites for individual projects:
BLUEPRINT (European Union, IHEC member)
Project description:
http://www.blueprint-epigenome.eu
Data Visualization:
http://www.blueprint-epigenome.eu/index.cfm?p=7DEC5013-D416-AFDB-BFEABB82782623BB
(Also Track hub on UCSC, Ensembl, and ENCODE portal)
CEEHRC Epigenomic Platform Program (Canada, IHEC member)
Project description:
http://ihec-epigenomes.org/research/projects/epigenomic-platform-program/
Data distribution:
http://epigenomesportal.ca/ihec/
(Also Track hub on UCSC, Ensembl, and ENCODE portal)
CREST (Japan, IHEC member)
Project description:
http://crest-ihec.jp/english/index.html
Data distribution:
DEEP (Germany, IHEC member)
Project description:
GTEx (Genotype-Tissue Expression, US, NIH Common Fund)
GTEx portal: http://www.gtexportal.org/
GTEx project site: http://commonfund.nih.gov/GTEx
Rationale for ENCODE
ENCODE directly addresses a need identified during the human genome project. The goal of the human genome project was to obtain a complete human genome sequence, and make that sequence available as a public resource. One of the first efforts to make practical use of this sequence was the GENCODE project (an ENCODE component, the Encyclopedia of genes and gene variants). GENCODE has produced a widely used annotation for protein-coding genes (Harrow et al. 2012). 21 annotated versions of the human genome have been developed with both protein and non-protein coding genes now available to the scientific community (Harrow et al. 2012). The GENCODE v21 human annotations indicate there are 19,881 protein-coding genes and 25,411 noncoding genes.
The genetic code, while very powerful, is only useful for interpreting protein-coding regions (approximately 1.2% of the human genome). However, the genome also contains regulatory regions, and as we do not have a similarly powerful regulatory code to call upon, it is beyond our grasp today to even identify all of the functional (or non-functional) regions of the genome by sequence alone, let alone to determine when, where, and how they act. Regulatory regions often appear to control distal genes, which are not necessarily the nearest gene, and often function in particular cell fates or cell states. Moreover, a particular regulatory region can apparently control more than one gene. ENCODE is one effort to fill this gap in our knowledge using unbiased experimentation.
Another area in which ENCODE is playing a role in assisting the scientific community is in the area of helping to investigate the genetic components of human disease and traits. Genetic variants that are statistically associated with traits of interest are identified using GWAS (genome-wide association studies), exome sequencing, and whole genome sequencing. These powerful studies generate testable hypotheses that can be explored in subsequent studies. The ultimate goal of these lines of experimentation is often to identify better predictions of disease risk, better diagnostics, and better therapeutics.
Unfortunately, statistical association studies have limitations that are fundamental, and well known. First, a variant associated with a trait or disease is not necessarily the causal variant, in part because of linkage disequilibrium and in part because most studies don’t interrogate the whole genome. Second, the gene that is affected by a variant can be difficult to identify, as regulation can occur at large distances, and target multiple genes. Third, the cell fate(s) and cell state(s) affected by germline variants can be difficult to surmise. This information is important given that both complex and Mendelian disorders can involve multiple cell types. Finally, it is often not straightforward to determine how causal variants alter cells and organisms, as one must consider changes in transcription, splicing, RNA stability, translation, and protein function, among others. It is important to note that more than 90% of GWAS findings lie outside of protein coding regions (Thurman et al. 2012), and more than 80% of the findings for signatures of recent adaptive evolution lie outside of protein-coding regions (Jones et al. 2012; Fraser 2013; Grossman et al. 2013). Individual examples, such as Fragile X, ALS, lactose tolerance, and polydactyly, also indicate that variation in noncoding DNA makes important contributions to human traits and disease, including Mendelian and severe traits (Lettice et al. 2002; Penagarikano et al. 2007; Tishkoff et al. 2007; Kleinjan and Lettice 2008; Herdewyn et al. 2012). When we fail to consider the role of noncoding DNA, we are likely ignoring the vast majority of findings, which will likely diminish our power to understand human diseases and traits. The contributions of ENCODE in providing both gene and regulatory annotations for these noncoding regions has recently proven useful (https://www.encodeproject.org/search/?type=publication&published_by=community&categories=human+disease).
Epigenomic and transcriptomic experimentation in general, and the ENCODE resource in particular, can be used to generate and refine hypotheses to allow researchers to move from statistical associations to functional connections (Hardison 2012; Hardison and Taylor 2012; Ward and Kellis 2012c; Edwards et al. 2013). This information is especially powerful with noncoding variants (Maurano et al. 2012), which is fortunate because most GWAS findings are noncoding, and we lack a regulatory code with predictive power similar to the genetic code.
ENCODE data types
The major ENCODE approach to identifying candidate functional elements is to use biochemical assays that are capable of surveying the entire genome. Because functional elements may be cell type-specific, these biochemical assays are performed in many cell types, to identify the largest and most diverse fraction of the functional elements that is practically achievable. In addition, ENCODE is to a more limited extent using the approach of comparative sequence analysis to identify and characterize candidate functional elements. These two fundamentally different approaches have different strengths and weaknesses; the same can be said of genetic approaches. As was the case before ENCODE, the findings from these different approaches do not always agree, so resolving these differences continues to be an active area of research (Nardone et al. 2004; Borneman et al. 2007; Odom et al. 2007; Noonan and McCallion 2010; Schmidt et al. 2010; Biggin 2011; Fisher et al. 2012; MacArthur et al. 2012).
Conceptually, ENCODE is attempting to identify genes, transcribed regions, and transcripts, as well as regulatory regions in DNA and RNA (Figure 1). The biochemical assays used by ENCODE have been widely used by other scientists. Indeed many of the underlying assays, such as DNase I hypersensitivity site (DHS) mapping (Wu 1980; Gross and Garrard 1988), and chromatin immunoprecipitation (ChIP) (Gilmour and Lis 1984; Solomon et al. 1988; Ren et al. 2000; Iyer et al. 2001; Litt et al. 2001; Johnson and Bresnick 2002) were initially developed by individual labs and later were adapted to genome-wide assays. Many of these assays have long been used in the study of gene regulation (Gross and Garrard 1988; Johnson and Bresnick 2002; Weinmann and Farnham 2002; Nardone et al. 2004; Li et al. 2006; Ozsolak et al. 2008; Visel et al. 2009; Creyghton et al. 2010; Rada-Iglesias et al. 2011). As these methods are used and refined, the new knowledge results in new interpretations of findings from these assays; however, the conceptual interpretation remains quite stable, as findings accumulate from many independent labs over many years (reviewed in Rivera and Ren 2013).
Figure 1. Scheme of ENCODE data types.
ENCODE identifies candidate genes (transcribed regions, and transcripts) as well as candidate regulatory elements in DNA and RNA (bottom). ENCODE researchers employ a variety of biochemical assays (rounded rectangles in middle) to identify features associated with DNA and RNA (drawing at top).
RNA data are used to identify both protein coding genes, and also noncoding genes such as miRNAs and lncRNAs (reviewed in Djebali et al. 2012). The prevalence of nonpolyadenylated (poly (A)−) RNA is increasingly recognized as potential functional elements based on recent studies, including ENCODE (Kapranov et al. 2007; Tilgner et al. 2012; Livyatan et al. 2013). Transcriptional start sites can be identified using assays such as CAGE and RAMPAGE, specifically designed for this purpose (Shiraki et al. 2003; Batut et al. 2013). Active enhancers can sometimes be identified by the production of poly(A)− RNAs known as eRNA that are on average shorter than most protein coding and noncoding transcripts. In addition, transcribed regions can be assembled into previously annotated and unannotated candidate transcripts (Harrow et al. 2012).
DNase hypersensitivity has long been used to identify candidate regulatory elements (Weintraub and Groudine 1976; Gross and Garrard 1988; Fraser et al. 1993; Bell et al. 2011b; Thurman et al. 2012). Binding of regulatory proteins often perturbs chromatin structure, leading to DNase hypersensitivity (Pazin et al. 1994; Pazin et al. 1996; Henikoff and Shilatifard 2011). A recent genome-wide survey revealed that eQTLs (expression quantitative trait loci) are frequently associated with genotype-dependent changes in DNase (dsQTLs, DNase I sensitivity quantitative trait loci), suggesting DHSs are frequently regulatory regions (Degner et al. 2012). Regulatory regions identified by DHSs include enhancers and promoters, as well as locus control regions and silencers (Fraser et al. 1993; Diaz et al. 1994; Sawada et al. 1994; Siu et al. 1994; Festenstein et al. 1996). Recent studies have found DHS to be useful for annotating GWAS findings (Consortium 2012; Maurano et al. 2012; Pickrell 2013). Cell types clustered by DHS fall into biologically meaningful groups by developmental lineage and age, reflecting the information contained in DHS signatures (Stergachis et al. 2013).
Histone modifications (acetylation and methylation) have long been considered as surrogate marks for gene activity (Allfrey et al. 1964; Allis et al. 2007), and have been studied by chromatin fractionation, staining of chromosomes, and ChIP (Levinger and Varshavsky 1980; Hebbes et al. 1988; Lin et al. 1989; Turner 1989; Hebbes et al. 1994). Many histone modifications and histone variants are known, and a number have relatively well-accepted correlations with gene expression (eg. Campos and Reinberg 2009; Ernst and Kellis 2010; Huff et al. 2010; Bell et al. 2011b; Ernst et al. 2011; Suganuma and Workman 2011; Consortium 2012). For example, promoters are enriched for H3K4me3, while enhancers are enriched for H3K4me1 (Heintzman et al. 2007). Active enhancers and promoters are often enriched for H3K27ac (Creyghton et al. 2010; Rada-Iglesias et al. 2011). Repressed regions of the genome are enriched for H3K27me3 and H3K9me3; the latter is found in constitutive heterochromatin, while the former is found in regions repressed during development by the polycomb system. Transcribed regions are enriched for H3K36me3 and H3K79me3; near the first intron, there appears to be a transition from H3K79 to H3K36 (Huff et al. 2010; Consortium 2012). Combinations of histone marks have been used to identify enhancers, promoters, and transcribed regions (Heintzman et al. 2007; Heintzman et al. 2009; Ernst and Kellis 2010; Ernst et al. 2011; Consortium 2012). Global analysis of histone modifications over many cell types confirms correlation with gene expression (Consortium 2012).
Transcription factor binding has been used for some time to identify candidate regulatory elements (Johnson and Bresnick 2002; Weinmann and Farnham 2002; Gerstein et al. 2012; Maston et al. 2012; Wang et al. 2012; Yip et al. 2012). ENCODE determines transcription factor binding regions using ChIP, DNase footprinting, and DNA sequence analysis, and has also analyzed the networks formed by these transcription factor binding sites (Gerstein et al. 2012; Neph et al. 2012a; Neph et al. 2012b; Wang et al. 2012; Yip et al. 2012; Kheradpour and Kellis 2013). ENCODE has grouped measurements of over 100 factors, including sequence-specific transcription factors, RNA polymerase subunits, general transcription factors, and chromatin remodeling enzymes, within this data type. Global analysis of transcription factor binding is highly correlated with gene expression, and GWAS findings are enriched in binding sites, confirming the importance of the identification of transcription factor binding sites (Consortium 2012).
DNA methylation at promoters and enhancers is inversely correlated with gene expression, while DNA methylation at gene bodies is correlated with gene expression (Varley et al. 2013). DNA methylation is inversely correlated with open chromatin, transcription factor binding, and histone modifications found in active chromatin (Bell et al. 2011a; Stadler et al. 2011). Overlap in variants that alter DNA methylation and variants that alter gene expression has been detected, and variation in DNA methylation has been found to possess a strong genetic component (Bell et al. 2011a). Cell types clustered by DNA methylation fall into biologically meaningful groups, revealing lineage and cancer signatures, reflecting the information contained in DNA methylation (Varley et al. 2013).
To date, ENCODE has not focused on characterizing long-range interactions of regulatory regions. However, ENCODE has collected limited amounts of both 5C and ChIA-PET data, and plans to continue to do so (Li et al. 2012; Sanyal et al. 2012). Both data types can identify interactions between regulatory elements, which are candidate functional interactions. 5C identifies DNA-DNA interactions, while ChIA-PET identifies DNA-DNA associations bound by a protein of interest. To date, the primary utility of this data has been to validate predictions of enhancer-promoter linkage, to predict the genes that are targets of regulatory elements. This work appears to validate the approach, in that there is good correlation between gene expression, enhancer-promoter interaction, and production of eRNA. This work also finds that the majority of looping interactions are not to the most proximal gene (Sanyal et al. 2012).
A variety of candidate functional elements can be identified by sequence conservation, including protein-coding and noncoding genes, as well as regulatory regions (eg. Nardone et al. 2004). Sequence comparison has revealed evidence of selection for 3–15% of the human genome; most of this DNA does not appear to code for protein (reviewed in Ponting and Hardison 2011). The ENCODE estimate of constraint (3–10%;) is consistent with estimates by others (3–15%) (Ponting and Hardison 2011; Ward and Kellis 2012a). ENCODE estimates of the amount of human regulatory DNA (5–20%, from DNase footprints to DHS + TF-ChIP)(Consortium 2012) agrees well with estimates of constraint. ENCODE has also compared biochemical and sequence-comparison based approaches to identification of functional elements (Birney et al. 2007; Consortium 2012).
Use cases for the ENCODE resource
ENCODE data have been be used to suggest or refine hypotheses about the role of genetic variation in phenotypic variation (including disease). Publications which used ENCODE data by consortium members (https://www.encodeproject.org/search/?type=publication&published_by=ENCODE), as well as publications by researchers outside of ENCODE (https://www.encodeproject.org/search/?type=publication&published_by=community) illustrate a number of approaches for using the ENCODE resource in the study of human disease, basic biology (including gene regulation), and methods development. This section will present some common use cases for ENCODE data. Although the focus here is on interpreting genetic variation data, essentially the same approaches can be used with the ENCODE resource or with data from other projects, with appropriate modification, to interpret epigenomic and gene regulation data (Rakyan et al. 2011; Ng et al. 2012). Such approaches could allow investigators to use genomic methods to study environmental and stochastic processes, in addition to genetic processes.
Use case- Prediction of causal variants
Genetic association studies identify statistical association between a trait of interest and a genetic variant, (referred to here as the “tag variant”). While it is possible the tag variant causes the phenotypic change of interest, for at least two fundamentally different reasons it is also possible that another variant (referred to here as the “causal variant”) is responsible. First, this can happen when multiple variants are correlated through linkage disequilibrium; the significance of the association can be similar enough for these variants to make it difficult to distinguish them. Second, the functional variant may not even have been tested in the study; GWAS arrays interrogate a subset of the known common variants, and exome sequencing covers a limited fraction of the genome. It is also possible for a tag variant in a protein-coding region to mark a causal variant in a regulatory region (Herdewyn et al. 2012). While there is no one best approach to identify the causal variant, epigenomic and transcriptomic data can be used to refine hypotheses (Schaub et al. 2012).
One approach is to use semi-automated tools to analyze ENCODE data. RegulomeDB (Boyle et al. 2012) and HaploReg (Ward and Kellis 2012b) mine the data in ENCODE, as well as other resources, and return information on candidate functional variants. RegulomeDB accepts SNP IDs, single nucleotides, and chromosomal regions as inputs (Figure 2). RegulomeDB provides a score for each variant queried, with lower scores indicating more supporting evidence; for example, eQTLs score in the range 1a–1f, strong evidence of binding without an eQTL scores 2a–2c, continuing out to 6, followed by a score of no evidence. Users can click on the provided hyperlinks to examine the underlying evidence (which includes TF-ChIP signals, DNase peaks, DNase footprints, and predicted DNA sequence motifs for TFs) or to a genome browser view of the region. RegulomeDB also functions as a database, presenting a list of traits and conditions as hyperlinks, which can be selected to display associated variants, the publications that identified them, scores for variants within LD of the query, and links to RegulomeDB annotations. HaploReg accepts SNP IDs, chromosomal regions, or published GWAS studies as inputs, and also a user-configurable definition of linkage disequilibrium (Figure 3). HaploReg returns all variants within the query (as well as variants within LD), and indicates what evidence (if any) supports a functional role for each variant. While HaploReg was originally designed to mine ENCODE data, it now also mines data from the NIH Common Fund project Roadmap Epigenomic Mapping Centers (REMC). As RegulomeDB and HaploReg use somewhat different approaches, it is probably beneficial to use both for hypothesis generation.
Figure 2. Annotating potential regulatory variants using RegulomeDB.
RegulomeDB accepts a variety of inputs, including SNP IDs and genomic coordinates (red arrow 1). RegulomeDB returns a score for each feature in the input region (red arrow 2); scores range from 1 (strongest evidence for regulatory potential) to 6 and no data (weakest evidence). Clicking on the score reveals the details of the evidence; the data types, cell fate/cell state, and the source of the data. For example, the indicated row (red arrow 3) displays an ENCODE DHS in lymphoid cells. Clicking on the browser link next to the score provides a graphical visualization of the evidence; for example, ENCODE transcription factor binding is present (red arrow 4).
Figure 3. Annotating potential regulatory variants using HaploReg.
HaploReg accepts SNP IDs and genomic coordinates as inputs (red arrow 1). HaploReg returns evidence for regulatory protein binding (mouse over to see the protein names), chromatin structure (mouse over to see the cell types with DNase hypersensitivity), the chromatin state of the region, and putative transcription factor binding motifs that are altered by the variant (row marked by red arrow 2). SNPs in LD with the query are also reported in the rows above and below.
Another approach is to manually inspect the region of interest to determine what transcriptomic and epigenomic annotations are present. Starting from the ENCODE portal annotations at https://www.encodeproject.org/data/annotations, one can view ENCODE annotations (see https://www.encodeproject.org/encyclopedia/visualize) of distal DHS and H3K27ac sites (which can be considered as candidate enhancers), and proximal sites (which are candidate promoters or enhancers). The ENCODE portal offers browsing of ENCODE data based on assay types and biological samples, via faceted browsing, to find data to visualize and download (Figure 4). Starting from the UCSC browser (https://genome.ucsc.edu/encode/), the region of interest can be viewed along with ENCODE tracks. For example, one can start with a display of transcribed regions, H3K4me1 (an enhancer mark), H3K4me3 (a promoter mark), H3K27ac (a mark for active promoters and enhancers), DNase (an active chromatin mark frequently found at regulatory regions) and transcription factor ChIP data (frequently found at regulatory regions), as recently described (Mortlock and Pregizer 2012). Using the sessions feature of the UCSC browser (Meyer et al. 2013), users can save their favorite configuration(s) to load them for future use, or to share with colleagues (instructions at http://www.genome.gov/Pages/Research/ENCODE/2012_09_18_ENCODE_Sessions.pdf). One can also use this approach to examine particular tracks of interest, such as comparing B cells, T helper subsets, and macrophages. One strength of manual visualization is the user can simultaneously visualize annotations from multiple projects, such as ENCODE, REMC, and IHEC, by loading the desired track hubs (instructions at http://www.genome.gov/Pages/Research/ENCODE/2012_09_18_ENCODE_Sessions.pdf).
Figure 4. Browsing of ENCODE data using the ENCODE Portal.
The ENCODE Portal (https://www.encodeproject.org) allows users to select subsets of ENCODE data, for example by data type (red arrow 1), organism (red arrow 2), or organ (red arrow 3). The identified data can be visualized or downloaded.
Use case- Prediction of target genes
Even if one could be certain they had identified a functional variant, the gene that is affected by the variant may not be immediately obvious, for at least two fundamentally different reasons. First, it has been known for some time that regulatory regions can regulate distal genes, multiple genes, and the target gene(s) is not necessarily the nearest TSS (Forrester et al. 1987; Fraser et al. 1993; Bedell et al. 1995; Mohrs et al. 2001; Lee et al. 2003). Moreover, distal regulatory regions are known to contribute to human phenotypes and disease, including traits segregating in Mendelian fashion (Lettice et al. 2002; Loots et al. 2005; Kleinjan and Lettice 2008; Noonan and McCallion 2010). Data from ENCODE add support to the ideas that distal regulatory regions, and regulatory regions that target multiple genes, may not be rare (Thurman et al. 2012). If this is not taken into account, one might identify the incorrect target for developing therapeutics, or for pathway analysis (Lettice et al. 2002; Kleinjan and Lettice 2008; Noonan and McCallion 2010; Davison et al. 2012; Maurano et al. 2012). Second, the target gene could be a non-protein coding RNA, and the annotation of these regions is still being developed.
Currently, there are no high-throughput methods to determine the functional connectivity between enhancers and promoters; this is typically done through intensive study of single loci. However, ENCODE has produced DNase Hypersensitivity Site (DHS) maps from over 100 samples, and many DHS are cell type specific. The Regulatory Elements Database (Sheffield et al. 2013) is an internet-based tool (http://dnase.genome.duke.edu) that can be used to predict the target gene for DHSs of interest, or to predict the DHSs that regulate a gene of interest (Figure 5). The predictions are based on correlations between the cell-specificity of gene expression and the cell-specificity of DHSs, which are candidate regulatory regions. As both positive and negative correlations are reported, this tool may also predict the linkage between genes and silencers, the repressive version of enhancers. Human and mouse linkage predictions can be queried from the annotations page (https://www.encodeproject.org/data/annotations/). In an earlier effort, the correlation between the cell-specificity of distal DHS and the cell specificity of promoter DHS (a surrogate of gene expression) was tabulated (Thurman et al. 2012). Supplementary table 7 can be used to predict the linkage between genes and regulatory regions. (Using one-line Unix commands, this table can be sorted by DHS start coordinate (sort −k 5,5 −k 6,6n path/filename > path/newfilename) or by the name of linked genes (sort −k 4,4 path/filename > path/newfilename). As these two resources can identify different linkages, likely because of the difference in approach, it is useful to use both for hypothesis generation. An important feature of both of these tools is they integrate data from large numbers of cell types to provide specificity.
Figure 5. Predicting genes associated with a regulatory element using the Regulatory Elements Database.
An intuitive interface facilitates identification of known DNaseI-hypersensitive sites (DHS) within a genomic region of interest, prediction of the target gene for DHS of interest, and prediction of the DHS that regulates a gene of interest, as well as other tasks (red arrow 1). After finding a DHS within a coordinate range (red arrow 2), one can see this DHS is positively correlated with the IL13 gene, predicting this region activates IL13 (red arrow 3). The DHS profile across cell types reveal this element is active in Th2 cells, and inactive in most of the other cell types (red arrow 4), suggesting variants in this region could be important in a particular type of T cell.
ENCODE RNA data can also be used in predicting target genes. One can identify the transcribed regions (protein coding and noncoding) near a variant of interest, and determine the cell types they are expressed in (Hu et al. 2011). If the relevant cell type(s) are known, it may be possible to refine the list of candidate genes.
Use case- Prediction of the cell fate/cell state where elements (and variants) alter function
Identification of a tag variant, causal variant, or target gene does not always reveal the pathological cell type, for at least three fundamentally different reasons. First, some diseases, such as cardiovascular disease, are known to involve and affect multiple cell types. Genetic association findings in such diseases cannot always be immediately assigned to the relevant cell type(s). Second, the defect may not be intrinsic to the cell type that displays obvious pathology, and may not initially be recognized as relevant to the trait or disease. For example, a rare human immunodeficiency appears to be caused by the same genetic defect found in nude mice (Frank et al. 1999). In the mouse model, it is well known that the immunodeficiency is intrinsic to the epithelial cells that form the niche for developing T cells, rather than the affected lymphocytes (Nehls et al. 1996). A third potential confounder is that frequently disease etiology is incompletely known; for example, it was not always appreciated that Type 1 Diabetes and Crohns Disease were autoimmune disorders.
The previously described tools for identifying causal variants (HaploReg, and RegulomeDB) and linkage between regulatory elements and genes (Regulatory Elements Database) report the cell type where the evidence for active regulatory elements was found. This information can be used to generate hypotheses about the cell type affected by the genetic variants. This information is even more powerful when one starts with a trait or disease with substantial information on the relevant cell types. In this case, one may be attempting to distinguish between very different cell types (e.g. neurons, muscle, or lymphocytes) in which case ENCODE annotations in a cell type related to one of the disease-relevant cell types may be sufficient to predict the cell type where the causal variant functions.
Manual inspection of ENCODE tracks at a locus of interest also reports information about the cell type. ENCODE composite tracks, which combine data from multiple cell types, can be expanded to reveal the contributions from individual cell types. One can select individual tracks from particular cell types, either to broadly examine the signal over widely different cell types, or closely related cell types.
RNA data from ENCODE can be used to predict affected cell types. One can examine the cell-specificity of expression for genes (protein coding and noncoding) in the vicinity of a variant of interest (Hu et al. 2011). One can also determine the expression patterns of genes that are predicted to be linked to a variant of interest. Finally, if the variant is a candidate regulatory element, one can determine whether cell-type specific eRNAs are produced by the locus.
A more rigorous way to predict cell type is to calculate enrichment of variants (associated with a trait or disease of interest) in regulatory elements, by cell type. Enrichment of variants within DHSs and histone modifications sites from ENCODE and the Roadmap Epigenomics projects have been used to predict known and plausible cell types in test cases (Maurano et al. 2012; Pickrell 2013; Trynka et al. 2013). This approach can also be used to increase the number of testable variants associated with a trait or disorder, moving beyond statistical thresholds to biological thresholds. One can initially apply a statistical threshold to identify relevant variants, then determine enrichment by cell type, next relax the threshold as far as possible without losing the predicted cell specificity, and then test this larger list of variants (Maurano et al. 2012). This approach could be especially useful in settings where the capacity to test variants substantially exceeds the number of predicted variants.
Summary
Development of the ENCODE resource is an ongoing effort, informed by community feedback, and the ever-changing landscape of genomics technology. Clearly a broader and deeper sampling of cell fate/cell state space is an important need. During the current phase of the project, ENCODE plans to examine more cell types, with an emphasis on explants and primary cells, and an emphasis on increasing the number of samples with many ENCODE assays. Users have found value in both explants and purified cells (though not always the same users), so ENCODE will likely produce data on both sides of this compromise between better-defined samples and more-physiologic samples. However, the cell fate/cell state space is enormous. The difficulty of identifying functional connections between regulatory regions and genes, which was recognized before ENCODE, has become more apparent with the wealth of new data from a variety of sources. ENCODE is continuing data collection on physical interactions between candidate regulatory elements, as a high-throughput surrogate measurement of functional interactions. ENCODE plans to characterize at least 500 transcription factors, a significant increase over the current total of about 100; again, considering there are on the order of 2000 transcription factors, and they appear to bind different motif instances in different cell states and fates, the space is enormous. Given the growing recognition of the importance of regulation at the RNA level, ENCODE is expanding measurements of RNA binding proteins from pilot scale to production scale, with a goal of characterizing about 200 RNA binding proteins. These efforts will consider proteins that bind mRNA, and regulatory proteins that likely use noncoding RNA as an adapter to bind to targets. ENCODE is working with our partners in other epigenomics projects to increase the utility of the data from all the projects. We are working together to coordinate metadata and ontologies, for example, in an attempt to make it easier to use the data from different projects in the same way, or to bring together data from different projects. Finally, ENCODE has relaxed the data use policy for external users. As soon as ENCODE data are publicly released, they may be used in publications and presentations. Previously there had been a moratorium on the global use of newly released data sets (for either 9 months or the first consortium publication). However, this created some confusion over what data were subject to what restrictions; we have eliminated this moratorium to simplify this aspect of using ENCODE data. We continue to ask that users cite ENCODE, as would be the case for any other scientific source used.
Acknowledgments
Thanks to these individuals for their helpful comments on draft versions of this manuscript: Brad Bernstein, Guillaume Bourque, Elise Feingold, Tom Gingeras, Eurie Hong, Robert Klein, Jonathan Pritchard, Bing Ren, Brid Ryan, John Satterlee, Andrea Wurster, and an anonymous reviewer. Thanks to all the members of the ENCODE Consortium for their dedication and enthusiasm. Finally, thanks to my NHGRI ENCODE colleagues Elise Feingold, who launched ENCODE and leads the project, Peter Good, who has been with ENCODE from the start, and Dan Gilchrist.
References
- Allfrey VG, Faulkner R, Mirsky AE. Acetylation and Methylation of Histones and Their Possible Role in the Regulation of Rna Synthesis. Proc Natl Acad Sci U S A. 1964;51:786–794. doi: 10.1073/pnas.51.5.786. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Allis CD, Jenuwein T, Reinberg D, Caparros ML. Epigenetics. Cold Spring Harbor Laboratory Press; 2007. [Google Scholar]
- Batut P, Dobin A, Plessy C, Carninci P, Gingeras TR. High-fidelity promoter profiling reveals widespread alternative promoter usage and transposon-driven developmental gene expression. Genome Res. 2013;23(1):169–180. doi: 10.1101/gr.139618.112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bedell MA, Brannan CI, Evans EP, Copeland NG, Jenkins NA, Donovan PJ. DNA rearrangements located over 100 kb 5′ of the Steel (Sl)-coding region in Steel-panda and Steel-contrasted mice deregulate Sl expression and cause female sterility by disrupting ovarian follicle development. Genes Dev. 1995;9(4):455–470. doi: 10.1101/gad.9.4.455. [DOI] [PubMed] [Google Scholar]
- Bell JT, Pai AA, Pickrell JK, Gaffney DJ, Pique-Regi R, Degner JF, Gilad Y, Pritchard JK. DNA methylation patterns associate with genetic and gene expression variation in HapMap cell lines. Genome biology. 2011a;12(1):R10. doi: 10.1186/gb-2011-12-1-r10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bell O, Tiwari VK, Thoma NH, Schubeler D. Determinants and dynamics of genome accessibility. Nature reviews Genetics. 2011b;12(8):554–564. doi: 10.1038/nrg3017. [DOI] [PubMed] [Google Scholar]
- Biggin MD. Animal transcription networks as highly connected, quantitative continua. Developmental cell. 2011;21(4):611–626. doi: 10.1016/j.devcel.2011.09.008. [DOI] [PubMed] [Google Scholar]
- Birney E, Stamatoyannopoulos JA, Dutta A, Guigo R, Gingeras TR, Margulies EH, Weng Z, Snyder M, Dermitzakis ET, Thurman RE, et al. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature. 2007;447(7146):799–816. doi: 10.1038/nature05874. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Borneman AR, Gianoulis TA, Zhang ZD, Yu H, Rozowsky J, Seringhaus MR, Wang LY, Gerstein M, Snyder M. Divergence of transcription factor binding sites across related yeast species. Science. 2007;317(5839):815–819. doi: 10.1126/science.1140748. [DOI] [PubMed] [Google Scholar]
- Boyle AP, Hong EL, Hariharan M, Cheng Y, Schaub MA, Kasowski M, Karczewski KJ, Park J, Hitz BC, Weng S, et al. Annotation of functional variation in personal genomes using RegulomeDB. Genome research. 2012;22(9):1790–1797. doi: 10.1101/gr.137323.112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Campos EI, Reinberg D. Histones: annotating chromatin. Annual review of genetics. 2009;43:559–599. doi: 10.1146/annurev.genet.032608.103928. [DOI] [PubMed] [Google Scholar]
- Celniker SE, Dillon LA, Gerstein MB, Gunsalus KC, Henikoff S, Karpen GH, Kellis M, Lai EC, Lieb JD, MacAlpine DM, et al. Unlocking the secrets of the genome. Nature. 2009;459(7249):927–930. doi: 10.1038/459927a. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Consortium TEP. The ENCODE (ENCyclopedia Of DNA Elements) Project. Science. 2004;306(5696):636–640. doi: 10.1126/science.1105136. [DOI] [PubMed] [Google Scholar]
- Consortium TEP. A user’s guide to the encyclopedia of DNA elements (ENCODE) PLoS Biol. 2011;9(4):e1001046. doi: 10.1371/journal.pbio.1001046. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Consortium TEP. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489(7414):57–74. doi: 10.1038/nature11247. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Creyghton MP, Cheng AW, Welstead GG, Kooistra T, Carey BW, Steine EJ, Hanna J, Lodato MA, Frampton GM, Sharp PA, et al. Histone H3K27ac separates active from poised enhancers and predicts developmental state. Proc Natl Acad Sci U S A. 2010;107(50):21931–21936. doi: 10.1073/pnas.1016071107. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Davison LJ, Wallace C, Cooper JD, Cope NF, Wilson NK, Smyth DJ, Howson JM, Saleh N, Al-Jeffery A, Angus KL, et al. Long-range DNA looping and gene expression analyses identify DEXI as an autoimmune disease candidate gene. Human molecular genetics. 2012;21(2):322–333. doi: 10.1093/hmg/ddr468. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Degner JF, Pai AA, Pique-Regi R, Veyrieras JB, Gaffney DJ, Pickrell JK, De Leon S, Michelini K, Lewellen N, Crawford GE, et al. DNase I sensitivity QTLs are a major determinant of human expression variation. Nature. 2012;482(7385):390–394. doi: 10.1038/nature10808. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Diaz P, Cado D, Winoto A. A locus control region in the T cell receptor alpha/delta locus. Immunity. 1994;1(3):207–217. doi: 10.1016/1074-7613(94)90099-x. [DOI] [PubMed] [Google Scholar]
- Djebali S, Davis CA, Merkel A, Dobin A, Lassmann T, Mortazavi A, Tanzer A, Lagarde J, Lin W, Schlesinger F, et al. Landscape of transcription in human cells. Nature. 2012;489(7414):101–108. doi: 10.1038/nature11233. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Edwards SL, Beesley J, French JD, Dunning AM. Beyond GWASs: illuminating the dark road from association to function. American journal of human genetics. 2013;93(5):779–797. doi: 10.1016/j.ajhg.2013.10.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ernst J, Kellis M. Discovery and characterization of chromatin states for systematic annotation of the human genome. Nat Biotechnol. 2010;28(8):817–825. doi: 10.1038/nbt.1662. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ernst J, Kheradpour P, Mikkelsen TS, Shoresh N, Ward LD, Epstein CB, Zhang X, Wang L, Issner R, Coyne M, et al. Mapping and analysis of chromatin state dynamics in nine human cell types. Nature. 2011;473(7345):43–49. doi: 10.1038/nature09906. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Festenstein R, Tolaini M, Corbella P, Mamalaki C, Parrington J, Fox M, Miliou A, Jones M, Kioussis D. Locus control region function and heterochromatin-induced position effect variegation. Science. 1996;271(5252):1123–1125. doi: 10.1126/science.271.5252.1123. [DOI] [PubMed] [Google Scholar]
- Fisher WW, Li JJ, Hammonds AS, Brown JB, Pfeiffer BD, Weiszmann R, MacArthur S, Thomas S, Stamatoyannopoulos JA, Eisen MB, et al. DNA regions bound at low occupancy by transcription factors do not drive patterned reporter gene expression in Drosophila. Proceedings of the National Academy of Sciences of the United States of America. 2012;109(52):21330–21335. doi: 10.1073/pnas.1209589110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Forrester WC, Takegawa S, Papayannopoulou T, Stamatoyannopoulos G, Groudine M. Evidence for a locus activation region: the formation of developmentally stable hypersensitive sites in globin-expressing hybrids. Nucleic acids research. 1987;15(24):10159–10177. doi: 10.1093/nar/15.24.10159. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Frank J, Pignata C, Panteleyev AA, Prowse DM, Baden H, Weiner L, Gaetaniello L, Ahmad W, Pozzi N, Cserhalmi-Friedman PB, et al. Exposing the human nude phenotype. Nature. 1999;398(6727):473–474. doi: 10.1038/18997. [DOI] [PubMed] [Google Scholar]
- Fraser HB. Gene expression drives local adaptation in humans. Genome research. 2013;23(7):1089–1096. doi: 10.1101/gr.152710.112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fraser P, Pruzina S, Antoniou M, Grosveld F. Each hypersensitive site of the human beta-globin locus control region confers a different developmental pattern of expression on the globin genes. Genes & development. 1993;7(1):106–113. doi: 10.1101/gad.7.1.106. [DOI] [PubMed] [Google Scholar]
- Gerstein MB, Kundaje A, Hariharan M, Landt SG, Yan KK, Cheng C, Mu XJ, Khurana E, Rozowsky J, Alexander R, et al. Architecture of the human regulatory network derived from ENCODE data. Nature. 2012;489(7414):91–100. doi: 10.1038/nature11245. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gerstein MB, Lu ZJ, Van Nostrand EL, Cheng C, Arshinoff BI, Liu T, Yip KY, Robilotto R, Rechtsteiner A, Ikegami K, et al. Integrative analysis of the Caenorhabditis elegans genome by the modENCODE project. Science. 2010;330(6012):1775–1787. doi: 10.1126/science.1196914. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gilmour DS, Lis JT. Detecting protein-DNA interactions in vivo: distribution of RNA polymerase on specific bacterial genes. Proc Natl Acad Sci U S A. 1984;81(14):4275–4279. doi: 10.1073/pnas.81.14.4275. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Graveley BR, Brooks AN, Carlson JW, Duff MO, Landolin JM, Yang L, Artieri CG, van Baren MJ, Boley N, Booth BW, et al. The developmental transcriptome of Drosophila melanogaster. Nature. 2011;471(7339):473–479. doi: 10.1038/nature09715. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gross DS, Garrard WT. Nuclease hypersensitive sites in chromatin. Annu Rev Biochem. 1988;57:159–197. doi: 10.1146/annurev.bi.57.070188.001111. [DOI] [PubMed] [Google Scholar]
- Grossman SR, Andersen KG, Shlyakhter I, Tabrizi S, Winnicki S, Yen A, Park DJ, Griesemer D, Karlsson EK, Wong SH, et al. Identifying recent adaptations in large-scale genomic data. Cell. 2013;152(4):703–713. doi: 10.1016/j.cell.2013.01.035. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hardison RC. Genome-wide epigenetic data facilitate understanding of disease susceptibility association studies. The Journal of biological chemistry. 2012;287(37):30932–30940. doi: 10.1074/jbc.R112.352427. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hardison RC, Taylor J. Genomic approaches towards finding cis-regulatory modules in animals. Nat Rev Genet. 2012;13(7):469–483. doi: 10.1038/nrg3242. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Harrow J, Frankish A, Gonzalez JM, Tapanari E, Diekhans M, Kokocinski F, Aken BL, Barrell D, Zadissa A, Searle S, et al. GENCODE: The reference human genome annotation for The ENCODE Project. Genome Res. 2012;22(9):1760–1774. doi: 10.1101/gr.135350.111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hebbes TR, Clayton AL, Thorne AW, Crane-Robinson C. Core histone hyperacetylation co-maps with generalized DNase I sensitivity in the chicken beta-globin chromosomal domain. The EMBO journal. 1994;13(8):1823–1830. doi: 10.1002/j.1460-2075.1994.tb06451.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hebbes TR, Thorne AW, Crane-Robinson C. A direct link between core histone acetylation and transcriptionally active chromatin. The EMBO journal. 1988;7(5):1395–1402. doi: 10.1002/j.1460-2075.1988.tb02956.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Heintzman ND, Hon GC, Hawkins RD, Kheradpour P, Stark A, Harp LF, Ye Z, Lee LK, Stuart RK, Ching CW, et al. Histone modifications at human enhancers reflect global cell-type-specific gene expression. Nature. 2009;459(7243):108–112. doi: 10.1038/nature07829. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Heintzman ND, Stuart RK, Hon G, Fu Y, Ching CW, Hawkins RD, Barrera LO, Van Calcar S, Qu C, Ching KA, et al. Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in the human genome. Nat Genet. 2007;39(3):311–318. doi: 10.1038/ng1966. [DOI] [PubMed] [Google Scholar]
- Henikoff S, Shilatifard A. Histone modification: cause or cog? Trends in genetics: TIG. 2011;27(10):389–396. doi: 10.1016/j.tig.2011.06.006. [DOI] [PubMed] [Google Scholar]
- Herdewyn S, Zhao H, Moisse M, Race V, Matthijs G, Reumers J, Kusters B, Schelhaas HJ, van den Berg LH, Goris A, et al. Whole-genome sequencing reveals a coding non-pathogenic variant tagging a non-coding pathogenic hexanucleotide repeat expansion in C9orf72 as cause of amyotrophic lateral sclerosis. Human molecular genetics. 2012;21(11):2412–2419. doi: 10.1093/hmg/dds055. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hu X, Kim H, Stahl E, Plenge R, Daly M, Raychaudhuri S. Integrating autoimmune risk loci with gene-expression data identifies specific pathogenic immune cell subsets. American journal of human genetics. 2011;89(4):496–506. doi: 10.1016/j.ajhg.2011.09.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Huff JT, Plocik AM, Guthrie C, Yamamoto KR. Reciprocal intronic and exonic histone modification regions in humans. Nat Struct Mol Biol. 2010;17(12):1495–1499. doi: 10.1038/nsmb.1924. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Iyer VR, Horak CE, Scafe CS, Botstein D, Snyder M, Brown PO. Genomic binding sites of the yeast cell-cycle transcription factors SBF and MBF. Nature. 2001;409(6819):533–538. doi: 10.1038/35054095. [DOI] [PubMed] [Google Scholar]
- Johnson KD, Bresnick EH. Dissecting long-range transcriptional mechanisms by chromatin immunoprecipitation. Methods. 2002;26(1):27–36. doi: 10.1016/S1046-2023(02)00005-1. [DOI] [PubMed] [Google Scholar]
- Jones FC, Grabherr MG, Chan YF, Russell P, Mauceli E, Johnson J, Swofford R, Pirun M, Zody MC, White S, et al. The genomic basis of adaptive evolution in threespine sticklebacks. Nature. 2012;484(7392):55–61. doi: 10.1038/nature10944. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kapranov P, Cheng J, Dike S, Nix DA, Duttagupta R, Willingham AT, Stadler PF, Hertel J, Hackermuller J, Hofacker IL, et al. RNA maps reveal new RNA classes and a possible function for pervasive transcription. Science. 2007;316(5830):1484–1488. doi: 10.1126/science.1138341. [DOI] [PubMed] [Google Scholar]
- Kharchenko PV, Alekseyenko AA, Schwartz YB, Minoda A, Riddle NC, Ernst J, Sabo PJ, Larschan E, Gorchakov AA, Gu T, et al. Comprehensive analysis of the chromatin landscape in Drosophila melanogaster. Nature. 2011;471(7339):480–485. doi: 10.1038/nature09725. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kheradpour P, Kellis M. Systematic discovery and characterization of regulatory motifs in ENCODE TF binding experiments. Nucleic acids research. 2013 doi: 10.1093/nar/gkt1249. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kleinjan DA, Lettice LA. Long-range gene control and genetic disease. Adv Genet. 2008;61:339–388. doi: 10.1016/S0065-2660(07)00013-2. [DOI] [PubMed] [Google Scholar]
- Lee GR, Fields PE, Griffin TJ, Flavell RA. Regulation of the Th2 cytokine locus by a locus control region. Immunity. 2003;19(1):145–153. doi: 10.1016/s1074-7613(03)00179-1. [DOI] [PubMed] [Google Scholar]
- Lettice LA, Horikoshi T, Heaney SJ, van Baren MJ, van der Linde HC, Breedveld GJ, Joosse M, Akarsu N, Oostra BA, Endo N, et al. Disruption of a long-range cis-acting regulator for Shh causes preaxial polydactyly. Proceedings of the National Academy of Sciences of the United States of America. 2002;99(11):7548–7553. doi: 10.1073/pnas.112212199. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Levinger L, Varshavsky A. High-resolution fractionation of nucleosomes: minor particles, “whiskers,” and separation of mononucleosomes containing and lacking A24 semihistone. Proc Natl Acad Sci U S A. 1980;77(6):3244–3248. doi: 10.1073/pnas.77.6.3244. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li CC, Ramirez-Carrozzi VR, Smale ST. Pursuing gene regulation ‘logic’ via RNA interference and chromatin immunoprecipitation. Nat Immunol. 2006;7(7):692–697. doi: 10.1038/ni0706-692. [DOI] [PubMed] [Google Scholar]
- Li G, Ruan X, Auerbach RK, Sandhu KS, Zheng M, Wang P, Poh HM, Goh Y, Lim J, Zhang J, et al. Extensive promoter-centered chromatin interactions provide a topological basis for transcription regulation. Cell. 2012;148(1–2):84–98. doi: 10.1016/j.cell.2011.12.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lin R, Leone JW, Cook RG, Allis CD. Antibodies specific to acetylated histones document the existence of deposition- and transcription-related histone acetylation in Tetrahymena. The Journal of cell biology. 1989;108(5):1577–1588. doi: 10.1083/jcb.108.5.1577. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Litt MD, Simpson M, Gaszner M, Allis CD, Felsenfeld G. Correlation between histone lysine methylation and developmental changes at the chicken beta-globin locus. Science. 2001;293(5539):2453–2455. doi: 10.1126/science.1064413. [DOI] [PubMed] [Google Scholar]
- Livyatan I, Harikumar A, Nissim-Rafinia M, Duttagupta R, Gingeras TR, Meshorer E. Non-polyadenylated transcription in embryonic stem cells reveals novel non-coding RNA related to pluripotency and differentiation. Nucleic acids research. 2013;41(12):6300–6315. doi: 10.1093/nar/gkt316. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Loots GG, Kneissel M, Keller H, Baptist M, Chang J, Collette NM, Ovcharenko D, Plajzer-Frick I, Rubin EM. Genomic deletion of a long-range bone enhancer misregulates sclerostin in Van Buchem disease. Genome research. 2005;15(7):928–935. doi: 10.1101/gr.3437105. [DOI] [PMC free article] [PubMed] [Google Scholar]
- MacArthur DG, Balasubramanian S, Frankish A, Huang N, Morris J, Walter K, Jostins L, Habegger L, Pickrell JK, Montgomery SB, et al. A systematic survey of loss-of-function variants in human protein-coding genes. Science. 2012;335(6070):823–828. doi: 10.1126/science.1215040. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Maston GA, Landt SG, Snyder M, Green MR. Characterization of enhancer function from genome-wide analyses. Annual review of genomics and human genetics. 2012;13:29–57. doi: 10.1146/annurev-genom-090711-163723. [DOI] [PubMed] [Google Scholar]
- Maurano MT, Humbert R, Rynes E, Thurman RE, Haugen E, Wang H, Reynolds AP, Sandstrom R, Qu H, Brody J, et al. Systematic localization of common disease-associated variation in regulatory DNA. Science. 2012;337(6099):1190–1195. doi: 10.1126/science.1222794. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Meyer LR, Zweig AS, Hinrichs AS, Karolchik D, Kuhn RM, Wong M, Sloan CA, Rosenbloom KR, Roe G, Rhead B, et al. The UCSC Genome Browser database: extensions and updates 2013. Nucleic acids research. 2013;41(Database issue):D64–69. doi: 10.1093/nar/gks1048. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mohrs M, Blankespoor CM, Wang ZE, Loots GG, Afzal V, Hadeiba H, Shinkai K, Rubin EM, Locksley RM. Deletion of a coordinate regulator of type 2 cytokine expression in mice. Nat Immunol. 2001;2(9):842–847. doi: 10.1038/ni0901-842. [DOI] [PubMed] [Google Scholar]
- Mortlock DP, Pregizer S. Identifying Functional Annotation for Noncoding Genomic Sequences. Current Protocols in Human Genetics. 2012 doi: 10.1002/0471142905.hg0110s72. [DOI] [PubMed] [Google Scholar]
- Nardone J, Lee DU, Ansel KM, Rao A. Bioinformatics for the ‘bench biologist’: how to find regulatory regions in genomic DNA. Nat Immunol. 2004;5(8):768–774. doi: 10.1038/ni0804-768. [DOI] [PubMed] [Google Scholar]
- Negre N, Brown CD, Ma L, Bristow CA, Miller SW, Wagner U, Kheradpour P, Eaton ML, Loriaux P, Sealfon R, et al. A cis-regulatory map of the Drosophila genome. Nature. 2011;471(7339):527–531. doi: 10.1038/nature09990. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nehls M, Kyewski B, Messerle M, Waldschutz R, Schuddekopf K, Smith AJ, Boehm T. Two genetically separable steps in the differentiation of thymic epithelium. Science. 1996;272(5263):886–889. doi: 10.1126/science.272.5263.886. [DOI] [PubMed] [Google Scholar]
- Neph S, Stergachis AB, Reynolds A, Sandstrom R, Borenstein E, Stamatoyannopoulos JA. Circuitry and dynamics of human transcription factor regulatory networks. Cell. 2012a;150(6):1274–1286. doi: 10.1016/j.cell.2012.04.040. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Neph S, Vierstra J, Stergachis AB, Reynolds AP, Haugen E, Vernot B, Thurman RE, John S, Sandstrom R, Johnson AK, et al. An expansive human regulatory lexicon encoded in transcription factor footprints. Nature. 2012b;489(7414):83–90. doi: 10.1038/nature11212. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ng JW, Barrett LM, Wong A, Kuh D, Smith GD, Relton CL. The role of longitudinal cohort studies in epigenetic epidemiology: challenges and opportunities. Genome biology. 2012;13(6):246. doi: 10.1186/gb-2012-13-6-246. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Noonan JP, McCallion AS. Genomics of long-range regulatory elements. Annual review of genomics and human genetics. 2010;11:1–23. doi: 10.1146/annurev-genom-082509-141651. [DOI] [PubMed] [Google Scholar]
- Odom DT, Dowell RD, Jacobsen ES, Gordon W, Danford TW, MacIsaac KD, Rolfe PA, Conboy CM, Gifford DK, Fraenkel E. Tissue-specific transcriptional regulation has diverged significantly between human and mouse. Nature genetics. 2007;39(6):730–732. doi: 10.1038/ng2047. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ozsolak F, Poling LL, Wang Z, Liu H, Liu XS, Roeder RG, Zhang X, Song JS, Fisher DE. Chromatin structure analyses identify miRNA promoters. Genes Dev. 2008;22(22):3172–3183. doi: 10.1101/gad.1706508. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pazin MJ, Kamakaka RT, Kadonaga JT. ATP-dependent nucleosome reconfiguration and transcriptional activation from preassembled chromatin templates. Science. 1994;266(5193):2007–2011. doi: 10.1126/science.7801129. [DOI] [PubMed] [Google Scholar]
- Pazin MJ, Sheridan PL, Cannon K, Cao Z, Keck JG, Kadonaga JT, Jones KA. NF-kappa B-mediated chromatin reconfiguration and transcriptional activation of the HIV-1 enhancer in vitro. Genes Dev. 1996;10(1):37–49. doi: 10.1101/gad.10.1.37. [DOI] [PubMed] [Google Scholar]
- Penagarikano O, Mulle JG, Warren ST. The pathophysiology of fragile x syndrome. Annual review of genomics and human genetics. 2007;8:109–129. doi: 10.1146/annurev.genom.8.080706.092249. [DOI] [PubMed] [Google Scholar]
- Pickrell JK. Joint analysis of functional genomic data and genome-wide association studies of 18 human traits. 2013 doi: 10.1016/j.ajhg.2014.03.004. bioRxiv. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ponting CP, Hardison RC. What fraction of the human genome is functional? Genome Res. 2011;21(11):1769–1776. doi: 10.1101/gr.116814.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rada-Iglesias A, Bajpai R, Swigut T, Brugmann SA, Flynn RA, Wysocka J. A unique chromatin signature uncovers early developmental enhancers in humans. Nature. 2011;470(7333):279–283. doi: 10.1038/nature09692. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rakyan VK, Down TA, Balding DJ, Beck S. Epigenome-wide association studies for common human diseases. Nature reviews Genetics. 2011;12(8):529–541. doi: 10.1038/nrg3000. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ren B, Robert F, Wyrick JJ, Aparicio O, Jennings EG, Simon I, Zeitlinger J, Schreiber J, Hannett N, Kanin E, et al. Genome-wide location and function of DNA binding proteins. Science. 2000;290(5500):2306–2309. doi: 10.1126/science.290.5500.2306. [DOI] [PubMed] [Google Scholar]
- Rivera CM, Ren B. Mapping human epigenomes. Cell. 2013;155(1):39–55. doi: 10.1016/j.cell.2013.09.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Roy S, Ernst J, Kharchenko PV, Kheradpour P, Negre N, Eaton ML, Landolin JM, Bristow CA, Ma L, Lin MF, et al. Identification of functional elements and regulatory circuits by Drosophila modENCODE. Science. 2010;330(6012):1787–1797. doi: 10.1126/science.1198374. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sanyal A, Lajoie BR, Jain G, Dekker J. The long-range interaction landscape of gene promoters. Nature. 2012;489(7414):109–113. doi: 10.1038/nature11279. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sawada S, Scarborough JD, Killeen N, Littman DR. A lineage-specific transcriptional silencer regulates CD4 gene expression during T lymphocyte development. Cell. 1994;77(6):917–929. doi: 10.1016/0092-8674(94)90140-6. [DOI] [PubMed] [Google Scholar]
- Schaub MA, Boyle AP, Kundaje A, Batzoglou S, Snyder M. Linking disease associations with regulatory information in the human genome. Genome research. 2012;22(9):1748–1759. doi: 10.1101/gr.136127.111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schmidt D, Wilson MD, Ballester B, Schwalie PC, Brown GD, Marshall A, Kutter C, Watt S, Martinez-Jimenez CP, Mackay S, et al. Five-vertebrate ChIP-seq reveals the evolutionary dynamics of transcription factor binding. Science. 2010;328(5981):1036–1040. doi: 10.1126/science.1186176. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sheffield NC, Thurman RE, Song L, Safi A, Stamatoyannopoulos JA, Lenhard B, Crawford GE, Furey TS. Patterns of regulatory activity across diverse human cell types predict tissue identity, transcription factor binding, and long-range interactions. Genome research. 2013;23(5):777–788. doi: 10.1101/gr.152140.112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shiraki T, Kondo S, Katayama S, Waki K, Kasukawa T, Kawaji H, Kodzius R, Watahiki A, Nakamura M, Arakawa T, et al. Cap analysis gene expression for high-throughput analysis of transcriptional starting point and identification of promoter usage. Proc Natl Acad Sci U S A. 2003;100(26):15776–15781. doi: 10.1073/pnas.2136655100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Siu G, Wurster AL, Duncan DD, Soliman TM, Hedrick SM. A transcriptional silencer controls the developmental expression of the CD4 gene. Embo J. 1994;13(15):3570–3579. doi: 10.1002/j.1460-2075.1994.tb06664.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Solomon MJ, Larsen PL, Varshavsky A. Mapping protein-DNA interactions in vivo with formaldehyde: evidence that histone H4 is retained on a highly transcribed gene. Cell. 1988;53(6):937–947. doi: 10.1016/s0092-8674(88)90469-2. [DOI] [PubMed] [Google Scholar]
- Stadler MB, Murr R, Burger L, Ivanek R, Lienert F, Scholer A, van Nimwegen E, Wirbelauer C, Oakeley EJ, Gaidatzis D, et al. DNA-binding factors shape the mouse methylome at distal regulatory regions. Nature. 2011;480(7378):490–495. doi: 10.1038/nature10716. [DOI] [PubMed] [Google Scholar]
- Stamatoyannopoulos JA, Snyder M, Hardison R, Ren B, Gingeras T, Gilbert DM, Groudine M, Bender M, Kaul R, Canfield T, et al. An encyclopedia of mouse DNA elements (Mouse ENCODE) Genome Biol. 2012;13(8):418. doi: 10.1186/gb-2012-13-8-418. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stergachis AB, Neph S, Reynolds A, Humbert R, Miller B, Paige SL, Vernot B, Cheng JB, Thurman RE, Sandstrom R, et al. Developmental fate and cellular maturity encoded in human regulatory DNA landscapes. Cell. 2013;154(4):888–903. doi: 10.1016/j.cell.2013.07.020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Suganuma T, Workman JL. Signals and combinatorial functions of histone modifications. Annual review of biochemistry. 2011;80:473–499. doi: 10.1146/annurev-biochem-061809-175347. [DOI] [PubMed] [Google Scholar]
- Thurman RE, Rynes E, Humbert R, Vierstra J, Maurano MT, Haugen E, Sheffield NC, Stergachis AB, Wang H, Vernot B, et al. The accessible chromatin landscape of the human genome. Nature. 2012;489(7414):75–82. doi: 10.1038/nature11232. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tilgner H, Knowles DG, Johnson R, Davis CA, Chakrabortty S, Djebali S, Curado J, Snyder M, Gingeras TR, Guigo R. Deep sequencing of subcellular RNA fractions shows splicing to be predominantly co-transcriptional in the human genome but inefficient for lncRNAs. Genome Res. 2012;22(9):1616–1625. doi: 10.1101/gr.134445.111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tishkoff SA, Reed FA, Ranciaro A, Voight BF, Babbitt CC, Silverman JS, Powell K, Mortensen HM, Hirbo JB, Osman M, et al. Convergent adaptation of human lactase persistence in Africa and Europe. Nature genetics. 2007;39(1):31–40. doi: 10.1038/ng1946. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Trynka G, Sandor C, Han B, Xu H, Stranger BE, Liu XS, Raychaudhuri S. Chromatin marks identify critical cell types for fine mapping complex trait variants. Nature genetics. 2013;45(2):124–130. doi: 10.1038/ng.2504. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Turner BM. Acetylation and deacetylation of histone H4 continue through metaphase with depletion of more-acetylated isoforms and altered site usage. Experimental cell research. 1989;182(1):206–214. doi: 10.1016/0014-4827(89)90292-9. [DOI] [PubMed] [Google Scholar]
- Varley KE, Gertz J, Bowling KM, Parker SL, Reddy TE, Pauli-Behn F, Cross MK, Williams BA, Stamatoyannopoulos JA, Crawford GE, et al. Dynamic DNA methylation across diverse human cell lines and tissues. Genome research. 2013;23(3):555–567. doi: 10.1101/gr.147942.112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Visel A, Blow MJ, Li Z, Zhang T, Akiyama JA, Holt A, Plajzer-Frick I, Shoukry M, Wright C, Chen F, et al. ChIP-seq accurately predicts tissue-specific activity of enhancers. Nature. 2009;457(7231):854–858. doi: 10.1038/nature07730. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang J, Zhuang J, Iyer S, Lin X, Whitfield TW, Greven MC, Pierce BG, Dong X, Kundaje A, Cheng Y, et al. Sequence features and chromatin structure around the genomic regions bound by 119 human transcription factors. Genome Res. 2012;22(9):1798–1812. doi: 10.1101/gr.139105.112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ward LD, Kellis M. Evidence of abundant purifying selection in humans for recently acquired regulatory functions. Science. 2012a;337(6102):1675–1678. doi: 10.1126/science.1225057. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ward LD, Kellis M. HaploReg: a resource for exploring chromatin states, conservation, and regulatory motif alterations within sets of genetically linked variants. Nucleic Acids Res. 2012b;40(Database issue):D930–934. doi: 10.1093/nar/gkr917. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ward LD, Kellis M. Interpreting noncoding genetic variation in complex traits and human disease. Nature biotechnology. 2012c;30(11):1095–1106. doi: 10.1038/nbt.2422. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Weinmann AS, Farnham PJ. Identification of unknown target genes of human transcription factors using chromatin immunoprecipitation. Methods. 2002;26(1):37–47. doi: 10.1016/S1046-2023(02)00006-3. [DOI] [PubMed] [Google Scholar]
- Weintraub H, Groudine M. Chromosomal subunits in active genes have an altered conformation. Science. 1976;193(4256):848–856. doi: 10.1126/science.948749. [DOI] [PubMed] [Google Scholar]
- Wu C. The 5′ ends of Drosophila heat shock genes in chromatin are hypersensitive to DNase I. Nature. 1980;286(5776):854–860. doi: 10.1038/286854a0. [DOI] [PubMed] [Google Scholar]
- Yip KY, Cheng C, Bhardwaj N, Brown JB, Leng J, Kundaje A, Rozowsky J, Birney E, Bickel P, Snyder M, et al. Classification of human genomic regions based on experimentally determined binding sites of more than 100 transcription-related factors. Genome Biol. 2012;13(9):R48. doi: 10.1186/gb-2012-13-9-r48. [DOI] [PMC free article] [PubMed] [Google Scholar]





