Abstract
Members of the GATA family of transcription factors play key roles in the differentiation of specific cell lineages by regulating the expression of target genes. Three GATA factors play distinct roles in hematopoietic differentiation. In order to better understand how these GATA factors function to regulate genes throughout the genome, we are studying the epigenomic and transcriptional landscapes of hematopoietic cells in a model‐driven, integrative fashion. We have formed the collaborative multi‐lab VISION project to conduct ValIdated Systematic IntegratiON of epigenomic data in mouse and human hematopoiesis. The epigenomic data included nuclease accessibility in chromatin, CTCF occupancy, and histone H3 modifications for 20 cell types covering hematopoietic stem cells, multilineage progenitor cells, and mature cells across the blood cell lineages of mouse. The analysis used the Integrative and Discriminative Epigenome Annotation System (IDEAS), which learns all common combinations of features (epigenetic states) simultaneously in two dimensions—along chromosomes and across cell types. The result is a segmentation that effectively paints the regulatory landscape in readily interpretable views, revealing constitutively active or silent loci as well as the loci specifically induced or repressed in each stage and lineage. Nuclease accessible DNA segments in active chromatin states were designated candidate cis‐regulatory elements in each cell type, providing one of the most comprehensive registries of candidate hematopoietic regulatory elements to date. Applications of VISION resources are illustrated for the regulation of genes encoding GATA1, GATA2, GATA3, and Ikaros. VISION resources are freely available from our website http://usevision.org.
Keywords: epigenomes, erythropoiesis, gene regulation, genome segmentation, hematopoiesis, integrative analysis, regulatory elements
Abbreviations
- ATAC‐seq
Assay for Transposase‐Accessible Chromatin using sequencing
- B
lymphoid B cells
- cCRE
candidate cis‐regulatory element
- CFUE
colony‐forming units erythroid
- CFUMk
colony‐forming units megakaryocytic
- ChIP‐seq
Chromatin ImmunoPrecipitation assayed using sequencing
- CLP
common lymphoid progenitor cell population
- CMP
common myeloid progenitor cell population
- DNase‐seq
DNase accessible chromatin assayed using sequencing
- ENCODE
Encyclopedia of DNA Elements
- ERY
erythroblasts
- G1E and ER4
cell lines that serve as a model for GATA1‐dependent erythroid maturation
- GMP
granulocyte monocyte progenitor cell population
- GTEx
Genotype and Tissue Expression project
- GWAS
Genome Wide Association Study
- H3K4me1
Histone H3 monomethylated on lysine 4, associated with enhancers
- H3K4me3
Histone H3 trimethylated on lysine 4, associated with promoters
- H3K9me3
Histone H3 trimethylated on lysine 9, associated with heterochromatin
- H3K27ac
Histone H3 acetylated on lysine 27, associated with active regulatory elements
- H3K27me3
Histone H3 trimethylated on lysine 27, associated with transcriptional repression
- H3K36me3
Histone H3 trimethylated on lysine 36, associated with transcriptional elongation
- Hi‐C
genome‐wide assay for chromosome conformation capture
- HPC7
cell line model of a multipotent myeloid progenitor cell
- IDEAS
Integrative and Discriminative Epigenome Annotation System
- IHEC
International Human Epigenome Consortium
- iMk
immature megakaryocytic cell population
- LSK
lineage minus, Sca1+, Kit+ cell population (includes hematopoietic stem cells and early multilineage progenitor cells)
- MEP
megakaryocytic erythroid progenitor cell population
- Mk
megakaryocytes
- MON
monocytes
- NEU
neutrophils
- NK
natural killer cells
- PLT
platelets
- RBC
red blood cells
- T CD4
CD4+ T‐cells
- T CD8
CD8+ T‐cells
- VISION
ValIdated Systematic IntegratiON of epigenomic data
1. INTRODUCTION
A person's genetic profile can have a significant impact on complex traits such as disease susceptibility and response to specific treatments. Genome‐wide association studies (GWASs) have mapped loci at which a common genetic variation is associated with complex traits, but the mechanistic connection between genotype and phenotype is rarely understood. This is because most trait‐associated genetic variants are not in the 1–2% of the genome that encodes mRNA, but rather in a much larger noncoding genome.1 Although no DNA‐based grammar has been developed yet to interpret these noncoding variants,2 the fact that they are highly enriched in chromatin with epigenetic features associated with gene regulatory elements offers new avenues to understanding their impact on phenotypes.3, 4, 5 Efforts to harvest GWAS results for potential medical application have led to the concept of precision medicine, in which a person's genotype is used to improve lifestyle choices and develop therapeutic interventions specifically for that person.6 However, precision medicine requires more than genotypes and associations. Precision medicine needs a thorough understanding of the epigenome to interpret the large majority of trait‐associated genetic variants that lie outside coding regions.
The problem we address is how to utilize the enormous amounts of emerging epigenetic data effectively both for basic research and precision medicine. Powered by advances in sequencing technologies, biochemical reagents, and bioinformatic analyses, many laboratories and large consortia, such as ENCODE,4 Roadmap Epigenome Project,7 GTEx,8 BluePrint,9, 10 and IHEC11) are determining transcriptome profiles and producing genome‐wide views of the regulatory landscape.12 At this point, data acquisition may no longer be the major barrier to understand the mechanisms of gene regulation during normal and pathological development. In fact, the volume of data produced is already overwhelming for most investigators. We seek to understand how epigenetic features regulate differentiation and how that regulation is altered in disease. Major challenges include the integration of epigenetic data in terms that are accessible and understandable to a broad community of researchers, building validated quantitative models explaining how changes in epigenetic features affect the dynamics of gene expression across differentiation and translation of the information effectively from mouse models to potential applications in human health.
Consider any genetic locus implicated in development, differentiation, behavior, or disease. Investigators may want to study the regulation of expression of gene(s) in that locus, for example, to understand how genetic variants could affect its expression. This investigation could be greatly facilitated by abundant genome‐wide data sets on multiple epigenetic features. Currently, to utilize such information, an investigator would examine epigenetic data around this locus in web‐based genome browsers and databases. These resources are useful, but they do not cover all the relevant aspects of chromatin structure, dynamics, and expression. After finding the available data, the investigator will need to analyze the results to predict candidate cis‐regulatory elements (cCREs), including enhancers, silencers, or insulators. While progress continues to be made in the predicting cCREs, issues of completeness (how sensitive are the cCREs for discovering true regulatory elements?) and specificity (how likely is it that the cCREs are true regulatory elements?) are actively debated. Developing more useful collections or registries of high‐quality cCREs is a major current need in functional genomics.
We have formed an interdisciplinary collaborative team to address these needs via ValIdated Systematic IntegratiON of epigenomic data (VISION) to analyze and interpret molecular mechanisms regulating hematopoiesis in mouse and human. We are consolidating hundreds of epigenomic data sets and applying integrative approaches to generate robust candidate functional assignments to DNA segments. These assignments, coupled with gene target predictions and results of genome editing experiments, are the input to machine‐learning approaches that generate quantitative models for how each candidate CRE contributes to the regulation of its target gene. Importantly, these models will be rigorously tested and validated by targeted genome editing in reference loci and then applied genome wide. Furthermore, we are developing resources to enable more accurate translation of regulatory insights between mouse and human. The results from our project will inform investigators about candidate CREs and their predicted roles in regulating their loci of interest, thus enabling them to design model‐driven experiments to deepen their understanding of the investigated process.
In this concise review, we focus on our efforts to integrate the large amount of genome‐wide information on epigenetic features and transcriptomes in a systematic manner to assign chromatin states across hematopoietic cells and predict cCREs. These resources are illustrated with respect to the GATA factors, both the genes encoding them and the binding patterns of the proteins in erythroid and lymphoid T cells. A further examination of the Ikzf1 gene encoding the Ikaros transcription factor illustrates the power of our integrative approaches to deduce data‐driven hypotheses about differential regulation of gene expression in hematopoiesis.
2. COMPILE AND DETERMINE EPIGENETIC FEATURES AND TRANSCRIPT LEVELS ACROSS HEMATOPOIETIC DIFFERENTIATION
Over the past decade, the amount of information about gene expression levels and epigenetic regulatory landscapes in mammalian hematopoietic cells has increased exponentially, both through the work of individual laboratories13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27 as well as the work of major consortia such as ENCODE and Blueprint. These data currently are provided in differing formats from diverse resources, with no common data processing or analysis, for example, to find significant peaks of signals. Our first step in the VISION project was to compile the data sets, process the data in a consistent manner, and provide the data in a manner enabling investigators to find all relevant information.
Building on resources independently developed in laboratories within the VISION project, we have established a distributed data network to enhance accessibility and develop a unified interface to the users. The CODEX resource, developed by the Gottgens group, maintains a compendium of next‐generation sequencing data sets pertaining to transcriptional programs of mouse and human blood development.28 The compendium currently contains over 1,700 publicly available data sets, all uniformly processed to facilitate comparisons across data sets. CODEX contains ChIP‐seq, DNase‐seq, and RNA‐seq data sets, which are available as signal tracks, mapped sequence files, peak calls, and transcript levels for the RNA‐seq. The CODEX website also provides a number of analysis tools including correlation analysis, sequence motif discovery, analysis of overrepresented gene sets, and comparisons between mouse and human. The SBR‐Blood resource, developed by the Bodine lab, has compiled expression data, ChIP‐seq, and Methyl‐seq data for mouse and human hematopoietic cells (990 data sets), including normalizations across disparate data sets.29 Both of these resources feed into the VISION project, which provides raw and normalized data sets selected to cover specific groups of features in mouse and human hematopoiesis, segmentations by integrative modeling (see below), and catalogs of cCREs, among other resources, on the website http://usevision.org. This website includes a link to a genome browser with epigenetic and expression data sets during hematopoiesis as well as the 3D Genome Browser developed by the Yue lab.30 In addition to the effort to compile and analyze existing data, new data are being generated both within the VISION project and in other laboratories that expand the coverage of epigenetic features across cell types and bring in data sets on new transcription factors or co‐factors.
Our initial efforts were in mouse hematopoiesis because of the large number of epigenomic and transcriptomic data sets that were available in both primary maturing cells (exemplary references at the beginning of this section) and in the multilineage progenitors to blood cells.31 In addition, epigenomic data were included from selected cell lines that have been used extensively as models for multilineage myeloid cells (HPC7 cells32) and for GATA1‐dependent erythroid maturation (G1E and G1E‐ER4 cells33). The cell populations investigated have traditionally been viewed in a simple hierarchy (Figure 1a). Recent studies, especially of single cell transcriptomes, have revealed much greater complexity along with additional intermediate cells.34 However, the simple hierarchy used here serves as a useful organizing structure for considering relationships among the interrogated cell types. For assignments to chromatin states, we focused on nuclease accessibility of chromatin, as determined by DNase‐seq35 or ATAC‐seq36 binding by the structural protein CTCF, and posttranslational modifications of histone H3 N‐terminal tails37 associated with enhancers (H3K4me1), promoters (H3K4me3), active enhancers and promoters (H3K27ac), transcriptional elongation (H3K36me3), polycomb repression (H3K27me3), or heterochromatic repression (H3K9me3). For some cell types, all these features were determined (Figure 1b). Notably, the remaining cell types were missing data on multiple features. We stress that this problem with missing data is not unique to our work, but rather it is commonly seen in all large‐scale analyses including Roadmap, ENCODE, and Blueprint. We suggest that the approach developed in VISION (see later) will be broadly useful in any setting with missing data. Estimates of transcription levels were available from RNA‐seq in all the investigated cell types.
3. SYSTEMATICALLY LEARN AND ASSIGN EPIGENETIC STATES ACROSS CELL TYPES
The large numbers of interrelated epigenetic data sets described above present immense opportunities for understanding differential gene regulation if these data can be integrated into robust annotation of likely functional DNA. A key challenge is to build quantitative models explaining how the dynamics of epigenomes across many cell types lead to gene expression changes and phenotypic diversity.4, 38 A current approach for describing epigenetic landscapes is genome segmentation,39, 40 which assigns states to genomic segments exhibiting unique patterns of chromatin marks. Existing genome segmentation tools39, 40, 41, 42 were developed primarily for segmenting the epigenomes of single cell types. Although genomes from different cell types may be concatenated together, such an approach ignores the position‐specific epigenetic events conserved across related cell types. We have used the IDEAS method (Integrative and Discriminative Epigenome Annotation System) for two‐dimensional segmentation along chromosomes and across cell types because of its improved accuracy and consistency in assigning epigenetic states.43, 44 This method uses a Bayesian model to approximate quantitative data distributions without signal binarization. It also utilizes Bayesian techniques to automatically determine the best model sizes, including the number of states.
Importantly, the statistical framework for IDEAS allows it to assign likely epigenetic states to cell types based on the data distributions for signals across cell types. Thus, even if particular features have not been determined in a cell type, the system can still assign a likely state based on the known signals in other locally related cell types. This model‐based inference of states has better performance than current data imputation procedures.45 As noted above (Figure 1b), many of the cell types of interest did not have data on all features, but we were able to utilize this ability of IDEAS to produce segmentations despite missing data to generate informative segmentations across all the cell types examined.
The IDEAS segmentation method is analogous to integration by mixing. One can consider the signal track for each epigenetic feature as a signal with a distinctive color, as illustrated in Figure 2 with deep red for DNase‐seq, purple for CTCF, and so on. An intuitive way to integrate the eight tracks of information is simply by mixing, for example, by merging all the colored tracks into one. That approach does bring out some aspects of the combined features, such as CTCF and nuclease accessibility 5′ to Zfpm1 and a mix of K4 monomethylation and K36 trimethylation of H3 through the body of the gene. However, the mixing can also blend too many colors together to distinguish clear states, such as around the transcription start site (TSS) and the region around the 3′ end of Zfpm1. The systematic integration by segmentation can be thought of as a principled, objective way to find well‐defined, discrete combinations of features that occur frequently throughout the epigenomes examined (the epigenetic states). Each genomic segment is then assigned to the one state that best matches the known (or inferred) epigenetic signals in each cell type. Thus, the IDEAS track below “merge tracks” gives a principled resolution of the mixtures of the epigenetic features.
For the eight epigenetic features across 20 mouse hematopoietic cell types, IDEAS generated a 27‐state model (Figure 3). Each state was defined by a quantitative profile of signal strengths from the features, illustrated by the heat map. Each of the eight features was assigned a color, and these were used in turn to establish automatically a color for each state based on the contribution of each feature to that state. Thus, the several promoter‐like states were colored in various shades of red, the enhancer‐like states were given yellow to orange colors, CTCF‐containing states had purple colors, transcribed states were colored in states of green, states associated with polycomb repression were blue, and the heterochromatic state was gray. The most frequently occurring state (state 0) was a quiescent state, with very low signal for each of the eight features. Many of these combinations of features have been described previously, and the segmentation provided a systematic means to identify those combinations as states that are assigned consistently across cell types. The IDEAS states also gave a discrete set of several promoter‐associated or enhancer‐associated states, which can be further examined experimentally for functional roles.
The epigenomic data were determined in many different laboratories at different times, with systematic differences in protocols, sequencing depths, and other factors that could impact the integrative analysis. Thus, consistency in data processing and appropriate normalization of the data were also key components of the IDEAS segmentation pipeline. More complete descriptions of the approaches developed and utilized are given elsewhere.46, 47 The impact of normalization is illustrated in Figure 2. The upper tracks of individual features show the raw numbers of reads mapped to DNA intervals for one of two replicates, all set to the same scale. However, after normalizing to adjust for differences in sequencing depth and in the signal‐to‐noise ratio, the H3K27me3 signal was boosted such that it drove the assignment of some DNA segments at the left end of the diagram to the polycomb repressed (blue) state.
The accuracy and effectiveness of the IDEAS segmentation can be evaluated by comparison to orthogonal data that provides an alternative view of the functions implied by the segmentation. Specifically, we examined the binding patterns of GATA transcription factors known to regulate gene expression in different hematopoietic lineages. The transcription factors GATA1, GATA2, and GATA3 have been strongly associated with enhancers and transcriptional switches in erythroid, myeloid progenitor cells, and lymphoid cells, respectively.48 Thus, one may expect the enhancer‐like states in the Zfpm1 gene in erythroblasts to be bound by GATA1. Indeed, the ChIP‐seq pattern for GATA1 coincided well with those states (orange in the IDEAS track, Figure 2). Moreover, these enhancer‐like states were co‐bound by the transcription factor TAL1 in erythroblasts (Figure 2). This co‐binding by GATA1 and TAL1 has been strongly associated with gene activation,17, 18, 49 and the ChIP‐seq patterns for these transcription factors lend strong support to the epigenetic state assignments. Many of these predicted enhancers of Zfpm1 were shown to increase expression from a reporter gene in transfected cells, giving further credence to the segmentation results.25, 50
By conducting the segmentation jointly across cell types as well as along chromosomes (the two‐dimensional segmentation), the IDEAS method also brings out differences between cell types. The segmentation of Zfpm1 in CD4+ T‐cells differed greatly from that in erythroblasts (Figure 2). Many of the DNA segments in the gene body that were enhancer‐like in erythroblasts were either in the quiescent (white) or transcribed (green) states in CD4+ T‐cells. The region around the TSS was assigned to promoter‐like (red states). Notably, this same TSS region is bound by GATA3 in CD4+ T‐cells, as indicated by the ChIP‐seq signal (obtained from CODEX for data from Reference 51). Thus, expression of Zfpm1 appears to be regulated at the TSS in CD4+ T‐cells, whereas multiple internal enhancers are utilized in erythroid cells.
4. DEFINE A LARGE SET OF cCREs IN MOUSE HEMATOPOIETIC CELLS
The integrative segmentation from IDEAS allowed us to take a straightforward approach to predicting candidate cis‐regulatory elements or cCREs.47 The nuclease‐accessible DNA intervals (hypersensitive sites or HSs) in each cell type were determined by peak‐calling method on the DNase‐seq and ATAC‐seq data. We then gathered all HSs from all cell types (requiring replication within a cell type if available) and merged the overlapping ones. This set of HSs was then filtered to remove any that were only in the quiescent state (0) in all cell types. The remaining set contained all HSs that were in an IDEAS state indicative of dynamic histone modifications or CTCF binding in at least one of the cell types examined. This simple two‐step method for predicting cCREs relies on the sophistication of IDEAS for assigning DNA intervals to one of the commonly occurring combinations of epigenetic features. It does not rely on any particular combination of histone modifications to predict cCREs, and it should be robust to changes in epigenetic landscape that result from switches in regulatory and expression patterns between cell types.
The initial registry of cCREs consisted of 205,019 DNA intervals in 18 hematopoietic cell types in mouse (no nuclease sensitivity data were available for 2 of the 20 cell types, Mk from fetal liver, and CLP). The absence of full knowledge of functional elements and neutral elements across genome precludes a rigorous determination of the sensitivity and specificity of this initial cCRE registry. However, this collection does look promising in several respects. The registry captured virtually all the known erythroid cCREs, and it included a large majority (two‐thirds) of DNA segments bound by the transcription co‐activator EP300 in murine erythroleukemia cells, CH12 cells (a model for B cells), and fetal liver.47 Thus, the recall appears to be reasonable. Further experimental tests should provide insight into the precision or specificity of the cCRE predictions.
The cCREs in and around the Zfpm1 gene are shown on the bottom line of Figure 2. They include the candidate enhancer‐like regions discussed above, as expected. Many other cCREs are also present, which raises questions such as the following: (1) In what cell types does a particular cCRE appear to be active? (2) What transcription factors may be bound to a cCRE? (3) How likely is it that a cCRE is regulating a gene of interest? Continuing work in the VISION project strives to address such questions. Question 1 is addressed by the state assignments for each cCRE in each cell type, which can be downloaded or browsed at the project website. Question 2 is being addressed by utilizing the resources in CODEX to annotate cCREs with binding data from ChIP‐seq. Question 3 is being addressed by developing quantitative models to “explain” gene expression data in terms of IDEAS state assignments across cell types. Future work should bring in chromatin interaction data and additional machine learning approaches.
5. ILLUSTRATE THE VISION RESOURCES AT LOCI ENCODING GATA FACTORS AND IKAROS
As discussed above, the GATA family of transcription factors is well known for regulation of gene expression in specific cell types and lineages. Examination of the genes encoding these factors and another key regulator of gene expression in hematopoietic cells, Ikaros (IKZF1), illustrates the types of insights that investigators can glean from the integrative analyses in the VISION project. Levels of expression of the genes were estimated from the RNA‐seq data from recent publications27, 31 that were compiled at the VISION website (Figure 4a). Consistent with the lineage specificity previously reported, expression of Gata1 was most prevalent in erythroid cells and the multilineage progenitor cell populations CMP and MEP, with more modest expression in megakaryocytic cells. The Gata2 gene was expressed more highly in the multilineage progenitor cell populations with some persistence into megakaryocytic cells. High levels of Gata3 expression were found primarily in a subset of the lymphoid cells, namely, NK, CD4+, and CD8+ T‐cells. In contrast, expression of Ikzf1 was expressed at higher levels and in a broader pattern, with expression in most hematopoietic cell types albeit lower in maturing erythroid cells.
The epigenetic landscapes summarized as states from the IDEAS model showed patterns that fit with the cell type specificity of expression, and they revealed potential regulatory elements (cCREs) involved in cell type‐specific control of expression. The IDEAS tracks around Gata1 showed active epigenetic states in MEP, erythroid, and megakaryocytic cells (Figure 4b), which also express this gene. Furthermore, this locus has six cCREs, four of which have been shown to be enhancers or promoters regulating Gata1 expression in erythroid cells.52, 53 Both known CREs and novel cCREs were bound by GATA1 in erythroid cells, albeit at varying levels, but none were bound by GATA2 in the myeloid progenitor cell model HPC7 cells or by GATA3 in CD4+ T‐cells. In cell types not expressing Gata1, the locus was largely in the quiescent state, indicating that histone H3 in the chromatin of these cell types was undergoing little to no dynamic modifications.
Several regulatory elements have been mapped in the Gata2 locus, both proximal and internal to the gene as well as distal, close to the Rpn1 gene.50, 54, 55 These regulatory elements were in active epigenetic states in the expressing cell types, and the distal CRE was in an active state in a broader range of cell types (Figure 4c). Several of the CREs were bound by GATA2 in HPC7 cells (as well as the previously reported binding in G1E cells, not shown), but little to no binding was observed for GATA1 or GATA3. In contrast to the quiescent state observed for nonexpressing cells for the Gata1 locus, the Gata2 locus was in a polycomb‐repressed state (H3K27me3) in many of the nonexpressing cells. These distinct mechanisms inferred for repression (quiescent vs. polycomb) were deduced simply by examining the IDEAS tracks, and they illustrate insights that follow easily from integrative analysis and modeling.
The DNA interval around the TSS of the Gata3 gene was in an active promoter‐like epigenetic state and was bound by GATA3 in lymphoid cells, consistent with the expression pattern (Figure 4d). However, several additional DNA segments internal to the gene and upstream (between Gata3 and Taf3) were in active states and were inferred to be cCREs. Thus, the regulation of Gata3 may involve multiple CREs. As with the Gata2 locus, the Gata3 locus tended to be in a polycomb‐repressed state in many nonexpressing cell types. Surprisingly, the cCREs around Gata3 were in active epigenetic states in multilineage progenitor cells such as LSK, despite the very low levels of expression. This apparently precocious activation of the epigenetic landscape may serve as a type of lineage priming, or it could reflect some lineage commitment in this cell population.
The epigenetic states and transcription factor binding around the more widely expressed Ikzf1 revealed patterns indicative of lineage‐specific regulatory mechanisms (Figure 5). In addition to the transcribed states internal to the gene in almost all cell types, multiple cCREs in active states were observed around the TSS, upstream to the gene, in the third and seventh introns, and downstream. Strikingly, the pattern of binding of GATA factors was lineage‐specific, with GATA2 binding at an upstream cCRE and in intron 3 in multilineage progenitors, GATA1 binding at a different set of cCREs upstream and in intron 3 in erythroid cells, and GATA3 binding in still a different pattern in CD4+ T‐cells. The cCREs tended to be in active enhancer‐like or promoter‐like states in the cell types for which binding by GATA factors was also observed. These distinct GATA binding patterns, coupled with active epigenetic states from IDEAS, indicate substantial lineage specificity in the cCREs and transcription factors utilized to achieve an appropriate level of expression of Ikzf1 in the various hematopoietic cell types.
6. FUTURE PERSPECTIVES
A major goal of the VISION project is to provide integrated views of epigenomic landscapes and transcriptomes from mammalian hematopoietic cells that will inform gene regulatory models to advance our understanding of global gene regulation. Importantly, these integrative views should enable other investigators to formulate data‐based, testable hypotheses to advance their specific research interests. This review has focused on our recent work with mouse hematopoietic cells, organizing and analyzing about 150 tracks of epigenomic data from 20 cell types to produce a segmentation into well‐defined epigenetic states using the IDEAS method. The consistent and distinctive colors associated with each state present the segmentation results as a type of painting, with one multicolored panel for each cell type. Thus, enhancer‐like and promoter‐like elements can be easily seen in a genome browser, as well as changes in the states among cell types. The epigenetic state assignments were used to annotate nuclease HSs and produce an initial registry of cCREs in mouse hematopoietic cells. This set of slightly over 200,000 cCREs serves as a large set of candidate regulatory elements that can be used in many ways for further research. Likely regulatory elements are now readily available for any gene, along with information about the epigenetic state of the chromatin covering that gene in the 20 hematopoietic cell types. The registry of cCREs can be examined for overlaps with lists of peaks of transcription factor binding (from ChIP‐seq) for further inferences about potential functions of the cCREs.
Building from these initial resources, we have now compiled a large number of epigenomic data sets on human blood cells from the IHEC Blueprint Consortium10 and many individual laboratories, including recent data on multilineage progenitor cells.56 These data are being integrated via IDEAS segmentation, and an initial registry of cCREs is being built using the approaches described here for mouse hematopoietic cells. In addition to the purposes already discussed, these resources will be particularly valuable for improving the interpretation of human genetic variants associated with various blood cell traits and diseases. Large‐scale GWASs have revealed many variants associated with traits of interest in hematology, and we now expect that many of the causative variants are acting through impacts on gene regulation.57 Having a set of high‐quality cCRE predictions decreases the search space for likely functional variants. Thus, the cCRE predictions may enable more precise, higher resolution studies of the potential impacts of the trait‐associated, noncoding genetic variants.
A continuing challenge for utilizing cCRE predictions is the ambiguity in inferring a target gene. Regulatory elements can be far away from their target gene, and it is not uncommon for a CRE to be separated from its target gene by multiple nontarget genes. Substantial efforts within the VISION project and elsewhere are tackling this enduring challenge. Measurements of chromatin interaction frequencies in an all‐against‐all mode such as Hi‐C58 or using capture strategies to focus on particular regions or interactions23, 59 should provide important information to leverage with respect to target gene assignments. We are currently utilizing high‐resolution Hi‐C data26 and capture‐C data23, 60 from erythroid cells for multiple studies including improvement of target gene assignments.
The integrative maps of the regulatory landscape and the cCRE predictions were designed to provide accessible views and resources to enable a wide spectrum of users to benefit from the numerous and deep epigenomic data sets available. For the most part, the epigenetic states learned by IDEAS match those expected from decades of work on the impact of chromatin structure on gene regulation. However, there is still the potential for discoveries of novel relationships. We are currently using the epigenetic states and cCREs as input into additional analytical approaches to try to uncover novel insights and global models. For example, we are using multivariate regressions47 and machine learning approaches61 to estimate the impact of individual cCREs on potential target genes, which can then be tested using directed mutagenesis. As these quantitative models for explaining levels of gene expression across cell types improve, they may reveal unexpected, previously unknown relationships. Indeed, one of the several important outcomes from our VISION project is development of new methods to provide robust results to drive further research.
All resources from the VISION project are publicly available via our website http://usevision.org. We hope that this review will encourage use of these resources.
Acknowledgments
The VISION project is supported by a grant from NIH/NIDDK R24DK106766 (multi‐PI). Additional support of this work is from NIH/GM R01GM121613 (to Y.Z. and S.M.), NIH/NIDDK R01DK054937 (to G.B.), and NIH/NCI R01CA178393 (to R.H.).
Hardison RC, Zhang Y, Keller CA, et al. Systematic integration of GATA transcription factors and epigenomes via IDEAS paints the regulatory landscape of hematopoietic cells. IUBMB Life. 2020;72:27–38. 10.1002/iub.2195
Funding information National Cancer Institute, Grant/Award Number: R01CA178393; National Institute of Diabetes and Digestive and Kidney Diseases, Grant/Award Numbers: R01DK054937, R24DK106766; National Institute of General Medical Sciences, Grant/Award Number: R01GM121613
REFERENCES
- 1. Hindorff LA, Sethupathy P, Junkins HA, et al. Potential etiologic and functional implications of genome‐wide association loci for human diseases and traits. Proc Natl Acad Sci U S A. 2009;106:9362–9367. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Hardison RC, Taylor J. Genomic approaches towards finding cis‐regulatory modules in animals. Nat Rev Genet. 2012;13:469–483. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Maurano MT, Humbert R, Rynes E, et al. Systematic localization of common disease‐associated variation in regulatory DNA. Science. 2012;337:1190–1195. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. The_ENCODE_Project_Consortium . An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489:57–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Hardison RC, Blobel GA. Genetics. GWAS to therapy by genome edits? Science. 2013;342:206–207. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Collins FS, Varmus H. Perspective: A new initiative on precision medicine. N Engl J Med. 2015;372:793–795. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Bernstein BE, Stamatoyannopoulos JA, Costello JF, et al. The NIH roadmap epigenomics mapping consortium. Nat Biotechnol. 2010;28:1045–1048. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. GTEx_Consortium . Human genomics. The genotype‐tissue expression (GTEx) pilot analysis: Multitissue gene regulation in humans. Science. 2015;348:648–660. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Chen L, Kostadima M, Martens JH, Canu G, Garcia SP, et al. Transcriptional diversity during lineage commitment of human blood progenitors. Science. 2014;345:1251033. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Stunnenberg HG, International Human Epigenome, C , Hirst M. The international human epigenome consortium: A blueprint for scientific collaboration and discovery. Cell. 2016;167:1145–1149. [DOI] [PubMed] [Google Scholar]
- 11. Ramirez F, Dundar F, Diehl S, Gruning BA, Manke T. deepTools: A flexible platform for exploring deep‐sequencing data. Nucleic Acids Res. 2014;42:W187–W191. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Furey TS. ChIP‐seq and beyond: New and improved methodologies to detect and characterize protein‐DNA interactions. Nat Rev Genet. 2012;13:840–852. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Welch JJ, Watts JA, Vakoc CR, et al. Global regulation of erythroid gene expression by transcription factor GATA‐1. Blood. 2004;104:3136–3147. [DOI] [PubMed] [Google Scholar]
- 14. Hughes JR, Cheng JF, Ventress N, et al. Annotation of cis‐regulatory elements by identification, subclassification, and functional assessment of multispecies conserved sequences. Proc Natl Acad Sci U S A. 2005;102:9830–9835. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Cheng Y, King DC, Dore LC, et al. Transcriptional enhancement by GATA1‐occupied DNA segments is strongly associated with evolutionary constraint on the binding site motif. Genome Res. 2008;18:1896–1905. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Dore LC, Amigo JD, Dos Santos CO, Zhang Z, Gai X, et al. A GATA‐1‐regulated microRNA locus essential for erythropoiesis. Proc Natl Acad Sci U S A. 2008;105:3333–3338. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Tripic T, Deng W, Cheng Y, et al. SCL and associated proteins distinguish active from repressive GATA transcription factor complexes. Blood. 2009;113:2191–2201. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Cheng Y, Wu W, Kumar SA, Yu D, Deng W, et al. Erythroid GATA1 function revealed by genome‐wide analysis of transcription factor occupancy, histone modifications, and mRNA expression. Genome Res. 2009;19:2172–2184. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Wilson NK, Foster SD, Wang X, Knezevic K, Schutte J, et al. Combinatorial transcriptional control in blood stem/progenitor cells: Genome‐wide analysis of ten major transcriptional regulators. Cell Stem Cell. 2010;7:532–544. [DOI] [PubMed] [Google Scholar]
- 20. Wu W, Cheng Y, Keller CA, et al. Dynamics of the epigenetic landscape during erythroid differentiation after GATA1 restoration. Genome Res. 2011;21:1659–1671. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Kadauke S, Udugama MI, Pawlicki JM, et al. Tissue‐specific mitotic bookmarking by hematopoietic transcription factor GATA1. Cell. 2012;150:725–737. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Pimkin M, Kossenkov AV, Mishra T, et al. Divergent functions of hematopoietic transcription factors in lineage priming and differentiation during erythro‐megakaryopoiesis. Genome Res. 2014;24:1932–1944. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Hughes JR, Roberts N, McGowan S, et al. Analysis of hundreds of cis‐regulatory landscapes at high resolution in a single, high‐throughput experiment. Nat Genet. 2014;46:205–212. [DOI] [PubMed] [Google Scholar]
- 24. Stonestrom AJ, Hsu SC, Jahn KS, et al. Functions of BET proteins in erythroid gene expression. Blood. 2015;125:2825–2834. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Dogan N, Wu W, Morrissey CS, et al. Occupancy by key transcription factors is a more accurate predictor of enhancer activity than histone modifications or chromatin accessibility. Epigenet Chromatin. 2015;8:16. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Hsu SC, Gilgenast TG, Bartman CR, Edwards CR, Stonestrom AJ, et al. The BET protein BRD2 cooperates with CTCF to enforce transcriptional and architectural boundaries. Mol Cell. 2017;66:102–116.e7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Heuston EF, Keller CA, Lichtenberg J, Giardine B, Anderson SM, et al. Establishment of regulatory elements during erythro‐megakaryopoiesis identifies hematopoietic lineage‐commitment points. Epigenet Chromatin. 2018;11:22. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Sanchez‐Castillo M, Ruau D, Wilkinson AC, Ng FS, Hannah R, et al. CODEX: A next‐generation sequencing experiment database for the haematopoietic and embryonic stem cell communities. Nucleic Acids Res. 2015;43:D1117–D1123. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Lichtenberg J, Heuston EF, Mishra T, Keller CA, Hardison RC, et al. SBR‐blood: A systems biology repository for hematopoietic stem cell differentiation. Nucleic Acids Res. 2016;4(44):D925–D931. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. Wang Y, Song F, Zhang B, et al. The 3D genome browser: A web‐based browser for visualizing 3D genome organization and long‐range chromatin interactions. Genome Biol. 2018;19:151. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Lara‐Astiaso D, Weiner A, Lorenzo‐Vivas E, et al. Immunogenetics. Chromatin state dynamics during blood formation. Science. 2014;345:943–949. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. Pinto do OP, Kolterud A, Carlsson L. Expression of the LIM‐homeobox gene LH2 generates immortalized steel factor‐dependent multipotent hematopoietic precursors. EMBO J. 1998;17:5744–5756. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33. Weiss MJ, Yu C, Orkin SH. Erythroid‐cell‐specific properties of transcription factor GATA‐1 revealed by phenotypic rescue of a gene‐targeted cell line. Mol Cell Biol. 1997;17:1642–1651. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34. Laurenti E, Gottgens B. From haematopoietic stem cells to complex differentiation landscapes. Nature. 2018;553:418–426. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35. Sabo PJ, Kuehn MS, Thurman R, et al. Genome‐scale mapping of DNase I sensitivity in vivo using tiling DNA microarrays. Nat Methods. 2006;3:511–518. [DOI] [PubMed] [Google Scholar]
- 36. Buenrostro JD, Giresi PG, Zaba LC, Chang HY, Greenleaf WJ. Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA‐binding proteins and nucleosome position. Nat Methods. 2013;10:1213–1218. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37. Barski A, Cuddapah S, Cui K, et al. High‐resolution profiling of histone methylations in the human genome. Cell. 2007;129:823–837. [DOI] [PubMed] [Google Scholar]
- 38. Roadmap Epigenomics C, Kundaje A, Meuleman W, Ernst J, Bilenky M, et al. Integrative analysis of 111 reference human epigenomes. Nature. 2015;518:317–330. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39. Ernst J, Kellis M. ChromHMM: Automating chromatin‐state discovery and characterization. Nat Methods. 2012;9:215–216. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40. Hoffman MM, Buske OJ, Wang J, Weng Z, Bilmes JA, Noble WS. Unsupervised pattern discovery in human chromatin structure through genomic segmentation. Nat Methods. 2012;9:473–476. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41. Zeng X, Sanalkumar R, Bresnick EH, Li H, Chang Q, Keleş S. jMOSAiCS: Joint analysis of multiple ChIP‐seq datasets. Genome Biol. 2013;14:R38. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42. Hamada M, Ono Y, Fujimaki R, Asai K. Learning chromatin states with factorized information criteria. Bioinformatics. 2015;31:2426–2433. [DOI] [PubMed] [Google Scholar]
- 43. Zhang Y, An L, Yue F, Hardison RC. Jointly characterizing epigenetic dynamics across multiple human cell types. Nucleic Acids Res. 2016;44:6721–6731. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44. Zhang Y, Hardison RC. Accurate and reproducible functional maps in 127 human cell types via 2D genome segmentation. Nucleic Acids Res. 2017;45:9823–9836. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45. Zhang Y, Mahony S. Direct prediction of regulatory elements from partial data withou imputation. bioRxiv. 2019. 10.1101/643486. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46. Xiang G, Keller CA, Giardine B, An L, Hardison RC, et al. S3norm: Simultaneous normalization of sequencing depth and signal‐to‐noise ratio in epigenomic data. bioRxiv. 2019. 10.1101/506634. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47. Xiang G, Keller CA, Heuston E, Giardine BM, An L, et al. An integrative view of the regulatory and transcriptional landscapes in mouse hematopoiesis. bioRxiv. 2019. 10.1101/731729. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48. Bresnick EH, Katsumura KR, Lee HY, Johnson KD, Perkins AS. Master regulatory GATA transcription factors: Mechanistic principles and emerging links to hematologic malignancies. Nucleic Acids Res. 2012;40:5819–5831. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49. Wu W, Morrissey CS, Keller CA, et al. Dynamic shifts in occupancy by TAL1 are guided by GATA factors and drive large‐scale reprogramming of gene expression during hematopoiesis. Genome Res. 2014;24:1945–1962. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50. Wang H, Zhang Y, Cheng Y, et al. Experimental validation of predicted mammalian erythroid cis‐regulatory modules. Genome Res. 2006;16:1480–1492. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51. Wei G, Abraham BJ, Yagi R, et al. Genome‐wide analyses of transcription factor GATA3‐mediated gene regulation in distinct T cell types. Immunity. 2011;35:299–311. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52. Onodera K, Takahashi S, Nishimura S, et al. GATA‐1 transcription is controlled by distinct regulatory mechanisms during primitive and definitive erythropoiesis. Proc Natl Acad Sci U S A. 1997;94:4487–4492. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53. Valverde‐Garduno V, Guyot B, Anguita E, Hamlett I, Porcher C, Vyas P. Differences in the chromatin structure and cis‐element organization of the human and mouse GATA1 loci: Implications for cis‐element identification. Blood. 2004;104:3106–3116. [DOI] [PubMed] [Google Scholar]
- 54. Bresnick EH, Lee HY, Fujiwara T, Johnson KD, Keles S. GATA switches as developmental drivers. J Biol Chem. 2010;285:31087–31093. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55. Grass JA, Boyer ME, Pal S, Wu J, Weiss MJ, Bresnick EH. GATA‐1‐dependent transcriptional repression of GATA‐2 via disruption of positive autoregulation and domain‐wide chromatin remodeling. Proc Natl Acad Sci U S A. 2003;100:8811–8816. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56. Corces MR, Buenrostro JD, Wu B, et al. Lineage‐specific and single‐cell chromatin accessibility charts human hematopoiesis and leukemia evolution. Nat Genet. 2016;48:1193–1203. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57. Ulirsch JC, Nandakumar SK, Wang L, et al. Systematic functional dissection of common genetic variation affecting red blood cell traits. Cell. 2016;165:1530–1545. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58. Lieberman‐Aiden E, van Berkum NL, Williams L, et al. Comprehensive mapping of long‐range interactions reveals folding principles of the human genome. Science. 2009;326:289–293. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59. Platt JL, Salama R, Smythies J, et al. Capture‐C reveals preformed chromatin interactions between HIF‐binding sites and distant promoters. EMBO Rep. 2016;17:1410–1421. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60. Huang P, Keller CA, Giardine B, et al. Comparative analysis of three‐dimensional chromosomal architecture identifies a novel fetal hemoglobin regulatory element. Genes Dev. 2017;31:1704–1713. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61. Denas, O. and Taylor, J. (2013) Deep modeling of gene expression regulation in an erythropoiesis model. In: 30th International Conference on Machine Learning workshop on Representation Learning, Atlanta, GA. http://deeplearning.net/wp-content/uploads/2013/2003/icml_paper.pdf.