Significance
Methods for identifying causal variants underlying human diseases have been greatly enhanced by whole-exome sequencing; however, this approach overlooks mutations that occur within noncoding regulatory regions. Moreover, the mechanisms for how such mutations result in disease are poorly understood. In this study, we interrogated binding sites of the blood cell transcription factor GATA1 in regulatory elements that are mutated in cases of human red blood cell disorders by creating small-targeted deletions in model cell lines. These deletions cause a major reduction in target gene expression. We used this initial insight to show that such elements are highly conserved, and that through predictive modeling, we can gain global insight into key determinants of GATA1 transcriptional activity.
Keywords: GATA1, cis-regulatory elements, noncoding mutations, Mendelian erythroid disorders
Abstract
Whole-exome sequencing has been incredibly successful in identifying causal genetic variants and has revealed a number of novel genes associated with blood and other diseases. One limitation of this approach is that it overlooks mutations in noncoding regulatory elements. Furthermore, the mechanisms by which mutations in transcriptional cis-regulatory elements result in disease remain poorly understood. Here we used CRISPR/Cas9 genome editing to interrogate three such elements harboring mutations in human erythroid disorders, which in all cases are predicted to disrupt a canonical binding motif for the hematopoietic transcription factor GATA1. Deletions of as few as two to four nucleotides resulted in a substantial decrease (>80%) in target gene expression. Isolated deletions of the canonical GATA1 binding motif completely abrogated binding of the cofactor TAL1, which binds to a separate motif. Having verified the functionality of these three GATA1 motifs, we demonstrate strong evolutionary conservation of GATA1 motifs in regulatory elements proximal to other genes implicated in erythroid disorders, and show that targeted disruption of such elements results in altered gene expression. By modeling transcription factor binding patterns, we show that multiple transcription factors are associated with erythroid gene expression, and have created predictive maps modeling putative disruptions of their binding sites at key regulatory elements. Our study provides insight into GATA1 transcriptional activity and may prove a useful resource for investigating the pathogenicity of noncoding variants in human erythroid disorders.
Whole-exome sequencing (WES) and targeted sequencing approaches have greatly accelerated our ability to identify causal genetic lesions in both previously implicated and novel genes underlying monogenic disorders (1, 2). In hematology, WES has been extremely useful for identifying unknown genetic etiologies for various disorders, such as those affecting red blood cell (RBC) production, including Diamond–Blackfan anemia and congenital dyserythropoietic anemia (3–5), disorders of RBC structure and function (6, 7), and disorders affecting other aspects of hematologic function (2, 8). Despite this considerable success, however, more than 50% of cases of presumed monogenic diseases are refractory to current WES approaches (9). Although resolving these remaining cases will benefit from improvements in exome capture (10), read alignment (11), and variant annotation methodologies (11), the importance of genetic variation occurring within regulatory elements (REs) outside of the traditionally investigated coding sequences in hematologic and other diseases is being increasingly appreciated (12).
Whole-genome sequencing (WGS) approaches are becoming progressively more available and affordable, but separating pathogenic genetic variation from benign or unrelated mutations remains especially difficult outside of protein-coding genetic regions (12, 13). This difficulty is most clearly reflected in the distribution of mutations listed in databases of Mendelian disorders, such as the Human Gene Mutation Database, where most mutations are found within coding regions (86%) or at intronic splice sites (11%), with only a small fraction (3%) identified in regulatory regions (14). Newer methods for annotating and predicting the impact of noncoding (NC) variants have provided substantial improvements (15, 16), but experimental validation of the presumed effects remains critical for the determination of pathogenicity and elucidation of the mechanism of action (13, 17).
Multiple lines of evidence have shown that NC mutations can modulate the transcriptional activity and binding of key cell type-specific master transcription factors (TFs) (18, 19). In several Mendelian erythroid disorders (MEDs), NC genetic variants have been identified in the DNA-binding motif (WGATAR) of the hematopoietic master regulator, GATA1 (20–25). GATA1 is both necessary for proper erythropoiesis (3, 26) and sufficient to reprogram alternative lineages toward an erythroid fate (27, 28). Moreover, numerous studies have shown that GATA1 acts in a number of multimeric complexes involving other TFs and can activate or repress target gene expression (29–37).
Although much has been learned about transcriptional regulation during erythropoiesis, the determinants of GATA1 activity at individual REs are incompletely understood (29). For example, after Gata1 induction in the Gata1-null G1E-ER4 erythroblast (EB) cell line, GATA1 binds to and up-regulates key erythroid genes to induce terminal erythroid differentiation, while simultaneously repressing genes involved in alternative lineages. The global chromatin architecture is indistinguishable between induced and repressed genes, however, suggesting that additional features, such as DNA sequence requirements and combinatorial TF patterns, determine the resulting transcriptional alterations in this model system and also more generally during lineage commitment and differentiation (29–31, 34, 35, 37–42). Furthermore, the effects of genetic variation on erythroid cis-regulatory elements (CREs), particularly with regard to those predicted to disrupt GATA1 or cofactor (CF) motifs on GATA1 and CF binding and target gene expression, are not well defined (40).
Given these deficiencies, we interrogated three regions harboring rare variants implicated in distinct MEDs that are predicted to disrupt GATA1 motifs, using a combination of experimental and computational approaches to gain further insight into GATA1 transcriptional activity (20–23, 43).
Results
Using Genome Editing to Verify NC Mutations Implicated in MEDs.
We selected single nucleotide mutations implicated in X-linked sideroblastic anemia (XLSA) (21, 22), congenital erythropoietic porphyria (CEP) (20), and pyruvate kinase deficiency (PKD) (23, 43) that were predicted to disrupt a GATA1 binding site (BS) (Fig. S1). In each case, the mutation fell within an erythroid CRE occupied by GATA1 and multiple CFs proximal to the known causal gene (ALAS2 in XLSA, UROS in CEP, and PKLR in PKD) (Fig. 1 A–C). To verify the effects of each NC mutation, we used CRISPR/Cas9 genome editing to create precise double-stranded DNA breaks across the affected core “GATA” motif in the K562 erythroid cell line, which shares a similar regulatory architecture at these CREs with primary human EBs (Fig. S2) (32). We used a PCR-based screen to identify three to four clones with small deletions ranging in size from 2 to 15 nt across each motif resulting from unfaithful repair of the targeted double-stranded DNA breaks (Fig. S3).
Fig. S1.
Selected NC mutations from three distinct MEDs disrupt the GATA1 motif. The canonical GATA1 motif is shown as a position weight matrix. For each mutated CRE, one reported mutation in the GATA1 motif is shown. P values for the presence of the GATA1 motif in each reference (Ref) sequence are shown, along with the log10 difference in this P value between the reference sequence and the sequence with the reported mutation. In each case, the mutation is predicted to completely disrupt the GATA1 motif.
Fig. 1.
Using genome editing to verify NC mutations associated with MED. (A–C) GATA1, TAL1, LDB1, KLF1, and NFE2 occupancy and nucleosome-depleted regions (NDRs) across ALAS2, UROS, and PKLR in primary proEs. The orange highlighted region indicates the target GATA1 BS targeted by CRISPR/Cas9. The sgRNA used for genome editing is shown, and its PAM sequence is shown in green. The red arrows indicate mutations associated with MEDs. (D) Relative mRNA expression levels of ALAS2, UROS, and PKLR across clonal deletions of the associated GATA1 BS. ****P < 0.0001. (E) Relative mRNA expression levels of ALAS2 and UROS on day 10 of differentiation with a vector containing Cas9 and a nontargeting sgRNA (black; sgXPR5) or sgRNAs targeting the GATA1 motif in the CRE near ALAS2 (red; sgALAS2) or near UROS (orange, sgUROS). *P < 0.05; **P < 0.01; n.s., P nonsignificant. (F) CD235a expression at day 12 of differentiation in infected HSPCs. Unstained control cells are shown in gray.
Fig. S2.
Comparison of erythroid TF chromatin occupancy between primary human proEs and the erythroid K562 cell line. (A–C) ChIP-seq profiles across ALAS2, UROS, and PKLR in proEs, similar to Fig. 1 A–C. (D–F) ChIP-seq profiles across ALAS2, UROS, and PKLR in K562 cells, similar to Fig. 1 A–C.
Fig. S3.
PCR-based genomic DNA screening method and deletions generated across interrogated GATA1 motifs. (A) For each gene, one forward primer and two reverse primers were designed for the screen (Table S1). FW1 and RV1 acted as control primers and amplified the whole region of interest. FW1 and RV2 failed to amplify a product when the GATA1 binding site (in red) was sufficiently deleted to interfere with RV2 binding. Assuming a diploid genotype, we determined whether the deletions introduced were present in the first (1) or second (2) chromosomal copy. (B) PCR products of FW and RV1 of PKLR clones and two GFP controls. PCR products for all samples are observed. Band sizes are unequal due to base pair deletions as a result of CRISPR/Cas9 editing. (C) PCR products of FW and RV2 of PKLR clones and two GFP controls. PKLR clones 1, 3, 4, and 5 show no product. (D–F) Clonal deletions across GATA1 motifs obtained for CREs proximal to ALAS2 (D), UROS (E), and PKLR (F).
For clonal GATA1 motif deletions in the intronic CREs of ALAS2 and UROS, we observed an ∼80% reduction of ALAS2 and UROS mRNA, but for clonal deletions in the PKLR promoter, we observed a >99% reduction of PKLR mRNA (Fig. 1D). Deletions of as little as 2 nt were sufficient to reduce target mRNA transcription by >80%, similar to findings in reporter assays (20, 44). Although these results verify the functionality of the selected mutations, they also suggest that not all GATA1 BS variants will affect gene expression equally. In our experiments, GATA1 motif deletion in a promoter resulted in complete abrogation of target gene expression, whereas residual target mRNA expression was still detectable in deletions at intronic CREs. In addition, secondary intronic CREs occupied by GATA1 and TAL1 within ALAS2 and UROS may additively regulate gene expression; no secondary CREs were detected near PKLR (Fig. 1 A–C).
As a proof of principle that this approach can validate variants in a disease-relevant context, we used a lentiviral CRISPR/Cas9 approach to target these GATA1 motifs in adult CD34+ hematopoietic stem and progenitor cells (HSPCs). By targeting the CRE proximal to ALAS2, we observed a specific reduction in ALAS2 mRNA expression; we obtained similar results by targeting the CRE proximal to UROS (Fig. 1E). Importantly, the Cas9-infected HSPCs showed no impairment in their ability to undergo robust erythroid differentiation (Fig. 1F).
Isogenic Cellular Models Phenocopy Cell-Intrinsic Effects of Associated Disorders.
We next set out to determine the extent to which our cellular models of NC GATA1 BS mutations could phenocopy the cell-intrinsic effects of each associated MED. In XLSA, mutations in ALAS2, the gene encoding for δ-aminolevulinic acid synthase 2, the rate-limiting enzyme in heme biosynthesis, result in decreased heme production, promoting formation of the characteristic ringed sideroblasts (Fig. 2A). Thus, we performed a heme quantification assay and verified a decrease in heme levels of ∼40% in our XLSA model compared with isogenic controls (Fig. 2A, Right). Residual heme production is likely attributable to the small amount (∼20%) of ALAS2 mRNA expression and the moderate levels of the ubiquitously expressed δ-aminolevulinic acid synthase 1 (encoded by ALAS1) in the cells (Fig. S4).
Fig. 2.
Functional characterization of CRISPR-edited K562 clones modeling XLSA, PKD, and CEP. (A) (Left) Flowchart of the heme biosynthesis pathway. ALAS2 is the rate-limiting enzyme in this pathway, and its disruption leads to decreased heme production. (Right) Relative amounts (in µM) of heme normalized to cell number. ****P < 0.0001. (B) (Left) Schematic of the extracellular portion of the heme biosynthesis pathway. δ-aminolevulinic acid is stepwise converted to hydroxymethylbilane, which is then synthesized into uroporphyrin III by UROS, eventually resulting in heme production. Without UROS, hydroxymethylbilane is processed into nonphysiological isomer I porphyrins, resulting in hemolysis and UV phototoxicity. (Right) Ratios of coproporphyrin I to coproporphyrin III values. ****P < 0.0001. (C) (Top Left) Diagram showing the final step of glycolysis. PKLR catalyzes this conversion of phosphoenolpyruvate (PEP) to pyruvate. (Bottom Left) Intracellular flow cytometry analysis for PKLR. (Top Right) PK enzymatic activity assay. (Bottom Right) Expression of genes encoding PK, PKLR and PKM, during erythroid differentiation and in K562 cells.
Fig. S4.
Gene expression of ALAS2 and ALAS1. (Left) Expression of ALAS2 and ALAS1 in the erythroid lineage. (Right) Expression of ALAS2 and ALAS1 in the K562 cell line.
Similar to XLSA, CEP is also a disorder of dysregulated heme biosynthesis. Although mutations in UROS, the gene encoding uroporphyrinogen III synthase, also result in mildly reduced heme production, the phototoxicity and hemolysis observed in CEP are attributable primarily to accumulation of the toxic nonphysiological coproporphyrin I isomers (Fig. 2B) (20). To measure this accumulation in our CEP model, we performed a porphyrin assay, and found that our CRISPR-edited clones had an approximately fourfold higher ratio of isomer I to isomer III porphyrins compared with controls (Fig. 2B, Right).
Finally, we investigated our cellular models of PKD, one of the most common causes of hereditary nonspherocytic hemolytic anemia (45). Pyruvate kinase (PK), encoded by PKLR, catalyzes the final reaction of glycolysis by converting phosphoenolpyruvate to pyruvate (Fig. 2C, Top Left). We first measured intracellular levels of mature PKLR protein and confirmed a reduction in our CRISPR-edited clones (Fig. 2C, Bottom Left). Nevertheless, a PK enzymatic activity assay revealed only weakly to moderately reduced activity in the CRISPR-edited clones compared with controls (Fig. 2C, Top Right). Interestingly, although PKLR is the only PK expressed in the committed erythroid lineage in humans, K562 cells strongly express PKM, an alternative PK, suggesting that even though PKLR is nearly absent in our PKD model, PKM alone is able to sustain glycolysis (Fig. 2C, Bottom Right). Overall, by disrupting specific GATA1 BSs, our cellular models are able to mimic the cell-intrinsic effects of XLSA, CEP, and PKD to varying degrees.
Targeted Disruption of the Core GATA Motif Destabilizes GATA1 TF Complexes.
As a master regulator of erythropoiesis, GATA1 often acts in a multimeric complex comprising TAL1 as well as the non–DNA-binding TFs LMO2 and LDB1, which can loop from CRE to promoter to regulate gene expression (33, 35–37). Mutations in the TAL1 motif (occurring 8–9 nt from a GATA1 motif) have been shown to modulate gene expression by significantly displacing both TAL1 and LDB1 while exhibiting only marginal effects on GATA1 binding (40). Although we have shown that mutations in the GATA1 motif itself can affect target gene expression, resulting in disease, little is known about how these mutations affect binding of the entire activation complex. Thus, we selected clones with the smallest GATA1 motif deletions (maximum 4 nt) and performed ChIP followed by PCR (ChIP-PCR) for TAL1, the other DNA-binding component of the multimeric complex (whose canonical motif was not disrupted). Interestingly, all three GATA1 mutations resulted in a complete loss of TAL1 binding compared with both the GFP control and cross-isogenic control cells (Fig. 3 A–C). Thus, we provide direct evidence for a previously proposed model in which both TAL1 and GATA1 binding are important for proper erythroid gene expression, TAL1 binding is dependent on GATA1 (37, 46), but GATA1 binding appears to be only partially dependent on TAL1 (Fig. 3D) (40).
Fig. 3.
Mutations in GATA1 BSs affect CF binding. (A–C) ChIP-PCR results against rabbit IgG negative control and TAL1 antibody for the ALAS2 intron (A), the UROS intron (B), and the PKLR promoter (C). *P < 0.05; **P < 0.01; ***P < 0.001; ****P < 0.0001; n.s., P nonsignificant. (D) Model of mutations in a standard activation complex involving GATA1.
Evolutionary Conservation and Functional Relevance of GATA Motifs in Erythroid REs.
Although WES has uncovered many unexpected and novel genes underlying MEDs, including GATA1 in Diamond–Blackfan anemia and KCNN4 in hereditary xerocytosis (3, 7), causal genetic lesions have not been definitively identified in coding regions for a large percentage of Mendelian cases with clinical phenotypes (9, 44, 47, 48). This suggests the possibility that an important minority of causal genetic lesions may instead be located within NC regulatory regions. Given that we have verified that mutations identified in GATA1 BSs are pathogenic for three distinct MEDs, we wanted to investigate these BSs for other MEDs in which the known genetic defects act intrinsically within the erythroid lineage (www.bloodgenes.org/MEDs.html).
We first enumerated all CREs occupied by GATA1 in primary human EBs that are proximal to the ∼20 known pathogenic MED genes. Within these CREs, we identified 176 core GATA sites, and investigated the evolutionary conservation of these four nucleotides as a proxy for functionality (32, 44, 47, 48). Surprisingly, although we observed only slight conservation for the GATA site within all ∼20,000 GATA1 chromatin occupancy sites across the genome, GATA sites within MED CREs were substantially conserved (Fig. 4A) (32). Moreover, the region containing the TAL1 motif 8–9 nt upstream of the core GATA motif was specifically conserved in a subset of MED CREs, suggesting the importance and functionality of both factors at these elements (Fig. 4B) (37). Therefore, we suggest that GATA1, and in certain cases TAL1, DNA BSs may harbor mutations in MED cases without identifiable and pathogenic genetic lesions within the coding sequence of the known MED genes.
Fig. 4.
Identifying functional GATA1 BSs near genes implicated in MEDs. (A) Conservation of GATA1 binding motifs using PhastCons scores. Low-level conservation is observed across ∼40,000 GATA1 motifs within all ∼20,000 GATA1 occupancy sites in EBs (blue). Strongly conserved GATA1 motifs with biochemical activity for GATA1, TAL1, and KLF1 (CGTK) in human and mouse EBs are positive controls (red). The 176 GATA1 binding motifs in CREs occupied by GATA1 proximal to MED genes (green) are conserved, suggesting that they are functional. (B) Conservation across the 176 GATA1 motifs shown in A. Approximately 40% of these GATA1 binding motifs are strongly conserved, and ∼60% of these are co-conserved with TAL1 motifs. Black lines separate clusters identified using k-means. (C) ChIP-seq profiles across EPB41, similar to those shown in Fig. 1 A–C. E1 and E2 are GATA1 motifs in CREs occupied by GATA1 that we targeted using CRISPR/Cas9. (D) Relative mRNA expression levels of EPB41. ****P < 0.0001. (E) ChIP-seq profiles across KCNN4, similar to those shown in C. K1 is a GATA1 motif similar to E1. (F) Relative mRNA expression levels of KCNN4. ****P < 0.0001.
To experimentally verify that disruption of these GATA1 motifs could affect expression of the target MED genes, we performed CRISPR/Cas9 genome editing at GATA1 motifs within seven CREs proximal to four MED genes. For five of the seven targeted CREs, we observed a significant decrease in target gene expression (ranging from ∼30% to 80%) (Fig. 4 C–F and Fig. S5). (Interestingly, we observed that targeting the other two CREs resulted in increased mRNA expression of the target gene, SLC4A1.) Because we measured mRNA expression only in bulk cells, this finding is likely an underestimation of the true effect of the GATA1 motif disruption. Interestingly, GATA1 appears to regulate genes implicated in disorders of the RBC membrane (EPB41 and KCNN4), various enzyme-related diseases (PKLR and HK1), and disorders occurring within the heme biosynthetic pathway (ALAS2 and UROS), suggesting that mutations in GATA1 BSs may occur in a number of distinct MEDs.
Fig. S5.
Targeting additional GATA1 binding sites near genes implicated in MEDs. (A) ChIP-seq profiles across HK1, similar to Fig. 1 A–C. H1 and H2 are GATA1 motifs in CREs occupied by GATA1 that we targeted using CRISPR/Cas9. (B) Relative mRNA expression levels of HK1. **P < 0.01; ***P < 0.001. (C) ChIP-seq profiles across SLC4A1, similar to Fig. 1 A–C. S1 and S2 are GATA1 motifs in CREs occupied by GATA1 that we targeted using CRISPR/Cas9. (D) Relative mRNA expression levels of SLC4A1. **P < 0.01.
Understanding and Predicting the Effects of NC Mutations on GATA1 and CFs.
Finally, we aimed to gain a better global understanding of the combinatorics of erythroid transcriptional regulation and to identify the important DNA elements underlying these TFs (31–33, 38, 39, 41). To do so, we first used ChIP-seq to define the relative binding of GATA1 and four CFs generally implicated in an activation complex with GATA1: TAL1, LDB1, KLF1, and NFE2 (31–35, 37, 39). Modeling the relative binding intensities of these CFs across the proximal promoters of ∼18,000 genes revealed that these five factors alone explained 46% of the variation in proerythroblast (proE) gene expression (Fig. 5A).
Fig. 5.
Identifying key TF occupancy patterns and predicting important DNA REs near human MED genes. (A) Plot comparing observed and predicted proE expression from the predictive model of TF binding intensities. Hexagonal binning is used to represent point density. (B) Identification of six clusters representing the z-scores of the promoter binding intensities of erythroid TFs. (C) Comparison of gene expression during lineage commitment and terminal erythroid differentiation across the clusters derived in B. The pie chart shows that MED genes reside within clusters with greater GATA1 and CF binding intensities. (D) Mutation map of the PKLR promoter for GATA1 and TAL1 binding. Predicted loss-of-function mutations are in blue, gain-of-function mutations are in red, and benign mutations are in white. Stars indicate previously reported mutations in cases of PKD. The GATA1 and TAL1 motifs are highlighted in orange and green, respectively. (E) A mutation map of a CRE in the eighth intron of ALAS2, similar to D.
We next investigated combinatorial binding patterns using partitioning around medoids (PAM), and identified six clusters of variable combinations of binding by GATA1 and CFs (Fig. 5B and Fig. S6). The first five clusters exhibited concomitant GATA1, TAL1, KLF1, and NFE2 binding intensity, associated with increasing expression during terminal erythroid differentiation (Fig. 5C). Interestingly, cluster 6 was differentiated from the other clusters by substantially greater LDB1 and TAL1 promoter binding, greater LDB1 binding at distal CREs (Fig. S7), and stronger transcriptional induction during terminal erythropoiesis (Fig. 5 B and C) (35). In addition, many MED genes, as well as the erythroid TFs themselves, were found within clusters with strong erythroid TF intensity (Fig. 5C). Building on previous work (31–33, 38, 39, 41), our results provide additional evidence that the cooperation of these five TFs is important for the proper expression of key erythroid genes, such as those mutated in MEDs.
Fig. S6.
Clusters of TF intensity at promoter regions. Multiple plots showing the differences in TF activity across PAM-derived clusters of ∼18,000 genes. Units are all in log2 input-normalized reads per million. (Lower Diagonal) Pairwise scatterplots for the binding intensities of TFs. Points (genes) are color-coded by cluster, as in Bottom Right. (Diagonal) Probability densities of TF binding intensities for each cluster. (Upper Diagonal) Spearman correlations for pairwise TFs. (Far Right) Boxplots of the TF binding intensities for each cluster. (Bottom Right) Relative numbers of genes in each cluster.
Fig. S7.
Erythroid TFs binding at enhancers. Shown are intensities of LDB1 (A), GATA (B), and TAL1 (C) across enhancers assigned to genes in each cluster, as defined in Fig. 4 B and C. ***P < 10−9.
Given the importance of multiple TFs in our model, we dissected the DNA elements of CREs at the known MED genes using two state-of-the-art approaches to predict the cell type-specific effects of NC mutations. First, we trained a gapped k-mer support vector machine (gkmer-SVM) on EB and K562 open chromatin data and used delta-SVM to predict single nucleotide effects (16). We then used already-trained models for the TFs GATA1, TAL1, KLF1, and NFE2 from DeepBind, a convolutional neural network approach (15). For each CRE proximal to the 20 MED genes, we created a “mutation map” of the predicted effects of all possible single nucleotide changes (15).
For each GATA1 motif that we functionally verified with CRISPR/Cas9 genome editing, our models show that a single nucleotide change in the core GATA is predicted to completely abolish binding, consistent with our in vitro results and the in vivo phenotypes in patients with cis-regulatory mutations (Fig. 5D and Fig. S8). Moreover, the mutation maps suggest that TAL1 binding is more strongly affected by alterations to the GATA1 motif than by alterations to its own motif, consistent with our TAL1 ChIP-PCR results (Fig. 5D and Fig. S8). To further demonstrate the utility of our mutation maps, we determined that mutations reported in two independent cases of PKD (in what was previously known as the PKLR-RE1 element) actually disrupted the TAL1 motif proximal to the interrogated GATA1 motif (Fig. 5D) (49, 50), highlighting the important role of GATA1 CFs in regulating proper gene transcription during erythropoiesis (33, 40).
Fig. S8.
Mutation maps based on a gkmer-SVM trained on erythroid open chromatin and DeepBind trained on GATA1 and TAL1 ChIP-seq. All interrogated GATA1 motifs that exhibited a decrease in target gene expression are shown: (A) ALAS2 intron 1, (B) UROS intron, (C) EPB41 E1, (D) EPB41 E2, (E) KCNN1 K1, (F) HK1 H1, and (G) HK1 H2. In most cases, disruption of a GATA1 motif was more important than disruption of a TAL1 motif for predicted TAL1 binding.
We also observed multiple “complex” REs containing multiple TF binding motifs. For example, at a CRE occupied by GATA1, TAL1, KLF1, LDB1, and NFE2 in the eighth intron of ALAS2, our mutation maps identified multiple GATA1 and TAL1 motifs, as well as a KLF1 motif and an NFE2 motif (Fig. 5E). Exogenous reporter assays have shown that, depending on the exact “chunks” of DNA tested, this region can vary in activation strength and even repress (51). Intriguingly, when we separately targeted two GATA1 motifs in ALAS2 intron 8 with CRISPR/Cas9, we observed an increase in ALAS2 mRNA expression, similar to the results obtained by targeting two CREs near SLC4A1, suggesting that these regions may have a particularly complex regulatory architecture (Fig. S9) (51). Taken together, our analyses provide additional evidence that erythroid CREs are dependent on the binding of multiple TFs for proper transcriptional regulation and RBC development. We provide visualizations of the regulatory landscape of MED genes, gene expression graphs, and mutation maps across CREs for each MED on our website (www.bloodgenes.org/MEDs.html).
Fig. S9.
Interrogation of two GATA1 binding motifs in the “complex” enhancer in the eighth intron of ALAS2. (A) ChIP-seq profiles across ALAS2, similar to Fig. 1A. A2 is a GATA1 motif in a CRE occupied by GATA1 that we targeted using CRISPR/Cas9. (B) Relative mRNA expression levels of ALAS2. **P < 0.01; ****P < 0.0001.
Discussion
In this study, we interrogated natural genetic variations occurring in MEDs to provide a window into the functions of erythroid gene regulatory networks (2). In primary HSPCs and an erythroid cell line, we used CRISPR/Cas9 genome editing to verify the regulatory effects of NC genetic variants associated with XLSA, CEP, and PKD, and were able to recapitulate the cell-intrinsic phenotypes of the associated disorder (20–23, 43). These immortalized cellular models are readily available for the study of disease pathology and may prove useful for screening potential treatments, such as gene therapy (52). More generally, our approach to experimentally validate the pathogenicity of rare NC variants can be applied to novel NC variants associated with erythroid or nonerythroid Mendelian diseases.
During the process of functionally verifying the selected mutations, we determined that small alterations (2–4 nt) of a GATA1 motif were sufficient to reduce expression of the known causal gene by >80% in all cases; although the exact effects of each single nucleotide variant observed in vivo remains to be shown. Mechanistically, we showed that by solely disrupting the GATA1 BS at a CRE, the binding of the DNA-binding CF TAL1 is severely impaired, likely resulting in loss of the entire CF-containing multimeric complex (35, 37). Our analysis of combinatorial occupancy patterns of erythroid TFs highlights the importance of this multimeric complex formed by GATA1, TAL1, and LDB1 for the induction of key erythroid genes during terminal erythropoiesis (33, 41). In contrast to the GATA1 BSs interrogated here, our predictive maps and other studies have shown that mutations in TAL1 BSs have only marginal effects on GATA1 binding, while still resulting in altered gene expression (40).
Another important finding from the present study is that similar GATA1 motif deletions did not always result in a complete abrogation of target gene expression. Although more examples are needed to confirm the exact principles and extent of context dependence (i.e., promoter vs. intronic/distal CRE), our results primarily suggest the possibility of enhancer additivity. Our investigation of multiple CREs proximal to EPB41, HK1, ALAS2, and SLC4A1 showed that each CRE independently regulated the expression of its target gene.
Finally, determining the pathogenicity and/or causality of NC genetic variants that either result in disease or modify disease severity remains difficult (13, 53). To address this, we coupled experimental and bioinformatics approaches to show that mutations in GATA1 and CF BSs can substantially affect genes implicated in MEDs, and created mutation maps of predicted mutations across CREs proximal to MED genes (15, 16). These maps may prove useful for prioritizing variants from WGS or targeted sequencing of MED cases, as we have initially shown for PKLR-RE1 mutations that disrupt a TAL1 binding motif, and can identify disruptive as well as gain-of-function mutations. Nevertheless, our maps might not represent all regulatory modalities, such as the effects of currently unassayed TFs or complex REs (i.e., barrier insulators) (54). Furthermore, experimental validation of an associated variant (using, e.g., CRISPR/Cas9 genome editing) remains the cornerstone for determining causality (13). Altogether, our combined approach for interrogating NC genetic variation reveals important aspects of both GATA1 transcriptional activity and erythroid CREs, and likely will prove useful for identifying and dissecting NC mutations in MEDs.
Materials and Methods
Experimental Outline.
K562 cells were cotransfected with Cas9 nuclease and sgRNA plasmids and subjected to puromycin selection and limiting dilutions. Clones were screened for small deletions of the WGATAR motif via PCR, and successful clones were selected for downstream assays. The experimental procedures are described in detail in SI Materials and Methods.
Bioinformatic and Statistical Analyses.
Unless specified otherwise, the two-tailed Student t test was used for comparisons between groups. ChIP-seq and RNA-seq were processed as described previously (32). Random forests were used to model expression, k-means and PAM were used for clustering, and gkmer-SVM and DeepBind were used to create mutation maps (15, 16). Complete details are available in SI Materials and Methods.
SI Materials and Methods
Plasmid Preparation.
Short guide RNA (sgRNA) sequences (Table S1) were cloned into the pSg1 vector (Addgene) and the XPR5 lentiviral vector (Broad Institute), respectively. [The XPR5 vector contains the Cas9 nuclease and a red fluorescent protein (RFP) cassette.] The Cas9 nuclease expression vector used was pxPR_BRD001, which contains a puromycin resistance cassette as a selection marker. Off-target scores for each guide were calculated using the CRISPR design tool (CRISPR Design; crispr.mit.edu); only guides with a score >50 (except for a score of 49 in one case) were used.
Table S1.
sgRNA oligos and primers for PCR screening, qRT-PCR, and sequencing
| sgRNA oligo/primer | 5′ → 3′ |
| sgRNA oligos for pSg1 cloning | |
| ALAS2 sgRNA FW | AACTCTGGCAACTTTATCTGGTTTT |
| ALAS2 sgRNA RV | CAGATAAAGTTGCCAGAGTTCGGTG |
| PKLR sgRNA FW | AAACTGCTGGTCTTATCTAAGTTTT |
| PKLR sgRNA RV | TTAGATAAGACCAGCAGTTTCGGTG |
| UROS sgRNA FW | GAAGACCCCTGTCACTGATAGTTTT |
| UROS sgRNA RV | TATCAGTGACAGGGGTCTTCCGGTG |
| EPB41-1a sgRNA FW | GCCCAGGCTCTGACAGGATAGTTTT |
| EPB41-1a sgRNA RV | TATCCTGTCAGAGCCTGGGCCGGTG |
| EPB41-1b sgRNA FW | CTTGAGGTGGGTGATAAAGAGTTTT |
| EPB41-1b sgRNA RV | TCTTTATCACCCACCTCAAGCGGTG |
| EPB41-1c sgRNA FW | CTGTGGGGCGCTGATAAGCTGTTTT |
| EPB41-1c sgRNA RV | AGCTTATCAGCGCCCCACAGCGGTG |
| EPB41-2 sgRNA FW | CACACACACTCTTATCAGGCGTTTT |
| EPB41-2 sgRNA RV | GCCTGATAAGAGTGTGTGTGCGGTG |
| KCNN4 sgRNA FW | CGGCACACCCCACTTATCTCGTTTT |
| KCNN4 sgRNA RV | GAGATAAGTGGGGTGTGCCGCGGTG |
| HK1-1 sgRNA FW | GACTCAGTGTTACTTATCTGGTTTT |
| HK1-1 sgRNA RV | CAGATAAGTAACACTGAGTCCGGTG |
| HK1-2 sgRNA FW | AGGGTTTGCTGGCTCAGATAGTTTT |
| HK1-2 sgRNA RV | TATCTGAGCCAGCAAACCCTCGGTG |
| SLC4A1-1 sgRNA FW | TGTGCTGCCTAGCACTGATAGTTTT |
| SLC4A1-1 sgRNA RV | TATCAGTGCTAGGCAGCACACGGTG |
| SLC4A1-2 sgRNA FW | GTGGAGGGAGAAGATAGCTCGTTTT |
| SLC4A1-2 sgRNA RV | GAGCTATCTTCTCCCTCCACCGGTG |
| sgRNA oligos for XPR5 cloning | |
| ALAS2 shRNA FW | CACCGAACTCTGGCAACTTTATCTG |
| ALAS2 shRNA RV | AAACCAGATAAAGTTGCCAGAGTTC |
| PKLR shRNA FW | CACCGAAACTGCTGGTCTTATCTAA |
| PKLR shRNA RV | AAACTTAGATAAGACCAGCAGTTTC |
| UROS shRNA FW | CACCGAAGACCCCTGTCACTGATA |
| UROS shRNA RV | AAACTATCAGTGACAGGGGTCTTC |
| Primers for PCR screening | |
| ALAS2 FW | TGCCTGCTTGTGAAAGCTAA |
| ALAS2 RV1 | GGAGTGGTCAGACCCCAAT |
| ALAS2 RV2 | GGCGATAAACTCTGGCAACTTTA |
| PKLR FW | CGGGACCATGGAATGAGAG |
| PKLR RV1 | TGTGCCCCTTTTCTCTTCTC |
| PKLR RV2 | CTTTTCTCTTCTCTGTCTCCCTTAGAT |
| UROS FW | GCACTAATGGGCTTGTTCTTTC |
| UROS RV1 | TGGTTTCATCTGTCTTTCCAAG |
| UROS RV2 | CATGCTCTTTCTTGGCCTTA |
| Primers for qRT-PCR | |
| B-actin FW | AGAAAATCTGGCACCACACC |
| B-actin RV | GGGGTGTTGAAGGTCTCAAA |
| ALAS2 FW | ACCTACCGTGTGTTCAAGACT |
| ALAS2 RV | AGATGCCTCAGAGAAATGTTGG |
| PKLR FW | TCAAGGCCGGGATGAACATTG |
| PKLR RV | CTGAGTGGGGAACCTGCAAAG |
| UROS FW | GCCAAGTCAGTGTATGTGGTT |
| UROS RV | GCAATCCCTTTGTCCTTGAGC |
| EPB41 FW | TGAACTGGGAGACTACGACCC |
| EPB41 RV | AGCTGGAGTCATGGACCTGT |
| KCNN4 FW | CTGCTGCGTCTCTACCTGG |
| KCNN4 RV | AGGGTGCGTGTTCATGTAAAG |
| HK1 FW | GCTCTCCGATGAAACTCTCATAG |
| HK1 RV | GGACCTTACGAATGTTGGCAA |
| SLC4A1 FW | GGTGATGGACGAAAAGAACCA |
| SLC4A1 RV | AAGACTCTACGCAGCTCTAGG |
| Primers for sequencing | |
| ALAS2 PCR FW | CAGCCTGGGTTGGTATGTG |
| ALAS2 PCR RV | TAGCCAGATGCTCAGACGTG |
| ALAS2 SEQ FW | TCAGCTGTCAAACGTGAGGT |
| PKLR PCR FW | CCTCTCTGGGTCTCCCTCTC |
| PKLR PCR RV | GAGGAAATGCCAGGAGATGA |
| PKLR SEQFW | GGCTTCTGTCTCCCCTTCTT |
| UROS PCR FW | AGGGATCAAAGTGGCTTCAA |
| UROS PCR RV | TCTTTCCGGAACCATAAACG |
| UROS SEQ FW | TCCTAAGCAATTTCCGATGG |
Cell Culture and Lentivirus Production.
The K562 cells (American Type Culture Collection) were maintained in RPMI medium 1640 plus l-glutamine (Life Technologies) supplemented with 10% (vol/vol) FBS (Atlanta Biologicals) and 1% penicillin-streptomycin (PS; Life Technologies). Cells were incubated at 37 °C with 5% CO2 in air atmosphere. The 293T cells were maintained in DMEM, high glucose (Life Technologies) supplemented with 10% FBS and 1% PS. For lentivirus production, 293T cells were transfected with sgRNA constructs (Table S1) along with the VSV-G envelope and the pDelta8.9 packaging vector using Fugene6 (Roche) transfection reagent according to the manufacturer’s instructions. The next day, the medium was changed to Iscove’s modified Dulbecco's medium (IMDM; Life Technologies) supplemented with 2% (vol/vol) human AB plasma, 3% (vol/vol) human AB serum (Atlanta Biologicals), and 1% PS. The next day (48 h posttransfection), viral supernatant was collected and filtered through a 0.45-µm filter, and then concentrated by centrifuging at 68,320 × g for 2 h at 4 °C with an Optima L-100 XP ultracentrifuge (Beckman Coulter). Concentrated virus was dissolved in 2 mL of supernatant overnight at 4 °C, and infection was performed the next day.
Primary Cell Culture and Lentiviral Infection.
Adult CD34+ HSPCs from mobilized peripheral blood mononuclear cells (Harvard Stem Cell Institute Flow Cytometry Facility) were maintained in base medium (IMDM plus 2% human AB plasma, 3% human AB serum, 1% PS, 3 U/mL heparin, 10 µg/mL insulin, and 200 µg/mL holo-transferrin) supplemented with 3 U/mL erythropoietin, 10 ng/mL human recombinant stem cell factor, and 1 ng/mL human recombinant IL-3 for days 0–6 of culture. IL-3 was omitted on days 7–12 of culture. Cells were incubated at 37 °C with 5% CO2. For lentiviral infection, cells (at day 2 of culture) were spun at 931 × g for 90 min at room temperature with 8 µg/mL polybrene (Millipore), 1 mL of concentrated virus, and 1 mL of IMDM supplemented with 2% human AB plasma, 3% human AB serum, and 1% PS. The next day, the medium was changed to omit the 8 µg/mL polybrene and concentrated virus. At 3 d postinfection, RFP was measured by flow cytometry analysis with a BD Acuri C6 flow cytometer (BD Biosciences).
Fluorescence-Activated Cell Sorting and Analysis.
For fluorescence-activated cell sorting (FACS), cells were harvested at day 10 of culture, washed with FACS buffer [3% (vol/vol) FBS in PBS], and then resuspended in 600 µL of FACS buffer. RFP-positive cells were then sorted using a FACSAria II cell sorter (BD Biosciences). For flow cytometry analysis of differentiation, cells were harvested at day 12 of culture, washed with FACS buffer, stained with allophycocyanin-conjugated anti-human CD235a antibody (17-9987; eBioscience) and analyzed with a BD Acuri C6 flow cytometer.
DNA Transfection and Puromycin Selection.
K562 cells were cotransfected with 1 µg total of Cas9 nuclease and sgRNA plasmids using Lipofectamine LTX Plus Reagent (Thermo Fisher Scientific) at a 1:2 ratio of Cas9 to sgRNA. For a control, K562 cells were cotransfected with 1 µg total of Cas9 nuclease and pLKO.1-GFP plasmid at a 1:2 ratio of Cas9 to pLKO.1-GFP. At 24 h after cotransfection, puromycin was added at a concentration of 2 µg/mL, followed 24 h later by a reduction to 1 µg/mL for an additional 24 h. Selection efficiency was assessed by flow cytometry with propidium iodide staining (to assess viability) on a FACSCanto II flow cytometer (BD Biosciences). Limiting dilutions were performed to obtain single cell-derived clonal populations for both cells targeted with sgRNAs as well as for GFP controls. Unless specified otherwise, three matching clonal GFP controls were analyzed for each experiment.
Identification of Deletions.
Genomic DNA was isolated from single clones using the DNEasy Blood & Tissue Kit (Qiagen). PCR analyses were performed using Platinum PCR Supermix (Invitrogen) and a T100 Thermal Cycler (Bio-Rad) with primer pairs flanking the sgRNA target sequences, as well as with primer pairs that select for clones containing a deletion within the targeted GATA1 binding site (Fig. S3). Genomic DNA isolated for screening was PCR-amplified using Platinum PCR Supermix (Invitrogen) on a T100 Thermal Cycler (Bio-Rad) (Table S1), and purified with the Qiaquick PCR Purification Kit (Qiagen). The product was Sanger-sequenced, and trace files were analyzed on FinchTV (Geospiza). Chromatograms were compared with DNA sequences obtained from the University of California Santa Cruz (UCSC) Genome Browser (reference genome GRCH37/hg19).
When identifying deletions, it is important to keep in mind that K562 cells are aneuploid in certain regions. Here we aimed to completely disrupt the targeted RE regardless of the copy number at a particular locus. For all clonal deletions (either homozygous or compound heterozygous deletions), we observed Sanger-sequencing traces only with a deletion and did not observe any traces of the reference K562 DNA, confirming that the targeted CRE was modified in all copies. In cases with compound heterozygosity, we typically observed similar intensities for both alleles, suggesting that the loci were diploid.
Quantitative RT-PCR.
RNA was extracted from selected clones using the RNEasy Plus Mini Kit (Qiagen) and the Ambion RNAqueous–Micro Total RNA Isolation Kit (Life Technologies). cDNA was synthesized using the iScript cDNA Synthesis Kit (Bio-Rad). Quantitative RT-PCR (qRT-PCR) was performed with iQSYBR Green Supermix (Bio-Rad) on a CFX96 Real-Time PCR System (Bio-Rad). Primer sequences are listed in Table S1. mRNA expression levels were quantified and calculated via the ΔΔCt method and normalized to β-actin levels (55).
Heme Quantification.
A total of 250,000 cells from the controls and ALAS2 clones were counted with a hemacytometer and harvested for heme quantification using the QuantiChrom Heme Assay Kit (BioAssay Systems; DIHM-250). The assay was performed in triplicate, following the manufacturer’s protocol.
Porphyrin Quantification.
A total of 100,000 cells from the controls and UROS clones were seeded in RPMI medium 1640 including l-glutamine supplemented with 10% FBS and 1% PS, along with 1 mM δ-aminolevulinic acid (Sigma-Aldrich), at a density of 50,000 cells/mL. Cells were incubated at 37 °C with 5% CO2, protected from light, for 72 h. The cells were then centrifuged at 456 × g at for 5 min at 4 °C, and 1 mL of supernatant was collected for analysis. Samples were deproteinized with an equal volume of 20% TCA/DMSO (1:1, vol/vol), incubated on ice for 15 min, and then centrifuged at 10,000 × g for 10 min. Porphyrins in the samples were separated by ultra-performance liquid chromatography (UPLC) in a ACQUITY UPLC system (Waters) on a BEH C18 2.1 × 100 μM column (Waters) as described previously (56).
Eluent A was 1 mol/L ammonium acetate–acetic acid buffer, pH 5.16, with 0.02% sodium azide, and eluent B was neat acetonitrile. The elution program was as follows: 2 min concave (Waters #8) gradient from 8% (vol/vol) to 70% (vol/vol) B, followed by 0.5 min at 70% B, and a 1.5-min reequilibration at 8% B at a flow rate of 0.6 mL/min. Column temperature was maintained at 65 °C. Porphyrins were detected using a Waters ACQUITY UPLC fluorescence detector with 405 nm excitation, 619 nm emission, and a gain of 100. For porphyrin quantitation, the UPLC chromatograph was standardized using the URO I and COPRO III fluorescence standards UFS-1 and CFS-3, respectively (Frontier Scientific). All experiments were performed in triplicate.
Intracellular Flow Cytometry.
Cells from the controls and PKLR clones were harvested for intracellular staining with the anti-PKLR antibody (1H9, ab123908; Abcam), using a modified protocol and reagents from the Click-iT Flow Cytometry Assay Kit (Life Technologies; C-10418). Cells were analyzed on a FACSCanto II flow cytometer (BD Biosciences).
Pyruvate Kinase Enzymatic Activity Quantification.
A total of 50,000 cells of the controls and PKLR clones were harvested. The assay was performed with the Sigma-Aldrich Pyruvate Kinase Activity Assay Kit (MAK072) following the manufacturer’s protocol.
ChIP-PCR.
A total of 60,000,000 cells of the controls and clones ALAS2 A-4, UROS U-1, and PKLR P-1 were harvested and fixed with 1% formaldehyde in preparation for ChIP with 10 µg of normal rabbit-IgG (sc-2027x; Santa Cruz Biotechnology) or TAL1 clone C-21 (sc-12984; Santa Cruz Biotechnology). ChIP DNA was then quantified against the ALAS2, PKLR, and UROS loci via qRT-PCR using iQSYBR Green Supermix (Bio-Rad) on the CFX96 Real-Time PCR System (Bio-Rad).
Identifying Putative CREs in MEDs.
MEDs whose genes act intrinsically within the erythroid lineage (ANK1, SPTB, SPTA1, SLC4A1, EPB42, EPB41, PIEZO1, KCNN4, GLUT1, G6PD, PKLR, NT5C3A, HK1, GPI, PGK1, ALDOA, TPI1, PFKM, ALAS2, UROS, and FECH) were selected from the literature. CREs were defined as NDR peaks in proEs that were proximal to the target gene (manually defined within 500 kb of each MED gene so as not to include nearby promoters or regulatory regions that were more likely to specifically regulate other genes).
Bioinformatics and Statistical Analyses.
The Student t test was used to compare mRNA expression, functional assays, and ChIP-PCR results between controls and CRISPR-edited clones. Unless specified otherwise, three replicates were performed for each experiment, and error bars represent the SEM. Disruptions of GATA1 binding motifs for selected mutations were initially identified using TRAP (57). ChIP-seq data for GATA1, TAL1, NFE2, and KLF1 were analyzed as described previously (32) and displayed as input-normalized reads per million. Raw ChIP-seq data were obtained and combined from multiple studies (31, 32, 58–62). LDB1 ChIP-seq and FAIRE-seq in EBs were obtained from www.ncbi.nlm.nih.gov/geo/ (accession nos. GSE52637 and GSE36985) and processed similarly (58, 63). GATA1 and NDR peaks were defined using MACS2 (64). DHS-seq for K562 cells was obtained from the Integrative Genomics Browser (IGV) server, and IGV was used for visualization of genome-wide assays. RNA-seq data across the erythroid lineage were obtained from www.ncbi.nlm.nih.gov/geo/ (accession nos. GSE61566 and GSE53983) and processed as described previously (32, 65). PhastCons scores derived from 100 vertebrates were downloaded from hgdownload.cse.ucsc.edu/goldenpath/hg19/phastCons100way/ and compared using the seqplots package in R/Bioconductor. K-means clustering was used to derive clusters of conservation across GATA1 motifs. TF binding intensities at promoters (defined as ±1 kb from the TSS) were determined using the UCSC Genome Browser tool bigWigAverageOverBed; the maximum value in an interval was taken. When multiple promoters were present for a single annotated gene, the promoter with the maximum TF intensity was chosen. Intensities were then log2-transformed, and negative values (i.e., when input is greater than ChIP) were set to 0. Random forest models of proE gene expression were then learned on a training set of TF binding intensities for 15,000 genes with the randomForest package in R with the parameters mtry = 2, mtree = 300, and ntree = 501. A total of 3,000 genes were held out as a test set, and fit was reported for this set.
TF binding intensities were then transformed to z-scores, and PAM was used to identify TF binding intensity clusters with the pamk function from the fpc package in R. Enhancers near genes in each cluster were defined by EB DHS peaks within ±100 kb of the TSS (excluding DHS peaks overlapping ±2 kb of each TSS). Similar to promoter analyses, LDB1, GATA1, and TAL1 binding intensities were mapped to each enhancer and summed for each gene. The Mann–Whitney U test was used to compare TF intensities across two clusters.
A gkmer-SVM and a convolutional neural network (DeepBind) were used to predict the effects of NC mutations (15, 16). K562 weights for gkmer-SVM were obtained from www.beerlab.org/deltasvm/, and weights for the EB NDRs were derived according to the protocol of Lee et al. (16).
Trained models of GATA1, TAL1, NFE2, and KLF1 (SP1 SELEX-seq) binding were obtained from tools.genes.toronto.edu/deepbind/ (D00765.001, D00815.001, D00535.004, and D00650.005, respectively). Mutation maps for both the gkmer-SVM and DeepBind predictions were created using a custom R script following the outline in the supplemental notes of Alipanahi et al. (15). Predictions were averaged across windows of 10 nt for gkmer-SVM and 24 nt for DeepBind unless specified otherwise (e.g., in the PKLR promoter where multiple GATA1 motifs were observed within a single 24-nt window).
Acknowledgments
We thank B. Cleary, C. Fulco, R. Tewhey, D. Bishop, and members of the V.G.S. laboratory for valuable comments and discussions. This work was supported by National Institutes of Health Grants R01 DK103794 and R21 HL120791 (to V.G.S.).
Footnotes
The authors declare no conflict of interest.
This article is a PNAS Direct Submission.
This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1521754113/-/DCSupplemental.
References
- 1.Bamshad MJ, et al. Exome sequencing as a tool for Mendelian disease gene discovery. Nat Rev Genet. 2011;12(11):745–755. doi: 10.1038/nrg3031. [DOI] [PubMed] [Google Scholar]
- 2.Sankaran VG, Gallagher PG. Applications of high-throughput DNA sequencing to benign hematology. Blood. 2013;122(22):3575–3582. doi: 10.1182/blood-2013-07-460337. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Sankaran VG, et al. Exome sequencing identifies GATA1 mutations resulting in Diamond-Blackfan anemia. J Clin Invest. 2012;122(7):2439–2443. doi: 10.1172/JCI63597. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Babbs C, et al. WGS500 Consortium Homozygous mutations in a predicted endonuclease are a novel cause of congenital dyserythropoietic anemia type I. Haematologica. 2013;98(9):1383–1387. doi: 10.3324/haematol.2013.089490. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Sankaran VG, et al. X-linked macrocytic dyserythropoietic anemia in females with an ALAS2 mutation. J Clin Invest. 2015;125(4):1665–1669. doi: 10.1172/JCI78619. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Zarychanski R, et al. Mutations in the mechanotransduction protein PIEZO1 are associated with hereditary xerocytosis. Blood. 2012;120(9):1908–1915. doi: 10.1182/blood-2012-04-422253. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Glogowska E, Lezon-Geyda K, Maksimova Y, Schulz VP, Gallagher PG. Mutations in the Gardos channel (KCNN4) are associated with hereditary xerocytosis. Blood. 2015;126(11):1281–1284. doi: 10.1182/blood-2015-07-657957. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Finberg KE, et al. Mutations in TMPRSS6 cause iron-refractory iron deficiency anemia (IRIDA) Nat Genet. 2008;40(5):569–571. doi: 10.1038/ng.130. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Yang Y, et al. Clinical whole-exome sequencing for the diagnosis of mendelian disorders. N Engl J Med. 2013;369(16):1502–1511. doi: 10.1056/NEJMoa1306555. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Chilamakuri CS, et al. Performance comparison of four exome capture systems for deep sequencing. BMC Genomics. 2014;15:449. doi: 10.1186/1471-2164-15-449. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Pabinger S, et al. A survey of tools for variant analysis of next-generation genome sequencing data. Brief Bioinform. 2014;15(2):256–278. doi: 10.1093/bib/bbs086. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Ward LD, Kellis M. Interpreting noncoding genetic variation in complex traits and human disease. Nat Biotechnol. 2012;30(11):1095–1106. doi: 10.1038/nbt.2422. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.MacArthur DG, et al. Guidelines for investigating causality of sequence variants in human disease. Nature. 2014;508(7497):469–476. doi: 10.1038/nature13127. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Cooper DN, et al. Genes, mutations, and human inherited disease at the dawn of the age of personalized genomics. Hum Mutat. 2010;31(6):631–655. doi: 10.1002/humu.21260. [DOI] [PubMed] [Google Scholar]
- 15.Alipanahi B, Delong A, Weirauch MT, Frey BJ. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat Biotechnol. 2015;33(8):831–838. doi: 10.1038/nbt.3300. [DOI] [PubMed] [Google Scholar]
- 16.Lee D, et al. A method to predict the impact of regulatory variants from DNA sequence. Nat Genet. 2015;47(8):955–961. doi: 10.1038/ng.3331. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Claussnitzer M, et al. FTO obesity variant circuitry and adipocyte browning in humans. N Engl J Med. 2015;373(10):895–907. doi: 10.1056/NEJMoa1502214. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Farh KK, et al. Genetic and epigenetic fine mapping of causal autoimmune disease variants. Nature. 2015;518(7539):337–343. doi: 10.1038/nature13835. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Soccio RE, et al. Genetic variation determines PPARγ function and anti-diabetic drug response in vivo. Cell. 2015;162(1):33–44. doi: 10.1016/j.cell.2015.06.025. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Solis C, Aizencang GI, Astrin KH, Bishop DF, Desnick RJ. Uroporphyrinogen III synthase erythroid promoter mutations in adjacent GATA1 and CP2 elements cause congenital erythropoietic porphyria. J Clin Invest. 2001;107(6):753–762. doi: 10.1172/JCI10642. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Campagna DR, et al. X-linked sideroblastic anemia due to ALAS2 intron 1 enhancer element GATA binding site mutations. Am J Hematol. 2014;89(3):315–319. doi: 10.1002/ajh.23616. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Kaneko K, et al. Identification of a novel erythroid-specific enhancer for the ALAS2 gene and its loss-of-function mutation which is associated with congenital sideroblastic anemia. Haematologica. 2014;99(2):252–261. doi: 10.3324/haematol.2013.085449. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Manco L, et al. A new PKLR gene mutation in the R-type promoter region affects the gene transcription causing pyruvate kinase deficiency. Br J Haematol. 2000;110(4):993–997. doi: 10.1046/j.1365-2141.2000.02283.x. [DOI] [PubMed] [Google Scholar]
- 24.Nakajima T, et al. Mutation of the GATA site in the erythroid cell-specific regulatory element of the ABO gene in a Bm subgroup individual. Transfusion. 2013;53(11) Suppl 2:2917–2927. doi: 10.1111/trf.12181. [DOI] [PubMed] [Google Scholar]
- 25.Matsuda M, Sakamoto N, Fukumaki Y. Delta-thalassemia caused by disruption of the site for an erythroid-specific transcription factor, GATA-1, in the delta-globin gene promoter. Blood. 1992;80(5):1347–1351. [PubMed] [Google Scholar]
- 26.Campbell AE, Wilkinson-White L, Mackay JP, Matthews JM, Blobel GA. Analysis of disease-causing GATA1 mutations in murine gene complementation systems. Blood. 2013;121(26):5218–5227. doi: 10.1182/blood-2013-03-488080. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Iwasaki H, et al. GATA-1 converts lymphoid and myelomonocytic progenitors into the megakaryocyte/erythrocyte lineages. Immunity. 2003;19(3):451–462. doi: 10.1016/s1074-7613(03)00242-5. [DOI] [PubMed] [Google Scholar]
- 28.Kulessa H, Frampton J, Graf T. GATA-1 reprograms avian myelomonocytic cell lines into eosinophils, thromboblasts, and erythroblasts. Genes Dev. 1995;9(10):1250–1262. doi: 10.1101/gad.9.10.1250. [DOI] [PubMed] [Google Scholar]
- 29.Cheng Y, et al. Erythroid GATA1 function revealed by genome-wide analysis of transcription factor occupancy, histone modifications, and mRNA expression. Genome Res. 2009;19(12):2172–2184. doi: 10.1101/gr.098921.109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Wu W, et al. Dynamics of the epigenetic landscape during erythroid differentiation after GATA1 restoration. Genome Res. 2011;21(10):1659–1671. doi: 10.1101/gr.125088.111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Su MY, et al. Identification of biologically relevant enhancers in human erythroid cells. J Biol Chem. 2013;288(12):8433–8444. doi: 10.1074/jbc.M112.413260. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Ulirsch JC, et al. Altered chromatin occupancy of master regulators underlies evolutionary divergence in the transcriptional landscape of erythroid differentiation. PLoS Genet. 2014;10(12):e1004890. doi: 10.1371/journal.pgen.1004890. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Love PE, Warzecha C, Li L. Ldb1 complexes: Tthe new master regulators of erythroid gene transcription. Trends Genet. 2014;30(1):1–9. doi: 10.1016/j.tig.2013.10.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Pilon AM, et al. NISC Comparative Sequencing Center Genome-wide ChIP-Seq reveals a dramatic shift in the binding of the transcription factor erythroid Kruppel-like factor during erythrocyte differentiation. Blood. 2011;118(17):e139–e148. doi: 10.1182/blood-2011-05-355107. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Li L, et al. Ldb1-nucleated transcription complexes function as primary mediators of global erythroid gene activation. Blood. 2013;121(22):4575–4585. doi: 10.1182/blood-2013-01-479451. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Krivega I, Dale RK, Dean A. Role of LDB1 in the transition from chromatin looping to transcription activation. Genes Dev. 2014;28(12):1278–1290. doi: 10.1101/gad.239749.114. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Kassouf MT, et al. Genome-wide identification of TAL1’s functional targets: Insights into its mechanisms of action in primary erythroid cells. Genome Res. 2010;20(8):1064–1083. doi: 10.1101/gr.104935.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Dogan N, et al. Occupancy by key transcription factors is a more accurate predictor of enhancer activity than histone modifications or chromatin accessibility. Epigenetics Chromatin. 2015;8:16–36. doi: 10.1186/s13072-015-0009-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Kang Y, Kim YW, Yun J, Shin J, Kim A. KLF1 stabilizes GATA-1 and TAL1 occupancy in the human β-globin locus. Biochim Biophys Acta. 2015;1849(3):282–289. doi: 10.1016/j.bbagrm.2014.12.010. [DOI] [PubMed] [Google Scholar]
- 40.Wienert B, et al. Editing the genome to introduce a beneficial naturally occurring mutation associated with increased fetal globin. Nat Commun. 2015;6:7085–7092. doi: 10.1038/ncomms8085. [DOI] [PubMed] [Google Scholar]
- 41.Wadman IA, et al. The LIM-only protein Lmo2 is a bridging molecule assembling an erythroid, DNA-binding complex which includes the TAL1, E47, GATA-1, and Ldb1/NLI proteins. EMBO J. 1997;16(11):3145–3157. doi: 10.1093/emboj/16.11.3145. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Wilson NK, et al. Combinatorial transcriptional control in blood stem/progenitor cells: Genome-wide analysis of ten major transcriptional regulators. Cell Stem Cell. 2010;7(4):532–544. doi: 10.1016/j.stem.2010.07.016. [DOI] [PubMed] [Google Scholar]
- 43.Marcello AP, et al. A case of congenital red cell pyruvate kinase deficiency associated with hereditary stomatocytosis. Blood Cells Mol Dis. 2008;41(3):261–262. doi: 10.1016/j.bcmd.2008.07.001. [DOI] [PubMed] [Google Scholar]
- 44.Wang H, et al. Experimental validation of predicted mammalian erythroid cis-regulatory modules. Genome Res. 2006;16(12):1480–1492. doi: 10.1101/gr.5353806. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Zanella A, Fermo E, Bianchi P, Valentini G. Red cell pyruvate kinase deficiency: Molecular and clinical aspects. Br J Haematol. 2005;130(1):11–25. doi: 10.1111/j.1365-2141.2005.05527.x. [DOI] [PubMed] [Google Scholar]
- 46.Wu W, et al. Dynamic shifts in occupancy by TAL1 are guided by GATA factors and drive large-scale reprogramming of gene expression during hematopoiesis. Genome Res. 2014;24(12):1945–1962. doi: 10.1101/gr.164830.113. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Ballester B, et al. Multi-species, multi-transcription factor binding highlights conserved control of tissue-specific biological pathways. eLife. 2014;3:e02626. doi: 10.7554/eLife.02626. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Cheng Y, et al. Mouse ENCODE Consortium Principles of regulatory information conservation between mouse and human. Nature. 2014;515(7527):371–375. doi: 10.1038/nature13985. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Kager L, et al. Two novel missense mutations and a 5-bp deletion in the erythroid-specific promoter of the PKLR gene in two unrelated patients with pyruvate kinase-deficient transfusion-dependent chronic nonspherocytic hemolytic anemia. Pediatr Blood Cancer. 2016 doi: 10.1002/pbc.25878. in press. [DOI] [PubMed] [Google Scholar]
- 50.van Wijk R, et al. Disruption of a novel regulatory element in the erythroid-specific promoter of the human PKLR gene causes severe pyruvate kinase deficiency. Blood. 2003;101(4):1596–1602. doi: 10.1182/blood-2002-07-2321. [DOI] [PubMed] [Google Scholar]
- 51.Surinya KH, Cox TC, May BK. Identification and characterization of a conserved erythroid-specific enhancer located in intron 8 of the human 5-aminolevulinate synthase 2 gene. J Biol Chem. 1998;273(27):16798–16809. doi: 10.1074/jbc.273.27.16798. [DOI] [PubMed] [Google Scholar]
- 52.Sankaran VG, Weiss MJ. Anemia: progress in molecular mechanisms and therapies. Nat Med. 2015;21(3):221–230. doi: 10.1038/nm.3814. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Galanello R, et al. Amelioration of Sardinian beta0 thalassemia by genetic modifiers. Blood. 2009;114(18):3935–3937. doi: 10.1182/blood-2009-04-217901. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Gallagher PG, et al. Mutation of a barrier insulator in the human ankyrin-1 gene is associated with hereditary spherocytosis. J Clin Invest. 2010;120(12):4453–4465. doi: 10.1172/JCI42240. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Sankaran VG, et al. Cyclin D3 coordinates the cell cycle during differentiation to regulate erythrocyte size and number. Genes Dev. 2012;26(18):2075–2087. doi: 10.1101/gad.197020.112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Lim CK, Li FM, Peters TJ. High-performance liquid chromatography of porphyrins. J Chromatogr A. 1988;429:123–153. doi: 10.1016/s0378-4347(00)83869-4. [DOI] [PubMed] [Google Scholar]
- 57.Thomas-Chollier M, et al. Transcription factor binding predictions using TRAP for the analysis of ChIP-seq data and regulatory SNPs. Nat Protoc. 2011;6(12):1860–1869. doi: 10.1038/nprot.2011.409. [DOI] [PubMed] [Google Scholar]
- 58.Xu J, et al. Combinatorial assembly of developmental stage-specific enhancers controls gene expression programs during human erythropoiesis. Dev Cell. 2012;23(4):796–811. doi: 10.1016/j.devcel.2012.09.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Pinello L, Xu J, Orkin SH, Yuan GC. Analysis of chromatin-state plasticity identifies cell type-specific regulators of H3K27me3 patterns. Proc Natl Acad Sci USA. 2014;111(3):E344–E353. doi: 10.1073/pnas.1322570111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Hu G, et al. Regulation of nucleosome landscape and transcription factor targeting at tissue-specific enhancers by BRG1. Genome Res. 2011;21(10):1650–1658. doi: 10.1101/gr.121145.111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Fujiwara T, et al. Discovering hematopoietic mechanisms through genome-wide analysis of GATA factor chromatin occupancy. Mol Cell. 2009;36(4):667–681. doi: 10.1016/j.molcel.2009.11.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Consortium EP. ENCODE Project Consortium An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489(7414):57–74. doi: 10.1038/nature11247. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Stadhouders R, et al. HBS1L-MYB intergenic variants modulate fetal hemoglobin via long-range MYB enhancers. J Clin Invest. 2014;124(4):1699–1710. doi: 10.1172/JCI71520. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Liu T. Use model-based analysis of ChIP-seq (MACS) to analyze short reads generated by sequencing protein-DNA interactions in embryonic stem cells. Methods Mol Biol. 2014;1150:81–95. doi: 10.1007/978-1-4939-0512-6_4. [DOI] [PubMed] [Google Scholar]
- 65.An X, et al. Global transcriptome analyses of human and murine terminal erythroid differentiation. Blood. 2014;123(22):3466–3477. doi: 10.1182/blood-2014-01-548305. [DOI] [PMC free article] [PubMed] [Google Scholar]














