Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2015 Aug 1.
Published in final edited form as: Genomics. 2014 Jul 2;104(2):105–112. doi: 10.1016/j.ygeno.2014.04.006

Strategies to fine-map genetic associations with lipid levels by combining epigenomic annotations and liver-specific transcription profiles

Ken Sin Lo 1, Swarooparani Vadlamudi 2, Marie P Fogarty 2, Karen L Mohlke 2, Guillaume Lettre 1,3
PMCID: PMC4373602  NIHMSID: NIHMS672087  PMID: 24997396

Abstract

Characterization of the epigenome promises to yield the functional elements buried in the human genome sequence, thus helping to annotate non-coding DNA polymorphisms with regulatory functions. Here, we develop two novel strategies to combine epigenomic data with transcriptomic profiles in humans or mice to prioritize potential candidate SNPs associated with lipid levels by genome-wide association study (GWAS). First, after confirming that lipid-associated loci that are also expression quantitative trait loci (eQTL) in human livers are enriched for ENCODE regulatory marks in the human hepatocellular HepG2 cell line, we prioritize candidate SNPs based on the number of these marks that overlap the variant position. This method recognized the known SORT1 rs12740374 regulatory SNP associated with LDL-cholesterol, and highlighted candidate functional SNPs at 15 additional lipid loci. In the second strategy, we combine ENCODE chromatin immunoprecipitation followed by high-throughput DNA sequencing (ChIP-seq) data and liver expression datasets from knockout mice lacking specific transcription factors. This approach identified SNPs in specific transcription factor binding sites that are located near target genes of these transcription factors. We show that FOXA2 transcription factor binding sites are enriched at lipid-associated loci and experimentally validate that alleles of one such proxy SNP located near the FOXA2 target gene BIRC5 show allelic differences in FOXA2-DNA binding and enhancer activity. These methods can be used to generate testable hypotheses for many non-coding SNPs associated with complex diseases or traits.

Keywords: FOXA1, FOXA2, HNF4A, BIRC5, ENCODE, GWAS

Introduction

Genome-wide association studies (GWAS) have identified thousands of robust associations between single nucleotide polymorphisms (SNPs) and complex human diseases and traits (Hindorff et al. 2009). These SNPs are often in linkage disequilibrium (LD) with many other known and unknown DNA sequence variants and are located within non-coding regions of the human genome. For these two reasons, at most GWAS loci it has been difficult to identify the genes and variants that are responsible for phenotypic variation. The 1000 Genomes Project has generated an extensive catalogue of genetic variation across several human populations, partly addressing the first challenge in GWAS fine-mapping projects (1000 Genomes Project Consortium 2010; 1000 Genomes Project Consortium 2012). As for the second challenge, investigators from the Encyclopedia of DNA Elements (ENCODE) Project recently summarized results from comprehensive whole-genome analyses of transcription, transcription factor association, chromatin structure, and histone modification, allowing for a functional annotation of non-coding DNA variants (Dunham et al. 2012). Furthermore, the ENCODE data might be useful to pinpoint functional regulatory variants from strongly correlated, but not functional, LD proxies. Many groups have already utilized their own epigenomic datasets or ENCODE data to show enrichment of chromatin marks at GWAS loci, to identify relevant tissues for experimental design or to prioritize candidate functional genes and DNA sequence variants (Jia et al. 2009; Pomerantz et al. 2009; Ernst et al. 2011; Cowper-Sal lari et al. 2012; Dunham et al. 2012; Maurano et al. 2012; Schaub et al. 2012; Zhang et al. 2012; Karczewski et al. 2013; Trynka et al. 2013).

Additional work is needed to refine these existing methods. We also need to develop new tools when there is no evidence in human tissues that the associated non-coding SNPs control gene expression, that is when the SNPs are not expression quantitative trait loci (eQTLs). In an effort to broaden the application of this approach by the community, we further extended the use of epigenomic data to prioritize functional candidate SNPs by developing two novel approaches, and we applied these approaches to 95 loci associated with lipid levels in humans (Teslovich et al. 2010). We were particularly interested in testing if gene expression datasets from relevant knockout mouse models could help prioritize candidate functional genes and variants at GWAS loci. Such a strategy could have broad implications as it may offer an alternative when there is no eQTL evidence or the human tissues are not readily accessible for transcriptomic studies. Our results demonstrate that combining human genetic, epigenomic and mouse expression data can provide additional fine-mapping resolution at GWAS loci. As a proof-of-principle, we functionally tested and validated a variant in LD with a lipid sentinel SNP that interferes with the binding of the FOXA transcription factors and is located near a FOXA2 transcriptional target gene as determined by the transcriptomic characterization of Foxa2−/− mouse livers. Our two methods, applied individually or together, should be broadly applicable to other human complex traits and diseases.

Results

Enrichment analysis

For this study, we obtained from the ENCODE Project all DNAseI hypersensitive sites (DHS) and ChIP-seq peaks from HepG2, which are hepatoblastoma cells that have been extensively used to study lipid metabolism. For comparison, we also analyzed the same data in the three tier 1 ENCODE cell lines: B-lymphoblastoid cells GM12878, erythroleukemia cells K562 and human embryonic stem cells H1-hESC. In this article, we use the term “epigenomic annotation” to refer to any DHS or ChIP-seq peak reported by the ENCODE Project in these four cell lines. To quantify the overlap between ENCODE epigenomic annotations that mark regulatory DNA sequences and individual SNPs at GWAS loci, we counted epigenomic annotations in each cell line that overlap the SNP and assessed significance using a simple enrichment analysis framework. We considered variants in LD (r2≥0.8, European-ancestry individuals from the 1000 Genomes Project) with the GWAS sentinel SNPs and then used 5,000 matched sets of markers to assess the statistical significance of the enrichment (Materials and Methods and Supplementary Figure 1).

Applying this approach to 95 lipid loci, we found enrichment of DHS and most histone marks associated with transcription regulation. The enrichment was stronger in HepG2 cells than in the three other cell lines analyzed: 70% of marks (7 of 10) had enrichment P<0.0002 for HepG2, whereas the corresponding proportions for GM12878, K562 and H1-hESC were 20%, 50% and 20%, respectively (Supplementary Table 1). This result is consistent with previous reports that used similar or complementary strategies, and emphasizes that most functional lipid variants identified by GWAS may exert their effect on phenotypic variation through the regulation of gene expression (Jia et al. 2009; Pomerantz et al. 2009; Ernst et al. 2011; Cowper-Sal lari et al. 2012; Dunham et al. 2012; Maurano et al. 2012; Schaub et al. 2012; Zhang et al. 2012; Karczewski et al. 2013; Trynka et al. 2013).

Integrating human eQTL data

A large meta-analysis of genome-wide association results for lipid levels highlighted variants at 24 of 95 lipid loci that are eQTL in human liver at P<5×10−8 (Schadt et al. 2008; Teslovich et al. 2010). Given our enrichment results, we reasoned that the specific causal variant(s) at each of these eQTL should be either the sentinel SNP itself or a marker in strong LD with it, and marked by epigenomic annotations in HepG2 cells. Because the presence or absence of epigenomic annotations at markers within the same locus should be independent of LD between them, ENCODE data could help prioritize functional variants even if they are perfectly correlated (a limitation of the genetic approach in fine-mapping GWAS loci).

The simplest strategy to combine epigenomic annotations and DNA polymorphisms is to count the number of DHS and ChIP-seq peaks that physically map in the human genome at the same position as DNA polymorphisms. Our hypothesis is that the best functional candidate variant at an eQTL lipid locus should have the highest number of overlaps with epigenomic annotations in HepG2, thus allowing discrimination between variants in strong LD. Obviously, this one causal variant-one locus hypothesis would not be valid if there is evidence of indepencent association signals or in the presence of several causal variants in strong LD, as recently proposed in the genomic context of super-enhancers (Hnisz et al. 2013; Parker et al. 2013; Corradin et al. 2014). However, under the several causal variants-one locus model, our framework might still identify at least one of the potential functional variants. For this analysis, we used all DHS and histone mark peaks; we also included ChIP-seq data for all available transcription factors since most of them were examined specifically in hepatocytes or are general activators or repressors of transcription without a clear cell- or biological pathway-specificity. Importantly, epigenomic annotations are biologically correlated as many mark the same chromatin state (e.g. promoters, enhancers)(Gerstein et al. 2012). However, they also each provide experimental evidence that a genomic region is transcriptionally important. In addition, the accumulation of DHS and ChIP-seq peaks from different experiments (and for ENCODE, different laboratories) at a given position in the genome decreases the likelihood of false positives. For these reasons, we treated all DHS, histone marks and transcription factors ChIP-seq data from ENCODE HepG2 independently (including technical replicates when available) and used them to annotate SNPs. Merging technical replicates to only analyze intersecting peaks had no significant impact on the results.

Results from this analysis are summarized in Table 1. At 19 of the 24 eQTL, the variant with the highest number of overlaps with ENCODE epigenomic annotations in HepG2 was different than the reported sentinel lipid SNP. The candidate SNPs prioritized by the ENCODE data were also on average closer, although not significantly, to the transcription start site(s) of the eQTL gene(s) than the sentinel lipid SNPs (78±82 vs. 88±93 kilobases (kb)), but still sufficiently far to suggest an influence on enhancer as opposed to promoter activities. We performed a receiver operating characteristic (ROC) curve analysis to determine the number of overlapping epigenomic annotations that maximize both sensitivity and specificity of finding candidate SNPs at eQTL. We compared the number of epigenomic annotations for each SNP within the 24 eQTL with the number for each SNP in the 71 non-eQTL, focusing on the SNP with the highest number of epigenomic annotations in each locus. At a threshold of 16 ovelapping epigenomic annotations, the area under the curve (AUC) is 0.618, the sensitivity 67% and the specificity 61%. If a SNP has ≥16 epigenomic annotations in HepG2, it is more likely to be located at an eQTL in liver (Fisher’s exact P=0.03, odds ratio and 95% confidence interval=3.1 [1.1–9.6]). Using a threshold of 16 epigenomic annotations, we found a functional candidate SNP for 16 of the 24 lipid and gene expression levels loci (bold in Table 1). For each of the 16 loci, we list all SNPs in strong LD (r2≥0.8) that overlap with ≥16 epigenomic annotations in Supplementary Table 2.

Table 1. Overlaps of epigenomic annotations from ENCODE HepG2 and sentinel lipid SNPs associated with gene expression levels in human livers.

For each sentinel lipid SNP, we identified SNPs in linkage disequilibrium (r2≥0.8, European populations from the 1000 Genomes Project) and counted the number of overlaps with ENCODE peaks for all DNAse I hypersensitive sites and ChIP-seq data available. ENCODE top candidate SNPs with ≥16 epigenomic annotation overlaps are in bold (see text for details). TSS, transcription start site; bp, base pairs. Human liver eQTL data from (Schadt et al. 2008; Teslovich et al. 2010).

Sentinel lipid
SNP
Chr:Position
(hg19)
Transcript(s)
associated
with
genotypes at
the sentinel
lipid SNP in
human livers
ENCODE top
candidate
SNP (in LD
with sentinel
lipid SNP;
highest
number of
epigenomic
annotation
overlaps)
Chr:Position
(hg19)
Number of
overlapping
ENCODE
epigenomic
annotations
Distance
between
ENCODE top
candidate
SNP and gene
TSS (bp)
Distance
between
sentinel lipid
SNP and gene
TSS (bp)
Distance
between
sentinel lipid
SNP and gene
TSS (bp)
rs12027135 Chr1:25,775,733 RHCE rs9438904 Chr1:25,756,860 46 9,497 28,370 −18,873
RHD −157,880 −176,753
TMEM50A −92,072 −110,945
TMEM57 527 −18,346
rs2131925 Chr1:63,025,942 ANGPTL3 rs631106 Chr1:62,901,807 47 161,379 37,244 −124,135
DOCK7 −252,232 −128,097
rs629301 Chr1:109,818,306 CELSR2 rs12740374 Chr1:109,817,590 44 −24,950 −25,666 −716
PSMA5 −151,480 −150,764
PSRC1 −8,200 −7,484
SORT1 −122,973 −122,257
SYPL2 191,509 190,793
rs1260326 Chr2:27,730,940 IFT172 rs780094 Chr2:27,741,237 23 28,666 18,369 10,297
rs13107325 Chr4:103,188,709 SLC39A8 rs13107325 Chr4:103,188,709 0 −77,946 −77,946 0
rs9488822 Chr6:116,312,893 FRK rs9488822 Chr6:116,312,893 4 −69,028 −69,028 0
rs10128711 Chr11:18,632,984 SPTY2D1 rs7943121 Chr11:18,656,062 49 42 −23,036 23,078
rs174546 Chr11:61,569,830 FADS1 rs174538 Chr11:61,560,081 49 −24,448 −14,699 −9,749
rs11220462 Chr11:126,243,952 ST3GAL4 rs2066985 Chr11:126,251,286 5 22,024 29,358 7,334
rs7134594 Chr12:110,000,193 MMAB rs10161126 Chr12:110,042,348 17 30,990 −11,165 42,155
rs8017377 Chr14:24,883,887 NYNRIN rs72694393 Chr14:24,874,193 9 −6,202 −15,896 −9,694
rs2929282 Chr15:44,245,931 CKMT1A rs4270152 Chr15:44,224,668 5 −239,585 −260,848 −21,263
rs1532085 Chr15:58,683,366 ALDH1A2 rs2043085 Chr15:58,680,954 4 322,833 325,245 −2,412
LIPC 43,220 40,808
rs11649653 Chr16:30,918,487 VKORC1 rs11640961 Chr16:30,979,818 4 −126,458 −187,789 61,331
rs16942887 Chr16:67,928,042 NFATC3 rs7188085 Chr16:68,113,873 81 5,395 191,226 185,831
rs11869286 Chr17:37,813,856 PERLD1 rs881844 Chr17:37,810,218 33 −34,092 −30,454 −3,638
rs7206971 Chr17:45,425,115 TBKBP1 rs4793978 Chr17:45,698,175 18 74,454 347,514 273,060
rs7241918 Chr18:47,160,953 LIPG rs7239867 Chr18:47,164,717 32 −76,291 −72,527 3,764
rs7255436 Chr19:8,433,196 ANGPTL4 rs10413136 Chr19:8,452,879 16 −23,869 −4,186 19,683
rs439401 Chr19:45,414,451 APOC4 rs584007 Chr19:45,416,478 31 29,016 31,043 2,027
rs386000 Chr19:54,792,761 LILRA3 rs386000 Chr19:54,792,761 2 −11,460 −11,460 0
rs2277862 Chr20:34,152,782 CEP250 rs2104417 Chr20:34,127,871 25 −84,649 −109,560 −24,911
CPNE1 −124,988 −100,077
rs6065906 Chr20:44,554,015 PLTP rs1057208 Chr20:44,563,007 49 22,004 13,012 8,992
rs181362 Chr22:21,932,068 UBE2L3 rs2266959 Chr22:21,922,904 18 −886 −10,050 −9,164

As a positive control, we evaluated the priority of rs12740374, a SNP near SORT1 previously proposed to be a causal lipid variant at this locus by interfering with binding of C/EBP transcription factors (Musunuru et al. 2010). At the SORT1 locus, we identified rs12740374 as the most likely functional regulatory variant based on 44 epigenomic annotation overlaps in comparison with 23 overlaps for the second most likely SNP (empirical P=0.048, calculated using the two variants with the highest number of annotations in each of the 5,000 matched sets of 95 SNPs) and 13 overlaps for rs629301, the sentinel lipid SNP (Figure 1A). Another promising example is at the NFATC3 locus. The sentinel lipid SNP rs16942887 that is associated with NFATC3 expression levels in human livers is located 191 kb upstream of its transcription start site. The highest priority candidate SNP at the locus in our analysis, rs7188085, has 81 epigenomic annotation overlaps in HepG2 (vs. 20 for rs16942887) and is located only 5.3 kb upstream of NFATC3 (Figure 1B). This variant and many others presented in Table 1 are strong functional candidates.

Figure 1. ENCODE HepG2 epigenomic annotations at the (A) SORT1 and (B) NFATC3 lipid loci.

Figure 1

For the sentinel lipid and eQTL SNPs (SORT1: rs629301; NFATC3: rs16942887) and their linkage disequilibrium proxies (r2≥0.8, European populations from 1000 Genomes Project), we counted the number of overlaps with peaks from HepG2 DNase I hypersensitive sites, histone marks or transcription factor binding ChIP-seq data. For both loci, the SNP with the highest number of epigenomic annotations is different than the published sentinel SNP.

Combining ENCODE and mouse transcriptomic data

Despite a very strong enrichment of epigenomic annotations correlated with transcriptional regulation (Supplementary Table 2), only 36% of the 95 loci associated with lipid levels in humans were reported to harbor eQTL variants (Teslovich et al. 2010). Many factors could explain this observation: transcriptomic profiling was performed in the wrong tissues, the genotypic effect on gene expression was too weak to be detected, the transcripts of interest were not measured or were undetectable, etc.

One alternative to gene profiling in human samples is to use the mouse, where the relevant tissues are readily accessible, and assume that transcription factor homologs will target a large set of overlapping genes in both species. In particular, we tested the hypothesis that the disruption of specific transcription factors in mouse livers could help identify functional lipid genes and variants. First, we performed an enrichment analysis of all the ENCODE HepG2 ChIP-seq transcription factor data over the sentinel and correlated SNPs at the 95 lipid loci and identified ten transcription factors that preferentially bind to these regions: CEBPB, ELF1, FOXA1, FOXA2, HEY1, HNF4A, HNF4G, MBD4, MYBL2, NFIC (Supplementary Table 3). This enrichment was reproducible across technical replicates. These transcription factors may define regulatory networks that are important to control lipid metabolism in humans. Of particular interest, we saw an enrichment for three families of transcription factors expressed in the liver and previously implicated in lipid metabolism: CEBPB, FOXA1 and FOXA2, and HNF4A. Second, we identified publicly available transcriptomic profiles in livers of control mice and liver-specific knockout animals for Foxa1 and Foxa2 (Bochkis et al. 2012), and Hnf4a (Bonzo et al. 2012); unfortunately, such data was not available for Cebpb. For each of these conditional gene knockout strains, we retrieved the list of mouse genes whose expression in liver was significantly changed when compared to control animals: 385, 1009 and 1179 genes for Foxa1, Foxa2 and Hnf4a, respectively (Materials and Methods). Third, we searched if any of the human homologs of these target genes were located within an arbitrary window defined as 250 kb on each side of the 95 sentinel lipid SNPs. For FOXA2, we found ten target genes located within nine of the 95 lipid loci; all but one of these loci contain at least one FOXA2 ChIP-seq peak in HepG2 (Table 2 and Supplementary Table 3). Results were similarly encouraging for HNF4A: there are 20 transcriptional target genes located at 17 of the 95 lipid loci, and for 14 of these 17 loci, there is at least one annotated HNF4A ChIP-seq peak in HepG2 (Table 2 and Supplementary Table 3). Because we demonstrated a strong statistical enrichment of FOXA2 and HNF4A ChIP-seq peaks at the human lipid loci, and because we focus our query on genes modulated by the disruption of these transcription factors in mouse livers, we argue that the genes listed in Table 2 are strong biological candidates for influencing lipid levels in humans. Our screen re-identified genes previously implicated in lipid metabolism, such as SORT1 and GALNT2, but also other genes with unanticipated functions in regulating lipid levels (Supplementary Table 3)(Musunuru et al. 2010; Holleboom et al. 2011). There were no FOXA1 target genes among these genomic regions, perhaps consistent with the previous finding that FOXA1 is preeminently involved in cell cycle regulation (Bochkis et al. 2012).

Table 2. Identification of HNF4A and FOXA2 potential regulatory variants at lipid loci.

For each of these two transcription factors, we identified target genes in mouse livers, and then searched if the human homologs of these target genes were located within a 500 kilobase window around the 95 sentinel lipid SNPs. For the lipid loci that contain at least one target gene, we then query if the SNPs (or linkage disequilibrium proxies) overlapped with corresponding ENCODE ChIP-seq peaks and disrupted predicted binding sites. Two SNPs for each HNF4A and FOXA2 (in bold) met all these criteria.

A-HNF4A
Sentinel lipid SNP Chr:Position (hg19) HNF4A target gene(s) Is there a HNF4A ChIP-
seq peak at the locus?
SNPs in LD with sentinel lipid SNP that overlap
with ENCODE HNF4A peaks (Do they disrupt
predicted HNF4A binding sites?)
rs4846914 Chr1:230,295,691 GALNT2 YES rs4846913 (NO)
rs2144300 (NO)
rs1260326 Chr2:27,730,940 FNDC4, SLC30A3 NO
rs1800562 Chr6:26,093,141 SLC17A3, HIST1H4D, HIST1H4F NO
rs17145738 Chr7:72,982,874 MLXIPL YES rs34060476 (NO)
rs2081687 Chr8:59,388,565 UBXN2B NO
rs11136341 Chr8:145,043,543 NRBP2 NO
rs7134594 Chr12:110,000,193 MMAB, MYO1H YES rs10744826 (NO)
rs10161126 (NO)
rs1169288 Chr12:121,416,650 HNF1A NO
rs4759375 Chr12:123,796,238 SETD8 YES rs10846506 (NO)
rs838880 Chr12:125,261,593 BRI3BP YES rs838881 (NO)
rs838882 (YES)
rs838884 (NO)
rs16942887 Chr16:67,928,042 DUS2L, PSMB10 YES rs7188085 (NO)
rs2107369 (NO)
rs8044328 (NO)
rs4420638 Chr19:45,422,946 BCAM, PVRL2 NO
rs6029526 Chr20:39,672,618 PLCG1 NO
rs6065906 Chr20:44,554,015 WFDC3 YES rs6065905 (NO)
rs12185764 (YES)
B-FOXA2
Sentinel lipid SNP Chr:Position (hg19) FOXA2 target gene(s) Is there a FOXA2 ENCODE
peak at the locus?
SNPs in LD with sentinel lipid SNP that overlap
with ENCODE FOXA2 peaks (Do they disrupt
predicted FOXA2 binding sites?)
rs629301 Chr1:109,818,306 SORT1 YES rs7528419 (NO)
rs12740374 (NO)
rs660240 (NO)
rs12328675 Chr2:165,540,800 COBLL1 YES rs7607980 (NO)
rs2290159 Chr3:12,628,920 PPARG YES rs60448371 (NO)
rs55762590 (NO)
rs6450176 Chr5:53,298,025 ARL15 YES rs6889847 (NO)
rs6876198 (NO)
rs3776712 (NO)
rs1541681 (NO)
rs1541680 (NO)
rs3776707 (NO)
rs3776706 (NO)
rs3776705 (NO)
rs3776703 (NO)
rs3776702 (YES)
rs1800562 Chr6:26,093,141 SLC17A3 YES rs115740542 (NO)
rs1169288 Chr12:121,416,650 P2RX7 YES rs6489786 (NO)
rs1169288 (NO)
rs7206971 Chr17:45,425,115 ITGB3 YES rs4793978 (NO)
rs11079784 (NO)
rs4129767 Chr17:76,403,984 BIRC5 YES rs4969182 (YES)
rs181362 Chr22:21,932,068 HIC2, SDF2L1 NO

Finding and characterizing potential functional variants

Our analysis presented in Table 2 also allowed us to try to predict functional variants. Indeed, if a sentinel lipid SNP (or an LD proxy) overlaps a FOXA2 or HNF4A ChIP-seq peak in HepG2 and disrupts a predicted binding site for these transcription factors, it is likely to be biologically relevant. We queried the HaploReg database and found that four SNPs disrupted binding motifs for FOXA2 and HNF4A (Table 2: rs3776702 and rs4969182 for FOXA2; rs838882 and rs12185764 for HNF4A)(Ward and Kellis 2012). Many of the loci listed in Table 2 do not contain SNPs that disrupt predicted FOXA2 or HNF4A binding sites. This is consistent with results from the ENCODE Project that showed that ChIP-seq can identify numerous and robust transcription factor peaks with no consensus binding motif in the underlying DNA sequence (Neph et al. 2012; Whitfield et al. 2012). In the absence of canonical binding sites, it is impossible to predict the effect of SNPs on transcription factor binding; this requires functional validation. Therefore, many of the variants listed in Table 2 might be functional even if they reside in FOXA2 or HNF4A ChiP-seq peaks that do not contain canonical binding motifs.

Finally, we sought to functionally validate one of our predictions. We selected rs4969182, which is in LD with the sentinel lipid SNP rs4129767 (r2=0.96), overlaps with a FOXA2 peak in HepG2 and is located 171 kb away from the apoptosis-related gene BIRC5, a transcriptional target of Foxa2 in mouse livers (Table 2). rs4969182 is a C/T bi-allelic variant, and the C-allele disrupts the motif recognized by FOXA transcription factors. Using reporter assays in HepG2 cells, we showed that the DNA sequence surrounding rs4969182 has enhancer activity, and that the T-allele recognized by FOXA2 shows significantly increased transcriptional activity compared to the C-allele (Figure 2A, P=2.6×10−5 and P=5.0×10−6 in the forward and reverse orientation, respectively). Next, using electrophoretic mobility shift assays (EMSA), we tested if alleles of rs4969182 differentially affected DNA binding to nuclear proteins. Our results showed that proteins from HepG2 nuclear extracts bind probes containing either the C- or the T-allele, but that binding is stronger for the T-allele-containing probe (Figure 2B). Competition of T-allele-containing labeled probe with excess unlabeled probe with the T-allele more efficiently competed away allele-specific bands than excess unlabeled probe with the C-allele, providing support for allelic differences in protein-DNA binding (Figure 2B). Antibodies against FOXA1 and FOXA2 appear to weaken the probe-FOXA interaction but did not supershift the protein-probe complexes (Figure 2B). Other examples exist of EMSA experiments in which antibodies appear to impair binding without causing a clear supershift of the complex (Musunuru et al. 2010).

Figure 2. Allelic differences in regulatory activity at rs4969182.

Figure 2

(A) Differential transcriptional enhancer reporter activity in HepG2 cells. The T-allele, found in FOXA consensus binding motifs, showed significantly increased luciferase activity compared to C-allele in both orientations and with respect to a minimal promoter vector. Error bars represent standard error of five independent clones for each allele. Results are expressed as fold change compared to empty vector control. P-values were calculated by a two-sided t-test. (B) Electrophoretic mobility shift assay (EMSA) using HepG2 nuclear extract shows differential protein-DNA binding of rs4969182 alleles. The probe containing the T-allele shows increased protein binding (arrow A) compared to the probe containing the C-allele. Excess unlabeled specific probe containing the T-allele (T-comp) more efficiently competed away allele-specific binding than the unlabeled C-allele (C-comp). Incubation with FOXA1 and FOXA2 antibody reduced the DNA-protein complex (arrow A). To enhance visualization of protein complexes, free biotin-labeled probe is not shown.

Discussion

Characterization of the epigenome by the ENCODE Project provides a framework to functionally annotate non-coding SNPs identified by GWAS. Based on the observation that GWAS SNPs are enriched for chromatin marks linked to transcriptional regulation, we designed two novel strategies that integrate gene expression profiling with epigenomic characterization. We used SNPs associated with lipid levels as a test set because the large number allows us to derive meaningful statistics and also because relevant cells and tissues are characterized. First, we showed that at eQTL loci, simply counting the number of epigenomic annotations that overlap with associated SNPs can improve fine-mapping resolution. This is particularly useful to distinguish markers in strong LD, such as SNPs at the SORT1 locus (Figure 1A). Second, in the absence of human eQTL information, or to complement such datasets, we used gene expression profiling in the mouse to prioritize candidate functional genes, and subsequently candidate functional variants. We reasoned that if a transcription factor binds preferentially at lipid loci (ENCODE ChIP-seq data), disruption of the mouse homolog could identify target genes that may be important for lipid levels variation in humans. Indeed, although it is known that transcription factors from different species bind different DNA motifs (Schmidt et al. 2010), the transcriptional target genes are often conserved across species (Boj et al. 2009; Chan et al. 2009). This strategy allowed us to highlight the role of the FOXA2 and HNF4A transcriptional networks in lipid metabolism. Importantly, we validated one of our predictions experimentally: a lipid sentinel SNP located 171 kb from BIRC5, a FOXA2 target gene in the mouse liver, is in LD with a marker that interferes with FOXA2 binding and modulates the enhancer activity of the DNA sequence (Figure 2). We did not validate whether BIRC5 plays a role in lipid metabolism; there are other potential candidate genes at the locus, although none are FOXA2 target genes based on the mouse data. Other candidates include PGS1, a gene involved in the biosynthesis of the anionic phospholipids phosphatidylglycerol and cardiolipin.

Fine-mapping may sometimes point to a candidate functional gene that will be different than what would be expected based on the known biology of the genes located within the locus. We have such an example in our analysis of eQTL data from human livers. rs16942887 is associated with HDL-cholesterol levels in humans (Teslovich et al. 2010), and is located 46 kb from LCAT, which encodes an important enzyme involved in cholesterol transport. Whereas common knowledge would suggest LCAT as the likeliest causal gene at the locus, genotypes at rs1692887 are associated with expression levels of NFATC3 in human livers, a gene located 191 kb downstream. Furthermore, epigenomic characterization of this locus in HepG2 highlights rs7188085, a SNP in strong LD with rs1692887 (r2=0.85) and located only 5.3 kb from the NFATC3 transcription start site (Figure 1B). NFATC3 encodes a gene involved in immune responses. LCAT is critical for lipid metabolism in humans, but there is currently no functional evidence that suggests that the SNPs at this locus mediate their effect on HDL-C levels through LCAT itself, NFATC3, or both.

Several studies have proposed to use epigenomic annotations to prioritize DNA sequence variants at GWAS loci for functional testing (Jia et al. 2009; Pomerantz et al. 2009; Ernst et al. 2011; Cowper-Sal lari et al. 2012; Dunham et al. 2012; Maurano et al. 2012; Schaub et al. 2012; Zhang et al. 2012; Karczewski et al. 2013; Trynka et al. 2013). We extended these methods and also developed a novel paradigm by proposing to integrate mouse transcriptomic data as an additional filter to prioritize candidate functional variants and genes. As with the other bioinformatic methods, ours also have limitations that are inherent to the type of data available. For instance, genomic regions that are difficult to sequence using next-generation DNA sequencers are less likely to be annotated by the ENCODE Project and might thus escape detection using such methods. Equally important is the fact that most epigenomic marks catalogued by the ENCODE Project are associated with transcriptional activation. Thus, functional genetic variants that relieve transcriptional repression are less likely to be found using these strategies. Finally, if the transcription factors tested by ChIP-seq do not have a mouse ortholog (unlikely since 99% of human genes have a mouse equivalent (Mouse Genome Sequencing et al. 2002)) or if the mouse knockout models do not exist, our second strategy is not applicable.

In conclusion, we presented two simple strategies that combine epigenomic and transcriptomic profiling to prioritize functional genes and variants at GWAS loci. These methods should be applicable to prioritize rare genetic variants as well because they rely on the annotation of physical positions and are independent of allele frequency. The predictions from our approaches, which are statistically supported through enrichment analysis, are readily testable in the laboratory. These methods should be applicable to characterize genetic markers associated with many complex diseases and traits, and in particular those related to immune or hematological phenotypes as relevant tissues are easier to access. Combining human genetic findings with epigenomic characterization and gene expression data from mouse knockouts offer an alternative solution, in particular when human tissues are not accessible. Finally, as the repertoire of epigenomic annotations in various human tissues continue to expand, we anticipate that our strategies will become amenable to most human complex phenotypes.

Methods

ENCODE enrichment analysis

The enrichment pipeline strategy is summarized graphically in Supplementary Figure 1. For each epigenomic annotation, peak coordinates were identified using software developed for the ENCODE Project (http://encodeproject.org/ENCODE/encodeTools.html). We obtained epigenomic annotations in the form of peak calls mapped onto the human genome (build hg19) directly from the ENCODE Project website (accessed June 2012). In total, we considered in our analysis 116, 147, 111, 177 different epigenomic annotations files for HepG2, GM12878, H1-hESC and K562, respectively. To quantify the enrichment of SNPs associated with a specific complex disease or trait, we developed a four step strategy: First, we generated sets of variants (with replacement) that are matched with the sentinel variants based on allele frequency (±4%), gene proximity (±100 kb) and linkage disequilibrium (LD; all SNPs within the same set have r2≤0.5). For our analysis of the lipid loci, we generated 5,000 sets of 95 SNPs using information from European individuals from the 1000 Genomes Project. Second, for each variant in the seed and matched sets, we retrieved all other variants in LD (r2≥0.8) using the 1000 Genomes Project European population genotypes and the PLINK software (Purcell et al. 2007). Third, we annotated all variants and their LD proxies for overlap with specified epigenomic annotations. Finally, we assessed statistical enrichment by computing empirical P-values for each epigenomic annotation by counting the number of matched set with more SNP-epigenomic annotation overlaps than found in the set of sentinel variants. We provide a step-by-step description of our methods in the Supplementary Information.

Gene expression datasets

Human liver eQTL results (P≤5×10−8) were available from previous reports (Schadt et al. 2008; Teslovich et al. 2010). The list of genes differentially expressed in liver-specific knockout Foxa1−/− and Foxa2−/− mice were obtained from a previous report (fold-change ≥±1.5, false discovery rate=15%)(Bochkis et al. 2012). To identify the list of genes differentially expressed in Hnf4a−/− liver mice compared to wild-type animals, we recovered the corresponding dataset from NCBI Gene Expression Omnibus (GSE34581)(Bonzo et al. 2012) and analyzed the data with the GEO2R module, correcting for multiple testing using the Benjamini & Hochberg procedure (adjusted P≤0.05). We converted mouse gene symbols to human gene symbols assuming a one-to-one homolog (Supplementary Information).

Luciferase transcriptional reporter assays

HepG2 hepatocellular carcinoma cells were cultured in MEM-alpha (Invitrogen) supplemented with 10% FBS, 1 mM sodium pyruvate and 2 mM L-glutamine. A 181 bp fragment (hg19 chr17: 76,392,913–76,393,093) surrounding the SNP rs4969182 was PCR-amplified using primers 5’-TGTGAGAGCTGTCTAAAACGAA-3’ and 5’-TTCATCAGGGTGTTTATTTCCTC-3’ from DNA of individuals homozygous for either allele and cloned in both orientations into the multiple cloning sites of the minimal promoter-containing firefly luciferase reporter vector pGL4.23 (Promega, Madison, WI). Fragments are designated as ‘forward’ or ‘reverse’ based on their orientation in the genome with respect to the BIRC5 coding sequence. Five independent clones for each allele for each orientation were isolated, verified by sequencing and transfected in duplicate into HepG2 cell line. Luciferase assays were performed as previously described (Fogarty et al. 2013).

Electrophoretic mobility shift assay (EMSA)

Nuclear cell extract was prepared from HepG2 cells using the NE-PER nuclear and cytoplasmic extraction kit (Thermo Scientific) as described (Fogarty et al. 2013). 17 base-pair oligonucleotides were designed to the sequence surrounding rs4969182 alleles: Sense 5’ biotin-ATATTTAC[T/C]CTCTGGCC-3’, antisense 5’-biotin-GGCCAGAG[G/A]GTAAATAT-3’ (SNP alleles in bold). For supershift assays, before adding labeled probe, 2 µg of polyclonal antibody against FOXA1 (ab23738; from ABCAM) or 4 ug of FOXA2 (ENCODE ChIP-seq antibody, SC-6554X; from Santa Cruz Biotechnology) was added to the binding reaction and incubated for 25 minutes. EMSAs were carried out on a second independent day and yielded comparable results.

Supplementary Material

Highlights.

  1. We combined eQTL and ENCODE data to prioritize functional variants at lipid loci.

  2. We integrated mouse transcriptomic and ENCODE data to fine-map GWAS loci.

  3. We validated in silico predictions using functional experiments for a lipid locus.

Acknowledgments

We thank investigators from the ENCODE Project for making the data publicly available. We also thank Cameron Palmer for sharing code to allow SNP matching. This work was funded by grants from the Centre of Excellence in Personalized Medicine (CEPMed), the “Fondation de l’Institut de Cardiologie de Montréal”, the Canada Research Chair Program, the “Fonds de la Recherche en Santé du Québec” (to GL), and NIH grants DA027040 and DK072193.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Author contributions Conceived and designed experiments: KSL, SV, MPF, KLM, GL

Performed experiments: KSL, SV

Directed the study: KLM, GL

All authors analyzed results and wrote the manuscript.

Disclosure declaration

The authors declare no conflicts of interest.

References

  1. 1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature. 2010;467(7319):1061–1073. doi: 10.1038/nature09534. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. 1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature. 2012;491(7422):56–65. doi: 10.1038/nature11632. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Bochkis IM, Schug J, Ye DZ, Kurinna S, Stratton SA, Barton MC, Kaestner KH. Genome-wide location analysis reveals distinct transcriptional circuitry by paralogous regulators Foxa1 and Foxa2. PLoS genetics. 2012;8(6):e1002770. doi: 10.1371/journal.pgen.1002770. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Boj SF, Servitja JM, Martin D, Rios M, Talianidis I, Guigo R, Ferrer J. Functional targets of the monogenic diabetes transcription factors HNF-1alpha and HNF-4alpha are highly conserved between mice and humans. Diabetes. 2009;58(5):1245–1253. doi: 10.2337/db08-0812. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Bonzo JA, Ferry CH, Matsubara T, Kim JH, Gonzalez FJ. Suppression of hepatocyte proliferation by hepatocyte nuclear factor 4alpha in adult mice. The Journal of biological chemistry. 2012;287(10):7345–7356. doi: 10.1074/jbc.M111.334599. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Chan ET, Quon GT, Chua G, Babak T, Trochesset M, Zirngibl RA, Aubin J, Ratcliffe MJ, Wilde A, Brudno M, et al. Conservation of core gene expression in vertebrate tissues. Journal of biology. 2009;8(3):33. doi: 10.1186/jbiol130. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Corradin O, Saiakhova A, Akhtar-Zaidi B, Myeroff L, Willis J, Cowper-Sal lari R, Lupien M, Markowitz S, Scacheri PC. Combinatorial effects of multiple enhancer variants in linkage disequilibrium dictate levels of gene expression to confer susceptibility to common traits. Genome research. 2014;24(1):1–13. doi: 10.1101/gr.164079.113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Cowper-Sal lari R, Zhang X, Wright JB, Bailey SD, Cole MD, Eeckhoute J, Moore JH, Lupien M. Breast cancer risk-associated SNPs modulate the affinity of chromatin for FOXA1 and alter gene expression. Nature genetics. 2012;44(11):1191–1198. doi: 10.1038/ng.2416. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Dunham I, Kundaje A, Aldred SF, Collins PJ, Davis CA, Doyle F, Epstein CB, Frietze S, Harrow J, Kaul R, et al. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489(7414):57–74. doi: 10.1038/nature11247. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Ernst J, Kheradpour P, Mikkelsen TS, Shoresh N, Ward LD, Epstein CB, Zhang X, Wang L, Issner R, Coyne M, et al. Mapping and analysis of chromatin state dynamics in nine human cell types. Nature. 2011;473(7345):43–49. doi: 10.1038/nature09906. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Fogarty MP, Panhuis TM, Vadlamudi S, Buchkovich ML, Mohlke KL. Allele-specific transcriptional activity at type 2 diabetes-associated single nucleotide polymorphisms in regions of pancreatic islet open chromatin at the JAZF1 locus. Diabetes. 2013;62(5):1756–1762. doi: 10.2337/db12-0972. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Gerstein MB, Kundaje A, Hariharan M, Landt SG, Yan KK, Cheng C, Mu XJ, Khurana E, Rozowsky J, Alexander R, et al. Architecture of the human regulatory network derived from ENCODE data. Nature. 2012;489(7414):91–100. doi: 10.1038/nature11245. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Hindorff LA, Sethupathy P, Junkins HA, Ramos EM, Mehta JP, Collins FS, Manolio TA. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci U S A. 2009;106(23):9362–9367. doi: 10.1073/pnas.0903103106. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Hnisz D, Abraham BJ, Lee TI, Lau A, Saint-Andre V, Sigova AA, Hoke HA, Young RA. Super-enhancers in the control of cell identity and disease. Cell. 2013;155(4):934–947. doi: 10.1016/j.cell.2013.09.053. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Holleboom AG, Karlsson H, Lin RS, Beres TM, Sierts JA, Herman DS, Stroes ES, Aerts JM, Kastelein JJ, Motazacker MM, et al. Heterozygosity for a loss-of-function mutation in GALNT2 improves plasma triglyceride clearance in man. Cell metabolism. 2011;14(6):811–818. doi: 10.1016/j.cmet.2011.11.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Jia L, Landan G, Pomerantz M, Jaschek R, Herman P, Reich D, Yan C, Khalid O, Kantoff P, Oh W, et al. Functional enhancers at the gene-poor 8q24 cancer-linked locus. PLoS genetics. 2009;5(8):e1000597. doi: 10.1371/journal.pgen.1000597. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Karczewski KJ, Dudley JT, Kukurba KR, Chen R, Butte AJ, Montgomery SB, Snyder M. Systematic functional regulatory assessment of disease-associated variants. Proc Natl Acad Sci U S A. 2013;110(23):9607–9612. doi: 10.1073/pnas.1219099110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Maurano MT, Humbert R, Rynes E, Thurman RE, Haugen E, Wang H, Reynolds AP, Sandstrom R, Qu H, Brody J, et al. Systematic localization of common disease-associated variation in regulatory DNA. Science. 2012;337(6099):1190–1195. doi: 10.1126/science.1222794. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Mouse Genome Sequencing C. Waterston RH, Lindblad-Toh K, Birney E, Rogers J, Abril JF, Agarwal P, Agarwala R, Ainscough R, Alexandersson M, et al. Initial sequencing and comparative analysis of the mouse genome. Nature. 2002;420(6915):520–562. doi: 10.1038/nature01262. [DOI] [PubMed] [Google Scholar]
  20. Musunuru K, Strong A, Frank-Kamenetsky M, Lee NE, Ahfeldt T, Sachs KV, Li X, Li H, Kuperwasser N, Ruda VM, et al. From noncoding variant to phenotype via SORT1 at the 1p13 cholesterol locus. Nature. 2010;466(7307):714–719. doi: 10.1038/nature09266. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Neph S, Vierstra J, Stergachis AB, Reynolds AP, Haugen E, Vernot B, Thurman RE, John S, Sandstrom R, Johnson AK, et al. An expansive human regulatory lexicon encoded in transcription factor footprints. Nature. 2012;489(7414):83–90. doi: 10.1038/nature11212. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Parker SC, Stitzel ML, Taylor DL, Orozco JM, Erdos MR, Akiyama JA, van Bueren KL, Chines PS, Narisu N, et al. Program NCS. Chromatin stretch enhancer states drive cell-specific gene regulation and harbor human disease risk variants. Proc Natl Acad Sci U S A. 2013;110(44):17921–17926. doi: 10.1073/pnas.1317023110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Pomerantz MM, Ahmadiyeh N, Jia L, Herman P, Verzi MP, Doddapaneni H, Beckwith CA, Chan JA, Hills A, Davis M, et al. The 8q24 cancer risk variant rs6983267 shows long-range interaction with MYC in colorectal cancer. Nature genetics. 2009;41(8):882–884. doi: 10.1038/ng.403. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, de Bakker PI, Daly MJ, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007;81(3):559–575. doi: 10.1086/519795. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Schadt EE, Molony C, Chudin E, Hao K, Yang X, Lum PY, Kasarskis A, Zhang B, Wang S, Suver C, et al. Mapping the genetic architecture of gene expression in human liver. PLoS biology. 2008;6(5):e107. doi: 10.1371/journal.pbio.0060107. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Schaub MA, Boyle AP, Kundaje A, Batzoglou S, Snyder M. Linking disease associations with regulatory information in the human genome. Genome research. 2012;22(9):1748–1759. doi: 10.1101/gr.136127.111. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Schmidt D, Wilson MD, Ballester B, Schwalie PC, Brown GD, Marshall A, Kutter C, Watt S, Martinez-Jimenez CP, Mackay S, et al. Five-vertebrate ChIP-seq reveals the evolutionary dynamics of transcription factor binding. Science. 2010;328(5981):1036–1040. doi: 10.1126/science.1186176. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Teslovich TM, Musunuru K, Smith AV, Edmondson AC, Stylianou IM, Koseki M, Pirruccello JP, Ripatti S, Chasman DI, Willer CJ, et al. Biological, clinical and population relevance of 95 loci for blood lipids. Nature. 2010;466(7307):707–713. doi: 10.1038/nature09270. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Trynka G, Sandor C, Han B, Xu H, Stranger BE, Liu XS, Raychaudhuri S. Chromatin marks identify critical cell types for fine mapping complex trait variants. Nature genetics. 2013;45(2):124–130. doi: 10.1038/ng.2504. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Ward LD, Kellis M. HaploReg: a resource for exploring chromatin states, conservation, and regulatory motif alterations within sets of genetically linked variants. Nucleic acids research. 2012;40(Database issue):D930–D934. doi: 10.1093/nar/gkr917. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Whitfield TW, Wang J, Collins PJ, Partridge EC, Aldred SF, Trinklein ND, Myers RM, Weng Z. Functional analysis of transcription factor binding sites in human promoters. Genome biology. 2012;13(9):R50. doi: 10.1186/gb-2012-13-9-r50. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Zhang X, Cowper-Sal lari R, Bailey SD, Moore JH, Lupien M. Integrative functional genomics identifies an enhancer looping to the SOX9 gene disrupted by the 17q24.3 prostate cancer risk locus. Genome research. 2012;22(8):1437–1446. doi: 10.1101/gr.135665.111. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

RESOURCES