Skip to main content
eLife logoLink to eLife
. 2013 Jun 4;2:e00523. doi: 10.7554/eLife.00523

Passive and active DNA methylation and the interplay with genetic variation in gene regulation

Maria Gutierrez-Arcelus 1,2,3, Tuuli Lappalainen 1,2,3, Stephen B Montgomery 1,4, Alfonso Buil 1,2,3, Halit Ongen 1,2,3, Alisa Yurovsky 1,2,3, Julien Bryois 1,2,3, Thomas Giger 1,2, Luciana Romano 1,2,3, Alexandra Planchon 1,2,3, Emilie Falconnet 1,2, Deborah Bielser 1,2, Maryline Gagnebin 1,2, Ismael Padioleau 1,2,3, Christelle Borel 1,2, Audrey Letourneau 1,2, Periklis Makrythanasis 1,2, Michel Guipponi 1,2, Corinne Gehrig 1,2, Stylianos E Antonarakis 1,2,*, Emmanouil T Dermitzakis 1,2,3,*
Editor: Chris P Ponting5
PMCID: PMC3673336  PMID: 23755361

Abstract

DNA methylation is an essential epigenetic mark whose role in gene regulation and its dependency on genomic sequence and environment are not fully understood. In this study we provide novel insights into the mechanistic relationships between genetic variation, DNA methylation and transcriptome sequencing data in three different cell-types of the GenCord human population cohort. We find that the association between DNA methylation and gene expression variation among individuals are likely due to different mechanisms from those establishing methylation-expression patterns during differentiation. Furthermore, cell-type differential DNA methylation may delineate a platform in which local inter-individual changes may respond to or act in gene regulation. We show that unlike genetic regulatory variation, DNA methylation alone does not significantly drive allele specific expression. Finally, inferred mechanistic relationships using genetic variation as well as correlations with TF abundance reveal both a passive and active role of DNA methylation to regulatory interactions influencing gene expression.

DOI: http://dx.doi.org/10.7554/eLife.00523.001

Research organism: Human

eLife digest

Variations occur throughout our genome. These variations can cause genes to be expressed (switched on) in slightly different ways among individuals. Moreover, the same gene can also be expressed in different ways in different cells within an individual. A third level of variation is supplied by epigenetic markers: these are molecules that bind to the DNA at specific points and can have profound effects on the expression of nearby genes. One such epigenetic marker is the addition of a methyl group to a cytosine base, a process that is known as DNA methylation.

DNA methylation usually happens when a cytosine base is next to a guanine base, forming a CpG site. In mammals, most CpG sites have methyl groups attached, although regions with a lot of CpG sites (called CpG islands) are mostly unmethylated. Initial studies suggested that methylation prevented particular genes from being expressed, but more recent work has indicated that methylation can be associated with both reduced and increased expression of genes. Moreover, it is not clear if this association is active (i.e., changes in methylation drive changes in gene expression) or passive (DNA methylation is the result of gene regulation).

Now, Gutierrez-Arcelus et al. have carried out a large-scale study to clarify the relationships between three different types of gene-related variations among individuals. They extracted fibroblasts, T-cells and lymphoblastoid cells from the umbilical cords of 204 babies, and analysed them for variations in DNA sequence, gene expression and DNA methylation. Their results show that the associations between the three are more complex than was previously thought.

Gutierrez-Arcelus et al. show that the mechanisms that control the association between the variations in DNA methylation and gene expression in individuals are likely to be different to those that are responsible for the establishment of methylation patterns during the process of cell differentiation. They also find that the association between DNA methylation and gene expression can be either active or passive, and can depend on the context in which they occur in our genome. Finally, where the two copies or alleles of a gene are not equally expressed in a given cell, the difference in expression is primarily regulated by DNA sequence variation, with DNA methylation having little or no role on its own. Equally complex interactions and effects are expected in further studies of genetic and epigenetic variation.

DOI: http://dx.doi.org/10.7554/eLife.00523.002

Introduction

DNA methylation is an essential (Li et al., 1992) epigenetic mark whose role in gene regulation and its dependency on genomic sequence and environment is not yet fully understood (Jones, 2012; Schubeler, 2012). DNA methylation in vertebrates occurs most commonly in cytosines that are adjacent to guanines (CpG sites). Mammalian DNA methylation levels are generally high, although CpG rich regions, called CpG islands (CGI), appear mostly unmethylated (Bird, 2002; Weber et al., 2007; Lister et al., 2009). The mechanisms of de novo methylation and maintenance of methylation patterns are well known (Shoemaker et al., 2011), and alterations of these can cause severe diseases (Robertson, 2005). Even though it was originally reported to be involved in gene silencing (Holliday and Pugh, 1975; Riggs, 1975), more recent studies have found that DNA methylation can also be positively correlated to gene transcription when found in gene bodies (Hellman and Chess, 2007; Lister et al., 2009). Additionally, its participation in gene expression is proving to be highly variable, ranging from marking alternative intra-genic promoters (Maunakea et al., 2010), to being affected by transcription factors (TFs) at enhancers (Stadler et al., 2011) or it-self affecting the binding of TFs such as MYC (Prendergast and Ziff, 1991). Hence, whether DNA methylation is a consequence of gene regulation, or whether it controls gene expression changes—that is, whether it plays a passive or an active role in gene regulation—still remains a topic of debate (Schubeler, 2012).

Additionally, DNA methylation can be affected by environment (Kaminsky et al., 2009) but it has also been proven that regions in the genome can autonomously determine DNA methylation states (Lienert et al., 2010). Furthermore, studies looking at natural DNA methylation variation in human populations have shown that genetic variation influences DNA methylation levels in different cell-types (Gibbs et al., 2010; Zhang et al., 2010; Bell et al., 2011), but the mechanisms by which this occurs are far from clear. In these and other studies (Kulis et al., 2012), the association between DNA methylation and gene expression in a population context, where the same gene and same methylation site can be compared across multiple individuals, has been reported to be both positive and negative. Overall, the nature of the relationships among genetic variants, DNA methylation and gene expression are still unclear despite some initial efforts (van Eijk et al., 2012).

In this study we dissect the mechanistic relationships between inter-individual DNA methylation and gene expression variation using DNA sequence variability and TF abundance measured by RNA-Seq. By assaying in a high resolution and genome wide level these three layers of information in three different cell-types originating from the same set of individuals, we are able to study the role of DNA methylation variation in different dimensions. Our results reveal a picture where DNA methylation variable sites are mechanistically associated to gene expression in complex and context dependent ways that can be of passive or active nature. We further highlight some of the mechanisms by which passive DNA methylation may occur and how this role can interplay with genetic variation.

Results

Associations of genetic variation, DNA methylation and gene expression

We use the GenCord collection (Dimas et al., 2009) of umbilical cord samples from 204 newborn babies of central European descent, from which we derived three cell-types: fibroblasts (primary cells), T-cells (primary cells) and lymphoblastoid cells (immortalized cell lines, LCLs) (Figure 1). We genotyped each individual for 2.5 million SNPs, and sequenced the poly-A transcriptome of all three cell-types from the 204 individuals yielding a median of 16 million exonic reads per sample. We subsequently removed samples from 18 to 21 genetic or expression outliers, yielding a final set of 183–186 individuals per cell type (Figure 1—figure supplements 1 and 2). The assayed SNPs were imputed to the Phase 1 release of the 1000 genomes project (Abecasis et al., 2012) yielding a set of 6.9 million SNPs. We obtained normalized expression levels for 70,800–76,870 exons belonging to 12,265–12,863 genes (Figure 1—figure supplements 3 and 4). DNA methylation levels were measured using bisulfite-conversion and hybridization to a bead chip, assaying 416,118 CpG sites in 66–118 samples. Normalized methylation levels of CpG sites range from 0 to 1, reflecting the percentage of methylation per site (β-value; Figure 1—figure supplements 5–7). In total, we analyzed 66–186 samples per cell-type and assay, belonging to 195 individuals (Table 1; see ‘Materials and methods’).

Figure 1. GenCord project scheme.

We collected umbilical cord and cord blood samples from 204 newborn babies, from which we derived three cell-types: fibroblasts, lymphoblastoid cells and T-cells. Genotyping, RNA-sequencing and DNA methylation levels were assayed. The number of samples without genetic and technical outliers is indicated for each assay and each cell-type. We then correlated and utilized different properties of all datasets in order to assess: expression Quantitative Trait Loci (eQTLs), methylation QTLs (mQTLs), positive (pos) and negative (neg) expression Quantitative Trait Methylation (eQTMs). Green ticks represent Single Nucleotide Polymorphisms (SNPs), purple lollipops represent methylation sites, black boxes represent exons and orange arrows depict associations between two data-types. Shown are the maximum distances between each pair of variables tested. See Figure 1—figure supplement 1–7 for data processing and quality checks.

DOI: http://dx.doi.org/10.7554/eLife.00523.003

Figure 1.

Figure 1—figure supplement 1. Genetic outliers removed from analyses involving genetic variation.

Figure 1—figure supplement 1.

Multidimensional scaling (MDS) plot showing the genetic clustering of our 204 GenCord individuals (in black) and 30 individuals from each of the following HapMap populations: Western and Northern European in Utah (CEU, in blue), Japanese in Tokyo (JPT, in green), Yoruba in Ibadan, Nigeria (YRI, in orange), Gujarati Indians in Houston (GIH, in purple), Mexican ancestry in Los Angeles (MXL, in pink). The 16 GenCord individuals inside gray circles were considered genotypic outliers and were not included in analyses involving genetic variation.
Figure 1—figure supplement 2. Number of reads per sample, before removing technical outliers.

Figure 1—figure supplement 2.

(A) Number of total reads per sample. (B) Number of total reads mapping uniquely, properly paired and with MAPQ ≥ 10 to exons. (C) Proportion of reads mapping to exons. Vertical red lines indicate median. Samples with less than 5M exonic reads were considered technical outliers.
Figure 1—figure supplement 3. Covariates for which expression data was corrected.

Figure 1—figure supplement 3.

(A) Histogram of mean GC content per library before removing outliers. (B) Histogram of insert size mode per library before removing outliers. The two samples in the extreme left were considered technical outliers and were removed. (C)–(F) p value distributions, with π1 indicated in each plot, of the linear regression effects of the four covariates we later corrected for on scaled exon levels.
Figure 1—figure supplement 4. Pair wise correlations among individuals before and after covariate correction.

Figure 1—figure supplement 4.

(A) Histograms of scaled exon counts pair wise spearman correlation coefficients between samples (all libraries scaled to 10M reads) in fibroblasts (F), LCLs (L) and T-cells (T). (B) Histogram of pair wise spearman correlation coefficients of expression levels after covariate correction between individuals in each cell-type.
Figure 1—figure supplement 5. Normalized β-value and variance across individuals.

Figure 1—figure supplement 5.

(A) Normalized beta-value distribution in fibroblasts (F), LCLs (L) and T-cells (T). (B) Distribution of normalized β-value variance per site across individuals.
Figure 1—figure supplement 6. Normalized β-value pair wise correlations between individuals.

Figure 1—figure supplement 6.

Distributions of pair wise spearman correlation coefficients between samples in fibroblasts (F), LCLs (L) and T-cells (T).
Figure 1—figure supplement 7. β-value distributions in distinct genomic features for expressed and non-expressed genes.

Figure 1—figure supplement 7.

(A) Median β-value across individuals is plotted for CpG sites by their position relative to the nearest transcription start site (TSS) in each cell-type, for expressed (blue) and non-expressed (red) genes. Based on the region where expressed genes have lower methylation levels than non-expressed genes, the promoter proximal region of our analyses was defined from -1kb to +2kb relative to the TSS. (B)–(D) β-value distributions in genes (B), 1kb window upstream of TSS (C) and open-chromatin (D) in one LCL sample for expressed (blue) and non-expressed (red) genes. Other cell-types look very similar.

Table 1. Summary of main association analyses in GenCord.

See Table 1–source data 1

DOI: http://dx.doi.org/10.7554/eLife.00523.011

Table 1—source data 1.
Significant eQTL, mQTL and eQTM associations found in each cell-type.
elife00523s001.zip (3.3MB, zip)
DOI: 10.7554/eLife.00523.012
Test Samples Window size Unit FDR (%) Nominal p value Fibroblasts LCLs T-cells
eQTLs Genotypes and expression 183 (F); 185 (L); 186 (T) 1 Mb Genes 10 2.2 × 10−5 (F); 3.2 × 10−5 (L); 1.8 × 10−5 (T) 2433 3372 2115
mQTLs Genotypes and methylation 107 (F); 111 (L); 66 (T) 5 kb Methylation sites 10 4.4 × 10−4 (F); 7.9 × 10−4 (L); 1.3 × 10−3 (T) 14,189 22,411 32,318
eQTMs Methylation and expression 110 (F); 118 (L); 66 (T) 50 kb Genes 10 7.6 × 10−5 (F); 7 × 10−4 (L); 6.9 × 10−4 (T) 596 3680 3838

At 10% FDR, we identify 2115–3372 expression quantitative trait loci (eQTLs) using a 1 Mb window to either side of the TSS. We discover 14,189–32,318 methylation QTLs (mQTLs) using a 5-kb window to either side of the CpG site. We find 1541–17,267 significant expression to methylation associations (eQTMs) using a 50-kb window around the TSS; these pertain to 596–3838 genes and 970–6846 CpG sites (Table 1, Figure 1).

Orthogonal roles of developmental and inter-individual DNA methylation variation

DNA methylation at promoter regions is widely known to correlate negatively with gene expression levels when looking at comparisons across genes (Jones, 2012). We observe the same pattern in our study looking at the promoter regions of all genes of each individual separately and for each cell-type (Figure 2—figure supplement 1). The across individual methylation-gene expression associations (eQTMs) however appear to be either positive or negative, even for DNA methylation sites in promoter regions. Hence, we hypothesized that methylation sites in promoter regions from positive (pos) and negative (neg) eQTMs contribute differently (positively and negatively, respectively) to gene expression levels across genes. Contrary to our hypothesis, we find that methylation sites correlate negatively with gene expression across genes independently of whether they correlate positively or negatively with gene expression across individuals (Figure 2A; Spearman correlation coefficient, rho = −0.11, p=1.1 × 10−4, and rho = −0.10, p=1.7 × 10−13, respectively; see also Figure 2—figure supplement 2). The strength of these negative correlations, despite involving a subset of genes, is comparable to the one found at a genome-wide level, correlating all expressed genes with their promoter DNA methylation status per individual (Figure 2—figure supplement 1C). These results suggest that the mechanisms and processes underlying inter-individual DNA methylation variation associated to gene expression are at least partly independent of the mechanisms involved in the establishment of the repressive mark of promoter DNA methylation across genes during development and differentiation.

Figure 2. Inter-individual DNA methylation variation in cell-type differentiation and in different genomic contexts.

(A) The median methylation level of promoter eQTM sites (x-axis) correlates negatively with across gene median number of reads per kilobase per million reads (RPKM) irrespective of whether they are pos-eQTMs (yellow, N = 1149) or neg-eQTMs (blue; N = 5112). Spearman correlation coefficient rho is indicated in the plot with p=1.1 × 10−4 and p=1.7 × 10−13 for pos and neg-eQTMs, respectively. See Figure 2—figure supplements 1 and 2. (B) As the level of cell-type methylation differentiation increases (x-axis), a larger proportion of sites are associated to gene expression (eQTMs, left y-axis) and affected by genetic variation (mQTLs, right y-axis). Proportions are plotted by 10 bins each containing 10% of the data (0.1 quantiles). Level of methylation differentiation is measured for each site as the coefficient of variation of the median methylation level per cell-type. See Figure 2—figure supplement 3. (C) Proportion of eQTMs that are positive (pos-eQTMs, yellow) or negative (neg-eQTMs, blue) overlapping vs non-overlapping (expected) distinct genomic features (promoters, CTCF binding sites, enhancers), or overlapping CpG island promoters (CGI prom) vs overlapping non-CpG island promoters (non-CGI prom). For T-cells there are no CTCF or chromatin ChIP-seq data available so the data of an LCL were used instead (see Materials and methods). (D) For each non-CGI and CGI promoters (x-axis), the proportion (y-axis) of overlapping mQTLs was calculated (red bars) and was compared to the proportion of overlapping null SNPs (black bars). One star indicates p<0.05, two stars indicate p<1 × 10−6, Fisher’s exact test.

DOI: http://dx.doi.org/10.7554/eLife.00523.013

Figure 2.

Figure 2—figure supplement 1. Correlation between promoter DNA methylation and across gene expression.

Figure 2—figure supplement 1.

(A) Shown is the median methylation level across all sites falling in the promoter region of a gene (x-axis) by the gene expression level of the gene (y-axis) as the log2 of the median number of reads per kilobase per million reads (RPKM). The background scatter plot represents the data of a single individual, while the fitted lines depict the smoothened mean distribution of values for each individual. Non-expressed genes were artificially plotted at the bottom of the plot. (B) Histogram of Spearman rank correlation coefficients for each sample of correlations explained in (A). All p values<2.2×10−16. (C) Distribution of Spearman rank correlation coefficients for each sample of correlations explained in (A) taking only expressed genes, which were used in our eQTM analyses (at least 1 exonic read in >90% of individuals). In fibroblasts (F) all p values<6.2×10−8, in LCLs (L) all p values<8.2×10−18, in T-cells (T) all p-values<2.6×10−44.
Figure 2—figure supplement 2. pos and neg-eQTMs correlation with across gene expression.

Figure 2—figure supplement 2.

Shown are the correlations between the median methylation levels of eQTM sites (x-axis) and the median gene expression levels (y-axis) for positive-eQTMs (left panels) and negative-eQTMs (right panels) in fibroblasts (F), LCLs (L) and T-cells (T). The Spearman correlation coefficient rho and p value are indicated in each plot. Fitted lines depict the smoothened mean distribution of values.
Figure 2—figure supplement 3. Tissue-specific methylation is enriched with eQTMs and mQTLs.

Figure 2—figure supplement 3.

(A) Distribution of the level of differentiation metric which we defined as the coefficient of variation of the median methylation level per cell-type for each site (i.e., standard deviation of the medians divided by the mean of medians). (B) eQTM sites have higher levels of differentiation than non-eQTM sites (C) mQTL sites have higher levels of differentiation than non-mQTL sites. Stars between box plots indicate that the difference is significant with p<2.2×10−16, Wilcoxon test. eQTM and mQTL median methylation levels cover a wide range of β-values, so removing the 113 sites that were below the minimum or above the maximum median (representing effectively unmethylated or fully methylated sites) in all three cell-types did not alter the results at all.

Differentially methylated regions among cell-types have been shown to play important roles in cell differentiation and tissue-specific regulation (Meissner et al., 2008; Schmidl et al., 2009; Hodges et al., 2011; Li et al., 2012). Hence, we asked whether differentially methylated sites across the three cell-types are more likely to be functionally relevant when variable within a population, hence more often associated to gene expression (eQTMs) or genetic variation (mQTLs) than non-differentiated sites across cell-types. We measured the level of differentiation for each site by calculating its median methylation level in each cell-type and then calculating the coefficient of variation between those three medians, which is a measure of variability that controls for the mean level of methylation (Figure 2—figure supplement 3A). We find that as differentiation per methylation site increases, a higher proportion of eQTMs and mQTLs are observed (Figure 2B). This is statistically supported by the fact that the tissue differentiation level of methylation sites involved in eQTMs and mQTLs is significantly higher than for non-eQTM and non-mQTL sites, respectively (all p values<2.2 × 10−16, Wilcoxon test, Figure 2—figure supplement 3B,C). These results show that the same methylation sites marking or contributing to tissue differentiation are participating in inter-individual variability that is highly determined by genetic variation and associated to gene expression. This could indicate that the establishment of differentially methylated regions during development could be delineating a backbone where local inter-individual changes may occur.

Context-specific DNA methylation

In order to further understand the nature of these inter-individual changes we analyzed their participation in different genomic contexts. We find a significant increased presence of negative compared to positive eQTMs in CTCF binding sites, enhancers and promoters, with this increase being higher in non-CpG island (CGI) promoters compared to CGI promoters (Figure 2C). Interestingly, we discover a significant depletion of mQTLs in CGI promoters, and a significant enrichment of mQTLs in non-CGI promoters (Figure 2D). This shows that methylation sites in non-CGI promoters are under a stronger genetic control than at CGI promoters, where methylation is in general robustly maintained at low levels. Nevertheless, genetic variation in CpG islands commonly affects gene expression, since we find that eQTLs are significantly enriched at these genomic regions (all p values<0.03). Overall, these results suggest that the role of DNA methylation can be highly dependent on the genomic and functional context.

DNA methylation and allele specific expression

Allele specific expression (ASE), seen as a signal of regulatory difference between two haplotypes of an individual, can in theory be driven either by genetic regulatory variation or epigenetic (in)activation of one of the two alleles, for example, by DNA methylation. In order to test whether ASE is caused by genetic variation or differential DNA methylation, we compared the magnitude of allelic imbalance in eQTL genes between individuals who are heterozygous for the eQTL and their respective homozygotes. Similar to other studies (Dimas et al., 2009; Pickrell et al., 2010; Montgomery et al., 2010), at a genome-wide level a significantly greater allelic imbalance is associated with heterozygote eQTLs in all three cell-types (p values<3.0 × 10−8, Figure 3A). In order to test whether ASE is driven by haplotype differences in DNA methylation alone, potentially represented by semimethylated sites (partially methylated, β-value >0.3 and <0.7), we analyzed the genes having eQTM methylation sites that are not associated with SNPs (filtered out SNP-methylation correlations with nominal p<0.01). A comparison of allelic imbalance between individuals with semimethylated eQTMs, and homomethylated (i.e., fully methylated, β-value > 0.7, or unmethylated, β-value < 0.3) eQTMs, revealed no significant differences (p values>0.42, Figure 3A). Additionally, ASE driven by heterozygote eQTLs is significantly higher than that of semimethylated eQTMs (p values<7.5 × 10−3, Figure 3A). Furthermore, as a positive control, we observe that allelic imbalance in imprinted genes, known to be driven by allelic DNA methylation, is significantly higher than in homomethylated eQTMs (p values<9.7 × 10−4, Figure 3A). Thus, while ASE is shown to be significantly driven by genetic variation (or by DNA methylation in imprinted genes), we find no evidence in our data of methylation alone contributing to ASE. This argues that DNA methylation is rarely allele-specific in the absence of DNA sequence variation effects, and while we cannot exclude other possible epigenetic sources of ASE, the results suggest that the widespread ASE across the genome may be primarily driven by common and rare genetic regulatory variants.

Figure 3. DNA methylation associated to gene expression is not significantly allelic and can interact with genetic variation.

Figure 3.

(A) In green are depicted the distributions of allelic imbalance (i.e., absolute distance from the expected 0.5 ratio) of assayable heterozygote sites in eQTL genes of individuals that are homozygous (HOM) or heterozygous (HET) for the eQTL. The difference between distributions is significant in all cell-types with p<2.2 × 10−16, p=2.7 × 10−15 and p=3.0 × 10−8 in fibroblasts (F), LCLs (L) and T-cells (T), respectively (Wilcoxon test). This strongly indicates that allele specific expression is significantly driven by genetic variation. In purple are shown the distributions of allelic imbalance of assayable heterozygote sites in eQTM genes (excluding methylation sites affected by genetic variation) of individuals that are homomethylated (HOMOMETH, i.e., fully methylated, β-value > 0.7, or unmethylated β-value < 0.3) or semimethylated (SEMIMETH, i.e., β-value >0.3 and <0.7) for the eQTM site. The difference between distributions is not significant in any cell-type, with p=0.79, p=0.42 and p=0.49 in F, L, T, respectively. The difference between distributions of HET eQTLs and SEMIMETH eQTMs is significant in all cell-types with p=7.5 × 10−3, p=3.4 × 10−7, p=1.1 × 10−13, in F, L, T, respectively. The difference between distributions of IMPRINTED genes and HOMOMETH eQTMs is significant in all cell-types with p=6.9 × 10−34, p=9.7 × 10−4, p=8.5 × 10−5 in F, L, and T, respectively. This shows that allele specific expression is not significantly driven by DNA methylation that is not affected by genetic variation. (B) Using linear regression we tested the interaction term as shown in the illustrated formula for the effects of SNPs (green tick) and methylation sites (purple lollipop) on gene expression (black squared arrow and boxes). Qqplots illustrate the enrichment of low synergistic interaction observed p values (black), together with the 5th and 95th confidence limits based on expression permutations (gray) with respect to the expected uniform distribution.

DOI: http://dx.doi.org/10.7554/eLife.00523.017

Synergistic interactions between DNA methylation and genetic variation on gene expression

We then sought to explore the mechanistic relationships among DNA methylation, gene expression and genetic variation. We first tested whether there was a significant enrichment of synergistic interactions between genetic variants and DNA methylation on gene expression using linear regression. We selected for each exon all eQTLs with p<1 × 10−4 that fell in independent recombination intervals and all eQTMs with p<0.001. To avoid artificial inflation of significant interactions, we filtered out any exon-SNP-meth triplets where the SNP and methylation site correlated with p<0.05. To avoid spurious interactions caused by outliers, we further filtered out cases where there were less than four individuals homozygous for the minor allele of a SNP. Finally, to further account for remaining correlation between the SNP and the methylation we permuted the expression values 1000 times to infer the 95% confidence intervals and assess the significance of the enrichment of low p values. Synergistic interactions are enriched in LCLs and T-cells (Figure 3B) with π1 being 4.3% and 9.3% (all p<0.001), which reflects the percent of estimated true positives from the p value distributions (Storey and Tibshirani, 2003). In fibroblasts, although the p value distribution is not above the 95th confidence limit (Figure 3B), the observed π1 is 12.4% (p<0.001), 7.2 times larger than the top permuted π1, which suggests a significant enrichment of interactions. Finally, we find 3, 91 and 14 individually significant interactions in fibroblasts, LCLs and T-cells, respectively, at 10% false discovery rate (FDR). Overall, these results reflect the interdependency of genetic and epigenetic variation to determine gene expression levels.

Passive and active roles of DNA methylation in gene regulation

We further dissected the causative relationships between DNA methylation and gene expression by considering that the SNP triggers the causal network since its state is not modifiable in time. We used Bayesian Network (BN) construction and relative likelihood (see ‘Materials and methods’) to test which of the three possible causative models depicted in Figure 4A and described below is the most likely given the data for each set of variables. Under the ‘INDEP’ (independent) model, a SNP affects independently gene expression and DNA methylation (passive role of methylation). Under the ‘SME’ (SNP-methylation-expression) model, the SNP affects methylation, which then affects gene expression (active role). The ‘SEM’ (SNP-expression-methylation) model requires that a SNP affects gene expression, and expression then affects methylation (passive role). We tested the relative likelihood of these models choosing a non-biased approach where at least two of the three pairwise correlations between genetic variation, gene expression and DNA methylation are significant, yielding 831–2928 SNP-methyl-exon triplets tested per cell-type. All three models occur, depending on cell-type and sign of correlation between DNA methylation and gene expression (Figure 4B, Figure 4—figure supplement 1A). The INDEP model, that reflects a passive role for DNA methylation, is the most likely model in fibroblasts and LCLs. However, in T-cells the SME model, where methylation takes an active causative role, is the major contributor. Note that in the SEM model, where methylation is passive being influenced by gene expression levels, a higher likelihood of pos-eQTMs is found compared to the SME model. Overall, these results suggest that DNA methylation can be both active, by being a likely cause of gene expression variation levels, or passive, by being a consequence or an independent mark of gene expression levels.

Figure 4. Passive and active roles of DNA methylation in gene regulation.

(A) Illustration of the three possible causative models tested of mechanistic relationships between genetic variation (SNP), DNA methylation (methyl) and gene expression (expr). Arrows indicate the causal direction of effects. The name of each model is underlined. (B) Mosaicplots illustrate the relative likelihoods of each model (x-axis), partitioned by the relative likelihoods of those involving pos-eQTMs (yellow) and neg-eQTMs (blue; y-axis), in fibroblasts (F), LCLs (L) and T-cells (T). The three types of models (INDEP, SME and SEM) are present in the three cell-types, suggesting that DNA methylation can have both active and passive roles in gene regulation. See Figure 4 –figure supplement 1 and Figure 4–Source data 1. (C) Heatmap of p value relative frequency distributions of spearman correlations between transcription factors (TF) and DNA methylation levels of eQTMs at their binding sites, sorted by π1. The enrichment of significant associations can be appreciated by the accumulation of reddish colors, reflecting higher relative frequencies, at low p values, and yellowish colors, reflecting lower relative frequencies, at higher p values. These results highlight one of the possible mechanisms of a passive role of DNA methylation regarding gene expression. See Figure 4—figure supplements 2 and 3.

DOI: http://dx.doi.org/10.7554/eLife.00523.018

Figure 4—source data 1.
High confidence calls for INDEP, SME and SEM models in each cell-type.
elife00523s002.zip (49.6KB, zip)
DOI: 10.7554/eLife.00523.019

Figure 4.

Figure 4—figure supplement 1. Passive and active roles of DNA methylation on gene regulation.

Figure 4—figure supplement 1.

(A) For all exon-SNP-methyl triplets tested the best model was inferred using Bayesian networks (BN) and relative likelihood. Mosaicplots illustrate the relative frequencies of each model (x-axis), partitioned by the relative frequencies of pos-eQTMs (yellow) and neg-eQTMs (blue; y-axis), in fibroblasts (F), LCLs (L) and T-cells (T). (B) Same as in (A) only including high confidence calls (see ‘Materials and methods’ and Figure 4—source data 1).
Figure 4—figure supplement 2. Associations between TF abundance and DNA methylation at their target binding sites.

Figure 4—figure supplement 2.

Observed (left panels) and expected (right panels) p value distributions of spearman correlations between transcription factor expression levels and DNA methylation levels of eQTMs at their binding sites (see ‘Materials and methods’) in fibroblasts (F), LCLs (L) and T-cells (T). π1 statistic, reflecting the fraction of true positives, is indicated on top of each p value distribution.
Figure 4—figure supplement 3. Interactions between SNPs and transcription factor levels on DNA methylation.

Figure 4—figure supplement 3.

(A) We tested whether genetic variants that could potentially alter TF binding would interact with TF abundance (transcription level) on their effect on DNA methylation levels. To do so we took the top mQTL SNPs that fell in TF peaks and whose associated methylation site correlated significantly with the TF expression level of that peak (10% FDR for both types of associations), and required that the SNP and the TF abundance were not correlated and that the SNP had at least four minor allele homozygotes (N = 114). (B) We find an enrichment of low p values for interactions between genetic variants and TF abundance on DNA methylation levels, with π1 estimating 15% of true positives. (C) Plotted is the top interaction with p=1×10−4 involving the TF c-Jun in T-cells. The SNP falls in the binding peak of c-Jun. The methylation site and the SNP are 172 bp away from each other and both fall in an intron of gene SLC9A9.

In order to obtain a set of high confidence calls for each of the models, we have required that the BN calls are confirmed by an independent method called the Causal Inference Test (Millstein et al., 2009) (see details in ‘Materials and methods’). From the total number of tests, 61%, 36% and 27% were called as high confidence (HC) in fibroblasts, LCLs and T-cells, respectively (Figure 4—source data 1). The relative frequencies of these HC calls look similar to the general relative likelihood space of the models (Figure 4—figure supplement 1B). As an example of a HC INDEP model, we identified in fibroblasts a case occurring at the promoter of gene DPYSL4 and involving a methylation site associated to age and age rate in blood samples (Hannum et al., 2013). In this example, SNP rs12772795 affects independently the DNA methylation status of site cg05652533 and the expression level of gene DPYSL4, possibly via an effect on the binding of CTCF nearby (given binding peak reported), which is a factor known to alter DNA methylation levels locally (Stadler et al., 2011). As an example of an SME model, we have identified in T-cells SNP rs1362125 that could be affecting the binding of SP1 (overlapping peak reported), which is a factor shown to confer methylation protection (Boumber et al., 2008). This could then alter the methylation state of site cg24703717 (78bp away) that falls in a YY1 binding peak, a factor whose binding is known to be sensitive to DNA methylation levels (Kim et al., 2003). Hence methylation could actively alter binding of YY1, negatively affecting the expression of the gene HLA-F. Finally, we would like to highlight in LCLs an SEM scenario in which SNP rs3733346, located in a DNAse hypersensitive site, affects the expression of gene DGKQ, whose transcriptional activity could then be influencing positively DNA methylation levels of site cg00846425 located in its gene body, as has been suggested to be a possible phenomenon for gene body DNA methylation (Hahn et al., 2011).

DNA methylation associated to transcription factor abundance

To understand potential molecular causes for the passive role of methylation in the INDEP model we postulated that SNPs could be influencing binding levels of DNA-binding factors (hereon called transcription factors or TFs), which in turn affect the methylation status of a site in parallel to their effect on gene expression. It has been shown in mice (Stadler et al., 2011) that TFs can influence DNA methylation levels near their binding sites (TFBSs). In addition, by correlating expression levels of TFs with DNA methylation at their TFBSs in different cell-types, this pattern has also been observed for some TFs in human tissues (Thurman et al., 2012). In order to test whether human inter-individual differences of TF expression levels would affect DNA methylation that is itself associated to gene expression we used the ENCODE dataset of TFBSs based on ChIP-seq assays for 111 TFs (Bernstein et al., 2012; Gerstein et al., 2012), and correlated their expression levels with eQTM methylation levels found at their reported binding sites. We found a strong enrichment of significant TF-methylation associations, with π1 being 18%, 9% and 25%, in fibroblasts, LCL, and T-cells, respectively (Figure 4—figure supplement 2). At 10% FDR, we find significant associations for 27, 47 and 99 different TFs in fibroblasts, LCL and T-cells, respectively. A strong enrichment of significant associations can also be appreciated for each TF individually as depicted in Figure 4C. We further tested whether genetic variants that could potentially alter TF binding would interact with TF abundance on their effect on DNA methylation levels (see ‘Materials and methods’). Despite the limited number of cases that could be tested (N = 114), we find an enrichment of low p values for such interactions, with π1 estimated at 15% of true positives, and the top interaction involving the TF c-Jun in T-cells (p=1 × 10−4, Figure 4—figure supplement 3). These results suggest that TF levels could be influencing DNA methylation and gene expression levels simultaneously. It is possible that in some cases this could occur via the interaction with DNA sequence variation. This could illustrate a realistic model for passive but at the same time correlated levels of DNA methylation with gene expression.

Discussion

In this study, we have shown that inter-individual DNA methylation changes are mechanistically associated to genetic variation and gene expression in complex and context dependent ways that can be of passive or active nature. DNA methylation levels can depend on TF binding (Stadler et al., 2011). TF binding levels in turn can be determined by both inter-individual differences of TF abundance and genetic variants at their binding sites. This picture shows how DNA methylation can be linked to genetic variation, and can participate in gene regulation in a passive manner. This scenario is quite different from the often-assumed simple model of genetic variation affecting the epigenome, which in turn affects the transcriptome, and furthermore these relationships can be of different nature depending on tissue and genomic region. This is likely to be true for other epigenetic mechanisms as well, and properly characterizing these relationships will enhance our understanding of genome function and complex disease.

Additionally, our results suggest that non-genetically determined DNA methylation is rarely allelic but rather defined at the whole cell level. That is, when we observe a population of cells to be semimethylated the most likely scenario is that only a fraction of the cells are fully methylated rather than one of the alleles in each of the cells is methylated. Nevertheless, given the limitations of our data, further studies addressing this question are needed to confirm our results.

Further, we show that methylation sites that represent a repressive state with respect to tissue differentiation can also participate in positive and negative associations with gene expression when looking at inter-individual variability. DNA methylation as a whole emerges both as marker and determinant of cellular identity and may provide a platform by which genetic variation and other variable factors among individuals (e.g., TFs) exert their effect to differentially influence cellular processes.

Materials and methods

Sample collection and cell growth

Fifty-six of the 204 newborn umbilical cord and cord blood samples are part of the 85 samples collected previously in Dimas et al. (2009). The remaining 148 samples have been newly collected, and primary fibroblasts, EBV-immortalized lymphoblastoid cell lines and primary T-cells have been prepared and grown as stated previously (Dimas et al., 2009), with a few exceptions: the preparation of LCLs was done with 1 ml of 20 million re-suspended cells and 1 ml of EBV that were transferred to a 24-well plate. In addition, to confirm that the cord and cord blood acquired from the hospital belonged to the same individual, DNA was extracted from cord tissue and LCLs separately with the Puregene cell kit (Gentra-Qiagen, Venlo, The Netherlands), microsatellite markers were assayed with QF-PCR and compared between the two samples coming from the same individual.

Genotyping and imputation

We extracted DNA from LCLs with the QIAamp DNA Blood Mini Kit (Qiagen, Venlo, The Netherlands) and genotyped the 204 individuals with the Illumina 2.5M Omni chip. We applied several filters before imputation with PLINK v1.07 (Purcell et al., 2007). We filtered out any SNPs with >5% missing genotypes, minor allele frequency (MAF) <1%, gene call score <0.25 for more than 1% of samples, or Hardy-Weinberg equilibrium p<1 × 10−6. This left 1,535,724 SNPs before imputation. We performed the imputation with Beagle v3.3.2 (Browning and Browning, 2009), into 21 million European panel SNPs of the 1000 genomes March 2012 release (Abecasis et al., 2012). From the final output, we required an R2 ≥ 0.9. This yielded a total of 6.9 million SNPs. For association analyses (eQTLs and mQTLs) we filtered out genetic outliers based on multidimensional scaling using HapMap populations (Altshuler et al., 2010; Figure 1—figure supplement 1), and we further filtered out any SNPs with MAF <5%, yielding sets of 5,209,348–5,278,330 SNPs.

RNA sequencing and quantification

RNA was extracted from LCLs, fibroblasts and T-cells with RNeasy columns (Qiagen, Venlo, The Netherlands) or Trizol (Invitrogen, Carlsbad, CA). RNA samples were quantified with NanoDrop (Thermo Scientific, Waltham, MA) and Qubit (Invitrogen, Carlsbad, CA), and analyzed with a 2100 Bioanalyzer (Agilent, Santa Clara, CA). Samples were prepared for sequencing with the Illumina mRNA-seq and TruSeq sample preparation kits (Illumina, San Diego, CA) as indicated by manufacturer’s instructions. In these library preparation procedures, poly-A RNA is selected using poly-T oligo-attached magnetic beads; then the RNA is cleaved, converted to first strand cDNA and after RNA digestion and second DNA strand synthesis, the fragments are end repaired and ligated to the adapters containing specific primer indexes. The cDNA libraries are then PCR amplified. Libraries were sequenced in either sets of 6 samples per lane in the Genome Analyzer II machine or 12 samples per lane in the HiSeq2000 machine, randomly pooling in the same lane samples from different cell-types and individuals. Afterwards, the 49-bp sequenced paired-end reads were mapped to the reference genome (Lander et al., 2001) GRCh37 with BWA v0.5.9 (Li and Durbin, 2009). Using SAMtools (Li et al., 2009), we kept reads mapping uniquely to the genome, with MAPQ ≥ 10 and properly paired. In order to quantify exons in a non-redundant way, we created a set of merged exons from the GENCODE v10 annotation (Harrow et al., 2006). In more detail, we took all protein coding and lincRNA transcripts and merged any overlapping exons into new exon units. We then counted the number of reads mapping to each exon unit (Figure 1—figure supplement 2). Technical outliers having less than 5M exonic reads or extremely low insert size mode were removed from the study. In order to keep only the set of exons that we considered expressed in each cell-type, we filtered out exons that had no reads mapped in 10% or more of the samples per cell-type. This yielded sets of 70,800–76,870 exons belonging to 12,265–12,863 expressed genes depending on the cell-type and analysis. Afterwards, we normalized the raw exon counts by scaling all libraries to 10 million reads, based on the total number of exonic reads per sample. In order to remove technical variance from our samples, we tested the effect of many different covariates on our scaled exon counts in each cell-type by linear regression. We decided to correct for mean GC content per library, run date, primer index and insert size mode, using a random effects model (Figure 1—figure supplement 3). Thus, our final exon expression levels are the residuals after correcting for those covariates plus the corrected mean (Figure 1—figure supplement 4).

DNA methylation assay

DNA was extracted with the QIAmp DNA blood mini kit (Qiagen, Venlo, The Netherlands) and bisulfite converted (BC) using the Zymo Research EZ DNA MethylationTM Kit (using the 50-column or 96-well formats; Zymo Research, Irvine, CA). The starting quantity of DNA was 1000 ng and after BC we normalized all our samples to have a final concentration of 40 ng/μl or the closest possible amount. BC-DNA was then processed through the 450K Illumina Infinium HD Methylation Assay (Illumina, San Diego, CA) according to manufacturer’s instructions. In brief, this is an assay that involves whole genome amplification, specific hybridization capture, and allele-specific single-base primer extension. Samples from the three cell-types were processed in a randomized manner. The controls available in the methylation assay for verifying the efficiency of our bisulfite conversion indicated that none of our samples had bad efficiency of bisulfite conversion. We removed three technical outliers that didn’t cluster well with any cell-type in a principal component analysis. From the 482,421 CpG sites assayed, we filtered out probes with 1000 genomes SNPs or indels at European minor allele frequency >5% and then any other probes that had any of the final set of SNPs used in our study. This yielded a total of 416,118 CpG sites to work with. We quantile normalized the data in the following way. The Illumina 450k array is composed of two probe types. Probe type 1 reads the methylated signal and unmethylated signal in the same channel (red or green) and probe type 2 reads the methylated signal in the green channel and the unmethylated signal in the red channel. The distribution of β-values before any normalization is quite different for each probe type, thus the two probe types were normalized separately. For probe type 1, the intensity for the methylated and unmethylated signals were quantile normalized across individuals for each color channel separately. As the density of signal intensities showed that there was a color bias (at least for the unmethylated probe), the color channels were quantile normalized within individuals for the methylated signal and unmethylated signal separately. For probe type 2, the methylated signals and unmethylated signals were quantile normalized across individual. As for this probe the methylated signal is always in green and the unmethylated signal is always in red. The color bias needs to be corrected before computing β-values. In order to correct the color bias, the methylated (unmethylated) signal probe type 2 normalized across individuals is quantile normalized with the methylated (unmethylated) signal of probe type 1 that was previously quantile normalized across individuals and within for color correction. In a final step, the β-values of the two probe types were quantile normalized together leading to the final β-values (Figure 1—figure supplements 5–7). The β-value is defined as the proportion of fluorescent signal from the methylated allele over the total fluorescent signal (methylated and unmethylated alleles); this represents the percentage of methylation per site (Bibikova et al., 2006).

Association analyses and multiple testing correction

We performed Spearman rank correlations (SRC) between (x and y variables): SNP genotypes and exon expression levels (eQTLs) in 183–185 samples using a 1-MB window to either side of the TSS, SNP genotypes and CpG site methylation levels (mQTLs) in 66–111 samples using a 5-kb window to either side of the CpG site, CpG site methylation levels and exon expression levels (eQTMs) in 66–118 samples using a 50-kb window to either side of the TSS. Different window sizes were tested for mQTL and eQTM analyses using a subset of 56 individuals. The final window size chosen is the one that yielded the highest enrichment of significant associations. As previously mentioned, association analyses involving genetic variation were ran with SNPs with MAF ≥ 5% and excluding genetic outlier individuals (Figure 1—figure supplement 1). Similar to previous analyses (Stranger et al., 2007; Montgomery et al., 2010), for each type of association analysis and each cell-type the phenotype belonging to the y variable was permuted 1000 times and the median p value distribution was summarized among the genes or CpG sites. For expression levels this was done on all exons belonging to 1000 randomly selected genes and for each gene the minimum p value distribution out of all its exons was chosen. For methylation levels, 50,000 random methylation probes were selected. Based on the null p value distributions yielded by the permutations, we selected for each analysis the associations that passed the permutation threshold that yielded 10% or less False Discovery Rate (FDR) in a per gene/CpG basis.

DNA methylation differentiation metric

We have used the coefficient of variation (CV) as a measure of DNA methylation differentiation level, as it is a variability measure that controls for the mean methylation. In detail, for each site we have calculated the standard deviation of the median methylation in each cell-type, divided by the mean of the three medians. By using the median methylation level per site for each cell-type, our differentiation measure is estimating the between cell-type variation only, not taking into account the within cell-type variation.

Context specific analyses

Enhancers and CTCF peaks, were downloaded from the UCSC genome browser tables (Browning and Browning, 2009) and come from ChIP-seq experiments of the ENCODE project (Birney et al., 2007; Rosenbloom et al., 2010; Bernstein et al., 2012) and specific groups (Boyle et al., 2008; Ernst and Kellis, 2010; Ernst et al., 2011; Thurman et al., 2012). For LCLs we used data belonging to the cell line GM12878. For fibroblasts we used data belonging to the cell line NHLF. Unfortunately, there was limited data available for T-cells so we used the GM12878 data, given its close lineage relatedness. CpG islands (CGIs) were downloaded from the UCSC genome browser (Karolchik et al., 2004). The promoter region is defined as going from -1kb of the TSS to +2kb of the TSS, based on methylation pattern observed in this region (Figure 1—figure supplement 7A). CGI promoters are defined as promoter regions overlapping any CGI.

The genomic feature enrichment analyses of eQTLs and mQTLs were done by comparing the number of observed overlaps with those found in a null set. The null set selects SNP-exon or SNP-CpG pairs with MAF, distance and expression or methylation values matching the distribution of eQTLs and mQTLs, respectively, from genes or CpG sites lacking any significant associations. The differential genomic feature overlap analyses of eQTMs were done by comparing the number of observed overlaps for positive and negative eQTMs, with the number of expected overlaps in non-eQTM sites. Fisher’s exact tests were used for assessing the significance of each analysis.

Intersections of genomic features were done with BEDTools v2.7.1 (Quinlan and Hall, 2010).

Allele specific expression analysis

The allelic imbalance (i.e., absolute distance from the expected 0.5 ratio) was measured for assayable heterozygote sites. First, we excluded sites that are susceptible to allelic mapping bias: 1) sites with 50-bp mapability (Karolchik et al., 2004) <1 implying that the 50-bp flanking region of the site is non-unique in the genome, and 2) simulated RNA-Seq reads overlapping the site show >5% difference in the mapping of reads that carry the reference or non-reference allele. In all the analyses, we used mapping and base quality threshold ≥10. Next, we calculated the expected reference allele ratio for each individual by summing up reads across all sites separately for each SNP allele combination after down-sampling reads of sites in the top 25th coverage percentile in order to avoid the highest covered sites having a disproportionally large effect on the ratios. These expected REF/TOTAL ratios (typically 0.494–0.517 as measured by 10th and 90th quantiles across samples) correct for remaining slight genome-wide mapping bias as well as GC bias in each individual. Finally, for all the sites covered by ≥16 reads in each individual, we calculated the absolute distance from the expected 0.5 ratio described above. Because our genotypes included imputed data, we verified the genotype from the RNA sequencing data by requiring the observation of both of the SNP alleles in at least one cell type per individual.

The number of genes in the homozygote and heterozygote eQTL analysis in Figure 3A are 1473 and 1453 in fibroblasts, 2197 and 2177 in LCLs, and 1288 and 1284 in T-cells, respectively. The number of genes in the homo-methylated and semi-methylated eQTM analysis in Figure 3A are 279 and 106 in fibroblasts, 1738 and 794 in LCLs, 1544 and 969 in T-cells, respectively. The number of genes in the imprinted genes analysis in Figure 3A are 29, 17, 15 in fibroblasts, LCLs and T-cells, respectively. For measuring allelic imbalance in imprinted genes, all the filtering and requirements mentioned above were applied except we did not require that both SNP alleles are seen. Imprinted genes were taken from http://www.geneimprint.com.

Interaction permutation analysis

The expression values of the exon-SNP-methyl triplets tested were permuted 1000 times to assess the expected p value distributions for the interaction between genetic variation and DNA methylation on gene expression. For each round of permutations a π1 was calculated, creating a null distribution of π1 from which an empirical p value can be inferred (as reported in the main text). Additionally, the 95% confidence intervals were calculated from the 1000 permutations of all tests in each cell-type. False discovery rates of the reported significant results were calculated using the QVALUE R package (Storey and Tibshirani, 2003).

Causative model analysis

Bayesian networks (BN) are directed acyclic graphs where nodes represent random variables and edges represent the conditional dependences among nodes. The direction of the edges between two nodes can be interpreted as a causal relationship. BN have been used before to infer causality in genetic problems (Schadt et al., 2005; Zhu et al., 2008). Likelihood methods are commonly used to estimate the most likely network—that is, the set of causal relationships among the different variables that better agrees with the data.

In a BN every node has an associated probability function and, together with the conditional dependencies represented by the edges, they conform the joint probability density of the network. BN satisfy the local Markov property—that is, each variable is conditionally independent of its non-descendants given its parent variables. The Markov property defines the decomposition of the joint probability density of the network into a set of local distributions. Thanks to the Markov property it is easy to calculate the likelihood of a given BN. Since different networks have different complexities, it is common to use a score that takes into account the network complexity instead of the raw likelihood to compare the networks. We used the R package bnlearn (Scutari, 2010) to calculate the maximum likelihood of the networks and the Akaike Information Criterion (AIC) score. AIC = 2k − 2ln(L), where k is the number of parameters and L is the maximum likelihood. To calculate how much better one network is compared to another, we used the relative likelihood of one network with respect to the other. If we have two networks, N1 and N2, with fixed parameters and AIC(N1) ≤ AIC(N2), then the relative likelihood of N2 with respect to N1 is defined as: exp((AIC(N1)−AIC(N2))/2). The networks we have compared (mentioned in main text) have five parameters each.

To test the networks, we selected three datasets that were then merged and normalized for size in order to calculate the relative frequencies of the inferred best network. In the first dataset we selected the best eQTM and for that methylation site the best mQTL, not requiring statistical significance for the eQTL (n = 219, 1074 and 1128 in fibroblasts [F], LCLs [L], and T-cells [T], respectively), which would enrich for the SME model since the eQTL is the most distant association. In the second dataset, for each exon we selected its best eQTL and its best eQTM; not requiring significance for the mQTL (n = 287, 1316 and 871 in F, L, T), which would enrich for the SEM model since the mQTL is the most distant association. And in the third, for each exon we selected its best eQTL and kept the cases where the SNP was the best mQTL for a methylation site; not requiring significance for the eQTM, which would enrich for the INDEP model since the eQTM is the most distant association (n = 413, 726, 775, in F, L, T).

In order to create a list of high confidence calls, we have made use of a semi-parametric method, the Causal Inference Test (Millstein et al., 2009), that requires a level of significance for calling two out of the three models. The CIT makes SME and SEM calls if either has a p<0.05, makes no call if p<0.05 for both models (highly infrequent) and makes a call for the INDEP if neither SME or SEM has a significant p-value. We have decided to call a high confidence set of SME and SEM models when they are called as such by both the BN and the CIT methods. Since the INDEP calls in the CIT do not require significance, we called a high confidence INDEP if 1) it is called INDEP by both methods, and 2) its BN relative likelihood (RL) to the second best model is equal or higher to the median RL of the SME and SEM high confidence calls to their second best prediction.

Transcription factor expression to methylation level analyses

As explained in the text, the transcription factor (TF) expression levels were Spearman rank correlated to methylation levels of eQTM sites falling in their reported TF binding sites (Bernstein et al., 2012; Gerstein et al., 2012), excluding any sites within 1 Mb of the TF gene transcription start site. The number of expressed TFs in each cell-type is 94, 106 and 105, comprising 666, 851 and 776 exons in fibroblasts, LCLs and T-cells, respectively. The number of targeted eQTM methylation sites in each cell-type is 459, 3832 and 3497 in fibroblasts, LCLs and T-cells, respectively. The number of tests performed is 17,747, 209136 and 152263 in fibroblasts, LCLs and T-cells, respectively.

To show the expected uniform distribution of p values for these correlations and compare it with the (highly enriched) observed, we have permuted once the TF expression values for each TF-meth correlation in each cell-type and plotted the p-value distributions (Figure 4—figure supplement 2). To call significant cases for the target sites of each TF, the false discovery rate was calculated for each using the QVALUE R package (Storey and Tibshirani, 2003).

We tested the interactions between SNPs and TF abundance on their effect on DNA methylation by taking the top mQTL SNPs that fell in TF peaks and whose associated methylation site correlated significantly with the TF expression level of that peak (10% FDR for both types of associations). We excluded cases in which the SNP and the TF abundance were correlated (p<0.05) and in which the number minor allele homozygotes was less than four. This yielded a total set of 114 SNP-TF-meth triplets for which the interaction between SNP and TF abundance was tested by linear regression (Figure 4—figure supplement 1).

Data availability

Genotyping, RNA-seq and DNA methylation data have been submitted to the EMBL-EBI European Genome-Phenome Archive (https://www.ebi.ac.uk/ega/) under accession number EGAS00001000446. Zipped text file tables with the identified eQTLs, mQTLs and eQTMs for each cell-type are available as Table 1—source data 1. Zipped text file tables with the high confidence SME, SEM and INDEP calls for each cell-type are available as Figure 4—source data 1.

Acknowledgements

We thank the genomics platform of the University of Geneva, VITAL-IT and the data producers from the ENCODE Consortium (Bernstein et al., 2012). We would also like to thank Helena Kilpinen, Alexandra C Nica, Daniel Robyr, Reza Sailani and Federico Santoni for comments and useful discussions.

Funding Statement

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Funding Information

This paper was supported by the following grants:

  • European Research Council (ERC) to Stylianos E Antonarakis.

  • European Research Council (ERC) to Emmanouil T Dermitzakis.

  • Swiss National Science Foundation to Stylianos E Antonarakis.

  • Swiss National Science Foundation to Emmanouil T Dermitzakis.

  • Louis-Jeantet Foundation to Emmanouil T Dermitzakis.

  • Blueprint to Stylianos E Antonarakis, Emmanouil T Dermitzakis.

Additional information

Competing interests

SEA: Senior editor, eLife.

ETD: Reviewing editor, eLife.

The other authors declare that no competing interests exist.

Author contributions

MG-A, Acquisition of data, Analysis and interpretation of data, Drafting or revising the article.

TL, Analysis and interpretation of data, Drafting or revising the article.

AB, Analysis and interpretation of data, Drafting or revising the article.

HO, Analysis and interpretation of data, Drafting or revising the article.

AY, Analysis and interpretation of data, Drafting or revising the article.

JB, Analysis and interpretation of data, Drafting or revising the article.

SBM, Conception and design, Analysis and interpretation of data, Drafting or revising the article.

SEA, Conception and design, Analysis and interpretation of data, Drafting or revising the article.

ETD, Conception and design, Analysis and interpretation of data, Drafting or revising the article.

TG, Analysis and interpretation of data.

AL, Analysis and interpretation of data.

PM, Analysis and interpretation of data.

LR, Acquisition of data.

AP, Acquisition of data.

EF, Acquisition of data.

DB, Acquisition of data.

MG, Acquisition of data.

IP, Acquisition of data.

CB, Acquisition of data.

MG, Acquisition of data.

CG, Acquisition of data.

Ethics

Human subjects: Informed consent was obtained from all subjects. The local ethics committee at University Hospitals of Geneva has approved this project (CER 10-046).

Additional files

Major datasets

The following dataset was generated:

Emmanouil T Dermitzakis, , 2013, Gencord, EGAS00001000446; https://www.ebi.ac.uk/ega/studies/EGAS00001000446, Publicly available at European Genotype Phenotype Archive.

References

  1. Abecasis GR, Auton A, Brooks LD, DePristo MA, Durbin RM, Handsaker RE, et al. 2012. An integrated map of genetic variation from 1,092 human genomes. Nature 491:56–65. 10.1038/nature11632 [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Altshuler DM, Gibbs RA, Peltonen L, Dermitzakis E, Schaffner SF, Yu F, et al. 2010. Integrating common and rare genetic variation in diverse human populations. Nature 467:52–8. 10.1038/nature09298 [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Bell JT, Pai AA, Pickrell JK, Gaffney DJ, Pique-Regi R, Degner JF, et al. 2011. DNA methylation patterns associate with genetic and gene expression variation in HapMap cell lines. Genome Biol 12:R10. 10.1186/gb-2011-12-1-r10 [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Bernstein BE, Birney E, Dunham I, Green ED, Gunter C, Snyder M. 2012. An integrated encyclopedia of DNA elements in the human genome. Nature 489:57–74. 10.1038/nature11247 [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Bibikova M, Lin Z, Zhou L, Chudin E, Garcia EW, Wu B, et al. 2006. High-throughput DNA methylation profiling using universal bead arrays. Genome Res 16:383–93. 10.1101/gr.4410706 [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Bird A. 2002. DNA methylation patterns and epigenetic memory. Genes Dev 16:6–21. 10.1101/gad.947102 [DOI] [PubMed] [Google Scholar]
  7. Birney E, Stamatoyannopoulos JA, Dutta A, Guigo R, Gingeras TR, Margulies EH, et al. 2007. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447:799–816. 10.1038/nature05874 [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Boumber YA, Kondo Y, Chen X, Shen L, Guo Y, Tellez C, et al. 2008. An Sp1/Sp3 binding polymorphism confers methylation protection. PLoS Genet 4:e1000162. 10.1371/journal.pgen.1000162 [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Boyle AP, Davis S, Shulha HP, Meltzer P, Margulies EH, Weng Z, et al. 2008. High-resolution mapping and characterization of open chromatin across the genome. Cell 132:311–22. 10.1016/j.cell.2007.12.014 [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Browning BL, Browning SR. 2009. A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. Am J Hum Genet 84:210–23. 10.1016/j.ajhg.2009.01.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Dimas AS, Deutsch S, Stranger BE, Montgomery SB, Borel C, Attar-Cohen H, et al. 2009. Common regulatory variation impacts gene expression in a cell type-dependent manner. Science 325:1246–50. 10.1126/science.1174148 [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Ernst J, Kellis M. 2010. Discovery and characterization of chromatin states for systematic annotation of the human genome. Nat Biotechnol 28:817–25. 10.1038/nbt.1662 [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Ernst J, Kheradpour P, Mikkelsen TS, Shoresh N, Ward LD, Epstein CB, et al. 2011. Mapping and analysis of chromatin state dynamics in nine human cell types. Nature 473:43–9. 10.1038/nature09906 [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Gerstein MB, Kundaje A, Hariharan M, Landt SG, Yan KK, Cheng C, et al. 2012. Architecture of the human regulatory network derived from ENCODE data. Nature 489:91–100. 10.1038/nature11245 [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Gibbs JR, van der Brug MP, Hernandez DG, Traynor BJ, Nalls MA, Lai SL, et al. 2010. Abundant quantitative trait loci exist for DNA methylation and gene expression in human brain. PLoS Genet 6:e1000952. 10.1371/journal.pgen.1000952 [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Hahn MA, Wu X, Li AX, Hahn T, Pfeifer GP. 2011. Relationship between gene body DNA methylation and intragenic H3K9me3 and H3K36me3 chromatin marks. PLoS One 6:e18844. 10.1371/journal.pone.0018844 [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Hannum G, Guinney J, Zhao L, Zhang L, Hughes G, Sadda S, et al. 2013. Genome-wide methylation profiles reveal quantitative views of human aging rates. Mol Cell 49:359–67. 10.1016/j.molcel.2012.10.016 [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Harrow J, Denoeud F, Frankish A, Reymond A, Chen CK, Chrast J, et al. 2006. GENCODE: producing a reference annotation for ENCODE. Genome Biol 7(Suppl 1):S4.1–9. 10.1186/gb-2006-7-s1-s4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Hellman A, Chess A. 2007. Gene body-specific methylation on the active X chromosome. Science 315:1141–3. 10.1126/science.1136352 [DOI] [PubMed] [Google Scholar]
  20. Hodges E, Molaro A, Dos Santos CO, Thekkat P, Song Q, Uren PJ, et al. 2011. Directional DNA methylation changes and complex intermediate states accompany lineage specificity in the adult hematopoietic compartment. Mol Cell 44:17–28. 10.1016/j.molcel.2011.08.026 [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Holliday R, Pugh JE. 1975. DNA modification mechanisms and gene activity during development. Science 187:226–32. 10.1126/science.1111098 [DOI] [PubMed] [Google Scholar]
  22. Jones PA. 2012. Functions of DNA methylation: islands, start sites, gene bodies and beyond. Nat Rev Genet 13:484–92. 10.1038/nrg3230 [DOI] [PubMed] [Google Scholar]
  23. Kaminsky ZA, Tang T, Wang SC, Ptak C, Oh GH, Wong AH, et al. 2009. DNA methylation profiles in monozygotic and dizygotic twins. Nat Genet 41:240–5. 10.1038/ng.286 [DOI] [PubMed] [Google Scholar]
  24. Karolchik D, Hinrichs AS, Furey TS, Roskin KM, Sugnet CW, Haussler D, et al. 2004. The UCSC Table Browser data retrieval tool. Nucleic Acids Res 32:D493–6. 10.1093/nar/gkh103 [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Kim J, Kollhoff A, Bergmann A, Stubbs L. 2003. Methylation-sensitive binding of transcription factor YY1 to an insulator sequence within the paternally expressed imprinted gene, Peg3. Hum Mol Genet 12:233–45. 10.1093/hmg/ddg028 [DOI] [PubMed] [Google Scholar]
  26. Kulis M, Heath S, Bibikova M, Queiros AC, Navarro A, Clot G, et al. 2012. Epigenomic analysis detects widespread gene-body DNA hypomethylation in chronic lymphocytic leukemia. Nat Genet 44:1236–42. 10.1038/ng.2443 [DOI] [PubMed] [Google Scholar]
  27. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, et al. 2001. Initial sequencing and analysis of the human genome. Nature 409:860–921. 10.1038/35057062 [DOI] [PubMed] [Google Scholar]
  28. Li H, Durbin R, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N. 2009. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25:1754–60. 10.1093/bioinformatics/btp324 [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Li E, Bestor TH, Jaenisch R. 1992. Targeted mutation of the DNA methyltransferase gene results in embryonic lethality. Cell 69:915–26. 10.1016/0092-8674(92)90611-F [DOI] [PubMed] [Google Scholar]
  30. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. 2009. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25:2078–9. 10.1093/bioinformatics/btp352 [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Li Y, Chen G, Ma L, Ohms SJ, Sun C, Shannon MF, et al. 2012. Plasticity of DNA methylation in mouse T cell activation and differentiation. BMC Mol Biol 13:16. 10.1186/1471-2199-13-16 [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Lienert F, Wirbelauer C, Som I, Dean A, Mohn F, Schubeler D. 2010. Identification of genetic elements that autonomously determine DNA methylation states. Nat Genet 43:1091–7. 10.1038/ng.946 [DOI] [PubMed] [Google Scholar]
  33. Lister R, Pelizzola M, Dowen RH, Hawkins RD, Hon G, Tonti-Filippini J, et al. 2009. Human DNA methylomes at base resolution show widespread epigenomic differences. Nature 462:315–22. 10.1038/nature08514 [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Maunakea AK, Nagarajan RP, Bilenky M, Ballinger TJ, D’Souza C, Fouse SD, et al. 2010. Conserved role of intragenic DNA methylation in regulating alternative promoters. Nature 466:253–7. 10.1038/nature09165 [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Meissner A, Mikkelsen TS, Gu H, Wernig M, Hanna J, Sivachenko A, et al. 2008. Genome-scale DNA methylation maps of pluripotent and differentiated cells. Nature 454:766–70. 10.1038/nature07107 [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Millstein J, Zhang B, Zhu J, Schadt EE. 2009. Disentangling molecular relationships with a causal inference test. BMC Genet 10:23. 10.1186/1471-2156-10-23 [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Montgomery SB, Sammeth M, Gutierrez-Arcelus M, Lach RP, Ingle C, Nisbett J, et al. 2010. Transcriptome genetics using second generation sequencing in a Caucasian population. Nature 464:773–7. 10.1038/nature08903 [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Pickrell JK, Marioni JC, Pai AA, Degner JF, Engelhardt BE, Nkadori E, et al. 2010. Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature 464:768–72. 10.1038/nature08872 [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Prendergast GC, Ziff EB. 1991. Methylation-sensitive sequence-specific DNA binding by the c-Myc basic region. Science 251:186–9. 10.1126/science.1987636 [DOI] [PubMed] [Google Scholar]
  40. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, et al. 2007. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet 81:559–75. 10.1086/519795 [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Quinlan AR, Hall IM. 2010. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26:841–2. 10.1093/bioinformatics/btq033 [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Riggs AD. 1975. X inactivation, differentiation, and DNA methylation. Cytogenet Cell Genet 14:9–25. 10.1159/000130315 [DOI] [PubMed] [Google Scholar]
  43. Robertson KD. 2005. DNA methylation and human disease. Nat Rev Genet 6:597–610. 10.1038/nrg1655 [DOI] [PubMed] [Google Scholar]
  44. Rosenbloom KR, Dreszer TR, Pheasant M, Barber GP, Meyer LR, Pohl A, et al. 2010. ENCODE whole-genome data in the UCSC Genome Browser. Nucleic Acids Res 38:D620–5. 10.1093/nar/gkp961 [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Schadt EE, Lamb J, Yang X, Zhu J, Edwards S, Guhathakurta D, et al. 2005. An integrative genomics approach to infer causal associations between gene expression and disease. Nat Genet 37:710–7. 10.1038/ng1589 [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Schmidl C, Klug M, Boeld TJ, Andreesen R, Hoffmann P, Edinger M, et al. 2009. Lineage-specific DNA methylation in T cells correlates with histone methylation and enhancer activity. Genome Res 19:1165–74. 10.1101/gr.091470.109 [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Schubeler D. 2012. Molecular biology. Epigenetic islands in a genetic ocean. Science 338:756–7. 10.1126/science.1227243 [DOI] [PubMed] [Google Scholar]
  48. Scutari M. 2010. Learning Bayesian Networks with the bnlearn R Package. J Statist Soft 35:1–22 [Google Scholar]
  49. Shoemaker R, Wang W, Zhang K. 2011. Mediators and dynamics of DNA methylation. Wiley Interdiscip Rev Syst Biol Med 3:281–98. 10.1002/wsbm.124 [DOI] [PubMed] [Google Scholar]
  50. Stadler MB, Murr R, Burger L, Ivanek R, Lienert F, Scholer A, et al. 2011. DNA-binding factors shape the mouse methylome at distal regulatory regions. Nature 480:490–5. 10.1038/nature10716 [DOI] [PubMed] [Google Scholar]
  51. Storey JD, Tibshirani R. 2003. Statistical significance for genomewide studies. Proc Natl Acad Sci USA 100:9440–5. 10.1073/pnas.1530509100 [DOI] [PMC free article] [PubMed] [Google Scholar]
  52. Stranger BE, Nica AC, Forrest MS, Dimas A, Bird CP, Beazley C, et al. 2007. Population genomics of human gene expression. Nat Genet 39:1217–24. 10.1038/ng2142 [DOI] [PMC free article] [PubMed] [Google Scholar]
  53. Thurman RE, Rynes E, Humbert R, Vierstra J, Maurano MT, Haugen E, et al. 2012. The accessible chromatin landscape of the human genome. Nature 489:75–82. 10.1038/nature11232 [DOI] [PMC free article] [PubMed] [Google Scholar]
  54. van Eijk KR, de Jong S, Boks MP, Langeveld T, Colas F, Veldink JH, et al. 2012. Genetic analysis of DNA methylation and gene expression levels in whole blood of healthy human subjects. BMC Genomics 13:636. 10.1186/1471-2164-13-636 [DOI] [PMC free article] [PubMed] [Google Scholar]
  55. Weber M, Hellmann I, Stadler MB, Ramos L, Paabo S, Rebhan M, et al. 2007. Distribution, silencing potential and evolutionary impact of promoter DNA methylation in the human genome. Nat Genet 39:457–66. 10.1038/ng1990 [DOI] [PubMed] [Google Scholar]
  56. Zhang D, Cheng L, Badner JA, Chen C, Chen Q, Luo W, et al. 2010. Genetic control of individual differences in gene-specific methylation in human brain. Am J Hum Genet 86:411–9. 10.1016/j.ajhg.2010.02.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
  57. Zhu J, Zhang B, Smith EN, Drees B, Brem RB, Kruglyak L, et al. 2008. Integrating large-scale functional genomic data to dissect the complexity of yeast regulatory networks. Nat Genet 40:854–61. 10.1038/ng.167 [DOI] [PMC free article] [PubMed] [Google Scholar]
eLife. 2013 Jun 4;2:e00523. doi: 10.7554/eLife.00523.023

Decision letter

Editor: Chris P Ponting1

eLife posts the editorial decision letter and author response on a selection of the published articles (subject to the approval of the authors). An edited version of the letter sent to the authors after peer review is shown, indicating the substantive concerns or comments; minor concerns are not usually shown. Reviewers have the opportunity to discuss the decision before the letter is sent (see review process). Similarly, the author response typically shows only responses to the major concerns raised by the reviewers.

Thank you for choosing to send your work entitled “Passive and active DNA methylation and the interplay with genetic variation in gene regulation” for consideration at eLife. Your article has been evaluated by a Senior editor and 2 reviewers, one of whom is a member of our Board of Reviewing Editors.

The Reviewing editor and the other reviewer discussed their comments before we reached this decision, and the Reviewing editor has assembled the following comments to help you prepare a revised submission.

The study tackles an important and timely set of questions in the nexus of DNA methylation, gene regulation, and genetic variation. The genomics dataset collected in this study is exemplary and, indeed, is one of the first of the next generation of “genetics–genomics” studies, with three cell types (two of them primary) isolated from over 200 newborns, all genotyped, and profiled for expression (RNA-Seq) and DNA methylation.

In this sense, all of us would be delighted to see the data published, and with the appropriate analytics and conclusions this could be an important paper, well within the scope of eLife. Indeed, it is precisely because the dataset is likely to be broadly used, because the design could be deployed in many other contexts by other groups, and because the conclusions address fundamental questions in gene regulation, that we feel that it is critical that the analytics underlying the conclusion be very carefully and very strongly supported.

In this respect, unfortunately, we found the current manuscript lacking in two important ways:

1) Very strong and broad conclusions were drawn as we detail below, but they were not equally strongly supported in the actual analyses/data. These should be toned down to the appropriate level, with some probably removed or replaced with hypothesis level assertions; and

2) In multiple places throughout the work, the statistics need to be performed more carefully, with appropriate controls, which we detail specifically below.

3) Next, once the statistical analyses are done appropriately, the authors are likely to remain with a far fewer number of significant loci/genes, but those would be individually significant, and the authors should highlight some of those specific examples, instead of focusing solely on aggregate numbers as currently written.

4) Finally, since some of the study’s inherent value is as a resource, it is critical that the manuscript provides processed results in summary tables (as source data files), listing all genes, associated SNPs, and other traits. It is also critical to state clearly within the manuscript where and how the data can be obtained by others (with the expected constraints associated with human genotype data).

We further detail below how to address points 1 and 2, in the context of each of the seven major claims of the paper. We emphasize that without a serious analytic revision, the manuscript will not be appropriate for publication in eLife.

We identify seven major claims in the paper. We list each here and its extent of support, and make specific suggestions for how to address each analytic shortcoming.

* Claim 1 addresses eQTMs, the relation between variation in methylation and expression. The authors draw the conclusion that eQTMs are mostly independent of the mechanism involved in the repressive effect of DNA methylation on expression across genes. This is a strong conclusion, but is currently poorly supported. Specifically, the eQTMs were partitioned into only two bins (positive and negative effect). It might be, however, that most of the pos- and neg-eQTM effects are insignificant (around 0) and therefore some important inter-relations are masked within a sea of noise. Instead, the same plots should be regenerated for eQTMs split into a larger number of bins (e.g., 5 bins: very-neg, neg, no-sign, pos, very-pos-eQTMs), to show whether the negative correlation holds in all of these 5 bins, and more generally is not an artifact of the binning.

* Claim 2 addresses differentially methylated regions (DMRs) across cell types and asked whether DMRs are more often associated to eQTMs and mQTLs than non-DMRs.

First, the relation between DMRs and eQTMs might be a byproduct of their relation with the average methylation. This is because the higher the average, the higher the variance, and thus we may get a higher differentiation per methylation. To control for this possibility the authors should use a 'normalized differentiation in methylation' (i.e., use the residuals of differentiation per methylation after removing the effect of the methylation average), and then use these normalized (residual) values to repeat the same test relating (residual) differentiation with mQTLs and eQTMs.

Second, regardless of the validity of the test, the additional biological conclusions are over-stretched and should be toned down.

* Claim 3 associates mQTLs and eQTMs with different features. The differential association of mQTLs to CGI and non-CGI promoters is intriguing and well supported. The analysis of pos- and neg-eQTMs should be repeated with finer binning (as claim 1, above), but is otherwise satisfactory.

* Claim 4 shows lack of support for a contribution of methylation to allele specific expression, independent of sequence variation. This is a strong assertion, but is also a negative result. Our main concern is that it is hard to devise an appropriate control here. We urge the authors to remove this result, unless they can provide additional support.

* Claim 5 tests for synergistic interactions between genetic variants and DNA methylation on gene expression. Although the authors note that they “controlled” for artificial inflation, Figure 3C shows that there is a clear inflation in the resulting P-values. What is the reason for this inflation? Without correcting for this, it is impossible to judge the number of significant relations. We advise that the authors use at least bootstrapping for some empirical correction, and preferably correct for population structure and other confounders.

* Claim 6 concerns different causality models and their proportions in the different cell types. Each of the three models has a different number of parameters, and the correction with AIC is based on theoretical considerations and hence might not compensate for the different degrees of freedom. In addition, there is no evaluation of the significance of the resulting predicted models (which then precludes the authors – and future readers – from following up on individual findings). To correct for such potential artifacts, we suggest applying a direct evaluation of the distribution of relative likelihood for each pair of models based on data reshuffling (permutation test). Based on such tests, it is possible to (A) get a threshold for each of the likelihood comparisons using a common P-value threshold. This would allow comparing results from different cell types and models using a common systematic threshold; and (B) discriminate significant vs. non-significant relative likelihood scores. Once this is done the authors should report how many of the results are significant and what the statistics are when using significant predictions only, and, finally, they should highlight specific examples, which would enhance the paper's impact and biological relevance.

* Claim 7 tries to couch the findings in a mechanistic setting, claiming that TF levels modulate the effect of DNA sequence variation. This is an intriguing and important direction, but it has to be done carefully.

First, as above, it is not clear whether the reported P-values are corrected for the special characteristics of the data. Applying a permutation test and deriving a corrected P-value based on such an analysis would allow evaluating the statistical significance of the reported results. Furthermore, while we do not require the authors to conduct follow up biological experiments, in the absence of such follow up, the conclusions should be toned down significantly.

eLife. 2013 Jun 4;2:e00523. doi: 10.7554/eLife.00523.024

Author response


1) Very strong and broad conclusions were drawn as we detail below, but they were not equally strongly supported in the actual analyses/data. These should be toned down to the appropriate level, with some probably removed or replaced with hypothesis level assertions; and

We have toned down the conclusions and performed additional analyses to further support some of our results, as detailed below.

2) In multiple places throughout the work, the statistics need to be performed more carefully, with appropriate controls, which we detail specifically below.

We have dealt with the concerns below either by improving the statistics or by clarifying why the statistical analyses used in the original submission were appropriate and further clarified in the text when necessary.

3) Next, once the statistical analyses are done appropriately, the authors are likely to remain with a far fewer number of significant loci/genes, but those would be individually significant, and the authors should highlight some of those specific examples, instead of focusing solely on aggregate numbers as currently written.

We have incorporated in the main text specific examples, as detailed in the response to claim 6.

4) Finally, since some of the study’s inherent value is as a resource, it is critical that the manuscript provides processed results in summary tables (as source data files), listing all genes, associated SNPs, and other traits. It is also critical to state clearly within the manuscript where and how the data can be obtained by others (with the expected constraints associated with human genotype data).

We have created tables as text files for the discovered eQTLs, mQTLs and eQTMs in each cell-type as a single zipped file as Table 1–source data 1. We have also generated a zipped file as Figure 4–source data 1 in which we report in text tables the high confidence calls of the inferred mechanistic relationships as detailed in claim 6.

The transcriptome, genotyping, and DNA methylation array data will be available in the EMBL-EBI European Genome-Phenome Archive under accession number EGAS00001000446.

We have described the location details of the primary data and supplementary files in Materials and methods under “Data availability”.

We identify seven major claims in the paper. We list each here and its extent of support, and make specific suggestions for how to address each analytic shortcoming.

* Claim 1 addresses eQTMs, the relation between variation in methylation and expression. The authors draw the conclusion that eQTMs are mostly independent of the mechanism involved in the repressive effect of DNA methylation on expression across genes. This is a strong conclusion, but is currently poorly supported. Specifically, the eQTMs were partitioned into only two bins (positive and negative effect). It might be, however, that most of the pos- and neg-eQTM effects are insignificant (around 0) and therefore some important inter-relations are masked within a sea of noise. Instead, the same plots should be regenerated for eQTMs split into a larger number of bins (e.g., 5 bins: very-neg, neg, no-sign, pos, very-pos-eQTMs), to show whether the negative correlation holds in all of these 5 bins, and more generally is not an artifact of the binning.

We believe that there has been some misunderstanding since the assessment of positive and negative eQTMs is only done for statistically significant eQTMs. Specifically, due to our multiple testing corrections, in order to call an eQTM, we require a high level of nominal significance (as indicated in Table 1) controlling for multiple testing with an FDR of 10%. As a consequence, the strength of the correlations of eQTMs as measured by the Spearman correlation coefficient are far from zero (see histograms below: Author response image 1), the minimum of the absolute value being 0.368, 0.308, and 0.407 in fibroblasts, LCLs, and T-cells, respectively. Furthermore, for the effect size of eQTMs as the slope of the linear regression of expression explained by DNA methylation, we also obtain values far from zero. The minimum of the absolute effect size is 3.16 and 7 in fibroblasts and T-cells, and LCLs have only 6 eQTMs with effect size below 1. Hence, our eQTMs do not contain insignificant associations, and thus the further binning is not necessary.

graphic file with name elife00523f005.jpg

Moreover, although we are dealing with 1140 pos-eQTMs and 5112 neg-eQTMs at promoter regions, these involve only a limited number of genes (703 and 1758 for pos and neg-eQTMs, respectively). Hence, binning our eQTMs would lead to a smaller number of genes per bin and consequently a lower dynamic range of gene expression levels, which would not allow a proper assessment of across gene expression.

Finally, we would like to highlight that the reported strength of the negative correlation of pos and neg-eQTM median methylation with across gene expression (rho = -0.11 and -0.10 for pos and neg-eQTMs, respectively) is comparable to the strength of the correlations between promoter DNA methylation and across gene expression measured per individual for all expressed genes, as illustrated in Figure 2–figure supplement 1C. Hence, the correlation we report in Figure 2A is as robust as that found at a genome-wide level.

Specific points in the paper have been clarified to reflect these issues.

* Claim 2 addresses differentially methylated regions (DMRs) across cell types and asked whether DMRs are more often associated to eQTMs and mQTLs than non-DMRs.

First, the relation between DMRs and eQTMs might be a byproduct of their relation with the average methylation. This is because the higher the average, the higher the variance, and thus we may get a higher differentiation per methylation. To control for this possibility the authors should use a ‘normalized differentiation in methylation’ (i.e., use the residuals of differentiation per methylation after removing the effect of the methylation average), and then use these normalized (residual) values to repeat the same test relating (residual) differentiation with mQTLs and eQTMs.

Second, regardless of the validity of the test, the additional biological conclusions are over-stretched and should be toned down.

The measure of DNA methylation differentiation that we are using (coefficient of variation, CV) already controls for the bias that the reviewers point out, that of a higher mean having a higher variance. We calculate the CV by measuring the standard deviation among the three medians of the methylation level per site in each cell-type divided by the mean of these three medians. In this manner, our differentiation measure is not positively correlated with average methylation as shown in the plot below (in fact, it is negatively correlated: Author response image 2).

graphic file with name elife00523f006.jpg

We would also like to emphasize that by using the median methylation level per site for each cell-type, our differentiation measure is estimating the between cell-type variation only, not taking into account the within cell-type variation. We have clarified these points in the main text, and in Materials and methods by including a section called “DNA methylation differentiation metric”. Additionally, we have toned down the conclusion of this analysis in the main text.

* Claim 3 associates mQTLs and eQTMs with different features. The differential association of mQTLs to CGI and non-CGI promoters is intriguing and well supported. The analysis of pos- and neg-eQTMs should be repeated with finer binning (as claim 1, above), but is otherwise satisfactory.

As explained above that the eQTMs are statistically significant and of substantial effect size, binning is not meaningful in this analysis.

* Claim 4 shows lack of support for a contribution of methylation to allele specific expression, independent of sequence variation. This is a strong assertion, but is also a negative result. Our main concern is that it is hard to devise an appropriate control here. We urge the authors to remove this result, unless they can provide additional support.

We understand the reviewers’ concern and have re-arranged the figure and added an additional positive control in order to better support our results (Figure 3A). We are of the opinion that this result is important and while not definitive it is still worth presenting with the necessary caveats.

As pointed out by the reviewers, our results show a lack of support for a contribution of methylation to allele specific expression, independent of sequence variation when comparing homomethylated eQTMs versus semimethylated eQTMs. This is contrary to the expected genetic effect, which leads to a significantly higher allelic imbalance in heterozygous compared to homozygous eQTLs. While the results as presented previously appeared as a “negative” result, we would like to point out that the allelic imbalance of heterozygous eQTLs is significantly greater than the allelic imbalance of semimethylated eQTMs (P = 7.50E-03, P = 3.40E-07, P = 1.1E-13 in fibroblasts, LCLs and T-cells, respectively), hence it is a positive result. Furthermore, we have incorporated a positive control of methylation effects on allelic imbalance, by measuring the allelic imbalance in genes reported to be imprinted, which represent a direct measure of ASE caused by an allelic signal of DNA methylation. As seen in the white boxplots of Figure 3A, allelic imbalance in imprinted genes is significantly higher than in homomethylated eQTMs (with P = 6.93E-34, P = 9.66E-04, P = 8.48E-05, in fibroblasts, LCLs and T-cells, respectively).

We have integrated these results into the main text, Figure 3A, and the Materials and methods. Moreover, we have highlighted in the Discussion the need for further studies to confirm these findings. We hope the reviewers find this additional support satisfactory and we remain open to any further suggestions that may arise.

* Claim 5 tests for synergistic interactions between genetic variants and DNA methylation on gene expression. Although the authors note that they “controlled” for artificial inflation, Figure 3C shows that there is a clear inflation in the resulting P-values. What is the reason for this inflation? Without correcting for this, it is impossible to judge the number of significant relations. We advise that the authors use at least bootstrapping for some empirical correction, and preferably correct for population structure and other confounders.

We understand the reviewers’ concern on the observed inflation in the resulting P-values and thank them for pointing that out. We can discard population structure confounders since we have already corrected for it (in all our analyses involving genetic variation) by excluding the genetic outliers depicted in Figure 1–figure supplement 1, as determined by multidimensional scaling genetic clustering using several HapMap populations. However, although we eliminated from our analyses any SNP-methyl correlation with P < 0.05, there could still be low levels of correlation that could potentially still inflate our P-values.

In order to better control for this, we have permuted the expression values of all the triplets tested, 1000 times, keeping SNP and methyl pairs of values in their original form. This allows us to maintain the correlation structure between the SNP and the methylation site and any inflation due to that, while randomizing the expression. From these permutations, we have derived the 95% confidence intervals and incorporated them into the qqplots of Figure 3B. The interaction P-values for fibroblasts fail to cross the upper 95th confidence limit, while for T-cells and LCLs the observed P-values remain mostly or entirely above the upper 95th confidence limit, respectively. Additionally, for each round of permutation of the tested triplets, we have calculated the π1 statistic. Hence, by creating a null distribution of π1 we can assess how different it is to the observed π1. The observed π1 of each cell-type was highly significant (all P-values < 0.001) with the top permuted π1 being 7.2, 8.5 and 2 times smaller than the observed π1 in fibroblasts, LCLs and T-cells, respectively.

We have included these results in the main text and Figure 3B, and clarified the details in Materials and methods under “Interaction permutation analysis”.

* Claim 6 concerns different causality models and their proportions in the different cell types. Each of the three models has a different number of parameters, and the correction with AIC is based on theoretical considerations and hence might not compensate for the different degrees of freedom. In addition, there is no evaluation of the significance of the resulting predicted models (which then precludes the authors – and future readers – from following up on individual findings). To correct for such potential artifacts, we suggest applying a direct evaluation of the distribution of relative likelihood for each pair of models based on data reshuffling (permutation test). Based on such tests, it is possible to (A) get a threshold for each of the likelihood comparisons using a common P-value threshold. This would allow comparing results from different cell types and models using a common systematic threshold; and (B) discriminate significant vs. non-significant relative likelihood scores. Once this is done the authors should report how many of the results are significant and what the statistics are when using significant predictions only, and, finally, they should highlight specific examples, which would enhance the paper’s impact and biological relevance.

We are not sure why the reviewers suggest that there is a different number of parameters for each model. The number of parameters of a discrete Bayesian network is defined as the sum of the number of logically independent parameters of each node given its parents (Chickering, 1995). We have 5 parameters in each of our three models, which we have now specified in the Materials and methods. Furthermore, in order to test an exon-SNP-methyl triplet we require that at least two of the three pair-wise correlations are significant; hence, permuting the data would break these prior requirements and would not truly assess the confidence of the called model. However, we understand the reviewers’ concern on better grasping the relative likelihood space between the models and reporting the number of significant predictions. In order to do this we have: 1) presented the general landscape of relative likelihoods of the models, 2) created a list of high confidence calls based on the overlap of two distinct methods.

For point (1), instead of calling the best model and then presenting the relative frequencies of each (previously Figure 4B, now Figure 4–figure supplement 1A), we present directly the relative likelihood space for the three models across all exon-SNP-methyl triplets tested. As seen in the new plot in Figure 4B, the relative contribution of the three models remains very similar to the one obtained before, with a slight general increase of pos-eQTM likelihood.

For point (2), creating a list of high confidence calls, we have made use of a semi-parametric method, the Causal Inference Test (CIT; Millstein et al. 2009), that requires a level of significance for calling two out of the three models. The CIT makes SME and SEM calls if either has a P < 0.05, it makes no call if P < 0.05 for both models (highly infrequent), and it makes a call for the INDEP if neither SME or SEM has a significant P-value. We have decided to call a high confidence set of SME and SEM models when they are called as such by both the Bayesian Networks (BN) and the CIT methods. Since the INDEP calls in the CIT do not require significance, we called a high confidence INDEP if 1) it is called INDEP by both methods, and 2) its BN relative likelihood (RL) to the second best model is equal or higher to the median RL of the SME and SEM high confidence calls to their second best prediction.

From the total number of tests, 61%, 36%, and 27% were called as high confidence (HC) in fibroblasts, LCLs, and T-cells, respectively (see details in table below). The relative frequencies of these HC calls have been added as Figure 4– figure supplement 1B (they look similar to the general landscapes). We have reported these results in the main text and we have described the details in the Materials and methods. Additionally, we provide text file tables of the high confidence calls in a zipped file as Figure 4–source data 1, in which the exon-SNP-methyl triplets are reported, as well as the Spearman rho value for inferring the sign of the correlation between expression and DNA methylation, and a column indicating the RL of the inferred model over the second best BN prediction, so that users can further filter by confidence if needed.

Cell-type Num Tests HC SME Median RL to 2nd HC SEM Median RL to 2nd HC INDEP INDEP RL threshold
L 2928 193 4.5 139 6.1 717 5.1
F 831 44 2.8 36 4.6 428 3.5
T 2584 292 5.1 67 3.1 345 5.0

Finally, we have described in the main text specific examples of a high confidence call for each of the three models.

* Claim 7 tries to couch the findings in a mechanistic setting, claiming that TF levels modulate the effect of DNA sequence variation. This is an intriguing and important direction, but it has to be done carefully.

First, as above, it is not clear whether the reported P-values are corrected for the special characteristics of the data. Applying a permutation test and deriving a corrected P-value based on such an analysis would allow evaluating the statistical significance of the reported results. Furthermore, while we do not require the authors to conduct follow up biological experiments, in the absence of such follow up, the conclusions should be toned down significantly.

The correlations between TF abundance and DNA methylation at their binding sites are calculated with Spearman rank correlation and therefore the P-values are expected to behave appropriately. However, for our and the reviewers’ peace of mind, we have permuted the TF expression values for each TF-methyl correlation in each cell-type. We now present the expected null P-value distribution (under permutation), which as expected was uniform, and the one we observe (with their associated π1 statistics) in the histograms added to Figure 4–figure supplement 2. We have also specified in the Materials and methods the usage of Spearman rank for these correlations and included a description of the permutation procedure.

Finally, we have also toned down the conclusions in the text.

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    Table 1—source data 1.

    Significant eQTL, mQTL and eQTM associations found in each cell-type.

    DOI: http://dx.doi.org/10.7554/eLife.00523.012

    elife00523s001.zip (3.3MB, zip)
    DOI: 10.7554/eLife.00523.012
    Figure 4—source data 1.

    High confidence calls for INDEP, SME and SEM models in each cell-type.

    DOI: http://dx.doi.org/10.7554/eLife.00523.019

    elife00523s002.zip (49.6KB, zip)
    DOI: 10.7554/eLife.00523.019

    Data Availability Statement

    Genotyping, RNA-seq and DNA methylation data have been submitted to the EMBL-EBI European Genome-Phenome Archive (https://www.ebi.ac.uk/ega/) under accession number EGAS00001000446. Zipped text file tables with the identified eQTLs, mQTLs and eQTMs for each cell-type are available as Table 1—source data 1. Zipped text file tables with the high confidence SME, SEM and INDEP calls for each cell-type are available as Figure 4—source data 1.


    Articles from eLife are provided here courtesy of eLife Sciences Publications, Ltd

    RESOURCES