WAVELET-BASED GENETIC ASSOCIATION ANALYSIS OF FUNCTIONAL PHENOTYPES ARISING FROM HIGH-THROUGHPUT SEQUENCING ASSAYS

Heejung Shim; Matthew Stephens

doi:10.1214/14-AOAS776

. Author manuscript; available in PMC: 2018 Feb 2.

Published in final edited form as: Ann Appl Stat. 2015;9(2):655–686. doi: 10.1214/14-AOAS776

WAVELET-BASED GENETIC ASSOCIATION ANALYSIS OF FUNCTIONAL PHENOTYPES ARISING FROM HIGH-THROUGHPUT SEQUENCING ASSAYS¹

Heejung Shim ¹, Matthew Stephens ¹

PMCID: PMC5795621 NIHMSID: NIHMS910681 PMID: 29399242

Abstract

Understanding how genetic variants influence cellular-level processes is an important step toward understanding how they influence important organismal-level traits, or “phenotypes,” including human disease susceptibility. To this end, scientists are undertaking large-scale genetic association studies that aim to identify genetic variants associated with molecular and cellular phenotypes, such as gene expression, transcription factor binding, or chromatin accessibility. These studies use high-throughput sequencing assays (e.g., RNA-seq, ChIP-seq, DNase-seq) to obtain high-resolution data on how the traits vary along the genome in each sample. However, typical association analyses fail to exploit these high-resolution measurements, instead aggregating the data at coarser resolutions, such as genes, or windows of fixed length. Here we develop and apply statistical methods that better exploit the high-resolution data. The key idea is to treat the sequence data as measuring an underlying “function” that varies along the genome, and then, building on wavelet-based methods for functional data analysis, test for association between genetic variants and the underlying function. Applying these methods to identify genetic variants associated with chromatin accessibility (dsQTLs), we find that they identify substantially more associations than a simpler window-based analysis, and in total we identify 772 novel dsQTLs not identified by the original analysis.

Key words and phrases: Wavelets, high-throughput sequencing assays, RNA-seq, DNase-seq, chromatin accessibility, ChIP-seq, genetic association analysis, hierarchical model, Bayesian inference, functional data

1. Introduction

Genetic association studies aim to understand the function of genetic variants by associating them with observable traits, or “phenotypes.” Although many association studies have focused on organismal-level phenotypes, such as human disease [e.g., WTCCC (2007)], association studies also provide a powerful tool for studying molecular-level phenotypes, such as gene expression [Cheung et al. (2010), Montgomery et al. (2010), Pickrell et al. (2010)], transcription factor binding [Karczewski et al. (2013), Kasowski et al. (2010)] and chromatin accessibility [Degner et al. (2012)]. Measurement of many molecular phenotypes has been recently transformed by the advent of cheap high-throughput sequencing technology, and corresponding experimental protocols (RNA-seq: [Marioni et al. (2008), Mortazavi et al. (2008), Wang et al. (2008)], ChIP-seq: [Barski et al. (2007), Johnson et al. (2007), Mikkelsen et al. (2007)], DNase-seq: [Boyle et al. (2008), Hesselberth et al. (2009)]), which provide high-resolution measurements across the whole genome. However, typical analyses fail to exploit these high-resolution measurements, instead aggregating the data at coarser resolutions, such as genes, or windows of fixed length.

In this paper we develop and apply association analysis methods that better exploit high-resolution measurements from high-throughput sequencing assays. We specifically focus on identifying genetic variants that are associated with an epigenetic phenomenon known as chromatin accessibility, measured using DNase-seq [Boyle et al. (2008), Degner et al. (2012)], both described in more detail below. However, the same or similar ideas could also be applied to association analyses of other high-throughput sequencing measurements.

Conceptually, the key idea is to treat the data from high-throughput sequencing assays as noisy measurements of an underlying “function” (in this case, chromatin accessibility) that varies along the genome. We then adapt methods from functional data analysis, based on wavelets, to develop a test for association between a covariate of interest (in this case, a genotype) and the shape of the underlying function. We also provide methods to estimate the shape of the genotype effect, which can help in understanding the potential mechanisms underlying the identified associations.

In outline, our methods first transform the data using a wavelet transform, and then model associations in the transformed space rather than the original data space. This approach makes modeling easier because we expect the effect of genotype on phenotype to exhibit a spatial structure in the original space, which corresponds to a sparse structure in the transformed space, and sparsity is relatively easy to model. Here we are borrowing ideas that have been developed, more generally, in the “functional mixed models” work of Morris and Carroll (2006), Morris et al. (2008), Zhu, Brown and Morris (2011). In particular, Morris et al. (2008) presented a framework for identifying locations within a region that show significant effects of covariates. Other relevant work on wavelet methods for regression analysis of functional data include Abramovich and Angelini (2006), Antoniadis and Sapatinas (2007), Fan and Lin (1998), Yang and Nie (2008), Zhao and Wu (2008). Previous applications of wavelet-based methods in genomics include Clement et al. (2012), Day et al. (2007), Mitra and Song (2012), Spencer et al. (2006), Wu et al. (2010), Zhang et al. (2008). Our main contributions are to embed the wavelet-based methods into a framework for association testing that is computationally tractable for large-scale genetic association analyses that involve hundreds of thousands of tests, and to demonstrate the practical potential of these methods for associating genetic variants with sequence-based molecular phenotypes.

2. Background

2.1. DNase-seq and chromatin accessibility

In brief, DNase-seq is an experimental protocol that measures the accessibility, or openness, of chromatin along the genome. Chromatin consists of both the DNA that makes up the genome and the proteins that package it within the cell nucleus. Accessibility is important because it is associated with biological function, and DNase-seq has been a useful tool for detecting functional elements of the genome [Boyle et al. (2008)]. Chromatin accessibility at any given location will vary from cell to cell, and although single-cell experiments are on the horizon, almost all current experiments provide average measurements over a population of cells, usually from the same individual.

The key step in the DNase-seq protocol is the use of an enzyme called DNase I to selectively cut the DNA at locations where the chromatin is accessible. There is a quantitative aspect to this selection: other things being equal, locations where the chromatin is more accessible will tend to be cut more often. The locations of these cut points are revealed by sequencing the ends of the resulting fragments of DNA, and mapping the sequences (the “reads”) back to the genome. The resulting data are then conveniently summarized by the counts, c_b, of the number of cut points at each base in the genome (for humans, b ≈ 1, …, 3 × 10⁹). (Note that c_b denotes the number of reads that start at base b, rather than the number of reads that cover base b, so each read is counted only once.) In analyses these counts are usually standardized to account for the total number of sequence reads generated for each sample, so we here use d_b = c_b/S where S is the total number of mapped reads in the experiment. Although the process is subject to considerable technical variation and other confounding factors, higher values of d_b generally correspond to higher accessibility of base b. (Technically, the DNase-seq protocol actually measures “DNase I sensitivity,” or sensitivity to cutting by the DNase I enzyme, which is a proxy for chromatin accessibility. For simplicity, we ignore this distinction here.)

A typical experiment will produce millions of sequence reads per sample, and these will be concentrated in the relatively small proportion of the genome that is most “accessible.” Thus, d_b = 0 for most bases b, but some regions will show substantial counts at each base. Further, where it exists, accessibility tends to extend over hundreds of bases and, more generally, d tends to exhibit local spatial autocorrelation (“spatial structure”). One important goal of our methods is to account for this structure in the analysis.

Here we consider data from Degner et al. (2012), who collected DNase-seq data on samples from 70 different human individuals, for whom extensive genome-wide genetic data are also available. By correlating the DNase-seq data with the genetic data, we aim to identify genetic variants associated with chromatin accessibility. Such genetic variants are referred to as dsQTLs (DNase I sensitivity Quantitative Trait Loci) by Degner et al. (2012). Identifying genetic variants that are associated with chromatin accessibility and other molecular phenotypes such as transcription factor binding and gene expression, can help provide insights into the mechanisms by which genetic variation influences gene regulation. Indeed, Degner et al. (2012) found that many of the dsQTLs they identified were also associated with gene expression (which is associated with protein production), suggesting that genetic variation affecting transcription factor binding and chromatin accessibility may explain a substantial proportion of genetic variation in protein production. Ultimately, by combining these types of data on molecular-level phenotypes, and integrating them with similar data on organismal level phenotypes, we hope to understand which genetic variants affect human disease susceptibility, and the biological mechanisms by which they operate [Nicolae et al. (2010)]. Identifying dsQTLs, as we do here, is one helpful step toward this larger goal.

2.2. Wavelets

Wavelets are a tool from signal processing that are commonly used to deal with spatially-structured (or temporally-structured) signals. In this paper we use the Haar Discrete Wavelet Transform (DWT), and this section provides a brief intuitive description of the DWT. Further, more formal, background on wavelets can be found in Mallat (1989).

Let $d = {(d_{b})}_{b = 1}^{B}$ be the standardized counts from a DNase-seq experiment in a region with a length B assumed to be a power of 2 (B = 2^J). The DWT decomposes d into a series of “wavelet coefficients” (WCs), y = (y_sl), each of which summarizes information in d at a different scale (or resolution) s and location l. At the “zeroth scale” there is a single WC (y₀₁), which is simply the sum of the elements of d, y₀₁ =Σ_b d_b. (This “zeroth scale” WC is not truly a WC, but we use this shorthand here for convenience.) This coefficient summarizes d at the coarsest possible level, by its sum. At the first scale there is also a single WC (y₁₁), which contrasts the counts in the first half vs second half of the region. That is, y₁₁ := Σ_b_≤_B/₂ d_b − _b>B/₂ d_b (omitting a scaling constant that is usually used to normalize the WCs, but does not concern us here). This WC can be thought of as roughly capturing any trend in d across the region. At the second scale there are two WCs (y₂₁, y₂₂): the first contrasting the first quarter vs the second quarter of the region; and the second contrasting the third quarter vs the fourth quarter of the region. This process continues through the scales: at scale s there are 2^s⁻¹ WCs that contrast regions of length 2^J⁻^s, and hence capture higher-resolution features of d.

Since y is a linear transform of d, the DWT can be written as a matrix multiplication: y = W d where W is known as the DWT matrix. Further, the transform is one–one, so W is invertible, and d can be obtained from y by the “inverse discrete wavelet transform” (IDWT), d = W⁻¹y. We exploit this linearity of the IDWT later to obtain closed-form expressions for posterior mean and variances of effect sizes in the original scale (see Methods).

Because the WCs are simply a one–one transform of d, y contains exactly the same information as d. However, WCs have two crucial properties that make them useful for settings where, as here, d is expected to have a spatial correlation structure: (i) where values of d may be strongly spatially correlated, the WCs tend to be less dependent, referred to as the “whitening” property of the wavelet transform; (ii) typically, many WCs will be small, with the signal concentrated in a few “big” WCs. As a result, one can obtain denoised (smoothed) estimates of a signal by ignoring or shrinking the smaller WCs (i.e., reducing them toward 0). This is called “wavelet denoising” [Donoho and Johnstone (1995)]. Here we effectively apply wavelet denoising to estimate the effect of a genetic variant on a signal, rather than to the signal itself [see also Morris and Carroll (2006) and Zhu, Brown and Morris (2011) for example].

3. Methods

Our data consist of DNase-seq data and genotype data at genetic variants (mostly Single Nucleotide Polymorphisms, or SNPs) across the whole genome on N individuals, and our goal is to assess whether the DNase-seq data is associated with the genotype data. In practice, we expect that SNPs affecting chromatin accessibility will tend to have a relatively local effect, an expectation supported by results in Degner et al. (2012). Thus, similar to Degner et al. (2012), we first divide the DNase-seq data into regions (of length B = 1024 in this case; see Results), and then test each region for association with all nearby SNPs. We will first describe the test for a single SNP, and then describe how we apply it to test all nearby SNPs.

Let dⁱ denote the vector of DNase-seq count data for individual i (i = 1, …, N). Thus, dⁱ is a vector of counts of length B = 2^J. Let gⁱ denote the genotype data for individual i at a single SNP of interest, coded as 0, 1, or 2 copies of the minor allele (so gⁱ ∈ {0, 1, 2}). Our aim is to assess whether the DNase-seq data is associated with genotype at this SNP. That is, can we reject the null hypothesis H₀ that d is independent of g?

In outline, our approach is as follows. First, we transform each phenotype vector dⁱ, using the DWT outlined above, to produce a new phenotype vector yⁱ of wavelet coefficients (WCs). Then, based on simplifying modeling assumptions detailed below, which combine information across WCs into a hierarchical model, we compute a likelihood-ratio test statistic Λ̂ testing H₀. Finally, since the modeling assumptions are unlikely to hold exactly in practice, we use permutation to assess significance of the observed value of Λ̂.

In more detail, let y_sl denote the vector of WCs at scale s and location l, and let γ_sl denote a binary indicator for whether y_sl is associated with g. The null hypothesis, H₀, is that there is no association between any WC and g, that is, γ_sl = 0 for all s and l.

To measure the support for γ_sl = 1 for a specific s, l, we use a Bayes Factor,

{BF}_{s l} (y, g) : = \frac{p (y_{s l} ∣ g, γ_{s l} = 1)}{p (y_{s l} ∣ g, γ_{s l} = 0)} .

(3.1)

To compute this Bayes Factor, we use the models and priors from Servin and Stephens (2007), which are based on assuming a standard normal linear regression for p(y_sl |g, γ_sl):

y_{s l}^{i} = μ_{s l} + β_{s l} g^{i} + ε_{s l}^{i} with ε_{s l}^{i} ~ N (0, σ_{s l}^{2}),

(3.2)

where μ_sl denotes the mean WC of individuals with gⁱ = 0; β_sl denotes the effect size of g on the WC; and $ε_{s l}^{i}$ is the residual error for sample i. With appropriate priors on σ_sl, μ_sl, β_sl given γ_sl [see Supplementary Material Shim and Stephens (2015)] the Bayes Factor BF_sl has a simple analytic form.

To combine information across scales s and locations l, we build a hierarchical model for the γ_sl, assuming

p (γ_{s l} = 1 ∣ π) = π_{s},

(3.3)

where π = (π₀, …, π_J) is a vector of hyperparameters, with π_s representing the proportion of WCs at scale s that are associated with g. Then, assuming independence across scales and locations, the likelihood ratio for π, relative to π ≡ 0 (i.e., π_s = 0 ∀s), is given by

Λ (π; y, g) : = \frac{p (y ∣ g, π)}{p (y ∣ g, π \equiv 0)} = \prod_{s, l} \frac{p (y_{s l} ∣ g, π_{s})}{p (y_{s l} ∣ g, π_{s} = 0)}

(3.4)

= \prod_{s, l} \frac{π_{s} p (y_{s l} ∣ g, γ_{s l} = 1) + (1 - π_{s}) p (y_{s l} ∣ g, γ_{s l} = 0)}{p (y_{s l} ∣ g, γ_{s l} = 0)}

(3.5)

= \prod_{s, l} [π_{s} {BF}_{s l} + (1 - π_{s})] .

(3.6)

Within this hierarchical model, the null H₀ holds if π ≡ 0. Thus, to test H₀, we use the likelihood ratio test statistic

\hat{Λ} (y, g) : = Λ (\hat{π}; y, g),

(3.7)

where π̂ denotes the maximum likelihood estimate π̂ := arg max Λ(π; y, g). This is easily computed using an EM algorithm.

Our hierarchical model assumes conditional independence of y_s,l (and β_s,l) given π across scales and locations. This assumption is partly justified by the whitening property of the DWT mentioned above; and certainly a corresponding conditional independence assumption would be entirely inappropriate for the original data d_b due to spatial correlations. Nonetheless, the conditional independence assumption will not hold exactly in practice. Anticipating this concern, we note that a primary goal of the hierarchical model is to obtain a test statistic for H₀, whose significance is assessed by permutation (see below), and that the resulting p-values are valid, in the sense of being uniform under the null hypothesis, regardless of the correctness of the modeling assumptions.

3.1. Multiple SNPs and permutation procedure

The statistic Λ̂(y, g) tests for association between y (or, equivalently, d) and a single SNP with genotype vector g. Often one would like to ask, for a given region, whether y (d) is associated with any of many nearby SNPs. To assess this for a set of P nearby SNPs, with genotype vectors given by g₁, …, g_P, we use the test statistic

{\hat{Λ}}_{max} : = max_{p} \hat{Λ} (y, g_{p}) .

(3.8)

To assess significance of Λ̂_max, we use permutation. That is, we generate independent random permutations ν₁, …, ν_M of (1, …, N) and compute

{\hat{Λ}}_{max}^{j} : = max_{p} \hat{Λ} (y, ν_{j} (g_{p})) .

(3.9)

Then the p-value associated with Λ̂_max is

p = \frac{# {j : {\hat{Λ}}_{max}^{j} \geq {\hat{Λ}}_{max}} + 1}{M + 1} .

(3.10)

To reduce computation time, we adapted the sequential procedure from Besag and Clifford (1991), which avoids large numbers of permutations for non-significant results (see Supplementary Material [Shim and Stephens (2015)]).

3.2. Filtering of low count WCs

Some WCs, particularly those corresponding to high resolutions, are computed based on very low counts. Indeed, for some WCs, the majority of individuals have zero counts in the regions being contrasted, and so have a WC of zero. These WCs effectively have high sampling error and provide little information on association; however, our model (3.2) does not incorporate the sampling error, and so these WCs tend to contribute more than they should to Λ, effectively adding noise to the test, and reducing power. To address this, we filter out these “low count” WCs, by setting their BF_sl = 1 in equation (3.6) (a BF of 1 corresponds to no information about association). In results presented here, a set of WCs ${y_{s l}^{i}}_{i = 1}^{N}$ was considered “low count” if the average number of reads per individual used in their computation was less than L = 2 (i.e., <140 total reads in our data with 70 individuals). Since this threshold is ad hoc, we empirically assessed sensitivity to choice of threshold L. We found that performance was almost identical for L ∈ {2, 3, 5}, and performance dropped slightly for L = 1, 10 (see Supplementary Material [Shim and Stephens (2015)]).

3.3. Quantile transformation to guard against nonnormality

Our model assumes that the residuals in (3.2) are normally distributed. Although such normal assumptions are often quite robust, large deviations from normality can adversely affect performance of association tests. Furthermore, in large-scale association studies involving thousands of phenotypes, occasional large deviations from normality can arise, and it is impractical to manually check each of the thousands of phenotypes. To address this, it is common to quantile-transform phenotypes to the quantiles of a standard normal distribution before testing for associations, an idea with a long history [van der Waerden (1953)]. Following this idea, in our association tests we transform the vector of WCs, ( $y_{s l}^{1}, \dots, y_{s l}^{N}$ ), to the quantiles of a standard normal distribution (with ties broken at random—see Supplementary Material [Shim and Stephens (2015)]) before computing the Bayes Factor BF_sl using the transformed WCs. This transformation guarantees that, under the null hypothesis (γ_sl = 0), the normal assumption on the residuals in (3.2) holds. Consequently, this transformation ensures that the BFs are well behaved under the null, which is particularly important in association testing applications where, as here, most effects are null or nearly-null.

Although this quantile transformation is helpful for making tests robust to deviations from normality, it can make estimated effects more difficult to interpret. Therefore, it is usual to report effect size estimates obtained without quantile transformations [e.g., Teslovich et al. (2010)], and we follow this practice here by not performing the quantile transformation when estimating effect sizes (see below).

3.4. Controlling for confounding factors

In genetic association analyses of molecular-level phenotypes, power can be substantially increased by controlling for unmeasured confounding factors [Leek and Storey (2007), Stegle et al. (2010)]. In this setting, this can be achieved by estimating the unmeasured factors by Principal Components Analysis, and then regressing out the first few Principal Components (PCs) from the phenotypes before testing them for association with genotype [Degner et al. (2012), Pickrell et al. (2010)]. In our data analysis here we use the four PCs used by Degner et al. (2012), who chose 4 PCs after comparing results with 2, 4, and 6 PCs (their Supplementary Figure S11). Specifically, our procedure is as follows. After quantile transforming each WC to a standard normal distribution, we correct these transformed WCs by taking the residuals of a standard multiple linear regression of the WCs on the PCs. Finally, we quantile transform these residuals to the quantiles of a standard normal distribution and use these quantile-transformed residuals in the Bayes Factor calculations. Further data normalization could also be helpful (e.g., GC content correction [Benjamini and Speed (2012), Pickrell et al. (2010)]), but we do not pursue this here.

3.5. Effect size estimates

Under the above hierarchical model, given π̂, the posterior distributions on the effect sizes in the wavelet space, p(β_sl |y, g, π̂), are available in closed form. Specifically, the β_sl are a posteriori independent, each having a distribution that is a mixture of a point mass at zero and a three parameter version of a t distribution [Jackman (2009)], with density given in Supplementary Material [Shim and Stephens (2015)].

However, the effects β_sl in the wavelet space are not easy to interpret. To obtain interpretable estimates of the effect of a SNP g, we transform these effects from the wavelet space back to the data space using the IDWT. To explain, we combine the B equations of the form (3.2) (corresponding to the B values of s, l) into a single matrix equation:

Y = M + β g + E,

(3.11)

where Y, M and E are B × N matrices (the WCs, means, and residuals, resp.), β is a B × 1 matrix of effects, and g is a 1 × N matrix of genotypes. Now recall that D = W⁻¹Y where W is the DWT matrix, so premultiplying (3.11) by W⁻¹ yields

D = \tilde{M} + α g + \tilde{E},

(3.12)

where M̃ = W⁻¹ M, Ẽ = W⁻¹ E and α := W⁻¹ β is a B vector of effect sizes in the original data space.

Thus, the effects in the original space, α, are given by the IDWT of β, which is a linear function of β. Although the full posterior on α does not have a simple analytic form, the linear relationship with β yields closed forms for the pointwise posterior mean and variance of α_b for b = 1, …, B (see Supplementary Material [Shim and Stephens (2015)]). Here we use these posterior summaries to summarize the posterior distribution on the effects. Other types of posterior inference could be performed by simulating from the posterior for α (which is easily achieved by simulating from the posterior of β and applying the IDWT to the simulated samples).

4. Results

4.1. The data and previous analysis

We apply our approach to DNase-seq data from Degner et al. (2012), who also used these data to identify dsQTLs. We begin with a brief summary of the analysis in Degner et al. (2012). The authors collected DNase-seq data for 70 HapMap Yoruba LCLs, and correlated these DNase-seq data with a total of about 18.8 million genetic variants (either directly genotyped or imputed). To do this, they first identified regions of the genome that had many DNase-seq reads mapping to them, since these are most likely to contain functional regulatory elements and are most amenable to association analysis. (Regions with no reads are clearly not amenable to association analysis.) Specifically, they divided the whole genome into non-overlapping 100 bp windows, and took the top 5% of these windows ranked according to a DNase I sensitivity [see Supplementary Material of Degner et al. (2012) for definition]. For each sample, they then counted the number of DNase-seq reads mapping to each window, standardized these counts by the total number of reads generated for each sample (to account for different read depths across individuals) and used the resulting standardized counts as a molecular phenotype for association analyses. For each window in turn, they tested each nearby SNP for association with the DNase-seq data using a standard linear regression (after appropriate normalization and controlling for confounding factors using 4 Principal Components). One analysis tested every SNP within 40,000 bases (40 kb) of each window; another tested every SNP within 2 kb. The first analysis identified 74,656 dsQTLs (FDR = 10%) associated with 9595 different windows. The second analysis identified 18,899 dsQTLs (FDR = 10%) associated with 7088 different windows.

4.2. Our analysis

Degner et al. (2012) observed that typical dsQTLs affect chromatin accessibility over roughly 200–300 bp. Based on this, we decided to focus on slightly larger regions of size 1024 bp (i.e., B = 1024) for our wavelet-based association analyses. From now on we refer to each 1024 bp region as a 1024 bp “site.” (The wavelet-based approach should be relatively robust to choice of site size—provided a site is large enough to cover potential signals—since its multi-scale nature makes it well adapted to detecting signals that affect only part of the site. In Supplementary Material [Shim and Stephens (2015)] we assess this robustness and find that, indeed, using larger 2048 bp sites identifies more associations in these data. We also discuss how choice of B involves trade-offs between power, computation, and localization.) We focus our association analysis on the top 1% of 1024 bp sites with the highest DNase I sensitivity (in total 146,435 sites) selected as described in Supplementary Material [Shim and Stephens (2015)]. We focus on the top 1% rather than the top 5% as in Degner et al. (2012) because Degner et al. (2012) found that the majority of dsQTL are in the top 1% of 100 bp windows with the highest DNase I sensitivity. For each site, we use our wavelet-based hierarchical model, plus permutation, described above, to obtain a p-value to test the null hypothesis, H₀: DNase-seq data at the site is unassociated with all nearby SNPs. Here, we took “nearby” to mean “within 2 kb of the site.”

For comparison, we also implemented a testing approach analogous to the 100 bp window-based approach from Degner et al. (2012). In brief, we divided each 1024 bp site into ten ~100 bp windows (nine of 100 bp and one of 124 bp). For each window we computed a p-value for association of the DNase-seq data with each nearby SNP using standard linear regression as in Degner et al. (2012). For this standard linear regression we quantile-normalized the phenotypes and corrected them for confounding factors using PCA, in the same way as for the wavelet-based approach (Section 3.4). Then, we take the minimum of all these p-values (across all nearby SNPs and all 10 windows), P_min, as a test statistic of H₀. We then assess the significance of P_min by permutation, in the same way as we assess significance of our Λ̂_max by permutation (Section 3.1).

4.3. A wavelet-based approach increases power compared to a 100 bp window approach

To compare our wavelet-based approach with the window-based analyses, we applied both methods to a subset of the data (50,000 randomly selected 1024 bp sites from the 146,435 sites). Each method yields a p-value testing H₀ for each site. Using these p-values, we use the qvalue package [Dabney, Storey and Warnes (2015)] to estimate the False Discovery Rate (FDR) for each method at a given p-value threshold. We then compare the methods by the number of significant sites at a given FDR (more significant sites at a given FDR being better).

Figure 1(a) compares the number of significant sites for each method as the FDR varies from 0.001 to 0.1. At all levels of the FDR the wavelet-based approach identifies considerably more significant sites than the 100 bp window approach. For example, at FDR = 0.05 the wavelet-based approach identifies 870 significant dsQTLs, compared with 572 dsQTLs for the 100 bp window-based approach, an increase of 52%. Moreover, most dsQTLs detected by the 100 bp window-based analysis are also identified by the wavelet-based approach [Figure 1(b), 84%, 84%, and 83% for FDR of 0.005, 0.01, and 0.02, resp.].

Fig. 1 — The wavelet-based approach considerably increases power to identify dsQTLs compared to the 100 bp window-based approach. (a) shows the number of dsQTLs identified by each method at a given FDR. The black line indicates FDR of 0.05. (b) shows the number of dsQTLs identified by the wavelet-based approach (Wavelet) and the 100 bp window-based approach (100 bp window) at FDR of 0.005, 0.01, and 0.02. The number of dsQTLs identified by both approaches is highlighted by dark green.

To gain insights into commonalities and differences between the methods, we manually examined effect size estimates for several examples.

Figure 2 (see also Supplementary Material Figure 1 [Shim and Stephens (2015)]) shows a typical example of a dsQTL identified by both methods. These examples show a consistent strong effect across 200–300 bp; consequently, at least one 100 bp window fully overlaps the affected region, and the window analysis will successfully identify such examples, provided the effect is sufficiently strong.

Fig. 2 — Example of typical dsQTL found by both methods. The *top panel* shows average DNase I cut rates along the site for each genotype class at the most strongly associated SNP (red = reference homozygotes; blue = heterozygotes; green = non-reference homozygotes). The dark green line indicates the position of the most strongly associated SNP. Purple blocks indicate putative transcription factor binding sites, identified using the software `CENTIPEDE` [Pique-Regi et al. (2011)] (with a name on the top for known motifs). Black vertical lines below the x-axis indicate mappable bases [see Supplementary Material of Degner et al. (2012) for definition]. The *middle panel* shows posterior mean for effect (α) of this SNP (blue), ±3 posterior standard deviations (sky blue). Pink highlights regions showing strongest signal (zero is outside of mean ± 3 posterior standard deviations). The *bottom panel* shows absolute value of t-statistic for each 100 bp window. The most strongly associated SNP: chr17.10161485 with minor allele frequency (MAF) of 0.39. For wavelet-based approach log Λ̂_max = 73.09, p < 0.00001. For window-based approach p < 0.00001.

In contrast, Figure 3 shows two examples of dsQTLs identified by the wavelet analysis, but not the window-based analysis. The dsQTL in Figure 3(a) has a strong effect in a relatively narrow region (the strongest effect estimate in the second pink region spans < 10 bp). The multi-scale nature of the wavelet approach makes it well adapted to detect this kind of narrow local feature, whereas the 100 bp window analysis fails to capture it (t-statistic of the 100 bp window containing the signal ≈ 2). This illustrates that the window-based approach has limited power to identify signals that are very strong, but affect a region much smaller than the window size. The dsQTL in Figure 3(b) has a consistent effect spread over 200–300 bp, qualitatively similar to typical dsQTLs identified by both methods. However, the effect of this dsQTL is modest, and it fails to be significant in the window-based approach. Our explanation for this is that, being based on 100 bp windows, the window-based approach effectively uses only part (100 bp) of the signal, whereas the multi-scale nature of the wavelet-based approach allows it to adapt to the scale of the signal, and make better use of the whole signal. In summary, these examples illustrate how the window-based approach is inherently adapted to identifying effects that have a particular scale (100 bp in this case) and is suboptimal for effects that occur on either smaller scales [Figure 3(a)] or larger scales [Figure 3(b)].

Fig. 3 — Examples of dsQTLs found by wavelet-based approach, but not by window-based approach. Labels and colors are as in Figure 2. (a) illustrates a dsQTL with a strong effect on a narrow region. The most strongly associated SNP: chr12.6264939 with MAF of 0.32. For wavelet-based approach log Λ̂_max = 25.97, p < 0.00001. For window-based approach p = 0.05. The two vertical orange lines indicate positions of two genetic variants that are in high linkage disequilibrium (i.e., highly correlated) with chr12.6264939. (b) illustrates a dsQTL with modest effect over a larger region. The most strongly associated SNP: chr10.59495589 with MAF of 0.43. For wavelet-based approach log Λ̂_max = 14.11, p = 0.0003. For window-based approach p = 0.01. The orange line indicates the position of genetic variants that is in high linkage disequilibrium with chr10.59495589.

Finally, Figure 4 shows a slightly more complex example. This dsQTL shows different effects in two regions: consistent in direction over about 100 bp and in opposite directions over about 200 bp. The 100 bp window analysis misses the first signal because no windows capture the whole signal. The third 100 bp window fully overlaps with the second signal, but left and right sides of the window have effects in opposite directions and partially cancel each other out, resulting in a weak overall association.

Fig. 4 — Example of dsQTL showing complex pattern of association with DNase I cut rates. Labels and colors are as in Figure 2. The most strongly associated SNP: chr2.110329846 with MAF of 0.43. For wavelet-based approach log Λ̂_max = 22.01, p < 0.00001. For window-based approach p = 0.23. In this example the most strongly associated SNP is outside of the 1024 bp site.

In addition to these results based on estimating FDR for real DNase data, we conducted additional comparisons on several simulated data sets, where the “true” (null vs alternative) status of each simulated data set is known. In these comparisons the wavelet-based approach consistently outperformed the window-based approach (see Section simulation study in Supplementary Material [Shim and Stephens (2015)]).

4.3.1. Potential mechanism underlying dsQTLs

It is possible that the different qualitative patterns of effect evident in the examples in Figures 2–4 correspond to different functional mechanisms. With current data any discussion of mechanism is necessarily somewhat speculative. However, in some cases a putative mechanism is clearer than others. In Figure 2, the most strongly associated SNP (green vertical line on figure) is inside a binding site for CTCF (CCCTC binding factor), and the effect spans two regions either side of the binding site (each about 100 bp highlighted by pink), with the effect dropping to zero at the binding site itself. This effect exemplifies typical TF binding patterns, which often exhibit a distinct drop in DNase cut rates within TF binding sites [Pique-Regi et al. (2011)] (referred to as the DNase I “footprint”) because the binding of the TF “protects” the DNA against the cutting action of the DNase I enzyme. The effect estimate in Figure 3 shows a similar footprint pattern around another CTCF binding site, and although the most strongly associated SNP is not in the CTCF binding site, another highly associated SNP is in that binding site (orange line; r² between these two SNPs is 0.9), and this SNP seems more likely to be the actual functional variant. Thus, these two examples appear to share a common mechanism by which chromatin accessibility is related to changes in CTCF binding.

In contrast to these typical footprint patterns, the effect in Figure 3(a) is quite different, with one narrow region (<10 bp) showing the biggest effect (the second pink region). The most strongly associated SNP (green line) lies a few hundred base pairs from this strong effect, but two other SNPs (orange vertical lines) that show almost identical association strength (r² > 0.99 with the strongest SNP) lie closer. One of these SNPs lies in a putative TF binding site that coincides with the narrow region of strongest effect. It seems plausible that this SNP is the functional variant influencing chromatin accessibility, and that the changes in chromatin accessibility in this case are, as for the other examples, related to transcription factor binding. However, if so, the reason for the effect being concentrated within the narrow area, rather than distributed around the TF binding site, is unclear.

Finally, the most strongly associated SNP in Figure 4 lies outside of the 1024 bp window. The effect pattern here includes almost-compensatory increases and decreases in chromatin accessibility, suggesting that the dsQTL is associated with accessibility “shifting” from some locations to others, possibly associated with rearrangements in nucleosome positioning.

4.4. Shifting windows provide modest gain in power

In some of the examples we examined (e.g., Figure 4), the 100 bp window approach appeared to miss a signal because no single window fully overlapped the region affected by the dsQTL. This suggested that power might be increased by using overlapping, rather than non-overlapping, windows. To assess this, we modified the 100 bp window approach to use 19 overlapping windows (the additional 9 windows being obtained by shifting each of the first nine windows 50 bp to the right). The test statistic for this modified approach is the minimum p-value across 19 windows, and we assessed significance by permutation as before. We compared this modified 100 bp window approach to the other two approaches by applying it to the 50,000 sites and computing the number of significant dsQTLs at a given FDR. As shown in Figure 1(a), it increases power compared with the non-overlapping windows, but remains well short of the wavelet-based approach. Looking at individual examples, we find the use of overlapping windows helps to identify the dsQTL in Figure 4 (p-value < 0.00001), as the third 50 bp-shifted window completely captures the signals that are consistent in direction over about 100 bp (see Supplementary Material Figure 2 [Shim and Stephens (2015)]). However, it still missed both the dsQTLs in Figure 3.

4.5. A wavelet-based association analysis of the entire data set

We next applied the wavelet-based approach to the full data set of 146,435 sites. At an FDR of 10% this yielded 3176 sites with a dsQTL within 2 kb. Among these, 772 sites (24%) are newly identified by the wavelet-based approach [i.e., not overlapping with the 7088 100 bp windows reported as having dsQTLs in 2 kb cis-candidate region from Degner et al. (2012)].

4.5.1. Many dsQTLs affect expression levels of nearby genes

A key find-ing of Degner et al. (2012) was that the dsQTLs identified in their analysis were strongly enriched for being eQTLs, that is, being associated with changes in expression of at least one nearby gene. Specifically, using expression data on the same cell lines from Pickrell et al. (2010), they tested their dsQTLs for association with expression. They found that 16% of their dsQTLs are also significant eQTLs (FDR = 10%). These represent a very significant (450-fold) enrichment compared with random expectation. This is important because it suggests that altering chromatin accessibility and/or transcription factor binding may be a common mechanism by which genetic variants influence gene expression.

We therefore conducted a similar analysis for our dsQTLs, also using the data from Pickrell et al. (2010) and applying the methods from Degner et al. (2012) (see their Supplementary Material for details) to the strongest associated SNP at each of the 3176 significant sites identified in our analysis. We found that 19% of dsQTL identified by the wavelet-based approach are also significant eQTLs (FDR = 10%). Among the 772 novel sites identified by the wavelet method, 15% were also significant eQTLs. The fact that these enrichments are similar to those reported in Degner et al. (2012) suggests that the additional dsQTL sites we identified are likely to be reliable, rather than false positives.

4.5.2. Computation

The computational time to test each site varies considerably among sites—computation scales roughly linearly with the number of “nearby” SNPs to be tested, the number of unfiltered WCs, and the number of permutations performed, all of which vary among sites. Analysis of the entire data set (with maximum number of permutations set to 100,000) took about 4702 CPU hours (user + system). Because the analysis of each site is independent, the entire analysis is naively massively parallelizable (on average 1.9 min CPU time for each site).

Software and scripts implementing our methods and analyses, and information on the 3176 dsQTLs reported here, are available at http://stephenslab.uchicago.edu/software.html.

5. Discussion

We have developed an effective and efficient wavelet-based method for association analysis of functional data arising from high-throughput sequencing assays. This method, including permutation-based assessments of significance, is computationally tractable for genetic studies involving hundreds of thousands of tests. We applied our method to identify SNPs associated with chromatin accessibility, and illustrated its advantages over a simple window-based approach. In brief, the main limitation of window-based methods is that they have a single inherent scale, determined by the length of the window, and while they are naturally well powered to detect effects that occur on this scale, they are less well powered for effects that occur on either longer or shorter scales. In contrast, wavelet-based approaches are naturally “multi-scale,” and hence better suited to settings where effects vary in their scale (e.g., where some effects are strong, affecting a narrow region, and other effects are modest, affecting a broad region). Our examples in Figure 3 illustrate the benefits of a multi-scale approach. In addition, the wavelet-based approach is better adapted to detecting effects that vary in direction along a region—a situation which may cause effects to “cancel out” in a window-based analysis (e.g., Figure 4). Overall, our analysis of data from Degner et al. (2012) identified 772 novel putative dsQTLs not identified by the original analysis.

In this paper we reported two types of performance comparisons—one based on simulations (results in Supplementary Material [Shim and Stephens (2015)]) and another based on performance on real data, specifically on the number of findings obtained at a given FDR. Although both comparisons are helpful, we view the latter as more practically relevant, because it directly reflects the way these types of methods are applied in practice, and it avoids the impossible task of creating simulations that reflect all the complexities of experimental data. This empirical comparison technique is particularly attractive for the kinds of genetic association analyses performed here, where there are large numbers of approximately independent tests on which to assess performance (in our setting, tests of different sites are typically independent because they typically involve independent genetic variants as well as different phenotypes). In addition to comparing competing methods, empirical comparisons like these can also be helpful for comparing analysis approaches more generally. For example, Degner et al. (2012), Pickrell et al. (2010), Stegle et al. (2010) and Mangravite et al. (2013) all used empirical comparisons to decide how many PCs to control for, and here we used them to assess the effects of altering the “low count threshold” and the size of the site tested. A similar idea could be used to assess other aspects of the analysis—such as choice of wavelet basis.

Although our methods were motivated primarily by genetic association studies for sequence-based molecular phenotypes, our approach is more general and could also test for association between functional data and other covariates, either continuous or discrete. For example, in a genomics context, it could be used to detect differences in gene expression (from RNA-seq data) or TF binding (from ChIP-seq data) measured on two groups (e.g., treatment conditions or cell types). Or it could be used to associate a functional phenotype, such as chromatin accessibility, with a continuous covariate, such as “overall” expression of a gene. It could also be used for genome-wide association studies of functional phenotypes unrelated to sequencing. The main current limitation is that sample sizes should not be too small, since our Bayes Factor calculations, based on normal quantile-transformed data, will not work well for small samples. We have not experimented to determine adequate sample sizes, but in other settings we have found the quantile-transformed approach can work for sample sizes as small as 10 (M. Barber and M. Stephens, unpublished data). We discuss modifying our approach to allow for smaller sample sizes below.

One of the most common assays now performed by sequencing is RNA-seq, and particular features of this assay merit special attention. Specifically, because construction of mRNA effectively involves splicing together small parts of the gene (the “exons”), a proportion of the reads generated in an RNA-seq experiment will span splice junctions. These reads naturally contain considerable information about splicing, but this information is not captured in the information we use here (the first base to which each read maps). Integrating the information in splice junction reads with our wavelet-based methods could be useful, but perhaps challenging. On the other hand, our method is not alone in failing to fully exploit splice reads, and it also has some strengths that complement existing approaches to this problem. For example, it is common to use the number of reads mapping to “known” exons as a phenotype to identifying SNPs that affect splicing [Pickrell et al. (2010)]. This may work well to identify certain types of effect (e.g., SNPs that affect whether or not an exon is spliced in), but less well for other effects (e.g., extension of an exon beyond its usual boundaries). Because our method considers the shape of the read profile across the whole gene, without reference to the “known exons,” it may be more effective at detecting this latter type of effect.

To our knowledge, this is the first genetic association analysis that attempts to fully exploit high-resolution information from high-throughput sequencing assays. [While this work was in review, another method aimed at exploiting the high-resolution information appeared in Frazee et al. (2014).] As such, there are many opportunities for potential improvements. First, our methods use a normal model for the (normal quantile-transformed) WCs, and this transformation loses information. Particularly, it loses the information that some WCs are based on small counts, and thus have higher sampling variability than WCs based on larger counts. Here we partly addressed this issue by filtering out WCs based on low counts, but a more principled approach may be expected to improve power. Further, as noted above, the normal quantile transformation requires moderate sample sizes. Both these issues could potentially be addressed by modeling the count nature of the sequence data directly, and we are currently experimenting with this approach, based on multi-scale models for inhomogeneous Poisson processes [Kolaczyk (1999), Timmermann and Nowak (1999)]. Another possibility would be to consider transforms designed to allow wavelets to be applied to Poisson data [Fryzlewicz and Nason (2004)]. Second, we have here made use of Haar wavelets, and it may be that other wavelets will perform better. Indeed, the optimal choice of wavelets may be context-dependent. For example, when applying wavelet denoising to ChIP-seq data on histone modifications, Zhang et al. (2008) selected a wavelet known as Coiflet4, arguing that its morphological characteristics are similar to the nucleosome peak shape. Our methods here could be directly applied with any choice of wavelet basis.

Finally, our hierarchical model assumes conditional independence of WCs (and effect sizes β_sl) given π across scales and locations, and this conditional independence will not hold exactly in practice. Our approach partly addresses this issue by assessing significance of a test statistic by permutation, which gives valid p-values irrespective of whether modeling assumptions are correct. However, our procedure for estimating the shape of genotype effect still relies on the conditional independence assumption and, ultimately, methods that exploit dependencies between the WCs should perform better. One way to model dependencies is to exploit the tree structure of WCs (and effect sizes β_sl) as described in Crouse, Nowak and Baraniuk (1998), and we are currently experimenting with this approach.

Supplementary Material

Supplementary Information

NIHMS910681-supplement-Supplementary_Information.pdf^{(375.8KB, pdf)}

Acknowledgments

We thank Jack Degner, Roger Pique-Regi and Jonathan Prichard for invaluable discussions and help with analyses of dsQTLs, Anil Raj, Ellen Leffler, and Sarah Urbut for helpful comments on an earlier version of the manuscript, and Ester Pantaleo and Zhengrong Xing for helpful comments on the simulation study in Supplementary Material. We thank the members of the J. Pritchard, M. Przeworski, M. Stephens and Y. Gilad labs for helpful discussions.

Footnotes

Supported by NIH Grant HG02585.

SUPPLEMENTARY MATERIAL

Supplement to “Wavelet-based genetic association analysis of functional phenotypes arising from high-throughput sequencing assays” (DOI: 10.1214/14-AOAS776SUPP;.pdf). Supplement Material referenced in Sections 3, 4 and 5 are provided in the Supplement Material file.

References

Abramovich F, Angelini C. Testing in mixed-effects FANOVA models. J Statist Plann Inference. 2006;136:4326–4348. [Google Scholar]
Antoniadis A, Sapatinas T. Estimation and inference in functional mixed-effects models. Comput Statist Data Anal. 2007;51:4793–4813. [Google Scholar]
Barski A, Cuddapah S, Cui K, Roh T-Y, Schones DE, Wang Z, Wei G, Chepelev I, Zhao K. High-resolution profiling of histone methylations in the human genome. Cell. 2007;129:823–837. doi: 10.1016/j.cell.2007.05.009. [DOI] [PubMed] [Google Scholar]
Benjamini Y, Speed TP. Summarizing and correcting the GC content bias in high-throughput sequencing. Nucleic Acids Res. 2012;40:e72. doi: 10.1093/nar/gks001. [DOI] [PMC free article] [PubMed] [Google Scholar]
Besag J, Clifford P. Sequential Monte Carlo p-values. Biometrika. 1991;78:301–304. [Google Scholar]
Boyle AP, Davis S, Shulha HP, Meltzer P, Margulies EH, Weng Z, Furey TS, Crawford GE. High-resolution mapping and characterization of open chromatin across the genome. Cell. 2008;132:311–322. doi: 10.1016/j.cell.2007.12.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cheung VG, Nayak RR, Wang IX, Elwyn S, Cousins SM, Morley M, Spielman RS. Polymorphic cis- and trans-regulation of human gene expression. PLoS Biol. 2010;8:e1000480. doi: 10.1371/journal.pbio.1000480. [DOI] [PMC free article] [PubMed] [Google Scholar]
Clement L, De Beuf K, Thas O, Vuylsteke M, Irizarry RA, Crainiceanu CM. Fast wavelet based functional models for transcriptome analysis with tiling arrays. Stat Appl Genet Mol Biol. 2012;11(Art 4):38. doi: 10.2202/1544-6115.1726. [DOI] [PMC free article] [PubMed] [Google Scholar]
Crouse MS, Nowak RD, Baraniuk RG. Wavelet-based statistical signal processing using hidden Markov models. IEEE Trans Signal Process. 1998;46:886–902. [Google Scholar]
Dabney A, Storey JD, Warnes GR. R package version 1.30.0. 2015. qvalue: Q-value estimation for false discovery rate control. [Google Scholar]
Day N, Hemmaplardh A, Thurman RE, Stamatoyannopoulos JA, Noble WS. Unsupervised segmentation of continuous genomic data. Bioinformatics. 2007;23:1424–1426. doi: 10.1093/bioinformatics/btm096. [DOI] [PubMed] [Google Scholar]
Degner JF, Pai AA, Pique-Regi R, Veyrieras J-B, Gaffney DJ, Pickrell JK, De Leon S, Michelini K, Lewellen N, Crawford GE, Stephens M, Gilad Y, Pritchard JK. DNasel sensitivity QTLs are a major determinant of human expression variation. Nature. 2012;482:390–394. doi: 10.1038/nature10808. [DOI] [PMC free article] [PubMed] [Google Scholar]
Donoho DL, Johnstone IM. Adapting to unknown smoothness via wavelet shrinkage. J Amer Statist Assoc. 1995;90:1200–1224. [Google Scholar]
Fan J, Lin S-K. Test of significance when data are curves. J Amer Statist Assoc. 1998;93:1007–1021. [Google Scholar]
Frazee AC, Sabunciyan S, Hansen KD, Irizarry RA, Leek JT. Differential expression analysis of RNA-seq data at single-base resolution. Biostatistics. 2014;15:413–426. doi: 10.1093/biostatistics/kxt053. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fryzlewicz P, Nason GP. A Haar–Fisz algorithm for Poisson intensity estimation. J Comput Graph Statist. 2004;13:621–638. [Google Scholar]
Hesselberth JR, Chen X, Zhang Z, Sabo PJ, Sandstrom R, Reynolds AP, Thurman RE, Neph S, Kuehn MS, Noble WS, Fields S, Stamatoy-annopoulos JA. Global mapping of protein-DNA interactions in vivo by digital genomic footprinting. Nature Methods. 2009;6:283–289. doi: 10.1038/nmeth.1313. [DOI] [PMC free article] [PubMed] [Google Scholar]
Jackman S. Bayesian Analysis for the Social Sciences. Wiley; Chichester: 2009. [Google Scholar]
Johnson DS, Mortazavi A, Myers RM, Wold B. Genome-wide mapping of in vivo protein–DNA interactions. Science. 2007;316:1497–1502. doi: 10.1126/science.1141319. [DOI] [PubMed] [Google Scholar]
Karczewski KJ, Dudley JT, Kukurba KR, Chen R, Butte AJ, Montgomery SB, Snyder M. Systematic functional regulatory assessment of disease-associated variants. Proc Natl Acad Sci USA. 2013;110:9607–9612. doi: 10.1073/pnas.1219099110. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kasowski M, Grubert F, Heffelfinger C, Hariharan M, Asabere A, Waszak SM, Habegger L, Rozowsky J, Shi M, Urban AE, Hong M-Y, Karczewski KJ, Huber W, Weissman SM, Gerstein MB, Korbel JO, Snyder M. Variation in transcription factor binding among humans. Science. 2010;328:232–235. doi: 10.1126/science.1183621. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kolaczyk ED. Bayesian multiscale models for Poisson processes. J Amer Statist Assoc. 1999;94:920–933. [Google Scholar]
Leek JT, Storey JD. Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 2007;3:1724–1735. doi: 10.1371/journal.pgen.0030161. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mallat SG. A theory for multiresolution signal decomposition: The wavelet representation. IEEE Trans Pattern Anal Mach Intell. 1989;11:674–693. [Google Scholar]
Mangravite LM, Engelhardt BE, Medina MW, Smith JD, Brown CD, Chasman DI, Mecham BH, Howie B, Shim H, Naidoo D, Feng Q, Rieder MJ, Chen Y-DI, Rotter JI, Ridker PM, Hopewell JC, Parish S, Armitage J, Collins R, Wilke RA, Nickerson DA, Stephens M, Krauss RM. A statin-dependent QTL for GATM expression is associated with statin-induced myopathy. Nature. 2013;502:377–380. doi: 10.1038/nature12508. [DOI] [PMC free article] [PubMed] [Google Scholar]
Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y. RNA-seq: An assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 2008;18:1509–1517. doi: 10.1101/gr.079558.108. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mikkelsen TS, Ku M, Jaffe DB, Issac B, Lieberman E, Giannoukos G, Alvarez P, Brockman W, Kim T-K, Koche RP, Lee W, Mendenhall E, O’Donovan A, Presser A, Russ C, Xie X, Meissner A, Wernig M, Jaenisch R, Nusbaum C, Lander ES, Bernstein BE. Genome-wide maps of chromatin state in pluripotent and lineage-committed cells. Nature. 2007;448:553–560. doi: 10.1038/nature06008. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mitra A, Song J. WaveSeq: A novel data-driven method of detecting histone modification enrichments using wavelets. PLoS ONE. 2012;7:e45486. doi: 10.1371/journal.pone.0045486. [DOI] [PMC free article] [PubMed] [Google Scholar]
Montgomery SB, Sammeth M, Gutierrez-Arcelus M, Lach RP, Ingle C, Nisbett J, Guigo R, Dermitzakis ET. Transcriptome genetics using second generation sequencing in a Caucasian population. Nature. 2010;464:773–777. doi: 10.1038/nature08903. [DOI] [PMC free article] [PubMed] [Google Scholar]
Morris JS, Carroll RJ. Wavelet-based functional mixed models. J R Stat Soc Ser B Stat Methodol. 2006;68:179–199. doi: 10.1111/j.1467-9868.2006.00539.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Morris JS, Brown PJ, Herrick RC, Baggerly KA, Coombes KR. Bayesian analysis of mass spectrometry proteomic data using wavelet-based functional mixed models. Biometrics. 2008;64:479–489. doi: 10.1111/j.1541-0420.2007.00895.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-seq. Nat Methods. 2008;5:621–628. doi: 10.1038/nmeth.1226. [DOI] [PMC free article] [PubMed] [Google Scholar]
Nicolae DL, Gamazon E, Zhang W, Duan S, Dolan ME, Cox NJ. Trait-associated SNPs are more likely to be eQTLs: Annotation to enhance discovery from GWAS. PLoS Genet. 2010;6:e1000888. doi: 10.1371/journal.pgen.1000888. [DOI] [PMC free article] [PubMed] [Google Scholar]
Pickrell JK, Marioni JC, Pai AA, Degner JF, Engelhardt BE, Nkadori E, Veyrieras J-B, Stephens M, Gilad Y, Pritchard JK. Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature. 2010;464:768–772. doi: 10.1038/nature08872. [DOI] [PMC free article] [PubMed] [Google Scholar]
Pique-Regi R, Degner JF, Pai AA, Boyle AP, Song L, Lee B-K, Gaffney DJ, Gilad Y, Pritchard JK. Accurate inference of transcription factor binding from DNA sequence and chromatin accessibility data. Genome Res. 2011;21:447–455. doi: 10.1101/gr.112623.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
Servin B, Stephens M. Imputation-based analysis of association studies: Candidate regions and quantitative traits. PLoS Genet. 2007;3:e114. doi: 10.1371/journal.pgen.0030114. [DOI] [PMC free article] [PubMed] [Google Scholar]
Shim H, Stephens M. Supplement to “Wavelet-based genetic association analysis of functional phenotypes arising from high-throughput sequencing assays”. 2015 doi: 10.1214/14-AOAS776SUPP. [DOI] [PMC free article] [PubMed] [Google Scholar]
Spencer CCA, Deloukas P, Hunt S, Mullikin J, Myers S, Silverman B, Donnelly P, Bentley D, McVean G. The influence of recombination on human genetic diversity. PLoS Genet. 2006;2:e148. doi: 10.1371/journal.pgen.0020148. [DOI] [PMC free article] [PubMed] [Google Scholar]
Stegle O, Parts L, Durbin R, Winn J. A Bayesian framework to account for complex non-genetic factors in gene expression levels greatly increases power in eQTL studies. PLoS Comput Biol. 2010;6:e1000770. doi: 10.1371/journal.pcbi.1000770. [DOI] [PMC free article] [PubMed] [Google Scholar]
Teslovich TM, Musunuru K, Smith AV, Edmondson AC, Stylianou IM, Koseki M, Pirruccello JP, Ripatti S, Chasman DI, Willer CJ, Johansen CT, Fouchier SW, Isaacs A, Peloso GM, Barbalic M, Ricketts SL, et al. Biological, clinical and population relevance of 95 loci for blood lipids. Nature. 2010;466:707–713. doi: 10.1038/nature09270. [DOI] [PMC free article] [PubMed] [Google Scholar]
Timmermann KE, Nowak RD. Multiscale modeling and estimation of Poisson processes with application to photon-limited imaging. IEEE Trans Inform Theory. 1999;45:846–862. [Google Scholar]
van der Waerden BL. Order tests for the two-sample problem. II, III. Proceedings of the Koninklijke Nederlandse Akademie van Wetenschappen, Serie A. 1953;564:303–310. 311–316. [Google Scholar]
Wang ET, Sandberg R, Luo S, Khrebtukova I, Zhang L, Mayr C, Kingsmore SF, Schroth GP, Burge CB. Alternative isoform regulation in human tissue transcriptomes. Nature. 2008;456:470–476. doi: 10.1038/nature07509. [DOI] [PMC free article] [PubMed] [Google Scholar]
WTCCC. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447:661–678. doi: 10.1038/nature05911. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wu S, Wang J, Zhao W, Pounds S, Cheng C. ChIP-PaM: An algorithm to identify protein-DNA interaction using ChIP-seq data. Theor Biol Med Model. 2010;7:18. doi: 10.1186/1742-4682-7-18. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yang X, Nie K. Hypothesis testing in functional linear regression models with Neyman’s truncation and wavelet thresholding for longitudinal data. Stat Med. 2008;27:845–863. doi: 10.1002/sim.2952. [DOI] [PubMed] [Google Scholar]
Zhang Y, Shin H, Song JS, Lei Y, Liu XS. Identifying positioned nucleosomes with epigenetic marks in human from ChIP-seq. BMC Genomics. 2008;9:537. doi: 10.1186/1471-2164-9-537. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhao W, Wu R. Wavelet-based nonparametric functional mapping of longitudinal curves. J Amer Statist Assoc. 2008;103:714–725. [Google Scholar]
Zhu H, Brown PJ, Morris JS. Robust, adaptive functional regression in functional mixed model framework. J Amer Statist Assoc. 2011;106:1167–1179. doi: 10.1198/jasa.2011.tm10370. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Information

NIHMS910681-supplement-Supplementary_Information.pdf^{(375.8KB, pdf)}

[R1] Abramovich F, Angelini C. Testing in mixed-effects FANOVA models. J Statist Plann Inference. 2006;136:4326–4348. [Google Scholar]

[R2] Antoniadis A, Sapatinas T. Estimation and inference in functional mixed-effects models. Comput Statist Data Anal. 2007;51:4793–4813. [Google Scholar]

[R3] Barski A, Cuddapah S, Cui K, Roh T-Y, Schones DE, Wang Z, Wei G, Chepelev I, Zhao K. High-resolution profiling of histone methylations in the human genome. Cell. 2007;129:823–837. doi: 10.1016/j.cell.2007.05.009. [DOI] [PubMed] [Google Scholar]

[R4] Benjamini Y, Speed TP. Summarizing and correcting the GC content bias in high-throughput sequencing. Nucleic Acids Res. 2012;40:e72. doi: 10.1093/nar/gks001. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] Besag J, Clifford P. Sequential Monte Carlo p-values. Biometrika. 1991;78:301–304. [Google Scholar]

[R6] Boyle AP, Davis S, Shulha HP, Meltzer P, Margulies EH, Weng Z, Furey TS, Crawford GE. High-resolution mapping and characterization of open chromatin across the genome. Cell. 2008;132:311–322. doi: 10.1016/j.cell.2007.12.014. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] Cheung VG, Nayak RR, Wang IX, Elwyn S, Cousins SM, Morley M, Spielman RS. Polymorphic cis- and trans-regulation of human gene expression. PLoS Biol. 2010;8:e1000480. doi: 10.1371/journal.pbio.1000480. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] Clement L, De Beuf K, Thas O, Vuylsteke M, Irizarry RA, Crainiceanu CM. Fast wavelet based functional models for transcriptome analysis with tiling arrays. Stat Appl Genet Mol Biol. 2012;11(Art 4):38. doi: 10.2202/1544-6115.1726. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] Crouse MS, Nowak RD, Baraniuk RG. Wavelet-based statistical signal processing using hidden Markov models. IEEE Trans Signal Process. 1998;46:886–902. [Google Scholar]

[R10] Dabney A, Storey JD, Warnes GR. R package version 1.30.0. 2015. qvalue: Q-value estimation for false discovery rate control. [Google Scholar]

[R11] Day N, Hemmaplardh A, Thurman RE, Stamatoyannopoulos JA, Noble WS. Unsupervised segmentation of continuous genomic data. Bioinformatics. 2007;23:1424–1426. doi: 10.1093/bioinformatics/btm096. [DOI] [PubMed] [Google Scholar]

[R12] Degner JF, Pai AA, Pique-Regi R, Veyrieras J-B, Gaffney DJ, Pickrell JK, De Leon S, Michelini K, Lewellen N, Crawford GE, Stephens M, Gilad Y, Pritchard JK. DNasel sensitivity QTLs are a major determinant of human expression variation. Nature. 2012;482:390–394. doi: 10.1038/nature10808. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Donoho DL, Johnstone IM. Adapting to unknown smoothness via wavelet shrinkage. J Amer Statist Assoc. 1995;90:1200–1224. [Google Scholar]

[R14] Fan J, Lin S-K. Test of significance when data are curves. J Amer Statist Assoc. 1998;93:1007–1021. [Google Scholar]

[R15] Frazee AC, Sabunciyan S, Hansen KD, Irizarry RA, Leek JT. Differential expression analysis of RNA-seq data at single-base resolution. Biostatistics. 2014;15:413–426. doi: 10.1093/biostatistics/kxt053. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] Fryzlewicz P, Nason GP. A Haar–Fisz algorithm for Poisson intensity estimation. J Comput Graph Statist. 2004;13:621–638. [Google Scholar]

[R17] Hesselberth JR, Chen X, Zhang Z, Sabo PJ, Sandstrom R, Reynolds AP, Thurman RE, Neph S, Kuehn MS, Noble WS, Fields S, Stamatoy-annopoulos JA. Global mapping of protein-DNA interactions in vivo by digital genomic footprinting. Nature Methods. 2009;6:283–289. doi: 10.1038/nmeth.1313. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] Jackman S. Bayesian Analysis for the Social Sciences. Wiley; Chichester: 2009. [Google Scholar]

[R19] Johnson DS, Mortazavi A, Myers RM, Wold B. Genome-wide mapping of in vivo protein–DNA interactions. Science. 2007;316:1497–1502. doi: 10.1126/science.1141319. [DOI] [PubMed] [Google Scholar]

[R20] Karczewski KJ, Dudley JT, Kukurba KR, Chen R, Butte AJ, Montgomery SB, Snyder M. Systematic functional regulatory assessment of disease-associated variants. Proc Natl Acad Sci USA. 2013;110:9607–9612. doi: 10.1073/pnas.1219099110. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] Kasowski M, Grubert F, Heffelfinger C, Hariharan M, Asabere A, Waszak SM, Habegger L, Rozowsky J, Shi M, Urban AE, Hong M-Y, Karczewski KJ, Huber W, Weissman SM, Gerstein MB, Korbel JO, Snyder M. Variation in transcription factor binding among humans. Science. 2010;328:232–235. doi: 10.1126/science.1183621. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] Kolaczyk ED. Bayesian multiscale models for Poisson processes. J Amer Statist Assoc. 1999;94:920–933. [Google Scholar]

[R23] Leek JT, Storey JD. Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 2007;3:1724–1735. doi: 10.1371/journal.pgen.0030161. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] Mallat SG. A theory for multiresolution signal decomposition: The wavelet representation. IEEE Trans Pattern Anal Mach Intell. 1989;11:674–693. [Google Scholar]

[R25] Mangravite LM, Engelhardt BE, Medina MW, Smith JD, Brown CD, Chasman DI, Mecham BH, Howie B, Shim H, Naidoo D, Feng Q, Rieder MJ, Chen Y-DI, Rotter JI, Ridker PM, Hopewell JC, Parish S, Armitage J, Collins R, Wilke RA, Nickerson DA, Stephens M, Krauss RM. A statin-dependent QTL for GATM expression is associated with statin-induced myopathy. Nature. 2013;502:377–380. doi: 10.1038/nature12508. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y. RNA-seq: An assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 2008;18:1509–1517. doi: 10.1101/gr.079558.108. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] Mikkelsen TS, Ku M, Jaffe DB, Issac B, Lieberman E, Giannoukos G, Alvarez P, Brockman W, Kim T-K, Koche RP, Lee W, Mendenhall E, O’Donovan A, Presser A, Russ C, Xie X, Meissner A, Wernig M, Jaenisch R, Nusbaum C, Lander ES, Bernstein BE. Genome-wide maps of chromatin state in pluripotent and lineage-committed cells. Nature. 2007;448:553–560. doi: 10.1038/nature06008. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] Mitra A, Song J. WaveSeq: A novel data-driven method of detecting histone modification enrichments using wavelets. PLoS ONE. 2012;7:e45486. doi: 10.1371/journal.pone.0045486. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] Montgomery SB, Sammeth M, Gutierrez-Arcelus M, Lach RP, Ingle C, Nisbett J, Guigo R, Dermitzakis ET. Transcriptome genetics using second generation sequencing in a Caucasian population. Nature. 2010;464:773–777. doi: 10.1038/nature08903. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] Morris JS, Carroll RJ. Wavelet-based functional mixed models. J R Stat Soc Ser B Stat Methodol. 2006;68:179–199. doi: 10.1111/j.1467-9868.2006.00539.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] Morris JS, Brown PJ, Herrick RC, Baggerly KA, Coombes KR. Bayesian analysis of mass spectrometry proteomic data using wavelet-based functional mixed models. Biometrics. 2008;64:479–489. doi: 10.1111/j.1541-0420.2007.00895.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-seq. Nat Methods. 2008;5:621–628. doi: 10.1038/nmeth.1226. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R33] Nicolae DL, Gamazon E, Zhang W, Duan S, Dolan ME, Cox NJ. Trait-associated SNPs are more likely to be eQTLs: Annotation to enhance discovery from GWAS. PLoS Genet. 2010;6:e1000888. doi: 10.1371/journal.pgen.1000888. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] Pickrell JK, Marioni JC, Pai AA, Degner JF, Engelhardt BE, Nkadori E, Veyrieras J-B, Stephens M, Gilad Y, Pritchard JK. Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature. 2010;464:768–772. doi: 10.1038/nature08872. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R35] Pique-Regi R, Degner JF, Pai AA, Boyle AP, Song L, Lee B-K, Gaffney DJ, Gilad Y, Pritchard JK. Accurate inference of transcription factor binding from DNA sequence and chromatin accessibility data. Genome Res. 2011;21:447–455. doi: 10.1101/gr.112623.110. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] Servin B, Stephens M. Imputation-based analysis of association studies: Candidate regions and quantitative traits. PLoS Genet. 2007;3:e114. doi: 10.1371/journal.pgen.0030114. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R37] Shim H, Stephens M. Supplement to “Wavelet-based genetic association analysis of functional phenotypes arising from high-throughput sequencing assays”. 2015 doi: 10.1214/14-AOAS776SUPP. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R38] Spencer CCA, Deloukas P, Hunt S, Mullikin J, Myers S, Silverman B, Donnelly P, Bentley D, McVean G. The influence of recombination on human genetic diversity. PLoS Genet. 2006;2:e148. doi: 10.1371/journal.pgen.0020148. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R39] Stegle O, Parts L, Durbin R, Winn J. A Bayesian framework to account for complex non-genetic factors in gene expression levels greatly increases power in eQTL studies. PLoS Comput Biol. 2010;6:e1000770. doi: 10.1371/journal.pcbi.1000770. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R40] Teslovich TM, Musunuru K, Smith AV, Edmondson AC, Stylianou IM, Koseki M, Pirruccello JP, Ripatti S, Chasman DI, Willer CJ, Johansen CT, Fouchier SW, Isaacs A, Peloso GM, Barbalic M, Ricketts SL, et al. Biological, clinical and population relevance of 95 loci for blood lipids. Nature. 2010;466:707–713. doi: 10.1038/nature09270. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R41] Timmermann KE, Nowak RD. Multiscale modeling and estimation of Poisson processes with application to photon-limited imaging. IEEE Trans Inform Theory. 1999;45:846–862. [Google Scholar]

[R42] van der Waerden BL. Order tests for the two-sample problem. II, III. Proceedings of the Koninklijke Nederlandse Akademie van Wetenschappen, Serie A. 1953;564:303–310. 311–316. [Google Scholar]

[R43] Wang ET, Sandberg R, Luo S, Khrebtukova I, Zhang L, Mayr C, Kingsmore SF, Schroth GP, Burge CB. Alternative isoform regulation in human tissue transcriptomes. Nature. 2008;456:470–476. doi: 10.1038/nature07509. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R44] WTCCC. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447:661–678. doi: 10.1038/nature05911. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R45] Wu S, Wang J, Zhao W, Pounds S, Cheng C. ChIP-PaM: An algorithm to identify protein-DNA interaction using ChIP-seq data. Theor Biol Med Model. 2010;7:18. doi: 10.1186/1742-4682-7-18. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R46] Yang X, Nie K. Hypothesis testing in functional linear regression models with Neyman’s truncation and wavelet thresholding for longitudinal data. Stat Med. 2008;27:845–863. doi: 10.1002/sim.2952. [DOI] [PubMed] [Google Scholar]

[R47] Zhang Y, Shin H, Song JS, Lei Y, Liu XS. Identifying positioned nucleosomes with epigenetic marks in human from ChIP-seq. BMC Genomics. 2008;9:537. doi: 10.1186/1471-2164-9-537. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R48] Zhao W, Wu R. Wavelet-based nonparametric functional mapping of longitudinal curves. J Amer Statist Assoc. 2008;103:714–725. [Google Scholar]

[R49] Zhu H, Brown PJ, Morris JS. Robust, adaptive functional regression in functional mixed model framework. J Amer Statist Assoc. 2011;106:1167–1179. doi: 10.1198/jasa.2011.tm10370. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

WAVELET-BASED GENETIC ASSOCIATION ANALYSIS OF FUNCTIONAL PHENOTYPES ARISING FROM HIGH-THROUGHPUT SEQUENCING ASSAYS¹

Heejung Shim

Matthew Stephens

Abstract

1. Introduction

2. Background