EPIQ—efficient detection of SNP–SNP epistatic interactions for quantitative traits

Ya’ara Arkin; Elior Rahmani; Marcus E Kleber; Reijo Laaksonen; Winfried März; Eran Halperin

doi:10.1093/bioinformatics/btu261

. 2014 Jun 11;30(12):i19–i25. doi: 10.1093/bioinformatics/btu261

EPIQ—efficient detection of SNP–SNP epistatic interactions for quantitative traits

Ya’ara Arkin ¹, Elior Rahmani ¹, Marcus E Kleber ², Reijo Laaksonen ^3,4, Winfried März ^2,5,6, Eran Halperin ^1,7,8,^*

PMCID: PMC4229902 PMID: 24931983

Abstract

Motivation: Gene–gene interactions are of potential biological and medical interest, as they can shed light on both the inheritance mechanism of a trait and on the underlying biological mechanisms. Evidence of epistatic interactions has been reported in both humans and other organisms. Unlike single-locus genome-wide association studies (GWAS), which proved efficient in detecting numerous genetic loci related with various traits, interaction-based GWAS have so far produced very few reproducible discoveries. Such studies introduce a great computational and statistical burden by necessitating a large number of hypotheses to be tested including all pairs of single nucleotide polymorphisms (SNPs). Thus, many software tools have been developed for interaction-based case–control studies, some leading to reliable discoveries. For quantitative data, on the other hand, only a handful of tools exist, and the computational burden is still substantial.

Results: We present an efficient algorithm for detecting epistasis in quantitative GWAS, achieving a substantial runtime speedup by avoiding the need to exhaustively test all SNP pairs using metric embedding and random projections. Unlike previous metric embedding methods for case–control studies, we introduce a new embedding, where each SNP is mapped to two Euclidean spaces. We implemented our method in a tool named EPIQ (EPIstasis detection for Quantitative GWAS), and we show by simulations that EPIQ requires hours of processing time where other methods require days and sometimes weeks. Applying our method to a dataset from the Ludwigshafen risk and cardiovascular health study, we discovered a pair of SNPs with a near-significant interaction (P = 2.2 × 10⁻¹³), in only 1.5 h on 10 processors.

Availability: https://github.com/yaarasegre/EPIQ

Contact: heran@post.tau.ac.il

1 INTRODUCTION

Genome-wide association studies (GWAS) have so far detected thousands of single nucleotide polymorphism (SNP) loci that are associated with various traits (Hindorff et al., 2009). Unfortunately, for most complex traits the discovered SNPs explain only a small fraction of the estimated heritability, a phenomena often referred to as the ‘missing heritability’ (Maher, 2008). One plausible explanation suggested for this problem is the existence of an epistatic effect, where two or more loci have a synergetic influence on the phenotype, also referred to as gene–gene interactions (Maher, 2008). The discovery of interacting SNP loci has an additional benefit, as it may shed light on the underlying biological mechanism or involved pathways.

Despite evidences of gene–gene interactions reported in both human and other organisms (Evans et al., 2006), very few reproducible discoveries were reported by GWAS (Liu et al., 2011; Prabhu and Pe’er, 2012 for example). The amount of data produced in a single study is a possible cause: when searching for groups of k SNPs with an epistatic effect, the number of possible k-sized groups is Θ(m^k), where m is the number of SNP loci. With current GWAS typically including hundreds of thousands of SNPs, this implies both a computational and statistical burden even for groups sizes as small as k = 2: the numerous tests takes days and even weeks to compute and require a substantial correction for multiple hypothesis, leading in some cases to a loss of power (Evans et al., 2006). One common approach is the reduction of the search space, usually by filtering candidate loci pairs: Marchini et al. (2005) suggested selecting a subset of SNPs with a moderate marginal effect and testing for interaction in pairs where at least one locus is included in the subset. Reduction of the search space can also be done by manipulating contingency tables (Wan et al., 2010; Zhang et al., 2010) or searching for a linkage disequilibrium (LD) contrast between cases and controls (Brinza et al., 2010; Prabhu and Pe’er, 2012). A more straightforward approach is increasing the computational power, either by multi-threaded implementations or by utilizing special hardware (Hu et al., 2010; Yung et al., 2011). Binary operations are used in some cases to speedup performance (Prabhu and Pe’er, 2012; Wan et al., 2010).

All of these tools, and many others, are designed for case–control studies; whereas for the quantitative case, where the tested phenotypes are physiological measurements of some sort, the selection of available software is limited. Since the phenotype tested is not dichotomous, testing for quantitative associations can be more challenging compared with case–control studies, as methods utilizing contingency tables, LD-contrast or binary operations are usually inapplicable. Methods tailored for case–control studies can be applied on quantitative traits after dichotomizing the phenotype (as in Bhattacharya et al., 2011); however, the resulting statistical test is different than the original, thus a loss of power is inevitable and would be difficult to quantify.

In this study, we present EPIQ (EPIstasis detection for quantitative GWAS)—an efficient algorithm for detecting pairs of SNP loci that have an epistatic effect on quantitative phenotypes. EPIQ achieves a substantial runtime speedup by avoiding the need to exhaustively test all SNP pairs: it applies a carefully chosen transformation that maps each genotyped SNP to a vector in a Euclidean space. This transformation has the property that SNP pairs with an epistatic effect are converted to vector pairs with a large inner product. A random projections method is subsequently applied to efficiently recover these SNPs. A novelty of our method is that each SNP is projected to two different points, for a more efficient detection of interacting SNPs. We show on simulated data that in just >3 h our algorithm was able to process a dataset that would take days or weeks using state of the art software, and present the results of running EPIQ on data from the Ludwigshafen risk and cardiovascular health (LURIC, Winkelmann et al. 2001) study.

2 METHODS

2.1 Outline

EPIQ is designed to efficiently discover SNPs that have a significant epistatic effect over a quantitative phenotype, without exhaustively testing all pairs of SNPs in a dataset. This goal is achieved in two steps: a filtering stage—generating a list of candidate SNP pairs, and a validation stage—fitting a linear regression model to these pairs. By shortening the list of pairs to be tested during the filtering stage, running time for the linear regression step is reduced substantially. Filtering is performed by assigning a score to each SNP; this score is stochastically generated so that for each pair of SNPs, the expected value for the product of their scores is proportional to the generalized likelihood ratio (GLR) test statistic of their interaction. This means epistatic pairs are expected to have a high score product. By performing multiple iterations and collecting pairs that pass a given threshold, we assure with high probability that if an interacting pair exists, it is included in the candidates list and will be reported during validation stage. To do so we present a new test-statistic τ² which is roughly proportional to the GLR test score, and apply a random projection algorithm that discovers pairs with exceptionally high τ² scores.

2.2 Model description

2.2.1 Model input

EPIQ receives as input a vector $y \in ℝ^{n}$ , representing the phenotypic values of all n individuals in the cohort, and a matrix $X_{n \times m} \in {0, 1}^{n \times m}$ representing the cohort at m polymorphic loci. The algorithm is adjusted for binary SNPs; therefore genotypes should be converted to a binary representation according to the expected type of interaction. For example, converting AA to 0 and aA, aa to 1 states a dominant model of interactions. The phenotype vector y is centered so that it has zero mean and SD of 1. $x \in {0, 1}^{n}$ denotes the column vector of allelic values measured for all n samples at a certain locus. x_i is the allele value of this locus for person number i and y_i is the phenotype value of person number i. We denote $p = \Pr [x_{i} = 1]$ , and estimate it with the maximum likelihood estimator $\hat{p} = mean (x)$ . Denote $x^{2} = {(x_{1}^{2}, x_{2}^{2}, \dots, x_{n}^{2})}^{T}, x x' = {(x_{1} x'_{1}, x_{2} x'_{2}, \dots, x_{n} x'_{n})}^{T}$ and $y x x' = {(y_{1} x_{1} x'_{1}, y_{2} x_{2} x'_{2}, \dots, y_{n} x_{n} x'_{n})}^{T}$ .

2.2.2 Linear model

When testing for an epistatic interaction between a pair of SNPs, the linear model can be defined as follows (Cordell, 2009):

y_{i} = α_{0} + α_{1} x_{i} + α_{2} x'_{i} + α_{3} x_{i} x'_{i} + ϵ_{i}

(1a)

ϵ_{i} \sim N (0, σ^{2})

(1b)

H_{0} : α_{3} = 0, H_{1} : α_{3} \neq 0

(1c)

Since tests for interaction are usually performed after testing for a main effect for each of the SNPs, it is reasonable to zero the main effects from the model. By altering the model so that α₁ = α₂ = 0 and α₀, α₃ are replaced with β₀, β₁ respectively, a new, simpler model is obtained:

y_{i} = β_{0} + β_{1} x_{i} x'_{i} + ϵ_{i}

(2a)

ϵ_{i} \sim N (0, σ^{2})

(2b)

H_{0} : β_{1} = 0, H_{1} : β_{1} \neq 0

(2c)

In this case, using ordinary least squares (OLS), the GLR tests statistic is:

2 \ln GLR = - n \ln (\frac{\sum_{i} {(y_{i} - {\hat{β}}_{0} - \hat{β_{1}} x_{i} x'_{i})}^{2}}{\sum_{i} {(y_{i} - \bar{y})}^{2}})

(3)

Where $\bar{y} =$ mean(y). This simplification allows us to define an alternative test statistic, τ², which is an approximation of the GLR test statistic and can be very useful for our filtering stage. Disregarding the main effect can in fact lead to false positive results, but these will only be a fraction of the total number of SNP pairs and will all be discarded during the validation stage, where pairs are tested against the full linear model (Equation 1).

To achieve simplicity and efficiency, the model does not include covariates—the residuals from the phenotype adjusted for other parameters should be used as the response variable. Population stratification can be addressed by applying an adjustment method such as EIGENSTRAT (Price et al., 2006) and using the first axes of variation as covariates while adjusting the phenotype. The model assumes linkage equilibrium between SNPs, an assumption that does not hold for GWAS, where proximal SNPs are in LD. As a result, the distribution of the τ² score for proximal SNPs deviates from what is expected under the null assumption, which results in an excess of pairs passing the filtering stage. This problem can be addressed by dismissing proximal pairs during the filtering stage, and exhaustively testing them later during post-processing time. As the number of proximal pairs in LD is O(m), the cost of this correction is minor.

2.3 GLR test and the new test-statistic τ²

In the following section we introduce our new test statistic τ² and show that for large sample sizes, $2 \ln GLR \approx τ^{2}$ . The new tests statistic is presented not as a means to achieve more power, rather as a means for reducing runtime by serving as a proxy to the GLR test statistic: we show in the next section how random projections methods can efficiently detect pairs with a high τ² score, as a filtering stage for detecting statistically significant interactions.

Since y is standardized, the denominator of Equation (3) equals n. Replacing ${\hat{β}}_{0}, {\hat{β}}_{1}$ in Equation (3) with their OLS estimators $\bar{y} - {\hat{β}}_{1} \bar{x x'}, \frac{\bar{y x x'}}{\bar{x^{2} x'^{2}} - {\bar{x x'}}^{2}}$ respectively, it is easy to verify that:

\sum_{i} {(y_{i} - {\hat{β}}_{0} - {\hat{β}}_{1} x_{i} {x^{'}}_{i})}^{2} = \sum_{i} {(y_{i} - \frac{\bar{yxx′}}{\bar{xx′} - {\bar{xx′}}^{2}} (x_{i} {x^{'}}_{i} - \bar{x x^{'}}))}^{2}

(4a)

= n (1 - \frac{{\bar{y x x'}}^{2}}{\bar{x x'} - {\bar{x x'}}^{2}})

(4b)

2 \ln GLR = - n \ln (1 - \frac{{\bar{y x x'}}^{2}}{\bar{x x'} - {\bar{x x'}}^{2}})

(5)

Under the linkage equilibrium assumption, $\bar{x x'} \overset{p}{\to} p p'$ . Using first order Taylor expansion, after neglecting ${\bar{x x'}}^{2}$ , we conclude that for a large sample size:

τ^{2} \equiv n \frac{{\bar{y x x'}}^{2}}{\hat{p} \hat{p}'} \approx 2 \ln GLR

(6)

Figure 1a displays $2 \ln$ GLR versus τ². With r² of 0.99, τ² is a good approximation of the GLR score. As seen in Figure 1c, under the null assumption of no interaction, the distribution of τ² is very close to a chi-square distribution with 1 degree of freedom, similar to the GLR test statistic (chi-square goodness-of-fit test P = 0.396). As a result, the task of finding an interacting SNP pair can now be replaced with the task of finding a pair with significantly high τ² score. In the next section, we show that this task can be done efficiently without testing all pairs.

Fig. 1. — The new test-statistic: (a) $2 \ln$ GLR versus τ². Data were generated with n = 5000, MAF $\in [0.01, 0.5]$ ; marginal and epistatic effects were sampled uniformly from the range (0, 1); r²= 0.99. (b) r² of the linear correlation between $2 \ln (GLR)$ and τ², as a function of n: τ² is highly correlated with the original test statistic for all tested sample sizes. (c) τ² distribution is proportional to the chi-square distribution with 1 degree of freedom. Passed a chi-square goodness-of-fit test with P-value of 0.396

2.4 Efficient discovery of interacting SNPs

We describe an algorithm for finding pairs of SNP where τ² is larger than a given threshold. For each binary SNP x we define a vector $v = (v_{1}, \dots, v_{n})$ where $v_{i} = \sqrt{\frac{| y_{i} |}{\hat{p} \sqrt{n}}} x_{i}$ and a vector $u = (u_{1}, \dots, u_{n})$ where $u_{i} = sign (y_{i}) v_{i}$ . For example, if $y = (- 0.1, 0.2, - 0.3, - 0.4, - 0.5, 0.6)$ and $x = (1, 1, 0, 1, 0, 0)$ then $v = \frac{1}{\sqrt{0.5 \sqrt{6}}} (\sqrt{0.1}, \sqrt{0.2}, 0, \sqrt{0.4}, 0, 0)$ and $u = \frac{1}{\sqrt{0.5 \sqrt{6}}} (- \sqrt{0.1}, \sqrt{0.2}, 0, - \sqrt{0.4}, 0, 0)$ . It is easy to see that

\forall x, x' : v \cdot u' = \frac{1}{\sqrt{n \hat{p} \hat{p}'}} \sum_{i = 1}^{n} y_{i} x_{i} x'_{i} = τ

(7)

So instead of searching for pairs with an exceptional τ score, we are now looking for an exceptional inner product size. To do so we apply a random projections method: we perform multiple iterations; in each iteration we sample a random vector $r = (r_{1}, r_{2}, ..., r_{n})$ , where r_i ∼ N(0,1). For each SNP x, we calculate two scores: $a = v \cdot r$ and $b = u \cdot r$ . Since r_i are sampled i.i.d. with mean 0 and variance 1, the expected value of the two scores’ product is τ:

\forall x, x' : E_{r} [a b'] = E_{r} [\sum_{i} r_{i} v_{i} \sum_{j} r_{j} u'_{j}] = v \cdot u' = τ

(8)

Note that while non-interacting SNPs have a zero expected value for ab^′, pairs with a significant P-value after a Bonferroni correction of Inline graphic are expected to have τ² of over 55. It can also be shown that . As a result, the distribution of ab^′ has a longer tail under the alternative assumption, so for any positive threshold t, the probability of |ab^′| ≥ t is always greater for interacting pairs (Fig. 2a). We utilize this fact to distinguish between interacting and non-interacting pairs: we perform several iterations where a vector r is sampled, and the scores a and b are calculated for all SNPs. In each iteration we collect the pairs of SNPs whose scores product pass a given threshold t. The last part can easily be done without testing all pairs: we define a vector Inline graphic and a vector . Both vectors are first sorted in descending order and then scanned in linear time to find pairs $x, x'$ such that .

Fig. 2. — (a) An illustration of the *ab′* score distribution for interacting pairs (blue) and non-interacting pairs (gray): interacting pairs have a higher probability of passing a threshold t during the filtering stage. (b) Variance of *ab′* as a function of MAFs, for an interacting pair with corrected P-value of 0.05

Since, as seen in Figure 2b, the variance of ab^′ is affected by the combination of minor allele frequencies (MAFs) of both SNPs, different t thresholds are used for different MAF combinations. This is done by assigning SNPs to bins of similar MAF: each bin B has two score vectors, ${\vec{a}}^{B}, {\vec{b}}^{B}$ , sorted by their score value. For each pair of bins, B and B′, we report all SNP pairs $x \in B, x' \in B'$ where Inline graphic , when $t_{B B'}^{2}$ is the appropriate threshold. Reported SNP pairs are validated against the linear model. Optimal t thresholds for each pair of bins were empirically calculated, as described in the following section. See algorithm pseudo-code 1.

graphic file with name btu261ilf1.jpg

2.4.1 Runtime analysis

The improvement in runtime achieved by EPIQ is due to the fact that only a fraction of the SNP pairs is tested. The algorithm performs L iterations, each iteration has O(nm) operations for calculating a, b scores and $O (m \log m)$ operations for sorting score vectors. If we denote ψ as the average fraction of SNP pairs that pass the threshold t at each iteration, then scanning the vectors for interaction candidates would take O $((\begin{matrix} m \\ 2 \end{matrix}) ψ)$ and the total runtime including validations is $O (L (n m + m l o g (m) + (\begin{matrix} m \\ 2 \end{matrix}) ψ n))$ . As exhaustive testing of all pairs take $O ((\begin{matrix} m \\ 2 \end{matrix}) n)$ , speedup is achieved when L ≪ψ and also L ≪ m. To speedup performance, EPIQ keeps all SNP data in memory; therefore space complexity is O(nm).

2.4.2 Choosing the parameters L, t to assure the requested power with minimal runtime

Given the stochastic nature of the algorithm, there is always a possibility that interacting pairs will be missed. This creates a trade-off between power and runtime, controlled by a success rate parameter. Setting this parameter to 90%, for example, would mean that the probability of missing a SNP pair with a significant GLR score is at most 10%, and the overall power achieved is at least 90% compared with an all-pairs scan. Strongly interacting pairs have an even larger chance of being detected, as the probability of passing the filtering stage is a function of the GLR score.

According to the success rate requested by the user, optimal values for L and t can be set. The two parameters are strongly linked with runtime: higher t values reduce probability of success, which means more iterations are required in order to provide the requested success rate. This elongates the filtering stage, and also might shorten the validation stage by reducing false positive rate. To calculate optimal parameters one must first calculate Inline graphic for both interacting and non-interacting pairs. One can show that , so the probability of the event a²b^′²≥ t² can be easily calculated. The value of f, on the other hand, is not as simple to calculate analytically. As a result, the choice of the parameters was done empirically: A sample dataset was randomly generated, using MAFs taken from the 1000 genomes project (Abecasis et al., 2012), as explained in the results section. SNPs were distributed among bins of similar MAFs, and each pair of bins was assigned with the maximal threshold value that enabled the required success rate, given the current number of iterations. As a final step, the number of iterations that led to the shortest runtime was chosen.

2.5 Simulated datasets

In order to test our algorithm we generated several datasets of diploid genotypes, with cohort sizes varying between 1000 and 5000, and the number of SNP loci between 10 000 and 1 million. While generating the SNPs we used the MAF distribution found on the 1000 genomes project (Abecasis et al.,2012) and assumed Hardy–Weinberg equilibrium. We later converted the datasets to a binary representation using a dominant coding, where AA was translated to 0 and aA, aa to 1. We used these datasets to demonstrate the runtime and power of EPIQ under different conditions.

2.6 The LURIC study

We applied our method to measurements of lipid concentration in cells (Cer(d18:0/24:1)), taken from the LURIC study. The LURIC study consists of 3316 white patients hospitalized for coronary angiography between 1997 and 2000 at a tertiary care center in Southwestern Germany (Winkelmann et al., 2001). To limit clinical heterogeneity, individuals suffering from acute illnesses other than acute coronary syndrome (ACS), chronic non-cardiac diseases and a history of malignancy within the past 5 years were excluded.

2.6.1 Laboratory procedures

Fasting blood samples were obtained by venipuncture in the early morning. Genomic DNA was prepared from EDTA anticoagulated peripheral blood by using a common salting-out procedure. Genotyping was done using the Affymetrix Human SNP Array 6.0 at the Synlab Center of Laboratory Diagnostics Heidelberg and the Mannheim Institute of Public Health of Heidelberg University.

2.6.2 Quality control

We used PLINK (Purcell et al., 2007) for quality control, excluding SNPs with call rate <95%. We excluded individuals with call rate <97%, ambiguous on genetic sex test or showing high estimated identity by descent (IBD) scores (PI_HAT ≥ 0.1875), controlling for cryptic relatedness. For the population stratification part, we used the Population Reference Sample (POPRES) dataset (Nelson et al., 2008) as a reference population. We considered the first four components of a multidimensional scaling (MDS) on both LURIC and POPRES individuals for determining and removing outliers. Finally, we had 687 253 SNPs and 859 individuals remaining for the analysis, of which 826 had lipid cell concentration measurements.

3 RESULTS

In this section, we show that in just a few hours EPIQ can process amounts of data that would take weeks and even years on common existing software. We demonstrate how the power of EPIQ is affected by the underlying model of interaction and present the results of applying EPIQ to a dataset from the LURIC study.

3.1 Runtime improvement

While epistasis detection tools for case–control studies are relatively common, not many quantitative pairwise epistasis tools were found. We chose to compare EPIQ against PLINK (Purcell et al., 2007), FastEpistasis (Schüpbach et al., 2010), EpiGPU (Hemani et al., 2011) and EpiGPUHSIC (Kam-Thong et al., 2011). All four tools perform an exhaustive search, using different hardware and various statistical tests. PLINK is a commonly used whole-genome association analysis toolset. Its epistasis option performs linear regression tests on all SNP pairs. FastEpistasis is an efficient parallel extension of the PLINK epistasis module. While the first two tools run on regular processors, EpiGPU and EpiGPUHSIC run on graphical processing units (GPUs), which are specialized electronic circuits that provide up to 100× speedup in performance. The former two tools utilize different statistical tests as well: EpiGPU performs an F-test, while EpiGPUHSIC is a quantitative extension of the Hilbert-Schmidt Independence Criterion (HSIC) (Gretton et al., 2005), which uses the correlation coefficient difference between cases and controls, as an approximation to the significance of the interaction term. Since EPIQ was run with the parameter success rate set to 80%, we compared its runtime against testing 80% of the pairs in the exhaustive search algorithms. As seen in Table 1, EPIQ shows a great improvement in runtime, compared with the exhaustive tools.

Table 1.

Runtime of the C++ implementation of EPIQ, compared with other programs available

Tool	Computational method	Statistical test	Cores	Runtime
PLINK (Purcell et al., 2007)^a	Exhaustive search	OLS	1	∼10 years
FastEpistasis (Schüpbach et al., 2010)^b	Exhaustive search	OLS	8	381 h
EpiGPU (Hemani et al., 2011)^b	Exhaustive search	F-test	–	9.3–90 h^c
EpiGPUHSIC (Hemani et al., 2011)^b	Exhaustive search	HSIC	–	194 h
EPIQ (Kam-Thong et al., 2011)^b	Random projections	OLS on binary SNPs	8	3.2 h

Open in a new tab

EPIQ was run with the parameter success rate set to 80%, therefore runtime is compared against testing 80% of the pairs in the exhaustive search algorithms (n = 1000, m = 10⁶).

^aTimes were extrapolated according to a test of 1000 SNPs performed on the same 2.5 GHz processor, scaling linearly with the number of SNP pairs.

^bTimes were extrapolated according to self-reported performance.

^cRuntime varies with the chosen GPU.

We ran EPIQ using different inputs in order to test how the program scales with changes in the number of SNPs, cohort size or requested power. As seen in Figure 3a, EPIQ scales linearly with the number of SNP pairs in the dataset. Figure 3b shows that gaining more power becomes increasingly time consuming when approaching 100% power, as can be expected in stochastic algorithms of this sort. However, one can achieve 95% power in a matter of hours. Scaling in the number of samples is above linear as well: while testing 5 × 10¹¹ pairs takes 3.2 h for 1000 individuals, it takes 7 times longer for 3000 samples, and 30 times longer for 5000 samples. Nevertheless, for moderate cohort sizes EPIQ remains an efficient choice. (All benchmark tests were performed on the Ubuntu Linux server with 2.5 GHz processor.)

Fig. 3. — Runtime of EPIQ for different settings: (a) runtime for various numbers of SNP pairs, n = 1000; EPIQ scales linearly with the number of pairs. (b) Runtime of EPIQ for different power thresholds; nearly 100% power can be achieved in a matter of hours (n = 1000, m = 10⁶)

3.2 Power analysis

To evaluate the power of our algorithm, we compared it against two commonly used baseline methods suggested by Marchini et al. (2005). First is a simple exhaustive all-pairs test, where all SNP pairs are tested for interaction. Although this method is not feasible for large datasets, the power achieved by an all-pairs test is of relevance, as this is the upper bound for the power of our algorithm. The second baseline we compared against is a method in which the top K marginal predictors are identified, and then tested for all pairwise iterations between them. When choosing $K = \sqrt{2 m}$ , for example, the number of tests performed is $(\begin{matrix} \sqrt{2 m} \\ 2 \end{matrix}) \approx m$ . We refer to this method as the ‘two-step’ algorithm. In all our tests we apply the conservative Bonferroni correction, in order to address the issue of multiple hypothesis. Since EPIQ implicitly evaluates all SNP pairs, the number of tests for a multiple testing correction is $(\begin{matrix} m \\ 2 \end{matrix})$ , as in the all-pairs algorithm. For each test the program generated a quantitative phenotype according to the linear model described earlier, $y_{i} = β_{0} + β_{1} x_{i} x'_{i} + ϵ_{i}$ , where x, x′ are two SNPs that were randomly chosen as the interacting pair. The phenotype was later standardized so that $\bar{y} = 0, stdev (y) = 1$ .

We compared EPIQ against the two methods, using different MAFs for the interacting SNPs and success rate equal to 80% (Fig. 4a–c). Note that although the requested success rate was 80%, the actual power of EPIQ (shown in dark blue) is consistently >80% of the power achieved by the all-pairs algorithm (light blue), as this parameter states the minimal relative power. Another conclusion drawn from these figures is that in some cases there is a substantial difference in power between the all-pairs test and the two-stage test (green), in favor of the all-pairs test. The opposite is true for large MAFs, as in this case the marginal effect is easy to detect, and the multiple testing correction is less stringent for a two-stage approach (data not shown). Similar results were described by Evans et al. (2006), which showed that for various models of interaction, an exhaustive all-pairs search is more powerful compared with the two-step strategy, despite the harsher multiple testing correction [O(m²) compared to O(m)]. In these cases, using EPIQ can yield a substantial improvement in power.

3.2.1 Comparison with PLINK

In order to further investigate the power achieved by EPIQ, we carried 50 experiments comparing our method with the linear regression performed by PLINK and FastEpistasis, using the 50 distinct models of interaction from Li and Reich (2000), which were adapted for quantitative traits. These models assume that there are two phenotypic means in the population: 0 and 1, and each model of interaction determines a different partitioning of the population to either mean. For example, model M1 states that only the individuals that are homozygous with the minor allele on both SNPs have the higher phenotypic mean [see Li and Reich (2000) for more details]. We generated datasets with 2000 individuals, where the two SNPs account for 10% of the trait variance. Before running EPIQ, we converted the genotypes to two binary representations, a dominant one and a recessive one, and applied EPIQ to both encodings [as in Brinza et al. (2010) and Prabhu and Pe’er (2012)]. We compared EPIQ’s results with the power achieved by applying the full linear model of PLINK on the original genotypes. Figure 4d shows the results for all models, when the x axis is the model number and the z axis is the power of EPIQ minus the power of PLINK, averaged over all MAF combinations. Out of the 50 models, 23 showed greater power when using EPIQ, 15 showed greater power with PLINK and the remaining 12 result in a similar power when using either method. Several of the 15 models where PLINK shows higher power describe either a complex and biologically unintuitive pattern of interaction (such as M101), or have a large marginal effect, which makes them easy to discover using Marchini’s two-stage algorithm (Marchini et al. 2005).

3.3 Results from the LURIC study

We applied EPIQ to measurements of lipid concentration in cells (Cer(d18:0/24:1)), taken from the LURIC study, setting the success rate parameter to 90%. Lipid concentration in cells was converted to the log scale, standardized and corrected for BMI, sex, age and statins usage, using the residuals as the input for EPIQ. Processing of 826 individuals and 687 253 SNPs took 1.5 h on 10 processors, identifying a single pair of SNPs (rs436969 (chr5, HWE p = 0.005), rs9385393 (chr6, HWE p = 1)) with a P-value of 2.2 × 10⁻¹³, which is near significant after applying a Bonferroni correction. No genes exist within 100 kb up- and down-stream of the SNPs. Figure 5 shows a Manhattan plot of the SNPs surrounding the pair of SNPs.

Fig. 5. — Results on the LURIC dataset. (a and b) Manhattan plots of 100 SNPs up- and down-stream of rs436969, rs9385393. The epistasis option of PLINK was used to test for interactions in all 40 401 pairs and the smallest P-value for each SNP was recorded. Note that the P-value for the top scoring pair is slightly higher than the one calculated by EPIQ, as EPIQ was run on the binary representation of the SNPs. (c) A QQ-plot of the P-values distribution shows a negligible inflation. P-values were calculated for a sample of 10 000 SNP pairs

4 DISCUSSION

In this article we demonstrated how random projections methods can be applied on quantitative GWAS, achieving in most cases at least an order of magnitude speedup compared with other existing tools, scaling linearly with the number of SNP pairs. We showed that EPIQ required only 1.5 h on 10 processors for a real dataset of 687 253 SNPs and 826 individuals, identifying a pair of SNPs with a possible epistatic interaction, demonstrating that the model’s assumptions do not hinder an efficient discovery of interacting pairs. This speedup is gained in exchange of a minor loss of power. As mentioned before, a search for interacting SNP pairs in current GWAS suffers from an inherent multiple testing problem, where P-values must be as small as 10⁻¹³ or less in order to be considered significant. Like RAPID (Brinza et al., 2010) and SIXPAC (Prabhu and Pe’er, 2012) have done, EPIQ turns this limitation into an advantage: for a given sample size, smaller P-values are a result of larger effect sizes. This in turn makes interacting pairs more distinct from the rest of the SNPs and consequently easier to detect by EPIQ. The drawback is that as sample size increases, a smaller effect size is required for achieving the same significance level. In this case EPIQ is required to perform more iterations in order to distinguish between interacting and non-interacting pairs, and therefore does not scale linearly with sample size. Thus, as GWAS expand to include increasingly larger cohorts, further adjustments in the algorithm would be required.

We wish to state that the settings used in this article are only a portion of a wide range of options. The approach we described can be extended to fit other statistical tests, and the binary coding of the genotypes can be performed differently than described, to match other underlying models of interaction. EPIQ can also be used in conjunction with methods such as the genome-wide rapid association testing (GRAT) (Kostem and Eskin, 2013), which utilize LD between SNPs for choosing a subset of proxy SNPs, thus reducing the number of tests and further improving runtime. Moreover, the runtime reported relates to the current code implementation of the algorithm. Different implementations, such as GPU-based code, are likely to achieve even better results. With the decrease in runtime, permutation tests for significance become a feasible option, resulting in increased power compared with stringent methods for multiple hypothesis correction.

ACKNOWLEDGEMENTS

The collections and methods for the POPRES are described by Nelson et al. (2008). The genotype dataset was obtained from the LURIC study. The datasets used for the analyses described in this article were obtained from dbGaP at http://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000145.v4.p2 through dbGaP accession number phs000145.v4.p2.

Funding: This study was supported in part by a Fellowship from the Edmond J. Safra Center for Bioinformatics at Tel-Aviv University. E.H. and Y.A. were supported by the Israel Science Foundation grant no. 1425/13. E.H. was also partially supported by National Science Foundation grant III-1217615. The LURIC study was supported by the 6th Framework Program (integrated project Bloodomics, grant LSHM-CT-2004-503485), by the 7th Framework Program (integrated project AtheroRemo, grant agreement number 201668 and RiskyCAD, grant agreement number 305739) of the European Union and by the INTERREG IV Oberrhein Program (Project A28, Genetic mechanisms of cardiovascular diseases) with support from the European Regional Development Fund (ERDF) and the Wissenschaftsoffensive TMO. E.H. is a Faculty Fellow of the Edmond J. Safra Center for Bioinformatics at Tel Aviv University.

Conflict of Interest: none declared.

REFERENCES

Abecasis GR, et al. An integrated map of genetic variation from 1,092 human genomes. Nature. 2012;491:56–65. doi: 10.1038/nature11632. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bhattacharya K, et al. Rapid testing of gene-gene interactions in genome-wide association studies of binary and quantitative phenotypes. Genet. Epidemiol. 2011;35:800–808. doi: 10.1002/gepi.20629. [DOI] [PMC free article] [PubMed] [Google Scholar]
Brinza D, et al. RAPID detection of gene-gene interactions in genome-wide association studies. Bioinformatics. 2010;26:2856–2862. doi: 10.1093/bioinformatics/btq529. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cordell HJ. Detecting gene-gene interactions that underlie human diseases. Nat. Rev. Genet. 2009;10:392–404. doi: 10.1038/nrg2579. [DOI] [PMC free article] [PubMed] [Google Scholar]
Evans DM, et al. Two-stage two-locus models in genome-wide association. PLoS Genet. 2006;2:e157. doi: 10.1371/journal.pgen.0020157. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gretton A, et al. Algorithmic Learning Theory. Springer, Singapore; 2005. Measuring statistical dependence with Hilbert-Schmidt norms; pp. 63–77. [Google Scholar]
Hemani G, et al. EpiGPU: exhaustive pairwise epistasis scans parallelized on consumer level graphics cards. Bioinformatics. 2011;27:1462–1465. doi: 10.1093/bioinformatics/btr172. [DOI] [PubMed] [Google Scholar]
Hindorff LA, et al. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc. Natl Acad. Sci. USA. 2009;106:9362–9367. doi: 10.1073/pnas.0903103106. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hu X, et al. SHEsisEpi, a GPU-enhanced genome-wide SNP-SNP interaction scanning algorithm, efficiently reveals the risk genetic epistasis in bipolar disorder. Cell Res. 2010;20:854–857. doi: 10.1038/cr.2010.68. [DOI] [PubMed] [Google Scholar]
Kam-Thong T, et al. Epistasis detection on quantitative phenotypes by exhaustive enumeration using GPUs. Bioinformatics. 2011;27:i214–i221. doi: 10.1093/bioinformatics/btr218. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kostem E, Eskin E. Efficiently identifying significant associations in genome-wide association studies. J. Comput. Biol. 2013;20:817–830. doi: 10.1089/cmb.2013.0087. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li W, Reich J. A complete enumeration and classification of two-locus disease models. Hum. Hered. 2000;50:334–349. doi: 10.1159/000022939. [DOI] [PubMed] [Google Scholar]
Liu Y, et al. Genome-wide interaction-based association analysis identified multiple new susceptibility loci for common diseases. PLoS Genet. 2011;7:e1001338. doi: 10.1371/journal.pgen.1001338. [DOI] [PMC free article] [PubMed] [Google Scholar]
Maher B. Personal genomes: the case of the missing heritability. Nat. 2008;456:18–21. doi: 10.1038/456018a. [DOI] [PubMed] [Google Scholar]
Marchini J, et al. Genome-wide strategies for detecting multiple loci that influence complex diseases. Nat. genet. 2005;37:413–417. doi: 10.1038/ng1537. [DOI] [PubMed] [Google Scholar]
Nelson MR, et al. The Population Reference Sample, POPRES: a resource for population, disease, and pharmacological genetics research. Am. J. Hum. Genet. 2008;83:347–358. doi: 10.1016/j.ajhg.2008.08.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
Prabhu S, Pe’er I. Ultrafast genome-wide scan for SNP-SNP interactions in common complex disease. Genome Res. 2012;22:2230–2240. doi: 10.1101/gr.137885.112. [DOI] [PMC free article] [PubMed] [Google Scholar]
Price AL, et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 2006;38:904–909. doi: 10.1038/ng1847. [DOI] [PubMed] [Google Scholar]
Purcell S, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 2007;81:559–575. doi: 10.1086/519795. [DOI] [PMC free article] [PubMed] [Google Scholar]
Schüpbach T, et al. FastEpistasis: a high performance computing solution for quantitative trait epistasis. Bioinformatics. 2010;26:1468–1469. doi: 10.1093/bioinformatics/btq147. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wan X, et al. BOOST: a fast approach to detecting gene-gene interactions in genome-wide case-control studies. Am. J. Hum. Genet. 2010;87:325–340. doi: 10.1016/j.ajhg.2010.07.021. [DOI] [PMC free article] [PubMed] [Google Scholar]
Winkelmann BR, et al. Rationale and design of the LURIC study–a resource for functional genomics, pharmacogenomics and long-term prognosis of cardiovascular disease. Pharmacogenomics. 2001;2(Suppl. 1):S1–S73. doi: 10.1517/14622416.2.1.S1. [DOI] [PubMed] [Google Scholar]
Yung LS, et al. GBOOST: a GPU-based tool for detecting gene-gene interactions in genome-wide case control studies. Bioinformatics. 2011;27:1309–1310. doi: 10.1093/bioinformatics/btr114. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhang X, et al. TEAM: efficient two-locus epistasis tests in human genome-wide association study. Bioinformatics. 2010;26:i217–i227. doi: 10.1093/bioinformatics/btq186. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btu261-B1] Abecasis GR, et al. An integrated map of genetic variation from 1,092 human genomes. Nature. 2012;491:56–65. doi: 10.1038/nature11632. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btu261-B2] Bhattacharya K, et al. Rapid testing of gene-gene interactions in genome-wide association studies of binary and quantitative phenotypes. Genet. Epidemiol. 2011;35:800–808. doi: 10.1002/gepi.20629. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btu261-B3] Brinza D, et al. RAPID detection of gene-gene interactions in genome-wide association studies. Bioinformatics. 2010;26:2856–2862. doi: 10.1093/bioinformatics/btq529. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btu261-B4] Cordell HJ. Detecting gene-gene interactions that underlie human diseases. Nat. Rev. Genet. 2009;10:392–404. doi: 10.1038/nrg2579. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btu261-B5] Evans DM, et al. Two-stage two-locus models in genome-wide association. PLoS Genet. 2006;2:e157. doi: 10.1371/journal.pgen.0020157. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btu261-B6] Gretton A, et al. Algorithmic Learning Theory. Springer, Singapore; 2005. Measuring statistical dependence with Hilbert-Schmidt norms; pp. 63–77. [Google Scholar]

[btu261-B7] Hemani G, et al. EpiGPU: exhaustive pairwise epistasis scans parallelized on consumer level graphics cards. Bioinformatics. 2011;27:1462–1465. doi: 10.1093/bioinformatics/btr172. [DOI] [PubMed] [Google Scholar]

[btu261-B8] Hindorff LA, et al. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc. Natl Acad. Sci. USA. 2009;106:9362–9367. doi: 10.1073/pnas.0903103106. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btu261-B9] Hu X, et al. SHEsisEpi, a GPU-enhanced genome-wide SNP-SNP interaction scanning algorithm, efficiently reveals the risk genetic epistasis in bipolar disorder. Cell Res. 2010;20:854–857. doi: 10.1038/cr.2010.68. [DOI] [PubMed] [Google Scholar]

[btu261-B10] Kam-Thong T, et al. Epistasis detection on quantitative phenotypes by exhaustive enumeration using GPUs. Bioinformatics. 2011;27:i214–i221. doi: 10.1093/bioinformatics/btr218. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btu261-B11] Kostem E, Eskin E. Efficiently identifying significant associations in genome-wide association studies. J. Comput. Biol. 2013;20:817–830. doi: 10.1089/cmb.2013.0087. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btu261-B12] Li W, Reich J. A complete enumeration and classification of two-locus disease models. Hum. Hered. 2000;50:334–349. doi: 10.1159/000022939. [DOI] [PubMed] [Google Scholar]

[btu261-B13] Liu Y, et al. Genome-wide interaction-based association analysis identified multiple new susceptibility loci for common diseases. PLoS Genet. 2011;7:e1001338. doi: 10.1371/journal.pgen.1001338. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btu261-B14] Maher B. Personal genomes: the case of the missing heritability. Nat. 2008;456:18–21. doi: 10.1038/456018a. [DOI] [PubMed] [Google Scholar]

[btu261-B15] Marchini J, et al. Genome-wide strategies for detecting multiple loci that influence complex diseases. Nat. genet. 2005;37:413–417. doi: 10.1038/ng1537. [DOI] [PubMed] [Google Scholar]

[btu261-B16] Nelson MR, et al. The Population Reference Sample, POPRES: a resource for population, disease, and pharmacological genetics research. Am. J. Hum. Genet. 2008;83:347–358. doi: 10.1016/j.ajhg.2008.08.005. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btu261-B17] Prabhu S, Pe’er I. Ultrafast genome-wide scan for SNP-SNP interactions in common complex disease. Genome Res. 2012;22:2230–2240. doi: 10.1101/gr.137885.112. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btu261-B18] Price AL, et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 2006;38:904–909. doi: 10.1038/ng1847. [DOI] [PubMed] [Google Scholar]

[btu261-B19] Purcell S, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 2007;81:559–575. doi: 10.1086/519795. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btu261-B20] Schüpbach T, et al. FastEpistasis: a high performance computing solution for quantitative trait epistasis. Bioinformatics. 2010;26:1468–1469. doi: 10.1093/bioinformatics/btq147. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btu261-B21] Wan X, et al. BOOST: a fast approach to detecting gene-gene interactions in genome-wide case-control studies. Am. J. Hum. Genet. 2010;87:325–340. doi: 10.1016/j.ajhg.2010.07.021. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btu261-B22] Winkelmann BR, et al. Rationale and design of the LURIC study–a resource for functional genomics, pharmacogenomics and long-term prognosis of cardiovascular disease. Pharmacogenomics. 2001;2(Suppl. 1):S1–S73. doi: 10.1517/14622416.2.1.S1. [DOI] [PubMed] [Google Scholar]

[btu261-B23] Yung LS, et al. GBOOST: a GPU-based tool for detecting gene-gene interactions in genome-wide case control studies. Bioinformatics. 2011;27:1309–1310. doi: 10.1093/bioinformatics/btr114. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btu261-B24] Zhang X, et al. TEAM: efficient two-locus epistasis tests in human genome-wide association study. Bioinformatics. 2010;26:i217–i227. doi: 10.1093/bioinformatics/btq186. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

EPIQ—efficient detection of SNP–SNP epistatic interactions for quantitative traits

Ya’ara Arkin

Elior Rahmani

Marcus E Kleber

Reijo Laaksonen

Winfried März

Eran Halperin

Abstract

1 INTRODUCTION

2 METHODS

2.1 Outline

2.2 Model description

2.2.1 Model input

2.2.2 Linear model

2.3 GLR test and the new test-statistic τ2

Fig. 1.

2.4 Efficient discovery of interacting SNPs

Fig. 2.

2.4.1 Runtime analysis

2.4.2 Choosing the parameters L, t to assure the requested power with minimal runtime

2.5 Simulated datasets

2.6 The LURIC study

2.6.1 Laboratory procedures

2.6.2 Quality control

3 RESULTS

3.1 Runtime improvement

Table 1.

Fig. 3.

3.2 Power analysis

Fig. 4.

3.2.1 Comparison with PLINK

3.3 Results from the LURIC study

Fig. 5.

4 DISCUSSION

ACKNOWLEDGEMENTS

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

2.3 GLR test and the new test-statistic τ²