BOOST: A Fast Approach to Detecting Gene-Gene Interactions in Genome-wide Case-Control Studies

Xiang Wan; Can Yang; Qiang Yang; Hong Xue; Xiaodan Fan; Nelson LS Tang; Weichuan Yu

doi:10.1016/j.ajhg.2010.07.021

. 2010 Sep 10;87(3):325–340. doi: 10.1016/j.ajhg.2010.07.021

BOOST: A Fast Approach to Detecting Gene-Gene Interactions in Genome-wide Case-Control Studies

Xiang Wan ^1,⁶, Can Yang ^1,⁶, Qiang Yang ², Hong Xue ³, Xiaodan Fan ⁴, Nelson LS Tang ⁵, Weichuan Yu ^1,^∗

PMCID: PMC2933337 PMID: 20817139

Abstract

Gene-gene interactions have long been recognized to be fundamentally important for understanding genetic causes of complex disease traits. At present, identifying gene-gene interactions from genome-wide case-control studies is computationally and methodologically challenging. In this paper, we introduce a simple but powerful method, named “BOolean Operation-based Screening and Testing” (BOOST). For the discovery of unknown gene-gene interactions that underlie complex diseases, BOOST allows examination of all pairwise interactions in genome-wide case-control studies in a remarkably fast manner. We have carried out interaction analyses on seven data sets from the Wellcome Trust Case Control Consortium (WTCCC). Each analysis took less than 60 hr to completely evaluate all pairs of roughly 360,000 SNPs on a standard 3.0 GHz desktop with 4G memory running the Windows XP system. The interaction patterns identified from the type 1 diabetes data set display significant difference from those identified from the rheumatoid arthritis data set, although both data sets share a very similar hit region in the WTCCC report. BOOST has also identified some disease-associated interactions between genes in the major histocompatibility complex region in the type 1 diabetes data set. We believe that our method can serve as a computationally and statistically useful tool in the coming era of large-scale interaction mapping in genome-wide case-control studies.

Introduction

Genome-wide case-control studies use high-throughput genotyping technologies to assay hundreds of thousands of SNPs and relate them to clinical conditions or measurable traits. To understand underlying causes of complex disease traits, it is often necessary to consider joint genetic effects (epistasis) across the whole genome. The concept of epistasis¹ was introduced around 100 years ago. It is generally defined as interactions among different genes. Recently, Phillips² highlighted the essential role of gene-gene interactions in the structure and evolution of genetic systems. Three terminologies are used to describe gene-gene interactions:

•
Functional epistasis is a functional description that addresses the molecular interactions.
•
Compositional epistasis, originally defined by Bateson,¹ is referred to as the blocking of one allelic effect by another allele at a different locus.
•
Statistical epistasis, attributed to Fisher,³ is defined as the statistical deviation from the additive effects of two loci on the phenotype.

The existence of epistasis has been widely accepted as an important contributor to genetic variation in complex diseases such as asthma, cancer, diabetes, hypertension, and obesity.⁴ As a matter of fact, many researchers believe that it is critical to model complex interactions in order to elucidate the joint genetic effects that may cause complex diseases. They have demonstrated the presence of gene-gene interactions in complex diseases such as breast cancer⁵ and coronary heart disease.⁶ The problem of detecting gene-gene interactions in genome-wide case-control studies has attracted extensive research interest. The difficulty in this problem is the heavy computational burden. For example, in order to detect pairwise interactions from 500,000 SNPs genotyped in thousands of samples, we need 1.25 × 10¹¹ statistical tests in total. A recent review⁴ presented a detailed analysis on many popular methods that detect epistasis on the basis of the statistical definition, including MDR,⁵ PLINK,⁷ Tuning ReliefF,⁸ Random Jungle,⁹ BEAM,¹⁰ and three proposed search strategies.¹¹

Among them, BEAM and MDR were reported to have difficulties in handling 500,000 SNPs genotyped in thousands of samples.⁴ Both methods need a prescreening process to reduce the number of SNPs in order to analyze the large data sets. Marchini et al.¹¹ demonstrated that it is feasible to test association allowing for interactions in a genome-wide scale. Random Jungle can handle genome-wide data efficiently. However, both Marchini's method and Random Jungle aim at testing associations allowing for interactions, which is easier than testing interactions (we have detailed explanations of a test of association allowing for interactions and a test of interactions in the Discussion). PLINK was recommended as the most computationally feasible method that is able to detect gene-gene interactions in genome-wide data.⁴ PLINK finished a pairwise interaction examination of 89,294 SNPs selected from the WTCCC Crohn disease data set in 14 days. To accelerate the analysis process in genome-wide association studies (GWAS), the parallel computation was recommended.^4,12

Here, we propose a fast method, named “BOolean Operation-based Screening and Testing” (BOOST), for the analysis of all pairwise interactions in genome-wide SNP data. In our method, we design a Boolean representation of genotype data, which promotes not only space efficiency but also CPU efficiency because it involves only Boolean values and allows for the use of fast logic (bitwise) operations to obtain contingency tables. On the basis of this data representation, we propose a two-stage (screening and testing) search method. In the screening stage, we use a noniterative method to approximate the likelihood ratio statistic in evaluating all pairs of SNPs and select those passing a specified threshold. Most nonsignificant interactions will be filtered out, and the survival of significant interactions is guaranteed. In the testing stage, we employ the classical likelihood ratio test to measure the interaction effects of selected SNP pairs. Experiments on WTCCC data sets show that our method is faster than current methods. This efficiency helps to identify interesting interaction patterns from the type 1 diabetes data set and the rheumatoid arthritis data set.

Material and Methods

Notation

Suppose we have L SNPs and n samples. We use X_l to denote the l-th SNP, $l = 1, \dots, L$ , and Y to denote the class label (1 for case and 2 for control). SNPs are biallelic genetic markers in genome-wide case-control studies. In general, we use capital letters (e.g., A, B, …) to denote major alleles and use lowercase letters (e.g., a, b, …) to denote minor alleles. For each SNP, there are three genotypes: the homozygous reference genotype (AA), the heterozygous genotype (Aa), and the homozygous variant genotype (aa). The popular way of coding the genotype data is to use {1, 2, 3} to represent {AA, Aa, aa}, respectively.

Definition of Interaction via Logistic Regression Models

Interactions are often defined via logistic regression models.¹³ The logistic regression model with only main effects, i.e., the main effect model, has the following form:

log \frac{P (Y = 1 | X_{p} = i, X_{q} = j)}{P (Y = 2 | X_{p} = i, X_{q} = j)} = β_{0} + β_{i}^{X_{p}} + β_{j}^{X_{q}}

(Equation 1)

The logistic regression model with both main effect terms and interaction terms, i.e., the full model, has the following form:

log \frac{P (Y = 1 | X_{p} = i, X_{q} = j)}{P (Y = 2 | X_{p} = i, X_{q} = j)} = β_{0} + β_{i}^{X_{p}} + β_{j}^{X_{q}} + β_{i j}^{X_{p} X_{q}}

(Equation 2)

Please note that the superscript X_p of $β_{i}^{X_{p}}$ in both equations is merely a label and does not represent the exponent. The term $β_{i}^{X_{p}}$ represents the coefficient of X_p at category i. This representation extends to $β_{j}^{X_{q}}$ and $β_{i j}^{X_{p} X_{q}}$ as well. There are five coefficients in Equation 1 and nine coefficients in Equation 2. This is because one category of both X_p and X_q is used as the reference. This notation is adopted by Agresti¹⁴ to make the representations of logistic regression models and log-linear models (introduced later) more compact.

Let L_M and L_F be the log-likelihoods of the main effect model and the full model, respectively. According to the likelihood ratio test, interaction effects are defined as the difference of the log-likelihoods of these two models evaluated at their maximum likelihood estimations (MLEs), i.e., ${\hat{L}}_{F} - {\hat{L}}_{M}$ . Hence, interaction effects can be interpreted as the departure from linear models naturally.⁴

However, it is computationally unaffordable to directly use this measure to evaluate all pairs of SNPs in a genome-wide case-control study because there are hundreds of billions of pairs to be tested. Therefore, faster test procedures without the loss of statistical powers are needed in GWAS. Noticing the equivalence between a logistic regression model and its corresponding log-linear model,¹⁴ here we propose to test two-locus interactions on the basis of log-linear models. The advantage of so doing is that the test statistic can be quickly approximated without iteration.

Log-Linear Models for Contingency Tables

To test the interaction effect between two SNPs (X_p, X_q) and disease status Y by using log-linear models, a contingency table of these three variables will be used (see Table 1). The size of the contingency table is I × J × K, where I = 3, J = 3 and K = 2. In Table 1, n_ijk is used to denote the observed count in the cell (i, j, k). It is considered as a realization of a random variable N_ijk assumed as Poisson distributed. We use π_ijk to denote the probability that an observation falls in the cell (i, j, k). A natural constraint of π_ijk is

\sum_{i, j, k} π_{i j k} = 1

(Equation 3)

Table 1.

The Genotype Counts in Cases and Controls

Y = 1	X_q = 1	X_q = 2	X_q = 3	Y = 2	X_q = 1	X_q = 2	X_q = 3
X_p = 1	n₁₁₁	n₁₂₁	n₁₃₁	X_p = 1	n₁₁₂	n₁₂₂	n₁₃₂
X_p = 2	n₂₁₁	n₂₂₁	n₂₃₁	X_p = 2	n₂₁₂	n₂₂₂	n₂₃₂
X_p = 3	n₃₁₁	n₃₂₁	n₃₃₁	X_p = 3	n₃₁₂	n₃₂₂	n₃₃₂

Open in a new tab

Cases are denoted with Y = 1 and controls with Y = 2.

We use the dot convention to indicate summation over a subscript; e.g., $π_{i ..} = \sum_{j, k} π_{i j k}$ is the marginal probability of X_p = i, and $n_{i ..} = \sum_{j, k} n_{i j k}$ is the number of observations with X_p = i. The notation extends to two dimensions as well. For example, $π_{i j .} = \sum_{k} π_{i j k}$ is the marginal probability of X_p = i and X_q = j, and $n_{i j .} = \sum_{k} n_{i j k}$ is the corresponding count. Clearly, we have $n = \sum_{i, j, k} n_{i j k}$ .

Log-linear models treat N_ijk as independent Poisson random variables with their means as follows:

μ_{i j k} = n π_{i j k}

(Equation 4)

The likelihood function is

f (μ) = \prod_{i, j, k} \frac{e^{- μ_{i j k}} μ_{i j k}^{n_{i j k}}}{n_{i j k}!}

(Equation 5)

Correspondingly, the log-likelihood function is

L (μ) = \sum_{i, j, k} [n_{i j k} log (μ_{i j k}) - μ_{i j k} - log (n_{i j k}!)]

(Equation 6)

In the space of log-linear models, the homogeneous association model is the equivalent form of the logistic regression model with only main effects (defined in Equation 1), and the saturated model matches the full logistic regression model (defined in Equation 2). Table 2 summarizes the equivalence between log-linear models and logistic models for a three-way contingency table. The details are provided in the Appendix. In the following text, we explain how these two models are used to test interactions.

Table 2.

Equivalence between Log-Linear Models and Logistic Models for a Three-Way Table with Binary Response Variable Y

Log-Linear Model

Logistic Model

MLE of μ_ijk

Block independence model (M_B):

\log μ_{i j k} = λ + λ_{i}^{X_{p}} + λ_{j}^{X_{q}} + λ_{k}^{Y} + λ_{j}^{X_{p} X_{q}}

β₀

\frac{n_{i j} . n .. k}{n}

Partial independence model (M_P):

\log μ_{i j k} = λ + λ_{i}^{X_{p}} + λ_{j}^{X_{q}} + λ_{k}^{Y} + λ_{i j}^{X_{p} X_{q}} + λ_{i k}^{X_{p} Y}

β_{0} + β_{i}^{X_{p}}

\frac{n_{i} . k^{n} . j k}{n_{.. k}}

Homogeneous association model (M_H):

\log μ_{i j k} = λ + λ_{i}^{X_{p}} + λ_{j}^{X_{q}} + λ_{k}^{Y} + λ_{i j}^{X_{p} X_{q}} + λ_{i k}^{X_{p} Y} + λ_{j k}^{X_{q} Y}

β_{0} + β_{i}^{X_{p}} + β_{j}^{X_{q}}

iterative estimation

Saturated model (M_S):

\log μ_{i j k} = λ + λ_{i}^{X_{p}} + λ_{j}^{X_{q}} + λ_{k}^{Y} + λ_{i j}^{X_{p} X_{q}} + λ_{i k}^{X_{p} Y} + λ_{j k}^{X_{q} Y} + λ_{i j k}^{X_{p} X_{q} Y}

β_{0} + β_{i}^{X_{p}} + β_{j}^{X_{q}} + β_{i j}^{X_{p} X_{q}}

n_ijk

Open in a new tab

The models M_B and M_P are used in the discussion of the difference between the test of interactions and the test of associations. The details of these two models are provided in the Appendix.

Measuring Interaction via Log-Linear Models

On the basis of the equivalence between the log-linear model and its corresponding logistic regression model, we construct our test statistic using the homogeneous association model M_H and the saturated model M_S. Let L_H and L_S be the log-likelihood of M_H and M_S, respectively. According to Equation 6 and the MLE of μ_ijk in M_S (see Table 2 and the Appendix), the maximum log-likelihood of M_S is

{\hat{L}}_{S} = \sum_{i, j, k} [n_{i j k} log n_{i j k} - n_{i j k} - log (n_{i j k}!)]

(Equation 7)

The log-likelihood of M_H is maximized at its MLE ${\hat{μ}}_{i j k}^{H}$ :

{\hat{μ}}_{i j k}^{H} = \arg \max_{μ_{i j k}} L_{H} = \arg \max_{μ_{i j k}} \sum_{i, j, k} [n_{i j k} log μ_{i j k} - μ_{i j k} - log (n_{i j k}!)]

(Equation 8)

In other words,

{\hat{L}}_{H} = L_{H} ({\hat{μ}}_{i j k}^{H}) = \max_{μ_{i j k}} L_{H} (μ_{i j k})

(Equation 9)

Notice that ${\hat{μ}}_{i j k}^{H}$ always exists and is unique because of the concavity of L_H. To measure interaction effects based on the likelihood ratio test, we have

{\hat{L}}_{S} - {\hat{L}}_{H} = \sum_{i, j, k} [n_{i j k} log \frac{n_{i j k}}{{\hat{μ}}_{i j k}^{H}} - n_{i j k} + {\hat{μ}}_{i j k}^{H}]

(Equation 10)

Because Equation 4 implies that

\sum_{i, j, k} {\hat{μ}}_{i j k}^{H} = n

(Equation 11)

Equation 10 can be further reduced as

\begin{array}{r} {\hat{L}}_{S} - {\hat{L}}_{H} = \sum_{i, j, k} [n_{i j k} log \frac{n_{i j k}}{{\hat{μ}}_{i j k}^{H}}] \\ = n \sum_{i, j, k} [\frac{n_{i j k}}{n} log \frac{n_{i j k} / n}{{\hat{μ}}_{i j k}^{H} / n}] \\ = n \sum_{i, j, k} [{\hat{π}}_{i j k} log \frac{{\hat{π}}_{i j k}}{{\hat{p}}_{i j k}}] \\ = n \cdot D_{K L} ({\hat{π}}_{i j k} ‖ {\hat{p}}_{i j k}) \end{array}

(Equation 12)

where $D_{K L} ({\hat{π}}_{i j k} ‖ {\hat{p}}_{i j k})$ is the Kullback-Leibler divergence of ${\hat{π}}_{i j k}$ and ${\hat{p}}_{i j k}$ .

The new measure $D_{K L} ({\hat{π}}_{i j k} ‖ {\hat{p}}_{i j k})$ provides us another interpretation of interactions. Equation 12 shows that the difference of the two log-likelihoods is proportional to the Kullback-Leibler divergence of the joint distribution ${\hat{π}}_{i j k}$ obtained under the saturated model M_S, and the distribution ${\hat{p}}_{i j k}$ obtained under the homogeneous association model M_H. The distribution ${\hat{p}}_{i j k}$ is constructed via lower-order distributions (see the Appendix). From the perspective of log-linear models, interaction effects can be understood as the information contained in the joint distribution but not in its lower-order factorization, which is known as “synergy” in physics.¹⁵ If no interaction effects exist, the joint distribution can be well characterized by its lower-order factorization.

Boolean Operation-Based Screening and Testing

Boolean Representation of Genotype Data

The data set containing L SNPs and n samples is usually stored in an $L \times n$ matrix. Each cell in this matrix takes a value from {1, 2, 3}, the elements of which represent the homozygous reference genotype, the heterozygous genotype, and the homozygous variant genotype, respectively. In our method, we introduce a Boolean representation of genotype data (the details are provided in the Appendix). This Boolean representation enables us to collect contingency tables in a fast manner.

Screening and Testing

Directly using ${\hat{L}}_{S} - {\hat{L}}_{H}$ to test interactions in GWAS still has some difficulties, because no closed-form solution exists for the homogenous association model M_H. Iterative methods are needed in model fitting to compute ${\hat{L}}_{H}$ . This will be computationally intensive when we face hundreds of billions of SNP pairs.

To solve this issue, we propose to approximate the homogenous association model M_H with the Kirkwood superposition approximation (KSA):¹⁵

{\hat{p}}_{i j k}^{K} = \frac{1}{η} \frac{π_{i j .} π_{i . k} π_{. j k}}{π_{i ..} π_{. j .} π_{.. k}}

(Equation 13)

where $η = \sum_{i, j, k} \frac{π_{i j .} π_{i . k} π_{. j k}}{π_{i ..} π_{. j .} π_{.. k}}$ is a normalization term. The benefit of using KSA is two-fold:

First, ${\hat{L}}_{S} - {\hat{L}}_{K S A}$ is an upper bound of ${\hat{L}}_{S} - {\hat{L}}_{H}$ ; i.e.,

{\hat{L}}_{S} - {\hat{L}}_{H} \leq {\hat{L}}_{S} - {\hat{L}}_{K S A}

(Equation 14)

where ${\hat{L}}_{K S A}$ is the log-likelihood evaluated at the MLE ${\hat{μ}}_{i j k}^{K}$ of the KSA model (see the proof in the Appendix).

Noticing that the calculation of ${\hat{p}}_{i j k}^{K}$ is straightforward and no iteration is involved, the approximated measure $2 ({\hat{L}}_{S} - {\hat{L}}_{K S A}) = 2 n \cdot D_{K L} ({\hat{π}}_{i j k} ‖ {\hat{p}}_{i j k}^{K})$ can be obtained easily on the basis of the contingency table collected by the Boolean operation. Therefore, the KSA model can be applied to evaluate hundreds of billions of SNP pairs. Because we are interested only in interactions with large $2 ({\hat{L}}_{S} - {\hat{L}}_{H})$ values, we can first filter out those SNP pairs with $2 ({\hat{L}}_{S} - {\hat{L}}_{K S A}) \leq τ$ by using a threshold τ, and we can then conduct statistical tests on the remaining SNP pairs.

Second, the bound in Equation 14 is tight. When the joint distribution is $p_{i j k}^{K}$ (Equation 13), the equality holds; i.e., ${\hat{L}}_{S} - {\hat{L}}_{K S A} = {\hat{L}}_{S} - {\hat{L}}_{H}$ . This bound is very close to the statistic ${\hat{L}}_{S} - {\hat{L}}_{H}$ of the likelihood ratio test. To illustrate the tightness of the bound, we use the simulation method proposed by Li et al.¹⁶ to generate a data set containing 2000 SNPs and 1000 samples based on HapMap data. Figure 1A shows the linkage disequilibrium (LD) pattern of the simulated data, which is very similar to the real data. Using this data, we calculate $2 ({\hat{L}}_{S} - {\hat{L}}_{K S A}) = 2 n \cdot D_{K L} ({\hat{π}}_{i j k} ‖ {\hat{p}}_{i j k}^{K})$ based on the KSA and $2 ({\hat{L}}_{S} - {\hat{L}}_{H}) = 2 n \cdot D_{K L} ({\hat{π}}_{i j k} ‖ {\hat{p}}_{i j k})$ based on log-linear models for all pairs of 2000 SNPs. Figure 1B shows the comparison of these two models. It can be seen that $2 ({\hat{L}}_{S} - {\hat{L}}_{K S A})$ consistently overestimates $2 ({\hat{L}}_{S} - {\hat{L}}_{H})$ . For the region [25, + ∞], $2 ({\hat{L}}_{S} - {\hat{L}}_{K S A})$ is almost identical to $2 ({\hat{L}}_{S} - {\hat{L}}_{H})$ .

KSA Performance in Simulation

(A) The LD (measured by r²) pattern of simulated data from the Hapmap data. To show the block structure clearly, we show only the LD of the first 500 SNPs here. The LD block structure of all 2000 SNPs is very similar.

(B) Comparison of the values $2 ({\hat{L}}_{S} - {\hat{L}}_{K S A})$ and $2 ({\hat{L}}_{S} - {\hat{L}}_{H})$ based on KSA and log-linear models. KSA overestimation $2 ({\hat{L}}_{S} - {\hat{L}}_{H}) \leq 2 ({\hat{L}}_{S} - {\hat{L}}_{K S A})$ is illustrated here. For the region [25, + ∞), $2 ({\hat{L}}_{S} - {\hat{L}}_{K S A})$ is almost identical to $2 ({\hat{L}}_{S} - {\hat{L}}_{H})$ .

In summary, most nonsignificant interactions can be filtered out because of the tightness of the bound (Equation 14) and the survival of significant interactions is guaranteed. On the basis of this upper bound, we propose our method, BOOST:

Stage 1: Screening

We evaluate all pairwise interactions by using the KSA in the screening stage. For each pair, the calculation of $2 ({\hat{L}}_{S} - {\hat{L}}_{K S A})$ is based on the contingency table collected by using Boolean operations. Because $2 ({\hat{L}}_{S} - {\hat{L}}_{H}) \leq 2 ({\hat{L}}_{S} - {\hat{L}}_{K S A})$ , an interaction obtained by the KSA without passing a specified threshold τ, i.e., $2 ({\hat{L}}_{S} - {\hat{L}}_{K S A}) \leq τ$ , would not be considered in stage 2. The threshold τ corresponds to the significant threshold (with the Bonferroni correction) specified by users. Because the Bonferroni correction tends to be conservative, a smaller threshold can be used to put more SNP pairs into the testing stage. We set τ = 30 in our experiments to test the computational capacity of our method. The threshold τ = 30 corresponds to the unadjusted p = 4.89 × 10⁻⁶, which is a very weak significance level for a genome-wide study.

Stage 2: Testing

For each pair with $2 ({\hat{L}}_{S} - {\hat{L}}_{K S A}) > τ$ , we test the interaction effect using the likelihood ratio statistic $2 ({\hat{L}}_{S} - {\hat{L}}_{H})$ . We fit the log-linear models M_H and M_S and calculate this test statistic using Equation 12. After that, we conduct the χ² test with four degrees of freedom (df = 4) to determine whether the interaction effect is significant. The p value is adjusted by the Bonferroni correction, with the number of tests $L (L - 1) / 2$ , where L is the total number of SNPs before screening.

To approximate M_H, we may also choose some other log-linear models, such as the block independence model M_B or the partial independence model M_P (see Table 2). However, such approximations will lead to very loose bounds, leaving millions of SNP pairs to be examined in the testing stage. Using the KSA, we have empirically observed that 300,000∼600,000 SNP pairs are examined in the testing stage when the WTCCC data are analyzed. When the partial independence model is used, the number of SNP pairs is up to 10⁸∼10⁹.

Results

Experiments on Simulation Data

The performance of our approach is evaluated through comparative studies with existing works. Our goal is to discover epistatic interactions from genome-wide data. Among many methods recently proposed, we mainly compare BOOST with PLINK⁷ with respect to the power of gene-gene interaction identification. The reasons for choosing PLINK for comparison are as follows:

•
A recent review⁴ tested many available methods and recommended PLINK as a powerful tool for testing interactions on a genome-wide scale.
•
Both PLINK and BOOST use an exhaustive search strategy. The comparison of their performance is fair.

We conduct the following simulation studies to compare BOOST with PLINK (tested with the “-fast-epistasis” option and without the “-case-only” option):

•
Case 1: Disease loci with main effects.
•
Case 2: Disease loci without main effects.
•
Case 3: Genetic heterogeneity.
•
Case 4: Null simulation for testing type I errors.

Case 1: Disease Loci with Main Effects

We consider four epistasis models whose odds tables are given in Table S7, available online. Model 1 is a multiplicative model.¹¹ Model 2 is an epistasis model¹⁷ that has been used to describe handedness¹⁸ and the color of swine.¹⁹ Model 3 is a classical epistasis model.^20,21 Model 4 is the well known XOR (exclusive OR) model.

Let p(D|G_i) denote the probability of an individual being affected given its genotype combination G_i (i.e., the penetrance of G_i), and let $p (\bar{D} | G_{i})$ denote the probability of an individual not being affected given its genotype G_i. On the basis of the definition of the odds of a disease,

O D D_{G_{i}} = \frac{p (D | G_{i})}{p (\bar{D} | G_{i})} = \frac{p (D | G_{i})}{1 - p (D | G_{i})}

(Equation 15)

the penetrance p(D|G_i) of the genotype G_i can be calculated by using

p (D | G_{i}) = \frac{O D D_{G_{i}}}{1 + O D D_{G_{i}}}

(Equation 16)

The disease prevalence p(D) and genetic heritability h² are given as

p (D) = \sum_{i} p (D | G_{i}) p (G_{i})

(Equation 17)

h^{2} = \frac{\sum_{i} {(p (D | G_{i}) - p (D))}^{2} p (G_{i})}{p (D) (1 - p (D))}

(Equation 18)

In our simulation, the prevalence p(D) and the heritability h² are controlled by the parameters α and θ (see Table S6). We first specify the disease prevalence p(D) and the genetic heritability h², and we then numerically solve the parameters (α and θ) on the basis of the above equations. For example, we set p(D) = 0.1 and h² = 0.03 in model 1. Then we obtain α = 0.09989 and θ = 3.4481 for minor allele frequency (MAF) = 0.1.

In the simulation, we set h² = 0.03 for model 1 and h² = 0.02 for models 2, 3, and 4. We generate genotype data on the basis of the Hardy-Weinberg principle. We set the MAFs of disease-associated SNPs to be 0.1, 0.2, and 0.4. We generate the MAFs of unassociated SNPs uniformly from [0.05, 0.5]. We simulate 100 data sets under each setting for each disease model. Each data set contains 1000 SNPs. To take sample size into consideration, we simulate both 800 samples and 1600 samples with the balanced design.

Figure 2 presents the comparison results with the significance thresholds selected as 0.1, 0.2, and 0.3 after the Bonferroni correction. For model 1 with MAF = 0.2, 0.4 and model 2 with MAF = 0.1, the statistical power of PLINK is higher. This is because these model settings are well captured by the allele interaction test. For all other settings, BOOST outperforms PLINK.

The Performance Comparison between BOOST and PLINK on Four Disease Models

Under each parameter setting, 100 data sets are generated. Both 800 samples and 1600 samples with balanced design are simulated. The power is calculated as the proportion of the 100 data sets in which the interactions of the disease-associated SNPs are detected. The absence of bars indicates no power.

Case 2: Disease Loci without Main Effects

Disease models displaying no main effects²² have been carefully discussed, and a wide spectrum of these models²³ has been provided. In this experiment, we use all of these 70 pure epistatic models without main effect to compare performance. For convenience, these models are listed in Tables S8–S14. The heritability h² controls the phenotypic variation of these 70 models, which ranges from 0.01 to 0.4. The MAF ranges from 0.2 to 0.4. For each model, the statistical power is evaluated under different sample sizes, including n = 400, n = 800, and n = 1, 600 (half controls and half cases). For each setting, 100 data sets are generated. Each data set contains 1000 SNPs.

Please check Figures S4–S7 to see the comparison results for the 70 models. For some models, such as model epi1–5, BOOST and PLINK perform equally well. For most of these models, BOOST is superior to PLINK because the interaction patterns cannot be well characterized by allele interactions.

Case 3: Genetic Heterogeneity

Genetic heterogeneity refers to the phenomenon that a disease is affected by different subsets of genes. It plays a substantial role in complex human diseases.²⁴ Here, we set up a simulation study to show the performance of BOOST and PLINK when genetic heterogeneity is present. We choose some epistatic models used in case 2 to generate the data. The heritability h² of these models ranges from 0.01 to 0.4. Different sample sizes, including n = 400, n = 800 and n = 1600, are simulated for each model. The details of simulation are provided in the Appendix.

The performance of both BOOST and PLINK is given in Figure S8. Genetic heterogeneity affects the performance of both BOOST and PLINK. In general, their performance degrades as heritability h² decreases. The sample size plays an important role when genetic heterogeneity is present. When the sample size increases from 400 to 1600, the power of both BOOST and PLINK increases a lot.

Case 4: Null Simulation for Testing Type I Errors

To compare BOOST and PLINK in terms of type I errors, we conduct null simulation in two scenarios:

•
Scenario 1: Without LD. We generate 1000 null data sets. Each data set contains 1000 SNPs and 1000 samples. All of the SNPs are generated independently, with MAFs uniformly distributed in [0.05, 0.5]. The result is shown in Figure 3A. It can be seen that the type I error of BOOST agrees with the nominal error rate and the type I error of PLINK is a little bit less than the nominal error rate.
•
Scenario 2: With LD. The simulation program “genomeSIMLA”²⁵ is used to simulate the SNP data on the basis of the marker information on the Affymetrix 500K chip from human chromosome 1. LD exists among SNPs. We generate 100 null data sets, each of which contains 38,836 SNPs and 1000 samples. The result is shown in Figure 3B. Because of the LD pattern, the error rates of both methods are lower than the nominal error rate, confirming that the Bonferroni correction is conservative. Surprisingly, unlike the situation in scenario 1, the error rate of BOOST is less than that of PLINK. The reason is that some cells of a contingency table may be empty when LD exists. This leads to the true degree of freedom df_true ≤ 4. Because we calculate p values by using the χ² distribution with df = 4, BOOST has a lower type I error rate than PLINK. This simulation study also implies that it is possible to increase the power of BOOST by using a more accurate degree of freedom in statistical tests.

Comparison of the Type I Error Rates in Null Simulation

(a) Null simulation with no LD.

(b) Null simulation with LD.

Experiments on WTCCC data

We have applied BOOST to analyze data (14,000 cases in total and 3000 shared controls) from the WTCCC on seven common human diseases: bipolar disorder (BD), coronary artery disease (CAD), Crohn disease (CD), hypertension (HT), rheumatoid arthritis (RA), type 1 diabetes (T1D), and type 2 diabetes (T2D). The procedure of quality control is presented in the Appendix. The results under different constraints are reported in Table 3. For T1D, we discovered many gene-gene interactions in the MHC region (see detailed descriptions in the following section). For the other six diseases, however, we did not find nontrivial interactions (except one SNP pair in CD).

Table 3.

The Number of Interactions Identified from the WTCCC Data Sets of Seven Diseases under Different Constraints

	BD	CAD	CD	HT	RA	T1D	T2D
C¹	10	16	8	7	350	4499	18
C¹ & C²	0	0	1	0	0	789	0
C¹ & C² & C³	0	0	1	0	0	91	0

Open in a new tab

Abbreviations are as follows: BD, bipolar disorder; CAD, coronary artery disease; CD, Crohn disease; HT, hypertension; RA, rheumatoid arthritis; T1D, type 1 diabetes; T2D, type 2 diabetes. C¹ is the significance threshold constraint: the significance threshold is 0.05 for the Bonferroni-corrected interaction p value. C² is the distance constraint: the physical distance between two interacting SNPs is at least 1Mb. This constraint is used to avoid interactions that might be attributed to the LD effects.⁴C³ is the main effect constraint: The single-locus p value should not be less than 10⁻⁶. This constraint is used to see whether there exist strong interactions without significant main effects, because those SNPs with p ≥ 10⁻⁶ are usually filtered out in the typical single-locus scan.

T1D and RA

The MHC region in chromosome 6 has long been investigated as the most variable region in the human genome with respect to infection, inflammation, autoimmunity, and transplant medicine.²⁶ The recent study conducted by the WTCCC²⁷ has shown that both T1D and RA are strongly associated with the MHC region via single-locus association mapping. The top-left panel of Figure 4 shows that the single-locus association map does not reveal much difference between T1D and RA. In our study, BOOST reports 4499 interactions in the T1D data set (see Table 3), in which 4489 interactions (99.8%) are in the MHC region. Clayton's analysis²⁸ on the T1D data set found that with the exception of strong interactions within the MHC region, interactions are small and have a modest effect on prediction. Our results have verified Clayton's finding from another perspective. As a comparison, BOOST reports 350 interactions in the RA data set, in which 280 interactions (80.0%) are in the MHC region. Our genome-wide interaction map provides evidence that the MHC region is associated with these two diseases in different ways. The bottom panel of Figure 4 gives detailed interaction maps in the MHC region for T1D and RA data. We further calculate composite LD using the method by Zaykin et al.²⁹ The LD map of MHC region is provided in the top-right panel of Figure 4. These interaction maps, different from the LD map, reveal a distinct pattern difference between T1D and RA. Specifically, there are three subregions in the MHC region: namely, the MHC class I region (29.8Mb–31.6Mb), the MHC class III region (31.6Mb–32.3Mb), and the MHC class II region (32.3Mb–33.4Mb). A closer inspection of the T1D interaction map indicates that strong interaction effects widely exist between genes within and across three classes, whereas most significant interactions in RA involve only loci closely placed in the MHC class II region. The contrast of the interaction patterns between T1D and RA may explain their different etiologies, which are not revealed by single-locus association mapping.

Interactions without Significant Main Effects Detected in T1D

The mathematical property of interactions without significant main effects has been discussed in detail.²² The existence of these interactions has been shown from the experiment results based on relatively small numbers of SNPs.^5,6 Here, we provide the result identified in the genome-wide scale. The MHC region is a highly polymorphic region with a high gene density. Although previous reports^27,30 using the single-locus scan have identified strong associations between MHC genes (such as HLA-DQB1 and HLA-DRB1) and T1D, it is still unclear which and how many loci within the MHC region determine T1D susceptibility. Interactions without significant main effects can provide additional information to help pinpoint disease-associated loci, because SNPs involved in those interactions are usually filtered out in the single-locus scan.

Among the selected 789 interacting pairs in T1D, 91 pairs have nonsignificant loci under the single-locus scan (all of them are listed in Table S6). A careful inspection of these 91 interactions has identified two interesting interaction patterns between the MHC class I and class II. One interaction pattern involves the 31350k–31390k region (see Figure 5) and the 32810k–32860k region (see Figure 6) in chromosome 6 (please check more results in the Appendix). The interactions between two regions in these two figures are listed in Table 4. All SNPs in these interactions display weak main effects, whereas their joint effects are statistically significant. The potential pathways involving HLA_B, HLA_DQA2, and PSMB8 are shown in Figure 7. HLA_B, HLA_DQA2, and PSMB8 potentially interact in the antigen-processing and -presentation pathway.^31–34 HLA_B and HLA_DQA2 potentially interact in the type 1 diabetes mellitus pathway.^30,35,36 As Nejentsev et al.³⁰ argued that both the MHC class I and II genes should be considered to better understand type 1 diabetes susceptibility, our results provide further evidence that the interaction effects between these two classes may contribute to the etiology of type 1 diabetes.

The 31350k–31390k Region of Chromosome 6

*HLA-B* in the MHC class I is located in this region. The recombination rate and LD plot from HapMap show that a block structure exists from 31360k to 31380k. This region is mapped through the SNPs rs2524057, rs2853934, rs2524115, rs396038, rs3873385, rs2524095, and rs2524089. The SNPs rs2524095 and rs2524089 are involved in the interactions with the 32930k–32960k region shown in Figure S2.

The 32810k–32860k Region of Chromosome 6

*HLA-DQA2* and *HLA-DQB2* in the MHC class II reside in this region. The recombination rate and LD plot from HapMap show that a block structure exists from 32820k to 32847k. This region is mapped through the genotyped SNPs rs9276448, rs5014418, and rs6919798. The ungenotyped SNPs rs9276438 and rs7774954 reside in *HLA-DQA2* and *HLA-DQB2*, respectively. They are in strong LD with those genotyped SNPs.

Table 4.

The Interaction SNP Pairs in the Two Regions Shown in Figure 5 and Figure 6

SNP 1		SNP 2		Interaction
SNP	Single-Locus p Value	SNP	Single-Locus p Value	BOOST p Value
rs2524057	4.807 × 10⁻¹	rs9276448	8.878 × 10⁻³	5.362 × 10⁻¹⁴
rs2524057	4.807 × 10⁻¹	rs5014418	1.116 × 10⁻²	2.738 × 10⁻¹³
rs2853934	8.336 × 10⁻²	rs9276448	8.878 × 10⁻³	2.507 × 10⁻¹³
rs2524115	1.215 × 10⁻¹	rs9276448	8.878 × 10⁻³	6.456 × 10⁻¹³
rs3873385	3.368 × 10⁻¹	rs9276448	8.878 × 10⁻³	3.186 × 10⁻¹⁴
rs3873385	3.368 × 10⁻¹	rs5014418	1.116 × 10⁻²	3.841 × 10⁻¹⁴
rs3873385	3.368 × 10⁻¹	rs6919798	6.077 × 10⁻²	4.257 × 10⁻¹³
rs396038	9.939 × 10⁻²	rs9276448	8.878 × 10⁻³	5.894 × 10⁻¹³

Open in a new tab

The SNPs in the SNP 1 column reside in HLA-B, and the SNPs in the SNP 2 column are located at the block across HLA-DQA2 and HLA-DQB2. They show strong interactions without displaying significant main effects.

Potential Pathways Involving *HLA_B*, *HLA_DQA2*, and *PSMB8*

T1DM represents the type 1 diabetes mellitus pathway. Antigen represents the antigen processing and presentation pathway.

Discussion

Relationship between Our Method and Other Two-Stage Methods

The analysis of GWAS data is a challenging computational problem. To speed up this process, many methods^4,5,11 have been coupled with some prescreening algorithms to reduce the number of SNPs. Most of the currently available screening algorithms are based on single-locus tests and can be finished very quickly. However, for some SNPs with weak main effects but significant interactions, these screening algorithms will filter them out. Our screening method does not have this issue. It uses a fast approximation to evaluate all SNP pairs with the guarantee that significant interactions will not be filtered out no matter whether individual SNPs display main effects or not.

Relationship between Our Method and PLINK

Both BOOST and PLINK use the exhaustive search to find epistatic interactions in GWAS. The key difference between BOOST and PLINK is the way that they test interaction effects:

•
PLINK tests interactions based on alleles.⁷ Three genotype categories are collapsed into two allele categories. Correspondingly, 3 × 3 contingency tables are collapsed into 2 × 2 tables. The difference of the odds ratios from the two 2 × 2 tables (one for cases and the other for controls) is used to construct a χ² test with df = 1.
•
BOOST tests interactions based on genotypes, using the χ² test with df = 4.

In general, if the underlying interaction could be well characterized by an allele interaction, then the statistical power of PLINK would be higher than that of BOOST. However, the type of underlying interaction is generally unknown and may vary widely.²² BOOST is more flexible because it covers a larger model space than PLINK. BOOST can be modified to test the allelic model by collapsing 3 × 3 contingency tables to 2 × 2 contingency tables (in the same way that PLINK does). The two-stage strategy in BOOST can then be applied to these 2 × 2 contingency tables. The statistical power of the modified BOOST will be roughly the same as PLINK because they both are based on the same allelic model. The ignorable difference is due to the difference between the Wald test and the likelihood ratio test. In the released software of BOOST, the allelic test has also been implemented. Regarding the running time, the BOOST allelic test is similar to the BOOST genotype test.

Relationship between Our Method and INTERSNP

Recently, INTERSNP³⁷ has implemented the interaction test in GWAS using log-linear models. Regarding the interaction test, both INTERSNP and our work are developed on the basis of the standardized definition using logistic regression models.¹³ INTERSNP has directly used an iterative method to fit the log-linear model M_H. It is still very time consuming to test interactions in GWAS. Therefore, INTERSNP suggests the use of some prior knowledge to reduce the number of SNPs, including the single-locus test, genetics criteria, and pathway information. Genetics criteria and pathway information provide biological constraints that are very useful. But using the single-locus test in the filtering, which has been discussed in the earlier section, will filter out those SNPs with weak main effects but significant interactions. Moreover, how to choose the threshold in filtering is also critical. On the contrary, we propose to use the noniterative approximation to directly examine all SNPs pairs. We show the computational performance of BOOST and INTERSNP in the following section.

Computation Time

From a practical point of view, a key issue of detecting gene-gene interactions in genome-wide case-control studies is the computational efficiency. Cordell⁴ reported that PLINK took about 14 days to test pairwise interactions of the selected 89,294 SNPs on a single node of a computer cluster. Random Jungle can analyze the large data sets quickly. However, Random Jungle aims at detecting association allowing for interactions rather than detecting interactions (see detailed explanations in the next subsection). Besides, Random Jungle has difficulty in finding interacting SNP pairs displaying weak main effects because trees built in Random Jungle rely on the main effects of SNPs. BEAM took about 8 days to handle 47,727 SNPs using 5 × 10⁷ Markov chain Monte Carlo iterations. Currently, BEAM has difficulties in handling 500,000 to 1,000,000 SNPs genotyped in 5000 or more samples. Cordell⁴ recommended PLINK as a powerful method of testing interactions in GWAS.

We tested the running time of PLINK on our desktop computer. In addition, we also tested INTERSNP on the same data sets because INTERSNP also uses log-linear models to test interactions. The results are shown in Table 5. BOOST is roughly 63 times faster than PLINK and 95 times faster than INTERSNP. It can finish the analysis of all pairs of roughly 360,000 SNPs within 60 hr (around 2.5 days) on a standard desktop (3.0 GHz CPU with 4G memory running the Windows XP Professional x64 edition system). Parallel computing¹² can be used to further improve the computation time for BOOST, PLINK, and INTERSNP. The WTCCC phase 2 study will analyze over 60,000 samples of various diseases using either the Affymetrix v6.0 chip or the Illumina 660K chip. The shared control samples will increase from 3000 to 6000. Such an increase in the number of SNPs and the sample size is more demanding on the computation efficiency. We anticipate that BOOST is still applicable for analyzing the new data sets.

Table 5.

Time Comparison of BOOST, PLINK, and INTERSNP

Data Size	BOOST	PLINK	INTERSNP
n = 5000, L = 1000	< 2s	106s	160s
n = 5000, L = 5000	42s	2703s	4277s
n = 5000, L = 10,000	170s	10,915s	15,805s

Open in a new tab

PLINK is tested with the “–fast-epistasis” option and without the “–case-only” option. All timings are carried out on a 3.0 GHz CPU with 4G memory running the Windows XP Professional system.

Test of interactions versus Test of Associations

To test association between a specific SNP X_p and the phenotype Y, a typical method is to test the difference between the deviance of the null model (Equation 19) and the deviance of the alternative model (Equation 20) with df = 2:

log \frac{P (Y = 1)}{P (Y = 2)} = β_{0}

(Equation 19)

log \frac{P (Y = 1 | X_{p} = i)}{P (Y = 2 | X_{p} = i)} = β_{0} + β_{i}^{X_{p}}

(Equation 20)

This is known as a “test of single-SNP association.”

In the above test, SNP X_p is allowed to interact with other SNPs. As a matter of fact, if the disease is influenced by SNP X_p itself and its interaction effect with another SNP X_q, the statistical power of detecting SNP X_p will be increased when allowing for interactions. This is known as a “test of two-locus associations allowing for interactions”⁴. Typically, this is accomplished by testing the difference between the log-likelihood of the null model (Equation 19) and that of the alternative model (Equation 21) with df = 8:

log \frac{P (Y = 1 | X_{p} = i, X_{q} = j)}{P (Y = 2 | X_{p} = i, X_{q} = j)} = β_{0} + β_{i}^{X_{p}} + β_{j}^{X_{q}} + β_{i, j}^{X_{p} X_{q}}

(Equation 21)

Marchini et al.¹¹ highlighted the importance of testing associations allowing for interactions in a genome-wide scale and successfully demonstrated its feasibility. They reported that performing all pairwise tests of associations allowing for interactions with df = 8 at 300,000 loci with 1000 cases and 1000 controls can be finished in 33 hr on a 10-node cluster. According to the equivalence between log-linear models and logistic models, it is clear that the feasibility of this exhaustive search method relies on the closed-form solution of the block independence model M_B and the closed-form solution of the saturated model M_S (see the Appendix for the details of M_B and M_S).

The differences of these tests are:

•
The test of single-SNP association is to compare M_P with M_B (see Table 2 for descriptions of M_P and M_B).
•
The test of associations allowing for interactions is to compare M_S with M_B.
•
The test of interaction is to compare M_S with M_H.

As we mentioned above, no closed-form solution exists for the test of interactions. In this sense, the test of interactions is more difficult than the test of associations allowing for interactions.

On Statistical Epistasis

It is extensively debated to what extent statistical epistasis implies biological or functional epistasis.⁴ The statistical epistasis is exploited in the literature, perhaps because of the following reasons:

•
The definition of statistical epistasis yields an appropriate measure for describing biological phenomena that one locus's effect on the phenotype depends on another locus.² This facilitates mathematical analysis of epistasis.
•
On the basis of the statistical definition, gene-gene interactions can be connected to Kullback-Leibler divergence used in the information theory (see Equation 12) and high-order mutual information in physics.¹⁵ This definition may bridge the gap between the biological understanding and the physical interpretation.
•
Compositional epistasis, conceived by Bateson, is closer to the biological understanding of gene-gene interactions than statistical epistasis.² Compositional epistasis has recently been shown to be empirically testable via a statistical approach.³⁸ In some cases, compositional and statistical epistatis are equivalent to each other.³⁸ Therefore, statistical epistasis can still provide useful information for biological understanding.

Currently, PLINK, INTERSNP, and BOOST are designed to test statistical epistasis. We realize that detecting statistical epistasis in a genome-wide scale is easier than finding compositional epistasis because the test of compositional epistasis for each SNP pair requires enumerating all possible genetic interaction models.² The detection of compositional epistasis will be investigated in our future work.

Conclusion

The large number of SNPs genotyped in genome-wide case-control studies poses a great computational challenge in the identification of gene-gene interactions. During the last few years, there have been fast-growing interests in developing and applying computational and statistical approaches to finding gene-gene interactions. In this paper, we present a method named “BOOST” to address this problem. Not only is BOOST computationally efficient, it has also shown good statistical power for a wide spectrum of epistasis models. We have successfully applied our method to analyze seven data sets from the WTCCC. Our experimental results demonstrate that interaction mapping is both computationally and statistically feasible for hundreds of thousands of SNPs genotyped in thousands of samples.

In this work, we focus mainly on the genome-wide case-control studies; i.e., the disease phenotype can be represented as a binary variable. In the current stage, our method cannot be applied to GWAS involving continuous phenotypes unless those continuous phenotypes can be discretized. There are two ways to handle covariates in our models. If the covariate is discrete or can be discretized, our method can be directly extended to handle it. If not, logistic regression can be used in the postprocessing step to adjust the covariate. In the postprocessing step, the computational burden of logistic regression is affordable because the number of selected interactions is limited.

There are some limitations of BOOST with respect to statistical power. BOOST uses a fixed degree of freedom (df = 4) to conduct the genotype test. When the contingency table is too sparse due to the low minor allele frequency, the degree of freedom of the statistical test should be reduced. To improve the performance of BOOST, we can first use BOOST to report interactions with a loose threshold and then use the penalized logistic regression³⁹ with the adaptive degree of freedom to adjust these interactions. There are several other issues that we have not addressed, such as population substructures and imputation of the missed genotypes. We will investigate them in our future work.

Acknowledgments

We thank the editor and the anonymous reviewers for their constructive suggestions and comments. This work was partially supported with grant GRF621707 from the Hong Kong Research Grant Council, grants RPC06/07.EG09, RPC07/08.EG25, and RPC10EG04 from the Hong Kong University of Science and Technology, and a grant from Sir Michael and Lady Kadoorie Funded Research Into Cancer Genetics.

Appendix

Log-Linear models

Here, we briefly describe four log-linear models, including the homogeneous association model M_H, the saturated model M_S, the block independence model M_B, and the partial independence model M_P. These four models are used in the main text. Please see details in Agresti.¹⁴

Homogeneous Association Model M_H

The homogeneous association model M_H factorizes the joint distribution π_ijk using the joint distributions of all pairs. The hypothesis is

H_{0}^{H} : π_{i j k} = ψ_{i j} ϕ_{i k} ω_{j k}

(Equation 22)

where ψ_ij, ϕ_ik and ω_jk are some lower-order distributions. The name “homogeneous association” comes from the fact that the association between any two of three variables is the same at all levels of the third variable.¹⁴

The homogeneous association model M_H is defined as

log μ_{i j k} = λ + λ_{i}^{X_{p}} + λ_{j}^{X_{q}} + λ_{k}^{Y} + λ_{i j}^{X_{p} X_{q}} + λ_{i k}^{X_{p} Y} + λ_{j k}^{X_{q} Y}

(Equation 23)

Unfortunately, no closed-form expression exists for the MLE of μ_ijk (denoted as ${\hat{μ}}_{i j k}^{H}$ ) in Equation 23. Iterative approaches, such as the Newton-Raphson method, are needed in order to estimate the parameters.

Saturated Model M_S

The saturated model M_S defines the joint distribution with all factors. The saturated log-linear model is

log μ_{i j k} = λ + λ_{i}^{X_{p}} + λ_{j}^{X_{q}} + λ_{k}^{Y} + λ_{i j}^{X_{p} X_{q}} + λ_{i k}^{X_{p} Y} + λ_{j k}^{X_{q} Y} + λ_{i j k}^{X_{p} X_{q} Y}

(Equation 24)

The MLE of μ_ijk in Equation 24 is

{\hat{μ}}_{i j k}^{S} = n_{i j k}

(Equation 25)

Block Independence Model M_B

When the joint distribution cannot be completely factorized, it may be factorized into blocks. The hypothesis is

H_{0}^{B} : π_{i j k} = π_{i j .} π_{.. k}

(Equation 26)

The corresponding log-linear model is

log μ_{i j k} = λ + λ_{i}^{X_{p}} + λ_{j}^{X_{q}} + λ_{k}^{Y} + λ_{i j}^{X_{p} X_{q}}

(Equation 27)

Under this structure, the MLE of μ_ijk is

{\hat{μ}}_{i j k}^{B} = \frac{n_{i j .} n_{.. k}}{n}

(Equation 28)

Partial Independence Model M_P

The joint distribution may be factorized when some variables are given. For example, given Y, the hypothesis is

H_{0}^{P} : π_{i j k} = \frac{π_{i . k} π_{. j k}}{π_{.. k}}

(Equation 29)

The corresponding log-linear model is

log μ_{i j k} = λ + λ_{i}^{X_{p}} + λ_{j}^{X_{q}} + λ_{k}^{Y} + λ_{i k}^{X_{p} Y} + λ_{j k}^{X_{q} Y}

(Equation 30)

Then the MLE of μ_ijk is

{\hat{μ}}_{i j k}^{P} = \frac{n_{i . k} n_{. j k}}{n_{.. k}}

(Equation 31)

Connection between Log-Linear Models and Logistic Models

For convenience, we use the homogeneous association model M_H as an example to describe the equivalence between a log-linear model and its corresponding logistic model. Its logit is

\begin{array}{l} log \frac{P (Y = 1 | X_{p} = i, X_{q} = j)}{P (Y = 2 | X_{p} = i, X_{q} = j)} \\ = log \frac{μ_{i j 1}}{μ_{i j 2}} \\ = log (μ_{i j 1}) - log (μ_{i j 2}) \\ = (λ + λ_{i}^{X_{p}} + λ_{j}^{X_{q}} + λ_{1}^{Y} + λ_{i j}^{X_{p} X_{q}} + λ_{i 1}^{X_{p} Y} + λ_{j 1}^{X_{q} Y}) \\ - (λ + λ_{i}^{X_{p}} + λ_{j}^{X_{q}} + λ_{2}^{Y} + λ_{i j}^{X_{p} X_{q}} + λ_{i 2}^{X_{p} Y} + λ_{j 2}^{X_{q} Y}) \\ = (λ_{1}^{Y} - λ_{2}^{Y}) + (λ_{i 1}^{X_{p} Y} - λ_{i 2}^{X_{p} Y}) + (λ_{j 1}^{X_{q} Y} - λ_{j 2}^{X_{q} Y}) \end{array}

(Equation 32)

The first term is a constant that does not depend on i or j. The second term depends only on the category i of X_p. The third term depends only on the category j of X_q. Therefore, this logit has the following form:

\begin{array}{l} log \frac{P (Y = 1 | X_{p} = i, X_{q} = j)}{P (Y = 2 | X_{p} = i, X_{q} = j)} \\ = (λ_{1}^{Y} - λ_{2}^{Y}) + (λ_{i 1}^{X_{p} Y} - λ_{i 2}^{X_{p} Y}) + (λ_{j 1}^{X_{q} Y} - λ_{j 2}^{X_{q} Y}) \\ = β_{0} + β_{i}^{X_{p}} + β_{j}^{X_{q}} \end{array}

(Equation 33)

Clearly, this is equivalent to the logistic model with only main effect terms defined in Equation 1. Using the similar inference mentioned above, it is straightforward to find the connection between the saturated model M_S and the full logistic regression model defined in Equation 2.

Proof of ${\hat{L}}_{S} - {\hat{L}}_{H} \leq {\hat{L}}_{S} - {\hat{L}}_{K S A}$

To show this, we need only to show ${\hat{L}}_{H} \geq {\hat{L}}_{K S A}$ . By Equation 4 and Equation 13, we have

{\hat{μ}}_{i j k}^{K} = n \cdot {\hat{p}}_{i j k}^{K} = \frac{n}{η} \frac{π_{i j .} π_{i . k} π_{. j k}}{π_{i ..} π_{. j .} π_{.. k}}

(Equation 34)

Taking the logarithm on both sides of Equation 34 yields

\begin{array}{l} log {\hat{μ}}_{i j k}^{K} = (log n - log η) - log π_{i ..} - log π_{. j .} - log π_{.. k} \\ + log π_{i j .} + log π_{i . k} + log π_{. j k} \\ = λ + λ_{i}^{X_{p}} + λ_{j}^{X_{q}} + λ_{k}^{Y} + λ_{i j}^{X_{p} X_{q}} + λ_{i k}^{X_{p} Y} \\ + λ_{j k}^{X_{q} Y} \end{array}

(Equation 35)

where

\begin{array}{l} λ = log n - log η, \\ λ_{i}^{X_{p}} = - log π_{i ..}, λ_{j}^{X_{q}} = - log π_{. j .}, λ_{k}^{Y} = - log π_{.. k}, \\ λ_{i j}^{X_{p} X_{q}} = log π_{i j .}, λ_{i k}^{X_{p} Y} = log π_{i . k}, λ_{j k}^{X_{q} Y} = log π_{. j k} \end{array}

(Equation 36)

This shows that the KSA model can be written in the form of Equation 23. For any model with this structure, we have shown that the log-likelihood L_H evaluated at its MLE ${\hat{μ}}_{i j k}^{H}$ achieves its maximum ${\hat{L}}_{H}$ in Equation 9. Therefore, we have

{\hat{L}}_{H} = L_{H} ({\hat{μ}}_{i j k}^{H}) = \max_{μ_{i j k}} L_{H} (μ_{i j k}) \geq L_{H} (μ_{i j k}^{K}) = {\hat{L}}_{K S A}

(Equation 37)

Boolean Representation and Operation of Genotype Data

For a data set containing L SNPs genotyped from n samples, an $L \times n$ matrix W is usually used to store the data, where each row represents genotype data for one specific SNP and each column represents one sample. A toy example including three SNPs genotyped from 16 samples is illustrated below, where the first eight columns in W (denoted as U_i) represent control samples and the others represent case samples (denoted as D_i).

W = \begin{matrix} X_{1} \\ X_{2} \\ X_{3} \end{matrix} [\begin{matrix} U_{1} & U_{2} & U_{3} & U_{4} & U_{5} & U_{6} & U_{7} & U_{8} & D_{1} & D_{2} & D_{3} & D_{4} & D_{5} & D_{6} & D_{7} & D_{8} \\ 1 & 3 & \underset{̲}{2} & \underset{̲}{2} & 3 & 3 & 1 & 1 & \underset{̲}{2} & \underset{̲}{2} & 3 & 3 & 1 & \underset{̲}{2} & 1 & 1 \\ 3 & 2 & 2 & 2 & 1 & 1 & 1 & 2 & 3 & 3 & 1 & 1 & 1 & 2 & 1 & 2 \\ 1 & 1 & 3 & 3 & 2 & 2 & 2 & 1 & 1 & 2 & 2 & 2 & 1 & 1 & 1 & 3 \end{matrix}]

To evaluate the interaction effect between SNP p and SNP q, we need two rows (X_p,X_q) in W to collect the contingency table. It is very time consuming to collect contingency tables for all SNP pairs in a genome-wide case-control study, because hundreds of billions of SNPs pairs exist for typical genotyping chips.

In our method, we introduce a Boolean representation of genotype data. Instead of using one row for each SNP, the new representation uses three rows, with each row for one specific genotype. Each row consists of two-bit strings, one for control samples and the other for case samples. Each bit in the string represents one sample, and its value (0 or 1) indicates whether the sample has the corresponding genotype. For the above toy example, the corresponding Boolean representation is as follows:

W_{b i t} = \begin{matrix} X_{1} = 1 \\ X_{1} = 2 \\ X_{1} = 3 \\ X_{2} = 1 \\ X_{2} = 2 \\ X_{2} = 3 \\ X_{3} = 1 \\ X_{3} = 2 \\ X_{3} = 3 \end{matrix} [\begin{matrix} Control \\ 10000011 \\ 00 \underset{̲}{11} 0000 \\ 01001100 \\ 00001110 \\ 01110001 \\ 10000000 \\ 11000001 \\ 00001110 \\ 00110000 \end{matrix} \begin{matrix} Case \\ 00001011 \\ \underset{̲}{11} 000 \underset{̲}{1} 00 \\ 00110000 \\ 00111010 \\ 00000101 \\ 11000000 \\ 10001110 \\ 01110000 \\ 00000001 \end{matrix}]

Both W and W_bit contain the same amount of information. To demonstrate this equivalence, we underline some matched items between W and W_bit. For example, the five 2′s in the first row of W are represented as five 1's in the second row of W_bit. Although the dimension of W_bit is three times as large as that of W, its space usage in the computer is smaller because each byte can store 8 bits. For a data set with 4000 samples and 500,000 SNPs (about the same size as the WTCCC data set), the new data representation needs around 700M bytes, whereas the general data representation requires 1900M bytes. More importantly, using W_bit is more CPU efficient than using W in collecting the contingency table (Table 1). This is because we can directly carry out the fast logic (bitwise) operation with W_bit. For example, to collect n₁₂₁ in Table 1 (n₁₂₁ represents the number of cases with X_q = 1 and X_q = 2), we just need to conduct the logical AND operation on the case bit strings of row X_p = 1 and X_q = 2, then count the number of 1's in the result. The 64-bit registers can perform 64-bit AND operation in one instruction, and the counting of “1” bits in a bit string (also called hamming weight) can be accomplished with an efficient algorithm (see http://en.wikipedia.org/wiki/Hamming_weight).

Genetic Heterogeneity Simulation

The simulation models are chosen on the basis of the performance of BOOST and PLINK in case 2. For each setting of h² and MAF, there are five models. We choose the one under which BOOST and PLINK perform best (i.e., have the highest statistical power). For example, both BOOST and PLINK have the best performance on model epi33 among models epi31–epi35 (with the same setting of h² = 0.05 and MAF = 0.2). Therefore, for this setting of h² and MAF, we select model epi33. The reason for so doing is to make sure that both BOOST and PLINK have reasonably good performance when genetic heterogeneity is absent. Then we can observe how genetic heterogeneity degrades their performance. All selected models are given in Table S5. In the simulation, 100 data sets are generated under each model setting. In each data set, 1000 SNPs are simulated. Different sample sizes (n = 400, 800, and 1600) are simulated. To simulate genetic heterogeneity, 50% case samples are generated at loci X₁ and X₂ and another 50% case samples are generated at loci X₃ and X₄. The distribution of case samples is based on a specific disease model given in Table S6. Each data set has two pairs of associated SNPs. Therefore, there are 200 pairs of SNPs for each parameter setting. We set the counter T to be zero initially. If one pair of these 200 pairs is detected (on the basis of the Bonferroni correction), then T = T + 1. After testing 100 data sets, the power is calculated as T/200.

Quality Control

We first check the quality of control samples:

•
Those genotype data with a Chiamo score²⁷ < 0.95 are considered as missing data. SNPs with more than 10% missing data are removed.
•
Those SNPs with a minor allele frequency < 0.05 are removed.
•
We also perform the Hardy-Weinberg Equilibrium (HWE) test for each SNP. Those SNPs with a p value ≤ 0.001 are removed.

Next, we check the quality of case samples. The strategy is similar to that for control samples except that the HWE test is not performed. The number of remaining SNPs is given in Table S1.

More Results of T1D Data Analysis

We have identified 91 interactions in which all loci are nonsignificant in the single-locus scan. These 91 interactions show two interesting interaction patterns between MHC class I and class II. We have shown one pattern in the main article. We have also identified another interaction pattern in chromosome 6 in the 31350k–31390k region (shown in Figure S1) and the 32930k–32960k region (shown in Figure S2). The six interactions between these two regions are listed in Table S2. It can be observed again that all SNPs in this table display weak main effects whereas their joint effects are statistically significant. We further report the odds ratios for those interactions in Table S3 and Table S4. For the first interaction group given in Table S3, the genotype combinations Aa/Bb, Aa/bb, aa/Bb, and aa/bb, where the uppercase and lowercase letters represent the major alleles and minor alleles, respectively, have significantly higher disease risks than others. The interaction effect of these genotypes can generally approximate the multiplicative model (see the left panel of Figure S3). For the second interaction group given in Table S4, the genotype combination aa/bb has a significantly higher disease risk than others. The interaction effect of this genotype is considered as a joint recessive effect (see the right panel of Figure S3).

Supplemental Data

Document S1. Eight Figures and 14 Tables

mmc1.pdf^{(627.2KB, pdf)}

Web Resources

The URL for data presented herein is as follows:

BOOST software, http://bioinformatics.ust.hk/BOOST.html

References

1.Bateson W., Mendel G. Cambridge University Press; Cambridge: 1909. Mendel's Principles of Heredity. [Google Scholar]
2.Phillips P.C. Epistasis—the essential role of gene interactions in the structure and evolution of genetic systems. Nat. Rev. Genet. 2008;9:855–867. doi: 10.1038/nrg2452. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Fisher R.A. The correlations between relatives on the supposition of mendelian inheritance. Philosophical Transactions of the Royal Society of Edinburgh. 1918;52:399–433. [Google Scholar]
4.Cordell H.J. Detecting gene-gene interactions that underlie human diseases. Nat. Rev. Genet. 2009;10:392–404. doi: 10.1038/nrg2579. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Ritchie M.D., Hahn L.W., Roodi N., Bailey L.R., Dupont W.D., Parl F.F., Moore J.H. Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Am. J. Hum. Genet. 2001;69:138–147. doi: 10.1086/321276. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Nelson M.R., Kardia S.L., Ferrell R.E., Sing C.F. A combinatorial partitioning method to identify multilocus genotypic partitions that predict quantitative trait variation. Genome Res. 2001;11:458–470. doi: 10.1101/gr.172901. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Purcell S., Neale B., Todd-Brown K., Thomas L., Ferreira M.A., Bender D., Maller J., Sklar P., de Bakker P.I., Daly M.J., Sham P.C. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 2007;81:559–575. doi: 10.1086/519795. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Moore J., White B. Tuning reliefF for genomewide genetic analysis. Lect. Notes Comput. Sci. 2007;4447:166–175. [Google Scholar]
9.Schwarz D., Kónig I., Ziegler A. On safari to random jungle: A fast implementation of random forests for high dimensional data. Bioinformatics. 2010;26:1752–1758. doi: 10.1093/bioinformatics/btq257. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Zhang Y., Liu J.S. Bayesian inference of epistatic interactions in case-control studies. Nat. Genet. 2007;39:1167–1173. doi: 10.1038/ng2110. [DOI] [PubMed] [Google Scholar]
11.Marchini J., Donnelly P., Cardon L.R. Genome-wide strategies for detecting multiple loci that influence complex diseases. Nat. Genet. 2005;37:413–417. doi: 10.1038/ng1537. [DOI] [PubMed] [Google Scholar]
12.Ma L., Runesha H., Dvorkin D., Garbe J., Da Y. Parallel and serial computing tools for testing single-locus and epistatic SNP effects of quantitative traits in genome-wide association studies. BMC Bioinformatics. 2009;9:315. doi: 10.1186/1471-2105-9-315. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Cordell H.J. Epistasis: what it means, what it doesn't mean, and statistical methods to detect it in humans. Hum. Mol. Genet. 2002;11:2463–2468. doi: 10.1093/hmg/11.20.2463. [DOI] [PubMed] [Google Scholar]
14.Agresti A. Second Edition. Wiley and Sons; 2002. Categorical Data Analysis. Wiley Series in Probability and Statistics. [Google Scholar]
15.Matsuda H. Physical nature of higher-order mutual information: Intrinsic correlations and frustration. Phys. Rev. E Stat. Phys. Plasmas Fluids Relat. Interdiscip. Topics. 2000;6:3096–3102. doi: 10.1103/physreve.62.3096. [DOI] [PubMed] [Google Scholar]
16.Li J., Chen Y. Generating samples for association studies based on HapMap data. BMC Bioinformatics. 2008;9:44. doi: 10.1186/1471-2105-9-44. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Neuman R.J., Rice J.P. Two-locus models of disease. Genet. Epidemiol. 1992;9:347–365. doi: 10.1002/gepi.1370090506. [DOI] [PubMed] [Google Scholar]
18.Levy J., Nagylaki T. A model for the genetics of handedness. Genetics. 1992;72:117–128. doi: 10.1093/genetics/72.1.117. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Lerner I. W.H. Freeman; San Francisco: 1968. Heredity, Evolution, and Society. [Google Scholar]
20.Li W., Reich J. A complete enumeration and classification of two-locus disease models. Hum. Hered. 2000;50:334–349. doi: 10.1159/000022939. [DOI] [PubMed] [Google Scholar]
21.Frankel W.N., Schork N.J. Who's afraid of epistasis? Nat. Genet. 1996;14:371–373. doi: 10.1038/ng1296-371. [DOI] [PubMed] [Google Scholar]
22.Culverhouse R., Suarez B.K., Lin J., Reich T. A perspective on epistasis: limits of models displaying no main effect. Am. J. Hum. Genet. 2002;70:461–471. doi: 10.1086/338759. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Velez D.R., White B.C., Motsinger A.A., Bush W.S., Ritchie M.D., Williams S.M., Moore J.H. A balanced accuracy function for epistasis modeling in imbalanced datasets using multifactor dimensionality reduction. Genet. Epidemiol. 2007;31:306–315. doi: 10.1002/gepi.20211. [DOI] [PubMed] [Google Scholar]
24.McClellan J., King M.C. Genetic heterogeneity in human disease. Cell. 2010;141:210–217. doi: 10.1016/j.cell.2010.03.032. [DOI] [PubMed] [Google Scholar]
25.Dudek S., Motsinger A., Velez D., Williams S., Ritchie M. Data simulation software for whole-genome association and other studies in human genetics. Pacific Symposium on Biocomputing. 2006:499–510. [PubMed] [Google Scholar]
26.Lechler R., Warrens A. Academic Press; 2000. HLA in health and disease. [Google Scholar]
27.Wellcome Trust Case Control Consortium Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447:661–678. doi: 10.1038/nature05911. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Clayton D.G. Prediction and interaction in complex disease genetics: experience in type 1 diabetes. PLoS Genet. 2009;5:e1000540. doi: 10.1371/journal.pgen.1000540. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Zaykin D.V., Meng Z., Ehm M.G. Contrasting linkage-disequilibrium patterns between cases and controls as a novel association-mapping method. Am. J. Hum. Genet. 2006;78:737–746. doi: 10.1086/503710. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Nejentsev S., Howson J.M., Walker N.M., Szeszko J., Field S.F., Stevens H.E., Reynolds P., Hardy M., King E., Masters J., Wellcome Trust Case Control Consortium Localization of type 1 diabetes susceptibility to the MHC class I genes HLA-B and HLA-A. Nature. 2007;450:887–892. doi: 10.1038/nature06406. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Brown M.G., Driscoll J., Monaco J.J. Structural and serological similarity of MHC-linked LMP and proteasome (multicatalytic proteinase) complexes. Nature. 1991;353:355–357. doi: 10.1038/353355a0. [DOI] [PubMed] [Google Scholar]
32.Ortiz-Navarrete V., Seelig A., Gernold M., Frentzel S., Kloetzel P.M., Hämmerling G.J. Subunit of the ‘20S’ proteasome (multicatalytic proteinase) encoded by the major histocompatibility complex. Nature. 1991;353:662–664. doi: 10.1038/353662a0. [DOI] [PubMed] [Google Scholar]
33.Villadangos J.A. Presentation of antigens by MHC class II molecules: getting the most out of them. Mol. Immunol. 2001;38:329–346. doi: 10.1016/s0161-5890(01)00069-4. [DOI] [PubMed] [Google Scholar]
34.Rocha N., Neefjes J. MHC class II molecules on the move for successful antigen presentation. EMBO J. 2008;27:1–5. doi: 10.1038/sj.emboj.7601945. [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Howson J.M., Walker N.M., Clayton D., Todd J.A., Type 1 Diabetes Genetics Consortium Confirmation of HLA class II independent type 1 diabetes associations in the major histocompatibility complex including HLA-B and HLA-A. Diabetes Obes. Metab. 2009;Suppl 1:31–45. doi: 10.1111/j.1463-1326.2008.01001.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Husain Z., Kelly M.A., Eisenbarth G.S., Pugliese A., Awdeh Z.L., Larsen C.E., Alper C.A. The MHC type 1 diabetes susceptibility gene is centromeric to HLA-DQB1. J. Autoimmun. 2008;30:266–272. doi: 10.1016/j.jaut.2007.10.006. [DOI] [PubMed] [Google Scholar]
37.Herold C., Steffens M., Brockschmidt F.F., Baur M.P., Becker T. INTERSNP: genome-wide interaction analysis guided by a priori information. Bioinformatics. 2009;25:3275–3281. doi: 10.1093/bioinformatics/btp596. [DOI] [PubMed] [Google Scholar]
38.VanderWeele T.J. Epistatic interactions. Statistical Application in Genetics and Molecular Biology. 2010;9 doi: 10.2202/1544-6115.1517. [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Park M.Y., Hastie T. Penalized logistic regression for detecting gene interactions. Biostatistics. 2008;9:30–50. doi: 10.1093/biostatistics/kxm010. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Document S1. Eight Figures and 14 Tables

mmc1.pdf^{(627.2KB, pdf)}

[bib1] 1.Bateson W., Mendel G. Cambridge University Press; Cambridge: 1909. Mendel's Principles of Heredity. [Google Scholar]

[bib2] 2.Phillips P.C. Epistasis—the essential role of gene interactions in the structure and evolution of genetic systems. Nat. Rev. Genet. 2008;9:855–867. doi: 10.1038/nrg2452. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib3] 3.Fisher R.A. The correlations between relatives on the supposition of mendelian inheritance. Philosophical Transactions of the Royal Society of Edinburgh. 1918;52:399–433. [Google Scholar]

[bib4] 4.Cordell H.J. Detecting gene-gene interactions that underlie human diseases. Nat. Rev. Genet. 2009;10:392–404. doi: 10.1038/nrg2579. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib5] 5.Ritchie M.D., Hahn L.W., Roodi N., Bailey L.R., Dupont W.D., Parl F.F., Moore J.H. Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Am. J. Hum. Genet. 2001;69:138–147. doi: 10.1086/321276. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib6] 6.Nelson M.R., Kardia S.L., Ferrell R.E., Sing C.F. A combinatorial partitioning method to identify multilocus genotypic partitions that predict quantitative trait variation. Genome Res. 2001;11:458–470. doi: 10.1101/gr.172901. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib7] 7.Purcell S., Neale B., Todd-Brown K., Thomas L., Ferreira M.A., Bender D., Maller J., Sklar P., de Bakker P.I., Daly M.J., Sham P.C. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 2007;81:559–575. doi: 10.1086/519795. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib8] 8.Moore J., White B. Tuning reliefF for genomewide genetic analysis. Lect. Notes Comput. Sci. 2007;4447:166–175. [Google Scholar]

[bib9] 9.Schwarz D., Kónig I., Ziegler A. On safari to random jungle: A fast implementation of random forests for high dimensional data. Bioinformatics. 2010;26:1752–1758. doi: 10.1093/bioinformatics/btq257. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib10] 10.Zhang Y., Liu J.S. Bayesian inference of epistatic interactions in case-control studies. Nat. Genet. 2007;39:1167–1173. doi: 10.1038/ng2110. [DOI] [PubMed] [Google Scholar]

[bib11] 11.Marchini J., Donnelly P., Cardon L.R. Genome-wide strategies for detecting multiple loci that influence complex diseases. Nat. Genet. 2005;37:413–417. doi: 10.1038/ng1537. [DOI] [PubMed] [Google Scholar]

[bib12] 12.Ma L., Runesha H., Dvorkin D., Garbe J., Da Y. Parallel and serial computing tools for testing single-locus and epistatic SNP effects of quantitative traits in genome-wide association studies. BMC Bioinformatics. 2009;9:315. doi: 10.1186/1471-2105-9-315. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib13] 13.Cordell H.J. Epistasis: what it means, what it doesn't mean, and statistical methods to detect it in humans. Hum. Mol. Genet. 2002;11:2463–2468. doi: 10.1093/hmg/11.20.2463. [DOI] [PubMed] [Google Scholar]

[bib14] 14.Agresti A. Second Edition. Wiley and Sons; 2002. Categorical Data Analysis. Wiley Series in Probability and Statistics. [Google Scholar]

[bib15] 15.Matsuda H. Physical nature of higher-order mutual information: Intrinsic correlations and frustration. Phys. Rev. E Stat. Phys. Plasmas Fluids Relat. Interdiscip. Topics. 2000;6:3096–3102. doi: 10.1103/physreve.62.3096. [DOI] [PubMed] [Google Scholar]

[bib16] 16.Li J., Chen Y. Generating samples for association studies based on HapMap data. BMC Bioinformatics. 2008;9:44. doi: 10.1186/1471-2105-9-44. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib17] 17.Neuman R.J., Rice J.P. Two-locus models of disease. Genet. Epidemiol. 1992;9:347–365. doi: 10.1002/gepi.1370090506. [DOI] [PubMed] [Google Scholar]

[bib18] 18.Levy J., Nagylaki T. A model for the genetics of handedness. Genetics. 1992;72:117–128. doi: 10.1093/genetics/72.1.117. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib19] 19.Lerner I. W.H. Freeman; San Francisco: 1968. Heredity, Evolution, and Society. [Google Scholar]

[bib20] 20.Li W., Reich J. A complete enumeration and classification of two-locus disease models. Hum. Hered. 2000;50:334–349. doi: 10.1159/000022939. [DOI] [PubMed] [Google Scholar]

[bib21] 21.Frankel W.N., Schork N.J. Who's afraid of epistasis? Nat. Genet. 1996;14:371–373. doi: 10.1038/ng1296-371. [DOI] [PubMed] [Google Scholar]

[bib22] 22.Culverhouse R., Suarez B.K., Lin J., Reich T. A perspective on epistasis: limits of models displaying no main effect. Am. J. Hum. Genet. 2002;70:461–471. doi: 10.1086/338759. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib23] 23.Velez D.R., White B.C., Motsinger A.A., Bush W.S., Ritchie M.D., Williams S.M., Moore J.H. A balanced accuracy function for epistasis modeling in imbalanced datasets using multifactor dimensionality reduction. Genet. Epidemiol. 2007;31:306–315. doi: 10.1002/gepi.20211. [DOI] [PubMed] [Google Scholar]

[bib24] 24.McClellan J., King M.C. Genetic heterogeneity in human disease. Cell. 2010;141:210–217. doi: 10.1016/j.cell.2010.03.032. [DOI] [PubMed] [Google Scholar]

[bib25] 25.Dudek S., Motsinger A., Velez D., Williams S., Ritchie M. Data simulation software for whole-genome association and other studies in human genetics. Pacific Symposium on Biocomputing. 2006:499–510. [PubMed] [Google Scholar]

[bib26] 26.Lechler R., Warrens A. Academic Press; 2000. HLA in health and disease. [Google Scholar]

[bib27] 27.Wellcome Trust Case Control Consortium Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447:661–678. doi: 10.1038/nature05911. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib28] 28.Clayton D.G. Prediction and interaction in complex disease genetics: experience in type 1 diabetes. PLoS Genet. 2009;5:e1000540. doi: 10.1371/journal.pgen.1000540. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib29] 29.Zaykin D.V., Meng Z., Ehm M.G. Contrasting linkage-disequilibrium patterns between cases and controls as a novel association-mapping method. Am. J. Hum. Genet. 2006;78:737–746. doi: 10.1086/503710. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib30] 30.Nejentsev S., Howson J.M., Walker N.M., Szeszko J., Field S.F., Stevens H.E., Reynolds P., Hardy M., King E., Masters J., Wellcome Trust Case Control Consortium Localization of type 1 diabetes susceptibility to the MHC class I genes HLA-B and HLA-A. Nature. 2007;450:887–892. doi: 10.1038/nature06406. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib31] 31.Brown M.G., Driscoll J., Monaco J.J. Structural and serological similarity of MHC-linked LMP and proteasome (multicatalytic proteinase) complexes. Nature. 1991;353:355–357. doi: 10.1038/353355a0. [DOI] [PubMed] [Google Scholar]

[bib32] 32.Ortiz-Navarrete V., Seelig A., Gernold M., Frentzel S., Kloetzel P.M., Hämmerling G.J. Subunit of the ‘20S’ proteasome (multicatalytic proteinase) encoded by the major histocompatibility complex. Nature. 1991;353:662–664. doi: 10.1038/353662a0. [DOI] [PubMed] [Google Scholar]

[bib33] 33.Villadangos J.A. Presentation of antigens by MHC class II molecules: getting the most out of them. Mol. Immunol. 2001;38:329–346. doi: 10.1016/s0161-5890(01)00069-4. [DOI] [PubMed] [Google Scholar]

[bib34] 34.Rocha N., Neefjes J. MHC class II molecules on the move for successful antigen presentation. EMBO J. 2008;27:1–5. doi: 10.1038/sj.emboj.7601945. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib35] 35.Howson J.M., Walker N.M., Clayton D., Todd J.A., Type 1 Diabetes Genetics Consortium Confirmation of HLA class II independent type 1 diabetes associations in the major histocompatibility complex including HLA-B and HLA-A. Diabetes Obes. Metab. 2009;Suppl 1:31–45. doi: 10.1111/j.1463-1326.2008.01001.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib36] 36.Husain Z., Kelly M.A., Eisenbarth G.S., Pugliese A., Awdeh Z.L., Larsen C.E., Alper C.A. The MHC type 1 diabetes susceptibility gene is centromeric to HLA-DQB1. J. Autoimmun. 2008;30:266–272. doi: 10.1016/j.jaut.2007.10.006. [DOI] [PubMed] [Google Scholar]

[bib37] 37.Herold C., Steffens M., Brockschmidt F.F., Baur M.P., Becker T. INTERSNP: genome-wide interaction analysis guided by a priori information. Bioinformatics. 2009;25:3275–3281. doi: 10.1093/bioinformatics/btp596. [DOI] [PubMed] [Google Scholar]

[bib38] 38.VanderWeele T.J. Epistatic interactions. Statistical Application in Genetics and Molecular Biology. 2010;9 doi: 10.2202/1544-6115.1517. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib39] 39.Park M.Y., Hastie T. Penalized logistic regression for detecting gene interactions. Biostatistics. 2008;9:30–50. doi: 10.1093/biostatistics/kxm010. [DOI] [PubMed] [Google Scholar]

PERMALINK

BOOST: A Fast Approach to Detecting Gene-Gene Interactions in Genome-wide Case-Control Studies

Xiang Wan

Can Yang

Qiang Yang

Hong Xue

Xiaodan Fan

Nelson LS Tang

Weichuan Yu

Abstract

Introduction

Material and Methods

Notation

Definition of Interaction via Logistic Regression Models

Log-Linear Models for Contingency Tables

Table 1.

Table 2.

Measuring Interaction via Log-Linear Models

Boolean Operation-Based Screening and Testing

Boolean Representation of Genotype Data

Screening and Testing

Figure 1.

Stage 1: Screening

Stage 2: Testing

Results

Experiments on Simulation Data

Case 1: Disease Loci with Main Effects

Figure 2.

Case 2: Disease Loci without Main Effects

Case 3: Genetic Heterogeneity

Case 4: Null Simulation for Testing Type I Errors

Figure 3.

Experiments on WTCCC data

Table 3.

T1D and RA

Figure 4.

Interactions without Significant Main Effects Detected in T1D

Figure 5.

Figure 6.

Table 4.

Figure 7.

Discussion

Relationship between Our Method and Other Two-Stage Methods

Relationship between Our Method and PLINK

Relationship between Our Method and INTERSNP

Computation Time

Table 5.

Test of interactions versus Test of Associations

On Statistical Epistasis

Conclusion

Acknowledgments

Appendix

Log-Linear models

Homogeneous Association Model MH

Saturated Model MS

Block Independence Model MB

Partial Independence Model MP

Connection between Log-Linear Models and Logistic Models

Proof ofL^S−L^H≤L^S−L^KSA

Boolean Representation and Operation of Genotype Data

Genetic Heterogeneity Simulation

Quality Control

More Results of T1D Data Analysis

Supplemental Data

Web Resources

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Homogeneous Association Model M_H

Saturated Model M_S

Block Independence Model M_B

Partial Independence Model M_P

Proof of ${\hat{L}}_{S} - {\hat{L}}_{H} \leq {\hat{L}}_{S} - {\hat{L}}_{K S A}$