Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2011 Oct 28.
Published in final edited form as: Genet Res (Camb). 2010 Dec;92(5-6):443–459. doi: 10.1017/S0016672310000595

Statistical Analysis of Genetic Interactions

Nengjun Yi 1
PMCID: PMC3203544  NIHMSID: NIHMS331116  PMID: 21429274

Abstract

Many common human diseases and complex traits are highly heritable and influenced by multiple genetic and environmental factors. Although genome-wide association studies (GWAS) have successfully identified many disease-associated variants, these genetic variants explain only a small proportion of the heritability of most complex diseases. Genetic interactions (gene-gene and gene-environment) substantially contribute to complex traits and diseases and could be one of the main sources of the missing heritability. This paper provides an overview of the available statistical methods and related computer software for identifying genetic interactions in animal and plant experimental crosses and human genetic association studies. The main discussion falls under the three broad issues in statistical analysis of genetic interactions: the definition, detection and interpretation of genetic interactions. Recently developed methods based on modern techniques for high-dimensional data are reviewed, including penalized likelihood approaches and hierarchical models; the relationships among these methods are also discussed. I conclude this review by highlighting some areas of future research.

Keywords: Bayesian methods, Complex traits, Epistasis, Gene-environment interactions, Genetic association, High-dimensionality, Hierarchical models, Penalized likelihood, Quantitative trait loci

Introduction

Many common human diseases and complex traits are highly heritable and are believed to be influenced by multiple genetic and environmental factors. A central goal of genetics, evolutionary biology and epidemiology is to identify genetic and environmental factors that influence complex traits and diseases, and to characterize the effects of these factors and their interactions (Lynch & Walsh, 1998; Thomas, 2004). Genetic interactions (gene-gene and gene-environment interactions) have long been recognized as an important component of the genetic architecture of complex traits and diseases and are fundamentally important to understanding the genetics of complex traits and diseases (Flint & Mackay, 2009; Mackay, 2001; Mackay et al., 2009; Moore, 2003).

There is a long history of the investigation of genetic interactions in inbred plant and animal experimental crosses (Carlborg & Haley, 2004) and human populations (Cordell, 2009; Thomas, 2010). Recent advances of genome-wide association studies (GWAS) have provided unparalleled opportunities for studying the genetic architecture of complex diseases (Hardy & Singleton, 2009). In the past few years, these studies have identified many genetic variants associated with complex diseases (Hindorff et al., 2009; WTCCC, 2007). However, the main effects of the identified variants explain only a small proportion of the heritability of most complex diseases, motivating research interest in finding the remaining, ‘missing’ heritability (Manolio et al., 2009). Since GWAS have not fully investigated interactions, it has been speculated that gene-gene and gene-environment interactions could be one of potential sources of the missing heritability; this further boosts the investigation of genetic interactions (Cantor et al., 2010; Cordell, 2009; Eichler et al., 2010; Manolio et al., 2009; Thomas, 2010).

Here, I review the statistical methods and related computer software that are currently being used to identify genetic interactions for complex traits in animal and plant experimental crosses and human population-based association studies. The discussion covers the three broad issues in statistical analysis of genetic interactions, namely, the definition, the detection and the interpretation of genetic interactions. Great advances in all these three related topics have been made in the past decades (Cordell, 2009; Thomas, 2010), and many of these are reviewed. All the methods discussed can be used in targeted genetic studies with moderate numbers of variants (for example, from hypothesis-driven candidate-genes or pathway-based studies), and some can be applied to large-scale genetic studies with large numbers of variants (for example, from GWAS).

One of the challenges in statistical analysis of genetic interactions is that genetic interaction is not uniquely defined. I first describe the general definition and meaning of interaction and then introduce the commonly used models that define an interaction term as a product of main-effect variables. I also discuss the issue that any statement about interaction is necessarily scale and model dependent, and outline the general principles for analyzing interactions. The detection of genetic interactions involves two issues, modeling and computational methods, and can be viewed as a problem of high-dimensional data analysis. The development of statistical methods for high-dimensional data analysis has recently become one of the most important and active areas in statistics (Hesterberg et al., 2008). Recently developed methods based on modern statistical techniques are mainly explored, including penalized likelihood approaches and hierarchical models, and the relationships among these methods are also discussed. The interpretation of genetic interactions has not been extensively discussed in the literature. The discussion is confined to key interpretations. Finally, I highlight some emerging directions and needs for making further progress.

Notation and Challenges in Analyzing Genetic Interactions

We consider QTL (quantitative trait locus) mapping in experimental crosses from inbred animals or plants and population-based genetic association studies in humans. These two types of studies have the same observed data structure, and thus statistical methods can be fairly similar, while each has special problems. For each individual in the sample, observed data consist of a complex trait Y, a number of genetic markers G = (g1, g2, ···, gm), and some environmental factors E = (z1, z2, ···, zk), where m and k represent the numbers of makers and environmental factors, respectively. The trait phenotype Y can be continuous (e.g., body weight) or discrete (e.g., a binary disease indicator, counts). We consider experimental crosses (e.g., F2 intercross) or markers (e.g., single-nucleotide polymorphisms (SNPs)) that segregate three distinct genotypes. Therefore, each genotype variable gs is a three-level factor, indicating homozygous for the more common allele, heterozygous and homozygous for the minor allele, respectively. The genotyped markers can be densely distributed either across the entire genome or within some candidate genes, and for each case the number of markers can be large.

Our goal is to identify genomic loci that are associated with the complex trait, and to characterize their genetic effects. Since most complex traits and diseases are caused by interacting networks of multiple genetic and environmental factors, it is desirable is to simultaneously consider multiple loci and environmental factors, and include gene-gene (epistatic) and gene-environment interactions in the model. Such joint analyses would improve the power for detection of causal effects and hence lead to increased understanding about the genetic architecture of diseases. There are considerable challenges, however, to perform statistical analysis of genetic interactions:

  • One has limited understanding of what the word “interaction” means because it has no unique and explicit definition. Different definitions have different properties and lead to different statistical models and interpretations.

  • With multiple genetic and environmental factors, there are many possible main effects and interactions, most of which are likely to be zero or at least negligible, leading to high-dimensional models and overfitting problems.

  • There are many more potential interactions than main effects, which would require different modeling for main effects and interactions.

  • Due to linkage disequilibrium, many genetic factors are highly correlated and nearly collinear, creating the difficulty of distinguishing disease-associated variants from others.

  • Frequencies of multi-locus genotypes that define interactions can be very low, which creates variables with near-zero variance and thus requires special parameterization.

  • The discreteness of genotype data can cause a separate identifiability problem, called separation, for discrete traits. Separation arises when a predictor or a linear combination of predictors is completely aligned with the outcome and can yield nonidentified models (that is, have parameters that cannot be estimated).

These problems necessitate sophisticated techniques in all the steps of modeling, computation and interpretation for analyzing genetic interactions. Some methods have been developed recently to overcome these problems and will be discussed in the following sections.

Definition of Genetic Interaction

The term “interaction” generally refers to a phenomenon whereby two or more variables jointly affect the outcome response. In order to analyze and interpret interactions, it is important to understand how interactions are defined. In this section, I first discuss the general definition and meaning of statistical interactions, and then show how they can be made more concrete in the case of genetic analysis. We return to the issue of biological interpretation of statistical interaction later in the article.

General definition of statistical interaction

As introduced earlier, the goal of QTL and association analysis is to investigate the relationship between the complex trait Y and the genetic and environmental factors, G = (g1, g2, ···, gm) and E = (z1, z2, ···, zk). For a normally distributed trait, this can be expressed as a statistical model

E(Y)=η(G;E)=η(g1,g2,,gm;z1,z2,,zk), (1)

with the normal distribution assumption about the response variable Y, where E(·) is the expectation, and η(·) represents a generally unknown function that relates the genetic and environmental factors to the expectation of Y.

With multiple genetic and environmental factors, even if we restrict our attention for simplicity to two-factor interactions, three different kinds have to be considered: (a) gene × gene (G×G), (b) gene × environment (G×E), (c) environment × environment (E×E). We do not discuss E×E interactions because they can be included in the model as covariates. While the formal definitions of G×G and G×E interactions are similar, their interpretations are rather different. I will briefly discuss their differences below.

With just two genetic factors g1 and g2, if the function of two factors η(g1, g2) can be replaced by the simpler form of two functions of one variable, i.e.,

η(g1,g2)=η1(g1)+η2(g2), (2)

then there is no interaction between g1 and g2 (Cox, 1984). This implies that the genotypic effect of locus g1 (g2) does not depend on the genotypes of g2 (g1). Therefore, these two genetic factors act in a way that appears causally independent. For G×E interactions, the condition of independence, η(g1, z1) = η1(g1) + f1(z1), appears identical to the above one, but the interpretation is quite different. Here the concern is with stability of the genotypic effect of g1 as the environmental condition z1 varies. In genetic mapping, the environmental effect f1(z1) itself is of no direct interest, but can be an important component to control the potential confounding effect.

The converse conclusion that the condition (2) is not satisfied is an indication of interaction between the two factors. In that case a change in the response due to a change in g1 (g2) does depend on the level of g2 (g1). However, any deviation from the independence condition (2) could be specified in various ways, leading to different types of interaction that may require different methods to identify. I here discuss the most commonly used method that considers the interaction as a product term of the main-effect variables. Because the genetic factors g1 and g2 are three-level factors, we naturally start with a two-way factorial model:

η(g1i,g2j)=μ+g1i+g2j+δij (3)

where i = 1, 2, 3; j = 1, 2, 3; g1i represents the main effect of factor g1 at level i; g2j represents the main effect of factor g2 at level j; and δij represents the interaction effect for factors g1 and g2 at levels i and j, respectively. With this model, the overall effect of factor g1 at level i (i.e., genotypic effects) equals μ + g1i + δij that does depend on the levels of g2.

Cockerham model and alternatives

With no constraints on the parameters, Model (3) is nonidentifiable. In Model (3), genotype factors g1 and g2 take on three values and therefore each has three main effects. However, the classical regression framework can estimate only two parameters – if all three were included, they would be collinear with the constant term and thus cannot be estimated uniquely (i.e., the model is nonidentifiable).

One of the commonly used constraints on the parameters is to exclude the first level of each genotype factor from the model. The level that is excluded from the model is known as the reference or baseline condition. With this constraint, the numbers of main effects of each genotype factor and interactions between two factors reduce to two and four, respectively. Model (3) can be re-parameterized as

η(g1,g2)=μ+(xa1a1+xd1d1)+(xa2a2+xd2d2)+(xa1xa2aa12+xa1xd2ad12+xd1xa2da12+xd1xd2dd12) (4)

with xak = 1 if gk = 2, xak = 0 otherwise, and xdk = 1 if gk = 3, xdk = 0 otherwise, where ak and dk represent two main effects, and aa12, ad12, da12 and dd12 represent four interaction effects. In human genetic association studies, this model is called a codominant model (Thomas, 2004).

There are other options to construct constraints. The most widely used method is the Cockerham model (Cordell, 2002; Cordell, 2009; Kao & Zeng, 2002; Wang & Zeng, 2006; Wang & Zeng, 2009; Zeng et al., 2005), which defines the main-effect variables as

xak=gk2,andxdk=(gk1)(3gk)0.5 (5)

For the Cockerham model, ak and dk correspond to the additive and dominance effects, respectively, and aa12, ad12, da12 and dd12 are interaction effects, called the additive × additive, additive × dominance, dominance × additive and dominance × dominance interactions, respectively. The Cockerham model can be easily understood by introducing the paternal and maternal indicators of the minor allele, ξp and ξm, centering by subtracting a conventional point 0.5. The indicator ξpm) equals 1 if the paternal (maternal) allele is the minor allele and 0 otherwise. Therefore, the additive-effect variable can be expressed as xak = (ξp – 0.5) + (ξm – 0.5). This can be explained because a genotype consists of two alleles inherited from father and mother, respectively, and the paternal and maternal allelic effects are assumed identical. The dominance-effect variable can be expressed as xdk = 2(ξp – 0.5)(ξm – 0.5), representing the interaction between paternal and maternal alleles. The Cockerham model can be modified by centering the indicators ξp and ξm by subtracting their mean p (i.e., the allelic frequency) (Wang & Zeng, 2006; Wang & Zeng, 2009). Therefore, we have xak = (ξpp) + (ξmp) and xdk = 2(ξpp)(ξmp).

The codominant and Cockerham models can be extended to include multiple genetic loci, environmental factors and their interactions:

η(g1,g2,,gm;z1,z2,,zk)=μ+i=1kziβi+j=1m(xajaj+xdjdj)+j<jm(xajxajaajj+xajxdjadjj+xdjxajdajj+xdjxdjddjj)+i<kj=1m(xajzjaeji+xdjzideji)+ (6)

which consists of 2m main effects and 2m(m-1) two-way epistatic interactions. This model can be further extended to include higher-order interactions. We can see that even with a moderate number of factors m, the interaction model can include a huge number of parameters.

Generalized linear models

Generalized linear models have been widely used to analyze various types of non-normal complex traits (Li et al., 2010; Yi & Banerjee, 2009). A generalized linear model consists of three components: the linear predictor, the link function, and the distribution of the outcome variable (Gelman et al., 2003; McCullagh & Nelder, 1989). The linear predictor is the same as that in the normal linear models described above. The link function h() is invertible and relates the mean of the outcome variable Y to the linear predictor:

h[E(Y)]=η(g1,g2,,gm;z1,z2,,zk) (7)

or equivalently,

E(Y)=h1[η(g1,g2,,gm;z1,z2,,zk)] (8)

which obviously reduces to the normal linear model if h() is the identity function. The distribution of Y can take various forms, including normal, Gamma, binomial, and Poisson distributions. Common forms of the link function for different assumed distributions of the outcome variable are h(η) = log(η) for Poisson treatment of counts, and logit = log(η / (1− η)), probit = Φ-1(η), or cloglog = log(−log(1 − η)) for binary and binomial data. Therefore, generalized linear modeling provides a unified framework for statistical analysis; by choosing appropriate link functions and data distributions, some commonly used models, e.g., normal linear, logistic, probit and Poisson regressions, become special cases.

Interaction effects are more complicated in generalized linear models due to the link function between the linear predictor and the outcome variable:

  • It is obvious that from Model (7) that the genetic effects correspond to a transformation of the mean of the outcome variable, h[E(Y)], rather than directly to the mean of the outcome variable E(Y) as in normal linear models. In a logistic regression, for example, genetic effects are defined on the scale of the log odds of a success outcome (i.e., Y = 1), i.e., logit[Pr(Y=1)]=log[Pr(Y=1)1Pr(Y=1)].

  • Some generalized linear models (for example, logistic and probit regressions) can be expressed as a normal linear model with an unobserved or latent outcome variable. For example, the logistic regression logit[Pr(Y = 1)] = η(g1, g2, ···, gm; z1, z2, ···, zk) is equivalent to the latent normal linear model, u ~ N(η(g1, g2, ···, gm; z1, z2, ···, zk), 1.62), Y = 1 if u > 0, and Y = 0 if u < 0. Therefore, genetic effects in a logistic model actually correspond to the scale of a latent normally distributed outcome. The formulation of latent variables not only provides a computational trick but also a way to interpret the generalized linear models.

  • Because genetic effects depend on the link function, it is possible that interaction effects on a link function may be removed by changing the link function. This is similar to the phenomenon for continuous responses that interaction on one scale may possibly be removed by a nonlinear transformation of the scale (e.g., logarithmic and simple powers) (Berrington & Cox, 2007; Cox, 1984). We may call an interaction removable if a transformation of the outcome scale can be found that induces additivity. I return to this issue later.

  • Even with no multiplicative interaction terms in a generalized linear model, it is possible that the effects of a factor on the mean of the observed outcome E(Y) may depend on the levels of other factors in the model, because of the nonlinear transformation h-1() (Gill, 2001). Therefore, interaction effects are automatically introduced into all generalized linear models by a link function. However, these interactions do not affect the transformation of the observed data h[E(Y)]. Multiplicative interaction terms such as xa1xa2aa11 are called the “variable-specific” interaction terms, which are different from the “automatic” interaction. If specifying these variable-specific terms in the model leads to improved fit, then we have successfully captured through parameterization at least some of the necessarily existent interaction between variables by the model specification.

Principles for analyzing interactions

The widely used genetic interaction models define an interaction term as a product of main-effect variables, following the general definition of interaction (Cox, 1984). For conventional models, guiding principles have been established for efficiently studying interactions. These principles could be more crucial for our problems because of the high-dimensional and correlated structure of genetic data. If appropriately applied, these principles can improve the analysis of genetic interactions (Kooperberg et al., 2009).

  1. The basic strategy for identifying interactions is to start from a simpler model involving only main effects, and then to introduce interaction effects when they improve the model fit to the data. The final interpretation of conclusions will be based on some simpler specification, for example one involving some strong interaction terms (Cox, 1984).

  2. We prefer simultaneously fitting as many predictors as possible and introducing some hierarchical structure into the model (Gelman et al., 2003). This would allow us to take into account the correlation among the predictors. Applied to interaction analysis, therefore, it would be desirable to simultaneously include many correlated main effects and interactions.

  3. Inputs with large main effects are more likely to have appreciable interactions with other inputs, although small main effects do not preclude the possibility of large interactions (Cox, 1984; Gelman & Hill, 2007). Also, the interactions corresponding to larger main effects may be in some sense of more practical importance. This principle, sometimes referred to as ‘effect heredity’, has been used to build on interaction models (Chipman, 1996; Hamada & Wu, 1992; Nelder, 1994).

  4. When an interaction of multiple factors is in the model, the lower-order variables comprising the interaction should also be present (Nelder, 1994). This is called the ‘effect hierarchy principle’. The reason for this is that if some contrast interacts with, say, z, and is therefore nonzero at some levels of z, it would normally be very artificial to suppose that the value averaged out exactly to zero over the levels of z involved in defining the ‘main effect’ for the contrast (Cox, 1984). Applied to genetic interactions, genetic variants that have an interaction effect typically will also show some modest main effects (Kooperberg et al., 2009). This could be used to more efficiently explore interactions.

Detection of Genetic Interaction

The detection of genetic interactions involves issues of statistical modeling and computing. A variety of methods for detecting gene-gene and gene-environment interactions have been proposed in the past decades (Cordell, 2009; Kooperberg et al., 2009; Musani et al., 2007; Thomas, 2010), and it is impossible to discuss all the available methods in this review. I focus on the most commonly used approaches: penalized likelihood regressions and hierarchical models. These two approaches are based on modern statistical techniques for high-dimensional data analysis and are powerful to handle the challenges in statistical analysis of genetic interactions, although alternative methods, including simple exhaustive searches (Marchini et al., 2005), Bayesian partitioning algorithms (Zhang & Liu, 2007), nonparametric Bayesian methods (Zou et al., 2010), and various machine learning techniques (Chen et al., 2007; Lou et al., 2007; Ritchie et al., 2001), have their own advantages.

Penalized likelihood approach

In classical framework, parameter estimation is obtained by maximizing the likelihood function. A linear model with either many coefficients or highly correlated variables can be nonidentifiable. A standard approach to overcome the problem of nonidentifiability is to add a penalty to the likelihood function, yielding the penalized likelihood function

PL(β,ϕ)=logf(yβ,ϕ)p(β) (9)

where β represents all effects, and ϕ represents other parameters (e.g., residual variance). The logarithm of the likelihood function log f(y|β, ϕ) is a standard statistical summary of model fit; larger likelihood means better fit to data. For classical models, adding a parameter to a model is expected to improve the fit, even if the new parameter represents pure noise (Gelman & Hill, 2007). Therefore, the penalty term p(β) serves to control the complexity of the model and place some constraints or prior information on the parameters. Maximization of the penalized likelihood results in a penalized likelihood estimator.

The penalized likelihood function not only stabilizes parameter estimation but also provides criteria for model selection and comparison. The form of the penalty p(β) determines the general behavior of the penalized likelihood approach. Small penalties would lead to large models with limited bias, but potentially high variance; larger penalties lead to the selection of models with fewer predictors, but with less variance. A traditional approach is to specify a penalty on the number of coefficients in the model, p(β) = λ |M|, where λ is a penalty parameter, and |M| is the size of a model M. Many classical criteria have this form, including the Akaike information criterion (AIC) (λ = 1) (Akaike, 1969) and the Bayesian information criterion (BIC) (λ = log(sample size)/2) (Schwartz, 1978). These criteria have been widely used in earlier methods of multiple QTL mapping (Kao et al., 1999; Zeng et al., 1999). However, Broman and Speed (2002) showed that the original AIC and BIC tend to include many spurious QTL and thus are not appropriate for model selection in QTL, due to the large numbers of potential variables. Therefore, a variety of modifications to these classical criteria have been proposed, all seeking to control the false positive rate by using stronger penalty (Baierl et al., 2006; Bogdan et al., 2004; Broman & Speed, 2002).

For epistatic models, using a single penalty to control the overall complexity of the model would not be appropriate, because there are many more potential interactions than main effects. Therefore, two separate penalties should be used for main effects and pair-wise epistatic interactions (Baierl et al., 2006; Bogdan et al., 2004; Manichaikul et al., 2009):

PL(β,ϕ)=logf(yβ,ϕ)λmMmλiMi (10)

where λm and λi are the penalties on main effects and pairwise epistatic interactions, respectively, and |M|m and |M|i are the numbers of main effects and pairwise epistatic interactions. Bogdan et al. (2004) and Baierl et al. (2006) suggested incorporating prior numbers of main effects and interactions to specify the penalty parameters λm and λi. Manichaikul et al. (2009) used the null distribution of the genomewide maximum LOD score to derive the penalty on main effects and the results of a two-dimensional, two-QTL scan to derive the penalty for the interaction terms. These methods employed forward and stepwise procedures to select main effects and interactions based on the corresponding penalized likelihoods. Manichaikul et al. (2009) further imposed an effect hierarchy principle, with the inclusion of a pairwise interaction requiring the inclusion of both corresponding main effects, and always included both additive and dominance terms for a QTL and all four epistatic effects for a pair of interacting QTL. The method of Manichaikul et al. (2009) has been implemented in the freely available software R/qtl. R/qtl is an extensible, interactive environment for mapping quantitative trait loci (QTLs) in experimental populations derived from inbred lines (Broman et al., 2003).

The above penalty is called the L0-penalty, which only involves the number of parameters and ignores the sizes of individual coefficients. Other penalty functions depend on the sizes of individual coefficients and can be more flexible. A popular method of this form uses an L2-penalty (quadratic penalty) on all coefficients (excluding the intercept), corresponding to ridge regression (Hoerl & Kennard, 1970):

PL(β,ϕ)=logf(yβ,ϕ)λj=1Jβj2, (11)

which is equivalent to maximizing the likelihood function subject to a size constraint on the sum of the squared coefficients, j=1Jβj2<t. The penalty parameter is predetermined usually by cross-validation.

Ridge regression can handle the problem of collinearity and thus can simultaneously fit highly correlated variables. Malo et al. (2008) applied ridge regression to fit all SNPs in a genomic region in genetic association studies and showed that such multiple-SNP analyses accommodate linkage disequilibrium among SNPs and have the potential to distinguish causative from noncausative variants. Park and Hastie (2008) proposed a logistic regression with L2-penalty to fit genetic interactions in population-based case-control studies. They showed that the penalized logistic regression has a number of attractive properties for detecting genetic interactions. First, the L2-penalty can deal with perfectly collinear variables (they sum to 1), and thus makes it possible to code each level of a factor by a dummy variable, yielding coefficients with direct interpretations (see Equation 3). As described earlier, this coding method cannot be applied to classical regression. Secondly, the L2-penalty automatically assigns zero to the coefficients of zero columns and hence gracefully handles interaction models which consist of variables with near-zero variance. Thirdly, the quadratic penalty enables us to simultaneously fit a large number of factors and interactions in a stable fashion. Although the L2-penalty has the above advantages, it cannot shrink any coefficients directly to zero and thus does not automatically remove variables from the model. Park and Hastie (2008) proposed a forward stepwise method based on the penalized likelihood to perform variable selection. Their algorithm obeys the effect hierarchy principle, but also provides the option to accept an interaction even with no corresponding main effects in the model.

Another widely used penalized likelihood approach uses an L1-penalty, leading to the lasso (least absolute shrinkage and selection operator) introduced by Tibshirani (Tibshirani, 1996). The lasso estimator is obtained by maximizing the likelihood function subject to a constraint on the sum of absolute values of the regression coefficients j=1Jβj<t. This is equivalent to maximizing the following penalized likelihood function:

PL(β,ϕ)=logf(yβ,ϕ)λj=1Jβj (12)

Compared to the ridge regression, a remarkable property of the lasso is that the L1-penalty can shrink some coefficients exactly to zero and therefore automatically achieve variable selection. This can be intuitively explained by the fact that |βj| is much larger than |βj|2 for small βj and thus the constraint j=1Jβj<t forces some βj's exactly to zero. Various optimization algorithms have been proposed to obtain the lasso estimator (Hesterberg et al., 2008); Notably, the least angle regression (Efron et al., 2004) and the coordinate descent algorithm (Friedman et al., 2010; Wu & Lange, 2008) are the most computationally efficient.

The feature of continuous shrinkage and variable selection along with the fast algorithms makes the lasso an effective method for genome-wide analysis of interacting genes. Tanck et al. (2006) applied lasso penalized regression to detect epistatic interactions in association studies, with L2-penalty on main effects and L1-penalty on epistatic effects. Therefore, all main effects are always included in the model, while irrelevant interactions can be removed. Wu et al. (2009) developed a lasso penalized logistic regression for genome-wide association analysis in case-control studies. Their approach always selects a fixed number of predictors from all potential predictors. This yields a more efficient way to determine the penalty parameter. This novel strategy is similar to the composite model space approach which places an upper bound on the number of effects included in the model (Yi, 2004; Yi et al., 2005). For a given value of the penalty parameter, Wu et al. (2009) applied the coordinate descent algorithm to fit the lasso penalized logistic regression. Wu et al. (2009) handled interactions in two stages. In the first stage, the most important main effects of the predetermined number are identified; in the second stage, the two-way or higher order interactions among the selected SNPs are examined. The method of Wu et al. (2009) has been implemented in the freely available software Mendel 9.0 at the UCLA Human Genetics web site.

Hierarchical models

Hierarchical modeling is an important tool in the analysis of complex and high-dimensional data and has been increasingly applied to QTL and association studies. Hierarchical models use a population distribution to structure some dependence into the parameters, thereby enabling to fit a large number of predictor variables. In contrast, nonhierarchical models generally cannot handle many variables simultaneously, because they are numerically unstable or tend to overfit data (i.e., fit the existing data well but lead to inferior prediction for new data). Hierarchical models are more easily interpreted and handled in the Bayesian framework. In Bayesian models, the population distribution of the parameters is often referred to as the prior distribution, and statistical inference is based on the posterior distribution that is proportional to the product of the likelihood function f(y|β, ϕ) and the prior distribution π(β, ϕ):

p(β,ϕy)f(yβ,ϕ)π(β,ϕ) (13)

The posterior distribution contains all the current information about the parameters. Ideally one might fully explore the entire posterior distribution by sampling from the distribution p(β, ϕ | y) using Markov chain Monte Carlo (MCMC) algorithms (Gelman et al., 2003). For practical and computational purposes, however, it is desirable to have a fast algorithm that returns a point estimate of the parameters and standard errors. A commonly used point estimate is the posterior mode, that is, the single most likely value, which can be obtained by maximizing the posterior density p(β, ϕ | y), or equivalently its logarithm:

logp(β,ϕy)=logf(yβ,ϕ)+logπ(β,ϕ)+constant (14)

Compared with the penalized likelihood function (9), we can see that the posterior mode estimator is equivalent to the penalized estimator, with the logarithm of the prior density log π(β, ϕ) as the penalty. Therefore, with particular priors, hierarchical models can lead to the penalized likelihood approaches discussed above.

Shrinkage priors

The prior distribution π(β, ϕ) plays an important role on the hierarchical modeling approach. A variety of priors have been proposed (Griffin & Brown, 2007), some of which have been adopted in QTL mapping and association analysis (Mutshinda & Sillanpää, 2010; Sun et al., 2010; Yi & Banerjee, 2009; Yi & Xu, 2008). For models with a large number of potential variables, it is reasonable to assume that most of the variables have no or weak effects on the phenotype, whereas only a few have noticeable effects. Therefore, we can set up a prior distribution that gives each effect βj a high probability of being near zero. Such priors are often referred to as ‘shrinkage’ priors. In the following discussion, the prior distribution of ϕ is assumed to be noninformative and independent of β.

A class of shrinkage priors uses continuous distributions. A commonly used continuous shrinkage prior is the double exponential (also called Laplace) distribution (Park & Casella, 2008; Tibshirani, 1996; Yi & Xu, 2008), π(βj)=λ2eλβj, where λ is a shrinkage parameter and controls the amount of shrinkage; larger λ forces more coefficients near zero. With this prior, the log posterior density can be expressed as logp(β,ϕy)=logf(yβ,ϕ)λj=1Jβj+constant. Therefore, the posterior mode estimate of the coefficients β is the lasso penalized estimate (Park & Casella, 2008).

Another widely used continuous shrinkage distribution is the well-known Student-t distribution, π(βj)=tνj(μj,sj2), where the hyperparameters μj, νj > 0 and sj > 0 are the location, the degrees of freedom and the scale parameters, respectively (Gelman et al., 2003). The location μj is usually set to zero. The hyperparameters νj and sj control the global amount of shrinkage in the effect estimates; larger νj and smaller sj2 induce stronger shrinkage and force more effects to be near zero. The family of the Student-t distributions includes various distributions as special cases. At sj = ∞, the t prior approaches a flat distribution, i.e., π(βj) ∝ 1. Placing flat priors on all βj corresponds to a classical model, which usually fails in our problem as illustrated earlier. At νj = ∞ and sj = s, the t prior is equivalent to a normal distribution βj ~ N(0s,2), and thus the log posterior density can be expressed as logp(β,ϕy)=logf(yβ,ϕ)1s2j=1Jβj2+constant. Therefore, the posterior mode estimate of the coefficients β is the ridge penalized estimate.

Both the double exponential distribution and the Student-t distribution can be presented as a two-level hierarchical model (Griffin & Brown, 2007; Yi & Xu, 2008). The first level assumes that the coefficients βj's follow independent normal distributions with mean zero and unknown variances τj2, and the second level assumes that the variances τj2 follow some specified independent prior distributions:

βjτj2N(βjμj,τj2),τj2θjπ(τj2θj) (15)

where θj represent hyperparameters. The above two-level priors result in a scale mixture of normal distributions for the coefficients βj:βjπ(βjθj)=0N(βj0,τj2)π(τj2θj)dτj2. For the double exponential prior, π(τj2θj) is an exponential distribution Expon(λj22) or equivalently a gamma distribution Gamma(1,λj22). For the Student-t prior, π(τj2θj) is a scaled inverse- χ2 distribution Inv-χ2(νj,sj2) or equivalently an inverse gamma distribution Inv- gamma(νj2,νj2sj2).

The two-level hierarchical formulation has several advantages. First, it allows easy and efficient computation; conditional on the variances τj2 the coefficients βj can be easily estimated and for some distributions π(τj2θj) (for example, the exponential and the inverse- χ2 distributions) the variances τj2 also can be easily estimated. Secondly, it offers easy interpretation of the model; the coefficient-specific variances τj2 result in different shrinkage amount for different coefficients. Thirdly, it is flexible enough to encompass most versions of the penalized regression procedures and also lead to new hierarchical models by using new priors for the variances τj2 or further modeling the hyperparameters θj (Griffin & Brown, 2007; Hoggart et al., 2008; Kyung et al., 2010; Sun et al., 2010).

The second class of shrinkage priors assumes a discrete, two-component mixture distribution for each genetic effect, a normal distribution, and a point mass at zero (Yi & Shriner, 2008; Yi et al., 2007b; Yi et al., 2005):

βjγj(1γj)I0+γjN(0,τj2) (16)

where I0 is a point mass at 0, and γj is a binary variable indicating the absence (γj =0) or presence (γj =1) of the effect βj. The variance τj2 can be predetermined or treated as a random variable with an inverse- χ2 hyper-prior distribution: τj2Inv-χ2(νj,sj2). The sparseness in the fitted model is controlled by the values of (νj,sj2) and the prior inclusion probability p(γj = 1) for each effect. The values of (νj,sj2) can be chosen to control the prior expected mean and the prior confidence region of the proportion of the phenotypic variance explained by βj. Yi et al. (2005) proposed a method to choose the prior inclusion probabilities p(γj = 1) for main effects and the G×G and G×E interactions (Yi & Shriner, 2008; Yi et al., 2007b). These discrete ‘spike and slab’ priors lead to various Bayesian variable selection methods (Yi & Shriner, 2008).

Estimating posterior modes

The continuous shrinkage priors result in continuous posterior distributions, allowing us to develop deterministic algorithms to quickly estimate the posterior mode. A variety of methods for computing posterior mode have been developed for hierarchical models with continuous shrinkage priors, using the EM (expectation-maximization) algorithms by taking advantage of the two-level hierarchical formulation (Armagan & Zaretzki, 2010; Figueiredo, 2003; Gelman et al., 2008) or other optimization algorithms (Genkin et al., 2007). These algorithms have been adapted to multiple QTL mapping and genetic association analysis (Hoggart et al., 2008; Sun et al., 2010; Xu, 2007; Xu, 2010; Yi & Banerjee, 2009; Yi et al., 2010; Zhang & Xu, 2005). Among these developments, Yi and Banerjee (2009) and Yi et al. (2010) have attractive features and will be discussed below. The method of Yi and Banerjee (2009) has been implemented in the freely available software R/qtlbim (Yandell et al., 2007). R/qtlbim is an extensible, interactive environment for the Bayesian Interval Mapping of QTL, built on top of R/qtl (Broman et al., 2003), providing Bayesian analysis of multiple interacting quantitative trait loci (QTL) models for continuous, binary and ordinal traits in experimental crosses.

Yi and Banerjee (2009) and Yi et al. (2010) developed hierarchical generalized linear models with Student-t prior distributions on the coefficients for multiple interacting QTL mapping and genetic association studies. Yi and Banerjee (2009) discussed the choice of the shrinkage parameters νj and sj to favor sparseness in the fitted model. Yi et al. (2010) further proposed different scales sj for different types of effects (i.e., main effects, G×G and G×E interactions); this specification applies stronger shrinkage for interactions and thus allows more reliably joint estimation of main effects and interactions. They used the EM algorithm to fit the model by estimating the marginal posterior modes of the coefficients βj's. The algorithm uses the two-level expression of the t prior distribution, treats the unknown variances τj2 as missing data and replaces them by their conditional expectations at each E-step. The conditional expectations of τj2 are independent of the response data, and thus the E-step is the same for different types of phenotypes. Given the variances τj2, the prior distributions βjτj2N(0,τj2) can be included as additional ‘data points’ in the normal approximation of the generalized likelihood. Therefore, the coefficients βj can be estimated using the standard iterative weighted least squares (IWLS) for fitting classical generalized linear models. Yi and Banerjee (2009) incorporated the above EM algorithm into the standard package glm in R for fitting classical generalized linear models. This computational strategy takes advantage of the standard algorithm and software, and thus leads to a stable, flexible and easily used computational tool.

The above approach is built upon the generalized linear model framework, and therefore can deal with various types of continuous and discrete phenotypes and any models as implemented in the R package glm (e.g., normal linear, gamma, logistic, and Poisson, etc.). This flexibility allows us to conveniently analyze data in different ways. As described earlier, interactions are defined relative to particular models and thus can be affected by a change of the model (Berrington & Cox, 2007; Cordell, 2002). The above approach would allow us to investigate whether an interaction can be removed by a transformation of the scale and to detect interactions that are only present in a particular model. The hierarch generalized linear models with Student-t priors on the coefficients includes various methods as special cases that have been designed to handle problems encountered in interacting QTL and association studies (Yi et al., 2010). In addition, the above EM algorithm takes advantage of the two-level formulation of the t distribution and hence can be easily applied to other shrinkage priors (e.g., the double exponential distribution) with only modification on the conditional expectations of τj2.

The hierarchical models can simultaneously analyze many covariates, main effects of numerous loci, epistatic and G×E interactions. For large-scale genetic data, however, we recommend performing a preliminary analysis to weed out unnecessary variables, or use a variable selection procedure to build a parsimonious model that only includes the most important predictors. The above algorithm can be incorporated into various variable selection procedures. Following the general principle for analyzing interactions discussed earlier, Yi and Banerjee (2009) proposed a useful model search strategy, beginning with a model with no genetic effect but relevant covariates if any, and then gradually adding main effects and interactions into the model. This procedure differs from most variable selection methods by simultaneously adding or deleting many correlated variables.

Sampling from the continuous posterior distribution

In Bayesian inference, it is more comprehensive to fully explore the posterior distribution than merely calculate the posterior mode. For the hierarchical models described above, this requires MCMC algorithms to generate samples from the posterior density. Various MCMC algorithms have been developed for hierarchical models with the continuous shrinkage priors discussed above (Bae & Mallick, 2004; Hans, 2009; Kyung et al., 2010; Park & Hastie, 2008), most taking advantage of the hierarchical formulation of the priors. These algorithms have been recently adapted to multiple QTL mapping and genetic association analysis (Sun et al., 2010; Xu, 2003; Yi & Xu, 2008), although they consider only main effects.

For hierarchical models with shrinkage priors that can be expressed as a mixture of normal distributions, it is easy to construct MCMC algorithms. Yi and Xu (2008) and Sun et al. (2010) developed MCMC algorithms for mapping multiple QTL using the hierarchical formulation of the double exponential and the Student-t priors. Since all priors for regression coefficients are conditionally Gaussian, a simple and unified scheme can be developed to update the coefficients βj regardless of the specific prior distributions on the variances τj2. For the Student-t and double exponential priors, the conditional posterior distributions of the variances τj2 have standard form and thus can be easily sampled. Since the variances are separated from the data by the regression coefficients, the conditional distributions of the variances are independent of the response data. Therefore, the same updating scheme can be used to update the variances regardless of the response distribution. The advantage of MCMC samplers for hierarchical priors becomes more obvious when dealing with hyperparameters λ in the double exponential prior and (ν, s) in the Student-t prior. The penalized likelihood approaches predetermine the penalty parameter using cross-validation, and the mode-finding algorithms usually preset the hyperparameters. In the fully Bayesian framework, however, the hypeparemeters can be assigned appropriate hyperpriors and are updated along with other parameters (Kyung et al., 2010; Park & Casella, 2008; Sun et al., 2010; Yi & Xu, 2008) or are estimated based on empirical Bayes using marginal maximum likelihood (Kyung et al., 2010; Park & Casella, 2008; Sun et al., 2010; Yi & Xu, 2008); this procedure obviates the choice of the hyperparameters and automatically accounts for the uncertainty in its selection that affects the estimates of the regression coefficients.

The disadvantage of the above fully Bayesian approach is the intensive computation. This may restrict its application in genetic interaction analysis of large-scale data. However, these methods can provide richer information on the posterior of a regression coefficient and adequately reflects the uncertainty in estimating a parameter to be close to zero (Kyung et al., 2010; Park & Casella, 2008). The fully Bayesian analysis can return not only point estimates but also interval estimates of all parameters and offers a natural means of assessing model uncertainty. As the mode-finding algorithms, the fully Bayesian methods can simultaneously fit many correlated variables and can distinguish important effects from a large number of correlated variables (Sun et al., 2010; Yi & Xu, 2008).

Bayesian variable selection using discrete priors

The hierarchical models with a discrete prior (16) are usually fitted using MCMC algorithms. A variety of algorithms have been proposed, some of which have been adapted to multiple interacting QTL mapping and genetic association analysis. Yi and Shriner (2008) provide a comprehensive review on these methods. In this section, I describe the Bayesian multiple interacting QTL mapping methods that have been implemented in the freely available software R/qtlbim (Yandell et al., 2007; Yi et al., 2007a; Yi et al., 2007b; Yi et al., 2005).

Yi et al. (2005) developed a Bayesian model selection method for mapping epistatic QTL in experimental crosses for complex traits, based on the discrete priors described above and the composite model space approach of Yi (2004). The key idea of this approach is to place an upper bound on the number of QTL included in the model. Yi et al. (2005) set up the upper bound based on the Poisson prior on the number of QTL with the prior mean determined by any initial analyses. Given the upper bound, Yi et al. (2005) used a vector γ of binary (0 or 1) variables indicating the absence or presence of the corresponding effects, equivalent to assuming the discrete prior (16). The vector γ determines the number of included QTL and the activity of the associated genetic effects. The use of the upper bound and the indicator variables avoids the need to explicitly model the number of QTL as in the previous Bayesian methods, allowing us to fit models of different dimensions, e.g., one vs. two QTL, without resorting to complicated reversible jump MCMC (Yi, 2004). It also largely reduces the model space and provides an efficient way to walk through the space of models, spending more time at “good” models.

Yi et al. (2005) developed an MCMC algorithm to generate samples from the posterior distribution, and extended (2007b) the above method to include arbitrary environmental effects and G×E interactions, and to map interacting QTL for binary and ordinal traits based on the generalized probit models (2007a). The posterior samples can be used to summarize the genetic architecture and search for models with high posterior probabilities. Larger effects should appear more often, making them easier to identify. We use all the saved iterations of the Markov chain, corresponding to model averaging, which assesses characteristics of the genetic architecture by averaging over possible models weighted by their posterior probability. Various methods have been developed to graphically and numerically summarize and interpret the posterior samples (Yandell et al., 2007; Yi et al., 2005).

Interpretation of Genetic Interaction

In QTL and genetic association analysis, there are many options available when modeling the data and computing the model. Once multiple QTL are detected and a model with main effects and interactions are established, therefore, it is important to assess the fit of the model to the data and to our substantive (biological) knowledge, and to interpret the fitted models. Assessment and interpretation of interaction models have not been extensively discussed in the literature, possibly because identifying genetic interactions is a challenge and researchers are often so relieved to have detected interactions that there is a temptation to stop and rest rather than interpret the fitted model. Here we discuss some methods for interpreting genetic interactions, including the issues of model checking, removable or nonremovable interactions, average predictive genotypic effects, and biological interactions.

Model checking and assessment

A flexible method for model checking and assessment is posterior predictive checking that can be applied to complex genetic models and can assess the fit of the model to various aspects of the data. Posterior predictive checking proceeds by generating replicated datasets from the fitted model and then comparing these replicated datasets to the observed dataset with respect to any features of interest. Assume that our data analysis has generated a set of simulations of the parameters, θ(s) = (β(s), ϕ(s)), s = 1, . . ., nsim. For each of these draws, we simulate a replicated dataset yrep(s) from the predictive distribution of the data, p(yrep | β(s), ϕ(s)). We check the model by means of discrepancy measures (test quantities) T(y, θ); several discrepancy measures can be chosen to reveal interesting features of the data or discrepancies between the model and the data. For each discrepancy variable, each simulated realized value T(y, θ(s)) is compared to the corresponding simulated replicated value T(yrep(s), θ(s)). Large and systematic differences between realized and replicated values indicate a misfit of the model to the data. In some cases, differences are apparent visually; otherwise, it can be useful to compute the p-value, p = Pr(T(yrep, θ)>T(y, θ) | y), to see whether the difference could plausibly have arisen by chance under the model. Although the posterior predictive model checking method is very flexible and quite simple, an important issue is how to choose the discrepancy quantities; this deserves future research.

A related approach to model checking is cross-validation, in which observed data are partitioned, with each part of the data compared to its predictions conditional on the model and the rest of the data. Cross-validation has been considered as a standard method for the expected predictive fit to new data. But it is computationally intensive and cannot be widely applied to Bayesian model assessment. For hierarchical models, however, the posterior predictive checking can produce results close to cross-validation if higher-level parameters are also simulated from the posterior (Green et al., 2009). Another approach is the deviance information criterion (DIC), which is a mixed analytical/computational approximation to an estimated predictive error (Spiegelhalter et al. 2002).

Removable or nonremovable interactions

Statistical interactions are defined relative to particular models and thus can be affected by a change of modeling or outcome scale (Berrington & Cox, 2007; Cordell, 2002; Cordell, 2009; Thomas, 2010). We call an interaction ‘removable’ if a transformation of the outcome scale can be found to induce additivity (Berrington & Cox, 2007). Removable interactions are sometimes referred to as quantitative, whereas nonromavable interactions are referred to as qualitative interactions. It may be important to investigate whether the detected interactions are removable or nonremovable. If the interactions can be removed, the resulting interpretation may be improved and easily understood by a reasonable and interpretable model simplification.

For a continuous positive outcome, the Box-Cox technique (Box & Cox, 1964) can be used to find a nonlinear transformation of the outcome that optimally fits the data (Berrington & Cox, 2007; Cox, 1984). The Box-Cox transformations include commonly used logarithmic and simple powers as special cases. For binary data, the logistic or probit or complementary log scale may be effective (Berrington & Cox, 2007). The hierarchical generalized linear model approach of Yi and Banerjee (2009) and Yi et al. (2010) can deal with various types of continuous and discrete phenotypes and any generalized linear models, and allows us to conveniently analyze data using different models, providing a flexible way to investigate the nature of interactions.

Average predictive genotypic effects

Once we detect multiple QTL with main effects and interactions, one of our interests is to infer which genotypes of these QTL are associated with increased phenotypic value or disease risk, and to describe how a gene is associated with a trait or disease in combination with another gene or an environmental factor. This can be derived from the fitted models. However, challenges remain. First, single coefficients in an interaction model are less informative. In the presence of appreciable interaction, for example, main effects are rarely of direct concern because they represent effects among individuals with other variables equaling zero. Therefore, the genetic effects should always be interpreted jointly. Second, the predictors in genetic models are usually coded as functions of the genotypes, rather than the genotypes themselves, leading to further difficulty in interpreting the coefficients. Third, for generalized linear models of interacting genes, the genetic effects are related to a nonlinear transformation (i.e., the link function) of the observed data, and thus cannot be directly interpreted on the scale of the data.

One way to understand models with multiple interactions is to calculate the average predictive comparison of each of the inputs. The average predictive comparison is defined as the expected change in the outcome variable corresponding to a specified change in the input of interest averaging over some specified distribution of all other inputs and parameters (Gelman & Hill, 2007; Gill, 2001). Yi et al. (2010) extend the average predictive comparison method to interpret genetic interaction models in case-control studies by presenting the average predictive probability of case for each of the SNPs and each pair of SNPs (or a SNP and a covariate) that significantly interact.

The method of Yi et al. (2010) can be extended to any genetic interaction models. Suppose that an interaction model has already been established. Generally, we define the marginal expectation E(y | gs = k) as the average predictive effect of the genotype gs = k of QTL s, and E(y | gs = k, gs’ = k’) as the average predictive effect of the two-locus genotype (gs, gs’ )= (k, k’) of QTL s and s’. For a binary trait, these expectations equal the average predictive probabilities as defined by Yi et al. (2010). These average predictive effects can be compared each other, e.g., E(y | gs = k) - E(y | gs = k’), or with the overall mean E(y). Thus, the average predictive effects clearly show which genotypes of the detected QTL and their combinations are associated with increased or decreased phenotypic value or disease risk. Yi et al. (2010) developed a simple method to calculate the average predictive probability and graphically display the results. Their method can be extended to calculate the average predictive effects based on any generalized linear models.

Biological relevance of statistical interactions

The term ‘epistasis’ or ‘gene × gene interaction’ was originally used to describe instances in which the effect of a particular genetic variant was masked by a variant at another locus so that variation of phenotype with genotype at one locus was only apparent amongst those with certain genotypes at the second locus (Cordell, 2009; VanderWeele, 2010). This original concept of epistasis is different from the definitions of statistical interactions that are usually used in statistical analysis of complex traits. Phillips (2008) recently discussed the ambiguity in the term “epistasis” and defines three distinct forms of epistasis: statistical epistasis, compositional epistasis, and functional epistasis Phillips (2008) defined ‘statistical epistasis’ as a departure from marginal effects in a statistical model, much closer to the statistical interaction described earlier. The term “compositional epistasis” refers to epistasis in Bateson's original sense of the term, while the term “functional epistasis” describes the physical molecular interactions between various proteins (and other genetic elements) (Phillips, 2008). Compositional epistasis is a more biological form of interaction than the commonly used statistical epistasis, but does not necessarily imply functional epistasis. These distinct concepts of epistasis can be also applied to gene-environment interactions (Thomas, 2010; VanderWeele, 2010).

Most statistical methods for analyzing genetic interactions actually test statistical interactions. However, the extent to which statistical interaction implies biological or functional interaction has been extensively debated in both the genetics and epidemiological literature. A prevailing opinion is that statistical tests for interactions are of limited use for elucidating epistasis in the biological sense of the term (Cordell, 2009). However, VanderWeele recently showed some relationship between statistical interaction and compositional epistasis, and derived conditions under which statistical interactions correspond to compositional epistasis (VanderWeele, 2010; Vanderweele & Laird, 2010). These empirical conditions are quite strong, but the procedures proposed may provide a useful strategy to study biological interactions.

Needs for further progress

Gene or pathway level information

Candidate gene studies usually consist of data at different levels, i.e., genetic variants (e.g., haplotype tagging SNPs) within multiple candidate genes which may be functionally related or from different pathways. Most of the statistical methods that are recently being used consider only individual-level predictors (i.e., SNPs and covariates) and ignore the hierarchical structure of the data and gene or pathway-level information. It is biologically expected that genetic variants within a gene would influence the phenotype more similarly than those in different genes (Hung et al., 2004). Often, rich gene or pathway-level information is available (Rebbeck et al., 2004), including simple pathway indicator variables, genomic annotation or pathway ontologies, functional assays, in silico predictions of function or evolutionary conservation, or simulation of pathway kinetics (Thomas et al., 2009). Therefore, there is a growing need to develop sophisticated approaches that model the multilevel variation simultaneously and incorporate gene or pathway-level data into the model (Dunson et al., 2008; Thomas, 2010).

Hierarchical models provide a natural and efficient way to incorporate the external information about candidate genes into the analysis. One way to include the gene-level information in the hierarchical models is to model the prior means in the prior distributions of coefficients βj using gene-level predictors. This approach allows us to pool the information in the same genes and thus would provide more effective inference about the genetic effects. Recent developments of penalized regressions for high-dimensional data may provide alternative improved ways to deal with specific structures in candidate genes. It is well known that the original lasso regression does not effectively account for the relationship among a group of correlated predictors and tends to select individual variables from the grouped variables. The elastic net (Zou & Hastie, 2005) is a generalization of the lasso regression, which introduces an additional penalty or prior to incorporate the correlation of predictors into the model (Kyung et al., 2010). The elastic net can be implemented in a hierarchical fashion combining variable selection at lower levels (e.g., among SNPs within a pathway) and shrinkage at higher levels (e.g., between genes within a pathway or between pathways).

Modeling genetic interactions hierarchically

The effect heredity and hierarchy are two important principles for the statistical analysis of interaction (Chipman, 1996; Hamada & Wu, 1992; Nelder, 1994). These principles pose certain dependence of interactions on their main effects. Since with many predictors there are a huge number of potential interactions, a simple inclusion of interactions can degrade the model fit and thus preclude effective estimation of main effects and interactions. Although these two principles have been noticed in some of the previous methods of genetic interactions, there is a clear need for further studies in the future. Recently, the lasso penalized regression has been extended to incorporate the effect heredity and hierarchy principles (Choi et al., 2010; Yuan et al., 2007; Zhao et al., 2009). Theoretical and empirical results have showed that these extensions outperform the previous methods for detecting interactions. These new developments should be adapted to the statistical analysis of genetic interactions. Another promising approach could be modeling interactions in a structured way, for example with larger variances for interactions whose main effects are large. This type of priors can incorporate the effect heredity principle in a more continuous form.

Next-generation sequencing and rare variants in genetic interactions

The genetic etiology of common (or complex) human diseases is determined by both common and rare genetic variants (Bodmer & Bonilla, 2008; Schork et al., 2009). Since genome-wide association studies (GWAS) have thus far focused on common variants (with minor allele frequency (MAF) > ~5%) in the human genome, it has been speculated that rare variants might account for at least some of the heritability that GWAS have missed (Cirulli & Goldstein, 2010; Eichler et al., 2010; Manolio et al., 2009). Several studies have already shown that rare variants play an important role in genetic determination for some diseases (Ahituv et al., 2007; Azzopardi et al., 2008; Cohen et al., 2006; Cohen et al., 2004; Ji et al., 2008; Nejentsev et al., 2009; Romeo et al., 2007; Romeo et al., 2009). Recent advances in next-generation sequencing technologies facilitate the detection of rare variants, making it possible to uncover the roles of rare variants in complex diseases.

As a single rare variant contains little variation owing to low MAF (< 0.5 or 1%), statistical methods that test variants individually provide insufficient power to detect causal rare variants. Therefore, association analysis of rare variants requires sophisticated methods that can effectively combine the information across variants and test for their overall effect (Manolio et al., 2009). Several approaches have been developed to analyze rare variants, including the Collapsing, Simple-Sum, and Weighted-Sum methods (Li & Leal, 2008; Madsen & Browning, 2009; Morris & Zeggini, 2010; Price et al., 2010). These methods summarize multiple rare variants by weighting them equally (Li & Leal, 2008; Morris & Zeggini, 2010) or on the basis of estimated standard deviation (Madsen & Browning, 2009) or functional prediction (Price et al., 2010). Recently, penalized likelihood approach and hierarchical models have been applied to rare variants analysis (Zhou et al., 2010). These methods have focused on rare variants in a gene or region, and exclude genetic interactions in the analysis. Since complex diseases are usually influenced by multiple genes and environmental factors and their interactions, it would be important to develop sophisticated methods for jointly analyzing all rare variants in multiple genes and gene-environment and gene-gene interactions.

Using interaction models for risk prediction

Genome-wide association studies have raised expectations for predicting individual susceptibility to common diseases using genetic variants (Kraft et al., 2009; Wray et al., 2008). Previous methods using only a limited number of significant variants have typically failed to achieve satisfactory prediction performance (Jakobsdottir et al., 2009; Kraft & Hunter, 2009). Recent studies show that joint analysis of a large number of genetic variants can improve the risk prediction performance (de los Campos et al., 2009; Hayashi & Iwata, 2010; Lee et al., 2008; Meuwissen et al., 2001; Wei et al., 2009; Yang et al., 2010). However, the previous studies have not included interactions into the predictive models. If G×G and G×E interactions are present, adding these interactions to a predictive model should increase the accuracy of prediction. Therefore, jointly modeling genetic, environmental factors and their interactions has important implications for disease risk prediction and personalized medicine (Clark, 2000; Moore & Williams, 2009). Because frequencies of multi-locus genotypes that define interactions are usually low, inclusion of interactions may not largely improve the overall prediction in the entire population based on the commonly used receiver operating characteristic (ROC) curve (Bjørnvold et al., 2008; Clayton, 2009). However, the interaction models can identify combinations of multiple susceptibility loci that confer very high or low risk, and hence can be highly predictive for subsets that carry certain combinations of interacting variants (Yi et al., 2010). Unfortunately, most of the genetic association studies have so far not addressed G×G and G×E interactions, and thus the translation of scientific understanding about G×G and G×E interactions into risk assessment and genomic profiling has been limited.

Conclusions

Genetic interactions are worth studying for many reasons (Cordell, 2009; Thomas, 2010). First, modeling G×G and G×E interactions can increase the power to detect additional variants or genes and more accurately characterize the genetic effects, Second, detection and characterization of genetic interactions will help elucidate the biological and biochemical pathways that underpin disease. Finally, including significant interactions in risk prediction models can have important implications for disease risk prediction and personalized medicine. Recent advances of genome-wide association studies have provided unparalleled opportunities for investigating the genetic architecture of complex diseases. However, most of these studies have used a single-locus analysis strategy and thus ignored interactions. Therefore, the follow-up studies should focus on investigating genetic interactions and other complexities (Cantor et al., 2010; Manolio et al., 2009). However, this requires sophisticated statistical methods. As discussed in this article, there are a variety of approaches that can be used to analyze genetic interactions. The integration of the modern high-dimensional statistical methods and the specific form of genetic data and external biological knowledge will further improve the power to detect complex interactions.

Acknowledgements

This work was supported in part by the following research grants: NIH 2R01GM069430-06, NIH R01 GM077490, and NIH R01 CA112520-06A2.

References

  1. Ahituv N, Kavaslar N, Schackwitz W, Ustaszewska A, Martin J, Hebert S, et al. Medical sequencing at the extremes of human body mass. Am J Hum Genet. 2007;80:779–91. doi: 10.1086/513471. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Akaike H. Fitting autoregressive models for prediction. Ann. Inst. Stat. Math. 1969;21:243–247. [Google Scholar]
  3. Armagan A, Zaretzki RL. Model selection via adaptive shrinkage with t priors. Comput Stat. 2010;25:441–461. [Google Scholar]
  4. Azzopardi D, Dallosso AR, Eliason K, Hendrickson BC, Jones N, Rawstorne E, et al. Multiple rare nonsynonymous variants in the adenomatous polyposis coli gene predispose to colorectal adenomas. Cancer Res. 2008;68:358–63. doi: 10.1158/0008-5472.CAN-07-5733. [DOI] [PubMed] [Google Scholar]
  5. Bae K, Mallick B. Gene selection using a two-level hierarchical Bayesian model. Bioinformatics. 2004;20:3423–30. doi: 10.1093/bioinformatics/bth419. [DOI] [PubMed] [Google Scholar]
  6. Baierl A, Bogdan M, Frommlet F, Futschik A. On locating multiple interacting quantitative trait loci in intercross designs. Genetics. 2006;173:1693–703. doi: 10.1534/genetics.105.048108. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Berrington A, Cox DR. Interpretation of interaction: A review. Ann Appl Stat. 2007;1:371–385. [Google Scholar]
  8. Bjørnvold M, Undlien D, Joner G, Dahl-Jørgensen K, Njølstad P, Akselsen H, et al. Joint effects of HLA, INS, PTPN22 and CTLA4 genes on the risk of type 1 diabetes. Diabetologia. 2008;51:589–96. doi: 10.1007/s00125-008-0932-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Bodmer W, Bonilla C. Common and rare variants in multifactorial susceptibility to common diseases. Nat Genet. 2008;40:695–701. doi: 10.1038/ng.f.136. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Bogdan M, Ghosh J, Doerge R. Modifying the Schwarz Bayesian information criterion to locate multiple interacting quantitative trait loci. Genetics. 2004;167:989–99. doi: 10.1534/genetics.103.021683. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Box GEP, Cox DR. An analysis of transformations (with discussion). J. R. Statist. Soc. B. 1964;26:211–252. [Google Scholar]
  12. Broman K, Wu H, Sen S, Churchill G. R/qtl: QTL mapping in experimental crosses. Bioinformatics. 2003;19:889–90. doi: 10.1093/bioinformatics/btg112. [DOI] [PubMed] [Google Scholar]
  13. Broman KW, Speed TP. A model selection approach for the identification of quantitative trait loci in experimental crosses. J. R. Stat. Soc. B. 2002;64:641–656. doi: 10.1534/genetics.108.094565. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Cantor R, Lange K, Sinsheimer J. Prioritizing GWAS results: A review of statistical methods and recommendations for their application. Am J Hum Genet. 2010;86:6–22. doi: 10.1016/j.ajhg.2009.11.017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Carlborg O, Haley C. Epistasis: too often neglected in complex trait studies? Nat Rev Genet. 2004;5:618–25. doi: 10.1038/nrg1407. [DOI] [PubMed] [Google Scholar]
  16. Chen X, Liu C, Zhang M, Zhang H. A forest-based approach to identifying gene and gene gene interactions. Proc Natl Acad Sci U S A. 2007;104:19199–203. doi: 10.1073/pnas.0709868104. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Chipman H. Bayesian variable selection with related predictors. Canadian Journal of Statistics. 1996;24:17–36. [Google Scholar]
  18. Choi NH, Li W, Zhu J. Variable selection with the strong heredity constraint and its oracle property. Journal of the American Statistics Association. 2010;105:354–364. [Google Scholar]
  19. Cirulli ET, Goldstein DB. Uncovering the roles of rare variants in common disease through whole-genome sequencing. Nat Rev Genet. 2010;11:415–25. doi: 10.1038/nrg2779. [DOI] [PubMed] [Google Scholar]
  20. Clark AG. Limits to prediction of phenotype from knowledge of genotypes. In: Clegg M, et al., editors. Limits to knowledge in evolutionary genetics. Kluwer Academic/Penum Publishers; New York: 2000. pp. 205–224. [Google Scholar]
  21. Clayton D. Prediction and interaction in complex disease genetics: experience in type 1 diabetes. PLoS Genet. 2009;5:e1000540. doi: 10.1371/journal.pgen.1000540. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Cohen JC, Boerwinkle E, Mosley TH, Jr., Hobbs HH. Sequence variations in PCSK9, low LDL, and protection against coronary heart disease. N Engl J Med. 2006;354:1264–72. doi: 10.1056/NEJMoa054013. [DOI] [PubMed] [Google Scholar]
  23. Cohen JC, Kiss RS, Pertsemlidis A, Marcel YL, McPherson R, Hobbs HH. Multiple rare alleles contribute to low plasma levels of HDL cholesterol. Science. 2004;305:869–72. doi: 10.1126/science.1099870. [DOI] [PubMed] [Google Scholar]
  24. Cordell H. Epistasis: what it means, what it doesn't mean, and statistical methods to detect it in humans. Hum Mol Genet. 2002;11:2463–8. doi: 10.1093/hmg/11.20.2463. [DOI] [PubMed] [Google Scholar]
  25. Cordell H. Detecting gene-gene interactions that underlie human diseases. Nat Rev Genet. 2009;10:392–404. doi: 10.1038/nrg2579. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Cox DR. Interaction. International Statistical Review. 1984;52:1–31. [Google Scholar]
  27. de los Campos G, Naya H, Gianola D, Crossa J, Legarra A, Manfredi E, et al. Predicting quantitative traits with regression models for dense molecular markers and pedigree. Genetics. 2009;182:375–85. doi: 10.1534/genetics.109.101501. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Dunson DB, Herring AH, Engle SM. Bayesian selection and clustering of polymorphisms in functionally related genes. Journal of The American Statistics Association. 2008;103:534–546. [Google Scholar]
  29. Efron B, Hastie T, Johnstone I, Tibshirani R. Least Angle Regression. The Annals of Statistics. 2004;32:407–451. [Google Scholar]
  30. Eichler EE, Flint J, Gibson G, Kong A, Leal SM, Moore JH, et al. Missing heritability and strategies for finding the underlying causes of complex disease. Nat Rev Genet. 2010;11:446–50. doi: 10.1038/nrg2809. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Figueiredo MAT. Adaptive sparseness for supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2003;25:1150–1159. [Google Scholar]
  32. Flint J, Mackay T. Genetic architecture of quantitative traits in mice, flies, and humans. Genome Res. 2009;19:723–33. doi: 10.1101/gr.086660.108. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J Stat Softw. 2010;33:1–22. [PMC free article] [PubMed] [Google Scholar]
  34. Gelman A, Carlin J, Stern H, Rubin D. Bayesian data analysis. Chapman and Hall; London: 2003. [Google Scholar]
  35. Gelman A, Hill J. Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press; New York: 2007. [Google Scholar]
  36. Gelman A, Jakulin A, Pittau MG, Su YS. A weakly informative default prior distribution for logistic and other regression models. Annals of Applied Statistics. 2008;2:1360–1383. [Google Scholar]
  37. Genkin A, Lewis DD, Madigan D. Large-scale Bayesian logistic regression for text categorization. Technometrics. 2007;49:291–304. [Google Scholar]
  38. Gill J. Interpreting Interactions and Interaction Hierarchies in Generalized Linear Models: Issues and Applications.. Presented at the Annual Meeting of the American Political Science Association; San Francisco. 2001. [Google Scholar]
  39. Green MJ, Medley GF, Browne WJ. Use of posterior predictive assessments to evaluate model fit in multilevel logistic regression. Vet. Res. 2009;40:30–40. doi: 10.1051/vetres/2009013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Griffin JE, Brown PJ. Bayesian adaptive lassos with non-convex penalization. IMSAS, University of Kent; 2007. Technical report. [Google Scholar]
  41. Hamada M, Wu C. Analysis of designed experiments with complex aliasing. Journal of Quality Technology. 1992;24:130–137. [Google Scholar]
  42. Hans C. Bayesian lasso regression. Biometrika. 2009;96:835–845. [Google Scholar]
  43. Hardy J, Singleton A. Genomewide association studies and human disease. N Engl J Med. 2009;360:1759–68. doi: 10.1056/NEJMra0808700. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Hayashi T, Iwata H. EM algorithm for Bayesian estimation of genomic breeding values. BMC Genet. 2010;11:3. doi: 10.1186/1471-2156-11-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Hesterberg T, Choi NH, Meier L, Fraley C. Least angle and L1 penalized regression: A review. Statistics Surveys. 2008;2:61–93. [Google Scholar]
  46. Hindorff L, Sethupathy P, Junkins H, Ramos E, Mehta J, Collins F, et al. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci U S A. 2009;106:9362–7. doi: 10.1073/pnas.0903103106. [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Hoerl AE, Kennard R. Ridge regression: biased estimation for nonorthogonal problems. Technometrics. 1970;12:55–67. [Google Scholar]
  48. Hoggart C, Whittaker J, De Iorio M, Balding D. Simultaneous analysis of all SNPs in genome-wide and re-sequencing association studies. PLoS Genet. 2008;4:e1000130. doi: 10.1371/journal.pgen.1000130. [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Hung R, Brennan P, Malaveille C, Porru S, Donato F, Boffetta P, et al. Using hierarchical modeling in genetic association studies with multiple markers: application to a case-control study of bladder cancer. Cancer Epidemiol Biomarkers Prev. 2004;13:1013–21. [PubMed] [Google Scholar]
  50. Jakobsdottir J, Gorin M, Conley Y, Ferrell R, Weeks D. Interpretation of genetic association studies: markers with replicated highly significant odds ratios may be poor classifiers. PLoS Genet. 2009;5:e1000337. doi: 10.1371/journal.pgen.1000337. [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. Ji W, Foo JN, O'Roak BJ, Zhao H, Larson MG, Simon DB, et al. Rare independent mutations in renal salt handling genes contribute to blood pressure variation. Nat Genet. 2008;40:592–9. doi: 10.1038/ng.118. [DOI] [PMC free article] [PubMed] [Google Scholar]
  52. Kao C, Zeng Z. Modeling epistasis of quantitative trait loci using Cockerham's model. Genetics. 2002;160:1243–61. doi: 10.1093/genetics/160.3.1243. [DOI] [PMC free article] [PubMed] [Google Scholar]
  53. Kao C, Zeng Z, Teasdale R. Multiple interval mapping for quantitative trait loci. Genetics. 1999;152:1203–16. doi: 10.1093/genetics/152.3.1203. [DOI] [PMC free article] [PubMed] [Google Scholar]
  54. Kooperberg C, Leblanc M, Dai J, Rajapakse I. Structures and assumptions: strategies to harness gene x gene and gene x environment interactions in GWAS. Stat Sci. 2009;24:472–488. doi: 10.1214/09-sts287. [DOI] [PMC free article] [PubMed] [Google Scholar]
  55. Kraft P, Hunter D. Genetic risk prediction - are we there yet? N Engl J Med. 2009;360:1701–3. doi: 10.1056/NEJMp0810107. [DOI] [PubMed] [Google Scholar]
  56. Kraft P, Wacholder S, Cornelis M, Hu F, Hayes R, Thomas G, et al. Beyond odds ratios - communicating disease risk based on genetic profiles. Nat Rev Genet. 2009;10:264–9. doi: 10.1038/nrg2516. [DOI] [PubMed] [Google Scholar]
  57. Kyung M, Gill J, Ghosh M, Casella G. Penalized regression, standard errors, and Bayesian lassos. Bayesian Analysis. 2010;5:369–412. [Google Scholar]
  58. Lee S, van der Werf J, Hayes B, Goddard M, Visscher P. Predicting unobserved phenotypes for complex traits from whole-genome SNP data. PLoS Genet. 2008;4:e1000231. doi: 10.1371/journal.pgen.1000231. [DOI] [PMC free article] [PubMed] [Google Scholar]
  59. Li B, Leal SM. Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am J Hum Genet. 2008;83:311–21. doi: 10.1016/j.ajhg.2008.06.024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  60. Li J, Reynolds R, Pomp D, Allison D, Yi N. Mapping interacting QTL for count phenotypes using hierarchical Poisson and binomial models: an application to reproductive traits in mice. Genet Res. 2010;92:13–23. doi: 10.1017/S0016672310000029. [DOI] [PMC free article] [PubMed] [Google Scholar]
  61. Lou X, Chen G, Yan L, Ma J, Zhu J, Elston R, et al. A generalized combinatorial approach for detecting gene-by-gene and gene-by-environment interactions with application to nicotine dependence. Am J Hum Genet. 2007;80:1125–37. doi: 10.1086/518312. [DOI] [PMC free article] [PubMed] [Google Scholar]
  62. Lynch M, Walsh B. Genetics and Analysis of Quantitative Traits. Sinauer Associates, Inc.; Sunderland, MA: 1998. [Google Scholar]
  63. Mackay T. The genetic architecture of quantitative traits. Annu Rev Genet. 2001;35:303–39. doi: 10.1146/annurev.genet.35.102401.090633. [DOI] [PubMed] [Google Scholar]
  64. Mackay T, Stone E, Ayroles J. The genetics of quantitative traits: challenges and prospects. Nat Rev Genet. 2009;10:565–77. doi: 10.1038/nrg2612. [DOI] [PubMed] [Google Scholar]
  65. Madsen BE, Browning SR. A groupwise association test for rare mutations using a weighted sum statistic. PLoS Genet. 2009;5:e1000384. doi: 10.1371/journal.pgen.1000384. [DOI] [PMC free article] [PubMed] [Google Scholar]
  66. Malo N, Libiger O, Schork N. Accommodating linkage disequilibrium in genetic-association analyses via ridge regression. Am J Hum Genet. 2008;82:375–85. doi: 10.1016/j.ajhg.2007.10.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  67. Manichaikul A, Moon J, Sen S, Yandell B, Broman K. A model selection approach for the identification of quantitative trait loci in experimental crosses, allowing epistasis. Genetics. 2009;181:1077–86. doi: 10.1534/genetics.108.094565. [DOI] [PMC free article] [PubMed] [Google Scholar]
  68. Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, et al. Finding the missing heritability of complex diseases. Nature. 2009;461:747–53. doi: 10.1038/nature08494. [DOI] [PMC free article] [PubMed] [Google Scholar]
  69. Marchini J, Donnelly P, Cardon L. Genome-wide strategies for detecting multiple loci that influence complex diseases. Nat Genet. 2005;37:413–7. doi: 10.1038/ng1537. [DOI] [PubMed] [Google Scholar]
  70. McCullagh P, Nelder JA. Generalized linear models. Chapman and Hall; London: 1989. [Google Scholar]
  71. Meuwissen THE, Hayes BJ, Goddard ME. Prediction of total genetic value using genome-wide dense marker maps. Genetics. 2001;157:1819–1829. doi: 10.1093/genetics/157.4.1819. [DOI] [PMC free article] [PubMed] [Google Scholar]
  72. Moore J. The ubiquitous nature of epistasis in determining susceptibility to common human diseases. Hum Hered. 2003;56:73–82. doi: 10.1159/000073735. [DOI] [PubMed] [Google Scholar]
  73. Moore J, Williams S. Epistasis and its implications for personal genetics. Am J Hum Genet. 2009;85:309–20. doi: 10.1016/j.ajhg.2009.08.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  74. Morris AP, Zeggini E. An evaluation of statistical approaches to rare variant analysis in genetic association studies. Genet Epidemiol. 2010;34:188–93. doi: 10.1002/gepi.20450. [DOI] [PMC free article] [PubMed] [Google Scholar]
  75. Musani S, Shriner D, Liu N, Feng R, Coffey C, Yi N, et al. Detection of gene x gene interactions in genome-wide association studies of human population data. Hum Hered. 2007;63:67–84. doi: 10.1159/000099179. [DOI] [PubMed] [Google Scholar]
  76. Mutshinda C, Sillanpää M. Extended Bayesian LASSO for multiple quantitative trait loci mapping and unobserved phenotype prediction. Genetics. 2010 doi: 10.1534/genetics.110.119586. in press. [DOI] [PMC free article] [PubMed] [Google Scholar]
  77. Nejentsev S, Walker N, Riches D, Egholm M, Todd JA. Rare variants of IFIH1, a gene implicated in antiviral responses, protect against type 1 diabetes. Science. 2009;324:387–9. doi: 10.1126/science.1167728. [DOI] [PMC free article] [PubMed] [Google Scholar]
  78. Nelder J. The statistics of linear models: back to basics. Statistics and Computing. 1994;4:221–234. [Google Scholar]
  79. Park M, Hastie T. Penalized logistic regression for detecting gene interactions. Biostatistics. 2008;9:30–50. doi: 10.1093/biostatistics/kxm010. [DOI] [PubMed] [Google Scholar]
  80. Park T, Casella G. The Bayesian Lasso. Journal of the American Statistical Association. 2008;103:681–686. [Google Scholar]
  81. Phillips P. Epistasis - the essential role of gene interactions in the structure and evolution of genetic systems. Nat Rev Genet. 2008;9:855–67. doi: 10.1038/nrg2452. [DOI] [PMC free article] [PubMed] [Google Scholar]
  82. Price AL, Kryukov GV, de Bakker PI, Purcell SM, Staples J, Wei LJ, et al. Pooled association tests for rare variants in exon-resequencing studies. Am J Hum Genet. 2010;86:832–8. doi: 10.1016/j.ajhg.2010.04.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  83. Rebbeck T, Spitz M, Wu X. Assessing the function of genetic variants in candidate gene association studies. Nat Rev Genet. 2004;5:589–97. doi: 10.1038/nrg1403. [DOI] [PubMed] [Google Scholar]
  84. Ritchie M, Hahn L, Roodi N, Bailey L, Dupont W, Parl F, et al. Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Am J Hum Genet. 2001;69:138–47. doi: 10.1086/321276. [DOI] [PMC free article] [PubMed] [Google Scholar]
  85. Romeo S, Pennacchio LA, Fu Y, Boerwinkle E, Tybjaerg-Hansen A, Hobbs HH, et al. Population-based resequencing of ANGPTL4 uncovers variations that reduce triglycerides and increase HDL. Nat Genet. 2007;39:513–6. doi: 10.1038/ng1984. [DOI] [PMC free article] [PubMed] [Google Scholar]
  86. Romeo S, Yin W, Kozlitina J, Pennacchio LA, Boerwinkle E, Hobbs HH, et al. Rare loss-of-function mutations in ANGPTL family members contribute to plasma triglyceride levels in humans. J Clin Invest. 2009;119:70–9. doi: 10.1172/JCI37118. [DOI] [PMC free article] [PubMed] [Google Scholar]
  87. Schork NJ, Murray SS, Frazer KA, Topol EJ. Common vs. rare allele hypotheses for complex diseases. Curr Opin Genet Dev. 2009;19:212–9. doi: 10.1016/j.gde.2009.04.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  88. Schwartz G. Estimating the dimension of a model. Ann. Stat. 1978;6:461–464. [Google Scholar]
  89. Spiegelhalter DJ, Best NG, Carlin B,P, van der Linde A. Bayesian measures of model complexity and fit. J. R. Statist. Soc. B. 2002;64:583–616. [Google Scholar]
  90. Sun W, Ibrahim J, Zou F. Genomewide multiple-loci mapping in experimental crosses by iterative adaptive penalized regression. Genetics. 2010;185:349–59. doi: 10.1534/genetics.110.114280. [DOI] [PMC free article] [PubMed] [Google Scholar]
  91. Tanck M, Jukema J, Zwinderman A. Simultaneous estimation of gene-gene and gene-environment interactions for numerous loci using double penalized log-likelihood. Genet Epidemiol. 2006;30:645–51. doi: 10.1002/gepi.20176. [DOI] [PubMed] [Google Scholar]
  92. Thomas D. Statistical Methods in Genetic Epidemiology. Oxford Univ. Press; 2004. [Google Scholar]
  93. Thomas D. Gene-environment-wide association studies: emerging approaches. Nat Rev Genet. 2010;11:259–272. doi: 10.1038/nrg2764. [DOI] [PMC free article] [PubMed] [Google Scholar]
  94. Thomas DC, Conti DV, Baurley J, Nijhout F, Reed M, Ulrich CM. Use of pathway information in molecular epidemiology. Hum. Genomics. 2009;4:21–42. doi: 10.1186/1479-7364-4-1-21. [DOI] [PMC free article] [PubMed] [Google Scholar]
  95. Tibshirani R. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society. Series B. 1996;58:267–288. [Google Scholar]
  96. VanderWeele T. Epistatic interactions. Stat Appl Genet Mol Biol. 2010;9 doi: 10.2202/1544-6115.1517. Article 1. DOI: 10.2202/1544-6115.1517. [DOI] [PMC free article] [PubMed] [Google Scholar]
  97. Vanderweele T, Laird N. Tests for compositional epistasis under single interaction-parameter models. Ann Hum Genet. 2010 doi: 10.1111/j.1469-1809.2010.00600.x. DOI: 10.1111/j.1469-1809.2010.00600.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  98. Wang T, Zeng Z. Models and partition of variance for quantitative trait loci with epistasis and linkage disequilibrium. BMC Genet. 2006;7:9. doi: 10.1186/1471-2156-7-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  99. Wang T, Zeng Z. Contribution of genetic effects to genetic variance components with epistasis and linkage disequilibrium. BMC Genet. 2009;10:52. doi: 10.1186/1471-2156-10-52. [DOI] [PMC free article] [PubMed] [Google Scholar]
  100. Wei Z, Wang K, Qu H, Zhang H, Bradfield J, Kim C, et al. From disease association to risk assessment: an optimistic view from genome-wide association studies on type 1 diabetes. PLoS Genet. 2009;5:e1000678. doi: 10.1371/journal.pgen.1000678. [DOI] [PMC free article] [PubMed] [Google Scholar]
  101. Wray N, Goddard M, Visscher P. Prediction of individual genetic risk of complex disease. Curr Opin Genet Dev. 2008;18:257–63. doi: 10.1016/j.gde.2008.07.006. [DOI] [PubMed] [Google Scholar]
  102. WTCCC Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447:661–678. doi: 10.1038/nature05911. [DOI] [PMC free article] [PubMed] [Google Scholar]
  103. Wu T, Chen Y, Hastie T, Sobel E, Lange K. Genome-wide association analysis by lasso penalized logistic regression. Bioinformatics. 2009;25:714–21. doi: 10.1093/bioinformatics/btp041. [DOI] [PMC free article] [PubMed] [Google Scholar]
  104. Wu TT, Lange K. Coordinate descent algorithms for lasso penalized regression. Ann. Appl. Stat. 2008;2:224–244. [Google Scholar]
  105. Xu S. Estimating polygenic effects using markers of the entire genome. Genetics. 2003;163:789–801. doi: 10.1093/genetics/163.2.789. [DOI] [PMC free article] [PubMed] [Google Scholar]
  106. Xu S. An empirical Bayes method for estimating epistatic effects of quantitative trait loci. Biometrics. 2007;63:513–21. doi: 10.1111/j.1541-0420.2006.00711.x. [DOI] [PubMed] [Google Scholar]
  107. Xu S. An expectation-maximization algorithm for the Lasso estimation of quantitative trait locus effects. Heredity. 2010 doi: 10.1038/hdy.2009.180. doi:10.1038/hdy.2009.180. [DOI] [PubMed] [Google Scholar]
  108. Yang J, Benyamin B, McEvoy B, Gordon S, Henders A, Nyholt D, Madden P, Heath A, Martin N, Montgomery G, Goddard M, Visscher P. Common SNPs explain a large proportion of the heritability for human height. Nature Genetics. 2010;42:565–569. doi: 10.1038/ng.608. [DOI] [PMC free article] [PubMed] [Google Scholar]
  109. Yandell B, Mehta T, Banerjee S, Shriner D, Venkataraman R, Moon J, et al. R/qtlbim: QTL with Bayesian Interval Mapping in experimental crosses. Bioinformatics. 2007;23:641–3. doi: 10.1093/bioinformatics/btm011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  110. Yi N. A unified Markov chain Monte Carlo framework for mapping multiple quantitative trait loci. Genetics. 2004;167:967–75. doi: 10.1534/genetics.104.026286. [DOI] [PMC free article] [PubMed] [Google Scholar]
  111. Yi N, Banerjee S. Hierarchical generalized linear models for multiple quantitative trait locus mapping. Genetics. 2009;181:1101–13. doi: 10.1534/genetics.108.099556. [DOI] [PMC free article] [PubMed] [Google Scholar]
  112. Yi N, Banerjee S, Pomp D, Yandell B. Bayesian mapping of genomewide interacting quantitative trait loci for ordinal traits. Genetics. 2007a;176:1855–64. doi: 10.1534/genetics.107.071142. [DOI] [PMC free article] [PubMed] [Google Scholar]
  113. Yi N, Kaklamani VG, Pasche B. Bayesian analysis of genetic interactions in case-control studies, with application to adiponectin aenes and colorectal cancer risk. Annals of Human Genetics. 2010 doi: 10.1111/j.1469-1809.2010.00605.x. doi:10.1111/j.1469-1809. [DOI] [PMC free article] [PubMed] [Google Scholar]
  114. Yi N, Shriner D. Advances in Bayesian multiple quantitative trait loci mapping in experimental crosses. Heredity. 2008;100:240–52. doi: 10.1038/sj.hdy.6801074. [DOI] [PMC free article] [PubMed] [Google Scholar]
  115. Yi N, Shriner D, Banerjee S, Mehta T, Pomp D, Yandell B. An efficient Bayesian model selection approach for interacting quantitative trait loci models with many effects. Genetics. 2007b;176:1865–77. doi: 10.1534/genetics.107.071365. [DOI] [PMC free article] [PubMed] [Google Scholar]
  116. Yi N, Xu S. Bayesian LASSO for quantitative trait loci mapping. Genetics. 2008;179:1045–55. doi: 10.1534/genetics.107.085589. [DOI] [PMC free article] [PubMed] [Google Scholar]
  117. Yi N, Yandell B, Churchill G, Allison D, Eisen E, Pomp D. Bayesian model selection for genome-wide epistatic quantitative trait loci analysis. Genetics. 2005;170:1333–44. doi: 10.1534/genetics.104.040386. [DOI] [PMC free article] [PubMed] [Google Scholar]
  118. Yuan M, Joseph V, Lin Y. An efficient variable selection approach for analyzing designed experiments. Technometrics. 2007;49:430–439. [Google Scholar]
  119. Zeng Z, Kao C, Basten C. Estimating the genetic architecture of quantitative traits. Genet Res. 1999;74:279–89. doi: 10.1017/s0016672399004255. [DOI] [PubMed] [Google Scholar]
  120. Zeng Z, Wang T, Zou W. Modeling quantitative trait Loci and interpretation of models. Genetics. 2005;169:1711–25. doi: 10.1534/genetics.104.035857. [DOI] [PMC free article] [PubMed] [Google Scholar]
  121. Zhang Y, Liu J. Bayesian inference of epistatic interactions in case-control studies. Nat Genet. 2007;39:1167–73. doi: 10.1038/ng2110. [DOI] [PubMed] [Google Scholar]
  122. Zhang Y, Xu S. A penalized maximum likelihood method for estimating epistatic effects of QTL. Heredity. 2005;95:96–104. doi: 10.1038/sj.hdy.6800702. [DOI] [PubMed] [Google Scholar]
  123. Zhao P, Rocha G, Yu B. The composite absolute penalties family for grouped and hierarchical variable selection. The Annals of Statistics. 2009;37:3468–3497. [Google Scholar]
  124. Zhou H, Sehl M, Sinsheimer J, Lange K. Association screening of common and rare genetic variants by penalized regression. Bioinformatics. 2010;26:2375–82. doi: 10.1093/bioinformatics/btq448. [DOI] [PMC free article] [PubMed] [Google Scholar]
  125. Zou F, Huang H, Lee S, Hoeschele I. Nonparametric Bayesian variable selection with applications to multiple quantitative trait loci mapping with epistasis and gene-environment interaction. Genetics. 2010;186:385–394. doi: 10.1534/genetics.109.113688. [DOI] [PMC free article] [PubMed] [Google Scholar]
  126. Zou H, Hastie T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B. 2005;67:301–320. [Google Scholar]

RESOURCES