Skip to main content
Genetics logoLink to Genetics
. 2019 Oct 3;213(4):1209–1224. doi: 10.1534/genetics.119.302658

Scalable Nonparametric Prescreening Method for Searching Higher-Order Genetic Interactions Underlying Quantitative Traits

Juho A J Kontio *, Mikko J Sillanpää *,†,1
PMCID: PMC6893368  PMID: 31585953

The Gaussian process (GP) regression is theoretically capable of capturing higher-order gene-by-gene interactions important to trait variation non-exhaustively with high accuracy. Unfortunately, GP approach is scalable only for 100-200 genes and thus, not applicable for high...

Keywords: higher-order gene-by-gene interactions, Gaussian process regression, nonlinear dimension reduction, Haseman-Elston regression, Gaussian kernel models, acute myeloid leukemia

Abstract

Gaussian process (GP)-based automatic relevance determination (ARD) is known to be an efficient technique for identifying determinants of gene-by-gene interactions important to trait variation. However, the estimation of GP models is feasible only for low-dimensional datasets (∼200 variables), which severely limits application of the GP-based ARD method for high-throughput sequencing data. In this paper, we provide a nonparametric prescreening method that preserves virtually all the major benefits of the GP-based ARD method and extends its scalability to the typical high-dimensional datasets used in practice. In several simulated test scenarios, the proposed method compared favorably with existing nonparametric dimension reduction/prescreening methods suitable for higher-order interaction searches. As a real-data example, the proposed method was applied to a high-throughput dataset downloaded from the cancer genome atlas (TCGA) with measured expression levels of 16,976 genes (after preprocessing) from patients diagnosed with acute myeloid leukemia.


RECENT biological literature indicates that robust identification of higher-order gene-by-gene interactions would provide explanations for a broad spectrum of biological/medical mysteries (e.g., Ernst et al. 2007; Phillips 2008; Wei et al. 2014; Taylor and Ehrenreich 2015; Ehrenreich 2017). In genomic association studies, gene-by-gene interactions are typically identified via exhaustive search by enumerating all possible pseudovariables (product variables) in the field of interest (see Wei et al. 2014). However, the number of pseudovariables becomes rapidly infeasible, imposing a serious computational challenge for typical genetic applications with high-throughput sequencing datasets (Cordell 2009; Wei et al. 2014).

Ideally, the dimension of a feature space could be reduced without enumerating all possible pseudovariables, producing fewer terms to be considered. However, a phenotype may exhibit strong gene-by-gene interaction effects in the presence of negligible main effects (Wei et al. 2014). Therefore, individual genes contributing multiplicatively to phenotypic variation may not be identifiable by linear prescreening/dimension reduction methods. Gaussian process (GP)-model-based automatic relevance determination (ARD) (Neal 1996; MacKay 1998; Rasmussen and Williams 2006) is known to be an efficient nonparametric technique for identifying determinants of gene-by-gene interactions, even when they do not exhibit marginal association with phenotype (Zou et al. 2010). It has actually been shown that the powered exponential kernels implicitly enumerate all possible interaction effects of individual covariates [a special case is shown in Jiang and Reif (2015)] and seem to provide a reasonable solution to the given dimension reduction problem. Nonetheless, a remarkable drawback of GP regression remains in the computational cost of learning an unknown covariance structure that is about O(n3). Markov chain Monte Carlo (MCMC)-based implementation of GP-regression can handle ∼200 variables (Zou et al. 2010), and, despite some computational advancements, including a penalized maximum likelihood approach (Shi and Choi 2011; Savitsky et al. 2011; Yi et al. 2011; Moore et al. 2016), the applicable number of variables still remains in the hundreds.

In this paper, we propose a novel prescreening/dimension reduction method based on the powered exponential kernel to ensure the theoretical capability of finding determinants of gene-by-gene interactions up to any order. This forms a generalized Haseman–Elston (H–E) regression model (Haseman and Elston 1972) that preserves virtually all the major benefits of GP regression while extending scalability to datasets with tens of thousands of variables. Several related methods have been proposed in the kernel regression literature (e.g., Cover and Thomas 2006; Yamada et al. 2014). A so-called kernelized feature-wise Lasso proposed by Yamada et al. (2014) appears to be especially highly efficient, with the same scalability as our method. The major difference between our method and those approaches is that the method proposed in this paper coincides more with the ARD approach, in which the kernel parameters are estimated from data.

Moreover, the impact of linear main effects on nonlinear effect estimation is often overlooked in practice. Ignorance of this model hierarchy severely limits the full potential of actual dimension reduction techniques by markedly increasing the number of false positives. Thus, we provide an integrative multi-stage procedure in which the linear main effects are estimated in essence [used for exhaustive search in Kärkkäinen et al. (2015) and Mathew et al. (2018)]. The nonparametric part characterized by the powered exponential kernel is then used to explain the similarities of the estimated residual vector between individuals that is nearly orthogonal with respect to the hyperplane spanned by the linear effects.

This multi-stage framework potentially has a huge impact on a broad spectrum of biological/medical problems since the genetic basis of phenotypic regulation often contains interaction effects but also a considerable number of non-negligible main effects (e.g., Yeang and Jaakkola 2006; Maienschein-Cline et al. 2012; Ehrenreich 2017). As an illustration, we provide several simulation scenarios (low- and high-dimensional cases) with performance comparisons to other existing methods, including high-order interaction terms up to the fourth order. As a real data example, we identify two- and three-way interactive regulatory genes for the well known prognostic factor in acute myeloid leukemia (AML) using the cancer genome atlas (TCGA) dataset with measured expression levels of 16,975 genes/proteins after preprocessing. Without dimension reduction, this would yield 144,066,825 possible interaction pairs and >1011 possible three-way interactions terms, respectively. This implies that literature for the known three-way interactions is fairly sparse at the present time due to the small number of scalable methods; thus, all findings are likely to be novel (cf. Awad and Chen 2014).

However, verification of gene-by-gene interactions is extremely challenging and often half-heartedly imposed in practice (Wei et al. 2014). We provide a multi-stage verification procedure based on the guidance of Cordell (2009) and Wei et al. (2014) to substantiate identified interactions. First, true interaction terms are separated from those that appear to be significant due to the strong underlying main effects and unfavorable coexpressions between genes. Second, we require that the effects on phenotype of the remaining gene-by-gene interactions are respectively exposed in the same direction in another independent sample (see also Milne et al. 2008). Finally, the most promising findings are further inspected to see if they overlap with experimentally verified gene/protein information from curated databases such as STRING and Biogrid.

Materials and Methods

In a regular additive multilocus model, individual genes are assumed to be associated additively with the phenotypic variation modeled by linear regression models of the form

Yi=μ+j=1pXijβj+εi. (1)

In this context, Yi denotes the phenotype value of the ith observation, and Xij corresponds to the gene expression value of observation i at gene j, each of which is centered and standardized to have a zero mean and unit variance. Moreover, μ is the population intercept, whereas βj is the main effect of gene j, and εi is the random error following a normal distribution N(0,σ2).

Interaction regression model

The linear combinations of the main effects are often too limited to explain the phenotypic variation accurately. Hence, the effects of gene-by-gene interactions up to the order k (k2) should be also included, which leads to extension of the main effect model (1):

Yi=μ+j=1pXijβj++1t1,,tkpXit1Xitkβt1,,tk+εi. (2)

Here, βj represents the main effect for gene j, similar to model (1), and βt1,,tk is the gene-by-gene interaction effect between genes Xit1,,Xitk. The random error εi is assumed to follow a normal distribution with a mean of zero and variance equal to σ2. Note that this formulation includes all higher order multiplicative terms of individual genes (Xj2,Xj3,).

A short preface to the proposed strategy:

It is widely acknowledged that covariates which contribute multiplicatively to the phenotypic variation may not be identifiable marginally by linear association studies (see e.g., Cordell 2009; Mackay 2014). However, individual genes related to important gene-by-gene interactions are associated with the phenotype nonlinearly through an unknown link-function (see e.g., Sailer and Harms 2017). That is, each interaction relationship, for instance Y=XkXl+ε can be represented with the marginal phenotype-relationships for both genes Xk and Xl as

Y=fk(Xk)+εkandY=fl(Xl)+εl

with unknown nonlinear functions fk(), fl() and the corresponding marginal error terms εk and εl. Nonlinear associations between phenotypic variation and interactive covariates could be therefore transformed to be linearly identifiable once the form of the transformation function is properly chosen. The proposed transformation is motivated by the Gaussian-kernel-based ARD (Neal 1996; MacKay 1998; Rasmussen and Williams 2006) due to its known capacity to identifying determinants of important gene-by-gene interactions from their nonlinear marginal effects (Zou et al. 2010). However, instead of estimating the unknown covariance structure for individuals, we propose a semiparametric model that explains the phenotypic similarity of individuals with the matched kernel entries. This can be seen as the H–E type regression model (Haseman and Elston 1972; Sham and Purcell 2001) upon which our framework is based.

Haseman–Elston regression model

The H–E regression method was introduced for linkage testing using sib-pair data. In the original H–E approach, the squared phenotypic difference of individuals was regressed on the proportion of alleles shared identical-by-descent at an observed marker locus. Generally, if the phenotype for the ith individual is Yi, the H–E regression model can be represented as

f(Yi,Yj)=μ+βΩi,j+εi,j (3)

in which f(Yi,Yj) represents an arbitrary measure of the phenotypic difference and Ωi,j is an arbitrary measure of the genetic difference, respectively. Yet, the measure of genetic as well as phenotypic differences of a pair of individuals are often modeled in the regression framework as

(YiYj)2=μ+β(XiXj)+εi,j. (4)

Alternative metrics between phenotype observations can be found, for instance, in the work of Sham and Purcell (2001). Some of these metrics, including the one used in this paper, are clearly not difference measures, but the models based on those measures are still referred as H–E regression models.

Note that when the H–E regression estimators are applied over the entire population for each possible pair of individuals, the datapoints can be dependent, conditional on predictor variables such that YiYh is correlated with YiYk. However, these dependencies between the datapoints could be accounted for by additional random effect terms ui,j with an appropriate covariance structure, such that ui,jεi,j and

(YiYj)2=μ+β(XiXj)+ui,j+εi,j. (5)

Moreover, the random effects ui,j are assumed to be independent from the residual terms εi,j (their distributional assumptions are discussed later).

Gaussian process model—powered exponential kernel

In the original GP regression, the estimation of an unknown function involves estimating the covariance structure Σij between individuals, which can be modeled, for instance, through the kernel function of the form

K(Xi,Xj;ρ)=ϱexp(k=1pρkXikXjkγ), (6)

where Xp is the random vector [X1,,Xp]T and ϱ,ρk are free parameters. This covariance function form is known as the powered exponential kernel (Neal 1996). Parameter ϱ regulates the magnitude of the exponential part, reflecting the variation in vertical scale, and γ]0,2] regulates the covariance smoothness (see Shi and Choi 2011). The covariance function, especially for γ=2, is the popular Gaussian kernel [e.g., used in Zou et al. (2010) and Bobb et al. (2015)]. However, the Gaussian kernel belongs to the class of infinitely differentiable functions and has been criticized for being overly smooth for many real-life applications (Rasmussen and Williams 2006). By using the powered exponential kernel instead, the covariance smoothness can be regulated by the γ]0,2] parameter.

The inverse of ρk is often called a bandwidth parameter. The bandwidth parameter 1/ρk represents the length scale characterizing how fast the covariance function is expected to vary along the axis of Xk. The shorter the characteristic length of Xk, the faster the covariance function will vary as a function of Xk. The importance of each variable Xk can therefore be determined by a variable specific bandwidth parameter 1/ρk or, equivalently, an inverse bandwidth parameter ρk. For example, ρk0 indicates that the covariance function is expected to be practically a constant function of variable Xk, which is therefore deemed irrelevant (Neal 1996; MacKay 1998; Rasmussen and Williams 2006). Large ρk represents a shorter length scale, implying that variable Xk is of high importance, respectively.

Representing higher-order interactions implicitly with kernels

In conjunction with the Gaussian process ARD (Neal 1996; MacKay 1998) we adopt a kernel-based nonlinear regression technique where a kernel function is used for implicitly transforming the original model matrix into a higher-dimensional Hilbert space. Let Xp be the sample space of the set of individual genes X=[X1,,Xp]T. The set of all interaction terms in the model (2) can be considered as the projections of individual genes XX in a higher-dimensional feature space Fd defined by an explicit feature map

φX:XpFd (7)

that enumerates all d possible interactions up to a specific order l, where 1lp. The number of coordinates d=k=1l(pk)+(l1)p in the mapping φX:XF therefore quickly becomes infeasible as a function of l imposing a serious computational challenge if the estimation proceeds explicitly in the interaction feature space F.

However, it is known that (see e.g., Rasmussen and Williams 2006) for every positive semidefinite kernel function K(,;ρ):X×X, a Hilbert space and a mapping Φ(;ρ):Xexist, such that, for all Xi,XjX, we have that

K(Xi,Xj;ρ)=Φ(Xi;ρ),Φ(Xj;ρ), (8)

where , is a norm in the Hilbert space . This implies that the mapping φX() in (Equation 7) can be characterized by an appropriate kernel function K(,;ρ), which implicitly computes the genetic similarity measure between Φ(Xi;ρ) and Φ(Xj;ρ) in the higher-dimensional feature space Fd. The idea is to find a kernel function K(,;ρ) such that the mapping Φ(;ρ) would be as similar as possible to the feature map φ() in (Equation 7).

Finding a suitable kernel provides huge computational alleviation, since K(Xi,Xj;ρ) is much easier to compute than associated features Φ(Xi;ρ) and Φ(Xj;ρ). In the following sections we will apply identity (Equation 8) to reduce the number of individual genes in XX before mapping them to the interaction feature space F, producing fewer interaction terms to be considered.

Powered exponential H–E regression model

Let us consider a mapping ΦX:XF of individual genes X characterized by some positive semidefinite kernel KX(,;ρ), and a phenotypic similarity function TY(,):Y×Y0 with non-negative ranges, where Y is the sample space of the phenotype Y. We propose a generalized H–E regression model in which the conditional expectation of the phenotypic similarity given the pair (Xi,Xj)X×X is modeled by the inner product of Φ(Xi,ρ) and Φ(Xj,ρ) in F implicitly using the identity (Equation 8), such that

E(TY(Yi,Yj)|(Xi,Xj))=KX(Xi,Xj;ρ). (9)

In particular, the importance of each gene is characterized by the gene-wise kernel parameter vector ρp estimated from data as in the original ARD-approach (Neal 1996; MacKay 1998).

As we have defined TY(,):Y×Y+ to be a non-negative similarity measure, it is natural to adopt a multiplicative error model (see e.g., Eagleson and Muller 1997) suitable for positive responses around the mean KX(Xi,Xj;ρ), i.e.,

TY(Yi,Yj)=KX(Xi,Xj;ρ)εi,j, (10)

where E(εi,j)=1 and εi,j>0, with a probability of one for all pairs (i,j) of individuals.

The phenotypic similarity function TY,ω(Yi,Yj) is defined in this paper as the inverse squared difference (YiYj)2 if |YiYj|>ω>0 and ω2 if |YiYj|<ω. For notational convenience, we assume that P(|YiYj|>0)=1 when ij for a given dataset and set ω>0 to be smaller than minij|YiYj|, implying that TY,ω(Yi,Yj)=ω2 only when i=j. Further, KX(,;ρ) was chosen to be the powered exponential kernel, which will be shown later to be well suited for finding higher-order interactions between gene-wise similarity measures in F. This yields the proposed powered exponential Haseman-Elston (PH–E) model, written as

TY,ω(Yi,Yj)=ϱexp(k=1pρkXikXjkγ)εi,j, (11)

where εi,j are assumed to be identically distributed positive error terms with E(εi,j)=1 for all pairs (i,j) of individuals. In this paper, the parameter γ]0,2] regulating kernel smoothness is considered as fixed. However, the role of this parameter in model estimates is considered in the simulation section.

Dimension reduction—estimation of inverse bandwidth parameters

To estimate inverse bandwidth parameters ρp for dimension reduction, and to obtain an additive error term (see e.g., Eagleson and Muller 1997), we transform both sides of the model (10) with a strictly monotone increasing transformation function Ψ:+, for which

Ψ(KX(Xi,Xj;ρ)εi,j)=Ψ(KX(Xi,Xj;ρ))+Ψ(εi,j). (12)

In this paper, we use a logarithmic-transformation function that transforms the PH–E model (11) into a regular multiple regression model over individual pairs (i,j), where ij so that TY,ω(Yi,Yj) is always equal to (YiYj)2. The logarithmic-transformation function removes the exponential form of the right-hand-side and extricates the inverse bandwidth parameters i.e.,

log((YiYj)2)=log(ϱ)+k=1p(ρk)XikXjkγ+log(εi,j). (13)

In the light of the original PH–E model (11), the intercept log(ϱ) can be interpreted as logarithmic magnitude parameter and log(εi,j) as the corresponding logarithmic error term. Here, the signs of the inverse bandwidth parameters ρk (k=1,,p) are written as negative for notational consistency with the standard linear models. Note that pseudo-observations log((YiYj)2) and XikXjkγ for all possible individuals i and j (i<j by the symmetry property of kernels) for some fixed γ]0,2] are created beforehand, yielding a dataset with a sample size equal to n(n1)/2. The bandwidth parameters can now be estimated with linear estimation techniques by using pseudovariables log((YiYj)2) and XikXjkγ in place of dependent and independent variables.

However, the problem with model (13) is that pseudo-observations are not mutually independent, as stated in the context of the model (5). To account for the dependencies between the pseudo-observations, one could include additional i.i.d. random effect terms ui,j in the model, such that ui,jlog(εi,j) and

log((YiYj)2)=log(ϱ)+k=1p(ρk)XikXjkγ+ui,j+log(εi,j). (14)

The random effect vector un(n1)/2 is assumed to follow a multivariate normal distribution N(0,σ2G) where Gn(n1)/2×n(n1)/2 is a known expression covariance matrix between pseudo-observations. With reference to the methods where genomic relationship matrices are estimated from molecular markers located across the genome (see e.g., VanRaden 2008), matrix G could be estimated from the model matrix Zn(n1)/2×p consisting of all pseudovariables Xi1Xj1γ,,XipXjpγ, such that

G^=(Zz^rowT1p)(Zz^rowT1p)T/(p1),

where z^rown(n1)/2 denotes a vector of the row means of the model matrix Z, and 1pp is a constant vector of ones. In practice, the matrix G can be calculated simply, for instance in R as cov(t(Z)). The bandwidth parameters in model (14) can then be estimated with the typical linear mixed Lasso/elastic net type of estimators (Schelldorfer et al. 2011). However, it can be shown that results do not differ substantially between models (13) and (14). Especially in quantitative trait locus analyses, multilocus association models have been shown repeatedly to perform well without including any mixed model correction term to account for residual dependencies in the model (for example, Setakis et al. 2006; Pikkuhookana and Sillanpää 2009; Kärkkäinen and Sillanpää 2012; Würschum and Kraft 2015; Toosi et al. 2018). In particular, the proposed dimension reduction method produced almost identical results with and without including the random effect terms ui,j in the model in our simulations (see the Supplemental Materials A). Additionally, this mixed model version appears to be computationally very demanding (also shown in the Supplemental Materials A). Based on these facts, for simplicity, we decided to ignore the dependencies among the pseudo-observations in our example analyses.

Relaxed distributional assumptions for the random terms:

The distributional assumption YN(0,σ2I) implies that (YiYj)2 follows a χ2-distribution χ2(2) for ij. This yields relatively complex distributions for the random effect vector un(n1)/2 and the logarithmic residual vector log(ε)n(n1)/2. However, it is shown that log-normal distributions offer appropriate approximations for χ2-distributions (Jouini et al. 2011). The dependent pseudovariables log((YiYj)2)=log((YiYj)2) can be therefore treated approximately as normally distributed random variables, and, thus, it is adequate to assume that, for a known expression, covariance matrix G

uN(0,σu2G)and log(ε)N(0,σε2I)s.t.ulog(ε).

Estimation with elastic net estimator:

In order to reduce the dimension of the interaction feature space F by removing multiplicatively nonimportant genes before enumeration, we assume that only a small number of the entries of (ρ1,,ρp) are nonzero. The relative importance of each covariate with the sparsity assumption can be evaluated by penalized estimators. In this paper we use the elastic net estimator proposed by Zou and Hastie (2005), which solves the problem

argminϱ,ρ1i<jn(log((YiYj)2)log(ϱ)k=1p(ρk)XikXjkγ)2 (15)
subject to(1α)ρ22+αρ1λ, (16)

where λ0 is the overall penalty-parameter. Relative proportion between the 1-penalty and the 2-penalty can be regulated by parameter α[0,1].

An explanation of why the proposed PH–E model is suitable for dimension reduction in higher-order interactions search

Let us define a H–E type interaction term over individuals i and j as a product of the ith and jth observations of a particular interaction term (e.g., (XikXil)(XjkXjl) for term XkXl). One beneficial feature of using the powered exponential kernel in Equation 11 is that it implicitly enumerates all higher-order H–E type interaction terms according to the following proposition:

Proposition 1:

The powered exponential kernel function Kγ(Xi,Xj;ρ) with 0<γ2 has the infinite series representation that contains all possible product terms rsρrφs(Xi)φs(Xj) with respect to all possible subsets s of indices {k1,,ks}{1,,p} of size s for all 1sp, where φs:p is a mapping φs(X)=ksXk.

Proof:

See Supplemental Materials A.

Moreover, phenotype Y and the interaction terms kXk in model (2) are independent if, and only if, the proposed phenotypic similarity TY(Yi,Yj) and the corresponding H–E type interaction terms kXikkXjk are independent:

Proposition 2:

Let us consider a random vector Xip, the corresponding phenotype Yi and independent and identically distributed copies Xjp and Yj. Then for any Borel-measurable function T we have that

YiksXikifandonlyifT(YiYj)φ(Xi)φ(Xj), (17)

where φ:p is a mapping φ(X)=ksXk for some fixed subset s of indices {k1,,ks}{1,,p} of size 1sp.

Proof:

See Supplemental Materials A.

Specifically, it can be shown that the individuals genes involved in phenotypically important interaction terms in the original model (2) are identifiable by the proposed PH–E method:

Proposition 3:

Let us consider an interaction model (2) with the random vector Xp and the phenotype Y. If we have that

Cov(Y,ksXk)0forsomes={l1,,ls}{1,,p},

then the corresponding inverse bandwidth parameters ρk (ks) estimated by the PH-E method tend to be nonzero.

Proof:

See Supplemental Materials A.

A simple example:

Let us consider a simple three-way interaction model:

Yi=1k<l<mpβklmXikXilXim+εi,wherei.i.d.εiN(0,σ2). (18)

Proposition 1 states that, by using the PH–E model (11) with the inverse squared phenotypic differences, we are implicitly assuming that

E((YiYj)2)=1k<l<mpρkρlρm(XikXilXim)(XjkXjlXjm). (19)

According to proposition 2, a particular H–E type interaction term (XikXilXim)(XjkXjlXjm) in model (19) is independent from the phenotypic similarity (YiYj)2 if, and only if, XikXilXim and Yi are independent in the original model (18). Proposition 3 further implies that the inverse bandwidth parameters ρk,ρl, and ρm estimated by the PH–E method tend to be nonzero if XikXilXim and Yi are linearly dependent in the original model (18).

Multi-stage strategy for estimating bandwidth parameters

It is important to take into account that the main effects (β1,βp) are identifiable primarily from marginal linear association studies. The interaction model (2) can therefore be portrayed implicitly through the following semiparametric regression model consisting of the explicitly parametrized main effects, i.e.,

Y=Xβ+g(X;ρ)+ε,whereεN(0,σ02I). (20)

Here, β is the main effect vector for a model matrix Xn×p, ε is an error term, and g:n×pn×p is an arbitrary transformation function of the model matrix X. The parameter vector ρ displays the nonlinear importance of individual covariates through an unspecified nonparametric regression function.

In the presence of the strong main effects the proportion of phenotype variance explained by the nonparametric part may be less than required for the multiplicatively important covariates to be identifiable. Therefore, we recommend to use a multi-stage procedure in which the main effects are estimated beforehand, known as residual outcome analysis (see Kärkkäinen et al. 2015; Mathew et al. 2018). Then the nonparametric part g(X;ρ) is used to explain the variation of the estimated residual vector ε^=Yβ^X yielding a nonparametric model

ε^=g(X;ρ)+ε*,whereε*N(0,σ*2I). (21)

In this model, the parametric linear part affects the estimation of ρ less since the estimated residual vector ε^=Yβ^X is nearly orthogonal with respect to the hyperplane spanned by the variables with marginal effects.

However, a little caution is necessary regarding the residual outcome analysis. We refer to the paper of Demissie and Adrienne (2011), discussing the possible downward bias with respect to regression coefficients when residual outcome analysis is applied. The bandwidth parameters estimated from the model (21) are no longer interpretable on the scale of the original model (20). Yet, this approach increases the chances of interaction components/genes to be found correctly in the presence of the strong main effects. That is, the proportion of the variance in the residual vector ε^ predictable from the nonlinear part g(X;ρ) is higher than the proportion of the variance in the phenotype Y explained by g(X;ρ). Thus, the effects of the residual outcome analysis can be summarized as follows:

However, one should not consider this residual outcome analysis as a necessity but rather as an additional step to increase the signal-to-noise ratio of the nonlinear part in Equation 20. An illustrative example/comparison is given in the simulation section, and the inclusion of this step is made optional in the provided R-code.

Identifying interaction terms by residual outcome analysis

In this section, we provide a step-by-step guide to how to identify interaction effects in the presence of the strong main effects via the PH–E method with residual outcome analysis. Even though the same procedure can be applied to find three-way or higher-order interactions as well, for simplicity, we illustrate its usage via the following model, with main and two-way interaction effects:

Yi=β0+m=1pXimβm+klXikXilβkl+εi, (22)

where βm is the main effect of the gene Xm and βkl is the gene-by-gene interaction effect between genes Xl and Xk, respectively. The random error εi is assumed to follow a normal distribution with a mean of zero and variance equal to σ2.

Dimension reduction steps

Step 1. Computation of residuals:

The main effects β1,,βp of the parametric part in Equation 20 are estimated by OLS-estimator if np or by typical shrinkage methods (LASSO, elastic net etc.) if n<p. The estimated residual vector [ε^1,,ε^n]T where

ε^i=Yiβ^0m=1pXimβ^m (23)

is then nearly orthogonal with respect to the hyperplane spanned by the variables with the strongest marginal effects, and is thus considered as a new phenotype vector for the next step.

Step 2. Bandwidth estimation and dimension reduction:

Nonlinear effects are estimated by regressing the estimated logarithms of inverse squared residual differences between individuals i and j against the matched kernel entries by the PH–E model

log((ε^iε^j)2)=log(ϱ)+k=1p(ρk)XikXjkγ+log(εi,j*), (24)

where the logarithmic random error log(εi,j*) of order two is assumed to follow a normal distribution with a mean of zero and variance equal to σ*2.

The relative importance of individual variables is evaluated by estimating variable specific inverse bandwidth parameters via the elastic net estimator (Eqs. 1516). As in the original ARD (Neal 1996; MacKay 1998), we are interested in nonzero bandwidth parameters that are used to select the set of candidates M={Xk|ρ^k0} for further analysis. We suggest that penalty parameter λ in Equation 16 is chosen such that the requested number of candidates is obtained. The parameter α in Equation 16 depends more or less on the collinearity of the data, but the use of relative small values of α is recommended to avoid the common problems of LASSO (Zou and Hastie 2005).

Postdimension reduction steps

The set of chosen candidates M={Xk|ρ^k0} is then reanalyzed through a specific parametric model of interest, which, in the case of model (22) consists of all multiplicative terms of the second order. However, verification of gene-by-gene interactions is extremely challenging and often half-heartedly imposed in practice (Wei et al. 2014). Thus, we propose the following multi-stage verification procedure in conjunction with Wei et al. (2014):

Step 3. Removing pseudo-interaction terms:

Spurious interactions might be identified erroneously due to strong underlying main effects and unfavorable coexpression between genes (Wood et al. 2014; Wei et al. 2014). For instance, let us assume that gene Xk and phenotype Y are strongly associated. Let us further assume that the expression-values of an another gene Xl are scattered around one with small variance. This implies that the interaction term XkXl (which is nearly equal to Xk) is also strongly associated with the phenotype Y, only because the gene Xk has a strong marginal association with phenotype Y and is “unfavorable coexpressed” with gene Xl. To remove such “pseudo-interactions”, all findings should be re-evaluated whether or not they exhibit non-negligible differences between the interaction and corresponding main effects. This can be done via a simple parametric interaction test in accordance with Aiken and West (1991): we test every possible pair (Xk,Xl) from the set of chosen candidates M by regressing the original phenotype Y on both marginal variables and their interaction term, i.e.,

Y=Xkβk+Xlβl+XkXlβkl. (25)

Note that this step differs from the typical postselection significance tests. We are not yet making conclusions about the significance of the interaction effects in the light of the original model (22), and are therefore not accounting for possible selection bias (see e.g., Bühlmann and Mandozzi 2014). In this step we separate only the “right” interaction terms from those erroneously identified due to strong underlying main effects and unfavorable coexpression between genes (see Wood et al. 2014; Wei et al. 2014) as follows:

  • Aiken–West decision rule for gene-by-gene interactions: If the interaction effect βkl in model (25) appears to be statistically more “significant” than the main effects βk and βl, the test indicates that the interaction term can explain the phenotypic variation better than the individual covariates. When testing the null-hypothesis of zero-valued effects, the P-value corresponding to the interaction effect βkl should be relative small (e.g., 105) but also smaller than the P-values associated with the main effects. Only such interaction terms are kept for further consideration.

Respectively, to test a specific higher-order interaction term with the Aiken–West interaction test, we regress the phenotype on that interaction term and the corresponding lower-order interaction and marginal terms. The interaction effect of interest in this case should be statistically more “significant” than the corresponding main and lower-order interaction effects.

Step 4. Replication:

Wei et al. (2014) highlighted the importance of showing that the identified interaction effects are respectively exposed on the same direction on the phenotype in an independent dataset. We therefore recommend to retest (with the same Aiken–West test) remaining interaction terms from Step 3, or the whole analysis in some independent dataset. To some extent, this step also removes the false positives due to the selection bias caused by the prescreening step (Bühlmann and Mandozzi 2014). This is because the effects of the true underlying interaction terms are less sensitive to changes in data than those associated with false positive findings.

Exploring results from a biological point of view:

The remaining gene-by-gene interactions from steps 1–4 are inspected to see whether or not they have a biologically reasonable interpretation based on experimentally verified physical gene/protein interaction information. However, this should be considered only as an additional support for findings since it is totally dependent on existing biological knowledge about gene interactions, which is evolving constantly.

Data availability

The authors state that all data necessary for confirming the conclusions presented in the article are represented fully within the article. The proposed method is first evaluated with a normalized gene expression dataset used in the DREAM-challenge 9 (Noren et al. 2016), which is available upon registration (http://dreamchallenges.org/). These data were provided by Dr. Steven Kornblau from The University of Texas MD Anderson Cancer Center, and were obtained through Synapse syn2455683 as part of the AML DREAM-challenge. A high-throughput sequencing AML dataset downloaded from the cancer genome atlas (TCGA) is also analyzed, which is publicly accessible through the TCGA-website (https://cancergenome.nih.gov/). A list of genes used in our analyses is provided in Supplemental Materials B (genes that remained after preprocessing). However, we noted that the naming policies of several genes in this dataset have been recently changed, but the obsolete names can be linked to the updated ones, for example with Entrez gene IDs. For validation of the results, another AML cohort was also analyzed. This dataset was collected at Erasmus University Medical Center (Rotterdam) and is available for open research from the Gene Expression Omnibus (GEO, http://www.ncbi.nlm.nih.gov/geo, GDS4278, Taskesen et al. 2011). Supplemental material is available at figshare: https://doi.org/10.25386/genetics.9933239.

Results

We begin by illustrating the capability of the proposed method with a normalized gene expression dataset provided by DREAM-challenge (June 2014). This dataset (Noren et al. 2016) consists of measurements of 191 AML patients with measured expression levels of 231 genes and phosphoproteins probed by reverse phase protein array analysis, each of which follows a standard normal distribution. In the forthcoming section, new phenotypes are simulated conditionally on the real expression levels of these 231 genes.

PH–E method: one simulated three-way interaction effect without the main and lower-order interaction effects

In this section, we evaluate the efficiency of the proposed PH–E dimension reduction method, and illustrate the impact of changing the values of the kernel parameter γ on the model estimates. Performance was tested through a grid of different heritability levels with a simulated three-way interaction effect (dissembled at genes 21, 105, and 207 in the DREAM-challenge dataset where indexes start from the ACTB gene) without the main and lower-order interaction effects. Three different heritability levels were simulated, each involving 100 replicates of the phenotype vector. The population intercept was set to 0 for each replication, and, by varying normal residual variance, we obtained the averaged heritabilities 0.50, 0.30, 0.20 over replications.

The R-codes used (estimation was with the glmnet R-package, see Friedman et al. 2010) and simulated phenotype replicates are available in the Supplemental Materials such that the results can be easily replicated.

In this example, to produce comparable, but relatively sparse, results, we used the cross-validation (CV)-based value 3λ1se for each test set. Here, λ1se denotes the largest value of λ such that error is within 1 SE of the minimum error (see the glmnet R-package). Note that the CV-based choice of an optimal penalty parameter λ is not recommended in real-data analysis since it tends to produce results that are too dense for our purposes. Instead, λ should be selected such that number of nonzero coefficients is less than a specific prefixed number (say 100). Moreover, the fixed proportion α=1/3 between the 1- and 2-penalty terms in the elastic net estimator (Equation 15 and Equation 16) was used to estimate the bandwidth parameters.

Figure 1 displays the standardized estimates of the inverse bandwidth parameters ρk for each variable Xk, k=1,,p averaged over 100 replications. The panel rows separate the estimates produced by different kernels parameters: γ=2 (the Gaussian kernel, panels C1–3), γ=1 (the exponential kernel, panels B1–3), and γ=0.2 (panels A1–3). For simulations corresponding to the heritability level of 0.50 (panels A1, B1, and C1), the averaged estimates of the bandwidth parameters associated with the correct genes (21, 105, 207) are much higher than irrelevant ones, regardless of the value of γ.

Figure 1.

Figure 1

Simulation analysis. Averaged estimates (over 100 replicates) of the inverse bandwidth parameters obtained by the PH–E method using distinct scenarios corresponding to the averaged heritability levels of 0.50 (A1, B1, and C1), 0.30 (A2, B2, and C2), and 0.20 (A3, B3, and C3), respectively. The panel rows separate the estimates produced by different kernels parameters: γ=2 (the Gaussian kernel, panels C1–3), γ=1 (the exponential kernel, panels B1–3) and γ=0.2 (panels A1–3). Black solid lines denote bandwidth parameter estimates and vertical red lines are the exact places of simulated phenotype associated genes. The CV-based penalty parameter value 3λ1se was used for each scenario.

However, by decreasing the averaged heritability from 0.50 (panels A1, B1, and C1) to 0.30 (panels A2, B2, and C2) and further to 0.20 (panels A3, B3, and C3), the level of noise and the appearance of false positives start to depend on the kernel parameter γ. It appears that smaller values of γ systematically yield better results. Especially interesting is that probably the most widely used Gaussian kernel (γ=2) shows relative poor performance compared to the exponential kernel, and even more so when compared to smaller values of λ.

Overall, all kernel parameter values have reasonable power to identify correct covariates with relative robust performance with respect to changes in heritability levels. However, simulations suggest that the performance of the PH-E method tends to be improved as the parameter γ decreases, based on which the value 0.2 of γ will be chosen in the real-data analysis section.

PH–E method via residual outcome analysis

The importance of the additional residual step was then evaluated by simulating the same three-way interaction term (the effect size is equal to 1) with three additional main effects (placed in 40, 150, and 180 with the effect sizes equal to 0.45, 0.45, and 0.30). In this case, only one level of heritability involving 100 replicates of the phenotype vector was simulated, with an averaged heritability of 0.40.

The inverse bandwidth parameters were estimated by the PH–E method, with and without the residual step presented in the Methods section. In both cases, the CV-based penalty parameter values 3λ1se were used for each test set and replicate. Moreover, since the PH–E method with the residual step requires the main effects to be estimated beforehand, the elastic net estimator (with α=1/3) was also used to estimate the main effects. The penalty parameter value corresponding to minimum prediction error selected by the CV-procedure was used. Then, the PH–E method was applied to the estimated residual vector in accordance with the Methods section.

Since the estimated residual vector is nearly orthogonal with respect to the hyperplane spanned by the main effects, the estimation of ρs should be now “enhanced” significantly. Without the residual step, the PH–E method was capable of finding the correct covariates (21, 105, 207), even though the main effects were inclined to be more exposed along with several false positives (Figure 2, left panel). As expected, by adjusting the PH–E method with the residual step, the estimates of bandwidth parameters associated with the simulated three-way interaction term are clearly separated from the main and other irrelevant effects (Figure 2, right panel).

Figure 2.

Figure 2

Simulation analysis. Panels represent averaged estimates of the bandwidth parameters obtained by the PH–E method without (left panel) and with (right panel) the additional residual step over 100 simulated replicates of phenotype with the averaged heritability level of 0.40. The CV-based penalty parameter value 3λ1se was used in each scenario.

This yields significantly smaller search spaces after enumeration, even at this scale, regardless of the threshold by which candidates are chosen. For instance, in the right panel, the first false positive appears at the threshold value 0.22, whereas the number of nonzero coefficients in the left panel beyond that same hypothetical threshold point (black dotted vertical lines in Figure 2) is 16 (yielding 560 possible three-way combinations). Thus, based on this empirical evidence, the use of the PH–E method without the residual step severely prohibits its full potential.

Comparison to other nonparametric methods without the main effects

In this section, we compare the overall performance of the PH–E method to other general nonparametric dimension reduction/prescreening methods suitable for finding higher-order interactions. Common to all these methods is that they can completely avoid exhaustive search through the space of enumerated pseudovariables (gene-by-gene interactions) because the functional relationship between the corresponding explanatory variables and the phenotype is considered to be nonlinear. That is, each interaction relationship, for instance Y=XkXl+ε, can be represented with the marginal phenotype-relationships for both genes Xk and Xl as

Y=fk(Xk)+εkandY=fl(Xl)+εl (26)

with unknown nonlinear functions fk(), fl() and the corresponding marginal error terms εk and εl.

The general nonparametric prescreening methods to be compared are (i) the mutual information (MI) approach (Frénay et al. 2013; Zhongxin et al. 2016), (ii) the distance correlation (DC)-based method (Li et al. 2012), and (iii) the HSIC-Lasso (Yamada et al. 2014). In the MI- and DC-approaches, one goes through all the gene expression values one at a time and measures the marginal phenotype relationships nonparametrically. A final outcome is a small subset of selected genes. In the HSIC-Lasso and the PH–E methods, a small subset of genes is selected from the transformed input variables (gene expressions) by the Lasso- or elastic net algorithm. Note that the MI method requires discretization of gene expression and trait values (which can be done in the corresponding packages). We used the infotheo R-package for the MI-method, energy R-package for the DC-method, and the publicly available Matlab-code for HSIC-Lasso (http://www.makotoyamada-ml.com/hsiclasso.html).

In the PH–E method, we use the powered exponential kernel (Equation 6) with the given kernel parameter γ=0.2. The performance is tested by simulating two three-way interactions and one two-way interaction (dissembled at genes 21, 105, 207, 41, 125, 225, 80, and 140 in the DREAM-challenge indexes starting from the ACTB gene) without the main effects, i.e.,

Y=X21X105X207+X41X125X225+X80X140+ε, (27)

where εN(0,σ2). The effect sizes were set to be one for all interaction terms. We simulated 100 replicates of the phenotype vector such that the same phenotype replicates were analyzed by methods (i)–(iii). The population intercept was set to 0 for each replication, and normal residual variance σ2, such that we obtained an averaged heritability of 0.60 over replications. The R-code used for the simulations, as well as the simulated phenotype replicates, are available in the Supplemental Material.

Knowing the true simulated effects allows us to compare the receiver operating characteristics (ROC) between the MI-, DC-, HSIC-Lasso, and PH–E methods using the true positive rate (TPR), the false positive rate (FPR) and the area under the ROC-curve. Because the performance with a restricted, small FPR may be often particularly interesting, we also compare the partial areas under ROC-curves between methods truncated at FPR 0.2.

Since both the MI- and DC-based methods are unpenalized, we used relatively small penalty-parameter values (λ=103) for the PH–E method and HSIC-Lasso. Then, for the estimates ρ^k(k=1,,p) obtained by each method, we shift the “decision” threshold value a through a grid [min(ρ^),max(ρ^)] to produce the TPR and FPR for each a, such that

TPR=#(k:ρ^k>aandρk>a)#(k:ρk>a), (28)
FPR=#(k:ρ^k>aandρk<a)#(k:ρk<a). (29)

Moreover, the fixed proportion α=1/3 between the 1- and 2-penalty terms in the elastic net estimator (Equation 15 and Equation 16) was used to estimate the bandwidth parameters in the PH–E method. Figure 3 displays the ROC-curves for the MI-, DC-, HSIC-Lasso and PH–E methods constructed from the estimates averaged over 100 simulated phenotype replicates.

Figure 3.

Figure 3

Performance comparison: the ROC curves for the proposed PH-E method (black solid curve), HSIC-Lasso (red dashed curve), the DC-based method (blue dotted curve) and MI approach (gray dashdotted curve) characterized with respect to the estimates averaged over 100 simulated replicates. In the left panel (low-dimensional case), the dataset used consists of the measured expression levels of 231 genes (for 191 individuals), and, in the right panel (high-dimensional case), the measured expression levels of 16,976 genes (for 173 individuals), respectively.

Figure 3 (left panel) shows that both the PH–E method and HSIC-Lasso start to perform extremely well relatively quickly as the FPR increases, whereas the MI- and DC-based methods appear to be somewhat ineffective. The nontruncated areas under ROC-curves for the MI-, DC-, HSIC-Lasso, and PH–E methods were 0.614, 0.717, 0.831, and 0.934, respectively. While the proposed PH–E method consistently implies better accuracy than the other methods, the superiority of the PH–E method is particularly evident toward lower FPRs. The truncated areas under ROC-curves (truncated at FPR20%) for the MI-, DC-, HSIC-Lasso, and PH–E methods were 0.241, 0.330, 0.511, and 0.859, respectively. These results indicate that the PH–E method is more suitable approach for finding the components of higher-order interaction terms than its competitors.

Comparison to other nonparametric methods in high-dimensional settings with the main effects

In this section, the suitability of the PH-E method is illustrated for gene-expression datasets with a large number of features, and its accuracy once again compared to (i) the MI approach, (ii) the DC-based method, and (iii) the HSIC-Lasso. We use a high-throughput sequencing dataset downloaded from the cancer genome atlas (TCGA) including RNA-sequencing for 173 acute myeloid leukemia patients with measured expression levels of ∼20,000 genes publicly accessible from the TCGA website (https://cancergenome.nih.gov/).

This dataset was chosen due to its high dimensionality and multicollinearity among coregulated and coexpressed gene expression levels to evaluate the performance of methods (i)–(iii) in typical gene-expression datasets. Approximately 10% of all pairwise correlations are >0.8. However, we removed genes with missing values, but also those with zero variance or very few unique expression values relative to the number of individuals. The latter is done by the nearZeroVar R-function (in the caret R-package). This yielded a dataset with 16,976 genes (for replication purposes, we provide a list of these 16,976 in Supplemental Materials B). Based on these remaining genes, we simulated new phenotypes such that;

Y=X6666+0.2X9120X2532X8735X1735+0.3X4812X5010X10257+0.2X754X8132+ε,

where εN(0,σ2). We simulated 100 replicates of the phenotype vector and analyzed the same phenotype replicates by the methods (i)–(iii). The normal residual variance σε2 was set to be low (σε2=0.2) due to the fact that the proportion of simulated interaction terms on explaining the overall phenotype variance is relatively small due to the large main effects.

The inverse bandwidth parameters in the model (11) were estimated by the PH–E method with the residual step presented in the methods section. Since the PH–E method with the residual step requires the main effects to be estimated beforehand, the elastic net estimator (with α=1) was used to estimate the main effects. The penalty parameter value corresponding to minimum prediction error selected by the CV-procedure was used. Then, the PH–E method was applied to the estimated residual vector in accordance to the method section with α=1 to provide sparsity and the given kernel parameter γ=0.2.

In terms of computational time, all these methods were very similar, taking only roughly a few minutes to run the analysis for one replicate under the Windows operating system with a quad-core 3.40 GHz i7-processor (Intel 3770) and 16 GB of RAM. It can be seen in Figure 3 (right panel) that the PH–E method with the residual step performs evidently better than the MI- and DC-based methods over all FPRs. However, HSIC-Lasso cannot be compared in this dataset since it appeared to produce results that were too sparse for constructing ROC-curves regardless of the penalty parameter value used, and, therefore, were omitted from Figure 3. Nevertheless, the nontruncated areas under ROC-curves for the MI-, DC-, (HSIC-Lasso), and PH-E methods were 0.547, 0.456, (0.221), and 0.808, respectively. These simulated examples indicate that the proposed PH–E method is highly capable of finding components of higher-order interaction terms, even with challenging real-life high-throughput sequencing datasets.

Real data analysis

In this section, we identify interactive regulatory genes for known prognostic factors in AML. The main dataset was downloaded from the cancer genome atlas (TCGA), including RNA-sequencing for 173 AML patients with measured expression levels of ∼20,000 genes (median age is 47 years with range 18–72 years). Detailed descriptions of clinical and molecular characteristics with other information are publicly accessible from the TCGA website (https://cancergenome.nih.gov/). However, as in the simulation section, we removed genes with missing observations and those with zero variance or very few unique expression values relative to the number of individuals, yielding a dataset with 16,976 genes (a list of these genes is available in Supplemental Materials B).

Interaction regulation model for ARC/NOL3

The study by Mak et al. (2014) demonstrated that apoptosis repressor with caspase recruitment domain (ARC), encoded by the NOL3 gene [MIM: 605235], is repeatedly overexpressed in diagnosed AML samples, and is involved in the regulation of apoptosis of cardiac cells. However, the precise molecular mechanism regulating this antiapoptotic protein still remains unknown due to the overwhelming number of possible combinations in high-throughput datasets (Awad and Chen 2014). Here, the PH–E dimension reduction method was applied to identify two- and three-way regulatory interactions controlling the variation of ARC/NOL3 (1 of the remaining 16,976 genes).

Dimension reduction steps for finding interaction components

In accordance with the Methods section, the main effects were estimated with the elastic net estimator with α=0.2 to obtain the residual vector of the first order. The penalty parameter λ was chosen by the CV criterion (λ0.403), producing a set of nonzero main effects of size equal to 0.5% of all genes (84 out of 16,975). Further consideration of the main effects will be omitted, whereas the major focus is on identifying and verifying two- and three-way interactions.

Whereas the variance of the NOL3 expression levels is approximately equal to one, the variance of the estimated residual vector ε^ of the first order is ∼0.16. The inverse bandwidth parameters are now estimated by the elastic net estimator (with α=1/3) (Equation 15 and Equation 16) with respect to the estimated residual vector ε^ in accordance to step 2 presented in the Methods section. Based on the simulation studies, we used the powered exponential kernel, with a kernel parameter value =0.2. The penalty parameter λ was chosen to be the largest possible value, such that the number of nonzero inverse bandwidth parameters remains <2000, producing 1993 candidates from which the two-way interaction terms can be identified exhaustively, resulting a 99% smaller interaction feature space than without the dimensional reduction step.

Note that larger penalty-parameter values are needed when higher-order interactions are considered. Once the estimated residual vector from the two-way interaction model was obtained, the PH–E method was further applied to find components involved in some three-way gene-by-gene interaction important to NOL3 variation. This was done by refitting the interaction model (2) with the main and two-way interaction effects enumerated only, over the previously found 1993 components. The residual vector ε^* of the second order can then be computed. Next, we repeated the dimension reduction step 2 with respect to the new residual vector ε^*. In this case, we used the largest penalty parameter value λ, such that the number of nonzero bandwidth parameters was <200. Given 197 candidates from which the three-way interaction terms can be identified exhaustively yields 1,254,890 possible three-way interaction combinations to be screened exhaustively. At this point, the benefit of the proposed dimension reduction step is evident since after enumeration we have a interaction feature space which is ∼99.99985% smaller than the set we would get without the dimensional reduction step (>1011 enumerated candidates).

Postdimension reduction validation

In this section, we provide step-by-step guidance on how to analyze the set of remaining candidates from the previous steps through the postdimension reduction validation steps presented in the Methods section. We also provide biologically reasonable interpretation for two identified interaction terms as an example based on experimentally verified physical gene/protein interaction information. To that end, let us define the sets

2={XkXl|ρ^k0andρ^l0,1klp} (30)
3={XkXlXm|ρ^k*0,ρ^l*0andρ^m*0,1klmp}, (31)

where ρ^k and ρ^l are the estimated inverse bandwidth parameters with respect to the residual vector ε^ of the first order. ρ^k*, ρ^l*, and ρ^m* are the estimated inverse bandwidth parameters with respect to the residual vector ε^* of the second order, respectively.

Parametric Aiken–West interaction test

We re-evaluated sets 2 and 3 to investigate whether or not the individual terms exhibit non-negligible differences between the interaction and corresponding lower-order interactions and marginal effects. In order to do so, we applied the parametric Aiken–West interaction test (Equation 23).

For the interaction terms in 2 enumerated over the remaining candidates, we required that P-values for the interaction effect βkl in the Aiken–West test must be <105, and, for the corresponding main effects, βk and βl must be >103. For the three-way interaction terms in 3, we required that P-values for the interaction effect βklm must be <105, and, for the corresponding lower-order interaction and main effects >103,respectively. These requirements gave 1820 pairs out of 1,985,010 two-way interaction terms, i.e., removed ∼99.9% of all candidates from the set 2. Accordingly, the Aiken–West test produced only 185 three-way interactions out of 1,254,890 candidates (removed over 99.9% of all three-way interactions from the set 3).

Replication:

We further tested whether or not the effect found in the Aiken–West interaction test follows the same direction in terms of phenotype in an independent dataset. We therefore analyzed data from another AML cohort, which includes 156 de novo CN-AML patients (median age, 50 years, range: 16–77 years), collected at Erasmus University Medical Center (Rotterdam) between 1990 and 2008. All patients were treated uniformly under the study protocols of Dutch-Belgian Cooperative Trial Group for Hematology Oncology (HOVON, details of therapeutic protocol available at http://www.hovon.nl). All clinical, cytogenetic, and molecular information, as well as microarray data of these patients are publicly available at the Gene Expression Omnibus (GEO, http://www.ncbi.nlm.nih.gov/geo, GDS4278, Taskesen et al. 2011).

Interaction between HtrA serine peptidase 3 and protein phosphatase 1 regulatory subunit 15B

Based on our results, it is evident that the interaction of HtrA serine peptidase 3 (HTRA3 [MIM: 608785]) and protein phosphatase 1 regulatory subunit 15B [PPP1R15B (MIM: 613257)] exhibit stronger association with the variation in NOL3-expression levels [20.377, 95%-CI: (20.534, 20.221)] than the individual genes (Table 1, top panel). The effect of the same gene-by-gene interaction on NOL3 is also revealed as being in the same direction in an independent dataset [20.679, 95%-CI: (21.048, 20.310)] (Table 1, bottom panel).

Table 1. Listed two-way gene-by-gene interaction that is replicable in an independent dataset and biologically reasonable based on the current literature.

Gene name (TCGA) Estimate Std. Error 95% CI P-value
HTRA3 0.201 0.072 [ 0.059, 0.344] 0.006
PPP1R15B 0.077 0.071 [–0.064, 0.218] 0.282
HTRA3: PPP1R15B −0.377 0.079 [–0.534, −0.221] 4.01106
Gene name (Omnibus) Estimate Std. Error 95% CI P-value
HTRA3 0.024 0.085 [–0.144, 0.192] 0.781
PPP1R15B 0.047 0.194 [–0.336, 0.431] 0.807
HTRA3: PPP1R15B −0.679 0.187 [–1.048, −0.310] 3.8104

Estimated effect sizes, standard errors, 95% confidence intervals and P-values for the interaction and main effects on NOL3/ARC variation in the Aiken–West interaction test. These values are estimated separately from the main TCGA-dataset (top panel) and the independent Omnibus dataset (bottom panel).

Interpretation of findings based on the current literature and databases:

PPP1R15B belongs to the protein phosphatase (PP1) subfamily—a ubiquitous serine/threonine phosphatase family known to regulate cell division and apoptosis (Garcia et al. 2003). In particular, PP1 has been shown to repress tumor protein p53 activity through dephosphorylation (Lu et al. 2013), which can transcriptionally regulate the expression levels of our target ARC/NOL3 (Li et al. 2008).

Given recent evidence, HTRA3 may be essential for cancer development (see e.g., Zurawa-Janicka et al. 2012). HTRA3 is known to function as a tumor suppressor and inhibitor of the TGF-β signaling pathway. TGF-β in turn activates signaling pathways in a Smad-independent manner, such as mitogen-activating protein kinases (MAPKs) and phosphatidylinositol 3-kinase (PI3K)/Akt (Korrodi-Gregório et al. 2014). This provides reasonable a interpretation for our results since constitutive activation of both MAPK and PI3K/Akt pathways are common in AML and are shown to be key regulators of NOL3/ARC activity (Mak et al. 2014).

Three-way interactions

SPEN, PPP1R15B, and ZSWIM7:

The identified three-way interaction term between Spen family transcriptional repressor (SPEN [MIM: 613484]), PPP1R15B and Zinc Finger SWIM-Type containing 7 (ZSWIM7 [MIM: 614535]) also have stronger association with the variation of NOL3 expression levels [0.294, 95/% CI: (0.189, 0.400)] than the individual genes or their lower-order interaction terms (Table 2, top panel). Accordingly, the effect of the same gene-by-gene interaction on NOL3 expression has the same direction in another independent dataset [2.340, 95%-CI: (0.127, 0.4553)] with similar Aiken–West test results (Table 2, bottom panel).

Table 2. Listed three-way gene-by-gene interaction that is replicable in an independent dataset and biologically reasonable based on the current literature.
Gene name (TCGA) Estimate Std. Error 95% CI P-value
SPEN 0.204 0.073 [ 0.070, 0.340] 0.003
PPP1R15B 0.194 0.068 [ 0.050, 0.339] 0.009
ZSWIM7 0.172 0.072 [ 0.049, 0.296] 0.007
SPEN: PPP1R15B 0.062 0.078 [−0.092, 0.216] 0.426
SPEN: ZSWIM7 −0.123 0.053 [−0.227, −0.019] 0.021
PPP1R15B: ZSWIM7 0.084 0.066 [−0.047, 0.215] 0.207
SPEN: PPP1R15B: ZSWIM7 0.294 0.053 [0.189, 0.400] 1.3107
Gene name (Omnibus) Estimate Std. Error 95% CI P-value
SPEN −0.172 0.267 [−0.700, 0.355] 0.520
PPP1R15B 0.132 0.213 [−0.289, 0.554] 0.537
ZSWIM7 0.328 0.193 [−0.053, 0.711] 0.091
SPEN: PPP1R15B −0.894 0.520 [−1.923, 0.135] 0.088
SPEN: ZSWIM7 −0.007 0.542 [−1.078, 1.065] 0.990
PPP1R15B: ZSWIM7 0.679 0.391 [−0.095, 1.453] 0.085
SPEN: PPP1R15B: ZSWIM7 2.340 1.120 [0.127, 4.553] 0.038

Estimated effect sizes, standard errors, 95% confidence intervals and P-values for the interaction and main effects on NOL3/ARC variation in the Aiken–West interaction test. These values are estimated separately from the main TCGA-dataset (top panel) and the independent Omnibus dataset (bottom panel).

Experimental studies by Oswald et al. (2002) have shown that SPEN is recruited by recombination signal binding protein for immunoglobulin kappa J (RBP-Jk) to a histone deacetylase (HDAC)-complex that is known to regulate the NOTCH signaling pathway. Further, a NOTCH-family member, NOTCH1 in turn regulates S-phase kinase associated protein 2 (SKP2) (Sarmento et al. 2005), which is associated directly with S-phase kinase associated protein 1 SKP1 in the SCF (Skp1/Cul1/F-box) protein complex. Finally, an experimental study of Sun et al. (2009) associated Skp1-Cul1-F-box complex to p53 regulation.

In the previous section, we explained that PPP1R15B is a member of the protein phosphatase (PP1) subfamily, which has been shown to repress p53 activity through dephosphorylation.

It is known that ZSWIM7 and SWSAP1 form a complex together (see Martino and Bernstein 2016), and their interplay between RAD51-paralogs has been discussed (Liu et al. 2011). Further, the RAD51C interacting protein XRCC3 has been shown to interact with FANCD2 protein in a newly identified complex (Somyajit et al. 2010), which in turn regulates the exonuclease CtIP (Yeo et al. 2014). Finally, CtIP binds directly to the SPEN repression domain (Oswald et al. 2005), building a bridge between SPEN and ZSWIM7.

Discussion

In this paper, we provide a fast multi-stage procedure for higher-order interaction search that makes two main contributions: (1) it extends the scalability of GP-based ARD (see Zou et al. 2010), while preserving its accuracy; and (2) it removes the influence of the main effects on the associated estimation process. Combined with the subsequent validation step, the proposed framework is therefore especially suitable for typical medical/biological problems with high-throughput gene/protein-expression sequencing datasets, e.g., identifying novel therapeutic targets and prognostic pathways for multi-factorial diseases. Moreover, the flexibility of the proposed dimension reduction method (PH–E) enables its extendability to a broad spectrum of important problems beyond higher-order interaction searches such as:

  • Prediction of complex quantitative traits with the original GP-model: PH–E could be used to reduce the dimension of a feature space to achieve a smaller input space for the original GP-model, which is directly applicable for outcome prediction with small datasets.

  • Identifying any kind of nonlinear effects: This framework could be used to find other nonlinear effects as well. This is relatively straightforward, since the parametric form can be defined user-specifically (see Step 3 in the Methods section).

Theoretical extension

Since we have considered only continuous explanatory variables (gene expression), applying the current framework for discrete explanatory variables (single nucleotide polymorphisms) requires further study. In this work, we have also completely adopted the frequentist point of view, whereas a natural advantage of the Bayesian approach would be to incorporate prior knowledge about each individual gene. However, since the model (13) is a regular linear model, applying Bayesian variable selection methods (O’Hara and Sillanpää 2009) instead of the elastic net estimator would be relatively straightforward.

Another issue omitted is the possible computational complexity caused by the increasing number of “observations” due to created pseudo-observations. If the resulting sample size becomes computationally overwhelming, one possible solution would be to consider only the tails of the phenotype distribution. Alternatively, one could perform random subsampling among observations and average the corresponding results. On the other hand, the creation of pseudo-observations might be an advantage if the sample size is extremely small, as stated implicitly by propositions 1 and 2.

The importance of validation

Use of the validation step was illustrated with the AML TCGA-dataset. During validation of our findings, we found that it was especially important to assure that the identified interaction effects on the phenotype are correspondingly exposed in the same direction in another independent dataset. For instance, a pairwise interaction between MAML1 and PPP1R12B was found to be strongly associated with NOL3 variation in the TCGA-dataset (P-value = 1.3×105). According to curated databases, they also seem to be linked through an experimentally verified pathway structure. Together with the fact that the interaction between MAML1 and PPP1R15B appears to be statistically highly “significant” also in an independent omnibus dataset (P-value = 6.2×106), a conclusion could be easily drawn that this interaction truly is an important contributor to NOL3/ARC variation. However, the interaction effect in the Omnibus dataset is exposed in the opposite direction, causing contradictory interpretations.

In this study, replicable findings were also required to overlap with experimentally verified gene/protein interaction information downloaded from curated databases. However, this, by definition, precludes completely novel findings. This step should be generally considered only as additional support for findings rather than required criteria, since curated databases provide only recent biological information and are evolving constantly.

Acknowledgments

We are grateful to the editor and two anonymous reviewers for their valuable comments which helped us to improve the quality of the manuscript. This work was supported by the research funding from Finnish Cancer Foundation and Biocenter Oulu. The authors declare no competing financial interests.

Footnotes

Supplemental material available at figshare: https://doi.org/10.25386/genetics.9933239.

Communicating editor: D. Nielsen

Literature Cited

  1. Aiken L. S., and West S. G., 1991.  Multiple Regression: Testing and Interpreting Interactions, Sage, London. [Google Scholar]
  2. Awad S., and Chen J., 2014.  Inferring transcription factor collaborations in gene regulatory networks. BMC Syst. Biol. 8: S1 10.1186/1752-0509-8-S1-S1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Bobb J., Valeri L., Claus Henn B., Christiani D. C., Wright R. O. et al. , 2015.  Bayesian kernel machine regression for estimating the health effects of multi-pollutant mixtures. Biostatistics 16: 493–508. 10.1093/biostatistics/kxu058 [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Bühlmann P., and Mandozzi J., 2014.  High-dimensional variable screening and bias in subsequent inference, with an empirical comparison. Comput. Stat. 29: 407–430. 10.1007/s00180-013-0436-3 [DOI] [Google Scholar]
  5. Che R., Motsinger-Reif A. A., and Brown C. C., 2012.  Loss of power in two-stage residual-outcome regression analysis in genetic association studies. Genet. Epidemiol. 36: 890–894. 10.1002/gepi.21671 [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Cordell H. J., 2009.  Detecting gene-gene interactions that underlie human diseases. Nat. Rev. Genet. 10: 392–404. 10.1038/nrg2579 [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Cover T. M., and Thomas J. A., 2006.  Elements of Information Theory, Ed. 2 John Wiley & Sons, Hoboken, NJ. [Google Scholar]
  8. Demissie S., and Adrienne L., 2011.  Bias due to two-stage residual-outcome regression analysis in genetic association studies. Genet. Epidemiol. 35: 592–596. 10.1002/gepi.20607 [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Eagleson G., and Muller H., 1997.  Transformations for smooth regression models with multiplicative errors. J. R. Stat. Soc. B 59: 173–189. 10.1111/1467-9868.00062 [DOI] [Google Scholar]
  10. Ehrenreich I. M., 2017.  Epistasis: searching for interacting genetic variants using crosses. G3 (Bethesda) 7: 1619–1622. 10.1534/g3.117.042770 [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Ernst J., Vainas O., Harbison C., Simon I., and Bax-Joseph Z., 2007.  Reconstructing dynamic regulatory maps. Mol. Syst. Biol. 3: 74 10.1038/msb4100115 [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Frénay B., Doquire G., and Verleysen M., 2013.  Is mutual information adequate for feature selection in regression? Neural Netw. 48: 1–7. 10.1016/j.neunet.2013.07.003 [DOI] [PubMed] [Google Scholar]
  13. Friedman J., Hastie T., and Tibshirani R., 2010.  Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33: 1–22. 10.18637/jss.v033.i01 [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Garcia A., Cayla X., Guergnon J., Dessauge F., Hospital V. et al. , 2003.  Serine/threonine protein phosphatases PP1 and PP2A are key players in apoptosis. Biochimie 85: 721–726. 10.1016/j.biochi.2003.09.004 [DOI] [PubMed] [Google Scholar]
  15. Haseman J. K., and Elston R. C., 1972.  The investigation of linkage between a quantitative trait and a marker locus. Behav. Genet. 2: 3–19. 10.1007/BF01066731 [DOI] [PubMed] [Google Scholar]
  16. Jiang Y., and Reif J. C., 2015.  Modeling epistasis in genomic selection. Genetics 201: 759–768. 10.1534/genetics.115.177907 [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Jouini W., Le Guennec D., Moy C., and Palicot J., 2011.  Log-normal approximation of chi-square distributions for signal processing. XXXth URSI General Assembly Scientific Symposium, Piscataway, New Jersey: IEEE, 1–4. 10.1109/URSIGASS.2011.6050531 [DOI] [Google Scholar]
  18. Kärkkäinen H., and Sillanpää M. J., 2012.  Robustness of Bayesian multilocus association models to cryptic relatedness. Ann. Hum. Genet. 76: 510–523. 10.1111/j.1469-1809.2012.00729.x [DOI] [PubMed] [Google Scholar]
  19. Kärkkäinen H., Li Z., and Sillanpää M. J., 2015.  An efficient genome-wide multilocus epistasis search. Genetics 201: 865–870. 10.1534/genetics.115.182444 [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Korrodi-Gregório L., Silva J. V., Santos-Sousa L., Freitas M. J., Felgueiras J. et al. , 2014.  TGF-beta cascade regulation by PPP1 and its interactors -impact on prostate cancer development and therapy. J. Cell. Mol. Med. 18: 555–567. 10.1111/jcmm.12266 [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Li Y. Z., Lu D. Y., Tan W. Q., Wang J. X., and Li P. F., 2008.  P53 initiates apoptosis by transcriptionally targeting the antiapoptotic protein ARC. Mol. Cell. Biol. 28: 564–574. 10.1128/MCB.00738-07 [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Li R., Zhong W., and Zhu L., 2012.  Feature screening via distance correlation learning. J. Am. Stat. Assoc. 107: 1129–1139. 10.1080/01621459.2012.695654 [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Liu T., Wan L., Wu Y., Chen J., and Huang J., 2011.  HSWS1-SWSAP1 is an evolutionarily conserved complex required for efficient homologous recombination repair. J. Biol. Chem. 286: 41758–41766. 10.1074/jbc.M111.271080 [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Lu Z., Wan G., Guo H., Zhang X., and Lu X., 2013.  Protein phosphatase 1 inhibits p53 signaling by dephosphorylating and stabilizing Mdmx. Cell. Signal. 25: 796–804. 10.1016/j.cellsig.2012.12.014 [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. MacKay D. J. C., 1998.  Introduction to Gaussian Processes, Neural Networks and Machine Learning, ed. Bishop C. M., Springer-Verlag, Berlin. [Google Scholar]
  26. Mackay T. F., 2014.  Epistasis and quantitative traits: using model organisms to study gene-gene interactions. Nat. Rev. Genetics 15: 22–33. 10.1038/nrg3627 [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Maienschein-Cline M., Zhou J., White K., Sciammas R., and Dinner A., 2012.  Discovering transcription factor regulatory targets using gene expression and binding data. Bioinformatics 28: 206–213. 10.1093/bioinformatics/btr628 [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Mak P. Y., Mak D. H., Mu H., Shi Y., Ruvolo P. et al. , 2014.  Apoptosis repressor with caspase recruitment domain is regulated by MAPK/PI3K and confers drug resistance and survival advantage to AML. Apoptosis 19: 698–707. 10.1007/s10495-013-0954-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Martino J., and Bernstein K. A., 2016.  The Shu complex is a conserved regulator of homologous recombination. FEMS Yeast Res. 16: pii: fow073. 10.1093/femsyr/fow073 [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Mathew B., Leon J., Sannemann W., and Sillanpää M. J., 2018.  Detection of epistasis for flowering time using Bayesian multilocus estimation in a barley MAGIC population. Genetics 208: 525–536. 10.1534/genetics.117.300546 [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Milne R. L., Fagerholm R., Nevanlinna H., and Bentez J., 2008.  The importance of replication in gene-gene interaction studies: multifactor dimensionality reduction applied to a two-stage breast cancer case-control study. Carcinogenesis 29: 1215–1218. 10.1093/carcin/bgn120 [DOI] [PubMed] [Google Scholar]
  32. Moore C. J., Chua A. J. K., Berry C. P. L., and Gair J. R., 2016.  Fast methods for training Gaussian processes on large datasets. R. Soc. Open Sci. 3: 160125 10.1098/rsos.160125 [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Neal R. M., 1996.  Bayesian Learning for Neural Networks, Springer-Verlag, New York: 10.1007/978-1-4612-0745-0 [DOI] [Google Scholar]
  34. Noren D. P., Long B. L., Norel R., Rrhissorrakrai K., Hess K. et al. , 2016.  A crowdsourcing approach to developing and assessing prediction algorithms for AML prognosis. PLOS Comput. Biol. 12: e1004890 10.1371/journal.pcbi.1004890 [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. O’Hara R. B., and Sillanpää M. J., 2009.  Review of Bayesian variable selection methods: what, how and which. Bayesian Anal. 4: 85–117. 10.1214/09-BA403 [DOI] [Google Scholar]
  36. Oswald F., Kostezka U., Astrahantseff K., Bourteele S., Dillinger K. et al. , 2002.  SHARP is a novel component of the Notch/RBP-Jkappa signalling pathway. EMBO J. 21: 5417–5426. 10.1093/emboj/cdf549 [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Oswald F., Winkler M., Cao Y., Astrahantseff K., Bourteele S. et al. , 2005.  RBP-J/SHARP recruits CtIP/CtBP corepressors to silence Notch target genes. Mol. Cell. Biol. 25: 10379–10390. 10.1128/MCB.25.23.10379-10390.2005 [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Phillips P. C., 2008.  Epistasis - the essential role of gene interactions in the structure and evolution of genetic systems. Nat. Rev. Genet. 9: 855–867. 10.1038/nrg2452 [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Pikkuhookana P., and Sillanpää M. J., 2009.  Correcting for relatedness in Bayesian models for genomic data association analysis. Heredity 103: 223–237. 10.1038/hdy.2009.56 [DOI] [PubMed] [Google Scholar]
  40. Rasmussen C. E., and Williams C. K. I., 2006.  Gaussian Processes for Machine Learning, MIT Press, London. [Google Scholar]
  41. Sailer Z. R., and Harms M. J., 2017.  Detecting high-order epistasis in nonlinear genotype-phenotype maps. Genetics 205: 1079–1088. 10.1534/genetics.116.195214 [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Sarmento L. M., Huang H., Limon A., Gordon W., Fernandes J. et al. , 2005.  Notch1 modulates timing of G1-S progression by inducing SKP2 transcription and p27Kip1 degradation. J. Exp. Med. 202: 157–168. 10.1084/jem.20050559 [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Savitsky T., Vannucci M., and Sha N., 2011.  Variable selection for nonparametric Gaussian process priors: models and computational strategies. Stat. Sci. 26: 130–149. 10.1214/11-STS354 [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Schelldorfer J., Bühlmann P., and van de Geer S., 2011.  Estimation for high-dimensional linear mixed-effects models using 1-penalization. Scand. J. Stat. 38: 197–214. 10.1111/j.1467-9469.2011.00740.x [DOI] [Google Scholar]
  45. Setakis E., Stirnadel H., and Balding D. J., 2006.  Logistic regression protects against population structure in genetic association studies. Genome Res. 16: 290–296. 10.1101/gr.4346306 [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Sham P. C., and Purcell S., 2001.  Equivalence between Haseman-Elston and variance-components linkage analyses for sib pairs. Am. J. Hum. Genet. 68: 1527–1532. 10.1086/320593 [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Shi J. Q., and Choi T., 2011.  Gaussian Process Regression Analysis for Functional Data, Chapman Hall/CRC, London: 10.1201/b11038 [DOI] [Google Scholar]
  48. Somyajit K., Subramanya S., and Nagaraju G., 2010.  RAD51C: a novel cancer susceptibility gene is linked to Fanconi anemia and breast cancer. Carcinogenesis 31: 2031–2038. 10.1093/carcin/bgq210 [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Sun L., Shi L., Li W., Yu W., Liang J. et al. , 2009.  JFK, a Kelch domain-containing F-box protein, links the SCF complex to p53 regulation. Proc. Natl. Acad. Sci. USA 106: 10195–10200. 10.1073/pnas.0901864106 [DOI] [PMC free article] [PubMed] [Google Scholar]
  50. Taskesen E., Bullinger L., and Corbacioglu A., Sanders M. A., Erpelinck C. A., et al. , 2011.  Prognostic impact, concurrent genetic mutations, and gene expression features of AML with CEBPA mutations in a cohort of 1182 cytogenetically normal AML patients: further evidence for CEBPA double mutant AML as a distinctive disease entity. Blood 117: 2469–2475. 10.1182/blood-2010-09-307280 [DOI] [PubMed] [Google Scholar]
  51. Taylor B. M., and Ehrenreich I. M., 2015.  Higher-order genetic interactions and their contribution to complex traits. Trends Genet. 31: 34–40. 10.1016/j.tig.2014.09.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
  52. Toosi A., Fernando R. L., and Dekkers J. C. M., 2018.  Genome-wide mapping of quantitative trait loci in admixed populations using mixed linear model and Bayesian multiple regression analysis. Genet. Sel. Evol. 50: 32 10.1186/s12711-018-0402-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  53. VanRaden P. M., 2008.  Efficient methods to compute genomic predictions. J. Dairy Sci. 91: 4414–4423. 10.3168/jds.2007-0980 [DOI] [PubMed] [Google Scholar]
  54. Wei W. H., Hemani G., and Haley C. S., 2014.  Detecting epistasis in human complex traits. Nat. Rev. Genet. 15: 722–733. 10.1038/nrg3747 [DOI] [PubMed] [Google Scholar]
  55. Wood A. R., Tuke M. A., Nalls M. A., Hernandez D. G., Bandinelli S. et al. , 2014.  Another explanation for apparent epistasis. Nature 514: E3–E5. 10.1038/nature13691 [DOI] [PMC free article] [PubMed] [Google Scholar]
  56. Würschum T., and Kraft T., 2015.  Evaluation of multi-locus models for genome-wide association studies: a case study in sugar beet. Heredity 114: 281–290. 10.1038/hdy.2014.98 [DOI] [PMC free article] [PubMed] [Google Scholar]
  57. Yamada M., Jitkrittum W., Sigal L., Xing E. P., and Sugiyama M., 2014.  High-dimensional feature selection by feature-wise kernelized lasso. Neural Comput. 26: 185–207. 10.1162/NECO_a_00537 [DOI] [PubMed] [Google Scholar]
  58. Yeang H., and Jaakkola T., 2006.  Modeling the combinatorial functions of multiple transcription factors. J. Comput. Biol. 13: 463–480. 10.1089/cmb.2006.13.463 [DOI] [PubMed] [Google Scholar]
  59. Yeo J. E., Lee E. H., Hendrickson E., and Sobeck A., 2014.  CtIP mediates replication fork recovery in a FANCD2-regulated manner. Hum. Mol. Genet. 23: 3695–3705. 10.1093/hmg/ddu078 [DOI] [PMC free article] [PubMed] [Google Scholar]
  60. Yi G., Shi J. Q., and Choi T., 2011.  Penalized Gaussian process regression and classification for high-dimensional nonlinear data. Biometrics 67: 1285–1294. 10.1111/j.1541-0420.2011.01576.x [DOI] [PubMed] [Google Scholar]
  61. Zou H., and Hastie T., 2005.  Regularization and variable selection via the elastic net. J. R. Stat. Soc. B 67: 301–320. 10.1111/j.1467-9868.2005.00503.x [DOI] [Google Scholar]
  62. Zou F., Huang H., Lee S., and Hoeschele I., 2010.  Nonparametric Bayesian variable selection with applications to multiple quantitative trait loci mapping with epistasis and gene-environment interaction. Genetics 186: 385–394. 10.1534/genetics.109.113688 [DOI] [PMC free article] [PubMed] [Google Scholar]
  63. Zhongxin W., Gang S., Jing Z., and Jia Z., 2016.  Feature selection algorithm based on mutual information and lasso for microarray data. Open Biotechnol. J. 10: 278–286. 10.2174/1874070701610010278 [DOI] [Google Scholar]
  64. Zurawa-Janicka D., Kobiela J., Galczynska N., Stefaniak T., Lipinska B. et al. , 2012.  Changes in expression of human serine protease HtrA1, HtrA2 and HtrA3 genes in benign and malignant thyroid tumors. Oncol. Rep. 28: 1838–1844. 10.3892/or.2012.1988 [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The authors state that all data necessary for confirming the conclusions presented in the article are represented fully within the article. The proposed method is first evaluated with a normalized gene expression dataset used in the DREAM-challenge 9 (Noren et al. 2016), which is available upon registration (http://dreamchallenges.org/). These data were provided by Dr. Steven Kornblau from The University of Texas MD Anderson Cancer Center, and were obtained through Synapse syn2455683 as part of the AML DREAM-challenge. A high-throughput sequencing AML dataset downloaded from the cancer genome atlas (TCGA) is also analyzed, which is publicly accessible through the TCGA-website (https://cancergenome.nih.gov/). A list of genes used in our analyses is provided in Supplemental Materials B (genes that remained after preprocessing). However, we noted that the naming policies of several genes in this dataset have been recently changed, but the obsolete names can be linked to the updated ones, for example with Entrez gene IDs. For validation of the results, another AML cohort was also analyzed. This dataset was collected at Erasmus University Medical Center (Rotterdam) and is available for open research from the Gene Expression Omnibus (GEO, http://www.ncbi.nlm.nih.gov/geo, GDS4278, Taskesen et al. 2011). Supplemental material is available at figshare: https://doi.org/10.25386/genetics.9933239.


Articles from Genetics are provided here courtesy of Oxford University Press

RESOURCES