Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 Mar 31.
Published in final edited form as: J Comput Graph Stat. 2022 Mar 31;31(4):1091–1103. doi: 10.1080/10618600.2022.2039161

A scalable hierarchical lasso for gene-environment interactions

Natalia Zemlianskaia 1, W James Gauderman 1, Juan Pablo Lewinger 1
PMCID: PMC9928188  NIHMSID: NIHMS1798410  PMID: 36793591

Abstract

We describe a regularized regression model for the selection of gene-environment (G×E) interactions. The model focuses on a single environmental exposure and induces a main-effect-before-interaction hierarchical structure. We propose an efficient fitting algorithm and screening rules that can discard large numbers of irrelevant predictors with high accuracy. We present simulation results showing that the model outperforms existing joint selection methods for (G×E) interactions in terms of selection performance, scalability and speed, and provide a real data application. Our implementation is available in the gesso R package.

Keywords: hierarchical variable selection, joint analysis, screening rules

1. Introduction

The problem of testing for the interaction between a single established predictor and a large number of candidate predictors arises in several contexts. A common setting is the scanning for interactions in genomewide-association studies (GWAS), where the goal is to identify interactions between an established environmental risk factor (e.g. processed meat intake) and a large number of single nucleotide polymorphisms (SNPs), in relation to an outcome of interest (e.g. colorectal cancer) [Figueiredo et al. [2014]]. Since many complex diseases have been linked to both genetic and environmental risk factors, identifying gene-environment (G×E) interactions, i.e. joint genetic and environmental effects beyond their component main effects, is of great interest. But the established predictor need not be an environmental risk factor and the candidate predictors can be other omic features like methylation and gene expression levels. For example, in a clinical trial setting, testing for the interaction between treatment and gene expression biomarkers can lead to the identification of subgroups with differential responses to a drug [Ternes et al. [2017]].

In this paper we focus on the problem of scanning a large number of possible interactions with a fixed predictor. This is in contrast to the related but more challenging problem of exhaustive scanning all possible pairwise interactions within a set of predictors. Throughout the paper we will refer to the former as the gene-environment interaction selection problem, and we will use language specific to this context even though the proposed method is completely general and applies to any setting where the analytical goal is to identify interactions with a designated variable of special interest. Similarly, we will refer to the exhaustive pairwise testing problem as the gene-gene (G×G) interaction selection problem.

The dominant paradigm for genomewide interaction scans (GWIS) is to test each genetic marker one at a time for interaction with the risk factor of interest applying a stringent multiple testing correction to account for the number of tests performed. However, a joint analysis that simultaneously takes into account the effects of all markers is preferable to a one-at-a-time analysis where each marker is considered separately. Variables with weak effects might be more readily identifiable when the model has been adjusted for other causal predictors, and false positives may be reduced by the inclusion of stronger true causal predictors in the model [Ayers and Cordell [2010]]. Since GWAS data is high-dimensional (the number of SNPs is typically in the order of millions while the sample size is in the tens of thousands), a joint analysis of all markers using standard (unpenalized) multiple regression methods is not feasible. General regularized regression methods such as the Lasso [Tibshirani [1996]] and Elastic Net [Zou and Hastie [2005]] that are suitable for high-dimensional data and also perform variable selection can be used for G×E identification but, because they do not exploit the hierarchical structure of the (G×E) problem, can perform suboptimally. A main-effect-before-interaction hierarchical structure ensures that the final selected model only includes interactions when corresponding main effects have been selected. This enhances the interpretability of the final model [Nelder [1977], Cox [1984]] and increases the ability to detect intreactions by reducing the search space [Chipman [1996]].

Methods for the G×G selection problem that exploit an interaction hierarchical structure like FAMILY [Haris, Witten and Simon [2016]], glinternet [Lim and Hastie [2015]], and hierNet [Bien et al. [2013]], and their corresponding R implementations can be in principle applied to the G×E case but they are optimized for the symmetric G×G case, which results in vastly suboptimal performance in terms of run-time and scalability (they can handle at most a few hundred predictors) for the G×E case. Indeed, the structure of the G×E selection task is simpler because 1) the dimensionality of the problem grows linearly with additional environmental variables as opposed to quadratically for the G×G selection task, and 2) because interactions with a single variable lead to a block-separable optimization problem, which, unlike the G×G case, can be efficiently solved using a block coordinate descent algorithm. For these reasons, the G×E selection problem is amenable to efficient implementations for large-scale analysis (e.g. genome-wide). However, efficient joint selection methods specific for G×E in large-scale applications have not been developed.

Liu et al. [2013] and Wu et al. [2017] adopted sparse group penalization approaches to accelerated failure time models for hierarchical selection of G×E interactions, but their approaches do not scale to datasets with a very large number of predictors without additional pre-screening procedures.

In this paper we present gesso (from G(by)E(la)sso) model for the hierarchical modeling of interaction terms. We present an efficient fitting algorithm for the gesso model and powerful new screening rules that eliminate a large number of variables beforehand, making joint G×E analyses feasible at genome-wide scale.

The paper is organized as follows. We first review the idea behind a hierarchical structure for interactions and present the gesso model. In section 3 we introduce screening rules and an adaptive convergence procedure we developed and incorporated into a block coordinate descent algorithm. We describe simulations in section 4 and a real data application in section 5 to demonstrate the applicability of the gesso model to large high-dimensional datasets and the scalability of our algorithm.

2. Methods

2.1. Hierarchical structure

The standard linear model for G×E interactions with a single environmental exposure includes all interaction product terms between genetic variables and the environmental factor in addition to their marginal effects:

E[Y]=β0+βEE+i=1pβGiGi+i=1pβGi×EGi×E, (1)

Here YRn is a quantitative outcome of interest, G is n × p matrix of genotypes, Gi is a column of matrix G corresponding to the i-th genotype, E is the vector of environmental measurements of size n,βGRp and βER are the main effects, and βG×ERp are the interaction effects.

A strong hierarchical structure implies that if either the genetic or the environmental main effect is equal to zero, the corresponding interaction term has to be zero as well. In the G×E context, the environmental predictor E is usually chosen because it has been previously identified as a risk factor. Thus, there is no need to maintain a hierarchical constraint with respect to the environmental effect, as it is known to have a main effect on the outcome. The strong hierarchical structure reduces then to

βGi×E0βGi0,equivalently,βGi=0βGi×E=0.

There are several ways to impose a hierarchical structure on the regression coefficients of model (1). One approach is forward selection [Efroymson [1960]], a procedure that iteratively considers adding the “best” variable to the model ensuring that an interaction term can be added only if its main effects have already been added to the model on a previous iteration. Since step-wise selection is a greedy algorithm, it tends to underperform compared to global optimization approaches [Ayers and Cordell [2010]]. Another way is to reparameterize the interaction coefficients as βGi×E=γiβGiβE, where γi is an introduced model parameter [Bhatnagar et al. [2018], Wu et al. [2020]], but this results in a non-convex objective function. Regularization can also be used to impose a desired hierarchical structure [Zhao et al. [2009]] and this is the approach we follow.

2.2. The gesso model

Denote the mean square error loss function for a G×E problem by

q(β0,βG,βE,βG×E)=12nY(β0+iβGiGi+βEE+iβGi×EGi×E)22.

We are interested in hierarchical selection of the Gi × E interaction terms that are associated with the outcome Y. We propose the following model that we call gesso (G(by)E(la)sso):

minimizeβ0,βG,βE,βG×Eq(β0,βG,βE,βG×E)+i=1p(λ1(βGi,βGi×E)+λ2|βGi×E|). (2)

The model has several important properties. First, it guarantees the desired hierarchical relationships between genetic main effects and interaction effects, since the penalty satisfies the overlapping group hierarchical principle [Zhao et al. [2009]]. Having βGj×E in every group where βGj is present ensures that once βGj×E deviates from zero, the penalty for βGj becomes close to zero by the properties of the group lasso models. In addition, having βGj×E in a group of its own makes it possible for βGj to deviate from zero, when βGj×E is zero. Second, the regularized objective function is convex, which can take advantage of convex optimization theory and algorithms that ensure convergence to a global optimal solution. Moreover, the block-separable structure of the problem, with small-block sizes allows for highly efficient solvers. Third, we include two tuning parameters in the model, λ1 and λ2 enabling flexible and decoupled data dependent control over the group and interaction penalties. Lastly, the group L penalty has a connection to a Lasso model with hierarchical constraints that we discuss next.

2.3. Connection to the hierarchical Lasso

Bien et al. [2013] proposed a Lasso model with hierarchical constraints for all pairwise interactions (the G×G selection problem) and demonstrated its advantages over the standard Lasso. The authors showed that the hierarchical Lasso model is equivalent to the unconstrained overlapping group lasso model with an L group norm. Following Bien et al. it can be shown that our proposed model (2) is equivalent to a constrained model (4). We describe the intuition behind the constrained version of the model (4) below and provide a proof of equivalence of the two models in the Appendix A.

Lasso with G×E Hierarchical Constraints

To impose the desired hierarchical structure, the constraints |βGi×E||βGi| can be added to the standard Lasso model. These ensure the main-effect before interaction property:

βGi=0βGi×E=0. Furthermore, the model makes the implicit (and arguably a reasonable) assumption that important interactions have large main effects. When this assumption is met, the model will be more powerful in detecting interactions. Unfortunately, the constraint set above is non-convex and yields the non-convex problem:

minimizeβ0,βG,βE,βG×Eq(β0,βG,βE,βG×E)+λ1βG1+λ2βG×E1subject to|βGi×E||βGi|,fori=1,,p. (3)

In order to transform the non-convex optimization problem (3) into a convex one, we decompose βG as βG=βG+βG and |βG| as βG++βG, where βG+0 and βG0. The non-convex constraints |βGi×E||βGi| are then replaced by the convex constraints |βGi×E|βGi++βGi Note that |βG|=βG++βG only if βG+βG0, so removing the latter condition results in a relaxed formulation that is not equivalent to the original non-convex problem. Substituting βG=βG+βG and |βG| to βG++βG in (3) we obtain a convex relaxation of the model:

minimizeβ0,βG+,βG,βE,βG×Eq(β0,βG+,βG,βE,βG×E)+λ11T(βG++βG)+λ2βG×E1subject to|βGi×E|βGi++βGi,βGi+0,βGi0fori=1,,p. (4)

Now the constraint set is convex and the optimization problem is convex as well. The relaxed constraints (βGi++βGi|βGi×E|) are less restrictive than the original |βGi||βGi×E|. In particular, the model can yield a large |βGi×E| estimate and a moderately sized |βGi| by making both βGi+ and βGi large.

Examining the equivalent formulations for gesso (unconstrained group L model (2) and constrained model (4)) we can see that the group L norm corresponds to the constraints on the effect sizes of the interactions and the main effects. Intuitively, because of the connection between the relaxed model (4) and model (3), (4) will be more powerful for detecting interactions in the case when important interactions have large main effects. This is confirmed by our simulation further below.

Additionally, the constrained formulation of gesso (4) allows for a simpler, interpretable form for the coordinate-wise solutions, the dual problem, and the development of the screening rules.

3. Block coordinate descent algorithm for gesso

Friedman et al. [2007] have proposed using cyclic coordinate descent for solving convex regularized regression problems involving L1 and L2 penalties and their combinations. The coordinate descent algorithm is particularly advantageous when each iteration involves only fast analytic updates. In addition, screening rules that exploit the sparse structure induced by the penalties can be readily incorporated into the algorithm to eliminate a large number of variables beforehand, making it much faster than alternative convex optimization algorithms.

For convex block-separable functions, convergence of the coordinate descent algorithm to a global minimum is guaranteed [Tseng [2001]]. The gesso model has a convex objective function with a smooth loss component and a non-smooth separable penalty component, where each block consists of βGj and βGj×E. Thus, the model can be fitted using a block coordinate descent (BCD) algorithm and convergence to a global optimal solution is guaranteed. Briefly, BCD optimizes the objective by cycling through the coordinate blocks 1,…, p and minimizing the objective along each coordinate block direction while keeping all other blocks fixed at their most current values. The coordinate-wise updates for model (4) can be obtained by working with the Lagrangian version of the model. The derivations are provided in the supplementary materials (section 1). The rest of the section focuses on new efficient screening rules we developed specifically for gesso.

3.1. Dual formulation of gesso

Consider the following primal formulation of the gesso problem obtained by substituting the residuals for a new variable z:

minimizeβ+,β,βG×E,z12nzTz+λ11T(β++β)+λ2βG×E1subject to|βG×E|(β++β),β+0,β0z=Y(1β0+G(β+β)+EβE+(G×E)βG×E), (5)

The ⪰ symbol denotes an element-wise comparison (x0xj0, for j = 1,…, p), G × E is a column-wise matrix of interaction vectors Gi × E. In order to formulate the dual problem we introduce the dual variables δRp,δ0, associated with the constraint |βG×E|(β++β), and νRn, associated with the constraint z=Y(1β0+G(β+β)+EβE+G×EβG×E). Substituting the residuals for a new variable in the primal formulation is a common approach for deriving a dual formulation. For simplicity, we denote the linear predictor 1β0+G(β+β)+EβE+(G×E)βG×E by . Based on the alternative primal formulation the dual takes the simple form below (section 2 of the supplementary materials):

maximizeδ,νn2(Yn22Ynν22)subject toδ,νDF,

where we denote the dual feasible region as DF:={|νTG×E|λ2+δ,|νTG|λ1δ,δ[0,λ1], and the dual objective as D(ν)=n2(Yn22Ynν22).

It follows that the optimal solution of the dual problem ν^ can be viewed as a projection of Yn onto the dual feasible set DF :

ν^=arg maxνDFn2(Yn22Ynν22)=arg minνDFYnν22=defProjDF(Yn).

And from the stationarity conditions we establish the following important relationship:

ν^=z^n=YXβ^n. (6)

The optimal dual variable ν^ equals the residuals scaled by the number of observations. Equation (6) defines the link between the primal (β^) and the dual (ν^) optimal solutions. The stationarity and complementary slackness conditions for βGj, and βGj×E (section 2 of supplementary materials) lead to the following important consequences for the primal and dual optimal variables:

|ν^T(Gi×E)|<λ2+δ^iβ^Gi×E=0,for alli, (7)
|ν^TGi|<λ1δ^iβ^Gi=0,for alli. (8)

Conditions (7), (8) form the basis for the screening rules we develop later in this section. Specifically, we exploit the above conditions to identify null predictors and avoid spending time cycling through them in the BCD algorithm. This leads to a substantial computational speed-up, especially for high-dimensional sparse problems.

3.2. Screening rules

Screening rules are used to identify predictors that are strongly associated with the outcome and screen out those that are likely to be null. Incorporation of the screening rules to the coordinate descent algorithm can greatly improve computational speed, making large but sparse high-dimensional problems computationally tractable. In this section we first describe the SAFE screening rules for gesso following the principles of the Lasso SAFE rules [El Ghaoui et al. [2012]], which form the stepping stone for the more efficient screening rules we developed and focus on later.

3.2.1. SAFE rules for gesso

El Ghaoui et al. [2012] proposed the SAFE screening rules for the Lasso model that guarantee a coefficient will be zero in the solution vector. In this section we derive SAFE rules to screen predictors for the gesso model.

By the KKT conditions (7) and (8)

|ν^TGi|<λ1δi|ν^T(Gi×E)|<λ2+δiδi[0,λ1]}β^i=β^Gi×E=0 (9)

for any primal optimal variables β^Gi, and β^Gi×E, dual optimal variable ν^, and dual feasible variable δi (for a fixed optimal ν^ any feasible δi would also be optimal, since the dual objective function does not depend on δ). The problem is that we do not know the dual optimal variable ν^.

The idea is to create upper bounds for |ν^TGi| and |ν^T(Gi×E)| that are easy to compute. In particular, consider a dual feasible variable ν0 and denote D(ν0) as γ. Let Θ = {ν: D(ν) ≥ γ}. As ν^=argmaxDFD(ν) we have D(ν^)D(ν0)=γ and, hence, ν^Θ. Then maxνΘ|νTGi|<λ1δi|ν^TGi|<λ1δi. The same is true for the interaction terms maxνΘ|νT(Gi×E)|<λ2+δi|ν^T(Gi×E)|<λ2+δi, where maxνΘ|νTGi| and maxνΘ|νT(Gi×E)| are the desired upper bounds for |ν^TGi| and |ν^T(Gi×E)| respectively. The above arguments lead to the following rules:

maxνΘ|νTGi|<λ1δiβ^Gi=0, (10)
maxνΘ|νT(Gi×E)|<λ2+δiβ^Gi×E=0. (11)

Note that the region Θ = {ν : D(ν) ≥ γ} is equivalent to n2(Yn22Ynν22)γYnν2Yn22nγ which is the equation of an Euclidean ball with r2=Yn22nγ and centered at Yn. Note that r=Yn22nγ=Yn22nn2(Yn22+Ynν022)=Ynν0. We denote this ball as B(c, r), where c=Yn and r=Ynν0. As a consequence, the upper bounds we constructed maxνΘ|νTGi| and maxνΘ|νT(Gi×E)| are equivalent to the following optimization problems: maxνB(c,r)|νTGi| and maxνB(c,r)|νT(Gi×E)| and the region Θ is the ball B(c, r), which has two important properties:

  1. it contains the optimal dual solution ν^Θ,

  2. it results in closed form solutions for the desired upper bounds (10) and (11).

Specifically, for a general XRn the solution to maxνB(c,r)|νTX| is maxνB(c,r)|νTX|=rX+|XTc|. Leaving the derivations to Appendix C, the SAFE rules to discard (βGi,βGi×E) are given by:

max{0,rGi×E+|(Gi×E)Tc|λ2}<λ1rGi|GiTc|β^Gi=β^Gi×E=0. (12)

To complete the construction of the SAFE rules, we need to find a dual feasible point ν0DF to calculate r = r(ν0). We can naturally obtain a dual point ν0 given the current estimate β via the primal-dual link from the stationarity conditions (6). Denote νres(β)=YXβn, where res stands for residuals. In general, νres(β) will not necessary be feasible, thus, we re-scale νres(β) to ensure it is in the feasible region DF. For example, for δ = 0 we can consider the re-scaling factor x=min(λ1|νres(β)TGi|,λ2|νres(β)TGi×E|) so that ν0 = res (β) is feasible. We call the value res (β) a naive projection of νres. In the next section we show an alternative way to re-scale νres(β) to obtain a better feasible point.

3.2.2. Optimal naive projection

The better we choose our feasible ν0 and δ (i.e. closer to the optimal solution) the tighter our upper bounds for |ν^TGi| and |ν^T(Gi×E)| on the set Θ = {ν : D(ν) ≥ D(ν0)} are. In the previous re-scaling of νres(β) we set the dual variable δ equal to 0 for simplicity when we constructed a dual feasible point ν0. However, we can find a point ν0 closer to the optimal by changing δ to our advantage.

Formally, we want to find a scalar x such that ν0 = res(β) is feasible and which minimizes D(ν0). This leads to the following optimization problem with respect to x and δ:

minimizex,δD(ν0)subject to|ν0TG×E|λ2+δ|ν0TG|λ1δδ0whereν0=xνres(β). (13)

We present the closed-form solution to the optimization problem (13) in Appendix B, which we use to obtain a dual feasible variable ν0 (ν0 = res(β)) that we require for the SAFE rules (12).

3.2.3. Warm starts and dynamic screening

The penalty parameters λ1 and λ2 will be typically tuned by cross-validation. In practice, this requires fitting the model for a grid of λ1 and λ2 values. A common choice is a logarithmic grid of 30 to 100 consecutive points. Maximum grid value can be determined from the stationarity KKT conditions. When solutions are computed along such a sequence or path of tuning parameters, warm starts is a standard approach to reduce the number of coordinate descent iterations required to achieve convergence. Warm starting refers to using the previously computed solution β^(λ(k1)) with respect to a grid of tuning parameters to initialize the parameters values for the coordinate descent algorithm β(k) on the next step.

The idea behind dynamic screening [Bonnefoy et al. [2014]] is to iteratively improve the dual feasible point ν0 during the coordinate descent updates. The residuals Y are updated at each iteration i of the coordinate descent algorithm, and since the coordinate descent algorithm guarantees convergence, we have that

β(i)iβ^andνres(i)=YXβ(i)niYXβ^n=ν^νres(i)iν^,ν0(i)iν^. (14)

Recall that our SAFE rules depend on the feasible point ν0 through the radius of the spherical region on which we base the upper bounds : B(c,r(ν0(i)))=B(Yn,Ynν0(i)). Also, recall that ν^=ProjDFYn=minνDFYnν2,ν^ is the closest feasible point to the center Yn. Hence, having ν0 closer to the optimal ν^ reduces the radius of the sphere and ensures tighter upper bounds maxνB(c,r)|νTGi| and maxνB(c,r)|νT(Gi×E)| for |ν^TGi| and |ν^T(Gi×E)|. This, in turn, ensures better SAFE rules that are able to discard more variables. By performing the SAFE rules screening not only at the beginning of each new iteration λk = (λ1, λ2)k, but all along the iterations i for the coordinate descent, we iteratively improve the estimate of ν^ with ν0(i) and consequentially keep improving our SAFE rules.

The proposed procedure is the following: at each iteration i of the algorithm use the current residuals Y(i) to obtain a current estimate of the dual variable νres(i)=YXβ(i)n, naively project it onto the feasible set DF and apply the SAFE rules (12). Figure 1(a) illustrates the iterative process of constructing the SAFE spherical regions. One concern is how expensive it is to compute the naive projection and the SAFE rules at each iteration, which we address in details in section 3.2.6.

Fig. 1.

Fig. 1

(a) Dynamic SAFE regions, (b) dynamic GAP SAFE regions.

However, there are clear disadvantages of our current choice of center c=Yn and radius r=Ynν0(i). Even when the dual feasible variable ν0(i) converges to the optimal value ν^, the ball B(Yn,Ynν^) does not shrink around the dual optimal variable, since the radius does not converge to zero and the center is static, indicating that our upper bounds maxνB(Yn,Ynν^)|νTGi| and maxνB(Yn,Ynν^)|νT(Gi×E)| remain loose (Figure 1(a)). The Gap SAFE rules proposed by Fercoq et al. [2015] beautifully address the above disadvantages. We present the ideas behind the Gap SAFE rules and their application to the gesso model in the next section.

3.2.4. Gap SAFE rules for gesso

Denote the primal objective function for the gesso model as P(β):

P(β)=12nYXβ2+λ1i=1pmax{|βGi|,|βGi×E|1}+λ2βG×E1 (15)

and the dual objective as D(ν):

D(ν)=n2(Yn22Ynν22). (16)

The duality gap, denoted as Gap(β, ν) is the difference between the primal and the dual objectives Gap(β, ν) = P(β) − D(ν). For the optimal solutions we have P(β)P(β^) and D(ν)D(ν^). By weak duality D(ν)D(ν^)P(β^)P(β)Gap(β,ν)0. The duality gap provides an upper bound for the suboptimality gap P(β)P(β^) as P(β)P(β^)P(β)D(ν)=Gap(β,ν)ϵ. Therefore, given a tolerance ϵ > 0, if at iteration t of the BCD algorithm we can construct β(t) and ν(t)DF such that Gap (β(t), ν(t)) ≤ ϵ, then β(t) is guaranteed to be a ϵ-optimal solution of the primal problem. Note that in order to use the duality gap at iteration t as a stopping criterion we need a dual feasible point ν(t)DF. We, again, utilize the naive projection method and obtain ν0(t).

Because D(ν) is a quadratic and strongly concave function with concavity modulus n [Boyd and Vandenberghe [2004]] we have:

D(ν)D(ν^)+D(ν^),νν^n2νν^2,
n2νν^2D(ν^)D(ν)+D(ν^),νν^P(β)D(ν)=Gap(β,ν),

where the second inequality follows from weak duality and the optimality conditions for ν^. Thus,

νν^2nGap(β,ν). (17)

Gap SAFE rules work with the Euclidean ball with center c=ν0(i) and radius r=2nGap(β(i),ν0(i)), which we denote as BGap(c, r). This is a valid region since ν^BGap(c,r) by (17). An important consequence of the above construction is that when β(i)iβ^, then ν0(i)iν^ via the primal-dual link (6, 14) and Gap(β(i),ν0(i))i0. Figure 1(b) illustrates that the dynamic approach discussed in the previous section in combination with the Gap SAFE rules very naturally results in improving the upper bounds maxBGap (c, r) | νT Gi | and maxBGap (c, r) | νT (Gi × E) |, since the radius of our new Gap ball converges to zero and the center converges to the dual optimal point. The Gap SAFE rules follow from substituting r=2nGap(β(i),ν0(i)) and c=ν0(i) in (12).

3.2.5. Working set strategy

Massias et al. [2017] proposed to use a working set approach with the Gap SAFE rules to achieve substantial speedups over the state of art Lasso solvers, including glmnet. The working set strategy involves two nested iteration loops. In the outer loop, a set of predictors Wt ⊂ {1,…, p} is defined, called a working set (WS). In the inner loop, the coordinate descent algorithm is launched to solve the problem restricted to XWt (i.e. considering only the predictors in Wt).

We adopt the Gap SAFE rules we developed for the gesso model to incorporate the proposed working set strategy. Recall the SAFE rules we constructed in (12):

max{0,rGi×E+|(Gi×E)Tc|λ2}<λ1rGi|GiTc|β^Gi=β^Gi×E=0.

By simply rearranging the terms in the inequality we get:

λ1|GiTc|+max{rGi×E,λ2|(Gi×E)Tc|}Gi+Gi×E>r.

Define the left-hand side of the above inequality di. Note that our SAFE rules (12) are equivalent to di > r.

The idea is that di now represents a score for how likely the predictor is zero or non-zero based on the Gap SAFE rules. Predictors for which di is small are more likely to be non-zero and, conversely, predictors with larger values of di are more likely to be zero up to the point when di > r and the corresponding predictor is exactly zero by the SAFE rules. Here by predictor we mean the pair (βGi,βGi×E).

The proposed procedure is as follows: compute the initial number of variables to be assigned to the working set (working _set _size). Calculate di and define the working set as the indices of the smallest di values up to the value of working_set_size. Fit the coordinate descent algorithm on the variables from the working set only and check if we achieved an optimal solution via the duality gap for all of the variables. If not, increase the size of the working set, recalculate di according to the new estimates obtained by fitting on a previous working set, and repeat the procedure. We increase the size of the working set by two each time.

Algorithm 1 presents the block coordinate descent algorithm and the stopping criterion we propose to use. Algorithm 2 describes the main steps of the working set strategy. To recapitulate, the main advantage of the algorithm is that we select variables that are likely to be non-zero and leave likely zeroes out so we do not have to spend unnecessary time fitting them. This approach can be thought of as an acceleration of the Gap SAFE rules.

Algorithm 1:

cyclic_coordinate_updates(set of indices I)

for i = 1, …, max_iter do check_duality_gap(β) //convergence criterion, computationally expensive
for jI do
(βGj, βGj×E) = coordinate_updates() //coordinate-wise solutions
Algorithm 2:

coordinate_descent_with_working_sets()

for i_outer = 1, …, max_iter do check_duality_gap(β)
d = compute_d_i()
if i_outer == 1 then
working_set = {j : (βGj ≠ 0) or (βGj×E ≠ 0)} //initialization
if length(working_set) == 0 then
working_set = {j: smallest d[j] up to working_set_init_size}
working_set_size = length(working_set)
else
d[working_set] = - Inf //to make sure WS increases monotonically
working_set_size = min(2·working_set_size, p) //doubling the size of the WS at each iteration
working_set = (j: smallest d[j] up to working_set_size} //scoring the coef. by the likelihood of being non-zero
cyclic_coordinate_updates(working_set) //Algorithm 1 or optimized Algorithm 3

3.2.6. Active set and adaptive max difference strategies

In Algorithm 1 the dual feasible point required to determine the dual gap is calculated according to the naive projection method proposed in section 3.2.2. However, computing the naive projection is expensive, since it requires performing a matrix by vector product with a vector of length n (sample size) and a matrix of size n × length(working_set). The idea of an active set and adaptive max difference strategies is to reduce the number of times we have to evaluate check_duality_gap() function to ensure convergence.

Adaptive max difference strategy:

In any iterative optimization algorithm, including coordinate descent, the convergence (stopping) criterion plays an important role. One of the most commonly used stopping criteria is based on the change in estimates between two consecutive iterations t − 1 and t(18). It is implemented in the glmnet package [Friedman et al. [2009]], for example. The idea is that if the estimated coefficients do not change much from iteration to iteration it is likely the optimal solution has been reached:

maxi{1p}|βi(t)βi(t1)|22Xi22<ϵ. (18)

However, in contrast to the duality gap stopping criterion, such heuristic rules do not offer control over suboptimality and it is generally not clear what value of ϵ is sufficiently small.

Although the heuristic convergence criterion based on the maximum absolute difference between consecutive estimates (18) does not control the suboptimality gap P(β(t))P(β^), it is very fast to compute, since Xi22 is pre-computed or normalized to be one. To reduce the number of times we check the convergence based on the duality gap criterion in Algorithm 1, we propose to use the criterion (18) as a proxy convergence criterion (Algorithm 3).

The proposed procedure is as follows: we initialize the tolerance for the max difference criterion as the tolerance we set for the duality gap convergence. We proceed by fitting the coordinate descent algorithm until we meet the max difference convergence criterion and then check the duality gap for the final convergence. If the duality gap criterion is not met, we decrease the max difference tolerance by a factor of 10 and proceed again. As a result, instead of checking the duality gap after each cycle of the coordinate descent algorithm, we wait until the proxy convergence criterion (which is very cheap to check) is met, and then check the duality gap criterion. By adaptively reducing the tolerance value of the proxy criterion we allow more coordinate descent cycles if needed, but carefully control the adaptive convergence to make as few checks as possible.

Active set strategy:

The active set (AS) is a heuristic proposed by the authors of the glmnet package. The active set strategy tracks predictors updated during the first coordinate descent cycle and proceeds by fitting the coordinate descent algorithm only on those variables. The active set strategy combines naturally with our proposed adaptive max difference procedure since we also calculate the differences of the estimates on the consecutive iterations for our proxy convergence criterion. We believe that the max difference strategy we proposed in combination with the active sets can accelerate the state of art methods for fitting the standard Lasso as well. We summarize the main steps for both strategies in Algorithm 3, which is an optimized version of Algorithm 1.

To recapitulate, Algorithm 1 is the vanilla block coordinate descent algorithm, where each block in {1,…, p}, comprised of a main effect and its corresponding interaction, is updated until convergence without the use of any screening rule or heuristic. Algorithm 2 incorporates the Gap SAFE screening rules to the block coordinate descent algorithm in combination with the working set strategy. In the outer loop the variables in the working set are determined and in the inner loop the basic block coordinate descent Algorithm 1 is applied to the reduced set of variables. This achieves a lower run time. Algorithm 3 leverages additional heuristics to further accelerate the convergence on working sets by using active sets and the fast maximum absolute difference convergence criterion. Note that all the proposed improvements guarantee convergence to a global optimal solution since the final convergence criterion ensures that the duality gap for the full problem (including all the estimated variables) is within a given tolerance.

Algorithm 3:

cyclic_coordinate_updates_optimized(set of indices I)

for i_inner = 1, …, max_iter do check_duality_gap(β)
if i_inner == 1 then
max_diff_toll = tol //initializing heuristic convergence criterion
else
max_diff_toll = max_diff_toll/10 //adaptively refine the criterion
while t < max_iter do
max_diff = 0
for jI do
(βGj,βGj×E)=coordinate_descent_update()
current_diff=max(.βGj(t)βGj(t1)2Gj2,βGj×E(t)βGj×E(t1)2Gj×E2). //max difference heuristic convergence criterion
max_diff = max(current_diff, max_diff)
if current_diff > 0 then
active_set[j] = TRUE //active set = coefficients that got updated
if max_diff < max_diff_tol then break
while t < max_iter do
max_diff = 0
for jactive_set do
//active set iteration
(βGj,βGj×E)=coordinate_descent_update()
current_diff=max(.βGj(t)βGj(t1)2Gj2,βGj×E(t)βGj×E(t1)2Gj×E2).
max_diff = max(current_diff, max_diff)
if max_diff < max_diff_tol then break

3.3. Experiments

We conducted a series of experiments to evaluate efficiency of our algorithm overall and with respect to the screening rules.

Runtime:

In the first experiment we compared the runtime of the basic coordinate descent algorithm (Algorithm 1) and the various speedup strategies we described in the sections above. We adopted the working set strategy for the gesso problem (Algorithm 2), we proposed the max difference strategy, and we added the active set strategy in combination to the max difference approach (Algorithm 3). The size of the dataset we used for the experiment was n = 200, p = 10,000 (20,001 predictors in total, p main effects and p interaction terms, one environmental variable). We simulated 15 non-zero main effects and 10 non-zero interaction terms. We ran each algorithm 100 times and reported the mean execution time in Figure 2. Figure 2 demonstrates that proposed speedup strategies result in a major run-time acceleration. Therefore, only the fastest algorithm containing all the proposed improvements (coordinate descent on WS with adding AS, and adaptive pseudo convergence) is implemented in the package and is used for the downstream experiments.

Fig. 2.

Fig. 2

Time comparison of proposed algorithms: mean runtime over 100 replicates on the y-axis, dual gap tolerance on the x-axis.

Working sets:

In the next experiment we evaluated the screening ability of our screening rules in combination with the working sets (WS) speedup. We simulated a dataset with n = 1000 and p = 100,000, with non-zero main effects and 10 non-zero interaction terms. Figure 3 shows log ten size of the working set for all pairs (λ1, λ2) of the tuning parameters. Maximum size of the working set is 120 (out of 100,000 predictors). For most pairs only a small number of variables (from 0 to 15) is needed to find the solution.

Fig. 3.

Fig. 3

Log working set size (log10 (WS)) for all lambda pairs.

4. Simulations

We performed a series of simulations to evaluate the selection performance of gesso and compare it to that of alternative models. As a baseline model for comparison, we used the standard Lasso model as implemented in the glmnet package [Friedman et al. [2009]]. Among the models that impose hierarchical interactions, we chose the glinternet [Lim and Hastie [2015]] model for comparison (Table 1), as it is implemented in an R package and can handle the G×E case with a single environmental variable (by specifying the parameter interactionCandidates), which is the focus of this paper. We also considered the FAMILY [Haris, Witten and Simon [2016]] and sail [Bhatnagar et al. [2018]] models and their respective packages for comparison. FAMILY implements a series of overlapped group lasso models (Table 1), but the package is designed for fitting strictly more than one environmental variable E. We nevertheless modified the source code to handle the single E case but were unable to achieve a satisfactory performance compared to other methods. We omit FAMILY from the results.

Table 1.

Hierarchical G×E Models

Name Model penalty Comments (convexity, assumptions, hyper-parameters)
gesso λ1i=1p(βGi,βGi×E)+λ2βG×E1 convex, two tuning parameters, βE is not penalized
glinternet λ(|βE(0)|+i=1p|βGi(0)|+i=1p(βE(j),βGj(E),βGj×E)2),
 where βGj=βGj(0)+βGj(E),βE=βE(0)+βE(1)++βE(p)
convex, single tuning parameter, βE is penalized
FAMILY (1α)λi=1p(βGi,βGi×E1,,βGi×Eq)+αλβG×E1(1α)λi=1p(βGi,βGi×E1,,βGi×Eq)2+αλβG×E1 q > 1, convex, two tuning parameters, βE is not penalized
sail (1α)λ(|βE|+i=1p|βGj|)+αλi=1p|γj|,
 where βGj×E=γjβEβGj
non-convex, one tuning parameter λ, α has to be set, βE is penalized

The sail method ensures hierarchy via a reparametrization of the interaction coefficient that results in a non-convex objective formulation (Table 1). We observed that it performs similarly to the other examined methods in the cases where pn, but for the high-dimensional setting we considered, the sensitivity for selecting interaction terms was poor and the execution time was comparably longer. In addition, sail only tunes the main penalty parameter, but not the relative weight of the penalties (parameter α). In practice, the optimal relative weight is highly dependent on the data, and the default value equal to 0.5 yields poor selection in most cases. Because the weight parameter controls the relative importance of main effects and interactions, it plays a critical role in the selection of interaction performance. This is unlike the weight penalty parameter in elastic net, where relative importance varies between penalties on the same set of coefficients.

The hierarchical Lasso for pairwise interactions is implemented in the hierNet [Bien et al. [2013]] R package. However, it can only handle a limited number of predictors and cannot be applied to the G×E case.

Simulation settings

We simulated data with n = 100 subjects, p = 2500 SNPs, and a single binary environmental variable E, for a total of 5001 predictors. We set the number of non-zero main effects, pG out of p SNPs to 10 and the number of non-zero interaction, pG×E, out of the p interaction terms to 5. We simulated all non-zero main effects βG to have the same absolute value with randomly chosen signs and similarly for the interaction terms βG×E.

We explored three simulation modes for the true coefficients that we call strong_hierarchical, hierarchical, and anti_hierarchical. In the strong hierarchical mode the hierarchical structure is maintained (βGi=0βGi×E=0) and also |βGi||βGi×E|. In the hierarchical mode the hierarchical structure is maintained, but we set |βGi||βGi×E|, with an attempt to violate the gesso assumption on the effect sizes. In the anti-hierarchical mode the hierarchical structure is violated (βGi×E0βGi=0). We simulated a single binary environmental factor with prevalence equal to 0.3. We set βG = 3 and βG×E = 1.5 for all of the modes except for the hierarchical mode where we set βG = 0.75 and βG×E = 1.5. We set the noise variable such that an interaction SNR (signal to noise ratio, defined as a ratio of the interaction signal to noise) is around 2.

We generated independent training, validation, and test sets under the same settings and report the model performance metrics on the test set. We run 200 replicates of the simulation for each parameter settings.

Results

We obtained solutions paths across a two-dimensional grid of tuning parameter values and computed the precision and area under the curve (AUC) for the detection of interactions as a function of the number of interactions discovered (Figure 4).

Fig. 4.

Fig. 4

Model performance (top row: AUC for G×E selection, bottom row: precision for G×E selection) as a function of the number of interactions discovered. p=2500, n=100, pG=10, pG×E =5.

In the strong hierarchical mode Figure 4(a) the models imposing a hierarchical structure (gesso and glinternet) outperform the Lasso model, gesso performs better selection compared to glinternet, in terms of both AUC and precision.

We considered the hierarchical mode to evaluate the selection performance of gesso when |βGi×E||βGi| does not hold. Because gesso does not make this stringent assumption, (only the laxer |βGi×E|βGi++βGi) we expect it to be robust to its violation. In the hierarchical mode Figure 4(b) glinternet and gesso still outperform the Lasso and gesso model performs on par with glinternet.

In the anti-hierarchical mode Figure 4(c) all three models perform similarly. Importantly, even though the hierarchical assumptions are violated, hierarchical models still do not lose to the Lasso model in terms of selection performance.

5. Real data example: Epigenetic clock

We analyzed the GSE40279 methylation dataset [Hannum et al. [2013]], that contains 450,000 CpG markers on autosomal chromosomes from whole blood for 656 subjects (Illumina Infinium 450k Human Methylation Beadchip), age, and gender. Hannum et al. [2013] reported that the methylome of men appeared to age approximately 4 percent faster than that of women. Epidemiological data also indicates that females live longer than males but the reasons for gender discrepancies are still unknown. Our goal in this analysis was to identify methylation by sex interactions that may associate with age.

We preselected 100,000 of the most variable probes for our analysis, leaving us with 100,000 CpG probe main effects and 100,000 methylation probes by sex interaction terms, for a total of 200,001 predictors. Age is our dependent/outcome variable. We ran our implementation of the gesso model over one hundred 5-fold cross-validation assignments and calculated the selection rate for each of the predictors as the number of times a predictor is selected by the model across the number of runs divided by the number of runs (one hundred in our case). We choose the hyper-parameters based on the minimum cross-validation loss. We performed the same procedure for the standard Lasso model using the glmnet package.

Results

The average run-time for our implementation of gesso across 100 runs of cross-validation was only 4 minutes for the grid of 30 values for both of our tuning parameters. The average run-time for the glmnet package for the same grid for one tuning parameter was 1.5 minutes.

We attempted to use the glinternet package for the analysis but it exited with an error due to the large data size. We then downloaded the source code and changed the memory handling code since it was allocating memory for the G×G before subselecting it to the G×E case. However, even with the updated function glinternet would not complete the analysis within 24 hours. We hypothesize that, due to specifics of the current implementation, the glinternet package is efficient for the symmetric G×G format, but not for the reduced G×E format with large G. We also tried to run the FAMILY package on our methylation dataset after adapting the source code to generalize the function to the G×E case with a single E variable. However, the program exited with an out of memory error. The sail package also did not finish within 24 hours. To conclude, the existing packages are not able to handle large datasets (here 200,000 variables) for the G×E analysis.

Table 2 presents the four top-ranking CpG probes for gesso and standard Lasso that interact with sex in their effects on age. Ranks were calculated by ordering interaction selection rates and assigning rank one to the highest rate value (most frequently selected variable), and so on.

Table 2.

Top interacting with sex CpG probes by method.

CpG probe Associated Gene gesso glmnet
cg00091483 * PEBP1 1 12
cg12015310 MEIS2 2 6
cg08327269 NEUROD1 3 56
cg103755456 ICAM1 4 163
cg14345497 HOXB4 8 1
cg09365557 TBX2 >50,000 2
cg27652200 PSMB9/TAP1 109 3
cg19009405 HNRNPUL2 14 4
*

Probes identified by both methods (ranked among the top 20 probes by the other method) are highlighted in bold.

Genes associated with the probes selected by the gesso were linked to aging processes and cell senescence in multiple publications and databases, and are suggested biomarkers for age-related diseases [Jeck et al. [2012], Chang et al. [2000], Gorgoulis et al. [2005]]. For example, PEBP1 gene (linked to the top selected probe) is involved in the aging process and negative regulation of the MAPK pathway [Schoentgen et al. [2020]]. The MAPK and SAPK/JNK signaling networks promote senescence (in vitro) and aging (in vivo, animal models and human cohorts) in response to oxidative stress and inflammation [Papaconstantinou et al. [2019]]. The Rat Genome Database (RGD) indicates that the PEBP1 gene is implicated in prostate and ovarian cancers [Scholler et al. [2008]], indicating some sex-specificity. The RGD database reports the PEBP1 gene as a biomarker of Alzheimer’s disease.

Among the top four probes based on the standard Lasso analysis, three probes were linked to regulatory genes. For example, TBX2 (linked to cg09365557) encodes transcription factor that, when up-regulated, inhibits CDKN1A (p21), the gene regulated by the NEUROD1 gene (cg08327269 probe) discovered by gesso [Gene Cards Database]. CDKN1A is necessary for tissue senescence, and when compromised, leaves the tissue vulnerable to tumor-promoting signals. Probes cg09365557 (identified by standard Lasso) and cg08327269 (identified by gesso) could both be uncovering the same biological process involving CDKN1A.

6. Discussion

We introduced a selection method for G×E interactions with the hierarchical main-effect-before-interaction property. We showed that existing packages for the hierarchical selection of interactions cannot handle the large number of predictors typical in studies with high-dimensional omic data. When considered separately, and not as a sub-case of a G×G analysis, the G×E case can be solved much more efficiently. Our proposed block coordinate descent algorithm is scalable to large numbers of predictors because of the custom screening rules we developed. We also showed in simulations that our model outperforms other hierarchical models. The implementation of our method is available in our R package gesso that can be downloaded from CRAN https://CRAN.R-project.org/package=gesso. Our algorithm can be extended to generalized linear models via iteratively reweighted least-squares.

The model can be generalized to include more than one environmental variable E. However, for the BCD algorithm to be efficient, blocks have to be relatively small. When including multiple E variables, the size of the coordinate descent blocks grows linearly and contains main effect and all its corresponding interactions with the environmental variables (βGj, βGj×E1,…βGj×Eq). The bottleneck here is to efficiently solve the system of equations resulting from the stationarity conditions, the size of which grows exponentially with the number of E variables. For more than a handful of E variables other algorithms, like ADMM, could become more efficient.

Apart from computational efficiency of the algorithm, memory considerations are critical for large-scale analyses. The gesso package allows users to analyze large genome-wide datasets that do not fit in RAM using the file-backed bigmemory [Kane, Emerson, and Weston [2013]] format. However, this involves more time-consuming data transfers between RAM and an the external memory source. To more efficiently handle datasets that exceed available RAM, the screening rules could be exploited for efficient batch processing [Qian et al. [2020]].

Supplementary Material

Supp 1

ACKNOWLEDGEMENTS

We sincerely thank Jacob Bien, Paul Marjoram, and David Conti for their helpful comments on this work.

FUNDING

Research reported in this paper was supported by NCI of the National Institutes of Health under award number P01CA196569, FIGI RO1 supported by NCI (R01CA201407), and T32 supported by NIEHS (T32ES013678).

APPENDIX

Appendix A: Proof of equivalence of relaxed and unconstrained models

We want to prove that models (4) and (2) are equivalent:

minβ0,βG±,βE,βG×Eq(β0,βG±,βE,βG×E)+λ1(βGi++βGi)+λ2βG×E1,subject to:|βGi×E|βGi++βGi,βGi±0fori=1,,p.

and

minβ0,βG,βE,βG×Eq(β0,βG,βE,βG×E)+λ1i=1pmax{|βGi|,|βGi×E|}+λ2βG×E1.

Proof: Recall that βGi=βGi+βGi and βGi±0, then βGi=βGi+βGi and βGi+βGiβGi++βGi=2βGi+βGi and since |βGi×E|βGi++βGi, we have βGi+|βGi×E|+βGi2. Since βGi+0,βGi+βGi, and βGi+|βGi×E|+βGi2, we can write βGi+max{[βGi]+,|βGi×E|+βGi2}, where [βGi]+=max{βGi,0}. Then we have that βGi++βGi==2βGi+βGimax{2[βGi]+βGi,|βGi×E|}. Finally, we note that 2[βGi]+βGi=|βGi|. We have βGi++βGimax{|βGi|,|βGi×E|} and since we are solving a minimization problem we can substitute (βGi++βGi) in model (4) to its minimum value max{|βGi|,|βGi×E|} which finalize our transformation from model (4) to (2) and concludes the proof of equivalence.

Appendix B: Solution to the optimization problem (13)

Consider an optimization problem:

maximizex,δD(ν0)subject to|ν0TG×E|λ2+δ|ν0TG|λ1δδ0whereν0=xνres(β)D(ν)=n2(Yn2Ynν2).

Lets denote Y / n as y and νres(β) as ν. We have

minimizex,δyxν2subject to|xνTGj×E|λ2+δj|xνTGj|λ1δjδj0,forj=1..p. (19)

Solution

D(ν)x=(yxν)Tν=0,yTνxν22=0,x^=yTνν22.

Lets denote |νT(G×E)j| as Bj and |νTGj| as Aj. The feasible set for our optimization problem (19) is

{|x|Bjλ2+δj|x|Ajλ1δj,{|x|λ2+δjBj|x|λ1δjAj.

Then optimal δj is the solution to λ2+δjBj=λ2δjAj because a solution to this equation maximizes the feasible set and hence provides more broad set to find an optimal x (Figure 5).

Fig. 5.

Fig. 5

Geometric solution to the problem (19). Optimal δj maximizes the range of possible x values.

λ2+δjBj=λ1δjAjAj(λ2+δj)=Bj(λ1δj)δj=Bjλ1Ajλ2Bj+Aj.

We know that δj[0,λ1],j (section 1 of the supplementary materials). If δj > λ1, then δ^j=λ1, and if δj<0, then δ^j=0. Therefore,

δ^j=max(0,min(δj,λ1)).

We found x^=yTνν22 that minimises our objective function by the stationarity conditions. To be optimal it has to satisfy out feasibility conditions for any j:

{|x|minjλ1δ^jAj,|x|minjλ2+δ^jBj.

Then feasible x is going to satisfy the inequality

|x|min(minjλ1δ^jAj,minjλ2δ^jBj).

Let’s denote the RHS of the inequality as M. If |x^|M, then optimal x=x^ and if |x^|>M, then optimal x=sign(x^)M.

Appendix C: Safe rules for gesso

By the KKT conditions (8) and (7):

{|ν^TGi|<λ1δi|ν^T(Gi×E)|<λ2+δiβ^i=β^Gi×E=0.δj[0,λ1]

Then,

{maxνB(c,r)|νTGi|<λ1δimaxνB(c,r)|νT(Gi×E)|<λ2+δiβ^i=β^Gi×E=0δi[0,λ1]{rGi+|GiTc|<λ1δirGi×E+|(Gi×E)Tc|<λ2+δiβ^i=β^Gi×E=0δi[0,λ1] (20)

Let’s consider the last system of inequalities more closely:

{rGi+|GiTc|<λ1δirGi×E+|(Gi×E)Tc|<λ2+δiδi[0,λ1]{δi<λ1rGi|GiTc|δi>rGi×E+|(Gi×E)Tc|λ2δi[0,λ1]{δi<λ1rGi|GiTc|δi>max{0,rGi×E+|(Gi×E)Tc|λ2}

Then,

(20)δifeasiblemax{0,rGi×E+|(Gi×E)Tc|λ2}<λ1rGi|GiTc|,

and, hence, the SAFE rules to discard (βGi,βGi×E) are:

max{0,rGi×E+|(Gi×E)Tc|λ2}<λ1rGi|GiTc|β^i=β^Gi×E=0. (21)

Footnotes

SUPPLEMENTARY MATERIALS

Supplement to “A scalable hierarchical lasso for gene-environment interactions”: We include detailed derivations for the coordinate-wise solutions (section 1) and the dual formulation (section 2).

References

  1. Ayers KL, Cordell HJ (2010). ”SNP selection in genome-wide and candidate gene studies via penalized logistic regression”, Genet Epidemiol, 34(8):879–891. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Bhatnagar S, Lovato A, Yang Y, Greenwood C (2018). ”Sparse Additive Interaction Learning”, bioRxiv 445304. [Google Scholar]
  3. Bien J, Taylor J, Tibshirani R (2013). ”A lasso for hierarchical interactions”, Ann. Statist 41, no. 3, 1111–1141. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Boyd S, Parikh N, Chu E, Peleato B, Eckstein J (2011). ”Distributed optimization and statistical learning via the alternating direction method of multipliers”, Foundations and Trends in Machine Learning, 3(1):1–122. [Google Scholar]
  5. Bonnefoy A, Emiya V, Ralaivola L, and Gribonval R (2014). ”A dynamic screening principle for the lasso”, In EUSIPCO. [Google Scholar]
  6. Boyd S, Vandenberghe L (2011). ”Convex Optimization”, Cambridge University Press. [Google Scholar]
  7. Chang BD, Watanabe K, Broude EV, Fang J, Poole JC, Kalinichenko TV, and Roninson IB (2000). ”Effects of p21Waf1/Cip1/Sdi1 on cellular gene expression: Implications for carcinogenesis, senescence, and age-related diseases”, Proc. Natl. Acad. Sci, 97(8), 4291–4296. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Chipman H (1996). ”Bayesian Variable Selection with Related Predictors”, The Canadian Journal of Statistics, 24(1), 17–36. [Google Scholar]
  9. Cox DR (1984). ”Interaction”, International Statistical Review, 52(1), 1–24. [Google Scholar]
  10. Efroymson MA (1960). ”Multiple regression analysis”, Mathematical Methods for Digital Computers, Wiley, New York. [Google Scholar]
  11. El Ghaoui L, Viallon V, and Rabbani T (2012). ”Safe feature elimination in sparse supervised learning”, Pac. J. Optim, 8(4):667–698. [Google Scholar]
  12. Fercoq O, Gramfort A, and Salmon J (2015). ”Mind the duality gap: safer rules for the lasso”, In ICML, pages 333–342. [Google Scholar]
  13. Figueiredo JC, Hsu L, Hutter CM, Lin Y, Campbell PT, Baron JA, Berndt SI, Jiao S, Casey G, Fortini B, Chan AT, Cotterchio M, Lemire M, Gallinger S, Harrison TA, Le Marchand L, Newcomb PA,, Slattery ML, Caan BJ, Carlson CS, Zanke BW, Rosse SA, Brenner H, Giovannucci EL, Wu K, Chang-Claude J, Chanock SJ, Curtis KR, Duggan D, Gong J, Haile RW, Hayes RB, Hoffmeister M, Hopper JL, Jenkins MA, Kolonel LN, Qu C, Rudolph A, Schoen RE, Schumacher FR, Seminara D, Stelling DL, Thibodeau SN, Thornquist M, Warnick GS, Henderson BE, Ulrich CM, Gauderman WJ, Potter JD, White E, Peters U (2014). ”Genome-wide diet-gene interaction analyses for risk of colorectal cancer.”, PLoS Genet., 10(4). [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Friedman J, Hastie T, Hoefling H, Tibshirani R (2007). ”Pathwise Coordinate Optimization”, Ann. Appl. Stat, 2(1), 302–332. [Google Scholar]
  15. Friedman J, Hastie T, Tibshirani R (2010). ”Regularization Paths for Generalized Linear Models via Coordinate Descent”, Journal of Statistical Software, 33(1). [PMC free article] [PubMed] [Google Scholar]
  16. Friedman J, Hastie T, Tibshirani R (2009). ”glmnet: Lasso and Elastic-Net Regularized Generalized Linear Models”. [Google Scholar]
  17. Gorgoulis VG, Pratsinis H, Zacharatos P, Demoliou C, Sigala F, Asimacopoulos PJ, Papavassiliou AG, Kletsas D (2005). ”p53-Dependent ICAM-1 overexpression in senescent human cells identified in atherosclerotic lesions”, Laboratory Investigation, 85(4), 502–511. [DOI] [PubMed] [Google Scholar]
  18. Hannum G, Guinney J, Zhao L, Zhang L, Hughes G, Sadda S, Klotzle B, Bibikova M, Fan J-B, Gao Y, Deconde R, Chen M, Rajapakse I, Friend S, Ideker T, Zhang K (2013). ”Genome-wide methylation profiles reveal quantitative views of human aging rates”, Mol. Cell, 49 359–367. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Haris A, Witten D, Simon N (2016). ”Convex Modeling of Interactions With Strong Heredity”, J. Comput. Graph. Statist, 25:4, 981–1004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Jeck WR, Siebold AP, Sharpless NE (2012). ”Review: a meta-analysis of GWAS and age-associated diseases”, Aging Cell, 11(5):727–731. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Kane M, Emerson J, Weston S (2015). ”Scalable Strategies for Computing with Massive Data”, Journal of Statistical Software, 55(14), 1–19. [Google Scholar]
  22. Lim M, Hastie T (2015). ”Learning Interactions Through Hierarchical Group-Lasso Regularization”, J. Comput. Graph. Statist, 24, 627–654. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Liu J, Huang J, Zhang Y, Lan Q, Rothman N, Zheng T, Ma S (2013). ”Identification of gene–environment interactions in cancer studies using penalization”, Genomics, 102(4), 189–194. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Massias M, Gramfort A, Salmon J (2017). ”From safe screening rules to working sets for faster lasso-type solvers”, 10th NIPS Workshop on Optimization for Machine Learning. [Google Scholar]
  25. Nelder JA (1977). ”A Reformulation of Linear Models”, Journal of the Royal Statistical Society. Series A, 140(1), 48–77. [Google Scholar]
  26. Papaconstantinou J (2019). ”The Role of Signaling Pathways of Inflammation and Oxidative Stress in Development of Senescence and Aging Phenotypes in Cardiovascular Disease”, Cells, 8, 1383. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Qian J, Tanigawa Y, Du W, Aguirre W, Chang C, Tibshirani R, Rivas MA, Hastie T (2020). ”A Fast and Scalable Framework for Large-scale and Ultrahigh-dimensional Sparse Regression with Application to the UK Biobank”, bioRxiv 630079. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Schoentgen F, Jonic S (2020). ”PEBP1/RKIP behavior: a mirror of actin-membrane organization”, Cell. Mol. Life Sci 77, 859–874. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Scholler N, Gross JA, Garvik B et al. (2008). ”Use of cancer-specific yeast-secreted in vivo biotinylated recombinant antibodies for serum biomarker discovery”, J. Transl. Med 6, 41. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Ternes N, Rotolo F, Heinze G, Michiels S (2017). ”Identification of biomarker-by-treatment interactions in randomized clinical trials with survival outcomes and high-dimensional spaces”, Biometrical Journal 59(4):685–701. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Tibshirani R (1996). ”Regression shrinkage via the lasso”, J. R. Stat. Soc. Ser. B. Stat. Methodol 58:267–288. [Google Scholar]
  32. Tseng P (2001). ”Convergence of a Block Coordinate Descent Method for Nondifferentiable Minimization”, J. Optim. Theory Appl, 109(3), 475–494. [Google Scholar]
  33. Wu C, Jiang Y, Ren J, Cui Y, Ma S (2017). ”Dissecting gene-environment interactions: A penalized robust approach accounting for hierarchical structures.”, Stat. Med, 37(3), 437–456. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Wu M, Zhang Q, Ma S (2017). ”Structured gene-environment interaction analysis.”, Biometrics, 76(1), 23–35. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Yuan M and Lin Y (2007). ”Model selection and estimation in regression with grouped variables”, J. R. Stat. Soc. Ser. B. Stat. Methodol 68(1), 49–67. [Google Scholar]
  36. Zou H and Hastie T (2005). ”Regularization and variable selection via the elastic net” J. R. Stat. Soc. Ser. B. Stat. Methodol 67:301–320. [Google Scholar]
  37. Zhao P, Rocha G, Yu B (2009). ”The composite absolute penalties family for grouped and hierarchical variable selection”, The Ann. Statist, 37(6A), 3468–3497. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp 1

RESOURCES