Abstract
We describe a regularized regression model for the selection of gene-environment (G×E) interactions. The model focuses on a single environmental exposure and induces a main-effect-before-interaction hierarchical structure. We propose an efficient fitting algorithm and screening rules that can discard large numbers of irrelevant predictors with high accuracy. We present simulation results showing that the model outperforms existing joint selection methods for (G×E) interactions in terms of selection performance, scalability and speed, and provide a real data application. Our implementation is available in the gesso R package.
Keywords: hierarchical variable selection, joint analysis, screening rules
1. Introduction
The problem of testing for the interaction between a single established predictor and a large number of candidate predictors arises in several contexts. A common setting is the scanning for interactions in genomewide-association studies (GWAS), where the goal is to identify interactions between an established environmental risk factor (e.g. processed meat intake) and a large number of single nucleotide polymorphisms (SNPs), in relation to an outcome of interest (e.g. colorectal cancer) [Figueiredo et al. [2014]]. Since many complex diseases have been linked to both genetic and environmental risk factors, identifying gene-environment (G×E) interactions, i.e. joint genetic and environmental effects beyond their component main effects, is of great interest. But the established predictor need not be an environmental risk factor and the candidate predictors can be other omic features like methylation and gene expression levels. For example, in a clinical trial setting, testing for the interaction between treatment and gene expression biomarkers can lead to the identification of subgroups with differential responses to a drug [Ternes et al. [2017]].
In this paper we focus on the problem of scanning a large number of possible interactions with a fixed predictor. This is in contrast to the related but more challenging problem of exhaustive scanning all possible pairwise interactions within a set of predictors. Throughout the paper we will refer to the former as the gene-environment interaction selection problem, and we will use language specific to this context even though the proposed method is completely general and applies to any setting where the analytical goal is to identify interactions with a designated variable of special interest. Similarly, we will refer to the exhaustive pairwise testing problem as the gene-gene (G×G) interaction selection problem.
The dominant paradigm for genomewide interaction scans (GWIS) is to test each genetic marker one at a time for interaction with the risk factor of interest applying a stringent multiple testing correction to account for the number of tests performed. However, a joint analysis that simultaneously takes into account the effects of all markers is preferable to a one-at-a-time analysis where each marker is considered separately. Variables with weak effects might be more readily identifiable when the model has been adjusted for other causal predictors, and false positives may be reduced by the inclusion of stronger true causal predictors in the model [Ayers and Cordell [2010]]. Since GWAS data is high-dimensional (the number of SNPs is typically in the order of millions while the sample size is in the tens of thousands), a joint analysis of all markers using standard (unpenalized) multiple regression methods is not feasible. General regularized regression methods such as the Lasso [Tibshirani [1996]] and Elastic Net [Zou and Hastie [2005]] that are suitable for high-dimensional data and also perform variable selection can be used for G×E identification but, because they do not exploit the hierarchical structure of the (G×E) problem, can perform suboptimally. A main-effect-before-interaction hierarchical structure ensures that the final selected model only includes interactions when corresponding main effects have been selected. This enhances the interpretability of the final model [Nelder [1977], Cox [1984]] and increases the ability to detect intreactions by reducing the search space [Chipman [1996]].
Methods for the G×G selection problem that exploit an interaction hierarchical structure like FAMILY [Haris, Witten and Simon [2016]], glinternet [Lim and Hastie [2015]], and hierNet [Bien et al. [2013]], and their corresponding R implementations can be in principle applied to the G×E case but they are optimized for the symmetric G×G case, which results in vastly suboptimal performance in terms of run-time and scalability (they can handle at most a few hundred predictors) for the G×E case. Indeed, the structure of the G×E selection task is simpler because 1) the dimensionality of the problem grows linearly with additional environmental variables as opposed to quadratically for the G×G selection task, and 2) because interactions with a single variable lead to a block-separable optimization problem, which, unlike the G×G case, can be efficiently solved using a block coordinate descent algorithm. For these reasons, the G×E selection problem is amenable to efficient implementations for large-scale analysis (e.g. genome-wide). However, efficient joint selection methods specific for G×E in large-scale applications have not been developed.
Liu et al. [2013] and Wu et al. [2017] adopted sparse group penalization approaches to accelerated failure time models for hierarchical selection of G×E interactions, but their approaches do not scale to datasets with a very large number of predictors without additional pre-screening procedures.
In this paper we present gesso (from G(by)E(la)sso) model for the hierarchical modeling of interaction terms. We present an efficient fitting algorithm for the gesso model and powerful new screening rules that eliminate a large number of variables beforehand, making joint G×E analyses feasible at genome-wide scale.
The paper is organized as follows. We first review the idea behind a hierarchical structure for interactions and present the gesso model. In section 3 we introduce screening rules and an adaptive convergence procedure we developed and incorporated into a block coordinate descent algorithm. We describe simulations in section 4 and a real data application in section 5 to demonstrate the applicability of the gesso model to large high-dimensional datasets and the scalability of our algorithm.
2. Methods
2.1. Hierarchical structure
The standard linear model for G×E interactions with a single environmental exposure includes all interaction product terms between genetic variables and the environmental factor in addition to their marginal effects:
(1) |
Here is a quantitative outcome of interest, G is n × p matrix of genotypes, Gi is a column of matrix G corresponding to the i-th genotype, E is the vector of environmental measurements of size and are the main effects, and are the interaction effects.
A strong hierarchical structure implies that if either the genetic or the environmental main effect is equal to zero, the corresponding interaction term has to be zero as well. In the G×E context, the environmental predictor E is usually chosen because it has been previously identified as a risk factor. Thus, there is no need to maintain a hierarchical constraint with respect to the environmental effect, as it is known to have a main effect on the outcome. The strong hierarchical structure reduces then to
There are several ways to impose a hierarchical structure on the regression coefficients of model (1). One approach is forward selection [Efroymson [1960]], a procedure that iteratively considers adding the “best” variable to the model ensuring that an interaction term can be added only if its main effects have already been added to the model on a previous iteration. Since step-wise selection is a greedy algorithm, it tends to underperform compared to global optimization approaches [Ayers and Cordell [2010]]. Another way is to reparameterize the interaction coefficients as , where γi is an introduced model parameter [Bhatnagar et al. [2018], Wu et al. [2020]], but this results in a non-convex objective function. Regularization can also be used to impose a desired hierarchical structure [Zhao et al. [2009]] and this is the approach we follow.
2.2. The gesso model
Denote the mean square error loss function for a G×E problem by
We are interested in hierarchical selection of the Gi × E interaction terms that are associated with the outcome Y. We propose the following model that we call gesso (G(by)E(la)sso):
(2) |
The model has several important properties. First, it guarantees the desired hierarchical relationships between genetic main effects and interaction effects, since the penalty satisfies the overlapping group hierarchical principle [Zhao et al. [2009]]. Having in every group where is present ensures that once deviates from zero, the penalty for becomes close to zero by the properties of the group lasso models. In addition, having in a group of its own makes it possible for to deviate from zero, when is zero. Second, the regularized objective function is convex, which can take advantage of convex optimization theory and algorithms that ensure convergence to a global optimal solution. Moreover, the block-separable structure of the problem, with small-block sizes allows for highly efficient solvers. Third, we include two tuning parameters in the model, λ1 and λ2 enabling flexible and decoupled data dependent control over the group and interaction penalties. Lastly, the group L∞ penalty has a connection to a Lasso model with hierarchical constraints that we discuss next.
2.3. Connection to the hierarchical Lasso
Bien et al. [2013] proposed a Lasso model with hierarchical constraints for all pairwise interactions (the G×G selection problem) and demonstrated its advantages over the standard Lasso. The authors showed that the hierarchical Lasso model is equivalent to the unconstrained overlapping group lasso model with an L∞ group norm. Following Bien et al. it can be shown that our proposed model (2) is equivalent to a constrained model (4). We describe the intuition behind the constrained version of the model (4) below and provide a proof of equivalence of the two models in the Appendix A.
Lasso with G×E Hierarchical Constraints
To impose the desired hierarchical structure, the constraints can be added to the standard Lasso model. These ensure the main-effect before interaction property:
. Furthermore, the model makes the implicit (and arguably a reasonable) assumption that important interactions have large main effects. When this assumption is met, the model will be more powerful in detecting interactions. Unfortunately, the constraint set above is non-convex and yields the non-convex problem:
(3) |
In order to transform the non-convex optimization problem (3) into a convex one, we decompose βG as and |βG| as , where and . The non-convex constraints are then replaced by the convex constraints Note that only if , so removing the latter condition results in a relaxed formulation that is not equivalent to the original non-convex problem. Substituting and |βG| to in (3) we obtain a convex relaxation of the model:
(4) |
Now the constraint set is convex and the optimization problem is convex as well. The relaxed constraints are less restrictive than the original . In particular, the model can yield a large estimate and a moderately sized by making both and large.
Examining the equivalent formulations for gesso (unconstrained group L∞ model (2) and constrained model (4)) we can see that the group L∞ norm corresponds to the constraints on the effect sizes of the interactions and the main effects. Intuitively, because of the connection between the relaxed model (4) and model (3), (4) will be more powerful for detecting interactions in the case when important interactions have large main effects. This is confirmed by our simulation further below.
Additionally, the constrained formulation of gesso (4) allows for a simpler, interpretable form for the coordinate-wise solutions, the dual problem, and the development of the screening rules.
3. Block coordinate descent algorithm for gesso
Friedman et al. [2007] have proposed using cyclic coordinate descent for solving convex regularized regression problems involving L1 and L2 penalties and their combinations. The coordinate descent algorithm is particularly advantageous when each iteration involves only fast analytic updates. In addition, screening rules that exploit the sparse structure induced by the penalties can be readily incorporated into the algorithm to eliminate a large number of variables beforehand, making it much faster than alternative convex optimization algorithms.
For convex block-separable functions, convergence of the coordinate descent algorithm to a global minimum is guaranteed [Tseng [2001]]. The gesso model has a convex objective function with a smooth loss component and a non-smooth separable penalty component, where each block consists of and . Thus, the model can be fitted using a block coordinate descent (BCD) algorithm and convergence to a global optimal solution is guaranteed. Briefly, BCD optimizes the objective by cycling through the coordinate blocks 1,…, p and minimizing the objective along each coordinate block direction while keeping all other blocks fixed at their most current values. The coordinate-wise updates for model (4) can be obtained by working with the Lagrangian version of the model. The derivations are provided in the supplementary materials (section 1). The rest of the section focuses on new efficient screening rules we developed specifically for gesso.
3.1. Dual formulation of gesso
Consider the following primal formulation of the gesso problem obtained by substituting the residuals for a new variable z:
(5) |
The ⪰ symbol denotes an element-wise comparison (, for j = 1,…, p), G × E is a column-wise matrix of interaction vectors Gi × E. In order to formulate the dual problem we introduce the dual variables , associated with the constraint , and , associated with the constraint . Substituting the residuals for a new variable in the primal formulation is a common approach for deriving a dual formulation. For simplicity, we denote the linear predictor by Xβ. Based on the alternative primal formulation the dual takes the simple form below (section 2 of the supplementary materials):
where we denote the dual feasible region as and the dual objective as
It follows that the optimal solution of the dual problem can be viewed as a projection of onto the dual feasible set :
And from the stationarity conditions we establish the following important relationship:
(6) |
The optimal dual variable equals the residuals scaled by the number of observations. Equation (6) defines the link between the primal and the dual optimal solutions. The stationarity and complementary slackness conditions for , and (section 2 of supplementary materials) lead to the following important consequences for the primal and dual optimal variables:
(7) |
(8) |
Conditions (7), (8) form the basis for the screening rules we develop later in this section. Specifically, we exploit the above conditions to identify null predictors and avoid spending time cycling through them in the BCD algorithm. This leads to a substantial computational speed-up, especially for high-dimensional sparse problems.
3.2. Screening rules
Screening rules are used to identify predictors that are strongly associated with the outcome and screen out those that are likely to be null. Incorporation of the screening rules to the coordinate descent algorithm can greatly improve computational speed, making large but sparse high-dimensional problems computationally tractable. In this section we first describe the SAFE screening rules for gesso following the principles of the Lasso SAFE rules [El Ghaoui et al. [2012]], which form the stepping stone for the more efficient screening rules we developed and focus on later.
3.2.1. SAFE rules for gesso
El Ghaoui et al. [2012] proposed the SAFE screening rules for the Lasso model that guarantee a coefficient will be zero in the solution vector. In this section we derive SAFE rules to screen predictors for the gesso model.
By the KKT conditions (7) and (8)
(9) |
for any primal optimal variables , and , dual optimal variable , and dual feasible variable δi (for a fixed optimal any feasible δi would also be optimal, since the dual objective function does not depend on δ). The problem is that we do not know the dual optimal variable .
The idea is to create upper bounds for and that are easy to compute. In particular, consider a dual feasible variable ν0 and denote D(ν0) as γ. Let Θ = {ν: D(ν) ≥ γ}. As we have and, hence, . Then . The same is true for the interaction terms , where and are the desired upper bounds for and respectively. The above arguments lead to the following rules:
(10) |
(11) |
Note that the region Θ = {ν : D(ν) ≥ γ} is equivalent to which is the equation of an Euclidean ball with and centered at . Note that . We denote this ball as B(c, r), where and . As a consequence, the upper bounds we constructed and are equivalent to the following optimization problems: and and the region Θ is the ball B(c, r), which has two important properties:
it contains the optimal dual solution ,
it results in closed form solutions for the desired upper bounds (10) and (11).
Specifically, for a general the solution to is . Leaving the derivations to Appendix C, the SAFE rules to discard are given by:
(12) |
To complete the construction of the SAFE rules, we need to find a dual feasible point to calculate r = r(ν0). We can naturally obtain a dual point ν0 given the current estimate β via the primal-dual link from the stationarity conditions (6). Denote , where res stands for residuals. In general, νres(β) will not necessary be feasible, thus, we re-scale νres(β) to ensure it is in the feasible region . For example, for δ = 0 we can consider the re-scaling factor so that ν0 = xνres (β) is feasible. We call the value xνres (β) a naive projection of νres. In the next section we show an alternative way to re-scale νres(β) to obtain a better feasible point.
3.2.2. Optimal naive projection
The better we choose our feasible ν0 and δ (i.e. closer to the optimal solution) the tighter our upper bounds for and on the set Θ = {ν : D(ν) ≥ D(ν0)} are. In the previous re-scaling of νres(β) we set the dual variable δ equal to 0 for simplicity when we constructed a dual feasible point ν0. However, we can find a point ν0 closer to the optimal by changing δ to our advantage.
Formally, we want to find a scalar x such that ν0 = xνres(β) is feasible and which minimizes D(ν0). This leads to the following optimization problem with respect to x and δ:
(13) |
We present the closed-form solution to the optimization problem (13) in Appendix B, which we use to obtain a dual feasible variable ν0 (ν0 = xνres(β)) that we require for the SAFE rules (12).
3.2.3. Warm starts and dynamic screening
The penalty parameters λ1 and λ2 will be typically tuned by cross-validation. In practice, this requires fitting the model for a grid of λ1 and λ2 values. A common choice is a logarithmic grid of 30 to 100 consecutive points. Maximum grid value can be determined from the stationarity KKT conditions. When solutions are computed along such a sequence or path of tuning parameters, warm starts is a standard approach to reduce the number of coordinate descent iterations required to achieve convergence. Warm starting refers to using the previously computed solution with respect to a grid of tuning parameters to initialize the parameters values for the coordinate descent algorithm β(k) on the next step.
The idea behind dynamic screening [Bonnefoy et al. [2014]] is to iteratively improve the dual feasible point ν0 during the coordinate descent updates. The residuals Y − Xβ are updated at each iteration i of the coordinate descent algorithm, and since the coordinate descent algorithm guarantees convergence, we have that
(14) |
Recall that our SAFE rules depend on the feasible point ν0 through the radius of the spherical region on which we base the upper bounds : . Also, recall that is the closest feasible point to the center . Hence, having ν0 closer to the optimal reduces the radius of the sphere and ensures tighter upper bounds and for and . This, in turn, ensures better SAFE rules that are able to discard more variables. By performing the SAFE rules screening not only at the beginning of each new iteration λk = (λ1, λ2)k, but all along the iterations i for the coordinate descent, we iteratively improve the estimate of with and consequentially keep improving our SAFE rules.
The proposed procedure is the following: at each iteration i of the algorithm use the current residuals Y − Xβ(i) to obtain a current estimate of the dual variable , naively project it onto the feasible set and apply the SAFE rules (12). Figure 1(a) illustrates the iterative process of constructing the SAFE spherical regions. One concern is how expensive it is to compute the naive projection and the SAFE rules at each iteration, which we address in details in section 3.2.6.
Fig. 1.
(a) Dynamic SAFE regions, (b) dynamic GAP SAFE regions.
However, there are clear disadvantages of our current choice of center and radius . Even when the dual feasible variable converges to the optimal value , the ball does not shrink around the dual optimal variable, since the radius does not converge to zero and the center is static, indicating that our upper bounds and remain loose (Figure 1(a)). The Gap SAFE rules proposed by Fercoq et al. [2015] beautifully address the above disadvantages. We present the ideas behind the Gap SAFE rules and their application to the gesso model in the next section.
3.2.4. Gap SAFE rules for gesso
Denote the primal objective function for the gesso model as P(β):
(15) |
and the dual objective as D(ν):
(16) |
The duality gap, denoted as Gap(β, ν) is the difference between the primal and the dual objectives Gap(β, ν) = P(β) − D(ν). For the optimal solutions we have and . By weak duality . The duality gap provides an upper bound for the suboptimality gap as . Therefore, given a tolerance ϵ > 0, if at iteration t of the BCD algorithm we can construct β(t) and such that Gap (β(t), ν(t)) ≤ ϵ, then β(t) is guaranteed to be a ϵ-optimal solution of the primal problem. Note that in order to use the duality gap at iteration t as a stopping criterion we need a dual feasible point . We, again, utilize the naive projection method and obtain .
Because D(ν) is a quadratic and strongly concave function with concavity modulus n [Boyd and Vandenberghe [2004]] we have:
where the second inequality follows from weak duality and the optimality conditions for . Thus,
(17) |
Gap SAFE rules work with the Euclidean ball with center and radius , which we denote as BGap(c, r). This is a valid region since by (17). An important consequence of the above construction is that when , then via the primal-dual link (6, 14) and . Figure 1(b) illustrates that the dynamic approach discussed in the previous section in combination with the Gap SAFE rules very naturally results in improving the upper bounds maxBGap (c, r) | νT Gi | and maxBGap (c, r) | νT (Gi × E) |, since the radius of our new Gap ball converges to zero and the center converges to the dual optimal point. The Gap SAFE rules follow from substituting and in (12).
3.2.5. Working set strategy
Massias et al. [2017] proposed to use a working set approach with the Gap SAFE rules to achieve substantial speedups over the state of art Lasso solvers, including glmnet. The working set strategy involves two nested iteration loops. In the outer loop, a set of predictors Wt ⊂ {1,…, p} is defined, called a working set (WS). In the inner loop, the coordinate descent algorithm is launched to solve the problem restricted to XWt (i.e. considering only the predictors in Wt).
We adopt the Gap SAFE rules we developed for the gesso model to incorporate the proposed working set strategy. Recall the SAFE rules we constructed in (12):
By simply rearranging the terms in the inequality we get:
Define the left-hand side of the above inequality di. Note that our SAFE rules (12) are equivalent to di > r.
The idea is that di now represents a score for how likely the predictor is zero or non-zero based on the Gap SAFE rules. Predictors for which di is small are more likely to be non-zero and, conversely, predictors with larger values of di are more likely to be zero up to the point when di > r and the corresponding predictor is exactly zero by the SAFE rules. Here by predictor we mean the pair .
The proposed procedure is as follows: compute the initial number of variables to be assigned to the working set (working _set _size). Calculate di and define the working set as the indices of the smallest di values up to the value of working_set_size. Fit the coordinate descent algorithm on the variables from the working set only and check if we achieved an optimal solution via the duality gap for all of the variables. If not, increase the size of the working set, recalculate di according to the new estimates obtained by fitting on a previous working set, and repeat the procedure. We increase the size of the working set by two each time.
Algorithm 1 presents the block coordinate descent algorithm and the stopping criterion we propose to use. Algorithm 2 describes the main steps of the working set strategy. To recapitulate, the main advantage of the algorithm is that we select variables that are likely to be non-zero and leave likely zeroes out so we do not have to spend unnecessary time fitting them. This approach can be thought of as an acceleration of the Gap SAFE rules.
Algorithm 1:
cyclic_coordinate_updates(set of indices I)
for i = 1, …, max_iter do check_duality_gap(β) //convergence criterion, computationally expensive |
for j ∈ I do |
(βGj, βGj×E) = coordinate_updates() //coordinate-wise solutions |
Algorithm 2:
coordinate_descent_with_working_sets()
for i_outer = 1, …, max_iter do check_duality_gap(β) |
d = compute_d_i() |
if i_outer == 1 then |
working_set = {j : (βGj ≠ 0) or (βGj×E ≠ 0)} //initialization |
if length(working_set) == 0 then |
working_set = {j: smallest d[j] up to working_set_init_size} |
working_set_size = length(working_set) |
else |
d[working_set] = - Inf //to make sure WS increases monotonically |
working_set_size = min(2·working_set_size, p) //doubling the size of the WS at each iteration |
working_set = (j: smallest d[j] up to working_set_size} //scoring the coef. by the likelihood of being non-zero |
cyclic_coordinate_updates(working_set) //Algorithm 1 or optimized Algorithm 3 |
3.2.6. Active set and adaptive max difference strategies
In Algorithm 1 the dual feasible point required to determine the dual gap is calculated according to the naive projection method proposed in section 3.2.2. However, computing the naive projection is expensive, since it requires performing a matrix by vector product with a vector of length n (sample size) and a matrix of size n × length(working_set). The idea of an active set and adaptive max difference strategies is to reduce the number of times we have to evaluate check_duality_gap() function to ensure convergence.
Adaptive max difference strategy:
In any iterative optimization algorithm, including coordinate descent, the convergence (stopping) criterion plays an important role. One of the most commonly used stopping criteria is based on the change in estimates between two consecutive iterations t − 1 and t(18). It is implemented in the glmnet package [Friedman et al. [2009]], for example. The idea is that if the estimated coefficients do not change much from iteration to iteration it is likely the optimal solution has been reached:
(18) |
However, in contrast to the duality gap stopping criterion, such heuristic rules do not offer control over suboptimality and it is generally not clear what value of ϵ is sufficiently small.
Although the heuristic convergence criterion based on the maximum absolute difference between consecutive estimates (18) does not control the suboptimality gap , it is very fast to compute, since is pre-computed or normalized to be one. To reduce the number of times we check the convergence based on the duality gap criterion in Algorithm 1, we propose to use the criterion (18) as a proxy convergence criterion (Algorithm 3).
The proposed procedure is as follows: we initialize the tolerance for the max difference criterion as the tolerance we set for the duality gap convergence. We proceed by fitting the coordinate descent algorithm until we meet the max difference convergence criterion and then check the duality gap for the final convergence. If the duality gap criterion is not met, we decrease the max difference tolerance by a factor of 10 and proceed again. As a result, instead of checking the duality gap after each cycle of the coordinate descent algorithm, we wait until the proxy convergence criterion (which is very cheap to check) is met, and then check the duality gap criterion. By adaptively reducing the tolerance value of the proxy criterion we allow more coordinate descent cycles if needed, but carefully control the adaptive convergence to make as few checks as possible.
Active set strategy:
The active set (AS) is a heuristic proposed by the authors of the glmnet package. The active set strategy tracks predictors updated during the first coordinate descent cycle and proceeds by fitting the coordinate descent algorithm only on those variables. The active set strategy combines naturally with our proposed adaptive max difference procedure since we also calculate the differences of the estimates on the consecutive iterations for our proxy convergence criterion. We believe that the max difference strategy we proposed in combination with the active sets can accelerate the state of art methods for fitting the standard Lasso as well. We summarize the main steps for both strategies in Algorithm 3, which is an optimized version of Algorithm 1.
To recapitulate, Algorithm 1 is the vanilla block coordinate descent algorithm, where each block in {1,…, p}, comprised of a main effect and its corresponding interaction, is updated until convergence without the use of any screening rule or heuristic. Algorithm 2 incorporates the Gap SAFE screening rules to the block coordinate descent algorithm in combination with the working set strategy. In the outer loop the variables in the working set are determined and in the inner loop the basic block coordinate descent Algorithm 1 is applied to the reduced set of variables. This achieves a lower run time. Algorithm 3 leverages additional heuristics to further accelerate the convergence on working sets by using active sets and the fast maximum absolute difference convergence criterion. Note that all the proposed improvements guarantee convergence to a global optimal solution since the final convergence criterion ensures that the duality gap for the full problem (including all the estimated variables) is within a given tolerance.
Algorithm 3:
cyclic_coordinate_updates_optimized(set of indices I)
for i_inner = 1, …, max_iter do check_duality_gap(β) |
if i_inner == 1 then |
max_diff_toll = tol //initializing heuristic convergence criterion |
else |
max_diff_toll = max_diff_toll/10 //adaptively refine the criterion |
while t < max_iter do |
max_diff = 0 |
for j ∈I do |
. //max difference heuristic convergence criterion |
max_diff = max(current_diff, max_diff) |
if current_diff > 0 then |
active_set[j] = TRUE //active set = coefficients that got updated |
if max_diff < max_diff_tol then break |
while t < max_iter do |
max_diff = 0 |
for j ∈ active_set do |
//active set iteration |
max_diff = max(current_diff, max_diff) |
if max_diff < max_diff_tol then break |
3.3. Experiments
We conducted a series of experiments to evaluate efficiency of our algorithm overall and with respect to the screening rules.
Runtime:
In the first experiment we compared the runtime of the basic coordinate descent algorithm (Algorithm 1) and the various speedup strategies we described in the sections above. We adopted the working set strategy for the gesso problem (Algorithm 2), we proposed the max difference strategy, and we added the active set strategy in combination to the max difference approach (Algorithm 3). The size of the dataset we used for the experiment was n = 200, p = 10,000 (20,001 predictors in total, p main effects and p interaction terms, one environmental variable). We simulated 15 non-zero main effects and 10 non-zero interaction terms. We ran each algorithm 100 times and reported the mean execution time in Figure 2. Figure 2 demonstrates that proposed speedup strategies result in a major run-time acceleration. Therefore, only the fastest algorithm containing all the proposed improvements (coordinate descent on WS with adding AS, and adaptive pseudo convergence) is implemented in the package and is used for the downstream experiments.
Fig. 2.
Time comparison of proposed algorithms: mean runtime over 100 replicates on the y-axis, dual gap tolerance on the x-axis.
Working sets:
In the next experiment we evaluated the screening ability of our screening rules in combination with the working sets (WS) speedup. We simulated a dataset with n = 1000 and p = 100,000, with non-zero main effects and 10 non-zero interaction terms. Figure 3 shows log ten size of the working set for all pairs (λ1, λ2) of the tuning parameters. Maximum size of the working set is 120 (out of 100,000 predictors). For most pairs only a small number of variables (from 0 to 15) is needed to find the solution.
Fig. 3.
Log working set size (log10 (WS)) for all lambda pairs.
4. Simulations
We performed a series of simulations to evaluate the selection performance of gesso and compare it to that of alternative models. As a baseline model for comparison, we used the standard Lasso model as implemented in the glmnet package [Friedman et al. [2009]]. Among the models that impose hierarchical interactions, we chose the glinternet [Lim and Hastie [2015]] model for comparison (Table 1), as it is implemented in an R package and can handle the G×E case with a single environmental variable (by specifying the parameter interactionCandidates), which is the focus of this paper. We also considered the FAMILY [Haris, Witten and Simon [2016]] and sail [Bhatnagar et al. [2018]] models and their respective packages for comparison. FAMILY implements a series of overlapped group lasso models (Table 1), but the package is designed for fitting strictly more than one environmental variable E. We nevertheless modified the source code to handle the single E case but were unable to achieve a satisfactory performance compared to other methods. We omit FAMILY from the results.
Table 1.
Hierarchical G×E Models
Name | Model penalty | Comments (convexity, assumptions, hyper-parameters) |
---|---|---|
gesso | convex, two tuning parameters, βE is not penalized | |
glinternet | , where |
convex, single tuning parameter, βE is penalized |
FAMILY | q > 1, convex, two tuning parameters, βE is not penalized | |
sail | , where |
non-convex, one tuning parameter λ, α has to be set, βE is penalized |
The sail method ensures hierarchy via a reparametrization of the interaction coefficient that results in a non-convex objective formulation (Table 1). We observed that it performs similarly to the other examined methods in the cases where p ≈ n, but for the high-dimensional setting we considered, the sensitivity for selecting interaction terms was poor and the execution time was comparably longer. In addition, sail only tunes the main penalty parameter, but not the relative weight of the penalties (parameter α). In practice, the optimal relative weight is highly dependent on the data, and the default value equal to 0.5 yields poor selection in most cases. Because the weight parameter controls the relative importance of main effects and interactions, it plays a critical role in the selection of interaction performance. This is unlike the weight penalty parameter in elastic net, where relative importance varies between penalties on the same set of coefficients.
The hierarchical Lasso for pairwise interactions is implemented in the hierNet [Bien et al. [2013]] R package. However, it can only handle a limited number of predictors and cannot be applied to the G×E case.
Simulation settings
We simulated data with n = 100 subjects, p = 2500 SNPs, and a single binary environmental variable E, for a total of 5001 predictors. We set the number of non-zero main effects, pG out of p SNPs to 10 and the number of non-zero interaction, pG×E, out of the p interaction terms to 5. We simulated all non-zero main effects βG to have the same absolute value with randomly chosen signs and similarly for the interaction terms βG×E.
We explored three simulation modes for the true coefficients that we call strong_hierarchical, hierarchical, and anti_hierarchical. In the strong hierarchical mode the hierarchical structure is maintained and also . In the hierarchical mode the hierarchical structure is maintained, but we set , with an attempt to violate the gesso assumption on the effect sizes. In the anti-hierarchical mode the hierarchical structure is violated . We simulated a single binary environmental factor with prevalence equal to 0.3. We set βG = 3 and βG×E = 1.5 for all of the modes except for the hierarchical mode where we set βG = 0.75 and βG×E = 1.5. We set the noise variable such that an interaction SNR (signal to noise ratio, defined as a ratio of the interaction signal to noise) is around 2.
We generated independent training, validation, and test sets under the same settings and report the model performance metrics on the test set. We run 200 replicates of the simulation for each parameter settings.
Results
We obtained solutions paths across a two-dimensional grid of tuning parameter values and computed the precision and area under the curve (AUC) for the detection of interactions as a function of the number of interactions discovered (Figure 4).
Fig. 4.
Model performance (top row: AUC for G×E selection, bottom row: precision for G×E selection) as a function of the number of interactions discovered. p=2500, n=100, pG=10, pG×E =5.
In the strong hierarchical mode Figure 4(a) the models imposing a hierarchical structure (gesso and glinternet) outperform the Lasso model, gesso performs better selection compared to glinternet, in terms of both AUC and precision.
We considered the hierarchical mode to evaluate the selection performance of gesso when does not hold. Because gesso does not make this stringent assumption, (only the laxer ) we expect it to be robust to its violation. In the hierarchical mode Figure 4(b) glinternet and gesso still outperform the Lasso and gesso model performs on par with glinternet.
In the anti-hierarchical mode Figure 4(c) all three models perform similarly. Importantly, even though the hierarchical assumptions are violated, hierarchical models still do not lose to the Lasso model in terms of selection performance.
5. Real data example: Epigenetic clock
We analyzed the GSE40279 methylation dataset [Hannum et al. [2013]], that contains 450,000 CpG markers on autosomal chromosomes from whole blood for 656 subjects (Illumina Infinium 450k Human Methylation Beadchip), age, and gender. Hannum et al. [2013] reported that the methylome of men appeared to age approximately 4 percent faster than that of women. Epidemiological data also indicates that females live longer than males but the reasons for gender discrepancies are still unknown. Our goal in this analysis was to identify methylation by sex interactions that may associate with age.
We preselected 100,000 of the most variable probes for our analysis, leaving us with 100,000 CpG probe main effects and 100,000 methylation probes by sex interaction terms, for a total of 200,001 predictors. Age is our dependent/outcome variable. We ran our implementation of the gesso model over one hundred 5-fold cross-validation assignments and calculated the selection rate for each of the predictors as the number of times a predictor is selected by the model across the number of runs divided by the number of runs (one hundred in our case). We choose the hyper-parameters based on the minimum cross-validation loss. We performed the same procedure for the standard Lasso model using the glmnet package.
Results
The average run-time for our implementation of gesso across 100 runs of cross-validation was only 4 minutes for the grid of 30 values for both of our tuning parameters. The average run-time for the glmnet package for the same grid for one tuning parameter was 1.5 minutes.
We attempted to use the glinternet package for the analysis but it exited with an error due to the large data size. We then downloaded the source code and changed the memory handling code since it was allocating memory for the G×G before subselecting it to the G×E case. However, even with the updated function glinternet would not complete the analysis within 24 hours. We hypothesize that, due to specifics of the current implementation, the glinternet package is efficient for the symmetric G×G format, but not for the reduced G×E format with large G. We also tried to run the FAMILY package on our methylation dataset after adapting the source code to generalize the function to the G×E case with a single E variable. However, the program exited with an out of memory error. The sail package also did not finish within 24 hours. To conclude, the existing packages are not able to handle large datasets (here 200,000 variables) for the G×E analysis.
Table 2 presents the four top-ranking CpG probes for gesso and standard Lasso that interact with sex in their effects on age. Ranks were calculated by ordering interaction selection rates and assigning rank one to the highest rate value (most frequently selected variable), and so on.
Table 2.
Top interacting with sex CpG probes by method.
CpG probe | Associated Gene | gesso | glmnet |
---|---|---|---|
cg00091483 * | PEBP1 | 1 | 12 |
cg12015310 | MEIS2 | 2 | 6 |
cg08327269 | NEUROD1 | 3 | 56 |
cg103755456 | ICAM1 | 4 | 163 |
cg14345497 | HOXB4 | 8 | 1 |
cg09365557 | TBX2 | >50,000 | 2 |
cg27652200 | PSMB9/TAP1 | 109 | 3 |
cg19009405 | HNRNPUL2 | 14 | 4 |
Probes identified by both methods (ranked among the top 20 probes by the other method) are highlighted in bold.
Genes associated with the probes selected by the gesso were linked to aging processes and cell senescence in multiple publications and databases, and are suggested biomarkers for age-related diseases [Jeck et al. [2012], Chang et al. [2000], Gorgoulis et al. [2005]]. For example, PEBP1 gene (linked to the top selected probe) is involved in the aging process and negative regulation of the MAPK pathway [Schoentgen et al. [2020]]. The MAPK and SAPK/JNK signaling networks promote senescence (in vitro) and aging (in vivo, animal models and human cohorts) in response to oxidative stress and inflammation [Papaconstantinou et al. [2019]]. The Rat Genome Database (RGD) indicates that the PEBP1 gene is implicated in prostate and ovarian cancers [Scholler et al. [2008]], indicating some sex-specificity. The RGD database reports the PEBP1 gene as a biomarker of Alzheimer’s disease.
Among the top four probes based on the standard Lasso analysis, three probes were linked to regulatory genes. For example, TBX2 (linked to cg09365557) encodes transcription factor that, when up-regulated, inhibits CDKN1A (p21), the gene regulated by the NEUROD1 gene (cg08327269 probe) discovered by gesso [Gene Cards Database]. CDKN1A is necessary for tissue senescence, and when compromised, leaves the tissue vulnerable to tumor-promoting signals. Probes cg09365557 (identified by standard Lasso) and cg08327269 (identified by gesso) could both be uncovering the same biological process involving CDKN1A.
6. Discussion
We introduced a selection method for G×E interactions with the hierarchical main-effect-before-interaction property. We showed that existing packages for the hierarchical selection of interactions cannot handle the large number of predictors typical in studies with high-dimensional omic data. When considered separately, and not as a sub-case of a G×G analysis, the G×E case can be solved much more efficiently. Our proposed block coordinate descent algorithm is scalable to large numbers of predictors because of the custom screening rules we developed. We also showed in simulations that our model outperforms other hierarchical models. The implementation of our method is available in our R package gesso that can be downloaded from CRAN https://CRAN.R-project.org/package=gesso. Our algorithm can be extended to generalized linear models via iteratively reweighted least-squares.
The model can be generalized to include more than one environmental variable E. However, for the BCD algorithm to be efficient, blocks have to be relatively small. When including multiple E variables, the size of the coordinate descent blocks grows linearly and contains main effect and all its corresponding interactions with the environmental variables (βGj, βGj×E1,…βGj×Eq). The bottleneck here is to efficiently solve the system of equations resulting from the stationarity conditions, the size of which grows exponentially with the number of E variables. For more than a handful of E variables other algorithms, like ADMM, could become more efficient.
Apart from computational efficiency of the algorithm, memory considerations are critical for large-scale analyses. The gesso package allows users to analyze large genome-wide datasets that do not fit in RAM using the file-backed bigmemory [Kane, Emerson, and Weston [2013]] format. However, this involves more time-consuming data transfers between RAM and an the external memory source. To more efficiently handle datasets that exceed available RAM, the screening rules could be exploited for efficient batch processing [Qian et al. [2020]].
Supplementary Material
ACKNOWLEDGEMENTS
We sincerely thank Jacob Bien, Paul Marjoram, and David Conti for their helpful comments on this work.
FUNDING
Research reported in this paper was supported by NCI of the National Institutes of Health under award number P01CA196569, FIGI RO1 supported by NCI (R01CA201407), and T32 supported by NIEHS (T32ES013678).
APPENDIX
Appendix A: Proof of equivalence of relaxed and unconstrained models
We want to prove that models (4) and (2) are equivalent:
and
Proof: Recall that and , then and and since , we have . Since , and , we can write , where . Then we have that . Finally, we note that . We have and since we are solving a minimization problem we can substitute in model (4) to its minimum value which finalize our transformation from model (4) to (2) and concludes the proof of equivalence.
Appendix B: Solution to the optimization problem (13)
Consider an optimization problem:
Lets denote Y / n as y and νres(β) as ν. We have
(19) |
Solution
Lets denote as Bj and as Aj. The feasible set for our optimization problem (19) is
Then optimal δj is the solution to because a solution to this equation maximizes the feasible set and hence provides more broad set to find an optimal x (Figure 5).
Fig. 5.
Geometric solution to the problem (19). Optimal δj maximizes the range of possible x values.
We know that (section 1 of the supplementary materials). If δj > λ1, then , and if , then . Therefore,
We found that minimises our objective function by the stationarity conditions. To be optimal it has to satisfy out feasibility conditions for any j:
Then feasible x is going to satisfy the inequality
Let’s denote the RHS of the inequality as M. If , then optimal and if , then optimal .
Appendix C: Safe rules for gesso
By the KKT conditions (8) and (7):
Then,
(20) |
Let’s consider the last system of inequalities more closely:
Then,
and, hence, the SAFE rules to discard are:
(21) |
Footnotes
References
- Ayers KL, Cordell HJ (2010). ”SNP selection in genome-wide and candidate gene studies via penalized logistic regression”, Genet Epidemiol, 34(8):879–891. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bhatnagar S, Lovato A, Yang Y, Greenwood C (2018). ”Sparse Additive Interaction Learning”, bioRxiv 445304. [Google Scholar]
- Bien J, Taylor J, Tibshirani R (2013). ”A lasso for hierarchical interactions”, Ann. Statist 41, no. 3, 1111–1141. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Boyd S, Parikh N, Chu E, Peleato B, Eckstein J (2011). ”Distributed optimization and statistical learning via the alternating direction method of multipliers”, Foundations and Trends in Machine Learning, 3(1):1–122. [Google Scholar]
- Bonnefoy A, Emiya V, Ralaivola L, and Gribonval R (2014). ”A dynamic screening principle for the lasso”, In EUSIPCO. [Google Scholar]
- Boyd S, Vandenberghe L (2011). ”Convex Optimization”, Cambridge University Press. [Google Scholar]
- Chang BD, Watanabe K, Broude EV, Fang J, Poole JC, Kalinichenko TV, and Roninson IB (2000). ”Effects of p21Waf1/Cip1/Sdi1 on cellular gene expression: Implications for carcinogenesis, senescence, and age-related diseases”, Proc. Natl. Acad. Sci, 97(8), 4291–4296. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chipman H (1996). ”Bayesian Variable Selection with Related Predictors”, The Canadian Journal of Statistics, 24(1), 17–36. [Google Scholar]
- Cox DR (1984). ”Interaction”, International Statistical Review, 52(1), 1–24. [Google Scholar]
- Efroymson MA (1960). ”Multiple regression analysis”, Mathematical Methods for Digital Computers, Wiley, New York. [Google Scholar]
- El Ghaoui L, Viallon V, and Rabbani T (2012). ”Safe feature elimination in sparse supervised learning”, Pac. J. Optim, 8(4):667–698. [Google Scholar]
- Fercoq O, Gramfort A, and Salmon J (2015). ”Mind the duality gap: safer rules for the lasso”, In ICML, pages 333–342. [Google Scholar]
- Figueiredo JC, Hsu L, Hutter CM, Lin Y, Campbell PT, Baron JA, Berndt SI, Jiao S, Casey G, Fortini B, Chan AT, Cotterchio M, Lemire M, Gallinger S, Harrison TA, Le Marchand L, Newcomb PA,, Slattery ML, Caan BJ, Carlson CS, Zanke BW, Rosse SA, Brenner H, Giovannucci EL, Wu K, Chang-Claude J, Chanock SJ, Curtis KR, Duggan D, Gong J, Haile RW, Hayes RB, Hoffmeister M, Hopper JL, Jenkins MA, Kolonel LN, Qu C, Rudolph A, Schoen RE, Schumacher FR, Seminara D, Stelling DL, Thibodeau SN, Thornquist M, Warnick GS, Henderson BE, Ulrich CM, Gauderman WJ, Potter JD, White E, Peters U (2014). ”Genome-wide diet-gene interaction analyses for risk of colorectal cancer.”, PLoS Genet., 10(4). [DOI] [PMC free article] [PubMed] [Google Scholar]
- Friedman J, Hastie T, Hoefling H, Tibshirani R (2007). ”Pathwise Coordinate Optimization”, Ann. Appl. Stat, 2(1), 302–332. [Google Scholar]
- Friedman J, Hastie T, Tibshirani R (2010). ”Regularization Paths for Generalized Linear Models via Coordinate Descent”, Journal of Statistical Software, 33(1). [PMC free article] [PubMed] [Google Scholar]
- Friedman J, Hastie T, Tibshirani R (2009). ”glmnet: Lasso and Elastic-Net Regularized Generalized Linear Models”. [Google Scholar]
- Gorgoulis VG, Pratsinis H, Zacharatos P, Demoliou C, Sigala F, Asimacopoulos PJ, Papavassiliou AG, Kletsas D (2005). ”p53-Dependent ICAM-1 overexpression in senescent human cells identified in atherosclerotic lesions”, Laboratory Investigation, 85(4), 502–511. [DOI] [PubMed] [Google Scholar]
- Hannum G, Guinney J, Zhao L, Zhang L, Hughes G, Sadda S, Klotzle B, Bibikova M, Fan J-B, Gao Y, Deconde R, Chen M, Rajapakse I, Friend S, Ideker T, Zhang K (2013). ”Genome-wide methylation profiles reveal quantitative views of human aging rates”, Mol. Cell, 49 359–367. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Haris A, Witten D, Simon N (2016). ”Convex Modeling of Interactions With Strong Heredity”, J. Comput. Graph. Statist, 25:4, 981–1004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jeck WR, Siebold AP, Sharpless NE (2012). ”Review: a meta-analysis of GWAS and age-associated diseases”, Aging Cell, 11(5):727–731. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kane M, Emerson J, Weston S (2015). ”Scalable Strategies for Computing with Massive Data”, Journal of Statistical Software, 55(14), 1–19. [Google Scholar]
- Lim M, Hastie T (2015). ”Learning Interactions Through Hierarchical Group-Lasso Regularization”, J. Comput. Graph. Statist, 24, 627–654. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu J, Huang J, Zhang Y, Lan Q, Rothman N, Zheng T, Ma S (2013). ”Identification of gene–environment interactions in cancer studies using penalization”, Genomics, 102(4), 189–194. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Massias M, Gramfort A, Salmon J (2017). ”From safe screening rules to working sets for faster lasso-type solvers”, 10th NIPS Workshop on Optimization for Machine Learning. [Google Scholar]
- Nelder JA (1977). ”A Reformulation of Linear Models”, Journal of the Royal Statistical Society. Series A, 140(1), 48–77. [Google Scholar]
- Papaconstantinou J (2019). ”The Role of Signaling Pathways of Inflammation and Oxidative Stress in Development of Senescence and Aging Phenotypes in Cardiovascular Disease”, Cells, 8, 1383. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Qian J, Tanigawa Y, Du W, Aguirre W, Chang C, Tibshirani R, Rivas MA, Hastie T (2020). ”A Fast and Scalable Framework for Large-scale and Ultrahigh-dimensional Sparse Regression with Application to the UK Biobank”, bioRxiv 630079. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schoentgen F, Jonic S (2020). ”PEBP1/RKIP behavior: a mirror of actin-membrane organization”, Cell. Mol. Life Sci 77, 859–874. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Scholler N, Gross JA, Garvik B et al. (2008). ”Use of cancer-specific yeast-secreted in vivo biotinylated recombinant antibodies for serum biomarker discovery”, J. Transl. Med 6, 41. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ternes N, Rotolo F, Heinze G, Michiels S (2017). ”Identification of biomarker-by-treatment interactions in randomized clinical trials with survival outcomes and high-dimensional spaces”, Biometrical Journal 59(4):685–701. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tibshirani R (1996). ”Regression shrinkage via the lasso”, J. R. Stat. Soc. Ser. B. Stat. Methodol 58:267–288. [Google Scholar]
- Tseng P (2001). ”Convergence of a Block Coordinate Descent Method for Nondifferentiable Minimization”, J. Optim. Theory Appl, 109(3), 475–494. [Google Scholar]
- Wu C, Jiang Y, Ren J, Cui Y, Ma S (2017). ”Dissecting gene-environment interactions: A penalized robust approach accounting for hierarchical structures.”, Stat. Med, 37(3), 437–456. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wu M, Zhang Q, Ma S (2017). ”Structured gene-environment interaction analysis.”, Biometrics, 76(1), 23–35. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yuan M and Lin Y (2007). ”Model selection and estimation in regression with grouped variables”, J. R. Stat. Soc. Ser. B. Stat. Methodol 68(1), 49–67. [Google Scholar]
- Zou H and Hastie T (2005). ”Regularization and variable selection via the elastic net” J. R. Stat. Soc. Ser. B. Stat. Methodol 67:301–320. [Google Scholar]
- Zhao P, Rocha G, Yu B (2009). ”The composite absolute penalties family for grouped and hierarchical variable selection”, The Ann. Statist, 37(6A), 3468–3497. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.