Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2012 Jan 1.
Published in final edited form as: J Nonparametr Stat. 2011;23(1):185–199. doi: 10.1080/10485252.2010.490584

An ordinary differential equation based solution path algorithm

Yichao Wu 1,*
PMCID: PMC3083079  NIHMSID: NIHMS250294  PMID: 21532936

Abstract

Efron, Hastie, Johnstone and Tibshirani (2004) proposed Least Angle Regression (LAR), a solution path algorithm for the least squares regression. They pointed out that a slight modification of the LAR gives the LASSO (Tibshirani, 1996) solution path. However it is largely unknown how to extend this solution path algorithm to models beyond the least squares regression. In this work, we propose an extension of the LAR for generalized linear models and the quasi-likelihood model by showing that the corresponding solution path is piecewise given by solutions of ordinary differential equation systems. Our contribution is twofold. First, we provide a theoretical understanding on how the corresponding solution path propagates. Second, we propose an ordinary differential equation based algorithm to obtain the whole solution path.

Keywords: generalized linear model, LARS, LASSO, ordinary differential equation, solution path algorithm, QuasiLARS, quasi-likelihood model

1 Introduction

Recently we have seen exploding growth of research in variable selection popularized by Tibshirani (1996), which uses the L1 penalty to regularize least squares regression. Following this line of research, many other techniques have been proposed. They include the SCAD (Fan and Li, 2001), the LARS (Efron et al., 2004), the elastic net (Zou and Hastie, 2005), the Dantzig selector (Candes and Tao, 2007), the adaptive LASSO (Zou, 2006; Zhang and Lu, 2007), and their related methods.

Computationally, the LASSO, elastic net, and adaptive LASSO can all be solved by any quadratic programming (QP) solver. The Dantzig selector involves a linear programming problem. The SCAD penalty leads to a non-convex optimization problem, for which Fan and Li (2001) proposed a local quadratic approximation (LQA) algorithm and Zou and Li (2008) proposed a local linear approximation (LLA) algorithm. They are two instances of the MM algorithm (Hunter and Li, 2005) and each step of the LQA or LLA involves a QP problem. All these algorithms share one characteristic in common: they solve the corresponding optimization for one regularization parameter at a time.

Efron et al. (2004) proposed the Least Angle Regression (LAR) algorithm and illustrated its close connection to the LASSO and Forward Stagewise linear regression. Together these algorithms are called LARS. By slight modification, their algorithm provides the whole exact solution path for the LASSO. The LARS solution paths are piecewise linear. Another algorithm for the LASSO is due to Osborne, Presnell and Turlach (2000) which proposed the homotopy algorithm. Rosset and Zhu (2007) derived a general characterization of the loss-penalty pair which leads to piecewise linear solution paths.

Note that the piecewise quadratic condition of Rosset and Zhu (2007) is not satisfied by generalized linear models (GLMs). The corresponding L1 regularized solution path is not piecewise linear as demonstrated by Figure 2. To our limited knowledge, it is largely unknown how to extend the LARS to GLMs and more generally to the quasi-likelihood model (QLM) to get an exact solution path. Yet some approximate solution path algorithms are available. Madigan and Ridgeway (2004) discussed one possible extension to GLMs. Rosset (2004) suggested a general second-order path-following algorithm to track the curved regularized optimization solution path. Park and Hastie (2007)’s algorithm is based on the predictor-corrector method of convex optimization. To control the overall accuracy, Park and Hastie (2007) pointed out that it is critical to select the step length of the regularization parameter, for which strategies are proposed. These two papers try to approximate the whole regularization solution path by providing a series of solution sets at di erent regularization parameters, but different strategies are proposed to select the set of regularization parameters to control the approximation error. Yuan and Zou (2009) proposed an efficient global approach to approximate nonlinear L1 regularization solution paths. Their method is based on the approximation of a general loss function by quadratic splines. In this way, the global loss approximation error can be controlled and a generalized LARS-type algorithm is devised to compute the corresponding exact solution path for the approximate quadratic spline loss. This path approximates the original nonlinear regularization solution path and theory is provided to show that the path approximation error is controlled by the global loss approximation error. On one hand, increasing the number of knots in the quadratic spline approximation makes the approximate solution path more accurate. On the other hand, it increases the number of pieces in the corresponding piecewise linear solution path and therefore the computational cost as well (Section 4 of Yuan and Zou, 2009). They further commented that “If the user wants to get the exact solution path from the EGA solution, then it seems not worthy to use a large number of knots.”

Figure 2.

Figure 2

Poisson QuasiLARS solution path of the toy example: the top panel gives the solution path of the Poisson QuasiLARS with respect to the one norm of β(t); the bottom panel plotted with respect to t.

This urgent need of an exact solution path calls for another algorithm. This is exactly the goal of the current paper. We extend the LAR to the QLM and name our extension QuasiLAR. Piecewise, our QuasiLAR solution path is given by solutions of ordinary differential equation (ODE) systems. We also discuss how the extension QuasiLAR is modified to get the whole solution path of the LASSO regularized quasi-likelihood, and this modified algorithm is called QuasiLASSO. Putting them together, we name our new algorithm QuasiLARS. The QuasiLARS is different from existing algorithms mentioned in the previous paragraph in that they all provide approximate solution paths instead. Our contribution is two-fold. On one hand, the current paper helps us to understand the corresponding optimization problem better by providing an answer to the question: how the general LASSO regularized solution path changes as the regularization parameter varies. On the other hand, we present an ODE based solution path algorithm and it provides a potential way to evaluate how well these existing solution path algorithms approximate the true solution path. Other papers on solution path algorithms include Zhu, Rosset, Hastie and Tibshirani (2004), Hastie, Rosset, Tibshirani and Zhu (2004), Wang and Shen (2006), Yuan and Lin (2007), Li, Liu and Zhu (2007), Wang and Zhu (2007), Wang, Shen and Liu (2008), Li and Zhu (2008), Zou (2008), Rocha, Zhao and Yu (2008), Wu, Shen and Geyer (2009), and references therein. In particular, Friedman, Hastie and Tibshirani (2010) focused on GLMs as well. They proposed a coordinate descent algorithm, which works for a fixed regularization parameter. They get a solution path by obtaining and connecting solutions at a pre-specified (penitentially dense) grid of the regularization parameter.

The rest of the article is organized as follows. Section 2 details the LARS and motivates the QuasiLARS. In Section 3, we present the QuasiLARS . Details for a key step are discussed in Section 4. Section 5 gives some properties of the QuasiLARS path. Numerical examples in Section 6 are used to illustrate how our QuasiLARS works. We conclude with Section 7. All technical proofs are collected in supplementary online material.

2 LARS

Before delving into details, let us see how the LAR works. To facilitate our later discussion, let us consider a general regression with a univariate response YR and predictor vector X=(X1,,Xp)TRp, where p denotes the number of predictor variables. The QLM assumes that μ(x) ≜ E(Y|X = x) = g−1(η(x)) with η(x) = β0 + xTβ, and Var(Y|X = x) = V (μ(x)) for some known monotonic link function g(·) and positive variance function V (·). Define Q(μ,y)=yμ(yw)V(w)dw and denote our observed data set by {(xi, yi) : i = 1, ⋯ , n} with xi=(xi1,,xip)TRp, yiR and n being the sample size. Predictors have been standardized such that i=1nxij=0 and i=1nxij2=1, j = 1, ⋯, n. The QLM estimates β0 and β by solving

maxβ0,βi=1nQ(g1(β0+xiTβ),yi). (1)

The QLM includes GLMs as special cases by choosing g(·) and V(·) appropriately.

The ordinary least squares (OLS) regression is a special case with g(μ) = μ and V(μ) = σ2. In this case, by demeaning if necessary to ensure i=1nyi=0, (1) reduces to

maxβ0,βi=1n(yixiTβ)2. (2)

For OLS (2), the LAR provides a solution path β(t) indexed by t ∈ [0, ∞), with β(0) = 0. For large enough t, β(t) is the same as the full OLS solution to (2). The solution path in between is piecewise linear. Over each piece, it moves along the direction that keeps the correlation between the current residuals and each active predictor equal in absolute value. Define current residuals ei(β(t))=yixiTβ(t) for i = 1, ⋯ , n. In terms of the current residual vector e(β(t)) = (e1(β(t)), ⋯ , en(β(t)))T and the jth predictor vector x(j) = (x1j, ⋯ , xnj)T, the current correlation e(β(t))T x(j) has the same absolute value for each active predictor j. Note that e(β(t))Tx(j)=12βji=1n(yixiTβ)2β(t). This implies that the absolute values of the objective function’s first-order derivatives are equal for each active predictor variable along the LAR solution path, namely βji=1n(yixiTβ)2β(t)=βji=1n(yixiTβ)2β(t) for any j and j’ among the active predictor set at t. In this paper we will take advantage of this observation and extend LARS to the more general QLM. For the diabetes data in the R package LARS, we plot the LAR solution path in the top left panel of Figure 1. The derivatives βji=1n(yixiTβ)2 along the LAR solution path are shown in the top right panel of Figure 1. The derivatives in absolute value, namely βji=1n(yixiTβ)2 are given in the bottom panel of Figure 1. It is clearly shows that, at the end of each LAR step, a new predictor variable joins the group of active predictor variables that share the honor of having the same largest absolute value of the first-order derivatives. The LAR algorithm terminates at the full OLS estimate of (2) when all the first-order partial derivatives are exactly zero.

Figure 1.

Figure 1

LAR solution path of the diabetes data: the top left panel gives the solution path of the LAR; the top right panel and the bottom panel plot the derivatives of (2) and their absolute values, respectively, along the LAR solution path.

3 QuasiLARS: extension of LARS

Note first that in general β0 of the QLM cannot be removed from equation (1) by location and scale transformations as in the least squares regression. However for any β, the quasi-likelihood function (1) is concave in β0. Thus we can define the marginal maximizer of β0 as a function of β. Namely for any β, define

β0(β)=argmaxβ0i=1nQ(g1(β0+xiTβ),yi). (3)

We denote R(β,β0)=i=1nQ(g1(β0+xiTβ),yi) and R(β) = R(β, β0(β)).

Based on the above observation that the LAR produces a solution path along which the objective function’s first-order partial derivatives have the same absolute value for each active predictor variable, our extension QuasiLAR seeks a solution path β(t) such that βjR(β,β0(β))β(t)=βjR(β,β0(β))β(t) for j and j’ that are active at t. More explicitly, at any t, the solution should move in a special direction a(β(t))=ddtβ(t), which is chosen in a way to ensure that the first-order partial derivatives βjR(β) have the same absolute value for each active predictor variable j.

For R(β), denote its vector of first-order partial derivatives by b(β) = (b1(β), ⋯ , bp(β))T and matrix of second-order partial derivatives by M(β) = (mjk(β))1≤j,kp, where bj(β)=βjR(β) and mjk(β)=2βjβkR(β) for 1 ≤ j,kp.

As in LAR, we use t to index our QuasiLAR solution path β(t). Denote the active index set at t by A(β(t)) and also by At. We will use A(β(t)) and At interchangeably.

Note that b(β(t + dt)) ≈ b(β(t)) + M(β(t)) {β(t + dt) — β(t)} for small dt > 0 due to Taylor expansion. Thus in order to keep the absolute values of the first-order partial derivatives with respect to all active predictor variables decrease and be the same, our solution path updating direction β(t+dt)—β(t) should satisfy that bj(β(t+dt))—bj(β(t)) has the opposite sign of bj(β(t)) and has the same absolute value for each jAt. Here the first requirement guarantees that the first-order partial derivatives of active predictor variables are decreasing in absolute value and the second requirement ensures that they decrease at the same speed. This gives our motivation on how to define an appropriate solution path updating direction. The above discussion can be made rigorous by using differential operators when the above dt > 0 is infinitesimal as presented next.

At any t with solution β(t), denote the solution path updating direction by a(β(t))=(a1(β(t)),,ap(β(t)))T=ddtβ(t). For any inactive variable jAt, we keep βj(t) = 0 and do not change it; thus aj(β(t)) = 0 for jAt. Consequently we only care about aj(β(t)) for active predictor jAt. For any two index sets A and B, vector a, and matrix M, denote aA to be the sub-vector of a consisting of those elements with index in A and MA,B to be the sub-matrix of M consisting of those elements with row index in A and column index in B. When A={j} and B={k} are singletons, we also write Mj,B and MA,k, which are essentially a row vector and a column vector, respectively. Denote the complement of A by Ac={1,,p}\A. With these notations, our solution path updating direction for active predictor variables should satisfy

MAt,At(β(t))aAt(β(t))=sign(bAt(β(t))). (4)

The argument is based on the previous paragraph with infinitesimal dt. Thus our solution path should be updated using ddtβAt(t)=aAt(β(t)) and ddtβAtc(t)=0 with

aAt(β(t))=(MAt,At(β(t)))1sign(bAt(β(t))) (5)

being the solution of (4), where the invertibility of MAt,At(β(t)) is not an issue as long as the quasi-likelihood is well defined. Here we use 0 to denote a column vector of zeros with length depending on the context. Note further that this updating scheme implies that ddtbAt(β(t))=sign(bAt(β(t))) because ddtbA(β(t))=MAt,At(β(t))aAt((β(t)).

In integration format, they become

βAt(t+dt)=βAt(t)+tt+dtaAt(β(τ))dτ,andβAtc(t+dt)=0, (6)
bAt(β(t+dt))=bAt(β(t))tt+dtsign(bAt(β(τ))dτ, (7)

where aAt(β(t)) is given by (5). Note that we consider small dt > 0 in all the above discussion and assume that between t and t + dt the active index has not changed. Consequently, beginning at t we may keep updating the solution path using (6) until the active set changes at some t’ > t. This happens when another predictor variable jAt joins the active set At to share the honor of having the largest absolute value of the first-order partial derivatives, that is, ∣bj’(β(t’))∣ = ∣bj(β(t’))∣ for any active predictor jAt. At this point, we update the active set by setting At=At{j}.

Now we present our extension QuasiLAR for the QLM. We initialize our solution path by identifying the predictor variable j so that the objective function R(β) changes fastest with respect to βj beginning at β = 0. We first set t0=maxj=1,,pβjR(β)β=0=maxj=1,,pbj(0). It will be clear later why we choose t0 in this way. Our solution path begins with β(t0) = 0 and β0(t0) = β0(βt0)) defined in terms of (3). The initial active predictor set is given by At0={argmax1jpβjR(β)β=0}={argmax1jpbj(0)}.

With t0, β(t0), and At0, we update our solution path using (6) until a new variable joins the active set at some t1(> t0) to be determined. That means the solution for any t > t0 may be temporarily updated by β~At0(t)=βAt0(t0)+t0taAt0(β~(τ))dτ and β~At0c(t)=0. Here β~(t) is a temporary solution path defined for any t > t0. For any jAt0, define Tj=min{t>t0:bj(β~(t))bm(β~(t))}, where mAt0. Then t1 is given by t1=minjAt0Tj and we call t1 a transition point in that the set of active predictors changes at t = t1.

Then our QuasiLAR algorithm updates by setting βAt0(t)=βAt0(t0)+t0taAt0(β(τ))dτ, βAt0c(t)=0, and β0(t) = β0(β(t)) for all t ∈ [t0, t1]. The active predictor set stays the same for t ∈ [t0, t1), namely At=At0. At t1, we update the active predictor set by setting At1=At0{jAt0:Tj=t1}. At t = t1, the number of active predictors is two. Due to (5), and the definitions of β~At0(t), Tj and t1, β(t1) satisfies bj(β(t1))=bj(β(t1))>bk(β(t1)),kAt1, where j,jAt1. Note further that (7) and the definition of t0 ensure that t1=βjR(β)β(t1)=bj(β(t1)) for any jAt1.

Our QuasiLAR algorithm continues with t1, β(t1), and At1. The full algorithm is given by Algorithm 1. Note that at the end of the mth QuasiLAR step, tm, β(tm), and Atm satisfy tm=βjR(β)β(tm)=bj(β(tm)) for any jAtm and bj(β(tm)=bj(β(tm)>bk(β(tm),kAtm, for any j,jAtm.

Algorithm 1. QuasiLAR for the QLM.

Step 1: Initialize by setting t0 = —maxj=1,⋯,pbj(0)∣, β(t0) = 0, β0(t0) = β0(β(t0)) ad defined in (3), and At0={argmax1jpbj(0)}.

Step 2: For m = 0, ⋯ , p −2, define a tentative solution path using

β~Atm(t)=βAtm(tm)+tmtaAtm(β~(τ))dτandβ~Atmc(t)=0

for ttm. Define a new transition point tm+1=minjAtmTj, where Tj=min{t>tm:bj(β~(t))bk(β~(t))for somekAtm}forjAtm. Update solution path by setting βAtm(t)=βAtm(tm)+tmtaAtm(β(τ))dτ, βAtmc(t)=0, and β0(t) = β0(β(t)) for t ∈ [tm, tm+1]. At=Atm for t ∈ [tm, tm+1) and Atm+1=Atm{jAtm:Tj=tm+1}.

Step 3: At the end of Step 2, Atp1 should be exactly {1, ⋯ , p}. Next we update solution path using β(t)=β(tp1)+tp1taAtp1(β(τ))dτ,β0(t)=β0(β(t)), and At={1,,p} for t between tp—1 and tp = 0.

Note that at the end of the (p — 1)th QuasiLAR step in Step 2 of Algorithm 1, all predictors are active. Then, in Step 3, the QuasiLAR path moves along a direction such that the absolute values of the first-order partial derivatives decrease at the same speed until all the first-order partial derivatives are exactly zero, which happens at t = 0. The solution at t = 0 exactly corresponds to the full solution of the QLM by solving (1) just like the LAR ends at the full OLS estimate. This completes our QuasiLAR algorithm.

Remark 1

Note that the QuasiLAR instantaneous path updating direction is given by (MAt,At(β(t)))1sign(bAt(β(t))). For least squares regression, the objective function is exactly quadratic and thus MAt,At depends only on the active predictor set At, but not on the current solution values βAt(t). Note that sign(bAt(β(t))) does not change in a small neighborhood of t. This implies that, within a small neighborhood of t, the instantaneous path updating direction is the same for least squares regression. This leads to the piecewise linear solution path of the LAR and Rosset and Zhu (2007) in general.

3.1 Quasi-LASSO modification

Efron et al. (2004) discovered that the LASSO solution path can be obtained by a slight modification of the LAR. Next we make a parallel extension by showing that the Quasi-LAR can be modified to get the whole LASSO regularized quasi-likelihood solution path.

Now consider the LASSO regularized quasi-likelihood in two different formats

minβ0,βR(β,β0(β))+λj=1pβj, (8)
minβ0,βR(β,β0(β))subject toj=1pβjs, (9)

which are equivalent with one-to-one correspondence between λ ≥ 0 and s ≥ 0.

Let β^ be a LASSO solution to (8). Then we can show that the sign of any nonzero component β^j must agree with the sign of the current first-order partial derivative bj(β^). It is given by Lemma 2 in Section 5.

Suppose t = t* at the end of a QuasiLAR step with a new active set A. At the next QuasiLAR step with t ∈ [t*, T] for some T to be determined, the QuasiLAR solution path moves along the following tentative solution path

β~A(t)=βA(t)+ttaA(β~(τ))dτandβ~(A)c(t)=0 (10)

for tt*. Denote Tj=min{t>t:bj(β~(t))bk(β~(t))for somekA}forjA for jA. Then the end point T is given by minjATj,

However β~j(t) may have changed sign at some point between t* and T for some jA, in which case the sign restriction in Lemma 2 must have been violated. We define Sj=min{t(t,):β~j(t)=0} for jA, where β~j(t) is the jth component of β~(t) defined by (10). If S=minjASj<T,β~(T) defined by (10) cannot be a LASSO quasi-likelihood solution since the sign restriction in Lemma 2 has already been violated. The following Quasi-LASSO modification can be applied to ensure that we can get the LASSO regularized quasi-likelihood solution path.

Quasi-LASSO modification

If S < T, stop the ongoing QuasiLAR step at S and remove j~ from the active set A by set AS=At\{j~}, where j~ is chosen such that Sj~=S. At the new transition point S, the new path updating direction is calculated based on the new active predictor set A\{j~}.

We have the following theorem to guarantee that the Quasi-LASSO modification leads to the LASSO regularized quasi-likelihood solution path. We name the modified algorithm by QuasiLASSO and use QuasiLARS to refer to both QuasiLAR and QuasiLASSO.

Note that at each transition point of our QuasiLARS solution path, two kinds of event can happen: either an inactive predictor joins the active predictor set or an active predictor is removed from the active predictor set. As in Efron et al. (2004), we assume that a “one at a time” condition holds. With the “one at a time” condition, at each transition point t*, only one single event can happen, namely either one inactive predictor variable becomes active or one currently active predictor variable becomes inactive.

Theorem 1

With the Quasi-LASSO modification, and assuming the “one at a time” condition, the QuasiLARS algorithm yields the LASSO quasi-likelihood solution path.

Remark 2

Here we make the “one at a time” assumption. However, even when the “one at a time” condition does not hold, a QuasiLASSO solution path is still available. The same discussion in Efron et al. (2004) applies. For practical applications, some slight jittering may simply be applied, if necessary, to ensure the “one at a time” condition.

3.2 Updating via ODE

Our solution path algorithm QuasiLARS involves an essential piecewise updating step β~At(t)=βAt(t)+ttaAt(β~(τ))dτ and β~Atc(t)=0 beginning at a transition point t* with solution β(t*) and active predictor set At. Note that the piecewise updating can be easily achieved by setting β~j(t)=0 for jAt and t > t* and solving the following ODE system ddtβ~At(t)=aAt(β~(t)) with initial value condition β~At(t)t=t=βAt(t). This is a standard initial-value ODE system, for which there are many efficient solvers available. We have implemented our QuasiLARS using Matlab ODE solver “ODE45.”

4 Details for deriving the path updating direction

Note that the path updating direction defined by (5) asks for βji=1nQ(g1(β0+xiTβ),yi) and 2βjβji=1nQ(g1(β0+xiTβ),yi). By the chain rule, we are required to have the implicit partial derivatives βjβ0(β) and 2βjβjβ0(β). Next we show how to obtain them.

According to its definition (3), β0(β) satisfies β0i=1nQ(g1(β0+xiTβ),yi)=i=1nQ1(g1(β0+xiTβ),yi)(g1)(β0+xiTβ)=0, where Q1(μ,y)=μQ(μ,y). Now treat β0 as a function of β and take derivative of each term with respect to βj. We should get i=1nQ11(g1(β0+xiTβ),yi)[(g1)(β0+xiTβ)]2(xij+βjβ0)+i=1nQ1(g1(β0+xiTβ),yi)(g1)(β0+xiTβ)(xij+βjβ0)=0, where Q11(μ,y)=2μ2Q(μ,y). Thus by solving for βjβ0, we get βjβ0(β)=i=1nxijci(β)i=1nci(β), where ci(β)=Q11(g1(β0(β)+xiTβ),yi)[(g1)(β0(β)+xiTβ)]2+Q1(g1(β0(β)+xiTβ),yi)(g1)(β0(β)+xiTβ) for i = 1, ⋯, n. To get the second-order partial derivatives 2βjβjβ0(β), we may apply another layer of differential operator βj.

For some particular generalized linear models, it may be much simpler to get those partial derivatives as shown in the following subsections.

4.1 Binomial

For the Binomial distribution, the data set is given by {(xi, yi) : i = 1, ⋯, n} with yi ∈ {0, 1}. With the canonical logit link η(x) = log(μ(x)/(1−μ(x))), the corresponding loglikelihood function is given by L(β,β0)=i=1n(yi(xiTβ+β0)log(1+exiTβ+β0)). Then for any β, the corresponding optimal β0(β) is given by the solution of i=1nyii=1nexiTβ+β01+exiTβ+β0=0 which is equivalent to i=1n(1yi)i=1n(1+exiTβ+β0)1=0. We next differentiate both sides with respect to βj and solve for βjβ0 to get βjβ0=(i=1nxijexitβ+β0(β)(1+exitβ+β0(β))2)(i=1nexitβ+β0(β)(1+exitβ+β0(β))2). We may take another layer of differentiation to get second-order partial derivatives.

4.2 Poisson

In the case of Poisson distribution with the canonical log link η(x) = log μ(x), the likelihood function is given, up to a constant, by L(β,β0)=i=1n[exiTβ+β0+yi(xiTβ+β0)]. For any β, the maximizer β0(β) of is given by β0(β)=log(i=1nyi)log(i=1nexiTβ). We may take differentiation to get partial derivatives βjβ0(β) and 2βjβjβ0(β).

With the closed-form formula of β0(β), we may simply plug it into the likelihood and get L(β)L(β,β0(β))=i=1nyi+i=1nyixiTβ(i=1nyi)log(i=1nexiTβ)+(i=1nyi)log(i=1nyi), which corresponds to our notation R(β).

5 Properties of QuasiLARS

We next establish some properties of QuasiLARS solution path and prove Theorem 1.

With the “one at a time” condition, at each transition point t*, only one single event can happen, namely either one inactive predictor variable becomes active or one currently active predictor variable becomes inactive. For the first type of event, it means the active set changes from A to A=A{j} for some jA. We next show in Lemma 1 that this new active variable j* joins in a “correct” manner. This is the key result for proving Theorem 1. Lemma 1 applies to QuasiLARS (both QuasiLAR and QuasiLASSO).

Lemma 1

For any transition point t* during the QuasiLARS solution path, if predictor variable j* is the only addition to the active set at t* with solution β(t*) and active set changing from A to A=A{j}, then the path updating direction α(β(t*)) at t* has its j*th component agreeing in sign with the current first-order partial derivative bj*(β(t*)).

Our next four lemmas concern properties of the LASSO regularized quasi-likelihood solution. These lemmas will lead to the proof of Theorem 1. For any s ≥ 0, we denote the solution of (9) by β^=β^(s), which is unique for each s and continuous in s. The uniqueness is due to the convexity of j=1pβj and the strict convexity of −R(β, β0(β)). Throughout the paper, the hat notation always refers to the LASSO regularized quasi-likelihood solution. For any s ≥ 0, let NsN(β^(s)){j:β^j(s)0} denote the index set of nonzero components of β^(s). We will show that the nonzero set Ns is also the active predictor set that determines the QuasiLARS path updating direction.

Let β^ be a solution of (8). Next we can show that the sign of any non-zero component β^j must agree with the sign of the current first-order partial derivative, namely sign(β^j)=sign(bj(β^)) for jN(β^).

Lemma 2

A LASSO regularized quasi-likelihood solution β^ to (8) satisfies sign(β^j)=sign(bj(β^)) for any jN(β^).

Let S be an open interval of the s axis, with infimum s, within which the nonzero set Ns of the corresponding LASSO regularized quasi-likelihood solution β^(s) remains constant, namely, Ns=N for sS and some N.

Lemma 3

For s{s}S, the LASSO regularized quasi-likelihood estimate β^(s) updates along the QuasiLARS path updating direction.

Lemma 4

For an open interval S with a constant nonzero set N over the LASSO regularized quasi-likelihood solution path β^(s), let s=inf(S). Then for sS{s}, the first-order partial derivatives of R(β, β0(β)) at β^(s) must satisfy bj(β^(s))=maxl=1,,pb1(β^(s)) for jN and bj(β^(s))maxl=1,,pb1(β^(s)) for jN.

Let s denote such a point, s=inf(S) as in Lemma 4, with the LASSO regularized quasi-likelihood solution β^, current derivatives bj(β^), and maximum absolute derivative D^(β^)=maxjbj(β^). Define A1={j:β^j0}, A0={j:β^j=0andbj(β^)=D^(β^(s))}, and A10=A1A0. Define β(γ)=β^+γd for some dRp, T(γ) = R(β(γ), β0(β(γ))) and S(γ)=j=1pβj(γ). Denote S.(γ)=ddγS(γ), T.(γ)=ddγT(γ), and T¨(γ)=d2dγ2T(γ).

Lemma 5

At s, we have

Z(d)=T.(0)S.(0)D^(β^), (11)

with equality only if dj = 0 for jA10c and sign(dj)=sign(bj(β^)) for jA0. If so,

T¨(0)=dA10TMA10,A10(β^)dA10. (12)

One implication of Lemma 5 is that, at any transition point, the active predictor set of the LASSO regularized quasi-likelihood solution is a subset of A10. Note that the LASSO regularized quasi-likelihood minimizes −R(β, β0(β)) subject to a constraint on the one norm of β. Locally around β^, we are maximizing T(γ) subject to an upper bound on S(γ). The first part of Lemma 5 implies that the instantaneous relative changing rate of T(γ) and S(γ) is at most D^(β^). For β(γ), its one norm S(γ) is increasing in γ as long as jA1sign(bj(β^))dj+jA0dj+jA10cdj>0 and the best instantaneous relative changing rate is achieved whenever dj = 0 for jA10c and sign(dj)=sign(bj(β^)) for jA0. Note that jA0 is the same to say that the jth predictor variable is changing from inactive to active. Then, with the “one at a time” condition, the set A0 is singleton and the requirement sign(dj)=sign(bj(β^)) for jA0 is thus guaranteed for our LARS path updating direction due to Lemma 1.

The second part of Lemma 5 provides a closer look at the relative changing rate by checking the second-order derivative T¨(0). As we only care about direction, we assume that S.(0)=jA1sign(bj(β^))dj+jA0dj=Δ for some fixed △ > 0. Note that T(γ)=T(0)+T.(0)γ+12T¨(0)γ2+o(γ2). Then we need to find the best direction d to maximize T(γ) among all the possible direction d satisfying jA1sign(bj(β^))dj+jA0dj=Δ and sign(dj)=sign(bj(β^)) for jA0. By taking the second-order information into account, we need to solve

maxdA10TTMA10,A10(β^)dA10 (13)

for some △ > 0 to select the optimal solution updating direction d. As we only care about direction, △ > 0 can be any number. Our next lemma shows that the optimal direction corresponding to (13) is exactly given by our QuasiLARS path updating direction.

Lemma 6

Our QuasiLARS path updating direction matches the direction corresponding to the solution to (13).

6 QuasiLARS in Action

In this section, we apply QuasiLARS to different datasets with different models. In our implementation, we first calculate t0. Then set δt = −t0/K with a large positive K. For our numerical examples, we set K = 2000. In addition to the transition points tks, we evaluate the solution over our solution path at a grid of size δt. More specifically, for each piece of our solution path over [tk, tk+1], we calculate our solution β(t) at t = tk + t for m=1,,tk+1tkδt, where a denotes the integer part of a.

The first toy example with a Poisson distribution is used to demonstrate that the LASSO regularized quasi-likelihood does have a nonlinear solution path. In Example 2, we consider Diabetes data with Gaussian distribution trying to compare QuasiLARS and LARS. The response of the Diabetes data is actually positive integer valued, and thus can be thought of coming from some Poisson model. In Example 3, we apply QuasiLARS with Poisson distribution to the Diabetes data. Binomial QuasiLARS is considered in Example 4 with the Wisconsin Diagnostic Breast Cancer (WDBC) Data (available online at http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)).

Example 1 (A Poisson toy example) We set p = 3 and n = 40. The predictor covariates are generated from X ~ N(0, ∑), where ∑ is the variance-covariance matrix with its (i, j) element being 1 if i = j and 0.9 otherwise. Conditional on X = (x1, x2, x3)T, the response is generated from a Poisson distribution with mean exp(4+3x1−5x2+x3). We apply our QuasiLARS with the canonical link function η(x) = logμ(x) and the identity variance function V(μ(x)) = μ(x) of the Poisson distribution.

For this toy example, the QuasiLAR and QuasiLASSO lead to the same solution path. In the top panel of Figure 2, we plot our solution path by solid lines. The horizontal axis corresponds to the one norm of β(t). If you connect the solutions at different transition points by straight lines, then you get the dashed lines. It clearly demonstrates that the true solution path for the LASSO regularized quasi-likelihood is not piecewise linear. In the bottom panel, the solution β(t) is plotted with respect to t.

Example 2 (Gaussian with Diabetes data) In this example, we use the diabetes data (Efron et al., 2004) to compare the solution path of our extension QuasiLARS and that of the original LARS algorithm. In this data set, ten baseline variables, age, sex, body mass index, average blood pressure, and six blood serum measurements were obtained for each of n = 442 diabetes patients, as well as the response of interest, a quantitative measure of disease progression one year after baseline. We run our extension QuasiLAR algorithm for this data set. Our QuasiLAR solution path matches the LAR solution path obtained by the R package LARS which is shown in the top left panel of Figure 1. The maximum solution difference at all transition points is very small and in fact bounded from above by 5.0×10−7, namely, maxj=1pmaxm=110βjLAR(tm)βjQuasiLAR(tm)<5.0×107, where βLAR and βQuasiLAR denote the LAR and QuasiLAR solution, respectively. Comparing to QuasiLAR, the QuasiLASSO solution path has two more transition points. This is consistent with the result of R package LARS. With LASSO, the maximum solution difference at all transition points is also bounded from above as maxj=1pmaxm=112βjLASSO(tm)βjQuasiLASSO(tm)<9.0×107. This example confirms that the QuasiLARS matches the LARS in the Gaussian case and works correctly. However to save space, we do not plot our QuasiLAR and QuasiLASSO paths.

Example 3 (Poisson with Diabetes data) The response in the diabetes data is in fact positive integer valued. We apply our QuasiLARS algorithm by choosing Poisson distribution with the canonical log link function and identity variance function, namely, η(x) = log μ(x) and V (μ(x)) = μ(x). Results are shown in Figure 3. As in the Gaussian example, some discrepancy between the QuasiLAR and QuasiLASSO solution paths is observed. The QuasiLASSO has four more transition points than the QuasiLAR does.

Figure 3.

Figure 3

Poisson QuasiLAR (top) and QuasiLASSO (bottom) solution paths for the Diabetes data.

Example 4 (Binomial with WDBC Data) The WDBC data is based on n = 569 patients. The number of predictors is p = 30. The response is binary in that each patient is diagnosed either as malignant (Y = 1) or benign (Y = 0). We first standardize each predictor variable to have mean zero and variance one. Our QuasiLARS with Binomial distribution is applied to this data set with the logit link logμ(x)1μ(x)=η(x) and variance function V (μ(x)) = μ(x)(1 − μ(x)). There are a lot of predictor variables available. To locate an “optimal” solution along the QuasiLARS solution path, we use the Bayesian Information Criterion (BIC) defined by BIC(β(t))=2i=1nlogL(xi,yi,β0(β(t)))+(logn)k(β(t)), where L(xi, yi; β(t), β0(β(t))) denotes the Binomial likelihood and k(β(t)) = #{1 ≤ jp : ∣βj(t)∣ > 0} denotes the number of nonzero coefficients of β(t). For the LASSO regularized least squares regression, Zou, Hastie and Tibshirani (2007) proved that the number of nonzero coefficients is an unbiased and asymptotically consistent estimator of the degrees of freedom. Park and Hastie (2007) provided a heuristic proof for the case of generalized linear models. The optimal solution is given by β(t*) with t* = argmint∈[t0,0] BIC(β(t)).

For the QuasiLAR, nonzero elements of the optimal solution are given by the second column of Table 1. The BIC score is plotted with respect to the solution’s one norm j=130βj(t) in the top left panel of Figure 4 for t ∈ [t0, T], where T is a little beyond t* corresponding to the “optimal” solution. The top right panel of Figure 4 gives the solution path for t ∈ [t0, T]. Here we truncate the figures at T to make it look more clear.

Table 1.

Nonzero elements of the optimal solution selected by BIC for Example 3

QuasiLAR QuasiLASSO

β 2 0.2077 0.1624
β 8 0.6170 0.5767
β 11 1.5370 1.4667
β 20 −0.3169 −0.2833
β 21 4.3576 3.4047
β 22 1.0325 1.0343
β 23 −0.9287
β 25 0.5470 0.5339
β 27 0.5176 0.4395
β 28 1.1496 1.0998
β 29 0.3378 0.3257

Figure 4.

Figure 4

Binomial QuasiLARS paths for the WDBC data: the top left panel plots the BIC score along the Binomial QuasiLAR solution path; the top right panel gives part of the Binomial QuasiLAR solution path; the bottom left panel plots the BIC score along the Binomial QuasiLASSO solution path; the bottom right panel gives part of the Binomial QuasiLASSO solution path;

For the QuasiLASSO, the optimal solution’s nonzero elements are shown in the third column of Table 1. The corresponding plots of the BIC and solution path are given in the two bottom panels of Figure 4.

From the QuasiLAR solution path given in the top right panel of Figure 4, we can see that one solution component has changed sign between the second and third transition points. This change causes the violation of the sign constraint of the LASSO regularized quasi-likelihood solution path. Thus in the QuasiLASSO solution path, another transition is added at this point to avoid sign constraint violation.

This example demonstrates that our extension QuasiLARS may be applied to high dimensional data sets. However there is no need to complete the whole solution path. We may design an optimal criterion, say the BIC. This optimal criterion may be used to identify the optimal solution as the QuasiLARS solution path progresses. Thus an earlier termination is possible to save computational effort in that it is computationally expensive to solve the ODE system when the active predictor set is large.

7 Conclusion

In this work, we extend the LARS algorithm to the QLM. Over each piece, the solution path is obtained by solving an initial-value ordinary differential equation system. Several examples are used to demonstrate how it works with real data. In particular, Example 4 uses the BIC to select the “optimal solution” along the solution path to show that the QuasiLARS algorithm may be applied to high dimensional data and an earlier termination is possible. One interesting future research topic is to study how to define degrees of freedom for the QuasiLARS as studied in Zou et al. (2007). This will provide an elegant criterion to select the “optimal solution.”

The LARS is attractive because of its super fast speed. This is made possible because the corresponding path is piecewise linear. However the QuasiLARS solution path is not piecewise linear due to the nature of the QLM. Thus we can not expect the QuasiLARS to be as fast as the LARS. We have implemented the primitive version of our algorithm using the Matlab ODE solver “ODE45,” which works fairly fast.

Supplementary Material

01

Acknowledgements

The author thanks Jianqing Fan and Chuanshu Ji for mentoring and longtime encouragement. The author also thanks Dennis Boos, Jingfang Huang, Yufeng Liu, John Monahan, and Leonard Stefanski for helpful comments and discussions. This work is supported in part by NSF grant DMS-0905561, NIH/NCI grant R01-CA149569, and NCSU Faculty Research and Professional Development Award.

References

  1. Candes E, Tao T. The dantzig selector: statistical estimation when p is much larger than n. The Annals of Statistics. 2007:2313–2351. [Google Scholar]
  2. Efron B, Hastie T, Johnstone I, Tibshirani R. Least angle regression (with discussions) Ann. Statist. 2004;32:409–499. [Google Scholar]
  3. Fan J, Li R. Variable selection via penalized likelihood. Journal of American Statistical Association. 2001;96:1348–1360. [Google Scholar]
  4. Friedman J, Hastie T, Tibshirani R. Regularized paths for generalized linear models via coordinate descent. Journal of Statistical Software. 2010;33 [PMC free article] [PubMed] [Google Scholar]
  5. Hastie T, Rosset S, Tibshirani R, Zhu J. The entire regularization path for the support vector machine. JMLR. 2004;5:1391–1415. [Google Scholar]
  6. Hunter DR, Li R. Variable selection using mm algorithm. The Annals of Statistics. 2005;33:1617–1642. doi: 10.1214/009053605000000200. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Li Y, Liu Y, Zhu J. Quantile regression in reproducing kernel hilbert spaces. Journal of the American Statistical Association. 2007;102:255–268. [Google Scholar]
  8. Li Y, Zhu J. l1-norm quantile regressions. Journal of Computational and Graphical Statistics. 2008;17:163–185. [Google Scholar]
  9. Madigan D, Ridgeway G. Discussion on “least angle regression”. Annals of Statistics. 2004;32:465–469. [Google Scholar]
  10. Osborne M, Presnell B, Turlach B. A new approach to variable selection in least squares problems. IMA Journal of Numerical Analysis. 2000;20:389–403. [Google Scholar]
  11. Park MY, Hastie T. l1-regularization path algorithm for generalized linear models. Journal of the Royal Statistical Society, Series B. 2007;69:659–677. [Google Scholar]
  12. Rocha G, Zhao P, Yu B. A path following algorithm for sparse pseudo-likelihood inverse covariance estimation (splice) Technical Report. 2008 [Google Scholar]
  13. Rosset S. Tracking curved regularized opimization solution paths. Advances in Neural Information Processing Systems. 2004;13 [Google Scholar]
  14. Rosset S, Zhu J. Piecewise linear regularized solution paths. Annals of Statistics. 2007;35:1012–1030. [Google Scholar]
  15. Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B. 1996;58:267–288. [Google Scholar]
  16. Wang J, Shen X, Liu Y. Probability estimation for large margin classifiers. Biometrika. 2008;95:149–167. [Google Scholar]
  17. Wang L, Shen X. Multicategory support vector machines, feature selection and solution path. Statistica Sinica. 2006;16:617–634. [Google Scholar]
  18. Wang L, Zhu J. Image denoising via solution paths. Annals of Operations Research (Special issue on data mining) 2007 [Google Scholar]
  19. Wu S, Shen X, Geyer C. Adaptive regularization through entire solution surface. Biometrika. 2009:513–527. doi: 10.1093/biomet/asp038. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Yuan M, Lin Y. On the nonnegative garrote estimator. Journal of the Royal Statistical Society, Series B. 2007;69:143–161. [Google Scholar]
  21. Yuan M, Zou H. Efficient global approximation of generalized nonlinear l1 regularized solution paths and its applications. JASA. 2009:1562–1574. [Google Scholar]
  22. Zhang HH, Lu W. Adaptive lasso for cox’s proportional hazards model. Biometrika. 2007;94:691–703. [Google Scholar]
  23. Zhu J, Rosset S, Hastie T, Tibshirani R. 1-norm support vector machines. Neural Information Processing Systems. 2004;16 [Google Scholar]
  24. Zou H. The adaptive lasso and its oracle properties. Journal of the American Statistical Association. 2006;101:1418–1429. [Google Scholar]
  25. Zou H. A note on path-based variable selection in the penalized proportional hazards model. Biometrika. 2008;95:241–247. [Google Scholar]
  26. Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B. 2005;67:301–320. [Google Scholar]
  27. Zou H, Hastie T, Tibshirani R. On the degrees of freedom of the lasso. The Annals of Statistics. 2007:2173–2192. [Google Scholar]
  28. Zou H, Li R. One-step sparse estimates in nonconcave penalized likelihood models (with discussion) Ann. Statist. 2008;36:1509–1566. doi: 10.1214/009053607000000802. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

01

RESOURCES