Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2013 Sep 12.
Published in final edited form as: J Comput Graph Stat. 2013 May 30;22(2):261–283. doi: 10.1080/10618600.2012.681248

A Path Algorithm for Constrained Estimation

Hua Zhou 1, Kenneth Lange 2
PMCID: PMC3772096  NIHMSID: NIHMS497698  PMID: 24039382

Abstract

Many least-square problems involve affine equality and inequality constraints. Although there are a variety of methods for solving such problems, most statisticians find constrained estimation challenging. The current article proposes a new path-following algorithm for quadratic programming that replaces hard constraints by what are called exact penalties. Similar penalties arise in l1 regularization in model selection. In the regularization setting, penalties encapsulate prior knowledge, and penalized parameter estimates represent a trade-off between the observed data and the prior knowledge. Classical penalty methods of optimization, such as the quadratic penalty method, solve a sequence of unconstrained problems that put greater and greater stress on meeting the constraints. In the limit as the penalty constant tends to ∞, one recovers the constrained solution. In the exact penalty method, squared penalties!are replaced by absolute value penalties, and the solution is recovered for a finite value of the penalty constant. The exact path-following method starts at the unconstrained solution and follows the solution path as the penalty constant increases. In the process, the solution path hits, slides along, and exits from the various constraints. Path following in Lasso penalized regression, in contrast, starts with a large value of the penalty constant and works its way downward. In both settings, inspection of the entire solution path is revealing. Just as with the Lasso and generalized Lasso, it is possible to plot the effective degrees of freedom along the solution path. For a strictly convex quadratic program, the exact penalty algorithm can be framed entirely in terms of the sweep operator of regression analysis. A few well-chosen examples illustrate the mechanics and potential of path following. This article has supplementary materials available online.

Keywords: Exact penalty, l1 regularization, Shape-restricted regression

1. INTRODUCTION

When constraints appear in maximum likelihood or least-square estimation, statisticians typically resort to sophisticated commercial software or craft specific optimization algorithms for specific problems. The current article presents a new technique for solving such problems that is motivated by path following in 1 regularized regression. In penalized regression, absolute value penalties guide the trade-off in parameter estimation between the observed data and prior knowledge. Running an estimation algorithm on a grid of tuning constants tends to miss important events along a path. In 1 penalized linear regression, the solution path is piecewise linear and can be anticipated. It turns out that similar considerations apply to quadratic programming with affine equality and inequality constraints. The exact penalty method of optimization replaces hard constraints by absolute value and hinge penalties and tracks the solution vector as the penalty tuning constant increases. For some finite value of the tuning constant, the penalized and constrained solutions coincide. In this article, we show how to track the solution path in quadratic programming. Besides providing the final constrained estimates, our new algorithm also delivers the whole solution path between the unconstrained and the constrained estimates. This is particularly helpful when the goal is to locate a solution between these two extremes based on criteria, such as prediction error in cross-validation.

In recent years, several path algorithms have been devised for specific l1 regularized problems. In particular, a modification of the least angle regression (LARS) procedure can handle Lasso penalized regression (Efron et al. 2004). Rosset and Zhu (2007) gave sufficient conditions for a solution path to be piecewise linear and expanded its applications to a wider range of loss and penalty functions. Friedman (2008) derived a path algorithm for any objective function defined by the sum of a convex loss and a separable penalty (not necessarily convex). The separability restriction on the penalty term excludes many of the problems studied here. Tibshirani and Taylor (2011) devised a path algorithm for generalized Lasso problems. Their formulation is similar to ours with two differences. First, they excluded inequality constraints. Our new path algorithm handles both equality and inequality constraints gracefully. Second, they passed to the dual problem and then translated the solution path of the dual problem back to the solution path of the primal problem. We attack the primal problem directly via a simple algorithm entirely driven by the classical sweep operator of regression analysis. In our opinion, primal path following is conceptually simpler and easier to program than dual path following. Readers adept in duality theory may disagree. On the other hand, the dual approach makes fewer restrictions on constraint gradients and can, in principle, deal with a wider variety of equality-constrained problems. The degrees of freedom formula derived for the Lasso (Efron et al. 2004; Zou, Hastie, and Tibshirani 2007) and generalized Lasso (Tibshirani and Taylor 2011) apply equally well in the presence of inequality constraints.

Our object of study will be minimization of the quadratic function

f(x)=12xtAx+btx+c, (1)

subject to the affine equality constraints Vx = d and the affine inequality constraints Wxe.Throughout our discussion, we assume that the feasible region is nontrivial and that the minimum is attained. If the symmetric matrix A has a negative eigenvalue λ and corresponding unit eigenvector u, then limr→∞ f(ru) = –∞ because the quadratic term 12(ru)tA(ru)=λ2r2 dominates the linear term rbt u. To avoid such behavior, we initially assume that all eigenvalues of A are positive. This makes f(x) strictly convex and coercive and guarantees a unique minimum point subject to the constraints. In linear regression, A = Xt X for some design matrix X. In this setting, A is positive definite, provided X has full column rank. The latter condition is only possible when the number of cases equals or exceeds the number of predictors. If A is positive semidefinite and singular, then adding a small amount of ridge regularization εI to it can be helpful (Tibshirani and Taylor 2011). Later we indicate how path following extends to positive semidefinite or even indefinite matrices A. Our assumption that the rows of V and W are linearly independent excludes problems such as the sparse fused Lasso and two- and three-dimensional fused Lasso considered by Tibshirani and Taylor (2011). We discuss the difficulties in relaxing this assumption in Section 5 and suggest a numerical remedy.

In multitask learning, the response is a d-dimensional vector YRd, and one minimizes the squared Frobenius deviation

12YXBF2 (2)

with respect to the p × d regression coefficient matrix B. When the constraints take the form VBD and WB = E, the problem reduces to quadratic programming as just posed. Indeed, if we stack the columns of Y with the vec operator, then the problem involves minimizing 12vec(Y)(IX)vec(B)22. Here, the identity vec(XB)=(IX)vec(B) comes into play invoking the Kronecker product and the identity matrix I. Similarly, we can rewrite the constraints as (IV)vec(X)=vec(D) and (IW)vec(X)vec(E).

As an illustration, consider the classical concave regression problem (Hildreth 1954). The data consist of a scatterplot (xi, yi) of n points with associated weights wi and predictors xi arranged in increasing order. The concave regression problem seeks the estimates θi that minimize the weighted sum of squares

i=1nwi(yiθi)2 (3)

subject to the concavity constraints

θiθi1xixi1θi+1θixi+1xi,i=2,,n1. (4)

The consistency of concave regression is proved by Hanson and Pledger (1976); the asymptotic distribution of the estimates and their rate of convergence are studied in subsequent articles (Mammen 1991; Groeneboom, Jongbloed, and Wellner 2001). Figure 1 shows a scatterplot of 100 data points. Here, the xi are uniformly sampled from the interval [0,1], the weights are constant, and yi = 4xi(1 – xi) + εi, where the εi are iid normal with mean 0 and standard deviation σ = 0.3. The left panel of Figure 1 gives four snapshots of the solution path. The original data points θ^i=yi provide the unconstrained estimates. The solid line shows the concavity-constrained solution. The dotted and dashed lines represent intermediate solutions between the unconstrained and the constrained solutions. The degrees of freedom formula derived in Section 6 is a vehicle for model selection based on criterion such as Cp, the Akaike information criterion (AIC), and the Bayesian information criterion (BIC). For example, the Cp statistic

Cp(θ^)=1nyθ^22+2nσ2df(θ^)

is an unbiased estimator of the true prediction error (Efron 2004) under the estimator θ^ whenever an unbiased estimate of the degrees of freedom is used. The right panel shows the Cp statistic along the solution path. In this example, the design matrix is a diagonal matrix. After submitting this article, we learned that Tibshirani, Hoefling, and Tibshirani (2011) solved a similar convex regression problem by a path algorithm. As we will see in Section 7, postulating a more general design matrix or other kinds of constraints broadens the scope of applications of the path algorithm and the estimated degrees of freedom.

Figure 1.

Figure 1

Path solutions to the concave regression problem. Left: the unconstrained solution (original data points), two intermediate solutions (dotted and dashed lines), and the concavity-constrained solution (solid line). Right: the Cp statistic as a function of the penalty constant ρ along the solution path. The online version of this figure is in color.

Here is a road map to the remainder of the current article. Section 2 reviews the exact penalty method for optimization and clarifies the connections between constrained optimization and regularization in statistics. Section 3 derives in detail our path algorithm. Its implementation via the sweep operator and QR decomposition are described in Sections 4 and 5. Section 6 derives the degrees of freedom formula. Section 7 presents various numerical examples. Finally, Section 8 discusses the limitations of the path algorithm and hints at future generalizations.

2. THE EXACT PENALTY METHOD

Exact penalty methods minimize the function

Eρ(x)=f(x)+ρi=1rgi(x)+ρj=1smax{0,hj(x)},

where f(x) is the objective function, gi(x) = 0 is one of r equality constraints, and hj(x) ≤ 0 is one of s inequality constraints. It is interesting to compare this function with the Lagrangian function

L(x)=f(x)i=1rλigi(x)+j=1sμjhj(x)

that captures the behavior of f(x) at a constrained local minimum y. By definition, the Lagrange multipliers satisfy the conditions L(y)=0 and μj ≥ 0 and μjhj(y) = 0 for all j. In the exact penalty method, one takes

ρ>max{λ1,,λr,μ1,,μs}. (5)

This choice creates the majorization f(x)Eρ(x) with f(z)=Eρ(z) at any feasible point z. Thus, minimizing Eρ(x) forces f(x) downhill. Much more than this is going on, however. As the next proposition proves, minimizing Eρ(x) effectively minimizes f(x) subject to the constraints.

Proposition 1. Suppose the objective function f(x) and the constraint functions are twice differentiable and satisfy the Lagrange multiplier rule at the local minimum y. If inequality (5) holds and vd2L(y)v>0 for every vector v ≠ 0 satisfying dgi(y)v = 0 and dhj(y)v ≤ 0 for all active inequality constraints, then y furnishes an unconstrained local minimum of Eρ(x). If f(x) is convex, the gi(x) are affine, the hj(x) are convex, and Slater's constraint qualification holds, then y is a minimum of Eρ(x) if and only if y is a minimum of f(x) subject to the constraints. In this convex programming context, no differentiability assumptions are needed.

Proof. The conditions imposed on the quadratic form vd2L(y)v>0 are well-known sufficient conditions for a local minimum. Theorems 6.9 and 7.21 of Ruszczyński (2006) prove all of the foregoing assertions.

3. THE PATH-FOLLOWING ALGORITHM

In the quadratic programming context with objective function (1), affine equality constraints V x = d, and affine inequality constraints Wx ≤ e, the penalized objective function takes form

Eρ(x)=12xtAx+btx+c+ρi=1rvitxdi+ρj=1s(wjtxei)+. (6)

Our assumptions on A render Eρ(x) strictly convex and coercive and guarantee a unique minimum point x(ρ). The generalized Lasso problem studied by Tibshirani and Taylor (2011) drops the last term and consequently excludes inequality-constrained applications.

According to the rules of the convex calculus (Ruszczyński 2006), the unique optimal point x(ρ) of the function Eρ(x) is characterized by the stationarity condition

0=Ax(ρ)+b+ρi=1rsi(ρ)vi+ρj=1stj(ρ)wj, (7)

with coefficients

si(ρ){{1}vitx(ρ)di<0,[1,1]vitx(ρ)di=0,{1}vitx(ρ)di>0,}tj(ρ){{0}wjtx(ρ)ei<0,[0,1]wjtx(ρ)ei=0,{1}wjtx(ρ)ei>0.} (8)

Assuming that the vectors (i{vi})(j{wj}) are linearly independent, the coefficients si(ρ) and tj(ρ) are uniquely determined. The sets defining the possible values of si(ρ) and tj(ρ) are the subdifferentials of the functions |si(ρ)| and tj(ρ)+ = max {0, tj(ρ)}. The coefficients si and tj appear as the dual variables in the dual path algorithm of Tibshirani and Taylor (2011). We now prove that the solution and coefficient paths are continuous.

Proposition 2. If A is positive definite and the vectors (i{vi})(j{wj}) are linearly independent, then the solution path x(ρ) and the coefficient paths s(ρ) and t(ρ) are unique and continuous.

Proof. The representation

x(ρ)=A1(b+ρi=1rsi(ρ)vi+ρj=1stj(ρ)wj)

entails the norm inequality

x(ρ)A1(b+ρi=1rvi+ρj=1swj).

Thus, the solution vector x(ρ) is bounded whenever ρ ≥ 0 is bounded above. To prove continuity, suppose that it fails for a given ρ. Then, there exists an ε > 0 and a sequence ρn tending to ρ such that ∥∥x(ρn) – x(ρ)∥∥ ≥ ε for all n. Since x(ρn) is bounded, we can pass to a subsequence if necessary and assume that x(ρn) converges to some point y. Taking limits in the inequality Eρn[x(ρn)]Eρn(x) demonstrates that Eρ(y)Eρ(x) for all x. Because x(ρ) is unique, we reach the contradictory conclusions ∥∥y – x(ρ)∥∥ ≥ ε and y = x(ρ). Continuity is inherited by the coefficients si(ρ) and tj(ρ). Indeed, let V and W be the matrices with rows vit and wjt, and let U be the block matrix (VW). The stationarity condition can be restated as

0=Ax(ρ)+b+ρUt(s(ρ)t(ρ)).

Multiplying this equation by U and solving give

ρ(s(ρ)t(ρ))=(UUt)1U[Ax(ρ)+b], (9)

and the continuity of the left-hand side follows from the continuity of x(ρ). Finally, dividing by ρ yields the continuity of the coefficients si(ρ) and tj(ρ) for ρ > 0.

Positive definiteness of A is not required for the uniqueness of x(ρ). The penalized objective function (6) may have a unique minimum for large ρ even when A is not positive definite. In our subsequent derivation of the path algorithm, we will also observe that the uniqueness of the coefficient paths s(ρ) and t(ρ) only requires linear independence of the active constraints along the solution path. In this and the next section, we assume strict convexity of A and linear independence of all constraint vectors vi and wj. In Section 5, we discuss extensions of the path algorithm where the first restriction is relaxed.

We next show that the solution path is piecewise linear. Along the path, we keep track of the following index sets determined by the constraint residuals:

NE={i:vitxdi<0},NI={j:wjtxej<0},ZE={i:vitxdi=0},ZI={j:wjtxej=0},PE={i:vitxdi>0},PI={j:wjtxej>0}.

We drop the argument ρ from x(ρ) whenever notationally convenient. The reader should keep in mind that these index sets are functions of ρ as well. For the sake of simplicity, assume that at the beginning of the current segment, si does not equal –1 or 1 when iZE and tj does not equal 0 or 1 when jZI. In other words, the coefficients of the active constraints occur in the interior of their subdifferentials. Let us show in this circumstance that the solution path can be extended in a linear fashion. The general idea is to impose the equality constraints VZEx=dZE and WZIx=eZI and write the objective function Eρ(x) as

12xtAx+btx+cρiNE(vitxdi)+ρiPE(vitxdi)+ρjPI(wjtxej).

For notational convenience, define

UZ=(VZEWZI),cZ=(dZEeZI),uZ=iNEvi+iPEvi+jPIwj.

Minimizing Eρ(x) subject to the constraints generates the Lagrange multiplier problem

(AUZtUZ0)(xλZ)=(bρuZcZ), (10)

with the explicit path solution and Lagrange multipliers

x(ρ)=P(b+ρuZ)+QcZ=ρPuZPb+QcZ, (11)
λZ=Qtb+RcZρQtuZ. (12)

Here,

(PQQtR)=(AUZtUZ0)1,

with

P=A1A1UZt(UZA1UZt)1UZA1,Q=A1UZt(UZA1UZt)1,R=(UZA1UZt)1.

As we will see in the next section, these seemingly complicated objects arise naturally if path following is organized around the sweep operator.

It is clear that as we increase ρ, the solution path (11) and the multiplier path (12) change in a linear fashion until either an inactive constraint becomes active or the coefficient of an active constraint hits the boundary of its subdifferential. We investigate the first case first. Imagining ρ to be a time parameter, an inactive constraint iNEPE becomes active when

vitx(ρ)=vitP(b+ρuZ)+vitQcZ=di.

If this event occurs, it occurs at the hitting time

ρ(i)=vitPb+vitQcZdivitPuZ. (13)

Similarly, an inactive constraint jNIPI becomes active at the hitting time

ρ(j)=wjtPb+wjtQcZejwjtPuZ. (14)

To determine the escape time for an active constraint, consider once again the stationarity condition (7). The Lagrange multiplier corresponding to an active constraint coincides with a product ρsi(ρ) or ρtj(ρ). Therefore, if we collect the coefficients for the active constraints into the vector rZ(ρ), then Equation (12) implies

rZ(ρ)=1ρλZ(ρ)=1ρ(Qtb+RcZ)QtuZ. (15)

Formula (15) for rZ(ρ) can be rewritten in terms of the value rZ(ρ0) at the start ρ0 of the current segment as

rZ(ρ)=ρ0ρrZ(ρ0)(1ρ0ρ)QtuZ. (16)

It is clear that rZ(ρ)i is increasing in ρ when [rZ(ρ0)+QtuZ]i<0 and decreasing in ρ when the reverse is true. The coefficient of an active constraint iZE escapes at either of the times

ρ(i)=[Qtb+RcZ]i[QtuZ]i1or[Qtb+RcZ]i[QtuZ]i+1,

whichever is pertinent. Similarly, the coefficient of an active constraint jZI escapes at either of the times

ρ(j)=[Qtb+RcZ]j[QtuZ]jor[Qtb+RcZ]j[QtuZ]j+1,

whichever is pertinent. The earliest hitting time or escape time over all constraints determines the duration of the current linear segment.

At the end of the current segment, our assumption that all active coefficients occur in the interior of their subdifferentials is actually violated. When the hitting time for an inactive constraint occurs first, we move the constraint to the appropriate active set ZE or ZI and keep the other constraints in place. Similarly, when the escape time for an active constraint occurs first, we move the constraint to the appropriate inactive set and keep the other constraints in place. In the second scenario, if si hits the value –1, then we move i to NE. If si hits the value 1, then we move i to PE. Similar comments apply when a coefficient tj hits 0 or 1. Once this move is executed, we commence a new linear segment as just described. The path-following algorithm continues segment by segment until for sufficiently large ρ, the sets NE, PE, and PI are exhausted, uZ=0, and the solution vector (11) stabilizes.

This description omits two details. First, to get the process started, we set ρ = 0 and x(0) = –A–1b. In other words, we start at the unconstrained minimum. For inactive constraints, the coefficients si(0) and tj(0) are fixed. However, for active constraints, it is unclear how to assign the coefficients and whether to release the constraints from active status as ρ increases. Second, very rarely, some of the hitting times and escape times will coincide. We are then faced again with the problem of which of the active constraints, with coefficients on their subdifferential boundaries, to keep active and which to encourage to go inactive in the next segment. In practice, the first problem can easily occur. Roundoff error typically keeps the second problem at bay.

In both anomalous cases, the status of each of active constraint can be resolved by trying all possibilities. Consider the second case first. If there are a currently active constraints parked at their subdifferential boundaries, then there are 2a possible configurations for their active–inactive states in the next segment. For a given configuration, we can exploit formula (15) to check whether the coefficient for an active constraint occurs in its subdifferential. If the coefficient occurs on the boundary of its subdifferential, then we can use representation (16) to check whether it is headed into the interior of the subdifferential as ρ increases. Since the path and its coefficients are unique, one and only one configuration should determine the next linear segment. At the start of the path algorithm, the correct configuration also determines the initial values of the active coefficients. If we take limits in Equation (15) as ρ tends to 0, then the coefficients will escape their subdifferentials unless Qtb+RcZ=0 and all components of QtuZ lie in their appropriate subdifferentials. Hence, again it is easy to decide on the active set Z going forward from ρ = 0. One could object that the number of configurations 2a is potentially very large, but, in practice, this combinatorial bottleneck never occurs. Visiting the various configurations can be viewed as a systematic walk through the subsets of {1, . . . , a} and organized using a classical gray code (Savage 1997) that deletes at most one element and adjoins at most one element as one passes from one active subset to the next. As we will see in the next section, adjoining an element corresponds to sweeping a diagonal entry of a tableau and deleting an element corresponds to inverse sweeping a diagonal entry of the same tableau.

When a is large, a more economical solution is to minimize the penalized objective function (6) at ρ + ε for ε small using any unconstrained optimizer for nonsmooth problems. Reasonable choices include the proximal gradient method (Chen et al. 2010), Nesterov's method (Liu, Yuan, and Ye 2010), and coordinate descent after reparameterization (Friedman et al. 2007; Wu and Lange 2008). The solution initializes the set configuration at time ρ + ε in anticipation of the resumption of path following.

4. THE PATH ALGORITHM AND SWEEPING

Implementation of the path algorithm can be conveniently organized around the sweep and inverse sweep operators of regression analysis (Dempster 1969; Jennrich 1977; Goodnight 1979; Little and Rubin 2002; Lange 2010). We first recall the definition and basic properties of the sweep operator. Suppose A is an m × m symmetric matrix. Sweeping on the kth diagonal entry akk ≠ 0 of A yields a new symmetric matrix A^ with entries

a^kk=1akk,a^ik=aikakk,ik,a^kj=akjakk,jk,a^ij=aijaikakjakk,i,jk.

These arithmetic operations can be undone by inverse sweeping on the same diagonal entry. Inverse sweeping sends the symmetric matrix A into the symmetric matrix Ă with entries

aˇkk=1akk,aˇik=aikakk,ik,aˇkj=akjakk,jk,aˇij=aijaikakjakk,i,jk.

Both sweeping and inverse sweeping preserve symmetry. Thus, all operations can be carried out on either the lower or the upper triangle of A alone, saving both computational time and storage. When several sweeps or inverse sweeps are performed, their order is irrelevant. Finally, a symmetric matrix A is positive definite if and only if A can be completely swept, and all of its diagonal entries remain positive until swept. Complete sweeping produces –A–1. Each sweep of a positive definite matrix reduces the magnitude of the unswept diagonal entries. Positive definite matrices with poor condition numbers can be detected by monitoring the relative magnitude of each diagonal entry just prior to sweeping.

At the start of path following, we initialize a path tableau with block entries

graphic file with name nihms-497698-f0002.jpg (17)

The starred blocks here are determined by symmetry. Sweeping the diagonal entries of the upper-left block –A of the tableau yields

graphic file with name nihms-497698-f0003.jpg

The new tableau contains the unconstrained solution x(0) = –A–1b and the corresponding constraint residuals –UA–1bc. In path following, we adopt our previous notation and divide the original tableau into subblocks. The result

graphic file with name nihms-497698-f0004.jpg (18)

highlights the active and inactive constraints. If we continue sweeping until all diagonal entries of the upper-left quadrant of this version of the tableau are swept, then the tableau becomes

graphic file with name nihms-497698-f0005.jpg

All of the required elements for the path algorithm now magically appear.

Given the next ρ, the solution vector x(ρ) appearing in Equation (11) requires the sum Pb+QcZ, which occurs in the revised tableau, and the vector PuZ. If rZ denotes the coefficient vector for the inactive constraints, with entries of –1 for constraints in NE, 0 for constraints in NI, and 1 for constraints in PEPI, then PuZ=PUtZrZ. Fortunately, PUtZ appears in the revised tableau. The update of ρ depends on the hitting times (13) and (14). These in turn depend on the numerators vitPb+vitQcZdi and wjtPb+wjtQcZej, which occur as components of the vector UZ(Pb+QcZ)cZ, and the denominators vitPuZ and wjtPuZ, which occur as components of the matrix UZPUtZrZ computable from the block UZPUtZ of the tableau. The escape times for the active constraints also determine the update of ρ. According to Equation (16), the escape times depend on the current coefficient vector, the current value ρ0 of ρ, and the vector QtuZ=QtUtZrZ, which can be computed from the block QtUtZ of the tableau. Thus, the revised tableau supplies all of the ingredients for path following. Algorithm 1 outlines the steps for path following ignoring the anomalous situations.

Algorithm 1.

Solution path of the primal problem (6) when A is positive definite.

Initialize k = 0, ρ0 = 0, and the path tableau (17). Sweep the diagonal entries of –A.
Enter the main loop.
repeat
    Increment k by 1.
    Compute the hitting time or exit time ρ(i) for each constraint i.
    Set ρk = min{ρ(i) : ρ(i) > ρk–1}.
    Update the coefficient vector by Equation (16).
    Sweep the diagonal entry of the inactive constraint that becomes active or inverse sweep the diagonal entry of the active constraint that becomes inactive.
    Update the solution vector xk = x(ρk) by Equation (11).
until NE=PE=PI=.

The ingredients for handling the anomalous situations can also be read from the path tableau. The initial coefficients rZ(0)=QtuZ=QtUtZrZ are available once we sweep the tableau (17) on the diagonal entries corresponding to the constraints in Z at the starting point x(0) = –A–1b. As noted earlier, if the coefficients of several active constraints are simultaneously poised to exit their subdifferentials, then one must consider all possible swept and unswept combinations of these constraints. The operative criteria for choosing the right combination involve the available quantities QtuZ and Qtb+RcZ. One of the sweeping combinations is bound to give a correct direction for the next extension of the path.

The computational complexity of path following depends on the number of parameters m and the number of constraints n = r + s. Computation of the initial solution –A–1b takes about 3m3 floating point operations (flops). There is no need to store or update%the P block during path following. The remaining sweeps and inverse sweeps take on the order of n(m + n) flops each. This count must be multiplied by the number of segments along the path, which empirically is on the order of O(n) for the small examples tried in this article. The sweep tableau requires storing (m + n)2 real numbers. We recommend all computations be done in double precision. Both flop counts and storage can be halved by exploiting symmetry. Finally, it is worth mentioning some computational shortcuts for the multitask learning model. Among these are the formulas

(IX)t(IX)=IXtX,(IXtX)1=I(XtX)1,(IXtX)1(IV)=I(XtX)1V,(IXtX)1(IW)=I(XtX)1W.

5. EXTENSIONS OF THE PATH ALGORITHM

As just presented, the path algorithm starts from the unconstrained solution and moves forward along the path to the constrained solution. With minor modifications, the same algorithm can start in the middle of the path or move in the reverse direction along it. The latter tactic proves useful in Lasso and fused-Lasso problems, where the fully constrained solution is trivial. In general, consider starting from x(ρ0) at a point ρ0 on the path. Let Z=ZEZI continue to denote the zero set for the segment containing ρ0. Path following begins by sweeping the upper-left block of the tableau (18) and then proceeds as indicated in Algorithm 1. Traveling in the reverse direction entails calculation of hitting and exit times for decreasing ρ rather than increasing ρ.

Two assumptions limit the applications of Algorithm 1. The assumption that A is positive definite automatically excludes underdetermined statistical problems with more parameters than cases. The linear independence assumption on constraint vectors vi and wj precludes certain regularization problems, such as the sparse fused Lasso and the two- or higher-dimensional fused Lasso. In this section, we indicate how to carry out the exact penalty method when positive definiteness of A fails and the sweep operator cannot be brought into play. Relaxation of the second restriction is more subtle and we briefly discuss the difficulties.

In the absence of constraints, f(x) lacks a minimum if and only if either A has a negative eigenvalue or the equation Ax = b has no solution. In either circumstance, a unique global minimum may exist if enough constraints are enforced. Suppose x(ρ0) supplies the minimum of the exact penalty function Eρ(x) at ρ = ρ0 > 0. Let the matrix UZ summarize the active constraint vectors. As we slide along the active constraints, the minimum point can be represented as x(ρ) = x(ρ0) + Y y(ρ), where the columns of Y are orthogonal to the of UZ. One can construct Y by the Gram–Schmidt process; Y is then the orthogonal complement of UZ in the QR decomposition. The active constraints hold in view of the identity UZx(ρ)=UZx(ρ0)=cZ.

The analog of the stationarity condition (7) under reparameterization is

0=YtAYy(ρ)+Ytb+ρYtuZ. (19)

The active constraints do not appear in this equation because vitY=0 and wjtY=0 for i or j active. Solving for y(ρ) and x(ρ) gives

y(ρ)=(YtAY)1(Ytb+ρYtuZ),x(ρ)=x(ρ0)Y(YtAY)1(Ytb+ρYtuZ), (20)

and does not require inverting A. Because the solution x(ρ) is affine in ρ, it is straightforward to calculate the hitting times for the inactive constraints.

Under the original parameterization, the Lagrange multipliers and corresponding active coefficients appearing in the stationarity condition (7) can still be recovered by invoking Equation (9). Again it is a simple matter to calculate exit times. The formulas are not quite as elegant as those based on the sweep operator, but all essential elements for traversing the path are available. Adding or deleting a row of the matrix UZ can be accomplished by updating the QR decomposition. The fast algorithms for this purpose simultaneously update Y (Lawson and Hanson 1987; Nocedal and Wright 2006). More generally, for equality-constrained problems generated by the Lasso and generalized Lasso, the constraint matrix UZ, as one approaches the penalized solution, is often very sparse. Computation of the QR decomposition from scratch is then numerically cheap.

When the active constraint vectors are linearly dependent, UZ does not have full row rank. This causes problems if one determines path coefficients via Equation (9). Replacing the inverse (UZUZt)1 by the Moore–Penrose pseudoinverse (UZUZt)+ yields the coefficient vector rZ(ρ)=(sZ(ρ)t,tZ(ρ)t)t, with minimal l2 norm (Magnus and Neudecker 1999). However, exit times predicated on this version of the coefficient vector are inappropriate because, at the predicted exit time, there could exist another version of the coefficient vector rZ lying in the interior of the permissible range (8) with a larger l2 norm. The set defined by the subdifferential constraints on the active coefficients is a convex polytope (a compact and polyhedral set). Its image under matrix multiplication by ρUZt is also a convex polytope. Thus, the exit time for the active constraints is the maximum ρ going forward for which –Ax(ρ) – b remains in the image polytope, which unfortunately is hard to determine. The dual approach taken by Tibshirani and Taylor (2011) seems somehow to circumvent the difficulty posed by naive application of the pseudoinverse solution. In practice, the whole issue can be simply resolved by computing the solution at a nearby future time ρ + ε using any unconstrained nonsmooth optimizer. Path following should then recommence along the direction β(ρ + ε) – β(ρ).

6. DEGREES OF FREEDOM UNDER AFFINE CONSTRAINTS

We now specialize to the least-square problem with the choices A = Xt X, b = –Xt y, and x(ρ)=β^(ρ), and consider how to define degrees of freedom in the presence of both equality and inequality constraints. As previous authors (Efron et al. 2004; Zou, Hastie, and Tibshirani 2007; Tibshirani and Taylor 2011) showed, the most productive approach relies on Stein's characterization (Stein 1981; Efron 2004)

df(y^)=E(i=1nyiy^i)=E[tr(dyy^)]

of the degrees of freedom. Here, y^=Xβ^ is the fitted value of y, and dyŷ denotes its differential with respect to the entries of y. Equation (11) implies that

y^=Xβ^=XPXty+XQcZρXPuZ.

Because ρ is fixed, it follows that dyŷ = XPXt. The representation

XPXt=X(XtX)1XtX(XtX)1UZt[UZ(XtX)1UZt]1UZ(XtX)1Xt=P1P2

and the cyclic permutation property of the trace function applied to the projection matrices P1 and P2 yield the formula

E[tr(dyy^)]=mE(Z),

where m equals the number of parameters. In other words, mZ is an unbiased estimator of the degrees of freedom. This result obviously depends on our assumptions that X has full column rank m and the constraints vi and wj are linearly independent. The latter condition is true for Lasso and one-dimensional fused-Lasso problems. The validity of Stein's formula requires the fitted value ŷ to be a continuous and almost differentiable function of y for almost every y (Stein 1981). Fortunately, this is the case for Lasso (Zou, Hastie, and Tibshirani 2007) and generalized Lasso problems (Tibshirani and Taylor 2011), and for at least one case of shape-restricted regression (Meyer and Woodroofe 2000). The derivation does not depend directly on whether the constraints are equality or inequality constraints. Hence, the degrees of freedom estimator can be applied in shape-restricted regression using model selection criteria, such as Cp, AIC, and BIC, along the whole path. The concave regression example in Section 1 illustrates the general idea.

7. EXAMPLES

Our examples illustrate both the mechanics and the potential of path following. The path algorithm's ability to handle inequality constraints allows us to obtain path solutions to a variety of shape-restricted regressions. Problems of this sort may well dominate the future agenda of nonparametric estimation.

7.1 Two Toy Examples

Our first example (Lawson and Hanson 1987) fits a straight line y = β0 + 1 to the data points (0.25,0.5), (0.5,0.6), (0.5,0.7), and (0.8,1.2) by minimizing the least-square criterion yXβ22 subject to the constraints

β10,β00,β0+β11.

In our notation,

A=XtX=(4.00002.05002.05001.2025),b=Xty=(3.00001.7350),W=(101011),e=(001).

The initial tableau is

graphic file with name nihms-497698-f0006.jpg

Sweeping the first two diagonal entries produces

graphic file with name nihms-497698-f0007.jpg

from which we read off the unconstrained solution β(0) = (0.0835, 1.3004)t and the constraint residuals (–0.0835, –1.3004, 0.3840)t. The latter indicates that NI={1,2}, ZI=, and PI={3}. Multiplying the middle block matrix by the coefficient vector r = (0, 0, 1)t and dividing the residual vector entrywise give the hitting times ρ = (–0.0599, 0.4051, 0.2116). Thus, ρ1 = 0.2116 and

β(0.2116)=(0.08351.3004)0.2116×(1.39513.2099)=(0.37870.6213).

Now N={1,2}, Z={3}, P=, and we have found the solution. Figure 2 displays the data points and the unconstrained and constrained fitted lines.

Figure 2.

Figure 2

The data points and the fitted lines for the first toy example of constrained curve fitting (Lawson and Hanson 1987). The online version of this figure is in color.

Our second toy example concerns the toxin response problem (Schoenfeld 1986), with m toxin levels x1x2 ≤ · · · ≤ xm and a mortality rate yi = f(xi) at each level. It is reasonable to assume that the mortality function f(x) is nonnegative and increasing. Suppose ȳi are the observed death frequencies averaged across ni trials at level xi. In a finite sample, the ȳi may fail to be nondecreasing. For example, in an Environmental Protection Agency (EPA) study of the effects of chromium on fish (Schoenfeld 1986), the observed binomial frequencies and chromium levels are

y=(0.3752,0.3202,0.2775,0.3043,0.5327)t,x=(51,105,194,384,822)tinμgl.

Isotonic regression minimizes k=1m(ykθk)2 subject to the constraints 0 ≤ θ1 ≤ · · · ≤ θm on the binomial parameters θk = f(xk). The solution path depicted in Figure 3 is continuous and piecewise linear as advertised, but the coefficient paths are nonlinear. The first four binomial parameters coalesce into the constrained estimate.

Figure 3.

Figure 3

Toxin response example. Left: solution path. Right: coefficient paths for the constraints.

7.2 Generalized Lasso Problems

Many of the generalized Lasso problems studied by Tibshirani and Taylor (2011) reduce to minimization of some form of the objective function (6). To avoid repetition, we omit a detailed discussion of this class of problems and simply refer readers interested in applications to Lasso or fused-Lasso penalized regression, outlier detections, trend filtering, and image restoration to the original article (Tibshirani and Taylor 2011). Here, we would like to point out the relevance of the generalized Lasso problems to graph-guided penalized regression (Chen et al. 2010). Suppose each node i of a graph is assigned a regression coefficient βi and a weight wi. In graph penalized regression, the objective function takes the form

12W(yXβ)22+λGijβidisgn(rij)βjdj+λLjβj, (21)

where the set of neighboring pairs i ~ j defines the graph, di is the degree of node i, and rij is the correlation coefficient between i and j. Under a line graph, the objective function (21) reduces to the fused Lasso. In two-dimensional imaging applications, the graph consists of neighboring pixels in the plane, and minimization of the function (21) is accomplished by total variation algorithms. In MRI images, the graph is defined by neighboring pixels in three dimensions. Penalties are introduced in image reconstruction and restoration to enforce smoothness. In microarray analysis, the graph reflects one or more gene networks. Smoothing the βi over the networks is motivated by the assumption that the expression levels of related genes should rise and fall in a coordinated fashion. Ridge regularization in graph penalized regression (Li and Li 2008) is achieved by changing the objective function to

12W(yXβ)22+λGij(βidisgn(rij)βjdj)2+λLjβj.

If one fixes either of the tuning constants in these models, our path algorithm delivers the solution path as a function of the other tuning constant. Alternatively, one can fix the ratio of the two tuning constants. Finally, the extension

12YXBF2+λGijk=1Kβkidisgn(rij)βkjdj+λLk,iβk,i

of the objective function to multivariate response models is obvious.

In principle, the path algorithm based on the sweep operator applies to these problems, provided the design matrix X has full column rank and the active constraints along the solution path are linearly independent. If X has reduced rank, then it is advisable to add a small amount of ridge regularization iβi2 to the objective function (Tibshirani and Taylor 2011). Even so, computation of the unpenalized solution may be problematic in high dimensions. Alternatively, path following can be conducted starting from the fully constrained problem as suggested in Section 5. If the linear independence of the active constrains is violated, for example, when the graph has loops, then we recommend resorting to the numerical remedy mentioned at the end of Section 5.

7.3 Shape-Restricted Regressions

Order-constrained regression is now widely accepted as an important modeling tool (Robertson, Wright, and Dykstra 1988; Silvapulle and Sen 2005). If β is the parameter vector, monotone regression includes isotone constraints β1β2 ≤ · · · ≤ βm or antitone constraints β1β2 ≤ · · · ≤ βm. In partially ordered regression, subsets of the parameters are subject to isotone or antitone constraints. In other problems, it is sensible to impose convex or concave constraints. If observations are collected at irregularly spaced time points t1t2 ≤ · · · ≤ tm, then convexity translates into the constraints

βi+2βi+1ti+2ti+1βi+1βiti+1ti,

for 1 ≤ im – 2. When the time intervals are uniform, these convex constraints become βi+2βi+1βi+1βi. Concavity translates into the opposite set of inequalities. All of these shape-restricted regression problems can be solved by path following.

As an example of partial isotone regression, we fit the data from table 1.3.1 of Robertson, Wright, and Dykstra (1988) on the first-year grade point averages (GPA) of 2397 University of Iowa freshmen. These data can be downloaded as part of the R package “ic.infer.” The ordinal predictors, high school rank (as a percentile) and American College Testing (ACT, a standard aptitude test) score, are discretized into nine ordered categories each. A rational admission policy based on these two predictor sets should be isotone separately within each set. Figure 4 shows the unconstrained and constrained solutions for the intercept and the two predictor sets and the solution path of the regression coefficients for the high school rank predictor.

Figure 4.

Figure 4

Left: unconstrained and constrained estimates for the Iowa GPA data. Right: solution paths of the regression coefficients corresponding to high school rank. The online version of this figure is in color.

The same authors (Robertson, Wright, and Dykstra 1988) predicted the probability of obtaining a B or better college GPA based on high school GPA and ACT score. In their data, covering 1490 college students, ȳij is the proportion of students who obtain a B or better college GPA among the nij students who are within the ith ACT category and the jth high school GPA category. Prediction is achieved by minimizing the criterion ijnij(yijθij)2 subject to the matrix partial-order constraints θ11 ≥ 0, θijθi+1,j, and θijθi,j+1. Figure 5 shows the solution path and the residual sum of squares and effective degrees of freedom along the path. The latter vividly illustrates the trade-off between goodness of fit and degrees of freedom. Readers can consult page 33 of Robertson, Wright, and Dykstra (1988) for the original data and the constrained parameter estimates.

Figure 5.

Figure 5

GPA prediction example. Left: solution path for the predicted probabilities. Right: residual sum of squares and the estimated degrees of freedom along the path. The online version of this figure is in color.

7.4 Nonparametric Shape-Restricted Regression

In this section, we visit a few problems amenable to the path algorithm arising in nonparametric statistics. Given data (xi, yi), i = 1, . . . , n, and a weight function w(x), nonparametric least squares seeks a regression function θ(x) minimizing the criterion

i=1nw(xi)[yiθ(xi)]2 (22)

over a space C of functions with shape restrictions. In concave regression, for instance, C is the space of concave functions. This seemingly intractable infinite-dimensional problem can be simplified by minimizing the least-square criterion (3) subject to inequality constraints. For a univariate predictor and concave regression, the constraints (4) are pertinent. The piecewise linear function extrapolated from the estimated θi is clearly concave. The consistency of concavity-constrained least squares is proved by Hanson and Pledger (1976); the asymptotic distribution of the corresponding estimator and its rate of convergence are investigated in later articles (Mammen 1991; Groeneboom, Jongbloed, and Wellner 2001). Other relevant shape restrictions for univariate predictors include monotonicity (Brunk 1955; Grenander 1956), convexity (Groeneboom, Jongbloed, and Wellner 2001), supermodularity (Beresteanu 2004), and combinations of these.

Multidimensional nonparametric estimation is much harder because there is no natural order on Rd when d > 1. One fruitful approach to shape-restricted regression relies on sieve estimators (Shen and Wong 1994; Beresteanu 2004). The general idea is to introduce a basis of local functions (e.g., normalized B-splines) centered on the points of a grid G spanning the support of the covariate vectors xi. Admissible estimators are then limited to linear combinations of the basis functions subject to restrictions on the estimates at the grid points. Estimation can be formalized as minimization of the criterion yΨ(X)θ22 subject to the constraints CΦ(G)θ0, where Φ(X) is the matrix of basis functions evaluated at the covariate vectors xi, Φ(G) is the matrix of basis functions evaluated at the grid points, and θ is a vector of regression coefficients. The linear inequality constraints incorporated in the matrix C reflect the required shape restrictions. Estimation is performed on a sequence of grids (a sieve). Controlling the rate at which the sieve sequence converges yields a consistent estimator (Shen and Wong 1994; Beresteanu 2004). Prediction reduces to interpolation, and the path algorithm provides a computational engine for sieve estimation.

A related but different approach for multivariate convex regression minimizes the least-square criterion (3) subject to the constraints ξit(xjxi)θjθi for every ordered pair (i, j). In effect, θi is viewed as the value of the regression function θ(x) at the point xi. The unknown vector ξi serves as a subgradient of θ(x) at xi. Because convexity is preserved by maxima, the formula

θ(x)=maxj[θj+ξjt(xxj)]

defines a convex function with value θi at x = xi. In concave regression, the opposite constraint inequalities are imposed. Interpolation of predicted values in this model is accomplished by simply taking minima or maxima. Estimation reduces to a positive semidefinite quadratic program involving n(d + 1) variables and n(n – 1) inequality constraints. Note that the feasible region is nontrivial because setting all θi = 0 and all ξi = 0 works. In implementing the extension of the path algorithm mentioned in Section 5, the large number of constraints may prove to be a hindrance and lead to very short path segments. To improve estimation of the subgradients, it might be worth adding a small multiple of the ridge penalty iξ22 to the objective function (3). This would have the beneficial effect of turning a semidefinite quadratic program into a positive definite quadratic program.

8. CONCLUSIONS

Our new path algorithm for convex quadratic programming under affine constraints generalizes previous path algorithms for Lasso penalized regression and its extensions. Our path algorithm directly attacks the primal problem; the complementary method of Tibshirani and Taylor (2011) solves the dual problem. Our various examples confirm the primal algorithm's versatility. Its potential disadvantages involve computing the initial point –A–1b and storing the sweeping tableau. In problems with large numbers of parameters, neither of these steps is trivial. However, if A has enough structure, then an explicit inverse may exist. As we have already noted, once A–1 is computed, there is no need to store the entire tableau. The multitask regression problem with a large number of responses per case is a typical example where computation of A–1 simplifies. In settings where the matrix A is singular, parameter constraints may compensate. We have briefly indicated how to conduct path following in this circumstance. Although our more stringent assumption of linear independence of the constraint gradients excludes some interesting examples treated by Tibshirani and Taylor (2011), many practical problems can be finessed by the remedy discussed in Section 5.

Our path algorithm qualifies as a general convex quadratic program solver. Custom algorithms have been developed for many special cases of quadratic programming. For example, the pool-adjacent-violators algorithm (PAVA) is now the standard approach to isotone regression (de Leeuw, Hornik, and Mair 2009). The other generic methods of quadratic programming include active set and interior point methods. For applications where only the constrained estimate is of interest, it would be hard to beat these well-honed algorithms. In regularized statistical estimation and inverse problems, the primary goal is to select relevant predictors rather than to find a constrained solution. Thus, the entire solution path commands more interest than any single point along it, and the path algorithm's ability to deliver the whole regularized path with little additional computation cost beyond constrained estimation is bound to be appealing to statisticians. Numerical comparisons with competing methods would be illuminating but would also depend heavily on programming details and problem choices. In the interests of brevity, we refrain from making numerical comparisons here.

The path algorithm bears a stronger resemblance to the active set method (Nocedal and Wright 2006). Indeed, both operate by deleting and adding constraints to a working active set. However, they differ in at least two respects. First, the initial active set is constructed arbitrarily in the active set method. Distinct initial active sets produce different iteration sequences. In contrast, the path algorithm always starts from the unconstrained solution. The initial active set is determined as a by-product. Second, the mechanics of adding or deleting constraints differ in the two methods. The active set method chooses the direction of movement that tends to decrease the quadratic objective function most, while the path algorithm tracks the tuning constant ρ. In fact, path following steadily increases the objective function until it reaches its constrained solution. In this sense, the active set method is greedier than the path algorithm, which expends its effort in traversing the solution path.

Supplementary Material

suppl material

ACKNOWLEDGMENTS

We thank the editor, associate editor, and two referees, whose comments greatly improved the article. We also acknowledge support from grants GM53275, MH59490, CA87949, CA16042, R01HG006139, and NCSU FRPD.

Footnotes

SUPPLEMENTARY MATERIALS

MATLAB code: Data and MATLAB code for all examples in this article are available in the supplementary materials (path quadratic.zip). The readme.txt file describes the contents of each file in the package. They are also part of the SparseReg toolbox maintained and distributed on the first author's website.

Contributor Information

Hua Zhou, Department of Statistics, North Carolina State University, Raleigh, NC 27695-8203 (hua_zhou@ncsu.edu)..

Kenneth Lange, Departments of Biomathematics, Human Genetics, and Statistics, University of California, Los Angeles, CA 90095-8076 (klange@ucla.edu)..

REFERENCES

  1. Beresteanu A. Duke University, Department of Economics; 2004. Nonparametric Estimation of Regression Functions Under Restrictions on Partial Derivatives. Working Papers 04-06 [279] [Google Scholar]
  2. Brunk HD. Maximum Likelihood Estimates of Monotone Parameters. Annals of Mathematical Statistics. 1955;26:607–616. [279] [Google Scholar]
  3. Chen X, Lin Q, Kim S, Carbonell J, Xing E. Smoothing Proximal Gradient Method for General Structured Sparse Regression. Annals of Applied Statistics. 2012;6:719–752. [269,277] [Google Scholar]
  4. de Leeuw J, Hornik K, Mair P. Isotone Optimization in R: Pool-Adjacent-Violators Algorithm (PAVA) and Active Set Methods. Journal of Statistical Software. 2009;32(5):1–24. [280] [Google Scholar]
  5. Dempster AP. Elements of Continuous Multivariate Analysis. Addison-Wesley; Reading, MA: 1969. (Addison-Wesley Series in Behavioral Sciences) [269] [Google Scholar]
  6. Efron B. The Estimation of Prediction Error: Covariance Penalties and Cross-Validation. Journal of the American Statistical Association. 2004;99:619–642. (with discussion) [263,274] [Google Scholar]
  7. Efron B, Hastie T, Johnstone I, Tibshirani R. Least Angle Regression. The Annals of Statistics. 2004;32:407–499. (with discussion) [262,273] [Google Scholar]
  8. Friedman J. Fast Sparse Regression and Classification. Proceedings of the 23rd International Workshop on Statistical Modelling. 2008:27–57. [online] Available at http://www-stat.stanford.edu/0jhf/ftp/GPSpaper.pdf [262] [Google Scholar]
  9. Friedman J, Hastie T, Höfling H, Tibshirani R. Pathwise Coordinate Optimization. Annals of Applied Statistics. 2007;1:302–332. [269] [Google Scholar]
  10. Goodnight JH. A Tutorial on the Sweep Operator. The American Statistician. 1979;33:149–158. [269] [Google Scholar]
  11. Grenander U. On the Theory of Mortality Measurement. Part II. Skand Aktuarietidskr. 1956;39:125–153. [279] [Google Scholar]
  12. Groeneboom P, Jongbloed G, Wellner JA. Estimation of a Convex Function: Characterizations and Asymptotic Theory. The Annals of Statistics. 2001;29:1653–1698. [263,279] [Google Scholar]
  13. Hanson DL, Pledger G. Consistency in Concave Regression. The Annals of Statistics. 1976;4:1038–1050. [263,279] [Google Scholar]
  14. Hildreth C. Point Estimates of Ordinates of Concave Functions. Journal of the American Statistical Association. 1954;49:598–619. [263] [Google Scholar]
  15. Jennrich R. Stepwise Regression,” in Statistical Methods for Digital Computers. In: Ralston A, Enslein K, Wilf HS, editors. Wiley-Interscience; New York: 1977. pp. 58–75. [269] [Google Scholar]
  16. Lange K. Numerical Analysis for Statisticians. 2nd ed., Statistics and Computing. Springer; New York: 2010. [269] [Google Scholar]
  17. Lawson CL, Hanson RJ. Solving Least Squares Problems. New ed., Classics in Applied Mathematics. Society for Industrial and Applied Mathematics; Philadelphia, PA: 1987. [273,274] [Google Scholar]
  18. Li C, Li H. Network-Constrained Regularization and Variable Selection for Analysis of Genomic Data. Bioinformatics. 2008;24:1175–1182. doi: 10.1093/bioinformatics/btn081. [277] [DOI] [PubMed] [Google Scholar]
  19. Little RJA, Rubin DB. Statistical Analysis With Missing Data. 2nd ed., Wiley Series in Probability and Statistics. Wiley-Interscience; Hoboken, NJ: 2002. [269] [Google Scholar]
  20. Liu J, Yuan L, Ye J. An Efficient Algorithm for a Class of Fused Lasso Problems. Proceedings of the 16th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2010:323–332. [269] [Google Scholar]
  21. Magnus JR, Neudecker H. Matrix Differential Calculus With Applications in Statistics and Econometrics. Wiley; Chichester: 1999. (Wiley Series in Probability and Statistics) [273] [Google Scholar]
  22. Mammen E. Nonparametric Regression Under Qualitative Smoothness Assumptions. The Annals of Statistics. 1991;19:741–759. [263,279] [Google Scholar]
  23. Meyer M, Woodroofe M. On the Degrees of Freedom in Shape-Restricted Regression. The Annals of Statistics. 2000;28:1083–1104. [274] [Google Scholar]
  24. Nocedal J, Wright SJ. Numerical Optimization. 2nd ed., Springer Series in Operations Research and Financial Engineering. Springer; New York: 2006. [273,281] [Google Scholar]
  25. Robertson T, Wright FT, Dykstra RL. Order Restricted Statistical Inference. Wiley; Chichester: 1988. (Wiley Series in Probability and Mathematical Statistics: Probability and Mathematical Statistics) [277,278] [Google Scholar]
  26. Rosset S, Zhu J. Piecewise Linear Regularized Solution Paths. The Annals of Statistics. 2007;35:1012–1030. [262] [Google Scholar]
  27. Ruszczyński A. Nonlinear Optimization. Princeton University Press; Princeton, NJ: 2006. [265] [Google Scholar]
  28. Savage C. A Survey of Combinatorial Gray Codes. SIAM Review. 1997;39:605–629. [269] [Google Scholar]
  29. Schoenfeld DA. Confidence Bounds for Normal Means Under Order Restrictions, With Application to Dose-Response Curves, Toxicology Experiments, and Low-Dose Extrapolation. Journal of the American Statistical Association. 1986;81:186–195. [275] [Google Scholar]
  30. Shen X, Wong WH. Convergence Rate of Sieve Estimates. The Annals of Statistics. 1994;22:580–615. [279] [Google Scholar]
  31. Silvapulle MJ, Sen PK. Constrained Statistical Inference: Inequality, Order, and Shape Restrictions. Wiley-Interscience; Hoboken, NJ: 2005. (Wiley Series in Probability and Statistics) [277] [Google Scholar]
  32. Stein CM. Estimation of the Mean of a Multivariate Normal Distribution. The Annals of Statistics. 1981;9:1135–1151. [274] [Google Scholar]
  33. Tibshirani R, Taylor J. The Solution Path of the Generalized Lasso. The Annals of Statistics. 2011;39:1335–1371. [262,265,273,274,276,277,280] [Google Scholar]
  34. Tibshirani RJ, Hoefling H, Tibshirani R. Nearly-Isotonic Regression. Technometrics. 2011;53:54–61. [264] [Google Scholar]
  35. Wu TT, Lange K. Coordinate Descent Algorithms for Lasso Penalized Regression. Annals of Applied Statistics. 2008;2:224–244. [269] [Google Scholar]
  36. Zou H, Hastie T, Tibshirani R. On the ‘Degrees of Freedom’ of the Lasso. The Annals of Statistics. 2007;35:2173–2192. [262,274] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

suppl material

RESOURCES