Abstract
Classical penalty methods solve a sequence of unconstrained problems that put greater and greater stress on meeting the constraints. In the limit as the penalty constant tends to ∞, one recovers the constrained solution. In the exact penalty method, squared penalties are replaced by absolute value penalties, and the solution is recovered for a finite value of the penalty constant. In practice, the kinks in the penalty and the unknown magnitude of the penalty constant prevent wide application of the exact penalty method in nonlinear programming. In this article, we examine a strategy of path following consistent with the exact penalty method. Instead of performing optimization at a single penalty constant, we trace the solution as a continuous function of the penalty constant. Thus, path following starts at the unconstrained solution and follows the solution path as the penalty constant increases. In the process, the solution path hits, slides along, and exits from the various constraints. For quadratic programming, the solution path is piecewise linear and takes large jumps from constraint to constraint. For a general convex program, the solution path is piecewise smooth, and path following operates by numerically solving an ordinary differential equation segment by segment. Our diverse applications to a) projection onto a convex set, b) nonnegative least squares, c) quadratically constrained quadratic programming, d) geometric programming, and e) semidefinite programming illustrate the mechanics and potential of path following. The final detour to image denoising demonstrates the relevance of path following to regularized estimation in inverse problems. In regularized estimation, one follows the solution path as the penalty constant decreases from a large value.
Keywords: constrained convex optimization, exact penalty, geometric programming, ordinary differential equation, quadratically constrained, quadratic programming, regularization semidefinite, programming
1 Introduction
Penalties and barriers are both potent devices for solving constrained optimization problems [1, 2, 3, 4, 5, 6]. The general idea is to replace hard constraints by penalties or barriers and then exploit the well-oiled machinery for solving unconstrained problems. Penalty methods operate on the exterior of the feasible region and barrier methods on the interior. The strength of a penalty or barrier is determined by a tuning constant. In classical penalty methods, a single global tuning constant is gradually sent to ∞; in barrier methods, it is gradually sent to 0. Either strategy generates a sequence of solutions that converges in practice to the solution of the original constrained optimization problem.
Barrier methods are now generally conceded to offer a better approach to solving convex programs than penalty methods. Application of log barriers and carefully controlled versions of Newton’s method make it possible to follow the central path reliably and quickly to the constrained minimum [1]. Nonetheless, penalty methods should not be ruled out. Augmented Lagrangian methods [7] and exact penalty methods [4] are potentially competitive with interior point methods for smooth convex programming problems. Both methods have the advantage that the solution of the constrained problem kicks in for a finite value of the penalty constant. This avoids problems of ill conditioning as the penalty constant tends to ∞.
The disadvantage of exact penalties over traditional quadratic penalties is lack of differentiability of the penalized objective function. In the current paper, we argue that this impediment can be finessed by path following. Our path following method starts at the unconstrained solution and follows the solution path as the penalty constant increases. In the process, the solution path hits, exits, and slides along the various constraint boundaries. The path itself is piecewise smooth with kinks at the boundary hitting and escape times. One advances along the path by numerically solving a differential equation for the Lagrange multipliers of the penalized problem. In the special case of quadratic programming with affine constraints, the solution path is piecewise linear, and one can easily anticipate entire path segments [8]. This special case is intimately related to the linear complementarity problem [9] in optimization theory.
Homotopy (continuation) methods for the solution of nonlinear equations and optimization problems have been pursued for many years and enjoyed a variety of successes [4, 10, 11, 12]. To our knowledge, however, there has been no exploration of path following as an implementation of the exact penalty method. Our modest goal here is to assess the feasibility and versatility of exact path following for constrained optimization. Comparing its performance to existing methods, particularly the interior point method, is probably best left for later, more practically oriented papers. In our experience, coding the algorithm is straightforward in Matlab. The rich numerical resources of Matlab include differential equation solvers that alert the user when certain events such as constraint hitting and escape occur.
The rest of the paper is organized as follows. Section 2 briefly reviews the exact penalty method for optimization and investigates sufficient conditions for uniqueness and continuity of the solution path. Section 3 derives the path following strategy for general convex programs, with particular attention to the special cases of quadratic programming and convex optimization with affine constraints. Section 4 presents various applications of the path algorithm. Our most elaborate example demonstrates the relevance of path following to regularized estimation. The particular problem treated, image denoising, is typical of many inverse problems in applied mathematics and statistics [13]. In such problems one follows the solution path as the penalty constant decreases. Finally, Section 5 discusses the limitations of the path algorithm and hints at future generalizations.
2 Exact Penalty Methods
In this paper we consider the convex programming problem of minimizing the convex objective function f (x) subject to r affine equality constraints gi (x) = 0 and s convex inequality constraints hj (x) ≤ 0. We will further assume that f (x) and the hj (x) are twice differentiable. The differential df (x) is the row vector of partial derivatives of f (x); the gradient ∇ f (x) is the transpose of df (x). The second differential d2 f (x) is the Hessian matrix of second partial derivatives of f (x). Similar conventions hold for the differentials of the constraint functions.
Exact penalty methods [4, 5] minimize the surrogate function
| (1) |
This definition of ℰρ (x) is meaningful regardless of whether the contributing functions are convex. If the program is convex, then ℰρ (x) is itself convex. It is interesting to compare ℰρ (x) to the Lagrangian function
which captures the behavior of f (x) near the optimum. At a constrained local minimum y, the Lagrangian satisfies the stationarity condition ∇ℒ (y) = 0; its inequality multipliers μj are nonnegative and satisfy the complementary slackness conditions μj hj (y) = 0. In an exact penalty method one takes
| (2) |
This choice creates the favorable circumstances
with profound consequences. As the next proposition proves, minimizing ℰρ (x) is effective in minimizing f (x) subject to the constraints.
Proposition 1
Suppose the objective function f (x) and the constraint functions are twice differentiable and satisfy the Lagrange multiplier rule at the local minimum y. If inequality (2) holds and υ *d2 ℒ (y) v > 0 for every vector v ≠ 0 satisfying dgi (y) v = 0 and dhj (y) v ≤ 0 for all active inequality constraints, then y furnishes an unconstrained local minimum of ℰρ(x). For a convex program satisfying Slater’s constraint qualification and inequality (2), y is a minimum of ℰρ(x) if and only if y is a minimum of f (x) subject to the constraints. No differentiability assumptions are required for convex programs.
Proof The conditions imposed on the quadratic form v*d2 ℒ (y) v are well-known sufficient conditions for a local minimum. Theorems 6.9 and 7.21 of the reference [5] prove all of the foregoing assertions.
As previously stressed, the exact penalty method turns a constrained optimization problem into an unconstrained minimization problem. Furthermore, in contrast to the quadratic penalty method [4, Section 17.1], the constrained solution in the exact method is achieved for a finite value of ρ. Despite these advantages, minimizing the surrogate function ℰρ (x) is complicated. For one thing, it is no longer globally differentiable. For another, one must minimize ℰρ (x) along an increasing sequence ρn because the Lagrange multipliers (2) are usually unknown in advance. These hurdles have prevented wide application of exact penalty methods in convex programming.
As a prelude to our derivation of the path following algorithm for convex programs, we record several properties of ℰρ (x) that mitigate the failure of differentiability.
Proposition 2
The surrogate function ℰρ (x) is increasing in ρ. Furthermore, ℰρ (x) is strictly convex for one ρ > 0 if and only if it is strictly convex for all ρ > 0. Likewise, it is coercive for one ρ > 0 if and only if is coercive for all ρ > 0. Finally, if f (x) is strictly convex (or coercive), then all ℰρ (x) are strictly convex (or coercive).
Proof The first assertion is obvious. For the second assertion, consider more generally a finite family u1 (x), …, uq (x) of convex functions, and suppose a linear combination with positive coefficients is strictly convex. It suffices to prove that any other linear combination with positive coefficients is strictly convex. For any two points x ≠ y and any (0 scalar α ∈ (0, 1), we have
| (3) |
Since is strictly convex, strict inequality must hold for at least one k. Hence, multiplying inequality (3) by bk and adding gives
The third assertion follows from the fact that a convex function is coercive if and only if its restriction to each half-line is coercive [14, Proposition 3.2.2]. Given this result, suppose ℰρ (x) is coercive, but is not coercive. Then there exists a point x, a direction v, and a sequence of scalars tn tending to ∞ such that is bounded above. This requires the sequence f (x + tn v) and each of the sequences |gi (x + tn v)| and max {0, hj (x + tn v) to remain bounded above. But in this circumstance the sequence ℰρ (x + tn v) also remains bounded above. The final two assertions are obvious.
3 The Path Following Algorithm
In this section, we take a different point of view. Instead of minimizing ℰρ (x) for an increasing sequence ρn, we study how the solution x (ρ) changes continuously with ρ and devise a path following strategy starting from ρ = 0. For some finite value of ρ, the path locks in on the solution of the original convex program. In regularized statistical estimation and inverse problems, the primary goal is to select relevant predictors rather than to find a constrained solution. Thus, the entire solution path commands more interest than any single point along it [15, 16, 17, 8, 13]. Although our theory will focus on constrained estimation, readers should bear in mind this second application area of path following.
The path algorithm relies critically on the first order optimality condition that characterizes the optimum point of the convex function ℰρ (y).
Proposition 3
For a convex program, a point x = x (ρ) minimizes the function ℰρ (y) if and only if x satisfies the stationarity condition
| (4) |
for coefficient sets and . These sets can be characterized as
| (5) |
At most one point achieves the minimum of ℰρ (y) for a given ρ when ρ (y) is strictly convex.
Proof According to Fermat’s rule, x minimizes ℰρ (y) if and only if 0 belongs to the subdifferential ∂ℰρ (x) of ℰρ (y). To derive the subdifferential displayed in equations (4) and (5), one applies the addition and chain rules of the convex calculus. The sets defining the possible values of si and tj are the subdifferentials of the functions |s| and t+ = max {t, 0}, respectively. For more details see Theorem 3.5 and ancillary material in the book [5]. Finally, it is well known that strict convexity guarantees a unique minimum.
To speak coherently of solution paths, one must validate the existence, uniqueness, and continuity of the solution x (ρ) to the system of equations (1). Uniqueness follows from strict convexity as already noted. Existence and continuity are more subtle.
Proposition 4
If ℰρ (y) is strictly convex and coercive, then the solution path x (ρ) of equation (1) exists and is continuous in ρ. If the gradient vectors {∇ gi (x): gi (x) = 0} ∪ {∇ hj (x): hj (x) = 0} of the active constraints are linearly independent at x (ρ) for ρ > 0, then the coefficients si (ρ) and tj (ρ) are unique and continuous near ρ as well.
Proof The special case of a least squares objective with affine constraints appears in our previous paper [8]. The following proof applies to general convex programming. In accord with Proposition 2, we assume that either f (x) is strictly convex and coercive or restrict our attention to the open interval (0, ∞). Consider a subinterval [a, b] and fix a point x in the common domain of the functions ℰρ (y). The coercivity of ℰa (y) and the inequalities
demonstrate that the solution vector x (ρ) is bounded over [a, b]. To prove continuity, suppose that it fails for a given ρ ∈ [a, b]. Then there exists an ε > 0 and a sequence ρn tending to ρ such ‖x (ρn) − x (ρn)‖2 ≥ ε for all n. Since x (ρn) is bounded, we can pass to a subsequence if necessary and assume that x (ρn) converges to some point y. Taking limits in the inequality ℰρn [x (ρn)] ≤ ℰρn (x) demonstrates that ℰρ (y) ≤ ℰρ (x) for all x. Because x (ρ) is unique, we reach the contradictory conclusions ‖y − x (ρ)‖2 ≥ ε and y = x (ρ)
Verification of the second claim is deferred to permit further discussion of path following. The claim says that an active constraint (gi (x) = 0 or hj (x) = 0) remains active until its coefficient hits an endpoint of its subdifferential. Because the solution path is, in fact, piecewise smooth, one can follow the coefficient path by numerically solving an ordinary differential equation (ODE).
Our path following algorithm works segment-by-segment. Along the path we keep track of the following index sets
| (6) |
determined by the signs of the constraint functions. For the sake of simplicity, assume that at the beginning of the current segment si does not equal −1 or 1 when i ∈ . and tj does not equal 0 or 1 when . In other words, the coefficients of the active constraints occur on the interior of their subdifferentials. Let us show in this circumstance that the solution path can be extended in a smooth fashion. Our plan of attack is to reparameterize by the Lagrange multipliers for the active constraints. Thus, set λi = ρsi for i ∈ and ωj = ρtj for j ∈ . The multipliers satisfy −ρ < λi < ρ and 0 < ωj < ρ.
The stationarity condition now reads
To this we concatenate the constraint equations 0 = gi (x) for i ∈ and 0 = hj (x) for j ∈
For convenience now define
In this notation the stationarity equation can be recast as
Under the assumption that the matrix has full row rank, one can solve for the Lagrange multipliers in the form
| (7) |
Hence, the multipliers are unique. Continuity of the multipliers is a consequence of the continuity of the solution vector x (ρ) and all functions in sight on the right-hand side of equation (7). This observation completes the proof of Proposition 4.
Collectively the stationarity and active constraint equations can be written as the vector equation 0 = k (x, λ, ω, ρ). To solve for x, λ and ω in terms of ρ, we apply the implicit function theorem [18, 19]. This requires calculating the differential of k (x, λ, ω, ρ) with respect to the underlying dependent variables x, λ, and ω and the independent variable ρ. Because the equality constraints are affine, a brief calculation gives
The matrix ∂x,λ,ω k (x, λ, ω, ρ) is nonsingular when its upper-left block is positive definite and its lower-left block has full row rank [20, Proposition 11.3.2]. Given that it is nonsingular, the implicit function theorem applies, and we can in principle solve for x, λ and ω in terms of ρ. More importantly, the implicit function theorem supplies the derivative
| (8) |
which is the key to path following. We summarize our findings in the next proposition. A similar proof appears in our application paper [21], where we implement path following in a convex clustering context.
Proposition 5
Suppose the surrogate function ℰρ (y) is strictly convex and coercive. If at the point x (ρ0) the matrix ∂x,λ,ω k (x, λ, ω, ρ) is nonsingular and the coefficient of each active constraints occurs on the interior of its subdifferential, then the solution path x (ρ) and Lagrange multipliers λ (ρ) and ω (ρ) satisfy the differential equation (8) in the vicinity of x (ρ0).
In practice one traces the solution path along the current time segment until either an inactive constraint becomes active or the coefficient of an active constraint hits the boundary of its subdifferential. The earliest hitting time or escape time over all constraints determines the duration of the current segment. When the hitting time for an inactive constraint occurs first, we move the constraint to the appropriate active set or and keep the other constraints in place. Similarly, when the escape time for an active constraint occurs first, we move the constraint to the appropriate inactive set and keep the other constraints in place. In the second scenario, if si hits the value −1, then we move i to ; If si hits the value 1, then we move i to . Similar comments apply when a coefficient tj hits 0 or 1. Once this move is executed, we commence path following along the new segment. Path following continues until for sufficiently large ρ, the sets , , and the are exhausted, , and solution vector x (ρ) stabilizes.
Algorithm 1 summarizes the above path following strategy. It makes the convenient assumptions that there are no active constraints at the unconstrained solution x (0) and that the hitting and/or escape times do not occur simultaneously for any ρ > 0. Our previous paper [8] suggests remedies in the very rare situations where these assumptions fail.
|
| Algorithm 1: Exact path algorithm for convex programming. |
Path following simplifies considerably in two special cases. Consider convex quadratic programming with objective function and equality constraints V x = d and inequality constraints W x ≤ e, where A is positive semi-definite. The exact penalized objective function becomes
Since both the equality and inequality constraints are affine, their second derivatives vanish. Both and are constant on the current path segment, and the path x (ρ) satisfies
| (9) |
This implies that the solution path x (ρ) is piecewise linear [8]. When V is the identity matrix, d = 0, and there are no inequality constraints, Algorithm 1 reduces to the well-known path algorithm for lasso regression [15, 16]. Tibshirani and Taylor [17] construct a piecewise-linear path algorithm for the generalized lasso problem with a general V and d = 0. Their approach seems to circumvent the need for linear independence of the active constraints.
On the next rung on the ladder of generality are convex programs with affine constraints. For the exact surrogate
the matrix and vector are still constant along a path segment. The relevant differential equation becomes
| (10) |
There are two approaches for computing the right-hand side of equation (10). When A = d2 f (x) is positive definite and has full row rank, the relevant inverse amounts to
The numerical cost of computing the inverse scales as . When d2 f (x) is a constant, the inverse is computed once. Sequentially updating it for different active sets is then conveniently organized around the sweep operator of computational statistics [8]. For a general convex function f (x), every time x changes, the inverse must be recomputed. This burden plus the cost of computing the entries of d2 f (x) slow the path algorithm for general convex problems. At the same time, Section 4 demonstrates how a greater level of generality opens up many new applications. See our recent paper [21] on convex clustering for yet another application of path following.
In many applications f (x) is convex but not necessarily strictly convex. By Proposition 2, ℰρ is also convex for all ρ > 0. Furthermore, ℰρ may admit a unique global minimum for large ρ. In this case the proof of Proposition 4 still applies, and the solution path is continuous. One can circumvent problems in inverting d2 f (x) by reparameterizing [4]. For the sake of simplicity, suppose that all of the constraints are affine and that has full row rank. The set of points x satisfying the active constraints can be written as x = w+Y y, where w is a particular solution, y is free to vary, and the columns of span the null space of and hence are orthogonal to the rows of . Under the null space reparameterization, . Furthermore,
It follows that equation (10) becomes
| (11) |
Differentiating equation (7) gives the multiplier derivatives
| (12) |
The obvious advantage of using equation (11) is that the matrix Yt d2 f (x) Y can be nonsingular when d2 f (x) is singular. The computational cost of evaluating the right-hand sides of equations (11) and (12) is When and are small to compared n, this is an improvement over the cost of computing the right-hand side of equation (8). Balanced against this gain is the requirement of finding a basis of the null space of . Fortunately, the matrix Y is constant over each path segment and in practice can be computed by taking the QR decomposition of the active constraint matrix . At each kink of the solution path, either one constraint enters or one leaves. Therefore, Y can be sequentially computed by standard updating and downdating formulas [22, 4]. Which ODE (8) or (11) is preferable depends on the specific application. When the loss function f (x) is not strictly convex, for example when the number of parameters exceeds the number of cases in regression, path following requires the ODE (11). Interested readers are referred to the book [4] for a more extended discussion of range-space versus null-space optimization methods.
For a general convex program, one can employ Euler’s update
to advance the solution of the ODE (8). Euler’s formula may be inaccurate for ∆ρ large. One can correct it by fixing ρ and performing one step of Newton’s method to re-connect with the solution path. This amounts to replacing the position-multiplier vector by
In practice, it is certainly easier and probably safer to rely on ODE packages such as the ODE45 function in Matlab to advance the solution of the ODE.
4 Examples of Path Following
Our examples are intended to illuminate the mechanics of path following and showcase its versatility. As we emphasized in the introduction, we forgo comparisons with other methods. Comparisons depend heavily on programming details and problem choices, so a premature study might well be misleading.
Example 1 Projection onto the Feasible Region
Finding a feasible point is the initial stage in many convex programs. Dykstra’s algorithm [23, 24] was designed precisely to solve the problem of projecting an exterior point onto the intersection of a finite number of closed convex sets. The projection problem also yields to our generic path following algorithm. Consider the toy example of projecting a point b ∈ ℝ2 onto the intersection of the closed unit ball and the closed half space x1 ≥ 0 [18]. This is equivalent to solving
The relevant gradients and second differentials are
Path following starts from the unconstrained solution x (0) = b; the direction of movement is determined by formula (8). For x ∈ {x :‖x‖2 > 1, x1 > 0}, the path
heads toward the origin. For x ∈ {x : |x2| > 1, x1 = 0}, the path
also heads toward the origin. For x ∈ {x : ‖x‖2 > 1, x1 < 0}, the path
heads toward the point (1, 0)t. For x ∈ {x : ‖x‖2 = 1, x1 < 0}, the path
is tangent to the circle. Finally, for x ∈ {x : ‖x‖2 < 1, x1 < 0}, the path
heads toward the x2-axis. The left panel of Figure 1 plots the vector field at the time ρ = 0. The right panel shows the solution path for projection from the points (−2, 0.5)t, (−2, 1.5)t, (−1, 2)t, (2, 1.5)t, (2, 0)t, (1, 2)t, and (−0.5, −2)t onto the feasible region. In projecting the point b = (−1, 2)t onto (0, 1)t, the ODE45 solver of Matlab evaluates derivatives at 19 different time points. Dykstra’s algorithm by comparison takes about 30 iterations to converge [18].
Fig. 1.

Projection to the positive half disk. Left: Derivatives at ρ = 0 for projection onto the half disc. Right: Projection trajectories from various initial points.
Example 2 Nonnegative Least Squares (NNLS) and Nonnegative Matrix Factorization (NNMF)
Non-negative matrix factorization (NNMF) is an alternative to principle component analysis and is useful in modeling, compressing, and interpreting nonnegative data such as observational counts and images. The articles [25, 26, 27] discuss in detail estimation algorithms and statistical applications of NNMF. The basic idea is to approximate an m × n data matrix X = (xij) with nonnegative entries by a product V W of two low rank matrices V = (vik) and W = (wkj) with nonnegative entries. Here V and W are m × r and r × n respectively, with r ≪ min {m, n}. One version of NNMF minimizes the criterion
| (13) |
where ‖⋅‖F denotes the Frobenius norm. In a typical imaging problem, m (number of images) might range from 103 to 104, n (number of pixels per image) might surpass 104, and a rank r = 50 approximation might adequately capture X.
Minimization of the objective function (13) is nontrivial because it is not jointly convex in V and W. Multiple local minima are possible. The well-known multiplicative algorithm [26, 27] enjoys the descent property, but it is not guaranteed to converge to even a local minimum [25]. An alternative algorithm that exhibits better convergence is alternating least squares (ALS). In updating W with V fixed, ALS solves the n separated nonnegative least square (NLS) problems
| (14) |
where xj and wj denote the j-th columns of the corresponding matrices. Similarly, in updating V with W fixed, ALS solves m separated NNLS problems. The unconstrained solution W (0) = (Vt V)−1 Vt X of W for fixed V requires just one QR decomposition of V or one Cholesky decomposition of Vt V. The exact path algorithm for solving the subproblem problem (14) commences with W (0). If W (ρ) stabilizes with just a few zeros, then the path algorithm ends quickly and is extremely efficient. For a NNLS problem, the path is piecewise linear, and one can straightforwardly project the path to the next hitting or escape time using the sweep operator [8]. Figure 2 shows a typical piecewise linear path for a problem with r = 50 predictors. Each projection to the next event requires 2r2 flops. The number of path segments (events) roughly scales as the number of negative components in the unconstrained solution.
Fig. 2.

Piecewise linear paths of the regression coefficients for a NNLS problem with 50 predictors.
Example 3 Quadratically Constrained Quadratic Programming (QCQP)
Example 1 is a special case of quadratically constrained quadratic programming (QCQP). In convex QCQP [1, Section 4.4], one minimizes a convex quadratic function over an intersection of ellipsoids and affine subspaces.
Mathematically, this amounts to the problem
where P0 is a positive definite matrix and the Pj are positive semidefinite matrices. Our algorithm starts with the unconstrained minimum and proceeds along the path determined by the derivative
where (x) has rows for i ∈ and (Pj x + bj)t for j ∈ , and
Affine inequality constraints can be accommodated by setting one or more of the Pj equal to 0.
As a numerical illustration, consider the bivariate problem
| (15) |
Here the feasible region is given by the intersection of three disks with centers (0.5, 0)t, (−0.5, 0)t, and (0, 0.5)t, respectively, and a common radius of 1. Figure 3 displays the solution trajectory. Starting from the unconstrained minimum x (0) = (1, 1.5)t, it hits, slides along, and exits two circles before its journey ends at the constrained minimum (0.059, 0.829)t. The ODE45 solver of Matlab evaluates derivatives at 72 time points along the path.
Fig. 3.

Trajectory of the exact penalty path algorithm for a QCQP problem (15). The solid lines are the contours of the objective function f (x). The dashed lines are the contours of the constraint functions hj (x).
Example 4 Geometric Programming
As a branch of convex optimization theory, geometric programming stands just behind linear and quadratic programming in importance [28, 29, 30, 31]. It has applications in chemical equilibrium problems [32], structural mechanics [29], digit circuit design [33], maximum likelihood estimation [34], stochastic processes [35], and a host of other subjects [28, 29]. Geometric programming deals with posynomials, which are functions of the form
| (16) |
In the left-hand definition of this equivalent pair of definitions, the index set S ⊂ ℝn is finite, and all coefficients cα and all components x1, …, xn of the argument x of f (x) are positive. The possibly fractional powers αi corresponding to a particular α may be positive, negative, or zero. For instance, is a posynomial on ℝ2. In geometric programming, one minimizes a posynomial f (x) subject to posynomial inequality constraints of the form hj (x) ≤ 1 for 1 ≤ j ≤ s. In some versions of geometric geometric programming, equality constraints of monomial type are permitted [28]. The right-hand definition in equation (16) invokes the exponential reparameterization . This simple transformation has the advantage of rendering a geometric program convex. In fact, any posynomial f (y) in the exponential parameterization is log-convex and therefore convex. The concise representations
of the gradient and the second differential are helpful in both theory and computation.
Without loss of generality, one can repose geometric programming as
| (17) |
where f (y) and the hj (y) are posynomials and the equality constraints ln gi (y) are affine. In this exponential parameterization setting, it is easy to state necessary and sufficient conditions for strict convexity and coerciveness.
Proposition 6
The objective function f (y) in the geometric program (17) is strictly convex if and only if the subspace spanned by the vectors {α}α∈S is all of ℝn; f (y) is coercive if and only if the polar cone {z : zt α ≤ 0 for all α ∈ S} reduces to the origin 0. Equivalently, f (y) is coercive if the origin 0 belongs to the interior of the convex hull of the set S.
Proof These claims are proved in detail in our paper [36].
According to Propositions 1 and 4, the strict convexity and coerciveness of f (y) guarantee the uniqueness and continuity of the solution path in y. This in turn implies the uniqueness and continuity of the solution path in the original parameter vector x. The path directions are related by the chain rule
As a concrete example, consider the problem
| (18) |
It is easy to check that the vectors {(−3, 0)t, (−1, 2)t, (1, 1)t} span ℝ2 and generate a convex hull strictly containing the origin 0. Therefore, f (y) is strictly convex and coercive. It achieves its unconstrained minimum at the point , or equivalently y (0) = (ln 6/5, ln 6/5)t. To solve the constrained minimization problem, we follow the path dictated by the revised geometric program (17). Figure 4 plots the trajectory from the unconstrained solution to the constrained solution in the original x variables. The solid lines in the figure represent the contours of the objective function f (x), and the dashed lines represent the contours of the constraint function h (x). The ODE45 solver of Matlab evaluates derivatives at seven time points along the path.
Fig. 4.

Trajectory of the exact penalty path algorithm for the geometric programming problem (18). The solid lines are the contours of the objective function f (x). The dashed lines are the contours of the constraint function h (x) at levels 1, 1.25, and 1.5.
Example 5 Semidefinite Programming (SDP)
The linear semidefinite programming problem [37] consists in minimizing the trace function X → tr (C X) over the cone of positive semidefinite matrices subject to the linear constraints tr (Ai X) = bi for 1 ≤ i ≤ p. Here C and the Ai are assumed symmetric. According to Sylvester’s criterion, the constraint involves a complicated system of inequalities involving nonconvex functions. One way of cutting through this morass is to focus on the minimum eigenvalue v1 (X) of X. Because the function v1 −(X) is convex, one can enforce positive semidefiniteness by requiring v1 (X) ≤ 0. Thus, the linear semidefinite programming problem is a convex program in the standard functional form.
It simplifies matters enormously to assume that v1 (X) has multiplicity 1. Let u be the unique, up to sign, unit eigenvector corresponding to v1 (X). The matrix X is parameterized by the entries of its lower triangle. With these conventions, the following formulas
| (19) |
| (20) |
for the first and second partial derivatives of −v1(X) are well known [19]. Here the matrix (v1I − X)− is the Moore-Penrose inverse of v1 I − X. The partial derivative of X with respect to its lower triangular entry xij equals Eij + 1{i≠j} Eji, where Eij is the matrix consisting of all 0’s excepts for a 1 in position (i, j). Note that and Ekl u = ul ek for the standard unit vectors ej and ek. The second partial derivatives of X vanish. The Moore-Penrose inverse is most easily expressed in terms of the spectral decomposition of X. If we denote the ith eigenvalue of X by vi and the corresponding ith unit eigenvector by ui, then we have
Finally, the formulas
express the linear constraints and their partial derivatives in terms of the lower triangular entries of X.
Initiating path following is problematic because tr (CX) has minimum −∞. A good strategy is to amend the surrogate function ℰρ (x) by adding the term , where ℰ (ρ) is a smooth positive function that decreases to 0. Taking ℰ (ρ) = e−cρ for c positive works well in practice. The new surrogate function is strictly convex and possesses a unique minimum for all ρ ≥ 0. In view of the identities and tr (CX) = Σi Σj cij xij for X = (xij) and C = (cij), the initial condition X (0) = − ε (0)−1 C is straightforward to deduce.
Path following must be modified to accommodate the new surrogate function. In the notation of [19], let x = v (X) be the vector obtained from vec (X) by eliminating all supradiagonal entries, and let D be the duplication matrix satisfying vec (X) = Dx. Applying the chain rule to the obvious identities and tr (CX) = vec (C)t Dx, one can extend the derivation of Proposition 5 and prove that
Path following proceeds until all constraints are satisfied and ℰ (ρ) is negligible.
For didactic purposes, considering the problem of minimizing tr (CX) subject to
where
Figure 5 displays the solution paths of the entries xij of X and the minimum eigenvalue v1 Here we use ℰ (ρ) = e−ρ. The path starts with X (0) = −C, hits, slides along, and exits various constraints, and ends at the constrained solution .
Fig. 5.

Solution path of a semidefinite programming example.
Example 6 Image Denoising
Image analysis is another fertile field for path following. Here we explore how to restore or enhance images by removing noise. This example differs from previous examples in that the fully constrained solution is trivial. The solution path itself is the object of interest. Suppose that w = (wij) ∈ ℝm×n represents the recorded gray levels across a 2D array of pixels from a noisy image with true gray levels u = (uij). The well-known denoising model of Rudin-Osher-Fatemi (ROF) [38] minimizes the total variation regularized least squares criterion
| (21) |
The total variation penalty serves to smooth the reconstructed image and preserve its edges. A similar effect can be achieved by replacing the isotropic penalty TV (u) by the anisotropic penalty
| (22) |
In this example we focus on path following for the anisotropic penalty and a more general convex loss function f(u). The objective function is now
| (23) |
For instance, the amended loss function with a Gaussian or motion blurring matrix K is appropriate in many imaging problems. Poisson count data are relevant to image reconstruction in X-ray and positron tomography [20] and to image denoising in certain circumstances [39]. With Poisson noise, the least squares criterion is replaced by a negative loglikeli-hood. The difference matrix D captures the ℓ1 penalty (22). Note that the matrices w and u are now viewed as vectors. For an m × n 2D image, the difference matrix D has 2mn − m − n rows (penalties) and mn columns (pixels). This matrix is very sparse, with just 2(2mn − m − n) nonzero entries equal to ±1. When m and n are both at least 2, D has more rows than columns and a reduced column rank of mn −1.
For sufficiently large ρ, the minimum of the objective functions (21) reduces to a constant vector (blank image) equal to the average value of the wij. The goal of image denoising is to find a ρ such that the recovered image is judged satisfactory by visual inspection or other more quantitative criteria. Notable computational advances in solving this problem include Chambolle’s algorithm [40] and split Bregman iteration [41]. These methods minimize the objective functions (21) and (23) for a fixed value of ρ. The web site of UCLA’s Computational and Applied Math Group summarizes the most recent progress in this area. In reality, outer iterations are almost always required to tune the parameter ρ. Path following is an attractive option because it provides the whole solution path at about the same computational cost as recovering the solution for an individual ρ.
Although it is tempting to minimize the criterion (23) by path following, the regularization matrix D has linearly dependent rows and deficient rank. Because the assumptions of Proposition 4 are violated, the multipliers λE of the active constraints in equations (7) and (9) are not uniquely determined. One can intuitively understand the difficulty by considering a square with four pixels. Whenever any three constraints are active, the fourth is automatically active as well. This constraint redundancy can be remedied by reparameterizing the model in terms of neighboring pixel differences x = Du. Unfortunately, the rank deficiency of D is also an issue. Adding the same constant to all of the components of u yields exactly the same x. To circumvent this problem, we simply append a bottom row to D with all entries 0 except for a 1 in the last position. If V is the amended version of D, then V has full column rank, and the vector x = Vu uniquely determines the image. Indeed, one can solve for x in the form u = (Vt V)−1 Vt x. The bottom entry of x is obviously the gray level of the last pixel of the image.
Despite the presence of the inverse of the huge mn × mn matrix Vt V, the transformation u = (Vt V)−1 Vt x is not as daunting as it appears. First of all, multiplication by the sparse matrix Vt is trivial. More importantly, the matrix Vt V is symmetric, banded, and extremely sparse. To count its nonzero entries, note that except for diagonal entries, these entries occur in the same positions as the nonzero entries of the adjacency matrix of a corresponding graph with 2mn − m − n edges and mn nodes. Because an adjacency matrix has twice as many nonzero entries as edges, the matrix Vt V has at most 2(2mn − m − n) + mn = 5mn −2m −2n nonzero entries. These occur within a band of width min{m, n} along the main diagonal, depending on whether we stack columns or concatenate rows. The most convenient way to solve equations of the kind Vt V a = b is to extract the Cholesky decomposition L of Vt V and execute forward and backward substitution. Although extraction of L is cheap for banded matrices, it is even cheaper for banded matrices with just a handful of nonzero entries per row. In our experience, the computational complexity of extracting L scales linearly in the product mn. Since L itself is sparse, forward and backward substitution are also very cheap. For instance with a 256 × 256 image, Matlab computes L (a 65536 × 65536 matrix) in 0.26 seconds on a laptop; L contains just 1,971,395 nonzero entries. The sparsity of L suggests that it be computed once and stored in compressed format for all images of a given size. Many of its nonzero entries are close to zero. Thus, a fairly light truncation of the non-diagonal entries of L gives an even sparser matrix realizing nearly the same transformation. Figure 6 displays the sparsity pattern of the matrix Vt V and its permuted Cholesky factor L for 64 × 64 images. Images of other sizes show similar sparsity patterns.
Fig. 6.

Sparsity patterns of Vt V and its Cholesky decomposition L for 64-by-64 images.
The problem of minimizing the objective function in the transformed variable x turns out to coincide with lasso penalized regression, for which an efficient path algorithm is known [15, 16]. Let us sketch how path following works in the more general case. The objective function is f (Bx) + ρ‖x−‖1, where B = (Vt V)−1Vt and x− denotes the vector x with its last entry deleted. The penalty contributions correspond to affine equality constraints in constrained minimization. In path following, the penalty constant ρ starts large and moves downward. The initial image is flat with gray level determined by taking x− = 0 and adjusting the last entry of x to minimize f (Bx). Call this point x∞. The first escape time occurs at ρmax = maxj |(Bt ∇ f (Bx∞)j|. At this juncture path following begins in earnest. Under the x parameterization, the loss function has gradient Bt ∇ f (Bx) and second differential Btd2 f (Bx) B. Because f (Bx) is not strictly convex, our previous reparameterization from x to y variables is needed. Based on equation (11), the path ODEs reduce to
| (24) |
Observe that the updates of equation (11) drastically simplify because the rows of the active constraint matrix and the columns of its null space matrix Y are populated by standard Euclidean unit vectors. Furthermore, for the ROF model of image denoising, d2 f (Bx) is a diagonal matrix. Alternatively, one can derive the ODEs (24) from first principles by implicitly differentiating the stationary conditions. Path following solves the coupled ODEs (24) segment by segment.
For a quadratic loss function, the second differential is constant, and the solution path is piecewise linear. Thus no ODE solving is involved. With a blurring matrix K, the second differential is Bt d2 f (Bx) B = Bt Kt K B. After each path extension, the path directions (24) yield the next event time ρj at which a nonzero component xj hits zero, or a multiplier λj of a zero component xj hits ρ or −ρ. The path is then extended to the closest of these event times. In deblurring or denoising, the inverse of is best computed via a QR decomposition of . At each kink in the path, changes by adding or deleting a column of BK. As we mentioned earlier, it is straightforward to update or downdate the QR decomposition [22]. In the original ROF model, traversing one time segment requires about O (p) operations for p = mn total pixels. The whole process ends when T differences xj becomes nonzero. In practice, a large value of T recovers too grainy an image, so T is typically much smaller than p. The total cost of computing the solution path is approximately O (K p), where K is the number of traversed kinks. Empirically K is on the order of O (T) for our examples.
Figure 7 illustrates denoising of a 112 × 91 image of a lighthouse. The corrupted image appears in the top-left corner of the figure. The p = 10, 192 pixels generate 20, 182 transformed variables. It takes our Matlab script about one minute of desktop computing time to traverse K = 2, 500 segments along the regularization path from ρ = 87.9881 (blank image) to ρ = 0.5206 (a nearly optimal image), where T = 2148 differences become nonzero. In the process, the lighthouse clearly emerges from the fog of oversmoothing. Figure 7 displays selected snapshots along the regularization path. We emphasize that path following based on equation (23) reveals the entire path for the interval [0.5206,87.9881] of ρ values. In practice, one can accelerate path following by starting from a ρ nearer to the ultimate destination.
Fig. 7.

A noisy image and snapshots along the regularization path.
5 Discussion
Our path following algorithm for constrained convex optimization builds on but differs from the tradition of path following in homotopy methods [12] and interior point programming [1]. The paths encountered in the exact penalty method introduce the novelty of piecewise differentiability, which can be effectively handled by tracking the Langrange multipliers. Computational statisticians deserve credit for exploring this difficult terrain [15, 16, 8, 13, 42, 43]. To our knowledge we are the first to make the connection to exact penalty methods.
Our algorithm enjoys the dual advantages of simplicity and generality. Given the rich numerical resources of Matlab, it is straightforward to solve the required ODEs segment by segment. Regardless of whether path following is faster or slower than existing optimization methods, it supplies the whole solution path. In regularized estimation, this level of detail offers unprecedented insight into how penalties and predictors interact. Our example on image denoising is a case in point.
In quadratic programming with affine equality and inequality constraints, the solution path is piecewise linear [8]. This permits path following to take large steps. Furthermore, each step can be implemented very efficiently by the sweep operator of computational statistics. Despite the loss of these advantages in more complicated examples, the real culprit in path-following deceleration in many applications is an excessive number of constraints to be navigated. Our image denoising example suffers from this defect. On the positive side of the ledger, in nonconvex problems path following may well prove to be more reliable than competing methods in separating global from local minima [44].
Various extensions of path following are in order. First, the current algorithm commences from the unconstrained solution. Our development relies on the strict convexity and coerciveness of the objective function to ensure a unique starting point. In principle, path initiation should work for any problem with a unique unconstrained minimum. Similarly, path continuation should be possible whenever the interior solution is well defined and piecewise smooth. As the image denoising example suggests, reparametrization can play an important role in correcting defects in strict convexity. Another possibility is to amend the surrogate function ℰρ (x). In our semidefinite programming example, we add the term to enforce strict convexity and coerciveness. A similar tactic obviously works in other examples.
A second generalization is to expand the list of penalty functions. For instance, Euclidean penalties of the form ‖M x + a‖2 are useful in grouping parameters in statistical problems. It should be straightforward to extend path following to include such penalties. A third generalization is to remove convexity restrictions altogether. As we have noted, the exact penalty method applies equally to nonconvex programming. Path following in this setting is nontrivial since the solution path is no longer necessarily continuous. This poses a real challenge, and it is unclear to us whether one can construct a theory as satisfying as that standing behind modern interior point methods. We invite the optimization community to tackle this broader issue. In the meantime, we are happy to share our Matlab code with interested researchers.
Supplementary Material
Acknowledgments
Research supported in part by National Science Foundatation grant DMS-1310319 and National Institutes of Health grants GM53275, MH59490, HG006139 and GM105785.
Contributor Information
Hua Zhou, Department of Statistics, North Carolina State University, Raleigh, NC 27695-8203, Tel.: +1-919-5152570, E-mail: hua zhou@ncsu.edu.
Kenneth Lange, Departments of Biomathematics, Human Genetics and Statistics, University of California, Los Angeles, CA 90095-1766.
References
- 1.Boyd S, Vandenberghe L. Convex Optimization. Cambridge: Cambridge University Press; 2004. [Google Scholar]
- 2.Forsgren A, Gill PE, Wright MH. Interior methods for nonlinear optimization. SIAM Review. 2002;44:525–597. [Google Scholar]
- 3.Luenberger DG, Ye Y. Linear and Nonlinear Programming. 3. New York: Springer; 2008. (International Series in Operations Research & Management Science, 116). [Google Scholar]
- 4.Nocedal J, Wright SJ. Numerical Optimization. 2. New York: Springer; 2006. (Springer Series in Operations Research and Financial Engineering). [Google Scholar]
- 5.Ruszczyński A. Nonlinear Optimization. Princeton, NJ: Princeton University Press; 2006. [Google Scholar]
- 6.Zangwill WI. Non-linear programming via penalty functions. Management Science. 1967;13(5):344–358. [Google Scholar]
- 7.Hestenes MR. Optimization Theory: The Finite Dimensional Case. New York: Wiley-Interscience [John Wiley & Sons]; 1975. Pure and Applied Mathematics. [Google Scholar]
- 8.Zhou H, Lange K. A path algorithm for constrained estimation. Journal of Computational and Graphical Statistics. 2013;22:261–283. doi: 10.1080/10618600.2012.681248. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Cottle RW, Pang J-S, Stone RE. The Linear Complementarity Problem. Boston, MA: Academic Press Inc; 1992. (Computer Science and Scientific Computing). [Google Scholar]
- 10.Watson LT. Numerical linear algebra aspects of globally convergent homotopy methods. SIAM Rev. 1986;28(4):529–545. [Google Scholar]
- 11.Watson LT. Theory of globally convergent probability-one homotopies for nonlinear programming. SIAM J Optim. 2000 Jan11(3):761–780. electronic. [Google Scholar]
- 12.Zangwill WI, Garcia CB. Pathways to Solutions, Fixed Points, and Equilibria. Prentice-Hall: 1981. (Prentice-Hall series in computational mathematics). [Google Scholar]
- 13.Zhou H, Wu Y. A generic path algorithm for regularized statistical estimation. J Amer Statist Assoc. 2014;109(506):686–699. doi: 10.1080/01621459.2013.864166. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Bertsekas DP. Convex Analysis and Optimization. Athena Scientific; Belmont, MA: 2003. With Angelia Nedić and Asuman E. Ozdaglar. [Google Scholar]
- 15.Efron B, Hastie T, Johnstone I, Tibshirani R. Least angle regression. Ann Statist. 2004;32(2):407–499. With discussion, and a rejoinder by the authors. [Google Scholar]
- 16.Osborne MR, Presnell B, Turlach BA. A new approach to variable selection in least squares problems. IMA J Numer Anal. 2000;20(3):389–403. [Google Scholar]
- 17.Tibshirani RJ, Taylor J. The solution path of the generalized lasso. Ann Statist. 2011;39(3):1335–1371. [Google Scholar]
- 18.Lange K. Optimization. New York: Springer-Verlag; 2004. (Springer Texts in Statistics). [Google Scholar]
- 19.Magnus JR, Neudecker H. Matrix Differential Calculus with Applications in Statistics and Econometrics. Chichester: John Wiley & Sons Ltd; 1999. (Wiley Series in Probability and Statistics). [Google Scholar]
- 20.Lange K. Numerical Analysis for Statisticians. 2. New York: Springer; 2010. (Statistics and Computing). [Google Scholar]
- 21.Chi E, Lange K. Splitting methods for convex clustering. Journal of Computational and Graphical Statistics. 2014 doi: 10.1080/10618600.2014.948181. vol in press. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Lawson CL, Hanson RJ. Solving Least Squares Problems. Society for Industrial Mathematics; 1987. (Classics in Applied Mathematics). new edition ed. [Google Scholar]
- 23.Dykstra RL. An algorithm for restricted least squares regression. J Amer Statist Assoc. 1983;78(384):837–842. [Google Scholar]
- 24.Deutsch F. Best Approximation in Inner Product Spaces. New York: Springer-Verlag; 2001. (CMS Books in Mathematics/Ouvrages de Mathématiques de la SMC, 7). [Google Scholar]
- 25.Berry MW, Browne M, Langville AN, Pauca VP, Plemmons RJ. Algorithms and applications for approximate nonnegative matrix factorization. Comput Statist Data Anal. 2007;52(1):155–173. [Google Scholar]
- 26.Lee DD, Seung HS. Learning the parts of objects by non-negative matrix factorization. Nature. 1999 Oct;401:788–791. doi: 10.1038/44565. [DOI] [PubMed] [Google Scholar]
- 27.Lee DD, Seung HS. NIPS. MIT Press; 2001. Algorithms for non-negative matrix factorization; pp. 556–562. [Google Scholar]
- 28.Boyd S, Kim S-J, Vandenberghe L, Hassibi A. A tutorial on geometric programming. Optim Eng. 2007;8(1):67–127. [Google Scholar]
- 29.Ecker JG. Geometric programming: methods, computations and applications. SIAM Rev. 1980;22(3):338–362. [Google Scholar]
- 30.Peressini AL, Sullivan FE, Uhl JJ., Jr . The Mathematics of Nonlinear Programming. New York: Springer-Verlag; 1988. (Undergraduate Texts in Mathematics). [Google Scholar]
- 31.Peterson EL. Geometric programming. SIAM Rev. 1976;18(1):1–51. [Google Scholar]
- 32.Passy U, Wilde DJ. A geometric programming algorithm for solving chemical equilibrium problems. SIAM Journal on Applied Mathematics. 1968;16:363–373. [Google Scholar]
- 33.Boyd SP, Kim S-J, Patil DD, Horowitz MA. Digital circuit optimization via geometric programming. Operations Research. 2005;53:899–932. [Google Scholar]
- 34.Mazumdar M, Jefferson TR. Maximum likelihood estimates for multinomial probabilities via geometric programming. Biometrika. 1983;70(1):257–261. [Google Scholar]
- 35.Feigin PD, Passy U. The geometric programming dual to the extinction probability problem in simple branching processes. Ann Probab. 1981;9(3):498–503. [Google Scholar]
- 36.Lange K, Zhou H. MM algorithms for geometric and signomial programming. Mathematical Programming Series A. 2014;143:339–356. doi: 10.1007/s10107-012-0612-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Vandenberghe L, Boyd S. Semidefinite programming. SIAM Rev. 1996;38(1):49–95. [Google Scholar]
- 38.Rudin LI, Osher S, Fatemi E. Nonlinear total variation based noise removal algorithms. Physica D: Nonlinear Phenomena. 1992;60(1–4):259–268. [Google Scholar]
- 39.Le T, Chartrand R, Asaki TJ. A variational approach to reconstructing images corrupted by Poisson noise. Journal of Mathematical Imaging and Vision. 2007;27:257–263. [Google Scholar]
- 40.Chambolle A. An algorithm for total variation minimization and applications. Journal of Mathematical Imaging and Vision. 2004;20:89–97. [Google Scholar]
- 41.Goldstein T, Osher S. The split Bregman method for l1-regularized problems. SIAM J Img Sci. 2009;2:323–343. [Google Scholar]
- 42.Zhou H, Armagan A, Dunson D. Path following and empirical Bayes model selection for sparse regressions. 2012. arXiv:1201.3528. [Google Scholar]
- 43.Xiao W, Wu Y, Zhou H. ConvexLAR: an extension of least angle regression. Journal of Computational and Graphical Statistics. 2015 doi: 10.1080/10618600.2014.962700. vol in press. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Zhou H, Lange K. On the bumpy road to the dominant mode. Scandinavian Journal of Statistics. 2010;37(4):612–631. doi: 10.1111/j.1467-9469.2009.00681.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
