A Brief Survey of Modern Optimization for Statisticians

Kenneth Lange; Eric C Chi; Hua Zhou

doi:10.1111/insr.12022

. Author manuscript; available in PMC: 2014 Sep 18.

Published in final edited form as: Int Stat Rev. 2014 Feb 17;82(1):46–70. doi: 10.1111/insr.12022

A Brief Survey of Modern Optimization for Statisticians

Kenneth Lange ¹, Eric C Chi ², Hua Zhou ³

PMCID: PMC4166522 NIHMSID: NIHMS559847 PMID: 25242858

Abstract

Modern computational statistics is turning more and more to high-dimensional optimization to handle the deluge of big data. Once a model is formulated, its parameters can be estimated by optimization. Because model parsimony is important, models routinely include nondifferentiable penalty terms such as the lasso. This sober reality complicates minimization and maximization. Our broad survey stresses a few important principles in algorithm design. Rather than view these principles in isolation, it is more productive to mix and match them. A few well chosen examples illustrate this point. Algorithm derivation is also emphasized, and theory is downplayed, particularly the abstractions of the convex calculus. Thus, our survey should be useful and accessible to a broad audience.

Keywords: Block relaxation, Newton’s Method, MM algorithm, penalization, augmented Lagrangian, acceleration

Introduction

Modern statistics represents a confluence of data, algorithms, practical inference, and subject area knowledge. As data mining expands, computational statistics is assuming greater prominence. Surprisingly, the confident prediction of the previous generation that Bayesian methods would ultimately supplant frequentist methods has given way to a realization that Markov chain Monte Carlo (MCMC) may be too slow to handle modern data sets. Size matters because large data sets stress computer storage and processing power to the breaking point. The most successful compromises between Bayesian and frequentist methods now rely on penalization and optimization. Penalties serve as priors and steer parameter estimates in realistic directions. In classical statistics estimation usually meant least squares and maximum likelihood with smooth objective functions. In a search for sparse representations, mathematical scientists have introduced nondifferentiable penalties such as the lasso and the nuclear norm. To survive in this alien terrain, statisticians are being forced to master exotic branches of mathematics such as convex calculus [39, 40]. Thus, the uneasy but productive relationship between statistics and mathematics continues, but in a different guise and mediated by new concerns.

The purpose of this survey article is to provide a few glimpses of the new optimization algorithms being crafted by computational statisticians and applied mathematicians. Although a survey of convex calculus for statisticians would certainly be helpful, our emphasis is more concrete. The truth of the matter is that a few broad categories of algorithms dominate. Furthermore, difficult problems require that several algorithmic pieces be assembled into a well coordinated whole. Put another way, from a handful of basic ideas, computational statisticians often weave a complex tapestry of algorithms that meets the needs of a specific problem. No algorithm category should be dismissed a priori in tackling a new problem. There is plenty of room for creativity and experimentation. Algorithms are made for tinkering. When one part fails or falters, it can be replaced by a faster or more robust part.

This survey will treat the following methods: (a) block descent, (b) steepest descent, (c) Newton’s method, quasi-Newton methods, and scoring, (d) the MM and EM algorithms, (e) penalized estimation, (f) the augmented Lagrangian method for constrained optimization, and (g) acceleration of fixed point algorithms. As we have mentioned, often the best algorithms combine several themes. We will illustrate the various themes by a sequence of examples. Although we avoid difficult theory and convergence proofs, we will try to point out along the way a few motivating ideas that stand behind most algorithms. For example, as its name indicates, steepest descent algorithms search along the direction of fastest decrease of the objective function. Newton’s method and its variants all rely on the notion of local quadratic approximation, thus correcting the often poor linear approximation of steepest descent. In high dimensions, Newton’s method stalls because it involves calculating and inverting large matrices of second derivatives.

The MM and EM algorithms replace the objective function by a simpler surrogate function. By design, optimizing the surrogate function sends the objective function downhill in minimization and uphill in maximization. In constructing the surrogate function for an EM algorithm, statisticians rely on notions of missing data. The more general MM algorithm calls on skills in inequalities and convex analysis. More often than not, concrete problems also involve parameter constraints. Modern penalty methods incorporate the constraints by imposing penalties on the objective function. A tuning parameter scales the strength of the penalties. In the classical penalty method, the constrained solution is recovered as the tuning parameter tends to infinity. In the augmented Lagrangian method, the constrained solution emerges for a finite value of the tuning parameter.

In the remaining sections, we adopt several notational conventions. Vectors and matrices appear in boldface type; for the most part parameters appear as Greek letters. The differential df(θ) of a scalar-valued function f(θ) equals its row vector of partial derivatives; the transpose ▿f(θ) of the differential is the gradient. The second differential d²f(θ) is the Hessian matrix of second partial derivatives. The Euclidean norm of a vector b and the spectral norm of a matrix A are denoted by ∥b∥ and ∥A∥, respectively. All other norms will be appropriately subscripted. The nth entry b_n of a vector b must be distinguished from the nth vector b_n in a sequence of vectors. To maintain consistency, b_ni denotes the ith entry of b_n. A similar convention holds for sequences of matrices.

Block Descent

Block relaxation (either block descent or block ascent) divides the parameters into disjoint blocks and cycles through the blocks, updating only those parameters within the pertinent block at each stage of a cycle [21]. For the sake of brevity, we consider only block descent. In updating a block, we minimize the objective function over the block. Hence, block descent possesses the desirable descent property of always forcing the objective function downhill. When each block consists of a single parameter, block descent is called cyclic coordinate descent. The coordinate updates need not be explicit. In high-dimensional problems, implementation of one-dimensional Newton searches is often compatible with fast overall convergence. Block descent is best suited to unconstrained problems where the domain of the objective function reduces to a Cartesian product of the subdomains associated with the different blocks. Obviously, exact block updates are a huge advantage. Constraints can present insuperable barriers to coordinate descent because parameters get locked into place. In some problems it is advantageous to consider overlapping blocks.

Example 0.1. Nonnegative Least Squares

For a positive definite matrix A = (a_ij) and vector b = (b_i), consider minimizing the quadratic function

f (θ) = \frac{1}{2} θ^{t} A θ + b^{t} θ + c

subject to the constraints θ_i ≥ 0 for all i. In the case of least squares, A = X^tX and b = −X^ty for some design matrix X and response vector y. Equating the partial derivative of f(θ) with respect to θ_i to 0 gives

0 = \sum_{j} a_{ij} θ_{j} + b_{i} .

Rearrangement now yields the unrestricted minimum

θ_{n + 1, i} = θ_{ni} - \frac{1}{a_{i i}} (b_{i} + \sum_{j} a_{ij} θ_{nj}) .

Taking into account the nonnegativity constraint, this must be amended to

θ_{n + 1, i} = \max {0, θ_{ni} - \frac{1}{a_{i i}} (b_{i} + \sum_{j} a_{ij} θ_{nj})}

at stage n + 1 to construct the coordinate descent update of θ_i.

Example 0.2. Matrix Factorization by Alternating Least Squares

In the 1960s Kruskal [47] applied the method of alternating least squares to factorial ANOVA. Later the subject was taken up by de Leeuw and colleagues [32]. Suppose U is a m × q matrix whose columns u₁, … , u_q represent data vectors. In many applications it is reasonable to postulate a reduced number of prototypes v₁, … , v_p and write

u_{j} \approx \sum_{k = 1}^{p} υ_{k} w_{kj}

for certain nonnegative weights w_kj. The matrix W = (w_kj) is p × q. If p is small compared to q, then the representation U ≈ VW compresses the data for easier storage and retrieval. Depending on the circumstances, one may want to add further constraints [24]. For instance, if the entries of U are nonnegative, then it is often reasonable to demand that the entries of V be nonnegative as well [55, 68]. If we want each u_j to equal a convex combination of the prototypes, then constraining the column sums of W to equal 1 is indicated.

One way of estimating V and W is to minimize the squared Frobenius norm

{∣ ∣ U - VW ∣ ∣}_{F}^{2} = \sum_{i = 1}^{m} \sum_{j = 1}^{q} {(u_{ij} - \sum_{k = 1}^{p} υ_{ik} w_{k j})}^{2} .

No explicit solution is known, but alternating least squares offers an iterative attack. If W is fixed, then we can update the ith row of V by minimizing the sum of squares

\sum_{j = 1}^{q} {(u_{ij} - \sum_{k = 1}^{p} υ_{ik} w_{k j})}^{2} .

Similarly, if V is fixed, then we can update the jth column of W by minimizing the sum of squares

\sum_{i = 1}^{n} {(u_{ij} - \sum_{k = 1}^{p} υ_{ik} w_{k j})}^{2} .

Thus, block descent solves a sequence of least squares problems, some of which are constrained.

Steepest Descent

The first-order Taylor expansion

f (θ + γ) = f (θ) + df (θ) γ + o (∣ ∣ γ ∣ ∣)

of a differentiable function f(θ) around θ motivates the method of steepest descent. In view of the Cauchy-Schwarz inequality, the choice

γ = - \nabla f (θ) ∕ ∣ ∣ \nabla f (θ) ∣ ∣

minimizes the linear term df(θ)γ of the expansion over the sphere of unit vectors. Of course, if ▿f(θ) = 0, then θ is a stationary point. The steepest descent algorithm iterates according to

θ_{n + 1} = θ_{n} - s \nabla f (θ_{n})

(1)

for some scalar s > 0. If s is sufficiently small, then the descent property f(θ_n+1) < f(θ_n) holds. The most sophisticated version of the algorithm determines s by searching for the minimum of the objective function along the direction of steepest descent. Among the many methods of line search, the methods of false position, cubic interpolation, and golden section stand out [53]. These are all local search methods, and unless some guarantee of convexity exists, confusion of local and global minima can occur.

The method of steepest descent often exhibits zigzagging and a painfully slow rate of convergence. For these reasons it was largely replaced in practice by Newton’s method and its variants. However, the sheer scale of modern optimization problems has led to a re-evaluation. The avoidance of second derivatives and Hessian approximations is now viewed as an virtue. Furthermore, the method has been generalized to nondifferentiable problems by substituting the forward directional derivative

d_{ν} f (θ) = \lim_{s ↓ 0} \frac{f (θ + s ν) - f (θ)}{s}

for the gradient [84]. Here the idea is to choose a unit search vector ν to minimize d_νf(θ). In some instances this secondary problem can be attacked by linear programming. For a convex problem, the condition d_νf(θ) ≥ 0 for all ν is both necessary and sufficient for θ to be a minimum point. If the domain of f(θ) equals a convex set C, then only tangent directions ν = μ−θ with μ ∈ C come into play.

Steepest descent also has a role to play in constrained optimization. Suppose we want to minimize f(θ) subject to the constraint θ ∈ C for some closed convex set. The projected gradient method capitalizes on the steepest descent update (1) by projecting it onto the set C [35, 56, 79]. It is well known that for a point x external to C, there is a closest point P_C(x) to x in C. Explicit formulas for the projection operator P_C(x) exist when C is a box, Euclidean ball, hyperplane, or halfspace. Fast algorithms for computing P_C(x) exist for the unit simplex, the l₁ ball, and the cone of positive semidefinite matrices [27, 62].

Choice of the scalar s in the update (1) is crucial. Current theory suggests taking s to equal r/L, where L is a Lipschitz constant for the gradient ▿f(θ) and r belongs to the interval (0, 2). In particular, the Lipschitz inequality

∣ ∣ \nabla f (θ) - \nabla f (γ) ∣ ∣ \leq L ∣ ∣ θ - γ ∣ ∣

is valid for L = sup_θ ∥d²f(θ)∥, whenever this quantity is finite. In practice, the Lipschitz constant L must be estimated. Any induced matrix norm ∥ · ∥_† can be substituted for the spectral norm ∥ · ∥ in the defining supremum and will give an upper bound on L.

Example 0.3. Coordinate Descent versus the Projected Gradient Method

As a test problem, we generated a random 100 × 50 design matrix X with i.i.d. standard normal entries, a random 50 × 1 parameter vector θ with i.i.d. uniform [0,1] entries, and a random 100 × 1 error vector e with i.i.d. standard normal entries. In this setting the response y = Xθ + e. We then compared coordinate descent, the projected gradient method (for L equal to the spectral radius of X^tX and r equal to 1.0, 1.75, and 2.0), and the MM algorithm explained later in Example 0.6. All computer runs start from the common point θ₀ whose entries are filled with i.i.d. uniform [0,1] random deviates. Figure 1 plots the progress of each algorithm as measured by the relative difference

\frac{f (b_{n}) - f (b_{\infty})}{f (b_{\infty})},

(2)

between the loss at the current iteration and the ultimate loss at convergence. It is interesting how well coordinate descent performs compared to projected gradient descent. The slower convergence of the MM algorithm is probably a consequence of the fact that its multiplicative updates slow down as they approach the 0 boundary. Note also the importance of choosing a good step size in the projected gradient algorithm. Inflated steps accelerate convergence, but excessively inflated steps hamper it.

Comparing the rate of convergence of three algorithms on a nonnegative least squares problem. CD = coordinate descent, PG = projected gradient, and MM = majorize-minimize.

Variations on Newton’s Method

The primary advantage of Newton’s method is its speed of convergence in low-dimensional problems. Its many variants seek to retain its fast convergence while taming its defects. The variants all revolve around the core idea of locally approximating the objective function by a strictly convex quadratic. At each iteration the quadratic approximation is optimized subject to safeguards that keep the iterates from overshooting and veering toward irrelevant stationary points.

Consider minimizing the real-valued function f(θ) defined on an open set S ⊂ R^p. Assuming that f(θ) is twice differentiable, we have the second order Taylor expansion

f (γ) = f (θ) + df (θ) (γ - θ) + \frac{1}{2} {(γ - θ)}^{2} d^{2} f (α) (γ - θ)

for some α on the line segment [θ, γ]. This expansion suggests that we substitute d²f(θ) for d²f(α) and approximate f(γ) by the resulting quadratic. If we take this approximation seriously, then we can solve for its minimum point γ as

γ = θ - d^{2} f {(θ)}^{- 1} \nabla f (θ) .

In Newton’s method we iterate according to

θ_{n + 1} = θ_{n} - {sd}^{2} f {(θ_{n})}^{- 1} \nabla f (θ_{n})

(3)

for step length constant s with default value 1. Any stationary point of f(θ) is a fixed point of Newton’s method.

There is nothing to prevent Newton’s method from heading uphill rather than downhill. The first order expansion

f (θ_{n + 1}) = f (θ_{n}) - s df (θ_{n}) d^{2} f {(θ_{n})}^{- 1} \nabla f (θ_{n}) + o (s)

makes it clear that the descent property holds provided s > 0 is small enough and the Hessian matrix d²f(θ_n) is positive definite. When d²f(θ_n) is not positive definite, it is usually replaced by a positive definite approximation H_n in the update (3).

Backtracking is crucial to avoid overshooting. In the step-halving version of backtracking, one starts with s = 1. If the descent property holds, then one takes the Newton step. Otherwise, $\frac{s}{2}$ is substituted for s, θ_n+1 is recalculated, and the descent property is rechecked. Eventually, a small enough s is generated to guarantee f(θ_n+1) < f(θ_n).

In the next two examples we adopt standard statistical language. The outcome of a statistical experiment is summarized by a loglikelihood L(θ). Its gradient ▿L(θ) is called the score, and its second differential d²L(θ), after a change in sign, is call the observed information. In maximum likelihood estimation, one maximizes L(θ) with respect to the parameter vector θ.

Example 0.4. Newton’s Method for Binomial Regression

Consider binomial regression with m independent responses y₁, … , y_m. Each y_i represents a count between 0 and k_i with success probability π_i(θ) per trial. The loglikelihood, score, and observed information amount to

\begin{matrix} L (θ) = & \sum_{i = 1}^{m} [y_{i} \ln π_{i} (θ) + (k_{i} - y_{i}) \ln [1 - π_{i} (θ)] \\ \nabla L (θ) = & \sum_{i = 1}^{m} \frac{y_{i} - k_{i} π_{i} (θ)}{π_{i} (θ)] 1 - π_{i} (θ)]} \nabla π_{i} (θ) \\ - d^{2} L (θ) = & - \sum_{i = 1}^{m} \frac{y_{i} - k_{i} π_{i} (θ)}{π_{i} (θ) [1 - π_{i} (θ)]} d^{2} π_{i} (θ) \\ + \sum_{i = 1}^{m} {\frac{y_{i}}{π_{i} {(θ)}^{2}} + \frac{k_{i} - y_{i}}{{[1 - π_{i} (θ)]}^{2}}} \nabla π_{i} (θ) d π_{i} (θ) . \end{matrix}

Because E(y_i) = k_iπ_i(θ), the observed information can be approximated by

\begin{matrix} - d^{2} L (θ) \approx & \sum_{i = 1}^{m} {\frac{y_{i}}{π_{1} {(θ)}^{2}} + \frac{k_{i} - y_{i}}{{[1 - π_{i} (θ)]}^{2}}} \nabla π_{i} (θ) d π_{i} (θ) \\ \approx & \sum_{i = 1}^{m} {\frac{k_{i}}{π_{i} (θ)} + \frac{k_{i}}{[1 - π_{i} (θ)]}} \nabla π_{i} (θ) d π_{i} (θ) . \end{matrix}

Because we seek to maximize rather than minimize L(θ), we want −d²L(θ) to be positive definite. Fortunately, both approximations fulfill this requirement. The second approximation leads to the scoring algorithm discussed later.

Example 0.5. Poisson Multigraph Model

In a graph the number of edges between any two nodes is 0 or 1. A multigraph allows an arbitrary number of edges between any two nodes. Multigraphs are natural structures for modeling the internet and gene and protein networks. Here we consider a multigraph with a random number of edges X_ij connecting every pair of nodes {i, j}. In particular, we assume that the X_ij are independent Poisson random variables with means μ_ij. As a plausible model for ranking nodes, we take μ_ij = θ_iθ_j, where θ_i and θ_j are nonnegative propensities [72]. The loglikelihood of the observed edge counts x_ij = x_ji amounts to

\begin{matrix} L (θ) & = \sum_{{i, j}} (x_{ij} \ln μ_{ij} - μ_{ij} - \ln x_{ij}!) \\ = \sum_{{i, j}} [x_{ij} (\ln θ_{i} + \ln θ_{j}) - θ_{i} θ_{j} - \ln x_{ij}!] . \end{matrix}

The score vector has entries

\frac{\partial}{\partial θ_{i}} L (θ) = \sum_{j \neq i} (\frac{x_{ij}}{θ_{i}} - θ_{j}),

and the observed information matrix has entries

- \frac{\partial^{2}}{\partial θ_{i} \partial θ_{j}} L (θ) {\begin{matrix} 1 & i \neq j \\ \frac{1}{θ_{i}^{2}} \sum_{k \neq i} x_{ik} & i = j . \end{matrix}

For p nodes the matrix −d²L(p) is p × p, and inverting it seems out of the question when p is large. Fortunately, the Sherman-Morrison formula comes to the rescue. If we write −d²L(θ) as D + 11^t with D diagonal, then the explicit inverse

{(D + 11^{t})}^{- 1} = D^{- 1} - \frac{1}{1 + 1^{t} D^{- 1} 1} D^{- 1} 11^{t} D^{- 1}

is available. This makes Newton’s method trivial to implement as long as one respects the bounds θ_i ≥ 0. More generally, it is always cheap to invert a low-rank perturbation of an explicitly invertible matrix.

In maximum likelihood estimation, the method of steepest ascent replaces the observed information matrix −d²L(θ) by the identity matrix I. Fisher’s scoring algorithm makes the far more effective choice [67] of replacing the observed information matrix by the expected information matrix J(θ) = E[−d²L(θ)]. The alternative representation J(θ) = Var[▿rL(θ)] of J(θ) as a variance matrix demonstrates that it is positive semidefinite. Usually it is positive definite as well and serves as an excellent substitute for −d²L(θ) in Newton’s method. The inverse matrices $- d^{2} L {(\hat{θ})}^{- 1}$ and $J {(\hat{θ})}^{- 1}$ immediately supply the asymptotic variances and covariances of the maximum likelihood estimate $\hat{θ}$ [73].

The score and expected information simplify considerably for exponential families of densities [8, 11, 36, 44, 63]. Recall that the density of a vector random variable Y from an exponential family can be written as

f (y ∣ θ) = g (y) e^{β (θ) + h {(y)}^{t} γ (θ)}

(4)

relative to some measure ν [25, 73]. The function h(y) in equation (4) is the sufficient statistic. The maximum likelihood estimate of the parameter vector θ depends on an observation y only through h(y). Predictors of y are incorporated into the functions β(θ) and γ(θ). If γ(θ) is linear in θ, then J(θ) = −d²L(θ) = −d²β(θ), and scoring coincides with Newton’s method. If in addition J(θ) is positive definite, then L(θ) is strictly concave and possesses at most a single local maximum, which is necessarily the global maximum.

Both the score vector and expected information matrix can be expressed succinctly in terms of the mean vector μ(θ) = E[h(y)] and the variance matrix Σ(θ) = Var[h(y)] of the sufficient statistic. Standard arguments show that

\begin{matrix} \nabla L (θ) & = d μ {(θ)}^{t} \sum {(θ)}^{- 1} [h (y) - μ (θ)] \\ J (θ) & = d μ {(θ)}^{t} \sum {(θ)}^{- 1} d μ (θ) . \end{matrix}

These formulas have had an enormous impact on nonlinear regression and fitting generalized linear models. Applied statistics as we know it would be nearly impossible without them. Implementation of scoring is almost always safeguarded by step halving and upgraded to handle linear constraints and parameter bounds. The notion of quadratic approximation is still the key, but each step of constrained scoring must solve a quadratic program.

In parallel with developments in statistics, numerical analysts sought substitutes for Newton’s method. Their e orts a generation ago focused on quasi-Newton methods for generic smooth functions [23, 65]. Once again the core idea was successive quadratic approximation. A good quasi-Newton method: (a) minimizes a quadratic function f(θ) from R^p to R in p steps, (b) avoids evaluation of d²f(θ), (c) adapts readily to simple parameter constraints, and (d) exploits inexact line searches.

Quasi-Newton methods update the current approximation H_n to the second differential d²f(θ) of an objective function f(θ) by a rank-one or rank-two perturbation satisfying a secant condition. The secant condition captures the first-order Taylor approximation

\nabla f (θ_{n + 1}) - \nabla f (θ_{n}) \approx d^{2} f (θ_{n}) (θ_{n + 1} - θ_{n}) .

If we define the gradient and argument differences

\begin{matrix} g_{n} & = \nabla f (θ_{n + 1}) - \nabla f (θ_{n}) \\ d_{n} & = θ_{n + 1} - θ_{n}, \end{matrix}

then the secant condition reads H_n+1d_n = g_n. Davidon [19] discovered that the unique symmetric rank-one update to H_n satisfying the secant condition is

H_{n + 1} = H_{n} + c_{n} υ_{n} υ_{n}^{t},

where the constant c_n and the vector v_n are determined by

\begin{matrix} c_{n} & = - \frac{1}{{(H_{n} d_{n} - g_{n})}^{2} d_{n}} \\ υ_{n} & = H_{n} d_{n} - g_{n} . \end{matrix}

When the inner product (H_nd_n − g_n)^td_n is too close to 0, there are two possibilities. Either the secant adjustment is ignored, and the value H_n is retained for H_n+1, or one resorts to a trust region strategy [65].

In the trust region method, one minimizes the quadratic approximation to f(θ) subject to the spherical constraint ∥θ − θ_n∥² ≤ r² for a fixed radius r. This constrained optimization problem has a solution regardless of whether H_n is positive definite. Working within a trust region prevents absurdly large steps in the early stages of minimization. With appropriate safeguards, some numerical analysts [18, 45] consider Davidon’s rank-one update superior to the widely used BFGS update, named after Broyden, Fletcher, Goldfarb, and Shanno. This rank-two perturbation is guaranteed to maintain positive definiteness and is better understood theoretically than the symmetric rank-one update. Also of interest is the DFP (Davidon, Fletcher, and Powell) rank-two update, which applies to the inverse $H_{n}^{- 1}$ of H_n. Although the DFP update ostensibly avoids matrix inversion, the consensus is that the BFGS update is superior to it in numerical practice [23].

The MM and EM Algorithms

The numerical analysts Ortega and Rheinboldt [66] first articulated the MM principle; de Leeuw [20] saw its potential and created the first MM algorithm. The MM algorithm currently enjoys its greatest vogue in computational statistics [41, 54, 90]. The basic idea is to convert a hard optimization problem into a sequence of simpler ones. In minimization the MM principle majorizes the objective function f(θ) by a surrogate function g(θ ∣ θ_n) anchored at the current point θ_n. Majorization combines the tangency condition g(θ_n ∣ θ_n) = f(θ_n) and the domination condition g(θ ∣ θ_n) ≥ f(θ) for all θ. The next iterate of the MM algorithm is defined to minimize g(θ ∣ θ_n). Because

f (θ_{n + 1}) \leq g (θ_{n + 1} ∣ θ_{n}) \leq g (θ_{n} ∣ θ_{n}) = f (θ_{n}),

the MM iterates generate a descent algorithm driving the objective function downhill. Strictly speaking, the descent property depends only on decreasing g(θ ∣ θ_n), not on minimizing it. Constraint satisfaction is automatically enforced in finding θ_n+1. Under appropriate regularity conditions, an MM algorithm is guaranteed to converge to a local minimum of the objective function [52]. In maximization, we first minorize and then maximize. Thus, the acronym MM does double duty in the forms majorize-minimize and minorize-maximize.

When it is successful, the MM algorithm simplifies optimization by: (a) separating the variables of a problem, (b) avoiding large matrix inversions, (c) linearizing a problem, (d) restoring symmetry, (e) dealing with equality and inequality constraints gracefully, and (f) turning a nondifferentiable problem into a smooth problem. The art in devising an MM algorithm lies in choosing a tractable surrogate function g(θ ∣ θ_n) that hugs the objective function f(θ) as tightly possible.

The majorization relation between functions is closed under the formation of sums, nonnegative products, limits, and composition with an increasing function. These rules allow one to work piecemeal in simplifying complicated objective functions. Skill in dealing with inequalities is crucial in constructing majorizations. Classical inequalities such as Jensen’s inequality, the information inequality, the arithmetic-geometric mean inequality, and the Cauchy-Schwartz prove useful in many problems. The supporting hyperplane property of a convex function and the quadratic upper bound principle of Böhning and Lindsay [5] also find wide application.

Example 0.6. An MM Algorithm for Nonnegative Least Squares

Sha et al [81] devised an MM algorithm for Example 0.1. The diagonal terms $a_{ii} θ_{i}^{2}$ they retain as presented. The off-diagonal terms a_ijθ_iθ_j they majorize according to the sign of the coefficient a_ij. When the sign of a_ij is positive, they apply the majorization

xy \leq \frac{y_{n}}{2 x_{n}} x^{2} + \frac{x_{n}}{2 y_{n}} y^{2},

which just a rearrangement of the inequality

0 \leq {(\sqrt{\frac{y_{n}}{x_{n}}} x - \sqrt{\frac{x_{n}}{y_{n}}} y)}^{2},

with equality when x = x_n and y = y_n. When the sign of a_ij is negative, they apply the majorization

- xy \leq - x_{n} y_{n} [1 + \ln (\frac{x}{x_{n}}) + \ln (\frac{y}{y_{n}})],

which is just a rearrangement of the simple inequality z ≥ 1 + ln z with z = xy/(x_ny_n). The value z = 1 gives equality in the inequality. Both majorizations separate parameters and allow one to minimize the surrogate function parameter by parameter. Indeed, if we define matrices A⁺ and A⁻ with entries max{a_ij, 0} and −min{a_ij, 0}, respectively, then the resulting MM algorithm iterates according to

θ_{n + 1, i} = θ_{n, i} [\frac{- b_{i} + \sqrt{b_{i}^{2} + 4 {(A^{+} θ_{n})}_{i} {(A^{-} θ_{n})}_{i}}}{2 {(A^{+} θ_{n})}_{i}}] .

All entries of the initial point θ₀ should be positive; otherwise, the MM algorithm stalls. The updates occur in parallel. In contrast, the cyclic coordinate descent updates are sequential. Figure 1 depicts the progress of the MM algorithm on our nonnegative least squares problem.

Example 0.7. Locating a Gunshot

Locating the time and place of a gunshot is a typical global positioning problem [82]. In a certain city m sensors located at the points x₁, … , x_m are installed. A signal, say a gunshot sound, is sent from an unknown location θ at unknown time α and known speed s and arrives at location j at time y_j observed with random measurement error. The problem is to estimate the vector θ and the scalar α from the observed data y₁, … , y_m. Other problems of this nature include pinpointing the epicenter of an earthquake and the detonation point of a nuclear explosion. This estimation problem can be attacked by a combination of block descent and the MM principle.

If we assume Gaussian random errors, then maximum likelihood estimation reduces to minimizing the criterion

\begin{matrix} f (θ, α) & = \frac{1}{2} \sum_{j = 1}^{m} {(y_{j} - s^{- 1} ∣ ∣ θ - x_{j} ∣ ∣ - α)}^{2} \\ = \frac{1}{2 s^{2}} \sum_{j = 1}^{m} {({sy}_{j} - ∣ ∣ θ - x_{j} ∣ ∣ - α s)}^{2} . \end{matrix}

The equivalence of the two representations of f(θ, α) shows that it suffices to solve the problem with speed s = 1. In the remaining discussion we make this assumption. For fixed θ estimation of α reduces to a least squares problem with the obvious solution

α = \frac{1}{m} \sum_{j = 1}^{m} (y_{j} - ∣ ∣ θ - x_{j} ∣ ∣) .

To update θ with α fixed, we rewrite f(θ, α) as

f (θ, α) = \frac{1}{2} \sum_{j = 1}^{m} [{(y_{j} - α)}^{2} - 2 (y_{j} - α) ∣ ∣ θ - x_{j} {∣ ∣ + ∣ ∣ θ - x_{j} ∣ ∣}^{2}] .

The middle terms −2(y_j − α)∥θ − x_j∥ are awkward to deal with in minimization. Depending on the sign of the coefficient −2(y_j −α), we majorized them in two different ways. If the sign is negative, then we employ the Cauchy-Schwarz majorization

- ∣ ∣ θ - x_{j} ∣ ∣ \leq - {∣ ∣ θ_{n} - x_{j} ∣ ∣}^{- 1} {(θ_{n} - x_{j})}^{t} (θ - x_{j}) .

If the sign is positive, then we employ the more subtle majorization

∣ ∣ θ - x_{j} ∣ ∣ \leq ∣ ∣ θ_{n} - x_{j} ∣ ∣ + \frac{1}{2 ∣ ∣ θ_{n} - x_{j} ∣ ∣} ({∣ ∣ θ - x_{j} ∣ ∣}^{2} - ∣ ∣ θ_{n} - x_{j} ∣ ∣) .

To derive this second majorization, note that $\sqrt{u}$ is a concave function on (0, ∞). It therefore satisfies the dominating hyperplane inequality

\sqrt{u} \leq \sqrt{u_{n}} + \frac{1}{2 \sqrt{u_{n}}} (u - u_{n}) .

Now substitute ∥θ − x_j∥² for u. These maneuvers separate parameters and reduce the surrogate to a sum of linear terms and squared Euclidean norms. The minimization of the surrogate yields the MM update

θ_{n + 1} = \frac{\sum_{j = 1}^{m} [1 + \frac{(α - y_{j})}{∣ ∣ θ_{n} - x_{j} ∣ ∣}] x_{j} - [\sum_{α \leq y_{j}} \frac{(α - y_{j})}{∣ ∣ θ_{n} - x_{j} ∣ ∣}] θ_{n}}{m + \sum_{α > y_{j}} \frac{α - y_{j}}{∣ ∣ θ_{n} - x_{j} ∣ ∣}}

of θ for α fixed. The condition α > y_j in this update is usually vacuous. By design f(θ, α) decreases after each cycle of updating α and θ.

The celebrated expectation-maximization (EM) algorithm is one the most potent optimization tools in the statistician’s toolkit [22, 59]. The E step in the EM algorithm creates a surrogate function, the Q function in the literature, that minorizes the loglikelihood. Thus, every EM algorithm is an MM algorithm. If y is the observed data and x is the complete data, then the Q function is defined as the conditional expectation

Q (θ ∣ θ_{n}) = E [\ln f (X ∣ θ) ∣ Y = y, θ_{n}],

where f(x ∣ θ) denotes the complete data loglikelihood, upper case letters indicate random vectors, and lower case letters indicate corresponding realizations of these random vectors. In the M step of the EM algorithm, one calculates the next iterate θ_n+1 by maximizing Q(θ ∣ θ_n) with respect to θ.

Example 0.8. MM versus EM for the Dirichlet-Multinomial Distribution

When multivariate count data exhibit over-dispersion, the Dirichlet-multinomial distribution is preferred to the multinomial distribution. In the Dirichlet-multinomial model, the multinomial probabilities p = (p₁, … , p_d) follow a Dirichlet distribution with parameter vector α = (α₁, … , _d) having positive components. For a multivariate count vector x = (x₁, … , x_d) with batch size $∣ x ∣ = \sum_{j = 1}^{d} x_{j}$ , the probability mass function is accordingly

\begin{matrix} h (x ∣ α) & = \int_{Δ_{d}} (\begin{matrix} ∣ x ∣ \\ x \end{matrix}) \prod_{j = 1}^{d} p_{j}^{x_{j}} \frac{Γ (∣ α ∣)}{\prod_{j = 1}^{d} Γ (α_{j})} \prod_{j = 1}^{d} p_{j}^{α_{j} - 1} dp \\ = (\begin{matrix} ∣ x ∣ \\ x \end{matrix}) \frac{\prod_{j = 1}^{d} Γ (α_{j} + x_{j})}{Γ (∣ α ∣ + ∣ x ∣)} \frac{Γ (∣ α ∣)}{\prod_{j = 1}^{d} Γ (α_{j})} \\ = (\begin{matrix} ∣ x ∣ \\ x \end{matrix}) \frac{\prod_{j = 1}^{d} (α_{j}) x_{j}}{{(∣ α ∣)}_{∣ x ∣}}, \end{matrix}

(5)

where Δ_d is the unit simplex in d dimensions, ∣α∣ equals $\sum_{j = 1}^{d} α_{j}$ , and ${(a)}_{k} = \prod_{i = 0}^{k - 1} (a + i)$ denotes a rising factorial. The last equality in (6) follows from the factorial property Γ(a+1)/Γ(a) = a of the gamma function. Given independent data points x₁, … , x_m, the loglikelihood is

L (α) = \sum_{i = 1}^{m} \ln (\begin{matrix} ∣ x_{i} ∣ \\ x \end{matrix}) + \sum_{i = 1}^{m} \sum_{j = 1}^{d} \sum_{k = 0}^{x_{ij} - 1} \ln (α_{j} + k) - \sum_{i = 1}^{m} \sum_{k = 0}^{∣ x_{i} ∣ - 1} \ln (∣ α ∣ + k) .

The lack of concavity of L(α) may cause instability in Newton’s method when it is started far from the optimal point. Fisher’s scoring algorithm is computationally prohibitive because calculation of the expected information matrix involves numerous evaluations of beta-binomial tail probabilities. The ascent property makes EM and MM algorithms attractive.

In deriving an EM algorithm, we treat the unobserved multinomial probabilities p_j in each case as missing data. The complete data likelihood is then the integrand in the integral (5). A straightforward calculation shows that p possesses a posterior Dirichlet distribution with parameters α₁ + x_i1 through α_d + x_id for case i. If we now differentiate the identity

1 = \frac{Γ (∣ α ∣ + ∣ x_{i} ∣)}{\prod_{j = 1}^{d} Γ (α_{j} + x_{i j})} \int_{Δ_{d}} \prod_{j = 1}^{d} p_{j}^{x_{ij} + α_{j} - 1} dp

with respect to α_j, then the identity

E (\ln p_{j} ∣ α_{n,} x_{ij}) = Ψ (x_{ij} + α_{nj}) - Ψ (∣ x_{i} ∣ + ∣ α_{n} ∣)

emerges, where Ψ(z) = Γ’(z)/Γ(z) is the digamma function. It follows that up to an irrelevant additive constant the surrogate function is

\begin{matrix} Q (α ∣ α_{n}) = & \sum_{i = 1}^{m} \sum_{j = 1}^{d} α_{j} [Ψ (x_{i j} + α_{n j}) - Ψ (∣ x_{i} ∣ + ∣ α_{n} ∣)] \\ + m \ln Γ (∣ α ∣) - m \sum_{j = 1}^{d} \ln Γ (α_{j}) . \end{matrix}

Maximizing Q(α ∣ α_n) is non-trivial because involves it special functions and intertwining of the α_j parameters.

Directly invoking the MM principle produces a more malleable surrogate function. Consider the logarithm of the third form of the likelihood function (5). Applying Jensen’s inequality to ln(α_j + k) gives

\begin{matrix} \ln (α_{j} + k) & \geq \frac{α_{n j}}{α_{n j} + k} \ln (\frac{α_{nj} + k}{α_{nj}} \cdot α_{j}) + \frac{k}{α_{nj} + k} \ln (\frac{α_{nj} + k}{k} \cdot k) \\ = \frac{α_{nj}}{α_{nj} + k} \ln α_{j} + c_{n} . \end{matrix}

Likewise, applying the supporting hyperplane inequality to −ln(∣α∣ + k) gives

- \ln (∣ α ∣ + k) \geq - \ln (∣ α_{n} ∣ + k) - \frac{∣ α ∣ - ∣ α_{n} ∣}{∣ α_{n} + k ∣} = \frac{∣ α ∣}{∣ α_{n} ∣ + k} + c_{n} .

Overall, these minorizations yield the surrogate function

\begin{matrix} g (α ∣ α_{n}) & = - \sum_{k} \frac{r_{k}}{∣ α_{n} ∣ + k} ∣ α ∣ + \sum_{j} \sum_{k} \frac{s_{jk} α_{nj}}{α_{nj} + k} \ln α_{j} + c_{n}, \\ s_{j k} & = \sum_{i = 1}^{m} 1_{{x_{ij} \geq k + 1}}, r_{k} = \sum_{i = 1}^{m} 1_{{m_{i} \geq k + 1}}, \end{matrix}

which completely separates the parameter α_j. This suggests the simple MM updates

α_{n + 1, j} = α_{nj} \frac{\sum_{k} \frac{s_{jk}}{α_{nj} + k}}{\sum_{k} \frac{r_{k}}{∣ α_{n} ∣ + k}}, j = 1, \dots, d .

The positivity constraints are always satisfied when all initial values α_0j > 0. Parameter separation can be achieved in the EM algorithm by a further minorization of the lnΓ(∣α∣) term in Q(α ∣ α_n). This action yields a viable EM-MM hybrid algorithm. The reference [92] contains more details and a comparison of the convergence rates of the three algorithms.

Finally, let us mention various strategies for handling exceptional cases. In the MM algorithm it may be impossible to optimize the surrogate function g(θ ∣ θ_n) explicitly. There are two obvious remedies. One is to institute some form of block relaxation in updating g(θ ∣ θ_n) [61]. There is no need to iterate to convergence since the purpose is merely to improve g(θ ∣ θ_n) and hence the objective function f(θ). Another obvious remedy is to optimize the surrogate function by Newton’s method. It turns out that a single step of Newton’s method suffices to preserve the local rate of convergence of the MM algorithm [50]. The ascent property is sacrificed initially, but it kicks in as one approaches the optimal point. In an unconstrained problem this variant MM algorithm can be phrased as

\begin{matrix} θ_{n + 1} & = θ_{n} + d^{2} g {(θ_{n} ∣ θ_{n})}^{- 1} \nabla g (θ_{n} ∣ θ_{n}) \\ = θ_{n} + d^{2} g {(θ_{n} ∣ θ_{n})}^{- 1} \nabla f (θ_{n}), \end{matrix}

where the substitution of ▿f(θ_n) for ▿g(θ_n ∣ θ_n) is justified by the tangency and domination conditions satisfied by g(θ ∣ θ_n) and f(θ).

A more pressing concern in the EM algorithm is intractability of the E step. If f(X ∣ θ) denotes the complete data likelihood, then in the stochastic EM algorithm [43, 75, 87] one estimates the surrogate function by a Monte Carlo average

Q (θ ∣ θ_{n}) \approx \frac{1}{m} \sum_{i = 1}^{m} \ln f (x_{i} ∣ θ)

(6)

over realizations x_i of the complete data X conditional on the observed data Y = y and the current parameter iterate θ_n. Sampling can be done by rejection sampling, importance sampling, Markov chain Monte Carlo, or quasi-Monte Carlo. The next iterate θ_n+1should maximize the average (6). The sample size mshould increase as the iteration count nincreases. Determining the rate of increase of m and setting a reasonable convergence criterion are both subtle issues. The ascent property of the EM algorithm fails because of the inherent sampling noise. The combination of slow convergence and Monte Carlo sampling makes the stochastic EM algorithm unattractive in large-scale problems. In smaller problems it fills a useful niche.

The stochastic EM algorithm generalizes the Robbins-Monro algorithm [76] for root finding and the Kiefer-Wolfowitz algorithm [46] for function maximization. In unconstrained maximum likelihood estimation, one seeks a root of the likelihood equation, so both methods are relevant. Under suitable assumptions, the Kiefer-Wolfowitz algorithm converges to a local maximum almost surely. Since this cluster of topics is tangential to our overall emphasis on deterministic methods of optimization, we refer readers to the books [13, 49, 75] for a fuller discussion.

Penalization

Penalization is a device for imposing parsimony. For purposes of illustration, we discuss two penalized estimation problems of considerable utility in applied statistics. Both of these examples generate convex programs with nondifferentiable objective functions. In the interests of accessibility, we will derive estimation algorithms for both problems without invoking the machinery of convex analysis.

Example 0.9. Lasso Penalized Regression

Lasso penalized regression has been pursued for a long time in many application areas [14, 16, 26, 80, 83, 85]. Modern versions consider a generalized linear model where y_i is the response for case i, x_ij is the value of predictor j for case i, and θ_j is the regression coefficient corresponding to predictor j. When the number of predictors p exceeds the number of cases m, θ cannot be uniquely estimated. In an era of big data, this quandary is fairly common. One remedy is to perform model selection by imposing a lasso penalty on the loss function l(θ). In least squares estimation

ℓ (θ) = \frac{1}{2} \sum_{i = 1}^{m} {(y_{i} - \sum_{j} x_{ij} θ_{j})}^{2} .

For a generalized linear model [69], l(θ) is the negative loglikelihood of the data. Lasso penalized estimation minimizes the criterion

f (θ) = ℓ (θ) + ρ \sum_{j} w_{j} ∣ θ_{j} ∣,

where the nonnegative weights w_j and the tuning constant ρ > 0 are given. If θ_j is the intercept for the model, then its weight w_j is usually set to 0. For the remaining predictors the choice w_j = 1 is reasonable provided the predictors are standardized to have mean 0 and variance 1. To improve the asymptotic properties of the lasso estimates, the adaptive lasso [95] defines the weights $w_{j} = {∣ {\hat{θ}}_{j} ∣}^{- 1}$ for any consistent estimate ${\hat{θ}}_{j}$ of θ_j In a Bayesian context, imposing a lasso penalty is equivalent to placing a Laplace prior with mean 0 on each θ_j. The elastic net [96] adds a ridge penalty $λ \sum_{j} θ_{j}^{2}$ to the lasso penalty.

The primary difference between lasso and ridge regression is that the lasso penalty forces most parameters to 0 while the ridge penalty merely reduces them. Thus, the ridge penalty relaxes its grip too quickly for model selection. Unfortunately, the lasso penalty tends to select one predictor from a group of correlated predictors and ignore the others. The elastic net ameliorates this defect. To overcome severe shrinkage, many statisticians discard penalties after the conclusion of model selection and re-estimate the selected parameters. Cross-validation [37] and stability selection [60] are effctive in choosing the penalty tuning constant and the selected predictors, respectively.

Coordinate descent works particularly well when only a few predictors enter a model [29, 89]. Consider what happens when we visit parameter θ_j and the loss function is the least squares criterion. If we define the amended response $z_{ni} = y_{i} - \sum_{k \neq j} x_{ik} θ_{nk}$ , then the problem reduces to minimizing

\frac{1}{2} \sum_{i = 1}^{m} {(z_{ni} - x_{ij} θ_{j})}^{2} + ρ w_{j} ∣ θ_{j} ∣ .

Now divide the domain of θ_j into the two intervals (−∞, 0] and [0, ∞). On the right interval, elementary calculus suggests the update

θ_{n + 1, j} = \frac{\sum_{i = 1}^{m} z_{ni} x_{ij} - ρ w_{j}}{\sum_{i = 1}^{m} x_{ij}^{2}} .

This is invalid when it is negative and must be replaced by 0. Likewise, on the left interval, we have the update

θ_{n + 1, j} = \frac{\sum_{i = 1}^{m} z_{ni} x_{ij} - ρ w_{j}}{\sum_{i = 1}^{m} x_{ij}^{2}}

unless it is positive. On both intervals, shrinkage pulls the usual least squares estimate toward 0. In underdetermined problems with just a few relevant predictors, most parameters never budge from their starting values of 0. This circumstance plus the complete absence of matrix operations explains the speed of coordinate descent. It inherits its numerical stability from the descent property enjoyed by any coordinate descent algorithm.

With a generalized linear model, say logistic regression, the same story plays out. Now, however, we must institute a line search for the minimum on each of the two half intervals. Newton’s method, scoring, and even golden section search work well. When f(θ) is convex, and θ_j = 0, it is prudent to check the forward directional derivatives d_{e_j}f(θ) and d_{−e_j}f(θ) along the current coordinate direction e_j and its negative. If both forward directional derivatives are nonnegative, then no progress can be made by moving off 0. Thus, a parameter parked at 0 is left there. Other computational savings are possible that make coordinate descent even faster. For example, computations can be organized around the the linear predictor ∑_jx_ijθ_j for each case i. When θ_j changes, it is trivial to update this inner product. The references [88, 89] illustrate the potential of coordinate descent on some concrete genetic examples.

Example 0.10. Matrix Completion

The matrix completion problem became famous when the movie distribution company Netflix offered a million dollar prize for improvements to its movie rating system [1]. The idea was that customers would submit ratings on a small subset of movie titles, and from these ratings Netflix would infer their preferences and recommend additional movies for their consideration. Imagine therefore a very sparse matrix Y = (y_ij) whose rows are individuals and whose columns are movies. Completed cells contain a rating from 1 to 5. Most cells are empty and need to be filled in. If the matrix is sufficiently structured and possesses low rank, then it is possible to complete the matrix in a parsimonious way. Although this problem sounds specialized, it has applications far beyond this narrow setting. For example, filling in missing genotypes in genome scans for disease genes benefits from matrix completion [15].

Following the references [9, 10, 58, 12], let Δ denote the set of index pairs (i, j) such that y_ij is observed. The Lagrangian formulation of matrix completion minimizes the criterion

f (X) = \frac{1}{2} \sum_{(i, j) \in Δ} {(y_{ij} - x_{ij})}^{2} + ρ \sum_{k} σ_{k}

(7)

with respect to a compatible matrix X = (x_ij) with singular values σ_k. Recall that the singular value decomposition

X = \sum_{i} σ_{i} u_{i} υ_{i}^{t}

represents X as a sum of outer products involving a collection of orthogonal left singular vectors u_i, a corresponding collection of orthogonal right singular vectors v_i, and a descending sequence of nonnegative singular values σ_i. Alternatively, we can factor X in the form UΣV^t for orthogonal matrices U and V and a rectangular diagonal matrix Σ.

The nuclear norm ∥X∥_nuc = ∑_kσ_k plays the same role in low-rank matrix approximation that the l₁ norm ∥b∥₁ = ∑_k ∣b_k plays in sparse regression. For a more succinct representation of the criterion (7), we introduce the Frobenius norm

{∣ ∣ U ∣ ∣}_{F} = \sqrt{\sum_{i} \sum_{j} u_{ij}^{2}}

induced by the trace inner product tr(UV^t) and the projection operator P_Δ(Y) with entries

P_{Δ} (Y) = {\begin{matrix} y_{ij} & (i, j) \in Δ \\ 0 & (i, j) \notin Δ . \end{matrix}

In this notation, the criterion (7) becomes

\frac{1}{2} {∣ ∣ P_{Δ} (Y) - P_{Δ} (X) ∣ ∣}_{F}^{2} + ρ {∣ ∣ X ∣ ∣}_{nuc} .

To derive an algorithm for estimating X, we again exploit the MM principle. The general idea is to restore the symmetry of the problem by imputing the missing data [58]. Suppose X_n is our current approximation to X. We simply replace a missing entry y_ij of Y by the corresponding entry x_nij of X_n and add the term $\frac{1}{2} {(x_{nij} - x_{ij})}^{2}$ to the criterion (7). Since the added terms majorize 0, they create a legitimate surrogate function and lead to an MM algorithm. One can rephrase the problem in matrix terms by defining the orthogonal complement $P_{Δ}^{⊥} (Y)$ of P_Δ(Y) according to the rule $P_{Δ}^{⊥} (Y) + P_{Δ} (Y) = Y$ . The matrix $Z_{n} = P_{Δ} (Y) + P_{Δ}^{⊥} (X_{n})$ temporarily completes Y and yields the surrogate function

\begin{matrix} g (X ∣ X_{n}) & = \frac{1}{2} {∣ ∣ Z_{n} - X ∣ ∣}_{F}^{2} + ρ {∣ ∣ X ∣ ∣}_{nuc} \\ = \frac{1}{2} {∣ ∣ Z_{n} ∣ ∣}_{F}^{2} - tr (Z_{n} X^{t}) + \frac{1}{2} {∣ ∣ X ∣ ∣}_{F}^{2} + ρ {∣ ∣ X ∣ ∣}_{nuc} . \end{matrix}

At this juncture it is helpful to recall some mathematical facts. First, the Frobenius norm is invariant under left and right multiplication of its argument by an orthogonal matrix. Thus, ${∣ ∣ X ∣ ∣}_{F}^{2} = \sum_{k} σ_{k}^{2}$ depends only on the singular values of X. The inner product −tr(Z_nX^t) presents a greater barrier to progress, but it ultimately succumbs to a matrix analogue of the Cauchy-Schwarz inequality. Fan’s inequality [6] says that

tr (Z_{n} X^{t}) \leq \sum_{k} ω_{k} σ_{k}

for the ordered singular values ω_k of Z_n. Equality is attained in Fan’s inequality if and only if the right and left singular vectors for the two matrices coincide. Thus, in minimizing g(X ∣ X_n) we can assume that the singular vectors of X coincide with those of Z_n and rewrite the surrogate function as

\begin{matrix} g (X ∣ X_{n}) & = \frac{1}{2} \sum_{k} ω_{k}^{2} - \sum_{k} ω_{k} σ_{k} + \frac{1}{2} \sum_{k} σ_{k}^{2} + ρ \sum_{k} σ_{k} \\ = \frac{1}{2} \sum_{k} {(ω_{k} - σ_{k})}^{2} + ρ \sum_{k} σ_{k} . \end{matrix}

Application of the forward directional derivative test

d_{ν} [\frac{1}{2} \sum_{k} {(ω_{k} - σ_{k})}^{2} + ρ \sum_{k} σ_{k}] = \sum_{k} (σ_{k} - ω_{k}) ν_{k} + ρ \sum_{k} ν_{k} \geq 0

for all tangent directions ν identifies the shrunken singular values

σ_{k} = \max {ω_{k} - ρ, 0}

as optimal. In practice, one does not have to extract the full singular value decomposition of Z_n. Only the singular values ω_k > ρ are actually relevant in constructing X_n+1

In many applications the underlying structure of the observation matrix Y is corrupted by a few noisy entries. This tempts one to approximate Y by the sum of a low rank matrix X plus a sparse matrix W. To estimate X and W, we introduce a positive tuning constant λ and minimize the criterion

f (X, W) = \frac{1}{2} \sum_{(i, j) \in Δ} {(y_{ij} - x_{ij} - w_{ij})}^{2} + ρ \sum_{k} σ_{k} + λ \sum_{i} \sum_{j} ∣ w_{ij} ∣

by block descent. We have already indicated how to update X for W fixed. To minimize f(X, W) for X fixed, we set w_ij = 0 for any pair (i, j) ∉ Δ. Because the remaining W parameters separate in f(X, W), the shrinkage updates

w_{n + 1, i, j} = {\begin{matrix} y_{ij} - x_{nij} - λ & y_{ij} - x_{nij} - λ > 0 \\ y_{ij} - x_{nij} + λ & y_{ij} - x_{n i j} + λ < 0 \\ 0 & otherwise \end{matrix}

are trivial to derive.

Augmented Lagrangians

The augmented Lagrangian method is one of the best ways of handling parameter constraints [38, 65, 70, 77]. For the sake of simplicity, we focus on the problem of minimizing f(θ) subject to the equality constraints g_i(θ) = 0 for i = 1, … , q. We will ignore inequality constraints and assume that f(θ) and the g_i(θ) are smooth. At a constrained minimum the classical Lagrange multiplier rule

0 = Δ f (θ) + \sum_{i = 1}^{q} λ_{i} \nabla g_{i} (θ)

(8)

holds provided the gradients ▿g_i(θ) are linearly independent. The augmented Lagrangian method optimizes the perturbed function

L_{ρ} (θ, λ) = f (θ) + \sum_{i = 1}^{q} λ_{i} g_{i} (θ) + \frac{ρ}{2} \sum_{i = 1}^{q} g_{i} {(θ)}^{2}

with respect to θ. It then adjusts the current multiplier vector λ in the hope of matching the true Lagrange multiplier vector. The penalty term $\frac{ρ}{2} g_{i} {(θ)}^{2}$ punishes violations of the equality constraint g_i(θ) = 0. At convergence the gradient ρg_i(θ)▿g_i(θ) of $\frac{ρ}{2} g_{i} {(θ)}^{2}$ vanishes, and we recover the standard multiplier rule (8). This process can only succeed if the degree of penalization ρ is sufficiently large.

Thus, we must either take ρ initially large or gradually increase it until it hits the finite transition point where the constrained and unconstrained solutions merge. Updating λ is more subtle. If θ_n furnishes the unconstrained minimum of $L_{ρ} (θ, λ_{n})$ , then the stationarity condition reads

\begin{matrix} 0 & = \nabla f (θ_{n}) + \sum_{i = 1}^{q} λ_{ni} \nabla g_{i} (θ_{n}) + ρ \sum_{i = 1}^{q} g_{i} (θ_{n}) \nabla g_{i} (θ_{n}) . \\ = \nabla f (θ_{n}) + \sum_{i = 1}^{q} [λ_{ni} + ρ g_{i} (θ_{n})] \nabla g_{i} (θ_{n}) . \end{matrix}

The last equation motivates the standard update

λ_{n + 1, i} = λ_{ni} + ρ g_{i} (θ_{n}) .

The alternating direction method of multipliers (ADMM) [30, 33] minimizes the sum f(θ) + h(γ) subject to the affine constraints Aθ + Bγ = c. Although the objective function is separable in the block variables θ and γ, the affine constraints frustrate a direct attack. However, the problem is ripe for a combination of the augmented Lagrangian method and a single round of block descent per iteration. The augmented Lagrangian is

L_{ρ} (θ, γ, λ) = f (θ) + h (γ) + λ^{t} (A θ + B γ - c) + \frac{ρ}{2} {∣ ∣ A θ + B γ - c ∣ ∣}^{2} .

Minimization is performed over θ and γ by block descent before updating the multiplier vector λ via

λ_{n + 1} = λ_{n} + ρ (A θ_{n + 1} + B γ_{n + 1} - c) .

Introduction of block descent simplifies the usual augmented Lagrangian method, which minimizes $L_{ρ} (θ, γ, λ)$ jointly over θ and γ. This modest change keeps the convergence theory intact [7, 28] and has led to a resurgence in the popularity of ADMM in machine learning [4, 7, 12, 71, 74, 91].

Example 0.11. Fused Lasso

ADMM is helpful in reducing difficult optimization problems to simpler ones. The easiest fused lasso problem [86] minimizes the criterion

\frac{1}{2} {∣ ∣ y - θ ∣ ∣}_{2}^{2} + μ \sum_{i} ∣ θ_{i + 1} - θ_{i} ∣ .

The l₁ penalty on the increments θ_i+1 −θ_i favors piecewise constant solutions. Unfortunately, this twist on the standard lasso penalty renders coordinate descent inefficient. We can reformulate the problem as minimizing the criterion $\frac{1}{2} {∣ ∣ y - θ ∣ ∣}^{2} + μ {∣ ∣ γ ∣ ∣}_{1}$ subject to the constraint γ = Dθ, where

d_{ij} = {\begin{matrix} 1 & j = i + 1 \\ - 1 & j = i \\ 0 & otherwise . \end{matrix}

In the augmented Lagrangian framework, updating θ amounts to minimizing $\frac{1}{2} {∣ ∣ y - θ ∣ ∣}^{2} + \frac{ρ}{2} {∣ ∣ γ - \frac{1}{ρ} λ - D θ ∣ ∣}^{2}$ . It is straightforward to solve this least squares problem. Updating γ involves minimizing $\frac{ρ}{2} {∣ ∣ D θ - γ ∣ ∣}^{2} + μ {∣ ∣ γ ∣ ∣}_{1}$ , which is a standard lasso problem. Thus, ADMM decouples the problematic linear transformation Dθ from the lasso penalty.

Algorithm Acceleration

Many MM and block descent algorithms converge very slowly. In partial compensation, the computational work per iteration may be light. Even so, diminishing the number of iterations until convergence by one or two orders of magnitude is an attractive proposition [3, 42, 48, 51, 78, 93]. In this section we discuss a generic method for accelerating a wide variety of algorithms [93]. Consider a differentiable algorithm map θ_n+1 = A(θ_n) for optimizing an objective function f(θ), and suppose stationary points of f(θ) correspond to fixed points of A(θ). Equivalently, stationary points correspond to roots of the equation B(θ) = θ − A(θ) = 0. Within this framework it is natural to apply Newton’s method

\begin{matrix} θ_{n + 1} & = θ_{n} - dB {(θ_{n})}^{- 1} B (θ_{n}) \\ = θ_{n} - {[I - d A (θ_{n})]}^{- 1} B (θ_{n}) \end{matrix}

(9)

to find the root and accelerate the overall process. This is a realistic expectation because Newton’s method converges at a quadratic rate in contrast to the linear rates of MM and block descent algorithms.

There are two principal impediments to implementing algorithm (9) in high dimensions. First, it appears to require evaluation and storage of the Jacobi matrix dA(θ), whose rows are the differentials of the components of A(θ). Second, it also appears to require inversion of the matrix I − dA(θ). Both problems can be attacked by secant approximations. Close to the optimal point θ_∞, the linear approximation

A ο A (θ_{n}) - A (θ_{n}) \approx dA (θ_{\infty}) [A (θ_{n}) - θ_{n}]

is valid. This suggests that we take two ordinary steps and gather information in the process on the matrix M = A(θ_∞). If we let v be the vector A ○ A(θ_n) − A(θ_n) and u be the vector A(θ_n) − θ_n, then the secant condition reads Mu = v. In practice it is advisable to exploit multiple secant conditions Mu_i = v_i as long as their number does not exceed the number of parameters p. The secant conditions can be generated one per iteration over the current and previous q − 1 iterations. Let us represent the conditions collectively in the matrix form MU = V for U = (u₁, … , u_q), and V = (v₁, … , v_q).

The principle of parsimony suggests that we replace M by the smallest matrix satisfying the secant conditions. If we pose this problem concretely as minimizing the criterion $\frac{1}{2} {∣ ∣ M ∣ ∣}_{F}^{2}$ subject to the constraints MU = V, then a straightforward exercise in Lagrange multipliers [52] gives the solution M = V(U^tU)⁻¹U^t. The matrix M has rank at most q, and the Sherman Morrison formula yields that explicit inverse

{[I - V {(U^{t} U)}^{- 1} U^{t}]}^{- 1} = I + V {[U^{t} U - U^{t} V]}^{- 1} U^{t} .

Fortunately, it involves inverting just the q × q matrix U^tU − U^tV. Furthermore, the Newton update (9) boils down to

\begin{matrix} θ_{n + 1} & = θ_{n} - {[I - V {(U^{t} U)}^{- 1} U^{t}]}^{- 1} [θ_{n} - A (θ_{n})] \\ = A (θ_{n}) - V {(U^{t} U - U^{t} V)}^{- 1} U^{t} [θ_{n} - A (θ_{n})] . \end{matrix}

The advantages of this procedure include: (a) it avoids large matrix inverses, (b) it relies on matrix times vector multiplication rather than matrix times matrix multiplication, (c) it requires only storage of the small matrices U and V, and (d) it respects linear parameter constraints. Nonnegativity constraints may be violated. The number of secants q should be fixed in advance, say between 1 and 15, and the matrices U and V should be updated by substituting the latest secant pair generated for the earliest secant pair retained. If an accelerated step fails the descent test, than one can revert to the ordinary MM or block descent step.

Acceleration of non-smooth algorithms is more problematic [40]

For gradient descent and its generalizations [17] to non-smooth problems, Nesterov [64] has suggested a potent acceleration. As noted by Beck and Teboulle [2], the accelerated iterates in ordinary gradient descent depend on an intermediate scalar t_n and an intermediate vector φ according to the formulas

\begin{matrix} t_{n + 1} & = \frac{1 + \sqrt{1 + 4 t_{n}^{2}}}{2} \\ φ & = θ_{n} + \frac{t_{n} - 1}{t_{n + 1}} (θ_{n} - θ_{n - 1}) \\ θ_{n + 1} & = φ - \frac{1}{L} \nabla f (φ) \end{matrix}

with initial values t₁ = 1 and φ = θ₀. In other words, instead of taking a steepest descent step from the current iterate, one takes a steepest descent step from the extrapolated point φ, which depends on both the current iterate θ_n and the previous iterate θ_n−1. This mysterious extrapolation algorithm can yield impressive speed ups for essentially the same computational cost as gradient descent.

Discussion

The fault lines in optimization separate smooth from non-smooth problems, unconstrained from constrained problems, and small-scale problems from large-scale problems. Smooth, unconstrained, small-scale problems are easy to solve. Mathematical scientists are beginning to tackle non-smooth, constrained, large-scale problems at the opposite end of the difficulty spectrum. The most spectacular successes usually rely on convexity. We can expect further progress because some of the best minds in applied mathematics, computer science, and statistics have taken up the challenge. What is unlikely to occur is the discovery of a universally valid algorithm. Optimization is apt to remain as much art as science for a long time to come.

We have emphasized a few key ideas in this survey. Our examples demonstrate some of the possibilities for mixing and matching the different algorithm themes. Although we cannot predict the future of computational statistics with any certainty, the key ideas mentioned here will not disappear. For instance, penalization is here to stay, the descent property of an algorithm is always desirable, and quadratic approximation will always be superior to linear approximation for smooth functions. As computing devices hit physical constraints, the importance of parallel algorithms will also likely increase. This argues that block descent and parameter separated MM algorithms will played a larger role in the future [94]. Although we have de-emphasized convex calculus, readers who want to devise their own algorithms are well advised to learn this inherently subtle subject. There is a difference, after all, between principled algorithms and ad hoc procedures.

Acknowledgments

Research supported in part by USPHS grants HG006139 and GM53275.

Contributor Information

Kenneth Lange, Departments of Biomathematics, Human Genetics, and Statistics University of California Los Angeles, CA 90095-1766 Phone: 310-206-8076 klange@ucla.edu.

Eric C. Chi, Department of Human Genetics University of California Los Angeles, CA 90095 ecchi@ucla.edu

Hua Zhou, Department of Statistics North Carolina State University Raleigh, NC 27695-8203 hua_zhou@ncsu.edu.

References

[1].ACM SIGKDD and Netflix Proceedings of KDD Cup and Workshop. 2007 Available online http://www.cs.uic.edu/liub/Netflix-KDD-Cup-2007.html.
[2].Beck A, Teboulle M. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J Imaging Sci. 2009;2:183–202. [Google Scholar]
[3].Berlinet A, Roland C. Acceleration schemes with application to the EM algorithm. Comp Stat Data Anal. 2007;51:3689–3702. [Google Scholar]
[4].Bien J, Tibshirani RJ. Sparse estimation of a covariance matrix. Biometrika. 2011;98(4):807–820. doi: 10.1093/biomet/asr054. [DOI] [PMC free article] [PubMed] [Google Scholar]
[5].Böhning D, Lindsay BG. Monotonicity of quadratic approximation algorithms. Ann Instit Stat Math. 1988;40:641–663. [Google Scholar]
[6].Borwein JM, Lewis AS. Convex Analysis and Nonlinear Optimization: Theory and Examples. New York; Springer: 2000. [Google Scholar]
[7].Boyd S, Parikh N, Chu E, Peleato B, Eckstein J. Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 2011;3(1):1–122. [Google Scholar]
[8].Bradley EL. The equivalence of maximum likelihood and weighted least squares estimates in the exponential family. J Amer Stat Assoc. 1973;68:199–200. [Google Scholar]
[9].Cai J-F, Candés EJ, Shen Z. A singular value thresholding algorithm for matrix completion. SIAM J Optimization. 2008;20:1956–1982. [Google Scholar]
[10].Candés EJ, Tao T. The power of convex relaxation: near-optimal matrix completion. IEEE Trans Inform Theory. 2009;56:2053–2080. [Google Scholar]
[11].Charnes A, Frome EL, Yu PL. The equivalence of generalized least squares and maximum likelihood in the exponential family. J Amer Stat Assoc. 1976;71:169–171. [Google Scholar]
[12].Chen C, He B, Yuan X. Matrix completion via an alternating direction method. IMA J Numerical Anal. 2012;32:227–245. [Google Scholar]
[13].Chen HF. Stochastic Approximation and its Applications. Kluwer; Dordrecht: 2002. [Google Scholar]
[14].Chen SS, Donoho DL, Saunders MA. Atomic decomposition by basis pursuit. SIAM J Sci Comput. 1998;20:33–61. [Google Scholar]
[15].Chi EC, Zhou H, Ortega Del Vecchyo D, Lange K. Genotype imputation via matrix completion. 2012. (submitted) [DOI] [PMC free article] [PubMed]
[16].Claerbout J, Muir F. Robust modeling with erratic data. Geophysics. 1973;38:826–844. [Google Scholar]
[17].Combettes P, Wajs V. Signal recovery by proximal forward-backward splitting. Multiscale Modeling & Simulation. 2005;4:1168–1200. [Google Scholar]
[18].Conn AR, Gould NIM, Toint PL. Convergence of quasi-Newton matrices generated by the symmetric rank one update. Math Prog. 1991;50:177–195. [Google Scholar]
[19].Davidon WC. AEC Research and Development Report ANL–5990. Argonne National Laboratory; USA: 1959. Variable metric methods for minimization. [Google Scholar]
[20].de Leeuw J. Applications of convex analysis to multidimensional scaling. In: Barra JR, Brodeau F, Romie G, Van Cutsem B, editors. Recent Developments in Statistics. North-Holland, Amsterdam: 1976. [Google Scholar]
[21].de Leeuw J. Block relaxation algorithms in statistics. In: Bock HH, Lenski W, Richter MM, editors. Information Systems and Data Analysis. Springer; New York: 1994. pp. 308–325. [Google Scholar]
[22].Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm (with discussion) J Roy Stat Soc B. 1977;39:1–38. [Google Scholar]
[23].Dennis JE, Jr, Schnabel RB. Numerical Methods for Unconstrained Optimization and Nonlinear Equations. SIAM; Philadelphia: 1996. [Google Scholar]
[24].Ding C, Li T, Jordan MI. Convex and semi-nonnegative matrix factorizations. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2010;32:45–55. doi: 10.1109/TPAMI.2008.277. [DOI] [PubMed] [Google Scholar]
[25].Dobson AJ. An Introduction to Generalized Linear Models. Chapman & Hall; London: 1990. [Google Scholar]
[26].Donoho D, Johnstone I. Ideal spatial adaptation by wavelet shrinkage. Biometrika. 1994;81:425–455. [Google Scholar]
[27].Duchi J, Shalev-Shwartz S, Singer Y, Chandra T. Efficient projections onto the l1-ball for learning in high dimensions. Proceedings of the 25th international conference on Machine learning, (ICML 2008); ACM, New York. 2008. pp. 272–279. [Google Scholar]
[28].Fortin M, Glowinski R. Augmented Lagrangian methods: Applications to the numerical solution of boundary-value problems. ZAMM - Journal of Applied Mathematics and Mechanics / Zeitschrift für Angewandte Mathematik und Mechanik. 1983;65:622–622. [Google Scholar]
[29].Friedman J, Hastie T, Tibshirani R. Pathwise coordinate optimization. Ann Appl Stat. 2007;1:302–332. [Google Scholar]
[30].Gabay D, Mercier B. A dual algorithm for the solution of nonlinear variational problems via finite-element approximations. Comp Math Appl. 1976;2:17–40. [Google Scholar]
[31].Gabay D. Ph.D. thesis. Universite Pierre et Marie Curie; 1979. Methodes numeriques pour loptimisation non-lineaire. [Google Scholar]
[32].Gifi A. Nonlinear Multivariate Analysis. Wiley; Hoboken, NJ: 1990. [Google Scholar]
[33].Glowinski R, Marrocco A. Sur lapproximation par elements finis dordre un, et la resolution par penalisation-dualite dune classe de problemes de dirichlet nonlineaires. Rev. Francaise dAut. Inf. Rech. Oper. 1975;2:41–76. [Google Scholar]
[34].Glowinski R, Le Tallec P. Augmented Lagrangian and Operator-splitting Methods in Nonlinear Mechanics. SIAM; 1989. [Google Scholar]
[35].Goldstein AA. Convex programming in Hilbert space. Bulletin Amer Math Soc. 1964;70:709–710. [Google Scholar]
[36].Green PJ. Iteratively reweighted least squares for maximum likelihood estimation and some robust and resistant alternatives (with discussion) J Roy Stat Soc B. 1984;46:149–192. [Google Scholar]
[37].Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed Springer; New York: 2009. [Google Scholar]
[38].Hestenes MR. Multiplier and gradient methods. Journal of Optimization Theory and Applications. 1969;4:303–320. [Google Scholar]
[39].Hiriart-Urruty JB, Lemarechal C. Convex Analysis and Minimization Algorithms: Part 1: Fundamentals. Springer; New York: 1996. [Google Scholar]
[40].Hiriart-Urruty JB, Lemarechal C. Convex Analysis and Minimization Algorithms: Part 2: Advanced Theory and Bundle Methods. Springer; New York: 2001. [Google Scholar]
[41].Hunter DR, Lange K. A tutorial on MM algorithms. Amer Statistician. 2004;58:30–37. [Google Scholar]
[42].Jamshidian M, Jennrich RI. Quasi-Newton acceleration of the EM algorithm. J Roy Stat Soc B. 1997;59:569–587. [Google Scholar]
[43].Jank W. Implementing and diagnosing the stochastic approximation EM algorithm. J Computational Graphical Stat. 2006;15:803–829. [Google Scholar]
[44].Jennrich RI, Moore RH. Maximum likelihood estimation by means of nonlinear least squares. Proceedings of the Statistical Computing Section: Amer Stat Assoc. 1975;57:65. [Google Scholar]
[45].Khalfan HF, Byrd RH, Schnabel RB. A theoretical and experimental study of the symmetric rank-one update. SIAM J Optim. 1993;3:1–24. [Google Scholar]
[46].Kiefer J, Wolfowitz J. Stochastic estimation of the maximum of a regression function. Ann Math Stat. 1952;23:462–466. [Google Scholar]
[47].Kruskal JB. Analysis of factorial experiments by estimating monotone transformations of the data. J Roy Stat Soc B. 1965;27:251–263. [Google Scholar]
[48].Kuroda M, Sakakihara M. Accelerating the convergence of the EM algorithm using the vector epsilon algorithm. Comp Stat Data Anal. 2006;51:1549–1561. [Google Scholar]
[49].Kushner HJ, Yin GG. Stochastic Approximation and Recursive Algorithms and Applications. Springer; New York: 2003. [Google Scholar]
[50].Lange K. A gradient algorithm locally equivalent to the EM algorithm. J Roy Stat Soc B. 1995;57:425–437. [Google Scholar]
[51].Lange K. A quasi-Newton acceleration of the EM algorithm. Statistica Sinica. 1995;5:1–18. [Google Scholar]
[52].Lange K. Numerical Analysis for Statisticians. 2nd ed Springer; New York: 2010. [Google Scholar]
[53].Lange K. Optimization. 2nd ed Springer; New York: 2012. [Google Scholar]
[54].Lange K, Hunter DR, Yang I. Optimization transfer using surrogate objective functions (with discussion) J Comput Graphical Stat. 2000;9:1–59. [Google Scholar]
[55].Lee DD, Seung HS. Learning the parts of objects by non-negative matrix factorization. Nature. 1999;401:788–791. doi: 10.1038/44565. [DOI] [PubMed] [Google Scholar]
[56].Levitin ES, Polyak BT. Constrained minimization problems. USSR Computational Math and Math Physics. 1966;6:1–50. [Google Scholar]
[57].Mateos G, Bazerque J-A, Giannakis GB. Distributed sparse linear regression. IEEE Transactions on Signal Processing. 2010;58:5262–5276. [Google Scholar]
[58].Mazumder R, Hastie T, Tibshirani R. Spectral regularization algorithms for learning large incomplete matrices. J Mach Learn Res. 2010;11:2287–2322. [PMC free article] [PubMed] [Google Scholar]
[59].McLachlan GJ, Krishnan T. The EM Algorithm and Extensions. 2nd ed Wiley; Hoboken, NJ: 2008. [Google Scholar]
[60].Meinshausen N, Bühlmann P. Stability selection. J Roy Stat Soc B. 2010;72:417–473. [Google Scholar]
[61].Meng X-L, Rubin DB. Maximum likelihood estimation via the ECM algorithm: A general framework. Biometrika. 1993;80:267–278. [Google Scholar]
[62].Michelot C. A finite algorithm for finding the projection of a point onto the canonical simplex in Rn. J Optimization Theory Applications. 1986;50:195–200. [Google Scholar]
[63].Nelder JA, Wedderburn RWM. Generalized linear models. J Roy Stat Soc A. 1972;135:370–384. [Google Scholar]
[64].Nesterov Y. Gradient methods for minimizing composite objective function. CORE Discussion Papers. 2007.
[65].Nocedal J, Wright S. Numerical Optimization. 2nd ed Springer; New York: 2006. [Google Scholar]
[66].Ortega JM, Rheinboldt WC. Iterative Solutions of Nonlinear Equations in Several Variables. Academic; New York: 1970. [Google Scholar]
[67].Osborne MR. Fisher’s method of scoring. International Statistical Review. 1992;60:99–117. [Google Scholar]
[68].Paatero P, Tapper U. Positive matrix factorization: a nonnegative factor model with optimal utilization of error. Environmetrics. 1994;5:111–126. [Google Scholar]
[69].Park MY, Hastie T. l1-regularization path algorithm for generalized linear models. J Roy Stat Soc B. 2007;69:659–677. [Google Scholar]
[70].Powell MJD. A method for nonlinear constraints in minimization problems. In: Fletcher R, editor. Optimization. Academic Press; 1969. [Google Scholar]
[71].Qin Z, Goldfarb D. Structured sparsity via alternating direction methods. J. Mach. Learn. Res. 2012;98888:1435–1468. [Google Scholar]
[72].Ranola JM, Ahn S, Sehl ME, Smith DJ, Lange K. A Poisson model for random multigraphs. Bioinformatics. 2010;26:2004–2011. doi: 10.1093/bioinformatics/btq309. [DOI] [PMC free article] [PubMed] [Google Scholar]
[73].Rao CR. Linear Statistical Inference and its Applications. 2nd ed Wiley; Hoboken, NJ: 1973. [Google Scholar]
[74].Richard E, Savalle P-A, Vayatis N. Estimation of simultaneously sparse and low rank matrices. Proceedings of the 29th International Conference on Machine Learning (ICML 2012).2012. [Google Scholar]
[75].Robert C, Casella G. Monte Carlo Statistical Methods. Springer; New York: 2004. [Google Scholar]
[76].Robbins H, Monro S. A stochastic approximation method. Ann Math Stat. 1951;22:400–407. [Google Scholar]
[77].Rockafellar RT. The multiplier method of Hestenes and Powell applied to convex programming. J. Optimiz. Theory App. 1973;12:555–562. [Google Scholar]
[78].Roland C, Varadhan R. New iterative schemes for nonlinear fixed point problems, with applications to problems with bifurcations and incomplete-data problems. Applied Numerical Math. 2005;55:215–226. [Google Scholar]
[79].Ruszczyński A. Nonlinear Optimization. Princeton University Press; Princeton, NJ: 2006. [Google Scholar]
[80].Santosa F, Symes WW. Linear inversion of band-limited reflection seimograms. SIAM J Sci Stat Comput. 1986;7:1307–1330. [Google Scholar]
[81].Sha F, Saul LK, Lee DD. Multiplicative updates for nonnegative quadratic programming in support vector machines. In: Becker S, Thrun S, Ober-mayer K, editors. Advances in Neural Information Processing Systems 15. MIT Press; Cambridge, MA: 2003. pp. 1065–1073. [Google Scholar]
[82].Strang G, Borre K. Algorithms for Global Positioning. Wellesley-Cambridge Press; Wellesley, MA: 2012. [Google Scholar]
[83].Taylor H, Banks SC, McCoy JF. Deconvolution with the l1 norm. Geophysics. 1979;44:39–52. [Google Scholar]
[84].Teo CH, Vishwanthan S, Smola AJ, Le QV. Bundle methods for regularized risk minimization. J Mach Learn Res. 2010;11:311–365. [Google Scholar]
[85].Tibshirani R. Regression shrinkage and selection via the lasso. J Roy Stat Soc, Series B. 1996;58:267–28. [Google Scholar]
[86].Tibshirani R, Saunders M, Rosset S, Zhu J, Knight K. Sparsity and smoothness via the fused lasso. J R Statist Soc B. 2005;67:91–108. [Google Scholar]
[87].Wei GCG, Tanner MA. A Monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms. JASA. 1990;85:699–704. [Google Scholar]
[88].Wu TT, Chen YF, Hastie T, Sobel EM, Lange K. Genomewide association analysis by lasso penalized logistic regression. Bioinformatics. 2009;25:714–721. doi: 10.1093/bioinformatics/btp041. [DOI] [PMC free article] [PubMed] [Google Scholar]
[89].Wu TT, Lange K. Coordinate descent algorithms for lasso penalized regression. Ann Appl Stat. 2008;2:224–244. [Google Scholar]
[90].Wu TT, Lange K. The MM alternative to EM. Stat Sci. 2010;25:492–505. [Google Scholar]
[91].Xue L, Ma S, Zou H. Positive definite l1 penalized estimation of large covariance matrices. JASA. (in press) [Google Scholar]
[92].Zhou H, Zhang Y. EM vs MM: a case study. Comp Stat Data Anal. 2012;56:3909–3920. doi: 10.1016/j.csda.2012.05.018. [DOI] [PMC free article] [PubMed] [Google Scholar]
[93].Zhou H, Alexander DH, Lange K. A quasi-Newton acceleration for high-dimensional optimization algorithms. Statistics and Computing. 2011;21:261–273. doi: 10.1007/s11222-009-9166-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
[94].Zhou H, Lange K, Suchard MA. Graphics processing units and high-dimensional optimization. Stat Science. 2010;25:311–324. doi: 10.1214/10-STS336. [DOI] [PMC free article] [PubMed] [Google Scholar]
[95].Zou H. The adaptive lasso and its oracle properties. JASA. 2006;101:1418–1429. [Google Scholar]
[96].Zou H, Hastie T. Regularization and variable selection via the elastic net. J Royal Stat Soc B. 2005;67:301–320. [Google Scholar]

[R1] [1].ACM SIGKDD and Netflix Proceedings of KDD Cup and Workshop. 2007 Available online http://www.cs.uic.edu/liub/Netflix-KDD-Cup-2007.html.

[R2] [2].Beck A, Teboulle M. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J Imaging Sci. 2009;2:183–202. [Google Scholar]

[R3] [3].Berlinet A, Roland C. Acceleration schemes with application to the EM algorithm. Comp Stat Data Anal. 2007;51:3689–3702. [Google Scholar]

[R4] [4].Bien J, Tibshirani RJ. Sparse estimation of a covariance matrix. Biometrika. 2011;98(4):807–820. doi: 10.1093/biomet/asr054. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] [5].Böhning D, Lindsay BG. Monotonicity of quadratic approximation algorithms. Ann Instit Stat Math. 1988;40:641–663. [Google Scholar]

[R6] [6].Borwein JM, Lewis AS. Convex Analysis and Nonlinear Optimization: Theory and Examples. New York; Springer: 2000. [Google Scholar]

[R7] [7].Boyd S, Parikh N, Chu E, Peleato B, Eckstein J. Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 2011;3(1):1–122. [Google Scholar]

[R8] [8].Bradley EL. The equivalence of maximum likelihood and weighted least squares estimates in the exponential family. J Amer Stat Assoc. 1973;68:199–200. [Google Scholar]

[R9] [9].Cai J-F, Candés EJ, Shen Z. A singular value thresholding algorithm for matrix completion. SIAM J Optimization. 2008;20:1956–1982. [Google Scholar]

[R10] [10].Candés EJ, Tao T. The power of convex relaxation: near-optimal matrix completion. IEEE Trans Inform Theory. 2009;56:2053–2080. [Google Scholar]

[R11] [11].Charnes A, Frome EL, Yu PL. The equivalence of generalized least squares and maximum likelihood in the exponential family. J Amer Stat Assoc. 1976;71:169–171. [Google Scholar]

[R12] [12].Chen C, He B, Yuan X. Matrix completion via an alternating direction method. IMA J Numerical Anal. 2012;32:227–245. [Google Scholar]

[R13] [13].Chen HF. Stochastic Approximation and its Applications. Kluwer; Dordrecht: 2002. [Google Scholar]

[R14] [14].Chen SS, Donoho DL, Saunders MA. Atomic decomposition by basis pursuit. SIAM J Sci Comput. 1998;20:33–61. [Google Scholar]

[R15] [15].Chi EC, Zhou H, Ortega Del Vecchyo D, Lange K. Genotype imputation via matrix completion. 2012. (submitted) [DOI] [PMC free article] [PubMed]

[R16] [16].Claerbout J, Muir F. Robust modeling with erratic data. Geophysics. 1973;38:826–844. [Google Scholar]

[R17] [17].Combettes P, Wajs V. Signal recovery by proximal forward-backward splitting. Multiscale Modeling & Simulation. 2005;4:1168–1200. [Google Scholar]

[R18] [18].Conn AR, Gould NIM, Toint PL. Convergence of quasi-Newton matrices generated by the symmetric rank one update. Math Prog. 1991;50:177–195. [Google Scholar]

[R19] [19].Davidon WC. AEC Research and Development Report ANL–5990. Argonne National Laboratory; USA: 1959. Variable metric methods for minimization. [Google Scholar]

[R20] [20].de Leeuw J. Applications of convex analysis to multidimensional scaling. In: Barra JR, Brodeau F, Romie G, Van Cutsem B, editors. Recent Developments in Statistics. North-Holland, Amsterdam: 1976. [Google Scholar]

[R21] [21].de Leeuw J. Block relaxation algorithms in statistics. In: Bock HH, Lenski W, Richter MM, editors. Information Systems and Data Analysis. Springer; New York: 1994. pp. 308–325. [Google Scholar]

[R22] [22].Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm (with discussion) J Roy Stat Soc B. 1977;39:1–38. [Google Scholar]

[R23] [23].Dennis JE, Jr, Schnabel RB. Numerical Methods for Unconstrained Optimization and Nonlinear Equations. SIAM; Philadelphia: 1996. [Google Scholar]

[R24] [24].Ding C, Li T, Jordan MI. Convex and semi-nonnegative matrix factorizations. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2010;32:45–55. doi: 10.1109/TPAMI.2008.277. [DOI] [PubMed] [Google Scholar]

[R25] [25].Dobson AJ. An Introduction to Generalized Linear Models. Chapman & Hall; London: 1990. [Google Scholar]

[R26] [26].Donoho D, Johnstone I. Ideal spatial adaptation by wavelet shrinkage. Biometrika. 1994;81:425–455. [Google Scholar]

[R27] [27].Duchi J, Shalev-Shwartz S, Singer Y, Chandra T. Efficient projections onto the l1-ball for learning in high dimensions. Proceedings of the 25th international conference on Machine learning, (ICML 2008); ACM, New York. 2008. pp. 272–279. [Google Scholar]

[R28] [28].Fortin M, Glowinski R. Augmented Lagrangian methods: Applications to the numerical solution of boundary-value problems. ZAMM - Journal of Applied Mathematics and Mechanics / Zeitschrift für Angewandte Mathematik und Mechanik. 1983;65:622–622. [Google Scholar]

[R29] [29].Friedman J, Hastie T, Tibshirani R. Pathwise coordinate optimization. Ann Appl Stat. 2007;1:302–332. [Google Scholar]

[R30] [30].Gabay D, Mercier B. A dual algorithm for the solution of nonlinear variational problems via finite-element approximations. Comp Math Appl. 1976;2:17–40. [Google Scholar]

[R31] [31].Gabay D. Ph.D. thesis. Universite Pierre et Marie Curie; 1979. Methodes numeriques pour loptimisation non-lineaire. [Google Scholar]

[R32] [32].Gifi A. Nonlinear Multivariate Analysis. Wiley; Hoboken, NJ: 1990. [Google Scholar]

[R33] [33].Glowinski R, Marrocco A. Sur lapproximation par elements finis dordre un, et la resolution par penalisation-dualite dune classe de problemes de dirichlet nonlineaires. Rev. Francaise dAut. Inf. Rech. Oper. 1975;2:41–76. [Google Scholar]

[R34] [34].Glowinski R, Le Tallec P. Augmented Lagrangian and Operator-splitting Methods in Nonlinear Mechanics. SIAM; 1989. [Google Scholar]

[R35] [35].Goldstein AA. Convex programming in Hilbert space. Bulletin Amer Math Soc. 1964;70:709–710. [Google Scholar]

[R36] [36].Green PJ. Iteratively reweighted least squares for maximum likelihood estimation and some robust and resistant alternatives (with discussion) J Roy Stat Soc B. 1984;46:149–192. [Google Scholar]

[R37] [37].Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed Springer; New York: 2009. [Google Scholar]

[R38] [38].Hestenes MR. Multiplier and gradient methods. Journal of Optimization Theory and Applications. 1969;4:303–320. [Google Scholar]

[R39] [39].Hiriart-Urruty JB, Lemarechal C. Convex Analysis and Minimization Algorithms: Part 1: Fundamentals. Springer; New York: 1996. [Google Scholar]

[R40] [40].Hiriart-Urruty JB, Lemarechal C. Convex Analysis and Minimization Algorithms: Part 2: Advanced Theory and Bundle Methods. Springer; New York: 2001. [Google Scholar]

[R41] [41].Hunter DR, Lange K. A tutorial on MM algorithms. Amer Statistician. 2004;58:30–37. [Google Scholar]

[R42] [42].Jamshidian M, Jennrich RI. Quasi-Newton acceleration of the EM algorithm. J Roy Stat Soc B. 1997;59:569–587. [Google Scholar]

[R43] [43].Jank W. Implementing and diagnosing the stochastic approximation EM algorithm. J Computational Graphical Stat. 2006;15:803–829. [Google Scholar]

[R44] [44].Jennrich RI, Moore RH. Maximum likelihood estimation by means of nonlinear least squares. Proceedings of the Statistical Computing Section: Amer Stat Assoc. 1975;57:65. [Google Scholar]

[R45] [45].Khalfan HF, Byrd RH, Schnabel RB. A theoretical and experimental study of the symmetric rank-one update. SIAM J Optim. 1993;3:1–24. [Google Scholar]

[R46] [46].Kiefer J, Wolfowitz J. Stochastic estimation of the maximum of a regression function. Ann Math Stat. 1952;23:462–466. [Google Scholar]

[R47] [47].Kruskal JB. Analysis of factorial experiments by estimating monotone transformations of the data. J Roy Stat Soc B. 1965;27:251–263. [Google Scholar]

[R48] [48].Kuroda M, Sakakihara M. Accelerating the convergence of the EM algorithm using the vector epsilon algorithm. Comp Stat Data Anal. 2006;51:1549–1561. [Google Scholar]

[R49] [49].Kushner HJ, Yin GG. Stochastic Approximation and Recursive Algorithms and Applications. Springer; New York: 2003. [Google Scholar]

[R50] [50].Lange K. A gradient algorithm locally equivalent to the EM algorithm. J Roy Stat Soc B. 1995;57:425–437. [Google Scholar]

[R51] [51].Lange K. A quasi-Newton acceleration of the EM algorithm. Statistica Sinica. 1995;5:1–18. [Google Scholar]

[R52] [52].Lange K. Numerical Analysis for Statisticians. 2nd ed Springer; New York: 2010. [Google Scholar]

[R53] [53].Lange K. Optimization. 2nd ed Springer; New York: 2012. [Google Scholar]

[R54] [54].Lange K, Hunter DR, Yang I. Optimization transfer using surrogate objective functions (with discussion) J Comput Graphical Stat. 2000;9:1–59. [Google Scholar]

[R55] [55].Lee DD, Seung HS. Learning the parts of objects by non-negative matrix factorization. Nature. 1999;401:788–791. doi: 10.1038/44565. [DOI] [PubMed] [Google Scholar]

[R56] [56].Levitin ES, Polyak BT. Constrained minimization problems. USSR Computational Math and Math Physics. 1966;6:1–50. [Google Scholar]

[R57] [57].Mateos G, Bazerque J-A, Giannakis GB. Distributed sparse linear regression. IEEE Transactions on Signal Processing. 2010;58:5262–5276. [Google Scholar]

[R58] [58].Mazumder R, Hastie T, Tibshirani R. Spectral regularization algorithms for learning large incomplete matrices. J Mach Learn Res. 2010;11:2287–2322. [PMC free article] [PubMed] [Google Scholar]

[R59] [59].McLachlan GJ, Krishnan T. The EM Algorithm and Extensions. 2nd ed Wiley; Hoboken, NJ: 2008. [Google Scholar]

[R60] [60].Meinshausen N, Bühlmann P. Stability selection. J Roy Stat Soc B. 2010;72:417–473. [Google Scholar]

[R61] [61].Meng X-L, Rubin DB. Maximum likelihood estimation via the ECM algorithm: A general framework. Biometrika. 1993;80:267–278. [Google Scholar]

[R62] [62].Michelot C. A finite algorithm for finding the projection of a point onto the canonical simplex in Rn. J Optimization Theory Applications. 1986;50:195–200. [Google Scholar]

[R63] [63].Nelder JA, Wedderburn RWM. Generalized linear models. J Roy Stat Soc A. 1972;135:370–384. [Google Scholar]

[R64] [64].Nesterov Y. Gradient methods for minimizing composite objective function. CORE Discussion Papers. 2007.

[R65] [65].Nocedal J, Wright S. Numerical Optimization. 2nd ed Springer; New York: 2006. [Google Scholar]

[R66] [66].Ortega JM, Rheinboldt WC. Iterative Solutions of Nonlinear Equations in Several Variables. Academic; New York: 1970. [Google Scholar]

[R67] [67].Osborne MR. Fisher’s method of scoring. International Statistical Review. 1992;60:99–117. [Google Scholar]

[R68] [68].Paatero P, Tapper U. Positive matrix factorization: a nonnegative factor model with optimal utilization of error. Environmetrics. 1994;5:111–126. [Google Scholar]

[R69] [69].Park MY, Hastie T. l1-regularization path algorithm for generalized linear models. J Roy Stat Soc B. 2007;69:659–677. [Google Scholar]

[R70] [70].Powell MJD. A method for nonlinear constraints in minimization problems. In: Fletcher R, editor. Optimization. Academic Press; 1969. [Google Scholar]

[R71] [71].Qin Z, Goldfarb D. Structured sparsity via alternating direction methods. J. Mach. Learn. Res. 2012;98888:1435–1468. [Google Scholar]

[R72] [72].Ranola JM, Ahn S, Sehl ME, Smith DJ, Lange K. A Poisson model for random multigraphs. Bioinformatics. 2010;26:2004–2011. doi: 10.1093/bioinformatics/btq309. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R73] [73].Rao CR. Linear Statistical Inference and its Applications. 2nd ed Wiley; Hoboken, NJ: 1973. [Google Scholar]

[R74] [74].Richard E, Savalle P-A, Vayatis N. Estimation of simultaneously sparse and low rank matrices. Proceedings of the 29th International Conference on Machine Learning (ICML 2012).2012. [Google Scholar]

[R75] [75].Robert C, Casella G. Monte Carlo Statistical Methods. Springer; New York: 2004. [Google Scholar]

[R76] [76].Robbins H, Monro S. A stochastic approximation method. Ann Math Stat. 1951;22:400–407. [Google Scholar]

[R77] [77].Rockafellar RT. The multiplier method of Hestenes and Powell applied to convex programming. J. Optimiz. Theory App. 1973;12:555–562. [Google Scholar]

[R78] [78].Roland C, Varadhan R. New iterative schemes for nonlinear fixed point problems, with applications to problems with bifurcations and incomplete-data problems. Applied Numerical Math. 2005;55:215–226. [Google Scholar]

[R79] [79].Ruszczyński A. Nonlinear Optimization. Princeton University Press; Princeton, NJ: 2006. [Google Scholar]

[R80] [80].Santosa F, Symes WW. Linear inversion of band-limited reflection seimograms. SIAM J Sci Stat Comput. 1986;7:1307–1330. [Google Scholar]

[R81] [81].Sha F, Saul LK, Lee DD. Multiplicative updates for nonnegative quadratic programming in support vector machines. In: Becker S, Thrun S, Ober-mayer K, editors. Advances in Neural Information Processing Systems 15. MIT Press; Cambridge, MA: 2003. pp. 1065–1073. [Google Scholar]

[R82] [82].Strang G, Borre K. Algorithms for Global Positioning. Wellesley-Cambridge Press; Wellesley, MA: 2012. [Google Scholar]

[R83] [83].Taylor H, Banks SC, McCoy JF. Deconvolution with the l1 norm. Geophysics. 1979;44:39–52. [Google Scholar]

[R84] [84].Teo CH, Vishwanthan S, Smola AJ, Le QV. Bundle methods for regularized risk minimization. J Mach Learn Res. 2010;11:311–365. [Google Scholar]

[R85] [85].Tibshirani R. Regression shrinkage and selection via the lasso. J Roy Stat Soc, Series B. 1996;58:267–28. [Google Scholar]

[R86] [86].Tibshirani R, Saunders M, Rosset S, Zhu J, Knight K. Sparsity and smoothness via the fused lasso. J R Statist Soc B. 2005;67:91–108. [Google Scholar]

[R87] [87].Wei GCG, Tanner MA. A Monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms. JASA. 1990;85:699–704. [Google Scholar]

[R88] [88].Wu TT, Chen YF, Hastie T, Sobel EM, Lange K. Genomewide association analysis by lasso penalized logistic regression. Bioinformatics. 2009;25:714–721. doi: 10.1093/bioinformatics/btp041. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R89] [89].Wu TT, Lange K. Coordinate descent algorithms for lasso penalized regression. Ann Appl Stat. 2008;2:224–244. [Google Scholar]

[R90] [90].Wu TT, Lange K. The MM alternative to EM. Stat Sci. 2010;25:492–505. [Google Scholar]

[R91] [91].Xue L, Ma S, Zou H. Positive definite l1 penalized estimation of large covariance matrices. JASA. (in press) [Google Scholar]

[R92] [92].Zhou H, Zhang Y. EM vs MM: a case study. Comp Stat Data Anal. 2012;56:3909–3920. doi: 10.1016/j.csda.2012.05.018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R93] [93].Zhou H, Alexander DH, Lange K. A quasi-Newton acceleration for high-dimensional optimization algorithms. Statistics and Computing. 2011;21:261–273. doi: 10.1007/s11222-009-9166-3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R94] [94].Zhou H, Lange K, Suchard MA. Graphics processing units and high-dimensional optimization. Stat Science. 2010;25:311–324. doi: 10.1214/10-STS336. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R95] [95].Zou H. The adaptive lasso and its oracle properties. JASA. 2006;101:1418–1429. [Google Scholar]

[R96] [96].Zou H, Hastie T. Regularization and variable selection via the elastic net. J Royal Stat Soc B. 2005;67:301–320. [Google Scholar]

PERMALINK

A Brief Survey of Modern Optimization for Statisticians

Kenneth Lange

Eric C Chi

Hua Zhou

Abstract

Introduction

Block Descent

Example 0.1. Nonnegative Least Squares

Example 0.2. Matrix Factorization by Alternating Least Squares

Steepest Descent

Example 0.3. Coordinate Descent versus the Projected Gradient Method

Figure 1.

Variations on Newton’s Method

Example 0.4. Newton’s Method for Binomial Regression

Example 0.5. Poisson Multigraph Model

The MM and EM Algorithms

Example 0.6. An MM Algorithm for Nonnegative Least Squares

Example 0.7. Locating a Gunshot

Example 0.8. MM versus EM for the Dirichlet-Multinomial Distribution

Penalization

Example 0.9. Lasso Penalized Regression

Example 0.10. Matrix Completion

Augmented Lagrangians

Example 0.11. Fused Lasso

Algorithm Acceleration

Acceleration of non-smooth algorithms is more problematic [40]

Discussion

Acknowledgments

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

A Brief Survey of Modern Optimization for Statisticians

Kenneth Lange

Eric C Chi

Hua Zhou

Abstract

Introduction

Block Descent

Example 0.1. Nonnegative Least Squares

Example 0.2. Matrix Factorization by Alternating Least Squares

Steepest Descent

Example 0.3. Coordinate Descent versus the Projected Gradient Method

Figure 1.

Variations on Newton’s Method

Example 0.4. Newton’s Method for Binomial Regression

Example 0.5. Poisson Multigraph Model

The MM and EM Algorithms

Example 0.6. An MM Algorithm for Nonnegative Least Squares

Example 0.7. Locating a Gunshot

Example 0.8. MM versus EM for the Dirichlet-Multinomial Distribution

Penalization

Example 0.9. Lasso Penalized Regression

Example 0.10. Matrix Completion

Augmented Lagrangians

Example 0.11. Fused Lasso

Algorithm Acceleration

Acceleration of non-smooth algorithms is more problematic [40]

Discussion

Acknowledgments

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases