Variable Selection using MM Algorithms

David R Hunter; Runze Li

doi:10.1214/009053605000000200

. Author manuscript; available in PMC: 2009 Apr 29.

Published in final edited form as: Ann Stat. 2005;33(4):1617–1642. doi: 10.1214/009053605000000200

Variable Selection using MM Algorithms

David R Hunter ¹, Runze Li ²

PMCID: PMC2674769 NIHMSID: NIHMS103893 PMID: 19458786

Abstract

Variable selection is fundamental to high-dimensional statistical modeling. Many variable selection techniques may be implemented by maximum penalized likelihood using various penalty functions. Optimizing the penalized likelihood function is often challenging because it may be nondifferentiable and/or nonconcave. This article proposes a new class of algorithms for finding a maximizer of the penalized likelihood for a broad class of penalty functions. These algorithms operate by perturbing the penalty function slightly to render it differentiable, then optimizing this differentiable function using a minorize-maximize (MM) algorithm. MM algorithms are useful extensions of the well-known class of EM algorithms, a fact that allows us to analyze the local and global convergence of the proposed algorithm using some of the techniques employed for EM algorithms. In particular, we prove that when our MM algorithms converge, they must converge to a desirable point; we also discuss conditions under which this convergence may be guaranteed. We exploit the Newton-Raphson-like aspect of these algorithms to propose a sandwich estimator for the standard errors of the estimators. Our method performs well in numerical tests.

Keywords: AIC, BIC, EM algorithm, LASSO, MM algorithm, penalized likelihood, oracle estimator, SCAD

1. Introduction

Fan and Li (2001) discuss a family of variable selection methods that adopt a penalized likelihood approach. This family includes well-established methods such as AIC and BIC as well as more recent methods such as bridge regression (Frank and Friedman, 1993), LASSO (Tibshirani, 1996), and SCAD (Antoniadis and Fan, 2001). What all of these methods share is the fact that they require the maximization of a penalized likelihood function. Even when the log-likelihood itself is relatively easy to maximize, the penalized version may present numerical challenges. For example, in the case of SCAD or LASSO or bridge regression, the penalized log-likelihood function is nondifferentiable; with SCAD or bridge regression, the function is also nonconcave. To perform the maximization, Fan and Li (2001) propose a new and generic algorithm based on local quadratic approximation (LQA). In this article, we demonstrate and explore a connection between the LQA algorithm and minorization-maximization (MM) algorithms (Hunter and Lange, 2000), which shows that many different variable-selection techniques may be accomplished using the same algorithmic techniques.

MM algorithms exploit an optimization technique that extends the central idea of EM algorithms (Dempster et al., 1977) to situations not necessarily involving missing data nor even maximum likelihood estimation. The connection between LQA and MM enables us to analyze the convergence of the local quadratic approximation algorithm using techniques related to EM algorithms (Wu, 1983; Meng, 1994; Lange, 1995; Meng and Van Dyk, 1997). Furthermore, we extend the local quadratic approximation idea here by forming a slightly perturbed objective function to maximize. This perturbation solves two problems at once. First, it renders the objective function differentiable, which allows us to prove results regarding the convergence of the MM algorithms discussed here. Second, it repairs one of the drawbacks that the LQA algorithm shares with forward variable selection: Namely, if a covariate is deleted at any step in the LQA algorithm, it will necessarily be excluded from the final selected model. We discuss how to decide a priori how large a perturbation to choose when implementing this method and make specific comments about the price paid for using this perturbation.

The new algorithm we propose retains virtues of the Newton-Raphson algorithm, which among other things allows us to compute a standard error for the resulting estimator via a sandwich formula. It is also numerically stable and is never forced to delete a covariate permanently in the process of iteration. The general convergence results known for MM algorithms imply among other things that the newly proposed algorithm converges correctly to the maximizer of the perturbed penalized likelihood whenever this maximizer is the unique local maximum. The linear rate of convergence of the algorithm is governed by the largest eigenvalue of the derivative of the algorithm map.

The rest of the article is organized as follows. Section 2 briefly introduces the variable selection problem and the penalized likelihood approach. After providing some background on MM algorithms, Section 3 explains their connection to the LQA idea, then provides a modification to LQA that may be shown to be an MM algorithm for maximizing a perturbed version of the penalized likelihood. Various convergence properties of this new MM algorithm are also covered in Section 3. Section 4 describes a method of estimating covariances and presents numerical tests of the algorithm on a set of four diverse problems. Finally, Section 5 discusses the numerical results and offers some broad comparisons among the competing methods studied in Section 4. Some proofs appear in the Appendix.

2. Variable selection via maximum penalized likelihood

Suppose that {(x_i, y_i): i = 1, …, n} is a random sample with conditional log-likelihood $ℓ_{i} (β, φ) \equiv ℓ_{i} (x_{i}^{T} β, y_{i}, φ)$ givenx_i. Typically, the y_i are response variables that depend on the predictors x _i through a linear combination $x_{i}^{T} β$ , andφ is a dispersion parameter. Some of the components of β may be zero, which means that the corresponding predictors do not influence the response. The goal of variable selection is to identify those components of β that are zero. A secondary goal in this article will be to estimate the nonzero components of β.

In some variable selection applications, such as standard Poisson or logistic regression, no dispersion parameter φ exists. In other applications, such as linear regression, φ is to be estimated separately after β is estimated. Therefore, the penalized likelihood approach does not penalize φ, so we simplify notation in the remainder of this article by eliminating explicit reference to φ. In particular, ℓ_i(β, φ) will be written ℓ_i(β). This is standard practice in the variable selection literature; see, for example, Frank and Friedman (1993), Tibshirani (1996), Fan and Li (2001) or Miller (2002).

Many variable selection criteria arise as special cases of the general formulation discussed in Fan and Li (2001), where the penalized likelihood function takes the form

Q (β) = \sum_{i = 1}^{n} ℓ_{i} (β) - n \sum_{j = 1}^{d} λ_{j} p_{j} (∣ β_{j} ∣) \equiv ℓ (β) - n \sum_{j = 1}^{d} λ_{j} p_{j} (∣ β_{j} ∣) .

(2.1)

In equation (2.1), the p_j(·) are given nonnegative penalty functions, d is the dimension of the covariate vector x_i, and the λ_j are tuning parameters controlling model complexity. The selected model based on the maximized penalized likelihood (2.1) satisfies p_j(|β_j|) = 0 for certain β_j’s, which accordingly are not included in this final model, and so model estimation is performed at the same time as model selection. Often, the λ_j may be chosen by a data-driven approach such as cross-validation or generalized cross-validation (Craven and Wahba, 1979).

The penalty functions p_j(·) and the tuning parameters λ_j are not necessarily the same for all j. This allows one to incorporate hierarchical prior information for the unknown coefficients by using different penalty functions and taking different values of λ_j for the different regression coefficients. For instance, one may not be willing to penalize important factors in practice. For ease of presentation, we assume throughout this article that the same penalization is applied to every component of β and write λ_jp_j(|β_j|) as p_λ (|β_j|), which implies that the penalty function is allowed to depend on λ. Extensions to situations with different penalty functions for each component of β do not involve any extra difficulties except more tedious notation.

Many well-known variable selection criteria are special cases of the penalized likelihood of equation (2.1). For instance, consider the L₀ penalty p_λ(|β|) = 0.5λ²I (|β| ≠ 0), also called the entropy penalty in the literature, where I(·) is an indicator function. Note that the dimension or the size of a model equals Σ_jI (|β_j| ≠ 0), the number of nonzero regression coefficients in the model. In other words, the penalized likelihood (2.1) with the entropy penalty can be rewritten as

ℓ (β) - 0.5 n λ^{2} ∣ M ∣,

where |M| = Σ_jI (|β_j| ≠ 0), the size of the underlying candidate model. Hence, several popular variable selection criteria can be derived from the penalized likelihood (2.1) by choosing different values of λ. For instance, the AIC (or C_p) and BIC criteria correspond to $λ = \sqrt{2 / n}$ and $\sqrt{(log n) / n}$ , respectively, although these criteria were motivated from different principles. Similar in its effect to the entropy penalty function is the hard thresholding penalty function (see Antoniadis, 1997) given by

p_{λ} (∣ β ∣) = λ^{2} - {(∣ β ∣ - λ)}^{2} I (∣ β ∣ < λ) .

Recently, many authors have been working on penalized least squares with the L_q penalty p_λ (|β|) = λ|β|^q. Indeed, bridge regression is the solution of penalized least squares with the L_q penalty (Frank and Friedman, 1993). It is well known that ridge regression is the solution of penalized likelihood with the L₂ penalty. The L₁ penalty results in LASSO, proposed by Tibshirani (1996). Finally, there is the smoothly clipped absolute deviation (SCAD) penalty of Fan and Li (2001). For fixed a > 2, the SCAD penalty is the continuous function p_λ (·) defined by p_λ (0) = 0 and, for β ≠ 0,

p_{λ}^{'} (∣ β ∣) = λ I (∣ β ∣ \leq λ) + \frac{{(a λ - ∣ β ∣)}_{+}}{a - 1} I (∣ β ∣ > λ),

(2.2)

where throughout this article $p_{λ}^{'} (∣ β ∣)$ denotes the derivative of p_λ (·) evaluated at |β|.

Letting $p_{λ}^{'} (∣ β ∣ +)$ denote the limit of $p_{λ}^{'} (x)$ as x → |β| from above, the MM algorithms introduced in the next section are shown to apply to any continuous penalty function p_λ (β) that is nondecreasing and concave on (0, ∞) such that $p_{λ}^{'} (0 +) < \infty$ . The previously mentioned penalty functions that satisfy these criteria are hard thresholding, SCAD, LASSO, and L_q with 0 < q ≤ 1. Therefore, the methods presented in this article enable a wide range of variable selection algorithms.

Nonetheless, there are some common penalty functions that do not meet our criteria. The entropy penalty is excluded because it is discontinuous, and in fact maximizing the AIC- or BIC-penalized likelihood in cases other than linear regression often requires exhaustive fitting of all possible models. For q > 1, the L_q penalty is excluded because it is not concave on (0, ∞); however, the fact that p_λ(|β|) = |β|^q is everywhere differentiable suggests that the penalized likelihood function may be susceptible to gradient-based methods and hence alternatives such as MM may be of limited value. In particular, the special case of ridge regression (q = 2) admits a closed-form maximizer, a fact we allude to following equation (3.17). But there is a more subtle reason for excluding L_q penalties with q > 1. Note that $p_{λ}^{'} (0 +) > 0$ for any of our nonexcluded penalty functions. As Fan and Li (2001) point out, this fact (which they call singularity at the origin because it implies a discontinuous derivative at zero) ensures that the penalized likelihood has the sparsity property: The resulting estimator is automatically a thresholding rule that sets small estimated coefficients to zero, thus reducing model complexity. Sparsity is an important property for any penalized likelihood technique that is to be useful in a variable selection setting.

3. Maximized penalized likelihood via MM

It is sometimes a challenging task to find the maximum penalized likelihood estimate. Fan and Li (2001) propose a local quadratic approximation for the penalty function: Suppose that we are given an initial value β⁽⁰⁾. If $β_{j}^{(0)}$ is very close to 0, then set β̂_j = 0; otherwise, the penalty function is locally approximated by a quadratic function using

\frac{\partial}{\partial β_{j}} {p_{λ} (∣ β_{j} ∣)} = p_{λ}^{'} (∣ β_{j} ∣ +) sgn (β_{j}) \approx \frac{p_{λ}^{'} (∣ β_{j}^{(0)} ∣ +)}{∣ β_{j}^{(0)} ∣} β_{j}

when $β_{j}^{(0)} \neq 0$ . In other words,

p_{λ} (∣ β_{j} ∣) \approx p_{λ} (∣ β_{j}^{(0)} ∣) + \frac{{β_{j}^{2} - {(β_{j}^{(0)})}^{2}} p_{λ}^{'} (∣ β_{j}^{(0)} ∣ +)}{2 ∣ β_{j}^{(0)} ∣}

(3.1)

for $β_{j} \approx β_{j}^{(0)}$ . With the aid of this local quadratic approximation, a Newton-Raphson algorithm (for example) can be used to maximize the penalized likelihood function, where each iteration updates the local quadratic approximation.

In this section, we show that this local quadratic approximation idea is an instance of an MM algorithm. This fact enables us to study the convergence properties of the algorithm using techniques applicable to MM algorithms in general. Throughout this section, we refrain from specifying the form of p_λ(·), since the derivations apply equally to any one of hard thresholding, LASSO, bridge regression using L_q with 0 < q ≤ 1, SCAD, or any other method with penalty function p_λ(·) satisfying the conditions of Proposition 3.1.

3.1. Local Quadratic Approximation as an MM algorithm

MM stands for Majorize-Minimize or Minorize-Maximize, depending on the context (Hunter and Lange, 2000). EM algorithms (Dempster et al., 1977), in which the E-step may be shown to be equivalent to a minorization step, are the most famous examples of MM algorithms, though there are many examples of MM algorithms that involve neither maximum likelihood nor missing data. Heiser (1995) and Lange et al. (2000) give partial surveys of work in this area. The apparent ambiguity in allowing MM to have two different meanings is harmless, since any maximization problem may be viewed as a minimization problem by changing the sign of the objective function.

Consider the penalty term −n Σ_jp_λ(|β_j|) of equation (2.1), ignoring its minus sign for the moment. Mimicking the idea of Equation (3.1), we define the function

Φ_{θ_{0}} (θ) = p_{λ} (∣ θ_{0} ∣) + \frac{(θ^{2} - θ_{0}^{2}) p_{λ}^{'} (∣ θ_{0} ∣ +)}{2 ∣ θ_{0} ∣} .

(3.2)

We assume that p_λ(·) is piecewise differentiable so that $p_{λ}^{'} (∣ θ ∣ +)$ exists for all θ. Thus, Φ_θ_₀(θ) is a well-defined quadratic function of θ for all real θ₀ except for θ₀ = 0. Section 3.2 remedies the problem that Φ_θ_₀(θ) is undefined when θ₀ = 0.

We are interested in penalty functions p_λ(θ) for which

Φ_{θ_{0}} (θ) \geq p_{λ} (∣ θ ∣) for all θ, with equality when θ = θ_{0} .

(3.3)

A function Φ_θ_₀(θ) satisfying condition (3.3) is said to majorize p_λ(|θ|) at θ₀. If the direction of the inequality in condition (3.3) were reversed, then Φ_θ_₀(θ) would be said to minorize p_λ(|θ|) at θ₀.

The driving force behind an MM algorithm is the fact that condition (3.3) implies

Φ_{θ_{0}} (θ) - Φ_{θ_{0}} (θ_{0}) \geq p_{λ} (∣ θ ∣) - p_{λ} (∣ θ_{0} ∣),

which in turn gives the descent property

Φ_{θ_{0}} (θ) < Φ_{θ_{0}} (θ_{0}) implies p_{λ} (∣ θ ∣) < p_{λ} (∣ θ_{0} ∣) .

(3.4)

In other words, if θ₀ denotes the current iterate, any decrease in the value of Φ_θ_₀(θ) guarantees a decrease in the value of p_λ(|θ|). If θ_k denotes the estimate of the parameter at the kth iteration, then an iterative minimization algorithm would exploit the descent property by constructing the majorizing function Φ_θ_{_k}(θ), then minimizing it to give θ_k₊₁ — hence the name “majorize-minimize algorithm”. Proposition 3.1 gives sufficient conditions on the penalty function p_λ(·) in order that Φ_θ_₀(θ) majorizes p_λ (|θ|). Several different penalty functions that satisfy these conditions are depicted in Figure 1 along with their majorizing quadratic functions.

Fig. 1 — Majorizing functions Φ_θ_₀ (θ) for various penalty functions are shown as dotted curves; the penalty functions are shown as solid curves. The four penalties are (a) hard thresholding with λ = 2; (b) L₁ with λ = 1; (c) L_0.5 with λ = 1; and (d) SCAD with a = 2.1 and λ = 1. In each case, θ₀ = 1.

Proposition 3.1

Suppose that on (0, ∞), p_λ (·) is piecewise differentiable, nondecreasing, and concave. Furthermore, p_λ (·) is continuous at 0 and $p_{λ}^{'} (0 +) < \infty$ . Then for all θ₀ ≠ 0 , Φ_θ₀(θ) as defined in equation (3.2) majorizes p_λ(|θ|) at the points ±|θ₀|. In particular, conditions (3.3) and (3.4) hold.

Next, suppose that we wish to employ the local quadratic approximation idea in an iterative algorithm, where $β^{(k)} = (β_{1}^{(k)}, \dots, β_{d}^{(k)})$ denotes the value of β at the kth iteration. Appending negative signs to p_λ(|β_j|) and $Φ_{β_{j}^{(k)}} (β_{j})$ to turn majorization into minorization, we obtain the following corollary from Proposition 3.1 and equation (2.1):

Corollary 3.1

Suppose that $β_{j}^{(k)} \neq 0$ for all j and that p_λ(θ) satisfies the conditions given in Proposition 3.1. Then

S_{k} (β) \equiv ℓ (β) - n \sum_{j = 1}^{d} Φ_{β_{j}^{(k)}} (β_{j})

(3.5)

minorizes Q(β) atβ^(k).

By the ascent property — the analogue, for minorizing functions, of the descent property (3.4) — Corollary 3.1 suggests that given β⁽^k⁾, we should defineβ⁽^k⁺¹⁾ to be the maximizer of S_k(β), thereby ensuring that Q(β⁽^k⁺¹⁾)> Q(β⁽^k⁾). The benefit of replacing one maximization problem by another in this way is that S_k(β) is susceptible to a gradient-based scheme such as Newton-Raphson, unlike the non-differentiable function Q(β). Since the sum in equation (3.5) is a quadratic function of β — in fact, the Hessian matrix of this sum is a diagonal matrix — this sum presents no difficulties for maximization. Therefore, the difficulty of maximizing S_k(β) is determined solely by the form of ℓ(β). For example, in the special case of a linear regression model with normally distributed errors, the log-likelihood function ℓ(β) is itself quadratic, which implies that S_k(β) may be maximized analytically.

If some of the components of β⁽^k⁾ equal zero (or in practice, if some of them are close to zero), the algorithm proceeds by simply setting the final estimates of those components to be zero, deleting them from consideration, then defining the function S_k(β̃) as in equation (3.5), whereβ̃ is the vector composed of the nonzero components of β. The weakness of this scheme is that once a component is set to zero, it may never reenter the model at a later stage of the algorithm. The modification proposed in Section 3.2 eliminates this weakness.

3.2. An improved version of local quadratic approximation

The drawback of Φ_θ_₀(θ) in equation (3.2) is that when θ₀ = 0, the denominator 2|θ₀| makes Φ_θ_₀(θ) undefined. We therefore replace 2|θ₀| by 2(ε + |θ₀|) for some ε > 0. The resulting perturbed version of Φ_θ_₀(θ), which is defined for all real θ₀, is no longer a majorizer of p_λ(θ) as required by the MM theory. Nonetheless, we may show that it majorizes a perturbed version of p_λ(θ), which may therefore be used to define a new objective function Q_ε(β) that is similar to Q(β). To this end, we define

p_{λ, ε} (∣ θ ∣) = p_{λ} (∣ θ ∣) - ε \int_{0}^{∣ θ ∣} \frac{p_{λ}^{'} (t)}{ε + t} d t

(3.6)

and

Q_{ε} (β) = ℓ (β) - n \sum_{j = 1}^{d} p_{λ, ε} (∣ β_{j} ∣) .

(3.7)

The next proposition shows that an MM algorithm may be applied to the maximization of Q_ε(β) and suggests that a maximizer of Q_ε(β) should be close to a maximizer of Q(β) as long as ε is small and Q(β) is not too flat in the neighborhood of the maximizer.

Proposition 3.2

Suppose that p_λ (·) satisfies the conditions of Proposition 3.1. For ε > 0, define

Φ_{θ_{0}, ε} (θ) = p_{λ, ε} (∣ θ_{0} ∣) + \frac{(θ^{2} - θ_{0}^{2}) p_{λ}^{'} (∣ θ_{0} ∣ +)}{2 (ε + ∣ θ_{0} ∣)} .

(3.8)

Then (a) For any fixed ε >0,

S_{k, ε} (β) \equiv ℓ (β) - n \sum_{j = 1}^{d} Φ_{β_{j}^{(k)}, ε} (β_{j})

(3.9)

minorizes Q_ε (β)atβ^(k).

(b) As ε↓ 0, |Q_ε (β) − Q(β)| → 0 uniformly on compact subsets of the parameter space.

Note that when the MM algorithm converges and S_k,ε (β) is maximized by β⁽^k⁾, it follows by straightforward differentiation that

\frac{\partial S_{k, ε} (β^{(k)})}{\partial β_{j}} = \frac{\partial ℓ (β^{(k)})}{\partial β_{j}} - n p_{λ}^{'} (∣ β_{j}^{(k)} ∣) sgn (β_{j}^{(k)}) \frac{∣ β_{j}^{(k)} ∣}{ε + ∣ β_{j}^{(k)} ∣} = 0.

Thus, when ε is small, the resulting estimator β̂ approximately satisfies the penalized likelihood equation

\frac{\partial Q (\hat{β})}{\partial β_{j}} = \frac{\partial ℓ (\hat{β})}{\partial β_{j}} - n sgn ({\hat{β}}_{j}) p_{λ}^{'} (∣ {\hat{β}}_{j} ∣) = 0.

(3.10)

Suppose that β̂_ε denotes a maximizer of Q_ε(β) and β̂₀ denotes a maximizer of Q(β). In general, it is impossible to bound ||β̂_ε − β̂₀|| as a function of ε because Q(β) may be quite flat near its maximum. However, suppose that Q_ε(β) is upper compact, which means that {β: Q_ε(β) ≥ c} is a compact subset of R^d for any real constant c. In this case, then we may obtain the following corollary of Proposition 3.2(b).

Corollary 3.2

Suppose thatβ̂_ε denotes a maximizer of Q_ε(β). If Q_ε(β)is upper compact for all ε ≥ 0, then under the conditions of Proposition 3.1, any limit point of the sequence{β̂_ε}_ε↓0 is a maximizer of Q(β).

Both Proposition 3.2(a) and Corollary 3.2 give results as ε ↓ 0, which suggests the use of an algorithm in which ε is allowed to go to zero as the iterations progress. Certainly it would be possible to implement such an algorithm. However, in this article we interpret these results merely as theoretical justification for using the ε perturbation in the first place, and instead we hold ε fixed throughout the algorithms we discuss. The choice of this fixed ε is the subject of the next subsection.

3.3. Choice of ε

Essentially, we want to solve the penalized likelihood equation (3.10) for β̂_j ≠ 0 (recall that p_λ(β_j) is not differentiable at β_j = 0). Suppose, therefore, that convergence is declared in a numerical algorithm whenever |∂Q(β̂)/∂β_j| < τ for a predetermined small tolerance τ. Our algorithm accomplishes this by declaring convergence whenever |∂Q_ε(β̂)/∂β_j| < τ/2, where satisfies

∣ \frac{\partial Q_{ε} (\hat{β})}{\partial β_{j}} - \frac{\partial Q (\hat{β})}{\partial β_{j}} ∣ < τ / 2.

(3.11)

Since $p_{λ}^{'} (θ)$ is nonincreasing on (0, ∞), equation (3.6) implies

∣ \frac{\partial Q_{ε} (\hat{β})}{\partial β_{j}} - \frac{\partial Q (\hat{β})}{\partial β_{j}} ∣ = \frac{ε n p_{λ}^{'} (∣ {\hat{β}}_{j} ∣)}{∣ {\hat{β}}_{j} ∣ + ε} \leq \frac{ε n p_{λ}^{'} (0 +)}{∣ {\hat{β}}_{j} ∣}

forβ̂_j ≠ 0. Thus, to ensure that inequality (3.11) is satisfied, simply take

ε = \frac{τ}{2 n p_{λ}^{'} (0 +)} min {∣ β_{j} ∣ : β_{j} \neq 0} .

This may of course lead to a different value of ε each time β changes; therefore, in our implementations we fix

ε = \frac{τ}{2 n p_{λ}^{'} (0 +)} min {∣ β_{j}^{(0)} ∣ : β_{j}^{(0)} \neq 0} .

(3.12)

When the algorithm converges, if |∂Q(β̂)/∂β_j| < τ, β_j is presumed to be zero. In the numerical examples of Section 4, we take τ = 10⁻⁸.

3.4. The algorithm

By the ascent property of an MM algorithm, β⁽^k⁺¹⁾ should be defined at the kth iteration so that

S_{k, ε} (β^{(k + 1)}) > S_{k, ε} (β^{(k)}) .

(3.13)

Note that if β⁽^k⁺¹⁾ satisfies inequality (3.13) without actually maximizing S_k,ε(β), we still refer to the algorithm as an MM algorithm, even though the second M –for “maximize” – isn’t quite accurate. Alternatively, we could adopt the convention used for EM algorithms by Dempster et al (1977) and refer to such algorithms as generalized MM, or GMM, algorithms; however, in this article we prefer to require extra duty of the label MM and avoid further crowding the field of acronym-named algorithms.

From equation (3.9) we see that S_k,ε (β) consists of two parts, ℓ(β) and the sum of quadratic functions $- Φ_{β_{j}^{(k)}, ε} (β_{j})$ of the components of β. The latter part is easy to maximize directly; thus, the difficulty of maximizing S_k,ε(β), or at least attaining inequality (3.13), is solely determined by the form of ℓ(β). In general, when ℓ(β) is easy to maximize then so is S_k,ε(β), which distinguishes S_k,ε(β) from the (ε-perturbed) penalized likelihood Q_ε(β). Even if ℓ(β) is not easily maximized, at least if it is differentiable then so is S_k,ε(β), which means inequality (3.13) may be attained using standard gradient-based maximization methods such as Newton-Raphson. The function Q_ε(β), though differentiable, is not easily optimized using gradient-based methods because it is very close to the nondifferentiable function Q(β).

Although it is impossible to detail all possible forms of likelihood functions ℓ(β) to which these methods apply, we begin with the completely general Newton-Raphson-based algorithm

β^{(k + 1)} = β^{(k)} - α_{k} {[\nabla^{2} S_{k, ε} (β^{(k)})]}^{- 1} \nabla S_{k, ε} (β^{(k)}),

(3.14)

where ∇²S_k,ε(·) and ∇S_k,ε(·) denote the Hessian matrix and gradient vector, respectively, and α_k is some positive scalar. Using the definition (3.9) of S_k,ε(β), algorithm (3.14) becomes

β^{(k + 1)} = β^{(k)} - α_{k} {[\nabla^{2} ℓ (β^{(k)}) - n E_{k}]}^{- 1} [\nabla ℓ (β^{(k)}) - n E_{k} β^{(k)}],

(3.15)

where E_k is the diagonal matrix with (j, j)th entry $p_{λ}^{'} (∣ β_{j}^{(k)} ∣ +) / (ε + ∣ β_{j}^{(k)} ∣)$ . We take the ordinary maximum likelihood estimate to be the initial value β⁽⁰⁾ in the numerical examples of section 4.

There are some important special cases. First, we consider perhaps the simplest case but by far the most important case because of its ubiquity — the linear regression model with normal homoscedastic errors, for which

\nabla ℓ (β) = X^{T} y - X^{T} X β,

(3.16)

where X = (x₁, ···, x_n)^T, the design matrix of the regression model, and y is the response vector consisting of y_i. As pointed out at the beginning of Section 2, we omit mention of the error variance parameter σ² here because this parameter is to be estimated using standard methods once β has been estimated. In this case, equation (3.15) with α_k = 1 gives a closed-form maximizer of S_k,ε(β) because S_k,ε(β) is exactly a quadratic function. The resulting algorithm

β^{(k + 1)} = {X^{T} X + n E_{k}}^{- 1} X^{T} y

(3.17)

may be viewed as iterative ridge regression. In the case of LASSO, which uses the L₁ penalty, this algorithm is guaranteed to converge to the unique maximizer of Q_ε(β) (see Corollary 3.3).

The slightly more general case of generalized linear models with canonical link includes common procedures such as logistic regression and Poisson regression. In these cases, the Hessian matrix is

\nabla^{2} ℓ (β) = - X^{T} V X,

where V is a diagonal matrix whose (i, i)th entry is given by $v (x_{i}^{T} β)$ and v(·) is the variance function. Therefore, the Hessian matrix is negative definite provided that X is of full rank. This means that the vector −[∇²S_k,ε(β⁽^k⁾)⁻¹ ∇S_k,ε(β⁽^k⁾) in equation (3.14) is a direction of ascent (unless of course ∇S_k,ε(β⁽^k⁾) = 0), which guarantees the existence of a positive α_k such that β ⁽^k⁺¹⁾ satisfies inequality (3.13). A simple way of determining α_k is the practice of step-halving: Try α_k = 2⁻^ν for ν = 0, 1, 2, … until the resulting β⁽^k⁺¹⁾ satisfies inequality (3.13). For large samples in practice, ℓ(β) tends to be nearly quadratic, particularly in the vicinity of the MLE (which is close to the penalized MLE for large samples), so step-halving does not need to be employed very often. Nonetheless, in our experience it is always wise to check that inequality (3.13) is satisfied; even when the truth of this inequality is guaranteed as in the linear regression model, checking the inequality is often a useful debugging tool. Indeed, whenever possible it is a good idea to check that the objective function itself satisfies Q_ε(β⁽^k⁺¹⁾) > Q_ε(β⁽^k⁾); for even though this inequality is guaranteed theoretically by inequality (3.13), in practice many a programming error is caught using this simple check.

In still more general cases, the Hessian matrix ∇²ℓ(β) may not be negative definite or it may be difficult to compute. If the Fisher information matrix I(β) is known, then −nI(β) may be used in place of ∇²ℓ(β). This leads to

β^{(k + 1)} = β^{(k)} + α_{k} {[I (β^{(k)}) + E_{k}]}^{- 1} [\frac{1}{n} \nabla ℓ (β^{(k)}) - E_{k} β^{(k)}],

and the positive definiteness of the Fisher information will ensure that step-halving will always lead to an increase in S_k,ε(β).

Finally, we mention the possibility of applying an MM algorithm to a penalized partial likelihood function. Consider the example of the Cox proportional hazards model (Cox, 1975), which is the most popular model in survival data analysis. The variable selection methods of Section 2 are extended to the Cox model by Fan and Li (2002). Let T_i, C_i and x_i be respectively the survival time, the censoring time and the vector of covariates for the ith individual. Correspondingly, let Z_i = min{T_i, C_i} be the observed time and δ_i = I(T_i ≤ C_i) be the censoring indicator. It is assumed that T_i and C_i are conditionally independent given x_i and that the censoring mechanism is noninformative. Under the proportional hazards model, the conditional hazard function h(t_i|x_i) of T_i given x_i is given by

h (t_{i} ∣ x_{i}) = h_{0} (t_{i}) exp (x_{i}^{T} β),

where h₀(t) is the baseline hazard function. This is a semiparametric model with parameters h₀(t) and β. Denote the ordered uncensored failure times by $t_{1}^{0} \leq \dots \leq t_{N}^{0}$ , and let (j) provide the label for the item falling at $t_{j}^{0}$ so that the covariates associated with the N failures are x₍₁₎, ···, x₍_N₎. Let $R_{j} = {i : Z_{i} \geq t_{j}^{0}}$ denote the risk set right before the time $t_{j}^{0}$ . A partial likelihood is given by

ℓ_{P} (β) = \sum_{j = 1}^{N} [x_{(j)}^{T} β - log {\sum_{i \in R_{j}} exp (x_{i}^{T} β)}] .

Fan and Li (2002) consider variable selection via maximization of the penalized partial likelihood

ℓ_{P} (β) - n \sum_{j = 1}^{d} p_{λ} (∣ β_{j} ∣) .

It can be shown that the Hessian matrix of ℓ_P(β) is negative definite provided that X is of full rank. Under certain regularity conditions, it can further be shown that in the neighborhood of a maximizer, the partial likelihood is nearly quadratic for large n.

3.5. Convergence

It is not possible to prove that a generic MM algorithm converges at all, and when an MM algorithm does converge, there is no guarantee that it converges to a global maximum. For example, there are well-known pathological examples in which EM algorithms — or generalized EM algorithms, as discussed following inequality (3.13) — converge to saddle points or fail to converge (McLachlan and Krishnan, 1997). Nonetheless, it is often possible to obtain convergence results in specific cases.

We define a stationary point of the function Q_ε(β) to be any point β at which the gradient vector is zero. Because the differentiable function S_k,ε(β) is tangent to Q_ε(β) at the point β⁽^k⁾ by the minorization property, the gradient vectors of S_k,ε(β) and Q_ε(β) are equal when evaluated atβ⁽^k⁾. Thus, when using the method of Section 3.4 to maximize S_k,ε(β), we see that fixed points of the algorithm—i.e., points with gradient zero— coincide with stationary points of Q_ε(β). Letting M(β) denote the map implicitly defined by the algorithm that takes β⁽^k⁾ to β⁽^k⁺¹⁾ for any point β⁽^k⁾, inequality (3.13) states that S_k,ε{M (β)} > S_k,ε(β). The limit points of the set {β⁽^k⁾: k = 0, 1, 2, …} are characterized by the following slightly modified version of Lyapunov’s theorem (Lange, 1995).

Proposition 3.3

Given an initial valueβ⁽⁰⁾, letβ^(k) = M^k(β⁽⁰⁾). IfQ_ε(β) = Q_ε{M (β)} only for stationary pointsβofQ_εand ifβ^*is a limit point of the sequence {β^(k)}such that M(β)is continuous atβ^*,thenβ^*is a stationary point of Q_ε (β).

Equation (3.14) with α_k = 1 gives

M (β^{(k)}) = β^{(k)} - {\nabla^{2} S_{k, ε} (β^{(k)})}^{- 1} \nabla Q_{ε} (β^{(k)}),

(3.18)

where equation (3.18) uses the fact that ∇S_k,ε(β⁽^k⁾) = ∇Q_ε(β⁽^k⁾). As discussed in Lange (1995) and Lange et al. (2000), the derivative of M(β) gives insight into the local convergence properties of the algorithm. Suppose that ∇Q_ε(β⁽^k⁾) = 0, so β⁽^k⁾ is a stationary point. In this case, differentiating equation (3.18) gives

\nabla M (β^{(k)}) = {\nabla^{2} S_{k, ε} (β^{(k)})}^{- 1} {\nabla^{2} S_{k, ε} (β^{(k)}) - \nabla^{2} Q_{ε} (β^{(k)})} .

It is possible to write ∇²S_k,ε(β⁽^k⁾) − ∇²Q_ε(β⁽^k⁾) in closed form as nA_k, where $A_{k} = diag {a (β_{1}^{(k)}), \dots, a (β_{d}^{(k)})}$ and

a (t) = \frac{∣ t ∣}{ε + ∣ t ∣} {p_{λ}^{″} (∣ t ∣ +) - \frac{p_{λ} (∣ t ∣ +)}{ε + ∣ t ∣}} .

Under the conditions of Proposition 3.2, $p_{λ}^{″} (∣ t ∣ +) \leq 0$ and thus nA_k is negative semidefinite, a fact that may also be interpreted as a consequence of the minorization of Q_ε (β) by S_k,ε (β). Furthermore, ∇²ℓ(β⁽^k⁾) is often negative definite, as pointed out in Section 3.4, which implies that ∇²S_k,ε (β⁽^k⁾) is negative definite. This fact, together with the fact that ∇²S_k,ε(β⁽^k⁾) − ∇²Q_ε(β⁽^k⁾) is negative semidefinite, implies that the eigenvalues of ∇M(β⁽^k⁾) are all contained in the interval [0, 1) (Hestenes, 1981). Ostrowski’s theorem (Ortega, 1990) thus implies that the MM algorithm defined by equation (3.18) is locally attracted to β⁽^k⁾ and that the rate of convergence to β⁽^k⁾ in a neighborhood of β⁽^k⁾ is linear with rate equal to the largest eigenvalue of M(β⁽^k⁾). In other words, if β^* is a stationary point and ρ < 1 is the largest eigenvalue of ∇M(β^*), then for any δ > 0 such that ρ + δ < 1, there exists a neighborhood N_δ containing β^* such that for all β ∈ N_δ,

∥ M (β) - β^{*} ∥ \leq (ρ + δ) ∥ β - β^{*} ∥ .

Further details about the rate of convergence for similar algorithms may be found in Lange (1995) and Lange et al. (2000).

Lyapunov’s theorem (Proposition 3.3) gives a necessary condition for a point to be a limit point of a particular MM algorithm. To conclude this section, we consider a sufficient condition for the existence of a limit point. Suppose that the function Q_ε(β) is upper compact, as defined in Section 3.2. Then given the initial parameter vector β⁽⁰⁾, the set B = {β ∈ R^d: Q_ε(β) ≥ Q(β⁽⁰⁾)} is compact; furthermore, by equations (3.6) and (3.7), Q_ε(β) ≥ Q(β) so that B contains the entire sequence ${β^{(k)}}_{k = 0}^{\infty}$ . This guarantees that the sequence has at least one limit point, which must therefore be a stationary point of Q_ε(β) by Proposition 3.3. If in addition there is no more than one stationary point — for example, if Q_ε(β) is strictly concave — then we may conclude that the algorithm must converge to the unique stationary point.

Upper compactness of Q_ε(β) follows as long as Q_ε(β) → −∞ whenever ||β|| → ∞; this is often not difficult to prove for specific examples. In the particular case of the L₁ penalty (LASSO), strict concavity also holds as long as ℓ (β) is strictly concave, which implies the following corollary.

Corollary 3.3

If p_λ(|θ|) = λ |θ|and ℓ(β) is strictly concave and upper compact, then the MM algorithm ofequation (3.15) gives a sequence {β⁽^k⁾} converging to the unique maximizer ofQ_ε(β) for anyε > 0.

In particular, Corollary 3.3 implies that using our algorithm with the ε-perturbed LASSO penalty guarantees convergence to the maximum penalized likelihood estimator for any full-rank generalized linear model or (say) Cox proportional hazards model. However, strict concavity of Q_ε(β) is not typical for other penalty functions presented in this article in light of the requirement in Proposition 3.2 that p_λ(·) be concave — and hence that −p_λ(·) be convex — on (0, ∞). This fact means that when an MM algorithm using some penalty function other than L₁ converges, then it may converge to a local, rather than a global, maximizer of Q_ε(β). This can actually be an advantage, since one might like to know if the penalized likelihood has multiple local maxima.

4. Numerical Examples

Since Fan and Li (2001) have already compared the performance of LASSO with SCAD using a local quadratic approximation and other existing methods, in the following four numerical examples we focus on assessing the performance of the proposed algorithms using the SCAD penalty (2.2). Namely, we compare the unmodified LQA to our modified version (both using SCAD), where ε is chosen according to equation (3.12) with τ = 10⁻⁸. For SCAD, $p_{λ}^{'} (0 +) = λ$ , and this tuning parameter λ is chosen using generalized cross-validation, or GCV (Craven and Wahba, 1979). As suggested by Fan and Li (2001), we take a = 3.7 in the definition of SCAD.

The Newton-Raphson algorithm (3.15) enables a standard error estimate via a sandwich formula:

\hat{cov} (\hat{β}) = {\nabla^{2} ℓ (\hat{β}) - n E_{k}}^{- 1} \hat{cov} {\nabla S_{k, ε} (\hat{β})} {\nabla^{2} ℓ (\hat{β}) - n E_{k}}^{- 1},

(4.1)

where

\begin{array}{l} \hat{cov} {\nabla S_{k, ε} (β)} = \frac{1}{n} \sum_{i = 1}^{n} [\nabla ℓ_{i} (β) - n E_{k} β] {[\nabla ℓ_{i} (β) - n E_{k} β]}^{T} \\ - [\frac{1}{n} \sum_{i = 1}^{n} \nabla ℓ_{i} (β) - n E_{k} β] {[\frac{1}{n} \sum_{i = 1}^{n} \nabla ℓ_{i} (β) - n E_{k} β]}^{T} \\ = \frac{1}{n} \sum_{i = 1}^{n} [\nabla ℓ_{i} (β)] {[\nabla ℓ_{i} (β)]}^{T} - [\frac{1}{n} \sum_{i = 1}^{n} \nabla ℓ_{i} (β)] {[\frac{1}{n} \sum_{i = 1}^{n} \nabla ℓ_{i} (β)]}^{T} . \end{array}

Naturally, another estimate may be formed if −nI(β̂) is substituted for ∇ℓ(β̂) in equation (4.1). Fan and Peng (2004) establish the consistency of this sandwich formula for related problems, and their method of proof may be adapted to this situation, though we do not do so in this article.

For the simulated examples, Examples 1 through 3, we compare the performance of the proposed procedures along with AIC and BIC in terms of model error and model complexity. With μ(x) = E(Y|x), model error (ME) is defined as E{μ̂(x) − μ(x)}², where the expectation is taken with respect to a new observation x. The ME’s of the underlying procedures are divided by that of the ordinary maximum likelihood estimate, so we report relative model error (RME).

Example 1 (Linear regression)

In this example, we generated 500 data sets, each of which consists of 100 observations from the model

y = x^{T} β + ε,

where β is a 12-dimensional vector whose first, fifth and ninth components are 3, 1.5 and 2 respectively, and whose other components equal 0. The components of x and ε are standard normal and the correlation between x_i and x_j is taken to be ρ. In our simulation, we consider three cases: ρ = 0.1, ρ = 0.5, and ρ = 0.9. In this case, there is a closed form for the model error, namely ME(β̂) = (β̂−β)^T cov(x)(β̂−β). The median of the relative model error (RME) over 500 simulated data sets is summarized in Table 1. The average number of 0 coefficients is also reported in Table 1, in which the column labelled “C” gives the average number of coefficients, of the nine true zeros, correctly set to zero and the column labelled “I” gives the average number of the three true nonzeros incorrectly set to zero.

Table 1.

Relative Model Errors for Linear Regression

Method	RME	Zeros		RME	Zeros		RME	Zeros

	Median	C	I	Median	C	I	Median	C	I

	ρ = 0.9			ρ = 0.5			ρ = 0.1
New	.437	8.346	0	.215	8.708	0	.238	8.292	0
LQA	.590	7.772	0	.237	8.680	0	.269	8.272	0
BIC	.337	8.644	0	.335	8.652	0	.328	8.656	0
AIC	.672	7.358	0	.673	7.324	0	.668	7.374	0
Oracle	.201	9.000	0	.202	9	0	.211	9	0

Open in a new tab

In Table 1, New and LQA refer to the newly proposed algorithm and the local quadratic approximation algorithm of Fan and Li (2001). AIC and BIC stand for the best subset variable selection procedures that minimize AIC scores and BIC scores. Finally, “Oracle” stands for the oracle estimate computed by using the true model y = β₁x₁ + β₅x₅ + β₉x₉ + ε. When the correlation among the covariates is small or moderate, we see that the new algorithm performs the best in terms of model error and LQA also performs very well; their RMEs are both very close to those of the oracle estimator. When the covariates are highly correlated, the new algorithm outperforms LQA in terms of both model error and model complexity. The performance of BIC and AIC remains almost the same for the three cases in this example; Table 1 indicates that BIC performs better than AIC.

We now test the accuracy of the proposed standard error formula. The standard deviation of the estimated coefficients for the 500 simulated data sets, denoted by SD, can be regarded as the true standard deviation except for Monte Carlo error. The average of the estimated standard errors for the 500 simulated data sets, denoted by SE, and their standard deviation, denoted by std(SE), gauge the overall performance of the standard error formula. Table 2 only presents the SD, SE, std(SE) of β₁. The results for other coefficients are similar. In Table 2, LSE stands for the ordinary least squares estimate; other notation is the same as that in Table 1. The differences between SD and SE are less than twice std(SE), which suggests that the proposed standard error formula works fairly well. However, the SE appears to consistently underestimate the SD, a common phenomenon (see Kauermann and Carroll, 2001), so it may benefit from some slight modification.

Table 2.

Standard deviations and standard errors of β̂₁ in the linear regression model

Method	SD	SE (std(SE))	SD	SE (std(SE))	SD	SE (std(SE))

	ρ = 0.9		ρ = 0.5		ρ = 0.1
LSE	.339	.322(.035)	.152	.144(.016)	.114	.109(.012)
New	.303	.260(.036)	.129	.120(.017)	.104	.098(.014)
LQA	.315	.265(.037)	.128	.120(.017)	.105	.098(.014)
BIC	.295	.269(.028)	.133	.124(.013)	.105	.101(.010)
AIC	.322	.278(.029)	.145	.128(.013)	.109	.101(.010)
Oracle	.270	.264(.027)	.126	.124(.013)	.103	.102(.010)

Open in a new tab

Example 2 (Logistic regression)

In this example, we assess the performance of the proposed algorithm for a logistic regression model. We generated 500 data sets, each of which consists of 200 observations, from the logistic regression model

μ (x) \equiv {P (Y = 1 ∣ x)} = \frac{exp (x^{T} β)}{1 + exp (x^{T} β)},

(4.2)

where β is a 9-dimensional vector whose first, fourth and seventh components are 3, 1.5 and 2 respectively, and whose other components equal 0. The components of x are standard normal, where the correlation between x_i and x_j is ρ. In our simulation, we consider two cases, ρ = 0.25 and ρ = 0.75. Unlike the model error for linear regression models, there is no closed form of ME for the logistic regression model in this example. The ME, summarized in Table 3, is estimated by 50,000 Monte Carlo simulations. Notation in Table 3 is the same as that in Table 1. It can be seen from Table 3 that the newly proposed algorithm performs better than LQA in terms of model error and model complexity. We further test the accuracy of the standard error formula derived by using the sandwich formula (4.1). The results are similar to those in Table 2 — the proposed standard error formula works fairly well — so they are omitted here.

Table 3.

Relative Model Errors for Logistic Regression

Method	RME	Zeros		RME	Zeros

	Median	C	I	Median	C	I

	ρ = 0.25			ρ = 0.75
New	.277	5.922	0	.528	5.534	0.222
LQA	.368	5.728	0	.644	4.970	0.090
BIC	.304	5.860	0	.399	5.796	0.304
AIC	.673	4.930	0	.683	4.860	0.092
Oracle	.241	6	0	.216	6	0

Open in a new tab

Best variable subset selection using the BIC criterion is seen in Table 3 to perform quite well relative to other methods. However, best subset selection in this example requires an exhaustive search over all possible subsets, and therefore it is computationally expensive. The methods we propose can dramatically reduce computational cost. To demonstrate this point, random samples of size 200 were generated from model (4.2) with β being a d-dimensional vector whose first, fourth and seventh components are 3, 1.5 and 2 respectively, and whose other components equal 0. Table 4 depicts the average computing time for each simulation with d = 8,···, 11 and indicates that computing times for the BIC and AIC criteria increase exponentially with the dimension d, making these methods impractical for parameter sets much larger than those tested here. Given the increasing importance of variable selection problems in fields like genetics and data mining where the number of variables is measured in the hundreds or even thousands, efficiency of algorithms is an important consideration.

Table 4.

Computing Time for the Logistic Model (Seconds per Simulation)

ρ	d	New	LQA	BIC	AIC
0.25	8	0.287	0.142	2.701	2.699
	9	0.316	0.151	5.702	5.694
	10	0.348	0.180	11.761	11.754
	11	0.424	0.199	26.702	26.576

0.75	8	0.395	0.199	2.171	2.166
	9	0.438	0.205	4.554	4.553
	10	0.452	0.225	9.499	9.518
	11	0.532	0.244	19.915	19.959

Open in a new tab

Example 3 (Cox model)

We investigate the performance of the proposed algorithm for the Cox proportional hazard model in this example. We simulated 500 data sets each for sample sizes n = 40, 50 and 60 from the exponential hazard model

h (t ∣ x) = exp (x^{T} β),

(4.3)

where β = (0.8, 0, 0, 1, 0, 0, 0.6, 0)^T. This model is used in the simulation study of Fan and Li (2002). The x_u’s were marginally standard normal and the correlation between x_u and x_v was ρ^|^u⁻^v^| with ρ = 0.5. The distribution of the censoring time is an exponential distribution with mean U exp(x^Tβ₀), where U is randomly generated from the uniform distribution over [1,3] for each simulated data set so that 30% of the data are censored. Here β₀ = β is regarded as a known constant so that the censoring scheme is noninformative. The model error E{μ̂(x) − μ(x)}² is estimated by 50,000 Monte Carlo simulations and is summarized in Table 5. The performance of the newly proposed algorithm is similar to that of LQA. Both the new algorithm and LQA perform better than best subset variable selection with the AIC or BIC criteria. Note that the BIC criterion is a consistent variable selection criterion. Therefore, as the sample size increases, its performance becomes closer to that of the nonconcave penalized partial likelihood procedures. We also test the accuracy of the standard error formula derived by using the sandwich formula (4.1). The results are similar to those in Table 2; the proposed standard error formula works fairly well.

Table 5.

Relative Model Errors for the Cox Model

Method	RME	Zeros		RME	Zeros		RME	Zeros

	Median	C	I	Median	C	I	Median	C	I

n	40			50			60
New	.173	4.790	1.396	.288	4.818	1.144	.324	4.882	.904
LQA	.174	4.260	.626	.296	4.288	.440	.303	4.332	.260
BIC	.247	4.492	.606	.337	4.564	.442	.344	4.624	.272
AIC	.470	3.880	.358	.551	3.948	.240	.577	3.986	.160
Oracle	.103	5	0	.152	5	0	.215	5	0

Open in a new tab

As in Example 2, we take β to be a d-dimensional vector whose first, fourth and seventh components are nonzero (0.8, 1.0, and 0.6, respectively) and whose other components equal 0. Table 6 shows that the proposed algorithm and LQA can dramatically save computing time compared with AIC and BIC.

Table 6.

Computing Time for the Cox Model (Seconds per Simulation)

n	d	New	LQA	BIC	AIC
40	8	0.248	0.147	0.415	0.418
	9	0.258	0.140	0.843	0.843
	10	0.299	0.149	1.711	1.712
	11	0.327	0.162	3.680	3.675

50	8	0.320	0.197	0.588	0.591
	9	0.364	0.200	1.218	1.225
	10	0.406	0.217	2.532	2.531
	11	0.466	0.211	5.189	5.171

60	8	0.417	0.263	0.820	0.827
	9	0.474	0.268	1.795	1.768
	10	0.513	0.279	3.454	3.454
	11	0.574	0.288	7.219	7.162

Open in a new tab

As a referee pointed out, it is of interest to investigate the performance of variable selection algorithms when the “full” model is misspecified. Model misspecification is a concern for all variable selection procedures, not merely those discussed in this article. To address this issue, we generated a random sample from model (4.3) with β = (0.8, 0, 0, 1, 0, 0, 0.6, 0, β₉, β₁₀)^T, where β₉ = β₁₀ = 0.2 or 0.4. The first 8 components of x are the same as those in the above simulation. We take $x_{9} = (x_{1}^{2} - 1) / \sqrt{2}$ , and $x_{10} = (x_{2}^{2} - 1) / \sqrt{2}$ . In our model fitting, our “full” model only uses the first 8 components of x. Thus, we misspecified the full model by ignoring the last two components. Based on the “full” model, variable selection procedures are carried out. The oracle procedure uses (x₁, x₄, x₇, x₉, x₁₀)^T to fit the data. The model error E{μ̂(x)−μ(x)}², where μ(x) is the mean function of the true model including all 10 components of x, is estimated by 50,000 Monte Carlo simulations and is summarized in Table 7, from which we can see that all variable selection procedures outperform the full model. This implies that selecting significant variables can dramatically reduce both model error and model complexity. From Table 7, we can see that both the newly proposed algorithm and LQA significantly reduce the model error of best subset variable selection using AIC or BIC.

Table 7.

Relative Model Errors for Misspecified Cox Models

(β₉, β₁₀)	Method	RME	# of Zeros	RME	# of Zeros	RME	# of Zeros

		n = 40		n = 50		n = 60
(0.2,0.2)	New	0.155	6.276	0.259	6.142	0.298	5.906
	LQA	0.146	4.992	0.251	4.822	0.294	4.662
	BIC	0.244	5.186	0.356	5.020	0.487	4.932
	AIC	0.441	4.336	0.564	4.162	0.657	4.104
	Oracle	0.254	5	0.194	5	0.239	5

(0.4, 0.4)	New	0.205	6.544	0.323	6.384	0.479	6.132
	LQA	0.197	5.166	0.324	4.924	0.469	4.806
	BIC	0.345	5.352	0.477	5.156	0.593	5.110
	AIC	0.560	4.444	0.670	4.236	0.755	4.240
	Oracle	0.268	5	0.228	5	0.273	5

Open in a new tab

Example 4 (Environmental data)

In this example, we illustrate the proposed algorithm in the context of analysis of an environmental data set. This data set consists of the number of daily hospital admissions for circulation and respiration problems and daily measurements of air pollutants. It was collected in Hong Kong from January 1, 1994 to December 31, 1995 (courtesy of T. S. Lau). Of interest is the association between levels of pollutants and the total number of daily hospital admissions for circulatory and respiratory problems. The response is the number of admissions and the covariates X₁ to X₃ are the levels (in μg/m³) of the pollutants sulfur dioxide, nitrogen dioxide, and dust. Because the response is count data, it is reasonable to use a Poisson regression model with mean μ(x) to analyze this data set. To reduce modeling bias, we include all linear, quadratic and interaction terms among the three air pollutants in our full model. Since empirical studies show that there may be a trend over time, we allow an intercept depending on time, the date on which observations were collected. In other words, we consider the following model:

\begin{matrix} log {μ (x)} = β_{0} (t) + β_{1} X_{1} + β_{2} X_{2} + β_{3} X_{3} + β_{4} X_{1}^{2} + β_{5} X_{2}^{2} \\ + β_{6} X_{3}^{2} + β_{7} X_{1} X_{2} + β_{8} X_{1} X_{3} + β_{9} X_{2} X_{3} . \end{matrix}

We further parameterize the intercept function by a cubic spline

β_{0} (t) = β_{00} + β_{01} t + β_{02} t^{2} + β_{03} t^{3} + \sum_{j = 1}^{5} β_{0 (j + 4)} {(t - k_{j})}_{+}^{3},

where the knots k_j are chosen to be the 10th, 25th, 50th, 75th and 90th percentiles of t. Thus, we are dealing with a Poisson regression model with 18 variables. To avoid numerical instability, the time variable and the air pollutant variables are standardized.

Generalized cross-validation is used to select the tuning parameter λ for the new algorithm using SCAD. The plot of the GCV scores against λ is depicted in Figure 2(a), and the selected λ equals 0.1933. In Table 8, we see that all linear terms are very significant, whereas the quadratic terms of SO₂ and dust and the SO₂ × dust interaction are not significant. The plot of the estimated intercept function β₀(t) is depicted in Figure 2(b) along with the estimated intercept function under the full model. The two estimated intercept functions are almost identical and capture the time trend very well, but the new algorithm saves two degrees of freedom by deleting the t³ and ${(t - k_{3})}_{+}^{3}$ terms from the intercept function.

Fig. 2 — In (a), the solid line indicates the GCV scores for SCAD using the new algorithm, and the dashed line indicates the same thing for the LQA algorithm. In (b), the solid and thick dashed lines that nearly coincide indicate the estimated intercept functions for the new algorithm and the full model, respectively; the dash-dotted line is for LQA. The dots are the full-model residuals log(y) − x^Tβ̂_MLE

Table 8.

Estimated Coefficients and their Standard Errors

Covariate

MLE

New

LQA

SO₂

0.0082 (0.0041)

0.0043 (0.0024)

0.0090 (0.0029)

NO₂

0.0238 (0.0051)

0.0260 (0.0037)

0.0311 (0.0033)

Dust

0.0195 (0.0054)

0.0173 (0.0037)

0.0043 (0.0026)

S O_{2}^{2}

−0.0029 (0.0013)

0 (0.0009)

−0.0025 (0.0010)

N O_{2}^{2}

0.0204 (0.0043)

0.0118 (0.0029)

0.0157 (0.0031)

Dust²

0.0042 (0.0025)

0 (0.0015)

0.0060 (0.0018)

SO₂ × NO₂

−0.0120 (0.0039)

−0.0050 (0.0021)

−0.0074 (0.0022)

SO₂ × Dust

0.0086 (0.0047)

0 (< 0.00005)

0 (NA)

NO₂ × Dust

−0.0305 (0.0061)

−0.0176 (0.0037)

−0.0262 (0.0041)

Open in a new tab

Note: NA stands for “Not available”.

For the purpose of comparison, SCAD using the LQA is also applied to this data set. The plot of the GCV scores against λ is also depicted in Figure 2(a), and the selected again equals 0.1933. In this case, not only the t³ and ${(t - k_{3})}_{+}^{3}$ terms but also the t and t² terms are deleted from the intercept function. The estimated intercept function is the dash-dotted curve in Figure 2(b), and now the resulting estimated intercept function looks dramatically different from the one estimated under the full model, and furthermore it appears to do a poor job of capturing the overall trend. The LQA estimates shown in Table 8 are quite different from those obtained using the new algorithm, even though they both use the same tuning parameter. Recall that LQA suffers the drawback that once a parameter is deleted, it cannot reenter the model, which appears to have had a major impact on the LQA model estimates in this case. Standard errors in Table 8 are available for the deleted coefficients in the new model but not LQA because the two algorithms use different deletion criteria.

5. Discussion

We have shown how a particular class of MM algorithms may be applied to variable selection. In modifying previous work on variable selection using penalized least squares and penalized likelihood by Fan and Li (2001, 2002, 2004) and Cai, et al (2004), we have shown how a slight perturbation of the penalty function can eliminate the possibility of mistakenly excluding variables too soon while simultaneously enabling certain convergence results. While the numerical tests given here deal with four very diverse models, the range of possible applications of this method is even broader. Generally speaking, the MM algorithms of this article may be applied to any situation where an objective function — whether a likelihood or not — is penalized using a penalty function such as the one in equation (2.1). If the goal is to maximize the penalized objective function and p_λ(·) satisfies the conditions of Proposition 3.1, then an MM algorithm may be applicable. If the original (unpenalized) objective function is concave, then the modified Newton-Raphson approach of Section 3.4 holds promise. In Section 2, we list several distinct classes of penalty functions in the literature that satisfy the conditions of Proposition 3.1, but there may also be other useful penalties to which our method applies.

The numerical tests of Section 4 indicate that the modified SCAD penalty we propose performs well on simulated data sets. This algorithm has comparable relative model error (sometimes quite a bit better, sometimes slightly worse) to BIC and the unmodified SCAD penalty implemented using LQA, and it consistently outperforms AIC. Our proposed algorithm tends to result in more parsimonious models than the LQA algorithm, typically identifying more actual zeros correctly but also eliminating too many nonzero coefficients. This fact is surprising in light of the drawback that LQA can exclude variables too soon during the iterative process, a drawback that our algorithm corrects. The particular choice of ε, addressed in Section 3.3, may warrant further study because of its influence on the complexity of the final model chosen.

An important difference between our algorithm and both AIC and BIC is the fact that the latter two methods are often not computationally efficient, with computing time scaling exponentially in the number of candidate variables whenever it becomes necessary to search exhaustively over the whole model space. This means that in problems with hundreds or thousands of candidate variables, AIC and BIC can be difficult if not impossible to implement. Such problems are becoming more and more prevalent in the statistical literature as topics such as microarray data and data mining increase in popularity.

Finally, we have seen in one example involving the Cox proportional hazards model that both our method and LQA perform well when the model is misspecified, even outperforming the oracle method for samples of size 40. Although questions of model misspecification are largely outside the scope of this article, it is useful to remember that although simulation studies can aid our understanding, the model assumed for any real dataset is only an approximation of reality.

Acknowledgments

The authors thank referees for constructive suggestions. Li’s research was supported by NSF grants DMS-0348869 and CCF-0430349, and a NIH grant NIDA 1-P50-DA10075,.

Appendix: Proofs of some results in Section 3

Proof of Proposition 3.1

The proof uses the following Lemma:

Lemma A.1

Under the assumptions of Proposition 3.1, $p_{λ}^{'} (θ +) / (ε + θ)$ is a nonincreasing function of θ for any nonnegative ε.

The proof of the lemma is immediate: Both $p_{λ}^{'} (θ)$ and (ε + θ)⁻¹ are positive and nonincreasing on (0, ∞), so their product is nonincreasing. For any θ > 0, we see that

lim_{x \to θ +} \frac{d}{d x} [Φ_{θ_{0}} (x) - p_{λ} (∣ x ∣)] = θ [\frac{p_{λ}^{'} (∣ θ_{0} ∣ +)}{∣ θ_{0} ∣} - \frac{p_{λ}^{'} (θ +)}{θ}] .

(A.1)

Furthermore, since p_λ(·) is nondecreasing and concave on (0, ∞) (and continuous at 0), it is also continuous on [0, ∞). Thus, Φ_θ_₀(θ) − p_λ(|θ|) is an even function, piecewise differentiable and continuous everywhere. Taking ε = 0 in Lemma A.1, equation (A.1) implies that Φ_θ_₀(θ) − p_λ(|θ|) is nonincreasing for θ ∈ (0, |θ₀|) and nondecreasing for θ ∈ (|θ₀|, ∞); this function is therefore minimized on (0, ∞) at |θ₀|. Since it is clear that Φ_θ_₀(|θ₀|) = p_λ(|θ₀|) and Φ_θ_₀(−|θ₀|) = p_λ (−|θ₀|), condition (3.3) is satisfied for θ₀ = ±|θ₀|.

Proof of Proposition 3.2

For part (a), it suffices to show that Φ_θ_₀,_ε (θ) majorizes p_λ_, _ε(|θ|) at θ₀. It follows by definition that Φ_θ_₀,_ε (θ₀) = p_λ_, _ε (|θ₀|). Furthermore, Lemma A.1 shows as in Proposition (3.1) that the even function Φ_θ_₀,_ε(θ) − p_{λ, ε} (|θ|) is decreasing on (0, |θ₀|) and increasing on (|θ₀|, ∞), giving the desired result.

To prove part (b), it is sufficient to show that |p_λ_, _ε(|θ|) − p_λ(|θ|)| → 0 uniformly on compact subsets of the parameter space as ε ↓ 0. Since $p_{λ}^{'} (θ +)$ is nonincreasing on (0, ∞),

∣ p_{λ, ε} (∣ θ ∣) - p_{λ} (∣ θ ∣) ∣ \leq ε log [1 + \frac{∣ θ ∣}{ε}] p_{λ}^{'} (0 +),

and because $p_{λ}^{'} (0 +) < \infty$ , the right side of the above inequality tends to 0 uniformly on compact subsets of the parameter space as ε ↓ 0.

Proof of Corollary 3.2

Let β̂ denote a maximizer of Q(β) and put B = {β ∈ R^d:Q_ε_₀(β) ≥ Q(β̂)} for some fixed ε₀ > 0. Then B is compact and contains all β̂_ε for 0 ≤ ε < ε₀. Thus, Proposition 3.2 shows that for ε < ε₀,

\begin{array}{l} Q (\hat{β}) - Q ({\hat{β}}_{ε}) \leq Q (\hat{β}) - Q_{ε} (\hat{β}) + Q_{ε} ({\hat{β}}_{ε}) - Q ({\hat{β}}_{ε}) \\ \leq ∣ Q (\hat{β}) - Q_{ε} (\hat{β}) ∣ + ∣ Q_{ε} ({\hat{β}}_{ε}) - Q ({\hat{β}}_{ε}) ∣ \\ \to 0. \end{array}

If β^* is a limit point of {β̂_ε}_ε_{↓ 0}, then by the continuity of Q(β), |Q(β^*)− Q(β̂)| = 0 and so β^* is a maximizer of Q(β).

Proof of Proposition 3.3

Given an initial value β⁽⁰⁾, let β⁽^k⁾ = M^k(β⁽⁰⁾) for k ≥ 1; i.e., {β⁽^k⁾} is the sequence of points that the MM algorithm generates starting from β⁽⁰⁾. Let Λ denote the set of limit points of this sequence. For any β^* ∈ Λ, passing to a subsequence we have β⁽^kn) → β^*. The quantity Q_ε(β⁽^kn)), since it is increasing in n and bounded above, converges to a limit as n → ∞. Thus, taking limits in the inequalities

Q_{ε} (β^{(k_{n})}) \leq Q_{ε} {M (β^{(k_{n})})} \leq Q_{ε} (β^{(k_{n + 1})})

gives Q_ε (β^*) = Q_ε{lim_n _→∞ M (β_k_{_n})}, assuming this limit exists. Of course, if M (β) is continuous at β^*, then we have Q_ε(β^*) = Q_ε{M (β^*)}, which implies that β^* is a stationary point of Q_ε(β).

Note that Λ is not necessarily nonempty in the above proof. However, we know that each β⁽^k⁾ lies in the set {β: Q_ε(β) ≥ Q_ε(β₁)}, so if this set is compact, as is often the case, we may conclude that is indeed nonempty.

Contributor Information

David R. Hunter, Department of Statistics, The Pennsylvania State University University Park, Pennsylvania 16802–2111, E-mail: dhunter@stat.psu.edu

Runze Li, Department of Statistics, The Pennsylvania State University University Park, Pennsylvania 16802–2111 E-mail: rli@stat.psu.edu.

References

1.Antoniadis A. Wavelets in statistics (with discussion) Journal of the Italian Statistical Association. 1997;6:97–144. [Google Scholar]
2.Antoniadis A, Fan J. Regularization of wavelets approximations (with discussion) Journal of American Statistical Association. 2001;96:939–967. [Google Scholar]
3.Cai J, Fan J, Li R, Zhou H. Variable selection for multivariate failure time data. Biometrika. 2004 doi: 10.1093/biomet/92.2.303. In press. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Cox DR. Partial likelihood. Biometrika. 1975;62:269–276. [Google Scholar]
5.Craven P, Wahba G. Smoothing noisy data with spline functions: estimating the correct degree of smoothing by the method of generalized cross-validation. Numer Math. 1979;31:377–403. [Google Scholar]
6.Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B. 1977;39:1–38. [Google Scholar]
7.Fan J, Peng H. Nonconcave penalized likelihood with a diverging number of parameters. The Annals of Statistics. 2004;32:928–961. [Google Scholar]
8.Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of American Statistical Association. 2001;96:1348–1360. [Google Scholar]
9.Fan J, Li R. Variable selection for Cox’s proportional hazards model and frailty model. The Annals of Statistics. 2002;30:74–99. [Google Scholar]
10.Fan J, Li R. New estimation and model selection procedures for semiparametric modeling in longitudinal data analysis. Journal of American Statistical Association. 2004;99:710–723. [Google Scholar]
11.Frank IE, Friedman JH. A statistical view of some chemometrics regression tools. Technometrics. 1993;35:109–148. [Google Scholar]
12.Heiser WJ. Convergent computing by iterative majorization: theory and applications in multidimensional data analysis. In: Krzanowski WJ, editor. Recent Advances in Descriptive Multivariate Analysis. Clarendon Press; Oxford: 1995. pp. 157–189. [Google Scholar]
13.Hestenes MR. Optimization Theory: The Finite Dimensional Case. Wiley; New York: 1975. [Google Scholar]
14.Hunter DR, Lange K. Rejoinder to discussion of “Optimization transfer using surrogate objective functions”. Journal of Computational and Graphical Statistics. 2000;9:52–59. [Google Scholar]
15.Kauermann G, Carroll RJ. A note on the efficiency of sandwich covariance matrix estimation. 2001;96:1387–1396. [Google Scholar]
16.Lange K. A gradient algorithm locally equivalent to the EM algorithm. Journal of the Royal Statistical Society, Series B. 1995;57:425–437. [Google Scholar]
17.Lange K, Hunter DR, Yang I. Optimization transfer using surrogate objective functions. Journal of Computational and Graphical Statistics. 2000;9:1–59. [Google Scholar]
18.Lehmann EL. Pacific Grove. California: Wadsworth & Brooks/Cole; 1983. Theory of Point Estimation. [Google Scholar]
19.Meng XL. On the rate of convergence of the ECM algorithm. The Annals of Statistics. 1994;22:326–339. [Google Scholar]
20.Meng XL, Van Dyk DA. The EM algorithm — an old folk song sung to a fast new tune (with discussion) Journal of the Royal Statistical Society, Series B. 1997;59:511–567. [Google Scholar]
21.McLachlan G, Krishnan T. The EM Algorithm and Extensions. Wiley; New York: 1997. [Google Scholar]
22.Miller AJ. Subset Selection in Regression. 2. Chapman and Hall; London: 2002. [Google Scholar]
23.Ortega JM. Numerical Analysis: A second course. Society for Industrial and Applied Mathematics; Philadelphia: 1990. [Google Scholar]
24.Tibshirani R. Regression shrinkage and selection via the LASSO. Journal of the Royal Statistical Society, series B. 1996;58:267–288. [Google Scholar]
25.Wu CFJ. On the convergence of properties of the EM algorithm. The Annals of Statistics. 1983;11:95–103. [Google Scholar]

[R1] 1.Antoniadis A. Wavelets in statistics (with discussion) Journal of the Italian Statistical Association. 1997;6:97–144. [Google Scholar]

[R2] 2.Antoniadis A, Fan J. Regularization of wavelets approximations (with discussion) Journal of American Statistical Association. 2001;96:939–967. [Google Scholar]

[R3] 3.Cai J, Fan J, Li R, Zhou H. Variable selection for multivariate failure time data. Biometrika. 2004 doi: 10.1093/biomet/92.2.303. In press. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] 4.Cox DR. Partial likelihood. Biometrika. 1975;62:269–276. [Google Scholar]

[R5] 5.Craven P, Wahba G. Smoothing noisy data with spline functions: estimating the correct degree of smoothing by the method of generalized cross-validation. Numer Math. 1979;31:377–403. [Google Scholar]

[R6] 6.Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B. 1977;39:1–38. [Google Scholar]

[R7] 7.Fan J, Peng H. Nonconcave penalized likelihood with a diverging number of parameters. The Annals of Statistics. 2004;32:928–961. [Google Scholar]

[R8] 8.Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of American Statistical Association. 2001;96:1348–1360. [Google Scholar]

[R9] 9.Fan J, Li R. Variable selection for Cox’s proportional hazards model and frailty model. The Annals of Statistics. 2002;30:74–99. [Google Scholar]

[R10] 10.Fan J, Li R. New estimation and model selection procedures for semiparametric modeling in longitudinal data analysis. Journal of American Statistical Association. 2004;99:710–723. [Google Scholar]

[R11] 11.Frank IE, Friedman JH. A statistical view of some chemometrics regression tools. Technometrics. 1993;35:109–148. [Google Scholar]

[R12] 12.Heiser WJ. Convergent computing by iterative majorization: theory and applications in multidimensional data analysis. In: Krzanowski WJ, editor. Recent Advances in Descriptive Multivariate Analysis. Clarendon Press; Oxford: 1995. pp. 157–189. [Google Scholar]

[R13] 13.Hestenes MR. Optimization Theory: The Finite Dimensional Case. Wiley; New York: 1975. [Google Scholar]

[R14] 14.Hunter DR, Lange K. Rejoinder to discussion of “Optimization transfer using surrogate objective functions”. Journal of Computational and Graphical Statistics. 2000;9:52–59. [Google Scholar]

[R15] 15.Kauermann G, Carroll RJ. A note on the efficiency of sandwich covariance matrix estimation. 2001;96:1387–1396. [Google Scholar]

[R16] 16.Lange K. A gradient algorithm locally equivalent to the EM algorithm. Journal of the Royal Statistical Society, Series B. 1995;57:425–437. [Google Scholar]

[R17] 17.Lange K, Hunter DR, Yang I. Optimization transfer using surrogate objective functions. Journal of Computational and Graphical Statistics. 2000;9:1–59. [Google Scholar]

[R18] 18.Lehmann EL. Pacific Grove. California: Wadsworth & Brooks/Cole; 1983. Theory of Point Estimation. [Google Scholar]

[R19] 19.Meng XL. On the rate of convergence of the ECM algorithm. The Annals of Statistics. 1994;22:326–339. [Google Scholar]

[R20] 20.Meng XL, Van Dyk DA. The EM algorithm — an old folk song sung to a fast new tune (with discussion) Journal of the Royal Statistical Society, Series B. 1997;59:511–567. [Google Scholar]

[R21] 21.McLachlan G, Krishnan T. The EM Algorithm and Extensions. Wiley; New York: 1997. [Google Scholar]

[R22] 22.Miller AJ. Subset Selection in Regression. 2. Chapman and Hall; London: 2002. [Google Scholar]

[R23] 23.Ortega JM. Numerical Analysis: A second course. Society for Industrial and Applied Mathematics; Philadelphia: 1990. [Google Scholar]

[R24] 24.Tibshirani R. Regression shrinkage and selection via the LASSO. Journal of the Royal Statistical Society, series B. 1996;58:267–288. [Google Scholar]

[R25] 25.Wu CFJ. On the convergence of properties of the EM algorithm. The Annals of Statistics. 1983;11:95–103. [Google Scholar]

PERMALINK

Variable Selection using MM Algorithms

David R Hunter

Runze Li

Abstract

1. Introduction

2. Variable selection via maximum penalized likelihood

3. Maximized penalized likelihood via MM

3.1. Local Quadratic Approximation as an MM algorithm

Fig. 1.

Proposition 3.1

Corollary 3.1

3.2. An improved version of local quadratic approximation

Proposition 3.2

Corollary 3.2

3.3. Choice of ε

3.4. The algorithm

3.5. Convergence

Proposition 3.3

Corollary 3.3

4. Numerical Examples

Example 1 (Linear regression)

Table 1.

Table 2.

Example 2 (Logistic regression)

Table 3.

Table 4.

Example 3 (Cox model)

Table 5.

Table 6.

Table 7.

Example 4 (Environmental data)

Fig. 2.

Table 8.

5. Discussion

Acknowledgments

Appendix: Proofs of some results in Section 3

Proof of Proposition 3.1

Lemma A.1

Proof of Proposition 3.2

Proof of Corollary 3.2

Proof of Proposition 3.3

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases