A Small-Sample Choice of the Tuning Parameter in Ridge Regression

Philip S Boonstra; Bhramar Mukherjee; Jeremy M G Taylor

doi:10.5705/ss.2013.284

. Author manuscript; available in PMC: 2016 Mar 14.

Published in final edited form as: Stat Sin. 2015 Jul 1;25(3):1185–1206. doi: 10.5705/ss.2013.284

A Small-Sample Choice of the Tuning Parameter in Ridge Regression

Philip S Boonstra ¹, Bhramar Mukherjee ¹, Jeremy M G Taylor ¹

PMCID: PMC4790465 NIHMSID: NIHMS731299 PMID: 26985140

Abstract

We propose new approaches for choosing the shrinkage parameter in ridge regression, a penalized likelihood method for regularizing linear regression coefficients, when the number of observations is small relative to the number of parameters. Existing methods may lead to extreme choices of this parameter, which will either not shrink the coefficients enough or shrink them by too much. Within this “small-n, large-p” context, we suggest a correction to the common generalized cross-validation (GCV) method that preserves the asymptotic optimality of the original GCV. We also introduce the notion of a “hyperpenalty”, which shrinks the shrinkage parameter itself, and make a specific recommendation regarding the choice of hyperpenalty that empirically works well in a broad range of scenarios. A simple algorithm jointly estimates the shrinkage parameter and regression coefficients in the hyperpenalized likelihood. In a comprehensive simulation study of small-sample scenarios, our proposed approaches offer superior prediction over nine other existing methods.

Keywords: Akaike’s information criterion, Cross-validation, Generalized cross-validation, Hyperpenalty, Marginal likelihood, Penalized likelihood

1 Introduction

Suppose we have data, {y, x}, which are n observations of a continuous outcome Y and p covariates X, with the covariate matrix x regarded as fixed. n is small relative to p. We relate Y and X by a linear model, Y = β₀ +X^∴β +σε, with ε ∼ N{0, 1}. Up to an additive constant, the log-likelihood is

ℓ (β, β_{0}, σ^{2}) = - \frac{n}{2} \ln (σ^{2}) - \frac{1}{2 σ^{2}} {(y - β_{0} 1_{n} - x β)}^{⊤} (y - β_{0} 1_{n} - x β) .

(1)

We center y and standardize x to have unit variance. As a consequence of this, although β₀ is estimated in the fitted model, our notation will implicitly reflect the assumption β₀ = 0.

We consider penalized estimation of β, with our primary interest being prediction of future observations, rather than variable selection. Thus, we focus on L₂-penalization, ie ridge regression (Hoerl and Kennard, 1970), which, from a prediction perspective, has favorable properties even compared to more recently developed penalization methods (eg Frank and Friedman, 1993; Tibshirani, 1996; Fu, 1998; Zou and Hastie, 2005). Ridge regression may be viewed as a hierarchical linear model, similar to mixed effects modeling. Here the “random effects” are the elements of β. An L₂-penalty on β implicitly assumes these are jointly and independently Normal with mean zero and variance σ²/λ, because the penalty term matches the negative Normal log-density, up to a normalizing constant not depending on β:

p_{λ} (β, σ^{2}) = - \frac{λ}{2 σ^{2}} β^{⊤} β - \frac{p}{2} ln (λ) + \frac{p}{2} ln (σ^{2}) .

(2)

The scalar λ is the ridge parameter, controlling the shrinkage of β toward zero; larger values yield greater shrinkage. Given λ, the maximum penalized likelihood estimate of β is

β_{λ} = \arg {max}_{β | λ} {ℓ (β, σ^{2}) - p_{λ} (β, σ^{2})} = {(x^{⊤} x + λ I_{p})}^{- 1} x^{⊤} y .

(3)

When n − 1 ≥ p, a key result from Hoerl and Kennard (Theorem 4.3, 1970) is that λ^* = arg min_λ≥₀E[(β−β_λ)^∴(β−β_λ)] > 0, i.e. there exists λ^* > 0 for which the mean squared error (MSE) of β_λ decreases relative to λ = 0. If x^∴x/n = I_p, then λ^* = pσ²/β^∴β; however, there is no closed-form solution for λ^* in the general x^∴x case. A strictly positive λ introduces bias in β_λ but decreases variance, making a bias-variance tradeoff. A choice of λ which is too small leads to overfitting the data, and one which is too large shrinks β by too much. To contrast these extremes, we will hereafter refer to this latter scenario as “underfitting.” The existence of λ^* is relevant because prediction error, E[(β − β_λ)^∴x^∴x(β − β_λ)], is closely related to MSE and may correspondingly benefit from such a bias-variance tradeoff.

To approximate λ^*, one cannot simply maximize ℓ(β, σ²) − p_λ(β, σ²) jointly with respect to β, σ² and λ, because the expression can be made arbitrarily large by plugging in β = 0 and letting λ → ∞. Typically, λ is selected by optimizing some other objective function. Our motivation for this paper is to investigate selection strategies for λ when n is “small”, by which we informally mean n < p or n ≈ p, the complement being a more standard $n ≫ p$ situation. This small-n situation increasingly occurs in modern genomic studies, whereas common approaches for selecting λ are often justified asymptotically in n.

Our contribution is two-fold. First, we present new ideas for choosing λ, including both a small-sample modification to a common existing approach and novel proposals. Our framework categorizes existing strategies into two classes, based on whether a goodness-of-fit criterion or a likelihood is optimized. Methods in either class may be susceptible to over- or underfitting; a third, new class extends the hierarchical perspective of ridge regression, the first level being ℓ(β, σ²) and the second p_λ(β, σ²). Following ideas by Takada (1979), who showed that Stein’s Positive Part Estimator corresponds to a posterior mode given a certain prior, and, more recently, Strawderman and Wells (2012), who place a hyperprior on the Lasso penalty parameter, we add a third level, defining a “hyperpenalty” on λ. This hyper-penalty induces shrinkage on λ itself, thereby protecting against extreme choices of λ. The second contribution follows naturally, namely, a comprehensive evaluation of all methods, both existing and newly proposed, in this small-n situation via simulation studies.

The remainder of this paper is organized as follows. We review current approaches for choosing λ (the first and second classes discussed above) in Sections 2 and 3 and propose a small-sample modification to one of these methods, generalized cross-validation (GCV, Craven and Wahba, 1979). In Section 4, we define a generic hyperpenalty function and explore a specific choice for the form of hyperpenalty in 4.1. Section 5 conducts a comprehensive simulation study. Our results suggest that the existing approaches for choosing λ can be improved upon in many small-n cases. Section 7 concludes with a discussion, in which we discuss useful extensions of the hyperpenalty framework.

2 Goodness-of-fit-based methods for selection of λ

These methods define an objective function in terms of λ which is to be minimized. Commonly used is K-fold cross-validation, which partitions observations into K groups, κ(1), …, κ(K), and calculates β_λ K times using equation (3), each time leaving out group κ(i), to get $β_{λ}^{- k (1)}$ , $β_{λ}^{- k (2)}$ , etc. For $β_{λ}^{- k (i)},$ cross-validated residuals are calculated on the observations in κ(i), which did not contribute to estimating β. The objective function estimates prediction error and is the sum of the squared cross-validated residuals:

λ_{K-CV} = \arg {min}_{λ} \ln \sum_{i = 1}^{K} {(y_{k (i)} - x_{k (i)} β_{λ}^{- k (i)})}^{⊤} (y_{k (i)} - x_{k (i)} β_{λ}^{- k (i)}) .

(4)

A suggested choice for K is 5 (Hastie et al., 2009). When K = n, some simplification (Golub et al., 1979) gives

λ_{n - CV} = \arg {min}_{λ} \ln \sum_{i = 1}^{n} {(Y_{i} - X_{i}^{⊤} β_{λ})}^{2} / {(1 - P_{λ [i i]} - 1 / η)}^{2}

(5)

with P_{λ} = x {(x^{⊤} x + λ I_{p})}^{- 1} x .

(6)

P_λ_[_ii_] is the ith diagonal element of P_λ and measures the ith observation’s influence in estimating β. Further discussion of its interpretation is given in Section 2.1. From (5), observations for which P_λ_[_ii_] is large, ie influential observations, have greater weight. Re-centering y at each fold implies β₀ is re-estimated; this is reflected by the “−1/n” term in (5). This term does not appear in the derivations by Golub et al. (1979), which assume β₀ is known, but this difference in assumptions is important with regard to GCV, which is discussed next, and our proposed extension of GCV.

GCV multiplies each squared residual in (5) by (1 − P_λ_[_ii_] − 1/n)²/(1 − Trace(P_λ)/n − 1/n)², thereby giving equal weight to all observations. Using the equality y − xβ_λ = (I_n − P_λ)y, further simplification yields

λ_{GCV} = \arg {min}_{λ} {\ln y^{⊤} {(I_{n} - P_{λ})}^{2} y - 2 \ln (1 - Trace) (P_{λ}) / n - 1 / n)} .

(7)

Although derived using different principles, other methods reduce to a “model fit + penalty” or “model fit + model complexity” form similar to (7): Akaike’s Information Criterion (aic, Akaike, 1973) and the Bayesian Information Criterion (BIC, Schwarz, 1978). Respectively, each chooses λ as follows:

λ_{AIC} = \arg {min}_{λ} {\ln y^{⊤} {(I_{n} - P_{λ})}^{2} y + 2 (Trace (P_{λ}) + 2) / n)},

(8)

λ_{BIC} = \arg {min}_{λ} {\ln y^{⊤} {(I_{n} - P_{λ})}^{2} y + \ln ((n) Trace) (P_{λ}) + 2 / n)} .

(9)

Asymptotically in n, GCV will choose the value of λ which minimizes the prediction criterion $E [{(β - β_{λ})}^{⊤} x^{⊤} x (β - β_{λ})]$ (Golub et al., 1979; Li, 1986). Further, Golub et al. observe that GCV and AIC asymptotically coincide. BIC asymptotically selects the true underlying model from a set of nested candidate models (Sin and White, 1996; Hastie et al., 2009), so its justification for use in selecting λ, which is a shrinkage parameter, is weak. For all of these methods, optimality is based upon the assumption that $n ≫ p$ . When n is small, extreme overfitting is possible (Wahba and Wang, 1995; Efron, 2001), giving small bias/large variance estimates. A small-sample correction of aic (AIC_C, Hurvich and Tsai, 1989; Hurvich et al., 1998) and a robust version of GCV (RGCV_γ, Lukas, 2006) exist:

λ_{{AIC}_{C}} = \arg {min}_{λ} {\ln y^{⊤} {(I_{n} - P_{λ})}^{2} y + 2 (Trace (P_{λ}) + 2) / (n - Trace (P_{λ}) - 3))},

(10)

\begin{array}{c} λ_{{RGCV}_{γ}} = \arg {min}_{λ} {\ln y^{⊤} {(I_{n} - P_{λ})}^{2} y - 2 \ln (1 - Trace (P_{λ}) / (n - 1 / n) \\ + \ln (γ + (1 - γ) Trace (P_{λ}^{2}) / n)} . \end{array}

(11)

For AIC_C, the modified penalty is the product of the original penalty, 2(Trace(P_λ)+2)/n, and n/(n − Trace(P_λ) − 3). The authors do not consider the possibility of n − Trace(P_λ) − 3 < 0, which would inappropriately change the sign of the penalty, and we have found no discussion of this in the literature. In our implementation of AIC_C, we replace n − Trace(P_λ) − 3 with its positive part, (n − Trace(P_λ) − 3)₊, effectively making the criterion infinitely large in this case. As a rule of thumb, Burnham and Anderson (2002) suggest to use AIC_C over AIC when n < 40p (their threshold for small n) and thus also when n ≈ p. RGCV _γ subtracts another penalty from GCV based on a tuning parameter γ ∈ (0, 1], as in (11); we use γ = 0.3 based on Lukas’ recommendation. Small choices of λ are more severely penalized, thereby offering protection against overfitting. To the best of our knowledge, the performance of AIC_C or r GCV_γ in the context of selecting λ in ridge regression has not been extensively studied.

2.1 Small-sample GCV

Trace(P_λ), with P_λ defined in (6), is the effective number of model parameters, excluding β₀ and σ². It decreases monotonically with λ > 0 and lies in the interval (0, min{n − 1, p}). The upper bound on Trace(P_λ) is not min{n, p} because the standardization of x reduces its rank by one when n ≤ p. Although the parameters β₀ and σ² are counted (in the literal sense) in the model complexity terms of aic and BIC, they have only an additive effect, being represented by the “+2” expressions in (8) and (9). For this reason, β₀ and σ² may be ignored in considering model complexity. However, from (7), GCV counts β₀, which is given by the “−1/n” term, but not σ²; counting both will change the penalty, since the model complexity term is on the log-scale. This motivates our proposed small-sample correction to GCV, called GCV_C, which does count σ² as a parameter:

λ_{{GCV}_{C}} = \arg {min}_{λ} {\ln y^{⊤} {(I_{n} - P_{λ})}^{2} y - 2 \ln {((1 - Trace) (P_{λ}) / n - 2 / n)}_{+})} .

(12)

As with AIC_C, 1 − Trace(P_λ)/n − 2/n may be negative. In this case, subtracting the log of the positive part of 1 − Trace(P_λ)/n − 2/n makes the objective function infinite. This is only a small-sample correction because the objective functions in (7) and (12) coincide as n → ∞, and the asymptotic optimality of GCV transfers to GCV_C.

An explanation of why GCV_C corrects the small-sample deficiency of GCV is as follows. If n−1 = p, the model-fit term in the objective function of (7), ln y^∴(I_n − P_λ)²y, tends to −∞ as λ decreases. When λ = 0, the fitted values, P_λy, will perfectly match the observations, y, and the data are overfit. The penalty term, −2 ln(1 − Trace(P_λ)/n − 1/n), tends to ∞ as λ decreases, because Trace(P_λ) approaches n − 1. The rates of convergence for the model-fit and penalty terms determine whether GCV chooses a too-small λ. If the model-fit term approaches −∞ faster than the penalty approaches ∞, the objective function is minimized by setting λ as small as possible, which is λ = 0 when n − 1 = p. Although this phenomenon is most striking in cases for which n−1 = p, as we will see in Section 5, this finding appears to hold when n − 1 < p. In this case, predictions will nearly match observations as λ decreases but remains numerically positive to allow for the matrix inversion in P_λ, and the penalty term still approaches ∞ as λ decreases. Like GCV, the penalty function associated with GCV_C also approaches ∞ as λ decreases. In contrast to GCV, however, the GCV_C penalty equals ∞ when $λ = \tilde{λ} > 0$ , where $\tilde{λ}$ ~ is the solution to 1 − Trace(P_λ)/n − 2/n = 0, or, equivalently, Trace(P_λ) = n − 2. In other words, when fitting GCV_C, the effective number of remaining parameters, beyond σ² and β₀, will be less than n − 2, and perfect fit of the observations to the predictions, i.e. λ = 0, cannot occur.

Remark 1

A reviewer observed that the GCV_C penalty can be generalized according to −2 ln((1 − Trace(P_λ)/n − c/n)₊) for c ≥ 1; special cases of this include GCV (c = 1) and GCV_C (c = 2). Extending the interpretation given above, this ensures that the effective number of remaining parameters, beyond σ² and β₀, will be less than n − c, rather than n − 2. Preliminary results from allowing c to vary did not point to a uniformly better choice of c ≠ 2. Also, using c = 2 is consistent with our original motivation for proposing GCV_C, namely properly counting the model parameters.

3 Likelihood-based methods for selection of λ

A second approach treats the ridge penalty in (2) as a negative log-density. One can consider a marginal likelihood, where λ is interpreted as the variance component of a mixed-effects model:

\begin{array}{l} m (λ, σ^{2}) - \ln = \int_{β} \exp {ℓ (β, σ^{2}) - p_{λ} (β, σ^{2}) d β \\ = - \frac{1}{2 σ^{2}} y^{⊤} (I_{n} - P_{λ}) y - \frac{n}{2} \ln (σ^{2}) + \frac{1}{2} ln | I_{n} - P_{λ} | . \end{array}

(13)

From this, y|λ, σ² is multivariate Normal with mean 0_n (y is centered) and covariance σ²(I_n− P_λ)⁻¹. The maximum profile marginal likelihood (MPML) estimate, originally proposed for smoothing splines (Wecker and Ansley, 1983), profiles m(λ, σ²) over σ², replacing each instance with ${\hat{σ}}_{λ}^{2} = y^{⊤} (I_{n} - P_{λ}) y / n,$ and maximizes the “concentrated” log-likelihood, $m (λ, {\hat{σ}}_{λ}^{2}) :$

λ_{MPML} = \arg {min}_{λ} {\ln y^{⊤} (I_{n} - P_{λ}) y - \frac{1}{n} \ln | I_{n} - P_{λ} |} .

(14)

Closely related is the generalized/restricted MPML (GMPML, Harville, 1977; Wahba, 1985), which adjusts the penalty to account for estimation of regression parameters that are not marginalized. Here, only β₀ is not marginalized, so the adjustment is by one degree of freedom (see Supplement S1):

λ_{GMPML} = \arg {min}_{λ} {\ln y^{⊤} (I_{n} - P_{λ}) y - \frac{1}{n - 1} \ln | I_{n} - P_{λ} |} .

(15)

In a smoothing-spline comparison of GMPML to GCV, Wahba (1985) found mixed results, with neither method offering uniformly better predictions. For scatterplot smoothers, Efron (2001) notes that GMPML may oversmooth, yielding large bias/small variance estimates.

Remark 2

Rather than profiling over σ², one could jointly maximize m(λ, σ²) over λ and σ². We have not found this approach previously used as a selection criterion in ridge regression. Our initial investigation of this and its restricted likelihood counterpart gave results similar to MPML and GMPML, and so we do not consider it further.

An alternative to the marginal likelihood methods described above is to treat the objective function in (3) as an h-log-likelihood, or “h-loglihood”, of the type proposed by Lee and Nelder (1996) for hierarchical generalized linear models. The link between penalized likelihoods, like ridge regression, and the h-loglihood was noted in the paper’s ensuing discussion. To estimate σ² (the dispersion) and λ (the variance component), Lee and Nelder suggested an iterative profiling approach, yielding the maximum adjusted profile h-loglihood (maphl) estimate. In Supplement S2, we show one iteration proceeds as follows:

\begin{array}{l} σ^{2 (i)} \leftarrow \frac{{(y - x β^{(i - 1)})}^{⊤} (y - x β^{(i - 1)}) + λ^{(i - 1)} β^{(i - 1) ⊤} β^{(i - 1)}}{n - 1} \end{array}

(16)

\begin{array}{l} λ^{(i)} \leftarrow \arg {min}_{λ} {λ β^{(i - 1) ⊤} β^{(i - 1)} / σ^{2 (i)} - \ln | I_{n} - P_{λ} |} \end{array}

(17)

β^{(i)} \leftarrow β_{λ^{(i)}}

(18)

and λ_MAPHL=λ^(∞)

Finally, Tran (2009) proposed the “Loss-rank” (LR) method for selecting λ. Its derivation, which we do not give, is likelihood-based, but the criterion resembles that of AIC in (8):

\begin{array}{l} λ_{LR} = \arg {min}_{λ} {\ln y^{⊤} {(I_{n} - P_{λ})}^{2} y - \frac{2}{n} \ln | I_{n} - P_{λ} |} . \end{array}

(19)

Tran also suggested a modified penalty term, which is dependent on y, but this did not give appreciably different results from λ_LR in their study.

4 Maximization with Hyperpenalties

As noted previously, some existing methods may choose extreme values of λ, particularly when n is small, suggesting a need for a second level of shrinkage, that is, shrinkage of λ itself. We extend the hierarchical framework of (1) and (2) with a “hyperpenalty” on λ, h(λ), which gives non-negligible support for λ over a finite range of values. The “hyperpenalized log-likelihood” is

\begin{array}{l} h p ℓ (β, λ, σ^{2}) = ℓ (β, σ^{2}) - p_{λ} (β, σ^{2}) - h (λ) - \ln (σ^{2}) . \end{array}

(20)

From the Bayesian perspective, when h(λ) is in the form of a log-density, the hyperpenalty corresponds to a hyperprior on λ, and the hyperpenalized likelihood is the posterior (the expression − ln(σ²) is the log-density of an improper prior on σ²). In contrast to fully Bayesian methods, which characterize the entire posterior, we desire a single point estimate of β, σ² and λ and focus on mode finding. Importantly, joint maximization with respect to β, σ² and λ is now possible.

For a general h(λ), we find the joint mode of (20): ${\hat{β}, \hat{λ}, {\hat{σ}}^{2}} \leftarrow \arg {max}_{β, λ, σ^{2}} {hp ℓ (β, λ, σ^{2})} .$ Alternatively, ${\hat{β}, \hat{λ}, {\hat{σ}}^{2}}$ may be calculated using conditional maximization steps:

\begin{array}{l} σ^{2 (i)} \leftarrow \arg {max}_{σ^{2}} - {h p ℓ (β^{(i - 1)}, σ^{2}, λ^{(i - 1)})} \\ = \frac{{(y - x β^{(i - 1)})}^{⊤} (y - x β^{(i - 1)}) + λ^{(i - 1)} β^{(i - 1) ⊤} β^{(i - 1)}}{n + p + 2} \end{array}

(21)

λ^{(i)} \leftarrow \arg {max}_{λ} {h p ℓ (β^{(i - 1)}, σ^{2 (i)}, λ)}

(22)

β^{(i)} \leftarrow \arg {max}_{β} {h p ℓ (β, σ^{2 (i)}, λ^{(i)})} = β_{λ^{(i)}}

(23)

with ${\hat{β}, \hat{λ}, {\hat{σ}}^{2}} = {β^{(\infty)}, λ^{(\infty)}, σ^{2 (\infty)}} .$ The only step that depends on the choice h(λ) is (22); the other steps are available in closed form regardless of h(λ).

Based on the expression for p_λ(β, σ²) given in (2), if exp{−h(λ)} = o(λ^−p/²), then, upon applying the maximization step in (22), λ⁽ⁱ⁾ is guaranteed to be finite, regardless of the values of β⁽ⁱ⁻¹⁾ and σ²⁽ⁱ⁾. This relates to an earlier comment in the Introduction that one cannot simply maximize ℓ(β, σ²)−p_λ(β, σ²) alone. For example, using h(λ) = C, C constant, would yield the same result as maximizing ℓ(β, σ²) − p_λ(β, σ²), namely an infinite hyperpenalized log-likelihood. We will propose one such choice of h(λ) that satisfies exp{−h(λ)} = o(λ^−p/²) and is empirically observed to work well for the ridge penalty; different choices of h(λ) may be better suited for other penalty functions.

4.1 Choice of hyperpenalty

Crucial to this approach is the determination of an appropriate hyperpenalty and accompanying hyperparameters. Our recommended hyperpenalty is based on the gamma distribution, namely h(λ) = −(a−1) ln(λ)+λ/b. From the Bayesian perspective, this is natural because it is conjugate to the precision of the Normal distribution, which is one possible interpretation of λ (e.g. Tipping, 2001; Armagan and Zaretzki, 2010). From Supplement S3, the update for λ given in (22) becomes

λ^{(i)} = \frac{p + 2 a - 2}{β^{(i - 1) ⊤} β^{(i - 1)} / σ^{2 (i)} + 2 b} .

(24)

This additionally requires choosing values for a and b. We will do so by first choosing a desired prior mean for λ, given by a/b, and a value for λ⁽ⁱ⁾ in (24), and then solving the two expressions for a and b. Necessary to this strategy is that the chosen value of λ⁽ⁱ⁾ must result in a and b that are free of σ²⁽ⁱ⁾ and β⁽ⁱ⁻¹⁾. To choose a/b, recall the key result from Hoerl and Kennard (1970): when $x^{⊤} x / n = I_{p}, λ^{*} = {arg min}_{λ \geq 0} E [{(β - β_{λ})}^{⊤} (β - β_{λ})] = p σ^{2} / β^{⊤} β$ . While not of immediate practical use, since σ² and β are the parameters to be estimated, we note that σ²/β^∴β ≈ (1/R² − 1), where R² = β^∴Σ_Xβ/(β^∴Σ_Xβ + σ²) is the coefficient of determination, and the approximation comes from substituting x^∴x/n = I_p for Σ_X. In contrast to σ² or the individual elements of β, there may be knowledge about R². Alternatively, we will propose two strategies (Section 4.2) that provide an estimate of R². Given an estimate or prior guess of R² ∈ (0, 1), say ${\hat{R}}^{2}$ , we set $a / b = p (1 / {\hat{R}}^{2} - 1) .$

In addition to a sensible mean, it is important to have a and b be such that λ⁽ⁱ⁾, which is the resulting update for λ, is not extreme. Let the update for λ given in (24) be λ⁽ⁱ⁾ = (p−1)H⁽ⁱ⁾, where H⁽ⁱ⁾ is the harmonic mean of σ²⁽ⁱ⁾/β⁽ⁱ⁻^1)Tβ⁽ⁱ⁻¹⁾ and $(1 / {\hat{R}}^{2} - 1)$ . Being a harmonic mean, the $(1 / {\hat{R}}^{2} - 1)$ term, which will typically be less than 10 for most analyses, moderates potentially large values of σ²⁽ⁱ⁾/β⁽ⁱ⁻^1)Tβ⁽ⁱ⁻¹⁾, thereby preventing underfitting. Simultaneously, λ⁽ⁱ⁾ increases linearly with p, which prevents overfitting in n < p scenarios.

Solving these expressions, $a / b = p (1 / {\hat{R}}^{2} - 1)$ and λ⁽ⁱ⁾ = (p − 1)H⁽ⁱ⁾, yields a = p/2 and $b = {(1 / {\hat{R}}^{2} - 1)}^{- 1 / 2} .$ When the covariates are approximately uncorrelated and ${\hat{R}}^{2}$ is close to R², a/b will be near λ^*. However, the uncertainty coming from the variance of λ makes this useful in the general x^∴x case, for which no closed-form solution of λ^* exists. As we will see in the simulation study, this holds true even when ${\hat{R}}^{2} \neq R^{2}$ .

Finally, it is important that the hyperpenalty strategy not be inferior in a standard regression, when $n ≫ p$ . To establish this, we derive in Supplement S4 a large-n approximation for λ^* in the general x^∴x case: λ^* ≈ σ²Tr [(x^∴x)⁻¹]/β^∴(x^∴x)⁻¹β. As n → ∞, this converges in probability to $σ^{2} Tr [\sum_{X}^{- 1}] / β^{⊤} \sum_{X}^{- 1} < \infty$ , from which we can see that λ^* asymptotes in n. From (3), because the crossproduct term, x^∴x, increases linearly in n, β_λ*, the ridge estimate of β at the optimal value of λ, approaches the standard ordinary least squares (OLS) estimate, β_λ|λ₌₀. The x^∴x expression grows linearly in n, but λ^* asymptotes in n, so the effect of λ^* is reduced. The implication is that any choice of λ induces the same effective shrinkage for large n. The parameters a and b do not depend on n, and so the effect of the hyperpenalty will decrease with n, as desired.

Remark 3

Choosing a “flat” hyperpenalty, namely h(λ) = ln(λ), is untenable when p > 2, because it does not satisfy exp{−h(λ)} = o(λ^−p/²). Specifically, plugging h(λ) in (20), the expression hpℓ (β, λ, σ²) can be made arbitrarily large in λ by setting β = 0_p.

4.2 Estimating R²

For analyses in which one is unable or unwilling to make a prior guess of R² to use as ${\hat{R}}^{2}$ , here we describe a strategy to estimate R². We note first that R² = Cor(Y, X^⊤β)², where ‘Cor’ denotes the Pearson correlation. It is known that the empirical prediction error from using a vector of fitted values, $x \hat{β}$ , corresponding to the observed outcomes y will be optimistic when $\hat{β}$ depends on y. This means that the empirical R², ${\bar{R}}^{2} = C \hat{o} r {(y, x \hat{β})}^{2}$ , will also be optimistic, i.e. upwardly biased, for R². In contrast, Efron (1983) showed how an estimate of prediction error using the bootstrap will be pessimistic. Applied to our context, given B bootstrapped datasets, the bootstrap-estimate of R², ${\hat{R}}_{boot}^{2} = (1 / B) \sum_{b = 1}^{B} C \hat{o} r {(y_{*} (- b), x_{*} (- b) \hat{β} *^{(b)})}^{2}$ , where the *(b) and *(−b) notation indicate that the training and test datasets do not overlap, will be biased downward from R². Efron (1983) suggested that a particular linear combination of the optimistic and pessimistic prediction error estimates would provide an approximately unbiased estimate of prediction error. Analogizing this to estimating R², the linear combination is given by $0.632 {\hat{R}}_{boot}^{2} + 0.368 {\bar{R}}^{2}$ . The weight is based on a bootstrapped dataset containing, on average, about e⁻¹n ≈ 0.632n unique observations from the original dataset. When n > p, this “632-estimate” of R², using, say, OLS to estimate β in each bootstrap, would provide a reasonable estimate of R² for our purposes. However, in the n < p scenarios we are specifically interested in, OLS is not an option, and other methods, e.g. ridge regression, must be used to estimate β in each dataset. In addition, when p is large, the bootstrap may add a non-trivial computational component, if determining $\hat{β} *^{(b)}$ is computationally expensive. So as to minimize any added burden due to estimating R², which is only a preprocessing step before applying the hyperpenalty approach, we propose to modify the 632-estimate by replacing the bootstrap with 5-CV:

{\hat{R}}_{0.632}^{2} = 0.632 \times (1 / 5) \sum_{i = 1}^{5} C \hat{o} r {(y_{k (i)}, x_{k (i)} β_{λ_{5 - CV}})}^{2} + 0.368 \times C \hat{o} r {(y, x β_{λ_{5 - CV}})}^{2},

(25)

where we use the same κ(i) notation defined in Section 2. In words: first calculate λ_5-CV. Then, calculate a weighted average of the cross-validated empirical correlation and the standard empirical correlation over all the data. Let HYP₆₃₂ denote the hyperpenalty approach using ${\hat{R}}_{0.632}^{2}$ as the estimate for R².

5 Simulation Study

Our simulation study is designed to mimic the current reality of the “-omics” era in which many covariates are analyzed but few contribute substantial effect sizes. The relevant quantities are described as follows:

Covariates (x). One simulated dataset consists of training and validation data generated from the same model. The dimension of the training data is n×p, with n ∈ {25, 100, 250, 4000} and p ∈ {100, 4000}. The n × p matrix x is drawn from N_p{0_p, Σ_X}. For the validation data, a 2000 × p matrix x_new is sampled from this same distribution. We construct Σ_X according to “approximately uncorrelated” and “correlated” scenarios. The construction of Σ_X is described in Supplement S5. Briefly: we begin with a block-wise compound symmetric matrix with 10 blocks. Within blocks, the correlation is ρ = 0 (approximately uncorrelated) or ρ = 0.4 (correlated), and between blocks, there is zero correlation. We then stochastically perturb the matrix in such a way to generate Σ_X, using algorithms by Hardin et al. (2013), as to maintain positive-definiteness but mask the underlying structure. This perturbation occurs at each simulation iterate, and, as we outline in the Supplement, it is less extreme for the approximately uncorrelated case, so that the resulting Σ_X is close to I_p.
Parameters (β,σ²). To better account for many plausible configurations of the coefficients, which would be difficult using a single, fixed choice of β, we specify a generating distribution, drawing β once per simulation iterate and making it common to all observations. We draw β from a mixture density. Let Z_i ∈ {1, 2, 3}, i = 1, …, p, be a random variable with Pr(Z_i = 1) = Pr(Z_i = 2) = 0.005. Then, construct β as follows:
$α_{i} | Z_{i} \overset{ind}{~} {\begin{cases} t_{3} {σ^{2} = 1 / 3}, & Z_{i} = 1 \\ Exp {1}, & Z_{i} = 2 \\ N {0, σ^{2} | = 10^{- 6}}, & Z_{i} = 3 \end{cases}$ (26)

$β = α \times A R_{1} (π),$ (27)
where AR₁(π) is a p × p, first-order auto-regressive matrix with correlation coefficient π ∈ {0, 0.3}. The sampling density for α is a mixture of scaled t₃, Exponential, and Normal distributions, and 99% of the coefficients have small effect sizes, coming from the Normal component. The vector α is scaled to obtain β, which encourages neighboring coefficients to have similar effects, depending on π. So that some meaningful signal is present in every dataset, we ensure #{i : Z_i ≠ 3} ≥ 3, regardless of p. Given β, Σ_X, and R² ∈ {0.05, 0.2, 0.4, 0.6, 0.8, 0.95}, we calculate σ² = β^⊤Σ_Xβ(1/R² − 1).
Outcomes (y|x) The outcomes from the training and validation data are, respectively, y|x, β, σ² and y_new|x_new, β, σ². For each of the 192 combinations of π, p, n, uncorrelated/correlated x, and R², 10,000 (when p = 100) or 500 (p = 4000) training and validation datasets are sampled. We present results only for π = 0.3 and R² ∈ {0.2, 0.4, 0.8} here, with the remainder being in Supplement S6.

We compare the 11 methods listed in Table 1. n-CV is left out, being approximated by GCV and computationally expensive. aic is replaced with its small-sample correction, AIC_C. The new methods considered are GCV_C and the hyperpenalty-approach, HYP₆₃₂. We also present HYP_TRUE, which is the hyperpenalty method using the true, unknown value of R². The difference in prediction error between HYP₆₃₂ and HYP_TRUE quantifies the possible gain, if any, from improving upon the “632” estimation scheme. All methods differ only in λ, which determines the estimate of β via (3). The criterion by which we evaluate methods on the validation data is relative MSPE, rMSPE(λ):

rMSPE (λ) = 1000 \times (MSPE (λ) / MSPE (λ_{opt}) - 1),

(28)

where MSPE(λ) = (y_new − x_newβ_λ)^⊤(y_new − x_newβ_λ) and λ_opt = arg min_λ MSPE(λ). Thus, rMSPE measures the percentage increase above the smallest possible MSPE, and rMSPE = 0 is ideal. Equivalently, rMSPE measures the inefficiency of each method. We used an iterative grid search to calculate λ_opt as well as λ for all methods except MAPHL and HYP₆₃₂, for which explicit maximization steps are available. Table 2 gives the average rMSPE; values in boldface are the column-wise minima (excluding HYP_TRUE), and values with an asterisk are less than twice each column-wise minimum. Figure 1 compares the rMSPE of HYP₆₃₂ to the median rMSPE of remaining methods, excluding GCV_C, as an overall performance comparison. Figure 2 compares the rMSPE of GCV to GCV_C. Finally, Figure 3 gives histograms of ln(λ/λ_opt) for each of the methods from one scenario.

Table 1.

Annotation of all methods from the simulation study and the corresponding Equations and References.

Abbr.	Name	Eqn.	Reference
5-CV	Five-fold cross-validation	(4)	Hastie et al. (Section 7.10, 2009)
BIC	Bayesian Information Criterion	(9)	Schwarz (1978)
AIC_C	Corrected Akaike’s Information Criterion	(10)	Hurvich et al. (1998)
GCV	Generalized cross-validation	(7)	Craven and Wahba (1979)
GCV_C	Corrected generalized cross-validation	(12)	Section 2.1
RGCV_γ	Robust generalized cross-validation	(11)	Lukas (2006)
MPML	Maximum profile marginal likelihood	(14)	Wecker and Ansley (1983)
GMPML	Generalized maximum profile marginal likelihood	(15)	Harville (1977); Wahba (1985)
MAPHL	Maximum adjusted profile h-likelihood	(16)–(18)	Lee and Nelder (1996)
LR	Loss-rank	(19)	Tran (2009)
HYP₆₃₂	Hyperpenalty, ${\hat{R}}^{2}$ based on “632” estimator	(21)–(25)	Section 4, Efron (1983)

Open in a new tab

Table 2.

Average rMSPE as defined in (28) for the 11 methods summarized in Table 1 and also hyp_true, corresponding to using the true, unknown value of R², with π = 0.3 (Equation (27)).

p = 100, Approximately Uncorrelated
Method/{n/R²}	25/0.2	100/0.2	250/0.2	4000/0.2	25/0.4	100/0.4	250/0.4	4000/0.4	25/0.8	100/0.8	250/0.8	4000/0.8
5-CV	66.3	24.1	*8.7	*0.8	70.3	*26.2	*9.0	*0.8	*74.9	*35.2	*10.6	*0.6
BIC	370.9	> 10⁴	103.5	31.6	246.0	> 10⁴	250.3	16.4	*63.6	> 10⁴	258.7	3.4
AIC_C	*22.2	17.9	*8.5	*0.8	70.5	40.9	14.5	*0.8	283.7	212.4	31.6	*0.6
GCV	79.2	271.8	*7.7	*0.8	78.9	257.0	*8.1	*0.8	*63.7	392.8	*9.1	*0.6
GCV_C	*23.9	29.7	*7.6	*0.8	37.5	*31.5	*8.1	*0.8	*86.1	47.6	*9.1	*0.6
RGCV_0.3	*19.3	53.9	46.2	13.7	64.2	141.3	114.1	*1.0	278.3	560.2	105.0	*0.6
MPML	311.0	18.4	*6.8	*0.8	228.1	*19.4	*6.8	*0.8	*64.1	*25.3	*7.7	*0.6
GMPL	57.1	17.9	*6.8	*0.8	65.8	*18.8	*6.7	*0.8	*66.8	*23.6	*7.7	*0.6
MAPHL	298.0	25.9	*7.0	*0.8	191.5	*27.6	*7.0	*0.8	*44.4	*42.7	*7.8	*0.6
LR	183.6	16.9	19.1	11.3	160.9	40.8	46.3	15.9	*74.9	174.4	178.8	21.6
HYP₆₃₂	*23.9	*8.0	*8.0	*0.8	*16.1	*19.2	*12.1	*0.8	*59.8	*45.9	*8.5	*0.6
HYP_TRUE	8.4	7.1	5.2	0.7	11.2	7.6	4.2	0.7	12.6	15.1	7.8	0.6

p = 100, Positively Correlated
Method/{n/R²}	25/0.2	100/0.2	250/0.2	4000/0.2	25/0.4	100/0.4	250/0.4	4000/0.4	25/0.8	100/0.8	250/0.8	4000/0.8

5-CV	89.4	*23.8	*6.9	*0.8	105.9	*21.0	*6.9	*0.9	*92.0	*22.1	*7.9	*0.8
BIC	593.7	> 10⁴	61.6	14.9	446.6	> 10⁴	73.2	24.3	169.0	> 10⁴	135.8	17.7
AIC_C	*45.3	*15.0	*5.1	*0.8	99.7	*18.1	*5.6	*0.8	362.1	46.6	*12.9	*0.8
GCV	113.1	113.1	*6.5	*0.8	121.9	107.5	*6.5	*0.8	*87.0	168.0	*7.4	*0.8
GCV_C	*55.4	*22.0	*6.4	*0.8	*66.2	*23.2	*6.4	*0.8	*71.9	*25.5	*7.3	*0.8
RGCV_0.3	*49.4	45.2	28.7	5.0	126.4	76.6	29.2	11.4	377.5	95.5	64.6	1.8
MPML	232.0	*15.5	*4.9	*1.0	273.2	*13.6	*6.3	*1.0	165.2	*26.6	*11.8	*0.8
GMPML	*64.7	*15.4	*4.8	*1.0	*81.6	*13.2	*6.2	*1.0	*79.5	*24.5	*11.5	*0.8
MAPHL	244.6	*17.5	*5.7	*1.1	200.0	*18.3	*7.7	*1.0	*77.9	39.6	*13.3	*0.8
LR	*49.4	*24.8	16.4	4.3	86.5	41.0	24.4	7.5	172.1	80.9	57.0	15.5
HYP₆₃₂	*34.4	*13.9	*6.6	*1.4	*42.0	*16.7	*5.3	1.8	*88.1	*19.6	18.2	*1.2
HYP_TRUE	17.9	10.5	4.9	1.4	27.1	10.4	2.9	1.8	32.7	15.8	20.5	1.2

p = 4000, Approximately Uncorrelated
Method/{n/R²}	25/0.2	100/0.2	250/0.2	4000/0.2	25/0.4	100/0.4	250/0.4	4000/0.4	25/0.8	100/0.8	250/0.8	4000/0.8

5-CV	39.7	17.7	*7.7	*1.0	45.8	*19.2	*7.4	*1.1	40.2	*14.3	*6.6	*1.2
BIC	100.4	114.0	91.2	40.3	53.0	68.0	63.8	73.5	*5.6	*12.7	16.1	376.7
AIC_C	34.3	22.6	*11.0	1.9	100.9	49.9	22.9	8.5	329.2	160.4	78.4	89.1
GCV	40.6	16.3	*8.2	*0.9	43.9	*18.5	*8.6	*1.0	33.3	*13.7	*6.4	*1.1
GCV_C	*19.1	13.0	*6.5	*0.9	38.8	*13.7	*5.5	*1.0	110.6	*17.9	*5.4	*1.1
RGCV_0.3	29.5	37.8	35.0	5.2	90.0	94.5	73.7	20.7	312.5	273.2	137.1	216.8
MPML	100.3	54.1	*7.1	*1.0	53.0	64.1	11.2	*1.2	*5.6	*12.7	16.1	*1.1
GMPML	38.7	15.3	*6.4	*1.0	43.1	*17.9	*6.1	*1.2	29.6	*12.9	*7.2	*1.1
MAPHL	99.9	113.6	90.7	13.0	52.6	67.6	63.3	12.2	*5.6	*12.5	15.9	11.7
LR	100.3	14.4	12.8	6.0	53.1	*26.9	22.0	14.7	*5.6	55.8	48.3	75.6
HYP₆₃₂	*10.2	*6.0	*5.6	*1.1	*15.0	*14.3	*7.3	2.8	77.4	35.0	10.9	9.0
HYP_TRUE	17.4	13.2	6.9	0.9	29.0	12.8	4.7	1.8	21.9	6.6	2.3	11.9

p = 4000, Positively Correlated
Method/{n/R²}	25/0.2	100/0.2	250/0.2	4000/0.2	25/0.4	100/0.4	250/0.4	4000/0.4	25/0.8	100/0.8	250/0.8	4000/0.8

5-CV	83.5	*15.8	*8.3	*0.7	82.8	*16.5	9.7	*0.8	*72.0	*15.9	*7.2	*1.0
BIC	314.2	95.5	85.3	11.0	261.6	73.9	74.4	14.9	*95.2	33.4	52.1	42.2
AIC_C	*48.9	*11.8	*5.3	*0.6	84.3	*15.5	*6.2	*0.7	239.1	45.8	17.0	5.2
GCV	85.8	*16.7	*8.7	*0.7	84.9	*17.0	9.8	*0.7	*70.8	*16.2	*6.6	*0.9
GCV_C	*49.7	*13.3	*6.4	*0.7	*50.3	*12.9	*6.6	*0.7	*61.5	*12.2	*4.0	*0.9
RGCV_0.3	*56.7	30.6	19.7	*0.9	111.7	47.4	25.6	*1.3	254.7	71.7	24.2	12.4
MPML	242.0	*14.1	*5.3	*0.8	249.2	*16.3	*5.4	*1.4	*95.2	32.6	21.5	2.3
GMPML	*60.1	*13.5	*5.3	*0.7	*67.9	*13.5	*4.8	*1.4	*70.3	*13.8	*6.2	2.3
MAPHL	312.4	94.9	83.9	13.3	259.8	73.3	73.0	14.1	*94.2	32.9	50.9	14.5
LR	80.9	*16.9	11.7	2.4	108.3	26.5	17.4	2.9	122.7	59.2	31.5	8.4
HYP₆₃₂	*31.0	*10.9	*5.6	*0.7	*35.5	*12.9	*4.7	2.2	*55.5	*12.0	*4.6	32.4
HYP_TRUE	18.9	5.0	3.3	0.6	16.2	7.7	3.4	2.2	33.8	9.3	3.7	35.3

Open in a new tab

Values in bold are the column-wise minima, excluding HYP_TRUE, and those with an ‘*’ are less than twice the column-wise minimum. HYP_TRUE.

Ratio of the rMSPE of hyp₆₃₂ to the median rMSPE of remaining methods, excluding GCV_C, over log₁₀(n). Values less than one indicate that hyp₆₃₂ has smaller prediction error.

Locally-smoothed values of rMSPE of GCV and GCV_C over n in scenarios for which p = 100. The curve corresponding to GCV changes in behavior near the point n − 1 = p, given by the vertical dashed line.

Histograms of ln(*λ/λ*_opt) for p = 4000, n = 100, π = 0.3, R² = 0.2, and correlated covariates. ln(*λ/λ*_opt) = 0 means that λ was chosen to yield optimal shrinkage. All methods are described in Table 1.

From Table 2, hyp₆₃₂ and GCV_C achieve the stated goal of being useful in n ≈ p or n < p situations, as shown by the frequency of being in boldface or annotated with an asterisks. This is most evident in the smaller R² scenarios: either hyp₆₃₂ or GCV_C is frequently the best performing method when R² = 0.2 or R² = 0.4. However, these are not uniformly best across all scenarios considered, i.e. the “4000/0.8” column in the bottom sub-table or “25/0.8” column in the second-from-bottom sub-table. Even HYP_TRUE, which uses the true value of R², has large rMSPE in these settings, which suggests that the optimal choice of λ is not well-approximated by the p(1/R² − 1) expression discussed in Section 4.1. Also competitive are gmpml and 5-CV; GCV performs well except in the n = 100 scenarios, in which n ≈ p. The remaining methods, BIC, AIC_C, RGCV₀.₃, MPML, MAPHL, and LR have large rMSPE in some scenarios.

To further explore this, Figure 1 plots the ratio of the rMSPE corresponding to HYP₆₃₂ to the median rMSPE from the existing methods, to represent the performance of a typical method. When this ratio is less than one, HYP₆₃₂ has a smaller rMSPE. When p = 100, the y = 1 line is sometimes cross when log₁₀(n) = 2.5, or n = 250. When p = 4000, the y = 1 line is exceeded when n = 4000 or, regardless of n, when R² = 0.8 and the covariates are approximately uncorrelated. Based on the discussion in 4.1, HYP₆₃₂ will become equivalent to the asymptotically optimal methods as n → ∞.

As evidenced in the table, GCV_C has markedly smaller rMSPE than GCV when n is small. The two values of rMSPE coincide as n increases. We argued in Section 2.1 that the GCV penalty has the potential to elicit undesirable behavior when p = n − 1, namely choosing λ = 0, or as close as possible thereto. Figure 2 gives the ramifications of this in terms of prediction error, plotting locally-smoothed values of rMSPE of GCV and GCV_C over many values of n for the p = 100 scenarios; n − 1 is less than, equal to, and greater than p. For all six panels, which correspond to different R² or correlation combinations, there is a peak in the GCV curve beginning near n − 1 = p. GCV_C effectively eliminates this behavior and has almost uniformly smaller rMSPE when n − 1 ≤ p and nearly equal rMSPE when n − 1 > p.

Finally, we compare the actual values of λ for each method. Figure 3 plots histograms of ln(λ/λ_opt) for the eleven methods from one simulation setting in Table 2: p = 4000, n = 100, π = 0.3, R² = 0.2, and correlated covariates. For reference, when ln(λ/λ_opt) = 0, the method has selected the optimal λ. In this small-n scenario, all of the existing methods, at times, choose a very small or large λ. In contrast, the shrinkage from hyperpenalization is evident: the histogram for HYP₆₃₂ has a considerably smaller range. It has the overall smallest rMSPE in this scenario. Finally, GCV has ln(λ/λ_opt) as small as −12, and GCV_C has ln(λ/λ_opt) slightly less than −2, providing additional evidence that GCV_C effectively prevents overfitting.

6 Bardet-Biedl Data Analysis

To evaluate these methods in a real dataset, we consider the rat gene-expression data first reported in Scheetz et al. (2006). Tissue from 120 12-week old rats was analyzed using microarrays (Affymetrix GeneChip Rat Genome 230 2.0 Array), normalized, and log-transformed. The goal is to find genes associated with expression of the BBS11/TRIM32 gene, which is causative for Bardet-Biedl syndrome (Chiang et al., 2006). Copying the strategy of Huang et al. (2008), we considered 18,975 probesets that were sufficiently expressed, and, of these, reduced the number further to the p = 3,000 probes displaying the largest variation. We randomly selected n = 80 arrays as our training data, fit all methods to these data, and measured rMSPE based on the remaining 40 arrays, repeating this 1,000 times. Figure 4 gives the boxplots of these 1,000 rMSPEs, ordered from left to right in terms of the average rMSPE. The best-performing method is GCV_C with average rMSPE of 50.2, followed by GMPML (53.0), GCV (65.6), HYP₆₃₂ (66.3), and 5-CV (74.2). These rankings correspond closely to those from the simulation study.

7 Discussion

We have examined strategies for choosing the ridge parameter λ when the sample size n is small relative to p. Our small-sample modification to GCV, called GCV_C, is conceptually trivial but uniformly dominates GCV in our simulation study. This corrected GCV may be applied in other shrinkage or smoothing situations that would otherwise use the standard GCV, such as smoothing splines (Wahba, 1985) or adaptively-weighted linear combinations of linear regression estimates (Boonstra et al., 2013).

We also proposed a novel approach using what we call hyperpenalties, which add another level of shrinkage, that of λ itself, by extending the hierarchical model. A hyperpenalty based on the Gamma density with mean $p (1 / {\hat{R}}^{2} - 1)$ was shown to work well in the context of ridge regression. The approach is based on the observation that the optimal tuning parameter λ is approximated by the expression $p (1 / {\hat{R}}^{2} - 1)$ , where p is known. Furthermore, we proposed a simple strategy for estimating R². Relative to existing methods, our implementation can offer superior prediction and protection against choosing extreme values of λ. One area for improvement of this approach lies in the higher-R² scenarios, for which it is clear that $p (1 / {\hat{R}}^{2} - 1)$ does not approximate the optimal tuning parameter. However, it is unusual in a high-dimensional regression to expect R² larger than 0.6 or 0.7.

Another advantage of the hyperpenalty approach is its applicability in missing data problems: when implementing the Expectation-Maximization (EM) algorithm (Dempster et al., 1977), it is not clear how one might do ridge regression concurrently using goodness-of-fit or marginal likelihood approaches to select λ. On the other hand, by taking advantage of the conditional independence, specified by the hierarchical framework, between λ and any missing data given the remaining parameters, it is conceptually straightforward to embed a maximization step for λ, like expression (23), within a larger EM algorithm. This remains the focus of our current research.

Supplementary Material

Supplemental

NIHMS731299-supplement-Supplemental.pdf^{(380.1KB, pdf)}

Acknowledgments

This work was supported by the National Science Foundation [DMS1007494] and the National Institutes of Health [CA129102]. The authors thank Jian Huang for generously sharing the Bardet-Biedl data. Code for the simulation study was written in R (R Core Team, 2012), using the Matrix package (Bates and Maechler, 2013) to construct block-diagonal matrices, and is available at http://www-personal.umich.edu/~philb.

References

Akaike H. Information theory and an extension of the maximum likelihood principle. Second International Symposium on Information Theory. 1973:267–281. [Google Scholar]
Armagan A, Zaretzki RL. Model selection via adaptive shrinkage with t priors. Computational Statistics. 2010;25:441–461. [Google Scholar]
Bates D, Maechler M. Matrix: Sparse and Dense Matrix Classes and Methods. 2013 R package version 1.0-12. [Google Scholar]
Boonstra PS, Taylor JMG, Mukherjee B. Incorporating auxiliary information for improved prediction in high-dimensional datasets: an ensemble of shrinkage approaches. Biostatistics. 2013;14:259–272. doi: 10.1093/biostatistics/kxs036. [DOI] [PMC free article] [PubMed] [Google Scholar]
Burnham KP, Anderson DR. Model selection and multimodel inference: A practical information-theoretic approach. 2nd Springer; New York: 2002. [Google Scholar]
Chiang AP, Beck JS, Yen HJ, Tayeh MK, Scheetz TE, Swiderski RE, Nishimura DY, Braun TA, Kim KYA, Huang J, et al. Homozygosity mapping with snp arrays identifies TRIM32, an E3 ubiquitin ligase, as a Bardet–Biedl syndrome gene (BBS11) Proceedings of the National Academy of Sciences. 2006;103:6287–6292. doi: 10.1073/pnas.0600158103. [DOI] [PMC free article] [PubMed] [Google Scholar]
Craven P, Wahba G. Smoothing noisy data with spline functions. Numerische Mathematik. 1979;31:377–403. [Google Scholar]
Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B. 1977;39:1–38. [Google Scholar]
Efron B. Estimating the error rate of a prediction rule: Improvement on cross-validation. Journal of the American Statistical Association. 1983;78:316–331. [Google Scholar]
Efron B. Selection criteria for scatterplot smoothers. Annals of Statistics. 2001;29:470–504. [Google Scholar]
Efron B, Tibshirani R. Improvements on cross-validation: the 632+ bootstrap method. Journal of the American Statistical Association. 1997;92:548–560. [Google Scholar]
Frank IE, Friedman JH. A statistical view of some chemometrics regression tools. Technometrics. 1993;35:109–135. [Google Scholar]
Fu WJ. Penalized regressions: The bridge versus the Lasso. Journal of Computational and Graphical Statistics. 1998;7:397–416. [Google Scholar]
Golub GH, Heath M, Wahba G. Generalized cross-validation as a method for choosing a good ridge parameter. Technometrics. 1979;21:215–223. [Google Scholar]
Hardin J, Garcia SR, Golan D, et al. A method for generating realistic correlation matrices. The Annals of Applied Statistics. 2013;7:1733–1762. [Google Scholar]
Harville DA. Maximum likelihood approaches to variance component estimation and to related problems. Journal of the American Statistical Association. 1977;72:320–338. [Google Scholar]
Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning. 2nd Springer; New York: 2009. [Google Scholar]
Hoerl AE, Kennard RW. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics. 1970;12:55–67. [Google Scholar]
Huang J, Ma S, Zhang CH. Adaptive lasso for sparse high-dimensional regression models. Statistica Sinica. 2008;18:1603–1618. [Google Scholar]
Hurvich CM, Simonoff JS, Tsai CL. Smoothing parameter selection in nonparametric regression using an improved Akaike information criterion. Journal of the Royal Statistical Society: Series B. 1998;60:271–293. [Google Scholar]
Hurvich CM, Tsai CL. Regression and time series model selection in small samples. Biometrika. 1989;76:297–307. [Google Scholar]
Lee Y, Nelder JA. Hierarchical generalized linear models. Journal of the Royal Statistical Society: Series B. 1996;58:619–678. [Google Scholar]
Li KC. Asymptotic optimality of CL and generalized cross-validation in ridge regression with application to spline smoothing. The Annals of Statistics. 1986;14:1101–1112. [Google Scholar]
Lukas MA. Robust generalized cross-validation for choosing the regularization parameter. Inverse Problems. 2006;22:1883–1902. [Google Scholar]
R Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria: 2012. [Google Scholar]
Scheetz TE, Kim KYA, Swiderski RE, Philp AR, Braun TA, Knudtson KL, Dorrance AM, DiBona GF, Huang J, Casavant TL, et al. Regulation of gene expression in the mammalian eye and its relevance to eye disease. Proceedings of the National Academy of Sciences. 2006;103:14429–14434. doi: 10.1073/pnas.0602562103. [DOI] [PMC free article] [PubMed] [Google Scholar]
Schwarz G. Estimating the dimension of a model. The Annals of Statistics. 1978;6:461–464. [Google Scholar]
Sin CY, White H. Information criteria for selecting possibly misspecified parametric models. Journal of Econometrics. 1996;71:207–225. [Google Scholar]
Strawderman RL, Wells MT. On hierarchical prior specifications and penal-ized likelihood. In: Fourdrinier D, Éric Marchand, Rukhin AL, editors. Contemporary Developments in Bayesian Analysis and Statistical Decision Theory: A Festschrift for William E Strawderman. Vol. 8. Institute of Mathematical Statistics; 2012. pp. 154–180. [Google Scholar]
Takada Y. Stein’s positive part estimator and Bayes estimator. Annals of the Institute of Statistical Mathematics. 1979;31:177–183. [Google Scholar]
Tibshirani R. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society: Series B. 1996;58:267–288. [Google Scholar]
Tipping ME. Sparse Bayesian learning and the relevance vector machine. J Mach Learn Res. 2001;1:211–244. [Google Scholar]
Tran MN. Penalized maximum likelihood for choosing ridge parameter. Communications in Statistics. 2009;38:1610–1624. [Google Scholar]
Wahba G. A comparison of GCV and GML for choosing the smoothing parameter in the generalized spline smoothing problem. Annals of Statistics. 1985;13:1378–1402. [Google Scholar]
Wahba G, Wang Y. Behavior near zero of the distribution of GCV smoothing parameter estimates. Statistics & Probability Letters. 1995;25:105–111. [Google Scholar]
Wecker WE, Ansley CF. The signal extraction approach to nonlinear regression and spline smoothing. Journal of the American Statistical Association. 1983;78:81–89. [Google Scholar]
Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B. 2005;67:301–320. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental

NIHMS731299-supplement-Supplemental.pdf^{(380.1KB, pdf)}

[R1] Akaike H. Information theory and an extension of the maximum likelihood principle. Second International Symposium on Information Theory. 1973:267–281. [Google Scholar]

[R2] Armagan A, Zaretzki RL. Model selection via adaptive shrinkage with t priors. Computational Statistics. 2010;25:441–461. [Google Scholar]

[R3] Bates D, Maechler M. Matrix: Sparse and Dense Matrix Classes and Methods. 2013 R package version 1.0-12. [Google Scholar]

[R4] Boonstra PS, Taylor JMG, Mukherjee B. Incorporating auxiliary information for improved prediction in high-dimensional datasets: an ensemble of shrinkage approaches. Biostatistics. 2013;14:259–272. doi: 10.1093/biostatistics/kxs036. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] Burnham KP, Anderson DR. Model selection and multimodel inference: A practical information-theoretic approach. 2nd Springer; New York: 2002. [Google Scholar]

[R6] Chiang AP, Beck JS, Yen HJ, Tayeh MK, Scheetz TE, Swiderski RE, Nishimura DY, Braun TA, Kim KYA, Huang J, et al. Homozygosity mapping with snp arrays identifies TRIM32, an E3 ubiquitin ligase, as a Bardet–Biedl syndrome gene (BBS11) Proceedings of the National Academy of Sciences. 2006;103:6287–6292. doi: 10.1073/pnas.0600158103. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] Craven P, Wahba G. Smoothing noisy data with spline functions. Numerische Mathematik. 1979;31:377–403. [Google Scholar]

[R8] Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B. 1977;39:1–38. [Google Scholar]

[R9] Efron B. Estimating the error rate of a prediction rule: Improvement on cross-validation. Journal of the American Statistical Association. 1983;78:316–331. [Google Scholar]

[R10] Efron B. Selection criteria for scatterplot smoothers. Annals of Statistics. 2001;29:470–504. [Google Scholar]

[R11] Efron B, Tibshirani R. Improvements on cross-validation: the 632+ bootstrap method. Journal of the American Statistical Association. 1997;92:548–560. [Google Scholar]

[R12] Frank IE, Friedman JH. A statistical view of some chemometrics regression tools. Technometrics. 1993;35:109–135. [Google Scholar]

[R13] Fu WJ. Penalized regressions: The bridge versus the Lasso. Journal of Computational and Graphical Statistics. 1998;7:397–416. [Google Scholar]

[R14] Golub GH, Heath M, Wahba G. Generalized cross-validation as a method for choosing a good ridge parameter. Technometrics. 1979;21:215–223. [Google Scholar]

[R15] Hardin J, Garcia SR, Golan D, et al. A method for generating realistic correlation matrices. The Annals of Applied Statistics. 2013;7:1733–1762. [Google Scholar]

[R16] Harville DA. Maximum likelihood approaches to variance component estimation and to related problems. Journal of the American Statistical Association. 1977;72:320–338. [Google Scholar]

[R17] Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning. 2nd Springer; New York: 2009. [Google Scholar]

[R18] Hoerl AE, Kennard RW. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics. 1970;12:55–67. [Google Scholar]

[R19] Huang J, Ma S, Zhang CH. Adaptive lasso for sparse high-dimensional regression models. Statistica Sinica. 2008;18:1603–1618. [Google Scholar]

[R20] Hurvich CM, Simonoff JS, Tsai CL. Smoothing parameter selection in nonparametric regression using an improved Akaike information criterion. Journal of the Royal Statistical Society: Series B. 1998;60:271–293. [Google Scholar]

[R21] Hurvich CM, Tsai CL. Regression and time series model selection in small samples. Biometrika. 1989;76:297–307. [Google Scholar]

[R22] Lee Y, Nelder JA. Hierarchical generalized linear models. Journal of the Royal Statistical Society: Series B. 1996;58:619–678. [Google Scholar]

[R23] Li KC. Asymptotic optimality of CL and generalized cross-validation in ridge regression with application to spline smoothing. The Annals of Statistics. 1986;14:1101–1112. [Google Scholar]

[R24] Lukas MA. Robust generalized cross-validation for choosing the regularization parameter. Inverse Problems. 2006;22:1883–1902. [Google Scholar]

[R25] R Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria: 2012. [Google Scholar]

[R26] Scheetz TE, Kim KYA, Swiderski RE, Philp AR, Braun TA, Knudtson KL, Dorrance AM, DiBona GF, Huang J, Casavant TL, et al. Regulation of gene expression in the mammalian eye and its relevance to eye disease. Proceedings of the National Academy of Sciences. 2006;103:14429–14434. doi: 10.1073/pnas.0602562103. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] Schwarz G. Estimating the dimension of a model. The Annals of Statistics. 1978;6:461–464. [Google Scholar]

[R28] Sin CY, White H. Information criteria for selecting possibly misspecified parametric models. Journal of Econometrics. 1996;71:207–225. [Google Scholar]

[R29] Strawderman RL, Wells MT. On hierarchical prior specifications and penal-ized likelihood. In: Fourdrinier D, Éric Marchand, Rukhin AL, editors. Contemporary Developments in Bayesian Analysis and Statistical Decision Theory: A Festschrift for William E Strawderman. Vol. 8. Institute of Mathematical Statistics; 2012. pp. 154–180. [Google Scholar]

[R30] Takada Y. Stein’s positive part estimator and Bayes estimator. Annals of the Institute of Statistical Mathematics. 1979;31:177–183. [Google Scholar]

[R31] Tibshirani R. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society: Series B. 1996;58:267–288. [Google Scholar]

[R32] Tipping ME. Sparse Bayesian learning and the relevance vector machine. J Mach Learn Res. 2001;1:211–244. [Google Scholar]

[R33] Tran MN. Penalized maximum likelihood for choosing ridge parameter. Communications in Statistics. 2009;38:1610–1624. [Google Scholar]

[R34] Wahba G. A comparison of GCV and GML for choosing the smoothing parameter in the generalized spline smoothing problem. Annals of Statistics. 1985;13:1378–1402. [Google Scholar]

[R35] Wahba G, Wang Y. Behavior near zero of the distribution of GCV smoothing parameter estimates. Statistics & Probability Letters. 1995;25:105–111. [Google Scholar]

[R36] Wecker WE, Ansley CF. The signal extraction approach to nonlinear regression and spline smoothing. Journal of the American Statistical Association. 1983;78:81–89. [Google Scholar]

[R37] Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B. 2005;67:301–320. [Google Scholar]

PERMALINK

A Small-Sample Choice of the Tuning Parameter in Ridge Regression

Philip S Boonstra

Bhramar Mukherjee

Jeremy M G Taylor

Abstract

1 Introduction