Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2016 Mar 14.
Published in final edited form as: Stat Sin. 2015 Jul 1;25(3):1185–1206. doi: 10.5705/ss.2013.284

A Small-Sample Choice of the Tuning Parameter in Ridge Regression

Philip S Boonstra 1, Bhramar Mukherjee 1, Jeremy M G Taylor 1
PMCID: PMC4790465  NIHMSID: NIHMS731299  PMID: 26985140

Abstract

We propose new approaches for choosing the shrinkage parameter in ridge regression, a penalized likelihood method for regularizing linear regression coefficients, when the number of observations is small relative to the number of parameters. Existing methods may lead to extreme choices of this parameter, which will either not shrink the coefficients enough or shrink them by too much. Within this “small-n, large-p” context, we suggest a correction to the common generalized cross-validation (GCV) method that preserves the asymptotic optimality of the original GCV. We also introduce the notion of a “hyperpenalty”, which shrinks the shrinkage parameter itself, and make a specific recommendation regarding the choice of hyperpenalty that empirically works well in a broad range of scenarios. A simple algorithm jointly estimates the shrinkage parameter and regression coefficients in the hyperpenalized likelihood. In a comprehensive simulation study of small-sample scenarios, our proposed approaches offer superior prediction over nine other existing methods.

Keywords: Akaike’s information criterion, Cross-validation, Generalized cross-validation, Hyperpenalty, Marginal likelihood, Penalized likelihood

1 Introduction

Suppose we have data, {y, x}, which are n observations of a continuous outcome Y and p covariates X, with the covariate matrix x regarded as fixed. n is small relative to p. We relate Y and X by a linear model, Y = β0 +Xβ +σε, with ε ∼ N{0, 1}. Up to an additive constant, the log-likelihood is

(β,β0,σ2)=n2ln(σ2)12σ2(yβ01nxβ)(yβ01nxβ). (1)

We center y and standardize x to have unit variance. As a consequence of this, although β0 is estimated in the fitted model, our notation will implicitly reflect the assumption β0 = 0.

We consider penalized estimation of β, with our primary interest being prediction of future observations, rather than variable selection. Thus, we focus on L2-penalization, ie ridge regression (Hoerl and Kennard, 1970), which, from a prediction perspective, has favorable properties even compared to more recently developed penalization methods (eg Frank and Friedman, 1993; Tibshirani, 1996; Fu, 1998; Zou and Hastie, 2005). Ridge regression may be viewed as a hierarchical linear model, similar to mixed effects modeling. Here the “random effects” are the elements of β. An L2-penalty on β implicitly assumes these are jointly and independently Normal with mean zero and variance σ2, because the penalty term matches the negative Normal log-density, up to a normalizing constant not depending on β:

pλ(β,σ2)=λ2σ2ββp2ln(λ)+p2ln(σ2). (2)

The scalar λ is the ridge parameter, controlling the shrinkage of β toward zero; larger values yield greater shrinkage. Given λ, the maximum penalized likelihood estimate of β is

βλ=argmaxβ|λ{(β,σ2)pλ(β,σ2)}=(xx+λIp)1xy. (3)

When n − 1 ≥ p, a key result from Hoerl and Kennard (Theorem 4.3, 1970) is that λ* = arg minλ≥0E[(ββλ)(ββλ)] > 0, i.e. there exists λ* > 0 for which the mean squared error (MSE) of βλ decreases relative to λ = 0. If xx/n = Ip, then λ* = 2/ββ; however, there is no closed-form solution for λ* in the general xx case. A strictly positive λ introduces bias in βλ but decreases variance, making a bias-variance tradeoff. A choice of λ which is too small leads to overfitting the data, and one which is too large shrinks β by too much. To contrast these extremes, we will hereafter refer to this latter scenario as “underfitting.” The existence of λ* is relevant because prediction error, E[(ββλ)xx(ββλ)], is closely related to MSE and may correspondingly benefit from such a bias-variance tradeoff.

To approximate λ*, one cannot simply maximize (β, σ2) − pλ(β, σ2) jointly with respect to β, σ2 and λ, because the expression can be made arbitrarily large by plugging in β = 0 and letting λ → ∞. Typically, λ is selected by optimizing some other objective function. Our motivation for this paper is to investigate selection strategies for λ when n is “small”, by which we informally mean n < p or n ≈ p, the complement being a more standard np situation. This small-n situation increasingly occurs in modern genomic studies, whereas common approaches for selecting λ are often justified asymptotically in n.

Our contribution is two-fold. First, we present new ideas for choosing λ, including both a small-sample modification to a common existing approach and novel proposals. Our framework categorizes existing strategies into two classes, based on whether a goodness-of-fit criterion or a likelihood is optimized. Methods in either class may be susceptible to over- or underfitting; a third, new class extends the hierarchical perspective of ridge regression, the first level being (β, σ2) and the second pλ(β, σ2). Following ideas by Takada (1979), who showed that Stein’s Positive Part Estimator corresponds to a posterior mode given a certain prior, and, more recently, Strawderman and Wells (2012), who place a hyperprior on the Lasso penalty parameter, we add a third level, defining a “hyperpenalty” on λ. This hyper-penalty induces shrinkage on λ itself, thereby protecting against extreme choices of λ. The second contribution follows naturally, namely, a comprehensive evaluation of all methods, both existing and newly proposed, in this small-n situation via simulation studies.

The remainder of this paper is organized as follows. We review current approaches for choosing λ (the first and second classes discussed above) in Sections 2 and 3 and propose a small-sample modification to one of these methods, generalized cross-validation (GCV, Craven and Wahba, 1979). In Section 4, we define a generic hyperpenalty function and explore a specific choice for the form of hyperpenalty in 4.1. Section 5 conducts a comprehensive simulation study. Our results suggest that the existing approaches for choosing λ can be improved upon in many small-n cases. Section 7 concludes with a discussion, in which we discuss useful extensions of the hyperpenalty framework.

2 Goodness-of-fit-based methods for selection of λ

These methods define an objective function in terms of λ which is to be minimized. Commonly used is K-fold cross-validation, which partitions observations into K groups, κ(1), …, κ(K), and calculates βλ K times using equation (3), each time leaving out group κ(i), to get βλk(1), βλk(2), etc. For βλk(i), cross-validated residuals are calculated on the observations in κ(i), which did not contribute to estimating β. The objective function estimates prediction error and is the sum of the squared cross-validated residuals:

λK-CV=argminλlni=1K(yk(i)xk(i)βλk(i))(yk(i)xk(i)βλk(i)). (4)

A suggested choice for K is 5 (Hastie et al., 2009). When K = n, some simplification (Golub et al., 1979) gives

λn-CV=argminλlni=1n(YiXiβλ)2/(1Pλ[ii]1/η)2 (5)
withPλ=x(xx+λIp)1x. (6)

Pλ[ii] is the ith diagonal element of Pλ and measures the ith observation’s influence in estimating β. Further discussion of its interpretation is given in Section 2.1. From (5), observations for which Pλ[ii] is large, ie influential observations, have greater weight. Re-centering y at each fold implies β0 is re-estimated; this is reflected by the “−1/n” term in (5). This term does not appear in the derivations by Golub et al. (1979), which assume β0 is known, but this difference in assumptions is important with regard to GCV, which is discussed next, and our proposed extension of GCV.

GCV multiplies each squared residual in (5) by (1 − Pλ[ii] − 1/n)2/(1 − Trace(Pλ)/n − 1/n)2, thereby giving equal weight to all observations. Using the equality yλ = (In Pλ)y, further simplification yields

λGCV=argminλ{lny(InPλ)2y2ln(1Trace)(Pλ)/n1/n)}. (7)

Although derived using different principles, other methods reduce to a “model fit + penalty” or “model fit + model complexity” form similar to (7): Akaike’s Information Criterion (aic, Akaike, 1973) and the Bayesian Information Criterion (BIC, Schwarz, 1978). Respectively, each chooses λ as follows:

λAIC=argminλ{lny(InPλ)2y+2(Trace(Pλ)+2)/n)}, (8)
λBIC=argminλ{lny(InPλ)2y+ln((n)Trace)(Pλ)+2/n)}. (9)

Asymptotically in n, GCV will choose the value of λ which minimizes the prediction criterion E[(ββλ)xx(ββλ)] (Golub et al., 1979; Li, 1986). Further, Golub et al. observe that GCV and AIC asymptotically coincide. BIC asymptotically selects the true underlying model from a set of nested candidate models (Sin and White, 1996; Hastie et al., 2009), so its justification for use in selecting λ, which is a shrinkage parameter, is weak. For all of these methods, optimality is based upon the assumption that np. When n is small, extreme overfitting is possible (Wahba and Wang, 1995; Efron, 2001), giving small bias/large variance estimates. A small-sample correction of aic (AICC, Hurvich and Tsai, 1989; Hurvich et al., 1998) and a robust version of GCV (RGCVγ, Lukas, 2006) exist:

λAICC=argminλ{lny(InPλ)2y+2(Trace(Pλ)+2)/(nTrace(Pλ)3))}, (10)
λRGCVγ=argminλ{lny(InPλ)2y2ln(1Trace(Pλ)/(n1/n)+ln(γ+(1γ)Trace(Pλ2)/n)}. (11)

For AICC, the modified penalty is the product of the original penalty, 2(Trace(Pλ)+2)/n, and n/(n − Trace(Pλ) − 3). The authors do not consider the possibility of n − Trace(Pλ) − 3 < 0, which would inappropriately change the sign of the penalty, and we have found no discussion of this in the literature. In our implementation of AICC, we replace n − Trace(Pλ) − 3 with its positive part, (n − Trace(Pλ) − 3)+, effectively making the criterion infinitely large in this case. As a rule of thumb, Burnham and Anderson (2002) suggest to use AICC over AIC when n < 40p (their threshold for small n) and thus also when n ≈ p. RGCV γ subtracts another penalty from GCV based on a tuning parameter γ ∈ (0, 1], as in (11); we use γ = 0.3 based on Lukas’ recommendation. Small choices of λ are more severely penalized, thereby offering protection against overfitting. To the best of our knowledge, the performance of AICC or r GCVγ in the context of selecting λ in ridge regression has not been extensively studied.

2.1 Small-sample GCV

Trace(Pλ), with Pλ defined in (6), is the effective number of model parameters, excluding β0 and σ2. It decreases monotonically with λ > 0 and lies in the interval (0, min{n − 1, p}). The upper bound on Trace(Pλ) is not min{n, p} because the standardization of x reduces its rank by one when n ≤ p. Although the parameters β0 and σ2 are counted (in the literal sense) in the model complexity terms of aic and BIC, they have only an additive effect, being represented by the “+2” expressions in (8) and (9). For this reason, β0 and σ2 may be ignored in considering model complexity. However, from (7), GCV counts β0, which is given by the “−1/n” term, but not σ2; counting both will change the penalty, since the model complexity term is on the log-scale. This motivates our proposed small-sample correction to GCV, called GCVC, which does count σ2 as a parameter:

λGCVC=argminλ{lny(InPλ)2y2ln((1Trace)(Pλ)/n2/n)+)}. (12)

As with AICC, 1 − Trace(Pλ)/n − 2/n may be negative. In this case, subtracting the log of the positive part of 1 − Trace(Pλ)/n − 2/n makes the objective function infinite. This is only a small-sample correction because the objective functions in (7) and (12) coincide as n → ∞, and the asymptotic optimality of GCV transfers to GCVC.

An explanation of why GCVC corrects the small-sample deficiency of GCV is as follows. If n−1 = p, the model-fit term in the objective function of (7), ln y(In Pλ)2y, tends to −∞ as λ decreases. When λ = 0, the fitted values, Pλy, will perfectly match the observations, y, and the data are overfit. The penalty term, −2 ln(1 − Trace(Pλ)/n − 1/n), tends to as λ decreases, because Trace(Pλ) approaches n − 1. The rates of convergence for the model-fit and penalty terms determine whether GCV chooses a too-small λ. If the model-fit term approaches −∞ faster than the penalty approaches , the objective function is minimized by setting λ as small as possible, which is λ = 0 when n − 1 = p. Although this phenomenon is most striking in cases for which n−1 = p, as we will see in Section 5, this finding appears to hold when n − 1 < p. In this case, predictions will nearly match observations as λ decreases but remains numerically positive to allow for the matrix inversion in Pλ, and the penalty term still approaches as λ decreases. Like GCV, the penalty function associated with GCVC also approaches as λ decreases. In contrast to GCV, however, the GCVC penalty equals when λ=λ>0, where λ ~ is the solution to 1 − Trace(Pλ)/n − 2/n = 0, or, equivalently, Trace(Pλ) = n − 2. In other words, when fitting GCVC, the effective number of remaining parameters, beyond σ2 and β0, will be less than n − 2, and perfect fit of the observations to the predictions, i.e. λ = 0, cannot occur.

Remark 1

A reviewer observed that the GCVC penalty can be generalized according to −2 ln((1 − Trace(Pλ)/n − c/n)+) for c ≥ 1; special cases of this include GCV (c = 1) and GCVC (c = 2). Extending the interpretation given above, this ensures that the effective number of remaining parameters, beyond σ2 and β0, will be less than n − c, rather than n − 2. Preliminary results from allowing c to vary did not point to a uniformly better choice of c ≠ 2. Also, using c = 2 is consistent with our original motivation for proposing GCVC, namely properly counting the model parameters.

3 Likelihood-based methods for selection of λ

A second approach treats the ridge penalty in (2) as a negative log-density. One can consider a marginal likelihood, where λ is interpreted as the variance component of a mixed-effects model:

m(λ,σ2)ln=βexp{(β,σ2)pλ(β,σ2)dβ=12σ2y(InPλ)yn2ln(σ2)+12ln|InPλ|. (13)

From this, y|λ, σ2 is multivariate Normal with mean 0n (y is centered) and covariance σ2(In Pλ)−1. The maximum profile marginal likelihood (MPML) estimate, originally proposed for smoothing splines (Wecker and Ansley, 1983), profiles m(λ, σ2) over σ2, replacing each instance with σ^λ2=y(InPλ)y/n, and maximizes the “concentrated” log-likelihood, m(λ,σ^λ2):

λMPML=argminλ{lny(InPλ)y1nln|InPλ|}. (14)

Closely related is the generalized/restricted MPML (GMPML, Harville, 1977; Wahba, 1985), which adjusts the penalty to account for estimation of regression parameters that are not marginalized. Here, only β0 is not marginalized, so the adjustment is by one degree of freedom (see Supplement S1):

λGMPML=argminλ{lny(InPλ)y1n1ln|InPλ|}. (15)

In a smoothing-spline comparison of GMPML to GCV, Wahba (1985) found mixed results, with neither method offering uniformly better predictions. For scatterplot smoothers, Efron (2001) notes that GMPML may oversmooth, yielding large bias/small variance estimates.

Remark 2

Rather than profiling over σ2, one could jointly maximize m(λ, σ2) over λ and σ2. We have not found this approach previously used as a selection criterion in ridge regression. Our initial investigation of this and its restricted likelihood counterpart gave results similar to MPML and GMPML, and so we do not consider it further.

An alternative to the marginal likelihood methods described above is to treat the objective function in (3) as an h-log-likelihood, or “h-loglihood”, of the type proposed by Lee and Nelder (1996) for hierarchical generalized linear models. The link between penalized likelihoods, like ridge regression, and the h-loglihood was noted in the paper’s ensuing discussion. To estimate σ2 (the dispersion) and λ (the variance component), Lee and Nelder suggested an iterative profiling approach, yielding the maximum adjusted profile h-loglihood (maphl) estimate. In Supplement S2, we show one iteration proceeds as follows:

σ2(i)(yxβ(i1))(yxβ(i1))+λ(i1)β(i1)β(i1)n1 (16)
λ(i)argminλ{λβ(i1)β(i1)/σ2(i)ln|InPλ|} (17)
β(i)βλ(i) (18)

and λMAPHL=λ(∞)

Finally, Tran (2009) proposed the “Loss-rank” (LR) method for selecting λ. Its derivation, which we do not give, is likelihood-based, but the criterion resembles that of AIC in (8):

λLR=argminλ{lny(InPλ)2y2nln|InPλ|}. (19)

Tran also suggested a modified penalty term, which is dependent on y, but this did not give appreciably different results from λLR in their study.

4 Maximization with Hyperpenalties

As noted previously, some existing methods may choose extreme values of λ, particularly when n is small, suggesting a need for a second level of shrinkage, that is, shrinkage of λ itself. We extend the hierarchical framework of (1) and (2) with a “hyperpenalty” on λ, h(λ), which gives non-negligible support for λ over a finite range of values. The “hyperpenalized log-likelihood” is

hp(β,λ,σ2)=(β,σ2)pλ(β,σ2)h(λ)ln(σ2). (20)

From the Bayesian perspective, when h(λ) is in the form of a log-density, the hyperpenalty corresponds to a hyperprior on λ, and the hyperpenalized likelihood is the posterior (the expression − ln(σ2) is the log-density of an improper prior on σ2). In contrast to fully Bayesian methods, which characterize the entire posterior, we desire a single point estimate of β, σ2 and λ and focus on mode finding. Importantly, joint maximization with respect to β, σ2 and λ is now possible.

For a general h(λ), we find the joint mode of (20): {β^,λ^,σ^2}argmaxβ,λ,σ2{hp(β,λ,σ2)}. Alternatively, {β^,λ^,σ^2} may be calculated using conditional maximization steps:

σ2(i)argmaxσ2{hp(β(i1),σ2,λ(i1))}=(yxβ(i1))(yxβ(i1))+λ(i1)β(i1)β(i1)n+p+2 (21)
λ(i)argmaxλ{hp(β(i1),σ2(i),λ)} (22)
β(i)argmaxβ{hp(β,σ2(i),λ(i))}=βλ(i) (23)

with {β^,λ^,σ^2}={β(),λ(),σ2()}. The only step that depends on the choice h(λ) is (22); the other steps are available in closed form regardless of h(λ).

Based on the expression for pλ(β, σ2) given in (2), if exp{−h(λ)} = o(λ−p/2), then, upon applying the maximization step in (22), λ(i) is guaranteed to be finite, regardless of the values of β(i−1) and σ2(i). This relates to an earlier comment in the Introduction that one cannot simply maximize (β, σ2)−pλ(β, σ2) alone. For example, using h(λ) = C, C constant, would yield the same result as maximizing (β, σ2) − pλ(β, σ2), namely an infinite hyperpenalized log-likelihood. We will propose one such choice of h(λ) that satisfies exp{−h(λ)} = o(λ−p/2) and is empirically observed to work well for the ridge penalty; different choices of h(λ) may be better suited for other penalty functions.

4.1 Choice of hyperpenalty

Crucial to this approach is the determination of an appropriate hyperpenalty and accompanying hyperparameters. Our recommended hyperpenalty is based on the gamma distribution, namely h(λ) = −(a−1) ln(λ)+λ/b. From the Bayesian perspective, this is natural because it is conjugate to the precision of the Normal distribution, which is one possible interpretation of λ (e.g. Tipping, 2001; Armagan and Zaretzki, 2010). From Supplement S3, the update for λ given in (22) becomes

λ(i)=p+2a2β(i1)β(i1)/σ2(i)+2b. (24)

This additionally requires choosing values for a and b. We will do so by first choosing a desired prior mean for λ, given by a/b, and a value for λ(i) in (24), and then solving the two expressions for a and b. Necessary to this strategy is that the chosen value of λ(i) must result in a and b that are free of σ2(i) and β(i−1). To choose a/b, recall the key result from Hoerl and Kennard (1970): when xx/n=Ip,λ=arg minλ0E[(ββλ)(ββλ)]=pσ2/ββ. While not of immediate practical use, since σ2 and β are the parameters to be estimated, we note that σ2/ββ (1/R2 − 1), where R2 = βΣXβ/(βΣXβ + σ2) is the coefficient of determination, and the approximation comes from substituting xx/n = Ip for ΣX. In contrast to σ2 or the individual elements of β, there may be knowledge about R2. Alternatively, we will propose two strategies (Section 4.2) that provide an estimate of R2. Given an estimate or prior guess of R2 (0, 1), say R^2, we set a/b=p(1/R^21).

In addition to a sensible mean, it is important to have a and b be such that λ(i), which is the resulting update for λ, is not extreme. Let the update for λ given in (24) be λ(i) = (p−1)H(i), where H(i) is the harmonic mean of σ2(i)/β(i−1)Tβ(i−1) and (1/R^21). Being a harmonic mean, the (1/R^21) term, which will typically be less than 10 for most analyses, moderates potentially large values of σ2(i)/β(i−1)Tβ(i−1), thereby preventing underfitting. Simultaneously, λ(i) increases linearly with p, which prevents overfitting in n < p scenarios.

Solving these expressions, a/b=p(1/R^21) and λ(i) = (p − 1)H(i), yields a = p/2 and b=(1/R^21)1/2. When the covariates are approximately uncorrelated and R^2 is close to R2, a/b will be near λ*. However, the uncertainty coming from the variance of λ makes this useful in the general xx case, for which no closed-form solution of λ* exists. As we will see in the simulation study, this holds true even when R^2R2.

Finally, it is important that the hyperpenalty strategy not be inferior in a standard regression, when np. To establish this, we derive in Supplement S4 a large-n approximation for λ* in the general xx case: λ* ≈ σ2Tr [(xx)1]/β(xx)1β. As n → ∞, this converges in probability to σ2Tr[X1]/βX1<, from which we can see that λ* asymptotes in n. From (3), because the crossproduct term, xx, increases linearly in n, βλ*, the ridge estimate of β at the optimal value of λ, approaches the standard ordinary least squares (OLS) estimate, βλ|λ=0. The xx expression grows linearly in n, but λ* asymptotes in n, so the effect of λ* is reduced. The implication is that any choice of λ induces the same effective shrinkage for large n. The parameters a and b do not depend on n, and so the effect of the hyperpenalty will decrease with n, as desired.

Remark 3

Choosing a “flat” hyperpenalty, namely h(λ) = ln(λ), is untenable when p > 2, because it does not satisfy exp{−h(λ)} = o(λ−p/2). Specifically, plugging h(λ) in (20), the expression hpℓ (β, λ, σ2) can be made arbitrarily large in λ by setting β = 0p.

4.2 Estimating R2

For analyses in which one is unable or unwilling to make a prior guess of R2 to use as R^2, here we describe a strategy to estimate R2. We note first that R2 = Cor(Y, Xβ)2, where ‘Cor’ denotes the Pearson correlation. It is known that the empirical prediction error from using a vector of fitted values, xβ^, corresponding to the observed outcomes y will be optimistic when β^ depends on y. This means that the empirical R2, R¯2=Co^r(y,xβ^)2, will also be optimistic, i.e. upwardly biased, for R2. In contrast, Efron (1983) showed how an estimate of prediction error using the bootstrap will be pessimistic. Applied to our context, given B bootstrapped datasets, the bootstrap-estimate of R2, R^boot2=(1/B)b=1BCo^r(y(b),x(b)β^(b))2, where the *(b) and *(−b) notation indicate that the training and test datasets do not overlap, will be biased downward from R2. Efron (1983) suggested that a particular linear combination of the optimistic and pessimistic prediction error estimates would provide an approximately unbiased estimate of prediction error. Analogizing this to estimating R2, the linear combination is given by 0.632R^boot2+0.368R¯2. The weight is based on a bootstrapped dataset containing, on average, about e1n ≈ 0.632n unique observations from the original dataset. When n > p, this “632-estimate” of R2, using, say, OLS to estimate β in each bootstrap, would provide a reasonable estimate of R2 for our purposes. However, in the n < p scenarios we are specifically interested in, OLS is not an option, and other methods, e.g. ridge regression, must be used to estimate β in each dataset. In addition, when p is large, the bootstrap may add a non-trivial computational component, if determining β^(b) is computationally expensive. So as to minimize any added burden due to estimating R2, which is only a preprocessing step before applying the hyperpenalty approach, we propose to modify the 632-estimate by replacing the bootstrap with 5-CV:

R^0.6322=0.632×(1/5)i=15Co^r(yk(i),xk(i)βλ5CV)2+0.368×Co^r(y,xβλ5CV)2, (25)

where we use the same κ(i) notation defined in Section 2. In words: first calculate λ5-CV. Then, calculate a weighted average of the cross-validated empirical correlation and the standard empirical correlation over all the data. Let HYP632 denote the hyperpenalty approach using R^0.6322 as the estimate for R2.

5 Simulation Study

Our simulation study is designed to mimic the current reality of the “-omics” era in which many covariates are analyzed but few contribute substantial effect sizes. The relevant quantities are described as follows:

  • Covariates (x). One simulated dataset consists of training and validation data generated from the same model. The dimension of the training data is n×p, with n ∈ {25, 100, 250, 4000} and p ∈ {100, 4000}. The n × p matrix x is drawn from Np{0p, ΣX}. For the validation data, a 2000 × p matrix xnew is sampled from this same distribution. We construct ΣX according to “approximately uncorrelated” and “correlated” scenarios. The construction of ΣX is described in Supplement S5. Briefly: we begin with a block-wise compound symmetric matrix with 10 blocks. Within blocks, the correlation is ρ = 0 (approximately uncorrelated) or ρ = 0.4 (correlated), and between blocks, there is zero correlation. We then stochastically perturb the matrix in such a way to generate ΣX, using algorithms by Hardin et al. (2013), as to maintain positive-definiteness but mask the underlying structure. This perturbation occurs at each simulation iterate, and, as we outline in the Supplement, it is less extreme for the approximately uncorrelated case, so that the resulting ΣX is close to Ip.

  • Parameters (β,σ2). To better account for many plausible configurations of the coefficients, which would be difficult using a single, fixed choice of β, we specify a generating distribution, drawing β once per simulation iterate and making it common to all observations. We draw β from a mixture density. Let Zi ∈ {1, 2, 3}, i = 1, …, p, be a random variable with Pr(Zi = 1) = Pr(Zi = 2) = 0.005. Then, construct β as follows:
    αi|Zi~ind{t3{σ2=1/3},Zi=1Exp{1},Zi=2N{0,σ2|=106},Zi=3 (26)
    β=α×AR1(π), (27)
    where AR1(π) is a p × p, first-order auto-regressive matrix with correlation coefficient π ∈ {0, 0.3}. The sampling density for α is a mixture of scaled t3, Exponential, and Normal distributions, and 99% of the coefficients have small effect sizes, coming from the Normal component. The vector α is scaled to obtain β, which encourages neighboring coefficients to have similar effects, depending on π. So that some meaningful signal is present in every dataset, we ensure #{i : Zi ≠ 3} 3, regardless of p. Given β, ΣX, and R2 ∈ {0.05, 0.2, 0.4, 0.6, 0.8, 0.95}, we calculate σ2 = βΣXβ(1/R2 − 1).
  • Outcomes (y|x) The outcomes from the training and validation data are, respectively, y|x, β, σ2 and ynew|xnew, β, σ2. For each of the 192 combinations of π, p, n, uncorrelated/correlated x, and R2, 10,000 (when p = 100) or 500 (p = 4000) training and validation datasets are sampled. We present results only for π = 0.3 and R2 ∈ {0.2, 0.4, 0.8} here, with the remainder being in Supplement S6.

We compare the 11 methods listed in Table 1. n-CV is left out, being approximated by GCV and computationally expensive. aic is replaced with its small-sample correction, AICC. The new methods considered are GCVC and the hyperpenalty-approach, HYP632. We also present HYPTRUE, which is the hyperpenalty method using the true, unknown value of R2. The difference in prediction error between HYP632 and HYPTRUE quantifies the possible gain, if any, from improving upon the “632” estimation scheme. All methods differ only in λ, which determines the estimate of β via (3). The criterion by which we evaluate methods on the validation data is relative MSPE, rMSPE(λ):

rMSPE(λ)=1000×(MSPE(λ)/MSPE(λopt)1), (28)

where MSPE(λ) = (ynewxnewβλ)(ynewxnewβλ) and λopt = arg minλ MSPE(λ). Thus, rMSPE measures the percentage increase above the smallest possible MSPE, and rMSPE = 0 is ideal. Equivalently, rMSPE measures the inefficiency of each method. We used an iterative grid search to calculate λopt as well as λ for all methods except MAPHL and HYP632, for which explicit maximization steps are available. Table 2 gives the average rMSPE; values in boldface are the column-wise minima (excluding HYPTRUE), and values with an asterisk are less than twice each column-wise minimum. Figure 1 compares the rMSPE of HYP632 to the median rMSPE of remaining methods, excluding GCVC, as an overall performance comparison. Figure 2 compares the rMSPE of GCV to GCVC. Finally, Figure 3 gives histograms of ln(λ/λopt) for each of the methods from one scenario.

Table 1.

Annotation of all methods from the simulation study and the corresponding Equations and References.

Abbr. Name Eqn. Reference
5-CV Five-fold cross-validation (4) Hastie et al. (Section 7.10, 2009)
BIC Bayesian Information Criterion (9) Schwarz (1978)
AICC Corrected Akaike’s Information Criterion (10) Hurvich et al. (1998)
GCV Generalized cross-validation (7) Craven and Wahba (1979)
GCVC Corrected generalized cross-validation (12) Section 2.1
RGCVγ Robust generalized cross-validation (11) Lukas (2006)
MPML Maximum profile marginal likelihood (14) Wecker and Ansley (1983)
GMPML Generalized maximum profile marginal likelihood (15) Harville (1977); Wahba (1985)
MAPHL Maximum adjusted profile h-likelihood (16)–(18) Lee and Nelder (1996)
LR Loss-rank (19) Tran (2009)
HYP632 Hyperpenalty, R^2 based on “632” estimator (21)–(25) Section 4, Efron (1983)

Table 2.

Average rMSPE as defined in (28) for the 11 methods summarized in Table 1 and also hyptrue, corresponding to using the true, unknown value of R2, with π = 0.3 (Equation (27)).

p = 100, Approximately Uncorrelated
Method/{n/R2} 25/0.2 100/0.2 250/0.2 4000/0.2 25/0.4 100/0.4 250/0.4 4000/0.4 25/0.8 100/0.8 250/0.8 4000/0.8
5-CV 66.3 24.1 *8.7 *0.8 70.3 *26.2 *9.0 *0.8 *74.9 *35.2 *10.6 *0.6
BIC 370.9 > 104 103.5 31.6 246.0 > 104 250.3 16.4 *63.6 > 104 258.7 3.4
AICC *22.2 17.9 *8.5 *0.8 70.5 40.9 14.5 *0.8 283.7 212.4 31.6 *0.6
GCV 79.2 271.8 *7.7 *0.8 78.9 257.0 *8.1 *0.8 *63.7 392.8 *9.1 *0.6
GCVC *23.9 29.7 *7.6 *0.8 37.5 *31.5 *8.1 *0.8 *86.1 47.6 *9.1 *0.6
RGCV0.3 *19.3 53.9 46.2 13.7 64.2 141.3 114.1 *1.0 278.3 560.2 105.0 *0.6
MPML 311.0 18.4 *6.8 *0.8 228.1 *19.4 *6.8 *0.8 *64.1 *25.3 *7.7 *0.6
GMPL 57.1 17.9 *6.8 *0.8 65.8 *18.8 *6.7 *0.8 *66.8 *23.6 *7.7 *0.6
MAPHL 298.0 25.9 *7.0 *0.8 191.5 *27.6 *7.0 *0.8 *44.4 *42.7 *7.8 *0.6
LR 183.6 16.9 19.1 11.3 160.9 40.8 46.3 15.9 *74.9 174.4 178.8 21.6
HYP632 *23.9 *8.0 *8.0 *0.8 *16.1 *19.2 *12.1 *0.8 *59.8 *45.9 *8.5 *0.6
HYPTRUE 8.4 7.1 5.2 0.7 11.2 7.6 4.2 0.7 12.6 15.1 7.8 0.6

p = 100, Positively Correlated
Method/{n/R2} 25/0.2 100/0.2 250/0.2 4000/0.2 25/0.4 100/0.4 250/0.4 4000/0.4 25/0.8 100/0.8 250/0.8 4000/0.8

5-CV 89.4 *23.8 *6.9 *0.8 105.9 *21.0 *6.9 *0.9 *92.0 *22.1 *7.9 *0.8
BIC 593.7 > 104 61.6 14.9 446.6 > 104 73.2 24.3 169.0 > 104 135.8 17.7
AICC *45.3 *15.0 *5.1 *0.8 99.7 *18.1 *5.6 *0.8 362.1 46.6 *12.9 *0.8
GCV 113.1 113.1 *6.5 *0.8 121.9 107.5 *6.5 *0.8 *87.0 168.0 *7.4 *0.8
GCVC *55.4 *22.0 *6.4 *0.8 *66.2 *23.2 *6.4 *0.8 *71.9 *25.5 *7.3 *0.8
RGCV0.3 *49.4 45.2 28.7 5.0 126.4 76.6 29.2 11.4 377.5 95.5 64.6 1.8
MPML 232.0 *15.5 *4.9 *1.0 273.2 *13.6 *6.3 *1.0 165.2 *26.6 *11.8 *0.8
GMPML *64.7 *15.4 *4.8 *1.0 *81.6 *13.2 *6.2 *1.0 *79.5 *24.5 *11.5 *0.8
MAPHL 244.6 *17.5 *5.7 *1.1 200.0 *18.3 *7.7 *1.0 *77.9 39.6 *13.3 *0.8
LR *49.4 *24.8 16.4 4.3 86.5 41.0 24.4 7.5 172.1 80.9 57.0 15.5
HYP632 *34.4 *13.9 *6.6 *1.4 *42.0 *16.7 *5.3 1.8 *88.1 *19.6 18.2 *1.2
HYPTRUE 17.9 10.5 4.9 1.4 27.1 10.4 2.9 1.8 32.7 15.8 20.5 1.2

p = 4000, Approximately Uncorrelated
Method/{n/R2} 25/0.2 100/0.2 250/0.2 4000/0.2 25/0.4 100/0.4 250/0.4 4000/0.4 25/0.8 100/0.8 250/0.8 4000/0.8

5-CV 39.7 17.7 *7.7 *1.0 45.8 *19.2 *7.4 *1.1 40.2 *14.3 *6.6 *1.2
BIC 100.4 114.0 91.2 40.3 53.0 68.0 63.8 73.5 *5.6 *12.7 16.1 376.7
AICC 34.3 22.6 *11.0 1.9 100.9 49.9 22.9 8.5 329.2 160.4 78.4 89.1
GCV 40.6 16.3 *8.2 *0.9 43.9 *18.5 *8.6 *1.0 33.3 *13.7 *6.4 *1.1
GCVC *19.1 13.0 *6.5 *0.9 38.8 *13.7 *5.5 *1.0 110.6 *17.9 *5.4 *1.1
RGCV0.3 29.5 37.8 35.0 5.2 90.0 94.5 73.7 20.7 312.5 273.2 137.1 216.8
MPML 100.3 54.1 *7.1 *1.0 53.0 64.1 11.2 *1.2 *5.6 *12.7 16.1 *1.1
GMPML 38.7 15.3 *6.4 *1.0 43.1 *17.9 *6.1 *1.2 29.6 *12.9 *7.2 *1.1
MAPHL 99.9 113.6 90.7 13.0 52.6 67.6 63.3 12.2 *5.6 *12.5 15.9 11.7
LR 100.3 14.4 12.8 6.0 53.1 *26.9 22.0 14.7 *5.6 55.8 48.3 75.6
HYP632 *10.2 *6.0 *5.6 *1.1 *15.0 *14.3 *7.3 2.8 77.4 35.0 10.9 9.0
HYPTRUE 17.4 13.2 6.9 0.9 29.0 12.8 4.7 1.8 21.9 6.6 2.3 11.9

p = 4000, Positively Correlated
Method/{n/R2} 25/0.2 100/0.2 250/0.2 4000/0.2 25/0.4 100/0.4 250/0.4 4000/0.4 25/0.8 100/0.8 250/0.8 4000/0.8

5-CV 83.5 *15.8 *8.3 *0.7 82.8 *16.5 9.7 *0.8 *72.0 *15.9 *7.2 *1.0
BIC 314.2 95.5 85.3 11.0 261.6 73.9 74.4 14.9 *95.2 33.4 52.1 42.2
AICC *48.9 *11.8 *5.3 *0.6 84.3 *15.5 *6.2 *0.7 239.1 45.8 17.0 5.2
GCV 85.8 *16.7 *8.7 *0.7 84.9 *17.0 9.8 *0.7 *70.8 *16.2 *6.6 *0.9
GCVC *49.7 *13.3 *6.4 *0.7 *50.3 *12.9 *6.6 *0.7 *61.5 *12.2 *4.0 *0.9
RGCV0.3 *56.7 30.6 19.7 *0.9 111.7 47.4 25.6 *1.3 254.7 71.7 24.2 12.4
MPML 242.0 *14.1 *5.3 *0.8 249.2 *16.3 *5.4 *1.4 *95.2 32.6 21.5 2.3
GMPML *60.1 *13.5 *5.3 *0.7 *67.9 *13.5 *4.8 *1.4 *70.3 *13.8 *6.2 2.3
MAPHL 312.4 94.9 83.9 13.3 259.8 73.3 73.0 14.1 *94.2 32.9 50.9 14.5
LR 80.9 *16.9 11.7 2.4 108.3 26.5 17.4 2.9 122.7 59.2 31.5 8.4
HYP632 *31.0 *10.9 *5.6 *0.7 *35.5 *12.9 *4.7 2.2 *55.5 *12.0 *4.6 32.4
HYPTRUE 18.9 5.0 3.3 0.6 16.2 7.7 3.4 2.2 33.8 9.3 3.7 35.3

Values in bold are the column-wise minima, excluding HYPTRUE, and those with an ‘*’ are less than twice the column-wise minimum. HYPTRUE.

Figure 1.

Figure 1

Ratio of the rMSPE of hyp632 to the median rMSPE of remaining methods, excluding GCVC, over log10(n). Values less than one indicate that hyp632 has smaller prediction error.

Figure 2.

Figure 2

Locally-smoothed values of rMSPE of GCV and GCVC over n in scenarios for which p = 100. The curve corresponding to GCV changes in behavior near the point n − 1 = p, given by the vertical dashed line.

Figure 3.

Figure 3

Histograms of ln(λ/λopt) for p = 4000, n = 100, π = 0.3, R2 = 0.2, and correlated covariates. ln(λ/λopt) = 0 means that λ was chosen to yield optimal shrinkage. All methods are described in Table 1.

From Table 2, hyp632 and GCVC achieve the stated goal of being useful in n ≈ p or n < p situations, as shown by the frequency of being in boldface or annotated with an asterisks. This is most evident in the smaller R2 scenarios: either hyp632 or GCVC is frequently the best performing method when R2 = 0.2 or R2 = 0.4. However, these are not uniformly best across all scenarios considered, i.e. the “4000/0.8” column in the bottom sub-table or “25/0.8” column in the second-from-bottom sub-table. Even HYPTRUE, which uses the true value of R2, has large rMSPE in these settings, which suggests that the optimal choice of λ is not well-approximated by the p(1/R2 − 1) expression discussed in Section 4.1. Also competitive are gmpml and 5-CV; GCV performs well except in the n = 100 scenarios, in which n ≈ p. The remaining methods, BIC, AICC, RGCV0.3, MPML, MAPHL, and LR have large rMSPE in some scenarios.

To further explore this, Figure 1 plots the ratio of the rMSPE corresponding to HYP632 to the median rMSPE from the existing methods, to represent the performance of a typical method. When this ratio is less than one, HYP632 has a smaller rMSPE. When p = 100, the y = 1 line is sometimes cross when log10(n) = 2.5, or n = 250. When p = 4000, the y = 1 line is exceeded when n = 4000 or, regardless of n, when R2 = 0.8 and the covariates are approximately uncorrelated. Based on the discussion in 4.1, HYP632 will become equivalent to the asymptotically optimal methods as n → ∞.

As evidenced in the table, GCVC has markedly smaller rMSPE than GCV when n is small. The two values of rMSPE coincide as n increases. We argued in Section 2.1 that the GCV penalty has the potential to elicit undesirable behavior when p = n − 1, namely choosing λ = 0, or as close as possible thereto. Figure 2 gives the ramifications of this in terms of prediction error, plotting locally-smoothed values of rMSPE of GCV and GCVC over many values of n for the p = 100 scenarios; n − 1 is less than, equal to, and greater than p. For all six panels, which correspond to different R2 or correlation combinations, there is a peak in the GCV curve beginning near n − 1 = p. GCVC effectively eliminates this behavior and has almost uniformly smaller rMSPE when n − 1 ≤ p and nearly equal rMSPE when n − 1 > p.

Finally, we compare the actual values of λ for each method. Figure 3 plots histograms of ln(λ/λopt) for the eleven methods from one simulation setting in Table 2: p = 4000, n = 100, π = 0.3, R2 = 0.2, and correlated covariates. For reference, when ln(λ/λopt) = 0, the method has selected the optimal λ. In this small-n scenario, all of the existing methods, at times, choose a very small or large λ. In contrast, the shrinkage from hyperpenalization is evident: the histogram for HYP632 has a considerably smaller range. It has the overall smallest rMSPE in this scenario. Finally, GCV has ln(λ/λopt) as small as −12, and GCVC has ln(λ/λopt) slightly less than −2, providing additional evidence that GCVC effectively prevents overfitting.

6 Bardet-Biedl Data Analysis

To evaluate these methods in a real dataset, we consider the rat gene-expression data first reported in Scheetz et al. (2006). Tissue from 120 12-week old rats was analyzed using microarrays (Affymetrix GeneChip Rat Genome 230 2.0 Array), normalized, and log-transformed. The goal is to find genes associated with expression of the BBS11/TRIM32 gene, which is causative for Bardet-Biedl syndrome (Chiang et al., 2006). Copying the strategy of Huang et al. (2008), we considered 18,975 probesets that were sufficiently expressed, and, of these, reduced the number further to the p = 3,000 probes displaying the largest variation. We randomly selected n = 80 arrays as our training data, fit all methods to these data, and measured rMSPE based on the remaining 40 arrays, repeating this 1,000 times. Figure 4 gives the boxplots of these 1,000 rMSPEs, ordered from left to right in terms of the average rMSPE. The best-performing method is GCVC with average rMSPE of 50.2, followed by GMPML (53.0), GCV (65.6), HYP632 (66.3), and 5-CV (74.2). These rankings correspond closely to those from the simulation study.

Figure 4.

Figure 4

7 Discussion

We have examined strategies for choosing the ridge parameter λ when the sample size n is small relative to p. Our small-sample modification to GCV, called GCVC, is conceptually trivial but uniformly dominates GCV in our simulation study. This corrected GCV may be applied in other shrinkage or smoothing situations that would otherwise use the standard GCV, such as smoothing splines (Wahba, 1985) or adaptively-weighted linear combinations of linear regression estimates (Boonstra et al., 2013).

We also proposed a novel approach using what we call hyperpenalties, which add another level of shrinkage, that of λ itself, by extending the hierarchical model. A hyperpenalty based on the Gamma density with mean p(1/R^21) was shown to work well in the context of ridge regression. The approach is based on the observation that the optimal tuning parameter λ is approximated by the expression p(1/R^21), where p is known. Furthermore, we proposed a simple strategy for estimating R2. Relative to existing methods, our implementation can offer superior prediction and protection against choosing extreme values of λ. One area for improvement of this approach lies in the higher-R2 scenarios, for which it is clear that p(1/R^21) does not approximate the optimal tuning parameter. However, it is unusual in a high-dimensional regression to expect R2 larger than 0.6 or 0.7.

Another advantage of the hyperpenalty approach is its applicability in missing data problems: when implementing the Expectation-Maximization (EM) algorithm (Dempster et al., 1977), it is not clear how one might do ridge regression concurrently using goodness-of-fit or marginal likelihood approaches to select λ. On the other hand, by taking advantage of the conditional independence, specified by the hierarchical framework, between λ and any missing data given the remaining parameters, it is conceptually straightforward to embed a maximization step for λ, like expression (23), within a larger EM algorithm. This remains the focus of our current research.

Supplementary Material

Supplemental

Acknowledgments

This work was supported by the National Science Foundation [DMS1007494] and the National Institutes of Health [CA129102]. The authors thank Jian Huang for generously sharing the Bardet-Biedl data. Code for the simulation study was written in R (R Core Team, 2012), using the Matrix package (Bates and Maechler, 2013) to construct block-diagonal matrices, and is available at http://www-personal.umich.edu/~philb.

References

  1. Akaike H. Information theory and an extension of the maximum likelihood principle. Second International Symposium on Information Theory. 1973:267–281. [Google Scholar]
  2. Armagan A, Zaretzki RL. Model selection via adaptive shrinkage with t priors. Computational Statistics. 2010;25:441–461. [Google Scholar]
  3. Bates D, Maechler M. Matrix: Sparse and Dense Matrix Classes and Methods. 2013 R package version 1.0-12. [Google Scholar]
  4. Boonstra PS, Taylor JMG, Mukherjee B. Incorporating auxiliary information for improved prediction in high-dimensional datasets: an ensemble of shrinkage approaches. Biostatistics. 2013;14:259–272. doi: 10.1093/biostatistics/kxs036. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Burnham KP, Anderson DR. Model selection and multimodel inference: A practical information-theoretic approach. 2nd Springer; New York: 2002. [Google Scholar]
  6. Chiang AP, Beck JS, Yen HJ, Tayeh MK, Scheetz TE, Swiderski RE, Nishimura DY, Braun TA, Kim KYA, Huang J, et al. Homozygosity mapping with snp arrays identifies TRIM32, an E3 ubiquitin ligase, as a Bardet–Biedl syndrome gene (BBS11) Proceedings of the National Academy of Sciences. 2006;103:6287–6292. doi: 10.1073/pnas.0600158103. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Craven P, Wahba G. Smoothing noisy data with spline functions. Numerische Mathematik. 1979;31:377–403. [Google Scholar]
  8. Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B. 1977;39:1–38. [Google Scholar]
  9. Efron B. Estimating the error rate of a prediction rule: Improvement on cross-validation. Journal of the American Statistical Association. 1983;78:316–331. [Google Scholar]
  10. Efron B. Selection criteria for scatterplot smoothers. Annals of Statistics. 2001;29:470–504. [Google Scholar]
  11. Efron B, Tibshirani R. Improvements on cross-validation: the 632+ bootstrap method. Journal of the American Statistical Association. 1997;92:548–560. [Google Scholar]
  12. Frank IE, Friedman JH. A statistical view of some chemometrics regression tools. Technometrics. 1993;35:109–135. [Google Scholar]
  13. Fu WJ. Penalized regressions: The bridge versus the Lasso. Journal of Computational and Graphical Statistics. 1998;7:397–416. [Google Scholar]
  14. Golub GH, Heath M, Wahba G. Generalized cross-validation as a method for choosing a good ridge parameter. Technometrics. 1979;21:215–223. [Google Scholar]
  15. Hardin J, Garcia SR, Golan D, et al. A method for generating realistic correlation matrices. The Annals of Applied Statistics. 2013;7:1733–1762. [Google Scholar]
  16. Harville DA. Maximum likelihood approaches to variance component estimation and to related problems. Journal of the American Statistical Association. 1977;72:320–338. [Google Scholar]
  17. Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning. 2nd Springer; New York: 2009. [Google Scholar]
  18. Hoerl AE, Kennard RW. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics. 1970;12:55–67. [Google Scholar]
  19. Huang J, Ma S, Zhang CH. Adaptive lasso for sparse high-dimensional regression models. Statistica Sinica. 2008;18:1603–1618. [Google Scholar]
  20. Hurvich CM, Simonoff JS, Tsai CL. Smoothing parameter selection in nonparametric regression using an improved Akaike information criterion. Journal of the Royal Statistical Society: Series B. 1998;60:271–293. [Google Scholar]
  21. Hurvich CM, Tsai CL. Regression and time series model selection in small samples. Biometrika. 1989;76:297–307. [Google Scholar]
  22. Lee Y, Nelder JA. Hierarchical generalized linear models. Journal of the Royal Statistical Society: Series B. 1996;58:619–678. [Google Scholar]
  23. Li KC. Asymptotic optimality of CL and generalized cross-validation in ridge regression with application to spline smoothing. The Annals of Statistics. 1986;14:1101–1112. [Google Scholar]
  24. Lukas MA. Robust generalized cross-validation for choosing the regularization parameter. Inverse Problems. 2006;22:1883–1902. [Google Scholar]
  25. R Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria: 2012. [Google Scholar]
  26. Scheetz TE, Kim KYA, Swiderski RE, Philp AR, Braun TA, Knudtson KL, Dorrance AM, DiBona GF, Huang J, Casavant TL, et al. Regulation of gene expression in the mammalian eye and its relevance to eye disease. Proceedings of the National Academy of Sciences. 2006;103:14429–14434. doi: 10.1073/pnas.0602562103. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Schwarz G. Estimating the dimension of a model. The Annals of Statistics. 1978;6:461–464. [Google Scholar]
  28. Sin CY, White H. Information criteria for selecting possibly misspecified parametric models. Journal of Econometrics. 1996;71:207–225. [Google Scholar]
  29. Strawderman RL, Wells MT. On hierarchical prior specifications and penal-ized likelihood. In: Fourdrinier D, Éric Marchand, Rukhin AL, editors. Contemporary Developments in Bayesian Analysis and Statistical Decision Theory: A Festschrift for William E Strawderman. Vol. 8. Institute of Mathematical Statistics; 2012. pp. 154–180. [Google Scholar]
  30. Takada Y. Stein’s positive part estimator and Bayes estimator. Annals of the Institute of Statistical Mathematics. 1979;31:177–183. [Google Scholar]
  31. Tibshirani R. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society: Series B. 1996;58:267–288. [Google Scholar]
  32. Tipping ME. Sparse Bayesian learning and the relevance vector machine. J Mach Learn Res. 2001;1:211–244. [Google Scholar]
  33. Tran MN. Penalized maximum likelihood for choosing ridge parameter. Communications in Statistics. 2009;38:1610–1624. [Google Scholar]
  34. Wahba G. A comparison of GCV and GML for choosing the smoothing parameter in the generalized spline smoothing problem. Annals of Statistics. 1985;13:1378–1402. [Google Scholar]
  35. Wahba G, Wang Y. Behavior near zero of the distribution of GCV smoothing parameter estimates. Statistics & Probability Letters. 1995;25:105–111. [Google Scholar]
  36. Wecker WE, Ansley CF. The signal extraction approach to nonlinear regression and spline smoothing. Journal of the American Statistical Association. 1983;78:81–89. [Google Scholar]
  37. Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B. 2005;67:301–320. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental

RESOURCES