Skip to main content
Biostatistics (Oxford, England) logoLink to Biostatistics (Oxford, England)
. 2013 Apr 9;14(4):653–666. doi: 10.1093/biostatistics/kxt008

Controlling the local false discovery rate in the adaptive Lasso

Joshua N Sampson 1,*, Nilanjan Chatterjee 1, Raymond J Carroll 2, Samuel Müller 3
PMCID: PMC3769997  PMID: 23575212

Abstract

The Lasso shrinkage procedure achieved its popularity, in part, by its tendency to shrink estimated coefficients to zero, and its ability to serve as a variable selection procedure. Using data-adaptive weights, the adaptive Lasso modified the original procedure to increase the penalty terms for those variables estimated to be less important by ordinary least squares. Although this modified procedure attained the oracle properties, the resulting models tend to include a large number of “false positives” in practice. Here, we adapt the concept of local false discovery rates (lFDRs) so that it applies to the sequence, λn, of smoothing parameters for the adaptive Lasso. We define the lFDR for a given λn to be the probability that the variable added to the model by decreasing λn to λnδ is not associated with the outcome, where δ is a small value. We derive the relationship between the lFDR and λn, show lFDR=1 for traditional smoothing parameters, and show how to select λn so as to achieve a desired lFDR. We compare the smoothing parameters chosen to achieve a specified lFDR and those chosen to achieve the oracle properties, as well as their resulting estimates for model coefficients, with both simulation and an example from a genetic study of prostate specific antigen.

Keywords: Adaptive Lasso, Local false discovery rate, Smoothing parameter, Variable selection

1. Introduction

The Lasso procedure offers a means to fit a linear regression model when the number of parameters p is comparatively large (Tibshirani, 1996, 2011). The Lasso estimates coefficients by minimizing the residual sum of squares plus a penalty term. Let there be n subjects, let Y =(Y 1,…,Y n)T be their outcomes, let Xj=(Xj1,…,Xjn)T be their measurements for variable j=1,…,p, and let X=(X1,…,Xp). Then the estimated coefficients are

1.

A major benefit of the L1 penalty is that the Lasso also serves as a variable selection method, as a large proportion of Inline graphic are reduced to 0 when λn is large.

The adaptive Lasso modifies the original version by adding a data-defined weight, Inline graphic, to the penalty term (Zou, 2006). For our purposes, we consider only Inline graphic, where Inline graphic is the ordinary least squares estimate. The adaptive Lasso minimizes

1. (1.1)

When Inline graphic and Inline graphic, the adaptive Lasso is an oracle procedure (Cai and Sun, 2007; Fan and Li, 2001). Let the true relationship be described by the linear equation E(Y |X)=β1X1+⋯+βpXp where only a strict subset of the β-coefficients are non-zero, this subset being A={j:βj≠0}. An oracle procedure is defined by having the following two properties:

  • Consistent variable selection: Inline graphic where Inline graphic is the estimated set of influential variables.

  • Asymptotic efficiency for Inline graphic, where Σ is the inverse of the information matrix when A is known.

In practice, with finite sample sizes, a sequence, λn, that satisfies the oracle requirements results in a model that includes a large number of false positives (i.e. the set Inline graphic is large) (Martinez and others, 2010). In this manuscript, our three objectives are the following: (1) To demonstrate, mathematically, that choosing λn to meet the oracle properties will result in a high false positive rate for finite samples. (2) To quantify the probability that a variable selected into the model is a false positive. This probability can provide confidence that the included variable is independently associated with the outcome. (3) To show how to identify a sequence of smoothing parameters that controls the number of false positives, instead of achieving the oracle properties.

In order to measure and control the number of false positives, we introduce the concept of the local false discovery rate (lFDR) into the selection of λn (Efron and others, 2001; Efron and Tibshirani, 2002; Benjamini and Hochberg, 1995). Specifically, we define lFDR(λn) to be the probability that a variable added to the model is a false positive when the penalty term is incrementally lowered below λn. Our first goal is to derive the relationship between lFDR and λn. We then show that lFDRInline graphic, an unusual choice for most problems, if λn satisfies the oracle requirements, thus explaining the observation that the adaptive Lasso results in a large number of false positives when the effect sizes are not too large. In more traditional problems, a value of 0.05 is often the targeted FDR or lFDR. Finally, we offer a parametric bootstrap method for selecting λn to achieve a desired lFDR which is similar to a step described by Hall and others (2009). Others have also noted this high false positive rate and proposed Bootstrapped and Bayesian versions of the Lasso for handling this problem (Bach, 2008; Hans, 2010; Park and Casella, 2005).

Our motivating example comes from a Genome-Wide Association Study (GWAS). Both the Lasso (Wu and others, 2009) and the adaptive Lasso (Kooperberg and others, 2010; Sun and others, 2010) have become popular tools for GWAS because variable selection is an important step given that 100 000's of single nucleotide polymorphisms (SNPs) are available for testing. In our specific study, we focus on modeling the prostate specific antigen (PSA) level, a biomarker indicative of prostate cancer (Parikh and others, 2010).

The order of this paper is as follows. In Section 2, we introduce notation and review the adaptive Lasso. We then formalize our definition of the lFDR, derive the relationship between the lFDR and λn, and provide asymptotic theory. Finally, we describe our bootstrap approach for choosing λn. In Section 3 and supplementary material available at Biostatistics online, we evaluate the behavior of λn when selected by the lFDR through simulation and our motivating example. We conclude with a short discussion in Section 4.

2. Methods

2.1. Notation

We assume that there is a continuous outcome Y i and its true value is defined by

2.1. (2.1)

where ϵi=Normal(0,σ2). Further, we assume Inline graphic, where D is a positive-definite matrix. Recall that A is the set of covariates that are associated with a non-zero β, A≡{j:βj≠0}, and βA≡{βj:jA}. We say that covariate j is influential if jA or that it is superfluous if jA. Without loss of generality, assume that A={1,…,p0}, let z=1−p0/p, and let D00 be the corresponding p0×p0 submatrix of D.

Let Inline graphic be the parameter estimates produced by the adaptive Lasso,

2.1.

where, for our purposes, Inline graphic. The sequence λn is the set of smoothing parameters. We let Inline graphic be the set of covariates predicted to have a non-zero β, so Inline graphic.

Finally, we include the notation and definitions for the local FDR and related terms. We denote the probabilities, Pfp(λn) and Pfn(λn), that a variable will be a false positive and a false negative by

2.1.

We define the lFDR by

2.1. (2.2)

where

2.1.

By a Taylor series expansion, the expected difference in the number of false positives, (1−z){Pfp(λ)−Pfp(λδ)}, at λ and λδ, is approximately (1−z)δΔfp(λ). Similarly, the expected difference in the number of false negatives and the total number of variables included in the model are zδΔfp(λ) and zδΔfp(λ)+(1−z)δΔfn(λ). Therefore, we define the lFDR by (2.2) as the probability that a variable added to our model will be superfluous, if added when the smoothing parameter is lowered below λn. Our definition of lFDR differs from that traditionally given for two reasons: (i) we interpret the lFDR from a frequentist point of view and (ii) we focus on the smoothing parameter λn instead of on a test statistic. The traditional definitions of lFDR and FDR have also been used for purposes of variable selection, usually by including only those variables with a q-value below a given threshold (Storey, 2002). However, such an approach would not be as appropriate for Lasso procedures, which try to avoid this post hoc selection. Note, an equivalent definition for FDR is available by replacing Δ(λn) with P(λn) in (2.2).

2.2. Prior results

The adaptive Lasso has many theoretical properties. Here, we build on two previous results. Zou (2006) states the requirements needed for the adaptive Lasso to have the oracle properties.

Theorem 1 —

Suppose that

graphic file with name M24.gif (2.3)

Then the adaptive Lasso estimates must satisfy the following:

graphic file with name M25.gif (2.4)
graphic file with name M26.gif (2.5)

If our focus is on variable selection, then a theorem identified by Pötscher and Schneider (2009) proves equally useful.

Theorem 2 —

Let XTX=nI, where I is the identity matrix. then

graphic file with name M27.gif

Because Inline graphic is asymptotically normal with mean βj, we immediately see

graphic file with name M29.gif (2.6)

where Inline graphic follows a non-central χ2 distribution with one degree of freedom and non-centrality parameter Inline graphic.

2.3. Local false discovery rates

When X is orthogonal, the total number of variables included in the model is monotonically non-decreasing as λn decreases. The lFDR is the proportion of added variables that are expected to be false positives. When X is orthogonal and σ2=1, then

2.3. (2.7)

where C(λ), which can be interpreted as the cost of removing a false positive, is

2.3. (2.8)

where Inline graphic is the density for a χ2 variable with non-centrality parameter Inline graphic.

Equations (2.7) and (2.8) allow us to choose λn to achieve a specific lFDR. For example, if, in addition to σ2=1 and X being orthogonal, all βj=β, then the lFDR will never exceed q if

2.3. (2.9)

The sequence λn, when defined by (2.9), is independent of the number of variables p. Moreover, all properties discussed hold regardless of the size of β (e.g. β is constant or decreasing at a rate of Inline graphic). Therefore, although there is no λn that can attain the oracle property when β is decreasing at a rate of Inline graphic (Pötscher and Schneider, 2009), the sequence defined by (2.9) would still attain the stated lFDR. As expected, we note that the lFDR decreases with increasing λn confirming that those variables added when λn is small are more likely to be false positives. We define λqn to satisfy lFDR(λqn)=q.

2.4. Constant β

The term Inline graphic in (2.8) can be ignored when Inline graphic is large. Specifically, when Inline graphicj, the lFDR at a given value of λn can be approximated within 1% of its true value by

2.4. (2.10)

Equation (2.10) shows more clearly that if we choose λn to achieve the oracle property (i.e. Inline graphic), then we are choosing a λn that results in an Inline graphic. As an lFDR=1 implies that all variables being added to the model are false positives, purposely choosing such a λn would seem counterintuitive. Therefore, even when λn can be chosen to achieve the oracle properties, it is unclear whether such a choice is desirable. An alternative approach would be to choose λn to ensure that lFDR<q. In the previous example, where σ2=1, X is orthogonal, and βj=β, we now see lFDR<q if

2.4. (2.11)

Purposely choosing a λn such that the lFDR Inline graphic 0 seems equally counterintuitive, limiting the reasonable choices for λn. If σ2=1, X is orthogonal, and βj=β, where β is a constant, we see that for the lFDR not to diverge to 0 or 1, Inline graphic.

Lemma 1 —

When βj=βj, β is constant, σ2=1, X is orthogonal, and t=0.5β2, then

graphic file with name M48.gif (2.12)
graphic file with name M49.gif (2.13)
graphic file with name M50.gif (2.14)

where Inline graphic.

If λn were chosen to achieve an lFDR strictly between 0 and 1, then only the first of the two oracle properties holds, Inline graphic from (2.4). However, we claim that forgoing the second oracle property, in exchange for an lFDR between 0 and 1, is no loss. Although performing variable selection and fitting in a single step is convenient, it is unnecessary. Clearly, there is a two-step method that recovers the second oracle property. After using the adaptive Lasso with Inline graphic for variable selection, we can refit the model using OLS with only that subset of variables. This two-step procedure not only satisfies both oracle properties, but offers improved efficiency over the single-step procedure, reminding us that the oracle procedure is not an optimal procedure. Although an oracle procedure promises that Inline graphic for all superfluous variables, it makes no claim as to the rate at which this occurs. Asymptotically, we can increase the rate at which Inline graphic without decreasing the rate at which Inline graphic. Returning to (2.6), this potential improvement is clear because, asymptotically, Inline graphic, is unchanged by λn so long as Inline graphic.

2.5. Empirical choice of λn

In the idealized scenario, where X is orthogonal, βj=βjA, and both z and β are known, (2.9) can be used to choose a sequence λn to achieve a specified lFDR. If all values of {βj:jA} are not identical, then the solution to (2.8) would need to be obtained numerically. Although β and z are unknown, in practice, we could use an estimate of z and either an estimate of β or a lower bound for a biologically meaningful β. However, when (2.9) is evaluated with these estimates, the chosen λn tends to produce an lFDR above the desired value when X is not orthogonal. Therefore, we prefer a bootstrap approach similar to one of the steps discussed by Hall and others (2009). The algorithm is as follows. Let us first fit a simple model of Y on X to obtain estimates of β. In practice, as done in our simulations, we suggest identifying those β to have non-zero values by the adaptive Lasso with λdn, and then defining Inline graphic by the OLS estimates. Let us then denote the variance of the residuals from this model by Inline graphic. Next, set all components of Inline graphic below some threshold equal to zero. In practice, when n>p, we use Inline graphic as this threshold. Then generate B sets of data, assuming the true model is Inline graphic, where Inline graphic. For each value of λn in a given set, we calculate the number of true, Inline graphic, and false, Inline graphic, positives added to the model between λnδ and λn+δ where δ is an appropriately small number and the superscript b denotes the dataset. We can then estimate the lFDR for each λn by

2.5. (2.15)

and select the smoothing parameter that achieves a specified lFDR, q:

2.5.

For completeness, we define Inline graphic and Inline graphic when Inline graphic. In practice, B=10, but we base our estimates of lFDRest on a monotonically smoothed version of lFDRest(λn).

For purposes of comparison, we consider the standard method for selecting λn to be cross-validation aimed at minimizing the prediction error of future estimates. Recall that standard 10-fold cross-validation starts by dividing the set Sn of n subjects into 10 mutually exclusive sets, s1s2∪⋯∪s10=Sn, of roughly equal size. Let Inline graphic, 1≤k≤10 be the adaptive Lasso estimate for βj based on those subjects not in sk. Then

2.5.

Also, Inline graphic is an estimate of the deviance-optimized smoothing parameters:

2.5.

where Inline graphic are the data input into the adaptive Lasso to obtain the estimates Inline graphic, Inline graphic are the data from a new individual, and T={T,T0}. When β is fixed and X is orthogonal, the smoothing parameters minimizing the deviance must satisfy the oracle properties.

2.6. High-dimensional adaptive Lasso: p > n

As defined in (1.1), the weights in the adaptive Lasso are 1/Inline graphic. However, when p>n, the weights must substitute a different estimate of β in place of Inline graphic. Two possible substitutes that have been studied include Inline graphic, the estimates obtained by fitting separate models for each variable (Huang and others, 2008), and Inline graphic, the estimates from a regular Lasso procedure (Zhou and others, 2009). The properties of the latter estimates, Inline graphic, with Inline graphic, have been studied and demonstrated to have useful qualities (Zhou and others, 2009). In practice, however, we found that 1/Inline graphic performed better, and chose to use those weights in our simulations. For defining Inline graphic, we cannot use Inline graphic as our cutoff threshold. Instead, we first perform the adaptive Lasso on our data and count the number of coefficients estimated to be non-zero. We then find the threshold, such that by setting all Inline graphic below that threshold to 0 and simulating data, the adaptive Lasso on the simulated data estimates a similar number of non-zero coefficients.

3. Results

3.1. Simulation design: comparing λqn and λdn

Our first goal is to offer an example comparing the magnitude and performance of λdn and λqn. As with all simulations here, our objective is not to describe the performance of the estimates Inline graphic and Inline graphic, but to calculate, describe, and compare the true values of λdn and λqn. We assume that the covariate matrix X is orthogonal and that the outcome Y can be described by linear regression, (2.1), with βj=0.15 if βjA and σ2=1. For these examples, we fixed the number of covariates p=50, but let the size of A vary, z∈{0.5,0.7,0.9}. As described below, we used simulation to calculate λdn and λqn, their corresponding lFDR and the proportion of variables that were misclassified, errMC, for a sequence of samples between n=200 and n=2000.

Our second goal is to show that results are essentially unchanged when we vary p. For efficiency, we calculated λdn, λqn, lFDR, and errMC at only n=1000 for p∈{100,200,500}, maintaining all of the other assumptions.

Our third goal is to examine whether λqn, calculated assuming that X is orthogonal, was appropriate when there was dependence. Specifically, we repeated the abbreviated analyses assuming that the covariance structure of (Xi1,Xi2,…,Xip) is block diagonal. Correlation ρ within a block was constant, ρ∈{0.3,0.6}, each block contained the same number of influential variables (or possibly no influential variables if there were more blocks than influential variables), and each block contained the same number of total variables. Variables were divided into 2, 5, or 10 groups.

For any combination of n, p, z, and covariance structure, we estimate the values of λdn, λqn, lFDR, and errMC by simulating 200 000 values of X and Y . For each simulation, at a specified set of λ ranging from 0.01 to 100, we calculate the residual deviance and errMC. Furthermore, for the same set of λ, we count the number of true and false positives added to the model when the smoothing parameter was between Inline graphic and Inline graphic. Then, for each λ, we average the number of true and false positives added, deviance, and errMC over all 200 000 datasets to obtain estimates of each desired value. The lFDR was estimated by the ratio between the average number of false positives added, compared with the total number of variables added to the model. For each combination of n, p, z, and covariance structure discussed, we simulated a new set of 200 000 simulations. To generate a dataset X, we assumed that {Xi1,…,Xip} followed a normal distribution with mean 0 and specified covariance matrix. When X was assumed orthogonal, we used the resulting principal components. All datasets were standardized, so each variable had mean 0 and variance 1.

3.2. Simulation results: comparing λqn and λdn

First, consider the example when X is orthogonal, p=50, and z=0.7. Figure 1(a) shows that λqn increases linearly with the number of subjects and that the slope is approximately 0.5β2 as (2.11) suggests, except when λn2 is small. The same equation promises that the lFDR-selected λn's, for values q1 and q2, differ by approximately Inline graphic. As the deviance-optimized λn's achieve the oracle properties when X is orthogonal, they must be increasing at a rate less than Inline graphic, and therefore, the representative black line in Figure 1(a) is significantly below those illustrating the smoothing parameters chosen to achieve the specified values of the lFDR.

Fig. 1.

Fig. 1.

(a) The sequence of λn chosen to minimize the deviance (solid black line) or chosen to achieve a specified lFDR (broken lines) increases with the number of subjects in the study. For these simulations, z=0.7, p=50, β=0.15 for associated variables. (b) The average proportion of variables that are misclassified (error, y-axis), or the number of false positives and false negatives, quickly drops to 0 when λn is chosen to achieve a specified lFDR, but remains above 0 for deviance-optimized smoothing parameters.

The advantage to choosing a sequence λn that increases linearly with the number of the subjects is that the proportion of misclassified variables converges to 0 much quicker. Figure 1(b) shows that when there are 1000 individuals in the study and 35 out of 50 of the SNPs are superfluous, on average, 12% of the variables are misclassified with the deviance-optimized parameters, whereas less than 2.1% are misclassified when using lFDR-selected parameters. The relationship between lFDR and percentage misclassified is not monotone, as it depends on z. Here, setting q=0.5 minimized the proportion misclassified. Figure 2 shows that when λn minimizes deviance, the cost of reducing false positives is very low, or equivalently, the lFDR is high, so there is great benefit in increasing λn. In terms of identifying A exactly, with 1000 individuals, the probability that there is at least one misclassified variable, Inline graphic, exceeds 0.999 when using deviance-optimized smoothing parameters, whereas that probability is less than 0.64 when using lFDR-selected parameters.

Fig. 2.

Fig. 2.

(a) The solid black line shows the probability, Inline graphic, that a null variable is excluded from Inline graphic when X is orthogonal. The top curved dashed line shows the probability, Inline graphic, that a non-null variable is included in Inline graphic when X is orthogonal, z=0.7, and n=1000. The vertical dashed line farthest to the right indicates λdn. The other pairs of broken lines show the equivalent values when n=500 and n=800. (b) The local FDR, lFDR, is illustrated as a function of λ for the three scenarios above.

Table 1 shows that the large difference between λdn and λqn remains for p>50, and, in fact, both λdn and λqn appear to be essentially independent of p when X is orthogonal. When the covariates are correlated, compared with when they are independent, λqn tends to be larger, as more stringent penalty terms are needed to exclude null variables that are correlated with influential variables. Increasing ρ or block size magnifies this effect. Therefore, in practice, we suggest choosing λqn by the bootstrapping method described in Section 2.5. Table 1 also demonstrates the obvious result that as the proportion z of null variables increases, λqn must also increase.

Table 1.

The smoothing parameters designed to achieve lFDR=0.5 are larger than those designed to minimize deviance

Low correlation
High correlation
Independent
10 5 2 10 5 2
p z λdn λqn λqn λqn
100 0.9 6.31 18.52 18.72 20.32 27.33 22.32 27.63 30.23
100 0.7 3.7 15.02 16.22 18.32 18.32 14.92 17.12 15.22
200 0.9 6.21 17.92 20.72 23.63 28.33 26.23 28.13 39.44
200 0.7 3.7 14.82 17.02 18.72 19.82 15.02 14.92 15.42
500 0.9 6.21 18.12 28.73 34.84 45.15 38.44 50.56 51.66
500 0.7 3.7 14.22 18.62 18.92 19.52 15.42 15.42 16.02

3.3. Simulation design: evaluating the performance of adaptive Lasso with λqn

Our next goal is to evaluate the performance of the adaptive Lasso when using λqn, estimated by our bootstrap approach described in Sections 2.5 and 2.6. This method selects a set of variables that should satisfy the specified lFDR criteria. For comparison, we consider a more traditional method for selecting variables targeting the same criteria. This method, implemented by the R function FDRtool (Strimmer, 2008), inputs the p-values calculated from models including each variable individually. In brief, the method decomposes the overall distribution of p-values into two distributions, representing the p-values from the null and influential variables. Given these two distributions, the traditional method first estimates the p-value thresholds that would result in the specified lFDR and then selects all variables meeting the appropriate threshold.

We consider the lFDR, lFDRAL, resulting from using the bootstrap version of the adaptive Lasso and the rates lFDRTR resulting from the traditional method. We compare these observed rates to the targeted values: q∈{0.1,0.5,0.9}. These comparisons are performed in two types of datasets. When n>p, settings are similar to those in Section 3.1: n=1000, p=500, z=0.9, βj=0.1 if βjA and σ2=1. In order for the traditional methods to produce rates below 1, we reduce the number of correlated variables per block to 5. Again ρ∈{0.0,0.3,0.6}. When p>n, specifically n=1000 and p=5000, we increase z to 0.96 and include 10 variables per correlated block. To achieve q=0.1 when p>n, we further increase z to 0.99 and βj to 0.35. We provide an extended set of simulations, exploring other correlation structures and effect distributions, in supplementary material available at Biostatistics online.

For each combination of parameters, we generated 1000 datasets and then averaged the resulting lFDRAL and lFDRTR across all 1000 datasets. For each dataset, we defined the lFDR to be 0 if the last variable selected was influential, 1 otherwise.

3.4. Simulation results: evaluating the performance of adaptive Lasso with λqn

The bootstrap approach proposed in Sections 2.5 and 2.6 selected values of λqn that, when applied to the full dataset, resulted in lFDR values similar to the targeted value. In the example where n>p and ρ=0, the observed lFDR was 0.06, 0.48, and 0.89 when λqn was chosen to achieve lFDR=0.1,0.5, and 0.9. When targeting lFDR=0.1, our method achieved a lower lFDR, and therefore our chosen λq was larger than desired. This inflated λq arises, in part, from a tendency to select too few non-zero Inline graphic in our bootstrap models. Table 2 and results in supplementary material available at Biostatistics online show that the lFDR estimates were only minimally altered by changing the correlation structure or when considering the p>n.

Table 2.

A comparison between the newly proposed bootstrap (B) method for obtaining a specified lFDR with the traditional (T) approach

ρ=0.0
ρ=0.3
ρ=0.6
Target B T B T B T
n>p
0.1 0.059 0.088 0.083 0.232 0.101 0.688
0.5 0.477 0.435 0.505 0.737 0.52 0.919
0.9 0.893 0.804 0.905 0.94 0.909 0.967
p>n
0.1 0.008 0.075 0.019 0.298 0.105 0.904
0.5 0.466 0.464 0.592 0.679 0.639 0.905
0.9 0.751 0.864 0.861 0.953 0.894 0.978

The traditional approach, based on estimating the p-value distribution of the null and influential variables, performed poorly when there was high correlation between variables (Table 2). When there was high correlation, models with only a single variable assigned low p-values to those null variables associated with influential variables. This resulted in more variables achieving the lFDR threshold, but a higher proportion were false positives. With n>p, ρ=0.6, and a targeted lFDR=0.5, the observed lFDR=0.9.

3.5. Application

In the United States, prostate cancer is the most commonly diagnosed non-cutaneous cancer in men, with approximately 200 000 new diagnoses each year. Because levels of PSA are elevated in the presence of prostate cancer, it is commonly used as a biomarker for early detection. Unfortunately, the specificity of tests based on PSA is often very low, as many healthy individuals also have high levels. Specificity could be greatly improved by a method that can identify individuals with naturally high levels. To this end, there have been large GWASs searching for genetic markers associated with the PSA level. The Prostate, Lung, Colorectal, and Ovarian Cancer Screening Trial, or PLCO, which recorded PSA levels, genotyped 2200 healthy men using an Illumina genotyping platform containing more than 500 000 SNPs (Andriole and others, 2009). We focus on a subset of 530 SNPs in and around the KLK3 gene (Parikh and others, 2010).

Let X be the 2200×530 matrix containing the genotypes for the study population at these 530 SNPs. Genotypes are coded as 0, 1, or 2, indicating the number of minor alleles at that SNP. Let Y be the log-transformed PSA levels. Then we regressed Y on X using a linear model with the adaptive Lasso procedure. We repeated this analysis with the two values of λ: Inline graphic and Inline graphic, where Inline graphic was estimated by the previously defined bootstrap procedure. Unfortunately, the truth is unknown, and therefore we can only use this example to illustrate their relative performance. The estimated values of the smoothing parameter were Inline graphic and Inline graphic, with q=0.5. As expected, Inline graphic is significantly smaller than that value estimated to achieve lFDR=0.5. As a consequence, Inline graphic allowed 17 SNPs to have non-zero coefficients, whereas Inline graphic allowed only 1 SNP (Table S2 of supplementary material available at Biostatistics online). Although we cannot be certain that either model is correct, it seems doubtful that 17 SNPs in that region are directly associated with PSA levels. To estimate β corresponding to rs2735839 using the two-stage approach, first selecting variables with Inline graphic and then estimating β using OLS, we calculate Inline graphic from a model containing only rs2735839.

4. Discussion

The adaptive Lasso has become a popular model-building procedure because it shrinks a subset of coefficients to zero, thereby simultaneously performing variable selection and simplifying model interpretation. Although, asymptotically, using the traditional smoothing parameters promises that the adaptive Lasso will achieve consistent variable selection, their use often leaves a large number of false positives in the model when sample size is finite.

The lFDR is usually a form of post-processing, in that we would first perform a statistical procedure to attach a p-value to an estimate of each parameter and then determine the probability that the true value of a parameter with that p-value is the null value. We have adapted the lFDR framework to select smoothing parameters in the adaptive Lasso. Instead of defining an lFDR for a specific p-value, we define it for a specific value of the tuning constant λ. The framework offers an alternative means for selecting the smoothing parameters. When chosen to achieve a specified value of the lFDR, the adaptive Lasso procedure promises both asymptotically consistent variable selection and better control of the false positive rate for finite samples.

By itself, a single-step, adaptive Lasso procedure using λq, the lFDR-selected smoothing parameter, does not achieve the oracle properties. If one believed that the optimal, or best, estimator had to have these properties, then a combined variable selection and model fitting procedure with λq would not be a viable option. However, we do not consider the absence of the second oracle property to be a deterrent to using λq. First, the oracle properties can be regained by a two-step procedure that adds a separate model fitting step, where OLS is applied only to those variables retained by the initial adaptive Lasso. Although the convenience of a one-step procedure is sacrificed, the final estimate would still have the stated properties. Second, the first oracle property, consistent variable selection, is not a statement of optimality. That property makes no claims on the rate at which Inline graphic. In some sense, the rate of our two-step procedure is faster than the rate of the single-step procedure. Therefore, there is a benefit to our method, even if it cannot be measured by a characteristic as coarse as the oracle properties.

We chose to select the smoothing parameters to achieve a desired lFDR, instead of FDR, because we wanted to judge each variable on its own merits, and not the merits of all selected variables. As discussed previously (Efron and others, 2001; Efron and Tibshirani, 2002), if one fit a model with 1000 variables and aimed to achieve an FDR of 0.1, then if the first 90 variables selected were guaranteed to be non-null, the next 10 would be included regardless of the evidence. Note also that in addition to providing examples when lFDR=0.1, we offered examples with an lFDR as high as 0.9, a larger value than that generally used. For the adaptive Lasso procedure, where standard practice has been to choose lFDR=1 and there is often the desire not to omit any non-null variables, aiming for larger lFDR values may be preferred for the Lasso procedure.

5. Supplementary material

Supplementary material is available at http://biostatistics.oxfordjournals.org.

6. Funding

Sampson's and Chatterjee's research was supported by the Intramural Research Program of the NCI. Chatterjee's research was supported by a gene-environment initiative grant from the NHLBI (RO1-HL091172-01). Müller's research was supported by a grant from the Australian Research Council (DP110101998). Carroll's research was supported by a grant from the National Cancer Institute (R37-CA057030). Carroll was also supported by Award Number KUS-CI-016-04, made by King Abdullah University of Science and Technology (KAUST).

Supplementary Material

Supplementary Data

Acknowledgements

This study utilized the high-performance computational capabilities of the Biowulf Linux cluster at the NIH, Bethesda, Md. (http://biowulf.nih.gov). Conflict of Interest: None declared.

References

  1. Andriole G. L., Grubb R. L., Buys S. S., Chia D., Church T. R., Fouad M. N., Gelmann E. P., Kvale P. A., Reding D. J., Weissfeld J. L. Mortality results from a randomized prostate-cancer screening trial. New England Journal of Medicine. 2009;360:1310–1319. doi: 10.1056/NEJMoa0810696. and others. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Bach F. R. Bolasso: model consistent lasso estimation through the bootstrap. Proceedings of the Twenty-fifth International Conference on Machine Learning (ICML) 2008 Helsinki, Finland. [Google Scholar]
  3. Benjamini Y., Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B. 1995;57:289–300. [Google Scholar]
  4. Cai T., Sun W. Oracle and adaptive compound decision rules for false discovery rate control. Journal of the American Statistical Association. 2007;102:901–912. [Google Scholar]
  5. Efron B., Tibshirani R. Empirical Bayes methods and false discovery rates for microarrays. Genetic Epidemiology. 2002;23:70–86. doi: 10.1002/gepi.1124. [DOI] [PubMed] [Google Scholar]
  6. Efron B., Tibshirani R., Storey J. D., Tusher V. Empirical Bayes analysis of a microarray experiment. Journal of the American Statistical Association. 2001;96:1151–1160. [Google Scholar]
  7. Fan J., Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association. 2001;96:1348–1360. [Google Scholar]
  8. Hall P., Lee E. R., Park B. Bootstrap-based penalty choice for the lasso achieving oracle performance. Statistica Sinica. 2009;19:449–471. [Google Scholar]
  9. Hans C. Model uncertainty and variable selection in Bayesian Lasso regression. Statistics and Computing. 2010;20:221–229. [Google Scholar]
  10. Huang J., Ma S., Zhang C.-H. Adaptive lasso for sparse high dimensional regression models. Statistica Sinica. 2008;18:1603–1618. [Google Scholar]
  11. Kooperberg C., LeBlanc M., Obenchain V. Risk prediction using genome-wide association studies. Genetic Epidemiology. 2010;34:643–652. doi: 10.1002/gepi.20509. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Martinez J. G., Carroll R. J., Muller S., Sampson J. N., Chatterjee N. A note on the effect on power of score tests via dimension reduction by penalized regression under the null. The International Journal of Biostatistics. 2010;6 doi: 10.2202/1557-4679.1231. Article 12. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Parikh H., Deng Z., Yeager M., Boland J., Matthews C., Jia J., Collins I., White A., Burdett L., Hutchinson A. A comprehensive resequence analysis of the KLK15-KLK3-KLK2 locus on chromosome 19q13.33. Human Genetics. 2010;127:91–99. doi: 10.1007/s00439-009-0751-5. and others. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Park T., Casella G. The Bayesian Lasso. Technical Report. 2005 [Google Scholar]
  15. Pötscher B. M., Schneider U. On the distribution of the adaptive lasso estimator. Journal of Statistical Planning and Inference. 2009;139:2775–2790. [Google Scholar]
  16. Storey J. D. A direct approach to false discovery rates. Journal of the Royal Statistical Society. Series B (Statistical Methodology) 2002;64:479–498. [Google Scholar]
  17. Strimmer K. fdrtool: a versatile R package for estimating local and tail area-based false discovery rates. Bioinformatics. 2008;24:1461–1462. doi: 10.1093/bioinformatics/btn209. [DOI] [PubMed] [Google Scholar]
  18. Sun W., Ibrahim J. G., Zou F. Genomewide multiple-loci mapping in experimental crosses by iterative adaptive penalized regression. Genetics. 2010;185:349–359. doi: 10.1534/genetics.110.114280. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Tibshirani R. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society Series B. 1996;58:267–288. [Google Scholar]
  20. Tibshirani R. Regression shrinkage and selection via the Lasso: a retrospective. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2011;73:273–282. [Google Scholar]
  21. Wu T. T., Chen Y. F., Hastie T., Sobel E., Lange K. Genome-wide association analysis by Lasso penalized logistic regression. Bioinformatics. 2009;25:714–721. doi: 10.1093/bioinformatics/btp041. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Zhou S., van de Geer S., Buhlmann P. Adaptive lasso for high dimensional regression and gaussian graphical modeling. 2009 ArXiv:0903.2515. [Google Scholar]
  23. Zou H. The adaptive lasso and its oracle properties. Journal of the American Statistical Association. 2006;101:1418–1429. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data

Articles from Biostatistics (Oxford, England) are provided here courtesy of Oxford University Press

RESOURCES