Controlling the local false discovery rate in the adaptive Lasso

Joshua N Sampson; Nilanjan Chatterjee; Raymond J Carroll; Samuel Müller

doi:10.1093/biostatistics/kxt008

. 2013 Apr 9;14(4):653–666. doi: 10.1093/biostatistics/kxt008

Controlling the local false discovery rate in the adaptive Lasso

Joshua N Sampson ^1,^*, Nilanjan Chatterjee ¹, Raymond J Carroll ², Samuel Müller ³

PMCID: PMC3769997 PMID: 23575212

Abstract

The Lasso shrinkage procedure achieved its popularity, in part, by its tendency to shrink estimated coefficients to zero, and its ability to serve as a variable selection procedure. Using data-adaptive weights, the adaptive Lasso modified the original procedure to increase the penalty terms for those variables estimated to be less important by ordinary least squares. Although this modified procedure attained the oracle properties, the resulting models tend to include a large number of “false positives” in practice. Here, we adapt the concept of local false discovery rates (lFDRs) so that it applies to the sequence, λ_n, of smoothing parameters for the adaptive Lasso. We define the lFDR for a given λ_n to be the probability that the variable added to the model by decreasing λ_n to λ_n−δ is not associated with the outcome, where δ is a small value. We derive the relationship between the lFDR and λ_n, show lFDR=1 for traditional smoothing parameters, and show how to select λ_n so as to achieve a desired lFDR. We compare the smoothing parameters chosen to achieve a specified lFDR and those chosen to achieve the oracle properties, as well as their resulting estimates for model coefficients, with both simulation and an example from a genetic study of prostate specific antigen.

Keywords: Adaptive Lasso, Local false discovery rate, Smoothing parameter, Variable selection

1. Introduction

The Lasso procedure offers a means to fit a linear regression model when the number of parameters p is comparatively large (Tibshirani, 1996, 2011). The Lasso estimates coefficients by minimizing the residual sum of squares plus a penalty term. Let there be n subjects, let Y =(Y ₁,…,Y _n)^T be their outcomes, let X_j=(X_j1,…,X_jn)^T be their measurements for variable j=1,…,p, and let X=(X₁,…,X_p). Then the estimated coefficients are

A major benefit of the L₁ penalty is that the Lasso also serves as a variable selection method, as a large proportion of Inline graphic are reduced to 0 when λ_n is large.

The adaptive Lasso modifies the original version by adding a data-defined weight, Inline graphic , to the penalty term (Zou, 2006). For our purposes, we consider only , where is the ordinary least squares estimate. The adaptive Lasso minimizes

(1.1)

When Inline graphic and , the adaptive Lasso is an oracle procedure (Cai and Sun, 2007; Fan and Li, 2001). Let the true relationship be described by the linear equation E(Y |X)=β₁X₁+⋯+β_pX_p where only a strict subset of the β-coefficients are non-zero, this subset being A={j:β_j≠0}. An oracle procedure is defined by having the following two properties:

Consistent variable selection: where is the estimated set of influential variables.
Asymptotic efficiency for , where Σ is the inverse of the information matrix when A is known.

In practice, with finite sample sizes, a sequence, λ_n, that satisfies the oracle requirements results in a model that includes a large number of false positives (i.e. the set Inline graphic is large) (Martinez and others, 2010). In this manuscript, our three objectives are the following: (1) To demonstrate, mathematically, that choosing λ_n to meet the oracle properties will result in a high false positive rate for finite samples. (2) To quantify the probability that a variable selected into the model is a false positive. This probability can provide confidence that the included variable is independently associated with the outcome. (3) To show how to identify a sequence of smoothing parameters that controls the number of false positives, instead of achieving the oracle properties.

In order to measure and control the number of false positives, we introduce the concept of the local false discovery rate (lFDR) into the selection of λ_n (Efron and others, 2001; Efron and Tibshirani, 2002; Benjamini and Hochberg, 1995). Specifically, we define lFDR(λ_n) to be the probability that a variable added to the model is a false positive when the penalty term is incrementally lowered below λ_n. Our first goal is to derive the relationship between lFDR and λ_n. We then show that lFDR Inline graphic , an unusual choice for most problems, if λ_n satisfies the oracle requirements, thus explaining the observation that the adaptive Lasso results in a large number of false positives when the effect sizes are not too large. In more traditional problems, a value of 0.05 is often the targeted FDR or lFDR. Finally, we offer a parametric bootstrap method for selecting λ_n to achieve a desired lFDR which is similar to a step described by Hall and others (2009). Others have also noted this high false positive rate and proposed Bootstrapped and Bayesian versions of the Lasso for handling this problem (Bach, 2008; Hans, 2010; Park and Casella, 2005).

Our motivating example comes from a Genome-Wide Association Study (GWAS). Both the Lasso (Wu and others, 2009) and the adaptive Lasso (Kooperberg and others, 2010; Sun and others, 2010) have become popular tools for GWAS because variable selection is an important step given that 100 000's of single nucleotide polymorphisms (SNPs) are available for testing. In our specific study, we focus on modeling the prostate specific antigen (PSA) level, a biomarker indicative of prostate cancer (Parikh and others, 2010).

The order of this paper is as follows. In Section 2, we introduce notation and review the adaptive Lasso. We then formalize our definition of the lFDR, derive the relationship between the lFDR and λ_n, and provide asymptotic theory. Finally, we describe our bootstrap approach for choosing λ_n. In Section 3 and supplementary material available at Biostatistics online, we evaluate the behavior of λ_n when selected by the lFDR through simulation and our motivating example. We conclude with a short discussion in Section 4.

2. Methods

2.1. Notation

We assume that there is a continuous outcome Y _i and its true value is defined by

(2.1)

where ϵ_i=Normal(0,σ²). Further, we assume Inline graphic , where D is a positive-definite matrix. Recall that A is the set of covariates that are associated with a non-zero β, A≡{j:β_j≠0}, and β_A≡{β_j:j∈A}. We say that covariate j is influential if j∈A or that it is superfluous if j∉A. Without loss of generality, assume that A={1,…,p₀}, let z=1−p₀/p, and let D₀₀ be the corresponding p₀×p₀ submatrix of D.

Let Inline graphic be the parameter estimates produced by the adaptive Lasso,

where, for our purposes, Inline graphic . The sequence λ_n is the set of smoothing parameters. We let be the set of covariates predicted to have a non-zero β, so .

Finally, we include the notation and definitions for the local FDR and related terms. We denote the probabilities, P_fp(λ_n) and P_fn(λ_n), that a variable will be a false positive and a false negative by

We define the lFDR by

(2.2)

where

By a Taylor series expansion, the expected difference in the number of false positives, (1−z){P_fp(λ)−P_fp(λ−δ)}, at λ and λ−δ, is approximately (1−z)δΔ_fp(λ). Similarly, the expected difference in the number of false negatives and the total number of variables included in the model are zδΔ_fp(λ) and zδΔ_fp(λ)+(1−z)δΔ_fn(λ). Therefore, we define the lFDR by (2.2) as the probability that a variable added to our model will be superfluous, if added when the smoothing parameter is lowered below λ_n. Our definition of lFDR differs from that traditionally given for two reasons: (i) we interpret the lFDR from a frequentist point of view and (ii) we focus on the smoothing parameter λ_n instead of on a test statistic. The traditional definitions of lFDR and FDR have also been used for purposes of variable selection, usually by including only those variables with a q-value below a given threshold (Storey, 2002). However, such an approach would not be as appropriate for Lasso procedures, which try to avoid this post hoc selection. Note, an equivalent definition for FDR is available by replacing Δ(λ_n) with P(λ_n) in (2.2).

2.2. Prior results

The adaptive Lasso has many theoretical properties. Here, we build on two previous results. Zou (2006) states the requirements needed for the adaptive Lasso to have the oracle properties.

Theorem 1 —

Suppose that

(2.3)

Then the adaptive Lasso estimates must satisfy the following:

(2.4)

(2.5)

If our focus is on variable selection, then a theorem identified by Pötscher and Schneider (2009) proves equally useful.

Theorem 2 —

Let X^TX=nI, where I is the identity matrix. then

Because is asymptotically normal with mean β_j, we immediately see

(2.6)

where follows a non-central χ² distribution with one degree of freedom and non-centrality parameter .

2.3. Local false discovery rates

When X is orthogonal, the total number of variables included in the model is monotonically non-decreasing as λ_n decreases. The lFDR is the proportion of added variables that are expected to be false positives. When X is orthogonal and σ²=1, then

(2.7)

where C(λ), which can be interpreted as the cost of removing a false positive, is

(2.8)

where Inline graphic is the density for a χ² variable with non-centrality parameter .

Equations (2.7) and (2.8) allow us to choose λ_n to achieve a specific lFDR. For example, if, in addition to σ²=1 and X being orthogonal, all β_j=β, then the lFDR will never exceed q if

(2.9)

The sequence λ_n, when defined by (2.9), is independent of the number of variables p. Moreover, all properties discussed hold regardless of the size of β (e.g. β is constant or decreasing at a rate of Inline graphic ). Therefore, although there is no λ_n that can attain the oracle property when β is decreasing at a rate of (Pötscher and Schneider, 2009), the sequence defined by (2.9) would still attain the stated lFDR. As expected, we note that the lFDR decreases with increasing λ_n confirming that those variables added when λ_n is small are more likely to be false positives. We define λ_qn to satisfy lFDR(λ_qn)=q.

2.4. Constant β

The term Inline graphic in (2.8) can be ignored when is large. Specifically, when ∀j, the lFDR at a given value of λ_n can be approximated within 1% of its true value by

(2.10)

Equation (2.10) shows more clearly that if we choose λ_n to achieve the oracle property (i.e. Inline graphic ), then we are choosing a λ_n that results in an . As an lFDR=1 implies that all variables being added to the model are false positives, purposely choosing such a λ_n would seem counterintuitive. Therefore, even when λ_n can be chosen to achieve the oracle properties, it is unclear whether such a choice is desirable. An alternative approach would be to choose λ_n to ensure that lFDR<q. In the previous example, where σ²=1, X is orthogonal, and β_j=β, we now see lFDR<q if

(2.11)

Purposely choosing a λ_n such that the lFDR Inline graphic 0 seems equally counterintuitive, limiting the reasonable choices for λ_n. If σ²=1, X is orthogonal, and β_j=β, where β is a constant, we see that for the lFDR not to diverge to 0 or 1, .

Lemma 1 —

When β_j=β ∀j, β is constant, σ²=1, X is orthogonal, and t=0.5β², then

(2.12)

(2.13)

(2.14)

where .

If λ_n were chosen to achieve an lFDR strictly between 0 and 1, then only the first of the two oracle properties holds, Inline graphic from (2.4). However, we claim that forgoing the second oracle property, in exchange for an lFDR between 0 and 1, is no loss. Although performing variable selection and fitting in a single step is convenient, it is unnecessary. Clearly, there is a two-step method that recovers the second oracle property. After using the adaptive Lasso with Inline graphic for variable selection, we can refit the model using OLS with only that subset of variables. This two-step procedure not only satisfies both oracle properties, but offers improved efficiency over the single-step procedure, reminding us that the oracle procedure is not an optimal procedure. Although an oracle procedure promises that Inline graphic for all superfluous variables, it makes no claim as to the rate at which this occurs. Asymptotically, we can increase the rate at which without decreasing the rate at which . Returning to (2.6), this potential improvement is clear because, asymptotically, , is unchanged by λ_n so long as Inline graphic .

2.5. Empirical choice of λ_n

In the idealized scenario, where X is orthogonal, β_j=β ∀j∈A, and both z and β are known, (2.9) can be used to choose a sequence λ_n to achieve a specified lFDR. If all values of {β_j:j∈A} are not identical, then the solution to (2.8) would need to be obtained numerically. Although β and z are unknown, in practice, we could use an estimate of z and either an estimate of β or a lower bound for a biologically meaningful β. However, when (2.9) is evaluated with these estimates, the chosen λ_n tends to produce an lFDR above the desired value when X is not orthogonal. Therefore, we prefer a bootstrap approach similar to one of the steps discussed by Hall and others (2009). The algorithm is as follows. Let us first fit a simple model of Y on X to obtain estimates of β. In practice, as done in our simulations, we suggest identifying those β to have non-zero values by the adaptive Lasso with λ_dn, and then defining Inline graphic by the OLS estimates. Let us then denote the variance of the residuals from this model by . Next, set all components of below some threshold equal to zero. In practice, when n>p, we use as this threshold. Then generate B sets of data, assuming the true model is , where . For each value of λ_n in a given set, we calculate the number of true, Inline graphic , and false, , positives added to the model between λ_n−δ and λ_n+δ where δ is an appropriately small number and the superscript b denotes the dataset. We can then estimate the lFDR for each λ_n by

(2.15)

and select the smoothing parameter that achieves a specified lFDR, q:

For completeness, we define Inline graphic and when . In practice, B=10, but we base our estimates of lFDR_est on a monotonically smoothed version of lFDR_est(λ_n).

For purposes of comparison, we consider the standard method for selecting λ_n to be cross-validation aimed at minimizing the prediction error of future estimates. Recall that standard 10-fold cross-validation starts by dividing the set S_n of n subjects into 10 mutually exclusive sets, s₁∪s₂∪⋯∪s₁₀=S_n, of roughly equal size. Let Inline graphic , 1≤k≤10 be the adaptive Lasso estimate for β_j based on those subjects not in s_k. Then

Also, Inline graphic is an estimate of the deviance-optimized smoothing parameters:

where Inline graphic are the data input into the adaptive Lasso to obtain the estimates , are the data from a new individual, and T={T,T₀}. When β is fixed and X is orthogonal, the smoothing parameters minimizing the deviance must satisfy the oracle properties.

2.6. High-dimensional adaptive Lasso: p > n

As defined in (1.1), the weights in the adaptive Lasso are 1/ Inline graphic . However, when p>n, the weights must substitute a different estimate of β in place of . Two possible substitutes that have been studied include , the estimates obtained by fitting separate models for each variable (Huang and others, 2008), and , the estimates from a regular Lasso procedure (Zhou and others, 2009). The properties of the latter estimates, Inline graphic , with , have been studied and demonstrated to have useful qualities (Zhou and others, 2009). In practice, however, we found that 1/ performed better, and chose to use those weights in our simulations. For defining , we cannot use as our cutoff threshold. Instead, we first perform the adaptive Lasso on our data and count the number of coefficients estimated to be non-zero. We then find the threshold, such that by setting all Inline graphic below that threshold to 0 and simulating data, the adaptive Lasso on the simulated data estimates a similar number of non-zero coefficients.

3. Results

3.1. Simulation design: comparing λ_qn and λ_dn

Our first goal is to offer an example comparing the magnitude and performance of λ_dn and λ_qn. As with all simulations here, our objective is not to describe the performance of the estimates Inline graphic and , but to calculate, describe, and compare the true values of λ_dn and λ_qn. We assume that the covariate matrix X is orthogonal and that the outcome Y can be described by linear regression, (2.1), with β_j=0.15 if β_j∈A and σ²=1. For these examples, we fixed the number of covariates p=50, but let the size of A vary, z∈{0.5,0.7,0.9}. As described below, we used simulation to calculate λ_dn and λ_qn, their corresponding lFDR and the proportion of variables that were misclassified, err_MC, for a sequence of samples between n=200 and n=2000.

Our second goal is to show that results are essentially unchanged when we vary p. For efficiency, we calculated λ_dn, λ_qn, lFDR, and err_MC at only n=1000 for p∈{100,200,500}, maintaining all of the other assumptions.

Our third goal is to examine whether λ_qn, calculated assuming that X is orthogonal, was appropriate when there was dependence. Specifically, we repeated the abbreviated analyses assuming that the covariance structure of (X_i1,X_i2,…,X_ip) is block diagonal. Correlation ρ within a block was constant, ρ∈{0.3,0.6}, each block contained the same number of influential variables (or possibly no influential variables if there were more blocks than influential variables), and each block contained the same number of total variables. Variables were divided into 2, 5, or 10 groups.

For any combination of n, p, z, and covariance structure, we estimate the values of λ_dn, λ_qn, lFDR, and err_MC by simulating 200 000 values of X and Y . For each simulation, at a specified set of λ ranging from 0.01 to 100, we calculate the residual deviance and err_MC. Furthermore, for the same set of λ, we count the number of true and false positives added to the model when the smoothing parameter was between Inline graphic and . Then, for each λ, we average the number of true and false positives added, deviance, and err_MC over all 200 000 datasets to obtain estimates of each desired value. The lFDR was estimated by the ratio between the average number of false positives added, compared with the total number of variables added to the model. For each combination of n, p, z, and covariance structure discussed, we simulated a new set of 200 000 simulations. To generate a dataset X, we assumed that {X_i1,…,X_ip} followed a normal distribution with mean 0 and specified covariance matrix. When X was assumed orthogonal, we used the resulting principal components. All datasets were standardized, so each variable had mean 0 and variance 1.

3.2. Simulation results: comparing λ_qn and λ_dn

First, consider the example when X is orthogonal, p=50, and z=0.7. Figure 1(a) shows that λ_qn increases linearly with the number of subjects and that the slope is approximately 0.5β² as (2.11) suggests, except when λ_nnβ² is small. The same equation promises that the lFDR-selected λ_n's, for values q₁ and q₂, differ by approximately Inline graphic . As the deviance-optimized λ_n's achieve the oracle properties when X is orthogonal, they must be increasing at a rate less than , and therefore, the representative black line in Figure 1(a) is significantly below those illustrating the smoothing parameters chosen to achieve the specified values of the lFDR.

Fig. 1. — (a) The sequence of λ_n chosen to minimize the deviance (solid black line) or chosen to achieve a specified lFDR (broken lines) increases with the number of subjects in the study. For these simulations, z=0.7, p=50, β=0.15 for associated variables. (b) The average proportion of variables that are misclassified (error, y-axis), or the number of false positives and false negatives, quickly drops to 0 when λ_n is chosen to achieve a specified lFDR, but remains above 0 for deviance-optimized smoothing parameters.

The advantage to choosing a sequence λ_n that increases linearly with the number of the subjects is that the proportion of misclassified variables converges to 0 much quicker. Figure 1(b) shows that when there are 1000 individuals in the study and 35 out of 50 of the SNPs are superfluous, on average, 12% of the variables are misclassified with the deviance-optimized parameters, whereas less than 2.1% are misclassified when using lFDR-selected parameters. The relationship between lFDR and percentage misclassified is not monotone, as it depends on z. Here, setting q=0.5 minimized the proportion misclassified. Figure 2 shows that when λ_n minimizes deviance, the cost of reducing false positives is very low, or equivalently, the lFDR is high, so there is great benefit in increasing λ_n. In terms of identifying A exactly, with 1000 individuals, the probability that there is at least one misclassified variable, Inline graphic , exceeds 0.999 when using deviance-optimized smoothing parameters, whereas that probability is less than 0.64 when using lFDR-selected parameters.

Fig. 2. — (a) The solid black line shows the probability, , that a null variable is excluded from when X is orthogonal. The top curved dashed line shows the probability, , that a non-null variable is included in when X is orthogonal, z=0.7, and n=1000. The vertical dashed line farthest to the right indicates λ_dn. The other pairs of broken lines show the equivalent values when n=500 and n=800. (b) The local FDR, lFDR, is illustrated as a function of λ for the three scenarios above.

Inline graphic — (a) The solid black line shows the probability, , that a null variable is excluded from when X is orthogonal. The top curved dashed line shows the probability, , that a non-null variable is included in when X is orthogonal, z=0.7, and n=1000. The vertical dashed line farthest to the right indicates λ_dn. The other pairs of broken lines show the equivalent values when n=500 and n=800. (b) The local FDR, lFDR, is illustrated as a function of λ for the three scenarios above.

Table 1 shows that the large difference between λ_dn and λ_qn remains for p>50, and, in fact, both λ_dn and λ_qn appear to be essentially independent of p when X is orthogonal. When the covariates are correlated, compared with when they are independent, λ_qn tends to be larger, as more stringent penalty terms are needed to exclude null variables that are correlated with influential variables. Increasing ρ or block size magnifies this effect. Therefore, in practice, we suggest choosing λ_qn by the bootstrapping method described in Section 2.5. Table 1 also demonstrates the obvious result that as the proportion z of null variables increases, λ_qn must also increase.

Table 1.

The smoothing parameters designed to achieve lFDR=0.5 are larger than those designed to minimize deviance

				Low correlation			High correlation
		Independent		10	5	2	10	5	2
p	z	λ_dn	λ_qn	λ_qn			λ_qn
100	0.9	6.31	18.52	18.72	20.32	27.33	22.32	27.63	30.23
100	0.7	3.7	15.02	16.22	18.32	18.32	14.92	17.12	15.22
200	0.9	6.21	17.92	20.72	23.63	28.33	26.23	28.13	39.44
200	0.7	3.7	14.82	17.02	18.72	19.82	15.02	14.92	15.42
500	0.9	6.21	18.12	28.73	34.84	45.15	38.44	50.56	51.66
500	0.7	3.7	14.22	18.62	18.92	19.52	15.42	15.42	16.02

Open in a new tab

3.3. Simulation design: evaluating the performance of adaptive Lasso with λ_qn

Our next goal is to evaluate the performance of the adaptive Lasso when using λ_qn, estimated by our bootstrap approach described in Sections 2.5 and 2.6. This method selects a set of variables that should satisfy the specified lFDR criteria. For comparison, we consider a more traditional method for selecting variables targeting the same criteria. This method, implemented by the R function FDRtool (Strimmer, 2008), inputs the p-values calculated from models including each variable individually. In brief, the method decomposes the overall distribution of p-values into two distributions, representing the p-values from the null and influential variables. Given these two distributions, the traditional method first estimates the p-value thresholds that would result in the specified lFDR and then selects all variables meeting the appropriate threshold.

We consider the lFDR, lFDR_AL, resulting from using the bootstrap version of the adaptive Lasso and the rates lFDR_TR resulting from the traditional method. We compare these observed rates to the targeted values: q∈{0.1,0.5,0.9}. These comparisons are performed in two types of datasets. When n>p, settings are similar to those in Section 3.1: n=1000, p=500, z=0.9, β_j=0.1 if β_j∈A and σ²=1. In order for the traditional methods to produce rates below 1, we reduce the number of correlated variables per block to 5. Again ρ∈{0.0,0.3,0.6}. When p>n, specifically n=1000 and p=5000, we increase z to 0.96 and include 10 variables per correlated block. To achieve q=0.1 when p>n, we further increase z to 0.99 and β_j to 0.35. We provide an extended set of simulations, exploring other correlation structures and effect distributions, in supplementary material available at Biostatistics online.

For each combination of parameters, we generated 1000 datasets and then averaged the resulting lFDR_AL and lFDR_TR across all 1000 datasets. For each dataset, we defined the lFDR to be 0 if the last variable selected was influential, 1 otherwise.

3.4. Simulation results: evaluating the performance of adaptive Lasso with λ_qn

The bootstrap approach proposed in Sections 2.5 and 2.6 selected values of λ_qn that, when applied to the full dataset, resulted in lFDR values similar to the targeted value. In the example where n>p and ρ=0, the observed lFDR was 0.06, 0.48, and 0.89 when λ_qn was chosen to achieve lFDR=0.1,0.5, and 0.9. When targeting lFDR=0.1, our method achieved a lower lFDR, and therefore our chosen λ_q was larger than desired. This inflated λ_q arises, in part, from a tendency to select too few non-zero Inline graphic in our bootstrap models. Table 2 and results in supplementary material available at Biostatistics online show that the lFDR estimates were only minimally altered by changing the correlation structure or when considering the p>n.

Table 2.

A comparison between the newly proposed bootstrap (B) method for obtaining a specified lFDR with the traditional (T) approach

	ρ=0.0		ρ=0.3		ρ=0.6
Target	B	T	B	T	B	T
n>p
0.1	0.059	0.088	0.083	0.232	0.101	0.688
0.5	0.477	0.435	0.505	0.737	0.52	0.919
0.9	0.893	0.804	0.905	0.94	0.909	0.967
p>n
0.1	0.008	0.075	0.019	0.298	0.105	0.904
0.5	0.466	0.464	0.592	0.679	0.639	0.905
0.9	0.751	0.864	0.861	0.953	0.894	0.978

Open in a new tab

The traditional approach, based on estimating the p-value distribution of the null and influential variables, performed poorly when there was high correlation between variables (Table 2). When there was high correlation, models with only a single variable assigned low p-values to those null variables associated with influential variables. This resulted in more variables achieving the lFDR threshold, but a higher proportion were false positives. With n>p, ρ=0.6, and a targeted lFDR=0.5, the observed lFDR=0.9.

3.5. Application

In the United States, prostate cancer is the most commonly diagnosed non-cutaneous cancer in men, with approximately 200 000 new diagnoses each year. Because levels of PSA are elevated in the presence of prostate cancer, it is commonly used as a biomarker for early detection. Unfortunately, the specificity of tests based on PSA is often very low, as many healthy individuals also have high levels. Specificity could be greatly improved by a method that can identify individuals with naturally high levels. To this end, there have been large GWASs searching for genetic markers associated with the PSA level. The Prostate, Lung, Colorectal, and Ovarian Cancer Screening Trial, or PLCO, which recorded PSA levels, genotyped 2200 healthy men using an Illumina genotyping platform containing more than 500 000 SNPs (Andriole and others, 2009). We focus on a subset of 530 SNPs in and around the KLK3 gene (Parikh and others, 2010).

Let X be the 2200×530 matrix containing the genotypes for the study population at these 530 SNPs. Genotypes are coded as 0, 1, or 2, indicating the number of minor alleles at that SNP. Let Y be the log-transformed PSA levels. Then we regressed Y on X using a linear model with the adaptive Lasso procedure. We repeated this analysis with the two values of λ: Inline graphic and , where was estimated by the previously defined bootstrap procedure. Unfortunately, the truth is unknown, and therefore we can only use this example to illustrate their relative performance. The estimated values of the smoothing parameter were and , with q=0.5. As expected, is significantly smaller than that value estimated to achieve lFDR=0.5. As a consequence, Inline graphic allowed 17 SNPs to have non-zero coefficients, whereas allowed only 1 SNP (Table S2 of supplementary material available at Biostatistics online). Although we cannot be certain that either model is correct, it seems doubtful that 17 SNPs in that region are directly associated with PSA levels. To estimate β corresponding to rs2735839 using the two-stage approach, first selecting variables with Inline graphic and then estimating β using OLS, we calculate from a model containing only rs2735839.

4. Discussion

The adaptive Lasso has become a popular model-building procedure because it shrinks a subset of coefficients to zero, thereby simultaneously performing variable selection and simplifying model interpretation. Although, asymptotically, using the traditional smoothing parameters promises that the adaptive Lasso will achieve consistent variable selection, their use often leaves a large number of false positives in the model when sample size is finite.

The lFDR is usually a form of post-processing, in that we would first perform a statistical procedure to attach a p-value to an estimate of each parameter and then determine the probability that the true value of a parameter with that p-value is the null value. We have adapted the lFDR framework to select smoothing parameters in the adaptive Lasso. Instead of defining an lFDR for a specific p-value, we define it for a specific value of the tuning constant λ. The framework offers an alternative means for selecting the smoothing parameters. When chosen to achieve a specified value of the lFDR, the adaptive Lasso procedure promises both asymptotically consistent variable selection and better control of the false positive rate for finite samples.

By itself, a single-step, adaptive Lasso procedure using λ_q, the lFDR-selected smoothing parameter, does not achieve the oracle properties. If one believed that the optimal, or best, estimator had to have these properties, then a combined variable selection and model fitting procedure with λ_q would not be a viable option. However, we do not consider the absence of the second oracle property to be a deterrent to using λ_q. First, the oracle properties can be regained by a two-step procedure that adds a separate model fitting step, where OLS is applied only to those variables retained by the initial adaptive Lasso. Although the convenience of a one-step procedure is sacrificed, the final estimate would still have the stated properties. Second, the first oracle property, consistent variable selection, is not a statement of optimality. That property makes no claims on the rate at which Inline graphic . In some sense, the rate of our two-step procedure is faster than the rate of the single-step procedure. Therefore, there is a benefit to our method, even if it cannot be measured by a characteristic as coarse as the oracle properties.

We chose to select the smoothing parameters to achieve a desired lFDR, instead of FDR, because we wanted to judge each variable on its own merits, and not the merits of all selected variables. As discussed previously (Efron and others, 2001; Efron and Tibshirani, 2002), if one fit a model with 1000 variables and aimed to achieve an FDR of 0.1, then if the first 90 variables selected were guaranteed to be non-null, the next 10 would be included regardless of the evidence. Note also that in addition to providing examples when lFDR=0.1, we offered examples with an lFDR as high as 0.9, a larger value than that generally used. For the adaptive Lasso procedure, where standard practice has been to choose lFDR=1 and there is often the desire not to omit any non-null variables, aiming for larger lFDR values may be preferred for the Lasso procedure.

5. Supplementary material

Supplementary material is available at http://biostatistics.oxfordjournals.org.

6. Funding

Sampson's and Chatterjee's research was supported by the Intramural Research Program of the NCI. Chatterjee's research was supported by a gene-environment initiative grant from the NHLBI (RO1-HL091172-01). Müller's research was supported by a grant from the Australian Research Council (DP110101998). Carroll's research was supported by a grant from the National Cancer Institute (R37-CA057030). Carroll was also supported by Award Number KUS-CI-016-04, made by King Abdullah University of Science and Technology (KAUST).

Supplementary Material

Supplementary Data

supp_14_4_653__index.html^{(850B, html)}

Acknowledgements

This study utilized the high-performance computational capabilities of the Biowulf Linux cluster at the NIH, Bethesda, Md. (http://biowulf.nih.gov). Conflict of Interest: None declared.

References

Andriole G. L., Grubb R. L., Buys S. S., Chia D., Church T. R., Fouad M. N., Gelmann E. P., Kvale P. A., Reding D. J., Weissfeld J. L. Mortality results from a randomized prostate-cancer screening trial. New England Journal of Medicine. 2009;360:1310–1319. doi: 10.1056/NEJMoa0810696. and others. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bach F. R. Bolasso: model consistent lasso estimation through the bootstrap. Proceedings of the Twenty-fifth International Conference on Machine Learning (ICML) 2008 Helsinki, Finland. [Google Scholar]
Benjamini Y., Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B. 1995;57:289–300. [Google Scholar]
Cai T., Sun W. Oracle and adaptive compound decision rules for false discovery rate control. Journal of the American Statistical Association. 2007;102:901–912. [Google Scholar]
Efron B., Tibshirani R. Empirical Bayes methods and false discovery rates for microarrays. Genetic Epidemiology. 2002;23:70–86. doi: 10.1002/gepi.1124. [DOI] [PubMed] [Google Scholar]
Efron B., Tibshirani R., Storey J. D., Tusher V. Empirical Bayes analysis of a microarray experiment. Journal of the American Statistical Association. 2001;96:1151–1160. [Google Scholar]
Fan J., Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association. 2001;96:1348–1360. [Google Scholar]
Hall P., Lee E. R., Park B. Bootstrap-based penalty choice for the lasso achieving oracle performance. Statistica Sinica. 2009;19:449–471. [Google Scholar]
Hans C. Model uncertainty and variable selection in Bayesian Lasso regression. Statistics and Computing. 2010;20:221–229. [Google Scholar]
Huang J., Ma S., Zhang C.-H. Adaptive lasso for sparse high dimensional regression models. Statistica Sinica. 2008;18:1603–1618. [Google Scholar]
Kooperberg C., LeBlanc M., Obenchain V. Risk prediction using genome-wide association studies. Genetic Epidemiology. 2010;34:643–652. doi: 10.1002/gepi.20509. [DOI] [PMC free article] [PubMed] [Google Scholar]
Martinez J. G., Carroll R. J., Muller S., Sampson J. N., Chatterjee N. A note on the effect on power of score tests via dimension reduction by penalized regression under the null. The International Journal of Biostatistics. 2010;6 doi: 10.2202/1557-4679.1231. Article 12. [DOI] [PMC free article] [PubMed] [Google Scholar]
Parikh H., Deng Z., Yeager M., Boland J., Matthews C., Jia J., Collins I., White A., Burdett L., Hutchinson A. A comprehensive resequence analysis of the KLK15-KLK3-KLK2 locus on chromosome 19q13.33. Human Genetics. 2010;127:91–99. doi: 10.1007/s00439-009-0751-5. and others. [DOI] [PMC free article] [PubMed] [Google Scholar]
Park T., Casella G. The Bayesian Lasso. Technical Report. 2005 [Google Scholar]
Pötscher B. M., Schneider U. On the distribution of the adaptive lasso estimator. Journal of Statistical Planning and Inference. 2009;139:2775–2790. [Google Scholar]
Storey J. D. A direct approach to false discovery rates. Journal of the Royal Statistical Society. Series B (Statistical Methodology) 2002;64:479–498. [Google Scholar]
Strimmer K. fdrtool: a versatile R package for estimating local and tail area-based false discovery rates. Bioinformatics. 2008;24:1461–1462. doi: 10.1093/bioinformatics/btn209. [DOI] [PubMed] [Google Scholar]
Sun W., Ibrahim J. G., Zou F. Genomewide multiple-loci mapping in experimental crosses by iterative adaptive penalized regression. Genetics. 2010;185:349–359. doi: 10.1534/genetics.110.114280. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tibshirani R. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society Series B. 1996;58:267–288. [Google Scholar]
Tibshirani R. Regression shrinkage and selection via the Lasso: a retrospective. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2011;73:273–282. [Google Scholar]
Wu T. T., Chen Y. F., Hastie T., Sobel E., Lange K. Genome-wide association analysis by Lasso penalized logistic regression. Bioinformatics. 2009;25:714–721. doi: 10.1093/bioinformatics/btp041. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhou S., van de Geer S., Buhlmann P. Adaptive lasso for high dimensional regression and gaussian graphical modeling. 2009 ArXiv:0903.2515. [Google Scholar]
Zou H. The adaptive lasso and its oracle properties. Journal of the American Statistical Association. 2006;101:1418–1429. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data

supp_14_4_653__index.html^{(850B, html)}

supp_kxt008_kxt008supp.pdf^{(71.1KB, pdf)}

[KXT008C1] Andriole G. L., Grubb R. L., Buys S. S., Chia D., Church T. R., Fouad M. N., Gelmann E. P., Kvale P. A., Reding D. J., Weissfeld J. L. Mortality results from a randomized prostate-cancer screening trial. New England Journal of Medicine. 2009;360:1310–1319. doi: 10.1056/NEJMoa0810696. and others. [DOI] [PMC free article] [PubMed] [Google Scholar]

[KXT008C2] Bach F. R. Bolasso: model consistent lasso estimation through the bootstrap. Proceedings of the Twenty-fifth International Conference on Machine Learning (ICML) 2008 Helsinki, Finland. [Google Scholar]

[KXT008C3] Benjamini Y., Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B. 1995;57:289–300. [Google Scholar]

[KXT008C4] Cai T., Sun W. Oracle and adaptive compound decision rules for false discovery rate control. Journal of the American Statistical Association. 2007;102:901–912. [Google Scholar]

[KXT008C5] Efron B., Tibshirani R. Empirical Bayes methods and false discovery rates for microarrays. Genetic Epidemiology. 2002;23:70–86. doi: 10.1002/gepi.1124. [DOI] [PubMed] [Google Scholar]

[KXT008C6] Efron B., Tibshirani R., Storey J. D., Tusher V. Empirical Bayes analysis of a microarray experiment. Journal of the American Statistical Association. 2001;96:1151–1160. [Google Scholar]

[KXT008C7] Fan J., Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association. 2001;96:1348–1360. [Google Scholar]

[KXT008C8] Hall P., Lee E. R., Park B. Bootstrap-based penalty choice for the lasso achieving oracle performance. Statistica Sinica. 2009;19:449–471. [Google Scholar]

[KXT008C9] Hans C. Model uncertainty and variable selection in Bayesian Lasso regression. Statistics and Computing. 2010;20:221–229. [Google Scholar]

[KXT008C10] Huang J., Ma S., Zhang C.-H. Adaptive lasso for sparse high dimensional regression models. Statistica Sinica. 2008;18:1603–1618. [Google Scholar]

[KXT008C11] Kooperberg C., LeBlanc M., Obenchain V. Risk prediction using genome-wide association studies. Genetic Epidemiology. 2010;34:643–652. doi: 10.1002/gepi.20509. [DOI] [PMC free article] [PubMed] [Google Scholar]

[KXT008C12] Martinez J. G., Carroll R. J., Muller S., Sampson J. N., Chatterjee N. A note on the effect on power of score tests via dimension reduction by penalized regression under the null. The International Journal of Biostatistics. 2010;6 doi: 10.2202/1557-4679.1231. Article 12. [DOI] [PMC free article] [PubMed] [Google Scholar]

[KXT008C13] Parikh H., Deng Z., Yeager M., Boland J., Matthews C., Jia J., Collins I., White A., Burdett L., Hutchinson A. A comprehensive resequence analysis of the KLK15-KLK3-KLK2 locus on chromosome 19q13.33. Human Genetics. 2010;127:91–99. doi: 10.1007/s00439-009-0751-5. and others. [DOI] [PMC free article] [PubMed] [Google Scholar]

[KXT008C14] Park T., Casella G. The Bayesian Lasso. Technical Report. 2005 [Google Scholar]

[KXT008C15] Pötscher B. M., Schneider U. On the distribution of the adaptive lasso estimator. Journal of Statistical Planning and Inference. 2009;139:2775–2790. [Google Scholar]

[KXT008C16] Storey J. D. A direct approach to false discovery rates. Journal of the Royal Statistical Society. Series B (Statistical Methodology) 2002;64:479–498. [Google Scholar]

[KXT008C17] Strimmer K. fdrtool: a versatile R package for estimating local and tail area-based false discovery rates. Bioinformatics. 2008;24:1461–1462. doi: 10.1093/bioinformatics/btn209. [DOI] [PubMed] [Google Scholar]

[KXT008C18] Sun W., Ibrahim J. G., Zou F. Genomewide multiple-loci mapping in experimental crosses by iterative adaptive penalized regression. Genetics. 2010;185:349–359. doi: 10.1534/genetics.110.114280. [DOI] [PMC free article] [PubMed] [Google Scholar]

[KXT008C19] Tibshirani R. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society Series B. 1996;58:267–288. [Google Scholar]

[KXT008C20] Tibshirani R. Regression shrinkage and selection via the Lasso: a retrospective. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2011;73:273–282. [Google Scholar]

[KXT008C21] Wu T. T., Chen Y. F., Hastie T., Sobel E., Lange K. Genome-wide association analysis by Lasso penalized logistic regression. Bioinformatics. 2009;25:714–721. doi: 10.1093/bioinformatics/btp041. [DOI] [PMC free article] [PubMed] [Google Scholar]

[KXT008C22] Zhou S., van de Geer S., Buhlmann P. Adaptive lasso for high dimensional regression and gaussian graphical modeling. 2009 ArXiv:0903.2515. [Google Scholar]

[KXT008C23] Zou H. The adaptive lasso and its oracle properties. Journal of the American Statistical Association. 2006;101:1418–1429. [Google Scholar]

PERMALINK

Controlling the local false discovery rate in the adaptive Lasso

Joshua N Sampson

Nilanjan Chatterjee

Raymond J Carroll

Samuel Müller

Abstract

1. Introduction

2. Methods

2.1. Notation

2.2. Prior results

Theorem 1 —

Theorem 2 —

2.3. Local false discovery rates

2.4. Constant β

Lemma 1 —

2.5. Empirical choice of λn

2.6. High-dimensional adaptive Lasso: p > n

3. Results

3.1. Simulation design: comparing λqn and λdn

3.2. Simulation results: comparing λqn and λdn

Fig. 1.

Fig. 2.

Table 1.

3.3. Simulation design: evaluating the performance of adaptive Lasso with λqn

3.4. Simulation results: evaluating the performance of adaptive Lasso with λqn

Table 2.

3.5. Application

4. Discussion

5. Supplementary material

6. Funding

Supplementary Material

Acknowledgements

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

2.5. Empirical choice of λ_n

3.1. Simulation design: comparing λ_qn and λ_dn

3.2. Simulation results: comparing λ_qn and λ_dn

3.3. Simulation design: evaluating the performance of adaptive Lasso with λ_qn

3.4. Simulation results: evaluating the performance of adaptive Lasso with λ_qn