Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2015 Jul 1.
Published in final edited form as: J Am Stat Assoc. 2013 Jul 25;109(507):991–1007. doi: 10.1080/01621459.2013.823775

Estimation and Accuracy after Model Selection

Bradley Efron 1,*
PMCID: PMC4207812  NIHMSID: NIHMS506929  PMID: 25346558

Abstract

Classical statistical theory ignores model selection in assessing estimation accuracy. Here we consider bootstrap methods for computing standard errors and confidence intervals that take model selection into account. The methodology involves bagging, also known as bootstrap smoothing, to tame the erratic discontinuities of selection-based estimators. A useful new formula for the accuracy of bagging then provides standard errors for the smoothed estimators. Two examples, nonparametric and parametric, are carried through in detail: a regression model where the choice of degree (linear, quadratic, cubic, …) is determined by the Cp criterion, and a Lasso-based estimation problem.

Keywords: model averaging, Cp, Lasso, bagging, bootstrap smoothing, ABC intervals, importance sampling

1 Introduction

Accuracy assessments of statistical estimators customarily are made ignoring model selection. A preliminary look at the data might, for example, suggest a cubic regression model, after which the fitted curve’s accuracy is computed as if “cubic” were pre-chosen. Here we will discuss bootstrap standard errors and approximate confidence intervals that take into account the model-selection procedure.

Figure 1 concerns the Cholesterol data, an example investigated in more detail in Section 2: n = 164 men took cholestyramine, a proposed cholesterol-lowering drug, for an average of seven years each; the response variable was the decrease in blood-level cholesterol measured from the beginning to the end of the trial,

Figure 1. Cholesterol data.

Figure 1

Cholesterol decrease plotted versus adjusted compliance for 164 men in Treatment arm of the cholostyramine study (Efron and Feldman, 1991). Solid curve is OLS cubic regression, as selected by the Cp criterion. How accurate is the curve, taking account of model selection as well as least squares fitting? (Solid arrowed point is Subject 1, featured in subsequent calculations. Bottom numbers indicate compliance for the 11 subjects in the simulation trial of Figure 5.)

d=cholesterol decrease; (1.1)

also measured (by pill counts) was compliance, the proportion of the intended dose taken,

c=compliance, (1.2)

ranging from zero to full compliance for the 164 men. A transformation of the observed proportions has been made here so that the 164 c values approximate a standard normal distribution,

c˙N(0,1). (1.3)

The solid curve is a regression estimate of decrease d as a cubic function of compliance c, fit by ordinary least squares (OLS) to the 164 points. “Cubic” was selected by the Cp criterion, Mallows (1973), as described in Section 2. The question of interest for us is how accurate is the fitted curve, taking account of the Cp model-selection procedure as well as OLS estimation?

More specifically, let μj be the expectation of cholesterol decrease for subject j given his compliance cj,

μj=E{djcj}. (1.4)

We wish to assign standard errors to estimates of μj read from the regression curve in Figure 1 . A nonparametric bootstrap estimate sd~j of standard deviation, taking account of model selection, is developed in Sections 2 and 3. Figure 2 shows that this is usually, but not always, greater than the naive estimate sd¯j obtained from standard OLS calculations, assuming that the cubic model was pre-selected. The ratio sdj/sd¯j has median value 1.52; so at least in this case, ignoring model selection can be deceptively optimistic.

Figure 2.

Figure 2

Solid points: ratio of standard deviations, taking account of model selection or not, for the 164 values μ^j from the regression curve in Figure 1. Median ratio equals 1.52. Standard deviations including model selection are the smoothed bootstrap estimates sdB of Section 3. Dashed line: ratio of sdB to sd^B, the unsmoothed bootstrap sd estimates as in (2.4), median 0.91.

Data-based model selection can produce “jumpy” estimates that change values discontinuously at the boundaries between model regimes. Bagging (Breiman, 1996), or bootstrap smoothing, is a model-averaging device that both reduces variability and eliminates discontinuities. This is described in Section 2, and illustrated on the Cholesterol data.

Our key result is a new formula for the delta-method standard deviation of a bagged estimator. The result, which applies to general bagging situations and not just regression problems, is described in Section 3. Stated in projection terms (see Figure 4), it provides the statistician a direct assessment of the cost in reduced accuracy due to model selection.

Figure 4.

Figure 4

Illustration of Corollary 1. The ratio sdB/sd^B is the cosine of the angle between t*s01 (3.9) and the linear space (Y) spanned by the centered bootstrap counts (3.2). Model-selection estimators tend to be more nonlinear, yielding smaller ratios, i.e., greater gains from smoothing.

A parametric bootstrap version of the smoothing theory is described in Sections 4 and 5. Parametric modeling allows more refined results, permiting second order-accurate confidence calculations of the BCa or ABC type, as in DiCiccio and Efron (1992), Section 6. Section 7 concludes with notes, details, and deferred proofs.

Bagging (Breiman, 1996) has become a major technology in the prediction literature, an excellent recent reference being Buja and Stuetzle (2006). The point of view here agrees with that in Bühlmann and Yu (2002), though their emphasis is more theoretical and less data-analytic. They employ bagging to “change hard thresholding estimators to soft thresholding,” in the same spirit as our Section 2.

Berk, Brown, Buja, Zhang and Zhao (2012) develop conservative normal-theory confidence intervals that are guaranteed to cover the true parameter value regardless of the preceding model-selection procedure. Very often it may be difficult to say just what selection procedure was used, in which case the conservative intervals are appropriate. The methods of this paper assume that the model-selection procedure is known, yielding smaller standard error estimates and shorter confidence intervals.

Hjort and Claeskens (2003) construct an ambitious large-sample theory of frequentist model-selection estimation and model averaging, while making comparisons with Bayesian methods. In theory, the Bayesian approach offers an ideal solution to model-selection problems, but, as Hjort and Claeskens point out, it requires an intimidating amount of prior knowledge from the statistician. The present article is frequentist in its methodology.

Hurvich and Tsai (1990) provide a nice discussion of what “frequentist” might mean in a model-selection framework. (Here I am following their “overall” interpretation.) The nonparametric bootstrap approach in Buckland, Burnham and Augustin (1997) has a similar flavor to the computations in Section 2.

Classical estimation theory ignored model selection out of necessity. Armed with modern computational equipment, statisticians can now deal with model-selection problems more realistically. The limited, but useful, goal of this paper is to provide a general tool for the assessment of standard errors in such situations. Simple parameters like (1.4) are featured in our examples, but the methods apply just as well to more complicated functionals, for instance the maximum value of a regression surface, or a tree-based estimate.

2 Nonparametric bootstrap smoothing

For the sake of simple notation, let y represent all the observed data, and μ^=t(y) an estimate of a parameter of interest μ. The Cholesterol data has

y={(cj,dj),j=1,2,,n=164}. (2.1)

If μ = μj (1.4) we might take μ^j to be the height of the Cp-OLS regression curve measured at compliance c = cj.

In a nonparametric setting we have data

y=(y1,y2,,yn) (2.2)

where the yj are independent and identically distributed (iid) observations from an unknown distribution F, a two-dimensional distribution in situation (2.1). The parameter is some functional μ = T(F), but the plug-in estimator μ^=T(F^), where F^ is the empirical distribution of the yj values, is usually what we hope to improve upon in model-selection situations.

A nonparametric bootstrap sample

y=(y1,y2,,yn) (2.3)

consists of n draws with replacement from the set {y1, y2, … ,yn}, yielding bootstrap replication μ^=t(y). The empirical standard deviation of B such draws,

sd^B=[i1B(μ^iμ^.)2/(B1)]1/2,(μ^.=μ^i/B), (2.4)

is the familiar nonparametric bootstrap estimate of standard error for μ^ (Efron, 1979); sd^B is a dependable accuracy estimator in most standard situations but, as we will see, it is less dependable for setting approximate confidence limits in model-selection contexts.

The cubic regression curve in Figure 1 was selected using the Cp criterion. Suppose that under “Model m” we have

y=Xmβm+ε[ε(0,σ2I)] (2.5)

where Xm is a given n by m structure matrix of rank m, and ε has mean 0 and covariance σ2 times the Identity (σ assumed known in what follows). The Cp measure of fit for Model m is

Cp(m)=yXmβ^m2+2σ2m (2.6)

with β^m the OLS estimate of βm; given a collection of possible choices for the structure matrix, the Cp criterion selects the one minimizing Cp.

Table 1 shows Cp results for the Cholesterol data. Six polynomial regression models were compared, ranging from linear (m = 2) to sixth degree (m = 7); the value σ = 22.0 was used, corresponding to the standard estimate σ^ obtained from the sixth degree model. The cubic model (m = 4) minimized Cp(m), leading to its selection in Figure 1.

Table 1.

Cp model selection for the Cholesterol data; measure of fit Cp(m) (2.6) for polynomial regression models of increasing degree. The cubic model minimizes Cp(m). (Value σ = 22.0 was used here and in all bootstrap replications.) Last column shows percentage each model was selected as the Cp minimizer, among B = 4000 bootstrap replications.

Regression model m Cp(m) − 80, 000 (Bootstrap %)
Linear 2 1132 (19%)
Quadratic 3 1412 (12%)
Cubic 4 667 (34%)
Quartic 5 1591 (8%)
Quintic 6 1811 (21%)
Sextic 7 2758 (6%)

B = 4000 nonparametric bootstrap replications of the Cp-OLS regression curve — several times more than necessary, see Section 3 — were generated: starting with a bootstrap sample y * (2.3), the equivalent of Table 1 was calculated (still using σ = 22.0) and the Cp minimizing degree m* selected, yielding the bootstrap regression curve

μ^=Xmβ^m (2.7)

Where β^m was the OLS coefficient vector for the selected model. The last column of Table 1 shows the various bootstrap model-selection percentages: cubic was selected most often, but still only about one-third of the time.

Suppose we focus attention on Subject 1, the arrowed point in Figure 1, so that the parameter of interest μ1 can be estimated by the Cp-OLS value t(y)=μ^1, evaluated to be 2.71. Figure 3 shows the histogram of the 4000 bootstrap replications t(y)=μ^1. The point estimate μ^1=2.71 is located to the right, exceeding a surprising 76% of the μ^1 values.

Figure 3.

Figure 3

B = 4000 bootstrap replications μ^1 of the Cp-OLS regression estimate for Subject 1. The original estimate t(y)=μ^1 is 2.71, exceeding 76% of the replications. Bootstrap standard deviation (2.4) equals 8.02. Triangles indicate 2.5th and 97.5th percentiles of the histogram.

Table 2 shows why. The cases where “Cubic” was selected yielded the largest bootstrap estimates μ^1. The actual dataset y fell into the cubic region, giving a correspondingly large estimate μ^1. Things might very well have turned out otherwise, as the bootstrap replications suggest: model selection can make an estimate “jumpy” and erratic.

Table 2.

Mean and standard deviation of μ^1 as a function of the selected model, 4000 nonparametric bootstrap replications; Cubic, Model 3, gave the largest estimates.

Model 1 2 3 4 5 6
Mean −13.69 −3.69 4.71 −1.25 −3.80 −3.56
Stdev 3.64 3.48 5.43 5.28 4.46 4.95

We can smooth μ^=t(y) by averaging over the bootstrap replications, defining

μ=s(y)=1Bi=1Bt(y). (2.8)

Bootstrap smoothing (Efron and Tibshirani, 1996), a form of model averaging, is better known as “bagging” in the prediction literature; see Breiman (1996) and Buja and Stuetzle (2006). There its variance reduction properties are emphasized. Our example will also show variance reductions, but the main interest here lies in smoothing; s(y), unlike t(y), does not jump as y crosses region boundaries, making it a more dependable vehicle for setting standard errors and confidence intervals. Suppose, for definiteness, that we are interested in setting approximate 95% bootstrap confidence limits for parameter μ. The usual “standard interval”

μ^±1.96sd^B (2.9)

(= 2.71 ± 1.96 . 8.02 in Figure 3) inherits the dangerous jumpiness of μ^=t(y). The percentile interval, Section 13.3 of Efron and Tibshirani (1993),

[μ^(.025),μ^(.975)], (2.10)

the 2.5th and 97.5th percentiles of the B bootstrap replications, yields more stable results. (Notice that it does not require a central point estimate such as μ^ in (2.9).)

A third choice, of particular interest here, is the smoothed interval

μ±1.96sdB (2.11)

where μ=s(y) is the bootstrap smoothed estimate (2.8), while sdB is given by the projection formula discussed in Section 3. Interval (2.11) combines stability with reduced length.

Table 3 compares the three approximate 95% intervals for μ1. The reduction in length is dramatic here, though less so for the other 163 subjects; see Section 3.

Table 3.

Three approximate 95% bootstrap confidence intervals for μ1, the response value for Subject 1, Cholesterol data.

Interval Length Center point

Standard interval (2.9) (−13.0, 18.4) 31.4 2.71
Percentile interval (2.10) (−17.8, 13.5) 31.3 −2.15
Smoothed standard (2.11) (−13.3, 8.0) 21.3 −2.65

The BCa-ABC system goes beyond (2.9)(2.11) to produce bootstrap confidence intervals having second-order accuracy, as in DiCiccio and Efron (1992). Section 6 carries out the ABC calculations in a parametric bootstrap context.

3 Accuracy of the smoothed bootstrap estimates

The smoothed standard interval μ±1.96sdB requires a standard deviation assessment sdB for the smoothed bootstrap estimate (2.8). A brute force approach employs a second level of bootstrapping: resampling from yi (2.3) yields a collection of B second-level replications yij∗∗, from which we calculate si=t(yij∗∗)/B; repeating this whole process for many replications of yi provides bootstrap values si from which we calculate its bootstrap standard deviation.

The trouble with brute force is that it requires an enormous number of recomputations of the original statistic t(⋅). This section describes an estimate sdB that uses only the original B bootstrap replications {t(yi),i=1,2,,B}.

The theorem that follows will be stated in terms of the “ideal bootstrap,” where B equals all nn possible choices of y=(y1,y2,,yn) from {y1, y2, …, yn}, each having probability 1/B. It will be straightforward then to adapt our results to the non-ideal bootstrap, with B = 4000 for instance.

Define

ti=t(yi)[yi=(yi1,yi2,,yik,,yin)], (3.1)

the ith bootstrap replication of the statistic of interest, and let

Yij=#{yik=yj}, (3.2)

the number of elements of yi equaling the original data point yj. The vector Yi=(Yi1,Yi2,,Yin) follows a multinomial distribution with n draws on n categories each of probability 1/n, and has mean vector and covariance matrix

Yi(1n,I1n1n/n), (3.3)

1n the vector of n 1’s and I the n × n identity matrix.

Theorem 1. The nonparametric delta-method estimate of standard deviation for the ideal smoothed bootstrap statistic s(y)=i=1Bt(yi)/B is

sd=[j=1ncovj2]1/2 (3.4)

where

covj=cov(Yij,ti), (3.5)

the bootstrap covariance between Yij and ti.

(The proof appears later in this section.)

The estimate of standard deviation for s(y) in the non-ideal case is the analogue of (3.4),

sdB=[j=1ncov^j2]1/2 (3.6)

where

cov^j=i=1n(YijY.j)(tit.)/B (3.7)

with Y.j=i=1BYij/B and t.=i=1Bti/B=s(y). Remark J concerns a bias correction for (3.6) that can be important in the non-ideal case (it wasn’t in the Cholesterol example). All of these results apply generally to bagging estimators, and are not restricted to regression situations.

Figure 2 shows that sdB is less than sd^B, the bootstrap estimate of standard deviation for the unsmoothed statistic,

sd^B=[(tit.)2/B]1/2, (3.8)

for all 164 estimators t(y)=μ^j. This is no accident. Returning to the ideal bootstrap situation, let (Y) be the (n − 1)-dimensional subspace of B spanned by the columns of the B × n matrix having elements Yij1. (Notice that i=1BYij/B=1 according to (3.3).) Also define s0=i=1Bti/B, the ideal bootstrap smoothed estimate, so

Uts01 (3.9)

is the B-vector of mean-centered replications tis0. Note: Formula (3.6) is a close cousin of the “jackknife-after-bootstrap” method of Efron (1992), the difference being the use of jackknife rather than our infinitesimal jackknife calculations.

Corollary 1. The ratio sdB/sd^B is given by

sdBsd^B=U^U (3.10)

where U^ is the projection of U* into (Y).

(See Remark A in Section 7 for the proof. Remark B concerns the relation of Theorem 1 to the Hájek projection.)

The illustration in Figure 4 shows sdB/sd^B as the cosine of the angle between t*s01 and (Y). The ratio is a measure of the nonlinearity of ti as a function of the bootstrap counts Yij. Model selection induces discontinuities in t(⋅), increasing the nonlinearity and decreasing sdB/sd^B. The 164 ratios shown as the dashed line in Figure 2 had median 0.91, mean 0.89.

How many bootstrap replications B are necessary to ensure the accuracy of sdB? The jackknife provides a quick answer: divide the B replications into J groups of size B/J each, and let sdBj be the estimate (3.6) computed with the jth group removed. Then

cvB=[JJ1j=1J(sdBjsdB.)2]1/2/sdB, (3.11)

sd~B.=sdBj/J, is the jackknife estimated coefficient of variation for sdB. Applying (3.11) with J = 20 to the first B = 1000 replications (of the 4000 used in Figure 2) yielded cvB values of about 0.05 for each of the 164 subjects. Going on to B = 4000 reduced the cvB’s to about 0.02. Stopping at B = 1000 would have been quite sufficient. Note: cvB applies to the bootstrap accuracy of sdB as an estimate of the ideal value sd (3.4), not to sampling variability due to randomness in the original data y, while sdB itself does refer to sampling variability.

Proof of Theorem 1. The “nonparametric delta method” is the same as the influence function and infinitesimal jackknife methods described in Chapter 6 of Efron (1982). It is appropriate here because s(y), unlike t(y), is a smooth function of y. With the original data vector y (2.2) fixed, we can write bootstrap replication ti=t(yi) as a function T(Yi) of the count vector (3.2). The ideal smoothed bootstrap estimate s0 is the multinomial expectation of T(Y*),

s0=E{T(Y)},YMultn(n,p0), (3.12)

p0 = (1/n, 1/n, … , 1/n), the notation indicating a multinomial distribution with n draws on n equally likely categories.

Now let S(p) denote the multinomial expectation of T(Y*) if the probability vector is changed from p0 to p = (p1, p2, … , pn),

S(p)=E{T(Y)},YMultn(n,p), (3.13)

so S(p0) = s0. Define the directional derivative

S˙j=limε0S(p0+ε(δjp0))S(p0)ε, (3.14)

δj the jth coordinate vector (0, 0, … , 0, 1, 0, … , 0), with 1 in the jth place. Formula (6.18) of Efron (1982) gives

(j=1nS˙j2)1/2/n (3.15)

as the delta method estimate of standard deviation for s0. It remains to show that (3.15) equals (3.4).

Define wi(p) to be the ratio of the probabilities of Yi under (3.13) compared to (3.12),

wi(p)=k=1n(npk)Yik, (3.16)

so that

S(p)=i=1Bwi(p)ti/B (3.17)

(the factor 1/B reflecting that under p0, all the Yis have probability 1/B = 1/nn).

For p(ε) = p0 + ε (δjp0) as in (3.14), we calculate

wi(p)=(1+(n1)ε)Yij(1ε)kjYik. (3.18)

Letting ε → 0 yields

wi(p)1+nε(Yij1) (3.19)

where we have used kYik/n=1. Substitution into (3.17) gives

S(p(ε))i=1B[1+nε(Yij1)]ti/B=s0+nεcovj (3.20)

as in (3.5). Finally, definition (3.14) yields

S˙j=ncovj (3.21)

and (3.15) verifies Theorem 1 (3.4).

The validity of an approximate 95% interval θ^±1.96σ^ is compromised if the standard error σ is itself changing rapidly as a function of θ. Acceleration a^ (Efron, 1987) is a measure of such change. Roughly speaking,

a^=dσdθθ^. (3.22)

If a^=0.10 for instance, then at the upper endpoint θ^up=θ^+1.96σ^ the standard error will have increased to about 1.196σ^, leaving θ^up only 1.64, not 1.96, σ-units above θ^. (The 1987 paper divides definition (3.22) by 3, as being appropriate after a normalizing transformation.)

Acceleration has a simple expression in terms of the covariances cov^j used to calculate sdB in (3.6),

a^=16[j=1ncov^j3/(cov^j2)3/2], (3.23)

equation (7.3) of Efron (1987). The a^’s were small for the 164 sdB estimates for the Cholesterol data, most of them falling between −0.02 and 0.02, strengthening belief in the smoothed standard intervals μi±1.96sdBi (2.11).

Bias is more difficult to estimate than variance, particularly in a nonparametric context. Remark C of Section 7 verifies the following promising-looking result: the nonparametric estimate of bias for the smoothed estimate μ=s(y) (2.8) is

bias=12cov(Qi,ti)whereQi=k=1n(Ynk1)2, (3.24)

with cov* indicating bootstrap covariance as in (3.5). Unfortunately, bias proved to be too noisy to use in the Cholesterol example. Section 6 describes a more practical approach to bias estimation in a parametric bootstrap context.

4 Parametric bootstrap smoothing

We switch now from nonparametric to parametric estimation problems, but ones still involving data-based model selection. More specifically, we assume that a p-parameter exponential family of densities applies,

fα(β^)=eαβ^ψ(α)f0(β^), (4.1)

where α is the p-dimensional natural or canonical parameter vector, β^ the p-dimensional sufficient statistic vector (playing the role of y in (2.2)), ψ(α) the cumulant generating function, and f0(β^) the “carrying density” defined with respect to some carrying measure (which may include discrete atoms as with the Poisson family). Form (4.1) covers a wide variety of familiar applications, including generalized linear models; β^ is usually obtained by sufficiency from the original data, as seen in the next section.

The expectation parameter vector β=Eα{β^} is a one-to-one function of α, say β = λ(α), having p × p derivative matrix

dβdα=V(α) (4.2)

where V = V(α) is the covariance matrix covα(β^). The value of α corresponding to the sufficient statistic β^,α^=λ1(β^), is the maximum likelihood estimate (MLE) of α.

A parametric bootstrap sample is obtained by drawing i.i.d. realizations β^ from the MLE density fα^(),

fα^()iidβ^1,β^2,,β^B. (4.3)

If μ^=t(β^) is an estimate of a parameter of interest μ, the bootstrap samples (4.3) provide B parametric bootstrap replications of μ^,

μ^i=t(β^i),i=1,2,,B. (4.4)

As in the nonparametric situation, these can be averaged to provide a smoothed estimate,

μ~=s(β^)=i=1Bt(β^i)/B. (4.5)

When t(⋅) involves model selection, μ^ is liable to an erratic jumpiness, smoothed out by the averaging process.

The bootstrap replications β^~fa^(·) have mean vector and covariance matrix

β^~(β^,V^)[V^=V(a)]. (4.6)

Let B be the B × p matrix with ith row β^iβ^. As before, we will assume an ideal bootstrap resampling situation where B → ∞, making the empirical mean and variance of the β^ values exactly match (4.6):

B1B/B=OandBB/B=V^, (4.7)

1B the vector of B 1’s.

Parametric versions of Theorem 1 and Corollary 1 depend on the p-dimensional bootstrap covariance vector between β^ and t* = t(y*),

cov= B(ts01B)/B (4.8)

where t* is the B-vector of bootstrap replications ti=t(y), and s0 the ideal smoothed estimate (4.5).

Theorem 2. The parametric delta-method estimate of standard deviation for the ideal smoothed estimate (4.5) is

sd~=[covV^1cov]1/2. (4.9)

(Proof given at the end of this section.)

Corollary 2. sd~ is always less than or equal to sd^, the bootstrap estimate of standard deviation for the unsmoothed estimate,

sd^=[ts01B2/B]1/2, (4.10)

the ratio being

sd~/sd~=[(ts01B)B(BB)1B(ts01B)]1/2/sd^. (4.11)

In the ideal bootstrap case, (4.7) and (4.9) show that sd~ equals B−1/2 times the numerator on the right-hand side of (4.11). This is recognizable as the length of projection of t − s01B into the p-dimensional linear subspace of ℛB spanned by the columns of B. Figure 4 still applies, with ℒ(B) replacing ℒ(Y*).

If t(y) = μ^ is multivariate, say of dimension K, then cov* as defined in (4.8) is a p × K matrix. In this case

covV^1cov (4.12)

(or cov^V~1cov^ in what follows) is the delta-method assessment of covariance for the smoothed vector estimate s(y)=t(yi)/B, also called t*. below.

Only minor changes are necessary for realistic bootstrap computations, i.e., for B < ∞. Now we define B as the B × p matrix having ith row β^iβ^., with β^.=β^i/B, and compute the empirical covariance vector

cov^=B(tt.1B)/B (4.13)

and the empirical bootstrap variance matrix

V_=BB/B. (4.14)

Then the estimate of standard deviation for the smoothed estimate μ=s(β^). (4.15) is

sdB=[cov^V_1cov^]1/2. (4.15)

As B,cov^cov, V_V^, so sdBsd (4.9). Corollary 2, with s0 replaced by μ (4.5), remains valid.

Figure 5 reports on a simulation test of Theorem 2. This was based on a parametric model for the Cholesterol data of Figure 1,

Figure 5.

Figure 5

Simulation test of Theorem 2, parametric model (4.16)(4.18), Cholesterol data; 100 simulations, 1000 parametric bootstraps each, for the 11 subjects indicated at the bottom of Figure 1. Heavy line connects observed empirical standard deviations (4.22); dashes show the 100 estimates sd from Theorem 2 (4.15). Light dashed line connects averages of the sd values, as discussed in Remark K.

yN164(μ,σ2), (4.16)

where σ2 was diagonal, with diagonal elements a cubic function of compliance c (obtained from a regression precentile fit),

σi=23.7+5.49c2.25c21.03c3, (4.17)

making σi about twice as large to the right as to the left. The expectation vector μ was taken to be

μ=Xβ^(6)=μ^(6), (4.18)

the sixth degree OLS fit for cholesterol decrease as a function of compliance in (4.16), with X the corresponding 164 × 7 structure matrix.

Model (4.16)(4.18) is a 7-parameter exponential family (4.1), with sufficient statistic

β^=G1X(σ2)1y[G=X(σ2)1X] (4.19)

and covariance matrix (4.2)

V=G1, (4.20)

which is all that is necessary to apply Theorem 2.

The simulation began with 100 draws yi,i=1,2,,100, from (4.16), each of which gave OLS estimate μ^i=Xβ^i(6). Then B = 1000 parametric bootstrap draws were generated from β^i,

yij∗∗N(μ^i,σ2),j=1,2,,1000, (4.21)

from which smoothed estimate μi (4.5) and estimated standard deviation sdi were calculated according to (4.15). All of this was done for 11 of the 164 subjects, as indicated in Figure 1.

The dashes in Figure 5 indicate the 100 sdi values for each of the 11 subjects. This is compared with the observed empirical standard deviations of the smoothed estimates,

Sd~=[1100(μiμ.)2/99]1/2[μ.=1100μi/100], (4.22)

connected by the heavy solid curve. The sd values from Theorem 2 are seen to provide reasonable estimates of Sd, though with some bias and variability.

There is more to the story. The empirical standard deviations sd are themselves affected by model selection problems. Averaging the 100 sdi values (connected by the dashed line in Figure 5) gives more dependable results, as discussed in Remark K.

Proof of Theorem 2. Suppose that instead of fα^() in (4.3) we wished to consider parametric bootstrap samples drawn from some other member of family (4.1), fα() (α not necessarily the “true value”). The ratio wi=fα(β^i)/fa^(β^i) equals

wi=cα,α^eQiwhere Qi=(αα^)(β^iβ^), (4.23)

with the factor cα,α^ not depending on β^i. Importance sampling can now be employed to estimate Eα{t(β^)}, the expectation under fα of statistic t(β^), using only the original bootstrap replications (β^i,ti) from (4.3),

E^α=i=1Bwiti/i=1Bwi=i=1BeQiti/i=1BeQ. (4.24)

Notice that E^α is the value of the smoothed estimate (4.5) at parameter α say sα. The delta-method standard deviation for our estimate sα^ depends on the derivative vector dsα/ evaluated at = α=α^. Letting αα^ in (4.23)(4.24) gives,

sα(1+Qi)ti/B(1+Qi)/B=sα^+(αα^)cov (4.24)

where the denominator term Qi/B equals 0 for the ideal bootstrap according to (4.7). (For the non-ideal bootstrap, Qi/B approaches 0 at rate Op(1/B).)

We see that

dsαdαα^=cov, (4.26)

so from (4.2),

dsαdβα^=V^1cov. (4.27)

Since V^ is the covariance matrix of β^, that is, of β^ under distribution fα=α^, (4.6) and (4.27) verify sd in (4.9) as the usual delta-method estimate of standard deviation for s( β^).

Theorem 1 and Corollary 1 can be thought of as special cases of the exponential family theory in this section. The multinomial distribution of Y* (3.12) plays the role of fα^(β^);V^ in (4.9) becomes I1n1n/n (3.3), so that (4.9) becomes (3.4). A technical difference is that the Multn(n,p) family (3.13) is singular (that is, concentrated on a n − 1-dimensional subspace of ℛn), making the influence-function argument a little more involved than the parametric delta-function calculations. More seriously, the dimension of the nonparametric multinomial distribution increases with n, while for example, the parametric “Supernova” example of the next section has dimension 10 no matter how many supernovas might be observed. The more elaborate parametric confidence interval calculations of Section 6 failed when adapted for the nonparametric Cholesterol analysis, perhaps because of the comparatively high dimension, 164 versus 10.

5 The Supernova data

Figure 6 concerns a second example we will use to illustrate the parametric bootstrap theory of the previous section, the Supernova data: the absolute magnitude yi has been determined for n = 39 Type Ia supernovas, yielding the data

Figure 6. The Supernova data.

Figure 6

Absolute magnitudes of n = 39 Type Ia supernovas plotted versus their OLS estimates from the full linear model (5.3); adjusted R2 (5.5) equals 0.69.

y=(y1,y2,,yn). (5.1)

Each supernova has also had observed a vector of spectral energies xi measured at p = 10 frequencies,

xi(=xi1,xi2,,xi10) (5.2)

for supernova i. The 39 × 10 covariate matrix X, having xi as its ith row, will be regarded as fixed.

We assume a standard normal linear regression model

y=Xα+ε,εN39(O,I), (5.3)

referred to as the full model in what follows. (For convenient discussion, the yi have been rescaled to make (5.3) appropriate.) It has exponential family form (4.1), p = 10, with natural parameter α,β^=Xy, and ψ=αXXα/2.

Then (XX)1β^=α^, the MLE of α, which also equals α^OLS, the ordinary least squares estimate of α in (5.3), yielding the full-model vector of supernova brightness estimates

μ^OLS=Xα^OLS. (5.4)

Figure 6 plots yi versus its estimate μ^OLS,i. The fit looks good, having an unadjusted R2 of 0.82. Adjusting for the fact that we have used m = 10 parameters to fit n = 39 data points yields the more realistic value

Radj2=R22(1R2)mnm=0.69; (5.5)

see Remark D.

Type Ia supernovas were used as “standard candles” in the discovery of dark energy and the cosmological expansion of the universe (Perlmutter et al., 1999; Riess et al., 1998). Their standardness assumes a constant absolute magnitude. This is not exactly true, and in practice regression adjustments are made. Our 39 supernovas were close enough to Earth to have their absolute magnitudes ascertained independently. The spectral measurements x, however, can be made for distant Type Ia supernovas, where independent methods fail, the scientific goal being a more accurate estimation function μ^(x) for their absolute magnitudes, and improved calibration of cosmic expansion.

We will use the Lasso (Tibshirani, 1996) to select μ^(x). For a given choice of the non-negative “tuning parameter” λ we estimate α by the Lasso criterion

α^λ=argminα{yXα2+λk=1pαk}; (5.6)

α^λ shrinks the components of α^OLS toward zero, some of them all the way. As λ decreases from infinity to 0, the number m of non-zero components of α^λ increases from 0 to p. Conveniently enough, it turns out that m also nearly equals the effective degrees of freedom for the selection of α^λ (Efron, Hastie, Johnstone and Tibshirani, 2004). In what follows we will write α^m rather than α^λ.

Table 4 shows a portion of the Lasso calculations for the Supernova data. Its last column gives Radj2 (5.5) with R2 having the usual form

Table 4.

Lasso model selection for the Supernova data. As the regularization parameter λ in (5.6) decreases from infinity to zero, the number m of non-zero coordinates of α^m increases from 0 to 10. The choice m = 7 maximizes the adjusted R2 value (5.7), making it the selected model.

λ m R2 R2adj
0 0 0
63 1 .17 .12
19.3 3 .74 .70
8.2 5 .79 .73
.496 7 .82 .735 (selected)
.039 9 .82 .71
0 10 .82 .69 (OLS)
R2=1yμ^m2yy-12(μ^˙m=Xα^m,y-=yi/n). (5.7)

The choice m^ = 7 maximizes Radj2,

m^=argmaxm{Radj2}, (5.8)

yielding our selected coefficient vector α^m^ and the corresponding vector of supernova estimates

μ^=Xα^m^; (5.9)

note that α^m^ is not an OLS estimate.

B = 4000 bootstrap replications μ^ were computed (again many more than were actually needed): bootstrap samples y* were drawn using the full OLS model,

yN39(μ^OLS,I); (5.10)

see Remark E. The equivalent of Table 4, now based on data y*, was calculated, the Radj2 maximizer m^ and α^m^ selected, giving

μ^=Xα^m^. (5.11)

Averaging the 4000 μ^ vectors yielded the smoothed vector estimates

μ~=i=1Bμ^i/B. (5.12)

Standard deviations sdBj for supernova j’s smoothed estimate μj were then calculated according to (4.15), j = 1,2,…,39 The ratio of standard deviations for sdB/sd^B each of the 39 supernovas, ranged from 0.87 to 0.98, with an average of 0.93. Jackknife calculations (3.11) showed that B = 800 would have been enough for good accuracy.

At this point it pays to remember that sdB is a delta-method shortcut version of a full bootstrap standard deviation for the smoothed estimator s(y). We would prefer the latter if not for the computational burden of a second level of bootstrapping. As a check, a full second-level simulation was run, beginning with simulated data vectors yN39(μ^OLS,I) (5.10), and for each y* carrying through calculations of s* and sdB based on B = 1000 second-level bootstraps. This was done 500 times, yielding 500 values sk for each of the 39 supernovas, which provided direct bootstrap estimates say sdk for sk. The sdk values averaged about 7.5% larger than the delta-method approximations sdBk Taking this into account, the reductions in standard deviation due to smoothing were actually quite small, the ratios averaging about 98%; see the end of Remark H.

Returning to the original calculations, model selection was highly variable among the 4000 bootstrap replications. Table 5 shows the percentage of the 4000 replications that selected m non-zero coefficients for α^ in (5.11), m = 1, 2,…, 10, with the original choice m = 7 not quite being modal. Several of the supernovas showed effects like that in Figure 3.

Table 5.

Percentage of the 4000 bootstrap replications selecting m non-zero coefficients for α^ in (5.11), m = 1, 2,…,10. The original choice m = 7 is not quite modal.

m 1 2 3 4 5 6 7 8 9 10
% 0 1 8 13 16 18 18 14 9 2

Model averaging, that is bootstrap smoothing, still has important confidence interval effects even though here it does not substantially reduce standard deviations. This is shown in Figure 7 of the next section, which displays approximate 95% confidence intervals for the 39 supernova magnitudes.

Figure 7.

Figure 7

Approximate 95% confidence limits for the 39 supernova magnitudes μk (after subtraction of smoothed estimates μ~k (5.12)); ABC intervals (solid) compared with smoothed standard intervals μ~k±1.96sdk (dashed). Crosses indicate differences between unsmoothed and smoothed estimates, (5.9) minus (5.12).

Other approaches to bootstrapping Lasso estimates are possible. Chatterjee and Lahiri (2011), referring back to work by Knight and Fu (2000), resample regression residuals rather than using the full parametric bootstrap (5.10). The “m out of n” bootstrap is featured in Hall, Lee and Park (2009). Asymptotic performance, mostly absent here, is a central concern of these papers; also, they focus on estimation of the regression coefficients, in (5.3), a more difficult task than estimating μ = .

6 Better bootstrap confidence intervals

The central tactic of this paper is the use of bootstrap smoothing to convert an erratically behaved model selection-based estimator t() into a smoothly varying version s() Smoothing makes the good asymptotic properties of the bootstrap, as extensively developed in Hall (1992), more credible for actual applications. This section carries the smoothing theme further, showing how s() can be used to form second-order accurate intervals.

The improved confidence intervals depend on the properties of bootstrap samples from exponential families (4.1). We define an “empirical exponential family” f^α() that puts probability

f^α(β^i)=e(αα^)β^iψ^(α)1B (6.1)

on bootstrap replication β^i (4.3) for i = 1, 2,…, B, where

ψ^(α)=log(j=1Be(αα^)β^i/B). (6.2)

Here α^ is the MLE of in the original family (4.1), α^=λ1(β^) in the notation following (4.2)

The choice of α=α^ makes f^α^(β^i)=1/B for i = 1, 2,…, B; in other words, it yields the empirical probability distribution of the bootstrap sample (4.3) in ℛp. Other choices of “tilt” the empirical distribution in direction α=α^; (6.1) is a direct analogue of the original exponential family (4.1), which can be re-expressed as

fα(β^)=e(αα^)β^(ψ(α)ψ(α^))fα^(β^), (6.3)

now with α^ fixed and β^ the random variable. Notice that ψ^(α^)=0 in (6.2). Taking this into account, the only di erence between the original family (6.3) and the empirical family (6.1) is the change in support, from f^α() to the empirical probability distribution. Under mild regularity conditions, family f^α() approaches f^α() as the bootstrap sample size B goes to infinity.

As in (4.23)(4.24), let sα be the value of the smoothed statistic we would get if bootstrap samples were obtained from fα rather than fα^. We can estimate sα from the original bootstrap samples (4.3) by importance sampling in family (4.1),

sα=i=1Be(αα^)β^iti/i=1Be(αα^)β^i=i=1Bf^α(β^i)ti (6.4)

without requiring any further evaluations of t(). (Note that f^α(β^i) is proportional to wi in (4.24).) The main point here is that the smoothed estimate sα is the expectation of the values ti,i=1,2,,B, taken with respect to the empirical exponential family (6.1).

A system of approximate confidence intervals enjoys second-order accuracy if its coverage probabilities approach the target value with errors 1/n in the sample size n, rather than at the slower rate 1/n of the standard intervals. The ABC system (“approximate bootstrap confidence” intervals, DiCiccio and Efron, 1992, not to be confused with “approximate Bayesian computation” as in Fearnhead and Prangle, 2012) employs numerical derivatives to produce second-order accurate intervals in exponential families. Its original purpose was to eliminate the need for bootstrap resampling. Here, though, we will apply it to the smoothed statistic s(β^)=t(β^i)/B (4.5) in order to avoid a second level of bootstrapping. This is a legitimate use of ABC because we are working in an exponential family, albeit the empirical family (6.1).

Three corrections are needed to improve the smoothed standard interval (2.11) from first- to secondorder accuracy: a non-normality correction obtained from the bootstrap distribution, an acceleration correction of the type mentioned at (3.22), and a bias-correction. ABC carries these out via p+2 numerical second derivatives of S^α in (6.4), taken at α=α^, as detailed in Section 2 of DiCiccio and Efron (1992). The computational burden is effectively nil compared with the original bootstrap calculations (4.3).

Figure 7 compares the ABC 95% limits for the supernova brightnesses μk,k = 1, 2, . . . , 39, solid lines, with parametric smoothed standard intervals (2.11), dashed lines. (The smoothed estimates μk (5.12) have been subtracted from the endpoints in order to put all the intervals on the same display.) There are a few noticeable discrepancies, for supernovas 2, 6, 25, and 27 in particular, but overall the smoothed standard intervals hold up reasonably well.

Smoothing has a moderate effect on the Supernova estimates, as indicated by the values of μ^kμk, (5.11) minus (5.12), the crosses in Figure 7. A few of the intervals would be much different if based on the unsmoothed estimates μ^k, e.g., supernovas 1, 12, 17, and 28. Remark I says more about the ABC calculations.

As a check on the ABC intervals, the “full simulation” near the end of Section 4, with B = 1000 bootstrap replications for each of 500 trials, was repeated. For each trial, the 1000 bootstraps provided new ABC calculations, from which the “achieved significance level” aslk of the original smoothed estimate μk (5.12) was computed: that is,

aslk=bootstrap ABC confidence level for(,μk) (6.5)

If the ABC construction were working perfectly, aslk would have a uniform distribution,

aslkU(0,1) (6.6)

for k = 1; 2,…, 39.

Table 6 displays quantiles of aslk in the 500 trials, for seven of the 39 supernovas, k =5, 10, 15, 20, 25, 30, and 35. The results are not perfectly uniform, showing for instance a moderate deficiency of small aslk values for k = 5, but overall the results are encouraging. A U (0, 1) random variable has mean 0.500 and standard deviation0.289, while all 3500 aslk values in Table 6 had mean 0.504 and standard deviation 0.284.

Table 6.

Simulation check for ABC intervals; 500 trials, each with B = 1000 bootstrap replications. Columns show quantiles of achieved significance levels aslk (6.5) for supernovas k = 5, 10,…,35; last column for all seven supernovas combined. It is a reasonable match to the ideal uniform distribution (6.6).

quantile SN5 SN10 SN15 SN20 SN25 SN30 SN35 ALL
0.025 0.04 0.02 0.04 0.00 0.04 0.03 0.02 0.025
0.05 0.08 0.04 0.06 0.04 0.08 0.06 0.06 0.055
0.1 0.13 0.08 0.11 0.10 0.12 0.10 0.12 0.105
0.16 0.20 0.17 0.18 0.16 0.18 0.18 0.18 0.175
0.5 0.55 0.50 0.54 0.48 0.50 0.48 0.50 0.505
0.84 0.84 0.82 0.82 0.84 0.84 0.84 0.84 0.835
0.9 0.90 0.88 0.90 0.88 0.90 0.90 0.90 0.895
0.95 0.96 0.94 0.96 0.94 0.94 0.94 0.94 0.945
0.975 0.98 0.97 0.98 0.98 0.96 0.98 0.97 0.975

The ABC computations are local, in the sense that the importance sampling estimates sα in (6.4) need only be evaluated for very near α^. This avoids the familiar peril of importance sampling, that the sampling weights in (6.4) or (4.1) may vary uncontrollably in size.

If one is willing to ignore the peril, full bootstrap standard errors for the smoothed estimates μ (4.5), rather than the delta-method estimates of Theorem 2, become feasible: in addition to the original parametric bootstrap samples (4.3), we draw J more times, say

fα^(·)β1,β2,,βJ, (6.7)

and compute the corresponding natural parameter estimates αj=λ1(βj), as following (4.2). Each αj gives a bootstrap version of the smoothed statistic sαj, using (6.4), from which we calculate the usual bootstrap standard error estimate,

sdboot=[j=1J(sαjs.)2/(J1)]1/2, (6.8)

Where s.=sαj/J. Once again, no further evaluations of t(·) beyond the original ones in (4.5) are required.

Carrying this out for the Supernova data gave standard errors sdboot a little smaller than those from Theorem 2, as opposed to the somewhat larger ones found by the full simulation near the end of Section 5. Occasional very large importance sampling weights in (6.4) did seem to be a problem here.

Compromises between the delta method and full bootstrapping are possible. For the normal model (5.3) we have βjN(β^,XX) in (6.7). Instead we might take

β^jN(β^,cXX) (6.9)

with c less than 1, placing αj nearer α^. Then (6.8) must be multiplied by 1/c. Doing this with c = 1/9 gave standard error estimates almost the same as those from Theorem 2.

7 Remarks, details, and proofs

This section expands on points raised in the previous discussion.

A. Proof of Corollary 1

With Y=(Yij) as in (3.2), let X=Y1B1n=(Yij1). For the ideal bootstrap, B = nn,

XX/B=I1n1n. (7.1)

the multinomial covariance matrix in (3.3). This has (n − 1) non-zero eigenvalues all equaling 1, implying that the singular value decomposition of X is

X=BLR, (7.2)

Land R orthonormal matrices of dimensions B × (n −1) and n × (n −1). Then the B-vector U=(tis0) has projected squared length into ℒ(X)

ULLU=BULBRRBLB2U=B(UX/B)(XU/B)=Bsd2, (7.3)

verifying (3.4).

B. Hájek projection and ANOVA decomposition

For the ideal nonparametric bootstrap of Section 3, define the conditional bootstrap expectations

ej=E{t(yi)yij=yj}, (7.4)

j = 1, 2, … , n (not depending on k). The bootstrap ANOVA decomposition of Efron (1983, Sect. 7) can be used to derive an orthogonal decomposition of t(y*),

t(yi)=s0+Li+Ri (7.5)

where s0=E{t(y)} is the ideal smoothed bootstrap estimate, and

Li=j=1nYij(ejs0), (7.6)

while Ri involves higher-order ANOVA terms such as ejl− ej− el + s0 with

ejl=E{ti(y)yik=yjandyim=yk}. (7.7)

The terms in (7.5) satisfy E{L}=E{R}=0 and are orthogonal, E{LR}=0. The bootstrap Hájek projection of t(y*) (Hájek, 1968) is then the first two terms of (7.5), say

Hi=s0+Li. (7.8)

Moreover,

Li=j=1nYijcovj (7.9)

from (3.5) and the ratio of smoothed-to-unsmoothed standard deviation (3.10) equals

sdB/sd^B=[var{Li}/(var{Li}+var{Ri})]1/2. (7.10)

C. Nonparametric bias estimate

There is a nonparametric bias estimate biasB for the smoothed statistic s(y) (2.8) corresponding to the variability estimate sdB. In terms of T(Y*) and S(p) (3.13)(3.14), the nonparametric delta method gives

biasB=12j=1nS¨jn2 (7.11)

where S¨j is the second-order influence value

S¨j=limε0S(p0+ε(δjp0))2S(p0)+S(p0ε(δjp0))ε2. (7.12)

See Section 6.6 of Efron (1982).

Without going into details, the Taylor series calculation (3.18)(3.19) can be carried out one step further, leading to the following result:

biasB=cov(Di,ti) (7.13)

where Di=1n(Yij1)2.

This looks like a promising extension of Theorem 1 (3.4)(3.5). Unfortunately, (7.13) proved unstable when applied to the Cholesterol data, as revealed by jackknife calculations like (3.11). Things are better in parametric settings; see Remark I. There is also some question of what “bias” means with model selection-based estimators; see Remark G.

D. Adjusted R2

Formula (5.5) for Radj, not the usual definition, is motivated by OLS estimation and prediction in a homoskedastic model. We observe

y(μ,σ2I) (7.14)

and estimate μ by μ^=My, where the n × n symmetric matrix M is idempotent, M2 = M. Then σ2=yμ^2/(nm), m the rank of M, is the usual unbiased estimate of σ2. Letting y° indicate an independent new copy of y, the expected prediction error of μ^ is

E{y°μ^2}=E{yμ^2+2mσ^2} (7.15)

as in (2.6). Finally, the usual definition of R2,

R2=1yμ^2/yy_12 (7.16)

is adjusted by adding the amount suggested in (7.15),

Radj2=1{yμ^2+2mσ^2}/yy_12, (7.17)

and this reduces to (5.5).

E. Full-model bootstrapping

The bootstrap replications (5.10) are drawn from the full model, yN39(μ^OLS,I), rather than say the smoothed Lasso choice (5.12), yN39(μ,I). This follows the general development in Section 4(4.3) and, less obviously, the theory of Sections 2 and 3, where the “full model” is the usual nonparametric one (2.3).

An elementary example, based on Section 10.6 of Hjort and Claeskens (2003), illustrates the dangers of bootstrapping from other than the full model. We observe yN(μ,1), with MLE μ^ = t(y) = y, and consider estimating μ with the shrunken estimator μ^ = s(y) = cy, where c is a fixed constant 0 < c < 1, so

μN(cμ,c2). (7.18)

Full-model bootstrapping corresponds to yN(μ^,1), and yields μ=cyN(cμ,c2). as the bootstrap distribution. However the “model-selected bootstrap” yN(μ,1) yields

μN(c2μ^,c2), (7.19)

squaring the amount of shrinkage in (7.18).

Returning to the Supernova example, the Lasso is itself a shrinkage technique. Bootstrapping from the Lasso choice μ would shrink twice, perhaps setting many more of the coordinate estimates to zero.

F. Bias of the smoothed estimate

There is a simple asymptotic expression for the bias of the bootstrap smoothed estimator in exponential families, following DiCiccio and Efron (1992). The schematic diagram of Figure 8 shows the main elements: the observed vector y, expectation μ, generates the bootstrap distribution of y*, indicated by the dashed ellipses. A parameter of interest θ = t(μ) has MLE θ^ = t(y). Isoplaths of constant value for t(⋅) are indicated by the solid curves in Figure 8.

Figure 8.

Figure 8

Schematic diagram of large-sample bootstrap estimation. Observed vector y has expectation μ. Ellipses indicate bootstrap distribution of y* given μ^=y. Parameter of interest θ = t(μ) is estimated by θ^ = t(y). Solid curves indicate surfaces of constant value of t(⋅).

The asymptotic mean and variance of the MLE θ^ = t(y) as sample size n grows large is of the form

θ^(θ+b(μ)n,c2(μ)n)+Op(n3/2). (7.20)

Here the bias b(μ)/n is determined by the curvature of the level surfaces near μ. Then it is not difficult to show that the ideal smoothed bootstrap estimate θ=t(yi)/B,B, has mean and variance

θ(θ+2b(μ)n,c2(μ)n)+Op(n3/2). (7.21)

So smoothing doubles the bias without changing variance. This just says that smoothing cannot improve on the MLE θ^ in the already smooth asymptotic estimation context of Figure 8.

G. Two types of bias

The term b(μ)/n in (7.20) represents “statistical bias,” the difference between the expected value of t( μ^) and t(μ). Model-selection estimators also involve “definitional bias”: we wish to estimate θ = T(μ), but for reasons of robustness or efficiency we employ a different functional θ^ = t(y), a homely example being the use of a trimmed mean to estimate an expectation. The ABC bias correction mentioned in Section 6 is correcting the smoothed standard interval μ±1.96seB for statistical bias. Definitional bias can be estimated by t(y)−T(y), but this is usually too noisy to be of help. Section 2 of Berk et al. (2012) makes this point nicely (see their discussion of “target estimation”) and I have followed their lead in not trying to account for definitional bias. See also Bühlmann and Yu (2002), Definition 1.2, for an asymptotic statement of what is being estimated by a model-selection procedure.

H. Selection-based estimation

The introduction of model selection into the estimation process disrupts the smooth properties seen in Figure 8. The wedge-shaped regions of Figure 9 indicate different model choices, e.g., linear, quadratic, cubic, etc. regressions for the Cholesterol data. Now the surfaces of constant estimation jump discontinuously as y crosses regional boundaries. Asymptotic properties such as (7.20)(7.21) are less convincing when the local geometry near the observed y can change abruptly a short distance away.

Figure 9.

Figure 9

Estimation after model selection. The regions indicate different model choices. Now the curves of constant estimation jump discontinuously as y crosses regional boundaries.

The bootstrap ellipses in Figure 9 are at least qualitatively correct for the Cholesterol and Supernova examples, since in both cases a wide bootstrap variety of regions were selected. In this paper, the main purpose of bootstrap smoothing is to put us back into Figure 8, where for example the standard intervals (2.11) are more believable. (Note: Lasso estimates are continuous, though non-differentiable, across region boundaries, giving a picture somewhere between Figure 8 and Figure 9. This might help explain the smooth estimators’ relatively modest reductions in standard error for the Supernova analysis.)

Bagging amounts to replacing the discontinuous isoplaths of θ = t(μ) with smooth ones, say for θbag = s(μ). The standard deviations and approximate confidence intervals of this paper apply to θbag, ignoring the possible definitional bias.

I. The ABC intervals

The approximate bootstrap confidence limits in Figure 7 were obtained using the ABCq algorithm, as explained in detail in Section 2 of DiCiccio and Efron (1992). In addition to the acceleration a and bias-correction constant z0, ABCq also calculates cq: in a one-parameter exponential family (4.1), cq measures the nonlinearity of the parameter of interest θ= t(β) as a function of β, with a similar definition applying in p dimensions. The algorithm involves the calculation of p + 2 numerical second derivatives of sα (6.4) carried out at α = α^. Besides a, z0, and cq, ABCq provides an estimate of statistical bias for sα.

If (α, z0, cq) = (0, 0, 0) then the ABCq intervals match the smoothed standard intervals (2.11). Otherwise, corrections are made in order to achieve second-order accuracy. For instance (a, z0, cq) = (0, −0.1, 0) shifts the standard intervals leftwards by 0.1− σ^. For all three constants, values outside of ±0.1 can produce noticeable changes to the intervals.

Table 7 presents summary statistics of a, z0, cq, and bias for the 39 smoothed Supernova estimates μk. The differences between the ABCq and smoothed standard intervals seen in Figure 7 were primarily due to z0.

Table 7.

Summary statistics of the ABCq constants for the 39 smoothed Supernova estimates μk (5.12).

a z0 cq bias

mean .00 .00 .00 .00
stdev .01 .13 .04 .06
Lowest –.01 –.21 –.07 –.14
highest .01 .27 .09 .12

J. Bias correction for sdB

The nonparametric standard deviation estimate sdB (3.7) is biased upward for the ideal value sd (3.4), but it is easy to make a correction. Using notation (3.3)(3.9), define

Zij=(Yij1)(tis0). (7.22)

Then Zij has bootstrap mean covj (3.5) and bootstrap variance say Δj2. A sample of B bootstrap replications yields bootstrap moments

cov^j=1Bi=1BZij(covj,Δj2/B), (7.23)

so

EsdB2=sd2+1Bj=1nΔj2. (7.24)

Therefore the bias-corrected version of sdB2 is

sdB21B2j=1ni=1B(Zijcov^j)2. (7.25)

K. Improved estimates of the bagged standard errors

The simulation experiment of Figure 5 can also be regarded as a two-level parametric bootstrap procedure, with the goal of better estimating sd(μk), the bagged estimates’ standard deviations for subjects k = 1, 2 , . . . , 11 in the Cholesterol study. Two possible estimates are shown: (1) the empirical standard deviation Sd (4.22), solid curve, and (2) the average sd. of the 100 second-level sdi values (4.15), dashed curve. There are two reasons to prefer the latter.

The first has to do with the sampling error of the standard deviation estimates themselves. This was about 10 times larger for Sd than sd., e.g., 5.45 ± 0.35 compared to 5.84 ± 0.03 for subject 1. (Note: The two curves in Figure 5 do not di er significantly at any point.)

The second and more important reason has to do with the volitility of model-selection estimates and their standard errors. Let σ(β) denote the standard deviation of a bagged estimator μ^ in a parametric model such as (4.16)(4.17). The unknown true parameter β0 has yielded the observed value β^ , and then bootstrap values β^i, i = 1, 2, . . . , 100, and second-level bootstraps β^ij, j = 1, 2, . . . , 1000. The estimate sd100 obtained from the β^i’s (4.15) is a good approximation to σ(β^). The trouble is that the functional σ(β) is itself volatile, so that σ(β^) may differ considerably from the “truth” σ(β0).

This can be seen at the second level in Figure 5, where the dashes indicating sdi values, i = 1, 2,…, 100, vary considerably. (This is not due to the limitations of using B = 1000 replications; the bootstrap “internal variance” component accounts for only about 30% of the spread.) Broadly speaking, β^i values that fall close to a regime boundary, say separating the choice of “Cubic” from “Quartic,” had larger values of σ(β^i)sdi.

The preferred estimate sd effectively averages σ(β^i) over the parametric choice of β^i and β^. Another way to say this is that sd. is a flat-prior Bayesian estimate of σ(β0),given the data β^. See Efron (2012).

Of course sd.requires much more computation than sdB (4.15). Our 100 × 1000 analysis could be reduced to 50 × 500 without bad effect, but that is still 25000 resamples. In fact, sd~. was not much different from sdB in this example. The difference was larger in the nonparametric version of Figure 5, which showed substantially greater bias and variability, making the second level of bootstrapping more worthwhile.

References

  1. Berk R, Brown L, Buja A, Zhang K, Zhao L. Valid post-selection inference. Ann Statist. 2012 Submitted http://stat.wharton.upenn.edu/~buja/PoSI.pdf.
  2. Breiman L. Bagging predictors. Mach Learn. 1996;24:123–140. [Google Scholar]
  3. Buckland ST, Burnham KP, Augustin NH. Model selection: An integral part of inference. Biometrics. 1997;53:603–618. [Google Scholar]
  4. Bühlmann P, Yu B. Analyzing bagging. Ann Statist. 2002;30:927–961. [Google Scholar]
  5. Buja A, Stuetzle W. Observations on bagging. Statist Sinica. 2006;16:323–351. [Google Scholar]
  6. Chatterjee A, Lahiri SN. Bootstrapping lasso estimators. J Amer Statist Assoc. 2011;106:608–625. [Google Scholar]
  7. DiCiccio T, Efron B. More accurate confidence intervals in exponential families. Biometrika. 1992;79:231–245. [Google Scholar]
  8. Efron B. Bootstrap methods: Another look at the jackknife. Ann Statist. 1979;7:1–26. [Google Scholar]
  9. Efron B. The Jackknife, the Bootstrap and Other Resampling Plans, CBMS-NSF Regional Conference Series in Applied Mathematics. Vol. 38. Philadelphia, Pa.: Society for Industrial and Applied Mathematics (SIAM); 1982. [Google Scholar]
  10. Efron B. Estimating the error rate of a prediction rule: Improvement on cross-validation. J Amer Statist Assoc. 1983;78:316–331. [Google Scholar]
  11. Efron B. Better bootstrap confidence intervals. J Amer Statist Assoc. 1987;82:171–200. with comments and a rejoinder by the author. [Google Scholar]
  12. Efron B. Jackknife-after-bootstrap standard errors and influence functions. J Roy Statist Soc Ser B. 1992;54:83–127. [Google Scholar]
  13. Efron B. Bayesian inference and the parametric bootstrap. Ann Appl Statist. 2012;6:1971–1997. doi: 10.1214/12-AOAS571. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Efron B, Feldman D. Compliance as an explanatory variable in clinical trials. J Amer Statist Assoc. 1991;86:9–17. [Google Scholar]
  15. Efron B, Hastie T, Johnstone I, Tibshirani R. Least angle regression. Ann Statist. 2004;32:407–499. with discussion, and a rejoinder by the authors. [Google Scholar]
  16. Efron B, Tibshirani R. An Introduction to the Bootstrap, Monographs on Statistics and Applied Probability. Vol. 57. New York: Chapman and Hall; 1993. [Google Scholar]
  17. Efron B, Tibshirani R. Using specially designed exponential families for density estimation. Ann Statist. 1996;24:2431–2461. [Google Scholar]
  18. Fearnhead P, Prangle D. Constructing summary statistics for approximate Bayesian computation: semi-automatic approximate Bayesian computation. J Roy Statist Soc Ser B. 2012;74:419–474. [Google Scholar]
  19. Hájek J. Asymptotic normality of simple linear rank statistics under alternatives. Ann Math Statist. 1968;39:325–346. doi: 10.1073/pnas.57.1.19. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Hall P. The Bootstrap and Edgeworth Expansion Springer Series in Statistics. New York: Springer-Verlag; 1992. [Google Scholar]
  21. Hall P, Lee ER, Park BU. Bootstrap-based penalty choice for the lasso, achieving oracle performance. Statist Sinica. 2009;19:449–471. [Google Scholar]
  22. Hjort NL, Claeskens G. Frequentist model average estimators. J Amer Statist Assoc. 2003;98:879–899. [Google Scholar]
  23. Hurvich CM, Tsai C-L. Model selection for least absolute deviations regression in small samples. Statist Probab Lett. 1990;9:259–265. [Google Scholar]
  24. Knight K, Fu W. Asymptotics for lasso-type estimators. Ann Statist. 2000;28:1356–1378. [Google Scholar]
  25. Mallows CL. Some Comments on Cp. Technometrics. 1973;15:661–675. [Google Scholar]
  26. Perlmutter S, Aldering G, Goldhaber G, Knop R, Nugent P, Castro P, Deustua S, Fabbro S, Goobar A, Groom D, Hook I, Kim A, Kim M, Lee J, Nunes N, Pain R, Pennypacker C, Quimby R, Lidman C, Ellis R, Irwin M, McMahon R, Ruiz-Lapuente P, Walton N, Schaefer B, Boyle B, Filippenko A, Matheson T, Fruchter A, Panagia N, Newberg H, Couch W. Measurements of omega and lambda from 42 high-redshift supernovae. Astrophys J. 1999;517:565–586. [Google Scholar]
  27. Riess A, Filippenko A, Challis P, Clocchiatti A, Diercks A, Garnavich P, Gilliland R, Hogan C, Jha S, Kirshner R, Leibundgut B, Phillips M, Reiss D, Schmidt B, Schommer R, Smith R, Spyromilio J, Stubbs C, Suntze N, Tonry J. Observational evidence from supernovae for an accelerating universe and a cosmological constant. Astron J. 1998;116:1009–1038. [Google Scholar]
  28. Tibshirani R. Regression shrinkage and selection via the lasso. J Roy Statist Soc Ser B. 1996;58:267–288. [Google Scholar]

RESOURCES