Estimation and Accuracy after Model Selection

Bradley Efron

doi:10.1080/01621459.2013.823775

. Author manuscript; available in PMC: 2015 Jul 1.

Published in final edited form as: J Am Stat Assoc. 2013 Jul 25;109(507):991–1007. doi: 10.1080/01621459.2013.823775

Estimation and Accuracy after Model Selection

Bradley Efron ^1,^*

PMCID: PMC4207812 NIHMSID: NIHMS506929 PMID: 25346558

Abstract

Classical statistical theory ignores model selection in assessing estimation accuracy. Here we consider bootstrap methods for computing standard errors and confidence intervals that take model selection into account. The methodology involves bagging, also known as bootstrap smoothing, to tame the erratic discontinuities of selection-based estimators. A useful new formula for the accuracy of bagging then provides standard errors for the smoothed estimators. Two examples, nonparametric and parametric, are carried through in detail: a regression model where the choice of degree (linear, quadratic, cubic, …) is determined by the C_p criterion, and a Lasso-based estimation problem.

Keywords: model averaging, C_p, Lasso, bagging, bootstrap smoothing, ABC intervals, importance sampling

1 Introduction

Accuracy assessments of statistical estimators customarily are made ignoring model selection. A preliminary look at the data might, for example, suggest a cubic regression model, after which the fitted curve’s accuracy is computed as if “cubic” were pre-chosen. Here we will discuss bootstrap standard errors and approximate confidence intervals that take into account the model-selection procedure.

Figure 1 concerns the Cholesterol data, an example investigated in more detail in Section 2: n = 164 men took cholestyramine, a proposed cholesterol-lowering drug, for an average of seven years each; the response variable was the decrease in blood-level cholesterol measured from the beginning to the end of the trial,

Cholesterol decrease plotted versus adjusted compliance for 164 men in Treatment arm of the cholostyramine study (Efron and Feldman, 1991). Solid curve is OLS cubic regression, as selected by the *C_p* criterion. How accurate is the curve, taking account of model selection as well as least squares fitting? (Solid arrowed point is Subject 1, featured in subsequent calculations. Bottom numbers indicate compliance for the 11 subjects in the simulation trial of Figure 5.)

d = cholesterol decrease;

(1.1)

also measured (by pill counts) was compliance, the proportion of the intended dose taken,

c = compliance,

(1.2)

ranging from zero to full compliance for the 164 men. A transformation of the observed proportions has been made here so that the 164 c values approximate a standard normal distribution,

c \dot{\sim} N (0, 1) .

(1.3)

The solid curve is a regression estimate of decrease d as a cubic function of compliance c, fit by ordinary least squares (OLS) to the 164 points. “Cubic” was selected by the C_p criterion, Mallows (1973), as described in Section 2. The question of interest for us is how accurate is the fitted curve, taking account of the C_p model-selection procedure as well as OLS estimation?

More specifically, let μ_j be the expectation of cholesterol decrease for subject j given his compliance c_j,

μ_{j} = E {d_{j} ∣ c_{j}} .

(1.4)

We wish to assign standard errors to estimates of μ_j read from the regression curve in Figure 1 . A nonparametric bootstrap estimate ${\tilde{sd}}_{j}$ of standard deviation, taking account of model selection, is developed in Sections 2 and 3. Figure 2 shows that this is usually, but not always, greater than the naive estimate ${\bar{sd}}_{j}$ obtained from standard OLS calculations, assuming that the cubic model was pre-selected. The ratio ${\tilde{sd}}_{j} / {\bar{sd}}_{j}$ has median value 1.52; so at least in this case, ignoring model selection can be deceptively optimistic.

*Solid points*: ratio of standard deviations, taking account of model selection or not, for the 164 values ${\hat{μ}}_{j}$ from the regression curve in Figure 1. Median ratio equals 1.52. Standard deviations including model selection are the smoothed bootstrap estimates ${\tilde{sd}}_{B}$ of Section 3. *Dashed line*: ratio of ${\tilde{sd}}_{B}$ to ${\hat{sd}}_{B}$ , the unsmoothed bootstrap sd estimates as in (2.4), median 0.91.

Data-based model selection can produce “jumpy” estimates that change values discontinuously at the boundaries between model regimes. Bagging (Breiman, 1996), or bootstrap smoothing, is a model-averaging device that both reduces variability and eliminates discontinuities. This is described in Section 2, and illustrated on the Cholesterol data.

Our key result is a new formula for the delta-method standard deviation of a bagged estimator. The result, which applies to general bagging situations and not just regression problems, is described in Section 3. Stated in projection terms (see Figure 4), it provides the statistician a direct assessment of the cost in reduced accuracy due to model selection.

Illustration of Corollary 1. The ratio ${\tilde{sd}}_{B} / {\hat{sd}}_{B}$ is the cosine of the angle between t^* – s₀1 (3.9) and the linear space $ℒ (Y^{*})$ spanned by the centered bootstrap counts (3.2). Model-selection estimators tend to be more nonlinear, yielding smaller ratios, i.e., greater gains from smoothing.

A parametric bootstrap version of the smoothing theory is described in Sections 4 and 5. Parametric modeling allows more refined results, permiting second order-accurate confidence calculations of the BCa or ABC type, as in DiCiccio and Efron (1992), Section 6. Section 7 concludes with notes, details, and deferred proofs.

Bagging (Breiman, 1996) has become a major technology in the prediction literature, an excellent recent reference being Buja and Stuetzle (2006). The point of view here agrees with that in Bühlmann and Yu (2002), though their emphasis is more theoretical and less data-analytic. They employ bagging to “change hard thresholding estimators to soft thresholding,” in the same spirit as our Section 2.

Berk, Brown, Buja, Zhang and Zhao (2012) develop conservative normal-theory confidence intervals that are guaranteed to cover the true parameter value regardless of the preceding model-selection procedure. Very often it may be difficult to say just what selection procedure was used, in which case the conservative intervals are appropriate. The methods of this paper assume that the model-selection procedure is known, yielding smaller standard error estimates and shorter confidence intervals.

Hjort and Claeskens (2003) construct an ambitious large-sample theory of frequentist model-selection estimation and model averaging, while making comparisons with Bayesian methods. In theory, the Bayesian approach offers an ideal solution to model-selection problems, but, as Hjort and Claeskens point out, it requires an intimidating amount of prior knowledge from the statistician. The present article is frequentist in its methodology.

Hurvich and Tsai (1990) provide a nice discussion of what “frequentist” might mean in a model-selection framework. (Here I am following their “overall” interpretation.) The nonparametric bootstrap approach in Buckland, Burnham and Augustin (1997) has a similar flavor to the computations in Section 2.

Classical estimation theory ignored model selection out of necessity. Armed with modern computational equipment, statisticians can now deal with model-selection problems more realistically. The limited, but useful, goal of this paper is to provide a general tool for the assessment of standard errors in such situations. Simple parameters like (1.4) are featured in our examples, but the methods apply just as well to more complicated functionals, for instance the maximum value of a regression surface, or a tree-based estimate.

2 Nonparametric bootstrap smoothing

For the sake of simple notation, let y represent all the observed data, and $\hat{μ} = t (y)$ an estimate of a parameter of interest μ. The Cholesterol data has

y = {(c_{j}, d_{j}), j = 1, 2, \dots, n = 164} .

(2.1)

If μ = μ_j (1.4) we might take ${\hat{μ}}_{j}$ to be the height of the C_p-OLS regression curve measured at compliance c = c_j.

In a nonparametric setting we have data

y = (y_{1}, y_{2}, \dots, y_{n})

(2.2)

where the y_j are independent and identically distributed (iid) observations from an unknown distribution F, a two-dimensional distribution in situation (2.1). The parameter is some functional μ = T(F), but the plug-in estimator $\hat{μ} = T (\hat{F})$ , where $\hat{F}$ is the empirical distribution of the y_j values, is usually what we hope to improve upon in model-selection situations.

A nonparametric bootstrap sample

y^{*} = (y_{1}^{*}, y_{2}^{*}, \dots, y_{n}^{*})

(2.3)

consists of n draws with replacement from the set {y₁, y₂, … ,y_n}, yielding bootstrap replication ${\hat{μ}}^{*} = t (y^{*})$ . The empirical standard deviation of B such draws,

{\hat{sd}}_{B} = {[{\sum_{i - 1}^{B} ({\hat{μ}}_{i}^{*} - {\hat{μ}}_{.}^{*})}^{2} / (B - 1)]}^{1 / 2}, ({\hat{μ}}_{.}^{*} = \sum {\hat{μ}}_{i}^{*} / B),

(2.4)

is the familiar nonparametric bootstrap estimate of standard error for $\hat{μ}$ (Efron, 1979); ${\hat{sd}}_{B}$ is a dependable accuracy estimator in most standard situations but, as we will see, it is less dependable for setting approximate confidence limits in model-selection contexts.

The cubic regression curve in Figure 1 was selected using the C_p criterion. Suppose that under “Model m” we have

y = X_{m} β_{m} + ε [ε \sim (0, σ^{2} I)]

(2.5)

where X_m is a given n by m structure matrix of rank m, and ε has mean 0 and covariance σ² times the Identity (σ assumed known in what follows). The C_p measure of fit for Model m is

C_{p} (m) = {‖ y - X_{m} {\hat{β}}_{m} ‖}^{2} + 2 σ^{2} m

(2.6)

with ${\hat{β}}_{m}$ the OLS estimate of β_m; given a collection of possible choices for the structure matrix, the C_p criterion selects the one minimizing C_p.

Table 1 shows C_p results for the Cholesterol data. Six polynomial regression models were compared, ranging from linear (m = 2) to sixth degree (m = 7); the value σ = 22.0 was used, corresponding to the standard estimate $\hat{σ}$ obtained from the sixth degree model. The cubic model (m = 4) minimized C_p(m), leading to its selection in Figure 1.

Table 1.

C_p model selection for the Cholesterol data; measure of fit C_p(m) (2.6) for polynomial regression models of increasing degree. The cubic model minimizes C_p(m). (Value σ = 22.0 was used here and in all bootstrap replications.) Last column shows percentage each model was selected as the C_p minimizer, among B = 4000 bootstrap replications.

Regression model	m	C_p(m) − 80, 000	(Bootstrap %)
Linear	2	1132	(19%)
Quadratic	3	1412	(12%)
Cubic	4	667	(34%)
Quartic	5	1591	(8%)
Quintic	6	1811	(21%)
Sextic	7	2758	(6%)

Open in a new tab

B = 4000 nonparametric bootstrap replications of the C_p-OLS regression curve — several times more than necessary, see Section 3 — were generated: starting with a bootstrap sample y ^* (2.3), the equivalent of Table 1 was calculated (still using σ = 22.0) and the C_p minimizing degree m^* selected, yielding the bootstrap regression curve

{\hat{μ}}^{*} = X_{m *} {\hat{β}}_{m *}

(2.7)

Where ${\hat{β}}_{m *}^{*}$ was the OLS coefficient vector for the selected model. The last column of Table 1 shows the various bootstrap model-selection percentages: cubic was selected most often, but still only about one-third of the time.

Suppose we focus attention on Subject 1, the arrowed point in Figure 1, so that the parameter of interest μ₁ can be estimated by the C_p-OLS value $t (y) = {\hat{μ}}_{1}$ , evaluated to be 2.71. Figure 3 shows the histogram of the 4000 bootstrap replications $t (y^{*}) = {\hat{μ}}_{1}^{*}$ . The point estimate ${\hat{μ}}_{1} = 2.71$ is located to the right, exceeding a surprising 76% of the ${\hat{μ}}_{1}^{*}$ values.

B = 4000 bootstrap replications ${\hat{μ}}_{1}^{*}$ of the *C_p*-OLS regression estimate for Subject 1. The original estimate $t (y) = {\hat{μ}}_{1}$ is 2.71, exceeding 76% of the replications. Bootstrap standard deviation (2.4) equals 8.02. Triangles indicate 2.5th and 97.5th percentiles of the histogram.

Table 2 shows why. The cases where “Cubic” was selected yielded the largest bootstrap estimates ${\hat{μ}}_{1}^{*}$ . The actual dataset y fell into the cubic region, giving a correspondingly large estimate ${\hat{μ}}_{1}$ . Things might very well have turned out otherwise, as the bootstrap replications suggest: model selection can make an estimate “jumpy” and erratic.

Table 2.

Mean and standard deviation of ${\hat{μ}}_{1}^{*}$ as a function of the selected model, 4000 nonparametric bootstrap replications; Cubic, Model 3, gave the largest estimates.

Model	1	2	3	4	5	6
Mean	−13.69	−3.69	4.71	−1.25	−3.80	−3.56
Stdev	3.64	3.48	5.43	5.28	4.46	4.95

Open in a new tab

We can smooth $\hat{μ} = t (y)$ by averaging over the bootstrap replications, defining

\tilde{μ} = s (y) = \frac{1}{B} \sum_{i = 1}^{B} t (y^{*}) .

(2.8)

Bootstrap smoothing (Efron and Tibshirani, 1996), a form of model averaging, is better known as “bagging” in the prediction literature; see Breiman (1996) and Buja and Stuetzle (2006). There its variance reduction properties are emphasized. Our example will also show variance reductions, but the main interest here lies in smoothing; s(y), unlike t(y), does not jump as y crosses region boundaries, making it a more dependable vehicle for setting standard errors and confidence intervals. Suppose, for definiteness, that we are interested in setting approximate 95% bootstrap confidence limits for parameter μ. The usual “standard interval”

\hat{μ} \pm 1.96 {\hat{sd}}_{B}

(2.9)

(= 2.71 ± 1.96 . 8.02 in Figure 3) inherits the dangerous jumpiness of $\hat{μ} = t (y)$ . The percentile interval, Section 13.3 of Efron and Tibshirani (1993),

[{\hat{μ}}^{* (.025)}, {\hat{μ}}^{* (.975)}],

(2.10)

the 2.5th and 97.5th percentiles of the B bootstrap replications, yields more stable results. (Notice that it does not require a central point estimate such as $\hat{μ}$ in (2.9).)

A third choice, of particular interest here, is the smoothed interval

\tilde{μ} \pm 1.96 {\tilde{sd}}_{B}

(2.11)

where $\tilde{μ} = s (y)$ is the bootstrap smoothed estimate (2.8), while ${\tilde{sd}}_{B}$ is given by the projection formula discussed in Section 3. Interval (2.11) combines stability with reduced length.

Table 3 compares the three approximate 95% intervals for μ₁. The reduction in length is dramatic here, though less so for the other 163 subjects; see Section 3.

Table 3.

Three approximate 95% bootstrap confidence intervals for μ₁, the response value for Subject 1, Cholesterol data.

	Interval	Length	Center point

Standard interval (2.9)	(−13.0, 18.4)	31.4	2.71
Percentile interval (2.10)	(−17.8, 13.5)	31.3	−2.15
Smoothed standard (2.11)	(−13.3, 8.0)	21.3	−2.65

Open in a new tab

The BCa-ABC system goes beyond (2.9)–(2.11) to produce bootstrap confidence intervals having second-order accuracy, as in DiCiccio and Efron (1992). Section 6 carries out the ABC calculations in a parametric bootstrap context.

3 Accuracy of the smoothed bootstrap estimates

The smoothed standard interval $\tilde{μ} \pm 1.96 {\tilde{sd}}_{B}$ requires a standard deviation assessment ${\tilde{sd}}_{B}$ for the smoothed bootstrap estimate (2.8). A brute force approach employs a second level of bootstrapping: resampling from $y_{i}^{*}$ (2.3) yields a collection of B second-level replications $y_{ij}^{**}$ , from which we calculate $s_{i}^{*} = \sum t (y_{ij}^{**}) / B$ ; repeating this whole process for many replications of $y_{i}^{*}$ provides bootstrap values $s_{i}^{*}$ from which we calculate its bootstrap standard deviation.

The trouble with brute force is that it requires an enormous number of recomputations of the original statistic t(⋅). This section describes an estimate ${\tilde{sd}}_{B}$ that uses only the original B bootstrap replications ${t (y_{i}^{*}), i = 1, 2, \dots, B}$ .

The theorem that follows will be stated in terms of the “ideal bootstrap,” where B equals all nⁿ possible choices of $y^{*} = (y_{1}^{*}, y_{2}^{*}, \dots, y_{n}^{*})$ from {y₁, y₂, …, y_n}, each having probability 1/B. It will be straightforward then to adapt our results to the non-ideal bootstrap, with B = 4000 for instance.

Define

t_{i}^{*} = t (y_{i}^{*}) [y_{i}^{*} = (y_{i 1}^{*}, y_{i 2}^{*}, \dots, y_{i k}^{*}, \dots, y_{i n}^{*})],

(3.1)

the ith bootstrap replication of the statistic of interest, and let

Y_{i j}^{*} = # {y_{i k}^{*} = y_{j}},

(3.2)

the number of elements of $y_{i}^{*}$ equaling the original data point y_j. The vector $Y_{i}^{*} = (Y_{i 1}^{*}, Y_{i 2}^{*}, \dots, Y_{i n}^{*})$ follows a multinomial distribution with n draws on n categories each of probability 1/n, and has mean vector and covariance matrix

Y_{i}^{*} \sim (1_{n}, I - 1_{n} {1^{'}}_{n} / n),

(3.3)

1_n the vector of n 1’s and I the n × n identity matrix.

Theorem 1. The nonparametric delta-method estimate of standard deviation for the ideal smoothed bootstrap statistic $s (y) = \sum_{i = 1}^{B} t (y_{i}^{*}) / B$ is

\tilde{sd} = {[\sum_{j = 1}^{n} {cov}_{j}^{2}]}^{1 / 2}

(3.4)

where

{cov}_{j} = {cov}_{*} (Y_{ij}^{*}, t_{i}^{*}),

(3.5)

the bootstrap covariance between $Y_{ij}^{*}$ and $t_{i}^{*}$ .

(The proof appears later in this section.)

The estimate of standard deviation for s(y) in the non-ideal case is the analogue of (3.4),

{\tilde{sd}}_{B} = {[\sum_{j = 1}^{n} {\hat{cov}}_{j}^{2}]}^{1 / 2}

(3.6)

where

{\hat{cov}}_{j} = \sum_{i = 1}^{n} (Y_{i j}^{*} - Y_{. j}^{*}) (t_{i}^{*} - t_{.}^{*}) / B

(3.7)

with $Y_{. j}^{*} = \sum_{i = 1}^{B} Y_{i j}^{*} / B$ and $t_{.}^{*} = \sum_{i = 1}^{B} t_{i}^{*} / B = s (y)$ . Remark J concerns a bias correction for (3.6) that can be important in the non-ideal case (it wasn’t in the Cholesterol example). All of these results apply generally to bagging estimators, and are not restricted to regression situations.

Figure 2 shows that ${\tilde{sd}}_{B}$ is less than ${\hat{sd}}_{B}$ , the bootstrap estimate of standard deviation for the unsmoothed statistic,

{\hat{sd}}_{B} = {[\sum {(t_{i}^{*} - t_{.}^{*})}^{2} / B]}^{1 / 2},

(3.8)

for all 164 estimators $t (y) = {\hat{μ}}_{j}$ . This is no accident. Returning to the ideal bootstrap situation, let $ℒ (Y^{*})$ be the (n − 1)-dimensional subspace of $ℛ^{B}$ spanned by the columns of the B × n matrix having elements $Y_{i j}^{*} - 1$ . (Notice that $\sum_{i = 1}^{B} Y_{i j}^{*} / B = 1$ according to (3.3).) Also define $s_{0} = \sum_{i = 1}^{B} t_{i}^{*} / B$ , the ideal bootstrap smoothed estimate, so

U^{*} \equiv t^{*} - s_{0} 1

(3.9)

is the B-vector of mean-centered replications $t_{i}^{*} - s_{0}$ . Note: Formula (3.6) is a close cousin of the “jackknife-after-bootstrap” method of Efron (1992), the difference being the use of jackknife rather than our infinitesimal jackknife calculations.

Corollary 1. The ratio ${\tilde{sd}}_{B} / {\hat{sd}}_{B}$ is given by

\frac{{\tilde{sd}}_{B}}{{\hat{sd}}_{B}} = \frac{∥ {\hat{U}}^{*} ∥}{∥ U^{*} ∥}

(3.10)

where ${\hat{U}}^{*}$ is the projection of U^* into $ℒ (Y^{*})$ .

(See Remark A in Section 7 for the proof. Remark B concerns the relation of Theorem 1 to the Hájek projection.)

The illustration in Figure 4 shows ${\tilde{sd}}_{B} / {\hat{sd}}_{B}$ as the cosine of the angle between t^* – s₀1 and $ℒ (Y^{*})$ . The ratio is a measure of the nonlinearity of $t_{i}^{*}$ as a function of the bootstrap counts $Y_{ij}^{*}$ . Model selection induces discontinuities in t(⋅), increasing the nonlinearity and decreasing ${\tilde{sd}}_{B} / {\hat{sd}}_{B}$ . The 164 ratios shown as the dashed line in Figure 2 had median 0.91, mean 0.89.

How many bootstrap replications B are necessary to ensure the accuracy of ${\tilde{sd}}_{B}$ ? The jackknife provides a quick answer: divide the B replications into J groups of size B/J each, and let ${\tilde{sd}}_{B j}$ be the estimate (3.6) computed with the jth group removed. Then

{\tilde{cv}}_{B} = {[\frac{J}{J - 1} \sum_{j = 1}^{J} {({\tilde{sd}}_{B j} - {\tilde{sd}}_{B} .)}^{2}]}^{1 / 2} / {\tilde{sd}}_{B},

(3.11)

${\tilde{sd}}_{B} . = \sum {\tilde{sd}}_{B j} / J$ , is the jackknife estimated coefficient of variation for ${\tilde{sd}}_{B}$ . Applying (3.11) with J = 20 to the first B = 1000 replications (of the 4000 used in Figure 2) yielded ${\tilde{cv}}_{B}$ values of about 0.05 for each of the 164 subjects. Going on to B = 4000 reduced the ${\tilde{cv}}_{B}$ ’s to about 0.02. Stopping at B = 1000 would have been quite sufficient. Note: ${\tilde{cv}}_{B}$ applies to the bootstrap accuracy of ${\tilde{sd}}_{B}$ as an estimate of the ideal value $\tilde{sd}$ (3.4), not to sampling variability due to randomness in the original data y, while ${\tilde{sd}}_{B}$ itself does refer to sampling variability.

Proof of Theorem 1. The “nonparametric delta method” is the same as the influence function and infinitesimal jackknife methods described in Chapter 6 of Efron (1982). It is appropriate here because s(y), unlike t(y), is a smooth function of y. With the original data vector y (2.2) fixed, we can write bootstrap replication $t_{i}^{*} = t (y_{i}^{*})$ as a function $T (Y_{i}^{*})$ of the count vector (3.2). The ideal smoothed bootstrap estimate s₀ is the multinomial expectation of T(Y*),

s_{0} = E {T (Y^{*})}, Y^{*} \sim {Mult}_{n} (n, p_{0}),

(3.12)

p₀ = (1/n, 1/n, … , 1/n), the notation indicating a multinomial distribution with n draws on n equally likely categories.

Now let S(p) denote the multinomial expectation of T(Y*) if the probability vector is changed from p₀ to p = (p₁, p₂, … , p_n),

S (p) = E {T (Y^{*})}, Y^{*} \sim {Mult}_{n} (n, p),

(3.13)

so S(p₀) = s₀. Define the directional derivative

{\dot{S}}_{j} = \lim_{ε \to 0} \frac{S (p_{0} + ε (δ_{j} - p_{0})) - S (p_{0})}{ε},

(3.14)

δ_j the jth coordinate vector (0, 0, … , 0, 1, 0, … , 0), with 1 in the jth place. Formula (6.18) of Efron (1982) gives

{(\sum_{j = 1}^{n} {\dot{S}}_{j}^{2})}^{1 / 2} / n

(3.15)

as the delta method estimate of standard deviation for s₀. It remains to show that (3.15) equals (3.4).

Define w_i(p) to be the ratio of the probabilities of $Y_{i}^{*}$ under (3.13) compared to (3.12),

w_{i} (p) = \prod_{k = 1}^{n} {(n p_{k})}^{Y_{i k}^{*}},

(3.16)

so that

S (p) = \sum_{i = 1}^{B} w_{i} (p) t_{i}^{*} / B

(3.17)

(the factor 1/B reflecting that under p₀, all the $Y_{i}^{*} ’ s$ have probability 1/B = 1/nⁿ).

For p(ε) = p₀ + ε (δ_j − p₀) as in (3.14), we calculate

w_{i} (p) = {(1 + (n - 1) ε)}^{Y_{i j}^{*}} {(1 - ε)}^{\sum_{k \neq j} Y_{i k}^{*}} .

(3.18)

Letting ε → 0 yields

w_{i} (p) ≐ 1 + n ε (Y_{i j}^{*} - 1)

(3.19)

where we have used $\sum_{k} Y_{i k}^{*} / n = 1$ . Substitution into (3.17) gives

\begin{matrix} S (p (ε)) & ≐ & \sum_{i = 1}^{B} [1 + n ε (Y_{i j}^{*} - 1)] t_{i}^{*} / B \\ = & s_{0} + n ε {cov}_{j} \end{matrix}

(3.20)

as in (3.5). Finally, definition (3.14) yields

{\dot{S}}_{j} = n {cov}_{j}

(3.21)

and (3.15) verifies Theorem 1 (3.4).

The validity of an approximate 95% interval $\hat{θ} \pm 1.96 \hat{σ}$ is compromised if the standard error σ is itself changing rapidly as a function of θ. Acceleration $\hat{a}$ (Efron, 1987) is a measure of such change. Roughly speaking,

\hat{a} = {\frac{d σ}{d θ} ∣}_{\hat{θ}} .

(3.22)

If $\hat{a} = 0.10$ for instance, then at the upper endpoint ${\hat{θ}}_{up} = \hat{θ} + 1.96 \hat{σ}$ the standard error will have increased to about $1.196 \hat{σ}$ , leaving ${\hat{θ}}_{up}$ only 1.64, not 1.96, σ-units above $\hat{θ}$ . (The 1987 paper divides definition (3.22) by 3, as being appropriate after a normalizing transformation.)

Acceleration has a simple expression in terms of the covariances ${\hat{cov}}_{j}$ used to calculate ${\tilde{sd}}_{B}$ in (3.6),

\hat{a} = \frac{1}{6} [\sum_{j = 1}^{n} {\hat{cov}}_{j}^{3} / {(\sum {\hat{cov}}_{j}^{2})}^{3 / 2}],

(3.23)

equation (7.3) of Efron (1987). The $\hat{a}$ ’s were small for the 164 ${\tilde{sd}}_{B}$ estimates for the Cholesterol data, most of them falling between −0.02 and 0.02, strengthening belief in the smoothed standard intervals ${\tilde{μ}}_{i} \pm 1.96 {\tilde{sd}}_{B i}$ (2.11).

Bias is more difficult to estimate than variance, particularly in a nonparametric context. Remark C of Section 7 verifies the following promising-looking result: the nonparametric estimate of bias for the smoothed estimate $\tilde{μ} = s (y)$ (2.8) is

\tilde{bias} = \frac{1}{2} {cov}_{*} (Q_{i}^{*}, t_{i}^{*}) where Q_{i}^{*} = {\sum_{k = 1}^{n} (Y_{n k}^{*} - 1)}^{2},

(3.24)

with cov_* indicating bootstrap covariance as in (3.5). Unfortunately, $\tilde{bias}$ proved to be too noisy to use in the Cholesterol example. Section 6 describes a more practical approach to bias estimation in a parametric bootstrap context.

4 Parametric bootstrap smoothing

We switch now from nonparametric to parametric estimation problems, but ones still involving data-based model selection. More specifically, we assume that a p-parameter exponential family of densities applies,

f_{α} (\hat{β}) = e^{α^{'} \hat{β} - ψ (α)} f_{0} (\hat{β}),

(4.1)

where α is the p-dimensional natural or canonical parameter vector, $\hat{β}$ the p-dimensional sufficient statistic vector (playing the role of y in (2.2)), $ψ (α)$ the cumulant generating function, and $f_{0} (\hat{β})$ the “carrying density” defined with respect to some carrying measure (which may include discrete atoms as with the Poisson family). Form (4.1) covers a wide variety of familiar applications, including generalized linear models; $\hat{β}$ is usually obtained by sufficiency from the original data, as seen in the next section.

The expectation parameter vector $β = E_{α} {\hat{β}}$ is a one-to-one function of α, say β = λ(α), having p × p derivative matrix

\frac{d β}{d α} = V (α)

(4.2)

where V = V(α) is the covariance matrix ${cov}_{α} (\hat{β})$ . The value of α corresponding to the sufficient statistic $\hat{β}, \hat{α} = λ^{- 1} (\hat{β})$ , is the maximum likelihood estimate (MLE) of α.

A parametric bootstrap sample is obtained by drawing i.i.d. realizations ${\hat{β}}^{*}$ from the MLE density $f_{\hat{α}} (\cdot)$ ,

f_{\hat{α}} (\cdot) \underset{\to}{iid} {\hat{β}}_{1}^{*}, {\hat{β}}_{2}^{*}, \dots, {\hat{β}}_{B}^{*} .

(4.3)

If $\hat{μ} = t (\hat{β})$ is an estimate of a parameter of interest μ, the bootstrap samples (4.3) provide B parametric bootstrap replications of $\hat{μ}$ ,

{\hat{μ}}_{i}^{*} = t ({\hat{β}}_{i}^{*}), i = 1, 2, \dots, B .

(4.4)

As in the nonparametric situation, these can be averaged to provide a smoothed estimate,

\tilde{μ} = s (\hat{β}) = \sum_{i = 1}^{B} t ({\hat{β}}_{i}^{*}) / B .

(4.5)

When t(⋅) involves model selection, $\hat{μ}$ is liable to an erratic jumpiness, smoothed out by the averaging process.

The bootstrap replications ${\hat{β}}^{*} ~ f_{\hat{a}} (\cdot)$ have mean vector and covariance matrix

{\hat{β}}^{*} ~ (\hat{β}, \hat{V}) [\hat{V} = V (a)] .

(4.6)

Let B be the B × p matrix with ith row ${\hat{β}}_{i}^{*} - \hat{β} .$ As before, we will assume an ideal bootstrap resampling situation where B → ∞, making the empirical mean and variance of the ${\hat{β}}^{*}$ values exactly match (4.6):

B^{'} 1_{B} / B = O and B^{'} B / B = \hat{V},

(4.7)

1_B the vector of B 1’s.

Parametric versions of Theorem 1 and Corollary 1 depend on the p-dimensional bootstrap covariance vector between ${\hat{β}}^{*}$ and t* = t(y*),

{cov}_{*} = B^{'} (t^{*} - s_{0} 1_{B}) / B

(4.8)

where t* is the B-vector of bootstrap replications $t_{i}^{*} = t (y^{*})$ , and s₀ the ideal smoothed estimate (4.5).

Theorem 2. The parametric delta-method estimate of standard deviation for the ideal smoothed estimate (4.5) is

\tilde{sd} = {[{cov}_{*}^{'} {\hat{V}}^{- 1} {cov}_{*}]}^{1 / 2} .

(4.9)

(Proof given at the end of this section.)

Corollary 2. $\tilde{sd}$ is always less than or equal to $\hat{sd}$ , the bootstrap estimate of standard deviation for the unsmoothed estimate,

\hat{sd} = {[{∥ t^{*} - s_{0} 1_{B} ∥}^{2} / B]}^{1 / 2},

(4.10)

the ratio being

\tilde{sd} / \tilde{sd} = {[{(t^{*} - s_{0} 1_{B})}^{'} B {(B^{'} B)}^{- 1} B^{'} (t^{*} - s_{0} 1_{B})]}^{1 / 2} / \hat{sd} .

(4.11)

In the ideal bootstrap case, (4.7) and (4.9) show that $\tilde{sd}$ equals B^−1/2 times the numerator on the right-hand side of (4.11). This is recognizable as the length of projection of t − s₀1_B into the p-dimensional linear subspace of ℛ^B spanned by the columns of B. Figure 4 still applies, with ℒ(B) replacing ℒ(Y*).

If t(y) = $\hat{μ}$ is multivariate, say of dimension K, then cov_* as defined in (4.8) is a p × K matrix. In this case

{cov}_{*}^{'} {\hat{V}}^{- 1} {cov}_{*}

(4.12)

(or ${\hat{cov}}^{'} {\tilde{V}}^{- 1} \hat{cov}$ in what follows) is the delta-method assessment of covariance for the smoothed vector estimate $s (y) = \sum t (y_{i}^{*}) / B$ , also called t^*. below.

Only minor changes are necessary for realistic bootstrap computations, i.e., for B < ∞. Now we define B as the B × p matrix having ith row ${\hat{β}}_{i}^{*} - {\hat{β}}_{.}^{*},$ with ${\hat{β}}_{.}^{*} = \sum {\hat{β}}_{i}^{*} / B,$ and compute the empirical covariance vector

\hat{cov} = B^{'} (t^{*} - t_{.}^{*} 1_{B}) / B

(4.13)

and the empirical bootstrap variance matrix

\overline{V} = B^{'} B / B .

(4.14)

Then the estimate of standard deviation for the smoothed estimate $\tilde{μ} = s (\hat{β})$ . (4.15) is

{\tilde{sd}}_{B} = {[{\hat{cov}}^{'} {\overline{V}}^{- 1} \hat{cov}]}^{1 / 2} .

(4.15)

As $B \to \infty, \hat{cov} \to {cov}_{*}$ , $\overline{V} \to \hat{V},$ so ${\tilde{sd}}_{B} \to \tilde{sd}$ (4.9). Corollary 2, with s₀ replaced by $\tilde{μ}$ (4.5), remains valid.

Figure 5 reports on a simulation test of Theorem 2. This was based on a parametric model for the Cholesterol data of Figure 1,

Simulation test of Theorem 2, parametric model (4.16)–(4.18), Cholesterol data; 100 simulations, 1000 parametric bootstraps each, for the 11 subjects indicated at the bottom of Figure 1. Heavy line connects observed empirical standard deviations (4.22); dashes show the 100 estimates $\tilde{sd}$ from Theorem 2 (4.15). Light dashed line connects averages of the $\tilde{sd}$ values, as discussed in Remark K.

y \sim N_{164} (μ, σ^{2}),

(4.16)

where $σ^{2}$ was diagonal, with diagonal elements a cubic function of compliance c (obtained from a regression precentile fit),

σ_{i} = 23.7 + 5.49 c - 2.25 c^{2} - 1.03 c^{3},

(4.17)

making σ_i about twice as large to the right as to the left. The expectation vector μ was taken to be

μ = X \hat{β} (6) = \hat{μ} (6),

(4.18)

the sixth degree OLS fit for cholesterol decrease as a function of compliance in (4.16), with X the corresponding 164 × 7 structure matrix.

Model (4.16)–(4.18) is a 7-parameter exponential family (4.1), with sufficient statistic

\hat{β} = G^{- 1} X^{'} {(σ^{2})}^{- 1} y [G = X^{'} {(σ^{2})}^{- 1} X]

(4.19)

and covariance matrix (4.2)

V = G^{- 1},

(4.20)

which is all that is necessary to apply Theorem 2.

The simulation began with 100 draws $y_{i}^{*}, i = 1, 2, \dots, 100,$ from (4.16), each of which gave OLS estimate ${\hat{μ}}_{i}^{*} = X {\hat{β}}_{i}^{*} (6) .$ Then B = 1000 parametric bootstrap draws were generated from ${\hat{β}}_{i}^{*}$ ,

y_{ij}^{**} \sim N ({\hat{μ}}_{i}^{*}, σ^{2}), j = 1, 2, \dots, 1000,

(4.21)

from which smoothed estimate ${\tilde{μ}}_{i}$ (4.5) and estimated standard deviation $\tilde{s d_{i}}$ were calculated according to (4.15). All of this was done for 11 of the 164 subjects, as indicated in Figure 1.

The dashes in Figure 5 indicate the 100 $\tilde{s d_{i}}$ values for each of the 11 subjects. This is compared with the observed empirical standard deviations of the smoothed estimates,

\tilde{Sd} = {[\sum_{1}^{100} {({\tilde{μ}}_{i} - \tilde{μ} .)}^{2} / 99]}^{1 / 2} [\tilde{μ} . = \sum_{1}^{100} {\tilde{μ}}_{i} / 100],

(4.22)

connected by the heavy solid curve. The $\tilde{sd}$ values from Theorem 2 are seen to provide reasonable estimates of $\tilde{Sd}$ , though with some bias and variability.

There is more to the story. The empirical standard deviations $\tilde{sd}$ are themselves affected by model selection problems. Averaging the 100 $\tilde{sd}$ _i values (connected by the dashed line in Figure 5) gives more dependable results, as discussed in Remark K.

Proof of Theorem 2. Suppose that instead of $f_{\hat{α}} (\cdot)$ in (4.3) we wished to consider parametric bootstrap samples drawn from some other member of family (4.1), $f_{α} (\cdot)$ (α not necessarily the “true value”). The ratio $w_{i} = f α ({\hat{β}}_{i}^{*}) / f_{\hat{a}} ({\hat{β}}_{i}^{*})$ equals

w_{i} = c_{α, \hat{α}} e^{Q_{i}} where Q_{i} = {(α - \hat{α})}^{'} ({\hat{β}}_{i}^{*} - \hat{β}),

(4.23)

with the factor $c_{α, \hat{α}}$ not depending on ${\hat{β}}_{i}^{*}$ . Importance sampling can now be employed to estimate $E_{α} {t (\hat{β})}$ , the expectation under f_α of statistic $t (\hat{β})$ , using only the original bootstrap replications $({\hat{β}}_{i}^{*}, t_{i}^{*})$ from (4.3),

{\hat{E}}_{α} = \sum_{i = 1}^{B} w_{i} t_{i}^{*} / \sum_{i = 1}^{B} w_{i} = \sum_{i = 1}^{B} e^{Q_{i}} t_{i}^{*} / \sum_{i = 1}^{B} e^{Q_{*}} .

(4.24)

Notice that $\hat{E} α$ is the value of the smoothed estimate (4.5) at parameter α say s_α. The delta-method standard deviation for our estimate $s_{\hat{α}}$ depends on the derivative vector ds_α/dα evaluated at = $α = \hat{α}$ . Letting $α \to \hat{α}$ in (4.23)–(4.24) gives,

s_{α} ≐ \frac{\sum (1 + Q_{i}) t_{i}^{*} / B}{\sum (1 + Q_{i}) / B} = s_{\hat{α}} + {(α - \hat{α})}^{'} {cov}_{*}

(4.24)

where the denominator term $\sum Q_{i} / B$ equals 0 for the ideal bootstrap according to (4.7). (For the non-ideal bootstrap, $\sum Q_{i} / B$ approaches 0 at rate $O_{p} (1 / \sqrt{B})$ .)

We see that

\frac{d s_{α}}{d α} ∣_{\hat{α}} = {cov}_{*},

(4.26)

so from (4.2),

\frac{d s_{α}}{d β} ∣_{\hat{α}} = {\hat{V}}^{- 1} {cov}_{*} .

(4.27)

Since $\hat{V}$ is the covariance matrix of ${\hat{β}}^{*},$ that is, of $\hat{β}$ under distribution $f_{α = \hat{α}}$ , (4.6) and (4.27) verify $\tilde{sd}$ in (4.9) as the usual delta-method estimate of standard deviation for s( $\hat{β}$ ).

Theorem 1 and Corollary 1 can be thought of as special cases of the exponential family theory in this section. The multinomial distribution of Y* (3.12) plays the role of $f_{\hat{α}} ({\hat{β}}^{*}); \hat{V}$ in (4.9) becomes $I - 1_{n} {1^{'}}_{n} / n$ (3.3), so that (4.9) becomes (3.4). A technical difference is that the Mult_n(n,p) family (3.13) is singular (that is, concentrated on a n − 1-dimensional subspace of ℛⁿ), making the influence-function argument a little more involved than the parametric delta-function calculations. More seriously, the dimension of the nonparametric multinomial distribution increases with n, while for example, the parametric “Supernova” example of the next section has dimension 10 no matter how many supernovas might be observed. The more elaborate parametric confidence interval calculations of Section 6 failed when adapted for the nonparametric Cholesterol analysis, perhaps because of the comparatively high dimension, 164 versus 10.

5 The Supernova data

Figure 6 concerns a second example we will use to illustrate the parametric bootstrap theory of the previous section, the Supernova data: the absolute magnitude y_i has been determined for n = 39 Type Ia supernovas, yielding the data

Absolute magnitudes of n = 39 Type Ia supernovas plotted versus their OLS estimates from the full linear model (5.3); adjusted R² (5.5) equals 0.69.

y = {(y_{1}, y_{2}, \dots, y_{n})}^{'} .

(5.1)

Each supernova has also had observed a vector of spectral energies x_i measured at p = 10 frequencies,

x_{i} (= x_{i 1}, x_{i 2}, \dots, x_{i 10})

(5.2)

for supernova i. The 39 × 10 covariate matrix X, having x_i as its ith row, will be regarded as fixed.

We assume a standard normal linear regression model

y = X α + ε, ε \sim N_{39} (O, I),

(5.3)

referred to as the full model in what follows. (For convenient discussion, the y_i have been rescaled to make (5.3) appropriate.) It has exponential family form (4.1), p = 10, with natural parameter $α, \hat{β} = X^{'} y,$ and $ψ = α^{'} X^{'} X α / 2.$

Then ${(X^{'} X)}^{- 1} \hat{β} = \hat{α},$ the MLE of α, which also equals ${\hat{α}}_{OLS}$ , the ordinary least squares estimate of α in (5.3), yielding the full-model vector of supernova brightness estimates

{\hat{μ}}_{OLS} = X {\hat{α}}_{OLS} .

(5.4)

Figure 6 plots y_i versus its estimate ${\hat{μ}}_{OLS, i .}$ The fit looks good, having an unadjusted R² of 0.82. Adjusting for the fact that we have used m = 10 parameters to fit n = 39 data points yields the more realistic value

R_{adj}^{2} = R^{2} - 2 \cdot (1 - R^{2}) \frac{m}{n - m} = 0.69;

(5.5)

see Remark D.

Type Ia supernovas were used as “standard candles” in the discovery of dark energy and the cosmological expansion of the universe (Perlmutter et al., 1999; Riess et al., 1998). Their standardness assumes a constant absolute magnitude. This is not exactly true, and in practice regression adjustments are made. Our 39 supernovas were close enough to Earth to have their absolute magnitudes ascertained independently. The spectral measurements x, however, can be made for distant Type Ia supernovas, where independent methods fail, the scientific goal being a more accurate estimation function $\hat{μ} (x)$ for their absolute magnitudes, and improved calibration of cosmic expansion.

We will use the Lasso (Tibshirani, 1996) to select $\hat{μ} (x)$ . For a given choice of the non-negative “tuning parameter” λ we estimate α by the Lasso criterion

{\hat{α}}_{λ} = \arg \min_{α} {{∥ y - X α ∥}^{2} + λ \sum_{k = 1}^{p} ∣ α_{k} ∣};

(5.6)

${\hat{α}}_{λ}$ shrinks the components of $\hat{α}$ _OLS toward zero, some of them all the way. As λ decreases from infinity to 0, the number m of non-zero components of ${\hat{α}}_{λ}$ increases from 0 to p. Conveniently enough, it turns out that m also nearly equals the effective degrees of freedom for the selection of ${\hat{α}}_{λ}$ (Efron, Hastie, Johnstone and Tibshirani, 2004). In what follows we will write ${\hat{α}}_{m}$ rather than ${\hat{α}}_{λ}$ .

Table 4 shows a portion of the Lasso calculations for the Supernova data. Its last column gives $R_{adj}^{2}$ (5.5) with R² having the usual form

Table 4.

Lasso model selection for the Supernova data. As the regularization parameter λ in (5.6) decreases from infinity to zero, the number m of non-zero coordinates of ${\hat{α}}_{m}$ increases from 0 to 10. The choice m = 7 maximizes the adjusted R² value (5.7), making it the selected model.

λ	m	R²	R²_adj
∞	0	0	0
63	1	.17	.12
19.3	3	.74	.70
8.2	5	.79	.73
.496	7	.82	.735	(selected)
.039	9	.82	.71
0	10	.82	.69	(OLS)

Open in a new tab

R^{2} = 1 - \frac{{∥ y - {\hat{μ}}_{m} ∥}^{2}}{{∥ y - \bar{y} 1 ∥}^{2}} ({\dot{\hat{μ}}}_{m} = X {\hat{α}}_{m}, \bar{y} = \sum y_{i} / n) .

(5.7)

The choice $\hat{m}$ = 7 maximizes $R_{adj}^{2}$ ,

\hat{m} = \arg \max_{m} {R_{adj}^{2}},

(5.8)

yielding our selected coefficient vector ${\hat{α}}_{\hat{m}}$ and the corresponding vector of supernova estimates

\hat{μ} = X {\hat{α}}_{\hat{m}};

(5.9)

note that ${\hat{α}}_{\hat{m}}$ is not an OLS estimate.

B = 4000 bootstrap replications ${\hat{μ}}^{*}$ were computed (again many more than were actually needed): bootstrap samples y^* were drawn using the full OLS model,

y^{*} \sim N_{39} ({\hat{μ}}_{OLS}, I);

(5.10)

see Remark E. The equivalent of Table 4, now based on data y^*, was calculated, the $R_{adj}^{2}$ maximizer ${\hat{m}}^{*}$ and ${\hat{α}}_{{\hat{m}}^{*}}^{*}$ selected, giving

{\hat{μ}}^{*} = X {\hat{α}}_{{\hat{m}}^{*}}^{*} .

(5.11)

Averaging the 4000 ${\hat{μ}}^{*}$ vectors yielded the smoothed vector estimates

\tilde{μ} = \sum_{i = 1}^{B} {\hat{μ}}_{i}^{*} / B .

(5.12)

Standard deviations ${\tilde{sd}}_{B j}$ for supernova j’s smoothed estimate ${\tilde{μ}}_{j}$ were then calculated according to (4.15), j = 1,2,…,39 The ratio of standard deviations for ${\tilde{sd}}_{B} / {\hat{sd}}_{B}$ each of the 39 supernovas, ranged from 0.87 to 0.98, with an average of 0.93. Jackknife calculations (3.11) showed that B = 800 would have been enough for good accuracy.

At this point it pays to remember that ${\tilde{sd}}_{B}$ is a delta-method shortcut version of a full bootstrap standard deviation for the smoothed estimator s(y). We would prefer the latter if not for the computational burden of a second level of bootstrapping. As a check, a full second-level simulation was run, beginning with simulated data vectors $y^{*} \sim N_{39} ({\hat{μ}}_{OLS}, I)$ (5.10), and for each y* carrying through calculations of s* and ${\tilde{sd}}_{B}^{*}$ based on B = 1000 second-level bootstraps. This was done 500 times, yielding 500 values $s_{k}^{*}$ for each of the 39 supernovas, which provided direct bootstrap estimates say ${\tilde{sd}}_{k}$ for s_k. The ${\tilde{sd}}_{k}$ values averaged about 7.5% larger than the delta-method approximations ${\tilde{sd}}_{B k}$ Taking this into account, the reductions in standard deviation due to smoothing were actually quite small, the ratios averaging about 98%; see the end of Remark H.

Returning to the original calculations, model selection was highly variable among the 4000 bootstrap replications. Table 5 shows the percentage of the 4000 replications that selected m non-zero coefficients for ${\hat{α}}^{*}$ in (5.11), m = 1, 2,…, 10, with the original choice m = 7 not quite being modal. Several of the supernovas showed effects like that in Figure 3.

Table 5.

Percentage of the 4000 bootstrap replications selecting m non-zero coefficients for ${\hat{α}}^{*}$ in (5.11), m = 1, 2,…,10. The original choice m = 7 is not quite modal.

m	1	2	3	4	5	6	7	8	9	10
%	0	1	8	13	16	18	18	14	9	2

Open in a new tab

Model averaging, that is bootstrap smoothing, still has important confidence interval effects even though here it does not substantially reduce standard deviations. This is shown in Figure 7 of the next section, which displays approximate 95% confidence intervals for the 39 supernova magnitudes.

Approximate 95% confidence limits for the 39 supernova magnitudes *μ_k* (after subtraction of smoothed estimates ${\tilde{μ}}_{k}$ (5.12)); ABC intervals (solid) compared with smoothed standard intervals ${\tilde{μ}}_{k} \pm 1.96 {\tilde{sd}}_{k}$ (dashed). Crosses indicate differences between unsmoothed and smoothed estimates, (5.9) minus (5.12).

Other approaches to bootstrapping Lasso estimates are possible. Chatterjee and Lahiri (2011), referring back to work by Knight and Fu (2000), resample regression residuals rather than using the full parametric bootstrap (5.10). The “m out of n” bootstrap is featured in Hall, Lee and Park (2009). Asymptotic performance, mostly absent here, is a central concern of these papers; also, they focus on estimation of the regression coefficients, in (5.3), a more difficult task than estimating μ = Xα.

6 Better bootstrap confidence intervals

The central tactic of this paper is the use of bootstrap smoothing to convert an erratically behaved model selection-based estimator $t (\cdot)$ into a smoothly varying version $s (\cdot)$ Smoothing makes the good asymptotic properties of the bootstrap, as extensively developed in Hall (1992), more credible for actual applications. This section carries the smoothing theme further, showing how $s (\cdot)$ can be used to form second-order accurate intervals.

The improved confidence intervals depend on the properties of bootstrap samples from exponential families (4.1). We define an “empirical exponential family” ${\hat{f}}_{α} (\cdot)$ that puts probability

{\hat{f}}_{α} ({\hat{β}}_{i}^{*}) = e^{{(α - \hat{α})}^{'} {\hat{β}}_{i}^{*} - \hat{ψ} (α)} \frac{1}{B}

(6.1)

on bootstrap replication ${\hat{β}}_{i}^{*}$ (4.3) for i = 1, 2,…, B, where

\hat{ψ} (α) = \log (\sum_{j = 1}^{B} e^{{(α - \hat{α})}^{'} {\hat{β}}_{i}^{*}} / B) .

(6.2)

Here $\hat{α}$ is the MLE of in the original family (4.1), $\hat{α} = λ^{- 1} (\hat{β})$ in the notation following (4.2)

The choice of $α = \hat{α}$ makes ${\hat{f}}_{\hat{α}} ({\hat{β}}_{i}^{*}) = 1 / B$ for i = 1, 2,…, B; in other words, it yields the empirical probability distribution of the bootstrap sample (4.3) in ℛ^p. Other choices of “tilt” the empirical distribution in direction $α = \hat{α}$ ; (6.1) is a direct analogue of the original exponential family (4.1), which can be re-expressed as

f_{α} ({\hat{β}}^{*}) = e^{{(α - \hat{α})}^{'} {\hat{β}}^{*} - (ψ (α) - ψ (\hat{α}))} f_{\hat{α}} ({\hat{β}}^{*}),

(6.3)

now with $\hat{α}$ fixed and ${\hat{β}}^{*}$ the random variable. Notice that $\hat{ψ} ∣ (\hat{α}) = 0$ in (6.2). Taking this into account, the only di erence between the original family (6.3) and the empirical family (6.1) is the change in support, from ${\hat{f}}_{α} (\cdot)$ to the empirical probability distribution. Under mild regularity conditions, family ${\hat{f}}_{α} (\cdot)$ approaches ${\hat{f}}_{α} (\cdot)$ as the bootstrap sample size B goes to infinity.

As in (4.23)–(4.24), let s_α be the value of the smoothed statistic we would get if bootstrap samples were obtained from f_α rather than $f_{\hat{α}}$ . We can estimate s_α from the original bootstrap samples (4.3) by importance sampling in family (4.1),

\begin{array}{l} s_{α} & = & \sum_{i = 1}^{B} e^{{(α - \hat{α})}^{'} {\hat{β}}_{i}^{*}} t_{i}^{*} / \sum_{i = 1}^{B} e^{{(α - \hat{α})}^{'} {\hat{β}}_{i}^{*}} \\ = & \sum_{i = 1}^{B} {\hat{f}}_{α} ({\hat{β}}_{i}^{*}) t_{i}^{*} \end{array}

(6.4)

without requiring any further evaluations of $t (\cdot)$ . (Note that ${\hat{f}}_{α} ({\hat{β}}_{i}^{*})$ is proportional to w_i in (4.24).) The main point here is that the smoothed estimate s_α is the expectation of the values $t_{i}^{*}, i = 1, 2, \dots, B,$ taken with respect to the empirical exponential family (6.1).

A system of approximate confidence intervals enjoys second-order accuracy if its coverage probabilities approach the target value with errors 1/n in the sample size n, rather than at the slower rate $1 / \sqrt{n}$ of the standard intervals. The ABC system (“approximate bootstrap confidence” intervals, DiCiccio and Efron, 1992, not to be confused with “approximate Bayesian computation” as in Fearnhead and Prangle, 2012) employs numerical derivatives to produce second-order accurate intervals in exponential families. Its original purpose was to eliminate the need for bootstrap resampling. Here, though, we will apply it to the smoothed statistic $s (\hat{β}) = \sum t ({\hat{β}}_{i}^{*}) / B$ (4.5) in order to avoid a second level of bootstrapping. This is a legitimate use of ABC because we are working in an exponential family, albeit the empirical family (6.1).

Three corrections are needed to improve the smoothed standard interval (2.11) from first- to secondorder accuracy: a non-normality correction obtained from the bootstrap distribution, an acceleration correction of the type mentioned at (3.22), and a bias-correction. ABC carries these out via p+2 numerical second derivatives of ${\hat{S}}_{α}$ in (6.4), taken at $α = \hat{α}$ , as detailed in Section 2 of DiCiccio and Efron (1992). The computational burden is effectively nil compared with the original bootstrap calculations (4.3).

Figure 7 compares the ABC 95% limits for the supernova brightnesses μ_k,k = 1, 2, . . . , 39, solid lines, with parametric smoothed standard intervals (2.11), dashed lines. (The smoothed estimates ${\tilde{μ}}_{k}$ (5.12) have been subtracted from the endpoints in order to put all the intervals on the same display.) There are a few noticeable discrepancies, for supernovas 2, 6, 25, and 27 in particular, but overall the smoothed standard intervals hold up reasonably well.

Smoothing has a moderate effect on the Supernova estimates, as indicated by the values of ${\hat{μ}}_{k} - {\tilde{μ}}_{k}$ , (5.11) minus (5.12), the crosses in Figure 7. A few of the intervals would be much different if based on the unsmoothed estimates ${\hat{μ}}_{k}$ , e.g., supernovas 1, 12, 17, and 28. Remark I says more about the ABC calculations.

As a check on the ABC intervals, the “full simulation” near the end of Section 4, with B = 1000 bootstrap replications for each of 500 trials, was repeated. For each trial, the 1000 bootstraps provided new ABC calculations, from which the “achieved significance level” ${asl}_{k}^{*}$ of the original smoothed estimate ${\tilde{μ}}_{k}$ (5.12) was computed: that is,

{asl}_{k}^{*} = bootstrap ABC confidence level for (- \infty, {\tilde{μ}}_{k})

(6.5)

If the ABC construction were working perfectly, ${asl}_{k}^{*}$ would have a uniform distribution,

{asl}_{k}^{*} \sim U (0, 1)

(6.6)

for k = 1; 2,…, 39.

Table 6 displays quantiles of ${asl}_{k}^{*}$ in the 500 trials, for seven of the 39 supernovas, k =5, 10, 15, 20, 25, 30, and 35. The results are not perfectly uniform, showing for instance a moderate deficiency of small ${asl}_{k}^{*}$ values for k = 5, but overall the results are encouraging. A U (0, 1) random variable has mean 0.500 and standard deviation0.289, while all 3500 ${asl}_{k}^{*}$ values in Table 6 had mean 0.504 and standard deviation 0.284.

Table 6.

Simulation check for ABC intervals; 500 trials, each with B = 1000 bootstrap replications. Columns show quantiles of achieved significance levels ${asl}_{k}^{*}$ (6.5) for supernovas k = 5, 10,…,35; last column for all seven supernovas combined. It is a reasonable match to the ideal uniform distribution (6.6).

quantile	SN5	SN10	SN15	SN20	SN25	SN30	SN35	ALL
0.025	0.04	0.02	0.04	0.00	0.04	0.03	0.02	0.025
0.05	0.08	0.04	0.06	0.04	0.08	0.06	0.06	0.055
0.1	0.13	0.08	0.11	0.10	0.12	0.10	0.12	0.105
0.16	0.20	0.17	0.18	0.16	0.18	0.18	0.18	0.175
0.5	0.55	0.50	0.54	0.48	0.50	0.48	0.50	0.505
0.84	0.84	0.82	0.82	0.84	0.84	0.84	0.84	0.835
0.9	0.90	0.88	0.90	0.88	0.90	0.90	0.90	0.895
0.95	0.96	0.94	0.96	0.94	0.94	0.94	0.94	0.945
0.975	0.98	0.97	0.98	0.98	0.96	0.98	0.97	0.975

Open in a new tab

The ABC computations are local, in the sense that the importance sampling estimates $s_{α}$ in (6.4) need only be evaluated for very near $\hat{α}$ . This avoids the familiar peril of importance sampling, that the sampling weights in (6.4) or (4.1) may vary uncontrollably in size.

If one is willing to ignore the peril, full bootstrap standard errors for the smoothed estimates $\tilde{μ}$ (4.5), rather than the delta-method estimates of Theorem 2, become feasible: in addition to the original parametric bootstrap samples (4.3), we draw J more times, say

f_{\hat{α}} (\cdot) \to {\tilde{β}}_{1}^{*}, {\tilde{β}}_{2}^{*}, \dots, {\tilde{β}}_{J}^{*},

(6.7)

and compute the corresponding natural parameter estimates ${\tilde{α}}_{j}^{*} = λ^{- 1} ({\tilde{β}}_{j}^{*}),$ as following (4.2). Each ${\tilde{α}}_{j}^{*}$ gives a bootstrap version of the smoothed statistic $s_{{\tilde{α}}_{j}^{*}}$ , using (6.4), from which we calculate the usual bootstrap standard error estimate,

{\tilde{sd}}_{boot} = {[\sum_{j = 1}^{J} {(s_{{\tilde{α}}_{j}^{*}} - s .)}^{2} / (J - 1)]}^{1 / 2},

(6.8)

Where $s . = \sum s_{{\tilde{α}}_{j}^{*}} / J$ . Once again, no further evaluations of $t (\cdot)$ beyond the original ones in (4.5) are required.

Carrying this out for the Supernova data gave standard errors ${\tilde{sd}}_{boot}$ a little smaller than those from Theorem 2, as opposed to the somewhat larger ones found by the full simulation near the end of Section 5. Occasional very large importance sampling weights in (6.4) did seem to be a problem here.

Compromises between the delta method and full bootstrapping are possible. For the normal model (5.3) we have ${\tilde{β}}_{j}^{*} \sim N (\hat{β}, X^{'} X)$ in (6.7). Instead we might take

{\hat{β}}_{j}^{*} \sim N (\hat{β}, c X^{'} X)

(6.9)

with c less than 1, placing ${\tilde{α}}_{j}^{*}$ nearer $\hat{α}$ . Then (6.8) must be multiplied by $1 / \sqrt{c}$ . Doing this with c = 1/9 gave standard error estimates almost the same as those from Theorem 2.

7 Remarks, details, and proofs

This section expands on points raised in the previous discussion.

A. Proof of Corollary 1

With $Y^{*} = (Y_{i j}^{*})$ as in (3.2), let $X = Y^{*} - 1_{B} {1^{'}}_{n} = (Y_{i j}^{*} - 1)$ . For the ideal bootstrap, B = nⁿ,

X^{'} X / B = I - {1^{'}}_{n} 1_{n} .

(7.1)

the multinomial covariance matrix in (3.3). This has (n − 1) non-zero eigenvalues all equaling 1, implying that the singular value decomposition of X is

X = \sqrt{B} L R^{'},

(7.2)

Land R orthonormal matrices of dimensions B × (n −1) and n × (n −1). Then the B-vector $U^{*} = (t_{i}^{*} - s_{0})$ has projected squared length into ℒ(X)

\begin{matrix} {U^{*}}^{'} L L^{'} U^{*} & = & B {U^{*}}^{'} \frac{L \sqrt{B} R^{'} R \sqrt{B} L^{'}}{B^{2}} U^{*} \\ = & B ({U^{*}}^{'} X / B) (X^{'} U^{*} / B) = B {\tilde{sd}}^{2}, \end{matrix}

(7.3)

verifying (3.4).

B. Hájek projection and ANOVA decomposition

For the ideal nonparametric bootstrap of Section 3, define the conditional bootstrap expectations

e_{j} = E_{*} {t (y_{i}^{*}) ∣ y_{i j}^{*} = y_{j}},

(7.4)

j = 1, 2, … , n (not depending on k). The bootstrap ANOVA decomposition of Efron (1983, Sect. 7) can be used to derive an orthogonal decomposition of t(y*),

t (y_{i}^{*}) = s_{0} + L_{i}^{*} + R_{i}^{*}

(7.5)

where $s_{0} = E_{*} {t (y^{*})}$ is the ideal smoothed bootstrap estimate, and

L_{i}^{*} = \sum_{j = 1}^{n} Y_{i j}^{*} (e_{j} - s_{0}),

(7.6)

while $R_{i}^{*}$ involves higher-order ANOVA terms such as e_jl− e_j− e_l + s₀ with

e_{jl} = E_{*} {t_{i} (y^{*}) ∣ y_{ik}^{*} = y_{j} and y_{im}^{*} = y_{k}} .

(7.7)

The terms in (7.5) satisfy $E_{*} {L^{*}} = E_{*} {R^{*}} = 0$ and are orthogonal, $E_{*} {L^{*} R^{*}} = 0.$ The bootstrap Hájek projection of t(y*) (Hájek, 1968) is then the first two terms of (7.5), say

H_{i}^{*} = s_{0} + L_{i}^{*} .

(7.8)

Moreover,

L_{i}^{*} = \sum_{j = 1}^{n} Y_{i j}^{*} {cov}_{j}

(7.9)

from (3.5) and the ratio of smoothed-to-unsmoothed standard deviation (3.10) equals

{\tilde{sd}}_{B} / {\hat{sd}}_{B} = {[{var}_{*} {L_{i}^{*}} / ({var}_{*} {L_{i}^{*}} + {var}_{*} {R_{i}^{*}})]}^{1 / 2} .

(7.10)

C. Nonparametric bias estimate

There is a nonparametric bias estimate ${\tilde{bias}}_{B}$ for the smoothed statistic s(y) (2.8) corresponding to the variability estimate ${\tilde{sd}}_{B}$ . In terms of T(Y*) and S(p) (3.13)–(3.14), the nonparametric delta method gives

{\tilde{bias}}_{B} = \frac{1}{2} \sum_{j = 1}^{n} \frac{{\ddot{S}}_{j}}{n^{2}}

(7.11)

where ${\ddot{S}}_{j}$ is the second-order influence value

{\ddot{S}}_{j} = \lim_{ε \to 0} \frac{S (p_{0} + ε (δ_{j} - p_{0})) - 2 S (p_{0}) + S (p_{0} - ε (δ_{j} - p_{0}))}{ε^{2}} .

(7.12)

See Section 6.6 of Efron (1982).

Without going into details, the Taylor series calculation (3.18)–(3.19) can be carried out one step further, leading to the following result:

{\tilde{bias}}_{B} = {cov}_{*} (D_{i}^{*}, t_{i}^{*})

(7.13)

where $D_{i}^{*} = {\sum_{1}^{n} (Y_{ij}^{*} - 1)}^{2} .$

This looks like a promising extension of Theorem 1 (3.4)–(3.5). Unfortunately, (7.13) proved unstable when applied to the Cholesterol data, as revealed by jackknife calculations like (3.11). Things are better in parametric settings; see Remark I. There is also some question of what “bias” means with model selection-based estimators; see Remark G.

D. Adjusted R²

Formula (5.5) for $R_{adj}^{*}$ , not the usual definition, is motivated by OLS estimation and prediction in a homoskedastic model. We observe

y \sim (μ, σ^{2} I)

(7.14)

and estimate μ by $\hat{μ} = My$ , where the n × n symmetric matrix M is idempotent, M² = M. Then $σ^{2} = {‖ y - \hat{μ} ‖}^{2} / (n - m)$ , m the rank of M, is the usual unbiased estimate of σ². Letting y° indicate an independent new copy of y, the expected prediction error of $\hat{μ}$ is

E {{‖ y ° - \hat{μ} ‖}^{2}} = E {{‖ y - \hat{μ} ‖}^{2} + 2 m {\hat{σ}}^{2}}

(7.15)

as in (2.6). Finally, the usual definition of R²,

R^{2} = {1 - {‖ y - \hat{μ} ‖}^{2} / ‖ y - \overline{y} 1 ‖}^{2}

(7.16)

is adjusted by adding the amount suggested in (7.15),

R_{adj}^{2} = {1 - {{‖ y - \hat{μ} ‖}^{2} + 2 m {\hat{σ}}^{2}} / ‖ y - \overline{y} 1 ‖}^{2},

(7.17)

and this reduces to (5.5).

E. Full-model bootstrapping

The bootstrap replications (5.10) are drawn from the full model, $y^{*} \sim N_{39} ({\hat{μ}}_{OLS}, I),$ rather than say the smoothed Lasso choice (5.12), $y^{*} \sim N_{39} (\tilde{μ}, I)$ . This follows the general development in Section 4(4.3) and, less obviously, the theory of Sections 2 and 3, where the “full model” is the usual nonparametric one (2.3).

An elementary example, based on Section 10.6 of Hjort and Claeskens (2003), illustrates the dangers of bootstrapping from other than the full model. We observe $y \sim N (μ, 1)$ , with MLE $\hat{μ}$ = t(y) = y, and consider estimating μ with the shrunken estimator $\hat{μ}$ = s(y) = cy, where c is a fixed constant 0 < c < 1, so

\tilde{μ} \sim N (c μ, c^{2}) .

(7.18)

Full-model bootstrapping corresponds to $y^{*} \sim N (\hat{μ}, 1)$ , and yields ${\tilde{μ}}^{*} = {cy}^{*} \sim N (c μ, c^{2}) .$ as the bootstrap distribution. However the “model-selected bootstrap” $y^{*} \sim N (\tilde{μ}, 1)$ yields

{\tilde{μ}}^{*} \sim N (c^{2} \hat{μ}, c^{2}),

(7.19)

squaring the amount of shrinkage in (7.18).

Returning to the Supernova example, the Lasso is itself a shrinkage technique. Bootstrapping from the Lasso choice $\tilde{μ}$ would shrink twice, perhaps setting many more of the coordinate estimates to zero.

F. Bias of the smoothed estimate

There is a simple asymptotic expression for the bias of the bootstrap smoothed estimator in exponential families, following DiCiccio and Efron (1992). The schematic diagram of Figure 8 shows the main elements: the observed vector y, expectation μ, generates the bootstrap distribution of y*, indicated by the dashed ellipses. A parameter of interest θ = t(μ) has MLE $\hat{θ}$ = t(y). Isoplaths of constant value for t(⋅) are indicated by the solid curves in Figure 8.

Schematic diagram of large-sample bootstrap estimation. Observed vector y has expectation μ. Ellipses indicate bootstrap distribution of y* given $\hat{μ} = y$ . Parameter of interest θ = t(μ) is estimated by $\hat{θ}$ = t(y). Solid curves indicate surfaces of constant value of t(⋅).

The asymptotic mean and variance of the MLE $\hat{θ}$ = t(y) as sample size n grows large is of the form

\hat{θ} \sim (θ + \frac{b (μ)}{n}, \frac{c^{2} (μ)}{n}) + O_{p} (n^{- 3 / 2}) .

(7.20)

Here the bias b(μ)/n is determined by the curvature of the level surfaces near μ. Then it is not difficult to show that the ideal smoothed bootstrap estimate $\tilde{θ} = \sum t (y_{i}^{*}) / B, B \to \infty$ , has mean and variance

\tilde{θ} \sim (θ + 2 \frac{b (μ)}{n}, \frac{c^{2} (μ)}{n}) + O_{p} (n^{- 3 / 2}) .

(7.21)

So smoothing doubles the bias without changing variance. This just says that smoothing cannot improve on the MLE $\hat{θ}$ in the already smooth asymptotic estimation context of Figure 8.

G. Two types of bias

The term b(μ)/n in (7.20) represents “statistical bias,” the difference between the expected value of t( $\hat{μ}$ ) and t(μ). Model-selection estimators also involve “definitional bias”: we wish to estimate θ = T(μ), but for reasons of robustness or efficiency we employ a different functional $\hat{θ}$ = t(y), a homely example being the use of a trimmed mean to estimate an expectation. The ABC bias correction mentioned in Section 6 is correcting the smoothed standard interval $\tilde{μ} \pm 1.96 {\tilde{se}}_{B}$ for statistical bias. Definitional bias can be estimated by t(y)−T(y), but this is usually too noisy to be of help. Section 2 of Berk et al. (2012) makes this point nicely (see their discussion of “target estimation”) and I have followed their lead in not trying to account for definitional bias. See also Bühlmann and Yu (2002), Definition 1.2, for an asymptotic statement of what is being estimated by a model-selection procedure.

H. Selection-based estimation

The introduction of model selection into the estimation process disrupts the smooth properties seen in Figure 8. The wedge-shaped regions of Figure 9 indicate different model choices, e.g., linear, quadratic, cubic, etc. regressions for the Cholesterol data. Now the surfaces of constant estimation jump discontinuously as y crosses regional boundaries. Asymptotic properties such as (7.20)–(7.21) are less convincing when the local geometry near the observed y can change abruptly a short distance away.

Estimation after model selection. The regions indicate different model choices. Now the curves of constant estimation jump discontinuously as y crosses regional boundaries.

The bootstrap ellipses in Figure 9 are at least qualitatively correct for the Cholesterol and Supernova examples, since in both cases a wide bootstrap variety of regions were selected. In this paper, the main purpose of bootstrap smoothing is to put us back into Figure 8, where for example the standard intervals (2.11) are more believable. (Note: Lasso estimates are continuous, though non-differentiable, across region boundaries, giving a picture somewhere between Figure 8 and Figure 9. This might help explain the smooth estimators’ relatively modest reductions in standard error for the Supernova analysis.)

Bagging amounts to replacing the discontinuous isoplaths of θ = t(μ) with smooth ones, say for θ_bag = s(μ). The standard deviations and approximate confidence intervals of this paper apply to θ_bag, ignoring the possible definitional bias.

I. The ABC intervals

The approximate bootstrap confidence limits in Figure 7 were obtained using the ABC_q algorithm, as explained in detail in Section 2 of DiCiccio and Efron (1992). In addition to the acceleration a and bias-correction constant z₀, ABC_q also calculates c_q: in a one-parameter exponential family (4.1), c_q measures the nonlinearity of the parameter of interest θ= t(β) as a function of β, with a similar definition applying in p dimensions. The algorithm involves the calculation of p + 2 numerical second derivatives of s_α (6.4) carried out at α = $\hat{α}$ . Besides a, z₀, and c_q, ABC_q provides an estimate of statistical bias for s_α.

If (α, z₀, c_q) = (0, 0, 0) then the ABC_q intervals match the smoothed standard intervals (2.11). Otherwise, corrections are made in order to achieve second-order accuracy. For instance (a, z₀, c_q) = (0, −0.1, 0) shifts the standard intervals leftwards by 0.1− $\hat{σ}$ . For all three constants, values outside of ±0.1 can produce noticeable changes to the intervals.

Table 7 presents summary statistics of a, z₀, c_q, and bias for the 39 smoothed Supernova estimates ${\tilde{μ}}_{k}$ . The differences between the ABC_q and smoothed standard intervals seen in Figure 7 were primarily due to z₀.

Table 7.

Summary statistics of the ABC_q constants for the 39 smoothed Supernova estimates ${\tilde{μ}}_{k}$ (5.12).

	a	z₀	c_q	bias

mean	.00	.00	.00	.00
stdev	.01	.13	.04	.06

Lowest	–.01	–.21	–.07	–.14
highest	.01	.27	.09	.12

Open in a new tab

J. Bias correction for ${\tilde{sd}}_{B}$

The nonparametric standard deviation estimate ${\tilde{sd}}_{B}$ (3.7) is biased upward for the ideal value $\tilde{sd}$ (3.4), but it is easy to make a correction. Using notation (3.3)– (3.9), define

Z_{i j}^{*} = (Y_{i j}^{*} - 1) (t_{i}^{*} - s_{0}) .

(7.22)

Then $Z_{i j}^{*}$ has bootstrap mean cov_j (3.5) and bootstrap variance say $Δ_{j}^{2}$ . A sample of B bootstrap replications yields bootstrap moments

{\hat{cov}}_{j} = \frac{1}{B} \sum_{i = 1}^{B} Z_{i j}^{*} \sim_{*} ({cov}_{j}, Δ_{j}^{2} / B),

(7.23)

E_{*} {\tilde{sd}}_{B}^{2} = {\tilde{sd}}^{2} + \frac{1}{B} \sum_{j = 1}^{n} Δ_{j}^{2} .

(7.24)

Therefore the bias-corrected version of ${\tilde{sd}}_{B}^{2}$ is

{\tilde{sd}}_{B}^{2} - \frac{1}{B^{2}} \sum_{j = 1}^{n} \sum_{i = 1}^{B} {(Z_{i j}^{*} - {\hat{cov}}_{j})}^{2} .

(7.25)

K. Improved estimates of the bagged standard errors

The simulation experiment of Figure 5 can also be regarded as a two-level parametric bootstrap procedure, with the goal of better estimating $sd ({\tilde{μ}}_{k})$ , the bagged estimates’ standard deviations for subjects k = 1, 2 , . . . , 11 in the Cholesterol study. Two possible estimates are shown: (1) the empirical standard deviation $\tilde{Sd}$ (4.22), solid curve, and (2) the average $\tilde{sd}$ . of the 100 second-level ${\tilde{sd}}_{i}$ values (4.15), dashed curve. There are two reasons to prefer the latter.

The first has to do with the sampling error of the standard deviation estimates themselves. This was about 10 times larger for $\tilde{Sd}$ than $\tilde{sd}$ ., e.g., 5.45 ± 0.35 compared to 5.84 ± 0.03 for subject 1. (Note: The two curves in Figure 5 do not di er significantly at any point.)

The second and more important reason has to do with the volitility of model-selection estimates and their standard errors. Let σ(β) denote the standard deviation of a bagged estimator $\hat{μ}$ in a parametric model such as (4.16)–(4.17). The unknown true parameter β₀ has yielded the observed value $\hat{β}$ , and then bootstrap values ${\hat{β}}_{i}^{*}$ , i = 1, 2, . . . , 100, and second-level bootstraps ${\hat{β}}_{i j}^{*}$ , j = 1, 2, . . . , 1000. The estimate ${\tilde{sd}}_{100}$ obtained from the ${\hat{β}}_{i}^{*}$ ’s (4.15) is a good approximation to $σ (\hat{β})$ . The trouble is that the functional σ(β) is itself volatile, so that $σ (\hat{β})$ may differ considerably from the “truth” σ(β₀).

This can be seen at the second level in Figure 5, where the dashes indicating ${\tilde{sd}}_{i}$ values, i = 1, 2,…, 100, vary considerably. (This is not due to the limitations of using B = 1000 replications; the bootstrap “internal variance” component accounts for only about 30% of the spread.) Broadly speaking, ${\hat{β}}_{i}^{*}$ values that fall close to a regime boundary, say separating the choice of “Cubic” from “Quartic,” had larger values of $σ ({\hat{β}}_{i}^{*}) ≐ {\tilde{sd}}_{i}$ .

The preferred estimate $\tilde{sd}$ effectively averages $σ ({\hat{β}}_{i}^{*})$ over the parametric choice of ${\hat{β}}_{i}^{*}$ and $\hat{β}$ . Another way to say this is that $\tilde{sd}$ . is a flat-prior Bayesian estimate of $σ (β_{0})$ ,given the data $\hat{β}$ . See Efron (2012).

Of course $\tilde{sd}$ .requires much more computation than ${\tilde{sd}}_{B}$ (4.15). Our 100 × 1000 analysis could be reduced to 50 × 500 without bad effect, but that is still 25000 resamples. In fact, $\tilde{sd}$ . was not much different from ${\tilde{sd}}_{B}$ in this example. The difference was larger in the nonparametric version of Figure 5, which showed substantially greater bias and variability, making the second level of bootstrapping more worthwhile.

References

Berk R, Brown L, Buja A, Zhang K, Zhao L. Valid post-selection inference. Ann Statist. 2012 Submitted http://stat.wharton.upenn.edu/~buja/PoSI.pdf.
Breiman L. Bagging predictors. Mach Learn. 1996;24:123–140. [Google Scholar]
Buckland ST, Burnham KP, Augustin NH. Model selection: An integral part of inference. Biometrics. 1997;53:603–618. [Google Scholar]
Bühlmann P, Yu B. Analyzing bagging. Ann Statist. 2002;30:927–961. [Google Scholar]
Buja A, Stuetzle W. Observations on bagging. Statist Sinica. 2006;16:323–351. [Google Scholar]
Chatterjee A, Lahiri SN. Bootstrapping lasso estimators. J Amer Statist Assoc. 2011;106:608–625. [Google Scholar]
DiCiccio T, Efron B. More accurate confidence intervals in exponential families. Biometrika. 1992;79:231–245. [Google Scholar]
Efron B. Bootstrap methods: Another look at the jackknife. Ann Statist. 1979;7:1–26. [Google Scholar]
Efron B. The Jackknife, the Bootstrap and Other Resampling Plans, CBMS-NSF Regional Conference Series in Applied Mathematics. Vol. 38. Philadelphia, Pa.: Society for Industrial and Applied Mathematics (SIAM); 1982. [Google Scholar]
Efron B. Estimating the error rate of a prediction rule: Improvement on cross-validation. J Amer Statist Assoc. 1983;78:316–331. [Google Scholar]
Efron B. Better bootstrap confidence intervals. J Amer Statist Assoc. 1987;82:171–200. with comments and a rejoinder by the author. [Google Scholar]
Efron B. Jackknife-after-bootstrap standard errors and influence functions. J Roy Statist Soc Ser B. 1992;54:83–127. [Google Scholar]
Efron B. Bayesian inference and the parametric bootstrap. Ann Appl Statist. 2012;6:1971–1997. doi: 10.1214/12-AOAS571. [DOI] [PMC free article] [PubMed] [Google Scholar]
Efron B, Feldman D. Compliance as an explanatory variable in clinical trials. J Amer Statist Assoc. 1991;86:9–17. [Google Scholar]
Efron B, Hastie T, Johnstone I, Tibshirani R. Least angle regression. Ann Statist. 2004;32:407–499. with discussion, and a rejoinder by the authors. [Google Scholar]
Efron B, Tibshirani R. An Introduction to the Bootstrap, Monographs on Statistics and Applied Probability. Vol. 57. New York: Chapman and Hall; 1993. [Google Scholar]
Efron B, Tibshirani R. Using specially designed exponential families for density estimation. Ann Statist. 1996;24:2431–2461. [Google Scholar]
Fearnhead P, Prangle D. Constructing summary statistics for approximate Bayesian computation: semi-automatic approximate Bayesian computation. J Roy Statist Soc Ser B. 2012;74:419–474. [Google Scholar]
Hájek J. Asymptotic normality of simple linear rank statistics under alternatives. Ann Math Statist. 1968;39:325–346. doi: 10.1073/pnas.57.1.19. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hall P. The Bootstrap and Edgeworth Expansion Springer Series in Statistics. New York: Springer-Verlag; 1992. [Google Scholar]
Hall P, Lee ER, Park BU. Bootstrap-based penalty choice for the lasso, achieving oracle performance. Statist Sinica. 2009;19:449–471. [Google Scholar]
Hjort NL, Claeskens G. Frequentist model average estimators. J Amer Statist Assoc. 2003;98:879–899. [Google Scholar]
Hurvich CM, Tsai C-L. Model selection for least absolute deviations regression in small samples. Statist Probab Lett. 1990;9:259–265. [Google Scholar]
Knight K, Fu W. Asymptotics for lasso-type estimators. Ann Statist. 2000;28:1356–1378. [Google Scholar]
Mallows CL. Some Comments on Cp. Technometrics. 1973;15:661–675. [Google Scholar]
Perlmutter S, Aldering G, Goldhaber G, Knop R, Nugent P, Castro P, Deustua S, Fabbro S, Goobar A, Groom D, Hook I, Kim A, Kim M, Lee J, Nunes N, Pain R, Pennypacker C, Quimby R, Lidman C, Ellis R, Irwin M, McMahon R, Ruiz-Lapuente P, Walton N, Schaefer B, Boyle B, Filippenko A, Matheson T, Fruchter A, Panagia N, Newberg H, Couch W. Measurements of omega and lambda from 42 high-redshift supernovae. Astrophys J. 1999;517:565–586. [Google Scholar]
Riess A, Filippenko A, Challis P, Clocchiatti A, Diercks A, Garnavich P, Gilliland R, Hogan C, Jha S, Kirshner R, Leibundgut B, Phillips M, Reiss D, Schmidt B, Schommer R, Smith R, Spyromilio J, Stubbs C, Suntze N, Tonry J. Observational evidence from supernovae for an accelerating universe and a cosmological constant. Astron J. 1998;116:1009–1038. [Google Scholar]
Tibshirani R. Regression shrinkage and selection via the lasso. J Roy Statist Soc Ser B. 1996;58:267–288. [Google Scholar]

[R1] Berk R, Brown L, Buja A, Zhang K, Zhao L. Valid post-selection inference. Ann Statist. 2012 Submitted http://stat.wharton.upenn.edu/~buja/PoSI.pdf.

[R2] Breiman L. Bagging predictors. Mach Learn. 1996;24:123–140. [Google Scholar]

[R3] Buckland ST, Burnham KP, Augustin NH. Model selection: An integral part of inference. Biometrics. 1997;53:603–618. [Google Scholar]

[R4] Bühlmann P, Yu B. Analyzing bagging. Ann Statist. 2002;30:927–961. [Google Scholar]

[R5] Buja A, Stuetzle W. Observations on bagging. Statist Sinica. 2006;16:323–351. [Google Scholar]

[R6] Chatterjee A, Lahiri SN. Bootstrapping lasso estimators. J Amer Statist Assoc. 2011;106:608–625. [Google Scholar]

[R7] DiCiccio T, Efron B. More accurate confidence intervals in exponential families. Biometrika. 1992;79:231–245. [Google Scholar]

[R8] Efron B. Bootstrap methods: Another look at the jackknife. Ann Statist. 1979;7:1–26. [Google Scholar]

[R9] Efron B. The Jackknife, the Bootstrap and Other Resampling Plans, CBMS-NSF Regional Conference Series in Applied Mathematics. Vol. 38. Philadelphia, Pa.: Society for Industrial and Applied Mathematics (SIAM); 1982. [Google Scholar]

[R10] Efron B. Estimating the error rate of a prediction rule: Improvement on cross-validation. J Amer Statist Assoc. 1983;78:316–331. [Google Scholar]

[R11] Efron B. Better bootstrap confidence intervals. J Amer Statist Assoc. 1987;82:171–200. with comments and a rejoinder by the author. [Google Scholar]

[R12] Efron B. Jackknife-after-bootstrap standard errors and influence functions. J Roy Statist Soc Ser B. 1992;54:83–127. [Google Scholar]

[R13] Efron B. Bayesian inference and the parametric bootstrap. Ann Appl Statist. 2012;6:1971–1997. doi: 10.1214/12-AOAS571. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] Efron B, Feldman D. Compliance as an explanatory variable in clinical trials. J Amer Statist Assoc. 1991;86:9–17. [Google Scholar]

[R15] Efron B, Hastie T, Johnstone I, Tibshirani R. Least angle regression. Ann Statist. 2004;32:407–499. with discussion, and a rejoinder by the authors. [Google Scholar]

[R16] Efron B, Tibshirani R. An Introduction to the Bootstrap, Monographs on Statistics and Applied Probability. Vol. 57. New York: Chapman and Hall; 1993. [Google Scholar]

[R17] Efron B, Tibshirani R. Using specially designed exponential families for density estimation. Ann Statist. 1996;24:2431–2461. [Google Scholar]

[R18] Fearnhead P, Prangle D. Constructing summary statistics for approximate Bayesian computation: semi-automatic approximate Bayesian computation. J Roy Statist Soc Ser B. 2012;74:419–474. [Google Scholar]

[R19] Hájek J. Asymptotic normality of simple linear rank statistics under alternatives. Ann Math Statist. 1968;39:325–346. doi: 10.1073/pnas.57.1.19. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] Hall P. The Bootstrap and Edgeworth Expansion Springer Series in Statistics. New York: Springer-Verlag; 1992. [Google Scholar]

[R21] Hall P, Lee ER, Park BU. Bootstrap-based penalty choice for the lasso, achieving oracle performance. Statist Sinica. 2009;19:449–471. [Google Scholar]

[R22] Hjort NL, Claeskens G. Frequentist model average estimators. J Amer Statist Assoc. 2003;98:879–899. [Google Scholar]

[R23] Hurvich CM, Tsai C-L. Model selection for least absolute deviations regression in small samples. Statist Probab Lett. 1990;9:259–265. [Google Scholar]

[R24] Knight K, Fu W. Asymptotics for lasso-type estimators. Ann Statist. 2000;28:1356–1378. [Google Scholar]

[R25] Mallows CL. Some Comments on Cp. Technometrics. 1973;15:661–675. [Google Scholar]

[R26] Perlmutter S, Aldering G, Goldhaber G, Knop R, Nugent P, Castro P, Deustua S, Fabbro S, Goobar A, Groom D, Hook I, Kim A, Kim M, Lee J, Nunes N, Pain R, Pennypacker C, Quimby R, Lidman C, Ellis R, Irwin M, McMahon R, Ruiz-Lapuente P, Walton N, Schaefer B, Boyle B, Filippenko A, Matheson T, Fruchter A, Panagia N, Newberg H, Couch W. Measurements of omega and lambda from 42 high-redshift supernovae. Astrophys J. 1999;517:565–586. [Google Scholar]

[R27] Riess A, Filippenko A, Challis P, Clocchiatti A, Diercks A, Garnavich P, Gilliland R, Hogan C, Jha S, Kirshner R, Leibundgut B, Phillips M, Reiss D, Schmidt B, Schommer R, Smith R, Spyromilio J, Stubbs C, Suntze N, Tonry J. Observational evidence from supernovae for an accelerating universe and a cosmological constant. Astron J. 1998;116:1009–1038. [Google Scholar]

[R28] Tibshirani R. Regression shrinkage and selection via the lasso. J Roy Statist Soc Ser B. 1996;58:267–288. [Google Scholar]

PERMALINK

Estimation and Accuracy after Model Selection

Bradley Efron

Abstract

1 Introduction

Figure 1. Cholesterol data.

Figure 2.

Figure 4.

2 Nonparametric bootstrap smoothing

Table 1.

Figure 3.

Table 2.

Table 3.

3 Accuracy of the smoothed bootstrap estimates

4 Parametric bootstrap smoothing

Figure 5.

5 The Supernova data

Figure 6. The Supernova data.

Table 4.

Table 5.

Figure 7.

6 Better bootstrap confidence intervals

Table 6.

7 Remarks, details, and proofs

A. Proof of Corollary 1

B. Hájek projection and ANOVA decomposition

C. Nonparametric bias estimate

D. Adjusted R2

E. Full-model bootstrapping

F. Bias of the smoothed estimate

Figure 8.

G. Two types of bias

H. Selection-based estimation

Figure 9.

I. The ABC intervals

Table 7.

J. Bias correction for sd∼B

K. Improved estimates of the bagged standard errors

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

D. Adjusted R²

J. Bias correction for ${\tilde{sd}}_{B}$