Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2010 Mar 1.
Published in final edited form as: J Am Stat Assoc. 2005 Mar 1;100(469):296–309. doi: 10.1198/016214504000000818.

Variance Estimation in a Model with Gaussian Sub-Models

Vanja M Dukić *, Edsel A Peña
PMCID: PMC2829998  NIHMSID: NIHMS2370  PMID: 20198124

Abstract

This paper considers the problem of estimating the dispersion parameter in a Gaussian model which is intermediate between a model where the mean parameter is fully known (fixed) and a model where the mean parameter is completely unknown. One of the goals is to understand the implications of the two-step process of first selecting a model among a finite number of sub-models, and then estimating a parameter of interest after the model selection, but using the same sample data. The estimators are classified into global, two-step, and weighted-type estimators. While the global-type estimators ignore the model space structure, the two-step estimators explore the structure adaptively and can be related to pre-test estimators, and the weighted estimators are motivated by the Bayesian approach. Their performances are compared theoretically and through simulations using their risk functions based on a scale invariant quadratic loss function. It is shown that in the variance estimation problem efficiency gains arise by exploiting the sub-model structure through the use of two-step and weighted estimators, especially when the number of competing sub-models is few; but that this advantage may deteriorate or be lost altogether for some two-step estimators as the number of sub-models increases or as the distance between them decreases. Furthermore, it is demonstrated that weighted estimators, arising from properly chosen priors, outperform two-step estimators when there are many competing sub-models or when the sub-models are close to each other, whereas two-step estimators are preferred when the sub-models are highly distinguishable. The results have implications regarding model averaging and model selection issues.

Keywords: Model selection; inference after model selection, model averaging; Bayes estimators; pre-test estimators; two-step estimators, adaptive estimators

1 Model Selection and Inference

In a variety of settings in statistical practice, it is common to encounter the following situation: we observe data χ from a distribution F which is only known to belong to one of p (possibly nested) sub-models ℳ1, ℳ2,…,ℳp; and given χ, we want to estimate a common parameter, or a functional, of F, denoted by τ(F). For example, we might observe χF where F belongs to either the gamma or Weibull family of distributions, and wish to estimate the mean of F. Or, in a multiple regression setting with p possible predictors, we might want to choose one of the 2p competing sub-models (Breiman (1992); Zhang (1992b,a)), and then estimate a common parameter such as dispersion or the conditional distribution function of the response variable.

The most frequent strategies for estimating τ(F) are: (i) utilizing an estimator developed under a larger model ℳ, which contains all sub-models; (ii) using data χ to first choose a sub-model, and then applying the estimator developed for the chosen sub-model to the same data χ; and (iii) assigning to each sub-model a plausibility measure, possibly using χ, and then forming a weighted combination of the estimators developed under each of the sub-models. In this paper we are interested in determining whether there is a preferred strategy, and whether that preferred strategy depends on the interplay among the competing sub-models, and possibly the parameter we are estimating.

Issues pertaining to the two-step process of inference after model selection and the consequences of “data double-dipping” in strategy (ii) have been discussed in the econometric literature (Judge, Bock, and Yancey (1974), Leamer (1978), Yancey, Judge, and Mandy (1983), and Wallace (1977)). Further investigations of these issues in other settings are in Potscher (1991), Buhlmann (1999), and Burnham and Anderson (1998). The third strategy has been discussed mostly in the context of model averaging, a notion that naturally arises in the Bayesian paradigm (Madigan and Raftery (1994); Raftery, Madigan, and Hoeting (1997); Hoeting, Madigan, Raftery, and Volinsky (1999); Burnham and Anderson (1998), among many others). The first strategy on the other hand may be viewed as having a nonparametric flavor. Though it is clearly intuitive that the first strategy will entail some loss in efficiency, it is not apparent whether (and when) the second strategy is preferred over the third strategy. Clearly, an examination of this problem in the general framework is important to provide guidance to practitioners regarding which strategy is better in general situations. However, a general treatment of the problem may not yield exact results, and one may need to rely on asymptotics, or local asymptotics such as in the work by Claeskens and Hjort (2003) and Hjort and Claeskens (2003).

In this paper, we focus our attention on a prototype Gaussian model which admits exact results and thereby enables concrete comparison of the three strategies. Though the specific model examined in this paper may be perceived as restrictive, it highlights the difficulties inherent in this problem. In addition, the specific estimation problem examined – the estimation of dispersion parameter – is still the subject of active research (Arnold and Villasenor (1997), Brewster and Zidek (1974), Gelfand and Dey (1977), Maatta and Casella (1990), Ohtani (2001), Pal, Ling, and Lin (1998), Rukhin (1987), Vidaković and DasGupta (1995), and Wallace (1977)).

The paper is outlined as follows. Section 2 will describe the formal setting of the specific problem considered, introduce notation, and present the global-type estimators. Section 3 will present the classical two-step estimators, whereas the Bayes and weighted estimators will be developed in Section 4. Distributional properties and risk comparison will be obtained in Section 5. Concluding remarks are given in Section 6, while Appendix A gathers the technical proofs.

2 Global-Type Estimators

We first describe the specific model examined in this paper. Let χ = (χ1, χ2,…,χn)… be a vector of IID random variables from an unknown distribution function F(x) = Pr{χ1x} which belongs to the two-parameter normal family of distributions ℳ = {N(μ, σ²) : (μ, σ²) ∈ Θ = ℜ × ℜ+}. If interest is on estimating the variance σ², then the uniformly minimum variance unbiased estimator (UMVUE) of σ² is

σ^UMVU2=S2=1n1i=1n(χiχ¯)2 (1)

where χ¯=1ni=1nχi (Lehmann and Casella (1998)). We adopt a decision-theoretic approach for evaluating estimators of σ² via the risk function based on the scale invariant quadratic loss function L : ℜ × Θ → ℜ

L(a,(μ,σ2))=(aσ2σ2)2. (2)

It should be pointed that the appropriateness of this loss function has been questioned, partly because of Stein (1964)’s demonstration that under this loss the UMVUE of σ² is inadmissible and dominated by the minimum risk equivariant estimator (MRE)

σ^MRE2=1n+1i=1n(χiχ¯)2 (3)

(which also turns out to be inadmissible). However, quadratic loss functions are still popular when dealing with the estimation of variance (Arnold and Villasenor (1997), Maatta and Casella (1990), Ohtani (2001), Pal et al. (1998), Rukhin (1987), Vidaković and DasGupta (1995), and Wallace (1977)).

If the model is restricted so that μ = μ0 where μ0 ∈ ℜ is known, so ℳ0 = {N(μ, σ²) : (μ, σ²) ∈ Θ0 = {μ0} × ℜ+}, the UMVUE and MRE of σ² are given, respectively, by

σ^UMVU2(μ0)=1ni=1n(χiμ0)2andσ^MRE2(μ0)=1n+2i=1n(χiμ0)2. (4)

Incidentally, σ^UMVU2(μ0) is also the minimax estimator of σ² since it is a limit of proper Bayes estimators and it has constant risk. Clearly, we are able to improve on the estimators derived under ℳ by exploiting the knowledge that μ = μ0 under ℳ0: when model ℳ0 holds, the relative efficiency of the estimator σ^UMVU2(μ0) in (4) with respect to σ^UMVU2 in (1) is n/(n - 1). But suppose now that we have a model between ℳ and ℳ0. Specifically, let p be a known positive integer, and μ = {μ1, μ2,…,μp} be a set of known real numbers, and consider the estimation of σ² under the model

p=p(μ)={N(μ,σ2):(μ,σ2)Θp{μ1,μ2,,μp}×+}.

In ℳp, in contrast to ℳ0, there is some information about the possible value of μ, but we are not certain about this value. Model ℳp can be viewed as having p sub-models, with the ith sub-model, ℳp,i, being the normal class with unknown variance σ² and known mean μi, that is, ℳp,i = {Ni, σ²) : σ² > 0}. This particular model arises in a variety of settings. For example, it includes decision problems with a two-element action space such as in the Neyman-Pearson hypothesis testing setting. If we further allow the possibility that μ ∈ ℜ \ {μ1, μ2,…,μp}, we obtain a generalization of the setting utilized by Stein (1964) to derive an estimator dominating σ^MRE2 (see Brewster and Zidek (1974), Wallace (1977), and Maatta and Casella (1990)). The Stein estimator is given by

σ^ST2=min{σ^MRE2,1n+2i=1nχi2}. (5)

This estimator can be viewed as a preliminary test estimator (Sen and Saleh (1987), Lehmann and Casella (1998), and Sclove, Morris, and Radhakrishnan (1972)). The pre-test estimation approach proceeds by testing a null hypothesis that the parameter equals a certain value, and if it accepts the hypothesis then the estimator based on this parameter value is used; otherwise, an estimator under the general model is used. Since a test is to be performed, a level of significance needs to be specified, and so generally pre-test estimators will depend on such a specified level. Interestingly, the Stein estimator in (5), which could be derived as a pre-test estimator with hypothesis specifying that μ = 0, eliminates this dependence by utilizing an ‘optimal’ significance level. The approach implemented in our paper differs from that of pre-test estimation, since we altogether avoid the need for testing a hypothesis to derive our estimators.

Note that in problems dealing with the estimation of the normal variance, it is typically assumed that either model ℳ or model ℳ0 holds. However, in many settings, the mean can take only a finite number of possible values, such as for example in the Neyman-Pearson lemma, where the variance estimator is typically the sample variance, which does not exploit the fact that there are only two possible means.

3 Classical Two-Step Estimators

Under ℳp the likelihood function for the sample realization χ = x = (x1, x2,…, xn)′… is

L(μ,σ2)=L(μ,σ2|x)=i=1pLi(μi,σ2)Mi (6)

where, for i = 1, 2,…, p, with I{·} denoting the indicator function and Mi = I{μ = μi},

Li(μi,σ2)=(12π)n(1σ2)n2exp{nσ^i22σ2}andσ^i2=1nj=1n(xjμi)2. (7)

Lii, σ²) is maximized with respect to σ² at σ^i2, so Li(μi,σ^i2)=supσ2+Li(μi,σ2). Define the likelihood-based model selector M^=M^(χ) via

M^=argmax1ipLi(μi,σ^i2)=argmin1ipσ^i2=argmin1ip|χ¯μi|.

One could employ model selectors different from M^, such as the highest posterior probability (à la Schwarz’ Bayesian criterion (SBC) (Schwarz (1978)) or the Akaike information criterion (AIC) (Akaike (1973), Burnham and Anderson (1998))). In this paper we restrict our attention to the intuitive selector M^, which could actually be viewed also as a highest posterior probability model selector associated with a flat prior distribution. The maximum likelihood estimator (MLE) of σ² under ℳp is

σ^p,MLE2=σ^M^2=i=1pI{M^=i}σ^i2, (8)

a two-step estimator, with the first stage selecting the sub-model and the second-stage using the MLE of σ² in the chosen sub-model. An alternative to the estimator (8) is to use the sub-model’s MRE instead of MLE of σ²:

σ^p,MRE2=σ^M RE,M^2=i=1pI{M^=i}σ^MRE,i2=i=1pI{M^=i}nσ^i2(n+2). (9)

Note that the label ‘p,MRE’ (and similar labels in the sequel) is a misnomer since this estimator need not be minimum risk equivariant under model ℳp. However, we keep the name for clarity.

4 Bayes and Weighted Estimators

We focus on the class of prior densities of (μ, σ²) which consists of the product of a multinomial probability function and an inverse gamma density:

π(μ,σ2|θ˜,κ,β)=(i=1pθ˜imi)βκ1Γ(κ1)(1σ2)κexp(βσ2), (10)

where σ² > 0, mi = I{μ = μi} so that i=1pmi=1,and0θ˜i1withi=1pθ˜i=1,β>0, and κ > 1. From (10) and (6), we obtain the posterior density of (μ, σ²) given χ = x:

π(μ,σ2|x)=Ci=1p{θ˜i(1σ2)n2+κexp(1σ2[nσ^i22+β])}mi. (11)

Note that in the “vectorized form,” and with a slight abuse of notation, π(μ, σ²|x) = π(m, σ²|x) because {μ = μi} = {m = 1i}, where 1i is an n × 1 vector with ith component equal to 1 and all others equal 0. It follows that

C=1Γ(n/2+κ1){i=1pθ˜i(nσ^i2/2+β)n/2+κ1}1. (12)

4.1 Posterior Probabilities

From the posterior distribution in (11), the marginal posterior density (with respect to counting measure) of μ, or equivalently of m, is

π(m|x)=Ci=1p{θ˜iΓ(n/2+κ1)(nσ^i2/2+β)n/2+κ1}mi=i=1p{θi(κ,β,n,x)}mi,

where, for i = 1, 2,…, p, the posterior probability that the sub-model ℳp,i is true, is

θi(κ,β,n,x)=θ˜i(nσ^i2/2+β)(n/2+κ1)j=1pθ˜j(nσ^j2/2+β)(n/2+κ1). (13)

Note, as expected, that if θ̃i > 0 and ℳp,i is the true sub-model, θi(κ, β, n, χ), when viewed as a function of χ with (μ, σ²) fixed, converges to 1 with probability one (wp1) as n → ∞. This is because if ℳp,i is the correct model, σ^i2 converges wp1 to σ² by the strong law of large numbers (SLLN); whereas, for i′i,σ^i′2 converges wp1 to σ² + (μi - μi′)².

4.2 Estimators

The marginal posterior density function of σ² is directly obtained from (11) to be

π(σ2|x)=Ci=1pθ˜i(1σ2)(κ+n/2)exp[1σ2(nσ^i22+β)]I{σ2>0}. (14)

The posterior mean, which is the Bayes estimator of σ² under the loss function L in (2), is then

σ^p,Bayes2(κ,β,θ)=i=1pθi(κ,β,n,x){(nn+2(κ2)σ^i2+(2(κ2)n+2(κ2))(βκ2)}. (15)

Note that β/(κ − 2) is the prior mean of σ², provided κ > 2 (the condition also needed for the prior variance of σ² to exist), whereas σ^i2 is the MLE of σ² under the ℳp,i model. This estimator mixes in a data-dependent manner, using the posterior probabilities of the p sub-models, the Bayes estimators of σ² from each sub-model. Furthermore, the Bayes estimator of σ² for the ℳp,i sub-model is a convex combination of the ℳp,i-model MLE and the prior mean of σ², a well-known result.

To obtain limiting Bayes estimators for σ², we consider improper priors arising by setting θ̃i = 1/p, i = 1, 2, …, p, and β → 0. We examine four κ values: κ → 1, κ → 3/2, κ = 2, and κ = 3. The rationale for these choices is as follows: κ → 1 amounts to placing Jeffreys’ non-informative prior on σ² in each of the p sub-models, since Jeffreys’ prior for σ² (with mean known) is proportional to 1/σ² (Robert (2001)); κ → 3/2 corresponds to the Jeffreys’ prior for σ² when the mean is unknown in the normal model, since in this case Jeffreys’ prior is proportional to (1/σ²)(3/2); κ = 2 and κ = 3 produce (limiting) Bayes estimators that are convex combinations of the sub-models’ MLEs and MREs, respectively. Table 1 lists the sub-models’ posterior probabilities and the resulting limiting Bayes estimators of σ². Each of the sub-models’ posterior probabilities associated with κ ∈ {1, 3/2, 2, 3} given in Table 1 could also be utilized to form estimators which are convex combinations of the sub-models’ MREs. These new estimators need not however be limiting Bayes with respect to our class of priors. These ‘weighted’ estimators are defined as:

σ^p,PLB12=(n2n+2)σ^p,LB12;σ^p,PLB22=(n2n+2)σ^p,LB22;σ^p,PLB32=(nn+2)σ^p,LB32. (16)

Table 1.

Sub-models’ posterior probabilities and limiting Bayes estimators of σ² for different values of κ when θ̃i = 1/p and β → 0.

κ Sub-model Posterior Probabilities, θi(κ, 0, n, x), i = 1, 2, …, p Limiting Bayes Estimator
σ^p,LBk2,k=1,2,3,4
1
θi1=(σ^i2)n/2/j=1p(σ^j2)n/2
σ^p,LB12=(nn2)i=1pθi1σ^i2
3/2
θi2=(σ^i2)(n+1)/2/j=1p(σ^j2)(n+1)/2
σ^p,LB22=(nn1)i=1pθi2σ^i2
2
θi3=(σ^i2)(n+2)/2/j=1p(σ^j2)(n+2)/2
σ^p,LB32=i=1pθi3σ^i2
3
θi4=(σ^i2)(n+4)/2/j=1p(σ^j2)(n+4)/2
σ^p,LB42=(nn+2)i=1pθi4σ^i2

Note also from (15) that the estimators σ˜LB,i2=(n/(n2))σ^i2, the ones whose convex combination is being formed in σ^p,LB12, are the limiting Bayes estimators of σ² for each of the p sub-models under Jeffreys’ non-informative prior when the sub-model’s mean is known (arising from κ → 1). The estimators in Table 1 and in (16) have different flavors than the MLE of σ² given in (8): in the latter, we choose one among the p estimators of σ², while the Bayes and weighted estimators are mixing sub-model estimators according to the sub-models’ posterior probabilities.

Finally, we define the two-step estimator based on the sub-models’ limiting Bayes estimators:

σ^p,ALB2=σ˜LB,M^2=(nn2)i=1pI{M^=i}σ^i2. (17)

This belongs to the same class of estimators as σ^p,MLE2 and σ^p,MRE2, differing just in the multipliers which are functions of n only. Note that for the purposes of obtaining risk functions, it suffices to derive formulas for the mean and variance functions of σ^p,MLE2.

5 Comparison of Estimators

The goal of this section is to compare the performances of the estimators given in Table 1 and (16), with the estimators developed under ℳ (σ^UMVU2andσ^MRE2), with the two-step estimators (σ^p,MLE2,σ^p,MRE2,σ^p,MLE2,andσ^p,ALB2) and, for completeness, with the Stein estimator in (5). Performance will be measured by their risk functions arising from the loss function L in (2). In particular, we address the following questions: (i) How much efficiency is lost by using the estimators developed under the wider model ℳ when model ℳp holds? (ii) How do the limiting Bayes and weighted estimators σ^p,LBk2 and σ^p,PLBk2 compare with the ℳp MLE-based and MRE-based estimators? (iii) Do the advantages of the ℳp-based estimators over ℳ-based estimators decrease as the dimension p increases and/or the spacings among the μ1,…,μp decrease?

5.1 Distributional Representations

It is well-known that, provided n > 1, (n1)σ^UMVU2/σ2χn12,soE{σ^UMVU2/σ2}=1andVar{σ^UMVU2/σ2}=2/(n1). Therefore, the risk function of σ^UMVU2 with respect to the loss function L in (2) is R(σ^UMVU2,(μ,σ2))=2/(n1). By exploiting the relationship between σ^UMVU2 and σ^MRE2, the risk function of the latter is easily found to be R(σ^MRE2,(μ,σ2))=2/(n+1). This demonstrates the known fact that σ^UMVU2 is inadmissible. To compare estimator performances, we will use σ^UMVU2 as the baseline, so the efficiency of an estimator σ̂² will be given by

Eff(σ^2:σ^UMVU2)=R(σ^UMVU2,(μ,σ2))R(σ^2,(μ,σ2)). (18)

Thus, in particular, Eff(σ^MRE2:σ^UMVU2)=(n+1)/(n1)=1+2/(n1).

We present some distributional properties of the estimators which will be used to derive the exact expressions of the risk functions of σ^p,MLE2, and second-order approximations to the risk functions of σ^p,LBk2 and σ^p,PLBk2. Let Z ∼ N(0,1) and Z = (Z1, Z2,…,Zn)′ ∼ Nn(0, I). For the vector of means μ = (μ12,…,μp)′ with μi0 being the true mean (i0 ∈ {1,2,…,p}), we let

ΔΔ(μ,σ)=μμi01σ (19)

where 1 = (1,1,…,1)′. Note that this will always have a zero component under ℳp. In the sequel, the ‘equal-in-distribution’ relation is denoted by d¯¯. To achieve a more fluid presentation, formal proofs of lemmas, propositions, theorems, and corollaries are relegated to Appendix A.

Proposition 5.1

Under ℳp with μi0 the true mean, nσ^i2/σ2d¯¯W+Vi2,i=1,2,,p,, where Wχn12,V=(V1,V2,,Vp)Np(nΔ,J11), and W and V are independent.

From Proposition 5.1, by exploiting the independence between W and V and using the iterated expectation and covariance rules, and by noting that

E{Wk/2}=(1/2)k/2[Γ((n+k1)/2)/Γ((n1)/2)]

holds for any k < n − 1, the following corollary immediately follows.

Corollary 5.1

Under the conditions of Proposition 5.1, nσ^i2/σ2=dW(1+Ti2),i=1,2,,p, with T=(T1,,Tp)=V/W. The distribution of T depends on (μ, σ²) only through Δ and, provided that n > 3, the mean vector and covariance matrix of T are given, respectively, by

E(T)=νΔCnandCov(T,T)=1n3J+(n(n3)Cn21)ν2

with Cn=n/2[Γ((n2)/2)/Γ((n1)/2)].

5.2 Representation and Risk Function of σ^p,MLE2

We now give a representation of σ^p,MLE2 and obtain the exact expressions for its mean, variance, and risk. For a given Δ, let Δ(1) < Δ(2) < … < Δ(p) denote the associated ordered values.

Theorem 5.1

Let μi0 be the true mean. Then under ℳp,

nσ^p,MLE2/σ2=dW+i=1pI{L(Δ(i),Δ)<Z<U(Δ(i),Δ)}(ZnΔ(i))2

where, under the convention that Δ(0) = −∞ and Δ(p+1) = +∞,

L(Δ(i),Δ)=(n/2)[Δ(i)+Δ(i1)]andU(Δ(i),Δ)=(n/2)[Δ(i)+Δ(i+1)],

Wχn12,ZN(0,1) and W and Z are independent.

Define the events Ω(i) = {L(i), Δ) < Z < U(i),Δ)}, i = 1, 2, …, p. The collection of sets {{M^=i},i=1,2,,p} is in one-to-one correspondence with the collection {Ω(1), Ω(2), …, Ω(p)}, as can be seen from Theorem 5.1, and thus Ω(i), i = 1, 2, …, p, are disjoint. Using Theorem 5.1, we can now obtain expressions of the mean and variance of σ^p,MLE2. For i = 1, 2, …, p, we let

P(i)(Δ)Pr{Ω(i)}=Φ(n2(Δ(i)+Δ(i+1)))Φ(n2(Δ(i)+Δ(i1))), (20)

where Φ(·) is the standard normal distribution function. In the sequel, we let ϕ(·) denote the density function of a standard normal random variable.

Theorem 5.2

Under the conditions of Theorem 5.1,

EpMLE(Δ)E{σ^p,MLE2σ2}=12ni=1pΔ(i)[ϕ(L(Δ(i),Δ))ϕ(U(Δ(i),Δ))]+i=1pΔ(i)2[Φ(U(Δ(i),Δ))Φ(L(Δ(i),Δ))].

Next, we present an expression for the variance function of the estimator σ^p,MLE2. Towards this end, we introduce some notation to simplify the presentation. For k ∈ Ƶ+ = {0, 1, 2, …}, define

ξ(k;Ω(i))E{ZkI(Ω(i))}=L(Δ(i),Δ)U(Δ(i),Δ)zkϕ(z)dz.

Using this, observe that by the binomial expansion, for m ∈ Ƶ+,

ζ(i)(m)E{I(Ω(i))(ZnΔ(i))m}=k=0m(1)(mk)(mk)(nΔ(i))(mk)ξ(k;Ω(i)). (21)

To compute the quantity ξ(k(i)), observe that for k ∈ Ƶ+ and t ∈ ℜ,

tzkϕ(z)dz=(1)k2(k1)/22πΓ((k+1)/2)×{Pr{χk+12>t2}ift<0[1+(1)kPr{χk+12<t2}]ift0.

Using the above formulas, we obtain ξ(k(i)) according to

ξ(k;Ω(i))=U(Δ(i),Δ)zkϕ(z)dzL(Δ(i),Δ)zkϕ(z)dz. (22)

Theorem 5.3

Under the conditions of Theorem 5.1,

VpMLE(Δ)Var{σ^p,MLE2σ2}=1n{2(11n)+1n[i=1pζ(i)(4)(i=1pζ(i)(2))2]}.

In the situation where there are only two sub-models so p = 2, the expressions for the mean and variance functions of σ^p,MLE2/σ2 simplify. These simplified forms are provided in the following corollary. The proofs of these results are straightforward, hence to conserve space, we omit them but instead refer the reader to the more detailed technical report by Dukić and Peña (2003).

Corollary 5.2

If p = 2 so Δ = (0; Δ), then under the conditions of Theorem 5.1,

EpMLE(Δ)=1(2n|Δ|){ϕ(n2|Δ|)(n2|Δ|)[1Φ(n2|Δ|)]};VpMLE(Δ)=2n+|Δ|4Φ(n2|Δ|)[1Φ(n2|Δ|)]4n|Δ|3Φ(n2|Δ|)×n2|Δ|zϕ(z)dz4n3/2|Δ|{n2|Δ|z3ϕ(z)dzn2|Δ|zϕ(z)dz}+1n|Δ|2{6n2|Δ|z2ϕ(z)dz4(n2|Δ|zϕ(z)dz)22[1Φ(n2|Δ|)]}.

We note from the expression in Corollary 5.2 that, for a fixed n, lim|Δ|→0 EpMLE(Δ) → 1 and lim|Δ|0VpMLE(Δ)2/n=Var{σ^MLE2/σ2},whereσ^MLE2=1ni=1n(χiμi0)2, the ML (also UMVU, minimax) estimator of σ² under the true model. Also, for a fixed Δ, we see that limn→∞ EpMLE(Δ) → 1 and limn→∞ {n(VpMLE(Δ))} → 2.

The next result in Corollary 5.3 shows that even though the sub-models’ MLEs are each unbiased for σ², the two-step estimator σ^p,MLE2, which employs the MLE of the sub-model selected by the model selector M^, is a negatively biased estimator of σ². The result is an immediate consequence of Corollary 5.2 by noting that the continuous function g(u) = ϕ(u) − u[1 − Φ(u)] is positive by virtue of the facts that limu↓0h(u) > 0, limu→∞ h(u) = 0, and g′(u) = ϕ′(u) + uϕ(u) − [1 − Φ(u)] = −[1 − Φ(u)] < 0 since ϕ′(u) = −uϕ(u).

Corollary 5.3

Under the conditions of Corollary 5.2 with Δ ≠ 0, E{σ^p,MLE2}<σ2 that is, σ^p,MLE2 is negatively biased for σ².

Now that we have the exact expressions for the mean and variance of σ^p,MLE2/σ2, we could obtain the risk function of σ^p,MLE2 under ℳp and loss L in (2) as

R(σ^p,MLE2,(μi0,σ2))=VpMLE(Δ)+[EpMLE(Δ)1]2. (23)

Finally, for σ^p,MLE2, we address the question of what happens when p increases and the spacings in Δ decrease. This will indicate whether we will lose the advantage of ℳp-based estimators over ℳ-based estimators. The proof of Theorem 5.4 is rather lengthy and hence omitted; instead we refer the reader to the technical report by Dukić and Peña (2003).

Theorem 5.4

Given n fixed, if p → ∞, max2≤ip(i) − Δ(i−1)| → 0, with Δ(1) → Δmin ∈ (−∞, 0], and Δ(p) → Δmax ∈ [0, ∞), then

EpMLE(Δ)11nnΔminnΔmaxw2ϕ(w)dw+2n{Δminϕ(nΔmin)Δmaxϕ(nΔmax)}+{(Δmin)2Φ(nΔmin)+(Δmax)2[1Φ(nΔmax)]};
VpMLE(Δ)2n(11n)+1n2[{E[(ZnΔmin)4I(Z<nΔmin)]+E[(ZnΔmax)4I(Z>nΔmax)]}{E[(ZnΔmin)2I(Z<nΔmin)]+E[(ZnΔmax)2I(Z>nΔmax)]}2].

Letting Δmin → − ∞ and Δmax → ∞, EpMLE(Δ) → 1 − 1/n and VpMLE(Δ) → (2/n) (1 − 1/n).

Using Theorem 5.4 we could now address the issue of whether we lose the advantage by utilizing the two-step estimator which was developed under model ℳp over the estimator developed under the more general model ℳ when p increases. For this purpose we have the following corollary.

Corollary 5.4

With n > 1 fixed, if as p → ∞, max2≤ip(i) − Δ(i−1)| → 0, Δ(1) → −∞, and Δ(p) → ∞, then (i) Eff(σ^p,MLE2:σ^UMVU2)2n2/[(n1)(2n1)]>1; (ii) Eff(σ^p,MRE2:σ^UMVU2)2(n+2)2/[(n1)(2n+7)]>1; (iii) Eff(σ^p,MRE2:σ^p,MLE2)(2n1)(n+2)2/[(2n+7)n2]>1; and (iv) Eff(σ^p,MRE2:σ^MRE2)2(n+2)2/[(n+1)(2n+7)]<1. In addition, σ^p,ALB2 is dominated by σ^UMVU2.

The fourth result in Corollary 5.4 indicates that when the number of sub-models increases indefinitely the estimator σ^MRE2 (which is the minimum risk estimator under the general model ℳ) dominates the two-step estimator σ^p,MRE2 (which was developed by exploiting the sub-model structure of ℳp). Using the limiting results for p = 2 and as |Δ| → 0, stated after Corollary 5.2, we find that the limiting risk function of σ^p,MRE2 is 2/(n+2), which is smaller than 2/(n+1), the risk function of σ^MRE2. This shows that when the number of sub-models is small, we can gain efficiency by using the two-step estimator developed under model ℳp These results agree with our intuition: when the number of sub-models increases it is better to utilize the best estimator developed under the more general model. However, as it will be seen in the simulation studies reported later in the paper, the weighted and Bayes-type estimators’ performance seems not degraded by an increase in the number of sub-models.

5.3 Representation of Limiting Bayes and Weighted Estimators

We now provide distributional representations useful for the limiting Bayes estimators σ^p,LBk2s and the weighted estimators σ^p,PLBk2s under ℳp, in order to find an approximation to the risk functions of these estimators. For α > 0, define the “umbrella” estimator as

σ^p,LB2σ^p,LB2(α)=i=1p{(σ^i2)αj=1p(σ^j2)α}σ^i2. (24)

Individual estimators are easily derived from this umbrella estimator by choosing an appropriate α. For example:

σ^p,LB12=(nn2)σ^p,LB2(n/2)andσ^p,PLB12=(nn+2)σ^p,LB2(n/2).

Theorem 5.5

Under ℳp where μi0 is the true mean, for a fixed α > 0, nσ^p,LB2/σ2=dW(1+H(T)), where

H(T)H(T;α)=i=1pθi(T)Ti2withθi(T)θi(T;α)=(1+Ti2)αj=1p(1+Tj2)α,i=1,2,,p.

Consequently, the distribution of σ^p,LB2/σ2 depends on (μ, σ²) only through Δ.

From the distributional representation in Theorem 5.5, a closed-form expression for the risk function of σ^p,LB2 will be difficult to obtain because of the adaptive, i.e., data-dependent, nature of the mixing probabilities θi(T) and the fact that these are rational functions of T. To obtain an approximation to the risk function of σ^p,LB2 we used a second-order Taylor expansion of the function H(T) about T = ν, the mean vector of T. For notation, let

HH(ν);H(1)TH(T)|T=ν;andH(2)2TT′H(T)|T=ν.

A second-order Taylor approximation for σ^p,LB2/σ2 is provided by

σ^p,LB2σ2dWn{1+H+H(1)(Tν)+12(Tν)H(2)(Tν)}. (25)

From this approximate representation, we are able to obtain approximate expressions for the mean and variance of the Bayes estimator. These mean and variance expressions, which involve the constant Cn defined in Corollary 5.1, are given in the next two theorems. The proofs require several intermediate results (contained in lemmas), and these are presented in Appendix A.

Theorem 5.6

Under ℳp, a second-order approximation to the mean of σ^p,LB2/σ2 is

E2(Δ)(11n)(1+H)+12(1Cn21+3n)(ν′H(2)ν)1n{(H(1)ν)12(1′H(2)1)}.

Theorem 5.7

Under ℳp, a second-order approximation to the variance of σ^p,LB2/σ2 is V2(Δ)1n{VE(Δ)+EV(Δ)}, where

VE(Δ)2(11n)(1+HH(1)ν+12νH(2)ν)2+(n1Cn2(n2)2n)×(H(1)ννH(2)ν)2+2(12n)(1+HH(1)ν+12νH(2)ν)(H(1)ννH(2)ν);
EV(Δ)12n(1′H(2)1)2+(11n)(H(1)11′H(2)ν)2+2(12n)(H(1)11′H(2)ν)(1′H(2)ν)+1Cn2(1′H(2)ν)2.

From these expressions, we can compute second-order approximations to the risk functions of σ^p,LB12 according to the formula

R(σ^p,LB12,(μi0,σ2))(nn2)2V2(Δ;α=n/2)+[(nn2)E2(Δ;α=n/2)1]2, (26)

where E2(Δ;α) ≡ E2(Δ) and V2(Δ;α) ≡ V2(Δ) are given in Theorem 5.6 and Theorem 5.7, respectively. For other limiting Bayes and weighted estimators in Table 1 and (16), analogous approximate risk expressions can be obtained similarly as for σ^p,LB12.

Lastly, still for a given α > 0, we present a few expressions for the components Hk(1)(T), k ∈ {1, 2, …, p} of the p × 1 vector H(1) (T) and the components Hkl(2)(T), k, l ∈ {1, 2, …, p} of the p × p matrix H(2) T, which when evaluated at T = ν yield H(1) and H(2), respectively. From the expressions for H(T) and θi(T) in Corollary 5.1, we find that for j,k ∈ {1, 2, …, p},

Hk(1)(T)TkH(T)=2θk(T)Tk+i=1pθik(1)(T)Ti2;
Hkl(2)(T)2TkTlH(T)=2θk(T)I{k=l}+2[θkl(1)(T)Tk+θlk(1)(T)Tl]+i=1pθikl(2)(T)Ti2;

where, for i, k ∈ {1, 2, …, p}, θik(1)(T)=(2α)(Tk/(1+Tk2))θk(T)[θi(T)I{k=i}]; and, for i,k,l ∈ {1, 2, …, p},

θikl(2)(T)=(2α){I{k=l}(1Tk2(1+Tk2)2)θk(T)[θi(T)I{k=i}]+(Tk(1+Tk2)2)[θkl(1)(T)[θi(T)I{k=i}]+θk(T)θil(1)(T)]}.

5.4 Assessing the Second-Order Approximations via Simulation

To assess the goodness of the second-order approximations, we compared the values of the means, variances, and risks of σ^p,LB2(α=n/2)/σ2 based on 10,000 simulated datasets to their second-order approximations. The results revealed the same pattern across all choices of Δ. For n = 3 the approximation performs rather poorly, gradually improving with increasing n, to finally become almost identical to the simulation-based values when n is 30. In one of the worst-case scenarios, when n = 3 and Δ is symmetric with a medium-size spread (such as Δ = (−0:25; 0; 0:25)), the approximate mean values lie generally within 20% of the simulated ones. Similar behavior is shown by variances and risks also. Furthermore, as the model dimension p increases or as the separations among the sub-models’ means become smaller, the differences between simulated values and approximations also seem to diminish. With increasing n the accuracy of the approximations improves. Therefore, the second-order approximation appears to work well overall, but when n is small (less than 15) it seems better to use simulations. In the remainder of this paper, all analyses involving risks of the limiting Bayes (σ^p,LBk2s) and weighted (σ^p,PLBk2s) estimators are based on simulations.

5.5 Comparison of Relative Efficiencies

We now carry out the comparison of relative efficiencies of the variance estimators with respect to σ^UMVU2, using simulated datasets with a variety of Δ values for n ∈ {3; 10; 30}. The results are summarized in Table 2 and Table 3 and Figure 1 and Figure 2.

Table 2.

Efficiencies (relative to of the UMVU estimator σ^UMVU2) of the different variance estimators for different combinations of p, Δ, and n. For the limiting Bayes and weighted estimators, the values are based on simulation studies with 10000 replications for each combination.

Combinations of p and Δ   Efficiency %
n MRE pMLE pMRE ALB LB1 LB2 LB3 LB4 PLB1 PLB2 PLB3

  3 200 174 222 14 10 60 152 230 240 236 233
Δ=(−0.25,0,0.25) 10 122 117 124 71 62 89 113 128 129 129 128
p=3 30 107 106 107 91 89 99 107 105 112 112 112

  3 200 183 209 17 11 66 160 217 235 226 221
Δ=(−0.5,0,0.5) 10 122 119 124 73 59 86 110 127 129 128 127
p=3 30 107 106 110 89 84 95 104 109 111 111 111

  3 200 164 226 13 9 51 135 227 238 234 232
Δ=(0,0.25, 0.50) 10 122 114 127 66 53 78 103 131 129 128 128
p=3 30 107 105 109 88 81 92 100 107 108 108 108

  3 200 166 222 13 8 47 128 222 233 228 224
Δ=(0,0.5,1) 10 122 115 128 65 51 76 102 126 126 126 126
p=3 30 107 105 110 87 83 94 103 109 110 110 110

  3 200 174 222 14 10 58 149 234 241 238 235
Δ=(−0.25:2−4:0.25) 10 122 117 123 71 61 88 112 127 130 130 129
p=9 30 107 105 106 91 88 98 106 109 111 111 111

  3 200 174 222 14 10 57 145 234 239 237 234
Δ=(−0.25:2−5:0.25) 10 122 117 123 71 60 86 109 130 127 127 126
p=17 30 107 105 106 91 87 97 105 108 110 110 110

  3 200 164 225 13 9 52 137 230 243 240 238
Δ=(0:2−4:0.5) 10 122 114 126 66 53 77 102 129 128 127 127
p=9 30 107 104 108 88 76 87 97 109 107 107 107

  3 200 164 225 13 10 55 145 235 249 246 243
Δ=(0:2−5:0.5) 10 122 114 126 66 54 79 106 130 134 133 133
p=17 30 107 104 108 88 76 87 97 111 107 107 108

Table 3.

Efficiencies (relative to the UMVU estimator σ^UMVU2) of the different variance estimators for different combinations of p, Δ, and n. For the limiting Bayes and weighted estimators, 10000 simulation replications were performed for each combination.

Combinations of p and Δ   Efficiency %
n MRE pMLE pMRE ALB LB1 LB2 LB3 LB4 PLB1 PLB2 PLB3

  3 200 170 232 13 8 52 143 229 243 237 234
Δ/=(0, 1) 10 122 115 134 63 55 81 107 130 129 130 130
p=2 30 107 104 111 85 85 96 105 109 112 112 112

  3 200 195 216 18 10 63 162 221 255 237 228
Δ=(−1, 0, 1) 10 122 120 134 67 51 77 103 129 126 127 128
p=3 30 107 104 111 85 84 95 103 110 111 111 111

  3 200 185 199 19 11 66 162 208 246 230 221
Δ=(−1:2−1:1) 10 122 119 124 73 52 78 103 125 126 126 125
p=5 30 107 106 110 89 81 92 101 107 108 108 108

  3 200 182 195 20 11 64 153 210 232 219 211
Δ=(−1:2−2 :1) 10 122 118 120 74 56 82 106 126 126 125 125
p=9 30 107 106 107 91 82 92 100 105 106 106 106

  3 200 181 194 20 11 65 155 207 234 220 213
Δ=(−1:2−3:1) 10 122 117 119 75 55 81 104 126 125 124 123
p=17 30 107 105 106 92 80 90 99 107 107 107 107

  3 200 181 194 20 12 72 168 206 240 226 217
Δ=(−1:2−4:1) 10 122 117 119 75 57 82 106 125 126 125 124
p=33 30 107 105 105 92 82 93 101 105 108 108 108

  3 200 181 194 20 11 65 156 208 234 221 214
Δ=(−1:2−5:1) 10 122 117 119 75 58 84 109 123 128 127 127
p=65 30 107 105 105 92 82 93 102 107 109 109 109

  3 200 181 194 20 11 69 163 207 237 224 216
Δ=(−1:2−6:1) 10 122 117 119 75 56 82 106 126 126 125 125
p=129 30 107 105 105 92 81 91 99 108 106 106 106

  3 200 181 194 20 11 67 159 205 236 223 215
Δ=(−1:2−7 :1) 10 122 117 119 75 57 83 107 124 128 127 126
p=257 30 107 105 105 92 82 92 101 108 108 108 108

  3 200 181 194 20 11 66 156 206 233 221 213
Δ=(−1:2−8:1) 10 122 117 119 75 58 84 108 127 128 127 126
p=513 30 107 105 105 92 80 91 99 106 107 107 107

Figure 1.

Figure 1

Relative efficiencies of pMRE with respect to MRE in a symmetric and asymmetric Δ cases, as a function of Δmax and number of sub-models p for sample size of n = 10. The symmetric case is of form Δ = [−Δmax : Δmax/(p − 1) : Δmax], while the asymmetric case is of form Δ = [0 : Δmax/(2(p − 1)) : Δmax].

Figure 2.

Figure 2

Efficiencies of the leading estimators and Stein’s estimator of σ² relative to the UMVUE σ^UMVUE2 for p = 2 and Δ = (0, Δ), with Δ varying, for n = 3, 10. The connected scatterplots represent the simulated relative efficiencies based on 20000 replications for each Δ for the limiting Bayes (pLB4, lighter of the top two), weighted (pPLB1, darker of the top two), and Stein (bottom) estimators. The smooth curves correspond to theoretical efficiencies of σ^p,MRE2 (top), σ^MRE2 (middle), and σ^p,MLE2 (bottom).

Table 2 focuses on the differences in relative effciencies between symmetric and asymmetric Δ cases. As can be seen, there does not seem to be a strong effect of the asymmetry of Δ on the estimators. In all Δ cases that we have chosen, σ^p,PLB12,σ^p,PLB22,σ^p,PLB32,σ^p,LB42 perform best, with the two-step estimator σ^p,MRE2 following. Clearly, the best among the ℳp-based estimators dominate the ℳM-based estimators σ^UMVU2andσ^MRE2, with the gain in efficiency being quite impressive for small sample sizes.

Table 3 is designed to examine the impact of increasing number of sub-models. We see that when p is large the two-step estimator σ^p,MRE2 becomes less efficient than the global-type estimator σ^MRE2. This result is consistent with the theoretical result of Corollary 5.4. Note also that even for p as low as 33, the ratios of the relative efficiency values from Table 3 start to agree (to the third decimal) with the limiting relative efficiencies predicted by Corollary 5.4. The weighted estimators do not seem to be affected much by the increasing p, faring much better than the two-step estimators.

Figure 1 presents two contour plots of the relative efficiencies of σ^p,MRE2 with respect to σ^MRE2 as a function of p (where p > 3) and the range of the values in the Δ vector. In the top and bottom contour plots symmetric and asymmetric Δ cases are considered separately. The top contour plot reveals a structure that is consistent with Corollary 5.4, especially in the regions of the contour plot where p → ∞ and Δmax → ∞ (hence Δmin → − ∞) which fall in the top and bottom right corners, where σ^MRE2 starts to dominate σ^p,MRE2. Note that the 98% relative efficiency in this region is very close to the limiting 97% from Corollary 5.4. The bottom contour plot is constructed using Δ that are quite asymmetric (all Δmin = 0), and therefore could not be compared to the predictions of Corollary 5.4. From this plot we can see however that σ^p,MRE2 seems to dominate σ^MRE2 everywhere.

Figure 2 explores the case when p = 2 only, so Δ = (0; Δ), and for sample sizes n = 3 and n = 10. The plots depict the efficiency of the leading estimators, together with that of the Stein estimator in (5), as a function of the magnitude of the Δ parameter. As can be seen, σ^p,MRE2 performs best when |Δ| is large, with σ^p,LB42 giving a very comparable performance. The estimator σ^p,PLB12 performs better than σ^p,MRE2 when |Δ| is closer to zero, but degrades in performance when |Δ| becomes large. Thus, we see that the estimators’ performances and regions where they perform well will depend to a large extent on the magnitude of Δ. In particular, it appears that the best among the weighted estimators perform very well when the |Δ| is neither too large nor too small, while the two-step estimator performs very well when |Δ| is large. This points to the following intuitive explanation: when |Δ| is large, the two models are well-separated and the model selection is easier; however, when |Δ| is neither too small nor too large, it is not so clear which model to choose, and it seems better to average over the sub-models’ estimators. When |Δ| is quite close to zero, i.e. when there is not much di erence among the sub-models, either approach to estimation works well. With regards to the Stein estimator, observe that though it dominates σ^MRE2, as is expected from theory, it is at the same time dominated by the estimators σ^p,MRE2,σ^p,PLB12,andσ^p,LB42.

Overall, based on the results of the risk comparison, the estimators performing best are the Bayes-type or weighted estimators σ^p,PLB12andσ^p,LB42, and the two-step estimator σ^p,MRE2. We give a slight preference to the weighted estimators because their performance does not degrade much even when the number of sub-models increases, in contrast to the two-step estimator which becomes dominated by the ℳ-estimator σ^MRE2 when p, the number of sub-models, increases.

Finally, a cautionary note arising from these efficiency studies is that one ought to be very careful in the choice of prior parameters. At least in the situation when one is concerned with variance estimation, the limiting Bayes estimators σ^p,LB12andσ^p,LB22, corresponding to the limiting cases of κ → 1 and κ → 3/2 respectively, perform quite poorly, especially for small sample sizes. These two estimators are dominated by the estimator σ^UMVU2 in terms of risk function. However, these improper priors associated with the limiting values of κ are most likely the worst-case scenarios, and other, more carefully chosen and meaningful priors should result in improved performance.

6 Concluding Remarks

We have examined some of the issues arising when considering a model with a finite number of sub-models, where the goal is to make inference about a common parameter among these sub-models, based on a single realization of a sample. It is of interest to determine which of the three possible strategies is preferable: (i) to utilize a wider model that encompasses all competing sub-models; (ii) to adopt a two-step approach: select the sub-model, and then do inference within this chosen sub-model, but with both steps utilizing the same sample data; (iii) to do a sub-model averaging scheme where the inference procedure is formed by weighting the sub-models’ procedures, with the weights being also data-dependent. The second strategy may be labeled the classical approach, while the third strategy coincides or is motivated by the Bayesian approach.

Through a simple model prototype with a finite number of Gaussian sub-models with common variance but different means, we have studied each of the strategies, with regards to the estimation of variance. Based on the theoretical and simulated comparison of the different types of estimators, and with the estimator performance evaluated through risk functions based on quadratic loss, we have reached the following conclusions: (i) There could be considerable improvement in using estimators developed by exploiting the structure of the sub-models, over the strategy of simply using estimators from a wider model. (ii) However, the properties of these resulting estimators may be quite difficult to obtain. Furthermore, some desirable properties of the sub-model estimators, such as unbiasedness and minimum variance, may not carry-over when they are combined to form the estimator for the full model of interest. (iii) Based on the theoretical and simulated results for the variance parameter σ² considered in this paper, the weighted estimators, which were motivated and/or derived via the Bayesian approach, seem preferable over the two-step estimators even though these estimators were derived using improper priors. (iv) When the number of sub-models increases and two-step estimators are employed, it appears that their performance could degrade relative to estimators developed under a wider model, but that the weighted estimators’ performances are not necessarily affected. (v) And, finally, when developing weighted estimators through the Bayesian framework, caution must be observed in assigning prior parameters as a particular prior specification may also lead to poor estimators.

Approaches similar to the one presented here could be a useful first step in many contexts where model selection is often done and where there exist a natural notion of “distance” among models: regression, survival analysis, or goodness-of-fit testing. For example, techniques and results presented here could be extended to settings of regression with p possible predictor variables, where the goal is estimation of dispersion parameters associated with error components for the 2p competing sub-models. There is clearly a need for studies of more complicated situations in varied settings, where multiple parameters are being estimated simultaneously and/or where sub-models are of different dimensions (Claeskens and Hjort (2003)). As was pointed out earlier, for these more general settings, exact risk expressions may not be possible and asymptotic analysis may be needed, in contrast to the situation considered in this paper where the Gaussian distributional assumption enabled us to obtain concrete results. Also, many other interesting alternative options need to be examined: for example, in the two-step approach, would it have been better to subdivide the sample data into two parts and use the first part for model selection and the second part for making inference in the chosen sub-model, an issue alluded for instance in Hastie, Tibshirani, and Friedman (2001) and investigated by Yang (2003)? Finally, we have observed in Dukić and Peña (2003) that in the case of estimation of the distribution function in this same specific model, a different conclusion holds with respect to which of the three strategies is preferable. This is consistent with recent work by Claeskens and Hjort (2003)) where they advocate the use of a focussed information criterion in model selection problems that is tailored to the specific parameter of interest.

Acknowledgments

The authors thank the Editor, Associate Editor, and Referees for their comments and criticisms which led to a considerably improved paper.

A Appendix: Proofs of Selected Results

In this appendix we gather the technical proofs of the results presented in earlier sections.

Proof of Proposition 5.1

With Z¯=(Z1′)/n, we have nσ^i2/σ2=χμi1/σ2=[(χμi01)(μiμi0)1]/σ2=dZΔi12=ZZ¯12+n(Z¯Δi)2. Letting W=ZZ¯12 and Vi=n(Z¯Δi), i = 1, 2, … p, it follows that Wχn12 with W and V = (V1, V2, …, Vp)′ independent. Furthermore, since Z¯N(0,n1), V has representation

V=Z1nΔ, (27)

so VNp(0,J) since Cov(V,V)=Cov{Z1nΔ,Z1nΔ}=1Var(Z)1=J.

Proof of Theorem 5.1

By Prop. 5.1, σ^p,MLE2/σ2=d[W+i=1pI{M^=i}Vi2]/n. Now,

{M^=i}={σ^i2σ2<σ^j2σ2,j=1,2,,p;ji}=d{Vi2<Vj2,j=1,2,,p;ji}={(ZnΔi)2<(ZnΔj)2,j=1,2,,p;ji}by(27)={(ΔjΔi)Z<n2(Δj+Δi)(ΔjΔi),j=1,2,,p;ji}={Z<n2(Δj+Δi),(ji,Δj>Δi)}{Z>n2(Δj+Δi),(ji,Δj<Δi)}={n2(Δi+max{j:Δj<Δi}Δj)<Z<n2(Δi+min{j:Δj>Δi}Δj)}.

But,

i=1pI{n2(Δi+max{j:Δj<Δi}Δj)<Z<n2(Δi+min{j:Δj>Δi}Δj)}Vi2=i=1pI{n2(Δ(i)+max{j:Δj<Δ(i)}Δj)<Z<n2(Δ(i)+min{j:Δj>Δ(i)}Δj)}(ZnΔ(i))2=i=1pI{n2(Δ(i)+Δ(i1))<Z<n2(Δ(i)+Δ(i+1))}(ZnΔ(i))2=i=1pI{L(Δ(i),Δ)<Z<U(Δ(i),Δ)}(ZnΔ(i))2.

Proof of Theorem 5.2

From Theorem 5.1, using abzϕ(z)dz=ϕ(a)ϕ(b) for a < b,

EpMLE(Δ)=1n{E(W)+E{i=1pI(Ω(i))(ZnΔ(i))2}}=1n{(n1)+E(Z2)2ni=1pΔ(i)E{ZI(Ω(i))}+ni=1pΔ(i)2P(i)(Δ)}=1n{n2ni=1pΔ(i)L(Δ(i),Δ)U(Δ(i),Δ)zϕ(z)dz+ni=1pΔ(i)2P(i)(Δ)}=12ni=1pΔ(i)[ϕ(L(Δ(i),Δ))ϕ(U(Δ(i),Δ))]+i=1pΔ(i)2[Φ(U(Δ(i),Δ))Φ(L(Δ(i),Δ))].

Proof of Theorem 5.3

From Theorem 5.1, by independence of W and Z, and because Ω(i)s depend only on Z, we have VpMLE(Δ)=n2{Var(W)+Var[i=1pI(Ω(i))(ZnΔ(i))2]}. We already know that Var(W) = 2(n − 1). On the other hand, Var{i=1pI(Ω(i))(ZnΔ(i))2} =i=1pVar{I(Ω(i))(ZnΔ(i))2}+ijCov{I(Ω(i))(ZnΔ(i))2,I(Ω(j))(ZnΔ(j))2}. By the variance formula and definition of ζ(i) (k), Var{I(Ω(i))(ZnΔ(i))2}=ζ(i)(4)[ζ(i)(2)]2; and Cov{I(Ω(i))(ZnΔ(i))2,I(Ω(j))(ZnΔ(j))2}=ζ(i)(2)ζ(j)(2), as I(i))I(j)) = 0 whenever ij. Consequently, Var{i=1pI(Ω(i))(ZnΔ(i))2} =i=1p[ζ(i)(4)[ζ(i)(2)]2]+ij[ζ(i)(2)ζ(j)(2)]=i=1pζ(i)(4)(i=1pζ(i)(2))2.

Proof of Corollary 5.4

From Theorem 5.4, R(σ^p,MLE2,(μi0,σ2))=VpMLE(Δ)+[EpMLE(Δ)1]2(2/n)(11/n)+[(11/n)1]2=(2n1)/n2. Also, since σ^p,MRE2=(n/(n+2))σ^p,MLE2, we have R(σ^p,MRE2,(μi0,σ2)) =(n/(n+2))2VpMLE(Δ)+[(n/(n+2))EpMLE(Δ)1]2 (n/(n+2))2(2/n)(11/n)+[(n/(n+2))(11/n)1]2=(2n+7)[(n+2)2]. The efficiency expressions relative to σ^UMVU2ofσ^p,MLE2andσ^p,MRE2 are obtained by dividing R(σ^UMVU2,(μi0,σ2))=2/(n1) by the preceding risk expressions. Both of the resulting efficiency expressions are easily shown to exceed 1. Taking the ratio of R(σ^p,MLE2,(μi0,σ2)) and R(σ^p,MRE2,(μi0,σ2)) yields the third expression, an expression easily shown to exceed 1. In comparing σ^MRE2 and σ^p,MRE2, we observe that Eff(σ^p,MRE2:σ^MRE2)=2(n+2)2/[(n+1)(2n+7)]. Since 2(n + 2)²/[(n + 1)(2n + 7)] − 1 = −(n−1)/[(n + 1)(2n + 7)] < 0 for n > 1, then we have established that, as p → ∞, σ^MRE2 is more efficient than σ^p,MRE2. To show that σ^p,ALB2 is dominated by σ^UMVU2, note that R(σ^p,ALB2,(μi0,σ2))=(2n1)/[(n2)2], which is easily shown to exceed 2/(n − 1) whenever n > 2.

Proof of Theorem 5.5

Starting from (24) and using Corollary 5.1, we obtain

σ^p,LB2σ2=di=1p{[Wn(1+Ti2)]αj=1p[Wn(1+Tj2)]α}[Wn(1+Ti2)]=Wn{1+i=1pθi(T)Ti2}.

The following lemma will be needed for proving Theorems 5.6 and 5.7

Lemma A.1

Under the conditions of Proposition 5.1, (i) E{Tν|W}=(n/(WCn)1)ν; (ii)E{(Tν)(Tν)|W}=J/W+(n/(WCn)1)2ν2; (iii) E{W(Tν)} = −ν; and (iv) E{W(Tν)(Tν)}=J+n(1/Cn21+3/n)ν2, with constant Cn given in Corollary 5.1.

Proof of Lemma A.1

Since E(T|W)=E(V/W|W)=E(V)/W=nν/(WCn), then the first result immediately follows. Using that E(W) = n - 1 and E(W)=(n2)Cn/n, the third result follows trivially from the first identity. To prove the second result, observe that

E(TT′|W)=1WE(VV′)=1WE{(Z1nΔ)2}=1W(J+nCn2ν2);

and E{Tν|W}=nΔν/W=nν2/(WCn). Consequently,

E{(Tν)(Tν)|W}=E(TT′|W)2E(Tν|W)+ν2=JW+(nWCn1)2ν2.

Finally, by the iterated expectation rule, and using the expressions for E(W) and E(W),

E{W(Tν)(Tν)}=E{WE[(Tν)(Tν)|W]}=E{J+(nCn22nCnW+W)ν2}=J+(nCn22(n2)+(n1))ν2.

Simplifying leads to the expression given in the statement of the lemma.

Proof of Theorem 5.6

First, note that by using the fourth result in Lemma A.1, we have

E{W(Tν)H(2)(Tν)}=E{tr[H(2)W(Tν)(Tν)]}=tr(H(2)[J+n(1Cn21+3n)ν2])=(1H(2)1)+n(1Cn21+3n)(ν′H(2)ν).

From (25), and using Lemma A.1 and the preceding result, the approximation to the mean of σ^p,LB2/σ2 is obtained to be

E2(Δ)=1n{(1+H)(n1)H(1)ν+12(1′H(2)1)+n2(1Cn21+3n)(ν′H(2)ν)},

which simplifies to the expression in the statement of the theorem.

To obtain second-order approximation to the variance of σ^p,LB2/σ2, we first establish two inter-mediate lemmas concerning the conditional mean and variance, given W, of the variable

Q1+H+H(1)(Tν)+12(Tν)H(2)(Tν), (28)

which is the second-order Taylor approximation of the function 1 + H(T) (see equation (25)).

Lemma A.2

Under conditions of Proposition 5.1,

E(Q|W)={1+H(H(1)ν)+12(ν′H(2)ν)}+nCn{(H(1)ν)(ν′H(2)ν)}1W+12{(1H(2)1)+nCn2(ν′H(2)ν)}1W.
Proof of Lemma A.2

From the proof of and results in Lemma A.1, it follows that

E{(Tν)H(2)(Tν)|W}=1W(1H(2)1)+(nWCn22nWCn+1)(ν′H(2)ν).

From first result in Lemma A.1 and the preceding result, E(Q|W)=1+H+(nWCn1)H(1)ν+12{1w(1′H(2)1)+(nWCn22nWCn+1)(νH(2)ν)}, and the result follows.

Lemma A.3

Under conditions of Proposition 5.1,

Var(Q|W)={(H(1)1)(1′H(2)ν)}21W+2nCn{(H(1)1)(1H(2)ν)}(1H(2)ν)1W3/2+{12(1′H(2)1)2+nCn2(1H(2)ν)2}1W2.
Proof of Lemma A.3

First, observe that we have

Var{H(1)(Tν)|W}=1WH(1)Cov(V)H(1)=1WH(1)JH(1)=1W(H(1)1)2;Var{(Tν)H(2)(Tν)|W}=Var{T′H(2)T|W}+4Var{ν′H(2)T|W}4Cov{T′H(2)T,ν′H(2)T|W};Var{T′H(2)T|W}=Var{V′WH(2)VW|W}=1W2Var{V′H(2)V}.

From the representation of V in (27), we obtain

Var{V′H(2)V}=Var{(Z1nΔ)H(2)(Z1nΔ)}2(1H(2)1)2+4nCn2(1′H(2)ν)2

since Var(Z²) = 2, Var(Z) = 1, and Cov(Z²,Z) = 0. Thus,

Var{T′H(2)T|W}=4W2{12(1′H(2)1)2+nCn2(1′H(2)ν)2}.

We also have

Var{ν′H(2)T|W}=ν′H(2)Cov(T|W)1WνH(2)JH(2)ν=1W(1′H(2)ν)2;Cov{T′H(2)T,ν′H(2)T|W}=1W3/2Cov{VH(2)V,ν′H(2)V}.

Again, by utilizing the representation for V in (27), we obtain

Cov{V′H(2)V,ν′H(2)V}=2nCn(1H(2)ν)2,

so Cov{T′H(2)T,ν′H(2)T|W}=2nW3/2Cn(1H(2)ν)2. Combining these results, we obtain

Var{(Tν)H(2)(Tν)|W}=4W2{12(1H(2)1)2+nCn2(1H(2)ν)2}+4{1W(1H(2)ν)2}4{2nW3/2Cn(1H(2)ν)2}.

Also,

Cov{H(1)(Tν),(Tν)H(2)(Tν)|W}=H(1)[Cov(T,TH(2)T|W)2Cov(T|W)H(2)ν].

But, once again,

Cov(T,T′H(2)T|W)=1W3/2Cov(V,VH(2)V)=2nW3/21(1H(2)Δ)=2nW3/2Cn(1H(2)ν)1,

while H(1)Cov(T|W)H(2)ν=1W(H(1)1)(1′H(2)ν). Thus, we have

Cov{H(1)(Tν),(Tν)H(2)(Tν)|W}=2nW3/2Cn(1H(2)ν)12W(H(1)1)(1H(2)ν).

Now, by using these intermediate results, we find that

Var(Q|W)=Var{H(1)(Tν)}+14Var{(Tν)H(2)(Tν)|W}+(2)(12)Cov{H(1)(Tν),(Tν)H(2)(Tν)|W}=1W(H(1)1)2+14{4W2[12(1′H(2)1)2+nCn2(1′H(2)ν)2]+4W(1′H(2)ν)2(4)(2)W3/2nCn(1′H(2)ν)2}+2W3/2nCn(H(1)1)(1′H(2)ν)2W(H(1)1)(1′H(2)ν).

and the result follows after simplification.

Proof of Theorem 5.7

From (25) and (28), and by the iterated variance rule, an approximate variance of σ^p,LB2/σ2 is V2(Δ)Var{1nWQ}=1n[Var{wnE(Q|W)}+E{W2nVar(Q|W)}]1n{VE(Δ)+EV(Δ)}. The final expression of VE(Δ) follows from the Lemma A.2, the identities Var(W) = 2(n-1), Var(W)=(n1)(n2)2Cn2/n, and Cov(W,W)=(n2)Cn/n, and

VE(Δ)=1n[Var{W[1+HH(1)ν+12νH(2)ν]+WnCn(H(1)ννH(2)ν)}]=1n{[1+HH(1)ν+12νH(2)ν]2Var(W)+nCn2(H(1)ννH(2)ν)2Var(W)+2nCn[1+HH(1)ν+12νH(2)ν](H(1)ννH(2)ν)Cov(W,W)}.

Also, from Lemma A.3,

EV(Δ)=1n{E(W)(H(1)11′H(2)ν)2+2nCn(H(1)11′H(2)ν)(1′H(2)ν)E(W)+12(1′H(2)1)2+nCn2(1′H(2)ν)2},

which simplifies to the expression of EV(Δ) in the statement of the theorem upon substituting the expressions E(W) = n − 1 and E(W)=(n2)Cn/n. This completes the proof.

References

  1. Akaike H. Information theory and the maximum likelihood principle. Budapest: Akademiai Kiado; 1973. pp. 610–624. [Google Scholar]
  2. Arnold B, Villasenor J. Variance estimation with diffuse prior information. Stat. Prob. Letters. 1997;33:35–39. [Google Scholar]
  3. Breiman L. The Little Bootstrap and Other Methods for Dimensionality Selection in Regression: X-Fixed Prediction Error. J. Amer. Statist. Assoc. 1992;87:738–754. [Google Scholar]
  4. Brewster J, Zidek J. Improving on equivariant estimators. Ann. Statist. 1974;2:21–38. [Google Scholar]
  5. Buhlmann P. Efficient and adaptive post-model-selection estimators. Journal of Statistical Plannning and Inference. 1999;79:1–9. [Google Scholar]
  6. Burnham K, Anderson D. Model Selection and Inference: A Practical Information-Theoretic Approach. New York: Springer; 1998. [Google Scholar]
  7. Claeskens G, Hjort N. The Focussed Information Criterion (with discussion) Journal of the American Statistical Assoiation. 2003;98:900–945. [Google Scholar]
  8. Dukić V, Peña E. Estimation after Model Selection in a Gaussian Model. University of Chicago, Department of Health Studies; Technical report, Department of Health Studies. 2003
  9. Gelfand A, Dey D. Improved estimation of the disturbance variance in a linear regression model. J. Econometrics. 1977;39:387–395. [Google Scholar]
  10. Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York: Springer; 2001. [Google Scholar]
  11. Hjort N, Claeskens G. Frequentist Model Average Estimators. Journal of the American Statistical Association. 2003;98:879–899. [Google Scholar]
  12. Hoeting J, Madigan D, Raftery A, Volinsky C. Bayesian Model Averaging. Statistical Science. 1999;14:382–401. [Google Scholar]
  13. Judge G, Bock M, Yancey T. Post Data Model Evaluation. The Review of Economics and Statistics. 1974;56:245–253. [Google Scholar]
  14. Leamer E. Specification Searches. New York: Wiley; 1978. [Google Scholar]
  15. Lehmann E, Casella G. Theory of Point Estimation. New York: Springer; 1998. [Google Scholar]
  16. Maatta J, Casella G. Developments in decision-theoretic variance estimation (with discussions) Statist. Sci. 1990;5:90–120. [Google Scholar]
  17. Madigan D, Raftery A. Model Selection and Accounting for Model Uncertainty in Graphical Models Using Occam’s Window. J. Amer. Statist. Assoc. 1994;89:1535–1546. [Google Scholar]
  18. Ohtani K. MSE dominance of the pre-test iterative variance estimator over the iterative variance estimator in regression. Stat. Prob. Letters. 2001;54:331–340. [Google Scholar]
  19. Pal N, Ling C, Lin J. Estimation of a normal variance – a critical review. Statist. Papers. 1998;39:389–404. [Google Scholar]
  20. Potscher B. Effects of Model Selection on Inference. Econometric Theory. 1991;7:163–185. [Google Scholar]
  21. Raftery A, Madigan D, Hoeting J. Bayesian Model Averaging for Linear Regression Models. J. Amer. Statist. Assoc. 1997;92:179–191. [Google Scholar]
  22. Robert C. The Bayesian Choice. 2nd ed. New York: Springer; 2001. [Google Scholar]
  23. Rukhin A. How much better are better estimators of a normal variance. J. Amer. Statist. Assoc. 1987;82:925–928. [Google Scholar]
  24. Schwarz G. Estimating the dimension of a model. Ann. Statist. 1978;6:461–464. [Google Scholar]
  25. Sclove S, Morris C, Radhakrishnan R. Nonoptimality of preliminary-test estimators for the mean of a multivariate normal distribution. Ann. Math. Statist. 1972;43:1481–1490. [Google Scholar]
  26. Sen P, Saleh A. On preliminary test and shrinkage M-estimation in linear models. Ann. Statist. 1987;15:1580–1592. [Google Scholar]
  27. Stein C. Inadmissibility of the usual estimator for the variance of a normal distribution with unknown mean. Ann. Inst. Statist. Math. 1964;16:155–160. [Google Scholar]
  28. Vidaković B, DasGupta A. Lower bounds on Bayes risks for estimating a normal variance: with applications. Canad. J. Statist. 1995;23:269–282. [Google Scholar]
  29. Wallace T. Pretest estimation in regression: A survey. Amer. J. Agr. Econ. 1977;59:431–443. [Google Scholar]
  30. Yancey T, Judge G, Mandy D. The sampling performance of pre-test estimators of the scale parameter under squared error loss. Economics Letters. 1983;12:181–186. [Google Scholar]
  31. Yang Y. Regression with multiple candidate models: selecting or mixing? Statistica Sinica. 2003;13:783–809. [Google Scholar]
  32. Zhang P. Inference after variable selection in linear regression models. Biometrika. 1992a;79:741–746. [Google Scholar]
  33. Zhang P. On the Distributional Properties of Model Selection Criteria. J. Amer. Statist. Assoc. 1992b;87:732–737. [Google Scholar]

RESOURCES