Abstract
This paper considers the problem of estimating the dispersion parameter in a Gaussian model which is intermediate between a model where the mean parameter is fully known (fixed) and a model where the mean parameter is completely unknown. One of the goals is to understand the implications of the two-step process of first selecting a model among a finite number of sub-models, and then estimating a parameter of interest after the model selection, but using the same sample data. The estimators are classified into global, two-step, and weighted-type estimators. While the global-type estimators ignore the model space structure, the two-step estimators explore the structure adaptively and can be related to pre-test estimators, and the weighted estimators are motivated by the Bayesian approach. Their performances are compared theoretically and through simulations using their risk functions based on a scale invariant quadratic loss function. It is shown that in the variance estimation problem efficiency gains arise by exploiting the sub-model structure through the use of two-step and weighted estimators, especially when the number of competing sub-models is few; but that this advantage may deteriorate or be lost altogether for some two-step estimators as the number of sub-models increases or as the distance between them decreases. Furthermore, it is demonstrated that weighted estimators, arising from properly chosen priors, outperform two-step estimators when there are many competing sub-models or when the sub-models are close to each other, whereas two-step estimators are preferred when the sub-models are highly distinguishable. The results have implications regarding model averaging and model selection issues.
Keywords: Model selection; inference after model selection, model averaging; Bayes estimators; pre-test estimators; two-step estimators, adaptive estimators
1 Model Selection and Inference
In a variety of settings in statistical practice, it is common to encounter the following situation: we observe data χ from a distribution F which is only known to belong to one of p (possibly nested) sub-models ℳ1, ℳ2,…,ℳp; and given χ, we want to estimate a common parameter, or a functional, of F, denoted by τ(F). For example, we might observe χ ∼ F where F belongs to either the gamma or Weibull family of distributions, and wish to estimate the mean of F. Or, in a multiple regression setting with p possible predictors, we might want to choose one of the 2p competing sub-models (Breiman (1992); Zhang (1992b,a)), and then estimate a common parameter such as dispersion or the conditional distribution function of the response variable.
The most frequent strategies for estimating τ(F) are: (i) utilizing an estimator developed under a larger model ℳ, which contains all sub-models; (ii) using data χ to first choose a sub-model, and then applying the estimator developed for the chosen sub-model to the same data χ; and (iii) assigning to each sub-model a plausibility measure, possibly using χ, and then forming a weighted combination of the estimators developed under each of the sub-models. In this paper we are interested in determining whether there is a preferred strategy, and whether that preferred strategy depends on the interplay among the competing sub-models, and possibly the parameter we are estimating.
Issues pertaining to the two-step process of inference after model selection and the consequences of “data double-dipping” in strategy (ii) have been discussed in the econometric literature (Judge, Bock, and Yancey (1974), Leamer (1978), Yancey, Judge, and Mandy (1983), and Wallace (1977)). Further investigations of these issues in other settings are in Potscher (1991), Buhlmann (1999), and Burnham and Anderson (1998). The third strategy has been discussed mostly in the context of model averaging, a notion that naturally arises in the Bayesian paradigm (Madigan and Raftery (1994); Raftery, Madigan, and Hoeting (1997); Hoeting, Madigan, Raftery, and Volinsky (1999); Burnham and Anderson (1998), among many others). The first strategy on the other hand may be viewed as having a nonparametric flavor. Though it is clearly intuitive that the first strategy will entail some loss in efficiency, it is not apparent whether (and when) the second strategy is preferred over the third strategy. Clearly, an examination of this problem in the general framework is important to provide guidance to practitioners regarding which strategy is better in general situations. However, a general treatment of the problem may not yield exact results, and one may need to rely on asymptotics, or local asymptotics such as in the work by Claeskens and Hjort (2003) and Hjort and Claeskens (2003).
In this paper, we focus our attention on a prototype Gaussian model which admits exact results and thereby enables concrete comparison of the three strategies. Though the specific model examined in this paper may be perceived as restrictive, it highlights the difficulties inherent in this problem. In addition, the specific estimation problem examined – the estimation of dispersion parameter – is still the subject of active research (Arnold and Villasenor (1997), Brewster and Zidek (1974), Gelfand and Dey (1977), Maatta and Casella (1990), Ohtani (2001), Pal, Ling, and Lin (1998), Rukhin (1987), Vidaković and DasGupta (1995), and Wallace (1977)).
The paper is outlined as follows. Section 2 will describe the formal setting of the specific problem considered, introduce notation, and present the global-type estimators. Section 3 will present the classical two-step estimators, whereas the Bayes and weighted estimators will be developed in Section 4. Distributional properties and risk comparison will be obtained in Section 5. Concluding remarks are given in Section 6, while Appendix A gathers the technical proofs.
2 Global-Type Estimators
We first describe the specific model examined in this paper. Let χ = (χ1, χ2,…,χn)… be a vector of IID random variables from an unknown distribution function F(x) = Pr{χ1 ≤ x} which belongs to the two-parameter normal family of distributions ℳ = {N(μ, σ²) : (μ, σ²) ∈ Θ = ℜ × ℜ+}. If interest is on estimating the variance σ², then the uniformly minimum variance unbiased estimator (UMVUE) of σ² is
| (1) |
where (Lehmann and Casella (1998)). We adopt a decision-theoretic approach for evaluating estimators of σ² via the risk function based on the scale invariant quadratic loss function L : ℜ × Θ → ℜ
| (2) |
It should be pointed that the appropriateness of this loss function has been questioned, partly because of Stein (1964)’s demonstration that under this loss the UMVUE of σ² is inadmissible and dominated by the minimum risk equivariant estimator (MRE)
| (3) |
(which also turns out to be inadmissible). However, quadratic loss functions are still popular when dealing with the estimation of variance (Arnold and Villasenor (1997), Maatta and Casella (1990), Ohtani (2001), Pal et al. (1998), Rukhin (1987), Vidaković and DasGupta (1995), and Wallace (1977)).
If the model is restricted so that μ = μ0 where μ0 ∈ ℜ is known, so ℳ0 = {N(μ, σ²) : (μ, σ²) ∈ Θ0 = {μ0} × ℜ+}, the UMVUE and MRE of σ² are given, respectively, by
| (4) |
Incidentally, is also the minimax estimator of σ² since it is a limit of proper Bayes estimators and it has constant risk. Clearly, we are able to improve on the estimators derived under ℳ by exploiting the knowledge that μ = μ0 under ℳ0: when model ℳ0 holds, the relative efficiency of the estimator in (4) with respect to in (1) is n/(n - 1). But suppose now that we have a model between ℳ and ℳ0. Specifically, let p be a known positive integer, and μ = {μ1, μ2,…,μp} be a set of known real numbers, and consider the estimation of σ² under the model
In ℳp, in contrast to ℳ0, there is some information about the possible value of μ, but we are not certain about this value. Model ℳp can be viewed as having p sub-models, with the ith sub-model, ℳp,i, being the normal class with unknown variance σ² and known mean μi, that is, ℳp,i = {N(μi, σ²) : σ² > 0}. This particular model arises in a variety of settings. For example, it includes decision problems with a two-element action space such as in the Neyman-Pearson hypothesis testing setting. If we further allow the possibility that μ ∈ ℜ \ {μ1, μ2,…,μp}, we obtain a generalization of the setting utilized by Stein (1964) to derive an estimator dominating (see Brewster and Zidek (1974), Wallace (1977), and Maatta and Casella (1990)). The Stein estimator is given by
| (5) |
This estimator can be viewed as a preliminary test estimator (Sen and Saleh (1987), Lehmann and Casella (1998), and Sclove, Morris, and Radhakrishnan (1972)). The pre-test estimation approach proceeds by testing a null hypothesis that the parameter equals a certain value, and if it accepts the hypothesis then the estimator based on this parameter value is used; otherwise, an estimator under the general model is used. Since a test is to be performed, a level of significance needs to be specified, and so generally pre-test estimators will depend on such a specified level. Interestingly, the Stein estimator in (5), which could be derived as a pre-test estimator with hypothesis specifying that μ = 0, eliminates this dependence by utilizing an ‘optimal’ significance level. The approach implemented in our paper differs from that of pre-test estimation, since we altogether avoid the need for testing a hypothesis to derive our estimators.
Note that in problems dealing with the estimation of the normal variance, it is typically assumed that either model ℳ or model ℳ0 holds. However, in many settings, the mean can take only a finite number of possible values, such as for example in the Neyman-Pearson lemma, where the variance estimator is typically the sample variance, which does not exploit the fact that there are only two possible means.
3 Classical Two-Step Estimators
Under ℳp the likelihood function for the sample realization χ = x = (x1, x2,…, xn)′… is
| (6) |
where, for i = 1, 2,…, p, with I{·} denoting the indicator function and Mi = I{μ = μi},
| (7) |
Li(μi, σ²) is maximized with respect to σ² at , so . Define the likelihood-based model selector via
One could employ model selectors different from , such as the highest posterior probability (à la Schwarz’ Bayesian criterion (SBC) (Schwarz (1978)) or the Akaike information criterion (AIC) (Akaike (1973), Burnham and Anderson (1998))). In this paper we restrict our attention to the intuitive selector , which could actually be viewed also as a highest posterior probability model selector associated with a flat prior distribution. The maximum likelihood estimator (MLE) of σ² under ℳp is
| (8) |
a two-step estimator, with the first stage selecting the sub-model and the second-stage using the MLE of σ² in the chosen sub-model. An alternative to the estimator (8) is to use the sub-model’s MRE instead of MLE of σ²:
| (9) |
Note that the label ‘p,MRE’ (and similar labels in the sequel) is a misnomer since this estimator need not be minimum risk equivariant under model ℳp. However, we keep the name for clarity.
4 Bayes and Weighted Estimators
We focus on the class of prior densities of (μ, σ²) which consists of the product of a multinomial probability function and an inverse gamma density:
| (10) |
where σ² > 0, mi = I{μ = μi} so that , and κ > 1. From (10) and (6), we obtain the posterior density of (μ, σ²) given χ = x:
| (11) |
Note that in the “vectorized form,” and with a slight abuse of notation, π(μ, σ²|x) = π(m, σ²|x) because {μ = μi} = {m = 1i}, where 1i is an n × 1 vector with ith component equal to 1 and all others equal 0. It follows that
| (12) |
4.1 Posterior Probabilities
From the posterior distribution in (11), the marginal posterior density (with respect to counting measure) of μ, or equivalently of m, is
where, for i = 1, 2,…, p, the posterior probability that the sub-model ℳp,i is true, is
| (13) |
Note, as expected, that if θ̃i > 0 and ℳp,i is the true sub-model, θi(κ, β, n, χ), when viewed as a function of χ with (μ, σ²) fixed, converges to 1 with probability one (wp1) as n → ∞. This is because if ℳp,i is the correct model, converges wp1 to σ² by the strong law of large numbers (SLLN); whereas, for converges wp1 to σ² + (μi - μi′)².
4.2 Estimators
The marginal posterior density function of σ² is directly obtained from (11) to be
| (14) |
The posterior mean, which is the Bayes estimator of σ² under the loss function L in (2), is then
| (15) |
Note that β/(κ − 2) is the prior mean of σ², provided κ > 2 (the condition also needed for the prior variance of σ² to exist), whereas is the MLE of σ² under the ℳp,i model. This estimator mixes in a data-dependent manner, using the posterior probabilities of the p sub-models, the Bayes estimators of σ² from each sub-model. Furthermore, the Bayes estimator of σ² for the ℳp,i sub-model is a convex combination of the ℳp,i-model MLE and the prior mean of σ², a well-known result.
To obtain limiting Bayes estimators for σ², we consider improper priors arising by setting θ̃i = 1/p, i = 1, 2, …, p, and β → 0. We examine four κ values: κ → 1, κ → 3/2, κ = 2, and κ = 3. The rationale for these choices is as follows: κ → 1 amounts to placing Jeffreys’ non-informative prior on σ² in each of the p sub-models, since Jeffreys’ prior for σ² (with mean known) is proportional to 1/σ² (Robert (2001)); κ → 3/2 corresponds to the Jeffreys’ prior for σ² when the mean is unknown in the normal model, since in this case Jeffreys’ prior is proportional to (1/σ²)(3/2); κ = 2 and κ = 3 produce (limiting) Bayes estimators that are convex combinations of the sub-models’ MLEs and MREs, respectively. Table 1 lists the sub-models’ posterior probabilities and the resulting limiting Bayes estimators of σ². Each of the sub-models’ posterior probabilities associated with κ ∈ {1, 3/2, 2, 3} given in Table 1 could also be utilized to form estimators which are convex combinations of the sub-models’ MREs. These new estimators need not however be limiting Bayes with respect to our class of priors. These ‘weighted’ estimators are defined as:
| (16) |
Table 1.
Sub-models’ posterior probabilities and limiting Bayes estimators of σ² for different values of κ when θ̃i = 1/p and β → 0.
| κ | Sub-model Posterior Probabilities, θi(κ, 0, n, x), i = 1, 2, …, p | Limiting Bayes Estimator |
||
|---|---|---|---|---|
| 1 | ||||
| 3/2 | ||||
| 2 | ||||
| 3 |
Note also from (15) that the estimators , the ones whose convex combination is being formed in , are the limiting Bayes estimators of σ² for each of the p sub-models under Jeffreys’ non-informative prior when the sub-model’s mean is known (arising from κ → 1). The estimators in Table 1 and in (16) have different flavors than the MLE of σ² given in (8): in the latter, we choose one among the p estimators of σ², while the Bayes and weighted estimators are mixing sub-model estimators according to the sub-models’ posterior probabilities.
Finally, we define the two-step estimator based on the sub-models’ limiting Bayes estimators:
| (17) |
This belongs to the same class of estimators as and , differing just in the multipliers which are functions of n only. Note that for the purposes of obtaining risk functions, it suffices to derive formulas for the mean and variance functions of .
5 Comparison of Estimators
The goal of this section is to compare the performances of the estimators given in Table 1 and (16), with the estimators developed under ℳ , with the two-step estimators and, for completeness, with the Stein estimator in (5). Performance will be measured by their risk functions arising from the loss function L in (2). In particular, we address the following questions: (i) How much efficiency is lost by using the estimators developed under the wider model ℳ when model ℳp holds? (ii) How do the limiting Bayes and weighted estimators and compare with the ℳp MLE-based and MRE-based estimators? (iii) Do the advantages of the ℳp-based estimators over ℳ-based estimators decrease as the dimension p increases and/or the spacings among the μ1,…,μp decrease?
5.1 Distributional Representations
It is well-known that, provided n > 1, . Therefore, the risk function of with respect to the loss function L in (2) is . By exploiting the relationship between and , the risk function of the latter is easily found to be . This demonstrates the known fact that is inadmissible. To compare estimator performances, we will use as the baseline, so the efficiency of an estimator σ̂² will be given by
| (18) |
Thus, in particular, .
We present some distributional properties of the estimators which will be used to derive the exact expressions of the risk functions of , and second-order approximations to the risk functions of and . Let Z ∼ N(0,1) and Z = (Z1, Z2,…,Zn)′ ∼ Nn(0, I). For the vector of means μ = (μ1,μ2,…,μp)′ with μi0 being the true mean (i0 ∈ {1,2,…,p}), we let
| (19) |
where 1 = (1,1,…,1)′. Note that this will always have a zero component under ℳp. In the sequel, the ‘equal-in-distribution’ relation is denoted by . To achieve a more fluid presentation, formal proofs of lemmas, propositions, theorems, and corollaries are relegated to Appendix A.
Proposition 5.1
Under ℳp with μi0 the true mean, , where , and W and V are independent.
From Proposition 5.1, by exploiting the independence between W and V and using the iterated expectation and covariance rules, and by noting that
holds for any k < n − 1, the following corollary immediately follows.
Corollary 5.1
Under the conditions of Proposition 5.1, with . The distribution of T depends on (μ, σ²) only through Δ and, provided that n > 3, the mean vector and covariance matrix of T are given, respectively, by
with .
5.2 Representation and Risk Function of
We now give a representation of and obtain the exact expressions for its mean, variance, and risk. For a given Δ, let Δ(1) < Δ(2) < … < Δ(p) denote the associated ordered values.
Theorem 5.1
Let μi0 be the true mean. Then under ℳp,
where, under the convention that Δ(0) = −∞ and Δ(p+1) = +∞,
and W and Z are independent.
Define the events Ω(i) = {L(Δ(i), Δ) < Z < U(Δ(i),Δ)}, i = 1, 2, …, p. The collection of sets is in one-to-one correspondence with the collection {Ω(1), Ω(2), …, Ω(p)}, as can be seen from Theorem 5.1, and thus Ω(i), i = 1, 2, …, p, are disjoint. Using Theorem 5.1, we can now obtain expressions of the mean and variance of . For i = 1, 2, …, p, we let
| (20) |
where Φ(·) is the standard normal distribution function. In the sequel, we let ϕ(·) denote the density function of a standard normal random variable.
Theorem 5.2
Under the conditions of Theorem 5.1,
Next, we present an expression for the variance function of the estimator . Towards this end, we introduce some notation to simplify the presentation. For k ∈ Ƶ+ = {0, 1, 2, …}, define
Using this, observe that by the binomial expansion, for m ∈ Ƶ+,
| (21) |
To compute the quantity ξ(k;Ω(i)), observe that for k ∈ Ƶ+ and t ∈ ℜ,
Using the above formulas, we obtain ξ(k;Ω(i)) according to
| (22) |
Theorem 5.3
Under the conditions of Theorem 5.1,
In the situation where there are only two sub-models so p = 2, the expressions for the mean and variance functions of simplify. These simplified forms are provided in the following corollary. The proofs of these results are straightforward, hence to conserve space, we omit them but instead refer the reader to the more detailed technical report by Dukić and Peña (2003).
Corollary 5.2
If p = 2 so Δ = (0; Δ), then under the conditions of Theorem 5.1,
We note from the expression in Corollary 5.2 that, for a fixed n, lim|Δ|→0 EpMLE(Δ) → 1 and , the ML (also UMVU, minimax) estimator of σ² under the true model. Also, for a fixed Δ, we see that limn→∞ EpMLE(Δ) → 1 and limn→∞ {n(VpMLE(Δ))} → 2.
The next result in Corollary 5.3 shows that even though the sub-models’ MLEs are each unbiased for σ², the two-step estimator , which employs the MLE of the sub-model selected by the model selector , is a negatively biased estimator of σ². The result is an immediate consequence of Corollary 5.2 by noting that the continuous function g(u) = ϕ(u) − u[1 − Φ(u)] is positive by virtue of the facts that limu↓0h(u) > 0, limu→∞ h(u) = 0, and g′(u) = ϕ′(u) + uϕ(u) − [1 − Φ(u)] = −[1 − Φ(u)] < 0 since ϕ′(u) = −uϕ(u).
Corollary 5.3
Under the conditions of Corollary 5.2 with Δ ≠ 0, that is, is negatively biased for σ².
Now that we have the exact expressions for the mean and variance of , we could obtain the risk function of under ℳp and loss L in (2) as
| (23) |
Finally, for , we address the question of what happens when p increases and the spacings in Δ decrease. This will indicate whether we will lose the advantage of ℳp-based estimators over ℳ-based estimators. The proof of Theorem 5.4 is rather lengthy and hence omitted; instead we refer the reader to the technical report by Dukić and Peña (2003).
Theorem 5.4
Given n fixed, if p → ∞, max2≤i≤p |Δ(i) − Δ(i−1)| → 0, with Δ(1) → Δmin ∈ (−∞, 0], and Δ(p) → Δmax ∈ [0, ∞), then
Letting Δmin → − ∞ and Δmax → ∞, EpMLE(Δ) → 1 − 1/n and VpMLE(Δ) → (2/n) (1 − 1/n).
Using Theorem 5.4 we could now address the issue of whether we lose the advantage by utilizing the two-step estimator which was developed under model ℳp over the estimator developed under the more general model ℳ when p increases. For this purpose we have the following corollary.
Corollary 5.4
With n > 1 fixed, if as p → ∞, max2≤i≤p |Δ(i) − Δ(i−1)| → 0, Δ(1) → −∞, and Δ(p) → ∞, then (i) ; (ii) ; (iii) ; and (iv) . In addition, is dominated by .
The fourth result in Corollary 5.4 indicates that when the number of sub-models increases indefinitely the estimator (which is the minimum risk estimator under the general model ℳ) dominates the two-step estimator (which was developed by exploiting the sub-model structure of ℳp). Using the limiting results for p = 2 and as |Δ| → 0, stated after Corollary 5.2, we find that the limiting risk function of is 2/(n+2), which is smaller than 2/(n+1), the risk function of . This shows that when the number of sub-models is small, we can gain efficiency by using the two-step estimator developed under model ℳp These results agree with our intuition: when the number of sub-models increases it is better to utilize the best estimator developed under the more general model. However, as it will be seen in the simulation studies reported later in the paper, the weighted and Bayes-type estimators’ performance seems not degraded by an increase in the number of sub-models.
5.3 Representation of Limiting Bayes and Weighted Estimators
We now provide distributional representations useful for the limiting Bayes estimators and the weighted estimators under ℳp, in order to find an approximation to the risk functions of these estimators. For α > 0, define the “umbrella” estimator as
| (24) |
Individual estimators are easily derived from this umbrella estimator by choosing an appropriate α. For example:
Theorem 5.5
Under ℳp where μi0 is the true mean, for a fixed α > 0, , where
Consequently, the distribution of depends on (μ, σ²) only through Δ.
From the distributional representation in Theorem 5.5, a closed-form expression for the risk function of will be difficult to obtain because of the adaptive, i.e., data-dependent, nature of the mixing probabilities θi(T) and the fact that these are rational functions of T. To obtain an approximation to the risk function of we used a second-order Taylor expansion of the function H(T) about T = ν, the mean vector of T. For notation, let
A second-order Taylor approximation for is provided by
| (25) |
From this approximate representation, we are able to obtain approximate expressions for the mean and variance of the Bayes estimator. These mean and variance expressions, which involve the constant Cn defined in Corollary 5.1, are given in the next two theorems. The proofs require several intermediate results (contained in lemmas), and these are presented in Appendix A.
Theorem 5.6
Under ℳp, a second-order approximation to the mean of is
Theorem 5.7
Under ℳp, a second-order approximation to the variance of is , where
From these expressions, we can compute second-order approximations to the risk functions of according to the formula
| (26) |
where E2(Δ;α) ≡ E2(Δ) and V2(Δ;α) ≡ V2(Δ) are given in Theorem 5.6 and Theorem 5.7, respectively. For other limiting Bayes and weighted estimators in Table 1 and (16), analogous approximate risk expressions can be obtained similarly as for .
Lastly, still for a given α > 0, we present a few expressions for the components , k ∈ {1, 2, …, p} of the p × 1 vector H(1) (T) and the components , k, l ∈ {1, 2, …, p} of the p × p matrix H(2) T, which when evaluated at T = ν yield H(1) and H(2), respectively. From the expressions for H(T) and θi(T) in Corollary 5.1, we find that for j,k ∈ {1, 2, …, p},
where, for i, k ∈ {1, 2, …, p}, ; and, for i,k,l ∈ {1, 2, …, p},
5.4 Assessing the Second-Order Approximations via Simulation
To assess the goodness of the second-order approximations, we compared the values of the means, variances, and risks of based on 10,000 simulated datasets to their second-order approximations. The results revealed the same pattern across all choices of Δ. For n = 3 the approximation performs rather poorly, gradually improving with increasing n, to finally become almost identical to the simulation-based values when n is 30. In one of the worst-case scenarios, when n = 3 and Δ is symmetric with a medium-size spread (such as Δ = (−0:25; 0; 0:25)), the approximate mean values lie generally within 20% of the simulated ones. Similar behavior is shown by variances and risks also. Furthermore, as the model dimension p increases or as the separations among the sub-models’ means become smaller, the differences between simulated values and approximations also seem to diminish. With increasing n the accuracy of the approximations improves. Therefore, the second-order approximation appears to work well overall, but when n is small (less than 15) it seems better to use simulations. In the remainder of this paper, all analyses involving risks of the limiting Bayes and weighted estimators are based on simulations.
5.5 Comparison of Relative Efficiencies
We now carry out the comparison of relative efficiencies of the variance estimators with respect to , using simulated datasets with a variety of Δ values for n ∈ {3; 10; 30}. The results are summarized in Table 2 and Table 3 and Figure 1 and Figure 2.
Table 2.
Efficiencies (relative to of the UMVU estimator ) of the different variance estimators for different combinations of p, Δ, and n. For the limiting Bayes and weighted estimators, the values are based on simulation studies with 10000 replications for each combination.
| Combinations of p and Δ | Efficiency % | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| n | MRE | pMLE | pMRE | ALB | LB1 | LB2 | LB3 | LB4 | PLB1 | PLB2 | PLB3 | |
| 3 | 200 | 174 | 222 | 14 | 10 | 60 | 152 | 230 | 240 | 236 | 233 | |
| Δ=(−0.25,0,0.25) | 10 | 122 | 117 | 124 | 71 | 62 | 89 | 113 | 128 | 129 | 129 | 128 |
| p=3 | 30 | 107 | 106 | 107 | 91 | 89 | 99 | 107 | 105 | 112 | 112 | 112 |
| 3 | 200 | 183 | 209 | 17 | 11 | 66 | 160 | 217 | 235 | 226 | 221 | |
| Δ=(−0.5,0,0.5) | 10 | 122 | 119 | 124 | 73 | 59 | 86 | 110 | 127 | 129 | 128 | 127 |
| p=3 | 30 | 107 | 106 | 110 | 89 | 84 | 95 | 104 | 109 | 111 | 111 | 111 |
| 3 | 200 | 164 | 226 | 13 | 9 | 51 | 135 | 227 | 238 | 234 | 232 | |
| Δ=(0,0.25, 0.50) | 10 | 122 | 114 | 127 | 66 | 53 | 78 | 103 | 131 | 129 | 128 | 128 |
| p=3 | 30 | 107 | 105 | 109 | 88 | 81 | 92 | 100 | 107 | 108 | 108 | 108 |
| 3 | 200 | 166 | 222 | 13 | 8 | 47 | 128 | 222 | 233 | 228 | 224 | |
| Δ=(0,0.5,1) | 10 | 122 | 115 | 128 | 65 | 51 | 76 | 102 | 126 | 126 | 126 | 126 |
| p=3 | 30 | 107 | 105 | 110 | 87 | 83 | 94 | 103 | 109 | 110 | 110 | 110 |
| 3 | 200 | 174 | 222 | 14 | 10 | 58 | 149 | 234 | 241 | 238 | 235 | |
| Δ=(−0.25:2−4:0.25) | 10 | 122 | 117 | 123 | 71 | 61 | 88 | 112 | 127 | 130 | 130 | 129 |
| p=9 | 30 | 107 | 105 | 106 | 91 | 88 | 98 | 106 | 109 | 111 | 111 | 111 |
| 3 | 200 | 174 | 222 | 14 | 10 | 57 | 145 | 234 | 239 | 237 | 234 | |
| Δ=(−0.25:2−5:0.25) | 10 | 122 | 117 | 123 | 71 | 60 | 86 | 109 | 130 | 127 | 127 | 126 |
| p=17 | 30 | 107 | 105 | 106 | 91 | 87 | 97 | 105 | 108 | 110 | 110 | 110 |
| 3 | 200 | 164 | 225 | 13 | 9 | 52 | 137 | 230 | 243 | 240 | 238 | |
| Δ=(0:2−4:0.5) | 10 | 122 | 114 | 126 | 66 | 53 | 77 | 102 | 129 | 128 | 127 | 127 |
| p=9 | 30 | 107 | 104 | 108 | 88 | 76 | 87 | 97 | 109 | 107 | 107 | 107 |
| 3 | 200 | 164 | 225 | 13 | 10 | 55 | 145 | 235 | 249 | 246 | 243 | |
| Δ=(0:2−5:0.5) | 10 | 122 | 114 | 126 | 66 | 54 | 79 | 106 | 130 | 134 | 133 | 133 |
| p=17 | 30 | 107 | 104 | 108 | 88 | 76 | 87 | 97 | 111 | 107 | 107 | 108 |
Table 3.
Efficiencies (relative to the UMVU estimator ) of the different variance estimators for different combinations of p, Δ, and n. For the limiting Bayes and weighted estimators, 10000 simulation replications were performed for each combination.
| Combinations of p and Δ | Efficiency % | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| n | MRE | pMLE | pMRE | ALB | LB1 | LB2 | LB3 | LB4 | PLB1 | PLB2 | PLB3 | |
| 3 | 200 | 170 | 232 | 13 | 8 | 52 | 143 | 229 | 243 | 237 | 234 | |
| Δ/=(0, 1) | 10 | 122 | 115 | 134 | 63 | 55 | 81 | 107 | 130 | 129 | 130 | 130 |
| p=2 | 30 | 107 | 104 | 111 | 85 | 85 | 96 | 105 | 109 | 112 | 112 | 112 |
| 3 | 200 | 195 | 216 | 18 | 10 | 63 | 162 | 221 | 255 | 237 | 228 | |
| Δ=(−1, 0, 1) | 10 | 122 | 120 | 134 | 67 | 51 | 77 | 103 | 129 | 126 | 127 | 128 |
| p=3 | 30 | 107 | 104 | 111 | 85 | 84 | 95 | 103 | 110 | 111 | 111 | 111 |
| 3 | 200 | 185 | 199 | 19 | 11 | 66 | 162 | 208 | 246 | 230 | 221 | |
| Δ=(−1:2−1:1) | 10 | 122 | 119 | 124 | 73 | 52 | 78 | 103 | 125 | 126 | 126 | 125 |
| p=5 | 30 | 107 | 106 | 110 | 89 | 81 | 92 | 101 | 107 | 108 | 108 | 108 |
| 3 | 200 | 182 | 195 | 20 | 11 | 64 | 153 | 210 | 232 | 219 | 211 | |
| Δ=(−1:2−2 :1) | 10 | 122 | 118 | 120 | 74 | 56 | 82 | 106 | 126 | 126 | 125 | 125 |
| p=9 | 30 | 107 | 106 | 107 | 91 | 82 | 92 | 100 | 105 | 106 | 106 | 106 |
| 3 | 200 | 181 | 194 | 20 | 11 | 65 | 155 | 207 | 234 | 220 | 213 | |
| Δ=(−1:2−3:1) | 10 | 122 | 117 | 119 | 75 | 55 | 81 | 104 | 126 | 125 | 124 | 123 |
| p=17 | 30 | 107 | 105 | 106 | 92 | 80 | 90 | 99 | 107 | 107 | 107 | 107 |
| 3 | 200 | 181 | 194 | 20 | 12 | 72 | 168 | 206 | 240 | 226 | 217 | |
| Δ=(−1:2−4:1) | 10 | 122 | 117 | 119 | 75 | 57 | 82 | 106 | 125 | 126 | 125 | 124 |
| p=33 | 30 | 107 | 105 | 105 | 92 | 82 | 93 | 101 | 105 | 108 | 108 | 108 |
| 3 | 200 | 181 | 194 | 20 | 11 | 65 | 156 | 208 | 234 | 221 | 214 | |
| Δ=(−1:2−5:1) | 10 | 122 | 117 | 119 | 75 | 58 | 84 | 109 | 123 | 128 | 127 | 127 |
| p=65 | 30 | 107 | 105 | 105 | 92 | 82 | 93 | 102 | 107 | 109 | 109 | 109 |
| 3 | 200 | 181 | 194 | 20 | 11 | 69 | 163 | 207 | 237 | 224 | 216 | |
| Δ=(−1:2−6:1) | 10 | 122 | 117 | 119 | 75 | 56 | 82 | 106 | 126 | 126 | 125 | 125 |
| p=129 | 30 | 107 | 105 | 105 | 92 | 81 | 91 | 99 | 108 | 106 | 106 | 106 |
| 3 | 200 | 181 | 194 | 20 | 11 | 67 | 159 | 205 | 236 | 223 | 215 | |
| Δ=(−1:2−7 :1) | 10 | 122 | 117 | 119 | 75 | 57 | 83 | 107 | 124 | 128 | 127 | 126 |
| p=257 | 30 | 107 | 105 | 105 | 92 | 82 | 92 | 101 | 108 | 108 | 108 | 108 |
| 3 | 200 | 181 | 194 | 20 | 11 | 66 | 156 | 206 | 233 | 221 | 213 | |
| Δ=(−1:2−8:1) | 10 | 122 | 117 | 119 | 75 | 58 | 84 | 108 | 127 | 128 | 127 | 126 |
| p=513 | 30 | 107 | 105 | 105 | 92 | 80 | 91 | 99 | 106 | 107 | 107 | 107 |
Figure 1.

Relative efficiencies of pMRE with respect to MRE in a symmetric and asymmetric Δ cases, as a function of Δmax and number of sub-models p for sample size of n = 10. The symmetric case is of form Δ = [−Δmax : Δmax/(p − 1) : Δmax], while the asymmetric case is of form Δ = [0 : Δmax/(2(p − 1)) : Δmax].
Figure 2.

Efficiencies of the leading estimators and Stein’s estimator of σ² relative to the UMVUE for p = 2 and Δ = (0, Δ), with Δ varying, for n = 3, 10. The connected scatterplots represent the simulated relative efficiencies based on 20000 replications for each Δ for the limiting Bayes (pLB4, lighter of the top two), weighted (pPLB1, darker of the top two), and Stein (bottom) estimators. The smooth curves correspond to theoretical efficiencies of (top), (middle), and (bottom).
Table 2 focuses on the differences in relative effciencies between symmetric and asymmetric Δ cases. As can be seen, there does not seem to be a strong effect of the asymmetry of Δ on the estimators. In all Δ cases that we have chosen, perform best, with the two-step estimator following. Clearly, the best among the ℳp-based estimators dominate the ℳM-based estimators , with the gain in efficiency being quite impressive for small sample sizes.
Table 3 is designed to examine the impact of increasing number of sub-models. We see that when p is large the two-step estimator becomes less efficient than the global-type estimator . This result is consistent with the theoretical result of Corollary 5.4. Note also that even for p as low as 33, the ratios of the relative efficiency values from Table 3 start to agree (to the third decimal) with the limiting relative efficiencies predicted by Corollary 5.4. The weighted estimators do not seem to be affected much by the increasing p, faring much better than the two-step estimators.
Figure 1 presents two contour plots of the relative efficiencies of with respect to as a function of p (where p > 3) and the range of the values in the Δ vector. In the top and bottom contour plots symmetric and asymmetric Δ cases are considered separately. The top contour plot reveals a structure that is consistent with Corollary 5.4, especially in the regions of the contour plot where p → ∞ and Δmax → ∞ (hence Δmin → − ∞) which fall in the top and bottom right corners, where starts to dominate . Note that the 98% relative efficiency in this region is very close to the limiting 97% from Corollary 5.4. The bottom contour plot is constructed using Δ that are quite asymmetric (all Δmin = 0), and therefore could not be compared to the predictions of Corollary 5.4. From this plot we can see however that seems to dominate everywhere.
Figure 2 explores the case when p = 2 only, so Δ = (0; Δ), and for sample sizes n = 3 and n = 10. The plots depict the efficiency of the leading estimators, together with that of the Stein estimator in (5), as a function of the magnitude of the Δ parameter. As can be seen, performs best when |Δ| is large, with giving a very comparable performance. The estimator performs better than when |Δ| is closer to zero, but degrades in performance when |Δ| becomes large. Thus, we see that the estimators’ performances and regions where they perform well will depend to a large extent on the magnitude of Δ. In particular, it appears that the best among the weighted estimators perform very well when the |Δ| is neither too large nor too small, while the two-step estimator performs very well when |Δ| is large. This points to the following intuitive explanation: when |Δ| is large, the two models are well-separated and the model selection is easier; however, when |Δ| is neither too small nor too large, it is not so clear which model to choose, and it seems better to average over the sub-models’ estimators. When |Δ| is quite close to zero, i.e. when there is not much di erence among the sub-models, either approach to estimation works well. With regards to the Stein estimator, observe that though it dominates , as is expected from theory, it is at the same time dominated by the estimators .
Overall, based on the results of the risk comparison, the estimators performing best are the Bayes-type or weighted estimators , and the two-step estimator . We give a slight preference to the weighted estimators because their performance does not degrade much even when the number of sub-models increases, in contrast to the two-step estimator which becomes dominated by the ℳ-estimator when p, the number of sub-models, increases.
Finally, a cautionary note arising from these efficiency studies is that one ought to be very careful in the choice of prior parameters. At least in the situation when one is concerned with variance estimation, the limiting Bayes estimators , corresponding to the limiting cases of κ → 1 and κ → 3/2 respectively, perform quite poorly, especially for small sample sizes. These two estimators are dominated by the estimator in terms of risk function. However, these improper priors associated with the limiting values of κ are most likely the worst-case scenarios, and other, more carefully chosen and meaningful priors should result in improved performance.
6 Concluding Remarks
We have examined some of the issues arising when considering a model with a finite number of sub-models, where the goal is to make inference about a common parameter among these sub-models, based on a single realization of a sample. It is of interest to determine which of the three possible strategies is preferable: (i) to utilize a wider model that encompasses all competing sub-models; (ii) to adopt a two-step approach: select the sub-model, and then do inference within this chosen sub-model, but with both steps utilizing the same sample data; (iii) to do a sub-model averaging scheme where the inference procedure is formed by weighting the sub-models’ procedures, with the weights being also data-dependent. The second strategy may be labeled the classical approach, while the third strategy coincides or is motivated by the Bayesian approach.
Through a simple model prototype with a finite number of Gaussian sub-models with common variance but different means, we have studied each of the strategies, with regards to the estimation of variance. Based on the theoretical and simulated comparison of the different types of estimators, and with the estimator performance evaluated through risk functions based on quadratic loss, we have reached the following conclusions: (i) There could be considerable improvement in using estimators developed by exploiting the structure of the sub-models, over the strategy of simply using estimators from a wider model. (ii) However, the properties of these resulting estimators may be quite difficult to obtain. Furthermore, some desirable properties of the sub-model estimators, such as unbiasedness and minimum variance, may not carry-over when they are combined to form the estimator for the full model of interest. (iii) Based on the theoretical and simulated results for the variance parameter σ² considered in this paper, the weighted estimators, which were motivated and/or derived via the Bayesian approach, seem preferable over the two-step estimators even though these estimators were derived using improper priors. (iv) When the number of sub-models increases and two-step estimators are employed, it appears that their performance could degrade relative to estimators developed under a wider model, but that the weighted estimators’ performances are not necessarily affected. (v) And, finally, when developing weighted estimators through the Bayesian framework, caution must be observed in assigning prior parameters as a particular prior specification may also lead to poor estimators.
Approaches similar to the one presented here could be a useful first step in many contexts where model selection is often done and where there exist a natural notion of “distance” among models: regression, survival analysis, or goodness-of-fit testing. For example, techniques and results presented here could be extended to settings of regression with p possible predictor variables, where the goal is estimation of dispersion parameters associated with error components for the 2p competing sub-models. There is clearly a need for studies of more complicated situations in varied settings, where multiple parameters are being estimated simultaneously and/or where sub-models are of different dimensions (Claeskens and Hjort (2003)). As was pointed out earlier, for these more general settings, exact risk expressions may not be possible and asymptotic analysis may be needed, in contrast to the situation considered in this paper where the Gaussian distributional assumption enabled us to obtain concrete results. Also, many other interesting alternative options need to be examined: for example, in the two-step approach, would it have been better to subdivide the sample data into two parts and use the first part for model selection and the second part for making inference in the chosen sub-model, an issue alluded for instance in Hastie, Tibshirani, and Friedman (2001) and investigated by Yang (2003)? Finally, we have observed in Dukić and Peña (2003) that in the case of estimation of the distribution function in this same specific model, a different conclusion holds with respect to which of the three strategies is preferable. This is consistent with recent work by Claeskens and Hjort (2003)) where they advocate the use of a focussed information criterion in model selection problems that is tailored to the specific parameter of interest.
Acknowledgments
The authors thank the Editor, Associate Editor, and Referees for their comments and criticisms which led to a considerably improved paper.
A Appendix: Proofs of Selected Results
In this appendix we gather the technical proofs of the results presented in earlier sections.
Proof of Proposition 5.1
With , we have . Letting and , i = 1, 2, … p, it follows that with W and V = (V1, V2, …, Vp)′ independent. Furthermore, since , V has representation
| (27) |
so V ∼ Np(0,J) since .
Proof of Theorem 5.1
By Prop. 5.1, . Now,
But,
Proof of Theorem 5.2
From Theorem 5.1, using for a < b,
Proof of Theorem 5.3
From Theorem 5.1, by independence of W and Z, and because Ω(i)s depend only on Z, we have . We already know that Var(W) = 2(n − 1). On the other hand, . By the variance formula and definition of ζ(i) (k), ; and , as I(Ω(i))I(Ω(j)) = 0 whenever i ≠ j. Consequently, .
Proof of Corollary 5.4
From Theorem 5.4, . Also, since , we have . The efficiency expressions relative to are obtained by dividing by the preceding risk expressions. Both of the resulting efficiency expressions are easily shown to exceed 1. Taking the ratio of and yields the third expression, an expression easily shown to exceed 1. In comparing and , we observe that . Since 2(n + 2)²/[(n + 1)(2n + 7)] − 1 = −(n−1)/[(n + 1)(2n + 7)] < 0 for n > 1, then we have established that, as p → ∞, is more efficient than . To show that is dominated by , note that , which is easily shown to exceed 2/(n − 1) whenever n > 2.
Proof of Theorem 5.5
Starting from (24) and using Corollary 5.1, we obtain
The following lemma will be needed for proving Theorems 5.6 and 5.7
Lemma A.1
Under the conditions of Proposition 5.1, (i) ; (ii); (iii) E{W(T − ν)} = −ν; and (iv) , with constant Cn given in Corollary 5.1.
Proof of Lemma A.1
Since , then the first result immediately follows. Using that E(W) = n - 1 and , the third result follows trivially from the first identity. To prove the second result, observe that
and . Consequently,
Finally, by the iterated expectation rule, and using the expressions for E(W) and ,
Simplifying leads to the expression given in the statement of the lemma.
Proof of Theorem 5.6
First, note that by using the fourth result in Lemma A.1, we have
From (25), and using Lemma A.1 and the preceding result, the approximation to the mean of is obtained to be
which simplifies to the expression in the statement of the theorem.
To obtain second-order approximation to the variance of , we first establish two inter-mediate lemmas concerning the conditional mean and variance, given W, of the variable
| (28) |
which is the second-order Taylor approximation of the function 1 + H(T) (see equation (25)).
Lemma A.2
Under conditions of Proposition 5.1,
Proof of Lemma A.2
From the proof of and results in Lemma A.1, it follows that
From first result in Lemma A.1 and the preceding result, , and the result follows.
Lemma A.3
Under conditions of Proposition 5.1,
Proof of Lemma A.3
First, observe that we have
From the representation of V in (27), we obtain
since Var(Z²) = 2, Var(Z) = 1, and Cov(Z²,Z) = 0. Thus,
We also have
Again, by utilizing the representation for V in (27), we obtain
so . Combining these results, we obtain
Also,
But, once again,
while . Thus, we have
Now, by using these intermediate results, we find that
and the result follows after simplification.
Proof of Theorem 5.7
From (25) and (28), and by the iterated variance rule, an approximate variance of is . The final expression of VE(Δ) follows from the Lemma A.2, the identities Var(W) = 2(n-1), , and , and
Also, from Lemma A.3,
which simplifies to the expression of EV(Δ) in the statement of the theorem upon substituting the expressions E(W) = n − 1 and . This completes the proof.
References
- Akaike H. Information theory and the maximum likelihood principle. Budapest: Akademiai Kiado; 1973. pp. 610–624. [Google Scholar]
- Arnold B, Villasenor J. Variance estimation with diffuse prior information. Stat. Prob. Letters. 1997;33:35–39. [Google Scholar]
- Breiman L. The Little Bootstrap and Other Methods for Dimensionality Selection in Regression: X-Fixed Prediction Error. J. Amer. Statist. Assoc. 1992;87:738–754. [Google Scholar]
- Brewster J, Zidek J. Improving on equivariant estimators. Ann. Statist. 1974;2:21–38. [Google Scholar]
- Buhlmann P. Efficient and adaptive post-model-selection estimators. Journal of Statistical Plannning and Inference. 1999;79:1–9. [Google Scholar]
- Burnham K, Anderson D. Model Selection and Inference: A Practical Information-Theoretic Approach. New York: Springer; 1998. [Google Scholar]
- Claeskens G, Hjort N. The Focussed Information Criterion (with discussion) Journal of the American Statistical Assoiation. 2003;98:900–945. [Google Scholar]
- Dukić V, Peña E. Estimation after Model Selection in a Gaussian Model. University of Chicago, Department of Health Studies; Technical report, Department of Health Studies. 2003
- Gelfand A, Dey D. Improved estimation of the disturbance variance in a linear regression model. J. Econometrics. 1977;39:387–395. [Google Scholar]
- Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York: Springer; 2001. [Google Scholar]
- Hjort N, Claeskens G. Frequentist Model Average Estimators. Journal of the American Statistical Association. 2003;98:879–899. [Google Scholar]
- Hoeting J, Madigan D, Raftery A, Volinsky C. Bayesian Model Averaging. Statistical Science. 1999;14:382–401. [Google Scholar]
- Judge G, Bock M, Yancey T. Post Data Model Evaluation. The Review of Economics and Statistics. 1974;56:245–253. [Google Scholar]
- Leamer E. Specification Searches. New York: Wiley; 1978. [Google Scholar]
- Lehmann E, Casella G. Theory of Point Estimation. New York: Springer; 1998. [Google Scholar]
- Maatta J, Casella G. Developments in decision-theoretic variance estimation (with discussions) Statist. Sci. 1990;5:90–120. [Google Scholar]
- Madigan D, Raftery A. Model Selection and Accounting for Model Uncertainty in Graphical Models Using Occam’s Window. J. Amer. Statist. Assoc. 1994;89:1535–1546. [Google Scholar]
- Ohtani K. MSE dominance of the pre-test iterative variance estimator over the iterative variance estimator in regression. Stat. Prob. Letters. 2001;54:331–340. [Google Scholar]
- Pal N, Ling C, Lin J. Estimation of a normal variance – a critical review. Statist. Papers. 1998;39:389–404. [Google Scholar]
- Potscher B. Effects of Model Selection on Inference. Econometric Theory. 1991;7:163–185. [Google Scholar]
- Raftery A, Madigan D, Hoeting J. Bayesian Model Averaging for Linear Regression Models. J. Amer. Statist. Assoc. 1997;92:179–191. [Google Scholar]
- Robert C. The Bayesian Choice. 2nd ed. New York: Springer; 2001. [Google Scholar]
- Rukhin A. How much better are better estimators of a normal variance. J. Amer. Statist. Assoc. 1987;82:925–928. [Google Scholar]
- Schwarz G. Estimating the dimension of a model. Ann. Statist. 1978;6:461–464. [Google Scholar]
- Sclove S, Morris C, Radhakrishnan R. Nonoptimality of preliminary-test estimators for the mean of a multivariate normal distribution. Ann. Math. Statist. 1972;43:1481–1490. [Google Scholar]
- Sen P, Saleh A. On preliminary test and shrinkage M-estimation in linear models. Ann. Statist. 1987;15:1580–1592. [Google Scholar]
- Stein C. Inadmissibility of the usual estimator for the variance of a normal distribution with unknown mean. Ann. Inst. Statist. Math. 1964;16:155–160. [Google Scholar]
- Vidaković B, DasGupta A. Lower bounds on Bayes risks for estimating a normal variance: with applications. Canad. J. Statist. 1995;23:269–282. [Google Scholar]
- Wallace T. Pretest estimation in regression: A survey. Amer. J. Agr. Econ. 1977;59:431–443. [Google Scholar]
- Yancey T, Judge G, Mandy D. The sampling performance of pre-test estimators of the scale parameter under squared error loss. Economics Letters. 1983;12:181–186. [Google Scholar]
- Yang Y. Regression with multiple candidate models: selecting or mixing? Statistica Sinica. 2003;13:783–809. [Google Scholar]
- Zhang P. Inference after variable selection in linear regression models. Biometrika. 1992a;79:741–746. [Google Scholar]
- Zhang P. On the Distributional Properties of Model Selection Criteria. J. Amer. Statist. Assoc. 1992b;87:732–737. [Google Scholar]
