EXACT MINIMAX ESTIMATION OF THE PREDICTIVE DENSITY IN SPARSE GAUSSIAN MODELS

Gourab Mukherjee; Iain M Johnstone

doi:10.1214/14-AOS1251

. Author manuscript; available in PMC: 2015 Oct 5.

Published in final edited form as: Ann Stat. 2015;43(3):937–961. doi: 10.1214/14-AOS1251

EXACT MINIMAX ESTIMATION OF THE PREDICTIVE DENSITY IN SPARSE GAUSSIAN MODELS¹

Gourab Mukherjee ¹, Iain M Johnstone ²

PMCID: PMC4593074 NIHMSID: NIHMS723163 PMID: 26448678

Abstract

We consider estimating the predictive density under Kullback–Leibler loss in an ℓ₀ sparse Gaussian sequence model. Explicit expressions of the first order minimax risk along with its exact constant, asymptotically least favorable priors and optimal predictive density estimates are derived. Compared to the sparse recovery results involving point estimation of the normal mean, new decision theoretic phenomena are seen. Suboptimal performance of the class of plug-in density estimates reflects the predictive nature of the problem and optimal strategies need diversification of the future risk. We find that minimax optimal strategies lie outside the Gaussian family but can be constructed with threshold predictive density estimates. Novel minimax techniques involving simultaneous calibration of the sparsity adjustment and the risk diversification mechanisms are used to design optimal predictive density estimates.

Key words and phrases: Predictive density, risk diversification, minimax, sparsity, high-dimensional, mutual information, plug-in risk, thresholding

1. Introduction

Statistical prediction analysis aims to use past data to choose a probability distribution that will be good in predicting the behavior of future samples. This well-established subject [Aitchison and Dunsmore (1975), Geisser (1993)] finds application in game theory, econometrics, information theory, machine learning, mathematical finance, etc.

In this paper we study predictive density estimation in a high-dimensional setting and, in particular, explore the consequences of sparsity assumptions on the unknown parameters.

1.1. Main results

We begin by describing some of our main results: fuller references, background and interpretation follow in Section 1.2.

We work in the simplest Gaussian model for high-dimensional prediction:

X \sim N_{n} (θ, v_{x} I), Y \sim N_{n} (θ, v_{y} I), X ⊥ ⊥ Y | θ .

(1)

On the basis of the “past” observation vector X, we seek to predict the distribution of a future observation Y. The past and future observations are independent, but are linked by the common mean parameter θ, assumed to be unknown. Note, however, that the variances, assumed here to be known, may differ. We write p(x|θ, v_x) and p(y|θ, v_y) for the probability densities of X and Y, respectively.

We seek estimators $\hat{p} (y | x)$ of the future observation density p(y|θ, v_y), and to compare their performance under sparsity assumptions on θ. We recall two natural ways of generating large classes of estimators. Perhaps simplest are the “plug-in” or estimative densities: given a point estimate $\hat{θ} (X)$ , simply set $\hat{p} (y | x) = p (y | \hat{θ})$ . We often use the abbreviation $p [\hat{θ}]$ . Second, given any prior measure π(dθ), proper or improper, such that the posterior π(dθ|x) is well defined, the Bayes predictive density is

{\hat{p}}_{π} (y | x) = \int p (y | θ, v_{y}) π (d θ | x) .

(2)

The important case of a uniform prior measure π(dθ) = dθ leads to predictive density ${\hat{p}}_{U} (y | x)$ , easily seen to correspond to N_n(x, (v_x + v_y)I).

We will examine similarities and differences between high-dimensional prediction and high-dimensional estimation. In particular, ${\hat{p}}_{U} (y | x)$ plays in prediction the role of the maximum likelihood estimator ${\hat{θ}}_{MLE} (x) = x$ in the multinormal mean estimation setting. In contrast to the corresponding plug-in estimate $p [{\hat{θ}}_{MLE}]$ , the density ${\hat{p}}_{U}$ incorporates the variability of the location estimate which leads to a flattening of the estimator: v_x + v_y>v_y.

To evaluate the performance of a predictive density estimator $\hat{p} (y | x)$ , we use the familiar Kullback–Leibler “distance” as loss function:

L (θ, \hat{p} (\cdot | x)) = \int p (y | θ, v_{y}) \log \frac{p (y | θ, v_{y})}{\hat{p} (y | x)} d y .

The corresponding K–L risk function follows by averaging over the distribution of the past observation:

ρ (θ, \hat{p}) = \int L (θ, \hat{p} (\cdot | x) p (x | θ, v_{x})) d x .

Given a prior measure π(dθ), the average or integrated risk is

B (π, \hat{p}) = \int ρ (θ, \hat{p}) π (d θ) .

(3)

The Bayes predictive density (2) can be shown to minimize both the posterior expected loss $\int L (θ, \hat{p} (\cdot | x)) π (d θ | x)$ and the integrated risk $B (π, \hat{p})$ in the class of all density estimates. This is a general fact in statistical decision theory [Brown (1974)], the resulting minimum the Bayes K–L risk:

B (π) = inf_{\hat{p}} B (π, \hat{p}) .

(4)

Our main focus is on how to optimize the predictive risk $ρ (θ, \hat{p})$ in a high-dimensional setting under an ℓ₀ sparsity condition on the parameter space. Thus, let ‖θ‖₀ = #{i:θ_i ≠0} and

Θ_{n} [s] = {θ \in ℝ^{n} : ‖ θ ‖_{0} \leq s} .

(5)

This “exact” sparsity condition has been widely used in estimation; in this paper we initiate study of its implications for predictive density estimation.

The minimax K–L risk for estimation over Θ is given by

R_{N} (Θ) = inf_{\hat{p}} \sup_{θ \in Θ} ρ (θ, \hat{p}),

(6)

where the infimum is taken over all measurable predictive density estimators $\hat{p} (y | x)$ . For comparison, we write $R_{ε} (Θ) = {inf}_{\hat{θ}} {sup}_{Θ} ρ (θ, p [\hat{θ}])$ for the minimax risk restricted to the sub-class ℰ of plug-in or “estimative” densities.

To state our main results, henceforth we will assume v_x = 1 and introduce the key parameters

r = v_{y} / v_{x} = v_{y}, v_{w} = {(1 + r^{- 1})}^{- 1} .

(7)

Here v_w is the “oracle variance” which would be the variance of the UMVUE for θ, were both X and Y observed.

In our asymptotic model, the dimensionality n → ∞ and the sparsity s = s_n may depend on n, but the variance ratio r remains fixed. The notation a_n ~ b_n denotes a_n/b_n → 1 as n → ∞.

Theorem 1A

Fix r ∈ (0, ∞). If η_n = s_n/n → 0, then

R_{N} (Θ_{n} [s_{n}]) ~ \frac{1}{1 + r} s_{n} \log (n / s_{n}) = \frac{1}{1 + r} n η_{n} \log η_{n}^{- 1} .

(8)

The minimax risk is proportional to the sparsity s_n, with a logarithmic penalty factor. The case where s_n ≡ s remains constant is included. The expression is quite analogous to that obtained for point estimation with quadratic loss, namely, 2s_nlog(n/s_n) [Donoho and Johnstone (1994), Donoho et al. (1992) and Johnstone (2013), Chapter 8.8, hereafter cited as Johnstone (2013)]. However, we shall see that quite different phenomena emerge in the predictive density setting.

Indeed, the future-to-past variance ratio r is an important parameter of the predictive estimation problem. The minimax risk increases as r decreases: we need to estimate the future observation density based on increasingly noisy past observations (in relative terms, r = v_y/v_x), and so the difficulty of the density estimation problem increases. However, the rate of convergence with n in (8) does not depend on r, and so exact determination of the constants is needed to show the role of r in this prediction problem.

The inefficiency of plug-in estimators is an immediate consequence of Theorem 1A. Let $q (θ, \hat{θ}) = E ‖ \hat{θ} (X) - θ ‖^{2}$ denote the risk of point estimator $\hat{θ}$ under squared-error loss. It is straightforward to show for a plug-in density estimate $p [\hat{θ}]$ that $ρ (θ, p [\hat{θ}]) = q (θ, \hat{θ}) / (2 r)$ . Hence, from the point estimation minimax risk just cited,

R_{ε} (Θ_{n} [s_{n}]) \sim \frac{1}{r} s_{n} \log (n / s_{n}) \sim (1 + \frac{1}{r}) R_{N} (Θ_{N} [s_{n}]) .

The inefficiency of plug-in estimators thus equals the oracle precision,

1 / v_{w} = 1 + 1 / r,

and becomes arbitrarily large as the variance ratio r → 0.

We turn now to the asymptotically least favorable priors and optimal estimators in Theorem 1A. Let δ_λ denote unit point mass at λ and

π [η, λ] = (1 - η) δ_{0} + η δ_{λ}

(9)

be a univariate two-point prior: this is a sparse prior when η is small and λ large. Let

λ_{e} = \sqrt{2 \log η_{n}^{- 1} (1 - η_{n})}, λ_{f} = \sqrt{v_{w}} λ_{e} .

(10)

In point estimation based on X, we recall that λ_e is essentially the threshold of detectability corresponding to sparsity η_n = s_n/n. Although Y is not yet observed, we will see that in the prediction setting the UMVUE scaled threshold λ_f < λ_e plays a partly analogous role.

Build a sparse high-dimensional prior from i.i.d. draws:

π_{n}^{IID} (d θ) = \prod_{i = 1}^{n} π [η_{n}, λ_{f}] (d θ_{i}) .

(11)

If the sparsity s_n increases without bound with n, then this i.i.d. prior with scale λ_f is asymptotically least favorable:

Theorem 1B

If s_n → ∞ and s_n/n → 0, then

B (π_{n}^{IID}) = R_{N} (Θ_{n} [s_{n}]) \cdot (1 + o (1)) .

The assumption that s_n → ∞ ensures that $π_{n}^{IID}$ concentrates on Θ[s_n], namely, that $π_{n}^{IID} (Θ [s_{n}]) \to 1$ as n → ∞. This hypothesis is not needed for Theorem 1A; indeed, a sparse prior built from “independent blocks” is asymptotically least favorable assuming only s_n/n → 0. This more elaborate prior is described in Section 5.

Some of the novel aspects of the predictive density estimation problem appear in the description of optimal estimators, that is, ones that asymptotically attain the minimax bound in Theorem 1A. In point estimation, the simplest asymptotically minimax rule for sparsity s_n is given by co-ordinatewise hard thresholding ${\hat{θ}}_{i} (x) = x_{i} I {| x_{i} | \geq λ_{e}}$ . For prediction, we consider the following class of univariate density estimators as analogs of hard thresholding:

{\hat{p}}_{T} (y_{1} | x_{1}) = {\begin{matrix} {\hat{p}}_{π} (y_{1} | x_{1}), & if | x_{1} | \leq λ_{e}, \\ {\hat{p}}_{U} (y_{1} | x_{1}), & if | x_{1} | < λ_{e} . \end{matrix}

(12)

The univariate density estimates are combined to form a multivariate predictive density estimate via a product rule

{\hat{p}}_{T} (y | x) = \prod_{i = 1}^{n} {\hat{p}}_{T} (y_{i} | x_{i}) .

(13)

The threshold λ_e in (12) is that corresponding to estimation based on X at sparsity η_n = s_n/n. Above the threshold, the uniform prior predictive density ${\hat{p}}_{U}$ corresponds to the (unbiased) MLE. Below threshold, we shall need the flexibility of the Bayes predictive density (2). Indeed, as explained in Section 4, it does not suffice to use π = δ₀, point mass at 0, which would be the predictive analog of thresholding to zero in point estimation.

Instead, we use a sparse univariate cluster prior π = π_CL[η, r] given by

π = (1 - η) δ_{0} + \frac{η}{2 K} \sum_{k = 1}^{K} (δ_{μ_{k}} + δ_{- μ_{k}}) .

(14)

The points μ_k = μ_k(r) for k = 1,…, K are geometrically spaced to cover an interval [ν_η, λ_e + a] containing [λ_f, λ_e], as described in more detail below. The key point is that it is necessary to “diversify” the predictive risk by introducing prior support points to cover [–λ_e, –λ_f] ∪ [λ_f, λ_e].

More specifically, for a parameter a = a_η given below, let μ_η be the positive root of the overshoot equation

μ^{2} + 2 a μ = λ_{e}^{2},

(15)

that occurs in sparse minimax point estimation [e.g., Johnstone (2013), equation (8.48)], and then set $ν_{η} = \sqrt{v_{w}} μ_{η}$ : since μ_η < λ_e, we have ν_η <λ_f. The support points

μ_{1} = ν_{η}, μ_{k + 1} = {(1 + 2 r)}^{k} ν_{η}, k \geq 1,

(16)

with K = max{k : μ_k ≤ λ_e + a}. We choose $a_{η} = \sqrt{{2 \log λ}_{f}}$ .

Theorem 1C

Assume η_n = s_n/n → 0. Let ${\hat{p}}_{T, CL} (y | x)$ be the product predictive threshold estimator defined by (12) and (13) using the cluster prior π_CL [η_n, r ]. Then ${\hat{p}}_{T, CL}$ is asymptotically minimax:

max_{Θ_{n} [s_{n}]} ρ (θ, {\hat{p}}_{T, CL}) = R_{N} (Θ_{n} [s_{n}]) (1 + o (1)) .

Note that the number of positive support points in the cluster prior K = K_η increases as r decreases. For any fixed η, the cluster prior contains in total (2K_η + 1) support points. Also, for any fixed r ∈ (0, ∞) as η → 0, we have

K (r) = \lim_{η \to 0} K_{n} = [\frac{\log (1 + r^{- 1})}{2 \log (1 + 2 r)}] .

Thus, K(r) is apiecewise constant, right continuous function with jumps as shown in Table 1.

Table 1.

Number K(r) of positive support points in the cluster prior π_CL[η, r] as r varies

r	0.1073	0.1235	0.1465	0.1826	0.2485	0.4196	>0.4196
K(r)	7	6	5	4	3	2	1

Open in a new tab

The results presented above assume v_x = 1. These results can be easily extended to the general case by noting that the minimax risk remains invariant and the scale of past observations and parameter is divided by $\sqrt{v_{x}}$ .

1.2. Background and previous work

The relative entropy predictive risk $ρ (θ, \hat{p})$ measures the exponential rate of divergence of the joint likelihood ratio over a large number of independent trials [Larimore (1983)]. The minimal predictive risk estimate maximizes the expected growth rate in repeated investment scenarios [Cover and Thomas (1991), Chapters 6, 15]. In data compression, $L (θ, \hat{p} (\cdot | x))$ reflects the excess average code length that we need if we use the conditional density estimate $\hat{p}$ instead of the true density to construct a uniquely decodable code for the data Y given the past x [McMillan (1956)]. Following Bell and Cover (1980), ℓ₀-constrained minimax optimal predictive density estimates in on our model can be used for construction of optimal predictive schemes for gambling, sports betting, portfolio selection and sparse coding [Mukherjee (2013), Chapter 1.3].

Aitchison (1975), Murray (1977) and Ng (1980) showed that in most parametric models there exist Bayes predictive density estimates which are decision theoretically better than the maximum likelihood plug-in estimate. An important issue in predictive inference has always been to compare the performance of the class ℰ of point estimation (PE) based plug-in density estimates [Barndorff-Nielsen and Cox (1996)] with that of the optimal predictive density estimate. In parameter spaces of fixed dimension, large sample attributes of the predictive risk of efficient plug-in and Bayes density estimates have been studied by Komaki (1996), Hartigan (1998) and Aslan (2006).

The high-dimensional predictive density estimation problem studied in this paper is relevant to a number of contemporary applications, including data compression, sequential investment with side information and sports betting (SM).

Analogy with point estimation

Decision theoretic parallels between predictive density estimation under Kullback–Leibler loss and point estimation under quadratic loss have been explored in our Gaussian model by George, Liang and Xu (2006), Ghosh, Mergel and Datta (2008), Komaki (2004), Xu and Zhou (2011) and George, Liang and Xu (2012). For unconstrained parameter spaces $Θ = ℝ^{n}$ , fundamental ideas in Gaussian point estimation theory can be extended to yield optimal predictive density estimates [Brown, George and Xu (2008), Fourdrinier et al. (2011), Komaki (2001)]. For ellipsoids, Xu and Liang (2010) established an analog of the theorem of Pinsker (1980) by proving that the class of all linear predictive density estimates [see (17)] is minimax optimal.

For sparse estimation, instead of parallels, we found contrasts. Minimax risks in the predictive density problem depend on r, but this dependence is not emphasized in the admissibility results in unrestricted spaces. As we have seen, under sparsity construction of optimal minimax estimators requires the notion of diversification of the future risk over the interval [λ_f, λ_e] in a way strongly dependent on r. Thus, efficiency of the prediction schemes depend on careful calibration of the sparsity adjustment and the risk diversification mechanisms.

1.3. Further results

Other classes of estimators

The class of linear estimates ℒ are Bayes rules based on conjugate product normal priors. The resulting estimators

{\hat{p}}_{L, α} = \prod_{i = 1}^{n} N (α_{i} X_{i}, α_{i} + r), α_{i} \in [0, 1],

(17)

are still Gaussian but have larger variance than the future density $p (y | θ, r) = ϕ (y | θ, r)$ . We choose the name “linear” because the conjugate prior implies linearity of the posterior mean in X.

The class $G$ contains all product Gaussian density estimates $p [\hat{θ}, \hat{d}] = \prod_{i = 1}^{n} N ({\hat{θ}}_{i}, {\hat{d}}_{i})$ . Clearly, $G$ contains both ℒ and ℰ, the latter introduced after (6). The minimax risks R_ℒ(Θ) and $R_{G} (Θ)$ are defined by restricting the infimum in (6) to ℒ and $G$ , respectively.

We have seen after Theorem 1a that $R_{ε} (Θ_{n} [s_{n}]) ~ (1 + r^{- 1}) R_{N} (Θ_{n} [s_{n}])$ . It turns out that extending ℰ to $G$ does not help, while, as is typical for sparse estimation, the class of linear estimators ℒ performs very poorly.

Proposition 1

Fix r ∈ (0, ∞). If s_n/n → 0, then

\begin{array}{l} (R_{ℒ} (Θ_{n} [s_{n}]) = (n / 2) \log (1 + r^{- 1}), \\ R_{ℒ} (Θ_{n} [s_{n}]) / R_{N} (Θ_{n} [s_{n}]) \to \infty and \\ R_{G} {Θ_{n} [s_{n}]) ~ R_{ℰ} (Θ_{n} [s_{n}]) . \end{array}

Univariate prediction problem

The product structure of our high-dimensional model (1), estimators (13) and priors (11), along with concentration of measure, implies that many aspects of our multivariate results can be understood and proved through an associated univariate prediction problem.

In the univariate setting, assume that the past observation X|θ ~ N(θ, 1) and the future observation Y|θ ~ N(θ, r). Assume that X and Y are independent given θ. In addition, suppose that θ is random with distribution π(dθ), assumed to belong to

m (η) = {π \in P (ℝ) : π (θ \neq 0) \leq η},

(18)

where $P (ℝ)$ is the collection of all probability measures in ℝ.

A predictive density estimator $\hat{p} (y | x)$ is evaluated through its integrated risk $B (π, \hat{p})$ defined at (3). The minimax risk for this univariate prediction problem is given by

β (η, r) : = inf_{\hat{p}} \sup_{π \in m (η)} B (π, \hat{p}),

(19)

and we study sparsity through the asymptotic regime η → 0. Recall definition (10) of the scaled threshold λ_f = λ_f,_η.

Theorem 2

Fix r ∈ (0, ∞). As η → 0,
$β (η, r) = \frac{1}{2 r} η λ_{f}^{2} (1 + o (1)) .$ (20)
An asymptotically least favorable prior is given by the two-point distribution π[η, λ_f(η)] of (9).
An asymptotically minimax estimator is given by the thresholding construction (12) combined with sparse univariate cluster prior π = π_CL [η, r] defined at (14).

1.4. Organization of the paper

The main results of the paper are multivariate, Theorems 1A, 1B and 1C. However, the main technical issues in the proofs are best handled in the univariate setting of Theorem 2, whose parts a, b and c correspond to Theorems 1a, 1b and 1c, respectively. Section 2 has an overview: it first reviews some connections between the multivariate and univariate settings, then gives heuristic derivations for the lower and upper bounds of univariate Theorem 2. Section 3 and Section 4, respectively, contain the technical proofs for the lower and upper bound on the univariate minimax risk, Theorem 2b and 2c, respectively. Together, they complete the proof of Theorem 2. Proofs of the multivariate results in Theorems 1a, 1b and 1c are completed in Section 5. This section also contains a heuristic proof of Proposition 1 whose rigorous proof is presented in the supplementary material [Mukherjee and Johnstone (2015)].

Glossary

[The notation (6)+2 refers to text 2 lines after equation (6)].

Estimators: Bayes ${\hat{p}}_{π}$ (2), Uniform prior ${\hat{p}}_{U}$ (2)+2, Threshold ${\hat{p}}_{T}$ (12), Multivariate product $\hat{p} (y | x)$ (13); Univariate $\hat{p} (y_{1} | x_{1})$ .

Classes of estimators and multivariate minimax risks: all nonlinear N, R_N (6), estimative ℰ, R_ℰ (6)+2, “linear” ℒ, R_ℒ (17), Gaussian $G$ , $R_{G}$ (17)+4.

Univariate minimax risk: β (19).

Parameter spaces: multivariate Θ_n[s] (5); univariate m(η) (18).

Priors: Univariate: two point π[η,λ] (9), cluster π_CL[η,r] (14), Multivariate: $π_{n}^{IID}$ (11).

Parameters: variance ratio r = v_y/v_x, oracle variance v_w (7), sparsity η (9), thresholds λ_e, λ_f (10), cluster prior: overshoot a (15), ν_η (15)+2.

2. Proof overview and interpretation

2.1. Connections between multivariate and univariate settings

Many aspects of the multivariate theorem may be understood, and in part proved, through a discussion of the univariate prediction problem of Theorem 2. An obvious connection between the univariate and multivariate approaches runs as follows: suppose that a multivariate predictive estimator is built as a product of univariate components

\hat{p} (y | x) = \prod_{i = 1}^{n} {\hat{p}}_{1} (y_{i} | x_{i}) .

(21)

Suppose also that to a vector θ = (θ_i) we associate a univariate (discrete) distribution $π_{n}^{e} = n^{- 1} \sum_{i = 1}^{n} δ_{θ_{i}}$ . Since the true multivariate future density p(Y|θ, r) is also a product of univariate components, it is then readily seen that the multivariate and univariate Bayes K–L risks are related by

ρ (θ, \hat{p}) = \sum_{i = 1}^{n} ρ (θ_{i}, {\hat{p}}_{1}) = n B (π_{n}^{e}, {\hat{p}}_{1}) .

(22)

The sparsity condition Θ_n[s_n] in the multivariate problem corresponds to requiring that the prior $π = π_{n}^{e}$ in the univariate problem satisfies

π {θ_{1} \neq 0} \leq s_{n} / n = η_{n},

and thus belongs to the class m(η) defined in (18). Next, we outline the minimax risk calculations for the sparse predictive density estimation problem.

As a first illustration, to which we return later, consider the maximum risk of a product rule over Θ_n[s_n]: using (22) and (3), we have

sup_{Θ_{n} [s_{n}]} ρ (θ, \hat{p}) = n [(1 - η_{n}) ρ (0, {\hat{p}}_{1}) + η_{n} sup_{θ \in ℝ} ρ (θ, {\hat{p}}_{1})] .

(23)

In the univariate problem, using ${\hat{p}}_{1}$ , we have the somewhat parallel bound

sup_{m (η)} B (π, \hat{p}) = (1 - η) ρ (0, {\hat{p}}_{1}) + η sup_{θ \in ℝ} ρ (θ, {\hat{p}}_{1}) .

(24)

Consequently, a careful study of the two univariate quantities

\begin{array}{l} risk at zero : & ρ (0, {\hat{p}}_{1}), \\ maximum risk : & \sup_{θ} ρ (θ, {\hat{p}}_{1}) \end{array}

(25)

is basic for upper bounds for both univariate and multivariate cases.

2.2. Theorem 2B: Univariate lower bound heuristics

To understand the apperance of $λ_{f}^{2}$ in the minimax risks, we turn to a heuristic discussion of the lower bound, first in the univariate case.

We use the two point priors (9) and the definition (19):

β (η, r) \geq B (π [η, λ]) = (1 - η) ρ (0, {\hat{p}}_{π}) + η ρ (λ, {\hat{p}}_{π}) \geq η ρ (λ, {\hat{p}}_{π}),

(26)

and look for a good bound for $ρ (λ, {\hat{p}}_{π})$ for a suitable choice of λ.

The key is a mixture representation for predictive risk of a Bayes estimator in terms of quadratic risk, where the weighted mixture is over noise levels v ∈ [v_w,1], with v_w being the oracle variance, (7). Brown, George and Xu (2008), Theorem 1, show that the predictive risk of the Bayes predictive density estimate ${\hat{p}}_{π}$ is

ρ (θ, {\hat{p}}_{π}) = \frac{1}{2} - \int_{v_{w}}^{1} q (θ, {\hat{θ}}_{π, v}; v) \frac{d v}{v^{2}},

(27)

where $q (θ, {\hat{θ}}_{π, v}; v) = E_{θ} {[{\hat{θ}}_{π, v} (W) - θ]}^{2}$ is the quadratic risk of the Bayes location estimate ${\hat{θ}}_{π, v}$ for prior π when W ~ N(θ, v). In point estimation with quadratic loss, it is known [Johnstone (2013), Chapter 8], that as η → 0 an approximately least favorable prior in the class m(η) is given, for noise level v = 1, by the sparse two-point prior π[η, λ_e(η)] defined in (9) and $λ_{e} (η) = \sqrt{2 \log η^{- 1} (1 - η)}$ . This prior has the remarkable property that points θ ≤ λ_e are “invisible” in the sense that even when θ is true, the Bayes estimator ${\hat{θ}}_{π} = {\hat{θ}}_{π, 1}$ effectively estimates 0 rather than θ and so makes a mean squared error

q (θ, {\hat{θ}}_{π}; 1) ~ θ^{2} for 0 \leq θ \leq λ_{e} .

(28)

Two issues arise as the noise level v varies. First, the region of invisibility will scale, becoming $0 \leq θ \leq \sqrt{v} λ_{e}$ at scale v. As v varies in [v_w, 1], the intersection of all regions of invisibility will be $0 \leq θ \leq \sqrt{v_{w}} λ_{e} = λ_{f}$ as defined at (10). The second issue is that for a given prior π and predictive Bayes rule ${\hat{p}}_{π}$ in (27), the Bayes rules ${\hat{θ}}_{π, v}$ vary with v. We return to this second point in the next section; for now we can hope that for all v∈[v_w,1],

q (λ_{f}, {\hat{θ}}_{π, v}; v) \underset{\sim}{>} λ_{f}^{2},

(29)

and so, from mixture representation (27),

ρ (λ_{f}, {\hat{p}}_{π}) \underset{\sim}{>} \frac{λ_{f}^{2}}{2} \int_{v_{w}}^{1} \frac{d v}{v^{2}} = \frac{λ_{f}^{2}}{2 r},

since the integral evaluates to $v_{w}^{- 1} - 1 = r^{- 1}$ . From this we can conjecture that for π = π[η, λ_f] ∈ m(η),

B (π) > η ρ (λ_{f}, {\hat{p}}_{π}) \underset{\sim}{>} η \frac{λ_{f}^{2}}{2 r} .

(30)

A full proof, with slightly modified definitions, is given in Section 3.

2.3. Theorem 2C: Univariate upper bound heuristics

We now turn to a heuristic discussion of constructing a density estimate to show that the lower bound (30) is asymptotically correct. Pursuing the analogy with point estimation, we know that in that setting optimal estimators can be found within the family of hard thresholding rules $\hat{θ} (x) = x I {| x | > λ}$ . The natural analog for predictive density estimation would have the form

{\hat{p}}_{T, π_{0}} [λ] (y | x) = {\begin{matrix} {\hat{p}}_{U} (y | x), | x | > λ, \\ {\hat{p}}_{π_{0}} (y | x), | x | \leq λ . \end{matrix}

(31)

To see this, note that ${\hat{p}}_{U}$ is the predictive Bayes rule corresponding to the uniform prior π(dθ) = dθ, which leads to the MLE $\hat{θ} (x) = x$ in point estimation, while ${\hat{p}}_{π_{0}} (y | x)$ denotes the predictive Bayes rule corresponding to a prior concentrated entirely at 0, so that

{\hat{p}}_{π_{0}} (y | x) = ϕ (y | 0, r)

(32)

is a normal density with mean zero and variance r.

For the upper bound, according to definition (19), we seek an estimator ${\hat{p}}_{1}$ for which ${sup}_{m (η)} B (π, {\hat{p}}_{1}) \sim η λ_{f}^{2} / (2 r)$ as η → 0. In bound (24), the first component is the risk at zero, $ρ (0, {\hat{p}}_{1})$ , and it turns out that this determines the possible values of the threshold λ in (31). Thus, in order that

ρ (0, {\hat{p}}_{T, π_{0}} [λ]) = o (η λ_{f}^{2}),

it follows [see (51)] that the threshold λ should be chosen as λ = λ_e ~ (2log η⁻¹)^1/2 and not smaller.

Turning to the second part of (25), we seek an estimator ${\hat{p}}_{1}$ with

\sup_{θ} ρ (θ, {\hat{p}}_{1}) = \frac{λ_{f}^{2}}{2 r} \cdot (1 + o (1)) .

(33)

We first argue that the hard thresholding analog (31) cannot work. Decompose the predictive risk of a univariate threshold estimator ${\hat{p}}_{T}$ with threshold λ_e into contributions due to X above and below the threshold

\begin{array}{l} ρ (θ, {\hat{p}}_{T}) = E_{θ} L (θ, \hat{p} (\cdot | X)) \\ = E_{θ} [L (θ, {\hat{p}}_{U} (\cdot | X)), | X | > λ_{e}] + E_{θ} [L (θ, {\hat{p}}_{π} (\cdot | X)), | X | \leq λ_{e}] \\ = ρ_{A} (θ) + ρ_{B} (θ), \end{array}

(34)

say. With the “zero prior,” the K–L loss is just quadratic in θ,

L (θ, {\hat{p}}_{π_{0}} (Y | X)) = E_{θ} \log \frac{ϕ (Y | θ, r)}{ϕ (Y | 0, r)} = \frac{θ^{2}}{2 r},

and so, in particular, for θ ≤ λ_e we see that

ρ (θ, {\hat{p}}_{T, π_{0}}) \geq ρ_{B} (θ) \underset{\sim}{>} \frac{θ^{2}}{2 r} P_{θ} [| X | \leq λ_{e}]

(35)

could be as large as $λ_{e}^{2} / (2 r)$ , and hence larger than our target risk $λ_{f}^{2} / (2 r)$ .

Bearing in mind the role that two-point priors play in the lower bound, it is perhaps natural to ask next if the threshold rule ${\hat{p}}_{T, LF}$ with π₀ in (31) replaced by the (symmetrized) two-point prior π[η, λ_f] could cut off the growth of the quadratic θ²/(2r) for |θ| ≥ λ_f. The 3-point prior π₃[η, λ_f] ∈ m(η) places probability η/2 at the two nonzero atoms at ±λ_f. Remarks in Section 3 show that π₃[η, λ_f] is also asymptotically least favorable for the univariate prediction problem as η → 0. Indeed, it can be shown (see Section 4) that for this prior and for λ_f ≤|θ|≤ λ_e,

\begin{array}{l} ρ (θ, {\hat{p}}_{T, LF}) ~ ρ B (θ) \\ \leq \frac{1}{2 r} {λ_{f}^{2} - (| θ | - λ_{f}) [(1 + 2 r) λ_{f} - | θ |]} + o (λ_{f}^{2}) . \end{array}

(36)

Consequently, the risk bound dips below $λ_{f}^{2} / (2 r)$ for λ_f ≤ |θ| ≤ (1 + 2r)λ_f but increases thereafter. So, ${\hat{p}}_{T, LF}$ is minimax optimal if λ_e < (1 + 2r)λ_f, which occurs if r is sufficiently large, r > 0.4196 in Table 1. However, the upper bound exceeds our target risk $λ_{f}^{2} / (2 r)$ if r ≤ 0.4196. Section S.2 of the supplementary material [Mukherjee and Johnstone (2015)] shows rigorously that ${\hat{p}}_{T, LF}$ is indeed minimax suboptimal for low values of r.

As π₃[η,λ_f] fails to produce minimax optimal density estimates, the strategy then is to introduce extra support points |μ_k| ≤ λ_e into the prior chosen to “pull down” the risk $ρ_{B} (θ) = E_{θ} [L (θ, {\hat{p}}_{π} (\cdot | X)), | X | \leq λ_{e}]$ below $λ_{f}^{2} / (2 r)$ whenever it would otherwise exceed this level. The schematic diagram in Figure 1 illustrates this bounding of the maximum risk. The extra support points added in [λ_f, λ_e] and [−λ_e, −λ_f] distribute the predictive risk across that range—“risk diversification”—and keep the maximum risk below $λ_{f}^{2} / (2 r) (1 + o (1))$ .

Fig. 1 — Schematic diagram of the risk of univariate threshold density estimates for θ ≥ 0. The dotted line is the risk of density estimator ${\hat{p}}_{T, LF}$ based on the 3-point prior π₃[η, λ_f]. The addition of appropriately spaced prior mass points (shown in red) up to λ_e pulls down the risk function of the cluster prior-based density estimate ${\hat{p}}_{T, CL}$ below $λ_{f}^{2} / (2 r)$ until the effect of thresholding at λ_e takes over.

To prove that this works, we obtain upper bounds on ρ_B (θ) for ${\hat{p}}_{T, CL}$ by focusing, when $θ \in [μ_{k}, μ_{k + 1}]$ , only on the prior support point μ_k. The main inequality is obtained in (50), namely,

ρ_{B} (θ) \leq \frac{1}{2 r} [λ_{f}^{2} + min_{k} q_{k} (θ)] + o (λ_{f}^{2}) .

where q_k(θ) is a quadratic polynomial that is O(λ_f) on [μ_k, μ_k+₁]. Putting together this and other bounds, we can then finally establish the uniform bound (33). The details are in Section 4.

3. Theorem 2B: Univariate lower bound proof

This section is devoted to a proof of the lower bound part of Theorem 2. The heuristic discussion of the last section indicated the importance of two-point sparse priors and the invisibility property (28). To formulate a precise statement about the upper limit of invisibility, we start with noise level 1 and bring in the positive solution μ_η of the overshoot equation (15), namely, $μ^{2} + 2 a μ = λ_{e}^{2}$ . Here the “overshoot” parameter a = a_η should satisfy both $a_{η} \to \infty$ and a_η = o(μ_η); we make the specific choice $a_{η} = \sqrt{2 \log λ_{f, η}}$ .

In preparation for the range of variance scales in mixture representation (27), we consider the collection of two-point priors π[η, μ] for 0≤μ≤μ_η. Using a temporary notation for this section, let ${\hat{θ}}_{μ} (x) = E [θ | x]$ be the Bayes rule for squared error loss for the prior π[η, μ]. The next result shows that when the true parameter is actually μ, and this nonzero support point $μ \leq μ_{η}$ , then the Bayes rule for π[η,μ] “gets it wrong” by effectively estimating 0 and making an error of size μ², uniformly in $μ \leq μ_{η}$ .

Lemma 3

There exists ε_η ↘ 0 as η → 0 such that for all μ in [0,μ_η],

q (μ, {\hat{θ}}_{μ}; 1) \geq μ^{2} [1 - ε_{η}] .

Proof

Using standard calculations for the two-point prior, the Bayes rule ${\hat{θ}}_{μ} = μ p (μ | x) = μ / [1 + m (x)]$ , with

m (x) = \frac{p (0 | x)}{p (μ | x)} = \frac{1 - η}{η} \frac{ϕ (x)}{ϕ (x - μ)} = \exp {\frac{1}{2} λ_{e}^{2} - x μ + \frac{1}{2} μ^{2}} .

(37)

Consequently,

\begin{array}{l} q (μ, {\hat{θ}}_{μ}; 1) = E_{μ} {[{\hat{θ}}_{μ} - μ]}^{2} = μ^{2} E_{μ} {[{(1 + m (X))}^{- 1} - 1]}^{2} \\ = μ^{2} E_{0} {[1 + m^{- 1} (μ + Z)]}^{- 2}, \end{array}

where Z ~ N(0, 1), and from (37), $m^{-}^{1} (μ + z) = exp {\frac{1}{2} (μ^{2} + 2 μ z - λ_{e}^{2})}$ .

Now, using definition (15) of μ_η, for 0 ≤ μ ≤ μ_η, we have

μ^{2} + 2 μ z - λ_{e}^{2} \leq μ_{η}^{2} + 2 μ_{η} z_{+} - λ_{e}^{2} = - 2 μ_{η} (a - z_{+}),

so that for 0 ≤ μ ≤μ_η,

μ^{- 2} q (μ, {\hat{θ}}_{μ}; 1) \geq E_{0} {{[1 + \exp (- μ_{η} (a - Z_{+}))]}^{- 2}, Z < a} = 1 - ε_{η},

say. For each fixed z, we have $μ_{η} (a - z_{+}) \to \infty$ since $a \to \infty$ , and so from the dominated convergence theorem we conclude that $ε (η) \to 0$ . □

With these preparations, we return to the lower bound in the prediction problem. As η → 0, an asymptotically least favorable distribution is given by a sparse two-point prior with the nonzero support point scaled using the oracle standard deviation $ν_{w}^{1 / 2}$ . We shall prove the following:

Lemma 4

Let μ_η be the positive solution to overshoot equation (15) with $a_{η} = \sqrt{2 \log λ_{f, η}} . Set ν_{η} = ν_{w}^{1 / 2} μ_{η}$ and consider the two-point prior π[η, ν_η]. Then as η → 0,

β (η, r) \geq B (π [η, ν_{η}]) \geq \frac{η λ_{f}^{2}}{2 r} (1 + o (1)) .

We note here that since a_η = o(μ_η), the overshoot equation implies that

μ_{η} ~ λ_{e, η} and ν_{η} ~ λ_{f, η} .

(38)

A stronger conclusion, used in the next section, also follows from the overshoot equation, namely,

λ_{f, η}^{2} - ν_{η}^{2} = ν_{w} (λ_{e, η}^{2} - μ_{η}^{2}) = ν_{w} \cdot 2 a μ_{η} \leq 2 a ν_{w} λ_{e, η} = 2 a \sqrt{ν_{w}} λ_{f, η} .

(39)

Proof of Lemma 4

Recall (26) and (27) in the heuristic discussion. We now clarify the dependence on scale v of the Bayes rule ${\hat{θ}}_{π, ν}$ in the mixture representation (27). Passing from noise level v to noise level 1 by dividing parameters and estimates by v^1/2, we obtain the invariance relation

q (θ, {\hat{θ}}_{π [η, λ], ν}; ν) = ν q (ν^{- 1 / 2} θ, {\hat{θ}}_{π [η, ν^{- 1 / 2} λ]}; 1) .

Now set θ = ν_η and substitute into (27) to obtain, for π = π[η, ν_η],

ρ (ν_{η}, {\hat{p}}_{π}) = \frac{1}{2} \int_{ν_{w}}^{1} q (v^{- 1 / 2} v_{η}, {\hat{θ}}_{π [η, v^{- 1 / 2} v_{η}]}; 1) \frac{d v}{v} .

(40)

Now apply Lemma 3 with μ = v^−1/2ν_η being bounded above by $v_{w}^{- 1 / 2} ν_{η} = μ_{η}$ . For all $v \in [v_{w}, 1]$ we obtain

q (v^{- 1 / 2}, {\hat{θ}}_{v^{- 1 / 2} v_{η}}; 1) \geq v^{- 1} v_{η}^{2} [1 - ε_{η}] .

Putting this into the mixture representation, we get

ρ (v_{η}, {\hat{p}}_{π}) \geq \frac{1}{2} v_{η}^{2} [1 - ε_{η}] \int_{v_{w}}^{1} \frac{d v}{v^{2}} = \frac{ν_{η}^{2}}{2 r} [1 - ε_{η}] .

Taking into account both (26) and (38), we have established the lemma.

Based on the discussion in Section 2, the above lemma establishes a lower bound on the asymptotic minimax risk β (η, r) in Theorem 2. Similarly, the symmetric 3-point prior

π_{3} [η, ν_{η}] = (1 - η) δ_{0} + (η / 2) {δ_{ν_{η}} + δ -_{ν_{η}}}

will also be asymptotically least favorable over m (η) as η → 0.

4. Theorem 2C: Univariate upper bound proof

The upper bound on the predictive minimax risk β (η, r) is derived from the upper bound on the maximum Bayes risk of ${\hat{p}}_{T, CL}$ over m (η). In this section we will prove the following lemma which along with Lemma 4 completes the proof of Theorem 2.

Lemma 5

For any r ∈ (0, ∞) we have, as η → 0,

\sup_{π \in m (η)} B (π, {\hat{p}}_{T, CL}) \leq \frac{η λ_{f}^{2}}{2 r} (1 + o (1)) .

We consider a threshold predictive density estimate ${\hat{p}}_{T}$ which uses the Bayes predictive density estimate from prior π below the threshold λ_e and ${\hat{p}}_{U}$ above the threshold λ_e. We bound the maximum predictive risk over m(η):

\sup_{π \in m (η)} B (π, {\hat{p}}_{T}) \leq (1 - η) ρ (0, {\hat{p}}_{T}) + η \sup_{θ} ρ (θ, {\hat{p}}_{T}) .

(41)

Next, as in (34), we decompose the predictive risk of ${\hat{p}}_{T}$ into contributions due to X above and below the threshold. We calculate explicit expressions for ρ_A and ρ_B. The predictive loss of ${\hat{p}}_{U}$ (see Appendix A.2) is given by

L (θ, {\hat{p}}_{U} (\cdot | x)) = a_{1 r} + a_{2 r} {(θ - x)}^{2}

(42)

with $a_{1 r} = \frac{1}{2} [\log (1 + r^{- 1}) - {(1 + r)}^{- 1}]$ and $a_{2 r} = \frac{1}{2} {(1 + r)}^{- 1}$ . Hence, the above threshold term

ρ_{A} (θ) = a_{1 r} P_{θ} (| X | > λ_{e}) + a_{2 r} E_{θ} [{(X - θ)}^{2}, | X | > λ_{e}] .

(43)

As ρ_B (θ) depends on the prior π used below the threshold, we restrict our attention to the specific choice of the cluster prior. The risk functions of the hard threshold density estimate ${\hat{p}}_{T},_{π_{0}}$ and that of ${\hat{p}}_{T, LF}$ can be easily derived from the calculations with the cluster prior.

According to (58) in the Appendix, the Bayes predictive density for a discrete prior $π = \sum_{k = - K}^{K} π_{k} δ_{μ_{k}}$ is given by

{\hat{p}}_{π} (y | x) = \sum_{- K}^{K} ϕ (y | μ_{k}, r) π_{k} ϕ (x - μ_{k}) / m (x),

(44)

where $m (x) = \sum_{k} π_{k} ϕ (x - μ_{k})$ denotes the marginal density of π. The K–L loss of ${\hat{p}}_{π} (\cdot | x)$ is given by

L (θ, {\hat{p}}_{π} (\cdot | x)) = E_{θ} \log \frac{ϕ (Y | θ, r)}{{\hat{p}}_{π} (Y | x)} .

A simple but informative upper bound for the K–L loss is obtained by retaining only the kth term in (44):

\begin{array}{l} L (θ, {\hat{p}}_{π} (\cdot | x)) \leq E_{θ} \log \frac{ϕ (Y | θ, r)}{ϕ (Y | μ_{k}, r)} - \log \frac{π_{k} ϕ (x - μ_{k})}{π_{0} ϕ (x)} + \log \frac{m (x)}{π_{0} ϕ (x)} \\ = \frac{1}{2 r} {(θ - μ_{k})}^{2} + \frac{1}{2} (μ_{k}^{2} - 2 x μ_{k}) - \log \frac{π_{k}}{π_{0}} + d (x) \end{array}

(45)

where we have set d(x) = log[m(x)/(π₀ϕ(x))].

We are now ready to analyze the bound (41). We follow the steps recalled in the quadratic loss case [see Section S.4 of Mukherjee and Johnstone (2015)] and evaluate the predictive risk at the origin and the maximum risk of the threshold density estimate ${\hat{p}}_{T}$ . This organization helps to make clear the new features of the predictive loss setting.

Risk at zero

It is easy to show that $ρ (0, {\hat{p}}_{T}) = O (η λ_{f})$ . First, from (43), we have

ρ_{A} (0) = 2 a_{1 r} \tilde{Φ} (λ_{e}) + a_{2 r} q_{A} (0) = O (η λ_{f}),

where q_A(0) is defined in (S.4.2) and the above calculation follows by using $\tilde{Φ} (λ_{e}) \leq λ^{- 1} ϕ (λ_{e}) = O (λ_{e}^{- 1} η)$ and the quadratic risk-at-zero bound (S.4.4).

For the below-threshold term, we set k = 0 in (45), note that μ₀ = 0 and apply Jensen’s inequality to obtain

ρ_{B} (0) = E_{0} [L (0, {\hat{p}}_{π} (\cdot | x)), | x | \leq λ] \leq E_{0} [d (X)] \leq \log E_{0} [m (X) / (π_{0} ϕ (X))] .

Since $E_{0} [m (X) / ϕ (X)] = \int m (x) d x = 1$ and π₀ = −η, we obtain that

ρ_{B} (0) \leq - \log (1 - η) \leq η .

Consequently, ρ_B(0) = O(η) and so ρ(0, ${\hat{p}}_{T}$ ,_CL) = O(ηλ_f). Note that the above calculations hold for any ${\hat{p}}_{T, π}$ with π being a discrete prior in m(η).

Maximum risk

From decomposition (41), our goal is to show that

\sup_{θ} ρ (θ, {\hat{p}}_{T, CL}) = {(2 r)}^{-}^{1} λ_{f}^{2} (1 + o (1)) .

(46)

We first isolate the main term in the contributions from ρ_A(θ) and ρ_B(θ). From (43), clearly ρ_A(θ) ≤ a₁_r + a₂_r = O(1), which does not contribute. We turn to

ρ_{B} (θ) = E_{θ} [L (θ, {\hat{p}}_{π} (\cdot | X)), | X | \leq λ_{e}]

and returning to (45), we begin by claiming that for $| x | \leq λ_{e}$ the final term $d (x) \leq \log 2$ . Indeed

\frac{m (x)}{π_{0} ϕ (x)} = 1 + \sum_{| k | = 1}^{K} \frac{π_{k}}{π_{0}} \frac{ϕ (x - μ_{k})}{ϕ (x)} = 1 + \sum_{| k | = 1}^{K} \frac{π_{k}}{π_{0}} \exp {x μ_{k} - \frac{μ_{k}^{2}}{2}} .

(47)

For $| x | \leq λ_{e}$ , we have

x μ_{k} - μ_{k}^{2} / 2 \leq λ_{e} | μ_{k} | - μ_{k}^{2} / 2 \leq λ_{e}^{2} / 2 = \log η^{- 1} (1 - η) .

Since $π_{0} = 1 - η$ , we arrive at

E_{θ} [d (X), | X | \leq λ_{e}] \leq \log 2 .

(48)

The dependence of (45) on θ may then be seen by writing x = θ + z. The first two terms in (45) then take the form

\frac{1}{2 r} {{[θ - (1 + r) μ_{k}]}^{2} - (r^{2} + r) μ_{k}^{2}} - μ_{k} z,

while, after recalling that $π_{k} = η / (2 K)$ and that $λ_{e}^{2} = 2 \log (1 - η) η^{- 1}$ , the third term becomes

\frac{1}{2} λ_{e}^{2} + \log (2 K) = \frac{1}{2 r} (1 + r) λ_{f}^{2} + \log (2 K) .

We may therefore rewrite (45) as

L (θ, {\hat{p}}_{π} (\cdot | x)) \leq \frac{1}{2 r} [λ_{f}^{2} + q_{k} (θ)] - μ_{k} (x - θ) + \log (2 K) + d (x),

(49)

where the kth quadratic polynomial

q_{k} (θ) = {[θ - (1 + r) μ_{k}]}^{2} - r^{2} μ_{k}^{2} + r (λ_{f}^{2} - μ_{k}^{2}) .

Denote the last three terms of (49) by J_k(x, θ). From (16) and (48) we see that

E_{θ} [J_{k}, | X | \leq λ_{e}] \leq μ_{k} + \log (2 K) + \log 2 \leq λ_{e} + a + \log (4 K) = o (λ_{f}^{2}) .

Consequently, we obtain the key bound

ρ_{B} (θ) \leq \frac{1}{2 r} [λ_{f}^{2} + min_{k} q_{k} (θ)] + o (λ_{f}^{2}) .

(50)

Now we use the geometric structure of the support points μ_k, defined at (16). We bound min_k q_k(θ) above by considering the quadratic polynomial q_k(θ) on I_k = [μ_k,μ_k₊₁] and observe that these 2K intervals cover the range (–λ_e –a,–λ_f) ∪ (λ_f, λ_e +a) of interest. See Figure 2. Note that q_k(θ) achieves its maximum on I_k at both endpoints and that

Fig. 2 — Schematic diagram demonstrating the behavior of the quadratic polynomials q_k(θ) in the interval [μ₁, μ_K+1]. Here K = 4. The maximum of min_kq_k(θ) for θ ∈ [μ₁, μ_K+1] is bounded by q₁(μ₁).

q_{k} (μ_{k + 1}) = q_{k} ((1 + 2 r) μ_{k}) = q_{k} (μ_{k}) = r (λ_{f}^{2} - μ_{k}^{2}) .

These maxima decrease with k and so are bounded by $q_{1} (ν_{η}) = r (λ_{f}^{2} - ν_{η}^{2})$ . Appealing now to bound (39), we have for λ_f < |θ| < λ_e + a,

min_{k} q_{k} (θ) \leq r (λ_{f}^{2} - v_{η}^{2}) \leq 2 r \sqrt{v_{w}} a λ_{f} .

Returning to (50), we now see that the last two terms are each $o (λ_{f}^{2})$ and so the final bound (46) is proven. This completes the proof of Lemma 5.

These calculations apply to threshold density estimates based on Bayes estimates of discrete priors. In particular, for ${\hat{p}}_{T, LF}$ which is based on the 3-point prior π₃[η, ν_η], we have K = 1 and the bound (36). Thus, the difference $ρ_{B} (θ) - λ_{f}^{2} / 2 r$ in this case is negligible when |θ| ≤ μ₂.

Similarly, the asymptotic risk function of the hard threshold plug-in density estimate ${\hat{p}}_{T, π_{0}}$ (for which K = 0 in our calculations above) exceeds the minimax risk β(η, r) for |θ| ∈ [λ_f, λ_e] and so is minimax suboptimal for any fixed r. Figure 3 shows the numerical evaluation of the risk functions for the different univariate threshold density estimates.

Fig. 3 — Numerical evaluation of the asymptotic risk ρ_B(θ) for r = 0.25 of univariate threshold density estimates: hard threshold plug-in estimate ${\hat{p}}_{T, π_{0}}$ (red), ${\hat{p}}_{T, LF}$ (green) and the cluster prior-based minimax optimal estimate ${\hat{p}}_{T, CL}$ (blue). The brown boxes show the nonzero support point of the cluster prior and the univarate asymptotic minimax risk $β (η, r) = {(2 r)}^{- 1} λ_{f}^{2}$ and the threshold λ_e are respectively denoted by dotted horizontal and vertical lines. The plot on left has η = e⁻²⁰ (very high sparsity), λ_f = 2.83, λ_e = 6.32 and the right one has η = 0.05 (moderate sparsity), λ_f = 1.09, λ_e = 2.45.

Also, note that any threshold estimate ${\hat{p}}_{T} [λ]$ with threshold size λ less than λ_e will be minimax suboptimal, as its risk at the origin will not be negligible as compared to β(η, r). By (34) and (43) we have

\begin{array}{l} ρ (0, {\hat{p}}_{T} [λ]) \geq 2 a_{2 r} E [Z^{2} I {Z > λ}] = 2 a_{2 r} {λ ϕ (λ) + 2 \tilde{Φ} (λ)} \\ \geq λ ϕ (λ) / (1 + r), \end{array}

(51)

and so for any fixed ε > 0,

\underset{1 \leq {λ < λ}_{e} (η) - ε}{\lim \inf} \frac{ρ (0, {\hat{p}}_{T} [λ])}{β (η, r)} \to \infty .

Thus, ${\hat{p}}_{T} [λ]$ is suboptimal unless λ ≥ λ_e.

5. Theorem 1: Multivariate minimax risk

Here we will use the univariate minimax results developed in the previous sections to evaluate the asymptotic multivariate minimax risk R_n = R_N(Θ_n[s_n]) over the sparse parameter space Θ_n[s_n].

5.1. Lower bound proof: Theorem 1B and an extension

We first prove a lower bound for the multivariate minimax risk under only the assumption that s_n/n → 0—without requiring, as in Theorem 1B, that also s_n → ∞. This is done using an “independent blocks” sparse prior, along the lines of Johnstone (2013), Chapter 8.6, that we will show to be asymptotically least favorable. This result establishes the lower bound half of Theorem 1a. At the end of the subsection, we prove Theorem 1B using the simpler i.i.d. prior.

Let π_S(τ; m) denote a single spike prior of scale τ on ℝ^m: choose an index I ∈ {1, …, m} at random and set θ = τe_I, where e_I is a unit length vector in the ith coordinate direction. We will use a scale τ_m = λ_m − logλ_m which is somewhat smaller than $λ_{m} = \sqrt{2 \log m}$ .

The independent blocks prior π^IB on Θ[s_n] is built by dividing {1, …, n} into s_n contiguous blocks B_j,j = 1, …, m each of length m = m_n = [n/s_n]. Draw components θ_i in each block B_j according to an independent copy of π_S(ν_m; m) where the scale $v_{m} = \sqrt{v_{w}} τ_{m}$ is matched to the prediction setting. Finally, set θ_i = 0 for the remaining n − m_ns_n components. Thus, π^IB is supported on Θ[s_n] since any draw θ from π^IB has exactly s_n nonzero components.

The lower bound half of Theorem 1A follows from the following result, the analog of Theorem 1B for the independent blocks prior.

Theorem 6

Fix r ∈ (0, ∝). If s_n/n → 0, then

R_{N} (Θ_{n} [s_{n}]) \geq B (π_{n}^{IB}) \geq {(1 + r)}^{- 1} s_{n} \log (n / s_{n}) .

Proof

Bounding maximum risk by Bayes risk and using the product structure shows that

R_{n} = R_{N} (Θ_{n} [s_{n}]) \geq B (π_{n}^{IB}) = s_{n} B (π_{S} (v_{m}; m)) .

(52)

Next, using $B_{Q}^{v}$ to denote the Bayes risk for noise level v, the multivariate form of the connecting equation and scale invariance enable us to write

B (π_{S} (v_{m}; m)) = \frac{1}{2} \int_{v_{w}}^{1} B_{Q}^{v} (π_{S} (v_{m}; m)) \frac{d v}{v^{2}} = \frac{1}{2} \int_{v_{w}}^{1} B_{Q} (π_{S} (\frac{v_{m}}{\sqrt{v}}; m)) \frac{d v}{v} .

The next lemma, proved in Section S.5 of Mukherjee and Johnstone (2015), provides a uniform lower bound for the quadratic loss Bayes risk of a single spike prior. It is a multivariate analog of Lemma 3.

Proposition 7

Suppose that y ~ N_n(0, I). Set $λ_{n} = \sqrt{2 \log n}$ and τ_n = λ_n − log λ_n. Then there exists ε_n → 0 such that uniformly in τ ∈ [0, τ_n],

B_{q} (π_{S} (τ; n)) \geq τ^{2} (1 - ε_{n}) .

Noting that v ∈ [v_w, 1] implies that $v_{m} / \sqrt{v} \leq v_{m} / \sqrt{v_{w}} = τ_{m}$ , and then applying the proposition,

B (π_{S} (v_{m}; m)) \geq \frac{(1 - ε_{m})}{2} \int_{v_{m}}^{1} \frac{v_{m}^{2}}{v^{2}} d v = (1 - ε_{m}) \frac{v_{m}^{2}}{2 r} .

Combining this with (52) and the definition of ν_m, we obtain

R_{n} \geq (1 - ε_{m}) s_{n} v_{w} τ_{m}^{2} / (2 r) \sim {(1 + r)}^{- 1} s_{n} \log (n / s_{n}) .

(53)

Proof of Theorem 1B

Note that because of the product structure of the problem and the prior $π_{n}^{IID}$ we have

B (π_{n}^{IID}) = \sum_{i = 1}^{n} β (η_{n}, r) = n β (η_{n}, r),

which is asymptotically equal to R_N(Θ[s_n]), using the univariate Theorem 2 [cf. (20)] and

{(2 r)}^{- 1} λ_{f}^{2} = {(2 r)}^{- 1} v_{w} λ_{f}^{2} \sim {(1 + r)}^{- 1} \log η_{n}^{- 1} as n \to \infty .

(54)

Also, as s_n → ∞, $π_{n}^{IID} (Θ [s_{n}]) \to 1$ by application of Chebyshev’s inequality and, hence, $π_{n}^{IID}$ is an asymptotically least favorable prior under the conditions of Theorem 1B. □

5.2. Upper bound proof: Theorem 1C

First, an upper bound on R_N(Θ_n[s_n]) is derived based on the maximum risk of the multivariate product threshold density estimate ${\hat{p}}_{T, CL}$ defined in Theorem 1C. Using the product structure of the threshold estimate as well as that of the unknown future density

{\hat{p}}_{T, CL} (y | x) = \prod_{i = 1}^{n} {\hat{p}}_{T, CL} (y_{i} | x_{i}) and p (y | θ, r) = \prod_{i = 1}^{n} p (y_{i} | θ_{i}, r),

the risk of our multivariate threshold estimate simplifies as an agglomerative coordinate wise risk of the respective univariate density estimates

ρ (θ, {\hat{p}}_{T, CL}) = E_{θ} \log \frac{p (y | θ, r)}{{\hat{p}}_{T, CL} (y | x)} = \sum_{i = 1}^{n} ρ (θ_{i}, {\hat{p}}_{T}) .

Now, maximizing over θ ∈ Θ_n[s_n], we have

R_{n} \leq \sup_{Θ_{n} [s_{n}]} ρ (θ, {\hat{p}}_{T, CL}) \leq (n - s_{n}) ρ (0, {\hat{p}}_{T, CL}) + s_{n} \sup_{θ} ρ (θ, {\hat{p}}_{T, CL}) .

From the univariate study, we know that $ρ (0, {\hat{p}}_{T, CL}) = O (η_{n} λ_{f})$ , which makes $(n - s_{n}) ρ (0, {\hat{p}}_{T, CL}) = O (s_{n} λ_{f})$ negligible relative to

s_{n} \sup_{θ} ρ (θ, {\hat{p}}_{T, CL}) = {(2 r)}^{- 1} s_{n} λ_{f}^{2} (1 + o (1)),

where we used (46). Thus, taking account also of (54), we have the desired upper bound on the minimax risk

R_{n} \leq {(2 r)}^{- 1} s_{n} λ_{f}^{2} (1 + o (1)) \sim {(1 + r)}^{- 1} s_{n} \log (n / s_{n}) .

(55)

Completion of Proof of Theorems 1A, 1B and 1C

As the lower bound (53) and upper bound (55) on R_n match asymptotically, the first order asymptotic minimax risk of Theorem 1a is achieved, and the proof of all parts is done.

5.3. Proof of Proposition 1

Estimates in ℒ and $G$ are products of the form (21) and so $R_{ℒ, n} = R_{ℒ} (Θ_{n} [s_{n}])$ can be studied using the associated univariate problem and decomposition (23). It is shown in Appendix A.2 that

ρ (θ, {\hat{p}}_{L, α}) = \frac{1}{2} \log (1 + \frac{α}{r}) + \frac{{(1 - α)}^{2}}{2 (r + α)} [θ^{2} - \frac{α}{1 - α}] .

(56)

Thus, $\sup_{θ} ρ (θ, {\hat{ρ}}_{L, α})$ is infinite unless α = 1, that is, the uniform prior estimate ${\hat{p}}_{U}$ , in which case $ρ (θ, {\hat{p}}_{U}) \equiv \frac{1}{2} \log (1 + r^{- 1})$ . Thus,

R_{ℒ, n} = \frac{n}{2} \log (1 + r^{- 1}) ≫ \frac{s_{n}}{1 + r} \log (\frac{n}{s_{n}}) \sim R_{n} .

In particular, R_ℒ,_n/R_n → ∝ when s_n/n → 0.

We turn to the Gaussian class $G$ . Since $E \subset G$ , clearly $R_{G, n} < R_{E, n} = {(2 r)}^{- 1} n η_{n} λ_{e}^{2}$ . We give here a heuristic argument for the reverse inequality, which gives the idea for the rigorous proof given in Section S.3 of the supplementary material [Mukherjee and Johnstone (2015)]. From the decomposition (23), any near-optimal estimator in $G$ must have univariate risk at 0 bounded as follows:

ρ (0, {\hat{p}}_{1}) \leq r^{- 1} η_{n} λ_{e}^{2} .

(57)

Now from (60) we know that the risk at the origin for the univariate Gaussian density estimate $p [\hat{θ}, \hat{d}]$ is

ρ (0, p [\hat{θ}, \hat{d}]) = 2^{- 1} E_{0} {\log (r^{- 1} \hat{d}) + {\hat{d}}^{- 1} (r + {\hat{θ}}^{2}) - 1},

which for any fixed choice of $\hat{θ}$ achieves its minimum at $d_{opt} [\hat{θ}] = r + {\hat{θ}}^{2}$ . Thus, for such an optimal choice of $\hat{d}$ ,

ρ (0, p [\hat{θ}, d_{opt} (\hat{θ})]) = E_{0} \log (1 + r^{- 1} {\hat{θ}}^{2}),

and for this to satisfy (57), we must have $\hat{θ} (x) \approx 0$ for |x| ≤ λ_e(1 + o(1)). Thus, $\hat{p}$ would approximately need to have the threshold structure (31), (32) for |x| ≤ λ_e and so the bound (35) shows that

ρ (θ, {\hat{p}}_{1}) \geq \frac{θ^{2}}{2 r} P_{θ} (| X | \leq λ_{e}) \sim \frac{λ_{e}^{2}}{2 r} .

Returning to decomposition (23), we can now see that $R_{G, n} \underset{\sim}{>} {(2 r)}^{- 1} s_{n} λ^{2} \sim R_{E, n}$ , which completes the heuristic argument.

6. Discussion

Avoiding thresholding

The asymptotic minimax rules ${\hat{p}}_{T}$ described in Theorems 1c and 2c are based on thresholding. It would be desirable to construct a prior π for which the Bayes predictive density ${\hat{p}}_{π}$ in (2) is itself asymptotically minimax, without any use of the discontinuous thresholding operation.

Consider, then, a symmetric univariate prior π_∞[η, r] whose support consists of the origin and infinite number of equidistant clusters each containing 2K points in the same spatial alignment as for π_CL[η, r]:

π_{\infty} [η, r] = (1 - η) δ_{0} + \frac{1 - η}{2} \sum_{j = 0}^{\infty} η^{j + 1} \sum_{k = 1}^{K} q_{k} (δ_{μ_{jk}} + δ - μ_{jk}),

where μ_jk = jλ_e + μ_k and for k = 2, …, K and γ = log η⁻¹, we have q_k = γ^−k and $q_{1} = 1 - \sum_{2}^{K} q_{k}$ .

Based on π_∞[η_n, r], one can construct a multivariate prior $π_{n, \infty}^{IID}$ using (11), which heuristic arguments indicate will not only be least favorable but also yield a minimax optimal density estimate. A detailed proof is forthcoming.

Approximate sparsity and other extensions

Starting from Johnstone (2013), Chapters 8 and 13, the ℓ₀ sparsity results presented here can be extended to obtain minimax optimal predictive density estimates over weak and strong ℓ_p sparse parameter spaces. An interesting topic for future work will be whether, as in point estimation [Donoho and Johnstone (1994)], the phenomena seen here can be generalized to a family of loss functions. Simple analogues of the connecting equations [Brown, George and Xu (2008), Theorem 1] between the predictive and quadratic PE regimes do not exist in those cases, though some of the decision theoretic parallels can still be proved particularly for the ℓ₂ loss [Gatsonis (1984)].

Supplementary Material

Supplement

NIHMS723163-supplement-Supplement.pdf^{(469.6KB, pdf)}

Acknowledgments

We thank the Associate Editor and three referees for constructive suggestions to shorten and improve the paper.

APPENDIX

A.1. Bayes density estimate for discrete priors

The posterior distribution for the discrete prior $π = \sum_{k = - K}^{K} π_{k} δ_{μ_{k}}$ is given by

π (μ_{k} | x) = {m (x)}^{- 1} ϕ (x | μ_{k}, 1) π_{k} where m (x) = \sum_{k} π_{k} ϕ (x | μ_{k}, 1) .

So, for the Bayes predictive density based on the prior π,

{\hat{p}}_{π} (y | x) = \sum_{k = - K}^{K} ϕ (y | μ_{k}, r) π (μ_{k} | x) = \sum_{k = - K}^{K} ϕ (y | μ_{k}, r) \frac{ϕ (x | μ_{k}, 1) π_{k}}{m (x)} .

(58)

A.2. K–L risk for gaussian and linear density estimates

The predictive risk of the univariate Gaussian density estimate $p [\hat{θ}, \hat{d}] = N (\hat{θ}, \hat{d})$ is given by

ρ (θ, p [\hat{θ}, \hat{d}]) = E_{θ} {\log ϕ (Y | θ, r)} - E_{θ} {\log ϕ (Y | \hat{θ} (X), \hat{d} (X))},

where the expectation is over X ~ N(θ, 1) and Y ~ N(θ, r). Noting that $E_{θ} {\log ϕ (Y | \hat{θ}, \hat{d}) | X = x} = - \frac{1}{2} \log (2 π \hat{d} (x)) - (2 \hat{d} (x)) - 1 {r + {(\hat{θ} (x) - θ)}^{2}}$ and $E_{θ} \log ϕ (Y | θ, r) = - \frac{1}{2} \log (2 π r) - \frac{1}{2}$ , we obtain

L (θ, \hat{p} (\cdot | x)) = \frac{1}{2} \log (r^{- 1} \hat{d}) + \frac{r + {(\hat{θ} (x) - θ)}^{2}}{2 \hat{d}} - \frac{1}{2},

(59)

and the following expression for the K–L risk of members in $G$ :

ρ (θ, p [\hat{θ}, \hat{d}]) = \frac{1}{2} [E_{θ} \log (r^{- 1} \hat{d}) + E_{θ} {\frac{r + {(\hat{θ} - θ)}^{2}}{\hat{d}}} - 1] .

(60)

Consider now “linear” estimators. Starting with the conjugate prior θ ~ N(0, α/(1 − α)) for 0 ≤ α ≤ 1, standard calculations show that the posterior density π(θ|x) is N(αx, α) and the predictive density ${\hat{p}}_{L, α}$ , being the convolution of Gaussians, compare (2), is seen to be N(αx, r + α). Now, using $\hat{d} = r + α$ and $\hat{θ} = α X$ in (60), we get

ρ (θ, {\hat{p}}_{L, α}) = \frac{1}{2} [\log (1 + r^{- 1} α) + {(r + α)}^{- 1} {r + E_{θ} {(α X - θ)}^{2}} - 1] .

The linear risk formula (56) now follows from the quadratic risk of αX. Next, we present some details about the risk of the particular linear estimate ${\hat{p}}_{U}$ .

Proof of (42)

The estimator ${\hat{p}}_{U} = {\hat{p}}_{L, 1}$ is given by the N(x, 1 + r) distribution, and so from (59)

L (θ, {\hat{p}}_{U} (\cdot | x)) = \frac{1}{2} \log (1 + r^{- 1}) + \frac{r + {(θ - x)}^{2}}{2 (1 + r)} - \frac{1}{2},

from which (42) is immediate.

Footnotes

Supported in part by NSF Grant DMS-09-06812 and NIH Grant R01 EB001988.

MSC2010 subject classifications. Primary 62C20; secondary 62M20, 60G25, 91G70.

SUPPLEMENTARY MATERIAL

Supplementary material to “Exact minimax estimation of the predictive density in sparse Gaussian models” (DOI: 10.1214/14-AOS1251SUPP;.pdf). The supplement Mukherjee and Johnstone (2015) contains a brief description of the relevance of the predictive density estimation problem in related application areas along with the proof for the suboptimality of the univariate threshold density estimate ${\hat{p}}_{T, LF}$ (in Section S.2) and the details of the proof of Proposition 1 (in Section S.3). The arguments for the maximum quadratic risk of hard threshold point estimates are reviewed in Section S.4 and the proof of Proposition 7 is presented in Section S.5. Links to R-codes used in producing Table 1 and Figure 3 are also provided.

Contributor Information

Gourab Mukherjee, Email: gourab@usc.edu, Department of Data Sciences and Operations, Marshall School of Business, University of Southern California, Los Angeles, California 90089-0809, USA.

Iain M. Johnstone, Email: imj@stanford.edu, Department of Statistics, Sequoia Hall, 390 Serra Mall, Stanford University, Stanford, California 94305-4065, USA.

References

Aitchison J. Goodness of prediction fit. Biometrika. 1975;62:547–554. MR0391353. [Google Scholar]
Aitchison J, Dunsmore IR. Statistical Prediction Analysis. Cambridge Univ. Press; Cambridge: 1975. MR0408097. [Google Scholar]
Aslan M. Asymptotically minimax Bayes predictive densities. Ann Statist. 2006;34:2921–2938. MR2329473. [Google Scholar]
Barndorff-Nielsen OE, Cox DR. Prediction and asymptotics. Bernoulli. 1996;2:319–340. MR1440272. [Google Scholar]
Bell RM, Cover TM. Competitive optimality of logarithmic investment. Math Oper Res. 1980;5:161–166. MR0571810. [Google Scholar]
Brown L. Lecture notes on statistical decision theory. 1974 Available at http://www-stat.wharton.upenn.edu/~lbrown.
Brown LD, George EI, Xu X. Admissible predictive density estimation. Ann Statist. 2008;36:1156–1170. MR2418653. [Google Scholar]
Cover TM, Thomas JA. Elements of Information Theory. Wiley; New York: 1991. MR112280. [Google Scholar]
Donoho DL, Johnstone IM. Minimax risk over lp-balls for lq-error. Probab Theory Related Fields. 1994;99:277–303. [Google Scholar]
Donoho DL, Johnstone IM, Hoch JC, Stern AS. Maximum entropy and the nearly black object. J R Stat Soc Ser B Stat Methodol. 1992;54:41–81. MR1157714. [Google Scholar]
Fourdrinier D, Marchand É, Righi A, Strawderman WE. On improved predictive density estimation with parametric constraints. Electron J Stat. 2011;5:172–191. MR2792550. [Google Scholar]
Gatsonis CA. Deriving posterior distributions for a location parameter: A decision theoretic approach. Ann Statist. 1984;12:958–970. MR0751285. [Google Scholar]
Geisser S. Predictive Inference: An Introduction Monographs on Statistics and Applied Probability. Vol. 55. Chapman & Hall; New York: 1993. MR1252174. [Google Scholar]
George EI, Liang F, Xu X. Improved minimax predictive densities under Kullback–Leibler loss. Ann Statist. 2006;34:78–91. MR2275235. [Google Scholar]
George EI, Liang F, Xu X. From minimax shrinkage estimation to minimax shrinkage prediction. Statist Sci. 2012;27:82–94. MR2953497. [Google Scholar]
Ghosh M, Mergel V, Datta GS. Estimation, prediction and the Stein phenomenon under divergence loss. J Multivariate Anal. 2008;99:1941–1961. MR2466545. [Google Scholar]
Hartigan JA. The maximum likelihood prior. Ann Statist. 1998;26:2083–2103. MR1700222. [Google Scholar]
Johnstone IM. Gaussian estimation: Sequence and wavelet models. 2013 Available at http://www-stat.stanford.edu/~imj.
Komaki F. On asymptotic properties of predictive distributions. Biometrika. 1996;83:299–313. MR1439785. [Google Scholar]
Komaki F. A shrinkage predictive distribution for multivariate normal observables. Biometrika. 2001;88:859–864. MR1859415. [Google Scholar]
Komaki F. Simultaneous prediction of independent Poisson observables. Ann Statist. 2004;32:1744–1769. MR2089141. [Google Scholar]
Larimore WE. Predictive inference, sufficiency, entropy and an asymptotic likelihood principle. Biometrika. 1983;70:175–181. MR0742987. [Google Scholar]
McMillan B. Two inequalities implied by unique decipherability. IRE Transactions on Information Theory. 1956;2:115–116. [Google Scholar]
Mukherjee G. Sparsity and shrinkage in predictive density estimation. PhD thesis, Stanford Univ. 2013 Available at http://purl.stanford.edu/gm306wz2890.
Mukherjee G, Johnstone IM. Supplement to “Exact minimax estimation of the predictive density in sparse Gaussian models”. 2015 doi: 10.1214/14-AOS1251SUPP. [DOI] [PMC free article] [PubMed] [Google Scholar]
Murray GD. A note on the estimation of probability density functions. Biometrika. 1977;64:150–152. MR0448690. [Google Scholar]
Ng VM. On the estimation of parametric density functions. Biometrika. 1980;67:505–506. MR0581751. [Google Scholar]
Pinsker MS. Optimal filtration of square-integrable signals in Gaussian noise. Probl Inf Transm. 1980;16:120–133. Originally in Russian in Problemy Peredachi Informatsii16 52–67. MR0624591. [Google Scholar]
Xu X, Liang F. Asymptotic minimax risk of predictive density estimation for non-parametric regression. Bernoulli. 2010;16:543–560. MR2668914. [Google Scholar]
Xu X, Zhou D. Empirical Bayes predictive densities for high-dimensional normal models. J Multivariate Anal. 2011;102:1417–1428. MR2819959. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement

NIHMS723163-supplement-Supplement.pdf^{(469.6KB, pdf)}

[R1] Aitchison J. Goodness of prediction fit. Biometrika. 1975;62:547–554. MR0391353. [Google Scholar]

[R2] Aitchison J, Dunsmore IR. Statistical Prediction Analysis. Cambridge Univ. Press; Cambridge: 1975. MR0408097. [Google Scholar]

[R3] Aslan M. Asymptotically minimax Bayes predictive densities. Ann Statist. 2006;34:2921–2938. MR2329473. [Google Scholar]

[R4] Barndorff-Nielsen OE, Cox DR. Prediction and asymptotics. Bernoulli. 1996;2:319–340. MR1440272. [Google Scholar]

[R5] Bell RM, Cover TM. Competitive optimality of logarithmic investment. Math Oper Res. 1980;5:161–166. MR0571810. [Google Scholar]

[R6] Brown L. Lecture notes on statistical decision theory. 1974 Available at http://www-stat.wharton.upenn.edu/~lbrown.

[R7] Brown LD, George EI, Xu X. Admissible predictive density estimation. Ann Statist. 2008;36:1156–1170. MR2418653. [Google Scholar]

[R8] Cover TM, Thomas JA. Elements of Information Theory. Wiley; New York: 1991. MR112280. [Google Scholar]

[R9] Donoho DL, Johnstone IM. Minimax risk over lp-balls for lq-error. Probab Theory Related Fields. 1994;99:277–303. [Google Scholar]

[R10] Donoho DL, Johnstone IM, Hoch JC, Stern AS. Maximum entropy and the nearly black object. J R Stat Soc Ser B Stat Methodol. 1992;54:41–81. MR1157714. [Google Scholar]

[R11] Fourdrinier D, Marchand É, Righi A, Strawderman WE. On improved predictive density estimation with parametric constraints. Electron J Stat. 2011;5:172–191. MR2792550. [Google Scholar]

[R12] Gatsonis CA. Deriving posterior distributions for a location parameter: A decision theoretic approach. Ann Statist. 1984;12:958–970. MR0751285. [Google Scholar]

[R13] Geisser S. Predictive Inference: An Introduction Monographs on Statistics and Applied Probability. Vol. 55. Chapman & Hall; New York: 1993. MR1252174. [Google Scholar]

[R14] George EI, Liang F, Xu X. Improved minimax predictive densities under Kullback–Leibler loss. Ann Statist. 2006;34:78–91. MR2275235. [Google Scholar]

[R15] George EI, Liang F, Xu X. From minimax shrinkage estimation to minimax shrinkage prediction. Statist Sci. 2012;27:82–94. MR2953497. [Google Scholar]

[R16] Ghosh M, Mergel V, Datta GS. Estimation, prediction and the Stein phenomenon under divergence loss. J Multivariate Anal. 2008;99:1941–1961. MR2466545. [Google Scholar]

[R17] Hartigan JA. The maximum likelihood prior. Ann Statist. 1998;26:2083–2103. MR1700222. [Google Scholar]

[R18] Johnstone IM. Gaussian estimation: Sequence and wavelet models. 2013 Available at http://www-stat.stanford.edu/~imj.

[R19] Komaki F. On asymptotic properties of predictive distributions. Biometrika. 1996;83:299–313. MR1439785. [Google Scholar]

[R20] Komaki F. A shrinkage predictive distribution for multivariate normal observables. Biometrika. 2001;88:859–864. MR1859415. [Google Scholar]

[R21] Komaki F. Simultaneous prediction of independent Poisson observables. Ann Statist. 2004;32:1744–1769. MR2089141. [Google Scholar]

[R22] Larimore WE. Predictive inference, sufficiency, entropy and an asymptotic likelihood principle. Biometrika. 1983;70:175–181. MR0742987. [Google Scholar]

[R23] McMillan B. Two inequalities implied by unique decipherability. IRE Transactions on Information Theory. 1956;2:115–116. [Google Scholar]

[R24] Mukherjee G. Sparsity and shrinkage in predictive density estimation. PhD thesis, Stanford Univ. 2013 Available at http://purl.stanford.edu/gm306wz2890.

[R25] Mukherjee G, Johnstone IM. Supplement to “Exact minimax estimation of the predictive density in sparse Gaussian models”. 2015 doi: 10.1214/14-AOS1251SUPP. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] Murray GD. A note on the estimation of probability density functions. Biometrika. 1977;64:150–152. MR0448690. [Google Scholar]

[R27] Ng VM. On the estimation of parametric density functions. Biometrika. 1980;67:505–506. MR0581751. [Google Scholar]

[R28] Pinsker MS. Optimal filtration of square-integrable signals in Gaussian noise. Probl Inf Transm. 1980;16:120–133. Originally in Russian in Problemy Peredachi Informatsii16 52–67. MR0624591. [Google Scholar]

[R29] Xu X, Liang F. Asymptotic minimax risk of predictive density estimation for non-parametric regression. Bernoulli. 2010;16:543–560. MR2668914. [Google Scholar]

[R30] Xu X, Zhou D. Empirical Bayes predictive densities for high-dimensional normal models. J Multivariate Anal. 2011;102:1417–1428. MR2819959. [Google Scholar]

PERMALINK

EXACT MINIMAX ESTIMATION OF THE PREDICTIVE DENSITY IN SPARSE GAUSSIAN MODELS1

Gourab Mukherjee

Iain M Johnstone

Abstract

1. Introduction

1.1. Main results

Theorem 1A

Theorem 1B

Theorem 1C

Table 1.

1.2. Background and previous work

Analogy with point estimation

1.3. Further results

Other classes of estimators

Proposition 1

Univariate prediction problem

Theorem 2

1.4. Organization of the paper

Glossary

2. Proof overview and interpretation

2.1. Connections between multivariate and univariate settings

2.2. Theorem 2B: Univariate lower bound heuristics

2.3. Theorem 2C: Univariate upper bound heuristics

Fig. 1.

3. Theorem 2B: Univariate lower bound proof

Lemma 3

Proof

Lemma 4

Proof of Lemma 4

4. Theorem 2C: Univariate upper bound proof

Lemma 5

Risk at zero

Maximum risk

Fig. 2.

Fig. 3.

5. Theorem 1: Multivariate minimax risk

5.1. Lower bound proof: Theorem 1B and an extension

Theorem 6

Proof

Proposition 7

Proof of Theorem 1B

5.2. Upper bound proof: Theorem 1C

Completion of Proof of Theorems 1A, 1B and 1C

5.3. Proof of Proposition 1

6. Discussion

Avoiding thresholding

Approximate sparsity and other extensions

Supplementary Material

Acknowledgments

APPENDIX

A.1. Bayes density estimate for discrete priors

A.2. K–L risk for gaussian and linear density estimates

Proof of (42)

Footnotes

Contributor Information

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

EXACT MINIMAX ESTIMATION OF THE PREDICTIVE DENSITY IN SPARSE GAUSSIAN MODELS¹