Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2015 Oct 5.
Published in final edited form as: Ann Stat. 2015;43(3):937–961. doi: 10.1214/14-AOS1251

EXACT MINIMAX ESTIMATION OF THE PREDICTIVE DENSITY IN SPARSE GAUSSIAN MODELS1

Gourab Mukherjee 1, Iain M Johnstone 2
PMCID: PMC4593074  NIHMSID: NIHMS723163  PMID: 26448678

Abstract

We consider estimating the predictive density under Kullback–Leibler loss in an 0 sparse Gaussian sequence model. Explicit expressions of the first order minimax risk along with its exact constant, asymptotically least favorable priors and optimal predictive density estimates are derived. Compared to the sparse recovery results involving point estimation of the normal mean, new decision theoretic phenomena are seen. Suboptimal performance of the class of plug-in density estimates reflects the predictive nature of the problem and optimal strategies need diversification of the future risk. We find that minimax optimal strategies lie outside the Gaussian family but can be constructed with threshold predictive density estimates. Novel minimax techniques involving simultaneous calibration of the sparsity adjustment and the risk diversification mechanisms are used to design optimal predictive density estimates.

Key words and phrases: Predictive density, risk diversification, minimax, sparsity, high-dimensional, mutual information, plug-in risk, thresholding

1. Introduction

Statistical prediction analysis aims to use past data to choose a probability distribution that will be good in predicting the behavior of future samples. This well-established subject [Aitchison and Dunsmore (1975), Geisser (1993)] finds application in game theory, econometrics, information theory, machine learning, mathematical finance, etc.

In this paper we study predictive density estimation in a high-dimensional setting and, in particular, explore the consequences of sparsity assumptions on the unknown parameters.

1.1. Main results

We begin by describing some of our main results: fuller references, background and interpretation follow in Section 1.2.

We work in the simplest Gaussian model for high-dimensional prediction:

XNn(θ,vxI),YNn(θ,vyI),XY|θ. (1)

On the basis of the “past” observation vector X, we seek to predict the distribution of a future observation Y. The past and future observations are independent, but are linked by the common mean parameter θ, assumed to be unknown. Note, however, that the variances, assumed here to be known, may differ. We write p(x|θ, vx) and p(y|θ, vy) for the probability densities of X and Y, respectively.

We seek estimators p^(y|x) of the future observation density p(y|θ, vy), and to compare their performance under sparsity assumptions on θ. We recall two natural ways of generating large classes of estimators. Perhaps simplest are the “plug-in” or estimative densities: given a point estimate θ^(X), simply set p^(y|x)=p(y|θ^). We often use the abbreviation p[θ^]. Second, given any prior measure π(), proper or improper, such that the posterior π(|x) is well defined, the Bayes predictive density is

p^π(y|x)=p(y|θ,vy)π(dθ|x). (2)

The important case of a uniform prior measure π() = dθ leads to predictive density p^U(y|x), easily seen to correspond to Nn(x, (vx + vy)I).

We will examine similarities and differences between high-dimensional prediction and high-dimensional estimation. In particular, p^U(y|x) plays in prediction the role of the maximum likelihood estimator θ^MLE(x)=x in the multinormal mean estimation setting. In contrast to the corresponding plug-in estimate p[θ^MLE], the density p^U incorporates the variability of the location estimate which leads to a flattening of the estimator: vx + vy>vy.

To evaluate the performance of a predictive density estimator p^(y|x), we use the familiar Kullback–Leibler “distance” as loss function:

L(θ,p^(|x))=p(y|θ,vy)logp(y|θ,vy)p^(y|x)dy.

The corresponding K–L risk function follows by averaging over the distribution of the past observation:

ρ(θ,p^)=L(θ,p^(|x)p(x|θ,vx))dx.

Given a prior measure π(), the average or integrated risk is

B(π,p^)=ρ(θ,p^)π(dθ). (3)

The Bayes predictive density (2) can be shown to minimize both the posterior expected loss L(θ,p^(|x))π(dθ|x) and the integrated risk B(π,p^) in the class of all density estimates. This is a general fact in statistical decision theory [Brown (1974)], the resulting minimum the Bayes K–L risk:

B(π)=infp^B(π,p^). (4)

Our main focus is on how to optimize the predictive risk ρ(θ,p^) in a high-dimensional setting under an 0 sparsity condition on the parameter space. Thus, let ‖θ0 = #{i:θi ≠0} and

Θn[s]={θn:θ0s}. (5)

This “exact” sparsity condition has been widely used in estimation; in this paper we initiate study of its implications for predictive density estimation.

The minimax K–L risk for estimation over Θ is given by

RN(Θ)=infp^supθΘρ(θ,p^), (6)

where the infimum is taken over all measurable predictive density estimators p^(y|x). For comparison, we write Rε(Θ)=infθ^supΘρ(θ,p[θ^]) for the minimax risk restricted to the sub-class of plug-in or “estimative” densities.

To state our main results, henceforth we will assume vx = 1 and introduce the key parameters

r=vy/vx=vy,vw=(1+r1)1. (7)

Here vw is the “oracle variance” which would be the variance of the UMVUE for θ, were both X and Y observed.

In our asymptotic model, the dimensionality n → ∞ and the sparsity s = sn may depend on n, but the variance ratio r remains fixed. The notation an ~ bn denotes an/bn → 1 as n → ∞.

Theorem 1A

Fix r ∈ (0, ∞). If ηn = sn/n → 0, then

RN(Θn[sn])~11+rsnlog(n/sn)=11+rnηnlogηn1. (8)

The minimax risk is proportional to the sparsity sn, with a logarithmic penalty factor. The case where sn ≡ s remains constant is included. The expression is quite analogous to that obtained for point estimation with quadratic loss, namely, 2snlog(n/sn) [Donoho and Johnstone (1994), Donoho et al. (1992) and Johnstone (2013), Chapter 8.8, hereafter cited as Johnstone (2013)]. However, we shall see that quite different phenomena emerge in the predictive density setting.

Indeed, the future-to-past variance ratio r is an important parameter of the predictive estimation problem. The minimax risk increases as r decreases: we need to estimate the future observation density based on increasingly noisy past observations (in relative terms, r = vy/vx), and so the difficulty of the density estimation problem increases. However, the rate of convergence with n in (8) does not depend on r, and so exact determination of the constants is needed to show the role of r in this prediction problem.

The inefficiency of plug-in estimators is an immediate consequence of Theorem 1A. Let q(θ,θ^)=Eθ^(X)θ2 denote the risk of point estimator θ^ under squared-error loss. It is straightforward to show for a plug-in density estimate p[θ^] that ρ(θ,p[θ^])=q(θ,θ^)/(2r). Hence, from the point estimation minimax risk just cited,

Rε(Θn[sn])1rsnlog(n/sn)(1+1r)RN(ΘN[sn]).

The inefficiency of plug-in estimators thus equals the oracle precision,

1/vw=1+1/r,

and becomes arbitrarily large as the variance ratio r → 0.

We turn now to the asymptotically least favorable priors and optimal estimators in Theorem 1A. Let δλ denote unit point mass at λ and

π[η,λ]=(1η)δ0+ηδλ (9)

be a univariate two-point prior: this is a sparse prior when η is small and λ large. Let

λe=2logηn1(1ηn),λf=vwλe. (10)

In point estimation based on X, we recall that λe is essentially the threshold of detectability corresponding to sparsity ηn = sn/n. Although Y is not yet observed, we will see that in the prediction setting the UMVUE scaled threshold λf < λe plays a partly analogous role.

Build a sparse high-dimensional prior from i.i.d. draws:

πnIID(dθ)=i=1nπ[ηn,λf](dθi). (11)

If the sparsity sn increases without bound with n, then this i.i.d. prior with scale λf is asymptotically least favorable:

Theorem 1B

If sn → ∞ and sn/n → 0, then

B(πnIID)=RN(Θn[sn])(1+o(1)).

The assumption that sn → ∞ ensures that πnIID concentrates on Θ[sn], namely, that πnIID(Θ[sn])1 as n → ∞. This hypothesis is not needed for Theorem 1A; indeed, a sparse prior built from “independent blocks” is asymptotically least favorable assuming only sn/n → 0. This more elaborate prior is described in Section 5.

Some of the novel aspects of the predictive density estimation problem appear in the description of optimal estimators, that is, ones that asymptotically attain the minimax bound in Theorem 1A. In point estimation, the simplest asymptotically minimax rule for sparsity sn is given by co-ordinatewise hard thresholding θ^i(x)=xiI{|xi|λe}. For prediction, we consider the following class of univariate density estimators as analogs of hard thresholding:

p^T(y1|x1)={p^π(y1|x1),if|x1|λe,p^U(y1|x1),if|x1|<λe. (12)

The univariate density estimates are combined to form a multivariate predictive density estimate via a product rule

p^T(y|x)=i=1np^T(yi|xi). (13)

The threshold λe in (12) is that corresponding to estimation based on X at sparsity ηn = sn/n. Above the threshold, the uniform prior predictive density p^U corresponds to the (unbiased) MLE. Below threshold, we shall need the flexibility of the Bayes predictive density (2). Indeed, as explained in Section 4, it does not suffice to use π = δ0, point mass at 0, which would be the predictive analog of thresholding to zero in point estimation.

Instead, we use a sparse univariate cluster prior π = πCL[η, r] given by

π=(1η)δ0+η2Kk=1K(δμk+δμk). (14)

The points μk = μk(r) for k = 1,…, K are geometrically spaced to cover an interval [νη, λe + a] containing [λf, λe], as described in more detail below. The key point is that it is necessary to “diversify” the predictive risk by introducing prior support points to cover [λe, λf] ∪ [λf, λe].

More specifically, for a parameter a = aη given below, let μη be the positive root of the overshoot equation

μ2+2aμ=λe2, (15)

that occurs in sparse minimax point estimation [e.g., Johnstone (2013), equation (8.48)], and then set νη=vwμη: since μη < λe, we have νη <λf. The support points

μ1=νη,μk+1=(1+2r)kνη,k1, (16)

with K = max{k : μk ≤ λe + a}. We choose aη=2logλf.

Theorem 1C

Assume ηn = sn/n → 0. Let p^T,CL(y|x) be the product predictive threshold estimator defined by (12) and (13) using the cluster prior πCLn, r ]. Then p^T,CL is asymptotically minimax:

maxΘn[sn]ρ(θ,p^T,CL)=RN(Θn[sn])(1+o(1)).

Note that the number of positive support points in the cluster prior K = Kη increases as r decreases. For any fixed η, the cluster prior contains in total (2Kη + 1) support points. Also, for any fixed r ∈ (0, ∞) as η → 0, we have

K(r)=limη0Kn=[log(1+r1)2log(1+2r)].

Thus, K(r) is apiecewise constant, right continuous function with jumps as shown in Table 1.

Table 1.

Number K(r) of positive support points in the cluster prior πCL[η, r] as r varies

r 0.1073 0.1235 0.1465 0.1826 0.2485 0.4196 >0.4196
K(r) 7 6 5 4 3 2 1

The results presented above assume vx = 1. These results can be easily extended to the general case by noting that the minimax risk remains invariant and the scale of past observations and parameter is divided by vx.

1.2. Background and previous work

The relative entropy predictive risk ρ(θ,p^) measures the exponential rate of divergence of the joint likelihood ratio over a large number of independent trials [Larimore (1983)]. The minimal predictive risk estimate maximizes the expected growth rate in repeated investment scenarios [Cover and Thomas (1991), Chapters 6, 15]. In data compression, L(θ,p^(|x)) reflects the excess average code length that we need if we use the conditional density estimate p^ instead of the true density to construct a uniquely decodable code for the data Y given the past x [McMillan (1956)]. Following Bell and Cover (1980), 0-constrained minimax optimal predictive density estimates in on our model can be used for construction of optimal predictive schemes for gambling, sports betting, portfolio selection and sparse coding [Mukherjee (2013), Chapter 1.3].

Aitchison (1975), Murray (1977) and Ng (1980) showed that in most parametric models there exist Bayes predictive density estimates which are decision theoretically better than the maximum likelihood plug-in estimate. An important issue in predictive inference has always been to compare the performance of the class of point estimation (PE) based plug-in density estimates [Barndorff-Nielsen and Cox (1996)] with that of the optimal predictive density estimate. In parameter spaces of fixed dimension, large sample attributes of the predictive risk of efficient plug-in and Bayes density estimates have been studied by Komaki (1996), Hartigan (1998) and Aslan (2006).

The high-dimensional predictive density estimation problem studied in this paper is relevant to a number of contemporary applications, including data compression, sequential investment with side information and sports betting (SM).

Analogy with point estimation

Decision theoretic parallels between predictive density estimation under Kullback–Leibler loss and point estimation under quadratic loss have been explored in our Gaussian model by George, Liang and Xu (2006), Ghosh, Mergel and Datta (2008), Komaki (2004), Xu and Zhou (2011) and George, Liang and Xu (2012). For unconstrained parameter spaces Θ=n, fundamental ideas in Gaussian point estimation theory can be extended to yield optimal predictive density estimates [Brown, George and Xu (2008), Fourdrinier et al. (2011), Komaki (2001)]. For ellipsoids, Xu and Liang (2010) established an analog of the theorem of Pinsker (1980) by proving that the class of all linear predictive density estimates [see (17)] is minimax optimal.

For sparse estimation, instead of parallels, we found contrasts. Minimax risks in the predictive density problem depend on r, but this dependence is not emphasized in the admissibility results in unrestricted spaces. As we have seen, under sparsity construction of optimal minimax estimators requires the notion of diversification of the future risk over the interval [λf, λe] in a way strongly dependent on r. Thus, efficiency of the prediction schemes depend on careful calibration of the sparsity adjustment and the risk diversification mechanisms.

1.3. Further results

Other classes of estimators

The class of linear estimates are Bayes rules based on conjugate product normal priors. The resulting estimators

p^L,α=i=1nN(αiXi,αi+r),αi[0,1], (17)

are still Gaussian but have larger variance than the future density p(y|θ,r)=ϕ(y|θ,r). We choose the name “linear” because the conjugate prior implies linearity of the posterior mean in X.

The class G contains all product Gaussian density estimates p[θ^,d^]=i=1nN(θ^i,d^i). Clearly, G contains both and , the latter introduced after (6). The minimax risks R(Θ) and RG(Θ) are defined by restricting the infimum in (6) to ℒ and G, respectively.

We have seen after Theorem 1a that Rε(Θn[sn])~(1+r1)RN(Θn[sn]). It turns out that extending to G does not help, while, as is typical for sparse estimation, the class of linear estimators ℒ performs very poorly.

Proposition 1

Fix r ∈ (0, ∞). If sn/n → 0, then

(R(Θn[sn])=(n/2)log(1+r1),R(Θn[sn])/RN(Θn[sn])andRG{Θn[sn])~R(Θn[sn]).

Univariate prediction problem

The product structure of our high-dimensional model (1), estimators (13) and priors (11), along with concentration of measure, implies that many aspects of our multivariate results can be understood and proved through an associated univariate prediction problem.

In the univariate setting, assume that the past observation X|θ ~ N(θ, 1) and the future observation Y|θ ~ N(θ, r). Assume that X and Y are independent given θ. In addition, suppose that θ is random with distribution π(), assumed to belong to

m(η)={πP():π(θ0)η}, (18)

where P() is the collection of all probability measures in ℝ.

A predictive density estimator p^(y|x) is evaluated through its integrated risk B(π,p^) defined at (3). The minimax risk for this univariate prediction problem is given by

β(η,r):=infp^supπm(η)B(π,p^), (19)

and we study sparsity through the asymptotic regime η → 0. Recall definition (10) of the scaled threshold λf = λf,η.

Theorem 2
  1. Fix r ∈ (0, ∞). As η → 0,
    β(η,r)=12rηλf2(1+o(1)). (20)
  2. An asymptotically least favorable prior is given by the two-point distribution π[η, λf(η)] of (9).

  3. An asymptotically minimax estimator is given by the thresholding construction (12) combined with sparse univariate cluster prior π = πCL [η, r] defined at (14).

1.4. Organization of the paper

The main results of the paper are multivariate, Theorems 1A, 1B and 1C. However, the main technical issues in the proofs are best handled in the univariate setting of Theorem 2, whose parts a, b and c correspond to Theorems 1a, 1b and 1c, respectively. Section 2 has an overview: it first reviews some connections between the multivariate and univariate settings, then gives heuristic derivations for the lower and upper bounds of univariate Theorem 2. Section 3 and Section 4, respectively, contain the technical proofs for the lower and upper bound on the univariate minimax risk, Theorem 2b and 2c, respectively. Together, they complete the proof of Theorem 2. Proofs of the multivariate results in Theorems 1a, 1b and 1c are completed in Section 5. This section also contains a heuristic proof of Proposition 1 whose rigorous proof is presented in the supplementary material [Mukherjee and Johnstone (2015)].

Glossary

[The notation (6)+2 refers to text 2 lines after equation (6)].

Estimators: Bayes p^π (2), Uniform prior p^U (2)+2, Threshold p^T (12), Multivariate product p^(y|x) (13); Univariate p^(y1|x1).

Classes of estimators and multivariate minimax risks: all nonlinear N, RN (6), estimative , R (6)+2, “linear” ℒ, R (17), Gaussian G, RG (17)+4.

Univariate minimax risk: β (19).

Parameter spaces: multivariate Θn[s] (5); univariate m(η) (18).

Priors: Univariate: two point π[η,λ] (9), cluster πCL[η,r] (14), Multivariate: πnIID (11).

Parameters: variance ratio r = vy/vx, oracle variance vw (7), sparsity η (9), thresholds λe, λf (10), cluster prior: overshoot a (15), νη (15)+2.

2. Proof overview and interpretation

2.1. Connections between multivariate and univariate settings

Many aspects of the multivariate theorem may be understood, and in part proved, through a discussion of the univariate prediction problem of Theorem 2. An obvious connection between the univariate and multivariate approaches runs as follows: suppose that a multivariate predictive estimator is built as a product of univariate components

p^(y|x)=i=1np^1(yi|xi). (21)

Suppose also that to a vector θ = (θi) we associate a univariate (discrete) distribution πne=n1i=1nδθi. Since the true multivariate future density p(Y|θ, r) is also a product of univariate components, it is then readily seen that the multivariate and univariate Bayes K–L risks are related by

ρ(θ,p^)=i=1nρ(θi,p^1)=nB(πne,p^1). (22)

The sparsity condition Θn[sn] in the multivariate problem corresponds to requiring that the prior π=πne in the univariate problem satisfies

π{θ10}sn/n=ηn,

and thus belongs to the class m(η) defined in (18). Next, we outline the minimax risk calculations for the sparse predictive density estimation problem.

As a first illustration, to which we return later, consider the maximum risk of a product rule over Θn[sn]: using (22) and (3), we have

supΘn[sn]ρ(θ,p^)=n[(1ηn)ρ(0,p^1)+ηnsupθρ(θ,p^1)]. (23)

In the univariate problem, using p^1, we have the somewhat parallel bound

supm(η)B(π,p^)=(1η)ρ(0,p^1)+ηsupθρ(θ,p^1). (24)

Consequently, a careful study of the two univariate quantities

risk at zero:ρ(0,p^1),maximum risk:supθρ(θ,p^1) (25)

is basic for upper bounds for both univariate and multivariate cases.

2.2. Theorem 2B: Univariate lower bound heuristics

To understand the apperance of λf2 in the minimax risks, we turn to a heuristic discussion of the lower bound, first in the univariate case.

We use the two point priors (9) and the definition (19):

β(η,r)B(π[η,λ])=(1η)ρ(0,p^π)+ηρ(λ,p^π)ηρ(λ,p^π), (26)

and look for a good bound for ρ(λ,p^π) for a suitable choice of λ.

The key is a mixture representation for predictive risk of a Bayes estimator in terms of quadratic risk, where the weighted mixture is over noise levels v ∈ [vw,1], with vw being the oracle variance, (7). Brown, George and Xu (2008), Theorem 1, show that the predictive risk of the Bayes predictive density estimate p^π is

ρ(θ,p^π)=12vw1q(θ,θ^π,v;v)dvv2, (27)

where q(θ,θ^π,v;v)=Eθ[θ^π,v(W)θ]2 is the quadratic risk of the Bayes location estimate θ^π,v for prior π when W ~ N(θ, v). In point estimation with quadratic loss, it is known [Johnstone (2013), Chapter 8], that as η → 0 an approximately least favorable prior in the class m(η) is given, for noise level v = 1, by the sparse two-point prior π[η, λe(η)] defined in (9) and λe(η)=2logη1(1η). This prior has the remarkable property that points θλe are “invisible” in the sense that even when θ is true, the Bayes estimator θ^π=θ^π,1 effectively estimates 0 rather than θ and so makes a mean squared error

q(θ,θ^π;1)~θ2for0θλe. (28)

Two issues arise as the noise level v varies. First, the region of invisibility will scale, becoming 0θvλeat scale v. As v varies in [vw, 1], the intersection of all regions of invisibility will be 0θvwλe=λf as defined at (10). The second issue is that for a given prior π and predictive Bayes rule p^πin (27), the Bayes rules θ^π,v vary with v. We return to this second point in the next section; for now we can hope that for all v∈[vw,1],

q(λf,θ^π,v;v)>λf2, (29)

and so, from mixture representation (27),

ρ(λf,p^π)>λf22vw1dvv2=λf22r,

since the integral evaluates to vw11=r1. From this we can conjecture that for π = π[η, λf] ∈ m(η),

B(π)>ηρ(λf,p^π)>ηλf22r. (30)

A full proof, with slightly modified definitions, is given in Section 3.

2.3. Theorem 2C: Univariate upper bound heuristics

We now turn to a heuristic discussion of constructing a density estimate to show that the lower bound (30) is asymptotically correct. Pursuing the analogy with point estimation, we know that in that setting optimal estimators can be found within the family of hard thresholding rules θ^(x)=xI{|x|>λ}. The natural analog for predictive density estimation would have the form

p^T,π0[λ](y|x)={p^U(y|x),|x|>λ,p^π0(y|x),|x|λ. (31)

To see this, note that p^U is the predictive Bayes rule corresponding to the uniform prior π() = , which leads to the MLE θ^(x)=x in point estimation, while p^π0(y|x) denotes the predictive Bayes rule corresponding to a prior concentrated entirely at 0, so that

p^π0(y|x)=ϕ(y|0,r) (32)

is a normal density with mean zero and variance r.

For the upper bound, according to definition (19), we seek an estimator p^1 for which supm(η)B(π,p^1)ηλf2/(2r) as η → 0. In bound (24), the first component is the risk at zero, ρ(0,p^1), and it turns out that this determines the possible values of the threshold λ in (31). Thus, in order that

ρ(0,p^T,π0[λ])=o(ηλf2),

it follows [see (51)] that the threshold λ should be chosen as λ = λe ~ (2log η−1)1/2 and not smaller.

Turning to the second part of (25), we seek an estimator p^1 with

supθρ(θ,p^1)=λf22r(1+o(1)). (33)

We first argue that the hard thresholding analog (31) cannot work. Decompose the predictive risk of a univariate threshold estimator p^T with threshold λe into contributions due to X above and below the threshold

ρ(θ,p^T)=EθL(θ,p^(|X))=Eθ[L(θ,p^U(|X)),|X|>λe]+Eθ[L(θ,p^π(|X)),|X|λe]=ρA(θ)+ρB(θ), (34)

say. With the “zero prior,” the K–L loss is just quadratic in θ,

L(θ,p^π0(Y|X))=Eθlogϕ(Y|θ,r)ϕ(Y|0,r)=θ22r,

and so, in particular, for θ ≤ λe we see that

ρ(θ,p^T,π0)ρB(θ)>θ22rPθ[|X|λe] (35)

could be as large as λe2/(2r), and hence larger than our target risk λf2/(2r).

Bearing in mind the role that two-point priors play in the lower bound, it is perhaps natural to ask next if the threshold rule p^T,LF with π0 in (31) replaced by the (symmetrized) two-point prior π[η, λf] could cut off the growth of the quadratic θ2/(2r) for |θ| ≥ λf. The 3-point prior π3[η, λf] ∈ m(η) places probability η/2 at the two nonzero atoms at ±λf. Remarks in Section 3 show that π3[η, λf] is also asymptotically least favorable for the univariate prediction problem as η → 0. Indeed, it can be shown (see Section 4) that for this prior and for λf ≤|θ|≤ λe,

ρ(θ,p^T,LF)~ρB(θ)12r{λf2(|θ|λf)[(1+2r)λf|θ|]}+o(λf2). (36)

Consequently, the risk bound dips below λf2/(2r) for λf|θ| ≤ (1 + 2r)λf but increases thereafter. So, p^T,LF is minimax optimal if λe < (1 + 2r)λf, which occurs if r is sufficiently large, r > 0.4196 in Table 1. However, the upper bound exceeds our target risk λf2/(2r) if r ≤ 0.4196. Section S.2 of the supplementary material [Mukherjee and Johnstone (2015)] shows rigorously that p^T,LF is indeed minimax suboptimal for low values of r.

As π3[η,λf] fails to produce minimax optimal density estimates, the strategy then is to introduce extra support points |μk| ≤ λe into the prior chosen to “pull down” the risk ρB(θ)=Eθ[L(θ,p^π(|X)),|X|λe] below λf2/(2r) whenever it would otherwise exceed this level. The schematic diagram in Figure 1 illustrates this bounding of the maximum risk. The extra support points added in [λf, λe] and [−λe, −λf] distribute the predictive risk across that range—“risk diversification”—and keep the maximum risk below λf2/(2r)(1+o(1)).

Fig. 1.

Fig. 1

Schematic diagram of the risk of univariate threshold density estimates for θ ≥ 0. The dotted line is the risk of density estimator p^T,LF based on the 3-point prior π3[η, λf]. The addition of appropriately spaced prior mass points (shown in red) up to λe pulls down the risk function of the cluster prior-based density estimate p^T,CLbelow λf2/(2r)until the effect of thresholding at λe takes over.

To prove that this works, we obtain upper bounds on ρB (θ) for p^T,CL by focusing, when θ[μk,μk+1], only on the prior support point μk. The main inequality is obtained in (50), namely,

ρB(θ)12r[λf2+minkqk(θ)]+o(λf2).

where qk(θ) is a quadratic polynomial that is O(λf) on [μk, μk+1]. Putting together this and other bounds, we can then finally establish the uniform bound (33). The details are in Section 4.

3. Theorem 2B: Univariate lower bound proof

This section is devoted to a proof of the lower bound part of Theorem 2. The heuristic discussion of the last section indicated the importance of two-point sparse priors and the invisibility property (28). To formulate a precise statement about the upper limit of invisibility, we start with noise level 1 and bring in the positive solution μη of the overshoot equation (15), namely, μ2+2aμ=λe2. Here the “overshoot” parameter a = aη should satisfy both aη and aη = o(μη); we make the specific choice aη=2logλf,η.

In preparation for the range of variance scales in mixture representation (27), we consider the collection of two-point priors π[η, μ] for 0≤μ≤μη. Using a temporary notation for this section, let θ^μ(x)=E[θ|x] be the Bayes rule for squared error loss for the prior π[η, μ]. The next result shows that when the true parameter is actually μ, and this nonzero support point μμη, then the Bayes rule for π[η,μ] “gets it wrong” by effectively estimating 0 and making an error of size μ2, uniformly in μμη.

Lemma 3

There exists εη ↘ 0 as η → 0 such that for all μ in [0,μη],

q(μ,θ^μ;1)μ2[1εη].

Proof

Using standard calculations for the two-point prior, the Bayes rule θ^μ=μp(μ|x)=μ/[1+m(x)], with

m(x)=p(0|x)p(μ|x)=1ηηϕ(x)ϕ(xμ)=exp{12λe2xμ+12μ2}. (37)

Consequently,

q(μ,θ^μ;1)=Eμ[θ^μμ]2=μ2Eμ[(1+m(X))11]2=μ2E0[1+m1(μ+Z)]2,

where Z ~ N(0, 1), and from (37), m1(μ+z)=exp{12(μ2+2μzλe2)}.

Now, using definition (15) of μη, for 0 ≤ μμη, we have

μ2+2μzλe2μη2+2μηz+λe2=2μη(az+),

so that for 0 ≤ μμη,

μ2q(μ,θ^μ;1)E0{[1+exp(μη(aZ+))]2,Z<a}=1εη,

say. For each fixed z, we have μη(az+) since a, and so from the dominated convergence theorem we conclude that ε(η)0. □

With these preparations, we return to the lower bound in the prediction problem. As η → 0, an asymptotically least favorable distribution is given by a sparse two-point prior with the nonzero support point scaled using the oracle standard deviation νw1/2. We shall prove the following:

Lemma 4

Let μη be the positive solution to overshoot equation (15) with aη=2logλf,η.Setνη=νw1/2μη and consider the two-point prior π[η, νη]. Then as η → 0,

β(η,r)B(π[η,νη])ηλf22r(1+o(1)).

We note here that since aη = o(μη), the overshoot equation implies that

μη~λe,ηandνη~λf,η. (38)

A stronger conclusion, used in the next section, also follows from the overshoot equation, namely,

λf,η2νη2=νw(λe,η2μη2)=νw2aμη2aνwλe,η=2aνwλf,η. (39)

Proof of Lemma 4

Recall (26) and (27) in the heuristic discussion. We now clarify the dependence on scale v of the Bayes rule θ^π,ν in the mixture representation (27). Passing from noise level v to noise level 1 by dividing parameters and estimates by v1/2, we obtain the invariance relation

q(θ,θ^π[η,λ],ν;ν)=νq(ν1/2θ,θ^π[η,ν1/2λ];1).

Now set θ = νη and substitute into (27) to obtain, for π = π[η, νη],

ρ(νη,p^π)=12νw1q(v1/2vη,θ^π[η,v1/2vη];1)dvv. (40)

Now apply Lemma 3 with μ = v−1/2νη being bounded above by vw1/2νη=μη. For all v[vw,1] we obtain

q(v1/2,θ^v1/2vη;1)v1vη2[1εη].

Putting this into the mixture representation, we get

ρ(vη,p^π)12vη2[1εη]vw1dvv2=νη22r[1εη].

Taking into account both (26) and (38), we have established the lemma.

Based on the discussion in Section 2, the above lemma establishes a lower bound on the asymptotic minimax risk β (η, r) in Theorem 2. Similarly, the symmetric 3-point prior

π3[η,νη]=(1η)δ0+(η/2){δνη+δνη}

will also be asymptotically least favorable over m (η) as η → 0.

4. Theorem 2C: Univariate upper bound proof

The upper bound on the predictive minimax risk β (η, r) is derived from the upper bound on the maximum Bayes risk of p^T,CL over m (η). In this section we will prove the following lemma which along with Lemma 4 completes the proof of Theorem 2.

Lemma 5

For any r ∈ (0, ∞) we have, as η → 0,

supπm(η)B(π,p^T,CL)ηλf22r(1+o(1)).

We consider a threshold predictive density estimate p^T which uses the Bayes predictive density estimate from prior π below the threshold λe and p^U above the threshold λe. We bound the maximum predictive risk over m(η):

supπm(η)B(π,p^T)(1η)ρ(0,p^T)+ηsupθρ(θ,p^T). (41)

Next, as in (34), we decompose the predictive risk of p^T into contributions due to X above and below the threshold. We calculate explicit expressions for ρA and ρB. The predictive loss of p^U (see Appendix A.2) is given by

L(θ,p^U(·|x))=a1r+a2r(θx)2 (42)

with a1r=12[log(1+r1)(1+r)1] and a2r=12(1+r)1. Hence, the above threshold term

ρA(θ)=a1rPθ(|X|>λe)+a2rEθ[(Xθ)2,|X|>λe]. (43)

As ρB (θ) depends on the prior π used below the threshold, we restrict our attention to the specific choice of the cluster prior. The risk functions of the hard threshold density estimate p^T,π0 and that of p^T,LF can be easily derived from the calculations with the cluster prior.

According to (58) in the Appendix, the Bayes predictive density for a discrete prior π=k=KKπkδμk is given by

p^π(y|x)=KKϕ(y|μk,r)πkϕ(xμk)/m(x), (44)

where m(x)=kπkϕ(xμk) denotes the marginal density of π. The K–L loss of p^π(·|x)is given by

L(θ,p^π(|x))=Eθlogϕ(Y|θ,r)p^π(Y|x).

A simple but informative upper bound for the K–L loss is obtained by retaining only the kth term in (44):

L(θ,p^π(|x))Eθlogϕ(Y|θ,r)ϕ(Y|μk,r)logπkϕ(xμk)π0ϕ(x)+logm(x)π0ϕ(x)=12r(θμk)2+12(μk22xμk)logπkπ0+d(x) (45)

where we have set d(x) = log[m(x)/(π0ϕ(x))].

We are now ready to analyze the bound (41). We follow the steps recalled in the quadratic loss case [see Section S.4 of Mukherjee and Johnstone (2015)] and evaluate the predictive risk at the origin and the maximum risk of the threshold density estimate p^T. This organization helps to make clear the new features of the predictive loss setting.

Risk at zero

It is easy to show that ρ(0,p^T)=O(ηλf). First, from (43), we have

ρA(0)=2a1rΦ(λe)+a2rqA(0)=O(ηλf),

where qA(0) is defined in (S.4.2) and the above calculation follows by using Φ(λe)λ1ϕ(λe)=O(λe1η) and the quadratic risk-at-zero bound (S.4.4).

For the below-threshold term, we set k = 0 in (45), note that μ0 = 0 and apply Jensen’s inequality to obtain

ρB(0)=E0[L(0,p^π(·|x)),|x|λ]E0[d(X)]logE0[m(X)/(π0ϕ(X))].

Since E0[m(X)/ϕ(X)]=m(x)dx=1 and π0 = −η, we obtain that

ρB(0)log(1η)η.

Consequently, ρB(0) = O(η) and so ρ(0, p^T,CL) = O(ηλf). Note that the above calculations hold for any p^T,π with π being a discrete prior in m(η).

Maximum risk

From decomposition (41), our goal is to show that

supθρ(θ,p^T,CL)=(2r)1λf2(1+o(1)). (46)

We first isolate the main term in the contributions from ρA(θ) and ρB(θ). From (43), clearly ρA(θ) ≤ a1r + a2r = O(1), which does not contribute. We turn to

ρB(θ)=Eθ[L(θ,p^π(·|X)),|X|λe]

and returning to (45), we begin by claiming that for |x|λe the final term d(x)log2. Indeed

m(x)π0ϕ(x)=1+|k|=1Kπkπ0ϕ(xμk)ϕ(x)=1+|k|=1Kπkπ0exp{xμkμk22}. (47)

For |x|λe, we have

xμkμk2/2λe|μk|μk2/2λe2/2=logη1(1η).

Since π0=1η, we arrive at

Eθ[d(X),|X|λe]log2. (48)

The dependence of (45) on θ may then be seen by writing x = θ + z. The first two terms in (45) then take the form

12r{[θ(1+r)μk]2(r2+r)μk2}μkz,

while, after recalling that πk=η/(2K) and that λe2=2log(1η)η1, the third term becomes

12λe2+log(2K)=12r(1+r)λf2+log(2K).

We may therefore rewrite (45) as

L(θ,p^π(|x))12r[λf2+qk(θ)]μk(xθ)+log(2K)+d(x), (49)

where the kth quadratic polynomial

qk(θ)=[θ(1+r)μk]2r2μk2+r(λf2μk2).

Denote the last three terms of (49) by Jk(x, θ). From (16) and (48) we see that

Eθ[Jk,|X|λe]μk+log(2K)+log2λe+a+log(4K)=o(λf2).

Consequently, we obtain the key bound

ρB(θ)12r[λf2+minkqk(θ)]+o(λf2). (50)

Now we use the geometric structure of the support points μk, defined at (16). We bound mink qk(θ) above by considering the quadratic polynomial qk(θ) on Ik = [μkk+1] and observe that these 2K intervals cover the range (–λea,–λf) ∪ (λf, λe +a) of interest. See Figure 2. Note that qk(θ) achieves its maximum on Ik at both endpoints and that

Fig. 2.

Fig. 2

Schematic diagram demonstrating the behavior of the quadratic polynomials qk(θ) in the interval [μ1, μK+1]. Here K = 4. The maximum of minkqk(θ) for θ ∈ [μ1, μK+1] is bounded by q11).

qk(μk+1)=qk((1+2r)μk)=qk(μk)=r(λf2μk2).

These maxima decrease with k and so are bounded by q1(νη)=r(λf2νη2). Appealing now to bound (39), we have for λf < |θ| < λe + a,

minkqk(θ)r(λf2vη2)2rvwaλf.

Returning to (50), we now see that the last two terms are each o(λf2) and so the final bound (46) is proven. This completes the proof of Lemma 5.

These calculations apply to threshold density estimates based on Bayes estimates of discrete priors. In particular, for p^T,LF which is based on the 3-point prior π3[η, νη], we have K = 1 and the bound (36). Thus, the difference ρB(θ)λf2/2r in this case is negligible when |θ| ≤ μ2.

Similarly, the asymptotic risk function of the hard threshold plug-in density estimate p^T,π0 (for which K = 0 in our calculations above) exceeds the minimax risk β(η, r) for |θ| ∈ [λf, λe] and so is minimax suboptimal for any fixed r. Figure 3 shows the numerical evaluation of the risk functions for the different univariate threshold density estimates.

Fig. 3.

Fig. 3

Numerical evaluation of the asymptotic risk ρB(θ) for r = 0.25 of univariate threshold density estimates: hard threshold plug-in estimate p^T,π0 (red), p^T,LF (green) and the cluster prior-based minimax optimal estimate p^T,CL (blue). The brown boxes show the nonzero support point of the cluster prior and the univarate asymptotic minimax risk β(η,r)=(2r)1λf2 and the threshold λe are respectively denoted by dotted horizontal and vertical lines. The plot on left has η = e−20 (very high sparsity), λf = 2.83, λe = 6.32 and the right one has η = 0.05 (moderate sparsity), λf = 1.09, λe = 2.45.

Also, note that any threshold estimate p^T[λ] with threshold size λ less than λe will be minimax suboptimal, as its risk at the origin will not be negligible as compared to β(η, r). By (34) and (43) we have

ρ(0,p^T[λ])2a2rE[Z2I{Z>λ}]=2a2r{λϕ(λ)+2Φ(λ)}λϕ(λ)/(1+r), (51)

and so for any fixed ε > 0,

liminf1λ<λe(η)ερ(0,p^T[λ])β(η,r).

Thus, p^T[λ] is suboptimal unless λ ≥ λe.

5. Theorem 1: Multivariate minimax risk

Here we will use the univariate minimax results developed in the previous sections to evaluate the asymptotic multivariate minimax risk Rn = RNn[sn]) over the sparse parameter space Θn[sn].

5.1. Lower bound proof: Theorem 1B and an extension

We first prove a lower bound for the multivariate minimax risk under only the assumption that sn/n → 0—without requiring, as in Theorem 1B, that also sn → ∞. This is done using an “independent blocks” sparse prior, along the lines of Johnstone (2013), Chapter 8.6, that we will show to be asymptotically least favorable. This result establishes the lower bound half of Theorem 1a. At the end of the subsection, we prove Theorem 1B using the simpler i.i.d. prior.

Let πS(τ; m) denote a single spike prior of scale τ on ℝm: choose an index I ∈ {1, …, m} at random and set θ = τeI, where eI is a unit length vector in the ith coordinate direction. We will use a scale τm = λm logλm which is somewhat smaller than λm=2logm.

The independent blocks prior πIB on Θ[sn] is built by dividing {1, …, n} into sn contiguous blocks Bj,j = 1, , m each of length m = mn = [n/sn]. Draw components θi in each block Bj according to an independent copy of πS(νm; m) where the scale vm=vwτm is matched to the prediction setting. Finally, set θi = 0 for the remaining n − mnsn components. Thus, πIB is supported on Θ[sn] since any draw θ from πIB has exactly sn nonzero components.

The lower bound half of Theorem 1A follows from the following result, the analog of Theorem 1B for the independent blocks prior.

Theorem 6

Fix r ∈ (0, ∝). If sn/n → 0, then

RN(Θn[sn])B(πnIB)(1+r)1snlog(n/sn).

Proof

Bounding maximum risk by Bayes risk and using the product structure shows that

Rn=RN(Θn[sn])B(πnIB)=snB(πS(vm;m)). (52)

Next, using BQv to denote the Bayes risk for noise level v, the multivariate form of the connecting equation and scale invariance enable us to write

B(πS(vm;m))=12vw1BQv(πS(vm;m))dvv2=12vw1BQ(πS(vmv;m))dvv.

The next lemma, proved in Section S.5 of Mukherjee and Johnstone (2015), provides a uniform lower bound for the quadratic loss Bayes risk of a single spike prior. It is a multivariate analog of Lemma 3.

Proposition 7

Suppose that y ~ Nn(0, I). Set λn=2logn and τn = λn − log λn. Then there exists εn → 0 such that uniformly in τ ∈ [0, τn],

Bq(πS(τ;n))τ2(1εn).

Noting that v ∈ [vw, 1] implies that vm/vvm/vw=τm, and then applying the proposition,

B(πS(vm;m))(1εm)2vm1vm2v2dv=(1εm)vm22r.

Combining this with (52) and the definition of νm, we obtain

Rn(1εm)snvwτm2/(2r)(1+r)1snlog(n/sn). (53)

Proof of Theorem 1B

Note that because of the product structure of the problem and the prior πnIID we have

B(πnIID)=i=1nβ(ηn,r)=nβ(ηn,r),

which is asymptotically equal to RN(Θ[sn]), using the univariate Theorem 2 [cf. (20)] and

(2r)1λf2=(2r)1vwλf2(1+r)1logηn1asn. (54)

Also, as sn → ∞, πnIID(Θ[sn])1 by application of Chebyshev’s inequality and, hence, πnIID is an asymptotically least favorable prior under the conditions of Theorem 1B. □

5.2. Upper bound proof: Theorem 1C

First, an upper bound on RNn[sn]) is derived based on the maximum risk of the multivariate product threshold density estimate p^T,CL defined in Theorem 1C. Using the product structure of the threshold estimate as well as that of the unknown future density

p^T,CL(y|x)=i=1np^T,CL(yi|xi)andp(y|θ,r)=i=1np(yi|θi,r),

the risk of our multivariate threshold estimate simplifies as an agglomerative coordinate wise risk of the respective univariate density estimates

ρ(θ,p^T,CL)=Eθlogp(y|θ,r)p^T,CL(y|x)=i=1nρ(θi,p^T).

Now, maximizing over θ ∈ Θn[sn], we have

RnsupΘn[sn]ρ(θ,p^T,CL)(nsn)ρ(0,p^T,CL)+snsupθρ(θ,p^T,CL).

From the univariate study, we know that ρ(0,p^T,CL)=O(ηnλf), which makes (nsn)ρ(0,p^T,CL)=O(snλf) negligible relative to

snsupθρ(θ,p^T,CL)=(2r)1snλf2(1+o(1)),

where we used (46). Thus, taking account also of (54), we have the desired upper bound on the minimax risk

Rn(2r)1snλf2(1+o(1))(1+r)1snlog(n/sn). (55)

Completion of Proof of Theorems 1A, 1B and 1C

As the lower bound (53) and upper bound (55) on Rn match asymptotically, the first order asymptotic minimax risk of Theorem 1a is achieved, and the proof of all parts is done.

5.3. Proof of Proposition 1

Estimates in ℒ and G are products of the form (21) and so R,n=R(Θn[sn]) can be studied using the associated univariate problem and decomposition (23). It is shown in Appendix A.2 that

ρ(θ,p^L,α)=12log(1+αr)+(1α)22(r+α)[θ2α1α]. (56)

Thus, supθρ(θ,ρ^L,α) is infinite unless α = 1, that is, the uniform prior estimate p^U, in which case ρ(θ,p^U)12log(1+r1). Thus,

R,n=n2log(1+r1)sn1+rlog(nsn)Rn.

In particular, R,n/Rn → ∝ when sn/n → 0.

We turn to the Gaussian class G. Since EG, clearly RG,n<RE,n=(2r)1nηnλe2. We give here a heuristic argument for the reverse inequality, which gives the idea for the rigorous proof given in Section S.3 of the supplementary material [Mukherjee and Johnstone (2015)]. From the decomposition (23), any near-optimal estimator in G must have univariate risk at 0 bounded as follows:

ρ(0,p^1)r1ηnλe2. (57)

Now from (60) we know that the risk at the origin for the univariate Gaussian density estimate p[θ^,d^] is

ρ(0,p[θ^,d^])=21E0{log(r1d^)+d^1(r+θ^2)1},

which for any fixed choice of θ^ achieves its minimum at dopt[θ^]=r+θ^2. Thus, for such an optimal choice of d^,

ρ(0,p[θ^,dopt(θ^)])=E0log(1+r1θ^2),

and for this to satisfy (57), we must have θ^(x)0 for |x| ≤ λe(1 + o(1)). Thus, p^ would approximately need to have the threshold structure (31), (32) for |x| ≤ λe and so the bound (35) shows that

ρ(θ,p^1)θ22rPθ(|X|λe)λe22r.

Returning to decomposition (23), we can now see that RG,n>(2r)1snλ2RE,n, which completes the heuristic argument.

6. Discussion

Avoiding thresholding

The asymptotic minimax rules p^T described in Theorems 1c and 2c are based on thresholding. It would be desirable to construct a prior π for which the Bayes predictive density p^π in (2) is itself asymptotically minimax, without any use of the discontinuous thresholding operation.

Consider, then, a symmetric univariate prior π[η, r] whose support consists of the origin and infinite number of equidistant clusters each containing 2K points in the same spatial alignment as for πCL[η, r]:

π[η,r]=(1η)δ0+1η2j=0ηj+1k=1Kqk(δμjk+δμjk),

where μjk = jλe + μk and for k = 2, …, K and γ = log η1, we have qk = γ−k and q1=12Kqk.

Based on π[ηn, r], one can construct a multivariate prior πn,IID using (11), which heuristic arguments indicate will not only be least favorable but also yield a minimax optimal density estimate. A detailed proof is forthcoming.

Approximate sparsity and other extensions

Starting from Johnstone (2013), Chapters 8 and 13, the 0 sparsity results presented here can be extended to obtain minimax optimal predictive density estimates over weak and strong p sparse parameter spaces. An interesting topic for future work will be whether, as in point estimation [Donoho and Johnstone (1994)], the phenomena seen here can be generalized to a family of loss functions. Simple analogues of the connecting equations [Brown, George and Xu (2008), Theorem 1] between the predictive and quadratic PE regimes do not exist in those cases, though some of the decision theoretic parallels can still be proved particularly for the 2 loss [Gatsonis (1984)].

Supplementary Material

Supplement

Acknowledgments

We thank the Associate Editor and three referees for constructive suggestions to shorten and improve the paper.

APPENDIX

A.1. Bayes density estimate for discrete priors

The posterior distribution for the discrete prior π=k=KKπkδμk is given by

π(μk|x)={m(x)}1ϕ(x|μk,1)πkwherem(x)=kπkϕ(x|μk,1).

So, for the Bayes predictive density based on the prior π,

p^π(y|x)=k=KKϕ(y|μk,r)π(μk|x)=k=KKϕ(y|μk,r)ϕ(x|μk,1)πkm(x). (58)

A.2. K–L risk for gaussian and linear density estimates

The predictive risk of the univariate Gaussian density estimate p[θ^,d^]=N(θ^,d^) is given by

ρ(θ,p[θ^,d^])=Eθ{logϕ(Y|θ,r)}Eθ{logϕ(Y|θ^(X),d^(X))},

where the expectation is over X ~ N(θ, 1) and Y ~ N(θ, r). Noting that Eθ{logϕ(Y|θ^,d^)|X=x}=12log(2πd^(x))(2d^(x))1{r+(θ^(x)θ)2} and Eθlogϕ(Y|θ,r)=12log(2πr)12, we obtain

L(θ,p^(|x))=12log(r1d^)+r+(θ^(x)θ)22d^12, (59)

and the following expression for the K–L risk of members in G:

ρ(θ,p[θ^,d^])=12[Eθlog(r1d^)+Eθ{r+(θ^θ)2d^}1]. (60)

Consider now “linear” estimators. Starting with the conjugate prior θ ~ N(0, α/(1 − α)) for 0 ≤ α ≤ 1, standard calculations show that the posterior density π(θ|x) is N(αx, α) and the predictive density p^L,α, being the convolution of Gaussians, compare (2), is seen to be N(αx, r + α). Now, using d^=r+α and θ^=αX in (60), we get

ρ(θ,p^L,α)=12[log(1+r1α)+(r+α)1{r+Eθ(αXθ)2}1].

The linear risk formula (56) now follows from the quadratic risk of αX. Next, we present some details about the risk of the particular linear estimate p^U.

Proof of (42)

The estimator p^U=p^L,1 is given by the N(x, 1 + r) distribution, and so from (59)

L(θ,p^U(|x))=12log(1+r1)+r+(θx)22(1+r)12,

from which (42) is immediate.

Footnotes

1

Supported in part by NSF Grant DMS-09-06812 and NIH Grant R01 EB001988.

MSC2010 subject classifications. Primary 62C20; secondary 62M20, 60G25, 91G70.

SUPPLEMENTARY MATERIAL

Supplementary material to “Exact minimax estimation of the predictive density in sparse Gaussian models” (DOI: 10.1214/14-AOS1251SUPP;.pdf). The supplement Mukherjee and Johnstone (2015) contains a brief description of the relevance of the predictive density estimation problem in related application areas along with the proof for the suboptimality of the univariate threshold density estimate p^T,LF (in Section S.2) and the details of the proof of Proposition 1 (in Section S.3). The arguments for the maximum quadratic risk of hard threshold point estimates are reviewed in Section S.4 and the proof of Proposition 7 is presented in Section S.5. Links to R-codes used in producing Table 1 and Figure 3 are also provided.

Contributor Information

Gourab Mukherjee, Email: gourab@usc.edu, Department of Data Sciences and Operations, Marshall School of Business, University of Southern California, Los Angeles, California 90089-0809, USA.

Iain M. Johnstone, Email: imj@stanford.edu, Department of Statistics, Sequoia Hall, 390 Serra Mall, Stanford University, Stanford, California 94305-4065, USA.

References

  1. Aitchison J. Goodness of prediction fit. Biometrika. 1975;62:547–554. MR0391353. [Google Scholar]
  2. Aitchison J, Dunsmore IR. Statistical Prediction Analysis. Cambridge Univ. Press; Cambridge: 1975. MR0408097. [Google Scholar]
  3. Aslan M. Asymptotically minimax Bayes predictive densities. Ann Statist. 2006;34:2921–2938. MR2329473. [Google Scholar]
  4. Barndorff-Nielsen OE, Cox DR. Prediction and asymptotics. Bernoulli. 1996;2:319–340. MR1440272. [Google Scholar]
  5. Bell RM, Cover TM. Competitive optimality of logarithmic investment. Math Oper Res. 1980;5:161–166. MR0571810. [Google Scholar]
  6. Brown L. Lecture notes on statistical decision theory. 1974 Available at http://www-stat.wharton.upenn.edu/~lbrown.
  7. Brown LD, George EI, Xu X. Admissible predictive density estimation. Ann Statist. 2008;36:1156–1170. MR2418653. [Google Scholar]
  8. Cover TM, Thomas JA. Elements of Information Theory. Wiley; New York: 1991. MR112280. [Google Scholar]
  9. Donoho DL, Johnstone IM. Minimax risk over lp-balls for lq-error. Probab Theory Related Fields. 1994;99:277–303. [Google Scholar]
  10. Donoho DL, Johnstone IM, Hoch JC, Stern AS. Maximum entropy and the nearly black object. J R Stat Soc Ser B Stat Methodol. 1992;54:41–81. MR1157714. [Google Scholar]
  11. Fourdrinier D, Marchand É, Righi A, Strawderman WE. On improved predictive density estimation with parametric constraints. Electron J Stat. 2011;5:172–191. MR2792550. [Google Scholar]
  12. Gatsonis CA. Deriving posterior distributions for a location parameter: A decision theoretic approach. Ann Statist. 1984;12:958–970. MR0751285. [Google Scholar]
  13. Geisser S. Predictive Inference: An Introduction Monographs on Statistics and Applied Probability. Vol. 55. Chapman & Hall; New York: 1993. MR1252174. [Google Scholar]
  14. George EI, Liang F, Xu X. Improved minimax predictive densities under Kullback–Leibler loss. Ann Statist. 2006;34:78–91. MR2275235. [Google Scholar]
  15. George EI, Liang F, Xu X. From minimax shrinkage estimation to minimax shrinkage prediction. Statist Sci. 2012;27:82–94. MR2953497. [Google Scholar]
  16. Ghosh M, Mergel V, Datta GS. Estimation, prediction and the Stein phenomenon under divergence loss. J Multivariate Anal. 2008;99:1941–1961. MR2466545. [Google Scholar]
  17. Hartigan JA. The maximum likelihood prior. Ann Statist. 1998;26:2083–2103. MR1700222. [Google Scholar]
  18. Johnstone IM. Gaussian estimation: Sequence and wavelet models. 2013 Available at http://www-stat.stanford.edu/~imj.
  19. Komaki F. On asymptotic properties of predictive distributions. Biometrika. 1996;83:299–313. MR1439785. [Google Scholar]
  20. Komaki F. A shrinkage predictive distribution for multivariate normal observables. Biometrika. 2001;88:859–864. MR1859415. [Google Scholar]
  21. Komaki F. Simultaneous prediction of independent Poisson observables. Ann Statist. 2004;32:1744–1769. MR2089141. [Google Scholar]
  22. Larimore WE. Predictive inference, sufficiency, entropy and an asymptotic likelihood principle. Biometrika. 1983;70:175–181. MR0742987. [Google Scholar]
  23. McMillan B. Two inequalities implied by unique decipherability. IRE Transactions on Information Theory. 1956;2:115–116. [Google Scholar]
  24. Mukherjee G. Sparsity and shrinkage in predictive density estimation. PhD thesis, Stanford Univ. 2013 Available at http://purl.stanford.edu/gm306wz2890.
  25. Mukherjee G, Johnstone IM. Supplement to “Exact minimax estimation of the predictive density in sparse Gaussian models”. 2015 doi: 10.1214/14-AOS1251SUPP. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Murray GD. A note on the estimation of probability density functions. Biometrika. 1977;64:150–152. MR0448690. [Google Scholar]
  27. Ng VM. On the estimation of parametric density functions. Biometrika. 1980;67:505–506. MR0581751. [Google Scholar]
  28. Pinsker MS. Optimal filtration of square-integrable signals in Gaussian noise. Probl Inf Transm. 1980;16:120–133. Originally in Russian in Problemy Peredachi Informatsii16 52–67. MR0624591. [Google Scholar]
  29. Xu X, Liang F. Asymptotic minimax risk of predictive density estimation for non-parametric regression. Bernoulli. 2010;16:543–560. MR2668914. [Google Scholar]
  30. Xu X, Zhou D. Empirical Bayes predictive densities for high-dimensional normal models. J Multivariate Anal. 2011;102:1417–1428. MR2819959. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement

RESOURCES