Abstract
We consider estimating the predictive density under Kullback–Leibler loss in an ℓ0 sparse Gaussian sequence model. Explicit expressions of the first order minimax risk along with its exact constant, asymptotically least favorable priors and optimal predictive density estimates are derived. Compared to the sparse recovery results involving point estimation of the normal mean, new decision theoretic phenomena are seen. Suboptimal performance of the class of plug-in density estimates reflects the predictive nature of the problem and optimal strategies need diversification of the future risk. We find that minimax optimal strategies lie outside the Gaussian family but can be constructed with threshold predictive density estimates. Novel minimax techniques involving simultaneous calibration of the sparsity adjustment and the risk diversification mechanisms are used to design optimal predictive density estimates.
Key words and phrases: Predictive density, risk diversification, minimax, sparsity, high-dimensional, mutual information, plug-in risk, thresholding
1. Introduction
Statistical prediction analysis aims to use past data to choose a probability distribution that will be good in predicting the behavior of future samples. This well-established subject [Aitchison and Dunsmore (1975), Geisser (1993)] finds application in game theory, econometrics, information theory, machine learning, mathematical finance, etc.
In this paper we study predictive density estimation in a high-dimensional setting and, in particular, explore the consequences of sparsity assumptions on the unknown parameters.
1.1. Main results
We begin by describing some of our main results: fuller references, background and interpretation follow in Section 1.2.
We work in the simplest Gaussian model for high-dimensional prediction:
| (1) |
On the basis of the “past” observation vector X, we seek to predict the distribution of a future observation Y. The past and future observations are independent, but are linked by the common mean parameter θ, assumed to be unknown. Note, however, that the variances, assumed here to be known, may differ. We write p(x|θ, vx) and p(y|θ, vy) for the probability densities of X and Y, respectively.
We seek estimators of the future observation density p(y|θ, vy), and to compare their performance under sparsity assumptions on θ. We recall two natural ways of generating large classes of estimators. Perhaps simplest are the “plug-in” or estimative densities: given a point estimate , simply set . We often use the abbreviation . Second, given any prior measure π(dθ), proper or improper, such that the posterior π(dθ|x) is well defined, the Bayes predictive density is
| (2) |
The important case of a uniform prior measure π(dθ) = dθ leads to predictive density , easily seen to correspond to Nn(x, (vx + vy)I).
We will examine similarities and differences between high-dimensional prediction and high-dimensional estimation. In particular, plays in prediction the role of the maximum likelihood estimator in the multinormal mean estimation setting. In contrast to the corresponding plug-in estimate , the density incorporates the variability of the location estimate which leads to a flattening of the estimator: vx + vy>vy.
To evaluate the performance of a predictive density estimator , we use the familiar Kullback–Leibler “distance” as loss function:
The corresponding K–L risk function follows by averaging over the distribution of the past observation:
Given a prior measure π(dθ), the average or integrated risk is
| (3) |
The Bayes predictive density (2) can be shown to minimize both the posterior expected loss and the integrated risk in the class of all density estimates. This is a general fact in statistical decision theory [Brown (1974)], the resulting minimum the Bayes K–L risk:
| (4) |
Our main focus is on how to optimize the predictive risk in a high-dimensional setting under an ℓ0 sparsity condition on the parameter space. Thus, let ‖θ‖0 = #{i:θi ≠0} and
| (5) |
This “exact” sparsity condition has been widely used in estimation; in this paper we initiate study of its implications for predictive density estimation.
The minimax K–L risk for estimation over Θ is given by
| (6) |
where the infimum is taken over all measurable predictive density estimators . For comparison, we write for the minimax risk restricted to the sub-class ℰ of plug-in or “estimative” densities.
To state our main results, henceforth we will assume vx = 1 and introduce the key parameters
| (7) |
Here vw is the “oracle variance” which would be the variance of the UMVUE for θ, were both X and Y observed.
In our asymptotic model, the dimensionality n → ∞ and the sparsity s = sn may depend on n, but the variance ratio r remains fixed. The notation an ~ bn denotes an/bn → 1 as n → ∞.
Theorem 1A
Fix r ∈ (0, ∞). If ηn = sn/n → 0, then
| (8) |
The minimax risk is proportional to the sparsity sn, with a logarithmic penalty factor. The case where sn ≡ s remains constant is included. The expression is quite analogous to that obtained for point estimation with quadratic loss, namely, 2snlog(n/sn) [Donoho and Johnstone (1994), Donoho et al. (1992) and Johnstone (2013), Chapter 8.8, hereafter cited as Johnstone (2013)]. However, we shall see that quite different phenomena emerge in the predictive density setting.
Indeed, the future-to-past variance ratio r is an important parameter of the predictive estimation problem. The minimax risk increases as r decreases: we need to estimate the future observation density based on increasingly noisy past observations (in relative terms, r = vy/vx), and so the difficulty of the density estimation problem increases. However, the rate of convergence with n in (8) does not depend on r, and so exact determination of the constants is needed to show the role of r in this prediction problem.
The inefficiency of plug-in estimators is an immediate consequence of Theorem 1A. Let denote the risk of point estimator under squared-error loss. It is straightforward to show for a plug-in density estimate that . Hence, from the point estimation minimax risk just cited,
The inefficiency of plug-in estimators thus equals the oracle precision,
and becomes arbitrarily large as the variance ratio r → 0.
We turn now to the asymptotically least favorable priors and optimal estimators in Theorem 1A. Let δλ denote unit point mass at λ and
| (9) |
be a univariate two-point prior: this is a sparse prior when η is small and λ large. Let
| (10) |
In point estimation based on X, we recall that λe is essentially the threshold of detectability corresponding to sparsity ηn = sn/n. Although Y is not yet observed, we will see that in the prediction setting the UMVUE scaled threshold λf < λe plays a partly analogous role.
Build a sparse high-dimensional prior from i.i.d. draws:
| (11) |
If the sparsity sn increases without bound with n, then this i.i.d. prior with scale λf is asymptotically least favorable:
Theorem 1B
If sn → ∞ and sn/n → 0, then
The assumption that sn → ∞ ensures that concentrates on Θ[sn], namely, that as n → ∞. This hypothesis is not needed for Theorem 1A; indeed, a sparse prior built from “independent blocks” is asymptotically least favorable assuming only sn/n → 0. This more elaborate prior is described in Section 5.
Some of the novel aspects of the predictive density estimation problem appear in the description of optimal estimators, that is, ones that asymptotically attain the minimax bound in Theorem 1A. In point estimation, the simplest asymptotically minimax rule for sparsity sn is given by co-ordinatewise hard thresholding . For prediction, we consider the following class of univariate density estimators as analogs of hard thresholding:
| (12) |
The univariate density estimates are combined to form a multivariate predictive density estimate via a product rule
| (13) |
The threshold λe in (12) is that corresponding to estimation based on X at sparsity ηn = sn/n. Above the threshold, the uniform prior predictive density corresponds to the (unbiased) MLE. Below threshold, we shall need the flexibility of the Bayes predictive density (2). Indeed, as explained in Section 4, it does not suffice to use π = δ0, point mass at 0, which would be the predictive analog of thresholding to zero in point estimation.
Instead, we use a sparse univariate cluster prior π = πCL[η, r] given by
| (14) |
The points μk = μk(r) for k = 1,…, K are geometrically spaced to cover an interval [νη, λe + a] containing [λf, λe], as described in more detail below. The key point is that it is necessary to “diversify” the predictive risk by introducing prior support points to cover [–λe, –λf] ∪ [λf, λe].
More specifically, for a parameter a = aη given below, let μη be the positive root of the overshoot equation
| (15) |
that occurs in sparse minimax point estimation [e.g., Johnstone (2013), equation (8.48)], and then set : since μη < λe, we have νη <λf. The support points
| (16) |
with K = max{k : μk ≤ λe + a}. We choose .
Theorem 1C
Assume ηn = sn/n → 0. Let be the product predictive threshold estimator defined by (12) and (13) using the cluster prior πCL [ηn, r ]. Then is asymptotically minimax:
Note that the number of positive support points in the cluster prior K = Kη increases as r decreases. For any fixed η, the cluster prior contains in total (2Kη + 1) support points. Also, for any fixed r ∈ (0, ∞) as η → 0, we have
Thus, K(r) is apiecewise constant, right continuous function with jumps as shown in Table 1.
Table 1.
Number K(r) of positive support points in the cluster prior πCL[η, r] as r varies
| r | 0.1073 | 0.1235 | 0.1465 | 0.1826 | 0.2485 | 0.4196 | >0.4196 |
|---|---|---|---|---|---|---|---|
| K(r) | 7 | 6 | 5 | 4 | 3 | 2 | 1 |
The results presented above assume vx = 1. These results can be easily extended to the general case by noting that the minimax risk remains invariant and the scale of past observations and parameter is divided by .
1.2. Background and previous work
The relative entropy predictive risk measures the exponential rate of divergence of the joint likelihood ratio over a large number of independent trials [Larimore (1983)]. The minimal predictive risk estimate maximizes the expected growth rate in repeated investment scenarios [Cover and Thomas (1991), Chapters 6, 15]. In data compression, reflects the excess average code length that we need if we use the conditional density estimate instead of the true density to construct a uniquely decodable code for the data Y given the past x [McMillan (1956)]. Following Bell and Cover (1980), ℓ0-constrained minimax optimal predictive density estimates in on our model can be used for construction of optimal predictive schemes for gambling, sports betting, portfolio selection and sparse coding [Mukherjee (2013), Chapter 1.3].
Aitchison (1975), Murray (1977) and Ng (1980) showed that in most parametric models there exist Bayes predictive density estimates which are decision theoretically better than the maximum likelihood plug-in estimate. An important issue in predictive inference has always been to compare the performance of the class ℰ of point estimation (PE) based plug-in density estimates [Barndorff-Nielsen and Cox (1996)] with that of the optimal predictive density estimate. In parameter spaces of fixed dimension, large sample attributes of the predictive risk of efficient plug-in and Bayes density estimates have been studied by Komaki (1996), Hartigan (1998) and Aslan (2006).
The high-dimensional predictive density estimation problem studied in this paper is relevant to a number of contemporary applications, including data compression, sequential investment with side information and sports betting (SM).
Analogy with point estimation
Decision theoretic parallels between predictive density estimation under Kullback–Leibler loss and point estimation under quadratic loss have been explored in our Gaussian model by George, Liang and Xu (2006), Ghosh, Mergel and Datta (2008), Komaki (2004), Xu and Zhou (2011) and George, Liang and Xu (2012). For unconstrained parameter spaces , fundamental ideas in Gaussian point estimation theory can be extended to yield optimal predictive density estimates [Brown, George and Xu (2008), Fourdrinier et al. (2011), Komaki (2001)]. For ellipsoids, Xu and Liang (2010) established an analog of the theorem of Pinsker (1980) by proving that the class of all linear predictive density estimates [see (17)] is minimax optimal.
For sparse estimation, instead of parallels, we found contrasts. Minimax risks in the predictive density problem depend on r, but this dependence is not emphasized in the admissibility results in unrestricted spaces. As we have seen, under sparsity construction of optimal minimax estimators requires the notion of diversification of the future risk over the interval [λf, λe] in a way strongly dependent on r. Thus, efficiency of the prediction schemes depend on careful calibration of the sparsity adjustment and the risk diversification mechanisms.
1.3. Further results
Other classes of estimators
The class of linear estimates ℒ are Bayes rules based on conjugate product normal priors. The resulting estimators
| (17) |
are still Gaussian but have larger variance than the future density . We choose the name “linear” because the conjugate prior implies linearity of the posterior mean in X.
The class contains all product Gaussian density estimates . Clearly, contains both ℒ and ℰ, the latter introduced after (6). The minimax risks Rℒ(Θ) and are defined by restricting the infimum in (6) to ℒ and , respectively.
We have seen after Theorem 1a that . It turns out that extending ℰ to does not help, while, as is typical for sparse estimation, the class of linear estimators ℒ performs very poorly.
Proposition 1
Fix r ∈ (0, ∞). If sn/n → 0, then
Univariate prediction problem
The product structure of our high-dimensional model (1), estimators (13) and priors (11), along with concentration of measure, implies that many aspects of our multivariate results can be understood and proved through an associated univariate prediction problem.
In the univariate setting, assume that the past observation X|θ ~ N(θ, 1) and the future observation Y|θ ~ N(θ, r). Assume that X and Y are independent given θ. In addition, suppose that θ is random with distribution π(dθ), assumed to belong to
| (18) |
where is the collection of all probability measures in ℝ.
A predictive density estimator is evaluated through its integrated risk defined at (3). The minimax risk for this univariate prediction problem is given by
| (19) |
and we study sparsity through the asymptotic regime η → 0. Recall definition (10) of the scaled threshold λf = λf,η.
Theorem 2
1.4. Organization of the paper
The main results of the paper are multivariate, Theorems 1A, 1B and 1C. However, the main technical issues in the proofs are best handled in the univariate setting of Theorem 2, whose parts a, b and c correspond to Theorems 1a, 1b and 1c, respectively. Section 2 has an overview: it first reviews some connections between the multivariate and univariate settings, then gives heuristic derivations for the lower and upper bounds of univariate Theorem 2. Section 3 and Section 4, respectively, contain the technical proofs for the lower and upper bound on the univariate minimax risk, Theorem 2b and 2c, respectively. Together, they complete the proof of Theorem 2. Proofs of the multivariate results in Theorems 1a, 1b and 1c are completed in Section 5. This section also contains a heuristic proof of Proposition 1 whose rigorous proof is presented in the supplementary material [Mukherjee and Johnstone (2015)].
Glossary
[The notation (6)+2 refers to text 2 lines after equation (6)].
Estimators: Bayes (2), Uniform prior (2)+2, Threshold (12), Multivariate product (13); Univariate .
Classes of estimators and multivariate minimax risks: all nonlinear N, RN (6), estimative ℰ, Rℰ (6)+2, “linear” ℒ, Rℒ (17), Gaussian , (17)+4.
Univariate minimax risk: β (19).
Parameter spaces: multivariate Θn[s] (5); univariate m(η) (18).
Priors: Univariate: two point π[η,λ] (9), cluster πCL[η,r] (14), Multivariate: (11).
Parameters: variance ratio r = vy/vx, oracle variance vw (7), sparsity η (9), thresholds λe, λf (10), cluster prior: overshoot a (15), νη (15)+2.
2. Proof overview and interpretation
2.1. Connections between multivariate and univariate settings
Many aspects of the multivariate theorem may be understood, and in part proved, through a discussion of the univariate prediction problem of Theorem 2. An obvious connection between the univariate and multivariate approaches runs as follows: suppose that a multivariate predictive estimator is built as a product of univariate components
| (21) |
Suppose also that to a vector θ = (θi) we associate a univariate (discrete) distribution . Since the true multivariate future density p(Y|θ, r) is also a product of univariate components, it is then readily seen that the multivariate and univariate Bayes K–L risks are related by
| (22) |
The sparsity condition Θn[sn] in the multivariate problem corresponds to requiring that the prior in the univariate problem satisfies
and thus belongs to the class m(η) defined in (18). Next, we outline the minimax risk calculations for the sparse predictive density estimation problem.
As a first illustration, to which we return later, consider the maximum risk of a product rule over Θn[sn]: using (22) and (3), we have
| (23) |
In the univariate problem, using , we have the somewhat parallel bound
| (24) |
Consequently, a careful study of the two univariate quantities
| (25) |
is basic for upper bounds for both univariate and multivariate cases.
2.2. Theorem 2B: Univariate lower bound heuristics
To understand the apperance of in the minimax risks, we turn to a heuristic discussion of the lower bound, first in the univariate case.
We use the two point priors (9) and the definition (19):
| (26) |
and look for a good bound for for a suitable choice of λ.
The key is a mixture representation for predictive risk of a Bayes estimator in terms of quadratic risk, where the weighted mixture is over noise levels v ∈ [vw,1], with vw being the oracle variance, (7). Brown, George and Xu (2008), Theorem 1, show that the predictive risk of the Bayes predictive density estimate is
| (27) |
where is the quadratic risk of the Bayes location estimate for prior π when W ~ N(θ, v). In point estimation with quadratic loss, it is known [Johnstone (2013), Chapter 8], that as η → 0 an approximately least favorable prior in the class m(η) is given, for noise level v = 1, by the sparse two-point prior π[η, λe(η)] defined in (9) and . This prior has the remarkable property that points θ ≤ λe are “invisible” in the sense that even when θ is true, the Bayes estimator effectively estimates 0 rather than θ and so makes a mean squared error
| (28) |
Two issues arise as the noise level v varies. First, the region of invisibility will scale, becoming at scale v. As v varies in [vw, 1], the intersection of all regions of invisibility will be as defined at (10). The second issue is that for a given prior π and predictive Bayes rule in (27), the Bayes rules vary with v. We return to this second point in the next section; for now we can hope that for all v∈[vw,1],
| (29) |
and so, from mixture representation (27),
since the integral evaluates to . From this we can conjecture that for π = π[η, λf] ∈ m(η),
| (30) |
A full proof, with slightly modified definitions, is given in Section 3.
2.3. Theorem 2C: Univariate upper bound heuristics
We now turn to a heuristic discussion of constructing a density estimate to show that the lower bound (30) is asymptotically correct. Pursuing the analogy with point estimation, we know that in that setting optimal estimators can be found within the family of hard thresholding rules . The natural analog for predictive density estimation would have the form
| (31) |
To see this, note that is the predictive Bayes rule corresponding to the uniform prior π(dθ) = dθ, which leads to the MLE in point estimation, while denotes the predictive Bayes rule corresponding to a prior concentrated entirely at 0, so that
| (32) |
is a normal density with mean zero and variance r.
For the upper bound, according to definition (19), we seek an estimator for which as η → 0. In bound (24), the first component is the risk at zero, , and it turns out that this determines the possible values of the threshold λ in (31). Thus, in order that
it follows [see (51)] that the threshold λ should be chosen as λ = λe ~ (2log η−1)1/2 and not smaller.
Turning to the second part of (25), we seek an estimator with
| (33) |
We first argue that the hard thresholding analog (31) cannot work. Decompose the predictive risk of a univariate threshold estimator with threshold λe into contributions due to X above and below the threshold
| (34) |
say. With the “zero prior,” the K–L loss is just quadratic in θ,
and so, in particular, for θ ≤ λe we see that
| (35) |
could be as large as , and hence larger than our target risk .
Bearing in mind the role that two-point priors play in the lower bound, it is perhaps natural to ask next if the threshold rule with π0 in (31) replaced by the (symmetrized) two-point prior π[η, λf] could cut off the growth of the quadratic θ2/(2r) for |θ| ≥ λf. The 3-point prior π3[η, λf] ∈ m(η) places probability η/2 at the two nonzero atoms at ±λf. Remarks in Section 3 show that π3[η, λf] is also asymptotically least favorable for the univariate prediction problem as η → 0. Indeed, it can be shown (see Section 4) that for this prior and for λf ≤|θ|≤ λe,
| (36) |
Consequently, the risk bound dips below for λf ≤ |θ| ≤ (1 + 2r)λf but increases thereafter. So, is minimax optimal if λe < (1 + 2r)λf, which occurs if r is sufficiently large, r > 0.4196 in Table 1. However, the upper bound exceeds our target risk if r ≤ 0.4196. Section S.2 of the supplementary material [Mukherjee and Johnstone (2015)] shows rigorously that is indeed minimax suboptimal for low values of r.
As π3[η,λf] fails to produce minimax optimal density estimates, the strategy then is to introduce extra support points |μk| ≤ λe into the prior chosen to “pull down” the risk below whenever it would otherwise exceed this level. The schematic diagram in Figure 1 illustrates this bounding of the maximum risk. The extra support points added in [λf, λe] and [−λe, −λf] distribute the predictive risk across that range—“risk diversification”—and keep the maximum risk below .
Fig. 1.

Schematic diagram of the risk of univariate threshold density estimates for θ ≥ 0. The dotted line is the risk of density estimator based on the 3-point prior π3[η, λf]. The addition of appropriately spaced prior mass points (shown in red) up to λe pulls down the risk function of the cluster prior-based density estimate below until the effect of thresholding at λe takes over.
To prove that this works, we obtain upper bounds on ρB (θ) for by focusing, when , only on the prior support point μk. The main inequality is obtained in (50), namely,
where qk(θ) is a quadratic polynomial that is O(λf) on [μk, μk+1]. Putting together this and other bounds, we can then finally establish the uniform bound (33). The details are in Section 4.
3. Theorem 2B: Univariate lower bound proof
This section is devoted to a proof of the lower bound part of Theorem 2. The heuristic discussion of the last section indicated the importance of two-point sparse priors and the invisibility property (28). To formulate a precise statement about the upper limit of invisibility, we start with noise level 1 and bring in the positive solution μη of the overshoot equation (15), namely, . Here the “overshoot” parameter a = aη should satisfy both and aη = o(μη); we make the specific choice .
In preparation for the range of variance scales in mixture representation (27), we consider the collection of two-point priors π[η, μ] for 0≤μ≤μη. Using a temporary notation for this section, let be the Bayes rule for squared error loss for the prior π[η, μ]. The next result shows that when the true parameter is actually μ, and this nonzero support point , then the Bayes rule for π[η,μ] “gets it wrong” by effectively estimating 0 and making an error of size μ2, uniformly in .
Lemma 3
There exists εη ↘ 0 as η → 0 such that for all μ in [0,μη],
Proof
Using standard calculations for the two-point prior, the Bayes rule , with
| (37) |
Consequently,
where Z ~ N(0, 1), and from (37), .
Now, using definition (15) of μη, for 0 ≤ μ ≤ μη, we have
so that for 0 ≤ μ ≤μη,
say. For each fixed z, we have since , and so from the dominated convergence theorem we conclude that . □
With these preparations, we return to the lower bound in the prediction problem. As η → 0, an asymptotically least favorable distribution is given by a sparse two-point prior with the nonzero support point scaled using the oracle standard deviation . We shall prove the following:
Lemma 4
Let μη be the positive solution to overshoot equation (15) with and consider the two-point prior π[η, νη]. Then as η → 0,
We note here that since aη = o(μη), the overshoot equation implies that
| (38) |
A stronger conclusion, used in the next section, also follows from the overshoot equation, namely,
| (39) |
Proof of Lemma 4
Recall (26) and (27) in the heuristic discussion. We now clarify the dependence on scale v of the Bayes rule in the mixture representation (27). Passing from noise level v to noise level 1 by dividing parameters and estimates by v1/2, we obtain the invariance relation
Now set θ = νη and substitute into (27) to obtain, for π = π[η, νη],
| (40) |
Now apply Lemma 3 with μ = v−1/2νη being bounded above by . For all we obtain
Putting this into the mixture representation, we get
Taking into account both (26) and (38), we have established the lemma.
Based on the discussion in Section 2, the above lemma establishes a lower bound on the asymptotic minimax risk β (η, r) in Theorem 2. Similarly, the symmetric 3-point prior
will also be asymptotically least favorable over m (η) as η → 0.
4. Theorem 2C: Univariate upper bound proof
The upper bound on the predictive minimax risk β (η, r) is derived from the upper bound on the maximum Bayes risk of over m (η). In this section we will prove the following lemma which along with Lemma 4 completes the proof of Theorem 2.
Lemma 5
For any r ∈ (0, ∞) we have, as η → 0,
We consider a threshold predictive density estimate which uses the Bayes predictive density estimate from prior π below the threshold λe and above the threshold λe. We bound the maximum predictive risk over m(η):
| (41) |
Next, as in (34), we decompose the predictive risk of into contributions due to X above and below the threshold. We calculate explicit expressions for ρA and ρB. The predictive loss of (see Appendix A.2) is given by
| (42) |
with and . Hence, the above threshold term
| (43) |
As ρB (θ) depends on the prior π used below the threshold, we restrict our attention to the specific choice of the cluster prior. The risk functions of the hard threshold density estimate and that of can be easily derived from the calculations with the cluster prior.
According to (58) in the Appendix, the Bayes predictive density for a discrete prior is given by
| (44) |
where denotes the marginal density of π. The K–L loss of is given by
A simple but informative upper bound for the K–L loss is obtained by retaining only the kth term in (44):
| (45) |
where we have set d(x) = log[m(x)/(π0ϕ(x))].
We are now ready to analyze the bound (41). We follow the steps recalled in the quadratic loss case [see Section S.4 of Mukherjee and Johnstone (2015)] and evaluate the predictive risk at the origin and the maximum risk of the threshold density estimate . This organization helps to make clear the new features of the predictive loss setting.
Risk at zero
It is easy to show that . First, from (43), we have
where qA(0) is defined in (S.4.2) and the above calculation follows by using and the quadratic risk-at-zero bound (S.4.4).
For the below-threshold term, we set k = 0 in (45), note that μ0 = 0 and apply Jensen’s inequality to obtain
Since and π0 = −η, we obtain that
Consequently, ρB(0) = O(η) and so ρ(0, ,CL) = O(ηλf). Note that the above calculations hold for any with π being a discrete prior in m(η).
Maximum risk
From decomposition (41), our goal is to show that
| (46) |
We first isolate the main term in the contributions from ρA(θ) and ρB(θ). From (43), clearly ρA(θ) ≤ a1r + a2r = O(1), which does not contribute. We turn to
and returning to (45), we begin by claiming that for the final term . Indeed
| (47) |
For , we have
Since , we arrive at
| (48) |
The dependence of (45) on θ may then be seen by writing x = θ + z. The first two terms in (45) then take the form
while, after recalling that and that , the third term becomes
We may therefore rewrite (45) as
| (49) |
where the kth quadratic polynomial
Denote the last three terms of (49) by Jk(x, θ). From (16) and (48) we see that
Consequently, we obtain the key bound
| (50) |
Now we use the geometric structure of the support points μk, defined at (16). We bound mink qk(θ) above by considering the quadratic polynomial qk(θ) on Ik = [μk,μk+1] and observe that these 2K intervals cover the range (–λe –a,–λf) ∪ (λf, λe +a) of interest. See Figure 2. Note that qk(θ) achieves its maximum on Ik at both endpoints and that
Fig. 2.

Schematic diagram demonstrating the behavior of the quadratic polynomials qk(θ) in the interval [μ1, μK+1]. Here K = 4. The maximum of minkqk(θ) for θ ∈ [μ1, μK+1] is bounded by q1(μ1).
These maxima decrease with k and so are bounded by . Appealing now to bound (39), we have for λf < |θ| < λe + a,
Returning to (50), we now see that the last two terms are each and so the final bound (46) is proven. This completes the proof of Lemma 5.
These calculations apply to threshold density estimates based on Bayes estimates of discrete priors. In particular, for which is based on the 3-point prior π3[η, νη], we have K = 1 and the bound (36). Thus, the difference in this case is negligible when |θ| ≤ μ2.
Similarly, the asymptotic risk function of the hard threshold plug-in density estimate (for which K = 0 in our calculations above) exceeds the minimax risk β(η, r) for |θ| ∈ [λf, λe] and so is minimax suboptimal for any fixed r. Figure 3 shows the numerical evaluation of the risk functions for the different univariate threshold density estimates.
Fig. 3.

Numerical evaluation of the asymptotic risk ρB(θ) for r = 0.25 of univariate threshold density estimates: hard threshold plug-in estimate (red), (green) and the cluster prior-based minimax optimal estimate (blue). The brown boxes show the nonzero support point of the cluster prior and the univarate asymptotic minimax risk and the threshold λe are respectively denoted by dotted horizontal and vertical lines. The plot on left has η = e−20 (very high sparsity), λf = 2.83, λe = 6.32 and the right one has η = 0.05 (moderate sparsity), λf = 1.09, λe = 2.45.
Also, note that any threshold estimate with threshold size λ less than λe will be minimax suboptimal, as its risk at the origin will not be negligible as compared to β(η, r). By (34) and (43) we have
| (51) |
and so for any fixed ε > 0,
Thus, is suboptimal unless λ ≥ λe.
5. Theorem 1: Multivariate minimax risk
Here we will use the univariate minimax results developed in the previous sections to evaluate the asymptotic multivariate minimax risk Rn = RN(Θn[sn]) over the sparse parameter space Θn[sn].
5.1. Lower bound proof: Theorem 1B and an extension
We first prove a lower bound for the multivariate minimax risk under only the assumption that sn/n → 0—without requiring, as in Theorem 1B, that also sn → ∞. This is done using an “independent blocks” sparse prior, along the lines of Johnstone (2013), Chapter 8.6, that we will show to be asymptotically least favorable. This result establishes the lower bound half of Theorem 1a. At the end of the subsection, we prove Theorem 1B using the simpler i.i.d. prior.
Let πS(τ; m) denote a single spike prior of scale τ on ℝm: choose an index I ∈ {1, …, m} at random and set θ = τeI, where eI is a unit length vector in the ith coordinate direction. We will use a scale τm = λm − logλm which is somewhat smaller than .
The independent blocks prior πIB on Θ[sn] is built by dividing {1, …, n} into sn contiguous blocks Bj,j = 1, …, m each of length m = mn = [n/sn]. Draw components θi in each block Bj according to an independent copy of πS(νm; m) where the scale is matched to the prediction setting. Finally, set θi = 0 for the remaining n − mnsn components. Thus, πIB is supported on Θ[sn] since any draw θ from πIB has exactly sn nonzero components.
The lower bound half of Theorem 1A follows from the following result, the analog of Theorem 1B for the independent blocks prior.
Theorem 6
Fix r ∈ (0, ∝). If sn/n → 0, then
Proof
Bounding maximum risk by Bayes risk and using the product structure shows that
| (52) |
Next, using to denote the Bayes risk for noise level v, the multivariate form of the connecting equation and scale invariance enable us to write
The next lemma, proved in Section S.5 of Mukherjee and Johnstone (2015), provides a uniform lower bound for the quadratic loss Bayes risk of a single spike prior. It is a multivariate analog of Lemma 3.
Proposition 7
Suppose that y ~ Nn(0, I). Set and τn = λn − log λn. Then there exists εn → 0 such that uniformly in τ ∈ [0, τn],
Noting that v ∈ [vw, 1] implies that , and then applying the proposition,
Combining this with (52) and the definition of νm, we obtain
| (53) |
Proof of Theorem 1B
Note that because of the product structure of the problem and the prior we have
which is asymptotically equal to RN(Θ[sn]), using the univariate Theorem 2 [cf. (20)] and
| (54) |
Also, as sn → ∞, by application of Chebyshev’s inequality and, hence, is an asymptotically least favorable prior under the conditions of Theorem 1B. □
5.2. Upper bound proof: Theorem 1C
First, an upper bound on RN(Θn[sn]) is derived based on the maximum risk of the multivariate product threshold density estimate defined in Theorem 1C. Using the product structure of the threshold estimate as well as that of the unknown future density
the risk of our multivariate threshold estimate simplifies as an agglomerative coordinate wise risk of the respective univariate density estimates
Now, maximizing over θ ∈ Θn[sn], we have
From the univariate study, we know that , which makes negligible relative to
where we used (46). Thus, taking account also of (54), we have the desired upper bound on the minimax risk
| (55) |
Completion of Proof of Theorems 1A, 1B and 1C
As the lower bound (53) and upper bound (55) on Rn match asymptotically, the first order asymptotic minimax risk of Theorem 1a is achieved, and the proof of all parts is done.
5.3. Proof of Proposition 1
Estimates in ℒ and are products of the form (21) and so can be studied using the associated univariate problem and decomposition (23). It is shown in Appendix A.2 that
| (56) |
Thus, is infinite unless α = 1, that is, the uniform prior estimate , in which case . Thus,
In particular, Rℒ,n/Rn → ∝ when sn/n → 0.
We turn to the Gaussian class . Since , clearly . We give here a heuristic argument for the reverse inequality, which gives the idea for the rigorous proof given in Section S.3 of the supplementary material [Mukherjee and Johnstone (2015)]. From the decomposition (23), any near-optimal estimator in must have univariate risk at 0 bounded as follows:
| (57) |
Now from (60) we know that the risk at the origin for the univariate Gaussian density estimate is
which for any fixed choice of achieves its minimum at . Thus, for such an optimal choice of ,
and for this to satisfy (57), we must have for |x| ≤ λe(1 + o(1)). Thus, would approximately need to have the threshold structure (31), (32) for |x| ≤ λe and so the bound (35) shows that
Returning to decomposition (23), we can now see that , which completes the heuristic argument.
6. Discussion
Avoiding thresholding
The asymptotic minimax rules described in Theorems 1c and 2c are based on thresholding. It would be desirable to construct a prior π for which the Bayes predictive density in (2) is itself asymptotically minimax, without any use of the discontinuous thresholding operation.
Consider, then, a symmetric univariate prior π∞[η, r] whose support consists of the origin and infinite number of equidistant clusters each containing 2K points in the same spatial alignment as for πCL[η, r]:
where μjk = jλe + μk and for k = 2, …, K and γ = log η−1, we have qk = γ−k and .
Based on π∞[ηn, r], one can construct a multivariate prior using (11), which heuristic arguments indicate will not only be least favorable but also yield a minimax optimal density estimate. A detailed proof is forthcoming.
Approximate sparsity and other extensions
Starting from Johnstone (2013), Chapters 8 and 13, the ℓ0 sparsity results presented here can be extended to obtain minimax optimal predictive density estimates over weak and strong ℓp sparse parameter spaces. An interesting topic for future work will be whether, as in point estimation [Donoho and Johnstone (1994)], the phenomena seen here can be generalized to a family of loss functions. Simple analogues of the connecting equations [Brown, George and Xu (2008), Theorem 1] between the predictive and quadratic PE regimes do not exist in those cases, though some of the decision theoretic parallels can still be proved particularly for the ℓ2 loss [Gatsonis (1984)].
Supplementary Material
Acknowledgments
We thank the Associate Editor and three referees for constructive suggestions to shorten and improve the paper.
APPENDIX
A.1. Bayes density estimate for discrete priors
The posterior distribution for the discrete prior is given by
So, for the Bayes predictive density based on the prior π,
| (58) |
A.2. K–L risk for gaussian and linear density estimates
The predictive risk of the univariate Gaussian density estimate is given by
where the expectation is over X ~ N(θ, 1) and Y ~ N(θ, r). Noting that and , we obtain
| (59) |
and the following expression for the K–L risk of members in :
| (60) |
Consider now “linear” estimators. Starting with the conjugate prior θ ~ N(0, α/(1 − α)) for 0 ≤ α ≤ 1, standard calculations show that the posterior density π(θ|x) is N(αx, α) and the predictive density , being the convolution of Gaussians, compare (2), is seen to be N(αx, r + α). Now, using and in (60), we get
The linear risk formula (56) now follows from the quadratic risk of αX. Next, we present some details about the risk of the particular linear estimate .
Proof of (42)
The estimator is given by the N(x, 1 + r) distribution, and so from (59)
from which (42) is immediate.
Footnotes
Supported in part by NSF Grant DMS-09-06812 and NIH Grant R01 EB001988.
MSC2010 subject classifications. Primary 62C20; secondary 62M20, 60G25, 91G70.
Supplementary material to “Exact minimax estimation of the predictive density in sparse Gaussian models” (DOI: 10.1214/14-AOS1251SUPP;.pdf). The supplement Mukherjee and Johnstone (2015) contains a brief description of the relevance of the predictive density estimation problem in related application areas along with the proof for the suboptimality of the univariate threshold density estimate (in Section S.2) and the details of the proof of Proposition 1 (in Section S.3). The arguments for the maximum quadratic risk of hard threshold point estimates are reviewed in Section S.4 and the proof of Proposition 7 is presented in Section S.5. Links to R-codes used in producing Table 1 and Figure 3 are also provided.
Contributor Information
Gourab Mukherjee, Email: gourab@usc.edu, Department of Data Sciences and Operations, Marshall School of Business, University of Southern California, Los Angeles, California 90089-0809, USA.
Iain M. Johnstone, Email: imj@stanford.edu, Department of Statistics, Sequoia Hall, 390 Serra Mall, Stanford University, Stanford, California 94305-4065, USA.
References
- Aitchison J. Goodness of prediction fit. Biometrika. 1975;62:547–554. MR0391353. [Google Scholar]
- Aitchison J, Dunsmore IR. Statistical Prediction Analysis. Cambridge Univ. Press; Cambridge: 1975. MR0408097. [Google Scholar]
- Aslan M. Asymptotically minimax Bayes predictive densities. Ann Statist. 2006;34:2921–2938. MR2329473. [Google Scholar]
- Barndorff-Nielsen OE, Cox DR. Prediction and asymptotics. Bernoulli. 1996;2:319–340. MR1440272. [Google Scholar]
- Bell RM, Cover TM. Competitive optimality of logarithmic investment. Math Oper Res. 1980;5:161–166. MR0571810. [Google Scholar]
- Brown L. Lecture notes on statistical decision theory. 1974 Available at http://www-stat.wharton.upenn.edu/~lbrown.
- Brown LD, George EI, Xu X. Admissible predictive density estimation. Ann Statist. 2008;36:1156–1170. MR2418653. [Google Scholar]
- Cover TM, Thomas JA. Elements of Information Theory. Wiley; New York: 1991. MR112280. [Google Scholar]
- Donoho DL, Johnstone IM. Minimax risk over lp-balls for lq-error. Probab Theory Related Fields. 1994;99:277–303. [Google Scholar]
- Donoho DL, Johnstone IM, Hoch JC, Stern AS. Maximum entropy and the nearly black object. J R Stat Soc Ser B Stat Methodol. 1992;54:41–81. MR1157714. [Google Scholar]
- Fourdrinier D, Marchand É, Righi A, Strawderman WE. On improved predictive density estimation with parametric constraints. Electron J Stat. 2011;5:172–191. MR2792550. [Google Scholar]
- Gatsonis CA. Deriving posterior distributions for a location parameter: A decision theoretic approach. Ann Statist. 1984;12:958–970. MR0751285. [Google Scholar]
- Geisser S. Predictive Inference: An Introduction Monographs on Statistics and Applied Probability. Vol. 55. Chapman & Hall; New York: 1993. MR1252174. [Google Scholar]
- George EI, Liang F, Xu X. Improved minimax predictive densities under Kullback–Leibler loss. Ann Statist. 2006;34:78–91. MR2275235. [Google Scholar]
- George EI, Liang F, Xu X. From minimax shrinkage estimation to minimax shrinkage prediction. Statist Sci. 2012;27:82–94. MR2953497. [Google Scholar]
- Ghosh M, Mergel V, Datta GS. Estimation, prediction and the Stein phenomenon under divergence loss. J Multivariate Anal. 2008;99:1941–1961. MR2466545. [Google Scholar]
- Hartigan JA. The maximum likelihood prior. Ann Statist. 1998;26:2083–2103. MR1700222. [Google Scholar]
- Johnstone IM. Gaussian estimation: Sequence and wavelet models. 2013 Available at http://www-stat.stanford.edu/~imj.
- Komaki F. On asymptotic properties of predictive distributions. Biometrika. 1996;83:299–313. MR1439785. [Google Scholar]
- Komaki F. A shrinkage predictive distribution for multivariate normal observables. Biometrika. 2001;88:859–864. MR1859415. [Google Scholar]
- Komaki F. Simultaneous prediction of independent Poisson observables. Ann Statist. 2004;32:1744–1769. MR2089141. [Google Scholar]
- Larimore WE. Predictive inference, sufficiency, entropy and an asymptotic likelihood principle. Biometrika. 1983;70:175–181. MR0742987. [Google Scholar]
- McMillan B. Two inequalities implied by unique decipherability. IRE Transactions on Information Theory. 1956;2:115–116. [Google Scholar]
- Mukherjee G. Sparsity and shrinkage in predictive density estimation. PhD thesis, Stanford Univ. 2013 Available at http://purl.stanford.edu/gm306wz2890.
- Mukherjee G, Johnstone IM. Supplement to “Exact minimax estimation of the predictive density in sparse Gaussian models”. 2015 doi: 10.1214/14-AOS1251SUPP. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Murray GD. A note on the estimation of probability density functions. Biometrika. 1977;64:150–152. MR0448690. [Google Scholar]
- Ng VM. On the estimation of parametric density functions. Biometrika. 1980;67:505–506. MR0581751. [Google Scholar]
- Pinsker MS. Optimal filtration of square-integrable signals in Gaussian noise. Probl Inf Transm. 1980;16:120–133. Originally in Russian in Problemy Peredachi Informatsii16 52–67. MR0624591. [Google Scholar]
- Xu X, Liang F. Asymptotic minimax risk of predictive density estimation for non-parametric regression. Bernoulli. 2010;16:543–560. MR2668914. [Google Scholar]
- Xu X, Zhou D. Empirical Bayes predictive densities for high-dimensional normal models. J Multivariate Anal. 2011;102:1417–1428. MR2819959. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
