Convergence rates of a partition based Bayesian multivariate density estimation method

Linxi Liu; Dangna Li; Wing Hung Wong

. Author manuscript; available in PMC: 2019 Jul 24.

Published in final edited form as: Adv Neural Inf Process Syst. 2017 Dec;30:4738–4746.

Convergence rates of a partition based Bayesian multivariate density estimation method

Linxi Liu ^1,^*, Dangna Li ², Wing Hung Wong ³

PMCID: PMC6656380 NIHMSID: NIHMS1033831 PMID: 31341369

Abstract

We study a class of non-parametric density estimators under Bayesian settings. The estimators are obtained by adaptively partitioning the sample space. Under a suitable prior, we analyze the concentration rate of the posterior distribution, and demonstrate that the rate does not directly depend on the dimension of the problem in several special cases. Another advantage of this class of Bayesian density estimators is that it can adapt to the unknown smoothness of the true density function, thus achieving the optimal convergence rate without artificial conditions on the density. We also validate the theoretical results on a variety of simulated data sets.

1. Introduction

In this paper, we study the asymptotic behavior of posterior distributions of a class of Bayesian density estimators based on adaptive partitioning. Density estimation is a building block for many other statistical methods, such as classification, nonparametric testing, clustering, and data compression.

With univariate (or bivariate) data, the most basic non-parametric method for density estimation is the histogram method. In this method, the sample space is partitioned into regular intervals (or rectangles), and the density is estimated by the relative frequency of data points falling into each interval (rectangle). However, this method is of limited utility in higher dimensional spaces because the number of cells in a regular partition of a p-dimensional space will grow exponentially with p, which makes the relative frequency highly variable unless the sample size is extremely large. In this situation the histogram may be improved by adapting the partition to the data so that larger rectangles are used in the parts of the sample space where data is sparse. Motivated by this consideration, researchers have recently developed several multivariate density estimation methods based on adaptive partitioning [13, 12]. For example, by generalizing the classical Pólya Tree construction [7, 22] developed the Optional Pólya Tree (OPT) prior on the space of simple functions. Computational issues related to OPT density estimates were discussed in [13], where efficient algorithms were developed to compute the OPT estimate. The method performs quite well when the dimension is moderately large (from 10 to 50).

The purpose of the current paper is to address the following questions on such Bayesian density estimates based on partition-learning. Question 1: what is the class of density functions that can be “well estimated” by the partition-learning based methods. Question 2: what is the rate at which the posterior distribution is concentrated around the true density as the sample size increases. Our main contributions lie in the following aspects:

We impose a suitable prior on the space of density functions defined on binary partitions, and calculate the posterior concentration rate under the Hellinger distance with mild assumptions. The rate is adaptive to the unknown smoothness of the true density.
For two dimensional density functions of bounded variation, the posterior contraction rate of our method is n^−1/4(log n)³.
For Hölder continuous (one-dimensional case) or mixture Hölder continuous (multi-dimensional case) density functions with regularity parameter β in (0, 1], the posterior concentration rate is $n^{- \frac{β}{2 β + p}} {(log n)}^{2 + \frac{p}{2 β}}$ , whereas the minimax rate for one-dimensional Hölder continuous functions is ${(n / log n)}^{- β / (2 β + 1)}$ .
When the true density function is sparse in the sense that the Haar wavelet coefficients satisfy a weak−l_q (q > 1/2) constraint, the posterior concentration rate is $n^{- \frac{q - 1 / 2}{2 q}} {(log n)}^{2 + \frac{1}{2 q - 1}}$ .
We can use a computationally efficient algorithm to sample from the posterior distribution. We demonstrate the theoretical results on several simulated data sets.

1.1. Related work

An important feature of our method is that it can adapt to the unknown smoothness of the true density function. The adaptivity of Bayesian approaches has drawn great attention in recent years. In terms of density estimation, there are mainly two categories of adaptive Bayesian nonparametric approaches. The first category of work relies on basis expansion of the density function and typically imposes a random series prior [15, 17]. When the prior on the coefficients of the expansion is set to be normal [4], it is also a Gaussian process prior. In the multivariate case, most existing work [4, 17] uses tensor-product basis. Our improvement over these methods mainly lies in the adaptive structure. In fact, as the dimension increases the number of tensor-product basis functions can be prohibitively large, which imposes a great challenge on computation. By introducing adaptive partition, we are able to handle the multivariate case even when the dimension is 30 (Example 2 in Section 4).

Another line of work considers mixture priors [16, 11, 18]. Although the mixture distributions have good approximation properties and naturally lead to adaptivity to very high smoothness levels, they may fail to detect or characterize the local features. On the other hand, by learning a partition of the sample space, the partition based approaches can provide an informative summary of the structure, and allow us to examine the density at different resolutions [14, 21].

The paper is organized as follows. In Section 2 we provide more details of the density functions on binary partitions and define the prior distribution. Section 3 summarizes the theoretical results on posterior concentration rates. The results are further validated in Section 4 by several experiments.

2. Bayesian multivariate density estimation

We focus on density estimation problems in p-dimensional Euclidean space. Let $(Ω, B)$ be a measurable space and f₀ be a compactly supported density function with respect to the Lebesgue measure μ. Y₁, Y₂, ⋯, Y_n is a sequence of independent variables distributed according to f₀. After translation and scaling, we can always assume that the support of f₀ is contained in the unit cube in $ℝ^{p}$ . Translating this into notations, we assume that Ω = {(y¹, y², ⋯, y^p)}: y^l ϵ [0,1]}. $F = {f is a nonnegative measurable function on Ω : \int_{Ω} f d μ = 1}$ denotes the collection of all the density functions on $(Ω, B, μ)$ . Then $F$ constitutes the parameter space in this problem. Note that is $F$ an infinite dimensional parameter space.

2.1. Densities on binary partitions

To address the infinite dimensionality of $F$ , we construct a sequence of finite dimensional approximating spaces Θ₁, Θ₂, ⋯, Θ_I, ⋯ based on binary partitions. With growing complexity, these spaces provide more and more accurate approximations to the initial parameter space $F$ . Here, we use a recursive procedure to define a binary partition with I subregions of the unit cube in $ℝ^{p}$ . Let Ω = {(y¹, y², ⋯,y^p): y^l ϵ [0,1]} be the unit cube in $ℝ^{p}$ . In the first step, we choose one of the coordinates y^l and cut Ω into two subregions along the midpoint of the range of y^l. That is, $Ω = Ω_{0}^{l} \cup Ω_{1}^{l}$ , where $Ω_{0}^{l} = {y \in Ω : y^{l} \leq 1 / 2}$ and $Ω_{1}^{l} = Ω \ Ω_{0}^{l}$ . In this way, we get a partition with two subregions. Note that the total number of possible partitions after the first step is equal to the dimension p. Suppose after I − 1 steps of the recursion, we have obtained a partition ${Ω_{i}}_{i = 1}^{I}$ with I subregions. In the I-th step, further partitioning of the region is defined as follows:

Choose a region from Ω₁, ⋯, Ω_I. Denote it as Ω_i0.
Choose one coordinate y^l and divide Ω_i0 into two subregions along the midpoint of the range of y^l.

Such a partition obtained by I − 1 recursive steps is called a binary partition of size I. Figure 1 displays all possible two dimensional binary partitions when I is 1, 2 and 3.

Now, let

Θ_{I} = {f : f = \sum_{i = 1}^{I} \frac{θ_{i}}{| Ω_{i} |} 1_{Ω_{i}}, \sum_{i = 1}^{I} θ_{i} = 1, {Ω_{i}}_{i = 1}^{I} is a binary partition of Ω .}

where |Ω_i| is the volume of Ω_i. Then, Θ_I is the collection of the density functions supported by the binary partitions of size I. They constitute a sequence of approximating spaces (i.e. a sieve, see [10, 20] for background on sieve theory). Let $Θ = \cup_{I = 1}^{\infty} Θ_{I}$ be the space containing all the density functions supported by the binary partitions. Then Θ is an approximation of the initial parameter space $F$ to certain approximation error which will be characterized later.

We take the metric on $F$ , Θ and Θ_I to be Hellinger distance, which is defined as

ρ (f, g) = {(\int_{Ω} {(\sqrt{f (y)} - \sqrt{g (y)})}^{2} d y)}^{1 / 2}, f, g \in F .

(1)

2.2. Prior distribution

An ideal prior Π on $Θ = \cup_{I = 1}^{\infty} Θ_{I}$ is supposed to be capable of balancing the approximation error and the complexity Θ. The prior in this paper penalizes the size of the partition in the sense that the probability mass on each Θ_I is proportional to exp(−λI log I). Given a sample of size n, we restrict our attention to $Θ_{n} = \cup_{I = 1}^{n / log n} Θ_{I}$ , because in practice we need enough samples within each subregion to get a meaningful estimate of the density. This is to say, when I ≤ n/log n, Π(Θ_I) ∝ exp(−λI log I), otherwise Π(Θ_I) = 0.

If we use T_I to denote the total number of possible partitions of size I, then it is not hard to see that log T_I ≤ c*I log I, where c* is a constant. Within each Θ_I, the prior is uniform across all binary partitions. In other words, let ${Ω_{i}}_{i = 1}^{I}$ be a binary partition of Ω of size I, and $F ({Ω_{i}}_{i = 1}^{I})$ is the collection of piecewise constant density functions on this partition (i.e. $F ({Ω_{i}}_{i = 1}^{I}) = {f = \sum_{i = 1}^{I} \frac{θ_{i}}{| Ω_{i} |} 1_{Ω_{i}} : \sum_{i = 1}^{I} θ_{i} = 1 and θ_{i} \geq 0, i = 1, \dots, I}$ ), then

Π (F ({Ω_{i}}_{i = 1}^{I})) \propto exp (- λ I log I) / T_{I} .

(2)

Given a partition ${Ω_{i}}_{i = 1}^{I}$ , the weights θ_i on the subregions follow a Dirichlet distribution with parameters all equal to α (α < 1). This is to say, for x₁, ⋯, x_I ≥ 0 and $\sum_{i = 1}^{I} x_{i} = 1$ ,

Π (f = \sum_{i = 1}^{I} \frac{θ_{i}}{| Ω_{i} |} 1_{Ω_{i}} : θ_{1} \in d x_{1}, \dots, θ_{I} \in d x_{I} | F ({Ω_{i}}_{i = 1}^{I})) = \frac{1}{D (α, \dots, α)} \prod_{i = 1}^{I} x_{i}^{α - 1},

(3)

where $D (δ_{1}, \dots, δ_{I}) = \prod_{i = 1}^{I} Γ (δ_{i}) / Γ (\sum_{i = 1}^{I} δ_{i})$ .

Let Π_n(·|Y₁, ⋯, Y_n) to denote the posterior distribution. After integrating out the weights θ_i, we can compute the marginal posterior probability of $F ({Ω_{i}}_{i = 1}^{I})$ :

Π_{n} (F ({Ω_{i}}_{i = 1}^{I}) | Y_{1}, \dots, Y_{n}) \propto Π (F ({Ω_{i}}_{i = 1}^{I})) \int (\prod_{i = 1}^{I} {(θ_{i} / | Ω_{i} |)}^{n_{i}}) \times (\frac{1}{D (α, \dots, α)} \prod_{i = 1}^{I} θ_{i}^{α - 1}) d θ_{1} \dots d θ_{I} \propto \frac{exp (- λ I log I)}{T_{I}} \cdot \frac{D (α + n_{1}, \dots, α + n_{I})}{D (α, \dots, α)} \prod_{i = 1}^{I} \frac{1}{{| Ω_{i} |}^{n_{i}}},

(4)

where n_i is the number of observations in Ω_i. Under the prior introduced in [13], the marginal posterior distribution is:

Π_{n}^{*} (F ({Ω_{i}}_{i = 1}^{I}) | Y_{1}, \dots, Y_{n}) \propto exp (- λ I) \frac{D (α + n_{1}, \dots, α + n_{I})}{D (α, \dots, α)} \prod_{i = 1}^{I} \frac{1}{{| Ω_{i} |}^{n_{i}}},

(5)

while the maximum log-likelihood achieved by histograms on the partition ${Ω_{i}}_{i = 1}^{n}$ is:

l_{n} (F ({Ω_{i}}_{i = 1}^{I})) : = max_{f \in F ({Ω_{i}}_{i = 1}^{I})} l_{n} (f) = \sum_{i = 1}^{I} n_{i} log (\frac{n_{i}}{n | Ω_{i} |}) .

(6)

From a model selection perspective, we may treat the histograms on each binary partition as a model of the data. When I ≪ n, asymptotically,

log (Π_{n}^{*} (F ({Ω_{i}}_{i = 1}^{I}) | Y_{1}, \dots, Y_{n})) ≍ l_{n} (F ({Ω_{i}}_{i = 1}^{I})) - \frac{1}{2} (I - 1) log n .

(7)

This is to say, in [13], selecting the partition which maximizes the marginal posterior distribution is equivalent to applying the Bayesian information criterion (BIC) to perform model selection. However, if we allow I to increase with n, (7) will not hold any more. But if we use the prior introduced in this section, in the case when $\frac{I}{n} \to ζ \in (0, 1)$ , as n → ∞, we still have

log (Π_{n} (F ({Ω_{i}}_{i = 1}^{I}) | Y_{1}, \dots, Y_{n})) ≍ l_{n} (F ({Ω_{i}}_{i = 1}^{I})) - λ I log I .

(8)

From a model selection perspective, this is closer to the risk inflation criterion (RIC, [8]).

3. Posterior concentration rates

We are interested in how fast the posterior probability measure concentrates around the true the density f₀. Under the prior specified above, the posterior probability is the random measure given by

Π_{n} (B | Y_{1}, \dots, Y_{n}) = \frac{\int_{B} \prod_{j = 1}^{n} f (Y_{j}) d Π (f)}{\int_{Θ} \prod_{j = 1}^{n} f (Y_{j}) d Π (f)} .

A Bayesian estimator is said to be consistent if the posterior distribution concentrates on arbitrarily small neighborhoods of f₀, with probability tending to 1 under $P_{0}^{n}$ (P₀ is the probability measure corresponding to the density function f₀). The posterior concentration rate refers to the rate at which these neighborhoods shrink to zero while still possessing most of the posterior mass. More explicitly, we want to find a sequence ϵ_n → 0, such that for sufficiently large M,

Π_{n} ({f : ρ (f, f_{0}) \geq M ϵ_{n}} | Y_{1}, \dots, Y_{n}) \to 0 in P_{0}^{n} - probability .

In [6] and [2], the authors demonstrated that it is impossible to find an estimator which works uniformly well for every f in $F$ . This is the case because for any estimator $\hat{f}$ , there always exists $f \in F$ for which $\hat{f}$ is inconsistent. Given the minimaxity of the Bayes estimator, we have to restrict our attention to a subset of the original parameter space $F$ . Here, we focus on the class of density functions that can be well approximated by Θ_I’s. To be more rigorous, a density function $f \in F$ is said to be well approximated by elements in Θ, if there exits a sequence of f_I ϵ Θ_I, satisfying that ρ(f_I, f) = O(I^−r)(r > 0). Let $F_{0}$ be the collection of these density functions. We will first derive posterior concentration rate for the elements in $F_{0}$ as a function of r. For different function classes, this approximation rate r can be calculated explicitly. In addition to this, we also assume that f₀ has finite second moment.

The following theorem gives the posterior concentration rate under the prior introduced in Section 2.2.

Theorem 3.1. Y₁, ⋯, Y_n is a sequence of independent random variables distributed according to f₀. P₀ is the probability measure corresponding to f₀. Θ is the collection of p-dimensional density functions supported by the binary partitions as defined in Section 2.1. With the modified prior distribution, if $f_{0} \in F_{0}$ , then the posterior concentration rate is $ϵ_{n} = n^{- \frac{r}{2 r + 1}} {(log n)}^{2 + \frac{1}{2 r}}$ .

The strategy to show this theorem is to write the posterior probability of the shrinking ball as

Π ({f : ρ (f, f_{0}) \geq M ϵ_{n}} | Y_{1}, \dots, Y_{n}) = \frac{\sum_{I = 1}^{\infty} \int_{{f : ρ (f, f_{0}) \geq M ϵ_{n}} \cap Θ_{I}} \prod_{j = 1}^{n} \frac{f (Y_{j})}{f_{0} (Y_{j})} d Π (f)}{\sum_{I = 1}^{\infty} \int_{Θ_{I}} \prod_{j = 1}^{n} \frac{f (Y_{j})}{f_{0} (Y_{j})} d Π (f)} .

(9)

The proof employs the mechanism developed in the landmark works [9] and [19]. We first obtain the upper bounds for the items in the numerator by dividing them into three blocks, each of which accounts for bias, variance, and rapidly decaying prior respectively, and calculate the upper bound for each block separately. Then we provide the prior thickness result, i.e., we bound the prior mass of a ball around the true density from below. Due to space constraints, the details of the proof will be provided in the appendix.

This theorem suggests the following two take-away messages: 1. The rate is adaptive to the unknown smoothness of the true density. 2. The posterior contraction rate is $n^{- \frac{r}{2 r + 1}} {(log n)}^{2 + \frac{1}{2 r}}$ , which does not directly depend on the dimension p. For some density functions, r may depend on p. But in several special cases, like the density function is spatially sparse or the density function lies in a low dimensional subspace, we will show that the rate will not be affected by the full dimension of the problem.

In the following three subsections, we will calculate the explicit rates for three density classes. Again, all proofs are given in the appendix.

3.1. Spatial adaptation

First, we assume that the density concentrates spatially. Mathematically, this implies the density function satisfies a type of sparsity. In the past two decades, sparsity has become one of the most discussed types of structure under which we are able to overcome the curse of dimensionality. A remarkable example is that it allows us to solve high-dimensional linear models, especially when the system is underdetermined.

Let f be a p dimensional density function and Ψ the p-dimensional Haar basis. We will work with $g = \sqrt{f}$ first. Note that g ϵ L²([0,1]^p). Thus we can expand g with respect to Ψ as $g = \sum_{ψ \in Ψ} 〈 g, ψ 〉 ψ, ψ \in Ψ$ . We rearrange this summation by the size of wavelet coefficients. In other words, we order the coefficients as the following

| 〈 g, ψ_{(1)} 〉 | \geq | 〈 g, ψ_{(2)} 〉 | \geq \dots \geq | 〈 g, ψ_{(k)} 〉 | \geq \dots,

then the sparsity condition imposed on the density functions is that the decay of the wavelet coefficients follows a power law,

| 〈 g, ψ_{(k)} 〉 | \leq C k^{- q} for all k \in ℕ and q > 1 / 2,

(10)

where C is a constant.

We call such a constraint a weak−l_q constraint. The condition has been widely used to characterize the sparsity of signals and images [1, 3]. In particular, in [5], it was shown that for two-dimensional cases, when q > 1/2, this condition reasonably captures the sparsity of real world images.

Corollary 3.2. (Application to spatial adaptation) Suppose f₀ is a p-dimensional density function and satisfies the condition (10). If we apply our approaches to this type of density functions, the posterior concentration rate is $n^{- \frac{q - 1 / 2}{2 q}} {(log n)}^{2 + \frac{1}{2 q - 1}}$ .

3.2. Density functions of bounded variation

Let Ω = [0, 1)² be a domain in $ℝ^{2}$ . We first characterize the space BV(Ω) of functions of bounded variation on Ω.

For a vector $ν \in ℝ^{2}$ , the difference operator Δ_v along the direction v is defined by

Δ_{ν} (f, y) : = f (y + ν) - f (y) .

For functions f defined on Ω, Δ_v(f, y) is defined whenever y ϵ Ω(v), where Ω(v) := {y : [y,y + v] ⊂ Ω} and [y,y + v] is the line segment connecting y and y + v. Denote by e_l, l = 1, 2 the two coordinate vectors in $ℝ^{2}$ . We say that a function f ϵ L₁(Ω) is in BV(Ω) if and only if

V_{Ω} (f) : = sup_{h > 0} h^{- 1} \sum_{l = 1}^{2} ‖ Δ_{h e_{l}} (f, \cdot) ‖_{L_{1} (Ω (h e_{l}))} = lim_{h \to 0} h^{- 1} \sum_{l = 1}^{2} ‖ Δ_{h e_{l}} (f, \cdot) ‖_{L_{1} (Ω (h e_{l}))}

is finite. The quantity V_Ω(f) is the variation of f over Ω.

Corollary 3.3. Assume that f₀ ϵ BV (Ω). If we apply the Bayesian multivariate density estimator based on adaptive partitioning here to estimate f₀, the posterior concentration rate is n^−1/4(log n)³.

3.3. Hölder space

In one-dimensional case, the class of Hölder functions $H (L, β)$ with regularity parameter β is defined as the following: let κ be the largest integer smaller than β, and denote by f^(κ) its κth derivative.

H (L, β) = {f : [0, 1] \to ℝ : | f^{(κ)} (x) - f^{(κ)} (y) | \leq L | x - y |^{β - κ}} .

In multi-dimensional cases, we introduce the Mixed-Hölder continuity. In order to simplify the notation, we give the definition when the dimension is two. It can be easily generalized to high-dimensional cases. A real-valued function f on $ℝ^{2}$ is called Mixed-Hölder continuous for some nonnegative constant C and β ϵ (0, 1], if for any $(x_{1}, y_{1}), (x_{1}, y_{2}) \in ℝ^{2}$ ,

| f (x_{2}, y_{2}) - f (x_{2}, y_{1}) - f (x_{1}, y_{2}) + f (x_{1}, y_{1}) | \leq C {| x_{1} - x_{2} |}^{β} {| y_{1} - y_{2} |}^{β} .

Corollary 3.4. Let f₀ be the p-dimensional density function. If $\sqrt{f_{0}}$ is Hölder continuous (when p = 1) or mixed-Hölder continuous (when p ≥ 2) with regularity parameter β ϵ (0, 1], then the posterior concentration rate of the Bayes estimator is $- \frac{β}{2 β + p} {(log n)}^{2 + \frac{p}{2 β}}$ .

This result also implies that if f₀ only depends on $\tilde{p}$ variable where $\tilde{p} < p$ , but we do not know in advance which $\tilde{p}$ variables, then the rate of this method is determined by the effective dimension $\tilde{p}$ of the problem, since the smoothness parameter r is only a function of $\tilde{p}$ . In next section, we will use a simulated data set to illustrate this point.

4. Simulation

4.1. Sequential importance sampling

Each partition $A_{I} = {Ω_{i}}_{i = 1}^{I}$ is obtained by recursively partitioning the sample space. We can use a sequence of partitions $A_{1}, A_{2}, \dots, A_{I}$ to keep track of the path leading to $008036 m d$ . Let Π_n(·) denote the posterior distribution Π_n(·|Y₁, ⋯, Y_n) for simplicity, and $Π_{n}^{I}$ be the posterior be distribution conditioning on Θ_I. Then $Π_{n}^{I} (A_{I})$ can be decomposed as

Π_{n}^{I} (A_{I}) = Π_{n}^{I} (A_{I}) Π_{n}^{I} (A_{2} | A_{1}) \dots Π_{n}^{I} (A_{I} | A_{I - 1}) .

The conditional distribution $Π_{n}^{I} (A_{i + 1} | A_{i})$ can be calculated by $Π_{n}^{I} (A_{i + 1}) / Π_{n}^{I} (A_{i})$ . However, the computation of the marginal distribution $Π_{n}^{I} (A_{i})$ is sometimes infeasible, especially when both I and I − i are large, because we need to sum the marginal posterior probability over all binary partitions of size I for which the first i steps in the partition generating path are the same as those of $A_{i}$ . Therefore, we adopt the sequential importance algorithm proposed in [13]. In order to build a sequence of binary partitions, at each step, the conditional distribution is approximated by $Π_{n}^{i + 1} (A_{i + 1} | A_{i})$ . The obtained partition is assigned a weight to compensate the approximation, where the weight is

w_{I} (A_{I}) = \frac{Π_{n}^{I} (A_{I})}{Π_{n}^{1} (A_{1}) Π_{n}^{2} (A_{2} | A_{1}) \dots Π_{n}^{I} (A_{I} | A_{I - 1})} .

In order to make the data points as uniform as possible, we apply a copula transformation to each variable in advance whenever the dimension exceeds 3. More specifically, we estimate the marginal distribution of each variable X_j by our approach, denoted as ${\hat{f}}_{j}$ (we use ${\hat{F}}_{j}$ to denote the cdf of X_j), and transform each point (y¹, ⋯, y^p) to (F₁(y¹), ⋯, F_p(y^p)). Another advantage of this transformation is that after the transformation the sample space naturally becomes [0, 1]^p.

Example 1 Assume that the two-dimensional density function is

(\begin{array}{l} Y_{1} \\ Y_{2} \end{array}) ~ \frac{2}{5} N ((\begin{matrix} 0.25 \\ 0.25 \end{matrix}), {0.05}^{2} I_{2 \times 2}) + \frac{3}{5} N ((\begin{matrix} 0.75 \\ 0.75 \end{matrix}), {0.05}^{2} I_{2 \times 2}) .

This density function both satisfies the spatial sparsity condition and belongs to the space of functions of bounded variation. Figure 2 shows the heatmap of the density function and its Haar coefficients. The last panel in the second plot displays the sorted coefficients with the abscissa in log-scale. From this we can clearly see that the power-law decay defined in Section 3.1 is satisfied.

We apply the adaptive partitioning approach to estimate the density, and allow the sample size increase from 10² to 10⁵. In Figure 3, the left plot is the density estimation result based on a sample with 10000 data points. The right one is the plot of Kullback-Leibler (KL) divergence from the estimated density to f₀ vs. sample size in log-scale. The sample sizes are set to be 100, 500, 1000, 5000, 10⁴, and 10⁵. The linear trend in the plot validates the posterior concentrate rates calculated in Section 3. The reason why we use KL divergence instead of the Hellinger distance is that for any $f_{0} \in F$ and $\hat{f} \in Θ_{0}$ , we can show that the KL divergence and the Hellinger distance are of the same order. But KL divergence is relatively easier to compute in our setting, since we can show that it is linear in the logarithm of the posterior marginal probability of a partition. The proof will be provided in the appendix. For each fixed sample size, we run the experiment 10 times and estimate the standard error, which is shown by the lighter blue part in the plot.

Example 2 In the second example we work with a density function of moderately high dimension. Assume that the first five random variables Y₁, ⋯ Y₅ are generated from the following location mixture of the Gaussian distribution:

(\begin{array}{l} Y_{1} \\ Y_{2} \\ Y_{3} \end{array}) ~ \frac{1}{2} N ((\begin{matrix} 0.25 \\ 0.25 \\ 0.25 \end{matrix}), (\begin{matrix} {0.05}^{2} & {0.03}^{2} & 0 \\ {0.03}^{2} & {0.05}^{2} & 0 \\ 0 & 0 & {0.05}^{2} \end{matrix})) + \frac{1}{2} N ((\begin{matrix} 0.75 \\ 0.75 \\ 0.75 \end{matrix}), {0.05}^{2} I_{3 \times 3}), Y_{4}, Y_{5} ~ N (0.5, 0.1),

the other components Y₆, ⋯, Y_p are independently uniformly distributed. We run experiments for p = 5, 10, and 30. For a fixed p, we generate n ϵ {500, 1000, 5000, 10⁴, 10⁵} data points. For each pair of p and n, we repeat the experiment 10 times and calculate the standard error. Figure 4 displays the plot of the KL divergence vs. the sample size on log-log scale. The density function is continuous differentiable. Therefore, it satisfies the mixed-Hölder continuity condition. The effective dimension of this example is $\tilde{p} = 5$ , and this is reflected in the plot: the slopes of the three lines, which correspond to the concentration rates under different dimensions, almost remain the same as we increase the full dimension of the problem.

Figure 4: — KL divergence vs. sample size. The blue, purple and red curves correspond to the cases when p = 5, p = 10 and p = 30 respectively. The slopes of the three lines are almost the same, implying that the concentration rate only depends on the effective dimension of the problem (which is 5 in this example).

5. Conclusion

In this paper, we study the posterior concentration rate of a class of Bayesian density estimators based on adaptive partitioning. We obtain explicit rates when the density function is spatially sparse, belongs to the space of bounded variation, or is Hölder continuous. For the last case, the rate is minimax up to a logarithmic term. When the density function is sparse or lies in a low-dimensional subspace, the rate will not be affected by the dimension of the problem. Another advantage of this method is that it can adapt to the unknown smoothness of the underlying density function.

Supplementary Material

supplement 1

NIHMS1033831-supplement-supplement_1.pdf^{(249.8KB, pdf)}

Contributor Information

Linxi Liu, Department of Statistics, Columbia.

Dangna Li, ICME, Stanford University.

Wing Hung Wong, Department of Statistics, Stanford University.

Bibliography

[1].Abramovich Felix, Benjamini Yoav, Donoho David L., and Johnstone Iain M.. Adapting to unknown sparsity by controlling the false discovery rate. The Annals of Statistics, 34(2):584–653, April 2006. [Google Scholar]
[2].Birgé Lucien and Massart Pascal. Minimum contrast estimators on sieves: exponential bounds and rates of convergence. Bernoulli, 4(3):329–375, September 1998. [Google Scholar]
[3].Candès EJ and Tao T. Near-optimal signal recovery from random projections: Universal encoding strategies? Information Theory, IEEE Transactions on, 52(12):5406–5425, December 2006. [Google Scholar]
[4].de Jonge R and van Zanten JH. Adaptive estimation of multivariate functions using conditionally Gaussian tensor-product spline priors. Electron. J. Statist, 6:1984–2001, 2012. [Google Scholar]
[5].DeVore RA, Jawerth B, and Lucier BJ. Image compression through wavelet transform coding. Information Theory, IEEE Transactions on, 38(2):719–746, March 1992. [Google Scholar]
[6].Farrell RH. On the lack of a uniformly consistent sequence of estimators of a density function in certain cases. The Annals of Mathematical Statistics, 38(2):471–474, April 1967. [Google Scholar]
[7].Ferguson Thomas S.. Prior distributions on spaces of probability measures. Ann. Statist, 2:615–629, 1974. [Google Scholar]
[8].Foster Dean P. and George Edward I.. The risk inflation criterion for multiple regression. Ann. Statist, 22(4):1947–1975, December 1994. [Google Scholar]
[9].Ghosal Subhashis, Ghosh Jayanta K., and van der Vaart Aad W.. Convergence rates of posterior distributions. The Annals of Statistics, 28(2):500–531, April 2000. [Google Scholar]
[10].Grenander U. Abstract Inference. Probability and Statistics Series. John Wiley & Sons, 1981. [Google Scholar]
[11].Kruijer Willem, Rousseau Judith, and van der Vaart Aad. Adaptive bayesian density estimation with location-scale mixtures. Electron. J. Statist, 4:1225–1257, 2010. [Google Scholar]
[12].Li Dangna, Yang Kun, and Wong Wing Hung. Density estimation via discrepancy based adaptive sequential partition. 30th Conference on Neural Information Processing Systems (NIPS 2016), 2016. [PMC free article] [PubMed] [Google Scholar]
[13].Lu Luo, Jiang Hui, and Wong Wing H.. Multivariate density estimation by bayesian sequential partitioning. Journal of the American Statistical Association, 108(504):1402–1410, 2013. [Google Scholar]
[14].Ma Li and Wong Wing Hung. Coupling optional pólya trees and the two sample problem. Journal of the American Statistical Association, 106(496):1553–1565, 2011. [Google Scholar]
[15].Rivoirard Vincent and Rousseau Judith. Posterior concentration rates for infinite dimensional exponential families. Bayesian Anal, 7(2):311–334, June 2012. [Google Scholar]
[16].Rousseau Judith. Rates of convergence for the posterior distributions of mixtures of betas and adaptive nonparametric estimation of the density. The Annals of Statistics, 38(1):146–180, February 2010. [Google Scholar]
[17].Shen Weining and Ghosal Subhashis. Adaptive bayesian procedures using random series priors. Scandinavian Journal of Statistics, 42(4):1194–1213, 2015. 10.1111/sjos.12159. [DOI] [Google Scholar]
[18].Shen Weining, Tokdar Surya T., and Ghosal Subhashis. Adaptive bayesian multivariate density estimation with dirichlet mixtures. Biometrika, 100(3):623–640, 2013. [Google Scholar]
[19].Shen Xiaotong and Wasserman Larry. Rates of convergence of posterior distributions. The Annals of Statistics, 29(3):687–714, June 2001. [Google Scholar]
[20].Shen Xiaotong and Wong Wing Hung. Convergence rate of sieve estimates. The Annals of Statistics, 22(2):pp. 580–615, 1994. [Google Scholar]
[21].Soriano Jacopo and Ma Li. Probabilistic multi-resolution scanning for two-sample differences. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(2):547–572, 2017. [Google Scholar]
[22].Wong Wing H. and Ma Li. Optional pólya tree and bayesian inference. The Annals of Statistics, 38(3):1433–1459, June 2010. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supplement 1

NIHMS1033831-supplement-supplement_1.pdf^{(249.8KB, pdf)}

[R1] [1].Abramovich Felix, Benjamini Yoav, Donoho David L., and Johnstone Iain M.. Adapting to unknown sparsity by controlling the false discovery rate. The Annals of Statistics, 34(2):584–653, April 2006. [Google Scholar]

[R2] [2].Birgé Lucien and Massart Pascal. Minimum contrast estimators on sieves: exponential bounds and rates of convergence. Bernoulli, 4(3):329–375, September 1998. [Google Scholar]

[R3] [3].Candès EJ and Tao T. Near-optimal signal recovery from random projections: Universal encoding strategies? Information Theory, IEEE Transactions on, 52(12):5406–5425, December 2006. [Google Scholar]

[R4] [4].de Jonge R and van Zanten JH. Adaptive estimation of multivariate functions using conditionally Gaussian tensor-product spline priors. Electron. J. Statist, 6:1984–2001, 2012. [Google Scholar]

[R5] [5].DeVore RA, Jawerth B, and Lucier BJ. Image compression through wavelet transform coding. Information Theory, IEEE Transactions on, 38(2):719–746, March 1992. [Google Scholar]

[R6] [6].Farrell RH. On the lack of a uniformly consistent sequence of estimators of a density function in certain cases. The Annals of Mathematical Statistics, 38(2):471–474, April 1967. [Google Scholar]

[R7] [7].Ferguson Thomas S.. Prior distributions on spaces of probability measures. Ann. Statist, 2:615–629, 1974. [Google Scholar]

[R8] [8].Foster Dean P. and George Edward I.. The risk inflation criterion for multiple regression. Ann. Statist, 22(4):1947–1975, December 1994. [Google Scholar]

[R9] [9].Ghosal Subhashis, Ghosh Jayanta K., and van der Vaart Aad W.. Convergence rates of posterior distributions. The Annals of Statistics, 28(2):500–531, April 2000. [Google Scholar]

[R10] [10].Grenander U. Abstract Inference. Probability and Statistics Series. John Wiley & Sons, 1981. [Google Scholar]

[R11] [11].Kruijer Willem, Rousseau Judith, and van der Vaart Aad. Adaptive bayesian density estimation with location-scale mixtures. Electron. J. Statist, 4:1225–1257, 2010. [Google Scholar]

[R12] [12].Li Dangna, Yang Kun, and Wong Wing Hung. Density estimation via discrepancy based adaptive sequential partition. 30th Conference on Neural Information Processing Systems (NIPS 2016), 2016. [PMC free article] [PubMed] [Google Scholar]

[R13] [13].Lu Luo, Jiang Hui, and Wong Wing H.. Multivariate density estimation by bayesian sequential partitioning. Journal of the American Statistical Association, 108(504):1402–1410, 2013. [Google Scholar]

[R14] [14].Ma Li and Wong Wing Hung. Coupling optional pólya trees and the two sample problem. Journal of the American Statistical Association, 106(496):1553–1565, 2011. [Google Scholar]

[R15] [15].Rivoirard Vincent and Rousseau Judith. Posterior concentration rates for infinite dimensional exponential families. Bayesian Anal, 7(2):311–334, June 2012. [Google Scholar]

[R16] [16].Rousseau Judith. Rates of convergence for the posterior distributions of mixtures of betas and adaptive nonparametric estimation of the density. The Annals of Statistics, 38(1):146–180, February 2010. [Google Scholar]

[R17] [17].Shen Weining and Ghosal Subhashis. Adaptive bayesian procedures using random series priors. Scandinavian Journal of Statistics, 42(4):1194–1213, 2015. 10.1111/sjos.12159. [DOI] [Google Scholar]

[R18] [18].Shen Weining, Tokdar Surya T., and Ghosal Subhashis. Adaptive bayesian multivariate density estimation with dirichlet mixtures. Biometrika, 100(3):623–640, 2013. [Google Scholar]

[R19] [19].Shen Xiaotong and Wasserman Larry. Rates of convergence of posterior distributions. The Annals of Statistics, 29(3):687–714, June 2001. [Google Scholar]

[R20] [20].Shen Xiaotong and Wong Wing Hung. Convergence rate of sieve estimates. The Annals of Statistics, 22(2):pp. 580–615, 1994. [Google Scholar]

[R21] [21].Soriano Jacopo and Ma Li. Probabilistic multi-resolution scanning for two-sample differences. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(2):547–572, 2017. [Google Scholar]

[R22] [22].Wong Wing H. and Ma Li. Optional pólya tree and bayesian inference. The Annals of Statistics, 38(3):1433–1459, June 2010. [Google Scholar]

PERMALINK

Convergence rates of a partition based Bayesian multivariate density estimation method

Linxi Liu

Dangna Li

Wing Hung Wong

Abstract