Distribution Free Prediction Sets

Jing Lei; James Robins; Larry Wasserman

doi:10.1080/01621459.2012.751873

. Author manuscript; available in PMC: 2014 Sep 16.

Published in final edited form as: J Am Stat Assoc. 2013 Mar 15;108(501):278–287. doi: 10.1080/01621459.2012.751873

Distribution Free Prediction Sets

Jing Lei ¹, James Robins ², Larry Wasserman ³

PMCID: PMC4164906 NIHMSID: NIHMS570924 PMID: 25237208

Abstract

This paper introduces a new approach to prediction by bringing together two different nonparametric ideas: distribution free inference and nonparametric smoothing. Specifically, we consider the problem of constructing nonparametric tolerance/prediction sets. We start from the general conformal prediction approach and we use a kernel density estimator as a measure of agreement between a sample point and the underlying distribution. The resulting prediction set is shown to be closely related to plug-in density level sets with carefully chosen cut-off values. Under standard smoothness conditions, we get an asymptotic efficiency result that is near optimal for a wide range of function classes. But the coverage is guaranteed whether or not the smoothness conditions hold and regardless of the sample size. The performance of our method is investigated through simulation studies and illustrated in a real data example.

Keywords: prediction sets, conformal prediction, kernel density, distribution free, finite sample

1. INTRODUCTION

1.1 Prediction sets and density level sets

Suppose we observe iid data Y₁, … , Y_n ∈ ℝ^d from a distribution P . Our goal is to construct a prediction set C_n = C_n(Y₁, … , Y_n) ⊆ ℝ^d such that

ℙ (Y_{n + 1} \in C_{n}) \geq 1 - α

(1)

for a fixed 0 < α < 1, where ℙ = P ⁿ⁺¹ is the product probability measure over the (n + 1)-tuple (Y₁, … , Y_n₊₁). In general, we let ℙ denote P ⁿor P ⁿ⁺¹ depending on the context.

The prediction set problem has a natural connection to density level sets and density based clustering. Given a random sample from a distribution, it is often of interest to ask where most of the probability mass is concentrated. A natural answer to this question is the density level set L(t) = {y ∈ ℝ^d : p(y) ≥ t}, where p is the density function of P . When the distribution P is multimodal, a suitably chosen t will give a clustering of the underlying distribution (Hartigan 1975). When t is given, consistent estimators of L(t) and rates of convergence have been studied in detail (Polonik 1995; Tsybakov 1997; Baillo, Cuestas-Alberto & Cuevas 2001; Baillo 2003; Cadre 2006; Willett & Nowak 2007; Rigollet & Vert 2009; Rinaldo & Wasserman 2010). It often makes sense to define t implicitly using the desired probability coverage (1 − α):

t (α) = \inf {t : P (L (t)) \geq 1 - α} .

(2)

Let μ(·) denote the Lebesgue measure on ℝ^d. If the contour {y : p(y) = t(α)} has zero Lebesgue measure, then it is easily shown that

C^{(α)} ≔ L (t (α)) = \arg \min_{C} μ (C),

(3)

where the min is over {C : P (C) ≥ 1 − α}. Therefore, the density based clustering problem can sometimes be formulated as estimation of the minimum volume prediction set.

The study of prediction sets has a long history in statistics under various names such as “tolerance regions” and “minimum volume sets”; see, for example, Wilks (1941), Wald (1943), Fraser & Guttman (1956), Guttman (1970), Aichison & Dunsmore (1975), Chatterjee & Patra (1980), Di Bucchianico, Einmahl & Mushkudiani (2001), Cadre (2006), and Li & Liu (2008). Also related is the notion of quantile contours (Wei 2008). In this paper we study a newer method due to Vovk, Gammerman & Shafer (2005) which we describe in Section 2.

1.2 Main results

Let C_n be a prediction set. There are two natural criteria to measure its quality: validity and efficiency. By validity we mean that C_n has the desired coverage for all P (for example, in the sense of (1)). We measure the efficiency of C_n in terms of its closeness to the optimal (oracle) set C^(α). Since p is unknown, C^(α) cannot be used as an estimator but only as a benchmark in evaluating the efficiency. We define the loss function of C_n by

R (C_{n}) = μ (C_{n} Δ C^{(α)}),

(4)

where Δ, denotes the symmetric set difference. We say that C_nis efficient at rate r_n for a class of distributions P if, for every P ∈ P, ℙ(R(C_n) ≥ r_n) → 0 as n → ∞. Such loss functions have been used, for example, by Chatterjee & Patra (1980) and Li & Liu (2008) in nonparametric prediction set estimation and by Tsybakov (1997); Rigollet & Vert (2009) in density level set estimation.

In this paper, we construct C_n with the following properties.

Finite sample validity: C_n satisfies (1) for all P and n under no assumption other than iid.
Asymptotic efficiency: C_n is efficient at rate (log n/n)^cp,α for some constant c_p,α> 0 depending only on the smoothness of p.
For any y ∈ ℝ^d, the computational cost of evaluating 1(y ∈ C_n) is linear in n.

Our prediction set is obtained by combining the idea of conformal prediction (Vovk et al. 2005) with density estimation. We show that such a set, whose analytical form may be intractable, is sandwiched by two kernel density level sets with carefully tuned cut-off values. Therefore, the efficiency of the conformal prediction set can be approximated by those of the two kernel density level sets. As a by-product, we obtain a kernel density level set that always contains the conformal prediction set, and satisfies finite sample validity as well as asymptotic efficiency. In the efficiency argument, we refine the rates of convergence for plug-in density level sets at implicitly defined levels first developed in Cadre (2006); Cadre, Pelletier & Pudlo (2009), which may be of independent interest. We remark that, while the method gives valid prediction regions in any dimension, the efficiency of the region can be poor in higher dimensions.

1.3 Related work

The conformal prediction method (Vovk et al. 2005; Shafer & Vovk 2008) is a general approach for constructing distribution free, sequential prediction sets using exchangeability, and is usually applied to sequential classification and regression problems (Vovk, Nouretdinov & Gammerman 2009). We show that one can adapt the method to the prediction task described in (1). We describe this general method in Section 2 and our adaptation in Section 3.

In multivariate prediction set estimation, common approaches include methods based on statistically equivalent blocks (Tukey 1947; Li & Liu 2008) and plug-in density level sets (Chatterjee & Patra 1980; Hyndman 1996; Cadre 2006). In the former, an ordering function taking values in ℝ¹ is used to order the data points. Then one-dimensional tolerance interval methods (e.g. Wilks (1941)) can be applied. Such methods usually give accurate coverage but efficiency is hard to prove. Li & Liu (2008) proposed an estimator, with a high computational cost, using the multivariate spacing depth as the ordering function. Consistency is only proved when the level sets are convex. On the other hand, the plug-in methods (Chatterjee & Patra 1980) give provable validity and efficiency in an asymptotic sense regardless of the shape of the distribution, with a much easier implementation. As mentioned earlier, our estimator can be approximated by plug-in level sets, which are similar to those introduced in Chatterjee & Patra (1980); Hyndman (1996); Cadre (2006); Park, Huang & Ding (2010). However, these methods do not give finite sample validity.

Other important work on estimating tolerance regions and minimum volume prediction sets includes Polonik (1997), Walther (1997), Di Bucchianico et al. (2001), and Scott & Nowak (2006). Scott & Nowak (2006) does have finite sample results but does not have the guarantee given in Equation (1) which is the focus of this paper. Bandwidth selection for level sets is discussed in Samworth & Wand (2010). There is also a literature on anomaly detection which amounts to constructing prediction sets. Recent advances in this area include Zhao & Saligrama (2009), Sricharan & Hero (2011) and Steinwart, Hush & Scovel (2005).

In Section 2 we introduce conformal prediction. In Section 3 we describe a construction of prediction sets by combining conformal prediction with kernel density estimators. The approximation result (sandwich lemma) and asymptotic properties are also discussed. A method for choosing the bandwidth is given in Section 4. Simulation and a real data example are presented in Section 5. Some technical proofs are given the Appendix.

2. CONFORMAL PREDICTION

Let Y₁, …, Y_n be a random sample from P and let Y = (Y₁, …, Y_n). Fix some y ∈ ℝ^d and let us tentatively set Y_n₊₁= y. Let σ_i = σ({Y₁, … , Y_n₊₁}, Y_i) be a “conformity score” that measures how similar Y_i is to {Y₁, … , Y_n₊₁}. We only require that σ be symmetric in the entries of it first argument. We test the hypothesis H₀: Y_n₊₁= y by computing the p-value

π_{n} (y) = \frac{1}{n + 1} \sum_{j = 1}^{n + 1} 1 [σ_{j} \leq σ_{n + 1}] .

By symmetry, under H₀the ranks of the σ_i are uniformly distributed among {1/(n + 1), 2/(n + 1) … 1} and hence for any α ∈ (0, 1) we have $ℙ (π_{n} (y) \leq \tilde{α}) \leq α$ where $\tilde{α} = ⌊ (n + 1) α ⌋ ∕ (n + 1) \approx α$ . Let

{\hat{C}}^{(α)} (Y_{1}, \dots, Y_{n}) = {y : π_{n} (y) \geq \tilde{α}} .

(5)

It follows that under H₀we have $ℙ [Y_{n + 1} ∊ {\hat{C}}^{(α)} (Y_{1}, \dots, Y_{n})] \geq 1 - α$ . Based on the above discussion, any conformity measure σ can be used to construct prediction sets with finite sample validity, with no assumptions on P . The only requirement is exchangeability of the data. In this paper we will $σ_{i} = \hat{p} (Y_{i})$ where $\hat{p}$ is an appropriate density estimator.

3. CONFORMAL PREDICTION WITH KERNEL DENSITY

3.1 The method

For a given bandwidth h_n and kernel function K, let

{\hat{p}}_{n} (u) = \frac{1}{{nh}_{n}^{d}} \sum_{i = 1}^{n} K (\frac{u - Y_{i}}{h_{n}})

(6)

be the usual kernel density estimator. For now, we focus on a given bandwidth h_n. The theoretical and practical aspects of choosing h_n will be discussed in Subsection 3.3 and Section 4, respectively. For any given y ∈ ℝ^d, let Y_n₊₁= y and define the augmented density estimator

{\hat{p}}_{n}^{y} (u) = \frac{1}{h_{n}^{d} (n + 1)} \sum_{i = 1}^{n + 1} K (\frac{u - Y_{i}}{h_{n}}) = (\frac{n}{n + 1}) {\hat{p}}_{n} (u) + \frac{1}{h_{n}^{d} (n + 1)} K (\frac{u - Y}{h_{n}}) .

(7)

Now we use the conformity measure $σ_{i} = {\hat{p}}_{n}^{y} (Y_{i})$ and the p-value becomes

π_{n} (y) ≔ \frac{1}{n + 1} \sum_{i = 1}^{n + 1} 1 [{\hat{p}}_{n}^{y} (Y_{i}) \leq {\hat{p}}_{n}^{y} (y)] .

The resulting prediction set is ${\hat{C}}^{(α)} = {y : π_{n} (y) \geq \tilde{α}}$ . It follows that $ℙ [Y_{n + 1} ∊ {\hat{C}}^{(α)}] \geq 1 - α$ for all P and all n as required.

Figure 1 shows a one-dimensional example of the procedure. The top left plot shows a histogram of some data of sample size 20 from a two-component Gaussian mixture. The next three plots (top middle, top right, bottom left) show three kernel density estimators with increasing bandwidths as well as the conformal prediction sets derived from these estimators with α = 0.05. Every bandwidth leads to a valid set, but undersmoothing and oversmoothing lead to larger sets. The bottom middle plot shows the Lebesgue measure of the set as a function of bandwidth. The bottom right plot shows the estimator and prediction set based on the bandwidth whose corresponding conformal prediction set has the minimal Lebesgue measure.

Top left: histogram of some data. Top middle, top right, and bottom left show three kernel density estimators and the corresponding conformal prediction sets with bandwidth 0.1, 1, and 10. Bottom middle: Lebesgue measure as a function of bandwidth. Bottom right: estimator and prediction set obtained from the bandwidth with smallest prediction set.

3.2 An approximation

The conformal prediction set is expensive to compute since we have to compute π_n(y) for every y ∈ ℝ^d. Here we derive an approximation to ${\hat{C}}^{(α)}$ that can be computed quickly and maintains finite sample validity. Define the upper and lower level sets of density p at level t, respectively:

L t = {y : p (y) \geq t}, and L^{ℓ} (t) = {y : p (y) \leq t} .

(8)

The corresponding level sets of ${\hat{p}}_{n}$ are denoted L_n(t) and $L_{n}^{l} (t)$ , respectively. Let Y₍₁₎, … , Y_(n)be the reordered data so that ${\hat{p}}_{n} (Y_{(1)}) \leq \dots \leq {\hat{p}}_{n} (Y_{(n)})$ , and define the inner and outer sandwiching sets:

L_{n}^{-} = L_{n} ({\hat{p}}_{n} (Y_{(i_{n, α})})), L_{n}^{+} = L_{n} ({\hat{p}}_{n} (Y_{(i_{n, α})}) - {({nh}^{d})}^{- 1} ψ K)

where ψ_K = sup_u,u′ |K(u) − K(u′)|. Then we have the following “sandwich” lemma, whose prrof can be found in Appendix B.

Lemma 3.1 (Sandwich Lemma). Let ${\hat{C}}^{(α)}$ be the conformal prediction set based on the kernel density estimator. Assume that sup_u|K(u)| = K(0). Then

L_{n}^{-} \subseteq {\hat{C}}^{(α)} \subseteq L_{n}^{+} .

(9)

According to the sandwich lemma, $L_{n}^{+}$ also guarantees distribution free finite sample coverage and is easier to analyze. Moreover, it is much faster to compute since it avoids ever having to compute the kernel density estimator based on the augmented data. The inner set, $L_{n}^{-}$ , which is used as an estimate of C^(α) in related work such as in Chatterjee & Patra (1980); Hyndman (1996); Cadre et al. (2009), generally does not have finite sample validity. We confirm this through simulations in Section 5. Next we investigate the efficiency of these prediction sets.

3.3 Asymptotic properties

The inner and outer sandwiching sets $L_{n}^{-}$ and $L_{n}^{+}$ are plug-in estimators of density level sets of the form: $L_{n} (t_{n}^{(α)}) = {y : {\hat{p}}_{n} (y) \geq t_{n}^{(α)}}$ , where $t_{n}^{(α)} = {\hat{p}}_{n} (Y_{(i_{n, α})})$ for the inner set $L_{n}^{-}$ and $t_{n}^{(α)} = {\hat{p}}_{n} (Y_{(i_{n, α})}) - {(n h_{n}^{d})}^{- 1} ψ K$ for the outer set $L_{n}^{+}$ . Here we can view $t_{n}^{(α)}$ as an estimate of t(α). In Cadre et al. (2009) it is shown that, under regularity conditions of the density p, the plug-in estimators $t_{n}^{(α)}$ and L_n( $t_{n}^{(α)}$ ) are consistent with convergence rate $1 ∕ \sqrt{n h_{n}^{d}}$ for a range of h_n. Here we refine the results under more general conditions. We note that similar convergence rates for plug-in density level sets with a fixed and known level are obtained in Rigollet & Vert (2009). The extension to unknown levels is nontrivial and needs slightly stronger regularity conditions.

Intuitively speaking, the plug-in density level set L_n( $t_{n}^{(α)}$ ) is an accurate estimator of L(t^(α)) if ${\hat{p}}_{n}$ and $t_{n}^{(α)}$ are accurate estimators of p and t^(α), and p is not too flat at level t^(α). The following smoothness condition is assumed for p and K to ensure accurate density estimation.

A1. The density p is Hölder smooth of order β, with β > 0, and K is a valid kernel of order β. Hölder smoothness and valid kernels are standard assumptions for nonparametric density estimation. We give their definitions in Appendix A.

Remark: Assumption A1 can be relaxed in a similar way as in Rigollet & Vert (2009). The idea is that we only need to estimate the density very accurately in a neighborhood of ∂C^(α) (the boundary of the optimal set). Therefore, it would be sufficient to have the strong β′-Hölder smoothness condition near ∂C^(α), together with a weaker β′-Hölder smoothness condition (β^t ≤ β) everywhere else. For presentation simplicity, we stick with the global smoothness condition in A1.

To control the regularity of p at level t^(α), a common assumption is the γ-exponent condition, which was first introduced by Polonik (1995) and has been used by many others (see Tsybakov (1997) and Rigollet & Vert (2009) for example). In our argument, such an assumption is also related to estimating t^(α) itself. Specifically, we assume

A2. There exist constants 0 < c₁≤ c₂and ∈₀ > 0 such that

c_{1} {∣ ∊ ∣}^{γ} \leq ∣ P ({y : p (y) \leq t^{(α)} + ∊}) - α ∣ \leq c_{2} {∣ ∊ ∣}^{γ}, \forall - ∊_{0} \leq ∊ \leq ∊_{0} .

(10)

The gamma exponent condition requires that the density to be neither flat (for stability of level set) nor steep (for accuracy of $t_{n}^{(α)}$ ). As indicated in Audibert & Tsybakov (2007), A1 and A2 cannot hold simultaneously unless γ(1 ^ β) ≤ 1. In the common case γ = 1, this always holds.

Assumptions A1 and A2 extend those in Cadre et al. (2009), where β = γ = 1 is considered. The next theorem states the quality of cut-off values used in the sandwiching sets $L_{n}^{-}$ and $L_{n}^{+}$ .

Theorem 3.2. Let $t_{n}^{(α)} = {\hat{p}}_{n} (Y_{(i_{n, α})})$ , where ${\hat{p}}_{n}$ is the kernel density estimator given by eq. (6), and Y_(i)and i_n,α are defined as in Section 3.2. Assume that A1-A2 hold and choose $h_{n} ≍ {(\log n ∕ n)}^{1 ∕ (2 β + d)}$ . Then for any λ > 0, there exist constants A_λ, $A_{λ}^{'}$ depending only on p, K and α, such that

ℙ (∣ t_{n}^{(α)} - t^{(α)} ∣ \geq A_{λ} {(\frac{\log n}{n})}^{\frac{β}{2 β + d}} + A_{λ}^{'} {(\frac{\log n}{n})}^{\frac{1}{2 γ}}) = O (n^{- λ}) .

(11)

We give the proof of Theorem 3.2 in Appendix C. Theorem 3.2 is useful for establishing the convergence of the corresponding level set. Observing that $(n h_{n}^{d}) = o ({(\log n ∕ n)}^{β ∕ (2 β + d)})$ , it follows immediately that the cut-off value used in $L_{n}^{+}$ also satisfies (11). The next theorem, proved in Appendix C, gives the rate of convergence for our estimators.

Theorem 3.3. Under same conditions as in Theorem 3.2, for any λ > 0, there exist constants B_λ, $B_{λ}^{'}$ depending on p, K and α only, such that, for all $\hat{C} ∊ {{\hat{C}}^{(α)}, L_{n}^{-}, L_{n}^{+}}$ ,

ℙ (μ (\hat{C} Δ C^{(α)}) \geq B_{λ} {(\frac{\log n}{n})}^{\frac{β γ}{2 β + d}} + B_{λ}^{'} {(\frac{\log n}{n})}^{\frac{1}{2}}) = O (n^{- λ}) .

(12)

Remark: In the most common cases γ = 1, or β ≥ 1/2, γβ ≤ 1, the term (log n/n)^βγ/(2β+d) dominates the convergence rate. It matches the minimax risk rate of the plug-in density level set at a known level developed by Rigollet & Vert (2009). As a result, not knowing the cut-off value t^(α) does not change the difficulty of estimation. When βγ/(2β + d) > 1/2, the rate is dominated by (log n/n)^1/2 and does not agree with the known minimax lower bound and we do not know if the $\sqrt{\log n ∕ n}$ can be eliminated from the result.

Remark: The theorems above were stated for the optimal choice of bandwidth. The method is still consistent with similar arguments whenever $n h_{n}^{d} ∕ \log n \to \infty$ and h_n → 0, although the resulting rates will no longer be optimal.

Remark: The same conclusions in Theorems 3.2 and 3.3 hold under a weaker version of Assumption A1. To make this idea more precise, suppose the density function is only β-Hölder smooth in a neighborhood of the level set contour {y : p(y) = t^(α)}, but less smooth everywhere else. Then the same proofs of Theorems 3.2 and 3.3 can be used to obtain a slower rate of convergence. After establishing this first consistency result, one can apply the argument again, with the analysis confined in the smooth neighborhood, to obtain the desired rate of convergence. However, in the interest of space and clarity, we will prove our results only under the more restrictive smoothness assumptions that we have stated.

Algorithm 1: Tuning With Sample Splitting.

Input: sample Y = (Y₁, …, Y_n), prediction set estimator $\hat{C}$ level α, and candidate set H

Split the sample randomly into two equal sized subsamples, Y₁and Y₂.
Construct prediction sets ${{\hat{C}}_{h, 1} : h ∊ H}$ each at level 1 − α, using subsample Y₁.
Let $\hat{h}$ = arg min_hμ( ${\hat{C}}_{h, 1}$ ).
4. Return ${\hat{C}}_{\hat{h}, 2}$ , which is constructed using bandwidth $\hat{h}$ and subsample Y₂.

4. CHOOSING THE BANDWIDTH

As illustrated in Figure 1, the efficiency of ${\hat{C}}^{(α)}$ depends on the choice of h_n. The size of estimated prediction sets can be very large if the bandwidth is either too large or too small. Therefore, in practice it is desirable to choose a good bandwidth in an automatic and data driven manner. In kernel density estimation, the choice of bandwidth has been one of the most important topics and many approaches have been studied; see Loader (1999), Mammen, Miranda, Nielsen & Sperlich (2011), Samworth & Wand (2010) and references therein. Here we consider choosing the bandwidth by minimizing the volume of the conformal prediction set.

Let H = {h₁, … , h_m} be a grid of candidate bandwidths. We compute the prediction set for each h ∈ H and choose the one with the smallest volume. To preserve finite sample validity, we use sample splitting as described in Algorithm 1. We state the following result and omit its proof.

Proposition 4.1. If $\hat{C}$ satisfies finite sample validity for all h, then ${\hat{C}}_{\hat{h}, 2}$ , the output of the sample splitting tuning algorithm, also satisfies finite sample validity.

There are two justifications for choosing a bandwidth to make $μ ({\hat{C}}_{h})$ small. The first is pragmatic: in making predictions it seems desirable to have a small prediction set. The second reason is that minimizing μ(C) can potentially lead to good risk properties in terms of the loss μ(CΔC^(α)) as we now show. Recall that R(C) = μ(CΔC^(α)) and define ε(C) = μ(C) − μ(C^(α)). To avoid technical complications, we will assume in this section that the sample space is compact and focus on the simple case γ = 1 in condition A2.

Lemma 4.2. Let $\hat{C}$ be an estimator of C^(α). Then $ε (\hat{C}) \leq R (\hat{C})$ Furthermore, if $\hat{C}$ is finite sample valid and A2 holds with γ = 1, then $𝔼 (R (\hat{C})) \leq c_{1} {[𝔼 (ε (\hat{C}))]}^{1 ∕ 2}$ for some constant c₁.

The bandwidth selection algorithm makes $ε (\hat{C})$ small. The lemma gives at least us some assurance that making $ε (\hat{C})$ small will help to make $R (\hat{C})$ small. The proof of Lemma 4.2 is given in Appendix D. (A similar result can be found in Scott & Nowak (2006).) However, it is an open question whether $R (\hat{C})$ achieves the minimax rate.

5. NUMERICAL EXAMPLES

We first consider simulations on Gaussian mixtures and double-exponential mixtures in two and three dimensions. We apply the bandwidth selector presented in Section 4 to both ${\hat{C}}^{(α)}$ and $L_{n}^{+}$ . The bandwidth used for $L_{n}^{-}$ is the same as that for $L_{n}^{+}$ . Therefore, in the results it is possible to see if $L_{n}^{-}$ is bigger than ${\hat{C}}^{(α)}$ , or if ${\hat{C}}^{(α)}$ is bigger than $L_{n}^{-}$ because of different bandwidths and data splitting.

5.1 2D Gaussian mixture

We first consider a two-component Gaussian mixture in ℝ². The first component has mean ( $\sqrt{2 \log n} - 2, 0$ ) and variance diag(4, 1/4), and the second component has mean (0, $\sqrt{2 \log n} - 2$ ) and variance diag(1/4, 4) (see Figure 2). This choice of component centers is to make a moderate overlap between the data clouds from the two components. It makes the prediction set problem more challenging.

Conformal prediction set (left) and the convex hull of the multivariate spacing depth based tolerance set (right), with data from a two-component Gaussian mixture.

Table 1 shows the coverage and Lebesgue measure of the prediction set at level 0.9 (α = 0.1) over 100 repetitions. The coverage is excellent and the size of the set is close to optimal. Both the conformal set ${\hat{C}}^{(α)}$ and the outer sandwiching set $L_{n}^{+}$ give correct coverage regardless of the sample size. It is worth noting that the inner sandwiching set $L_{n}^{-}$ (corresponding to the method in Hyndman (1996); Park et al. (2010)) does not give the desired coverage, which suggests that decreasing the cut-off value in $L_{n}^{+}$ is not merely an artifact of proof, but a necessary tuning. The observed excess loss also reflects a rate of convergence that supports our theoretical results on the symmetric difference loss. We compare our method with the approach introduced by Zhao & Saligrama (2009) (Ĉ_ZS), where the prediction set is constructed by ranking the distances from each data point to its kth nearest neighbor. It has been reported that the choice of k is not crucial and we use k = 6. (We remark further on the choice of k at the end of this section.) This method is similar to ours but does not have finite sample validity. We observe that the finite sample coverage Ĉ_ZS is less than the nominal level.

Table 1.

The simulation results for 2-d Gaussian mixture with α = 0.1 over 100 repetitions (mean and one standard deviation). The Lebesgue measure of the ideal set ≈ 28.02.

		Coverage		Lebesgue Measure
	n = 100	n = 200	n = 1000	n = 100	n = 200	n = 1000
${\hat{C}}^{(α)}$	0.886 ± 0.005	0.897 ± 0.002	0.900 ± 0.001	35.6 ± 0.7	34.3 ± 0.3	31.1 ± 0.2
$L_{n}^{-}$	0.861 ± 0.004	0.882 ± 0.001	0.896 ± 0.001	29.8 ± 0.3	34.1 ± 0.2	32.2 ± 0.1
$L_{n}^{+}$	0.907 ± 0.003	0.900 ± 0.001	0.907 ± 0.001	36.2 ± 0.4	36.9 ± 0.2	34.1 ± 0.1
${\hat{C}}^{ZS}$	0.853 ± 0.004	0.867 ± 0.002	0.881 ± 0.001	28.1 ± 0.4	28.2 ± 0.2	28.0 ± 0.1

Open in a new tab

Figure 2 shows a typical realization of the estimators. In both panels, the dots are data points when n = 200. The left panel shows the conformal prediction set with sample splitting (blue solid curve), together with the inner and outer sandwiching sets (red dashed and green dotted curves, respectively). Also plotted is the ideal set C^(α) (grey dash-dotted curve). It is clear that all three estimated sets capture the main part of the ideal set, and they are mutually close. On the right panel we plot a realization of the depth based approach from Li & Liu (2008). This approach does not require any tuning parameter. However, it takes O(n^d⁺¹) time to evaluate 1(y ∈ Ĉ) for any single y. In practice it is recommended to compute the empirical depth only for all the data points and use the convex hull of all data points with high depth as the estimated prediction set. Such a convex hull construction misses the “L” shape of the ideal set. Moreover, in our implementation the running time of the kernel density method is much shorter even when n = 200.

Figure 3 shows the effect of bandwidth on the excess loss ε(Ĉ) = μ(Ĉ) − μ(Ĉ)^(α)) based on a typical implementation with n = 200, where the y axis is the Lebesgue measure of the estimated sets. We observe that for the conformal prediction set Ĉ^(α), the excess loss is stable for a wide range of bandwidths, especially that moderate undersmoothing does not harm the performance very much. An intuitive explanation is that the data near the contour are dense enough to allow for moderate undersmoothing. Similar phenomenon should be expected whenever α is not too small. Moreover, the selected bandwidth from the outer sandwiching set $L_{n}^{+}$ is close to that obtained from the conformal set. This observation may be of practical interest since it is usually much faster to compute $L_{n}^{+}$ .

Lebesgue measure of prediction sets versus bandwidth.

Remark: The Ĉ^ZS method requires a choice of k. We tried k = 2, 3, … , 20. The coverage increases with k but does not reach the nominal 0.9 level even when k = 20. The Lebesgue measure also increases with k and after k = 20, it becomes larger than the conformal region.

5.2 Further simulations

We now investigate the performance of our method using distributions with heavier tails and in higher dimensions. These simulations confirm that our method always give finite sample coverage, even when the density estimation is very challenging.

Double exponential distribution

In this setting, the distribution also has two balanced components. The first component has independent double exponential coordinates: Y (1) ~ 2 DoubleExp(1)+2.2 log n, Y(2) ~ 0.5 DoubleExp(1), where DoubleExp(1) has density exp(−|y|)/2. The second component has the two coordinates switched. The centering at 2.2 log n is chosen so that there is moderate overlap between data clouds from two components. The results are summarized in Table 2.

Table 2.

The simulation results for 2-d double exponential mixture with α = 0.1 over 100 repetitions (mean and one standard deviation). The Lebesgue measure of the ideal set ≈ 55.

		Coverage		Lebesgue Measure
	n = 100	n = 200	n = 1000	n = 100	n = 200	n = 1000
${\hat{C}}^{(α)}$	0.895 ± 0.005	0.916 ± 0.003	0.91 ± 0.002	77.7 ± 3	76.6 ± 1.6	62.3 ± 0.6
$L_{n}^{-}$	0.864 ± 0.006	0.897 ± 0.003	0.90 ± 0.001	66.5 ± 2.3	71.7 ± 1.2	58.3 ± 0.3
$L_{n}^{+}$	0.893 ± 0.005	0.912 ± 0.003	0.92 ± 0.001	86.1 ± 7.4	78.2 ± 1.3	65.0 ± 0.4
${\hat{C}}^{ZS}$	0.871 ± 0.004	0.892 ± 0.003	0.897 ± 0.001	58.2 ± 1.5	60.2 ± 1.0	55.2 ± 0.4

Open in a new tab

Three-dimensional data

Now we increase the dimension of data. The Gaussian mixture is the same as in the 2-dimensional setup, with the third coordinate being an independent Gaussian with mean zero and variance 1/4. The results are summarized in Table 3.

Table 3.

The simulation results for 3-d Gaussian mixture with α = 0.1 over 100 repetitions (mean and one standard deviation). The Lebesgue measure of the ideal set ≈ 62.

		Coverage		Lebesgue Measure
	n = 100	n = 200	n = 1000	n = 100	n = 200	n = 1000
${\hat{C}}^{(α)}$	0.917 ± 0.004	0.902 ± 0.003	0.900 ± 0.002	109 ± 2.4	89 ± 1.5	74 ± 0.7
$L_{n}^{-}$	0.875 ± 0.005	0.880 ± 0.003	0.889 ± 0.002	109 ± 2.1	98 ± 1.5	81 ± 0.7
$L_{n}^{+}$	0.892 ± 0.004	0.898 ± 0.003	0.916 ± 0.002	118 ± 2.2	109 ± 1.6	96 ± 0.9
${\hat{C}}^{ZS}$	0.869 ± 0.003	0.872 ± 0.002	0.879 ± 0.001	75 ± 1.3	69 ± 0.8	64 ± 0.4

Open in a new tab

Remark: In the above two simulation settings, the conformal prediction sets are much larger than the ideal (oracle) set unless the sample size is very large (n = 1000). This is because of the difficulty of multivariate nonparametric density estimation. In fact, the kernel density estimator may no longer lead to a good conformity score in this case. However, the theory of conformal prediction is still valid as reflected by the coverage. Thus, one may use other conformity scores such as the k-nearest-neighbor radius, for which a non-conformal version has been reported in Zhao & Saligrama (2009). Other possible choices include Gaussian mixture density estimators and semi-parametric models. These extensions will be pursued in future work.

5.3 Application to Breast Cancer Data

In this subsection we apply our method to the Wisconsin Breast Cancer Dataset (available at the UCI machine learning repository). The data contains nine features of 699 patients among which 241 are malignant and 458 are benign. Although this data set is commonly used to test classification algorithms, it has been used to test prediction region methods in the literature (see Park et al. (2010) for example). In this example we use prediction sets to tell malignant cases from benign ones. Formally, we assume that the benign cases are sampled from a common distribution, and we construct a 95% prediction set corresponding to the high density region of the underlying distribution. Although the prediction sets are constructed using only the benign cases, the efficiency of the estimated prediction/tolerance set can be measured not only in terms of its Lebesgue measure, but also in terms of the number of false negatives (i.e., the number of malignant cases covered by the prediction set). Ideally the prediction set shall contain most of benign cases but few malignant cases and hence can be used as a classifier.

In our implementation, the data dimension is reduced to two using standard principal components analysis. Such a dimension reduction simplifies visualization and has also been used in Park et al. (2010). If no dimension reduction is used, the data concentrates near a low dimensional subset of the space, and other conformity scores, such as the k nearest neighbors radius, can be used instead of kernel density estimation. To test the out of sample performance of our method, we randomly choose 100 out of 458 benign cases as testing data. The prediction region is constructed using only the remaining 358 benign cases with coverage level 0.95 and kernel density bandwidth 0.8. We repeat this experiment 100 times. A typical implementation is plotted in Figure 4. In Table 4 we report the mean coverage on the testing data as well as the malignant data. The resulting conformal prediction sets give the desired coverage for the benign cases and low false coverage for the malignant cases. Note that in this case the inner density level set $L_{n}^{-}$ is equivalent to the method proposed in Park et al. (2010), which in general does not have finite sample validity. In our experiment, the average out-of-sample coverage is slightly below the nominal level (by about one standard deviation). In this example, we see that the conformal methods (Ĉ^(α) and $L_{n}^{+}$ ) give similar empirical performance as the conventional non-conformal method ( $L_{n}^{-}$ ), with additional finite sample guarantee.

Prediction sets for benign instances. Crosses: benign; diamonds: malignant. Blue dashed curve: $L_{n}^{+}$ ; Black dotted curve: $L_{n}^{-}$ ; Red solid curve: Ĉ^(α).

Table 4.

Application to the breast cancer data with α = 0.05 over 100 repetitions. Reported are the mean and one estimated standard deviation of the empirical coverage on the testing benign data and the malignant data.

method	${\hat{C}}^{(α)}$	$L_{n}^{-}$	$L_{n}^{+}$
test sample coverage	0.9514 ± 0.0012	0.9488 ± 0.0012	0.9534 ± 0.0013
malignant data coverage	0.0141 ± 0.0002	0.0044 ± 0.0001	0.0420 ± 0.0004

Open in a new tab

APPENDIX A. DEFINITIONS

A.1 Hölder smooth functions

The Hölder class is a popular smoothness condition in nonparametric inferences (Tsybakov 2009, Section 1.2). Here we use the version given in (Rigollet & Vert 2009).

Let s = (s₁, …, s_d) be a d-tuple of non-negative integers and |s| = s₁+ … + s_d. For any x ∈ ℝ^d, let $x^{s} = x_{1}^{S_{1}} \dots x_{d}^{S_{d}}$ and D^sbe the differential operator:

D^{s} f = \frac{\partial^{∣ s ∣} f}{\partial x_{1}^{s_{1}} \dots \partial x_{d}^{s_{d}}} (x_{1}, \dots, x_{d}) .

Given β > 0, for any functions f that are [β] times differentiable, denote its Taylor expansion of

degree [βJ at x₀by

f_{x_{0}}^{(β)} (x) = \sum_{∣ s ∣ \leq β} \frac{{(x - x_{0})}^{s}}{s_{1}! \dots s_{d}!} D^{s} f (x_{0}) .

Definition A.1 (Hölder class). For constants β > 0, L > 0, define the Hölder class Σ(β, L) to be the set of lβJ-times differentiable functions on ℝ^d such that,

∣ f (x) - f_{x_{0}}^{(β)} (x) ∣ \leq L {∥ x - x_{0} ∥}^{β} .

(A.1)

A.2 Valid kernels

A standard condition on the kernel is the notion of β-valid kernels.

Definition A.2 (β-valid kernel). For any β > 0, function K : ℝ^d → ℝ¹is a β-valid kernel if (a) K is supported on [−1, 1]^d; (b) ∫ K = 1; (c) ∫ |K|^r < ∞, all r ≥ 1; (d) ∫ y^sK(y)dy = 0 for all 1 ≤ |s| ≤ β.

The last condition is interpreted elementwise. In the literature, β-valid kernels are usually used with Hölder class of functions to derive fast rates of convergence. The existence of univariate β-valid kernels can be found in Section 1.2 of Tsybakov (2009). A multivariate β-valid kernel can be obtained by taking direct product of univariate β-valid kernels.

APPENDIX B. PROOF OF LEMMA 3.2

Proof Lemma 3.1. Let $P_{n}^{y} = \frac{n}{n + 1} P_{n} + \frac{1}{n + 1} δ_{y}$ , where P_n is the empirical distribution defined by the sample Y = (Y₁, …, Y_n), and δ_y is the point mass distribution at y. Define functions

\begin{matrix} G (t) & = P (L^{ℓ} (t)), \\ G_{n} (t) & = P_{n} (L_{n}^{ℓ} (t)) = n^{- 1} \sum_{i = 1}^{n} 1 ({\hat{p}}_{n} (Y_{i}) \leq t), \\ G_{n}^{y} (t) & = P_{n}^{y} ({\hat{p}}_{n}^{y} (Y) \leq t) = \frac{1}{n + 1} (\sum_{i = 1}^{n} 1 ({\hat{p}}_{n}^{y} (Y_{i}) \leq t) + 1 ({\hat{p}}_{n}^{y} (y) \leq t)) . \end{matrix}

The functions G, G_n and $G_{n}^{y}$ defined above are the cumulative distribution function (CDF) of p(Y ) and its empirical versions with sample Y and aug(Y, y), respectively, where aug(Y, y) = (Y₁, … , Y_n, y). By (5) and Algorithm 1, the conformal prediction set can be written as

{\hat{C}}^{(α)} = {y \in ℝ^{d} : G_{n}^{y} ({\hat{p}}_{n}^{y} (y)) \geq \tilde{α}} .

The proof is based on a direct characterization of $L_{n}^{-}$ and $L_{n}^{+}$ . First, for each y ∈ $L_{n}^{-}$ and i ≤ i_n,α, we have

{\hat{p}}_{n}^{y} (y) - {\hat{p}}_{n}^{y} (Y_{(i)}) = \frac{n}{n + 1} ({\hat{p}}_{n} (y) - {\hat{p}}_{n} (Y_{(i)})) + \frac{1}{(n + 1) h^{d}} (K (0) - K (\frac{Y_{(i)} - y}{h})) \geq 0 .

As a result, $G_{n}^{y} ({\hat{p}}_{n}^{y} (y)) \leq i_{n, α} ∕ (n + 1) = \tilde{α}$ and hence y ∈ Ĉ^(α). Similarly, for each y ∈ $L_{n}^{+}$ and i ≥ i_n,α we have

\begin{matrix} {\hat{p}}_{h}^{y} (y) - {\hat{p}}_{h}^{y} (Y_{(i)}) & = \frac{n}{n + 1} ({\hat{p}}_{h} (y) - {\hat{p}}_{h} (Y_{(i)})) + \frac{1}{(n + 1) h^{d}} (K (0) - K (\frac{Y_{(i)} - y}{h})) \\ \leq \frac{n}{n + 1} ({\hat{p}}_{h} (y) - {\hat{p}}_{h} (Y_{(i_{n, α})})) + \frac{1}{(n + 1) h^{d}} ψ_{K} < 0 . \end{matrix}

Therefore, $G_{n}^{y} ({\hat{p}}_{n}^{y} (y)) \leq (i_{n, α 1}) ∕ (n - 1) = \tilde{α}$ and hence y ∈ Ĉ^(α).

APPENDIX C. PROOF OF THEOREM 3.3

The bias in the estimated cut-off level $t_{n}^{(α)}$ can be bounded in terms of two quantities:

V_{n} = \sup_{t > 0} ∣ P_{n} (L^{ℓ} (t)) - P (L^{ℓ} (t)) ∣, R_{n} = {∥ {\hat{p}}_{n} - p ∥}_{\infty} .

Here V_n can be viewed as the maximum of the empirical process P_n − P over a nested class of sets, and R_n is the L_∞loss of the density estimator. As a result, V_n can be bounded using the standard empirical process and VC dimension argument, and R_n can be bounded using the smoothness of p and kernel K with a suitable choice of bandwidth. Formally, we provide upper bounds for these two quantities through the following lemma.

Lemma C.1. Let V_n, R_nbe defined as above, then under Assumptions A1 and A2, for any λ > 0, there exist constants A_1,λand A_2,λdepending on λ only, such that,

ℙ (V_{n} \geq A_{1, λ} \sqrt[]{\frac{\log n}{n}}) = O (n^{- λ}), ℙ (R_{n} \geq A_{2, λ} {(\frac{\log n}{n})}^{\frac{β}{2 β + d}}) = O (n^{- λ}) .

Proof. First, it is easy to check that the class of sets {L^f(t) : t > 0} are nested with VC (Vapnik-Chervonenkis) dimension 2 and hence by classical empirical process theory (see, for example, van der Vaart & Wellner (1996), Section 2.14) , there exists a constant C₀> 0 such that for all η > 0

ℙ (V_{n} \geq η) \leq C_{0} n^{2} \exp (- n η^{2} ∕ 32) .

(A.2)

Let η = $A \sqrt{\log n ∕ n}$ , we have

ℙ (V_{n} \geq A \sqrt[]{\log n ∕ n}) \leq C_{0} n^{2} \exp (- A^{2} \log n ∕ 32) = C_{0} n^{- (A^{2} ∕ 32 - 2)} .

(A.3)

The first result then follows by choosing $A_{1, λ} = \sqrt{32 (λ + 2)}$ . Next we bound R_n. Let p̄ = 𝔼[p̂_n], and ∈_n= (log n/n)^β/(2β+d). By triangle inequality R_n ≤ || p̂_n − p̄||∞ + ||p̄ − p||_∞. Due to a result of Giné & Guillou (2002) (see also (49) in Chapter 3 of Prakasa Rao (1983)), under Assumption A1, there exist constants C₁, C₂and B₀> 0 such that have for all B ≥ B₀,

ℙ ({∥ {\hat{p}}_{n} - \overset{‒}{p} ∥}_{\infty} \geq B ∊_{n}) \leq C_{1} \exp (- C_{2} B^{2} \log (h_{n}^{- 1})) = C_{1} h_{n}^{C_{2} B^{2}} .

(A.4)

On the other hand, by Assumption A1, for some constant C₃

{∥ \overset{‒}{p} - p ∥}_{\infty} \leq C_{3} h_{n}^{β} .

(A.5)

In (A.3), (A.4) and (A.5) the constants C_i, i = 0, …, 3, depend on p and K only. Hence,

ℙ ({∥ \hat{p} - \overset{‒}{p} ∥}_{\infty} \geq (C_{3} + B) ∊_{n}) \leq C_{1} h_{n}^{C_{2} B^{2}},

(A.6)

which concludes the second part by choosing $A_{2, λ} = C_{3} + \sqrt{\frac{(2 β + d) λ}{C_{2}}}$ . □

Proof of Theorem 3.2. Let α_n = i_n,α/n = l(n + 1)αJ/n. We have |α_n − α| ≤ 1/n. Recall that the ideal level t^(α) can be written as t^(α) = G⁻¹(α) where the function G is the cumulative distribution function of p(Y ), as defined in Subsection 3.2. By the γ-exponent condition the inverse of G is well defined in a small neighborhood of α. When n is large enough, we can define t^(αn) as t^(αn) = G⁻¹(α_n).

Again, by the γ-exponent condition, $c_{1} ∣ t^{(α_{n})} - t^{(α)} ∣^{γ} \leq ∣ G (t^{(α_{n})}) - G (t^{(α)}) ∣ = ∣ α_{n} - α \leq \frac{1}{n}$ . Therefore, for n large enough

∣ t^{(α_{n})} - t^{(α)} ∣ \leq {(c_{1} n)}^{- 1 ∕ γ} .

(A.7)

Equation (A.7) allows us to switch to the problem of bounding $∣ t_{n}^{(α)} - t^{(α_{n})} ∣$ . Recall that $t_{n}^{(α)} = {\hat{p}}_{n} (Y_{(i_{n, α})})$ . The key of the proof is to observe that $t_{n}^{(α)} = G_{n}^{- 1} (α_{n}) : = \inf {t : G_{n} (t) \geq α_{n}}$ . Then it suffices to show that G⁻¹ and G⁻¹ are close at α_n. In fact, by definition of R_n we have for all $t q 0 : L^{l} (t - R_{n}) \subseteq L_{n}^{l} (t) \subseteq L^{l} (t + R_{n})$ . As a result, we have

P_{n} (L^{ℓ} (t - R_{n})) \leq P_{n} (L_{n}^{ℓ} (t)) \leq P_{n} (L^{ℓ} (t + R_{n})) .

By definition of V_n,

P (L^{ℓ} (t - R_{n})) - V_{n} \leq P_{n} (L_{n}^{ℓ} (t)) \leq P (L^{ℓ} (t + R_{n})) + V_{n} .

By definition of G and G_n, the above inequality becomes

G (t - R_{n}) - V_{n} \leq G_{n} (t) \leq G (t + R_{n}) + V_{n} .

Let W_n = R_n + (2V_n/c₁)^1/γ . Suppose n is large enough such that

{(\frac{c_{1}}{n})}^{\frac{1}{γ}} + {(\frac{2 A_{1, λ}}{c_{1}} \sqrt[]{\frac{\log n}{n}})}^{\frac{1}{γ}} < ∊_{0},

then on the event $V_{n} \leq A_{1, λ} \sqrt{\frac{\log n}{n}}$ ,

\begin{matrix} G_{n} (t^{(α_{n})} - W_{n}) & \leq G (t^{(α_{n})} - W_{n} + R_{n}) + V_{n} \\ = G (t^{(α_{n})} - {(2 V_{n} ∕ c_{1})}^{1 ∕ γ}) - G (t^{(α_{n})}) + α_{n} + V_{n} \\ \leq α_{n} - V_{n} < α_{n} . \end{matrix}

where the last inequality uses the left side of the γ-exponent condition. Similarly, G_n(t^(αn) + W_n) > α_n. Hence, for n large enough, if $V_{n} \leq A_{1, λ} \sqrt{(\log n) ∕ n}$ then,

∣ t_{n}^{(α)} - t^{(α_{n})} ∣ \leq W_{n} .

(A.8)

To conclude the proof, first note that ${(\frac{c_{1}}{n})}^{\frac{1}{γ}} = o ({(\frac{\log n}{n})}^{\frac{1}{2 γ}})$ . Then we can find constant $A_{λ}^{'}$ such that for all n large enough,

(A_{λ}^{'} - {(\frac{2 A_{1, λ}}{c_{1}})}^{\frac{1}{γ}}) {(\frac{\log n}{n})}^{\frac{1}{2 γ}} \geq {(\frac{c_{1}}{n})}^{\frac{1}{γ}} .

(A.9)

Let A_λ = A_2,λ. Combining equations (A.7) and (A.8), on the event

E_{n, λ} ≔ {R_{n} \leq A_{λ} {(\frac{\log n}{n})}^{\frac{β}{2 β + d}}, V_{n} \leq A_{1, λ} {(\frac{\log n}{n})}^{\frac{1}{2}}},

(A.10)

we have, for n large enough,

\begin{matrix} ∣ t_{n}^{(α)} - t^{(α)} ∣ & \leq ∣ t_{n}^{(α)} - t^{(α_{n})} ∣ + {(\frac{c_{1}}{n})}^{\frac{1}{γ}} \leq W_{n} + {(\frac{c_{1}}{n})}^{\frac{1}{γ}} \\ \leq R_{n} + (2 c_{c_{1}}^{- 1} {V_{n}}^{1 ∕ γ}) + {(\frac{c_{1}}{n})}^{\frac{1}{γ}} \\ \leq A_{λ} {(\frac{\log n}{n})}^{\frac{β}{2 β + d}} + {(\frac{2 A_{1, λ}}{c_{1}} \sqrt[]{\frac{\log n}{n}})}^{\frac{1}{γ}} + {(\frac{c_{1}}{n})}^{\frac{1}{γ}} \\ \leq A_{λ} {(\frac{\log n}{n})}^{\frac{β}{2 β + d}} + A_{λ}^{'} {(\frac{\log n}{n})}^{\frac{1}{2 γ}} \end{matrix}

where the second last inequality is from the definition of E_n,λ and the last inequality is from the choice of $A_{λ}^{'}$ . The proof is concluded by observing $ℙ (E_{n, λ}^{c}) = O (n^{- λ})$ , a consequence of Lemma C.1. □

Proof of Theorem 3.3. In the proof we write t_n for $ℙ (E_{n, λ}^{c}) = O (n^{- λ})$ as a generic estimate of t^(α) that satisfies (11). Observe that

μ (L_{n} (t_{n}) Δ C^{(α)}) = μ ({{\hat{p}}_{n} \geq t_{n}, p < t^{(α)}}) + μ ({{\hat{p}}_{n} < t_{n}, p \geq t^{(α)}}) .

(A.11)

Note that

{{\hat{p}}_{n} \geq t_{n}, p < t^{(α)}} \subseteq {t^{(α)} - ∣ t_{n} - t^{(α)} ∣ - R_{n} \leq p < t^{(α)}},

(A.12)

and Therefore

{{\hat{p}}_{n} < t_{n}, p \geq t^{(α)}} \subseteq {t^{(α)} < p \leq t^{(α)} + ∣ t^{(α)} - t_{n} ∣ + R_{n}} .

(A.13)

Suppose n is large enough such that

L_{n} (t_{n}) Δ C^{(α)} \subseteq {t^{(α)} - ∣ t_{n} - t^{(α)} ∣ - R_{n} < p \leq t^{(α)} + ∣ t^{(α)} - t_{n} ∣ + R_{n}} .

(A.14)

Suppose n is large enough such that

2 A_{2, λ} {(\frac{\log n}{n})}^{\frac{β}{2 β + d}} + A_{λ}^{'} {(\frac{\log n}{n})}^{\frac{1}{2 γ}} < (∊_{0} \land \frac{t^{(α)}}{2}),

where the constant A_2,λis defined as in Lemma C.1 and $A_{λ}^{'}$ is defined as in equation (A.9). Then on the event E_n,λ as defined in equation (A.10), applying Theorem 3.2 and condition (10) on the right hand side of (A.14) yields

\begin{matrix} μ (L_{n} (t_{n}) Δ C^{(α)}) & \leq \frac{P (L_{n} (t_{n}) Δ C^{(α)})}{t^{(α)} - ∣ t_{n} - t^{(α)} ∣ - R_{n}} \\ \leq \frac{2}{t^{(α)}} c_{2} {(2 A_{2, λ} {(\frac{\log n}{n})}^{\frac{β}{2 β + d}} + A_{λ}^{'} {(\frac{\log n}{n})}^{\frac{1}{2 γ}})}^{γ} \\ \leq B_{λ} {(\frac{\log n}{n})}^{\frac{β γ}{2 β + d}} + B_{λ}^{'} {(\frac{\log n}{n})}^{\frac{1}{2}}, \end{matrix}

(A.15)

where B_λ, $B_{λ}^{'}$ are positive constants depending only on p, K, α and γ. As a result, both $L_{n}^{-}$ and $L_{n}^{+}$ satisfies the claim of Theorem 3.3. The claim also holds for Ĉ^α by the sandwich Lemma. □

APPENDIX D. PROOFS OF LEMMA 4.3

Proof of Lemma 4.2. The first statement follows since

\begin{matrix} ε (C) & = μ (C) - μ (C_{*}) = μ (C \cap C_{*}^{c}) + μ (C \cap C_{*}) - [μ (C_{*} \cap C) + μ (C_{*} \cap C^{c})] \\ = μ (C \cap C_{*}^{c}) - μ (C_{*} \cap C^{c})] \leq μ (C \cap C_{*}^{c}) + μ (C_{*} \cap C^{c})] = R (C) . \end{matrix}

For the second statement, let I denote the indicator function for C and let I_*denote the indicator function for C_*. Note that, for all y, (I(y) − I_*(y))(λ − p(y)) ≥ 0. Let λ = λ_α and define W_∈ = {y : |p(y) − λ| > ∈}. From Assumption A2 with γ = 1 we have that μ(CΔC_*) ≤ μ((CΔC_*) ∩ W_E) + c∈ for some c > 0. Hence,

\begin{matrix} μ (C Δ C_{*}) & \leq μ ((C Δ C_{*}) \cap W_{∊}) + c ∊ \\ = \frac{1}{∊} \int_{W_{e}} ∣ I (y) - I_{*} (y) ∣ ∊ d μ (y) + c ∊ \\ \leq \frac{1}{∊} \int_{W_{e}} ∣ I (y) - I_{*} (y) ∣ ∣ λ - p (y) ∣ d μ (y) + c ∊ \\ \leq \frac{1}{∊} \int (I (y) - I_{*} (y)) (λ - p (y)) d μ (y) + c ∊ \\ = \frac{λ}{∊} [μ (C) - μ (C_{*})] - \frac{1}{∊} [P (C) - P (C_{*})] + c ∊ \\ = \frac{λ}{∊} ε (C) - \frac{1}{∊} [P (C) - (1 - α)] + c ∊ . \end{matrix}

Since 𝔼(P (C)) ≥ 1−α, if we take expected values of both sides we have that $𝔼 (R (C)) \leq \frac{λ}{∊} 𝔼 (ε (C)) + c ∊$ . The conclusion follows by setting $∊ = \sqrt{λ 𝔼 (ε (E)) ∕ c}$ .

Contributor Information

Jing Lei, Department of Statistics, Carnegie Mellon University Pittsburgh, PA 15213.

James Robins, Department of Biostatistics, Harvard University Boston, MA 02115.

Larry Wasserman, Department of Statistics and Machine Learning Department Carnegie Mellon University Pittsburgh, PA 15213.

REFERENCES

Aichison J, Dunsmore IR. Statistical Prediction Analysis. Cambridge Univ. Press; 1975. [Google Scholar]
Audibert J, Tsybakov A. Fast learning for plug-in classifiers. The Annals of Statistics. 2007;35:608–633. [Google Scholar]
Baillo A. Total error in a plug-in estimator of level sets. Statistics & Probability Letters. 2003;65:411–417. [Google Scholar]
Baillo A, Cuestas-Alberto J, Cuevas A. Convergence rates in nonparametric estimation of level sets. Statistics & Probability Letters. 2001;53:27–35. [Google Scholar]
Cadre B. Kernel estimation of density level sets. Journal of multivariate analysis. 2006;97:999–1023. [Google Scholar]
Cadre B, Pelletier B, Pudlo P. Clustering by estimation of density level sets at a fixed probability. 2009 manuscript. [Google Scholar]
Chatterjee SK, Patra NK. Asymptotically minimal multivariate tolerance sets. Calcutta Statist. Assoc. Bull. 1980;29:73–93. [Google Scholar]
Di Bucchianico A, Einmahl JH, Mushkudiani NA. Smallest nonparametric tolerance regions. The Annals of Statistics. 2001;29:1320–1343. [Google Scholar]
Fraser DAS, Guttman I. Tolerance regions. The Annals of Mathematical Statistics. 1956;27:162–179. [Google Scholar]
Giné E, Guillou A. Rates of strong uniform consistency for multivariate kernel density estimators. Annales de l’Institut Henri Poincare (B) Probability and Statistics. 2002;38:907–921. [Google Scholar]
Guttman I. In: Statistical Tolerance Regions: Classical and Bayesian Griffin. Hartigan J, editor. London.; Clustering Algorithms John Wiley; New York: 1970. 1975. [Google Scholar]
Hyndman R. Computing and Graphing Highest Density Regions. The American Statistician. 1996;50:120–125. [Google Scholar]
Li J, Liu R. Multivariate spacings based on data depth: I. construction of nonparametric multivariate tolerance regions. The Annals of Statistics. 2008;36:1299–1323. [Google Scholar]
Loader C. Bandwidth selection: classical or plug-in? The Annals of Statistics. 1999;27:415–438. [Google Scholar]
Mammen E, Miranda MDM, Nielsen JP, Sperlich S. Do-Validation for kernel density estimation. Journal of the American Statistical Association. 2011;106:651–660. [Google Scholar]
Park C, Huang JZ, Ding Y. A Computable Plug-In Estimator of Minimum Volume Sets for Novelty Detection. Operations Research. 2010;58:1469–1480. [Google Scholar]
Polonik W. Measuring mass concentrations and estimating density contour clusters - an excess mass approach. The Annals of Statistics. 1995;23:855–881. [Google Scholar]
Polonik W. Minimum volume sets and generalized quantile processes. Stochastic Processes and their Applications. 1997;69(1):1–24. [Google Scholar]
Prakasa Rao B. Nonparametric Functional Estimation. Academic Press; 1983. [Google Scholar]
Rigollet P, Vert R. Optimal rates for plug-in estimators of denslty level sets. Bernoulli. 2009;14:1154–1178. [Google Scholar]
Rinaldo A, Wasserman L. Generalized density clustering. The Annals of Statistics. 2010;38:2678–2722. [Google Scholar]
Samworth RJ, Wand MP. Asymptotics and optimal bandwidth selection for highest density region estimation. The Annals of Statistics. 2010;38:1767–1792. [Google Scholar]
Scott CD, Nowak RD. Learning Minimum Volume Sets. Journal of Machine Learning Research. 2006;7:665–704. [Google Scholar]
Shafer G, Vovk V. A tutorial on conformal prediction. Journal of Machine Learning Research. 2008;9:371–421. [Google Scholar]
Sricharan K, Hero A. Efficient anomaly detection using bipartite k-NN graphs. In: Shawe-Taylor J, Zemel R, Bartlett P, Pereira F, Weinberger K, editors. Advances in Neural Information Processing Systems. Vol. 24. 2011. pp. 478–486. [Google Scholar]
Steinwart I, Hush D, Scovel C. A Classification Framework for Anomaly Detection. Journal of Machine Learning Research. 2005;6:211–232. [Google Scholar]
Tsybakov A. On nonparametric estimation of density level sets. The Annals of Statistics. 1997;25:948–969. [Google Scholar]
Tsybakov A. Introduction to nonparametric estimation. Springer; 2009. [Google Scholar]
Tukey J. Nonparametric estimation. II. Statistical equivalent blocks and multivarate tolerance regions,” The Annals of Mathematical Statistics. 1947;18:529–539. [Google Scholar]
van der Vaart AW, Wellner JA. Weak Convergence and Empirical Processes. Springer; 1996. [Google Scholar]
Vovk V, Gammerman A, Shafer G. Algorithmic Learning in a Random World. Springer; 2005. [Google Scholar]
Vovk V, Nouretdinov I, Gammerman A. On-line preditive linear regression. The Annals of Statistics. 2009;37:1566–1590. [Google Scholar]
Wald A. An extension of Wilks method for setting tolerance limits. The Annals of Mathematical Statistics. 1943;14:45–55. [Google Scholar]
Walther G. Granulometric Smoothing. The Annals of Statistics. 1997;25(6):2273–2299. [Google Scholar]
Wei Y. An approach to multivariate covariate-dependent quantile contours with application to bivariate conditional growth charts. Journal of the American Statistical Association. 2008;103(481):397–409. [Google Scholar]
Wilks S. Determination of sample sizes for setting tolerance limits. The Annals of Mathematical Statistics. 1941;12:91–96. [Google Scholar]
Willett R, Nowak R. Minimax optimal level-set estimation. IEEE Transactions on Image Processing. 2007;16:2965–2979. doi: 10.1109/tip.2007.910175. [DOI] [PubMed] [Google Scholar]
Zhao M, Saligrama V. Bengio Y, Schuurmans D, Lafferty J, Williams CKI, Culotta A, editors. Anomaly Detection with Score functions based on Nearest Neighbor Graphs. Advances in Neural Information Processing Systems. 2009;22:2250–2258. [Google Scholar]

[R1] Aichison J, Dunsmore IR. Statistical Prediction Analysis. Cambridge Univ. Press; 1975. [Google Scholar]

[R2] Audibert J, Tsybakov A. Fast learning for plug-in classifiers. The Annals of Statistics. 2007;35:608–633. [Google Scholar]

[R3] Baillo A. Total error in a plug-in estimator of level sets. Statistics & Probability Letters. 2003;65:411–417. [Google Scholar]

[R4] Baillo A, Cuestas-Alberto J, Cuevas A. Convergence rates in nonparametric estimation of level sets. Statistics & Probability Letters. 2001;53:27–35. [Google Scholar]

[R5] Cadre B. Kernel estimation of density level sets. Journal of multivariate analysis. 2006;97:999–1023. [Google Scholar]

[R6] Cadre B, Pelletier B, Pudlo P. Clustering by estimation of density level sets at a fixed probability. 2009 manuscript. [Google Scholar]

[R7] Chatterjee SK, Patra NK. Asymptotically minimal multivariate tolerance sets. Calcutta Statist. Assoc. Bull. 1980;29:73–93. [Google Scholar]

[R8] Di Bucchianico A, Einmahl JH, Mushkudiani NA. Smallest nonparametric tolerance regions. The Annals of Statistics. 2001;29:1320–1343. [Google Scholar]

[R9] Fraser DAS, Guttman I. Tolerance regions. The Annals of Mathematical Statistics. 1956;27:162–179. [Google Scholar]

[R10] Giné E, Guillou A. Rates of strong uniform consistency for multivariate kernel density estimators. Annales de l’Institut Henri Poincare (B) Probability and Statistics. 2002;38:907–921. [Google Scholar]

[R11] Guttman I. In: Statistical Tolerance Regions: Classical and Bayesian Griffin. Hartigan J, editor. London.; Clustering Algorithms John Wiley; New York: 1970. 1975. [Google Scholar]

[R12] Hyndman R. Computing and Graphing Highest Density Regions. The American Statistician. 1996;50:120–125. [Google Scholar]

[R13] Li J, Liu R. Multivariate spacings based on data depth: I. construction of nonparametric multivariate tolerance regions. The Annals of Statistics. 2008;36:1299–1323. [Google Scholar]

[R14] Loader C. Bandwidth selection: classical or plug-in? The Annals of Statistics. 1999;27:415–438. [Google Scholar]

[R15] Mammen E, Miranda MDM, Nielsen JP, Sperlich S. Do-Validation for kernel density estimation. Journal of the American Statistical Association. 2011;106:651–660. [Google Scholar]

[R16] Park C, Huang JZ, Ding Y. A Computable Plug-In Estimator of Minimum Volume Sets for Novelty Detection. Operations Research. 2010;58:1469–1480. [Google Scholar]

[R17] Polonik W. Measuring mass concentrations and estimating density contour clusters - an excess mass approach. The Annals of Statistics. 1995;23:855–881. [Google Scholar]

[R18] Polonik W. Minimum volume sets and generalized quantile processes. Stochastic Processes and their Applications. 1997;69(1):1–24. [Google Scholar]

[R19] Prakasa Rao B. Nonparametric Functional Estimation. Academic Press; 1983. [Google Scholar]

[R20] Rigollet P, Vert R. Optimal rates for plug-in estimators of denslty level sets. Bernoulli. 2009;14:1154–1178. [Google Scholar]

[R21] Rinaldo A, Wasserman L. Generalized density clustering. The Annals of Statistics. 2010;38:2678–2722. [Google Scholar]

[R22] Samworth RJ, Wand MP. Asymptotics and optimal bandwidth selection for highest density region estimation. The Annals of Statistics. 2010;38:1767–1792. [Google Scholar]

[R23] Scott CD, Nowak RD. Learning Minimum Volume Sets. Journal of Machine Learning Research. 2006;7:665–704. [Google Scholar]

[R24] Shafer G, Vovk V. A tutorial on conformal prediction. Journal of Machine Learning Research. 2008;9:371–421. [Google Scholar]

[R25] Sricharan K, Hero A. Efficient anomaly detection using bipartite k-NN graphs. In: Shawe-Taylor J, Zemel R, Bartlett P, Pereira F, Weinberger K, editors. Advances in Neural Information Processing Systems. Vol. 24. 2011. pp. 478–486. [Google Scholar]

[R26] Steinwart I, Hush D, Scovel C. A Classification Framework for Anomaly Detection. Journal of Machine Learning Research. 2005;6:211–232. [Google Scholar]

[R27] Tsybakov A. On nonparametric estimation of density level sets. The Annals of Statistics. 1997;25:948–969. [Google Scholar]

[R28] Tsybakov A. Introduction to nonparametric estimation. Springer; 2009. [Google Scholar]

[R29] Tukey J. Nonparametric estimation. II. Statistical equivalent blocks and multivarate tolerance regions,” The Annals of Mathematical Statistics. 1947;18:529–539. [Google Scholar]

[R30] van der Vaart AW, Wellner JA. Weak Convergence and Empirical Processes. Springer; 1996. [Google Scholar]

[R31] Vovk V, Gammerman A, Shafer G. Algorithmic Learning in a Random World. Springer; 2005. [Google Scholar]

[R32] Vovk V, Nouretdinov I, Gammerman A. On-line preditive linear regression. The Annals of Statistics. 2009;37:1566–1590. [Google Scholar]

[R33] Wald A. An extension of Wilks method for setting tolerance limits. The Annals of Mathematical Statistics. 1943;14:45–55. [Google Scholar]

[R34] Walther G. Granulometric Smoothing. The Annals of Statistics. 1997;25(6):2273–2299. [Google Scholar]

[R35] Wei Y. An approach to multivariate covariate-dependent quantile contours with application to bivariate conditional growth charts. Journal of the American Statistical Association. 2008;103(481):397–409. [Google Scholar]

[R36] Wilks S. Determination of sample sizes for setting tolerance limits. The Annals of Mathematical Statistics. 1941;12:91–96. [Google Scholar]

[R37] Willett R, Nowak R. Minimax optimal level-set estimation. IEEE Transactions on Image Processing. 2007;16:2965–2979. doi: 10.1109/tip.2007.910175. [DOI] [PubMed] [Google Scholar]

[R38] Zhao M, Saligrama V. Bengio Y, Schuurmans D, Lafferty J, Williams CKI, Culotta A, editors. Anomaly Detection with Score functions based on Nearest Neighbor Graphs. Advances in Neural Information Processing Systems. 2009;22:2250–2258. [Google Scholar]

PERMALINK

Distribution Free Prediction Sets

Jing Lei

James Robins

Larry Wasserman

Abstract

1. INTRODUCTION

1.1 Prediction sets and density level sets

1.2 Main results

1.3 Related work

2. CONFORMAL PREDICTION

3. CONFORMAL PREDICTION WITH KERNEL DENSITY

3.1 The method

Figure 1.

3.2 An approximation

3.3 Asymptotic properties

Algorithm 1: Tuning With Sample Splitting.

4. CHOOSING THE BANDWIDTH

5. NUMERICAL EXAMPLES

5.1 2D Gaussian mixture

Figure 2.

Table 1.

Figure 3.

5.2 Further simulations

Double exponential distribution

Table 2.

Three-dimensional data

Table 3.

5.3 Application to Breast Cancer Data

Figure 4.

Table 4.

APPENDIX A. DEFINITIONS

A.1 Hölder smooth functions

A.2 Valid kernels

APPENDIX B. PROOF OF LEMMA 3.2

APPENDIX C. PROOF OF THEOREM 3.3

APPENDIX D. PROOFS OF LEMMA 4.3

Contributor Information

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases