Conditional Density Estimation in Measurement Error Problems

Xiao-Feng Wang; Deping Ye

doi:10.1016/j.jmva.2014.08.011

. Author manuscript; available in PMC: 2016 Jan 1.

Published in final edited form as: J Multivar Anal. 2015 Jan 1;133:38–50. doi: 10.1016/j.jmva.2014.08.011

Conditional Density Estimation in Measurement Error Problems

Xiao-Feng Wang ^a,^*, Deping Ye ^b

PMCID: PMC4183069 NIHMSID: NIHMS629473 PMID: 25284902

Abstract

This paper is motivated by a wide range of background correction problems in gene array data analysis, where the raw gene expression intensities are measured with error. Estimating a conditional density function from the contaminated expression data is a key aspect of statistical inference and visualization in these studies. We propose re-weighted deconvolution kernel methods to estimate the conditional density function in an additive error model, when the error distribution is known as well as when it is unknown. Theoretical properties of the proposed estimators are investigated with respect to the mean absolute error from a “double asymptotic” view. Practical rules are developed for the selection of smoothing-parameters. Simulated examples and an application to an Illumina bead microarray study are presented to illustrate the viability of the methods.

Keywords: measurement error, gene microarray, conditional density, deconvolution, ridge parameter, kernel, bandwidth selection

1. Introduction

Measurement error problems have attracted a great deal of interest in the past two decades. A variety of models and methods for the problems have been applied in scientific fields, such as medicine, economy, and astronomy. Statistical deconvolution is an important component in measurement error models. The fundamental objective of deconvolution is to recover the unknown probability density function of a random variable when its observed values are contaminated with error. Let X be the variable of interest, which can not be observed directly. Instead, we observe a sample of W,

W_{j} = X_{j} + U_{j}, for 1 \leq j \leq n,

(1)

where X_j's are identically distributed as X, U_j's are identically distributed as U, and they are totally independent. The most popular approach to estimate the density of X is the deconvolution kernel estimator through applying an inverse Fourier transform and a kernel technique (Carroll and Hall, 1988; Stefanski and Carroll, 1990; Fan, 1991a,b). Other estimation procedures include the truncated Fourier inversion method (Diggle and Hall, 1993), the wavelet-based method (Fan and Koo, 2002), the penalization approach (Comte and Lacour, 2011), among others. Deconvolution problems based on more complicated model settings have also been extensively studied. Delaigle and Meister (2008), Wang et al. (2010), and McIntyre and Stefanski (2011) considered the problems of heteroscedastic measurement errors. Hall and Maiti (2009) investigated nonparametric deconvolution methods in two-level mixed models. Neumann (1997), Johannes (2009) and Wang and Ye (2012) studied the density deconvolution with unknown error distribution. Delaigle and Meister (2011) investigated kernel deconvolution when the characteristic function of the measurement errors contains zeros. Wang and Wang (2011) discussed fast Fourier transform algorithms in measurement error models and developed an R software package. The literature on deconvolution problems is particularly large and is surveyed in the monograph by Meister (2009).

In this paper, we consider the estimation problem of the conditional density of X given W, f_X|W, from the contaminated data W_j's. The problem is motivated by a wide range of background correction problems in gene array data analysis. Gene microarray techniques have become very popular in medical studies. A microarray is a collection of microscopic DNA spots attached to a solid surface. Hundreds of thousands of gene expression values are obtained from one array chip simultaneously. However, reading the expression values from a microarray is a noisy measurement process. The sources of measurement error include, for instance, irregularities in the array surface, variations in the laboratory process, different image scanner settings, and dye effects.

Typically, the first step in gene array data analysis is known as background correction, which refers to adjustments to the contaminated data intended to remove measurement error from the measured signal. Estimating the conditional density function from contaminated gene expression data, therefore, plays a key aspect of statistical inference and visualization here. It provides the most informative summary of the relationship between the contaminated gene intensities and the unobserved true signals. The current popular model of background correction in bioinformatics is the normal-exponential model (Irizarry et al., 2003; Silver et al., 2009). It assumes that the observed intensity is equal to the true intensity plus the background noise, where the true signal follows an exponential distribution with mean α, and the background noise follows a normal distribution with mean μ and variance σ². However, the validity of the parametric assumptions is unknown in real gene array studies. Thus, it is of particular interest to nonparametrically estimate the conditional density from the contaminated gene intensities.

A variety of papers discuss the nonparametric conditional density estimation when bivariate data are available. Hyndman et al. (1996) studied a kernel estimator. Bashtannyk and Hyndman (2001); Fan and Yim (2004) proposed several rules for selecting smoothing parameters. De Gooijer and Zerom (2003) proposed a modification of the Nadaraya-Watson type of smoother. Hall et al. (2004) discussed cross-validation and the estimation of conditional probability densities. Efromovich (2007) studied the conditional density estimation in a regression setting.

Unlike the conventional conditional density estimation problem from bivariate data, the observations for the variable-of-interest, X, are not available in the measurement error problem. In this paper, we investigate the estimation of the conditional density f_X|W from the only-available contaminated sample W_j's under the model (1). In Section 2, estimators of f_X|W are constructed in case of a known and an unknown error density. In Section 3, theoretical properties of the estimators are investigated with respect to the mean absolute error. In Section 4, practical rules are developed for the selection of smoothing-parameters. Simulated examples and an application to an Illumina Bead microarray study are presented in Section 5. The proofs of theorems are given in the Appendix and some additional asymptotic results are provided in the supplement of the article.

2. Methodology

Under the additive measurement error model (1), let f_X, f_U, and f_W be the density functions of X, U, and W, respectively. Denote f_X,W(x, w) as the joint density of (X, W). The conditional density of X given W = w is

f_{X | W} (x | w) = f_{X, W} (x, w) / f_{W} (w) = f_{U} (w - x) f_{X} (x) / f_{W} (w) .

(2)

2.1. Estimation of f_X|W with known error distribution

If one assumes that the error density f_U is known explicitly, f_X can be estimated by the classical deconvolution kernel approach. It is given by,

{\hat{f}}_{X} (x) = \frac{1}{n} \sum_{j = 1}^{n} K_{h}^{*} (x - W_{j}),

(3)

where $K_{h}^{*} (\cdot) = K^{*} (\cdot / h) / h,$

K^{*} (z) = \frac{1}{2 π} \int e^{- i t z} \frac{ϕ_{K} (t)}{ϕ_{U} (t / h)} dt,

(4)

is known as the deconvoluting kernel, and h > 0 is a smoothing parameter. In (4), ϕ_U is the characteristic function of U, and ϕ_K(t) = ∫e^itxK(x)dx is the Fourier transform of K(x), a symmetric probability kernel with a finite variance ∫x²K(x)dx < ∞. Under the common assumption that ϕ_K is compactly supported and ϕ_U does not vanish on the real line, the deconvoluting kernel K*(·) is well defined and finite.

The estimation of f_X|W is naturally approached by replacing f_X with its deconvolution kernel estimator and f_W with its ordinary kernel estimator in (2). This results in a re-weighted deconvolution kernel estimator (RDKE), defined by

{\hat{f}}_{X | W} (x | w) = {\hat{τ}}_{0} (x | w) \sum_{j = 1}^{n} K_{h}^{*} (x - W_{j}),

(5)

where

{\hat{τ}}_{0} (x | w) = \frac{f_{U} (w - x)}{\sum_{j = 1}^{n} L_{b} (w - W_{j})},

(6)

L_b(·) = L(·/b)/b, L(·) is a real non-negative kernel function, and b is the bandwidth that associates with the kernel density estimate of f_W. The estimator (5) is called RDKE in that, comparing with the conventional deconvolution kernel estimator, 1/n in (3) is replaced by a weight function τ̂(·) in this new estimator.

2.2. Estimation of f_X|W with unknown error distribution

In practice, the exact distribution of measurement error is typically unknown. Thus, one often conducts a separate independent experiment to collect an additional noise sample. For example, in Illumina bead microarray studies, the additional noise sample is always available for each gene array. Density deconvolution from a contaminated sample coupled with an additional noise sample have been studied by Neumann (1997), Johannes (2009), and Wang and Ye (2012). For the conditional density estimation in the presence of error with unknown distribution, we propose here a ridge-based re-weighted kernel estimator. Denote that {U₀_j},j = 1, ⋯, n₀ are direct observations from the error distribution, which are independent and identically distributed as U. The empirical characteristic function of U is then ${\hat{ϕ}}_{U} (t) = n_{0}^{- 1} \sum_{j = 1}^{n_{0}} e^{i t U_{0 j}}$ . Our estimator takes the form

{\hat{f}}_{X | W, ρ} (x | w) = \hat{τ} (x | w) \sum_{j = 1}^{n} K_{h, ρ}^{*} (x - W_{j}),

(7)

where $K_{h, ρ}^{*} (\cdot) = K_{ρ}^{*} (\cdot / h) / h$ , and

K_{ρ}^{*} (z) = \frac{1}{2 π} \int e^{- i t z} ϕ_{K} (t) \frac{{\hat{ϕ}}_{U} (- t / h)}{max {{| {\hat{ϕ}}_{U} (t / h) |}^{2}, ρ}} dt,

(8)

\hat{τ} (x | w) = \frac{{(n_{0})}^{- 1} \sum_{k = 1}^{n_{0}} G_{a} (w - x - U_{0 k})}{\sum_{j = 1}^{n} L_{b} (w - W_{j})} .

(9)

In this estimator, G_a(·) = G(·/a)/a and L_b(·) = L(·/b)/b, where G and L are real, non-negative kernel functions, a, b are the bandwidths associated with the kernel density estimate of f_U and f_W, respectively.

We introduce a ridge parameter ρ in (8) in order to prevent ϕ̂_U from being too close to zero. When the error distribution is unknown, ϕ_U might be estimated from its empirical counterpart, ϕ̂_U. However, estimating the conditional density becomes unstable if one applies ϕ̂_U without any regularization. Note that ϕ_U can be estimated by ϕ̂_U (t) at each point t with the rate $n_{0}^{- 1 / 2}$ . The estimator ϕ̂_U is reasonable as |ϕ̂_U| ≫ n₀^−1/2, while it becomes unstable as |ϕ̂_U| ≪ n₀^−1/2. Hence, in practice, we simply take the ridge parameter $ρ = n_{0}^{- 1}$ in order to avoid the difficulty of simultaneously selecting multiple tuning parameters in (8).

3. Theoretical properties

We present here the asymptotic results based on the mean absolute distance between f̂_X_|_W and f_X|W. Define the mean absolute error (MAE) by MAE(f̂_X_|_W) = Inline graphic |f̂_X_|_W − f_X|W|, which is the “local” analogue of the L₁ distance between f_X|W and its estimate. The L₁ view of conventional kernel density estimation for error-free data had attracted great attentions. Devroye (1985) gave distinct illustrations of the mathematical attractions of L₁ distance. For instance, it is always well-defined on the space of density functions, and it is invariant under monotone transformations. Studying the MAE properties is also due to the practical interest in gene array background correction. Some additional theoretical results based on the mean square distance can be found in the supplement of this paper.

In measurement error problems, the quality of a sample does not only depend on its size but also crucially relates to the magnitude of the error variance σ². This phenomenon was observed by Fan (1992) and had been extensively studied by Delaigle (2008). Delaigle (2008) made the following argument for an alternative asymptotic view of measurement error problems. For standard asymptotic theory, n may not particular large but statisticians are interested in analyzing theoretical properties for the unrealistic situation (n → ∞). It allows us to reveal the properties of an estimator when the sample size is not too small. Therefore, in measurement error problems, just as any given sample size can be considered as a finite sample approximation of n → ∞, any given σ² can be also considered as a finite sample version of σ² → 0. Considering the asymptotic properties where both σ² → 0 and n → ∞ (named as the double asymptotics by Delaigle (2008)) provides a more appropriate way to uncover important properties of an estimator when σ² is not too large.

Following Delaigle's recommendation, we study the double asymptotic properties of our estimators by considering both σ² → 0 and n → ∞. The rate of convergences under the standard asymptotic theory framework are also provided in this section. Let us rewrite the model (1) as:

W_{j} = X_{j} + σ Z_{j}, for 1 \leq j \leq n,

(10)

where U_j = σZ_j, Z_j's are independent and identically distributed as Z with the density f_Z and the variance var(Z) = 1. Hereinafter, the asymptotic properties are for both σ² → 0 and n → ∞, when we refer to model (10); the asymptotic properties are for n → ∞ only, when we refer to model (1). We always assume that the density function f_W is strictly positive and has the continuous second derivative. The unknown probability density function f_X belongs to the class

Ψ_{m, α, B} = {f (x) : | f^{(m)} (x) - f^{(m)} (x + δ) | \leq B δ^{α}},

where α ∊ [0,1), m ∊ℕ, and B > 0 are known constants. We consider two classes of errors as in the classical deconvolution literature: ordinary smooth error U of order β if

d_{0} {| t |}^{- β} \leq | ϕ_{U} (t) | \leq d_{1} {| t |}^{- β}, as t \to \infty,

for some positive constants d₀,d₁,β; and supersmooth error U of order β if

d_{0} {| t |}^{β_{0}} exp (- {| t |}^{β} / γ) \leq | ϕ_{U} (t) | \leq d_{1} {| t |}^{β_{1}} exp (- {| t |}^{β} / γ), as t \to \infty,

for some positive constants d₀, d₁, γ, β and constants β₀, β₁. We further state the conditions for the kernel functions:

(A1) K is bounded, continuous, and ∫ |y|^m⁺^α|K(y)|dy < ∞. Moreover, its characteristic function, ϕ_K, is symmetric and satisfies ϕ_K(t) = 1 + O(|t|^m+α) as t→0.
(A2) The kernel function L(x) is a real, non-negative, even kernel function on ℝ such that ∫ L(x)dx = 1, ∫ xL(x)dx = 0, and ∫ x²L(x)dx < ∞. The kernel function G(·) follows the same conditions.

Condition (A1) asserts that K is basically a kernel function of order m + α.

3.1. Theoretical properties when f_U is known

We use c(n) ∼ b(n) to represent d₂c(n) ≤ b(n) ≤ d₃c(n) for some constants d₂, d₃ > 0. c(n) ≫ b(n) (resp. c(n) ≪ b(n)) represents b(n) = o(c(n)) (resp. c(n) = o(b(n))).

(B1) ∫ |t|^β|ϕ_K(t)|dt < ∞ and ∫ |t|²^β|ϕ_K(t)|² dt < ∞.

Theorem 1

Assume that the variable Z in (10) is ordinary smooth of order β. Under the assumptions (A1), (A2), (B1), and $b \sim n^{- \frac{1}{5}}$ ,

if $σ = O (n^{- \frac{1}{2 m + 2 α + 1}})$ and $h \sim n^{- \frac{1}{2 m + 2 α + 1}}$ , then ∀x, w ∊ ℝ and ∀ f_X ∊ Ψ_m,α,B,
$E | {\hat{f}}_{X | W} (x | w) - f_{X | W} (x | w) | = O (n^{- \frac{2}{5}});$
if $σ ≫ n^{- \frac{1}{2 m + 2 α + 1}}$ and $h \sim σ^{\frac{2 β}{2 (m + α + β) + 1}} n^{- \frac{1}{2 (m + α + β) + 1}}$ , then for all x, w ∊ ℝ and for all f_X ∊ Ψ_m,α,B,

E | {\hat{f}}_{X | W} (x | w) - f_{X | W} (x | w) | = O (n^{- \frac{2}{5}}) + O (σ^{\frac{2 (m + α) β}{2 (m + α + β) + 1}} n^{\frac{- (m + α)}{2 (m + α + β) + 1}}) .

Remark 1. For the model (1), by considering σ to be a constant, one gets

E | {\hat{f}}_{X | W} (x | w) - f_{X | W} (x | w) | = O (n^{- \frac{2}{5}}) + O (n^{- \frac{m + α}{2 m + 2 α + 2 β + 1}}) .

Notice that the assumption $b \sim n^{- \frac{1}{5}}$ is standard in the kernel density estimation and $h \sim n^{- \frac{1}{2 m + 2 α + 2 β + 1}}$ is standard in the deconvolution kernel density estimation. If m + α < 4β + 2, then $- \frac{m + α}{2 m + 2 α + 2 β + 1} > - \frac{2}{5}$ . Therefore, the rate of convergence is $O (n^{- \frac{m + α}{2 m + 2 α + 2 β + 1}})$ . On the other hand, if m + α ≥ 4β + 2, then the rate of convergence is $O (n^{- \frac{2}{5}})$ . If we assume that f_W(·) has higher order derivatives and we choose higher order kernel density function L, for instance 2r-th order, then we can let $b \sim n^{- \frac{1}{4 r + 1}}$ . In this case, the rate of convergence becomes $O (n^{- \frac{2 r}{4 r + 1}}) + O (n^{- \frac{m + α}{2 m + 2 α + 2 β + 1}})$ .

Next we address the theorem in the case of supersmooth error. We need the following condition.

(C1) ϕ_K(t) = 0 for all |t| > 1, that is, ϕ_K(t) has support on [−1,1]. Moreover, $\int {| ϕ_{K} (t) |}^{2} {| t |}^{- 2 β_{0}} dt < \infty$ .

More generally, in condition (C1), one can assume that ϕ_K has compact support [−M, M] for some 0 < M < ∞.

Theorem 2

Assume that the variable Z in (10) is supersmooth with ϕ_Z(t) ≠ 0 for any t. Under the assumptions (A1), (A2), (C1), and $b \sim n^{- \frac{1}{5}}$ ,

if $σ = O (n^{- \frac{1}{2 (m + α) + 1}})$ and $h \sim n^{- \frac{1}{2 (m + α) + 1}}$ , then ∀ x, w ∊ ℝ and ∀ f_X ∊ Ψ_m,α,B,

$E | {\hat{f}}_{X | W} (x | w) - f_{X | W} (x | w) | = O (n^{- \frac{2}{5}});$
if $σ = n^{- \frac{1}{2 (m + α) + 1}} a (n)$ and $h = {(\frac{2}{γ D})}^{1 / β} σ {log a (n)}^{- 1 / β}$ with $1 ≪ a (n) ≪ n^{\frac{1}{2 (m + α) + 1}}$ and D < 2m + 2α + 1, then ∀ x, w ∊ ℝ and ∀ f_X ∊Ψ_m,α,B,

E | {\hat{f}}_{X | W} (x | w) - f_{X | W} (x | w) | = O (n^{- \frac{2}{5}}) + O (σ^{m + α} {log a (n)}^{\frac{- (m + α)}{β}}) .

Remark 2. For the model (1), by considering σ to be a constant, one has,

E | {\hat{f}}_{X | W} (x | w) - f_{X | W} (x | w) | = O ({(log n)}^{- \frac{m + α}{β}}) .

The choice of $b ~ n^{- \frac{1}{5}}$ is standard for the kernel density estimation and h = (4/γ)^1/β(log n)^−1/β is standard for the deconvolution kernel density estimation. Here, the order of the derivatives of f_W(·) and the order of the kernel density function L do not change the rate of convergence.

3.2. Theoretical properties when f_U is unknown

For the ordinary smooth error distribution, we need the following condition:

(B2) ∫ |ϕ_K(τ)|τ|²^βsdτ <∞ for s = 0,1,2.

Theorem 3

Assume that the variable Z is ordinary smooth of order β but f_Z is unknown. Under the assumptions (A1), (A2), (B1), (B2), and by choosing $a ~ n_{0}^{- 1 / 5}$ , $b ~ n^{- 1 / 5}$ ,

if σ = O(h) and $h ~ max {n^{- \frac{1}{2 m + 2 α + 1}}, n_{0}^{- \frac{1}{2 (m + α + 1)}}}$ , then for all x, w ∊ ℝ and for all f_X ∊ Ψ_m,α,B,

$E | {\hat{f}}_{X | W, ρ} (x | w) - f_{X | W} (x | w) | = O (n^{- \frac{2}{5}}) + O (n_{0}^{- \frac{2}{5}}) + O (n_{0}^{- \frac{m + α}{2 (m + α + 1)}});$
if $σ ≫ max {n^{- \frac{1}{2 m + 2 α + 1}}, n_{0}^{- \frac{1}{2 m + 2 α + 2}}}$ , and

h \sim max {σ^{\frac{2 β}{2 (m + α + β) + 1}} n^{- \frac{1}{2 m + 2 α + 2 β + 1}}, σ^{\frac{β}{m + α + β + 1}} n_{0}^{- \frac{1}{2 m + 2 α + 2 β + 2}}},

then for all x, w ∊ ℝ and for all f_X ∊ Ψ_m,α,B,

E | {\hat{f}}_{X | W, ρ} (x | w) - f_{X | W} (x | w) | = O (σ^{\frac{2 (m + α) β}{2 (m + α + β) + 1}} n^{- \frac{(m + α)}{2 (m + α + β) + 1}}) + O (σ^{\frac{(m + α) β}{m + α + β + 1}} n_{0}^{- \frac{m + α}{2 (m + α + β + 1)}}) + O (n^{- \frac{2}{5}}) + O (n_{0}^{- \frac{2}{5}}) .

Remark 3. For the model (1), by considering σ to be a constant, one has

E | {\hat{f}}_{X | W, ρ} (x | w) - f_{X | W} (x | w) | = O (n^{\frac{- (m + α)}{2 (m + α + β) + 1}}) + O (n^{- \frac{2}{5}}) + O (n_{0}^{- \frac{2}{5}}) + O (n_{0}^{\frac{- (m + α)}{2 (m + α + β + 1)}}) .

Theorem 4

Assume that the variable Z is supersmooth with ϕ_Z(t) ≠ 0 for any t but f_Z is unknown. Under the assumptions (A1), (A2), (C1), and by choosing $a ~ n_{0}^{- 1 / 5}$ , $b ~ n^{- \frac{1}{5}}$ ,

if σ = O(h) and $h ~ max {n^{- \frac{1}{2 m + 2 α + 1}}, n_{0}^{- \frac{1}{2 (m + α + 1)}}}$ , then for all x, w ∊ ℝ and for all f_X ∊ Ψ_m,α,B,

$E | {\hat{f}}_{X | W, ρ} (x | w) - f_{X | W} (x | w) | = O (n^{- \frac{2}{5}}) + O (n_{0}^{- \frac{2}{5}}) + O (n_{0}^{- \frac{m + α}{2 (m + α + 1)}});$
if $σ = max {n^{\frac{- 1}{2 (m + α) + 1}}, n_{0}^{\frac{- 1}{2 (m + α + 1)}}} a (n, n_{0})$ and $h = {(\frac{2}{γ D})}^{\frac{1}{β}} σ {log a (n, n_{0})}^{- \frac{1}{β}}$ , with $1 ≪ a (n, n_{0}) ≪ min {n^{\frac{1}{2 (m + α) + 1}}, n_{0}^{\frac{1}{2 (m + α + 1)}}}$ and D < 2m + 2α + 1, then for all x, w ∊ ℝ and for all f_X ∊ Ψ_m,α,B,

$E | {\hat{f}}_{X | W, ρ} (x | w) - f_{X | W} (x | w) | = O (n^{- \frac{2}{5}}) + O (n_{0}^{- \frac{2}{5}}) + O (σ^{m + α} {log a (n, n_{0})}^{\frac{- (m + α)}{β}}) .$

Remark 4. For the model (1), by considering σ to be a constant, one has

E | {\hat{f}}_{X | W, ρ} (x | w) - f_{X | W} (x | w) | = O ({log n}^{- \frac{m + α}{β}}) + O ({log n_{0}}^{- \frac{m + α}{β}}) .

The above four theorems offer the double asymptotic view of the proposed RKDEs, which provide a better interpretation of asymptotic behaviors for the estimators than the results from the standard asymptotic view. For either ordinary smooth or supersmooth error, there are two different rates of convergence depending on the error magnitude. For instance, the convergence rate for supersmooth error varies from the rate of the error-free kernel density estimation to the very slow rate of the classical supersmooth deconvolution. The theoretical results support that the quality of the RKDEs depends not only on the sample size but also on the error magnitude. They also support in practice the RKDEs could perform well with a moderate sample size for the case of Gaussian error as long as σ² is not too large.

4. Selection of smoothing parameters

Bandwidth plays a critical role in implementation of practical estimation using smoothing techniques. The selection of bandwidth in deconvolution kernel density estimation has been broadly studied in the literature (Delaigle and Gijbels, 2004). Here we propose a simple but intuitively appealing method for selecting the smoothing parameters in the conditional density estimation. Let us focus on the case of known error distribution. Our procedure includes two steps. First, we select b for the kernel density estimate ${\hat{f}}_{W} = n^{- 1} \sum_{j = 1}^{n} L_{b} (w - W_{j})$ . Many bandwidth selection methods are available in classical kernel density estimation. Here we use the one that minimizes mean absolute distance (Hall and Wand, 1988a,b). Second, for given b and w, the proposed bandwidth h for estimating f̂(x|w) using (5) is the minimizer of the mean integrated absolute error (MIAE) of g(x|w) = f_U(w − x)f_X(x), i.e.,

h_{MIAE} = min_{h} {E \int | \hat{g} (x | w; h) - g (x | w; h) | dx},

(11)

where ĝ(x|w) = f_U(w − x)f̂_X(x) and f̂_X is the deconvolution kernel estimate defined by (3).

Let K be a m-th order kernel function with ∫ |K(z)z^m+¹| dz < ∞. Assume that the density function f_X has continuous, bounded (m + 1)-th order derivative. Denote R_m(K) = ∫x^mK(x)dx and $S_{m} (f_{X}) = \int | f_{U} (w - x) f_{X}^{(m)} (x) | dx$ for any positive integer m, and let ǁf_X(x)ǁ_∞ be the supremum norm of f_X. It can be shown that the asymptotic dominating term of the MIAE of g(x|w) is given by

AMIAE (\hat{g} (x | w; h)) = \frac{h^{m}}{m!} | R_{m} (K) | S_{m} (f_{X}) + \sqrt{\frac{{‖ f_{X} (x) ‖}_{\infty}}{2 π h n} \int \frac{{| ϕ_{K} (t) |}^{2}}{{| ϕ_{U} (t / h) |}^{2}} dt} .

(12)

In practice, we evaluate this asymptotic approximation over the exact MIAE because of its rather simple expression. Formula (12) involves the unknown quantities S_m(f_X) and ǁf_X(x)ǁ_∞. We suggest two possible estimators for them and obtain as such an estimator $\hat{AMIAE} (\hat{g} (x | w; h))$ . The first one is a simple normal reference approach. Assume that X is from a normal distribution $N (μ_{X}, σ_{X}^{2})$ . Then one calculates μ̂_X = μ̂_W, ${\hat{σ}}_{X}^{2} = {\hat{σ}}_{W}^{2} - σ_{U}^{2}$ where μ̂_W and ${\hat{σ}}_{W}^{2}$ are the sample mean and variance from the observations W_j's. Hence, one can easily calculate S_m(f̂_X) and ǁf̂_X(x)ǁ_∞. The approach may not perform well when X is far from a normal distribution. Our second approach is to estimate f̂_X through the classical deconvolution kernel method and then numerically evaluate S_m(f_X) and ǁf_X(x)ǁ_∞. From our simulation experiences, the second method often performs better than the normal reference method although it is more computationally complicated.

The proposed bandwidth h_MIAE depends on w. If necessary, one could select a global bandwidth

{\tilde{h}}_{MIAE} = min_{h} {E \int \int | \hat{g} (x | w; h) - g (x | w; h) | d x d w},

where the integration is over the region of x and w of interest.

Certainly, one may consider to use the criterion of minimizing mean integrated squared error. Our simulation experiences suggest that there is negligible difference between two criteria in practical bandwidth selection. The above methods of selecting smoothing parameters can be naturally extended to the case of unknown error distribution, where both a and b are pre-determined from additional noise data and contaminated data.

5. Numerical properties

5.1. Simulated examples

We illustrate our methods via two simulated models. We take the kernel K to be a second order kernel function, where $ϕ_{K} (t) = {(1 - t^{2})}_{+}^{3}$ , and the kernels L and G to be Gaussian.

Example 1. Consider a normal-exponential convolution model, W_j = X_j + U_j, where X_j is exponentially distributed with mean α, while U_j is normally distributed with mean μ and variance σ². It can be shown, with some simple transformation and algebra, the conditional density of X give W is

f_{X | W} (x | w) = \frac{φ (x; w - μ - σ^{2} / α, σ^{2})}{1 - Φ (0; w - μ - σ^{2} / α, σ^{2})}

where φ(.) and Φ(.) denote the Gaussian density and distribution functions, respectively. A sample of 1000 was generated from this model, where the parameters were set as follows: α = 10, μ = 2, σ = 2. We assume that the measurement error parameters are known. Formula (5) was applied to estimate the conditional density functions from the contaminated data W_j's. The bandwidths were selected by the proposed method in Section 4. Figure 1, which presents a typical simulated example, displays the estimated conditional densities using the RDKE with known error distribution. In (a)-(c), the estimated conditional densities (solid curves), for (a) w = 5, (b) w = 10, (c) w = 30, are compared with the true densities (dashed curves). The RDKE performed quite well to recover the true functions. (d) shows a “stacked conditional density plot” (Hyndman et al., 1996), which displays a number of densities plotted side by side in a perspective plot. The plot highlights the conditioning which allows us to evaluate the changes of densities over w.

A simulated example of conditional density estimation with measurement error for the normal-exponential model. In (a)-(c), estimated conditional densities (solid curves) for (a) w = 5, (b) w = 10, (c) w = 30 using the reweighed deconvolution kernel method with known error distribution are compared with the true densities (dashed curves). (d) shows a number of densities plotted side by side in a perspective plot.

Example 2. Consider a normal-normal convolution model, W_j = X_j + U_j, where X_j is normally distributed with mean μ₁ and variance $σ_{1}^{2}$ , while U_j is normally distributed with mean μ₂ and variance $σ_{2}^{2}$ . The true conditional density of X give W is

f_{X | W} (x | w) = \frac{φ (x; μ_{1}, σ_{1}^{2}) φ (w - x; μ_{2}, σ_{2}^{2})}{φ (w; μ_{1} + μ_{2}, σ_{1}^{2} + σ_{2}^{2})} .

We generated X_j's and U_j's from N(2, 9) and N(0,1) respectively with the sample size n = 1000. In this example, we assumed that the error distribution was unknown. So, an additional noise sample U₀_j's were generated from N(0,1) with the size n₀ = 500. Formula (7) was applied to estimate the conditional density functions from the contaminated data W_j's coupled with the noise data U_0j's. Figure 2 shows the estimated conditional densities in a simulated example. In (a)-(c), the estimated conditional densities (solid curves), for (a) w = −2.5, (b) w = 0, (c) w = 2.5, are compared with the true densities (dashed curves). The estimated curves almost coincide with the true curves. (d) exhibits the stacked conditional density plot in a perspective view.

A simulated example of conditional density estimation with measurement error for the normal-normal model. In (a)-(c), estimated conditional densities (solid curves) for (a) w = −2.5, (b) w = 0, (c) w = 2.5 using the reweighed deconvolution kernel method with unknown error distribution are compared with the true densities (dashed curves). (d) shows a number of densities plotted side by side in a perspective plot.

5.2. An application to an Illumina bead microarray study

Illumina bead microarray is one of most popular microarray platforms in genetics. One distinctive feature of Bead array technology by Illumina Inc. is that more than one thousand “control bead types” in addition to gene sequences are allocated in each array. These control beads do not correspond to any expressed sequences in the genome. The data of control beads present the additional noise sample that is used to evaluate the noise distribution on the array in an experiment.

We illustrate our methods using the Illumina microarray data from a leukemia study by Ding et al. (2008). The leukemia study was to investigate the pathogenesis of leukemia. Irradiated mice who subsequently developed acute myeloid leukemia (AML) were involved to study the leukemogenic process. Illumina Mouse-6 V1 BeadChip mouse whole-genome expression arrays were used to obtain the gene expression profiles of AML samples. Ding et al. (2008) considered the normal-exponential model for the observed gene intensities in their analysis. Here we demonstrate the analysis of condition density estimation for the third bead array. Other bead arrays showed similar results. The intensity values of 46120 genes and 1655 negative controls were obtained from this array. We estimate the conditional densities nonparametrically using the reweighed deconvolution kernel method with unknown error distribution. The results are displayed in Figure 3. In (a)-(c), the solid curves denote estimated conditional densities for the observed intensity (a) w = 180, (b) w = 190, (c) w = 200. For comparison, estimated conditional densities based on the parametric normal-exponential model are also displayed (dashed curves). The nonparametric fit shows that conditional densities are right-skewed, which are away from the fit based on the parametric model. The stacked conditional density plot in (d) unveils the evolution of conditional densities over the observed intensities. The results suggest that the normal-exponential assumption appears not to be a realistic assumption in the study. A model that relaxes the parametric assumptions may be useful for the gene background correction here.

The analysis of the Illumina bead microarray data. In (a)-(c), estimated conditional densities (solid curves) for (a) w = 180, (b) w = 190, (c) w = 200 using the reweighed deconvolution kernel method with unknown error distribution are compared with the estimated conditional densities (dashed curves) using the normal-exponential method. (d) shows the nonparametric conditional density estimates plotted side by side in a perspective plot.

Supplementary Material

supplement

NIHMS629473-supplement.pdf^{(260.9KB, pdf)}

Acknowledgments

The authors are grateful to the editor and the reviewers for their valuable comments. The research of XFW is supported in part by NIH UL1 RR024989, and the research of DY is supported by a NSERC grant and a grant from Memorial University of Newfoundland.

Appendix

We now outline the key ideas of the proofs. We shall use Proposition 31.8 of Port (1994), which is restated as Lemma 1.

Lemma 1

Let q₁(X_i) and q₂(X_i) be two random variables with means μ₁ and μ₂, variances ν₁ and ν₂ respectively, and with covariance ν₁₂. Let {X₁, ⋯,X_n} be an i.d.d. sequence of random variables and define

{\sum^{^}}_{j} = \frac{1}{n} \sum_{i = 1}^{n} q_{j} (X_{i}), for j = 1, 2, & \hat{R} = \frac{{\sum^{^}}_{1}}{{\sum^{^}}_{2}} .

Then the second-order approximation of Inline graphic R̂ is $E \hat{R} \approx \frac{μ_{1}}{μ_{2}} + \frac{1}{n} (\frac{μ_{1} ν_{2}}{μ_{2}^{3}} - \frac{ν_{12}}{μ_{2}^{2}})$ and the first-order approximation of var(R̂) is $var (\hat{R}) \approx \frac{1}{n μ_{2}^{2}} (ν_{1} + \frac{μ_{1}^{2} ν_{2}}{μ_{2}^{2}} - 2 \frac{μ_{1} ν_{12}}{μ_{2}})$ .

Employing Lemma 1 to random variables q₁(W_j) = 1 and $q_{2} (W_{j}) = b^{- 1} L_{2} (\frac{w - W_{j}}{b})$ , one has

E ({[{\hat{f}}_{W} (w)]}^{- 1}) = {[E {\hat{f}}_{W} (w)]}^{- 1} + O (n b^{- 1}),

(.1)

E ({[{\hat{f}}_{W} (w)]}^{- 2}) = {[E {\hat{f}}_{W} (w))}^{- 2} + O (n b^{- 1}),

(.2)

where ${\hat{f}}_{W} (w) = \sum_{j = 1}^{n} n^{- 1} q_{2} (W_{j})$ is the kernel estimator for f_W.

Lemma 2

Let τ(w|x) = f_U(w − x)/f_W(w) and τ̂(w|x) be as in (9). By choosing $a ~ n_{0}^{- 1 / 5}$ and b ∼ n^−1/5, one has, for all x,w ∊ ℝ,

E {| \hat{τ} (w | x) - τ (w | x) |}^{2} = O (n_{0}^{- 4 / 5}) + O (n^{- 4 / 5}),

(.3)

E {| {\hat{τ}}_{0} (w | x) - τ (w | x) |}^{2} = O (n_{0}^{- 4 / 5}) .

(.4)

Proof. Let us prove formula (.3) first. Recall that ${U_{0 k}}_{k = 1}^{n_{0}}$ and ${W_{j}}_{j = 1}^{n}$ are independent. Consider ${\hat{f}}_{W} (w) = \sum_{i = 1}^{n} \frac{1}{n} g_{2} (W_{i})$ with

g_{2} (W_{i}) = \frac{1}{b} L (\frac{w - W_{i}}{b}) .

Then, one has (see for instance Lemma 1 of Hyndman et al. (1996))

μ_{2} = E g_{2} = f_{W} (w) + \frac{b^{2} σ_{L}^{2}}{2} f_{W}^{″} (w) + O (b^{4}) = f_{W} (w) [1 + O (b^{2})],

(.5)

ν_{2} = var (\frac{1}{b} L (\frac{w - W_{i}}{b})) = \frac{f_{W} (w) R (L)}{b} [1 + O (b)],

(.6)

where $σ_{L}^{2} = \int ω^{2} L (ω) d ω$ , and R(L) = ∫ L²(ω) dω.

By formulas (.1) and (.5), one has, for all x,w,

| E \hat{τ} (w | x) - τ (w | x) | = | E {\hat{f}}_{U} (w - x) E (\frac{1}{{\hat{f}}_{W} (w)}) - τ (w | x) | = | \frac{f_{U} (w - x) + O (a^{2})}{f_{W} (w) [1 + O (b^{2})]} + O (\frac{1}{n b}) - τ (w | x) | = O (a^{2}) + O (b^{2}) + O (\frac{1}{n b}) .

Similarly, formulas (.1), (.2), (.5), and (.6) imply

var (\hat{τ} (w | x)) = E {| \frac{{\hat{f}}_{U} (w - x)}{{\hat{f}}_{W} (w)} |}^{2} - {| E (\frac{{\hat{f}}_{U} (w - x)}{{\hat{f}}_{W} (w)}) |}^{2} = O (\frac{1}{n_{0} a}) + O (\frac{1}{n b}) .

Hence, $a ~ n_{0}^{- 1 / 5}$ and b ∼ n^−1/5 implies the desired result in Lemma 2.

To prove formula (.4), one can repeat the above calculation by removing all terms involving a, since here f̂_U is replaced by f_U.

Proof of Theorem 1

Recall that f̂_X_|_W(x|w) = τ̂₀(w|x)f̂_X(x). We need the following inequality, which is a direct consequence of the triangle inequality and Cauchy-Schwarz inequality:

E | {\hat{f}}_{X | W} (x | w) - f_{X | W} (x | w) | = E | {\hat{f}}_{X} {\hat{τ}}_{0} - {\hat{f}}_{X} τ + {\hat{f}}_{X} τ - f_{X} τ | \leq E | {\hat{f}}_{X} {\hat{τ}}_{0} - {\hat{f}}_{X} τ | + E | {\hat{f}}_{X} τ - f_{X} τ | \leq \sqrt{E {| {\hat{f}}_{X} |}^{2} E {| {\hat{τ}}_{0} - τ |}^{2}} + | τ | \sqrt{E {| {\hat{f}}_{X} - f_{X} |}^{2}} \leq \sqrt{2 (E {| {\hat{f}}_{X} - f_{X} |}^{2} + {| f_{X} |}^{2}) E {| {\hat{τ}}_{0} - τ |}^{2}} + | τ | \sqrt{E {| {\hat{f}}_{X} - f_{X} |}^{2}} = O (\sqrt{E {| {\hat{τ}}_{0} - τ |}^{2}}) + O (\sqrt{E {| {\hat{f}}_{X} - f_{X} |}^{2}}) .

(.7)

It is enough to estimate Inline graphic |f̂_X − f_X|². Under assumptions of Theorem 1 and by Taylor extension formula, one has (see e.g. Fan (1991b))

E ({\hat{f}}_{X} (x)) = f_{X} (x) + O (h^{m + α}) .

(.8)

Recall that sup_f∊Ψm,α,B f(x)≤C for some constant C > 0 (see Bickel and Ritov (1988)). By results in Fan (1991b) (see page 1266), one has

var ({\hat{f}}_{X} (x)) \leq \frac{C}{2 π n h} \int \frac{{| ϕ_{K} (t) |}^{2}}{{| ϕ_{Z} (σ t / h) |}^{2}} dt,

(.9)

where we have used ϕ_U(t) = ϕ_σZ(t) = ϕ_Z(σt). Since random variable Z is ordinary smooth, there exists a constant Q > 0 such that

d_{0} {| σ t / h |}^{- β} \leq | ϕ_{Z} (σ t / h) | \leq d_{1} {| σ t / h |}^{- β}, for | σ t / h | > Q,

for positive constants d₀, d₁ and β. Hence, by assumptions of Theorem 1,

var ({\hat{f}}_{X} (x)) \leq \frac{C}{2 π n h} [\int_{| t | \leq Q h / σ} \frac{{| ϕ_{K} (t) |}^{2}}{{| ϕ_{Z} (σ t / h) |}^{2}} dt + \int_{| t | > Q h / σ} \frac{{| ϕ_{K} (t) |}^{2}}{{| ϕ_{Z} (σ t / h) |}^{2}} dt] \leq \frac{C^{'}}{2 π n h} [\int_{| t | \leq Q h / σ} {| ϕ_{K} (t) |}^{2} dt + \frac{σ^{2 β}}{h^{2 β}} \int_{| t | > Q h / σ} {| ϕ_{K} (t) |}^{2} {| t |}^{2 β} dt] .

(.10)

σ = O(h). Note that condition (B1) implies that ∫ |ϕ_K(t)|² dt < ∞. Inequality (.10) implies $var ({\hat{f}}_{X} (x)) = O (\frac{1}{n h})$ . By (.8),

$E {| {\hat{f}}_{X} - f_{X} |}^{2} = var ({\hat{f}}_{X}) + {| E {\hat{f}}_{X} - f_{X} |}^{2} = var ({\hat{f}}_{X}) + O (h^{2 (m + α)}) = O (h^{2 (m + α)}) + O (\frac{1}{n h}) .$ (.11)

Combining with (.4) and (.7), one gets the desired conclusion, by choosing b ∼ n⁻^1/5, h ∼ n⁻^1/(2^m⁺²^α⁺¹⁾, and as $\frac{- (2 m + 2 α)}{2 m + 2 α + 1} \leq - \frac{4}{5}$ for m + α ≥ 2.
σ ≫ h. Inequality (.10) implies var)f̂_X(x)) ≤ O(σ²^βn⁻¹h⁻⁽²^β⁺¹⁾) (see also Delaigle (2008)).

The conclusion follows from (.4), (.7), and (.11), by choosing b ∼ n^−1/5 and $h ~ σ^{\frac{2 β}{2 m + 2 α + 2 β + 1}} n^{- 1 / (2 m + 2 α + 2 β + 1)}$ , which implies $σ ≫ n^{- \frac{1}{2 m + 2 α + 1}}$ .

Proof of Theorem 2

σ = O(h) implies that |σ/h| can be bounded from above by a constant. Hence, formula (.9) implies $var ({\hat{f}}_{X}) = O (\frac{1}{n h})$ , by condition (C1). The rest of the proof is same as that of the (i) in Theorem 1.
σ ≫ h. Since random variable Z is supersmooth, there exists Q > 0 s.t.,

$| ϕ_{Z} (σ t / h) | \geq d_{0} {| σ t / h |}^{β_{0}} exp (- σ^{β} {| t |}^{β} / (γ h^{β})), for | σ t / h | > Q .$

Hence, by assumptions of Theorem 2 and (.9),

$var ({\hat{f}}_{X}) \leq \frac{C^{'}}{2 π n h} [\int_{| t | \leq \frac{Q h}{σ}} {| ϕ_{K} (t) |}^{2} dt + \frac{h^{l}}{σ^{l}} \int_{| t | > \frac{Q h}{σ}} {| ϕ_{K} (t) |}^{2} exp (\frac{2 σ^{β} {| t |}^{β}}{γ h^{β}}) {| t |}^{- 2 β_{0}} dt]$

with l = 0 if β₀ ≥ 0 and l = 2β₀ if β₀ < 0. Then, $var ({\hat{f}}_{X}) = O (\frac{h^{l - 1}}{n σ^{l}} exp (\frac{2 σ^{β}}{γ h^{β}}))$ . Combining with (.4), (.7), and (.11), one gets the desired conclusion, by taking b ∼ n^-1/5, $σ = n^{- \frac{1}{2 (m + α) + 1}} a (n)$ and $h = {(\frac{2}{γ D})}^{\frac{1}{β}} σ {log a (n)}^{- \frac{1}{β}}$ with $1 ≪ a (n) ≪ n^{\frac{1}{2 (m + α) + 1}}$ and 0 < D < 2m + 2α + 1.

Proof of Theorem 3

For all x ∊ ℝ, we consider an estimator of f_X(x) as

{\hat{f}}_{X, ρ} (x) = \frac{1}{2 π} \int e^{- i t x} ϕ_{K} (th) {\hat{ϕ}}_{W} (t) \frac{{\hat{ϕ}}_{U} (- t)}{max {{| {\hat{ϕ}}_{U} (t) |}^{2}, n_{0}^{- 1}}} dt .

Cauchy-Schwarz inequality and Fubini's theorem imply that

E {| {\hat{f}}_{X, ρ} (x) - {\hat{f}}_{X} (x) |}^{2} = E {| \frac{1}{2 π} \int e^{- i t x} {\hat{ϕ}}_{W} (t) \frac{ϕ_{K} (th)}{ϕ_{U} (t)} (\frac{{\hat{ϕ}}_{U} (- t) ϕ_{U} (t)}{max {{| {\hat{ϕ}}_{U} (t) |}^{2}, n_{0}^{- 1}}} - 1) dt |}^{2} \leq \frac{1}{4 π^{2}} (\int | ϕ_{K} (th) | dt) [\int \frac{| ϕ_{K} (th) |}{{| ϕ_{U} (t) |}^{2}} E {| {\hat{ϕ}}_{W} (t) |}^{2} E {| \frac{{\hat{ϕ}}_{U} (- t) ϕ_{U} (t)}{max {{| {\hat{ϕ}}_{U} (t) |}^{2}, n_{0}^{- 1}}} - 1 |}^{2} dt] .

As showed in Wang and Ye (2012), one has

E_{U} {| \frac{{\hat{ϕ}}_{U} (- t)}{max {{| {\hat{ϕ}}_{U} (t) |}^{2}, n_{0}^{- 1}}} - \frac{1}{ϕ_{U} (t)} |}^{2} = O (n_{0}^{- 1} {| ϕ_{U} (t) |}^{- 4}) .

Recall that $E {| {\hat{ϕ}}_{W} (t) |}^{2} = \frac{1}{n} (1 - {| ϕ_{W} (t) |}^{2}) + {| ϕ_{W} (t) |}^{2}$ , ∫ | ϕ_K (th) | dt = O (1/h) by condition (B2), and |ϕ_W(t)| = |ϕ_X(t)ϕ_U(t)| ≤ |ϕ_U(t)| = |ϕ_Z(σt)|. One has

I = E {| {\hat{f}}_{X, ρ} (x) - {\hat{f}}_{X} (x) |}^{2} = O (\frac{1}{n_{0} h} \int \frac{| ϕ_{K} (th) |}{{| ϕ_{Z} (σ t) |}^{2}} dt) .

Similar to the proof of inequality (.7), one has,

E | {\hat{f}}_{X | W, ρ} (x | w) - f_{X | W} (x | w) | = E | {\hat{f}}_{X, ρ} \hat{τ} - {\hat{f}}_{X, ρ} τ + {\hat{f}}_{X, ρ} τ - f_{X} τ | = O (\sqrt{E {| {\hat{τ}}_{0} - τ |}^{2}}) + O (\sqrt{E {| {\hat{f}}_{X, ρ} - f_{X} |}^{2}}) .

(.12)

σ = O(h). Take $h ~ max {n^{- \frac{1}{2 m + 2 α + 1}}, n_{0}^{- \frac{1}{2 (m + α + 1)}}}$ . Similar to the calculation of inequality (.10), $I = O (n_{0}^{- 1} h^{- 2})$ . Recall that the proof of (i) in Theorem 1 gives var(f̂_X) = O((nh)⁻¹). Hence, combining with (.8), for all x∊ ℝ, and for all f_x ∊ Ψ_m,a,B, | f̂_X(x)) − f_X(x)|² = O(h²^m⁺²^α) + O(1/(nh)). By Cauchy-Schwarz inequality,

$E {| {\hat{f}}_{X, ρ} (x) - f_{X} (x) |}^{2} \leq 2 E {| {\hat{f}}_{X} (x) - f_{X} (x) |}^{2} + 2 E {| {\hat{f}}_{X, ρ} (x) - {\hat{f}}_{X} (x) |}^{2} = O (n^{\frac{- 2 m - 2 α}{2 m + 2 α + 1}}) + O (n_{0}^{\frac{- m - α}{m + α + 1}}) .$

Combining with inequalities (.3) and (.12), one gets the desired result in Theorem 3, by choosing m + α ≥ 2, $a ~ n_{0}^{- 1 / 5}$ and b ∼ n^−1/5.
Let σ ≫ h. Let us take $h ~ max {{(σ^{2 β} n^{- 1})}^{\frac{1}{2 m + 2 α + 2 β + 1}}, {(σ^{2 β} n_{0}^{- 1})}^{\frac{1}{2 m + 2 α + 2 β + 2}}}$ , which implies $σ ≫ max {n^{- \frac{1}{2 m + 2 α + 1}}, n_{0}^{- \frac{1}{2 m + 2 α + 2}}}$ . Similar to (.10), one has $I = O (\frac{σ^{2 β}}{n_{0} h^{2 β + 2}})$ . Recall that the proof of (ii) in Theorem 1 gives ν̄₁ = O(σ²^βh⁻⁽²^β⁺¹⁾). Hence, combining with (.8), one has
$E {| {\hat{f}}_{X} (x) - f_{X} (x) |}^{2} = O (h^{2 m + 2 α}) + O (\frac{σ^{2 β}}{n h^{2 β + 1}}) .$

Therefore, for all x ∊ ℝ and for all f_X ∊ Ψ_m,α,B,

$E {| {\hat{f}}_{X, ρ} (x) - f_{X} (x) |}^{2} \leq O (σ^{\frac{4 (m + α) β}{2 (m + α + β) + 1}} n^{\frac{- 2 (m + α)}{2 (m + α + β) + 1}}) + O (σ^{\frac{2 (m + α) β}{m + α + β + 1}} n_{0}^{\frac{- m - α}{m + α + β + 1}}) .$

As in (i), the conclusion in (ii) follows from inequalities (.3) and (.12), $a ~ n_{0}^{- 1 / 5}$ and b ∼ n^−1/5.

Proof of Theorem 4

σ = O(h) implies that |σ/h| can be bounded from above by a constant. Then by (.9), $I = O (\frac{1}{n_{0} h^{2}})$ . The rest of the proof is same as that of (i) in Theorem 3.
Let σ ≫ h. As in (ii) of Theorem 2, one has, $I = O (\frac{h^{l - 2}}{n_{0} σ^{l}} exp (\frac{2 σ^{β}}{γ h^{β}}))$ and $E {| {\hat{f}}_{X} (x) - f_{X} (x) |}^{2} \leq O (h^{2 m + 2 α}) + O (\frac{h^{l - 1}}{n_{0} σ^{l}} exp (\frac{2 σ^{β}}{γ h^{β}}))$ . By the choices of the bandwidth h, one has, for all x ∊ ℝ and for all f_X ∊ Ψ_m,a,B,

$E {| {\hat{f}}_{X, ρ} (x) - f_{X} (x) |}^{2} = O (σ^{m + α} {log a (n, n_{0})}^{\frac{- 2 (m + α)}{β}}) .$

Combining with inequalities (.3) and (.12), one gets the desired results, by choosing $a ~ n_{0}^{- 1 / 5}$ and b ∼ n^−1/5.

Derivation for the asymptotic dominating term of MIAE(ĝ(x|w))

Let K be a m-th order kernel function with ∫ |K(z)z^m+1|dz < ∞. Assume that the density function f_X has continuous, bounded (m + 1)-th order derivative. Notice that one can bound MIAE(ĝ(x|w; h)) from above as follows.

MIAE (\hat{g} (x | w; h)) = E \int | f_{U} (w - x) | (| {\hat{f}}_{X} - E {\hat{f}}_{X} + E {\hat{f}}_{X} - f_{X} |) dx \leq \int | f_{U} (w - x) Bias ({\hat{f}}_{X}) | dx + \int | f_{U} (w - x) | (E | {\hat{f}}_{X} - E {\hat{f}}_{X} |) dx \leq \int | f_{U} (w - x) Bias ({\hat{f}}_{X}) | dx + \int | f_{U} (w - x) | \sqrt{var ({\hat{f}}_{X})} dx .

It is known that the bias of f̂_X is

Bias ({\hat{f}}_{X}) = \int \frac{1}{h} f_{X} (y) K ((x - y) / h) dy - f_{X} (x) = \int [f_{X} (x - h z) - f_{X} (x)] K (z) dz = \frac{{(- 1)}^{m}}{m!} f_{X}^{(m)} (x) \int z^{m} K (z) d z h^{m} + O (h^{m + 1}) = \frac{{(- 1)}^{m}}{m!} f_{X}^{(m)} (x) R_{m} (K) h^{m} + O (h^{m + 1}),

where, by assumption on f_X, |O(h^m+¹) | ≤ c · h^m+1, with c > 0 a constant. Hence if $\int | f_{U} (w - x) f_{X}^{(m)} (x) | dx$ exists, then

\int | f_{U} (w - x) Bias (\hat{f}) | dx = \frac{h^{m}}{m!} | R_{m} (K) | \int | f_{U} (w - x) f_{X}^{(m)} (x) | dx + O (h^{m + 1}) .

For the variance, one has

var ({\hat{f}}_{X}) = \frac{1}{4 π^{2} h^{2} n} E {| \int exp (- it (\frac{x - Y}{h})) \frac{ϕ_{K} (t)}{ϕ_{U} (t / h)} dt |}^{2} - \frac{1}{n} {(E {\hat{f}}_{X})}^{2} \leq \frac{{‖ f_{X} (x) ‖}_{\infty}}{2 π h n} \int \frac{{| ϕ_{K} (t) |}^{2}}{{| ϕ_{U} (t / h) |}^{2}} dt + o (1 / n),

where ǁf_X(x)ǁ_∞ denotes the supremum norm of f_X (refer to the last formula on page 1266 in Fan (1991b)). Therefore,

\int | f_{U} (w - x) | \sqrt{var ({\hat{f}}_{X})} dx \leq \sqrt{\frac{{‖ f_{X} (x) ‖}_{\infty}}{2 π h n} \int \frac{{| ϕ_{K} (t) |}^{2}}{{| ϕ_{U} (t / h) |}^{2}} dt} + o (1 / \sqrt{n}) .

That is,

MIAE (\hat{g} (x | w; h)) \leq \frac{h^{m} | R_{m} (K) |}{m!} \int | f_{U} (w - x) f_{X}^{(m)} (x) | dx + O (h^{m + 1}) + \sqrt{\frac{{‖ f_{X} (x) ‖}_{\infty}}{2 π h n} \int \frac{{| ϕ_{K} (t) |}^{2}}{{| ϕ_{U} (t / h) |}^{2}} dt} + o (1 / \sqrt{n}) .

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

Bashtannyk D, Hyndman R. Bandwidth selection for kernel conditional density estimation. Computational Statistics & Data Analysis. 2001;36(3):279–298. [Google Scholar]
Bickel PJ, Ritov Y. Estimating integrated squared density derivatives: sharp best order of convergence estimates. Sankhyā: The Indian Journal of Statistics, Series A. 1988;50(3):381–393. [Google Scholar]
Carroll RJ, Hall P. Optimal rates of convergence for deconvolving a density. Journal of the American Statistical Associations. 1988;83:1184–1186. [Google Scholar]
Comte F, Lacour C. Data-driven density estimation in the presence of additive noise with unknown distribution. Journal of the Royal Statistical Society Series B-Methodological. 2011;73(4):601–627. [Google Scholar]
De Gooijer J, Zerom D. On conditional density estimation. Statistica Neerlandica. 2003;57(2):159–176. [Google Scholar]
Delaigle A. An alternative view of the deconvolution problem. Statistica Sinica. 2008;18(3):1025–1045. [Google Scholar]
Delaigle A, Gijbels I. Practical bandwidth selection in deconvolution kernel density estimation. Computational Statistics & Data Analysis. 2004 Mar;45(2):249–267. [Google Scholar]
Delaigle A, Meister A. Density estimation with heteroscedastic error. Bernoulli. 2008;14(2):562–579. [Google Scholar]
Delaigle A, Meister A. Nonparametric function estimation under Fourier-oscillating noise. Statistica Sinica. 2011;21:1065–1092. [Google Scholar]
Devroye L. A note on the L1 consistency of variable kernel estimates. Annals of Statistics. 1985;13(3):1041–1049. [Google Scholar]
Diggle PJ, Hall P. A Fourier approach to nonparametric deconvolution of a density estimate. Journal of the Royal Statistical Society Series B-Methodological. 1993;55(2):523–531. [Google Scholar]
Ding LH, Xie Y, Park S, Xiao G, Story MD. Enhanced identification and biological validation of differential gene expression via illumina whole-genome expression arrays through the use of the model-based background correction methodology. Nucleic Acids Research. 2008;36(10):e58. doi: 10.1093/nar/gkn234. [DOI] [PMC free article] [PubMed] [Google Scholar]
Efromovich S. Conditional density estimation in a regression setting. Annals of Statistics. 2007;35(6):2504–2535. [Google Scholar]
Fan J. Asymptotic normality for deconvolution kernel density estimators. Sankhyā: The Indian Journal of Statistics, Series A. 1991a;53(1):97–110. [Google Scholar]
Fan J. On the optimal rates of convergence for nonparametric deconvolution problems. Annals of Statistics. 1991b;19(3):1257–1272. [Google Scholar]
Fan J. Deconvolution with supersmooth distributions. Canadian Journal of Statistics. 1992;20(2):155–169. [Google Scholar]
Fan J, Koo J. Wavelet deconvolution. IEEE Transactions On Information Theory. 2002;48(3):734–747. [Google Scholar]
Fan J, Yim T. A crossvalidation method for estimating conditional densities. Biometrika. 2004;91(4):819–834. [Google Scholar]
Hall P, Maiti T. Deconvolution methods for non-parametric inference in two-level mixed models. Journal of the Royal Statistical Society Series B-Statistical Methodology. 2009;71:703–718. [Google Scholar]
Hall P, Racine J, Li Q. Cross-validation and the estimation of conditional probability densities. Journal of the American Statistical Association. 2004;99(468):1015–1026. [Google Scholar]
Hall P, Wand MP. Minimizing L1 distance in nonparametric density estimation. Journal of Multivariate Analysis. 1988a Jul;26(1):59–88. [Google Scholar]
Hall P, Wand MP. On the minimization of absolute distance in kernel density estimation. Statistics & Probability Letters. 1988b Apr;6(5):311–314. [Google Scholar]
Hyndman RJ, Bashtannyk DM, Grunwald GK. Estimating and visualizing conditional densities. Journal of Computational and Graphical Statistics. 1996;5(4):315–336. [Google Scholar]
Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics. 2003;4(2):249–264. doi: 10.1093/biostatistics/4.2.249. [DOI] [PubMed] [Google Scholar]
Johannes J. Deconvolution with unknown error distribution. Annals of Statistics. 2009;37(5A):2301–2323. [Google Scholar]
McIntyre J, Stefanski LA. Density estimation with replicate heteroscedastic measurements. Annals of the Institute of Statistical Mathematics. 2011;63(1):81–99. doi: 10.1007/s10463-009-0220-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Meister A. Lecture Notes in Statistics. Springer; New York: 2009. Deconvolution Problems in Nonparametric Statistics. [Google Scholar]
Neumann MH. On the effect of estimating the error density in nonparametric deconvolution. Journal of Nonparametric Statistics. 1997;7:307–330. [Google Scholar]
Port SC. Lecture Notes in Statistics. John Wiley; New York: 1994. Theoretical Probability for Applications. [Google Scholar]
Silver JD, Ritchie ME, Smyth GK. Microarray background correction: maximum likelihood estimation for the normal-exponential convolution. Biostatistics. 2009;10(2):352–363. doi: 10.1093/biostatistics/kxn042. [DOI] [PMC free article] [PubMed] [Google Scholar]
Stefanski LA, Carroll RJ. Deconvoluting kernel density estimators. Statistics. 1990;21:169–184. [Google Scholar]
Wang XF, Fan Z, Wang B. Estimating smooth distribution function in the presence of heterogeneous measurement errors. Computational Statistics and Data Analysis. 2010:25–36. doi: 10.1016/j.csda.2009.08.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang XF, Wang B. Deconvolution estimation in measurement error models: The R package decon. Journal of Statistical Software. 2011;39(10):1–24. [PMC free article] [PubMed] [Google Scholar]
Wang XF, Ye D. The effects of error magnitude and bandwidth selection for deconvolution with unknown error distribution. Journal of Nonparametric Statistics. 2012;24(1):153–167. doi: 10.1080/10485252.2011.647024. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supplement

NIHMS629473-supplement.pdf^{(260.9KB, pdf)}

[R1] Bashtannyk D, Hyndman R. Bandwidth selection for kernel conditional density estimation. Computational Statistics & Data Analysis. 2001;36(3):279–298. [Google Scholar]

[R2] Bickel PJ, Ritov Y. Estimating integrated squared density derivatives: sharp best order of convergence estimates. Sankhyā: The Indian Journal of Statistics, Series A. 1988;50(3):381–393. [Google Scholar]

[R3] Carroll RJ, Hall P. Optimal rates of convergence for deconvolving a density. Journal of the American Statistical Associations. 1988;83:1184–1186. [Google Scholar]

[R4] Comte F, Lacour C. Data-driven density estimation in the presence of additive noise with unknown distribution. Journal of the Royal Statistical Society Series B-Methodological. 2011;73(4):601–627. [Google Scholar]

[R5] De Gooijer J, Zerom D. On conditional density estimation. Statistica Neerlandica. 2003;57(2):159–176. [Google Scholar]

[R6] Delaigle A. An alternative view of the deconvolution problem. Statistica Sinica. 2008;18(3):1025–1045. [Google Scholar]

[R7] Delaigle A, Gijbels I. Practical bandwidth selection in deconvolution kernel density estimation. Computational Statistics & Data Analysis. 2004 Mar;45(2):249–267. [Google Scholar]

[R8] Delaigle A, Meister A. Density estimation with heteroscedastic error. Bernoulli. 2008;14(2):562–579. [Google Scholar]

[R9] Delaigle A, Meister A. Nonparametric function estimation under Fourier-oscillating noise. Statistica Sinica. 2011;21:1065–1092. [Google Scholar]

[R10] Devroye L. A note on the L1 consistency of variable kernel estimates. Annals of Statistics. 1985;13(3):1041–1049. [Google Scholar]

[R11] Diggle PJ, Hall P. A Fourier approach to nonparametric deconvolution of a density estimate. Journal of the Royal Statistical Society Series B-Methodological. 1993;55(2):523–531. [Google Scholar]

[R12] Ding LH, Xie Y, Park S, Xiao G, Story MD. Enhanced identification and biological validation of differential gene expression via illumina whole-genome expression arrays through the use of the model-based background correction methodology. Nucleic Acids Research. 2008;36(10):e58. doi: 10.1093/nar/gkn234. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Efromovich S. Conditional density estimation in a regression setting. Annals of Statistics. 2007;35(6):2504–2535. [Google Scholar]

[R14] Fan J. Asymptotic normality for deconvolution kernel density estimators. Sankhyā: The Indian Journal of Statistics, Series A. 1991a;53(1):97–110. [Google Scholar]

[R15] Fan J. On the optimal rates of convergence for nonparametric deconvolution problems. Annals of Statistics. 1991b;19(3):1257–1272. [Google Scholar]

[R16] Fan J. Deconvolution with supersmooth distributions. Canadian Journal of Statistics. 1992;20(2):155–169. [Google Scholar]

[R17] Fan J, Koo J. Wavelet deconvolution. IEEE Transactions On Information Theory. 2002;48(3):734–747. [Google Scholar]

[R18] Fan J, Yim T. A crossvalidation method for estimating conditional densities. Biometrika. 2004;91(4):819–834. [Google Scholar]

[R19] Hall P, Maiti T. Deconvolution methods for non-parametric inference in two-level mixed models. Journal of the Royal Statistical Society Series B-Statistical Methodology. 2009;71:703–718. [Google Scholar]

[R20] Hall P, Racine J, Li Q. Cross-validation and the estimation of conditional probability densities. Journal of the American Statistical Association. 2004;99(468):1015–1026. [Google Scholar]

[R21] Hall P, Wand MP. Minimizing L1 distance in nonparametric density estimation. Journal of Multivariate Analysis. 1988a Jul;26(1):59–88. [Google Scholar]

[R22] Hall P, Wand MP. On the minimization of absolute distance in kernel density estimation. Statistics & Probability Letters. 1988b Apr;6(5):311–314. [Google Scholar]

[R23] Hyndman RJ, Bashtannyk DM, Grunwald GK. Estimating and visualizing conditional densities. Journal of Computational and Graphical Statistics. 1996;5(4):315–336. [Google Scholar]

[R24] Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics. 2003;4(2):249–264. doi: 10.1093/biostatistics/4.2.249. [DOI] [PubMed] [Google Scholar]

[R25] Johannes J. Deconvolution with unknown error distribution. Annals of Statistics. 2009;37(5A):2301–2323. [Google Scholar]

[R26] McIntyre J, Stefanski LA. Density estimation with replicate heteroscedastic measurements. Annals of the Institute of Statistical Mathematics. 2011;63(1):81–99. doi: 10.1007/s10463-009-0220-x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] Meister A. Lecture Notes in Statistics. Springer; New York: 2009. Deconvolution Problems in Nonparametric Statistics. [Google Scholar]

[R28] Neumann MH. On the effect of estimating the error density in nonparametric deconvolution. Journal of Nonparametric Statistics. 1997;7:307–330. [Google Scholar]

[R29] Port SC. Lecture Notes in Statistics. John Wiley; New York: 1994. Theoretical Probability for Applications. [Google Scholar]

[R30] Silver JD, Ritchie ME, Smyth GK. Microarray background correction: maximum likelihood estimation for the normal-exponential convolution. Biostatistics. 2009;10(2):352–363. doi: 10.1093/biostatistics/kxn042. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] Stefanski LA, Carroll RJ. Deconvoluting kernel density estimators. Statistics. 1990;21:169–184. [Google Scholar]

[R32] Wang XF, Fan Z, Wang B. Estimating smooth distribution function in the presence of heterogeneous measurement errors. Computational Statistics and Data Analysis. 2010:25–36. doi: 10.1016/j.csda.2009.08.012. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R33] Wang XF, Wang B. Deconvolution estimation in measurement error models: The R package decon. Journal of Statistical Software. 2011;39(10):1–24. [PMC free article] [PubMed] [Google Scholar]

[R34] Wang XF, Ye D. The effects of error magnitude and bandwidth selection for deconvolution with unknown error distribution. Journal of Nonparametric Statistics. 2012;24(1):153–167. doi: 10.1080/10485252.2011.647024. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Conditional Density Estimation in Measurement Error Problems

Xiao-Feng Wang

Deping Ye

Abstract

1. Introduction

2. Methodology

2.1. Estimation of fX|W with known error distribution

2.2. Estimation of fX|W with unknown error distribution

3. Theoretical properties

3.1. Theoretical properties when fU is known

Theorem 1

Theorem 2

3.2. Theoretical properties when fU is unknown

Theorem 3

Theorem 4

4. Selection of smoothing parameters

5. Numerical properties

5.1. Simulated examples

Figure 1.

Figure 2.

5.2. An application to an Illumina bead microarray study

Figure 3.

Supplementary Material

Acknowledgments

Appendix

Lemma 1

Lemma 2

Proof of Theorem 1

Proof of Theorem 2

Proof of Theorem 3

Proof of Theorem 4

Derivation for the asymptotic dominating term of MIAE(ĝ(x|w))

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

2.1. Estimation of f_X|W with known error distribution

2.2. Estimation of f_X|W with unknown error distribution

3.1. Theoretical properties when f_U is known

3.2. Theoretical properties when f_U is unknown