Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2016 Jan 1.
Published in final edited form as: J Multivar Anal. 2015 Jan 1;133:38–50. doi: 10.1016/j.jmva.2014.08.011

Conditional Density Estimation in Measurement Error Problems

Xiao-Feng Wang a,*, Deping Ye b
PMCID: PMC4183069  NIHMSID: NIHMS629473  PMID: 25284902

Abstract

This paper is motivated by a wide range of background correction problems in gene array data analysis, where the raw gene expression intensities are measured with error. Estimating a conditional density function from the contaminated expression data is a key aspect of statistical inference and visualization in these studies. We propose re-weighted deconvolution kernel methods to estimate the conditional density function in an additive error model, when the error distribution is known as well as when it is unknown. Theoretical properties of the proposed estimators are investigated with respect to the mean absolute error from a “double asymptotic” view. Practical rules are developed for the selection of smoothing-parameters. Simulated examples and an application to an Illumina bead microarray study are presented to illustrate the viability of the methods.

Keywords: measurement error, gene microarray, conditional density, deconvolution, ridge parameter, kernel, bandwidth selection

1. Introduction

Measurement error problems have attracted a great deal of interest in the past two decades. A variety of models and methods for the problems have been applied in scientific fields, such as medicine, economy, and astronomy. Statistical deconvolution is an important component in measurement error models. The fundamental objective of deconvolution is to recover the unknown probability density function of a random variable when its observed values are contaminated with error. Let X be the variable of interest, which can not be observed directly. Instead, we observe a sample of W,

Wj=Xj+Uj,for1jn, (1)

where Xj's are identically distributed as X, Uj's are identically distributed as U, and they are totally independent. The most popular approach to estimate the density of X is the deconvolution kernel estimator through applying an inverse Fourier transform and a kernel technique (Carroll and Hall, 1988; Stefanski and Carroll, 1990; Fan, 1991a,b). Other estimation procedures include the truncated Fourier inversion method (Diggle and Hall, 1993), the wavelet-based method (Fan and Koo, 2002), the penalization approach (Comte and Lacour, 2011), among others. Deconvolution problems based on more complicated model settings have also been extensively studied. Delaigle and Meister (2008), Wang et al. (2010), and McIntyre and Stefanski (2011) considered the problems of heteroscedastic measurement errors. Hall and Maiti (2009) investigated nonparametric deconvolution methods in two-level mixed models. Neumann (1997), Johannes (2009) and Wang and Ye (2012) studied the density deconvolution with unknown error distribution. Delaigle and Meister (2011) investigated kernel deconvolution when the characteristic function of the measurement errors contains zeros. Wang and Wang (2011) discussed fast Fourier transform algorithms in measurement error models and developed an R software package. The literature on deconvolution problems is particularly large and is surveyed in the monograph by Meister (2009).

In this paper, we consider the estimation problem of the conditional density of X given W, fX|W, from the contaminated data Wj's. The problem is motivated by a wide range of background correction problems in gene array data analysis. Gene microarray techniques have become very popular in medical studies. A microarray is a collection of microscopic DNA spots attached to a solid surface. Hundreds of thousands of gene expression values are obtained from one array chip simultaneously. However, reading the expression values from a microarray is a noisy measurement process. The sources of measurement error include, for instance, irregularities in the array surface, variations in the laboratory process, different image scanner settings, and dye effects.

Typically, the first step in gene array data analysis is known as background correction, which refers to adjustments to the contaminated data intended to remove measurement error from the measured signal. Estimating the conditional density function from contaminated gene expression data, therefore, plays a key aspect of statistical inference and visualization here. It provides the most informative summary of the relationship between the contaminated gene intensities and the unobserved true signals. The current popular model of background correction in bioinformatics is the normal-exponential model (Irizarry et al., 2003; Silver et al., 2009). It assumes that the observed intensity is equal to the true intensity plus the background noise, where the true signal follows an exponential distribution with mean α, and the background noise follows a normal distribution with mean μ and variance σ2. However, the validity of the parametric assumptions is unknown in real gene array studies. Thus, it is of particular interest to nonparametrically estimate the conditional density from the contaminated gene intensities.

A variety of papers discuss the nonparametric conditional density estimation when bivariate data are available. Hyndman et al. (1996) studied a kernel estimator. Bashtannyk and Hyndman (2001); Fan and Yim (2004) proposed several rules for selecting smoothing parameters. De Gooijer and Zerom (2003) proposed a modification of the Nadaraya-Watson type of smoother. Hall et al. (2004) discussed cross-validation and the estimation of conditional probability densities. Efromovich (2007) studied the conditional density estimation in a regression setting.

Unlike the conventional conditional density estimation problem from bivariate data, the observations for the variable-of-interest, X, are not available in the measurement error problem. In this paper, we investigate the estimation of the conditional density fX|W from the only-available contaminated sample Wj's under the model (1). In Section 2, estimators of fX|W are constructed in case of a known and an unknown error density. In Section 3, theoretical properties of the estimators are investigated with respect to the mean absolute error. In Section 4, practical rules are developed for the selection of smoothing-parameters. Simulated examples and an application to an Illumina Bead microarray study are presented in Section 5. The proofs of theorems are given in the Appendix and some additional asymptotic results are provided in the supplement of the article.

2. Methodology

Under the additive measurement error model (1), let fX, fU, and fW be the density functions of X, U, and W, respectively. Denote fX,W(x, w) as the joint density of (X, W). The conditional density of X given W = w is

fX|W(x|w)=fX,W(x,w)/fW(w)=fU(w-x)fX(x)/fW(w). (2)

2.1. Estimation of fX|W with known error distribution

If one assumes that the error density fU is known explicitly, fX can be estimated by the classical deconvolution kernel approach. It is given by,

f^X(x)=1nj=1nKh(xWj), (3)

where Kh()=K(/h)/h,

K(z)=12πeitzϕK(t)ϕU(t/h)dt, (4)

is known as the deconvoluting kernel, and h > 0 is a smoothing parameter. In (4), ϕU is the characteristic function of U, and ϕK(t) = ∫eitxK(x)dx is the Fourier transform of K(x), a symmetric probability kernel with a finite variance ∫x2K(x)dx < ∞. Under the common assumption that ϕK is compactly supported and ϕU does not vanish on the real line, the deconvoluting kernel K*(·) is well defined and finite.

The estimation of fX|W is naturally approached by replacing fX with its deconvolution kernel estimator and fW with its ordinary kernel estimator in (2). This results in a re-weighted deconvolution kernel estimator (RDKE), defined by

f^X|W(x|w)=τ^0(x|w)j=1nKh(xWj), (5)

where

τ^0(x|w)=fU(wx)j=1nLb(wWj), (6)

Lb(·) = L(·/b)/b, L(·) is a real non-negative kernel function, and b is the bandwidth that associates with the kernel density estimate of fW. The estimator (5) is called RDKE in that, comparing with the conventional deconvolution kernel estimator, 1/n in (3) is replaced by a weight function τ̂(·) in this new estimator.

2.2. Estimation of fX|W with unknown error distribution

In practice, the exact distribution of measurement error is typically unknown. Thus, one often conducts a separate independent experiment to collect an additional noise sample. For example, in Illumina bead microarray studies, the additional noise sample is always available for each gene array. Density deconvolution from a contaminated sample coupled with an additional noise sample have been studied by Neumann (1997), Johannes (2009), and Wang and Ye (2012). For the conditional density estimation in the presence of error with unknown distribution, we propose here a ridge-based re-weighted kernel estimator. Denote that {U0j},j = 1, ⋯, n0 are direct observations from the error distribution, which are independent and identically distributed as U. The empirical characteristic function of U is then ϕ^U(t)=n01j=1n0eitU0j. Our estimator takes the form

f^X|W,ρ(x|w)=τ^(x|w)j=1nKh,ρ(xWj), (7)

where Kh,ρ()=Kρ(/h)/h, and

Kρ(z)=12πeitzϕK(t)ϕ^U(t/h)max{|ϕ^U(t/h)|2,ρ}dt, (8)
τ^(x|w)=(n0)1k=1n0Ga(wxU0k)j=1nLb(wWj). (9)

In this estimator, Ga(·) = G(·/a)/a and Lb(·) = L(·/b)/b, where G and L are real, non-negative kernel functions, a, b are the bandwidths associated with the kernel density estimate of fU and fW, respectively.

We introduce a ridge parameter ρ in (8) in order to prevent ϕ̂U from being too close to zero. When the error distribution is unknown, ϕU might be estimated from its empirical counterpart, ϕ̂U. However, estimating the conditional density becomes unstable if one applies ϕ̂U without any regularization. Note that ϕU can be estimated by ϕ̂U (t) at each point t with the rate n01/2. The estimator ϕ̂U is reasonable as |ϕ̂U| ≫ n0−1/2, while it becomes unstable as |ϕ̂U| ≪ n0−1/2. Hence, in practice, we simply take the ridge parameter ρ=n01 in order to avoid the difficulty of simultaneously selecting multiple tuning parameters in (8).

3. Theoretical properties

We present here the asymptotic results based on the mean absolute distance between X|W and fX|W. Define the mean absolute error (MAE) by MAE(X|W) = Inline graphic|X|WfX|W|, which is the “local” analogue of the L1 distance between fX|W and its estimate. The L1 view of conventional kernel density estimation for error-free data had attracted great attentions. Devroye (1985) gave distinct illustrations of the mathematical attractions of L1 distance. For instance, it is always well-defined on the space of density functions, and it is invariant under monotone transformations. Studying the MAE properties is also due to the practical interest in gene array background correction. Some additional theoretical results based on the mean square distance can be found in the supplement of this paper.

In measurement error problems, the quality of a sample does not only depend on its size but also crucially relates to the magnitude of the error variance σ2. This phenomenon was observed by Fan (1992) and had been extensively studied by Delaigle (2008). Delaigle (2008) made the following argument for an alternative asymptotic view of measurement error problems. For standard asymptotic theory, n may not particular large but statisticians are interested in analyzing theoretical properties for the unrealistic situation (n → ∞). It allows us to reveal the properties of an estimator when the sample size is not too small. Therefore, in measurement error problems, just as any given sample size can be considered as a finite sample approximation of n → ∞, any given σ2 can be also considered as a finite sample version of σ2 → 0. Considering the asymptotic properties where both σ2 → 0 and n → ∞ (named as the double asymptotics by Delaigle (2008)) provides a more appropriate way to uncover important properties of an estimator when σ2 is not too large.

Following Delaigle's recommendation, we study the double asymptotic properties of our estimators by considering both σ2 0 and n → ∞. The rate of convergences under the standard asymptotic theory framework are also provided in this section. Let us rewrite the model (1) as:

Wj=Xj+σZj,for1jn, (10)

where Uj = σZj, Zj's are independent and identically distributed as Z with the density fZ and the variance var(Z) = 1. Hereinafter, the asymptotic properties are for both σ2 → 0 and n → ∞, when we refer to model (10); the asymptotic properties are for n → ∞ only, when we refer to model (1). We always assume that the density function fW is strictly positive and has the continuous second derivative. The unknown probability density function fX belongs to the class

Ψm,α,B={f(x):|f(m)(x)f(m)(x+δ)|Bδα},

where α ∊ [0,1), m ∊ℕ, and B > 0 are known constants. We consider two classes of errors as in the classical deconvolution literature: ordinary smooth error U of order β if

d0|t|β|ϕU(t)|d1|t|β,ast,

for some positive constants d0,d1,β; and supersmooth error U of order β if

d0|t|β0exp(|t|β/γ)|ϕU(t)|d1|t|β1exp(|t|β/γ),ast,

for some positive constants d0, d1, γ, β and constants β0, β1. We further state the conditions for the kernel functions:

  • (A1) K is bounded, continuous, and |y|m+α|K(y)|dy < ∞. Moreover, its characteristic function, ϕK, is symmetric and satisfies ϕK(t) = 1 + O(|t|m+α) as t→0.

  • (A2) The kernel function L(x) is a real, non-negative, even kernel function on ℝ such that ∫ L(x)dx = 1, ∫ xL(x)dx = 0, and ∫ x2L(x)dx < ∞. The kernel function G(·) follows the same conditions.

Condition (A1) asserts that K is basically a kernel function of order m + α.

3.1. Theoretical properties when fU is known

We use c(n) ∼ b(n) to represent d2c(n) ≤ b(n) ≤ d3c(n) for some constants d2, d3 > 0. c(n) ≫ b(n) (resp. c(n) ≪ b(n)) represents b(n) = o(c(n)) (resp. c(n) = o(b(n))).

  • (B1) |t|β|ϕK(t)|dt < ∞ and |t|2β|ϕK(t)|2 dt < ∞.

Theorem 1

Assume that the variable Z in (10) is ordinary smooth of order β. Under the assumptions (A1), (A2), (B1), and bn15,

  1. if σ=O(n12m+2α+1) and hn12m+2α+1, then ∀x, w ∊ ℝ and ∀ fX ∊ Ψm,α,B,
    E|f^X|W(x|w)fX|W(x|w)|=O(n25);
  2. if σn12m+2α+1 and hσ2β2(m+α+β)+1n12(m+α+β)+1, then for all x, w ∊ ℝ and for all fX ∊ Ψm,α,B,

E|f^X|W(x|w)fX|W(x|w)|=O(n25)+O(σ2(m+α)β2(m+α+β)+1n(m+α)2(m+α+β)+1).

Remark 1. For the model (1), by considering σ to be a constant, one gets

E|f^X|W(x|w)fX|W(x|w)|=O(n25)+O(nm+α2m+2α+2β+1).

Notice that the assumption bn15 is standard in the kernel density estimation and hn12m+2α+2β+1 is standard in the deconvolution kernel density estimation. If m + α < 4β + 2, then m+α2m+2α+2β+1>25. Therefore, the rate of convergence is O(nm+α2m+2α+2β+1). On the other hand, if m + α ≥ 4β + 2, then the rate of convergence is O(n25). If we assume that fW(·) has higher order derivatives and we choose higher order kernel density function L, for instance 2r-th order, then we can let bn14r+1. In this case, the rate of convergence becomes O(n2r4r+1)+O(nm+α2m+2α+2β+1).

Next we address the theorem in the case of supersmooth error. We need the following condition.

  • (C1) ϕK(t) = 0 for all |t| > 1, that is, ϕK(t) has support on [−1,1]. Moreover, |ϕK(t)|2|t|2β0dt<.

    More generally, in condition (C1), one can assume that ϕK has compact support [−M, M] for some 0 < M < ∞.

Theorem 2

Assume that the variable Z in (10) is supersmooth with ϕZ(t) ≠ 0 for any t. Under the assumptions (A1), (A2), (C1), and bn15,

  1. if σ=O(n12(m+α)+1) and hn12(m+α)+1, then ∀ x, w ∊ ℝ and ∀ fX ∊ Ψm,α,B,

    E|f^X|W(x|w)fX|W(x|w)|=O(n25);
  2. if σ=n12(m+α)+1a(n) and h=(2γD)1/βσ{loga(n)}1/β with 1a(n)n12(m+α)+1 and D < 2m + 2α + 1, then ∀ x, w ∊ ℝ and ∀ fX ∊Ψm,α,B,

E|f^X|W(x|w)fX|W(x|w)|=O(n25)+O(σm+α{loga(n)}(m+α)β).

Remark 2. For the model (1), by considering σ to be a constant, one has,

E|f^X|W(x|w)fX|W(x|w)|=O((logn)m+αβ).

The choice of b~n15 is standard for the kernel density estimation and h = (4/γ)1/β(log n)−1/β is standard for the deconvolution kernel density estimation. Here, the order of the derivatives of fW(·) and the order of the kernel density function L do not change the rate of convergence.

3.2. Theoretical properties when fU is unknown

For the ordinary smooth error distribution, we need the following condition:

  • (B2) |ϕK(τ)|τ|2βs <∞ for s = 0,1,2.

Theorem 3

Assume that the variable Z is ordinary smooth of order β but fZ is unknown. Under the assumptions (A1), (A2), (B1), (B2), and by choosing a~n01/5, b~n1/5,

  1. if σ = O(h) and h~max{n12m+2α+1,n012(m+α+1)}, then for all x, w ∊ ℝ and for all fX ∊ Ψm,α,B,

    E|f^X|W,ρ(x|w)fX|W(x|w)|=O(n25)+O(n025)+O(n0m+α2(m+α+1));
  2. if σmax{n12m+2α+1,n012m+2α+2}, and

hmax{σ2β2(m+α+β)+1n12m+2α+2β+1,σβm+α+β+1n012m+2α+2β+2},

then for all x, w ∊ ℝ and for all fX ∊ Ψm,α,B,

E|f^X|W,ρ(x|w)fX|W(x|w)|=O(σ2(m+α)β2(m+α+β)+1n(m+α)2(m+α+β)+1)+O(σ(m+α)βm+α+β+1n0m+α2(m+α+β+1))+O(n25)+O(n025).

Remark 3. For the model (1), by considering σ to be a constant, one has

E|f^X|W,ρ(x|w)fX|W(x|w)|=O(n(m+α)2(m+α+β)+1)+O(n25)+O(n025)+O(n0(m+α)2(m+α+β+1)).
Theorem 4

Assume that the variable Z is supersmooth with ϕZ(t) ≠ 0 for any t but fZ is unknown. Under the assumptions (A1), (A2), (C1), and by choosing a~n01/5, b~n15,

  1. if σ = O(h) and h~max{n12m+2α+1,n012(m+α+1)}, then for all x, w ∊ ℝ and for all fX ∊ Ψm,α,B,

    E|f^X|W,ρ(x|w)fX|W(x|w)|=O(n25)+O(n025)+O(n0m+α2(m+α+1));
  2. if σ=max{n12(m+α)+1,n012(m+α+1)}a(n,n0) and h=(2γD)1βσ{loga(n,n0)}1β, with 1a(n,n0)min{n12(m+α)+1,n012(m+α+1)} and D < 2m + 2α + 1, then for all x, w ∊ ℝ and for all fX ∊ Ψm,α,B,

    E|f^X|W,ρ(x|w)fX|W(x|w)|=O(n25)+O(n025)+O(σm+α{loga(n,n0)}(m+α)β).

Remark 4. For the model (1), by considering σ to be a constant, one has

E|f^X|W,ρ(x|w)fX|W(x|w)|=O({logn}m+αβ)+O({logn0}m+αβ).

The above four theorems offer the double asymptotic view of the proposed RKDEs, which provide a better interpretation of asymptotic behaviors for the estimators than the results from the standard asymptotic view. For either ordinary smooth or supersmooth error, there are two different rates of convergence depending on the error magnitude. For instance, the convergence rate for supersmooth error varies from the rate of the error-free kernel density estimation to the very slow rate of the classical supersmooth deconvolution. The theoretical results support that the quality of the RKDEs depends not only on the sample size but also on the error magnitude. They also support in practice the RKDEs could perform well with a moderate sample size for the case of Gaussian error as long as σ2 is not too large.

4. Selection of smoothing parameters

Bandwidth plays a critical role in implementation of practical estimation using smoothing techniques. The selection of bandwidth in deconvolution kernel density estimation has been broadly studied in the literature (Delaigle and Gijbels, 2004). Here we propose a simple but intuitively appealing method for selecting the smoothing parameters in the conditional density estimation. Let us focus on the case of known error distribution. Our procedure includes two steps. First, we select b for the kernel density estimate f^W=n1j=1nLb(wWj). Many bandwidth selection methods are available in classical kernel density estimation. Here we use the one that minimizes mean absolute distance (Hall and Wand, 1988a,b). Second, for given b and w, the proposed bandwidth h for estimating (x|w) using (5) is the minimizer of the mean integrated absolute error (MIAE) of g(x|w) = fU(wx)fX(x), i.e.,

hMIAE=minh{E|g^(x|w;h)g(x|w;h)|dx}, (11)

where ĝ(x|w) = fU(w − x)X(x) and X is the deconvolution kernel estimate defined by (3).

Let K be a m-th order kernel function with |K(z)zm+1| dz < ∞. Assume that the density function fX has continuous, bounded (m + 1)-th order derivative. Denote Rm(K) = ∫xmK(x)dx and Sm(fX)=|fU(wx)fX(m)(x)|dx for any positive integer m, and let ǁfX(x be the supremum norm of fX. It can be shown that the asymptotic dominating term of the MIAE of g(x|w) is given by

AMIAE(g^(x|w;h))=hmm!|Rm(K)|Sm(fX)+fX(x)2πhn|ϕK(t)|2|ϕU(t/h)|2dt. (12)

In practice, we evaluate this asymptotic approximation over the exact MIAE because of its rather simple expression. Formula (12) involves the unknown quantities Sm(fX) and ǁfX(x. We suggest two possible estimators for them and obtain as such an estimator AMIAE^(g^(x|w;h)). The first one is a simple normal reference approach. Assume that X is from a normal distribution N(μX,σX2). Then one calculates μ̂X = μ̂W, σ^X2=σ^W2σU2 where μ̂W and σ^W2 are the sample mean and variance from the observations Wj's. Hence, one can easily calculate Sm(X) and ǁX(x. The approach may not perform well when X is far from a normal distribution. Our second approach is to estimate X through the classical deconvolution kernel method and then numerically evaluate Sm(fX) and ǁfX(x. From our simulation experiences, the second method often performs better than the normal reference method although it is more computationally complicated.

The proposed bandwidth hMIAE depends on w. If necessary, one could select a global bandwidth

hMIAE=minh{E|g^(x|w;h)g(x|w;h)|dxdw},

where the integration is over the region of x and w of interest.

Certainly, one may consider to use the criterion of minimizing mean integrated squared error. Our simulation experiences suggest that there is negligible difference between two criteria in practical bandwidth selection. The above methods of selecting smoothing parameters can be naturally extended to the case of unknown error distribution, where both a and b are pre-determined from additional noise data and contaminated data.

5. Numerical properties

5.1. Simulated examples

We illustrate our methods via two simulated models. We take the kernel K to be a second order kernel function, where ϕK(t)=(1t2)+3, and the kernels L and G to be Gaussian.

Example 1. Consider a normal-exponential convolution model, Wj = Xj + Uj, where Xj is exponentially distributed with mean α, while Uj is normally distributed with mean μ and variance σ2. It can be shown, with some simple transformation and algebra, the conditional density of X give W is

fX|W(x|w)=φ(x;wμσ2/α,σ2)1Φ(0;wμσ2/α,σ2)

where φ(.) and Φ(.) denote the Gaussian density and distribution functions, respectively. A sample of 1000 was generated from this model, where the parameters were set as follows: α = 10, μ = 2, σ = 2. We assume that the measurement error parameters are known. Formula (5) was applied to estimate the conditional density functions from the contaminated data Wj's. The bandwidths were selected by the proposed method in Section 4. Figure 1, which presents a typical simulated example, displays the estimated conditional densities using the RDKE with known error distribution. In (a)-(c), the estimated conditional densities (solid curves), for (a) w = 5, (b) w = 10, (c) w = 30, are compared with the true densities (dashed curves). The RDKE performed quite well to recover the true functions. (d) shows a “stacked conditional density plot” (Hyndman et al., 1996), which displays a number of densities plotted side by side in a perspective plot. The plot highlights the conditioning which allows us to evaluate the changes of densities over w.

Figure 1.

Figure 1

A simulated example of conditional density estimation with measurement error for the normal-exponential model. In (a)-(c), estimated conditional densities (solid curves) for (a) w = 5, (b) w = 10, (c) w = 30 using the reweighed deconvolution kernel method with known error distribution are compared with the true densities (dashed curves). (d) shows a number of densities plotted side by side in a perspective plot.

Example 2. Consider a normal-normal convolution model, Wj = Xj + Uj, where Xj is normally distributed with mean μ1 and variance σ12, while Uj is normally distributed with mean μ2 and variance σ22. The true conditional density of X give W is

fX|W(x|w)=φ(x;μ1,σ12)φ(wx;μ2,σ22)φ(w;μ1+μ2,σ12+σ22).

We generated Xj's and Uj's from N(2, 9) and N(0,1) respectively with the sample size n = 1000. In this example, we assumed that the error distribution was unknown. So, an additional noise sample U0j's were generated from N(0,1) with the size n0 = 500. Formula (7) was applied to estimate the conditional density functions from the contaminated data Wj's coupled with the noise data U0j's. Figure 2 shows the estimated conditional densities in a simulated example. In (a)-(c), the estimated conditional densities (solid curves), for (a) w = −2.5, (b) w = 0, (c) w = 2.5, are compared with the true densities (dashed curves). The estimated curves almost coincide with the true curves. (d) exhibits the stacked conditional density plot in a perspective view.

Figure 2.

Figure 2

A simulated example of conditional density estimation with measurement error for the normal-normal model. In (a)-(c), estimated conditional densities (solid curves) for (a) w = −2.5, (b) w = 0, (c) w = 2.5 using the reweighed deconvolution kernel method with unknown error distribution are compared with the true densities (dashed curves). (d) shows a number of densities plotted side by side in a perspective plot.

5.2. An application to an Illumina bead microarray study

Illumina bead microarray is one of most popular microarray platforms in genetics. One distinctive feature of Bead array technology by Illumina Inc. is that more than one thousand “control bead types” in addition to gene sequences are allocated in each array. These control beads do not correspond to any expressed sequences in the genome. The data of control beads present the additional noise sample that is used to evaluate the noise distribution on the array in an experiment.

We illustrate our methods using the Illumina microarray data from a leukemia study by Ding et al. (2008). The leukemia study was to investigate the pathogenesis of leukemia. Irradiated mice who subsequently developed acute myeloid leukemia (AML) were involved to study the leukemogenic process. Illumina Mouse-6 V1 BeadChip mouse whole-genome expression arrays were used to obtain the gene expression profiles of AML samples. Ding et al. (2008) considered the normal-exponential model for the observed gene intensities in their analysis. Here we demonstrate the analysis of condition density estimation for the third bead array. Other bead arrays showed similar results. The intensity values of 46120 genes and 1655 negative controls were obtained from this array. We estimate the conditional densities nonparametrically using the reweighed deconvolution kernel method with unknown error distribution. The results are displayed in Figure 3. In (a)-(c), the solid curves denote estimated conditional densities for the observed intensity (a) w = 180, (b) w = 190, (c) w = 200. For comparison, estimated conditional densities based on the parametric normal-exponential model are also displayed (dashed curves). The nonparametric fit shows that conditional densities are right-skewed, which are away from the fit based on the parametric model. The stacked conditional density plot in (d) unveils the evolution of conditional densities over the observed intensities. The results suggest that the normal-exponential assumption appears not to be a realistic assumption in the study. A model that relaxes the parametric assumptions may be useful for the gene background correction here.

Figure 3.

Figure 3

The analysis of the Illumina bead microarray data. In (a)-(c), estimated conditional densities (solid curves) for (a) w = 180, (b) w = 190, (c) w = 200 using the reweighed deconvolution kernel method with unknown error distribution are compared with the estimated conditional densities (dashed curves) using the normal-exponential method. (d) shows the nonparametric conditional density estimates plotted side by side in a perspective plot.

Supplementary Material

supplement

Acknowledgments

The authors are grateful to the editor and the reviewers for their valuable comments. The research of XFW is supported in part by NIH UL1 RR024989, and the research of DY is supported by a NSERC grant and a grant from Memorial University of Newfoundland.

Appendix

We now outline the key ideas of the proofs. We shall use Proposition 31.8 of Port (1994), which is restated as Lemma 1.

Lemma 1

Let q1(Xi) and q2(Xi) be two random variables with means μ1 and μ2, variances ν1 and ν2 respectively, and with covariance ν12. Let {X1, ⋯,Xn} be an i.d.d. sequence of random variables and define

^j=1ni=1nqj(Xi),forj=1,2,&R^=^1^2.

Then the second-order approximation of Inline graphic is ER^μ1μ2+1n(μ1ν2μ23ν12μ22) and the first-order approximation of var() is var(R^)1nμ22(ν1+μ12ν2μ222μ1ν12μ2).

Employing Lemma 1 to random variables q1(Wj) = 1 and q2(Wj)=b1L2(wWjb), one has

E([f^W(w)]1)=[Ef^W(w)]1+O(nb1), (.1)
E([f^W(w)]2)=[Ef^W(w))2+O(nb1), (.2)

where f^W(w)=j=1nn1q2(Wj) is the kernel estimator for fW.

Lemma 2

Let τ(w|x) = fU(wx)/fW(w) and τ̂(w|x) be as in (9). By choosing a~n01/5 and bn−1/5, one has, for all x,w ∊ ℝ,

E|τ^(w|x)τ(w|x)|2=O(n04/5)+O(n4/5), (.3)
E|τ^0(w|x)τ(w|x)|2=O(n04/5). (.4)

Proof. Let us prove formula (.3) first. Recall that {U0k}k=1n0 and {Wj}j=1n are independent. Consider f^W(w)=i=1n1ng2(Wi) with

g2(Wi)=1bL(wWib).

Then, one has (see for instance Lemma 1 of Hyndman et al. (1996))

μ2=Eg2=fW(w)+b2σL22fW(w)+O(b4)=fW(w)[1+O(b2)], (.5)
ν2=var(1bL(wWib))=fW(w)R(L)b[1+O(b)], (.6)

where σL2=ω2L(ω)dω, and R(L) = ∫ L2(ω) dω.

By formulas (.1) and (.5), one has, for all x,w,

|Eτ^(w|x)τ(w|x)|=|Ef^U(wx)E(1f^W(w))τ(w|x)|=|fU(wx)+O(a2)fW(w)[1+O(b2)]+O(1nb)τ(w|x)|=O(a2)+O(b2)+O(1nb).

Similarly, formulas (.1), (.2), (.5), and (.6) imply

var(τ^(w|x))=E|f^U(wx)f^W(w)|2|E(f^U(wx)f^W(w))|2=O(1n0a)+O(1nb).

Hence, a~n01/5 and bn−1/5 implies the desired result in Lemma 2.

To prove formula (.4), one can repeat the above calculation by removing all terms involving a, since here U is replaced by fU.

Proof of Theorem 1

Recall that X|W(x|w) = τ̂0(w|x)f̂X(x). We need the following inequality, which is a direct consequence of the triangle inequality and Cauchy-Schwarz inequality:

E|f^X|W(x|w)fX|W(x|w)|=E|f^Xτ^0f^Xτ+f^XτfXτ|E|f^Xτ^0f^Xτ|+E|f^XτfXτ|E|f^X|2E|τ^0τ|2+|τ|E|f^XfX|22(E|f^XfX|2+|fX|2)E|τ^0τ|2+|τ|E|f^XfX|2=O(E|τ^0τ|2)+O(E|f^XfX|2). (.7)

It is enough to estimate Inline graphic|X − fX|2. Under assumptions of Theorem 1 and by Taylor extension formula, one has (see e.g. Fan (1991b))

E(f^X(x))=fX(x)+O(hm+α). (.8)

Recall that supf∊Ψm,α,B f(x)≤C for some constant C > 0 (see Bickel and Ritov (1988)). By results in Fan (1991b) (see page 1266), one has

var(f^X(x))C2πnh|ϕK(t)|2|ϕZ(σt/h)|2dt, (.9)

where we have used ϕU(t) = ϕσZ(t) = ϕZ(σt). Since random variable Z is ordinary smooth, there exists a constant Q > 0 such that

d0|σt/h|β|ϕZ(σt/h)|d1|σt/h|β,for|σt/h|>Q,

for positive constants d0, d1 and β. Hence, by assumptions of Theorem 1,

var(f^X(x))C2πnh[|t|Qh/σ|ϕK(t)|2|ϕZ(σt/h)|2dt+|t|>Qh/σ|ϕK(t)|2|ϕZ(σt/h)|2dt]C2πnh[|t|Qh/σ|ϕK(t)|2dt+σ2βh2β|t|>Qh/σ|ϕK(t)|2|t|2βdt]. (.10)
  1. σ = O(h). Note that condition (B1) implies that K(t)|2 dt < ∞. Inequality (.10) implies var(f^X(x))=O(1nh). By (.8),

    E|f^XfX|2=var(f^X)+|Ef^XfX|2=var(f^X)+O(h2(m+α))=O(h2(m+α))+O(1nh). (.11)

    Combining with (.4) and (.7), one gets the desired conclusion, by choosing bn1/5, hn1/(2m+2α+1), and as (2m+2α)2m+2α+145 for m + α ≥ 2.

  2. σh. Inequality (.10) implies var)X(x)) ≤ O2βn−1h−(2β+1)) (see also Delaigle (2008)).

    The conclusion follows from (.4), (.7), and (.11), by choosing bn−1/5 and h~σ2β2m+2α+2β+1n1/(2m+2α+2β+1), which implies σn12m+2α+1.

Proof of Theorem 2

  1. σ = O(h) implies that |σ/h| can be bounded from above by a constant. Hence, formula (.9) implies var(f^X)=O(1nh), by condition (C1). The rest of the proof is same as that of the (i) in Theorem 1.

  2. σh. Since random variable Z is supersmooth, there exists Q > 0 s.t.,

    |ϕZ(σt/h)|d0|σt/h|β0exp(σβ|t|β/(γhβ)),for|σt/h|>Q.

    Hence, by assumptions of Theorem 2 and (.9),

    var(f^X)C2πnh[|t|Qhσ|ϕK(t)|2dt+hlσl|t|>Qhσ|ϕK(t)|2exp(2σβ|t|βγhβ)|t|2β0dt]

    with l = 0 if β0 ≥ 0 and l = 0 if β0 < 0. Then, var(f^X)=O(hl1nσlexp(2σβγhβ)). Combining with (.4), (.7), and (.11), one gets the desired conclusion, by taking bn-1/5, σ=n12(m+α)+1a(n) and h=(2γD)1βσ{loga(n)}1β with 1a(n)n12(m+α)+1 and 0 < D < 2m + 2α + 1.

Proof of Theorem 3

For all x ∊ ℝ, we consider an estimator of fX(x) as

f^X,ρ(x)=12πeitxϕK(th)ϕ^W(t)ϕ^U(t)max{|ϕ^U(t)|2,n01}dt.

Cauchy-Schwarz inequality and Fubini's theorem imply that

E|f^X,ρ(x)f^X(x)|2=E|12πeitxϕ^W(t)ϕK(th)ϕU(t)(ϕ^U(t)ϕU(t)max{|ϕ^U(t)|2,n01}1)dt|214π2(|ϕK(th)|dt)[|ϕK(th)||ϕU(t)|2E|ϕ^W(t)|2E|ϕ^U(t)ϕU(t)max{|ϕ^U(t)|2,n01}1|2dt].

As showed in Wang and Ye (2012), one has

EU|ϕ^U(t)max{|ϕ^U(t)|2,n01}1ϕU(t)|2=O(n01|ϕU(t)|4).

Recall that E|ϕ^W(t)|2=1n(1|ϕW(t)|2)+|ϕW(t)|2, | ϕK (th) | dt = O (1/h) by condition (B2), and |ϕW(t)| = |ϕX(t)ϕU(t)| ≤ |ϕU(t)| = |ϕZ(σt)|. One has

I=E|f^X,ρ(x)f^X(x)|2=O(1n0h|ϕK(th)||ϕZ(σt)|2dt).

Similar to the proof of inequality (.7), one has,

E|f^X|W,ρ(x|w)fX|W(x|w)|=E|f^X,ρτ^f^X,ρτ+f^X,ρτfXτ|=O(E|τ^0τ|2)+O(E|f^X,ρfX|2). (.12)
  1. σ = O(h). Take h~max{n12m+2α+1,n012(m+α+1)}. Similar to the calculation of inequality (.10), I=O(n01h2). Recall that the proof of (i) in Theorem 1 gives var(X) = O((nh)1). Hence, combining with (.8), for all x∊ ℝ, and for all fx ∊ Ψm,a,B, Inline graphic| X(x)) − fX(x)|2 = O(h2m+2α) + O(1/(nh)). By Cauchy-Schwarz inequality,

    E|f^X,ρ(x)fX(x)|22E|f^X(x)fX(x)|2+2E|f^X,ρ(x)f^X(x)|2=O(n2m2α2m+2α+1)+O(n0mαm+α+1).

    Combining with inequalities (.3) and (.12), one gets the desired result in Theorem 3, by choosing m + α ≥ 2, a~n01/5 and bn−1/5.

  2. Let σh. Let us take h~max{(σ2βn1)12m+2α+2β+1,(σ2βn01)12m+2α+2β+2}, which implies σmax{n12m+2α+1,n012m+2α+2}. Similar to (.10), one has I=O(σ2βn0h2β+2). Recall that the proof of (ii) in Theorem 1 gives ν̄1 = O2βh−(2β+1)). Hence, combining with (.8), one has
    E|f^X(x)fX(x)|2=O(h2m+2α)+O(σ2βnh2β+1).

    Therefore, for all x ∊ ℝ and for all fX ∊ Ψm,α,B,

    E|f^X,ρ(x)fX(x)|2O(σ4(m+α)β2(m+α+β)+1n2(m+α)2(m+α+β)+1)+O(σ2(m+α)βm+α+β+1n0mαm+α+β+1).

    As in (i), the conclusion in (ii) follows from inequalities (.3) and (.12), a~n01/5 and bn−1/5.

Proof of Theorem 4

  1. σ = O(h) implies that |σ/h| can be bounded from above by a constant. Then by (.9), I=O(1n0h2). The rest of the proof is same as that of (i) in Theorem 3.

  2. Let σh. As in (ii) of Theorem 2, one has, I=O(hl2n0σlexp(2σβγhβ)) and E|f^X(x)fX(x)|2O(h2m+2α)+O(hl1n0σlexp(2σβγhβ)). By the choices of the bandwidth h, one has, for all x ∊ ℝ and for all fX ∊ Ψm,a,B,

    E|f^X,ρ(x)fX(x)|2=O(σm+α{loga(n,n0)}2(m+α)β).

    Combining with inequalities (.3) and (.12), one gets the desired results, by choosing a~n01/5 and bn−1/5.

Derivation for the asymptotic dominating term of MIAE(ĝ(x|w))

Let K be a m-th order kernel function with |K(z)zm+1|dz < ∞. Assume that the density function fX has continuous, bounded (m + 1)-th order derivative. Notice that one can bound MIAE(ĝ(x|w; h)) from above as follows.

MIAE(g^(x|w;h))=E|fU(wx)|(|f^XEf^X+Ef^XfX|)dx|fU(wx)Bias(f^X)|dx+|fU(wx)|(E|f^XEf^X|)dx|fU(wx)Bias(f^X)|dx+|fU(wx)|var(f^X)dx.

It is known that the bias of X is

Bias(f^X)=1hfX(y)K((xy)/h)dyfX(x)=[fX(xhz)fX(x)]K(z)dz=(1)mm!fX(m)(x)zmK(z)dzhm+O(hm+1)=(1)mm!fX(m)(x)Rm(K)hm+O(hm+1),

where, by assumption on fX, |O(hm+1) | ≤ c · hm+1, with c > 0 a constant. Hence if |fU(wx)fX(m)(x)|dx exists, then

|fU(wx)Bias(f^)|dx=hmm!|Rm(K)||fU(wx)fX(m)(x)|dx+O(hm+1).

For the variance, one has

var(f^X)=14π2h2nE|exp(it(xYh))ϕK(t)ϕU(t/h)dt|21n(Ef^X)2fX(x)2πhn|ϕK(t)|2|ϕU(t/h)|2dt+o(1/n),

where ǁfX(x denotes the supremum norm of fX (refer to the last formula on page 1266 in Fan (1991b)). Therefore,

|fU(wx)|var(f^X)dxfX(x)2πhn|ϕK(t)|2|ϕU(t/h)|2dt+o(1/n).

That is,

MIAE(g^(x|w;h))hm|Rm(K)|m!|fU(wx)fX(m)(x)|dx+O(hm+1)+fX(x)2πhn|ϕK(t)|2|ϕU(t/h)|2dt+o(1/n).

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

  1. Bashtannyk D, Hyndman R. Bandwidth selection for kernel conditional density estimation. Computational Statistics & Data Analysis. 2001;36(3):279–298. [Google Scholar]
  2. Bickel PJ, Ritov Y. Estimating integrated squared density derivatives: sharp best order of convergence estimates. Sankhyā: The Indian Journal of Statistics, Series A. 1988;50(3):381–393. [Google Scholar]
  3. Carroll RJ, Hall P. Optimal rates of convergence for deconvolving a density. Journal of the American Statistical Associations. 1988;83:1184–1186. [Google Scholar]
  4. Comte F, Lacour C. Data-driven density estimation in the presence of additive noise with unknown distribution. Journal of the Royal Statistical Society Series B-Methodological. 2011;73(4):601–627. [Google Scholar]
  5. De Gooijer J, Zerom D. On conditional density estimation. Statistica Neerlandica. 2003;57(2):159–176. [Google Scholar]
  6. Delaigle A. An alternative view of the deconvolution problem. Statistica Sinica. 2008;18(3):1025–1045. [Google Scholar]
  7. Delaigle A, Gijbels I. Practical bandwidth selection in deconvolution kernel density estimation. Computational Statistics & Data Analysis. 2004 Mar;45(2):249–267. [Google Scholar]
  8. Delaigle A, Meister A. Density estimation with heteroscedastic error. Bernoulli. 2008;14(2):562–579. [Google Scholar]
  9. Delaigle A, Meister A. Nonparametric function estimation under Fourier-oscillating noise. Statistica Sinica. 2011;21:1065–1092. [Google Scholar]
  10. Devroye L. A note on the L1 consistency of variable kernel estimates. Annals of Statistics. 1985;13(3):1041–1049. [Google Scholar]
  11. Diggle PJ, Hall P. A Fourier approach to nonparametric deconvolution of a density estimate. Journal of the Royal Statistical Society Series B-Methodological. 1993;55(2):523–531. [Google Scholar]
  12. Ding LH, Xie Y, Park S, Xiao G, Story MD. Enhanced identification and biological validation of differential gene expression via illumina whole-genome expression arrays through the use of the model-based background correction methodology. Nucleic Acids Research. 2008;36(10):e58. doi: 10.1093/nar/gkn234. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Efromovich S. Conditional density estimation in a regression setting. Annals of Statistics. 2007;35(6):2504–2535. [Google Scholar]
  14. Fan J. Asymptotic normality for deconvolution kernel density estimators. Sankhyā: The Indian Journal of Statistics, Series A. 1991a;53(1):97–110. [Google Scholar]
  15. Fan J. On the optimal rates of convergence for nonparametric deconvolution problems. Annals of Statistics. 1991b;19(3):1257–1272. [Google Scholar]
  16. Fan J. Deconvolution with supersmooth distributions. Canadian Journal of Statistics. 1992;20(2):155–169. [Google Scholar]
  17. Fan J, Koo J. Wavelet deconvolution. IEEE Transactions On Information Theory. 2002;48(3):734–747. [Google Scholar]
  18. Fan J, Yim T. A crossvalidation method for estimating conditional densities. Biometrika. 2004;91(4):819–834. [Google Scholar]
  19. Hall P, Maiti T. Deconvolution methods for non-parametric inference in two-level mixed models. Journal of the Royal Statistical Society Series B-Statistical Methodology. 2009;71:703–718. [Google Scholar]
  20. Hall P, Racine J, Li Q. Cross-validation and the estimation of conditional probability densities. Journal of the American Statistical Association. 2004;99(468):1015–1026. [Google Scholar]
  21. Hall P, Wand MP. Minimizing L1 distance in nonparametric density estimation. Journal of Multivariate Analysis. 1988a Jul;26(1):59–88. [Google Scholar]
  22. Hall P, Wand MP. On the minimization of absolute distance in kernel density estimation. Statistics & Probability Letters. 1988b Apr;6(5):311–314. [Google Scholar]
  23. Hyndman RJ, Bashtannyk DM, Grunwald GK. Estimating and visualizing conditional densities. Journal of Computational and Graphical Statistics. 1996;5(4):315–336. [Google Scholar]
  24. Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics. 2003;4(2):249–264. doi: 10.1093/biostatistics/4.2.249. [DOI] [PubMed] [Google Scholar]
  25. Johannes J. Deconvolution with unknown error distribution. Annals of Statistics. 2009;37(5A):2301–2323. [Google Scholar]
  26. McIntyre J, Stefanski LA. Density estimation with replicate heteroscedastic measurements. Annals of the Institute of Statistical Mathematics. 2011;63(1):81–99. doi: 10.1007/s10463-009-0220-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Meister A. Lecture Notes in Statistics. Springer; New York: 2009. Deconvolution Problems in Nonparametric Statistics. [Google Scholar]
  28. Neumann MH. On the effect of estimating the error density in nonparametric deconvolution. Journal of Nonparametric Statistics. 1997;7:307–330. [Google Scholar]
  29. Port SC. Lecture Notes in Statistics. John Wiley; New York: 1994. Theoretical Probability for Applications. [Google Scholar]
  30. Silver JD, Ritchie ME, Smyth GK. Microarray background correction: maximum likelihood estimation for the normal-exponential convolution. Biostatistics. 2009;10(2):352–363. doi: 10.1093/biostatistics/kxn042. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Stefanski LA, Carroll RJ. Deconvoluting kernel density estimators. Statistics. 1990;21:169–184. [Google Scholar]
  32. Wang XF, Fan Z, Wang B. Estimating smooth distribution function in the presence of heterogeneous measurement errors. Computational Statistics and Data Analysis. 2010:25–36. doi: 10.1016/j.csda.2009.08.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Wang XF, Wang B. Deconvolution estimation in measurement error models: The R package decon. Journal of Statistical Software. 2011;39(10):1–24. [PMC free article] [PubMed] [Google Scholar]
  34. Wang XF, Ye D. The effects of error magnitude and bandwidth selection for deconvolution with unknown error distribution. Journal of Nonparametric Statistics. 2012;24(1):153–167. doi: 10.1080/10485252.2011.647024. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supplement

RESOURCES