Semiparametric Minimax Rates

James Robins; Eric Tchetgen Tchetgen; Lingling Li; Aad van der Vaart

doi:10.1214/09-EJS479

. Author manuscript; available in PMC: 2018 Apr 4.

Published in final edited form as: Electron J Stat. 2009 Dec 4;3:1305–1321. doi: 10.1214/09-EJS479

Semiparametric Minimax Rates

James Robins ¹, Eric Tchetgen Tchetgen ², Lingling Li ³, Aad van der Vaart ⁴

PMCID: PMC5884174 NIHMSID: NIHMS879384 PMID: 29629001

Abstract

We consider the minimax rate of testing (or estimation) of non-linear functionals defined on semiparametric models. Existing methods appear not capable of determining a lower bound on the minimax rate of testing (or estimation) for certain functionals of interest. In particular, if the semiparametric model is indexed by several infinite-dimensional parameters. To cover these examples we extend the approach of [1], which is based on comparing a “true distribution” to a convex mixture of perturbed distributions to a comparison of two convex mixtures. The first mixture is obtained by perturbing a first parameter of the model, and the second by perturbing in addition a second parameter. We apply the new result to two examples of semiparametric functionals:the estimation of a mean response when response data are missing at random, and the estimation of an expected conditional covariance functional.

AMS 2000 subject classifications: Primary 62G05, 62G20, 62G20, 62F25

Keywords and phrases: Nonlinear functional, nonparametric estimation, Hellinger distance

1. Introduction

Let X₁, X₂, …, X_n be a random sample from a density p relative to a measure ν on a sample space (𝒳, 𝒜). It is known that p belongs to a collection 𝒫 of densities, and we wish to estimate the value 𝒳(p) of a functional 𝒳:𝒫 → ℝ. In this setting the mimimax rate of estimation of 𝒳(p) relative to squared error loss can be defined as the root of

inf_{T_{n}} sup_{p \in 𝒫} E_{p} {| T_{n} - χ (p) |}^{2},

where the infimum is taken over all estimators T_n = T_n(X₁, …, X_n). Determination of a minimax rate in a particular problem often consists of proving a “lower bound”, showing that the mean square error of no estimator tends to zero faster than some rate $ε_{n}^{2}$ , combined with the explicit construction of an estimator with mean square error $ε_{n}^{2}$ .

The lower bound is often proved by a testing argument, which tries to separate two subsets of the set {Pⁿ: p ∈ 𝒫} of possible distributions of the observation (X₁, …, X_n). Even though testing is a statistically easier problem than estimation under quadratic loss, the corresponding minimax rates are often of the same order. The testing argument can be formulated as follows. If P_n and Q_n are in the convex hull of the sets {Pⁿ: p ∈ 𝒫, 𝒳(p) ≤ 0} and {Pⁿ: p ∈ 𝒫, 𝒳(p) ≥ ε_n} and there exist no sequence of tests of P_n versus Q_n with both error probabilities tending to zero, then the minimax rate is not faster than a multiple of ε_n. Here existence of a sequence of tests with errors tending to zero (a perfect sequence of tests) is determined by the asymptotic separation of the sequences P_n and Q_n and can be described, for instance, in terms of the Hellinger affinity

ρ (P_{n}, Q_{n}) = \int \sqrt{{dP}_{n}} \sqrt{{dQ}_{n}} .

If ρ(P_n, Q_n) is bounded away from zero as n → ∞, then no perfect sequence of tests exists (see e.g. Section 14.5 in [2]).

One difficulty in applying this simple argument is that the relevant (least favorable) two sequences of measures P_n and Q_n need not be product measures, but can be arbitrary convex combinations of product measures. In particular, it appears that for nonlinear functionals at least one of the two sequences must be a true mixture. This complicates the computation of the affinity ρ(P_n, Q_n) considerably. [1] derived an elegant nice lower bound on the affinity when P_n is a product measure and Q_n a convex mixture of product measures, and used it to determine the testing rate for functionals of the type ∫ f(p) dν, for a given smooth function f:ℝ → ℝ, the function f(x) = x² being the crucial example.

In this paper we are interested in structured models 𝒫 that are indexed by several subparameters and where the functional is defined in terms of the subparameters. It appears that testing a product versus a mixture is often not least favorable in this situation, but testing two mixtures is. Thus we extend the bound of [1] to the case that both P_n and Q_n are mixtures. In our examples P_n is equal to a convex mixture obtained by perturbing a first parameter of the model, and Q_n is obtained by perturbing in addition a second parameter. We also refine the bound in other, less essential directions.

The main general results of the paper are given in Section 2. In Section 3 we apply these results to two examples of interest.

2. Main result

For k ∈ ℕ let $𝒳 = \cup_{j = 1}^{k} 𝒳_{j}$ be a measurable partition of the sample space. Given a vector λ = (λ₁, …, λ_k) in some product measurable space Λ = Λ₁ × ⋯ × Λ_k let P_λ and Q_λ be probability measures on 𝒳 such that

P_λ(𝒳_j) = Q_λ(𝒳_j) = p_j for every λ ∈ Λ, for some probability vector (p₁, …, p_k).
The restrictions of P_λ and Q_λ to 𝒳_j depend on the jth coordinate λ_j of λ = (λ₁, …, λ_k) only.

For p_λ and q_λ densities of the measures P_λ and Q_λ that are jointly measurable in the parameter λ and the observation, and π a probability measure on Λ, define p = ∫ p_λ dπ(λ) and q = ∫ q_λ dπ (λ), and set

a = max_{j} sup_{λ} \int_{χ_{j}} \frac{{(p_{λ} - p)}^{2}}{p_{λ}} \frac{d ν}{p_{j}},

b = max_{j} sup_{λ} \int_{χ_{j}} \frac{{(q_{λ} - p_{λ})}^{2}}{p_{λ}} \frac{d ν}{p_{j}},

d = max_{j} sup_{λ} \int_{χ_{j}} \frac{{(q - p)}^{2}}{p_{λ}} \frac{d ν}{p_{j}} .

Theorem 2.1

If np_j(1 ∨ a ∨ b) ≤ A for all j and B̲ ≤ p_λ ≤ B̅ for positive constants A, B̲, B̅, then there exists a constant C that depends only on A, B̲, B̅ such that, for any product probability measure π = π₁ ⊗ ⋯ ⊗ π_k,

ρ (\int P_{λ}^{n} d π (λ), \int Q_{λ}^{n} d π (λ)) \geq 1 - {Cn}^{2} (max_{j} p_{j}) (b^{2} + ab) - Cnd .

Proof

The numbers a, b and d are the maxima over j of the numbers a, b and d defined in Lemma 2.2, but with the measures P_λ and Q_λ replaced there by the measures P_{j,λ_j} and Q_{j,λ_j} given in (2.1). Define a number c similarly as

max_{j} sup_{λ} \int_{𝒳_{j}} \frac{p^{2}}{p_{λ}} \frac{d ν}{p_{j}} .

Under the assumptions of the theorem c is bounded above by B̅² /B̲.

By applying Lemma 2.1 and next Lemma 2.2 we see that the left side is at least

E \prod_{j = 1}^{k} (1 - \frac{1}{4} \sum_{r = 2}^{N_{j}} (\begin{matrix} N_{j} \\ r \end{matrix}) b^{r} - \frac{1}{2} N_{j}^{2} \sum_{r = 1}^{N_{j} - 1} (\begin{matrix} N_{j} - 1 \\ r \end{matrix}) a^{r} b - \frac{1}{2} N_{j}^{2} c^{N_{j} - 1} d) \geq 1 - E \sum_{j = 1}^{k} (1 - \frac{1}{4} \sum_{r = 2}^{N_{j}} (\begin{matrix} N_{j} \\ r \end{matrix}) b^{r} - \frac{1}{2} N_{j}^{2} \sum_{r = 1}^{N_{j} - 1} (\begin{matrix} N_{j} - 1 \\ r \end{matrix}) a^{r} b - \frac{1}{2} N_{j}^{2} c^{N_{j} - 1} d),

because $\prod_{j = 1}^{k} (1 - a_{j}) \geq 1 - \sum_{j = 1}^{k} a_{j}$ for any nonnegative numbers a₁, …, a_k. The expected values on the binomial variables N_j can be evaluated explicitly, using the identities, for N a binomial variable with parameters n and p,

E \sum_{r = 2}^{N} (\begin{matrix} N \\ r \end{matrix}) b^{r} = E {((1 + b)}^{N} - 1 - Nb) = {(1 + bp)}^{n} - 1 - npb,

E N^{2} c^{N - 1} = np {(cp + 1 - p)}^{n - 2} (cnp + 1 - p),

E N^{2} \sum_{r = 1}^{N - 1} (\begin{matrix} N - 1 \\ r \end{matrix}) a^{r} = E N^{2} {((1 + a)}^{N - 1} - 1) = np {(1 + ap)}^{n - 2} (1 + nap + np - p) - np (1 - p) - n^{2} p^{2} .

Under the assumption that np(1∨a∨b∨c) ≲ 1, the right sides of these expressions can be seen to be bounded by multiples of (npb)², np and (np)²a, respectively. We substitute these bounds in the first display of the proof, and use the equality Σ_j p_j = 1 to complete the proof.

Remark 2.1

If min p_j ~ max_j p_j ~ 1/n^1+ε for some ε > 0, which arises for equiprobable partitions in k ~ n^1+ε sets, then there exists a number n₀ such that P(max_j N_j > n₀) → 0. (Indeed, the probability is bounded by k(n max_j p_j)^n₀+1.) Under this slightly stronger assumption the computations need only address N_j ≤ n₀ and hence can be simplified.

The proof of Theorem 2.1 is based on two lemmas. The first lemma factorizes the affinity into the affinities of the restrictions to the partitioning sets, which are next lower bounded using the second lemma. The reduction to the partioning sets is useful, because it reduces the n-fold products to lower order products for which the second lemma is accurate.

Define probability measures P_{λ_j} and Q_{λ_j} on 𝒳_j by

{dP}_{j, λ_{j}} = \frac{1_{χ_{j}} {dP}_{λ}}{p_{j}}, {dQ}_{j, λ_{j}} = \frac{1_{χ_{j}} {dQ}_{λ}}{p_{j}} .

(2.1)

Lemma 2.1

For any product probability measure π = π₁ ⊗ ⋯ ⊗ π_k on Λ and every n ∈ ℕ,

ρ (\int P_{λ}^{n} d π (λ), \int Q_{λ}^{n} d π (λ)) = E \prod_{j = 1}^{k} ρ_{j} (N_{j}),

where (N₁, …, N_k) is multinomially distributed on n trials with success probability vector (p₁, …, p_k) and ρ_j : {0, …, n} → [0, 1] is defined by ρ_j(0) = 1 and

ρ_{j} (m) = ρ (\int P_{j, λ_{j}}^{m} d π_{j} (λ), \int Q_{j, λ_{j}}^{m} d π_{j} (λ)), m \geq 1 .

Proof

Set ${\bar{P}}_{n} ≔ \int P_{λ}^{n} d π (λ)$ and consider this as the distribution of the vector (X₁, …, X_n). Then, for p_λ and q_λ densities of P_λ and Q_λ relative to some dominating measure, the left side of the lemma can be written as

ρ (\int P_{λ}^{n} d π (λ), \int Q_{λ}^{n} d π (λ)) = E_{{\bar{P}}_{n}} \sqrt{\frac{\int \prod_{j = 1}^{k} \prod_{i : X_{i} \in χ_{j}} q_{λ} (X_{i}) d π (λ)}{\int \prod_{j = 1}^{k} \prod_{i : X_{i} \in χ_{j}} p_{λ} (X_{i}) d π (λ)}} .

Because by assumption on each partitioning set 𝒳_j the measures Q_λ and P_λ depend on λ_j only, the expressions ∏_{i:X_i∈𝒳_j} q_λ(X_i) and ∏_{i:X_i∈𝒳_j} p_λ(X_i) depend on λ only through λ_j. In fact, within the quotient on the right side of the preceding display, they can be replaced by ∏_{i:X_i∈𝒳_j} q_{j, λ_j}(X_i) and ∏_{i:X_i∈𝒳_j} p_{j, λ_j}(X_i) for q_{j,λ_j} and p_{j,λ_j} densities of the measures Q_{j,λ_j} and P_{j,λ_j}. Because π is a product measure, we can next use Fubini’s theorem and rewrite the resulting expression as

E_{{\bar{P}}_{n}} \sqrt{\frac{\prod_{j = 1}^{k} \int \prod_{i : X_{i} \in χ_{j}} q_{j, λ_{j}} (X_{i}) d π_{j} (λ_{j})}{\prod_{j = 1}^{k} \int \prod_{i : X_{i} \in χ_{j}} p_{j, λ_{j}} (X_{i}) d π_{j} (λ_{j})}} .

Here the two products over j can be pulled out of the square root and replaced by a single product preceding it. A product over an empty set (if there is no X_i ∈ 𝒳_j) is interpreted as 1.

Define variables I₁, …, I_n, I_n that indicate the partitioning sets that contain the observations: I_i = j if X_i ∈ 𝒳_j for every i and j, and let N_j = (#1 ≤ i ≤ n: I_i = j) be the number of X_i falling in 𝒳_j.

The measure P̄_n arises as the distribution of (X₁, …, X_n) if this vector is generated in two steps. First λ is chosen from π and next given this λ the variables X₁, …, X_n are generated independently from P_λ. Then given λ the vector (N₁, …, N_k) is multinomially distributed on n trials and probability vector (P_λ(𝒳₁), …, P_λ(𝒳_k)). Because the latter vector is independent of λ and equal to (p₁, …, p_k) by assumption, the vector (N₁, …, N_k) is stochastically independent of λ and hence also unconditionally, under P̄_n, multinomially distributed with parameters n and (p₁, …, p_k). Similarly, given λ the variables I₁, …, I_n are independent and the event I_i = j has probability P_λ(𝒳_j), which is independent of λ by assumption. It follows that the random elements (I₁, …, I_n) and λ are stochastically independent under P̄_n.

The conditional distribution of X₁, …, X_n given λ and I₁, …, I_n can be described as: for each partitioning set 𝒳_j generate N_j variables independently from P_λ restricted and renormalized to 𝒳_j, i.e. from the measure P_{j,λ_j}; do so independently across the partitioning sets; and attach correct labels {1, …, n} consistent with I₁, …, I_n to the n realizations obtained. The conditional distribution under P̄_n of X₁, …, X_n given I_n is the mixture of this distribution relative to the conditional distribution of λ given (I₁, …, I_n), which was seen to be the unconditional distribution, π. Thus we obtain a sample from the conditional distribution under P̄_n of (X₁, …, X_n) given (I₁, …, I_n) by generating for each partitioning set 𝒳_j a set of N_j variables from the measure $\int P_{j, λ_{j}}^{N_{j}} d π_{j} (λ_{j})$ , independently across the partitioning sets, and next attaching labels consistent with I₁, …, I_n.

Now rewrite the right side of the last display by conditioning on I₁, …, I_n as

E_{{\bar{P}}_{n}} E_{{\bar{P}}_{n}} [\prod_{j = 1}^{k} \sqrt{\frac{\int \prod_{i : I_{i} = j} q_{j, λ_{j}} (X_{i}) d π_{j} (λ_{j})}{\int \prod_{i : I_{i} = j} p_{j, λ_{j}} (X_{i}) d π_{j} (λ_{j})}} | I_{1}, \dots, I_{n}] .

The product over j can be pulled out of the conditional expectation by the conditional independence across the partitioning sets. The resulting expression can be seen to be of the form as claimed in the lemma.

The second lemma does not use the partitioning structure, but is valid for mixtures of products of arbitrary measures on a measurable space. For λ in a measurable space Λ let P_λ and Q_λ be probability measures on a given sample space (𝒳, 𝒜), with densities p_λ and q_λ relative to a given dominating measure ν, which are jointly measurable. For a given (arbitrary) density p define functions ℓ_λ = q_λ − p_λ and κ_λ = p_λ − p, and set

a = sup_{λ \in Λ} \int \frac{κ_{λ}^{2}}{p_{λ}} d ν,

b = sup_{λ \in Λ} \int \frac{ℓ_{λ}^{2}}{p_{λ}} d ν,

c = sup_{λ \in λ} \int \frac{p^{2}}{p_{λ}} d ν,

d = sup_{λ \in Λ} \int \frac{{(\int ℓ_{μ} d π (μ))}^{2}}{p_{λ}} d ν .

Lemma 2.2

For any probability measure π on Λ and every n ∈ ℕ,

ρ (\int P_{λ}^{n} d π (λ), \int Q_{λ}^{n} d π (λ)) \geq 1 - \frac{1}{4} \sum_{r = 2}^{n} (\begin{matrix} n \\ r \end{matrix}) b^{r} - \frac{1}{2} n^{2} \sum_{r = 1}^{n - 1} (\begin{matrix} n - 1 \\ r \end{matrix}) a^{r} b - \frac{1}{2} n^{2} c^{n - 1} d .

Proof

Consider the measure ${\bar{P}}_{n} = \int P_{λ}^{n} d π (λ)$ , which has density ${\bar{p}}_{n} ({\vec{x}}_{n}) = \int \prod_{i = 1}^{n} p_{λ} (x_{i}) d π (λ)$ relative to νⁿ, as the distribution of (X₁, …, X_n). Using the inequality $E \sqrt{1 + Y} \geq 1 - E Y^{2} / 8$ , valid for any random variable Y with 1 + Y ≥ 0 and EY = 0 (see for example [1], we see that

ρ (\int P_{λ}^{n} d π (λ), \int Q_{λ}^{n} d π (λ)) = E_{{\bar{P}}_{n}} \sqrt{1 + \frac{\int [\prod_{i = 1}^{n} q_{λ} (X_{i}) - \prod_{i = 1}^{n} p_{λ} (X_{i})] d π (λ)}{{\bar{p}}_{n} (X_{1}, \dots, X_{n})}} \geq 1 - \frac{1}{8} E_{{\bar{P}}_{n}} \frac{\int [\prod_{i = 1}^{n} q_{λ} (X_{i}) - \prod_{i = 1}^{n} p_{λ} (X_{i})] d π {(λ)}^{2}}{{\bar{p}}_{n} {(X_{1}, \dots, X_{n})}^{2}} .

(2.2)

It suffices to upper bound the expected value on the right side. To this end we expand the difference $\prod_{i = 1}^{n} q_{λ} (X_{i}) - \prod_{i = 1}^{n} p_{λ} (X_{i})$ as Σ_|I|≥1 ∏_i∈I^c p_λ(X_i) ∏_i∈I ℓ_λ(X_i), where the sum ranges over all nonempty subsets I ⊂ {1, …, n}. We split this sum in two parts, consisting of the terms indexed by subsets of size 1 and the subsets that contain at least 2 elements, and separate the square of the sum of these two parts by the inequality (A + B)² ≤ 2A² + 2B².

If n = 1, then there are no subsets with at least two elements and the second part is empty. Otherwise the sum over subsets with at least two elements contributes two times

\int \frac{\int \sum_{| I | \geq 2} \prod_{i \in I^{c}} p_{λ} (x_{i}) \prod_{i \in I} ℓ_{λ} (x_{i}) d π {(λ)}^{2}}{\int \prod_{i} p_{λ} (x_{i}) d π (λ)} d ν^{n} ({\vec{x}}_{n}) \leq \int \int {(\sum_{| I | \geq 2} \prod_{i \in I^{c}} \sqrt{p_{λ}} (x_{i}) \prod_{i \in I} \frac{ℓ_{λ}}{\sqrt{p_{λ}}} (x_{i}))}^{2} d π (λ) d ν^{n} ({\vec{x}}_{n}) = \sum_{| I | \geq 2} \int \int \prod_{i \in I^{c}} p_{λ} (x_{i}) \prod_{i \in I} \frac{ℓ_{λ}^{2}}{p_{λ}} (x_{i}) d π (λ) d ν^{n} ({\vec{x}}_{n}) .

To derive the first inequality we use the inequality (EU)²/EV ≤ E(U²/V), valid for any random variables U and V ≥ 0, which can be derived from Cauchy-Schwarz' or Jensen's inequality. The last step follows by writing the square of the sum as a double sum and noting that all off-diagonal terms vanish, as they contain at least one term ℓ_λ(x_i) and ∫ ℓ_λ dν = 0. The order of integration in the right side can be exchanged, and next the integral relative to νⁿ can be factorized, where the integrals ∫ p_λ dν are equal to 1. This yields the contribution 2 Σ_|I|≥2 b^|I| to the bound on the expectation in (2.2).

The sum over sets with exactly one element contributes two times

\int \frac{\int \sum_{j = 1}^{n} \prod_{i \neq j} p_{λ} (x_{i}) ℓ_{λ} (x_{j}) d π {(λ)}^{2}}{\int \prod_{i} p_{λ} (x_{i}) d π (λ)} d ν^{n} ({\vec{x}}_{n}) .

(2.3)

Here we expand

\prod_{i \neq j} p_{λ} (x_{i}) - \prod_{i \neq j} p (x_{i}) = \prod_{i \neq j} p_{λ} (x_{i}) - \prod_{i \neq j} (p_{λ} - κ_{λ}) (x_{i}) = - \sum_{| I | \geq 1, j \notin I} \prod_{i \in I^{c}} p_{λ} (x_{i}) \prod_{i \in I} (- κ_{λ}) (x_{i}),

where the sum is over all nonempty subsets I ⊂ {1, …, n} that do not contain j. Replacement of ∏_i≠j p_λ(x_i) by ∏_i≠j p(x_i) changes (2.3) into

\int \frac{\int \sum_{j = 1}^{n} \prod_{i \neq j} p (x_{i}) ℓ_{λ} (x_{j}) d π {(λ)}^{2}}{\int \prod_{i} p_{λ} (x_{i}) d π (λ)} d ν^{n} ({\vec{x}}_{n}) \leq n \sum_{j = 1}^{n} \int \frac{\prod_{i \neq j} p^{2} (x_{i}) \int ℓ_{λ} (x_{j}) d π {(λ)}^{2}}{\int \prod_{i} p_{λ} (x_{i}) d π (λ)} d ν^{n} ({\vec{x}}_{n}) \leq n \sum_{j = 1}^{n} \int \int \prod_{i \neq j} \frac{p^{2}}{p_{μ}} (x_{i}) \frac{\int ℓ_{λ} d π {(λ)}^{2}}{p_{μ}} (x_{j}) d π (μ) d ν^{n} ({\vec{x}}_{n}) .

In the last step we use that 1/EV ≤ E(1/V) for any positive random variable V. The integral with respect to νⁿ in the right side can be factorized, and the expression bounded by n²cⁿ⁻¹d. Four this this must be added to the bound on the expectation in (2.2).

Finally the remainder after substituting ∏_i≠j p(x_i) for ∏_i≠j p_λ(x_i) in (2.3) contributes

\int \frac{\int \sum_{j = 1}^{n} \sum_{| I | \geq 1, j \notin I} \prod_{i \in I^{c}} p_{λ} (x_{i}) \prod_{i \in I} (- κ_{λ}) (x_{i}) ℓ_{λ} (x_{j}) d π {(λ)}^{2}}{\int \prod_{i} p_{λ} (x_{i}) d π (λ)} d ν^{n} ({\vec{x}}_{n}) \leq \int \int {(\sum_{j = 1}^{n} \sum_{| I | \geq 1, j \notin I} \prod_{i \in I^{c}} \sqrt{p_{λ}} (x_{i}) \prod_{i \in I} \frac{- κ_{λ}}{\sqrt{p_{λ}}} (x_{i}) \frac{ℓ_{λ}}{\sqrt{P_{λ}}} (x_{j}))}^{2} d π (λ) d ν^{n} ({\vec{x}}_{n}) \leq n \sum_{j = 1}^{n} \int \int {(\sum_{| I | \geq 1, j \notin I} \prod_{i \in I^{c}} \sqrt{p_{λ}} (x_{i}) \prod_{i \in I} \frac{- κ_{λ}}{\sqrt{p_{λ}}} (x_{i}))}^{2} \frac{ℓ_{λ}^{2}}{p_{λ}} (x_{j}) d π (λ) d ν^{n} ({\vec{x}}_{n}) = n \sum_{j = 1}^{n} \sum_{| I | \geq 1, j \notin I} \int \int \prod_{i \in I^{c}} p_{λ} (x_{i}) \prod_{i \in I} \frac{κ_{λ}^{2}}{p_{λ}} (x_{i}) \frac{ℓ_{λ}^{2}}{p_{λ}} (x_{j}) d π (λ) d ν^{n} ({\vec{x}}_{n}) .

We exchange the order of integration and factorize the integral with respect to νⁿ to bound this by n²Σ_{|I|≥1,j∉I} a^|I|b.

3. Applications

3.1. Estimating the mean response in missing data models

Suppose that a typical observation is distributed as X = (Y A, A, Z) for Y and A taking values in the two-point set {0, 1} and conditionally independent given Z. We think of Y as a response variable, which is observed only if the indicator A takes the value 1, and are interested in estimating the mean response EY. The covariate Z is chosen such that it contains all information on the dependence between response and missingness indicator (“missing at random”). We assume that Z takes its values in 𝒵 = [0, 1]^d.

The model can be parameterized by the marginal density f of Z relative to Lebesgue measure measure ν on 𝒵, and the probabilities b(z) = P(Y = 1|Z = z) and a(z)⁻¹ = P(A = 1|Z = z). Alternatively, the model can be parameterized by the function g = f/a, which is the conditional density of Z given A = 1 up to the norming factor P(A = 1). Under this latter parametrization which we adopt henceforth, the density p of an observation X is described by the triple (a, b, g) and the functional of interest is expressed as 𝒳(p) = ∫ abg dν.

Define $C_{M}^{α} {[0, 1]}^{d}$ as M times the unit ball of the Hölder space of α-smooth functions on [0, 1]^d. For given positive constants α, β, γ, ϕ and M̲, M, we consider the models

$𝒫_{1} = {(a, b, g) : a \in C_{M}^{α} {[0, 1]}^{d}, b \in C_{M}^{β} {[0, 1]}^{d}, g = 1 / 2, a, b \geq \underline{M}} .$
$𝒫_{2} = {(a, b, g) : a \in C_{M}^{α} {[0, 1]}^{d}, b \in C_{M}^{β} {[0, 1]}^{d}, g \in C^{γ} {[0, 1]}^{d}, a, b \geq \underline{M}} .$

If (α + β)/2 ≥ d/4, then a $\sqrt{n}$ -rate is attainable over 𝒫₂ (see [3]), and a standard “two-point” proof can show that this rate cannot be improved. Here we are interested in the case (α + β)/2 < d/4, when the rate becomes slower than $1 / \sqrt{n}$ . The paper [3] constructs an estimator that attains the rate n^{−(2α+2β)/(2α+2β+d)} uniformly over 𝒫₂ if

\frac{γ}{2 γ + d} > (\frac{α \lor β}{d}) (\frac{d - 2 α - 2 β}{d + 2 α + 2 β}) ≔ γ (α, β) .

(3.1)

We shall show that this result is optimal by showing that the minimax rate over the smaller model 𝒫₁ is not faster than n^{−(2α+2β)/(2α+2β+d)}.

In the case that α = β these results can be proved using the method of [1], but in general we need a construction as in Section 2 with P_λ based on a perturbation of the smoothest parameter of the pair (a, b) and Q_λ constructed by perturbing in addition the coarsest of the two parameters.

Theorem 3.1

If (α + β)/2 < d/4 the minimax rate over 𝒫₁ for estimating ∫ abg dν is at least n^{−2α−2β/(2α+2β+d)}.

Proof

Let H: ℝ^d → ℝ be a C^∞ function supported on the cube [0, 1/2]^d with ∫ H dν = 0 and ∫ H² dν = 1. Let k be the integer closest to n^{2d/(2α+2β+d)} and let 𝒵₁, …, 𝒵_k be translates of the cube k^−1/d[0, 1/2]^d that are disjoint and contained in [0, 1]^d. For z₁, …, z_k the bottom left corner of these cubes and λ = (λ₁, …, λ_k) ∈ Λ = {−1, 1}^k, let

a_{λ} (z) = 2 + k^{- α / d} \sum_{i = 1}^{k} λ_{i} H ((z - z_{i}) k^{1 / d}),

b_{λ} (z) = 1 / 2 + k^{- β / d} \sum_{i = 1}^{k} λ_{i} H ((z - z_{i}) k^{1 / d}) .

These functions can be seen to be contained in C^α[0, 1]^d and C^β[0, 1]^d with norms that are uniformly bounded in k. We choose a uniform prior π on λ, so that λ₁, …, λ_k are i.i.d. Rademacher variables.

We partition the sample space {0, 1}×{0, 1}×𝒵 into the sets {0, 1}×{0, 1}×𝒵_j and the remaining set.

We parameterize the model by the triple (a, b, g). The likelihood can then be written as

{(a - 1)}^{1 - A} (Z) {(b^{Y} (Z) {(1 - b)}^{1 - Y} (Z))}^{A} .

Because ∫ H dν = 0 the values of the functional ∫ abg dν at the parameter values (a_λ, 1/1, 1/2) and (2, b_λ, 1/2) are both equal to 1/2, whereas the value at (a_λ, b_λ, 1/2) is equal to

\int a_{λ} b_{λ} \frac{d ν}{2} = \frac{1}{2} + {(\frac{1}{k})}^{α / d + β / d} \int {(\sum_{i = 1}^{k} H ((z - z_{i}) k^{1 / d}))}^{2} \frac{d ν}{2} = \frac{1}{2} + \frac{1}{2} {(\frac{1}{k})}^{α / d + β / d} .

Thus the minimax rate is not faster than (1/k)^α/d+β/d for k = k_n such that the convex mixtures of the products of the perturbations do not separate completely as n → ∞. We choose the mixtures differently in the cases α ≤ β and α ≥ β.

α ≤ β. We define p_λ by the parameter (a_λ, 1/2, 1/2) and q_λ by the parameter (a_λ, bλ, 1/2). Because ∫ a_λ dπ(λ) = 2 and ∫ b_λ dπ(λ) = 1/2, we have

p (X) ≔ \int p_{λ} (X) d π (λ) = {(b^{Y} (Z) {(1 - b)}^{1 - Y} (Z))}^{A},

(p_{λ} - p) (X) = (1 - A) (a_{λ} - 2) (Z),

(q_{λ} - p_{λ}) (X) = A {(b_{λ} - 1 / 2)}^{Y} {(1 / 2 - b_{λ})}^{1 - Y},

(q - p) (X) ≔ \int (q_{λ} - p_{λ}) (X) d π (λ) = 0 .

Therefore, it follows that the number d in Theorem 2.1 vanishes, while the numbers a and b are of the orders k^−2α/d and k^−2β/d times

max_{j} \int_{𝒵_{j}} {(\sum_{i = 1}^{k} λ_{i} H ((z - z_{i}) k^{1 / d}))}^{2} \frac{d ν}{1 / k} \sim 1,

respectively. Theorem 2.1 shows that

ρ (\int P_{λ}^{n} d π (λ), \int Q_{λ}^{n} d π (λ)) \geq 1 - C' n^{2} \frac{1}{k} (k^{- 4 β / d} + k^{- 2 α / d} k^{- 2 β / d}) .

For k ~ n^{2d/(2α+2β+d)} the right side is bounded away from 0. Substitution of this number in the magnitude of separation (1/k)^α/d+β/d leads to the rate as claimed in the theorem.

α ≥ β. We define p_λ by the parameter (2, b_λ, 1/2) and q_λ by the parameter (a_λ, b_λ, 1/2). The computations are very similar to the ones in the case α ≤ β.

3.2. Estimating an expected conditional covariance

Suppose that we observe n independent and identically distributed copies of X = (Y, A, Z), where as in the previous section, Y and A are dichotomous, and Z takes its values in 𝒵 = [0, 1]^d with joint density given by f. Let b(z) = P(Y = 1|Z = z) and a(z) = P(A = 1|Z = z). We note that

b (Z) = P (Y = 1 | A = 1, Z) a (Z) + {1 - a (Z)} P (Y = 1 | A = 0, Z) = {P (Y = 1 | A = 1, Z) - P (Y = 1 | A = 0, Z)} a (Z) + P (Y = 1 | A = 0, Z) = {P (Y = 1 | A = 0, Z) - P (Y = 1 | A = 1, Z)} {1 - a (Z)} + P (Y = 1 | A = 1, Z)

so that by combining the last two equations above, we can write

P (Y = 1 | A, Z) = Δ (Z) {A - a (Z)} + b (Z)

where Δ (Z) = P (Y = 1|A = 1, Z) − P (Y = 1|A = 0, Z). This allows us to parametrize the density p of an observation by (Δ, a, b, f). The functional χ (p) is given by expected conditional covariance

E_{f} {{cov}_{Δ, p, b} (Y, A | Z)} = E_{Δ, p, b, f} (Y A) - \int abfd ν

(3.2)

We consider the models

$ℬ_{1} = {(Δ, a, b, f) : Δ is unrestricted, a \in C_{M}^{α} {[0, 1]}^{d}, b \in C_{M}^{β} {[0, 1]}^{d}, f = 1, a, b \geq M}$
$ℬ_{2} = {(Δ, a, b, f) : Δ is unrestricted, a \in C_{M}^{α} {[0, 1]}^{d}, b \in C_{M}^{β} {[0, 1]}^{d}, f \in C_{M}^{γ} {[0, 1]}^{d}, a, b \geq M}$

We are mainly interested in the case (α + β) /2 < d/4 when the rate of estimation of χ (p) becomes slower than $1 / \sqrt{n}$ . The paper [3] constructs and estimator that attains the rate n^{−(2α+2β)/(2α+2β+d)} uniformely over ℬ₂ if equation 3.1 of the previous section holds. We will show that this rate is optimal by showing that the minimax rate over the smaller model ℬ₁ is not faster than n^{−(2α+2β)/(2α+2β+d)}.

The first term of the difference on the right side of equation (3.2) can be estimated by the sample average $n^{- 1} \sum_{i = 1}^{n} Y_{i} A_{i}$ at rate n^−1/2. It follows that χ (p) can be estimated at the maximum of n^−1/2 and the rate of estimation of ∫ ab f dν. In other words, to establish that the minimax rate for estimating χ (p) over ℬ₁ is n^{−(2α+2β)/(2α+2β+d)}, we shall show that the minimax rate for estimating ∫ ab f dν over ℬ₁ is n^{−(2α+2β)/(2α+2β+d)}.

Theorem 3.2

If (α + β) /2 < d/4 the minimax rate over ℬ₁ for estimating ∫ ab f dν is at least n^{−2(α+β)/(2α+2β+d)}.

Proof

Under the parametrization (Δ, a, b, f), the density of an observation X is given by

{([Δ (Z) {A - a (Z)} + b (Z)] a (Z))}^{Y A} \times {([1 - Δ (Z) {A - a (Z)} - b (Z)] a (Z))}^{(1 - Y) A} \times {([Δ (Z) {A - a (Z)} + b (Z)] {1 - a (Z)})}^{Y (1 - A)} \times {[{1 - Δ (Z) {A - a (Z)} - b (Z)} {1 - a (Z)}]}^{(1 - Y) (1 - A)} \times f (Z)

Suppose α < β and set

a_{λ} (z) = 1 / 2 + δ a_{λ} (z) = 1 / 2 + k^{- α / d} \sum_{i = 1}^{k} λ_{i} H ((z - z_{i}) k^{1 / d})

b_{λ} (z) = 1 / 2 + δ b_{λ} (z) = 1 / 2 + k^{- β / d} \sum_{i = 1}^{k} λ_{i} H ((z - z_{i}) k^{1 / d})

Δ_{λ} (Z) = \frac{- δ b_{λ} (Z)}{1 / 2 - δ a_{λ} (Z)}

then at the parameters values (0, a_λ, 1/2, 1), ∫ ab f dυ = 1/4 with a corresponding likelihood p_λ = {a_λ (Z)}^A×[{1 − a_λ (Z)}]^(1−A), whereas at parameter values (Δ_λ, a_λ, b_λ, 1), ∫ ab f dυ = 1/4 + n^{−2(α+β)/(d+2(α+β))} and the likelihood is given by

q_{λ} (X) = {a_{λ} (Z) / 2}^{Y A} \times {a_{λ} (Z) / 2}^{(1 - Y) A} \times {([1 / 2 + δ b_{λ} (Z)])}^{Y (1 - A)} {([1 / 2 - a_{λ} (Z) - δ b_{λ} (Z)])}^{(1 - Y) (1 - A)}

so that

(q_{λ} - p_{λ}) (X) = (1 - A) \times δ b_{λ} {(Z)}^{Y} \times {- δ b_{λ} (Z)}^{(1 - Y)}

And we conclude that (q − p) (X) = ∫ (q_λ − p_λ) (X) dπ (λ) = 0. Furthermore

(p_{λ} - p) (X) = {δ a_{λ} (Z)}^{A} \times {[- δ a_{λ} (Z)]}^{(1 - A)}

so that

max_{j} sup_{λ} \int_{𝒳_{j}} \frac{{(q_{λ} - p_{λ})}^{2}}{p_{λ}} \frac{1}{P (𝒳_{j})} d υ = k^{- 2 β / d} max_{j} sup_{λ} \int_{𝒳_{j}} \frac{{(\sum_{i = 1}^{k} λ_{i} H ((z - z_{i}) k^{1 / d}))}^{2}}{p_{λ}} \frac{1}{P (𝒳_{j})} d υ ≾ k^{- 2 β / d},

sup_{λ} \int_{𝒳_{j}} \frac{{(\int (q_{λ} - p_{λ}) d π (λ))}^{2}}{p_{λ}} \frac{1}{P (𝒳_{j})} d υ = 0

and

max_{j} sup_{λ} \int_{𝒳_{j}} \frac{{(q_{λ} - p)}^{2}}{p_{λ}} \frac{1}{P (𝒳_{j})} d υ = k^{- 2 α / d} max_{j} sup_{λ} \int_{𝒳_{j}} \frac{{(\sum_{i = 1}^{k} λ_{i} H ((z - z_{i}) k^{1 / d}))}^{2}}{p_{λ}} \frac{1}{P (𝒳_{j})} d υ ≾ k^{- 2 α / d}

Therefore, it follows that the number d of Theorem 2.1 vanishes, while the numbers a and b are of order k^−2α/d and k^−2β/d respectively. Theorem 2.1. shows that

ρ (\int P_{λ}^{n} d π (λ), \int Q_{λ}^{n} d π (λ)) \geq 1 - C ″ n^{2} \frac{1}{k} (k^{- 4 β / d} + k^{- 2 β / d} k^{- 2 α / d})

which gives the desired result for the choice of k ~ n^{2d/(2α+2β+d)}.

Next, suppose α > β, set a_λ (Z) and b_λ (Z) as above, and let

Δ_{λ} (Z) = \frac{- δ a_{λ} (Z) b_{λ} (Z)}{(1 / 2 - δ a_{λ} (Z)) a (Z)}

then at the parameters values (0, 1/2, b_λ, 1), ∫ ab f dυ = 1/4 with corresponding likelihood

p_{λ} (X) = {[(b_{λ} (Z))]}^{Y} \times {[(1 - b_{λ} (Z))]}^{(1 - Y)}

whereas at parameter values (0, p_λ, b_λ, 1), ∫ ab f dυ = 1/4 + n^{−2(α+β)/(d+2(α+β))} with corresponding likelihood given by

q_{λ} (X) = {[b_{λ} (Z) / 2]}^{Y} \times ({[a_{λ} (Z) - b_{λ} (Z) / 2])}^{(1 - Y) A} \times {[1 / 2 - δ a_{λ} (Z) - b_{λ} (Z) / 2]}^{(1 - Y) (1 - A)}

so that

(q_{λ} - p_{λ}) (X) = (1 - Y) \times δ a {(Z)}^{A} \times {[- δ a_{λ} (Z)]}^{(1 - A)}

and we conclude that (q − p) (X) = ∫ (q_λ − p_λ) (X) dπ (λ) = 0. Furthermore

(p_{λ} - p) (X) = δ b_{λ} {(Z)}^{Y} \times {[- δ b_{λ} (Z)]}^{(1 - Y)}

so that

max_{j} sup_{λ} \int_{𝒳_{j}} \frac{{(q_{λ} - p_{λ})}^{2}}{p_{λ}} \frac{1}{P (𝒳_{j})} d υ ≾ k^{- 2 α / d},

sup_{λ} \int_{𝒳_{j}} \frac{{(\int (q_{λ} - p_{λ}) d π (λ))}^{2}}{p_{λ}} \frac{1}{P (𝒳_{j})} d υ = 0

and

max_{j} sup_{λ} \int_{𝒳_{j}} \frac{{(p_{λ} - p)}^{2}}{p_{λ}} \frac{1}{P (𝒳_{j})} d υ ≾ k^{- 2 β / d}

which yields the desired result by arguments similar to the previous case.

Contributor Information

James Robins, Department of Biostatistics and Epidemiology, School of Public Health, Harvard University.

Eric Tchetgen Tchetgen, Department of Biostatistics and Epidemiology, School of Public Health, Harvard University.

Lingling Li, Department of Population Medicine, Harvard Medical School and Harvard Pilgrim Health Care, Boston, MA, 02215.

Aad van der Vaart, Department of Mathematics, Vrije Universiteit, De Boelelaan 1081a, 1081 HV Amsterdam, The Netherlands.

References

1.Birge L, Massart P. Estimation of integral functionals of a density. Annals of statistics. 1995;23:11–29. [Google Scholar]
2.van der Vaart A. Asymptotic statistics. Cambridge University Press; 1998. [Google Scholar]
3.Robins J, Li L, Tchetgen E, van der Vaart A. Higher order inuence functions and minimax estimation of nonlinear functionals. IMS Lecture Notes-Monograph Series Probability and Statistics Models:Essays in Honor of David A. Freedman. 2008;2:335–421. [Google Scholar]

[R1] 1.Birge L, Massart P. Estimation of integral functionals of a density. Annals of statistics. 1995;23:11–29. [Google Scholar]

[R2] 2.van der Vaart A. Asymptotic statistics. Cambridge University Press; 1998. [Google Scholar]

[R3] 3.Robins J, Li L, Tchetgen E, van der Vaart A. Higher order inuence functions and minimax estimation of nonlinear functionals. IMS Lecture Notes-Monograph Series Probability and Statistics Models:Essays in Honor of David A. Freedman. 2008;2:335–421. [Google Scholar]

PERMALINK

Semiparametric Minimax Rates

James Robins

Eric Tchetgen Tchetgen

Lingling Li

Aad van der Vaart

Abstract

1. Introduction

2. Main result

Theorem 2.1

Proof

Remark 2.1

Lemma 2.1

Proof

Lemma 2.2

Proof

3. Applications

3.1. Estimating the mean response in missing data models

Theorem 3.1

Proof

3.2. Estimating an expected conditional covariance

Theorem 3.2

Proof

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Semiparametric Minimax Rates

James Robins

Eric Tchetgen Tchetgen

Lingling Li

Aad van der Vaart

Abstract

1. Introduction

2. Main result

Theorem 2.1

Proof

Remark 2.1

Lemma 2.1

Proof

Lemma 2.2

Proof

3. Applications

3.1. Estimating the mean response in missing data models

Theorem 3.1

Proof

3.2. Estimating an expected conditional covariance

Theorem 3.2

Proof

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases