HIGHER ORDER ESTIMATING EQUATIONS FOR HIGH-DIMENSIONAL MODELS

James Robins; Lingling Li; Rajarshi Mukherjee; Eric Tchetgen Tchetgen; Aad van der Vaart

doi:10.1214/16-AOS1515

. Author manuscript; available in PMC: 2019 Apr 8.

Published in final edited form as: Ann Stat. 2017 Oct 31;45(5):1951–1987. doi: 10.1214/16-AOS1515

HIGHER ORDER ESTIMATING EQUATIONS FOR HIGH-DIMENSIONAL MODELS

James Robins ¹, Lingling Li ¹, Rajarshi Mukherjee ¹, Eric Tchetgen Tchetgen ¹, Aad van der Vaart ¹

PMCID: PMC6453538 NIHMSID: NIHMS927754 PMID: 30971851

Abstract

We introduce a new method of estimation of parameters in semi-parametric and nonparametric models. The method is based on estimating equations that are U-statistics in the observations. The U-statistics are based on higher order influence functions that extend ordinary linear influence functions of the parameter of interest, and represent higher derivatives of this parameter. For parameters for which the representation cannot be perfect the method leads to a bias-variance trade-off, and results in estimators that converge at a slower than $\sqrt{n} -rate$ . In a number of examples the resulting rate can be shown to be optimal. We are particularly interested in estimating parameters in models with a nuisance parameter of high dimension or low regularity, where the parameter of interest cannot be estimated at $\sqrt{n} -rate$ , but we also consider efficient $\sqrt{n} -estimation$ using novel nonlinear estimators. The general approach is applied in detail to the example of estimating a mean response when the response is not always observed.

AMS 2000 subject classifications: Primary 62G05, 62G20, 62G20, 62F25

Keywords: Nonlinear functional, nonparametric estimation, U-statistic, influence function, tangent space

1. Introduction

Let X₁, X₂, …, X_n be a random sample from a density p relative to a measure μ on a sample space $(X, A)$ . It is known that p belongs to a collection $P$ of densities, and the problem is to estimate the value χ(p) of a functional $χ : P \to ℝ$ . Our main interest is in the situation of a semiparametric or nonparametric model, where $P$ is infinite dimensional.

Estimating equations have been found a good strategy for constructing estimators in semiparametric models [2, 34, 40]. Because the model is of (much) higher dimension than the parameter of interest, setting up a good estimating equation often requires an initial estimator ${\hat{η}}_{n}$ of a “nuisance parameter”, and an estimating equation for θ = χ(p) may take the form

ℙ_{n} ψ_{θ}, {\hat{η}}_{n} = 0.

Here $ℙ_{n} f$ is short for $n^{- 1} \sum_{i = 1}^{n} f (X_{i})$ , and x ↦ ψ_θ,η(x) is a given measurable map, for each (θ, η). In the present paper it will be more convenient to work with a one-step version of this estimator, defined by the method of Newton-Rhapson from a linearization of the map $θ \mapsto ℙ_{n} ψ_{θ, {\hat{η}}_{n}}$ around an initial estimator ${\hat{θ}}_{n}$ , leading to an estimator of the form ${\hat{θ}}_{n} + {\hat{V}}_{n}^{- 1} ℙ_{n} ψ_{{\hat{θ}}_{n}, {\hat{η}}_{n}}$ , for ${\hat{V}}_{n}$ an estimate of the derivative of the estimating equation. In more general notation such an estimator can be written as

{\hat{χ}}_{n} = χ ({\hat{p}}_{n}) + ℙ_{n} χ_{{\hat{p}}_{n}},

(1.1)

for ${\hat{p}}_{n}$ an initial estimator for p, and x ↦ χ_p(x) a given measurable function, for each $p \in P$ .

One possible choice in (1.1) is χ_p = 0, leading to the plug-in estimator $χ ({\hat{p}}_{n})$ . However, unless the initial estimator ${\hat{p}}_{n}$ possesses special properties, this choice is typically suboptimal. Better functions χ_p can be constructed by consideration of the tangent space of the model. To see this, we write (with $P χ_{\hat{p}}$ shorthand for $\int χ_{\hat{p}} (x) d P (x)$ )

{\hat{χ}}_{n} - χ (p) = [χ ({\hat{p}}_{n}) - χ (p) + P χ_{{\hat{p}}_{n}}] + (ℙ_{n} - P) χ_{{\hat{p}}_{n}} .

(1.2)

Because it is properly centered, we may expect the sequence $\sqrt{n} (ℙ_{n} - P) χ_{{\hat{p}}_{n}}$ to tend in distribution to a mean-zero normal distribution. The term between square brackets on the right of (1.2), which we shall refer to as the bias term, depends on the initial estimator ${\hat{p}}_{n}$ , and it would be natural to construct the function χ_p such that this term does not contribute to the limit distribution, or at least is not dominating the expression. Thus we would like to choose this function such that the “bias term” is at least of the order O_P (n⁻^1/2). A good choice is to ensure that the term $P χ_{{\hat{p}}_{n}}$ acts as minus the first derivative of the functional χ in the “direction” ${\hat{p}}_{n} - p$ . Functions x ↦ χ_p(x) with this property are known as influence functions in semiparametric theory [16, 22, 35, 5, 2], go back to the von Mises calculus due to [33], and play an important role in robust statistics [14, 11], or [40], Chapter 20.

For an influence function we may expect that the “bias term” is quadratic in the error $d ({\hat{p}}_{n}, p)$ , for an appropriate distance d. In that case it is certainly negligible as soon as this error is of order o_P (n⁻^1/4). Such a “no-bias” condition is well known in semiparametric theory (e.g. condition (25.52) in [40] or (11) in [20]). However, typically it requires that the model $P$ be “not too big”. For instance, a regression or density function on d-dimensional space can be estimated at rate n⁻^1/4 if it is a-priori known to have at least d/2 derivatives (indeed α/(2α + d) ≥ 1/4 if α ≥ d/2). The purpose of this paper is to develop estimation procedures for the case that no estimators exist that attain a O_P (n⁻^1/4) rate of convergence. The estimator (1.1) is then suboptimal, because it fails to make a proper trade-off between “bias” and “variance”: the two terms in (1.2) have different magnitudes. Our strategy is to replace the linear term $P_{n} χ_{p}$ by a general U-statistic $U_{n} χ_{p}$ , for an appropriate m-dimensional influence function (x₁, …, x_m) ↦ χ_p(x₁, …, x_m), chosen using a von Mises expansion of p ↦χ(p). Here the order m is adapted to the size of the model $P$ and the type of functional to be estimated.

Unfortunately, “exact” higher-order influence functions turn out to exist only for special functionals χ. To treat general functionals χ we approximate these by simpler functionals, or use approximate influence functions. The rate of the resulting estimator is then determined by a trade-off between bias and variance terms. It may still be of order $1 / \sqrt{n}$ , but it is often slower. In the former case, surprisingly, one may obtain semiparametric efficiency by estimators whose variance is determined by the linear term, but whose bias is corrected using higher order influence functions.

The conclusion that the “bias term” in (1.2) is quadratic in the estimation error $d ({\hat{p}}_{n}, p)$ is based on a worst case analysis. First, there exist a large number of models and functionals of interest that permit a first order influence function that is unbiased in the nuisance parameter. (E.g. adaptive models as considered in [1], models allowing a sufficient statistic for the nuisance parameter as in [38, 39], mixture models as considered in [19, 24, 36], and convex-linear models in survival analysis.) In such models there is no need for higher-order estimating equations. Second, the analysis does not take special, structural properties of the initial estimators ${\hat{p}}_{n}$ into account. An alternative approach would be to study the bias of a particular estimator in detail, and adapt the estimating equation to this special estimator. The strategy in this paper is not to use such special properties and focus on estimating equations that work with general initial estimators ${\hat{p}}_{n}$ .

The motivation for our new estimators stems from studies in epidemiology and econometrics that include covariates whose influence on an outcome of interest cannot be reliably modelled by a simple model. These covariates may themselves not be of interest, but are included in the analysis to adjust the analysis for possible bias. For instance, the mechanism that describes why certain data is missing is in terms of conditional probabilities given several covariates, but the functional form of this dependence is unknown. Or, to permit a causal interpretation in an observational study one conditions on a set of covariates to control for confounding, but the form of the dependence on the confounding variables is unknown. One may hypothesize in such situations that the functional dependence on a set of (continuous) covariates is smooth (e.g. d/2 times differentiable in the case of d covariates), or even linear. Then the usual estimators will be accurate (at order O_P(n⁻^1/2)) if the hypothesis is true, but they will be badly biased in the other case. In particular, the usual normal-theory based confidence intervals may be totally misleading: they will be both too narrow and wrongly located. The methods in this paper yield estimators with (typically) wider corresponding confidence intervals, but they are correct under weaker assumptions.

The mathematical contributions of the paper are to provide a heuristic for constructing minimax estimators in semiparametric models, and to apply this to a concrete model, which is a template for a number of other models (see [27, 37]). The methods connect to earlier work [13, 21] on the estimation of functionals on nonparametric models, but differs by our focus on functionals that are defined in terms of the structure of a semiparametric model. This requires an analysis of the inverse map from the density of the observations to the parameters, in terms of the semiparametric tangent spaces of the models. Our second order estimators are related to work on quadratic functionals, or functionals that are well approximated by quadratic functionals, as in [10, 15, 3, 4, 17, 18, 6, 7]. While we place the construction of minimax estimators for these special functionals in a wider framework, our focus differs by going beyond quadratic estimators and to consider semiparametric models.

Our mathematical results are in part conditional on a scale of regularity parameters, through a dimension (8.9) and partition of this dimension that depends on two of these parameters. We hope to discuss adaptation to these parameters in future work.

General heuristics of our construction are given in Section 3. Sections 4–8 are devoted to constructing new estimators for the mean response effect in missing data problems. In Section 9 we briefly discuss some other problems, including the problem of estimating a density at a point, where already first order influence functions do not exist and our heuristics naturally lead to projection estimators. Section 10 collects technical proofs. Sections 11, 12 and 13 (in the supplement [26]) discuss three key concepts of the paper: influence functions, projections and U-statistics.

2. Notation

Let $U_{n}$ denote the empirical U-statistic measure, viewed as an operator on functions. For given k ≤ n and a function $f : X^{k} \to ℝ$ on the sample space this is defined by

U_{n} f = \frac{1}{n (n - 1) \dots (n - k + 1)} \sum_{1 \leq i_{1}} \sum_{\neq i_{2} \neq \dots} \dots \sum_{\neq i_{k} \leq n} f (X_{i_{1}}, X_{i_{2}}, \dots, X_{i_{k}}) .

We do not let the order k show up in the notation $U_{n} f$ . This is unnecessary, as the notation is consistent in the following sense: if a function $f : X^{l} \to ℝ$ of l < k arguments is considered a function of k arguments that is constant in its last k −l arguments, then the right side of the preceding display is well defined and is exactly the corresponding U-statistic of order l. In particular, $U_{n} f$ is the empirical distribution $P_{n}$ applied to f if $f : X \to ℝ$ depends on only one argument.

We write $P^{n} U_{n} f = P^{k} f$ for the expectation of $U_{n} f$ if X₁, …, X_n are distributed according to the probability measure P. We also use this operator notation for the expectations of statistics in general.

We call f degenerate relative to P if ∫ f(x₁, …, x_k) dP(x_i) = 0 for every i and every (x_j: j ≠ i), and we call f symmetric if f(x₁, …, x_k) is invariant under permutation of the arguments x₁, …, x_k. Given an arbitrary measurable function $f : X^{k} \to ℝ$ we can form a function that is degenerate relative to P by subtracting the orthogonal projection in L₂ (P^k) onto the functions of at most k − 1 variables. This degenerate function can be written in the form (e.g. [40], Lemma 11.11)

(D_{P} f) (X_{1}, \dots, X_{k}) = \sum_{A \subset {1, \dots, k}} {(- 1)}^{k - | A |} E_{P} [f (X_{1}, \dots, X_{k}) | X_{i} : i \in A],

(2.1)

where the sum if over all subsets A of {1, …, k}, including the empty set, for which the conditional expectation is understood to be P^kf. If the function f is symmetric, then so is the function D_Pf.

Given two functions $g, h : X \to ℝ$ we write g × h for the function (x, y) ↦ g(x)h(y). More generally, given k functions g₁, …,g_k we write g₁ ×⋯× g_k for the tensor product of these functions. Such product functions are·degenerate iff all functions in the product have mean zero.

A kernel operator $K : L_{r} (X, A, μ) \to L_{r} (X, A, μ)$ takes the form $(K f) (x) = \int \bar{K} (x, y) f (y) d μ (y)$ for some measurable function $\bar{K} : X^{2} \to ℝ$ . We shall abuse notation in denoting the operator K and the kernel $\bar{K}$ with the same symbol: $K = \bar{K}$ . A (weighted) projection on a finite-dimensional space is a kernel operator. We discuss such projections in Section 12.

The set of measurable functions whose rth absolute power is μ-integrable is denoted L_r(μ), with norm ‖·‖_r_,_μ, or ‖·‖_r if the measure is clear; or also as L_r(w) with norm ‖·‖_r_,_w if w is a density relative to a given dominating measure. For r = ∞ the notation ‖·‖_∞ refers to the uniform norm.

3. Heuristics

Our basic estimator has the form (1.1) except that we replace the linear term by a general U-statistic. Given measurable functions $χ_{p} : X^{m} \to ℝ$ , for a fixed order m, we consider estimators ${\hat{χ}}_{n}$ of χ(p) of the type

{\hat{χ}}_{n} = χ ({\hat{p}}_{n}) + U_{n} χ_{{\hat{p}}_{n}} .

(3.1)

The initial estimators ${\hat{p}}_{n}$ are thought to have a certain (optimal) convergence rate $d ({\hat{p}}_{n}, p) \to 0$ , but need not possess (further) special properties. Throughout we shall treat these estimators as being based on an independent sample of observations, so that the stochasticities in (3.1) present in ${\hat{p}}_{n}$ and $U_{n}$ are independent. This takes away technical complications, and allows us to focus on rates of estimation in full generality. (A simple way to avoid the resulting asymmetry would be to swap the two samples, calculate the estimator a second time and take the average.)

3.1. Influence functions

The key is to find suitable “influence functions” χ_p. A decomposition of type (1.2) for the estimator (3.1) yields

{\hat{χ}}_{n} - χ (p) = [χ ({\hat{p}}_{n}) - χ (p) + P^{m} χ_{{\hat{p}}_{n}}] + (U_{n} - P^{n}) χ_{{\hat{p}}_{n}} .

(3.2)

This suggests to construct the influence functions such that $- P^{m} χ_{{\hat{p}}_{n}}$ represents the first m terms of the Taylor expansion of $χ ({\hat{p}}_{n}) - χ (p)$ . First this implies that the influence function used in (3.1) must be unbiased:

P^{m} χ_{p} = 0.

Next, to operationalize a “Taylor expansion” on the (infinite-dimensional) “manifold” $P$ we employ “smooth” submodels t ↦ p_t mapping a neighbourhood of 0 ∈ ℝ to $P$ and passing through p at t= 0 (i.e. p₀ = p). We determine χ_p such that for each chosen submodel

{\frac{d^{j}}{d t^{j}}}_{| t = 0} χ (p_{t}) = - {\frac{d^{j}}{d t^{j}}}_{| t = 0} P^{m} χ_{p_{t}}, j = 1, \dots, m .

A slight strengthening is to impose this condition “everywhere” on the path, i.e. the jth derivative of t ↦ χ(p_t) at t is the jth derivative of $h \mapsto - P_{t}^{m} χ_{p_{t + h}}$ at h = 0, for every t. If the map $(s, t) \mapsto P_{s}^{m} χ_{p_{t}}$ is smooth, then the latter implies (cf. Lemma 11.1 applied with χ = f and $g (s, t) = - P_{t}^{m} χ_{p_{s}}$ )

{\frac{d^{j}}{d t^{j}}}_{| t = 0} χ (p_{t}) = {\frac{d^{j}}{d t^{j}}}_{| t = 0} P_{t}^{m} χ_{p}, j = 1, \dots, m .

(3.3)

Relative to the previous formula the subscript t on the right hand side has changed places, and the negative sign has disappeared. Under regularity conditions equation (3.3) for m = 1 can be written in the form

{\frac{d}{d t}}_{| t = 0} χ (p_{t}) = {\frac{d}{d t}}_{| t = 0} P_{t} χ_{p} = P χ_{p} g,

(3.4)

where g = (d/dt)_t₌₀p_t/p is the score function of the model t ↦ p_t at t = 0.

A function χ_p satisfying (3.3) is exactly what is called an influence function in semiparametric theory: a function in L₂(p) whose inner products with the elements of the tangent space (the set of score functions g of appropriate submodels t ↦ p_g,t) represent the derivative of the functional ([40], page 363, or [2, 22, 39, 35]). For m > 1 equation (3.3) can be expanded similarly in terms of inner products of the influence function with score functions, but higher-order score functions arise next to ordinary score functions. Suitable higher-order tangent spaces are discussed in [27] (also see [37]), using score functions as defined in [43]. A discussion of second order scores and tangent spaces can be found in [28]. Second order tangent spaces are also discussed in [23], from a different point of view and with a different purpose.

Here we take a different route, defining higher order influence functions as influence functions of lower order ones. Any mth order, zero-mean U-statistic can be decomposed as the sum of m degenerate U-statistics of orders 1, 2, …, m, by way of its Hoeffding decomposition. In the present situation we can write

U_{n} χ_{p} = U_{n} χ_{p}^{(1)} + \frac{1}{2} U_{n} χ_{p}^{(2)} + \dots + \frac{1}{m!} U_{n} χ_{p}^{(m)},

where $χ_{p}^{(j)} : X^{j} \to ℝ$ is a degenerate kernel of j arguments, defined uniquely as a projection of χ_p (cf. [42] and (2.1)). (At m = n the left side evaluates to the symmetrized function χ_p, which is expressed in the functions $χ_{p}^{(j)}$ .) Suitable functions $χ_{p}^{(j)}$ in this decomposition can be found by the following algorithm:

Let $x_{1} \mapsto {\bar{χ}}_{p}^{(1)} (x_{1})$ be a first order influence function of the functional p ↦ χ(p).
Let $x_{j} \mapsto {\bar{χ}}_{p}^{(j)} (x_{1}, \dots, x_{j})$ be a first order influence function of the functional $p \mapsto {\bar{χ}}_{p}^{(j - 1)} (x_{1}, \dots, x_{j - 1})$ , for each x₁, …, x_j−₁, and j = 2, …, m.
Let $χ_{p}^{(j)} = D_{P} {\bar{χ}}_{p}^{(j)}$ be the degenerate part of ${\bar{χ}}_{p}^{(j)}$ relative to P, as defined in (2.1).

See Lemma 11.2 for a proof. Thus higher order influence functions are constructed as first order influence functions of influence functions. Somewhat abusing language we shall refer to the function $χ_{p}^{(j)}$ also as a “jth order influence function”. The overall order m will be fixed at a suitable value; for simplicity we do not let this show up in the notation χ_p.

Because only inner products with scores matter, an “influence function” is unique only up to projections onto the tangent space and (averaging over) permutations of its arguments. With any choice of influence functions the algorithm produces some influence function. In particular, the starting influence function ${\bar{χ}}_{p}^{(1)}$ in step [1] may be any function χ_p that satisfies (3.4) for every score g; it does not have to possess mean zero, or be an element of the tangent space. A similar remark applies to the (first order) influence functions found in step [2]. It is only in step [3] that we make the influence functions degenerate.

3.2. Bias-variance trade-off

Because it is centered, the “variance part” in (3.2), the variable $(U_{n} - P^{n}) χ_{{\hat{p}}_{n}}$ , should not change noticeably if we replace ${\hat{p}}_{n}$ by p, and be of the same order as $(U_{n} - P^{n}) χ_{p}$ . For a fixed square-integrable function χ_p the latter centered U-statistic is well known to be of order O_P(n⁻^1/2), and asymptotically normal if suitably scaled. A completely successful representation of the “bias” $R_{n} = χ ({\hat{p}}_{n}) - χ (p) + P^{m} χ_{{\hat{p}}_{n}}$ in (3.2) would lead to an error $R_{n} = O_{P} (d {({\hat{p}}_{n}, p)}^{m + 1})$ , which becomes smaller with increasing order m. Were this achievable for any m, then a $\sqrt{n} -estimator$ would exist no matter how slow the convergence rate $d ({\hat{p}}_{n}, p)$ of the initial estimator. Not surprisingly, in many cases of interest this ideal situation is not real. This is due to the non-existence of influence functions that can exactly represent the Taylor expansion of $χ ({\hat{p}}_{n}) - χ (p)$ . The technical reason is that multi-linear maps (even smooth ones) may not be representable through kernels.

In general, we have to content ourselves with a partial representation. Next to a remainder term of order $O_{P} (d {({\hat{p}}_{n}, p)}^{m + 1})$ , we then incur a “representation bias”. The latter bias can be made arbitrarily small by choice of the influence function, but only at the cost of increasing its variance. We thus obtain a trade-off between a variance and two biases. This typically results in a variance that is larger than 1/n, and a rate of convergence that is slower than $1 / \sqrt{n}$ , although sometimes a nontrivial bias correction is possible without increasing the variance.

3.3. Approximate functionals

An attractive method to find approximating influence functions is to compute exact influence functions for an approximate functional. Because smooth functionals on finite-dimensional models typically possess influence functions to any order, projections on finite-dimensional models may deliver such approximations.

A simple approximation would be $χ (\tilde{p})$ for a given map $p \mapsto \tilde{p}$ mapping the model $P$ onto a suitable “smaller” model $\tilde{P}$ (typically a submodel $\tilde{P} \subset P$ ). A closer approximation can be obtained by also including a derivative term. Consider the functional $\tilde{χ} : P \to ℝ$ defined by, for a given map $p \mapsto \tilde{p}$ ,

\tilde{χ} (p) = χ (\tilde{p}) + P χ_{\tilde{p}}^{(1)} .

(3.5)

(A more complete notation would be $\tilde{p} (p)$ ; the right hand side depends on p in three ways.) By the definition of an influence function the term $- P χ_{\tilde{p}}^{(1)}$ acts as the first order Taylor expansion of $χ (\tilde{p}) - χ (p)$ Consequently, we may expect that

| \tilde{χ} (p) - χ (p) | = O (d {(\tilde{p}, p)}^{2}) .

(3.6)

This ought to be true for any “projection” $p \mapsto \tilde{p}$ . If we choose the projection such that, for any path $t \mapsto p_{t}$ ,

{\frac{d}{d t}}_{| t = 0} (χ ({\tilde{p}}_{t}) + P_{0} χ_{{\tilde{p}}_{t}}^{(1)}) = 0,

(3.7)

then the functional $p \mapsto \tilde{χ} (p)$ will be locally (around p₀) equivalent to the functional $p \mapsto χ ({\tilde{p}}_{0}) + P χ_{{\tilde{p}}_{0}}^{(1)}$ (which depends on p in only one place, p₀ being fixed) in the sense that the first order influence functions are the same. The first order influence function of the second, linear functional at p₀ is equal to $χ_{{\tilde{p}}_{0}}^{(1)}$ , and hence for a projection satisfying (3.7) the first order influence function of the functional $p \mapsto \tilde{χ} (p)$ will be

{\tilde{χ}}_{p}^{(1)} = χ_{\tilde{p}}^{(1)} .

(3.8)

In words, this means that the influence function of the approximating functional $\tilde{χ}$ satisfying (3.5) and (3.7) at p is obtained by substituting $\tilde{p}$ for p in the influence function of the original functional.

This is relevant when obtaining higher order influence functions. As these are recursive derivatives of the first order influence function (see [1]–[3] in Section 3.1), the preceding display shows that we must compute influence functions of

p \mapsto χ_{\tilde{p}}^{(1)} (x),

i.e. we “differentiate on the model $\tilde{P}$ ”. If the latter model is sufficiently simple, for instance finite-dimensional, then exact higher order influence functions of the functional $p \mapsto \tilde{χ} (p)$ ought to exist. We can use these as approximate influence functions of $p \mapsto χ (p)$ .

4. Estimating the mean response in missing data models

Suppose that a typical observation is distributed as X = (Y A, A, Z), for Y and A taking values in the two-point set {0, 1} and conditionally independent given Z.

This model is standard in biostatistical applications, with Y an “outcome” or “response variable”, which is observed only if the indicator A takes the value 1. The covariate Z is chosen such that it contains all information on the dependence between the response and the missingness indicator A, thus making the response missing at random. Alternatively, we think of Y as a “counterfactual” outcome if a treatment were given (A = 1) and estimate (half) the treatment effect under the assumption of no unmeasured confounders.

The model can be parameterized by the marginal density f of Z (relative to some dominating measure ν) and the probabilities b(z) = P(Y = 1|Z = z) and a(z)⁻¹ = P(A = 1| Z = z). (Using a for the inverse probability simplifies later formulas.) Alternatively, the model can be parameterized by the pair (a, b) and the function g = f/a, which is the conditional density of Z given A = 1, up to the norming factor P(A = 1). Thus the density p of an observation X is described by the triplet (a, b, f), or equivalently the triplet (a, b, g). For simplicity of notation we write p instead of p_a,b,f or p_a,b,g, with the implicit understanding that a generic p corresponds one-to-one to a generic (a, b, f) or (a, b, g).

We wish to estimate the mean response EY = Eb(Z), i.e. the functional

χ (p) = \int bf dν = \int abg dν .

Estimators that are $\sqrt{n} -consistent$ and asymptotically efficient in the semi-parametric sense have been constructed using a variety of methods (e.g. [30, 31]), but only if a or b, or both, parameters are restricted to sufficiently small regularity classes. For instance, if the covariate Z ranges over a compact, convex subset $Z$ of ℝ^d, then the mentioned papers provide $\sqrt{n} -consistent$ estimators under the assumption that a and b belong to Hölder classes $C^{α} (Z)$ and $C^{β} (Z)$ with α and β large enough that

\frac{α}{2 α + d} + \frac{β}{2 β + d} \geq \frac{1}{2} .

(4.1)

(See e.g. Section 2.7.1 in [41] for the definition of Hölder classes). For moderate to large dimensions d this is a restrictive requirement. In the sequel we consider estimation for arbitrarily small α and β.

Throughout we assume that the parameters a, b and g are contained in Hölder spaces $C^{α} (Z)$ , $C^{β} (Z)$ and $C^{γ} (Z)$ of functions on a compact, convex domain in ℝ^d. We derive two types of results:

In Section 7 we show that a $\sqrt{n} -rate$ is attainable by using a higher order estimating equation (of order determined by γ) as long as
$\frac{α + β}{2} \geq \frac{d}{4} .$ (4.2)
This condition is strictly weaker than the condition (4.1) under which the linear estimator attains a $\sqrt{n} -rate$ . Thus even in the $\sqrt{n} -situation$ higher order estimating equations may yield estimators that are applicable in a wider range of models. For instance, in the case that α = β the cut-off (4.1) arises for α = β ≥ d/4, whereas (4.2) reduces to α = β ≥ d/2.
We consider minimax estimation in the case (α + β)/2 < d/4, when the rate becomes slower than $1 / \sqrt{n}$ . It is shown in [29] that even if g = f/a were known, then the minimax rate for a and b ranging over balls in the Hölder classes $C^{α} (Z)$ and $C^{β} (Z)$ cannot be faster than n⁻⁽²^α⁺²^β^)/(2^α⁺²^β⁺^d⁾. In Section 8 we show that this rate is attainable if g is known, and also if g is unknown, but is a-priori known to belong to a Hölder class $C^{γ} (Z)$ for sufficiently large γ, as given by (8.11). (Heuristic arguments, not discussed in this paper, appear to indicate that for smaller γ the minimax rate is slower than n⁻⁽²^α⁺²^β^)/(2^α⁺²^β⁺^d⁾.)

After reviewing the tangent space and first order theory in Section 4.1, we discuss the second order estimator separately in Section 5. The preceding results are next obtained in Sections 7 ( $\sqrt{n} -rate$ if (α + β)/2 ≥ d/4) and 8 (slower rate if (α + β)/2 < d/4), using the higher-order influence functions of an approximate functional, which is defined in the intermediate Section 6.

Assumption 4.1

We assume throughout that the functions 1/a, b, g and their preliminary estimators $1 / \hat{a}, \hat{b}, \hat{g}$ are bounded away from their extremes: 0 and 1 for the first two, and 0 and ∞ for the third.

4.1. Tangent space and first order influence function

The one-dimensional submodels t ↦ p_t induced by paths of the form a_t = a + tα, b_t = b + tβ, and f_t = f(1 + tϕ) for given directions α, β and ϕ (where ∫ ϕ f dν = 0) yield score functions

B_{p}^{a} α (X) = - \frac{A a (Z) - 1}{a (Z) (a - 1) (Z)} α (Z), a - score,

B_{p}^{b} β (X) = \frac{A (Y - b (Z))}{b (Z) (1 - b) (Z)} β (Z), b - score,

B_{p}^{f} ϕ (X) = ϕ (Z), f - score .

Here $B_{p}^{a}$ , $B_{p}^{b}$ , $B_{p}^{f}$ are the score operators for the three parameters, whose direct sum is the overall score operator, which we write as B_p: B_p(α, β, ϕ)(X) is the sum of the three left sides of the preceding equation. The first-order influence function is well known to take the form

χ_{p}^{(1)} (X) = A a (Z) (Y - b (Z)) + b (Z) - χ (p) .

(4.3)

Indeed, it is straightforward to verify that this function satisfies, for every path t ↦ p_t as described previously,

{\frac{d}{d t}}_{| t = 0} χ (p_{t}) = E_{p} χ_{p}^{(1)} (X) B_{p} (α, β, ϕ) (X) .

The advantage of choosing a an inverse probability is clear from the form of the (random part of the) influence function, which is bilinear in (a, b). The corresponding “first order bias” can be computed to be

χ (\hat{p}) - χ (p) + P χ_{\hat{p}}^{(1)} = - \int (\hat{a} - a) (\hat{b} - b) g dν .

(4.4)

In agreement with the heuristics given in Sections 1 and 3 this bias is quadratic in the errors of the initial estimator.

Actually, the form of the bias term is special in that square estimation errors ${(\hat{a} - a)}^{2}$ and ${(\hat{b} - b)}^{2}$ of the two initial estimators $\hat{a}$ and $\hat{b}$ do no arise, but only the product $(\hat{a} - a) (\hat{b} - b)$ of their errors. This property, termed “double robustness” in [34], makes that for first order inference it suffices that one of the two parameters be estimated well. A prior assumption that the parameters a and b are α and β regular, respectively, would allow estimation errors of the orders n^−α/⁽²^α⁺^d⁾ and n^−β/⁽²^β⁺^d⁾. If the product of these rates is O(n⁻^1/2), then the bias term matches the variance. This leads to the (unnecessarily restrictive) condition (4.1).

If the preliminary estimators $\hat{a}$ and $\hat{b}$ are solely selected for having small errors $‖ \hat{a} - a ‖$ and $‖ \hat{b} - b ‖$ (e.g. minimax in the L₂-norm), then it is hard see why (4.4) would be small unless the product $‖ \hat{a} - a ‖ ‖ \hat{b} - b ‖$ of the errors is small. Special estimators might exploit that the bias is an integral, in which cancellation of errors could occur. As we do not wish to use special estimators, our approach will be to replace the linear estimating equation by a higher order one, leading to an analogue of (4.4) that is a cubic or higher order polynomial of the estimation errors.

It may be noted that the marginal density f (or g) does not enter into the first order influence function (4.3). Even though the functional depends on f (or g), a rate on the initial estimator of this function is not needed for the construction of the first order estimator. This will be different at higher orders.

5. Second order estimator

In this section we derive a second order influence function for the missing data problem, and analyze the risk of the corresponding estimator. This estimator is minimax if (α + β)/2 ≥ d/4 and

\frac{γ}{2 γ + d} \geq \frac{1}{2}^\frac{2 α + 2 β}{d + 2 α + 2 β} - \frac{α}{2 α + d} - \frac{β}{2 β + d} .

(5.1)

In the other case, higher order estimators have smaller risk, as shown in Sections 7-8. However, it is worth while to treat the second order estimator separately, as its construction exemplifies essential elements, without involving technicalities attached to the higher order estimators.

To find a second order influence function, we follow the strategy [1]–[3] of Section 3.1, and try and find a function $χ_{p}^{(2)} : X^{2} \to ℝ$ such that, for every x₁ = (y₁a₁, a₁, z₁), and all directions α, β, ϕ,

{\frac{d}{d t}}_{| t = 0} [χ_{p_{t}}^{(1)} (x_{1}) + χ (p_{t})] = E_{p} χ_{p}^{(2)} (x_{1}, X_{2}) B_{p} (α, β, ϕ) (X_{2}) .

Here the expectation E_p on the right side is relative to the variable X₂ only, with x₁ fixed. This equation expresses that $x_{2} \mapsto χ_{p}^{(2)} (x_{1}, x_{2})$ is a first order influence function of $p \mapsto χ_{p}^{(1)} (x_{1}) + χ (p)$ , for fixed x₁. On the left side we added the “constant” χ(p_t) to the first order influence function (giving another first order influence function) to facilitate the computations. This is justified as the strategy [1]–[3] works with any influence function. In view of (4.3) and the definitions of the paths t ↦ a+tα, t ↦ b+tβ and t ↦ f(1+tϕ), this leads to the equation

a_{1} (y_{1} - b (z_{1})) α (z_{1}) - (a_{1} a (z_{1}) - 1) β (z_{1}) = E_{p} χ_{p}^{(2)} (x_{1}, X_{2}) B_{p} (α, β, ϕ) (X_{2}) .

(5.2)

Unfortunately, no function $χ_{p}^{(2)}$ that solves this equation for every (α, β, ϕ) exists. To see this note that for the special triplets with β = ϕ = 0 the requirement can be written in the form

α (z_{1}) = E_{p} [\frac{χ_{p}^{(2)} (x_{1}, X_{2})}{a_{1} (y_{1} - b (z_{1}))} \frac{1 - A_{2} a (Z_{2})}{a (Z_{2}) (a - 1) (Z_{2})}] α (Z_{2}) .

The right side of the equation can be written as ∫ K(z₁, z₂)α(z₂) dF (z₂), for K(z₁, Z₂) the conditional expectation of the function in square brackets given Z₂. Thus it is the image of α under the kernel operator with kernel K. If the equation were true for any α, then this kernel operator would work as the identity operator. However, on infinite-dimensional domains the identity operator is not given by a kernel. (Its kernel would be a “Dirac function on the diagonal”.)

Therefore, we have to be satisfied with an influence function that gives a partial representation only. In particular, a projection onto a finite-dimensional linear space possesses a kernel, and acts as the identity on this linear space. A “large” linear space gives representation in “many” directions. By reducing the expectation in (5.2) to an integral relative to the marginal distribution of Z₂, we can use an orthogonal projection ∏_p:L₂(g) → L₂(g) onto a subspace L of L₂(g). Writing also ∏_p for its kernel, and letting S₂h denote the symmetrization (h(X₁, X₂) + h(X₂, X₁))/2 of a function $h : X^{2} \to ℝ$ , we define

χ_{p}^{(2)} (X_{1}, X_{2}) = - 2 S_{2} [A_{1} (Y_{1} - b (Z_{1})) Π_{p} (Z_{1}, Z_{2}) (A_{2} a (Z_{2}) - 1)] .

(5.3)

Lemma 5.1

For $χ_{p}^{(2)}$ defined by (5.3) with Π_p the kernel of an orthogonal projection Π_p: L₂(g) → L₂(g) onto a subspace L ⊂ L₂(g), equation (5.2) is satisfied for every path t ↦ p_t corresponding to directions (α, β, ϕ) such that α ∈ L and β ∈ L.

Proof

By definition E(A|Z) = (1/a)(Z) and E(Y|Z) = b(Z). Also var (Aa(Z)| Z) = a(Z) − 1 and var(Y|Z) = b(Z)(1 − b)(Z). By direct computation using these identities, we find that for the influence function (5.3) the right side of (5.2) reduces to

a_{1} (y_{1} - b (z_{1})) Π_{p} α (z_{1}) - (a_{1} a (z_{1}) - 1) Π_{p} β (z_{1}) .

Thus (5.2) holds for every (α, β, ϕ) such that Π_pα = α and Π_pβ = β.

Together with the first order influence function (4.3) the influence function (5.3) defines the (approximate) influence function $χ_{p} = χ_{p}^{(1)} + \frac{1}{2} χ_{p}^{(2)}$ . For an initial estimator $\hat{p}$ based on independent observations we now construct the estimator (3.1), i.e.

{\hat{χ}}_{n} = χ (\hat{p}) + P_{n} χ_{\hat{p}}^{(1)} + \frac{1}{2} U_{n} χ_{\hat{p}}^{(2)} .

(5.4)

Unlike the first order influence function, the second order influence function does depend on the density f of the covariates, or rather the function g = f/a (through the kernel ∏_p, which is defined relative to L₂(g)), and hence the estimator (5.4) involves a preliminary estimator of g. As a consequence, the quality of the estimator of the functional χ depends on the precision by which g (as part of the plug-in $\hat{p} = (\hat{a}, \hat{b}, \hat{g})$ ) can be estimated.

Let ${\hat{E}}_{p}$ and ${\hat{var}}_{p 0}$ denote conditional expectations given the observations used to construct $\hat{p}$ , let ‖·‖_r be the norm of L_r(g), and let ‖∏‖_r denote the norm of an operator ∏:L_r(g) → L_r(g).

Theorem 5.1

The estimator ${\hat{χ}}_{n}$ given in (5.4) with influence functions $χ_{p}^{(1)}$ and $χ_{p}^{(2)}$ defined by (4.3) and (5.3), for Π_p the kernel of an orthogonal projection in L₂(g) onto a k-dimensional linear subspace, satisfies, for r ≥ 2 (with r/(r − 2) = ∞ if r = 2),

{\hat{E}}_{p} {\hat{χ}}_{n} - χ (p) = O_{P} ({‖ Π_{p} ‖}_{r} {‖ Π_{\hat{p}} ‖}_{r} {‖ \hat{a} - a ‖}_{r} {‖ \hat{b} - b ‖}_{r} {‖ \hat{g} - g ‖}_{r / (r - 2)}) + O_{P} ({‖ (I - Π_{p}) (a - \hat{a}) ‖}_{2} {‖ (I - Π_{p}) (b - \hat{b}) ‖}_{2}),

{\hat{var}}_{p} {\hat{χ}}_{n} = O_{P} (\frac{1}{n} + \frac{k}{n^{2}}) .

The two terms in the bias result from having to estimate p in the second order influence function (giving “third order bias”) and using an approximate influence function (leaving the remainders I − Π_p after projection), respectively. The terms 1/n and k/n² in the variance appear as the variances of $U_{n} χ_{p}^{(1)}$ and $U_{n} χ_{p}^{(2)}$ , the second being a degenerate second order U-statistic (giving 1/n², see (13.1)) with a kernel of variance k.

The proof of the theorem is deferred to Section 10.1.

Assume now that the range space of the projections Π_p can be chosen such that, for some constant C,

{‖ a - Π_{p} a ‖}_{2} \leq C {(\frac{1}{k})}^{α / d}, {‖ b - Π_{p} b ‖}_{2} \leq C {(\frac{1}{k})}^{β / d} .

(5.5)

Furthermore, assume that there exist estimators $\hat{a}$ and $\hat{b}$ and $\hat{g}$ that achieve convergence rates n^−α/⁽²^α⁺^d⁾, n^−β/⁽²^β⁺^d⁾ and n^−γ/⁽²^γ⁺^d⁾, respectively, in L_r(g) and L_r/₍_r−₂₎(g), uniformly over these a-priori models and a model for g (e.g. for r = 3), and that the preceding displays also hold for $\hat{a}$ and $\hat{b}$ . These assumptions are satisfied if the unknown functions a and b are “regular” of orders α and β on a compact subset of ℝ^d (see e.g. [32]). Then the estimator ${\hat{χ}}_{n}$ of Theorem 5.1 attains the square rate of convergence

{(\frac{1}{n})}^{2 α / (2 α + d) + 2 β / (2 β + d) + 2 γ / (2 γ + d)} \lor {(\frac{1}{k})}^{(2 α + 2 β) / d} \lor \frac{1}{n} \lor \frac{k}{n^{2}} .

(5.6)

We shall see in the next section that the first of the four terms in this maximum can be made smaller by choosing an estimating equation of order higher than 2, while the other three terms arise at any order. This motivates to determine a “second order ‘optimal” value of k by balancing the second, third and fourth terms. We next would use the second order estimator if γ is large enough so that the first term is negligible relative to the other terms.

For (α+β)/2 ≥ d/4 we can choose k = n and the resulting rate (the square root of (5.6)) is n⁻^1/2 provided that (5.1) holds. The latter condition is certainly satisfied under the sufficient condition (4.1) for the linear estimator to yield rate n⁻^1/2.

More interestingly, for (α + β)/2 < d/4 we choose k ~ n²^d/⁽^d⁺²^α⁺²^β⁾ and obtain the rate, provided that (5.1) holds,

n^{- (2 α + 2 β) / (d + 2 α + 2 β)} .

This rate is slower than n⁻^1/2, but better than the rate n^−α/⁽²^α⁺^d⁾^−β/⁽²^β⁺^d⁾ obtained by the linear estimator. In [29] this rate is shown to be the fastest possible in the minimax sense, for the model in which a and b range over balls in $C^{α} (Z)$ and $C^{β} (Z)$ , and g being known.

In both cases the second order estimator is better than the linear estimator, but minimax only for sufficiently large γ. This motivates to consider higher order estimators.

6. Approximate functional

Even though the functional of interest does not possess an exact second-order influence function, we might proceed to higher orders by differentiating the approximate second-order influence function $χ_{p}^{(2)}$ given in (5.3), and balancing the various terms obtained. However, the formulas are much more transparent if we compute exact higer-order influence functions of an approximating functional instead. In this section we first define a suitable functional and next compute its influence functions.

Following the heuristics of Section 3.3, we define an approximate functional by equation (3.5), using a particular projection $p \mapsto \tilde{p}$ of the parameters. We choose this projection to map the parameters a and b onto finite-dimensional models and leave the parameter g unaltered: p is mapped into an element $\tilde{p}$ of the approximating model, or equivalently a triplet (a, b, g) into a triplet $(\tilde{a}, \tilde{b}, g)$ in the approximating model for the three parameters (where g is unaltered). (Even though this is not evident in the notation, the projection is joint in the three parameters: the induced maps $(a, b, g) \mapsto \tilde{a}$ and $(a, b, g) \mapsto \tilde{b}$ do not reduce to maps $a \mapsto \tilde{a}$ and $b \mapsto \tilde{b}$ , but $\tilde{a}$ and $\tilde{b}$ depend on the full triplet (a, b, g).)

As “model” for (a, b) we consider the product of two affine linear spaces

(\hat{a} + \underline{a} L) \times (\hat{b} + \underline{b} L),

(6.1)

for a given finite-dimensional subspace L of L₂(ν) and fixed functions $\hat{a}, \underline{a}, \hat{b}, \underline{b} : Z \to ℝ$ that are bounded away from zero and infinity. (Later the functions $\hat{a}$ and $\hat{b}$ are taken equal to the preliminary estimators; one choice for the other functions is $\underline{a} = \underline{b} = 1$ .) The pair $(\tilde{a}, \tilde{b})$ of projections are defined as elements of the model (6.1) satisfying equation (3.7). In view of (4.4), for any path ${\tilde{p}}_{t} \leftrightarrow ({\tilde{a}}_{t}, {\tilde{b}}_{t}, g) = (\tilde{a} + t \underline{a} l, \tilde{b} + t \underline{b} l^{'}, g)$ , for given l, l′ ∈ L,

χ ({\tilde{p}}_{t}) + P χ_{{\tilde{p}}_{t}}^{(1)} = χ (p) - \int (\tilde{a} + t \underline{a} l - a) (\tilde{b} + t \underline{b} l^{'} - b) g dν .

(6.2)

Equation (3.7) requires that the derivative of this expression with respect to t at t = 0 vanishes. Thus the functions $\tilde{a}$ and $\tilde{b}$ must be chosen to satisfy the set of stationary equations, for every l; l′ ∈ L,

0 = \int (\tilde{a} - a) \underline{b} l^{'} g dν = \int (\frac{\tilde{a} - \hat{a}}{\underline{a}} - \frac{a - \hat{a}}{\underline{a}}) l^{'} \underline{a b} g dν, l^{'} \in L,

(6.3)

0 = \int \underline{a} l (\tilde{b} - b) g dν = \int (\frac{\tilde{b} - \hat{b}}{\underline{b}} - \frac{b - \hat{b}}{\underline{b}}) l \underline{a b} g dν, l^{'} \in L .

(6.4)

Because the functions $(\tilde{a} - \hat{a}) / \underline{a}$ and $(\tilde{b} - \hat{b}) / \underline{b}$ are required to be in L, the second way of writing these equations shows that the latter two functions are the orthogonal projections of the functions $(a - \hat{a}) / \underline{a}$ and $(b - \hat{b}) / \underline{b}$ onto L in $L_{2} (\underline{a b} g)$ .

As explained in Section 3.3, as it satisfies (3.7) the projection $(a, b, g) \mapsto (\tilde{a}, \tilde{b}, g)$ renders the first order influence function of the approximate functional $\tilde{χ}$ equal to the first order influence function of χ evaluated at the projection. Furthermore, the difference between χ and $\tilde{χ}$ is quadratic in the distance between $\tilde{p}$ and p (see (3.6)). The following theorem summarizes the preceding and verifies these properties in the present concrete situation.

Theorem 6.1

For given measurable functions $\hat{a}, \underline{a}, \hat{b}, \underline{b} : Z \to ℝ$ with $\underline{a}$ and $\underline{b}$ bounded away from zero and infinity, define a map $(a, b, g) \mapsto (\tilde{a}, \tilde{b}, g)$ by letting $(\tilde{a} - \hat{a}) / \underline{a}$ and $(\tilde{b} - \hat{b}) / \underline{b}$ be the orthogonal projections of $(a - \hat{a}) / \underline{a}$ and $(b - \hat{b}) / \underline{b}$ in $L_{2} (\underline{a b} g)$ onto a closed subspace L. Let $\tilde{p}$ correspond to $(\tilde{a}, \tilde{b}, g)$ and define $\tilde{χ} (p) = χ (\tilde{p}) + P χ_{\tilde{p}}^{(1)}$ . Then $\tilde{χ}$ has influence function

{\tilde{χ}}_{p}^{(1)} (X) = A \tilde{a} (Z) (Y - \tilde{b} (Z)) + \tilde{b} (Z) - χ (\tilde{p}) .

(6.5)

Furthermore, for $\underline{g} = \underline{a b} g$ ,

| \tilde{χ} (p) - χ (p) | \leq {‖ (I - Π_{p}) \frac{\hat{a} - a}{\underline{a}} ‖}_{2, \underline{g}} {‖ (I - Π_{p}) \frac{\hat{b} - b}{\underline{b}} ‖}_{2, \underline{g}} .

Proof

The formula for the influence function agrees with the combination of equations (3.8) and (4.3), and can also be verified directly. In view of (3.5) and (4.4),

\tilde{χ} (p) - χ (p) = - \int (\tilde{a} - a) (\tilde{b} - b) g dν .

We rewrite the right side as an integral relative to $\underline{g}$ dν, and next apply the Cauchy-Schwarz inequality. Finally we note that $(\tilde{a} - a) / \underline{a} = (\tilde{a} - \hat{a}) / \underline{a} - (a - \hat{a}) / \underline{a} = (I - Π_{p}) ((\hat{a} - a) / \underline{a})$ , and similarly for b.

The approximation error $\tilde{χ} (p) - χ (p)$ can be rendered arbitrarily small by choosing the space L large enough. Of course, we choose L to be appropriate relative to a-priori assumptions on the functions a and b. If these functions are known to belong to Hölder classes, then L can for instance be chosen as the linear span of the first k basis elements of a suitable orthonormal wavelet basis of L₂(ν).

To compute higher order influence functions of $\tilde{χ}$ we recursively determine influence functions of influence functions, according to the algorithm [1]–[3] in Section 3.1, starting with the influence function of $p \mapsto {\tilde{χ}}_{p}^{(1)} (x_{1}) + χ (\tilde{p})$ , for a fixed x₁. We defer the details of this derivation to Section 10.2, and summarize the result in the following theorem.

To simplify notation, define

\begin{array}{l} \tilde{Y} = A (Y - \tilde{b} (Z)) \underline{a} (Z), \\ \tilde{A} = (A \tilde{a} (Z) - 1) \underline{b} (Z), \\ \underline{A} = A \underline{a} (Z) \underline{b} (Z) . \end{array}

(6.6)

These are the generic variables; indexed versions ${\tilde{Y}}_{i}, {\tilde{A}}_{i}, {\underline{A}}_{i}, \dots$ are defined by adding an index to every variable in the equalities. With this notation and with $\underline{a} = \underline{b} = 1$ the second order influence function (5.3) at $p = \tilde{p}$ can be written as the symmetrization of $- 2 {\tilde{Y}}_{1} Π_{p} (Z_{1}, Z_{2}) {\tilde{A}}_{2}$ . This function was derived in an ad-hoc manner as an approximate or partial influence function of χ, but it is also the exact influence function of $\tilde{χ}$ . The higher order influence functions of $\tilde{χ}$ possess an equally attractive form.

Theorem 6.2

An mth order influence function ${\tilde{χ}}_{p}^{(m)}$ evaluated at (X₁,…, X_m) of the functional $\tilde{χ}$ defined in Theorem 6.1 is the degenerate (in L₂(p)) part of the variable

{(- 1)}^{j - 1} j! {\tilde{A}}_{1} Π_{1, 2} {\underline{A}}_{2} Π_{2, 3} {\underline{A}}_{3} Π_{3, 4} {\underline{A}}_{4} \times \dots \times {\underline{A}}_{m - 1} Π_{m - 1, m} {\tilde{Y}}_{m} .

Here ∏_i,j is the kernel of the orthogonal projection in $L_{2} (\underline{a b} g)$ onto L, evaluated at (Z_i, Z_j).

To obtain the degenerate part of the variable in the preceding lemma, we apply the general formula (2.1) together with Lemma 10.2. Assertions (i) and (ii) of the latter lemma show that the variable is already degenerate relative to X₁ and X_m, while assertion (iii) shows that integrating out the variable X_i for 1 < i < m simply collapses $Π_{i - 1, i} {\underline{A}}_{i} Π_{i, i + 1}$ into Π_i−₁_,i₊₁. For instance, with S_m denoting symmetrization of a function of m variables,

{\tilde{χ}}_{p}^{(2)} (X_{1}, X_{2}) = - 2 S_{2} [{\tilde{A}}_{1} Π_{1, 2} {\tilde{Y}}_{2}], {\tilde{χ}}_{p}^{(3)} (X_{1}, X_{2}, X_{3}) = 6 S_{3} [{\tilde{A}}_{1} Π_{1, 2} {\underline{A}}_{2} Π_{2, 3} {\tilde{Y}}_{3} - {\tilde{A}}_{1} Π_{1, 3} {\tilde{Y}}_{3}], {\tilde{χ}}_{p}^{(4)} (X_{1}, X_{2}, X_{3}, X_{4}) = - 24 S_{4} [{\tilde{A}}_{1} Π_{1, 2} {\underline{A}}_{2} Π_{2, 3} {\underline{A}}_{3} Π_{3, 4} {\tilde{Y}}_{4} - {\tilde{A}}_{1} Π_{1, 3} {\underline{A}}_{3} Π_{3, 4} {\tilde{Y}}_{4} - {\tilde{A}}_{1} Π_{1, 2} {\underline{A}}_{2} Π_{2, 4} {\tilde{Y}}_{4} + {\tilde{A}}_{1} Π_{1, 4} \tilde{Y}] .

(6.7)

As shown on the left, but not on the right of the equations, these quantities depend on the unknown parameter p = (a, b, g). In the right sides, the variables ${\tilde{Y}}_{i}$ and ${\tilde{A}}_{i}$ depend on p through $\tilde{b}$ and $\tilde{a}$ , and hence are not observables. Furthermore, the kernels Π_i,j depend on g as they are orthogonal projections in $L_{2} (\underline{a b} g)$ .

7. Parametric rate ((α + β)/2 ≥ d/4)

In this section we show that the parameter χ(p) is estimable at $1 / \sqrt{n} -rate$ provided the average smoothness (α + β)/2 is at least d/4. We achieve this using the estimator

{\hat{χ}}_{n} = χ (\hat{p}) + U_{n} ({\tilde{χ}}_{\hat{p}}^{(1)} + \frac{1}{2} {\tilde{χ}}_{\hat{p}}^{(2)} + \dots + \frac{1}{m!} χ_{\hat{p}}^{(m)}),

(7.1)

with the influence functions ${\tilde{χ}}_{p}^{(j)}$ those of the approximate functional $\tilde{χ}$ in Section 6: they are given in Theorems 6.1 and 6.2 for j = 1, and j = 2, …, m, respectively. (Because the map $p \mapsto \tilde{p}$ maps $\hat{p}$ into itself, the influence function for j = 1 in the display is also the first order influence function (6.5) of of χ, when evaluated at $p = \hat{p}$ .)

We assume that the projections Π_p and $Π_{\hat{p}}$ map $L_{s} (\underline{a b} g)$ to $L_{s} (\underline{a b} g)$ , for every s ∈ =[r/(r − 1), r], with uniformly bounded norms. (For r = 2 this entails only s = 2; in this case we define r/(r − 2) = ∞.)

Theorem 7.1

The estimator (7.1), with Π_p a kernel of an orthogonal projection in $L_{2} (\underline{a b} g)$ satisfying (12.1) with sup_x Π_p(x, x) ≲ k, satisfies, for a constant c that depends on ${‖ p / \hat{p} ‖}_{\infty}$ only, and r ≥ 2,

{\hat{E}}_{p} {\hat{χ}}_{n} - χ (p) = O ({‖ \hat{a} - a ‖}_{r} {‖ \hat{b} - b ‖}_{r} {‖ \hat{g} - g ‖}_{(m - 1) r / (r - 2)}^{m - 1}) + O ({‖ (I - Π_{p}) \frac{\hat{a} - a}{\underline{a}} ‖}_{2} {‖ (I - Π_{p}) \frac{\hat{b} - b}{\underline{b}} ‖}_{2}),

{\hat{var}}_{p} {\hat{χ}}_{n} \leq \sum_{j = 1}^{m} \frac{1}{(\frac{n}{j})} c^{j} k^{j - 1} .

The first term in the bias is of the order 1 + 1 + (m − 1) = m + 1, as to be expected for an estimator based on an mth order influence function; the second term is due to estimating $\tilde{χ}$ rather than χ; it is independent of m, and the same as in Theorem 5.1 if $\underline{a} = \underline{b} = 1$ . The bound on the variance can roughly be understood in that each of the degenerate U-statistics $U_{n} {\tilde{χ}}_{\hat{p}}^{(j)}$ in (7.1) contributes a term of order k^j−¹/n^j.

For α-, β- and γ-regular parameters a, b, g on a d-dimensional domain the range space of the projections Π_p can be chosen so that (5.5) holds and such that there exist estimators $\hat{a}, \hat{b}, \hat{g}$ of a, b, g, with the first two taking values in this range space, with convergence rates n^−α/⁽²^α⁺^d⁾, n^−β/⁽²^β⁺^d⁾ and n^−γ/⁽²^γ⁺^d⁾. Then the second term in the bias (with $\underline{a} = \underline{b} = 1$ ) is of order (1/k)^α/d⁺^β/d. If (α + β)/2 ≥ d/4 and we choose k = n, then this is of order $1 / \sqrt{n}$ . For k = n the standard deviation of the resulting estimator is also of the order $1 / \sqrt{n}$ , while the first term in the bias can be made arbitrarily small by choosing a sufficiently large order m. Specifically, the estimator ${\hat{χ}}_{n}$ attains a $\sqrt{n} -rate$ of convergence as soon as

m - 1 \geq (\frac{1}{2} - \frac{α}{2 α + d} - \frac{β}{2 β + d}) (\frac{2 γ + d}{γ}) .

(7.2)

For any γ > 0 there exists an order m that satisfies this, and hence the parameter is $\sqrt{n} -estimable$ as soon as (α + β)/2 ≥ d/4.

More ambitiously, we may aim at attaining the parametric rate for every γ > 0, without a-priori knowledge of γ. This can be achieved if (α + β)/2 > d/4 by using orders m = m_n that increase to infinity with the sample size. In this case the estimator can also be shown to be asymptotically efficient in the semiparametric sense.

Theorem 7.2

If (α + β)/2 > d/4, then the estimator (7.1), with m = log n and Π_p a kernel of an orthogonal projection in $L_{2} (\underline{a b} g)$ on a k = n/(log n)²-dimensional space satisfying (5.5) and (12.1) with sup_x Π_p(x, x) ≲ k, based on preliminary estimators $\hat{a}, \hat{b}, \hat{g}$ that attain rates (log n/n)^{−δ/(2δ+d)} relative to the uniform norm, satisfies

\sqrt{n} ({\hat{χ}}_{n} - χ (p) - P_{n} {\tilde{χ}}_{p}^{(1)}) \overset{P}{\to} 0.

An estimator that is asymptotically linear in the first order efficient influence function, as in the theorem, is asymptotically optimal in terms of the local asymptotic minimax and convolution theorems (see e.g. [40], Chapter 25). The present estimator ${\hat{χ}}_{n}$ actually looses its efficiency by splitting the sample in a part used to construct the preliminary estimators and a part to form $P_{n}$ . This can be easily remedied by crossing over the two parts of the split, and taking the average of the two estimators so obtained. By the theorem these are both asymptotically linear in their sample, and hence their average is asymptotically linear in the full sample and asymptotically efficient.

The proofs of the theorems are deferred to Section 10.3.

8. Minimax rate at lower smoothness ((α + β)/2 < d/4)

If the average a-priori smoothness (α + β)/2 of the functions a and b falls below d/4, then the functional χ cannot be estimated any more at the parametric rate ([29]). The estimator (7.1) of Theorem 7.1 can still be used and, with its bias and variance as given in the theorem properly balanced, attains a certain rate of convergence, faster than the current state-of-the-art linear estimators. However, in this section we present an estimator that is always better, and attains the minimax rate of convergence n⁻⁽²^α⁺²^β^)/(2^α⁺²^β⁺^d⁾ provided that the parameter g is sufficiently regular.

This estimator takes the same general form

{\hat{χ}}_{n} = χ (\hat{p}) + U_{n} ({\tilde{χ}}_{\hat{p}}^{(1)} + \frac{1}{2} {\tilde{χ}}_{\hat{p}}^{(2)} + \dots + \frac{1}{m!} {\tilde{χ}}_{\hat{p}}^{(m)}),

(8.1)

as the estimator (7.1), but the influence functions $χ_{p}^{(j)}$ for j ≥ 3 will be different. The idea is to “cut out” certain terms from the influence functions in (7.1) in order to decrease the variance, but without increasing the bias. For clarity we first consider the third order estimator, and next extend to the general mth order. To attain the minimax rate the order m must be fixed to a large enough value so that the first term in the bias given in Theorem 7.1 is no larger than n⁻⁽²^α⁺²^β^)/(2^α⁺²^β^+d). (Apart from added complexity there is no loss in choosing m larger than needed.)

The third order kernel ${\tilde{χ}}_{p}^{(3)}$ in (6.7) is the symmetrization of the variable

6 {\tilde{A}}_{1} (Π_{p} (Z_{1}, Z_{2}) {\underline{A}}_{2} Π_{p} (Z_{2}, Z_{3}) - Π_{p} (Z_{1}, Z_{3})) {\tilde{Y}}_{3} .

Here Π_p is the kernel of an orthogonal projection in $L_{2} (\underline{a b} g)$ onto a k-dimensional linear space, which we may view as the sum of k projections on one-dimensional spaces. The quantity k² in the order O(k²/n³) of the variance in Theorem 7.1 for m = 3 arises as the number of terms in the product $Π_{p} (Z_{1}, Z_{2}) {\underline{A}}_{2} Π_{p} (Z_{2}, Z_{3})$ of the two k-dimensional projection kernels. It turns out that this order can be reduced without increasing the bias by cutting out “products of projections on higher base elements”.

To make this precise, we partition the projection space in blocks, and decompose the two projections in the influence function over the blocks:

Π_{p} = \sum_{r = 0}^{R} Π_{p}^{(k_{r - 1}, k_{r}]}, Π_{p} = \sum_{s = 0}^{S} Π_{p}^{(l_{s - 1}, l_{s}]} .

(8.2)

Here $Π_{p}^{(m, n]}$ is the projection on the subspace spanned by base elements with index in intervals (m, n], and 1 = k₋₁ < k₀ < k₁ < ⋯ < k_R = k and 1 = l₋₁ < l₀ < l₁ < ⋯ < l_S = k are suitable partitions of the set {1, …, k}. (“Full” partitions in singleton sets would make the construction conceptual simpler, but a small number of blocks will be needed in our proofs.) The product of the two kernels now becomes a double sum, from which we retain only terms with small values of (r, s). The improved third order influence function is, with as before S₃ denoting symmetrization,

χ_{p}^{(3)} (X_{1}, X_{2}, X_{3}) = 6 S_{3} [\sum_{\begin{array}{l} (r, s) : r + s \leq D \\ \lor_{r} = 0 \lor_{s} = 0 \end{array}} \sum {\tilde{A}}_{1} (Π_{p}^{(k_{r - 1}, k_{r}]} (Z_{1}, Z_{2}) {\underline{A}}_{2} Π_{p}^{(l_{s - 1}, l_{s}]} (Z_{2}, Z_{3}) - Π_{p}^{(k_{r - 1} \lor l_{s - 1}, k_{r} \land l_{s}]} (Z_{1}, Z_{3})) {\tilde{Y}}_{3}] .

(8.3)

The negative term in the display is the conditional expectation given Z₁, Z₃ of the leading term, and maintains the degeneracy of the kernel. w For the decomposition (8.2) to be valid, the subspaces corresponding to the blocks must be orthogonal in $L_{2} (\underline{a b} g)$ . We may achieve this by starting with a standard basis e₁, ε₂, …, with good approximation properties for a target model, and next replacing this by an orthonormal basis in $L_{2} (\underline{a b} g)$ by the Gram-Schmidt procedure. For a bounded g the approximation properties will be preserved.

The grids are defined by

k_{- 1} = 1, k_{r} ~ n 2^{r / α}, r = 0, \dots, R,

(8.4)

l_{- 1} = 1, l_{s} ~ n 2^{s / β}, s = 0, \dots, S,

(8.5)

where R and S are chosen such that k_R ∼ l_S ∼ k (note that k₀ = l₀ = n). In these definitions the notation ~ means “equal up to a fixed multiple” (needed to allow that k_r and l_s are (dyadic) integers). For ease of notation let l_s = l₋₁ for s ≤ −1, and l_s = l_S for s ≥ S.

The grids k₀ < k₁ < ⋯ < k_R and l₀ < l₁ < ⋯ < l_S partition the integers n, n + 1, …, k in R and S groups. As $k_{r}^{α} l_{s}^{β} = 2^{r + s} n^{α + β}$ , for every r, s ≥ 0, the cut-off r+s ≤ D in (8.3) is delimited by the “hyperbola” i^αj^β ~ 2^Dn^α⁺^β in the space of indices (i, j) ∈ {1, …, k}² of base elements used in the two kernels, with only the pairs below the hyperbola retained (see Figure 1). The intuition behind this hyperbolic cut-off is the product form of the bias (4.4): a higher order correction on the estimator of a may combine with a lower order correction on b, and vice versa, to give an overall correction of the desired order. The overall bias is smaller if the cut-off D is chosen larger, but then more terms are included in the estimator and the variance will be bigger.

Fig 1 — Both axis carry the indices of the basis functions spanning the projection space L, and point in the plane refers to a product of two projections. Products of projections on pairs of basis functions in the shaded area are included in the third order influence function. The step function refers to the partitions of the indices as in (8.2).

Before deriving an optimal value of D, we introduce the mth order estimator for general m ≥ 3. Again we take the estimator of Theorem 7.1 as starting point, but modify the higher order influence functions ${\tilde{χ}}_{p}^{(j)}$ , for j = 4, …, m, similar and in addition to the modification of the third order influence function. For given j the former influence function is given in Theorem 6.2 (with m of the theorem taken equal to j), and is based on a product of j − 1 projection kernels. We modify this in two steps. For each of the j − 2 contiguous pairs of kernels ((1st, 2nd), (2nd, 3rd), …, ((j − 2)th, (j − 1)th)) we form a new kernel by truncating the pair at the hyperbola as described previously for the third order kernel, and truncating all other kernels at n. Next the modified jth order kernel is the sum of the resulting j − 2 kernels. More formally, the modified jth order kernel is equal to

χ_{p}^{(j)} (X_{1}, \dots, X_{j}) = \sum_{i = 1}^{j - 2} χ_{p}^{(j, i)} (X_{1}, \dots, X_{j}),

(8.6)

where $χ_{p}^{(j, i)} (X_{1}, \dots, X_{j})$ is the symmetrized, degenerate (relative to L₂(p)) part of the variable, for i = 1, …, j − 2, written in the notation of Theorem 7.1,

j! {(- 1)}^{j - 1} {\tilde{Y}}_{1} Π_{1, 2}^{(0, n]} {\underline{A}}_{2} \times \dots \times {\underline{A}}_{i - 1} Π_{i - 1, i}^{(0, n]} {\underline{A}}_{i} \times \times [\sum_{\begin{array}{l} (r, s) : r + s \leq D \\ \lor_{r} = 0 \lor_{s} = 0 \end{array}} \sum Π_{i, i + 1}^{(k_{r - 1}, k_{r}]} {\underline{A}}_{i + 1} Π_{i + 1, i + 2}^{(l_{s - 1}, l_{s}]}] {\underline{A}}_{i + 2} Π_{i + 2, i + 3}^{(0, n]} \times \dots \times {\underline{A}}_{j - 1} Π_{j - 1, j}^{(0, n]} {\tilde{A}}_{j} .

For j = 3 there is only one pair of kernels, and the construction reduces to the modification (8.3) as discussed previously.

We assume that the projections $Π_{p}^{(0, l]}$ and $Π_{\hat{p}}^{(0, l]}$ map $L_{s} (\underline{a b} g)$ to $L_{s} (\underline{a b} g)$ , for every s ∈ [r/(r − 1), r], with uniformly bounded norms.

Theorem 8.1

The estimator (8.1) for m ≥ 3 with the influence functions ${\tilde{χ}}_{p}^{(j)}$ and $χ_{p}^{(j)}$ given in (6.5) and (6.7) for j = 1, 2, respectively, and in (8.6) for j ≥ 3, and with $Π_{p}^{(0, l]}$ kernels of orthogonal projections in $L_{2} (\underline{a b} g)$ satisfying (12.1) with $\sup_{x} Π_{\hat{p}}^{(0, l]} (x, x) ≲ l$ , satisfies, for r ≥ 2 (and r/(r − 2) = ∞ if r = 2),

{\hat{E}}_{p} {\hat{χ}}_{n} - χ (p) = O ({‖ \hat{a} - a ‖}_{r} {‖ \hat{b} - b ‖}_{r} {‖ \hat{g} - g ‖}_{\frac{m r}{r - 2}}^{m - 1}) + O ({‖ (I - Π_{p}^{(0, k]}) \frac{\hat{a} - a}{\underline{a}} ‖}_{2} {‖ (I - Π_{p}^{(0, k]}) \frac{\hat{b} - b}{\underline{b}} ‖}_{2}), + O (\sum_{r = 1}^{R} {‖ (I - Π_{\hat{p}}^{(0, k_{r - 1}]}) (\frac{\hat{a} - a}{\underline{a}}) ‖}_{r} {‖ (I - Π_{\hat{p}}^{(0, l_{D - 1}]}) (\frac{\hat{b} - b}{\underline{b}}) ‖}_{r} {‖ \hat{g} - g ‖}_{\frac{r}{r - 2}}) + O (R {‖ (I - Π_{\hat{p}}^{(0, n]}) \frac{\hat{a} - a}{\underline{a}} ‖}_{r} {‖ (I - Π_{\hat{p}}^{(0, n]}) \frac{\hat{b} - b}{\underline{b}} ‖}_{r} {‖ \hat{g} - g ‖}_{\frac{m r}{r - 2}}^{2}),

{\hat{var}}_{p} {\hat{χ}}_{n} ≲ \frac{1}{n} + \frac{k}{n^{2}} + \frac{D 2^{(\frac{1}{α} \lor \frac{1}{β}) D}}{n} .

A proof of the theorem is presented in Sections 10.4 and 10.5.

The first two terms in the bias are the same as in Theorem 7.1; the third and fourth terms are the price paid for cutting out terms from the influence function. The benefit is a reduced variance. We shall show that the boundary parameter D can be chosen such that the third term in the variance (resulting from the third and higher order parts of the influence function) is not bigger than the second term, while the increase in bias is negligible.

Assume that the functions a and b and their estimates are known to belong to models that are well approximated by the base functions e₁, e₂, … in the sense that, for $p \in {p, \hat{p}}$ , and every value l in one of the two grids (8.4)–(8.5),

{‖ (I - Π_{p}^{(0, l]}) (\frac{\hat{a} - a}{\underline{a}}) ‖}_{r} ≲ {(\frac{1}{l})}^{α / d},

(8.7)

{‖ (I - Π_{p}^{(0, l]}) (\frac{\hat{b} - b}{\underline{b}}) ‖}_{r} ≲ {(\frac{1}{l})}^{β / d} .

(8.8)

Then the second term in the bias is of the order (1/k)^α/d⁺^β/d, as in Theorem 7.1, which is smaller than the minimax rate n⁻⁽²^α⁺²^β^)/(2^α⁺²^β⁺^d⁾ for

k ~ n^{2 d / (2 α + 2 β + d)} .

(8.9)

With this choice of k, the upper bound on the variance is of the square minimax rate n⁻⁽⁴^α⁺⁴^β^)/(2^α⁺²^β⁺^d⁾ if D is chosen to satisfy

2^{(\frac{1}{α} \lor \frac{1}{β}) D} ~ \frac{1}{\log n} n^{(d - 2 α - 2 β) / (d + 2 α + 2 β)} .

(8.10)

Furthermore, under (8.9) the numbers R, S of grid points are of the order log n.

In the third term of the bias we apply assumptions (8.7)–(8.8) and the identity $k_{r - 1}^{α} l_{D - r}^{β} ~ n^{α + β} 2^{D}$ , which results from (8.4)–(8.5), to see that the third term of the bias is of order

\sum_{r = 1}^{R} {(\frac{1}{k_{r - 1}})}^{α / d} {(\frac{1}{l_{D - r}})}^{β / d} {‖ \hat{g} - g ‖}_{r / (r - 2)} \leq R {(\frac{1}{n^{α + β} 2^{D}})}^{1 / d} {‖ \hat{g} - g ‖}_{r / (r - 2)} .

If the convergence rate of $\hat{g}$ is n^−γ/⁽²^γ⁺^d⁾, then, for the choice of D given in (8.10), this can (by a calculation) seen to be of smaller order than the minimax rate n⁻⁽²^α⁺²^β^)/(2^α⁺²^β⁺^d⁾ if γ is large enough that

\frac{γ}{2 γ + d} > (\frac{α \lor β}{d}) (\frac{d - 2 α - 2 β}{d + 2 α + 2 β}) .

(8.11)

The fourth term in the bias can by a similar analysis be seen to be of the order

R {(\frac{1}{n})}^{α / d} {(\frac{1}{n})}^{β / d} {‖ \hat{g} - g ‖}_{(m - 2) r / (r - 2)}^{2} .

Again this is smaller than the minimax rate if γ satisfies assumption (8.11).

Finally, if the convergence rates of $\hat{a}$ and $\hat{b}$ are n^−α/⁽²^α⁺^d⁾ and n^−β/⁽²^β⁺^d⁾, then the first term in the upper bound of the bias is of the order

{(\frac{1}{n})}^{α / (2 α + d) + β / (2 β + d) + (m - 1) γ / (2 γ + d)} .

We choose m large enough so that this is of smaller order than the preceding terms. In particular, we can choose it so that this is smaller than the minimax rate.

We summarize this in the following corollary, which is the most advanced result of the paper.

Corollary 8.1

If (8.7)–(8.11) hold, and $Π_{p}^{(0, l]}$ are kernels of orthogonal projections in $L_{2} (\underline{a b} g)$ satisfying (12.1) with $\sup_{x} Π_{\hat{p}}^{(0, l]} (x, x) \leq l$ , then the mth order estimator with the kernels (8.6) for j ≥ 3 and sufficiently large m and suitable initial estimators, attains the rate n⁻⁽²^α⁺²^β^)/(2^α⁺²^β⁺^d⁾ for estimating χ(p).

9. Other examples

In this section we briefly indicate a number of other examples for which our general heuristics have been worked out, leading to well known or novel estimators.

9.1. Density estimation

Consider estimating a density χ(p) = p(a) at the fixed point a based on a random sample from p. A first order influence function of this functional would satisfy, for every smooth path t ↦ p_t with score function g at t = 0,

\int χ_{p}^{(1)} gpd μ = {\frac{d}{d t}}_{| t = 0} χ (p_{t}) = g (a) p (a) .

In a nonparametric situation every zero-mean function g arises as a score function, and hence $χ_{p}^{(1)}$ would have to be a “Dirac function at a”. Because this does not exist (except for very special p), in this example already a first order influence function fails to exist.

We may approximate the Dirac function by the function x ↦ Π(a, x) for Π the kernel of an orthogonal projection onto a given (large) subspace L of L₂(μ). Because ∫ Π(a, x)g(x)p(x) dμ(x) = g(a)p(a) for every function g such that gp ∈ L, the function x ↦ Π(a, x) achieves representation for a large set of scores. The corresponding degenerate version is x ↦ Π(a, x) − Πp(a), for Πp = ∫ Π(·, x)p(x) dμ(x) the projection of p. The corresponding first order estimator (3.1) is

{\hat{χ}}_{n} = χ ({\hat{p}}_{n}) + P_{n} (Π (a, \cdot) - Π {\hat{p}}_{n} (a)) = P_{n} Π (a, \cdot) + ((I - Π) {\hat{p}}_{n}) (a) .

If ${\hat{p}}_{n} \in L$ , then the second term vanishes and the estimator reduces to $ℙ_{n} Π (a, \cdot)$ . This is the usual projection estimator (cf. [25, 32]): if L is spanned by the orthonormal set e₁, e₂, …, e_k, then $Π (x_{1}, x_{2}) = \sum_{i = 1}^{k} e_{i} (x_{1}) e_{i} (x_{2})$ and ${\hat{χ}}_{n} = \sum_{i = 1}^{k} (P_{n} e_{i}) e_{i} (a)$ .

Alternative to viewing x ↦ Π(a, x) as an approximation to the “ideal” influence function, we can derive it as the exact influence function of the approximate functional $\tilde{χ} (p) = χ (Π p)$ .

9.2. Quadratic functionals

Consider estimating the functional χ(p) = ∫ p² dμ based on a random sample of size n from the density p.

The first order influence function of this functional exists on the full non-parametric model, and can be seen to take the form

χ_{p}^{(1)} (x) = 2 (p (x) - χ (p)) .

By the algorithm [1]–[3] of Section 3.1 a second order influence function can be computed as the degenerate part of an influence function of the functional $p \mapsto {\bar{χ}}_{p}^{(1)} (x_{1}) = 2 p (x_{1})$ , for fixed x₁. As seen in Section 9.1, point evaluation is not a differentiable functional, but has the kernel Π of an orthogonal projection in L₂(μ) as an approximate influence function. Thus an approximate second order influence function of the present functional, minus its projection onto the degenerate functions, is given by

{\tilde{χ}}_{p}^{(2)} (x_{1}, x_{2}) = 2 Π (x_{1}, x_{2}) - 2 Π p (x_{1}) - 2 Π p (x_{2}) + 2 \int {(Π p)}^{2} d μ .

This may also be derived as an exact influence function of the approximate functional $\tilde{χ} (p) = χ (Π p)$ .

It can be checked that the estimator (7.1) for m = 2, given an initial estimator ${\hat{p}}_{n}$ that is contained in the range of Π, reduces to ${\hat{χ}}_{n} = U_{n} Π$ , which is a well known estimator ([17]).

9.3. Doubly robust models

The heuristics described in Section 3 ought to be applicable in a wide range of estimation problems, but the detailed treatment of the missing data problem in Sections 4–8 shows that their implementation can be involved. Inspection of the proofs reveals that the particular implementation in the latter sections is based on the structure (4.3) of the first order influence function in the missing data problem. The argument extends to semiparametric models with first order influence function of the form

χ_{p}^{(1)} (x) = a (z) b (z) S_{1} (x) + a (z) S_{2} (x) + b (z) S_{3} (x) + S_{4} (x) - χ (p),

(9.1)

for known functions S_i(x) of the data (i.e. S = (S₁, S₂, S₃, S₄) is a given statistic). The full parameter may be a quadruplet p ↔ (a, b, c, f), in which f is the marginal density of an observable covariate Z, and c does not appear in (9.1). Other examples of this structure are described in [27, 37].

10. Proofs

10.1. Proof of Theorem 5.1

Write $\hat{Π}$ and Π for $Π_{\hat{p}}$ and Π_p, respectively, for both the kernels and the corresponding projection operators, and drop p also in ${\hat{E}}_{p}$ and ${\hat{var}}_{p}$ . From (4.4) and (5.3) we have

\hat{E} {\hat{χ}}_{n} - χ (p) = - \int (\hat{a} - a) (\hat{b} - b) g dν - \hat{E} A_{1} (Y_{1} - \hat{b} (Z_{1})) (A_{2} \hat{a} (Z_{2}) - 1) \hat{Π} (Z_{1}, Z_{2}) = - \int (\hat{a} - a) (\hat{b} - b) g dν + \int \int [(\hat{a} - a) \times (\hat{b} - b)] (g \times g) \hat{Π} dν \times ν .

The double integral on the far right with $\hat{Π}$ replaced by Π can be written as the single integral $\int (\hat{a} - a) Π (\hat{b} - b) g dν$ , for $Π (\hat{b} - b)$ the image of $\hat{b} - b$ under the projection Π. Added to the first integral on the right this gives $- \int (\hat{a} - a) (I - Π) (\hat{b} - b) g dν$ , which is bounded in absolute value by the second term in the upper bound for the bias.

Replacement of $\hat{Π}$ by Π in the double integral gives a difference

\int \int [(\hat{a} - a) \times (\hat{b} - b)] g \times g (\hat{Π} - Π) dν \times ν = \int (\hat{a} - a) (\hat{Π} ((\hat{b} - b) \frac{g}{\hat{g}}) - Π (\hat{b} - b)) g dν \leq {‖ \hat{a} - a ‖}_{s} {‖ \hat{Π} ((\hat{b} - b) \frac{g}{\hat{g}}) - Π (\hat{b} - b) ‖}_{r, \hat{g}} {‖ g / \hat{g} ‖}_{\infty}^{1 / r},

by Hölder’s inequality, for a conjugate pair (r, s). Considering $\hat{Π}$ as the projection in $L_{2} (\hat{g})$ with weight 1, and Π as the weighted projection in $L_{2} (\hat{g})$ with weight function $\hat{w} = g / \hat{g}$ , we can apply Lemma 12.7(i) (with q = s/r and rp = s/(s − 2)) to see that this is bounded in absolute value by

{‖ \hat{a} - a ‖}_{s} {‖ \hat{Π} ‖}_{s, \hat{g}} {‖ Π ‖}_{s, \hat{g}} {‖ \hat{b} - b ‖}_{s, \hat{g}} {‖ \hat{w} - 1 ‖}_{s / (s - 2), \hat{g}} {‖ w ‖}_{\infty}^{1 / r} .

Because $\hat{w}$ is assumed bounded away from 0 and infinity, this is of the same order as the first term in the upper bound on the bias (if r replaces s).

Because the function $χ_{\hat{p}}^{(1)}$ is uniformly bounded, the (conditional) variance of $U_{n} χ_{\hat{p}}^{(1)}$ is of the order O(1/n). Thus for the variance bound it suffices to consider the (conditional) variance of $U_{n} χ_{\hat{p}}^{(2)}$ . In view of Lemma 13.1 and (13.1) this is bounded above by a multiple of

{(1 + {‖ \frac{p}{\hat{p}} ‖}_{\infty})}^{2} {\hat{P}}^{n} {(U_{n} χ_{\hat{p}}^{(2)})}^{2} = {(1 + {‖ \frac{p}{\hat{p}} ‖}_{\infty})}^{2} {(\frac{n}{2})}^{- 2} {\hat{P}}^{2} {(χ_{\hat{p}}^{(2)})}^{2} .

The variables $A (Y - \hat{b} (Z))$ and $(A \hat{a} (Z) - 1)$ are uniformly bounded. Hence the last term on the right is bounded above by a multiple of $n^{- 2} \int {\hat{Π}}^{2} (\hat{g} \times \hat{g}) dν \times ν$ , which is equal to k/n², by Lemma 12.3.

10.2. Proof of Theorem 6.2

We compute the higher order influence functionals of the approximate functional $\tilde{χ}$ using algorithm [1]–[3] in Section 3. This starts by computing the second order influence function as the derivative of $p \mapsto {\tilde{χ}}_{p}^{(1)} (x_{1}) + χ (\tilde{p})$ , for fixed x₁. Because the latter functional (given in (6.5)) depends on the parameters only through $\tilde{a} (z_{1})$ and $\tilde{b} (z_{1})$ , the following lemma does the main part of the work.

Lemma 10.1

For fixed z₁ influence functions of $p \mapsto \tilde{a} (z_{1})$ and $p \mapsto \tilde{b} (z_{1})$ are given by

x_{2} \mapsto - \underline{a} (z_{1}) Π_{p} (z_{1}, z_{2}) (a_{2} \tilde{a} (z_{2}) - 1) \underline{b} (z_{2}), x_{2} \mapsto \underline{b} (z_{1}) Π_{p} (z_{1}, z_{2}) a_{2} (y_{2} - \tilde{b} (z_{2})) \underline{a} (z_{2}),

where Π_p is the kernel of the orthogonal projection in $L_{2} (\underline{a b} g)$ onto L.

Proof

We can write the equation (6.3) determining $\tilde{a}$ as $E (A \tilde{a} (Z) - 1) \underline{b} (Z) l (Z) = 0$ , for every l ∈ L. Insert a sufficiently regular path p_t, given by parameters (a_t, b_t, f_t), and differentiate the equality relative to t at t = 0 to find, with γ a score function of the path

E {\frac{d}{d t}}_{| t = 0} {\tilde{a}}_{t} (Z) A \underline{b} (Z) l (Z) = - E (A \tilde{a} (Z) - 1) \underline{b} (Z) l (Z) γ (X) .

Using the fact that E(A|Z) = 1/a(Z), where a is bounded away from zero, we can also write this as

E \frac{{\frac{d}{d t}}_{| t = 0} {\tilde{a}}_{t} (Z)}{\underline{a} (Z)} \frac{(\underline{a b})}{a} (Z) l (Z) = - E \frac{(A \tilde{a} (Z) - 1) a (Z) γ (X)}{\underline{a} (Z)} \frac{(\underline{a b}) (Z)}{a (Z)} l (Z) .

Because the function $({\tilde{a}}_{t} - \hat{a}) / \underline{a}$ is contained in L for every t by construction, the function ${(d / d t)}_{| t = 0} {\tilde{a}}_{t} / \underline{a}$ is also contained in L. Combined with the validity of the preceding display for every l ∈ L, we conclude that ${(d / d t)}_{| t = 0} {\tilde{a}}_{t} (Z) / \underline{a} (Z)$ is the weighted projection of $- (A \tilde{a} (Z) - 1) a (Z) γ (X) / \underline{a} (Z)$ in L₂(P) onto the space {l(Z): l ∈ L} relative to the weight $(\underline{a b} / a) (Z)$ . The projection can be represented in terms of a kernel operator (cf. Lemma 12.1). If $Π_{p} (z_{1}, z_{2}) (\underline{a b}) (z_{2}) / a (z_{2})$ denotes the kernel, then

\frac{{\frac{d}{d t}}_{| t = 0} {\tilde{a}}_{t} (z_{1})}{\underline{a} (z_{1})} = E Π_{p} (z_{1}, Z_{2}) \frac{(A_{2} \tilde{a} (Z_{2}) - 1) a (Z_{2}) γ (X_{2})}{\underline{a} (Z_{2})} (\frac{\underline{a b}}{a}) (Z_{2}) = - E Π_{p} (z_{1}, Z_{2}) (A_{2} \tilde{a} (Z_{2}) - 1) \underline{b} (Z_{2}) γ (X_{2}) .

This represents the derivative on the left as an inner product of the score function γ with the function on the right of the first equation of the lemma (evaluated at X₂). Thus the first assertion of the lemma is proved.

The second assertion is proved similarly. Using that E(Y|Z) = b(Z) and E(A|Z) = 1/a(Z), we start by writing the equation (6.4) defining $\tilde{b}$ as $E (\tilde{b} (Z) - Y) / \underline{b} (Z) (\underline{a b} / a) (Z) l (Z) = 0$ , for every l ∈ L. By the same arguments as before we conclude that ${(d / d t)}_{| t = 0} {\tilde{b}}_{t} (Z)$ is the weighted projection of $(Y - \tilde{b} (Z)) γ (X) / \underline{b} (Z)$ in L₂(P) onto the space {l(Z): l ∈ L}, relative to the weight (ab/a)(Z). ■

The first order influence function (6.5) depends on p only through $\tilde{a}$ and $\tilde{b}$ and hence the chain rule and the preceding lemma imply that a second order influence function of $\tilde{χ}$ is given by the degenerate part of

{\tilde{χ}}_{p}^{(2)} (X_{1}, X_{2}) = - Π_{p} (Z_{1}, Z_{2}) [A_{1} (Y_{1} - \tilde{b} (Z_{1})) \underline{a} (Z_{1}) (A_{2} \tilde{a} (Z_{2}) - 1) \underline{b} (Z_{2}) + (A_{1} \tilde{a} (Z_{1}) - 1) \underline{b} (Z_{1}) A_{2} (Y_{2} - \tilde{b} (Z_{2})) \underline{a} (Z_{2})] .

(10.1)

(Note that this function is symmetric in (X₁, X₂); Π_p is symmetric, because it is an orthogonal projection kernel.) Actually, this function is already degenerate and hence is the second order influence function of $\tilde{χ}$ .

Lemma 10.2

For any fixed z₁ and z₃,

$E_{p} Π_{p} (z_{1}, Z_{2}) (A_{2} \tilde{a} (Z_{2}) - 1) \underline{b} (Z_{2}) = 0$ .
$E_{p} Π_{p} (z_{1}, Z_{2}) (A_{2} (Y_{2} - \tilde{b} (Z_{2})) \underline{a} (Z_{2}) = 0$ .
$E_{p} Π_{p} (z_{1}, Z_{2}) A_{2} (\underline{a b}) (Z_{2}) Π_{p} (Z_{2}, z_{3}) = Π_{p} (z_{1}, z_{3})$ .

Proof

Because $(\tilde{a} - \hat{a}) (Z) / \underline{a} (Z)$ and $(\tilde{b} - \hat{b}) (Z) / \underline{a} (Z)$ are the weighted projections in L₂(P) of $(a - \hat{a}) (Z) / \underline{a} (Z)$ and $(Y - \hat{b} (Z)) / \underline{b} (Z)$ , respectively, onto {l(Z):l ∈ L} relative to the weights (ab/a)(Z),

E_{X_{2}} Π_{p} (Z_{1}, Z_{2}) [\frac{\tilde{a} (Z_{2}) - \hat{a} (Z_{2})}{\underline{a} (Z_{2})} - \frac{a (Z_{2}) - \hat{a} (Z_{2})}{\underline{a} (Z_{2})}] \frac{\underline{a b}}{a} (Z_{2}) = 0,

(10.2)

E_{X_{2}} Π_{p} (Z_{1}, Z_{2}) [\frac{\tilde{b} (Z_{2}) - \hat{b} (Z_{2})}{\underline{b} (Z_{2})} - \frac{Y_{2} - \hat{b} (Z_{2})}{\underline{b} (Z_{2})}] (\underline{a b}) (Z_{2}) A = 0.

(10.3)

These two assertions imply (i) and (ii). The third assertion follows from the fact that Π_p is the kernel of the weighted projection in L₂(P) onto L relative to the weight (ab/a)(Z). ■

The second order influence function (10.1) depends on p through $\tilde{a}$ and $\tilde{b}$ and through the kernel Π_p. We proceed to higher orders by differentiating the influence function relative to these components, and applying the chain rule, where we use the influence functions of $p \mapsto \tilde{a} (x)$ and $p \mapsto \tilde{b} (x)$ as given previously in Lemma 10.1, and the influence function of $p \mapsto Π_{p} (z_{1}, z_{2})$ as given in Lemma 12.8.

Proof of Theorem 6.2

Denote the symmetrization of the variable in the theorem by ${\tilde{χ}}_{p}^{(m)} (X_{1}, \dots, X_{m})$ . Then ${\tilde{χ}}_{p}^{(2)}$ is the function ${\tilde{χ}}_{p}^{(2)}$ given by (10.1), which was seen to be a second order influence function in the preceding discussion. We show by induction on m that $x_{m + 1} \mapsto {\tilde{χ}}_{p}^{(m + 1)} (x_{1}, \dots, x_{m}, x_{m + 1})$ is an influence function of $p \mapsto {\tilde{χ}}_{p}^{(m)} (x_{1}, \dots, x_{m})$ . The theorem is then a corollary of Lemma 11.2.

By Lemmas 10.1 and 12.8,

The influence function of $p \mapsto {\tilde{Y}}_{1}$ is $x_{m + 1} \mapsto - Π_{p} (Z_{1}, z_{m + 1}) {\underline{A}}_{1} {\tilde{y}}_{m + 1}$
The influence function of $p \mapsto {\tilde{A}}_{1}$ is $x_{m + 1} \mapsto - Π_{p} (Z_{1}, z_{m + 1}) {\underline{A}}_{1} {\tilde{a}}_{m + 1}$ .
The influence function of $p \mapsto {\underline{A}}_{1}$ is zero.
The influence function of $p \mapsto Π_{p} (Z_{1}, Z_{2})$ is $x_{m + 1} \mapsto - Π_{p} (Z_{1}, z_{m + 1}) {\underline{A}}_{m + 1} Π_{p} (z_{m + 1}, Z_{2})$ .

Applying this repeatedly readily gives an expression for the influence function of $p \mapsto {\tilde{A}}_{1} Π_{1, 2} {\underline{A}}_{2} Π_{2, 3} {\underline{A}}_{3} Π_{3, 4} {\underline{A}}_{4} \times \dots \times {\underline{A}}_{m - 1} Π_{m - 1, m} {\tilde{Y}}_{m}$

. The symmetrization of this expression is the same expression, but then with m replaced by m + 1 and an added minus sign. ■

10.3. Proof of Theorems 7.1 and 7.2

Let Â and Ŷ be $\tilde{A}$ and $\tilde{Y}$ as in (6.6) with a and b in their definitions replaced by â and $\hat{b}$ . Because â and $\hat{b}$ are projected onto themselves under the map $(a, b, g) \mapsto (\tilde{a}, \tilde{b}, g)$ (see Theorem 6.1), we actually obtain the same variables by replacing $\tilde{a}$ and $\tilde{b}$ by â and $\hat{b}$ , respectively: $\hat{A} = (A \hat{a} (Z) - 1) \underline{b} (Z)$ and $\hat{Y} = A (Y - \hat{b} (Z)) \underline{a} (Z)$ . Furthermore, let Π and $\hat{Π}$ denote the operators Π_p and $Π_{\hat{p}}$ , respectively, and Π_i,j and ${\hat{Π}}_{i, j}$ their kernels evaluated at (Z_i, Z_j).

By explicit calculations,

χ (\hat{p}) + {\hat{E}}_{p} {\tilde{χ}}_{\hat{p}}^{(1)} - χ (p) = - \int (\hat{a} - a) (\hat{b} - b) g d v = \hat{E} {\hat{A}}_{1} Π_{1, 2} {\hat{Y}}_{2} - \hat{R},

(10.4)

for $\hat{R}$ defined by

\hat{R} = \int (\frac{\hat{a} - a}{\underline{a}}) (I - Π) (\frac{\hat{b} - b}{\underline{b}}) \underline{a b} q d v .

The variable $\hat{R}$ is bounded by the second term in the expression for ${\hat{E}}_{p} {\hat{χ}}_{n} - χ (p)$ in the statement of the theorem. We next show by induction on m that

\hat{R} + χ (\hat{p}) + \hat{E} {\tilde{χ}}_{\hat{p}}^{(1)} + \dots + \frac{1}{m!} \hat{E} {\tilde{χ}}_{\hat{p}}^{(m)} - χ (p) = {(- 1)}^{m - 1} \hat{E} {\hat{A}}_{1} {(\hat{Π} - Π)}_{1, 2} {\underline{A}}_{2} {(\hat{Π} - Π)}_{2, 3} \times \dots \times {\underline{A}}_{m - 1} {(\hat{Π} - Π)}_{m - 1, m} {\hat{Y}}_{m} .

(10.5)

The analysis of the bias can then be concluded by showing that the right side of (10.5) is of the order as the first term given in the theorem.

Equation (10.4) and the definition of ${\tilde{χ}}_{p}^{(2)}$ readily show that identity (10.5) is true for m = 2. We proceed to general m by induction. Relative to its value for m the left side receives for (m+ 1) the extra term $\hat{E} χ_{\hat{p}}^{(m + 1)} / (m + 1)!$ which is equal to (−1)^m times $\hat{E} {\hat{A}}_{1} {\hat{Π}}_{1, 2} {\underline{A}}_{2} {\hat{Π}}_{2.3} \times \dots \times {\underline{A}}_{m} {\hat{Π}}_{m, m + 1} {\hat{Y}}_{m + 1}$ minus a sum of terms resulting from projections of this leading term. This extra term without the factor (−1)^m (but including the projections) can be written (cf. (6.7) and (2.1))

\sum_{i = 0}^{m - 1} (\frac{m - 1}{i}) \hat{E} {\hat{A}}_{1} {\hat{Π}}_{1, 2} {\underline{A}}_{2} {\hat{Π}}_{2, 3} \times \dots \times {\underline{A}}_{m - i} {\hat{Π}}_{m - i, m - i + 1} {\hat{Y}}_{m - i + 1} {(- 1)}^{i} .

(10.6)

To prove the induction hypothesis for m + 1 it suffices to show that this is equal to

\hat{E} {\hat{A}}_{1} {(\hat{Π} - Π)}_{1, 2} {\underline{A}}_{2} {(\hat{Π} - Π)}_{2, 3} \times \dots \times {\underline{A}}_{m - 1} {(\hat{Π} - Π)}_{m - 1, m} {\hat{Y}}_{m} + \hat{E} {\hat{A}}_{1} {(\hat{Π} - Π)}_{1, 2} {\underline{A}}_{2} {(\hat{Π} - Π)}_{2, 3} \times \dots \times {\underline{A}}_{m} {(\hat{Π} - Π)}_{m, m + 1} {\hat{Y}}_{m + 1} .

(10.7)

To achieve this we expand the two terms of the preceding display into sums of expressions of the form, with each $K_{j, j + 1}^{(j)}$ equal to ${\hat{Π}}_{j, j + 1}$ or Π_j,j₊₁ and l the number of j for which the first alternative is true,

B_{l} : = {(- 1)}^{m - 1 - l} \hat{E} {\hat{A}}_{1} K_{1, 2}^{(1)} {\underline{A}}_{2} K_{2, 3}^{(2)} \times \dots \times {\underline{A}}_{m - 1} K_{m - 1, m}^{(m - 1)} {\hat{Y}}_{m},

(10.8)

and of the same form with m+1 replacing m for the second term of (10.7). As the notation suggests the expression in (10.8) depends on l (and m, but this is fixed), but not on which K are equal to $\hat{Π}$ or Π. To see this we use that Π is a projection onto L in L₂(abg), so that ∫ Π_1,2γ(z₂)(abg)(z₂)dv(z₂) = γ(z₁) for every γ ∈ L; and $\hat{Π}$ is also a projection onto L, so that as a function of one argument ${\hat{Π}}_{1, 2}$ is contained in L. This observation yields the identities, for K equal to $\hat{Π}$ or Π,

{\hat{E}}_{Z_{j}} Π_{j - 1} {\underline{A}}_{j} K_{j, j + 1} = K_{j - 1, j + 1} = {\hat{E}}_{Z_{j}} K_{j - 1, j} {\underline{A}}_{j} Π_{j, j + 1} .

This allows to reduce (10.8) to

B_{l} = {(- 1)}^{m - 1 - l} \hat{E} {\hat{A}}_{1} {\hat{Π}}_{1, 2} {\underline{A}}_{2} {\hat{Π}}_{2, 3} \times \dots \times {\underline{A}}_{1} {\hat{Π}}_{l, l + 1} {\hat{Y}}_{l + 1}, l \geq 1,

B_{0} = {(- 1)}^{m - 1} \hat{E} {\hat{A}}_{1} Π_{1, 2} {\hat{Y}}_{2} .

Thus after expanding the two terms of (10.7) in the quantities B_l, and simplifying these quantities, we can write their sum (10.7)

(B_{0} - B_{0}) + \sum_{l = 1}^{m - 1} ((\binom{m}{l}) - (\binom{m - 1}{l})) B_{l} {(- 1)}^{m - l} + B_{m} .

The difference of the binomial coefficients is $(\frac{m - 1}{l - 1})$ . The expression is equal to (10.6), as claimed. This completes the proof of (10.5).

Next we bound the right side of (10.5), by taking the expectation in turn with respect to X_m, X_m−₁, …, X₁. For M_ŵ multiplication by the function ŵ = g/ĝ,

{\hat{E}}_{X_{m}} {(\hat{Π} - Π)}_{m - 1, m} {\hat{Y}}_{m} = (\hat{Π} M_{\hat{w}} - Π) (\frac{\hat{b} - b}{\underline{b}}) (Z_{m} - 1) .

Next, for any function h and i = m − 1, m − 2, …, 2,

{\hat{E}}_{X_{i}} {(\hat{Π} - Π)}_{i - 1, i} {\underline{A}}_{i} h (Z_{i}) = (\hat{Π} M_{\hat{w}} - Π) h (Z_{i - 1}) .

Combining these equations, we can write the right side of (10.5) in the form

{(- 1)}^{m - 1} \int (\frac{a - \hat{a}}{\underline{a}}) [{(\hat{Π} M_{\hat{w}} - Π)}^{m - 1} (\frac{\hat{b} - b}{\underline{b}})] \underline{abg} d v .

We bound this by first applying Hölder’s inequality, with conjugate pair (τ, t) with τ equal to r as in the statement of the theorem, and next Lemma 12.7(iii), with $\hat{Π}$ and Π viewed as weighted orthogonal projections in L₂(abĝ) with weights 1 and ŵ, respectively, and r = τ(m−1)/(m+τ−3), p = (m+τ−3)/(τ−2) and q = (m+τ−3)/(m−1), so that rp = (m−1)τ/(τ−2) and rq = τ (and m of the lemma taken equal to the present m minus 1).

To bound the (conditional) variance of ${\hat{χ}}_{n}$ we use Lemma 13.1 to see that

P^{n} {(U_{n} {\tilde{χ}}_{\hat{p}}^{(j)})}^{2} \leq 2 j {[1 + {‖ \frac{p}{\hat{p}} ‖}_{\infty}]}^{2 j} {\hat{P}}^{n} {(U_{n} {\tilde{χ}}_{\hat{p}}^{(j)})}^{2} ≲ {[1 + {‖ \frac{p}{\hat{p}} ‖}_{\infty}]}^{2 j} \frac{2 j}{(\frac{n}{j})} {\hat{P}}^{j} {({\tilde{χ}}_{p}^{(j)})}^{2},

because ${\tilde{χ}}_{\hat{p}}^{(j)}$ is degenerate under $\hat{P}$ . The variable ${\tilde{χ}}_{\hat{p}}^{(j)} (X_{1}, \dots, X_{j}) / j!$ is the symmetrization of the projection of ${\hat{A}}_{1} {\hat{Π}}_{1, 2} {\underline{A}}_{2} \dots {\hat{Π}}_{j - 1, j} {\hat{Y}}_{j}$ onto the degenerate variables. Because the second moment of a mean of (arbitrary) random variables is bounded above by the maximum of the second moments of the terms, we can ignore the symmetrization, while the projection decreases the second moment. This shows that

\frac{1}{{(j!)}^{2}} {\hat{P}}^{j} {({\tilde{χ}}_{p}^{(j)})}^{2} \leq {\hat{P}}^{j} {({\hat{A}}_{1} {\hat{Π}}_{1, 2} {\underline{A}}_{2} \dots {\hat{Π}}_{j - 1, j} {\hat{Y}}_{j})}^{2} ≲ k^{j - 1},

by Lemma 12.4 and the assumption that the kernels are bounded by k on the diagonal.

We complete the proof of Theorem 7.1 by bounding the square of ${\hat{χ}}_{n} - χ (\hat{p})$ by $\sum_{j = 1}^{m} 2^{j} {(U_{n} {\tilde{χ}}_{\hat{p}}^{(j)} / j!)}^{2} \sum_{j} 2^{- j}$ . The extra factor 2^j can be incorporated in the constant c in the theorem.

For the proof of Theorem 7.2 it clearly suffices to show that

{\hat{E}}_{p} \sqrt{n} ({\hat{χ}}_{n} - χ (p) - ℙ_{n} {\tilde{χ}}_{(p)}^{(1)}) \overset{p}{\to} 0,

{var}_{p} \sqrt{n} ({\hat{χ}}_{n} - χ (p) - ℙ_{n} {\tilde{χ}}_{(p)}^{(1)}) \overset{p}{\to} 0.

Because an influence function is centered at mean zero, the first is simply $\sqrt{n}$ times the bias of ${\hat{χ}}_{n}$ . By Theorem 7.1 the bias is of the order

{(\frac{\log}{n})}^{α / 2 (α + d) + β / (2 β + d) + γ (m - 1) / (2 γ + d)} + {(\frac{1}{k})}^{(α + β) / d} .

The first term is trivially o(n⁻^1/2), as m_n → ∞. In the second we write (α + β)/d = r/2, where r > 1 by assumption, and see that it is o(n⁻^1/2), since kn^−1/^r → ∞.

To handle the variance we split the estimator ${\hat{χ}}_{n}$ in its linear and higher order terms. The sum of the variances of the U-statistics of orders 2 to m in ${\hat{χ}}_{n}$ is bounded by the sum of the terms j ≥ 2 in Theorem 7.1, i.e.

\sum_{j = 2}^{m} \frac{c^{j} k^{j - 1}}{(\frac{n}{j})} \leq \frac{1}{n} {\sum_{j = 2}^{m} (\frac{2 ckj}{n})}^{j - 1} \frac{c 2 j \sqrt{j}}{e^{j}},

by the inequalities $(\frac{n}{j}) \geq {(n / 2)}^{j} / j!$ , for j < n/2, and $j! ≲ {(j / e)}^{j} \sqrt{j}$ , by Stirling’s approximation with bound. The expression in brackets is bounded by 2ckm/n ≲ 1/log n, for m ~ log n and k ~ n/(log n)². Thus the sum tends to zero by dominated convergence. Finally the linear term in ${\hat{χ}}_{n}$ gives the contribution

{\hat{var}}_{p} \sqrt{n} (ℙ_{n} {\tilde{χ}}_{\hat{p}}^{(1)} - χ (p) - ℙ_{n} {\tilde{χ}}_{p}^{(1)}) = v \hat{a} r ({\tilde{χ}}_{\hat{p}}^{(1)} - {\tilde{χ}}_{p}^{(1)}) .

From the explicit expression (4.3) for the first order influence function (or (6.5) in the case of $\hat{p}$ , which gives an identical function), this is seen to tend to zero by the dominated convergence theorem.

10.4. Proof of Theorem 8.1 for m = 3

The theorem asserts that the bias of the estimator ${\hat{χ}}_{n}$ is equal to the sum of four terms, the first two of which also arise in the bias of the estimator considered in Theorem 7.1. Therefore, we can prove the assertion on the bias by showing that the expected values of the current estimator ${\hat{χ}}_{n}$ (for m = 3) and the estimator in Theorem 7.1 differ by less than the additional bias terms in Theorem 8.1.

The two estimators differ only in their third order influence functions, where the present estimator retains only the terms in the double sum (8.3) with r = 0, s = 0, or r + s ≤ D. Thus the difference of the expectations of the two estimators is equal to

\sum_{\begin{array}{l} r + s > D \\ r, s \geq 1 \end{array}} \sum {\hat{E}}_{p} {\hat{A}}_{1} [{\hat{Π}}^{(k_{r - 1}, k_{r}]} (Z_{1}, Z_{2}) {\underline{A}}_{2} {\hat{Π}}^{(l_{s - 1}, l_{s}]} (Z_{2}, Z_{3}) - {\hat{Π}}^{(k_{r - 1} \lor l_{s - 1}, k_{r} \land l_{s}]} (Z_{1}, Z_{3})] {\hat{Y}}_{3} .

The expectation Ê_p refers to the variable (X₁, X₂, X₃) for fixed values of the preliminary samples, which are indicated in the “hat” symbols on Â₁, Ŷ₃ and the kernels, and hence is an integral relative to the density (x₁, x₂, x₃) ↦ p(x₁)p(x₂)p(x₃). If we replace p(x₂) in this density by $\hat{p} (x_{2})$ , then the integral will be zero, as the kernel is degenerate under $\hat{P}$ . Thus we may integrate against $(x_{1}, x_{2}, x_{3}) \mapsto p (x_{1}) (p - \hat{p}) (x_{2}) p (x_{3})$ . In that case the projection term ${\hat{A}}_{1} {\hat{Π}}^{(k_{r - 1} \lor l_{s - 1}, k_{r} \land l_{s}]} (Z_{1}, Z_{3}) {\hat{Y}}_{3}$ integrates to zero, as it does not depend on X₂ and $\int (p - \hat{p}) (x_{2}) d μ (x_{2}) = 0$ , and hence can be dropped. Next we condition ${\hat{A}}_{1}$ and ${\hat{Y}}_{3}$ on Z₁, Z₂, Z₃ and write the preceding display in the form

\sum_{\begin{array}{l} r + s > D \\ r, s \geq 1 \end{array}} \sum \int \int \int \frac{\hat{a} - a}{\underline{a}} (z_{1}) {\hat{Π}}^{(k_{r - 1}, k_{r}]} (z_{1}, z_{2}) {\hat{Π}}^{(l_{s - 1}, l_{s}]} (z_{2}, z_{3}) \frac{\hat{b} - b}{\underline{b}} (z_{3}) \times d ρ (z_{1}) d (ρ - \hat{ρ}) (z_{2}) d ρ (z_{3}) .

for ρ and $\hat{ρ}$ the measures defined by dρ = abg dv and $d \hat{ρ} = \underline{a b \hat{g}} d v$ . The double sum can be rewritten as the sum over r running from 1 to R and over s from D − r + 1 to S, which gives the equivalent representation, with the × referring to “tensor products” as explained in Section 2,

\sum_{r = 1}^{R} \int (\frac{\hat{a} - a}{\underline{a}} \times 1 \times \frac{\hat{b} - b}{\underline{b}}) ({\hat{Π}}^{(k_{r - 1}, k_{r}]} \times {\hat{Π}}^{(l_{D - r}, k]}) d (ρ \times (ρ - \hat{ρ}) \times ρ) .

We write ${\hat{Π}}^{(k_{r - 1}, k_{r}]} = {\hat{Π}}^{(k_{r - 1}, k]} - {\hat{Π}}^{(k_{r}, k]}$ , and next arrive at the difference of two expressions of the type, with $k_{r}^{'} = k_{r - 1}$ and $k_{r}^{'} = k_{r}$ , respectively,

\sum_{r = 1}^{R} \int (\frac{\hat{a} - a}{\underline{a}} \times 1 \times \frac{\hat{b} - b}{\underline{b}}) ({\hat{Π}}^{(k_{r}^{'}, k]} \times {\hat{Π}}^{(l_{D - r}, k]}) d (ρ \times (ρ - \hat{ρ}) \times ρ) .

If the measure of integration were $\hat{ρ} \times (ρ - \hat{ρ}) \times ρ$ (with $\hat{ρ}$ instead of ρ), then we could perform the integrals on z₁ and z₃ and next apply Hölder’s inequality to bound the resulting expression in absolute value by

\sum_{r = 1}^{R} {‖ {\hat{Π}}^{(k_{r}^{'}, k]} (\frac{\hat{a} - a}{\underline{a}}) ‖}_{r} {‖ {\hat{Π}}^{(l_{D - r}, k]} (\frac{\hat{b} - b}{\underline{b}}) ‖}_{r} {‖ \frac{g}{\hat{g}} - 1 ‖}_{r / (r - 2)},

where the norms are those of $L_{2} (\underline{a b} \hat{g})$ , which are equivalent to those of L₂(ν), by assumption. We can write ${\hat{Π}}^{(l, k]} = {\hat{Π}}^{(0, k]} (I - {\hat{Π}}^{(0, l]})$ and use the assumed boundedness of ${\hat{Π}}^{(0, l]}$ as an operator on $L_{r} (\underline{a b} g)$ to bound this by the third term in the bias.

Replacing $ρ \times (ρ - \hat{ρ}) \times ρ$ by $\hat{ρ} \times (ρ - \hat{ρ}) \times \hat{ρ}$ can be achieved by writing the first and last occurrence of ρ as $ρ = \hat{ρ} + (ρ - \hat{ρ})$ and expanding the resulting expression on the + signs into four terms. One of these has the measure $\hat{ρ} \times (ρ - \hat{ρ}) \times \hat{ρ}$ . The other three terms have two or three occurrences of $ρ - \hat{ρ}$ , and can be bounded by the first term in the bias (with m = 3). This is argued precisely under (10.12) below.

Because the first and second order influence functions are equal to those of the estimator considered in Theorem 7.1, the (conditional) variances of $U_{n} {\tilde{χ}}_{\hat{p}}^{(j)}$ for j = 1, 2 can be seen to be of the orders O(1/n) and O(k/n²), respectively, by the same proof. By Lemma 13.1 the variance for j = 3 is bounded by (see (13.1))

6 {(1 + {‖ \frac{p}{\hat{p}} ‖}_{\infty})}^{6} {\hat{P}}^{n} {(U_{n} χ_{\hat{p}}^{(3)})}^{2} ≲ \frac{1}{(\frac{n}{3})} {\hat{P}}^{3} {(\sum_{r = 0}^{R} {\hat{A}}_{1} {\hat{Π}}^{(k_{r - 1}, k_{r}]} (Z_{1}, Z_{2}) {\underline{A}}_{2} {\hat{Π}}^{(0, l_{D - r}^{'}]} (Z_{2}, Z_{3}) {\hat{Y}}_{3})}^{2},

where $l_{D - r}^{'} = l_{D - r} \lor n$ . After bounding out ${\hat{A}}_{1}^{2}$ and ${\hat{Y}}_{3}^{2}$ , we write the squared sum as a double sum. From the fact that the projections ${\hat{Π}}^{(k_{r - 1}, k_{r}]}$ are orthogonal for different r, it follows that the off-diagonal terms of the double sum vanish (the expectation with respect to X₁ is zero). Thus the preceding display is bounded above by a multiple of

\frac{1}{n^{3}} \sum_{r = 0}^{R} {\hat{P}}^{3} {({\hat{Π}}^{(k_{r - 1}, k_{r}]} (Z_{1}, Z_{2}) {\underline{A}}_{2} {\hat{Π}}^{(0, l_{D - r}^{'}]} (Z_{2}, Z_{3}))}^{2} .

By Lemmas 12.4 and 12.3 and the assumption that $\sup_{z} {\hat{Π}}^{(0, l]} (z, z) ≲ l$ this is bounded by a multiple of

\frac{1}{n^{3}} \sum_{r = 0}^{R} (k_{r} - k_{r - 1}) l_{D - r}^{'} \leq \frac{1}{n^{3}} (n k + \sum_{r = 1}^{R} (k_{r} - k_{r - 1}) (l_{D - r} + n)) .

By (8.4) $k_{r} - k_{r - 1} = (1 - 2^{- α}) k_{r} ≲ k_{r} = n 2^{r / α}$ for r ≥ 1. On substituting this in the display, and noting that l_D₋_r = 0 if r > D, we see that this is bounded k/n² + 2^D/α˅D/β/n if α ≠ β and bounded by k/n² + D2^D/α/n if α = β.

Supplementary Material

NIHMS927754-supplement-supplement_1.pdf^{(563.3KB, pdf)}

Acknowledgments

The research leading to these results has received funding from the European Research Council under ERC Grant Agreement 320637.

Footnotes

SUPPLEMENTARY MATERIAL

Supplement: Higher order estimating equations

(doi: COMPLETED BY THE TYPESETTER;.pdf). The remainder of the paper is given in the supplement.

References

1.Bickel PJ. On adaptive estimation. Ann Statist. 1982;10(3):647–671. MR663424 (84a:62045) [Google Scholar]
2.Bickel PJ, Klaassen CAJ, Ritov Y, Wellner JA. Johns Hopkins Series in the Mathematical Sciences. Johns Hopkins University Press; Baltimore, MD: 1993. Efficient and adaptive estimation for semiparametric models. MR1245941 (94m:62007) [Google Scholar]
3.Bickel PJ, Ritov Y. Estimating integrated squared density derivatives: sharp best order of convergence estimates. Sankhyā Ser A. 1988;50(3):381–393. MR1065550 (91e:62079) [Google Scholar]
4.Birgé L, Massart P. Estimation of integral functionals of a density. Ann Statist. 1995;23(1):11–29. MR1331653 (96c:62065) [Google Scholar]
5.Bolthausen E, Perkins E, van der Vaart A. Lectures on probability theory and statistics. In: Bernard Pierre., editor. Lecture Notes in Mathematics. Vol. 1781. Springer-Verlag; Berlin: 2002. (Lectures from the 29th Summer School on Probability Theory held in Saint-Flour, July 8–24, 1999). MR1915443 (2003d:60004) [Google Scholar]
6.Cai TT, Low MG. Nonquadratic estimators of a quadratic functional. Ann Statist. 2005;33(6):2930–2956. doi: 10.1214/009053605000000147. . MR2253108 (2007k:62058) [DOI] [Google Scholar]
7.Cai TT, Low MG. Optimal adaptive estimation of a quadratic functional. Ann Statist. 2006;34(5):2298–2325. doi: 10.1214/009053606000000849. . MR2291501 (2008m:62054) [DOI] [Google Scholar]
8.Cohen A, Dahmen W, Daubechies I, DeVore R. Tree approximation and optimal encoding. Appl Comput Harmon Anal. 2001;11(2):192–226. doi: 10.1006/acha.2001.0336. . MR1848303 (2002g:42048) [DOI] [Google Scholar]
9.Daubechies I. Ten lectures on wavelets. Vol. 61. Society for Industrial and Applied Mathematics (SIAM); Philadelphia, PA: 1992. (CBMS-NSF Regional Conference Series in Applied Mathematics). MR1162107 (93e:42045) [Google Scholar]
10.Donoho DL, Nussbaum M. Minimax quadratic estimation of a quadratic functional. J Complexity. 1990;6(3):290–323. doi: 10.1016/0885-064X(90)90025-9. . MR1081043 (91m:65343) [DOI] [Google Scholar]
11.Hampel FR, Ronchetti EM, Rousseeuw PJ, Stahel WA. Robust statistics. John Wiley & Sons, Inc.; New York: 1986. (Wiley Series in Probability and Mathematical Statistics: Probability and Mathematical Statistics). The approach based on influence functions. MR829458 (87k:62054) [Google Scholar]
12.Härdle W, Kerkyacharian G, Picard D, Tsybakov A. Lecture Notes in Statistics. Vol. 129. Springer-Verlag; New York: 1998. Wavelets, approximation, and statistical applications. MR1618204 (99f:42065) [Google Scholar]
13.Has’minskiĭ RZ, Ibragimov IA. Proceedings of the Second Prague Symposium on Asymptotic Statistics (Hradec Králové, 1978) North-Holland; Amsterdam-New York: 1979. On the nonparametric estimation of functionals; pp. 41–51. MR571174 (81j:62076) [Google Scholar]
14.Huber PJ, Ronchetti EM. Robust statistics. Second. John Wiley & Sons, Inc; Hoboken, NJ: 2009. (Wiley Series in Probability and Statistics). . MR2488795 (2010j:62004) [DOI] [Google Scholar]
15.Kerkyacharian G, Picard D. Estimating nonquadratic functionals of a density using Haar wavelets. Ann Statist. 1996;24(2):485–507. MR1394973 (97e:62062) [Google Scholar]
16.Koševnik JA, Levit BJ. On a nonparametric analogue of the information matrix. Teor Verojatnost i Primenen. 1976;21(4):759–774. MR0428578 (55 #1599) [Google Scholar]
17.Laurent B. Estimation of integral functionals of a density and its derivatives. Bernoulli. 1997;3(2):181–211. MR1466306 (99c:62144) [Google Scholar]
18.Laurent B, Massart P. Adaptive estimation of a quadratic functional by model selection. Ann Statist. 2000;28(5):1302–1338. MR1805785 (2002c:62052) [Google Scholar]
19.Lindsay BG. Efficiency of the conditional score in a mixture setting. Ann Statist. 1983;11(2):486–497. MR696061 (84h:62050) [Google Scholar]
20.Murphy SA, van der Vaart AW. On profile likelihood. J Amer Statist Assoc. 2000;95(450):449–485. With comments and a rejoinder by the authors. MR1803168 (2002a:62143) [Google Scholar]
21.Nemirovski A. Lectures on probability theory and statistics (Saint-Flour, 1998) Lecture Notes in Math. Vol. 1738. Springer; Berlin: 2000. Topics in non-parametric statistics; pp. 85–277. MR1775640 (2001h:62074) [Google Scholar]
22.Pfanzagl J. Lecture Notes in Statistics. Vol. 13. Springer-Verlag; New York: 1982. Contributions to a general asymptotic statistical theory. With the assistance of W. Wefelmeyer. MR675954 (84i:62036) [Google Scholar]
23.Pfanzagl J. Lecture Notes in Statistics. Vol. 31. Springer-Verlag; Berlin: 1985. Asymptotic expansions for general statistical models. With the assistance of W. We-felmeyer. MR810004 (87i:62004) [Google Scholar]
24.Pfanzagl J. Lecture Notes in Statistics. Vol. 63. Springer-Verlag; New York: 1990. Estimation in semiparametric models. Some recent developments. MR1048589 (91f:62074) [Google Scholar]
25.Prakasa Rao BLS. Probability and Mathematical Statistics. Academic Press Inc. [Harcourt Brace Jovanovich Publishers]; New York: 1983. Nonparametric functional estimation. MR740865 (86m:62076) [Google Scholar]
26.Robins J, Li L, Tchetgen E, van der Vaart A. Supplement to “higher order estimating equations for high-dimensional models”. doi: 10.1214/16-AOS1515. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Robins J, Li L, Tchetgen E, van der Vaart A. Probability and statistics: essays in honor of David A Freedman Inst Math Stat Collect. Vol. 2. Inst Math Statist; Beachwood, OH: 2008. Higher order influence functions and minimax estimation of nonlinear functionals; pp. 335–421. . MR2459958 (2010b:62115) [DOI] [Google Scholar]
28.Robins J, Li L, Tchetgen E, van der Vaart A. Quadratic semiparametric von mises calculus. Metrika. 2009a;69:227–247. doi: 10.1007/s00184-008-0214-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Robins J, Li L, Tchetgen E, van der Vaart A. Semiparametric minimax rates. Electron J Stat. 2009b;3:1305–1321. doi: 10.1214/09-EJS479. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Robins JM, Rotnitzky A. Semiparametric efficiency in multivariate regression models with missing data. J Amer Statist Assoc. 1995;90(429):122–129. MR1325119 (96d:62084) [Google Scholar]
31.Rotnitzky A, Robins JM. Semi-parametric estimation of models for means and covariances in the presence of missing data. Scand J Statist. 1995;22(3):323–333. MR1363216 (96j:62090) [Google Scholar]
32.Tsybakov AB. Mathématiques & Applications (Berlin) [Mathematics & Applications] Vol. 41. Springer-Verlag; Berlin: 2004. Introduction à l’estimation non-paramétrique. MR2013911 (2005a:62007) [Google Scholar]
33.v Mises R. On the asymptotic distribution of differentiable statistical functions. Ann Math Statistics. 1947;18:309–348. MR0022330 (9,194h) [Google Scholar]
34.van der Laan MJ, Robins JM. Unified methods for censored longitudinal data and causality. Springer-Verlag; New York: 2003. (Springer Series in Statistics). MR1958123 (2003m:62003) [Google Scholar]
35.van der Vaart A. On differentiable functionals. Ann Statist. 1991;19(1):178–204. MR1091845 (92i:62100) [Google Scholar]
36.Van der Vaart A. Efficient maximum likelihood estimation in semiparametric mixture models. Ann Statist. 1996;24(2):862–878. doi: 10.1214/aos/1032894470. . MR1394993 (97d:62096) [DOI] [Google Scholar]
37.van der Vaart A. Higher Order Tangent Spaces and Influence Functions. Statist Sci. 2014;29(4):679–686. doi: 10.1214/14-STS478. . MR3300365. [DOI] [Google Scholar]
38.van der Vaart AW. Estimating a real parameter in a class of semiparametric models. Ann Statist. 1988a;16(4):1450–1474. doi: 10.1214/aos/1176351048. . MR964933 (89m:62032) [DOI] [Google Scholar]
39.van der Vaart AW. CWI Tract. Vol. 44. Stichting Mathematisch Centrum Centrum voor Wiskunde en Informatica; Amsterdam: 1988b. Statistical estimation in large parameter spaces. MR927725 (89e:62049) [Google Scholar]
40.van der Vaart AW. Cambridge Series in Statistical and Probabilistic Mathematics. Vol. 3. Cambridge University Press; Cambridge: 1998. Asymptotic statistics. MR1652247 (2000c:62003) [Google Scholar]
41.van der Vaart AW, Wellner JA. Weak convergence and empirical processes. Springer-Verlag; New York: 1996. (Springer Series in Statistics). With applications to statistics. MR1385671 (97g:60035) [Google Scholar]
42.van Zwet WR. A Berry-Esseen bound for symmetric statistics. Z Wahrsch Verw Gebiete. 1984;66(3):425–440. doi: 10.1007/BF00533707. . MR751580 (86h:60063) [DOI] [Google Scholar]
43.Waterman RP, Lindsay BG. Projected score methods for approximating conditional scores. Biometrika. 1996;83(1):1–13. doi: 10.1093/biomet/83.1.1. . MR1399151 (98g:62044) [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

NIHMS927754-supplement-supplement_1.pdf^{(563.3KB, pdf)}

[R1] 1.Bickel PJ. On adaptive estimation. Ann Statist. 1982;10(3):647–671. MR663424 (84a:62045) [Google Scholar]

[R2] 2.Bickel PJ, Klaassen CAJ, Ritov Y, Wellner JA. Johns Hopkins Series in the Mathematical Sciences. Johns Hopkins University Press; Baltimore, MD: 1993. Efficient and adaptive estimation for semiparametric models. MR1245941 (94m:62007) [Google Scholar]

[R3] 3.Bickel PJ, Ritov Y. Estimating integrated squared density derivatives: sharp best order of convergence estimates. Sankhyā Ser A. 1988;50(3):381–393. MR1065550 (91e:62079) [Google Scholar]

[R4] 4.Birgé L, Massart P. Estimation of integral functionals of a density. Ann Statist. 1995;23(1):11–29. MR1331653 (96c:62065) [Google Scholar]

[R5] 5.Bolthausen E, Perkins E, van der Vaart A. Lectures on probability theory and statistics. In: Bernard Pierre., editor. Lecture Notes in Mathematics. Vol. 1781. Springer-Verlag; Berlin: 2002. (Lectures from the 29th Summer School on Probability Theory held in Saint-Flour, July 8–24, 1999). MR1915443 (2003d:60004) [Google Scholar]

[R6] 6.Cai TT, Low MG. Nonquadratic estimators of a quadratic functional. Ann Statist. 2005;33(6):2930–2956. doi: 10.1214/009053605000000147. . MR2253108 (2007k:62058) [DOI] [Google Scholar]

[R7] 7.Cai TT, Low MG. Optimal adaptive estimation of a quadratic functional. Ann Statist. 2006;34(5):2298–2325. doi: 10.1214/009053606000000849. . MR2291501 (2008m:62054) [DOI] [Google Scholar]

[R8] 8.Cohen A, Dahmen W, Daubechies I, DeVore R. Tree approximation and optimal encoding. Appl Comput Harmon Anal. 2001;11(2):192–226. doi: 10.1006/acha.2001.0336. . MR1848303 (2002g:42048) [DOI] [Google Scholar]

[R9] 9.Daubechies I. Ten lectures on wavelets. Vol. 61. Society for Industrial and Applied Mathematics (SIAM); Philadelphia, PA: 1992. (CBMS-NSF Regional Conference Series in Applied Mathematics). MR1162107 (93e:42045) [Google Scholar]

[R10] 10.Donoho DL, Nussbaum M. Minimax quadratic estimation of a quadratic functional. J Complexity. 1990;6(3):290–323. doi: 10.1016/0885-064X(90)90025-9. . MR1081043 (91m:65343) [DOI] [Google Scholar]

[R11] 11.Hampel FR, Ronchetti EM, Rousseeuw PJ, Stahel WA. Robust statistics. John Wiley & Sons, Inc.; New York: 1986. (Wiley Series in Probability and Mathematical Statistics: Probability and Mathematical Statistics). The approach based on influence functions. MR829458 (87k:62054) [Google Scholar]

[R12] 12.Härdle W, Kerkyacharian G, Picard D, Tsybakov A. Lecture Notes in Statistics. Vol. 129. Springer-Verlag; New York: 1998. Wavelets, approximation, and statistical applications. MR1618204 (99f:42065) [Google Scholar]

[R13] 13.Has’minskiĭ RZ, Ibragimov IA. Proceedings of the Second Prague Symposium on Asymptotic Statistics (Hradec Králové, 1978) North-Holland; Amsterdam-New York: 1979. On the nonparametric estimation of functionals; pp. 41–51. MR571174 (81j:62076) [Google Scholar]

[R14] 14.Huber PJ, Ronchetti EM. Robust statistics. Second. John Wiley & Sons, Inc; Hoboken, NJ: 2009. (Wiley Series in Probability and Statistics). . MR2488795 (2010j:62004) [DOI] [Google Scholar]

[R15] 15.Kerkyacharian G, Picard D. Estimating nonquadratic functionals of a density using Haar wavelets. Ann Statist. 1996;24(2):485–507. MR1394973 (97e:62062) [Google Scholar]

[R16] 16.Koševnik JA, Levit BJ. On a nonparametric analogue of the information matrix. Teor Verojatnost i Primenen. 1976;21(4):759–774. MR0428578 (55 #1599) [Google Scholar]

[R17] 17.Laurent B. Estimation of integral functionals of a density and its derivatives. Bernoulli. 1997;3(2):181–211. MR1466306 (99c:62144) [Google Scholar]

[R18] 18.Laurent B, Massart P. Adaptive estimation of a quadratic functional by model selection. Ann Statist. 2000;28(5):1302–1338. MR1805785 (2002c:62052) [Google Scholar]

[R19] 19.Lindsay BG. Efficiency of the conditional score in a mixture setting. Ann Statist. 1983;11(2):486–497. MR696061 (84h:62050) [Google Scholar]

[R20] 20.Murphy SA, van der Vaart AW. On profile likelihood. J Amer Statist Assoc. 2000;95(450):449–485. With comments and a rejoinder by the authors. MR1803168 (2002a:62143) [Google Scholar]

[R21] 21.Nemirovski A. Lectures on probability theory and statistics (Saint-Flour, 1998) Lecture Notes in Math. Vol. 1738. Springer; Berlin: 2000. Topics in non-parametric statistics; pp. 85–277. MR1775640 (2001h:62074) [Google Scholar]

[R22] 22.Pfanzagl J. Lecture Notes in Statistics. Vol. 13. Springer-Verlag; New York: 1982. Contributions to a general asymptotic statistical theory. With the assistance of W. Wefelmeyer. MR675954 (84i:62036) [Google Scholar]

[R23] 23.Pfanzagl J. Lecture Notes in Statistics. Vol. 31. Springer-Verlag; Berlin: 1985. Asymptotic expansions for general statistical models. With the assistance of W. We-felmeyer. MR810004 (87i:62004) [Google Scholar]

[R24] 24.Pfanzagl J. Lecture Notes in Statistics. Vol. 63. Springer-Verlag; New York: 1990. Estimation in semiparametric models. Some recent developments. MR1048589 (91f:62074) [Google Scholar]

[R25] 25.Prakasa Rao BLS. Probability and Mathematical Statistics. Academic Press Inc. [Harcourt Brace Jovanovich Publishers]; New York: 1983. Nonparametric functional estimation. MR740865 (86m:62076) [Google Scholar]

[R26] 26.Robins J, Li L, Tchetgen E, van der Vaart A. Supplement to “higher order estimating equations for high-dimensional models”. doi: 10.1214/16-AOS1515. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] 27.Robins J, Li L, Tchetgen E, van der Vaart A. Probability and statistics: essays in honor of David A Freedman Inst Math Stat Collect. Vol. 2. Inst Math Statist; Beachwood, OH: 2008. Higher order influence functions and minimax estimation of nonlinear functionals; pp. 335–421. . MR2459958 (2010b:62115) [DOI] [Google Scholar]

[R28] 28.Robins J, Li L, Tchetgen E, van der Vaart A. Quadratic semiparametric von mises calculus. Metrika. 2009a;69:227–247. doi: 10.1007/s00184-008-0214-3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] 29.Robins J, Li L, Tchetgen E, van der Vaart A. Semiparametric minimax rates. Electron J Stat. 2009b;3:1305–1321. doi: 10.1214/09-EJS479. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] 30.Robins JM, Rotnitzky A. Semiparametric efficiency in multivariate regression models with missing data. J Amer Statist Assoc. 1995;90(429):122–129. MR1325119 (96d:62084) [Google Scholar]

[R31] 31.Rotnitzky A, Robins JM. Semi-parametric estimation of models for means and covariances in the presence of missing data. Scand J Statist. 1995;22(3):323–333. MR1363216 (96j:62090) [Google Scholar]

[R32] 32.Tsybakov AB. Mathématiques & Applications (Berlin) [Mathematics & Applications] Vol. 41. Springer-Verlag; Berlin: 2004. Introduction à l’estimation non-paramétrique. MR2013911 (2005a:62007) [Google Scholar]

[R33] 33.v Mises R. On the asymptotic distribution of differentiable statistical functions. Ann Math Statistics. 1947;18:309–348. MR0022330 (9,194h) [Google Scholar]

[R34] 34.van der Laan MJ, Robins JM. Unified methods for censored longitudinal data and causality. Springer-Verlag; New York: 2003. (Springer Series in Statistics). MR1958123 (2003m:62003) [Google Scholar]

[R35] 35.van der Vaart A. On differentiable functionals. Ann Statist. 1991;19(1):178–204. MR1091845 (92i:62100) [Google Scholar]

[R36] 36.Van der Vaart A. Efficient maximum likelihood estimation in semiparametric mixture models. Ann Statist. 1996;24(2):862–878. doi: 10.1214/aos/1032894470. . MR1394993 (97d:62096) [DOI] [Google Scholar]

[R37] 37.van der Vaart A. Higher Order Tangent Spaces and Influence Functions. Statist Sci. 2014;29(4):679–686. doi: 10.1214/14-STS478. . MR3300365. [DOI] [Google Scholar]

[R38] 38.van der Vaart AW. Estimating a real parameter in a class of semiparametric models. Ann Statist. 1988a;16(4):1450–1474. doi: 10.1214/aos/1176351048. . MR964933 (89m:62032) [DOI] [Google Scholar]

[R39] 39.van der Vaart AW. CWI Tract. Vol. 44. Stichting Mathematisch Centrum Centrum voor Wiskunde en Informatica; Amsterdam: 1988b. Statistical estimation in large parameter spaces. MR927725 (89e:62049) [Google Scholar]

[R40] 40.van der Vaart AW. Cambridge Series in Statistical and Probabilistic Mathematics. Vol. 3. Cambridge University Press; Cambridge: 1998. Asymptotic statistics. MR1652247 (2000c:62003) [Google Scholar]

[R41] 41.van der Vaart AW, Wellner JA. Weak convergence and empirical processes. Springer-Verlag; New York: 1996. (Springer Series in Statistics). With applications to statistics. MR1385671 (97g:60035) [Google Scholar]

[R42] 42.van Zwet WR. A Berry-Esseen bound for symmetric statistics. Z Wahrsch Verw Gebiete. 1984;66(3):425–440. doi: 10.1007/BF00533707. . MR751580 (86h:60063) [DOI] [Google Scholar]

[R43] 43.Waterman RP, Lindsay BG. Projected score methods for approximating conditional scores. Biometrika. 1996;83(1):1–13. doi: 10.1093/biomet/83.1.1. . MR1399151 (98g:62044) [DOI] [Google Scholar]

PERMALINK

HIGHER ORDER ESTIMATING EQUATIONS FOR HIGH-DIMENSIONAL MODELS

James Robins

Lingling Li

Rajarshi Mukherjee

Eric Tchetgen Tchetgen

Aad van der Vaart

Abstract

1. Introduction

2. Notation

3. Heuristics

3.1. Influence functions

3.2. Bias-variance trade-off

3.3. Approximate functionals

4. Estimating the mean response in missing data models

Assumption 4.1

4.1. Tangent space and first order influence function

5. Second order estimator

Lemma 5.1

Proof

Theorem 5.1

6. Approximate functional

Theorem 6.1

Proof

Theorem 6.2

7. Parametric rate ((α + β)/2 ≥ d/4)

Theorem 7.1

Theorem 7.2

8. Minimax rate at lower smoothness ((α + β)/2 < d/4)

Fig 1.

Theorem 8.1

Corollary 8.1

9. Other examples

9.1. Density estimation

9.2. Quadratic functionals

9.3. Doubly robust models

10. Proofs

10.1. Proof of Theorem 5.1

10.2. Proof of Theorem 6.2

Lemma 10.1

Proof

Lemma 10.2

Proof

Proof of Theorem 6.2

10.3. Proof of Theorems 7.1 and 7.2

10.4. Proof of Theorem 8.1 for m = 3

Supplementary Material

Acknowledgments

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases