Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2019 Apr 8.
Published in final edited form as: Ann Stat. 2017 Oct 31;45(5):1951–1987. doi: 10.1214/16-AOS1515

HIGHER ORDER ESTIMATING EQUATIONS FOR HIGH-DIMENSIONAL MODELS

James Robins 1, Lingling Li 1, Rajarshi Mukherjee 1, Eric Tchetgen Tchetgen 1, Aad van der Vaart 1
PMCID: PMC6453538  NIHMSID: NIHMS927754  PMID: 30971851

Abstract

We introduce a new method of estimation of parameters in semi-parametric and nonparametric models. The method is based on estimating equations that are U-statistics in the observations. The U-statistics are based on higher order influence functions that extend ordinary linear influence functions of the parameter of interest, and represent higher derivatives of this parameter. For parameters for which the representation cannot be perfect the method leads to a bias-variance trade-off, and results in estimators that converge at a slower than n-rate. In a number of examples the resulting rate can be shown to be optimal. We are particularly interested in estimating parameters in models with a nuisance parameter of high dimension or low regularity, where the parameter of interest cannot be estimated at n-rate, but we also consider efficient n-estimation using novel nonlinear estimators. The general approach is applied in detail to the example of estimating a mean response when the response is not always observed.

AMS 2000 subject classifications: Primary 62G05, 62G20, 62G20, 62F25

Keywords: Nonlinear functional, nonparametric estimation, U-statistic, influence function, tangent space

1. Introduction

Let X1, X2, …, Xn be a random sample from a density p relative to a measure μ on a sample space (X,A). It is known that p belongs to a collection P of densities, and the problem is to estimate the value χ(p) of a functional χ:P. Our main interest is in the situation of a semiparametric or nonparametric model, where P is infinite dimensional.

Estimating equations have been found a good strategy for constructing estimators in semiparametric models [2, 34, 40]. Because the model is of (much) higher dimension than the parameter of interest, setting up a good estimating equation often requires an initial estimator η^n of a “nuisance parameter”, and an estimating equation for θ = χ(p) may take the form

nψθ,η^n=0.

Here nf is short for n1i=1nf(Xi), and xψθ,η(x) is a given measurable map, for each (θ, η). In the present paper it will be more convenient to work with a one-step version of this estimator, defined by the method of Newton-Rhapson from a linearization of the map θnψθ,η^n around an initial estimator θ^n, leading to an estimator of the form θ^n+V^n1nψθ^n,η^n, for V^n an estimate of the derivative of the estimating equation. In more general notation such an estimator can be written as

χ^n=χ(p^n)+nχp^n, (1.1)

for p^n an initial estimator for p, and xχp(x) a given measurable function, for each pP.

One possible choice in (1.1) is χp = 0, leading to the plug-in estimator χ(p^n). However, unless the initial estimator p^npossesses special properties, this choice is typically suboptimal. Better functions χp can be constructed by consideration of the tangent space of the model. To see this, we write (with Pχp^ shorthand for χp^(x)dP(x))

χ^nχ(p)=[χ(p^n)χ(p)+Pχp^n]+(nP)χp^n. (1.2)

Because it is properly centered, we may expect the sequence n(nP)χp^n to tend in distribution to a mean-zero normal distribution. The term between square brackets on the right of (1.2), which we shall refer to as the bias term, depends on the initial estimator p^n, and it would be natural to construct the function χp such that this term does not contribute to the limit distribution, or at least is not dominating the expression. Thus we would like to choose this function such that the “bias term” is at least of the order OP (n1/2). A good choice is to ensure that the term Pχp^n acts as minus the first derivative of the functional χ in the “direction” p^np. Functions xχp(x) with this property are known as influence functions in semiparametric theory [16, 22, 35, 5, 2], go back to the von Mises calculus due to [33], and play an important role in robust statistics [14, 11], or [40], Chapter 20.

For an influence function we may expect that the “bias term” is quadratic in the error d(p^n,p), for an appropriate distance d. In that case it is certainly negligible as soon as this error is of order oP (n1/4). Such a “no-bias” condition is well known in semiparametric theory (e.g. condition (25.52) in [40] or (11) in [20]). However, typically it requires that the model P be “not too big”. For instance, a regression or density function on d-dimensional space can be estimated at rate n1/4 if it is a-priori known to have at least d/2 derivatives (indeed α/(2α + d) 1/4 if α ≥ d/2). The purpose of this paper is to develop estimation procedures for the case that no estimators exist that attain a OP (n1/4) rate of convergence. The estimator (1.1) is then suboptimal, because it fails to make a proper trade-off between “bias” and “variance”: the two terms in (1.2) have different magnitudes. Our strategy is to replace the linear term Pnχp by a general U-statistic Unχp, for an appropriate m-dimensional influence function (x1, …, xm) ↦ χp(x1, …, xm), chosen using a von Mises expansion of pχ(p). Here the order m is adapted to the size of the model P and the type of functional to be estimated.

Unfortunately, “exact” higher-order influence functions turn out to exist only for special functionals χ. To treat general functionals χ we approximate these by simpler functionals, or use approximate influence functions. The rate of the resulting estimator is then determined by a trade-off between bias and variance terms. It may still be of order 1/n, but it is often slower. In the former case, surprisingly, one may obtain semiparametric efficiency by estimators whose variance is determined by the linear term, but whose bias is corrected using higher order influence functions.

The conclusion that the “bias term” in (1.2) is quadratic in the estimation error d(p^n,p) is based on a worst case analysis. First, there exist a large number of models and functionals of interest that permit a first order influence function that is unbiased in the nuisance parameter. (E.g. adaptive models as considered in [1], models allowing a sufficient statistic for the nuisance parameter as in [38, 39], mixture models as considered in [19, 24, 36], and convex-linear models in survival analysis.) In such models there is no need for higher-order estimating equations. Second, the analysis does not take special, structural properties of the initial estimators p^n into account. An alternative approach would be to study the bias of a particular estimator in detail, and adapt the estimating equation to this special estimator. The strategy in this paper is not to use such special properties and focus on estimating equations that work with general initial estimators p^n.

The motivation for our new estimators stems from studies in epidemiology and econometrics that include covariates whose influence on an outcome of interest cannot be reliably modelled by a simple model. These covariates may themselves not be of interest, but are included in the analysis to adjust the analysis for possible bias. For instance, the mechanism that describes why certain data is missing is in terms of conditional probabilities given several covariates, but the functional form of this dependence is unknown. Or, to permit a causal interpretation in an observational study one conditions on a set of covariates to control for confounding, but the form of the dependence on the confounding variables is unknown. One may hypothesize in such situations that the functional dependence on a set of (continuous) covariates is smooth (e.g. d/2 times differentiable in the case of d covariates), or even linear. Then the usual estimators will be accurate (at order OP(n1/2)) if the hypothesis is true, but they will be badly biased in the other case. In particular, the usual normal-theory based confidence intervals may be totally misleading: they will be both too narrow and wrongly located. The methods in this paper yield estimators with (typically) wider corresponding confidence intervals, but they are correct under weaker assumptions.

The mathematical contributions of the paper are to provide a heuristic for constructing minimax estimators in semiparametric models, and to apply this to a concrete model, which is a template for a number of other models (see [27, 37]). The methods connect to earlier work [13, 21] on the estimation of functionals on nonparametric models, but differs by our focus on functionals that are defined in terms of the structure of a semiparametric model. This requires an analysis of the inverse map from the density of the observations to the parameters, in terms of the semiparametric tangent spaces of the models. Our second order estimators are related to work on quadratic functionals, or functionals that are well approximated by quadratic functionals, as in [10, 15, 3, 4, 17, 18, 6, 7]. While we place the construction of minimax estimators for these special functionals in a wider framework, our focus differs by going beyond quadratic estimators and to consider semiparametric models.

Our mathematical results are in part conditional on a scale of regularity parameters, through a dimension (8.9) and partition of this dimension that depends on two of these parameters. We hope to discuss adaptation to these parameters in future work.

General heuristics of our construction are given in Section 3. Sections 4–8 are devoted to constructing new estimators for the mean response effect in missing data problems. In Section 9 we briefly discuss some other problems, including the problem of estimating a density at a point, where already first order influence functions do not exist and our heuristics naturally lead to projection estimators. Section 10 collects technical proofs. Sections 11, 12 and 13 (in the supplement [26]) discuss three key concepts of the paper: influence functions, projections and U-statistics.

2. Notation

Let Un denote the empirical U-statistic measure, viewed as an operator on functions. For given k ≤ n and a function f:Xk on the sample space this is defined by

Unf=1n(n1)(nk+1)1i1i2iknf(Xi1,Xi2,,Xik).

We do not let the order k show up in the notation Unf. This is unnecessary, as the notation is consistent in the following sense: if a function f:Xl of l < k arguments is considered a function of k arguments that is constant in its last k −l arguments, then the right side of the preceding display is well defined and is exactly the corresponding U-statistic of order l. In particular, Unf is the empirical distribution Pn applied to f if f:X depends on only one argument.

We write PnUnf=Pkf for the expectation of Unf if X1, …, Xn are distributed according to the probability measure P. We also use this operator notation for the expectations of statistics in general.

We call f degenerate relative to P if ∫ f(x1, …, xk) dP(xi) = 0 for every i and every (xj: ji), and we call f symmetric if f(x1, …, xk) is invariant under permutation of the arguments x1, …, xk. Given an arbitrary measurable function f:Xk we can form a function that is degenerate relative to P by subtracting the orthogonal projection in L2 (Pk) onto the functions of at most k − 1 variables. This degenerate function can be written in the form (e.g. [40], Lemma 11.11)

(DPf)(X1,,Xk)=A{1,,k}(1)k|A|EP[f(X1,,Xk)|Xi:iA], (2.1)

where the sum if over all subsets A of {1, …, k}, including the empty set, for which the conditional expectation is understood to be Pkf. If the function f is symmetric, then so is the function DPf.

Given two functions g,h:X we write g × h for the function (x, y) ↦ g(x)h(y). More generally, given k functions g1, …,gk we write g1 ×⋯× gk for the tensor product of these functions. Such product functions are·degenerate iff all functions in the product have mean zero.

A kernel operator K:Lr(X,A,μ)Lr(X,A,μ) takes the form (Kf)(x)=K¯(x,y)f(y)dμ(y) for some measurable function K¯:X2. We shall abuse notation in denoting the operator K and the kernel K¯ with the same symbol: K=K¯. A (weighted) projection on a finite-dimensional space is a kernel operator. We discuss such projections in Section 12.

The set of measurable functions whose rth absolute power is μ-integrable is denoted Lr(μ), with norm ‖·‖r,μ, or ‖·‖r if the measure is clear; or also as Lr(w) with norm ‖·‖r,w if w is a density relative to a given dominating measure. For r = the notation ‖·‖ refers to the uniform norm.

3. Heuristics

Our basic estimator has the form (1.1) except that we replace the linear term by a general U-statistic. Given measurable functions χp:Xm, for a fixed order m, we consider estimators χ^n of χ(p) of the type

χ^n=χ(p^n)+Unχp^n. (3.1)

The initial estimators p^n are thought to have a certain (optimal) convergence rate d(p^n,p)0, but need not possess (further) special properties. Throughout we shall treat these estimators as being based on an independent sample of observations, so that the stochasticities in (3.1) present in p^n and Un are independent. This takes away technical complications, and allows us to focus on rates of estimation in full generality. (A simple way to avoid the resulting asymmetry would be to swap the two samples, calculate the estimator a second time and take the average.)

3.1. Influence functions

The key is to find suitable “influence functions” χp. A decomposition of type (1.2) for the estimator (3.1) yields

χ^nχ(p)=[χ(p^n)χ(p)+Pmχp^n]+(UnPn)χp^n. (3.2)

This suggests to construct the influence functions such that Pmχp^n represents the first m terms of the Taylor expansion of χ(p^n)χ(p). First this implies that the influence function used in (3.1) must be unbiased:

Pmχp=0.

Next, to operationalize a “Taylor expansion” on the (infinite-dimensional) “manifold” P we employ “smooth” submodels tpt mapping a neighbourhood of 0 ∈ ℝ to P and passing through p at t= 0 (i.e. p0 = p). We determine χp such that for each chosen submodel

djdtj|t=0χ(pt)=djdtj|t=0Pmχpt,j=1,,m.

A slight strengthening is to impose this condition “everywhere” on the path, i.e. the jth derivative of tχ(pt) at t is the jth derivative of hPtmχpt+h at h = 0, for every t. If the map (s,t)Psmχpt is smooth, then the latter implies (cf. Lemma 11.1 applied with χ = f and g(s,t)=Ptmχps)

djdtj|t=0χ(pt)=djdtj|t=0Ptmχp,j=1,,m. (3.3)

Relative to the previous formula the subscript t on the right hand side has changed places, and the negative sign has disappeared. Under regularity conditions equation (3.3) for m = 1 can be written in the form

ddt|t=0χ(pt)=ddt|t=0Ptχp=Pχpg, (3.4)

where g = (d/dt)t=0pt/p is the score function of the model tpt at t = 0.

A function χp satisfying (3.3) is exactly what is called an influence function in semiparametric theory: a function in L2(p) whose inner products with the elements of the tangent space (the set of score functions g of appropriate submodels tpg,t) represent the derivative of the functional ([40], page 363, or [2, 22, 39, 35]). For m > 1 equation (3.3) can be expanded similarly in terms of inner products of the influence function with score functions, but higher-order score functions arise next to ordinary score functions. Suitable higher-order tangent spaces are discussed in [27] (also see [37]), using score functions as defined in [43]. A discussion of second order scores and tangent spaces can be found in [28]. Second order tangent spaces are also discussed in [23], from a different point of view and with a different purpose.

Here we take a different route, defining higher order influence functions as influence functions of lower order ones. Any mth order, zero-mean U-statistic can be decomposed as the sum of m degenerate U-statistics of orders 1, 2, …, m, by way of its Hoeffding decomposition. In the present situation we can write

Unχp=Unχp(1)+12Unχp(2)++1m!Unχp(m),

where χp(j):Xj is a degenerate kernel of j arguments, defined uniquely as a projection of χp (cf. [42] and (2.1)). (At m = n the left side evaluates to the symmetrized function χp, which is expressed in the functions χp(j).) Suitable functions χp(j) in this decomposition can be found by the following algorithm:

  • Let x1χ¯p(1)(x1) be a first order influence function of the functional pχ(p).

  • Let xjχ¯p(j)(x1,,xj) be a first order influence function of the functional pχ¯p(j1)(x1,,xj1), for each x1, …, xj−1, and j = 2, …, m.

  • Let χp(j)=DPχ¯p(j) be the degenerate part of χ¯p(j) relative to P, as defined in (2.1).

See Lemma 11.2 for a proof. Thus higher order influence functions are constructed as first order influence functions of influence functions. Somewhat abusing language we shall refer to the function χp(j) also as a “jth order influence function”. The overall order m will be fixed at a suitable value; for simplicity we do not let this show up in the notation χp.

Because only inner products with scores matter, an “influence function” is unique only up to projections onto the tangent space and (averaging over) permutations of its arguments. With any choice of influence functions the algorithm produces some influence function. In particular, the starting influence function χ¯p(1) in step [1] may be any function χp that satisfies (3.4) for every score g; it does not have to possess mean zero, or be an element of the tangent space. A similar remark applies to the (first order) influence functions found in step [2]. It is only in step [3] that we make the influence functions degenerate.

3.2. Bias-variance trade-off

Because it is centered, the “variance part” in (3.2), the variable (UnPn)χp^n, should not change noticeably if we replace p^n by p, and be of the same order as (UnPn)χp. For a fixed square-integrable function χp the latter centered U-statistic is well known to be of order OP(n1/2), and asymptotically normal if suitably scaled. A completely successful representation of the “bias” Rn=χ(p^n)χ(p)+Pmχp^n in (3.2) would lead to an error Rn=OP(d(p^n,p)m+1), which becomes smaller with increasing order m. Were this achievable for any m, then a n-estimator would exist no matter how slow the convergence rate d(p^n,p) of the initial estimator. Not surprisingly, in many cases of interest this ideal situation is not real. This is due to the non-existence of influence functions that can exactly represent the Taylor expansion of χ(p^n)χ(p). The technical reason is that multi-linear maps (even smooth ones) may not be representable through kernels.

In general, we have to content ourselves with a partial representation. Next to a remainder term of order OP(d(p^n,p)m+1), we then incur a “representation bias”. The latter bias can be made arbitrarily small by choice of the influence function, but only at the cost of increasing its variance. We thus obtain a trade-off between a variance and two biases. This typically results in a variance that is larger than 1/n, and a rate of convergence that is slower than 1/n, although sometimes a nontrivial bias correction is possible without increasing the variance.

3.3. Approximate functionals

An attractive method to find approximating influence functions is to compute exact influence functions for an approximate functional. Because smooth functionals on finite-dimensional models typically possess influence functions to any order, projections on finite-dimensional models may deliver such approximations.

A simple approximation would be χ(p) for a given map pp mapping the model P onto a suitable “smaller” model P (typically a submodel PP). A closer approximation can be obtained by also including a derivative term. Consider the functional χ:P defined by, for a given map pp,

χ(p)=χ(p)+Pχp(1). (3.5)

(A more complete notation would be p(p); the right hand side depends on p in three ways.) By the definition of an influence function the term Pχp(1) acts as the first order Taylor expansion of χ(p)χ(p) Consequently, we may expect that

|χ(p)χ(p)|=O(d(p,p)2). (3.6)

This ought to be true for any “projection” pp. If we choose the projection such that, for any path tpt,

ddt|t=0(χ(pt)+P0χpt(1))=0, (3.7)

then the functional pχ(p) will be locally (around p0) equivalent to the functional pχ(p0)+Pχp0(1) (which depends on p in only one place, p0 being fixed) in the sense that the first order influence functions are the same. The first order influence function of the second, linear functional at p0 is equal to χp0(1), and hence for a projection satisfying (3.7) the first order influence function of the functional pχ(p) will be

χp(1)=χp(1). (3.8)

In words, this means that the influence function of the approximating functional χ satisfying (3.5) and (3.7) at p is obtained by substituting p for p in the influence function of the original functional.

This is relevant when obtaining higher order influence functions. As these are recursive derivatives of the first order influence function (see [1]–[3] in Section 3.1), the preceding display shows that we must compute influence functions of

pχp(1)(x),

i.e. we “differentiate on the model P”. If the latter model is sufficiently simple, for instance finite-dimensional, then exact higher order influence functions of the functional pχ(p) ought to exist. We can use these as approximate influence functions of pχ(p).

4. Estimating the mean response in missing data models

Suppose that a typical observation is distributed as X = (Y A, A, Z), for Y and A taking values in the two-point set {0, 1} and conditionally independent given Z.

This model is standard in biostatistical applications, with Y an “outcome” or “response variable”, which is observed only if the indicator A takes the value 1. The covariate Z is chosen such that it contains all information on the dependence between the response and the missingness indicator A, thus making the response missing at random. Alternatively, we think of Y as a “counterfactual” outcome if a treatment were given (A = 1) and estimate (half) the treatment effect under the assumption of no unmeasured confounders.

The model can be parameterized by the marginal density f of Z (relative to some dominating measure ν) and the probabilities b(z) = P(Y = 1|Z = z) and a(z)1 = P(A = 1| Z = z). (Using a for the inverse probability simplifies later formulas.) Alternatively, the model can be parameterized by the pair (a, b) and the function g = f/a, which is the conditional density of Z given A = 1, up to the norming factor P(A = 1). Thus the density p of an observation X is described by the triplet (a, b, f), or equivalently the triplet (a, b, g). For simplicity of notation we write p instead of pa,b,f or pa,b,g, with the implicit understanding that a generic p corresponds one-to-one to a generic (a, b, f) or (a, b, g).

We wish to estimate the mean response EY = Eb(Z), i.e. the functional

χ(p)=bf=abg.

Estimators that are n-consistent and asymptotically efficient in the semi-parametric sense have been constructed using a variety of methods (e.g. [30, 31]), but only if a or b, or both, parameters are restricted to sufficiently small regularity classes. For instance, if the covariate Z ranges over a compact, convex subset Z of ℝd, then the mentioned papers provide n-consistent estimators under the assumption that a and b belong to Hölder classes Cα(Z) and Cβ(Z) with α and β large enough that

α2α+d+β2β+d12. (4.1)

(See e.g. Section 2.7.1 in [41] for the definition of Hölder classes). For moderate to large dimensions d this is a restrictive requirement. In the sequel we consider estimation for arbitrarily small α and β.

Throughout we assume that the parameters a, b and g are contained in Hölder spaces Cα(Z), Cβ(Z) and Cγ(Z) of functions on a compact, convex domain in ℝd. We derive two types of results:

  1. In Section 7 we show that a n-rate is attainable by using a higher order estimating equation (of order determined by γ) as long as
    α+β2d4. (4.2)
    This condition is strictly weaker than the condition (4.1) under which the linear estimator attains a n-rate. Thus even in the n-situation higher order estimating equations may yield estimators that are applicable in a wider range of models. For instance, in the case that α = β the cut-off (4.1) arises for α = β ≥ d/4, whereas (4.2) reduces to α = β ≥ d/2.
  2. We consider minimax estimation in the case (α + β)/2 < d/4, when the rate becomes slower than 1/n. It is shown in [29] that even if g = f/a were known, then the minimax rate for a and b ranging over balls in the Hölder classes Cα(Z) and Cβ(Z) cannot be faster than n(2α+2β)/(2α+2β+d). In Section 8 we show that this rate is attainable if g is known, and also if g is unknown, but is a-priori known to belong to a Hölder class Cγ(Z) for sufficiently large γ, as given by (8.11). (Heuristic arguments, not discussed in this paper, appear to indicate that for smaller γ the minimax rate is slower than n(2α+2β)/(2α+2β+d).)

After reviewing the tangent space and first order theory in Section 4.1, we discuss the second order estimator separately in Section 5. The preceding results are next obtained in Sections 7 ( n-rate if (α + β)/2 ≥ d/4) and 8 (slower rate if (α + β)/2 < d/4), using the higher-order influence functions of an approximate functional, which is defined in the intermediate Section 6.

Assumption 4.1

We assume throughout that the functions 1/a, b, g and their preliminary estimators 1/a^,b^,g^ are bounded away from their extremes: 0 and 1 for the first two, and 0 and ∞ for the third.

4.1. Tangent space and first order influence function

The one-dimensional submodels tpt induced by paths of the form at = a + , bt = b + , and ft = f(1 + tϕ) for given directions α, β and ϕ (where ∫ ϕ f dν = 0) yield score functions

Bpaα(X)=Aa(Z)1a(Z)(a1)(Z)α(Z),ascore,
Bpbβ(X)=A(Yb(Z))b(Z)(1b)(Z)β(Z),bscore,
Bpfϕ(X)=ϕ(Z),fscore.

Here Bpa, Bpb, Bpf are the score operators for the three parameters, whose direct sum is the overall score operator, which we write as Bp: Bp(α, β, ϕ)(X) is the sum of the three left sides of the preceding equation. The first-order influence function is well known to take the form

χp(1)(X)=Aa(Z)(Yb(Z))+b(Z)χ(p). (4.3)

Indeed, it is straightforward to verify that this function satisfies, for every path tpt as described previously,

ddt|t=0χ(pt)=Epχp(1)(X)Bp(α,β,ϕ)(X).

The advantage of choosing a an inverse probability is clear from the form of the (random part of the) influence function, which is bilinear in (a, b). The corresponding “first order bias” can be computed to be

χ(p^)χ(p)+Pχp^(1)=(a^a)(b^b)g. (4.4)

In agreement with the heuristics given in Sections 1 and 3 this bias is quadratic in the errors of the initial estimator.

Actually, the form of the bias term is special in that square estimation errors (a^a)2 and (b^b)2 of the two initial estimators a^ and b^ do no arise, but only the product (a^a)(b^b) of their errors. This property, termed “double robustness” in [34], makes that for first order inference it suffices that one of the two parameters be estimated well. A prior assumption that the parameters a and b are α and β regular, respectively, would allow estimation errors of the orders n−α/(2α+d) and n−β/(2β+d). If the product of these rates is O(n1/2), then the bias term matches the variance. This leads to the (unnecessarily restrictive) condition (4.1).

If the preliminary estimators a^ and b^ are solely selected for having small errors a^a and b^b (e.g. minimax in the L2-norm), then it is hard see why (4.4) would be small unless the product a^ab^b of the errors is small. Special estimators might exploit that the bias is an integral, in which cancellation of errors could occur. As we do not wish to use special estimators, our approach will be to replace the linear estimating equation by a higher order one, leading to an analogue of (4.4) that is a cubic or higher order polynomial of the estimation errors.

It may be noted that the marginal density f (or g) does not enter into the first order influence function (4.3). Even though the functional depends on f (or g), a rate on the initial estimator of this function is not needed for the construction of the first order estimator. This will be different at higher orders.

5. Second order estimator

In this section we derive a second order influence function for the missing data problem, and analyze the risk of the corresponding estimator. This estimator is minimax if (α + β)/2 ≥ d/4 and

γ2γ+d12^2α+2βd+2α+2βα2α+dβ2β+d. (5.1)

In the other case, higher order estimators have smaller risk, as shown in Sections 7-8. However, it is worth while to treat the second order estimator separately, as its construction exemplifies essential elements, without involving technicalities attached to the higher order estimators.

To find a second order influence function, we follow the strategy [1]–[3] of Section 3.1, and try and find a function χp(2):X2 such that, for every x1 = (y1a1, a1, z1), and all directions α, β, ϕ,

ddt|t=0[χpt(1)(x1)+χ(pt)]=Epχp(2)(x1,X2)Bp(α,β,ϕ)(X2).

Here the expectation Ep on the right side is relative to the variable X2 only, with x1 fixed. This equation expresses that x2χp(2)(x1,x2) is a first order influence function of pχp(1)(x1)+χ(p), for fixed x1. On the left side we added the “constant” χ(pt) to the first order influence function (giving another first order influence function) to facilitate the computations. This is justified as the strategy [1]–[3] works with any influence function. In view of (4.3) and the definitions of the paths ta+, tb+ and tf(1+), this leads to the equation

a1(y1b(z1))α(z1)(a1a(z1)1)β(z1)=Epχp(2)(x1,X2)Bp(α,β,ϕ)(X2). (5.2)

Unfortunately, no function χp(2) that solves this equation for every (α, β, ϕ) exists. To see this note that for the special triplets with β = ϕ = 0 the requirement can be written in the form

α(z1)=Ep[χp(2)(x1,X2)a1(y1b(z1))1A2a(Z2)a(Z2)(a1)(Z2)]α(Z2).

The right side of the equation can be written as ∫ K(z1, z2)α(z2) dF (z2), for K(z1, Z2) the conditional expectation of the function in square brackets given Z2. Thus it is the image of α under the kernel operator with kernel K. If the equation were true for any α, then this kernel operator would work as the identity operator. However, on infinite-dimensional domains the identity operator is not given by a kernel. (Its kernel would be a “Dirac function on the diagonal”.)

Therefore, we have to be satisfied with an influence function that gives a partial representation only. In particular, a projection onto a finite-dimensional linear space possesses a kernel, and acts as the identity on this linear space. A “large” linear space gives representation in “many” directions. By reducing the expectation in (5.2) to an integral relative to the marginal distribution of Z2, we can use an orthogonal projection ∏p:L2(g) → L2(g) onto a subspace L of L2(g). Writing also ∏p for its kernel, and letting S2h denote the symmetrization (h(X1, X2) + h(X2, X1))/2 of a function h:X2, we define

χp(2)(X1,X2)=2S2[A1(Y1b(Z1))Πp(Z1,Z2)(A2a(Z2)1)]. (5.3)

Lemma 5.1

For χp(2) defined by (5.3) with Πp the kernel of an orthogonal projection Πp: L2(g) → L2(g) onto a subspace LL2(g), equation (5.2) is satisfied for every path tpt corresponding to directions (α, β, ϕ) such that αL and βL.

Proof

By definition E(A|Z) = (1/a)(Z) and E(Y|Z) = b(Z). Also var (Aa(Z)| Z) = a(Z) 1 and var(Y|Z) = b(Z)(1 − b)(Z). By direct computation using these identities, we find that for the influence function (5.3) the right side of (5.2) reduces to

a1(y1b(z1))Πpα(z1)(a1a(z1)1)Πpβ(z1).

Thus (5.2) holds for every (α, β, ϕ) such that Πpα = α and Πpβ = β.

Together with the first order influence function (4.3) the influence function (5.3) defines the (approximate) influence function χp=χp(1)+12χp(2). For an initial estimator p^ based on independent observations we now construct the estimator (3.1), i.e.

χ^n=χ(p^)+Pnχp^(1)+12Unχp^(2). (5.4)

Unlike the first order influence function, the second order influence function does depend on the density f of the covariates, or rather the function g = f/a (through the kernel ∏p, which is defined relative to L2(g)), and hence the estimator (5.4) involves a preliminary estimator of g. As a consequence, the quality of the estimator of the functional χ depends on the precision by which g (as part of the plug-in p^=(a^,b^,g^)) can be estimated.

Let E^p and var^p0 denote conditional expectations given the observations used to construct p^, let ‖·‖r be the norm of Lr(g), and let ‖∏‖r denote the norm of an operator ∏:Lr(g) → Lr(g).

Theorem 5.1

The estimator χ^n given in (5.4) with influence functions χp(1) and χp(2) defined by (4.3) and (5.3), for Πp the kernel of an orthogonal projection in L2(g) onto a k-dimensional linear subspace, satisfies, for r ≥ 2 (with r/(r − 2) = ∞ if r = 2),

E^pχ^nχ(p)=OP(ΠprΠp^ra^arb^brg^gr/(r2))+OP((IΠp)(aa^)2(IΠp)(bb^)2),
var^pχ^n=OP(1n+kn2).

The two terms in the bias result from having to estimate p in the second order influence function (giving “third order bias”) and using an approximate influence function (leaving the remainders I − Πp after projection), respectively. The terms 1/n and k/n2 in the variance appear as the variances of Unχp(1) and Unχp(2), the second being a degenerate second order U-statistic (giving 1/n2, see (13.1)) with a kernel of variance k.

The proof of the theorem is deferred to Section 10.1.

Assume now that the range space of the projections Πp can be chosen such that, for some constant C,

aΠpa2C(1k)α/d,bΠpb2C(1k)β/d. (5.5)

Furthermore, assume that there exist estimators a^ and b^ and g^ that achieve convergence rates n−α/(2α+d), n−β/(2β+d) and n−γ/(2γ+d), respectively, in Lr(g) and Lr/(r−2)(g), uniformly over these a-priori models and a model for g (e.g. for r = 3), and that the preceding displays also hold for a^ and b^. These assumptions are satisfied if the unknown functions a and b are “regular” of orders α and β on a compact subset of ℝd (see e.g. [32]). Then the estimator χ^n of Theorem 5.1 attains the square rate of convergence

(1n)2α/(2α+d)+2β/(2β+d)+2γ/(2γ+d)(1k)(2α+2β)/d1nkn2. (5.6)

We shall see in the next section that the first of the four terms in this maximum can be made smaller by choosing an estimating equation of order higher than 2, while the other three terms arise at any order. This motivates to determine a “second order ‘optimal” value of k by balancing the second, third and fourth terms. We next would use the second order estimator if γ is large enough so that the first term is negligible relative to the other terms.

For (α+β)/2 ≥ d/4 we can choose k = n and the resulting rate (the square root of (5.6)) is n1/2 provided that (5.1) holds. The latter condition is certainly satisfied under the sufficient condition (4.1) for the linear estimator to yield rate n1/2.

More interestingly, for (α + β)/2 < d/4 we choose k ~ n2d/(d+2α+2β) and obtain the rate, provided that (5.1) holds,

n(2α+2β)/(d+2α+2β).

This rate is slower than n1/2, but better than the rate n−α/(2α+d)−β/(2β+d) obtained by the linear estimator. In [29] this rate is shown to be the fastest possible in the minimax sense, for the model in which a and b range over balls in Cα(Z) and Cβ(Z), and g being known.

In both cases the second order estimator is better than the linear estimator, but minimax only for sufficiently large γ. This motivates to consider higher order estimators.

6. Approximate functional

Even though the functional of interest does not possess an exact second-order influence function, we might proceed to higher orders by differentiating the approximate second-order influence function χp(2) given in (5.3), and balancing the various terms obtained. However, the formulas are much more transparent if we compute exact higer-order influence functions of an approximating functional instead. In this section we first define a suitable functional and next compute its influence functions.

Following the heuristics of Section 3.3, we define an approximate functional by equation (3.5), using a particular projection pp of the parameters. We choose this projection to map the parameters a and b onto finite-dimensional models and leave the parameter g unaltered: p is mapped into an element p of the approximating model, or equivalently a triplet (a, b, g) into a triplet (a,b,g) in the approximating model for the three parameters (where g is unaltered). (Even though this is not evident in the notation, the projection is joint in the three parameters: the induced maps (a,b,g)a and (a,b,g)b do not reduce to maps aa and bb, but a and b depend on the full triplet (a, b, g).)

As “model” for (a, b) we consider the product of two affine linear spaces

(a^+a_L)×(b^+b_L), (6.1)

for a given finite-dimensional subspace L of L2(ν) and fixed functions a^,a_,b^,b_:Z that are bounded away from zero and infinity. (Later the functions a^ and b^ are taken equal to the preliminary estimators; one choice for the other functions is a_=b_=1.) The pair (a,b) of projections are defined as elements of the model (6.1) satisfying equation (3.7). In view of (4.4), for any path pt(at,bt,g)=(a+ta_l,b+tb_l,g), for given l, l′ ∈ L,

χ(pt)+Pχpt(1)=χ(p)(a+ta_la)(b+tb_lb)g. (6.2)

Equation (3.7) requires that the derivative of this expression with respect to t at t = 0 vanishes. Thus the functions a and b must be chosen to satisfy the set of stationary equations, for every l; l′ ∈ L,

0=(aa)b_lg=(aa^a_aa^a_)lab_g,lL, (6.3)
0=a_l(bb)g=(bb^b_bb^b_)lab_g,lL. (6.4)

Because the functions (aa^)/a_ and (bb^)/b_ are required to be in L, the second way of writing these equations shows that the latter two functions are the orthogonal projections of the functions (aa^)/a_ and (bb^)/b_ onto L in L2(ab_g).

As explained in Section 3.3, as it satisfies (3.7) the projection (a,b,g)(a,b,g) renders the first order influence function of the approximate functional χ equal to the first order influence function of χ evaluated at the projection. Furthermore, the difference between χ and χ is quadratic in the distance between p and p (see (3.6)). The following theorem summarizes the preceding and verifies these properties in the present concrete situation.

Theorem 6.1

For given measurable functions a^,a_,b^,b_:Z with a_ and b_ bounded away from zero and infinity, define a map (a,b,g)(a,b,g) by letting (aa^)/a_ and (bb^)/b_ be the orthogonal projections of (aa^)/a_ and (bb^)/b_ in L2(ab_g) onto a closed subspace L. Let p correspond to (a,b,g) and define χ(p)=χ(p)+Pχp(1). Then χ has influence function

χp(1)(X)=Aa(Z)(Yb(Z))+b(Z)χ(p). (6.5)

Furthermore, for g_=ab_g,

|χ(p)χ(p)|(IΠp)a^aa_2,g_(IΠp)b^bb_2,g_.

Proof

The formula for the influence function agrees with the combination of equations (3.8) and (4.3), and can also be verified directly. In view of (3.5) and (4.4),

χ(p)χ(p)=(aa)(bb)g.

We rewrite the right side as an integral relative to g_ , and next apply the Cauchy-Schwarz inequality. Finally we note that (aa)/a_=(aa^)/a_(aa^)/a_=(IΠp)((a^a)/a_), and similarly for b.

The approximation error χ(p)χ(p) can be rendered arbitrarily small by choosing the space L large enough. Of course, we choose L to be appropriate relative to a-priori assumptions on the functions a and b. If these functions are known to belong to Hölder classes, then L can for instance be chosen as the linear span of the first k basis elements of a suitable orthonormal wavelet basis of L2(ν).

To compute higher order influence functions of χ we recursively determine influence functions of influence functions, according to the algorithm [1]–[3] in Section 3.1, starting with the influence function of pχp(1)(x1)+χ(p), for a fixed x1. We defer the details of this derivation to Section 10.2, and summarize the result in the following theorem.

To simplify notation, define

Y=A(Yb(Z))a_(Z),A=(Aa(Z)1)b_(Z),A_=Aa_(Z)b_(Z). (6.6)

These are the generic variables; indexed versions Yi,Ai,A_i, are defined by adding an index to every variable in the equalities. With this notation and with a_=b_=1 the second order influence function (5.3) at p=p can be written as the symmetrization of 2Y1Πp(Z1,Z2)A2. This function was derived in an ad-hoc manner as an approximate or partial influence function of χ, but it is also the exact influence function of χ. The higher order influence functions of χ possess an equally attractive form.

Theorem 6.2

An mth order influence function χp(m) evaluated at (X1,…, Xm) of the functional χ defined in Theorem 6.1 is the degenerate (in L2(p)) part of the variable

(1)j1j!A1Π1,2A_2Π2,3A_3Π3,4A_4××A_m1Πm1,mYm.

Here ∏i,j is the kernel of the orthogonal projection in L2(ab_g) onto L, evaluated at (Zi, Zj).

To obtain the degenerate part of the variable in the preceding lemma, we apply the general formula (2.1) together with Lemma 10.2. Assertions (i) and (ii) of the latter lemma show that the variable is already degenerate relative to X1 and Xm, while assertion (iii) shows that integrating out the variable Xi for 1 < i < m simply collapses Πi1,iA_iΠi,i+1 into Πi−1,i+1. For instance, with Sm denoting symmetrization of a function of m variables,

χp(2)(X1,X2)=2S2[A1Π1,2Y2],χp(3)(X1,X2,X3)=6S3[A1Π1,2A_2Π2,3Y3A1Π1,3Y3],χp(4)(X1,X2,X3,X4)=24S4[A1Π1,2A_2Π2,3A_3Π3,4Y4A1Π1,3A_3Π3,4Y4A1Π1,2A_2Π2,4Y4+A1Π1,4Y]. (6.7)

As shown on the left, but not on the right of the equations, these quantities depend on the unknown parameter p = (a, b, g). In the right sides, the variables Yi and Ai depend on p through b and a, and hence are not observables. Furthermore, the kernels Πi,j depend on g as they are orthogonal projections in L2(ab_g).

7. Parametric rate ((α + β)/2 ≥ d/4)

In this section we show that the parameter χ(p) is estimable at 1/n-rate provided the average smoothness (α + β)/2 is at least d/4. We achieve this using the estimator

χ^n=χ(p^)+Un(χp^(1)+12χp^(2)++1m!χp^(m)), (7.1)

with the influence functions χp(j) those of the approximate functional χ in Section 6: they are given in Theorems 6.1 and 6.2 for j = 1, and j = 2, …, m, respectively. (Because the map pp maps p^ into itself, the influence function for j = 1 in the display is also the first order influence function (6.5) of of χ, when evaluated at p=p^.)

We assume that the projections Πp and Πp^ map Ls(ab_g) to Ls(ab_g), for every s ∈ =[r/(r − 1), r], with uniformly bounded norms. (For r = 2 this entails only s = 2; in this case we define r/(r − 2) = .)

Theorem 7.1

The estimator (7.1), with Πp a kernel of an orthogonal projection in L2(ab_g) satisfying (12.1) with supx Πp(x, x) ≲ k, satisfies, for a constant c that depends on p/p^ only, and r ≥ 2,

E^pχ^nχ(p)=O(a^arb^brg^g(m1)r/(r2)m1)+O((IΠp)a^aa_2(IΠp)b^bb_2),
var^pχ^nj=1m1(nj)cjkj1.

The first term in the bias is of the order 1 + 1 + (m − 1) = m + 1, as to be expected for an estimator based on an mth order influence function; the second term is due to estimating χ rather than χ; it is independent of m, and the same as in Theorem 5.1 if a_=b_=1. The bound on the variance can roughly be understood in that each of the degenerate U-statistics Unχp^(j) in (7.1) contributes a term of order kj−1/nj.

For α-, β- and γ-regular parameters a, b, g on a d-dimensional domain the range space of the projections Πp can be chosen so that (5.5) holds and such that there exist estimators a^,b^,g^ of a, b, g, with the first two taking values in this range space, with convergence rates n−α/(2α+d), n−β/(2β+d) and n−γ/(2γ+d). Then the second term in the bias (with a_=b_=1) is of order (1/k)α/d+β/d. If (α + β)/2 ≥ d/4 and we choose k = n, then this is of order 1/n. For k = n the standard deviation of the resulting estimator is also of the order 1/n, while the first term in the bias can be made arbitrarily small by choosing a sufficiently large order m. Specifically, the estimator χ^n attains a n-rate of convergence as soon as

m1(12α2α+dβ2β+d)(2γ+dγ). (7.2)

For any γ > 0 there exists an order m that satisfies this, and hence the parameter is n-estimable as soon as (α + β)/2 ≥ d/4.

More ambitiously, we may aim at attaining the parametric rate for every γ > 0, without a-priori knowledge of γ. This can be achieved if (α + β)/2 > d/4 by using orders m = mn that increase to infinity with the sample size. In this case the estimator can also be shown to be asymptotically efficient in the semiparametric sense.

Theorem 7.2

If (α + β)/2 > d/4, then the estimator (7.1), with m = log n and Πp a kernel of an orthogonal projection in L2(ab_g) on a k = n/(log n)2-dimensional space satisfying (5.5) and (12.1) with supx Πp(x, x) ≲ k, based on preliminary estimators a^,b^,g^ that attain rates (log n/n)−δ/(2δ+d) relative to the uniform norm, satisfies

n(χ^nχ(p)Pnχp(1))P0.

An estimator that is asymptotically linear in the first order efficient influence function, as in the theorem, is asymptotically optimal in terms of the local asymptotic minimax and convolution theorems (see e.g. [40], Chapter 25). The present estimator χ^n actually looses its efficiency by splitting the sample in a part used to construct the preliminary estimators and a part to form Pn. This can be easily remedied by crossing over the two parts of the split, and taking the average of the two estimators so obtained. By the theorem these are both asymptotically linear in their sample, and hence their average is asymptotically linear in the full sample and asymptotically efficient.

The proofs of the theorems are deferred to Section 10.3.

8. Minimax rate at lower smoothness ((α + β)/2 < d/4)

If the average a-priori smoothness (α + β)/2 of the functions a and b falls below d/4, then the functional χ cannot be estimated any more at the parametric rate ([29]). The estimator (7.1) of Theorem 7.1 can still be used and, with its bias and variance as given in the theorem properly balanced, attains a certain rate of convergence, faster than the current state-of-the-art linear estimators. However, in this section we present an estimator that is always better, and attains the minimax rate of convergence n(2α+2β)/(2α+2β+d) provided that the parameter g is sufficiently regular.

This estimator takes the same general form

χ^n=χ(p^)+Un(χp^(1)+12χp^(2)++1m!χp^(m)), (8.1)

as the estimator (7.1), but the influence functions χp(j) for j ≥ 3 will be different. The idea is to “cut out” certain terms from the influence functions in (7.1) in order to decrease the variance, but without increasing the bias. For clarity we first consider the third order estimator, and next extend to the general mth order. To attain the minimax rate the order m must be fixed to a large enough value so that the first term in the bias given in Theorem 7.1 is no larger than n−(2α+2β)/(2α+2β+d). (Apart from added complexity there is no loss in choosing m larger than needed.)

The third order kernel χp(3) in (6.7) is the symmetrization of the variable

6A1(Πp(Z1,Z2)A_2Πp(Z2,Z3)Πp(Z1,Z3))Y3.

Here Πp is the kernel of an orthogonal projection in L2(ab_g) onto a k-dimensional linear space, which we may view as the sum of k projections on one-dimensional spaces. The quantity k2 in the order O(k2/n3) of the variance in Theorem 7.1 for m = 3 arises as the number of terms in the product Πp(Z1,Z2)A_2Πp(Z2,Z3) of the two k-dimensional projection kernels. It turns out that this order can be reduced without increasing the bias by cutting out “products of projections on higher base elements”.

To make this precise, we partition the projection space in blocks, and decompose the two projections in the influence function over the blocks:

Πp=r=0RΠp(kr1,kr],Πp=s=0SΠp(ls1,ls]. (8.2)

Here Πp(m,n] is the projection on the subspace spanned by base elements with index in intervals (m, n], and 1 = k1 < k0 < k1 < ⋯ < kR = k and 1 = l1 < l0 < l1 < ⋯ < lS = k are suitable partitions of the set {1, …, k}. (“Full” partitions in singleton sets would make the construction conceptual simpler, but a small number of blocks will be needed in our proofs.) The product of the two kernels now becomes a double sum, from which we retain only terms with small values of (r, s). The improved third order influence function is, with as before S3 denoting symmetrization,

χp(3)(X1,X2,X3)=6S3[(r,s):r+sDr=0s=0A1(Πp(kr1,kr](Z1,Z2)A_2Πp(ls1,ls](Z2,Z3)Πp(kr1ls1,krls](Z1,Z3))Y3]. (8.3)

The negative term in the display is the conditional expectation given Z1, Z3 of the leading term, and maintains the degeneracy of the kernel. w For the decomposition (8.2) to be valid, the subspaces corresponding to the blocks must be orthogonal in L2(ab_g). We may achieve this by starting with a standard basis e1, ε2, …, with good approximation properties for a target model, and next replacing this by an orthonormal basis in L2(ab_g) by the Gram-Schmidt procedure. For a bounded g the approximation properties will be preserved.

The grids are defined by

k1=1,kr~n2r/α,r=0,,R, (8.4)
l1=1,ls~n2s/β,s=0,,S, (8.5)

where R and S are chosen such that kR ∼ lS ∼ k (note that k0 = l0 = n). In these definitions the notation ~ means “equal up to a fixed multiple” (needed to allow that kr and ls are (dyadic) integers). For ease of notation let ls = l1 for s ≤ −1, and ls = lS for s ≥ S.

The grids k0 < k1 < ⋯ < kR and l0 < l1 < ⋯ < lS partition the integers n, n + 1, …, k in R and S groups. As krαlsβ=2r+snα+β, for every r, s ≥ 0, the cut-off r+s ≤ D in (8.3) is delimited by the “hyperbola” iαjβ ~ 2Dnα+β in the space of indices (i, j) ∈ {1, …, k}2 of base elements used in the two kernels, with only the pairs below the hyperbola retained (see Figure 1). The intuition behind this hyperbolic cut-off is the product form of the bias (4.4): a higher order correction on the estimator of a may combine with a lower order correction on b, and vice versa, to give an overall correction of the desired order. The overall bias is smaller if the cut-off D is chosen larger, but then more terms are included in the estimator and the variance will be bigger.

Fig 1.

Fig 1

Both axis carry the indices of the basis functions spanning the projection space L, and point in the plane refers to a product of two projections. Products of projections on pairs of basis functions in the shaded area are included in the third order influence function. The step function refers to the partitions of the indices as in (8.2).

Before deriving an optimal value of D, we introduce the mth order estimator for general m ≥ 3. Again we take the estimator of Theorem 7.1 as starting point, but modify the higher order influence functions χp(j), for j = 4, …, m, similar and in addition to the modification of the third order influence function. For given j the former influence function is given in Theorem 6.2 (with m of the theorem taken equal to j), and is based on a product of j − 1 projection kernels. We modify this in two steps. For each of the j − 2 contiguous pairs of kernels ((1st, 2nd), (2nd, 3rd), …, ((j − 2)th, (j − 1)th)) we form a new kernel by truncating the pair at the hyperbola as described previously for the third order kernel, and truncating all other kernels at n. Next the modified jth order kernel is the sum of the resulting j − 2 kernels. More formally, the modified jth order kernel is equal to

χp(j)(X1,,Xj)=i=1j2χp(j,i)(X1,,Xj), (8.6)

where χp(j,i)(X1,,Xj) is the symmetrized, degenerate (relative to L2(p)) part of the variable, for i = 1, …, j − 2, written in the notation of Theorem 7.1,

j!(1)j1Y1Π1,2(0,n]A_2××A_i1Πi1,i(0,n]A_i××[(r,s):r+sDr=0s=0Πi,i+1(kr1,kr]A_i+1Πi+1,i+2(ls1,ls]]A_i+2Πi+2,i+3(0,n]××A_j1Πj1,j(0,n]Aj.

For j = 3 there is only one pair of kernels, and the construction reduces to the modification (8.3) as discussed previously.

We assume that the projections Πp(0,l] and Πp^(0,l] map Ls(ab_g) to Ls(ab_g), for every s ∈ [r/(r − 1), r], with uniformly bounded norms.

Theorem 8.1

The estimator (8.1) for m ≥ 3 with the influence functions χp(j) and χp(j) given in (6.5) and (6.7) for j = 1, 2, respectively, and in (8.6) for j ≥ 3, and with Πp(0,l] kernels of orthogonal projections in L2(ab_g) satisfying (12.1) with supxΠp^(0,l](x,x)l, satisfies, for r ≥ 2 (and r/(r − 2) = ∞ if r = 2),

E^pχ^nχ(p)=O(a^arb^brg^gmrr2m1)+O((IΠp(0,k])a^aa_2(IΠp(0,k])b^bb_2),+O(r=1R(IΠp^(0,kr1])(a^aa_)r(IΠp^(0,lD1])(b^bb_)rg^grr2)+O(R(IΠp^(0,n])a^aa_r(IΠp^(0,n])b^bb_rg^gmrr22),
var^pχ^n1n+kn2+D2(1α1β)Dn.

A proof of the theorem is presented in Sections 10.4 and 10.5.

The first two terms in the bias are the same as in Theorem 7.1; the third and fourth terms are the price paid for cutting out terms from the influence function. The benefit is a reduced variance. We shall show that the boundary parameter D can be chosen such that the third term in the variance (resulting from the third and higher order parts of the influence function) is not bigger than the second term, while the increase in bias is negligible.

Assume that the functions a and b and their estimates are known to belong to models that are well approximated by the base functions e1, e2, … in the sense that, for p{p,p^}, and every value l in one of the two grids (8.4)–(8.5),

(IΠp(0,l])(a^aa_)r(1l)α/d, (8.7)
(IΠp(0,l])(b^bb_)r(1l)β/d. (8.8)

Then the second term in the bias is of the order (1/k)α/d+β/d, as in Theorem 7.1, which is smaller than the minimax rate n(2α+2β)/(2α+2β+d) for

k~n2d/(2α+2β+d). (8.9)

With this choice of k, the upper bound on the variance is of the square minimax rate n(4α+4β)/(2α+2β+d) if D is chosen to satisfy

2(1α1β)D~1lognn(d2α2β)/(d+2α+2β). (8.10)

Furthermore, under (8.9) the numbers R, S of grid points are of the order log n.

In the third term of the bias we apply assumptions (8.7)–(8.8) and the identity kr1αlDrβ~nα+β2D, which results from (8.4)–(8.5), to see that the third term of the bias is of order

r=1R(1kr1)α/d(1lDr)β/dg^gr/(r2)R(1nα+β2D)1/dg^gr/(r2).

If the convergence rate of g^ is n−γ/(2γ+d), then, for the choice of D given in (8.10), this can (by a calculation) seen to be of smaller order than the minimax rate n(2α+2β)/(2α+2β+d) if γ is large enough that

γ2γ+d>(αβd)(d2α2βd+2α+2β). (8.11)

The fourth term in the bias can by a similar analysis be seen to be of the order

R(1n)α/d(1n)β/dg^g(m2)r/(r2)2.

Again this is smaller than the minimax rate if γ satisfies assumption (8.11).

Finally, if the convergence rates of a^ and b^ are n−α/(2α+d) and n−β/(2β+d), then the first term in the upper bound of the bias is of the order

(1n)α/(2α+d)+β/(2β+d)+(m1)γ/(2γ+d).

We choose m large enough so that this is of smaller order than the preceding terms. In particular, we can choose it so that this is smaller than the minimax rate.

We summarize this in the following corollary, which is the most advanced result of the paper.

Corollary 8.1

If (8.7)–(8.11) hold, and Πp(0,l] are kernels of orthogonal projections in L2(ab_g) satisfying (12.1) with supxΠp^(0,l](x,x)l, then the mth order estimator with the kernels (8.6) for j ≥ 3 and sufficiently large m and suitable initial estimators, attains the rate n−(2α+2β)/(2α+2β+d) for estimating χ(p).

9. Other examples

In this section we briefly indicate a number of other examples for which our general heuristics have been worked out, leading to well known or novel estimators.

9.1. Density estimation

Consider estimating a density χ(p) = p(a) at the fixed point a based on a random sample from p. A first order influence function of this functional would satisfy, for every smooth path tpt with score function g at t = 0,

χp(1)gpdμ=ddt|t=0χ(pt)=g(a)p(a).

In a nonparametric situation every zero-mean function g arises as a score function, and hence χp(1) would have to be a “Dirac function at a”. Because this does not exist (except for very special p), in this example already a first order influence function fails to exist.

We may approximate the Dirac function by the function x ↦ Π(a, x) for Π the kernel of an orthogonal projection onto a given (large) subspace L of L2(μ). Because Π(a, x)g(x)p(x) (x) = g(a)p(a) for every function g such that gpL, the function x ↦ Π(a, x) achieves representation for a large set of scores. The corresponding degenerate version is x ↦ Π(a, x) − Πp(a), for Πp = ∫ Π(·, x)p(x) (x) the projection of p. The corresponding first order estimator (3.1) is

χ^n=χ(p^n)+Pn(Π(a,)Πp^n(a))=PnΠ(a,)+((IΠ)p^n)(a).

If p^nL, then the second term vanishes and the estimator reduces to nΠ(a,). This is the usual projection estimator (cf. [25, 32]): if L is spanned by the orthonormal set e1, e2, …, ek, then Π(x1,x2)=i=1kei(x1)ei(x2) and χ^n=i=1k(Pnei)ei(a).

Alternative to viewing x ↦ Π(a, x) as an approximation to the “ideal” influence function, we can derive it as the exact influence function of the approximate functional χ(p)=χ(Πp).

9.2. Quadratic functionals

Consider estimating the functional χ(p) = ∫ p2 based on a random sample of size n from the density p.

The first order influence function of this functional exists on the full non-parametric model, and can be seen to take the form

χp(1)(x)=2(p(x)χ(p)).

By the algorithm [1]–[3] of Section 3.1 a second order influence function can be computed as the degenerate part of an influence function of the functional pχ¯p(1)(x1)=2p(x1), for fixed x1. As seen in Section 9.1, point evaluation is not a differentiable functional, but has the kernel Π of an orthogonal projection in L2(μ) as an approximate influence function. Thus an approximate second order influence function of the present functional, minus its projection onto the degenerate functions, is given by

χp(2)(x1,x2)=2Π(x1,x2)2Πp(x1)2Πp(x2)+2(Πp)2dμ.

This may also be derived as an exact influence function of the approximate functional χ(p)=χ(Πp).

It can be checked that the estimator (7.1) for m = 2, given an initial estimator p^n that is contained in the range of Π, reduces to χ^n=UnΠ, which is a well known estimator ([17]).

9.3. Doubly robust models

The heuristics described in Section 3 ought to be applicable in a wide range of estimation problems, but the detailed treatment of the missing data problem in Sections 4–8 shows that their implementation can be involved. Inspection of the proofs reveals that the particular implementation in the latter sections is based on the structure (4.3) of the first order influence function in the missing data problem. The argument extends to semiparametric models with first order influence function of the form

χp(1)(x)=a(z)b(z)S1(x)+a(z)S2(x)+b(z)S3(x)+S4(x)χ(p), (9.1)

for known functions Si(x) of the data (i.e. S = (S1, S2, S3, S4) is a given statistic). The full parameter may be a quadruplet p ↔ (a, b, c, f), in which f is the marginal density of an observable covariate Z, and c does not appear in (9.1). Other examples of this structure are described in [27, 37].

10. Proofs

10.1. Proof of Theorem 5.1

Write Π^ and Π for Πp^ and Πp, respectively, for both the kernels and the corresponding projection operators, and drop p also in E^p and var^p. From (4.4) and (5.3) we have

E^χ^nχ(p)=(a^a)(b^b)gE^A1(Y1b^(Z1))(A2a^(Z2)1)Π^(Z1,Z2)=(a^a)(b^b)g+[(a^a)×(b^b)](g×g)Π^×ν.

The double integral on the far right with Π^ replaced by Π can be written as the single integral (a^a)Π(b^b)g, for Π(b^b) the image of b^b under the projection Π. Added to the first integral on the right this gives (a^a)(IΠ)(b^b)g, which is bounded in absolute value by the second term in the upper bound for the bias.

Replacement of Π^ by Π in the double integral gives a difference

[(a^a)×(b^b)]g×g(Π^Π)×ν=(a^a)(Π^((b^b)gg^)Π(b^b))ga^asΠ^((b^b)gg^)Π(b^b)r,g^g/g^1/r,

by Hölder’s inequality, for a conjugate pair (r, s). Considering Π^ as the projection in L2(g^) with weight 1, and Π as the weighted projection in L2(g^) with weight function w^=g/g^, we can apply Lemma 12.7(i) (with q = s/r and rp = s/(s − 2)) to see that this is bounded in absolute value by

a^asΠ^s,g^Πs,g^b^bs,g^w^1s/(s2),g^w1/r.

Because w^ is assumed bounded away from 0 and infinity, this is of the same order as the first term in the upper bound on the bias (if r replaces s).

Because the function χp^(1) is uniformly bounded, the (conditional) variance of Unχp^(1) is of the order O(1/n). Thus for the variance bound it suffices to consider the (conditional) variance of Unχp^(2). In view of Lemma 13.1 and (13.1) this is bounded above by a multiple of

(1+pp^)2P^n(Unχp^(2))2=(1+pp^)2(n2)2P^2(χp^(2))2.

The variables A(Yb^(Z)) and (Aa^(Z)1) are uniformly bounded. Hence the last term on the right is bounded above by a multiple of n2Π^2(g^×g^)×ν, which is equal to k/n2, by Lemma 12.3.

10.2. Proof of Theorem 6.2

We compute the higher order influence functionals of the approximate functional χ using algorithm [1]–[3] in Section 3. This starts by computing the second order influence function as the derivative of pχp(1)(x1)+χ(p), for fixed x1. Because the latter functional (given in (6.5)) depends on the parameters only through a(z1) and b(z1), the following lemma does the main part of the work.

Lemma 10.1

For fixed z1 influence functions of pa(z1) and pb(z1) are given by

x2a_(z1)Πp(z1,z2)(a2a(z2)1)b_(z2),x2b_(z1)Πp(z1,z2)a2(y2b(z2))a_(z2),

where Πp is the kernel of the orthogonal projection in L2(ab_g) onto L.

Proof

We can write the equation (6.3) determining a as E(Aa(Z)1)b_(Z)l(Z)=0, for every lL. Insert a sufficiently regular path pt, given by parameters (at, bt, ft), and differentiate the equality relative to t at t = 0 to find, with γ a score function of the path

Eddt|t=0at(Z)Ab_(Z)l(Z)=E(Aa(Z)1)b_(Z)l(Z)γ(X).

Using the fact that E(A|Z) = 1/a(Z), where a is bounded away from zero, we can also write this as

Eddt|t=0at(Z)a_(Z)(ab_)a(Z)l(Z)=E(Aa(Z)1)a(Z)γ(X)a_(Z)(ab_)(Z)a(Z)l(Z).

Because the function (ata^)/a_ is contained in L for every t by construction, the function (d/dt)|t=0at/a_ is also contained in L. Combined with the validity of the preceding display for every lL, we conclude that (d/dt)|t=0at(Z)/a_(Z) is the weighted projection of (Aa(Z)1)a(Z)γ(X)/a_(Z) in L2(P) onto the space {l(Z): lL} relative to the weight (ab_/a)(Z). The projection can be represented in terms of a kernel operator (cf. Lemma 12.1). If Πp(z1,z2)(ab_)(z2)/a(z2) denotes the kernel, then

ddt|t=0at(z1)a_(z1)=EΠp(z1,Z2)(A2a(Z2)1)a(Z2)γ(X2)a_(Z2)(ab_a)(Z2)=EΠp(z1,Z2)(A2a(Z2)1)b_(Z2)γ(X2).

This represents the derivative on the left as an inner product of the score function γ with the function on the right of the first equation of the lemma (evaluated at X2). Thus the first assertion of the lemma is proved.

The second assertion is proved similarly. Using that E(Y|Z) = b(Z) and E(A|Z) = 1/a(Z), we start by writing the equation (6.4) defining b as E(b(Z)Y)/b_(Z)(ab_/a)(Z)l(Z)=0, for every lL. By the same arguments as before we conclude that (d/dt)|t=0bt(Z) is the weighted projection of (Yb(Z))γ(X)/b_(Z) in L2(P) onto the space {l(Z): lL}, relative to the weight (ab/a)(Z). ■

The first order influence function (6.5) depends on p only through a and b and hence the chain rule and the preceding lemma imply that a second order influence function of χ is given by the degenerate part of

χp(2)(X1,X2)=Πp(Z1,Z2)[A1(Y1b(Z1))a_(Z1)(A2a(Z2)1)b_(Z2)+(A1a(Z1)1)b_(Z1)A2(Y2b(Z2))a_(Z2)]. (10.1)

(Note that this function is symmetric in (X1, X2); Πp is symmetric, because it is an orthogonal projection kernel.) Actually, this function is already degenerate and hence is the second order influence function of χ.

Lemma 10.2

For any fixed z1 and z3,

  1. EpΠp(z1,Z2)(A2a(Z2)1)b_(Z2)=0.

  2. EpΠp(z1,Z2)(A2(Y2b(Z2))a_(Z2)=0.

  3. EpΠp(z1,Z2)A2(ab_)(Z2)Πp(Z2,z3)=Πp(z1,z3).

Proof

Because (aa^)(Z)/a_(Z) and (bb^)(Z)/a_(Z) are the weighted projections in L2(P) of (aa^)(Z)/a_(Z) and (Yb^(Z))/b_(Z), respectively, onto {l(Z):lL} relative to the weights (ab/a)(Z),

EX2Πp(Z1,Z2)[a(Z2)a^(Z2)a_(Z2)a(Z2)a^(Z2)a_(Z2)]ab_a(Z2)=0, (10.2)
EX2Πp(Z1,Z2)[b(Z2)b^(Z2)b_(Z2)Y2b^(Z2)b_(Z2)](ab_)(Z2)A=0. (10.3)

These two assertions imply (i) and (ii). The third assertion follows from the fact that Πp is the kernel of the weighted projection in L2(P) onto L relative to the weight (ab/a)(Z). ■

The second order influence function (10.1) depends on p through a and b and through the kernel Πp. We proceed to higher orders by differentiating the influence function relative to these components, and applying the chain rule, where we use the influence functions of pa(x) and pb(x) as given previously in Lemma 10.1, and the influence function of pΠp(z1,z2) as given in Lemma 12.8.

Proof of Theorem 6.2

Denote the symmetrization of the variable in the theorem by χp(m)(X1,,Xm). Then χp(2) is the function χp(2) given by (10.1), which was seen to be a second order influence function in the preceding discussion. We show by induction on m that xm+1χp(m+1)(x1,,xm,xm+1) is an influence function of pχp(m)(x1,,xm). The theorem is then a corollary of Lemma 11.2.

By Lemmas 10.1 and 12.8,

  1. The influence function of pY1 is xm+1Πp(Z1,zm+1)A_1ym+1

  2. The influence function of pA1 is xm+1Πp(Z1,zm+1)A_1am+1.

  3. The influence function of pA_1 is zero.

  4. The influence function of pΠp(Z1,Z2) is xm+1Πp(Z1,zm+1)A_m+1Πp(zm+1,Z2).

Applying this repeatedly readily gives an expression for the influence function of pA1Π1,2A_2Π2,3A_3Π3,4A_4××A_m1Πm1,mYm

. The symmetrization of this expression is the same expression, but then with m replaced by m + 1 and an added minus sign. ■

10.3. Proof of Theorems 7.1 and 7.2

Let  and Ŷ be A and Y as in (6.6) with a and b in their definitions replaced by â and b^. Because â and b^ are projected onto themselves under the map (a,b,g)(a,b,g) (see Theorem 6.1), we actually obtain the same variables by replacing a and b by â and b^, respectively: A^=(Aa^(Z)1)b_(Z) and Y^=A(Yb^(Z))a_(Z). Furthermore, let Π and Π^ denote the operators Πp and Πp^, respectively, and Πi,j and Π^i,j their kernels evaluated at (Zi, Zj).

By explicit calculations,

χ(p^)+E^pχp^(1)χ(p)=(a^a)(b^b)gdv=E^A^1Π1,2Y^2R^, (10.4)

for R^ defined by

R^=(a^aa_)(IΠ)(b^bb_)ab_qdv.

The variable R^ is bounded by the second term in the expression for E^pχ^nχ(p) in the statement of the theorem. We next show by induction on m that

R^+χ(p^)+E^χp^(1)++1m!E^χp^(m)χ(p)=(1)m1E^A^1(Π^Π)1,2A_2(Π^Π)2,3××A_m1(Π^Π)m1,mY^m. (10.5)

The analysis of the bias can then be concluded by showing that the right side of (10.5) is of the order as the first term given in the theorem.

Equation (10.4) and the definition of χp(2) readily show that identity (10.5) is true for m = 2. We proceed to general m by induction. Relative to its value for m the left side receives for (m+ 1) the extra term E^χp^(m+1)/(m+1)! which is equal to (1)m times E^A^1Π^1,2A_2Π^2.3××A_mΠ^m,m+1Y^m+1 minus a sum of terms resulting from projections of this leading term. This extra term without the factor (1)m (but including the projections) can be written (cf. (6.7) and (2.1))

i=0m1(m1i)E^A^1Π^1,2A_2Π^2,3××A_miΠ^mi,mi+1Y^mi+1(1)i. (10.6)

To prove the induction hypothesis for m + 1 it suffices to show that this is equal to

E^A^1(Π^Π)1,2A_2(Π^Π)2,3××A_m1(Π^Π)m1,mY^m+E^A^1(Π^Π)1,2A_2(Π^Π)2,3××A_m(Π^Π)m,m+1Y^m+1. (10.7)

To achieve this we expand the two terms of the preceding display into sums of expressions of the form, with each Kj,j+1(j) equal to Π^j,j+1 or Πj,j+1 and l the number of j for which the first alternative is true,

Bl:=(1)m1lE^A^1K1,2(1)A_2K2,3(2)××A_m1Km1,m(m1)Y^m, (10.8)

and of the same form with m+1 replacing m for the second term of (10.7). As the notation suggests the expression in (10.8) depends on l (and m, but this is fixed), but not on which K are equal to Π^ or Π. To see this we use that Π is a projection onto L in L2(abg), so that Π1,2γ(z2)(abg)(z2)dv(z2) = γ(z1) for every γL; and Π^ is also a projection onto L, so that as a function of one argument Π^1,2 is contained in L. This observation yields the identities, for K equal to Π^ or Π,

E^ZjΠj1A_jKj,j+1=Kj1,j+1=E^ZjKj1,jA_jΠj,j+1.

This allows to reduce (10.8) to

Bl=(1)m1lE^A^1Π^1,2A_2Π^2,3××A_1Π^l,l+1Y^l+1,l1,
B0=(1)m1E^A^1Π1,2Y^2.

Thus after expanding the two terms of (10.7) in the quantities Bl, and simplifying these quantities, we can write their sum (10.7)

(B0B0)+l=1m1((ml)(m1l))Bl(1)ml+Bm.

The difference of the binomial coefficients is (m1l1). The expression is equal to (10.6), as claimed. This completes the proof of (10.5).

Next we bound the right side of (10.5), by taking the expectation in turn with respect to Xm, Xm−1, …, X1. For Mŵ multiplication by the function ŵ = g/ĝ,

E^Xm(Π^Π)m1,mY^m=(Π^Mw^Π)(b^bb_)(Zm1).

Next, for any function h and i = m − 1, m − 2, …, 2,

E^Xi(Π^Π)i1,iA_ih(Zi)=(Π^Mw^Π)h(Zi1).

Combining these equations, we can write the right side of (10.5) in the form

(1)m1(aa^a_)[(Π^Mw^Π)m1(b^bb_)]abg_dv.

We bound this by first applying Hölder’s inequality, with conjugate pair (τ, t) with τ equal to r as in the statement of the theorem, and next Lemma 12.7(iii), with Π^ and Π viewed as weighted orthogonal projections in L2(abĝ) with weights 1 and ŵ, respectively, and r = τ(m−1)/(m+τ−3), p = (m+τ−3)/(τ−2) and q = (m+τ−3)/(m−1), so that rp = (m−1)τ/(τ−2) and rq = τ (and m of the lemma taken equal to the present m minus 1).

To bound the (conditional) variance of χ^n we use Lemma 13.1 to see that

Pn(Unχp^(j))22j[1+pp^]2jP^n(Unχp^(j))2[1+pp^]2j2j(nj)P^j(χp(j))2,

because χp^(j) is degenerate under P^. The variable χp^(j)(X1,,Xj)/j! is the symmetrization of the projection of A^1Π^1,2A_2Π^j1,jY^j onto the degenerate variables. Because the second moment of a mean of (arbitrary) random variables is bounded above by the maximum of the second moments of the terms, we can ignore the symmetrization, while the projection decreases the second moment. This shows that

1(j!)2P^j(χp(j))2P^j(A^1Π^1,2A_2Π^j1,jY^j)2kj1,

by Lemma 12.4 and the assumption that the kernels are bounded by k on the diagonal.

We complete the proof of Theorem 7.1 by bounding the square of χ^nχ(p^) by j=1m2j(Unχp^(j)/j!)2j2j. The extra factor 2j can be incorporated in the constant c in the theorem.

For the proof of Theorem 7.2 it clearly suffices to show that

E^pn(χ^nχ(p)nχ(p)(1))p0,
varpn(χ^nχ(p)nχ(p)(1))p0.

Because an influence function is centered at mean zero, the first is simply n times the bias of χ^n. By Theorem 7.1 the bias is of the order

(logn)α/2(α+d)+β/(2β+d)+γ(m1)/(2γ+d)+(1k)(α+β)/d.

The first term is trivially o(n1/2), as mn → ∞. In the second we write (α + β)/d = r/2, where r > 1 by assumption, and see that it is o(n1/2), since kn−1/r → ∞.

To handle the variance we split the estimator χ^n in its linear and higher order terms. The sum of the variances of the U-statistics of orders 2 to m in χ^n is bounded by the sum of the terms j ≥ 2 in Theorem 7.1, i.e.

j=2mcjkj1(nj)1nj=2m(2ckjn)j1c2jjej,

by the inequalities (nj)(n/2)j/j!, for j < n/2, and j!(j/e)jj, by Stirling’s approximation with bound. The expression in brackets is bounded by 2ckm/n ≲ 1/log n, for m ~ log n and k ~ n/(log n)2. Thus the sum tends to zero by dominated convergence. Finally the linear term in χ^n gives the contribution

var^pn(nχp^(1)χ(p)nχp(1))=va^r(χp^(1)χp(1)).

From the explicit expression (4.3) for the first order influence function (or (6.5) in the case of p^, which gives an identical function), this is seen to tend to zero by the dominated convergence theorem.

10.4. Proof of Theorem 8.1 for m = 3

The theorem asserts that the bias of the estimator χ^n is equal to the sum of four terms, the first two of which also arise in the bias of the estimator considered in Theorem 7.1. Therefore, we can prove the assertion on the bias by showing that the expected values of the current estimator χ^n (for m = 3) and the estimator in Theorem 7.1 differ by less than the additional bias terms in Theorem 8.1.

The two estimators differ only in their third order influence functions, where the present estimator retains only the terms in the double sum (8.3) with r = 0, s = 0, or r + s ≤ D. Thus the difference of the expectations of the two estimators is equal to

r+s>Dr,s1E^pA^1[Π^(kr1,kr](Z1,Z2)A_2Π^(ls1,ls](Z2,Z3)Π^(kr1ls1,krls](Z1,Z3)]Y^3.

The expectation Êp refers to the variable (X1, X2, X3) for fixed values of the preliminary samples, which are indicated in the “hat” symbols on Â1, Ŷ3 and the kernels, and hence is an integral relative to the density (x1, x2, x3) ↦ p(x1)p(x2)p(x3). If we replace p(x2) in this density by p^(x2), then the integral will be zero, as the kernel is degenerate under P^. Thus we may integrate against (x1,x2,x3)p(x1)(pp^)(x2)p(x3). In that case the projection term A^1Π^(kr1ls1,krls](Z1,Z3)Y^3 integrates to zero, as it does not depend on X2 and (pp^)(x2)dμ(x2)=0, and hence can be dropped. Next we condition A^1 and Y^3 on Z1, Z2, Z3 and write the preceding display in the form

r+s>Dr,s1a^aa_(z1)Π^(kr1,kr](z1,z2)Π^(ls1,ls](z2,z3)b^bb_(z3)×dρ(z1)d(ρρ^)(z2)dρ(z3).

for ρ and ρ^ the measures defined by = abg dv and dρ^=abg^_dv. The double sum can be rewritten as the sum over r running from 1 to R and over s from Dr + 1 to S, which gives the equivalent representation, with the × referring to “tensor products” as explained in Section 2,

r=1R(a^aa_×1×b^bb_)(Π^(kr1,kr]×Π^(lDr,k])d(ρ×(ρρ^)×ρ).

We write Π^(kr1,kr]=Π^(kr1,k]Π^(kr,k], and next arrive at the difference of two expressions of the type, with kr=kr1 and kr=kr, respectively,

r=1R(a^aa_×1×b^bb_)(Π^(kr,k]×Π^(lDr,k])d(ρ×(ρρ^)×ρ).

If the measure of integration were ρ^×(ρρ^)×ρ (with ρ^ instead of ρ), then we could perform the integrals on z1 and z3 and next apply Hölder’s inequality to bound the resulting expression in absolute value by

r=1RΠ^(kr,k](a^aa_)rΠ^(lDr,k](b^bb_)rgg^1r/(r2),

where the norms are those of L2(ab_g^), which are equivalent to those of L2(ν), by assumption. We can write Π^(l,k]=Π^(0,k](IΠ^(0,l]) and use the assumed boundedness of Π^(0,l] as an operator on Lr(ab_g) to bound this by the third term in the bias.

Replacing ρ×(ρρ^)×ρ by ρ^×(ρρ^)×ρ^ can be achieved by writing the first and last occurrence of ρ as ρ=ρ^+(ρρ^) and expanding the resulting expression on the + signs into four terms. One of these has the measure ρ^×(ρρ^)×ρ^. The other three terms have two or three occurrences of ρρ^, and can be bounded by the first term in the bias (with m = 3). This is argued precisely under (10.12) below.

Because the first and second order influence functions are equal to those of the estimator considered in Theorem 7.1, the (conditional) variances of Unχp^(j) for j = 1, 2 can be seen to be of the orders O(1/n) and O(k/n2), respectively, by the same proof. By Lemma 13.1 the variance for j = 3 is bounded by (see (13.1))

6(1+pp^)6P^n(Unχp^(3))21(n3)P^3(r=0RA^1Π^(kr1,kr](Z1,Z2)A_2Π^(0,lDr](Z2,Z3)Y^3)2,

where lDr=lDrn. After bounding out A^12 and Y^32, we write the squared sum as a double sum. From the fact that the projections Π^(kr1,kr] are orthogonal for different r, it follows that the off-diagonal terms of the double sum vanish (the expectation with respect to X1 is zero). Thus the preceding display is bounded above by a multiple of

1n3r=0RP^3(Π^(kr1,kr](Z1,Z2)A_2Π^(0,lDr](Z2,Z3))2.

By Lemmas 12.4 and 12.3 and the assumption that supzΠ^(0,l](z,z)l this is bounded by a multiple of

1n3r=0R(krkr1)lDr1n3(nk+r=1R(krkr1)(lDr+n)).

By (8.4) krkr1=(12α)krkr=n2r/α for r ≥ 1. On substituting this in the display, and noting that lDr = 0 if r > D, we see that this is bounded k/n2 + 2D/α˅D/β/n if αβ and bounded by k/n2 + D2D/α/n if α = β.

Supplementary Material

Acknowledgments

The research leading to these results has received funding from the European Research Council under ERC Grant Agreement 320637.

Footnotes

SUPPLEMENTARY MATERIAL

Supplement: Higher order estimating equations

(doi: COMPLETED BY THE TYPESETTER;.pdf). The remainder of the paper is given in the supplement.

References

  • 1.Bickel PJ. On adaptive estimation. Ann Statist. 1982;10(3):647–671. MR663424 (84a:62045) [Google Scholar]
  • 2.Bickel PJ, Klaassen CAJ, Ritov Y, Wellner JA. Johns Hopkins Series in the Mathematical Sciences. Johns Hopkins University Press; Baltimore, MD: 1993. Efficient and adaptive estimation for semiparametric models. MR1245941 (94m:62007) [Google Scholar]
  • 3.Bickel PJ, Ritov Y. Estimating integrated squared density derivatives: sharp best order of convergence estimates. Sankhyā Ser A. 1988;50(3):381–393. MR1065550 (91e:62079) [Google Scholar]
  • 4.Birgé L, Massart P. Estimation of integral functionals of a density. Ann Statist. 1995;23(1):11–29. MR1331653 (96c:62065) [Google Scholar]
  • 5.Bolthausen E, Perkins E, van der Vaart A. Lectures on probability theory and statistics. In: Bernard Pierre., editor. Lecture Notes in Mathematics. Vol. 1781. Springer-Verlag; Berlin: 2002. (Lectures from the 29th Summer School on Probability Theory held in Saint-Flour, July 8–24, 1999). MR1915443 (2003d:60004) [Google Scholar]
  • 6.Cai TT, Low MG. Nonquadratic estimators of a quadratic functional. Ann Statist. 2005;33(6):2930–2956. doi: 10.1214/009053605000000147. . MR2253108 (2007k:62058) [DOI] [Google Scholar]
  • 7.Cai TT, Low MG. Optimal adaptive estimation of a quadratic functional. Ann Statist. 2006;34(5):2298–2325. doi: 10.1214/009053606000000849. . MR2291501 (2008m:62054) [DOI] [Google Scholar]
  • 8.Cohen A, Dahmen W, Daubechies I, DeVore R. Tree approximation and optimal encoding. Appl Comput Harmon Anal. 2001;11(2):192–226. doi: 10.1006/acha.2001.0336. . MR1848303 (2002g:42048) [DOI] [Google Scholar]
  • 9.Daubechies I. Ten lectures on wavelets. Vol. 61. Society for Industrial and Applied Mathematics (SIAM); Philadelphia, PA: 1992. (CBMS-NSF Regional Conference Series in Applied Mathematics). MR1162107 (93e:42045) [Google Scholar]
  • 10.Donoho DL, Nussbaum M. Minimax quadratic estimation of a quadratic functional. J Complexity. 1990;6(3):290–323. doi: 10.1016/0885-064X(90)90025-9. . MR1081043 (91m:65343) [DOI] [Google Scholar]
  • 11.Hampel FR, Ronchetti EM, Rousseeuw PJ, Stahel WA. Robust statistics. John Wiley & Sons, Inc.; New York: 1986. (Wiley Series in Probability and Mathematical Statistics: Probability and Mathematical Statistics). The approach based on influence functions. MR829458 (87k:62054) [Google Scholar]
  • 12.Härdle W, Kerkyacharian G, Picard D, Tsybakov A. Lecture Notes in Statistics. Vol. 129. Springer-Verlag; New York: 1998. Wavelets, approximation, and statistical applications. MR1618204 (99f:42065) [Google Scholar]
  • 13.Has’minskiĭ RZ, Ibragimov IA. Proceedings of the Second Prague Symposium on Asymptotic Statistics (Hradec Králové, 1978) North-Holland; Amsterdam-New York: 1979. On the nonparametric estimation of functionals; pp. 41–51. MR571174 (81j:62076) [Google Scholar]
  • 14.Huber PJ, Ronchetti EM. Robust statistics. Second. John Wiley & Sons, Inc; Hoboken, NJ: 2009. (Wiley Series in Probability and Statistics). . MR2488795 (2010j:62004) [DOI] [Google Scholar]
  • 15.Kerkyacharian G, Picard D. Estimating nonquadratic functionals of a density using Haar wavelets. Ann Statist. 1996;24(2):485–507. MR1394973 (97e:62062) [Google Scholar]
  • 16.Koševnik JA, Levit BJ. On a nonparametric analogue of the information matrix. Teor Verojatnost i Primenen. 1976;21(4):759–774. MR0428578 (55 #1599) [Google Scholar]
  • 17.Laurent B. Estimation of integral functionals of a density and its derivatives. Bernoulli. 1997;3(2):181–211. MR1466306 (99c:62144) [Google Scholar]
  • 18.Laurent B, Massart P. Adaptive estimation of a quadratic functional by model selection. Ann Statist. 2000;28(5):1302–1338. MR1805785 (2002c:62052) [Google Scholar]
  • 19.Lindsay BG. Efficiency of the conditional score in a mixture setting. Ann Statist. 1983;11(2):486–497. MR696061 (84h:62050) [Google Scholar]
  • 20.Murphy SA, van der Vaart AW. On profile likelihood. J Amer Statist Assoc. 2000;95(450):449–485. With comments and a rejoinder by the authors. MR1803168 (2002a:62143) [Google Scholar]
  • 21.Nemirovski A. Lectures on probability theory and statistics (Saint-Flour, 1998) Lecture Notes in Math. Vol. 1738. Springer; Berlin: 2000. Topics in non-parametric statistics; pp. 85–277. MR1775640 (2001h:62074) [Google Scholar]
  • 22.Pfanzagl J. Lecture Notes in Statistics. Vol. 13. Springer-Verlag; New York: 1982. Contributions to a general asymptotic statistical theory. With the assistance of W. Wefelmeyer. MR675954 (84i:62036) [Google Scholar]
  • 23.Pfanzagl J. Lecture Notes in Statistics. Vol. 31. Springer-Verlag; Berlin: 1985. Asymptotic expansions for general statistical models. With the assistance of W. We-felmeyer. MR810004 (87i:62004) [Google Scholar]
  • 24.Pfanzagl J. Lecture Notes in Statistics. Vol. 63. Springer-Verlag; New York: 1990. Estimation in semiparametric models. Some recent developments. MR1048589 (91f:62074) [Google Scholar]
  • 25.Prakasa Rao BLS. Probability and Mathematical Statistics. Academic Press Inc. [Harcourt Brace Jovanovich Publishers]; New York: 1983. Nonparametric functional estimation. MR740865 (86m:62076) [Google Scholar]
  • 26.Robins J, Li L, Tchetgen E, van der Vaart A. Supplement to “higher order estimating equations for high-dimensional models”. doi: 10.1214/16-AOS1515. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Robins J, Li L, Tchetgen E, van der Vaart A. Probability and statistics: essays in honor of David A Freedman Inst Math Stat Collect. Vol. 2. Inst Math Statist; Beachwood, OH: 2008. Higher order influence functions and minimax estimation of nonlinear functionals; pp. 335–421. . MR2459958 (2010b:62115) [DOI] [Google Scholar]
  • 28.Robins J, Li L, Tchetgen E, van der Vaart A. Quadratic semiparametric von mises calculus. Metrika. 2009a;69:227–247. doi: 10.1007/s00184-008-0214-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Robins J, Li L, Tchetgen E, van der Vaart A. Semiparametric minimax rates. Electron J Stat. 2009b;3:1305–1321. doi: 10.1214/09-EJS479. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Robins JM, Rotnitzky A. Semiparametric efficiency in multivariate regression models with missing data. J Amer Statist Assoc. 1995;90(429):122–129. MR1325119 (96d:62084) [Google Scholar]
  • 31.Rotnitzky A, Robins JM. Semi-parametric estimation of models for means and covariances in the presence of missing data. Scand J Statist. 1995;22(3):323–333. MR1363216 (96j:62090) [Google Scholar]
  • 32.Tsybakov AB. Mathématiques & Applications (Berlin) [Mathematics & Applications] Vol. 41. Springer-Verlag; Berlin: 2004. Introduction à l’estimation non-paramétrique. MR2013911 (2005a:62007) [Google Scholar]
  • 33.v Mises R. On the asymptotic distribution of differentiable statistical functions. Ann Math Statistics. 1947;18:309–348. MR0022330 (9,194h) [Google Scholar]
  • 34.van der Laan MJ, Robins JM. Unified methods for censored longitudinal data and causality. Springer-Verlag; New York: 2003. (Springer Series in Statistics). MR1958123 (2003m:62003) [Google Scholar]
  • 35.van der Vaart A. On differentiable functionals. Ann Statist. 1991;19(1):178–204. MR1091845 (92i:62100) [Google Scholar]
  • 36.Van der Vaart A. Efficient maximum likelihood estimation in semiparametric mixture models. Ann Statist. 1996;24(2):862–878. doi: 10.1214/aos/1032894470. . MR1394993 (97d:62096) [DOI] [Google Scholar]
  • 37.van der Vaart A. Higher Order Tangent Spaces and Influence Functions. Statist Sci. 2014;29(4):679–686. doi: 10.1214/14-STS478. . MR3300365. [DOI] [Google Scholar]
  • 38.van der Vaart AW. Estimating a real parameter in a class of semiparametric models. Ann Statist. 1988a;16(4):1450–1474. doi: 10.1214/aos/1176351048. . MR964933 (89m:62032) [DOI] [Google Scholar]
  • 39.van der Vaart AW. CWI Tract. Vol. 44. Stichting Mathematisch Centrum Centrum voor Wiskunde en Informatica; Amsterdam: 1988b. Statistical estimation in large parameter spaces. MR927725 (89e:62049) [Google Scholar]
  • 40.van der Vaart AW. Cambridge Series in Statistical and Probabilistic Mathematics. Vol. 3. Cambridge University Press; Cambridge: 1998. Asymptotic statistics. MR1652247 (2000c:62003) [Google Scholar]
  • 41.van der Vaart AW, Wellner JA. Weak convergence and empirical processes. Springer-Verlag; New York: 1996. (Springer Series in Statistics). With applications to statistics. MR1385671 (97g:60035) [Google Scholar]
  • 42.van Zwet WR. A Berry-Esseen bound for symmetric statistics. Z Wahrsch Verw Gebiete. 1984;66(3):425–440. doi: 10.1007/BF00533707. . MR751580 (86h:60063) [DOI] [Google Scholar]
  • 43.Waterman RP, Lindsay BG. Projected score methods for approximating conditional scores. Biometrika. 1996;83(1):1–13. doi: 10.1093/biomet/83.1.1. . MR1399151 (98g:62044) [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

RESOURCES