Abstract
We introduce a new method of estimation of parameters in semi-parametric and nonparametric models. The method is based on estimating equations that are U-statistics in the observations. The U-statistics are based on higher order influence functions that extend ordinary linear influence functions of the parameter of interest, and represent higher derivatives of this parameter. For parameters for which the representation cannot be perfect the method leads to a bias-variance trade-off, and results in estimators that converge at a slower than . In a number of examples the resulting rate can be shown to be optimal. We are particularly interested in estimating parameters in models with a nuisance parameter of high dimension or low regularity, where the parameter of interest cannot be estimated at , but we also consider efficient using novel nonlinear estimators. The general approach is applied in detail to the example of estimating a mean response when the response is not always observed.
AMS 2000 subject classifications: Primary 62G05, 62G20, 62G20, 62F25
Keywords: Nonlinear functional, nonparametric estimation, U-statistic, influence function, tangent space
1. Introduction
Let X1, X2, …, Xn be a random sample from a density p relative to a measure μ on a sample space . It is known that p belongs to a collection of densities, and the problem is to estimate the value χ(p) of a functional . Our main interest is in the situation of a semiparametric or nonparametric model, where is infinite dimensional.
Estimating equations have been found a good strategy for constructing estimators in semiparametric models [2, 34, 40]. Because the model is of (much) higher dimension than the parameter of interest, setting up a good estimating equation often requires an initial estimator of a “nuisance parameter”, and an estimating equation for θ = χ(p) may take the form
Here is short for , and x ↦ ψθ,η(x) is a given measurable map, for each (θ, η). In the present paper it will be more convenient to work with a one-step version of this estimator, defined by the method of Newton-Rhapson from a linearization of the map around an initial estimator , leading to an estimator of the form , for an estimate of the derivative of the estimating equation. In more general notation such an estimator can be written as
| (1.1) |
for an initial estimator for p, and x ↦ χp(x) a given measurable function, for each .
One possible choice in (1.1) is χp = 0, leading to the plug-in estimator . However, unless the initial estimator possesses special properties, this choice is typically suboptimal. Better functions χp can be constructed by consideration of the tangent space of the model. To see this, we write (with shorthand for )
| (1.2) |
Because it is properly centered, we may expect the sequence to tend in distribution to a mean-zero normal distribution. The term between square brackets on the right of (1.2), which we shall refer to as the bias term, depends on the initial estimator , and it would be natural to construct the function χp such that this term does not contribute to the limit distribution, or at least is not dominating the expression. Thus we would like to choose this function such that the “bias term” is at least of the order OP (n−1/2). A good choice is to ensure that the term acts as minus the first derivative of the functional χ in the “direction” . Functions x ↦ χp(x) with this property are known as influence functions in semiparametric theory [16, 22, 35, 5, 2], go back to the von Mises calculus due to [33], and play an important role in robust statistics [14, 11], or [40], Chapter 20.
For an influence function we may expect that the “bias term” is quadratic in the error , for an appropriate distance d. In that case it is certainly negligible as soon as this error is of order oP (n−1/4). Such a “no-bias” condition is well known in semiparametric theory (e.g. condition (25.52) in [40] or (11) in [20]). However, typically it requires that the model be “not too big”. For instance, a regression or density function on d-dimensional space can be estimated at rate n−1/4 if it is a-priori known to have at least d/2 derivatives (indeed α/(2α + d) ≥ 1/4 if α ≥ d/2). The purpose of this paper is to develop estimation procedures for the case that no estimators exist that attain a OP (n−1/4) rate of convergence. The estimator (1.1) is then suboptimal, because it fails to make a proper trade-off between “bias” and “variance”: the two terms in (1.2) have different magnitudes. Our strategy is to replace the linear term by a general U-statistic , for an appropriate m-dimensional influence function (x1, …, xm) ↦ χp(x1, …, xm), chosen using a von Mises expansion of p ↦χ(p). Here the order m is adapted to the size of the model and the type of functional to be estimated.
Unfortunately, “exact” higher-order influence functions turn out to exist only for special functionals χ. To treat general functionals χ we approximate these by simpler functionals, or use approximate influence functions. The rate of the resulting estimator is then determined by a trade-off between bias and variance terms. It may still be of order , but it is often slower. In the former case, surprisingly, one may obtain semiparametric efficiency by estimators whose variance is determined by the linear term, but whose bias is corrected using higher order influence functions.
The conclusion that the “bias term” in (1.2) is quadratic in the estimation error is based on a worst case analysis. First, there exist a large number of models and functionals of interest that permit a first order influence function that is unbiased in the nuisance parameter. (E.g. adaptive models as considered in [1], models allowing a sufficient statistic for the nuisance parameter as in [38, 39], mixture models as considered in [19, 24, 36], and convex-linear models in survival analysis.) In such models there is no need for higher-order estimating equations. Second, the analysis does not take special, structural properties of the initial estimators into account. An alternative approach would be to study the bias of a particular estimator in detail, and adapt the estimating equation to this special estimator. The strategy in this paper is not to use such special properties and focus on estimating equations that work with general initial estimators .
The motivation for our new estimators stems from studies in epidemiology and econometrics that include covariates whose influence on an outcome of interest cannot be reliably modelled by a simple model. These covariates may themselves not be of interest, but are included in the analysis to adjust the analysis for possible bias. For instance, the mechanism that describes why certain data is missing is in terms of conditional probabilities given several covariates, but the functional form of this dependence is unknown. Or, to permit a causal interpretation in an observational study one conditions on a set of covariates to control for confounding, but the form of the dependence on the confounding variables is unknown. One may hypothesize in such situations that the functional dependence on a set of (continuous) covariates is smooth (e.g. d/2 times differentiable in the case of d covariates), or even linear. Then the usual estimators will be accurate (at order OP(n−1/2)) if the hypothesis is true, but they will be badly biased in the other case. In particular, the usual normal-theory based confidence intervals may be totally misleading: they will be both too narrow and wrongly located. The methods in this paper yield estimators with (typically) wider corresponding confidence intervals, but they are correct under weaker assumptions.
The mathematical contributions of the paper are to provide a heuristic for constructing minimax estimators in semiparametric models, and to apply this to a concrete model, which is a template for a number of other models (see [27, 37]). The methods connect to earlier work [13, 21] on the estimation of functionals on nonparametric models, but differs by our focus on functionals that are defined in terms of the structure of a semiparametric model. This requires an analysis of the inverse map from the density of the observations to the parameters, in terms of the semiparametric tangent spaces of the models. Our second order estimators are related to work on quadratic functionals, or functionals that are well approximated by quadratic functionals, as in [10, 15, 3, 4, 17, 18, 6, 7]. While we place the construction of minimax estimators for these special functionals in a wider framework, our focus differs by going beyond quadratic estimators and to consider semiparametric models.
Our mathematical results are in part conditional on a scale of regularity parameters, through a dimension (8.9) and partition of this dimension that depends on two of these parameters. We hope to discuss adaptation to these parameters in future work.
General heuristics of our construction are given in Section 3. Sections 4–8 are devoted to constructing new estimators for the mean response effect in missing data problems. In Section 9 we briefly discuss some other problems, including the problem of estimating a density at a point, where already first order influence functions do not exist and our heuristics naturally lead to projection estimators. Section 10 collects technical proofs. Sections 11, 12 and 13 (in the supplement [26]) discuss three key concepts of the paper: influence functions, projections and U-statistics.
2. Notation
Let denote the empirical U-statistic measure, viewed as an operator on functions. For given k ≤ n and a function on the sample space this is defined by
We do not let the order k show up in the notation . This is unnecessary, as the notation is consistent in the following sense: if a function of l < k arguments is considered a function of k arguments that is constant in its last k −l arguments, then the right side of the preceding display is well defined and is exactly the corresponding U-statistic of order l. In particular, is the empirical distribution applied to f if depends on only one argument.
We write for the expectation of if X1, …, Xn are distributed according to the probability measure P. We also use this operator notation for the expectations of statistics in general.
We call f degenerate relative to P if ∫ f(x1, …, xk) dP(xi) = 0 for every i and every (xj: j ≠ i), and we call f symmetric if f(x1, …, xk) is invariant under permutation of the arguments x1, …, xk. Given an arbitrary measurable function we can form a function that is degenerate relative to P by subtracting the orthogonal projection in L2 (Pk) onto the functions of at most k − 1 variables. This degenerate function can be written in the form (e.g. [40], Lemma 11.11)
| (2.1) |
where the sum if over all subsets A of {1, …, k}, including the empty set, for which the conditional expectation is understood to be Pkf. If the function f is symmetric, then so is the function DPf.
Given two functions we write g × h for the function (x, y) ↦ g(x)h(y). More generally, given k functions g1, …,gk we write g1 ×⋯× gk for the tensor product of these functions. Such product functions are·degenerate iff all functions in the product have mean zero.
A kernel operator takes the form for some measurable function . We shall abuse notation in denoting the operator K and the kernel with the same symbol: . A (weighted) projection on a finite-dimensional space is a kernel operator. We discuss such projections in Section 12.
The set of measurable functions whose rth absolute power is μ-integrable is denoted Lr(μ), with norm ‖·‖r,μ, or ‖·‖r if the measure is clear; or also as Lr(w) with norm ‖·‖r,w if w is a density relative to a given dominating measure. For r = ∞ the notation ‖·‖∞ refers to the uniform norm.
3. Heuristics
Our basic estimator has the form (1.1) except that we replace the linear term by a general U-statistic. Given measurable functions , for a fixed order m, we consider estimators of χ(p) of the type
| (3.1) |
The initial estimators are thought to have a certain (optimal) convergence rate , but need not possess (further) special properties. Throughout we shall treat these estimators as being based on an independent sample of observations, so that the stochasticities in (3.1) present in and are independent. This takes away technical complications, and allows us to focus on rates of estimation in full generality. (A simple way to avoid the resulting asymmetry would be to swap the two samples, calculate the estimator a second time and take the average.)
3.1. Influence functions
The key is to find suitable “influence functions” χp. A decomposition of type (1.2) for the estimator (3.1) yields
| (3.2) |
This suggests to construct the influence functions such that represents the first m terms of the Taylor expansion of . First this implies that the influence function used in (3.1) must be unbiased:
Next, to operationalize a “Taylor expansion” on the (infinite-dimensional) “manifold” we employ “smooth” submodels t ↦ pt mapping a neighbourhood of 0 ∈ ℝ to and passing through p at t= 0 (i.e. p0 = p). We determine χp such that for each chosen submodel
A slight strengthening is to impose this condition “everywhere” on the path, i.e. the jth derivative of t ↦ χ(pt) at t is the jth derivative of at h = 0, for every t. If the map is smooth, then the latter implies (cf. Lemma 11.1 applied with χ = f and )
| (3.3) |
Relative to the previous formula the subscript t on the right hand side has changed places, and the negative sign has disappeared. Under regularity conditions equation (3.3) for m = 1 can be written in the form
| (3.4) |
where g = (d/dt)t=0pt/p is the score function of the model t ↦ pt at t = 0.
A function χp satisfying (3.3) is exactly what is called an influence function in semiparametric theory: a function in L2(p) whose inner products with the elements of the tangent space (the set of score functions g of appropriate submodels t ↦ pg,t) represent the derivative of the functional ([40], page 363, or [2, 22, 39, 35]). For m > 1 equation (3.3) can be expanded similarly in terms of inner products of the influence function with score functions, but higher-order score functions arise next to ordinary score functions. Suitable higher-order tangent spaces are discussed in [27] (also see [37]), using score functions as defined in [43]. A discussion of second order scores and tangent spaces can be found in [28]. Second order tangent spaces are also discussed in [23], from a different point of view and with a different purpose.
Here we take a different route, defining higher order influence functions as influence functions of lower order ones. Any mth order, zero-mean U-statistic can be decomposed as the sum of m degenerate U-statistics of orders 1, 2, …, m, by way of its Hoeffding decomposition. In the present situation we can write
where is a degenerate kernel of j arguments, defined uniquely as a projection of χp (cf. [42] and (2.1)). (At m = n the left side evaluates to the symmetrized function χp, which is expressed in the functions .) Suitable functions in this decomposition can be found by the following algorithm:
Let be a first order influence function of the functional p ↦ χ(p).
Let be a first order influence function of the functional , for each x1, …, xj−1, and j = 2, …, m.
Let be the degenerate part of relative to P, as defined in (2.1).
See Lemma 11.2 for a proof. Thus higher order influence functions are constructed as first order influence functions of influence functions. Somewhat abusing language we shall refer to the function also as a “jth order influence function”. The overall order m will be fixed at a suitable value; for simplicity we do not let this show up in the notation χp.
Because only inner products with scores matter, an “influence function” is unique only up to projections onto the tangent space and (averaging over) permutations of its arguments. With any choice of influence functions the algorithm produces some influence function. In particular, the starting influence function in step [1] may be any function χp that satisfies (3.4) for every score g; it does not have to possess mean zero, or be an element of the tangent space. A similar remark applies to the (first order) influence functions found in step [2]. It is only in step [3] that we make the influence functions degenerate.
3.2. Bias-variance trade-off
Because it is centered, the “variance part” in (3.2), the variable , should not change noticeably if we replace by p, and be of the same order as . For a fixed square-integrable function χp the latter centered U-statistic is well known to be of order OP(n−1/2), and asymptotically normal if suitably scaled. A completely successful representation of the “bias” in (3.2) would lead to an error , which becomes smaller with increasing order m. Were this achievable for any m, then a would exist no matter how slow the convergence rate of the initial estimator. Not surprisingly, in many cases of interest this ideal situation is not real. This is due to the non-existence of influence functions that can exactly represent the Taylor expansion of . The technical reason is that multi-linear maps (even smooth ones) may not be representable through kernels.
In general, we have to content ourselves with a partial representation. Next to a remainder term of order , we then incur a “representation bias”. The latter bias can be made arbitrarily small by choice of the influence function, but only at the cost of increasing its variance. We thus obtain a trade-off between a variance and two biases. This typically results in a variance that is larger than 1/n, and a rate of convergence that is slower than , although sometimes a nontrivial bias correction is possible without increasing the variance.
3.3. Approximate functionals
An attractive method to find approximating influence functions is to compute exact influence functions for an approximate functional. Because smooth functionals on finite-dimensional models typically possess influence functions to any order, projections on finite-dimensional models may deliver such approximations.
A simple approximation would be for a given map mapping the model onto a suitable “smaller” model (typically a submodel ). A closer approximation can be obtained by also including a derivative term. Consider the functional defined by, for a given map ,
| (3.5) |
(A more complete notation would be ; the right hand side depends on p in three ways.) By the definition of an influence function the term acts as the first order Taylor expansion of Consequently, we may expect that
| (3.6) |
This ought to be true for any “projection” . If we choose the projection such that, for any path ,
| (3.7) |
then the functional will be locally (around p0) equivalent to the functional (which depends on p in only one place, p0 being fixed) in the sense that the first order influence functions are the same. The first order influence function of the second, linear functional at p0 is equal to , and hence for a projection satisfying (3.7) the first order influence function of the functional will be
| (3.8) |
In words, this means that the influence function of the approximating functional satisfying (3.5) and (3.7) at p is obtained by substituting for p in the influence function of the original functional.
This is relevant when obtaining higher order influence functions. As these are recursive derivatives of the first order influence function (see [1]–[3] in Section 3.1), the preceding display shows that we must compute influence functions of
i.e. we “differentiate on the model ”. If the latter model is sufficiently simple, for instance finite-dimensional, then exact higher order influence functions of the functional ought to exist. We can use these as approximate influence functions of .
4. Estimating the mean response in missing data models
Suppose that a typical observation is distributed as X = (Y A, A, Z), for Y and A taking values in the two-point set {0, 1} and conditionally independent given Z.
This model is standard in biostatistical applications, with Y an “outcome” or “response variable”, which is observed only if the indicator A takes the value 1. The covariate Z is chosen such that it contains all information on the dependence between the response and the missingness indicator A, thus making the response missing at random. Alternatively, we think of Y as a “counterfactual” outcome if a treatment were given (A = 1) and estimate (half) the treatment effect under the assumption of no unmeasured confounders.
The model can be parameterized by the marginal density f of Z (relative to some dominating measure ν) and the probabilities b(z) = P(Y = 1|Z = z) and a(z)−1 = P(A = 1| Z = z). (Using a for the inverse probability simplifies later formulas.) Alternatively, the model can be parameterized by the pair (a, b) and the function g = f/a, which is the conditional density of Z given A = 1, up to the norming factor P(A = 1). Thus the density p of an observation X is described by the triplet (a, b, f), or equivalently the triplet (a, b, g). For simplicity of notation we write p instead of pa,b,f or pa,b,g, with the implicit understanding that a generic p corresponds one-to-one to a generic (a, b, f) or (a, b, g).
We wish to estimate the mean response EY = Eb(Z), i.e. the functional
Estimators that are and asymptotically efficient in the semi-parametric sense have been constructed using a variety of methods (e.g. [30, 31]), but only if a or b, or both, parameters are restricted to sufficiently small regularity classes. For instance, if the covariate Z ranges over a compact, convex subset of ℝd, then the mentioned papers provide estimators under the assumption that a and b belong to Hölder classes and with α and β large enough that
| (4.1) |
(See e.g. Section 2.7.1 in [41] for the definition of Hölder classes). For moderate to large dimensions d this is a restrictive requirement. In the sequel we consider estimation for arbitrarily small α and β.
Throughout we assume that the parameters a, b and g are contained in Hölder spaces , and of functions on a compact, convex domain in ℝd. We derive two types of results:
- In Section 7 we show that a is attainable by using a higher order estimating equation (of order determined by γ) as long as
This condition is strictly weaker than the condition (4.1) under which the linear estimator attains a . Thus even in the higher order estimating equations may yield estimators that are applicable in a wider range of models. For instance, in the case that α = β the cut-off (4.1) arises for α = β ≥ d/4, whereas (4.2) reduces to α = β ≥ d/2.(4.2) We consider minimax estimation in the case (α + β)/2 < d/4, when the rate becomes slower than . It is shown in [29] that even if g = f/a were known, then the minimax rate for a and b ranging over balls in the Hölder classes and cannot be faster than n−(2α+2β)/(2α+2β+d). In Section 8 we show that this rate is attainable if g is known, and also if g is unknown, but is a-priori known to belong to a Hölder class for sufficiently large γ, as given by (8.11). (Heuristic arguments, not discussed in this paper, appear to indicate that for smaller γ the minimax rate is slower than n−(2α+2β)/(2α+2β+d).)
After reviewing the tangent space and first order theory in Section 4.1, we discuss the second order estimator separately in Section 5. The preceding results are next obtained in Sections 7 ( if (α + β)/2 ≥ d/4) and 8 (slower rate if (α + β)/2 < d/4), using the higher-order influence functions of an approximate functional, which is defined in the intermediate Section 6.
Assumption 4.1
We assume throughout that the functions 1/a, b, g and their preliminary estimators are bounded away from their extremes: 0 and 1 for the first two, and 0 and ∞ for the third.
4.1. Tangent space and first order influence function
The one-dimensional submodels t ↦ pt induced by paths of the form at = a + tα, bt = b + tβ, and ft = f(1 + tϕ) for given directions α, β and ϕ (where ∫ ϕ f dν = 0) yield score functions
Here , , are the score operators for the three parameters, whose direct sum is the overall score operator, which we write as Bp: Bp(α, β, ϕ)(X) is the sum of the three left sides of the preceding equation. The first-order influence function is well known to take the form
| (4.3) |
Indeed, it is straightforward to verify that this function satisfies, for every path t ↦ pt as described previously,
The advantage of choosing a an inverse probability is clear from the form of the (random part of the) influence function, which is bilinear in (a, b). The corresponding “first order bias” can be computed to be
| (4.4) |
In agreement with the heuristics given in Sections 1 and 3 this bias is quadratic in the errors of the initial estimator.
Actually, the form of the bias term is special in that square estimation errors and of the two initial estimators and do no arise, but only the product of their errors. This property, termed “double robustness” in [34], makes that for first order inference it suffices that one of the two parameters be estimated well. A prior assumption that the parameters a and b are α and β regular, respectively, would allow estimation errors of the orders n−α/(2α+d) and n−β/(2β+d). If the product of these rates is O(n−1/2), then the bias term matches the variance. This leads to the (unnecessarily restrictive) condition (4.1).
If the preliminary estimators and are solely selected for having small errors and (e.g. minimax in the L2-norm), then it is hard see why (4.4) would be small unless the product of the errors is small. Special estimators might exploit that the bias is an integral, in which cancellation of errors could occur. As we do not wish to use special estimators, our approach will be to replace the linear estimating equation by a higher order one, leading to an analogue of (4.4) that is a cubic or higher order polynomial of the estimation errors.
It may be noted that the marginal density f (or g) does not enter into the first order influence function (4.3). Even though the functional depends on f (or g), a rate on the initial estimator of this function is not needed for the construction of the first order estimator. This will be different at higher orders.
5. Second order estimator
In this section we derive a second order influence function for the missing data problem, and analyze the risk of the corresponding estimator. This estimator is minimax if (α + β)/2 ≥ d/4 and
| (5.1) |
In the other case, higher order estimators have smaller risk, as shown in Sections 7-8. However, it is worth while to treat the second order estimator separately, as its construction exemplifies essential elements, without involving technicalities attached to the higher order estimators.
To find a second order influence function, we follow the strategy [1]–[3] of Section 3.1, and try and find a function such that, for every x1 = (y1a1, a1, z1), and all directions α, β, ϕ,
Here the expectation Ep on the right side is relative to the variable X2 only, with x1 fixed. This equation expresses that is a first order influence function of , for fixed x1. On the left side we added the “constant” χ(pt) to the first order influence function (giving another first order influence function) to facilitate the computations. This is justified as the strategy [1]–[3] works with any influence function. In view of (4.3) and the definitions of the paths t ↦ a+tα, t ↦ b+tβ and t ↦ f(1+tϕ), this leads to the equation
| (5.2) |
Unfortunately, no function that solves this equation for every (α, β, ϕ) exists. To see this note that for the special triplets with β = ϕ = 0 the requirement can be written in the form
The right side of the equation can be written as ∫ K(z1, z2)α(z2) dF (z2), for K(z1, Z2) the conditional expectation of the function in square brackets given Z2. Thus it is the image of α under the kernel operator with kernel K. If the equation were true for any α, then this kernel operator would work as the identity operator. However, on infinite-dimensional domains the identity operator is not given by a kernel. (Its kernel would be a “Dirac function on the diagonal”.)
Therefore, we have to be satisfied with an influence function that gives a partial representation only. In particular, a projection onto a finite-dimensional linear space possesses a kernel, and acts as the identity on this linear space. A “large” linear space gives representation in “many” directions. By reducing the expectation in (5.2) to an integral relative to the marginal distribution of Z2, we can use an orthogonal projection ∏p:L2(g) → L2(g) onto a subspace L of L2(g). Writing also ∏p for its kernel, and letting S2h denote the symmetrization (h(X1, X2) + h(X2, X1))/2 of a function , we define
| (5.3) |
Lemma 5.1
For defined by (5.3) with Πp the kernel of an orthogonal projection Πp: L2(g) → L2(g) onto a subspace L ⊂ L2(g), equation (5.2) is satisfied for every path t ↦ pt corresponding to directions (α, β, ϕ) such that α ∈ L and β ∈ L.
Proof
By definition E(A|Z) = (1/a)(Z) and E(Y|Z) = b(Z). Also var (Aa(Z)| Z) = a(Z) − 1 and var(Y|Z) = b(Z)(1 − b)(Z). By direct computation using these identities, we find that for the influence function (5.3) the right side of (5.2) reduces to
Thus (5.2) holds for every (α, β, ϕ) such that Πpα = α and Πpβ = β.
Together with the first order influence function (4.3) the influence function (5.3) defines the (approximate) influence function . For an initial estimator based on independent observations we now construct the estimator (3.1), i.e.
| (5.4) |
Unlike the first order influence function, the second order influence function does depend on the density f of the covariates, or rather the function g = f/a (through the kernel ∏p, which is defined relative to L2(g)), and hence the estimator (5.4) involves a preliminary estimator of g. As a consequence, the quality of the estimator of the functional χ depends on the precision by which g (as part of the plug-in ) can be estimated.
Let and denote conditional expectations given the observations used to construct , let ‖·‖r be the norm of Lr(g), and let ‖∏‖r denote the norm of an operator ∏:Lr(g) → Lr(g).
Theorem 5.1
The estimator given in (5.4) with influence functions and defined by (4.3) and (5.3), for Πp the kernel of an orthogonal projection in L2(g) onto a k-dimensional linear subspace, satisfies, for r ≥ 2 (with r/(r − 2) = ∞ if r = 2),
The two terms in the bias result from having to estimate p in the second order influence function (giving “third order bias”) and using an approximate influence function (leaving the remainders I − Πp after projection), respectively. The terms 1/n and k/n2 in the variance appear as the variances of and , the second being a degenerate second order U-statistic (giving 1/n2, see (13.1)) with a kernel of variance k.
The proof of the theorem is deferred to Section 10.1.
Assume now that the range space of the projections Πp can be chosen such that, for some constant C,
| (5.5) |
Furthermore, assume that there exist estimators and and that achieve convergence rates n−α/(2α+d), n−β/(2β+d) and n−γ/(2γ+d), respectively, in Lr(g) and Lr/(r−2)(g), uniformly over these a-priori models and a model for g (e.g. for r = 3), and that the preceding displays also hold for and . These assumptions are satisfied if the unknown functions a and b are “regular” of orders α and β on a compact subset of ℝd (see e.g. [32]). Then the estimator of Theorem 5.1 attains the square rate of convergence
| (5.6) |
We shall see in the next section that the first of the four terms in this maximum can be made smaller by choosing an estimating equation of order higher than 2, while the other three terms arise at any order. This motivates to determine a “second order ‘optimal” value of k by balancing the second, third and fourth terms. We next would use the second order estimator if γ is large enough so that the first term is negligible relative to the other terms.
For (α+β)/2 ≥ d/4 we can choose k = n and the resulting rate (the square root of (5.6)) is n−1/2 provided that (5.1) holds. The latter condition is certainly satisfied under the sufficient condition (4.1) for the linear estimator to yield rate n−1/2.
More interestingly, for (α + β)/2 < d/4 we choose k ~ n2d/(d+2α+2β) and obtain the rate, provided that (5.1) holds,
This rate is slower than n−1/2, but better than the rate n−α/(2α+d)−β/(2β+d) obtained by the linear estimator. In [29] this rate is shown to be the fastest possible in the minimax sense, for the model in which a and b range over balls in and , and g being known.
In both cases the second order estimator is better than the linear estimator, but minimax only for sufficiently large γ. This motivates to consider higher order estimators.
6. Approximate functional
Even though the functional of interest does not possess an exact second-order influence function, we might proceed to higher orders by differentiating the approximate second-order influence function given in (5.3), and balancing the various terms obtained. However, the formulas are much more transparent if we compute exact higer-order influence functions of an approximating functional instead. In this section we first define a suitable functional and next compute its influence functions.
Following the heuristics of Section 3.3, we define an approximate functional by equation (3.5), using a particular projection of the parameters. We choose this projection to map the parameters a and b onto finite-dimensional models and leave the parameter g unaltered: p is mapped into an element of the approximating model, or equivalently a triplet (a, b, g) into a triplet in the approximating model for the three parameters (where g is unaltered). (Even though this is not evident in the notation, the projection is joint in the three parameters: the induced maps and do not reduce to maps and , but and depend on the full triplet (a, b, g).)
As “model” for (a, b) we consider the product of two affine linear spaces
| (6.1) |
for a given finite-dimensional subspace L of L2(ν) and fixed functions that are bounded away from zero and infinity. (Later the functions and are taken equal to the preliminary estimators; one choice for the other functions is .) The pair of projections are defined as elements of the model (6.1) satisfying equation (3.7). In view of (4.4), for any path , for given l, l′ ∈ L,
| (6.2) |
Equation (3.7) requires that the derivative of this expression with respect to t at t = 0 vanishes. Thus the functions and must be chosen to satisfy the set of stationary equations, for every l; l′ ∈ L,
| (6.3) |
| (6.4) |
Because the functions and are required to be in L, the second way of writing these equations shows that the latter two functions are the orthogonal projections of the functions and onto L in .
As explained in Section 3.3, as it satisfies (3.7) the projection renders the first order influence function of the approximate functional equal to the first order influence function of χ evaluated at the projection. Furthermore, the difference between χ and is quadratic in the distance between and p (see (3.6)). The following theorem summarizes the preceding and verifies these properties in the present concrete situation.
Theorem 6.1
For given measurable functions with and bounded away from zero and infinity, define a map by letting and be the orthogonal projections of and in onto a closed subspace L. Let correspond to and define . Then has influence function
| (6.5) |
Furthermore, for ,
Proof
The formula for the influence function agrees with the combination of equations (3.8) and (4.3), and can also be verified directly. In view of (3.5) and (4.4),
We rewrite the right side as an integral relative to dν, and next apply the Cauchy-Schwarz inequality. Finally we note that , and similarly for b.
The approximation error can be rendered arbitrarily small by choosing the space L large enough. Of course, we choose L to be appropriate relative to a-priori assumptions on the functions a and b. If these functions are known to belong to Hölder classes, then L can for instance be chosen as the linear span of the first k basis elements of a suitable orthonormal wavelet basis of L2(ν).
To compute higher order influence functions of we recursively determine influence functions of influence functions, according to the algorithm [1]–[3] in Section 3.1, starting with the influence function of , for a fixed x1. We defer the details of this derivation to Section 10.2, and summarize the result in the following theorem.
To simplify notation, define
| (6.6) |
These are the generic variables; indexed versions are defined by adding an index to every variable in the equalities. With this notation and with the second order influence function (5.3) at can be written as the symmetrization of . This function was derived in an ad-hoc manner as an approximate or partial influence function of χ, but it is also the exact influence function of . The higher order influence functions of possess an equally attractive form.
Theorem 6.2
An mth order influence function evaluated at (X1,…, Xm) of the functional defined in Theorem 6.1 is the degenerate (in L2(p)) part of the variable
Here ∏i,j is the kernel of the orthogonal projection in onto L, evaluated at (Zi, Zj).
To obtain the degenerate part of the variable in the preceding lemma, we apply the general formula (2.1) together with Lemma 10.2. Assertions (i) and (ii) of the latter lemma show that the variable is already degenerate relative to X1 and Xm, while assertion (iii) shows that integrating out the variable Xi for 1 < i < m simply collapses into Πi−1,i+1. For instance, with Sm denoting symmetrization of a function of m variables,
| (6.7) |
As shown on the left, but not on the right of the equations, these quantities depend on the unknown parameter p = (a, b, g). In the right sides, the variables and depend on p through and , and hence are not observables. Furthermore, the kernels Πi,j depend on g as they are orthogonal projections in .
7. Parametric rate ((α + β)/2 ≥ d/4)
In this section we show that the parameter χ(p) is estimable at provided the average smoothness (α + β)/2 is at least d/4. We achieve this using the estimator
| (7.1) |
with the influence functions those of the approximate functional in Section 6: they are given in Theorems 6.1 and 6.2 for j = 1, and j = 2, …, m, respectively. (Because the map maps into itself, the influence function for j = 1 in the display is also the first order influence function (6.5) of of χ, when evaluated at .)
We assume that the projections Πp and map to , for every s ∈ =[r/(r − 1), r], with uniformly bounded norms. (For r = 2 this entails only s = 2; in this case we define r/(r − 2) = ∞.)
Theorem 7.1
The estimator (7.1), with Πp a kernel of an orthogonal projection in satisfying (12.1) with supx Πp(x, x) ≲ k, satisfies, for a constant c that depends on only, and r ≥ 2,
The first term in the bias is of the order 1 + 1 + (m − 1) = m + 1, as to be expected for an estimator based on an mth order influence function; the second term is due to estimating rather than χ; it is independent of m, and the same as in Theorem 5.1 if . The bound on the variance can roughly be understood in that each of the degenerate U-statistics in (7.1) contributes a term of order kj−1/nj.
For α-, β- and γ-regular parameters a, b, g on a d-dimensional domain the range space of the projections Πp can be chosen so that (5.5) holds and such that there exist estimators of a, b, g, with the first two taking values in this range space, with convergence rates n−α/(2α+d), n−β/(2β+d) and n−γ/(2γ+d). Then the second term in the bias (with ) is of order (1/k)α/d+β/d. If (α + β)/2 ≥ d/4 and we choose k = n, then this is of order . For k = n the standard deviation of the resulting estimator is also of the order , while the first term in the bias can be made arbitrarily small by choosing a sufficiently large order m. Specifically, the estimator attains a of convergence as soon as
| (7.2) |
For any γ > 0 there exists an order m that satisfies this, and hence the parameter is as soon as (α + β)/2 ≥ d/4.
More ambitiously, we may aim at attaining the parametric rate for every γ > 0, without a-priori knowledge of γ. This can be achieved if (α + β)/2 > d/4 by using orders m = mn that increase to infinity with the sample size. In this case the estimator can also be shown to be asymptotically efficient in the semiparametric sense.
Theorem 7.2
If (α + β)/2 > d/4, then the estimator (7.1), with m = log n and Πp a kernel of an orthogonal projection in on a k = n/(log n)2-dimensional space satisfying (5.5) and (12.1) with supx Πp(x, x) ≲ k, based on preliminary estimators that attain rates (log n/n)−δ/(2δ+d) relative to the uniform norm, satisfies
An estimator that is asymptotically linear in the first order efficient influence function, as in the theorem, is asymptotically optimal in terms of the local asymptotic minimax and convolution theorems (see e.g. [40], Chapter 25). The present estimator actually looses its efficiency by splitting the sample in a part used to construct the preliminary estimators and a part to form . This can be easily remedied by crossing over the two parts of the split, and taking the average of the two estimators so obtained. By the theorem these are both asymptotically linear in their sample, and hence their average is asymptotically linear in the full sample and asymptotically efficient.
The proofs of the theorems are deferred to Section 10.3.
8. Minimax rate at lower smoothness ((α + β)/2 < d/4)
If the average a-priori smoothness (α + β)/2 of the functions a and b falls below d/4, then the functional χ cannot be estimated any more at the parametric rate ([29]). The estimator (7.1) of Theorem 7.1 can still be used and, with its bias and variance as given in the theorem properly balanced, attains a certain rate of convergence, faster than the current state-of-the-art linear estimators. However, in this section we present an estimator that is always better, and attains the minimax rate of convergence n−(2α+2β)/(2α+2β+d) provided that the parameter g is sufficiently regular.
This estimator takes the same general form
| (8.1) |
as the estimator (7.1), but the influence functions for j ≥ 3 will be different. The idea is to “cut out” certain terms from the influence functions in (7.1) in order to decrease the variance, but without increasing the bias. For clarity we first consider the third order estimator, and next extend to the general mth order. To attain the minimax rate the order m must be fixed to a large enough value so that the first term in the bias given in Theorem 7.1 is no larger than n−(2α+2β)/(2α+2β+d). (Apart from added complexity there is no loss in choosing m larger than needed.)
The third order kernel in (6.7) is the symmetrization of the variable
Here Πp is the kernel of an orthogonal projection in onto a k-dimensional linear space, which we may view as the sum of k projections on one-dimensional spaces. The quantity k2 in the order O(k2/n3) of the variance in Theorem 7.1 for m = 3 arises as the number of terms in the product of the two k-dimensional projection kernels. It turns out that this order can be reduced without increasing the bias by cutting out “products of projections on higher base elements”.
To make this precise, we partition the projection space in blocks, and decompose the two projections in the influence function over the blocks:
| (8.2) |
Here is the projection on the subspace spanned by base elements with index in intervals (m, n], and 1 = k−1 < k0 < k1 < ⋯ < kR = k and 1 = l−1 < l0 < l1 < ⋯ < lS = k are suitable partitions of the set {1, …, k}. (“Full” partitions in singleton sets would make the construction conceptual simpler, but a small number of blocks will be needed in our proofs.) The product of the two kernels now becomes a double sum, from which we retain only terms with small values of (r, s). The improved third order influence function is, with as before S3 denoting symmetrization,
| (8.3) |
The negative term in the display is the conditional expectation given Z1, Z3 of the leading term, and maintains the degeneracy of the kernel. w For the decomposition (8.2) to be valid, the subspaces corresponding to the blocks must be orthogonal in . We may achieve this by starting with a standard basis e1, ε2, …, with good approximation properties for a target model, and next replacing this by an orthonormal basis in by the Gram-Schmidt procedure. For a bounded g the approximation properties will be preserved.
The grids are defined by
| (8.4) |
| (8.5) |
where R and S are chosen such that kR ∼ lS ∼ k (note that k0 = l0 = n). In these definitions the notation ~ means “equal up to a fixed multiple” (needed to allow that kr and ls are (dyadic) integers). For ease of notation let ls = l−1 for s ≤ −1, and ls = lS for s ≥ S.
The grids k0 < k1 < ⋯ < kR and l0 < l1 < ⋯ < lS partition the integers n, n + 1, …, k in R and S groups. As , for every r, s ≥ 0, the cut-off r+s ≤ D in (8.3) is delimited by the “hyperbola” iαjβ ~ 2Dnα+β in the space of indices (i, j) ∈ {1, …, k}2 of base elements used in the two kernels, with only the pairs below the hyperbola retained (see Figure 1). The intuition behind this hyperbolic cut-off is the product form of the bias (4.4): a higher order correction on the estimator of a may combine with a lower order correction on b, and vice versa, to give an overall correction of the desired order. The overall bias is smaller if the cut-off D is chosen larger, but then more terms are included in the estimator and the variance will be bigger.
Fig 1.

Both axis carry the indices of the basis functions spanning the projection space L, and point in the plane refers to a product of two projections. Products of projections on pairs of basis functions in the shaded area are included in the third order influence function. The step function refers to the partitions of the indices as in (8.2).
Before deriving an optimal value of D, we introduce the mth order estimator for general m ≥ 3. Again we take the estimator of Theorem 7.1 as starting point, but modify the higher order influence functions , for j = 4, …, m, similar and in addition to the modification of the third order influence function. For given j the former influence function is given in Theorem 6.2 (with m of the theorem taken equal to j), and is based on a product of j − 1 projection kernels. We modify this in two steps. For each of the j − 2 contiguous pairs of kernels ((1st, 2nd), (2nd, 3rd), …, ((j − 2)th, (j − 1)th)) we form a new kernel by truncating the pair at the hyperbola as described previously for the third order kernel, and truncating all other kernels at n. Next the modified jth order kernel is the sum of the resulting j − 2 kernels. More formally, the modified jth order kernel is equal to
| (8.6) |
where is the symmetrized, degenerate (relative to L2(p)) part of the variable, for i = 1, …, j − 2, written in the notation of Theorem 7.1,
For j = 3 there is only one pair of kernels, and the construction reduces to the modification (8.3) as discussed previously.
We assume that the projections and map to , for every s ∈ [r/(r − 1), r], with uniformly bounded norms.
Theorem 8.1
The estimator (8.1) for m ≥ 3 with the influence functions and given in (6.5) and (6.7) for j = 1, 2, respectively, and in (8.6) for j ≥ 3, and with kernels of orthogonal projections in satisfying (12.1) with , satisfies, for r ≥ 2 (and r/(r − 2) = ∞ if r = 2),
A proof of the theorem is presented in Sections 10.4 and 10.5.
The first two terms in the bias are the same as in Theorem 7.1; the third and fourth terms are the price paid for cutting out terms from the influence function. The benefit is a reduced variance. We shall show that the boundary parameter D can be chosen such that the third term in the variance (resulting from the third and higher order parts of the influence function) is not bigger than the second term, while the increase in bias is negligible.
Assume that the functions a and b and their estimates are known to belong to models that are well approximated by the base functions e1, e2, … in the sense that, for , and every value l in one of the two grids (8.4)–(8.5),
| (8.7) |
| (8.8) |
Then the second term in the bias is of the order (1/k)α/d+β/d, as in Theorem 7.1, which is smaller than the minimax rate n−(2α+2β)/(2α+2β+d) for
| (8.9) |
With this choice of k, the upper bound on the variance is of the square minimax rate n−(4α+4β)/(2α+2β+d) if D is chosen to satisfy
| (8.10) |
Furthermore, under (8.9) the numbers R, S of grid points are of the order log n.
In the third term of the bias we apply assumptions (8.7)–(8.8) and the identity , which results from (8.4)–(8.5), to see that the third term of the bias is of order
If the convergence rate of is n−γ/(2γ+d), then, for the choice of D given in (8.10), this can (by a calculation) seen to be of smaller order than the minimax rate n−(2α+2β)/(2α+2β+d) if γ is large enough that
| (8.11) |
The fourth term in the bias can by a similar analysis be seen to be of the order
Again this is smaller than the minimax rate if γ satisfies assumption (8.11).
Finally, if the convergence rates of and are n−α/(2α+d) and n−β/(2β+d), then the first term in the upper bound of the bias is of the order
We choose m large enough so that this is of smaller order than the preceding terms. In particular, we can choose it so that this is smaller than the minimax rate.
We summarize this in the following corollary, which is the most advanced result of the paper.
Corollary 8.1
If (8.7)–(8.11) hold, and are kernels of orthogonal projections in satisfying (12.1) with , then the mth order estimator with the kernels (8.6) for j ≥ 3 and sufficiently large m and suitable initial estimators, attains the rate n−(2α+2β)/(2α+2β+d) for estimating χ(p).
9. Other examples
In this section we briefly indicate a number of other examples for which our general heuristics have been worked out, leading to well known or novel estimators.
9.1. Density estimation
Consider estimating a density χ(p) = p(a) at the fixed point a based on a random sample from p. A first order influence function of this functional would satisfy, for every smooth path t ↦ pt with score function g at t = 0,
In a nonparametric situation every zero-mean function g arises as a score function, and hence would have to be a “Dirac function at a”. Because this does not exist (except for very special p), in this example already a first order influence function fails to exist.
We may approximate the Dirac function by the function x ↦ Π(a, x) for Π the kernel of an orthogonal projection onto a given (large) subspace L of L2(μ). Because ∫ Π(a, x)g(x)p(x) dμ(x) = g(a)p(a) for every function g such that gp ∈ L, the function x ↦ Π(a, x) achieves representation for a large set of scores. The corresponding degenerate version is x ↦ Π(a, x) − Πp(a), for Πp = ∫ Π(·, x)p(x) dμ(x) the projection of p. The corresponding first order estimator (3.1) is
If , then the second term vanishes and the estimator reduces to . This is the usual projection estimator (cf. [25, 32]): if L is spanned by the orthonormal set e1, e2, …, ek, then and .
Alternative to viewing x ↦ Π(a, x) as an approximation to the “ideal” influence function, we can derive it as the exact influence function of the approximate functional .
9.2. Quadratic functionals
Consider estimating the functional χ(p) = ∫ p2 dμ based on a random sample of size n from the density p.
The first order influence function of this functional exists on the full non-parametric model, and can be seen to take the form
By the algorithm [1]–[3] of Section 3.1 a second order influence function can be computed as the degenerate part of an influence function of the functional , for fixed x1. As seen in Section 9.1, point evaluation is not a differentiable functional, but has the kernel Π of an orthogonal projection in L2(μ) as an approximate influence function. Thus an approximate second order influence function of the present functional, minus its projection onto the degenerate functions, is given by
This may also be derived as an exact influence function of the approximate functional .
It can be checked that the estimator (7.1) for m = 2, given an initial estimator that is contained in the range of Π, reduces to , which is a well known estimator ([17]).
9.3. Doubly robust models
The heuristics described in Section 3 ought to be applicable in a wide range of estimation problems, but the detailed treatment of the missing data problem in Sections 4–8 shows that their implementation can be involved. Inspection of the proofs reveals that the particular implementation in the latter sections is based on the structure (4.3) of the first order influence function in the missing data problem. The argument extends to semiparametric models with first order influence function of the form
| (9.1) |
for known functions Si(x) of the data (i.e. S = (S1, S2, S3, S4) is a given statistic). The full parameter may be a quadruplet p ↔ (a, b, c, f), in which f is the marginal density of an observable covariate Z, and c does not appear in (9.1). Other examples of this structure are described in [27, 37].
10. Proofs
10.1. Proof of Theorem 5.1
Write and Π for and Πp, respectively, for both the kernels and the corresponding projection operators, and drop p also in and . From (4.4) and (5.3) we have
The double integral on the far right with replaced by Π can be written as the single integral , for the image of under the projection Π. Added to the first integral on the right this gives , which is bounded in absolute value by the second term in the upper bound for the bias.
Replacement of by Π in the double integral gives a difference
by Hölder’s inequality, for a conjugate pair (r, s). Considering as the projection in with weight 1, and Π as the weighted projection in with weight function , we can apply Lemma 12.7(i) (with q = s/r and rp = s/(s − 2)) to see that this is bounded in absolute value by
Because is assumed bounded away from 0 and infinity, this is of the same order as the first term in the upper bound on the bias (if r replaces s).
Because the function is uniformly bounded, the (conditional) variance of is of the order O(1/n). Thus for the variance bound it suffices to consider the (conditional) variance of . In view of Lemma 13.1 and (13.1) this is bounded above by a multiple of
The variables and are uniformly bounded. Hence the last term on the right is bounded above by a multiple of , which is equal to k/n2, by Lemma 12.3.
10.2. Proof of Theorem 6.2
We compute the higher order influence functionals of the approximate functional using algorithm [1]–[3] in Section 3. This starts by computing the second order influence function as the derivative of , for fixed x1. Because the latter functional (given in (6.5)) depends on the parameters only through and , the following lemma does the main part of the work.
Lemma 10.1
For fixed z1 influence functions of and are given by
where Πp is the kernel of the orthogonal projection in onto L.
Proof
We can write the equation (6.3) determining as , for every l ∈ L. Insert a sufficiently regular path pt, given by parameters (at, bt, ft), and differentiate the equality relative to t at t = 0 to find, with γ a score function of the path
Using the fact that E(A|Z) = 1/a(Z), where a is bounded away from zero, we can also write this as
Because the function is contained in L for every t by construction, the function is also contained in L. Combined with the validity of the preceding display for every l ∈ L, we conclude that is the weighted projection of in L2(P) onto the space {l(Z): l ∈ L} relative to the weight . The projection can be represented in terms of a kernel operator (cf. Lemma 12.1). If denotes the kernel, then
This represents the derivative on the left as an inner product of the score function γ with the function on the right of the first equation of the lemma (evaluated at X2). Thus the first assertion of the lemma is proved.
The second assertion is proved similarly. Using that E(Y|Z) = b(Z) and E(A|Z) = 1/a(Z), we start by writing the equation (6.4) defining as , for every l ∈ L. By the same arguments as before we conclude that is the weighted projection of in L2(P) onto the space {l(Z): l ∈ L}, relative to the weight (ab/a)(Z). ■
The first order influence function (6.5) depends on p only through and and hence the chain rule and the preceding lemma imply that a second order influence function of is given by the degenerate part of
| (10.1) |
(Note that this function is symmetric in (X1, X2); Πp is symmetric, because it is an orthogonal projection kernel.) Actually, this function is already degenerate and hence is the second order influence function of .
Lemma 10.2
For any fixed z1 and z3,
.
.
.
Proof
Because and are the weighted projections in L2(P) of and , respectively, onto {l(Z):l ∈ L} relative to the weights (ab/a)(Z),
| (10.2) |
| (10.3) |
These two assertions imply (i) and (ii). The third assertion follows from the fact that Πp is the kernel of the weighted projection in L2(P) onto L relative to the weight (ab/a)(Z). ■
The second order influence function (10.1) depends on p through and and through the kernel Πp. We proceed to higher orders by differentiating the influence function relative to these components, and applying the chain rule, where we use the influence functions of and as given previously in Lemma 10.1, and the influence function of as given in Lemma 12.8.
Proof of Theorem 6.2
Denote the symmetrization of the variable in the theorem by . Then is the function given by (10.1), which was seen to be a second order influence function in the preceding discussion. We show by induction on m that is an influence function of . The theorem is then a corollary of Lemma 11.2.
By Lemmas 10.1 and 12.8,
The influence function of is
The influence function of is .
The influence function of is zero.
The influence function of is .
Applying this repeatedly readily gives an expression for the influence function of
. The symmetrization of this expression is the same expression, but then with m replaced by m + 1 and an added minus sign. ■
10.3. Proof of Theorems 7.1 and 7.2
Let  and Ŷ be and as in (6.6) with a and b in their definitions replaced by â and . Because â and are projected onto themselves under the map (see Theorem 6.1), we actually obtain the same variables by replacing and by â and , respectively: and . Furthermore, let Π and denote the operators Πp and , respectively, and Πi,j and their kernels evaluated at (Zi, Zj).
By explicit calculations,
| (10.4) |
for defined by
The variable is bounded by the second term in the expression for in the statement of the theorem. We next show by induction on m that
| (10.5) |
The analysis of the bias can then be concluded by showing that the right side of (10.5) is of the order as the first term given in the theorem.
Equation (10.4) and the definition of readily show that identity (10.5) is true for m = 2. We proceed to general m by induction. Relative to its value for m the left side receives for (m+ 1) the extra term which is equal to (−1)m times minus a sum of terms resulting from projections of this leading term. This extra term without the factor (−1)m (but including the projections) can be written (cf. (6.7) and (2.1))
| (10.6) |
To prove the induction hypothesis for m + 1 it suffices to show that this is equal to
| (10.7) |
To achieve this we expand the two terms of the preceding display into sums of expressions of the form, with each equal to or Πj,j+1 and l the number of j for which the first alternative is true,
| (10.8) |
and of the same form with m+1 replacing m for the second term of (10.7). As the notation suggests the expression in (10.8) depends on l (and m, but this is fixed), but not on which K are equal to or Π. To see this we use that Π is a projection onto L in L2(abg), so that ∫ Π1,2γ(z2)(abg)(z2)dv(z2) = γ(z1) for every γ ∈ L; and is also a projection onto L, so that as a function of one argument is contained in L. This observation yields the identities, for K equal to or Π,
This allows to reduce (10.8) to
Thus after expanding the two terms of (10.7) in the quantities Bl, and simplifying these quantities, we can write their sum (10.7)
The difference of the binomial coefficients is . The expression is equal to (10.6), as claimed. This completes the proof of (10.5).
Next we bound the right side of (10.5), by taking the expectation in turn with respect to Xm, Xm−1, …, X1. For Mŵ multiplication by the function ŵ = g/ĝ,
Next, for any function h and i = m − 1, m − 2, …, 2,
Combining these equations, we can write the right side of (10.5) in the form
We bound this by first applying Hölder’s inequality, with conjugate pair (τ, t) with τ equal to r as in the statement of the theorem, and next Lemma 12.7(iii), with and Π viewed as weighted orthogonal projections in L2(abĝ) with weights 1 and ŵ, respectively, and r = τ(m−1)/(m+τ−3), p = (m+τ−3)/(τ−2) and q = (m+τ−3)/(m−1), so that rp = (m−1)τ/(τ−2) and rq = τ (and m of the lemma taken equal to the present m minus 1).
To bound the (conditional) variance of we use Lemma 13.1 to see that
because is degenerate under . The variable is the symmetrization of the projection of onto the degenerate variables. Because the second moment of a mean of (arbitrary) random variables is bounded above by the maximum of the second moments of the terms, we can ignore the symmetrization, while the projection decreases the second moment. This shows that
by Lemma 12.4 and the assumption that the kernels are bounded by k on the diagonal.
We complete the proof of Theorem 7.1 by bounding the square of by . The extra factor 2j can be incorporated in the constant c in the theorem.
For the proof of Theorem 7.2 it clearly suffices to show that
Because an influence function is centered at mean zero, the first is simply times the bias of . By Theorem 7.1 the bias is of the order
The first term is trivially o(n−1/2), as mn → ∞. In the second we write (α + β)/d = r/2, where r > 1 by assumption, and see that it is o(n−1/2), since kn−1/r → ∞.
To handle the variance we split the estimator in its linear and higher order terms. The sum of the variances of the U-statistics of orders 2 to m in is bounded by the sum of the terms j ≥ 2 in Theorem 7.1, i.e.
by the inequalities , for j < n/2, and , by Stirling’s approximation with bound. The expression in brackets is bounded by 2ckm/n ≲ 1/log n, for m ~ log n and k ~ n/(log n)2. Thus the sum tends to zero by dominated convergence. Finally the linear term in gives the contribution
From the explicit expression (4.3) for the first order influence function (or (6.5) in the case of , which gives an identical function), this is seen to tend to zero by the dominated convergence theorem.
10.4. Proof of Theorem 8.1 for m = 3
The theorem asserts that the bias of the estimator is equal to the sum of four terms, the first two of which also arise in the bias of the estimator considered in Theorem 7.1. Therefore, we can prove the assertion on the bias by showing that the expected values of the current estimator (for m = 3) and the estimator in Theorem 7.1 differ by less than the additional bias terms in Theorem 8.1.
The two estimators differ only in their third order influence functions, where the present estimator retains only the terms in the double sum (8.3) with r = 0, s = 0, or r + s ≤ D. Thus the difference of the expectations of the two estimators is equal to
The expectation Êp refers to the variable (X1, X2, X3) for fixed values of the preliminary samples, which are indicated in the “hat” symbols on Â1, Ŷ3 and the kernels, and hence is an integral relative to the density (x1, x2, x3) ↦ p(x1)p(x2)p(x3). If we replace p(x2) in this density by , then the integral will be zero, as the kernel is degenerate under . Thus we may integrate against . In that case the projection term integrates to zero, as it does not depend on X2 and , and hence can be dropped. Next we condition and on Z1, Z2, Z3 and write the preceding display in the form
for ρ and the measures defined by dρ = abg dv and . The double sum can be rewritten as the sum over r running from 1 to R and over s from D − r + 1 to S, which gives the equivalent representation, with the × referring to “tensor products” as explained in Section 2,
We write , and next arrive at the difference of two expressions of the type, with and , respectively,
If the measure of integration were (with instead of ρ), then we could perform the integrals on z1 and z3 and next apply Hölder’s inequality to bound the resulting expression in absolute value by
where the norms are those of , which are equivalent to those of L2(ν), by assumption. We can write and use the assumed boundedness of as an operator on to bound this by the third term in the bias.
Replacing by can be achieved by writing the first and last occurrence of ρ as and expanding the resulting expression on the + signs into four terms. One of these has the measure . The other three terms have two or three occurrences of , and can be bounded by the first term in the bias (with m = 3). This is argued precisely under (10.12) below.
Because the first and second order influence functions are equal to those of the estimator considered in Theorem 7.1, the (conditional) variances of for j = 1, 2 can be seen to be of the orders O(1/n) and O(k/n2), respectively, by the same proof. By Lemma 13.1 the variance for j = 3 is bounded by (see (13.1))
where . After bounding out and , we write the squared sum as a double sum. From the fact that the projections are orthogonal for different r, it follows that the off-diagonal terms of the double sum vanish (the expectation with respect to X1 is zero). Thus the preceding display is bounded above by a multiple of
By Lemmas 12.4 and 12.3 and the assumption that this is bounded by a multiple of
By (8.4) for r ≥ 1. On substituting this in the display, and noting that lD−r = 0 if r > D, we see that this is bounded k/n2 + 2D/α˅D/β/n if α ≠ β and bounded by k/n2 + D2D/α/n if α = β.
Supplementary Material
Acknowledgments
The research leading to these results has received funding from the European Research Council under ERC Grant Agreement 320637.
Footnotes
SUPPLEMENTARY MATERIAL
Supplement: Higher order estimating equations
(doi: COMPLETED BY THE TYPESETTER;.pdf). The remainder of the paper is given in the supplement.
References
- 1.Bickel PJ. On adaptive estimation. Ann Statist. 1982;10(3):647–671. MR663424 (84a:62045) [Google Scholar]
- 2.Bickel PJ, Klaassen CAJ, Ritov Y, Wellner JA. Johns Hopkins Series in the Mathematical Sciences. Johns Hopkins University Press; Baltimore, MD: 1993. Efficient and adaptive estimation for semiparametric models. MR1245941 (94m:62007) [Google Scholar]
- 3.Bickel PJ, Ritov Y. Estimating integrated squared density derivatives: sharp best order of convergence estimates. Sankhyā Ser A. 1988;50(3):381–393. MR1065550 (91e:62079) [Google Scholar]
- 4.Birgé L, Massart P. Estimation of integral functionals of a density. Ann Statist. 1995;23(1):11–29. MR1331653 (96c:62065) [Google Scholar]
- 5.Bolthausen E, Perkins E, van der Vaart A. Lectures on probability theory and statistics. In: Bernard Pierre., editor. Lecture Notes in Mathematics. Vol. 1781. Springer-Verlag; Berlin: 2002. (Lectures from the 29th Summer School on Probability Theory held in Saint-Flour, July 8–24, 1999). MR1915443 (2003d:60004) [Google Scholar]
- 6.Cai TT, Low MG. Nonquadratic estimators of a quadratic functional. Ann Statist. 2005;33(6):2930–2956. doi: 10.1214/009053605000000147. . MR2253108 (2007k:62058) [DOI] [Google Scholar]
- 7.Cai TT, Low MG. Optimal adaptive estimation of a quadratic functional. Ann Statist. 2006;34(5):2298–2325. doi: 10.1214/009053606000000849. . MR2291501 (2008m:62054) [DOI] [Google Scholar]
- 8.Cohen A, Dahmen W, Daubechies I, DeVore R. Tree approximation and optimal encoding. Appl Comput Harmon Anal. 2001;11(2):192–226. doi: 10.1006/acha.2001.0336. . MR1848303 (2002g:42048) [DOI] [Google Scholar]
- 9.Daubechies I. Ten lectures on wavelets. Vol. 61. Society for Industrial and Applied Mathematics (SIAM); Philadelphia, PA: 1992. (CBMS-NSF Regional Conference Series in Applied Mathematics). MR1162107 (93e:42045) [Google Scholar]
- 10.Donoho DL, Nussbaum M. Minimax quadratic estimation of a quadratic functional. J Complexity. 1990;6(3):290–323. doi: 10.1016/0885-064X(90)90025-9. . MR1081043 (91m:65343) [DOI] [Google Scholar]
- 11.Hampel FR, Ronchetti EM, Rousseeuw PJ, Stahel WA. Robust statistics. John Wiley & Sons, Inc.; New York: 1986. (Wiley Series in Probability and Mathematical Statistics: Probability and Mathematical Statistics). The approach based on influence functions. MR829458 (87k:62054) [Google Scholar]
- 12.Härdle W, Kerkyacharian G, Picard D, Tsybakov A. Lecture Notes in Statistics. Vol. 129. Springer-Verlag; New York: 1998. Wavelets, approximation, and statistical applications. MR1618204 (99f:42065) [Google Scholar]
- 13.Has’minskiĭ RZ, Ibragimov IA. Proceedings of the Second Prague Symposium on Asymptotic Statistics (Hradec Králové, 1978) North-Holland; Amsterdam-New York: 1979. On the nonparametric estimation of functionals; pp. 41–51. MR571174 (81j:62076) [Google Scholar]
- 14.Huber PJ, Ronchetti EM. Robust statistics. Second. John Wiley & Sons, Inc; Hoboken, NJ: 2009. (Wiley Series in Probability and Statistics). . MR2488795 (2010j:62004) [DOI] [Google Scholar]
- 15.Kerkyacharian G, Picard D. Estimating nonquadratic functionals of a density using Haar wavelets. Ann Statist. 1996;24(2):485–507. MR1394973 (97e:62062) [Google Scholar]
- 16.Koševnik JA, Levit BJ. On a nonparametric analogue of the information matrix. Teor Verojatnost i Primenen. 1976;21(4):759–774. MR0428578 (55 #1599) [Google Scholar]
- 17.Laurent B. Estimation of integral functionals of a density and its derivatives. Bernoulli. 1997;3(2):181–211. MR1466306 (99c:62144) [Google Scholar]
- 18.Laurent B, Massart P. Adaptive estimation of a quadratic functional by model selection. Ann Statist. 2000;28(5):1302–1338. MR1805785 (2002c:62052) [Google Scholar]
- 19.Lindsay BG. Efficiency of the conditional score in a mixture setting. Ann Statist. 1983;11(2):486–497. MR696061 (84h:62050) [Google Scholar]
- 20.Murphy SA, van der Vaart AW. On profile likelihood. J Amer Statist Assoc. 2000;95(450):449–485. With comments and a rejoinder by the authors. MR1803168 (2002a:62143) [Google Scholar]
- 21.Nemirovski A. Lectures on probability theory and statistics (Saint-Flour, 1998) Lecture Notes in Math. Vol. 1738. Springer; Berlin: 2000. Topics in non-parametric statistics; pp. 85–277. MR1775640 (2001h:62074) [Google Scholar]
- 22.Pfanzagl J. Lecture Notes in Statistics. Vol. 13. Springer-Verlag; New York: 1982. Contributions to a general asymptotic statistical theory. With the assistance of W. Wefelmeyer. MR675954 (84i:62036) [Google Scholar]
- 23.Pfanzagl J. Lecture Notes in Statistics. Vol. 31. Springer-Verlag; Berlin: 1985. Asymptotic expansions for general statistical models. With the assistance of W. We-felmeyer. MR810004 (87i:62004) [Google Scholar]
- 24.Pfanzagl J. Lecture Notes in Statistics. Vol. 63. Springer-Verlag; New York: 1990. Estimation in semiparametric models. Some recent developments. MR1048589 (91f:62074) [Google Scholar]
- 25.Prakasa Rao BLS. Probability and Mathematical Statistics. Academic Press Inc. [Harcourt Brace Jovanovich Publishers]; New York: 1983. Nonparametric functional estimation. MR740865 (86m:62076) [Google Scholar]
- 26.Robins J, Li L, Tchetgen E, van der Vaart A. Supplement to “higher order estimating equations for high-dimensional models”. doi: 10.1214/16-AOS1515. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Robins J, Li L, Tchetgen E, van der Vaart A. Probability and statistics: essays in honor of David A Freedman Inst Math Stat Collect. Vol. 2. Inst Math Statist; Beachwood, OH: 2008. Higher order influence functions and minimax estimation of nonlinear functionals; pp. 335–421. . MR2459958 (2010b:62115) [DOI] [Google Scholar]
- 28.Robins J, Li L, Tchetgen E, van der Vaart A. Quadratic semiparametric von mises calculus. Metrika. 2009a;69:227–247. doi: 10.1007/s00184-008-0214-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Robins J, Li L, Tchetgen E, van der Vaart A. Semiparametric minimax rates. Electron J Stat. 2009b;3:1305–1321. doi: 10.1214/09-EJS479. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Robins JM, Rotnitzky A. Semiparametric efficiency in multivariate regression models with missing data. J Amer Statist Assoc. 1995;90(429):122–129. MR1325119 (96d:62084) [Google Scholar]
- 31.Rotnitzky A, Robins JM. Semi-parametric estimation of models for means and covariances in the presence of missing data. Scand J Statist. 1995;22(3):323–333. MR1363216 (96j:62090) [Google Scholar]
- 32.Tsybakov AB. Mathématiques & Applications (Berlin) [Mathematics & Applications] Vol. 41. Springer-Verlag; Berlin: 2004. Introduction à l’estimation non-paramétrique. MR2013911 (2005a:62007) [Google Scholar]
- 33.v Mises R. On the asymptotic distribution of differentiable statistical functions. Ann Math Statistics. 1947;18:309–348. MR0022330 (9,194h) [Google Scholar]
- 34.van der Laan MJ, Robins JM. Unified methods for censored longitudinal data and causality. Springer-Verlag; New York: 2003. (Springer Series in Statistics). MR1958123 (2003m:62003) [Google Scholar]
- 35.van der Vaart A. On differentiable functionals. Ann Statist. 1991;19(1):178–204. MR1091845 (92i:62100) [Google Scholar]
- 36.Van der Vaart A. Efficient maximum likelihood estimation in semiparametric mixture models. Ann Statist. 1996;24(2):862–878. doi: 10.1214/aos/1032894470. . MR1394993 (97d:62096) [DOI] [Google Scholar]
- 37.van der Vaart A. Higher Order Tangent Spaces and Influence Functions. Statist Sci. 2014;29(4):679–686. doi: 10.1214/14-STS478. . MR3300365. [DOI] [Google Scholar]
- 38.van der Vaart AW. Estimating a real parameter in a class of semiparametric models. Ann Statist. 1988a;16(4):1450–1474. doi: 10.1214/aos/1176351048. . MR964933 (89m:62032) [DOI] [Google Scholar]
- 39.van der Vaart AW. CWI Tract. Vol. 44. Stichting Mathematisch Centrum Centrum voor Wiskunde en Informatica; Amsterdam: 1988b. Statistical estimation in large parameter spaces. MR927725 (89e:62049) [Google Scholar]
- 40.van der Vaart AW. Cambridge Series in Statistical and Probabilistic Mathematics. Vol. 3. Cambridge University Press; Cambridge: 1998. Asymptotic statistics. MR1652247 (2000c:62003) [Google Scholar]
- 41.van der Vaart AW, Wellner JA. Weak convergence and empirical processes. Springer-Verlag; New York: 1996. (Springer Series in Statistics). With applications to statistics. MR1385671 (97g:60035) [Google Scholar]
- 42.van Zwet WR. A Berry-Esseen bound for symmetric statistics. Z Wahrsch Verw Gebiete. 1984;66(3):425–440. doi: 10.1007/BF00533707. . MR751580 (86h:60063) [DOI] [Google Scholar]
- 43.Waterman RP, Lindsay BG. Projected score methods for approximating conditional scores. Biometrika. 1996;83(1):1–13. doi: 10.1093/biomet/83.1.1. . MR1399151 (98g:62044) [DOI] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
