Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2012 Oct 18.
Published in final edited form as: Metrika. 2009 Mar;69(2-3):227–247. doi: 10.1007/s00184-008-0214-3

Quadratic semiparametric Von Mises calculus

James Robins 1, Lingling Li 1, Eric Tchetgen 1, Aad W van der Vaart 2,
PMCID: PMC3475538  NIHMSID: NIHMS120870  PMID: 23087487

Abstract

We discuss a new method of estimation of parameters in semiparametric and nonparametric models. The method is based on U-statistics constructed from quadratic influence functions. The latter extend ordinary linear influence functions of the parameter of interest as defined in semiparametric theory, and represent second order derivatives of this parameter. For parameters for which the matching cannot be perfect the method leads to a bias-variance trade-off, and results in estimators that converge at a slower than n–1/2-rate. In a number of examples the resulting rate can be shown to be optimal. We are particularly interested in estimating parameters in models with a nuisance parameter of high dimension or low regularity, where the parameter of interest cannot be estimated at n–1/2-rate.

Keywords: Von Mises calculus, Semiparametric models, Missing data, Tangent space, Influence function, Rate of convergence

1 Introduction

Let X1, X2, . . . , Xn be a random sample from a distribution Pη with density pη relative to a measure μ on a sample space (X,A), where the parameter η is known to belong to a subset H of a normed space. We wish to estimate the value χ(η) of a functional χ:HR with the help of the observations X1, . . . , Xn. Our main interest is in the situation of a semiparametric or nonparametric model, where H is an infinite-dimensional set, and the dependence ηpη is smooth.

This problem has been studied under the heading “semiparametric statistics” in the 1980s and 1990s. A theory of asymptotic lower bounds for “regular parameters” χ(η) based on Le Cam's concept of local asymptotic normality (Le Cam 1960) was developed starting with Koševnik and Levit (1976) and Pfanzagl (1982), and worked out for many examples in, among others, Begun et al. (1983), van der Vaart (1988) and Bickel et al. (1993). There are many examples of ad-hoc estimators that attain these bounds, and the behaviour of principled methods such as maximum likelihood (including its sieved and penalized variants) or estimating equations is understood to a certain extent (e.g., van der Vaart 1994; Murphy and van der Vaart 2000; Bolthausen et al. 2002; Wellner et al. 1993; van der Laan and Robins 2003).

Certain combinations of models (Pη: ηH) and parameter χ(η) possess structural properties that allow to estimate the parameter at n–1/2-rate, no matter the size of the parameter set H. In this paper we are interested in the other situations, where the rate of estimation drops when the complexity of the model exceeds a certain limit. Such examples arise for instance when many covariates must be included in a model to correct for possible confounding in a causal study, or for modelling the probability that an individual is included in a sample in a study with missing observations. If simple (e.g., linear) models for the dependence on these covariates are not plausible, which is typical in epidemiological studies, then the resulting model must be taken so large that the usual methods fail. These methods typically focus on variance only, because the bias is negligible due to the structure of the model, or by explicitly assuming a “no-bias condition” (see Klaassen 1987; Murphy and van der Vaart 2000). In this paper we develop new methods that make a bias-variance trade-off when necessary.

These methods are based on quadratic estimating equations rather than the usual linear estimating equations.

Quadratic expansions for semiparametric models were previously investigated by Pfanzagl and Wefelmeyer (Pfanzagl 1985), but from the very different perspective of second order efficiency, i.e., the refinement of first order bounds by adding a lower order term. Our aim is to show that second order influence functions can be used for first order inference, because they permit balancing of bias and variance.

Following linear and quadratic is cubic, and so on. Extension of our approach to still higher orders is possible, but comes with many new complications. We shall pursue this elsewhere.

The paper is organized as follows. In Sect. 2 we review linear estimators from our current perspective. Next in Sect. 3 we introduce our new method of constructing quadratic estimators. This section has mostly a heuristic nature. In Sects. 4 and 5 we give rigorous constructions and results for two examples. The first is a classical theoretical example. The second is more extensive and concerns estimating a mean response when the response is not always observed.

Notation Let Pn and Un denote the empirical measure and empirical U-statistic measure, viewed as an operators on functions: for given functions f:XR and g:X2R these are given by

Pnf=1ni=1nf(Xi),Ung=1n(n1)∑∑1ijng(Xi,Xj).

We use the notation Unf also for f:XR a function of one argument, with the interpretation Unf=Pnf. This is consistent with the given formulas if a function of one argument is considered as a function of 2 arguments that is constant in its second argument.

We write PnUng=P2g for the expectation of Ung if X1, . . . , Xn are distributed according to the probability measure P. We also use the operator notation for the expectations of statistics in general.

We call a measurable function g:X2R degenerate relative to P if g(x1,x2)dP(xi)=0 for i = 1, 2, and we call it symmetric if g(x1, x2) = g(x2, x1) for every x1,x2X.

Given two functions g,h:XR we write g × h the function (x,y)g(x)h(y). Such tensor products functions are degenerate if both functions f and g have mean zero. The corresponding notation P × Q of two measures P and Q gives the product measure.

2 Linear estimator

Given an initial estimator η^ of η, the plug-in estimator χ(η^) is typically a consistent estimator of the parameter of interest χ(η), but it may not be a good estimator. In particular, if η^ is a general purpose estimator, not specially constructed to yield a good plug-in, then χ(η^) will often have a suboptimal precision. To gain insight in this situation we assume that the parameter permits a Taylor expansion of the form

χ(η)=χ(η^)+χη^(ηη^)+O(ηη^2). (1)

Such an expansion suggests that the plug-in estimator will have an error of the order OP(ηη^), unless the linear term in the expansion vanishes.

The expansion (1) also suggests that better estimators can be obtained by “estimating” the linear term in the expansion. To achieve this we assume a “generalized von-Mises representation” of the derivative of the form

χη^(ηη^)=χ.η^d(PηPη^)=χ.η^dPη+O(ηη^2), (2)

for some measurable function χ.η^:XR, referred to as an influence function. The second equality is valid if χ.η is degenerate relative to Pη(i.e.,Pηχ.η=0) for every η, which can always be arranged by a recentering, as 1d(PηPη^)=0. The von-Mises representation and Eq. (1) suggest the “corrected plug-in estimator”

χ(η^)+Pnχ.η^. (3)

This estimator should have an error of the order OP(n12)+OP(ηη^2), as the difference (PnPη)χ.η^ is “centered” and ought to have “variance” of the order O(1/n).

We put “centered” and “variance” in quotes, because the randomness in the initial estimator η^ prevents a simple calculation of mean and variance. Empirical process theory can be used to show that the effect of replacing χ.η by χ.η^ is negligible, if the class of functions χ.η is not too rich. In the present paper we are interested in orders of magnitude only, and then a simpler approach is to split the sample and use separate observations to construct η^ and to construct Pn. Then the orders can be justified by reasoning conditionally on the first sample, and it suffices that χ.η^2dPη remains bounded in probability.

Von Mises (1947) originally introduced the expansions that are named after him in order to investigate functionals of empirical distributions. The idea to use expansions (1) for estimation in nonparametric models occurs in Emery et al. (2000). Our situation is more involved, because we are interested in models (Pη: ηH) that are structured through a map ηpη, and we are interested in a functional χ(η) of the parameter. In this situation a von-Mises type expansion can fail for two reasons. First a derivative χη is by definition a continuous, linear map on the underlying normed space, and such maps may or may not be representable as an integral, depending on the normed space. Second, our von Mises expansion (2) represents this derivative as an integral relative to the distribution Pη and hence also involves the inverse map Pηη from the distribution of the data to the parameter. We require representation through Pη, because this allows us to construct the estimator (3) by replacing Pη by the empirical distribution.

These issues are related to investigations in the theory of semiparametric models (see Koševnik and Levit 1976; Pfanzagl 1982; van der Vaart 1988; Bickel et al. 1993). These papers define a tangent set of a semiparametric model (Pη: ηH) as the set of functions g.η:XR obtainable as

12g.ηpη=limt0pηtpηt,

where the limit is taken in the L2-sense, and tηt ranges over a collection of maps from [0,1]R to H for which the limit exists. Informally, a “tangent vector” g.η is just a score function

g.η=tt=0logpηt=tt=0pηtpη, (4)

of a one-dimensional submodel (Pηt : t ≥ 0) at t = 0, where η0 = η. (Taking the derivative in the L2-sense is appropriate for asymptotic information theory, but not necessarily so for the present heuristic discussion.) An influence function is defined as a measurable map χ.η:XR such that, for all paths tηt considered,

ddtt=0χ(ηt)=Pηχ.ηg.η. (5)

It is not difficult to see that the latter influence function is the same as the influence function needed in the von-Mises expansion (2), if the various types of derivatives match up. (Note that the middle expression in (2) with η replaced by ηt and η^ by η expands to Pηχ.ηg.η+o(t), as pηtpη=tg.ηpη+o(t).) Necessary and sufficient conditions for existence of an influence function in terms of the derivatives of the maps ηχ(η) and ηpη were investigated in van der Vaart (1991).

An influence function is not necessarily unique, as only its inner products with elements g.η of the tangent set matter. The projection of any influence function that is contained in the closed linear span of the tangent set is called the efficient influence function or canonical gradient, as it is the influence function of asymptotically efficient estimators. It minimizes the variance varηPnχ.η over all influence functions.

3 Quadratic estimator

If the preliminary estimator η^ attains a rate of convergence η^η=oP(n14), then the plug-in estimator (3) attains a n–1/2-rate of convergence. Typically this will require that the parameter set H is not too big. If the preliminary estimator is less precise, then the remainder term of the expansion (1) will dominate. This suggests to take the expansion further to

χ(η)=χ(η^)+χη^(ηη^)+12χη^(ηη^,ηη^)+O(ηη^3). (6)

The generalization of the first order construction now requires a von Mises type representation of the form, for measurable functions χ.η:XR and χ¨η:X2R,

χη^(ηη^)+12χη^(ηη^,ηη^)=χ.η^d(PηPη^)+12χ¨η^d(PηPη^)×(PηPη^)+O(ηη^3). (7)

We assume without loss of generality that the functions χ.η and χ¨η are degenerate relative to Pη. The von-Mises representation then suggests the “corrected plug-in estimator”

χ(η^)+Pnχ.η^+12Unχ¨η^. (8)

The empirical measure and two-sample U-statistic serve as unbiased estimators of the expectations of their kernels. For simplicity we may again base the initial estimator η^ and these two U-statistics on independent samples of observations. Because the variance of a U-statistic is of order O(1/n), this estimator ought to have an error of the order OP(n12)+OP(ηη^3). We shall discuss the validity of this later.

To characterize the first and second order influence functions we can again employ smooth one-dimensional submodels (Pηt : t ≥ 0). With the first and second order derivatives of these models denoted by

g.η=tt=0pηtpη,g¨η=2t2t=0pηtpη, (9)

the von Mises expansion (7) can informally be seen to imply

ddtt=0χ(ηt)=Pηχ.ηg.η, (10)
d2dt2t=0χ(ηt)=Pηχ.ηg¨η+Pη2χ¨η(g.η×g.η). (11)

The Eq. (10) is identical to Eq. (5), and hence a first order influence function χ.η can be taken as before. Following Pfanzagl (1985) we define a second order influence function χ¨η as a measurable function χ¨η:X2R that satisfies (11) for every path tηt under consideration. From Eq. (11) we see that χ¨η is unique only up to functions that are orthogonal to functions of the form g.η×g.η, for g.η belonging to the tangent set. In particular, a second order influence function χ¨η can always be taken to be symmetric and degenerate relative to Pη. It must be taken so in the construction of the estimator (8).

The two influence functions occur together in Eq. (11), and hence should be considered a pair (χ.η,χ¨η) of functions rather than as two separate functions. This is particularly important if the tangent set is not “full”, i.e., smaller than the set of all mean-zero functions in L2(Pη), the tangent set of a nonparametric model. Both first and second order influence functions are then non-unique, but their different versions cannot be freely combined into valid pairs (χ.η,χ¨η). This is connected to the fact that first and second order derivatives g.η and g¨η are also not clearly separated. A simple change of speed tϕ(t) of a path through a second order diffeomorphism ϕ: [0, 1] → [0, 1] leads to the submodel (Pηϕ(t): t ≥ 0) with first and second order derivatives, by the chain rule,

ϕ(0)g.η,ϕ(0)2g¨η+ϕ(0)g.η.

Thus the first order derivative becomes part of the second order derivative after reparameterization. Pfanzagl (Pfanzagl 1985, 2.4.4) has shown, under assumptions of smoothness of the tangent set as a function of the parameter, that the sum of every first order derivative and every second order derivative occurs as the second order derivative of some path. Thus the set of second order derivatives g¨η is only defined up to equivalence modulo the tangent set.

From Eq. (11) it is also clear that second order influence functions involve the joint distribution of two observations. Correspondingly, we prefer to define a second order tangent space of the model not through the second order derivatives g¨η along paths tpηt, but through the functions of two arguments

s¨η:=2t2t=0(pηt×pηt)pη×pη=g¨η×1+2g.η×g.η+1×g¨η. (12)

The function s¨η is a second order score for the model (Pη × Pη: ηH) for two observations. The corresponding first order scores are

s.η:=tt=0(pηt×pηt)pη×pη=g.η×1+1×g.η. (13)

With these notations the Eqs. (10), (11) defining the influence functions can also be written as, if χ¨η is chosen degenerate,

ddtt=0χ(ηt)=Pη2(χ.η+12χ¨η)s.η=ddtt=0Pηt2(χ.η+12χ¨η), (14)
d2dt2t=0χ(ηt)=Pη2(χ.η+12χ¨η)s¨η=d2dt2t=0Pηt2(χ.η+12χ¨η). (15)

Here we interprete the function χ.η:XR as a function χ.η:X2R that depends on the first argument only (and is constant in the second), or (better) replace it by its symmetrization (x1,x2)12(χ.η(x1)+χ.η(x2)). The equations show that the overall influence function χ.η+12χ¨η is characterized by having “correct” inner products with the overall scores s.η and s¨η. This overall influence function uniquely defines its constituents χ.η and 12χ¨η provided χ¨η is restricted to be degenerate. The overall influence function is itself unique only up to projection onto the closed linear span in L2(Pη × Pη) of all functions s.η and s¨η.

The equality of the far left and right sides of Eqs. (14), (15) gives an alternative characterization of the overall influence function (at η0) as a function such that the maps ηχ(η) and ηPη2(χ.η0+12χ¨η0) possess the same first and second order derivatives at η0. Because the derivatives of a map ϕ on an open subset H of a normed space are completely characterized by the derivatives of the maps tϕ(η0+th), for h ranging over the space (“Gateaux derivatives”), we conclude that in the case of such parameters sets H it suffices to consider linear paths tηt=η0+th. (The mixed second derivative ϕη0(g,h) can be recovered from ϕη0(g+h,g+h) and ϕη0(gh,gh) by “polarization”.) This is true more generally for parameter spaces H defined by a linear constraint, but in the case of nonlinear constraints the use of curved paths is necessary.

The plug-in estimator (8) can be written χ(η^)+Un(χ.η^+12χ¨η^). A definition of an efficient or canonical second order influence function, should therefore refer to the variance of the U-statistic Un(χ.η+12χ¨η). Unlike in the linear case this does not translate in the variance of the influence function χ.η+12χ¨η itself (except for n = 2 if χ.η is interpreted as the symmetric function (x1,x2)12(χ.η(x1)+χ.η(x2))). By Eq. (5), if χ¨η is chosen degenerate and symmetric,

nvarηUn(χ.η+12χ¨η)=Pηχ.η2+12(n1)Pη2χ¨η2.

Thus the second order part adds a term of order O(1/n) relative to the first order contribution. The norm of the function χ.η+12χ¨η in L2(Pη × Pη) is irrelevant, even though the inner product of this space determines the influence functions.

It is possible to resolve this discrepancy by working in the model with n observations. From the expansion

i=1npηtpη(xi)=i=1n(1+tg.η(xi)+12t2g¨η(xi)+)=1+ti=1ng.η(xi)+t2(12i=1ng¨η(xi)+∑∑1i<jng.η(xi)g.η(xj))+,

we see that first and second order scores for the model (Pηn:ηH) take the forms

s.η(n)=tt=0(pηt××pηt)pη××pη=nPng.η, (16)
s¨η(n)=2t2t=0(pηt××pηt)pη××pη=nPng¨η+n(n1)Un(g.η×g.η). (17)

Rather than in the form Eqs. (14), (15), the Eqs. (10), (11) that define the influence functions can then be written in the form

ddtt=0χ(ηt)=Pηn(Un(χ.η+12χ¨η))s.η(n), (18)
d2dt2t=0χ(ηt)=Pηn(Un(χ.η+12χ¨η))s¨η(n). (19)

We conclude that the influence functions are determined by the inner products of the U-statistic Un(χ.η+12χ¨η) in L2(Pηn) with the score functions s.η(n) and s¨η(n). The influence functions that yield a minimal variance are found by projecting this U-statistic onto the closed linear span of these score functions. Thus it is natural to define the latter span as the second order tangent space of the model.

For computation in examples the defining Eq. (11) or (15) of a second order influence function can be tedious. It is usually easier to apply the rule that a second derivative is the derivative of the first derivative. In the present situation this takes the following form (Pfanzagl 1985, 4.3.11): if χ¨η:X2R is a function such that x2χ¨η(x1,x2) is a first order influence function of the parameter ηχ.η(x1), for every fixed x1 and a first order influence function χ.η (not necessarily degenerate), then χ¨η is a second order influence function.

Lemma 1 Suppose that (Pηt : t ≥ 0) is a sufficiently smooth submodel and χ.ηt:XR and χ¨ηt:X2R are measurable functions that satisfy

ddtχ(ηt)=χ.ηtddtpηtdμ,(t0),ddtt=0χ.ηt(x1)=χ¨η(x1,x2)g.η(x2)dPη(x2),(x1χ).

Then the function χ¨η is a second order influence function, and so is the symmetrization of its orthogonal projection onto the degenerate functions in L2(Pη × Pη).

Proof By differentiation of the first identity (under the integral) we see that

d2dt2χ(ηt)=ddtχ.ηtddtpηtdμ+χ.ηtd2dt2pηtdμ.

We evaluate this at t = 0 and substitute the second identity in the first term on the right to arrive at Eq. (11). The equation remains valid if χ¨η is replaced by its projection and symmetrization.

Just as for first order influence functions there is no guarantee that a second order influence function exists. The difference is that, for the examples we are interested in, nonexistence of a second order influence function is typical. A first indication that this might happen is that the informal conclusion reached in the preceding that the quadratic estimator (8) will have an error of the order OP(n12)+OP(ηη^3) is overly optimistic. In comparison to the linear estimator (3), this estimator would have reduced the dependence on the preliminary estimator η^ from OP(ηη^2) to OP(ηη^3), apparently without a serious penalty on the variance of the estimator. In our examples this does not occur, simply because a second order influence function does not exist.

As for the first order influence function, the nonexistence of the second order influence function may be caused by a lack of invertibility of the map ηpη or by failure of a von Mises type representation. The invertibility is again necessary, because we need representation of the derivatives of ηχ(η) in terms of the distribution Pη of the observation. This is similar as in the linear situation. The second cause for failure of representation also arose in the linear situation, but appears to arise in a much more serious way at the second order. Whereas a continuous, linear map B:L2(Pη)R is always representable as an inner product B(g)=Phgχ.η for some function χ.η, a continuous, bilinear map B:L2(Pη)×L2(Pη)R is not necessarily representable through a measurable function χ¨η:X2R, in the form

B(g,h)=g(x1)χ¨η(x1,x2)h(x2)dPη(x1)dPη(x2). (20)

It can be shown that a continuous, bilinear map can always be written in the form B(g, h) = ∫ g(Ah) dPη for a continuous, linear operator A: L2(Pη) → L2(Pη), but the latter operator is not necessarily a kernel operator in that Ah(x1)=χ¨η(x1,x2)h(x2)dPη(x2) for some kernel χ¨η. The latter representation is necessary for the von-Mises representation (7) of the second derivative.

Failure of existence of χ¨η does not mean that the idea to use a quadratic expansion for improved estimation is not fruitful. Failure does mean that we cannot construct the estimator (8) and the estimation rate OP(n12)+OP(η^η3) may not be attainable. However, we may return to Eq. (6) and try and estimate the quadratic term as well as possible, and still improve on the linear estimator. A key observation is that a bilinear map on a finite-dimensional subspace L × LL2(Pη) × L2(Pη) is always representable by a kernel.

Lemma 2 If L ⊂ L2(Pη) is a finite-dimensional subspace and B:L×LR is continuous and bilinear, then there exists a function χ¨ηL2(Pη×Pη) such that (20) holds for every g, h ∈ L.

Proof For an arbitrary orthonormal basis e1, . . . , ek of L we can express an element gL as g=i=1kg,eiηei, for 〈·, ·〉η the inner product of L2(Pη). By bilinearity

B(g,h)=i=1kj=1kg,eiηh,ejηB(ei,ej)=g(x1)h(x2)i=1kj=1kB(ei,ej)ei(x1)ej(x2)dPη(x1)dPη(x2).

Thus the function (x1,x2)i=1kj=1kB(ei,ej)ei(x1)ej(x2) is a kernel for the map B.

If the invertibility ηpη can be resolved, we can therefore always represent the second derivative in Eq. (6) at differences ηη^ within a given finite-dimensional linear space. The estimator (8) based on the resulting “partial second order influence function” then will add a representation error to the remainder OP(η^η3). This representation error can be made arbitrarily small by choosing the finite-dimensional linear space sufficiently large. However, the corresponding partial influence functions depend on the approximating linear spaces, the estimator now having the form

χ(η^)+Pnχ.η^+12Unχ¨L,η^, (21)

where χ¨L,η is a partial second order influence function based on an approximating space L. To obtain a good estimator we must balance the representation error, remainder O(η^η3), and the variance of the estimator. In an asymptotic framework we let the approximating space L increase to the full space when n → ∞. We shall see that this may cause the variance of Unχ¨L,η^ to dominate the variance of the linear term Pnχ.η^ and the overall variance may be bigger than O(1/n). However, by proper balancing of the three terms we do never worse than the linear estimator (3), and we gain over it if the parameter set H is large.

4 Estimating the square of a density

Consider the problem of estimating the functional χ(p) = ∫ p2 based on a random sample of size n from the density p. This problem was discussed among others in Bickel and Ritov (1988) and Laurent (1996), Laurent (1997). We shall rederive the estimator by Laurent (1996) through our general approach.

As the underlying model P we use a set of densities that is restricted only qualitatively, for instance a Hölder space of functions on the unit square in Rd. We parameterize this model by the density itself, which we denote by p (hence pη = η = p). The tangent space of the model can then be taken equal to the set of all mean zero functions g.p:XR in L2(P), and the first order influence function takes the form

χ.p(x)=2(p(x)χ(p)). (22)

To see this, it suffices to note that this function is mean-zero (i.e., degenerate) and satisfies

ddtt=0χ(pt)=2ptp.tdμt=0=P2pg.p=Pχ.pg.p,

for any sufficiently regular path tpt with p0 = p and score function g.p=p.0p0 at t = 0. This first order influence function exists without making assumptions on p or P.

We compute a second order influence function as the influence function of the functional pχp(x1)=2p(x1), which is the first order influence function up to centering. This entails point evaluation at a fixed point x1, which, unfortunately, is not a differentiable functional in the sense of possessing an influence function. For any sufficiently regular path tpt with score function g.p,

ddtt=0pt(x1)=g.p(x1)p(x1).

Existence of an influence function of the functional pp(x1) would require the map gg(x1)p(x1) to be representable as an inner product in L2(P) on the tangent space. Such a representation is not possible (unless p has finite support), because the map is not continuous relative to the L2(P)-norm.

Thus we content ourselves with partial representation of the second derivative. To this aim it is useful to think of the point evaluation map as integrating versus the Dirac measure (at x1). Full representation of the functional gg(x1)p(x1) would be possible if there existed a function Π:X×XR such that,

g(x1)p(x1)=Π(x1,x2)g(x2)dμ(x2). (23)

If this were true for every function g, then the measure BBΠ(x1,x2)dμ(x2) would, for each fixed x1, act as a Dirac measure at x1. In other words, the desired, but not existing, function Π would be a “Dirac measure” on the diagonal of X×X. Our second best is a function for which Eq. (23) is true, if not for all, then for a large collection of g. The kernel Π of a projection operator Π: L2(μ) → L2(μ) onto a (large) subspace is a candidate, because it satisfies the display whenever gp is in the subspace: if Πf (x1) = ∫ Π(x1, x2) f (x2) (x2), then the equation gp = Π(gp), which is valid for every gp in the range of the projection, gives the preceding display.

Lemma 3 An orthogonal projection Π: L2(μ) → LL2(μ) onto a finite-dimensional subspace L can be represented as Πf (x1) = ∫ Π(x1, x2) f (x2) (x2) for the kernel function Π(x1,x2)=i=1kei(x1)ei(x2) and e1, . . . , ek an orthonormal basis of L. This kernel satisfies ∫ Π2 × μ = k.

Proof We have Πf(x1)=i=1kf,eiμei(x1) for f,eiμ=feidμ. The representation follows by exchanging the order of summation and integration.

The square kernel is i=1kj=1kei(x1)ej(x1)ei(x2)ej(x2). By the orthonormality of the basis (ei) the (double) integral of the off-diagonal terms (i ≠ j) vanishes and the double integral of the diagonal terms is equal to 1. Thus the double integral is k.

We also arrive at a projection operator from the formula χp(g,h)=2ghp2dμ for the second derivative of χ. We can write this in the form χp(g,h)=2g(Aph)dP for the operator Ap: L2(P) → L2(P) given by Aph = hp. The operator Ap is not of kernel form, but we can approximate it by ΠAp, leading to the approximation 2 ∫ g(ΠAph) dP = 2 ∫ gp (Π(hp)) for χp(g,h).

For a given orthonormal basis e1, e2, . . . of L2(μ) we take the kernel Π(x1, x2) of the projection onto the span of the first k elements, given by Lemma 3, as a “partial” influence function of the functional pp(x1), and x22Π(x1,x2) as a “partial” influence function of the functional pχp(x1). The projection of this function onto the degenerate functions is

χ¨p(x1,x2)=2Π(x1,x2)2Πp(x1)2Πp(x2)+2(Πp)2dμ. (24)

The quadratic estimator (8), given the initial estimator p^, takes the form

χ(p^)+Pnχ.p^+Unχ¨p^=UnΠ+Un((IΠ)p^)=Un(i=1kei×ei)+Un(i=k+1θ^iei),

for θ^i=p^eidμ the Fourier coefficients of p^. If we choose the initial estimator to take values in the range of Π, then θ^i=0 for i > k and the second term vanishes. The resulting estimator reduces to the estimator considered by Laurent (1996, 1997), who showed that the estimator is minimax if p is a-priori known to belong to a multiple of the unit ball in the Hölder space Cβ[0, 1] of regularity β and (ei) is a basis suited to this a-priori model. In fact, mean and variance of the estimator satisfy, with θi the Fourier coefficients of p,

EpUn(i=1kei×ei)=Epi=1kei×ei=i=1kθi2,varpUn(i=1kei×ei)4nP(Πp)2+2kn(n1).

The bound on the variance follows from Lemma 6. If it is a-priori known that i=1θi2i2β<, then the bias is bounded by i>kθi2k2β. The square bias is balanced against the variance if k is chosen of the order kn = n2/(4β+1) if β ≤ 1/4 and kn = n if β ≥ 1/4. The resulting rate of convergence is n–2β/(4β+1 if β ≤ 1/4 and n–1/2 if β ≥ 1/4. In Robins et al. (2007) it is shown that it is also asymptotically normal.

5 Estimating the mean response in missing data models

Suppose that a typical observation is distributed as X = (Y A, A, Z) for Y and A taking values in the two-point set {0, 1} and conditionally independent given Z. We think of Y as a response variable, which is observed only if the indicator A takes the value 1. The covariate Z is chosen such that it contains all information on the dependence between response and missingness indicator (missing at random). Alternatively, we think of Y as a counterfactual outcome if a treatment were given (A = 1) and estimate (half) the treatment effect under the assumption of “no unmeasured confounders”. Both applications may require that Z is high-dimensional (e.g., of dimension 10), and there is typically insufficient a-priori information to model the dependence of A and Y on Z.

The model can be parameterized by the marginal density f of Z (relative to some dominating measure ν) and the probabilities b(z) = P(Y = 1|Z = z) and a(z)–1 = P A = 1|Z = z). (We use a for the inverse probability, because this simplifies later formulas.) Thus the density pη of an observation X is described by the triple η = (a, b, f).

We wish to estimate the mean response EY, i.e., the parameter

χ(η)=bfdν.

Estimators that are n–1/2-consistent and asymptotically efficient in the semiparametric sense have been constructed using a variety of methods (e.g., Robins and Rotnitzky 1992; van der Laan and Robins 2003; van der Vaart 1998), but only if a or b, or both, parameters are restricted to sufficiently small regularity classes. For instance, if the covariate ranges over a compact, convex subset Z of Rd, then the mentioned papers provide n–1/2-consistent estimators under the assumption that a and b belong to Hölder classes Cα(Z) and Cβ(Z) with α and β large enough that

α2α+d+β2β+d12. (25)

For moderate to large dimensions d this is a restrictive requirement. We shall show that a quadratic estimator of the type (8) can attain a n–1/2-rate in a bigger model and obtains a strictly better rate than the usual estimators if the n–1/2-rate is not obtainable.

Prelimary estimators

The parameter 1/a(z) = E(A|Z = z) is the regression of A on Z and hence can be estimated by any nonparametric regression estimator, such as a kernel or a truncated series estimator. Similarly, the function b(z) = P(Y = 1|Z = z, A = 1) is the regression of the observed Y on Z and can be estimated by nonparametric regression based on the subsample (Yi : Ai = 1) on the corresponding Zi. We shall see below that the parameter f/a is more fundamental than the parameter f. By Bayes’ rule (f/a)(z) = P(A = 1|Z = z) f (z) is P(A = 1) times the conditional density of Z given A = 1. Therefore, we may estimate f/a by a nonparametric density estimator based on the subsample (Zi : Ai = 1) times n1i=1nAi.

Tangent space and first order influence function

The one-dimensional submodels tpηt induced by paths of the form at = a + , bt = b + , and ft = f (1 + ) for given directions α, β and ϕ yield scores Bη(α,β,ϕ)=Bηaα+Bηbβ+Bηfϕ, for Bηa, Bηb, Bηf the score operators for the three parameters, given by

Bηaα(X)=Aa(Z)1a(Z)(a1)(Z)α(Z),ascore,Bηbβ(X)=A(Yb(Z))b(Z)(1b)(Z)β(Z),bscore,Bηfϕ(X)=ϕ(Z),fscore.

The first-order influence function is well known to take the form

χ.η(X)=Aa(Z)(Yb(Z))+b(Z)χ(η). (26)

To see this it must be verified that this function satisfies, for every path tpηt as described previously,

ddtt=0χ(ηt)=Eηχ.η(X)Bη(α,β,ϕ)(X).

For the paths at = a + tα, bt = b + and ft = f(1 + ) the left side of this equation is ∫ (β + bϕ) f dν. The right side can easily be evaluated to be the same, where it may be noted that conditional expectations of functions of Y and A given Z factorize, with E(Yb(Z)|Z) = E(Aa(Z) – 1|Z) = 0 and E (Yb(Z)2|Z = b(1 – b)(Z).

The advantage of choosing a an inverse probability is clear from the form of the (random part of the) influence function, which is a bilinear function in (a, b). The error of the corresponding von-Mises representation can be computed to be, for a given initial estimator η^=(a^,b^,f^),

χ(η^)χ(η)+Pηχ.η^=(a^a)(b^b)fadν. (27)

This is quadratic in the errors of the initial estimators. Actually, the form of the bias term is special in that square estimation errors of the two initial estimators a^ and b^ do no arise, but only the product of their errors. This property, termed “double robustness” in Rotnitzky and Robins (1995), Robins and Rotnitzky (2001), van der Laan and Robins (2003), makes that it suffices that one of the two parameters is estimated well. A prior assumption that the parameters a and b are α and β regular, respectively, would allow estimation errors with rates nα/(2α+d) and nβ/(2β+d). If the product of these rates is o(n–1/2), then the bias term is negligible, and the linear estimator (3) attains a rate n–1/2. This leads to the condition (25). If this condition fails, then the “bias” (27) is greater than O(n–1/2). The linear estimator then does not balance bias and variance and is suboptimal.

It may be noted that the marginal density f does not enter the first order influence function. Even though the functional depends on f, a rate on the initial estimator of this function is not needed for the construction of the first order estimator. This will be different at second order.

Quadratic estimator

We proceed to the computation of a second order influence function using Lemma 1, by searching a function χ¨η:X2R such that, for every x1 = (y1a1, a1, z1), and all directions α, β, ϕ,

a1(y1b(z1))α(z1)(a1a(z1)1)β(z1)=ddtt=0[χ.ηt(x1)+χ(ηt)]=Eηχ¨η(x1,X2)Bη(α,β,ϕ)(X2). (28)

Here the expectation is relative to the variable X2 only. Let Kη:Z2R be the kernel of an operator Kη: L2(f) → L2(f) (i.e., Kηg(x1) = ∫ K(x1, x2)g(x2) f (x2) dμ(x2)), and define

χ¨η(X1,X2)=A1(Y1b(Z1))a(Z2)(A2a(Z2)1)Kη(Z1,Z2)(A1a(Z1)1)a(Z2)A2(Y2b(Z2))Kη(Z1,Z2). (29)

For this choice the right side of Eq. (28) can be seen to reduce to

a1(y1b(z1))Kηα(z1)(a1a(z1)1)Kηβ(z1).

(Note that var (Aa(Z)|Z) = a(Z) – 1.) Thus the choice Eq. (29) of χ¨η satisfies Eq. (28) for every (α,β,ϕ) such that Kηα = α and Kηβ = β. Were Kη equal to the identity operator, then Eq. (28) would be satisfied for every (α, β, ϕ), and an exact second order influence function would exist. Unfortunately, the identity operator is not given by a kernel. As in Sect. 4 we have to be satisfied with an influence function that gives partial representation.

To ensure that χ¨η is symmetric we choose Kη(z1, z2) = Πη(z1, z2)/a(z2) for Πη a symmetric function. Specifically, we choose Πη the kernel of an orthogonal projection Πη: L2(f/a) → L2(f/a) onto a space L. The corresponding operators then (trivially) satisfy Kηg = Πηg for every gL2(f/a), and hence Kη will approximate the identity if L is large. The function (29) that results from this choice can be seen to be both symmetric and degenerate, and hence is a candidate “approximate” influence function. If S2 symmetrizes a function of two variables (i.e., 2 S2 g(X1, X2) = g(X1, X2) + g(X2, X1)), then this influence function can be written as

χ¨η(X1,X2)=S2[A1(Y1b(Z1))Πη(Z1,Z2)(A2a(Z2)1)]. (30)

For an initial estimator η^ based on independent observations we now construct the estimator (8).

Let E^ and va^r denote conditional expectations given the observations used to construct η^, and let ∥ · ∥2 be the norm of L2(f/a). Assume that the true functions a, f and the estimators a^, f^ are bounded away from 0 and ∞.

Theorem 1 The estimator χ^n=χ(η^)+Pnχ.η^+12Unχ¨η^ with (approximate) influence functions χ.η and χ¨η defined by (26) and (30) with Πη the kernel of an orthogonal projection in L2(f/a) onto a k-dimensional linear subspace satisfies

E^ηχ^nχ(η)=OP(a^a2b^b2f^a^fa2)+OP(aΠηa2bΠηb2),va^rηχ^n=OP(kn21n).

Proof From Eqs. (27) and (30) we have

E^χ^nχ(η)=(a^a)(b^b)fadνE^A1(Y1b^(Z1))Πη^(Z1,Z2)(A2a^(Z2)1)Πη^(Z1,Z2)=(a^a)(b^b)fadν+((a^a)×(b^b))×Πη^(fa×fa)dν×ν.

The double integral on the far right with Πη^ replaced by Πη can be written as the single integral (a^a)Πη(b^b)(fa)dν. Added to the first integral on the right this gives

(a^a)(IΠη)(b^b)(fa)dν.

By the Cuachy-Schwarz inequality this is bounded in absolute value by the second term in the upper bound for the bias.

Replacement of Πη^ by Πη in the double integral gives a difference

((a^a)×(b^b))(Πη^Πη)(fa×fa)dν×ν=(a^a)(Πη^((b^b)faf^a^)Πη(b^b))fadν.

By the Cauchy-Schwarz inequality the absolute value of this is bounded above by

a^a2(Πη^Mw^Πη)(b^b)2,ν^w^.

Here Mw^ is multiplication by the function w^=(fa)(f^a^)(defined byMw^g=gw^), and 2,ν^ is the L2(ν^)-norm for the measure ν^ defined by dν^=(f^a^)dν. Considering Πη^ as the projection in L2(ν^) with weight 1, and Πη as the weighted projection in L2(ν^) with weight function w^, we can apply Lemma 4 to the middle term and conclude that this is bounded in absolute value by Πη2,ν^w^1b^b2,ν^. Because we assume that the functions f/a and f^a^ are bounded away from zero and infinity, this can be seen to yield the first term in the upper bound on the bias.

The function χ.η^ is uniformly bounded and hence the (conditional) variance of Pnχ.η^ is of the order OP(1/n). Thus for the variance bound it suffices to consider the (conditional) variance of Unχ¨η^. In view of Lemma 6 this is bounded above by

4nEηEη(χ¨η^(X1,X2)X1)2+2n(n1)Eηχ¨η^2(X1,X2).

The variables A(Yb^(Z)) and (Aa^(Z)1) are uniformly bounded. Hence the last term on the right is bounded above by a multiple of n2Πη^2(fa×fa)dν×ν, which is bounded by w^2kn2, by Lemma 3. The first order term is of the order O(1/n). To see this we first note that

Eη(χ¨η^(X1,X2)X1)=A1(Y1b^(Z1))Πη^((a^a)w^)(X1)(A1a^(Z1)1)Πη^((b^b)w^)(X1).

Here the variables A1(Y1b^(Z1)) and (A1a^(Z1)1) are uniformly bounded, and the second moment of Πη^g is bounded by w^ times the second moment of g in L2(ν^), for every g.

Conclusion

Assume that the parameters a, b and f/a are known to be “regular” of degrees α, β and ϕ, respectively, in the sense that there exists a sequence of k-dimensional linear spaces Lk such that, for some constant C,

aLk2C(1k)αd,bLk2C(1k)βd,faLk2C(1k)βd.

This is true, for instance, if the functions a, b and f/a are defined on a compact, convex domain in Rd and are known to belong to Hölder (or Besov) spaces of functions of smoothness α, β and ϕ. The approximation is then valid even with the uniform norm on the left side, where the spaces Lk can be taken to be generated by polynomials, splines or wavelets.

In this case there also exist estimators a^ and b^ and f^a^ that achieve convergence rates nα/(2α+d), nβ/(2β+d) and nϕ/(2ϕ+d), respectively, uniformly over these a-priori models. Then the estimator χ^n of Theorem 1 attains the square rate of convergence

kn21n(1n)2α(2α+d)+2β(2β+d)+2ϕ(2ϕ+d)(1k)(2α+2β)d.

The optimal value of k balances the first and fourth terms and is of the order k ~ n2d/(d+2α+2β). The resulting rate is nγ for

γ=(12)(α2α+d+β2β+d+ϕ2ϕ+d)(2α+2βd+2α+2β).

This reduces to the rate n–1/2 under condition (25), but also if (α + β)/2 ≥ d/4 and ϕ is sufficiently large:

ϕ2ϕ+d12α2α+dβ2β+d.

(In this case we can also choose k = n independent of α and β.) In case the rate nγ is slower than n–1/2, then it is still better than the rate nα/(2α+d)–β/(2β+d) obtained by the linear estimator (3).

Thus the quadratic estimator outperforms the linear estimator.

6 Technical results

Let L be a given closed subspace of L2(X,A,μ) and w:XR a bounded, measurable function. Define operators Π,Πw:L2(μ)L2(μ) by

Πg=argminlL(gl)2dμ,Πwg=argminlL(gl)2wdμ.

Thus Π is the ordinary orthogonal projection on the space L, and Πw is a weighted projection. The projections can be characterized by the orthogonality relationships ∫ (g – Πg)l dμ = 0 and ∫ (g – Πg)l w dμ = 0, for every lL.

Lemma 4 Let Πw and Π be theweighted projections onto a fixed subspace L of L2(μ) relative to the weight functions w and 1, respectively, and let Mw be multiplication by the function w. Then ΠwΠMw2Πw2w1.

Proof The orthogonality relationships for the projections Π and Πw imply that, for every l ∈ L and g,

Π(wg)ldμ=wgldμ=w(Πwg)ldμ.

Because Πwg – (wg) is contained in L,

ΠwgΠ(wg)22=(ΠwgΠ(wg))(ΠwgΠ(wg))dμ,=(ΠwgΠ(wg))(Πwg(Πwg)w)dμ.

An application of the Cauchy-Schwarz inequality and next cancellation of one factor ΠwgΠ(wg)2 gives that ΠwgΠ(wg)2(Πwg)(1w)2. The right side is bounded above by Πw2g21w.

Lemma 5 For degenerate, symmetric functions f,g:X2R we have PnUnf=0 and

Pn(Unf)(Ung)=1(n2)P2fg.

Lemma 6 For any measurable function f:X2R, and f1(x1) = ∫ f(x1, x2) dP(x2),

varUnf4nPf12+2n(n1)Pf2.

Proof The first lemma follows by writing the square sum (Unf)2 as a double sum (over ordered pairs i < j). The expected values of the off-diagonal terms vanish by degeneracy.

For a general measurable function f:X2R the mean Pf2 is the projection onto the constant functions, and the function f1 defined by f1(x1)=f(x1,x2)dP(x2)P2f is the projection of f in L2(P2) onto the mean zero functions of one variable. The decomposition

f(x1,x2)=P2f+f1(x1)+f1(x2)+f12(x1,x2),

where f12 is defined by the equation yields the Hoeffding decomposition Unf=P2f+2Pnf1+Unf12 of the U-statistic in orthogonal parts, with Unf12 degenerate. Using Lemma 5 we see that the variance of Unf is equal to (4n)Pf12+2(n(n1))P2f122. The norm of f1 is smaller than the norm of f1. Because f12 is projection of f, its norm is bounded by the norm of f.

Footnotes

Open Access This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.

References

  1. Begun JM, Hall WJ, Huang WM, Wellner JA. Information and asymptotic efficiency in parametric–nonparametric models. Ann Stat. 1983;11(2):432–452. [Google Scholar]
  2. Bickel PJ, Ritov Y. Estimating integrated squared density derivatives: sharp best order of convergence estimates. Sankhyā Ser A. 1988;50(3):381–393. [Google Scholar]
  3. Bickel PJ, Klaassen CAJ, Ritov Y, Wellner JA. Johns Hopkins series in the mathematical sciences. Johns Hopkins University Press; Baltimore: 1993. Efficient and adaptive estimation for semiparametric models. [Google Scholar]
  4. Bolthausen E, Perkins E, van der Vaart A. Lectures on probability theory and statistics.. In: Bernard P, editor. Lecture notes in mathematics, Lectures from the 29th summer school on probability theory held in Saint-Flour; Springer, Berlin. July 8–24, 1999.2002. [Google Scholar]
  5. Emery M, Nemirovski A, Voiculescu D. Lectures on probability theory and statistics. In: Bernard P, editor. Lecture notes in mathematics, Lectures from the 28th summer school on probability theory held in Saint-Flour; Springer, Berlin. 17 August–3 September, 1998.2000. [Google Scholar]
  6. Klaassen CAJ. Consistent estimation of the influence function of locally asymptotically linear estimators. Ann Stat. 1987;15(4):1548–1562. [Google Scholar]
  7. Koševnik JA, Levit BJ. On a nonparametric analogue of the information matrix. Teor Verojatnost i Primenen. 1976;21(4):759–774. [Google Scholar]
  8. Laurent B. Efficient estimation of integral functionals of a density. Ann Stat. 1996;24(2):659–681. [Google Scholar]
  9. Laurent B. Estimation of integral functionals of a density and its derivatives. Bernoulli. 1997;3(2):181–211. [Google Scholar]
  10. Le Cam L. Locally asymptotically normal families of distributions. Certain approximations to families of distributions and their use in the theory of estimation and testing hypotheses. Univ california Publ Stat. 1960;3:37–98. [Google Scholar]
  11. Murphy SA, van der Vaart AW. On profile likelihood. J Am Stat Assoc. 2000;95(450):449–485. (with comments and a rejoinder by the authors) [Google Scholar]
  12. Pfanzagl J. Lecture notes in statistics. Vol. 13. Springer; New York: 1982. Contributions to a general asymptotic statistical theory. (with the assistance of Wefelmeyer W) [Google Scholar]
  13. Pfanzagl J. Lecture notes in statistics. Vol. 31. Springer; Berlin: 1985. Asymptotic expansions for general statistical models. (with the assistance of Wefelmeyer W) [Google Scholar]
  14. Robins J, Rotnitzky A. Comment on the bickel and kwon article, “inference for semiparametric models: Some questions and an answer”. Stat Sin. 2001;11(4):920–936. [Google Scholar]
  15. Robins J, Li L, Tchetgen E, van der Vaart A. Asymptotic normality of quadratic estimators. Ann Stat. 2007 doi: 10.1016/j.spa.2016.04.005. (submitted) [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Robins JM, Rotnitzky A. Recovery of information and adjustment for dependent censoring using surrogate markers. 1992. pp. 297–331.
  17. Rotnitzky A, Robins JM. Semi-parametric estimation of models for means and covariances in the presence of missing data. Scand J Stat. 1995;22(3):323–333. [Google Scholar]
  18. van der Laan MJ, Robins JM. Springer Series in Statistics. Springer; New York: 2003. Unified methods for censored longitudinal data and causality. [Google Scholar]
  19. van der Vaart A. On differentiable functionals. Ann Stat. 1991;19(1):178–204. [Google Scholar]
  20. van der Vaart A. Maximum likelihood estimation with partially censored data. Ann Stat. 1994;22(4):1896–1916. [Google Scholar]
  21. van der Vaart AW. Statistical estimation in large parameter spaces, CWI Tract. Vol. 44. Stichting Mathematisch Centrum Centrum voor Wiskunde en Informatica; Amsterdam: 1988. [Google Scholar]
  22. van der Vaart AW. Cambridge series in statistical and probabilistic mathematics. Vol. 3. Cambridge University Press; Cambridge: 1998. Asymptotic statistics. [Google Scholar]
  23. Von Mises R. On the asymptotic distribution of differentiable statistical functions. Ann Math Stat. 1947;18:309–348. [Google Scholar]
  24. Wellner JA, Klaassen CAJ, Ritov Y. Frontiers in statistics. Imp Coll Press; London: 1993. Semiparametric models: a review of progress since BKRW. pp. 25–44. (2006) [Google Scholar]

RESOURCES