Quadratic semiparametric Von Mises calculus

James Robins; Lingling Li; Eric Tchetgen; Aad W van der Vaart

doi:10.1007/s00184-008-0214-3

. Author manuscript; available in PMC: 2012 Oct 18.

Published in final edited form as: Metrika. 2009 Mar;69(2-3):227–247. doi: 10.1007/s00184-008-0214-3

Quadratic semiparametric Von Mises calculus

James Robins ¹, Lingling Li ¹, Eric Tchetgen ¹, Aad W van der Vaart ^2,^✉

PMCID: PMC3475538 NIHMSID: NIHMS120870 PMID: 23087487

Abstract

We discuss a new method of estimation of parameters in semiparametric and nonparametric models. The method is based on U-statistics constructed from quadratic influence functions. The latter extend ordinary linear influence functions of the parameter of interest as defined in semiparametric theory, and represent second order derivatives of this parameter. For parameters for which the matching cannot be perfect the method leads to a bias-variance trade-off, and results in estimators that converge at a slower than n^–1/2-rate. In a number of examples the resulting rate can be shown to be optimal. We are particularly interested in estimating parameters in models with a nuisance parameter of high dimension or low regularity, where the parameter of interest cannot be estimated at n^–1/2-rate.

Keywords: Von Mises calculus, Semiparametric models, Missing data, Tangent space, Influence function, Rate of convergence

1 Introduction

Let X₁, X₂, . . . , X_n be a random sample from a distribution P_η with density p_η relative to a measure μ on a sample space $(X, A)$ , where the parameter η is known to belong to a subset H of a normed space. We wish to estimate the value χ(η) of a functional $χ : H \to R$ with the help of the observations X₁, . . . , X_n. Our main interest is in the situation of a semiparametric or nonparametric model, where H is an infinite-dimensional set, and the dependence $η \mapsto p_{η}$ is smooth.

This problem has been studied under the heading “semiparametric statistics” in the 1980s and 1990s. A theory of asymptotic lower bounds for “regular parameters” χ(η) based on Le Cam's concept of local asymptotic normality (Le Cam 1960) was developed starting with Koševnik and Levit (1976) and Pfanzagl (1982), and worked out for many examples in, among others, Begun et al. (1983), van der Vaart (1988) and Bickel et al. (1993). There are many examples of ad-hoc estimators that attain these bounds, and the behaviour of principled methods such as maximum likelihood (including its sieved and penalized variants) or estimating equations is understood to a certain extent (e.g., van der Vaart 1994; Murphy and van der Vaart 2000; Bolthausen et al. 2002; Wellner et al. 1993; van der Laan and Robins 2003).

Certain combinations of models (P_η: η ∈ H) and parameter χ(η) possess structural properties that allow to estimate the parameter at n^–1/2-rate, no matter the size of the parameter set H. In this paper we are interested in the other situations, where the rate of estimation drops when the complexity of the model exceeds a certain limit. Such examples arise for instance when many covariates must be included in a model to correct for possible confounding in a causal study, or for modelling the probability that an individual is included in a sample in a study with missing observations. If simple (e.g., linear) models for the dependence on these covariates are not plausible, which is typical in epidemiological studies, then the resulting model must be taken so large that the usual methods fail. These methods typically focus on variance only, because the bias is negligible due to the structure of the model, or by explicitly assuming a “no-bias condition” (see Klaassen 1987; Murphy and van der Vaart 2000). In this paper we develop new methods that make a bias-variance trade-off when necessary.

These methods are based on quadratic estimating equations rather than the usual linear estimating equations.

Quadratic expansions for semiparametric models were previously investigated by Pfanzagl and Wefelmeyer (Pfanzagl 1985), but from the very different perspective of second order efficiency, i.e., the refinement of first order bounds by adding a lower order term. Our aim is to show that second order influence functions can be used for first order inference, because they permit balancing of bias and variance.

Following linear and quadratic is cubic, and so on. Extension of our approach to still higher orders is possible, but comes with many new complications. We shall pursue this elsewhere.

The paper is organized as follows. In Sect. 2 we review linear estimators from our current perspective. Next in Sect. 3 we introduce our new method of constructing quadratic estimators. This section has mostly a heuristic nature. In Sects. 4 and 5 we give rigorous constructions and results for two examples. The first is a classical theoretical example. The second is more extensive and concerns estimating a mean response when the response is not always observed.

Notation Let $P_{n}$ and $U_{n}$ denote the empirical measure and empirical U-statistic measure, viewed as an operators on functions: for given functions $f : X \to R$ and $g : X^{2} \to R$ these are given by

P_{n} f = \frac{1}{n} \sum_{i = 1}^{n} f (X_{i}), U_{n} g = \frac{1}{n (n - 1)} \underset{1 \leq i \neq j \leq n}{\sum\sum} g (X_{i}, X_{j}) .

We use the notation $U_{n} f$ also for $f : X \to R$ a function of one argument, with the interpretation $U_{n} f = P_{n} f$ . This is consistent with the given formulas if a function of one argument is considered as a function of 2 arguments that is constant in its second argument.

We write $P^{n} U_{n} g = P^{2} g$ for the expectation of $U_{n} g$ if X₁, . . . , X_n are distributed according to the probability measure P. We also use the operator notation for the expectations of statistics in general.

We call a measurable function $g : X^{2} \to R$ degenerate relative to P if $\int g (x_{1}, x_{2}) d P (x_{i}) = 0$ for i = 1, 2, and we call it symmetric if g(x₁, x₂) = g(x₂, x₁) for every $x_{1}, x_{2} \in X$ .

Given two functions $g, h : X \to R$ we write g × h the function $(x, y) \mapsto g (x) h (y)$ . Such tensor products functions are degenerate if both functions f and g have mean zero. The corresponding notation P × Q of two measures P and Q gives the product measure.

2 Linear estimator

Given an initial estimator $\hat{η}$ of η, the plug-in estimator $χ (\hat{η})$ is typically a consistent estimator of the parameter of interest χ(η), but it may not be a good estimator. In particular, if $\hat{η}$ is a general purpose estimator, not specially constructed to yield a good plug-in, then $χ (\hat{η})$ will often have a suboptimal precision. To gain insight in this situation we assume that the parameter permits a Taylor expansion of the form

χ (η) = χ (\hat{η}) + χ_{\hat{η}}^{'} (η - \hat{η}) + O ({‖ η - \hat{η} ‖}^{2}) .

(1)

Such an expansion suggests that the plug-in estimator will have an error of the order $O_{P} (‖ η - \hat{η} ‖)$ , unless the linear term in the expansion vanishes.

The expansion (1) also suggests that better estimators can be obtained by “estimating” the linear term in the expansion. To achieve this we assume a “generalized von-Mises representation” of the derivative of the form

χ_{\hat{η}}^{'} (η - \hat{η}) = \int {\dot{χ}}_{\hat{η}} d (P_{η} - P_{\hat{η}}) = \int {\dot{χ}}_{\hat{η}} d P_{η} + O ({‖ η - \hat{η} ‖}^{2}),

(2)

for some measurable function ${\dot{χ}}_{\hat{η}} : X \to R$ , referred to as an influence function. The second equality is valid if ${\dot{χ}}_{η}$ is degenerate relative to $P_{η} (i . e ., P_{η} {\dot{χ}}_{η} = 0)$ for every η, which can always be arranged by a recentering, as $\int 1 d (P_{η} - P_{\hat{η}}) = 0$ . The von-Mises representation and Eq. (1) suggest the “corrected plug-in estimator”

χ (\hat{η}) + P_{n} {\dot{χ}}_{\hat{η}} .

(3)

This estimator should have an error of the order $O_{P} (n^{- 1 ∕ 2}) + O_{P} ({‖ η - \hat{η} ‖}^{2})$ , as the difference $(P_{n} - P_{η}) {\dot{χ}}_{\hat{η}}$ is “centered” and ought to have “variance” of the order O(1/n).

We put “centered” and “variance” in quotes, because the randomness in the initial estimator $\hat{η}$ prevents a simple calculation of mean and variance. Empirical process theory can be used to show that the effect of replacing ${\dot{χ}}_{η}$ by ${\dot{χ}}_{\hat{η}}$ is negligible, if the class of functions ${\dot{χ}}_{η}$ is not too rich. In the present paper we are interested in orders of magnitude only, and then a simpler approach is to split the sample and use separate observations to construct $\hat{η}$ and to construct $P_{n}$ . Then the orders can be justified by reasoning conditionally on the first sample, and it suffices that $\int {\dot{χ}}_{\hat{η}}^{2} d P_{η}$ remains bounded in probability.

Von Mises (1947) originally introduced the expansions that are named after him in order to investigate functionals of empirical distributions. The idea to use expansions (1) for estimation in nonparametric models occurs in Emery et al. (2000). Our situation is more involved, because we are interested in models (P_η: η ∈ H) that are structured through a map $η \mapsto p_{η}$ , and we are interested in a functional χ(η) of the parameter. In this situation a von-Mises type expansion can fail for two reasons. First a derivative $χ_{η}^{'}$ is by definition a continuous, linear map on the underlying normed space, and such maps may or may not be representable as an integral, depending on the normed space. Second, our von Mises expansion (2) represents this derivative as an integral relative to the distribution P_η and hence also involves the inverse map P_η → η from the distribution of the data to the parameter. We require representation through P_η, because this allows us to construct the estimator (3) by replacing P_η by the empirical distribution.

These issues are related to investigations in the theory of semiparametric models (see Koševnik and Levit 1976; Pfanzagl 1982; van der Vaart 1988; Bickel et al. 1993). These papers define a tangent set of a semiparametric model (P_η: η ∈ H) as the set of functions ${\dot{g}}_{η} : X \to R$ obtainable as

\frac{1}{2} {\dot{g}}_{η} \sqrt{p_{η}} = lim_{t ↓ 0} \frac{\sqrt{p_{η_{t}}} - \sqrt{p_{η}}}{t},

where the limit is taken in the L₂-sense, and $t \mapsto η_{t}$ ranges over a collection of maps from $[0, 1] \subset R$ to H for which the limit exists. Informally, a “tangent vector” ${\dot{g}}_{η}$ is just a score function

{\dot{g}}_{η} = {\frac{\partial}{\partial t}}_{∣ t = 0} log p_{η_{t}} = \frac{{\frac{\partial}{\partial t}}_{∣ t = 0} p_{η_{t}}}{p_{η}},

(4)

of a one-dimensional submodel (P_ηt : t ≥ 0) at t = 0, where η₀ = η. (Taking the derivative in the L₂-sense is appropriate for asymptotic information theory, but not necessarily so for the present heuristic discussion.) An influence function is defined as a measurable map ${\dot{χ}}_{η} : X \to R$ such that, for all paths $t \mapsto η_{t}$ considered,

{\frac{d}{d t}}_{∣ t = 0} χ (η_{t}) = P_{η} {\dot{χ}}_{η} {\dot{g}}_{η} .

(5)

It is not difficult to see that the latter influence function is the same as the influence function needed in the von-Mises expansion (2), if the various types of derivatives match up. (Note that the middle expression in (2) with η replaced by η_t and $\hat{η}$ by η expands to $P_{η} {\dot{χ}}_{η} {\dot{g}}_{η} + o (t)$ , as $p_{η_{t}} - p_{η} = t {\dot{g}}_{η} p_{η} + o (t)$ .) Necessary and sufficient conditions for existence of an influence function in terms of the derivatives of the maps $η \mapsto χ (η)$ and $η \mapsto p_{η}$ were investigated in van der Vaart (1991).

An influence function is not necessarily unique, as only its inner products with elements ${\dot{g}}_{η}$ of the tangent set matter. The projection of any influence function that is contained in the closed linear span of the tangent set is called the efficient influence function or canonical gradient, as it is the influence function of asymptotically efficient estimators. It minimizes the variance ${var}_{η} P_{n} {\dot{χ}}_{η}$ over all influence functions.

3 Quadratic estimator

If the preliminary estimator $\hat{η}$ attains a rate of convergence $‖ \hat{η} - η ‖ = o_{P} (n^{- 1 ∕ 4})$ , then the plug-in estimator (3) attains a n^–1/2-rate of convergence. Typically this will require that the parameter set H is not too big. If the preliminary estimator is less precise, then the remainder term of the expansion (1) will dominate. This suggests to take the expansion further to

χ (η) = χ (\hat{η}) + χ_{\hat{η}}^{'} (η - \hat{η}) + \frac{1}{2} χ_{\hat{η}}^{''} (η - \hat{η}, η - \hat{η}) + O ({‖ η - \hat{η} ‖}^{3}) .

(6)

The generalization of the first order construction now requires a von Mises type representation of the form, for measurable functions ${\dot{χ}}_{η} : X \to R$ and ${\ddot{χ}}_{η} : X^{2} \to R$ ,

\begin{matrix} χ_{\hat{η}}^{'} (η - \hat{η}) + \frac{1}{2} χ_{\hat{η}}^{''} (η - \hat{η}, η - \hat{η}) = & \int {\dot{χ}}_{\hat{η}} d (P_{η} - P_{\hat{η}}) + \frac{1}{2} \iint {\ddot{χ}}_{\hat{η}} d (P_{η} - P_{\hat{η}}) \\ \times (P_{η} - P_{\hat{η}}) + O ({‖ η - \hat{η} ‖}^{3}) . \end{matrix}

(7)

We assume without loss of generality that the functions ${\dot{χ}}_{η}$ and ${\ddot{χ}}_{η}$ are degenerate relative to P_η. The von-Mises representation then suggests the “corrected plug-in estimator”

χ (\hat{η}) + P_{n} {\dot{χ}}_{\hat{η}} + \frac{1}{2} U_{n} {\ddot{χ}}_{\hat{η}} .

(8)

The empirical measure and two-sample U-statistic serve as unbiased estimators of the expectations of their kernels. For simplicity we may again base the initial estimator $\hat{η}$ and these two U-statistics on independent samples of observations. Because the variance of a U-statistic is of order O(1/n), this estimator ought to have an error of the order $O_{P} (n^{- 1 ∕ 2}) + O_{P} ({‖ η - \hat{η} ‖}^{3})$ . We shall discuss the validity of this later.

To characterize the first and second order influence functions we can again employ smooth one-dimensional submodels (P_ηt : t ≥ 0). With the first and second order derivatives of these models denoted by

{\dot{g}}_{η} = \frac{{\frac{\partial}{\partial t}}_{∣ t = 0} p_{η_{t}}}{p_{η}}, {\ddot{g}}_{η} = \frac{{\frac{\partial^{2}}{\partial t^{2}}}_{∣ t = 0} p_{η_{t}}}{p_{η}},

(9)

the von Mises expansion (7) can informally be seen to imply

{\frac{d}{d t}}_{∣ t = 0} χ (η_{t}) = P_{η} {\dot{χ}}_{η} {\dot{g}}_{η},

(10)

{\frac{d^{2}}{d t^{2}}}_{∣ t = 0} χ (η_{t}) = P_{η} {\dot{χ}}_{η} {\ddot{g}}_{η} + P_{η}^{2} {\ddot{χ}}_{η} ({\dot{g}}_{η} \times {\dot{g}}_{η}) .

(11)

The Eq. (10) is identical to Eq. (5), and hence a first order influence function ${\dot{χ}}_{η}$ can be taken as before. Following Pfanzagl (1985) we define a second order influence function ${\ddot{χ}}_{η}$ as a measurable function ${\ddot{χ}}_{η} : X^{2} \to R$ that satisfies (11) for every path $t \mapsto η_{t}$ under consideration. From Eq. (11) we see that ${\ddot{χ}}_{η}$ is unique only up to functions that are orthogonal to functions of the form ${\dot{g}}_{η} \times {\dot{g}}_{η}$ , for ${\dot{g}}_{η}$ belonging to the tangent set. In particular, a second order influence function ${\ddot{χ}}_{η}$ can always be taken to be symmetric and degenerate relative to P_η. It must be taken so in the construction of the estimator (8).

The two influence functions occur together in Eq. (11), and hence should be considered a pair $({\dot{χ}}_{η}, {\ddot{χ}}_{η})$ of functions rather than as two separate functions. This is particularly important if the tangent set is not “full”, i.e., smaller than the set of all mean-zero functions in L₂(P_η), the tangent set of a nonparametric model. Both first and second order influence functions are then non-unique, but their different versions cannot be freely combined into valid pairs $({\dot{χ}}_{η}, {\ddot{χ}}_{η})$ . This is connected to the fact that first and second order derivatives ${\dot{g}}_{η}$ and ${\ddot{g}}_{η}$ are also not clearly separated. A simple change of speed $t \mapsto ϕ (t)$ of a path through a second order diffeomorphism ϕ: [0, 1] → [0, 1] leads to the submodel (P_{η_ϕ(t)}: t ≥ 0) with first and second order derivatives, by the chain rule,

ϕ^{'} (0) {\dot{g}}_{η}, ϕ^{'} {(0)}^{2} {\ddot{g}}_{η} + ϕ^{''} (0) {\dot{g}}_{η} .

Thus the first order derivative becomes part of the second order derivative after reparameterization. Pfanzagl (Pfanzagl 1985, 2.4.4) has shown, under assumptions of smoothness of the tangent set as a function of the parameter, that the sum of every first order derivative and every second order derivative occurs as the second order derivative of some path. Thus the set of second order derivatives ${\ddot{g}}_{η}$ is only defined up to equivalence modulo the tangent set.

From Eq. (11) it is also clear that second order influence functions involve the joint distribution of two observations. Correspondingly, we prefer to define a second order tangent space of the model not through the second order derivatives ${\ddot{g}}_{η}$ along paths $t \mapsto p_{η_{t}}$ , but through the functions of two arguments

{\ddot{s}}_{η} : = \frac{{\frac{\partial^{2}}{\partial t^{2}}}_{∣ t = 0} (p_{η_{t}} \times p_{η_{t}})}{p_{η} \times p_{η}} = {\ddot{g}}_{η} \times 1 + 2 {\dot{g}}_{η} \times {\dot{g}}_{η} + 1 \times {\ddot{g}}_{η} .

(12)

The function ${\ddot{s}}_{η}$ is a second order score for the model (P_η × P_η: η ∈ H) for two observations. The corresponding first order scores are

{\dot{s}}_{η} : = \frac{{\frac{\partial}{\partial t}}_{∣ t = 0} (p_{η_{t}} \times p_{η_{t}})}{p_{η} \times p_{η}} = {\dot{g}}_{η} \times 1 + 1 \times {\dot{g}}_{η} .

(13)

With these notations the Eqs. (10), (11) defining the influence functions can also be written as, if ${\ddot{χ}}_{η}$ is chosen degenerate,

{\frac{d}{d t}}_{∣ t = 0} χ (η_{t}) = P_{η}^{2} ({\dot{χ}}_{η} + \frac{1}{2} {\ddot{χ}}_{η}) {\dot{s}}_{η} = {\frac{d}{d t}}_{∣ t = 0} P_{η_{t}}^{2} ({\dot{χ}}_{η} + \frac{1}{2} {\ddot{χ}}_{η}),

(14)

{\frac{d^{2}}{d t^{2}}}_{∣ t = 0} χ (η_{t}) = P_{η}^{2} ({\dot{χ}}_{η} + \frac{1}{2} {\ddot{χ}}_{η}) {\ddot{s}}_{η} = {\frac{d^{2}}{d t^{2}}}_{∣ t = 0} P_{η_{t}}^{2} ({\dot{χ}}_{η} + \frac{1}{2} {\ddot{χ}}_{η}) .

(15)

Here we interprete the function ${\dot{χ}}_{η} : X \to R$ as a function ${\dot{χ}}_{η} : X^{2} \to R$ that depends on the first argument only (and is constant in the second), or (better) replace it by its symmetrization $(x_{1}, x_{2}) \mapsto \frac{1}{2} ({\dot{χ}}_{η} (x_{1}) + {\dot{χ}}_{η} (x_{2}))$ . The equations show that the overall influence function ${\dot{χ}}_{η} + \frac{1}{2} {\ddot{χ}}_{η}$ is characterized by having “correct” inner products with the overall scores ${\dot{s}}_{η}$ and ${\ddot{s}}_{η}$ . This overall influence function uniquely defines its constituents ${\dot{χ}}_{η}$ and $\frac{1}{2} {\ddot{χ}}_{η}$ provided ${\ddot{χ}}_{η}$ is restricted to be degenerate. The overall influence function is itself unique only up to projection onto the closed linear span in L₂(P_η × P_η) of all functions ${\dot{s}}_{η}$ and ${\ddot{s}}_{η}$ .

The equality of the far left and right sides of Eqs. (14), (15) gives an alternative characterization of the overall influence function (at η₀) as a function such that the maps $η \mapsto χ (η)$ and $η \mapsto P_{η}^{2} ({\dot{χ}}_{η_{0}} + \frac{1}{2} {\ddot{χ}}_{η_{0}})$ possess the same first and second order derivatives at η₀. Because the derivatives of a map ϕ on an open subset H of a normed space are completely characterized by the derivatives of the maps $t \mapsto ϕ (η_{0} + t h)$ , for h ranging over the space (“Gateaux derivatives”), we conclude that in the case of such parameters sets H it suffices to consider linear paths $t \mapsto η_{t} = η_{0} + t h$ . (The mixed second derivative $ϕ_{η_{0}}^{''} (g, h)$ can be recovered from $ϕ_{η_{0}}^{''} (g + h, g + h)$ and $ϕ_{η_{0}}^{''} (g - h, g - h)$ by “polarization”.) This is true more generally for parameter spaces H defined by a linear constraint, but in the case of nonlinear constraints the use of curved paths is necessary.

The plug-in estimator (8) can be written $χ (\hat{η}) + U_{n} ({\dot{χ}}_{\hat{η}} + \frac{1}{2} {\ddot{χ}}_{\hat{η}})$ . A definition of an efficient or canonical second order influence function, should therefore refer to the variance of the U-statistic $U_{n} ({\dot{χ}}_{η} + \frac{1}{2} {\ddot{χ}}_{η})$ . Unlike in the linear case this does not translate in the variance of the influence function ${\dot{χ}}_{η} + \frac{1}{2} {\ddot{χ}}_{η}$ itself (except for n = 2 if ${\dot{χ}}_{η}$ is interpreted as the symmetric function $(x_{1}, x_{2}) \mapsto \frac{1}{2} ({\dot{χ}}_{η} (x_{1}) + {\dot{χ}}_{η} (x_{2}))$ ). By Eq. (5), if ${\ddot{χ}}_{η}$ is chosen degenerate and symmetric,

n {var}_{η} U_{n} ({\dot{χ}}_{η} + \frac{1}{2} {\ddot{χ}}_{η}) = P_{η} {\dot{χ}}_{η}^{2} + \frac{1}{2 (n - 1)} P_{η}^{2} {\ddot{χ}}_{η}^{2} .

Thus the second order part adds a term of order O(1/n) relative to the first order contribution. The norm of the function ${\dot{χ}}_{η} + \frac{1}{2} {\ddot{χ}}_{η}$ in L₂(P_η × P_η) is irrelevant, even though the inner product of this space determines the influence functions.

It is possible to resolve this discrepancy by working in the model with n observations. From the expansion

\begin{matrix} \prod_{i = 1}^{n} \frac{p_{η_{t}}}{p_{η}} (x_{i}) & = \prod_{i = 1}^{n} (1 + t {\dot{g}}_{η} (x_{i}) + \frac{1}{2} t^{2} {\ddot{g}}_{η} (x_{i}) + \dots) \\ = 1 + t \sum_{i = 1}^{n} {\dot{g}}_{η} (x_{i}) + t^{2} (\frac{1}{2} \sum_{i = 1}^{n} {\ddot{g}}_{η} (x_{i}) + \underset{1 \leq i < j \leq n}{\sum\sum} {\dot{g}}_{η} (x_{i}) {\dot{g}}_{η} (x_{j})) + \dots, \end{matrix}

we see that first and second order scores for the model $(P_{η}^{n} : η \in H)$ take the forms

{\dot{s}}_{η}^{(n)} = \frac{{\frac{\partial}{\partial t}}_{∣ t = 0} (p_{η_{t}} \times \dots \times p_{η_{t}})}{p_{η} \times \dots \times p_{η}} = n P_{n} {\dot{g}}_{η},

(16)

{\ddot{s}}_{η}^{(n)} = \frac{{\frac{\partial^{2}}{\partial t^{2}}}_{∣ t = 0} (p_{η_{t}} \times \dots \times p_{η_{t}})}{p_{η} \times \dots \times p_{η}} = n P_{n} {\ddot{g}}_{η} + n (n - 1) U_{n} ({\dot{g}}_{η} \times {\dot{g}}_{η}) .

(17)

Rather than in the form Eqs. (14), (15), the Eqs. (10), (11) that define the influence functions can then be written in the form

{\frac{d}{d t}}_{∣ t = 0} χ (η_{t}) = P_{η}^{n} (U_{n} ({\dot{χ}}_{η} + \frac{1}{2} {\ddot{χ}}_{η})) {\dot{s}}_{η}^{(n)},

(18)

{\frac{d^{2}}{d t^{2}}}_{∣ t = 0} χ (η_{t}) = P_{η}^{n} (U_{n} ({\dot{χ}}_{η} + \frac{1}{2} {\ddot{χ}}_{η})) {\ddot{s}}_{η}^{(n)} .

(19)

We conclude that the influence functions are determined by the inner products of the U-statistic $U_{n} ({\dot{χ}}_{η} + \frac{1}{2} {\ddot{χ}}_{η})$ in $L_{2} (P_{η}^{n})$ with the score functions ${\dot{s}}_{η}^{(n)}$ and ${\ddot{s}}_{η}^{(n)}$ . The influence functions that yield a minimal variance are found by projecting this U-statistic onto the closed linear span of these score functions. Thus it is natural to define the latter span as the second order tangent space of the model.

For computation in examples the defining Eq. (11) or (15) of a second order influence function can be tedious. It is usually easier to apply the rule that a second derivative is the derivative of the first derivative. In the present situation this takes the following form (Pfanzagl 1985, 4.3.11): if ${\ddot{χ}}_{η} : X^{2} \to R$ is a function such that $x_{2} \mapsto {\ddot{χ}}_{η} (x_{1}, x_{2})$ is a first order influence function of the parameter $η \mapsto {\dot{χ}}_{η} (x_{1})$ , for every fixed x₁ and a first order influence function ${\dot{χ}}_{η}$ (not necessarily degenerate), then ${\ddot{χ}}_{η}$ is a second order influence function.

Lemma 1 Suppose that (P_ηt : t ≥ 0) is a sufficiently smooth submodel and ${\dot{χ}}_{η_{t}} : X \to R$ and ${\ddot{χ}}_{η_{t}} : X^{2} \to R$ are measurable functions that satisfy

\begin{matrix} \frac{d}{d t} χ (η_{t}) & = \int {\dot{χ}}_{η_{t}} \frac{d}{d t} p_{η_{t}} d μ, (t \geq 0), \\ {\frac{d}{d t}}_{∣ t = 0} {\dot{χ}}_{η_{t}} (x_{1}) & = \int {\ddot{χ}}_{η} (x_{1}, x_{2}) {\dot{g}}_{η} (x_{2}) d P_{η} (x_{2}), (x_{1} \in χ) . \end{matrix}

Then the function ${\ddot{χ}}_{η}$ is a second order influence function, and so is the symmetrization of its orthogonal projection onto the degenerate functions in L₂(P_η × P_η).

Proof By differentiation of the first identity (under the integral) we see that

\frac{d^{2}}{d t^{2}} χ (η_{t}) = \int \frac{d}{d t} {\dot{χ}}_{η_{t}} \frac{d}{d t} p_{η_{t}} d μ + \int {\dot{χ}}_{η_{t}} \frac{d^{2}}{d t^{2}} p_{η_{t}} d μ .

We evaluate this at t = 0 and substitute the second identity in the first term on the right to arrive at Eq. (11). The equation remains valid if ${\ddot{χ}}_{η}$ is replaced by its projection and symmetrization.

Just as for first order influence functions there is no guarantee that a second order influence function exists. The difference is that, for the examples we are interested in, nonexistence of a second order influence function is typical. A first indication that this might happen is that the informal conclusion reached in the preceding that the quadratic estimator (8) will have an error of the order $O_{P} (n^{- 1 ∕ 2}) + O_{P} ({‖ η - \hat{η} ‖}^{3})$ is overly optimistic. In comparison to the linear estimator (3), this estimator would have reduced the dependence on the preliminary estimator $\hat{η}$ from $O_{P} ({‖ η - \hat{η} ‖}^{2})$ to $O_{P} ({‖ η - \hat{η} ‖}^{3})$ , apparently without a serious penalty on the variance of the estimator. In our examples this does not occur, simply because a second order influence function does not exist.

As for the first order influence function, the nonexistence of the second order influence function may be caused by a lack of invertibility of the map η → p_η or by failure of a von Mises type representation. The invertibility is again necessary, because we need representation of the derivatives of $η \mapsto χ (η)$ in terms of the distribution P_η of the observation. This is similar as in the linear situation. The second cause for failure of representation also arose in the linear situation, but appears to arise in a much more serious way at the second order. Whereas a continuous, linear map $B : L_{2} (P_{η}) \to R$ is always representable as an inner product $B (g) = P_{h} g {\dot{χ}}_{η}$ for some function ${\dot{χ}}_{η}$ , a continuous, bilinear map $B : L_{2} (P_{η}) \times L_{2} (P_{η}) \to R$ is not necessarily representable through a measurable function ${\ddot{χ}}_{η} : X^{2} \to R$ , in the form

B (g, h) = \iint g (x_{1}) {\ddot{χ}}_{η} (x_{1}, x_{2}) h (x_{2}) d P_{η} (x_{1}) d P_{η} (x_{2}) .

(20)

It can be shown that a continuous, bilinear map can always be written in the form B(g, h) = ∫ g(Ah) dP_η for a continuous, linear operator A: L₂(P_η) → L₂(P_η), but the latter operator is not necessarily a kernel operator in that $A h (x_{1}) = \int {\ddot{χ}}_{η} (x_{1}, x_{2}) h (x_{2}) d P_{η} (x_{2})$ for some kernel ${\ddot{χ}}_{η}$ . The latter representation is necessary for the von-Mises representation (7) of the second derivative.

Failure of existence of ${\ddot{χ}}_{η}$ does not mean that the idea to use a quadratic expansion for improved estimation is not fruitful. Failure does mean that we cannot construct the estimator (8) and the estimation rate $O_{P} (n^{- 1 ∕ 2}) + O_{P} ({‖ \hat{η} - η ‖}^{3})$ may not be attainable. However, we may return to Eq. (6) and try and estimate the quadratic term as well as possible, and still improve on the linear estimator. A key observation is that a bilinear map on a finite-dimensional subspace L × L ⊂ L₂(P_η) × L₂(P_η) is always representable by a kernel.

Lemma 2 If L ⊂ L₂(P_η) is a finite-dimensional subspace and $B : L \times L \to R$ is continuous and bilinear, then there exists a function ${\ddot{χ}}_{η} \in L_{2} (P_{η} \times P_{η})$ such that (20) holds for every g, h ∈ L.

Proof For an arbitrary orthonormal basis e₁, . . . , e_k of L we can express an element g ∈ L as $g = \sum_{i = 1}^{k} {〈 g, e_{i} 〉}_{η} e_{i}$ , for 〈·, ·〉_η the inner product of L₂(P_η). By bilinearity

\begin{matrix} B (g, h) & = \sum_{i = 1}^{k} \sum_{j = 1}^{k} {〈 g, e_{i} 〉}_{η} {〈 h, e_{j} 〉}_{η} B (e_{i}, e_{j}) \\ = \iint g (x_{1}) h (x_{2}) \sum_{i = 1}^{k} \sum_{j = 1}^{k} B (e_{i}, e_{j}) e_{i} (x_{1}) e_{j} (x_{2}) d P_{η} (x_{1}) d P_{η} (x_{2}) . \end{matrix}

Thus the function $(x_{1}, x_{2}) \mapsto \sum_{i = 1}^{k} \sum_{j = 1}^{k} B (e_{i}, e_{j}) e_{i} (x_{1}) e_{j} (x_{2})$ is a kernel for the map B.

If the invertibility $η \mapsto p_{η}$ can be resolved, we can therefore always represent the second derivative in Eq. (6) at differences $η - \hat{η}$ within a given finite-dimensional linear space. The estimator (8) based on the resulting “partial second order influence function” then will add a representation error to the remainder $O_{P} ({‖ \hat{η} - η ‖}^{3})$ . This representation error can be made arbitrarily small by choosing the finite-dimensional linear space sufficiently large. However, the corresponding partial influence functions depend on the approximating linear spaces, the estimator now having the form

χ (\hat{η}) + P_{n} {\dot{χ}}_{\hat{η}} + \frac{1}{2} U_{n} {\ddot{χ}}_{L, \hat{η}},

(21)

where ${\ddot{χ}}_{L, η}$ is a partial second order influence function based on an approximating space L. To obtain a good estimator we must balance the representation error, remainder $O ({‖ \hat{η} - η ‖}^{3})$ , and the variance of the estimator. In an asymptotic framework we let the approximating space L increase to the full space when n → ∞. We shall see that this may cause the variance of $U_{n} {\ddot{χ}}_{L, \hat{η}}$ to dominate the variance of the linear term $P_{n} {\dot{χ}}_{\hat{η}}$ and the overall variance may be bigger than O(1/n). However, by proper balancing of the three terms we do never worse than the linear estimator (3), and we gain over it if the parameter set H is large.

4 Estimating the square of a density

Consider the problem of estimating the functional χ(p) = ∫ p²dμ based on a random sample of size n from the density p. This problem was discussed among others in Bickel and Ritov (1988) and Laurent (1996), Laurent (1997). We shall rederive the estimator by Laurent (1996) through our general approach.

As the underlying model $P$ we use a set of densities that is restricted only qualitatively, for instance a Hölder space of functions on the unit square in $R^{d}$ . We parameterize this model by the density itself, which we denote by p (hence p_η = η = p). The tangent space of the model can then be taken equal to the set of all mean zero functions ${\dot{g}}_{p} : X \to R$ in L₂(P), and the first order influence function takes the form

{\dot{χ}}_{p} (x) = 2 (p (x) - χ (p)) .

(22)

To see this, it suffices to note that this function is mean-zero (i.e., degenerate) and satisfies

{\frac{d}{d t}}_{∣ t = 0} χ (p_{t}) = \int 2 p_{t} {\dot{p}}_{t} d μ_{∣ t = 0} = P 2 p {\dot{g}}_{p} = P {\dot{χ}}_{p} {\dot{g}}_{p},

for any sufficiently regular path $t \mapsto p_{t}$ with p₀ = p and score function ${\dot{g}}_{p} = {\dot{p}}_{0} ∕ p_{0}$ at t = 0. This first order influence function exists without making assumptions on p or $P$ .

We compute a second order influence function as the influence function of the functional $p \mapsto {\overset{‒}{χ}}_{p} (x_{1}) = 2 p (x_{1})$ , which is the first order influence function up to centering. This entails point evaluation at a fixed point x₁, which, unfortunately, is not a differentiable functional in the sense of possessing an influence function. For any sufficiently regular path $t \mapsto p_{t}$ with score function ${\dot{g}}_{p}$ ,

{\frac{d}{d t}}_{∣ t = 0} p_{t} (x_{1}) = {\dot{g}}_{p} (x_{1}) p (x_{1}) .

Existence of an influence function of the functional $p \mapsto p (x_{1})$ would require the map $g \mapsto g (x_{1}) p (x_{1})$ to be representable as an inner product in L₂(P) on the tangent space. Such a representation is not possible (unless p has finite support), because the map is not continuous relative to the L₂(P)-norm.

Thus we content ourselves with partial representation of the second derivative. To this aim it is useful to think of the point evaluation map as integrating versus the Dirac measure (at x₁). Full representation of the functional $g \mapsto g (x_{1}) p (x_{1})$ would be possible if there existed a function $Π : X \times X \to R$ such that,

g (x_{1}) p (x_{1}) = \int Π (x_{1}, x_{2}) g (x_{2}) d μ (x_{2}) .

(23)

If this were true for every function g, then the measure $B \mapsto \int_{B} Π (x_{1}, x_{2}) d μ (x_{2})$ would, for each fixed x₁, act as a Dirac measure at x₁. In other words, the desired, but not existing, function Π would be a “Dirac measure” on the diagonal of $X \times X$ . Our second best is a function for which Eq. (23) is true, if not for all, then for a large collection of g. The kernel Π of a projection operator Π: L₂(μ) → L₂(μ) onto a (large) subspace is a candidate, because it satisfies the display whenever gp is in the subspace: if Πf (x₁) = ∫ Π(x₁, x₂) f (x₂) dμ(x₂), then the equation gp = Π(gp), which is valid for every gp in the range of the projection, gives the preceding display.

Lemma 3 An orthogonal projection Π: L₂(μ) → L ⊂ L₂(μ) onto a finite-dimensional subspace L can be represented as Πf (x₁) = ∫ Π(x₁, x₂) f (x₂) dμ(x₂) for the kernel function $Π (x_{1}, x_{2}) = \sum_{i = 1}^{k} e_{i} (x_{1}) e_{i} (x_{2})$ and e₁, . . . , e_k an orthonormal basis of L. This kernel satisfies ∫ Π²dμ × μ = k.

Proof We have $Π f (x_{1}) = \sum_{i = 1}^{k} {〈 f, e_{i} 〉}_{μ} e_{i} (x_{1})$ for ${〈 f, e_{i} 〉}_{μ} = \int f e_{i} d μ$ . The representation follows by exchanging the order of summation and integration.

The square kernel is $\sum_{i = 1}^{k} \sum_{j = 1}^{k} e_{i} (x_{1}) e_{j} (x_{1}) e_{i} (x_{2}) e_{j} (x_{2})$ . By the orthonormality of the basis (e_i) the (double) integral of the off-diagonal terms (i ≠ j) vanishes and the double integral of the diagonal terms is equal to 1. Thus the double integral is k.

We also arrive at a projection operator from the formula $χ_{p}^{''} (g, h) = 2 \int g h p^{2} d μ$ for the second derivative of χ. We can write this in the form $χ_{p}^{''} (g, h) = 2 \int g (A_{p} h) d P$ for the operator A_p: L₂(P) → L₂(P) given by A_ph = hp. The operator A_p is not of kernel form, but we can approximate it by ΠA_p, leading to the approximation 2 ∫ g(ΠA_ph) dP = 2 ∫ gp (Π(hp)) dμ for $χ_{p}^{''} (g, h)$ .

For a given orthonormal basis e₁, e₂, . . . of L₂(μ) we take the kernel Π(x₁, x₂) of the projection onto the span of the first k elements, given by Lemma 3, as a “partial” influence function of the functional $p \mapsto p (x_{1})$ , and $x_{2} \mapsto 2 Π (x_{1}, x_{2})$ as a “partial” influence function of the functional $p \mapsto {\overset{‒}{χ}}_{p} (x_{1})$ . The projection of this function onto the degenerate functions is

{\ddot{χ}}_{p} (x_{1}, x_{2}) = 2 Π (x_{1}, x_{2}) - 2 Π p (x_{1}) - 2 Π p (x_{2}) + 2 \int {(Π p)}^{2} d μ .

(24)

The quadratic estimator (8), given the initial estimator $\hat{p}$ , takes the form

\begin{matrix} χ (\hat{p}) + P_{n} {\dot{χ}}_{\hat{p}} + U_{n} {\ddot{χ}}_{\hat{p}} & = U_{n} Π + U_{n} ((I - Π) \hat{p}) \\ = U_{n} (\sum_{i = 1}^{k} e_{i} \times e_{i}) + U_{n} (\sum_{i = k + 1}^{\infty} {\hat{θ}}_{i} e_{i}), \end{matrix}

for ${\hat{θ}}_{i} = \int \hat{p} e_{i} d μ$ the Fourier coefficients of $\hat{p}$ . If we choose the initial estimator to take values in the range of Π, then ${\hat{θ}}_{i} = 0$ for i > k and the second term vanishes. The resulting estimator reduces to the estimator considered by Laurent (1996, 1997), who showed that the estimator is minimax if p is a-priori known to belong to a multiple of the unit ball in the Hölder space C^β[0, 1] of regularity β and (e_i) is a basis suited to this a-priori model. In fact, mean and variance of the estimator satisfy, with θ_i the Fourier coefficients of p,

\begin{matrix} E_{p} U_{n} (\sum_{i = 1}^{k} e_{i} \times e_{i}) & = E_{p} \sum_{i = 1}^{k} e_{i} \times e_{i} = \sum_{i = 1}^{k} θ_{i}^{2}, \\ {var}_{p} U_{n} (\sum_{i = 1}^{k} e_{i} \times e_{i}) & \leq \frac{4}{n} P {(Π p)}^{2} + \frac{2 k}{n (n - 1)} . \end{matrix}

The bound on the variance follows from Lemma 6. If it is a-priori known that $\sum_{i = 1}^{\infty} θ_{i}^{2} i^{2 β} < \infty$ , then the bias is bounded by $\sum_{i > k} θ_{i}^{2} \leq k^{- 2 β}$ . The square bias is balanced against the variance if k is chosen of the order k_n = n^2/(4β+1) if β ≤ 1/4 and k_n = n if β ≥ 1/4. The resulting rate of convergence is n^{–2β/(4β+1} if β ≤ 1/4 and n^–1/2 if β ≥ 1/4. In Robins et al. (2007) it is shown that it is also asymptotically normal.

5 Estimating the mean response in missing data models

Suppose that a typical observation is distributed as X = (Y A, A, Z) for Y and A taking values in the two-point set {0, 1} and conditionally independent given Z. We think of Y as a response variable, which is observed only if the indicator A takes the value 1. The covariate Z is chosen such that it contains all information on the dependence between response and missingness indicator (missing at random). Alternatively, we think of Y as a counterfactual outcome if a treatment were given (A = 1) and estimate (half) the treatment effect under the assumption of “no unmeasured confounders”. Both applications may require that Z is high-dimensional (e.g., of dimension 10), and there is typically insufficient a-priori information to model the dependence of A and Y on Z.

The model can be parameterized by the marginal density f of Z (relative to some dominating measure ν) and the probabilities b(z) = P(Y = 1|Z = z) and a(z)^–1 = P A = 1|Z = z). (We use a for the inverse probability, because this simplifies later formulas.) Thus the density p_η of an observation X is described by the triple η = (a, b, f).

We wish to estimate the mean response EY, i.e., the parameter

χ (η) = \int b f d ν .

Estimators that are n^–1/2-consistent and asymptotically efficient in the semiparametric sense have been constructed using a variety of methods (e.g., Robins and Rotnitzky 1992; van der Laan and Robins 2003; van der Vaart 1998), but only if a or b, or both, parameters are restricted to sufficiently small regularity classes. For instance, if the covariate ranges over a compact, convex subset $Z$ of $R^{d}$ , then the mentioned papers provide n^–1/2-consistent estimators under the assumption that a and b belong to Hölder classes C^α(Z) and C^β(Z) with α and β large enough that

\frac{α}{2 α + d} + \frac{β}{2 β + d} \geq \frac{1}{2} .

(25)

For moderate to large dimensions d this is a restrictive requirement. We shall show that a quadratic estimator of the type (8) can attain a n^–1/2-rate in a bigger model and obtains a strictly better rate than the usual estimators if the n^–1/2-rate is not obtainable.

Prelimary estimators

The parameter 1/a(z) = E(A|Z = z) is the regression of A on Z and hence can be estimated by any nonparametric regression estimator, such as a kernel or a truncated series estimator. Similarly, the function b(z) = P(Y = 1|Z = z, A = 1) is the regression of the observed Y on Z and can be estimated by nonparametric regression based on the subsample (Y_i : A_i = 1) on the corresponding Z_i. We shall see below that the parameter f/a is more fundamental than the parameter f. By Bayes’ rule (f/a)(z) = P(A = 1|Z = z) f (z) is P(A = 1) times the conditional density of Z given A = 1. Therefore, we may estimate f/a by a nonparametric density estimator based on the subsample (Z_i : A_i = 1) times $n^{- 1} \sum_{i = 1}^{n} A_{i}$ .

Tangent space and first order influence function

The one-dimensional submodels $t \mapsto p_{η_{t}}$ induced by paths of the form a_t = a + tα, b_t = b + tβ, and f_t = f (1 + tϕ) for given directions α, β and ϕ yield scores $B_{η} (α, β, ϕ) = B_{η}^{a} α + B_{η}^{b} β + B_{η}^{f} ϕ$ , for $B_{η}^{a}$ , $B_{η}^{b}$ , $B_{η}^{f}$ the score operators for the three parameters, given by

\begin{matrix} B_{η}^{a} α (X) & = - \frac{A a (Z) - 1}{a (Z) (a - 1) (Z)} α (Z), a - score, \\ B_{η}^{b} β (X) & = \frac{A (Y - b (Z))}{b (Z) (1 - b) (Z)} β (Z), b - score, \\ B_{η}^{f} ϕ (X) & = ϕ (Z), f - score . \end{matrix}

The first-order influence function is well known to take the form

{\dot{χ}}_{η} (X) = A a (Z) (Y - b (Z)) + b (Z) - χ (η) .

(26)

To see this it must be verified that this function satisfies, for every path $t \mapsto p_{η_{t}}$ as described previously,

{\frac{d}{d t}}_{∣ t = 0} χ (η_{t}) = E_{η} {\dot{χ}}_{η} (X) B_{η} (α, β, ϕ) (X) .

For the paths a_t = a + tα, b_t = b + tβ and f_t = f(1 + tϕ) the left side of this equation is ∫ (β + bϕ) f dν. The right side can easily be evaluated to be the same, where it may be noted that conditional expectations of functions of Y and A given Z factorize, with E(Y – b(Z)|Z) = E(Aa(Z) – 1|Z) = 0 and E _(Y – b(Z)²|Z = b(1 – b)(Z).

The advantage of choosing a an inverse probability is clear from the form of the (random part of the) influence function, which is a bilinear function in (a, b). The error of the corresponding von-Mises representation can be computed to be, for a given initial estimator $\hat{η} = (\hat{a}, \hat{b}, \hat{f})$ ,

χ (\hat{η}) - χ (η) + P_{η} {\dot{χ}}_{\hat{η}} = - \int (\hat{a} - a) (\hat{b} - b) \frac{f}{a} d ν .

(27)

This is quadratic in the errors of the initial estimators. Actually, the form of the bias term is special in that square estimation errors of the two initial estimators $\hat{a}$ and $\hat{b}$ do no arise, but only the product of their errors. This property, termed “double robustness” in Rotnitzky and Robins (1995), Robins and Rotnitzky (2001), van der Laan and Robins (2003), makes that it suffices that one of the two parameters is estimated well. A prior assumption that the parameters a and b are α and β regular, respectively, would allow estimation errors with rates n^{–α/(2α+d)} and n^{–β/(2β+d)}. If the product of these rates is o(n^–1/2), then the bias term is negligible, and the linear estimator (3) attains a rate n^–1/2. This leads to the condition (25). If this condition fails, then the “bias” (27) is greater than O(n^–1/2). The linear estimator then does not balance bias and variance and is suboptimal.

It may be noted that the marginal density f does not enter the first order influence function. Even though the functional depends on f, a rate on the initial estimator of this function is not needed for the construction of the first order estimator. This will be different at second order.

Quadratic estimator

We proceed to the computation of a second order influence function using Lemma 1, by searching a function ${\ddot{χ}}_{η} : X^{2} \to R$ such that, for every x₁ = (y₁a₁, a₁, z₁), and all directions α, β, ϕ,

\begin{matrix} a_{1} (y_{1} - b (z_{1})) α (z_{1}) - (a_{1} a (z_{1}) - 1) β (z_{1}) & = {\frac{d}{d t}}_{∣ t = 0} [{\dot{χ}}_{η_{t}} (x_{1}) + χ (η_{t})] \\ = E_{η} {\ddot{χ}}_{η} (x_{1}, X_{2}) B_{η} (α, β, ϕ) (X_{2}) . \end{matrix}

(28)

Here the expectation is relative to the variable X₂ only. Let $K_{η} : Z^{2} \to R$ be the kernel of an operator K_η: L₂(f) → L₂(f) (i.e., K_ηg(x₁) = ∫ K(x₁, x₂)g(x₂) f (x₂) dμ(x₂)), and define

\begin{matrix} {\ddot{χ}}_{η} (X_{1}, X_{2}) = & - A_{1} (Y_{1} - b (Z_{1})) a (Z_{2}) (A_{2} a (Z_{2}) - 1) K_{η} (Z_{1}, Z_{2}) \\ - (A_{1} a (Z_{1}) - 1) a (Z_{2}) A_{2} (Y_{2} - b (Z_{2})) K_{η} (Z_{1}, Z_{2}) . \end{matrix}

(29)

For this choice the right side of Eq. (28) can be seen to reduce to

a_{1} (y_{1} - b (z_{1})) K_{η} α (z_{1}) - (a_{1} a (z_{1}) - 1) K_{η} β (z_{1}) .

(Note that var (Aa(Z)|Z) = a(Z) – 1.) Thus the choice Eq. (29) of ${\ddot{χ}}_{η}$ satisfies Eq. (28) for every (α,β,ϕ) such that K_ηα = α and K_ηβ = β. Were K_η equal to the identity operator, then Eq. (28) would be satisfied for every (α, β, ϕ), and an exact second order influence function would exist. Unfortunately, the identity operator is not given by a kernel. As in Sect. 4 we have to be satisfied with an influence function that gives partial representation.

To ensure that ${\ddot{χ}}_{η}$ is symmetric we choose K_η(z₁, z₂) = Π_η(z₁, z₂)/a(z₂) for Π_η a symmetric function. Specifically, we choose Π_η the kernel of an orthogonal projection Π_η: L₂(f/a) → L₂(f/a) onto a space L. The corresponding operators then (trivially) satisfy K_ηg = Π_ηg for every g ∈ L₂(f/a), and hence K_η will approximate the identity if L is large. The function (29) that results from this choice can be seen to be both symmetric and degenerate, and hence is a candidate “approximate” influence function. If S₂ symmetrizes a function of two variables (i.e., 2 S₂g(X₁, X₂) = g(X₁, X₂) + g(X₂, X₁)), then this influence function can be written as

{\ddot{χ}}_{η} (X_{1}, X_{2}) = - S_{2} [A_{1} (Y_{1} - b (Z_{1})) Π_{η} (Z_{1}, Z_{2}) (A_{2} a (Z_{2}) - 1)] .

(30)

For an initial estimator $\hat{η}$ based on independent observations we now construct the estimator (8).

Let $\hat{E}$ and $v \hat{a} r$ denote conditional expectations given the observations used to construct $\hat{η}$ , and let ∥ · ∥₂ be the norm of L₂(f/a). Assume that the true functions a, f and the estimators $\hat{a}$ , $\hat{f}$ are bounded away from 0 and ∞.

Theorem 1 The estimator ${\hat{χ}}_{n} = χ (\hat{η}) + P_{n} {\dot{χ}}_{\hat{η}} + \frac{1}{2} U_{n} {\ddot{χ}}_{\hat{η}}$ with (approximate) influence functions ${\dot{χ}}_{η}$ and ${\ddot{χ}}_{η}$ defined by (26) and (30) with Πη the kernel of an orthogonal projection in L₂(f/a) onto a k-dimensional linear subspace satisfies

\begin{matrix} {\hat{E}}_{η} {\hat{χ}}_{n} - χ (η) = & O_{P} ({‖ \hat{a} - a ‖}_{2} {‖ \hat{b} - b ‖}_{2} {‖ \frac{\hat{f}}{\hat{a}} - \frac{f}{a} ‖}_{2}) \\ + O_{P} ({‖ a - Π_{η} a ‖}_{2} {‖ b - Π_{η} b ‖}_{2}), \\ v \hat{a} r_{η} {\hat{χ}}_{n} = & O_{P} (\frac{k}{n^{2}} \lor \frac{1}{n}) . \end{matrix}

Proof From Eqs. (27) and (30) we have

\begin{matrix} \hat{E} {\hat{χ}}_{n} - χ (η) = & - \int (\hat{a} - a) (\hat{b} - b) \frac{f}{a} d ν \\ - \hat{E} A_{1} (Y_{1} - \hat{b} (Z_{1})) Π_{\hat{η}} (Z_{1}, Z_{2}) (A_{2} \hat{a} (Z_{2}) - 1) Π_{\hat{η}} (Z_{1}, Z_{2}) \\ = & - \int (\hat{a} - a) (\hat{b} - b) \frac{f}{a} d ν + \iint ((\hat{a} - a) \times (\hat{b} - b)) \\ \times Π_{\hat{η}} (\frac{f}{a} \times \frac{f}{a}) d ν \times ν . \end{matrix}

The double integral on the far right with $Π_{\hat{η}}$ replaced by Π_η can be written as the single integral $\int (\hat{a} - a) Π_{η} (\hat{b} - b) (f ∕ a) d ν$ . Added to the first integral on the right this gives

- \int (\hat{a} - a) (I - Π_{η}) (\hat{b} - b) (f ∕ a) d ν .

By the Cuachy-Schwarz inequality this is bounded in absolute value by the second term in the upper bound for the bias.

Replacement of $Π_{\hat{η}}$ by Π_η in the double integral gives a difference

\begin{matrix} \iint ((\hat{a} - a) \times (\hat{b} - b)) (Π_{\hat{η}} - Π_{η}) (\frac{f}{a} \times \frac{f}{a}) d ν \times ν \\ = \int (\hat{a} - a) (Π_{\hat{η}} ((\hat{b} - b) \frac{f ∕ a}{\hat{f} ∕ \hat{a}}) - Π_{η} (\hat{b} - b)) \frac{f}{a} d ν . \end{matrix}

By the Cauchy-Schwarz inequality the absolute value of this is bounded above by

{‖ \hat{a} - a ‖}_{2} {‖ (Π_{\hat{η}} \circ M_{\hat{w}} - Π_{η}) (\hat{b} - b) ‖}_{2, \hat{ν}} {‖ \hat{w} ‖}_{\infty} .

Here $M_{\hat{w}}$ is multiplication by the function $\hat{w} = (f ∕ a) ∕ (\hat{f} ∕ \hat{a}) (defined by M_{\hat{w}} g = g \hat{w})$ , and ${‖ \cdot ‖}_{2, \hat{ν}}$ is the $L_{2} (\hat{ν}) -norm$ for the measure $\hat{ν}$ defined by $d \hat{ν} = (\hat{f} ∕ \hat{a}) d ν$ . Considering $Π_{\hat{η}}$ as the projection in $L_{2} (\hat{ν})$ with weight 1, and Π_η as the weighted projection in $L_{2} (\hat{ν})$ with weight function $\hat{w}$ , we can apply Lemma 4 to the middle term and conclude that this is bounded in absolute value by ${‖ Π_{η} ‖}_{2, \hat{ν}} ∣ {‖ \hat{w} - 1 ‖}_{\infty} {‖ \hat{b} - b ‖}_{2, \hat{ν}}$ . Because we assume that the functions f/a and $\hat{f} ∕ \hat{a}$ are bounded away from zero and infinity, this can be seen to yield the first term in the upper bound on the bias.

The function ${\dot{χ}}_{\hat{η}}$ is uniformly bounded and hence the (conditional) variance of $P_{n} {\dot{χ}}_{\hat{η}}$ is of the order O_P(1/n). Thus for the variance bound it suffices to consider the (conditional) variance of $U_{n} {\ddot{χ}}_{\hat{η}}$ . In view of Lemma 6 this is bounded above by

\frac{4}{n} E_{η} E_{η} {({\ddot{χ}}_{\hat{η}} (X_{1}, X_{2}) ∣ X_{1})}^{2} + \frac{2}{n (n - 1)} E_{η} {\ddot{χ}}_{\hat{η}}^{2} (X_{1}, X_{2}) .

The variables $A (Y - \hat{b} (Z))$ and $(A \hat{a} (Z) - 1)$ are uniformly bounded. Hence the last term on the right is bounded above by a multiple of $n^{- 2} \int \int Π_{\hat{η}}^{2} (f ∕ a \times f ∕ a) d ν \times ν$ , which is bounded by ${‖ \hat{w} ‖}_{\infty}^{2} k ∕ n^{2}$ , by Lemma 3. The first order term is of the order O(1/n). To see this we first note that

\begin{matrix} E_{η} ({\ddot{χ}}_{\hat{η}} (X_{1}, X_{2}) ∣ X_{1}) = & - A_{1} (Y_{1} - \hat{b} (Z_{1})) Π_{\hat{η}} ((\hat{a} - a) \hat{w}) (X_{1}) \\ - (A_{1} \hat{a} (Z_{1}) - 1) Π_{\hat{η}} ((\hat{b} - b) \hat{w}) (X_{1}) . \end{matrix}

Here the variables $A_{1} (Y_{1} - \hat{b} (Z_{1}))$ and $(A_{1} \hat{a} (Z_{1}) - 1)$ are uniformly bounded, and the second moment of $Π_{\hat{η}} g$ is bounded by ${‖ \hat{w} ‖}_{\infty}$ times the second moment of g in $L_{2} (\hat{ν})$ , for every g.

Conclusion

Assume that the parameters a, b and f/a are known to be “regular” of degrees α, β and ϕ, respectively, in the sense that there exists a sequence of k-dimensional linear spaces L_k such that, for some constant C,

{‖ a - L_{k} ‖}_{2} \leq C {(\frac{1}{k})}^{α ∕ d}, {‖ b - L_{k} ‖}_{2} \leq C {(\frac{1}{k})}^{β ∕ d}, {‖ \frac{f}{a} - L_{k} ‖}_{2} \leq C {(\frac{1}{k})}^{β ∕ d} .

This is true, for instance, if the functions a, b and f/a are defined on a compact, convex domain in $R^{d}$ and are known to belong to Hölder (or Besov) spaces of functions of smoothness α, β and ϕ. The approximation is then valid even with the uniform norm on the left side, where the spaces L_k can be taken to be generated by polynomials, splines or wavelets.

In this case there also exist estimators $\hat{a}$ and $\hat{b}$ and $\hat{f} ∕ \hat{a}$ that achieve convergence rates n^{–α/(2α+d)}, n^{–β/(2β+d)} and n^{–ϕ/(2ϕ+d)}, respectively, uniformly over these a-priori models. Then the estimator ${\hat{χ}}_{n}$ of Theorem 1 attains the square rate of convergence

\frac{k}{n^{2}} \lor \frac{1}{n} \lor {(\frac{1}{n})}^{2 α ∕ (2 α + d) + 2 β ∕ (2 β + d) + 2 ϕ ∕ (2 ϕ + d)} \lor {(\frac{1}{k})}^{(2 α + 2 β) ∕ d} .

The optimal value of k balances the first and fourth terms and is of the order k ~ n^{2d/(d+2α+2β)}. The resulting rate is n^–γ for

γ = (\frac{1}{2}) \land (\frac{α}{2 α + d} + \frac{β}{2 β + d} + \frac{ϕ}{2 ϕ + d}) \land (\frac{2 α + 2 β}{d + 2 α + 2 β}) .

This reduces to the rate n^–1/2 under condition (25), but also if (α + β)/2 ≥ d/4 and ϕ is sufficiently large:

\frac{ϕ}{2 ϕ + d} \geq \frac{1}{2} - \frac{α}{2 α + d} - \frac{β}{2 β + d} .

(In this case we can also choose k = n independent of α and β.) In case the rate n^–γ is slower than n^–1/2, then it is still better than the rate n^{–α/(²α+d)–β/(2β+d)} obtained by the linear estimator (3).

Thus the quadratic estimator outperforms the linear estimator.

6 Technical results

Let L be a given closed subspace of $L_{2} (X, A, μ)$ and $w : X \to R$ a bounded, measurable function. Define operators $Π, Π_{w} : L_{2} (μ) \to L_{2} (μ)$ by

\begin{matrix} Π_{g} & = \underset{l \in L}{argmin} \int {(g - l)}^{2} d μ, \\ Π_{w} g & = \underset{l \in L}{argmin} \int {(g - l)}^{2} w d μ . \end{matrix}

Thus Π is the ordinary orthogonal projection on the space L, and Π_w is a weighted projection. The projections can be characterized by the orthogonality relationships ∫ (g – Πg)l dμ = 0 and ∫ (g – Πg)l w dμ = 0, for every l ∈ L.

Lemma 4 Let Π_wand Π be theweighted projections onto a fixed subspace L of L₂(μ) relative to the weight functions w and 1, respectively, and let M_w be multiplication by the function w. Then ${‖ Π_{w} - Π M_{w} ‖}_{2} \leq {‖ Π_{w} ‖}_{2} {‖ w - 1 ‖}_{\infty}$ .

Proof The orthogonality relationships for the projections Π and Π_w imply that, for every l ∈ L and g,

\int Π (w g) l d μ = \int w g l d μ = \int w (Π_{w} g) l d μ .

Because Π_wg – (wg) is contained in L,

\begin{matrix} {‖ Π_{w} g - Π (w g) ‖}_{2}^{2} & = \int (Π_{w} g - Π (w g)) (Π_{w} g - Π (w g)) d μ, \\ = \int (Π_{w} g - Π (w g)) (Π_{w} g - (Π_{w} g) w) d μ . \end{matrix}

An application of the Cauchy-Schwarz inequality and next cancellation of one factor ${‖ Π_{w} g - Π (w g) ‖}_{2}$ gives that ${‖ Π_{w} g - Π (w g) ‖}_{2} \leq {‖ (Π_{w} g) (1 - w) ‖}_{2}$ . The right side is bounded above by ${‖ Π_{w} ‖}_{2} {‖ g ‖}_{2} {‖ 1 - w ‖}_{\infty}$ .

Lemma 5 For degenerate, symmetric functions $f, g : X^{2} \to R$ we have $P^{n} U_{n} f = 0$ and

P^{n} (U_{n} f) (U_{n} g) = \frac{1}{(\begin{matrix} n \\ 2 \end{matrix})} P^{2} f g .

Lemma 6 For any measurable function $f : X^{2} \to R$ , and f₁(x₁) = ∫ f(x₁, x₂) dP(x₂),

var U_{n} f \leq \frac{4}{n} P f_{1}^{2} + \frac{2}{n (n - 1)} P f^{2} .

Proof The first lemma follows by writing the square sum ${(U_{n} f)}^{2}$ as a double sum (over ordered pairs i < j). The expected values of the off-diagonal terms vanish by degeneracy.

For a general measurable function $f : X^{2} \to R$ the mean Pf² is the projection onto the constant functions, and the function ${\overset{‒}{f}}_{1}$ defined by ${\overset{‒}{f}}_{1} (x_{1}) = \int f (x_{1}, x_{2}) d P (x_{2}) - P^{2} f$ is the projection of f in L₂(P²) onto the mean zero functions of one variable. The decomposition

f (x_{1}, x_{2}) = P^{2} f + {\overset{‒}{f}}_{1} (x_{1}) + {\overset{‒}{f}}_{1} (x_{2}) + f_{12} (x_{1}, x_{2}),

where f₁₂ is defined by the equation yields the Hoeffding decomposition $U_{n} f = P^{2} f + 2 P_{n} {\overset{‒}{f}}_{1} + U_{n} f_{12}$ of the U-statistic in orthogonal parts, with $U_{n} f_{12}$ degenerate. Using Lemma 5 we see that the variance of $U_{n} f$ is equal to $(4 ∕ n) P {\overset{‒}{f}}_{1}^{2} + 2 ∕ (n (n - 1)) P^{2} f_{12}^{2}$ . The norm of ${\overset{‒}{f}}_{1}$ is smaller than the norm of f₁. Because f₁₂ is projection of f, its norm is bounded by the norm of f.

Footnotes

Open Access This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.

References

Begun JM, Hall WJ, Huang WM, Wellner JA. Information and asymptotic efficiency in parametric–nonparametric models. Ann Stat. 1983;11(2):432–452. [Google Scholar]
Bickel PJ, Ritov Y. Estimating integrated squared density derivatives: sharp best order of convergence estimates. Sankhyā Ser A. 1988;50(3):381–393. [Google Scholar]
Bickel PJ, Klaassen CAJ, Ritov Y, Wellner JA. Johns Hopkins series in the mathematical sciences. Johns Hopkins University Press; Baltimore: 1993. Efficient and adaptive estimation for semiparametric models. [Google Scholar]
Bolthausen E, Perkins E, van der Vaart A. Lectures on probability theory and statistics.. In: Bernard P, editor. Lecture notes in mathematics, Lectures from the 29th summer school on probability theory held in Saint-Flour; Springer, Berlin. July 8–24, 1999.2002. [Google Scholar]
Emery M, Nemirovski A, Voiculescu D. Lectures on probability theory and statistics. In: Bernard P, editor. Lecture notes in mathematics, Lectures from the 28th summer school on probability theory held in Saint-Flour; Springer, Berlin. 17 August–3 September, 1998.2000. [Google Scholar]
Klaassen CAJ. Consistent estimation of the influence function of locally asymptotically linear estimators. Ann Stat. 1987;15(4):1548–1562. [Google Scholar]
Koševnik JA, Levit BJ. On a nonparametric analogue of the information matrix. Teor Verojatnost i Primenen. 1976;21(4):759–774. [Google Scholar]
Laurent B. Efficient estimation of integral functionals of a density. Ann Stat. 1996;24(2):659–681. [Google Scholar]
Laurent B. Estimation of integral functionals of a density and its derivatives. Bernoulli. 1997;3(2):181–211. [Google Scholar]
Le Cam L. Locally asymptotically normal families of distributions. Certain approximations to families of distributions and their use in the theory of estimation and testing hypotheses. Univ california Publ Stat. 1960;3:37–98. [Google Scholar]
Murphy SA, van der Vaart AW. On profile likelihood. J Am Stat Assoc. 2000;95(450):449–485. (with comments and a rejoinder by the authors) [Google Scholar]
Pfanzagl J. Lecture notes in statistics. Vol. 13. Springer; New York: 1982. Contributions to a general asymptotic statistical theory. (with the assistance of Wefelmeyer W) [Google Scholar]
Pfanzagl J. Lecture notes in statistics. Vol. 31. Springer; Berlin: 1985. Asymptotic expansions for general statistical models. (with the assistance of Wefelmeyer W) [Google Scholar]
Robins J, Rotnitzky A. Comment on the bickel and kwon article, “inference for semiparametric models: Some questions and an answer”. Stat Sin. 2001;11(4):920–936. [Google Scholar]
Robins J, Li L, Tchetgen E, van der Vaart A. Asymptotic normality of quadratic estimators. Ann Stat. 2007 doi: 10.1016/j.spa.2016.04.005. (submitted) [DOI] [PMC free article] [PubMed] [Google Scholar]
Robins JM, Rotnitzky A. Recovery of information and adjustment for dependent censoring using surrogate markers. 1992. pp. 297–331.
Rotnitzky A, Robins JM. Semi-parametric estimation of models for means and covariances in the presence of missing data. Scand J Stat. 1995;22(3):323–333. [Google Scholar]
van der Laan MJ, Robins JM. Springer Series in Statistics. Springer; New York: 2003. Unified methods for censored longitudinal data and causality. [Google Scholar]
van der Vaart A. On differentiable functionals. Ann Stat. 1991;19(1):178–204. [Google Scholar]
van der Vaart A. Maximum likelihood estimation with partially censored data. Ann Stat. 1994;22(4):1896–1916. [Google Scholar]
van der Vaart AW. Statistical estimation in large parameter spaces, CWI Tract. Vol. 44. Stichting Mathematisch Centrum Centrum voor Wiskunde en Informatica; Amsterdam: 1988. [Google Scholar]
van der Vaart AW. Cambridge series in statistical and probabilistic mathematics. Vol. 3. Cambridge University Press; Cambridge: 1998. Asymptotic statistics. [Google Scholar]
Von Mises R. On the asymptotic distribution of differentiable statistical functions. Ann Math Stat. 1947;18:309–348. [Google Scholar]
Wellner JA, Klaassen CAJ, Ritov Y. Frontiers in statistics. Imp Coll Press; London: 1993. Semiparametric models: a review of progress since BKRW. pp. 25–44. (2006) [Google Scholar]

[R1] Begun JM, Hall WJ, Huang WM, Wellner JA. Information and asymptotic efficiency in parametric–nonparametric models. Ann Stat. 1983;11(2):432–452. [Google Scholar]

[R2] Bickel PJ, Ritov Y. Estimating integrated squared density derivatives: sharp best order of convergence estimates. Sankhyā Ser A. 1988;50(3):381–393. [Google Scholar]

[R3] Bickel PJ, Klaassen CAJ, Ritov Y, Wellner JA. Johns Hopkins series in the mathematical sciences. Johns Hopkins University Press; Baltimore: 1993. Efficient and adaptive estimation for semiparametric models. [Google Scholar]

[R4] Bolthausen E, Perkins E, van der Vaart A. Lectures on probability theory and statistics.. In: Bernard P, editor. Lecture notes in mathematics, Lectures from the 29th summer school on probability theory held in Saint-Flour; Springer, Berlin. July 8–24, 1999.2002. [Google Scholar]

[R5] Emery M, Nemirovski A, Voiculescu D. Lectures on probability theory and statistics. In: Bernard P, editor. Lecture notes in mathematics, Lectures from the 28th summer school on probability theory held in Saint-Flour; Springer, Berlin. 17 August–3 September, 1998.2000. [Google Scholar]

[R6] Klaassen CAJ. Consistent estimation of the influence function of locally asymptotically linear estimators. Ann Stat. 1987;15(4):1548–1562. [Google Scholar]

[R7] Koševnik JA, Levit BJ. On a nonparametric analogue of the information matrix. Teor Verojatnost i Primenen. 1976;21(4):759–774. [Google Scholar]

[R8] Laurent B. Efficient estimation of integral functionals of a density. Ann Stat. 1996;24(2):659–681. [Google Scholar]

[R9] Laurent B. Estimation of integral functionals of a density and its derivatives. Bernoulli. 1997;3(2):181–211. [Google Scholar]

[R10] Le Cam L. Locally asymptotically normal families of distributions. Certain approximations to families of distributions and their use in the theory of estimation and testing hypotheses. Univ california Publ Stat. 1960;3:37–98. [Google Scholar]

[R11] Murphy SA, van der Vaart AW. On profile likelihood. J Am Stat Assoc. 2000;95(450):449–485. (with comments and a rejoinder by the authors) [Google Scholar]

[R12] Pfanzagl J. Lecture notes in statistics. Vol. 13. Springer; New York: 1982. Contributions to a general asymptotic statistical theory. (with the assistance of Wefelmeyer W) [Google Scholar]

[R13] Pfanzagl J. Lecture notes in statistics. Vol. 31. Springer; Berlin: 1985. Asymptotic expansions for general statistical models. (with the assistance of Wefelmeyer W) [Google Scholar]

[R14] Robins J, Rotnitzky A. Comment on the bickel and kwon article, “inference for semiparametric models: Some questions and an answer”. Stat Sin. 2001;11(4):920–936. [Google Scholar]

[R15] Robins J, Li L, Tchetgen E, van der Vaart A. Asymptotic normality of quadratic estimators. Ann Stat. 2007 doi: 10.1016/j.spa.2016.04.005. (submitted) [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] Robins JM, Rotnitzky A. Recovery of information and adjustment for dependent censoring using surrogate markers. 1992. pp. 297–331.

[R17] Rotnitzky A, Robins JM. Semi-parametric estimation of models for means and covariances in the presence of missing data. Scand J Stat. 1995;22(3):323–333. [Google Scholar]

[R18] van der Laan MJ, Robins JM. Springer Series in Statistics. Springer; New York: 2003. Unified methods for censored longitudinal data and causality. [Google Scholar]

[R19] van der Vaart A. On differentiable functionals. Ann Stat. 1991;19(1):178–204. [Google Scholar]

[R20] van der Vaart A. Maximum likelihood estimation with partially censored data. Ann Stat. 1994;22(4):1896–1916. [Google Scholar]

[R21] van der Vaart AW. Statistical estimation in large parameter spaces, CWI Tract. Vol. 44. Stichting Mathematisch Centrum Centrum voor Wiskunde en Informatica; Amsterdam: 1988. [Google Scholar]

[R22] van der Vaart AW. Cambridge series in statistical and probabilistic mathematics. Vol. 3. Cambridge University Press; Cambridge: 1998. Asymptotic statistics. [Google Scholar]

[R23] Von Mises R. On the asymptotic distribution of differentiable statistical functions. Ann Math Stat. 1947;18:309–348. [Google Scholar]

[R24] Wellner JA, Klaassen CAJ, Ritov Y. Frontiers in statistics. Imp Coll Press; London: 1993. Semiparametric models: a review of progress since BKRW. pp. 25–44. (2006) [Google Scholar]

PERMALINK

Quadratic semiparametric Von Mises calculus

James Robins

Lingling Li

Eric Tchetgen

Aad W van der Vaart

Abstract

1 Introduction

2 Linear estimator

3 Quadratic estimator

4 Estimating the square of a density

5 Estimating the mean response in missing data models

Prelimary estimators

Tangent space and first order influence function

Quadratic estimator

Conclusion

6 Technical results

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Quadratic semiparametric Von Mises calculus

James Robins

Lingling Li

Eric Tchetgen

Aad W van der Vaart

Abstract

1 Introduction

2 Linear estimator

3 Quadratic estimator

4 Estimating the square of a density

5 Estimating the mean response in missing data models

Prelimary estimators

Tangent space and first order influence function

Quadratic estimator

Conclusion

6 Technical results

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases