Abstract
We discuss a new method of estimation of parameters in semiparametric and nonparametric models. The method is based on U-statistics constructed from quadratic influence functions. The latter extend ordinary linear influence functions of the parameter of interest as defined in semiparametric theory, and represent second order derivatives of this parameter. For parameters for which the matching cannot be perfect the method leads to a bias-variance trade-off, and results in estimators that converge at a slower than n–1/2-rate. In a number of examples the resulting rate can be shown to be optimal. We are particularly interested in estimating parameters in models with a nuisance parameter of high dimension or low regularity, where the parameter of interest cannot be estimated at n–1/2-rate.
Keywords: Von Mises calculus, Semiparametric models, Missing data, Tangent space, Influence function, Rate of convergence
1 Introduction
Let X1, X2, . . . , Xn be a random sample from a distribution Pη with density pη relative to a measure μ on a sample space , where the parameter η is known to belong to a subset H of a normed space. We wish to estimate the value χ(η) of a functional with the help of the observations X1, . . . , Xn. Our main interest is in the situation of a semiparametric or nonparametric model, where H is an infinite-dimensional set, and the dependence is smooth.
This problem has been studied under the heading “semiparametric statistics” in the 1980s and 1990s. A theory of asymptotic lower bounds for “regular parameters” χ(η) based on Le Cam's concept of local asymptotic normality (Le Cam 1960) was developed starting with Koševnik and Levit (1976) and Pfanzagl (1982), and worked out for many examples in, among others, Begun et al. (1983), van der Vaart (1988) and Bickel et al. (1993). There are many examples of ad-hoc estimators that attain these bounds, and the behaviour of principled methods such as maximum likelihood (including its sieved and penalized variants) or estimating equations is understood to a certain extent (e.g., van der Vaart 1994; Murphy and van der Vaart 2000; Bolthausen et al. 2002; Wellner et al. 1993; van der Laan and Robins 2003).
Certain combinations of models (Pη: η ∈ H) and parameter χ(η) possess structural properties that allow to estimate the parameter at n–1/2-rate, no matter the size of the parameter set H. In this paper we are interested in the other situations, where the rate of estimation drops when the complexity of the model exceeds a certain limit. Such examples arise for instance when many covariates must be included in a model to correct for possible confounding in a causal study, or for modelling the probability that an individual is included in a sample in a study with missing observations. If simple (e.g., linear) models for the dependence on these covariates are not plausible, which is typical in epidemiological studies, then the resulting model must be taken so large that the usual methods fail. These methods typically focus on variance only, because the bias is negligible due to the structure of the model, or by explicitly assuming a “no-bias condition” (see Klaassen 1987; Murphy and van der Vaart 2000). In this paper we develop new methods that make a bias-variance trade-off when necessary.
These methods are based on quadratic estimating equations rather than the usual linear estimating equations.
Quadratic expansions for semiparametric models were previously investigated by Pfanzagl and Wefelmeyer (Pfanzagl 1985), but from the very different perspective of second order efficiency, i.e., the refinement of first order bounds by adding a lower order term. Our aim is to show that second order influence functions can be used for first order inference, because they permit balancing of bias and variance.
Following linear and quadratic is cubic, and so on. Extension of our approach to still higher orders is possible, but comes with many new complications. We shall pursue this elsewhere.
The paper is organized as follows. In Sect. 2 we review linear estimators from our current perspective. Next in Sect. 3 we introduce our new method of constructing quadratic estimators. This section has mostly a heuristic nature. In Sects. 4 and 5 we give rigorous constructions and results for two examples. The first is a classical theoretical example. The second is more extensive and concerns estimating a mean response when the response is not always observed.
Notation Let and denote the empirical measure and empirical U-statistic measure, viewed as an operators on functions: for given functions and these are given by
We use the notation also for a function of one argument, with the interpretation . This is consistent with the given formulas if a function of one argument is considered as a function of 2 arguments that is constant in its second argument.
We write for the expectation of if X1, . . . , Xn are distributed according to the probability measure P. We also use the operator notation for the expectations of statistics in general.
We call a measurable function degenerate relative to P if for i = 1, 2, and we call it symmetric if g(x1, x2) = g(x2, x1) for every .
Given two functions we write g × h the function . Such tensor products functions are degenerate if both functions f and g have mean zero. The corresponding notation P × Q of two measures P and Q gives the product measure.
2 Linear estimator
Given an initial estimator of η, the plug-in estimator is typically a consistent estimator of the parameter of interest χ(η), but it may not be a good estimator. In particular, if is a general purpose estimator, not specially constructed to yield a good plug-in, then will often have a suboptimal precision. To gain insight in this situation we assume that the parameter permits a Taylor expansion of the form
(1) |
Such an expansion suggests that the plug-in estimator will have an error of the order , unless the linear term in the expansion vanishes.
The expansion (1) also suggests that better estimators can be obtained by “estimating” the linear term in the expansion. To achieve this we assume a “generalized von-Mises representation” of the derivative of the form
(2) |
for some measurable function , referred to as an influence function. The second equality is valid if is degenerate relative to for every η, which can always be arranged by a recentering, as . The von-Mises representation and Eq. (1) suggest the “corrected plug-in estimator”
(3) |
This estimator should have an error of the order , as the difference is “centered” and ought to have “variance” of the order O(1/n).
We put “centered” and “variance” in quotes, because the randomness in the initial estimator prevents a simple calculation of mean and variance. Empirical process theory can be used to show that the effect of replacing by is negligible, if the class of functions is not too rich. In the present paper we are interested in orders of magnitude only, and then a simpler approach is to split the sample and use separate observations to construct and to construct . Then the orders can be justified by reasoning conditionally on the first sample, and it suffices that remains bounded in probability.
Von Mises (1947) originally introduced the expansions that are named after him in order to investigate functionals of empirical distributions. The idea to use expansions (1) for estimation in nonparametric models occurs in Emery et al. (2000). Our situation is more involved, because we are interested in models (Pη: η ∈ H) that are structured through a map , and we are interested in a functional χ(η) of the parameter. In this situation a von-Mises type expansion can fail for two reasons. First a derivative is by definition a continuous, linear map on the underlying normed space, and such maps may or may not be representable as an integral, depending on the normed space. Second, our von Mises expansion (2) represents this derivative as an integral relative to the distribution Pη and hence also involves the inverse map Pη → η from the distribution of the data to the parameter. We require representation through Pη, because this allows us to construct the estimator (3) by replacing Pη by the empirical distribution.
These issues are related to investigations in the theory of semiparametric models (see Koševnik and Levit 1976; Pfanzagl 1982; van der Vaart 1988; Bickel et al. 1993). These papers define a tangent set of a semiparametric model (Pη: η ∈ H) as the set of functions obtainable as
where the limit is taken in the L2-sense, and ranges over a collection of maps from to H for which the limit exists. Informally, a “tangent vector” is just a score function
(4) |
of a one-dimensional submodel (Pηt : t ≥ 0) at t = 0, where η0 = η. (Taking the derivative in the L2-sense is appropriate for asymptotic information theory, but not necessarily so for the present heuristic discussion.) An influence function is defined as a measurable map such that, for all paths considered,
(5) |
It is not difficult to see that the latter influence function is the same as the influence function needed in the von-Mises expansion (2), if the various types of derivatives match up. (Note that the middle expression in (2) with η replaced by ηt and by η expands to , as .) Necessary and sufficient conditions for existence of an influence function in terms of the derivatives of the maps and were investigated in van der Vaart (1991).
An influence function is not necessarily unique, as only its inner products with elements of the tangent set matter. The projection of any influence function that is contained in the closed linear span of the tangent set is called the efficient influence function or canonical gradient, as it is the influence function of asymptotically efficient estimators. It minimizes the variance over all influence functions.
3 Quadratic estimator
If the preliminary estimator attains a rate of convergence , then the plug-in estimator (3) attains a n–1/2-rate of convergence. Typically this will require that the parameter set H is not too big. If the preliminary estimator is less precise, then the remainder term of the expansion (1) will dominate. This suggests to take the expansion further to
(6) |
The generalization of the first order construction now requires a von Mises type representation of the form, for measurable functions and ,
(7) |
We assume without loss of generality that the functions and are degenerate relative to Pη. The von-Mises representation then suggests the “corrected plug-in estimator”
(8) |
The empirical measure and two-sample U-statistic serve as unbiased estimators of the expectations of their kernels. For simplicity we may again base the initial estimator and these two U-statistics on independent samples of observations. Because the variance of a U-statistic is of order O(1/n), this estimator ought to have an error of the order . We shall discuss the validity of this later.
To characterize the first and second order influence functions we can again employ smooth one-dimensional submodels (Pηt : t ≥ 0). With the first and second order derivatives of these models denoted by
(9) |
the von Mises expansion (7) can informally be seen to imply
(10) |
(11) |
The Eq. (10) is identical to Eq. (5), and hence a first order influence function can be taken as before. Following Pfanzagl (1985) we define a second order influence function as a measurable function that satisfies (11) for every path under consideration. From Eq. (11) we see that is unique only up to functions that are orthogonal to functions of the form , for belonging to the tangent set. In particular, a second order influence function can always be taken to be symmetric and degenerate relative to Pη. It must be taken so in the construction of the estimator (8).
The two influence functions occur together in Eq. (11), and hence should be considered a pair of functions rather than as two separate functions. This is particularly important if the tangent set is not “full”, i.e., smaller than the set of all mean-zero functions in L2(Pη), the tangent set of a nonparametric model. Both first and second order influence functions are then non-unique, but their different versions cannot be freely combined into valid pairs . This is connected to the fact that first and second order derivatives and are also not clearly separated. A simple change of speed of a path through a second order diffeomorphism ϕ: [0, 1] → [0, 1] leads to the submodel (Pηϕ(t): t ≥ 0) with first and second order derivatives, by the chain rule,
Thus the first order derivative becomes part of the second order derivative after reparameterization. Pfanzagl (Pfanzagl 1985, 2.4.4) has shown, under assumptions of smoothness of the tangent set as a function of the parameter, that the sum of every first order derivative and every second order derivative occurs as the second order derivative of some path. Thus the set of second order derivatives is only defined up to equivalence modulo the tangent set.
From Eq. (11) it is also clear that second order influence functions involve the joint distribution of two observations. Correspondingly, we prefer to define a second order tangent space of the model not through the second order derivatives along paths , but through the functions of two arguments
(12) |
The function is a second order score for the model (Pη × Pη: η ∈ H) for two observations. The corresponding first order scores are
(13) |
With these notations the Eqs. (10), (11) defining the influence functions can also be written as, if is chosen degenerate,
(14) |
(15) |
Here we interprete the function as a function that depends on the first argument only (and is constant in the second), or (better) replace it by its symmetrization . The equations show that the overall influence function is characterized by having “correct” inner products with the overall scores and . This overall influence function uniquely defines its constituents and provided is restricted to be degenerate. The overall influence function is itself unique only up to projection onto the closed linear span in L2(Pη × Pη) of all functions and .
The equality of the far left and right sides of Eqs. (14), (15) gives an alternative characterization of the overall influence function (at η0) as a function such that the maps and possess the same first and second order derivatives at η0. Because the derivatives of a map ϕ on an open subset H of a normed space are completely characterized by the derivatives of the maps , for h ranging over the space (“Gateaux derivatives”), we conclude that in the case of such parameters sets H it suffices to consider linear paths . (The mixed second derivative can be recovered from and by “polarization”.) This is true more generally for parameter spaces H defined by a linear constraint, but in the case of nonlinear constraints the use of curved paths is necessary.
The plug-in estimator (8) can be written . A definition of an efficient or canonical second order influence function, should therefore refer to the variance of the U-statistic . Unlike in the linear case this does not translate in the variance of the influence function itself (except for n = 2 if is interpreted as the symmetric function ). By Eq. (5), if is chosen degenerate and symmetric,
Thus the second order part adds a term of order O(1/n) relative to the first order contribution. The norm of the function in L2(Pη × Pη) is irrelevant, even though the inner product of this space determines the influence functions.
It is possible to resolve this discrepancy by working in the model with n observations. From the expansion
we see that first and second order scores for the model take the forms
(16) |
(17) |
Rather than in the form Eqs. (14), (15), the Eqs. (10), (11) that define the influence functions can then be written in the form
(18) |
(19) |
We conclude that the influence functions are determined by the inner products of the U-statistic in with the score functions and . The influence functions that yield a minimal variance are found by projecting this U-statistic onto the closed linear span of these score functions. Thus it is natural to define the latter span as the second order tangent space of the model.
For computation in examples the defining Eq. (11) or (15) of a second order influence function can be tedious. It is usually easier to apply the rule that a second derivative is the derivative of the first derivative. In the present situation this takes the following form (Pfanzagl 1985, 4.3.11): if is a function such that is a first order influence function of the parameter , for every fixed x1 and a first order influence function (not necessarily degenerate), then is a second order influence function.
Lemma 1 Suppose that (Pηt : t ≥ 0) is a sufficiently smooth submodel and and are measurable functions that satisfy
Then the function is a second order influence function, and so is the symmetrization of its orthogonal projection onto the degenerate functions in L2(Pη × Pη).
Proof By differentiation of the first identity (under the integral) we see that
We evaluate this at t = 0 and substitute the second identity in the first term on the right to arrive at Eq. (11). The equation remains valid if is replaced by its projection and symmetrization.
Just as for first order influence functions there is no guarantee that a second order influence function exists. The difference is that, for the examples we are interested in, nonexistence of a second order influence function is typical. A first indication that this might happen is that the informal conclusion reached in the preceding that the quadratic estimator (8) will have an error of the order is overly optimistic. In comparison to the linear estimator (3), this estimator would have reduced the dependence on the preliminary estimator from to , apparently without a serious penalty on the variance of the estimator. In our examples this does not occur, simply because a second order influence function does not exist.
As for the first order influence function, the nonexistence of the second order influence function may be caused by a lack of invertibility of the map η → pη or by failure of a von Mises type representation. The invertibility is again necessary, because we need representation of the derivatives of in terms of the distribution Pη of the observation. This is similar as in the linear situation. The second cause for failure of representation also arose in the linear situation, but appears to arise in a much more serious way at the second order. Whereas a continuous, linear map is always representable as an inner product for some function , a continuous, bilinear map is not necessarily representable through a measurable function , in the form
(20) |
It can be shown that a continuous, bilinear map can always be written in the form B(g, h) = ∫ g(Ah) dPη for a continuous, linear operator A: L2(Pη) → L2(Pη), but the latter operator is not necessarily a kernel operator in that for some kernel . The latter representation is necessary for the von-Mises representation (7) of the second derivative.
Failure of existence of does not mean that the idea to use a quadratic expansion for improved estimation is not fruitful. Failure does mean that we cannot construct the estimator (8) and the estimation rate may not be attainable. However, we may return to Eq. (6) and try and estimate the quadratic term as well as possible, and still improve on the linear estimator. A key observation is that a bilinear map on a finite-dimensional subspace L × L ⊂ L2(Pη) × L2(Pη) is always representable by a kernel.
Lemma 2 If L ⊂ L2(Pη) is a finite-dimensional subspace and is continuous and bilinear, then there exists a function such that (20) holds for every g, h ∈ L.
Proof For an arbitrary orthonormal basis e1, . . . , ek of L we can express an element g ∈ L as , for 〈·, ·〉η the inner product of L2(Pη). By bilinearity
Thus the function is a kernel for the map B.
If the invertibility can be resolved, we can therefore always represent the second derivative in Eq. (6) at differences within a given finite-dimensional linear space. The estimator (8) based on the resulting “partial second order influence function” then will add a representation error to the remainder . This representation error can be made arbitrarily small by choosing the finite-dimensional linear space sufficiently large. However, the corresponding partial influence functions depend on the approximating linear spaces, the estimator now having the form
(21) |
where is a partial second order influence function based on an approximating space L. To obtain a good estimator we must balance the representation error, remainder , and the variance of the estimator. In an asymptotic framework we let the approximating space L increase to the full space when n → ∞. We shall see that this may cause the variance of to dominate the variance of the linear term and the overall variance may be bigger than O(1/n). However, by proper balancing of the three terms we do never worse than the linear estimator (3), and we gain over it if the parameter set H is large.
4 Estimating the square of a density
Consider the problem of estimating the functional χ(p) = ∫ p2 dμ based on a random sample of size n from the density p. This problem was discussed among others in Bickel and Ritov (1988) and Laurent (1996), Laurent (1997). We shall rederive the estimator by Laurent (1996) through our general approach.
As the underlying model we use a set of densities that is restricted only qualitatively, for instance a Hölder space of functions on the unit square in . We parameterize this model by the density itself, which we denote by p (hence pη = η = p). The tangent space of the model can then be taken equal to the set of all mean zero functions in L2(P), and the first order influence function takes the form
(22) |
To see this, it suffices to note that this function is mean-zero (i.e., degenerate) and satisfies
for any sufficiently regular path with p0 = p and score function at t = 0. This first order influence function exists without making assumptions on p or .
We compute a second order influence function as the influence function of the functional , which is the first order influence function up to centering. This entails point evaluation at a fixed point x1, which, unfortunately, is not a differentiable functional in the sense of possessing an influence function. For any sufficiently regular path with score function ,
Existence of an influence function of the functional would require the map to be representable as an inner product in L2(P) on the tangent space. Such a representation is not possible (unless p has finite support), because the map is not continuous relative to the L2(P)-norm.
Thus we content ourselves with partial representation of the second derivative. To this aim it is useful to think of the point evaluation map as integrating versus the Dirac measure (at x1). Full representation of the functional would be possible if there existed a function such that,
(23) |
If this were true for every function g, then the measure would, for each fixed x1, act as a Dirac measure at x1. In other words, the desired, but not existing, function Π would be a “Dirac measure” on the diagonal of . Our second best is a function for which Eq. (23) is true, if not for all, then for a large collection of g. The kernel Π of a projection operator Π: L2(μ) → L2(μ) onto a (large) subspace is a candidate, because it satisfies the display whenever gp is in the subspace: if Πf (x1) = ∫ Π(x1, x2) f (x2) dμ(x2), then the equation gp = Π(gp), which is valid for every gp in the range of the projection, gives the preceding display.
Lemma 3 An orthogonal projection Π: L2(μ) → L ⊂ L2(μ) onto a finite-dimensional subspace L can be represented as Πf (x1) = ∫ Π(x1, x2) f (x2) dμ(x2) for the kernel function and e1, . . . , ek an orthonormal basis of L. This kernel satisfies ∫ Π2 dμ × μ = k.
Proof We have for . The representation follows by exchanging the order of summation and integration.
The square kernel is . By the orthonormality of the basis (ei) the (double) integral of the off-diagonal terms (i ≠ j) vanishes and the double integral of the diagonal terms is equal to 1. Thus the double integral is k.
We also arrive at a projection operator from the formula for the second derivative of χ. We can write this in the form for the operator Ap: L2(P) → L2(P) given by Aph = hp. The operator Ap is not of kernel form, but we can approximate it by ΠAp, leading to the approximation 2 ∫ g(ΠAph) dP = 2 ∫ gp (Π(hp)) dμ for .
For a given orthonormal basis e1, e2, . . . of L2(μ) we take the kernel Π(x1, x2) of the projection onto the span of the first k elements, given by Lemma 3, as a “partial” influence function of the functional , and as a “partial” influence function of the functional . The projection of this function onto the degenerate functions is
(24) |
The quadratic estimator (8), given the initial estimator , takes the form
for the Fourier coefficients of . If we choose the initial estimator to take values in the range of Π, then for i > k and the second term vanishes. The resulting estimator reduces to the estimator considered by Laurent (1996, 1997), who showed that the estimator is minimax if p is a-priori known to belong to a multiple of the unit ball in the Hölder space Cβ[0, 1] of regularity β and (ei) is a basis suited to this a-priori model. In fact, mean and variance of the estimator satisfy, with θi the Fourier coefficients of p,
The bound on the variance follows from Lemma 6. If it is a-priori known that , then the bias is bounded by . The square bias is balanced against the variance if k is chosen of the order kn = n2/(4β+1) if β ≤ 1/4 and kn = n if β ≥ 1/4. The resulting rate of convergence is n–2β/(4β+1 if β ≤ 1/4 and n–1/2 if β ≥ 1/4. In Robins et al. (2007) it is shown that it is also asymptotically normal.
5 Estimating the mean response in missing data models
Suppose that a typical observation is distributed as X = (Y A, A, Z) for Y and A taking values in the two-point set {0, 1} and conditionally independent given Z. We think of Y as a response variable, which is observed only if the indicator A takes the value 1. The covariate Z is chosen such that it contains all information on the dependence between response and missingness indicator (missing at random). Alternatively, we think of Y as a counterfactual outcome if a treatment were given (A = 1) and estimate (half) the treatment effect under the assumption of “no unmeasured confounders”. Both applications may require that Z is high-dimensional (e.g., of dimension 10), and there is typically insufficient a-priori information to model the dependence of A and Y on Z.
The model can be parameterized by the marginal density f of Z (relative to some dominating measure ν) and the probabilities b(z) = P(Y = 1|Z = z) and a(z)–1 = P A = 1|Z = z). (We use a for the inverse probability, because this simplifies later formulas.) Thus the density pη of an observation X is described by the triple η = (a, b, f).
We wish to estimate the mean response EY, i.e., the parameter
Estimators that are n–1/2-consistent and asymptotically efficient in the semiparametric sense have been constructed using a variety of methods (e.g., Robins and Rotnitzky 1992; van der Laan and Robins 2003; van der Vaart 1998), but only if a or b, or both, parameters are restricted to sufficiently small regularity classes. For instance, if the covariate ranges over a compact, convex subset of , then the mentioned papers provide n–1/2-consistent estimators under the assumption that a and b belong to Hölder classes Cα(Z) and Cβ(Z) with α and β large enough that
(25) |
For moderate to large dimensions d this is a restrictive requirement. We shall show that a quadratic estimator of the type (8) can attain a n–1/2-rate in a bigger model and obtains a strictly better rate than the usual estimators if the n–1/2-rate is not obtainable.
Prelimary estimators
The parameter 1/a(z) = E(A|Z = z) is the regression of A on Z and hence can be estimated by any nonparametric regression estimator, such as a kernel or a truncated series estimator. Similarly, the function b(z) = P(Y = 1|Z = z, A = 1) is the regression of the observed Y on Z and can be estimated by nonparametric regression based on the subsample (Yi : Ai = 1) on the corresponding Zi. We shall see below that the parameter f/a is more fundamental than the parameter f. By Bayes’ rule (f/a)(z) = P(A = 1|Z = z) f (z) is P(A = 1) times the conditional density of Z given A = 1. Therefore, we may estimate f/a by a nonparametric density estimator based on the subsample (Zi : Ai = 1) times .
Tangent space and first order influence function
The one-dimensional submodels induced by paths of the form at = a + tα, bt = b + tβ, and ft = f (1 + tϕ) for given directions α, β and ϕ yield scores , for , , the score operators for the three parameters, given by
The first-order influence function is well known to take the form
(26) |
To see this it must be verified that this function satisfies, for every path as described previously,
For the paths at = a + tα, bt = b + tβ and ft = f(1 + tϕ) the left side of this equation is ∫ (β + bϕ) f dν. The right side can easily be evaluated to be the same, where it may be noted that conditional expectations of functions of Y and A given Z factorize, with E(Y – b(Z)|Z) = E(Aa(Z) – 1|Z) = 0 and E (Y – b(Z)2|Z = b(1 – b)(Z).
The advantage of choosing a an inverse probability is clear from the form of the (random part of the) influence function, which is a bilinear function in (a, b). The error of the corresponding von-Mises representation can be computed to be, for a given initial estimator ,
(27) |
This is quadratic in the errors of the initial estimators. Actually, the form of the bias term is special in that square estimation errors of the two initial estimators and do no arise, but only the product of their errors. This property, termed “double robustness” in Rotnitzky and Robins (1995), Robins and Rotnitzky (2001), van der Laan and Robins (2003), makes that it suffices that one of the two parameters is estimated well. A prior assumption that the parameters a and b are α and β regular, respectively, would allow estimation errors with rates n–α/(2α+d) and n–β/(2β+d). If the product of these rates is o(n–1/2), then the bias term is negligible, and the linear estimator (3) attains a rate n–1/2. This leads to the condition (25). If this condition fails, then the “bias” (27) is greater than O(n–1/2). The linear estimator then does not balance bias and variance and is suboptimal.
It may be noted that the marginal density f does not enter the first order influence function. Even though the functional depends on f, a rate on the initial estimator of this function is not needed for the construction of the first order estimator. This will be different at second order.
Quadratic estimator
We proceed to the computation of a second order influence function using Lemma 1, by searching a function such that, for every x1 = (y1a1, a1, z1), and all directions α, β, ϕ,
(28) |
Here the expectation is relative to the variable X2 only. Let be the kernel of an operator Kη: L2(f) → L2(f) (i.e., Kηg(x1) = ∫ K(x1, x2)g(x2) f (x2) dμ(x2)), and define
(29) |
For this choice the right side of Eq. (28) can be seen to reduce to
(Note that var (Aa(Z)|Z) = a(Z) – 1.) Thus the choice Eq. (29) of satisfies Eq. (28) for every (α,β,ϕ) such that Kηα = α and Kηβ = β. Were Kη equal to the identity operator, then Eq. (28) would be satisfied for every (α, β, ϕ), and an exact second order influence function would exist. Unfortunately, the identity operator is not given by a kernel. As in Sect. 4 we have to be satisfied with an influence function that gives partial representation.
To ensure that is symmetric we choose Kη(z1, z2) = Πη(z1, z2)/a(z2) for Πη a symmetric function. Specifically, we choose Πη the kernel of an orthogonal projection Πη: L2(f/a) → L2(f/a) onto a space L. The corresponding operators then (trivially) satisfy Kηg = Πηg for every g ∈ L2(f/a), and hence Kη will approximate the identity if L is large. The function (29) that results from this choice can be seen to be both symmetric and degenerate, and hence is a candidate “approximate” influence function. If S2 symmetrizes a function of two variables (i.e., 2 S2 g(X1, X2) = g(X1, X2) + g(X2, X1)), then this influence function can be written as
(30) |
For an initial estimator based on independent observations we now construct the estimator (8).
Let and denote conditional expectations given the observations used to construct , and let ∥ · ∥2 be the norm of L2(f/a). Assume that the true functions a, f and the estimators , are bounded away from 0 and ∞.
Theorem 1 The estimator with (approximate) influence functions and defined by (26) and (30) with Πη the kernel of an orthogonal projection in L2(f/a) onto a k-dimensional linear subspace satisfies
Proof From Eqs. (27) and (30) we have
The double integral on the far right with replaced by Πη can be written as the single integral . Added to the first integral on the right this gives
By the Cuachy-Schwarz inequality this is bounded in absolute value by the second term in the upper bound for the bias.
Replacement of by Πη in the double integral gives a difference
By the Cauchy-Schwarz inequality the absolute value of this is bounded above by
Here is multiplication by the function , and is the for the measure defined by . Considering as the projection in with weight 1, and Πη as the weighted projection in with weight function , we can apply Lemma 4 to the middle term and conclude that this is bounded in absolute value by . Because we assume that the functions f/a and are bounded away from zero and infinity, this can be seen to yield the first term in the upper bound on the bias.
The function is uniformly bounded and hence the (conditional) variance of is of the order OP(1/n). Thus for the variance bound it suffices to consider the (conditional) variance of . In view of Lemma 6 this is bounded above by
The variables and are uniformly bounded. Hence the last term on the right is bounded above by a multiple of , which is bounded by , by Lemma 3. The first order term is of the order O(1/n). To see this we first note that
Here the variables and are uniformly bounded, and the second moment of is bounded by times the second moment of g in , for every g.
Conclusion
Assume that the parameters a, b and f/a are known to be “regular” of degrees α, β and ϕ, respectively, in the sense that there exists a sequence of k-dimensional linear spaces Lk such that, for some constant C,
This is true, for instance, if the functions a, b and f/a are defined on a compact, convex domain in and are known to belong to Hölder (or Besov) spaces of functions of smoothness α, β and ϕ. The approximation is then valid even with the uniform norm on the left side, where the spaces Lk can be taken to be generated by polynomials, splines or wavelets.
In this case there also exist estimators and and that achieve convergence rates n–α/(2α+d), n–β/(2β+d) and n–ϕ/(2ϕ+d), respectively, uniformly over these a-priori models. Then the estimator of Theorem 1 attains the square rate of convergence
The optimal value of k balances the first and fourth terms and is of the order k ~ n2d/(d+2α+2β). The resulting rate is n–γ for
This reduces to the rate n–1/2 under condition (25), but also if (α + β)/2 ≥ d/4 and ϕ is sufficiently large:
(In this case we can also choose k = n independent of α and β.) In case the rate n–γ is slower than n–1/2, then it is still better than the rate n–α/(2α+d)–β/(2β+d) obtained by the linear estimator (3).
Thus the quadratic estimator outperforms the linear estimator.
6 Technical results
Let L be a given closed subspace of and a bounded, measurable function. Define operators by
Thus Π is the ordinary orthogonal projection on the space L, and Πw is a weighted projection. The projections can be characterized by the orthogonality relationships ∫ (g – Πg)l dμ = 0 and ∫ (g – Πg)l w dμ = 0, for every l ∈ L.
Lemma 4 Let Πw and Π be theweighted projections onto a fixed subspace L of L2(μ) relative to the weight functions w and 1, respectively, and let Mw be multiplication by the function w. Then .
Proof The orthogonality relationships for the projections Π and Πw imply that, for every l ∈ L and g,
Because Πwg – (wg) is contained in L,
An application of the Cauchy-Schwarz inequality and next cancellation of one factor gives that . The right side is bounded above by .
Lemma 5 For degenerate, symmetric functions we have and
Lemma 6 For any measurable function , and f1(x1) = ∫ f(x1, x2) dP(x2),
Proof The first lemma follows by writing the square sum as a double sum (over ordered pairs i < j). The expected values of the off-diagonal terms vanish by degeneracy.
For a general measurable function the mean Pf2 is the projection onto the constant functions, and the function defined by is the projection of f in L2(P2) onto the mean zero functions of one variable. The decomposition
where f12 is defined by the equation yields the Hoeffding decomposition of the U-statistic in orthogonal parts, with degenerate. Using Lemma 5 we see that the variance of is equal to . The norm of is smaller than the norm of f1. Because f12 is projection of f, its norm is bounded by the norm of f.
Footnotes
Open Access This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
References
- Begun JM, Hall WJ, Huang WM, Wellner JA. Information and asymptotic efficiency in parametric–nonparametric models. Ann Stat. 1983;11(2):432–452. [Google Scholar]
- Bickel PJ, Ritov Y. Estimating integrated squared density derivatives: sharp best order of convergence estimates. Sankhyā Ser A. 1988;50(3):381–393. [Google Scholar]
- Bickel PJ, Klaassen CAJ, Ritov Y, Wellner JA. Johns Hopkins series in the mathematical sciences. Johns Hopkins University Press; Baltimore: 1993. Efficient and adaptive estimation for semiparametric models. [Google Scholar]
- Bolthausen E, Perkins E, van der Vaart A. Lectures on probability theory and statistics.. In: Bernard P, editor. Lecture notes in mathematics, Lectures from the 29th summer school on probability theory held in Saint-Flour; Springer, Berlin. July 8–24, 1999.2002. [Google Scholar]
- Emery M, Nemirovski A, Voiculescu D. Lectures on probability theory and statistics. In: Bernard P, editor. Lecture notes in mathematics, Lectures from the 28th summer school on probability theory held in Saint-Flour; Springer, Berlin. 17 August–3 September, 1998.2000. [Google Scholar]
- Klaassen CAJ. Consistent estimation of the influence function of locally asymptotically linear estimators. Ann Stat. 1987;15(4):1548–1562. [Google Scholar]
- Koševnik JA, Levit BJ. On a nonparametric analogue of the information matrix. Teor Verojatnost i Primenen. 1976;21(4):759–774. [Google Scholar]
- Laurent B. Efficient estimation of integral functionals of a density. Ann Stat. 1996;24(2):659–681. [Google Scholar]
- Laurent B. Estimation of integral functionals of a density and its derivatives. Bernoulli. 1997;3(2):181–211. [Google Scholar]
- Le Cam L. Locally asymptotically normal families of distributions. Certain approximations to families of distributions and their use in the theory of estimation and testing hypotheses. Univ california Publ Stat. 1960;3:37–98. [Google Scholar]
- Murphy SA, van der Vaart AW. On profile likelihood. J Am Stat Assoc. 2000;95(450):449–485. (with comments and a rejoinder by the authors) [Google Scholar]
- Pfanzagl J. Lecture notes in statistics. Vol. 13. Springer; New York: 1982. Contributions to a general asymptotic statistical theory. (with the assistance of Wefelmeyer W) [Google Scholar]
- Pfanzagl J. Lecture notes in statistics. Vol. 31. Springer; Berlin: 1985. Asymptotic expansions for general statistical models. (with the assistance of Wefelmeyer W) [Google Scholar]
- Robins J, Rotnitzky A. Comment on the bickel and kwon article, “inference for semiparametric models: Some questions and an answer”. Stat Sin. 2001;11(4):920–936. [Google Scholar]
- Robins J, Li L, Tchetgen E, van der Vaart A. Asymptotic normality of quadratic estimators. Ann Stat. 2007 doi: 10.1016/j.spa.2016.04.005. (submitted) [DOI] [PMC free article] [PubMed] [Google Scholar]
- Robins JM, Rotnitzky A. Recovery of information and adjustment for dependent censoring using surrogate markers. 1992. pp. 297–331.
- Rotnitzky A, Robins JM. Semi-parametric estimation of models for means and covariances in the presence of missing data. Scand J Stat. 1995;22(3):323–333. [Google Scholar]
- van der Laan MJ, Robins JM. Springer Series in Statistics. Springer; New York: 2003. Unified methods for censored longitudinal data and causality. [Google Scholar]
- van der Vaart A. On differentiable functionals. Ann Stat. 1991;19(1):178–204. [Google Scholar]
- van der Vaart A. Maximum likelihood estimation with partially censored data. Ann Stat. 1994;22(4):1896–1916. [Google Scholar]
- van der Vaart AW. Statistical estimation in large parameter spaces, CWI Tract. Vol. 44. Stichting Mathematisch Centrum Centrum voor Wiskunde en Informatica; Amsterdam: 1988. [Google Scholar]
- van der Vaart AW. Cambridge series in statistical and probabilistic mathematics. Vol. 3. Cambridge University Press; Cambridge: 1998. Asymptotic statistics. [Google Scholar]
- Von Mises R. On the asymptotic distribution of differentiable statistical functions. Ann Math Stat. 1947;18:309–348. [Google Scholar]
- Wellner JA, Klaassen CAJ, Ritov Y. Frontiers in statistics. Imp Coll Press; London: 1993. Semiparametric models: a review of progress since BKRW. pp. 25–44. (2006) [Google Scholar]