Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2020 May 13.
Published in final edited form as: J Am Stat Assoc. 2018 Sep 13;114(527):1174–1190. doi: 10.1080/01621459.2018.1482752

Toward computerized efficient estimation in infinite-dimensional models

Marco Carone 1, Alexander R Luedtke 2, Mark J van der Laan 3
PMCID: PMC7219981  NIHMSID: NIHMS1515027  PMID: 32405108

Abstract

Despite the risk of misspecification they are tied to, parametric models continue to be used in statistical practice because they are simple and convenient to use. In particular, efficient estimation procedures in parametric models are easy to describe and implement. Unfortunately, the same cannot be said of semiparametric and nonparametric models. While the latter often reflect the level of available scientific knowledge more appropriately, performing efficient inference in these models is generally challenging. The efficient influence function is a key analytic object from which the construction of asymptotically efficient estimators can potentially be streamlined. However, the theoretical derivation of the efficient influence function requires specialized knowledge and is often a difficult task, even for experts. In this paper, we present a novel representation of the efficient influence function and describe a numerical procedure for approximating its evaluation. The approach generalizes the nonparametric procedures of Frangakis et al. (2015) and Luedtke et al. (2015) to arbitrary models. We present theoretical results to support our proposal, and illustrate the method in the context of several semiparametric problems. The proposed approach is an important step toward automating efficient estimation in general statistical models, thereby rendering more accessible the use of realistic models in statistical analyses.

Keywords: asymptotic efficiency, canonical gradient, efficient influence function, nonparametric and semiparametric models, pathwise differentiability

1. Introduction

1.1. Efficient estimation of smooth target parameters

Efficient estimation techniques are appealing because they maximally exploit available information and minimize the uncertainty of the resulting scientific findings. Efficiency is most often studied in an asymptotic sense due to its intractability in many finite-sample problems. Characterizing asymptotic efficiency and constructing asymptotically efficient estimators has been an important focus of methodological and theoretical research in statistics. For convenience, from here and on, we ascribe without explicit mention an asymptotic sense to the terms efficient and efficiency

In the context of parametric models, a simple efficiency theory has been available for nearly a century, largely fueled by Fisher’s work on maximum likelihood estimation. In such models, efficiency is characterized by the Crámer-Rao bounds and efficient estimators can generally be obtained via maximum likelihood (e.g., Hájek, 1970, 1972; Le Cam, 1972). When parametric models are adopted in practice, it is often because they are simple and convenient to use. However, the use of such models carries the risk of model misspecification, which may adversely affect the scientific process. In many scientific problems, the available background knowledge simply does not justify the use of such restrictive statistical models

Infinite-dimensional models – either nonparametric or semiparametric – offer a more flexible alternative. These richer models mitigate the potential for model misspecification and often more accurately reflect the level of available prior knowledge. Unfortunately, establishing efficiency bounds for target parameters in infinite-dimensional models can be a complex task. The development of a general efficiency theory, valid for arbitrary statistical models, is a more recent accomplishment: except for the early seminal contribution of Stein (1956), developments in this area began in the late 1970s and early 1980s with the works of Koshevnik and Levit (1977), Pfanzagl (1982) and Begun et al. (1983), among others, and continued throughout the 1990s (e.g., van der Vaart, 1991; Newey, 1994). Notably, it builds upon notions of differential geometry and functional analysis. In certain cases, a generalized notion of maximum likelihood, as described by Kiefer and Wolfowitz (1956), for example, can still be used to produce efficient estimators of certain parameters – see, for example, Wong and Severini (1991) for a general study under broad conditions. In other cases though, the statistical model or target parameter may be too complex for a maximum likelihood estimator to exist, let alone be well-behaved

Several approaches have been proposed for efficient estimation in the context of infinite-dimensional models. In problems in which the likelihood approach is problematic, some authors have successfully repaired the likelihood function (e.g., van der Laan, 1996; Maathuis and Wellner, 2008). Regularization techniques, including penalization and the method of sieves, have proven useful for devising efficient estimators (e.g., Shen, 1997; Newey, 1997; Chen, 2007). General approaches for efficient estimation based upon knowledge of the efficient influence function have been developed and extensively studied (e.g., Pfanzagl, 1982; Bickel et al., 1997; van der Laan and Robins, 2003; van der Laan and Rubin, 2006). These flexible approaches have been used in the past decade to perform valid inference incorporating a variety of machine learning tools. In particular, they have been employed to conduct robust and efficient inference via ensemble learning, wherein nuisance functions are estimated using a data-driven amalgamation of many learning strategies. The latter can range from simple parametric techniques to complex machine learning algorithms, and can include penalization and sieve-based learning methods. A variety of illustrations are provided in van der Laan and Rose (2011, 2018), for example

1.2. The efficient influence function and its role in efficient estimation

The efficient influence function, hereafter referred to as the EIF, is a key object in efficiency theory for infinite-dimensional models. It bears this name because it is the influence function of any efficient estimator of the parameter of interest relative to the statistical model with respect to which it was calculated. As such, the efficiency bound for this parameter, and thus the asymptotic variance of any efficient estimator, can be derived from the EIF. Knowledge of the EIF has important implications for practice. First, the relative efficiency of a candidate estimator can be assessed objectively by comparing its variance to the efficiency bound implied by the EIF. Second, using an EIF-based estimate of the efficiency bound, asymptotically valid Wald confidence intervals can be constructed using any available efficient estimator, thereby circumventing the need to rely on bootstrap schemes that may either be invalid or computationally infeasible in a given problem. Third, knowledge of the EIF can assist investigators in planning the design of their study by helping quantify the effect of different sampling strategies on the variability of any efficient estimator used. Concrete examples include the optimal selection of subgroups on which to process resource-intensive measurements in two-phase sampling designs (see, e.g., Gilbert et al., 2014) and the optimal design of group-sequential adaptive trials (e.g., Chambaz and van der Laan, 2014). Fourth, as it is the influence function of any efficient estimator, the EIF can be used to investigate the robustness of efficient estimators (e.g., Hampel et al., 2011). For example, the EIF can be used to assess the approximate (i.e., asymptotic) contribution of individual observations to the value of an efficient estimator. Finally, and perhaps most importantly, as suggested above, if the analytic form of the EIF is available, efficient estimators can themselves be constructed rather easily. To do so, several approaches may be used, including, for example, gradient-based estimating equations (e.g., van der Laan and Robins, 2003; Tsiatis, 2007), Newton-Raphson one-step corrections (e.g., Pfanzagl, 1982) and targeted minimum loss-based estimation (e.g., van der Laan and Rose, 2011). Thus, there is a strong motivation for deriving the EIF in a given statistical problem

Unfortunately, the analytic computation of the EIF is seldom straightforward. It generally involves finding a candidate influence function and projecting it onto the tangent space of the statistical model, which must itself be characterized – the effort can be mathematically intricate. Over the years, many techniques have been developed to facilitate this task in certain classes of problems – the discretization technique of Chamberlain (1987) is one such example. Nevertheless, this calculation remains a rather specialized skill, mastered mostly by a small collection of theoretically-inclined researchers. Worse still, in some problems, the EIF does not even have a closed form, rendering its evaluation and use particularly difficult (e.g., Geskus and Groeneboom, 1999; Quale et al., 2006). The theoretical derivation of EIFs is generally not in the skill set of practicing statisticians. Yet, in many problems, it is a skill needed to facilitate optimal inference in more realistic statistical models. The paucity of this skill may have impeded a broader appreciation and adoption of semiparametric and nonparametric techniques in applications

In view of this barrier, one naturally wonders whether a suitable numerical approximation could serve as substitute for the analytic form of the EIF, and further whether its calculation could be computerized. An affirmative answer to this question would render the implementation of efficient inferential techniques in semiparametric and nonparametric models much more accessible to practitioners, and the impact on statistical practice could be significant. Recently, an important first step toward this goal was made by Frangakis et al. (2015): these authors proposed a simple numerical routine for calculating the EIF in the context of nonparametric models when the data are discrete-valued or when the parameter is a smooth functional of the distribution function. In our discussion of their article (see Luedtke et al., 2015), we suggested a regularization of their technique that is valid more broadly within the context of nonparametric models. Nevertheless, neither of these methods formally address the more difficult problem of numerically computing the EIF in semiparametric models. As opposed to nonparametric models, for which the tangent space is trivially described, semiparametric models generally have much more complex tangent spaces, projecting onto which may often require great experience. Identifying a numerical approach for computing the EIF in semiparametric models is therefore a more difficult but also more needed innovation

1.3. Contributions and organization of this article

In this article, we establish and study a novel representation of the EIF that directly generalizes the non-parametric results of Frangakis et al. (2015) and Luedtke et al. (2015) to arbitrary models. As we illustrate through several examples, this representation can be used to calculate either analytically or numerically the EIF of a given parameter in a given statistical model without the specialized knowledge traditionally required for this task. As such, it holds great promise in facilitating the computerization of efficient estimation not only in nonparametric models but also in semiparametric models, as we discuss below

This paper is organized as follows. In Section 2, we present our novel representation of the EIF for use in arbitrary statistical models and present high-level conditions under which it holds. In Section 3, we study each of these conditions and determine sufficient conditions under which the validity of this representation is guaranteed. We explore various practical issues regarding the implementation of our proposal and provide guidelines for use in Section 4. In Section 5, we illustrate the use of the approach in the context of four semiparametric problems. We provide concluding remarks in Section 6. While Theorem 1 is proved in the body of the paper, the proof of Theorems 2–5 are provided in an Appendix

2. Numerical calculation of the efficient influence function

2.1. Preliminaries

Suppose that we observe independent d-dimensional variates X1,X2,...,Xn following a distribution P0 known only to belong to the statistical model M. We denote by X(P)d the sample space associated to PM. We are interested in efficiently inferring about ψ0 := Ψ(P0) using the available data, where Ψ:Mq represents a pathwise differentiable parameter mapping of interest. Pathwise differentiability ensures the parameter is a sufficiently smooth mapping so as to admit an efficiency theory (see, e.g., Pfanzagl, 1982; Bickel et al., 1997). We denote by L20(P) the Hilbert space of P-integrable functions from X(P) to q with mean zero and finite variance under P. The parameter Ψ is said to be pathwise differentiable if there exists some χPL20(P) such that, for each regular one-dimensional parametric submodel M0{Pϵ:ϵE}M with E an interval containing zero and Pϵ=0 = P, the pathwise derivative ddϵΨ(Pϵ)|ϵ=0 can be represented as the inner product χP(u)s(u)dP(u), where s is the score for ϵ at ϵ = 0 in M0 (Pfanzagl, 1982). Any such element χP is said to be a gradient of Ψ at P relative to M. The tangent space TM(P)L20(P) of M at P is defined as the closure of the linear span of scores at P arising from regular one-dimensional P-dominated parametric submodels of M through P. The canonical gradient is the unique gradient contained in TM(P) and corresponds to the EIF under sampling from P. Throughout, we will refer to the EIF at P as 𝜙P and write 𝜙P(x) for the evaluation of 𝜙P at the observation value x. The asymptotic variance of an efficient estimator of ψ0 relative to model M is given by ϕP0(u)ϕP0(u)dP0(u). Without loss of generality, we will assume q = 1 as the general case can be dealt with using the developments herein applied to each component

If pathwise differentiability holds uniformly over paths in a neighborhood around P, for any P1M close enough to P, the parameter admits the linearization

Ψ(P1)Ψ(P)=ϕP1(u)d(P1P)(u)+R(P1,P)=ϕP1(u)dP(u)+R(P1,P) (2.1)

where R(P1,P) is a second-order remainder term, and the second line follows since ϕP1(u)dP1(u)=0. This representation, which is no more than a first-order Taylor approximation over the model space, holds for most smooth parameters arising in practice. The precise form of R is generally established by hand on a case-by-case basis. This linearization is critical for motivating and studying the use of both Newton-Raphson one-step correction and targeted minimum loss-based estimation to construct efficient estimators. It is also at the heart of our current proposal for obtaining a numerical approximation to the EIF value 𝜙P(x) at a given distribution PM and observation value xX(P).

2.2. Nonparametric models

Recently, Frangakis et al. (2015) presented one such proposal based on the representation of 𝜙P(x) as the Gˆateaux derivative of Ψ at P in the direction of δxP, where δx represents the degenerate distribution at x. Of course, this can also be seen as the pathwise derivative ddϵΨ(Pϵ)|ϵ=0 of Ψ at P along the linear perturbation path {Pϵ(1ϵ)P+ϵδx:0ϵ1} between P and δx – this simple observation will be helpful when tackling arbitrary models. Here and throughout, any such derivative is of course interpreted as a right derivative. To computerize the process of calculating 𝜙P(x), these authors suggested approximating this derivative by the slope of the secant line through (0,Ψ(P)) and (ϵ,Ψ,(Pϵ)) for a small ϵ > 0. In our discussion of Frangakis et al. (2015), we pointed out sufficient conditions that guarantee that this indeed approximates 𝜙P(x) (Luedtke et al., 2015). For example, this approach is valid whenever the model M is nonparametric and the sample space X(P) is finite. However, if the parameter Ψ depends on local features of the distribution, this method may fail when X(P) is infinite, such as when any component of X is continuous under P. We proposed a slight modification of the procedure of Frangakis et al. (2015) to remedy this limitation. Specifically, we proposed replacing the degenerate distribution δx at x by a distribution Hx,λ dominated by P and such that ∫g(u)dHx,λ(u) → g(x) as λ → 0 for all g in a sufficiently large class of functions. This amounts to replacing the degenerate distribution by a nearly degenerate distribution with smoothing parameter λ > 0. In a technical report published contemporaneously, Ichimura and Newey (2015) also suggested this approach. As stated in Luedtke et al. (2015), under certain regularity conditions and provided M is nonparametric, it is generally the case that

ϕP(x)=limλ0[ddϵΨ(Pϵ,λ)|ϵ=0] (2.2)

where we have defined the linear perturbation path. Pϵ,λ(1ϵ)P+ϵHx,λ Representation (2.2) is useful when the parameter is simple enough so that calculating the derivative of ϵ ↦ Ψ(Pϵ,𝜆) is analytically convenient. Otherwise, similarly as in Frangakis et al. (2015), the derivative in (2.2) can be approximated, for example, by the secant line slope

Ψ(Pϵ,λ)Ψ(P)ϵ

for small ϵ and λ. This operation only requires the ability to evaluate Ψ on a given distribution. Generally, as we highlighted in Luedtke et al. (2015), ϵ must be chosen much smaller than λ to obtain an accurate approximation – this emphasizes that the limit and derivative operators in (2.2) cannot generally be interchanged. We discuss this point in detail later. Because representation (2.2) is a special case of the general result we present below, we defer a statement of regularity conditions and a formal proof until then

2.3. Arbitrary models

When the model M is not nonparametric, representation (2.2) generally does not hold. Except when ϵ = 0, the linear perturbation path described by Pϵ,λ is not necessarily contained in the model. As such, the parameter may not even be defined on this path. Even if it is, the approximation suggested by this representation generally only yields the EIF of Ψ relative to a nonparametric model rather than the actual model. While the nonparametric EIF is still an influence function in M, it is not typically efficient relative to M. This should not be surprising since the expression in (2.2) in no way acknowledges constraints implied by M.

The linear perturbation approach of Frangakis et al. (2015) and Luedtke et al. (2015) does not generally lead to the EIF in semiparametric and parametric models. Nevertheless, a first-order analysis of the parameter mapping is helpful to determine conditions under which, in arbitrary models, the EIF can be expressed in terms of a pathwise derivative over some perturbation path. Below, we denote by Pϵ,λ*M a candidate perturbation of P determined by a choice of Hx,λ and ϵ. High-level conditions under which Pϵ,λ* serves as an adequate perturbation are outlined in the theorem below. Here and throughout, we use the shorthand notation ϕϵ,λ* to denote the evaluation ϕPϵ,λ* of the EIF at Pϵ,λ*

Theorem 1.

The EIF of Ψ relative to M at PPM evaluated at observation value x is given by

ϕP(x)=limλ0[ddϵΨ(Pϵ,λ*)|ϵ=0] (2.3)

provided the following conditions hold:

  • (A1)

    (near-solution of EIF estimating equation) limϵ0limϵ0ϕϵ,λ*(u)dPϵ,λ(u)/ϵ=0;

  • (A2)

    (continuity of EIF)limλ0limϵ0ϕϵ,λ*(u)d(Hx,λP)(u)=ϕP(x)

  • (A3)

    (preservation of convergence rate) limλ0limϵ0R(Pϵ,λ*,P)/ϵ=0.

Proof. Setting P1=Pϵ,λ* in (2.1), we note that

Ψ(Pϵ,λ*)Ψ(P)=ϕϵ,λ*(u)dP(u)+R(Pϵ,λ*,P)=ϕϵ,λ*(u)d(Pϵ,λP)(u)ϕϵ,λ*(u)dPϵ,λ(u)+R(Pϵ,λ*,P).

The result follows from (A1), (A2) and (A3) upon noting that, since, Pϵ,λP=ϵ(Hx,λP), we have that

Ψ(Pϵ,λ*)Ψ(P)ϵ=ϕϵ,λ*(u)d(Hx,λP)(u)ϕϵ,λ*(u)dPϵ,λ(u)ϵ+R(Pϵ,λ*,P)ϵ.

We must now determine a strategy for constructing a perturbation Pϵ,λ*, that can be expected to satisfy (A1)–(A3). Although the linear path {Pϵ : 0 ≤ ϵ ≤ 1} is generally inadequate, it appears natural to consider the path determined by the projection Pϵ,λ* of Pϵ into M, or a suitably regularized version thereof, according to an appropriate divergence. For any P1M, we define M(P1){P2M:P2P1}M as the subset of all probability measures in M dominated by P1, and let MNPM represent a nonparametric model containing M. Then, we formally define a novel perturbation path in the model as

Pϵ,λ*argminP1M(P)D(P1,Pϵ,λ), (2.4)

where the function D:MNP×MNP[0,+] satisfies the following conditions:

  • (B1)

    for each P1,P2M,D(P1,P2)0, and D(P1,P2)=0 if and only if P1 = P2

  • (B1)
    for any P2M and P1M(P2) such that the Radon-Nikodym derivative of P1 relative to P2 is uniformly bounded in an interval [m0,m1] ⊂ (0,+∞), provided P1 6= P2, the inequality
    K0(m0,m1)D(P1,P2)[dP1dP2(u)1]2dP2(u)K1(m0,m1)

    holds for constants 0 < K0(m0,m1) < K1(m0,m1) < +∞ depending on m0 and m1 alone

  • (B3)
    for each P2MNP and each K ∈ (0,+∞), setting P2*argminP1M(P2)D(P1,P2), provided the Radon-Nikodym derivative of P2* relative to P2 is uniformly bounded in an interval [m0,m1] ⊂ (0,+∞), there exists a map B : [0,+∞) → [0,+∞) possibly depending on (K,m0,m1) and with limδ→0 B(δ) = 0 for which
    dP2*dP212,P2δimpliesthat|h(u)dP2(u)|δB(δ)
    for each hTM(P2*) with ‖h < K.

Condition (B1) simply requires D to be a divergence function. Condition (B2) allows us to relate values of this divergence and relevant L2 norms. Condition (B3) is used to ensure condition (A1). Heuristically, it requires that, under a given distribution, elements of the tangent space at the projection of this distribution have mean nearly zero if the distribution is close enough to the model. Additional details are provided below.

At least two of the most important divergences fit into this general setup and satisfy all conditions: the Kullback-Leibler (KL) divergence and the Hellinger distance, corresponding respectively to

DKL:(P1,P2)log[dP1dP2(u)]dP2(u)andDH:(P1,P2)[1dP1dP2(u)]2dP2(u).

The fact that these particular divergences indeed satisfy (B1)–(B3) is established in the Supplementary Material. The KL divergence appears to be the most natural choice for several reasons. First, for this choice of divergence, for any given P2MNP and any hTM(P2*) with P2* the projection of P2 into M(P2), the equality ∫ h(u)dP2(u) = 0 can be shown to hold exactly. To see this, suppose that h is in the interior of TM(P2*) and construct the submodel dP2,η*=(1+ηh)dP2* for η in a neighborhood of zero. Because P2* is a global minimizer of the KL divergence, we have that

0=ddηlog[dP2,η*dP2(u)]dP2(u)|η=0=h(u)dP2(u).

If h is instead on the boundary of TM(P2*), the argument above can be repeated for a sequence of interior elements tending to h. Thus, for the KL divergence, an even stronger property than (B3) holds, and it implies in particular that ϕϵ,λ*(u)dPϵ,λ(u)=0 for each ϵ and λ. As a consequence, a stronger version of (A1) holds – the EIF estimating equation is in fact exactly solved – which could improve the resulting approximation of the EIF in practice. Second, because of its central role in likelihood methods, the KL divergence has been studied extensively, and many strategies for likelihood maximization have been described in the statistical literature. Whether other appropriate divergences, such as the Hellinger distance, provide concrete benefits in some problems remains to be seen. In the illustrations of Section 5, we focus primarily on the KL divergence, although we also discuss the use of the Hellinger distance in one example. In the Supplementary Material, we also provide an example in which the use of a seemingly natural yet inappropriate divergence leads to a violation of (2.3). This highlights the importance of verifying conditions (B1)–(B3) for divergences other than DKL and DH, both of which we have already verified to satisfy these conditions

It is useful to scrutinize the conditions of Theorem 1 in the context of the projected perturbation path. Condition (A1) drives in large part our generalization of the procedures of Frangakis et al. (2015) and Luedtke et al. (2015) to arbitratry models. Projecting the path {Pϵ : 0 ≤ ϵ ≤ 1} into M to obtain {Pϵ,λ*:0ϵ1} is expected to ensure that the score-like equation described in (A1) is solved. Condition (A2) imposes mild continuity requirements on the EIF. Since R is a second-order term,R(Pϵ,λ,P) is generally of order O2) for each fixed λ > 0. Condition (A3) requires that R(Pϵ,λ*,P) be of order o(ϵ) for λ small and ϵ sufficiently smaller. Determining how the projection step and the smoothing parameter λ > 0 affect the rate of this term is critical to establishing whether (A3) holds. This is studied in detail in the next section

Much of the effort required to use representation (2.3) to compute the EIF goes into identifying the projection Pϵ,λ* of Pϵ,λ onto the model space. In some cases, this projection can be determined analytically, while in others, a numerical approach must be taken. Regardless, the definition of Pϵ,λ* does not involve the parameter of interest. Hence, the more challenging portion of the approach is exclusively model-specific, and once it has been successfully tackled, the resulting projection can be used for any parameter a practitioner may wish to study. This contrasts sharply with the traditional approach to deriving the EIF, wherein the statistician must first derive an influence function, characterize the tangent space of the model, and finally project the influence function onto this tangent space. In that approach, both the parameter-specific task – finding an influence function – and the model-specific task – studying the tangent space and how to project onto it – require specialized knowledge. Performing these tasks for a given parameter and model combination does not necessarily provide an easy way of tackling other parameters, in contrast to the approach we propose

3. Verification of technical conditions

The validity of representation (2.3) is guaranteed to hold under the high-level technical conditions (A1), (A2) and (A3). We now identify lower-level sufficient conditions under which the projected perturbation path constructed in (2.4) satisfies conditions (A1), (A2) and (A3), and thus under which Theorem 1 holds

Here and throughout, given a function h defined on X(P), we write hk,P[|h(u)|kdP(u)]1k and ‖h∞,A : supuA |h(u)| for any set A. We denote by Sx,λ the support of Hx,λ and we define

rk(λ)dHx,λdPk,P.

This Radon-Nikodym derivative of Hx,λ relative to P is defined for each λ > 0 since Hx,λ is dominated by P by construction. Unless P assigns positive mass to {x} and Hx,λ is the degenerate distribution at {x}, the value of rk(λ) usually tends to infinity as λ tends to zero. The rate at which this occurs will be critical in our study of the technical conditions listed in Theorem 1

3.1. Near-solution of the EIF estimating equation

By virtue of being a global optimizer, and in view of condition (B3), Pϵ,λ* is expected to solve, or at least nearly solve, a rich collection of equations indexed by elements in the tangent space of M at Pϵ,λ*, including that exhibited in condition (A1). The following theorem establishes a formal regularity condition under which this is indeed the case

Theorem 2.

Suppose that there exists an interval J0 = [m0,m1] ⊂ (0,+∞) such that for each small λ > 0 the Radon-Nikodym derivative of Pϵ,λ* relative to P is uniformly contained in J0 over the support of P for sufficiently small ϵ Then, condition (A1) holds.

The condition in the above theorem is expected to hold in many situations since for ϵ = 0 and any λ > 0, Pϵ,λ=P is in M and thus Pϵ,λ*=P trivially. As such, under continuity of the projection operator relative to the supremum norm, for ϵ small enough, the Radom-Nikodym derivative of Pϵ,λ* relative to P is expected to be close to one, and therefore bounded both away from zero and above. In practice, it is often possible to either verify this condition analytically or to assess its plausibility numerically

3.2. Continuity of the EIF

We relied on certain notions of continuity to establish the validity of representation (2.3). The theorem below highlights how the continuity requirement stated in condition (A2) can be more concretely verified

Theorem 3.

Suppose that limλ→0 ∫𝜙P(u)dHx,λ(u) = 𝜙P(x) and limλ0limϵ0ϕϵ,λ*(u)dP(u)=0. Condition (A2) holds provided either

(a)limλ0limϵ0ϕϵ,λ*ϕP,sx,λ=0or(b)limλ0limϵ0r2(λ)ϕϵ,λ*ϕP2,P=0.

The requirement that ∫𝜙P(u)dHx,λ(u) approximates 𝜙P(x) as λ tends to zero simply stipulates that averaging 𝜙P with respect to a distribution eventually concentrating all its probability mass on {x} should approximately yield 𝜙P(x). Furthermore, this theorem requires that ϕϵ,λ*(u)dP(u) tends to zero, which is reasonable under some continuity since ϕϵ,λ* tends to 𝜙P and 𝜙P(u)dP(u) = 0. Beyond this, in order for condition (A2) to hold, it suffices either for ϕϵ,λ* to approximate 𝜙P in supremum norm in a neighborhood of x or in L2(P)-norm at a rate faster than r2(λ)−1. These statements each hinge on a certain notion of continuity that appears needed whenever P is not finitely supported

3.3. Preservation of the rate of convergence

The proof of representation (2.3) hinges upon a linearization of the difference between Ψ(Pϵ,λ*) and Ψ(P). To ignore the remainder term from this linearization, we require that R(Pϵ,λ*,P)/ϵ be arbitrarily small for small enough λ and sufficiently smaller ϵ. The following theorem establishes that this is the case under certain regularity conditions

Theorem 4.

Suppose that there exists an interval J0 = [m0,m1] ⊂ (0,+∞) such that for each small λ > 0 the Radon-Nikodym derivative of Pϵ,λ* relative to P is uniformly contained in J0 over the support of P for sufficiently small ϵ. Suppose also that there exists some 0 < C < +∞ such that, for any P1M with Radon-Nikodym relative to P bounded above by m1 over the support of P, we have that

|R(P1,P)|CdP1dP12,P2.

Then, condition (A3) holds

We have already discussed the first condition of this theorem since it is also used in Theorem 2. As for the second condition, it is often the case that the remainder term, being a second-order term arising from a linearization, can be bounded by the squared norm of the difference between the Radon-Nikodym derivative of Pϵ,λ* relative to P and its value at ϵ = 0. This inequality often follows quite easily from an application of the Cauchy-Schwartz inequality on the remainder term, and generally holds under rather mild conditions

4. Practical considerations

Representation (2.3) provides the theoretical foundations for our strategy for numerically approximating the EIF and thus constructing efficient estimators. The implementation of the approach suggested by this representation nevertheless presents specific challenges. Practical guidelines, as provided below, may facilitate the successful implementation of our proposal in practice

4.1. Construction of the perturbation path

In constructing the linear perturbation path that defines Pϵ,λ and thus Pϵ,λ*, the nearly degenerate distribution at {x} is used instead of its purely degenerate counterpart because it ensures that all distributions along the perturbation path are dominated by P. This is required to ensure the validity of the representation we have proposed. Clearly, there is no need for smoothing in the components of the data unit for which the corresponding marginal distribution implied by P is dominated by a counting measure. In fact, as we stress below, unnecessary smoothing can needlessly increase the computational burden of the approximation procedure. For components for which the corresponding marginal distribution is dominated by the Lebesgue measure, smoothing is generally needed. In practice, we suggest the use of product kernels for those components. Specifically, suppose that the data unit X is d-dimensional and can be partitioned into X = (XL,XC), where XL := (XL1,XL2,...,XLd1) and XC := (XC1,XC2,...,XCd2) with d1 +d2 = d, and that the marginal distributions of XL and XC under P are respectively dominated by the Lebesgue measure and a discrete counting measure. In this case, we can typically use the product kernel with density function

uhx,λ(u)[j=1d1Kλ(uLjxLj)]×[j=1d2I(uCj=xCj)] (4.1)

relative to the sum of Lebesgue and counting measures, where u := (uL,uC) with uL and uC possible realizations of XL and XC, respectively, and Kλ : wλ−1K(λ−1w) with K some symmetric and absolutely continuous univariate density function. The uniform kernel K(w) := I(−1 < 2w < +1) is particularly appealing due to its simplicity, which translates to greater practical feasibility of our numerical approximation procedure. If the uniform kernel is used, it is easy to verify that rk(λ) = λd1(1−1/k) provided, for example, the sectional evaluation uLp(uL, xC) of the density p of P is continuous and bounded away from zero in a neighborhood of xL. As suggested by Theorem 5 provided in the next subsection, for a secant line slope approximation to the derivative in (2.3) to be accurate, ϵ must generally be chosen such that ϵλd1. If d1is large, this requirement may be prohibitive, possibly even to the point of requiring a value of ϵ beyond the computer’s standard level of precision and thus requiring special computational techniques. Of course, while this guideline is sufficient, it may be conservative in some applications. Later in this section, we provide a more practical means of selecting the value of ϵ and λ

As alluded to above, if we include smoothing over XC as well in our choice of Hx,λ, we need. ϵ ≪ λdThis can be much more prohibitive computationally than requiring that ϵλd1, particularly if d2 is large. For this reason, smoothing in the construction of the linear perturbation path should be avoided for all components except those for which the corresponding marginal distribution under P is absolutely continuous. Additionally, for some parameters, smoothing can be avoided altogether for certain continuous components. As a general guideline for which supporting theory remains to be developed, we expect no smoothing to be required at all if the parameter is sufficiently smooth at PM, in the sense that Ψ(Pm) tends to Ψ(P) for any sequence {Pm ∈ ,M : m = 1,2,...} for which the distribution function of Pm tends to that of P uniformly as m tends to infinity. Alternatively, if the MLE Pn* of P based on observations X1,X2,...,Xn from P is such that Ψ(Pn*) is a consistent estimator of Ψ(P), no smoothing will generally be required. If, however, some regularization of the MLE is needed to ensure consistency, e.g., as in van der Laan (1996), smoothing will usually be critical

4.2. Finite-difference derivative approximations

When the projection of the linear perturbation path and the parameter of interest have an explicit form, it may be possible to compute analytically the pathwise derivative in representation (2.3). When analytic differentiation is possible, it is useful to do so since an approximation for 𝜙P(x) is then simply given by the evaluation of the pathwise derivative at ϵ = 0 for a small λ > 0. The approximation thus involves only a single approximation parameter. Otherwise, numerical differentiation techniques must be used

Since the derivative in representation (2.3) can be expressed as the limit of the slope of the secant line between points (0,Ψ(P)) and (ϵ,Ψ(Pϵ,λ*)) , the alternative representation

ϕP(x)=limλ0limϵ0Ψ(Pϵ,λ*)Ψ(P)ϵ (4.2)

is available under the conditions of Theorem 1. This representation suggests that 𝜙P(x) may be approximated by [Ψ(Pϵ,λ*)Ψ(P)] for appropriately small λ and ϵ. As indicated before, in general, the two limits in (4.2) cannot be interchanged. As a consequence, ϵ must be taken to be much smaller than λ. Practical guidelines for choosing ϵ and λ are provided in the next subsection. We note that, beyond the ability to compute the projection Pϵ,λ*, use of (4.2) to approximate 𝜙P(x) requires no more than the ability to evaluate Ψ

Representation (4.2) is based on a two-point forward finite-difference approximation to the derivative in (2.3). A multi-point forward finite-difference scheme could be used to produce accurate approximations for possibly less prohibitive values of ϵ and λ. For m ∈ {1,2,...}, the (m + 1)-point forward finite-difference representation of the EIF value is

ϕP(x)=limλ0limϵ0j=0maj,mΨ(Pjϵ,λ*)ϵ (4.3)

for a specified vector am := (a0,m,a1,m,...,am,m). Fornberg (1988) gives a recursive formula for am and numerical values up to m = 8. For example, we have that a1 = (−1, +1), a2=(32,+2,12),a3=(116,+3,32,+13) and a4=(2512,+4,3,+43,14) Use of (4.3) as a basis for numerically approximating 𝜙P(x) indeed yields an improved approximation, as the following theorem indicates. Below, we denote by Ψ˜:MNP the mapping P ↦ Ψ(P) for P the projection of P into M(P), and by M0,λ(P) the parametric submodel {Pϵ : 0 ≤ ϵ ≤ 1}.

Theorem 5.

Suppose that conditions (A1)–(A3) hold, that Hx,λ is a second-order kernel of the form (4.1), and that 𝜙P has bounded second derivatives in a neighborhood of x. Suppose also that, for small enough λ > 0, the parameter Ψ˜ is (m0 + 1)-times pathwise differentiable in M0,λ(P), with gradient of order m0 + 1 evaluated at Pϵ,λ bounded uniformly for sufficiently smallϵ>0. Then, with s := min(m,m0), we have that

j=0maj,mΨ(Pjϵ,λ*)ϵ=ϕP(x)+O(ϵsrs+1(λ)s+1+λ2).

As indicated before, if Hx,λ is taken to be the uniform kernel, we find that rs+1(λ)=λd1s/(s+1), and so, the approximation error is of order (ϵλd1)s+λ2. This result suggests that, at the cost of computing additional projections and evaluating the parameter at each of these projections, it may indeed be possible to increase approximation accuracy using a multi-point forward finite-difference schemes. Higher-order differentiability is required to ensure that such schemes yield valid results – in particular, this requires that the projected path be sufficiently smooth. More importantly, the theorem also indicates that there is no significant theoretical penalty incurred from using a multi-point scheme with number of points greater than the degree of differentiation, thereby supporting the use of multi-point schemes in practice. In Section 5, we show concrete benefits that may be derived from using a multi-point scheme in the context of Example 1

4.3. Selection of ϵ and λ values

When the pathwise derivative in (2.3) can be calculated analytically, the proposed approximation strategy only involves the smoothing parameter λ. The supporting theory suggests taking λ as small as possible. As we will illustrate in Section 5, in some cases there is little sensitivity to the choice of λ when (2.3) is used, and even a relatively large value of λ > 0 will yield stringent control of the approximation error

Whenever the involved projection is not available in closed form or differentiation with respect to ϵ is too cumbersome to perform analytically, a multi-point finite-difference scheme may be used to numerically approximate this analytic derivative, as described in the previous subsection. In such case, ϵ and λ must both be chosen, and more care is needed to ensure the reliability of the proposed procedure. The order of the limits in (4.2) and (4.3) suggests that we must select a small value of λ and even smaller value of ϵ. This was made more precise in the previous subsection, where in the case of a two-point scheme, for example, it is prescribed to choose ϵ to be much smaller than λd1, with d1 the number of components of P over which smoothing is performed. While this theoretical requirement may serve as a rough guide in practice, it does not provide a concrete means of selecting values for ϵ and λ. For this purpose, it is useful to produce a matrix representing the value of the approximation as a function of ϵ and λ, both ranging over an exponential scale – for example, we could consider both ϵ and λ in the set {10−1,10−2,10−3,10−4,10−5,...}. We refer to the resulting display as an epsilon-lambda plot. As a convention, the y-axis is used to represent ϵ values while λ values are represented on the x-axis. Our theoretical findings suggest that the right balance between ϵ and λ will be achieved in a possibly curvilinear triangular region nested in the upper left portion of the epsilon-lambda plot. In this region, the finite-difference approximation of the EIF value should be essentially constant. One practical means of selecting ϵ and λ would then consist of identifying this region visually by determining the quasi-triangular region in the upper left portion of the matrix over which the approximated EIF value is fixed up to a certain level of precision. As an illustration, without yet providing details regarding the specific parameter and model under consideration, we may scrutinize the epsilon-lambda plot arising in Example 1 from Section 5 and based upon a two-point scheme. This plot, provided as Figure 1, suggests that, up to three decimal points, the EIF value of interest is −0.963. This is indeed verified using theoretical calculations, as discussed in more detail in Section 5. The epsilon-lambda plot therefore may be a particularly useful tool for implementing the proposed approach for numerically approximating the EIF in practice

Figure 1:

Figure 1:

Epsilon-lambda plot of approximated values of the EIF using a secant line slope as a function of ϵ and λ in Example 1

4.4. Numeric computation of the model space projection

In implementing our proposal, the main challenge consists of operationalizing the optimization problem that characterizes the projection of the linear perturbation path {Pϵ : 0 ≤ ϵ ≤ 1} onto the model space M.An analytic – or nearly analytic – form can be found for the projection in many problems, including the illustrations provided in Section 5. In other problems, the optimization problem is less analytically tractable and a numeric approach may be needed

A general strategy for numerically approximating the required projection is to instead consider the corresponding optimization problem over Mm, where M1M2M is a sequence of finite-dimensional submodels of M such that m=1Mm=M. We illustrate this in the context of families of tilted densities, though other parametrizations are possible. We note that any distribution Q dominated by P can be described as a tilted form dQ(u) = exp[h(u)]dP(u)/∫exp[h(w)]dP(w) of P for some function h in a function class H=H(M) determined by the model M. Here, h characterizes the deviation of Q from P. It is often easier to determine suitable approximating finite-dimensional subspaces for H than for M. Suppose that {h1,h2,}H forms a basis for H, and let Hm denote the linear span of {h1,h2,...,hm}. If P has density p relative to ν, the submodel Mm implied by Hm then consists of all distributions Q with density

dQdν(u)=exp[j=1mβjhj(u)]p(u)exp[j=1mβjhj(w)]p(w)ν(dw)

for some β_m(β1,β2,,βm)m. The choice β_m=(0,0,,0) leads to Q = P. Denote by β_m,ϵ,λ* the index of the projection Pm,ϵ,λ* of Pϵ, λ into Mm. For small ϵ, it is expected that Pm,ϵ,λ*P, which suggests that the corresponding index β_m,ϵ,λ* should be near zero. Thus, the search for the optimizer can be focused in a neighborhood surrounding the origin in m. In our experience, this simple observation can sometimes greatly accelerate the numerical optimization routine used. In practice, a sufficiently large m must be selected to ensure that the resulting approximation of the projection is accurate enough to ensure the validity of the numerical evaluation of the EIF based on (4.3). If the KL divergence is used as statistical distance, up to an additive constant, the resulting optimization problem is to maximize the objective function

L(β_m)j=1mβjhj(u)dPϵ,λ(u)logexp[j=1mβjhj(w)]p(w)ν(dw).

Since derivatives of L(β_m) are easy to write down explicitly, many algorithms are available to solve this optimization problem efficiently, including the Newton-Raphson method. We note here that in this setup m should be taken as large as computationally feasible so as to minimize the bias induced by the approximation of M by Mm. This is in contrast to traditional sieve estimation, where the choice of m generally requires a careful balance whenever the unsieved maximum likelihood estimator either does not exist or is inconsistent. This difference occurs because in our scheme all distributions in the regularized model are dominated by the distribution being projected into the model

It may sometimes be useful to consider a stochastic version of this deterministic optimization problem. For example, we could minimize the divergence between the approximating finite-dimensional submodel and the weighted empirical distribution based on a very large number of draws from a uniform mixture between P and Hx,λ, with weights 2(1 – ϵ) and 2ϵ assigned to observations from P and Hx,λ, respectively. We are then faced with a standard parametric estimation problem, albeit one that may be high-dimensional. When a clever parametrization of the approximating submodel is used, it may be possible to use standard statistical learning techniques, including regularization methods from the machine learning literature, with computationally efficient and stable off-the-shelf implementations. When adopting this approach, it appears critical to ensure that the size of the dataset generated is very large compared to the richness of the approximating submodel, since otherwise the variability resulting from this parametric estimation problem could limit our ability to achieve the required level of accuracy. A detailed study of numerical strategies, including the approach described in this subsection, will be the focus of future work

4.5. Construction of an efficient estimator

As emphasized earlier, knowledge of the EIF facilitates the construction of efficient estimators in infinite- dimensional models. For example, if P^n is a consistent estimator of P0M based on independent draws X1,X2,...,Xn from P0, the one-step Newton-Raphson estimator, defined as

ψn+Ψ(P^n)+1ni=1nϕP^n(Xi),

is an efficient estimator of ψ0 under certain regularity conditions. The one-step approach appears to be the constructive method most amenable to an implementation based on numerical approximations of the EIF. Indeed, if the analytic form of the EIF is not known, it suffices to numerically approximate the value of ϕP^n(Xi) for each i = 1,2,...,n, rather than the entire function uϕP^n(u), to calculate ψn+. Thus, the procedure we propose can be used to approximate each of these n values. Nevertheless, when the projection step required to utilize the proposed representation of the EIF is computationally burdensome and the sample size n is large, computing each of these values may be challenging. One need not obtain an approximation of each ϕP^n(Xi) if the objective is to simply compute the one-step estimator ψn+ – in this case it suffices to obtain an approximation of the correction term Bn1ni=1nϕP^n(Xi). This simple observation is useful because a slight modification to (2.3) yields a numerical procedure for approximating the required empirical average. Specifically, the proof of Theorem 1 can be adapted to show that, under similar regularity conditions, if we define the linear perturbation P^n,ϵ,λ(1ϵ)P^n+ϵ1ni=1nHXi,λ between P^n and a uniform mixture of nearly degenerate distributions with mass concentrated around X1, X2, ..., Xn, then

Bn=limλ0ddϵΨ(P^n,ϵ,λ*)|ϵ=0

with P^n,ϵ,λ* denoting the projection of P^n,ϵ,λ onto the model. As such, a numerical approximation of the one-step estimator can be computed in a single numerical step as

Ψ(P^n)+ddϵΨ(P^n,ϵ,λ*)|ϵ=0Ψ(P^n)+j=0maj,mΨ(P^n,jϵ,λ*)ϵ

for appropriately selected ϵ and λ values. We note here that while one may wish to approximate Bn as accurately as feasible, it suffices that Bn be numerically approximated up to order oP(n−1/2) for the one-step estimator to preserve its desirable asymptotic properties. As such, the level of accuracy that must be enforced in practice very much depends on sample size

5. Illustration and numerical studies

To illustrate the use of (2.3), we study four examples in which the calculation of the EIF can be difficult for non-experts, whereas the suggested approach renders the problem straightforward. Below, we describe the steps involved in using representation (2.3) to approximate the EIF in the context of each of these examples, and provide numerical illustrations of its use. In the Supplementary Material, we show how representation (2.3) can instead be used to obtain the analytic form of the EIF without specialized knowledge

5.1. Example 1: Average value of density function with known mean

5.1.1. Background

Given a distribution P with univariate Lebesgue density p, the average density value parameter is given by

Ψ(P)EP[p(X)]=p(u)2du.

Estimation and inference for the average density value has been extensively studied in the semiparametric efficiency literature (e.g., Bickel and Ritov, 1988). We use this parameter as our first illustration because it is simple to describe yet requires specialized knowledge to study using conventional techniques. Suppose that MNP denotes the nonparametric model consisting of all univariate absolutely continuous distributions with finite-valued density. Suppose that μ is fixed and known, and denote by MMNP the semiparametric model consisting of all distributions in MNP with mean μ. We wish to compute the EIF of Ψ relative to M at a PM evaluated at an observation value x.

The EIF 𝜙NP,P of Ψ relative to the nonparametric model MNP evaluated at PM is given by u ↦ 𝜙NP,P(u) := 2[p(u) − Ψ(P)]. It is straightforward to derive this analytic form from first principles. Observing that <M={PMNP:Θ(P)=0}, where Θ(P) := ∫udP(u) − μ is a pathwise differentiable parameter with EIF relative to MNP at PM given by u ↦φP(u) := uμ, Example 1 of Section 6.2 of Bickel et al. (1997)⎹ implies that the EIF of Ψ relative to M can be obtained as

uϕP(u)ϕNP,P(u)ϕNP,P(w)φP(w)dP(w)φP(w)2dP(w)φP(u)=2[p(u)Ψ(P)(wμ)p(w)dP(w)(wμ)2dP(w)(uμ)].

While the resulting analytic form of this EIF is relatively simple, its derivation hinges on specialized knowledge unlikely to be available to most practitioners. Use of our novel representation of the EIF provides an alternative approach that avoids the need for such knowledge, as highlighted below.

5.1.2. Implementation and results

To utilize our representation, we must understand how to project a given distribution QMNP, say with Lebesgue density q, into M, say relative to the KL divergence. Suppose that the support of Q has finite lower and upper limits a and b, respectively, satisfying that a < μ < b. An application of the method of Lagrange multipliers yields that the maximizer in p of ∫logp(u)dQ(u) over the class of all Lebesgue densities with mean μ is given by q(u) := [1 − ξ0(uμ)]−1 q(u), where ξ0 solves the equation U(ξ)=0 with

U(ξ)[u1ξ(uμ)μ]dQ(u),

and lies strictly between (aμ)−1 and (bμ)−1

To compute 𝜙P(x) using the approach proposed in this paper, we first construct the linear perturbation Pϵ := (1 – ϵ)P + ϵHx𝜆, where Hx,λ is an absolutely continuous distribution that concentrates its mass on shrinking neighborhoods of the set {x} as λ tends to zero. In our numerical implementation, we take Hx,λ to be the uniform distribution on the interval (xλ, x+λ). The projection of Pϵ,λ onto M is then obtained as described in the preceding paragraph with Q = Pϵ,λ. As such, it has a closed-form analytic expression up to the constant ξ0 = ξ0 (ϵ, λ) that can be numerically solved. We may then approximate 𝜙P(x) by the secantline slope or using an improved multi-point finite-difference scheme, as suggested above

We evaluated this procedure numerically for a particular distribution P and observation value x. Specifically, we took P to be the Beta distribution with parameters α = 3 and β = 5, and evaluated our numerical procedure for approximating the true value of 𝜙P(0.6) ≈ −0.963. Figure 2 provides the percent error of our numerical approximation based upon a secant line slope for various combination of values for ϵ and λ. This approximation is inaccurate if either ϵ is not small enough or if λ is too small relative to ϵ. For small λ and much smaller ϵ, the secant line slope approximates the true value of 𝜙P(x) with high accuracy, as depicted by the dark green squares in Figure 2. This plot confirms what theory suggests regarding the choice of ϵ and λ. It also reaffirms the usefulness of the epsilon-lambda plot for selecting appropriate values of ϵ and λ. Figure 3 compares the accuracy of multi-point finite-difference schemes using 2, 3, 4 or 5 points over the same range of values for ϵ and λ. As more points are used in approximation scheme, the high-accuracy region grows substantially, and a good approximation can be obtained with much less stringent choices of ϵ and λ This plot confirms the significant practical benefits that can be derived from using multi-point schemes

Figure 2:

Figure 2:

Absolute % error in the approximation of the EIF value using a secant line slope as a function of ϵ and λ in Example 1

Figure 3:

Figure 3:

Absolute % error in the approximation of the EIF value using forward multi-point schemes with different number of points (top left: 2, top right: 3, bottom left: 4, bottom right: 5) as a function of ϵ and λ in Example 1. Color coding of entries is identical as in Figure 2

5.2. Example 2: Center of a symmetric distribution

5.2.1. Background

In this example, the statistical problem is to efficiently estimate the location parameter indexing a univariate distribution in a semiparametric symmetry model. Specifically, we consider the model consisting of each distribution P with symmetric density p relative to a fixed dominating measure ν. Denoting by Pμ,f the distribution with density up(u) = f(uμ) relative to ν, the model can be written as M={Pμ,f:μ,fF0(ν)} for F0(ν) the class of all densities relative to ν symmetric about zero. This semiparametric model is a subset of the nonparametric model MNP of all distributions dominated by ν. If the center of symmetry is of scientific interest, the parameter considered is Ψ(Pμ,f) = μ. We wish to compute the EIF of Ψ relative to M at a distribution P = Pμ,f with μ and fF0(ν) evaluated at an observation value x

As one of the first problems considered in the development of efficiency theory for infinite-dimensional models – see Stein (1956) – this problem has a rich history in statistics. The EIF of Ψ relative to M at P = Pμ,f has the simple form

uϕP(u)1I(f)f˙(uμ)f(uμ),

where f˙ is the first derivative of f and I(f)[f˙(w)/f(w)]2f(w)ν(dw). This EIF can be derived in many ways, though in this case the efficient score approach for classical semiparametric models is particularly expedient. Tsiatis (2007) provides an accessible reference of this approach. We show below how the use of the representation in this paper negates the need for this specialized knowledge for EIF calculation

5.2.2. Implementation and results

As before, we must determine the projection Q of an arbitrary distribution QMNP, say with density q, into M. Because Q*M, there exist μ* and f*F0(ν) such that Q*=Pμ*,f*. If the KL divergence is used, we have that

μ*=argmaxvsuprF0(ν)logr(uv)q(u)ν(du).

Using Jensen’s inequality, it is not difficult to show that, for a fixed candidate center of symmetry v, the maximizer r^v of logr(uv)q(u)ν(du) over F0(ν) is given explicitly as the symmetrization of q,

ur^v(u)q(v+u)+q(vu)2.

It follows then that μ is the maximizer of vAKL(v)logr^v(uv)q(u)ν(du) over , and f is given by r^μ*. If instead the Hellinger distance is used, it is not difficult to show that

μ*=argmaxvsuprF0(ν)r(uv)q(u)ν(du),

and for a fixed center of symmetry v, the maximizer r^v of r(uv)q(u)du over F0(ν) is given by

ur^v(u)1B(v;q)[q(v+u)+q(vu)2]2,

where B(v;q)1+q(u)q(2vu)du. Then, μ is the maximizer of vr^v(uv)q(u)du, or equivalently, of vAH(v)q(u)q(2vu)du, and as before, f is given by r^μ*. Thus, once again, the projection Q has an explicit form up to an implicitly defined constant. Since Ψ(Q) = μ, the parameter evaluation at the projection of Q simply equals the computed maximizer of AKL or AH

To compute 𝜙P(x) for P = Pμ,f, we again begin by constructing the linear perturbation Pϵ,λ := (1 - ϵ)PHx,λ where Hx,λ denotes an absolutely continuous distribution that concentrates its mass on shrinking neighborhoods of the set {x} as λ tends to zero. We then compute the maximizer μϵ,λ* of AKL or AH with Q = Pϵ,λ, and approximate 𝜙P(x) using, for example, the secant line slope (μϵ,λ*μ)/ϵ.

We evaluated this EIF approximation scheme by implementing the numerical algorithm above to obtain the value of 𝜙P(x) for x = 1 and P taken to be the normal distribution with mean zero and unit variance. Using the known analytic form of 𝜙P provided above, we find that 𝜙P(x) = x in this case – this is expected since this is the influence function of the sample mean, and the latter is an efficient estimator of the center of symmetry in a normal location family. The value we are seeking to approximate is 𝜙P(1) = 1. The relative error of the secant line slope approximation based on the KL divergence is displayed in Figure 4 for different combinations of ϵ and λ. The pattern observed is similar as in Example 1, with high accuracy obtained for small λ and even smaller ϵ. The implementation using the Hellinger distance led to very similar results – in fact, the resulting relative error plot was indistinguishable from the plot in Figure 4

Figure 4:

Figure 4:

Absolute % error in the approximation of the EIF value using a secant line slope as a function of ϵ and λ in Example 2 based on either the Kullback-Leibler divergence. Corresponding plot based on Hellinger distance is identical and not shown

5.3. Example 3: Coefficient in semiparametric Poisson regression model

5.3.1. Background

We now consider the problem of efficient estimation of the regression coefficient in a partially linear multiplicative Poisson regression model. Suppose that the data unit is X := (Y,Z,W) ∼ PM, where Y is a count outcome, Z is a univariate exposure, and W is a vector of possible confounders. Let G be the space of univariate real-valued functions, and let F be the space of all bivariate distribution functions. We consider the model M{Pβ,g,F:β,gG,FF}, where Pβ,g,F denotes the joint distribution implying

Pβ,g,F(Y=y|Z=z,W=w)=eμβ,g(z,w)μβ,g(z,w)yy!,y=0,1,2,

with logμβ,g(z,w)=βz+g(w), and Pβ,g,F(Zz,Ww)=F(z,w), for each z and w. If Z is the exposure of interest, the coefficient β serves as a parsimonious summary of the relationship between Y and Z adjusting for W. As such, we consider the parameter Ψ(Pβ,g,F)=β. We wish to compute the EIF of Ψ relative to M at distribution P=Pβ,g,F evaluated at observation value x := (y,z,w)

Again, efficiency calculations in this case can most expediently be performed using techniques for classical semiparametric models. The EIF can be expressed as an appropriate renormalization of the efficient score for β, defined as the projection of the regular score for β into the orthogonal complement of the nuisance tangent space. Using this technique, the EIF of Ψ relative to M at P=Pβ,g,F is found to be

xϕP(x)[zaP(w)][yμβ,g(z,w)]varP{[ZaP(W)][Yμβ,g(Z,W)]},

where aP(w) := EP (ZY | W = w)/EP (Y | W = w) for each w. Below, we discuss the use of our proposed numerical approach for computing this EIF without using any knowledge of semiparametric efficiency theory.

5.3.2. Implementation and results

Let QMNP be an arbitrary distribution. The projection Q of Q into ,M can be written as Q*=Pβ*,g*,F* for some β*,g*G and F*F. Here, we focus on the KL divergence. Writing L(β,g)z,wylogpβ,g(y|z,w)q(y|z,w)FQ(dz,dw) with pβ,g(y|z,w) denoting the conditional probability mass function of Y given Z = z and W = w evaluated at y and implied by parameter choices (β,g), and with FQ denoting the distribution of (Z,W) implied by Q, we have that β*=argmaxβsupgGL(β,g). For fixed β, the maximizer g^β of L(β,g) is given by wg^β(w)logEQ(Y|W=w)logEQ(eβZ|W=w). Under conditions allowing the interchange of integrals and derivatives, the maximizer β of L0(β)L(β,g^β) can be shown to also be the solution of the profile score equation U0(β)=0, where

U0(β)EQ[EQ(ZY|W)EQ(Y|W)EQ(ZeβZ|W)EQ(eβZ|W)].

It then follows that g*=g^β*. Since M. does not restrict the distribution of (Z,W), we have that F = FQ

To compute 𝜙P(x) for P=Pβ,g,F, we begin by constructing the linear perturbation Pϵ := (1 – ϵ)PϵHx,λ. If Z is discrete-valued under P, Hx,λ is a joint distribution for X that concentrates its mass on shrinking neighborhoods of the set {x} as λ tends to zero and such that the event {Y = y,Z = z} has probability one under Hx,λ for every λ > 0. If Z is continuous-valued, the latter requirement can instead be taken to be that {Y = y} has probability one under Hx,λ. We then compute βϵ,λ* as the solution of U0(β)=0 with Q = Pϵ,λ, and approximate the EIF value 𝜙P(x) by the secant line slope (βϵ,λ*β)/ϵ. We evaluated our proposed numerical scheme for approximating the value of 𝜙P(x) for x = (1,0,0.5) and a distribution P under which: conditional on (Z,W) = (z,w), Y has a Poisson distribution with mean exp{log(2)z + w}; conditional on W = w, Z is a Bernoulli random variable with success probability expit(1 − 2w); and W has a uniform distribution on the interval (0,1). The true value of the EIF can be calculated to be 𝜙P(1,0,0.5) ≈ 0.793. Figure 5 provides the relative error of our proposed secant line approximation for different combinations of ϵ and λ, and confirms patterns observed in previous examples

Figure 5:

Figure 5:

Absolute % error in the approximation of the EIF value using a secant line slope as a function of ϵ and λ in Example 3

5.4. Example 4: G-computation parameter under Markov structure

5.4.1. Background

We finally consider efficient estimation of a more complex parameter arising in the causal inference literature Suppose that the data unit consists of the longitudinal observation X := (L0,A0,...,LK,AK,LK+1) ∼ P, where L0,L1,...,LK is a sequence of measurements collected at K+1 distinct instances through time, LK+1 is the outcome of interest, and A0,A1,...,AK are intervention indicators corresponding to each pre-outcome timepoint. For simplicity, we consider all treatment indicators to be binary. Let MNP be a nonparametric model for P. We may be interested in the covariate-adjusted, treatment-specific mean Ψ(P) corresponding to the intervention (A0,A1,...,AK) = (1,1,...,1). Mathematically, Ψ(P) is given explicitly by EP [m0,P(L0)] via the G-computation recursion

mj,P(l¯j)EP[mj+1,P(L¯j+1)|L¯j=l¯j,Aj=Aj1==A0=1]

for j = K,K − 1,...,0, where we have set mK+1,P(L¯K+1)LK+1. Here, for any vector u := (u0,u1,...) we write u¯k(u0,u1,,uk). This parameter only depends on P through the conditional distribution Pj+1,1 of L¯j+1 given L¯j and A0 = A1 = ... = Aj = 1 for j = 0,1,...,K, and the marginal distribution P0,1 of L0. Under certain untestable causal assumptions, Ψ(P) corresponds to the mean of the counterfactual outcome Y defined by an intervention setting all treatment nodes to one (Robins, 1986). With respect to MNP, or any model with restrictions only on the conditional distribution of Aj given A¯j1 and L¯j possibly for any j ∈ {0,1,...,K}, the EIF of Ψ at P is known to be given by ϕNP,Pj=0K+1ϕj,NP,P, where 𝜙0,NP,P(x) := m0,P(l0) − Ψ(P) and

ϕj,NP,P(x)a0a1aj1r=0j1P(Ar=1|L¯r=l¯r,A0=A1==Ar1=1){mj,P(l¯j)mj1,P(l¯j1)}

for j = 1,2,...,K + 1.

We consider instead the model M consisting of the subset of distributions in MNP under which, for each j = 2,3,...,K + 1, Lj and L¯j2 are independent given Lj−1 and Aj−1 = Aj−2 = ... = A0 = 1. If PM, we note that mj,P(l¯j)=mj,P(lj) for each j. The EIF of Ψ at P relative to M is given by, where ϕPj=0K+1ϕj,P and ϕj,P is defined pointwise as

xϕj,P(x)EP[ϕj,NP,P(X)|Lj=lj,Lj1=lj1,A¯j1=a¯j1]EP[ϕj,NP,P(X)|Lj1=lj1,A¯j1=a¯j1]=a0a1aj1Tj(P)(x){mj,P(lj)mj1,P(lj1)}

for j = 1,2,...,K + 1, and we use Tj(P)(x) to denote

EP[1r=0j1P(Ar=1|L¯r,A0=A1==Ar1=1)|Lj=lj,Lj1=lj1,A¯j1=a¯j1].

Deriving this expression requires specialized knowledge and familiarity with efficiency theory for longitudinal structures. Furthermore, even given this analytic expression, the EIF may often be difficult to compute since it involves rather elaborate conditional expectations

5.4.2. Implementation and results

Again, we must understand how to project a given distribution QMNP into M, say according to the KL divergence. Given a dominating measure ν, we denote the density function of Q with respect to ν as q. Furthermore, we denote by qLj the density of the conditional distribution of Lj given L¯j1 and A¯j1, and by qAj the density of the conditional distribution of Aj given L¯j and A¯j1. We also denote by qLj,1 the density qLj with a¯j1=(1,1,,1). We use the same notational convention for any other candidate density r. Because for any candidate r we can write

logr(u)dQ(u)=j=0K+1logrLj(lj|l¯j1,a¯j1)dQ(u)+j=0KlogrAj(aj|l¯j,a¯j1)dQ(u)

and M can be written as a product model for the set of conditional distributions implied by the joint distribution, the required optimization problem can be performed separately for each conditional density. Because computing Ψ(Q) does not require any component of Q beyond qLj,1* for j = 0,1,...,K + 1, we focus our attention on the corresponding optimization problems alone. Below, we denote by q¯(lj,lj1) the marginalized density q(l0,1,l1,1,,lj1,1,lj)ν(dl0,dl1,,dlj2). To find qLj,1* for j = 2,3,...,K + 1, we must maximize the criterion

Lj(rLj,1)logrLj,1(lj|lj1)q(l0,1,l1,1,,lj1,1,lj)ν(dl0,dl1,,dlj)=logrLj,1(lj|lj1)q¯(lj1,lj)ν(dlj1,dlj)=logrLj,1(lj|lj1)q¯(lj1,lj)q¯(lj1,lj)ν(dlj)ν(dlj)q¯(lj1,lj)ν(dlj)ν(dlj1)

over the class of candidate conditional densities that do not depend on l¯j2, here represented by rLj,1. Since for each fixed lj−1 the mapping ljq¯(lj1,lj)/q¯(lj1,lj)ν(dlj) defines a proper conditional density, by Jensen’s inequality, Lj(rLj,1) is maximized by

qLj,1*(lj|lj1)=q¯(lj1,lj)q¯(lj1,lj)ν(dlj).

It is easy to see that M constrains neither rL1,1 nor rL0 and therefore qL1,1*=qL1,1 and qL0*=qL0. Thus, in the context of a longitudinal data structure, the projection of any given distribution Q into a model only constrained by a Markov structure has an explicit analytic form

To compute 𝜙P(x) numerically, we first construct the linear perturbation Pϵ := (1 – ϵ)P + ϵHx𝜆, where Hx,λ is a distribution dominated by P and concentrating its mass in shrinking neighborhoods of the set {x} as λ tends to zero. The projection Pϵ,λ* onto M is obtained by setting Q = Pϵ in the preceding paragraph. As in previous examples, we may approximate 𝜙P(x) by the secant line slope for small λ and even smaller ϵ. Because in this example Pϵ,λ* is available in closed form, 𝜙P(x) can alternatively be approximated by the evaluation of ddϵΨPϵ,λ* for small λ

For simplicity, in our numerical evaluation of the EIF, we restricted our attention to a setting with K = 2 post-baseline time-points. We considered the joint distribution P of X defined in terms of the following conditional distributions. The baseline covariate L0 has a discrete uniform distribution on the set {0,1,2,3,4}. Given L0 = l0, A0 has a Bernoulli distribution with success probability expit(−1 + 0.5l0). Given A0 = a0 and L0 = l0, L1 has a normal distribution with mean 3l0−3a0 and variance 4. Given L1 =l1, A0 = a0 and L0 = l0, A1 has a Bernoulli distribution with success probability expit{−5+c10(l1)+a0+0.5l0}, where we define c10 to be the trimming function u ↦10·I(−∞,−10)(u)+u·I[−10,+10](u)+10·I(+10,+∞)(u). Given A1 = a1, L1 =l1, A0 = a0 and L0 =l0, Y has a Bernoulli distribution with success probability expit{−1 + 0.5c10(l1) − 0.5a1a0}. We evaluated approximations of 𝜙P(x) for x = (0, 1, 2, 1, 1) based on either the secant line slope or the analytic pathwise derivative. We report the relative error of the approximation obtained using the secant line slope in Figure 6 and using the analytic derivative in Figure 7. Once again, the pattern observed in Figure 6 is similar to the pattern seen in other examples. From Figure 7, we note that a high level of accuracy is achieved with a relatively large λ > 0. Thus, use of the analytic derivative essentially eliminates the need for careful selection of approximation parameters otherwise needed. These implementations were repeated at additional values of x. Because very similar patterns emerged, results are not reported here

Figure 6:

Figure 6:

Absolute % error in the approximation of the EIF value using a secant line slope as a function of ϵ and λ in Example 4

Figure 7:

Figure 7:

Absolute % error in the approximation of the EIF value using analytic differentiation as a function of λ in Example 4

6. Concluding remarks

The novel representation of the EIF we have presented in this paper provides a novel approach for analytically deriving the EIF and suggests natural strategies for numerically approximating the EIF. This representation holds in arbitrary models under mild regularity conditions. Use of this representation requires the ability to project a given distribution into the statistical model – if the KL divergence is used, this is essentially no more than a maximum likelihood step that can be tackled by most practitioners. Most importantly, the involved work requires neither knowledge of efficiency theory nor familiarity with concepts from functional analysis or differential geometry. As such, this representation has the potential of democratizing the calculation of the EIF and thus the construction of efficient estimators in nonparametric and semiparametric models. Even for seasoned researchers in semiparametric and nonparametric theory, it provides an alternate means of tackling difficult problems, including those for which the EIF is either difficult or impossible to derive analytically

In most problems, we anticipate the analytic work required to project the linear perturbation path onto the model space to be much simpler than that needed for the conventional tangent space approach. Nevertheless, this may still constitute a barrier in practice. However, because the task of projecting onto the model space represents no more than an optimization problem, albeit an infinite-dimensional one, computational tools may readily be used to circumvent most, if not all, analytic work otherwise required. This is encouraging since strong computational skills are commonplace in statistics and data science. Furthermore, we expect the numerical challenge to become increasingly surmountable as the capability of our computational devices continues to grow over time. While in this paper we established theoretical results that may facilitate computerization, the practical challenges of numerical computerization must still be carefully studied and addressed to achieve the goal of automated efficient estimation. For example, robust protocols must be devised for automatically selecting suitable values of approximation parameters involved in the numerical computerization of efficient estimators. These important questions are the focus of ongoing research

As with all methods that incorporate some level of automation and more readily lend themselves to use by non-specialists, there is a clear potential for misuse of the results we have presented. This appears to be an inevitable risk inherent to this type of proposal, and it equally applies to some of the most celebrated tools in current statistical practice, including the bootstrap. Deriving the EIF analytically undoubtedly remains the gold-standard approach and it should be preferred whenever possible since much information can be learned about the problem at hand from the analytic form of the EIF. In particular, verification of the regularity conditions invoked in this paper can be difficult without prior analytic knowledge of the EIF. Nevertheless, the representation introduced in this paper has the potential of serving as an important new tool in the arsenal of statistical researchers and practitioners alike for performing semiparametric and nonparametric analyses. Devising algorithms for verifying the required regularity conditions in any given problem is an important avenue for future research

We have noted that a distinct advantage of the representation we have provided is that once it has been used to compute the EIF of a certain parameter in a given statistical model, the EIF of any other parameter can be obtained without any additional work since the bulk of the work required is exclusively model-specific. Nevertheless, the involved computational work must be repeated for each observation value at which we wish to evaluate the EIF. In particular, this makes it difficult to approximate the entire EIF as a function, particularly in the case of continuous or longitudinal data units. While the one-step approach only requires the EIF at the observed data points, the implementation of other efficient estimators with potentially better properties, such as targeted minimum loss-based estimators (TMLE), generally requires the entire EIF. The representation presented in this paper is therefore not conducive to a computerized implementation of TMLE. There is promise that alternative representations may be better suited for this purpose – this is also an area of active research

Supplementary Material

Supp1

Acknowledgments

The authors gratefully acknowledge the support of the Career Development Fund of the Department of Biostatistics at the University of Washington (MC), the New Development Fund of the Fred Hutchinson Cancer Research Center (ARL), and of NIH/NIAID grants 5UM1AI068635 (MC,ARL) and 5R01AI074345 (MJvdL)

Appendix

Throughout this Appendix, we will use the acronym RND to denote the Radon-Nikodym derivative.

Proof of Theorem 2. Denote by r(λ) the supremum norm over X(P) of the RND of Hx,λ relative to P. It is easy to verify that, for any fixed λ > 0, the RND of P relative to Pϵ is uniformly bounded in (0.5,1.5) over X(P) provided ϵ < 1/max[r(λ) – 1,3]. Using the fact that Pϵ,λ* minimizes D(P1,Pϵ,λ) over all P1M(Pϵ,λ) , and that P1M(Pϵ,λ) since Pϵ dominates P by construction, we then have that

0D(Pϵ,λ*,Pϵ,λ)D(P,Pϵ,λ)K˜1[dPdPϵ,λ(u)1]2dPϵ,λ(u)=K˜1[dPϵ,λdP(u)1]2dP(u)K˜1[dPϵ,λdP(u)1]2dP(u)K˜1ϵ2r2(λ)2,

where we have set K˜1K1(0.5,1.5) and used conditions (B1) and (B2). It is straightforward to show that the RND of Pϵ,λ* relative to Pϵ is uniformly bounded between 0.5m0 and 1.5m1 for ϵ < 1/max[r(λ) – 1,3]. Using condition (B2), this allows us to write that

0dPϵ,λ*dPϵ,λ12,Pe,λ2=[dPϵ,λ*dPϵ,λ(u)1]2dPϵ,λ(u)=[dPϵ,λ*dPϵ,λ(u)1]2[dPϵ,λ*dPϵ,λ(u)+1]2dPϵ,λ(u)K˜2[dPϵ,λ*dPϵ,λ(u)1]2dPϵ,λ(u)K˜0K˜2D(Pϵ,λ*,Pϵ,λ)K˜0K˜1K˜2ϵ2r2(λ)2

with K˜0K0(0.5m0,1.5m1) and K˜21.5m1+21.5m1+1. Writing K˜K˜0K˜1K˜2, in view of the fact that ϕϵ,λ*TM(Pϵ,λ*), the latter inequality implies, in view of condition (B3), that

|ϕϵ,λ*(u)dPϵ,λ(u)|ϵK˜ϵr2(λ)B(K˜ϵr2(λ))ϵ=K˜r2(λ)B(K˜ϵr2(λ))0

as ϵ tends to zero. This establishes that limλ0limϵ0ϕϵ,λ*(u)dPϵ,λ(u)/ϵ=0, as required. □

Proof of Theorem 3. We first note that

|ϕϵ,λ*(u)d(Hx,λP)(u)ϕP(x)|=|{ϕϵ,λ*(u)ϕP(u)}dHx,λ(u)+ϕP(u)dHx,λ(u)ϕP(x)ϕϵ,λ*(u)dP(u)||{ϕϵ,λ*(u)ϕP(u)}dHx,λ(u)|+|ϕP(u)dHx,λ(u)ϕP(x)|+|ϕϵ,λ*(u)dP(u)|

and because by assumption the second and third summands on the second line tend to zero as λ tends to zero, it suffices to study the first summand. We can bound this term by ϕϵ,λ*ϕP,Sx,λ and so, if condition (a) holds, the result follows immediately. Alternatively, we can write this term as

|{ϕϵ,λ*(u)ϕP(u)}dHx,λ(u)|=|dHx,λdP(u){ϕϵ,λ*(u)ϕP(u)}dP(u)|ϕϵ,λ*ϕP2,PdHx,λdP2,P

and thus, if condition (b) holds, the result is also guaranteed to hold □

Proof of Theorem 4. We will build upon notation and results established in the proof of Theorem 2. Below, we take a small but otherwise arbitrary λ > 0 and consider only ϵ < 1/max[r(λ) – 1,3]. We first note that

dPϵ,λ*dP1=dPϵ,λdP(dPϵ,λ*dPϵ,λ1)+(dPϵ,λdP1).

We can write that

dPϵ,λdP(dPϵ,λ*dPϵ,λ1)2,P2=[dPϵ,λdP(u)]2[dPϵ,λ*dPϵ,λ(u)1]2dP(u)dPϵ,λdP,X(P)dPϵ,λ*dPϵ,λ12,Pe,λ2[1ϵ+ϵr(λ)][K˜ϵr2(λ)]2

and additionally, that

dPϵ,λdP12,P2=ϵ2[1dHx,λdP(u)]2dP(u)ϵ2r2(λ)2.

Thus, using the above inequalities and the triangle inequality, we find that

0|R(Pϵ,λ*,P)|CdPϵ,λ*dP12,P2ϵ2r2(λ)2[1ϵ+ϵr(λ)K˜+1]2,

which then implies that limλ0limϵ0R(Pϵ,λ*,P)/ϵ=0, as required. □

Proof of Theorem 5. We will denote by fr(ϵ,λ) the derivative drdϵrΨ(Pϵ,λ*) and by ϕϵ,λ,m° the (m + 1)-point forward finite-difference approximation to 𝜙P(x). The approximation error can be written as

ϕϵ,λ,m°ϕP(x)=[ϕϵ,λ,m°f1(0,λ)]+[f1(0,λ)ϕP(x)],

where we will see that the first bracketed summand benefits from use of the multi-point scheme. Define the Taylor remainder Rjϵ,λ,sf0(jϵ,λ)r=1s(jϵ)rr!fr(0,λ). Setting A(r,m)j=1mjraj,m, in view of the fact that A(0,m) = 0, A(1,m) = 1 and A(r,m) = 0 for r = 2,3,...,m (Fornberg, 1988), we note that

ϕϵ,λ,m°=1ϵj=0maj,mΨ(Pjϵ,λ*)=j=1maj,m[Ψ(Pjϵ,λ*)Ψ(P)ϵ]=j=1mjaj,m[Ψ(Pjϵ,λ*)Ψ(P)jϵ]=j=1mjaj,m[f1(0,λ)+r=2s(jϵ)r1r!fr(0,λ)+Rjϵ,λ,sjϵ]=f1(0,λ)A(1,m)+r=2sϵr1r!fr(0,λ)A(r,m)+1ϵj=1maj,mRjϵ,λ,s=f1(0,λ)+1ϵj=1maj,mRjϵ,λ,s,

where the second line follows from the first by a Taylor expansion of η →7 Ψ(Pη,λ*) around η = 0 evaluated at η = j𝜖. By Taylor’s theorem, the remainder term has the form

Rjϵ,λ,s=(jϵ)s+1(s+1)!fs+1(ξj,ϵ,λ,λ)

for some ξj,𝜖,λ. By the differentiability condition, for small ω > 0, we can write

fs+1(ω,λ)=Υs+1,λ(Pω,λ)(u_s+1)k=1s+1[dHx,λdP(uk)1]dP(u1)dP(u2)dP(us+1),

where u_s+1(u1,u2,,us+1) and Υs+1(Pω,λ) denotes an (s+1)th order gradient of Ψ˜ relative to model M0,λ evaluated at Pω,λ (see, e.g., van der Vaart, 2014). We thus find that

|fs+1(ξj,ϵ,λ,λ)|sup0<ω<mϵ|fs+1(ω,λ)|sup0<ω<mϵΥs+1,λ(Pω,λ)2,Ps+1[1+rs+1(λ)]s+1

with Ps+1 := P ×...×P is a product measure of dimension s+1. For and λ small enough, this gives that

|ϕϵ,λ,m°f1(0,λ)||1ϵj=1maj,m(jϵ)s+1(s+1)!fs+1(ξj,ϵ,λ,λ)|ϵsj=1m|aj,m|js+1(s+1)!|fs+1(ξj,ϵ,λ,λ)|ϵs[1+rs+1(λ)]s+1sup0<ω<mΥs+1,λ(Pϵ,λ)2,Ps+1(s+1)!j=1mjs+1|aj,m|=O(ϵsrs+1(λ)s+1).

Finally, we note that f1(0) = ∫𝜙P(u)Hx,λ(du) in view of the proof of Theorem 1. If u ↦ 𝜙P(u) has bounded second derivatives near u = x, and Hx,λ is a second-order kernel function, it follows from standard results in the kernel smoothing literature that the systematic deviation f1(0)−𝜙P(x) has order O(λ−2). □

References

  1. Begun J, Hall W, Huang W, and Wellner J (1983). Information and asymptotic efficiency in parametric-nonparametric models. The Annals of Statistics, pages 432–452. [Google Scholar]
  2. Bickel P, Klaassen C, Ritov Y, and Wellner J (1997). Efficient and adaptive estimation for semiparametric models. Springer. [Google Scholar]
  3. Bickel P and Ritov Y (1988). Estimating integrated squared density derivatives: sharp best order of convergence estimates Sankhya¯: The Indian Journal of Statistics, Series A, pages 381–393. [Google Scholar]
  4. Chambaz A and van der Laan MJ (2014). Inference in targeted group-sequential covariate-adjusted randomized clinical trials. Scandinavian Journal of Statistics, 41(1):104–140. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Chamberlain G (1987). Asymptotic efficiency in estimation with conditional moment restrictions. Journal of Econometrics, 34(3):305–334. [Google Scholar]
  6. Chen X (2007). Large sample sieve estimation of semi-nonparametric models. Handbook of Econometrics, 6:5549–5632. [Google Scholar]
  7. Fornberg B (1988). Generation of finite difference formulas on arbitrarily spaced grids. Mathematics of Computation, 51(184):699–706. [Google Scholar]
  8. Frangakis C, Qian T, Wu Z, and Díaz I (2015). Deductive derivation and Turing-computerization of semiparametric efficient estimation (with discussion). Biometrics. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Geskus R and Groeneboom P (1999). Asymptotically optimal estimation of smooth functionals for interval censoring, case 2. The Annals of Statistics, 27(2):627–674.et al. [Google Scholar]
  10. Gilbert PB, Yu X, and Rotnitzky A (2014). Optimal auxiliary-covariate-based two-phase sampling design for semiparametric efficient estimation of a mean or mean difference, with application to clinical trials. Statistics in medicine, 33(6):901–917. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Hájek J (1970). A characterization of limiting distributions of regular estimates. Zeitschrift fu¨r Wahrscheinlichkeitstheorie und verwandte Gebiete, 14(4):323–330. [Google Scholar]
  12. Hájek J (1972). Local asymptotic minimax and admissibility in estimation. In Proceedings of the sixth Berkeley symposium on mathematical statistics and probability, volume 1, pages 175–194 [Google Scholar]
  13. Hampel FR, Ronchetti EM, Rousseeuw PJ, and Stahel WA (2011). Robust statistics: the approach based on influence functions, volume 196 John Wiley & Sons [Google Scholar]
  14. Ichimura H and Newey W (2015). The influence function of semiparametric estimators. arXiv preprint arXiv:1508.01378 [Google Scholar]
  15. Kiefer J and Wolfowitz J (1956). Consistency of the maximum likelihood estimator in the presence of infinitely many incidental parameters. The Annals of Mathematical Statistics, pages 887–906 [Google Scholar]
  16. Koshevnik Y and Levit B (1977). On a non-parametric analogue of the information matrix. Theory of Probability & Its Applications, 21(4):738–753 [Google Scholar]
  17. Le Cam L (1972). Limits of experiments. In Proceedings of the sixth Berkeley symposium on mathematical statistics and probability, volume 1, pages 245–261 [Google Scholar]
  18. Luedtke A, Carone M, and van der Laan M (2015). A discussion of “Deductive derivation and Turing-computerization of semiparametric efficient estimation” by Frangakis et al. Biometrics [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Maathuis MH and Wellner JA (2008). Inconsistency of the mle for the joint distribution of interval-censored survival times and continuous marks. Scandinavian Journal of Statistics, 35(1):83–103 [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Newey W (1994). The asymptotic variance of semiparametric estimators. Econometrica, pages 1349–1382 [Google Scholar]
  21. Newey WK (1997). Convergence rates and asymptotic normality for series estimators. Journal of Econometrics, 79(1):147–168 [Google Scholar]
  22. Pfanzagl J (1982). Contributions to a general asymptotic statistical theory. Springer [Google Scholar]
  23. Quale CM, van der Laan MJ, and Robins JR (2006). Locally efficient estimation with bivariate right-censored data. Journal of the American Statistical Association, 101(475):1076–1084 [Google Scholar]
  24. Robins JM (1986). A new approach to causal inference in mortality studies with a sustained exposure periodapplication to control of the healthy worker survivor effect. Mathematical Modelling, 7(9):1393–1512 [Google Scholar]
  25. Shen X (1997). On methods of sieves and penalization. The Annals of Statistics, pages 2555–2591 [Google Scholar]
  26. Stein C (1956). Efficient nonparametric testing and estimation. In Proceedings of the third Berkeley symposium on mathematical statistics and probability, volume 1, pages 187–195 [Google Scholar]
  27. Tsiatis A (2007). Semiparametric theory and missing data. Springer Science & Business Media [Google Scholar]
  28. van der Laan M (1996). Efficient estimation in the bivariate censoring model and repairing npmle. The Annals of Statistics, 24(2):596–627 [Google Scholar]
  29. van der Laan M and Robins J (2003). Unified methods for censored longitudinal data and causality. Springer. [Google Scholar]
  30. van der Laan M and Rose S (2011). Targeted learning: causal inference for observational and experimental data. Springer [Google Scholar]
  31. van der Laan M and Rubin D (2006). Targeted maximum likelihood learning. The International Journal of Biostatistics, 2(1) [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. van der Laan MJ and Rose S (2018). Targeted learning in data science: causal inference for complex longitudinal studies. Springer [Google Scholar]
  33. van der Vaart A (1991). On differentiable functionals. The Annals of Statistics, pages 178–204. [Google Scholar]
  34. van der Vaart A (2014). Higher order tangent spaces and influence functions. Statistical Science, 29(4):679–686 [Google Scholar]
  35. Wong W and Severini T (1991). On maximum likelihood estimation in infinite dimensional parameter spaces. The Annals of Statistics, pages 603–632 [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp1

RESOURCES