Toward computerized efficient estimation in infinite-dimensional models

Marco Carone; Alexander R Luedtke; Mark J van der Laan

doi:10.1080/01621459.2018.1482752

. Author manuscript; available in PMC: 2020 May 13.

Published in final edited form as: J Am Stat Assoc. 2018 Sep 13;114(527):1174–1190. doi: 10.1080/01621459.2018.1482752

Toward computerized efficient estimation in infinite-dimensional models

Marco Carone ¹, Alexander R Luedtke ², Mark J van der Laan ³

PMCID: PMC7219981 NIHMSID: NIHMS1515027 PMID: 32405108

Abstract

Despite the risk of misspecification they are tied to, parametric models continue to be used in statistical practice because they are simple and convenient to use. In particular, efficient estimation procedures in parametric models are easy to describe and implement. Unfortunately, the same cannot be said of semiparametric and nonparametric models. While the latter often reflect the level of available scientific knowledge more appropriately, performing efficient inference in these models is generally challenging. The efficient influence function is a key analytic object from which the construction of asymptotically efficient estimators can potentially be streamlined. However, the theoretical derivation of the efficient influence function requires specialized knowledge and is often a difficult task, even for experts. In this paper, we present a novel representation of the efficient influence function and describe a numerical procedure for approximating its evaluation. The approach generalizes the nonparametric procedures of Frangakis et al. (2015) and Luedtke et al. (2015) to arbitrary models. We present theoretical results to support our proposal, and illustrate the method in the context of several semiparametric problems. The proposed approach is an important step toward automating efficient estimation in general statistical models, thereby rendering more accessible the use of realistic models in statistical analyses.

Keywords: asymptotic efficiency, canonical gradient, efficient influence function, nonparametric and semiparametric models, pathwise differentiability

1. Introduction

1.1. Efficient estimation of smooth target parameters

Efficient estimation techniques are appealing because they maximally exploit available information and minimize the uncertainty of the resulting scientific findings. Efficiency is most often studied in an asymptotic sense due to its intractability in many finite-sample problems. Characterizing asymptotic efficiency and constructing asymptotically efficient estimators has been an important focus of methodological and theoretical research in statistics. For convenience, from here and on, we ascribe without explicit mention an asymptotic sense to the terms efficient and efficiency

In the context of parametric models, a simple efficiency theory has been available for nearly a century, largely fueled by Fisher’s work on maximum likelihood estimation. In such models, efficiency is characterized by the Crámer-Rao bounds and efficient estimators can generally be obtained via maximum likelihood (e.g., Hájek, 1970, 1972; Le Cam, 1972). When parametric models are adopted in practice, it is often because they are simple and convenient to use. However, the use of such models carries the risk of model misspecification, which may adversely affect the scientific process. In many scientific problems, the available background knowledge simply does not justify the use of such restrictive statistical models

Infinite-dimensional models – either nonparametric or semiparametric – offer a more flexible alternative. These richer models mitigate the potential for model misspecification and often more accurately reflect the level of available prior knowledge. Unfortunately, establishing efficiency bounds for target parameters in infinite-dimensional models can be a complex task. The development of a general efficiency theory, valid for arbitrary statistical models, is a more recent accomplishment: except for the early seminal contribution of Stein (1956), developments in this area began in the late 1970s and early 1980s with the works of Koshevnik and Levit (1977), Pfanzagl (1982) and Begun et al. (1983), among others, and continued throughout the 1990s (e.g., van der Vaart, 1991; Newey, 1994). Notably, it builds upon notions of differential geometry and functional analysis. In certain cases, a generalized notion of maximum likelihood, as described by Kiefer and Wolfowitz (1956), for example, can still be used to produce efficient estimators of certain parameters – see, for example, Wong and Severini (1991) for a general study under broad conditions. In other cases though, the statistical model or target parameter may be too complex for a maximum likelihood estimator to exist, let alone be well-behaved

Several approaches have been proposed for efficient estimation in the context of infinite-dimensional models. In problems in which the likelihood approach is problematic, some authors have successfully repaired the likelihood function (e.g., van der Laan, 1996; Maathuis and Wellner, 2008). Regularization techniques, including penalization and the method of sieves, have proven useful for devising efficient estimators (e.g., Shen, 1997; Newey, 1997; Chen, 2007). General approaches for efficient estimation based upon knowledge of the efficient influence function have been developed and extensively studied (e.g., Pfanzagl, 1982; Bickel et al., 1997; van der Laan and Robins, 2003; van der Laan and Rubin, 2006). These flexible approaches have been used in the past decade to perform valid inference incorporating a variety of machine learning tools. In particular, they have been employed to conduct robust and efficient inference via ensemble learning, wherein nuisance functions are estimated using a data-driven amalgamation of many learning strategies. The latter can range from simple parametric techniques to complex machine learning algorithms, and can include penalization and sieve-based learning methods. A variety of illustrations are provided in van der Laan and Rose (2011, 2018), for example

1.2. The efficient influence function and its role in efficient estimation

The efficient influence function, hereafter referred to as the EIF, is a key object in efficiency theory for infinite-dimensional models. It bears this name because it is the influence function of any efficient estimator of the parameter of interest relative to the statistical model with respect to which it was calculated. As such, the efficiency bound for this parameter, and thus the asymptotic variance of any efficient estimator, can be derived from the EIF. Knowledge of the EIF has important implications for practice. First, the relative efficiency of a candidate estimator can be assessed objectively by comparing its variance to the efficiency bound implied by the EIF. Second, using an EIF-based estimate of the efficiency bound, asymptotically valid Wald confidence intervals can be constructed using any available efficient estimator, thereby circumventing the need to rely on bootstrap schemes that may either be invalid or computationally infeasible in a given problem. Third, knowledge of the EIF can assist investigators in planning the design of their study by helping quantify the effect of different sampling strategies on the variability of any efficient estimator used. Concrete examples include the optimal selection of subgroups on which to process resource-intensive measurements in two-phase sampling designs (see, e.g., Gilbert et al., 2014) and the optimal design of group-sequential adaptive trials (e.g., Chambaz and van der Laan, 2014). Fourth, as it is the influence function of any efficient estimator, the EIF can be used to investigate the robustness of efficient estimators (e.g., Hampel et al., 2011). For example, the EIF can be used to assess the approximate (i.e., asymptotic) contribution of individual observations to the value of an efficient estimator. Finally, and perhaps most importantly, as suggested above, if the analytic form of the EIF is available, efficient estimators can themselves be constructed rather easily. To do so, several approaches may be used, including, for example, gradient-based estimating equations (e.g., van der Laan and Robins, 2003; Tsiatis, 2007), Newton-Raphson one-step corrections (e.g., Pfanzagl, 1982) and targeted minimum loss-based estimation (e.g., van der Laan and Rose, 2011). Thus, there is a strong motivation for deriving the EIF in a given statistical problem

Unfortunately, the analytic computation of the EIF is seldom straightforward. It generally involves finding a candidate influence function and projecting it onto the tangent space of the statistical model, which must itself be characterized – the effort can be mathematically intricate. Over the years, many techniques have been developed to facilitate this task in certain classes of problems – the discretization technique of Chamberlain (1987) is one such example. Nevertheless, this calculation remains a rather specialized skill, mastered mostly by a small collection of theoretically-inclined researchers. Worse still, in some problems, the EIF does not even have a closed form, rendering its evaluation and use particularly difficult (e.g., Geskus and Groeneboom, 1999; Quale et al., 2006). The theoretical derivation of EIFs is generally not in the skill set of practicing statisticians. Yet, in many problems, it is a skill needed to facilitate optimal inference in more realistic statistical models. The paucity of this skill may have impeded a broader appreciation and adoption of semiparametric and nonparametric techniques in applications

In view of this barrier, one naturally wonders whether a suitable numerical approximation could serve as substitute for the analytic form of the EIF, and further whether its calculation could be computerized. An affirmative answer to this question would render the implementation of efficient inferential techniques in semiparametric and nonparametric models much more accessible to practitioners, and the impact on statistical practice could be significant. Recently, an important first step toward this goal was made by Frangakis et al. (2015): these authors proposed a simple numerical routine for calculating the EIF in the context of nonparametric models when the data are discrete-valued or when the parameter is a smooth functional of the distribution function. In our discussion of their article (see Luedtke et al., 2015), we suggested a regularization of their technique that is valid more broadly within the context of nonparametric models. Nevertheless, neither of these methods formally address the more difficult problem of numerically computing the EIF in semiparametric models. As opposed to nonparametric models, for which the tangent space is trivially described, semiparametric models generally have much more complex tangent spaces, projecting onto which may often require great experience. Identifying a numerical approach for computing the EIF in semiparametric models is therefore a more difficult but also more needed innovation

1.3. Contributions and organization of this article

In this article, we establish and study a novel representation of the EIF that directly generalizes the non-parametric results of Frangakis et al. (2015) and Luedtke et al. (2015) to arbitrary models. As we illustrate through several examples, this representation can be used to calculate either analytically or numerically the EIF of a given parameter in a given statistical model without the specialized knowledge traditionally required for this task. As such, it holds great promise in facilitating the computerization of efficient estimation not only in nonparametric models but also in semiparametric models, as we discuss below

This paper is organized as follows. In Section 2, we present our novel representation of the EIF for use in arbitrary statistical models and present high-level conditions under which it holds. In Section 3, we study each of these conditions and determine sufficient conditions under which the validity of this representation is guaranteed. We explore various practical issues regarding the implementation of our proposal and provide guidelines for use in Section 4. In Section 5, we illustrate the use of the approach in the context of four semiparametric problems. We provide concluding remarks in Section 6. While Theorem 1 is proved in the body of the paper, the proof of Theorems 2–5 are provided in an Appendix

2. Numerical calculation of the efficient influence function

2.1. Preliminaries

Suppose that we observe independent d-dimensional variates X₁,X₂,...,X_n following a distribution P₀ known only to belong to the statistical model $M$ . We denote by $X (P) \subseteq ℝ^{d}$ the sample space associated to $P \in M$ . We are interested in efficiently inferring about ψ₀ := Ψ(P₀) using the available data, where $Ψ : M \to ℝ^{q}$ represents a pathwise differentiable parameter mapping of interest. Pathwise differentiability ensures the parameter is a sufficiently smooth mapping so as to admit an efficiency theory (see, e.g., Pfanzagl, 1982; Bickel et al., 1997). We denote by $L_{2}^{0} (P)$ the Hilbert space of P-integrable functions from $X (P)$ to $ℝ^{q}$ with mean zero and finite variance under P. The parameter Ψ is said to be pathwise differentiable if there exists some $χ_{P} \in L_{2}^{0} (P)$ such that, for each regular one-dimensional parametric submodel $M_{0} ≔ {P_{ϵ} : ϵ \in E} \subseteq M$ with $E \subset ℝ$ an interval containing zero and P_ϵ=0 = P, the pathwise derivative $\frac{d}{d ϵ} Ψ {(P_{ϵ}) |}_{ϵ = 0}$ can be represented as the inner product $\int χ_{P} (u) s (u) d P (u)$ , where s is the score for ϵ at ϵ = 0 in $M_{0}$ (Pfanzagl, 1982). Any such element $χ_{P}$ is said to be a gradient of Ψ at P relative to $M$ . The tangent space $T_{M} (P) \subseteq L_{2}^{0} (P)$ of $M$ at P is defined as the closure of the linear span of scores at P arising from regular one-dimensional P-dominated parametric submodels of $M$ through P. The canonical gradient is the unique gradient contained in $T_{M} (P)$ and corresponds to the EIF under sampling from P. Throughout, we will refer to the EIF at P as 𝜙_P and write 𝜙_P(x) for the evaluation of 𝜙_P at the observation value x. The asymptotic variance of an efficient estimator of ψ₀ relative to model $M$ is given by $\int ϕ_{P_{0}} (u) ϕ_{P_{0}} {(u)}^{⊤} d P_{0} (u)$ . Without loss of generality, we will assume q = 1 as the general case can be dealt with using the developments herein applied to each component

If pathwise differentiability holds uniformly over paths in a neighborhood around P, for any $P_{1} \in M$ close enough to P, the parameter admits the linearization

Ψ (P_{1}) - Ψ (P) = \int ϕ_{P_{1}} (u) d (P_{1} - P) (u) + R (P_{1}, P) = - \int ϕ_{P_{1}} (u) d P (u) + R (P_{1}, P)

(2.1)

where R(P₁,P) is a second-order remainder term, and the second line follows since $\int ϕ_{P_{1}} (u) d P_{1} (u) = 0$ . This representation, which is no more than a first-order Taylor approximation over the model space, holds for most smooth parameters arising in practice. The precise form of R is generally established by hand on a case-by-case basis. This linearization is critical for motivating and studying the use of both Newton-Raphson one-step correction and targeted minimum loss-based estimation to construct efficient estimators. It is also at the heart of our current proposal for obtaining a numerical approximation to the EIF value 𝜙_P(x) at a given distribution $P \in M$ and observation value $x \in X (P) .$

2.2. Nonparametric models

Recently, Frangakis et al. (2015) presented one such proposal based on the representation of 𝜙_P(x) as the Gˆateaux derivative of Ψ at P in the direction of δ_x − P, where δ_x represents the degenerate distribution at x. Of course, this can also be seen as the pathwise derivative $\frac{d}{d ϵ} Ψ {(P_{ϵ}) |}_{ϵ = 0}$ of Ψ at P along the linear perturbation path ${P_{ϵ} ≔ (1 - ϵ) P + ϵ δ_{x} : 0 \leq ϵ \leq 1}$ between P and δ_x – this simple observation will be helpful when tackling arbitrary models. Here and throughout, any such derivative is of course interpreted as a right derivative. To computerize the process of calculating 𝜙_P(x), these authors suggested approximating this derivative by the slope of the secant line through (0,Ψ(P)) and (ϵ,Ψ,(Pϵ)) for a small ϵ > 0. In our discussion of Frangakis et al. (2015), we pointed out sufficient conditions that guarantee that this indeed approximates 𝜙_P(x) (Luedtke et al., 2015). For example, this approach is valid whenever the model $M$ is nonparametric and the sample space $X (P)$ is finite. However, if the parameter Ψ depends on local features of the distribution, this method may fail when $X (P)$ is infinite, such as when any component of X is continuous under P. We proposed a slight modification of the procedure of Frangakis et al. (2015) to remedy this limitation. Specifically, we proposed replacing the degenerate distribution δ_x at x by a distribution H_x,λ dominated by P and such that ∫g(u)dH_x,λ(u) → g(x) as λ → 0 for all g in a sufficiently large class of functions. This amounts to replacing the degenerate distribution by a nearly degenerate distribution with smoothing parameter λ > 0. In a technical report published contemporaneously, Ichimura and Newey (2015) also suggested this approach. As stated in Luedtke et al. (2015), under certain regularity conditions and provided M is nonparametric, it is generally the case that

ϕ_{P} (x) = \lim_{λ \to 0} [\frac{d}{d ϵ} Ψ (P_{ϵ, λ}) |_{ϵ = 0}]

(2.2)

where we have defined the linear perturbation path. $P_{ϵ, λ} ≔ (1 - ϵ) P + ϵ H_{x, λ}$ Representation (2.2) is useful when the parameter is simple enough so that calculating the derivative of ϵ ↦ Ψ(P_ϵ,𝜆) is analytically convenient. Otherwise, similarly as in Frangakis et al. (2015), the derivative in (2.2) can be approximated, for example, by the secant line slope

\frac{Ψ (P_{ϵ, λ}) - Ψ (P)}{ϵ}

for small ϵ and λ. This operation only requires the ability to evaluate Ψ on a given distribution. Generally, as we highlighted in Luedtke et al. (2015), ϵ must be chosen much smaller than λ to obtain an accurate approximation – this emphasizes that the limit and derivative operators in (2.2) cannot generally be interchanged. We discuss this point in detail later. Because representation (2.2) is a special case of the general result we present below, we defer a statement of regularity conditions and a formal proof until then

2.3. Arbitrary models

When the model M is not nonparametric, representation (2.2) generally does not hold. Except when ϵ = 0, the linear perturbation path described by Pϵ,λ is not necessarily contained in the model. As such, the parameter may not even be defined on this path. Even if it is, the approximation suggested by this representation generally only yields the EIF of Ψ relative to a nonparametric model rather than the actual model. While the nonparametric EIF is still an influence function in $M$ , it is not typically efficient relative to $M$ . This should not be surprising since the expression in (2.2) in no way acknowledges constraints implied by $M$ .

The linear perturbation approach of Frangakis et al. (2015) and Luedtke et al. (2015) does not generally lead to the EIF in semiparametric and parametric models. Nevertheless, a first-order analysis of the parameter mapping is helpful to determine conditions under which, in arbitrary models, the EIF can be expressed in terms of a pathwise derivative over some perturbation path. Below, we denote by $P_{ϵ, λ}^{*} \in M$ a candidate perturbation of P determined by a choice of H_x,λ and ϵ. High-level conditions under which $P_{ϵ, λ}^{*}$ serves as an adequate perturbation are outlined in the theorem below. Here and throughout, we use the shorthand notation $ϕ_{ϵ, λ}^{*}$ to denote the evaluation $ϕ_{P_{ϵ, λ}^{*}}$ of the EIF at $P_{ϵ, λ}^{*}$

Theorem 1.

The EIF of Ψ relative to $M$ at P ∈ $P \in M$ evaluated at observation value x is given by

ϕ_{P} (x) = \lim_{λ \to 0} [\frac{d}{d ϵ} Ψ (P_{ϵ, λ}^{*}) |_{ϵ = 0}]

(2.3)

provided the following conditions hold:

(A1)
(near-solution of EIF estimating equation) $\lim_{ϵ \to 0} \lim_{ϵ \to 0} \int ϕ_{ϵ, λ}^{*} (u) d P_{ϵ, λ} (u) / ϵ = 0;$
(A2)
(continuity of EIF) $\lim_{λ \to 0} \lim_{ϵ \to 0} \int ϕ_{ϵ, λ}^{*} (u) d (H_{x, λ} - P) (u) = ϕ_{P} (x)$
(A3)
(preservation of convergence rate) $\lim_{λ \to 0} \lim_{ϵ \to 0} R (P_{ϵ, λ}^{*}, P) / ϵ = 0$ .

Proof. Setting $P_{1} = P_{ϵ, λ}^{*}$ in (2.1), we note that

Ψ (P_{ϵ, λ}^{*}) - Ψ (P) = - \int ϕ_{ϵ, λ}^{*} (u) d P (u) + R (P_{ϵ, λ}^{*}, P) = \int ϕ_{ϵ, λ}^{*} (u) d (P_{ϵ, λ} - P) (u) - \int ϕ_{ϵ, λ}^{*} (u) d P_{ϵ, λ} (u) + R (P_{ϵ, λ}^{*}, P) .

The result follows from (A1), (A2) and (A3) upon noting that, since, $P_{ϵ, λ} - P = ϵ (H_{x, λ} - P)$ , we have that

\frac{Ψ (P_{ϵ, λ}^{*}) - Ψ (P)}{ϵ} = \int ϕ_{ϵ, λ}^{*} (u) d (H_{x, λ} - P) (u) - \frac{\int ϕ_{ϵ, λ}^{*} (u) d P_{ϵ, λ} (u)}{ϵ} + \frac{R (P_{ϵ, λ}^{*}, P)}{ϵ} .

□

We must now determine a strategy for constructing a perturbation $P_{ϵ, λ}^{*}$ , that can be expected to satisfy (A1)–(A3). Although the linear path {P_ϵ,λ : 0 ≤ ϵ ≤ 1} is generally inadequate, it appears natural to consider the path determined by the projection $P_{ϵ, λ}^{*}$ of P_ϵ,λ into $M$ , or a suitably regularized version thereof, according to an appropriate divergence. For any $P_{1} \in M$ , we define $M (P_{1}) ≔ {P_{2} \in M : P_{2} ≪ P_{1}} \subseteq M$ as the subset of all probability measures in $M$ dominated by P₁, and let $M_{NP} \supseteq M$ represent a nonparametric model containing $M$ . Then, we formally define a novel perturbation path in the model as

P_{ϵ, λ}^{*} ≔ \underset{P_{1} \in M (P)}{argmin} D (P_{1}, P_{ϵ, λ}),

(2.4)

where the function $D : M_{NP} \times M_{NP} \to [0, + \infty]$ satisfies the following conditions:

(B1)
for each $P_{1}, P_{2} \in M, D (P_{1}, P_{2}) \geq 0$ , and $D (P_{1}, P_{2}) = 0$ if and only if P₁ = P₂
(B1)
for any $P_{2} \in M$ and $P_{1} \in M (P_{2})$ such that the Radon-Nikodym derivative of P₁ relative to P₂ is uniformly bounded in an interval [m₀,m₁] ⊂ (0,+∞), provided P₁ 6= P₂, the inequality
$K_{0} (m_{0}, m_{1}) \leq \frac{D (P_{1}, P_{2})}{\int {[\sqrt{\frac{d P_{1}}{d P_{2}} (u)} - 1]}^{2} d P_{2} (u)} \leq K_{1} (m_{0}, m_{1})$

holds for constants 0 < K₀(m₀,m₁) < K₁(m₀,m₁) < +∞ depending on m₀ and m₁ alone
(B3)
for each $P_{2} \in M_{NP}$ and each K ∈ (0,+∞), setting $P_{2}^{*} ≔ {argmin}_{P_{1} \in M (P_{2})} D (P_{1}, P_{2})$ , provided the Radon-Nikodym derivative of $P_{2}^{*}$ relative to P₂ is uniformly bounded in an interval [m₀,m₁] ⊂ (0,+∞), there exists a map B : [0,+∞) → [0,+∞) possibly depending on (K,m₀,m₁) and with lim_δ→0 B(δ) = 0 for which
${‖ \frac{d P_{2}^{*}}{d P_{2}} - 1 ‖}_{2, P_{2}} \leq δ implies that | \int h (u) d P_{2} (u) | \leq δ B (δ)$
for each $h \in T_{M} (P_{2}^{*})$ with ‖h‖_∞ < K.

Condition (B1) simply requires $D$ to be a divergence function. Condition (B2) allows us to relate values of this divergence and relevant L₂ norms. Condition (B3) is used to ensure condition (A1). Heuristically, it requires that, under a given distribution, elements of the tangent space at the projection of this distribution have mean nearly zero if the distribution is close enough to the model. Additional details are provided below.

At least two of the most important divergences fit into this general setup and satisfy all conditions: the Kullback-Leibler (KL) divergence and the Hellinger distance, corresponding respectively to

D_{K L} : (P_{1}, P_{2}) \mapsto - \int \log [\frac{d P_{1}}{d P_{2}} (u)] d P_{2} (u) and D_{H} : (P_{1}, P_{2}) \mapsto \int {[1 - \sqrt{\frac{d P_{1}}{d P_{2}} (u)}]}^{2} d P_{2} (u) .

The fact that these particular divergences indeed satisfy (B1)–(B3) is established in the Supplementary Material. The KL divergence appears to be the most natural choice for several reasons. First, for this choice of divergence, for any given $P_{2} \in M_{NP}$ and any $h \in T_{M} (P_{2}^{*})$ with $P_{2}^{*}$ the projection of P₂ into $M (P_{2}),$ the equality ∫ h(u)dP₂(u) = 0 can be shown to hold exactly. To see this, suppose that h is in the interior of $T_{M} (P_{2}^{*})$ and construct the submodel $d P_{2, η}^{*} = (1 + η h) d P_{2}^{*}$ for η in a neighborhood of zero. Because $P_{2}^{*}$ is a global minimizer of the KL divergence, we have that

0 = \frac{d}{d η} \int \log [\frac{d P_{2, η}^{*}}{d P_{2}} (u)] d P_{2} (u) |_{η = 0} = \int h (u) d P_{2} (u) .

If h is instead on the boundary of $T_{M} (P_{2}^{*})$ , the argument above can be repeated for a sequence of interior elements tending to h. Thus, for the KL divergence, an even stronger property than (B3) holds, and it implies in particular that $\int ϕ_{ϵ, λ}^{*} (u) d P_{ϵ, λ} (u) = 0$ for each ϵ and λ. As a consequence, a stronger version of (A1) holds – the EIF estimating equation is in fact exactly solved – which could improve the resulting approximation of the EIF in practice. Second, because of its central role in likelihood methods, the KL divergence has been studied extensively, and many strategies for likelihood maximization have been described in the statistical literature. Whether other appropriate divergences, such as the Hellinger distance, provide concrete benefits in some problems remains to be seen. In the illustrations of Section 5, we focus primarily on the KL divergence, although we also discuss the use of the Hellinger distance in one example. In the Supplementary Material, we also provide an example in which the use of a seemingly natural yet inappropriate divergence leads to a violation of (2.3). This highlights the importance of verifying conditions (B1)–(B3) for divergences other than $D_{K L}$ and $D_{H}$ , both of which we have already verified to satisfy these conditions

It is useful to scrutinize the conditions of Theorem 1 in the context of the projected perturbation path. Condition (A1) drives in large part our generalization of the procedures of Frangakis et al. (2015) and Luedtke et al. (2015) to arbitratry models. Projecting the path {P_ϵ,λ : 0 ≤ ϵ ≤ 1} into $M$ to obtain ${P_{ϵ, λ}^{*} : 0 \leq ϵ \leq 1}$ is expected to ensure that the score-like equation described in (A1) is solved. Condition (A2) imposes mild continuity requirements on the EIF. Since R is a second-order term,R(Pϵ,λ,P) is generally of order O(ϵ²) for each fixed λ > 0. Condition (A3) requires that $R (P_{ϵ, λ}^{*}, P)$ be of order o(ϵ) for λ small and ϵ sufficiently smaller. Determining how the projection step and the smoothing parameter λ > 0 affect the rate of this term is critical to establishing whether (A3) holds. This is studied in detail in the next section

Much of the effort required to use representation (2.3) to compute the EIF goes into identifying the projection $P_{ϵ, λ}^{*}$ of P_ϵ,λ onto the model space. In some cases, this projection can be determined analytically, while in others, a numerical approach must be taken. Regardless, the definition of $P_{ϵ, λ}^{*}$ does not involve the parameter of interest. Hence, the more challenging portion of the approach is exclusively model-specific, and once it has been successfully tackled, the resulting projection can be used for any parameter a practitioner may wish to study. This contrasts sharply with the traditional approach to deriving the EIF, wherein the statistician must first derive an influence function, characterize the tangent space of the model, and finally project the influence function onto this tangent space. In that approach, both the parameter-specific task – finding an influence function – and the model-specific task – studying the tangent space and how to project onto it – require specialized knowledge. Performing these tasks for a given parameter and model combination does not necessarily provide an easy way of tackling other parameters, in contrast to the approach we propose

3. Verification of technical conditions

The validity of representation (2.3) is guaranteed to hold under the high-level technical conditions (A1), (A2) and (A3). We now identify lower-level sufficient conditions under which the projected perturbation path constructed in (2.4) satisfies conditions (A1), (A2) and (A3), and thus under which Theorem 1 holds

Here and throughout, given a function h defined on $X (P)$ , we write $‖ h ‖_{k, P} ≔ {[\int | h (u) |^{k} d P (u)]}^{\frac{1}{k}}$ and ‖h‖_∞,A : sup_u∈A |h(u)| for any set A. We denote by $S_{x}_{, λ}$ the support of H_x,λ and we define

r_{k} (λ) ≔ {‖ \frac{d H_{x, λ}}{d P} ‖}_{k, P} .

This Radon-Nikodym derivative of H_x,λ relative to P is defined for each λ > 0 since H_x,λ is dominated by P by construction. Unless P assigns positive mass to {x} and H_x,λ is the degenerate distribution at {x}, the value of r_k(λ) usually tends to infinity as λ tends to zero. The rate at which this occurs will be critical in our study of the technical conditions listed in Theorem 1

3.1. Near-solution of the EIF estimating equation

By virtue of being a global optimizer, and in view of condition (B3), $P_{ϵ, λ}^{*}$ is expected to solve, or at least nearly solve, a rich collection of equations indexed by elements in the tangent space of $M$ at $P_{ϵ, λ}^{*}$ , including that exhibited in condition (A1). The following theorem establishes a formal regularity condition under which this is indeed the case

Theorem 2.

Suppose that there exists an interval J₀ = [m₀,m₁] ⊂ (0,+∞) such that for each small λ > 0 the Radon-Nikodym derivative of $P_{ϵ, λ}^{*}$ relative to P is uniformly contained in J₀ over the support of P for sufficiently small ϵ Then, condition (A1) holds.

The condition in the above theorem is expected to hold in many situations since for ϵ = 0 and any λ > 0, Pϵ,λ=P is in $M$ and thus $P_{ϵ, λ}^{*} = P$ trivially. As such, under continuity of the projection operator relative to the supremum norm, for ϵ small enough, the Radom-Nikodym derivative of $P_{ϵ, λ}^{*}$ relative to P is expected to be close to one, and therefore bounded both away from zero and above. In practice, it is often possible to either verify this condition analytically or to assess its plausibility numerically

3.2. Continuity of the EIF

We relied on certain notions of continuity to establish the validity of representation (2.3). The theorem below highlights how the continuity requirement stated in condition (A2) can be more concretely verified

Theorem 3.

Suppose that lim_{λ→0 ∫}𝜙_P(u)dH_x,λ(u) = 𝜙_P(x) and $\lim_{λ \to 0} \lim_{ϵ \to 0} \int ϕ_{ϵ, λ}^{*} (u) d P (u) = 0$ . Condition (A2) holds provided either

(a) \lim_{λ \to 0} \lim_{ϵ \to 0} {‖ ϕ_{ϵ, λ}^{*} - ϕ_{P} ‖}_{\infty, s_{x, λ}} = 0 o r (b) \lim_{λ \to 0} \lim_{ϵ \to 0} r_{2} (λ) {‖ ϕ_{ϵ, λ}^{*} - ϕ_{P} ‖}_{2, P} = 0 .

The requirement that ∫𝜙_P(u)dH_x,λ(u) approximates 𝜙_P(x) as λ tends to zero simply stipulates that averaging 𝜙_P with respect to a distribution eventually concentrating all its probability mass on {x} should approximately yield 𝜙_P(x). Furthermore, this theorem requires that $\int ϕ_{ϵ, λ}^{*} (u) d P (u)$ tends to zero, which is reasonable under some continuity since $ϕ_{ϵ, λ}^{*}$ tends to 𝜙_P and _∫𝜙_P(u)dP(u) = 0. Beyond this, in order for condition (A2) to hold, it suffices either for $ϕ_{ϵ, λ}^{*}$ to approximate 𝜙_P in supremum norm in a neighborhood of x or in L₂(P)-norm at a rate faster than r₂(λ)⁻¹. These statements each hinge on a certain notion of continuity that appears needed whenever P is not finitely supported

3.3. Preservation of the rate of convergence

The proof of representation (2.3) hinges upon a linearization of the difference between Ψ $(P_{ϵ, λ}^{*})$ and Ψ(P). To ignore the remainder term from this linearization, we require that $R (P_{ϵ, λ}^{*}, P) / ϵ$ be arbitrarily small for small enough λ and sufficiently smaller ϵ. The following theorem establishes that this is the case under certain regularity conditions

Theorem 4.

Suppose that there exists an interval J₀ = [m₀,m₁] ⊂ (0,+∞) such that for each small λ > 0 the Radon-Nikodym derivative of $P_{ϵ, λ}^{*}$ relative to P is uniformly contained in J₀ over the support of P for sufficiently small ϵ. Suppose also that there exists some 0 < C < +∞ such that, for any P₁ ∈ $M$ with Radon-Nikodym relative to P bounded above by m₁ over the support of P, we have that

| R (P_{1}, P) | \leq C {‖ \frac{d P_{1}}{d P} - 1 ‖}_{2, P}^{2} .

Then, condition (A3) holds

We have already discussed the first condition of this theorem since it is also used in Theorem 2. As for the second condition, it is often the case that the remainder term, being a second-order term arising from a linearization, can be bounded by the squared norm of the difference between the Radon-Nikodym derivative of $P_{ϵ, λ}^{*}$ relative to P and its value at ϵ = 0. This inequality often follows quite easily from an application of the Cauchy-Schwartz inequality on the remainder term, and generally holds under rather mild conditions

4. Practical considerations

Representation (2.3) provides the theoretical foundations for our strategy for numerically approximating the EIF and thus constructing efficient estimators. The implementation of the approach suggested by this representation nevertheless presents specific challenges. Practical guidelines, as provided below, may facilitate the successful implementation of our proposal in practice

4.1. Construction of the perturbation path

In constructing the linear perturbation path that defines P_ϵ,λ and thus $P_{ϵ, λ}^{*}$ , the nearly degenerate distribution at {x} is used instead of its purely degenerate counterpart because it ensures that all distributions along the perturbation path are dominated by P. This is required to ensure the validity of the representation we have proposed. Clearly, there is no need for smoothing in the components of the data unit for which the corresponding marginal distribution implied by P is dominated by a counting measure. In fact, as we stress below, unnecessary smoothing can needlessly increase the computational burden of the approximation procedure. For components for which the corresponding marginal distribution is dominated by the Lebesgue measure, smoothing is generally needed. In practice, we suggest the use of product kernels for those components. Specifically, suppose that the data unit X is d-dimensional and can be partitioned into X = (X_L,X_C), where X_L := (X_L1,X_L2,...,X_Ld1) and X_C := (X_C1,X_C2,...,X_Cd2) with d₁ +d₂ = d, and that the marginal distributions of X_L and X_C under P are respectively dominated by the Lebesgue measure and a discrete counting measure. In this case, we can typically use the product kernel with density function

u \mapsto h_{x, λ} (u) ≔ [\prod_{j = 1}^{d_{1}} K_{λ} (u_{L j} - x_{L j})] \times [\prod_{j = 1}^{d_{2}} I (u_{C j} = x_{C j})]

(4.1)

relative to the sum of Lebesgue and counting measures, where u := (u_L,u_C) with u_L and u_C possible realizations of X_L and X_C, respectively, and K_λ : w ↦ λ⁻¹K(λ⁻¹w) with K some symmetric and absolutely continuous univariate density function. The uniform kernel K(w) := I(−1 < 2w < +1) is particularly appealing due to its simplicity, which translates to greater practical feasibility of our numerical approximation procedure. If the uniform kernel is used, it is easy to verify that r_k(λ) = λ^{−d1(1−1/k)} provided, for example, the sectional evaluation u_L ↦ p(u_L, x_C) of the density p of P is continuous and bounded away from zero in a neighborhood of x_L. As suggested by Theorem 5 provided in the next subsection, for a secant line slope approximation to the derivative in (2.3) to be accurate, ϵ must generally be chosen such that $ϵ ≪ λ^{d_{1}}$ . If d₁is large, this requirement may be prohibitive, possibly even to the point of requiring a value of ϵ beyond the computer’s standard level of precision and thus requiring special computational techniques. Of course, while this guideline is sufficient, it may be conservative in some applications. Later in this section, we provide a more practical means of selecting the value of ϵ and λ

As alluded to above, if we include smoothing over X_C as well in our choice of H_x,λ, we need. ϵ ≪ λ^dThis can be much more prohibitive computationally than requiring that $ϵ ≪ λ^{d_{1}}$ , particularly if d₂ is large. For this reason, smoothing in the construction of the linear perturbation path should be avoided for all components except those for which the corresponding marginal distribution under P is absolutely continuous. Additionally, for some parameters, smoothing can be avoided altogether for certain continuous components. As a general guideline for which supporting theory remains to be developed, we expect no smoothing to be required at all if the parameter is sufficiently smooth at P ∈ $M$ , in the sense that Ψ(P_m) tends to Ψ(P) for any sequence {P_m ∈ , $M$ : m = 1,2,...} for which the distribution function of P_m tends to that of P uniformly as m tends to infinity. Alternatively, if the MLE $P_{n}^{*}$ of P based on observations X₁,X₂,...,X_n from P is such that Ψ $(P_{n}^{*})$ is a consistent estimator of Ψ(P), no smoothing will generally be required. If, however, some regularization of the MLE is needed to ensure consistency, e.g., as in van der Laan (1996), smoothing will usually be critical

4.2. Finite-difference derivative approximations

When the projection of the linear perturbation path and the parameter of interest have an explicit form, it may be possible to compute analytically the pathwise derivative in representation (2.3). When analytic differentiation is possible, it is useful to do so since an approximation for 𝜙_P(x) is then simply given by the evaluation of the pathwise derivative at ϵ = 0 for a small λ > 0. The approximation thus involves only a single approximation parameter. Otherwise, numerical differentiation techniques must be used

Since the derivative in representation (2.3) can be expressed as the limit of the slope of the secant line between points (0,Ψ(P)) and $(ϵ, Ψ (P_{ϵ, λ}^{*}))$ , the alternative representation

ϕ_{P} (x) = \lim_{λ \to 0} \lim_{ϵ \to 0} \frac{Ψ (P_{ϵ, λ}^{*}) - Ψ (P)}{ϵ}

(4.2)

is available under the conditions of Theorem 1. This representation suggests that 𝜙_P(x) may be approximated by $[Ψ (P_{ϵ, λ}^{*}) - Ψ (P)]$ for appropriately small λ and ϵ. As indicated before, in general, the two limits in (4.2) cannot be interchanged. As a consequence, ϵ must be taken to be much smaller than λ. Practical guidelines for choosing ϵ and λ are provided in the next subsection. We note that, beyond the ability to compute the projection $P_{ϵ, λ}^{*}$ , use of (4.2) to approximate 𝜙_P(x) requires no more than the ability to evaluate Ψ

Representation (4.2) is based on a two-point forward finite-difference approximation to the derivative in (2.3). A multi-point forward finite-difference scheme could be used to produce accurate approximations for possibly less prohibitive values of ϵ and λ. For m ∈ {1,2,...}, the (m + 1)-point forward finite-difference representation of the EIF value is

ϕ_{P} (x) = \lim_{λ \to 0} \lim_{ϵ \to 0} \frac{\sum_{j = 0}^{m} a_{j, m} Ψ (P_{j ϵ, λ}^{*})}{ϵ}

(4.3)

for a specified vector a_m := (a_0,m,a_1,m,...,a_m,m). Fornberg (1988) gives a recursive formula for a_m and numerical values up to m = 8. For example, we have that a₁ = (−1, +1), $a_{2} = (- \frac{3}{2}, + 2, - \frac{1}{2}), a_{3} = (- \frac{11}{6}, + 3, - \frac{3}{2}, + \frac{1}{3})$ and $a_{4} = (- \frac{25}{12}, + 4, - 3, + \frac{4}{3}, - \frac{1}{4})$ Use of (4.3) as a basis for numerically approximating 𝜙_P(x) indeed yields an improved approximation, as the following theorem indicates. Below, we denote by $\tilde{Ψ} : M_{NP} \to ℝ$ the mapping P ↦ Ψ(P^∗) for P^∗ the projection of P into $M (P)$ , and by $M_{0, λ} (P)$ the parametric submodel {P_ϵ,λ : 0 ≤ ϵ ≤ 1}.

Theorem 5.

Suppose that conditions (A1)–(A3) hold, that H_x,λ is a second-order kernel of the form (4.1), and that 𝜙_P has bounded second derivatives in a neighborhood of x. Suppose also that, for small enough λ > 0, the parameter ^$\tilde{Ψ}$ is (m₀ + 1)-times pathwise differentiable in $M_{0, λ} (P)$ , with gradient of order m₀ + 1 evaluated at Pϵ,λ bounded uniformly for sufficiently smallϵ>0. Then, with s := min(m,m₀), we have that

\frac{\sum_{j = 0}^{m} a_{j, m} Ψ (P_{j ϵ, λ}^{*})}{ϵ} = ϕ_{P} (x) + O (ϵ^{s} r_{s + 1} {(λ)}^{s + 1} + λ^{2}) .

As indicated before, if H_x,λ is taken to be the uniform kernel, we find that $r_{s + 1} (λ) = λ^{- d_{1} s / (s + 1)}$ , and so, the approximation error is of order ${(ϵ λ^{- d_{1}})}^{s} + λ^{- 2}$ . This result suggests that, at the cost of computing additional projections and evaluating the parameter at each of these projections, it may indeed be possible to increase approximation accuracy using a multi-point forward finite-difference schemes. Higher-order differentiability is required to ensure that such schemes yield valid results – in particular, this requires that the projected path be sufficiently smooth. More importantly, the theorem also indicates that there is no significant theoretical penalty incurred from using a multi-point scheme with number of points greater than the degree of differentiation, thereby supporting the use of multi-point schemes in practice. In Section 5, we show concrete benefits that may be derived from using a multi-point scheme in the context of Example 1

4.3. Selection of ϵ and λ values

When the pathwise derivative in (2.3) can be calculated analytically, the proposed approximation strategy only involves the smoothing parameter λ. The supporting theory suggests taking λ as small as possible. As we will illustrate in Section 5, in some cases there is little sensitivity to the choice of λ when (2.3) is used, and even a relatively large value of λ > 0 will yield stringent control of the approximation error

Whenever the involved projection is not available in closed form or differentiation with respect to ϵ is too cumbersome to perform analytically, a multi-point finite-difference scheme may be used to numerically approximate this analytic derivative, as described in the previous subsection. In such case, ϵ and λ must both be chosen, and more care is needed to ensure the reliability of the proposed procedure. The order of the limits in (4.2) and (4.3) suggests that we must select a small value of λ and even smaller value of ϵ. This was made more precise in the previous subsection, where in the case of a two-point scheme, for example, it is prescribed to choose ϵ to be much smaller than $λ^{d_{1}}$ , with d₁ the number of components of P over which smoothing is performed. While this theoretical requirement may serve as a rough guide in practice, it does not provide a concrete means of selecting values for ϵ and λ. For this purpose, it is useful to produce a matrix representing the value of the approximation as a function of ϵ and λ, both ranging over an exponential scale – for example, we could consider both ϵ and λ in the set {10⁻¹,10⁻²,10⁻³,10⁻⁴,10⁻⁵,...}. We refer to the resulting display as an epsilon-lambda plot. As a convention, the y-axis is used to represent ϵ values while λ values are represented on the x-axis. Our theoretical findings suggest that the right balance between ϵ and λ will be achieved in a possibly curvilinear triangular region nested in the upper left portion of the epsilon-lambda plot. In this region, the finite-difference approximation of the EIF value should be essentially constant. One practical means of selecting ϵ and λ would then consist of identifying this region visually by determining the quasi-triangular region in the upper left portion of the matrix over which the approximated EIF value is fixed up to a certain level of precision. As an illustration, without yet providing details regarding the specific parameter and model under consideration, we may scrutinize the epsilon-lambda plot arising in Example 1 from Section 5 and based upon a two-point scheme. This plot, provided as Figure 1, suggests that, up to three decimal points, the EIF value of interest is −0.963. This is indeed verified using theoretical calculations, as discussed in more detail in Section 5. The epsilon-lambda plot therefore may be a particularly useful tool for implementing the proposed approach for numerically approximating the EIF in practice

Figure 1: — Epsilon-lambda plot of approximated values of the EIF using a secant line slope as a function of ϵ and λ in Example 1

4.4. Numeric computation of the model space projection

In implementing our proposal, the main challenge consists of operationalizing the optimization problem that characterizes the projection of the linear perturbation path {P_ϵ,λ : 0 ≤ ϵ ≤ 1} onto the model space $M$ .An analytic – or nearly analytic – form can be found for the projection in many problems, including the illustrations provided in Section 5. In other problems, the optimization problem is less analytically tractable and a numeric approach may be needed

A general strategy for numerically approximating the required projection is to instead consider the corresponding optimization problem over $M_{m}$ , where $M_{1} \subseteq M_{2} \subseteq \dots \subseteq M$ is a sequence of finite-dimensional submodels of $M$ such that $\cup_{m = 1}^{\infty} M_{m} = M$ . We illustrate this in the context of families of tilted densities, though other parametrizations are possible. We note that any distribution Q dominated by P can be described as a tilted form dQ(u) = exp[h(u)]dP(u)/∫exp[h(w)]dP(w) of P for some function h in a function class $H = H (M)$ determined by the model $M$ . Here, h characterizes the deviation of Q from P. It is often easier to determine suitable approximating finite-dimensional subspaces for $H$ than for $M$ . Suppose that ${h_{1}, h_{2}, \dots} \subseteq H$ forms a basis for $H$ , and let $H_{m}$ denote the linear span of {h₁,h₂,...,h_m}. If P has density p relative to ν, the submodel $M_{m}$ implied by $H_{m}$ then consists of all distributions Q with density

\frac{d Q}{d ν} (u) = \frac{\exp [\sum_{j = 1}^{m} β_{j} h_{j} (u)] p (u)}{\int \exp [\sum_{j = 1}^{m} β_{j} h_{j} (w)] p (w) ν (d w)}

for some ${\underline{β}}_{_{m}} ≔ (β_{1}, β_{2}, \dots, β_{m}) \in ℝ^{m}$ . The choice ${\underline{β}}_{_{m}} = (0, 0, \dots, 0)$ leads to Q = P. Denote by ${\underline{β}}_{m, ϵ, λ}^{*}$ the index of the projection $P_{m, ϵ, λ}^{*}$ of P_{ϵ, λ} into $M_{m}$ . For small ϵ, it is expected that $P_{m, ϵ, λ}^{*} \approx P$ , which suggests that the corresponding index ${\underline{β}}_{m, ϵ, λ}^{*}$ should be near zero. Thus, the search for the optimizer can be focused in a neighborhood surrounding the origin in $ℝ^{m}$ . In our experience, this simple observation can sometimes greatly accelerate the numerical optimization routine used. In practice, a sufficiently large m must be selected to ensure that the resulting approximation of the projection is accurate enough to ensure the validity of the numerical evaluation of the EIF based on (4.3). If the KL divergence is used as statistical distance, up to an additive constant, the resulting optimization problem is to maximize the objective function

L ({\underline{β}}_{m}) ≔ \sum_{j = 1}^{m} β_{j} \int h_{j} (u) d P_{ϵ, λ} (u) - \log \int \exp [\sum_{j = 1}^{m} β_{j} h_{j} (w)] p (w) ν (d w) .

Since derivatives of $L ({\underline{β}}_{m})$ are easy to write down explicitly, many algorithms are available to solve this optimization problem efficiently, including the Newton-Raphson method. We note here that in this setup m should be taken as large as computationally feasible so as to minimize the bias induced by the approximation of $M$ by $M_{m}$ . This is in contrast to traditional sieve estimation, where the choice of m generally requires a careful balance whenever the unsieved maximum likelihood estimator either does not exist or is inconsistent. This difference occurs because in our scheme all distributions in the regularized model are dominated by the distribution being projected into the model

It may sometimes be useful to consider a stochastic version of this deterministic optimization problem. For example, we could minimize the divergence between the approximating finite-dimensional submodel and the weighted empirical distribution based on a very large number of draws from a uniform mixture between P and H_x,λ, with weights 2(1 – ϵ) and 2ϵ assigned to observations from P and H_x,λ, respectively. We are then faced with a standard parametric estimation problem, albeit one that may be high-dimensional. When a clever parametrization of the approximating submodel is used, it may be possible to use standard statistical learning techniques, including regularization methods from the machine learning literature, with computationally efficient and stable off-the-shelf implementations. When adopting this approach, it appears critical to ensure that the size of the dataset generated is very large compared to the richness of the approximating submodel, since otherwise the variability resulting from this parametric estimation problem could limit our ability to achieve the required level of accuracy. A detailed study of numerical strategies, including the approach described in this subsection, will be the focus of future work

4.5. Construction of an efficient estimator

As emphasized earlier, knowledge of the EIF facilitates the construction of efficient estimators in infinite- dimensional models. For example, if ${\hat{P}}_{n}$ is a consistent estimator of $P_{0} \in M$ based on independent draws X₁,X₂,...,X_n from P₀, the one-step Newton-Raphson estimator, defined as

ψ_{n}^{+} ≔ Ψ ({\hat{P}}_{n}) + \frac{1}{n} \sum_{i = 1}^{n} ϕ_{{\hat{P}}_{n}} (X_{i}),

is an efficient estimator of ψ₀ under certain regularity conditions. The one-step approach appears to be the constructive method most amenable to an implementation based on numerical approximations of the EIF. Indeed, if the analytic form of the EIF is not known, it suffices to numerically approximate the value of $ϕ_{{\hat{P}}_{n}} (X_{i})$ for each i = 1,2,...,n, rather than the entire function $u \mapsto ϕ_{{\hat{P}}_{n}} (u)$ , to calculate $ψ_{n}^{+}$ . Thus, the procedure we propose can be used to approximate each of these n values. Nevertheless, when the projection step required to utilize the proposed representation of the EIF is computationally burdensome and the sample size n is large, computing each of these values may be challenging. One need not obtain an approximation of each $ϕ_{{\hat{P}}_{n}} (X_{i})$ if the objective is to simply compute the one-step estimator $ψ_{n}^{+}$ – in this case it suffices to obtain an approximation of the correction term $B_{n} ≔ \frac{1}{n} \sum_{i = 1}^{n} ϕ_{{\hat{P}}_{n}} (X_{i})$ . This simple observation is useful because a slight modification to (2.3) yields a numerical procedure for approximating the required empirical average. Specifically, the proof of Theorem 1 can be adapted to show that, under similar regularity conditions, if we define the linear perturbation ${\hat{P}}_{n, ϵ, λ} ≔ (1 - ϵ) {\hat{P}}_{n} + ϵ \frac{1}{n} \sum_{i = 1}^{n} H_{X_{i}, λ}$ between ${\hat{P}}_{n}$ and a uniform mixture of nearly degenerate distributions with mass concentrated around X₁, X₂, ..., X_n, then

B_{n} = \lim_{λ \to 0} \frac{d}{d ϵ} Ψ {({\hat{P}}_{n, ϵ, λ}^{*}) |}_{ϵ = 0}

with ${\hat{P}}_{n, ϵ, λ}^{*}$ denoting the projection of ${\hat{P}}_{n, ϵ, λ}$ onto the model. As such, a numerical approximation of the one-step estimator can be computed in a single numerical step as

Ψ ({\hat{P}}_{n}) + \frac{d}{d ϵ} Ψ {({\hat{P}}_{n, ϵ, λ}^{*}) |}_{ϵ = 0} \approx Ψ ({\hat{P}}_{n}) + \frac{\sum_{j = 0}^{m} a_{j, m} Ψ ({\hat{P}}_{n, j ϵ, λ}^{*})}{ϵ}

for appropriately selected ϵ and λ values. We note here that while one may wish to approximate B_n as accurately as feasible, it suffices that B_n be numerically approximated up to order o_P(n^−1/2) for the one-step estimator to preserve its desirable asymptotic properties. As such, the level of accuracy that must be enforced in practice very much depends on sample size

5. Illustration and numerical studies

To illustrate the use of (2.3), we study four examples in which the calculation of the EIF can be difficult for non-experts, whereas the suggested approach renders the problem straightforward. Below, we describe the steps involved in using representation (2.3) to approximate the EIF in the context of each of these examples, and provide numerical illustrations of its use. In the Supplementary Material, we show how representation (2.3) can instead be used to obtain the analytic form of the EIF without specialized knowledge

5.1. Example 1: Average value of density function with known mean

5.1.1. Background

Given a distribution P with univariate Lebesgue density p, the average density value parameter is given by

Ψ (P) ≔ E_{P} [p (X)] = \int p {(u)}^{2} d u .

Estimation and inference for the average density value has been extensively studied in the semiparametric efficiency literature (e.g., Bickel and Ritov, 1988). We use this parameter as our first illustration because it is simple to describe yet requires specialized knowledge to study using conventional techniques. Suppose that M_NP denotes the nonparametric model consisting of all univariate absolutely continuous distributions with finite-valued density. Suppose that $μ \in ℝ$ is fixed and known, and denote by $M \subset M_{NP}$ the semiparametric model consisting of all distributions in $M_{NP}$ with mean μ. We wish to compute the EIF of Ψ relative to $M$ at a $P \in M$ evaluated at an observation value x.

The EIF 𝜙_NP,P of Ψ relative to the nonparametric model $M_{NP}$ evaluated at $P \in M$ is given by u ↦ 𝜙_NP,P(u) := 2[p(u) − Ψ(P)]. It is straightforward to derive this analytic form from first principles. Observing that < $M = {P \in M_{NP} : Θ (P) = 0}$ , where Θ(P) := ∫udP(u) − μ is a pathwise differentiable parameter with EIF relative to $M_{NP}$ at $P \in M$ given by u ↦φ_P(u) := u−μ, Example 1 of Section 6.2 of Bickel et al. (1997)⎹ implies that the EIF of Ψ relative to $M$ can be obtained as

u \mapsto ϕ_{P} (u) ≔ ϕ_{NP, P} (u) - \frac{\int ϕ_{NP, P} (w) φ_{P} (w) d P (w)}{\int φ_{P} {(w)}^{2} d P (w)} φ_{P} (u) = 2 [p (u) - Ψ (P) - \frac{\int (w - μ) p (w) d P (w)}{\int {(w - μ)}^{2} d P (w)} (u - μ)] .

While the resulting analytic form of this EIF is relatively simple, its derivation hinges on specialized knowledge unlikely to be available to most practitioners. Use of our novel representation of the EIF provides an alternative approach that avoids the need for such knowledge, as highlighted below.

5.1.2. Implementation and results

To utilize our representation, we must understand how to project a given distribution $Q \in M_{NP}$ , say with Lebesgue density q, into $M$ , say relative to the KL divergence. Suppose that the support of Q has finite lower and upper limits a and b, respectively, satisfying that a < μ < b. An application of the method of Lagrange multipliers yields that the maximizer in p of ∫logp(u)dQ(u) over the class of all Lebesgue densities with mean μ is given by q^∗(u) := [1 − ξ₀(u − μ)]⁻¹ q(u), where $ξ_{0} \in ℝ$ solves the equation $U (ξ) = 0$ with

U (ξ) ≔ \int [\frac{u}{1 - ξ (u - μ)} - μ] d Q (u),

and lies strictly between (a − μ)⁻¹ and (b − μ)⁻¹

To compute 𝜙_P(x) using the approach proposed in this paper, we first construct the linear perturbation P_ϵ,λ := (1 – ϵ)P + ϵH_x𝜆, where H_x,λ is an absolutely continuous distribution that concentrates its mass on shrinking neighborhoods of the set {x} as λ tends to zero. In our numerical implementation, we take H_x,λ to be the uniform distribution on the interval (x−λ, x+λ). The projection of Pϵ,λ onto $M$ is then obtained as described in the preceding paragraph with Q = Pϵ,λ. As such, it has a closed-form analytic expression up to the constant ξ₀ = ξ₀ (ϵ, λ) that can be numerically solved. We may then approximate 𝜙_P(x) by the secantline slope or using an improved multi-point finite-difference scheme, as suggested above

We evaluated this procedure numerically for a particular distribution P and observation value x. Specifically, we took P to be the Beta distribution with parameters α = 3 and β = 5, and evaluated our numerical procedure for approximating the true value of 𝜙_P(0.6) ≈ −0.963. Figure 2 provides the percent error of our numerical approximation based upon a secant line slope for various combination of values for ϵ and λ. This approximation is inaccurate if either ϵ is not small enough or if λ is too small relative to ϵ. For small λ and much smaller ϵ, the secant line slope approximates the true value of 𝜙_P(x) with high accuracy, as depicted by the dark green squares in Figure 2. This plot confirms what theory suggests regarding the choice of ϵ and λ. It also reaffirms the usefulness of the epsilon-lambda plot for selecting appropriate values of ϵ and λ. Figure 3 compares the accuracy of multi-point finite-difference schemes using 2, 3, 4 or 5 points over the same range of values for ϵ and λ. As more points are used in approximation scheme, the high-accuracy region grows substantially, and a good approximation can be obtained with much less stringent choices of ϵ and λ This plot confirms the significant practical benefits that can be derived from using multi-point schemes

Figure 2: — Absolute % error in the approximation of the EIF value using a secant line slope as a function of ϵ and λ in Example 1

Figure 3: — Absolute % error in the approximation of the EIF value using forward multi-point schemes with different number of points (top left: 2, top right: 3, bottom left: 4, bottom right: 5) as a function of ϵ and λ in Example 1. Color coding of entries is identical as in Figure 2

5.2. Example 2: Center of a symmetric distribution

5.2.1. Background

In this example, the statistical problem is to efficiently estimate the location parameter indexing a univariate distribution in a semiparametric symmetry model. Specifically, we consider the model consisting of each distribution P with symmetric density p relative to a fixed dominating measure ν. Denoting by P_μ,f the distribution with density u ↦ p(u) = f(u − μ) relative to ν, the model can be written as $M = {P_{μ, f} : μ \in ℝ, f \in F_{0} (ν)}$ for $F_{0} (ν)$ the class of all densities relative to ν symmetric about zero. This semiparametric model is a subset of the nonparametric model $M_{NP}$ of all distributions dominated by ν. If the center of symmetry is of scientific interest, the parameter considered is Ψ(P_μ,f) = μ. We wish to compute the EIF of Ψ relative to $M$ at a distribution P = P_μ,f with $μ \in ℝ$ and $f \in F_{0} (ν)$ evaluated at an observation value x

As one of the first problems considered in the development of efficiency theory for infinite-dimensional models – see Stein (1956) – this problem has a rich history in statistics. The EIF of Ψ relative to $M$ at P = P_μ,f has the simple form

u \mapsto ϕ_{P} (u) ≔ - \frac{1}{I (f)} \frac{\dot{f} (u - μ)}{f (u - μ)},

where $\dot{f}$ is the first derivative of f and $I (f) ≔ {\int [\dot{f} (w) / f (w)]}^{2} f (w) ν (d w)$ . This EIF can be derived in many ways, though in this case the efficient score approach for classical semiparametric models is particularly expedient. Tsiatis (2007) provides an accessible reference of this approach. We show below how the use of the representation in this paper negates the need for this specialized knowledge for EIF calculation

5.2.2. Implementation and results

As before, we must determine the projection Q^∗ of an arbitrary distribution $Q \in M_{NP}$ , say with density q, into $M$ . Because $Q^{*} \in M$ , there exist $μ^{*} \in ℝ$ and $f^{*} \in F_{0} (ν)$ such that $Q^{*} = P_{μ^{*}, f^{*}}$ . If the KL divergence is used, we have that

μ^{*} = \underset{v \in ℝ}{argmax} \sup_{r \in F_{0} (ν)} \int \log r (u - v) q (u) ν (d u) .

Using Jensen’s inequality, it is not difficult to show that, for a fixed candidate center of symmetry v, the maximizer ${\hat{r}}_{v}$ of _∫logr(u − v)q(u)ν(du) over $F_{0} (ν)$ is given explicitly as the symmetrization of q,

u \mapsto {\hat{r}}_{v} (u) ≔ \frac{q (v + u) + q (v - u)}{2} .

It follows then that μ^∗ is the maximizer of $v \mapsto A_{K L} (v) ≔ \int \log {\hat{r}}_{v} (u - v) q (u) ν (d u)$ over $ℝ$ , and f^∗ is given by ${\hat{r}}_{μ^{*}}$ . If instead the Hellinger distance is used, it is not difficult to show that

μ^{*} = \underset{v \in ℝ}{argmax} \sup_{r \in F_{0} (ν)} \int \sqrt{r (u - v) q (u)} ν (d u),

and for a fixed center of symmetry v, the maximizer ${\hat{r}}_{v}$ of $\int \sqrt{r (u - v) q (u)} d u$ over $F_{0} (ν)$ is given by

u \mapsto {\hat{r}}_{v} (u) ≔ \frac{1}{B (v; q)} {[\frac{\sqrt{q (v + u)} + \sqrt{q (v - u)}}{\sqrt{2}}]}^{2},

where $B (v; q) ≔ 1 + \int \sqrt{q (u) q (2 v - u)} d u$ . Then, μ^∗ is the maximizer of $v \mapsto \int \sqrt{{\hat{r}}_{v} (u - v) q (u)} d u$ , or equivalently, of $v \mapsto A_{H} (v) ≔ \int \sqrt{q (u) q (2 v - u)} d u$ , and as before, f^∗ is given by ${\hat{r}}_{μ^{_{*}}}$ . Thus, once again, the projection Q^∗ has an explicit form up to an implicitly defined constant. Since Ψ(Q^∗) = μ^∗, the parameter evaluation at the projection of Q simply equals the computed maximizer of A_KL or A_H

To compute 𝜙_P(x) for P = P_μ,f, we again begin by constructing the linear perturbation Pϵ,λ := (1 - ϵ)P+ϵH_x,λ where H_x,λ denotes an absolutely continuous distribution that concentrates its mass on shrinking neighborhoods of the set {x} as λ tends to zero. We then compute the maximizer $μ_{ϵ, λ}^{*}$ of A_KL or A_H with Q = Pϵ,λ, and approximate 𝜙_P(x) using, for example, the secant line slope $(μ_{ϵ, λ}^{*} - μ) / ϵ$ .

We evaluated this EIF approximation scheme by implementing the numerical algorithm above to obtain the value of 𝜙_P(x) for x = 1 and P taken to be the normal distribution with mean zero and unit variance. Using the known analytic form of 𝜙_P provided above, we find that 𝜙_P(x) = x in this case – this is expected since this is the influence function of the sample mean, and the latter is an efficient estimator of the center of symmetry in a normal location family. The value we are seeking to approximate is 𝜙_P(1) = 1. The relative error of the secant line slope approximation based on the KL divergence is displayed in Figure 4 for different combinations of ϵ and λ. The pattern observed is similar as in Example 1, with high accuracy obtained for small λ and even smaller ϵ. The implementation using the Hellinger distance led to very similar results – in fact, the resulting relative error plot was indistinguishable from the plot in Figure 4

Figure 4: — Absolute % error in the approximation of the EIF value using a secant line slope as a function of ϵ and λ in Example 2 based on either the Kullback-Leibler divergence. Corresponding plot based on Hellinger distance is identical and not shown

5.3. Example 3: Coefficient in semiparametric Poisson regression model

5.3.1. Background

We now consider the problem of efficient estimation of the regression coefficient in a partially linear multiplicative Poisson regression model. Suppose that the data unit is X := (Y,Z,W) ∼ P ∈ $M$ , where Y is a count outcome, Z is a univariate exposure, and W is a vector of possible confounders. Let $G$ be the space of univariate real-valued functions, and let $F$ be the space of all bivariate distribution functions. We consider the model $M ≔ {P_{β, g, F} : β \in ℝ, g \in G, F \in F}$ , where $P_{β, g, F}$ denotes the joint distribution implying

P_{β, g, F} (Y = y | Z = z, W = w) = e^{- μ_{β, g} (z, w)} \cdot \frac{μ_{β, g} {(z, w)}^{y}}{y!}, y = 0, 1, 2, \dots

with $\log μ_{β, g} (z, w) = β z + g (w)$ , and $P_{β, g, F} (Z \leq z, W \leq w) = F (z, w)$ , for each z and w. If Z is the exposure of interest, the coefficient β serves as a parsimonious summary of the relationship between Y and Z adjusting for W. As such, we consider the parameter $Ψ (P_{β, g, F}) = β$ . We wish to compute the EIF of Ψ relative to $M$ at distribution $P = P_{β, g, F}$ evaluated at observation value x := (y,z,w)

Again, efficiency calculations in this case can most expediently be performed using techniques for classical semiparametric models. The EIF can be expressed as an appropriate renormalization of the efficient score for β, defined as the projection of the regular score for β into the orthogonal complement of the nuisance tangent space. Using this technique, the EIF of Ψ relative to $M$ at $P = P_{β, g, F}$ is found to be

x \mapsto ϕ_{P} (x) ≔ \frac{[z - a_{P} (w)] [y - μ_{β, g} (z, w)]}{v a r_{P} {[Z - a_{P} (W)] [Y - μ_{β, g} (Z, W)]}},

where a_P(w) := E_P (ZY | W = w)/E_P (Y | W = w) for each w. Below, we discuss the use of our proposed numerical approach for computing this EIF without using any knowledge of semiparametric efficiency theory.

5.3.2. Implementation and results

Let $Q \in M_{NP}$ be an arbitrary distribution. The projection Q^∗ of Q into , $M$ can be written as $Q^{*} = P_{β *, g^{*}, F^{*}}$ for some $β^{*} \in ℝ, g^{*} \in G$ and $F^{*} \in F$ . Here, we focus on the KL divergence. Writing $L (β, g) ≔ \iint_{z, w} \sum_{y} \log p_{β, g} (y | z, w) q (y | z, w) F_{Q} (d z, d w)$ with $p_{β, g} (y | z, w)$ denoting the conditional probability mass function of Y given Z = z and W = w evaluated at y and implied by parameter choices (β,g), and with F_Q denoting the distribution of (Z,W) implied by Q, we have that $β^{*} = a r g m a x_{β \in ℝ} \sup_{g \in G} L (β, g)$ . For fixed β, the maximizer ${\hat{g}}_{β}$ of $L (β, g)$ is given by $w \mapsto {\hat{g}}_{β} (w) ≔ \log E_{Q} (Y | W = w) - \log E_{Q} (e^{β Z} | W = w)$ . Under conditions allowing the interchange of integrals and derivatives, the maximizer β^∗ of $L_{0} (β) ≔ L (β, {\hat{g}}_{β})$ can be shown to also be the solution of the profile score equation $U_{0} (β) = 0$ , where

U_{0} (β) ≔ E_{Q} [E_{Q} (Z Y | W) - \frac{E_{Q} (Y | W) E_{Q} (Z e^{β Z} | W)}{E_{Q} (e^{β Z} | W)}] .

It then follows that $g^{*} = {\hat{g}}_{β^{*}}$ . Since $M$ . does not restrict the distribution of (Z,W), we have that F^∗ = F_Q

To compute 𝜙_P(x) for $P = P_{β, g, F}$ , we begin by constructing the linear perturbation P_ϵ,λ := (1 – ϵ)P_ϵH_x,_λ. If Z is discrete-valued under P, H_x,λ is a joint distribution for X that concentrates its mass on shrinking neighborhoods of the set {x} as λ tends to zero and such that the event {Y = y,Z = z} has probability one under H_x,λ for every λ > 0. If Z is continuous-valued, the latter requirement can instead be taken to be that {Y = y} has probability one under H_x,λ. We then compute $β_{ϵ, λ}^{*}$ as the solution of $U_{0} (β) = 0$ with Q = P_ϵ,_λ, and approximate the EIF value 𝜙_P(x) by the secant line slope $(β_{ϵ, λ}^{*} - β) / ϵ$ . We evaluated our proposed numerical scheme for approximating the value of 𝜙_P(x) for x = (1,0,0.5) and a distribution P under which: conditional on (Z,W) = (z,w), Y has a Poisson distribution with mean exp{log(2)z + w}; conditional on W = w, Z is a Bernoulli random variable with success probability expit(1 − 2w); and W has a uniform distribution on the interval (0,1). The true value of the EIF can be calculated to be 𝜙_P(1,0,0.5) ≈ 0.793. Figure 5 provides the relative error of our proposed secant line approximation for different combinations of ϵ and λ, and confirms patterns observed in previous examples

Figure 5: — Absolute % error in the approximation of the EIF value using a secant line slope as a function of ϵ and λ in Example 3

5.4. Example 4: G-computation parameter under Markov structure

5.4.1. Background

We finally consider efficient estimation of a more complex parameter arising in the causal inference literature Suppose that the data unit consists of the longitudinal observation X := (L₀,A₀,...,L_K,A_K,L_K+1) ∼ P, where L₀,L₁,...,L_K is a sequence of measurements collected at K+1 distinct instances through time, L_K+1 is the outcome of interest, and A₀,A₁,...,A_K are intervention indicators corresponding to each pre-outcome timepoint. For simplicity, we consider all treatment indicators to be binary. Let $M_{NP}$ be a nonparametric model for P. We may be interested in the covariate-adjusted, treatment-specific mean Ψ(P) corresponding to the intervention (A₀,A₁,...,A_K) = (1,1,...,1). Mathematically, Ψ(P) is given explicitly by E_P [m_0,P(L₀)] via the G-computation recursion

m_{j, P} ({\bar{l}}_{j}) ≔ E_{P} [m_{j + 1, P} ({\bar{L}}_{j + 1}) | {\bar{L}}_{j} = {\bar{l}}_{j}, A_{j} = A_{j - 1} = \dots = A_{0} = 1]

for j = K,K − 1,...,0, where we have set $m_{K + 1, P} ({\bar{L}}_{K + 1}) ≔ L_{K + 1}$ . Here, for any vector u := (u₀,u₁,...) we write ${\bar{u}}_{k} ≔ (u_{0}, u_{1}, \dots, u_{k})$ . This parameter only depends on P through the conditional distribution P_j+1,1 of ${\bar{L}}_{j + 1}$ given ${\bar{L}}_{j}$ and A₀ = A₁ = ... = A_j = 1 for j = 0,1,...,K, and the marginal distribution P_0,1 of L₀. Under certain untestable causal assumptions, Ψ(P) corresponds to the mean of the counterfactual outcome Y defined by an intervention setting all treatment nodes to one (Robins, 1986). With respect to $M_{NP}$ , or any model with restrictions only on the conditional distribution of A_j given ${\bar{A}}_{j - 1}$ and ${\bar{L}}_{j}$ possibly for any j ∈ {0,1,...,K}, the EIF of Ψ at P is known to be given by $ϕ_{NP, P} ≔ \sum_{j = 0}^{K + 1} ϕ_{j, NP, P}$ , where 𝜙0,NP,P(x) := m0,P(l0) − Ψ(P) and

ϕ_{j, NP, P} (x) ≔ \frac{a_{0} a_{1} \dots a_{j - 1}}{\prod_{r = 0}^{j - 1} P (A_{r} = 1 | {\bar{L}}_{r} = {\bar{l}}_{r}, A_{0} = A_{1} = \dots = A_{r - 1} = 1)} {m_{j, P} ({\bar{l}}_{j}) - m_{j - 1, P} ({\bar{l}}_{j - 1})}

for j = 1,2,...,K + 1.

We consider instead the model $M$ consisting of the subset of distributions in $M_{NP}$ under which, for each j = 2,3,...,K + 1, L_j and ${\bar{L}}_{j - 2}$ are independent given L_j−1 and A_j−1 = A_j−2 = ... = A₀ = 1. If $P \in M$ , we note that $m_{j, P} ({\bar{l}}_{j}) = m_{j, P} (l_{j})$ for each j. The EIF of Ψ at P relative to $M$ is given by, where $ϕ_{P} ≔ \sum_{j = 0}^{K + 1} ϕ_{j, P}$ and $ϕ_{j, P}$ is defined pointwise as

x \mapsto ϕ_{j, P} (x) ≔ E_{P} [ϕ_{j, NP, P} (X) | L_{j} = l_{j}, L_{j - 1} = l_{j - 1}, {\bar{A}}_{j - 1} = {\bar{a}}_{j - 1}] - E_{P} [ϕ_{j, NP, P} (X) | L_{j - 1} = l_{j - 1}, {\bar{A}}_{j - 1} = {\bar{a}}_{j - 1}] = a_{0} a_{1} \dots a_{j - 1} \cdot T_{j} (P) (x) \cdot {m_{j, P} (l_{j}) - m_{j - 1, P} (l_{j - 1})}

for j = 1,2,...,K + 1, and we use T_j(P)(x) to denote

E_{P} [\frac{1}{\prod_{r = 0}^{j - 1} P (A_{r} = 1 | {\bar{L}}_{r}, A_{0} = A_{1} = \dots = A_{r - 1} = 1)} | L_{j} = l_{j}, L_{j - 1} = l_{j - 1}, {\bar{A}}_{j - 1} = {\bar{a}}_{j - 1}] .

Deriving this expression requires specialized knowledge and familiarity with efficiency theory for longitudinal structures. Furthermore, even given this analytic expression, the EIF may often be difficult to compute since it involves rather elaborate conditional expectations

5.4.2. Implementation and results

Again, we must understand how to project a given distribution $Q \in M_{NP}$ into $M$ , say according to the KL divergence. Given a dominating measure ν, we denote the density function of Q with respect to ν as q. Furthermore, we denote by $q_{L}_{_{j}}$ the density of the conditional distribution of L_j given ${\bar{L}}_{j - 1}$ and ${\bar{A}}_{j - 1}$ , and by $q_{A}_{_{j}}$ the density of the conditional distribution of A_j given ${\bar{L}}_{j}$ and ${\bar{A}}_{j - 1}$ . We also denote by $q_{L} {_{_{j}}}_{, 1}$ the density $q_{L}_{_{j}}$ with ${\bar{a}}_{j - 1} = (1, 1, \dots, 1)$ . We use the same notational convention for any other candidate density r. Because for any candidate r we can write

\int \log r (u) d Q (u) = \sum_{j = 0}^{K + 1} \int \log r_{L_{j}} (l_{j} | {\bar{l}}_{j - 1}, {\bar{a}}_{j - 1}) d Q (u) + \sum_{j = 0}^{K} \int \log r_{A_{j}} (a_{j} | {\bar{l}}_{j}, {\bar{a}}_{j - 1}) d Q (u)

and $M$ can be written as a product model for the set of conditional distributions implied by the joint distribution, the required optimization problem can be performed separately for each conditional density. Because computing Ψ(Q^∗) does not require any component of Q^∗ beyond $q_{L_{j}, 1}^{*}$ for j = 0,1,...,K + 1, we focus our attention on the corresponding optimization problems alone. Below, we denote by $\bar{q} (l_{j}, l_{j - 1})$ the marginalized density $\iint \dots \int q (l_{0}, 1, l_{1}, 1, \dots, l_{j - 1}, 1, l_{j}) ν (d l_{0}, d l_{1}, \dots, d l_{j - 2})$ . To find $q_{L_{j}, 1}^{*}$ for j = 2,3,...,K + 1, we must maximize the criterion

L_{j} (r_{L_{j}, 1}) ≔ \iint \dots \int \log r_{L_{j}, 1} (l_{j} | l_{j - 1}) q (l_{0}, 1, l_{1}, 1, \dots, l_{j - 1}, 1, l_{j}) ν (d l_{0}, d l_{1}, \dots, d l_{j}) = \iint \log r_{L_{j}, 1} (l_{j} | l_{j - 1}) \bar{q} (l_{j - 1}, l_{j}) ν (d l_{j - 1}, d l_{j}) = \iint \log r_{L_{j}, 1} (l_{j} | l_{j - 1}) \frac{\bar{q} (l_{j - 1}, l_{j})}{\int \bar{q} (l_{j - 1}, l_{j}^{'}) ν (d l_{j}^{'})} ν (d l_{j}) \int \bar{q} (l_{j - 1}, l_{j}^{'}) ν (d l_{j}^{'}) ν (d l_{j - 1})

over the class of candidate conditional densities that do not depend on ${\bar{l}}_{j - 2}$ , here represented by $r_{L_{j}, 1}$ . Since for each fixed l_j−1 the mapping $l_{j} \mapsto \bar{q} (l_{j - 1}, l_{j}) / \int \bar{q} (l_{j - 1}, l_{j}^{'}) ν (d l_{j}^{'})$ defines a proper conditional density, by Jensen’s inequality, $L_{j} (r_{L_{j}, 1})$ is maximized by

q_{L_{j}, 1}^{*} (l_{j} | l_{j - 1}) = \frac{\bar{q} (l_{j - 1}, l_{j})}{\int \bar{q} (l_{j - 1}, l_{j}^{'}) ν (d l_{j}^{'})} .

It is easy to see that $M$ constrains neither $r_{L_{1}, 1}$ nor $r_{L_{0}}$ and therefore $q_{L_{1}, 1}^{*} = q_{L_{1}, 1}$ and $q_{L_{0}}^{*} = q_{L_{0}}$ . Thus, in the context of a longitudinal data structure, the projection of any given distribution Q into a model only constrained by a Markov structure has an explicit analytic form

To compute 𝜙_P(x) numerically, we first construct the linear perturbation P_ϵ,λ := (1 – ϵ)P + ϵH_x𝜆, where H_x,λ is a distribution dominated by P and concentrating its mass in shrinking neighborhoods of the set {x} as λ tends to zero. The projection $P_{ϵ, λ}^{*}$ onto $M$ is obtained by setting Q = P_ϵ,λ in the preceding paragraph. As in previous examples, we may approximate 𝜙_P(x) by the secant line slope for small λ and even smaller ϵ. Because in this example $P_{ϵ, λ}^{*}$ is available in closed form, 𝜙_P(x) can alternatively be approximated by the evaluation of $\frac{d}{d ϵ} Ψ P_{ϵ, λ}^{*}$ for small λ

For simplicity, in our numerical evaluation of the EIF, we restricted our attention to a setting with K = 2 post-baseline time-points. We considered the joint distribution P of X defined in terms of the following conditional distributions. The baseline covariate L₀ has a discrete uniform distribution on the set {0,1,2,3,4}. Given L₀ = l₀, A₀ has a Bernoulli distribution with success probability expit(−1 + 0.5l₀). Given A₀ = a₀ and L₀ = l₀, L₁ has a normal distribution with mean 3l₀−3a₀ and variance 4. Given L₁ =l₁, A₀ = a₀ and L₀ = l₀, A₁ has a Bernoulli distribution with success probability expit{−5+c₁₀(l₁)+a₀+0.5l₀}, where we define c₁₀ to be the trimming function u ↦10·I_{(−∞,−10)}(u)+u·I_[−10,+10](u)+10·I_(+10,+∞)(u). Given A₁ = a₁, L₁ =l₁, A₀ = a₀ and L₀ =l₀, Y has a Bernoulli distribution with success probability expit{−1 + 0.5c₁₀(l₁) − 0.5a₁ − a₀}. We evaluated approximations of 𝜙_P(x) for x = (0, 1, 2, 1, 1) based on either the secant line slope or the analytic pathwise derivative. We report the relative error of the approximation obtained using the secant line slope in Figure 6 and using the analytic derivative in Figure 7. Once again, the pattern observed in Figure 6 is similar to the pattern seen in other examples. From Figure 7, we note that a high level of accuracy is achieved with a relatively large λ > 0. Thus, use of the analytic derivative essentially eliminates the need for careful selection of approximation parameters otherwise needed. These implementations were repeated at additional values of x. Because very similar patterns emerged, results are not reported here

Figure 6: — Absolute % error in the approximation of the EIF value using a secant line slope as a function of ϵ and λ in Example 4

Figure 7: — Absolute % error in the approximation of the EIF value using analytic differentiation as a function of λ in Example 4

6. Concluding remarks

The novel representation of the EIF we have presented in this paper provides a novel approach for analytically deriving the EIF and suggests natural strategies for numerically approximating the EIF. This representation holds in arbitrary models under mild regularity conditions. Use of this representation requires the ability to project a given distribution into the statistical model – if the KL divergence is used, this is essentially no more than a maximum likelihood step that can be tackled by most practitioners. Most importantly, the involved work requires neither knowledge of efficiency theory nor familiarity with concepts from functional analysis or differential geometry. As such, this representation has the potential of democratizing the calculation of the EIF and thus the construction of efficient estimators in nonparametric and semiparametric models. Even for seasoned researchers in semiparametric and nonparametric theory, it provides an alternate means of tackling difficult problems, including those for which the EIF is either difficult or impossible to derive analytically

In most problems, we anticipate the analytic work required to project the linear perturbation path onto the model space to be much simpler than that needed for the conventional tangent space approach. Nevertheless, this may still constitute a barrier in practice. However, because the task of projecting onto the model space represents no more than an optimization problem, albeit an infinite-dimensional one, computational tools may readily be used to circumvent most, if not all, analytic work otherwise required. This is encouraging since strong computational skills are commonplace in statistics and data science. Furthermore, we expect the numerical challenge to become increasingly surmountable as the capability of our computational devices continues to grow over time. While in this paper we established theoretical results that may facilitate computerization, the practical challenges of numerical computerization must still be carefully studied and addressed to achieve the goal of automated efficient estimation. For example, robust protocols must be devised for automatically selecting suitable values of approximation parameters involved in the numerical computerization of efficient estimators. These important questions are the focus of ongoing research

As with all methods that incorporate some level of automation and more readily lend themselves to use by non-specialists, there is a clear potential for misuse of the results we have presented. This appears to be an inevitable risk inherent to this type of proposal, and it equally applies to some of the most celebrated tools in current statistical practice, including the bootstrap. Deriving the EIF analytically undoubtedly remains the gold-standard approach and it should be preferred whenever possible since much information can be learned about the problem at hand from the analytic form of the EIF. In particular, verification of the regularity conditions invoked in this paper can be difficult without prior analytic knowledge of the EIF. Nevertheless, the representation introduced in this paper has the potential of serving as an important new tool in the arsenal of statistical researchers and practitioners alike for performing semiparametric and nonparametric analyses. Devising algorithms for verifying the required regularity conditions in any given problem is an important avenue for future research

We have noted that a distinct advantage of the representation we have provided is that once it has been used to compute the EIF of a certain parameter in a given statistical model, the EIF of any other parameter can be obtained without any additional work since the bulk of the work required is exclusively model-specific. Nevertheless, the involved computational work must be repeated for each observation value at which we wish to evaluate the EIF. In particular, this makes it difficult to approximate the entire EIF as a function, particularly in the case of continuous or longitudinal data units. While the one-step approach only requires the EIF at the observed data points, the implementation of other efficient estimators with potentially better properties, such as targeted minimum loss-based estimators (TMLE), generally requires the entire EIF. The representation presented in this paper is therefore not conducive to a computerized implementation of TMLE. There is promise that alternative representations may be better suited for this purpose – this is also an area of active research

Supplementary Material

Supp1

NIHMS1515027-supplement-Supp1.pdf^{(244KB, pdf)}

Acknowledgments

The authors gratefully acknowledge the support of the Career Development Fund of the Department of Biostatistics at the University of Washington (MC), the New Development Fund of the Fred Hutchinson Cancer Research Center (ARL), and of NIH/NIAID grants 5UM1AI068635 (MC,ARL) and 5R01AI074345 (MJvdL)

Appendix

Throughout this Appendix, we will use the acronym RND to denote the Radon-Nikodym derivative.

Proof of Theorem 2. Denote by r_∞(λ) the supremum norm over $X (P)$ of the RND of H_x,λ relative to P. It is easy to verify that, for any fixed λ > 0, the RND of P relative to P_ϵ,λ is uniformly bounded in (0.5,1.5) over $X (P)$ provided ϵ < 1/max[r_∞(λ) – 1,3]. Using the fact that $P_{ϵ, λ}^{*}$ minimizes $D (P_{1}, P_{ϵ, λ})$ over all $P_{1} \in M (P_{ϵ, λ})$ , and that $P_{1} \in M (P_{ϵ, λ})$ since P_ϵ,λ dominates P by construction, we then have that

0 \leq D (P_{ϵ, λ}^{*}, P_{ϵ, λ}) \leq D (P, P_{ϵ, λ}) \leq {\tilde{K}}_{1} \int {[\sqrt{\frac{d P}{d P_{ϵ, λ}} (u)} - 1]}^{2} d P_{ϵ, λ} (u) = {\tilde{K}}_{1} \int {[\sqrt{\frac{d P_{ϵ, λ}}{d P} (u)} - 1]}^{2} d P (u) \leq {\tilde{K}}_{1} \int {[\frac{d P_{ϵ, λ}}{d P} (u) - 1]}^{2} d P (u) \leq {\tilde{K}}_{1} ϵ^{2} r_{2} {(λ)}^{2},

where we have set ${\tilde{K}}_{1} ≔ K_{1} (0.5, 1.5)$ and used conditions (B1) and (B2). It is straightforward to show that the RND of $P_{ϵ, λ}^{*}$ relative to P_ϵ,λ is uniformly bounded between 0.5m₀ and 1.5m₁ for ϵ < 1/max[r_∞(λ) – 1,3]. Using condition (B2), this allows us to write that

0 \leq {‖ \frac{d P_{ϵ, λ}^{*}}{d P_{ϵ, λ}} - 1 ‖}_{2, P_{e, λ}}^{2} = \int {[\frac{d P_{ϵ, λ}^{*}}{d P_{ϵ, λ}} (u) - 1]}^{2} d P_{ϵ, λ} (u) = \int {[\sqrt{\frac{d P_{ϵ, λ}^{*}}{d P_{ϵ, λ}} (u)} - 1]}^{2} {[\sqrt{\frac{d P_{ϵ, λ}^{*}}{d P_{ϵ, λ}} (u)} + 1]}^{2} d P_{ϵ, λ} (u) \leq {\tilde{K}}_{2} \int {[\sqrt{\frac{d P_{ϵ, λ}^{*}}{d P_{ϵ, λ}} (u)} - 1]}^{2} d P_{ϵ, λ} (u) \leq {\tilde{K}}_{0} {\tilde{K}}_{2} D (P_{ϵ, λ}^{*}, P_{ϵ, λ}) \leq {\tilde{K}}_{0} {\tilde{K}}_{1} {\tilde{K}}_{2} ϵ^{2} r_{2} {(λ)}^{2}

with ${\tilde{K}}_{0} ≔ K_{0} (0.5 m_{0}, 1.5 m_{1})$ and ${\tilde{K}}_{2} ≔ 1.5 m_{1} + 2 \sqrt{1.5 m_{1}} + 1$ . Writing $\tilde{K} ≔ \sqrt{{\tilde{K}}_{0} {\tilde{K}}_{1} {\tilde{K}}_{2}}$ , in view of the fact that $ϕ_{ϵ, λ}^{*} \in T_{M} (P_{ϵ, λ}^{*})$ , the latter inequality implies, in view of condition (B3), that

\frac{| \int ϕ_{ϵ, λ}^{*} (u) d P_{ϵ, λ} (u) |}{ϵ} \leq \frac{\tilde{K} ϵ r_{2} (λ) B (\tilde{K} ϵ r_{2} (λ))}{ϵ} = \tilde{K} r_{2} (λ) B (\tilde{K} ϵ r_{2} (λ)) \to 0

as ϵ tends to zero. This establishes that $\lim_{λ \to 0} \lim_{ϵ \to 0} \int ϕ_{ϵ, λ}^{*} (u) d P_{ϵ, λ} (u) / ϵ = 0$ , as required. □

Proof of Theorem 3. We first note that

| \int ϕ_{ϵ, λ}^{*} (u) d (H_{x, λ} - P) (u) - ϕ_{P} (x) | = | \int {ϕ_{ϵ, λ}^{*} (u) - ϕ_{P} (u)} d H_{x, λ} (u) + \int ϕ_{P} (u) d H_{x, λ} (u) - ϕ_{P} (x) - \int ϕ_{ϵ, λ}^{*} (u) d P (u) | \leq | \int {ϕ_{ϵ, λ}^{*} (u) - ϕ_{P} (u)} d H_{x, λ} (u) | + | \int ϕ_{P} (u) d H_{x, λ} (u) - ϕ_{P} (x) | + | \int ϕ_{ϵ, λ}^{*} (u) d P (u) |

and because by assumption the second and third summands on the second line tend to zero as λ tends to zero, it suffices to study the first summand. We can bound this term by ${‖ ϕ_{ϵ, λ}^{*} - ϕ_{P} ‖}_{\infty, S_{x, λ}}$ and so, if condition (a) holds, the result follows immediately. Alternatively, we can write this term as

| \int {ϕ_{ϵ, λ}^{*} (u) - ϕ_{P} (u)} d H_{x, λ} (u) | = | \int \frac{d H_{x, λ}}{d P} (u) {ϕ_{ϵ, λ}^{*} (u) - ϕ_{P} (u)} d P (u) | \leq {‖ ϕ_{ϵ, λ}^{*} - ϕ_{P} ‖}_{2, P} {‖ \frac{d H_{x, λ}}{d P} ‖}_{2, P}

and thus, if condition (b) holds, the result is also guaranteed to hold □

Proof of Theorem 4. We will build upon notation and results established in the proof of Theorem 2. Below, we take a small but otherwise arbitrary λ > 0 and consider only ϵ < 1/max[r_∞(λ) – 1,3]. We first note that

\frac{d P_{ϵ, λ}^{*}}{d P} - 1 = \frac{d P_{ϵ, λ}}{d P} (\frac{d P_{ϵ, λ}^{*}}{d P_{ϵ, λ}} - 1) + (\frac{d P_{ϵ, λ}}{d P} - 1) .

We can write that

{‖ \frac{d P_{ϵ, λ}}{d P} (\frac{d P_{ϵ, λ}^{*}}{d P_{ϵ, λ}} - 1) ‖}_{2, P}^{2} = \int {[\frac{d P_{ϵ, λ}}{d P} (u)]}^{2} {[\frac{d P_{ϵ, λ}^{*}}{d P_{ϵ, λ}} (u) - 1]}^{2} d P (u) \leq {‖ \frac{d P_{ϵ, λ}}{d P} ‖}_{\infty, X (P)} {‖ \frac{d P_{ϵ, λ}^{*}}{d P_{ϵ, λ}} - 1 ‖}_{2, P_{e, λ}}^{2} \leq [1 - ϵ + ϵ r_{\infty} (λ)] {[\tilde{K} ϵ r_{2} (λ)]}^{2}

and additionally, that

{‖ \frac{d P_{ϵ, λ}}{d P} - 1 ‖}_{2, P}^{2} = ϵ^{2} \int {[1 - \frac{d H_{x, λ}}{d P} (u)]}^{2} d P (u) \leq ϵ^{2} r_{2} {(λ)}^{2} .

Thus, using the above inequalities and the triangle inequality, we find that

0 \leq | R (P_{ϵ, λ}^{*}, P) | \leq C {‖ \frac{d P_{ϵ, λ}^{*}}{d P} - 1 ‖}_{2, P}^{2} \leq ϵ^{2} r_{2} {(λ)}^{2} {[\sqrt{1 - ϵ + ϵ r_{\infty} (λ)} \tilde{K} + 1]}^{2},

which then implies that $\lim_{λ \to 0} \lim_{ϵ \to 0} R (P_{ϵ, λ}^{*}, P) / ϵ = 0$ , as required. □

Proof of Theorem 5. We will denote by f_r(ϵ,λ) the derivative $\frac{d^{r}}{d ϵ^{r}} Ψ (P_{ϵ, λ}^{*})$ and by $ϕ_{ϵ, λ, m}^{°}$ the (m + 1)-point forward finite-difference approximation to 𝜙_P(x). The approximation error can be written as

ϕ_{ϵ, λ, m}^{°} - ϕ_{P} (x) = [ϕ_{ϵ, λ, m}^{°} - f_{1} (0, λ)] + [f_{1} (0, λ) - ϕ_{P} (x)],

where we will see that the first bracketed summand benefits from use of the multi-point scheme. Define the Taylor remainder $R_{j ϵ, λ, s} ≔ f_{0} (j ϵ, λ) - \sum_{r = 1}^{s} \frac{{(j ϵ)}^{r}}{r!} f_{r} (0, λ)$ . Setting $A (r, m) ≔ \sum_{j = 1}^{m} j^{r} a_{j, m}$ , in view of the fact that A(0,m) = 0, A(1,m) = 1 and A(r,m) = 0 for r = 2,3,...,m (Fornberg, 1988), we note that

ϕ_{ϵ, λ, m}^{°} = \frac{1}{ϵ} \sum_{j = 0}^{m} a_{j, m} Ψ (P_{j ϵ, λ}^{*}) = \sum_{j = 1}^{m} a_{j, m} [\frac{Ψ (P_{j ϵ, λ}^{*}) - Ψ (P)}{ϵ}] = \sum_{j = 1}^{m} j a_{j, m} [\frac{Ψ (P_{j ϵ, λ}^{*}) - Ψ (P)}{j ϵ}] = \sum_{j = 1}^{m} j a_{j, m} [f_{1} (0, λ) + \sum_{r = 2}^{s} \frac{{(j ϵ)}^{r - 1}}{r!} f_{r} (0, λ) + \frac{R_{j ϵ, λ, s}}{j ϵ}] = f_{1} (0, λ) A (1, m) + \sum_{r = 2}^{s} \frac{ϵ^{r - 1}}{r!} f_{r} (0, λ) A (r, m) + \frac{1}{ϵ} \sum_{j = 1}^{m} a_{j, m} R_{j ϵ, λ, s} = f_{1} (0, λ) + \frac{1}{ϵ} \sum_{j = 1}^{m} a_{j, m} R_{j ϵ, λ, s},

where the second line follows from the first by a Taylor expansion of η →7 Ψ $(P_{η, λ}^{*})$ around η = 0 evaluated at η = j𝜖. By Taylor’s theorem, the remainder term has the form

R_{j ϵ, λ, s} = \frac{{(j ϵ)}^{s + 1}}{(s + 1)!} f_{s + 1} (ξ_{j, ϵ, λ}, λ)

for some ξ_j,𝜖,λ. By the differentiability condition, for small ω > 0, we can write

f_{s + 1} (ω, λ) = \int \dots \iint Υ_{s + 1, λ} (P_{ω, λ}) ({\underline{u}}_{s + 1}) \prod_{k = 1}^{s + 1} [\frac{d H_{x, λ}}{d P} (u_{k}) - 1] d P (u_{1}) d P (u_{2}) \dots d P (u_{s + 1}),

where ${\underline{u}}_{s + 1} ≔ (u_{1}, u_{2}, \dots, u_{s + 1})$ and Υ_s+1,λ(P_ω,λ) denotes an (s+1)^th order gradient of $\tilde{Ψ}$ relative to model $M_{0, λ}$ evaluated at P_ω,λ (see, e.g., van der Vaart, 2014). We thus find that

| f_{s + 1} (ξ_{j, ϵ, λ}, λ) | \leq \sup_{0 < ω < m ϵ} | f_{s + 1} (ω, λ) | \leq \sup_{0 < ω < m ϵ} {‖ Υ_{s + 1, λ} (P_{ω, λ}) ‖}_{2, P s + 1} \cdot {[1 + r_{s + 1} (λ)]}^{s + 1}

with P^s+1 := P ×...×P is a product measure of dimension s+1. For and λ small enough, this gives that

| ϕ_{ϵ, λ, m}^{°} - f_{1} (0, λ) | \leq | \frac{1}{ϵ} \sum_{j = 1}^{m} a_{j, m} \frac{{(j ϵ)}^{s + 1}}{(s + 1)!} f_{s + 1} (ξ_{j, ϵ, λ}, λ) | \leq ϵ^{s} \sum_{j = 1}^{m} | a_{j, m} | \frac{j^{s + 1}}{(s + 1)!} | f_{s + 1} (ξ_{j, ϵ, λ}, λ) | \leq ϵ^{s} {[1 + r_{s + 1} (λ)]}^{s + 1} \frac{\sup_{0 < ω < m \in} {‖ Υ_{s + 1, λ} (P_{ϵ, λ}) ‖}_{2, P^{s + 1}}}{(s + 1)!} \sum_{j = 1}^{m} j^{s + 1} | a_{j, m} | = O (ϵ^{s} r_{s + 1} {(λ)}^{s + 1}) .

Finally, we note that f₁(0,λ) = ∫𝜙_P(u)H_x,λ(du) in view of the proof of Theorem 1. If u ↦ 𝜙_P(u) has bounded second derivatives near u = x, and H_x,λ is a second-order kernel function, it follows from standard results in the kernel smoothing literature that the systematic deviation f₁(0,λ)−𝜙_P(x) has order O(λ⁻²). □

References

Begun J, Hall W, Huang W, and Wellner J (1983). Information and asymptotic efficiency in parametric-nonparametric models. The Annals of Statistics, pages 432–452. [Google Scholar]
Bickel P, Klaassen C, Ritov Y, and Wellner J (1997). Efficient and adaptive estimation for semiparametric models. Springer. [Google Scholar]
Bickel P and Ritov Y (1988). Estimating integrated squared density derivatives: sharp best order of convergence estimates Sankhya¯: The Indian Journal of Statistics, Series A, pages 381–393. [Google Scholar]
Chambaz A and van der Laan MJ (2014). Inference in targeted group-sequential covariate-adjusted randomized clinical trials. Scandinavian Journal of Statistics, 41(1):104–140. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chamberlain G (1987). Asymptotic efficiency in estimation with conditional moment restrictions. Journal of Econometrics, 34(3):305–334. [Google Scholar]
Chen X (2007). Large sample sieve estimation of semi-nonparametric models. Handbook of Econometrics, 6:5549–5632. [Google Scholar]
Fornberg B (1988). Generation of finite difference formulas on arbitrarily spaced grids. Mathematics of Computation, 51(184):699–706. [Google Scholar]
Frangakis C, Qian T, Wu Z, and Díaz I (2015). Deductive derivation and Turing-computerization of semiparametric efficient estimation (with discussion). Biometrics. [DOI] [PMC free article] [PubMed] [Google Scholar]
Geskus R and Groeneboom P (1999). Asymptotically optimal estimation of smooth functionals for interval censoring, case 2. The Annals of Statistics, 27(2):627–674.et al. [Google Scholar]
Gilbert PB, Yu X, and Rotnitzky A (2014). Optimal auxiliary-covariate-based two-phase sampling design for semiparametric efficient estimation of a mean or mean difference, with application to clinical trials. Statistics in medicine, 33(6):901–917. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hájek J (1970). A characterization of limiting distributions of regular estimates. Zeitschrift fu¨r Wahrscheinlichkeitstheorie und verwandte Gebiete, 14(4):323–330. [Google Scholar]
Hájek J (1972). Local asymptotic minimax and admissibility in estimation. In Proceedings of the sixth Berkeley symposium on mathematical statistics and probability, volume 1, pages 175–194 [Google Scholar]
Hampel FR, Ronchetti EM, Rousseeuw PJ, and Stahel WA (2011). Robust statistics: the approach based on influence functions, volume 196 John Wiley & Sons [Google Scholar]
Ichimura H and Newey W (2015). The influence function of semiparametric estimators. arXiv preprint arXiv:1508.01378 [Google Scholar]
Kiefer J and Wolfowitz J (1956). Consistency of the maximum likelihood estimator in the presence of infinitely many incidental parameters. The Annals of Mathematical Statistics, pages 887–906 [Google Scholar]
Koshevnik Y and Levit B (1977). On a non-parametric analogue of the information matrix. Theory of Probability & Its Applications, 21(4):738–753 [Google Scholar]
Le Cam L (1972). Limits of experiments. In Proceedings of the sixth Berkeley symposium on mathematical statistics and probability, volume 1, pages 245–261 [Google Scholar]
Luedtke A, Carone M, and van der Laan M (2015). A discussion of “Deductive derivation and Turing-computerization of semiparametric efficient estimation” by Frangakis et al. Biometrics [DOI] [PMC free article] [PubMed] [Google Scholar]
Maathuis MH and Wellner JA (2008). Inconsistency of the mle for the joint distribution of interval-censored survival times and continuous marks. Scandinavian Journal of Statistics, 35(1):83–103 [DOI] [PMC free article] [PubMed] [Google Scholar]
Newey W (1994). The asymptotic variance of semiparametric estimators. Econometrica, pages 1349–1382 [Google Scholar]
Newey WK (1997). Convergence rates and asymptotic normality for series estimators. Journal of Econometrics, 79(1):147–168 [Google Scholar]
Pfanzagl J (1982). Contributions to a general asymptotic statistical theory. Springer [Google Scholar]
Quale CM, van der Laan MJ, and Robins JR (2006). Locally efficient estimation with bivariate right-censored data. Journal of the American Statistical Association, 101(475):1076–1084 [Google Scholar]
Robins JM (1986). A new approach to causal inference in mortality studies with a sustained exposure periodapplication to control of the healthy worker survivor effect. Mathematical Modelling, 7(9):1393–1512 [Google Scholar]
Shen X (1997). On methods of sieves and penalization. The Annals of Statistics, pages 2555–2591 [Google Scholar]
Stein C (1956). Efficient nonparametric testing and estimation. In Proceedings of the third Berkeley symposium on mathematical statistics and probability, volume 1, pages 187–195 [Google Scholar]
Tsiatis A (2007). Semiparametric theory and missing data. Springer Science & Business Media [Google Scholar]
van der Laan M (1996). Efficient estimation in the bivariate censoring model and repairing npmle. The Annals of Statistics, 24(2):596–627 [Google Scholar]
van der Laan M and Robins J (2003). Unified methods for censored longitudinal data and causality. Springer. [Google Scholar]
van der Laan M and Rose S (2011). Targeted learning: causal inference for observational and experimental data. Springer [Google Scholar]
van der Laan M and Rubin D (2006). Targeted maximum likelihood learning. The International Journal of Biostatistics, 2(1) [DOI] [PMC free article] [PubMed] [Google Scholar]
van der Laan MJ and Rose S (2018). Targeted learning in data science: causal inference for complex longitudinal studies. Springer [Google Scholar]
van der Vaart A (1991). On differentiable functionals. The Annals of Statistics, pages 178–204. [Google Scholar]
van der Vaart A (2014). Higher order tangent spaces and influence functions. Statistical Science, 29(4):679–686 [Google Scholar]
Wong W and Severini T (1991). On maximum likelihood estimation in infinite dimensional parameter spaces. The Annals of Statistics, pages 603–632 [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp1

NIHMS1515027-supplement-Supp1.pdf^{(244KB, pdf)}

[R1] Begun J, Hall W, Huang W, and Wellner J (1983). Information and asymptotic efficiency in parametric-nonparametric models. The Annals of Statistics, pages 432–452. [Google Scholar]

[R2] Bickel P, Klaassen C, Ritov Y, and Wellner J (1997). Efficient and adaptive estimation for semiparametric models. Springer. [Google Scholar]

[R3] Bickel P and Ritov Y (1988). Estimating integrated squared density derivatives: sharp best order of convergence estimates Sankhya¯: The Indian Journal of Statistics, Series A, pages 381–393. [Google Scholar]

[R4] Chambaz A and van der Laan MJ (2014). Inference in targeted group-sequential covariate-adjusted randomized clinical trials. Scandinavian Journal of Statistics, 41(1):104–140. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] Chamberlain G (1987). Asymptotic efficiency in estimation with conditional moment restrictions. Journal of Econometrics, 34(3):305–334. [Google Scholar]

[R6] Chen X (2007). Large sample sieve estimation of semi-nonparametric models. Handbook of Econometrics, 6:5549–5632. [Google Scholar]

[R7] Fornberg B (1988). Generation of finite difference formulas on arbitrarily spaced grids. Mathematics of Computation, 51(184):699–706. [Google Scholar]

[R8] Frangakis C, Qian T, Wu Z, and Díaz I (2015). Deductive derivation and Turing-computerization of semiparametric efficient estimation (with discussion). Biometrics. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] Geskus R and Groeneboom P (1999). Asymptotically optimal estimation of smooth functionals for interval censoring, case 2. The Annals of Statistics, 27(2):627–674.et al. [Google Scholar]

[R10] Gilbert PB, Yu X, and Rotnitzky A (2014). Optimal auxiliary-covariate-based two-phase sampling design for semiparametric efficient estimation of a mean or mean difference, with application to clinical trials. Statistics in medicine, 33(6):901–917. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] Hájek J (1970). A characterization of limiting distributions of regular estimates. Zeitschrift fu¨r Wahrscheinlichkeitstheorie und verwandte Gebiete, 14(4):323–330. [Google Scholar]

[R12] Hájek J (1972). Local asymptotic minimax and admissibility in estimation. In Proceedings of the sixth Berkeley symposium on mathematical statistics and probability, volume 1, pages 175–194 [Google Scholar]

[R13] Hampel FR, Ronchetti EM, Rousseeuw PJ, and Stahel WA (2011). Robust statistics: the approach based on influence functions, volume 196 John Wiley & Sons [Google Scholar]

[R14] Ichimura H and Newey W (2015). The influence function of semiparametric estimators. arXiv preprint arXiv:1508.01378 [Google Scholar]

[R15] Kiefer J and Wolfowitz J (1956). Consistency of the maximum likelihood estimator in the presence of infinitely many incidental parameters. The Annals of Mathematical Statistics, pages 887–906 [Google Scholar]

[R16] Koshevnik Y and Levit B (1977). On a non-parametric analogue of the information matrix. Theory of Probability & Its Applications, 21(4):738–753 [Google Scholar]

[R17] Le Cam L (1972). Limits of experiments. In Proceedings of the sixth Berkeley symposium on mathematical statistics and probability, volume 1, pages 245–261 [Google Scholar]

[R18] Luedtke A, Carone M, and van der Laan M (2015). A discussion of “Deductive derivation and Turing-computerization of semiparametric efficient estimation” by Frangakis et al. Biometrics [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] Maathuis MH and Wellner JA (2008). Inconsistency of the mle for the joint distribution of interval-censored survival times and continuous marks. Scandinavian Journal of Statistics, 35(1):83–103 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] Newey W (1994). The asymptotic variance of semiparametric estimators. Econometrica, pages 1349–1382 [Google Scholar]

[R21] Newey WK (1997). Convergence rates and asymptotic normality for series estimators. Journal of Econometrics, 79(1):147–168 [Google Scholar]

[R22] Pfanzagl J (1982). Contributions to a general asymptotic statistical theory. Springer [Google Scholar]

[R23] Quale CM, van der Laan MJ, and Robins JR (2006). Locally efficient estimation with bivariate right-censored data. Journal of the American Statistical Association, 101(475):1076–1084 [Google Scholar]

[R24] Robins JM (1986). A new approach to causal inference in mortality studies with a sustained exposure periodapplication to control of the healthy worker survivor effect. Mathematical Modelling, 7(9):1393–1512 [Google Scholar]

[R25] Shen X (1997). On methods of sieves and penalization. The Annals of Statistics, pages 2555–2591 [Google Scholar]

[R26] Stein C (1956). Efficient nonparametric testing and estimation. In Proceedings of the third Berkeley symposium on mathematical statistics and probability, volume 1, pages 187–195 [Google Scholar]

[R27] Tsiatis A (2007). Semiparametric theory and missing data. Springer Science & Business Media [Google Scholar]

[R28] van der Laan M (1996). Efficient estimation in the bivariate censoring model and repairing npmle. The Annals of Statistics, 24(2):596–627 [Google Scholar]

[R29] van der Laan M and Robins J (2003). Unified methods for censored longitudinal data and causality. Springer. [Google Scholar]

[R30] van der Laan M and Rose S (2011). Targeted learning: causal inference for observational and experimental data. Springer [Google Scholar]

[R31] van der Laan M and Rubin D (2006). Targeted maximum likelihood learning. The International Journal of Biostatistics, 2(1) [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] van der Laan MJ and Rose S (2018). Targeted learning in data science: causal inference for complex longitudinal studies. Springer [Google Scholar]

[R33] van der Vaart A (1991). On differentiable functionals. The Annals of Statistics, pages 178–204. [Google Scholar]

[R34] van der Vaart A (2014). Higher order tangent spaces and influence functions. Statistical Science, 29(4):679–686 [Google Scholar]

[R35] Wong W and Severini T (1991). On maximum likelihood estimation in infinite dimensional parameter spaces. The Annals of Statistics, pages 603–632 [Google Scholar]

PERMALINK

Toward computerized efficient estimation in infinite-dimensional models

Marco Carone

Alexander R Luedtke

Mark J van der Laan

Abstract

1. Introduction

1.1. Efficient estimation of smooth target parameters

1.2. The efficient influence function and its role in efficient estimation

1.3. Contributions and organization of this article

2. Numerical calculation of the efficient influence function

2.1. Preliminaries

2.2. Nonparametric models

2.3. Arbitrary models

Theorem 1.

3. Verification of technical conditions

3.1. Near-solution of the EIF estimating equation

Theorem 2.

3.2. Continuity of the EIF

Theorem 3.

3.3. Preservation of the rate of convergence

Theorem 4.

4. Practical considerations

4.1. Construction of the perturbation path

4.2. Finite-difference derivative approximations

Theorem 5.

4.3. Selection of ϵ and λ values

Figure 1:

4.4. Numeric computation of the model space projection

4.5. Construction of an efficient estimator

5. Illustration and numerical studies

5.1. Example 1: Average value of density function with known mean

5.1.1. Background

5.1.2. Implementation and results

Figure 2:

Figure 3:

5.2. Example 2: Center of a symmetric distribution

5.2.1. Background

5.2.2. Implementation and results

Figure 4:

5.3. Example 3: Coefficient in semiparametric Poisson regression model

5.3.1. Background

5.3.2. Implementation and results

Figure 5:

5.4. Example 4: G-computation parameter under Markov structure

5.4.1. Background

5.4.2. Implementation and results

Figure 6:

Figure 7:

6. Concluding remarks

Supplementary Material

Acknowledgments

Appendix

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases