A Generally Efficient Targeted Minimum Loss Based Estimator based on the Highly Adaptive Lasso

Mark van der Laan

doi:10.1515/ijb-2015-0097

. Author manuscript; available in PMC: 2018 Jul 22.

Published in final edited form as: Int J Biostat. 2017 Oct 12;13(2):/j/ijb.2017.13.issue-2/ijb-2015-0097/ijb-2015-0097.xml. doi: 10.1515/ijb-2015-0097

A Generally Efficient Targeted Minimum Loss Based Estimator based on the Highly Adaptive Lasso

Mark van der Laan ^1,^*

PMCID: PMC6054860 NIHMSID: NIHMS979184 PMID: 29023235

Abstract

Suppose we observe n independent and identically distributed observations of a finite dimensional bounded random variable. This article is concerned with the construction of an efficient targeted minimum loss-based estimator (TMLE) of a pathwise differentiable target parameter of the data distribution based on a realistic statistical model. The only smoothness condition we will enforce on the statistical model is that the nuisance parameters of the data distribution that are needed to evaluate the canonical gradient of the pathwise derivative of the target parameter are multivariate real valued cadlag functions (right-continuous and left-hand limits, (G. Neuhaus. On weak convergence of stochastic processes with multidimensional time parameter. Ann Stat 1971;42:1285–1295.) and have a finite supremum and (sectional) variation norm. Each nuisance parameter is defined as a minimizer of the expectation of a loss function over over all functions it its parameter space. For each nuisance parameter, we propose a new minimum loss based estimator that minimizes the loss-specific empirical risk over the functions in its parameter space under the additional constraint that the variation norm of the function is bounded by a set constant. The constant is selected with cross-validation. We show such an MLE can be represented as the minimizer of the empirical risk over linear combinations of indicator basis functions under the constraint that the sum of the absolute value of the coefficients is bounded by the constant: i.e., the variation norm corresponds with this L₁-norm of the vector of coefficients. We will refer to this estimator as the highly adaptive Lasso (HAL)-estimator. We prove that for all models the HAL-estimator converges to the true nuisance parameter value at a rate that is faster than n^−1/4 w.r.t. square-root of the loss-based dissimilarity. We also show that if this HAL-estimator is included in the library of an ensemble super-learner, then the super-learner will at minimal achieve the rate of convergence of the HAL, but, by previous results, it will actually be asymptotically equivalent with the oracle (i.e., in some sense best) estimator in the library. Subsequently, we establish that a one-step TMLE using such a super-learner as initial estimator for each of the nuisance parameters is asymptotically efficient at any data generating distribution in the model, under weak structural conditions on the target parameter mapping and model and a strong positivity assumption (e.g., the canonical gradient is uniformly bounded). We demonstrate our general theorem by constructing such a one-step TMLE of the average causal effect in a nonparametric model, and establishing that it is asymptotically efficient.

Keywords: asymptotic linear estimator, canonical gradient, cross-validated targeted minimum loss estimation (CV-TMLE), Donsker class, efficient influence curve, efficient estimator, empirical process, entropy, highly adaptive Lasso, influence curve, one-step TMLE, super-learning, targeted minimum loss estimation (TMLE)

1 Introduction

We consider the general statistical estimation problem defined by a statistical model for the data distribution, a Euclidean valued target parameter mapping defined on the statistical model, and observing n independent and identically distributed draws from the data distribution. Our goal is to construct a generally asymptotically efficient substitution estimator of the target parameter. An estimator is asymptotically efficient if and only if it is asymptotically linear with influence curve equal to the canonical gradient (also called the efficient influence curve) of the pathwise derivative of the target parameter [1]. For realistic statistical models construction of efficient estimators requires using highly data adaptive estimators of the relevant parts of the data distribution the efficient influence curve depends upon. We will refer to these relevant parts of the data distribution as nuisance parameters.

One can construct an asymptotically efficient estimator with the following two general methods. Firstly, the one-step estimator is defined by adding to an initial plug-in estimator of the target parameter an empirical mean of an estimator of the efficient influence curve at this same initial estimator [1]. In the special case that the efficient influence curve can be represented as an estimating function, one can represent this methodology as the first step of the Newton-Raphson algorithm for solving the estimating equation defined by setting the empirical mean of the efficient influence curve equal to zero. Such general estimating equation methodology for construction of efficient estimators has been developed for censored and causal inference models in the literature (e.g., [2, 3]). Secondly, the TMLE defines a least favorable parametric submodel through an initial estimator of the relevant parts (nuisance parameters) of the data distribution, and updates the initial estimator with the MLE over this least favorable parametric submodel. The one-step TMLE of the target parameter is now the resulting plug-in estimator [4–6]. In this article we focus on the one-step TMLE since it is a more robust estimator by respecting the global constraints of the statistical model, which becomes evident when comparing the one-step estimator and TMLE in simulations for which the information is low for the target parameter (e.g., even resulting in one-step estimators of probabilities that are outside the (0, 1) range) (e.g., [7–9]). Nonetheless, the results in this article have immediate analogues for the one-step estimator and estimating equation method.

The asymptotic linearity and efficiency of the TMLE and one-step estimator relies on a second order remainder to be o_P(n^−1/2), which typically requires that the nuisance parameters are estimated at a rate faster than n^−1/4 w.r.t. an L²(P₀)-norm (e.g., see our example in Section 7). To make the TMLE highly data adaptive and thereby efficient for large statistical models we have recommended to estimate the nuisance parameters with a super-learner based on a large library of candidate estimators [10–13]. Due to the oracle inequality for the cross-validation selector, the super-learner will be asymptotically equivalent with the oracle selected estimator w.r.t. loss-based dissimilarity, even when the number of candidate estimators in the library grows polynomial in sample size. The loss-based dissimilarity (e.g., Kullback-Leibler divergence or loss-based dissimilarity for the squared error loss) behaves as a square of an L²(P₀)-norm (see, for example Lemma 4 in our example). Therefore, in order to control the second order remainder, our goal should be to construct a candidate estimator in the library of the super-learner which will converge at a faster rate than n^−1/4 w.r.t. square-root of the loss-based dissimilarity.

In this article, for each nuisance parameter, we propose a new minimum loss based estimator that minim-izes the loss-specific empirical risk over its parameter space under the additional constraint that the variation norm is bounded by a set constant. The constant is selected with cross-validation. We show that these MLEs can be represented as the minimizer of the empirical risk over linear combinations of indicator basis functions under the constraint that the sum of the absolute value of the coefficients is bounded by the constant: i.e., the variation norm corresponds with this L₁-norm of the vector of coefficients. We will refer to this estimator as the highly adaptive Lasso (HAL)-estimator. We prove that the HAL-estimator converges at a rate that is for all models faster than n^−1/4 w.r.t. square-root of the loss-based dissimilarity. This even holds if the model only assumes that the true nuisance parameters have a finite variation norm. As a corollary of the general oracle inequality for cross-validation, we will then show that the super-learner including this HAL-estimator it its library is guaranteed to converge to its true counterparts at the same rate as this HAL-estimator (and thus faster than n^−1/4). By also including a large variety of other estimators in the library of the super-learner, the super-learner will also have excellent practical performance for finite samples relative to competing estimators [14]. Based on this fundamental result for the HAL-estimator and the super-learner, we proceed in this article with proving a general theorem for asymptotic efficiency of the one-step TMLE for arbitrary statistical models. In this article we will use a one-step cross-validated-TMLE (CV-TMLE), which avoids the Donsker-class entropy condition on the nuisance parameter space, in order to further minimize the conditions for asymptotic efficiency [5, 15]. In our accompanying technical report [16] we present the analogue results for the one-step TMLE. Beyond establishing these fundamental theoretical general results, we will also discuss the practical implementation of the HAL-estimator and corresponding TMLE.

2 Example: Treatment specific mean in nonparametric model

Before we start the main part of this article, in this section we will first introduce an example, and use this example to provide the reader with a guide through the different sections.

2.1 Defining the statistical estimation problem

Let O = (W, A, Y) ∼ P₀ be a d-dimensional random variable consisting of a (d − 2)-dimensional vector of baseline covariates W, binary treatment A ∈ {0, 1} and binary outcome Y ∈ {0, 1}. We observe n i.i.d. copies O₁, …, O_n of O ∼ P₀. Let $\bar{Q} (P) (W) = E_{P} (Y | A = 1, W)$ and $\bar{G} (P) (W) = E_{P} (A | W)$ . Let Q₂(P) be the marginal cumulative probability distribution of W, and $Q = (Q_{1} = \bar{Q}, Q_{2})$ . Let the statistical model be of the form $M = {P : G (P) \in G, Q (P) \in Q}$ , where $G$ is a possibly restricted set, and $Q$ is nonparametric. The only key assumption we will enforce on $Q$ and $G$ is that for each $P \in M$ , $W \mapsto \bar{Q} (P) (W)$ and $W \mapsto \bar{G} (P) (W)$ are cadlag functions in W on a set [0, τ_P] ⊂ ℝ^d−2 [17], and that the variation norm of these functions $\bar{Q} (P)$ and $\bar{G} (P)$ are bounded. The definition of variation norm will be presented in the next section. Suppose that $G$ assumes that $\bar{G}$ only depends on W through a subset of covariates of dimension d₂ ≤ d − 2: if d₂ = d − 2, then this does not represent an assumption.

Our target parameter $Ψ : M \to ℝ$ is defined by $Ψ (P) = \int \bar{Q} (w) d Q_{2} (w) \equiv Ψ_{1} (Q_{1} = \bar{Q}, Q_{2})$ . For notational convenience, we will use Ψ for both mappings Ψ and Ψ₁. It is well known that Ψ is pathwise differentiable so that for each 1-dimensional parametric submodel ${P_{ε} : ε} \subset M$ through P with score S at ε = 0, we have

{\frac{d}{d ε} Ψ (P_{ε}) |}_{ε = 0} = P D (P) S = \int_{o} D (P) (o) S (o) d P (o),

for some D(P) ∈ L²(P), where L²(P) is the Hilbert space of functions of O with mean zero endowed with inner product 〈f, g〉_P = Pfg. Here we use the notation Pf ≡ ∫ f(o)dP(o). Such an object D(P) is called a gradient at P of the pathwise derivative. The unique gradient that is also an element of the tangent space T(P) is defined as the canonical gradient. The tangent space T(P) at P is defined as the closure of the linear span of the set of scores of the class of 1-dimensional parametric submodels we consider. In this example the canonical gradient D^*(P) = D^*(Q(P), G(P)) at P is given by:

D^{*} (Q, G) (O) = \frac{A}{\bar{G} (W)} (Y - \bar{Q} (W)) + \bar{Q} (W) - Ψ (Q) .

Let $D_{1}^{*} (Q, G) = A / \bar{G} (W) (Y - \bar{Q} (W))$ and $D_{2}^{*} (Q) = \bar{Q} (W) - Ψ (Q)$ and note that $D^{*} (Q, G) = D_{1}^{*} (Q, G) + D_{2}^{*} (Q)$ .

An estimator ψ_n of ψ₀ = Ψ(P₀) is asymptotically efficient (among the class of all regular estimators) if and only if it is asymptotically linear with influence curve equal to the canonical gradient D^*(P₀) [1]:

ψ_{n} - ψ_{0} = P_{n} D^{*} (P_{0}) + o_{P} (n^{- 1 / 2}),

where P_n is the empirical probability distribution of O₁, …, O_n. Therefore, the canonical gradient is also called the efficient influence curve.

We have that

Ψ (P) - Ψ (P_{0}) = (P - P_{0}) D^{*} (Q, G) + R_{20} ((\bar{Q}, \bar{G}), ({\bar{Q}}_{0}, {\bar{G}}_{0})),

(1)

where Q = Q(P), G = G(P), and the second order remainder R₂₀() is defined as follows:

R_{20} ((\bar{Q}, \bar{G}), ({\bar{Q}}_{0}, {\bar{G}}_{0})) \equiv \int \frac{\bar{G} (w) - {\bar{G}}_{0} (w)}{\bar{G} (w)} (\bar{Q} (w) - {\bar{Q}}_{0} (w)) d P_{0} (w) .

Of course, PD^*(Q, G) = 0.

We define the following two log-likelihood loss functions for $\bar{Q}$ , Q₂ and $\bar{G}$ , respectively:

L_{11} (\bar{Q}) (O) = - A {Y \log \bar{Q} (W) + (1 - Y) \log (1 - \bar{Q} (W))}; L_{12} (Q_{2}) (O) = - \log d Q_{2} (W); L_{2} (\bar{G}) (O) = - {A \log \bar{G} (W) + (1 - A) \log (1 - \bar{G} (W))} .

We also define the corresponding Kullback-Leibler dissimilarities $d_{10, 1} (\bar{Q}, {\bar{Q}}_{0}) = P_{0} {L_{11} (\bar{Q}) - L_{11} ({\bar{Q}}_{0})}$ , d_10,2(Q₂, Q₂₀) = P₀{L₁₂(Q₂) − L₁₂(Q₂₀)}, and $d_{20} (\bar{G}, {\bar{G}}_{0}) = P_{0} {L_{2} (\bar{G}) - L_{2} ({\bar{G}}_{0})}$ . Here Q₂ represents an easy to estimate parameter which we will estimate with the empirical probability distribution $Q_{2 n} = {\hat{Q}}_{2} (P_{n})$ of W₁, …, W_n.

Let the submodel $M (δ) \subset M$ be defined by the extra restriction that $δ < \bar{Q} (W) < 1 - δ$ and $\bar{G} (W) > δ$ P₀-a.e. If we would replace the log-likelihood loss $L_{11} (\bar{Q})$ (which becomes unbounded if $\bar{Q}$ approximates 0 or 1) by a squared error loss ${(Y - \bar{Q} (W))}^{2} A$ , then one can remove the restriction $δ < \bar{Q} (W) < 1 - δ$ in the definition of $M (δ)$ . Given a sequence δ_n → 0 as n → ∞, we can define a sequence of models $M_{n} = M (δ_{n})$ which grows from below to $M$ as n → ∞. By assumption, there exists an N₀ = N(P₀) < ∞ so that for n > N₀ we have $P_{0} \in M_{n}$ .

Let $Q_{n} = Q_{1 n} \times Q_{2 n}$ and $G_{n}$ be the corresponding parameter spaces for $Q = (\bar{Q}, Q_{2})$ and $\bar{G}$ , respectively, and specifically, $Q_{1 n} = {\bar{Q} : δ_{n} < \bar{Q} < 1 - δ_{n}}$ , while $Q_{2 n} = Q_{2}$ .

2.2 One step CV-TMLE

Let $\hat{\bar{Q}} : M_{nonp} \to Q_{1 n}$ and $\hat{\bar{G}} : M_{nonp} \to G_{n}$ be initial estimators of ${\bar{Q}}_{0}$ , ${\bar{G}}_{0}$ , respectively, where $M_{nonp}$ denotes a nonparametric model so that the estimator is defined for all realizations of the empirical probability distribution. Let $\hat{Q} : M_{nonp} \to Q_{n}$ be the estimator $\hat{Q} (P_{n}) = (\hat{\bar{Q}} (P_{n}), {\hat{Q}}_{2} (P_{n}))$ of $Q_{0} = ({\bar{Q}}_{0}, Q_{20})$ . For a given cross-validation scheme B_n ∈ {0, 1}ⁿ, let $P_{n, B_{n}}^{1}$ , $P_{n, B_{n}}^{0}$ be the empirical probability distributions of the validation sample {O_i : B_n(i) = 1} and training sample { O_i : B_n(i) = 0}, respectively. It is assumed that the proportion of observations in the validation sample (i.e., Σ_i B_n(i)/n) is between δ and 1−δ for some 0 < δ < 1. Let $Q_{n, B_{n}} = ({\bar{Q}}_{n, B_{n}}, Q_{2 n, B_{n}}) = \hat{Q} (P_{n, B_{n}}^{0})$ and ${\bar{G}}_{n, B_{n}} = \hat{\bar{G}} (P_{n, B_{n}}^{0})$ be the estimators applied to the training sample $P_{n, B_{n}}^{0}$ . Given a $(\bar{Q}, \bar{G})$ , consider the uniform least favorable submodel (van der Laan and Gruber, 2015)

Logit {\bar{Q}}_{ε_{1}} = Logit \bar{Q} + ε_{1} H_{\bar{G}}

through $\bar{Q}$ at ε₁ = 0, where $H_{\bar{G}} (W) = 1 / \bar{G} (W)$ . We indeed have $\frac{d}{d ε_{1}} L_{11} ({\bar{Q}}_{ε_{1}}) = D_{1}^{*} ({\bar{Q}}_{ε_{1}}, \bar{G})$ for all ε₁. Given a $Q = (\bar{Q}, Q_{2})$ , consider also the local least favorable submodel

d Q_{2, ε_{2}}^{lfm} (W) = d Q_{2} (W) (1 + ε_{2} D_{2}^{*} (Q) (W))

through Q₂ at ε₂ = 0. Indeed, ${\frac{d}{d ε_{2}} L_{12} (Q_{2, ε_{2}}^{lfm}) |}_{ε_{2} = 0} = D_{2}^{*} (\bar{Q}, Q_{2})$ . This local least favorable submodel implies the following uniform least favorable submodel (van der Laan and Gruber, 2015): for ε₂ ≥ 0

d Q_{2, ε_{2}} = d Q_{2} \exp (\int_{0}^{ε_{2}} D_{2}^{*} (\bar{Q}, Q_{2, x}) d x) .

This universal least favorable submodel implies a recursive construction of Q_2,_ε for all ε-values, by starting at ε = 0 and moving upwards. For negative values of ε₂, we define $\int_{0}^{ε_{2}} = \int_{ε_{2}}^{0}$ . For all ε₂, $\frac{d}{d ε_{2}} L_{12} (Q_{2, ε_{2}}) = D_{2}^{*} (\bar{Q}, Q_{2, ε_{2}})$ , which shows that this is indeed a universal least favorable submodel for Q₂.

Let $ε_{1 n} = \arg {min}_{ε_{1}} E_{B_{n}} P_{n, B_{n}}^{1} L_{11} ({\bar{Q}}_{n, B_{n}, ε_{1}})$ , and ${\bar{Q}}_{n, B_{n}}^{*} = {\bar{Q}}_{n, B_{n}, ε_{1 n}}$ . The score equation for ε₁_n shows that $E_{B_{n}} P_{n, B_{n}}^{1} D_{1}^{*} ({\bar{Q}}_{n, B_{n}}^{*}, {\bar{G}}_{n, B_{n}}) = 0$ . Let $ε_{2 n} = \arg {min}_{ε_{2}} E_{B_{n}} P_{n, B_{n}}^{1} L_{12} (Q_{2 n, B_{n}, ε_{2}})$ and $Q_{2 n, B_{n}}^{*} = Q_{2 n, B_{n} ε_{2_{n}}}$ . The score equation for ε₂_n shows that $E_{B_{n}} P_{n, B_{n}}^{1} D_{2}^{*} ({\bar{Q}}_{n, B_{n}}^{*}, Q_{2 n, B_{n}}^{*}) = 0$ , which implies

E_{B_{n}} P_{n, B_{n}}^{1} {\bar{Q}}_{n, B_{n}}^{*} = E_{B_{n}} Q_{2 n, B_{n}}^{*} {\bar{Q}}_{n, B_{n}}^{*} .

(2)

The CV-TMLE of Ψ(Q₀) is defined as $ψ_{n}^{*} \equiv E_{B_{n}} Ψ (Q_{n, B_{n}}^{*})$ , where $Q_{n, B_{n}}^{*} = ({\bar{Q}}_{n, B_{n}}^{*}, Q_{2 n, B_{n}}^{*})$ . By eq. (2) this implies that the CV-TMLE can also be represented as:

ψ_{n}^{*} = E_{B_{n}} P_{n, B_{n}}^{1} {\bar{Q}}_{n, B_{n}}^{*} .

(3)

Note that this latter representation proves that we never have to carry out the TMLE-update step for Q₂_n, but that the CV-TMLE is a simple empirical mean of ${\bar{Q}}_{n, B_{n}}^{*}$ over the validation sample, averaged across the different splits B_n. We also conclude that this one-step CV-TMLE solves the crucial cross-validated efficient influence curve equation

E_{B_{n}} P_{n, B_{n}}^{1} D^{*} (Q_{n, B_{n}}^{*}, {\bar{G}}_{n, B_{n}}) = 0.

(4)

2.3 Guide for article based on this example

Section 3: Formulation of general estimation problem

The goal of this article is far beyond establishing asymptotic efficiency of the CV-TMLE eq. (3) in this example. Therefore, we start in Section 3 by defining a general model and general target parameter, essentially generalizing the above notation for this example. Therefore, having read the above example, the presentation in Section 3 of a very general estimation problem will be easier to follow. Our subsequent definition and results for the HAL-estimator, the HAL-super-learner, and the CV-TMLE in the subsequent Sections 4-6 apply now to our general model and target parameter, thereby establishing asymptotic efficiency of the CV-TMLE for an enormous large class of semi-parametric statistical estimation problems, including our example as a special case.

Let’s now return to our example to point out the specific tasks that are solved in each section of this article. By eqs (1) and (4), we have the following starting identity for the CV-TMLE:

E_{B_{n}} Ψ (Q_{n, B_{n}}^{*}) - Ψ (Q_{0}) = E_{B_{n}} (P_{n, B_{n}}^{1} - P_{0}) D^{*} (Q_{n, B_{n}}^{*}, {\bar{G}}_{n, B_{n}}) + E_{B_{n}} R_{20} (({\bar{Q}}_{n, B_{n}}^{*}, {\bar{G}}_{n, B_{n}}), ({\bar{Q}}_{0}, {\bar{G}}_{0})) .

(5)

By the Cauchy-Schwarz inequality and bounding $1 / {\bar{G}}_{n, B_{n}}$ by 1/δ_n, we can bound the second order remainder as follows:

| E_{B_{n}} R_{20} (({\bar{Q}}_{n, B_{n}}^{*}, {\bar{G}}_{n, B_{n}}), ({\bar{Q}}_{0}, {\bar{G}}_{0})) | \leq \frac{1}{δ_{n}} E_{B_{n}} {‖ {\bar{Q}}_{n, B_{n}}^{*} - {\bar{Q}}_{0} ‖}_{P_{0}} {‖ {\bar{G}}_{n, B_{n}} - {\bar{G}}_{0} ‖}_{P_{0}},

(6)

where ${‖ f ‖}_{P_{0}} \equiv {(P_{0} f^{2})}^{1 / 2}$ . Suppose we can construct estimators $\hat{\bar{Q}}$ and $\hat{\bar{G}}$ of ${\bar{Q}}_{0}$ and ${\bar{G}}_{0}$ so that ${‖ {\bar{Q}}_{n} - {\bar{Q}}_{0} ‖}_{P_{0}} = O_{P} (n^{- 1 / 4 - α_{1}})$ and ${‖ {\bar{G}}_{n} - {\bar{G}}_{0} ‖}_{P_{0}} = O_{P} (n^{- 1 / 4 - α_{2}})$ for some α₁ > 0, α₂ > 0. Since the training sample is proportional to sample size n, this immediately implies ${‖ {\bar{G}}_{n, B_{n}} - {\bar{G}}_{0} ‖}_{P_{0}} = O_{P} (n^{- 1 / 4 - α_{2}})$ and ${‖ {\bar{Q}}_{n, B_{n}} - {\bar{Q}}_{0} ‖}_{P_{0}} = O_{P} (n^{- 1 / 4 - α_{1}})$ . In addition, it is easy to show (as we will formally establish in general) that the rate of convergence of the initial estimator ${\bar{Q}}_{n, B_{n}}$ carries over to its targeted version so that ${‖ {\bar{Q}}_{n, B_{n}}^{*} - {\bar{Q}}_{0} ‖}_{P_{0}} = O_{P} (n^{- 1 / 4 - α_{1}})$ . Thus, with such initial estimators, we obtain

E_{B_{n}} R_{20} (({\bar{Q}}_{n, B_{n}}^{*}, {\bar{G}}_{n, B_{n}}), ({\bar{Q}}_{0}, {\bar{G}}_{0})) = o_{P} (δ_{n}^{- 1} n^{- 1 / 2 - α_{1} - α_{2}}) .

(7)

Thus, by selecting δ_n so that $δ_{n}^{- 1} n^{- α_{1} - α_{2}} \to 0$ , we obtain $E_{B_{n}} R_{20} (({\bar{Q}}_{n, B_{n}}^{*}, {\bar{G}}_{n, B_{n}}), ({\bar{Q}}_{0}, {\bar{G}}_{0})) = o_{P} (n^{- 1 / 2})$ .

Section 4: Construction and analysis of an M-specific HAL-estimator that converges at a rate faster than n^−1/4

This challenge of constructing such estimators $\hat{\bar{Q}}$ and $\hat{\bar{G}}$ is addressed in Section 4. In the context of our example, in Section 4 we define a minimum loss estimator (MLE) ${\bar{Q}}_{n, M} = \arg {min}_{{‖ \bar{Q} ‖}_{ν} < M} P_{n} L_{11} (\bar{Q})$ that minimizes the empirical risk over all cadlag functions with variation norm smaller than M. In Section 4 we then show that, if M is chosen larger than the variation norm of ${\bar{Q}}_{0}$ , $d_{10, 1}^{1 / 2} ({\bar{Q}}_{n, M}, {\bar{Q}}_{0})$ converges to zero at a faster rate than $n^{- 1 / 4 - α_{1}}$ for some α₁ = α₁(d) > 0 (for each dimension d). We provide an explicit representation eq. (17) of a cadlag function with finite variation norm M as an infinite linear combination of indicator functions for which the sum of the absolute value of the coefficients is bounded by M. As a consequence, it is shown in Appendix D that this M-specific minimum loss-based estimator can be approximated by (or can be exactly defined as) a Lasso-generalized linear regression problem in which the sum of the absolute values of the coefficients is bounded by M. Therefore, we will refer to ${\bar{Q}}_{n, M}$ as the M-specific HAL-estimator. Our proof of Lemma 1 in Section 4, which establishes the rate of convergence of the M-specific HAL-estimator, relies on an empirical process result by [18] that expresses the upper bound for this rate of convergence in terms of the entropy of the model space $Q_{1}$ of $\bar{Q}$ . The representation eq. (17) demonstrates that the set of cadlag functions that have variation norm smaller than a constant M is a difference of a“convex” hull of indicator functions, and, as a consequence of a general convex hull result in [19] this proves that it is a Donsker class with a specified upper bound on its entropy. In this way, we obtain an explicit entropy bound for our model space $Q_{1}$ . Given this explicit upper bound for the entropy, the result in [18] establishes a rate of convergence of the M-specific HAL-estimator faster than $n^{- 1 / 4 - α_{1}}$ for a specified α₁ > 0. By selecting M larger than the unknown variation norm of the true nuisance parameter value, we obtain an HAL-estimator that converges at a faster rate than n^−1/4.

Section 5: Construction and analysis of an HAL-super-learner

Instead of assuming that the the variation norm of ${\bar{Q}}_{0}$ is bounded by a known M and use the corresponding M-specific HAL-estimator, in Section 5 we define a a collection of such M-specific estimators for a set of M-values for which the maximum value converges to infinity as sample size converges to infinity. We then use cross-validation to data adaptively select M. We now show that the resulting cross-validated selected estimator of ${\bar{Q}}_{0}$ will be asymptotically equivalent with the oracle (i.e., best w.r.t. loss-based dissimilarity) choice. This follows from a previously established oracle inequality for the cross-validation selector, as long as the supremum norm bound on the loss-function at the candidate estimators does not grow too fast to infinity as a function of sample size (e.g., [11, 13]). By using such a data adaptively selected M one obtains an estimator with better practical performance and it avoids having to know an upper bound M. As a consequence, our statistical model does not need to assume a universal bound M on the variation norm of the nuisance parameters, but it only needs to assume that each nuisance parameter value has a finite variation norm. For the sake of finite sample performance, we want to use a super-learner that uses cross-validation to select an estimator from a library of candidate estimators that includes these M-specific estimators as candidates, beyond other candidate estimators. In this way, the choice of estimator will be adapted to what works well for the actual data set. Therefore, in Section 5, we actually define such a general super-learner $\hat{\bar{Q}}$ and Theorem 2 states that it will converge at least as fast as the best choice in the library, and thus certainly as fast as the M-specific HAL-estimator using M equal to the true variation norm of ${\bar{Q}}_{0}$ . We refer to a super-learner whose library includes this collection of M-specific HAL-estimators as an HAL-super-learner. We will use an analogue HAL-super-learner of ${\bar{G}}_{0}$ (Theorem 6).

The convergence results for this super-learner in terms of the Kullback-Leibler loss-based dissimilarities also imply corresponding results for L²(P₀)-convergence as needed to control the second order remainder eq. (6): see Lemma 4.

Section 6: Construction and analysis of HAL-CV-TMLE

To control the remainder we need to understand the behavior of the updated initial estimator ${\bar{Q}}_{n, B_{n}}^{*}$ instead of the initial estimator ${\bar{Q}}_{n, B_{n}}$ itself. In our example, since the updated estimator only involves a single updating step of the initial estimator, using a cross-validated MLE selector of ε, we can easily show that ${\bar{Q}}_{n, B_{n}}^{*}$ converges at same rate to ${\bar{Q}}_{0}$ as the initial estimator ${\bar{Q}}_{n, B_{n}}$ . In general, in Section 6 we define a one-step CV-TMLE for our general model and target parameter so that the targeted versions of the initial estimator of ${\bar{Q}}_{0}$ converges at the same rate as the initial HAL-super-learner estimator ${\bar{Q}}_{n}$ . (Since the initial estimator is an HAL-super-learner, we refer to this type of CV-TMLE as an HAL-CV-TMLE.) This concerns a choice of least favorable submodel for which the CV-TMLE-step separately updates each of the components of the initial estimator $\hat{Q}$ . We then show that with this choice of least favorable submodel the CV-TMLE-step preserves the convergence rate of the initial estimator (Lemma 3). We also establish in Appendix D that the one-step CV-TMLE already solves the desired cross-validated efficient influence curve equation (4) up till an o_P(n^−1/2)-term, so that an iterative CV-TMLE can be avoided (Lemma 13 and Lemma 14). At that point, we have shown that the generalized analogue of eq. (7) indeed holds with a specified α₁ > 0, α₂ > 0. In the final subsection of Section 6, Theorem 1 then establish the asymptotic efficiency of the HAL-CV-TMLE, which now also involves analyzing the cross-validated empirical process term, specifically, showing that

E_{B_{n}} (P_{n, B_{n}}^{1} - P_{0}) D^{*} (Q_{n, B_{n}}^{*}, {\bar{G}}_{n, B_{n}}) = (P_{n} - P_{0}) D^{*} (Q_{0}, {\bar{G}}_{0}) + o_{P} (n^{- 1 / 2}) .

(8)

This will hold under weak conditions, given that we have estimators $Q_{n, B_{n}}^{*}$ , $G_{n, B_{n}}$ that converge at specified rates to their true counterparts and that, for each split B_n, conditional on the training sample, the empirical process is indexed by a finite dimensional (i.e., dimension of :) class of functions.

Section 7: Returning to our example

In Section 7 we return to our example to present a formal Theorem 2 with specified conditions, involving an application of our general efficiency Theorem 1 in Section 6.

Appendix: Various technical results are presented in the Appendix.

3 Statistical formulation of the estimation problem

Let O₁, …, O_n be n independent and identically distributed copies of a d-dimensional random variable O with probability distribution P₀ that is known to be an element of a statistical model $M$ . Let $Ψ : M \to ℝ$ be a one-dimensional target parameter, so that ψ₀ = Ψ(P₀) is the estimand of interest we aim to learn from the n observations o₁, …, o_n. We assume that Ψ is pathwise differentiable at any $P \in M$ with canonical gradient D^*(P): for a specified rich class of one-dimensional submodels ${P_{ε} : ε \in (- δ, δ)} \subset M$ through P at ε = 0 and score $S = \frac{d}{d ε} \log {d P_{ε} / d P |}_{ε = 0}$ , we have

{\frac{d}{d ε} Ψ (P_{ε}) |}_{ε = 0} = P D^{*} (P) S \equiv \int_{o} D^{*} (P) (o) S (o) d P (o) .

Our goal in this article is to construct a substitution estimator (i.e., a TMLE $Ψ (P_{n}^{*})$ for a targeted estimator $P_{n}^{*}$ of P₀) that is asymptotically efficient under minimal conditions.

Relevant nuisance parameters Q, G and their loss functions

Let Q(P) be a nuisance parameter of P so that Ψ(P) = Ψ₁(Q(P)) for some Ψ₁, so that Ψ(P) only depends on P through Q(P). Let $Q = Q (M) = {Q (P) : P \in M}$ be the parameter space of this parameter $Q : M \to Q$ . Suppose that Q(P) = (Q_j(P) : j = 1, …, k₁ + 1) has k₁ + 1 components, and $Q_{j} : M \to Q_{j}$ are variation independent parameters j = 1, …, k₁ + 1. Let $Q_{j} = Q_{j} (M)$ be the parameter space of Q_j. Thus, the parameter space of Q is a cartesian product $Q = \prod_{j = 1}^{k_{1} + 1} Q_{j}$ . In addition, suppose that for j = 1, …, k₁ + 1, $Q_{j} (P_{0}) = \arg {min}_{Q_{j} \in Q_{j}} P_{0} L_{1 j} (Q_{j})$ for specified loss functions (O, Q_j) ↦ L₁_j(Q_j)(O). Let $\bar{Q} = (Q_{1}, \dots, Q_{k_{1}})$ represent parameters that require data adaptive estimation trading off variance and bias (e.g., densities), while Q_k₁₊₁ represents an easy to estimate parameter for which we have an empirical estimator ${\hat{Q}}_{k_{1} + 1}$ available with negligible bias. In our treatment specific mean example above $Q = (Q_{1} = \bar{Q}, Q_{2})$ , where the easy to estimate parameter Q₂ was the probability distribution of W which is naturally estimated with the empirical probability distribution. The parameter $\bar{Q} (P_{0})$ will be estimated with our proposed loss-based HAL-super-learner. In the special case that each of the components of Q require a super-learner type-estimator, we define $Q_{k_{1} + 1}$ as empty (or equivalently, a known value), and in that case $Q = \bar{Q}$ . We define corresponding loss-based dissimilarities d₁₀_j(Q_j, Q_j₀) = P₀L₁_j(Q_j)−P₀L₁_j(Q_j₀), j = 1, …, k₁ + 1. We assume that $d_{10 (k_{1} + 1)} ({\hat{Q}}_{k_{1} + 1} (P_{n}), Q_{(k_{1} + 1) 0}) = O_{P} (r_{Q, k_{1} + 1} (n))$ for a known rate of convergence $r_{Q, k_{1} + 1} (n)$ . Let

d_{10} (Q, Q_{0}) = (d_{10 j} (Q_{j}, Q_{j 0}) : j = 1, \dots, k_{1} + 1)

(9)

be the collection of these k₁ + 1 loss-based dissimilarities. We use the notation $d_{10} (\bar{Q}, {\bar{Q}}_{0}) = (d_{10 j} (Q_{j}, Q_{j 0}) : j = 1, \dots k_{1})$ for the vector of k₁ loss-based dissimilarities for $\bar{Q}$ .

Suppose that D^*(P) only depends on P through Q(P) and an additional nuisance parameter G(P). In the special case that D^*(P) only depends on P through Q(P), we define G as empty (or equivalently, as a known value). Let $G = (G_{1}, \dots, G_{k_{2} + 1})$ be a collection of (k₂ + 1)-variation independent parameters of G for some integer k₂ + 1 ≥ 1. Thus the parameter space of G is a cartesian product $G = \prod_{j = 1}^{k_{2} + 1} G_{j}$ , where $G_{j}$ is the parameter space of $G_{j} : M \to G_{j}$ . Let $G_{j 0} = \arg {min}_{G \in G_{j}} P_{0} L_{2 j} (G_{j})$ for a loss function (O, G_j) ↦ L₂_j(G_j)(O), and let d₂_j₀(G_j, G_j₀) = P₀L₂_j(G_j) − P₀L₂_j(G_j₀) be the corresponding loss-based dissimilarity, j = 1, …, k₂ + 1. Let $G_{k_{2} + 1}$ represents an easy to estimate parameter for which we have a well behaved and understood estimator ${\hat{G}}_{k_{2} + 1}$ available. The parameter $\bar{G} (P_{0})$ will be estimated with our proposed HAL-super-learner. We assume that $d_{20 (k_{2} + 1)} ({\hat{G}}_{k_{2} + 1} (P_{n}), G_{(k_{2} + 1) 0}) = O_{P} (r_{G, k_{2} + 1} (n))$ for a known rate of convergence $r_{G, k_{2} + 1} (n)$ . As above, let $d_{20} (G, G_{0}) = (d_{20 j} (G_{j}, G_{j 0}) : j = 1, \dots, k_{2} + 1)$ be the collection of these loss-based dissimilarities, and let $d_{20} (\bar{G}, {\bar{G}}_{0}) = (d_{20 j} (G_{j}, G_{j 0}) : j = 1, \dots, k_{2})$ , where $\bar{G} = (G_{1}, \dots, G_{k_{2}})$ . In the special case that each G_j requires a super-learner based estimator, then we define $G_{k_{2} + 1}$ as empty, and $G = \bar{G}$ .

We also define

d_{0} ((Q, G), (Q_{0}, G_{0})) = (d_{10 j_{1}} (Q_{j_{1}}, Q_{j_{1} 0}), d_{20 j_{2}} (G_{j_{2}}, G_{j_{2} 0}) : j_{1}, j_{2})

(10)

as the vector of k₁ + k₂ + 2 loss-based dissimilarities. We will also use the short-hand notation d₀(P, P₀) for d₀((Q, G), (Q₀, G₀)).

We define

L_{1} (Q) = (L_{1 j} (Q_{j}) : j = 1, \dots, k_{1} + 1)

(11)

as the vector of k₁ + 1-loss functions for $Q = (Q_{1}, \dots, Q_{k_{1} + 1})$ , and similarly we define

L_{2} (G) = (L_{2 j} (G_{j}) : j = 1, \dots, k_{2} + 1) .

(12)

We will also use the notation $L_{1} (\bar{Q}) = (L_{1} (Q_{j}) : j = 1, \dots, k_{1})$ and $L_{2} (\bar{G}) = (L_{2 j} (G_{j}) : j = 1, \dots, k_{2})$ . We will assume that $\bar{Q} \mapsto L_{1} (\bar{Q})$ is a convex function in the sense that, for any ${\bar{Q}}_{1} = (Q_{j 1} : j = 1, \dots, k_{1}), \dots, {\bar{Q}}_{m} = (Q_{j m} : j = 1, \dots k_{1})$ , for each j = 1, …, k₁

P_{0} L_{1 j} (\sum_{k = 1}^{m} α_{k} Q_{j k}) \leq \sum_{k = 1}^{m} α_{k} P_{0} L_{1 j} (Q_{j k})

(13)

when Σ_k α_k = 1 and min_k α_k ≥ 0. Similarly, we assume $\bar{G} \mapsto L_{2} (\bar{G})$ is a convex function. Our results for the TMLE generalize to non-convex loss functions, but the convexity of the loss functions allows a nicer representation for the super-learner oracle inequality, and in most applications a natural convex loss function is available.

We will abuse notation by also denoting Ψ(P) and D^*(P) with Ψ(Q) and D^*(Q, G), respectively. A special case is that D^*(P) = D^*(Q(P)) does not depend on an additional nuisance parameter G: for example, if O ∈ ℝ, $M$ is nonparametric, and Ψ(P) = ∫p(o)²do is the integral of the square of the Lebesgue density p of P, then the canonical gradient is given by D^*(P) = 2p² − 2Ψ(P), so that one would define Q(P) = p, and there is no G.

Second order remainder for target parameter

We define the second order remainder R₂(P, P₀) as follows:

R_{2} (P, P_{0}) \equiv Ψ (P) - Ψ (P_{0}) + P_{0} D^{*} (P) .

(14)

We will also denote R₂(P, P₀) with R₂₀((Q, G), (Q₀, G₀)) to indicate that it involves differences between Q and Q₀ and G and G₀, beyond possibly some additional dependence on P₀. In our experience, this remainder R₂(P, P₀) can be represented as a sum of terms of the type ∫(H₁(P) − H₁(P₀))(H₂(P) − H₂(P₀))f(P, P₀)dP₀(o) for some functionals H₁, H₂ and f, where, typically, H₁(P) and H₂(P) represent functions of Q(P) or G(P). In certain classes of problems we have that R₂(P, P₀) only involves cross-terms of the type ∫(H₁(Q) − H₁(Q₀))(H₂(G) − H₂(G₀))f(P, P₀)dP₀, so that R₂₀((Q, G), (Q₀, G₀)) = 0 if either Q = Q₀ or G = G₀. In these cases, we say that the efficient influence curve is double robust w.r.t. misspecification of Q₀ and G₀:

P_{0} D^{*} (P) = Ψ (P_{0}) - Ψ (P) if G (P) = G (P_{0}) or Q (P) = Q (P_{0}) .

Given the above double robustness property of the canonical gradient (i.e, of the target parameter), if P solves P₀D^*(P) = 0, and either G(P) = G₀ or Q(P) = Q₀, then Ψ(P) = Ψ(P₀). This allows for the construction of so called double robust estimators of ψ₀ that will be consistent if either the estimator of Q₀ is consistent or the estimator of G₀ is consistent.

Support of data distribution

The support of $P \in M$ is defined as a set $O_{P} \subset ℝ^{d}$ so that $P (O_{P}) = 1$ . It is assumed that for each $P \in M$ , $O_{P} \subset [0, τ_{P}]$ for some finite $τ_{P} \in ℝ_{> 0}^{d}$ . We define

τ = \sup_{P \in M} τ_{P},

(15)

so that [0, τ_P] ⊂ [0, τ] for all $P \in M$ , where τ = ∞ is allowed, in which case $[0, τ] \equiv ℝ_{\geq 0}^{d}$ . That is, [0, τ] is an upper bound of all the supports, and the model $M$ states that the support of the data structure O is known to be contained in [0, τ].

Cadlag functions on [0, τ], supremum norm and variation norm

Suppose τ is finite, and, in fact, if τ is not finite, then we will apply the definitions below to a τ = τ_n that is finite and converges to τ. Let $D [0, τ]$ be the Banach space of d-variate real valued cadlag functions (right-continuous with left-hand limits) [17]. For a $f \in D [0, τ]$ , let ‖ f ‖_∞ = sup_x_∈[0,_τ_] | f(x) | be the supremum norm. For a $f \in D [0, τ]$ , we define the variation norm of f [20] as

{‖ f ‖}_{ν} = | f (0) | + \sum_{s \subset {1, \dots, d}} \int_{(0_{s}, τ_{s})} | f (d x_{s}, 0_{- s}) | .

(16)

For a subset s ⊂ {1, …, d}, x_s = (x_j : j ∈ s), x₋_s = (x_j : j ∉ s), and the Σ_s in the above definition of the variation norm is over all subsets of {1, …, d}. In addition, x_s → f(x_s, 0₋_s₎) is the s-specific section of x → f(x) that sets the coordinates in the compliment of s equal to 0. Note that ‖ f ‖_ν is the sum of variation norms of s-specific sections of f (including f itself). Therefore, one might refer to this norm as the sectional variation norm, but, for convenience, for the purpose of this article, we will just refer to it as variation norm. If ‖ f ‖_ν < ∞, then we can, in fact, represent f as follows [20]:

f (x) = f (0) + \sum_{s \subset {1, \dots, d}} \int_{{0_{s}, x_{s}}} f (d u_{s}, 0_{- s}),

(17)

where f(du_s, 0₋_s) is the measure generated by the cadlag function u_s ↦ f(u_s, 0₋_s). For a M ∈ ℝ_≥0, let

F_{ν, M} = {f \in D (0, τ) : {‖ f ‖}_{ν} < M}

denote the set of cadlag functions f : [0, 4] → ℝ with variation norm bounded by M.

Cartesian product of cadlag function spaces, and its component-wise operations

Let D^k[0, τ] be the product Banach space of k-dimensional (f₁, …, f_k) where each $f_{j} \in D [0, τ]$ , j = 1, …, k. If f ∈ D^k[0, τ], then we define ‖ f ‖_∞ = (‖ f_j ‖_∞ : j = 1, …, k) as a vector whose j-th component equals the supremum norm of the j-th component f_j of f. Similarly we define a variation norm of f ∈ D^k[0, τ] as a vector

{‖ f ‖}_{ν} = ({‖ f_{j} ‖}_{ν} : j = 1, \dots, k)

of variation norms. If f ∈ D^k[0, τ], then ‖ f ‖_P₀= (‖ f_j ‖_P₀: j = 1, …, k) is a vector whose components are the L²(P₀)-norms of the components of f. Generally speaking, in this paper any operation on a function f ∈ D^k[0, τ], such as taking a norm ${‖ f ‖}_{P_{0}}$ , an expectation P₀f, operations on a pair of functions f, g ∈ D^k[0, τ], such as f/g, f × g, max(f, g) or an inequality f < g, is carried out component wise: for example, max(f, g) = (max(f_j, g_j) : j = 1, …, k) and $\inf_{Q \in Q} P_{0} L_{1} (Q) = (\inf_{Q_{j} \in Q_{j}} P_{0} L_{1 j} (Q_{j}) : j = 1, \dots, k_{1} + 1)$ . In a similar manner, for an $M \in ℝ_{> 0}^{k}$ , let $F_{ν, M} = \prod_{j = 1}^{k} F_{ν, M_{j}}$ denote the cartesian product. This general notation allows us to present results with minimal notation, avoiding the need to continuously having to enumerate all the components.

Our results will hold for general models and pathwise differentiable target parameters, as long as the statistical model satisfies the following key smoothness assumption:

Assumption 1. (Smoothness Assumption)

For each $P \in M$ , $\bar{Q} = \bar{Q} (P) \in D^{k_{1}} [0, τ]$ , $\bar{G} = \bar{G} (P) \in D^{k_{2}} [0, τ]$ , $D^{*} (P) = D^{*} (Q, G) \in D [0, τ]$ , $L_{1} (\bar{Q}) \in D^{k_{1}} [0, τ]$ , $L_{2} (\bar{G}) \in D^{k_{2}} [0, τ]$ , and $\bar{Q}$ , $\bar{G}$ , D^*(P), $L_{1} (\bar{Q})$ , $L_{2} (\bar{G})$ have a finite supremum and variation norm.

Definition of bounds on the statistical model

The properties of the super-learner and TMLE rely on bounds on the model $M$ . Our estimators will also allow for unbounded models by using a sieve of models for which its finite bounds slowly approximate the actual model bound as sample size converges to infinity. These bounds will be defined now:

τ = τ (M) = \sup_{P \in M} τ (P), M_{1 Q} = M_{1 Q} (M) = \sup_{Q, Q_{0} \in Q} {‖ L_{1} (\bar{Q}) - L_{1} ({\bar{Q}}_{0}) ‖}_{\infty}, M_{2 Q} = M_{2 Q} (M) = \sup_{P, P_{0} \in M} \frac{{‖ L_{1} (\bar{Q}) - L_{1} ({\bar{Q}}_{0}) ‖}_{P_{0}}}{{d_{10} (\bar{Q}, {\bar{Q}}_{0})}^{1 / 2}}, M_{1 G} = M_{1 G} (M) = \sup_{G, G_{0} \in G} {‖ L_{2} (\bar{G}) - L_{2} ({\bar{G}}_{0}) ‖}_{\infty}, M_{2 G} = M_{2 G} (M) = \sup_{P, P_{0} \in M} \frac{{‖ L_{2} (\bar{G}) - L_{2} ({\bar{G}}_{0}) ‖}_{P_{0}}}{{d_{20} (\bar{G}, {\bar{G}}_{0})}^{1 / 2}}, M_{D^{*}} = M_{D^{*}} (M) = \sup_{P \in M} {‖ D^{*} (P) ‖}_{\infty} .

(18)

Note that M₁_Q, $M_{2 Q} \in ℝ_{\geq 0}^{k_{1}}$ and M₁_G, $M_{2 G} \in ℝ_{\geq 0}^{k_{2}}$ are defined as vectors of constants, a constant for each component of $\bar{Q}$ and $\bar{G}$ , respectively. The bounds M₁_Q, M₂_Q guarantee excellent properties of the cross-validation selector based on the loss-function $L_{1} (\bar{Q})$ (e.g., [11, 13]). A bound on M₂_Q shows that the loss-based dissimilarity $d_{01} (\bar{Q}, {\bar{Q}}_{0})$ behaves as a square of a difference between $\bar{Q}$ and ${\bar{Q}}_{0}$ . Similarly, the bounds M₁_G, M₂_G control the behavior of the cross-validation selector based on the loss function $L_{2} (\bar{G})$ .

Bounded and Unbounded Models

We will call the model $M$ bounded if it is a model for which τ < ∞ (i.e., universally bounded support), M₁_Q, M₂_Q, M₁_G, M₂_G, $M_{D^{*}}$ are finite. In words, in essence, a bounded model is a model for which the support and the supremum norm of $\bar{Q} (P)$ , $\bar{G} (P)$ , $L_{1} (\bar{Q})$ , $L_{2} (\bar{G})$ and D^*(Q, G) are uniformly (over the model) bounded. Any model that is not bounded will be called an unbounded model.

Sequence of bounded submodels approximating the unbounded model

For an unbounded model $M$ , our initial estimators $({\bar{Q}}_{n}, {\bar{G}}_{n})$ of $({\bar{Q}}_{0}, {\bar{G}}_{0})$ are defined in terms of a sequence of bounded submodels $M_{n} \subset M$ that are increasing in n and approximate the actual model $M$ as n converges to infinity. The counterparts of the above defined universal bounds on $M$ applied to $M_{n}$ are denoted with τ_n, M₁_Q,_n, M₂_Q,_n, M₁_G,_n, M₂_G,_n, $M_{D^{*}, n}$ . The conditions of our general asymptotic efficiency Theorem 1 will enforce that these bounds converge slowly enough to infinity (in the case the corresponding true model bound is infinity). This model $M_{n}$ could be defined as the largest subset of $M$ for which these latter bounds apply. By Assumption 1, with this choice of definition of $M_{n}$ , for any $P_{0} \in M$ , there exists an N₀ = N(P₀), so that for n > N₀ $P_{0} \in M_{n}$ . Either way, we assume that $M_{n}$ is defined such that the latter is true.

Let $Q_{n} = Q (M_{n})$ and $G_{n} = G (M_{n})$ be the parameter spaces of Q and G under model $M_{n}$ , and let ${\bar{Q}}_{n} = \bar{Q} (M_{n})$ and ${\bar{G}}_{n} = \bar{G} (M_{n})$ be the parameter spaces of $\bar{Q}$ and $\bar{G}$ . We define the following true parameters corresponding with this model $M_{n}$ :

{\bar{Q}}_{0 n} = \arg min_{\bar{Q} \in {\bar{Q}}_{n}} P_{0} L_{1} (\bar{Q}) {\bar{G}}_{0 n} = \arg min_{\bar{G} \in {\bar{G}}_{n}} P_{0} L_{2} (\bar{G}) .

We will assume that $M_{n}$ is chosen so that $Q_{k_{1} + 1} (P_{0 n}) = Q_{k_{1} + 1} (P_{0})$ and $G_{k_{2} + 1} (P_{0 n}) = G_{k_{2} + 1} (P_{0})$ , where $P_{0 n} = \arg {max}_{P_{\in} M_{n}} P_{0} \log \frac{d P}{d P_{0}}$ . That is, our sieve is not affecting the estimation of the “easy” nuisance parameters $Q_{(k_{1} + 1) 0}$ and $G_{(k_{2} + 1) 0}$ . Note that for n > N₀, we have $Q_{0 n} = Q_{0}$ and $G_{0 n} = G_{0}$ .

In this paper our initial estimators of ${\bar{Q}}_{0}$ and ${\bar{G}}_{0}$ are always enforced to be in the parameter spaces of this sequence of models $M_{n}$ , but if the model $M$ is already bounded, then one can set $M_{n} = M$ for all n. However, even for bounded models $M$ , the utilization of a sequence of submodels $M_{n}$ with stronger universal bounds than $M$ could result in finite sample improvements (e.g., if the universal bounds on $M$ are very large relative to sample size and the dimension of the data).

4 Highly adaptive Lasso estimator of Nuisance parameters

Let M₁ < ∞ be given. Our M₁-specific HAL-estimator of ${\bar{Q}}_{0}$ is defined as the minimizer of the empirical risk $P_{n} L_{1} (\bar{Q})$ over $\bar{Q} \in {\bar{Q}}_{n}$ for which $L_{1} (\bar{Q})$ has a variation norm bounded by M₁ (see eq. (21)). The rate of convergence of a minimum empirical risk estimator is driven by the rate of convergence of the covering number of the parameter space over which one minimizes (e.g., [19]). This explains why the rate of convergence of the covering number of this set of functions $L_{1} (\bar{Q})$ defines a minimal rate of convergence for this HAL-estimator (while M₁ will be selected with the cross-validation selector). Similarly, this applies to our HAL-estimator of ${\bar{G}}_{0}$ . In the next subsection we define the relevant covering numbers and their rates α₁, α₂, and establish an upper bound on them. Subsequently, we establish in Lemma 1 the minimal rate of convergence of the HAL-estimator in terms of these rates α₁, α₂.

4.1 Upper bounding the entropy of the parameter space for the HAL-estimator

We remind the reader that a covering number $N (ε, F, L^{2} (Λ))$ is defined as the minimal number of balls of size ε w.r.t. L²(Λ)-norm that are needed to cover the set $F$ of functions embedded in L²(Λ). Let $α_{1} \in ℝ_{\geq 0}^{k_{1}}$ and $α_{2} \in ℝ_{\geq 0}^{k_{2}}$ be such that for fixed M₁, M₂

\sup_{Λ} \log^{1 / 2} (N (ε, L_{1} ({\bar{Q}}_{n, M_{1}}), L^{2} (Λ)) = O (ε^{- (1 - α_{1})}) \sup_{Λ} \log^{1 / 2} (N (ε, L_{2} ({\bar{Q}}_{n, M_{2}}), L^{2} (Λ)) = O (ε^{- (1 - α_{2})}),

(19)

where $L_{1} ({\bar{Q}}_{n, M_{1}}) = {L_{1} (\bar{Q}) : \bar{Q} \in {\bar{Q}}_{n, M_{1}}}$ , $L_{2} ({\bar{G}}_{n, M_{2}}) = {L_{1} (\bar{G}) : \bar{G} \in {\bar{G}}_{n, M_{2}}}$ , and

{\bar{Q}}_{n, M_{1}} \equiv {\bar{Q} \in {\bar{Q}}_{n} : {‖ L_{1} (\bar{Q}) ‖}_{ν} < M_{1}} {\bar{G}}_{n, M_{2}} \equiv {\bar{G} \in {\bar{Q}}_{n} : {‖ L_{2} (\bar{G}) ‖}_{ν} < M_{2}} .

(20)

The minimal rates of convergence of our HAL-estimator of ${\bar{Q}}_{0}$ and ${\bar{G}}_{0}$ are defined in terms of α₁ and α₂, respectively.

By eq. (17) it follows that any cadlag function with finite variation norm can be represented as a difference of two bounded monotone increasing functions (i.e., cumulative distribution function). The class of d-variate monotone increasing/cumulative distribution functions is a convex hull of d-variate indicator functions, which is again concretely implied by the representation eq. (17) by noting that $\int_{0}^{x} d f (u) = \int I (u \leq x) d f (u)$ Thus, $F_{ν, M}$ consists of a difference of two convex hulls of d-variate indicator functions. By Theorem 2.6.9 in [19], which maps the covering number of a set of functions into a covering number of the convex hull of these functions, for a fixed M < ∞, we have that the universal covering number of $F_{ν, M}$ is bounded as follows:

\sup_{Λ} \log^{1 / 2} (N (ε, F_{ν, M}, L^{2} (Λ)) = O (ε^{- (1 - α (d))}),

where α(d) = 2/(d + 2). Let $d_{1} \in ℕ_{> 0}^{k_{1}}$ be the vector of integers indicating the dimension of the domain of $\bar{Q} = (Q_{1}, \dots, Q_{k_{1}})$ , and similarly, let $d_{2} \in ℝ_{> 0}^{k_{2}}$ be the vector of integers indicating the dimension of the domain of $\bar{G} = (G_{1}, \dots, G_{k_{2}})$ . Since $L_{1} ({\bar{Q}}_{n, M_{1}}) \subset F_{ν, M_{1}}$ with d = d₁, $L_{2} ({\bar{G}}_{n, M_{2}}) \subset F_{ν, M_{2}}$ with d = d₂, we have that α₁ ≥ α(d₁) and α₂ ≥ α(d₂).

4.2 Minimal rate of convergence of the HAL-estimator

Lemma 1 below proves that the minimal rates $r_{Q, 1 : k_{1}} (n) \in ℝ^{k_{1}}$ and $r_{G, 1 : k_{2}} (n) \in ℝ^{k_{2}}$ of our HAL-estimator of ${\bar{Q}}_{0}$ and ${\bar{G}}_{0}$ w.r.t. the loss-based dissimilarities d₀₁(Q, Q₀) and d₀₂(G, G₀) are given by:

r_{\bar{Q}} (n) = r_{Q, 1 : k_{1}} (n) = n^{- (1 / 2 + α_{1} / 4)} r_{\bar{G}} (n) = r_{G, 1 : k_{2}} (n) = n^{- (1 / 2 + α_{2} / 4)} .

Let $r_{Q, k_{1} + 1}$ and $r_{G, k_{2} + 1}$ be the rates of the simple estimators ${\hat{Q}}_{k_{1} + 1}$ and ${\hat{G}}_{k_{2} + 1}$ of $Q_{(k_{1} + 1) 0}$ and $G_{(k_{2} + 1) 0}$ , respectively. This defines $r_{Q} (n) \in ℝ^{k_{1} + 1}$ and $r_{G} (n) \in ℝ^{k_{2} + 1}$ .

Lemma 1

For a given vector $M \in ℝ_{\geq 0}^{k_{1}}$ of constants, let ${\bar{Q}}_{n, M} \subset {\bar{Q} \in {\bar{Q}}_{n} : {‖ L_{1} (\bar{Q}) ‖}_{ν} \leq M} \subset F_{ν, M}$ be the set of all functions in the parameter space ${\bar{Q}}_{n}$ for ${\bar{Q}}_{0 n}$ for which the variation norm of its loss is smaller than M < ∞. (In this definition one can also incorporate some extra M-constraints, as long as ${\bar{Q}}_{n, M = \infty} = {\bar{Q}}_{n}$ .) Let ${\bar{Q}}_{0 n}^{M} \in {\bar{Q}}_{n, M}$ be so that $P_{0} L_{1} ({\bar{Q}}_{0 n}^{M}) = \inf_{\bar{Q} \in {\bar{Q}}_{n, M}} P_{0} L_{1} (\bar{Q})$ . Assume that for a fixed M < ∞,

M_{2 Q, M} \equiv \lim \sup_{n \to \infty} \sup_{\bar{Q} \in {\bar{Q}}_{n, M}} \frac{{‖ L_{1} (\bar{Q}) - L_{1} ({\bar{Q}}_{0 n}^{M}) ‖}_{P_{0}}}{{d_{10} (\bar{Q}, {\bar{Q}}_{0 n}^{M})}^{1 / 2}} < \infty .

Consider an estimator ${\bar{Q}}_{n}^{M}$ for which

P_{n} L_{1} ({\bar{Q}}_{n}^{M}) = \inf_{\bar{Q} \in {\bar{Q}}_{n, M}} P_{n} L_{1} (\bar{Q}) + r_{n},

(21)

where r_n = o_P(n^−1/2). then

0 \leq d_{01} ({\bar{Q}}_{n}^{M}, {\bar{Q}}_{0 n}^{M}) \leq - (P_{n} - P_{0}) {L_{1} ({\bar{Q}}_{n}^{M}) - L_{1} ({\bar{Q}}_{0 n}^{M})} + r_{n},

(22)

and

d_{01} ({\bar{Q}}_{n}^{M}, {\bar{Q}}_{0 n}^{M}) = O_{P} (r_{\bar{Q}} (n)) + r_{n} .

Proof

We have

0 \leq d_{01} ({\bar{Q}}_{n}^{M}, {\bar{Q}}_{0 n}^{M}) = P_{0} {L_{1} ({\bar{Q}}_{n}^{M}) - L_{1} ({\bar{Q}}_{0 n}^{M})} = - (P_{n} - P_{0}) {L_{1} ({\bar{Q}}_{n}^{M}) - L_{1} ({\bar{Q}}_{0 n}^{M})} + P_{n} {L_{1} ({\bar{Q}}_{n}^{M}) - L_{1} ({\bar{Q}}_{0 n}^{M})} \leq - (P_{n} - P_{0}) {L_{1} ({\bar{Q}}_{n}^{M}) - L_{1} ({\bar{Q}}_{0 n}^{M})} + r_{n},

which proves eq. (22). Since $L_{1} ({\bar{Q}}_{n}^{M}) - L_{1} ({\bar{Q}}_{0 n}^{M})$ falls in a P₀-Donsker class $F_{ν, M}$ , it follows that the right-hand side is O_P(n^−1/2), and thus $d_{01} ({\bar{Q}}_{n}^{M}, {\bar{Q}}_{0 n}^{M}) = O_{P} (n^{- 1 / 2})$ . Since M_2,_Q,_M < ∞, this also implies that ${‖ L_{1} ({\bar{Q}}_{n}^{M}) - L_{1} ({\bar{Q}}_{0 n}^{M}) ‖}_{P_{0}}^{2} = O_{P} (n^{- 1 / 2})$ . By empirical process theory we have that n^1/2(P_n − P₀)f_n →_p 0 if f_n falls in a P₀-Donsker class with probability tending to 1, and $P_{0} f_{n}^{2} \to_{p} 0$ as n → ∞. Applying this to $f_{n} = L_{1} ({\bar{Q}}_{n}^{M}) - L_{1} ({\bar{Q}}_{0 n}^{M})$ shows that $(P_{n} - P_{0}) (L_{1} ({\bar{Q}}_{n}^{M}) - L ({\bar{Q}}_{0 n}^{M})) = o_{P} (n^{- 1 / 2})$ , which proves $d_{01} ({\bar{Q}}_{n}^{M}, {\bar{Q}}_{0 n}^{M}) = o_{P} (n^{- 1 / 2})$ .

We now apply Lemma 7 with $F_{n} = {L_{1} (\bar{Q}) - L_{1} ({\bar{Q}}_{0 n}^{M}) : \bar{Q} \in {\bar{Q}}_{n, M}}$ , α = α₁ (see eq. (19)), envelope bound M_n = M and r₀(n) = n^−1/4, which proves that

| n^{1 / 2} (P_{n} - P_{0}) f_{n} | = O_{P} (n^{- α_{1} / 4}) .

This proves $d_{01} ({\bar{Q}}_{n}^{M}, {\bar{Q}}_{0 n}^{M}) = O_{P} (n^{- (1 / 2 + α_{1} / 4)}) + r_{n}$ □

5 Super-learning: HAL-estimator tuning the variation norm of the fit with cross-validation

Defining the library of candidate estimators

For an $M \in ℝ_{> 0}^{k_{1}}$ , let ${\hat{\bar{Q}}}_{M} : M_{nonp} \to {\bar{Q}}_{n, M} \subset F_{ν, M}$ be the HAL-estimator eq. (21) and let ${\bar{Q}}_{n, M} = {\hat{\bar{Q}}}_{M} (P_{n})$ . By Lemma 1 we have $d_{01} ({\bar{Q}}_{n, M} = {\hat{\bar{Q}}}_{M} (P_{n}), {\bar{Q}}_{0 n}^{M}) = O_{P} (r_{\bar{Q}}^{2} (n))$ , assuming that the numerical approximation error r_n is of smaller order. Let $K_{1, n, ν}$ be an ordered collection $M_{1}^{n} < M_{2}^{n} < \dots < M_{K_{1, n, ν}}$ of k₁-dimensional constants, and consider the corresponding collection of K_1,_n,_v candidate estimators ${\hat{\bar{Q}}}_{M}$ with $M \in K_{1, n, ν}$ . We impose that this index set $K_{1, n, ν}$ is increasing in n such that $\lim \sup_{n \to \infty} M_{K_{1, n, ν}}$ equals $\sup_{P \in M} {‖ L_{1} (\bar{Q} (P)) ‖}_{ν}$ , so that for any $P \in M$ , there exists an N(P) so that for n > N(P), we will have that $M_{K_{1, n, ν}} > {‖ L_{1} (\bar{Q} (P)) ‖}_{ν}$ . Note that for all $M \in K_{1, n, ν}$ with $M > {‖ L_{1} ({\bar{Q}}_{0}) ‖}_{ν}$ , we have that $d_{01} ({\hat{\bar{Q}}}_{M} (P_{n}), {\bar{Q}}_{0}) = O_{P} (r_{\bar{Q}}^{2} (n))$ . In addition, let ${\hat{\bar{Q}}}_{j} : M_{nonp} \to Q_{n}, j \in K_{1, n, a}$ be an additional collection of K_1,_n,_a estimators of ${\bar{Q}}_{0}$ . For example, these candidate estimators could include a variety of parametric model as well as machine learning based estimators. This defines an index set $K_{1, n} = K_{1, n, ν} \cup K_{1, n, a}$ representing a collection of K₁_n = K_1,_n,_ν + K_1,_n,_a candidate estimators ${{\hat{\bar{Q}}}_{k} : k \in K_{1 n}}$ .

Super Learner

Let B_n ∈ {0, 1}ⁿ denote a random cross-validation scheme that randomly splits the sample {O₁, …, O_n} in a training sample {O_i : B_n(i) = 0} and validation sample {O_i : B_n(i) = 1}. Let $q_{n} = \sum_{i = 1}^{n} B_{n} (i) / n$ denote the proportion of observations in the validation sample. We impose throughout the article that q < q_n ≤ 1/2 for some q > 0 and that this random vector B_n has a finite number V possible realizations for a fixed V < ∞. In addition, $P_{n, B_{n}}^{1}$ , $P_{n, B_{n}}^{0}$ will denote the empirical probability distributions of the validation and training sample, respectively. Thus, the cross-validated risk of an estimator $\hat{\bar{Q}} : M_{nonp} \to {\bar{Q}}_{n}$ of ${\bar{Q}}_{0}$ is defined as $E_{B_{n}} P_{n, B_{n}}^{1} L_{1} (\hat{\bar{Q}} (P_{n, B_{n}}^{0}))$ .

We define the cross-validation selector as the index

k_{1 n} = {\hat{K}}_{1} (P_{n}) = \arg min_{k \in K_{1 n}} E_{B_{n}} P_{n, B_{n}}^{1} L_{1} ({\hat{\bar{Q}}}_{k} (P_{n, B_{n}}^{0}))

that minimizes the cross-validated risk $E_{B_{n}} P_{n} L_{1} ({\hat{\bar{Q}}}_{k} (P_{n, B_{n}}^{0}))$ over all choices $k \in K_{1 n}$ of candidate estimators. Our proposed super-learner is defined by

{\bar{Q}}_{n} = \hat{\bar{Q}} (P_{n}) \equiv E_{B_{n}} {\hat{\bar{Q}}}_{k_{1 n}} (P_{n, B_{n}}^{0}) .

(23)

The following lemma proves that the super-learner $\hat{\bar{Q}} (P_{n})$ converges to ${\bar{Q}}_{0}$ at least at the rate $r_{\bar{Q}} (n)$ the HAL-estimator converges to ${\bar{Q}}_{0} : d_{01} (\hat{\bar{Q}} (P_{n}), {\bar{Q}}_{0}) = O_{P} (r_{\bar{Q}} (n))$ . This lemma also shows that the super-learner is either asymptotically equivalent with the oracle selected candidate estimator, or achieves the parametric rate 1/n of a correctly specified parametric model.

Lemma 2

Recall the definition of the model bounds M₁_Q,_n, M₂_Q,_n eq. (18), and let $C (M_{1}, M_{2}, δ) \equiv 2 {(1 + δ)}^{2} (2 M_{1} / 3 + M_{2}^{2} / δ)$ .

For any fixed δ > 0,

d_{01} ({\bar{Q}}_{n}, {\bar{Q}}_{0 n}) \leq (1 + 2 δ) E_{B_{n}} min_{k \in K_{1 n}} d_{01} ({\hat{\bar{Q}}}_{k} (P_{n, B_{n}}^{0}), {\bar{Q}}_{0 n}) + O_{P} (C (M_{1 Q, n}, M_{2 Q, n}, δ) \frac{\log K_{1 n}}{n}) .

If for each fixed δ > 0, C(M₁_Q,_n, M₂_Q,_n, δ) log K₁_n/n divided by $E_{B_{n}} {min}_{k} d_{01} ({\hat{\bar{Q}}}_{k} (P_{n, B_{n}}^{0}), {\bar{Q}}_{0 n})$ is o_P(1), then

\frac{d_{01} (\hat{\bar{Q}} (P_{n}), {\bar{Q}}_{0 n})}{E_{B_{n}} {min}_{k} d_{01} ({\hat{\bar{Q}}}_{k} (P_{n, B_{n}}^{0}), {\bar{Q}}_{0 n})} - 1 = o_{P} (1) .

If for each fixed δ > 0, $E_{B_{n}} {min}_{k} d_{01} ({\hat{\bar{Q}}}_{k} (P_{n, B_{n}}^{0}), {\bar{Q}}_{0 n}) = O_{P} (C (M_{1 Q, n}, M_{2 Q, n}, δ) \log K_{1 n} / n)$ , then

d_{01} (\hat{\bar{Q}} (P_{n}), {\bar{Q}}_{0 n}) = O_{P} (\frac{C (M_{1 n}, M_{2 n}, δ) \log K_{1 n}}{n}) .

Suppose that for each finite M, the conditions of Lemma 1 hold with negligible numerical approximation error r_n, so that $d_{01} ({\bar{Q}}_{n, M} = {\hat{\bar{Q}}}_{M} (P_{n}), {\bar{Q}}_{0 n}^{M}) = O_{P} (r_{\bar{Q}}^{2} (n))$ . Let $λ_{1} \in ℝ_{> 0}^{k_{1}}$ be chosen so that $r_{\bar{Q}}^{2} (n) = O (n^{- λ_{1}})$ . For each fixed δ > 0, we have

d_{01} ({\bar{Q}}_{n}, {\bar{Q}}_{0 n}) = O_{P} (n^{- λ_{1}}) + O_{P} (C (M_{1 Q, n}, M_{2 Q, n}, δ) \frac{\log K_{1 n}}{n}) .

(24)

The proof of this lemma is a simple corollary of the finite sample oracle inequality for cross-validation [11, 13, 21, 33, 34], also presented in Lemma 5 in Section A of the Appendix. It uses the convexity of the loss function to bring the $E_{B_{n}}$ inside the loss-based dissimilarity.

In the Appendix we present the analogue super-learner eq. (37) of G₀ and its corresponding Lemma 6.

6 One-step CV-HAL-TMLE

Cross-validated TMLE (CV-TMLE) robustifies the bias-reduction of the TMLE-step by selecting : based on the cross-validated risk [5, 15]. In the next subsection we define the CV-TMLE. In this subsection we propose a particular type of local least favorable submodel that separately updates the initial estimator of Q_j₀ for each j = 1, …, k₁. Due to this choice, in subsection 2 we now easily establish that the CV-TMLE of ${\bar{Q}}_{0}$ converges at the same rate to ${\bar{Q}}_{0}$ as the initial estimator, which is important for control of the second order remainder in the asymptotic efficiency proof of the CV-TMLE. In subsection 3 we establish the asymptotic efficiency of the CV-TMLE.

6.1 The CV-HAL-TMLE

Definition of one-step CV-HAL-TMLE for general local least favorable submodel

Let ${\bar{L}}_{1} (Q) \equiv \sum_{j = 1}^{k_{1} + 1} L_{1 j} (Q_{j})$ be the sum loss-function. For a given (Q, G), let ${Q_{ε} : ε} \subset Q_{n} \subset Q$ be a parametric submodel through Q at ε = 0 such that the linear span of $\frac{d}{d ε} {\bar{L}}_{1} (Q_{ε})$ at ε = 0 includes the canonical gradient D^*(Q, G). Let $\hat{Q} : M_{nonp} \to Q_{n}$ and $\hat{G} : M_{nonp} \to G_{n}$ be our initial estimators of $Q_{0} = ({\bar{Q}}_{0}, Q_{0, k_{1} + 1})$ and $G_{0} = ({\bar{G}}_{0}, G_{0, k_{2} + 1}$ . We recommend defining the initial estimators $\hat{\bar{Q}}$ and $\hat{\bar{G}}$ of ${\bar{Q}}_{0}$ and ${\bar{G}}_{0}$ to be HAL-super-learners as defined by eqs (23) and (37), so that $d_{10} (\hat{Q} (P_{n}), Q_{0 n}) = O_{P} (r_{Q}^{2} (n))$ and $d_{20} (\hat{G} (P_{n}), G_{0 n}) = O_{P} (r_{G}^{2} (n))$ . Given a cross-validation scheme B_n ∈ {0, 1}ⁿ, let $Q_{n, B_{n}} = \hat{Q} (P_{n, B_{n}}^{0}) \in Q_{n}$ be the estimator $\hat{Q}$ applied to the training sample $P_{n, B_{n}}^{0}$ . Similarly, let $G_{n, B_{n}} = \hat{G} (P_{n, B_{n}}^{0})$ . Let ${Q_{n, B_{n}, ε} : ε}$ be the above submodel with $(Q, G) = (Q_{n, B_{n}}, G_{n, B_{n}})$ through $Q_{n, B_{n}}$ at ε = 0. Let

ε_{n} = \arg min_{ε} E_{B_{n}} P_{n, B_{n}}^{1} \bar{L} (Q_{n, B_{n}, ε})

be the MLE of ε minimizing the cross-validated empirical risk. This defines $Q_{n, B_{n}}^{*} = Q_{n, B_{n}, ε_{n}}$ as the B_n-specific targeted fit of Q₀. The one-step CV-TMLE of ψ₀ is defined as

ψ_{n}^{*} = E_{B_{n}} Ψ (Q_{n, B_{n}}^{*}) .

One-step CV-HAL-TMLE solves cross-validated efficient score equation

Our efficiency Theorem 1 assumes that

E_{B_{n}} P_{n, B_{n}}^{1} D^{*} (Q_{n, B_{n}}^{*}, G_{n, B_{n}}) = o_{P} (n^{- 1 / 2}) .

(25)

That is, it is assumed that the one-step CV-TMLE already solves the cross-validated efficient influence curve equation up till an asymptotically negligible approximation error. By definition of ε_n we have that it solves its score equation $E_{B_{n}} P_{n, B_{n}}^{1} \frac{d}{d ε_{n}} \bar{L} (Q_{n, B_{n}, ε_{n}}) = 0$ , which provides a basis for verifying eq. (25). As formalized by Lemma 13 in the Appendix D, for our choice of n^−(1/4+)-consistent initial estimators Q_n, G_n of Q₀, G₀, a one-step CV-TMLE will satisfy eq. (25) for one-dimensional local least favorable submodels under weak regularity conditions. We believe that such a result can be proved in great generality for arbitrary (also multivariate) local least favorable submodels. Instead, below we propose a particular class of multivariate local least favorable submodels eq. (26) for which we establish eq. (25) under regularity conditions. In (van der Laan and Gruber, 2015) it is shown that one can always construct a so called universal least favorable submodel through Q with a one dimensional ε so that $\frac{d}{d ε} {\bar{L}}_{1} (Q_{ε}) = D^{*} (Q_{ε}, G)$ at each ε so that $E_{B_{n}} P_{n, B_{n}}^{1} D^{*} (Q_{n, B_{n}, ε_{n}}^{*}, G_{n, B_{n}}) = 0 (exactly)$ , independent of the properties of the initial estimator (Q_n, G_n).

One-step CV-HAL-TMLE preserves fast rate of convergence of initial estimator

Our efficiency Theorem 1 also assumes that the updated estimator $Q_{n, B_{n}}^{*}$ satisfies for each split $B_{n} d_{01} (Q_{n, B_{n}}^{*}, Q_{0}) = o_{P} (n^{- 1 / 2})$ . This is generally a very reasonable condition given that $d_{01} (Q_{n, B_{n}}, Q_{0}) = O_{P} (n^{- λ_{1}})$ for a specified λ₁ > 1/2. Our proposed class of local least favorable submodels eq. (26) below guarantees that the rate of convergence of the initial estimator $Q_{n, B_{n}}$ is completely preserved by $Q_{n, B_{n}}^{*}$ , so that this condition is automatically guaranteed to hold.

A class of multivariate local least favorable submodels that separately updates each nuisance parameter component

One way to guarantee that $d_{01} (Q_{n, B_{n}}^{*}, Q_{0}) = o_{P} (n^{- 1 / 2})$ is to make sure that the updated estimator $Q_{n, B_{n}}^{*}$ converges as fast to Q₀ as the initial estimator $Q_{n, B_{n}}$ . For that purpose we propose a k₁ + 1-dimensional local least favorable submodel of the type

{Q_{ε} = (Q_{1, ε_{1}}, \dots, Q_{k_{1} + 1, ε_{k_{1} + 1}}) such that \frac{d}{d ε_{j}} L_{1 j} (Q_{j, ε_{j}}) |}_{ε_{j} = 0} = D_{j}^{*} (Q, G),

(26)

for j = 1, …, k₁ + 1, and where $D^{*} (Q, G) = \sum_{j = 1}^{k_{1} + 1} D_{j}^{*} (Q, G)$ . By using such a submodel we have $Q_{j, n, B_{n}}^{*} = Q_{j, n, B_{n}, ε_{n} (j)}$ and $ε_{n} (j) = \arg {min}_{ε} E_{B_{n}} P_{n, B_{n}}^{1} L_{1 j} (Q_{j, n, B_{n}, ε})$ . Thus, in this case $Q_{j, n, B_{n}}$ is updated with its own ε_n(j), j = 1, …, k₁ + 1. The advantage of such a least favorable submodel is that the one-step update of ${\bar{Q}}_{j, n, B_{n}}$ is not affected by the statistical behavior of the other estimators ${\bar{Q}}_{l, n, B_{n}}, l \neq j$ . On the other hand, if one uses a local least favorable submodel with a single ε, the MLE ε_n is very much driven by the worst performing estimator ${\bar{Q}}_{j, n, B_{n}}$ . Lemma 3 shows that, by using such a k₁ + 1-variate local least favorable submodel satisfying eq. (26), the rate of convergence of the initial estimator ${\bar{Q}}_{j, n}$ is fully preserved by the TMLE-update ${\bar{Q}}_{j, n, B_{n}}^{*}$ (see Lemma 3 below).

How to construct a local least favorable submodel of type eq. (26)

A general approach for constructing such a k₁ + 1-variate least favorable submodel is the following. Let $D_{j}^{*} (P)$ be the efficient influence curve at a P for the parameter $Ψ_{j, P} : M \to ℝ$ defined by Ψ_j,_P(P₁) = Ψ(Q₋_j(P), Q_j(P₁)) that sets all the other components of Q_l with l ≠ j equal to its true value under P, j = 1, …, k₁ +1. Then, it follows immediately from the definition of pathwise derivative that

D^{*} (P) \sum_{j = 1}^{k_{1} + 1} D_{j}^{*} (P),

so that, D^*(P) is an element of the linear span of ${D_{j}^{*} (P) : j = 1, \dots, k_{1} + 1}$ . Let ${Q_{j, ε (j)} : ε (j)} \subset Q_{j n}$ be a one-dimensional submodel through Q_j so that

{\frac{d}{d ε (j)} L_{1 j} (Q_{j, ε (j)}) |}_{ε (j) = 0} = D_{j}^{*} (Q, G), j = 1, \dots, k_{1} + 1.

That is, {Q_j,_ε₍_j₎ : ε(j)} is a local least favorable submodel at (Q, G) for the parameter $Ψ_{j, Q} : M \to ℝ$ , j = 1, …, k₁ + 1. Now, define $(Q_{ε} : ε) \subset Q_{n}$ by Q_ε = (Q_j,_ε₍_j₎ : j = 1, …, k₁ + 1). Then, we have

{\frac{d}{d ε} \bar{L} (Q ε) |}_{ε = 0} = {(D_{j}^{*} (Q, G) : j = 1, \dots, k_{1} + 1)}^{⊤},

so that the submodel is indeed a local least favorable submodel.

Lemma 14 provides a sufficient set of minor conditions under which the one-step-HAL-CV-TMLE using a local least favorable submodel of the type eq. (26) will satisfy eq. (25). Therefore, the class of local least favorable submodels eq. (26) yields both crucial conditions for the HAL-CV-TMLE: it solves eq. (25) and it preserve the rate of convergence of the initial estimator.

6.2 Preservation of the rate of initial estimator for the one-step CV-HAL-TMLE using eq. (26)

Consider the submodel {Q_ε : ε} of the type eq. (26) presented above. Given an initial estimator $\hat{Q} : M_{nonp} \to Q_{n}$ , recall the definition $Q_{n, B_{n}, ε} = {\hat{Q}}_{ε} (P_{n, B_{n}}^{0})$ as the fluctuated version of the initial estimator applied to the training sample, and $ε_{n} = \arg {min}_{ε} E_{B_{n}} P_{n, B_{n}}^{1} L_{1} (Q_{n, B_{n}, ε})$ . We want to show that $Q_{n, B_{n}, ε_{n}}$ converges to Q₀ at the same rate as the initial estimator $Q_{n, B_{n}}$ (and thus also $\hat{Q} (P_{n})$ ). The following lemma establishes this result and it is an immediate consequence of the oracle inequality of the cross-validation selector for the loss function L₁_j, applied to the set of candidate estimators $P_{n} \to Q_{j n, ε (j)} = {\hat{Q}}_{j, ε (j)} (P_{n})$ indexed by ε(j), for each j = 1, …, k₁ +1.

Lemma 3

Let $ε_{n} = \arg {min}_{ε} E_{B_{n}} P_{n, B_{n}}^{1} L_{1} (Q_{n, B_{n}, ε})$ . We have

E_{B_{n}} d_{01} ({\hat{Q}}_{ε_{n}} (P_{n, B_{n}}^{0}), Q_{0 n}) \leq (1 + 2 δ) min E_{B_{n}} d_{01} ({\hat{Q}}_{ε} (P_{n, B_{n}}^{0}), Q_{0 n}) + O_{P} (\frac{C (M_{1 Q, n}, M_{2 Q, n}, δ) \log K_{1 n}}{n q}) .

By convexity of the loss function L₁(Q), this implies

d_{01} (E_{B_{n}} {\hat{Q}}_{ε_{n}} (P_{n, B_{n}}^{0}), O_{0 n}) \leq (1 + 2 δ) min E_{B_{n}} d_{01} ({\hat{Q}}_{ε} (P_{n, B_{n}}^{0}), Q_{0 n}) + O_{P} (\frac{C (M_{1 Q, n}, M_{2 Q, n}, δ) \log K_{1 n}}{n q}) .

We have

min_{ε} E_{B_{n}} d_{01} ({\hat{Q}}_{ε} (P_{n, B_{n}}^{0}), Q_{0 n}) \leq E_{B_{n}} d_{01} (\hat{Q} (P_{n, B_{n}}^{0}), Q_{0 n}) .

Thus, if for some $λ_{1} > 0 C (M_{1 Q, n}, M_{2 Q, n}, δ) \log K_{1 n} / (n q) = O (n^{- λ_{1}})$ and for each $B_{n} d_{01} (\hat{Q} (P_{n, B_{n}}^{0}), Q_{0 n}) = O_{P} (n^{- λ_{1}})$ , then

d_{01} (E_{B_{n}} Q_{n, B_{n}, ε_{n}}, Q_{0 n}) = O_{P} (n^{- λ_{1}}) .

It then also follows that for each B_n, $d_{01} ({\hat{Q}}_{ε_{n}} (P_{n, B_{n}}^{0}), Q_{0 n}) = O_{P} (n^{- λ_{1}})$ .

6.3 Efficiency of the one-step CV-HAL-TMLE

We have the following theorem.

Theorem 1

Consider the above defined corresponding one-step CV-TMLE $ψ_{n}^{*} = E_{B_{n}} Ψ (Q_{n, B_{n}, ε_{n}})$ of Ψ(Q₀).

Initial estimator conditions

Consider the HAL-super-learners $\hat{\bar{Q}} (P_{n})$ and $\hat{\bar{G}} (P_{n})$ defined by eqs (23) and (37), respectively, and, recall that we are given simple estimators ${\hat{Q}}_{k_{1} + 1}$ and ${\hat{G}}_{k_{2} + 1}$ of $Q_{0, k_{1} + 1}$ and $G_{0, k_{2} + 1}$ . Let λ₁ and λ₂ be chosen so that $r_{\bar{Q}} (n) = O (n^{- λ_{1}})$ and $r_{\bar{G}} (n) = O (n^{- λ_{2}})$ . Assume the conditions of Theorem 2 and Theorem 6 so that we have

d_{01} (\hat{\bar{Q}} (P_{n}), {\bar{Q}}_{0}) = O_{P} (n^{- λ_{1} (1 : k_{1})}) + O_{P} (C (M_{1 Q, n}, M_{2 Q, n}, δ) \log K_{1 n} / n) d_{02} (\hat{\bar{G}} (P_{n}), {\bar{G}}_{0}) = O_{P} (n^{- λ_{2} (1 : k_{2})}) + O_{P} (C (M_{1 G, n}, M_{2 G, n}, δ) \log K_{2 n} / n),

where λ₁(1 : k₁) > 1/2 and λ₂(1 : k₂) > 1/2. Let $\hat{Q} = (\hat{\bar{Q}}, {\hat{Q}}_{k_{1} + 1})$ and $\hat{G} = (\hat{\bar{G}}, {\hat{G}}_{k_{2} + 1})$ be the corresponding estimators of Q₀ and G₀, respectively.

“Preserve rate of convergence of initial estimator”-condition

In addition, assume that either (Case A) the CV-TMLE uses a local least favorable submodel of the type eq. (26) so that Lemma 3 applies, or (Case B) assume that for each split $B_{n} d_{01} (Q_{n, B_{n}}^{*}, Q_{0}) = O_{P} (n^{- λ_{1}^{*}})$ for some $λ_{1}^{*} > 1 / 2$ .

Efficient influence curve score equation condition and second order remainder condition

Define $f_{n, ε} = D^{*} ({\hat{Q}}_{ε} (P_{n, B_{n}}^{0}), G_{n, B_{n}}) - D^{*} (Q_{0}, G_{0})$ and the class of functions $F_{n} = {f_{n, ε} : ε}$ . Assume

E_{B_{n}} P_{n, B_{n}}^{1} D^{*} (Q_{n, B_{n}, ε_{n}}, G_{n, B_{n}}) = o_{P} (n^{- 1 / 2}),

(27)

{‖ D^{*} (Q_{n, B_{n}}^{*}, G_{n, B_{n}}) - D^{*} (Q_{0}, G_{0}) ‖}_{P_{0}} = o_{P} (r_{D^{*}, n}) for r_{D^{*}, n} = o (1),

(28)

E_{B_{n}} R_{20} ((Q_{n, B_{n}}^{*}, G_{n, B_{n}}), (Q_{0}, G_{0})) = o_{P} (n^{- 1 / 2}),

(29)

\frac{max (M_{1 Q, n}, M_{2 Q, n}^{2}) \log K_{1 n}}{n} = O (n^{- λ_{1}}),

(30)

\frac{max (M_{1 G, n}, M_{2 G, n}^{2}) \log K_{2 n}}{n} = O (n^{- λ_{1}}),

(31)

\sup_{Λ} N (ε M_{D^{*}, n}, F_{n}, L^{2} (Λ)) < K ε^{- p} for a K < \infty, p < \infty .

(32)

In Case A, for verification of assumption eq. (27) one could apply Lemma 14.

In Case A, for verification of the two assumptions eqs (28) and (29) one can use that for each of the V realizations of B_n, $d_{0} (Q_{n, B_{n}}^{*}, Q_{0}) = O_{P} (n^{- λ_{1}})$ and $d_{02} (G_{n, B_{n}}, G_{0}) = O_{P} (n^{- λ_{2}})$ .

In Case B, for verification of the latter two assumptions eqs (28) and (29) one can use that for each of the V realizations of B_n, $d_{0} (Q_{n, B_{n}}^{*}, Q_{0}) = O_{P} (n^{- λ_{1}^{*}})$ and $d_{02} (G_{n, B_{n}}, G_{0}) = O_{P} (n^{- λ_{2}})$ .

Then, $ψ_{n}^{*} = E_{B_{n}} Ψ (Q_{n, B_{n}, ε_{n}})$ is asymptotically efficient:

ψ_{n}^{*} - ψ_{0} = (P_{n} - P_{0}) D^{*} (Q_{0}, G_{0}) + o_{P} (n^{- 1 / 2}) .

(33)

Condition eq. (32) will practically always trivially hold for p = k₁ + 1 equal to the dimension of ε: note that this is even true for unbounded models due to the normalizing constant $M_{D^{*}, n}$ . We already discussed the crucial condition eq. (27) in our subsection defining the CV-TMLE. Conditions eqs (30) and (31) are easily satisfied by controlling the speed at which the model bounds M₁_Q,_n, M₂_Q,_n, M₁_G,_n, M₂_G,_n can converge to infinity, and are always true for bounded models (as long as the size of the library of the super-learner behaves as a polynomial power of sample size). For bounded models $M$ , condition eq. (28) will typically hold with $r_{D^{*}, n} = n^{- λ}$ and λ equal to the minimum of the components of λ₁/2 and λ₂/2: i.e., the efficient influence curve estimator will converge to its true counterpart as fast as the slowest converging nuisance parameter estimator. If the model $M$ is unbounded so that the model bounds of the sieve $M_{n}$ will converge to infinity, then eq. (28) will hold with $r_{D^{*}, n} = n^{- λ} M_{n}$ for some M_n converging to infinity (e.g., $M_{n} = M_{D^{*}, n}$ ). So, in the latter case one has to control the rate at which the model bounds of the sieve $M_{n}$ , such as the supremum norm bound $M_{D^{*}, n}$ for the efficient influence curve, converge to infinity. Finally, the crucial condition eq. (29) will easily hold for bounded models $M$ if this slowest rate λ is larger than 1/4, which we know to be true for the HAL-estimator and its super-learner. For unbounded models, this condition eq. (29) will put a serious brake on the speed as which the model bounds of $M_{n}$ can converge to infinity.

Proof

By assumptions eqs (30) and (31), we have

d_{0} ((\hat{Q} (P_{n, B_{n}}^{0}), \hat{G} (P_{n, B_{n}}^{0}), (Q_{0}, G_{0})) = O_{P} (n^{- λ_{1}}, n^{- λ_{2}}) .

Consider Case A. Lemma 3 proves that under these same assumptions eqs (30), (31), we also have, for each B_n, $d_{01} (Q_{n, B_{n} ε_{n}}, Q_{0 n}) = O_{P} (n^{- λ_{1}})$ . This proves that for each B_n, $d_{0} ((Q_{n, B_{n}}^{*} = Q_{n, B_{n}, ε_{n}}, G_{n, B_{n}}), (Q_{0}, G_{0})) = O_{P} (n^{- λ_{1}}, n^{- λ_{2}})$ . For Case B, we replace in latter expression λ₁ by $λ_{1}^{*}$ . Suppose n > N₀ so that Q₀_n = Q₀ and G₀_n = G₀. By the identity $Ψ (Q_{n, B_{n}}^{*}) - Ψ (Q_{0}) = - P_{0} D^{*} (Q_{n, B_{n}}^{*}, G_{n, B_{n}}) + R_{20} ((Q_{n, B_{n}}^{*}, G_{n, B_{n}}), (Q_{0}, G_{0}))$ , we have

E_{B_{n}} Ψ (Q_{n, B_{n}}^{*}) - Ψ (Q_{0}) = - E_{B_{n}} P_{0} D^{*} (Q_{n, B_{n}}^{*}, G_{n, B_{n}}) + E_{B_{n}} R_{20} ((Q_{n, B_{n}}^{*}, G_{n, B_{n}}), (Q_{0}, G_{0})) .

Combining this with eq. (27) yields the following identity:

ψ_{n}^{*} - Ψ (Q_{0}) = E_{B_{n}} Ψ (Q_{n, B_{n}}^{*}) - Ψ (Q_{0}) = E_{B_{n}} (P_{n, B_{n}}^{1} - P_{0}) D^{*} (Q_{n, B_{n}}^{*}, G_{n, B_{n}}) + E_{B_{n}} R_{20} ((Q_{n, B_{n}}^{*}, G_{n, B_{n}}), (Q_{0}, G_{0})) + o_{P} (n^{- 1 / 2}) .

By assumption eq. (29) we have that $E_{B_{n}} R_{20} ((Q_{n, B_{n}}^{*}, G_{n, B_{n}}), (Q_{0}, G_{0})) = o_{P} (n^{- 1 / 2})$ . Thus, we have shown

Ψ (Q_{n}^{*}) - Ψ (Q_{0}) = E_{B_{n}} (P_{n, B_{n}}^{1} - P_{0}) D^{*} (Q_{n, B_{n}}^{*}, G_{n, B_{n}}) + o_{P} (n^{- 1 / 2}) .

We now note

E_{B_{n}} (P_{n, B_{n}}^{1} - P_{0}) D^{*} (Q_{n, B_{n}}^{*}, G_{n, B_{n}}) = E_{B_{n}} (P_{n, B_{n}}^{1} - P_{0}) D^{*} (Q_{0}, G_{0}) + E_{B_{n}} (P_{n, B_{n}}^{1} - P_{0}) {D^{*} (Q_{n, B_{n}}^{*}, G_{n, B_{n}}) - D^{*} (Q_{0}, G_{0})} = (P_{n} - P_{0}) D^{*} (Q_{0}, G_{0}) + E_{B_{n}} (P_{n, B_{n}}^{1} - P_{0}) {D^{*} (Q_{n, B_{n}}^{*}, G_{n, B_{n}}) - D^{*} (Q_{0}, G_{0})} .

Thus, it remains to prove that $E_{B_{n}} (P_{n, B_{n}}^{1} - P_{0}) {D^{*} (Q_{n, B_{n}}^{*}, G_{n, B_{n}}) - D^{*} (Q_{0}, G_{0})} = o_{P} (n^{- 1 / 2})$ . For this we apply Lemma 10 with $f_{n, ε} = D^{*} ({\hat{Q}}_{ε} (P_{n, B_{n}}^{0}), G_{n, B_{n}}) - D^{*} (Q_{0}, G_{0})$ , conditional on $P_{n, B_{n}}^{0}$ , and $F_{n} = {f_{n, ε} : ε}$ . By assumption eq. (28), there exists a rate $r_{D^{*}, n} = o (1)$ so that ${‖ f_{n, ε_{n}} ‖}_{P_{0}} = O_{P} (r_{D^{*}, n})$ , where (e.g., for Case A) this rate will be determined based upon $d_{0} ((Q_{n, B_{n}}^{*}, G_{n, B_{n}}) (Q_{0}, G_{0})) = O_{P} (n^{- λ_{1}}, n^{- λ_{2}})$ . Note also that the envelope of $F_{n}$ satisfies ${‖ F_{n} ‖}_{Λ} \leq M_{D^{*}, n}$ for any measure Λ (see eq. (18)). Since ε is p-dimensional for some integer p, the entropy of $F_{n}$ easily satisfies sup_Λ N(ε ‖ F_n ‖_Λ, $F_{n}$ , L²(Λ)) = O(ε⁻^p), which is assumed to hold by condition eq. (32). Application of Lemma 10 proves now that, if $r_{D^{*}, n} = o (1)$ , then, given $P_{n, B_{n}}^{0}$ ,

(P_{n, B_{n}}^{1} - P_{0}) f_{n, ε_{n}} = o_{P} (n^{- 1 / 2}) .

This proves also that $E_{B_{n}} (P_{n, B_{n}}^{1} - P_{0}) f_{n, ε_{n}} = o_{P} (n^{- 1 / 2})$ . This completes the proof. □

7 Example: Treatment specific mean

We will now apply Theorem 1 to the example introduced in Section 2. We have the following sieve model bounds (van der Laan et al., 2004): $M_{1 Q, n} = O (\log δ_{n}^{- 1})$ ; $M_{2 Q, n} = O (1 / δ_{n})$ ; $M_{1 G, n} = O (\log δ_{n}^{- 1})$ ; $M_{2 G, n} = O (1 / δ_{n})$ ; $M_{D^{*}, n} = O (1 / δ_{n})$ .

Since the parameter space $Q_{1 n}$ consists of the cadlag functions with bounded variation norms, without any further restrictions beyond the global bound δ_n, we can select the entropy quantities for $Q_{1}$ as follows: α₁ = α(d₁) = 2/(d₁ + 2), where d₁ = d−2 is the dimension of W. Similarly, if $G_{n}$ consists of all cadlag functions of dimension d₂, without further meaningful restrictions beyond δ_n, then we can select the entropy quantities for $G_{n}$ as α₂ = α(d₂) = 2/(d₂ + 2). If the model $G$ enforces more meaningful restrictions than that A only depends on W through a subset of W of dimension d₂, then α₂ can be replaced by a sharper upper bound than α(d₂). We already established that condition eq. (27) in Theorem 1 holds exactly. Condition eq. (32) trivially holds.

Verification of eqs (30) and (31)

Let ${\bar{Q}}_{n} \in Q_{1 n}$ be a super-learner of ${\bar{Q}}_{0}$ of the type presented in eq. (23). Similarly, let ${\bar{G}}_{n} \in G_{n}$ be such a super-learner of ${\bar{G}}_{0}$ as presented in eq. (37). Suppose that $max (M_{1 Q, n} M_{2 Q, n}^{2}) \log K_{1 n} / n = O (n^{- λ (d_{1})})$ and $max (M_{1 G, n} M_{2 G, n}^{2}) \log K_{2 n} / n = O (n^{- λ (d_{2})})$ , where λ(d) = 1/2+ α(d)/4. Then, by Lemma 2 and Lemma 6, we have $d_{10, 1} ({\bar{Q}}_{n}, {\bar{Q}}_{0}) = O_{P} (n^{- λ (d_{1})})$ and $d_{02} ({\bar{G}}_{n}, {\bar{G}}_{0}) = O_{P} (n^{- λ (d_{2})})$ . Plugging in the above bounds for M₁_Q,_n, M₂_Q,_n, M₁_G,_n, M₂_G,_n, it follows that it suffices to select δ_n so that $δ_{n}^{- 1} = O (n^{1 / 2 - 1 / 2 λ (d_{1})} {(max (\log K_{1 n}, \log K_{2 n}))}^{- 1 / 2})$ . (Improvements can be obtained by selecting a separate δ₁_n for truncating $\bar{Q}$ and δ₂_n for truncating $\bar{G}$ .) Let K_n = max(K n, K₂_n) and impose that $\log K_{n} = O (n^{1 / 2 - α (d_{1}) / 2})$ . Then, it follows that this bound for $δ_{n}^{- 1}$ is larger than $n^{α (d_{1}) / 6}$ , so that this constraint on δ_n is dominated by our later constraint given below $δ_{n}^{- 1} = o (n^{α (d_{1}) / 6})$ .

Above we showed that if $δ_{n}^{- 1} = O (n^{1 / 2 - 1 / 2 λ (d_{1})} {(max (\log K_{1 n}, \log K_{2 n}))}^{- 1 / 2})$ , then the two super-learners ${\bar{Q}}_{n, B_{n}}$ and ${\bar{G}}_{n, B_{n}}$ of ${\bar{Q}}_{0}$ and ${\bar{G}}_{0}$ based on the training sample $P_{n, B_{n}}^{0}$ converge at the rate $n^{- λ (d_{1})}$ and $n^{- λ (d_{2})}$ w.r.t the loss-based dissimilarities d_10,1 and d₀₂, respectively. By Lemma 3, under the same conditions stated above for $d_{01} ({\bar{Q}}_{n}, {\bar{Q}}_{0}) = O_{P} (n^{- λ (d_{1})})$ , the TMLE update ${\bar{Q}}_{n, B_{n}}^{*}$ converges at this same rate: for each split B_n, we have $d_{01} ({\bar{Q}}_{n, B_{n}}^{*}, {\bar{Q}}_{0}) = O_{P} (n^{- λ (d_{1})})$ .

Verification of eq. (28)

Using straightforward algebra and using the triangle inequality for a norm, we obtain

{‖ D^{*} (Q_{n, B_{n}}^{*}, {\bar{G}}_{n, B_{n}}) - D^{*} (Q_{0}, {\bar{G}}_{0}) ‖}_{P_{0}} \leq {‖ A \frac{{\bar{G}}_{n, B_{n}} - {\bar{G}}_{0}}{{\bar{G}}_{n, B_{n}} {\bar{G}}_{0}} (Y - {\bar{Q}}_{0}) ‖}_{P_{0}} + {‖ \frac{A}{{\bar{G}}_{n, B_{b}}} ({\bar{Q}}_{n, B_{n}}^{*} - {\bar{Q}}_{0}) ‖}_{P_{0}} + {‖ {\bar{Q}}_{n, B_{n}}^{*} - {\bar{Q}}_{0} ‖}_{P_{0}} + | E_{B_{n}} Ψ (Q_{n, B_{n}}^{*}) - Ψ (Q_{0}) | .

Using that $min ({\bar{G}}_{n, B_{n}} {\bar{G}}_{0}) > δ_{n}$ and $| Y - {\bar{Q}}_{0} | < 1$ it follows that the first term is bounded by $δ^{- 3 / 2} {‖ {\bar{G}}_{n, B_{n}} - {\bar{G}}_{0} ‖}_{P_{0}}$ . Using that ${\bar{G}}_{n, B_{n}} > δ_{n}$ , it follows that the second term is bounded by $δ_{n}^{- 1} {‖ {\bar{Q}}_{n, B_{n}}^{*} - {\bar{Q}}_{0} ‖}_{P_{0}}$ . So, we have

{‖ D^{*} (Q_{n, B_{n}}^{*}, G_{n, B_{n}}) - D^{*} (Q_{0}, G_{0}) ‖}_{P_{0}} \leq δ_{n}^{- 3 / 2} {‖ {\bar{G}}_{n, B_{n}} - {\bar{G}}_{0} ‖}_{P_{0}} + 2 δ_{n}^{- 1} {‖ {\bar{Q}}_{n, B_{n}}^{*} - {\bar{Q}}_{0} ‖}_{P_{0}} + | E_{B_{n}} Ψ (Q_{n, B_{n}}^{*}) - Ψ (Q_{0}) | .

We bound the last term as follows:

E_{B_{n}} Ψ (Q_{n, B_{n}}^{*}) - Ψ (Q_{0}) = E_{B_{n}} Q_{2 n, B_{n}}^{1} {\bar{Q}}_{n, B_{n}}^{*} - Q_{20} {\bar{Q}}_{0} = E_{B_{n}} (Q_{2 n, B_{n}}^{1} - Q_{20}) {\bar{Q}}_{0} + E_{B_{n}} Q_{2 n, B_{n}}^{1} (Q_{n, B_{n}}^{*} - {\bar{Q}}_{0}) = O_{P} (n^{- 1 / 2}) + E_{B_{n}} (Q_{2 n, B_{n}}^{1} - Q_{20}) ({\bar{Q}}_{n, B_{n}}^{*} - {\bar{Q}}_{0}) + E_{B_{n}} Q_{20} ({\bar{Q}}_{n, B_{n}}^{*} - {\bar{Q}}_{0}) = O_{P} (n^{- 1 / 2}) + E_{B_{n}} (Q_{2 n, B_{n}}^{1} - Q_{20}) ({\bar{Q}}_{n, B_{n}}^{*} - {\bar{Q}}_{0}) + O_{P} (E_{B_{n}} d_{10, 1}^{1 / 2} ({\bar{Q}}_{n, B_{n}}^{*}, {\bar{Q}}_{0})),

where we used at the third equality that for each split $B_{n} (O_{2 n, B_{n}}^{1} - Q_{20}) {\bar{Q}}_{0} = O_{P} (n^{- 1 / 2})$ , by the standard central limit theorem.

In order to bound the second empirical process term we apply Lemma 10 to the term $n^{1 / 2} (Q_{2 n, B_{n}}^{1} - Q_{20}) ({\bar{Q}}_{n, B_{n}}^{*} - {\bar{Q}}_{0})$ . Lemma 4 below shows that ${‖ {\bar{Q}}_{n, B_{n}} - {\bar{Q}}_{0} ‖}_{P_{0}} = O_{P} (n^{- λ (d_{1}) / 2} δ_{n}^{- 1 / 2})$ . Therefore, we can apply Lemma 10 with $r_{D^{*}, n}$ equal to this latter rate. This yields the following bound:

E_{B_{n}} (Q_{2 n, B_{n}}^{1} - Q_{20}) ({\bar{Q}}_{n, B_{n}}^{*} - {\bar{Q}}_{0}) = O_{P} (n^{- λ (d_{1}) / 2} δ_{n}^{- 1 / 2} (1 + \log n + \log δ_{n})) .

Thus, we have shown

{‖ D^{*} (Q_{n, B_{n}}^{*}, {\bar{G}}_{n, B_{n}}) - D^{*} (Q_{0}, {\bar{G}}_{0}) ‖}_{P_{0}} = O_{P} (n^{- λ (d_{1}) / 2} δ_{n}^{- 1 / 2} (1 + \log n + \log δ_{n})) + O_{P} (δ_{n}^{- 1} {‖ {\bar{Q}}_{n, B_{n}}^{*} - {\bar{Q}}_{0} ‖}_{P_{0}}) + O_{P} (δ_{n}^{- 3 / 2} {‖ {\bar{G}}_{n, B_{n}} - {\bar{G}}_{0} ‖}_{P_{0}}) .

We have $d_{10, 1} ({\bar{Q}}_{n, B_{n}}^{*}, {\bar{Q}}_{0}) = O_{P} (n^{- λ (d_{1})})$ and $d_{02} ({\bar{G}}_{n, B_{n}}, {\bar{G}}_{0}) = O_{P} (n^{- λ (d_{2})})$ . These rates first need to be translated in terms of L²(P₀)-norms in order to utilize the above bound. Lemma 4 below shows that ${‖ {\bar{Q}}_{n, B_{n}}^{*} - {\bar{Q}}_{0} ‖}_{P_{0}} = O_{P} (n^{- λ (d_{1}) / 2} δ_{n}^{- 1 / 2})$ and ${‖ {\bar{G}}_{n, B_{n}} - {\bar{G}}_{0} ‖}_{P_{0}} = O_{P} (n^{- λ (d_{2})})$ . So we obtain the following bound:

{‖ D^{*} (Q_{n, B_{n}}^{*}, {\bar{G}}_{n, B_{n}}) - D^{*} (Q_{0}, {\bar{G}}_{0}) ‖}_{P_{0}} = O_{P} (n^{- λ (d_{1}) / 2} δ_{n}^{- 1 / 2} (1 + \log n + \log δ_{n})) + O_{P} (δ_{n}^{- 3 / 2} n^{- λ (d_{1}) / 2}) + O_{P} (δ_{n}^{- 3 / 2} n^{- λ (d_{2}) / 2}) .

We can conservatively bound this as follows:

{‖ D^{*} (Q_{n, B_{n}}^{*}, {\bar{G}}_{n, B_{n}}) - D^{*} (Q_{0}, {\bar{G}}_{0}) ‖}_{P_{0}} = O_{P} (δ_{n}^{- 3 / 2} n^{- λ (d_{1}) / 2} \log n),

where we used conservative bounding by not utilizing that d₂ could be significantly smaller than d₁. We conclude that we can set $r_{D^{*}, n} = δ_{n}^{- 3 / 2} n^{- λ (d_{1}) / 2} \log n$ . We need that $r_{D^{*}, n} = o (1)$ and thus that $δ_{n}^{- 3 / 2} = o (n^{λ (d_{1}) / 2} \log n)$ , or $δ_{n}^{- 1} = o (n^{1 / 6 + α (d_{1}) / 6} \log n)$ The latter condition is dominated by the condition $δ_{n}^{- 1} = o (n^{α (d_{1}) / 6})$ we need in the analysis below of the second order remainder.

Verification of eq. (29)

By eq. (6), we can bound the second order remainder as follows:

R_{20} (P_{n, B_{n}}^{*}, P_{0}) \leq δ_{n}^{- 1} {‖ {\bar{G}}_{n, B_{n}} - {\bar{G}}_{0} ‖}_{P_{0}} {‖ {\bar{Q}}_{n, B_{n}}^{*} - {\bar{Q}}_{0} ‖}_{P_{0}} = O_{P} (δ_{n}^{- 3 / 2} n^{- λ (d_{1}) / 2 - λ (d_{2}) / 2}) .

Thus, it suffices to assume that $δ_{n}^{- 3 / 2} n^{- λ (d_{1})} = o (n^{- 1 / 2})$ , and thus $δ_{n}^{- 1} = o (n^{α (d_{1}) / 6})$ .

We verified the conditions of Theorem 1. Application of Theorem 1 yields the following result.

Theorem 2

Consider the nonparametric statistical model $M$ for P₀ of the d-dimensional $O = (W, A, Y) ~ P_{0} \in M$ and target parameter $Ψ : M \to ℝ$ defined by Ψ(P) = E_PE_P(Y | A = 1, W). In this nonparametric model we only assume that for each $P \in M$ , $\bar{Q} (P) = E_{P} (Y | A = 1, W)$ and $\bar{G} (P) = E_{P} (A | W)$ are cadlag functions on $[0, τ] \subset ℝ_{\geq 0}^{d - 2}$ for some finite τ with finite variation norm.

Consider the above defined one-step CV-TMLE $ψ_{n}^{*} = E_{B_{n}} Ψ (Q_{n, B_{n}}^{*})$ of Ψ(Q₀) based on the HAL-super-learner ${\bar{Q}}_{n}$ and ${\bar{G}}_{n}$ of type eqs (23) and (37), where ${\bar{Q}}_{n}$ and ${\bar{G}}_{n}$ are enforced to be contained in interval (δ_n, 1 − δ_n). Let d₁ = d − 2. Let α(d₁) = 2/(d₁ + 2), λ(d₁) = 1/2 + α(d₁)/4, and K_n = max(K₁_n, K₂_n).

Assume that $\log K_{n} = O (n^{1 / 2 - α (d_{1}) / 2})$ , and that $d_{n}^{- 1}$ converges slowly enough to ∞ so that $δ_{n}^{- 1} = o (n^{α (d_{1}) / 6})$ Then $ψ_{n}^{*}$ is a regular asymptotically linear estimator with influence curve equal to the efficient influence curve D^*(P₀), and is thus asymptotically efficient.

Thus for large dimension d, $δ_{n}^{- 1}$ is only allowed to converge to infinity at a very slow rate. Note that $δ_{n}^{- 1}$ immediately implies a bound on the efficient influence curve and such bounds are naturally very crucial.

Above we used the following lemma.

Lemma 4

We have

{‖ \bar{Q} - {\bar{Q}}_{0} ‖}_{P_{0}}^{2} \leq 4 δ_{n}^{- 1} d_{01} (\bar{Q}, {\bar{Q}}_{0}) .

(34)

We also have

{‖ \bar{G} - {\bar{G}}_{0} ‖}_{P_{0}}^{2} \leq 4 d_{02} (\bar{G}, {\bar{G}}_{0}) .

(35)

Proof

We first prove eq. (34). Let

K L (\bar{Q} (W), {\bar{Q}}_{0} (W)) = {\bar{Q}}_{0} (W) \log \frac{{\bar{Q}}_{0} (W)}{\bar{Q} (W)} + (1 - {\bar{Q}}_{0} (W)) \log \frac{1 - {\bar{Q}}_{0} (W)}{1 - \bar{Q} (W)}

be the Kullback-Leibler divergence between the Bernoulli laws with probabilities $\bar{Q} (W)$ and ${\bar{Q}}_{0} (W)$ . Then,

d_{01} (\bar{Q}, {\bar{Q}}_{0}) = E_{P_{0}} {\bar{G}}_{0} (W) K L (\bar{Q} (W), {\bar{Q}}_{0} (W)) .

In van der Vaart (1998, page 62) it is shown that for two densities p, p₀, we have ${‖ p^{1 / 2} - p_{0}^{1 / 2} ‖}_{P_{0}}^{2} \leq - \int \log (p / p_{0}) d P_{0}$ . Applying this inequality to Bernoulli laws with probabilities $\bar{Q} (W)$ and ${\bar{Q}}_{0} (W)$ yields:

K L (\bar{Q} (\cdot)) \geq {\bar{Q}}_{0} {({\bar{Q}}^{1 / 2} - {\bar{Q}}_{0}^{- 1 / 2})}^{2} + (1 - {\bar{Q}}_{0}) {({(1 - \bar{Q})}^{1 / 2} - {(1 - {\bar{Q}}_{0})}^{1 / 2})}^{2} .

Applying the inequality (a − b)² ≤ 4(a^1/2 − b^1/2)² (for a, b ∈ [0, 1]) to the square terms on the right-hand side now yields:

K L (\bar{Q} (\cdot), {\bar{Q}}_{0} (\cdot)) \geq 4^{- 1} {(\bar{Q} - {\bar{Q}}_{0})}^{2} .

(36)

Now, note that $d_{01} (\bar{Q}, {\bar{Q}}_{0}) = E_{P_{0}} {\bar{G}}_{0} (W) K L (\bar{Q} (W), {\bar{Q}}_{0} (W))$ . We can use that ${\bar{G}}_{0} > δ_{n}$ , which provides us with the following bound:

d_{01} (\bar{Q}, {\bar{Q}}_{0}) \geq δ_{n} E_{P_{0}} K L (\bar{Q} (W), {\bar{Q}}_{0} (W)) \geq δ_{n} 4^{- 1} E_{P_{0}} {(\bar{Q} - {\bar{Q}}_{0})}^{2} (W) = δ_{n} 4^{- 1} {‖ \bar{Q} - {\bar{Q}}_{0} ‖}_{P_{0}}^{2} .

This completes the proof of eq. (34). We have

d_{02} (\bar{G}, {\bar{G}}_{0}) = E_{P_{0}} K L (\bar{G} (W), {\bar{G}}_{0} (W)) .

Completely analogue to the derivation above of eq. (36) we obtain

K L (\bar{G} (\cdot), {\bar{G}}_{0} (\cdot)) \geq 4^{- 1} {(\bar{G} - {\bar{G}}_{0})}^{2},

and thus

d_{02} (\bar{G}, {\bar{G}}_{0}) \geq 4^{- 1} {‖ \bar{G} - {\bar{G}}_{0} ‖}_{P_{0}}^{2} .

This proves eq. (35). □

8 Discussion

In this article we established that a one-step CV-TMLE, using a super-learner with a library that includes L¹-penalized MLEs that minimize the empirical risk over high dimensional linear combinations of indicator basis functions under a series of L¹-constraints, will be asymptotically efficient. This was shown to hold under remarkable weak conditions and for an arbitrary dimension of the data structure O.

This remarkable fact is heavily driven by the fact that this super-learner will always converge at a rate faster than n^−1/4 w.r.t. the loss-based dissimilarity, which is typically equivalent with the L²(P₀)-norm. This holds for every dimension of the data and any underlying smoothness of the true nuisance parameter values, as long as these true nuisance parameter values have a finite variation norm. Since the second order remainder $R_{2} (P_{n}^{*}, P_{0})$ of the first order expansion for the TMLE can be bounded in terms of these loss-based dissimilarities between the super-learner and its true counterpart, this rate of convergence is fast enough to make the second order remainder asymptotically negligible. As a consequence, the first order empirical mean of the canonical gradient/efficient influence curve drives the asymptotics of the TMLE.

In order to prove our theorems it was also important to establish that a one-step TMLE already approximately solves the efficient influence curve equation, under very general reasonable conditions. In this article we focused on a one-step TMLE that updates each nuisance parameter with its own one-dimensional MLE update step. This choice of local least favorable submodel guarantees that the one-step TMLE update of the super-learner of the nuisance parameters is not driven by the nuisance parameter component that is hardest to estimate, which might have finite sample advantages. Nonetheless, our asymptotic efficiency theorem applies to any local least favorable submodel.

The fact that a one-step TMLE already solves the efficient influence curve equation is particularly important in problems in which the TMLE update step is very demanding due to a high complexity of the efficient influence curve. In addition, a one-step TMLE has a more predictable robust behavior than a limit of an iterative algorithm. We could have focused on the universal least favorable submodels so that the TMLE is always a one-step TMLE, but in various problems local least favorable submodels are easier to fit and can thus have practical advantages.

By now, we also have implemented the HAL-estimator for nonparametric regression and dimensions d ≤ 10, and established that its practical performance appears to be very good [22]. In addition, we also implemented the HAL-TMLE for the ATE (i.e., our example) for such low dimensions and the coverage of the confidence intervals has been remarkable good for normal sample sizes, suggesting that the asymptotics of the HAL-TMLE kicks in at earlier sample sizes then theory would predict. We suspect that part of the reason for the excellent practical performance is the double robust nature of the second order remainder, which suggest more finite sample bias cancelation than an actual square of a difference. The practical implementation and evaluation of the HAL-estimator and HAL-TMLE across a diversity of problems remains an area of future research.

In this article we assumed independent and identically distributed observations. Nonetheless, this type of super learner and the resulting asymptotic efficiency of the one-step TMLE will be generalizable to a variety of dependent data structures such as data generated by a statistical graph that assumes sufficient conditional independencies so that the desired central limit theorems can still be established [4, 23–26].

This article focused on a CV-TMLE that represents the statistical target parameter Ψ(P) as a function $Ψ (Q_{1} (P), \dots, Q_{k_{1} + 1} (P))$ of variation independent nuisance parameters $(Q_{1}, \dots, Q_{k_{1} + 1})$ . However, there are key examples in which representing Ψ(P) in terms of recursively defined nuisance parameters has key advantages. For example, the longitudinal one-step TMLE of causal effects of multiple time point interventions in [27, 28] relies on a sequential regression representation of the target parameter [29]. In this case, the next regression is defined as the regression of the previous regression on a shrinking history, across a number of regressions, one for each time point at which an intervention takes place. In this case, a super-learner of nuisance parameter Q_k is based on a loss function $L_{1, k, Q_{k + 1}} (Q_{k})$ that depends on a next nuisance parameter Q_k₊₁ (representing the outcome for the regression defining Q_k), k = 1, …, k₁ + 1.. One would now start with obtaining the desired result for the super-learner of $Q_{k_{1} + 1}$ whose loss function does not depend on other nuisance parameters. For the second super-learner of $Q_{K_{1}}$ based on candidate estimators ${\hat{Q}}_{k_{1}, j}$ , j = 1, …, J, we would use as cross-validated risk $E_{B_{n}} P_{n, B_{n}}^{1} L_{1, k_{1}, {\hat{Q}}_{k_{1} + 1} (P_{n, B_{n}}^{0})} ({\hat{Q}}_{k_{1}, j})$ . In other words, one estimates the nuisance parameter of the loss-function based on the training sample. In [11, 30, 31] we establish oracle inequalities for the cross-validation selector based on loss-functions indexed by an unknown nuisance parameter, which now also rely on a remainder concerning the rate at which ${\hat{Q}}_{k_{1} + 1} (P_{n})$ converges to $Q_{k_{1} + 1, 0}$ . In this manner, one can establish that the super-learner of ${\hat{Q}}_{k_{1}, j}$ will converge at the same or better rate than the super-learner of $Q_{k_{1} + 1, 0}$ . This process can be iterated to establish convergence of all the super-learners at the same or better rate than the initial super-learner of $Q_{k_{1} + 1, 0}$ . Our asymptotic efficiency results for the one-step TMLE and one-step CV-TMLE can now be generalized to one-step TMLE and CV-TMLE that rely on sequential targeted learning. The disadvantage of sequential learning is that the behavior of previous super-learners affects the behavior of the next super-learners in the sequence, but the practical implementation of a sequential super-learner can be significantly easier.

Our general theorems and specifically the theorems for our example demonstrate that the model bound on the variance of the efficient influence curve heavily affects the stability of the TMLE, and that we can only let this bound converge to infinity at a slow rate when the dimension of the data is large. Therefore, knowing this bound instead of enforcing it in a data adaptive manner is crucial for good behavior of these efficient estimators. This is also evident from the well known finite sample behavior of various efficient estimators in causal inference and censored data models that almost always rely on using truncation of the treatment and/or censoring mechanism. If one uses highly data adaptive estimators, even when the censoring or treatment mechanism is bounded away from zero, the estimators of these nuisance parameters could easily get very close to zero, so that truncation is crucial. Careful data adaptive selection of this truncation level is therefore an important component in the definition of these efficient estimators.

Alternatively, one can define target parameters in such a way that their variance of the efficient influence curve is uniformly bounded over the model (e.g., [32]). For example, in our example we could have defined the target parameter $E Y_{d_{1}} - E Y_{d_{0}}$ , where $d_{1} (W) = I ({\bar{G}}_{n} (W) > δ) π$ and $d_{0} (W) = 1 - I ((1 - {\bar{G}}_{n} (W) > δ)$ , and ${\bar{G}}_{n}$ is the super-learner of ${\bar{G}}_{0} = E_{0} (A | W)$ and δ > 0 is a user supplied constant. In this case, the static interventions have been replaced by data dependent realistic dynamic interventions that approximate the static interventions but are guaranteed to only carry out the intervention when there is enough support in the data. Due to the fact that such parameters have a guaranteed amount of support in the data, the variance of the efficient influence curve is uniformly bounded over the model: i.e. $M_{D^{*}} < \infty$ .

Acknowledgments

This research is funded by NIH-grant 5R01AI074345-07. The author thanks Marco Carone, Antoine Chambaz, and Alex Luedtke for stimulating discussions, and the reviewers for their very helpful comments.

Appendix

A Oracle inequality for the cross-validation selector

Lemma 2 is a simple corollary of the following finite sample oracle inequality for cross-validation [11, 13], combined with exploiting the convexity of the loss function allowing us to bring the $E_{B_{n}}$ inside the loss-based dissimilarity.

Lemma 5

For any δ > 0, there exists a constant $C (M_{1 Q, n}, M_{2 Q, n}, δ) = 2 {(1 + δ)}^{2} (2 M_{1 Q, n} / 3 + M_{2 Q, n}^{2} / δ)$ such that

E_{0} {E_{B_{n}} d_{01} ({\hat{\bar{Q}}}_{k_{1 n}} (P_{n, B_{n}}^{0}), {\bar{Q}}_{0})} \leq (1 + 2 δ) E_{0} {E_{B_{n}} min_{k} d_{01} ({\hat{\bar{Q}}}_{k} (P_{n, B_{n}}^{0}), {\bar{Q}}_{0})} + 2 C (M_{1 Q, n}, M_{2 Q, n}, δ) \frac{\log K_{1 n}}{n {\bar{B}}_{n}} .

Similarly, for any δ > 0,

E_{B_{n}} d_{01} ({\hat{\bar{Q}}}_{k_{1 n}} (P_{n, B_{n}}^{0}), {\bar{Q}}_{0}) \leq (1 + 2 δ) E_{B_{n}} min_{k} d_{01} ({\hat{\bar{Q}}}_{k} (P_{n, B_{n}}^{0}), {\bar{Q}}_{0})} + R_{n},

where $E R_{n} \leq 2 C (M_{1 Q, n}, M_{2 Q, n}, δ) \frac{\log K_{1 n}}{n {\bar{B}}_{n}}$ .

If log K₁_n/n divided by $E_{B_{n}} {min}_{k} d_{01} ({\hat{\bar{Q}}}_{k} (P_{n, B_{n}}^{0}), {\bar{Q}}_{0})}$ converges to zero in probability, then we also have

\frac{E_{B_{n}} d_{01} ({\hat{\bar{Q}}}_{k_{n}} (P_{n, B_{n}}^{0}, {\bar{Q}}_{0})}{E_{B_{n}} {min}_{k} d_{01} ({\hat{\bar{Q}}}_{k} (P_{n, B_{n}}^{0}, {\bar{Q}}_{0})} \to_{p} 1.

Similarly, if log K₁_n/n divided by $E_{0} E_{B_{n}} {min}_{k} d_{01} ({\hat{\bar{Q}}}_{k} (P_{n, B_{n}}^{0}), {\bar{Q}}_{0})}$ converges to zero, then we also have

\frac{E_{0} E_{B_{n}} d_{01} ({\hat{\bar{Q}}}_{k_{n}} (P_{n, B_{n}}^{0}, {\bar{Q}}_{0})}{E_{0} E_{B_{n}} {min}_{k} d_{01} ({\hat{\bar{Q}}}_{k} (P_{n, B_{n}}^{0}, {\bar{Q}}_{0})} \to 1.

B Super learner of G₀

Completely analogue to the super-learner eq. (23), we can define such a super-learner of G₀, which we will do here. For an $M \in ℝ_{> 0}^{k_{2}}$ , let ${\hat{\bar{G}}}_{M} : M_{nonp} \to {\bar{G}}_{n, M} \subset F_{ν, M}$ be the MLE for which $d_{02} ({\bar{G}}_{n, M} = {\hat{\bar{G}}}_{M} (P_{n}), {\bar{G}}_{0 n}^{M}) = O_{P} (r_{\bar{G}}^{2} (n))$ . Let $K_{2, n, ν}$ be an ordered collection of k₂-dimensional constants, and consider the corresponding collection of candidate estimators ${\hat{\bar{G}}}_{M}$ with $M \in K_{2, n, ν}$ . We assume the index set $K_{2, n, ν}$ is increasing in n and that $\lim \sup_{n \to \infty} M_{K_{2, n, ν}} = max (M_{G, ν}, M_{L_{2} (G), ν})$ . Note that for all $M \in K_{2, n, ν}$ with $M > {‖ L_{2} ({\bar{G}}_{0}) ‖}_{ν}$ , we have that $d_{02} ({\hat{\bar{G}}}_{M} (P_{n}), {\bar{G}}_{0}) = O_{P} (n^{- λ_{2}})$ . In addition, let ${\hat{\bar{G}}}_{j} : M_{nonp} \to {\bar{G}}_{n}$ , $j \in K_{2, n, a}$ , be an additional collection of K_2,_n,_a estimators of G₀. This defines a collection of K₂_n = K_2,_n,_v + K_2,_n,_a candidate estimators ${{\hat{\bar{G}}}_{k} : k \in K_{2 n}}$ of ${\bar{G}}_{0}$ .

We define the cross-validation selector as the index

k_{2 n} = {\hat{K}}_{2} (P_{n}) = \arg min_{k \in K_{2 n}} E_{B_{n}} P_{n, B_{n}}^{1} L_{1} ({\hat{\bar{G}}}_{k} (P_{n, B_{n}}^{0}))

that minimizes the cross-validated risk $E_{B_{n}} P_{n} L_{2} ({\hat{\bar{G}}}_{k} (P_{n, B_{n}}^{0}))$ over all choices k of candidate estimators. Our proposed super-learner of ${\bar{G}}_{0}$ is defined by

{\bar{G}}_{n} = \hat{\bar{G}} (P_{n}) = E_{B_{n}} {\hat{\bar{G}}}_{k_{n}} (P_{n, B_{n}}^{0}) .

(37)

The same Lemma 2 applies to this estimator $\hat{\bar{G}} (P_{n})$ of ${\bar{G}}_{0}$ .

Lemma 6

Recall the definition of the model bounds M₁_G,_n, M₂_G,_n eq. (18), and let $C (M_{1}, M_{2}, δ) \equiv 2 {(1 + δ)}^{2} (2 M_{1} / 3 + M_{2}^{2} / δ)$ . For any fixed δ > 0,

d_{02} ({\bar{G}}_{n}, {\bar{G}}_{0 n}) \leq (1 + 2 δ) E_{B_{n}} min_{k \in K_{2 n}} d_{02} ({\hat{\bar{G}}}_{k} (P_{n, B_{n}}^{0}), {\bar{G}}_{0 n}) + O_{P} (C (M_{1 G, n}, M_{2 G, n}, δ) \frac{\log K_{2 n}}{n}),

If for each fixed δ > 0, C(M₁_G,_n, M₂_G,_n, δ) log K₂_n/n divided by $E_{B_{n}} {min}_{k} d_{02} ({\hat{\bar{G}}}_{k} (P_{n, B_{n}}^{0}), {\bar{G}}_{0 n})$ is o_P(1), then

\frac{d_{02} (\hat{\bar{G}} (P_{n}), {\bar{G}}_{0 n})}{E_{B_{n}} {min}_{k} d_{02} ({\hat{\bar{G}}}_{k} (P_{n, B_{n}}^{0}), {\bar{G}}_{0 n})} - 1 = o_{P} (1) .

If for a fixed δ > 0, $E_{B_{n}} {min}_{k} d_{02} ({\hat{\bar{G}}}_{k} (P_{n, B_{n}}^{0}), {\bar{G}}_{0 n}) = O_{P} (C (M_{1 G, n}, M_{2 G, n}, δ) \log K_{2 n} / n)$ , then

d_{02} (\hat{\bar{G}} (P_{n}), {\bar{G}}_{0 n}) = O_{P} (\frac{C (M_{1 G, n}, M_{2 G, n}, δ) \log K_{1 n}}{n}) .

Suppose that for each fixed M the conditions of Lemma 1 hold with negligible numerical approximation error r_n, so that $d_{02} ({\bar{G}}_{n, M}, {\bar{G}}_{0 n}^{M}) = O_{P} (r_{\bar{G}}^{2} (n))$ . Let λ₂ be chosen so that $r_{\bar{G}}^{2} (n) = O (n^{- λ_{2}})$ . For each fixed δ > 0, we have

d_{02} (\hat{\bar{G}} (P_{n}), {\bar{G}}_{0 n}) = O_{P} (n^{- λ_{2}}) + O_{P} (C (M_{1 G, n}, M_{2 G, n}, δ) \frac{\log K_{2 n}}{n}) .

(38)

C Empirical process results

Theorem 2.1 in [18] establishes the following result for a Donsker class $F_{n}$ with uniformly bounded envelope F_n and for which for each $f \in F_{n} P_{0} f^{2} \leq δ^{2} P F_{n}^{2}$ :

E {‖ G_{n} ‖}_{F_{n}} \underset{\sim}{<} J (δ, F_{n}) (1 + \frac{J (δ, F_{n})}{δ^{2} n^{1 / 2} {‖ F_{n} ‖}_{P_{0}}}) {‖ F_{n} ‖}_{P_{0}},

where G_n(f) = n^1/2(P_n − P₀)f and

J (δ, F_{n}) \equiv \sup_{Λ} \int_{0}^{δ} \log^{1 / 2} (1 + N (ε {‖ F_{n} ‖}_{Λ}, F_{n}, L^{2} (Λ)) d ε

is the entropy integral from 0 to δ. This definition of the entropy integral is slightly different from a common definition in which the supremum over P is taken within the integral.

Suppose we want a bound on $\sup_{f \in F_{n}, {‖ f ‖}_{P_{0}} < δ} | G_{n} (f) |$ . Of course, ${‖ f ‖}_{P_{0}} < δ$ is equivalent with ${‖ f ‖}_{P_{0}} < δ_{1} {‖ F_{n} ‖}_{P_{0}}$ , where $δ_{1} = δ / {‖ F_{n} ‖}_{P_{0}}$ . Application of the above result with this choice of δ = δ₁ yields:

E \sup_{f \in F_{n}, {‖ f ‖}_{P_{0}} < δ} | G_{n} (f) | \underset{\sim}{<} J (δ / {‖ F_{n} ‖}_{P_{0}}, F_{n}) (1 + \frac{J (δ / {‖ F_{n} ‖}_{P_{0}}, F_{n}) {‖ F_{n} ‖}_{P_{0}}}{δ^{2} n^{1 / 2}}) {‖ F_{n} ‖}_{P_{0}} .

(39)

Suppose that $\sup_{Λ} \log^{1 / 2} (1 + N (ε {‖ F_{n} ‖}_{Λ}, F_{n}, L^{2} (Λ))) = O (ε^{- (1 - α)})$ for some α ∈ (0, 1). Then,

J (δ / {‖ F_{n} ‖}_{P_{0}}, F_{n}) = O (δ^{α} {‖ F_{n} ‖}_{P_{0}}^{- α}) .

Thus, we have

E \sup_{f \in F_{n}, {‖ f ‖}_{P_{0}} < δ} | G_{n} (f) | \underset{\sim}{<} δ^{α} {‖ F_{n} ‖}_{P_{0}}^{1 - α} + δ^{2 α - 2} n^{- 1 / 2} {‖ F_{n} ‖}_{P_{0}}^{2 - 2 α} .

Note that this is a decreasing function in ${‖ F_{n} ‖}_{P_{0}}$ . Given a bound M_n so that ${‖ F_{n} ‖}_{P_{0}} < M_{n}$ , a conservative bound is obtained by replacing ${‖ F_{n} ‖}_{P_{0}}$ by M_n.

This proves the following lemma.

Lemma 7

Consider $F_{n}$ with ${‖ F_{n} ‖}_{P_{0}} < M_{n}$ and $\sup_{Λ} \log^{1 / 2} (1 + N (ε {‖ F_{n} ‖}_{Λ}, F_{n}, L^{2} (Λ))) = O (ε^{- (1 - α)})$ for some α ∈ (0, 1). Then,

E \sup_{f \in F_{n}, {‖ f ‖}_{P_{0}} < r_{0} (n)} | G_{n} (f) | \underset{\sim}{<} {r_{0} (n) / M_{n}}^{α} M_{n} + {r_{0} (n) / M_{n}}^{2 α - 2} n^{- 1 / 2} .

If r₀(n) < n^−1/4, one should select r₀(n) = n^−1/4 in the above right hand side, giving the bound:

E \sup_{f \in F_{n}, {‖ f ‖}_{P_{0}} < r_{0} (n)} | G_{n} (f) | \underset{\sim}{<} {n^{- 1 / 4} / M_{n}}^{α} M_{n} + {M_{n}}^{2 - 2 α} n^{- α / 2} .

Consider eq. (39) again, but suppose now that $\sup_{Λ} N (ε {‖ F_{n} ‖}_{Λ}, F_{n}, L^{2} (Λ)) = O (ε^{- p})$ for some p > 0. Then,

J (δ / {‖ F_{n} ‖}_{P_{0}}, F_{n}) = p^{1 / 2} \int_{0}^{δ / {‖ F_{n} ‖}_{P_{0}}} \log^{1 / 2} ε^{- 1} d ε .

We can conservatively bound log^1/2 ε⁻¹ by log ε⁻¹ for ε small enough, and then note $\int_{0}^{x} \log ε d ε = x (1 - \log x)$ . Thus, we have the bound

J (δ / {‖ F_{n} ‖}_{P_{0}}, F_{n}) = O (δ {‖ F_{n} ‖}_{P_{0}}^{- 1} (1 - \log (δ / {‖ F_{n} ‖}_{P_{0}})) .

By plugging this latter bound into eq. (39) we obtain

E \sup_{f \in F_{n}, {‖ f ‖}_{P_{0}} < δ} | G_{n} (f) | \underset{\sim}{<} δ (1 - \log (δ / {‖ F_{n} ‖}_{P_{0}})) + {(1 - \log (δ / {‖ F_{n} ‖}_{P_{0}}))}^{2} n^{- 1 / 2} .

Note that the right-hand side is increasing in ${‖ F_{n} ‖}_{P_{0}}$ . So if we know that ${‖ F_{n} ‖}_{P_{0}} \leq M_{n}$ for some M_n, we obtain the bound

E \sup_{f \in F_{n}, {‖ f ‖}_{P_{0}} < δ} | G_{n} (f) | \underset{\sim}{<} δ (1 - \log (δ / M_{n})) + {(1 - \log (δ / M_{n}))}^{2} n^{- 1 / 2} .

Lemma 8

Consider $F_{n}$ with ${‖ F_{n} ‖}_{P_{0}} < M_{n}$ and $\sup_{Λ} N (ε {‖ F_{n} ‖}_{Λ}, F_{n}, L^{2} (Λ))) = O (ε^{- p})$ for some p > 0. Then,

E \sup_{f \in F_{n}, {‖ F_{n} ‖}_{P_{0}} < r_{0} (n)} ‖ G_{n} (f) ‖ \underset{\sim}{<} r_{0} (n) (1 - \log \frac{r_{0} (n)}{M_{n}}) + {(1 - \log \frac{r_{0} (n)}{M_{n}})}^{2} n^{- 1 / 2} .

(40)

The following lemma is proved by first applying the Lemma 7 to (P_n − P₀)f_n with r₀(n) = 1 to obtain an initial rate r₀(n), and then applying the above lemma again with this new initial rate r₀(n).

Lemma 9

Consider the following setting:

f_{n} \in F_{n}, {‖ F_{n} ‖}_{P_{0}} \leq M_{n}, \sup_{Λ} \log^{1 / 2} (1 + N (ε {‖ F_{n} ‖}_{Λ}, F_{n}, L^{2} (Λ))) = O (ε^{- (1 - α)}), α \in (0, 1), d_{0} (Q_{n}, Q_{0}) \leq | (P_{n} - P_{0}) f_{n} |, {‖ f_{n} ‖}_{P_{0}} \leq M_{2 n} {d_{0} (Q_{n}, Q_{0})}^{1 / 2} 1 < M_{n} \underset{\sim}{<} n^{1 / (4 (1 - α))} .

Then

d_{0} (Q_{n}, Q_{0}) \underset{\sim}{<} n^{- 1 / 2} n^{- α / 4} C (M_{n}, M_{2 n}, α),

where

C (M_{n}, M_{2 n}, α) = M_{2 n}^{α} M_{n}^{1 - α / 2 - α^{2} / 2} + n^{- α / 4} M_{2 n}^{2 α - 1} M_{n}^{1 - α^{2}} .

Proof

We have d₀(Q_n, Q₀) ≤ | (P_n − P₀)f_n |. We apply Lemma 7 to the right-hand side with r₀(n) = 1. This yields

E | (P_{n} - P_{0}) f_{n} | \underset{\sim}{<} n^{- 1 / 2} M_{n}^{1 - α} + M_{n}^{2 - 2 α} n^{- 1} .

This shows $d_{0} (Q_{n}, Q_{0}) \underset{\sim}{<} n^{- 1 / 2} M_{n}^{(1 - α)} + M_{n}^{2 - 2 α} n^{- 1}$ . Using that $\sqrt{x + y} \leq \sqrt{x} + \sqrt{y}$ , this implies $d_{0} {(Q_{n}, Q_{0})}^{1 / 2} \underset{\sim}{<} n^{- 1 / 4} M_{n}^{(1 - α) / 2} + M_{n}^{1 - α} n^{- 1 / 2}$ . By assumption, this implies

{‖ f_{n} ‖}_{P_{0}} \underset{\sim}{<} n^{- 1 / 4} M_{2 n} M_{n}^{(1 - α) / 2} + n^{- 1 / 2} M_{2 n} M_{n}^{1 - α} .

The right-hand side is of order $n^{- 1 / 4} M_{2 n} M_{n}^{(1 - α) / 2}$ if M_n ≲ n^{1/(4(1−α))}, which holds by assumption. Let $r_{0} (n) = n^{- 1 / 4} M_{2 n} M_{n}^{(1 - α) / 2}$ . We now apply Lemma 7 to (P_n − P₀)f_n with this choice of r₀(n). Note r₀(n) converges to zero at slower rate (or equal than) n^−1/4. Thus, application of Lemma 7 gives the following bound:

E | (P_{n} - P_{0}) f_{n} | \underset{\sim}{<} n^{- 1 / 2} r_{0} {(n)}^{α} M_{n}^{1 - α} + r_{0} {(n)}^{2 α - 2} M_{n}^{2 - 2 α} n^{- 1} \underset{\sim}{<} n^{- 1 / 2} n^{- α / 4} M_{2 n}^{α} M_{n}^{1 - α / 2 - α^{2} / 2} + n^{- 1 / 2 (1 + α)} M_{2 n}^{2 α - 2} M_{n}^{1 - α^{2}} .

We can factor out n^−1/2n^−α/4, giving the bound

\underset{\sim}{<} n^{- 1 / 2} n^{- α / 4} {M_{2 n}^{α} M_{n}^{1 - α / 2 - α^{2} / 2} + n^{- α / 4} M_{2 n}^{2 α - 2} M_{n}^{1 - α^{2}}} .

This completes the proof of the lemma. □

The following lemma is needed in the analysis of the CV-TMLE, where $f_{n, ε} = D^{*} (Q_{n, B_{n}, ε}, G_{n, B_{n}}) - D^{*} (Q_{0}, G_{0})$ .

Lemma 10

Let $f_{n, ε_{n}} \in F_{n} = {f_{n, ε} : ε}$ where ε varies over a bounded set in ℝ^p and f_n,_ε is a non-random function (i.e., not based on data O₁, …, O_n). Let F_n be the envelope of $F_{n}$ and let M_n be such that $‖ F_{n} ‖ < M_{D^{*}, n}$ . Assume that $\sup_{Λ} N (ε {‖ F_{n} ‖}_{Λ}, F_{n}, L^{2} (Λ)) = O (ε^{- p})$ . Suppose that ${‖ f_{n, ε_{n}} ‖}_{P_{0}} = o_{P} (r_{D^{*}} (n))$ for a rate $r_{D^{*}} (n) \to 0$ . Then, $G_{n} (f_{n, ε_{n}}) = G_{n} ({\bar{f}}_{n, ε_{n}}) + E_{n}$ , where

E_{0} | G_{n} ({\tilde{f}}_{n, ε_{n}}) | = O (r_{D^{*}} (n) (1 - \log (r_{D^{*}} (n) / M_{D^{*}, n}))),

and E_n equals 0 with probability tending to 1. Thus, if $r_{D^{*}} (n) \log (M_{D^{*}, n} / r_{D^{*}} (n)) = o (1)$ , then $G_{n} (f_{n, ε_{n}}) = o_{P} (1)$ .

Proof

For notational convenience, let’s denote $f_{n, ε_{n}}$ with f_n. We have that with probability tending to 1 ${‖ f_{n} ‖}_{P_{0}} < r_{D^{*}} (n)$ . We have $f_{n} = f_{n} I ({‖ f_{n} ‖}_{P_{0}} < r_{D^{*}} (n)) + f_{n} I ({‖ f_{n} ‖}_{P_{0}} > r_{D^{*}} (n))$ . Denote the first term with ${\tilde{f}}_{n}$ and note that the second term equals zero with probability tending to 1. This shows that $G_{n} (f_{n}) = G_{n} ({\tilde{f}}_{n}) + E_{n}$ where E_n equals zero with probability tending to 1 while ${‖ {\tilde{f}}_{n} ‖}_{P_{0}} < r_{D^{*}} (n)$ with probability 1. Application of Lemma 8 shows that

E | G_{n} ({\tilde{f}}_{n}) | \underset{\sim}{<} r_{D^{*}} (n) \log (M_{D^{*}, n} / r_{D^{*}} (n)) .

This completes the proof. □

D Implementing the HAL-estimator

For notational convenience, consider the case that $Q_{n} = Q$ . The M-specific HAL-estimator is defined for a given M < ∞ vector, by minimizing $P_{n} L_{1} (\bar{Q})$ over all $\bar{Q} \in \bar{Q}$ for which the variation norm of $L_{1} (\bar{Q})$ is bounded by this M. We need to calculate this estimator for a series of M-vectors ranging from 0 to infinity, and we will then select M with cross-validation (see next section). Suppose that, for a fixed n, there exists an $M_{n, v} \in ℝ^{k_{1}}$ so that for all $\bar{Q} \in \bar{Q}$ , ${‖ L_{1} (\bar{Q}) ‖}_{v} \leq M_{n, v} {‖ \bar{Q} ‖}_{v}$ . This is typically an assumption that is trivially satisfied. Then, calculating this collection of M-specific HAL-estimators across a set of M-vectors can also be achieved by computing an MLE of $\bar{Q} \to P_{n} L_{1} (\bar{Q})$ over all $\bar{Q} \in \bar{Q}$ with ${‖ \bar{Q} ‖}_{v} < M$ , for a series of M-vectors. Therefore we rephrase our goal as to compute a ${\bar{Q}}_{n, M}$ so that

P_{n} L_{1} ({\bar{Q}}_{n, M}) = min_{\bar{Q} \in {\bar{Q}}_{M}} P_{n} L_{1} (\bar{Q}) + r_{n},

(41)

where in this section we redefine ${\bar{Q}}_{M} = {\bar{Q} \in \bar{Q} : {‖ \bar{Q} ‖}_{v} < M}$ , and r_n is a controlled small number. We will now address a strategy for implementation of this MLE ${\bar{Q}}_{n, M}$ .

D.1 Approximating a function with variation norm M by a linear combination of indicator basis functions with L¹-norm of the coefficient vector equal to M

Any cadlag function $f \in D [0, τ]$ with finite variation norm can be represented as follows:

f (x) = f (0) + \sum_{s \subset {1, \dots, p}} \int_{(0_{s}, x_{s}]} f (d u_{s}, 0_{- s}) .

For each subset s of size | s |, consider a partitioning of (0_s, τ_s] in | s |-dimensional cubes with width h_m. Let’s denote these cubes with $R_{h_{m}} (j, s)$ , where j is the index of the j-th cube and j runs over $O (1 / h_{m}^{| s |})$ cubes. Let $R_{h_{m}} (s)$ be the index set, so that we can write $(0_{s}, τ_{s}] = \cup_{j \in R_{h_{m}} (s)} R_{h_{m}} (j, s)$ . By definition of an integral, we have $f (x) = \lim_{h_{m} \to 0} f_{m} (x)$ , where

f_{m} (x) = f_{m} (f) (x) = f (0) + \sum_{s \subset {1, \dots, p}} \sum_{j \in R_{h_{m}} (s)} ϕ_{h_{m}, j}^{s} (x) β_{h_{m}, j}^{s},

$β_{h_{m}, j}^{s} = f (R_{h_{m}} (j, s))$ is the measure f assigns to the cube $R_{h_{m}} (j, s)$ , and $ϕ_{h_{m}, j}^{s} (x) = I (m_{h_{m}} (j, s) \leq x_{s})$ is the indicator that the midpoint $m_{h_{m}} (j, s)$ of the cube $R_{h_{m}} (j, s)$ is smaller or equal than x_s. By the dominated convergence theorem, it also follows that ‖ f_m(f) − f ‖_Λ→ 0 for any L²(D)-norm. Moreover, the variation norm of f is approximated by the sum of the absolute values of all the coefficients $β_{h_{m}, j}^{s}$ :

{‖ f ‖}_{v} = \lim_{h_{m} \to 0} f (0) + \sum_{s \subset {1, \dots, p}} \sum_{j \in R_{h_{m}} (s)} | β_{h_{m}, j}^{s} | .

Let β₀ denote the intercept f(0). Thus, we conclude that given a function $f \in F_{v, M}$ , we can approximate it with a finite linear combination f_m(f) of indicator basis functions $ϕ_{h_{m}, j}^{s}$ plus an intercept β₀ for which the L¹-norm of its coefficient vector $(β_{0}, (β_{h_{m}, j}^{s} : j, s))$ approximates the variation norm of f. The support points $m_{h_{m}} (j, s)$ could also be selected based on the data support {O₁, …, O_n}. Such a strategy is presented and implemented for the HAL-estimator of a nonparametric regression in [22]. In the latter paper we select n support points for each s-specific measure, possibly resulting in as many as n ^* 2^d-number of basis functions.

D.2 An approximation of the MLE over functions of bounded variation using L¹-penalization

For an M ∈ ℝ_>0, let’s define

F_{v, M}^{m} = {\sum_{s \subset {1, \dots, p}} \sum_{j \in R_{h_{m}} (s)} ϕ_{h_{m}, j}^{s} (x) β_{h_{m}, j}^{s} : \sum_{s, j} | β_{h_{m}, j}^{s} | \leq M}

as the collection of all these finite linear combinations of this collection of basis functions under the constraint that its L¹-norm is bounded by M. Consider the case that the parameter space ${\bar{Q}}_{j}$ for ${\bar{Q}}_{j} (P)$ , j ∈ {1, …, k₁} is nonparametric, so that the MLE over ${\bar{Q}}_{j, M} = F_{v, M}$ of ${\bar{Q}}_{j 0}$ would correspond with minimizing over $F_{v, M}$ . Note that this does not imply that the model $M$ is nonparametric: for example, the data distribution could be parameterized in terms of unspecified functions ${\bar{Q}}_{j}$ of dimension d₁(j), j = 1, …, k₁, and unspecified functions ${\bar{G}}_{j}$ of dimension d₂(j), j = 1, …, k₂.

The next lemma proves that we can approximate such an MLE over $F_{v, M}$ for a loss function $L_{1 j} ({\bar{Q}}_{j})$ by an MLE over $F_{v, M}^{m}$ by selecting m large enough.

Lemma 11

Let M ∈ ℝ_>0 be given. Consider $f_{0} \in F_{v, M} \subset D [0, τ]$ so that for a loss function (O, f) → L(f)(O), we have $P_{0} L (f_{0}) = {min}_{f \in F_{v, M}} P_{0} L (f)$ . Assume that if $f_{m} \in F_{v, M}$ converges pointwise to a $f \in F_{v, M}$ on [0, τ], then L(f_m) converges pointwise to L(f) on a support of P₀, including the support of the empirical distribution P_n. Let $f_{0, m} \in F_{v, M}^{m}$ be such that $P_{0} L (f_{0, m}) = {min}_{f \in F_{v, M}^{m}} P_{0} L (f)$ . We have P₀(L(f_0,_m) − L(f₀)) → 0 as h_m → 0.

Consider now an $f_{n} \in F_{v, M}$ so that $P_{n} L (f_{n}) = {min}_{f \in F_{v, M}} P_{n} L (f)$ , and let $f_{n, m} \in F_{v, M}^{m}$ be such that $P_{n} L (f_{n, m}) = {min}_{f \in F_{v, M}^{m}} P_{n} L (f)$ . We have P_n(L(f_n,_m) − L(f_n)) → 0 as h_m → 0.

Proof

We want to show that P₀(L(f_0,_m) − L(f₀)) → 0 as h_m → 0. By the approximation presented in the previous section, since $f_{0} \in F_{v, M}$ , we can find a sequence $f_{0, m}^{*} \in F_{v, M}^{m}$ so that $f_{0, m}^{*} \to f_{0}$ as h_m → 0, pointwise and in L²(P₀) norm. By assumption and the dominated convergence theorem, this implies $P_{0} L (f_{0, m}^{*}) - P_{0} L (f_{0})$ also converges to zero as h_m → 0. But, since f_0,_m minimizes P₀L(f) over all $f \in F_{v, M}^{m}$ , we have

0 \leq P_{0} L (f_{0, m}) - P_{0} L (f_{0}) \leq P_{0} L (f_{0, m}^{*}) - P_{0} L (f_{0}) \to 0,

which proves that P₀L(f_0,_m) − P₀L(f₀) → 0, as h_m → 0.

We now want to show that P_n(L(f_n,_m) − L(f_n)) → 0 as h_m → 0. Since $f_{n} \in F_{v, M}$ , we can find a sequence $f_{n, m}^{*} \in F_{v, M}^{m}$ so that $f_{n, m}^{*} \to f_{n}$ as h_m → 0, pointwise and in L²(P_n)-norm.

Then, by assumption and the dominated convergence theorem, $P_{n} L (f_{n, m}^{*}) - P_{n} L (f_{n})$ also converges to zero as h_m → 0. But, since f_n,_m minimizes P_nL(f) over all $f \in F_{v, M}^{m}$ , we have

0 \leq P_{n} L (f_{n, m}) - P_{n} L (f_{n}) \leq P_{n} L (f_{n, m}^{*}) - P_{n} L (f_{n}) \to 0,

which proves that P_nL(f_n,_m) − P_nL(f_n) → 0, as h_m → 0. □

D.3 An approximation of the MLE over the subspace ${\bar{Q}}_{M}$ by an MLE over an L₁-constrained linear model

Above we defined a mapping from a function $f \in F_{v, M}$ into a linear combination $f_{m} (f) \in F_{v, M}^{m}$ of basis functions for which the norm of the coefficient vector approximates the variation norm of f. The following lemma proves in general that we can compute the MLE over ${\bar{Q}}_{M} = \bar{Q} \cap F_{v, M}$ with the MLE over ${\bar{Q}}_{M}^{m} = {{\bar{Q}}_{m} (\bar{Q}) : \bar{Q} \in {\bar{Q}}_{M}}$ , which is a collection of these linear combinations of the basis functions for which the L¹-norm of the coefficient vector is bounded by M. Note that ${\bar{Q}}_{M}^{m}$ is typically not a submodel of ${\bar{Q}}_{M}$ , but it is obtained by replacing each element $\bar{Q}$ in ${\bar{Q}}_{M}$ with its approximation ${\bar{Q}}_{m} (\bar{Q})$ .

Lemma 12

Assume that if ${\bar{Q}}_{m} \in F_{v, M}$ converges pointwise to a $\bar{Q} \in F_{v, M}$ on ${[0, τ]}^{k_{1}}$ , then $L_{1} ({\bar{Q}}_{m})$ converges pointwise to $L_{1} (\bar{Q})$ on a support of P₀, including the support of the empirical distribution P_n. For an $M \in ℝ^{k_{1}}$ , let ${\bar{Q}}_{M} = \bar{Q} \cap F_{v, M}^{k + 1} = {\bar{Q} (P) : P \in M, \bar{Q} (P) \in F_{v, M}}$ be all functions in the parameter space for ${\bar{Q}}_{0}$ that have a variation norm smaller than M < ∞. Let ${\bar{Q}}_{M}^{m} = {{\bar{Q}}_{m} (\bar{Q}) : \bar{Q} \in {\bar{Q}}_{M}}$ , where ${\bar{Q}}_{m} (\bar{Q})$ is defined above as the finite dimensional linear combination of the basis functions ${ϕ_{h_{m}, j}^{s} : j, s}$ with coefficient vector ${β_{h_{m}, j}^{s} (\bar{Q}) : j, s}$ .

Consider a ${\bar{Q}}_{0, M} \in {\bar{Q}}_{M}$ so that $P_{0} L_{1} ({\bar{Q}}_{0, M}) = {min}_{\bar{Q} \in {\bar{Q}}_{M}} P_{0} L_{1} (\bar{Q})$ , and let $P_{0} L_{1} ({\bar{Q}}_{0, M}^{m}) = {min}_{\bar{Q} \in {\bar{Q}}_{M}^{m}} P_{0} L_{1} (\bar{Q})$ be such that $P_{0} (L_{1} ({\bar{Q}}_{0, M}^{m}) - L_{1} ({\bar{Q}}_{0, M})) \to 0$ as h_m → 0.

Similarly, consider a ${\bar{Q}}_{n, M} \in {\bar{Q}}_{M}$ so that $P_{n} L_{1} ({\bar{Q}}_{n, M}) = {min}_{\bar{Q} \in {\bar{Q}}_{M}} P_{n} L_{1} (\bar{Q})$ , and let ${\bar{Q}}_{n, M}^{m} \in {\bar{Q}}_{M}^{m}$ be such that $P_{n} L_{1} ({\bar{Q}}_{n, M}^{m}) = {min}_{\bar{Q} \in {\bar{Q}}_{M}^{m}} P_{n} L_{1} (\bar{Q})$ . Then, $P_{n} (L_{1} ({\bar{Q}}_{n, M}^{m} - L_{1} ({\bar{Q}}_{n, M})) \to 0$ as h_m → 0.

Proof

We want to show that $P_{0} (L_{1} ({\bar{Q}}_{0, M}^{m}) - L ({\bar{Q}}_{0, M})) \to 0$ as h_m → 0. By the approximation presented in the previous section, since ${\bar{Q}}_{0, M} \in F_{v, M}$ , we can find a sequence ${\bar{Q}}_{0, M}^{m, *} \in F_{v, M}^{m}$ so that ${\bar{Q}}_{0, M}^{*} \to {\bar{Q}}_{0, M}$ as h_m → 0, pointwise and in L²(P₀) norm. By assumption and the dominated convergence theorem, this implies $P_{0} L_{1} ({\bar{Q}}_{0, M}^{m, *}) - P_{0} L_{1} ({\bar{Q}}_{0, M})$ also converges to zero as h_m → 0. But, since ${\bar{Q}}_{0, M}^{m}$ minimizes $P_{0} L_{1} (\bar{Q})$ over all $\bar{Q} \in {\bar{Q}}_{M}^{m}$ , we have

0 \leq P_{0} L_{1} ({\bar{Q}}_{0, M}^{m}) - P_{0} L_{1} ({\bar{Q}}_{0, M}) \leq P_{0} L_{1} ({\bar{Q}}_{0, M}^{m, *}) - P_{0} L_{1} ({\bar{Q}}_{0, M}) \to 0,

which proves that $P_{0} L_{1} ({\bar{Q}}_{0, M}^{m}) - P_{0} L_{1} ({\bar{Q}}_{0, M}) \to 0$ , as h_m → 0.

We now want to show that $P_{n} (L_{1} ({\bar{Q}}_{n, M}^{m}) - L_{1} ({\bar{Q}}_{n, M})) \to 0$ as h_m → 0. Since ${\bar{Q}}_{n, M} \in F_{v, M}$ , we can find a sequence ${\bar{Q}}_{n, M}^{m, *} \in F_{v, M}^{m}$ so that ${\bar{Q}}_{n, M}^{m, *} \to {\bar{Q}}_{n, M}$ as h_m → 0, pointwise and in L²(P_n)-norm.

Then, by assumption and the dominated convergence theorem, $P_{n} L_{1} ({\bar{Q}}_{n, M}^{m, *}) - P_{n} L_{1} ({\bar{Q}}_{n, M})$ also converges to zero as h_m → 0. But, since ${\bar{Q}}_{n, M}^{m}$ minimizes $P_{n} L_{1} (\bar{Q})$ over all $\bar{Q} \in {\bar{Q}}_{n, M}^{m}$ , we have

0 \leq P_{n} L_{1} ({\bar{Q}}_{n, M}^{m}) - P_{n} L_{1} ({\bar{Q}}_{n, M}) \leq P_{n} L_{1} ({\bar{Q}}_{n, M}^{m, *}) - P_{n} L_{1} ({\bar{Q}}_{n, M}) \to 0,

which proves that $P_{n} L_{1} ({\bar{Q}}_{n, M}^{m}) - P_{n} L_{1} ({\bar{Q}}_{n, M}) \to 0$ , as h_m → 0. □

E A single updating step in TMLE suffices for approximately solving the efficient influence curve equation

In this section we focus on the one-step TMLE, but the results can be straightforwardly generalized to the one-step CV-TMLE.

The following lemma proves that for a local least favorable submodel with a 1-dimensional ε and n^−1/4+-consistent initial estimators, the one-step TMLE already solves $P_{n} D^{*} (Q_{n, ε_{n}}, G_{n}) = o_{P} (n^{- 1 / 2})$ under some regularity conditions.

Lemma 13

$Ψ : M \to R$ is a pathwise differentiable parameter at P with canonical gradient D^∗(P), and assume Ψ(P) = Ψ(Q(P)) and D^∗(P) = D^∗(Q(P), G(P)) for parameters $Q : M \to Q = {Q (P) : P \in M}$ and $G : M \to G = {G (P) : P \in M}$ . Let R₂() be defined by Ψ(P) − Ψ(P₀) = (P − P₀)D^∗(P) + R₂(P, P₀), and let R₂(P, P₀) = R₂₀((Q, G), (Q₀, G₀)). Suppose Q₀ = arg min_Q P₀L(Q) for some loss function L(Q) and that, for any $Q \in Q$ and $G \in G$ , ${Q_{ε} : ε} \subset Q$ is a one dimensional parametric submodel through Q with $\frac{d}{d ε} L (Q_{ε}) |_{ε = 0} = D^{*} (Q, G)$ be an initial estimator of (Q₀, G₀), and consider the one-step TMLE Ψ(Q_n, ε_n) with ε_n = arg min_ε P_nL(Q_n, _ε).

Let f_n(ε) = P_nD^∗(Q_n, _ε, G_n) and $g_{n} (ε) = \frac{d}{d ε} P_{n} L (Q_{n, ε})$ . Let $f_{n}^{'} (ε) = \frac{d}{d ε} f_{n} (ε)$ and $g_{n}^{'} (ε) = \frac{d}{d ε} g_{n} (ε)$ . Let ε₀ = 0. Assume

$f_{n} (ε_{n}) = f_{n} (0) + f_{n}^{'} (0) ε_{n} + O_{P} (ε_{n}^{2})$ and $g_{n} (ε_{n}) = g_{n} (0) + g_{n}^{'} (0) ε_{n} + O_{P} (ε_{n}^{2})$ ;
$ε_{n}^{2} = o_{P} (n^{- 1 / 2});$
${\frac{d}{d ε_{n}} D^{*} (Q_{n, ε_{n}}, G_{n}) - \frac{d^{2}}{d ε_{n}^{2}} L (Q_{n, ε_{n}})} / n^{1 / 4}$ falls in a P₀-Donsker class with probability tending to 1;
$P_{0} {\frac{d}{d ε_{0}} D^{*} (Q_{n, ε_{0}}, G_{n}) - \frac{d}{d ε_{0}} D^{*} (Q_{0, ε_{0}},, G_{0})} = O_{P} (n^{- 1 / 4}) P_{0} {\frac{d^{2}}{d ε_{0}^{2}} L (Q_{n, ε_{0}}) - \frac{d^{2}}{d ε_{0}^{2}} L (Q_{0, ε_{0}})} = O_{P} (n^{- 1 / 4});$ (42)
$P_{0} \frac{d^{2}}{d ε_{0}^{2}} L (Q_{0, ε_{0}}) = - P_{0} D^{*} (P_{0}) {D^{*} (P_{0})}^{^{⊺}} .$ (43)
If L(Q(P)) = − log p_Q₍_P_),_η₍_P₎ for some density parameterization (Q, η) → p_Q,_η, then (43) holds;
$\frac{d}{d ε_{0}} R_{20} ((Q_{0, ε_{0}}, G_{0}), (Q_{0}, G_{0})) = 0$ .

Then, $P_{n} D^{*} (Q_{n, ε_{n}}, G_{n}) = o_{P} (n^{- 1 / 2})$ .

The first bullet point condition only assumes that the chosen least favorable submodel is smooth in ε. The second bullet point condition will be satisfied if the initial estimators Q_n, G_n converge to the true Q₀, G₀ at a rate faster than n^−1/4. The third bullet condition will hold without n^−1/4-scalar if the estimators Q_n, G_n have uniformly bounded variation norm. Due to the scaling n^−1/4, it could even allow that the variation norm grows with sample size, again showing that this is a very weak condition. Conditions eq. (42) are expected to hold if Q_n, G_n converge to Q₀, G₀ at a rate n^−1/4. Condition eq. (43) is a condition that holds for loss-functions that can be represented as log-likelihood loss function, and is therefore again a natural condition for a local least favorable submodel w.r.t. loss function L. Finally, consider the last bullet point condition. If this remainder has a double robust form R₂₀((Q, G), (Q₀, G₀)) = ∫(H₁(Q) − H₁(Q₀))(H₂(G) − H₂(G₀))dP₀ for some functionals H₁, H₂, then this condition holds. If the remainder is of the form R₂₀((Q, G), (Q₀, G₀)) = ∫(H(Q) − H(Q₀))²dP₀, then again this condition trivially holds. This shows that also the latter condition is a weak regularity condition.

Proof of Lemma

Firstly, by the fact that Q_n,_ε has score D^∗(Q_n, G_n) at ε = 0, it follows that f_n(0) = g_n(0). We also know that g_n(ε_n) = 0, and we want to show that f_n(ε_n) = o_P(n^−1/2). Let ε₀ = 0. By the second order Tailor expansion assumption for f_n, g_n at ε = 0, we have

f_{n} (ε_{n}) = f_{n} (ε_{n}) - g_{n} (ε_{n}) = f_{n} (0) - g_{n} (0) + ε_{n} (f_{n}^{'} - g_{n}^{'}) (0) + O (ε_{n}^{2}) = ε_{n} {\frac{d}{d ε_{0}} P_{n} D^{*} (Q_{n, ε_{0}}, G_{n}) - \frac{d^{2}}{d ε_{0}^{2}} P_{n} L (Q_{n, ε_{0}})} + O (ε_{n}^{2}) .

By assumption, $ε_{n}^{2} = o_{P} (n^{- 1 / 2})$ , so that $O (ε_{n}^{2}) = o_{P} (n^{- 1 / 2})$ . Thus, it remains to show

P_{n} \frac{d}{d ε_{0}} D^{*} (Q_{n, ε_{0}}, G_{n}) - P_{n} \frac{d^{2}}{d ε_{0}^{2}} L (Q_{n, ε_{0}}) = O_{P} (n^{- 1 / 4}) .

By our Donsker class assumption, we have

(P_{n} - P_{0}) {\frac{d}{d ε_{0}} D^{*} (Q_{n, ε_{0}}, G_{n}) - \frac{d^{2}}{d ε_{0}^{2}} L (Q_{n, ε_{0}})} / n^{1 / 4} = O_{P} (n^{- 1 / 2}) .

Thus, it remains to show

\frac{d}{d ε_{0}} P_{0} D^{*} (Q_{n, ε_{0}}, G_{n}) - P_{0} \frac{d^{2}}{d ε_{0}^{2}} L (Q_{n, ε_{0}}) = O_{P} (n^{- 1 / 4})

By assumptions eq. (42), we have that the left-hand side of last expression equals

\frac{d}{d ε_{0}} P_{0} D^{*} (Q_{0, ε_{0}}, G_{0}) - P_{0} \frac{d^{2}}{d ε_{0}^{2}} L (Q_{0, ε_{0}}) + O_{P} (n^{- 1 / 4}),

so that it remains to show that the first term equals zero. By −P₀D^∗(P) = Ψ(P) − Ψ(P₀) − R₂(P, P₀), it follows that

\frac{d}{d ε_{0}} P_{0} D^{*} (Q_{0, ε_{0}}, G_{0}) = - \frac{d}{d ε_{0}} Ψ (Q_{0, ε_{0}}) + \frac{d}{d ε_{0}} R_{2} ((Q_{0, ε_{0}}, G_{0}), (Q_{0}, G_{0})) .

By assumption we have $\frac{d}{d ε_{0}} R_{2} ((Q_{0, ε_{0}}, G_{0}), (Q_{0}, G_{0})) = 0$ . By definition of the pathwise derivative at P₀, we have that the derivative Ψ(Q_0,_ε) = Ψ(P_0,_ε) at ε = 0 equals P₀D^∗(P₀){D^∗(P₀)}^⊺ . Thus, we have shown

\frac{d}{d ε_{0}} P_{0} D^{*} (Q_{0, ε_{0}}, G_{0}) = - P_{0} D^{*} (P_{0}) {D^{*} (P_{0})}^{^{⊺}} .

Thus, it remains to show eq. (43), which thus holds by assumption. Suppose that L(Q(P)) = − log p_Q₍_P_),_η₍_P₎ for some density parameterization (Q, η) → p_Q,_η. Then $L (Q_{0, ε}) = - \log p_{Q_{0, ε}, η_{0}}$ . Since ${p_{Q_{0, ε}, η_{0}} : ε}$ is a correctly specified parametric model, we have that the second derivative of $- P_{0} \log p_{Q_{0, ε}, η_{0}}$ at ε = 0 equals its information matrix (i.e., covariance matrix of its score) $P_{0} \frac{d}{d ε} \log p_{Q_{0, ε}, η_{0}} {\frac{d}{d ε} \log p_{Q_{0, ε}, η_{0}}}^{^{⊺}}$ at ε = 0. However, the latter equals −P₀D^∗(P₀){D^∗(P₀)}^⊺, which proves eq. (43). This completes the proof of f_n(ε_n) = o_P(n^−1/2).

In the main article we have not proposed a 1-dimensional local least favorable submodel as in Lemma 13, even though our results are straightforwardly generalized to that case. Instead we proposed a k₁ + 1- dimensional least favorable submodel that uses a 1-dimensional ε(j) for updating Q_jn for each j = 1, …, k₁ + 1. We will now state the desired lemma for the one-step TMLE for such a submodel by application of the above lemma across all j.

Lemma 14

Let $Ψ : M \to ℝ$ be pathwise differentiable with canonical gradient D^∗(P) = D^∗(Q, G) and let Ψ(P) = Ψ(Q(P)) for $Q (P) = (Q_{1} (P), \dots, Q_{k_{1} + 1} (P))$ . For a given Q, we define $Ψ_{Q, j} : M \to ℝ$ by Ψ_Q,_j(P) = Ψ(Q₋_j, Q_j(P)), j = 1, …, k₁ + 1. Let $D_{Q, j}^{*} (P) = D_{Q, j}^{*} (Q_{j} (P), Q_{- j} (P), G (P))$ be the efficient influence curve of Ψ_Q,_j at P, and define R_2,_Q,_j(P, P₀) = R_2,_Q,_j((Q(P), G(P)), (Q₀, G₀)) by $Ψ_{Q, j} (P) - Ψ_{Q, j} (P_{0}) = (P - P_{0}) D_{Q, j}^{*} (P) + R_{2, Q, j} (P, P_{0}), j = 1, \dots, k_{1} + 1$ . Here Q₋_j = (Q_l : l ≠ j, l ∈ {1, …, k₁ + 1}). We have $D^{*} (P) = \sum_{j = 1}^{k_{1} + 1} D_{Q (P), j}^{*} (P)$ .

Let $Q_{n} \in Q_{n}$ , $G_{n} \in G_{n}$ be a given initial estimator. Let ${Q_{j n, ε (j)} : ε (j)} \subset Q_{j n}$ be a submodel through Q_jn at ε(j) = 0 and satisfying ${\frac{d}{d ε (j)} L_{1, j} (Q_{j n, ε (j)}) |}_{ε (j) = 0} = D_{Q_{n, j}}^{*} (Q_{n}, G_{n}), j = 1, \dots, k_{1} + 1$ . Let ${Q_{n, ε} : ε} \subset Q_{n}$ be defined by Q_n,_ε = (Q_jn,_ε₍_j₎ : j = 1, …, k₁+1). Let ε_n = arg min_ε P_nL₁(Q_n,_ε), where P_nL₁(Q_n,_ε) = (P_nL₁_j(Q_jn,_ε₍_j₎) : j = 1, …, k₁+1). Let $Q_{n}^{*} = Q_{n, ε_{n}}$ .

We wish to establish that $P_{n} D^{*} (Q_{n, ε_{n}}, G_{n}) = o_{P} (n^{- 1 / 2})$ , where

P_{n} D^{*} (Q_{n, ε_{n}}, G_{n}) = \sum_{j = 1}^{k_{1} + 1} P_{n} D_{Q_{n, ε_{n}, j}}^{*} (Q_{j n, ε_{n} (j)}, Q_{- j n, ε_{n}}, G_{n}) .

For each j = 1, …, k₁ + 1, assume the following conditions:

Suppose that by application of the previous lemma to $Ψ_{Q_{n}, j} : M \to R$ , submodel {Q_jn,_ε₍_j₎ : ε(j)}, loss function L₁_j(Q_j), ε_n(j) = arg min_ε₍_j₎ P_nL₁_j(Q_jn,_ε₍_j₎), and one-step TMLE Q_jn,_εn₍_j₎, we establish its conclusion $P_{n} D_{Q_{_{n}}, j}^{*} (Q_{j n} {_{, ε}}_{_{n}}_{(j)}, Q_{- j n}, G_{n}) = o_{P} (n^{- 1 / 2})$ . For completeness, Lemma 15 below explicitly states these j specific conditions of the previous lemma, which are sufficient for this conclusion.
Let $f_{n j} = D_{Q_{n}, j}^{*} (Q_{j n}^{*}, Q_{- j n}, G_{n}) - D_{Q_{n}, j}^{*} (Q_{j n}^{*}, Q_{- j n}^{*}, G_{n})$ , and assume (P_n − P₀)f_nj = o_P(n^−1/2). For this to hold if suffices to assume that $P_{0} f_{n j}^{2} \to p 0$ and lim sup_n_→∞ ‖f_nj‖_v< M a.e.
Let $f_{n j, 1} = D_{Q_{n}, j}^{*} (Q_{n}^{*}, G_{n}) - D_{Q_{n}^{*}, j}^{*} (Q_{n}^{*}, G_{n})$ , and assume (P_n − P₀)f_nj = o_P(n^−1/2). For this to hold if suffices to assume that $P_{0} f_{n j}^{2} \to p 0$ and lim sup_n_→∞ ‖f_nj,₁‖v< M a.e.
$R_{2, Q_{n}, j} (((Q_{j n}^{*}, Q_{- j n}^{*}), G_{n}), (Q_{0}, G_{0})) - R_{2, Q_{n}, j} (((Q_{j n}^{*}, Q_{- j n}), G_{n}), (Q_{0}, G_{0})) = o_{P} (n^{- 1 / 2})$ ;
$R_{2, Q_{j n}^{*}, j} ((Q_{n}^{*}, G_{n}), (Q_{0}, G_{0})) - R_{2, Q_{n}, j} ((Q_{n}^{*}, G_{n}), (Q_{0}, G_{0})) = o_{P} (n^{- 1 / 2})$ ;
$Ψ_{Q_{n}^{*}, j} (Q_{j n}^{*}) - Ψ_{Q_{n}^{*}, j} (Q_{j 0}) - {Ψ_{Q_{n}, j} (Q_{j n}^{*}) - Ψ_{Q_{n}, j} (Q_{j 0})} = o_{P} (n^{- 1 / 2})$ .

Then, $P_{n} D^{*} (Q_{n, ε_{n}}, G_{n}) = o_{P} (n^{- 1 / 2})$ .

Lemma 15

Let $f_{n j} (ε (j)) = P_{n} D_{Q_{n}, j}^{*} (Q_{j n, ε (j)}, Q_{- j n}, G_{n})$ and $g_{n j} (ε (j)) = \frac{d}{d ε (j)} P_{n} L_{1 j} (Q_{j n, ε (j)})$ . Let $f_{n j}^{'} (ε (j)) = \frac{d}{d ε (j)} f_{n j} (ε (j))$ and $g_{n j}^{'} (ε (j)) = \frac{d}{d ε (j)} g_{n j} (ε (j))$ . Let : ε₀(j) = 0.

Assume the following conditions:

$f_{n j} (ε_{n} (j)) = f_{n j} (0) + f_{n j}^{*} (0) ε_{n} (j) + O_{P} (ε_{n} {(j)}^{2})$ and $g_{n j} (ε_{n} (j)) = g_{n j} (0) + g_{n j}^{'} (0) ε_{n} (j) + O_{P} (ε_{n}^{2} (j))$ ;
$ε_{n}^{2} (j) = o_{P} (n^{- 1 / 2})$ ;
${\frac{d}{d ε_{n} (j)} D_{Q_{n}, j}^{*} (Q_{j n, ε_{n} (j)}, Q_{- j n} G_{n}) - \frac{d^{2}}{d ε_{n} {(j)}^{2}} L_{1 j} (Q_{j n, ε_{n} (j)})} / n^{1 / 4}$ falls in a P₀-Donsker class with probability tending to 1;
$\frac{d}{d ε_{0} (j)} P_{0} {D_{Q_{n}, j}^{*} (Q_{j n, ε_{0} (j)}, Q_{- j n} G_{n}) - D_{Q_{n}, j}^{*} (Q_{j 0, ε_{0} (j)}, Q_{- j 0} G_{0})} = O_{P} (n^{- 1 / 4}) \frac{d^{2}}{d ε_{0} {(j)}^{2}} P_{0} {L_{1 j} (Q_{j n, ε_{0} (j)}) - L_{1 j} (Q_{j 0, ε_{0} (j)})} = O_{P} (n^{- 1 / 4});$
$P_{0} \frac{d^{2}}{d ε_{0} {(j)}^{2}} L_{1 j} (Q_{j 0, ε_{0} (j)}) = P_{0} D_{Q_{0}, j}^{*} (P_{0}) {D_{Q_{0}, j}^{*} (P_{0})}^{⊤} .$ (44)
If $L_{1 j} (Q_{j} (P)) = - \log p_{Q_{j} (P), η (P)}$ for some density parameterization $(Q_{j}, η) \to p_{Q_{j}, η}$ , then eq. (44) holds;
$\frac{d}{d ε_{0} (j)} R_{2, Q_{0}, j} ((Q_{j_{0}, ε_{0} (j)}, Q_{- j 0}, G_{0}), (Q_{0}, G_{0})) = 0.$

Then, $P_{n} D_{Q_{n}, j}^{*} (Q_{j n, ε_{n} (j)}, Q_{- j n}, G_{n}) = o_{P} (n^{- 1 / 2}) .$

Proof

This is an immediate application of Lemma 13. □

Proof of Lemma 14

Consider a 1-dimensional submodel {P_ε : ε}⊂ ℳ with score S. We have

\frac{d}{d ε} Ψ (P_{ε}) = \frac{d}{d ε} Ψ (Q_{ε}) = \frac{d}{d ε} Ψ (Q_{1 ε}, \dots, Q_{k_{1} + 1 ε}) \sum_{j = 1}^{k_{1} + 1} \frac{d}{d ε} Ψ (Q_{- j}, Q_{j ε}) .

By pathwise differentiability of Ψ at P the left-hand side equals PD^*(P)S, while, by pathwise differentiability of Ψ_Q,_j at P, each j-specific term on the right-hand side equals $P D_{Q, j}^{*} (P) S$ . This proves that

P D^{*} (P) S = \sum_{j = 1}^{k_{1} + 1} P D_{Q, j}^{*} (P) S = P {\sum_{j = 1}^{k_{1} + 1} D_{Q, j}^{*} (P)} S .

Since this holds for each S ∈ T(P) and $D_{Q, j}^{*} (P) \in T (P)$ for all j, this implies $D^{*} (P) = \sum_{j = 1}^{k_{1} + 1} D_{Q, j}^{*} (P)$ . This proves the first statement of the lemma. This shows also that $P_{n} D^{*} (Q_{n}^{*}, G_{n}) = \sum_{j = 1}^{k_{1} + 1} P_{n} D_{Q_{n}^{*}, j}^{*} (Q_{n}^{*}, G_{n})$ , so it suffices to prove that $P_{n} D_{Q_{n}^{*}, j}^{*} (Q_{n}^{*}, G_{n}) = o_{P} (n^{- 1 / 2})$ for each j. In the lemma we assumed that we already established $P_{n} D_{Q_{n}^{*}, j}^{*} (Q_{j n}^{*}, Q_{- j n}, G_{n}) = o_{P} (n^{- 1 / 2})$ , by application of Lemma 15.

Firstly, we want to prove that $P_{n} {D_{Q_{n}^{*}, j}^{*} (Q_{j n}^{*}, Q_{- j n}, G_{n}) - D_{Q_{n}, j}^{*} (Q_{j n}^{*}, Q_{- j n}^{*}, G_{n})} = o_{P} (n^{- 1 / 2})$ , which then shows that $P_{n} D_{Q_{n}, j}^{*} (Q_{n}^{*}, G_{n}) = o_{P} (n^{- 1 / 2})$ . This term can be represented as P_nf_n. We can write P_nf_n = (P_n − P₀)f_n + P₀f_n. By our first assumption, we have (P_n − P₀)f_n = o_P(1). So we now have to consider

P_{0} {D_{Q_{_{n}}, j}^{*} (Q_{n j}^{*}, Q_{- j n}, G_{n}) - D_{Q_{n}, j}^{*} (Q_{j n}^{*}, Q_{- j n}^{*}, G_{n})} = Ψ_{Q_{n}, j} (Q_{j n}^{*}) - Ψ_{Q_{n}, J} (Q_{j 0}) - R_{2, Q_{n}, j} (((Q_{j n}^{*}, Q_{- j n}^{*}), G_{n}), (Q_{0}, G_{0})) - Ψ_{Q_{n}, j} (Q_{j n}^{*}) + Ψ_{Q_{n}, j} (Q_{j 0}) - R_{2}_{, Q_{n}, j} (((Q_{j n}^{*}, Q_{- j n}), G_{n}), (Q_{0}, G_{0})) = R_{2, Q_{n}, j} (((Q_{j n}^{*}, Q_{- j n}^{*}), G_{n}), (Q_{0}, G_{0})) - R_{2, Q_{n}, j} (((Q_{j n}^{*}, Q_{- j n}), G_{n}), (Q_{o}, G_{0}))) .

By assumption 2., the latter is o (n^−1/2). This proves now that $P_{n} D_{Q_{n}, j}^{*} (Q_{n}^{*}, G_{n}) = o_{P} (n^{- 1 / 2})$ .

We now want to prove that $P_{n} {D_{Q_{n}, j}^{*} (Q_{n}^{*}, G_{n}) - D_{Q_{n}^{*}, j}^{*} (Q_{n}^{*}, G_{n})} = o_{P} (n^{- 1 / 2})$ , so that we can conclude $P_{n} D_{Q_{n}^{*}, j}^{*} (Q_{n}^{*}, G_{n})} = o_{P} (n^{- 1 / 2})$ . Let, $f_{n} = {D_{Q_{n}, j}^{*} (Q_{n}^{*}, G_{n}) - D_{Q_{n}^{*}, j}^{*} (Q_{n}^{*}, G_{n})}$ , so that this term can be represented as P_nf_n. We have P_nf_n = (P_n − P₀)f_n + P₀f_n. By assumption 3., we have (P_n – P₀)f_n=o_P(n^−1/2). We now have to consider

P_{0} {D_{Q_{n}, j}^{*} (Q_{n}^{*}, G_{n}) - D_{Q_{n}^{*}, j}^{*} (Q_{n}^{*}, G_{n})} = Ψ_{Q_{n}^{*}, j} (Q_{j n}^{*}) - Ψ_{Q_{n}^{*}, j} (Q_{j 0}) + R_{2, Q_{n}^{*}, j} ((Q_{n}^{*}, G_{n}), (Q_{0}, G_{0})) - Ψ_{Q_{n}, j} (Q_{j n}^{*}) + Ψ_{Q_{n}, J} (Q_{J 0}) - R_{2}_{, Q_{n}, j} ((Q_{n}^{*}, G_{n}), (Q_{0}, G_{0})) .

By assumption 4., we have $R_{2, Q_{n}^{*}, j} () - R_{2, Q_{n}, j} () = o_{P} (n^{- 1 / 2})$ . By assumption 5, the “second order Ψ-difference” is o_P(n^−1/2) as well. □

References

1.Bickel PJ, Klaassen CAJ, Ritov Y, Wellner J. Efficient and adaptive estimation for semiparametric models. Berlin/Heidelberg/New York: Springer; 1997. [Google Scholar]
2.Robins JM, Rotnitzky A. AIDS epidemiology. Basel: Birkhauser; 1992. Recovery of information and adjustment for dependent censoring using surrogate markers; pp. 297–331. [Google Scholar]
3.van der Laan MJ, Robins JM. Unified methods for censored longitudinal data and causality. New York: Springer; 2003. [Google Scholar]
4.van der Laan MJ. Estimation based on case-control designs with known prevalance probability. Int J Biostat. 2008;4(1) doi: 10.2202/1557-4679.1114. Article 17. [DOI] [PubMed] [Google Scholar]
5.van der Laan MJ, Rose S. Targeted learning: Causal inference for observational and experimental data. Berlin/Heidelberg/New York: Springer; 2011. [Google Scholar]
6.van der Laan MJ, Rubin Daniel B. Targeted maximum likelihood learning. Int J Biostat. 2006;2(1) Article 11. [Google Scholar]
7.Gruber S, van der Laan MJ. An application of collaborative targeted maximum likelihood estimation in causal inference and genomics. Int J Biostat. 2010;6(1) doi: 10.2202/1557-4679.1182. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Porter KE, Gruber S, van der Laan MJ, Sekhon JS. The relative performance of targeted maximum likelihood estimators. (Working Paper Series. Working Paper 279).Int J Biostat. 2011 Jan 1;7(1) doi: 10.2202/1557-4679.1308. Article 31., 2011. Published online Aug 17, 2011. Also available at: U.C. Berkeley Division of Biostatistics. http://www.bepress.com/ucbbiostat/paper279. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Sekhon JS, Gruber S, Porter KE, van der Laan MJ. Propensity scorebased estimators and c-tmle. In: van der Laan MJ, Rose S, editors. Targeted learning: Causal inference for observational and experimental data. New York/Dordrecht/Heidelberg/London: Springer; 2012. [Google Scholar]
10.Polley EC, Rose S, van der Laan MJ. Super learner. In: van der Laan MJ, Rose S, editors. Targeted learning: Causal inference for observational and experimental data. New York/Dordrecht/Heidelberg/London: Springer; 2011. [Google Scholar]
11.van der Laan MJ, Gruber S. One-step targeted minimum loss-based estimation based on universal least favorable one-dimensional submodels. Int J Biostat. 2016;12(1):351–378. doi: 10.1515/ijb-2015-0054. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.van der Laan MJ, Polley EC, Hubbard AE. Super learner. Stat Appl Genet Mol. 2007;6(1) doi: 10.2202/1544-6115.1309. Article 25. [DOI] [PubMed] [Google Scholar]
13.van der Vaart AW, Dudoit S, van der Laan MJ. Oracle inequalities for multi-fold cross-validation. Stat Decis. 2006;24(3):351–371. [Google Scholar]
14.Polley EC, Sherri Rose, van der Laan MJ. Super learning. In: van der Laan MJ, Rose S, editors. Targeted learning: Causal inference for observational and experimental data. New York/Dordrecht/Heidelberg/London: Springer; 2012. [Google Scholar]
15.Zheng W, van der Laan MJ. Cross-validated targeted minimum loss based estimation. In: van der Laan MJ, Rose S, editors. Targeted learning: Causal inference for observational and experimental studies. New York: Springer; 2011. [Google Scholar]
16.van der Laan MJ. Technical Report 300. UC Berkeley: 2015. A generally efficient targeted minimum lossbased estimator. http://biostats.bepress.com/ucbbiostat/paper343. [Google Scholar]
17.Neuhaus G. On weak convergence of stochastic processes with multidimensional time parameter. Ann Stat. 1971;42:1285–1295. [Google Scholar]
18.van der Vaart AW, Wellner JA. A local maximal inequality under uniform entropy. Electr J Stat. 2011;5:192–203. doi: 10.1214/11-EJS605. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.van der Vaart AW, Wellner JA. Weak convergence and empirical processes. Berlin/Heidelberg/New York: Springer; 1996. [Google Scholar]
20.Gill RD, van der Laan MJ, Wellner JA. Inefficient estimators of the bivariate survival function for three models. Annales de l’Institut Henri Poincaré. 1995;31:545–597. [Google Scholar]
21.van der Laan MJ, Dudoit S, van der Vaart AW. The cross-validated adaptive epsilon-net estimator. Stat Decis. 2006;24(3):373–395. [Google Scholar]
22.Benkeser D, van der Laan MJ. The highly adaptive lasso estimator. Proceedings of the IEEE Conference on Data Science and Advanced Analytics. 2016 doi: 10.1109/DSAA.2016.93. To appear. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Chambaz A, van der Laan MJ. Targeting the optimal design in randomized clinical trials with binary outcomes and no covariate, theoretical study. (Working paper 258).Int J Biostat. 2011a;7(1):1–32. www.bepress.com/ucbbiostat. [Google Scholar]
24.Chambaz A, van der Laan MJ. Targeting the optimal design in randomized clinical trials with binary outcomes and no covariate, simulation study. (Working paper 258).Int J Biostat. 2011b;7(1):33. www.bepress.com/ucbbiostat. [Google Scholar]
25.van der Laan MJ. Causal inference for networks. UC Berkeley: 2012. (Technical Report 300). http://biostats.bepress.com/ucbbiostat/paper300, to appear in Journal of Causal Inference. [Google Scholar]
26.van der Laan MJ, Balzer LB, Petersen ML. Adaptive matching in randomized trials and observational studies. J Stat Res. 2013;46(2):113–156. [PMC free article] [PubMed] [Google Scholar]
27.Gruber S, van der Laan MJ. Targeted maximum likelihood estimation, R package version 1.2.0-1. 2012 Available at http://cran.rproject.org/web/packages/tmle/tmle.pdf.
28.Petersen M, Schwab J, Gruber S, Blaser N, Schomaker M, van der Laan MJ. Targeted maximum likelihood estimation of dynamic and static marginal structural working models. J Causal Inf. 2013;2:147–185. doi: 10.1515/jci-2013-0007. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Bang H, Robins JM. Doubly robust estimation in missing data and causal inference models. Biometrics. 2005;61:962–972. doi: 10.1111/j.1541-0420.2005.00377.x. [DOI] [PubMed] [Google Scholar]
30.Iván Díaz, van der Laan MJ. Sensitivity analysis for causal inference under unmeasured confounding and measurement error problems. Int J Biostat. doi: 10.1515/ijb-2013-0004. In press. [DOI] [PubMed] [Google Scholar]
31.van der Laan MJ, Petersen ML. Targeted learning. In: Zhang C, Ma Y, editors. Ensemble machine learning: methods and applications. New York: Springer; 2012. [Google Scholar]
32.van der Laan MJ, Petersen ML. Causal effect models for realistic individualized treatment and intention to treat rules. Int J Biostat. 2007;3(1) doi: 10.2202/1557-4679.1022. Article 3. [DOI] [PMC free article] [PubMed] [Google Scholar]
33.van der Laan MJ, Dudoit S. Unified cross-validation methodology for selection among estimators and a general cross-validated adaptive epsilon-net estimator: finite sample oracle inequalities and examples. Division of Biostatistics, University of California; Berkeley: 2003. (Technical Report 130). [Google Scholar]
34.van der Laan MJ, Dudoit S, Keles S. Asymptotic optimality of likelihood-based cross-validation. Stat Appl Genet Mol. 2004;3(1) doi: 10.2202/1544-6115.1036. Article 4. [DOI] [PubMed] [Google Scholar]

[R1] 1.Bickel PJ, Klaassen CAJ, Ritov Y, Wellner J. Efficient and adaptive estimation for semiparametric models. Berlin/Heidelberg/New York: Springer; 1997. [Google Scholar]

[R2] 2.Robins JM, Rotnitzky A. AIDS epidemiology. Basel: Birkhauser; 1992. Recovery of information and adjustment for dependent censoring using surrogate markers; pp. 297–331. [Google Scholar]

[R3] 3.van der Laan MJ, Robins JM. Unified methods for censored longitudinal data and causality. New York: Springer; 2003. [Google Scholar]

[R4] 4.van der Laan MJ. Estimation based on case-control designs with known prevalance probability. Int J Biostat. 2008;4(1) doi: 10.2202/1557-4679.1114. Article 17. [DOI] [PubMed] [Google Scholar]

[R5] 5.van der Laan MJ, Rose S. Targeted learning: Causal inference for observational and experimental data. Berlin/Heidelberg/New York: Springer; 2011. [Google Scholar]

[R6] 6.van der Laan MJ, Rubin Daniel B. Targeted maximum likelihood learning. Int J Biostat. 2006;2(1) Article 11. [Google Scholar]

[R7] 7.Gruber S, van der Laan MJ. An application of collaborative targeted maximum likelihood estimation in causal inference and genomics. Int J Biostat. 2010;6(1) doi: 10.2202/1557-4679.1182. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] 8.Porter KE, Gruber S, van der Laan MJ, Sekhon JS. The relative performance of targeted maximum likelihood estimators. (Working Paper Series. Working Paper 279).Int J Biostat. 2011 Jan 1;7(1) doi: 10.2202/1557-4679.1308. Article 31., 2011. Published online Aug 17, 2011. Also available at: U.C. Berkeley Division of Biostatistics. http://www.bepress.com/ucbbiostat/paper279. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] 9.Sekhon JS, Gruber S, Porter KE, van der Laan MJ. Propensity scorebased estimators and c-tmle. In: van der Laan MJ, Rose S, editors. Targeted learning: Causal inference for observational and experimental data. New York/Dordrecht/Heidelberg/London: Springer; 2012. [Google Scholar]

[R10] 10.Polley EC, Rose S, van der Laan MJ. Super learner. In: van der Laan MJ, Rose S, editors. Targeted learning: Causal inference for observational and experimental data. New York/Dordrecht/Heidelberg/London: Springer; 2011. [Google Scholar]

[R11] 11.van der Laan MJ, Gruber S. One-step targeted minimum loss-based estimation based on universal least favorable one-dimensional submodels. Int J Biostat. 2016;12(1):351–378. doi: 10.1515/ijb-2015-0054. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] 12.van der Laan MJ, Polley EC, Hubbard AE. Super learner. Stat Appl Genet Mol. 2007;6(1) doi: 10.2202/1544-6115.1309. Article 25. [DOI] [PubMed] [Google Scholar]

[R13] 13.van der Vaart AW, Dudoit S, van der Laan MJ. Oracle inequalities for multi-fold cross-validation. Stat Decis. 2006;24(3):351–371. [Google Scholar]

[R14] 14.Polley EC, Sherri Rose, van der Laan MJ. Super learning. In: van der Laan MJ, Rose S, editors. Targeted learning: Causal inference for observational and experimental data. New York/Dordrecht/Heidelberg/London: Springer; 2012. [Google Scholar]

[R15] 15.Zheng W, van der Laan MJ. Cross-validated targeted minimum loss based estimation. In: van der Laan MJ, Rose S, editors. Targeted learning: Causal inference for observational and experimental studies. New York: Springer; 2011. [Google Scholar]

[R16] 16.van der Laan MJ. Technical Report 300. UC Berkeley: 2015. A generally efficient targeted minimum lossbased estimator. http://biostats.bepress.com/ucbbiostat/paper343. [Google Scholar]

[R17] 17.Neuhaus G. On weak convergence of stochastic processes with multidimensional time parameter. Ann Stat. 1971;42:1285–1295. [Google Scholar]

[R18] 18.van der Vaart AW, Wellner JA. A local maximal inequality under uniform entropy. Electr J Stat. 2011;5:192–203. doi: 10.1214/11-EJS605. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] 19.van der Vaart AW, Wellner JA. Weak convergence and empirical processes. Berlin/Heidelberg/New York: Springer; 1996. [Google Scholar]

[R20] 20.Gill RD, van der Laan MJ, Wellner JA. Inefficient estimators of the bivariate survival function for three models. Annales de l’Institut Henri Poincaré. 1995;31:545–597. [Google Scholar]

[R21] 21.van der Laan MJ, Dudoit S, van der Vaart AW. The cross-validated adaptive epsilon-net estimator. Stat Decis. 2006;24(3):373–395. [Google Scholar]

[R22] 22.Benkeser D, van der Laan MJ. The highly adaptive lasso estimator. Proceedings of the IEEE Conference on Data Science and Advanced Analytics. 2016 doi: 10.1109/DSAA.2016.93. To appear. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] 23.Chambaz A, van der Laan MJ. Targeting the optimal design in randomized clinical trials with binary outcomes and no covariate, theoretical study. (Working paper 258).Int J Biostat. 2011a;7(1):1–32. www.bepress.com/ucbbiostat. [Google Scholar]

[R24] 24.Chambaz A, van der Laan MJ. Targeting the optimal design in randomized clinical trials with binary outcomes and no covariate, simulation study. (Working paper 258).Int J Biostat. 2011b;7(1):33. www.bepress.com/ucbbiostat. [Google Scholar]

[R25] 25.van der Laan MJ. Causal inference for networks. UC Berkeley: 2012. (Technical Report 300). http://biostats.bepress.com/ucbbiostat/paper300, to appear in Journal of Causal Inference. [Google Scholar]

[R26] 26.van der Laan MJ, Balzer LB, Petersen ML. Adaptive matching in randomized trials and observational studies. J Stat Res. 2013;46(2):113–156. [PMC free article] [PubMed] [Google Scholar]

[R27] 27.Gruber S, van der Laan MJ. Targeted maximum likelihood estimation, R package version 1.2.0-1. 2012 Available at http://cran.rproject.org/web/packages/tmle/tmle.pdf.

[R28] 28.Petersen M, Schwab J, Gruber S, Blaser N, Schomaker M, van der Laan MJ. Targeted maximum likelihood estimation of dynamic and static marginal structural working models. J Causal Inf. 2013;2:147–185. doi: 10.1515/jci-2013-0007. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] 29.Bang H, Robins JM. Doubly robust estimation in missing data and causal inference models. Biometrics. 2005;61:962–972. doi: 10.1111/j.1541-0420.2005.00377.x. [DOI] [PubMed] [Google Scholar]

[R30] 30.Iván Díaz, van der Laan MJ. Sensitivity analysis for causal inference under unmeasured confounding and measurement error problems. Int J Biostat. doi: 10.1515/ijb-2013-0004. In press. [DOI] [PubMed] [Google Scholar]

[R31] 31.van der Laan MJ, Petersen ML. Targeted learning. In: Zhang C, Ma Y, editors. Ensemble machine learning: methods and applications. New York: Springer; 2012. [Google Scholar]

[R32] 32.van der Laan MJ, Petersen ML. Causal effect models for realistic individualized treatment and intention to treat rules. Int J Biostat. 2007;3(1) doi: 10.2202/1557-4679.1022. Article 3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R33] 33.van der Laan MJ, Dudoit S. Unified cross-validation methodology for selection among estimators and a general cross-validated adaptive epsilon-net estimator: finite sample oracle inequalities and examples. Division of Biostatistics, University of California; Berkeley: 2003. (Technical Report 130). [Google Scholar]

[R34] 34.van der Laan MJ, Dudoit S, Keles S. Asymptotic optimality of likelihood-based cross-validation. Stat Appl Genet Mol. 2004;3(1) doi: 10.2202/1544-6115.1036. Article 4. [DOI] [PubMed] [Google Scholar]

PERMALINK

A Generally Efficient Targeted Minimum Loss Based Estimator based on the Highly Adaptive Lasso

Mark van der Laan

Abstract

1 Introduction

2 Example: Treatment specific mean in nonparametric model

2.1 Defining the statistical estimation problem

2.2 One step CV-TMLE

2.3 Guide for article based on this example

Section 3: Formulation of general estimation problem

Section 4: Construction and analysis of an M-specific HAL-estimator that converges at a rate faster than n−1/4

Section 5: Construction and analysis of an HAL-super-learner

Section 6: Construction and analysis of HAL-CV-TMLE

Section 7: Returning to our example

3 Statistical formulation of the estimation problem

Relevant nuisance parameters Q, G and their loss functions

Second order remainder for target parameter

Support of data distribution

Cadlag functions on [0, τ], supremum norm and variation norm

Cartesian product of cadlag function spaces, and its component-wise operations

Assumption 1. (Smoothness Assumption)

Definition of bounds on the statistical model

Bounded and Unbounded Models

Sequence of bounded submodels approximating the unbounded model

4 Highly adaptive Lasso estimator of Nuisance parameters

4.1 Upper bounding the entropy of the parameter space for the HAL-estimator

4.2 Minimal rate of convergence of the HAL-estimator

Lemma 1

Proof

5 Super-learning: HAL-estimator tuning the variation norm of the fit with cross-validation

Defining the library of candidate estimators

Super Learner

Lemma 2

6 One-step CV-HAL-TMLE

6.1 The CV-HAL-TMLE

Definition of one-step CV-HAL-TMLE for general local least favorable submodel

One-step CV-HAL-TMLE solves cross-validated efficient score equation

One-step CV-HAL-TMLE preserves fast rate of convergence of initial estimator

A class of multivariate local least favorable submodels that separately updates each nuisance parameter component

How to construct a local least favorable submodel of type eq. (26)

6.2 Preservation of the rate of initial estimator for the one-step CV-HAL-TMLE using eq. (26)

Lemma 3

6.3 Efficiency of the one-step CV-HAL-TMLE

Theorem 1

Initial estimator conditions

“Preserve rate of convergence of initial estimator”-condition

Efficient influence curve score equation condition and second order remainder condition

Proof

7 Example: Treatment specific mean

Verification of eqs (30) and (31)

Verification of eq. (28)

Verification of eq. (29)

Theorem 2

Lemma 4

Proof

8 Discussion

Acknowledgments

Appendix

A Oracle inequality for the cross-validation selector

Lemma 5

B Super learner of G0

Lemma 6

C Empirical process results

Lemma 7

Lemma 8

Lemma 9

Proof

Lemma 10

Proof

D Implementing the HAL-estimator

D.1 Approximating a function with variation norm M by a linear combination of indicator basis functions with L1-norm of the coefficient vector equal to M

D.2 An approximation of the MLE over functions of bounded variation using L1-penalization

Lemma 11

Proof

D.3 An approximation of the MLE over the subspace Q¯M by an MLE over an L1-constrained linear model

Lemma 12

Proof

E A single updating step in TMLE suffices for approximately solving the efficient influence curve equation

Lemma 13

Proof of Lemma

Section 4: Construction and analysis of an M-specific HAL-estimator that converges at a rate faster than n^−1/4

B Super learner of G₀

D.1 Approximating a function with variation norm M by a linear combination of indicator basis functions with L¹-norm of the coefficient vector equal to M

D.2 An approximation of the MLE over functions of bounded variation using L¹-penalization

D.3 An approximation of the MLE over the subspace ${\bar{Q}}_{M}$ by an MLE over an L₁-constrained linear model