Universal sieve-based strategies for efficient estimation using machine learning tools

HONGXIANG QIU; ALEX LUEDTKE; MARCO CARONE

doi:10.3150/20-BEJ1309

. Author manuscript; available in PMC: 2021 Nov 2.

Published in final edited form as: Bernoulli (Andover). 2021 Aug 24;27(4):2300–2336. doi: 10.3150/20-BEJ1309

Universal sieve-based strategies for efficient estimation using machine learning tools

HONGXIANG QIU ^1,^*, ALEX LUEDTKE ², MARCO CARONE ^1,^*,^*

PMCID: PMC8561841 NIHMSID: NIHMS1702459 PMID: 34733110

Abstract

Suppose that we wish to estimate a finite-dimensional summary of one or more function-valued features of an underlying data-generating mechanism under a nonparametric model. One approach to estimation is by plugging in flexible estimates of these features. Unfortunately, in general, such estimators may not be asymptotically efficient, which often makes these estimators difficult to use as a basis for inference. Though there are several existing methods to construct asymptotically efficient plug-in estimators, each such method either can only be derived using knowledge of efficiency theory or is only valid under stringent smoothness assumptions. Among existing methods, sieve estimators stand out as particularly convenient because efficiency theory is not required in their construction, their tuning parameters can be selected data adaptively, and they are universal in the sense that the same fits lead to efficient plug-in estimators for a rich class of estimands. Inspired by these desirable properties, we propose two novel universal approaches for estimating function-valued features that can be analyzed using sieve estimation theory. Compared to traditional sieve estimators, these approaches are valid under more general conditions on the smoothness of the function-valued features by utilizing flexible estimates that can be obtained, for example, using machine learning.

Keywords: nonparametric inference, asymptotic efficiency, sieve estimation

1. Introduction

1.1. Motivation

A common statistical problem consists of using available data in order to learn about a summary of the underlying data-generating mechanism. In many cases, this summary involves function-valued features of the distribution that cannot be estimated at a parametric rate under a nonparametric model — for example, a regression function or the density function of the distribution. Examples of useful summaries involving such features include average treatment effects [23], average derivatives [11], moments of the conditional mean function [24], variable importance measures [33] and treatment effect heterogeneity measures [12]. For ease of implementation and interpretation, in traditional approaches to estimation, these features have typically been restricted to have simple forms encoded by parametric or restrictive semiparametric models. However, when these models are misspecified, both the interpretation and validity of subsequent inferences can be compromised. To circumvent this difficulty, investigators have increasingly relied on machine learning (ML) methods to flexibly estimate these function-valued features.

Once estimates of the function-valued features are obtained, it is natural to consider plug-in estimators of the summary of interest. However, in general, such estimators are not root-n-consistent and asymptotically normal, and hence not asymptotically efficient (referred to as efficient henceforth). Lacking this property is problematic since it often forms the basis for constructing valid confidence intervals and hypothesis tests [3, 20]. When the function-valued features are estimated by ML methods, in order for the plug-in estimator to be CAN, the ML methods must not only estimate the involved function-valued features well, but must also satisfy a small-bias property with respect to the summary of interest [20, 31]. Unfortunately, because ML methods generally seek to optimize out-of-sample performance, they seldom satisfy the latter property.

1.2. Existing methodological frameworks

The targeted minimum loss-based estimation (TMLE) framework provides a means of constructing efficient plug-in estimators [27, 31]. Given an (almost arbitrary) initial ML fit that provides a good estimate of the function-valued features involved, TMLE produces an adjusted fit such that the resulting plug-in estimator has reduced bias and is efficient. This adjustment process is referred to as targeting since a generic estimate of the function-valued features is modified to better suit the goal of estimating the summary of interest. Though TMLE provides a general template for constructing efficient estimators, its implementation requires specialized expertise, namely knowledge of the analytic expression for an influence function of the summary of interest. Influence functions arise in semiparametric efficiency theory and are key to establishing efficiency, but can be difficult to derive. Furthermore, even when an influence function is known analytically, additional expertise is needed to construct a TMLE for a given problem.

Alternative approaches for constructing efficient plug-in estimators have been proposed in the literature, including the use of undersmoothing [18], twicing kernels [20], and sieves [4, 19, 24]. These methods neither require knowing an influence function nor performing any targeting of the function-valued feature estimates. Hence, the same fits can be used to simultaneously estimate different summaries of the data-generating distribution, even if these summaries were not pre-specified when obtaining the fit. These approaches also circumvent the difficulties in obtaining an influence function. However, these methods all rely on smoothness conditions on derivatives of the functional features that may be overly stringent. In addition, undersmoothing provides limited guidance on the choice of the tuning parameter; the twicing kernel method requires the use of a higher-order kernel, which may lead to poor performance in small to moderate samples [14].

In contrast, under some conditions, sieve estimation can produce a flexible fit with the optimal out-of-sample performance while also yielding an efficient — and therefore root-n-consistent and asymptotically normal — plug-in estimator [24]. In this paper, we focus on extensions of this approach. In sieve estimation, we first assume that the unknown function falls in a rich function space, and construct a sequence of approximating subspaces indexed by sample size that increase in complexity as sample size grows. We require that, in the limit, the functions in the subspaces can approximate any function in the rich function space arbitrarily well. These approximating subspaces are referred to as sieves. By using an ordinary fitting procedure that optimizes the estimation of the function-valued feature within the sieve, the bias of the plug-in estimator can decrease sufficiently fast as the sieve grows in order for that estimator to be efficient. Thus sieve estimation requires no explicit targeting for the summary of interest.

The series estimator is one of the best known and most widely used sieve techniques. These sieves are taken as the span of the first finitely many terms in a basis that is chosen by the user to approximate the true function well. Common choices of the basis include polynomials, splines, trigonometric series and wavelets, among others. However, series estimators usually require strong smoothness assumptions on derivative of the unknown function in order for the flexible fit to converge at a sufficient rate to ensure the resulting plug-in estimator is efficient. As the dimension of the problem increases, the smoothness requirement may become prohibitive. Moreover, even if the smoothness assumption is satisfied, a prohibitively large sample size may be needed for some series estimators to produce a good fit. For example, if the unknown function is smooth but is a constant over a region, estimation based on a polynomial series can perform poorly in small to moderate samples.

Series estimators may also require the user to choose the number of terms in the series in such a way that results in a sufficient convergence rate. The rates at which the number of terms should grow with sample size have been thoroughly studied (e.g. [4, 19, 24]). However, these results only provide minimal guidance for applications because there is no indication on how to select the actual number of terms for a given sample size. In practice, the number of terms is the series is often chosen by CV. Upper bounds on the convergence rate of the series estimator as a function of sample size and the number of terms have been derived, and it has been shown that the optimal number of terms that minimizes the bound can also lead to an efficient plug-in estimator [24]. However, CV tends to select the number of terms that optimizes the actual convergence rate [29], which may differ from the number of terms minimizing the derived bound on the convergence rate. Even though the use of CV-tuned sieve estimators has achieved good numerical performance, to the best of our knowledge, there is no theoretical guarantee that they lead to an efficient plug-in estimator.

Two variants of traditional series estimators were proposed in [3]. These methods can use two bases to approximate the unknown function-valued features and the corresponding gradient separately, whereas in traditional series estimators, only one basis is used for both approximations. Consequently, these variants may be applied to more general cases than traditional series estimators. However, like traditional series estimators, they also suffer from the inflexibility of the pre-specified bases.

1.3. Contributions and organization of this article

In this paper we present two approaches that can partially overcome these shortcomings.

Estimating the unknown function with Highly Adaptive Lasso (HAL) [1, 26].

If we are willing to assume the unknown functions have a finite variation norm, then they may be estimated via HAL. If the tuning parameter is chosen carefully, then we may obtain an efficient plug-in estimator. This method can help overcome the stringent smoothness assumptions on derivatives that are required by existing series estimators, as we discussed earlier.
Using data-adaptive series based on an initial ML fit.

As long as the initial ML algorithm converges to the unknown function at a sufficient rate, we show that, for certain types of summaries, it is possible to obtain an efficient plug-in estimator with a particular data-adaptive series. The smoothness assumption on the unknown function can be greatly relaxed due to the introduction of the ML algorithm into the procedure. Moreover, for summaries that are highly smooth, we show that the number of terms in the series can be selected by CV.

Although the first approach is not an example of sieve estimation, both approaches are motivated by the sieve literature and can be shown to lead to asymptotically efficient plug-in estimators using the sieve estimation theory derived in [24]. The flexible fits of the functional features from both approaches can be plugged in for a rich class of estimands.

We remark that, although we do not have to restrict ourselves to the plug-in approach in order to construct an asymptotically efficient estimator, other estimators do not overcome the shortcomings described in Section 1.2 and can have other undesirable properties. For example, the popular one-step correction approach (also called debiasing in the recent literature on high-dimensional statistics) [22] constructs efficient estimators by adding a bias reduction term to the plug-in estimator. Thus, it is not a plug-in estimator itself, and as a consequence, one-step estimators may not respect known constraints on the estimand — for example, bounds on a scalar-valued estimand (e.g., the estimand is a probability and must lie in [0, 1]) or shape constraints on a vector-valued estimand (e.g., monotonicity constraints). This drawback is also typical for other non-plug-in estimators, such as those derived via estimating equations [30] and double machine learning [5, 6]. Additionally, as with the other procedures described above, the one-step correction approach requires the analytic expression of an influence function.

Our paper is organized as follows. We introduce the problem setup and notation in Section 2. We consider plug-in estimators based on HAL in Section 3, data-adaptive series in Section 4, and its generalized version that is applicable to more general summaries in Section 5. Section 6 concludes with a discussion. Technical proofs of lemmas and theorems (Appendix D), simulation details (Appendix E) and other additional details are provided in the Appendix.

2. Problem setup and traditional sieve estimation review

Suppose we have independent and identically distributed observations V₁, …, V_n drawn from P₀. Let Θ be a class of functions, and denote by θ₀ ∈ Θ a (possibly vector-valued) functional feature of P₀ — for example, θ₀ may be a regression function. Throughout this paper we assume that the generic data unit is V = (X, Z) ~ P₀, where X is a (possibly vector-valued) random variable corresponding to the argument of θ₀, and Z may also be a vector-valued random variable. In some cases V = X and Z is trivial. We use $X$ to denote the support of X. The estimand of interest is a finite-dimensional summary Ψ(θ₀) of θ₀. We consider a plug-in estimator $Ψ ({\hat{θ}}_{n})$ , where ${\hat{θ}}_{n}$ is an estimator of θ₀ , and aim for this plug-in estimator to be asymptotically linear, in the sense that $Ψ ({\hat{θ}}_{n}) = Ψ (θ_{0}) + n^{- 1} \sum_{i = 1}^{n} IF (V_{i}) + o_{p} (n^{- 1 / 2})$ with IF an influence function satisfying $E_{P_{0}} [IF (V)] = 0$ and $E_{P_{0}} [IF {(V)}^{2}]$ . This estimator is efficient under a nonparametric model if the estimator is also regular. By the central limit theorem and Slutsky’s theorem, it follows that $Ψ ({\hat{θ}}_{n})$ is a CAN estimator of Ψ(θ₀), and therefore, $\sqrt{n} [Ψ ({\hat{θ}}_{n}) - Ψ (θ_{0})] \overset{d}{\to} N (0, E_{P_{0}} [IF {(V)}^{2})$ . This provides a basis for constructing valid confidence intervals for Ψ(θ₀).

We now list some examples of such problems.

Example 1. Moments of the conditional mean function [24]: Let $θ_{0} : x \mapsto E_{P_{0}} [Z ∣ X = x]$ be the conditional mean function. The κ-th moment of θ₀ (X), X ~ P₀, namely $Ψ_{κ} (θ_{0}) = E_{P_{0}} [θ_{0}^{κ} (X)]$ , can be a summary of interest. The values of Ψ₁ (θ₀) and Ψ₂ (θ₀) are useful for defining the proportion of ${Var}_{P_{0}} (Z)$ that is explained by X, which may be written as ${Var}_{P_{0}} (θ_{0} (X)) / {Var}_{P_{0}} (Z)$ . This proportion is a measure of variable importance [33]. Generally, we may consider $Ψ (θ_{0}) = E_{P_{0}} [f (θ_{0} (X))]$ for a fixed function f.

Example 2. Average derivative [11]: Let X follow a continuous distribution on $ℝ^{d}$ and $θ_{0} : x \mapsto E_{P_{0}} [Z ∣ X = x]$ be the conditional mean function. Let $θ_{0}^{'}$ denote the vector of partial derivatives of θ₀. Then $Ψ (θ_{0}) = E_{P_{0}} [θ_{0}^{'} (X)]$ summarizes the overall (adjusted) effect of each component of X on Y. Under certain conditions, we can rewrite $Ψ (θ_{0}) = E_{P_{0}} [θ_{0} (X) p_{0}^{'} (X) / p_{0} (X)]$ , where p₀ is the Lebesgue density of X and $p_{0}^{'}$ is the vector of partial derivatives of p₀. This expression clearly shows the important role of the Lebesgue density of X in this summary.

Example 3. Mean counterfactual outcome [23]: Suppose that Z = (A, Y) where A is a binary treatment indicator and Y is the outcome of interest. Let $θ_{0} : x \mapsto E_{P_{0}} [Y ∣ A = 1, X = x]$ be the outcome regression function under treatment value 1. Under causal assumptions, the mean counterfactual outcome corresponding to the intervention that assigns treatment 1 to the entire population can be nonparametrically identified by the G-computation formula $Ψ (θ_{0}) = E_{P_{0}} [θ_{0} (X)]$ .

Example 4. Treatment effect heterogeneity measures [12]: Similarly to Example 3, suppose that A is a binary treatment indicator and Z is the outcome of interest. Let θ₀ = (μ₀₀, μ₀₁)^T, where $μ_{0 a} : x \mapsto E_{P_{0}} [Z ∣ A = a, X = x]$ is the outcome regression function for treatment arm a ∈ {0, 1}. Then, $Ψ (θ_{0}) = {Var}_{P_{0}} (μ_{01} (X) - μ_{00} (X))$ is an overall summary of treatment effect heterogeneity.

To obtain an asymptotically linear plug-in estimator, ${\hat{θ}}_{n}$ must converge to θ₀ at a sufficiently fast rate and approximately solve an estimating equation to achieve the small bias property with respect to the summary of interest [20, 26, 31]. For simplicity, we assume the estimand to be scalar-valued — when the estimand is vector-valued, we can treat each entry as a separate estimand, and the plug-in estimators of all entries are jointly asymptotically linear if each estimator is asymptotically linear. Therefore, this leads to no loss in generality if the same fits are used for all entries in the summary of interest.

Sieve estimation allows us to obtain an estimator $Ψ ({\hat{θ}}_{n})$ with the small bias property with respect to Ψ(θ₀) while maintaining the optimal convergence rate of ${\hat{θ}}_{n}$ [4, 24]. The construction of sieve estimators is based on a sequence of approximating spaces Θ_n to Θ. These approximating spaces are referred to as sieves. Usually Θ_n is much simpler than Θ to avoid over-fitting but complex enough to avoid under-fitting. For example, Θ_n can be the space of all polynomials with degree K or splines with K knots with K = K (n) → ∞ as n → ∞. In this paper, with a loss function l such that $θ_{0} \in {argmin}_{θ \in Θ} E_{P_{0}} [l (θ) (V)]$ , we consider estimating θ₀ by minimizing an empirical risk based on l, i.e., ${\hat{θ}}_{n} \in {argmin}_{θ \in Θ_{n}} n^{- 1} \sum_{i = 1}^{n} l (θ) (V_{i})$ . Under some conditions, the growth rate of Θ_n can be carefully chosen so that $Ψ ({\hat{θ}}_{n})$ is an asymptotically linear estimator of Ψ(θ₀) while ${\hat{θ}}_{n}$ converges to θ₀ at the optimal rate.

Throughout this paper, for a probability distribution P and an integrable function f with respect to P, we define $P f ≔ \int f (v) d P (v) = E_{P} [f (V)]$ . We use P_n to denote the empirical distribution. We take 〈·, ·〉 to be the L² (P₀)-inner product, i.e., $〈 θ_{1}, θ_{2} 〉 = P_{0} (θ_{1}^{⊤} θ_{2})$ , where L² (P₀) is the set of real-valued P₀ -squared-integrable functions defined on the support of P₀. When the functions are vector-valued, we take $〈 θ_{1}, θ_{2} 〉 = P_{0} (θ_{1}^{⊤} θ_{2})$ . We use ‖·‖ to denote the induced norm of 〈·, ·〉. We assume that Θ ⊆ L² (P₀). We remark that we have committed to a specific choice of inner product and norm to fix ideas; other inner products can also be adopted, and our results will remain valid upon adaptation of our upcoming conditions. We discuss this explicitly via a case study in Appendix A.

For the methods we propose in this article, we assume that Θ is convex. Throughout this paper, we will further require a set of conditions similar to those in [24]. For any θ ∈ Θ, let $l_{0}^{'} [θ - θ_{0}] (v) ≔ {lim}_{δ \to 0} [l (θ_{0} + δ (θ - θ_{0})) (v) - l (θ_{0}) (v)] / δ$ be the Gâteaux derivative of l at θ₀ in the direction θ−θ₀ and $r [θ - θ_{0}] (v) ≔ l (θ) (v) - l (θ_{0}) (v) - l_{0}^{'} [θ - θ_{0}] (v)$ be the corresponding remainder.

Condition A1 (Linearity and boundedness of Gâteaux derivative operator of loss function). For all θ ∈ Θ, $θ \in Θ, l_{0}^{'} [θ - θ_{0}]$ exists and $l_{0}^{'} [θ - θ_{0}] (v) - P_{0} l_{0}^{'} [θ - θ_{0}]$ is linear and bounded in θ − θ₀.

Condition A2 (Local quadratic behavior of loss function). There exists a constant α_0,l ∈ (0, ∞) such that, for all θ ∈ Θ such that P₀ {l(θ) − l(θ₀)} or ‖θ − θ₀‖ is sufficiently small, it holds that P₀ {l(θ) − l(θ₀)} = α₀,l ‖θ − θ₀ ‖² /2 + o(‖θ − θ₀‖²).

Remark 1. We now present an equivalent form of A2 that may be easier to verify in practice. For all θ ∈ Θ\{θ₀}, define h_θ := (θ−θ₀)/‖θ−θ₀ ‖ and $a_{θ} ≔ {\frac{d^{2}}{d δ^{2}} P_{0} l (θ_{0} + δ h_{θ}) |}_{δ = 0}$ . Requiring Condition A2 is equivalent to requiring that $a_{θ_{1}} = a_{θ_{2}}$ for all θ₁, θ₂ ∈ Θ\{θ₀} and that

sup_{θ \in Θ} | P_{0} l (θ_{0} + δ h_{θ}) - P_{0} l (θ_{0}) - \frac{a_{θ}}{2} | = o (δ^{2}) .

Moreover, if A2 holds, then, for any θ ∈ Θ\{θ₀}, it is true that α_0,l = a_θ.

A large class of loss functions satisfy Conditions A1 and A2. For example, in the regression setting where Z is the outcome, the squared-error loss l(θ) : v ↦ [z − θ(x)]² and the logistic loss l(θ) : v ↦ −zθ(x) + log{1 + exp(θ(x))} both satisfy these conditions; a negative working log-likelihood usually also satisfies these conditions. In Examples 1–4, the unknown functions are all conditional mean functions, which can be estimated with the above loss functions. Thus, Conditions A1 and A2 hold. Examples 3 and 4 require a slight modification discussed in more details in Appendix A. We also note that Condition A2 is sufficient for Condition B in [24].

Condition A3 (Differentiability of summary of interest). $Ψ_{θ_{0}}^{'} [θ - θ_{0}] ≔ lim_{δ \to 0} [Ψ (θ_{0} + δ (θ - θ_{0})) - Ψ (θ_{0})] / δ$ exists for all θ ∈ Θ and is a linear bounded operator.

If Condition A3 holds, then, by the Riesz representation theorem, $Ψ_{θ_{0}}^{'} [θ - θ_{0}] = 〈 θ - θ_{0}, \dot{Ψ} 〉$ for a gradient function $\dot{Ψ} = {\dot{Ψ}}_{θ_{0}}$ in the completion of the space spanned by Θ−θ₀ := {x ↦ θ(x) − θ₀ (x) : θ ∈ Θ}.

Condition A4 (Locally quadratic remainder). There exists a constant C > 0 so that, for all θ with sufficiently small ‖θ − θ₀‖, it holds that

| Ψ (θ) - Ψ (θ_{0}) - Ψ_{θ_{0}}^{'} [θ - θ_{0}] | \leq C {‖ θ - θ_{0} ‖}^{2} .

The above condition states that the remainder of the linear approximation to Ψ is locally bounded by a quadratic function.

Conditions A3 and A4 hold for Examples 1–4. For the generalized moment of the conditional mean function in Example 1, it holds that $\dot{Ψ} = f^{'} \circ θ_{0}$ . For the average derivative of the conditional mean function in Example 2, it holds that $\dot{Ψ} = p_{0}^{'} / p_{0}$ . For the average treatment effect and the treatment effect heterogeneity measure in Examples 3 and 4, as we show in Appendix A, $\dot{Ψ}$ also exists and depends on the propensity score function x ↦ P₀ (A = 1 | X = x).

3. Estimation with Highly Adaptive Lasso

3.1. Brief review of Highly Adaptive Lasso

Recently, the Highly Adaptive Lasso (HAL) was proposed as a flexible ML algorithm that only requires a mild smoothness condition on the unknown function and has a well-described implementation [1, 26]. In this subsection, we briefly review HAL. We first heuristically introduce its definition and desirable properties, and then introduce the definition and implementation more formally. For ease of presentation, for the moment, we assume that θ₀ is real-valued.

In HAL, θ₀ is assumed to fall in the class of càdlàg functions (right-continuous with left limits) defined on $X \subseteq ℝ^{d}$ with variation norm bounded by a finite constant M. In this section, we denote this function class by Θ_v,M. The variation norm of a càdlàg function θ, denoted by ‖θ‖_v, characterizes the total variability of θ as its argument ranges over the domain, so ‖ · ‖_v is a global smoothness measure and Θ_v,M is a large function class that even contains functions with discontinuities. Fig. 1 presents some examples of univariate càdlàg functions with finite variation norms for illustration. Because Θ_v,M is a rich class, it can be plausible that θ₀ ∈ Θ_v,M for some M < ∞. The HAL estimator of θ₀ is then ${\hat{θ}}_{n} = {\hat{θ}}_{n, M} \in {argmin}_{θ \in Θ_{v, M}} n^{- 1} \sum_{i = 1}^{n} l (θ) (V_{i})$ . Under this assumption, it has been shown that $‖ {\hat{θ}}_{n} - θ_{0} ‖ = o_{p} (n^{- 1 / 4})$ regardless of the dimension of X under additional mild conditions [26]. Thus, estimation with HAL replaces the usual smoothness requirement on derivatives of traditional series estimators by a requirement on global smoothness, namely θ₀ ∈ Θ_v,M for some M.

Figure 1. — Examples of univariate càdlàg functions with finite variation norms. The top-left, top-right, bottom-left and bottom-right plots present the standard normal density function, a minimax concave penalty function [35], a step function and the real part of a Morlet wavelet [13] respectively.

We next formally present the definition of variation norm of a càdlàg function θ: $[x^{(l)}, x^{(u)}] \subseteq ℝ^{d} \to ℝ$ . Here, x^(l) and x^(u) are vectors in $ℝ^{d}$ ; with ≤ being entrywise, $[x^{(l)}, x^{(u)}] ≔ {x \in ℝ^{d} : x^{(l)} \leq x \leq x^{(u)}}$ .

For any nonempty index set s ⊆ {1, 2, …, d} and any x = (x₁, x₂, …, x_d) ∈ [x^(l), x^(u)], we define x_s := {x_j : j ∈ s} and x_−s := {x_j : j ∈ {1, 2, …, d} \ s} to be entries of x with indices in and not in s respectively. We defined the s-section of θ as $θ_{s} ≔ θ (x_{1} 1 (1 \in s), x_{2} 1 (2 \in s), \dots, x_{d} 1 (d \in s))$ . We can subsequently obtain the following representation of θ at any x ∈ [x^(l), x^(u)] in terms of sums and integrals of the variation of s-sections of θ [9]:

θ (x) = θ (x^{(l)}) + \sum_{s \in {1, \dots, d}, s \neq \emptyset} \int_{(x^{(l)}, x]} θ_{s} (d \tilde{x}) .

The variation norm is then subsequently defined as

‖ θ ‖_{v} ≔ | θ (x^{(l)}) | + \sum_{s \in {1, \dots, d}, s \neq \emptyset} \int_{(x^{(l)}, x^{(u)}]} | θ_{s} (d \tilde{x}) | .

We refer to [1] and [26] for more details on variation norm. Notably, this notion of variation norm coincides with that of Hardy and Krause [21].

We finally briefly introduce the algorithm to compute a HAL estimator. It can be shown that an empirical risk minimizer in Θ_v,M is a step function that only jumps at sample points, namely

x \mapsto β_{0} + \sum_{s \subseteq {1, \dots, d}, s \neq \emptyset} \sum_{j = 1}^{n} 1 (X_{j, s} \leq x_{s}) β_{s, j} .

Here, β₀ and all β_s,j are real numbers. To find an empirical risk minimizer in θ_v,M in the above form, we may solve the following optimization problem:

\min_{θ} \sum_{i = 1}^{n} l (θ) (V_{i})

subject to θ : x \mapsto β_{0} + \sum_{s \subseteq {1, \dots, d}, s \neq \emptyset} \sum_{j = 1}^{n} 1 (X_{j, s} \leq x_{s}) β_{s, j}

| β_{0} | + \sum_{s \subseteq {1, \dots, d}, s \neq \emptyset} \sum_{j = 1}^{n} | β_{s, j} | \leq M .

The constraint imposes an upper bound on the l₁ norm of a vector. Therefore, for common loss functions, we may use software for LASSO regression [25]. For example, if the loss function is the squared-error loss, then we may run a LASSO linear regression to obtain a HAL estimate.

3.2. Estimation with an oracle tuning parameter

In this section, we consider plug-in estimators based on HAL. For ease of illustration, for the rest of this section, we consider scalar-valued Ψ, and will discuss vector-valued Ψ only at the end of this subsection. We further introduce the following conditions needed to establish that the HAL-based plug-in estimator is efficient.

Condition B1 (Càdlàg functions). θ₀ and $\dot{Ψ}$ are càdlàg.

Condition B2 (Bound on variation norm). For some M < ∞, ${‖ θ_{0} ‖}_{v} + ‖ \dot{Ψ} ‖_{v} \leq M$ .

Condition B2 ensures that certain perturbations of θ₀ still lie in Θ_v,M, a crucial requirement for proving the asymptotic linearity of our proposed plug-in estimator. In addition, since $\dot{Ψ}$ may depend on components of P₀ other than θ₀ as in Examples 2–4, Conditions B1–B2 may also impose conditions on these components.

In this section, we fix an M that satisfies Condition B2. Additional technical conditions can be found in Appendix B.1. Let ${\hat{θ}}_{n} = {\hat{θ}}_{n, M} \in {argmin}_{θ \in Θ_{v, M}} n^{- 1} \sum_{i = 1}^{n} l (θ) (V_{i})$ denote the the HAL fit obtained using the bound M in Condition B2.

We note that ${\hat{θ}}_{n}$ is not a typical sieve estimator because M is fixed and there is no explicit sequence of growing approximating spaces Θ_n. Nevertheless, we may view this method as a special case of sieve estimation with degenerate sieves Θ_n = Θ_v,M for all n. This allows us to utilize existing results [24] to show the asymptotic linearity and efficiency of the plug-in estimator based on ${\hat{θ}}_{n}$ . We next formally present this result.

Theorem 1 (Efficincy of plug-in estimator). Under Conditions A1–A4 and B1–B4, $Ψ ({\hat{θ}}_{n})$ is an asymptotically linear estimator of Ψ(θ₀) with the influence function being $v \mapsto α_{0, l}^{- 1} {- l_{0}^{'} [\dot{Ψ}] (v) + E_{P_{0}} [l_{0}^{'} [\dot{Ψ}] (V)]}$ , that is,

Ψ ({\hat{θ}}_{n}) = Ψ (θ_{0}) + \frac{1}{n} \sum_{i = 1}^{n} α_{0, l}^{- 1} {- l_{0}^{'} [\dot{Ψ}] (V_{i}) + E_{P_{0}} [l_{0}^{'} [\dot{Ψ}] (V)]} + o_{p} (n^{- 1 / 2}) .

As a consequence, $\sqrt{n} [Ψ ({\hat{θ}}_{n}) - Ψ (θ_{0})] \overset{d}{\to} N (0, ξ^{2})$ with $ξ^{2} ≔ {Var}_{P_{0}} (l_{0}^{'} [\dot{Ψ}] (V)) / α_{0, l}^{2}$ . In addition, under Conditions E1 and E2 in Appendix B.4, $Ψ ({\hat{θ}}_{n})$ is efficient under a nonparametric model.

We note that, for HAL to achieve the optimal convergence rate, we only need that M ≥ ‖θ₀ ‖_v [1, 26]. The requirement of a larger M imposed by Condition B2 resembles undersmoothing [18], as using a larger M would result in a fit that is less smooth than that based on the CV-selected bound. The L² (P₀)-convergence rate of the flexible fit using the larger bound remains the same, but the leading constant may be larger. This is in contrast to traditional undersmoothing, which leads to a fit with a suboptimal rate of convergence.

Under some conditions, the following lemma provides a loose bound on $‖ \dot{Ψ} ‖_{v}$ in the case that $\dot{Ψ}$ has a particular structure. Such a bound can be used to select an appropriate bound on variation norm that satisfies Condition B2.

Lemma 1. Suppose that $\dot{Ψ} = \dot{ψ} \circ θ_{0}$ , where $\dot{ψ} : ℝ \to ℝ$ is differentiable. Let x^(l) = sup{x : P₀ (X ≥ x) =1} where sup and and ≥ are entrywise. Assume that θ₀ is differentiable. If each of ‖θ₀‖_v, $| \dot{Ψ} (x^{(l)}) |$ and $B ≔ {sup}_{z : | z | \leq {‖ θ_{0} ‖}_{v}} | {\dot{ψ}}^{'} (z) |$ is finite, then $‖ \dot{Ψ} ‖_{v} \leq B {‖ θ_{0} ‖}_{v} + | \dot{Ψ} (x^{(l)}) |$ . Hence, ${‖ θ_{0} ‖}_{v} + ‖ \dot{Ψ} ‖_{v} \leq (B + 1) {‖ θ_{0} ‖}_{v} + | \dot{Ψ} (x^{(l)}) | < \infty$ .

As we discussed at the end of Section 2, such structures as $\dot{Ψ} = \dot{ψ} \circ θ_{0}$ are common, especially if we augment θ₀ to include other implicitly relevant components of P₀. For example, in Example 2, we may augment θ₀ with p₀ and $p_{0}^{'}$ ; in Examples 3 and 4, we may augment θ₀ with the propensity score function.

When θ₀ is $p_{0}^{'}$ -valued, θ₀ can often be viewed as a collection of q real-valued variation-independent functions η₁₀, …, η_q0. In this case, we can define Θ_v,M = {(η₁, …, η_q) : η_j is càdlàg, ‖η_j‖_v ≤ M_j, j = 1, …, q} for a positive vector M = (M₁, …, M_q). The subsequent arguments follow analogously, where now each η_j is treated as a separate function.

We remark that an undersmoothing condition such as B2 appears to be necessary for a HAL-based plug-in estimator to be efficient. We illustrate this numerically in Section 3.3. The choice of a sufficiently large bound M required by Theorem 1 is by no means trivial, since this choice requires knowledge that the user may not have. Nevertheless, this result forms the basis of the data-driven method that we propose in Section 3.3 for choosing M. We also remark that, if we wish to plug in the same ${\hat{θ}}_{n}$ based on HAL for a rich estimands, the chosen bound M needs to be sufficiently large for all estimands of interest.

Another method to construct efficient plug-in estimators based on HAL has been independently developed [28]. Unlike our approach based on sieve theory, in this work, the authors directly analyzed the first-order bias of the plug-in estimator using influence functions. In terms of ease of implementation, their method requires specifying a constant involved in a threshold of the empirical mean of the basis functions, which may be difficult to specify in applications. Our approach in Section 3.3 may also require specifying an unknown constant to obtain a valid upper bound on $‖ \dot{Ψ} ‖_{v}$ , but in some cases the constant may be set to zero, and our simulation suggests that the performance is not sensitive to the choice of the constant.

3.3. Data-adaptive selection of the tuning parameter

Since it is hard to prespecify a bound M on the variation norm that is sufficiently large to satisfy Condition B2 but also sufficiently small to avoid overfitting for a given data set, it is desirable to select M in a data-adaptive manner. A seemingly natural approach makes use of k-fold CV. In particular, for each candidate bound M, partition the data into k folds of approximately equal size (k is fixed and does not depend on n), in each fold evaluate the performance of the HAL estimator fitted on all other folds based on this candidate M, and use the candidate bound M_n with the best average performance across all folds to obtain the final fit. It has been shown that ${\hat{θ}}_{n, M_{n}}$ can achieve the optimal convergence rate under mild conditions [29], but M_n appears not to satisfy Condition B2 in general. In particular, the derived bound on $‖ {\hat{θ}}_{n} - θ_{0} ‖$ relies on an empirical process term, namely ${sup}_{θ \in Θ_{v, M}} | (P_{n} - P_{0}) {l (θ) - l (θ_{0})} |$ , and a larger M implies a larger space Θ_v,M. Therefore, the bound on $‖ {\hat{θ}}_{n} - θ_{0} ‖$ grows with M. Because k-fold CV seeks to optimize out-of-sample performance, M_n generally appears to be close to ‖θ₀‖_v and not sufficiently large to obtain an efficient plug-in estimator.

To avoid this issue with the CV-selected bound, we propose a method that takes inspiration from k-fold CV, but modifies the bound so that it is guaranteed to yield an efficient plug-in estimator for Ψ(θ₀). This method may require the analytic expression for $\dot{Ψ}$ . In Sections 4 and 5, we present methods that do not require this knowledge.

Derive an upper bound on $‖ \dot{Ψ} ‖_{v}$ . This bound is a non-decreasing function of the variation norms of functions that can be learned from data (e.g., using Lemma 1). In other words, find a non-decreasing function F such that ${‖ \dot{Ψ} ‖}_{v} \leq F ({‖ η_{10} ‖}_{v}, \dots, {‖ η_{q 0} ‖}_{v})$ for unknown functions η₁₀, …, η_q0 that can be assumed to be càdlàg with finite variation norm and can be estimated with HAL.
Estimate θ₀, η₁₀, …, η_q0 by HAL with k-fold CV, and denote the CV-selected bounds for these functions by M_n, M_1n, …, M_qn.
For a small ϵ > 0, use the bound M_n + ϵ + F (M_1n + ϵ, …, M_qn + ϵ) to estimate θ₀ with HAL and plug in the fit. We refer to this step of slightly increasing the bounds as ϵ-relaxation.

It follows from Lemma 2 in the Appendix that this method would yield a sufficiently large bound with probability tending to one. In practice, it is desirable for the bound derived on $‖ \dot{Ψ} ‖_{v}$ to be relatively tight to avoid choosing an overly large bound that leads to overfitting in small to moderate samples. We remark that multiplying by 1 + ϵ rater than adding E to each argument also leads to a valid choice for the bound; that is, the bound M_n (1 + ϵ) + F (M_1n (1 + ϵ), …, M_qn (1 + ϵ)) is also sufficiently large with probability tending to one. In practice, the user may increase each CV-selected bound by, for example, 5% or 10%. Although it is more natural and convenient to directly use M_n + F (M_1n, …, M_qn) as the bound, we have only been able to prove the result with a small ϵ-relaxation. However, if the bound is loose and F is continuous, we can show that ϵ-relaxation is unnecessary. The formal argument can be found after Lemma 2 in the Appendix.

As for methods based on knowledge of an influence function, deriving $\dot{Ψ}$ and a bound for its variation norm requires some expertise, but in some cases this task can be straightforward. The derivation of an influence function is typically based on a fluctuation in the space of distributions, but in many cases, the relation between such fluctuations and the summary of interest is implicit and difficult to handle. In contrast, the derivation of $\dot{Ψ}$ is based on a fluctuation of θ₀, and the summary of interest explicitly depends on θ₀. As a consequence, it can be simpler to derive $\dot{Ψ}$ than to derive an influence function. For example, for the summary $Ψ_{κ} (θ_{0}) = P_{0} θ_{0}^{κ}$ in Example 1, we find that ${\dot{Ψ}}_{κ} = κ θ_{0}^{κ - 1}$ by straightforward calculation, whereas the influence function given in Theorem 1 is more difficult to directly derive analytically.

We illustrate the fact that M_n may not be sufficiently large and show that our proposed method resolves this issue via a simulation study in which $θ_{0} : x \mapsto E_{P_{0}} [Y ∣ X = x]$ and $Ψ : θ_{0} \mapsto P_{0} θ_{0}^{2}$ . We compare the performance of the plug-in estimators based on the 10-fold CV-selected bound on variation norm (M.cv), the bound derived from the analytic expression of $\dot{Ψ}$ with and without ϵ-relaxation (M.gcv+ and M.gcv respectively), and a sufficiently large oracle choice satisfying Condition B2 (M.oracle). We According to Lemma 1, M.oracle is 3‖θ₀‖_v and M.gcv is 3×M.cv. We also investigate the performance of 95% Wald CIs based on the influence function. For each resulting plug-in estimator, we investigate the following quantities: n · MSE, $\sqrt{n} \cdot ∣ bias ∣$ and CI coverage. More details of this simulation are provided in Appendix E. In theory, for an efficient estimator, we should find that n · MSE tends to a constant (the variance of the influence function $ξ^{2} ≔ P_{0} {IF}^{2}$ ), $\sqrt{n} \cdot ∣ bias ∣$ tends to 0, and 95% Wald CIs have approximately 95% coverage.

We report performance summaries in Fig 2 and Table 1 with this criterion, from which it appears that the plug-in estimators with M.oracle and M.gcv+ achieve efficiency, while the plug-in estimator based on M.cv does not. The desirable performance of M.oracle and M.gcv+ agrees with the available theory, whereas the poor performance of M.cv suggests that cross-validation may not yield a valid choice of variation norm in general. Interestingly, M.gcv performs similarly to M.oracle and M.gcv+. We conjecture that using an ϵ-relaxation is unnecessary in this setting. In Fig 3, we can also see that M.cv tends to ‖θ₀‖_v and has a high probability of being less than M.oracle. Therefore, this simulation suggests that using a sufficiently large bound — in particular, a bound larger than the CV-selected bound — may be necessary and sufficient for the plug-in estimator to achieve efficiency.

Figure 2. — The relative MSE, n · MSE/ξ², and the relative absolute bias, $\sqrt{n} \cdot | bias / Ψ (θ_{0}) |$ , of the plug-in estimator of $Ψ (θ_{0}) = P_{0} θ_{0}^{2}$ based on HAL for an oracle choice of the bound on variation norm (M.oracle), the 10-fold CV-selected bound (M.cv), a bound based on M.cv and analytic expression of $\dot{Ψ}$ without and with ϵ-relaxation (M.gcv and M.gcv+ respectively). ξ² := P₀IF² is the asymptotic variance that the n · MSE of an AL estimator should converge to. Note that then MSE for M.oracle, M.gcv and M.gcv+ tends to 2 but that for M.cv does not.

Table 1.

Coverage probability of 95% Wald CI of the plug-in estimator of $Ψ (θ_{0}) = P_{0} θ_{0}^{2}$ based on HAL for an oracle choice of the bound on variation norm (M.oracle), the 10-fold CV-selected bound (M.cv), a bound based on M.cv and analytic expression of $\dot{Ψ}$ without and with ϵ-relaxation (M.gcv and M.gcv+ respectively). The CI is constructed based on the influence function. The coverage for M.oracle, M.gcv and M.gcv+ is approximately 95%, but that for M.cv is not.

n	M.cv	M.gcv	M.gcv+	M. oracle
500	0.87	0.96	0.96	0.97
1000	0.87	0.97	0.97	0.97
2000	0.90	0.95	0.95	0.96
5000	0.93	0.95	0.95	0.95
10000	0.89	0.95	0.95	0.95

Open in a new tab

Figure 3. — A boxplot of the ratio of bounds based on 10-fold CV and M.oracle. The horizontal gray thick dashed lines are 1 and 1/3. The y-axis is scaled based on logarithm for readability. There is a high probability that M.cv is much smaller than M.oracle; M.cv tends to the variation norm of the function being estimated, ‖θ₀‖_v, corresponding to 1/3 of M.oracle. Enlarging M.cv according to the analytic expression of $\dot{Ψ}$ with ϵ-relaxation results in sufficiently large bounds. The enlargement without ϵ-relaxation appears to have similar performance.

4. Data-adaptive series

4.1. Proposed method

For ease of illustration, we consider the case that Ψ is scalar-valued in this section. As we will describe next, our proposed estimation procedure for function-valued features does not rely on Ψ and hence can be used for a class of summaries.

Suppose that θ is a vector space of $ℝ^{q}$ -valued functions equipped with the L₂ (P₀)- inner product. Further, suppose that $Ψ = ψ \circ θ_{0}$ for some function: $\dot{ψ} : ℝ^{q} \to ℝ^{q}$ . This holds, for example, when Ψ : θ ↦ P₀(f ○ θ) for a fixed differentiable function f in Example 1. In this case, $\dot{Ψ} = f^{'} \circ θ_{0}$ and hence $\dot{ψ} = f^{'}$ . Particularly useful examples include Examples 1 and 4. For now we assume that the marginal distribution of X is known so that we only need to estimate θ₀ for this summary. We will address the more difficult case in which the marginal distribution of X is unknown in Section 4.3.

Let $θ_{n}^{0}$ be a given initial flexible ML fit of θ₀ and consider the data-adaptive sieve-like subspaces based on $θ_{n}^{0}, Θ_{n} ≔ Θ_{n, θ_{n}^{0}} ≔ Span {ϕ_{1}, ϕ_{2}, \dots, ϕ_{K}} \circ θ_{n}^{0}$ , where ϕ₁, ϕ₂,… are $ℝ^{q}$ -valued basis functions in a series defined on $ℝ^{q}$ and K = K (n) is a deterministic number of terms in the series — we will consider selecting K via CV in Section 4.4. Let $θ_{n}^{*} = θ_{n}^{*} (θ_{n}^{0}) \in {argmin}_{θ \in Θ_{n}} n^{- 1} \sum_{i = 1}^{b} l (θ) (V_{i})$ denote the series estimator within this data-adaptive sieve-like subspace that minimizes the empirical risk. We propose to use $Ψ (θ_{n}^{*})$ to estimate Ψ(θ₀).

4.2. Results for a deterministic number of terms

Following [4, 24], our proofs of the validity of our data-adaptive series approach make heavy use of projection operators. We use $π_{n} ≔ π_{n, θ_{n}^{0}}$ to denote the projection operator for functions in Θ onto $Θ_{n} = Θ_{n, θ_{n}^{0}}$ with respect to 〈·, ·〉. For any function θ ∈ Θ, let Π_n,θ denote the operator that takes as input a function $g : ℝ^{q} \to ℝ^{q}$ for which g ○ θ ∈ L² (P₀) and outputs a function $Π_{n, θ} (g) : ℝ^{q} \to ℝ^{q}$ such that Π_n,θ (g) ○ θ = π_n,θ (g ○ θ). In other words, letting β_j be the quantity that depends on g and θ such that $π_{n, θ} (g \circ θ) = (\sum_{j = 1}^{K} β_{j} ϕ_{j}) \circ θ$ , we define $Π_{n, θ} (g) ≔ \sum_{j = 1}^{K} β_{j} ϕ_{j}$ . The operator Π_n,θ may also be interpreted as follows: letting P_θ be the distribution of θ(X) with V = (X, Z) ~ P₀, then Π_n,θ is the projection operator of functions $ℝ^{q} \to ℝ^{q}$ with respect to the L² (P_θ)-inner product. We use $I$ to denote the identity function in $ℝ^{q}$ .

We now present additional conditions we will require to ensure that $Ψ (θ_{n}^{*})$ is an efficient estimator of Ψ(θ₀).

Condition C1 (Sufficient convergence rate of initial ML fit). $‖ θ_{n}^{0} - θ_{0} ‖ = o_{p} (n^{- 1 / 4})$ .

Condition C2 (Sufficiently small estimation error). $‖ θ_{n}^{*} - π_{n} (θ_{0}) ‖ = o_{p} (n^{- 1 / 4})$ .

Condition C3 (Sufficiently small approximation error to $I$ for $Θ_{n, θ_{0}}$ ). $‖ θ_{0} - Π_{n, θ_{0}} (I) \circ θ_{0} ‖ = o (n^{- 1 / 4})$ .

Condition C4 (Sufficiently small approximation error to $\dot{ψ}$ for $Θ_{n, θ_{0}}$ and convergence rate of $θ_{n}^{*}$ ). $‖ [\dot{ψ} - Π_{n, θ_{0}} (\dot{ψ})] \circ θ_{0} ‖ \cdot ‖ θ_{n}^{*} - θ_{0} ‖ = o_{p} (n^{- 1 / 2})$ Appendix B.2 contains further technical conditions and Appendix C discusses their plausibility. As discussed in Appendix C, Conditions C2–C4 typically imply restrictions on the growth rate of K : if K grows too fast with n, then Condition C2 may be violated; if K instead grows too slow, then Conditions C3 and C4 may be violated. For the generalized moment Ψ : θ ↦ P₀ (f ○ θ) with a fixed known function f in Example 1, Condition C4 typically also imposes a smoothness condition f so that f can be approximated by the series well. Our conditions are closely related to the conditions in Theorem 1 of [24]. Conditions C1–C3 and C6 serve as sufficient conditions for the condition on the smoothness of Ψ and the convergence rate of $θ_{n}^{*}$ in Theorem 1 of [24]. Together with Conditions C4 and C7, we can derive Lemma 4, which is similar to the first part of Condition C of [24]. The empirical process condition C8 is sufficient for Conditions A, D and the second part of C in Theorem 1 in [24].

We now present a theorem ensuring the asymptotic linearity and efficiency of the plug-in estimator based on $θ_{n}^{*}$ .

Theorem 2 (Efficiency of plug-in estimator). Under Conditions A1–A4 and C1–C9, $Ψ (θ_{n}^{*})$ is an asymptotically linear estimator of Ψ(θ₀) with the influence function being $v \mapsto α_{0, l}^{- 1} {- l_{0}^{'} [\dot{Ψ}] (v) + E_{P_{0}} [l_{0}^{'} [\dot{Ψ}] (V)]}$ , that is,

Ψ (θ_{n}^{*}) = Ψ (θ_{0}) + \frac{1}{n} \sum_{i = 1}^{n} α_{0, l}^{- 1} {- l_{0}^{'} [\dot{ψ} \circ θ_{0}] (V_{i}) + E_{P_{0}} [l_{0}^{'} [\dot{ψ} \circ θ_{0}] (V)]} + o_{p} (n^{- 1 / 2}) .

As a consequence, $\sqrt{n} [Ψ ({\hat{θ}}_{n}) - Ψ (θ_{0})] \overset{d}{\to} N (0, ξ^{2})$ with $ξ^{2} ≔ V a r_{P_{0}} (l_{0}^{'} [\dot{Ψ}] (V)) / α_{0, l}^{2}$ . In addition, under Conditions E1 and E2 in Appendix B.4, $Ψ ({\hat{θ}}_{n})$ is efficient under a nonparametric model.

Remark 2. Consider the general case in which it may not be true that $\dot{Ψ}$ can be can be represented as $\dot{ψ} \circ θ_{0}$ for some $\dot{ψ} : ℝ^{q} \to ℝ^{q}$ . If the analytic expression of $\dot{Ψ}$ can be derived and $\dot{Ψ}$ can be estimated by ${\dot{Ψ}}_{n}$ such that $‖ {\dot{Ψ}}_{n} - \dot{Ψ} ‖ \cdot ‖ θ_{n}^{0} - θ_{0} ‖ = o_{p} (n^{- 1 / 2})$ , then our data-adaptive series can take a special form that is targeted towards Ψ. Specifically, letting $ϑ_{0} ≔ {(θ_{0}, \dot{Ψ})}^{⊤}$ and Ψ (ϑ₀) := Ψ(θ₀), it is straightforward to show that the gradient of Ψ is $\dot{Ψ} = {(\dot{Ψ}, 0)}^{⊤} = {(e_{2}, 0)}^{⊤} ϑ_{0}$ with 0 = (0, 0)^T and e₂ = (0, 1)^T, which is a function composed with ϑ₀. We can set $ϑ_{n}^{0} = (θ_{n}^{0}, {\dot{Ψ}}_{n})$ and $Θ_{n} = Span {θ_{n}^{0}, {\dot{Ψ}}_{n}}$ in our data-adaptive series. This approach does not have a growing number of terms in Θ_n and is not similar to sieve estimation, but can be treated as a special case of data-adaptive series. It can be shown that Conditions C1–C4 are still satisfied for ϑ and Ψ with this choice of Θ_n, and hence our data-adaptive series estimator leads to an efficient plug-in estimator. We remark that the introduction of ϑ and Ψ is a purely theoretical device, and this targeted approach to estimation is quite similar to that used in the context of TMLE [27, 31].

4.3. Summaries involving the marginal distribution of X

We now generalize the setting considered thus far by allowing the parameter to depend both on θ₀ and on P₀, i.e., estimating Ψ(θ₀, P₀). The example given at the beginning of Section 4.1, namely that of estimating Ψ(θ₀) = P₀ (f ○ θ₀), is a special case of this more general setting. In what follows, we will make use of the following conditions:

Condition D1 (Conditions with P₀ fixed). When we regard Ψ(θ₀, P₀) as the mapping θ ↦ Ψ(θ, P₀) evaluated at θ₀, Conditions A1–A4, C1–C4 and C6–C9 are satisfied for estimating Ψ(θ₀, P₀).

Condition D2 (Hadamard differentiability with θ₀ fixed). The mapping P ↦ Ψ(θ₀, P) is Hadamard differentiable at P₀.

By the functional delta method, it follows that Ψ(θ₀, P_n) = Ψ(θ₀, P₀) + P_nIF₀ + op (n^−1/2) for a function IF₀ satisfying P₀IF₀ = 0 and $P_{0} {IF}_{0}^{2} < \infty$ .

Condition D3 (Negligible second-order difference).

[Ψ (θ_{n}^{*}, P_{n}) - Ψ (θ_{0}, P_{n})] - [Ψ (θ_{n}^{*}, P_{0}) - Ψ (θ_{0}, P_{0})] = o_{p} (n^{- 1 / 2}) .

This condition usually holds, for example, when Ψ(θ₀, P₀) = P₀ (f ○ θ₀), as in this case the left-hand side is equal to $(P_{n} - P_{0}) (f \circ θ_{n}^{*} - f \circ θ_{0})$ , which is o_p (n^−1/2) under empirical process conditions.

Theorem 3 (Asymptotic linearity of plug-in estimator). Under Conditions D1–D3, $Ψ (θ_{n}^{*}, P_{n})$ is an asymptotically linear estimator of Ψ(θ₀, P₀) with influence function

v \mapsto α_{0, l}^{- 1} {- l_{0}^{'} [\dot{ψ} \circ θ_{0}] (v) + E_{P_{0}} [l_{0}^{'} [\dot{ψ} \circ θ_{0}] (V)]} + {IF}_{0} (V),

that is,

Ψ (θ_{n}^{*}, P_{n}) = Ψ (θ_{0}, P_{0}) + \frac{1}{n} \sum_{i = 1}^{n} {- α_{0, l}^{- 1} l_{0}^{'} [\dot{ψ} \circ θ_{0}] (V_{i}) + α_{0, l}^{- 1} E_{P_{0}} [l_{0}^{'} [\dot{ψ} \circ θ_{0}] (V)] + IF (V_{i})} + o_{p} (n^{- 1 / 2}) .

As a consequence, $\sqrt{n} [Ψ (θ_{n}^{*}, P_{n}) - Ψ (θ_{0}, P_{0})] \overset{d}{\to} N (0, ξ^{2})$ with $ξ^{2} ≔ V a r_{P_{0}} (α_{0, l}^{- 1} l_{0}^{'} [\dot{ψ} \circ θ_{0}] (V) + IF (V))$ .

This result is easy to verify by decomposing $Ψ (θ_{n}^{*}, P_{n}) - Ψ (θ_{0}, P_{0})$ as

[Ψ (θ_{n}^{*}, P_{0}) - Ψ (θ_{0}, P_{0})] + [Ψ (θ_{0}, P_{n}) - Ψ (θ_{0}, P_{0})] + {[Ψ (θ_{n}^{*}, P_{n}) - Ψ (θ_{0}, P_{n})] - [Ψ (θ_{n}^{*}, P_{0}) - Ψ (θ_{0}, P_{0})]}

Moreover, under conditions similar to the conditions E1 and E2 given in Appendix B.4, we can show that $Ψ (θ_{n}^{*}, P_{n})$ is efficient under a nonparametric model.

Remark 3. Conditions D2 and D3 can be relaxed. Specifically, if ${\hat{P}}_{n}$ is an estimator of P₀ that satisfies that $Ψ (θ_{0}, {\hat{P}}_{n}) = Ψ (θ_{0}, P_{0}) + P_{n} {IF}_{0} + o_{p} (n^{- 1 / 2})$ for an influence function IF₀ and Condition D3 holds with P_n replaced by ${\hat{P}}_{n}$ , then $Ψ (θ_{n}^{*}, ∣ {\hat{P}}_{n})$ is an asymptotically linear estimator of Ψ(θ₀, P₀).

4.4. CV selection of the number of terms in data-adaptive series

In the preceding subsections, we established the efficiency of the plug-in estimator based on suitable rates of growth for K relative to the sample size n. In this subsection, we show that, under some conditions, such a K can be selected by k-fold CV: after obtaining $θ_{n}^{0}$ , for each K in a range of candidates, we can calculate the cross-validated risk from k folds and choose the value of K with the smallest CV risk. We denote the number of terms in the series that CV selects by K*. In this section, we use K in the subscripts for notation related to data-adaptive sieves-like spaces and projections; this represents a slight abuse of notation because, in Sections 4.1 and 4.2, these subscripts were instead used for sample size n. That is, we use Θ_K,θ to denote Span {φ₁, φ₂, …, φ_K} ○ θ, π_K,θ to denote the projection onto Θ_K,θ, Π_K,θ to denote the operator such that Π_K,θ (g) ○ θ = π_K,θ (g ○ θ) for all $g : ℝ^{q} \to ℝ^{q}$ with g ○ θ ∈ L² (P₀), and $θ_{n}^{#} ≔ θ_{K^{*}}^{*} (θ_{n}^{0})$ to be the data-adaptive series estimator based on $θ_{n}^{0}$ and K*.

Condition C5 (Bounded approximation error of ψ˙ relative to $I$ ). There exists a constant C > 0 such that, with probability tending to one, $‖ \dot{ψ} \circ θ_{n}^{0} - Π_{K, θ_{n}^{0}} (\dot{ψ}) \circ θ_{n}^{0} ‖ \leq C ‖ θ_{n}^{0} - Π_{K, θ_{n}^{0}} (I) \circ θ_{n}^{0} ‖$ .

This condition is equivalent to

{‖ \dot{ψ} - Π_{K, θ_{n}^{0}} (\dot{ψ}) ‖}_{L^{2} (P_{θ_{n}^{0}})} \leq C {‖ I - Π_{K, θ_{n}^{0}} (I) ‖}_{L^{2} (P_{θ_{n}^{0}})}

for all K with probability tending to one, which may be interpreted in terms of two simultaneous requirements. The first requirement is that the identity function $I$ is not exactly contained in the span of φ₁, …, φ_K for any K, since otherwise, the right-hand side would be zero for all sufficiently large K. Therefore, common series such as polynomial and spline series are not permitted for general summaries. In contrast, other series such as trigonometric series and wavelets satisfy this requirement. The second requirement is that the approximation error of the chosen series for the identity function $I$ is not much larger than $\dot{ψ}$ . If a trigonometric or wavelet series is used, then this condition imposes a strong smoothness condition on derivatives of $\dot{ψ}$ . Nonetheless, this may not be stringent in some interesting examples. For example, if Ψ(θ) = P₀ (f ○ θ) for a fixed function f in Example 1, then $\dot{ψ}$ equals f′ and hence can be expected to satisfy this strong smoothness condition provided that f is infinitely differentiable with bounded derivatives. The estimands encountered in many applications involve f satisfying this smoothness condition.

The following theorem justifies the use of k-fold CV to select K under appropriate conditions.

Theorem 4 (Efficiency of CV-based plug-in estimator). Assume that Conditions A1–A4, C1–C3, C5, C8 and C9 hold for a deterministic K = K (n). Suppose part (a) of Condition C7 holds, then, with $θ_{n}^{#} ≔ θ_{K^{*}} (θ_{n}^{0}), Ψ (θ_{n}^{#})$ is an asymptotically linear estimator of ψ(θ₀) with influence function $v \mapsto α_{0, l}^{- 1} {- l_{0}^{'} [\dot{ψ} \circ θ_{0}] (v) + E_{P_{0}} [l_{0}^{'} [\dot{ψ} \circ θ_{0}] (V)]}$ , that is,

Ψ (θ_{n}^{#}) = Ψ (θ_{0}) + \frac{1}{n} \sum_{i = 1}^{n} α_{0, l}^{- 1} {- l_{0}^{'} [\dot{ψ} \circ θ_{0}] (V_{i}) + E_{P_{0}} [l_{0}^{'} [\dot{ψ} \circ θ_{0}] (V)]} + o_{p} (n^{- 1 / 2}) .

As a consequence, $\sqrt{n} [Ψ (θ_{n}^{#}) - Ψ (θ_{0})] \overset{d}{\to} N (0, ξ^{2})$ with $ξ^{2} ≔ V a r_{P_{0}} (l_{0}^{'} [\dot{ψ} \circ θ_{0}] (V)) / α_{0, l}^{2}$ . In addition, under Conditions E1 and E2 in Appendix B.4, $Ψ ({\hat{θ}}_{n})$ is efficient under a nonparametric model.

4.5. Simulation

4.5.1. Demonstration of Theorem 4

We illustrate our method in a simulation in which we take $θ_{0} (x) = E_{P_{0}} [Z ∣ X = x]$ and $Ψ (θ_{0}) = P_{0} θ_{0}^{2}$ . This is a special case of Example 1. The true function θ₀ is chosen to be discontinuous, which violates the smoothness assumptions commonly required in traditional series estimation. In this case, $\dot{ψ} = 2 I$ and so the constant in Condition C5 is 2. We compare the performance of plug-in estimators based on three different nonparametric regressions: (i) polynomial regression with degree selected by 10-fold CV (poly), which results in a traditional sieve estimator, (ii) gradient boosting (xgb) [7, 8, 15, 16], and (iii) data-adaptive trigonometric series estimation with gradient boosting as the initial ML fit and 10-fold CV to select the number of terms in the series (xgb.trig). We also compare these plug-in estimators with the one-step correction estimator [22] based on gradient boosting (xgb.1step). Further details of this simulation can be found in Appendix E.

Fig 4 presents n · MSE and $\sqrt{n} \cdot ∣ bias ∣$ for each estimator, whereas Table 2 presents the coverage probability of 95% Wald CIs based on these estimators. We find that xgb.trig and xgb.1step estimators perform well, while poly and xgb plug-in estimators do not appear to be efficient. Since polynomial series estimators only work well when estimating smooth functions, in this simulation, we would not expect the fit from the polynomial series estimator to converge sufficiently fast, and consequently, we would not expect the resulting plug-in estimator to be efficient. In contrast, gradient boosting is a flexible ML method that can learn discontinuous functions, so we can expect an efficient plug-in estimator based on this ML method. However, gradient boosting is not designed to approximately solve the estimating equation that achieves the small-bias property for this particular summary, so we would not expect its naïve plug-in estimator to be efficient. Based on gradient boosting, our estimator and the one-step corrected estimator both appear to be efficient, but our method has the advantage of being a plug-in estimator. Moreover, the construction of our estimator does not require knowledge of the analytic expression of an influence function.

Figure 4. — The relative MSE, n · MSE/ξ², and the relative absolute bias, $\sqrt{n} \cdot | bias / Ψ (θ_{0}) |$ , of estimators of $Ψ (θ_{0}) = P_{0} θ_{0}^{2}$ is the asymptotic variance that then n · MSE of an AL estimator should converge to. poly: plug-in estimator based on polynomial sieve estimation. xgb: plug-in estimator based on gradient boosting. xgb.1step: one-step correction (debiasing) of the plug-in estimator based on gradi- ent boosting. xgb.trig: data-adaptive series with trigonometric series composed with gradient boosting. All tuning parameters are CV-selected. The y-a.xis for relative MSE is scaled based on logarithm for readability. Note that then n · MSE for xgb.trig and xgb.1step tend to ξ², but those for poly and xgb do not.

Table 2.

Coverage probability of 95% Wald CI based on estimators of $Ψ (θ_{0}) = P_{0} θ_{0}^{2}$ . poly: plug-in estimator based on polynomial sieve estimation. xgb: plug-in estimator based on gradient boosting. xgb.1step: one-step correction (debiasing) of the plug-in estimator based on gradient boosting. xgb.trig: data-adaptive series with trigonometric series composed with gradient boosting. All tuning parameters are CV-selected. The CI is constructed based on the influence function. The coverage probabilities for xgb.trig and xgb.1step are approximately 95%, but those for poly and xgb are not.

n	poly	xgb	xgb.lstep	xgb.trig
500	0.90	0.90	0.95	0.95
1000	0.86	0.89	0.95	0.95
2000	0.74	0.88	0.96	0.96
5000	0.47	0.88	0.94	0.94
10000	0.16	0.87	0.95	0.96
20000	0.02	0.86	0.96	0.96

Open in a new tab

We also investigate the effect of the choice of K on the performance of our method. Fig 5 presents n · MSE for the data-adaptive series estimator with different choices of K. We can see that our method is insensitive to the choice of K in this simulation setting. Although a relatively small K performs better, choosing a much larger K does not appear to substantially harm the behavior of the estimator. This insensitivity to the selected tuning parameter suggests that in some applications, without using CV, an almost arbitrary choice of K that is sufficiently large might perform well.

Figure 5. — n · MSE of estimators of $Ψ (θ_{0}) = P_{0} θ_{0}^{2}$ based on data-adaptive series with different choices of K. The horizontal gray thick dashed line is the asymptotic variance that then n · MSE of an AL estimator should converge to, ξ²:= P₀IF². Note that n · MSE is not sensitive to the choice of K over a wide range of K.

4.5.2. Violation of Condition C5

For the k-fold CV selection of K in our method to yield an efficient plug-in estimator, Ψ must be highly smooth in the sense that $\dot{ψ}$ can be approximated by the series about as well as can the identity function (see Condition C5). Although we have argued that this condition is reasonable, in this section, we explore via simulation the behavior of our method based on CV when $\dot{ψ}$ is rough. We again take $θ_{0} : x \mapsto E_{P_{0}} [Z ∣ X = x]$ and an artificial summary Ψ(θ₀) = P₀(f ∘ θ₀) where f is an element of C¹[−1,1] but not of C² [−1, 1]. In this case, $\dot{ψ} = f^{'}$ is very rough, so we do not expect it to be approximated by a trigonometric series as well as the identity function. However, it is sufficiently smooth to allow for the existence of a deterministic K that achieves efficiency. Further simulation details are provided in Appendix E.

Table 3 presents the performance of our estimator based on 10-fold CV. We note that it performs reasonably well in terms of then· MSE criterion. However, it is unclear whether its scaled bias converges to zero for large n, so our method may be too biased. The coverage of 95% Wald Cis is close to the nominal level, suggesting that the bias is fairly small relative to the standard error of the estimator at the sample sizes considered. One possible explanation for the good performance observed is that the L² (P₀)-convergence rate of $θ_{n}^{*}$ is much faster than n^−1/14, which allows for a slower convergence rate of the approximation error $‖ \dot{ψ} \circ θ_{0} - Π_{n, θ_{0}} (\dot{ψ}) \circ θ_{0} ‖$ (see Appendix C). This simulation shows that our proposed method may still perform well even if Condition C5 is violated, especially when the initial ML fit is close to the unknown function.

Table 3.

Performance of the plug-in estimator of Ψ(θ₀) = P₀ (f ○ θ₀) based on data-adaptive series. Here f is not infinitely differentiable. The relative MSE is n · MSE/ξ² where ξ² := P₀IF² is the asymptotic variance that the n · MSE of an AL estimator should converge to; the root-n abs relative bias is $\sqrt{n} | bias / Ψ (θ_{0}) |$ . The performance appears to be acceptable in view of the small MSE and reasonable CI coverage.

n	relative MSE	root-n absolute relative bias	95% Wald CI coverage
500	0.88	3.95	0.97
1000	0.89	3.73	0.96
2000	0.79	3.15	0.97
5000	0.78	2.02	0.97
10000	0.88	2.57	0.97
20000	0.88	1.75	0.96

Open in a new tab

5. Generalized data-adaptive series

5.1. Proposed method

As in Section 4, we consider the case that Ψ is scalar-valued in this section. The assumption that $\dot{Ψ} = \dot{ψ} \circ θ_{0}$ may be too restrictive for general summaries as in Examples 2–4, especially if $\dot{Ψ}$ is not derived analytically (see Remark 2). In this section, we generalize the method in Section 4 to deal with these summaries. Letting $I_{x}$ be the identity function defined on $X$ , we can readily generalize the above method to the case where $\dot{Ψ}$ can be represented as $\dot{ψ} \circ (θ_{0}, I_{x})$ for a function $\dot{ψ} : ℝ^{q} \times X \to ℝ^{q}$ ; that is, $\dot{Ψ} (x) = \dot{ψ} (θ_{0} (x), x)$ . This form holds trivially if we set $\dot{ψ} (t, x) = \dot{Ψ} (x)$ , i.e., $\dot{ψ}$ is independent of its first argument, but we can utilize flexible ML methods if $\dot{ψ}$ is nontrivial. Again, we assume Θ is a vector space of $ℝ^{q}$ -valued function equipped with the L²(P₀)-inner product. We assume $\dot{ψ}$ can be approximated well by a basis $ϕ_{1}, ϕ_{2}, \dots : ℝ^{q} \times X \to ℝ^{q}$ , and consider the data-adaptive sieve-like subspace $Θ_{n} ≔ Θ_{n, θ_{n}^{0}} ≔ Span {ϕ_{1}, \dots, ϕ_{K}} \circ (θ_{n}^{0}, I_{x})$ . We propose to use $Ψ (θ_{n}^{*})$ to estimate Ψ(θ₀), where $θ_{n}^{*} = θ_{n}^{*} (θ_{n}^{0}) \in {argmin}_{θ \in Θ_{n}} n^{- 1} \sum_{i = 1}^{n} l (θ) (V_{i})$ denotes the series estimator within Θ_n minimizing the empirical risk.

5.2. Results for proposed method

With a slight abuse of notation, in this section we use $I$ to denote the function (t, x) 7↦ t where $t \in ℝ^{q}$ and $x \in X$ . Again, we use $π_{n} ≔ π_{n, θ_{n}^{0}}$ to denote the projection operator onto $Θ_{n, θ_{n}^{0}}$ . Let Π_n,θ be defined such that, for any function $q : ℝ^{q} \times X \to ℝ^{q}$ with $g \circ (θ, I_{x}) \in L^{2} (P_{0})$ , it holds that $Π_{n, θ} (g) \circ (θ, I_{x}) = π_{n, θ} (g \circ (θ, I_{x}))$ ; that is, letting β_j be the quantity that depends on g and θ such that $π_{n, θ} (g \circ (θ, I_{x})) = (\sum_{j = 1}^{K} β_{j} ϕ_{j}) \circ (θ, I_{x})$ , we define $Π_{n, θ} (g) ≔ \sum_{j = 1}^{K} β_{j} ϕ_{j}$ .

We introduce conditions and derive theoretical results that are parallel to those in Section 4.

Condition C3* (Sufficiently small approximation error to $I$ for $Θ_{n, θ_{0}}$ ). $θ_{0} - Π_{n, θ_{0}} (I) \circ (θ_{0}, I_{x}) = o (n^{- 1 / 4})$ .

Condition C4* (Sufficiently small approximation error to $\dot{ψ}$ for $Θ_{n, θ_{0}}$ and convergence rate of $θ_{n}^{*}$ . $[\dot{ψ} - Π_{n, θ_{0}} (\dot{ψ})] \circ (θ_{0}, I_{x}) \cdot θ_{n}^{*} - θ_{0} = o_{p} (n^{- 1 / 2})$ Additional regularity conditions can be found in Appendix B.3. Note that $\dot{Ψ}$ may depend on components of P₀ other than θ₀, Condition C4* may impose smoothness conditions on these components so that $\dot{ψ}$ can be well approximated by the chosen series. For example, in Example 2, Condition C4* requires that $p_{0}^{'} / p_{0}$ and the propensity score can be approximated by the series well; in Examples 3 and 4, Condition C4* imposes the same requirement on the propensity score. We now present a theorem that establishes the efficiency of the plug-in estimator based on $θ_{n}^{*}$ .

Theorem 5 (Efficiency of plug-in estimator). Under Conditions A1–A4, C1, C2, C3*, C4*, C6*, C7*, C8 and C9, $Ψ (θ_{n}^{*})$ is an asymptotically linear estimator of Ψ(θ₀) with influence function $v \mapsto α_{0, l}^{- 1} {- l_{0}^{'} [ψ \circ (θ_{0}, I_{x})] (v) + E_{P_{0}} [l_{0}^{'} [\dot{ψ} \circ (θ_{0}, I_{x})] (V)]}$ , that is,

Ψ (θ_{n}^{*}) = Ψ (θ_{0}) + \frac{1}{n} \sum_{i = 1}^{n} α_{0, l}^{- 1} {- l_{0}^{'} [\dot{ψ} \circ (θ_{0}, I_{x})] (V_{i}) + E_{P_{0}} [l_{0}^{'} [\dot{ψ} \circ (θ_{0}, I_{x})] (V)]} + o_{p} (n^{- 1 / 2}) .

As a consequence, $\sqrt{n} [Ψ (θ_{n}^{*}) - Ψ (θ_{0})] \overset{d}{\to} N (0, ξ^{2})$ with $ξ^{2} ≔ V a r_{P_{0}} (l_{0}^{'} [\dot{ψ} \circ (θ_{0}, I_{x})] (V)) / α_{0, l}^{2}$ . In addition, under Conditions E1 and E2 in Appendix B.4, $Ψ ({\hat{θ}}_{n})$ is efficient under nonparametric models.

Remark 4. When Ψ depends on both θ₀ and P₀, we can readily adapt this method as in Section 4.3.

We now present a condition for selecting K via k-fold CV, in parallel with Condition C5 from Section 4.4.

Condition C5* (Bounded approximation error of $\dot{ψ}$ relative to $I$ ). There exists a constant C > 0 such that, with probability tending to one, $\dot{ψ} \circ (θ_{n}^{0}, I_{x}) - Π_{K, θ_{n}^{0}} (\dot{ψ}) \circ (θ_{n}^{0}, I_{x}) \leq C θ_{n}^{0} - Π_{K, θ_{n}^{0}} (I) \circ (θ_{n}^{0}, I_{x})$ .

Remark 5. Similarly to Condition C5, Condition C5* requires that the identity function $I$ is not contained in the span of finitely many terms of the chosen series and that $\dot{ψ}$ is sufficiently smooth so that $\dot{ψ}$ can be approximated well by the chosen series. However, Condition C5* may be far more stringent than Condition C5. In fact, it may be overly stringent in practice. Since $\dot{Ψ}$ may depend on components of P₀ other than θ₀, Condition C5* may require these components to be sufficiently smooth. When a common candidate series such as the trigonometric series is used, a sufficient condition for Condition C5* is that $\dot{ψ}$ is infinitely differentiable with bounded derivatives, which further imposes assumptions on the smoothness of other components of P₀. For example, in Example 2, a sufficient condition for Condition C5* is that $p_{0}^{'} / p_{0}$ is infinitely differentiable with bounded derivatives; in Examples 3 and 4, a sufficient condition for Condition C5* is that Condition C5* is that the propensity score function satisfies the same requirement. Due to the stringency of Condition C5*, we conduct a simulation in Section 5.3.2 to understand the performance of our proposed method when this condition is violated. The simulation appears to indicate that our method may be robust against violation of Condition C5*.

The following theorem shows that k-fold CV can be used to select K under certain conditions.

Theorem 6 (Efficiency of CV-based plug-in estimator). Assume Conditions A1–A4, C1, C2, C3*, C6*, C7*, C8 and C9 hold for a deterministic K = K(n). Suppose that part (a) of Condition C7* holds. With $θ_{n}^{#} ≔ θ_{K^{*}} (θ_{n}^{0}), Ψ (θ_{n}^{#})$ is an asymptotically linear estimator of Ψ(θ₀) with influence function $v \mapsto α_{0, l}^{- 1} {- l_{0}^{'} [\dot{ψ} \circ (θ_{0}, I_{x})] (v) + E_{P_{0}} [l_{0}^{'} [\dot{ψ} \circ (θ_{0}, I_{x})] (V)]}$ , that is,

Ψ (θ_{n}^{#}) = Ψ (θ_{0}) + \frac{1}{n} \sum_{i = 1}^{n} α_{0, l}^{- 1} {- l_{0}^{'} [\dot{ψ} \circ (θ_{0}, I_{x})] (V_{i}) + E_{P_{0}} [l_{0}^{'} [\dot{ψ} \circ (θ_{0}, I_{x})] (V)]} + o_{p} (n^{- 1 / 2})

Therefore, $\sqrt{n} [Ψ (θ_{n}^{#}) - Ψ (θ_{0})] \overset{d}{\to} N (0, ξ^{2})$ with $ξ^{2} ≔ V a r_{P_{0}} (l_{0}^{'} [\dot{ψ} \circ θ_{0}] (V)) / α_{0, l}^{2}$ . In addition, under Conditions E1 and E2 in Appendix B.4, $Ψ ({\hat{θ}}_{n})$ is efficient under a nonparametric model.

5.3. Simulation

In the following simulations, we consider the problem in Example 4. As we show in Appendix A, letting g₀ : x ↦ P₀(A = 1|X = x) be the propensity score and setting θ = (μ₀, μ₁), with ℓ(θ) : v ↦ a[z−μ₁(x)]² +(1−a)[z−μ₀(x)]², the generalized data-adaptive series methodology may be used to obtain an efficient estimator. As in Section 4.5, we conduct two simulation studies, the first demonstrating Theorem 6 and the other exploring the robustness of CV against violation of Condition C5*.

5.3.1. Demonstration of Theorem 6

We choose θ₀ to be a discontinuous function while g₀ is highly smooth. We compare the performance of plug-in estimators based on three different nonparametric regressions: (i) polynomial regression with the degree selected by 5-fold CV (poly), which results in a traditional sieve estimator, (ii) gradient boosting (xgb) [7, 8, 15, 16], and (iii) generalized data-adaptive trigonometric series estimation with gradient boosting as the initial ML fit and 5-fold CV to select the number of terms in the series (xgb.trig). Further details of the simulation setting are provided in Appendix E.

Fig 6 presents n · MSE and $\sqrt{n} \cdot ∣ bias ∣$ for each estimator, whereas Table 4 presents the coverage probability of 95% Wald CIs based on these estimators. There are a few runs in the simulation with noticeably poor behavior, so we trimmed the most extreme values when computing MSE and bias in Fig 6 (1% of all Monte Carlo runs). The outliers may be caused by the performance of gradient boosting and the instability of 5-fold CV. In practice, the user may ensemble more ML methods and use 10-fold CV to mitigate such behavior. We note that xgb.trig and xgb.1step estimators perform well, while poly and xgb plug-in estimators do not appear to be efficient. Based on gradient boosting, our estimator and the one-step corrected estimator both appear to be efficient, but the construction of our estimator has the advantage of not requiring the analytic expression of an influence function.

Figure 6. — The relative MSE, n · MSE/ξ², and the relative absolute bias, $\sqrt{n} \cdot | bias / Ψ (θ_{0}) |$ , of estimators of $Ψ (θ_{0}) = {Var}_{P_{0}} (μ_{0, 1} (X) - μ_{0, 0} (X))$ where $μ_{0, a} : x \mapsto E_{P_{0}} [Y | A = a, X = x]$ . ξ² := P₀IF² is the asymptotic variance that the n · MSE of an AL estimator should converge to. poly: plug-in estimator based on polynomial sieve estimation. xgb: plug-in estimator based on gradient boosting. xgb.1step: one-step correction (debiasing) of the plug-in estimator based on gradient boosting. xgb.trig: data-adaptive series with trigonometric series composed with gradient boosting. All tuning parameters are CV-selected. The y-axis for relative MSE is scaled based on logarithm for readability. Note that the n · MSE for xgb.trig and xgb.1step tend to ξ², but those for poly and xgb do not.

Table 4.

Coverage probability of 95% Wald CI based on estimators of $Ψ (θ_{0}) = {Var}_{P_{0}} (μ_{0, 1} (X) - μ_{0, 0} (X))$ where $μ_{0, a} : x \mapsto E_{P_{0}} [Y | A = a, X = x]$ . poly: plug-in estimator based on polynomial sieve estimation. xgb: plug-in estimator based on gradient boosting. xgb.1step: one-step correction (debiasing) of the plug-in estimator based on gradient boosting. xgb.trig: data-adaptive series with trigonometric series composed with gradient boosting. All tuning parameters are CV-selected. The CI is constructed based on the influence function. The coverage probabilities for xgb.trig and xgb.1step are relatively close to 95%, but those for poly and xgb are not.

n	poly	xgb	xgb.1step	xgb.trig
500	0.85	0.76	0.89	0.90
1000	0.68	0.78	0.93	0.93
2000	0.44	0.81	0.93	0.92
5000	0.11	0.80	0.89	0.87
10000	0.00	0.79	0.92	0.90
20000	0.00	0.67	0.91	0.88

Open in a new tab

5.3.2. Violation of Condition C5*

We also study via simulation the behavior of our estimator when Condition C5* is violated. We note that whether Condition C5* holds depends on the smoothness of g₀. We choose g₀ to be rougher than $I$ with g₀ being an element of C²[−1, 1] but not of C³[−1, 1]. Consequently, $\dot{Ψ}$ cannot be approximated by our generalized data-adaptive series as well as $I$ , but its smoothness is sufficient for the existence of a deterministic K to achieve efficiency. Appendix E describes further details of this simulation setting.

Table 5 presents the performance of our estimator based on 5-fold CV. We observe that its scaled MSE appears to converge to one, but it is unclear whether its scaled bias converges to zero for large n, and so our method may be overly biased.. The coverage of 95% Wald CIs is close to the nominal level, suggesting that the bias may be fairly small relative to the standard error of the estimator at the sample sizes considered. Therefore, according to this simulation, our generalized data-adaptive series methodology appears to be robust against violation of Condition C5*.

Table 5.

Performance of the plug-in estimator of $Ψ (θ_{0}) = {Var}_{P_{0}} (μ_{0, 1} (X) - μ_{0, 0} (X))$ where $μ_{0, a} : x \mapsto E_{P_{0}} [Y | A = a, X = x]$ based on data-adaptive series. Here the propensity score $g_{0} : x \mapsto E_{P_{0}} [A | X = x]$ is rough. The relative MSE is n · MSE/ξ² where ξ² := P₀IF² is the asymptotic variance that the n · MSE of an AL estimator should converge to; the root-n abs relative bias is $\sqrt{n} | bias / Ψ (θ_{0}) |$ . The performance appears to be acceptable in view of small MSE and reasonable CI coverage.

n	relative MSE	root-n absolute relative bias	95% Wald CI coverage
500	1.02	0.28	0.92
1000	1.13	0.26	0.91
2000	1.10	0.19	0.94
5000	1.03	0.02	0.93
10000	0.96	0.23	0.95
20000	0.99	0.24	0.94

Open in a new tab

6. Discussion

Numerous methods have been proposed to construct efficient estimators for statistical parameters under a nonparametric model, but each of them has one or more of the following undesirable limitations: (i) their construction may require specialized expertise that is not accessible to most statisticians; (ii) for any given data set, there may be little guidance, if any, on how to select a key tuning parameter; and (iii) they may require stringent smoothness conditions, especially on derivatives. In this paper, we propose two sieve-like methods that can partially overcome these difficulties.

Our first approach, namely that based on HAL, can be further generalized to the case in which the flexible fit is an empirical risk minimizer over a function class assumed to contain the unknown function. The key condition B2 may be modified in that case as long as it ensures that certain perturbations of the unknown function still lie in that function class. We note that our methods may also be applied under semiparametric models.

A major direction for future work is to construct valid CIs without the knowledge of the influence function of the resulting plug-in estimator. The nonparametric bootstrap is in general invalid when the overall summary is not Hadamard differentiable and especially when the method relies on CV [2, 10], but a model-based bootstrap is a possible solution (Chapter 28 of [31]). In many cases only certain components of the true data-generating distribution must be estimated to obtain a plug-in estimator, while its variance may depend on other components that are not explicitly estimated. Therefore, generating valid model-based bootstrap samples is generally difficult.

Our proposed sieve-like methods may be used to construct efficient plug-in estimators for new applications in which the relevant theoretical results are difficult to derive. They may also inspire new methods to construct such estimators under weaker conditions.

Acknowledgements

This work was partially supported by the National Institutes of Health under award number DP2-LM013340 and R01HL137808. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Appendix A: Modification of chosen norm for evaluating the conditions: case study of mean counterfactual outcome

In this appendix, we consider a parameter that requires a modification in the chosen norm for evaluating the conditions. In particular, we discuss estimating counterfactual mean outcome in Example 3.

Let g₀ : x ↦ P₀(A = 1|X = x) be the propensity score function. A natural choice of the loss function is ℓ(θ) : v ↦ a[y − θ(x)]². Indeed, learning a function with this loss function is equivalent to fitting a function within the stratum of observations that received treatment 1. Unfortunately, this loss function does not satisfy Condition A2 with L²(P₀)-norm, because P₀{ℓ(θ) − ℓ(θ₀)} = P₀{g₀ · (θ − θ₀)²} cannot be well approximated by α_0,ℓP₀{(θ − θ₀)²}/2 for any constant α_0,ℓ > 0 unless g₀ is a constant. One way to overcome this challenge is to choose the alternative inner product 〈θ₁, θ₂〉_g0 := P₀{g₀θ₁θ₂} and its induced norm $∥ \cdot ∥_{g_{0}}$ . In this case, Condition A2 is satisfied once k·k is replaced by $∥ \cdot ∥_{g_{0}}$ in the condition statement. Under this choice, $Ψ_{θ_{0}}^{'} = P_{0} (θ - θ_{0}) = {〈 1 / g_{0}, θ - θ_{0} 〉}_{g_{0}}$ .

We may redefine the corresponding $\dot{Ψ}$ similarly as the function that satisfies

Ψ_{θ_{0}}^{'} = {〈 \dot{Ψ}, θ - θ_{0} 〉}_{g_{0}},

and it immediately follows that $\dot{Ψ} = 1 / g_{0}$ . Moreover, under a strong positivity condition, namely g₀(X) ≥ δ_g > 0 a.s. for some δ_g, which is a typical condition in causal inference literature [31, 34], then it is straightforward to show that $δ_{g} ∥ \cdot ∥ \leq ∥ \cdot ∥_{g_{0}} \leq ∥ \cdot ∥$ ; that is, $∥ \cdot ∥_{g_{0}}$ is equivalent to L²(P₀)-norm. Using this fact, it can be shown that all other conditions with respect to the L²(P₀)-inner product are equivalent to the corresponding conditions with respect to ${〈 \cdot, \cdot 〉}_{g_{0}}$ .

Therefore, the data-adaptive series can be applied to estimation of the counterfactual mean outcome under our conditions for L²(P₀)-inner product. If we use the targeted form in Remark 2, then we need a flexible estimator of g₀ and the procedure is almost identical to a TMLE [31]. If we use the generalized data-adaptive series, we would require sufficient amount of smoothness for g₀(·). In the latter case, the change in norm when evaluating the conditions is a purely technical device and the estimation procedure is the same as would have been used if we had used the L²(P₀)-norm. We also note that the same argument may be used to show that in Example 4, with ℓ(θ) : v ↦ a[z − μ₁(x)]² + (1 − a)[z − μ₀(x)]² being the usual squared-error loss, we may choose the alternative inner product ${〈 θ_{1}, θ_{2} 〉}_{g_{0}} ≔ P_{0} {θ_{1}^{⊤} \cdot diag (1 - g_{0}, g_{0}) \cdot θ_{2}}$ and find that , $\dot{Ψ} = {(- 2 / (1 - g_{0}) \cdot [(μ_{01} - μ_{00}) - P_{0} (μ_{01} - μ_{00})], 2 / g_{0} \cdot [(μ_{01} - μ_{00}) - P_{0} (μ_{01} - μ_{00})])}^{⊤}$ as we did in Section 5.3.

Appendix B: Additional conditions

Throughout the rest of this appendix, we use C to denote a general absolute positive constant that can vary line by line.

B.1. HAL

Condition B3 (Empirical processes conditions). For any fixed ϑ ∈ Θ_v,M and some Δ > 0, it holds that ℓ(θ), $l_{0}^{'} [θ - θ_{0}]$ and {r[θ − θ₀] − r[θ + δ(ϑ − θ) − θ₀]}/δ are càdlàg for all θ ∈ Θ_v,M and all δ ∈ [0, Δ]. Moreover, the following terms are all finite:

sup_{θ \in Θ_{v, M}} ‖ l (θ) ‖_{v}, sup_{θ \in Θ_{v, M}} {‖ l_{0}^{'} [θ - θ_{0}] ‖}_{v}, sup_{θ \in Θ_{v, M}, δ \in [0, Δ]} {‖ \frac{r [θ - θ_{0}] - r [θ - θ_{0} + δ (ϑ - θ)]}{δ} ‖}_{v} .

In addition, $‖ l_{0}^{'} [{\hat{θ}}_{n} - θ_{0}] ‖$ and ${sup}_{δ \in [0, Δ]} ‖ {r [{\hat{θ}}_{n} - θ_{0}] - r [{\hat{θ}}_{n} - θ_{0} + δ (ϑ - {\hat{θ}}_{n})]} / δ ‖$ converge to 0 in probability.

Condition B4 (Finite variance of influence function) $ξ^{2} ≔ V a r_{P_{0}} (l_{0}^{'} [\dot{Ψ}] (V)) / α_{0, l}^{2} < \infty$ .

B.2. Data-adaptive series

Condition C6 (Local Lipschitz continuity of $Π_{n, θ_{0}} (I)$ ). For sufficiently large n,

‖ Π_{n, θ_{0}} (I) \circ θ - Π_{n, θ_{0}} (I) \circ θ_{0} ‖ \leq C ‖ θ - θ_{0} ‖

for all θ ∈ Θ with ‖θ − θ₀‖ ≤ n^−1/4.

Condition C7 (Local Lipschitz continuity of $\dot{ψ}$ and $Π_{n, θ_{0}} (\dot{ψ})$ ). For sufficiently large n, for all θ ∈ Θ with ‖θ − θ₀‖ ≤ n^−1/4,

$‖ \dot{ψ} \circ θ - \dot{ψ} \circ θ_{0} ‖ \leq C ‖ θ - θ_{0} ‖$ ;
$‖ Π_{n, θ_{0}} (\dot{ψ}) \circ θ - Π_{n, θ_{0}} (\dot{ψ}) \circ θ_{0} ‖ \leq C ‖ θ - θ_{0} ‖$ .

Condition C8 (Empirical process conditions). There exists some constant Δ > 0 such that

sup_{δ \in [0, Δ]} | (P_{n} - P_{0}) {\frac{r [θ_{n}^{*} - θ_{0}] - r [π_{n} ((1 - δ) θ_{n}^{*} + δ (\pm \dot{Ψ} + θ_{0})) - θ_{0}]}{δ}} | = o_{p} (n^{- 1 / 2}),

(P_{n} - P_{0}) l_{0}^{'} [(\pm \dot{Ψ} + θ_{0}) - π_{n} (\pm \dot{Ψ} + θ_{0})] = o_{p} (n^{- 1 / 2}),

(P_{n} - P_{0}) l_{0}^{'} [θ_{n}^{*} - θ_{0}] = o_{p} (n^{- 1 / 2}) .

Condition C9 (Finite variance of influence function) $ξ^{2} ≔ V a r_{P_{0}} (l_{0}^{'} [\dot{Ψ}] (V)) / α_{0, l}^{2} < \infty$ .

B.3. Generalized data-adaptive series

Condition C6* (Local Lipschitz continuity of projected $I$ for $Θ_{n, θ_{0}}$ ). For sufficiently large n, $‖ Π_{n, θ_{0}} (I) \circ (θ, I_{x}) - Π_{n, θ_{0}} (I) \circ (θ_{0}, I_{x}) ‖ \leq C ‖ θ - θ_{0} ‖$ for all $‖ θ - θ_{0} ‖ \leq n^{- 1 / 4}$ .

Condition C7* (Local Lipschitz continuity of $\dot{ψ}$ and its projection for $Θ_{n, θ_{0}}$ ). For sufficiently large n, for all ‖θ − θ₀‖ ≤ n^−1/4,

$‖ \dot{ψ} \circ (θ, I_{x}) - \dot{ψ} \circ (θ_{0}, I_{x}) ‖ \leq C ‖ θ - θ_{0} ‖$ ;
$‖ Π_{n, θ_{0}} (\dot{ψ}) \circ (θ, I_{x}) - Π_{n, θ_{0}} (\dot{ψ}) \circ (θ_{0}, I_{x}) ‖ \leq C ‖ θ - θ_{0} ‖$ .

B.4. Conditions for efficiency of the plug-in estimator

Define a collection of submodels

{{P_{H, δ} : δ \in B_{H} \subseteq ℝ} : H \in H}

for which: (i) $H$ is a subset of $L_{0}^{2} (P_{0})$ and the $L_{0}^{2} (P_{0})$ of its linear span is $L_{0}^{2} (P_{0})$ ; and (ii) each ${P_{H, δ} : δ \in B_{H} \subseteq ℝ}$ is a regular univariate parametric submodel that passes through P₀ and has score H for δ at δ = 0. For each $H \in H$ and δ ∈ B_H, we define $θ_{H, δ} \in {argmin}_{θ \in Θ} P_{H, δ} l (θ)$ . In this appendix, for all small o and big O notations, we let δ → 0 with H fixed.

Condition E1 (Sufficiently close risk minimizer). For any given $H \in H$ . $‖ θ_{H, δ} - θ_{0} ‖ = o (δ^{1 / 2})$ .

Condition E2 (Quadratic behavior of loss function remainder near 0). For any given $H \in H$ and ϑ, there exists positive δ′ = o(δ) such that $(P_{H, δ} - P_{0}) {r [(1 - δ^{'}) (θ_{H, δ} - θ_{0}) + δ^{'} \dot{Ψ}] - r [θ_{H, δ} - θ_{0}]} / δ^{'} = o (δ)}$ .

Appendix C: Discussion of technical conditions for data-adaptive series and its generalization

C.1. Theorem 2

Condition C2 usually imposes an upper bound on the growth rate of K. To see this, we show that Condition C2 is equivalent to a term being o_p(n^−1/4), and an upper bound of this term is controlled by K. Let $θ_{n}^{†} \in {argmin}_{θ \in Θ_{n}} P_{0} l (θ)$ be the true-risk minimizer in Θ_n. Under Conditions A2, C1, C3 and C6, by Lemma 5, it follows that Condition C2 is equivalent to requiring that $‖ θ_{n}^{*} - θ_{n}^{†} ‖ = o_{p} (n^{- 1 / 4})$ . Note that $θ_{n}^{*}$ minimizes the empirical risk in Θ_n, and M-estimation theory [32] can show that $‖ θ_{n}^{*} - θ_{n}^{†} ‖$ can be upper bounded by an empirical process term, whose upper bound is related to the complexity of Θ_n, namely how fast K grows with sample size. To ensure this bound is o_p(n^−1/4), K must not grow too quickly.

Condition C3 assumes that the identity function can be well approximated by the series ϕ_k with the specified number of terms K in the $L^{2} (P_{θ_{0}})$ sense. If Span{ϕ₁, …, ϕ_K} does not contain $I$ for any K, then sufficiently many terms must be included to satisfy this condition; that is, this condition imposes a lower bound on the rate at which K should grow with n. Even if Span{ϕ₁, …, ϕ_K} does contain $I$ for some finite K, this condition still requires that K is not too small.

Condition C4 is implied by the following condition in view of Lemma 3:

Condition C4s. $‖ [\dot{ψ} - Π_{n, θ_{0}} (\dot{ψ})] \circ θ_{0} ‖ = o (n^{- 1 / 4})$ .

This condition is similar to Condition C3. However, in general, we do not expect $\dot{ψ}$ to be contained in Span{ϕ₁, …, ϕ_K} for any K, and hence this condition generally imposes a lower bound on the rate of K. Note that Condition C4s is stronger than Condition C4, and there are interesting examples where C4 holds but C4s fails to hold. Indeed, if $θ_{n}^{*}$ converges to θ₀ at a rate much faster than n^−1/4, then C4 can be satisfied even if $‖ [\dot{ψ} - Π_{n, θ_{0}} (\dot{ψ})] \circ θ_{0} ‖$ decays to zero in probability relatively slowly — that is, the convergence rate of $θ_{n}^{*}$ can compensate for the approximation error of $\dot{ψ}$ . This is one way in which we can benefit from using flexible ML algorithms to estimate θ₀: if $θ_{n}^{0}$ converges to θ₀ at a fast rate, then we can expect $θ_{n}^{*}$ to also have a fast convergence rate.

Conditions C2, C3 and C4 are not stringent provided sufficient smoothness on derivatives of $\dot{ψ}$ and a reasonable series. For example, as noted in [4], when $\dot{ψ}$ has a bounded p-th order derivative and the polynomial, trigonometric series or spline with degree at least p + 1 is used, then if K²/n → 0 (K³/n → 0 for polynomial series), the term in Condition C2 is $O_{p} (\sqrt{K / n})$ ; the terms in Condition C3 and the sufficient Condition C4s are O(K^−p/q). Therefore, we can select K to grow at a rate faster than n^q/(4p) and slower than n^1/2 (n^1/3 for polynomial series). If p is large, then this allows for a wide range of rates for K. Typically $\dot{Ψ}$ (and hence $\dot{ψ}$ ) is only related to the summary of interest Ψ but not the true function θ₀. For example, for the summary Ψ(θ) = P₀(f ○ θ) at the beginning of Section 4.1, $\dot{ψ} = f^{'}$ is variation independent of θ₀. It is often the case that Ψ is smooth and so is $\dot{ψ}$ , so p is often sufficiently large for this window to be wide.

Condition C6 is usually easy to satisfy. Since $Π_{n, θ_{0}} (I)$ is a linear combination of {ϕ_k : k ∈ {1, …, K}} and is an approximation of a highly smooth function $I$ , if the series ϕ_k is smooth, then we can expect that $Π_{n, θ_{0}} (I)$ will be Lipschitz uniformly over n, that is, that Condition C6 holds. For example, using polynomial series, cubic splines or trigonometric series imply that this condition holds.

Condition C7 imposes Lipschitz continuity conditions on $\dot{ψ}$ and $Π_{n, θ_{0}} (\dot{ψ})$ uniformly over n. The Lipschitz continuity of $\dot{ψ}$ has been discussed above. As for $Π_{n, θ_{0}} (\dot{ψ})$ , similarly to Condition C6, as long as the series ϕ_k being used is smooth, $Π_{n, θ_{0}} (\dot{ψ})$ would be Lipschitz continuous uniformly over n.

C.2. Theorem 5

The conditions are similar to those in Theorem 2. However, Condition C4* can be more stringent than Condition C4. For generalized data-adaptive series, the dimension of the argument of the series is larger. Hence, as noted in [4], C4* may require more smoothness of $\dot{ψ}$ in order that $\dot{ψ}$ can be well approximated by $Π_{n, θ_{0}} (\dot{ψ})$ . However, in general, we do not expect the smoothness of $\dot{ψ}$ to depend on Ψ alone but no components of P₀, so the amount of smoothness of $\dot{ψ}$ may be more limited in practice.

It is also worth noting that, similarly to Theorem 2, a sufficient condition for Condition C4* is the following:

Condition C4*s. $[\dot{ψ} - Π_{n, θ_{0}} (\dot{ψ})] \circ (θ_{0}, I_{x}) = o (n^{- 1 / 4})$ .

Appendix D: Lemmas and technical proofs

D.1. Highly Adaptive Lasso (HAL)

Proof of Theorem 1. Under Conditions A2 and B1–B3, Lemma 1 and its corollary in [26] show that $‖ {\hat{θ}}_{n} - θ_{0} ‖ = o_{p} (n^{- 1 / 4})$ .

We show that the small perturbations of ${\hat{θ}}_{n}$ in certain directions are contained in Θ_v,M. Let $ϑ_{δ} = {\hat{θ}}_{n} + δ (\dot{Ψ} + θ_{0} - {\hat{θ}}_{n})$ be a path indexed by δ (0 ≤ δ < 1) that is a perturbation of ${\hat{θ}}_{n}$ . Note that for all δ, ϑ_δ is càdlàg by Condition B1 and we have that

{‖ ϑ_{δ} ‖}_{v} = {‖ (1 - δ) {\hat{θ}}_{n} + δ (\dot{Ψ} + θ_{0}) ‖}_{v} \leq (1 - δ) {‖ {\hat{θ}}_{n} ‖}_{v} + δ (‖ \dot{Ψ} ‖_{v} + {‖ θ_{0} ‖}_{v}) \leq (1 - δ) M + δ M = M

by Condition B2. Hence ϑ_δ ∈ Θ_v,M. The same result holds for the path ${\tilde{ϑ}}_{δ} ≔ {\hat{θ}}_{n} + δ (- \dot{Ψ} + θ_{0} - {\hat{θ}}_{n})$ .

Combining this observation with the P₀-Donkser property of Θ_v,M′ for any fixed M′ > 0 [9] and Conditions A1–A2, B4, we have that all of the conditions of Theorem 1 in [24] are satisfied with all sieves being Θ_v,M. The desired asymptotic linearity result follows. The efficiency result is shown in Appendix D.3. □

Proof of Lemma 1. Recall that $X \subseteq ℝ^{d}$ . Similar to x^(l), let x^(u) = inf{x : P₀(X ≤ x) = 1} where inf and ≤ are entrywise. To avoid clumsy notations, in this proof we drop the subscript in θ₀ and use θ instead. This should not introduce confusion because other functions (e.g., an estimator of θ₀) are not involved in the statement or proof. Using the results reviewed in Section 3.1,

‖ \dot{Ψ} ‖_{v} = | \dot{Ψ} (x^{(l)}) | + \sum_{s \subseteq {1, 2, \dots, d}, s \neq \emptyset} \int_{x_{s}^{(l)}}^{x_{s}^{(u)}} | {\dot{Ψ}}_{s} (d u) | = | \dot{Ψ} (x^{(l)}) | + {\sum_{s \subseteq {1, 2, \dots, d}, s \neq \emptyset} \int_{x_{s}^{(l)}}^{x_{s}^{(u)}} | {\dot{ψ}}^{'} (z) | |}_{z = θ_{s} (u)} | θ_{s} (d u) | .

Since

| θ (x) | = | θ (x^{(l)}) + \sum_{s \subseteq {1, 2, \dots, d}, s \neq \emptyset} \int_{x_{s}^{(l)}}^{x_{s}} θ_{s} (d u) | \leq | θ (x^{(l)}) | + \sum_{s \subseteq {1, 2, \dots, d}, s \neq \emptyset} \int_{x_{s}^{(l)}}^{x_{s}} | θ_{s} (d u) | \leq | θ (x^{(l)}) | + \sum_{s \subseteq {1, 2, \dots, d}, s \neq \emptyset} \int_{x_{s}^{(l)}}^{x_{s}^{(u)}} | θ_{s} (d u) | = ‖ θ ‖_{v},

we have ${| {\dot{ψ}}^{'} (z) | |}_{z = θ_{s} (u)} \leq {sup}_{z^{'} : | z^{'} | \leq {‖ θ_{0} ‖}_{v}} | {\dot{ψ}}^{'} (z^{'}) | = B$ for all x^(ℓ) ≤ u ≤ x^(u), so

‖ \dot{Ψ} ‖_{v} \leq | \dot{Ψ} (x^{(l)}) | + \sum_{s \subseteq {1, 2, \dots, d}, s \neq \emptyset} \int_{x_{s}^{(l)}}^{x_{s}^{(u)}} B | θ_{s} (d u) | \leq | \dot{Ψ} (x^{(l)}) | + B {‖ θ_{0} ‖}_{v} .

□

Lemma 2 (CV-selected bound not much smaller than the bound of the true function’s variation norm). Suppose that Condition B1 holds, θ₀ is càdlàg, ‖θ₀‖_v < ∞ and for any M, ${sup}_{θ \in Θ_{v, M}} ∥ l (θ) ∥ < \infty$ . Let M_n be a (possibly random) sequence such that $P_{0} {l ({\hat{θ}}_{n, M_{n}}) - l (θ_{0})} = o_{p} (1)$ . Then for any ϵ > 0, with probability tending to one, M_n ≥ ‖θ₀‖_v − ϵ. Therefore, for any fixed ϵ > 0, with probability tending to one, M_n + ϵ ≥ (‖θ₀‖_v − ϵ) + ϵ = ‖θ₀‖_v

Proof of Lemma 2. We prove by contradiction. Suppose the claim is not true, i.e. there exists ϵ, δ > 0 such that P(M_n < ‖θ₀‖_v − ϵ) for all $n \in N$ , where $N$ is an infinite set. Let $θ_{0, M} \in {argmin}_{θ \in Θ_{v, M}} P_{0} l (θ)$ . Then for all $n \in N$ , with probability at least δ,

P_{0} {l ({\hat{θ}}_{n, M_{n}}) - l (θ_{0})} = P_{0} {l ({\hat{θ}}_{n, M_{n}}) - l (θ_{0, M_{n}})} + P_{0} {l (θ_{0, M_{n}}) - l (θ_{0})} \geq P_{0} {l (θ_{0, M_{n}}) - l (θ_{0})} \geq P_{0} {l (θ_{0, {‖ θ_{0} ‖}_{v} - ϵ}) - l (θ_{0})},

which is a positive constant since the function class $Θ_{{‖ θ_{0} ‖}_{v} - ϵ}$ does not contain θ₀ and this term is non-negligible bias. This contradicts the assumption that $P_{0} {l ({\hat{θ}}_{n, M_{n}}) - l (θ_{0})} = o_{p} (1)$ and hence the desired follows. □

Therefore, if $∥ \dot{Ψ} ∥_{v} \leq F ({‖ θ_{0} ‖}_{v})$ for a known increasing function F, then with probability tending to one, M_n + ϵ + F(M_n + ϵ)) is a valid bound on ${‖ {\hat{θ}}_{n} ‖}_{v}$ that can be used to obtain an efficient plug-in estimator. Moreover, if the bound is loose, i.e. $∥ \dot{Ψ} ∥_{v} \leq F ({‖ θ_{0} ‖}_{v})$ , and F is continuous, then there exists some ϵ > 0 such that $∥ \dot{Ψ} ∥_{v} \leq F ({‖ θ_{0} ‖}_{v} - ϵ) - ϵ$ and hence ${‖ θ_{0} ‖}_{v} + ∥ \dot{Ψ} ∥_{v} \leq M_{n} + F (M_{n})$ with probability tending to one.

Note that this lemma only concerns learning a function-valued feature but not estimating Ψ(θ₀). There are examples where $\dot{Ψ}$ depends on components of P₀, say η₀, other than θ₀. However, if η₀ can be learned via HAL, then Lemma 2 can be applied. Therefore, if it is known that $∥ \dot{Ψ} ∥_{v} \leq F ({‖ θ_{0} ‖}_{v}, {‖ η_{0} ‖}_{v})$ for a known increasing function F, then we can use a bound on ${‖ {\hat{θ}}_{n} ‖}_{v}$ obtained in a similar fashion as above from the sequence M_n to construct an efficient plug-in estimator $Ψ ({\hat{θ}}_{n})$ .

Now consider obtaining M_n by k-fold CV from a set of candidate bounds. Then, under Conditions B1–B3, by (i) Lemma 1 and its corollary of [26], and (ii) the oracle inequality for k-fold CV in [29], $P_{0} {l ({\hat{θ}}_{n, M_{n}}) - l (θ_{0})} = o_{p} (n^{- 1 / 4})$ if (i) one candidate bound is no smaller than ‖θ₀‖_v, and (ii) the number of candidate bounds is fixed. Therefore, the above results apply to this case.

D.2. Data-adaptive series estimation

We first present and prove two lemmas that lead to Theorems 2 and 5.

Lemma 3 (Convergence rate of the sieve estimator). Under Conditions C1, C3 and C6, ‖π_n(θ₀)−θ₀‖ = o_p(n^−1/4). Under an additional condition C2, $‖ θ_{n}^{*} - θ_{0} ‖ = o_{p} (n^{- 1 / 4})$ .

Proof of Lemma 3. By triangle inequality, $‖ π_{n} (θ_{0}) - θ_{0} ‖ \leq ‖ θ_{0} - θ_{n}^{0} ‖ + ‖ θ_{n}^{0} - π_{n} (θ_{n}^{0}) ‖ + ‖ π_{n} (θ_{n}^{0}) - π_{n} (θ_{0}) ‖$ . We bound these three terms separately.

Term 1: By Condition C1, $‖ θ_{0} - θ_{n}^{0} ‖ = o_{p} (n^{- 1 / 4})$ .

Term 2: By the definition of projection operator,

‖ θ_{n}^{0} - π_{n} (θ_{n}^{0}) ‖ = ‖ θ_{n}^{0} - Π_{n, θ_{n}^{0}} (I) \circ θ_{n}^{0} ‖ \leq ‖ θ_{n}^{0} - Π_{n, θ_{0}} (I) \circ θ_{n}^{0} ‖ .

We bound the right-hand side by showing this term is close to $‖ θ_{0} - Π_{n, θ_{0}} (I) \circ θ_{0} ‖$ up to an o_p(n^−1/4) term. By the reverse triangle inequality and the triangle inequality,

| ‖ θ_{n}^{0} - Π_{n, θ_{0}} (I) \circ θ_{n}^{0} ‖ - ‖ θ_{0} - Π_{n, θ_{0}} (I) \circ θ_{0} ‖ | \leq ‖ [θ_{n}^{0} - Π_{n, θ_{0}} (I) \circ θ_{n}^{0}] - [θ_{0} - Π_{n, θ_{0}} (I) \circ θ_{0}] ‖ = ‖ [θ_{n}^{0} - θ_{0}] - [Π_{n, θ_{0}} (I) \circ θ_{n}^{0} - Π_{n, θ_{0}} (I) \circ θ_{0}] ‖ \leq ‖ θ_{n}^{0} - θ_{0} ‖ + ‖ Π_{n, θ_{0}} (I) \circ θ_{n}^{0} - Π_{n, θ_{0}} (I) \circ θ_{0} ‖ \leq ‖ θ_{n}^{0} - θ_{0} ‖ + C ‖ θ_{n}^{0} - θ_{0} ‖

which is o_p(n^−1/4) by Condition C1. Therefore, by Condition C3,

‖ θ_{n}^{0} - π_{n} (θ_{n}^{0}) ‖ \leq ‖ θ_{n}^{0} - Π_{n, θ_{0}} (I) \circ θ_{n}^{0} ‖ \leq ‖ θ_{0} - Π_{n, θ_{0}} (I) \circ θ_{0} ‖ + o_{p} (n^{- 1 / 4}) = o_{p} (n^{- 1 / 4})

Term 3: By the definition of projection and Condition C1, $‖ π_{n} (θ_{n}^{0}) - π_{n} (θ_{0}) ‖ \leq ‖ θ_{n}^{0} - θ_{0} ‖ = o_{p} (n^{- 1 / 4})$

Conclusion from the three bounds: ‖π_n(θ₀) − θ₀‖ = o_p(n^−1/4).

If, in addition, Condition C2 also holds, then $‖ θ_{n}^{*} - θ_{0} ‖ \leq ‖ π_{n} (θ_{0}) - θ_{0} ‖ + ‖ θ_{n}^{*} - π_{n} (θ_{0}) ‖ = o_{p} (n^{- 1 / 4})$ . □

The same result holds for the generalized data-adaptive series under Conditions C1, C6*, C3* and C2 (if relevant). The proof is almost identical and is therefore omitted.

Lemma 4 (Approximation error to $\dot{ψ}$ ). Under Condition C7, $‖ \dot{ψ} \circ θ_{0} - π_{n} (\dot{ψ} \circ θ_{0}) ‖ \leq C ‖ θ_{n}^{0} - θ_{0} ‖ + ‖ \dot{ψ} \circ θ_{0} - Π_{n, θ_{0}} (\dot{ψ}) \circ θ_{0} ‖$ . Therefore, under Conditions C1–C4, $‖ \dot{ψ} \circ θ_{0} - π_{n} (ψ \circ θ_{0}) ‖ \cdot ‖ θ_{n}^{*} - θ_{0} ‖ = o_{p} (n^{- 1 / 2})$ .

Proof of Lemma 4. By the definition of the projection operator and triangle inequality,

‖ \dot{ψ} \circ θ_{0} - π_{n} (\dot{ψ} \circ θ_{0}) ‖ \leq ‖ \dot{ψ} \circ θ_{0} - π_{n} (\dot{ψ} \circ θ_{n}^{0}) ‖ \leq ‖ \dot{ψ} \circ θ_{0} - \dot{ψ} \circ θ_{n}^{0} ‖ + ‖ \dot{ψ} \circ θ_{n}^{0} - π_{n} (\dot{ψ} \circ θ_{n}^{0}) ‖ .

We bound the two terms on the right-hand side separately.

Term 1: By Condition C7, $‖ \dot{ψ} \circ θ_{0} - \dot{ψ} \circ θ_{n}^{0} ‖ \leq C ‖ θ_{0} - θ_{n}^{0} ‖$ .

Term 2: This term can be bounded similarly as in Lemma 3. By the reverse triangle inequality and the triangle inequality,

| ‖ \dot{ψ} \circ θ_{n}^{0} - Π_{n, θ_{0}} (\dot{ψ}) \circ θ_{n}^{0} ‖ - ‖ \dot{ψ} \circ θ_{0} - Π_{n, θ_{0}} (I) \circ θ_{0} ‖ | \leq ‖ [\dot{ψ} \circ θ_{n}^{0} - Π_{n, θ_{0}} (\dot{ψ}) \circ θ_{n}^{0}] - [\dot{ψ} \circ θ_{0} - Π_{n, θ_{0}} (\dot{ψ}) \circ θ_{0}] ‖ = ‖ [\dot{ψ} \circ θ_{n}^{0} - \dot{ψ} \circ θ_{0}] - [Π_{n, θ_{0}} (\dot{ψ}) \circ θ_{n}^{0} - Π_{n, θ_{0}} (\dot{ψ}) \circ θ_{0}] ‖ \leq ‖ \dot{ψ} \circ θ_{n}^{0} - \dot{ψ} \circ θ_{0} ‖ + ‖ Π_{n, θ_{0}} (\dot{ψ}) \circ θ_{n}^{0} - Π_{n, θ_{0}} (\dot{ψ}) \circ θ_{0} ‖ \leq C ‖ θ_{n}^{0} - θ_{0} ‖ + C ‖ θ_{n}^{0} - θ_{0} ‖ (Condition C 7) = C ‖ θ_{n}^{0} - θ_{0} ‖ .

Therefore, by the definition of the projection operator and Condition C7,

‖ \dot{ψ} \circ θ_{n}^{0} - π_{n} (\dot{ψ} \circ θ_{n}^{0}) ‖ \leq ‖ \dot{ψ} \circ θ_{n}^{0} - Π_{n, θ_{0}} (\dot{ψ}) \circ θ_{n}^{0} ‖ \leq ‖ \dot{ψ} \circ θ_{0} - Π_{n, θ_{0}} (\dot{ψ}) \circ θ_{0} ‖ + C ‖ θ_{n}^{0} - θ_{0} ‖ .

Conclusion from the two bounds: $‖ \dot{ψ} \circ θ_{0} - π_{n} (\dot{ψ} \circ θ_{0}) ‖ \leq C ‖ θ_{n}^{0} - θ_{0} ‖ + ‖ \dot{ψ} \circ θ_{0} - \prod_{n, θ_{0}} (\dot{ψ}) \circ θ_{0} ‖$ .

Under Conditions C1–C4, using Lemma 3, it follows that $‖ \dot{ψ} \circ θ_{0} - π_{n} (\dot{ψ} \circ θ_{0}) ‖ \cdot ‖ θ_{n}^{*} - θ_{0} ‖ = o_{p} (n^{- 1 / 2})$ . □

Note that π_n is a linear operator. Lemma 3 and 4 along with other conditions essentially satisfy the assumptions in Corollary 2 in [24]. We can prove the asymptotic linearity result of Theorem 2 similarly to this result as follows.

Proof of Theorem 2. We note that

P_{n} l (θ_{n}^{*}) = P_{n} l (θ_{0}) + P_{0} [l (θ_{n}^{*}) - l (θ_{0})] + (P_{n} - P_{0}) [l (θ_{n}^{*}) - l (θ_{0})] = P_{n} l (θ_{0}) + P_{0} [l (θ_{n}^{*}) - l (θ_{0})] + (P_{n} - P_{0}) l_{0}^{'} [θ_{n}^{*} - θ_{0}] + (P_{n} - P_{0}) r [θ_{n}^{*} - θ_{0}] .

Let ϵ_n be an arbitrary sequence of positive real numbers that is o(n^−1/2). We may replace $θ_{n}^{*}$ with $π_{n} ((1 - ϵ_{n}) θ_{n}^{*} + ϵ_{n} (θ_{0} + \dot{Ψ}))$ in the above equation. We first consider $π_{n} ((1 - ϵ_{n}) θ_{n}^{*} + ϵ_{n} (θ_{0} + \dot{Ψ}))$ :

P_{n} l (π_{n} ((1 - ϵ_{n}) θ_{n}^{*} + ϵ_{n} (θ_{0} + \dot{Ψ}))) = P_{n} l (θ_{0}) + P_{0} [l (π_{n} ((1 - ϵ_{n}) θ_{n}^{*} + ϵ_{n} (θ_{0} + \dot{Ψ}))) - l (θ_{0})] + (P_{n} - P_{0}) l_{0}^{'} [π_{n} ((1 - ϵ_{n}) θ_{n}^{*} + ϵ_{n} (θ_{0} + \dot{Ψ})) - θ_{0}] + (P_{n} - P_{0}) r [π_{n} ((1 - ϵ_{n}) θ_{n}^{*} + ϵ_{n} (θ_{0} + \dot{Ψ})) - θ_{0}] .

(1)

Take the difference between the above two equations. By the linearity of $l_{0}^{'}$ and π_n, we have that

P_{n} l (π_{n} ((1 - ϵ_{n}) θ_{n}^{*} + ϵ_{n} (θ_{0} + \dot{Ψ}))) - P_{n} l (θ_{n}^{*}) = P_{0} [l (π_{n} ((1 - ϵ_{n}) θ_{n}^{*} + ϵ_{n} (θ_{0} + \dot{Ψ}))) - l (θ_{0})] - P_{0} [l (θ_{n}^{*}) - l (θ_{0})] + (P_{n} - P_{0}) l_{0}^{'} [π_{n} ((1 - ϵ_{n}) θ_{n}^{*} + ϵ_{n} (θ_{0} + \dot{Ψ})) - θ_{n}^{*}] + (P_{n} - P_{0}) {r [π_{n} ((1 - ϵ_{n}) θ_{n}^{*} + ϵ_{n} (θ_{0} + \dot{Ψ})) - θ_{0}] - r [θ_{n}^{*} - θ_{0}]} .

We next analyze the three lines on the right-hand side of the above equation separately.

Line 1: Under Condition A2,

P_{0} [l (π_{n} ((1 - ϵ_{n}) θ_{n}^{*} + ϵ_{n} (θ_{0} + \dot{Ψ}))) - l (θ_{0})] - P_{0} [l (θ_{n}^{*}) - l (θ_{0})] = \frac{α_{0, l}}{2} {‖ π_{n} ((1 - ϵ_{n}) θ_{n}^{*} + ϵ_{n} (θ_{0} + \dot{Ψ})) - θ_{0} ‖}^{2} - \frac{α_{0, l}}{2} {‖ θ_{n}^{*} - θ_{0} ‖}^{2} + o_{p} ({‖ π_{n} ((1 - ϵ_{n}) θ_{n}^{*} + ϵ_{n} (θ_{0} + \dot{Ψ})) - θ_{0} ‖}^{2} + {‖ θ_{n}^{*} - θ_{0} ‖}^{2})

We subtract and add $(1 - ϵ_{n}) θ_{n}^{*} + ϵ_{n} (θ_{0} + \dot{Ψ})$ in the first term. By the fact that π_n is linear and $π_{n} (θ_{n}^{*}) = θ_{n}^{*}$ , the display continues as

= \frac{α_{0, l}}{2} ‖ {π_{n} ((1 - ϵ_{n}) θ_{n}^{*} + ϵ_{n} (θ_{0} + \dot{Ψ})) - ((1 - ϵ_{n}) θ_{n}^{*} + ϵ_{n} (θ_{0} + \dot{Ψ}))} + {(1 - ϵ_{n}) θ_{n}^{*} + ϵ_{n} (θ_{0} + \dot{Ψ}) - θ_{0}} ‖^{2} - \frac{α_{0, l}}{2} {‖ θ_{n}^{*} - θ_{0} ‖}^{2} + o_{p} ({‖ π_{n} ((1 - ϵ_{n}) θ_{n}^{*} + ϵ_{n} (θ_{0} + \dot{Ψ})) - θ_{0} ‖}^{2} + {‖ θ_{n}^{*} - θ_{0} ‖}^{2}) = \frac{α_{0, l}}{2} {‖ ϵ_{n} {π_{n} (θ_{0} + \dot{Ψ}) - (θ_{0} + \dot{Ψ})} + (θ_{n}^{*} - θ_{0}) + ϵ_{n} (\dot{Ψ} + θ_{0} - θ_{n}^{*}) ‖}^{2} - \frac{α_{0, l}}{2} {‖ θ_{n}^{*} - θ_{0} ‖}^{2} + o_{p} ({‖ π_{n} ((1 - ϵ_{n}) θ_{n}^{*} + ϵ_{n} (θ_{0} + \dot{Ψ})) - θ_{0} ‖}^{2} + {‖ θ_{n}^{*} - θ_{0} ‖}^{2}) = ϵ_{n} α_{0, l} 〈 θ_{n}^{*} - θ_{0}, \dot{Ψ} 〉 + ϵ_{n}^{2} \frac{α_{0, l}}{2} {‖ π_{n} (θ_{0} + \dot{Ψ}) - θ_{n}^{*} ‖}^{2} + ϵ_{n} α_{0, l} 〈 π_{n} (θ_{0}) - θ_{0}, θ_{n}^{*} - θ_{0} 〉 + ϵ_{n} α_{0, l} 〈 π_{n} (\dot{Ψ}) - \dot{Ψ}, θ_{n}^{*} - θ_{0} 〉 - ϵ_{n} α_{0, l} {‖ θ_{n}^{*} - θ_{0} ‖}^{2} + o_{p} ({‖ π_{n} ((1 - ϵ_{n}) θ_{n}^{*} + ϵ_{n} (θ_{0} + \dot{Ψ})) - θ_{0} ‖}^{2} + {‖ θ_{n}^{*} - θ_{0} ‖}^{2})

By Cauchy-Schwards inequality, the display continues as

\leq ϵ_{n} α_{0, l} 〈 θ_{n}^{*} - θ_{0}, \dot{Ψ} 〉 + ϵ_{n}^{2} \frac{α_{0, l}}{2} {‖ π_{n} (θ_{0} + \dot{Ψ}) - θ_{n}^{*} ‖}^{2} + ϵ_{n} α_{0, l} ‖ π_{n} (θ_{0}) - θ_{0} ‖ ‖ θ_{n}^{*} - θ_{0} ‖ + ϵ_{n} α_{0, l} ‖ π_{n} (\dot{Ψ}) - \dot{Ψ} ‖ ‖ θ_{n}^{*} - θ_{0} ‖ - ϵ_{n} α_{0, l} {‖ θ_{n}^{*} - θ_{0} ‖}^{2} + o_{p} ({‖ π_{n} ((1 - ϵ_{n}) θ_{n}^{*} + ϵ_{n} (θ_{0} + \dot{Ψ})) - θ_{0} ‖}^{2} + {‖ θ_{n}^{*} - θ_{0} ‖}^{2})

By Lemmas 3–4 and the assumption that ϵ_n = o(n^−1/2), the display continues as

= ϵ_{n} α_{0, l} 〈 θ_{n}^{*} - θ_{0}, \dot{Ψ} 〉 + ϵ_{n} o_{p} (n^{- 1 / 2}) + o_{p} ({‖ π_{n} ((1 - ϵ_{n}) θ_{n}^{*} + ϵ_{n} (θ_{0} + \dot{Ψ})) - θ_{0} ‖}^{2} + {‖ θ_{n}^{*} - θ_{0} ‖}^{2}) .

Line 2: We subtract and add $(1 - ϵ_{n}) θ_{n}^{*} + ϵ_{n} (θ_{0} + \dot{Ψ})$ . By linearity of $l_{0}^{'}$ , Condition C8, and the fact that $π_{n} (θ_{n}^{*}) = θ_{n}^{*}$ , we have that

(P_{n} - P_{0}) l_{0}^{'} [π_{n} ((1 - ϵ_{n}) θ_{n}^{*} + ϵ_{n} (θ_{0} + \dot{Ψ})) - θ_{n}^{*}] = (P_{n} - P_{0}) l_{0}^{'} [(1 - ϵ_{n}) θ_{n}^{*} + ϵ_{n} (θ_{0} + \dot{Ψ}) - θ_{n}^{*}] + ϵ_{n} (P_{n} - P_{0}) l_{0}^{'} [π_{n} (θ_{0} + \dot{Ψ}) - (θ_{0} + \dot{Ψ})] = ϵ_{n} (P_{n} - P_{0}) l_{0}^{'} [\dot{Ψ}] - ϵ_{n} (P_{n} - P_{0}) l_{0}^{'} [θ_{n}^{*} - θ_{0}] + ϵ_{n} o_{p} (n^{- 1 / 2}) = ϵ_{n} (P_{n} - P_{0}) l_{0}^{'} [\dot{Ψ}] + ϵ_{n} o_{p} (n^{- 1 / 2}) .

Line 3: By Condition C8, this term is ϵ_no_p(n^−1/2).

Conclusion of the three lines: It holds that

P_{n} l (π_{n} ((1 - ϵ_{n}) θ_{n}^{*} + ϵ_{n} (θ_{0} + \dot{Ψ}))) - P_{n} l (θ_{n}^{*}) \leq ϵ_{n} α_{0, l} 〈 θ_{n}^{*} - θ_{0}, \dot{Ψ} 〉 + ϵ_{n} (P_{n} - P_{0}) l_{0}^{'} [\dot{Ψ}] + ϵ_{n} o_{p} (n^{- 1 / 2}) + o_{p} ({‖ π_{n} ((1 - ϵ_{n}) θ_{n}^{*} + ϵ_{n} (θ_{0} + \dot{Ψ})) - θ_{0} ‖}^{2} + {‖ θ_{n}^{*} - θ_{0} ‖}^{2}) .

Since $θ_{n}^{*}$ is an empirical risk minimizer, the left-hand side is non-negative. Thus,

0 \leq θ_{n}^{*} - θ_{0}, \dot{Ψ} + (P_{n} - P_{0}) α_{0, l}^{- 1} l_{0}^{'} [\dot{Ψ}] + o_{p} (n^{- 1 / 2}) .

Similarly, by replacing $π_{n} ((1 - ϵ_{n}) θ_{n}^{*} + ϵ_{n} (θ_{0} + \dot{Ψ}))$ with $π_{n} ((1 - ϵ_{n}) θ_{n}^{*} + ϵ_{n} (θ_{0} + \dot{Ψ}))$ in (1), we derive that

0 \leq - 〈 θ_{n}^{*} - θ_{0}, \dot{Ψ} 〉 - (P_{n} - P_{0}) α_{0, l}^{- 1} l_{0}^{'} [\dot{Ψ}] + o_{p} (n^{- 1 / 2}) .

Therefore, $| 〈 θ_{n}^{*} - θ_{0}, \dot{Ψ} 〉 - (P_{n} - P_{0}) α_{0, l}^{- 1} l_{0}^{'} [\dot{Ψ}] |$ . By Conditions A3–A4 and Lemma 3,

Ψ (θ_{n}^{*}) - Ψ (θ_{0}) = 〈 θ_{n}^{*} - θ_{0}, \dot{Ψ} 〉 + o_{p} (n^{- 1 / 2}) = - (P_{n} - P_{0}) α_{0, l}^{- 1} l_{0}^{'} [\dot{Ψ}] + o_{p} (n^{- 1 / 2}) .

The asymptotic linearity of $Ψ (θ_{n}^{*})$ follows. We prove the efficiency in Appendix D.3. □

The proof of Theorem 5 is almost identical.

Nest we present and prove a lemma allows us to interpret Condition C2 as an upper bound on the rate of K.

Lemma 5. Under Conditions A2, C1, C3 (C3* resp.) and C6 (C6* resp.), $‖ π_{n} (θ_{0}) - θ_{n}^{†} ‖ = o_{p} (n^{- 1 / 4})$ .

Proof of Lemma 5. By definition of $θ_{n}^{†}$ and Condition A2, we have

{‖ θ_{n}^{†} - θ_{0} ‖}^{2} \leq C P_{0} {l (θ_{n}^{†}) - l (θ_{0})} \leq C P_{0} {l (π_{n} (θ_{0})) - l (θ_{0})} \leq C {‖ π_{n} (θ_{0}) - θ_{0} ‖}^{2},

the right-hand side of which is o_p(n^−1/2) by Lemma 3 (or its corresponding version under Conditions C6* and C3*). Therefore, $‖ θ_{n}^{†} - θ_{0} ‖ = o_{p} (n^{- 1 / 4})$ and hence $‖ π_{n} (θ_{0}) - θ_{n}^{†} ‖ \leq ‖ π_{n} (θ_{0}) - θ_{0} ‖ + ‖ θ_{n}^{†} - θ_{0} ‖ = o_{p} (n^{- 1 / 4})$ .

We finally prove the efficiency of the data-adaptive series estimator with K selected by CV.

Proof of Theorem 4. By Lemma 3 and Condition A2, for that existing deterministic K, $P_{0} {l (θ_{K}^{*} (θ_{n}^{0})) - l (θ_{0})} \leq C {‖ θ_{K}^{*} (θ_{n}^{0}) - θ_{0} ‖}^{2} = o_{p} (n^{- 1 / 2})$ . By the oracle inequality for CV in [29], $P_{0} {l (θ_{n}^{#}) - l (θ_{0})} = o_{p} (n^{- 1 / 2})$ . By Condition A2, ${‖ θ_{n}^{#} - θ_{0} ‖}^{2} \leq C P_{0} {l (θ_{n}^{#}) - l (θ_{0})} = o_{p} (n^{- 1 / 2})$ and hence $‖ θ_{n}^{#} - θ_{0} ‖ = o_{p} (n^{- 1 / 2})$ . So with probability tending to one,

‖ \dot{ψ} \circ θ_{n}^{0} - π_{K^{*}, θ_{n}^{0}} (\dot{ψ} \circ θ_{n}^{0}) ‖ = ‖ \dot{ψ} \circ θ_{n}^{0} - Π_{K^{*}, θ_{n}^{0}} (\dot{ψ}) \circ θ_{n}^{0} ‖ \leq C ‖ θ_{n}^{0} - Π_{K^{*}, θ_{n}^{0}} (I) \circ θ_{n}^{0} ∣ (Condition C5) \leq C ‖ θ_{n}^{0} - θ_{n}^{#} ‖ (definition of the projection operator) \leq C (‖ θ_{n}^{0} - θ_{0} ‖ + ‖ θ_{n}^{#} - θ_{0} ‖), (triangle inequality)

which is o_p(n^−1/4) by Condition C1. Hence,

‖ \dot{ψ} \circ θ_{0} - π_{K^{*}, θ_{n}^{0}} (\dot{ψ} \circ θ_{0}) ‖ \leq ‖ \dot{ψ} \circ θ_{0} - π_{K^{*}, θ_{n}^{0}} (\dot{ψ} \circ θ_{n}^{0}) ‖ \leq ‖ \dot{ψ} \circ θ_{0} - \dot{ψ} \circ θ_{n}^{0} ‖ + ‖ \dot{ψ} \circ θ_{n}^{0} - π_{K^{*}, θ_{n}^{0}} (\dot{ψ} \circ θ_{n}^{0}) ‖ \leq C ‖ θ_{n}^{0} - θ_{0} ‖ + o_{p} (n^{- 1 / 4}), (Condition C7)

which is o_p(n^−1/4) by Condition C1.

This bounds the approximation error $‖ \dot{ψ} \circ θ_{0} - π_{K^{*}, θ_{n}^{0}} (\dot{ψ} \circ θ_{0}) ‖$ for $\dot{ψ}$ , a result that is similar to Lemma 4 combined with Conditions C1 and C4*s. Similarly to Theorem 2, along with other conditions, the assumptions in Corollary 2 in [24] are essentially satisfied and hence an almost identical argument shows that $Ψ (θ_{n}^{#})$ is an asymptotically linear estimator of Ψ(θ₀). We prove the efficiency in Appendix D.3. □

D.3. Efficiency

Proof of efficiency of the proposed estimators. It is sufficient to show that the influence function of our proposed estimators is the canonical gradient under a nonparametric model. Let $H \in H$ be fixed. In the rest of this proof, for all small o and big O notations, we let δ → 0. The proof is similar to the proof of asymptotic linearity in [24] except that the estimator of θ₀ and the empirical distribution P_n are replaced by θ_H,δ and P_H,δ respectively.

Let δ′ satisfy Condition E2. We note that

P_{H, δ} l (θ_{H, δ}) = P_{H, δ} l (θ_{0}) + P_{0} [l (θ_{H, δ}) - l (θ_{0})] + (P_{H, δ} - P_{0}) [l (θ_{H, δ}) - l (θ_{0})] = P_{H, δ} l (θ_{0}) + P_{0} [l (θ_{H, δ}) - l (θ_{0})] + (P_{H, δ} - P_{0}) l_{0}^{'} [θ_{H, δ} - θ_{0}] + (P_{H, δ} - P_{0}) r [θ_{H, δ} - θ_{0}] .

We also note that $(1 - δ^{'}) θ_{H, δ} + δ^{'} (θ_{0} + \dot{Ψ}) \in Θ$ if |δ| is sufficiently small. Then, similarly, by replacing θ_H,δ with $(1 - δ^{'}) θ_{H, δ} + δ^{'} (θ_{0} + \dot{Ψ})$ in the above equation, we have that

P_{H, δ} l ((1 - δ^{'}) θ_{H, δ} + δ^{'} (θ_{0} + \dot{Ψ})) = P_{H, δ} l (θ_{0}) + P_{0} [l ((1 - δ^{'}) θ_{H, δ} + δ^{'} (θ_{0} + \dot{Ψ})) - l (θ_{0})] + (P_{H, δ} - P_{0}) l_{0}^{'} [(1 - δ^{'}) θ_{H, δ} + δ^{'} (θ_{0} + \dot{Ψ}) - θ_{0}] + (P_{H, δ} - P_{0}) r [(1 - δ^{'}) (θ_{H, δ} - θ_{0}) + δ^{'} \dot{Ψ}] .

(2)

Take the difference between the above two equations. By the linearity of $l_{0}^{'}$ , we have that

P_{H, δ} l ((1 - δ^{'}) θ_{H, δ} + δ^{'} (θ_{0} + \dot{Ψ})) - P_{H, δ} l (θ_{H, δ}) = P_{0} [l ((1 - δ^{'}) θ_{H, δ} + δ^{'} (θ_{0} + \dot{Ψ})) - l (θ_{0})] - P_{0} [l (θ_{H, δ}) - l (θ_{0})] + δ^{'} (P_{H, δ} - P_{0}) l_{0}^{'} [\dot{Ψ} - θ_{H, δ} + θ_{0}] + (P_{H, δ} - P_{0}) {r [(1 - δ^{'}) (θ_{H, δ} - θ_{0}) + δ^{'} \dot{Ψ}] - r [θ_{H, δ} - θ_{0}]} = \frac{α_{0, l}}{2} {‖ (1 - δ^{'}) (θ_{H, δ} - θ_{0}) + δ^{'} \dot{Ψ} ‖}^{2} - \frac{α_{0, l}}{2} {‖ θ_{H, δ} - θ_{0} ‖}^{2} + o ({‖ θ_{H, δ} - θ_{0} ‖}^{2} + {‖ (1 - δ^{'}) (θ_{H, δ} - θ_{0}) + δ^{'} \dot{Ψ} ‖}^{2}) (Condition A2) + δ^{'} (P_{H, δ} - P_{0}) l_{0}^{'} [\dot{Ψ}] - δ^{'} (P_{H, δ} - P_{0}) l_{0}^{'} [θ_{H, δ} - θ_{0}] + δ^{'} o (δ) (Condition E2) = δ^{'} α_{0, l} 〈 θ_{H, δ} - θ_{0}, \dot{Ψ} 〉 - δ^{'} α_{0, l} {‖ θ_{H, δ} - θ_{0} ‖}^{2} + δ^{' 2} \frac{α_{0, l}}{2} {‖ θ_{H, δ} - θ_{0} + \dot{Ψ} ‖}^{2} + o ({‖ θ_{H, δ} - θ_{0} ‖}^{2} + {‖ (1 - δ^{'}) (θ_{H, δ} - θ_{0}) + δ^{'} \dot{Ψ} ‖}^{2}) + δ^{'} (P_{H, δ} - P_{0}) l_{0}^{'} [\dot{Ψ}] + δ^{'} o (δ) \leq δ^{'} α_{0, l} 〈 θ_{H, δ} - θ_{0}, \dot{Ψ} 〉 + δ^{' 2} \frac{α_{0, l}}{2} {‖ θ_{H, δ} - θ_{0} + \dot{Ψ} ‖}^{2} + o ({‖ θ_{H, δ} - θ_{0} ‖}^{2} + {‖ (1 - δ^{'}) (θ_{H, δ} - θ_{0}) + δ^{'} \dot{Ψ} ‖}^{2}) + δ^{'} (P_{H, δ} - P_{0}) l_{0}^{'} [\dot{Ψ}] + δ^{'} o (δ) .

Since the left-hand side of the above display is nonnegative, by Condition E1, we have that

0 \leq 〈 θ_{H, δ} - θ_{0}, \dot{Ψ} 〉 + α_{0, l}^{- 1} (P_{H, δ} - P_{0}) l_{0}^{'} [\dot{Ψ}] + O (δ^{'}) + o (δ) = 〈 θ_{H, δ} - θ_{0}, \dot{Ψ} 〉 + α_{0, l}^{- 1} (P_{H, δ} - P_{0}) l_{0}^{'} [\dot{Ψ}] + o (δ) .

Similarly, by replacing $(1 - δ^{'}) θ_{H, δ} + δ^{'} (θ_{0} + \dot{Ψ})$ with $(1 - δ^{'}) θ_{H, δ} + δ^{'} (θ_{0} + \dot{Ψ})$ in (2), we show that $0 \leq - 〈 θ_{H, δ} - θ_{0}, \dot{Ψ} 〉 + α_{0, l}^{- 1} (P_{H, δ} - P_{0}) l_{0}^{'} [\dot{Ψ}] + o (δ)$ . Therefore, $| 〈 θ_{H, δ} - θ_{0}, \dot{Ψ} 〉 + α_{0, l}^{- 1} (P_{H, δ} - P_{0}) l_{0}^{'} [\dot{Ψ}] | = o (δ)$ and

Ψ (θ_{H, δ}) - Ψ (θ_{0}) = 〈 θ_{H, δ} - θ_{0}, \dot{Ψ} 〉 + O ({‖ θ_{H, δ} - θ_{0} ‖}^{2}) = - α_{0, l}^{- 1} (P_{H, δ} - P_{0}) l_{0}^{'} [\dot{Ψ}] + o (δ) + O ({‖ θ_{H, δ} - θ_{0} ‖}^{2}) = - α_{0, l}^{- 1} (P_{H, δ} - P_{0}) l_{0}^{'} [\dot{Ψ}] + o (δ) . (Condition E1)

Consequently, ${lim}_{δ \to 0} [Ψ (θ_{H, δ}) - Ψ (θ_{0})] / δ = P_{0} {- α_{0, l}^{- 1} l_{0}^{'} [\dot{Ψ}] \cdot H}$ and hence the canonical gradient of Ψ under a nonparametric model is $α_{0, l}^{- 1} {- l_{0}^{'} [\dot{Ψ}] + P_{0} l_{0}^{'} [\dot{Ψ}]}$ . Since the influence functions of our asymptotically linear estimators are equal to this canonical gradient, our proposed estimators are efficient under a nonparametric model. □

Appendix E: Simulation setting details

In all simulations, since $θ_{0} (x) = E_{P_{0}} [Z | X = x]$ is the conditional mean function, the loss function was chosen to be the square loss ℓ(θ) : v ↦ (z − θ(x))².

E.1. HAL

In the simulation, we generate data from the distribution defined by

X ~ N (0, 1), θ_{0} (x) = exp {- (- 1 + 2 x + 2 x^{2}) / 2}, Z ∣ X = x ~ Exponential (rate = 1 / θ_{0} (x)) .

The sample sizes being considered are 500, 1000, 2000, 5000 and 10000. For each scenario we run 1000 replicates. We chose M.gcv+ to be 3.1 times M.cv.

E.2. Data-adaptive series

E.2.1. Demonstration of Theorem 4

In the simulation, we generate data from the distribution defined by X ~ Unif(−1, 1), Z|X = x ~ N(θ₀(x), 0.25²) where

θ_{0} : x \mapsto I (- 1 \leq x < - 3 / 4) + π I (- 3 / 4 \leq x < - 1 / 2) + 10 x^{2} I (- 1 / 4 \leq x < 1 / 4) + \sqrt{2} I (1 / 4 \leq x < 1 / 2) + exp (- 1) I (1 / 2 \leq x < 3 / 4) + \sqrt[3]{3} I (3 / 4 \leq x \leq 1),

When using the trigonometric series, we first shift and scale the initial function range to be [−1/2, 1/2] and then use the basis for the interval [−1, 1] (i.e. sin(jπz), cos(jπz)) in sieve estimation to avoid the poor behavior of trigonometric series near the boundary. We consider sample sizes 500, 1000, 2000, 5000, 10000 and 20000. For each sample size, we run 1000 simulations.

E.2.2. Violation of Condition C5

In the simulation, we generate data from the distribution defined by X ~ Unif(−1, 1), Z|X = x ~ N(θ₀(x), 1) where θ₀ : x ↦ cos(10x). The estimand is Ψ(θ₀) = P₀(f ○ θ₀) where

f : z \mapsto [\frac{3}{10 π} cos (5 π z) - \frac{3}{8}] I (z < - \frac{1}{2}) - \frac{3}{2} z^{2} I (- \frac{1}{2} \leq z < 0) + 3 z^{2} I (0 \leq z < \frac{1}{2}) + [- \frac{3}{2} exp (2 - 4 z) - 3 z + \frac{15}{4}] I (z \geq \frac{1}{2}) .

We consider sample sizes 500, 1000, 2000, 5000, 10000 and 20000; for each sample size, we run 1000 simulations. Our goal is to explore the behavior of the plug-in estimator when f, instead of θ₀, is rough, so we use kernel regression [17] to estimate θ₀ for convenience.

E.3. Generalized data-adaptive series

E.3.1. Demonstration of Theorem 6

In the simulation, we generate data from the distribution defined by X ~ Unif(−1, 1), A|X = x ~ Bern(expit(−x)), Y|A = a, X = x ~ N(μ_0,a(x), 0.25²) where

μ_{00} : x \mapsto I (- 1 \leq x < - 3 / 4) + π I (- 3 / 4 \leq x < - 1 / 2) + 10 x^{2} I (- 1 / 4 \leq x < 1 / 4) + \sqrt{2} I (1 / 4 \leq x < 1 / 2) + exp (- 1) I (1 / 2 \leq x < 3 / 4) + \sqrt[3]{3} I (3 / 4 \leq x \leq 1),

μ_{01} : x \mapsto x^{2} I (x < - 1 / 3) + exp (x) I (- 1 / 3 \leq x 〈 1 / 3) + I (x 〉 1 / 3)

The series is the tensor product [4] of univariate trigonometric series in E.2.1. The sample sizes are the same as in E.2.1.

E.3.2. *Violation of Condition C5**

In the simulation, we generate data from the distribution defined by X ~ Unif(−1, 1), A|X = x ~ Bern(g₀(x)), Y|A = a, X = x ~ N(μ_0,a(x), 0.25²) where μ_0a : x ↦ exp(−x² + 0.8ax + 0.5a) (a ∈ {0, 1}) and

g_{0} : x \mapsto expit {(- \frac{5}{3} x^{3} - \frac{15}{4} x^{2} - \frac{5}{3} x - \frac{25}{96}) I (x \leq - \frac{1}{2}) + (\frac{5}{6} x^{4} + \frac{5}{3} x^{3}) I (- \frac{1}{2} < x \leq 0) + \frac{5}{3} x^{3} I (0 < x \leq \frac{1}{2}) + (5 x^{2} - \frac{15}{4} x + \frac{5}{6}) I (x > \frac{1}{2})} .

We consider sample sizes 500, 1000, 2000, 5000, 10000 and 20000; for each sample size, we run 1000 simulations. Our goal is to explore the behavior of the plug-in estimator when $\dot{Ψ}$ , instead of θ₀, is rough, so we use kernel regression [17] to estimate θ₀ for convenience.

References

[1].Benkeser D and van Der Laan M (2016). The Highly Adaptive Lasso Estimator. In Data Science and Advanced Analytics (DSAA), 2016 IEEE International Conference on 689–696. IEEE. [DOI] [PMC free article] [PubMed] [Google Scholar]
[2].Bickel P, Götze F and vanZwet W (1997). Resampling fewer than n observations: Gains, losses, and remedies for losses. STATISTICA SINICA 7. [Google Scholar]
[3].Bickel PJ and Ritov Y (2003). Nonparametric estimators which can be “plugged-in”. Annals of Statistics 31 1033–1053. [Google Scholar]
[4].Chen X (2007). Chapter 76 Large Sample Sieve Estimation of Semi-Nonparametric Models. Handbook of Econometrics 6 5549–5632. [Google Scholar]
[5].Chernozhukov V, Chetverikov D, Demirer M, Duflo E, Hansen C and Newey W (2017). Double/debiased/Neyman machine learning of treatment effects. American Economic Review 107 261–265. [Google Scholar]
[6].Chernozhukov V, Chetverikov D, Demirer M, Duflo E, Hansen C, Newey W and Robins J (2018). Double/debiased machine learning for treatment and structural parameters. Econometrics Journal 21 C1–C68. [Google Scholar]
[7].Friedman JH (2001). Greedy Function Approximation: A Gradient Boosting Machine Technical Report No. 5.
[8].Friedman JH (2002). Stochastic gradient boosting. Computational Statistics and Data Analysis 38 367–378. [Google Scholar]
[9].Gill RD, van der Laan MJ and Wellner JA (1993). Inefficient estimators of the bivariate survival function for three models. Rijksuniversiteit Utrecht. Mathematisch Instituut. [Google Scholar]
[10].Hall P (2013). The bootstrap and Edgeworth expansion. Springer Science & Business Media. [Google Scholar]
[11].Härdle W and Stoker TM (1989). Investigating smooth multiple regression by the method of average derivatives. Journal of the American Statistical Association 84 986–995. [Google Scholar]
[12].Levy J, van der Laan M, Hubbard A and Pirracchio R (2018). A Fundamental Measure of Treatment Effect Heterogeneity.
[13].Mallat S (2009). A Wavelet Tour of Signal Processing. Elsevier. [Google Scholar]
[14].Marron JS (1994). Visual understanding of higher-order kernels. Journal of Computational and Graphical Statistics 3 447–458. [Google Scholar]
[15].Mason L, Baxter J, Bartlett P and Frean M (1999). Boosting Algorithms as Gradient Descent in Function Space Technical Report.
[16].Mason L, Baxter J, Bartlett PL and Frean M (2000). Boosting Algorithms as Gradient Descent Technical Report.
[17].Nadaraya EA (1964). On estimating regression. Theory of Probability & Its Applications 9 141–142. [Google Scholar]
[18].Newey W, Hsieh F and Robins J (1998). Undersmoothing and Bias Corrected Functional Estimation. Working papers.
[19].Newey WK (1997). Convergence rates and asymptotic normality for series estimators. Journal of Econometrics 79 147–168. [Google Scholar]
[20].Newey WK, Hsieh F and Robins JM (2004). Twicing Kernels and a Small Bias Property of Semiparametric Estimators. Econometrica 72 947–962. [Google Scholar]
[21].Owen AB (2005). Multidimensional variation for quasi-Monte Carlo. In Contemporary Multivariate Analysis And Design Of Experiments: In Celebration of Professor Kai-Tai Fang’s 65th Birthday 49–74. World Scientific. [Google Scholar]
[22].Pfanzagl J (1982). Contributions to a General Asymptotic Statistical Theory. Lecture Notes in Statistics 13. Springer; New York, New York, NY. [Google Scholar]
[23].Rubin DB (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology 66 688–701. [Google Scholar]
[24].Shen X (1997). On methods of sieves and penalization. Annals of Statistics 25 2555–2591. [Google Scholar]
[25].Tibshirani R (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological) 267–288. [Google Scholar]
[26].van Der Laan M (2017). A Generally Efficient Targeted Minimum Loss Based Estimator based on the Highly Adaptive Lasso. International Journal of Biostatistics 13. [DOI] [PMC free article] [PubMed] [Google Scholar]
[27].van der Laan M and Rubin D (2006). Targeted Maximum Likelihood Learning. U.C. Berkeley Division of Biostatistics Working Paper Series. [Google Scholar]
[28].van der Laan MJ, Benkeser D and Cai W (2019). Efficient Estimation of Pathwise Differentiable Target Parameters with the Undersmoothed Highly Adaptive Lasso. arXiv preprint arXiv:1908.05607v1. [DOI] [PMC free article] [PubMed]
[29].van der laan MJ and Dudoit S (2003). Unified cross-validation methodology for selection among estimators and a general cross-validated adaptive epsilon-net estimator: Finite sample oracle inequalities and examples. U.C. Berkeley Division of Biostatistics Working Paper. [Google Scholar]
[30].van der Laan MJ and Robins JM (2003). Unified Methods for Censored Longitudinal Data and Causality. Springer Series in Statistics. Springer New York, New York, NY. [Google Scholar]
[31].van der Laan MJ and Rose S (2018). Targeted Learning in Data Science.
[32].van der Vaart A and Wellner J (2000). Weak Convergence and Empirical Processes: With Applications to Statistics. Springer Series in Statistics. Springer. [Google Scholar]
[33].Williamson B, Gilbert P, Simon N and Carone M (2017). Nonparametric variable importance assessment using machine learning techniques. UW Biostatistics Working Paper Series. [DOI] [PMC free article] [PubMed] [Google Scholar]
[34].Yang S and Ding P (2018). Asymptotic inference of causal effects with observational studies trimmed by the estimated propensity scores. Biometrika 105 487–493. [Google Scholar]
[35].Zhang CH (2010). Nearly unbiased variable selection under minimax concave penalty. Annals of Statistics 38 894–942. [Google Scholar]

[R1] [1].Benkeser D and van Der Laan M (2016). The Highly Adaptive Lasso Estimator. In Data Science and Advanced Analytics (DSAA), 2016 IEEE International Conference on 689–696. IEEE. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] [2].Bickel P, Götze F and vanZwet W (1997). Resampling fewer than n observations: Gains, losses, and remedies for losses. STATISTICA SINICA 7. [Google Scholar]

[R3] [3].Bickel PJ and Ritov Y (2003). Nonparametric estimators which can be “plugged-in”. Annals of Statistics 31 1033–1053. [Google Scholar]

[R4] [4].Chen X (2007). Chapter 76 Large Sample Sieve Estimation of Semi-Nonparametric Models. Handbook of Econometrics 6 5549–5632. [Google Scholar]

[R5] [5].Chernozhukov V, Chetverikov D, Demirer M, Duflo E, Hansen C and Newey W (2017). Double/debiased/Neyman machine learning of treatment effects. American Economic Review 107 261–265. [Google Scholar]

[R6] [6].Chernozhukov V, Chetverikov D, Demirer M, Duflo E, Hansen C, Newey W and Robins J (2018). Double/debiased machine learning for treatment and structural parameters. Econometrics Journal 21 C1–C68. [Google Scholar]

[R7] [7].Friedman JH (2001). Greedy Function Approximation: A Gradient Boosting Machine Technical Report No. 5.

[R8] [8].Friedman JH (2002). Stochastic gradient boosting. Computational Statistics and Data Analysis 38 367–378. [Google Scholar]

[R9] [9].Gill RD, van der Laan MJ and Wellner JA (1993). Inefficient estimators of the bivariate survival function for three models. Rijksuniversiteit Utrecht. Mathematisch Instituut. [Google Scholar]

[R10] [10].Hall P (2013). The bootstrap and Edgeworth expansion. Springer Science & Business Media. [Google Scholar]

[R11] [11].Härdle W and Stoker TM (1989). Investigating smooth multiple regression by the method of average derivatives. Journal of the American Statistical Association 84 986–995. [Google Scholar]

[R12] [12].Levy J, van der Laan M, Hubbard A and Pirracchio R (2018). A Fundamental Measure of Treatment Effect Heterogeneity.

[R13] [13].Mallat S (2009). A Wavelet Tour of Signal Processing. Elsevier. [Google Scholar]

[R14] [14].Marron JS (1994). Visual understanding of higher-order kernels. Journal of Computational and Graphical Statistics 3 447–458. [Google Scholar]

[R15] [15].Mason L, Baxter J, Bartlett P and Frean M (1999). Boosting Algorithms as Gradient Descent in Function Space Technical Report.

[R16] [16].Mason L, Baxter J, Bartlett PL and Frean M (2000). Boosting Algorithms as Gradient Descent Technical Report.

[R17] [17].Nadaraya EA (1964). On estimating regression. Theory of Probability & Its Applications 9 141–142. [Google Scholar]

[R18] [18].Newey W, Hsieh F and Robins J (1998). Undersmoothing and Bias Corrected Functional Estimation. Working papers.

[R19] [19].Newey WK (1997). Convergence rates and asymptotic normality for series estimators. Journal of Econometrics 79 147–168. [Google Scholar]

[R20] [20].Newey WK, Hsieh F and Robins JM (2004). Twicing Kernels and a Small Bias Property of Semiparametric Estimators. Econometrica 72 947–962. [Google Scholar]

[R21] [21].Owen AB (2005). Multidimensional variation for quasi-Monte Carlo. In Contemporary Multivariate Analysis And Design Of Experiments: In Celebration of Professor Kai-Tai Fang’s 65th Birthday 49–74. World Scientific. [Google Scholar]

[R22] [22].Pfanzagl J (1982). Contributions to a General Asymptotic Statistical Theory. Lecture Notes in Statistics 13. Springer; New York, New York, NY. [Google Scholar]

[R23] [23].Rubin DB (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology 66 688–701. [Google Scholar]

[R24] [24].Shen X (1997). On methods of sieves and penalization. Annals of Statistics 25 2555–2591. [Google Scholar]

[R25] [25].Tibshirani R (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological) 267–288. [Google Scholar]

[R26] [26].van Der Laan M (2017). A Generally Efficient Targeted Minimum Loss Based Estimator based on the Highly Adaptive Lasso. International Journal of Biostatistics 13. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] [27].van der Laan M and Rubin D (2006). Targeted Maximum Likelihood Learning. U.C. Berkeley Division of Biostatistics Working Paper Series. [Google Scholar]

[R28] [28].van der Laan MJ, Benkeser D and Cai W (2019). Efficient Estimation of Pathwise Differentiable Target Parameters with the Undersmoothed Highly Adaptive Lasso. arXiv preprint arXiv:1908.05607v1. [DOI] [PMC free article] [PubMed]

[R29] [29].van der laan MJ and Dudoit S (2003). Unified cross-validation methodology for selection among estimators and a general cross-validated adaptive epsilon-net estimator: Finite sample oracle inequalities and examples. U.C. Berkeley Division of Biostatistics Working Paper. [Google Scholar]

[R30] [30].van der Laan MJ and Robins JM (2003). Unified Methods for Censored Longitudinal Data and Causality. Springer Series in Statistics. Springer New York, New York, NY. [Google Scholar]

[R31] [31].van der Laan MJ and Rose S (2018). Targeted Learning in Data Science.

[R32] [32].van der Vaart A and Wellner J (2000). Weak Convergence and Empirical Processes: With Applications to Statistics. Springer Series in Statistics. Springer. [Google Scholar]

[R33] [33].Williamson B, Gilbert P, Simon N and Carone M (2017). Nonparametric variable importance assessment using machine learning techniques. UW Biostatistics Working Paper Series. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] [34].Yang S and Ding P (2018). Asymptotic inference of causal effects with observational studies trimmed by the estimated propensity scores. Biometrika 105 487–493. [Google Scholar]

[R35] [35].Zhang CH (2010). Nearly unbiased variable selection under minimax concave penalty. Annals of Statistics 38 894–942. [Google Scholar]

PERMALINK

Universal sieve-based strategies for efficient estimation using machine learning tools

HONGXIANG QIU

ALEX LUEDTKE

MARCO CARONE

Abstract

1. Introduction

1.1. Motivation

1.2. Existing methodological frameworks

1.3. Contributions and organization of this article

2. Problem setup and traditional sieve estimation review

3. Estimation with Highly Adaptive Lasso

3.1. Brief review of Highly Adaptive Lasso

Figure 1.

3.2. Estimation with an oracle tuning parameter

3.3. Data-adaptive selection of the tuning parameter

Figure 2.

Table 1.

Figure 3.

4. Data-adaptive series

4.1. Proposed method

4.2. Results for a deterministic number of terms

4.3. Summaries involving the marginal distribution of X

4.4. CV selection of the number of terms in data-adaptive series

4.5. Simulation

4.5.1. Demonstration of Theorem 4

Figure 4.

Table 2.

Figure 5.

4.5.2. Violation of Condition C5

Table 3.

5. Generalized data-adaptive series

5.1. Proposed method

5.2. Results for proposed method

5.3. Simulation

5.3.1. Demonstration of Theorem 6

Figure 6.

Table 4.

5.3.2. Violation of Condition C5*

Table 5.

6. Discussion

Acknowledgements

Appendix A: Modification of chosen norm for evaluating the conditions: case study of mean counterfactual outcome

Appendix B: Additional conditions

B.1. HAL

B.2. Data-adaptive series

B.3. Generalized data-adaptive series

B.4. Conditions for efficiency of the plug-in estimator

Appendix C: Discussion of technical conditions for data-adaptive series and its generalization

C.1. Theorem 2

C.2. Theorem 5

Appendix D: Lemmas and technical proofs

D.1. Highly Adaptive Lasso (HAL)

D.2. Data-adaptive series estimation

D.3. Efficiency

Appendix E: Simulation setting details

E.1. HAL

E.2. Data-adaptive series

E.2.1. Demonstration of Theorem 4

E.2.2. Violation of Condition C5

E.3. Generalized data-adaptive series

E.3.1. Demonstration of Theorem 6

E.3.2. Violation of Condition C5*

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

E.3.2. *Violation of Condition C5**