Parametric-rate inference for one-sided differentiable parameters

Alexander R Luedtke; Mark J van der Laan

doi:10.1080/01621459.2017.1285777

. Author manuscript; available in PMC: 2018 Aug 3.

Published in final edited form as: J Am Stat Assoc. 2017 Feb 28;113(522):780–788. doi: 10.1080/01621459.2017.1285777

Parametric-rate inference for one-sided differentiable parameters

Alexander R Luedtke ^1,^*, Mark J van der Laan ^1,^†

PMCID: PMC6075853 NIHMSID: NIHMS979164 PMID: 30078921

Abstract

Suppose one has a collection of parameters indexed by a (possibly infinite dimensional) set. Given data generated from some distribution, the objective is to estimate the maximal parameter in this collection evaluated at the distribution that generated the data. This estimation problem is typically non-regular when the maximizing parameter is non-unique, and as a result standard asymptotic techniques generally fail in this case. We present a technique for developing parametric-rate confidence intervals for the quantity of interest in these non-regular settings. We show that our estimator is asymptotically efficient when the maximizing parameter is unique so that regular estimation is possible. We apply our technique to a recent example from the literature in which one wishes to report the maximal absolute correlation between a prespecified outcome and one of p predictors. The simplicity of our technique enables an analysis of the previously open case where p grows with sample size. Specifically, we only require that log p grows slower than $\sqrt{n}$ , where n is the sample size. We show that, unlike earlier approaches, our method scales to massive data sets: the point estimate and confidence intervals can be constructed in O(np) time.

Keywords: stabilized one-step estimator, non-regular inference, variable screening

1 Introduction

Many semiparametric and nonparametric estimation problems yield estimators which achieve a parametric rate of convergence. These estimators are often asymptotically linear, in that they can be written as an empirical mean of an influence function applied to the data. Valid choices of the influence function can be derived as gradients for a functional derivative of the parameter of interest. Applying the central limit theorem then immediately yields Wald-type confidence intervals which achieve the desired parametric rate. Such problems have been studied in depth over the past several decades [Pfanzagl, 1990, van der Vaart, 1991, Bickel et al., 1993, van der Laan and Robins, 2003].

While remarkably general, these approaches rely on the key condition that the parameter of interest is sufficiently differentiable at the data generating distribution for such a gradient to exist. Statisticians are increasingly encountering problems for which parametric-rate estimation is theoretically possible but the parameter is insufficiently differentiable to yield a standard first-order expansion demanded by older techniques. For example, suppose we observe baseline covariates, a binary treatment, and an outcome occuring after treatment. We wish to learn the mean outcome under the optimal individualized treatment strategy, i.e. the treatment strategy which makes treatment decisions which are allowed to use baseline covariate information to make treatment decisions Chakraborty and Moodie [2013]. As another example, suppose we observe a vector of covariates (X₁, …, X_p) and an outcome Y. We wish give a confidence interval the maximal absolute correlation between a covariate X_k and Y. The lower bound for this quantity is of particular interest since this will suffice for a variable screening procedure. Alternatively, we may only with to test the null hypothesis that the maximal absolute correlation is zero. McKeague and Qian [2015] provide a test of this null hypothesis using an adaptive resampling test (ART).

These problems belong to a larger class of problems in which one observes O₁, …, O_n drawn independently from a P₀ in some (possibly nonparametric) statistical model ℳ and wishes to estimate

Ψ_{n} (P) \equiv \max_{d \in D_{n}} Ψ^{d} (P),

(1)

at P = P₀, where $D_{n}$ is an index set that may rely on sample size and each Ψ^d : ℳ → ℝ is a sufficiently differentiable parameter to permit parametric-rate estimation using classical methods such as those presented in Bickel et al. [1993]. When there is no unique maximizer $d \in D_{n}$ of Ψ^d(P), then the inference problem is typically non-regular, in the sense that the parameter $P \mapsto \max_{d \in D_{n}} Ψ^{d} (P)$ is not sufficiently differentiable to allow the use of standard influence function based techniques for obtaining inference. Hirano and Porter [2012] showed that regular and asymptotically linear estimators fail to exist for such one-sided pathwise differentiable parameters. In univariate calculus, functions such as f(x) = max{x, 0} are one-sided differentiable at zero in that the left and right limits of [f (x + ε) − f (x)]/ε are well-defined but disagree. The same holds for the Ψ_n evaluated at a distribution P₀, but now the one-sided differentiability is caused by the subset of $D_{n}$ containing the indices which maximize the expression on the right in (1). A small fluctuation in P₀ can greatly reduce the subset of maximizing indices, leading to different derivatives depending on the fluctuation taken.

In this work, we present a method which, loosely, splits the sample in such a way that the estimated index in $D_{n}$ which maximizes Ψ^d(P₀) is conditioned on so that this estimated index need not have a limit. We do this iteratively to ensure that our estimator gets the full benefit of the sample size n. When the parameter is fixed with sample size and the d maximizing Ψ^d(P₀) is fixed, we show that our estimator is asymptotically efficient, and therefore also regular. When the maximizing index is not unique, our estimator will not typically be regular. Thus our estimator adapts to the non-regularity of the estimation problem.

Our estimator is inspired by the online estimator for pathwise differentiable parameters presented in van der Laan and Lendle [2014] and a subsequent modification of this estimator in Luedtke and van der Laan [2016] to deal with the non-regularity when estimating the mean outcome under an optimal treatment rule. Such estimators are designed to be efficient in both computational complexity and storage requirements. We show that the estimator that we present in this work inherits many of these computational efficiency properties. We apply our technique to estimate the maximal absolute correlation considered in McKeague and Qian [2015]. In this problem, we show that our estimator runs efficiently in both dimension and sample size, with a runtime of O(np). In practice, this means that the lead author can implement our estimator using only R code and screen p = 100 000 variables using n = 1 000 samples on a single core of his laptop in under a minute. Thus our estimator seems to have both the statistically efficiency that has been demanded of estimators for generations and the computational efficiency that is becoming increasingly important in this new big data era. While the method of McKeague and Qian [2015] can also be implemented in O(np) time when the number of (double) bootstrap samples remains constant, the implicit constant in this procedure implies a much longer runtime.

2 Toy Example

We first present a toy example that we will use to facilitate the presentation of our estimator. While this toy example is simple enough that one can fairly easily come up with alternative estimation strategies, we believe it provides a useful starting point for presenting our general estimation scheme. Suppose we observe an i.i.d. sample {O_j ≡ (O_j_,1, O_j_,2) : j = 1, …, n} of ℝ²-valued observations, drawn from some distribution P₀ with ℝ²-valued mean $(Ψ_{1} (P_{0}), Ψ_{2} (P_{0})) \equiv E_{P_{0}} [O]$ . For general P, similarly define (Ψ₁(P), Ψ₂(P)) ≡ E_P[O]. Our objective is to estimate $Ψ (P_{0}) \equiv \max_{d \in {1, 2}} Ψ_{d} (P_{0})$ . If Ψ₁(P₀) = Ψ₂(P₀), then this parameter is one-sided but not two-sided pathwise differentiable at P₀ and no regular and asymptotically linear estimator exists [Hirano and Porter, 2012].

To give a reader a sense of the challenges faced by the most intuitive estimation strategy, consider the plug-in estimator that estimates Ψ₁(P₀) and Ψ₂(P₀) using the empirical means of the observations, which we denote ${\hat{μ}}_{1}$ and ${\hat{μ}}_{2}$ , and then the estimates ${\hat{ψ}}_{n} = \max {{\hat{μ}}_{1}, {\hat{μ}}_{2}}$ . We have that

n^{1 / 2} [{\tilde{ψ}}_{n} - Ψ (P_{0})] = n^{1 / 2} [{\hat{μ}}_{2} - Ψ_{2} (P_{0})] + I {n^{1 / 2} [{\hat{μ}}_{1} - {\hat{μ}}_{2}] \geq 0} n^{1 / 2} [{\hat{μ}}_{1} - {\hat{μ}}_{2} - Ψ_{1} (P_{0}) + Ψ_{2} (P_{0})] .

By the central limit theorem, $n^{1 / 2} [{\hat{μ}}_{d} - Ψ_{d} (P_{0}) : d = 1, 2]$ converges to a multivariate normal Z = (Z₁, Z₂) with estimable covariance matrix. If Ψ₁(P₀) > Ψ₂(P₀) or Ψ₂(P₀) > Ψ₁(P₀), then the indicator above converges in probability to one or zero, respectively, and the above converges in distribution to Z₁ or Z₂, respectively. Both of these normal limits can be consistently estimated from the data, and if one knows that Ψ₁(P₀) ≠ Ψ₂(P₀) then the correct index for the limit is consistently estimated by $I {{\hat{μ}}_{1} > {\hat{μ}}_{2}}$ . Often one may not be willing to assume that Ψ₁(P₀) ≠ Ψ(P₀). To see the challenge that arises, note that if Ψ₁(P₀) = Ψ₂(P₀), then the right-hand side above converges to max{Z₁, Z₂}, which is a non-normal limit. This gives some intuition on the non-regularity of Ψ(P₀) when Ψ₁(P₀) = Ψ₂(P₀): an arbitrarily small shift in Ψ₁(P₀) or Ψ₂(P₀) dramatically changes the limiting behavior of the estimator. Furthermore, it is not in general clear which limiting result one should use to approximate the distribution of ${\tilde{ψ}}_{n}$ if the Ψ₁(P₀) = Ψ₂(P₀)+ε for some small ε > 0, so that asymptotically the limit of the estimator is Z₁ but in practice max{Z₁, Z₂} better approximates the variability of the estimator.

To avoid this problem, we develop an estimator that naturally adapts to the (non-)regularity of the problem. For each 2 ≤ j ≤ n, let (μ_j_,1, μ_j_,2) represent the empirical mean of O₁, …, O_j and d_j ≡ argmax_dμ_j,_d. Our estimator takes the form $ψ_{n} \equiv \sum_{j = 2}^{n - 1} w_{j} O_{j + 1, d_{j}}$ for positive convex weights w₂, …, w_n₋₁ that we now define. To expedite the presentation of this toy example, suppose that we know the variance of the first and second components of O ~ P₀. Denote these variances by $\sum_{1}^{2}$ and $\sum_{2}^{2}$ , where we assume that Σ₁, Σ₂ ∊ (0, ∞). Let ${\sum^{¯}}_{n}$ denote the harmonic mean of $\sum_{d_{2}}, \dots, \sum_{d_{n - 1}}$ , i.e. ${\sum^{¯}}_{n} \equiv {(\frac{1}{n - 2} \sum_{j = 2}^{n - 1} \sum_{d_{j}}^{n - 1})}^{- 1}$ . Our convex weights are given by $w_{j} = {\sum^{¯}}_{n} \sum_{d_{j}}^{- 1}$ .

We now aim to understand the variability of our estimator. Note that

ψ_{n} - Ψ (P_{0}) = \sum_{j = 2}^{n - 1} w_{j} [O_{j + 1, d_{j}} - Ψ_{d_{j}} (P_{0})] + \sum_{j = 2}^{n - 1} w_{j} [Ψ_{d_{j}} (P_{0}) - Ψ (P_{0})] .

Our analysis will use that the weights w_j are finite because Σ₁, Σ₂ ∊ (0, ∞). We first consider the second term on the right. If Ψ₁(P₀) = Ψ₂(P₀), then this term is exactly zero. Suppose Ψ₁(P₀) ≠ Ψ₂(P₀). Noting that μ_j,d → Ψ_d(P₀) almost surely as j → ∞, it then follows that, with probability 1, d_j = argmax_d Ψ_d(P₀) for all j large enough. It readily follows that n^1/2 times the second term on the right converges to zero almost surely, and therefore also in probability. Multiplying this term by the random but finite quantity ${\sum^{¯}}_{n}^{- 1}$ does not change this convergence to zero. Thus we have shown that

n^{1 / 2} {\sum^{¯}}_{n}^{- 1} [ψ_{n} - Ψ (P_{0})] = n^{1 / 2} {\sum^{¯}}_{n}^{- 1} \sum_{j = 2}^{n - 1} w_{j} [O_{j + 1, d_{j}} - Ψ_{d_{j}} (P_{0})] + o_{P} (1) .

Finally, we note that the first term above is a martingale sum, and that each term in this sum has variance 1 thanks to the choice of weights. A standard martingale central limit theorem [e.g., Gaenssler et al., 1978] then shows that this term converges to a standard normal random variable, and standard arguments show that the interval $[ψ_{n} \pm 1.96 {\sum^{¯}}_{n} n^{- 1 / 2}]$ contains Ψ(P₀) with probability approaching 0.95.

3 Estimator

We will now present our technique for a general estimation problem. We now introduce the notion of pathwise differentiability, since this provides the key object needed to construct our estimator.

3.1 Pathwise differentiability

We assume that each parameter Ψ^d, $d \in D_{n}$ for any n, is pathwise differentiable for all distributions in our model [see, e.g., Pfanzagl, 1990, Bickel et al., 1993]. For each P ∊ ℳ, we let D^d(P) denote the canonical gradient of Ψ^d at P. By definition D^d(P)(O) is mean zero with finite variance under sampling from P. Typically pathwise differentiability implies that Ψ^d satisfies the following linear expansion for any P ∊ ℳ and $d \in D_{n}$ :

Ψ^{d} (P) - Ψ^{d} (P_{0}) = - \int D^{d} (P) (o) d P_{0} (o) + {Rem}_{n}^{d} (P),

(2)

where we omit the dependence of ${Rem}_{n}^{d} (P)$ on P₀ in the notation and indicate its possible dependence on sample size with the subscript n. Above ${Rem}_{n}^{d} (P)$ is a second-order remainder term that is small whenever P is close to P₀. We consider this condition more closely in our example, but for non-sample size dependent parameters this term can typically be made to be $O_{P_{0}} (1 / n)$ in a parametric model and often can be made to be $o_{P_{0}} (1 / \sqrt{n})$ in a nonparametric model. In the toy example from Section 2, the two parameters Ψ_d(P₀) are linear, and so D^d(P)(o) = o_d − E_P[O_d] and ${Rem}_{n}^{d} (P) = 0$ for all d, P. For a more thorough presentation, see Pfanzagl [1990] or Bickel et al. [1993].

Knowing the canonical gradient of a parameter enables one to implement a one-step estimator [see Section 5.7 of van der Vaart, 1998]. To ease discussion, fix d. Suppose one has an initial estimate ${\hat{P}}_{n}$ of the components of P₀ needed to evaluate Ψ^d and D^d. Then a one-step estimate of Ψ^d(P₀) is given by ${\tilde{ψ}}_{n}^{d} = Ψ^{d} ({\hat{P}}_{n}) + \frac{1}{n} \sum_{i = 1}^{n} D^{d} ({\hat{P}}_{n}) (O_{i})$ . Under empirical process and consistency conditions on ${\hat{P}}_{n}$ , one can show that $n^{1 / 2} [{\tilde{ψ}}_{n}^{d} - Ψ^{d} ({\hat{P}}_{n})]$ converges in distribution to a normal random variable with estimable variance. In the next section, we present a variant of this estimator that allows for the selection of the optimizing d, even when the optimal index is non-unique.

3.2 Estimator and confidence interval

We now present a stabilized one-step estimator for problems of the type found in (1) when the required differentiability condition on Ψ^d holds.

Let {ℓ_n} be some sequence such that n−ℓ_n → ∞. One possible choice is ℓ_n = 0 for all n. For each j = ℓ_n, …, n − 1, let d_nj represent an estimate of a maximizer of (1) obtained using observations (O_i : i = 1, …, j), ${\hat{P}}_{n j}$ be an estimate of P₀ obtained using observations (O_i : i = 1, …, j), and ${\hat{P}}_{n j}$ ${\hat{D}}_{n j}$ equal D^d (P) evaluated at $P = {\hat{P}}_{n j}$ and d = d_nj. For nonnegative weights $w_{l_{n}}, \dots, w_{n - 1}$ that we will define shortly with $\sum_{j = l_{n}}^{n - 1} w_{j} = n - l_{n}$ , our stabilized one-step estimate takes the form

ψ_{n} \equiv \frac{1}{n - l_{n}} \sum_{j = l_{n}}^{n - 1} w_{j} [Ψ^{d_{n j}} ({\hat{P}}_{n j}) + {\hat{D}}_{n j} (O_{j + 1})] .

Our proposed 95% confidence interval has the form

[{LB}_{n}, {UB}_{n}] \equiv [ψ_{n} \pm 1.96 \frac{{\bar{σ}}_{n}}{\sqrt{n - l_{n}}}],

where we will define ${\bar{σ}}_{n}$ momentarily and one can replace 1.96 by the desired quantile of the normal distribution to modify the confidence level.

We now define the weights. Let ${\bar{σ}}_{n j}^{2}$ represent an estimate of the variance of ${\hat{D}}_{n j} (O)$ , O ~ P₀, conditional on observations O₁, …, O_j. This estimate should only rely on those j observations. Often we can let

{\hat{σ}}_{n j}^{2} \equiv \frac{1}{j} {\sum_{i = 1}^{j} [{\hat{D}}_{n j} (O_{i}) - \frac{1}{j} \sum_{i = 1}^{j} {\hat{D}}_{n j} (O_{i})]}^{2}, j = l_{n}, \dots, n - 1.

The standard deviation type variable in the confidence interval definition is given by ${\bar{σ}}_{n} \equiv {(\frac{1}{n - l_{n}} \sum_{j = l_{n}}^{n - 1} {\hat{σ}}_{n j}^{- 1})}^{- 1}$ , and the weights are given by $w_{j} \equiv {\bar{σ}}_{n} {\hat{σ}}_{n j}^{- 1}$ , where we have omitted the possible dependence of the weights on sample size in the notation.

Our estimator is similar to the online one-step estimator developed in van der Laan and Lendle [2014] for streaming data, but it weights each term proportionally to the estimated inverse standard deviation of ${\hat{D}}_{n j} (O)$ when O ~ P₀. Our confidence interval takes a form similar to a Wald-type confidence interval, but replaces the typical standard deviation with ${\bar{σ}}_{n}$ and has width on the order of $1 / \sqrt{n - l_{n}}$ rather than $1 / \sqrt{n}$ Note of course that ℓ_n = o (n) implies that $1 / \sqrt{n - l_{n}} - 1 / \sqrt{n}$ converges to zero.

3.3 Validity of confidence interval

We now prove the validity of our confidence interval. Let $σ_{n j}^{2} \equiv {Var}_{P_{0}} ({\hat{D}}_{n j} (O) | O_{1}, \dots, O_{j})$ . The validity of the lower bound of the confidence interval relies on the following conditions:

C1)
There exists some M < ∞ such that $\frac{1}{n - l_{n}} \sum_{j = l_{n}}^{n - 1} P_{0} (\frac{| {\hat{D}}_{n j} (O) |}{{\hat{σ}}_{n j}} < M | O_{0}, \dots, O_{j - 1}) \to 1$ in probability as n → ∞.
C2)
$\frac{1}{n - l_{n}} \sum_{j = l_{n}}^{n - 1} | \frac{σ_{n j}^{2}}{{\hat{σ}}_{n j}^{2}} - 1 | \to 0$ in probability as n → ∞.
C3)
$\frac{1}{\sqrt{n - l_{n}}} \sum_{j = l_{n}}^{n - 1} {\hat{σ}}_{n j}^{- 1} {\hat{Rem}}_{n j} \to 0$ in probability as n → ∞., where ${\hat{Rem}}_{n j} \equiv {Rem}^{d_{n j}} ({\hat{P}}_{n j})$ .

The validity of the upper bound requires the following additional condition:
C4)
$\frac{1}{\sqrt{n - l_{n}}} \sum_{j = l_{n}}^{n - 1} {\hat{σ}}_{n j}^{- 1} [Ψ^{d_{n j}} (P_{0}) - Ψ_{n} (P_{0})]$ converges to zero in probability as n → ∞.

We now present our main result.

Theorem 1

(Validity of confidence interval). If C1), C2), and C3) hold, then

\underset{n \to \infty}{\lim \inf} \Pr (Ψ_{n} (P_{0}) \geq {LB}_{n}) \geq 1 - α / 2.

If C4) also holds, then

\lim_{n \to \infty} \Pr ({LB}_{n} \leq Ψ_{n} (P_{0}) \leq {UB}_{n}) = 1 - α .

Proof

The definition ψ_n combined with (2) yield that

\sqrt{n - l_{n}} {\bar{σ}}_{n}^{- 1} [ψ_{n} - Ψ_{n} (P_{0})] = \frac{1}{\sqrt{n - l_{n}}} \sum_{j = l_{n}}^{n - 1} {\hat{σ}}_{n}^{- 1} ({\hat{D}}_{n j} (O_{j + 1}) - E_{P_{0}} [{\hat{D}}_{n j} (O) | O_{1}, \dots, O_{j}]) + \frac{1}{\sqrt{n - l_{n}}} \sum_{j = l_{n}}^{n - 1} {\hat{σ}}_{n}^{- 1} [Ψ^{d_{n j}} (P_{0}) - Ψ_{n} (P_{0}) + {\hat{Rem}}_{n j}] .

(3)

The second line converges to zero in probability by C3) and C4). By C1), C2), and the martingale central limit theorem for triangular arrays in Gaenssler et al. [1978], (3) converges in distribution to a standard normal random variable. A standard Wald-type confidence interval construction argument shows that the confidence interval has coverage approaching 1 − α under C1) through C4).

Now suppose C4) does not hold. By (1), $\sum_{j = l_{n}}^{n - 1} {\hat{σ}}_{n j}^{- 1} [Ψ^{d_{n j}} (P_{0}) - Ψ_{n} (P_{0})] \leq 0$ . The same argument readily shows the validity of the lower bound under only C1), C2), and C3). □

The conditions of the theorem are discussed in Appendix A.1. Appendix A.2 considers the asymptotic efficiency of our estimator when the parameter in (1) does not rely on sample size. High level conditions are provided, and we then argue that these conditions are plausible when the maximizing index in (1) is unique. Appendix A.3 discusses computationally efficient implementations of our general estimator.

4 Maximal correlation example

4.1 Problem formulation

We now present the running example of this work, namely the maximal correlation estimation problem considered by McKeague and Qian [2015]. The observed data structure is O = (X, Y), where X = (X_k : k = 1, …) is a [−1, 1]^∞ vector of predictors and Y is an outcome in [−1, 1]. For each n, we let $K_{n}$ represent a subset of these predictors of size p, where throughout we assume that

β_{n}^{2} \equiv \frac{\log p}{\sqrt{n}} \to 0 as n \to \infty .

(4)

For readability, we omit the dependence of p on n in the notation. Under a distribution P, the maximal absolute correlation of a predictor with Y is given by

Ψ_{n} (P) \equiv \max_{k \in K_{n}} | {Corr}_{P} (X_{k}, Y) |,

(5)

where Corr_P(X_k, Y) is the correlation of X_k and Y under P. We wish to develop confidence intervals for Ψ_n(P₀). When a test of H₀ : Ψ_n(P₀) = 0 against the complementary alternative is of interest, we also wish to establish the behavior of our test against local alternatives as was done in McKeague and Qian [2015].

In contrast to McKeague and Qian [2015], the procedure that we present in this work:

is proven to work when p grows with sample size at any rate satisfying (4);
yields confidence intervals for the maximal correlation rather than just a test of the null hypothesis that it is equal to zero;
allows a non-null maximizer in (5) to be non-unique.

While McKeague and Qian argued that 3) is unlikely in practice, having two non-null maximizers be approximately equal may still have finite sample implications for their test in some settings.

We now show that this problem fits in our framework. To satisfy the pathwise differentiability condition, we let and, for each $d = (k, m) \in D_{n}$ ,

Ψ^{d} (P) \equiv m {Corr}_{P} (X_{k}, Y) .

Note that Ψ_n(P) now takes the form in (1), where we note that the use of m in the definition of Ψ^d serves to ensure that Ψ_n(P₀) represents the correlation with the maximal absolute value.

4.2 Differentiability condition

Canonical gradients

For each k, let $s_{P}^{2} (X_{k}) \equiv {Var}_{P} (X_{k})$ , and likewise for $s_{P}^{2} (Y)$ . For ease of notation we let $s_{0}^{2} (X_{k}) \equiv s_{P_{0}}^{2} (X_{k})$ , and likewise for $s_{0}^{2} (Y)$ and Corr₀(X_k, Y). An application of the delta method shows that Ψ^d has canonical gradient D^d(P)(o) given by

m \times (\frac{(x_{k} - E_{P} [X_{k}]) (y - E_{P} [Y])}{s_{P} (X_{k}) s_{P} (Y)} - \frac{1}{2} {Corr}_{P} (X_{k}, Y) [\frac{{(x_{k} - E_{P} [X_{k}])}^{2}}{s_{P}^{2} (X_{k})} + \frac{{(y - E_{P} [Y])}^{2}}{s_{P}^{2} (Y)}]) .

In order to ensure that D^d(P₀) is uniformly bounded for all d, we assume throughout that, for some δ ∊ (0, 1],

\min {s_{0} (Y), s_{0} (X_{1}), s_{0} (X_{2}) \dots} > δ .

Second-order remainder

Fix $d = (k, m) \in D_{n}$ and P ∈ ℳ. Let $\tilde{δ} \in (0, 1]$ be some constant such that both s_P(X_k) and s_P(Y) are larger than $\tilde{δ}$ . Lemma A.3 in Appendix B.1 proves that

| {Rem}^{d} (P) | \leq {\tilde{δ}}^{- 1} (| s_{P} (X_{k}) s_{P} (Y) - s_{0} (X_{k}) s_{0} (Y) | | {Corr}_{P} (X_{k}, Y) | + {(E_{P} [X_{k}] - E_{P_{0}} [X_{k}])}^{2} + {(E_{P} [Y] - E_{P_{0}} [Y])}^{2}) + \frac{s_{0}^{2} (Y)}{s_{P}^{2} (Y)} {[s_{P} (X_{k}) - s_{0} (X_{k})]}^{2} + \frac{s_{0}^{2} (X_{k})}{s_{P}^{2} (X_{k})} {[s_{P} (Y) - s_{0} (Y)]}^{2}) .

(6)

The first term above is small if s_P(X_k), s_P(Y), and Corr_P(X_k, Y) are close to s₀(X_k), s₀(Y), and Corr₀(X_k, Y). The middle terms are small if E_P[X_k] and E_P[Y] are close to $E_{P_{0}} [X_{k}]$ and $E_{P_{0}} [X_{k}]$ . The final terms are small if s_P(X_k) and s_P(Y) are close to s₀(X_k) and s₀(Y).

Variance of canonical gradients

For any given d, there is no elegant (and informative) expression for ${Var}_{P_{0}} [D^{d} (P_{j}) (O)]$ . Nonetheless, we show in Lemma A.6 of Appendix B.1 that our estimates ${\hat{σ}}_{n j}^{2}$ , taken as the sample variance of $D^{d_{n j}} (P_{j}) (O)$ for an index estimate d_nj to be defined in the next subsection, concentrate tightly about $σ_{n j}^{2}$ with high probability when the sample size is large enough. Thus, in practice, one can actually check if $σ_{n j}^{2}$ is small by looking at ${\hat{σ}}_{n j}^{2}$ . If P₀ is normal, then this variance is equal to ${[1 - {Corr}_{P_{0}} {(X_{k}, Y)}^{2}]}^{2}$ , and so is only zero if ${Corr}_{P_{0}} (X_{k}, Y) = 1$ . Though such an elegant expression does not exist for the variance of D^d(P₀)(O) for general distributions, one can still show in general that the variance of D^d(P₀) is equal to zero only if ${Corr}_{P_{0}} (X_{k}, Y) = 1$ . Here we make the slightly stronger assumption that

\inf_{n \geq 2} \min_{(k, m) \in K_{n} \times {- 1, 1}} {Var}_{P_{0}} (D^{d} (P_{0}) (O)) \geq γ > 0.

(7)

4.3 Our estimator

We will use the estimator presented in Section 3 to estimate Ψ_n(P₀). At each index j ≥ ℓ_n we use the empirical distribution P_j of the observations O₁, …, O_j to estimate P₀. We let our optimal index estimate d_nj ≡ (k_nj, m_nj), where $k_{n j} \equiv \arg \max_{k \in K_{n}} {Corr}_{P_{j}} (X_{k}, Y)$ and $m_{n j} \equiv sgn [{Corr}_{P_{j}} (X_{k_{n j}}, Y)]$ . We estimate $σ_{n j}^{2}$ with the variance of ${\hat{D}}_{n j} (O)$ under P_j.

In Appendix B.1, we detail conditions on ℓ_n which ensure that ℓ_n does not grow too slowly or quickly. For any ε ∊ (0, 2), one possible choice of ℓ_n that satisfies these conditions is

l_{n} = \max {{(\log \max {n, p})}^{1 + ε}, n \exp (- β_{n}^{- 2 + ε})} .

(8)

We show that this choice of ℓ_n ensures C1), C2), and C3) in Appendix B.1. By Theorem 1 this establishes the validity of the lower bound of our confidence interval. We can also show that this lower bound is tight up to a term of the order n^−1/4β_n.

Theorem 2

(Tightness of the lower bound). For any sequence t_n → ∞, Ψ_n(P₀) < LB_n + t_nn^−1/4β_n with probability approaching 1.

We note that the choice of ℓ_n in (8) is only needed if the set of indices $K_{n}$ changes with sample size. For fixed $K_{n}$ , one could take ℓ_n fixed, e.g. ℓ_n = 2, and still have a valid lower bound provided the estimates of σ_nj are truncated from below at some $\tilde{δ} > 0$ (see Lemma A.1). While choosing ℓ_n according to (8) is still advisible since this is what will enable us to study the behavior of a hypothesis testing procedure under local alternatives, this invariance to ℓ_n should at least reassure the user that most choices of ℓ_n will perform reasonably well. In Luedtke and van der Laan [2016], we evaluated the stabilized one-step estimator on a variety of choices of ℓ_n and found little sensitivity to this tuning parameter. Nonetheless, we consider the development of a data adaptive selection procedure for choosing an ℓ_n satisfying C1), C2), and C3) an important area for future work. In parallel to how McKeague and Qian [2015] used the bootstrap to select their tuning parameter, one might consider using the bootstrap to select ℓ_n, though it remains to determine an appropriate criterion for selecting ℓ_n. Because our ℓ_n-specific lower bound is defined using a normal limiting result rather than the bootstrap, such a selection procedure would avoid the use of a computationally burdensome double bootstrap.

We now consider the validity of the upper bound of our confidence interval, which holds under C4). This condition is trivially valid if Ψ_n(P₀) = 0 for all n. Condition C4) is also valid under the following margin condition:

MC) For some sequence t_n → ∞, there exists a sequence of non-empty subsets $K_{n}^{*} \subseteq K_{n}$ such that, for all n,

\sup_{k \in K_{n}^{*}} | {Corr}_{P_{0}} (X_{k}, Y) | - \inf_{k \in K_{n}^{*}} | {Corr}_{P_{0}} (X_{k}, Y) | = o (n^{- 1 / 2}), \inf_{k \in K_{n}^{*}} | {Corr}_{P_{0}} (X_{k}, Y) | \geq \sup_{k \in K_{n} \ K_{n}^{*}} | {Corr}_{P_{0}} (X_{k}, Y) | + t_{n} n^{- 1 / 4} β_{n} .

If $K_{n}^{*} = K_{n}$ , then the supremum over $K_{n} \ K_{n}^{*}$ is taken to be zero.

Theorem 3

(Validity of the upper bound). If MC) or Ψ_n(P₀) = 0 for all n, then C4) holds so that LB_n ≤ Ψ_n(P₀) ≤ UB_n with probability approaching 1 − α.

We outline the techniques used to prove these two results at the end of this subsection. Complete proofs are given in Appendix B.1.

Suppose we wish to test H₀ : Ψ_n(P₀) = 0 against H₁ : Ψ_n(P₀) > 0. Consider the test that rejects H₀ if LB_n > 0. We wish to explore the behavior of this test under local alternatives where Ψ_n(P₀) converges to zero slower than n^−1/4β_n. Theorem 2 shows that this test has power converging to one under such local alternatives. Furthermore, as the lower bound is valid in general, this test has type I error of at most α/2 under the null. This is indeed an exciting result as it enables the study of local alternatives even when dimension grows quickly with sample size. If dimension does not grow with sample size, this shows that we can detect against any alternatives converging to zero slower than $n^{- 1 / 2} \sqrt{\log n}$ . We would not be surprised if the $\sqrt{\log n}$ is unnecessary, but rather that it is simply a result of our proof techniques which give high probability bounds on the concentration of our correlation estimates at each sample size. McKeague and Qian [2015] showed that their method is consistent against a class of alternatives converging to zero slower than n^−1/2 provided the optimal index is unique. Our result does not rely on this uniqueness condition. We emphasize that we only used MC) to establish the validity of the upper bound of our confidence interval. Our lower bound, and therefore our ability to reject the null of uniformly zero correlation, is valid even without this margin condition.

Theorem 3 shows that the upper bound of our confidence interval is also valid under a reasonable margin condition. The margin condition states that there may be many non-null approximate maximizers provided their absolute correlations are well-separated from the absolute correlations of the other predictors with Y. By “approximate” we mean that their absolute correlations all fall within o(n^−1/2) of one another. If $K_{n}$ does not depend on sample size, then this theorem shows that our two-sided confidence interval is always valid.

Sketch of proofs of Theorems 2 and 3

Our proofs of both of these theorems rely on high-probability bounds of the absolute differences between our estimates of $s_{P_{j}}^{2} (X_{k})$ , $s_{P_{j}}^{2} (Y)$ , ${Corr}_{P_{j}} (X_{k}, Y)$ , and ${\hat{σ}}_{n j}$ and their population counterparts, uniformly over $k \in K_{n}$ and j. We show that, with probability at most 1 − 1/n, all of these absolute differences are upper bounded by constants (with explicit dependence on γ and δ) times j^−1/2 log max{n, p}.

Condition C1) follows once we show that, with high probability, $s_{P_{j}}^{2} (X_{k})$ and $s_{P_{j}}^{2} (Y)$ are bounded below by δ/2 and ${\hat{σ}}_{n j}^{2}$ . is bounded below by γ/2 uniformly over j ≥ ℓ_n for n large enough. Condition C2) and C3) are easy consequences of our concentration results. The concentration results also yield that

\frac{1}{n - l_{n}} \sum_{j = l_{n}}^{n - 1} {\hat{σ}}_{n j}^{- 1} [Ψ^{d_{n j}} (P_{0}) - Ψ_{n} (P_{0})] = O_{P_{0}} (n^{- 1 / 4} β_{n}),

which then quickly yields Theorem 2 thanks to the expression in (3).

Now suppose MC) holds. By our concentration inequalities, we select a $k_{n j} \in K_{n}^{*}$ for each $j \geq C t_{n}^{- 1} n$ with high probability, where C is a constant. We also correctly specify m_nj to be the sign of ${Corr}_{0} (X_{k_{n j}}, Y)$ . Because all of the absolute correlations in $K_{n}$ are small, the difference between $Ψ^{d_{n j}} (P_{0})$ for d_nj = (k_nj, m_nj) and Ψ_n(P₀) is very small. If $l_{n} < C t_{n}^{- 1}$ , then we can apply our concentration inequalities to establish that these first few values of j for which $j < C t_{n}^{- 1}$ are small enough so that C4) still holds, yielding Theorem 3. □

In Appendix B.2, we show that our estimator runs in O(np) time. We show that the estimate can be computed using O(p) storage when the observations O₁, …, O_n arrive in a data stream. This result is closely related to the fact that, for a ℝ^p-valued sequence {t_i}, the sum $S_{j} \equiv \sum_{i = 1}^{j} t_{i}$ at j = n can be computed in time O(np) using storage O(p). In particular, one can use the recursion relation S_j = t_j + S_j₋₁, thereby only storing t_j and S_j₋₁ when computing S_j. Our estimate can also be computed in O(np) time and O(n) storage when the vectors (X_jr : j = 1, …, n) ℝⁿ arrive in a stream for r = 1, 2, …, p, where X_jr is the observation of X_r for individual j. We do not prove the O(n) storage result in the appendix due to space constraints, though the algorithm is closely related to that given in Appendix B.2.

5 Simulation study

We now consider the power and scalability of our method using the simulations similar to those described in McKeague and Qian [2015]. Let X ~ MVN(0, Σ) for Σ a p × p covariance matrix to be given shortly, and τ₁, …, τ_p be a sequence of i.i.d. normal random variables independent of all other quantities under consideration. We will use two types of errors: the homoscedastic error τ₁ and the heteroscedastic error $η (X) \equiv \sum_{k = 1}^{p} X_{k} τ_{k} / \sqrt{p}$ . For (n, p) = (200, 200), (500, 2000), we generate data using the following distributions: (N.IE) Y = τ₁, (A1.IE) Y = X₁/5 + τ₁, (A2.IE) $Y = 0.15 \sum_{k = 1}^{5} X_{k} - 0.1 \sum_{k = 6}^{10} X_{k} - τ_{1}$ , (N.DE) Y = η(X), (A1.DE) Y = X₁/5 + η(X), and (A2.DE) $Y = 0.15 \sum_{k = 1}^{5} X_{k} - 0.1 \sum_{k = 6}^{10} X_{k} + η (X)$ . For (n, p) = (2 000, 30 000), we generate data using the following distributions: (N.IE) Y = τ₁, (A3.IE) Y = X₁/15 + τ₁, and $Y = 0.03 \sum_{k = 1}^{5} X_{k} - 0.015 \sum_{k = 6}^{10} X_{k} + τ_{1}$ . We set all of the diagonal elements in the covariance matrix Σ equal 1, and the off-diagonal elements equal p, where for each simulation setting we let ρ = 0, 0.25, 0.5, 0.75. Unless otherwise specified, all simulations are run using 1 000 Monte Carlo simulations in R [R Core Team, 2014]. Code is available in the Supplementary Materials.

We conduct a 5% test of Ψ(P₀) > 0 by checking if the lower bound of a 90% confidence interval for this quantity is greater than zero. We use models N.IE and N.DE to evaluate type I error and all other models evaluate power. We run our method with ℓ_n as in (8), where we let ε = 0.5. For ease of implementation, we compute our method on chunks of data of size (n − ℓ_n)/10 (see Section 6.1 of Luedtke and van der Laan, 2016).

We compare our method to the ART of McKeague and Qian [2015]. The ART relies on a tuning parameter λ_n satisfying $λ_{n} / \sqrt{n} \to 0$ and λ_n →∞ that is selected via a double bootstrap procedure. We implemented code that we obtained from the authors (McKeague and Qian) that selects $λ_{n} = \sqrt{a \log n}$ from a grid of a varying between 0.5 and 4. Due to computational limitations, we ran 400 outer bootstrap samples and 200 inner bootstrap samples (rather than the default of 1 000 samples for both layers of bootstrap), and also reduced the grid for a from the default (0.5, 0.55, …, 4) to (0.5, 0.6, …, 4). We also reduced the number of Monte Carlo replicates for the ART to 200 and only ran ART on the smallest sample size (n, p) = (200, 200). While we were not able to run the double bootstrap at the moderate sample size (n, p) = (500, 2000) due to computational constraints, we were able to mimic the double bootstrap procedure by selecting an oracle choice of $λ_{n} = \sqrt{a \log n}$ . In particular, we ran ART for the fixed choices of a = 0.5, 2.25, 4, found that a = 4 appropriately controlled type I error while the other choices of a typically did not, and reported the results of ART at this fixed tuning parameter. We were unable to run even the oracle procedure at the largest sample size with due to computational constraints.

We also compared our procedure to the analogue of ART described in Section 2 of Zhang and Laber [2015], where this analogue does not require running a double bootstrap. This latter procedure is referred to as the “parametric bootstrap” in Zhang and Laber [2015], though to avoid confusion with other bootstrap procedures here we refer to their method as “ZL”. The ZL procedure assumes a locally linear model with homoscedastic errors. Note that the homoscedasticity requirement is stronger than the uncorrelated error requirement made by the ART. In fact, the errors are guaranteed to be uncorrelated with the predictors under the null of zero maximal absolute correlation, thereby ensuring the type I error control of ART. We use 500 bootstrap draws for each run of the ZL procedure. Zhang and Laber show that their method, which does not involve running a computationally burdensome double bootstrap procedure, has comparable performance to ART across sample sizes and predictor dimension, while being more computationally efficient. The ZL procedure is less computationally intensive than the ART, but still requires estimating the p × p covariance matrix Σ and simulating from a $N (0, \sum^{^})$ distribution. Due to computational constraints, we only run ZL for p ≤ 2 000 and not for p = 30 000. We also compare our method to a Bonferroni-corrected t-test.

Figures 1 displays the power of the four testing procedures for (n, p) equal to (200, 200) and (500, 2000) for the homoscedastic data generating distributions N.IE, A1.IE, and A2.IE. The ART and ZL procedures perform best in both of these settings. We can show (details omitted) that our method underperforms in this setting due to the second-order term representing the cost for estimating d₀ on subsets of the data of size j ≪ n early on in the procedure. While Theorem A.11 ensures that the estimate of d₀ will be asymptotically valid, there appears to be a noticeable price to pay at small sample sizes.

Power of the various testing procedures for (n, p) equal to (200, 200) and (500, 2000) under homoscedastic errors. The ART and ZL procedure performs the best in this setting.

Figures 2 displays the power of the three testing procedures for (n, p) equal to (200, 200) and (500, 2000) for the heteroscedastic data generating distributions. The ZL procedure fails to control the type I error in this setting. This is unsurprising given that this test was developed under a local linear model with independent errors. All other methods adequately control type I error in this setting, especially at the larger sample size n = 500, while we see that the Bonferroni and ART procedures achieves slightly better power than our method for these data generating distributions.

Power of the various testing procedures for (n, p) equal to (200, 200) and (500, 2000) under heteroscedastic errors. The ZL procedure fails to control the type I error in this setting.

Figure 3 displays the power of our method and the Bonferroni procedure for (n, p) equal to (2 000, 30 000). While (unsurprisingly) Bonferroni performs well when the correlation between the predictors in X is low, our method outperforms the Bonferroni procedure when the correlation increases. We expect that, were we able to run ART or ZL at this sample size, they would outperform all other methods under consideration as they did at the smaller sample sizes. Nonetheless, both methods quickly become computationally impractical when p gets large, whereas our procedure and the Bonferroni procedure can still be implemented at these sample sizes.

Power of the test from the stabilized one-step and from the Bonferroni-adjusted t-test for (n, p) equal to (2 000, 30 000) under homoscedastic errors.

We also ran our method at different choices of ℓ_n for (n, p) = (200, 200) and (500, 2 000) (details not shown), namely defined according to (8) with ε = 0.25, 1, 1.5, 1.75. We found little sensitivity to the choice of ε, with the exception that choosing ε = 1.75 often led to a moderate loss of power (at most 15% on an additive scale). This is not surprising given that, at ε = 1.75, ℓ_n is approximately equal to n/2 for both (n, p) settings.

6 Discussion

We have presented a general method for estimating the (possibly non-unique) maximum of a family of parameter values indexed by $d \in D_{n}$ . Such an estimation problem is generally non-regular because minor fluctuations of the data generating distribution can change the subset of $D_{n}$ for which the corresponding parameter is maximized. Our estimate takes the form of a sum of the terms of a martingale difference sequence, which quickly allows us to apply the relevant central limit theorem to study its asymptotics and develop Wald-type confidence intervals. The estimator adapts to the non-regularity of the problem, in the sense that we can give reasonable conditions under which it is regular and asymptotically linear when the maximizer is unique so that regularity is possible.

We have applied our approach to the example of McKeague and Qian [2015] in which one wishes to learn about the maximal absolute correlation between a prespecified outcome and a predictor belonging to some set. The sample splitting that is built into our estimator has enabled us to analyze the estimator when the dimension p of the predictor grows with sample size slowly enough so that n^−1/2 log p → 0 as n goes to infinity. While McKeague and Qian focus on testing the null hypothesis that this maximal absolute correlation is zero, we have established valid confidence intervals for this quantity. The lower bound of our confidence interval is particularly interesting because it is valid under minimal conditions. When p is very large, one might expect that the null of no correlation between the outcome and any of the predictors is unlikely to be true. In these problems, having an estimate of the maximal absolute correlation, or at least a lower bound for this quantity, will likely still be interesting as a measure of the overall relationship between X and Y.

We have also studied the behavior of this null hypothesis test under local alternatives, showing that our test is consistent when the maximal absolute correlation shrinks to zero slower than n^−1/2(log max{n, p})^1/2. When the dimension of the predictor is fixed, the test of McKeague and Qian is consistent against alternatives shrinking to zero more slowly than n^−1/2 rather than (log n)^1/2n^−1/2. We would not be surprised to find that this (log n)^1/2 is unnecessary for p fixed and can be removed using more refined proof techniques.

McKeague and Qian do not require that Y and the coordinates of X have range in [−1, 1]. We have made this boundedness assumption out of convenience for our proofs and expect that we can replace the boundedness assumptions with appropriate moment assumptions without significantly changing the results. Our simulation results support this claim. The boundedness condition is not as restrictive as it may first seem, as unbounded X and Y can be rescaled to be to be bounded. Since the sharp null H₀ : Ψ_n(P₀) = 0 is invariant to strictly monotonic transformations of X and Y, our theoretical results yield a valid of H₀ test after applying, e.g., the sigmoid transformation to X and Y.

We note that, in our simulations, ART and ZL achieve the highest power among competing methods, though for our heteroscedastic simulation setting ZL failed to control the type I error. We were not able to run either of these methods at our largest sample size due to computational constraints. The ZL procedure as currently described is computationally expensive and does not scale well to large data sets, especially when the dimension of the predictor p is large. This difficulty occurs because the procedure requires the computation of a p × p covariance matrix. The ART method presented in McKeague and Qian [2015], which achieves similar power to ZL, is in practice even more computationally burdensome due to its use of a double bootstrap. Nonetheless, from a theoretical computational complexity standpoint, the ART method can be made to scale as O(np) provided the number of bootstrap draws remains fixed. Though the number of bootstrap samples will likely be fixed in practical applications, we note that ART cannot maintain consistency against local alternatives unless the number of bootstrap samples grows with sample size, thereby yielding a slower than order-np runtime. As is to be expected from marginal screening procedures that perform an O(n) screening operation p times, our method attains an O(np) runtime. This computational efficiency, combined with the asymptotic theory supporting our method’s power against local alternatives under increasing covariate dimension and efficiency under fixed alternatives and covariate dimension, demonstrates what is achievable by marginal screening procedures. Given our simulations, we also believe that developing rigorous asymptotic theory under increasing dimension for the ART methods is an important area for future work.

The stabilized one-step estimator presented in this paper applies to many other situations not considered in this paper. In an earlier work, we showed that this estimator is useful for estimating the mean outcome under an optimal individualized treatment strategy Luedtke and van der Laan [2016], where the class $D_{n}$ now indexes functions mapping from the covariate space to the set of possible treatment decisions. Thanks to the martingale structure of our estimator, the stabilized one-step estimator can be used to construct confidence intervals when the data is drawn sequentially so that the data generating distribution for observation j can depend on that of the first j − 1 observations. One interesting example along these lines is to obtain inference for the value of the optimal arm in a multi-armed bandit problem, even in the case where the optimal arm is non-unique and the reward distributions for the optimal arms have different variances. We look forward to seeing further applications of the general template for a stabilized one-step estimator that we have presented in this paper.

Supplementary Material

Appendix

NIHMS979164-supplement-Appendix.pdf^{(266.8KB, pdf)}

Appendix A General estimator

A.1 Discussion of conditions of Theorem 1

In this section, we consider the setting where the parameter in (1) does not depend on sample size, and consequently omit the n subscript to quantities which no longer depend on sample size. We will show that C7) and the following conditions imply the conditions of Theorem 1:

C9)
${\hat{σ}}_{j}^{2} - σ_{j}^{2}$ converges to zero in probability as j → ∞.
C10)
$\sqrt{j} {\hat{Rem}}_{j} \equiv \sqrt{j} {Rem}^{d_{j}} ({\hat{P}}_{j})$ converges to zero in probability as as j → ∞.

The validity of the upper bound requires the following additional condition:

C11)
$\sqrt{j} [Ψ^{d_{j}} (P_{0}) - Ψ (P_{0})]$ converges to zero in probability as as j → ∞.

For simplicity, we will take ℓ_n = 0 in this section.

We now discuss the conditions. Condition C1) is an immediate consequence of C7) and D^d(P)(o) being uniformly bounded in P ∈ ℳ, $d \in D$ , $o \in O$ . This will be plausible in many situations, including the examples in this paper. A more general Lindeberg-type condition also suffices [see Condition C1 in Luedtke and van der Laan, 2016], though we omit its presentation here for brevity.

The other three conditions all rely on terms like $\frac{1}{n} \sum_{j = 0}^{n - 1} R_{j}$ converging to zero in probability, possibly at some rate. Ideally we want a stochastic version of the fact that, for β ∊ [0, 1),

\frac{1}{n} \sum_{j = 1}^{n} j^{- β} \approx \frac{1}{n} \int_{1}^{n} j^{- β} d j \approx \frac{n^{- β}}{1 - β} when n is large .

(A.1)

Lemma 6 of Luedtke and van der Laan [2016] establishes this result. We restate it here for convenience.

Lemma A.1

(Lemma 6 in Luedtke and van der Laan, 2016). Suppose that R_j is some sequence of (finite) real-valued random variables such that $R_{j} = o_{P_{0}} (j^{- β})$ for some β ∊ [0, 1), where we assume that each R_j is a function of {O_i : 1 ≤ i ≤ j}. Then,

\frac{1}{n} \sum_{j = 0}^{n - 1} R_{j} = o_{P_{0}} (n^{- β}) .

Conditions C2) through C4) are now easily handled. Condition C2) is a consequence of the fact that

\frac{1}{n} \sum_{j = 0}^{n - 1} | \frac{σ_{j}^{2}}{{\hat{σ}}_{j}^{2}} - 1 | \leq γ^{- 1} \frac{1}{n} \sum_{j = 0}^{n - 1} | {\hat{σ}}_{j}^{2} - σ_{j}^{2} | \to 0 in probability as n \to \infty,

where the inequality holds by C7) and the convergence holds by C9) Lemma A.1. Condition C9) is easily shown to hold under Glivenko-Cantelli conditions on the estimators ${\hat{P}}_{j}$ and d_j [see, e.g., Theorem 7 in Luedtke and van der Laan, 2016]. Conditions C3) and C4) are an immediate consequence of C10) and C11) combined with Lemma A.1.

While sufficient conditions for C11) should be developed in each individual example, we can give intuition as to why this condition should be reasonable. For any P ∈ ℳ, let d(P) return a maximizer of (1). We are interested in ensuring that $Ψ^{d_{n}} (P_{0}) - Ψ^{d (P_{0})} (P_{0})$ is small, where d_n is our estimate of a maximizer of (1). This can be expected to hold when the parameter P ↦ Ψ^d^(P)(P₀) has pathwise derivative zero at P = P₀, where the P₀ in the Ψ argument is fixed. When well-defined, the pathwise derivative will be zero because d(P) is chosen to maximize Ψ^d(P₀) in d.

A.2 Efficiency when the maximizer in (1) is unique

We have presented a parametric-rate estimator for Ψ_n(P₀), but thus far we have not made any claims about the efficiency of our estimator. In this section, we consider a fixed parameter in (1) that does not rely on sample size. We therefore omit the n subscript in many quantities to indicate their lack of dependence on sample size. We will give conditions under which our estimator is asymptotically efficient among all regular, asymptotically linear estimators. The efficiency bound is not typically well-defined when the maximizer is non-unique due to the non-regularity of the problem - generally in this case no regular, asymptotically linear estimator exists, so neither does an efficient member of this class [Hirano and Porter, 2012]. Thus the conditions that we give in this section will typically only hold when the maximizer $d_{0} \in D$ in (1) is unique.

We use the following additional assumptions for our efficiency result:

C5)
$E_{P_{0}} [{({\hat{D}}_{j} (O) - D^{d_{0}} (P_{0}) (O))}^{2} | O_{1}, \dots, O_{j}] \to 0 in probability as j \to \infty .$
C6)
There exists some M < ∞ such that $P_{0} (D^{d_{0}} (P_{0}) (O) < M)$ and $P_{0} ({\hat{D}}_{j} (O) < M)$ with probability approaching 1 as j → ∞.
C7)
$\inf_{j \geq 1} {\hat{σ}}_{j}^{2} > γ$ with probability 1 over draws of (O_j : j = 0, 1, …).

We discuss the conditions immediately following the theorem.

Theorem A.2

(Asymptotic efficiency). Suppose that Ψ does not depend on sample size and is pathwise differentiable with canonical gradient $D^{d_{0}} (P_{0})$ . Further suppose that ℓ_n = o(n). If C1) through C7) hold, then

{\bar{σ}}_{n}^{2} \to {Var}_{P_{0}} (D^{d_{0}} (P_{0}) (O)) in probability as n \to \infty .

Furthermore,

ψ_{n} - Ψ (P_{0}) = \frac{1}{n} \sum_{i = 1}^{n} D^{d_{0}} (P_{0}) (O_{i}) + o_{P_{0}} (n^{- 1 / 2}) .

Thus, ψ_n is asymptotically efficient among all regular, asymptotically linear estimators.

The proof is entirely analogous to the proof of Corollary 3 in Luedtke and van der Laan [2016] so is omitted. See Lemma 25.23 of van der Vaart [1998] for a proof of the fact that asymptotic linearity with the influence function given by the canonical gradient implies regularity.

The additional conditions needed for this result over Theorem 1 are mild when the maximizing index is unique. Condition C5) says that Ψ should have the same canonical gradient as $Ψ^{d_{0}}$ . While this should be manually checked in each example, it will be fairly typical when the maximizer is unique, since in this case an arbitrarily small fluctuation of P₀ will generally not change the maximizer. This is similar to problems in introductory calculus where the derivative at the maximum is zero. Condition C5) requires that ${\hat{D}}^{j} (O)$ converge to $D^{d_{0}} (P_{0}) (O)$ in mean-squared error, which is to be expected if ${\hat{P}}_{n j}$ begins to approximate P₀ and d_nj converges to the unique maximizer d₀ as n, j → ∞. Condition C6) is a bounding assumption on the canonical gradient and estimates thereof that will hold in many examples of interest. Finally, Condition C7) will hold if one knows that ${Var}_{P_{0}} [D^{d} (P) (O)]$ is bounded away from zero uniformly in P ∈ ℳ and $d \in D$ , and uses this knowledge to truncate ${\hat{σ}}_{j}^{2}$ for some deterministic sequence γ_j → 0. For γ_j sufficiently small and j sufficiently large this truncation scheme will then have no effect on the variance estimates ${\hat{σ}}_{j}^{2}$ .

A.3 Computationally efficient implementation

There are several computationally efficient ways to compute our estimate. In Section 6.1 of Luedtke and van der Laan [2016], we show that the runtime of our estimator can be dramatically improved by running the algorithm used to compute each ${\hat{P}}_{j}$ a limited number of times, say ten times. We do not detail this approach here, though we note that the theorems we have presented are general enough to apply to this case.

An alternative approach to improve runtime is to use the estimator’s online nature to compute it efficiently both in time and storage. Suppose that we have an algorithm to update the estimate ${\hat{P}}_{n j}$ of P₀ to the estimate ${\hat{P}}_{n (j + 1)}$ based on the first j observations by looking at O_j₊₁ only. This will often be feasible if the parameter of interest and the bias correction step only require estimates of certain components of P₀, e.g. of a set of regression and classification functions. In these cases we can apply modern regression and classification approaches to estimate these quantities [see, e.g., Xu, 2011, Luts et al., 2014]. Often d_nj· can also be obtained using online methods, and thus $\frac{1}{n - l_{n}} \sum_{j = l_{n}}^{n - 1} [Ψ^{d_{n j}} ({\hat{P}}_{n j}) + {\hat{D}}_{n j} (O_{j + 1})]$ can be estimated online by keeping a running sum. This quantity is not equal to ψ_n because it does not yet include the weights.

It will not in general be possible to compute the weights online, though their computation does not require storing O(n) observations in memory. We can estimate ${Var}_{P_{0}} ({\hat{D}}_{n j} (O))$ consistently using the r_j observations, where r_j → ∞ but can grow very slowly (even log j suffices asymptotically, though such a slow growth is not recommended for finite samples). Given online estimates of these variances, it is then straightforward to compute both ${\bar{σ}}_{n}$ and the weights and incorporate these into our estimator. In some cases, we can compute the weights, and thus the estimate, in a truly online fashion. Describing general sufficient conditions for this appears to be difficult, but we conjecture that often this will not typically hold if $D_{n}$ is not of finite cardinality. The weights can be computed online in the maximal correlation example.

Appendix B McKeague and Qian [2015] example

B.1 Proofs and results

Lemma A.3

Fix $\tilde{δ} > 0$ and $d \in D_{n}$ . For any P with $\min_{k \in K_{n}} s_{p}^{2} (X_{k}) > \tilde{δ}$ and $s_{p}^{2} (Y) > \tilde{δ}$ , (6) holds. Proof. Straightforward but tedious calculations show that

{Rem}^{d} (P) = m (\frac{1}{s_{P} (X_{k}) s_{P} (Y)} [s_{P} (X_{k}) s_{P} (Y) - s_{0} (X_{k}) s_{0} (Y)] [{Corr}_{P} (X_{k}, Y) - {Corr}_{0} (X_{k}, Y)] \frac{(E_{P} [X_{k}] - E_{P_{0}} [X_{k}]) (E_{P} [Y] - E_{P_{0}} [Y])}{s_{P} (X_{k}) s_{P} (Y)} - \frac{{Corr}_{P} (X_{k}, Y)}{2} [\frac{{(E_{P} [X_{k}] - E_{P_{0}} [X_{k}])}^{2}}{s_{P}^{2} (X_{k})} + \frac{{(E_{P} [Y] - E_{P_{0}} [Y])}^{2}}{s_{P}^{2} (Y)}] - \frac{{Corr}_{P} (X_{k}, Y)}{2 s_{P}^{2} (X_{k}) s_{P}^{2} (Y)} {[s_{P} (X_{k}) s_{0} (Y) - s_{0} (X_{k}) s_{P} (Y)]}^{2}) .

(A.2)

The result follows by taking the absolute value of both sides, applying the triangle inquality, using that ab ≤ (a² + b²)/2 for any real a, b, Corr_P(X_k, Y) ≤ 1, and the lower bound $\tilde{δ}$ on the variances. □

We now establish high probability bounds on the difference between $s_{P_{j}}^{2} (X_{k})$ , $s_{P_{j}}^{2} (Y)$ , ${Corr}_{P_{j}} (X_{k}, Y)$ and ${\hat{σ}}_{n j}$ and their population counterparts, uniformly over $k \in K_{n}$ and j. We will use ≲ to denote “less than or equal to up to a universal multiplicative constant”. Let $F_{n}$ denote the following class of functions mapping from $O \equiv ℝ^{\infty} \times ℝ$ to the real line:

{(x, y) \mapsto x_{k}^{r} y^{s} : 0 \leq r, s, \leq 4; r + s \leq 4 k \in K_{n}} .

(A.3)

Note that $| F_{n} | ≲ p$ . We will use this class to develop concentration results about our estimates the needed portions of the likelihood. This class is actually somewhat larger than is needed for most of our results, as in fact

{(x, y) \mapsto x_{k} y : k \in K_{n}} \cup {(x, y) \mapsto x_{k} : k \in K_{n}} \cup {(x, y) \mapsto y} \cup {(x, y) \mapsto x_{k}^{2} : k \in K_{n}} \cup {(x, y) \mapsto y^{2}}

suffices for concentrating our estimates of Corr₀(X_k, Y), s₀(X_k), and s₀(Y). Nonetheless, using this larger class $F_{n}$ will allow us to prove results about the concentration of ${\hat{σ}}_{n j}^{2}$ about $σ_{n j}^{2}$ and just stating it as a single class is convenient for brevity.

For $f \in F_{n}$ and j ∈ {1, …, n}, define the empirical process as

G_{n j} \equiv \frac{1}{\sqrt{j}} \sum_{i = 1}^{j} [f (O_{i}) - P_{0} f] = \sqrt{j} (P_{j} - P_{0}) f,

where we use P_j denote the empirical distribution of O₀, …, O_j₋₁ and Pf ≡ E_P[f(O)] for any distribution P. Let ${‖ G_{n j} ‖}_{F_{n}} \equiv \sup_{f \in F_{n}} | G_{n j} |$ . By Theorem 2.14.1 in van der Vaart and Wellner [1996] shows that

E {‖ G_{n j} ‖}_{F_{n}} ≲ \sqrt{\log # F_{n}} ≲ \sqrt{\log p,}

(A.4)

where the expectation is over the draws O₁, …, O_j. We have used that our class is bounded by the constant 1.

Let

K_{n j} \equiv j^{- 1 / 2} \sqrt{\log \max {n, p}} .

(A.5)

Define the events

A_{n j} \equiv {\max_{f \in F_{n}} | (P_{j} - P_{0}) f | \leq C K_{n j}} for all j = 1, \dots, n, A_{n} \equiv \cap_{j = 1}^{n} A_{n j},

where C in the definition of $A_{n j}$ is equal the smallest universal constant satisfying (A.4) plus 1.

Lemma A.4

For any sample size n, the event $A_{n}$ occurs with probability at least 1−n/max{n², p} ≥ 1 − 1/n.

Proof

We first upper bound the probability of the complement of $A_{n j}$ for each n, j. Fix n and j ≤ n. By the bounds on X and Y, changing one O_i in (O₁, …, O_j) to some other value in the support of P₀ can change b by at most $1 / \sqrt{j}$ . Thus $(O_{1}, \dots, O_{j}) \mapsto {‖ G_{n j} ‖}_{F_{n}}$ satisfies the bounded differences property with bound $1 / \sqrt{j}$ , and we may apply McDiarmid’s inequality [McDiarmid, 1989] to show that, with probability at most 1 − exp(−2t²), ${‖ G_{n j} ‖}_{F_{n}} \leq E {‖ G_{n j} ‖}_{F_{n}} + t$ . Choosing $t = \sqrt{\frac{\log \max {n^{2}, p}}{2}}$ and using (A.4) yields that, with probability at least 1−1/max{n², p}, the following inequality holds for all j = 1, …, n:

{‖ G_{n j} ‖}_{F_{n}} \leq E {‖ G_{n j} ‖}_{F_{n}} + \sqrt{\frac{\log \max {n^{2}, p}}{2}} \leq C' \sqrt{\log p} + \sqrt{\log \max {n, p}} \leq C \sqrt{\log \max {n, p}},

where C′ denotes the universal constant in (A.4).

By DeMorgan’s laws and a union bound, it follows that the event $A_{n} \equiv \cap_{j} A_{n j}$ occurs with probability at least 1 − n/ max{n², p} ≥ 1 − 1/n. □

We have shown that $A_{n}$ occurs with high probability. Now we show that our estimates of variances, covariances, and correlations perform well when $A_{n}$ occurs.

Lemma A.5

Fix a sample size n ≥ 2. The occurrence of $A_{n}$ implies that, for all j = 2, …, n:

1)
$\max_{k \in K_{n}} | s_{P_{j}} (X_{k}) - s_{0} (X_{k}) | ≲ δ^{- 1 / 2} K_{n j}$ ;
2)
$\max_{k \in K_{n}} | s_{P_{j}}^{2} (X_{k}) - s_{0}^{2} (X_{k}) | ≲ K_{n j}$ ;
3)
$| s_{P_{j}} (Y) - s_{0} (Y) | ≲ δ^{- 1 / 2} K_{n j}$ ;
4)
$| s_{P_{j}}^{2} (Y) - s_{0}^{2} (Y) | ≲ K_{n j}$ ;
5)
$\max_{k \in K_{n}} | {Corr}_{P_{j}} (X_{k}, Y) - {Corr}_{P_{0}} (X_{k}, Y) | ≲ δ^{- 1} K_{n j}$ ,

where we define ${Corr}_{P_{j}} (X_{k}, Y) = 0$ when either $s_{P_{j}} (X_{k})$ or $s_{P_{j}} (Y)$ is equal to zero.

Proof

Suppose $A_{n}$ holds and fix $k \in K_{n}$ . The triangle inequality and the bounds on X_k yield that

| s_{P_{j}}^{2} (X_{k}) - s_{0}^{2} (X_{k}) | = | (E_{P_{j}} [X_{k}^{2}] - E_{P_{0}} [X_{k}^{2}]) - (E_{P_{j}} [X_{k}] + E_{P_{0}} [X_{k}]) (E_{P_{j}} [X_{k}] - E_{P_{0}} [X_{k}]) | \leq | E_{P_{j}} [X_{k}^{2}] - E_{P_{0}} [X_{k}^{2}] | + 2 | E_{P_{j}} [X_{k}] - E_{P_{0}} [X_{k}] | ≲ K_{n j} .

This gives 2). For 1), note that

| s_{P_{j}} (X_{k}) - s_{0} (X_{k}) | = | \frac{s_{P_{j}}^{2} (X_{k}) - s_{0}^{2} (X_{k})}{s_{P_{j}} (X_{k}) + s_{0} (X_{k})} | ≲ δ^{- 1 / 2} K_{n j} .

The same argument yields 3) and 4).

Again fix k. An application of the triangle inequality and the bounds on X_k and Y readily yield that $| {Cov}_{P_{j}} (X_{k}, Y) - {Cov}_{P_{0}} (X_{k}, Y) | ≲ K_{n j}$ . Furthermore,

{Corr}_{P_{j}} (X_{k}, Y) - {Corr}_{P_{0}} (X_{k}, Y) = \frac{{Cov}_{P_{j}} (X_{k}, Y) - {Cov}_{P_{0}} (X_{k}, Y)}{s_{0} (X_{k}) s_{0} (Y)} - \frac{{Corr}_{P_{j}} (X_{k}, Y) s_{P_{j}} (Y)}{s_{0} (X_{k}) s_{0} (Y)} [s_{P_{j}} (X_{k}) - s_{0} (X_{k})] - \frac{{Corr}_{P_{j}} (X_{k}, Y)}{s_{0} (Y)} [s_{P_{j}} (Y) - s_{0} (Y)] .

Taking the absolute value of both sides, applying the triangle inequality, and using the lower bounds on s₀(X_k) and s₀(Y) and the upper bound on ${Corr}_{P_{j}} (X_{k}, Y)$ yields that $| {Corr}_{P_{j}} (X_{k}, Y) - {Corr}_{P_{0}} (X_{k}, Y) | ≲ δ^{- 1} K_{n j}$ . This holds for all k, so 5) holds. □

Lemma A.6

Let C be the smallest universal constant in 2) of that Lemma A.5, and let n be any natural number satisfying n ≥ ⌈4C⁻²δ⁻² log max{n, p}⌉ ≡ J(n, δ). Under these conditions, the occurrence of $A_{n}$ implies that, for all j = J(n, δ), …, n,

8
$\min_{k \in K_{n}} s_{P_{j}}^{2} (X_{k}) \geq δ / 2$ and $\min_{k \in K_{n}} s_{P_{j}}^{2} (Y) \geq δ / 2$ ;
9
$\min_{k \in K_{n}} \frac{s_{P_{0}}^{2} (X_{k})}{s_{P_{j}}^{2} (X_{k})} \leq δ$ and $\frac{s_{P_{0}}^{2} (Y)}{s_{P_{j}}^{2} (Y)} \leq 2$ ;
10
$| {\hat{σ}}_{n j}^{2} - σ_{n j}^{2} | ≲ δ^{- 2} K_{n j}$ ;
11
$| {\hat{σ}}_{n j}^{2} - {Var}_{P_{0}} (D_{n j} (P_{0}) (O)) | ≲ δ^{- 2} K_{n j} + {({\hat{Rem}}_{n j})}^{2}$ .

Proof

By Lemma A.5, 2) holds, and using that j ≥ J(n, δ), we see that

s_{P_{j}}^{2} (X_{k}) = s_{P_{0}}^{2} (X_{k}) + s_{P_{j}}^{2} (X_{k}) - s_{P_{0}}^{2} (X_{k}) \geq s_{P_{0}}^{2} (X_{k}) - \max_{k \in K_{n}} | s_{P_{j}}^{2} (X_{k}) - s_{P_{0}}^{2} (X_{k}) | \geq δ / 2.

The same argument works for $s_{P_{j}}^{2} (Y)$ , so 8 holds. Furthermore,

\frac{s_{P_{0}}^{2} (X_{k})}{s_{P_{j}}^{2} (X_{k})} \leq \frac{s_{P_{0}}^{2} (X_{k})}{s_{P_{0}}^{2} (X_{k}) - C K_{n j}} = 1 + \frac{C K_{n j}}{s_{P_{0}}^{2} (X_{k}) - C K_{n j}} \leq 1 + 2 C δ^{- 1} K_{n j} \leq 2,

where the final two inequalities hold by 8. This proves the first part of 9, and the bound on $s_{P_{0}}^{2} (Y) / s_{P_{j}}^{2} (Y)$ holds by the same argument. For the second result, note that

| {\hat{σ}}_{n j}^{2} - σ_{j n}^{2} | \leq | (P_{j} - P_{0}) {\hat{D}}_{n j}^{2} | + | {(P_{j} {\hat{D}}_{n j})}^{2} - {(P_{0} {\hat{D}}_{n j})}^{2} | \leq | (P_{j} - P_{0}) {\hat{D}}_{n j}^{2} | + | (P_{j} + P_{0}) {\hat{D}}_{n j} | | (P_{j} - P_{0}) {\hat{D}}_{n j} | .

Using 8, the bounds on X and Y, and the triangle inequality shows that

≲ | (P_{j} - P_{0}) {\hat{D}}_{n j}^{2} | + δ^{- 1} | (P_{j} - P_{0}) {\hat{D}}_{n j} | ≲ δ^{- 2} {‖ G_{n j} ‖}_{F_{n}},

where we have used that $F_{n}$ contains all polynomials of X_k, Y of degree at most 4. By the occurrence of $A_{n}$ , the final line is upper bounded by a constant times δ⁻² K_nj. This yields 10.

For 11, we will bound $| σ_{n j}^{2} - {Var}_{P_{0}} (D_{n j} (P_{0}) (O)) |$ and then combine this with 10 using the triangle inequality. We have that

| σ_{n j}^{2} - {Var}_{P_{0}} (D_{n j} (P_{0}) (O)) | \leq | P_{0} [{\hat{D}}_{n j}^{2} - D_{n j} {(P_{0})}^{2}] | + {(P_{0} {\hat{D}}_{n j})}^{2} .

Now we use that $P_{0} {\hat{D}}_{n j} = - {Corr}_{P_{j}} (X_{k}, Y) + {Corr}_{P_{0}} (X_{k}, Y) + {\hat{Rem}}_{n j}$ and (a + b)² ≤ 2 (a² + b²) for any real a, b to see that ${(P_{0} {\hat{D}}_{n j})}^{2} ≲ \max_{k} ({Corr}_{P_{j}} (X_{k}, Y) - {Corr}_{P_{0}} {(X_{k}, Y)}^{2}) + {({\hat{Rem}}_{n j})}^{2}$ . By 5) from Lemma A.5 and the fact that j ≥ J(n, δ), the maximum over $k \in K_{n}$ is bounded above by a constant times $δ^{- 2} K_{n j}^{2} ≲ δ^{- 2} K_{n j}$ . Continuing with the above,

≲ | P_{0} ([{\hat{D}}_{n j} + D_{n j} (P_{0})] [{\hat{D}}_{n j} - D_{n j} (P_{0})]) | + δ^{- 2} K_{n j} + {({\hat{Rem}}_{n j})}^{2} ≲ δ^{- 1} P_{0} | {\hat{D}}_{n j} - D_{n j} (P_{0}) | + δ^{- 2} K_{n j} + {({\hat{Rem}}_{n j})}^{2} ≲ δ^{- 2} {‖ G_{n j} ‖}_{F_{n}} + δ^{- 2} K_{n j} + {({\hat{Rem}}_{n j})}^{2} ≲ δ^{- 2} K_{n j} + {({\hat{Rem}}_{n j})}^{2},

where we used 8 for the second to last inequality. □

Lemma A.7

Suppose the conditions of Lemma A.6. Under these conditions, the occurrence of $A_{n}$ implies that, for all j = J(n, δ), …, n.

8. $| {\hat{Rem}}_{n j} | ≲ δ^{- 5 / 2} {(K_{n j})}^{2} .$

Proof

By Lemma A.6, $\min_{k \in K_{n}} s_{P_{j}}^{2} (X_{k}) \geq δ / 2$ and $\min_{k \in K_{n}} s_{P_{j}}^{2} (Y) \geq δ / 2$ . By Lemma A.3, this yields

| {\hat{Rem}}_{n j} | ≲ δ^{- 1} \max_{k \in K_{n}} (| s_{P_{j}} (X_{k}) s_{P_{j}} (Y) - s_{0} (X_{k}) s_{0} | | {Corr}_{P_{j}} (X_{k}, Y) - {Corr}_{0} (X_{k}, Y) | + {(E_{P_{j}} [X_{k}] - E_{P_{0}} [X_{k}])}^{2} + {(E_{P_{j}} [Y] - E_{P_{0}} [Y])}^{2} + \frac{s_{0}^{2} (Y)}{s_{P_{j}}^{2} (Y)} {[s_{P_{j}} (X_{k}) - s_{0} (X_{k})]}^{2} + \frac{s_{0}^{2} (X_{k})}{s_{P_{j}}^{2} (X_{k})} {[s_{P_{j}} (Y) - s_{0} (Y)]}^{2}) .

By the bounds on X and Y and the triangle inequality, $| s_{P_{j}} (X_{k}) s_{P_{j}} (Y) - s_{0} (X_{k}) s_{0} (Y) | \leq | s_{P_{j}} (X_{k}) - s_{P_{0}} (X_{k}) | + | s_{P_{j}} (Y) - s_{P_{0}} (Y) |$ . Applying 9 from Lemma A.6 and the results of Lemma A.5 to the above yields the result. □

Lemma A.8

Let γ be as defined in (7). For a constant C(γ, δ) > 0 relying on γ and δ only, the occurrence of $A_{n}$ implies that, for all j = ⌈C(γ, δ) log max{n, p}⌉, …, n.

8. ${\hat{σ}}_{n j}^{2} \geq γ / 2$ .

Sketch of proof

Suppose $A_{n}$ . By 11 and 8, for all j ≥ J(n, δ)

| {\hat{σ}}_{n j}^{2} - {Var}_{P_{0}} (D_{n j} (P_{0}) (O)) | ≲ δ^{- 2} K_{n j} + δ^{- 10} {(K_{n j})}^{4} .

It is easy to confirm that, for a universal constant C > 0, the above yields that the left-hand side is upper bounded by γ/2 for all j ≥ Cγ^−1/2δ⁻² max {δ⁻³, γ^−3/2} log max{n, p} ≡ C(γ, δ)log max{n, p} ≥ J(n, δ). An application of the triangle inequality gives the result. □

The remainder of the results in this section are asymptotic in nature. We omit the dependence on δ and γ in these statements as these quantities are treated as fixed as sample size grows. Throughout we assume that

\frac{\log \max {n, p}}{l_{n}} \to 0,

(A.6)

β_{n}^{2} \log \frac{n}{l_{n}} \to 0,

(A.7)

\underset{n \to \infty}{\lim \sup} \frac{l_{n}}{n} < 1.

(A.8)

In view of (A.6) and (A.7), we see that, roughly, ℓ_n grows faster than log max{n, p} if β_n goes to zero faster than $1 / \sqrt{\log n}$ and at least as fast as $n \exp (- o (β_{n}^{- 2}))$ if β_n goes to zero more slowly than $1 / \sqrt{\log n}$ . Given an ε > 0, one possible choice of ℓ_n that satisfies these properties is

l_{n} = \max {{(\log \max {n, p})}^{1 + ε}, n \exp (- β_{n}^{- 2 + ε})} .

We have the following result.

Lemma A.9

For all n large enough, ℓ_n is at least J(n. δ) and is at least C(γ, δ) log max{n. p}, where these quantities are defined in Lemmas A.6 and A.8, respectively.

Proof

This is an immediate consequence (A.6) of the fact that δ and γ are fixed as sample size grows. □

Theorem A.10

C1), C2), and C3) hold.

Proof

C1)
By Lemma A.9, we can apply 8 from Lemma A.6 and Lemma A.8 provided n is large enough. In that case, $\frac{{\hat{D}}_{n j}}{{\hat{σ}}_{n j}} ≲ δ^{- 1} γ^{- 1 / 2}$ for all j ≥ ℓ_n provided $A_{n}$ holds. By Lemma A.4, this then occurs with probability at least 1 − 1/n, and thus C1) holds.
C2)
If $A_{n}$ holds, then Lemmas A.8 and A.9 show that, for all n large enough,
$\frac{1}{n - l_{n}} \sum_{j = l_{n}}^{n - 1} | \frac{σ_{n j}^{2}}{{\hat{σ}}_{n j}^{2}} - 1 | \leq \frac{2 γ^{- 1}}{n - l_{n}} \sum_{j = l_{n}}^{n - 1} | {\hat{σ}}_{n j}^{2} - σ_{n j}^{2} | ≲ \frac{γ^{- 1} δ^{- 2}}{n - l_{n}} \log \max {n, p} \sum_{j = l_{n}}^{n - 1} j^{- 1}$
By 10 in Lemma A.6 and the fact that $\sum_{a}^{b} j^{- 1} \leq \int_{a - 1}^{b} j^{- 1} d j$ , the right-hand side is has an upper bound proportional to $\frac{γ^{- 1} δ^{- 2}}{n - l_{n}} \log \max {n, p} \log n$ . This bound is o(1) by (A.8) and the fact that β_n → 0. The fact that $A_{n}$ occurs with probability approaching 1 (Lemma A.4) yields C2).
C3)
Suppose that n is large enough so that the results of Lemma A.9 apply. Also suppose that $A_{n}$ occurs. We have that

\frac{1}{n - l_{n}} \sum_{j = l_{n}}^{n - 1} \frac{{\hat{Rem}}_{n j}}{{\hat{σ}}_{n j}} ≲ \frac{γ^{- 1 / 2} δ^{- 5 / 2}}{n - l_{n}} \sum_{j = l_{n}}^{n - 1} {(K_{n j})}^{20}

(Lemmas A.7, A.8, and A.9)

= \frac{γ^{- 1 / 2} δ^{- 5 / 2}}{n - l_{n}} \log \max {n, p} \sum_{j = l_{n}}^{n - 1} j^{- 1}

(Eq. A.5)

≲ \frac{γ^{- 1 / 2} δ^{- 5 / 2}}{n - l_{n}} \log \max {n . p} \log \frac{n}{l_{n}}

(\sum_{a}^{b} j^{- 1} \leq \int_{a - 1}^{b} j^{- 1} d j)

= o ({[n - l_{n}]}^{- 1 / 2}) .

(Eqs. A.7 and A.8)

Open in a new tab

The fact that $A_{n}$ occurs with probability approaching 1 (Lemma A.4) yields C3). □

Let k_n₀ be a possibly non-unique k maximizer of $| {Corr}_{P_{0}} (X_{k}, Y) |$ . For each r > 0, let $K_{n}^{r} \subseteq K_{n}$ denote the set of all $k \in K_{n}$ such that $| {Corr}_{P_{0}} (X_{k_{n} 0}) | - | {Corr}_{P_{0}} (X_{k}, Y) | \leq r$ .

The upcoming theorem uses the following conditions to establish the validity of a hypothesis test of no effect and of the upper bound of our confidence interval, respectively:

M1) For some sequence {t_n} with t_n → + ∞, there exists a sequence of non-empty subsets $K_{n}^{r} \subseteq K_{n}$ such that, for all n,

\inf_{k_{1} \in K_{n}^{*}} | {Corr}_{P_{0}} (X_{k_{1}}, Y) | \geq \sup_{k_{2} \in K_{n} \ K_{n}^{*}} | {Corr}_{P_{0}} (X_{k_{2}}, Y) | + t_{n} n^{- 1 / 4} β_{n} .

If $K_{n}^{r} = K_{n}$ , then the supremum on the right-hand side is taken to be zero.

M2) The conditions of M1) hold, and also

Diam (K_{n}^{*}) \equiv \sup_{k_{1}, k_{2} \in K_{n}^{r}} (| {Corr}_{P_{0}} (X_{k_{1}}, Y) | - | {Corr}_{P_{0}} (X_{k_{2}}, Y) |) = 0 (n^{- 1 / 2}) .

The first of these conditions will be used to establish the consistency of a null hypothesis significance test. The second of these conditions is similar to margin conditions used in classification, and will be used to establish the validity of our confidence interval.

Theorem A.11

\frac{1}{n - l_{n}} \sum_{j = l_{n}}^{n - 1} {\hat{σ}}_{n j}^{- 1} [Ψ^{d_{n j}} (P_{0}) - Ψ_{n} (P_{0})] = O_{P_{0}} (n^{- 1 / 4} β_{n}) .

(A.9)

If also M1), then the right-hand side of the above can be tightened to $O_{P_{0}} (Diam (K_{n}^{*}) \land n^{- 1 / 4} β_{n}) + O_{P_{0}} (n^{1 / 2})$ . If also M2), then C4) holds.

Proof

Suppose that $A_{n}$ holds and n is large enough so that the results of Lemma A.9 apply. For each j ≥ ℓ_n, let k_nj represent the $k \in K_{n}$ which maximizes $| {Corr}_{P_{j}} (X_{k}, Y) |$ . Let $m_{0} = sgn [{Corr}_{P_{0}} (X_{k_{n} 0}, Y)]$ and $m_{n j} = sgn [{Corr}_{P_{j}} (X_{k_{n j}}, Y)]$ . Then, for a universal constant C > 0,

0 \geq m_{0} {Corr}_{P_{j}} (X_{k_{n} 0}, Y) - | {Corr}_{P_{j}} (X_{k_{n j}}, Y) | = m_{0} [{Corr}_{P_{0}} (X_{k_{n} 0}, Y) - m_{n j} {Corr}_{P_{0}} (X_{k_{n j}}, Y)] + m_{0} [{Corr}_{P_{j}} (X_{k_{n} 0}, Y) - {Corr}_{P_{0}} (X_{k_{n} 0}, Y)] - m_{n j} [{Corr}_{P_{j}} (X_{k_{n j}}, Y) - {Corr}_{P_{0}} (X_{k_{n j}}, Y)] \geq Ψ_{n} (P_{0}) - Ψ^{d_{n j}} (P_{0}) - 2 \max_{k \in K_{n}} | {Corr}_{P_{j}} (X_{k}, Y) - {Corr}_{P_{0}} (X_{k}, Y) | \geq Ψ_{n} (P_{0}) - Ψ^{d_{n j}} (P_{0}) - C δ^{- 1} K_{n j},

(A.10)

where the final inequality holds by Lemma 5). Using that $\sum_{j = l_{n}}^{n - 1} j^{- 1 / 2} ≲ \sqrt{n}$ and (A.8),

\frac{1}{n - ℓ_{n}} \sum_{j = ℓ_{n}}^{n - 1} K_{n j} ≲ log max {n, p} \frac{\sqrt{n}}{n - ℓ_{n}} ≲ n^{- 1 / 4} β_{n} .

By Lemma A.8, this then implies that the left-hand side of (A.9) is upper bounded by an O (γ^−1/2n^−1/4β_n) term under $A_{n}$ , and so Lemma A.4 yields (A.9).

For the second result, suppose that M1) holds. Observe that, for all $j > C n t_{n}^{- 1}$ for C as defined in (A.10), $Ψ_{n} (P_{0}) - Ψ^{d_{n j}} (P_{0}) < t_{n} n^{- 1 / 4} β_{n}$ . Furthermore, $Ψ_{n} (P_{0}) - | {Corr}_{P_{0}} (X_{k_{n j}}) | \leq Ψ_{n} (P_{0}) - Ψ^{d_{n j}} (P_{0})$ . Thus $k_{n j} \in K_{n}^{*}$ as deȀned in M1). Furthermore, m_nj must equal $sgn [{Corr}_{P_{0}} (X_{k_{n j}}, Y)]$ , since otherwise

Ψ_{n} (P_{0}) - Ψ^{d_{n j}} (P_{0}) \geq | {Corr}_{P_{0}} (X_{k_{n j}}, Y) | - Ψ^{d_{n j}} (P_{0}) \geq 2 \inf_{k \in K_{n}^{*}} | {Corr}_{P_{0}} (X_{k_{n j}}, Y) | \geq 2 t_{n} n^{- 1 / 4} β_{n},

contradicting the fact that $Ψ_{n} (P_{0}) - Ψ^{d_{n j}} (P_{0}) \leq t_{n} n^{- 1 / 4} β_{n}$ per (A.10). Because $k \in K_{n}^{*}$ , we see that $Ψ^{d_{n j}} (P_{0}) \geq \inf_{k \in K_{n}^{*}} {|Corr}_{P_{0}} (X_{k_{n j}}, Y) |$ . Hence,

\frac{1}{n - l_{n}} \sum_{j = \max {l_{n}, ⌈ C n t_{n}^{- 1} ⌉}}^{n - 1} [Ψ^{d_{n j}} (P_{0}) - Ψ_{n} (P_{0})] \geq - Diam (K_{n}^{*}) .

(A.11)

Further, if $⌈ C n t_{n}^{- 1} ⌉ \geq l_{n}$ , (A.10) yields

\sum_{j = l_{n}}^{⌈ C n t_{n}^{- 1} ⌉} [Ψ^{d_{n j}} (P_{0}) - Ψ_{n} (P_{0})] \geq - C \sum_{j = l_{n}}^{⌈ C n t_{n}^{- 1} ⌉} K_{n j} \geq - C \log \max {n, p} \int_{l_{n} - 1}^{⌈ C n t_{n}^{- 1} ⌉} j^{- 1 / 2} d j

It follows that the left-hand side above is greater than or equal to a positive universal constant times $n^{1 / 2} t_{n}^{- 1 / 2}$ . Dividing the left by n − ℓ_n and applying (A.8) yields that this same result holds with an upper bound on the order of $n^{- 1 / 2} t_{n}^{- 1 / 2}$ . Combining this with (A.11) shows that

\frac{1}{n - l_{n}} \sum_{j = l_{n}}^{n - 1} [Ψ^{d_{n j}} (P_{0}) - Ψ_{n} (P_{0})] \geq - Diam (K_{n}^{*}) + O (n^{- 1 / 2} t_{n}^{- 1 / 2}) .

Using that $t_{n}^{- 1 / 2} \to 0, n^{- 1 / 2} t_{n}^{- 1 / 2} = o (n^{- 1 / 2})$ . When proving the Ȁrst result (A.9) we also showed that the left-hand side is upper-bounded by a positive constant times −δ⁻¹n^−1/4β_n. Combining with Lemma A.8 and using that $A_{n}$ holds with probability approaching 1 (Lemma A.4) shows that the left-hand side of (A.9) is $O_{P_{0}} (Diam (K_{n}^{*}) \land n^{- 1 / 4} β_{n}) + o_{P_{0}} (n^{- 1 / 2})$ . If M2) holds, then this expression is $o_{P_{0}} (n^{- 1 / 2})$ , and so C4) holds. □

B.2 Computationally efficient implementation of our estimator

In this section, we describe how to implement the estimator for the McKeague and Qian [2015] example in O(np) time. We show that this can be accomplished using O(p) storage when the observations O₁, …, O_n arrive in a stream.

Fix n so that the set $K_{n}$ of predictor indices is also fixed. For each j, let P_j denote the empirical distribution of the first j observations. Recall the definition of the class $F_{n}$ from (A.3), and note that $F_{n}$ contains O(p) functions. It is easy to see that, at j = 2, we can compute $P_{j} f \equiv E_{P_{j}} [f (O)]$ for each $f \in F_{n}$ using O(p) time and storage. Furthermore, for j > 3 the fact that $P_{j} f = f (O_{j}) + \frac{j - 1}{j} P_{j - 1} f$ shows that we can compute and save P_jf in O(p) time and storage if we know O_j and P_j₋₁ f. To attain this storage complexity, we remove P_j₋₂ f, $f \in F_{n}$ , from memory for each j ≥ 4 so that P₂ f, …, P_j₋₂ f are not stored in memory.

We now have an algorithm that, at observation j, starts with O_j and P_j₋₁ f, $f \in F_{n}$ , stored in memory and, after running the steps described in the preceding paragraph, also has P_jf, $f \in F_{n}$ stored in memory. Given P_jf, $f \in F_{n}$ , one can compute and save ${Cov}_{P_{j}} (X_{k}, Y) = E_{P_{j}} [X_{k} Y] - E_{P_{j}} [X_{k}] E_{P_{j}} [Y]$ , $k \in K_{n}$ , and $s_{P_{j}}^{2} (Z) = E_{P_{j}} [Z^{2}] - E_{P_{j}} {[Z]}^{2}$ , Z equal to Y or X_k, $k \in K_{n}$ , in O(p) time and storage. We can now compute and save ${Corr}_{P_{j}} (X_{k}, Y) = \frac{{Cov}_{P_{j}} (X_{k}, Y)}{s_{P_{j}} (X_{k}) s_{P_{j}} (Y)}$ , $k \in K_{n}$ in O(p) time and storage. If the predictors or outcome are large and their variance small, the described online computation of the sample variance may lead to numerical difficulties. See Welford [1962] for a better estimate of the variance in this setting.

Let H_j denote the collection of (i) the integer j, (ii) P_jf, $f \in F_{n}$ , (iii) $s_{P_{j}}^{2}$ , and (iv) Cov(X_k, Y), $s_{P_{j}}^{2} (X_{k})$ and ${Corr}_{P_{j}} (X_{k}, Y)$ , $k \in K_{n}$ . For j ≥ 2, let UPDATEH be a function which takes as input (O_j+₁, H_j) and outputs H_j+₁. We have shown that UPDATEH(O_j₊₁, H_j) can run in O(p) time for any j ≥ 2. We call a separate function INITIALIZEH on (O₁, O₂) to obtain the initial value H₂. This function runs in O(p) time and storage.

Let the function MAXIMIZER be a function that takes as input H_j and returns the d_j = (k_j, m_j) which maximizes $m {Corr}_{P_{j}} (X_{k}, Y)$ in $d \equiv (k, m) \in K_{n}$ , thereby allowing us to compute ${\hat{σ}}_{n j} = P_{j} D^{d_{j}} {(P_{j})}^{2}$ . Finding d_j involves finding the maximum of $| D_{n} | = 2 p$ numbers, and therefore can be accomplished in O(p) time.

The function CALCD takes as input H_j, O_j₊₁, and d_j and calculates $D^{d_{j}} (P_{j}) (O_{j + 1})$ . It is easy to see that this can be accomplished in O(1) time and O(p) storage.

For ease of notation in the proceeding paragraph and equation we omit the dependence of d_j = (k_j, m_j) on j in the notation. Since D^d(P_j) is a gradient for Ψ^d at P_j and gradients are mean zero, P_jD^d(P_j) = 0. For any $d \in D_{n}$ , tedious but trivial calculations show that

P_{j} D^{d} {(P_{j})}^{2} = [\frac{2 + {Corr}_{P_{j}} {(X_{k}, Y)}^{2}}{2 s_{P_{j}}^{2} (X_{k}) s_{P_{j}}^{2} (Y)}] \sum_{r = 0}^{2} \sum_{s = 0}^{2} {(- 1)}^{r + s} (\binom{2}{r}) (\binom{2}{s}) E_{P_{j}} [X_{k}^{r} Y^{s}] E_{P_{j}} {[X_{k}]}^{2 - r} E_{P_{j}} {[Y]}^{2 - s} + \frac{{Corr}_{P_{j}} {(X_{k}, Y)}^{2}}{4} \sum_{r = 0}^{4} {(- 1)}^{r} (\binom{4}{r}) [\frac{E_{P_{j}} [X_{k}^{r}] E_{P_{j}} {[X_{k}]}^{4 - r}}{s_{P_{j}}^{4} (X_{k})} + \frac{E_{P_{j}} [Y^{r}] E_{P_{j}} {[Y]}^{4 - r}}{s_{P_{j}}^{4} (Y)}] - \frac{{Corr}_{P_{j}} (X_{k}, Y)}{s_{P_{j}} {(X_{k})}^{3} s_{P_{j}} (Y)} \sum_{r = 0}^{3} \sum_{s = 0}^{1} {(- 1)}^{r + s} (\binom{3}{r}) E_{P_{j}} [X_{k}^{r} Y^{s}] E_{P_{j}} {[X_{k}]}^{3 - r} E_{P_{j}} {[Y]}^{1 - s} - \frac{{Corr}_{P_{j}} (X_{k}, Y)}{s_{P_{j}} (X_{k}) s_{P_{j}} {(Y)}^{3}} \sum_{r = 0}^{1} \sum_{s = 0}^{3} {(- 1)}^{r + s} (\binom{3}{s}) E_{P_{j}} [X_{k}^{r} Y^{s}] E_{P_{j}} {[X_{k}]}^{1 - r} E_{P_{j}} {[Y]}^{3 - s} .

Observe that all expectations on the right-hand side above are expectations over some $f \in F_{n}$ applied to the observed data structure. It follows that the above can be computed in O(1) time using a subset of the O(p) expectation, standard deviation, and correlation estimates stored in H_j. Let CALCSIGHAT denote the function which takes as input H_j and d_j and outputs ${\hat{σ}}_{n j}$ . We have shown that CALCSIGHAT (H_j, d_j) runs in O(1) time.

The pseudocode in ESTPSI describes our estimator, with most of the work done in the recursion step described in the function RECURSION. Because each call of RECURSION runs in O(p) time, the n − ℓ_n = O(n) step for loop in ESTPSI requires time O(np) time. The storage requirement of each call of RECURSION is O(p). Because the code in the for loop in ESTPSI deletes the output from the previous recursion step, the total storage requirement of ESTPSI is O(p).

Algorithm Recursion Step for Estimating Ψ(P₀)

	function Recursion(O_j₊₁, ψ_j, H_j, ${\bar{σ}}_{j}$ , ℓ_n)
2:	if j < ℓ_n then ψ_j₊₁ = 0 and ${\bar{σ}}_{j + 1} = 0$
	else
4:	d_j = Maximizer(H_j)
	${\hat{σ}}_{n j} = CalcSigHat (H_{j}, d_{j})$
6:	$D^{d_{j}} (P_{j}) (O_{j + 1}) = CalcD (H_{j}, O_{j + 1})$
	$ψ_{j + 1} = \frac{ψ_{j}}{{\bar{σ}}_{j}} + \frac{{Corr}_{P_{j}} (X_{d_{j}}, Y) + D^{d_{j}} (P_{j}) (O_{j + 1})}{{\hat{σ}}_{n j}}$	▹ By convention, 0/0 = 0.
8:	${\bar{σ}}_{j + 1} = \frac{1}{j + 1} [j {\bar{σ}}_{j} + {\hat{σ}}_{n j}]$
	H_j+₁ = UpdateH(O_j₊₁, H_j)
10:	return $(ψ_{j + 1}, {\bar{σ}}_{j + 1}, H_{j + 1})$

Open in a new tab

Algorithm Estimate Ψ(P₀) Using Sample of Size n

2:	function ESTPSI(n, ℓ_n)
2:	Read O₁, O₂ from data stream
4:	Base case: ψ₂ = 0, ${\bar{σ}}_{2} = 0$ , and H₂ = InitializeH(O₁, O₂)
	for j = 2, …, n − 1 do
	Read O_j₊₁ from data stream
6:	$(ψ_{j + 1}, {\bar{σ}}_{j + 1}, H_{j + 1}) = Recursion (O_{j + 1}, ψ_{j}, H_{j}, {\bar{σ}}_{j}, l_{n})$
6:	Remove $(O_{j + 1}, ψ_{j}, H_{j}, {\bar{σ}}_{j})$ from memory
8:	return Point estimate ψ_n and confidence interval $[ψ_{n} \pm 1.96 {\bar{σ}}_{n} / \sqrt{n}]$

Open in a new tab

References

Bickel PJ, Klaassen CAJ, Ritov Y, Wellner JA. Efficient and adaptive estimation for semiparametric models. Johns Hopkins University Press; Baltimore: 1993. [Google Scholar]
Chakraborty B, Moodie EE. Statistical Methods for Dynamic Treatment Regimes. Springer; Berlin Heidelberg New York: 2013. [Google Scholar]
Gaenssler P, Strobel J, Stute W. On central limit theorems for martingale triangular arrays. Acta Math Hungar. 1978;31(3):205–216. [Google Scholar]
Hirano K, Porter JR. Impossibility results for nondifferentiable functionals. Econometrica. 2012;80(4):1769–1790. [Google Scholar]
Luedtke AR, van der Laan MJ. Statistical inference for the mean outcome under a possibly non-unique optimal treatment strategy. Annals of Statistics. 2016;44(2):713–742. doi: 10.1214/15-AOS1384. [DOI] [PMC free article] [PubMed] [Google Scholar]
Luts J, Broderick T, Wand MP. Real-time semiparametric regression. Journal of Computational and Graphical Statistics. 2014;23(3):589–615. [Google Scholar]
McDiarmid C. On the method of bounded differences. Surveys in combinatorics. 1989;141(1):148–188. [Google Scholar]
McKeague IW, Qian M. An adaptive resampling test for detecting the presence of significant predictors. Journal of the American Statistical Association. 2015;110(512) doi: 10.1080/01621459.2015.1095099. [DOI] [PMC free article] [PubMed] [Google Scholar]
Pfanzagl J. Estimation in semiparametric models. Springer; Berlin Heidelberg New York: 1990. [Google Scholar]
R Core Team. R: a language and environment for statistical computing. 2014 URL http://www.r-project.org/
van der Laan MJ, Lendle SD. Online Targeted Learning. Division of Biostatistics, University of California; Berkeley: 2014. (Technical Report 330). available at http://www.bepress.com/ucbbiostat. [Google Scholar]
van der Laan MJ, Robins JM. Unified methods for censored longitudinal data and causality. Springer; New York Berlin Heidelberg: 2003. [Google Scholar]
van der Vaart AW. On differentiable functionals. Annals of Statistics. 1991;19:178–204. [Google Scholar]
van der Vaart AW. Asymptotic statistics. Cambridge University Press; New York: 1998. [Google Scholar]
van der Vaart AW, Wellner JA. Weak convergence and empirical processes. Springer; Berlin Heidelberg New York: 1996. [Google Scholar]
Welford BP. Note on a method for calculating corrected sums of squares and products. Technometrics. 1962;4(3):419–420. [Google Scholar]
Xu W. Towards optimal one pass large scale learning with averaged stochastic gradient descent. arXiv preprint arXiv:1107.2490. 2011 [Google Scholar]
Zhang Y, Laber EB. Comment. J Am Stat Assoc. 2015;110(512):1451–1454. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Appendix

NIHMS979164-supplement-Appendix.pdf^{(266.8KB, pdf)}

[R1] Bickel PJ, Klaassen CAJ, Ritov Y, Wellner JA. Efficient and adaptive estimation for semiparametric models. Johns Hopkins University Press; Baltimore: 1993. [Google Scholar]

[R2] Chakraborty B, Moodie EE. Statistical Methods for Dynamic Treatment Regimes. Springer; Berlin Heidelberg New York: 2013. [Google Scholar]

[R3] Gaenssler P, Strobel J, Stute W. On central limit theorems for martingale triangular arrays. Acta Math Hungar. 1978;31(3):205–216. [Google Scholar]

[R4] Hirano K, Porter JR. Impossibility results for nondifferentiable functionals. Econometrica. 2012;80(4):1769–1790. [Google Scholar]

[R5] Luedtke AR, van der Laan MJ. Statistical inference for the mean outcome under a possibly non-unique optimal treatment strategy. Annals of Statistics. 2016;44(2):713–742. doi: 10.1214/15-AOS1384. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] Luts J, Broderick T, Wand MP. Real-time semiparametric regression. Journal of Computational and Graphical Statistics. 2014;23(3):589–615. [Google Scholar]

[R7] McDiarmid C. On the method of bounded differences. Surveys in combinatorics. 1989;141(1):148–188. [Google Scholar]

[R8] McKeague IW, Qian M. An adaptive resampling test for detecting the presence of significant predictors. Journal of the American Statistical Association. 2015;110(512) doi: 10.1080/01621459.2015.1095099. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] Pfanzagl J. Estimation in semiparametric models. Springer; Berlin Heidelberg New York: 1990. [Google Scholar]

[R10] R Core Team. R: a language and environment for statistical computing. 2014 URL http://www.r-project.org/

[R11] van der Laan MJ, Lendle SD. Online Targeted Learning. Division of Biostatistics, University of California; Berkeley: 2014. (Technical Report 330). available at http://www.bepress.com/ucbbiostat. [Google Scholar]

[R12] van der Laan MJ, Robins JM. Unified methods for censored longitudinal data and causality. Springer; New York Berlin Heidelberg: 2003. [Google Scholar]

[R13] van der Vaart AW. On differentiable functionals. Annals of Statistics. 1991;19:178–204. [Google Scholar]

[R14] van der Vaart AW. Asymptotic statistics. Cambridge University Press; New York: 1998. [Google Scholar]

[R15] van der Vaart AW, Wellner JA. Weak convergence and empirical processes. Springer; Berlin Heidelberg New York: 1996. [Google Scholar]

[R16] Welford BP. Note on a method for calculating corrected sums of squares and products. Technometrics. 1962;4(3):419–420. [Google Scholar]

[R17] Xu W. Towards optimal one pass large scale learning with averaged stochastic gradient descent. arXiv preprint arXiv:1107.2490. 2011 [Google Scholar]

[R18] Zhang Y, Laber EB. Comment. J Am Stat Assoc. 2015;110(512):1451–1454. [Google Scholar]

PERMALINK

Parametric-rate inference for one-sided differentiable parameters

Alexander R Luedtke

Mark J van der Laan

Abstract

1 Introduction

2 Toy Example

3 Estimator

3.1 Pathwise differentiability

3.2 Estimator and confidence interval

3.3 Validity of confidence interval

Theorem 1

Proof

4 Maximal correlation example

4.1 Problem formulation

4.2 Differentiability condition

Canonical gradients

Second-order remainder

Variance of canonical gradients

4.3 Our estimator

Theorem 2

Theorem 3

Sketch of proofs of Theorems 2 and 3

5 Simulation study

Figure 1.

Figure 2.

Figure 3.

6 Discussion

Supplementary Material

Appendix A General estimator

A.1 Discussion of conditions of Theorem 1

Lemma A.1

A.2 Efficiency when the maximizer in (1) is unique

Theorem A.2

A.3 Computationally efficient implementation

Appendix B McKeague and Qian [2015] example

B.1 Proofs and results

Lemma A.3

Lemma A.4

Proof

Lemma A.5

Proof

Lemma A.6

Proof

Lemma A.7

Proof

Lemma A.8

Sketch of proof

Lemma A.9

Proof

Theorem A.10

Proof

Theorem A.11

Proof

B.2 Computationally efficient implementation of our estimator

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases