Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2018 Aug 3.
Published in final edited form as: J Am Stat Assoc. 2017 Feb 28;113(522):780–788. doi: 10.1080/01621459.2017.1285777

Parametric-rate inference for one-sided differentiable parameters

Alexander R Luedtke 1,*, Mark J van der Laan 1,
PMCID: PMC6075853  NIHMSID: NIHMS979164  PMID: 30078921

Abstract

Suppose one has a collection of parameters indexed by a (possibly infinite dimensional) set. Given data generated from some distribution, the objective is to estimate the maximal parameter in this collection evaluated at the distribution that generated the data. This estimation problem is typically non-regular when the maximizing parameter is non-unique, and as a result standard asymptotic techniques generally fail in this case. We present a technique for developing parametric-rate confidence intervals for the quantity of interest in these non-regular settings. We show that our estimator is asymptotically efficient when the maximizing parameter is unique so that regular estimation is possible. We apply our technique to a recent example from the literature in which one wishes to report the maximal absolute correlation between a prespecified outcome and one of p predictors. The simplicity of our technique enables an analysis of the previously open case where p grows with sample size. Specifically, we only require that log p grows slower than n, where n is the sample size. We show that, unlike earlier approaches, our method scales to massive data sets: the point estimate and confidence intervals can be constructed in O(np) time.

Keywords: stabilized one-step estimator, non-regular inference, variable screening

1 Introduction

Many semiparametric and nonparametric estimation problems yield estimators which achieve a parametric rate of convergence. These estimators are often asymptotically linear, in that they can be written as an empirical mean of an influence function applied to the data. Valid choices of the influence function can be derived as gradients for a functional derivative of the parameter of interest. Applying the central limit theorem then immediately yields Wald-type confidence intervals which achieve the desired parametric rate. Such problems have been studied in depth over the past several decades [Pfanzagl, 1990, van der Vaart, 1991, Bickel et al., 1993, van der Laan and Robins, 2003].

While remarkably general, these approaches rely on the key condition that the parameter of interest is sufficiently differentiable at the data generating distribution for such a gradient to exist. Statisticians are increasingly encountering problems for which parametric-rate estimation is theoretically possible but the parameter is insufficiently differentiable to yield a standard first-order expansion demanded by older techniques. For example, suppose we observe baseline covariates, a binary treatment, and an outcome occuring after treatment. We wish to learn the mean outcome under the optimal individualized treatment strategy, i.e. the treatment strategy which makes treatment decisions which are allowed to use baseline covariate information to make treatment decisions Chakraborty and Moodie [2013]. As another example, suppose we observe a vector of covariates (X1, …, Xp) and an outcome Y. We wish give a confidence interval the maximal absolute correlation between a covariate Xk and Y. The lower bound for this quantity is of particular interest since this will suffice for a variable screening procedure. Alternatively, we may only with to test the null hypothesis that the maximal absolute correlation is zero. McKeague and Qian [2015] provide a test of this null hypothesis using an adaptive resampling test (ART).

These problems belong to a larger class of problems in which one observes O1, …, On drawn independently from a P0 in some (possibly nonparametric) statistical model ℳ and wishes to estimate

Ψn(P)maxdDnΨd(P), (1)

at P = P0, where Dn is an index set that may rely on sample size and each Ψd : ℳ → ℝ is a sufficiently differentiable parameter to permit parametric-rate estimation using classical methods such as those presented in Bickel et al. [1993]. When there is no unique maximizer dDn of Ψd(P), then the inference problem is typically non-regular, in the sense that the parameter PmaxdDnΨd(P) is not sufficiently differentiable to allow the use of standard influence function based techniques for obtaining inference. Hirano and Porter [2012] showed that regular and asymptotically linear estimators fail to exist for such one-sided pathwise differentiable parameters. In univariate calculus, functions such as f(x) = max{x, 0} are one-sided differentiable at zero in that the left and right limits of [f (x + ε) − f (x)]/ε are well-defined but disagree. The same holds for the Ψn evaluated at a distribution P0, but now the one-sided differentiability is caused by the subset of Dn containing the indices which maximize the expression on the right in (1). A small fluctuation in P0 can greatly reduce the subset of maximizing indices, leading to different derivatives depending on the fluctuation taken.

In this work, we present a method which, loosely, splits the sample in such a way that the estimated index in Dn which maximizes Ψd(P0) is conditioned on so that this estimated index need not have a limit. We do this iteratively to ensure that our estimator gets the full benefit of the sample size n. When the parameter is fixed with sample size and the d maximizing Ψd(P0) is fixed, we show that our estimator is asymptotically efficient, and therefore also regular. When the maximizing index is not unique, our estimator will not typically be regular. Thus our estimator adapts to the non-regularity of the estimation problem.

Our estimator is inspired by the online estimator for pathwise differentiable parameters presented in van der Laan and Lendle [2014] and a subsequent modification of this estimator in Luedtke and van der Laan [2016] to deal with the non-regularity when estimating the mean outcome under an optimal treatment rule. Such estimators are designed to be efficient in both computational complexity and storage requirements. We show that the estimator that we present in this work inherits many of these computational efficiency properties. We apply our technique to estimate the maximal absolute correlation considered in McKeague and Qian [2015]. In this problem, we show that our estimator runs efficiently in both dimension and sample size, with a runtime of O(np). In practice, this means that the lead author can implement our estimator using only R code and screen p = 100 000 variables using n = 1 000 samples on a single core of his laptop in under a minute. Thus our estimator seems to have both the statistically efficiency that has been demanded of estimators for generations and the computational efficiency that is becoming increasingly important in this new big data era. While the method of McKeague and Qian [2015] can also be implemented in O(np) time when the number of (double) bootstrap samples remains constant, the implicit constant in this procedure implies a much longer runtime.

2 Toy Example

We first present a toy example that we will use to facilitate the presentation of our estimator. While this toy example is simple enough that one can fairly easily come up with alternative estimation strategies, we believe it provides a useful starting point for presenting our general estimation scheme. Suppose we observe an i.i.d. sample {Oj (Oj,1, Oj,2) : j = 1, …, n} of ℝ2-valued observations, drawn from some distribution P0 with ℝ2-valued mean (Ψ1(P0),Ψ2(P0))EP0[O]. For general P, similarly define (Ψ1(P), Ψ2(P)) ≡ EP[O]. Our objective is to estimate Ψ(P0)maxd{1,2}Ψd(P0). If Ψ1(P0) = Ψ2(P0), then this parameter is one-sided but not two-sided pathwise differentiable at P0 and no regular and asymptotically linear estimator exists [Hirano and Porter, 2012].

To give a reader a sense of the challenges faced by the most intuitive estimation strategy, consider the plug-in estimator that estimates Ψ1(P0) and Ψ2(P0) using the empirical means of the observations, which we denote μ^1 and μ^2, and then the estimates ψ^n=max{μ^1,μ^2}. We have that

n1/2[ψnΨ(P0)]=n1/2[μ^2Ψ2(P0)]+I{n1/2[μ^1μ^2]0}n1/2[μ^1μ^2Ψ1(P0)+Ψ2(P0)].

By the central limit theorem, n1/2[μ^dΨd(P0):d=1,2] converges to a multivariate normal Z = (Z1, Z2) with estimable covariance matrix. If Ψ1(P0) > Ψ2(P0) or Ψ2(P0) > Ψ1(P0), then the indicator above converges in probability to one or zero, respectively, and the above converges in distribution to Z1 or Z2, respectively. Both of these normal limits can be consistently estimated from the data, and if one knows that Ψ1(P0) ≠ Ψ2(P0) then the correct index for the limit is consistently estimated by I{μ^1>μ^2}. Often one may not be willing to assume that Ψ1(P0) ≠ Ψ(P0). To see the challenge that arises, note that if Ψ1(P0) = Ψ2(P0), then the right-hand side above converges to max{Z1, Z2}, which is a non-normal limit. This gives some intuition on the non-regularity of Ψ(P0) when Ψ1(P0) = Ψ2(P0): an arbitrarily small shift in Ψ1(P0) or Ψ2(P0) dramatically changes the limiting behavior of the estimator. Furthermore, it is not in general clear which limiting result one should use to approximate the distribution of ψn if the Ψ1(P0) = Ψ2(P0)+ε for some small ε > 0, so that asymptotically the limit of the estimator is Z1 but in practice max{Z1, Z2} better approximates the variability of the estimator.

To avoid this problem, we develop an estimator that naturally adapts to the (non-)regularity of the problem. For each 2 ≤ j ≤ n, let (μj,1, μj,2) represent the empirical mean of O1, …, Oj and dj ≡ argmaxdμj,d. Our estimator takes the form ψnj=2n1wjOj+1,dj for positive convex weights w2, …, wn−1 that we now define. To expedite the presentation of this toy example, suppose that we know the variance of the first and second components of O ~ P0. Denote these variances by 12 and 22, where we assume that Σ1, Σ2 ∊ (0, ∞). Let ¯n denote the harmonic mean of d2,,dn1, i.e. ¯n(1n2j=2n1djn1)1. Our convex weights are given by wj=¯ndj1.

We now aim to understand the variability of our estimator. Note that

ψnΨ(P0)=j=2n1wj[Oj+1,djΨdj(P0)]+j=2n1wj[Ψdj(P0)Ψ(P0)].

Our analysis will use that the weights wj are finite because Σ1, Σ2 ∊ (0, ∞). We first consider the second term on the right. If Ψ1(P0) = Ψ2(P0), then this term is exactly zero. Suppose Ψ1(P0) ≠ Ψ2(P0). Noting that μj,d Ψd(P0) almost surely as j → ∞, it then follows that, with probability 1, dj = argmaxd Ψd(P0) for all j large enough. It readily follows that n1/2 times the second term on the right converges to zero almost surely, and therefore also in probability. Multiplying this term by the random but finite quantity ¯n1 does not change this convergence to zero. Thus we have shown that

n1/2¯n1[ψnΨ(P0)]=n1/2¯n1j=2n1wj[Oj+1,djΨdj(P0)]+oP(1).

Finally, we note that the first term above is a martingale sum, and that each term in this sum has variance 1 thanks to the choice of weights. A standard martingale central limit theorem [e.g., Gaenssler et al., 1978] then shows that this term converges to a standard normal random variable, and standard arguments show that the interval [ψn±1.96¯nn1/2] contains Ψ(P0) with probability approaching 0.95.

3 Estimator

We will now present our technique for a general estimation problem. We now introduce the notion of pathwise differentiability, since this provides the key object needed to construct our estimator.

3.1 Pathwise differentiability

We assume that each parameter Ψd, dDn for any n, is pathwise differentiable for all distributions in our model [see, e.g., Pfanzagl, 1990, Bickel et al., 1993]. For each P ∊ ℳ, we let Dd(P) denote the canonical gradient of Ψd at P. By definition Dd(P)(O) is mean zero with finite variance under sampling from P. Typically pathwise differentiability implies that Ψd satisfies the following linear expansion for any P ∊ ℳ and dDn:

Ψd(P)Ψd(P0)=Dd(P)(o)dP0(o)+Remnd(P), (2)

where we omit the dependence of Remnd(P) on P0 in the notation and indicate its possible dependence on sample size with the subscript n. Above Remnd(P) is a second-order remainder term that is small whenever P is close to P0. We consider this condition more closely in our example, but for non-sample size dependent parameters this term can typically be made to be OP0(1/n) in a parametric model and often can be made to be oP0(1/n) in a nonparametric model. In the toy example from Section 2, the two parameters Ψd(P0) are linear, and so Dd(P)(o) = odEP[Od] and Remnd(P)=0 for all d, P. For a more thorough presentation, see Pfanzagl [1990] or Bickel et al. [1993].

Knowing the canonical gradient of a parameter enables one to implement a one-step estimator [see Section 5.7 of van der Vaart, 1998]. To ease discussion, fix d. Suppose one has an initial estimate P^n of the components of P0 needed to evaluate Ψd and Dd. Then a one-step estimate of Ψd(P0) is given by ψnd=Ψd(P^n)+1ni=1nDd(P^n)(Oi) . Under empirical process and consistency conditions on P^n, one can show that n1/2[ψndΨd(P^n)] converges in distribution to a normal random variable with estimable variance. In the next section, we present a variant of this estimator that allows for the selection of the optimizing d, even when the optimal index is non-unique.

3.2 Estimator and confidence interval

We now present a stabilized one-step estimator for problems of the type found in (1) when the required differentiability condition on Ψd holds.

Let {n} be some sequence such that nn → ∞. One possible choice is n = 0 for all n. For each j = ℓn, …, n − 1, let dnj represent an estimate of a maximizer of (1) obtained using observations (Oi : i = 1, …, j), P^nj be an estimate of P0 obtained using observations (Oi : i = 1, …, j), and P^nj D^nj equal Dd (P) evaluated at P=P^nj and d = dnj. For nonnegative weights wln,,wn1 that we will define shortly with j=lnn1wj=nln, our stabilized one-step estimate takes the form

ψn1nlnj=lnn1wj[Ψdnj(P^nj)+D^nj(Oj+1)].

Our proposed 95% confidence interval has the form

[LBn,UBn][ψn±1.96σ¯nnln],

where we will define σ¯n momentarily and one can replace 1.96 by the desired quantile of the normal distribution to modify the confidence level.

We now define the weights. Let σ¯nj2 represent an estimate of the variance of D^nj(O), O ~ P0, conditional on observations O1, …, Oj. This estimate should only rely on those j observations. Often we can let

σ^nj21ji=1j[D^nj(Oi)1ji=1jD^nj(Oi)]2,j=ln,,n1.

The standard deviation type variable in the confidence interval definition is given by σ¯n(1nlnj=lnn1σ^nj1)1, and the weights are given by wjσ¯nσ^nj1, where we have omitted the possible dependence of the weights on sample size in the notation.

Our estimator is similar to the online one-step estimator developed in van der Laan and Lendle [2014] for streaming data, but it weights each term proportionally to the estimated inverse standard deviation of D^nj(O) when O ~ P0. Our confidence interval takes a form similar to a Wald-type confidence interval, but replaces the typical standard deviation with σ¯n and has width on the order of 1/nln rather than 1/n Note of course that n = o (n) implies that 1/nln1/n converges to zero.

3.3 Validity of confidence interval

We now prove the validity of our confidence interval. Let σnj2VarP0(D^nj(O)|O1,,Oj). The validity of the lower bound of the confidence interval relies on the following conditions:

  • C1)

    There exists some M < ∞ such that 1nlnj=lnn1P0(|D^nj(O)|σ^nj<M|O0,,Oj1)1 in probability as n → ∞.

  • C2)

    1nlnj=lnn1|σnj2σ^nj21|0 in probability as n → ∞.

  • C3)

    1nlnj=lnn1σ^nj1Rem^nj0 in probability as n → ∞., where Rem^njRemdnj(P^nj).

    The validity of the upper bound requires the following additional condition:

  • C4)

    1nlnj=lnn1σ^nj1[Ψdnj(P0)Ψn(P0)] converges to zero in probability as n → ∞.

We now present our main result.

Theorem 1

(Validity of confidence interval). If C1), C2), and C3) hold, then

liminfnPr(Ψn(P0)LBn)1α/2.

If C4) also holds, then

limnPr(LBnΨn(P0)UBn)=1α.
Proof

The definition ψn combined with (2) yield that

nlnσ¯n1[ψnΨn(P0)]=1nlnj=lnn1σ^n1(D^nj(Oj+1)EP0[D^nj(O)|O1,,Oj])+1nlnj=lnn1σ^n1[Ψdnj(P0)Ψn(P0)+Rem^nj]. (3)

The second line converges to zero in probability by C3) and C4). By C1), C2), and the martingale central limit theorem for triangular arrays in Gaenssler et al. [1978], (3) converges in distribution to a standard normal random variable. A standard Wald-type confidence interval construction argument shows that the confidence interval has coverage approaching 1 − α under C1) through C4).

Now suppose C4) does not hold. By (1), j=lnn1σ^nj1[Ψdnj(P0)Ψn(P0)]0. The same argument readily shows the validity of the lower bound under only C1), C2), and C3). □

The conditions of the theorem are discussed in Appendix A.1. Appendix A.2 considers the asymptotic efficiency of our estimator when the parameter in (1) does not rely on sample size. High level conditions are provided, and we then argue that these conditions are plausible when the maximizing index in (1) is unique. Appendix A.3 discusses computationally efficient implementations of our general estimator.

4 Maximal correlation example

4.1 Problem formulation

We now present the running example of this work, namely the maximal correlation estimation problem considered by McKeague and Qian [2015]. The observed data structure is O = (X, Y), where X = (Xk : k = 1, …) is a [−1, 1] vector of predictors and Y is an outcome in [−1, 1]. For each n, we let Kn represent a subset of these predictors of size p, where throughout we assume that

βn2logpn0asn. (4)

For readability, we omit the dependence of p on n in the notation. Under a distribution P, the maximal absolute correlation of a predictor with Y is given by

Ψn(P)maxkKn|CorrP(Xk,Y)|, (5)

where CorrP(Xk, Y) is the correlation of Xk and Y under P. We wish to develop confidence intervals for Ψn(P0). When a test of H0 : Ψn(P0) = 0 against the complementary alternative is of interest, we also wish to establish the behavior of our test against local alternatives as was done in McKeague and Qian [2015].

In contrast to McKeague and Qian [2015], the procedure that we present in this work:

  1. is proven to work when p grows with sample size at any rate satisfying (4);

  2. yields confidence intervals for the maximal correlation rather than just a test of the null hypothesis that it is equal to zero;

  3. allows a non-null maximizer in (5) to be non-unique.

While McKeague and Qian argued that 3) is unlikely in practice, having two non-null maximizers be approximately equal may still have finite sample implications for their test in some settings.

We now show that this problem fits in our framework. To satisfy the pathwise differentiability condition, we let and, for each d=(k,m)Dn,

Ψd(P)mCorrP(Xk,Y).

Note that Ψn(P) now takes the form in (1), where we note that the use of m in the definition of Ψd serves to ensure that Ψn(P0) represents the correlation with the maximal absolute value.

4.2 Differentiability condition

Canonical gradients

For each k, let sP2(Xk)VarP(Xk), and likewise for sP2(Y). For ease of notation we let s02(Xk)sP02(Xk), and likewise for s02(Y) and Corr0(Xk, Y). An application of the delta method shows that Ψd has canonical gradient Dd(P)(o) given by

m×((xkEP[Xk])(yEP[Y])sP(Xk)sP(Y)12CorrP(Xk,Y)[(xkEP[Xk])2sP2(Xk)+(yEP[Y])2sP2(Y)]).

In order to ensure that Dd(P0) is uniformly bounded for all d, we assume throughout that, for some δ ∊ (0, 1],

min{s0(Y),s0(X1),s0(X2)}>δ.

Second-order remainder

Fix d=(k,m)Dn and P ∈ ℳ. Let δ(0,1] be some constant such that both sP(Xk) and sP(Y) are larger than δ. Lemma A.3 in Appendix B.1 proves that

|Remd(P)|δ1(|sP(Xk)sP(Y)s0(Xk)s0(Y)||CorrP(Xk,Y)|+(EP[Xk]EP0[Xk])2+(EP[Y]EP0[Y])2)+s02(Y)sP2(Y)[sP(Xk)s0(Xk)]2+s02(Xk)sP2(Xk)[sP(Y)s0(Y)]2). (6)

The first term above is small if sP(Xk), sP(Y), and CorrP(Xk, Y) are close to s0(Xk), s0(Y), and Corr0(Xk, Y). The middle terms are small if EP[Xk] and EP[Y] are close to EP0[Xk] and EP0[Xk]. The final terms are small if sP(Xk) and sP(Y) are close to s0(Xk) and s0(Y).

Variance of canonical gradients

For any given d, there is no elegant (and informative) expression for VarP0[Dd(Pj)(O)]. Nonetheless, we show in Lemma A.6 of Appendix B.1 that our estimates σ^nj2, taken as the sample variance of Ddnj(Pj)(O) for an index estimate dnj to be defined in the next subsection, concentrate tightly about σnj2 with high probability when the sample size is large enough. Thus, in practice, one can actually check if σnj2 is small by looking at σ^nj2. If P0 is normal, then this variance is equal to [1CorrP0(Xk,Y)2]2, and so is only zero if CorrP0(Xk,Y)=1. Though such an elegant expression does not exist for the variance of Dd(P0)(O) for general distributions, one can still show in general that the variance of Dd(P0) is equal to zero only if CorrP0(Xk,Y)=1. Here we make the slightly stronger assumption that

infn2min(k,m)Kn×{1,1}VarP0(Dd(P0)(O))γ>0. (7)

4.3 Our estimator

We will use the estimator presented in Section 3 to estimate Ψn(P0). At each index j ≥ ℓn we use the empirical distribution Pj of the observations O1, …, Oj to estimate P0. We let our optimal index estimate dnj ≡ (knj, mnj), where knjargmaxkKnCorrPj(Xk,Y) and mnjsgn[CorrPj(Xknj,Y)]. We estimate σnj2 with the variance of D^nj(O) under Pj.

In Appendix B.1, we detail conditions on n which ensure that n does not grow too slowly or quickly. For any ε ∊ (0, 2), one possible choice of n that satisfies these conditions is

ln=max{(logmax{n,p})1+ε,nexp(βn2+ε)}. (8)

We show that this choice of n ensures C1), C2), and C3) in Appendix B.1. By Theorem 1 this establishes the validity of the lower bound of our confidence interval. We can also show that this lower bound is tight up to a term of the order n−1/4βn.

Theorem 2

(Tightness of the lower bound). For any sequence tn → ∞, Ψn(P0) < LBn + tnn−1/4βn with probability approaching 1.

We note that the choice of n in (8) is only needed if the set of indices Kn changes with sample size. For fixed Kn, one could take n fixed, e.g. n = 2, and still have a valid lower bound provided the estimates of σnj are truncated from below at some δ>0 (see Lemma A.1). While choosing n according to (8) is still advisible since this is what will enable us to study the behavior of a hypothesis testing procedure under local alternatives, this invariance to n should at least reassure the user that most choices of n will perform reasonably well. In Luedtke and van der Laan [2016], we evaluated the stabilized one-step estimator on a variety of choices of n and found little sensitivity to this tuning parameter. Nonetheless, we consider the development of a data adaptive selection procedure for choosing an n satisfying C1), C2), and C3) an important area for future work. In parallel to how McKeague and Qian [2015] used the bootstrap to select their tuning parameter, one might consider using the bootstrap to select n, though it remains to determine an appropriate criterion for selecting n. Because our n-specific lower bound is defined using a normal limiting result rather than the bootstrap, such a selection procedure would avoid the use of a computationally burdensome double bootstrap.

We now consider the validity of the upper bound of our confidence interval, which holds under C4). This condition is trivially valid if Ψn(P0) = 0 for all n. Condition C4) is also valid under the following margin condition:

MC) For some sequence tn → ∞, there exists a sequence of non-empty subsets KnKn such that, for all n,

supkKn|CorrP0(Xk,Y)|infkKn|CorrP0(Xk,Y)|=o(n1/2),infkKn|CorrP0(Xk,Y)|supkKn\Kn|CorrP0(Xk,Y)|+tnn1/4βn.

If Kn=Kn, then the supremum over Kn\Kn is taken to be zero.

Theorem 3

(Validity of the upper bound). If MC) or Ψn(P0) = 0 for all n, then C4) holds so that LBn ≤ Ψn(P0) ≤ UBn with probability approaching 1 − α.

We outline the techniques used to prove these two results at the end of this subsection. Complete proofs are given in Appendix B.1.

Suppose we wish to test H0 : Ψn(P0) = 0 against H1 : Ψn(P0) > 0. Consider the test that rejects H0 if LBn > 0. We wish to explore the behavior of this test under local alternatives where Ψn(P0) converges to zero slower than n−1/4βn. Theorem 2 shows that this test has power converging to one under such local alternatives. Furthermore, as the lower bound is valid in general, this test has type I error of at most α/2 under the null. This is indeed an exciting result as it enables the study of local alternatives even when dimension grows quickly with sample size. If dimension does not grow with sample size, this shows that we can detect against any alternatives converging to zero slower than n1/2logn. We would not be surprised if the logn is unnecessary, but rather that it is simply a result of our proof techniques which give high probability bounds on the concentration of our correlation estimates at each sample size. McKeague and Qian [2015] showed that their method is consistent against a class of alternatives converging to zero slower than n−1/2 provided the optimal index is unique. Our result does not rely on this uniqueness condition. We emphasize that we only used MC) to establish the validity of the upper bound of our confidence interval. Our lower bound, and therefore our ability to reject the null of uniformly zero correlation, is valid even without this margin condition.

Theorem 3 shows that the upper bound of our confidence interval is also valid under a reasonable margin condition. The margin condition states that there may be many non-null approximate maximizers provided their absolute correlations are well-separated from the absolute correlations of the other predictors with Y. By “approximate” we mean that their absolute correlations all fall within o(n−1/2) of one another. If Kn does not depend on sample size, then this theorem shows that our two-sided confidence interval is always valid.

Sketch of proofs of Theorems 2 and 3

Our proofs of both of these theorems rely on high-probability bounds of the absolute differences between our estimates of sPj2(Xk), sPj2(Y), CorrPj(Xk,Y), and σ^nj and their population counterparts, uniformly over kKn and j. We show that, with probability at most 1 − 1/n, all of these absolute differences are upper bounded by constants (with explicit dependence on γ and δ) times j−1/2 log max{n, p}.

Condition C1) follows once we show that, with high probability, sPj2(Xk) and sPj2(Y) are bounded below by δ/2 and σ^nj2. is bounded below by γ/2 uniformly over jn for n large enough. Condition C2) and C3) are easy consequences of our concentration results. The concentration results also yield that

1nlnj=lnn1σ^nj1[Ψdnj(P0)Ψn(P0)]=OP0(n1/4βn),

which then quickly yields Theorem 2 thanks to the expression in (3).

Now suppose MC) holds. By our concentration inequalities, we select a knjKn for each jCtn1n with high probability, where C is a constant. We also correctly specify mnj to be the sign of Corr0(Xknj,Y). Because all of the absolute correlations in Kn are small, the difference between Ψdnj(P0) for dnj = (knj, mnj) and Ψn(P0) is very small. If ln<Ctn1, then we can apply our concentration inequalities to establish that these first few values of j for which j<Ctn1 are small enough so that C4) still holds, yielding Theorem 3. □

In Appendix B.2, we show that our estimator runs in O(np) time. We show that the estimate can be computed using O(p) storage when the observations O1, …, On arrive in a data stream. This result is closely related to the fact that, for a ℝp-valued sequence {ti}, the sum Sji=1jti at j = n can be computed in time O(np) using storage O(p). In particular, one can use the recursion relation Sj = tj + Sj1, thereby only storing tj and Sj1 when computing Sj. Our estimate can also be computed in O(np) time and O(n) storage when the vectors (Xjr : j = 1, …, n) ℝn arrive in a stream for r = 1, 2, …, p, where Xjr is the observation of Xr for individual j. We do not prove the O(n) storage result in the appendix due to space constraints, though the algorithm is closely related to that given in Appendix B.2.

5 Simulation study

We now consider the power and scalability of our method using the simulations similar to those described in McKeague and Qian [2015]. Let X ~ MVN(0, Σ) for Σ a p × p covariance matrix to be given shortly, and τ1, …, τp be a sequence of i.i.d. normal random variables independent of all other quantities under consideration. We will use two types of errors: the homoscedastic error τ1 and the heteroscedastic error η(X)k=1pXkτk/p. For (n, p) = (200, 200), (500, 2000), we generate data using the following distributions: (N.IE) Y = τ1, (A1.IE) Y = X1/5 + τ1, (A2.IE) Y=0.15k=15Xk0.1k=610Xkτ1, (N.DE) Y = η(X), (A1.DE) Y = X1/5 + η(X), and (A2.DE) Y=0.15k=15Xk0.1k=610Xk+η(X). For (n, p) = (2 000, 30 000), we generate data using the following distributions: (N.IE) Y = τ1, (A3.IE) Y = X1/15 + τ1, and Y=0.03k=15Xk0.015k=610Xk+τ1. We set all of the diagonal elements in the covariance matrix Σ equal 1, and the off-diagonal elements equal p, where for each simulation setting we let ρ = 0, 0.25, 0.5, 0.75. Unless otherwise specified, all simulations are run using 1 000 Monte Carlo simulations in R [R Core Team, 2014]. Code is available in the Supplementary Materials.

We conduct a 5% test of Ψ(P0) > 0 by checking if the lower bound of a 90% confidence interval for this quantity is greater than zero. We use models N.IE and N.DE to evaluate type I error and all other models evaluate power. We run our method with n as in (8), where we let ε = 0.5. For ease of implementation, we compute our method on chunks of data of size (nn)/10 (see Section 6.1 of Luedtke and van der Laan, 2016).

We compare our method to the ART of McKeague and Qian [2015]. The ART relies on a tuning parameter λn satisfying λn/n0 and λn →∞ that is selected via a double bootstrap procedure. We implemented code that we obtained from the authors (McKeague and Qian) that selects λn=alogn from a grid of a varying between 0.5 and 4. Due to computational limitations, we ran 400 outer bootstrap samples and 200 inner bootstrap samples (rather than the default of 1 000 samples for both layers of bootstrap), and also reduced the grid for a from the default (0.5, 0.55, …, 4) to (0.5, 0.6, …, 4). We also reduced the number of Monte Carlo replicates for the ART to 200 and only ran ART on the smallest sample size (n, p) = (200, 200). While we were not able to run the double bootstrap at the moderate sample size (n, p) = (500, 2000) due to computational constraints, we were able to mimic the double bootstrap procedure by selecting an oracle choice of λn=alogn. In particular, we ran ART for the fixed choices of a = 0.5, 2.25, 4, found that a = 4 appropriately controlled type I error while the other choices of a typically did not, and reported the results of ART at this fixed tuning parameter. We were unable to run even the oracle procedure at the largest sample size with due to computational constraints.

We also compared our procedure to the analogue of ART described in Section 2 of Zhang and Laber [2015], where this analogue does not require running a double bootstrap. This latter procedure is referred to as the “parametric bootstrap” in Zhang and Laber [2015], though to avoid confusion with other bootstrap procedures here we refer to their method as “ZL”. The ZL procedure assumes a locally linear model with homoscedastic errors. Note that the homoscedasticity requirement is stronger than the uncorrelated error requirement made by the ART. In fact, the errors are guaranteed to be uncorrelated with the predictors under the null of zero maximal absolute correlation, thereby ensuring the type I error control of ART. We use 500 bootstrap draws for each run of the ZL procedure. Zhang and Laber show that their method, which does not involve running a computationally burdensome double bootstrap procedure, has comparable performance to ART across sample sizes and predictor dimension, while being more computationally efficient. The ZL procedure is less computationally intensive than the ART, but still requires estimating the p × p covariance matrix Σ and simulating from a N(0,^) distribution. Due to computational constraints, we only run ZL for p ≤ 2 000 and not for p = 30 000. We also compare our method to a Bonferroni-corrected t-test.

Figures 1 displays the power of the four testing procedures for (n, p) equal to (200, 200) and (500, 2000) for the homoscedastic data generating distributions N.IE, A1.IE, and A2.IE. The ART and ZL procedures perform best in both of these settings. We can show (details omitted) that our method underperforms in this setting due to the second-order term representing the cost for estimating d0 on subsets of the data of size jn early on in the procedure. While Theorem A.11 ensures that the estimate of d0 will be asymptotically valid, there appears to be a noticeable price to pay at small sample sizes.

Figure 1.

Figure 1

Power of the various testing procedures for (n, p) equal to (200, 200) and (500, 2000) under homoscedastic errors. The ART and ZL procedure performs the best in this setting.

Figures 2 displays the power of the three testing procedures for (n, p) equal to (200, 200) and (500, 2000) for the heteroscedastic data generating distributions. The ZL procedure fails to control the type I error in this setting. This is unsurprising given that this test was developed under a local linear model with independent errors. All other methods adequately control type I error in this setting, especially at the larger sample size n = 500, while we see that the Bonferroni and ART procedures achieves slightly better power than our method for these data generating distributions.

Figure 2.

Figure 2

Power of the various testing procedures for (n, p) equal to (200, 200) and (500, 2000) under heteroscedastic errors. The ZL procedure fails to control the type I error in this setting.

Figure 3 displays the power of our method and the Bonferroni procedure for (n, p) equal to (2 000, 30 000). While (unsurprisingly) Bonferroni performs well when the correlation between the predictors in X is low, our method outperforms the Bonferroni procedure when the correlation increases. We expect that, were we able to run ART or ZL at this sample size, they would outperform all other methods under consideration as they did at the smaller sample sizes. Nonetheless, both methods quickly become computationally impractical when p gets large, whereas our procedure and the Bonferroni procedure can still be implemented at these sample sizes.

Figure 3.

Figure 3

Power of the test from the stabilized one-step and from the Bonferroni-adjusted t-test for (n, p) equal to (2 000, 30 000) under homoscedastic errors.

We also ran our method at different choices of n for (n, p) = (200, 200) and (500, 2 000) (details not shown), namely defined according to (8) with ε = 0.25, 1, 1.5, 1.75. We found little sensitivity to the choice of ε, with the exception that choosing ε = 1.75 often led to a moderate loss of power (at most 15% on an additive scale). This is not surprising given that, at ε = 1.75, n is approximately equal to n/2 for both (n, p) settings.

6 Discussion

We have presented a general method for estimating the (possibly non-unique) maximum of a family of parameter values indexed by dDn. Such an estimation problem is generally non-regular because minor fluctuations of the data generating distribution can change the subset of Dn for which the corresponding parameter is maximized. Our estimate takes the form of a sum of the terms of a martingale difference sequence, which quickly allows us to apply the relevant central limit theorem to study its asymptotics and develop Wald-type confidence intervals. The estimator adapts to the non-regularity of the problem, in the sense that we can give reasonable conditions under which it is regular and asymptotically linear when the maximizer is unique so that regularity is possible.

We have applied our approach to the example of McKeague and Qian [2015] in which one wishes to learn about the maximal absolute correlation between a prespecified outcome and a predictor belonging to some set. The sample splitting that is built into our estimator has enabled us to analyze the estimator when the dimension p of the predictor grows with sample size slowly enough so that n−1/2 log p → 0 as n goes to infinity. While McKeague and Qian focus on testing the null hypothesis that this maximal absolute correlation is zero, we have established valid confidence intervals for this quantity. The lower bound of our confidence interval is particularly interesting because it is valid under minimal conditions. When p is very large, one might expect that the null of no correlation between the outcome and any of the predictors is unlikely to be true. In these problems, having an estimate of the maximal absolute correlation, or at least a lower bound for this quantity, will likely still be interesting as a measure of the overall relationship between X and Y.

We have also studied the behavior of this null hypothesis test under local alternatives, showing that our test is consistent when the maximal absolute correlation shrinks to zero slower than n−1/2(log max{n, p})1/2. When the dimension of the predictor is fixed, the test of McKeague and Qian is consistent against alternatives shrinking to zero more slowly than n−1/2 rather than (log n)1/2n−1/2. We would not be surprised to find that this (log n)1/2 is unnecessary for p fixed and can be removed using more refined proof techniques.

McKeague and Qian do not require that Y and the coordinates of X have range in [−1, 1]. We have made this boundedness assumption out of convenience for our proofs and expect that we can replace the boundedness assumptions with appropriate moment assumptions without significantly changing the results. Our simulation results support this claim. The boundedness condition is not as restrictive as it may first seem, as unbounded X and Y can be rescaled to be to be bounded. Since the sharp null H0 : Ψn(P0) = 0 is invariant to strictly monotonic transformations of X and Y, our theoretical results yield a valid of H0 test after applying, e.g., the sigmoid transformation to X and Y.

We note that, in our simulations, ART and ZL achieve the highest power among competing methods, though for our heteroscedastic simulation setting ZL failed to control the type I error. We were not able to run either of these methods at our largest sample size due to computational constraints. The ZL procedure as currently described is computationally expensive and does not scale well to large data sets, especially when the dimension of the predictor p is large. This difficulty occurs because the procedure requires the computation of a p × p covariance matrix. The ART method presented in McKeague and Qian [2015], which achieves similar power to ZL, is in practice even more computationally burdensome due to its use of a double bootstrap. Nonetheless, from a theoretical computational complexity standpoint, the ART method can be made to scale as O(np) provided the number of bootstrap draws remains fixed. Though the number of bootstrap samples will likely be fixed in practical applications, we note that ART cannot maintain consistency against local alternatives unless the number of bootstrap samples grows with sample size, thereby yielding a slower than order-np runtime. As is to be expected from marginal screening procedures that perform an O(n) screening operation p times, our method attains an O(np) runtime. This computational efficiency, combined with the asymptotic theory supporting our method’s power against local alternatives under increasing covariate dimension and efficiency under fixed alternatives and covariate dimension, demonstrates what is achievable by marginal screening procedures. Given our simulations, we also believe that developing rigorous asymptotic theory under increasing dimension for the ART methods is an important area for future work.

The stabilized one-step estimator presented in this paper applies to many other situations not considered in this paper. In an earlier work, we showed that this estimator is useful for estimating the mean outcome under an optimal individualized treatment strategy Luedtke and van der Laan [2016], where the class Dn now indexes functions mapping from the covariate space to the set of possible treatment decisions. Thanks to the martingale structure of our estimator, the stabilized one-step estimator can be used to construct confidence intervals when the data is drawn sequentially so that the data generating distribution for observation j can depend on that of the first j − 1 observations. One interesting example along these lines is to obtain inference for the value of the optimal arm in a multi-armed bandit problem, even in the case where the optimal arm is non-unique and the reward distributions for the optimal arms have different variances. We look forward to seeing further applications of the general template for a stabilized one-step estimator that we have presented in this paper.

Supplementary Material

Appendix

Appendix A General estimator

A.1 Discussion of conditions of Theorem 1

In this section, we consider the setting where the parameter in (1) does not depend on sample size, and consequently omit the n subscript to quantities which no longer depend on sample size. We will show that C7) and the following conditions imply the conditions of Theorem 1:

  • C9)

    σ^j2σj2 converges to zero in probability as j → ∞.

  • C10)

    jRem^jjRemdj(P^j) converges to zero in probability as as j → ∞.

The validity of the upper bound requires the following additional condition:

  • C11)

    j[Ψdj(P0)Ψ(P0)] converges to zero in probability as as j → ∞.

For simplicity, we will take n = 0 in this section.

We now discuss the conditions. Condition C1) is an immediate consequence of C7) and Dd(P)(o) being uniformly bounded in P ∈ ℳ, dD, oO. This will be plausible in many situations, including the examples in this paper. A more general Lindeberg-type condition also suffices [see Condition C1 in Luedtke and van der Laan, 2016], though we omit its presentation here for brevity.

The other three conditions all rely on terms like 1nj=0n1Rj converging to zero in probability, possibly at some rate. Ideally we want a stochastic version of the fact that, for β ∊ [0, 1),

1nj=1njβ1n1njβdjnβ1βwhennislarge. (A.1)

Lemma 6 of Luedtke and van der Laan [2016] establishes this result. We restate it here for convenience.

Lemma A.1

(Lemma 6 in Luedtke and van der Laan, 2016). Suppose that Rj is some sequence of (finite) real-valued random variables such that Rj=oP0(jβ) for some β ∊ [0, 1), where we assume that each Rj is a function of {Oi : 1 ≤ i ≤ j}. Then,

1nj=0n1Rj=oP0(nβ).

Conditions C2) through C4) are now easily handled. Condition C2) is a consequence of the fact that

1nj=0n1|σj2σ^j21|γ11nj=0n1|σ^j2σj2|0inprobabilityasn,

where the inequality holds by C7) and the convergence holds by C9) Lemma A.1. Condition C9) is easily shown to hold under Glivenko-Cantelli conditions on the estimators P^j and dj [see, e.g., Theorem 7 in Luedtke and van der Laan, 2016]. Conditions C3) and C4) are an immediate consequence of C10) and C11) combined with Lemma A.1.

While sufficient conditions for C11) should be developed in each individual example, we can give intuition as to why this condition should be reasonable. For any P ∈ ℳ, let d(P) return a maximizer of (1). We are interested in ensuring that Ψdn(P0)Ψd(P0)(P0) is small, where dn is our estimate of a maximizer of (1). This can be expected to hold when the parameter P ↦ Ψd(P)(P0) has pathwise derivative zero at P = P0, where the P0 in the Ψ argument is fixed. When well-defined, the pathwise derivative will be zero because d(P) is chosen to maximize Ψd(P0) in d.

A.2 Efficiency when the maximizer in (1) is unique

We have presented a parametric-rate estimator for Ψn(P0), but thus far we have not made any claims about the efficiency of our estimator. In this section, we consider a fixed parameter in (1) that does not rely on sample size. We therefore omit the n subscript in many quantities to indicate their lack of dependence on sample size. We will give conditions under which our estimator is asymptotically efficient among all regular, asymptotically linear estimators. The efficiency bound is not typically well-defined when the maximizer is non-unique due to the non-regularity of the problem - generally in this case no regular, asymptotically linear estimator exists, so neither does an efficient member of this class [Hirano and Porter, 2012]. Thus the conditions that we give in this section will typically only hold when the maximizer d0D in (1) is unique.

We use the following additional assumptions for our efficiency result:

  • C5)

    EP0[(D^j(O)Dd0(P0)(O))2|O1,,Oj]0inprobabilityasj.

  • C6)

    There exists some M < ∞ such that P0(Dd0(P0)(O)<M) and P0(D^j(O)<M) with probability approaching 1 as j → ∞.

  • C7)

    infj1σ^j2>γ with probability 1 over draws of (Oj : j = 0, 1, …).

We discuss the conditions immediately following the theorem.

Theorem A.2

(Asymptotic efficiency). Suppose that Ψ does not depend on sample size and is pathwise differentiable with canonical gradient Dd0(P0). Further suppose that n = o(n). If C1) through C7) hold, then

σ¯n2VarP0(Dd0(P0)(O))in probability asn .

Furthermore,

ψnΨ(P0)=1ni=1nDd0(P0)(Oi)+oP0(n1/2).

Thus, ψn is asymptotically efficient among all regular, asymptotically linear estimators.

The proof is entirely analogous to the proof of Corollary 3 in Luedtke and van der Laan [2016] so is omitted. See Lemma 25.23 of van der Vaart [1998] for a proof of the fact that asymptotic linearity with the influence function given by the canonical gradient implies regularity.

The additional conditions needed for this result over Theorem 1 are mild when the maximizing index is unique. Condition C5) says that Ψ should have the same canonical gradient as Ψd0. While this should be manually checked in each example, it will be fairly typical when the maximizer is unique, since in this case an arbitrarily small fluctuation of P0 will generally not change the maximizer. This is similar to problems in introductory calculus where the derivative at the maximum is zero. Condition C5) requires that D^j(O) converge to Dd0(P0)(O) in mean-squared error, which is to be expected if P^nj begins to approximate P0 and dnj converges to the unique maximizer d0 as n, j → ∞. Condition C6) is a bounding assumption on the canonical gradient and estimates thereof that will hold in many examples of interest. Finally, Condition C7) will hold if one knows that VarP0[Dd(P)(O)] is bounded away from zero uniformly in P ∈ ℳ and dD, and uses this knowledge to truncate σ^j2 for some deterministic sequence γj → 0. For γj sufficiently small and j sufficiently large this truncation scheme will then have no effect on the variance estimates σ^j2.

A.3 Computationally efficient implementation

There are several computationally efficient ways to compute our estimate. In Section 6.1 of Luedtke and van der Laan [2016], we show that the runtime of our estimator can be dramatically improved by running the algorithm used to compute each P^j a limited number of times, say ten times. We do not detail this approach here, though we note that the theorems we have presented are general enough to apply to this case.

An alternative approach to improve runtime is to use the estimator’s online nature to compute it efficiently both in time and storage. Suppose that we have an algorithm to update the estimate P^nj of P0 to the estimate P^n(j+1) based on the first j observations by looking at Oj+1 only. This will often be feasible if the parameter of interest and the bias correction step only require estimates of certain components of P0, e.g. of a set of regression and classification functions. In these cases we can apply modern regression and classification approaches to estimate these quantities [see, e.g., Xu, 2011, Luts et al., 2014]. Often dnj· can also be obtained using online methods, and thus 1nlnj=lnn1[Ψdnj(P^nj)+D^nj(Oj+1)] can be estimated online by keeping a running sum. This quantity is not equal to ψn because it does not yet include the weights.

It will not in general be possible to compute the weights online, though their computation does not require storing O(n) observations in memory. We can estimate VarP0(D^nj(O)) consistently using the rj observations, where rj → ∞ but can grow very slowly (even log j suffices asymptotically, though such a slow growth is not recommended for finite samples). Given online estimates of these variances, it is then straightforward to compute both σ¯n and the weights and incorporate these into our estimator. In some cases, we can compute the weights, and thus the estimate, in a truly online fashion. Describing general sufficient conditions for this appears to be difficult, but we conjecture that often this will not typically hold if Dn is not of finite cardinality. The weights can be computed online in the maximal correlation example.

Appendix B McKeague and Qian [2015] example

B.1 Proofs and results

Lemma A.3

Fix δ>0 and dDn. For any P with minkKnsp2(Xk)>δ and sp2(Y)>δ, (6) holds. Proof. Straightforward but tedious calculations show that

Remd(P)=m(1sP(Xk)sP(Y)[sP(Xk)sP(Y)s0(Xk)s0(Y)][CorrP(Xk,Y)Corr0(Xk,Y)](EP[Xk]EP0[Xk])(EP[Y]EP0[Y])sP(Xk)sP(Y)CorrP(Xk,Y)2[(EP[Xk]EP0[Xk])2sP2(Xk)+(EP[Y]EP0[Y])2sP2(Y)]CorrP(Xk,Y)2sP2(Xk)sP2(Y)[sP(Xk)s0(Y)s0(Xk)sP(Y)]2). (A.2)

The result follows by taking the absolute value of both sides, applying the triangle inquality, using that ab ≤ (a2 + b2)/2 for any real a, b, CorrP(Xk, Y) ≤ 1, and the lower bound δ on the variances. □

We now establish high probability bounds on the difference between sPj2(Xk), sPj2(Y), CorrPj(Xk,Y) and σ^nj and their population counterparts, uniformly over kKn and j. We will use ≲ to denote “less than or equal to up to a universal multiplicative constant”. Let Fn denote the following class of functions mapping from O× to the real line:

{(x,y)xkrys:0r,s,4;r+s4kKn}. (A.3)

Note that |Fn|p. We will use this class to develop concentration results about our estimates the needed portions of the likelihood. This class is actually somewhat larger than is needed for most of our results, as in fact

{(x,y)xky:kKn}{(x,y)xk:kKn}{(x,y)y}{(x,y)xk2:kKn}{(x,y)y2}

suffices for concentrating our estimates of Corr0(Xk, Y), s0(Xk), and s0(Y). Nonetheless, using this larger class Fn will allow us to prove results about the concentration of σ^nj2 about σnj2 and just stating it as a single class is convenient for brevity.

For fFn and j ∈ {1, …, n}, define the empirical process as

Gnj1ji=1j[f(Oi)P0f]=j(PjP0)f,

where we use Pj denote the empirical distribution of O0, …, Oj−1 and Pf ≡ EP[f(O)] for any distribution P. Let GnjFnsupfFn|Gnj|. By Theorem 2.14.1 in van der Vaart and Wellner [1996] shows that

EGnjFnlog#Fnlogp, (A.4)

where the expectation is over the draws O1, …, Oj. We have used that our class is bounded by the constant 1.

Let

Knjj1/2logmax{n,p}. (A.5)

Define the events

Anj{maxfFn|(PjP0)f|CKnj}forallj=1,,n,Anj=1nAnj,

where C in the definition of Anj is equal the smallest universal constant satisfying (A.4) plus 1.

Lemma A.4

For any sample size n, the event An occurs with probability at least 1−n/max{n2, p} ≥ 1 − 1/n.

Proof

We first upper bound the probability of the complement of Anj for each n, j. Fix n and j ≤ n. By the bounds on X and Y, changing one Oi in (O1, …, Oj) to some other value in the support of P0 can change b by at most 1/j. Thus (O1,,Oj)GnjFn satisfies the bounded differences property with bound 1/j, and we may apply McDiarmid’s inequality [McDiarmid, 1989] to show that, with probability at most 1 − exp(−2t2), GnjFnEGnjFn+t. Choosing t=logmax{n2,p}2 and using (A.4) yields that, with probability at least 1−1/max{n2, p}, the following inequality holds for all j = 1, …, n:

GnjFnEGnjFn+logmax{n2,p}2C'logp+logmax{n,p}Clogmax{n,p},

where C′ denotes the universal constant in (A.4).

By DeMorgan’s laws and a union bound, it follows that the event AnjAnj occurs with probability at least 1 − n/ max{n2, p} ≥ 1 − 1/n. □

We have shown that An occurs with high probability. Now we show that our estimates of variances, covariances, and correlations perform well when An occurs.

Lemma A.5

Fix a sample size n ≥ 2. The occurrence of An implies that, for all j = 2, …, n:

  • 1)

    maxkKn|sPj(Xk)s0(Xk)|δ1/2Knj;

  • 2)

    maxkKn|sPj2(Xk)s02(Xk)|Knj;

  • 3)

    |sPj(Y)s0(Y)|δ1/2Knj;

  • 4)

    |sPj2(Y)s02(Y)|Knj;

  • 5)

    maxkKn|CorrPj(Xk,Y)CorrP0(Xk,Y)|δ1Knj,

where we define CorrPj(Xk,Y)=0 when either sPj(Xk) or sPj(Y)is equal to zero.

Proof

Suppose An holds and fix kKn. The triangle inequality and the bounds on Xk yield that

|sPj2(Xk)s02(Xk)|=|(EPj[Xk2]EP0[Xk2])(EPj[Xk]+EP0[Xk])(EPj[Xk]EP0[Xk])||EPj[Xk2]EP0[Xk2]|+2|EPj[Xk]EP0[Xk]|Knj.

This gives 2). For 1), note that

|sPj(Xk)s0(Xk)|=|sPj2(Xk)s02(Xk)sPj(Xk)+s0(Xk)|δ1/2Knj.

The same argument yields 3) and 4).

Again fix k. An application of the triangle inequality and the bounds on Xk and Y readily yield that |CovPj(Xk,Y)CovP0(Xk,Y)|Knj. Furthermore,

CorrPj(Xk,Y)CorrP0(Xk,Y)=CovPj(Xk,Y)CovP0(Xk,Y)s0(Xk)s0(Y)CorrPj(Xk,Y)sPj(Y)s0(Xk)s0(Y)[sPj(Xk)s0(Xk)]CorrPj(Xk,Y)s0(Y)[sPj(Y)s0(Y)].

Taking the absolute value of both sides, applying the triangle inequality, and using the lower bounds on s0(Xk) and s0(Y) and the upper bound on CorrPj(Xk,Y) yields that |CorrPj(Xk,Y)CorrP0(Xk,Y)|δ1Knj. This holds for all k, so 5) holds. □

Lemma A.6

Let C be the smallest universal constant in 2) of that Lemma A.5, and let n be any natural number satisfying n ≥ ⌈4C−2δ−2 log max{n, p}⌉ ≡ J(n, δ). Under these conditions, the occurrence of An implies that, for all j = J(n, δ), …, n,

  • 8

    minkKnsPj2(Xk)δ/2 and minkKnsPj2(Y)δ/2;

  • 9

    minkKnsP02(Xk)sPj2(Xk)δ and sP02(Y)sPj2(Y)2;

  • 10

    |σ^nj2σnj2|δ2Knj;

  • 11

    |σ^nj2VarP0(Dnj(P0)(O))|δ2Knj+(Rem^nj)2.

Proof

By Lemma A.5, 2) holds, and using that jJ(n, δ), we see that

sPj2(Xk)=sP02(Xk)+sPj2(Xk)sP02(Xk)sP02(Xk)maxkKn|sPj2(Xk)sP02(Xk)|δ/2.

The same argument works for sPj2(Y), so 8 holds. Furthermore,

sP02(Xk)sPj2(Xk)sP02(Xk)sP02(Xk)CKnj=1+CKnjsP02(Xk)CKnj1+2Cδ1Knj2,

where the final two inequalities hold by 8. This proves the first part of 9, and the bound on sP02(Y)/sPj2(Y) holds by the same argument. For the second result, note that

|σ^nj2σjn2||(PjP0)D^nj2|+|(PjD^nj)2(P0D^nj)2||(PjP0)D^nj2|+|(Pj+P0)D^nj||(PjP0)D^nj|.

Using 8, the bounds on X and Y, and the triangle inequality shows that

|(PjP0)D^nj2|+δ1|(PjP0)D^nj|δ2GnjFn,

where we have used that Fn contains all polynomials of Xk, Y of degree at most 4. By the occurrence of An, the final line is upper bounded by a constant times δ−2 Knj. This yields 10.

For 11, we will bound |σnj2VarP0(Dnj(P0)(O))| and then combine this with 10 using the triangle inequality. We have that

|σnj2VarP0(Dnj(P0)(O))||P0[D^nj2Dnj(P0)2]|+(P0D^nj)2.

Now we use that P0D^nj=CorrPj(Xk,Y)+CorrP0(Xk,Y)+Rem^nj and (a + b)2 ≤ 2 (a2 + b2) for any real a, b to see that (P0D^nj)2maxk(CorrPj(Xk,Y)CorrP0(Xk,Y)2)+(Rem^nj)2. By 5) from Lemma A.5 and the fact that jJ(n, δ), the maximum over kKn is bounded above by a constant times δ2Knj2δ2Knj. Continuing with the above,

|P0([D^nj+Dnj(P0)][D^njDnj(P0)])|+δ2Knj+(Rem^nj)2δ1P0|D^njDnj(P0)|+δ2Knj+(Rem^nj)2δ2GnjFn+δ2Knj+(Rem^nj)2δ2Knj+(Rem^nj)2,

where we used 8 for the second to last inequality. □

Lemma A.7

Suppose the conditions of Lemma A.6. Under these conditions, the occurrence of An implies that, for all j = J(n, δ), …, n.

8. |Rem^nj|δ5/2(Knj)2.

Proof

By Lemma A.6, minkKnsPj2(Xk)δ/2 and minkKnsPj2(Y)δ/2. By Lemma A.3, this yields

|Rem^nj|δ1maxkKn(|sPj(Xk)sPj(Y)s0(Xk)s0||CorrPj(Xk,Y)Corr0(Xk,Y)|+(EPj[Xk]EP0[Xk])2+(EPj[Y]EP0[Y])2+s02(Y)sPj2(Y)[sPj(Xk)s0(Xk)]2+s02(Xk)sPj2(Xk)[sPj(Y)s0(Y)]2).

By the bounds on X and Y and the triangle inequality, |sPj(Xk)sPj(Y)s0(Xk)s0(Y)||sPj(Xk)sP0(Xk)|+|sPj(Y)sP0(Y)|. Applying 9 from Lemma A.6 and the results of Lemma A.5 to the above yields the result. □

Lemma A.8

Let γ be as defined in (7). For a constant C(γ, δ) > 0 relying on γ and δ only, the occurrence of An implies that, for all j = ⌈C(γ, δ) log max{n, p}⌉, …, n.

8. σ^nj2γ/2.

Sketch of proof

Suppose An. By 11 and 8, for all jJ(n, δ)

|σ^nj2VarP0(Dnj(P0)(O))|δ2Knj+δ10(Knj)4.

It is easy to confirm that, for a universal constant C > 0, the above yields that the left-hand side is upper bounded by γ/2 for all j−1/2δ−2 max {δ−3, γ−3/2} log max{n, p} ≡ C(γ, δ)log max{n, p} ≥ J(n, δ). An application of the triangle inequality gives the result. □

The remainder of the results in this section are asymptotic in nature. We omit the dependence on δ and γ in these statements as these quantities are treated as fixed as sample size grows. Throughout we assume that

logmax{n,p}ln0, (A.6)
βn2lognln0, (A.7)
limsupnlnn<1. (A.8)

In view of (A.6) and (A.7), we see that, roughly, n grows faster than log max{n, p} if βn goes to zero faster than 1/logn and at least as fast as nexp(o(βn2)) if βn goes to zero more slowly than 1/logn. Given an ε > 0, one possible choice of n that satisfies these properties is

ln=max{(logmax{n,p})1+ε,nexp(βn2+ε)}.

We have the following result.

Lemma A.9

For all n large enough, n is at least J(n. δ) and is at least C(γ, δ) log max{n. p}, where these quantities are defined in Lemmas A.6 and A.8, respectively.

Proof

This is an immediate consequence (A.6) of the fact that δ and γ are fixed as sample size grows. □

Theorem A.10

C1), C2), and C3) hold.

Proof
  • C1)

    By Lemma A.9, we can apply 8 from Lemma A.6 and Lemma A.8 provided n is large enough. In that case, D^njσ^njδ1γ1/2 for all j ≥ ℓn provided An holds. By Lemma A.4, this then occurs with probability at least 1 − 1/n, and thus C1) holds.

  • C2)
    If An holds, then Lemmas A.8 and A.9 show that, for all n large enough,
    1nlnj=lnn1|σnj2σ^nj21|2γ1nlnj=lnn1|σ^nj2σnj2|γ1δ2nlnlogmax{n,p}j=lnn1j1
    By 10 in Lemma A.6 and the fact that abj1a1bj1dj, the right-hand side is has an upper bound proportional to γ1δ2nlnlogmax{n,p}logn. This bound is o(1) by (A.8) and the fact that βn 0. The fact that An occurs with probability approaching 1 (Lemma A.4) yields C2).
  • C3)

    Suppose that n is large enough so that the results of Lemma A.9 apply. Also suppose that An occurs. We have that

1nlnj=lnn1Rem^njσ^njγ1/2δ5/2nlnj=lnn1(Knj)20
(Lemmas A.7, A.8, and A.9)
=γ1/2δ5/2nlnlogmax{n,p}j=lnn1j1
(Eq. A.5)
γ1/2δ5/2nlnlogmax{n.p}lognln
(abj1a1bj1dj)
=o([nln]1/2).
(Eqs. A.7 and A.8)

The fact that An occurs with probability approaching 1 (Lemma A.4) yields C3). □

Let kn0 be a possibly non-unique k maximizer of |CorrP0(Xk,Y)|. For each r > 0, let KnrKn denote the set of all kKn such that |CorrP0(Xkn0)||CorrP0(Xk,Y)|r.

The upcoming theorem uses the following conditions to establish the validity of a hypothesis test of no effect and of the upper bound of our confidence interval, respectively:

M1) For some sequence {tn} with tn → + ∞, there exists a sequence of non-empty subsets KnrKn such that, for all n,

infk1Kn|CorrP0(Xk1,Y)|supk2Kn\Kn|CorrP0(Xk2,Y)|+tnn1/4βn.

If Knr=Kn, then the supremum on the right-hand side is taken to be zero.

M2) The conditions of M1) hold, and also

Diam(Kn)supk1,k2Knr(|CorrP0(Xk1,Y)||CorrP0(Xk2,Y)|)=0(n1/2).

The first of these conditions will be used to establish the consistency of a null hypothesis significance test. The second of these conditions is similar to margin conditions used in classification, and will be used to establish the validity of our confidence interval.

Theorem A.11

1nlnj=lnn1σ^nj1[Ψdnj(P0)Ψn(P0)]=OP0(n1/4βn). (A.9)

If also M1), then the right-hand side of the above can be tightened to OP0(Diam(Kn)n1/4βn)+OP0(n1/2). If also M2), then C4) holds.

Proof

Suppose that An holds and n is large enough so that the results of Lemma A.9 apply. For each jn, let knj represent the kKn which maximizes |CorrPj(Xk,Y)|. Let m0=sgn[CorrP0(Xkn0,Y)] and mnj=sgn[CorrPj(Xknj,Y)]. Then, for a universal constant C > 0,

0m0CorrPj(Xkn0,Y)|CorrPj(Xknj,Y)|=m0[CorrP0(Xkn0,Y)mnjCorrP0(Xknj,Y)]+m0[CorrPj(Xkn0,Y)CorrP0(Xkn0,Y)]mnj[CorrPj(Xknj,Y)CorrP0(Xknj,Y)]Ψn(P0)Ψdnj(P0)2maxkKn|CorrPj(Xk,Y)CorrP0(Xk,Y)|Ψn(P0)Ψdnj(P0)Cδ1Knj, (A.10)

where the final inequality holds by Lemma 5). Using that j=lnn1j1/2n and (A.8),

1nnj=nn1Knjlogmax{n,p}nnnn1/4βn.

By Lemma A.8, this then implies that the left-hand side of (A.9) is upper bounded by an O−1/2n−1/4βn) term under An, and so Lemma A.4 yields (A.9).

For the second result, suppose that M1) holds. Observe that, for all j>Cntn1 for C as defined in (A.10), Ψn(P0)Ψdnj(P0)<tnn1/4βn. Furthermore, Ψn(P0)|CorrP0(Xknj)|Ψn(P0)Ψdnj(P0). Thus knjKn as deȀned in M1). Furthermore, mnj must equal sgn[CorrP0(Xknj,Y)], since otherwise

Ψn(P0)Ψdnj(P0)|CorrP0(Xknj,Y)|Ψdnj(P0)2infkKn|CorrP0(Xknj,Y)|2tnn1/4βn,

contradicting the fact that Ψn(P0)Ψdnj(P0)tnn1/4βn per (A.10). Because kKn, we see that Ψdnj(P0)infkKn|CorrP0(Xknj,Y)|. Hence,

1nlnj=max{ln,Cntn1}n1[Ψdnj(P0)Ψn(P0)]Diam(Kn). (A.11)

Further, if Cntn1ln, (A.10) yields

j=lnCntn1[Ψdnj(P0)Ψn(P0)]Cj=lnCntn1KnjClogmax{n,p}ln1Cntn1j1/2dj

It follows that the left-hand side above is greater than or equal to a positive universal constant times n1/2tn1/2. Dividing the left by nn and applying (A.8) yields that this same result holds with an upper bound on the order of n1/2tn1/2. Combining this with (A.11) shows that

1nlnj=lnn1[Ψdnj(P0)Ψn(P0)]Diam(Kn)+O(n1/2tn1/2).

Using that tn1/20,n1/2tn1/2=o(n1/2). When proving the Ȁrst result (A.9) we also showed that the left-hand side is upper-bounded by a positive constant times −δ−1n−1/4βn. Combining with Lemma A.8 and using that An holds with probability approaching 1 (Lemma A.4) shows that the left-hand side of (A.9) is OP0(Diam(Kn)n1/4βn)+oP0(n1/2). If M2) holds, then this expression is oP0(n1/2), and so C4) holds. □

B.2 Computationally efficient implementation of our estimator

In this section, we describe how to implement the estimator for the McKeague and Qian [2015] example in O(np) time. We show that this can be accomplished using O(p) storage when the observations O1, …, On arrive in a stream.

Fix n so that the set Kn of predictor indices is also fixed. For each j, let Pj denote the empirical distribution of the first j observations. Recall the definition of the class Fn from (A.3), and note that Fn contains O(p) functions. It is easy to see that, at j = 2, we can compute PjfEPj[f(O)] for each fFn using O(p) time and storage. Furthermore, for j > 3 the fact that Pjf=f(Oj)+j1jPj1f shows that we can compute and save Pjf in O(p) time and storage if we know Oj and Pj−1 f. To attain this storage complexity, we remove Pj−2 f, fFn, from memory for each j ≥ 4 so that P2 f, …, Pj−2 f are not stored in memory.

We now have an algorithm that, at observation j, starts with Oj and Pj−1 f, fFn, stored in memory and, after running the steps described in the preceding paragraph, also has Pjf, fFn stored in memory. Given Pjf, fFn, one can compute and save CovPj(Xk,Y)=EPj[XkY]EPj[Xk]EPj[Y], kKn, and sPj2(Z)=EPj[Z2]EPj[Z]2, Z equal to Y or Xk, kKn, in O(p) time and storage. We can now compute and save CorrPj(Xk,Y)=CovPj(Xk,Y)sPj(Xk)sPj(Y), kKn in O(p) time and storage. If the predictors or outcome are large and their variance small, the described online computation of the sample variance may lead to numerical difficulties. See Welford [1962] for a better estimate of the variance in this setting.

Let Hj denote the collection of (i) the integer j, (ii) Pjf, fFn, (iii) sPj2, and (iv) Cov(Xk, Y), sPj2(Xk) and CorrPj(Xk,Y), kKn. For j ≥ 2, let UPDATEH be a function which takes as input (Oj+1, Hj) and outputs Hj+1. We have shown that UPDATEH(Oj+1, Hj) can run in O(p) time for any j ≥ 2. We call a separate function INITIALIZEH on (O1, O2) to obtain the initial value H2. This function runs in O(p) time and storage.

Let the function MAXIMIZER be a function that takes as input Hj and returns the dj = (kj, mj) which maximizes mCorrPj(Xk,Y) in d(k,m)Kn, thereby allowing us to compute σ^nj=PjDdj(Pj)2. Finding dj involves finding the maximum of |Dn|=2p numbers, and therefore can be accomplished in O(p) time.

The function CALCD takes as input Hj, Oj+1, and dj and calculates Ddj(Pj)(Oj+1). It is easy to see that this can be accomplished in O(1) time and O(p) storage.

For ease of notation in the proceeding paragraph and equation we omit the dependence of dj = (kj, mj) on j in the notation. Since Dd(Pj) is a gradient for Ψd at Pj and gradients are mean zero, PjDd(Pj) = 0. For any dDn, tedious but trivial calculations show that

PjDd(Pj)2=[2+CorrPj(Xk,Y)22sPj2(Xk)sPj2(Y)]r=02s=02(1)r+s(2r)(2s)EPj[XkrYs]EPj[Xk]2rEPj[Y]2s+CorrPj(Xk,Y)24r=04(1)r(4r)[EPj[Xkr]EPj[Xk]4rsPj4(Xk)+EPj[Yr]EPj[Y]4rsPj4(Y)]CorrPj(Xk,Y)sPj(Xk)3sPj(Y)r=03s=01(1)r+s(3r)EPj[XkrYs]EPj[Xk]3rEPj[Y]1sCorrPj(Xk,Y)sPj(Xk)sPj(Y)3r=01s=03(1)r+s(3s)EPj[XkrYs]EPj[Xk]1rEPj[Y]3s.

Observe that all expectations on the right-hand side above are expectations over some fFn applied to the observed data structure. It follows that the above can be computed in O(1) time using a subset of the O(p) expectation, standard deviation, and correlation estimates stored in Hj. Let CALCSIGHAT denote the function which takes as input Hj and dj and outputs σ^nj. We have shown that CALCSIGHAT (Hj, dj) runs in O(1) time.

The pseudocode in ESTPSI describes our estimator, with most of the work done in the recursion step described in the function RECURSION. Because each call of RECURSION runs in O(p) time, the nn = O(n) step for loop in ESTPSI requires time O(np) time. The storage requirement of each call of RECURSION is O(p). Because the code in the for loop in ESTPSI deletes the output from the previous recursion step, the total storage requirement of ESTPSI is O(p).

Algorithm Recursion Step for Estimating Ψ(P0)

function Recursion(Oj+1, ψj, Hj, σ¯j, n)
2: if j < ℓn then ψj+1 = 0 and σ¯j+1=0
else
4:   dj = Maximizer(Hj)
   σ^nj=CalcSigHat(Hj,dj)
6:    Ddj(Pj)(Oj+1)=CalcD(Hj,Oj+1)
   ψj+1=ψjσ¯j+CorrPj(Xdj,Y)+Ddj(Pj)(Oj+1)σ^nj ▹ By convention, 0/0 = 0.
8:    σ¯j+1=1j+1[jσ¯j+σ^nj]
Hj+1 = UpdateH(Oj+1, Hj)
10: return (ψj+1,σ¯j+1,Hj+1)

Algorithm Estimate Ψ(P0) Using Sample of Size n

2: function ESTPSI(n, n)
 Read O1, O2 from data stream
4:  Base case: ψ2 = 0, σ¯2=0, and H2 = InitializeH(O1, O2)
for j = 2, …, n − 1 do
  Read Oj+1 from data stream
6:    (ψj+1,σ¯j+1,Hj+1)=Recursion(Oj+1,ψj,Hj,σ¯j,ln)
  Remove (Oj+1,ψj,Hj,σ¯j) from memory
8: return Point estimate ψn and confidence interval [ψn±1.96σ¯n/n]

References

  1. Bickel PJ, Klaassen CAJ, Ritov Y, Wellner JA. Efficient and adaptive estimation for semiparametric models. Johns Hopkins University Press; Baltimore: 1993. [Google Scholar]
  2. Chakraborty B, Moodie EE. Statistical Methods for Dynamic Treatment Regimes. Springer; Berlin Heidelberg New York: 2013. [Google Scholar]
  3. Gaenssler P, Strobel J, Stute W. On central limit theorems for martingale triangular arrays. Acta Math Hungar. 1978;31(3):205–216. [Google Scholar]
  4. Hirano K, Porter JR. Impossibility results for nondifferentiable functionals. Econometrica. 2012;80(4):1769–1790. [Google Scholar]
  5. Luedtke AR, van der Laan MJ. Statistical inference for the mean outcome under a possibly non-unique optimal treatment strategy. Annals of Statistics. 2016;44(2):713–742. doi: 10.1214/15-AOS1384. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Luts J, Broderick T, Wand MP. Real-time semiparametric regression. Journal of Computational and Graphical Statistics. 2014;23(3):589–615. [Google Scholar]
  7. McDiarmid C. On the method of bounded differences. Surveys in combinatorics. 1989;141(1):148–188. [Google Scholar]
  8. McKeague IW, Qian M. An adaptive resampling test for detecting the presence of significant predictors. Journal of the American Statistical Association. 2015;110(512) doi: 10.1080/01621459.2015.1095099. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Pfanzagl J. Estimation in semiparametric models. Springer; Berlin Heidelberg New York: 1990. [Google Scholar]
  10. R Core Team. R: a language and environment for statistical computing. 2014 URL http://www.r-project.org/
  11. van der Laan MJ, Lendle SD. Online Targeted Learning. Division of Biostatistics, University of California; Berkeley: 2014. (Technical Report 330). available at http://www.bepress.com/ucbbiostat. [Google Scholar]
  12. van der Laan MJ, Robins JM. Unified methods for censored longitudinal data and causality. Springer; New York Berlin Heidelberg: 2003. [Google Scholar]
  13. van der Vaart AW. On differentiable functionals. Annals of Statistics. 1991;19:178–204. [Google Scholar]
  14. van der Vaart AW. Asymptotic statistics. Cambridge University Press; New York: 1998. [Google Scholar]
  15. van der Vaart AW, Wellner JA. Weak convergence and empirical processes. Springer; Berlin Heidelberg New York: 1996. [Google Scholar]
  16. Welford BP. Note on a method for calculating corrected sums of squares and products. Technometrics. 1962;4(3):419–420. [Google Scholar]
  17. Xu W. Towards optimal one pass large scale learning with averaged stochastic gradient descent. arXiv preprint arXiv:1107.2490. 2011 [Google Scholar]
  18. Zhang Y, Laber EB. Comment. J Am Stat Assoc. 2015;110(512):1451–1454. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Appendix

RESOURCES