Abstract
Suppose one has a collection of parameters indexed by a (possibly infinite dimensional) set. Given data generated from some distribution, the objective is to estimate the maximal parameter in this collection evaluated at the distribution that generated the data. This estimation problem is typically non-regular when the maximizing parameter is non-unique, and as a result standard asymptotic techniques generally fail in this case. We present a technique for developing parametric-rate confidence intervals for the quantity of interest in these non-regular settings. We show that our estimator is asymptotically efficient when the maximizing parameter is unique so that regular estimation is possible. We apply our technique to a recent example from the literature in which one wishes to report the maximal absolute correlation between a prespecified outcome and one of p predictors. The simplicity of our technique enables an analysis of the previously open case where p grows with sample size. Specifically, we only require that log p grows slower than , where n is the sample size. We show that, unlike earlier approaches, our method scales to massive data sets: the point estimate and confidence intervals can be constructed in O(np) time.
Keywords: stabilized one-step estimator, non-regular inference, variable screening
1 Introduction
Many semiparametric and nonparametric estimation problems yield estimators which achieve a parametric rate of convergence. These estimators are often asymptotically linear, in that they can be written as an empirical mean of an influence function applied to the data. Valid choices of the influence function can be derived as gradients for a functional derivative of the parameter of interest. Applying the central limit theorem then immediately yields Wald-type confidence intervals which achieve the desired parametric rate. Such problems have been studied in depth over the past several decades [Pfanzagl, 1990, van der Vaart, 1991, Bickel et al., 1993, van der Laan and Robins, 2003].
While remarkably general, these approaches rely on the key condition that the parameter of interest is sufficiently differentiable at the data generating distribution for such a gradient to exist. Statisticians are increasingly encountering problems for which parametric-rate estimation is theoretically possible but the parameter is insufficiently differentiable to yield a standard first-order expansion demanded by older techniques. For example, suppose we observe baseline covariates, a binary treatment, and an outcome occuring after treatment. We wish to learn the mean outcome under the optimal individualized treatment strategy, i.e. the treatment strategy which makes treatment decisions which are allowed to use baseline covariate information to make treatment decisions Chakraborty and Moodie [2013]. As another example, suppose we observe a vector of covariates (X1, …, Xp) and an outcome Y. We wish give a confidence interval the maximal absolute correlation between a covariate Xk and Y. The lower bound for this quantity is of particular interest since this will suffice for a variable screening procedure. Alternatively, we may only with to test the null hypothesis that the maximal absolute correlation is zero. McKeague and Qian [2015] provide a test of this null hypothesis using an adaptive resampling test (ART).
These problems belong to a larger class of problems in which one observes O1, …, On drawn independently from a P0 in some (possibly nonparametric) statistical model ℳ and wishes to estimate
| (1) |
at P = P0, where is an index set that may rely on sample size and each Ψd : ℳ → ℝ is a sufficiently differentiable parameter to permit parametric-rate estimation using classical methods such as those presented in Bickel et al. [1993]. When there is no unique maximizer of Ψd(P), then the inference problem is typically non-regular, in the sense that the parameter is not sufficiently differentiable to allow the use of standard influence function based techniques for obtaining inference. Hirano and Porter [2012] showed that regular and asymptotically linear estimators fail to exist for such one-sided pathwise differentiable parameters. In univariate calculus, functions such as f(x) = max{x, 0} are one-sided differentiable at zero in that the left and right limits of [f (x + ε) − f (x)]/ε are well-defined but disagree. The same holds for the Ψn evaluated at a distribution P0, but now the one-sided differentiability is caused by the subset of containing the indices which maximize the expression on the right in (1). A small fluctuation in P0 can greatly reduce the subset of maximizing indices, leading to different derivatives depending on the fluctuation taken.
In this work, we present a method which, loosely, splits the sample in such a way that the estimated index in which maximizes Ψd(P0) is conditioned on so that this estimated index need not have a limit. We do this iteratively to ensure that our estimator gets the full benefit of the sample size n. When the parameter is fixed with sample size and the d maximizing Ψd(P0) is fixed, we show that our estimator is asymptotically efficient, and therefore also regular. When the maximizing index is not unique, our estimator will not typically be regular. Thus our estimator adapts to the non-regularity of the estimation problem.
Our estimator is inspired by the online estimator for pathwise differentiable parameters presented in van der Laan and Lendle [2014] and a subsequent modification of this estimator in Luedtke and van der Laan [2016] to deal with the non-regularity when estimating the mean outcome under an optimal treatment rule. Such estimators are designed to be efficient in both computational complexity and storage requirements. We show that the estimator that we present in this work inherits many of these computational efficiency properties. We apply our technique to estimate the maximal absolute correlation considered in McKeague and Qian [2015]. In this problem, we show that our estimator runs efficiently in both dimension and sample size, with a runtime of O(np). In practice, this means that the lead author can implement our estimator using only R code and screen p = 100 000 variables using n = 1 000 samples on a single core of his laptop in under a minute. Thus our estimator seems to have both the statistically efficiency that has been demanded of estimators for generations and the computational efficiency that is becoming increasingly important in this new big data era. While the method of McKeague and Qian [2015] can also be implemented in O(np) time when the number of (double) bootstrap samples remains constant, the implicit constant in this procedure implies a much longer runtime.
2 Toy Example
We first present a toy example that we will use to facilitate the presentation of our estimator. While this toy example is simple enough that one can fairly easily come up with alternative estimation strategies, we believe it provides a useful starting point for presenting our general estimation scheme. Suppose we observe an i.i.d. sample {Oj ≡ (Oj,1, Oj,2) : j = 1, …, n} of ℝ2-valued observations, drawn from some distribution P0 with ℝ2-valued mean . For general P, similarly define (Ψ1(P), Ψ2(P)) ≡ EP[O]. Our objective is to estimate . If Ψ1(P0) = Ψ2(P0), then this parameter is one-sided but not two-sided pathwise differentiable at P0 and no regular and asymptotically linear estimator exists [Hirano and Porter, 2012].
To give a reader a sense of the challenges faced by the most intuitive estimation strategy, consider the plug-in estimator that estimates Ψ1(P0) and Ψ2(P0) using the empirical means of the observations, which we denote and , and then the estimates . We have that
By the central limit theorem, converges to a multivariate normal Z = (Z1, Z2) with estimable covariance matrix. If Ψ1(P0) > Ψ2(P0) or Ψ2(P0) > Ψ1(P0), then the indicator above converges in probability to one or zero, respectively, and the above converges in distribution to Z1 or Z2, respectively. Both of these normal limits can be consistently estimated from the data, and if one knows that Ψ1(P0) ≠ Ψ2(P0) then the correct index for the limit is consistently estimated by . Often one may not be willing to assume that Ψ1(P0) ≠ Ψ(P0). To see the challenge that arises, note that if Ψ1(P0) = Ψ2(P0), then the right-hand side above converges to max{Z1, Z2}, which is a non-normal limit. This gives some intuition on the non-regularity of Ψ(P0) when Ψ1(P0) = Ψ2(P0): an arbitrarily small shift in Ψ1(P0) or Ψ2(P0) dramatically changes the limiting behavior of the estimator. Furthermore, it is not in general clear which limiting result one should use to approximate the distribution of if the Ψ1(P0) = Ψ2(P0)+ε for some small ε > 0, so that asymptotically the limit of the estimator is Z1 but in practice max{Z1, Z2} better approximates the variability of the estimator.
To avoid this problem, we develop an estimator that naturally adapts to the (non-)regularity of the problem. For each 2 ≤ j ≤ n, let (μj,1, μj,2) represent the empirical mean of O1, …, Oj and dj ≡ argmaxdμj,d. Our estimator takes the form for positive convex weights w2, …, wn−1 that we now define. To expedite the presentation of this toy example, suppose that we know the variance of the first and second components of O ~ P0. Denote these variances by and , where we assume that Σ1, Σ2 ∊ (0, ∞). Let denote the harmonic mean of , i.e. . Our convex weights are given by .
We now aim to understand the variability of our estimator. Note that
Our analysis will use that the weights wj are finite because Σ1, Σ2 ∊ (0, ∞). We first consider the second term on the right. If Ψ1(P0) = Ψ2(P0), then this term is exactly zero. Suppose Ψ1(P0) ≠ Ψ2(P0). Noting that μj,d → Ψd(P0) almost surely as j → ∞, it then follows that, with probability 1, dj = argmaxd Ψd(P0) for all j large enough. It readily follows that n1/2 times the second term on the right converges to zero almost surely, and therefore also in probability. Multiplying this term by the random but finite quantity does not change this convergence to zero. Thus we have shown that
Finally, we note that the first term above is a martingale sum, and that each term in this sum has variance 1 thanks to the choice of weights. A standard martingale central limit theorem [e.g., Gaenssler et al., 1978] then shows that this term converges to a standard normal random variable, and standard arguments show that the interval contains Ψ(P0) with probability approaching 0.95.
3 Estimator
We will now present our technique for a general estimation problem. We now introduce the notion of pathwise differentiability, since this provides the key object needed to construct our estimator.
3.1 Pathwise differentiability
We assume that each parameter Ψd, for any n, is pathwise differentiable for all distributions in our model [see, e.g., Pfanzagl, 1990, Bickel et al., 1993]. For each P ∊ ℳ, we let Dd(P) denote the canonical gradient of Ψd at P. By definition Dd(P)(O) is mean zero with finite variance under sampling from P. Typically pathwise differentiability implies that Ψd satisfies the following linear expansion for any P ∊ ℳ and :
| (2) |
where we omit the dependence of on P0 in the notation and indicate its possible dependence on sample size with the subscript n. Above is a second-order remainder term that is small whenever P is close to P0. We consider this condition more closely in our example, but for non-sample size dependent parameters this term can typically be made to be in a parametric model and often can be made to be in a nonparametric model. In the toy example from Section 2, the two parameters Ψd(P0) are linear, and so Dd(P)(o) = od − EP[Od] and for all d, P. For a more thorough presentation, see Pfanzagl [1990] or Bickel et al. [1993].
Knowing the canonical gradient of a parameter enables one to implement a one-step estimator [see Section 5.7 of van der Vaart, 1998]. To ease discussion, fix d. Suppose one has an initial estimate of the components of P0 needed to evaluate Ψd and Dd. Then a one-step estimate of Ψd(P0) is given by . Under empirical process and consistency conditions on , one can show that converges in distribution to a normal random variable with estimable variance. In the next section, we present a variant of this estimator that allows for the selection of the optimizing d, even when the optimal index is non-unique.
3.2 Estimator and confidence interval
We now present a stabilized one-step estimator for problems of the type found in (1) when the required differentiability condition on Ψd holds.
Let {ℓn} be some sequence such that n−ℓn → ∞. One possible choice is ℓn = 0 for all n. For each j = ℓn, …, n − 1, let dnj represent an estimate of a maximizer of (1) obtained using observations (Oi : i = 1, …, j), be an estimate of P0 obtained using observations (Oi : i = 1, …, j), and equal Dd (P) evaluated at and d = dnj. For nonnegative weights that we will define shortly with , our stabilized one-step estimate takes the form
Our proposed 95% confidence interval has the form
where we will define momentarily and one can replace 1.96 by the desired quantile of the normal distribution to modify the confidence level.
We now define the weights. Let represent an estimate of the variance of , O ~ P0, conditional on observations O1, …, Oj. This estimate should only rely on those j observations. Often we can let
The standard deviation type variable in the confidence interval definition is given by , and the weights are given by , where we have omitted the possible dependence of the weights on sample size in the notation.
Our estimator is similar to the online one-step estimator developed in van der Laan and Lendle [2014] for streaming data, but it weights each term proportionally to the estimated inverse standard deviation of when O ~ P0. Our confidence interval takes a form similar to a Wald-type confidence interval, but replaces the typical standard deviation with and has width on the order of rather than Note of course that ℓn = o (n) implies that converges to zero.
3.3 Validity of confidence interval
We now prove the validity of our confidence interval. Let . The validity of the lower bound of the confidence interval relies on the following conditions:
-
C1)
There exists some M < ∞ such that in probability as n → ∞.
-
C2)
in probability as n → ∞.
-
C3)
in probability as n → ∞., where .
The validity of the upper bound requires the following additional condition:
-
C4)
converges to zero in probability as n → ∞.
We now present our main result.
Theorem 1
(Validity of confidence interval). If C1), C2), and C3) hold, then
If C4) also holds, then
Proof
The definition ψn combined with (2) yield that
| (3) |
The second line converges to zero in probability by C3) and C4). By C1), C2), and the martingale central limit theorem for triangular arrays in Gaenssler et al. [1978], (3) converges in distribution to a standard normal random variable. A standard Wald-type confidence interval construction argument shows that the confidence interval has coverage approaching 1 − α under C1) through C4).
Now suppose C4) does not hold. By (1), . The same argument readily shows the validity of the lower bound under only C1), C2), and C3). □
The conditions of the theorem are discussed in Appendix A.1. Appendix A.2 considers the asymptotic efficiency of our estimator when the parameter in (1) does not rely on sample size. High level conditions are provided, and we then argue that these conditions are plausible when the maximizing index in (1) is unique. Appendix A.3 discusses computationally efficient implementations of our general estimator.
4 Maximal correlation example
4.1 Problem formulation
We now present the running example of this work, namely the maximal correlation estimation problem considered by McKeague and Qian [2015]. The observed data structure is O = (X, Y), where X = (Xk : k = 1, …) is a [−1, 1]∞ vector of predictors and Y is an outcome in [−1, 1]. For each n, we let represent a subset of these predictors of size p, where throughout we assume that
| (4) |
For readability, we omit the dependence of p on n in the notation. Under a distribution P, the maximal absolute correlation of a predictor with Y is given by
| (5) |
where CorrP(Xk, Y) is the correlation of Xk and Y under P. We wish to develop confidence intervals for Ψn(P0). When a test of H0 : Ψn(P0) = 0 against the complementary alternative is of interest, we also wish to establish the behavior of our test against local alternatives as was done in McKeague and Qian [2015].
In contrast to McKeague and Qian [2015], the procedure that we present in this work:
is proven to work when p grows with sample size at any rate satisfying (4);
yields confidence intervals for the maximal correlation rather than just a test of the null hypothesis that it is equal to zero;
allows a non-null maximizer in (5) to be non-unique.
While McKeague and Qian argued that 3) is unlikely in practice, having two non-null maximizers be approximately equal may still have finite sample implications for their test in some settings.
We now show that this problem fits in our framework. To satisfy the pathwise differentiability condition, we let and, for each ,
Note that Ψn(P) now takes the form in (1), where we note that the use of m in the definition of Ψd serves to ensure that Ψn(P0) represents the correlation with the maximal absolute value.
4.2 Differentiability condition
Canonical gradients
For each k, let , and likewise for . For ease of notation we let , and likewise for and Corr0(Xk, Y). An application of the delta method shows that Ψd has canonical gradient Dd(P)(o) given by
In order to ensure that Dd(P0) is uniformly bounded for all d, we assume throughout that, for some δ ∊ (0, 1],
Second-order remainder
Fix and P ∈ ℳ. Let be some constant such that both sP(Xk) and sP(Y) are larger than . Lemma A.3 in Appendix B.1 proves that
| (6) |
The first term above is small if sP(Xk), sP(Y), and CorrP(Xk, Y) are close to s0(Xk), s0(Y), and Corr0(Xk, Y). The middle terms are small if EP[Xk] and EP[Y] are close to and . The final terms are small if sP(Xk) and sP(Y) are close to s0(Xk) and s0(Y).
Variance of canonical gradients
For any given d, there is no elegant (and informative) expression for . Nonetheless, we show in Lemma A.6 of Appendix B.1 that our estimates , taken as the sample variance of for an index estimate dnj to be defined in the next subsection, concentrate tightly about with high probability when the sample size is large enough. Thus, in practice, one can actually check if is small by looking at . If P0 is normal, then this variance is equal to , and so is only zero if . Though such an elegant expression does not exist for the variance of Dd(P0)(O) for general distributions, one can still show in general that the variance of Dd(P0) is equal to zero only if . Here we make the slightly stronger assumption that
| (7) |
4.3 Our estimator
We will use the estimator presented in Section 3 to estimate Ψn(P0). At each index j ≥ ℓn we use the empirical distribution Pj of the observations O1, …, Oj to estimate P0. We let our optimal index estimate dnj ≡ (knj, mnj), where and . We estimate with the variance of under Pj.
In Appendix B.1, we detail conditions on ℓn which ensure that ℓn does not grow too slowly or quickly. For any ε ∊ (0, 2), one possible choice of ℓn that satisfies these conditions is
| (8) |
We show that this choice of ℓn ensures C1), C2), and C3) in Appendix B.1. By Theorem 1 this establishes the validity of the lower bound of our confidence interval. We can also show that this lower bound is tight up to a term of the order n−1/4βn.
Theorem 2
(Tightness of the lower bound). For any sequence tn → ∞, Ψn(P0) < LBn + tnn−1/4βn with probability approaching 1.
We note that the choice of ℓn in (8) is only needed if the set of indices changes with sample size. For fixed , one could take ℓn fixed, e.g. ℓn = 2, and still have a valid lower bound provided the estimates of σnj are truncated from below at some (see Lemma A.1). While choosing ℓn according to (8) is still advisible since this is what will enable us to study the behavior of a hypothesis testing procedure under local alternatives, this invariance to ℓn should at least reassure the user that most choices of ℓn will perform reasonably well. In Luedtke and van der Laan [2016], we evaluated the stabilized one-step estimator on a variety of choices of ℓn and found little sensitivity to this tuning parameter. Nonetheless, we consider the development of a data adaptive selection procedure for choosing an ℓn satisfying C1), C2), and C3) an important area for future work. In parallel to how McKeague and Qian [2015] used the bootstrap to select their tuning parameter, one might consider using the bootstrap to select ℓn, though it remains to determine an appropriate criterion for selecting ℓn. Because our ℓn-specific lower bound is defined using a normal limiting result rather than the bootstrap, such a selection procedure would avoid the use of a computationally burdensome double bootstrap.
We now consider the validity of the upper bound of our confidence interval, which holds under C4). This condition is trivially valid if Ψn(P0) = 0 for all n. Condition C4) is also valid under the following margin condition:
MC) For some sequence tn → ∞, there exists a sequence of non-empty subsets such that, for all n,
If , then the supremum over is taken to be zero.
Theorem 3
(Validity of the upper bound). If MC) or Ψn(P0) = 0 for all n, then C4) holds so that LBn ≤ Ψn(P0) ≤ UBn with probability approaching 1 − α.
We outline the techniques used to prove these two results at the end of this subsection. Complete proofs are given in Appendix B.1.
Suppose we wish to test H0 : Ψn(P0) = 0 against H1 : Ψn(P0) > 0. Consider the test that rejects H0 if LBn > 0. We wish to explore the behavior of this test under local alternatives where Ψn(P0) converges to zero slower than n−1/4βn. Theorem 2 shows that this test has power converging to one under such local alternatives. Furthermore, as the lower bound is valid in general, this test has type I error of at most α/2 under the null. This is indeed an exciting result as it enables the study of local alternatives even when dimension grows quickly with sample size. If dimension does not grow with sample size, this shows that we can detect against any alternatives converging to zero slower than . We would not be surprised if the is unnecessary, but rather that it is simply a result of our proof techniques which give high probability bounds on the concentration of our correlation estimates at each sample size. McKeague and Qian [2015] showed that their method is consistent against a class of alternatives converging to zero slower than n−1/2 provided the optimal index is unique. Our result does not rely on this uniqueness condition. We emphasize that we only used MC) to establish the validity of the upper bound of our confidence interval. Our lower bound, and therefore our ability to reject the null of uniformly zero correlation, is valid even without this margin condition.
Theorem 3 shows that the upper bound of our confidence interval is also valid under a reasonable margin condition. The margin condition states that there may be many non-null approximate maximizers provided their absolute correlations are well-separated from the absolute correlations of the other predictors with Y. By “approximate” we mean that their absolute correlations all fall within o(n−1/2) of one another. If does not depend on sample size, then this theorem shows that our two-sided confidence interval is always valid.
Sketch of proofs of Theorems 2 and 3
Our proofs of both of these theorems rely on high-probability bounds of the absolute differences between our estimates of , , , and and their population counterparts, uniformly over and j. We show that, with probability at most 1 − 1/n, all of these absolute differences are upper bounded by constants (with explicit dependence on γ and δ) times j−1/2 log max{n, p}.
Condition C1) follows once we show that, with high probability, and are bounded below by δ/2 and . is bounded below by γ/2 uniformly over j ≥ ℓn for n large enough. Condition C2) and C3) are easy consequences of our concentration results. The concentration results also yield that
which then quickly yields Theorem 2 thanks to the expression in (3).
Now suppose MC) holds. By our concentration inequalities, we select a for each with high probability, where C is a constant. We also correctly specify mnj to be the sign of . Because all of the absolute correlations in are small, the difference between for dnj = (knj, mnj) and Ψn(P0) is very small. If , then we can apply our concentration inequalities to establish that these first few values of j for which are small enough so that C4) still holds, yielding Theorem 3. □
In Appendix B.2, we show that our estimator runs in O(np) time. We show that the estimate can be computed using O(p) storage when the observations O1, …, On arrive in a data stream. This result is closely related to the fact that, for a ℝp-valued sequence {ti}, the sum at j = n can be computed in time O(np) using storage O(p). In particular, one can use the recursion relation Sj = tj + Sj−1, thereby only storing tj and Sj−1 when computing Sj. Our estimate can also be computed in O(np) time and O(n) storage when the vectors (Xjr : j = 1, …, n) ℝn arrive in a stream for r = 1, 2, …, p, where Xjr is the observation of Xr for individual j. We do not prove the O(n) storage result in the appendix due to space constraints, though the algorithm is closely related to that given in Appendix B.2.
5 Simulation study
We now consider the power and scalability of our method using the simulations similar to those described in McKeague and Qian [2015]. Let X ~ MVN(0, Σ) for Σ a p × p covariance matrix to be given shortly, and τ1, …, τp be a sequence of i.i.d. normal random variables independent of all other quantities under consideration. We will use two types of errors: the homoscedastic error τ1 and the heteroscedastic error . For (n, p) = (200, 200), (500, 2000), we generate data using the following distributions: (N.IE) Y = τ1, (A1.IE) Y = X1/5 + τ1, (A2.IE) , (N.DE) Y = η(X), (A1.DE) Y = X1/5 + η(X), and (A2.DE) . For (n, p) = (2 000, 30 000), we generate data using the following distributions: (N.IE) Y = τ1, (A3.IE) Y = X1/15 + τ1, and . We set all of the diagonal elements in the covariance matrix Σ equal 1, and the off-diagonal elements equal p, where for each simulation setting we let ρ = 0, 0.25, 0.5, 0.75. Unless otherwise specified, all simulations are run using 1 000 Monte Carlo simulations in R [R Core Team, 2014]. Code is available in the Supplementary Materials.
We conduct a 5% test of Ψ(P0) > 0 by checking if the lower bound of a 90% confidence interval for this quantity is greater than zero. We use models N.IE and N.DE to evaluate type I error and all other models evaluate power. We run our method with ℓn as in (8), where we let ε = 0.5. For ease of implementation, we compute our method on chunks of data of size (n − ℓn)/10 (see Section 6.1 of Luedtke and van der Laan, 2016).
We compare our method to the ART of McKeague and Qian [2015]. The ART relies on a tuning parameter λn satisfying and λn →∞ that is selected via a double bootstrap procedure. We implemented code that we obtained from the authors (McKeague and Qian) that selects from a grid of a varying between 0.5 and 4. Due to computational limitations, we ran 400 outer bootstrap samples and 200 inner bootstrap samples (rather than the default of 1 000 samples for both layers of bootstrap), and also reduced the grid for a from the default (0.5, 0.55, …, 4) to (0.5, 0.6, …, 4). We also reduced the number of Monte Carlo replicates for the ART to 200 and only ran ART on the smallest sample size (n, p) = (200, 200). While we were not able to run the double bootstrap at the moderate sample size (n, p) = (500, 2000) due to computational constraints, we were able to mimic the double bootstrap procedure by selecting an oracle choice of . In particular, we ran ART for the fixed choices of a = 0.5, 2.25, 4, found that a = 4 appropriately controlled type I error while the other choices of a typically did not, and reported the results of ART at this fixed tuning parameter. We were unable to run even the oracle procedure at the largest sample size with due to computational constraints.
We also compared our procedure to the analogue of ART described in Section 2 of Zhang and Laber [2015], where this analogue does not require running a double bootstrap. This latter procedure is referred to as the “parametric bootstrap” in Zhang and Laber [2015], though to avoid confusion with other bootstrap procedures here we refer to their method as “ZL”. The ZL procedure assumes a locally linear model with homoscedastic errors. Note that the homoscedasticity requirement is stronger than the uncorrelated error requirement made by the ART. In fact, the errors are guaranteed to be uncorrelated with the predictors under the null of zero maximal absolute correlation, thereby ensuring the type I error control of ART. We use 500 bootstrap draws for each run of the ZL procedure. Zhang and Laber show that their method, which does not involve running a computationally burdensome double bootstrap procedure, has comparable performance to ART across sample sizes and predictor dimension, while being more computationally efficient. The ZL procedure is less computationally intensive than the ART, but still requires estimating the p × p covariance matrix Σ and simulating from a distribution. Due to computational constraints, we only run ZL for p ≤ 2 000 and not for p = 30 000. We also compare our method to a Bonferroni-corrected t-test.
Figures 1 displays the power of the four testing procedures for (n, p) equal to (200, 200) and (500, 2000) for the homoscedastic data generating distributions N.IE, A1.IE, and A2.IE. The ART and ZL procedures perform best in both of these settings. We can show (details omitted) that our method underperforms in this setting due to the second-order term representing the cost for estimating d0 on subsets of the data of size j ≪ n early on in the procedure. While Theorem A.11 ensures that the estimate of d0 will be asymptotically valid, there appears to be a noticeable price to pay at small sample sizes.
Figure 1.

Power of the various testing procedures for (n, p) equal to (200, 200) and (500, 2000) under homoscedastic errors. The ART and ZL procedure performs the best in this setting.
Figures 2 displays the power of the three testing procedures for (n, p) equal to (200, 200) and (500, 2000) for the heteroscedastic data generating distributions. The ZL procedure fails to control the type I error in this setting. This is unsurprising given that this test was developed under a local linear model with independent errors. All other methods adequately control type I error in this setting, especially at the larger sample size n = 500, while we see that the Bonferroni and ART procedures achieves slightly better power than our method for these data generating distributions.
Figure 2.

Power of the various testing procedures for (n, p) equal to (200, 200) and (500, 2000) under heteroscedastic errors. The ZL procedure fails to control the type I error in this setting.
Figure 3 displays the power of our method and the Bonferroni procedure for (n, p) equal to (2 000, 30 000). While (unsurprisingly) Bonferroni performs well when the correlation between the predictors in X is low, our method outperforms the Bonferroni procedure when the correlation increases. We expect that, were we able to run ART or ZL at this sample size, they would outperform all other methods under consideration as they did at the smaller sample sizes. Nonetheless, both methods quickly become computationally impractical when p gets large, whereas our procedure and the Bonferroni procedure can still be implemented at these sample sizes.
Figure 3.

Power of the test from the stabilized one-step and from the Bonferroni-adjusted t-test for (n, p) equal to (2 000, 30 000) under homoscedastic errors.
We also ran our method at different choices of ℓn for (n, p) = (200, 200) and (500, 2 000) (details not shown), namely defined according to (8) with ε = 0.25, 1, 1.5, 1.75. We found little sensitivity to the choice of ε, with the exception that choosing ε = 1.75 often led to a moderate loss of power (at most 15% on an additive scale). This is not surprising given that, at ε = 1.75, ℓn is approximately equal to n/2 for both (n, p) settings.
6 Discussion
We have presented a general method for estimating the (possibly non-unique) maximum of a family of parameter values indexed by . Such an estimation problem is generally non-regular because minor fluctuations of the data generating distribution can change the subset of for which the corresponding parameter is maximized. Our estimate takes the form of a sum of the terms of a martingale difference sequence, which quickly allows us to apply the relevant central limit theorem to study its asymptotics and develop Wald-type confidence intervals. The estimator adapts to the non-regularity of the problem, in the sense that we can give reasonable conditions under which it is regular and asymptotically linear when the maximizer is unique so that regularity is possible.
We have applied our approach to the example of McKeague and Qian [2015] in which one wishes to learn about the maximal absolute correlation between a prespecified outcome and a predictor belonging to some set. The sample splitting that is built into our estimator has enabled us to analyze the estimator when the dimension p of the predictor grows with sample size slowly enough so that n−1/2 log p → 0 as n goes to infinity. While McKeague and Qian focus on testing the null hypothesis that this maximal absolute correlation is zero, we have established valid confidence intervals for this quantity. The lower bound of our confidence interval is particularly interesting because it is valid under minimal conditions. When p is very large, one might expect that the null of no correlation between the outcome and any of the predictors is unlikely to be true. In these problems, having an estimate of the maximal absolute correlation, or at least a lower bound for this quantity, will likely still be interesting as a measure of the overall relationship between X and Y.
We have also studied the behavior of this null hypothesis test under local alternatives, showing that our test is consistent when the maximal absolute correlation shrinks to zero slower than n−1/2(log max{n, p})1/2. When the dimension of the predictor is fixed, the test of McKeague and Qian is consistent against alternatives shrinking to zero more slowly than n−1/2 rather than (log n)1/2n−1/2. We would not be surprised to find that this (log n)1/2 is unnecessary for p fixed and can be removed using more refined proof techniques.
McKeague and Qian do not require that Y and the coordinates of X have range in [−1, 1]. We have made this boundedness assumption out of convenience for our proofs and expect that we can replace the boundedness assumptions with appropriate moment assumptions without significantly changing the results. Our simulation results support this claim. The boundedness condition is not as restrictive as it may first seem, as unbounded X and Y can be rescaled to be to be bounded. Since the sharp null H0 : Ψn(P0) = 0 is invariant to strictly monotonic transformations of X and Y, our theoretical results yield a valid of H0 test after applying, e.g., the sigmoid transformation to X and Y.
We note that, in our simulations, ART and ZL achieve the highest power among competing methods, though for our heteroscedastic simulation setting ZL failed to control the type I error. We were not able to run either of these methods at our largest sample size due to computational constraints. The ZL procedure as currently described is computationally expensive and does not scale well to large data sets, especially when the dimension of the predictor p is large. This difficulty occurs because the procedure requires the computation of a p × p covariance matrix. The ART method presented in McKeague and Qian [2015], which achieves similar power to ZL, is in practice even more computationally burdensome due to its use of a double bootstrap. Nonetheless, from a theoretical computational complexity standpoint, the ART method can be made to scale as O(np) provided the number of bootstrap draws remains fixed. Though the number of bootstrap samples will likely be fixed in practical applications, we note that ART cannot maintain consistency against local alternatives unless the number of bootstrap samples grows with sample size, thereby yielding a slower than order-np runtime. As is to be expected from marginal screening procedures that perform an O(n) screening operation p times, our method attains an O(np) runtime. This computational efficiency, combined with the asymptotic theory supporting our method’s power against local alternatives under increasing covariate dimension and efficiency under fixed alternatives and covariate dimension, demonstrates what is achievable by marginal screening procedures. Given our simulations, we also believe that developing rigorous asymptotic theory under increasing dimension for the ART methods is an important area for future work.
The stabilized one-step estimator presented in this paper applies to many other situations not considered in this paper. In an earlier work, we showed that this estimator is useful for estimating the mean outcome under an optimal individualized treatment strategy Luedtke and van der Laan [2016], where the class now indexes functions mapping from the covariate space to the set of possible treatment decisions. Thanks to the martingale structure of our estimator, the stabilized one-step estimator can be used to construct confidence intervals when the data is drawn sequentially so that the data generating distribution for observation j can depend on that of the first j − 1 observations. One interesting example along these lines is to obtain inference for the value of the optimal arm in a multi-armed bandit problem, even in the case where the optimal arm is non-unique and the reward distributions for the optimal arms have different variances. We look forward to seeing further applications of the general template for a stabilized one-step estimator that we have presented in this paper.
Supplementary Material
Appendix A General estimator
A.1 Discussion of conditions of Theorem 1
In this section, we consider the setting where the parameter in (1) does not depend on sample size, and consequently omit the n subscript to quantities which no longer depend on sample size. We will show that C7) and the following conditions imply the conditions of Theorem 1:
-
C9)
converges to zero in probability as j → ∞.
-
C10)
converges to zero in probability as as j → ∞.
The validity of the upper bound requires the following additional condition:
-
C11)
converges to zero in probability as as j → ∞.
For simplicity, we will take ℓn = 0 in this section.
We now discuss the conditions. Condition C1) is an immediate consequence of C7) and Dd(P)(o) being uniformly bounded in P ∈ ℳ, , . This will be plausible in many situations, including the examples in this paper. A more general Lindeberg-type condition also suffices [see Condition C1 in Luedtke and van der Laan, 2016], though we omit its presentation here for brevity.
The other three conditions all rely on terms like converging to zero in probability, possibly at some rate. Ideally we want a stochastic version of the fact that, for β ∊ [0, 1),
| (A.1) |
Lemma 6 of Luedtke and van der Laan [2016] establishes this result. We restate it here for convenience.
Lemma A.1
(Lemma 6 in Luedtke and van der Laan, 2016). Suppose that Rj is some sequence of (finite) real-valued random variables such that for some β ∊ [0, 1), where we assume that each Rj is a function of {Oi : 1 ≤ i ≤ j}. Then,
Conditions C2) through C4) are now easily handled. Condition C2) is a consequence of the fact that
where the inequality holds by C7) and the convergence holds by C9) Lemma A.1. Condition C9) is easily shown to hold under Glivenko-Cantelli conditions on the estimators and dj [see, e.g., Theorem 7 in Luedtke and van der Laan, 2016]. Conditions C3) and C4) are an immediate consequence of C10) and C11) combined with Lemma A.1.
While sufficient conditions for C11) should be developed in each individual example, we can give intuition as to why this condition should be reasonable. For any P ∈ ℳ, let d(P) return a maximizer of (1). We are interested in ensuring that is small, where dn is our estimate of a maximizer of (1). This can be expected to hold when the parameter P ↦ Ψd(P)(P0) has pathwise derivative zero at P = P0, where the P0 in the Ψ argument is fixed. When well-defined, the pathwise derivative will be zero because d(P) is chosen to maximize Ψd(P0) in d.
A.2 Efficiency when the maximizer in (1) is unique
We have presented a parametric-rate estimator for Ψn(P0), but thus far we have not made any claims about the efficiency of our estimator. In this section, we consider a fixed parameter in (1) that does not rely on sample size. We therefore omit the n subscript in many quantities to indicate their lack of dependence on sample size. We will give conditions under which our estimator is asymptotically efficient among all regular, asymptotically linear estimators. The efficiency bound is not typically well-defined when the maximizer is non-unique due to the non-regularity of the problem - generally in this case no regular, asymptotically linear estimator exists, so neither does an efficient member of this class [Hirano and Porter, 2012]. Thus the conditions that we give in this section will typically only hold when the maximizer in (1) is unique.
We use the following additional assumptions for our efficiency result:
-
C5)
-
C6)
There exists some M < ∞ such that and with probability approaching 1 as j → ∞.
-
C7)
with probability 1 over draws of (Oj : j = 0, 1, …).
We discuss the conditions immediately following the theorem.
Theorem A.2
(Asymptotic efficiency). Suppose that Ψ does not depend on sample size and is pathwise differentiable with canonical gradient . Further suppose that ℓn = o(n). If C1) through C7) hold, then
Furthermore,
Thus, ψn is asymptotically efficient among all regular, asymptotically linear estimators.
The proof is entirely analogous to the proof of Corollary 3 in Luedtke and van der Laan [2016] so is omitted. See Lemma 25.23 of van der Vaart [1998] for a proof of the fact that asymptotic linearity with the influence function given by the canonical gradient implies regularity.
The additional conditions needed for this result over Theorem 1 are mild when the maximizing index is unique. Condition C5) says that Ψ should have the same canonical gradient as . While this should be manually checked in each example, it will be fairly typical when the maximizer is unique, since in this case an arbitrarily small fluctuation of P0 will generally not change the maximizer. This is similar to problems in introductory calculus where the derivative at the maximum is zero. Condition C5) requires that converge to in mean-squared error, which is to be expected if begins to approximate P0 and dnj converges to the unique maximizer d0 as n, j → ∞. Condition C6) is a bounding assumption on the canonical gradient and estimates thereof that will hold in many examples of interest. Finally, Condition C7) will hold if one knows that is bounded away from zero uniformly in P ∈ ℳ and , and uses this knowledge to truncate for some deterministic sequence γj → 0. For γj sufficiently small and j sufficiently large this truncation scheme will then have no effect on the variance estimates .
A.3 Computationally efficient implementation
There are several computationally efficient ways to compute our estimate. In Section 6.1 of Luedtke and van der Laan [2016], we show that the runtime of our estimator can be dramatically improved by running the algorithm used to compute each a limited number of times, say ten times. We do not detail this approach here, though we note that the theorems we have presented are general enough to apply to this case.
An alternative approach to improve runtime is to use the estimator’s online nature to compute it efficiently both in time and storage. Suppose that we have an algorithm to update the estimate of P0 to the estimate based on the first j observations by looking at Oj+1 only. This will often be feasible if the parameter of interest and the bias correction step only require estimates of certain components of P0, e.g. of a set of regression and classification functions. In these cases we can apply modern regression and classification approaches to estimate these quantities [see, e.g., Xu, 2011, Luts et al., 2014]. Often dnj· can also be obtained using online methods, and thus can be estimated online by keeping a running sum. This quantity is not equal to ψn because it does not yet include the weights.
It will not in general be possible to compute the weights online, though their computation does not require storing O(n) observations in memory. We can estimate consistently using the rj observations, where rj → ∞ but can grow very slowly (even log j suffices asymptotically, though such a slow growth is not recommended for finite samples). Given online estimates of these variances, it is then straightforward to compute both and the weights and incorporate these into our estimator. In some cases, we can compute the weights, and thus the estimate, in a truly online fashion. Describing general sufficient conditions for this appears to be difficult, but we conjecture that often this will not typically hold if is not of finite cardinality. The weights can be computed online in the maximal correlation example.
Appendix B McKeague and Qian [2015] example
B.1 Proofs and results
Lemma A.3
Fix and . For any P with and , (6) holds. Proof. Straightforward but tedious calculations show that
| (A.2) |
The result follows by taking the absolute value of both sides, applying the triangle inquality, using that ab ≤ (a2 + b2)/2 for any real a, b, CorrP(Xk, Y) ≤ 1, and the lower bound on the variances. □
We now establish high probability bounds on the difference between , , and and their population counterparts, uniformly over and j. We will use ≲ to denote “less than or equal to up to a universal multiplicative constant”. Let denote the following class of functions mapping from to the real line:
| (A.3) |
Note that . We will use this class to develop concentration results about our estimates the needed portions of the likelihood. This class is actually somewhat larger than is needed for most of our results, as in fact
suffices for concentrating our estimates of Corr0(Xk, Y), s0(Xk), and s0(Y). Nonetheless, using this larger class will allow us to prove results about the concentration of about and just stating it as a single class is convenient for brevity.
For and j ∈ {1, …, n}, define the empirical process as
where we use Pj denote the empirical distribution of O0, …, Oj−1 and Pf ≡ EP[f(O)] for any distribution P. Let . By Theorem 2.14.1 in van der Vaart and Wellner [1996] shows that
| (A.4) |
where the expectation is over the draws O1, …, Oj. We have used that our class is bounded by the constant 1.
Let
| (A.5) |
Define the events
where C in the definition of is equal the smallest universal constant satisfying (A.4) plus 1.
Lemma A.4
For any sample size n, the event occurs with probability at least 1−n/max{n2, p} ≥ 1 − 1/n.
Proof
We first upper bound the probability of the complement of for each n, j. Fix n and j ≤ n. By the bounds on X and Y, changing one Oi in (O1, …, Oj) to some other value in the support of P0 can change b by at most . Thus satisfies the bounded differences property with bound , and we may apply McDiarmid’s inequality [McDiarmid, 1989] to show that, with probability at most 1 − exp(−2t2), . Choosing and using (A.4) yields that, with probability at least 1−1/max{n2, p}, the following inequality holds for all j = 1, …, n:
where C′ denotes the universal constant in (A.4).
By DeMorgan’s laws and a union bound, it follows that the event occurs with probability at least 1 − n/ max{n2, p} ≥ 1 − 1/n. □
We have shown that occurs with high probability. Now we show that our estimates of variances, covariances, and correlations perform well when occurs.
Lemma A.5
Fix a sample size n ≥ 2. The occurrence of implies that, for all j = 2, …, n:
-
1)
;
-
2)
;
-
3)
;
-
4)
;
-
5)
,
where we define when either or is equal to zero.
Proof
Suppose holds and fix . The triangle inequality and the bounds on Xk yield that
This gives 2). For 1), note that
The same argument yields 3) and 4).
Again fix k. An application of the triangle inequality and the bounds on Xk and Y readily yield that . Furthermore,
Taking the absolute value of both sides, applying the triangle inequality, and using the lower bounds on s0(Xk) and s0(Y) and the upper bound on yields that . This holds for all k, so 5) holds. □
Lemma A.6
Let C be the smallest universal constant in 2) of that Lemma A.5, and let n be any natural number satisfying n ≥ ⌈4C−2δ−2 log max{n, p}⌉ ≡ J(n, δ). Under these conditions, the occurrence of implies that, for all j = J(n, δ), …, n,
-
8
and ;
-
9
and ;
-
10
;
-
11
.
Proof
By Lemma A.5, 2) holds, and using that j ≥ J(n, δ), we see that
The same argument works for , so 8 holds. Furthermore,
where the final two inequalities hold by 8. This proves the first part of 9, and the bound on holds by the same argument. For the second result, note that
Using 8, the bounds on X and Y, and the triangle inequality shows that
where we have used that contains all polynomials of Xk, Y of degree at most 4. By the occurrence of , the final line is upper bounded by a constant times δ−2 Knj. This yields 10.
For 11, we will bound and then combine this with 10 using the triangle inequality. We have that
Now we use that and (a + b)2 ≤ 2 (a2 + b2) for any real a, b to see that . By 5) from Lemma A.5 and the fact that j ≥ J(n, δ), the maximum over is bounded above by a constant times . Continuing with the above,
where we used 8 for the second to last inequality. □
Lemma A.7
Suppose the conditions of Lemma A.6. Under these conditions, the occurrence of implies that, for all j = J(n, δ), …, n.
8.
Proof
By Lemma A.6, and . By Lemma A.3, this yields
By the bounds on X and Y and the triangle inequality, . Applying 9 from Lemma A.6 and the results of Lemma A.5 to the above yields the result. □
Lemma A.8
Let γ be as defined in (7). For a constant C(γ, δ) > 0 relying on γ and δ only, the occurrence of implies that, for all j = ⌈C(γ, δ) log max{n, p}⌉, …, n.
8. .
Sketch of proof
Suppose . By 11 and 8, for all j ≥ J(n, δ)
It is easy to confirm that, for a universal constant C > 0, the above yields that the left-hand side is upper bounded by γ/2 for all j ≥ Cγ−1/2δ−2 max {δ−3, γ−3/2} log max{n, p} ≡ C(γ, δ)log max{n, p} ≥ J(n, δ). An application of the triangle inequality gives the result. □
The remainder of the results in this section are asymptotic in nature. We omit the dependence on δ and γ in these statements as these quantities are treated as fixed as sample size grows. Throughout we assume that
| (A.6) |
| (A.7) |
| (A.8) |
In view of (A.6) and (A.7), we see that, roughly, ℓn grows faster than log max{n, p} if βn goes to zero faster than and at least as fast as if βn goes to zero more slowly than . Given an ε > 0, one possible choice of ℓn that satisfies these properties is
We have the following result.
Lemma A.9
For all n large enough, ℓn is at least J(n. δ) and is at least C(γ, δ) log max{n. p}, where these quantities are defined in Lemmas A.6 and A.8, respectively.
Proof
This is an immediate consequence (A.6) of the fact that δ and γ are fixed as sample size grows. □
Theorem A.10
C1), C2), and C3) hold.
Proof
-
C1)
By Lemma A.9, we can apply 8 from Lemma A.6 and Lemma A.8 provided n is large enough. In that case, for all j ≥ ℓn provided holds. By Lemma A.4, this then occurs with probability at least 1 − 1/n, and thus C1) holds.
- C2)
-
C3)
Suppose that n is large enough so that the results of Lemma A.9 apply. Also suppose that occurs. We have that
The fact that occurs with probability approaching 1 (Lemma A.4) yields C3). □
Let kn0 be a possibly non-unique k maximizer of . For each r > 0, let denote the set of all such that .
The upcoming theorem uses the following conditions to establish the validity of a hypothesis test of no effect and of the upper bound of our confidence interval, respectively:
M1) For some sequence {tn} with tn → + ∞, there exists a sequence of non-empty subsets such that, for all n,
If , then the supremum on the right-hand side is taken to be zero.
M2) The conditions of M1) hold, and also
The first of these conditions will be used to establish the consistency of a null hypothesis significance test. The second of these conditions is similar to margin conditions used in classification, and will be used to establish the validity of our confidence interval.
Theorem A.11
| (A.9) |
If also M1), then the right-hand side of the above can be tightened to . If also M2), then C4) holds.
Proof
Suppose that holds and n is large enough so that the results of Lemma A.9 apply. For each j ≥ ℓn, let knj represent the which maximizes . Let and . Then, for a universal constant C > 0,
| (A.10) |
where the final inequality holds by Lemma 5). Using that and (A.8),
By Lemma A.8, this then implies that the left-hand side of (A.9) is upper bounded by an O (γ−1/2n−1/4βn) term under , and so Lemma A.4 yields (A.9).
For the second result, suppose that M1) holds. Observe that, for all for C as defined in (A.10), . Furthermore, . Thus as deȀned in M1). Furthermore, mnj must equal , since otherwise
contradicting the fact that per (A.10). Because , we see that . Hence,
| (A.11) |
Further, if , (A.10) yields
It follows that the left-hand side above is greater than or equal to a positive universal constant times . Dividing the left by n − ℓn and applying (A.8) yields that this same result holds with an upper bound on the order of . Combining this with (A.11) shows that
Using that . When proving the Ȁrst result (A.9) we also showed that the left-hand side is upper-bounded by a positive constant times −δ−1n−1/4βn. Combining with Lemma A.8 and using that holds with probability approaching 1 (Lemma A.4) shows that the left-hand side of (A.9) is . If M2) holds, then this expression is , and so C4) holds. □
B.2 Computationally efficient implementation of our estimator
In this section, we describe how to implement the estimator for the McKeague and Qian [2015] example in O(np) time. We show that this can be accomplished using O(p) storage when the observations O1, …, On arrive in a stream.
Fix n so that the set of predictor indices is also fixed. For each j, let Pj denote the empirical distribution of the first j observations. Recall the definition of the class from (A.3), and note that contains O(p) functions. It is easy to see that, at j = 2, we can compute for each using O(p) time and storage. Furthermore, for j > 3 the fact that shows that we can compute and save Pjf in O(p) time and storage if we know Oj and Pj−1 f. To attain this storage complexity, we remove Pj−2 f, , from memory for each j ≥ 4 so that P2 f, …, Pj−2 f are not stored in memory.
We now have an algorithm that, at observation j, starts with Oj and Pj−1 f, , stored in memory and, after running the steps described in the preceding paragraph, also has Pjf, stored in memory. Given Pjf, , one can compute and save , , and , Z equal to Y or Xk, , in O(p) time and storage. We can now compute and save , in O(p) time and storage. If the predictors or outcome are large and their variance small, the described online computation of the sample variance may lead to numerical difficulties. See Welford [1962] for a better estimate of the variance in this setting.
Let Hj denote the collection of (i) the integer j, (ii) Pjf, , (iii) , and (iv) Cov(Xk, Y), and , . For j ≥ 2, let UPDATEH be a function which takes as input (Oj+1, Hj) and outputs Hj+1. We have shown that UPDATEH(Oj+1, Hj) can run in O(p) time for any j ≥ 2. We call a separate function INITIALIZEH on (O1, O2) to obtain the initial value H2. This function runs in O(p) time and storage.
Let the function MAXIMIZER be a function that takes as input Hj and returns the dj = (kj, mj) which maximizes in , thereby allowing us to compute . Finding dj involves finding the maximum of numbers, and therefore can be accomplished in O(p) time.
The function CALCD takes as input Hj, Oj+1, and dj and calculates . It is easy to see that this can be accomplished in O(1) time and O(p) storage.
For ease of notation in the proceeding paragraph and equation we omit the dependence of dj = (kj, mj) on j in the notation. Since Dd(Pj) is a gradient for Ψd at Pj and gradients are mean zero, PjDd(Pj) = 0. For any , tedious but trivial calculations show that
Observe that all expectations on the right-hand side above are expectations over some applied to the observed data structure. It follows that the above can be computed in O(1) time using a subset of the O(p) expectation, standard deviation, and correlation estimates stored in Hj. Let CALCSIGHAT denote the function which takes as input Hj and dj and outputs . We have shown that CALCSIGHAT (Hj, dj) runs in O(1) time.
The pseudocode in ESTPSI describes our estimator, with most of the work done in the recursion step described in the function RECURSION. Because each call of RECURSION runs in O(p) time, the n − ℓn = O(n) step for loop in ESTPSI requires time O(np) time. The storage requirement of each call of RECURSION is O(p). Because the code in the for loop in ESTPSI deletes the output from the previous recursion step, the total storage requirement of ESTPSI is O(p).
Algorithm Recursion Step for Estimating Ψ(P0)
| function Recursion(Oj+1, ψj, Hj, , ℓn) | ||
| 2: | if j < ℓn then ψj+1 = 0 and | |
| else | ||
| 4: | dj = Maximizer(Hj) | |
| 6: | ||
| ▹ By convention, 0/0 = 0. | ||
| 8: | ||
| Hj+1 = UpdateH(Oj+1, Hj) | ||
| 10: | return |
Algorithm Estimate Ψ(P0) Using Sample of Size n
| 2: | function ESTPSI(n, ℓn) |
| Read O1, O2 from data stream | |
| 4: | Base case: ψ2 = 0, , and H2 = InitializeH(O1, O2) |
| for j = 2, …, n − 1 do | |
| Read Oj+1 from data stream | |
| 6: | |
| Remove from memory | |
| 8: | return Point estimate ψn and confidence interval |
References
- Bickel PJ, Klaassen CAJ, Ritov Y, Wellner JA. Efficient and adaptive estimation for semiparametric models. Johns Hopkins University Press; Baltimore: 1993. [Google Scholar]
- Chakraborty B, Moodie EE. Statistical Methods for Dynamic Treatment Regimes. Springer; Berlin Heidelberg New York: 2013. [Google Scholar]
- Gaenssler P, Strobel J, Stute W. On central limit theorems for martingale triangular arrays. Acta Math Hungar. 1978;31(3):205–216. [Google Scholar]
- Hirano K, Porter JR. Impossibility results for nondifferentiable functionals. Econometrica. 2012;80(4):1769–1790. [Google Scholar]
- Luedtke AR, van der Laan MJ. Statistical inference for the mean outcome under a possibly non-unique optimal treatment strategy. Annals of Statistics. 2016;44(2):713–742. doi: 10.1214/15-AOS1384. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Luts J, Broderick T, Wand MP. Real-time semiparametric regression. Journal of Computational and Graphical Statistics. 2014;23(3):589–615. [Google Scholar]
- McDiarmid C. On the method of bounded differences. Surveys in combinatorics. 1989;141(1):148–188. [Google Scholar]
- McKeague IW, Qian M. An adaptive resampling test for detecting the presence of significant predictors. Journal of the American Statistical Association. 2015;110(512) doi: 10.1080/01621459.2015.1095099. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pfanzagl J. Estimation in semiparametric models. Springer; Berlin Heidelberg New York: 1990. [Google Scholar]
- R Core Team. R: a language and environment for statistical computing. 2014 URL http://www.r-project.org/
- van der Laan MJ, Lendle SD. Online Targeted Learning. Division of Biostatistics, University of California; Berkeley: 2014. (Technical Report 330). available at http://www.bepress.com/ucbbiostat. [Google Scholar]
- van der Laan MJ, Robins JM. Unified methods for censored longitudinal data and causality. Springer; New York Berlin Heidelberg: 2003. [Google Scholar]
- van der Vaart AW. On differentiable functionals. Annals of Statistics. 1991;19:178–204. [Google Scholar]
- van der Vaart AW. Asymptotic statistics. Cambridge University Press; New York: 1998. [Google Scholar]
- van der Vaart AW, Wellner JA. Weak convergence and empirical processes. Springer; Berlin Heidelberg New York: 1996. [Google Scholar]
- Welford BP. Note on a method for calculating corrected sums of squares and products. Technometrics. 1962;4(3):419–420. [Google Scholar]
- Xu W. Towards optimal one pass large scale learning with averaged stochastic gradient descent. arXiv preprint arXiv:1107.2490. 2011 [Google Scholar]
- Zhang Y, Laber EB. Comment. J Am Stat Assoc. 2015;110(512):1451–1454. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
