Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2014 Aug 3.
Published in final edited form as: J Stat Res. 2012 Dec 1;46(2):113–156.

ADAPTIVE MATCHING IN RANDOMIZED TRIALS AND OBSERVATIONAL STUDIES

Mark J van der Laan 1, Laura B Balzer 2, Maya L Petersen 3
PMCID: PMC4119765  NIHMSID: NIHMS604043  PMID: 25097298

SUMMARY

In many randomized and observational studies the allocation of treatment among a sample of n independent and identically distributed units is a function of the covariates of all sampled units. As a result, the treatment labels among the units are possibly dependent, complicating estimation and posing challenges for statistical inference. For example, cluster randomized trials frequently sample communities from some target population, construct matched pairs of communities from those included in the sample based on some metric of similarity in baseline community characteristics, and then randomly allocate a treatment and a control intervention within each matched pair. In this case, the observed data can neither be represented as the realization of n independent random variables, nor, contrary to current practice, as the realization of n/2 independent random variables (treating the matched pair as the independent sampling unit). In this paper we study estimation of the average causal effect of a treatment under experimental designs in which treatment allocation potentially depends on the pre-intervention covariates of all units included in the sample. We define efficient targeted minimum loss based estimators for this general design, present a theorem that establishes the desired asymptotic normality of these estimators and allows for asymptotically valid statistical inference, and discuss implementation of these estimators. We further investigate the relative asymptotic efficiency of this design compared with a design in which unit-specific treatment assignment depends only on the units’ covariates. Our findings have practical implications for the optimal design and analysis of pair matched cluster randomized trials, as well as for observational studies in which treatment decisions may depend on characteristics of the entire sample.

Keywords and phrases: Cluster randomized trials, matching, asymptotic linearity of an estimator, causal effect, efficient influence curve, empirical process, confounding, dependent treatment allocation, G-computation formula, influence curve, loss function, adaptive randomization, semiparametric statistical model, targeted maximum likelihood estimation, targeted minimum loss based estimation (TMLE)

1 Introduction

In a typical randomized controlled trial, one randomly draws a unit from a target population, measures baseline covariates on the unit, randomly assigns a treatment from among a set of possible treatments according to a known distribution (possibly conditional on the baseline covariates of the unit), and measures the unit’s treatment-specific outcome. This experiment is repeated n times resulting in n independent and identically distributed (i.i.d.) copies, providing a firm basis for statistical estimation and inference using the central limit theorem.

In this article we consider an alternative data generating experiment in which one first randomly draws n units from a target population and measures the baseline covariates of each, then assigns n treatments from among some set according to a conditional distribution, given the n unit-specific baseline covariates, and finally measures the n treatment-specific outcomes. In such an experiment, the underlying units are independently and identically distributed draws from a common target population, so that the covariates and the underlying treatment-specific outcomes represent an i.i.d sample. However, the treatment assigned to one unit can be a function of the covariates of other units in the sample, creating dependence between the n unit-specific observed data structures. As a result, the data generating design cannot be represented as n repetitions of an experiment, and not even as n independent experiments. The challenge posed to statistical inference by this design is highlighted by the fact that it is unclear how to implement valid bootstrap-based variance estimation. The available data constitute a single repetition of the underlying experiment.

Our study of this problem is motivated, in particular, by a common design-based approach to enforce empirical balance in baseline covariates among the treated and non-treated units in a finite sample. One way to enforce such balance is to partition the random sample of n units into n/2 pairs that are maximally similar with respect to covariate values according to some metric, and then to randomly allocate a treatment and a control intervention within each pair. A variation of this design partitions the sample into fewer than n/2 pairs, discarding those units for which the poorest matches are obtained. Such pair-matched designs are particularly common in community or cluster randomized controlled trials, motivated both by a desire to improve efficiency, and by the fact that such trials typically enroll far fewer independent units than their individual-level counterparts, and are thus less able to rely on chance alone to achieve the desired covariate balance between treatment groups (see, for example, reviews by Donner and Klar (2000), Hayes and Moulton (2009), Campbell et al. (2007)). Treatment assignment conditional on such a covariate-based partition of the sample preserves randomization while ensuring a degree of covariate balance between treatment and control arms of the trial. However, since the partition (in this case, the construction of the matched pairs) is generated as a function of all n covariate vectors, the treatment assignment of each unit in the sample is now a function of the covariate values of the entire sample.

While the fact that pair matching in randomized trials can introduce dependence between units is well-recognized, the extensive literature on the design and analysis of pair matched trials, including the literature debating the merits and perils of pair-matching, focuses on experimental designs in which the matched pair constitutes the unit of independence (Freedman et al. (1990); Campbell et al. (2007); Hayes and Moulton (2009); Imai et al. (2009); Imai (2008); Donner and Klar (2000); Murray (1998); Donner and Klar (2000); Raudenbush et al. (2007); Klar and Donner (1997); Balzer et al. (2012)). In one well studied design, two units are sampled from a conditional distribution given a stratum of a baseline covariate, the treatment and control intervention are randomly allocated to the pair, the outcomes for the two units are observed, and the experiment is repeated multiple times at different strata. In such an experiment, the data generating distribution involves independently repeating the stratum-specific experiment of drawing the pair of units from the stratum, assigning the treatments, and measuring the outcomes, across different strata. Therefore, statistical inference can be based on a central limit theorem for sums of independent random variables. If the strata are set by design, then the data for each pair are independent across pairs (with a stratum-specific data distribution for each pair) but are not identically distributed.

A variation of this design is based on randomly sampling a unit and measuring a baseline covariate on the unit, and then sampling a second unit from a conditional distribution, given that the baseline covariate has the same value as the first unit. Treatment and control are allocated within the matched pair, outcomes on each unit in the pair are measured, and the experiment is repeated multiple times. In this case, the data on each pair are not only independent across the pairs but are also identically distributed. van der Laan (2008), Rose and van der Laan (2009), Balzer et al. (2012) discuss formulation of the above two data structures in terms of matched case-control sampling, and present corresponding targeted minimum loss-based estimators.

The focus in the literature on designs in which the matched pair represents the unit of independence may be due in part to the specific studies for which much of the early theory was developed. These include randomized trials in ophthalmology in which the patient’s two eyes provide the matched pair, as well as some cluster randomized trials. For example, the Community Intervention Trial for Smoking Cessation (COMMIT) motivated important early work on the use of pair matching in cluster randomized trials (Freedman et al. (1997); Gail et al. (1992); COMMIT Research Group (1991)). This study in fact sampled (albeit not randomly) 11 matched pairs of communities from a larger population of candidate matched pairs.

In contrast, however, many current cluster randomized trials employ a fundamentally different pair matched design. Communities are first sampled, and only then are matched pairs created from among this finite set by applying some algorithm to the baseline characteristics of the communities included in the sample. We refer to this design as “adaptive matching” in order to make explicit its links to the larger literature on adaptive study designs, and specifically adaptive treatment allocation in response to characteristics of the previously observed units: Bai et al. (2002); Andersen et al. (1994); Flournoy and Rosenberger (1995); Hu and Rosenberger (2000); Rosenberger (1996); Rosenberger et al. (1997); Rosenberger and Grill (1997); Rosenberger and Shiram (1997); Tamura et al. (1994); Wei (1979); Wei and Durham (1978); Wei et al. (1990); Zelen (1969); Cheng and Shen (2005); van der Laan (2008); Chambaz and van der Laan (2010); van der Laan and Rose (2012).

Recent cluster randomized trials that have employed adaptive matching include the SPACE study of a school level intervention to improve physical activity in Denmark (Toftager et al., 2011), a cluster randomized trial of routine HIV-1 viral load monitoring in Zambia (Koethe et al., 2010), and the PRISM trial of a community-level intervention to prevent post-partum depression in Australia (Watson et al., 2004). Under adaptive matching, the matched pair no longer represents the independent sampling unit. Instead, such a design corresponds to a special case of the general experimental design in which the allocation of treatment among a sample of n independent and identically distributed units is a function of the covariates of all sampled units. This raises a number of questions with practical implications for the design and analysis of cluster (as well as individual) randomized trials. When will adaptively pair matched designs result in efficiency gains relative to their non-matched counterparts? What is the optimally efficient approach to estimating the treatment effect in such studies? How should statistical inference be carried out given the dependence between randomized units?

The results developed in this paper also apply to observational studies in which treatment decisions for each participant in a randomly sampled cohort may be influenced by the covariates of all or a subset of the other cohort members, while the participant-specific outcome is only influenced by a participant’s own covariates and assigned treatment. Consider, for example, a study that aims to evaluate the impact of enrollment in a weight loss program on participant weight loss. The study protocol might bring subsets of the sampled cohort members together to discuss the program, after which participants are allowed to decide whether or not they wish to enroll. In such a study, enrollment probabilities might differ depending on the characteristics of the subgroup to which enrollees are assigned (for example, the extent to which the subgroup includes charismatic or vocal individuals who have failed similar weight loss programs in the past). As a result, enrollment decisions within a subgroup are no longer independent.

Finally, a special case of the general experiment described in this article is one in which the treatment allocation for each unit in an i.i.d. sample from some target population can be a different function of the sample characteristics of all of the other units in the sample. Conditional on the baseline characteristics of the sample, the treatment assignment of each individual is independent; however, the individual-specific assignment mechanisms are not identical across the individuals. In the example study to evaluate the effect of a weight loss program, the entire sample might be divided up in several subgroups, allowing the subjects within a subgroup to mingle and talk among themselves, before being provided with information about the weight loss program and subsequently deciding whether to enroll. Individual enrollment decisions in such a scenario might depend on the characteristics of other attendees in that subgroup. One might be willing to assume that each individual’s enrollment decision is made independently, given what he or she has observed about the characteristics of the other attendees in the subgroup in contrast to the previous example in which decisions within subgroups were dependent. However, an individual’s enrollment decision is indexed by the subgroup he or she belongs to, so that treatment allocation is not identical across all individuals.

1.1 Organization of article

In Section 2 we define the statistical estimation problem posed by estimating the additive causal effect of treatment (or average treatment effect) under the general experimental design in which treatment allocation can depend on the characteristics of other units in the sample. Specifically, we define the data generating experiment, the observed data, the likelihood, the statistical model and the target parameter.

In Section 3 we study the fundamental mathematics of the design by determining the tangent space of the model and the canonical gradient of the pathwise derivative of the target parameter. Section 4 presents a targeted minimum loss based estimator (TMLE) of the additive causal effect of treatment and discusses its implementation. The TMLE presented is double robust. In particular, it remains consistent and asymptotically normally distributed as long as the treatment assignment mechanism for the n treatments is known or well estimated, even if the conditional mean of the outcome given the treatment and covariates is estimated inconsistently. We further present an estimator of the asymptotic variance of this TMLE. Interestingly, it appears that no double robust estimator of this variance is available, so that asymptotic consistent estimation of the variance requires a consistent estimator of the outcome regression. This demonstrates a strong contrast with designs in which treatment is independently assigned. In Section 5 we present a theorem that provides a formal basis for the estimators introduced in Section 4, and in particular establishes the asymptotics of the TMLE and thereby the validity of the statistical inference based on a normal limit distribution.

Section 6 discusses implications of these results for the design and analysis of randomized trials with adaptive pair matching. In particular, we discuss implementation of a TMLE of the average treatment effect and corresponding statistical inference in terms of confidence intervals and testing. While consistent estimation of the variance requires consistent estimation of the outcome regression, for this special case we show that a conservative estimate of the variance is possible. We further contrast the asymptotic variance of the adaptive pair matched design with the asymptotic variance of a design in which the intervention is randomly allocated to each unit independently. This provides insight into the potential benefits of pair matching in cluster randomized trials, beyond that provided by previous literature in which the matched pair constituted the unit of independence. In addition, we contrast the approach to statistical inference presented in this article to the standard approach employed in pair matched randomized trials, in which the average treatment effect is estimated as the sample mean of paired differences, and the variance is estimated as the sample variance treating the pairs as independent. It is shown that this standard approach provides conservative inference, under an explicitly stated assumption which is generally expected to hold.

Section 7 extends these results to the common case in which some units in the initial sample are missing treatment and outcome data. Such a case would occur, for example, in a cluster randomized trial with adaptive pair matching in which treatment were only allocated among the subset of sampled units for which adequate pair matches were identified. We conclude with a summary of the practical implications of our results and identify areas for future work. Proofs of all theorems and an overview of the required empirical process theory are provided in an Appendix.

1.2 Novel contributions of this article

To the best of our knowledge, the estimation problem addressed in this article has not been formally studied. This estimation problem targets the usual average causal effect, but the dependent allocation of treatment allowed by our model makes it different from other estimation problems that the literature has covered.

Even though targeted maximum likelihood estimation is a general method that has been applied to many problems in the literature (see e.g., van der Laan and Rose (2012) for an overview and comprehensive coverage of this method), the actual construction of a targeted maximum likelihood estimator for a new estimation problem, as defined by the statistical model and target parameter, requires new research: it relies on the construction of a least favorable sub-model for fluctuating an initial estimator and a loss function so that the loss-function specific score of the least favorable sub-model at zero fluctuation spans the efficient influence curve. In particular, this requires determining the efficient influence curve (i.e., canonical gradient of pathwise derivative) for this target parameter in this new model. Indeed, the resulting TMLE as developed in this article is new and not presented anywhere else. In addition, the analysis of this TMLE relies on the state of the art methods in empirical process theory as presented in van der Vaart and Wellner (1996). Finally, the implications of our results for the analysis of cluster randomized trials and observational studies in which treatment allocation depends on the covariates of other units in the sample are new and important. In particular, our theoretical results allow us to formally compare the efficiency of different possible matched pair designs mentioned in the introduction. This work will appear in a future article.

2 Definition of Statistical Estimation Problem

2.1 Observed data

Let Xn = (X1, …, Xn) be a vector consisting of n i.i.d. observations of Xi = (Wi, Yi(0), Yi(1)), where Wi denotes the baseline covariates, and (Yi(0), Yi(1)) denotes the treatment-specific counter-factual outcomes for subject i. (In words, Yi(a) denotes the outcome that would have been observed for unit i if it had received treatment level A = a.) Let PX,0 denote the common distribution of Xi. In addition, g0n(A1,,AnXn) is the true conditional distribution of the treatment or intervention An = (A1, …, An), conditional on Xn. The observed data are Oi = (Wi, Ai, Yi = Yi(Ai)), i = 1, …, n, so that only one counterfactual outcome, corresponding to the treatment actually received, is observed for each unit. Note that On = (O1, …, On) is a many to one function of An and Xn, and is thus a missing or censored data structure in which the full data is Xn and the censoring or missingness variable is An.

We assume throughout that the conditional distribution of An, given Xn, g0n(·Xn), is only a function of Xn through Wn = (W1, …, Wn), which implies the coarsening at random assumption on g0n with respect to the full data Xn (Heitjan and Rubin (1991); Jacobsen and Keiding (1995); Gill et al. (1997)). This assumption allows for dependence between A1, …, An, as long as it can be explained by covariate vector Wn. One important class of examples covered by such treatment mechanisms g0n are studies that first partitions the sample {1, …, n} into groups based on the co-variate vector Wn and subsequently randomly assign the treatment and control intervention within each group. For example, cluster randomized trials are commonly implemented by first partitioning the sample {1, …, n} into n/2 pairs based on some metric of similarity in baseline covariates Wn, and then randomly assigning a treatment and a control condition to the two members of each pair. More formally, g0n in such a design can be defined as follows: given Wn and thereby the disjoint pairs Cj(Wn) = {j1, j2} ⊂ {1, …, n} with C1(Wn) ∪ … ∪ Cn/2(Wn) = {1, …, n}, within each pair Cj(Wn) assign (1,0) or (0,1) with a flip of a fair coin (i.e. with probability 0.5).

Instead of using the Neyman-Rubin counterfactual formulation above, this observed data generating distribution can also be described in terms of an non-parametric structural equation model (NPSEM) (Pearl (1995, 2009)) as follows. Let Wi = fW (UWi), UW,i, i = 1, …, n, are i.i.d., An = fAn (Wn, UAn), Yi = fY (Wi, Ai, UY,i), with UY,i, i = 1, …, n, i.i.d. The analogue of the coarsening at random assumption in terms of this NPSEM is that UAn is independent of (UY,i: i = 1, …, n), given Wn. The functions fY and fW are unspecified, but the function fAn and the distribution of UAn might be known. For example, the sample might be partitioned into groups according to some known algorithm applied to the baseline characteristics of the sample, and the intervention A randomly assigned within each group.

2.2 Likelihood and statistical model

Under both formulations of the data generating experiment, the observed data is Oi = (Wi, Ai, Yi), i = 1, …, n, and the likelihood of the observed data On = (O1, …, On), under distribution Pn, is given by

Pn(O1,,On)=i=1nQW(Wi)QY(YiWi,Ai)gn(AnWn),

where QW = QW (Pn) and QY = QY (Pn) denote the common marginal distribution of W and the common conditional distribution of Y, given A, W, respectively. We put no constraints on the sets of possible QY and QW, which corresponds with putting no constraints on the common full data distribution PX,0 (or no constraints on the NPSEM specified above beyond assumptions on the equation for An). Regarding the treatment mechanism g0n, we assume that

g0n(AnWn)=j=1Jg0,j(A(j):jCj(Wn)Wn), (2.1)

where C1(Wn), …, CJ(Wn) is a partitioning of the sample {1, …, n} into J groups deterministically implied by Wn. Thus, it is assumed that, conditional on Wn, the treatment labels within a group are independently assigned from treatment labels in another group. It is assumed that lim infn→∞ J(n)/n > 0 so that the asymptotics will still be driven by n. Let gin be the conditional distribution of Ai, given Wn. Although not necessary for deriving the desired asymptotics, we assume that this distribution gin of Ai, given Wn is non-deterministic, i = 1, …, n. The set of possible g0n will be denoted with Inline graphic. The set of all possible data distributions Pn implied by the nonparametric model on PX,0 and the model Inline graphic for g0n represents a statistical model Inline graphic for the true data distribution P0n. This general model will be referred to as Inline graphic. (Generalization of our p results to general J(n) with rates of convergence 1/J(n) should be possible as well, but is not pursued here.)

2.2.1 Special models of interest for the treatment mechanism

A special choice for Inline graphic consists of distributions satisfying gn(An|Wn) = Πigi(Ai|Wn). In this particular model it is assumed that, given Wn, A1, …, An are independent with conditional distributions gin of Ai, given Wn, i = 1, …, n. This choice, which corresponds to partitions of size 1, allows treatment to be assigned to each unit in the sample according to a distinct unit-specific mechanism that is allowed to depend on the baseline covariates of the entire sample. Such a data generating process might arise in a study, such as the weight loss example presented in the introduction, in which subgroups of individuals are allowed to interact before each assigning themselves independently to the treatment or control condition. Then the baseline covariates of the subgroups influence individual treatment decisions but these decisions are still made independently given the baseline covariates of the cohort. We refer to this choice of Inline graphic as G1n and refer to corresponding statistical model, implied by the nonparametric model on PX,0 and G1n as M1n.

A second special choice for Inline graphic consists of distributions satisfying gn(AnWn)=j=1n/2g(A(j):jCj(Wn)Wn), where Cj(Wn) = {j1, j2}, j = 1, …, n/2, represents a partitioning of the sample {1, …, n} into n/2 disjoint pairs Cj(Wn). This class of treatment assignment mechanisms describes randomized trials with adaptive pair matching. We refer to this choice of Inline graphic as G2n and refer to corresponding statistical model, implied by the nonparametric model on PX,0 and G2n as M2n.

2.3 Target statistical parameter

We focus on the target quantity ΨF(PX) = E{Y (1) − Y (0)}, a particular parameter of the full-data distribution PX or, equivalently, a parameter of the distribution of the counterfactuals (Y(0), Y(1)) defined by the NPSEM. This quantity is often referred to as the average treatment effect, and corresponds to the causal quantity typically targeted by randomized trials as well as many observational studies. Under coarsening at random, EQY (Y|A = a, W) = EPX (Y (a)|W), while the parameters (QY, QW) of Pn are identifiable parameters of Pn. This target quantity is thus identified by the distribution of the data Pn as follows:

ΨF(PX)=Ψ(Q)=EQW{EQY(YA=1,W)-EQY(YA=0,W)}=EQW{Q¯(1,W)-Q¯(0,W)},

where Q = (QW, ) denotes the common distribution QW of Wi, and common conditional mean of Yi, given Ai, Wi. Here Q = Q(Pn) is a parameter of the observed data distribution Pn. This identifiability result defines now a target parameter Ψ: Inline graphicInline graphic of the observed data distribution, defined as Ψ (Pn) = Ψ(Q) (where we abuse notation by using the same Ψ for two different mappings).

The estimation problem is now defined: we want to estimate Ψ(Pn) based on On = (O1, …, On) ~ PnInline graphic, and we also want to provide asymptotic inference in terms of confidence intervals and tests of the null hypotheses.

3 The canonical gradient of the pathwise derivative of the target parameter

In order to construct efficient asymptotically linear estimators, and in particular targeted minimum loss-based estimators (van der Laan and Rubin (2006); van der Laan and Rose (2012)), of Ψ(Pn), we first determine the tangent space of the model and the canonical gradient of the pathwise derivative of the target parameter.

Let

D(Q,g)(W,A,Y)=2A-1g(AW)(Y-Q¯(A,W))+Q¯(1,W)-Q¯(0,W)-Ψ(Q),

which is the efficient influence curve of Ψ: Inline graphicInline graphic with Ψ(P) = Ψ(Q) under i.i.d. sampling from PQ,g, where g is a conditional distribution of A, given W (van der Laan and Robins (2003); van der Laan and Rose (2012)). We will also denote D*(Q, g) with D*(Q, g, Ψ(Q)) to stress its representation as an estimating function in ψ. We note D(Q,g,Ψ(Q))=DY(Q¯,g)+DW(Q), where

DY(Q¯,g)=2A-1g(AW)(Y-Q¯(A,W))DW(Q)=Q¯(1,W)-Q¯(0,W)-Ψ(Q).

The following theorem presents the canonical gradient of the pathwise derivative of the parameter Ψ: Inline graphicInline graphic. (For semiparametric efficiency theory, see e.g., Bickel et al. (1997); van der Laan and Robins (2003); van der Vaart (1998).) This canonical gradient is expressed in terms of the above function D*(Q, g).

Theorem 1

Let Oi = (Wi, Ai, Yi), On = (O1, …, On) ~ Pn, with

Pn(O1,,On)=i=1nQW(Wi)QY(YiWi,Ai)gn(AnWn),

where QW is an unspecified marginal distribution, QY is an unspecified conditional distribution of Y, given A, W, and gn is a conditional distribution of An = (A1, …, An), given Wn = (W1, …, Wn), known to be an element of a set Inline graphic consisting of distributions satisfying (2.1). Let Inline graphic be the resulting statistical model for Pn. Let Inline graphic (gn) be the model if gn is known.

Let Ψ: Inline graphicInline graphic be defined by Ψ(Pn) = EQW{(1, W) − (0, W)}, where (A, W) = EQY (Y|A, W).

The tangent space at Pn in model Inline graphic is given by:

T(Pn)={i=1nϕ(Wi):ϕTW}+{i=1nϕ(YiAi,Wi):ϕTY}+j=1JTCj, (3.1)

where TW = {h(W): Eh(W) = 0},

TY={h(YA,W):EQY(h(YA,W)A,W)=0},

and

TCj={S((Aj:jCj(Wn))Wn):E(SWn)=0}.

The tangent space at Pn in model Inline graphic(gn) is given by

T(Q)={i=1nϕ(Wi):ϕTW}+{i=1nϕ(YiAi,Wi):ϕTY}.

The statistical parameter Ψ is pathwise differentiable and its canonical gradient at Pn is given by

Dn,(Pn)=1ni=1nD(Q,g¯n)(Oi)=1ni=1n{DW(Q(Wi)+DY(Q,g¯n)(Oi)},

where gi is the conditional distribution of Ai, given Wi, and

g¯n(aW)=1ni=1ngi(aW).

We note that

gi(1Wi)=(wj:ji)gi(1(wj:ji),Wi)jiQW(wj) (3.2)

is a function of gi(Ai |Wn) and the common marginal distribution QW.

Double robustness of canonical gradient

We have

E0Dn,(Q¯,g¯n,ψ0)=0ifQ¯=Q¯0org¯n=g¯n,0, (3.3)

assuming that for all i, 0 < gi(1|W) < 1 a.e. More generally, if QW = QW,0, then for any , , we have

E0Dn,(Q¯,g¯,Ψ(Q))=ψ0-Ψ(Q)+E0(g¯0g¯(1W)-1)(Q¯0-Q¯)(1,W)-E0(g¯0g¯(0W)-1)(Q¯0-Q¯)(0,W).

The proof is presented in the Appendix.

4 A targeted minimum loss-based estimator (TMLE)

Derivation of the canonical gradient of the pathwise derivative of the target parameter Ψ allows us to construct a targeted minimum loss based estimator (TMLE). In this section we present a TMLE for Ψ for the general statistical model Inline graphic, in which g0n(AnWn)=j=1Jg0,j(A(j):jCj(Wn)Wn) is unknown. This TMLE is thus applicable to studies, such as the example presented in the introduction, in which a cohort of individuals is partitioned into subgroups and individuals within subgroups are allowed to interact in determining their treatment assignment according to some unknown mechanism. It further covers the special cases in which the sample is partitioned into n singletons and in which g0n is known (as in a adaptively pair matched trial). Section 6 considers the latter special case in greater detail.

We recall from the literature on TMLE (van der Laan and Rose (2012)) specification of a TMLE of a target parameter Ψ(Q0) requires specification of a loss function L(Q) and a sub-model {Q(ε): ε} through a Q at ε = 0, possibly indexed by nuisance parameter (in our case, ), so that ddεL(Q(ε))ε=0 spans the canonical gradient (in our case, Dn,*(Q, )). Since Q = (QW, ) and QW,n is already a nonparametric maximum likelihood estimator, the TMLE will only involve fluctuating .

Loss function and initial estimator for 0

Let Y ∈ {0, 1} be binary or continuous in (0, 1). Let Q¯n0 be an initial estimator of 0, which can be based on the loss-function

-Li(Q¯)(Oi)={YilogQ¯(Wi,Ai)+(1-Yi)log(1-Q¯(Wi,Ai))}. (4.1)

To understand the validity of this loss, note that

-E0Li(Q¯)(Oi)=EQW,0,g0,iQ¯0(Ai,Wi)logQ¯(Ai,Wi)+(1-Q¯0)(Ai,Wi)log(1-Q¯(Ai,Wi)),

which is indeed minimized at = 0. This demonstrates that Li() is a valid loss function for 0. Specifically, one could fit 0 by minimizing i=1nLi(Q¯θ)(Oi) over a parametric or semiparametric working model {θ: θ ∈ Θ}. Furthermore, to select among estimators such as different choices of working models or different algorithms, we can also use this loss to carry out cross-validation based estimator selection. Since conditional on An, Wn, the outcomes Yi, i = 1, …, n, are independent, a cross-validation selector that uses (4.1) as loss function and treats i = 1, …, n as the index of the independent units when splitting the sample into training and validation sets will satisfy an oracle type inequality analogue to the ones developed and presented in van der Laan and Dudoit (2003); van der Laan et al. (2006); van der Vaart et al. (2006). Thus an initial estimator of 0 can be based on applying a data-adaptive loss-based learning approach such as super learning, ignoring the dependence between the treatment labels (van der Laan et al. (2007); Polley and van der Laan (2010) and Chapter 3 by Polley, Rose, van der Laan in van der Laan and Rose (2012)).

Least favorable sub-model through initial estimator

Let 0 = 1/n Σi g0,i, where g0,i is the true conditional distribution of Ai, given Wi. As sub-model for fluctuating Q¯n0 we use

LogitQ¯n0(ε)=LogitQ¯n0+εHg¯n,

where Hg¯n(A,W)=(2A-1)/g¯n(AW), and n is an estimator of 0. Let QW,n be the empirical distribution, and Qn0=(QW,n,Q¯n0). We note that

ddεiLi(Q¯n0(ε))(Oi)|ε=0=iDY(Qn0,g¯n)(Oi)

so that this loss function and sub-model indeed generates the crucial component of the canonical gradient of the target parameter, a requirement for the construction of efficient TMLE. The component corresponding with DW is generated by a sub-model QW,n(ε) through QW,n at ε = 0 with score DW, but since QW,n is already an NPMLE, the estimated amount of fluctuation according to this sub-model would be zero, so that this sub-model plays no role in the TMLE.

Computing the update of initial estimator

The amount of fluctuation εn is estimated as

εn=argminεi=1nLi(Q¯n0(ε))(Oi).

This provides the update Q¯n=Q¯n0(εn). Let Qn=(QW,n,Q¯n).

TMLE of target parameter

The TMLE of ψ0 is the corresponding plug-in estimator

Ψ(Qn)=1ni=1n{Q¯n(1,Wi)-Q¯n(0,Wi)}.
TMLE solves efficient influence curve equation

By construction, the TMLE solves

0=Dn,(Q¯n,g¯n,ψn)=1ni=1nD(Q¯n,g¯n,ψn)(Oi).

In other words, the TMLE solves the efficient influence curve equation for the model Inline graphic. This equation will form a crucial ingredient for establishing double robustness and asymptotic normality of the TMLE Ψ(Qn).

4.1 Estimation of 0

In general 0 is not known and must be estimated. In this section we sometimes suppress the n in n when referring to an estimator of 0 = 1/n Σi g0,i.

Estimation of can be based on the pooled log-likelihood L()(Wn, An) = Σi log (Ai|Wi), as if we observe a sample of n i.i.d. (Wi, Ai). Let n be the resulting estimator. The above TMLE is then applied with n as an estimator of 0. Indeed, L() is a valid loss function for 0 since

E0L(g¯)(Wn,An)=E0i=1nlogg¯(aWi)g0,i(aWi)=EQW,0,g¯0logg¯(AW),

which is minimized at 0. Conditional on Wn, the groups (Ai: iCj(Wn)) of treatment nodes are independent. Thus, can be estimated using loss-based learning and cross-validation, but the cross-validation should, in contrast to estimation of 0, treat the groups indexed by j as the independent units.

In a randomized controlled trial, g0(Ai |Wn) is known by design, while g0,i(Ai |Wi) and thus 0(Ai|Wi) are not and must thus still be estimated. In such cases, knowledge of the true design g0n can be used to get a more accurate estimate of 0. Specifically, we have

gi(1Wi)=(wj:ji)gi(1(wj:ji),Wi)jiQW(wj).

Thus, if g0n is known, we can estimate QW with the empirical distribution, giving the estimator

gi,n(1Wi)=wj:jig0,i(1(wj:ji),Wi)jiQW,n(wj),

and corresponding n = 1/nΣigi,n of 0.

4.2 Statistical inference

In our main Theorem 2 below we assume that the design gn = P(An|Wn) is known, but, as mentioned above, this still requires estimation of 0 through estimation of QW,0. The asymptotics of Theorem 2 below proves that, under appropriate conditions, the standardized TMLE n(ψn-ψ0) converges to a normal distribution with variance σW2+σY2, where σY2 is consistently approximated with

σY,n2=1nj=1J{fj,n(Q¯n)(Oi:iCj(Wn))}2,

with fj,n(Q¯)=iCj(Wn)fi,n1(Q¯)(Oi), and

fi,n1(Q¯)2Ai-1g¯n(AiWi)(Yi-Q¯(Ai,Wi))-{gi(1Wn)g¯n(1Wi)(Q¯0-Q¯)(1,Wi)-gi(0Wn)g¯n(0Wi)(Q¯0-Q¯)(0,Wi)}.

Estimation of the asymptotic variance σY2 appears to rely on a consistent estimator of 0. In addition, σW2 is the variance of a function ICW of W, which can thus be consistently estimated with 1/ni=1nICW,n(Wi)2, if ICW,n is a consistent estimate of this unknown function ICW. Specifically, ICW is a sum of three functions, one of them being DW(Q¯,QW,0)=Q¯(W)-QW,0Q¯, where (W) = (1, W) − (0, W), and QW,0* = ∫w*(w)dQW,0(w). This function DW is trivially consistently estimated by plugging in Q¯n and the empirical distribution QW,n. If Q¯n converges to 0, then the other two components of ICW are equal to zero. If the conditional distribution g0,in of Ai, given Wn is equal to the conditional distribution g0,i, given Wi, and the latter is constant in i, then these other two components of ICW are also equal to zero, even if Q¯n is inconsistent.

In general, one of these two components of ICW is generated by the contribution of n as an estimator of 0, assuming a plug-in estimator is used utilizing that the distribution gin of Ai, given Wn, is known. The influence curve of this contribution can be straightforwardly determined and is presented in Theorem 2. The other component concerns an average of differences g0,in-g¯n, indicating that, g0,i(· |Wn) has to converge for n going to infinity to a fixed g0,i(· |Wi): thus, the dependence on the covariates of the other individuals li has to be asymptotically negligible. If this convergence occurs fast enough this contribution may be equal to zero, but, in general, we allow for a contribution. This “asymptotic stability of the design” (i.e, gin converging to a fixed gi) condition is analogue to the condition on the adaptive allocation probabilities in adaptive group sequential designs to establish asymptotic normality of the TMLE, as studied in van der Laan (2008); Chambaz and van der Laan (2010). Either way, consistent estimation of σW2 is possible without relying on a consistent estimator of 0. On the other hand, if g0,in(1Wn)=g0,i(1Wi), then the influence curve is easily derived and is specified in Theorem 2, and, if g0,i is constant in i, then the contribution equals zero.

If gin=gi, and n = 0, and Cj(Wn) are singletons, then it can be shown that the asymptotic variance is consistently estimated as

σI,n2=1ni=1n{D(Q¯n,g¯n,ψn)(Oi)}2, (4.2)

which does thus not rely on a consistent estimator of 0 anymore. Note that this latter variance estimator is the estimator one would have used if one treats the sample as n independent observations, and one ignores the adaptivity of the design.

In the special case of adaptive pair matching designs (and thus n = 0 and ICW=DW), we prove below that, under a mild condition, this same estimator (4.2) of the asymptotic variance remains conservative if Q¯n is inconsistent for true 0. It remains to be determined if this result also applies to other group sizes.

Finally, if the design g0n is actually unknown and thus also needs to be estimated, and if we assume that this design is consistently estimated, then we conjecture that the asymptotic limit variance described above will be conservative, due to the general result that estimation of an orthogonal factor in the likelihood (i.e., the tangent space of the treatment mechanism is orthogonal to tangent space of relevant Q-factors) generally improves the asymptotic variance (Theorem 2.3 van der Laan and Robins (2003)).

5 Theorem establishing asymptotic normality

We have the following theorem establishing the asymptotic normality of the TMLE presented in Section 4 and thereby in particular the basis for the variance estimator presented in Section 4.2.

Theorem 2

Let PQ0,gn0fi represents a conditional expectation of a function, given Wn, which is thus still random through Wn. In this theorem gn0=P0(AnWn) is considered known. Let Inline graphic be a set of multivariate real valued functions so that Q¯n is an element of Inline graphic with probability 1. Define

fi,n1(Q¯)2Ai-1g¯n(AiWi)(Yi-Q¯(Ai,Wi))-{g0,i(1Wn)g¯n(1Wi)(Q¯0-Q¯)(1,Wi)-g0,i(0Wn)g¯n(0Wi)(Q¯0-Q¯)(0,Wi)}.

Define Xn(Q¯)=1/nj=1J{iCj(Wn)fi,n1(Q¯)(Oi)}, and note that, conditional on Wn, this is a sum of J = J(n) independent mean zero random variables. Let

fj,niCj(Wn)fi,n1(Oi).

We can represent Xn() as Xn(Q¯)1/nj=1Jfj,n(Oi:iCj(Wn)). Let DW(Q0)=Q¯0(W)-ψ0.

Uniform bound

Assume maxi∈{1,…,n} Inline graphic supWn O |fi,n()(Wn, O) |< M < ∞, where the second supremum is over a support of (Wn, Ai, Yi).

Asymptotic linearity of function of n

Assume that for a function ICW,i, of W with mean zero and finite variance (uniformly in i)

ZW,n,g¯n1niPQ0,g0,i{D(Q¯n,g¯n,ψn)-PQ0,g0,iD(Q¯n,g¯0,ψn)}=1ni=1nICW,i,g¯(Wi)+oP(1).

Note that if g0,in=P0n(AiWn) equals g0,i = P0(Ai|Wi) and is thus known, then n = 0 so that ZW,n,n = 0. In general, under the required regularity conditions, we have

ZW,g¯n,n1nk=1nIC(Wk)-E0IC(Wk),

where

IC(Wk)=w(1ni=1nlingi(1Wi=w,Wl=Wk))Q¯0-Q¯g¯0(1,w)dQ0(w)-w(1ni=1nlingi(0Wi=w,Wl=Wk))Q¯0-Q¯g¯0(0,w)dQ0(w).

and gi,0(1|Wi, Wl) is the conditional distribution of Ai = 1, given Wi, Wl.

Asymptotic stability of treatment mechanism as function of covariates

Let

ZW,n,gn=Z1,gn+ZW,n,

where

Z1,gn=1ni=1ngi(1Wn)-g¯n(1Wi)g¯n(1Wi)(Q¯0-Q¯n)(1,Wi)-1ni=1ngi(0Wn)-g¯n(0Wi)g¯n(0Wi)(Q¯0-Q¯n)(0,Wi),

and

ZW,n=1ni=1n{Q¯0(Wi)-ψ0}.

Assume, for a function ICW,i,gn of W with mean zero and finite variance, Z1,gn=1ni=1nICW,i,,gn(Wi)+oP(1). Note that, if g0,in=g0,i and constant in i, then Z1,gn = 0. If g0,in=g0,i, so that n = 0, then

ICW,i,gn(Wi)=g0,i(1Wi)-g¯0(1Wi)g¯0(1Wi)(Q¯0-Q¯)(1,Wi)-g0,i(0Wi)-g¯0(0,Wi)g¯0(0Wi)(Q¯0-Q¯)(0,Wi).
Convergence of variances

Assume that for a specified {Σ0(1, 2): 1, 2Inline graphic}, for any 1, 2Inline graphic, 1nj=1JPQ0,g0nfj,n(Q¯1)20(Q¯1,Q¯1)a.s (i.e, for almost every (Wn, n ≥ 1)), and

1nj=1JPQ0,gnfj,n(Q¯1)fj,n(Q¯2)0(Q¯1,Q¯2)a.s. (5.1)

For example, if Cj(Wn) are singletons, the first condition holds if

[1ni=1ng0,i(0Wn)g¯n2(0Wi)E0((Y-Q¯(0,Wi))2Ai=0,Wi)+1ni=1ng0,i(1Wn)g¯n2(1Wi)E0((Y-Q¯(1,Wi))2Ai=1,Wi)]0(Q¯,Q¯)a.s.

Similarly, for the convergence of covariance. Note that this holds trivially if g0,i(1 | Wn) = g0,i(1 | Wi).

Convergence of Q¯n to some limit

For any 1, 2Inline graphic, we define

σn2(Q¯1-Q¯2)=1nj=1JPQ0,g0n{fj,n(Q¯1)-fj,n(Q¯2)}2,

where we note that the right-hand side indeed only depends on 1, 2 through its difference 12.

Assume that for a particular *Inline graphic, σn2(Q¯n-Q¯)0 in probability as n → ∞.

Entropy condition

Let Inline graphic = {f1f2 : f1, f2Inline graphic}. Let N (ε, σn, Inline graphic) be the covering number of the class Inline graphic w.r.t norm/dissimilarity || f ||= σn(f). Assume that the class Inline graphic satisfies

limδn00δnlogN(ε,σn,Fd)dε=0
Asymptotic equicontinuity of process

Then,

Xn(Q¯n)-Xn(Q¯)convergestozeroinprobability,asn.
First order linear approximation

As a consequence,

n(ψn-ψ0)=XW,n+Xn(Q¯)+oP(1),

where XW,n=1/ni=1nICW,i(Wi), and ICW,i=ICW,i,g¯+ICW,i,gn+DW(Q0).

Asymptotic normality

In addition, XW,n converges in distribution to N(0,σW2), where

σW2=limn1ni=1nP0ICW,i2,

and Xn() converges to an independent N(0, Σ0(*, *)), so that

n(ψn-ψ0)convergesindistributiontoN(0,σ02σW2+0(Q¯,Q¯)).

The asymptotic variance Σ0(*, *) equals the limit of

n=1nj=1n{fj,n(Q¯n)(Oi:iCj(Wn))}2.

Under certain treatment allocation mechanisms g0,in, one might have that the contributions captured by XW,n in this theorem require a more general representation XW,n=1/ni=1nfi(W), where fi(W) has weak enough dependence on Wj with ji, so that such a process still converges weakly to a normal distribution. Depending on applications of interest, we can pursue such a more general representation of this theorem with little extra work.

Note that fj,n in Σn still depends on 0, while Q¯n estimates the possibly misspecified limit . Thus Σn is not an estimator. In the special case that Cj(Wn) are singletons, it follows that σ02 can be consistently approximated with 1/ni{ICW(Qn)(Wi)+fi,n(Q¯n)(Oi)}2 which equals

σn2=1ni=1n{D(Q¯n,g¯n,ψn)(Oi)}2,

if g0,in=g0,i, and n = 0 so that ICW=DW(Q¯,QW,0).

Some of the conditions were discussed in the previous section 4 under statistical inference. The entropy condition corresponds with assuming that Inline graphic is a Donsker class and is thus a natural condition that puts (minimal) restrictions on the size of the class Inline graphic. For example, one can define Inline graphic as the class of multivariate real valued functions that have a uniform sectional variation norm bounded by a universal M < ∞ (van der Laan (1996); Gill et al. (1995)).

In order to demonstrate the condition (5.1), we consider the special case that Cj(Wn) are singletons j = 1, …, n, and that gi(Ai | Wn) = gi(Ai | Wi). We wish to show that σn2=1/ni=1nσi2σ2, where

σi2=PQ0,gin{(2A-1)/gi(AiWi)(Yi-Q¯(Ai,Wi))}2-(Q¯(Wi)-Q¯0(Wi))2=PQ0,gin1/gi2(AiWi)(Yi-Q¯(Ai,Wi))2-(Q¯(Wi)-Q¯0(Wi))2=a,y1gi2(aWi)(y-Q¯(a,Wi))2gi(aWn)QY,0(ya,Wi)-(Q¯(Wi)-Q¯0(Wi))2=a,y1gi(aWi)(y-Q¯(a,Wi))2QY,0(ya,Wi)-(Q¯(Wi)-Q¯0(Wi))2=1gi(aWi)E0((Y-Q¯(0,Wi))2Ai=0,Wi)+1gi(1Wi)E0((Y-Q¯(1,Wi))2Ai=1,Wi)-(Q¯(Wi)-Q¯0(Wi))2.

So we need that

1ni=1n1gi(0Wi)E0((Y-Q¯(0,Wi))2Ai=0,Wi)+1ni=1n1gi(1Wi)E0((Y-Q¯(1,Wi))2Ai=1,Wi)-1ni=1n(Q¯(Wi)-Q¯0(Wi))2σ2a.s.

By the law of large numbers, both empirical means converge.

6 Randomized trials with adaptive pair matching

Theorem 2 and the corresponding TMLE of the average treatment effect and variance estimator have important implications for the design and analysis of individual and cluster randomized trials with adaptive pair matching. In particular, previous literature on pair matched trials considered the pair as the unit of independence (Freedman et al. (1990); Campbell et al. (2007); Hayes and Moulton (2009); Imai et al. (2009); Imai (2008); Donner and Klar (2000); Murray (1998); Donner and Klar (2000); Raudenbush et al. (2007); Klar and Donner (1997); Balzer et al. (2012)). This leaves open a number of key questions regarding the design and analysis of trials in which matched pairs are constructed based on applying some algorithm to the baseline characteristics of the entire sample (adaptive pair matching). What is the most efficient estimator of the intervention’s effect in such studies? How should the variance of this estimator be estimated, given dependence induced between units? And finally, under what conditions will adaptive pair matching provide a more efficient estimator than that provided by a non-matched design? In this section we consider the implications of Theorem 2 for each of these questions in turn.

6.1 Estimation of the average treatment effect

In a randomized trial with adaptive pair matching, an unadjusted difference of the mean outcome in the treated and untreated units will provide an unbiased estimate of the average treatment effect. However, adjustment for baseline covariates W that predict the outcome Y will result in efficiency gains. This raises the issue of how best to accomplish such adjustment in adaptively pair matched designs, with the dual goals of minimizing the variance of the resulting estimator and ensuring that it remains unbiased. Our Theorem 2 and the corresponding TMLE presented in Section 4 establish such an efficient and unbiased estimator of the average treatment effect.

Specifically, in such trials, g0,i(Ai | Wn) = g0,i(Ai | Wi) = 0(Ai | Wi) is known to be equal to a constant (typically 0.5 for a trial with two arms) and need not be estimated (although estimation of may improve efficiency). The clever covariate in the TMLE thus reduces to a simple (2Ai − 1)/0.5. Thus, a TMLE for the average treatment effect in such a trial can be implemented by treating the data as if they were a sample of n i.i.d. units. Implementation of this estimator is described in detail in van der Laan and Rose (2012). Further, because A is randomized, as long as the initial estimator of 0 (the conditional mean of the unit-specific outcome given unit-specific covariates and treatment) is a least squares regression or logistic maximum likelihood regression that includes an intercept and the treatment A as a main term (still allowing for additional interaction terms between A and covariates W), no further update step is needed (the initial estimator is already a TMLE) (Rosenblum and van der Laan (2010); van der Laan and Rose (2012)).

6.2 Statistical inference

Again, we note that in a randomized trial with adaptive pair matching, we have n = 0 = g0. Our Theorem 2 shows that, under regularity conditions, the standardized TMLE with Q¯n converging to * is asymptotically consistent and normally distributed, n(ψn-ψ0)dN(0,σ2), where σ2=σW2+σY2,σW2=E(Q¯0(W)-ψ0)2 and σY2 is the limit of

1njPQ0,g0n{fj,n(Q¯,Q¯0)}2=1njPQ0,g0n(iCj(Wn)Hg¯0(Ai,Wi)(Yi-Q¯))2-1nj(iCj(Wn)(Q¯0-Q¯)(Wi))2.

Given a consistent estimator σn2 of σ2, ψn±1.96σn/n is an asymptotic 0.95-confidence interval. A test of the null hypothesis H0 : ψ0 = ψ0 can be based on the test statistic Tn=n(ψn-ψ0)/σn which is approximately standard normal under H0.

In order to implement an estimator of σn, σW2 can be estimated as 1ni=1n{Q¯n(Wi)-ψn}2. Estimation of σY2 can be based on Theorem 2, which shows that σY2 is consistently approximated by 1/nj{fj,n(Q¯n,Q¯0)}2. In implementing a substitution estimator of σY2, we naturally replace * by Q¯n (the updated fit of 0 on which the TMLE substitution estimator of ψ0 is based). However, fj,n(Q¯n,Q¯0) still depends on 0. When implementing a consistent estimator of the asymptotic variance σ2, one may need to estimate 0 with a super learner in order to make it maximally unbiased (van der Laan et al. (2007)). In particular, even if a simple parametric regression based estimator was used as initial estimator of 0 when implementing the TMLE for ψ0, a more flexible approach to estimating 0 is warranted when estimating σ2 in order to minimize bias in the resulting variance estimator.

A substitution estimator of this asymptotic variance σ2 is now given by

σn2=σW,n2+σY,n2σW,n2=1ni=1n{Q¯n(Wi)-ψn}2σY,n2=1nj(iCj(Wn)Hg¯0(Ai,Wi)(Yi-Q¯n(Ai,Wi))-iCj(Wn)(Q¯n-Q¯n)(Wi))2

Interestingly, even though we constructed an estimator of ψ0 that is guaranteed consistent and asymptotically normally distributed at misspecified Q¯n, as long as g0,in is known or well estimated, it seems that in general no such robust estimator of the asymptotic variance σ2 is available. Fortunately, below we will construct conservative variance estimators that do not rely on a consistent estimator of 0.

6.3 Robust conservative estimation of the variance

In the special case of an adaptively pair matched design, we have the following theorem, which establishes a simpler (but) conservative estimate of the asymptotic variance σ2. The proof is analogue to that of Theorem 4 below.

Theorem 3

Suppose gn0G2n , in which case Cj(Wn) are pairs. As above, assume that g0,i(Ai | Wn) = g0(Ai | Wi) is known, so that 0 = g0 is known as well. The asymptotic variance σ2 of the TMLE for Q¯n converging to *, i.e., σ2=σW2+σY2 in Theorem 2, can be represented as follows:

σ2=PQ0,g0{D(Q¯,g0,ψ0)}2-C,

where

CE01Jj=1n/2(Q¯0-Q¯)(1,W1j)(Q¯0-Q)(0,W2j)+E01Jj=1n/2(Q¯0-Q¯)(0,W1j)(Q¯0-Q¯)(1,W2j)+E01Jj=1n/2(Q¯0-Q¯)(1,W1j)(Q¯0-Q¯)(1,W2j)+E01Jj=1n/2(Q¯0-Q¯)(0,W1j)(Q¯0-Q¯)(0,W2j). (6.1)

Under the assumption that the covariance-term C is positive, a conservative estimate of σ2 is thus given by:

σI,n2=1ni=1n{D(Q¯n,g0,ψn)(Oi)}2.

This theorem, together with Theorem 2, imply that randomized trials with adaptive pair matching can be analyzed ignoring the matching process, both in order to generate an efficient and unbiased point estimator of the treatment effect and for inference on this estimator. Under the assumption that the last covariance-term is positive, as would be expected if units were effectively matched on predictors of the outcome, a variance estimator that treats the data as if they were i.i.d. will be conservative if Q¯n is inconsistent for 0, while it remains asymptotically consistent if Q¯n is consistent.

In general, one can aim to construct a target Cl for which it is known that ClC, and estimate the variance with σI,n2-Cl,n , where Cl,n is a consistent estimator of Cl. In this case, one aims to find such a Cl that can be consistently estimated without relying on a consistent estimator of 0. This will be carried out in the next two subsections resulting in a possibly much less conservative variance estimator.

6.4 Comparison of the “naive” variance estimator with the true variance

Let us consider the case that we use the unadjusted regression of Y on A as initial estimator so that Q¯n(a,W)=Q¯n(a), a ∈ {0, 1}, does not depend on W, and the TMLE ψn=Q¯n(1)-Q¯n(0) equals the mean outcome over the n/2 pairs Cj(Wn) of ΣiCj(Wn)AiYi − (1 − Ai)Yi, j = 1, …, n/2:

ψn=1n/2j=1n/2iCj(Wn)(AiYi-(1-Ai)Yi).

Above we presented an expression (6.1) for the true asymptotic variance of the standardized TMLE, n(ψn-ψ0), which can be applied to this simple sample average ψn . This expression σ2 was represented as the asymptotic variance σI2 of this TMLE under i.i.d sampling from PQ0,g0, i.e. σI2=PQ0,g0D(Q¯,g0,ψ0)2 , minus a sum of four terms defined as

CE01Jj=1n/2(Q¯0-Q¯)(1,W1j)(Q¯0-Q¯)(0,W2j)+E01Jj=1n/2(Q¯0-Q¯)(0,W1j)(Q¯0-Q¯)(1,W2j)+E01Jj=1n/2(Q¯0-Q¯)(1,W1j)(Q¯0-Q¯)(1,W2j)+E01Jj=1n/2(Q¯0-Q¯)(0,W1j)(Q¯0-Q¯)(0,W2j).

Let (W1, A1, Y1), (W2, A2, Y2) denote the observations in the two units in a pair, where either (A1, A2) = (1, 0) or (A1, A2) = (0, 1). For notational convenience, we will denote this term C as

CE0(Q¯0-Q¯)(1,W1)(Q¯0-Q¯)(0,W2)+E0(Q¯0-Q¯)(0,W1)(Q¯0-Q¯)(1,W2)+E0(Q¯0-Q¯)(1,W1)(Q¯0-Q¯)(1,W2)+E0(Q¯0-Q¯)(0,W1)(Q¯0-Q¯)(0,W2).

Since we use the unadjusted estimator so that D*(*, g0, ψ0)(W, A, Y) = (2A − 1)/g0(A)(Y*(A)), and g0(A) = 0.5, we have

PQ0,g0D(Q¯,g0,ψ0)2=2{σ12+σ02},

where σ12=E0(Y(1)-ψ0(1))2 and σ02=E0(Y(0)-ψ0(0))2. We conclude that the true asymptotic variance of n(ψn-ψ0) is given by

σ2=2{σ12+σ02}-C.

Let us compare this true asymptotic variance σ2/n of the unadjusted estimator with the variance estimate used in current practice, which we will refer to as the “naive” variance estimator. Current practice assumes that the n/2 pairs are i.i.d. and estimates the asymptotic variance of n/2(ψn-ψ0) with the sample variance of the average of the difference across the pairs:

0.5σn,naive2=1n/2j=1n/2(Y1jA1j+Y2jA2j-Y1j(1-A1j)-Y2j(1-A2j)-ψn)2.

This converges for n → ∞ to

0.5σnaive2=σ02+σ12-(ρ1+ρ2),

where

ρ1=E0(Q¯0(1,W1)-ψ0(1))(Q¯0(0,W2)-ψ0(0))ρ2=E0(Q¯0(0,W1)-ψ0(0))(Q¯0(1,W2)-ψ0(1)).

The true asymptotic variance and the naive asymptotic variance are given by σ2/n and (0.5σnaive2)/(n/2)=σnaive2/n, respectively. As a consequence, the relevant comparison is the comparison of σ2 with σnaive2, where

σ2=2{σ12+σ02}-Cσnaive2=2{σ12+σ02}-2(ρ1+ρ2).

To show that naive variance estimator represents a conservative variance estimator we would need to show that

2(ρ1+ρ2)C.

Notice that C = ρ1 + ρ2 + C1, where

C1=E0(Q¯0-Q¯)(1,W1)(Q¯0-Q¯)(1,W2)+E0(Q¯0-Q¯)(0,W1)(Q¯0-Q¯)(0,W2).

Thus, the naive variance estimator would be conservative if ρ1 + ρ2C1. Note that we can also represent this as:

C1-ρ1-ρ2=Cov(Q0(1,W1),(Q0(1,W2))+Cov(Q0(0,W1),Q0(0,W2))-Cov(Q0(1,W1),Q0(0,W2))-Cov(Q0(0,W1),Q0(1,W2))=Cov(Q0(W1),Q0(W2)),

where Cov(X, Y) = E(XY) denotes the standard covariance between two mean zero random variables X and Y, and we introduced the notation 0(W) = (0(1, W) − 0(0, W)) and 0(a, W) = (0*)(a, W). Thus, if the latter covariance-term Cov(0(W1), 0(W2)) is non-negative, then the naive variance estimator is conservative. This is a very reasonable condition certainly expected to hold. Thus, we can conclude that in great generality the naive variance estimator is a conservative estimator. We also note that if in truth there is no treatment effect, conditional on covariates, then this covariance term equals zero, so that the naive variance estimator is unbiased.

6.5 A general conservative estimator of the asymptotic variance of TMLE

Above we presented the naive variance estimator of the unadjusted estimator and showed that it is conservative in great generality. In this subsection we propose a generalization of this estimator to obtain a conservative estimator of the asymptotic variance of the general TMLE (using a general initial estimator).

Recall that C = (ρ1 + ρ2) + C1, and note that ρ̄ = ρ1 + ρ2 can be consistently estimated with ρ¯n=2/Jj=1J(Y1j-Q¯n(A1j,W1j))(Y2j-Q¯n(A2j,W2j)). Above we showed that we can obtain a conservative bound for C by replacing C1 by ρ. Thus, we can conservatively estimate C by 2ρ̄n. Thus, a general conservative estimator of the asymptotic variance σ2 of n(ψn-ψ0) is given by

σn2=σI,n2-2ρ¯n,

where

σI,n2=1ni=1nD(Q¯n,g0,ψn)(Oi)2.

This estimator can be viewed as the generalization of the “naive” variance estimator for the unadjusted estimator of ψ0, analyzed in the previous subsection.

6.6 A simulation confirming the variance formula for the unadjusted estimator

To confirm our conclusions regarding the asymptotic variance of the unadjusted estimator, consider the following simple simulations. For n units, the baseline covariates W1 and W2 were independently drawn from N(0, 0.22) and U(−1, 1), respectively. Then the following adaptive matching algorithm was employed. First units were classified into a matching category M, representing the 16 quartile combinations of W1 and W2. Within each stratum of M, units were randomly paired. If there were an odd number of units in a given strata, the remaining unit was set aside. The leftovers were then ordered according to M and pairs created. Next the treatment was randomized within the n/2 matched pairs. Finally, the binary outcome Y was drawn independently for each unit with probability

p=expit[β0+β1A+β2W1+β3W1A+β4W22] (6.2)

where expit is the inverse logistic function and the coefficients were set as β0 = −1, β1 = −0.5, β2 = 3, β3 = −2 and β4 = 2. The target causal parameter is the average treatment effect. It had a true value of ψ0 = −0.11 in this data generating experiment (“Scenario 1”). The coefficients were then also varied to examine the asymptotic variance of the unadjusted estimator in different data generating experiments. In Scenario 2 there is no treatment effect: β1 = β3 = 0. In Scenario 3 the baseline covariates (used for matching) have no effect on the outcome. Specifically, β2, β3 and β4 were set to zero to yield an average treatment effect of −0.08.

For each scenario, the true finite sample variance Var(ψn) was estimated as the variance of unadjusted estimator over R = 10, 000 trials, each of sample size n = 500 units and the corresponding asymptotic variance estimate nVar(ψn) was reported. Table 1 compares this estimate of the true asymptotic variance with the claimed asymptotic variance σ2=σI2-C calculated according to Theorem 3, and with the asymptotic naive variance treating the pairs as independent σnaive2. The asymptotic variances σ2 and σnaive2, as well as C and ρ̄ were computed with Monte Carlo simulation of 50,000 units. All statistical computing was done in R version 2.15.1. In addition, recall our claims that C − 2ρ̄ ≥ 0, implies that σnaive2=σI2-2ρ¯ is conservative.

Table 1.

Comparing the true finite sample variance of the unadjusted estimator scaled by n nVar(ψn), the asymptotic variance σ2 according to Theorem 3 and the naive asymptotic variance treating the pairs as independent σnaive2. Scenario 1 corresponds to the setting β0 = −1, β1 = −0.5, β2 = 3, β3 = −2 and β4 = 2 in Eq. 6.2. Scenario 2 corresponds setting β = 1 and β3 to zero in order to examine the asymptotic variance if the intervention has no effect on the outcome. Scenario 3 corresponds to setting β2, β3 and β4 to zero in order to examine the asymptotic variance if the baseline covariates (used for matching) have no effect on the outcome. For each scenario, the correction factor C and 2ρ̄ are also given.

Scenario 1 Scenario 2 Scenario 3
nVar(ψn)
0.8408 0.8708 0.6833
σ2 0.8523 0.8729 0.6915
σnaive2
0.8591 0.8729 0.6915

C 0.0712 0.1060 0.0000
2ρ̄ 0.0643 0.1060 −0.0000

In all scenarios, the true asymptotic variance of the TMLE and our claimed true asymptotic variance are in agreement. The simulation for scenario 1 also confirms that σnaive2=σI2-2ρ¯ is indeed conservative, but close to the true asymptotic variance. In Scenario 2 the correction factors C and 2ρ̄ are equal when there is no treatment effect: C = 2ρ̄, and in Scenario 3 we have C = 2ρ̄ = 0. Indeed, in both of these scenarios we see perfect agreement between σnaive2 and the true asymptotic variance σ2.

6.7 Efficiency gains due to adaptive pair matching

In this section we compare two design choices regarding g0n. In the first, we simply assume that g0n(AnXn)=i=1ng0(AiWi) for a common g0. In this case, (Wi, Ai, Yi), i = 1, …, n, are i.i.d. This design includes classic non-matched randomized trials in which treatment is randomly assigned with some known probability, possibly conditional on unit-specific covariates.

We compare this design to a design employing adaptive pair matching. In other words, in the second design we assume g0nG2n with g0,in=g0, so that g0n(AnXn)=j=1n/2g0(Ai:iCj(Wn)Wn) and the marginal P (Ai = a | Wn) = g0(a | Wi), i = 1, …, n.

We compare the asymptotic variance of the TMLEs under these two designs when Q¯n0 converges to a possibly misspecified *. This provides insight into the efficiency gains made possible by adaptive pair matching. We assume that g0 is known, so that n = g0, as would be the case in both an non-matched and adaptively matched randomized trial.

Theorem 4

Under the i.i.d. design, the TMLE is asymptotically linear with influence curve D*(*, g0, ψ0), so that its asymptotic variance is given by σI2(Q¯)=P0{D(Q¯,g0,ψ0)}2. This variance can be represented as

σI2(Q¯)=E0{Q¯0(W)-ψ0}2+E0E0(Hg02(A,W)(Y-Q¯(A,W))2W)-E0{Q¯0(W)-Q¯(W)}2.

For the adaptive paired matching design the asymptotic variance σ2(*) of the TMLE is given by the limit of

E0{Q¯0(W)-ψ0)}2+E01nj=1n/2PQ0,gn{iCj(Wn)Hg0(Ai,Wi)(Yi-Q¯(Ai,Wi))}2-E01nj=1n/2{iCj(Wn){Q¯0(Wi)-Q¯(Wi)}}2

This can be represented as:

σ2(Q¯)=E0{Q¯0(W)-ψ0}2+E0E0(Hg02(A,W)(Y-Q¯(A,W))2W)-E0{Q¯0(W)-Q¯(W)}2-C,

where C = (ρ1 + ρ2) + C1 was defined above as sum of four terms, with

C1=E01Jj=1J(Q¯0-Q¯)(1,W1j)(Q¯0-Q¯)(1,W2j)+E01Jj=1J(Q¯0-Q¯)(0,W1j)(Q¯0-Q¯)(0,W2j).

The difference between the two asymptotic variances is thus given by:

σI2(Q¯)-σ2(Q¯)=C.

If * = 0, the two asymptotic variances are equal. If *(A, W) = E0(Y | A), then the difference is the sum C of the four covariances.

This theorem teaches us that, while the information bound for the two designs is the same, the TMLE under adaptive pair matching at misspecified * will outperform the TMLE under i.i.d. sampling, as long as C > 0. This theorem further suggests that pair matching will result in efficiency gains over the i.i.d. design to the extent that there are baseline covariates W that are predictive of Y which cannot be adjusted for in the outcome regression. Such a scenario might occur in finite samples due to lack of support in the data. For example, in a cluster randomized trial of an HIV prevention intervention, the sample of communities might include only two communities in proximity to a major trucking route, a community characteristic known to predict higher HIV transmission levels. If by chance in the i.i.d. design both of these communities were assigned to the treatment arm of the trial, lack of data support would preclude adjustment for this community-level covariate and thus pair matching on this covariate would result in efficiency gains.

7 Augmenting the data structure with missingness

Consider the following data generating experiment. Firstly, we sample n i.i.d. (W1, Y1(0), Y1(1)), …, (Wn, Yn(0), Yn(1)), giving us the vector Xn and vector of baseline covariates Wn. Based on Wn, we run a partitioning algorithm generating pairs Cj(Wn), j = 1, …, J. However, suppose that the designer does not want to accept pairs that are not similar enough with respect to some metric. Therefore, one applies an algorithm that involves assigning an indicator Δi(Wn), i = 1, …, n and applying the partitioning algorithm among the units {i : Δi(Wn) = 1} resulting in Cj(Wn), j = 1, …, J. Thus ∪jCj(Wn) = {i : Δi(Wn) = 1}. We also note that Δi(Wn) is a deterministic function of Wn. Let n1 be the number of observations with Δi(Wn) = 1. Given Wn, the Δi(Wn) and the pairs Cj(Wn), we draw An1 from a conditional distribution of

g0n(An1Xn)=g0n(An1Wn)=j=1Jg0n((Ai:iCj(Wn))Wn).

We now collect the data Oi = (Wi, Δi(Wn), Δi(Wn)Ai, Δi(Wn)Yi(Ai)), i = 1, …, n, giving the observed data On = (O1, …, On).

The target quantity of interest remains the average treatment effect ΨF(PX,0) = E0Y(1) − E0Y(0). We have ψ0F=EW{Q¯0(1,W)-Q¯0(0,W)}, where 0(a, w) = E(Y(a) | W = w) = E0(Y | A = a, W = W). We note that Yi, given Wn, An1, is independent across i = 1, …, n, and this conditional distribution equals the conditional distribution of Yi, given Wi, Ai. Therefore,

E(YiAi,Wi,Δi(Wn)=1)=E(E(YiAi,Wi,Wn)Ai,Wi,Δi(Wn)=1)=E(E(YiAi,Wi)Ai,Wi,Δi(Wn)=1)=E(YiAi,Wi).

This proves that 0(a, w) = E(Yi | Ai = a, Wi = w, Δi(Wn) = 1) and is thus identifiable from the distribution of On. This proves the desired identifiability of ψ0F:

ΨF(PX,0)=EW,0{Q¯0(1,W)-Q¯0(0,W)}=Ψ(P0n).

The average is with respect to the marginal distribution of W (not conditional on Δi(Wn) = 1), so that also the observations with Δi(Wn) = 0 are used to identify this target quantity.

This also demonstrates that E0i=1nI(Δi(Wn)=1)(Yi-Q¯(Ai,Wi))2 is minimized over by 0, and thus represents a valid loss function for loss-based learning of 0 based on On. Similarly, we can use a log-likelihood loss i=1nI(Δi(Wn)=1)L(Q¯)(Wi,Ai,Yi), where −L()(W, A, Y) = Y log (A, W) + (1 − Y) log(1 − (A, W)).

In order to present a TMLE we first need to derive the canonical gradient, which is presented in the following theorem.

Theorem 5

Consider the data generating experiment described above.

Let Oi = (Wi, Δi(Wn), Δi(Wn)Ai, Δi(Wn)Yi), the observed data is On = (O1, …, On) ~ Pn with

Pn(On)=i=1nQW(Wi){QY(YiWi,Ai)}Δi(Wn)gn((Ai:Δi(Wn)=1)Wn),

where QW is an unspecified marginal distribution, QY is an unspecified conditional distribution of Y, given A, W, and gn is a conditional distribution of An1 = (Ai : Δi(Wn) = 1), given Wn = (W1, …, Wn), known to be an element of a set Inline graphic consisting of distributions satisfying (2.1). Let Inline graphic be the resulting statistical model for Pn. Let Inline graphic(gn) be the model if gn is known.

Let Ψ : Inline graphicInline graphic be defined by Ψ(Pn) = EQW {(1, W) − (0, W)}, where (A, W) = EQY (Y| A, W).

The tangent space at Pn in model Inline graphic is given by:

T(Pn)={i=1nϕ(Wi):ϕTW}+{i=1nΔi(Wn)ϕ(YiAi,Wi):ϕTY}+j=1JTCj, (7.1)

where TW = {h(W) : Eh(W) = 0},

TY={h(YA,W):EQY(h(YA,W)A,W)=0},andTCj={S((Ai:iCj(Wn))Wn):E(SWn)=0}.

The tangent space at Pn in model Inline graphic(gn) is given by

T(Q)={i=1nϕ(Wi):ϕTW}+{i=1nϕ(YiAi,Wi):ϕTY}.

Let

D(Q¯,g,ψ)(W,Δ,ΔA,ΔY)=DW(Q¯,ψ)(W)+Δ(2A-1)g(A,1W)(Y-Q¯(A,W)),

where g denotes a distribution of g(a, 1 | W) = P(A = a, Δ = 1 | W). The statistical parameter Ψ is pathwise differentiable and its canonical gradient at Pn is given by

Dn,(Pn)=1ni=1nD(Q¯,g¯n,Ψ(Q))(Oi),

where gi(a, 1 | Wi) = Πi(1 | Wi)gi(a | Wi) is the conditional probability that Ai = a, Δi(Wn) = 1, given Wi, which can be factored into Πi(1 | Wi) = Pi(Wn) = 1 | Wi) and gi(a | Wi) = P(Ai = a | Wi, Δi(Wn) = 1), and

g¯n(a,1W)=1ni=1ngi(a,1W).

We note that

gi(a,1Wi)=(wj:ji)Δi((wj:ji),Wi)gi(a(wj:ji),Wi)jiQW(wj) (7.2)

is a function of gi(Ai | Wn) and the common marginal distribution QW. We have

E0Dn,(Q¯,g¯n,ψ0)=0ifQ¯=Q¯0org¯n=g¯n,0, (7.3)

assuming that for all i, 0 < gi(1, 1 | Wi) < 1 a.e.

The TMLE of 0 is analogue to the TMLE presented in Section 4, with the modification that the clever covariate is now given by (2Ai − 1)Ii(Wn) = 1)/n(Ai, 1 | Wi), only the complete observations are used for fitting 0, but the empirical distribution over all W1, …, Wn is plugged in the target parameter mapping. The same asymptotics can be applied and the formulas for the asymptotic variance are the same as presented earlier, with the only modification that gi(a | Wi) is now replaced by gi(a, 1 | Wi).

8 Summary

This article has investigated efficient estimation and inference for the additive causal effect E0{Y (1) − Y (0)} of treatment on the outcome under a class of designs based on sampling n i.i.d. (Wi, Yi(0), Yi(1)) ~ PX,0, sampling An, given Wn, and collecting (Wi, Ai, Yi), i = 1, …, n. We considered a general class of dependent treatment assignment mechanisms gn satisfying the assumption that (Ai : iCj(Wn)), j = 1, …, J, are independent across j, conditionally on Wn, where Cj(Wn), j = 1, …, J, is a partitioning of the sample {1, …, n} into groups implied by Wn. The number of partitions J was assumed to be proportional to n.

We computed the efficient influence curve of the target parameter for the statistical model implied by this design without making additional assumptions about the common full-data distribution PX,0. We defined a corresponding TMLE that is consistent and asymptotically normally distributed under correct specification of g0n, and is also efficient if the outcome regression 0 is consistently estimated. This TMLE can be implemented by ignoring the dependency created by the treatment allocation process, with the exception that if cross validation is used to estimate the average n of g0,i(Ai | Wi) across i = 1, …, n, the group rather than the unit should be used when partitioning the data into training and validation sets. Thus, construction of training and validation sets for data adaptive estimation of 0 can be based on the sampling unit. We further suggested an alternative plug-in approach to estimating the unit specific treatment mechanism g0,i that makes use of design based knowledge of g0n, thus potentially improving estimator robustness and efficiency.

Due to the dependency introduced by the treatment allocation process, no asymptotically consistent bootstrap method appears to be available for the general class of dependent gn-designs presented in this paper. Further, when groups are size 2 or larger, the asymptotic variance of the TMLE under the dependent sampling relies on a consistent estimator of 0 even when g0n is known. In contrast, the asymptotic variance of the TMLE under i.i.d. sampling is fully robust to misspecification of Q¯n in randomized controlled trials.

We further considered adaptively pair matched trials as an important special case of the general dependent treatment allocation design. We formally compared the asymptotic variance of the TMLE under this design with that of the TMLE under i.i.d. sampling. While the information bound for the adaptively pair matched design with gin=gi=g0 equals the information bound for i.i.d. sampling of (Wi, Ai, Yi) with P (A = a | W ) = g0(a | W ), we showed that the TMLE under adaptive pair matching and misspecified * will outperform the TMLE under i.i.d. sampling as long as the (0*)(1, ·) and (0*) (0, ·) of the baseline covariates within the groups Cj(Wn) are positively correlated. We also showed that under the paired matching design and the positive correlation condition, an estimate of the variance that treats the n observations as i.i.d. is conservative if Q¯n is inconsistent for 0 and is asymptotically consistent if Q¯n is consistent. We also presented a less conservative variance estimator that relies on an additional reasonable assumption (similar to the above positive correlation assumption). We demonstrated that the estimator of the variance for the unadjusted estimator as currently used by practitioners in the analysis of paired matched trials is valid as well, and our above mentioned less conservative variance estimator is just a generalization of this estimator.

Taken together, these finding teach us that the use of an adaptively pair matched design will generally result in a more efficient estimator of the treatment effect, while one can still obtain robust conservative variance estimators. However, understanding the complications resulting from the adaptive pair matching requires advanced empirical process theory, and even makes the analysis of the unadjusted estimator a serious challenge.

A Appendix: Proof of Theorem 1

Firstly, we note that

E0D(Q¯,g¯n,ψ0)(On)=E01ni=1n{Q¯(Wi)-ψ0}+E01ni=1ngi,0(1Wi)g¯n(1Wi)(Q¯0-Q¯)(1,Wi)-gi,0(0Wi)g¯n(0Wi)(Q¯0-Q¯)(0,Wi)=Ψ(Q)-ψ0+1niwQW,0(w)gi,0(1w)g¯n(1w)(Q¯0-Q¯)(1,w)-1niwQW,0(w)gi,0(0w)g¯n(0w)(Q¯0-Q¯)(0,w)=Ψ(Q)-ψ0+E0g¯0,n(1W)g¯n(1W)(Q¯0-Q¯)(1,W)-E0g¯n,0(0W)g¯n(0W)(Q¯0-Q¯)(0,W).

Thus, if n,0 = 0, then this equals Ψ(Q) − ψ0 + ψ0 − Ψ(Q) = 0. If Q0 = Q, then we also obtain 0. This proves (3.3). We also note that Dn,*(Q0, 0) is an element of the tangent space TQ. In addition, for each Q, Dn,*(Q, 0, ψ0) is a gradient in the model M(g0n) with g0n known, which shows that Dn,*(Q0, 0) is the canonical gradient of Ψ : Inline graphic(gn) → Inline graphic at P0n . By factorization of the likelihood, it is also the canonical gradient for any model Inline graphic that instead assumes that g0nGn for a model Inline graphic.

B Appendix: Proof of Theorem 2

Recall the notation Pf = EPf. We have

P0nDn,(Q¯n,g¯0,ψn)1ni=1nPQ0,g0,iD(Q¯n,g¯0,ψn)=ψ0-ψn.

Here we remind the reader that 0 = 1/n Σi g0,i and g0,i(a | w) = P0(Ai = a | Wi = w). We also have Dn,(Q¯n,g¯n,ψn)=0.

Thus,

ψn-ψ0=1ni{D(Q¯n,g¯n,ψn)(Oi)-PQ0,g0,iD(Q¯n,g¯n,ψn)}+1niPQ0,g0,i{D(Q¯n,g¯n,ψn)-PQ0,g0,iD(Q¯n,g¯0,ψn)}1ni{D(Q¯n,g¯n,ψn)(Oi)-P0,g0,iD(Q¯n,g¯n,ψn)}+1nZW,g¯n,n.

We note that, using some straightforward algebra,

ZW,g¯n,n=nwg¯0-g¯ng¯n(1w)(Q¯0-Q¯)(1,w)dQW,0(w)-nwg¯0-g¯ng¯n(0w)(Q¯0-Q¯)(0,w)dQW,0(w)=nwg¯0-g¯ng¯0(1w)(Q¯0-Q¯)(1,w)dQW,0(w)-nwg¯0-g¯ng¯0(0w)(Q¯0-Q¯)(0,w)dQW,0(w)+R(g¯n,g¯0)

where

R(g¯n,g¯0)=n(g¯0-g¯n)2g¯ng¯0(1w)(Q¯0-Q¯)(1,w)dQ0(w)-n(g¯0-g¯n)2g¯ng¯0(1w)(Q¯0-Q¯)(1,w)dQ0(w).

We assume that the latter is oP (1). Thus to establish the asymptotic linearity of ZW,ḡn,n we need to study terms of form nf(w)(g¯n-g¯0)(1w)dQ(w). We now note that

(g¯n-g¯(1w)=1ni=1n(gi,n-gi)(1w)1ni=1ngi(1(Wj:ji),Wi=w)(jiQW,n(wj)-jiQW(wj))=1ni=1ngi(1W-i,Wi=w)l=1,lin(QW,n(wl)-QW(wl))m=1,mil-1QW,n(wm)m=l+1,minQW(wm)1ni=1nligi(1Wi=w,Wl=wl)(QW,n(wl)-QW(wl))=1nk=1n1ni=1nlin{gi(1Wi=w,Wl=Wk)-gi(1Wi=w)},

where we suppressed the second order term a formal analysis would have to take into account. Therefore, we can write

n(g¯n-g¯0)(1w)f(w)dQ(w)=1nk=1n1ni=1nlinw{gi(1Wi=w,Wl=Wk)-E0,Wkgi(1Wi=w,Wl=Wk)}f(w)dQ(w)1nk=1n{Φ(Wk)-E0Φ(Wk)},

where we defined

Φ(Wk)=1ni=1nlinwgi(1Wi=w,Wl=Wk)f(w)dQ(w).

Thus such integrals are standardized sums of independent random variables Φl(Wk) − E0 Φl(Wk) with mean zero. Such terms will converge to a normal distribution if the variance of Φ(Wk) is bounded (uniformly in n, since Φ is really indexed by n as well). This demonstrates that one will need that the Σli should essentially only contribute a finite number of terms.

To conclude, under regularity conditions, we might have

ZW,g¯n,n1nk=1nIC(Wk)-E0IC(Wk),

where

IC(Wk)=w(1ni=1nlingi(1Wi=wi,Wl=Wk))Q¯0-Q¯g¯0(1,w)dQ0(w)-w(1ni=1nlingi(0Wi=w,Wl=Wk))Q¯0-Q¯g¯0(0,w)dQ0(w).

A crucial assumption we made in the theorem is that the variance of IC(Wk) is finite. We will now show that under a reasonable typical assumption we will, in fact, have that IC(Wk) − E0IC(Wk) = 0. For iCj(Wn), in a typical design one will have that gi(a | Wi = wi, Wi) only depends on Wi = wi. Thus, in that case, for iCj(Wn) we have gi(a | Wn) = gi(a | Wi) for some conditional density gi(a | w). This provides us with the following representation:

gi(aWn)=j=1JI(iCj(Wn))gi(aWi)

This yields the following derivation of gi(a | Wi, Wl):

gi(aWi,Wl)=gi(aWi,Wl,W(-i,-l))P(W(-i,-l))j=1JI(iCj(Wi,Wl,W(-i,-l)))gi(aWi)P(W(-i,-l))=j=1Jgi(aWi)I(iCj(Wi,Wl,W(-i,-l)))P(W(-i,-l))=gi(aWi)j=1JP(iCj(Wn)Wi,Wl)=gi(aWi).

Thus, in this case, we have IC(Wk) is constant in Wk so that IC(Wk) − E0IC(Wk) = 0.

We now proceed as follows:

ψn=ψ0=1ni=1n{D(Qn,g¯n,ψn)(Oi)-PQ0,giD(Qn,g¯n,ψn)}+1nZW,g¯n,n=1ni=1n{D(Q¯n,g¯n,ψn)(Oi)-PQ0,ginD(Q¯n,g¯n,ψn)}+1ni=1n{(PQ0,gin-PQ0,gi)D(Q¯n,g¯n,ψn)}+1nZW,g¯n,n=1ni=1n{DY(Q¯n,g¯n)(Oi)-PQ0,ginDY(Q¯n,g¯n)}+1ni=1n{(PQ0,gin-PQ0,gi)D(Q¯n,g¯n,ψn)}+1nZW,g¯n,n1nXn(Q¯n)+1nZW,n,gn+1nZW,g¯n,n.

Here we used at the third equality that PQ0,g0,in is a conditional expectation, given Wn, so that the empirical process of DW cancels out in the first term. We defined the process only as a function of Q¯n, not as a function of n, because n is only a function of Wn. Note, that

DY(Q¯,g¯n)(Oi)-PQ0,g0,inDY(Q¯,g¯n)=2Ai-1g¯n(AiWi)(Yi-Q¯(Ai,Wi))-{g0,in(1Wn)g¯n(1Wi)(Q¯0-Q¯)(1,Wi)-g0,in(0Wn)g¯n(0Wi)(Q¯0-Q¯)(0,Wi)}fi,n1(Q¯)(Oi).

Note that fi,n1(Q¯) is a random function of Oi through Wn, while, given Wn, it is a fixed function of Oi. In the special case that g0,in=g0,i is constant in i, we have fi,n1(Q¯)(Oi)=DY(Q¯,g0,i)(Oi)-{Q¯0-Q¯}(Wi). We can represent Xn() as Xn(Q¯)=1/ni=1nfi,n1(Q¯)(Oi), where PQ0,g0,infi,n(Q¯)=0.

Let’s now determine the form of ZW,n,gn. We have

1/niPQ0,g0,inDY(Q¯n,g¯n)=1/nig0,i(1Wn)g¯n(1Wi)(Q¯0-Q¯n)(1,Wi)-g0,in(0Wn)g¯n(0Wi)(Q¯0-Q¯n)(0,Wi)=1/ni(g0,i(1Wn)g¯n(1Wi)-1)(Q¯0-Q¯n)(1,Wi)-(g0,in(0Wn)g¯n(0Wi)-1)(Q¯0-Q¯n)(0,Wi)+1/niQ¯0(Wi)-Q¯n(Wi)1/niPQ0,g0,iDY(Q¯n,g¯n)=w(Q¯0-Q¯n)QW,0(w)1/niPQ0,gin-PQ0,giDW(Q¯n,ψn)=1/niQ¯n(Wi)-P0Q¯n

Thus,

1/ni(PQ0,g0,in-PQ0,g0,i)(DY+DW)(Q¯n,g¯n,ψn)=1/ni(g0,i(1Wn)g¯n(1Wi)-1)(Q¯0-Q¯n)(1,Wi)-(g0,in(0Wn)g¯n(0Wi)-1)(Q¯0-Q¯n)(0,Wi)+1/niQ¯0(Wi)-Q¯n(Wi)+w(Q¯0-Q¯n)(w)QW,0(w)+1/niQ¯n(Wi)-P0Q¯n=1/ni(g0,i(1Wn)g¯n(1Wi)-1)(Q¯0-Q¯n)(1,Wi)-(g0,in(0Wn)g¯n(0Wi)-1)(Q¯0-Q¯n)(0,Wi)+1/ni{Q¯0(Wi)-ψ0}.

Thus,

ZW,n,gn=1ni=1n{Q¯0(Wi)-ψ0}+1ni=1n{g0,in(1Wn)-g¯n(1Wi)g¯n(1Wi)(Q¯0-Q¯n)(1,Wi)-g0,in(0Wn)-g¯n(0Wi)g¯n(0Wi)(Q¯0-Q¯n)(0,Wi)}.

In the special case that g0,in=g0,i and constant in i, we have that

ZW,n,gn=ZW,n1ni=1n{Q¯0(Wi)-ψ0}.

In the general case, one can decompose

ZW,n,gn=Z1,gn+ZW,n,

where

Z1,gn=1ni=1ng0,in(1Wn)-g¯n(1Wi)g¯n(1Wi)(Q¯0-Q¯n)(1,Wi)-1ni=1ng0,in(0Wn)-g¯n(0Wi)g¯n(0Wi)(Q¯0-Q¯n)(0,Wi).

Suppose now that g0,in=g0,i. Then n = 0. Notice that for a function f, we have

1ni=1nE0(g0,i(1Wi)g¯0(1Wi)-1)f(Wi)=1ni=1nw(g0,i(1w)g¯0(1w)-1)f(w)dQW,0(w)=1n(g¯0g¯0(1w)-1)f(w)dQW,0(w)=0.

This proves that, Z1,gn(), defined as the process above with Q¯n replaced by , is a standard empirical process Z1,gn(Q¯)=1/nifi(Q¯)(Wi) of mean zero and independent random variables

fi(Q¯)(Wi)=g0,i(1Wi)-g¯0(1Wi)g¯0(1Wi)(Q¯0-Q¯)(1,Wi)-g0,i(0Wi)-g¯0(0Wi)g¯0(0Wi)(Q¯0-Q¯)(0,Wi).

Such a process can be analyzed with methods we use below, showing that Z1,gn(Q¯n)=Z1,gn(Q¯)+oP(1), and Z1,gn(Q¯)=1/niIC1,gn,i(Wi)+oP(1), where IC1,gn,i = fi(Q*). We conclude that

n(ψn-ψ0)=Xn(Q¯n)+ZW,n+Z1,gn+ZW,g¯n,n,

where our assumptions guarantee that ZW,n+Z1,gn+ZW,g¯n,n=1/niICW,i(Wi)+oP(1). So we showed that n(ψn-ψ0)=XW,n+Xn(Q¯n), where XW,n=1/niICW,i(Wi)+op(1) for some influence curve ICW,i. Thus XW,n is understood and converges to a normal distribution with mean zero and variance σW2=limn1ni=1nP0ICW,i2 , if the variance of ICW,i is bounded uniformly in i.

Below we establish that, conditional on Wn, Xn(Q¯n) converges in distribution to a Gaussian random variable. The separate weak convergence of XW,n and Xn(Q¯n) implies the desired weak convergence of XW,n and Xn(Q¯n) jointly as follows. For notational convenience, let Xn denote Xn(Q¯n) and X denotes its limit in distribution. Let W = (Wn : n = 1, …,). Note that P(XW,nA, XnB) = EWI(XW,nA)P (XnB | W). Since P (XnB | W) converges to P(XB) for almost every W, we obtain

P(XB)EWI(XW,nA)P(XB)P(XWA)

plus a term EWI(XW,nA)(P (XnB) | W) − P (XB)). The latter term converges to zero by the dominated convergence theorem. The joint convergence implies the weak convergence of the sum XW,n+Xn(Q¯n) to XW + X.

So it remains to study Xn(Q¯n). By application of a CLT for sums of independent random variables, under the stated conditions, one can show that, conditional on Wn, (Xn(j) : j) for fixed jInline graphic converges to a multivariate normal distribution with covariance matrix defined by (1, 2) → Σ0(1, 2). Weak convergence of Xn() for a fixed or finite collection of ’s is not enough for establishing the desired asymptotic linearity. In order to understand terms such as Xn(Q¯n)-Xn(Q¯) (and that our proposed variance estimator is consistent) we need to understand the process (Xn() : Inline graphic) with respect to supremum norm over a set Inline graphic that contains Q¯n with probability tending to 1. Again, we will study this process conditional on (Wn : n ≥ 1).

Let dn2(Q¯1,Q¯2)=1/njPQ0,gn{fj,n(Q¯1)-fj,n(Q¯2)}2. We note that Xn(Q¯1)-Xn(Q¯2)=Xn(Q¯1-Q¯2) for a slightly different process Xn. Thus, dn2(Q¯1,Q¯2)=1/njPQ0,gn{fj,n(Q¯1-Q¯2)}2 for a specified fj,n(Q¯1-Q¯2)=iCj{fi,n(Q¯1)-fi,n(Q¯2)}(Oi). Note that dn2(Q¯1,Q¯2) is the conditional variance of Xn(1) − Xn(2), conditional on Wn, or equivalently, it is the conditional variance of Xn(Q¯1-Q¯2). We will denote this conditional variance also with σn2(Q¯1-Q¯2)=dn2(Q¯1,Q¯2).

Recall that Inline graphic = {f1f2 : f1, f2Inline graphic}. Given the entropy condition on Inline graphic, we will prove asymptotic equicontinuity of (Xn() : Inline graphic) with respect to this semi-metric dn: for each ε > 0 and sequence δn → 0,

P(supdn(f,g)δnXn(f)-Xn(g)>ε)0asn.

This is equivalent with establishing the following asymptotic equicontinuity of ( Xn(f):fFd) w.r.t semi-metric σn: for each ε > 0 and sequence δn → 0,

P(supσn(f)δnXn(f)>ε)0asn.

If dn(Q¯n,Q¯)0 in probability, and Q¯n-Q¯Fd with probability tending to 1, then this asymptotic equicontinuity proves that Xn(Q¯n)-XnQ¯)=Xn(Qn-Q¯) converges to zero in probability, as n → ∞.

To establish the asymptotic equicontinuity result, we use a number of fundamental building blocks. Note that Xn(f)/σn(f) is a sum of J independent mean zero bounded random variables and the variance of this sum equals 1. Bernstein’s inequality states that P(jYj>x)2exp(-12x2v+Mx/3), where v ≥ VAR Σj Yj. Thus, by Bernstein’s inequality, conditional on Wn, we have

P(Xn(f)σn(f)>x)2exp(-12x21+Mx/3)Kexp(-Cx2),

for a universal K and C.

As stated in our review section, this implies Xn(f)/σn(f)ψ2(1+K/C)0.5, where for a given convex function ψ with ψ (0) = 0, || X ||ψ ≡ inf{C > 0 : (| X | /C) ≤ 1} is the so called Orlics norm, and ψ2(x) = exp(x2) − 1. Thus Xn(f)ψ2C1σn(f) for fInline graphic. This result allows us to apply Theorem 2.2.4 in van der Vaart and Wellner (1996) (this theorem is copied below in the appendix): for each δ > 0 and η > 0, we now have

supσn(f)δXn(f)ψ2K{0ηψ2-1(N(ε,σn,Fd)dε+δψ2-1(N2(η,σn,Fd))}, (B.1)

Convergence to zero with respect to ψ2-orlics norm implies convergence in expectation to zero and thereby convergence to zero in probability. Let δn be a sequence converging to zero, and let ηn also converge to zero but slowly enough so that the term δnψ2-1(N2(ηn,σn,Fd)) converges to zero as n → ∞. By assumption, 0δnψ2-1(N(ε,σn,Fd)dε converges to zero. Thus,

limδn0{0δnψ2-1(N(ε,σn,Fd)dε+δnψ2-1(N2(ηn,σn,Fd))}=0.

This proves that

E(supσn(f)δnXn(f))0,

and thereby the asymptotic equicontinuity of Xn.

We now prove the convergence to the limit variance: If σn(Q¯n-Q¯)0 in probability, then

1nj=1J{fj,n(Q¯n)(Oi)}2-1nj=1JPQ0,gn(fj,n(Q¯))20inprobability.

We can write this difference as a sum of the following two differences:

1nj=1J{fj,n(Q¯n)(O¯j)}2-1nj=1JPQ0,gnfj,n(Q¯n)21nj=1JPQ0,gnfj,n(Q¯n)2-1nj=1nPQ0,gnfj,n(Q¯)2=1nj=1JPQ0,gn{fj,n(Q¯n)2-fj,n(Q¯)2}=1nj=1JPQ0,gn{fj,n(Q¯n-Q¯)}{fj,n(Q¯n)+fj,n(Q¯)}(1nj=1JPQ0,gn{fj,n(Q¯n-Q¯)}2)0.5(1nj=1JPQ0,gn{fj,n(Q¯n)+fj,n(Q¯)}2)0.5,

where we used Cauchy-Schwarz inequality at the last inequality. The last term can thus be bounded by Mdn(Q¯n,Q¯), so that it converges to zero in probability, since dn(Q¯n), ) converges to zero in probability.

We now consider the first term, which can be represented as

1nj=1Jhj,n(Q¯n),

where

hj,n(Q¯)fj,n2(Q¯)(Oi)-PQ0,gnfj,n(Q¯)2.

Define the process Yn() = 1/n Σj hj,n(). Note that hj,n() has conditional mean zero given Wn. Thus, conditional on Wn, Yn() is a sum of independent mean zero random variables. The process nYn(Q¯) has exactly same structure as process Xn() we analyzed above. Therefore, under our conditions, we have supQ¯FYn(Q¯)=OP(1/n). This implies, in particular, that the first term converges to zero in probability. This proves the convergence to the desired limit.

C Appendix: Proof of Theorem 4

We can decompose D* (*, g0, ψ0) orthogonally in a function of W and a function of Y, A, W, which has conditional mean zero, given W, as follows:

D(Q¯,g0,ψ0)=Q¯0(W)-ψ0+Hg0(A,W)(Y-Q¯(A,W))-{Q¯0(W)-Q¯(W)}.

Thus, the variance is given by:

P0{D(Q¯,g0,ψ0)}2=E0{Q¯0(W)-ψ0}2+E0E0(Hg02(A,W)(Y-Q¯(A,W))2W)-E0{Q¯0(W)-Q¯(W)}2.

Note that

E0E0(Hg02(A,W)(Y-Q¯(A,W))2W)=E0E0(1g02(AW)E0((Y-Q¯(A,W))2A,W)W)=E0a1g0(aW)σ02(Q¯)(a,W),

where σ02(Q¯)(a,W)E0((Y-Q¯(A,W))2A=a,W). Thus, we have obtained the following expression:

σI2(Q¯)=E0{Q¯0(W)-ψ0}2+E0a1g0(aW)σ2(Q¯)(a,W)-E0{Q¯0(W)-Q¯(W)}2.

For the paired matching design the asymptotic variance σ2 of the TMLE is given by the limit of

E0{Q¯0(W)-ψ0)}2+E01nj=1n/2PQ0,gn{iCj(Wn)Hg0(Ai,Wi)(Yi-Q¯(Ai,Wi))}2-E01nj=1n/2{iCj(Wn){Q¯0(Wi)-Q¯(Wi)}}2

Each ΣiCj(Wn) is a sum over two terms. We use that (a + b)2 = a2 + b2 + 2ab. The contribution a2 + b2 from the square terms yields:

E01ni=1n{PQ0,gn{Hg0(Ai,Wi)(Yi-Q¯(Ai,Wi))}2-{Q¯0(Wi)-Q¯(Wi)}2}=E0a1g0(aW)σ02(Q¯)(a,W)-E0{Q¯0(W)-Q¯(W)}2.

This equals the corresponding expression we have for σI2(Q¯). The contribution 2ab from the cross-terms yields:

2E01nj=1n/2PQ0,gnHg0,1j(Y1j-Q¯(A1j,W1j))Hg0,2j(Y2j-Q¯(A2j,W2j))-2E01nj=1n/2{Q¯0(W1j)-Q¯(W1j)}{Q¯0(W2j)-Q¯(W2j)}=-4E01nj=1n/2{(Y1j(1)-Q¯(1,W1j))(Y2j(0)-Q¯(0,W2j))}-4E01nj=1n/2{(Y1j(0)-Q¯(0,W1j))(Y2j(1)-Q¯(1,W2j))}-2E1nj=1n/2{Q¯0(W1j)-Q¯(W1j)}{Q¯0(W2j)-Q¯(W2j)}.

To conclude, the asymptotic variance under the paired matching design is given by:

σ2(Q¯)=E0{Q¯0(W)-ψ0}2+E0a1g0(aW)σ2(Q¯)(a,W)-E0{Q¯0(W)-Q¯(W)}2-4E01nj=1n/2{(Y1j(1)-Q¯(1,W1j))(Y2j(0)-Q¯(0,W2j))}-4E01nj=1n/2{(Y1j(0)-Q¯(0,W1j))(Y2j(1)-Q¯(1,W2j))}-2E1nj=1n/2{Q¯0(W1j)-Q¯(1,W1j)}{Q¯0(W2j)-Q¯(W2j))}

Thus, the difference between the two asymptotic variances is given by:

σI2-σ2=4E01nj=1n/2{(Y1j(1)-Q¯(1,W1j))(Y2j(0)-Q¯(0,W2j))}+4E01nj=1n/2{(Y1j(0)-Q¯(0,W1j))(Y2j(1)-Q¯(1,W2j))}+2E1nj=1n/2{Q¯0(W1j)-Q¯(W1j)}{Q¯0(W2j)-Q¯(W2j)}=2E01Jj=1n/2{(Q¯0(1,W1j)-Q¯(1,W1j))(Q¯0(0,W2j)-Q¯(0,W2j))}+2E01Jj=1n/2{(Q¯0(1,W1j)-Q¯(0,W1j))(Q¯0(1,W2j)-Q¯(1,W2j))}+E1Jj=1n/2{Q¯0(W1j)-Q¯(W1j)}{Q¯0(W2j)-Q¯(W2j)}=E01Jj=1n/2(Q¯0-Q¯)(1,W1j)(Q¯0-Q¯)(0,W2j)+E01Jj=1n/2(Q¯0-Q¯)(0,W1j)(Q¯0-Q¯)(1,W2j)+E01Jj=1n/2(Q¯0-Q¯)(1,W1j)(Q¯0-Q¯)(1,W2j)+E01Jj=1n/2(Q¯0-Q¯)(0,W1j)(Q¯0-Q¯)(0,W2j)C,

and σ2=σI2-C.

D Appendix: Proof of Theorem 5

The proof is analogue to the proof of Theorem 1. Therefore, we suffice with proving (7.3). Firstly, we note that

E0Dn,(Q¯,g¯,Π¯,ψ0)(On)=E01ni=1n{Q¯(Wi)-ψ0}+E01ni=1ngi,0(1,1Wi)g¯n(1,1Wi)(Q¯0-Q¯)(1,Wi)-gi,0(0,1Wi)g¯n(0,1Wi)(Q¯0-Q¯)(0,Wi)=Ψ(Q)-ψ0+1niwQW,0(w)gi,0(1,1w)g¯n(1,1w)(Q¯0-Q¯)(1,w)-1niwQW,0(w)gi,0(0,1w)g¯n(0,1Wi)(Q¯0-Q¯)(0,w)=Ψ(Q)-ψ0+E0g¯0,n(1,1W)g¯n(1,1W)(Q¯0-Q¯)(1,W)-E0g¯n,0(0,1W)g¯n(0,1W)(Q¯0-Q¯)(0,W).

Thus, if n,0 = 0, then this equals Ψ(Q) − ψ0 + ψ0 − Ψ(Q) = 0. If Q0 = Q, then we also obtain 0. This proves (3.3).

E Appendix: Review of relevant empirical process/weak convergence theory

We refer to van der Vaart and Wellner (1996), Section 2.2. on maximal inequalities and covering numbers. For a real valued random variable X and convex function ψ with ψ (0) = 0, the Orlics norm is defined as || X ||ψ ≡ inf{C > 0 : (| X | /C) ≤ 1}. Setting ψ(x) = xp gives the Lp-norms || X ||p= E(| X |p)1/p, p ≥ 1. Another important choice for empirical processes is ψp(x) = exp(xp) − 1. Sums of independent bounded random variables and Gaussian random variables have bounded ψ2-norm. There is an important relation between the orlics norm and a bound on the tail probability of the random variable. In particular, we have (page 96 in van der Vaart and Wellner (1996))

P(X>x)1ψ(x/Xψ).

For ψp(x) this leads to tail estimates exp(−Cxp) for any random variable with a finite ψp norm. Conversely, an exponential tail bound of this type shows that || X ||ψp is finite: Lemma 2.2.1 states that if P(| X |> x) ≤ K exp(−Cxp) for every x, for constants K and C, and for p ≥ 1, then its orlics norm satisfies || X ||ψp ≤ ((1 + K)/C)1/p. So if we have an exponential tail probability for Xn(f), then we can translate this into a bound on the ψp-orlics norm.

Given a sequence of random variables Xi, we have (page 96)

maximXiψKψ-1(m)maxiXiψ.

Thus, if we can bound the orlics norm of Xn(f) in terms of a norm on f, then this result allows us to bound the orlics norm of a maximum over m functions. This bound combined with chaining gives the typical entropy type bounds. As we will see one of the main things we will need is a bound on || Xn(f) ||ψ in terms of d(f, f) for a semi-metric d on Inline graphic.

Bounding orlics norm

Let (T, d) be an arbitrary semi-metric space. The covering number N(ε, d) is the minimal number of balls of radius ε needed to cover T. Call a collection of points ε-separated if the distance between each pair of points is strictly larger than ε. The packing number D(ε, d) is the maximum number of ε-separated points in T. Entropy numbers are the logarithms of the covering or packing number. Since N(ε, d) ≤ D(ε, d) ≤ N(0.5ε, d), bounds in packing number map into a bound in covering number and vice versa.

For our purpose, we will need Theorem 2.2.4 in van der Vaart and Wellner (1996), which is stated here for completeness.

Theorem 6

(Theorem 2.2.4, van der Vaart and Wellner, 96) Let ψ be a convex non-decreasing non zero function with ψ(0) = 0 and lim supx,y→∞ ψ(x) ψ(y)/ ψ(cxy) < ∞ for some constant c. Let (Xt : tT) be a separable stochastic process (that is, supd(s,t)<δ | XsXt | remains almost surely the same if the index set T is replaced by a suitable countable subset) with

Xs-XtψCd(s,t)foreverys,t,

for some semimetric d on T and a constant C. Then, for any η, δ > 0,

supd(s,t)δXs-XtψK{0ηψ-1(D(ε,d))dε+δψ-1(D2(η,d))}

for a constant K depending on ψ and C only. In particular, the constant K can be chosen so that

sups,tXs-XtψK0diamTψ-1(D(ε,d))dε,

where diam(T) is the diameter of T. This result also gives

suptXtψXt0ψ+0diam(T)ψ-1(D(ε,d))dε,

The bound shows that the sample paths of X are uniformly continuous in ψ-norm, whenever the covering integral 0ηψ-1(D(ε,d))dε is finite/exists for some η > 0. In order to have that this integral is bounded for classes T with covering numbers that behave as εp, one will need to use an Orlics norm with ψ(x) = xp, and if one wants the integral to be bounded for any p, then one needs ψ(x) = exp(xq) − 1 for some q.

If one can prove that || Xn(s) − Xn(t) ||ψCd(s, t) for a constant C independent of n, and each Xn is a separable stochastic process, then this theorem teaches us that for any sequence δn, and ηn > 0, we have that there exists a constant K depending on ψ, C only (not dependent on n!) so that

supd(s,t)δnXn(s)-Xn(t)ψK{0ηψ-1(D(ε,d))dε+δnψ-1(D2(η,d))}.

We can now apply this inequality for a sequence δn → 0 for n → ∞. Since η can be chosen arbitrary small, it follows that, if 0ηψ-1(D(ε,d))dε< for some η > 0, then

supd(s,t)δnXn(s)-Xn(t)ψ0asn.

So we can state the following useful corollary:

Corollary E.1

Suppose there exists a η > 0 so that 0ηψ-1(D(ε,d))dε<. In addition, assume

Xn(s)-Xn(t)ψCd(s,t)

for a constant C independent of n, and each Xn is a separable stochastic process with respect to d. Then for any sequence δn → 0, we have

supd(s,t)δnXn(s)-Xn(t)ψ0asn.

This corollary provides us with conditions under which Xn is asymptotically uniformly d-equicontinuous in probability. Theorem 1.5.7. in van der Vaart and Wellner (1996) now states that Xn is asymptotically tight in (T) if Xn(t) is asymptotically tight for every t, (T, d) is totally bounded, and Xn is asymptotically d-equicontinuous in probability. In addition, Theorem 1.5.4 states that if Xn is asymptotically tight and its marginals converge weakly to the marginals X(t1), …, X(tk) of a stochastic process X, then there is a version of X with uniformly bounded sample paths and Xn converges weakly to X. Thus, we can state the following result:

Lemma E.1

Let ψ be one of the following functions: ψ(x) = xp for some p, or ψ(x) = exp(x1) −1, ψ(x) = exp(x2) − 1. Let d be a semi-metric on T so that ( (T), d) is totally bounded, and there exists a η > 0 so that 0ηψ-1(D(ε,d))dε<. In addition, assume

Xn(s)-Xn(t)ψCd(s,t)

for a constant C independent of n, and each Xn is a separable stochastic process with respect to d. Then for any sequence δn → 0, we have for each x > 0

Pr(supd(s,t)δnXn(s)-Xn(t)>x)0asn, (E.1)

and Xn is asymptotically tight.

If Xn(t1), …, Xn(tk) converges weakly to (X(t1), …, X(tk), then there exists a version X with uniformly bounded sample paths and Xnd X.

If X is Gaussian process X in (T), and d(s, t) = ρp(s, t) ≡ || (X(f) − X(g)) ||p, then there exists a version of X which is tight Borel measurable map into (T).

Actually (page 41), if X is Gaussian, then Xn converges weakly to X in (T) if and only if for some p (and then for all p) (i) the marginals of Xn converge to the corresponding marginals of X, (ii) Xn is asymptotically equicontinuous in probability with respect to

d(s,t)=ρp(s,t)X(s)-X(t)p,

as defined in (E.1), and (iii) T is totally bounded for d = ρp.

Contributor Information

Mark J. van der Laan, Email: laan@berkeley.edu, Division of Biostatistics, University of California, Berkeley

Laura B. Balzer, Email: lbbalzer@gmail.com, Division of Biostatistics, University of California, Berkeley

Maya L. Petersen, Email: mayaliv@berkeley.edu, Division of Biostatistics, University of California, Berkeley

References

  1. Andersen J, Faries D, Tamura R. A randomized play-the-winner design for multiarm clinical trials. Communication in Statistical Theory. 1994;23:309–323. [Google Scholar]
  2. Bai Z, Hu F, Rosenberger W. Asymptotic properties of adaptive designs for clinical trials with delayed response. Annals of Statistics. 2002;30(1):122–139. [Google Scholar]
  3. Balzer L, Petersen M, van der Laan M. Technical Report. Vol. 294. Division of Biostatistics, University of California; Berkeley: 2012. Why match in individually and cluster randomized trials. [Google Scholar]
  4. Bickel P, Klaassen C, Ritov Y, Wellner J. Efficient and Adaptive Estimation for Semiparametric Models. Springer-Verlag; 1997. [Google Scholar]
  5. Campbell M, Donner A, Klar N. Developments in cluster randomized trials and statistics in medicine. Statistics in Medicine. 2007;26:2–19. doi: 10.1002/sim.2731. [DOI] [PubMed] [Google Scholar]
  6. Chambaz A, van der Laan M. Technical Report. Vol. 258. Division of Biostatistics, University of California; Berkeley: 2010. Targeting the optimal design in randomized clinical trials with binary outcomes and no covariate. [Google Scholar]
  7. Cheng Y, Shen Y. Bayesian adaptive designs for clinical trials. Biometrika. 2005;92(3):633–646. [Google Scholar]
  8. Donner A, Klar N. Design and Analysis of Cluster Randomization Trials in Health Research. London: Arnold; 2000. [Google Scholar]
  9. Flournoy N, Rosenberger W. Adaptive Designs. Hayward, Institute of Mathematical Statistics; 1995. [Google Scholar]
  10. Freedman L, Gail M, Green S, Corle D. The efficiency of the matched-pairs design of the community intervention trial for smoking cessation (commit) Controlled clinical trials. 1997;18(2):131–139. doi: 10.1016/S0197-2456(96)00115-8. [DOI] [PubMed] [Google Scholar]
  11. Freedman L, Green S, Byar D. Assessing the gain in efficiency due to matching in a community intervention study. Statistics in Medicine. 1990;9:943–952. doi: 10.1002/sim.4780090810. [DOI] [PubMed] [Google Scholar]
  12. Gail M, Byar D, Pechacek T, Corle D. Aspects of statistical design for the community intervention trial for smoking cessation. Controlled clinical trials. 1992;13:6–21. doi: 10.1016/0197-2456(92)90026-v. [DOI] [PubMed] [Google Scholar]
  13. Gill R, van der Laan M, Robins J. In: Lin D, Fleming T, editors. Coarsening at random: characterizations, conjectures and counter-examples; Proceedings of the First Seattle Symposium in Biostatistics; New York. Springer Verlag; 1997. pp. 255–94. [Google Scholar]
  14. Gill R, van der Laan M, Wellner J. Inefficient estimators of the bivariate survival function for three models. Annales de l’Institut Henri Poincaré. 1995;31:545–597. [Google Scholar]
  15. COMMIT Research Group. Summary of design and intervention. Journal of National Cancer Institute. 1991;83(22):1620–1628. doi: 10.1093/jnci/83.22.1620. [DOI] [PubMed] [Google Scholar]
  16. Hayes R, Moulton L. Cluster Randomized Trials. Boca Raton: Chapman & Hall/CRC; 2009. [Google Scholar]
  17. Heitjan D, Rubin D. Ignorability and coarse data. Annals of statistics. 1991 Dec;19(4):2244–2253. [Google Scholar]
  18. Hu F, Rosenberger W. Analysis of time trends in adaptive designs withi appliation to a neurophysiology experiment. Statistics in Medicine. 2000;19:2067–2075. doi: 10.1002/1097-0258(20000815)19:15<2067::aid-sim508>3.0.co;2-l. [DOI] [PubMed] [Google Scholar]
  19. Imai K. Variance identification and efficiency analysis in randomized experiments under the matched-pair design. Statistics in Medicine. 2008;27(24):4857–4873. doi: 10.1002/sim.3337. [DOI] [PubMed] [Google Scholar]
  20. Imai K, King G, Nall C. The essential role of pair matching in cluster randomized experiments with applicationto the mexican universal health insurance evaluation. Statistical Science. 2009;24(1):29–53. [Google Scholar]
  21. Jacobsen M, Keiding N. Coarsening at random in general sample spaces and random censoring in continuous time. Annals of Statistics. 1995;23:774–86. [Google Scholar]
  22. Klar N, Donner A. The merits of matching in community intervention trials: a cautionary tale. Statistics in Medicine. 1997;16(15):1753–1764. doi: 10.1002/(sici)1097-0258(19970815)16:15<1753::aid-sim597>3.0.co;2-e. [DOI] [PubMed] [Google Scholar]
  23. Koethe J, Westfall A, Luhanga D, Clark G, Goldman J, Mulenga P, Cantrell R, Chi B, Zulu I, Saag M, Stringer JS. A cluster randomized trial of routine hiv-1 viral load monitoring in zambia. PlLoS One. 2010;5,5(3):e9680. doi: 10.1371/journal.pone.0009680. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Murray D. Design and Analysis of Community Randomized Trials. Oxford: Oxford University Press; 1998. [Google Scholar]
  25. Pearl J. Causal diagrams for empirical research. Biometrika. 1995;82:669–710. [Google Scholar]
  26. Pearl J. Causality: models, reasoning, and inference. 2. New York: Cambridge; 2009. [Google Scholar]
  27. Polley E, van der Laan M. Technical Report. Vol. 266. Division of Biostatistics, University of California; Berkeley: 2010. Super learner in prediction. [Google Scholar]
  28. Raudenbush S, Martinez A, Spybrook J. Developments in cluster randomized trials and statistics in medicine. Educational Evaluation and Policy Analysis. 2007;29:5–29. [Google Scholar]
  29. Rose S, van der Laan M. Why match? investigating matched case-control study designs with causal effect estimation. The International Journal of Biostatistics. 2009 doi: 10.2202/1557-4679.1127. http://www.bepress.com/ijb/vol5/iss1/1/ [DOI] [PMC free article] [PubMed]
  30. Rosenberger W. New directions in adaptive designs. Statistical Science. 1996;11:137–149. [Google Scholar]
  31. Rosenberger W, Flournoy N, Durham S. Asymptotic normality of maximum likelihood estimators from multiparameter response driven designs. Journal of Statistical Planning and Inference. 1997:69–76. [Google Scholar]
  32. Rosenberger W, Grill S. A sequential design for psychophysical experiments: An application to estimating timing of sensory events. Statistics in Medicine. 1997;16:2245–2260. doi: 10.1002/(sici)1097-0258(19971015)16:19<2245::aid-sim656>3.0.co;2-s. [DOI] [PubMed] [Google Scholar]
  33. Rosenberger W, Shiram T. Estimation for an adaptive allocation design. Journal of Statistical Planning and Inference. 1997;59:309–319. [Google Scholar]
  34. Rosenblum M, van der Laan M. Targeted maximum likelihood estimation of the parameter of a marginal structural model. Int J Biostat. 2010;6(2):Article 19. doi: 10.2202/1557-4679.1238. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Tamura R, Faries D, Andersen J, Heiligenstein J. A case study of an adaptive clinical trial in the treatment of out-patients with depressive disorder. Journal of the American Statistical Association. 1994;89:768–776. [Google Scholar]
  36. Toftager M, Christiansen L, Kristensen P, Troelsen J. Space for physical activity-a multicomponent intervention study: study design and baseline findings from a cluster randomized controlled trial. BMC Public Health. 2011;10:711–777. doi: 10.1186/1471-2458-11-777. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. van der Laan M. CWI tract. 114. Amsterdam: Centre of Computer Science and Mathematics; 1996. Efficient and Inefficient Estimation in Semiparametric Models. [Google Scholar]
  38. van der Laan M. Technical Report. Vol. 232. Division of Biostatistics, University of California; Berkeley: 2008. The construction and analysis of adaptive group sequential designs. http://www.bepress.com/ucbbiostat/paper234. [Google Scholar]
  39. van der Laan M, Dudoit S. Technical Report. Vol. 130. Division of Biostatistics, University of California; Berkeley: 2003. Unified cross-validation methodology for selection among estimators and a general cross-validated adaptive epsilon-net estimator: finite sample oracle inequalities and examples. [Google Scholar]
  40. van der Laan M, Dudoit S, van der Vaart A. The cross-validated adaptive epsilon-net estimator. Statistics and Decisions. 2006;24(3):373–395. [Google Scholar]
  41. van der Laan M, Polley E, Hubbard A. Super learner. Statistical Applications in Genetics and Molecular Biology. 2007;6(25):1–21. doi: 10.2202/1544-6115.1309. [DOI] [PubMed] [Google Scholar]
  42. van der Laan M, Robins J. Unified methods for censored longitudinal data and causality. Berlin Heidelberg New York: Springer; 2003. [Google Scholar]
  43. van der Laan M, Rose S. Targeted Learning: Causal Inference for Observational and Experimental Data. New York: Springer; 2012. [Google Scholar]
  44. van der Laan M, Rubin D. Targeted maximum likelihood learning. The International Journal of Biostatistics. 2006;2(1) doi: 10.2202/1557-4679.1211. [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. van der Vaart A. Asymptotic Statistics. Cambridge University Press; 1998. [Google Scholar]
  46. van der Vaart A, Dudoit S, van der Laan M. Oracle inequalities for multi-fold cross-validation. Statistics and Decisions. 2006;24(3):351–371. [Google Scholar]
  47. van der Vaart A, Wellner J. Weak Convergence and Empirical Processes. Springer-Verlag; New York: 1996. [Google Scholar]
  48. Watson L, Small R, Brown S, Dawson W, Lumley J. Mounting a community-randomized trial: sample size, matching, selection, and randomization issues in prism. Journal of Controlled Clinical Trials. 2004;3:235–250. doi: 10.1016/j.cct.2003.12.002. [DOI] [PubMed] [Google Scholar]
  49. Wei L. The generalized polya’s urn design for sequential medical trials. Annals of Statistics. 1979;7:291–296. [Google Scholar]
  50. Wei L, Durham S. The randomize play-the-winner rule in medical trials. Journal of the American Statistical Association. 1978;73:840–843. [Google Scholar]
  51. Wei L, Smythe R, Lin D, Park T. Statistical inference with data-dependent treatment allocation rules. Journal of the American Statistical Association. 1990;85:156–162. [Google Scholar]
  52. Zelen M. Play the winner rule and the controlled clinical trial. Journal of the American Statistical Association. 1969;64:131–146. [Google Scholar]

RESOURCES