ADAPTIVE MATCHING IN RANDOMIZED TRIALS AND OBSERVATIONAL STUDIES

Mark J van der Laan; Laura B Balzer; Maya L Petersen

. Author manuscript; available in PMC: 2014 Aug 3.

Published in final edited form as: J Stat Res. 2012 Dec 1;46(2):113–156.

ADAPTIVE MATCHING IN RANDOMIZED TRIALS AND OBSERVATIONAL STUDIES

Mark J van der Laan ¹, Laura B Balzer ², Maya L Petersen ³

PMCID: PMC4119765 NIHMSID: NIHMS604043 PMID: 25097298

SUMMARY

In many randomized and observational studies the allocation of treatment among a sample of n independent and identically distributed units is a function of the covariates of all sampled units. As a result, the treatment labels among the units are possibly dependent, complicating estimation and posing challenges for statistical inference. For example, cluster randomized trials frequently sample communities from some target population, construct matched pairs of communities from those included in the sample based on some metric of similarity in baseline community characteristics, and then randomly allocate a treatment and a control intervention within each matched pair. In this case, the observed data can neither be represented as the realization of n independent random variables, nor, contrary to current practice, as the realization of n/2 independent random variables (treating the matched pair as the independent sampling unit). In this paper we study estimation of the average causal effect of a treatment under experimental designs in which treatment allocation potentially depends on the pre-intervention covariates of all units included in the sample. We define efficient targeted minimum loss based estimators for this general design, present a theorem that establishes the desired asymptotic normality of these estimators and allows for asymptotically valid statistical inference, and discuss implementation of these estimators. We further investigate the relative asymptotic efficiency of this design compared with a design in which unit-specific treatment assignment depends only on the units’ covariates. Our findings have practical implications for the optimal design and analysis of pair matched cluster randomized trials, as well as for observational studies in which treatment decisions may depend on characteristics of the entire sample.

Keywords and phrases: Cluster randomized trials, matching, asymptotic linearity of an estimator, causal effect, efficient influence curve, empirical process, confounding, dependent treatment allocation, G-computation formula, influence curve, loss function, adaptive randomization, semiparametric statistical model, targeted maximum likelihood estimation, targeted minimum loss based estimation (TMLE)

1 Introduction

In a typical randomized controlled trial, one randomly draws a unit from a target population, measures baseline covariates on the unit, randomly assigns a treatment from among a set of possible treatments according to a known distribution (possibly conditional on the baseline covariates of the unit), and measures the unit’s treatment-specific outcome. This experiment is repeated n times resulting in n independent and identically distributed (i.i.d.) copies, providing a firm basis for statistical estimation and inference using the central limit theorem.

In this article we consider an alternative data generating experiment in which one first randomly draws n units from a target population and measures the baseline covariates of each, then assigns n treatments from among some set according to a conditional distribution, given the n unit-specific baseline covariates, and finally measures the n treatment-specific outcomes. In such an experiment, the underlying units are independently and identically distributed draws from a common target population, so that the covariates and the underlying treatment-specific outcomes represent an i.i.d sample. However, the treatment assigned to one unit can be a function of the covariates of other units in the sample, creating dependence between the n unit-specific observed data structures. As a result, the data generating design cannot be represented as n repetitions of an experiment, and not even as n independent experiments. The challenge posed to statistical inference by this design is highlighted by the fact that it is unclear how to implement valid bootstrap-based variance estimation. The available data constitute a single repetition of the underlying experiment.

Our study of this problem is motivated, in particular, by a common design-based approach to enforce empirical balance in baseline covariates among the treated and non-treated units in a finite sample. One way to enforce such balance is to partition the random sample of n units into n/2 pairs that are maximally similar with respect to covariate values according to some metric, and then to randomly allocate a treatment and a control intervention within each pair. A variation of this design partitions the sample into fewer than n/2 pairs, discarding those units for which the poorest matches are obtained. Such pair-matched designs are particularly common in community or cluster randomized controlled trials, motivated both by a desire to improve efficiency, and by the fact that such trials typically enroll far fewer independent units than their individual-level counterparts, and are thus less able to rely on chance alone to achieve the desired covariate balance between treatment groups (see, for example, reviews by Donner and Klar (2000), Hayes and Moulton (2009), Campbell et al. (2007)). Treatment assignment conditional on such a covariate-based partition of the sample preserves randomization while ensuring a degree of covariate balance between treatment and control arms of the trial. However, since the partition (in this case, the construction of the matched pairs) is generated as a function of all n covariate vectors, the treatment assignment of each unit in the sample is now a function of the covariate values of the entire sample.

While the fact that pair matching in randomized trials can introduce dependence between units is well-recognized, the extensive literature on the design and analysis of pair matched trials, including the literature debating the merits and perils of pair-matching, focuses on experimental designs in which the matched pair constitutes the unit of independence (Freedman et al. (1990); Campbell et al. (2007); Hayes and Moulton (2009); Imai et al. (2009); Imai (2008); Donner and Klar (2000); Murray (1998); Donner and Klar (2000); Raudenbush et al. (2007); Klar and Donner (1997); Balzer et al. (2012)). In one well studied design, two units are sampled from a conditional distribution given a stratum of a baseline covariate, the treatment and control intervention are randomly allocated to the pair, the outcomes for the two units are observed, and the experiment is repeated multiple times at different strata. In such an experiment, the data generating distribution involves independently repeating the stratum-specific experiment of drawing the pair of units from the stratum, assigning the treatments, and measuring the outcomes, across different strata. Therefore, statistical inference can be based on a central limit theorem for sums of independent random variables. If the strata are set by design, then the data for each pair are independent across pairs (with a stratum-specific data distribution for each pair) but are not identically distributed.

A variation of this design is based on randomly sampling a unit and measuring a baseline covariate on the unit, and then sampling a second unit from a conditional distribution, given that the baseline covariate has the same value as the first unit. Treatment and control are allocated within the matched pair, outcomes on each unit in the pair are measured, and the experiment is repeated multiple times. In this case, the data on each pair are not only independent across the pairs but are also identically distributed. van der Laan (2008), Rose and van der Laan (2009), Balzer et al. (2012) discuss formulation of the above two data structures in terms of matched case-control sampling, and present corresponding targeted minimum loss-based estimators.

The focus in the literature on designs in which the matched pair represents the unit of independence may be due in part to the specific studies for which much of the early theory was developed. These include randomized trials in ophthalmology in which the patient’s two eyes provide the matched pair, as well as some cluster randomized trials. For example, the Community Intervention Trial for Smoking Cessation (COMMIT) motivated important early work on the use of pair matching in cluster randomized trials (Freedman et al. (1997); Gail et al. (1992); COMMIT Research Group (1991)). This study in fact sampled (albeit not randomly) 11 matched pairs of communities from a larger population of candidate matched pairs.

In contrast, however, many current cluster randomized trials employ a fundamentally different pair matched design. Communities are first sampled, and only then are matched pairs created from among this finite set by applying some algorithm to the baseline characteristics of the communities included in the sample. We refer to this design as “adaptive matching” in order to make explicit its links to the larger literature on adaptive study designs, and specifically adaptive treatment allocation in response to characteristics of the previously observed units: Bai et al. (2002); Andersen et al. (1994); Flournoy and Rosenberger (1995); Hu and Rosenberger (2000); Rosenberger (1996); Rosenberger et al. (1997); Rosenberger and Grill (1997); Rosenberger and Shiram (1997); Tamura et al. (1994); Wei (1979); Wei and Durham (1978); Wei et al. (1990); Zelen (1969); Cheng and Shen (2005); van der Laan (2008); Chambaz and van der Laan (2010); van der Laan and Rose (2012).

Recent cluster randomized trials that have employed adaptive matching include the SPACE study of a school level intervention to improve physical activity in Denmark (Toftager et al., 2011), a cluster randomized trial of routine HIV-1 viral load monitoring in Zambia (Koethe et al., 2010), and the PRISM trial of a community-level intervention to prevent post-partum depression in Australia (Watson et al., 2004). Under adaptive matching, the matched pair no longer represents the independent sampling unit. Instead, such a design corresponds to a special case of the general experimental design in which the allocation of treatment among a sample of n independent and identically distributed units is a function of the covariates of all sampled units. This raises a number of questions with practical implications for the design and analysis of cluster (as well as individual) randomized trials. When will adaptively pair matched designs result in efficiency gains relative to their non-matched counterparts? What is the optimally efficient approach to estimating the treatment effect in such studies? How should statistical inference be carried out given the dependence between randomized units?

The results developed in this paper also apply to observational studies in which treatment decisions for each participant in a randomly sampled cohort may be influenced by the covariates of all or a subset of the other cohort members, while the participant-specific outcome is only influenced by a participant’s own covariates and assigned treatment. Consider, for example, a study that aims to evaluate the impact of enrollment in a weight loss program on participant weight loss. The study protocol might bring subsets of the sampled cohort members together to discuss the program, after which participants are allowed to decide whether or not they wish to enroll. In such a study, enrollment probabilities might differ depending on the characteristics of the subgroup to which enrollees are assigned (for example, the extent to which the subgroup includes charismatic or vocal individuals who have failed similar weight loss programs in the past). As a result, enrollment decisions within a subgroup are no longer independent.

Finally, a special case of the general experiment described in this article is one in which the treatment allocation for each unit in an i.i.d. sample from some target population can be a different function of the sample characteristics of all of the other units in the sample. Conditional on the baseline characteristics of the sample, the treatment assignment of each individual is independent; however, the individual-specific assignment mechanisms are not identical across the individuals. In the example study to evaluate the effect of a weight loss program, the entire sample might be divided up in several subgroups, allowing the subjects within a subgroup to mingle and talk among themselves, before being provided with information about the weight loss program and subsequently deciding whether to enroll. Individual enrollment decisions in such a scenario might depend on the characteristics of other attendees in that subgroup. One might be willing to assume that each individual’s enrollment decision is made independently, given what he or she has observed about the characteristics of the other attendees in the subgroup in contrast to the previous example in which decisions within subgroups were dependent. However, an individual’s enrollment decision is indexed by the subgroup he or she belongs to, so that treatment allocation is not identical across all individuals.

1.1 Organization of article

In Section 2 we define the statistical estimation problem posed by estimating the additive causal effect of treatment (or average treatment effect) under the general experimental design in which treatment allocation can depend on the characteristics of other units in the sample. Specifically, we define the data generating experiment, the observed data, the likelihood, the statistical model and the target parameter.

In Section 3 we study the fundamental mathematics of the design by determining the tangent space of the model and the canonical gradient of the pathwise derivative of the target parameter. Section 4 presents a targeted minimum loss based estimator (TMLE) of the additive causal effect of treatment and discusses its implementation. The TMLE presented is double robust. In particular, it remains consistent and asymptotically normally distributed as long as the treatment assignment mechanism for the n treatments is known or well estimated, even if the conditional mean of the outcome given the treatment and covariates is estimated inconsistently. We further present an estimator of the asymptotic variance of this TMLE. Interestingly, it appears that no double robust estimator of this variance is available, so that asymptotic consistent estimation of the variance requires a consistent estimator of the outcome regression. This demonstrates a strong contrast with designs in which treatment is independently assigned. In Section 5 we present a theorem that provides a formal basis for the estimators introduced in Section 4, and in particular establishes the asymptotics of the TMLE and thereby the validity of the statistical inference based on a normal limit distribution.

Section 6 discusses implications of these results for the design and analysis of randomized trials with adaptive pair matching. In particular, we discuss implementation of a TMLE of the average treatment effect and corresponding statistical inference in terms of confidence intervals and testing. While consistent estimation of the variance requires consistent estimation of the outcome regression, for this special case we show that a conservative estimate of the variance is possible. We further contrast the asymptotic variance of the adaptive pair matched design with the asymptotic variance of a design in which the intervention is randomly allocated to each unit independently. This provides insight into the potential benefits of pair matching in cluster randomized trials, beyond that provided by previous literature in which the matched pair constituted the unit of independence. In addition, we contrast the approach to statistical inference presented in this article to the standard approach employed in pair matched randomized trials, in which the average treatment effect is estimated as the sample mean of paired differences, and the variance is estimated as the sample variance treating the pairs as independent. It is shown that this standard approach provides conservative inference, under an explicitly stated assumption which is generally expected to hold.

Section 7 extends these results to the common case in which some units in the initial sample are missing treatment and outcome data. Such a case would occur, for example, in a cluster randomized trial with adaptive pair matching in which treatment were only allocated among the subset of sampled units for which adequate pair matches were identified. We conclude with a summary of the practical implications of our results and identify areas for future work. Proofs of all theorems and an overview of the required empirical process theory are provided in an Appendix.

1.2 Novel contributions of this article

To the best of our knowledge, the estimation problem addressed in this article has not been formally studied. This estimation problem targets the usual average causal effect, but the dependent allocation of treatment allowed by our model makes it different from other estimation problems that the literature has covered.

Even though targeted maximum likelihood estimation is a general method that has been applied to many problems in the literature (see e.g., van der Laan and Rose (2012) for an overview and comprehensive coverage of this method), the actual construction of a targeted maximum likelihood estimator for a new estimation problem, as defined by the statistical model and target parameter, requires new research: it relies on the construction of a least favorable sub-model for fluctuating an initial estimator and a loss function so that the loss-function specific score of the least favorable sub-model at zero fluctuation spans the efficient influence curve. In particular, this requires determining the efficient influence curve (i.e., canonical gradient of pathwise derivative) for this target parameter in this new model. Indeed, the resulting TMLE as developed in this article is new and not presented anywhere else. In addition, the analysis of this TMLE relies on the state of the art methods in empirical process theory as presented in van der Vaart and Wellner (1996). Finally, the implications of our results for the analysis of cluster randomized trials and observational studies in which treatment allocation depends on the covariates of other units in the sample are new and important. In particular, our theoretical results allow us to formally compare the efficiency of different possible matched pair designs mentioned in the introduction. This work will appear in a future article.

2 Definition of Statistical Estimation Problem

2.1 Observed data

Let X_n = (X₁, …, X_n) be a vector consisting of n i.i.d. observations of X_i = (W_i, Y_i(0), Y_i(1)), where W_i denotes the baseline covariates, and (Y_i(0), Y_i(1)) denotes the treatment-specific counter-factual outcomes for subject i. (In words, Y_i(a) denotes the outcome that would have been observed for unit i if it had received treatment level A = a.) Let P_X_,0 denote the common distribution of X_i. In addition, $g_{0}^{n} (A_{1}, \dots, A_{n} ∣ X^{n})$ is the true conditional distribution of the treatment or intervention Aⁿ = (A₁, …, A_n), conditional on Xⁿ. The observed data are O_i = (W_i, A_i, Y_i = Y_i(A_i)), i = 1, …, n, so that only one counterfactual outcome, corresponding to the treatment actually received, is observed for each unit. Note that Oⁿ = (O₁, …, O_n) is a many to one function of Aⁿ and Xⁿ, and is thus a missing or censored data structure in which the full data is Xⁿ and the censoring or missingness variable is Aⁿ.

We assume throughout that the conditional distribution of Aⁿ, given Xⁿ, $g_{0}^{n} (\cdot ∣ X^{n})$ , is only a function of Xⁿ through Wⁿ = (W₁, …, W_n), which implies the coarsening at random assumption on $g_{0}^{n}$ with respect to the full data Xⁿ (Heitjan and Rubin (1991); Jacobsen and Keiding (1995); Gill et al. (1997)). This assumption allows for dependence between A₁, …, A_n, as long as it can be explained by covariate vector Wⁿ. One important class of examples covered by such treatment mechanisms $g_{0}^{n}$ are studies that first partitions the sample {1, …, n} into groups based on the co-variate vector Wⁿ and subsequently randomly assign the treatment and control intervention within each group. For example, cluster randomized trials are commonly implemented by first partitioning the sample {1, …, n} into n/2 pairs based on some metric of similarity in baseline covariates Wⁿ, and then randomly assigning a treatment and a control condition to the two members of each pair. More formally, $g_{0}^{n}$ in such a design can be defined as follows: given Wⁿ and thereby the disjoint pairs C_j(Wⁿ) = {j₁, j₂} ⊂ {1, …, n} with C₁(Wⁿ) ∪ … ∪ C_n_/2(Wⁿ) = {1, …, n}, within each pair C_j(Wⁿ) assign (1,0) or (0,1) with a flip of a fair coin (i.e. with probability 0.5).

Instead of using the Neyman-Rubin counterfactual formulation above, this observed data generating distribution can also be described in terms of an non-parametric structural equation model (NPSEM) (Pearl (1995, 2009)) as follows. Let W_i = f_W (U_{W_i}), U_W_,_i, i = 1, …, n, are i.i.d., Aⁿ = f_Aⁿ (Wⁿ, U_Aⁿ), Y_i = f_Y (W_i, A_i, U_Y_,_i), with U_Y_,_i, i = 1, …, n, i.i.d. The analogue of the coarsening at random assumption in terms of this NPSEM is that U_Aⁿ is independent of (U_Y_,_i: i = 1, …, n), given Wⁿ. The functions f_Y and f_W are unspecified, but the function f_Aⁿ and the distribution of U_Aⁿ might be known. For example, the sample might be partitioned into groups according to some known algorithm applied to the baseline characteristics of the sample, and the intervention A randomly assigned within each group.

2.2 Likelihood and statistical model

Under both formulations of the data generating experiment, the observed data is O_i = (W_i, A_i, Y_i), i = 1, …, n, and the likelihood of the observed data Oⁿ = (O₁, …, O_n), under distribution Pⁿ, is given by

P^{n} (O_{1}, \dots, O_{n}) = \prod_{i = 1}^{n} Q_{W} (W_{i}) Q_{Y} (Y_{i} ∣ W_{i}, A_{i}) g^{n} (A^{n} ∣ W^{n}),

where Q_W = Q_W (Pⁿ) and Q_Y = Q_Y (Pⁿ) denote the common marginal distribution of W and the common conditional distribution of Y, given A, W, respectively. We put no constraints on the sets of possible Q_Y and Q_W, which corresponds with putting no constraints on the common full data distribution P_X_,0 (or no constraints on the NPSEM specified above beyond assumptions on the equation for Aⁿ). Regarding the treatment mechanism $g_{0}^{n}$ , we assume that

g_{0}^{n} (A^{n} ∣ W^{n}) = \prod_{j = 1}^{J} g_{0, j} (A (j) : j \in C_{j} (W^{n}) ∣ W^{n}),

(2.1)

where C₁(Wⁿ), …, C_J(Wⁿ) is a partitioning of the sample {1, …, n} into J groups deterministically implied by Wⁿ. Thus, it is assumed that, conditional on Wⁿ, the treatment labels within a group are independently assigned from treatment labels in another group. It is assumed that lim inf_n_→∞ J(n)/n > 0 so that the asymptotics will still be driven by n. Let $g_{i}^{n}$ be the conditional distribution of A_i, given Wⁿ. Although not necessary for deriving the desired asymptotics, we assume that this distribution $g_{i}^{n}$ of A_i, given Wⁿ is non-deterministic, i = 1, …, n. The set of possible $g_{0}^{n}$ will be denoted with Inline graphic . The set of all possible data distributions Pⁿ implied by the nonparametric model on P_X_,0 and the model for $g_{0}^{n}$ represents a statistical model for the true data distribution $P_{0}^{n}$ . This general model will be referred to as . (Generalization of our p results to general J(n) with rates of convergence $1 / \sqrt{J (n)}$ should be possible as well, but is not pursued here.)

2.2.1 Special models of interest for the treatment mechanism

A special choice for Inline graphic consists of distributions satisfying gⁿ(Aⁿ|Wⁿ) = Π_ig_i(A_i|Wⁿ). In this particular model it is assumed that, given Wⁿ, A₁, …, A_n are independent with conditional distributions $g_{i}^{n}$ of A_i, given Wⁿ, i = 1, …, n. This choice, which corresponds to partitions of size 1, allows treatment to be assigned to each unit in the sample according to a distinct unit-specific mechanism that is allowed to depend on the baseline covariates of the entire sample. Such a data generating process might arise in a study, such as the weight loss example presented in the introduction, in which subgroups of individuals are allowed to interact before each assigning themselves independently to the treatment or control condition. Then the baseline covariates of the subgroups influence individual treatment decisions but these decisions are still made independently given the baseline covariates of the cohort. We refer to this choice of Inline graphic as $G_{1}^{n}$ and refer to corresponding statistical model, implied by the nonparametric model on P_X_,0 and $G_{1}^{n}$ as $M_{1}^{n}$ .

A second special choice for Inline graphic consists of distributions satisfying $g^{n} (A^{n} ∣ W^{n}) = \prod_{j = 1}^{n / 2} g (A (j) : j \in C_{j} (W^{n}) ∣ W^{n})$ , where C_j(Wⁿ) = {j₁, j₂}, j = 1, …, n/2, represents a partitioning of the sample {1, …, n} into n/2 disjoint pairs C_j(Wⁿ). This class of treatment assignment mechanisms describes randomized trials with adaptive pair matching. We refer to this choice of Inline graphic as $G_{2}^{n}$ and refer to corresponding statistical model, implied by the nonparametric model on P_X_,0 and $G_{2}^{n}$ as $M_{2}^{n}$ .

2.3 Target statistical parameter

We focus on the target quantity Ψ^F(P_X) = E{Y (1) − Y (0)}, a particular parameter of the full-data distribution P_X or, equivalently, a parameter of the distribution of the counterfactuals (Y(0), Y(1)) defined by the NPSEM. This quantity is often referred to as the average treatment effect, and corresponds to the causal quantity typically targeted by randomized trials as well as many observational studies. Under coarsening at random, E_{Q_Y} (Y|A = a, W) = E_{P_X} (Y (a)|W), while the parameters (Q_Y, Q_W) of Pⁿ are identifiable parameters of Pⁿ. This target quantity is thus identified by the distribution of the data Pⁿ as follows:

\begin{array}{l} Ψ^{F} (P_{X}) = Ψ (Q) = E_{Q_{W}} {E_{Q_{Y}} (Y ∣ A = 1, W) - E_{Q_{Y}} (Y ∣ A = 0, W)} \\ = E_{Q_{W}} {\bar{Q} (1, W) - \bar{Q} (0, W)}, \end{array}

where Q = (Q_W, Q̄) denotes the common distribution Q_W of W_i, and common conditional mean Q̄ of Y_i, given A_i, W_i. Here Q = Q(Pⁿ) is a parameter of the observed data distribution Pⁿ. This identifiability result defines now a target parameter Ψ: Inline graphic → of the observed data distribution, defined as Ψ (Pⁿ) = Ψ(Q) (where we abuse notation by using the same Ψ for two different mappings).

The estimation problem is now defined: we want to estimate Ψ(Pⁿ) based on Oⁿ = (O₁, …, O_n) ~ Pⁿ ∈ Inline graphic , and we also want to provide asymptotic inference in terms of confidence intervals and tests of the null hypotheses.

3 The canonical gradient of the pathwise derivative of the target parameter

In order to construct efficient asymptotically linear estimators, and in particular targeted minimum loss-based estimators (van der Laan and Rubin (2006); van der Laan and Rose (2012)), of Ψ(Pⁿ), we first determine the tangent space of the model and the canonical gradient of the pathwise derivative of the target parameter.

Let

D^{*} (Q, g) (W, A, Y) = \frac{2 A - 1}{g (A ∣ W)} (Y - \bar{Q} (A, W)) + \bar{Q} (1, W) - \bar{Q} (0, W) - Ψ (Q),

which is the efficient influence curve of Ψ: Inline graphic → with Ψ(P) = Ψ(Q) under i.i.d. sampling from P_Q_,_g, where g is a conditional distribution of A, given W (van der Laan and Robins (2003); van der Laan and Rose (2012)). We will also denote D^*(Q, g) with D^*(Q, g, Ψ(Q)) to stress its representation as an estimating function in ψ. We note $D^{*} (Q, g, Ψ (Q)) = D_{Y}^{*} (\bar{Q}, g) + D_{W}^{*} (Q)$ , where

\begin{array}{l} D_{Y}^{*} (\bar{Q}, g) = \frac{2 A - 1}{g (A ∣ W)} (Y - \bar{Q} (A, W)) \\ D_{W}^{*} (Q) = \bar{Q} (1, W) - \bar{Q} (0, W) - Ψ (Q) . \end{array}

The following theorem presents the canonical gradient of the pathwise derivative of the parameter Ψ: Inline graphic → . (For semiparametric efficiency theory, see e.g., Bickel et al. (1997); van der Laan and Robins (2003); van der Vaart (1998).) This canonical gradient is expressed in terms of the above function D^*(Q, g).

Theorem 1

Let O_i = (W_i, A_i, Y_i), Oⁿ = (O₁, …, O_n) ~ Pⁿ, with

P^{n} (O_{1}, \dots, O_{n}) = \prod_{i = 1}^{n} Q_{W} (W_{i}) Q_{Y} (Y_{i} ∣ W_{i}, A_{i}) g^{n} (A^{n} ∣ W^{n}),

where Q_W is an unspecified marginal distribution, Q_Y is an unspecified conditional distribution of Y, given A, W, and gⁿ is a conditional distribution of Aⁿ = (A₁, …, A_n), given Wⁿ = (W₁, …, W_n), known to be an element of a set Inline graphic consisting of distributions satisfying (2.1). Let be the resulting statistical model for Pⁿ. Let (gⁿ) be the model if gⁿ is known.

Let Ψ: Inline graphic → be defined by Ψ(Pⁿ) = E_{Q_W}{Q̄(1, W) − Q̄(0, W)}, where Q̄(A, W) = E_{Q_Y} (Y|A, W).

The tangent space at Pⁿ in model Inline graphic is given by:

T (P^{n}) = {\sum_{i = 1}^{n} ϕ (W_{i}) : ϕ \in T_{W}} + {\sum_{i = 1}^{n} ϕ (Y_{i} ∣ A_{i}, W_{i}) : ϕ \in T_{Y}} + \sum_{j = 1}^{J} T_{C_{j}},

(3.1)

where T_W = {h(W): Eh(W) = 0},

T_{Y} = {h (Y ∣ A, W) : E_{Q_{Y}} (h (Y ∣ A, W) ∣ A, W) = 0},

and

T_{C_{j}} = {S ((A_{j} : j \in C_{j} (W^{n})) ∣ W^{n}) : E (S ∣ W^{n}) = 0} .

The tangent space at Pⁿ in model Inline graphic (gⁿ) is given by

T (Q) = {\sum_{i = 1}^{n} ϕ (W_{i}) : ϕ \in T_{W}} + {\sum_{i = 1}^{n} ϕ (Y_{i} ∣ A_{i}, W_{i}) : ϕ \in T_{Y}} .

The statistical parameter Ψ is pathwise differentiable and its canonical gradient at Pⁿ is given by

D^{n, *} (P^{n}) = \frac{1}{n} \sum_{i = 1}^{n} D^{*} (Q, {\bar{g}}_{n}) (O_{i}) = \frac{1}{n} \sum_{i = 1}^{n} {D_{W}^{*} (Q (W_{i}) + D_{Y}^{*} (Q, {\bar{g}}_{n}) (O_{i})},

where g_i is the conditional distribution of A_i, given W_i, and

{\bar{g}}_{n} (a ∣ W) = \frac{1}{n} \sum_{i = 1}^{n} g_{i} (a ∣ W) .

We note that

g_{i} (1 ∣ W_{i}) = \sum_{(w_{j} : j \neq i)} g_{i} (1 ∣ (w_{j} : j \neq i), W_{i}) \prod_{j \neq i} Q_{W} (w_{j})

(3.2)

is a function of g_i(A_i |Wⁿ) and the common marginal distribution Q_W.

Double robustness of canonical gradient

We have

E_{0} D^{n, *} (\bar{Q}, {\bar{g}}_{n}, ψ_{0}) = 0 i f \bar{Q} = {\bar{Q}}_{0} o r {\bar{g}}_{n} = {\bar{g}}_{n, 0},

(3.3)

assuming that for all i, 0 < g_i(1|W) < 1 a.e. More generally, if Q_W = Q_W_,0, then for any Q̄, ḡ, we have

E_{0} D^{n, *} (\bar{Q}, \bar{g}, Ψ (Q)) = ψ_{0} - Ψ (Q) + E_{0} (\frac{{\bar{g}}_{0}}{\bar{g}} (1 ∣ W) - 1) ({\bar{Q}}_{0} - \bar{Q}) (1, W) - E_{0} (\frac{{\bar{g}}_{0}}{\bar{g}} (0 ∣ W) - 1) ({\bar{Q}}_{0} - \bar{Q}) (0, W) .

The proof is presented in the Appendix.

4 A targeted minimum loss-based estimator (TMLE)

Derivation of the canonical gradient of the pathwise derivative of the target parameter Ψ allows us to construct a targeted minimum loss based estimator (TMLE). In this section we present a TMLE for Ψ for the general statistical model Inline graphic , in which $g_{0}^{n} (A^{n} ∣ W^{n}) = \prod_{j = 1}^{J} g_{0, j} (A (j) : j \in C_{j} (W^{n}) ∣ W^{n})$ is unknown. This TMLE is thus applicable to studies, such as the example presented in the introduction, in which a cohort of individuals is partitioned into subgroups and individuals within subgroups are allowed to interact in determining their treatment assignment according to some unknown mechanism. It further covers the special cases in which the sample is partitioned into n singletons and in which $g_{0}^{n}$ is known (as in a adaptively pair matched trial). Section 6 considers the latter special case in greater detail.

We recall from the literature on TMLE (van der Laan and Rose (2012)) specification of a TMLE of a target parameter Ψ(Q₀) requires specification of a loss function L(Q) and a sub-model {Q(ε): ε} through a Q at ε = 0, possibly indexed by nuisance parameter (in our case, ḡ), so that $\frac{d}{d ε} L (Q (ε)) ∣_{ε = 0}$ spans the canonical gradient (in our case, Dⁿ^,*(Q, ḡ)). Since Q = (Q_W, Q̄) and Q_W_,_n is already a nonparametric maximum likelihood estimator, the TMLE will only involve fluctuating Q̄.

Loss function and initial estimator for Q̄₀

Let Y ∈ {0, 1} be binary or continuous in (0, 1). Let ${\bar{Q}}_{n}^{0}$ be an initial estimator of Q̄₀, which can be based on the loss-function

- L_{i} (\bar{Q}) (O_{i}) = {Y_{i} log \bar{Q} (W_{i}, A_{i}) + (1 - Y_{i}) log (1 - \bar{Q} (W_{i}, A_{i}))} .

(4.1)

To understand the validity of this loss, note that

- E_{0} L_{i} (\bar{Q}) (O_{i}) = E_{Q_{W, 0}, g_{0, i}} {\bar{Q}}_{0} (A_{i}, W_{i}) log \bar{Q} (A_{i}, W_{i}) + (1 - {\bar{Q}}_{0}) (A_{i}, W_{i}) log (1 - \bar{Q} (A_{i}, W_{i})),

which is indeed minimized at Q̄ = Q̄₀. This demonstrates that L_i(Q̄) is a valid loss function for Q̄₀. Specifically, one could fit Q̄₀ by minimizing $\sum_{i = 1}^{n} L_{i} ({\bar{Q}}_{θ}) (O_{i})$ over a parametric or semiparametric working model {Q̄_θ: θ ∈ Θ}. Furthermore, to select among estimators such as different choices of working models or different algorithms, we can also use this loss to carry out cross-validation based estimator selection. Since conditional on Aⁿ, Wⁿ, the outcomes Y_i, i = 1, …, n, are independent, a cross-validation selector that uses (4.1) as loss function and treats i = 1, …, n as the index of the independent units when splitting the sample into training and validation sets will satisfy an oracle type inequality analogue to the ones developed and presented in van der Laan and Dudoit (2003); van der Laan et al. (2006); van der Vaart et al. (2006). Thus an initial estimator of Q̄₀ can be based on applying a data-adaptive loss-based learning approach such as super learning, ignoring the dependence between the treatment labels (van der Laan et al. (2007); Polley and van der Laan (2010) and Chapter 3 by Polley, Rose, van der Laan in van der Laan and Rose (2012)).

Least favorable sub-model through initial estimator

Let ḡ₀ = 1/n Σ_i g_0,_i, where g_0,_i is the true conditional distribution of A_i, given W_i. As sub-model for fluctuating ${\bar{Q}}_{n}^{0}$ we use

Logit {\bar{Q}}_{n}^{0} (ε) = Logit {\bar{Q}}_{n}^{0} + ε H_{{\bar{g}}_{n}}^{*},

where $H_{{\bar{g}}_{n}}^{*} (A, W) = (2 A - 1) / {\bar{g}}_{n} (A ∣ W)$ , and ḡ_n is an estimator of ḡ₀. Let Q_W_,_n be the empirical distribution, and $Q_{n}^{0} = (Q_{W, n}, {\bar{Q}}_{n}^{0})$ . We note that

\frac{d}{d ε} \sum_{i} {L_{i} ({\bar{Q}}_{n}^{0} (ε)) (O_{i}) |}_{ε = 0} = \sum_{i} D_{Y}^{*} (Q_{n}^{0}, {\bar{g}}_{n}) (O_{i})

so that this loss function and sub-model indeed generates the crucial component of the canonical gradient of the target parameter, a requirement for the construction of efficient TMLE. The component corresponding with $D_{W}^{*}$ is generated by a sub-model Q_W_,_n(ε) through Q_W_,_n at ε = 0 with score $D_{W}^{*}$ , but since Q_W_,_n is already an NPMLE, the estimated amount of fluctuation according to this sub-model would be zero, so that this sub-model plays no role in the TMLE.

Computing the update of initial estimator

The amount of fluctuation ε_n is estimated as

ε_{n} = arg min_{ε} \sum_{i = 1}^{n} L_{i} ({\bar{Q}}_{n}^{0} (ε)) (O_{i}) .

This provides the update ${\bar{Q}}_{n}^{*} = {\bar{Q}}_{n}^{0} (ε_{n})$ . Let $Q_{n}^{*} = (Q_{W, n}, {\bar{Q}}_{n}^{*})$ .

TMLE of target parameter

The TMLE of ψ₀ is the corresponding plug-in estimator

Ψ (Q_{n}^{*}) = \frac{1}{n} \sum_{i = 1}^{n} {{\bar{Q}}_{n}^{*} (1, W_{i}) - {\bar{Q}}_{n}^{*} (0, W_{i})} .

TMLE solves efficient influence curve equation

By construction, the TMLE solves

0 = D^{n, *} ({\bar{Q}}_{n}^{*}, {\bar{g}}_{n}, ψ_{n}^{*}) = \frac{1}{n} \sum_{i = 1}^{n} D^{*} ({\bar{Q}}_{n}^{*}, {\bar{g}}_{n}, ψ_{n}^{*}) (O_{i}) .

In other words, the TMLE solves the efficient influence curve equation for the model Inline graphic . This equation will form a crucial ingredient for establishing double robustness and asymptotic normality of the TMLE $Ψ (Q_{n}^{*})$ .

4.1 Estimation of ḡ₀

In general ḡ₀ is not known and must be estimated. In this section we sometimes suppress the n in ḡ_n when referring to an estimator of ḡ₀ = 1/n Σ_i g_0,_i.

Estimation of ḡ can be based on the pooled log-likelihood L(ḡ)(Wⁿ, Aⁿ) = Σ_i log ḡ(A_i|W_i), as if we observe a sample of n i.i.d. (W_i, A_i). Let ḡ_n be the resulting estimator. The above TMLE is then applied with ḡ_n as an estimator of ḡ₀. Indeed, L(ḡ) is a valid loss function for ḡ₀ since

E_{0} L (\bar{g}) (W^{n}, A^{n}) = E_{0} \sum_{i = 1}^{n} log \bar{g} (a ∣ W_{i}) g_{0, i} (a ∣ W_{i}) = E_{Q_{W, 0}, {\bar{g}}_{0}} log \bar{g} (A ∣ W),

which is minimized at ḡ₀. Conditional on Wⁿ, the groups (A_i: i ∈ C_j(Wⁿ)) of treatment nodes are independent. Thus, ḡ can be estimated using loss-based learning and cross-validation, but the cross-validation should, in contrast to estimation of Q̄₀, treat the groups indexed by j as the independent units.

In a randomized controlled trial, g₀(A_i |Wⁿ) is known by design, while g_0,_i(A_i |W_i) and thus ḡ₀(A_i|W_i) are not and must thus still be estimated. In such cases, knowledge of the true design $g_{0}^{n}$ can be used to get a more accurate estimate of ḡ₀. Specifically, we have

g_{i} (1 ∣ W_{i}) = \sum_{(w_{j} : j \neq i)} g_{i} (1 ∣ (w_{j} : j \neq i), W_{i}) \prod_{j \neq i} Q_{W} (w_{j}) .

Thus, if $g_{0}^{n}$ is known, we can estimate Q_W with the empirical distribution, giving the estimator

g_{i, n} (1 ∣ W_{i}) = \sum_{w_{j} : j \neq i} g_{0, i} (1 ∣ (w_{j} : j \neq i), W_{i}) \prod_{j \neq i} Q_{W, n} (w_{j}),

and corresponding ḡ_n = 1/nΣ_ig_i_,_n of ḡ₀.

4.2 Statistical inference

In our main Theorem 2 below we assume that the design gⁿ = P(Aⁿ|Wⁿ) is known, but, as mentioned above, this still requires estimation of ḡ₀ through estimation of Q_W_,0. The asymptotics of Theorem 2 below proves that, under appropriate conditions, the standardized TMLE $\sqrt{n} (ψ_{n}^{*} - ψ_{0})$ converges to a normal distribution with variance $σ_{W}^{2} + σ_{Y}^{2}$ , where $σ_{Y}^{2}$ is consistently approximated with

σ_{Y, n}^{2} = \frac{1}{n} \sum_{j = 1}^{J} {f_{j, n} ({\bar{Q}}_{n}^{*}) (O_{i} : i \in C_{j} (W^{n}))}^{2},

with $f_{j, n} (\bar{Q}) = \sum_{i \in C_{j} (W^{n})} f_{i, n}^{1} (\bar{Q}) (O_{i})$ , and

f_{i, n}^{1} (\bar{Q}) \equiv \frac{2 A_{i} - 1}{{\bar{g}}_{n} (A_{i} ∣ W_{i})} (Y_{i} - \bar{Q} (A_{i}, W_{i})) - {\frac{g_{i} (1 ∣ W^{n})}{{\bar{g}}_{n} (1 ∣ W_{i})} ({\bar{Q}}_{0} - \bar{Q}) (1, W_{i}) - \frac{g_{i} (0 ∣ W^{n})}{{\bar{g}}_{n} (0 ∣ W_{i})} ({\bar{Q}}_{0} - \bar{Q}) (0, W_{i})} .

Estimation of the asymptotic variance $σ_{Y}^{2}$ appears to rely on a consistent estimator of Q̄₀. In addition, $σ_{W}^{2}$ is the variance of a function IC_W of W, which can thus be consistently estimated with $1 / n \sum_{i = 1}^{n} {I C}_{W, n} {(W_{i})}^{2}$ , if IC_W_,_n is a consistent estimate of this unknown function IC_W. Specifically, IC_W is a sum of three functions, one of them being $D_{W}^{*} ({\bar{Q}}^{*}, Q_{W, 0}) = {\bar{Q}}^{*} (W) - Q_{W, 0} {\bar{Q}}^{*}$ , where Q̄(W) = Q̄ (1, W) − Q̄ (0, W), and Q_W_,0Q̄^* = ∫_wQ̄^*(w)dQ_W_,0(w). This function $D_{W}^{*}$ is trivially consistently estimated by plugging in ${\bar{Q}}_{n}^{*}$ and the empirical distribution Q_W_,_n. If ${\bar{Q}}_{n}^{*}$ converges to Q̄₀, then the other two components of IC_W are equal to zero. If the conditional distribution $g_{0, i}^{n}$ of A_i, given Wⁿ is equal to the conditional distribution g_0,_i, given W_i, and the latter is constant in i, then these other two components of IC_W are also equal to zero, even if ${\bar{Q}}_{n}^{*}$ is inconsistent.

In general, one of these two components of IC_W is generated by the contribution of ḡ_n as an estimator of ḡ₀, assuming a plug-in estimator is used utilizing that the distribution $g_{i}^{n}$ of A_i, given Wⁿ, is known. The influence curve of this contribution can be straightforwardly determined and is presented in Theorem 2. The other component concerns an average of differences $g_{0, i}^{n} - {\bar{g}}_{n}$ , indicating that, g_0,_i(· |Wⁿ) has to converge for n going to infinity to a fixed g_0,_i(· |W_i): thus, the dependence on the covariates of the other individuals l ≠ i has to be asymptotically negligible. If this convergence occurs fast enough this contribution may be equal to zero, but, in general, we allow for a contribution. This “asymptotic stability of the design” (i.e, $g_{i}^{n}$ converging to a fixed g_i) condition is analogue to the condition on the adaptive allocation probabilities in adaptive group sequential designs to establish asymptotic normality of the TMLE, as studied in van der Laan (2008); Chambaz and van der Laan (2010). Either way, consistent estimation of $σ_{W}^{2}$ is possible without relying on a consistent estimator of Q̄₀. On the other hand, if $g_{0, i}^{n} (1 ∣ W^{n}) = g_{0, i} (1 ∣ W_{i})$ , then the influence curve is easily derived and is specified in Theorem 2, and, if g_0,_i is constant in i, then the contribution equals zero.

If $g_{i}^{n} = g_{i}$ , and ḡ_n = ḡ₀, and C_j(Wⁿ) are singletons, then it can be shown that the asymptotic variance is consistently estimated as

σ_{I, n}^{2} = \frac{1}{n} \sum_{i = 1}^{n} {D^{*} ({\bar{Q}}_{n}^{*}, {\bar{g}}_{n}, ψ_{n}^{*}) (O_{i})}^{2},

(4.2)

which does thus not rely on a consistent estimator of Q̄₀ anymore. Note that this latter variance estimator is the estimator one would have used if one treats the sample as n independent observations, and one ignores the adaptivity of the design.

In the special case of adaptive pair matching designs (and thus ḡ_n = ḡ₀ and ${I C}_{W} = D_{W}^{*}$ ), we prove below that, under a mild condition, this same estimator (4.2) of the asymptotic variance remains conservative if ${\bar{Q}}_{n}^{*}$ is inconsistent for true Q̄₀. It remains to be determined if this result also applies to other group sizes.

Finally, if the design $g_{0}^{n}$ is actually unknown and thus also needs to be estimated, and if we assume that this design is consistently estimated, then we conjecture that the asymptotic limit variance described above will be conservative, due to the general result that estimation of an orthogonal factor in the likelihood (i.e., the tangent space of the treatment mechanism is orthogonal to tangent space of relevant Q-factors) generally improves the asymptotic variance (Theorem 2.3 van der Laan and Robins (2003)).

5 Theorem establishing asymptotic normality

We have the following theorem establishing the asymptotic normality of the TMLE presented in Section 4 and thereby in particular the basis for the variance estimator presented in Section 4.2.

Theorem 2

Let $P_{Q_{0}, g_{n}^{0}} f_{i}$ represents a conditional expectation of a function, given Wⁿ, which is thus still random through Wⁿ. In this theorem $g_{n}^{0} = P_{0} (A^{n} ∣ W^{n})$ is considered known. Let Inline graphic be a set of multivariate real valued functions so that ${\bar{Q}}_{n}^{*}$ is an element of with probability 1. Define

f_{i, n}^{1} (\bar{Q}) \equiv \frac{2 A_{i} - 1}{{\bar{g}}_{n} (A_{i} ∣ W_{i})} (Y_{i} - \bar{Q} (A_{i}, W_{i})) - {\frac{g_{0, i} (1 ∣ W^{n})}{{\bar{g}}_{n} (1 ∣ W_{i})} ({\bar{Q}}_{0} - \bar{Q}) (1, W_{i}) - \frac{g_{0, i} (0 ∣ W^{n})}{{\bar{g}}_{n} (0 ∣ W_{i})} ({\bar{Q}}_{0} - \bar{Q}) (0, W_{i})} .

Define $X_{n} (\bar{Q}) = 1 / \sqrt{n} \sum_{j = 1}^{J} {\sum_{i \in C_{j} (W^{n})} f_{i, n}^{1} (\bar{Q}) (O_{i})}$ , and note that, conditional on Wⁿ, this is a sum of J = J(n) independent mean zero random variables. Let

f_{j, n} \equiv \sum_{i \in C_{j} (W^{n})} f_{i, n}^{1} (O_{i}) .

We can represent X_n(Q̄) as $X_{n} (\bar{Q}) \equiv 1 / \sqrt{n} \sum_{j = 1}^{J} f_{j, n} (O_{i} : i \in C_{j} (W^{n}))$ . Let $D_{W}^{*} (Q_{0}) = {\bar{Q}}_{0} (W) - ψ_{0}$ .

Uniform bound

Assume max_i_∈{1,…,n} Inline graphic sup_Wⁿ _O |f_i_,_n(Q̄)(Wⁿ, O) |< M < ∞, where the second supremum is over a support of (Wⁿ, A_i, Y_i).

Asymptotic linearity of function of ḡ_n

Assume that for a function IC_W_,_i_,_ḡ of W with mean zero and finite variance (uniformly in i)

\begin{array}{l} Z_{W, n, {\bar{g}}_{n}} \equiv \frac{1}{\sqrt{n}} \sum_{i} P_{Q_{0}, g_{0, i}} {D^{*} ({\bar{Q}}_{n}^{*}, {\bar{g}}_{n}, ψ_{n}^{*}) - P_{Q_{0}, g_{0}, i} D^{*} ({\bar{Q}}_{n}^{*}, {\bar{g}}_{0}, ψ_{n}^{*})} \\ = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} {I C}_{W, i, \bar{g}} (W_{i}) + o_{P} (1) . \end{array}

Note that if $g_{0, i}^{n} = P_{0}^{n} (A_{i} ∣ W^{n})$ equals g_0,_i = P₀(A_i|W_i) and is thus known, then ḡ_n = ḡ₀ so that Z_{W,_n,ḡ_n} = 0. In general, under the required regularity conditions, we have

Z_{W, {\bar{g}}_{n}, n} \approx \frac{1}{\sqrt{n}} \sum_{k = 1}^{n} I C (W_{k}) - E_{0} I C (W_{k}),

where

\begin{array}{l} I C (W_{k}) = \int_{w} (\frac{1}{n} \sum_{i = 1}^{n} \sum_{l \neq i}^{n} g_{i} (1 ∣ W_{i} = w, W_{l} = W_{k})) \frac{{\bar{Q}}_{0} - \bar{Q}}{{\bar{g}}_{0}} (1, w) {d Q}_{0} (w) \\ - \int_{w} (\frac{1}{n} \sum_{i = 1}^{n} \sum_{l \neq i}^{n} g_{i} (0 ∣ W_{i} = w, W_{l} = W_{k})) \frac{{\bar{Q}}_{0} - \bar{Q}}{{\bar{g}}_{0}} (0, w) {d Q}_{0} (w) . \end{array}

and g_i_,0(1|W_i, W_l) is the conditional distribution of A_i = 1, given W_i, W_l.

Asymptotic stability of treatment mechanism as function of covariates

Let

Z_{W, n, g^{n}} = Z_{1, g^{n}} + Z_{W, n},

where

Z_{1, g^{n}} = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} \frac{g_{i} (1 ∣ W^{n}) - {\bar{g}}_{n} (1 ∣ W_{i})}{{\bar{g}}_{n} (1 ∣ W_{i})} ({\bar{Q}}_{0} - {\bar{Q}}_{n}^{*}) (1, W_{i}) - \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} \frac{g_{i} (0 ∣ W^{n}) - {\bar{g}}_{n} (0 ∣ W_{i})}{{\bar{g}}_{n} (0 ∣ W_{i})} ({\bar{Q}}_{0} - {\bar{Q}}_{n}^{*}) (0, W_{i}),

and

Z_{W, n} = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} {{\bar{Q}}_{0} (W_{i}) - ψ_{0}} .

Assume, for a function IC_W,i,gⁿ of W with mean zero and finite variance, $Z_{1, g^{n}} = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} {I C}_{W, i,, g^{n}} (W_{i}) + o_{P} (1)$ . Note that, if $g_{0, i}^{n} = g_{0, i}$ and constant in i, then Z_1,gⁿ = 0. If $g_{0, i}^{n} = g_{0, i}$ , so that ḡ_n = ḡ₀, then

{I C}_{W, i, g^{n}} (W_{i}) = \frac{g_{0, i} (1 ∣ W_{i}) - {\bar{g}}_{0} (1 ∣ W_{i})}{{\bar{g}}_{0} (1 ∣ W_{i})} ({\bar{Q}}_{0} - \bar{Q}) (1, W_{i}) - \frac{g_{0, i} (0 ∣ W_{i}) - {\bar{g}}_{0} (0, W_{i})}{{\bar{g}}_{0} (0 ∣ W_{i})} ({\bar{Q}}_{0} - \bar{Q}) (0, W_{i}) .

Convergence of variances

Assume that for a specified {Σ₀(Q̄₁, Q̄₂): Q̄₁, Q̄₂ ∈ Inline graphic }, for any Q̄₁, Q̄₂ ∈ , $\frac{1}{n} \sum_{j = 1}^{J} P_{Q_{0}, g_{0}^{n}} f_{j, n} {({\bar{Q}}_{1})}^{2} \to \sum_{0} ({\bar{Q}}_{1}, {\bar{Q}}_{1}) a . s$ (i.e, for almost every (Wⁿ, n ≥ 1)), and

\frac{1}{n} \sum_{j = 1}^{J} P_{Q_{0}, g^{n}} f_{j, n} ({\bar{Q}}_{1}) f_{j, n} ({\bar{Q}}_{2}) \to \sum_{0} ({\bar{Q}}_{1}, {\bar{Q}}_{2}) a . s .

(5.1)

For example, if C_j(Wⁿ) are singletons, the first condition holds if

[\frac{1}{n} \sum_{i = 1}^{n} \frac{g_{0, i} (0 ∣ W^{n})}{{\bar{g}}_{n}^{2} (0 ∣ W_{i})} E_{0} ({(Y - \bar{Q} (0, W_{i}))}^{2} ∣ A_{i} = 0, W_{i}) + \frac{1}{n} \sum_{i = 1}^{n} \frac{g_{0, i} (1 ∣ W^{n})}{{\bar{g}}_{n}^{2} (1 ∣ W_{i})} E_{0} ({(Y - \bar{Q} (1, W_{i}))}^{2} ∣ A_{i} = 1, W_{i})] \to \sum_{0} (\bar{Q}, \bar{Q}) a . s .

Similarly, for the convergence of covariance. Note that this holds trivially if g_0,_i(1 | Wⁿ) = g_0,_i(1 | W_i).

Convergence of ${\bar{Q}}_{n}^{*}$ to some limit

For any Q̄₁, Q̄₂ ∈ Inline graphic , we define

σ_{n}^{2} ({\bar{Q}}_{1} - {\bar{Q}}_{2}) = \frac{1}{n} \sum_{j = 1}^{J} P_{Q_{0}, g_{0}^{n}} {f_{j, n} ({\bar{Q}}_{1}) - f_{j, n} ({\bar{Q}}_{2})}^{2},

where we note that the right-hand side indeed only depends on Q̄₁, Q̄₂ through its difference Q̄₁ − Q̄₂.

Assume that for a particular Q̄^* ∈ Inline graphic , $σ_{n}^{2} ({\bar{Q}}_{n}^{*} - {\bar{Q}}^{*}) \to 0$ in probability as n → ∞.

Entropy condition

Let Inline graphic = {f₁ − f₂ : f₁, f₂ ∈ }. Let N (ε, σ_n, ) be the covering number of the class w.r.t norm/dissimilarity || f ||= σ_n(f). Assume that the class satisfies

lim_{δ_{n} \to 0} \int_{0}^{δ_{n}} \sqrt{log N (ε, σ_{n}, F^{d})} d ε = 0

Asymptotic equicontinuity of process

Then,

X_{n} ({\bar{Q}}_{n}^{*}) - X_{n} ({\bar{Q}}^{*}) converges t o zero i n probability, a s n \to \infty .

First order linear approximation

As a consequence,

\sqrt{n} (ψ_{n}^{*} - ψ_{0}) = X_{W, n} + X_{n} ({\bar{Q}}^{*}) + o_{P} (1),

where $X_{W, n} = 1 / \sqrt{n} \sum_{i = 1}^{n} {I C}_{W, i} (W_{i})$ , and ${I C}_{W, i} = {I C}_{W, i, \bar{g}} + {I C}_{W, i, g^{n}} + D_{W}^{*} (Q_{0})$ .

Asymptotic normality

In addition, X_W_,_n converges in distribution to $N (0, σ_{W}^{2})$ , where

σ_{W}^{2} = lim_{n \to \infty} \frac{1}{n} \sum_{i = 1}^{n} P_{0} {I C}_{W, i}^{2},

and X_n(Q̄) converges to an independent N(0, Σ₀(Q̄^*, Q̄^*)), so that

\sqrt{n} (ψ_{n}^{*} - ψ_{0}) converges i n distribution t o N (0, σ_{0}^{2} \equiv σ_{W}^{2} + \sum_{0} ({\bar{Q}}^{*}, {\bar{Q}}^{*})) .

The asymptotic variance Σ₀(Q̄^*, Q̄^*) equals the limit of

\sum_{n} = \frac{1}{n} \sum_{j = 1}^{n} {f_{j, n} ({\bar{Q}}_{n}^{*}) (O_{i} : i \in C_{j} (W^{n}))}^{2} .

Under certain treatment allocation mechanisms $g_{0, i}^{n}$ , one might have that the contributions captured by X_W_,_n in this theorem require a more general representation $X_{W, n} = 1 / \sqrt{n} \sum_{i = 1}^{n} f_{i} (W)$ , where f_i(W) has weak enough dependence on W_j with j ≠ i, so that such a process still converges weakly to a normal distribution. Depending on applications of interest, we can pursue such a more general representation of this theorem with little extra work.

Note that f_j_,_n in Σ_n still depends on Q̄₀, while ${\bar{Q}}_{n}^{*}$ estimates the possibly misspecified limit Q̄. Thus Σ_n is not an estimator. In the special case that C_j(Wⁿ) are singletons, it follows that $σ_{0}^{2}$ can be consistently approximated with $1 / n \sum_{i} {{I C}_{W} (Q_{n}^{*}) (W_{i}) + f_{i, n} ({\bar{Q}}_{n}^{*}) (O_{i})}^{2}$ which equals

σ_{n}^{2} = \frac{1}{n} \sum_{i = 1}^{n} {D^{*} ({\bar{Q}}_{n}^{*}, {\bar{g}}_{n}, ψ_{n}^{*}) (O_{i})}^{2},

if $g_{0, i}^{n} = g_{0, i}$ , and ḡ_n = ḡ₀ so that ${I C}_{W} = D_{W}^{*} ({\bar{Q}}^{*}, Q_{W, 0})$ .

Some of the conditions were discussed in the previous section 4 under statistical inference. The entropy condition corresponds with assuming that Inline graphic is a Donsker class and is thus a natural condition that puts (minimal) restrictions on the size of the class . For example, one can define as the class of multivariate real valued functions that have a uniform sectional variation norm bounded by a universal M < ∞ (van der Laan (1996); Gill et al. (1995)).

In order to demonstrate the condition (5.1), we consider the special case that C_j(Wⁿ) are singletons j = 1, …, n, and that g_i(A_i | Wⁿ) = g_i(A_i | W_i). We wish to show that $σ_{n}^{2} = 1 / n \sum_{i = 1}^{n} σ_{i}^{2} \to σ^{2}$ , where

\begin{array}{l} σ_{i}^{2} = P_{Q_{0}, g_{i}^{n}} {(2 A - 1) / g_{i} (A_{i} ∣ W_{i}) (Y_{i} - \bar{Q} (A_{i}, W_{i}))}^{2} - {(\bar{Q} (W_{i}) - {\bar{Q}}_{0} (W_{i}))}^{2} \\ = P_{Q_{0}, g_{i}^{n}} 1 / g_{i}^{2} (A_{i} ∣ W_{i}) {(Y_{i} - \bar{Q} (A_{i}, W_{i}))}^{2} - {(\bar{Q} (W_{i}) - {\bar{Q}}_{0} (W_{i}))}^{2} \\ = \int_{a, y} \frac{1}{g_{i}^{2} (a ∣ W_{i})} {(y - \bar{Q} (a, W_{i}))}^{2} g_{i} (a ∣ W^{n}) Q_{Y, 0} (y ∣ a, W_{i}) - {(\bar{Q} (W_{i}) - {\bar{Q}}_{0} (W_{i}))}^{2} \\ = \int_{a, y} \frac{1}{g_{i} (a ∣ W_{i})} {(y - \bar{Q} (a, W_{i}))}^{2} Q_{Y, 0} (y ∣ a, W_{i}) - {(\bar{Q} (W_{i}) - {\bar{Q}}_{0} (W_{i}))}^{2} \\ = \frac{1}{g_{i} (a ∣ W_{i})} E_{0} ({(Y - \bar{Q} (0, W_{i}))}^{2} ∣ A_{i} = 0, W_{i}) + \frac{1}{g_{i} (1 ∣ W_{i})} E_{0} ({(Y - \bar{Q} (1, W_{i}))}^{2} ∣ A_{i} = 1, W_{i}) - {(\bar{Q} (W_{i}) - {\bar{Q}}_{0} (W_{i}))}^{2} . \end{array}

So we need that

\frac{1}{n} \sum_{i = 1}^{n} \frac{1}{g_{i} (0 ∣ W_{i})} E_{0} ({(Y - \bar{Q} (0, W_{i}))}^{2} ∣ A_{i} = 0, W_{i}) + \frac{1}{n} \sum_{i = 1}^{n} \frac{1}{g_{i} (1 ∣ W_{i})} E_{0} ({(Y - \bar{Q} (1, W_{i}))}^{2} ∣ A_{i} = 1, W_{i}) - \frac{1}{n} \sum_{i = 1}^{n} {(\bar{Q} (W_{i}) - {\bar{Q}}_{0} (W_{i}))}^{2} \to σ^{2} a . s .

By the law of large numbers, both empirical means converge.

6 Randomized trials with adaptive pair matching

Theorem 2 and the corresponding TMLE of the average treatment effect and variance estimator have important implications for the design and analysis of individual and cluster randomized trials with adaptive pair matching. In particular, previous literature on pair matched trials considered the pair as the unit of independence (Freedman et al. (1990); Campbell et al. (2007); Hayes and Moulton (2009); Imai et al. (2009); Imai (2008); Donner and Klar (2000); Murray (1998); Donner and Klar (2000); Raudenbush et al. (2007); Klar and Donner (1997); Balzer et al. (2012)). This leaves open a number of key questions regarding the design and analysis of trials in which matched pairs are constructed based on applying some algorithm to the baseline characteristics of the entire sample (adaptive pair matching). What is the most efficient estimator of the intervention’s effect in such studies? How should the variance of this estimator be estimated, given dependence induced between units? And finally, under what conditions will adaptive pair matching provide a more efficient estimator than that provided by a non-matched design? In this section we consider the implications of Theorem 2 for each of these questions in turn.

6.1 Estimation of the average treatment effect

In a randomized trial with adaptive pair matching, an unadjusted difference of the mean outcome in the treated and untreated units will provide an unbiased estimate of the average treatment effect. However, adjustment for baseline covariates W that predict the outcome Y will result in efficiency gains. This raises the issue of how best to accomplish such adjustment in adaptively pair matched designs, with the dual goals of minimizing the variance of the resulting estimator and ensuring that it remains unbiased. Our Theorem 2 and the corresponding TMLE presented in Section 4 establish such an efficient and unbiased estimator of the average treatment effect.

Specifically, in such trials, g_0,_i(A_i | Wⁿ) = g_0,_i(A_i | W_i) = ḡ₀(A_i | W_i) is known to be equal to a constant (typically 0.5 for a trial with two arms) and need not be estimated (although estimation of may improve efficiency). The clever covariate in the TMLE thus reduces to a simple (2A_i − 1)/0.5. Thus, a TMLE for the average treatment effect in such a trial can be implemented by treating the data as if they were a sample of n i.i.d. units. Implementation of this estimator is described in detail in van der Laan and Rose (2012). Further, because A is randomized, as long as the initial estimator of Q̄₀ (the conditional mean of the unit-specific outcome given unit-specific covariates and treatment) is a least squares regression or logistic maximum likelihood regression that includes an intercept and the treatment A as a main term (still allowing for additional interaction terms between A and covariates W), no further update step is needed (the initial estimator is already a TMLE) (Rosenblum and van der Laan (2010); van der Laan and Rose (2012)).

6.2 Statistical inference

Again, we note that in a randomized trial with adaptive pair matching, we have ḡ_n = ḡ₀ = g₀. Our Theorem 2 shows that, under regularity conditions, the standardized TMLE with ${\bar{Q}}_{n}^{*}$ converging to Q̄^* is asymptotically consistent and normally distributed, $\sqrt{n} (ψ_{n}^{*} - ψ_{0}) \Rightarrow_{d} N (0, σ^{2})$ , where $σ^{2} = σ_{W}^{2} + σ_{Y}^{2}, σ_{W}^{2} = E {({\bar{Q}}_{0} (W) - ψ_{0})}^{2}$ and $σ_{Y}^{2}$ is the limit of

\frac{1}{n} \sum_{j} P_{Q_{0}, g_{0}^{n}} {f_{j, n} ({\bar{Q}}^{*}, {\bar{Q}}_{0})}^{2} = \frac{1}{n} \sum_{j} P_{Q_{0}, g_{0}^{n}} {(\sum_{i \in C_{j} (W^{n})} H_{{\bar{g}}_{0}} (A_{i}, W_{i}) (Y_{i} - {\bar{Q}}^{*}))}^{2} - \frac{1}{n} \sum_{j} {(\sum_{i \in C_{j} (W^{n})} ({\bar{Q}}_{0} - {\bar{Q}}^{*}) (W_{i}))}^{2} .

Given a consistent estimator $σ_{n}^{2}$ of σ², $ψ_{n}^{*} \pm 1.96 σ_{n} / \sqrt{n}$ is an asymptotic 0.95-confidence interval. A test of the null hypothesis H₀ : ψ₀ = ψ⁰ can be based on the test statistic $T_{n} = \sqrt{n} (ψ_{n}^{*} - ψ^{0}) / σ_{n}$ which is approximately standard normal under H₀.

In order to implement an estimator of σ_n, $σ_{W}^{2}$ can be estimated as $\frac{1}{n} \sum_{i = 1}^{n} {{\bar{Q}}_{n} (W_{i}) - ψ_{n}^{*}}^{2}$ . Estimation of $σ_{Y}^{2}$ can be based on Theorem 2, which shows that $σ_{Y}^{2}$ is consistently approximated by $1 / n \sum_{j} {f_{j, n} ({\bar{Q}}_{n}^{*}, {\bar{Q}}_{0})}^{2}$ . In implementing a substitution estimator of $σ_{Y}^{2}$ , we naturally replace Q̄^* by ${\bar{Q}}_{n}^{*}$ (the updated fit of Q̄₀ on which the TMLE substitution estimator of ψ₀ is based). However, $f_{j, n} ({\bar{Q}}_{n}^{*}, {\bar{Q}}_{0})$ still depends on Q̄₀. When implementing a consistent estimator of the asymptotic variance σ², one may need to estimate Q̄₀ with a super learner in order to make it maximally unbiased (van der Laan et al. (2007)). In particular, even if a simple parametric regression based estimator was used as initial estimator of Q̄₀ when implementing the TMLE for ψ₀, a more flexible approach to estimating Q̄₀ is warranted when estimating σ² in order to minimize bias in the resulting variance estimator.

A substitution estimator of this asymptotic variance σ² is now given by

\begin{array}{l} σ_{n}^{2} = σ_{W, n}^{2} + σ_{Y, n}^{2} \\ σ_{W, n}^{2} = \frac{1}{n} \sum_{i = 1}^{n} {{\bar{Q}}_{n} (W_{i}) - ψ_{n}^{*}}^{2} \\ σ_{Y, n}^{2} = \frac{1}{n} \sum_{j} {(\sum_{i \in C_{j} (W^{n})} H_{{\bar{g}}_{0}} (A_{i}, W_{i}) (Y_{i} - {\bar{Q}}_{n}^{*} (A_{i}, W_{i})) - \sum_{i \in C_{j} (W^{n})} ({\bar{Q}}_{n} - {\bar{Q}}_{n}^{*}) (W_{i}))}^{2} \end{array}

Interestingly, even though we constructed an estimator of ψ₀ that is guaranteed consistent and asymptotically normally distributed at misspecified ${\bar{Q}}_{n}^{*}$ , as long as $g_{0, i}^{n}$ is known or well estimated, it seems that in general no such robust estimator of the asymptotic variance σ² is available. Fortunately, below we will construct conservative variance estimators that do not rely on a consistent estimator of Q̄₀.

6.3 Robust conservative estimation of the variance

In the special case of an adaptively pair matched design, we have the following theorem, which establishes a simpler (but) conservative estimate of the asymptotic variance σ². The proof is analogue to that of Theorem 4 below.

Theorem 3

Suppose $g_{n}^{0} \in G_{2}^{n}$ , in which case C_j(Wⁿ) are pairs. As above, assume that g_0,_i(A_i | Wⁿ) = g₀(A_i | W_i) is known, so that ḡ₀ = g₀ is known as well. The asymptotic variance σ² of the TMLE for ${\bar{Q}}_{n}^{*}$ converging to Q̄^*, i.e., $σ^{2} = σ_{W}^{2} + σ_{Y}^{2}$ in Theorem 2, can be represented as follows:

σ^{2} = P_{Q_{0}, g_{0}} {D^{*} ({\bar{Q}}^{*}, g_{0}, ψ_{0})}^{2} - C,

where

\begin{array}{l} C \equiv E_{0} \frac{1}{J} \sum_{j = 1}^{n / 2} ({\bar{Q}}_{0} - {\bar{Q}}^{*}) (1, W_{1 j}) ({\bar{Q}}_{0} - Q^{*}) (0, W_{2 j}) \\ + E_{0} \frac{1}{J} \sum_{j = 1}^{n / 2} ({\bar{Q}}_{0} - {\bar{Q}}^{*}) (0, W_{1 j}) ({\bar{Q}}_{0} - {\bar{Q}}^{*}) (1, W_{2 j}) \\ + E_{0} \frac{1}{J} \sum_{j = 1}^{n / 2} ({\bar{Q}}_{0} - {\bar{Q}}^{*}) (1, W_{1 j}) ({\bar{Q}}_{0} - {\bar{Q}}^{*}) (1, W_{2 j}) \\ + E_{0} \frac{1}{J} \sum_{j = 1}^{n / 2} ({\bar{Q}}_{0} - {\bar{Q}}^{*}) (0, W_{1 j}) ({\bar{Q}}_{0} - {\bar{Q}}^{*}) (0, W_{2 j}) . \end{array}

(6.1)

Under the assumption that the covariance-term C is positive, a conservative estimate of σ² is thus given by:

σ_{I, n}^{2} = \frac{1}{n} \sum_{i = 1}^{n} {D^{*} ({\bar{Q}}_{n}^{*}, g_{0}, ψ_{n}^{*}) (O_{i})}^{2} .

This theorem, together with Theorem 2, imply that randomized trials with adaptive pair matching can be analyzed ignoring the matching process, both in order to generate an efficient and unbiased point estimator of the treatment effect and for inference on this estimator. Under the assumption that the last covariance-term is positive, as would be expected if units were effectively matched on predictors of the outcome, a variance estimator that treats the data as if they were i.i.d. will be conservative if ${\bar{Q}}_{n}^{*}$ is inconsistent for Q̄₀, while it remains asymptotically consistent if ${\bar{Q}}_{n}^{*}$ is consistent.

In general, one can aim to construct a target C_l for which it is known that C_l ≤ C, and estimate the variance with $σ_{I, n}^{2} - C_{l, n}$ , where C_l_,_n is a consistent estimator of C_l. In this case, one aims to find such a C_l that can be consistently estimated without relying on a consistent estimator of Q̄₀. This will be carried out in the next two subsections resulting in a possibly much less conservative variance estimator.

6.4 Comparison of the “naive” variance estimator with the true variance

Let us consider the case that we use the unadjusted regression of Y on A as initial estimator so that ${\bar{Q}}_{n}^{*} (a, W) = {\bar{Q}}_{n}^{*} (a)$ , a ∈ {0, 1}, does not depend on W, and the TMLE $ψ_{n}^{*} = {\bar{Q}}_{n}^{*} (1) - {\bar{Q}}_{n}^{*} (0)$ equals the mean outcome over the n/2 pairs C_j(Wⁿ) of Σ_{i∈C_j(Wⁿ)}A_iY_i − (1 − A_i)Y_i, j = 1, …, n/2:

ψ_{n}^{*} = \frac{1}{n / 2} \sum_{j = 1}^{n / 2} \sum_{i \in C_{j} (W^{n})} (A_{i} Y_{i} - (1 - A_{i}) Y_{i}) .

Above we presented an expression (6.1) for the true asymptotic variance of the standardized TMLE, $\sqrt{n} (ψ_{n}^{*} - ψ_{0})$ , which can be applied to this simple sample average $ψ_{n}^{*}$ . This expression σ² was represented as the asymptotic variance $σ_{I}^{2}$ of this TMLE under i.i.d sampling from P_Q₀,g₀, i.e. $σ_{I}^{2} = P_{Q_{0}, g_{0}} D^{*} {({\bar{Q}}^{*}, g_{0}, ψ_{0})}^{2}$ , minus a sum of four terms defined as

\begin{array}{l} C \equiv E_{0} \frac{1}{J} \sum_{j = 1}^{n / 2} ({\bar{Q}}_{0} - {\bar{Q}}^{*}) (1, W_{1 j}) ({\bar{Q}}_{0} - {\bar{Q}}^{*}) (0, W_{2 j}) \\ + E_{0} \frac{1}{J} \sum_{j = 1}^{n / 2} ({\bar{Q}}_{0} - {\bar{Q}}^{*}) (0, W_{1 j}) ({\bar{Q}}_{0} - {\bar{Q}}^{*}) (1, W_{2 j}) \\ + E_{0} \frac{1}{J} \sum_{j = 1}^{n / 2} ({\bar{Q}}_{0} - {\bar{Q}}^{*}) (1, W_{1 j}) ({\bar{Q}}_{0} - {\bar{Q}}^{*}) (1, W_{2 j}) \\ + E_{0} \frac{1}{J} \sum_{j = 1}^{n / 2} ({\bar{Q}}_{0} - {\bar{Q}}^{*}) (0, W_{1 j}) ({\bar{Q}}_{0} - {\bar{Q}}^{*}) (0, W_{2 j}) . \end{array}

Let (W₁, A₁, Y₁), (W₂, A₂, Y₂) denote the observations in the two units in a pair, where either (A₁, A₂) = (1, 0) or (A₁, A₂) = (0, 1). For notational convenience, we will denote this term C as

\begin{array}{l} C \equiv E_{0} ({\bar{Q}}_{0} - {\bar{Q}}^{*}) (1, W_{1}) ({\bar{Q}}_{0} - {\bar{Q}}^{*}) (0, W_{2}) \\ + E_{0} ({\bar{Q}}_{0} - {\bar{Q}}^{*}) (0, W_{1}) ({\bar{Q}}_{0} - {\bar{Q}}^{*}) (1, W_{2}) \\ + E_{0} ({\bar{Q}}_{0} - {\bar{Q}}^{*}) (1, W_{1}) ({\bar{Q}}_{0} - {\bar{Q}}^{*}) (1, W_{2}) \\ + E_{0} ({\bar{Q}}_{0} - {\bar{Q}}^{*}) (0, W_{1}) ({\bar{Q}}_{0} - {\bar{Q}}^{*}) (0, W_{2}) . \end{array}

Since we use the unadjusted estimator so that D^*(Q̄^*, g₀, ψ₀)(W, A, Y) = (2A − 1)/g₀(A)(Y − Q̄^*(A)), and g₀(A) = 0.5, we have

P_{Q_{0}, g_{0}} D^{*} {({\bar{Q}}^{*}, g_{0}, ψ_{0})}^{2} = 2 {σ_{1}^{2} + σ_{0}^{2}},

where $σ_{1}^{2} = E_{0} {(Y (1) - ψ_{0} (1))}^{2}$ and $σ_{0}^{2} = E_{0} {(Y (0) - ψ_{0} (0))}^{2}$ . We conclude that the true asymptotic variance of $\sqrt{n} (ψ_{n}^{*} - ψ_{0})$ is given by

σ^{2} = 2 {σ_{1}^{2} + σ_{0}^{2}} - C .

Let us compare this true asymptotic variance σ²/n of the unadjusted estimator with the variance estimate used in current practice, which we will refer to as the “naive” variance estimator. Current practice assumes that the n/2 pairs are i.i.d. and estimates the asymptotic variance of $\sqrt{n / 2} (ψ_{n}^{*} - ψ_{0})$ with the sample variance of the average of the difference across the pairs:

0.5 σ_{n, naive}^{2} = \frac{1}{n / 2} \sum_{j = 1}^{n / 2} {(Y_{1 j} A_{1 j} + Y_{2 j} A_{2 j} - Y_{1 j} (1 - A_{1 j}) - Y_{2 j} (1 - A_{2 j}) - ψ_{n}^{*})}^{2} .

This converges for n → ∞ to

0.5 σ_{naive}^{2} = σ_{0}^{2} + σ_{1}^{2} - (ρ_{1} + ρ_{2}),

where

\begin{array}{l} ρ_{1} = E_{0} ({\bar{Q}}_{0} (1, W_{1}) - ψ_{0} (1)) ({\bar{Q}}_{0} (0, W_{2}) - ψ_{0} (0)) \\ ρ_{2} = E_{0} ({\bar{Q}}_{0} (0, W_{1}) - ψ_{0} (0)) ({\bar{Q}}_{0} (1, W_{2}) - ψ_{0} (1)) . \end{array}

The true asymptotic variance and the naive asymptotic variance are given by σ²/n and $(0.5 σ_{naive}^{2}) / (n / 2) = σ_{naive}^{2} / n$ , respectively. As a consequence, the relevant comparison is the comparison of σ² with $σ_{naive}^{2}$ , where

\begin{array}{l} σ^{2} = 2 {σ_{1}^{2} + σ_{0}^{2}} - C \\ σ_{naive}^{2} = 2 {σ_{1}^{2} + σ_{0}^{2}} - 2 (ρ_{1} + ρ_{2}) . \end{array}

To show that naive variance estimator represents a conservative variance estimator we would need to show that

2 (ρ_{1} + ρ_{2}) \leq C .

Notice that C = ρ₁ + ρ₂ + C₁, where

C_{1} = E_{0} ({\bar{Q}}_{0} - {\bar{Q}}^{*}) (1, W_{1}) ({\bar{Q}}_{0} - {\bar{Q}}^{*}) (1, W_{2}) + E_{0} ({\bar{Q}}_{0} - {\bar{Q}}^{*}) (0, W_{1}) ({\bar{Q}}_{0} - {\bar{Q}}^{*}) (0, W_{2}) .

Thus, the naive variance estimator would be conservative if ρ₁ + ρ₂ ≤ C₁. Note that we can also represent this as:

\begin{array}{l} C_{1} - ρ_{1} - ρ_{2} = Cov ({\tilde{Q}}_{0} (1, W_{1}), ({\tilde{Q}}_{0} (1, W_{2})) + Cov ({\tilde{Q}}_{0} (0, W_{1}), {\tilde{Q}}_{0} (0, W_{2})) - Cov ({\tilde{Q}}_{0} (1, W_{1}), {\tilde{Q}}_{0} (0, W_{2})) - Cov ({\tilde{Q}}_{0} (0, W_{1}), {\tilde{Q}}_{0} (1, W_{2})) \\ = Cov ({\tilde{Q}}_{0} (W_{1}), {\tilde{Q}}_{0} (W_{2})), \end{array}

where Cov(X, Y) = E(XY) denotes the standard covariance between two mean zero random variables X and Y, and we introduced the notation Q̃₀(W) = (Q̃₀(1, W) − Q̃₀(0, W)) and Q̃₀(a, W) = (Q̄₀ − Q̄^*)(a, W). Thus, if the latter covariance-term Cov(Q̃₀(W₁), Q̃₀(W₂)) is non-negative, then the naive variance estimator is conservative. This is a very reasonable condition certainly expected to hold. Thus, we can conclude that in great generality the naive variance estimator is a conservative estimator. We also note that if in truth there is no treatment effect, conditional on covariates, then this covariance term equals zero, so that the naive variance estimator is unbiased.

6.5 A general conservative estimator of the asymptotic variance of TMLE

Above we presented the naive variance estimator of the unadjusted estimator and showed that it is conservative in great generality. In this subsection we propose a generalization of this estimator to obtain a conservative estimator of the asymptotic variance of the general TMLE (using a general initial estimator).

Recall that C = (ρ₁ + ρ₂) + C₁, and note that ρ̄ = ρ₁ + ρ₂ can be consistently estimated with ${\bar{ρ}}_{n} = 2 / J \sum_{j = 1}^{J} (Y_{1 j} - {\bar{Q}}_{n}^{*} (A_{1 j}, W_{1 j})) (Y_{2 j} - {\bar{Q}}_{n}^{*} (A_{2 j}, W_{2 j}))$ . Above we showed that we can obtain a conservative bound for C by replacing C₁ by ρ. Thus, we can conservatively estimate C by 2ρ̄_n. Thus, a general conservative estimator of the asymptotic variance σ² of $\sqrt{n} (ψ_{n}^{*} - ψ_{0})$ is given by

σ_{n}^{2} = σ_{I, n}^{2} - 2 {\bar{ρ}}_{n},

where

σ_{I, n}^{2} = \frac{1}{n} \sum_{i = 1}^{n} D^{*} ({\bar{Q}}_{n}^{*}, g_{0}, ψ_{n}^{*}) {(O_{i})}^{2} .

This estimator can be viewed as the generalization of the “naive” variance estimator for the unadjusted estimator of ψ₀, analyzed in the previous subsection.

6.6 A simulation confirming the variance formula for the unadjusted estimator

To confirm our conclusions regarding the asymptotic variance of the unadjusted estimator, consider the following simple simulations. For n units, the baseline covariates W1 and W2 were independently drawn from N(0, 0.2²) and U(−1, 1), respectively. Then the following adaptive matching algorithm was employed. First units were classified into a matching category M, representing the 16 quartile combinations of W1 and W2. Within each stratum of M, units were randomly paired. If there were an odd number of units in a given strata, the remaining unit was set aside. The leftovers were then ordered according to M and pairs created. Next the treatment was randomized within the n/2 matched pairs. Finally, the binary outcome Y was drawn independently for each unit with probability

p = expit [β_{0} + β_{1} A + β_{2} W 1 + β_{3} W 1^{*} A + β_{4} W 2^{2}]

(6.2)

where expit is the inverse logistic function and the coefficients were set as β₀ = −1, β₁ = −0.5, β₂ = 3, β₃ = −2 and β₄ = 2. The target causal parameter is the average treatment effect. It had a true value of ψ₀ = −0.11 in this data generating experiment (“Scenario 1”). The coefficients were then also varied to examine the asymptotic variance of the unadjusted estimator in different data generating experiments. In Scenario 2 there is no treatment effect: β₁ = β₃ = 0. In Scenario 3 the baseline covariates (used for matching) have no effect on the outcome. Specifically, β₂, β₃ and β₄ were set to zero to yield an average treatment effect of −0.08.

For each scenario, the true finite sample variance $Var (ψ_{n}^{*})$ was estimated as the variance of unadjusted estimator over R = 10, 000 trials, each of sample size n = 500 units and the corresponding asymptotic variance estimate $n Var (ψ_{n}^{*})$ was reported. Table 1 compares this estimate of the true asymptotic variance with the claimed asymptotic variance $σ^{2} = σ_{I}^{2} - C$ calculated according to Theorem 3, and with the asymptotic naive variance treating the pairs as independent $σ_{naive}^{2}$ . The asymptotic variances σ² and $σ_{naive}^{2}$ , as well as C and ρ̄ were computed with Monte Carlo simulation of 50,000 units. All statistical computing was done in R version 2.15.1. In addition, recall our claims that C − 2ρ̄ ≥ 0, implies that $σ_{naive}^{2} = σ_{I}^{2} - 2 \bar{ρ}$ is conservative.

Table 1.

Comparing the true finite sample variance of the unadjusted estimator scaled by n $nVar (ψ_{n}^{*})$ , the asymptotic variance σ² according to Theorem 3 and the naive asymptotic variance treating the pairs as independent $σ_{naive}^{2}$ . Scenario 1 corresponds to the setting β₀ = −1, β₁ = −0.5, β₂ = 3, β₃ = −2 and β₄ = 2 in Eq. 6.2. Scenario 2 corresponds setting β = 1 and β₃ to zero in order to examine the asymptotic variance if the intervention has no effect on the outcome. Scenario 3 corresponds to setting β₂, β₃ and β₄ to zero in order to examine the asymptotic variance if the baseline covariates (used for matching) have no effect on the outcome. For each scenario, the correction factor C and 2ρ̄ are also given.

Scenario 1

Scenario 2

Scenario 3

nVar (ψ_{n}^{*})

0.8408

0.8708

0.6833

σ²

0.8523

0.8729

0.6915

σ_{naive}^{2}

0.8591

0.8729

0.6915

0.0712

0.1060

0.0000

2ρ̄

0.0643

0.1060

−0.0000

Open in a new tab

In all scenarios, the true asymptotic variance of the TMLE and our claimed true asymptotic variance are in agreement. The simulation for scenario 1 also confirms that $σ_{naive}^{2} = σ_{I}^{2} - 2 \bar{ρ}$ is indeed conservative, but close to the true asymptotic variance. In Scenario 2 the correction factors C and 2ρ̄ are equal when there is no treatment effect: C = 2ρ̄, and in Scenario 3 we have C = 2ρ̄ = 0. Indeed, in both of these scenarios we see perfect agreement between $σ_{naive}^{2}$ and the true asymptotic variance σ².

6.7 Efficiency gains due to adaptive pair matching

In this section we compare two design choices regarding $g_{0}^{n}$ . In the first, we simply assume that $g_{0}^{n} (A^{n} ∣ X^{n}) = \prod_{i = 1}^{n} g_{0} (A_{i} ∣ W_{i})$ for a common g₀. In this case, (W_i, A_i, Y_i), i = 1, …, n, are i.i.d. This design includes classic non-matched randomized trials in which treatment is randomly assigned with some known probability, possibly conditional on unit-specific covariates.

We compare this design to a design employing adaptive pair matching. In other words, in the second design we assume $g_{0}^{n} \in G_{2}^{n}$ with $g_{0, i}^{n} = g_{0}$ , so that $g_{0}^{n} (A^{n} ∣ X^{n}) = \prod_{j = 1}^{n / 2} g_{0} (A_{i} : i \in C_{j} (W^{n}) ∣ W^{n})$ and the marginal P (A_i = a | Wⁿ) = g₀(a | W_i), i = 1, …, n.

We compare the asymptotic variance of the TMLEs under these two designs when ${\bar{Q}}_{n}^{0}$ converges to a possibly misspecified Q̄^*. This provides insight into the efficiency gains made possible by adaptive pair matching. We assume that g₀ is known, so that ḡ_n = g₀, as would be the case in both an non-matched and adaptively matched randomized trial.

Theorem 4

Under the i.i.d. design, the TMLE is asymptotically linear with influence curve D^*(Q̄^*, g₀, ψ₀), so that its asymptotic variance is given by $σ_{I}^{2} ({\bar{Q}}^{*}) = P_{0} {D^{*} ({\bar{Q}}^{*}, g_{0}, ψ_{0})}^{2}$ . This variance can be represented as

σ_{I}^{2} ({\bar{Q}}^{*}) = E_{0} {{\bar{Q}}_{0} (W) - ψ_{0}}^{2} + E_{0} E_{0} (H_{g_{0}}^{2} (A, W) {(Y - {\bar{Q}}^{*} (A, W))}^{2} ∣ W) - E_{0} {{\bar{Q}}_{0} (W) - {\bar{Q}}^{*} (W)}^{2} .

For the adaptive paired matching design the asymptotic variance σ²(Q̄^*) of the TMLE is given by the limit of

E_{0} {{\bar{Q}}_{0} (W) - ψ_{0})}^{2} + E_{0} \frac{1}{n} \sum_{j = 1}^{n / 2} P_{Q_{0}, g^{n}} {\sum_{i \in C_{j} (W^{n})} H_{g_{0}} (A_{i}, W_{i}) (Y_{i} - {\bar{Q}}^{*} (A_{i}, W_{i}))}^{2} - E_{0} \frac{1}{n} \sum_{j = 1}^{n / 2} {\sum_{i \in C_{j} (W^{n})} {{\bar{Q}}_{0} (W_{i}) - {\bar{Q}}^{*} (W_{i})}}^{2}

This can be represented as:

σ^{2} ({\bar{Q}}^{*}) = E_{0} {{\bar{Q}}_{0} (W) - ψ_{0}}^{2} + E_{0} E_{0} (H_{g_{0}}^{2} (A, W) {(Y - {\bar{Q}}^{*} (A, W))}^{2} ∣ W) - E_{0} {{\bar{Q}}_{0} (W) - {\bar{Q}}^{*} (W)}^{2} - C,

where C = (ρ₁ + ρ₂) + C₁ was defined above as sum of four terms, with

C_{1} = E_{0} \frac{1}{J} \sum_{j = 1}^{J} ({\bar{Q}}_{0} - {\bar{Q}}^{*}) (1, W_{1 j}) ({\bar{Q}}_{0} - {\bar{Q}}^{*}) (1, W_{2 j}) + E_{0} \frac{1}{J} \sum_{j = 1}^{J} ({\bar{Q}}_{0} - {\bar{Q}}^{*}) (0, W_{1 j}) ({\bar{Q}}_{0} - {\bar{Q}}^{*}) (0, W_{2 j}) .

The difference between the two asymptotic variances is thus given by:

σ_{I}^{2} ({\bar{Q}}^{*}) - σ^{2} ({\bar{Q}}^{*}) = C .

If Q̄^* = Q̄₀, the two asymptotic variances are equal. If Q̄^*(A, W) = E₀(Y | A), then the difference is the sum C of the four covariances.

This theorem teaches us that, while the information bound for the two designs is the same, the TMLE under adaptive pair matching at misspecified Q̄^* will outperform the TMLE under i.i.d. sampling, as long as C > 0. This theorem further suggests that pair matching will result in efficiency gains over the i.i.d. design to the extent that there are baseline covariates W that are predictive of Y which cannot be adjusted for in the outcome regression. Such a scenario might occur in finite samples due to lack of support in the data. For example, in a cluster randomized trial of an HIV prevention intervention, the sample of communities might include only two communities in proximity to a major trucking route, a community characteristic known to predict higher HIV transmission levels. If by chance in the i.i.d. design both of these communities were assigned to the treatment arm of the trial, lack of data support would preclude adjustment for this community-level covariate and thus pair matching on this covariate would result in efficiency gains.

7 Augmenting the data structure with missingness

Consider the following data generating experiment. Firstly, we sample n i.i.d. (W₁, Y₁(0), Y₁(1)), …, (W_n, Y_n(0), Y_n(1)), giving us the vector Xⁿ and vector of baseline covariates Wⁿ. Based on Wⁿ, we run a partitioning algorithm generating pairs C_j(Wⁿ), j = 1, …, J. However, suppose that the designer does not want to accept pairs that are not similar enough with respect to some metric. Therefore, one applies an algorithm that involves assigning an indicator Δ_i(Wⁿ), i = 1, …, n and applying the partitioning algorithm among the units {i : Δ_i(Wⁿ) = 1} resulting in C_j(Wⁿ), j = 1, …, J. Thus ∪_jC_j(Wⁿ) = {i : Δ_i(Wⁿ) = 1}. We also note that Δ_i(Wⁿ) is a deterministic function of Wⁿ. Let n₁ be the number of observations with Δ_i(Wⁿ) = 1. Given Wⁿ, the Δ_i(Wⁿ) and the pairs C_j(Wⁿ), we draw A^n₁ from a conditional distribution of

g_{0}^{n} (A^{n_{1}} ∣ X^{n}) = g_{0}^{n} (A^{n_{1}} ∣ W^{n}) = \prod_{j = 1}^{J} g_{0}^{n} ((A_{i} : i \in C_{j} (W^{n})) ∣ W^{n}) .

We now collect the data O_i = (W_i, Δ_i(Wⁿ), Δ_i(Wⁿ)A_i, Δ_i(Wⁿ)Y_i(A_i)), i = 1, …, n, giving the observed data Oⁿ = (O₁, …, O_n).

The target quantity of interest remains the average treatment effect Ψ^F(P_X_,0) = E₀Y(1) − E₀Y(0). We have $ψ_{0}^{F} = E_{W} {{\bar{Q}}_{0} (1, W) - {\bar{Q}}_{0} (0, W)}$ , where Q̄₀(a, w) = E(Y(a) | W = w) = E₀(Y | A = a, W = W). We note that Y_i, given Wⁿ, A^n₁, is independent across i = 1, …, n, and this conditional distribution equals the conditional distribution of Y_i, given W_i, A_i. Therefore,

\begin{array}{l} E (Y_{i} ∣ A_{i}, W_{i}, Δ_{i} (W^{n}) = 1) = E (E (Y_{i} ∣ A_{i}, W_{i}, W^{n}) ∣ A_{i}, W_{i}, Δ_{i} (W^{n}) = 1) \\ = E (E (Y_{i} ∣ A_{i}, W_{i}) ∣ A_{i}, W_{i}, Δ_{i} (W^{n}) = 1) \\ = E (Y_{i} ∣ A_{i}, W_{i}) . \end{array}

This proves that Q̄₀(a, w) = E(Y_i | A_i = a, W_i = w, Δ_i(Wⁿ) = 1) and is thus identifiable from the distribution of Oⁿ. This proves the desired identifiability of $ψ_{0}^{F}$ :

Ψ^{F} (P_{X, 0}) = E_{W, 0} {{\bar{Q}}_{0} (1, W) - {\bar{Q}}_{0} (0, W)} = Ψ (P_{0}^{n}) .

The average is with respect to the marginal distribution of W (not conditional on Δ_i(Wⁿ) = 1), so that also the observations with Δ_i(Wⁿ) = 0 are used to identify this target quantity.

This also demonstrates that $E_{0} \sum_{i = 1}^{n} I (Δ_{i} (W^{n}) = 1) {(Y_{i} - \bar{Q} (A_{i}, W_{i}))}^{2}$ is minimized over Q̄ by Q̄₀, and thus represents a valid loss function for loss-based learning of Q̄₀ based on Oⁿ. Similarly, we can use a log-likelihood loss $\sum_{i = 1}^{n} I (Δ_{i} (W^{n}) = 1) L (\bar{Q}) (W_{i}, A_{i}, Y_{i})$ , where −L(Q̄)(W, A, Y) = Y log Q̄ (A, W) + (1 − Y) log(1 − Q̄(A, W)).

In order to present a TMLE we first need to derive the canonical gradient, which is presented in the following theorem.

Theorem 5

Consider the data generating experiment described above.

Let O_i = (W_i, Δ_i(Wⁿ), Δ_i(Wⁿ)A_i, Δ_i(Wⁿ)Y_i), the observed data is Oⁿ = (O₁, …, O_n) ~ Pⁿ with

P^{n} (O^{n}) = \prod_{i = 1}^{n} Q_{W} (W_{i}) {Q_{Y} (Y_{i} ∣ W_{i}, A_{i})}^{Δ_{i} (W^{n})} g^{n} ((A_{i} : Δ_{i} (W^{n}) = 1) ∣ W^{n}),

where Q_W is an unspecified marginal distribution, Q_Y is an unspecified conditional distribution of Y, given A, W, and gⁿ is a conditional distribution of A^n₁ = (A_i : Δ_i(Wⁿ) = 1), given Wⁿ = (W₁, …, W_n), known to be an element of a set Inline graphic consisting of distributions satisfying (2.1). Let be the resulting statistical model for Pⁿ. Let (gⁿ) be the model if gⁿ is known.

Let Ψ : Inline graphic → be defined by Ψ(Pⁿ) = E_{Q_W} {Q̄(1, W) − Q̄(0, W)}, where Q̄ (A, W) = E_{Q_Y} (Y| A, W).

The tangent space at Pⁿ in model Inline graphic is given by:

T (P^{n}) = {\sum_{i = 1}^{n} ϕ (W_{i}) : ϕ \in T_{W}} + {\sum_{i = 1}^{n} Δ_{i} (W^{n}) ϕ (Y_{i} ∣ A_{i}, W_{i}) : ϕ \in T_{Y}} + \sum_{j = 1}^{J} T_{C_{j}},

(7.1)

where T_W = {h(W) : Eh(W) = 0},

\begin{array}{l} T_{Y} = {h (Y ∣ A, W) : E_{Q_{Y}} (h (Y ∣ A, W) ∣ A, W) = 0}, and \\ T_{C_{j}} = {S ((A_{i} : i \in C_{j} (W^{n})) ∣ W^{n}) : E (S ∣ W^{n}) = 0} . \end{array}

The tangent space at Pⁿ in model Inline graphic (gⁿ) is given by

T (Q) = {\sum_{i = 1}^{n} ϕ (W_{i}) : ϕ \in T_{W}} + {\sum_{i = 1}^{n} ϕ (Y_{i} ∣ A_{i}, W_{i}) : ϕ \in T_{Y}} .

Let

D^{*} (\bar{Q}, g, ψ) (W, Δ, Δ A, Δ Y) = D_{W}^{*} (\bar{Q}, ψ) (W) + \frac{Δ (2 A - 1)}{g (A, 1 ∣ W)} (Y - \bar{Q} (A, W)),

where g denotes a distribution of g(a, 1 | W) = P(A = a, Δ = 1 | W). The statistical parameter Ψ is pathwise differentiable and its canonical gradient at Pⁿ is given by

D^{n, *} (P^{n}) = \frac{1}{n} \sum_{i = 1}^{n} D^{*} (\bar{Q}, {\bar{g}}_{n}, Ψ (Q)) (O_{i}),

{\bar{g}}_{n} (a, 1 ∣ W) = \frac{1}{n} \sum_{i = 1}^{n} g_{i} (a, 1 ∣ W) .

We note that

g_{i} (a, 1 ∣ W_{i}) = \sum_{(w_{j} : j \neq i)} Δ_{i} ((w_{j} : j \neq i), W_{i}) g_{i} (a ∣ (w_{j} : j \neq i), W_{i}) \prod_{j \neq i} Q_{W} (w_{j})

(7.2)

is a function of g_i(A_i | Wⁿ) and the common marginal distribution Q_W. We have

E_{0} D^{n, *} (\bar{Q}, {\bar{g}}_{n}, ψ_{0}) = 0 i f \bar{Q} = {\bar{Q}}_{0} o r {\bar{g}}_{n} = {\bar{g}}_{n, 0},

(7.3)

assuming that for all i, 0 < g_i(1, 1 | W_i) < 1 a.e.

The TMLE of Q̄₀ is analogue to the TMLE presented in Section 4, with the modification that the clever covariate is now given by (2A_i − 1)I(Δ_i(Wⁿ) = 1)/ḡ_n(A_i, 1 | W_i), only the complete observations are used for fitting Q̄₀, but the empirical distribution over all W₁, …, W_n is plugged in the target parameter mapping. The same asymptotics can be applied and the formulas for the asymptotic variance are the same as presented earlier, with the only modification that g_i(a | W_i) is now replaced by g_i(a, 1 | W_i).

8 Summary

This article has investigated efficient estimation and inference for the additive causal effect E₀{Y (1) − Y (0)} of treatment on the outcome under a class of designs based on sampling n i.i.d. (W_i, Y_i(0), Y_i(1)) ~ P_X_,0, sampling Aⁿ, given Wⁿ, and collecting (W_i, A_i, Y_i), i = 1, …, n. We considered a general class of dependent treatment assignment mechanisms gⁿ satisfying the assumption that (A_i : i ∈ C_j(Wⁿ)), j = 1, …, J, are independent across j, conditionally on Wⁿ, where C_j(Wⁿ), j = 1, …, J, is a partitioning of the sample {1, …, n} into groups implied by Wⁿ. The number of partitions J was assumed to be proportional to n.

We computed the efficient influence curve of the target parameter for the statistical model implied by this design without making additional assumptions about the common full-data distribution P_X_,0. We defined a corresponding TMLE that is consistent and asymptotically normally distributed under correct specification of $g_{0}^{n}$ , and is also efficient if the outcome regression Q̄₀ is consistently estimated. This TMLE can be implemented by ignoring the dependency created by the treatment allocation process, with the exception that if cross validation is used to estimate the average ḡ_n of g_0,_i(A_i | W_i) across i = 1, …, n, the group rather than the unit should be used when partitioning the data into training and validation sets. Thus, construction of training and validation sets for data adaptive estimation of Q̄₀ can be based on the sampling unit. We further suggested an alternative plug-in approach to estimating the unit specific treatment mechanism g_0,_i that makes use of design based knowledge of $g_{0}^{n}$ , thus potentially improving estimator robustness and efficiency.

Due to the dependency introduced by the treatment allocation process, no asymptotically consistent bootstrap method appears to be available for the general class of dependent gⁿ-designs presented in this paper. Further, when groups are size 2 or larger, the asymptotic variance of the TMLE under the dependent sampling relies on a consistent estimator of Q̄₀ even when $g_{0}^{n}$ is known. In contrast, the asymptotic variance of the TMLE under i.i.d. sampling is fully robust to misspecification of ${\bar{Q}}_{n}^{*}$ in randomized controlled trials.

We further considered adaptively pair matched trials as an important special case of the general dependent treatment allocation design. We formally compared the asymptotic variance of the TMLE under this design with that of the TMLE under i.i.d. sampling. While the information bound for the adaptively pair matched design with $g_{i}^{n} = g_{i} = g_{0}$ equals the information bound for i.i.d. sampling of (W_i, A_i, Y_i) with P (A = a | W ) = g₀(a | W ), we showed that the TMLE under adaptive pair matching and misspecified Q̄^* will outperform the TMLE under i.i.d. sampling as long as the (Q̄₀ − Q̄^*)(1, ·) and (Q̄₀ − Q̄^*) (0, ·) of the baseline covariates within the groups C_j(Wⁿ) are positively correlated. We also showed that under the paired matching design and the positive correlation condition, an estimate of the variance that treats the n observations as i.i.d. is conservative if ${\bar{Q}}_{n}^{*}$ is inconsistent for Q̄₀ and is asymptotically consistent if ${\bar{Q}}_{n}^{*}$ is consistent. We also presented a less conservative variance estimator that relies on an additional reasonable assumption (similar to the above positive correlation assumption). We demonstrated that the estimator of the variance for the unadjusted estimator as currently used by practitioners in the analysis of paired matched trials is valid as well, and our above mentioned less conservative variance estimator is just a generalization of this estimator.

Taken together, these finding teach us that the use of an adaptively pair matched design will generally result in a more efficient estimator of the treatment effect, while one can still obtain robust conservative variance estimators. However, understanding the complications resulting from the adaptive pair matching requires advanced empirical process theory, and even makes the analysis of the unadjusted estimator a serious challenge.

A Appendix: Proof of Theorem 1

Firstly, we note that

\begin{array}{l} E_{0} D^{*} (\bar{Q}, {\bar{g}}_{n}, ψ_{0}) (O^{n}) = E_{0} \frac{1}{n} \sum_{i = 1}^{n} {\bar{Q} (W_{i}) - ψ_{0}} + \\ E_{0} \frac{1}{n} \sum_{i = 1}^{n} \frac{g_{i, 0} (1 ∣ W_{i})}{{\bar{g}}_{n} (1 ∣ W_{i})} ({\bar{Q}}_{0} - \bar{Q}) (1, W_{i}) - \frac{g_{i, 0} (0 ∣ W_{i})}{{\bar{g}}_{n} (0 ∣ W_{i})} ({\bar{Q}}_{0} - \bar{Q}) (0, W_{i}) \\ = Ψ (Q) - ψ_{0} + \frac{1}{n} \sum_{i} \int_{w} Q_{W, 0} (w) \frac{g_{i, 0} (1 ∣ w)}{{\bar{g}}_{n} (1 ∣ w)} ({\bar{Q}}_{0} - \bar{Q}) (1, w) - \frac{1}{n} \sum_{i} \int_{w} Q_{W, 0} (w) \frac{g_{i, 0} (0 ∣ w)}{{\bar{g}}_{n} (0 ∣ w)} ({\bar{Q}}_{0} - \bar{Q}) (0, w) \\ = Ψ (Q) - ψ_{0} + E_{0} \frac{{\bar{g}}_{0, n} (1 ∣ W)}{{\bar{g}}_{n} (1 ∣ W)} ({\bar{Q}}_{0} - \bar{Q}) (1, W) - E_{0} \frac{{\bar{g}}_{n, 0} (0 ∣ W)}{{\bar{g}}_{n} (0 ∣ W)} ({\bar{Q}}_{0} - \bar{Q}) (0, W) . \end{array}

Thus, if ḡ_n_,0 = ḡ₀, then this equals Ψ(Q) − ψ₀ + ψ₀ − Ψ(Q) = 0. If Q₀ = Q, then we also obtain 0. This proves (3.3). We also note that Dⁿ^,*(Q₀, ḡ₀) is an element of the tangent space T_Q. In addition, for each Q, Dⁿ^,*(Q, ḡ₀, ψ₀) is a gradient in the model $M (g_{0}^{n})$ with $g_{0}^{n}$ known, which shows that Dⁿ^,*(Q₀, ḡ₀) is the canonical gradient of Ψ : Inline graphic (gⁿ) → at $P_{0}^{n}$ . By factorization of the likelihood, it is also the canonical gradient for any model that instead assumes that $g_{0}^{n} \in G^{n}$ for a model .

B Appendix: Proof of Theorem 2

Recall the notation Pf = E_Pf. We have

P_{0}^{n} D^{n, *} ({\bar{Q}}_{n}^{*}, {\bar{g}}_{0}, ψ_{n}^{*}) \equiv \frac{1}{n} \sum_{i = 1}^{n} P_{Q_{0}, g_{0, i}} D^{*} ({\bar{Q}}_{n}^{*}, {\bar{g}}_{0}, ψ_{n}^{*}) = ψ_{0} - ψ_{n}^{*} .

Here we remind the reader that ḡ₀ = 1/n Σ_i g₀,_i and g_0,_i(a | w) = P₀(A_i = a | W_i = w). We also have $D^{n, *} ({\bar{Q}}_{n}^{*}, {\bar{g}}_{n}, ψ_{n}^{*}) = 0$ .

Thus,

\begin{array}{l} ψ_{n}^{*} - ψ_{0} = \frac{1}{n} \sum_{i} {D^{*} ({\bar{Q}}_{n}^{*}, {\bar{g}}_{n}, ψ_{n}^{*}) (O_{i}) - P_{Q_{0}, g_{0, i}} D^{*} ({\bar{Q}}_{n}^{*}, {\bar{g}}_{n}, ψ_{n}^{*})} \\ + \frac{1}{n} \sum_{i} P_{Q_{0}, g_{0, i}} {D^{*} ({\bar{Q}}_{n}^{*}, {\bar{g}}_{n}, ψ_{n}^{*}) - P_{Q_{0}, g_{0, i}} D^{*} ({\bar{Q}}_{n}^{*}, {\bar{g}}_{0}, ψ_{n}^{*})} \\ \equiv \frac{1}{n} \sum_{i} {D^{*} ({\bar{Q}}_{n}^{*}, {\bar{g}}_{n}, ψ_{n}^{*}) (O_{i}) - P_{0, g_{0}, i} D^{*} ({\bar{Q}}_{n}^{*}, {\bar{g}}_{n}, ψ_{n}^{*})} + \frac{1}{\sqrt{n}} Z_{W, {\bar{g}}_{n}, n} . \end{array}

We note that, using some straightforward algebra,

\begin{array}{l} Z_{W, {\bar{g}}_{n}, n} = \sqrt{n} \int_{w} \frac{{\bar{g}}_{0} - {\bar{g}}_{n}}{{\bar{g}}_{n}} (1 ∣ w) ({\bar{Q}}_{0} - \bar{Q}) (1, w) {d Q}_{W, 0} (w) \\ - \sqrt{n} \int_{w} \frac{{\bar{g}}_{0} - {\bar{g}}_{n}}{{\bar{g}}_{n}} (0 ∣ w) ({\bar{Q}}_{0} - \bar{Q}) (0, w) {d Q}_{W, 0} (w) \\ = \sqrt{n} \int_{w} \frac{{\bar{g}}_{0} - {\bar{g}}_{n}}{{\bar{g}}_{0}} (1 ∣ w) ({\bar{Q}}_{0} - \bar{Q}) (1, w) {d Q}_{W, 0} (w) \\ - \sqrt{n} \int_{w} \frac{{\bar{g}}_{0} - {\bar{g}}_{n}}{{\bar{g}}_{0}} (0 ∣ w) ({\bar{Q}}_{0} - \bar{Q}) (0, w) {d Q}_{W, 0} (w) + R ({\bar{g}}_{n}, {\bar{g}}_{0}) \end{array}

where

R ({\bar{g}}_{n}, {\bar{g}}_{0}) = \sqrt{n} \int \frac{{({\bar{g}}_{0} - {\bar{g}}_{n})}^{2}}{{\bar{g}}_{n} {\bar{g}}_{0}} (1 ∣ w) ({\bar{Q}}_{0} - \bar{Q}) (1, w) {d Q}_{0} (w) - \sqrt{n} \int \frac{{({\bar{g}}_{0} - {\bar{g}}_{n})}^{2}}{{\bar{g}}_{n} {\bar{g}}_{0}} (1 ∣ w) ({\bar{Q}}_{0} - \bar{Q}) (1, w) {d Q}_{0} (w) .

We assume that the latter is o_P (1). Thus to establish the asymptotic linearity of Z_{W,ḡ_n,n} we need to study terms of form $\sqrt{n} \int f (w) ({\bar{g}}_{n} - {\bar{g}}_{0}) (1 ∣ w) d Q (w)$ . We now note that

\begin{array}{l} ({\bar{g}}_{n} - \bar{g} (1 ∣ w) = \frac{1}{n} \sum_{i = 1}^{n} (g_{i, n} - g_{i}) (1 ∣ w) \\ \frac{1}{n} \sum_{i = 1}^{n} \int g_{i} (1 ∣ (W_{j} : j \neq i), W_{i} = w) (\prod_{j \neq i} Q_{W, n} (w_{j}) - \prod_{j \neq i} Q_{W} (w_{j})) \\ = \frac{1}{n} \sum_{i = 1}^{n} \int g_{i} (1 ∣ W_{- i}, W_{i} = w) \sum_{l = 1, l \neq i}^{n} (Q_{W, n} (w_{l}) - Q_{W} (w_{l})) \prod_{m = 1, m \neq i}^{l - 1} Q_{W, n} (w_{m}) \prod_{m = l + 1, m \neq i}^{n} Q_{W} (w_{m}) \\ \approx \frac{1}{n} \sum_{i = 1}^{n} \sum_{l \neq i} \int g_{i} (1 ∣ W_{i} = w, W_{l} = w_{l}) (Q_{W, n} (w_{l}) - Q_{W} (w_{l})) \\ = \frac{1}{n} \sum_{k = 1}^{n} \frac{1}{n} \sum_{i = 1}^{n} \sum_{l \neq i}^{n} {g_{i} (1 ∣ W_{i} = w, W_{l} = W_{k}) - g_{i} (1 ∣ W_{i} = w)}, \end{array}

where we suppressed the second order term a formal analysis would have to take into account. Therefore, we can write

\begin{array}{l} \sqrt{n} \int ({\bar{g}}_{n} - {\bar{g}}_{0}) (1 ∣ w) f (w) d Q (w) = \frac{1}{\sqrt{n}} \sum_{k = 1}^{n} \frac{1}{n} \sum_{i = 1}^{n} \sum_{l \neq i}^{n} \int_{w} {g_{i} (1 ∣ W_{i} = w, W_{l} = W_{k}) - E_{0, W_{k} g_{i}} (1 ∣ W_{i} = w, W_{l} = W_{k})} f (w) d Q (w) \\ \equiv \frac{1}{\sqrt{n}} \sum_{k = 1}^{n} {Φ (W_{k}) - E_{0} Φ (W_{k})}, \end{array}

where we defined

Φ (W_{k}) = \frac{1}{n} \sum_{i = 1}^{n} \sum_{l \neq i}^{n} \int_{w} g_{i} (1 ∣ W_{i} = w, W_{l} = W_{k}) f (w) d Q (w) .

Thus such integrals are standardized sums of independent random variables Φ_l(W_k) − E₀ Φ_l(W_k) with mean zero. Such terms will converge to a normal distribution if the variance of Φ(W_k) is bounded (uniformly in n, since Φ is really indexed by n as well). This demonstrates that one will need that the Σ_l_≠_i should essentially only contribute a finite number of terms.

To conclude, under regularity conditions, we might have

Z_{W, {\bar{g}}_{n}, n} \approx \frac{1}{\sqrt{n}} \sum_{k = 1}^{n} I C (W_{k}) - E_{0} I C (W_{k}),

where

\begin{array}{l} I C (W_{k}) = \int_{w} (\frac{1}{n} \sum_{i = 1}^{n} \sum_{l \neq i}^{n} g_{i} (1 ∣ W_{i} = w_{i}, W_{l} = W_{k})) \frac{{\bar{Q}}_{0} - \bar{Q}}{{\bar{g}}_{0}} (1, w) {d Q}_{0} (w) \\ - \int_{w} (\frac{1}{n} \sum_{i = 1}^{n} \sum_{l \neq i}^{n} g_{i} (0 ∣ W_{i} = w, W_{l} = W_{k})) \frac{{\bar{Q}}_{0} - \bar{Q}}{{\bar{g}}_{0}} (0, w) {d Q}_{0} (w) . \end{array}

A crucial assumption we made in the theorem is that the variance of IC(W_k) is finite. We will now show that under a reasonable typical assumption we will, in fact, have that IC(W_k) − E₀IC(W_k) = 0. For i ∈ C_j(Wⁿ), in a typical design one will have that g_i(a | W_i = w_i, W₋_i) only depends on W_i = w_i. Thus, in that case, for i ∈ C_j(Wⁿ) we have g_i(a | Wⁿ) = g_i(a | W_i) for some conditional density g_i(a | w). This provides us with the following representation:

g_{i} (a ∣ W^{n}) = \sum_{j = 1}^{J} I (i \in C_{j} (W^{n})) g_{i} (a ∣ W_{i})

This yields the following derivation of g_i(a | W_i, W_l):

\begin{array}{l} g_{i} (a ∣ W_{i}, W_{l}) = \int g_{i} (a ∣ W_{i}, W_{l}, W (- i, - l)) P (W (- i, - l)) \int \sum_{j = 1}^{J} I (i \in C_{j} (W_{i}, W_{l}, W (- i, - l))) g_{i} (a ∣ W_{i}) P (W (- i, - l)) \\ = \sum_{j = 1}^{J} g_{i} (a ∣ W_{i}) \int I (i \in C_{j} (W_{i}, W_{l}, W (- i, - l))) P (W (- i, - l)) \\ = g_{i} (a ∣ W_{i}) \sum_{j = 1}^{J} P (i \in C_{j} (W^{n}) ∣ W_{i}, W_{l}) \\ = g_{i} (a ∣ W_{i}) . \end{array}

Thus, in this case, we have IC(W_k) is constant in W_k so that IC(W_k) − E₀IC(W_k) = 0.

We now proceed as follows:

\begin{array}{l} ψ_{n}^{*} = ψ_{0} = \frac{1}{n} \sum_{i = 1}^{n} {D^{*} (Q_{n}^{*}, {\bar{g}}_{n}, ψ_{n}^{*}) (O_{i}) - P_{Q_{0}, g_{i}} D^{*} (Q_{n}^{*}, {\bar{g}}_{n}, ψ_{n}^{*})} + \frac{1}{\sqrt{n}} Z_{W, {\bar{g}}_{n}, n} \\ = \frac{1}{n} \sum_{i = 1}^{n} {D^{*} ({\bar{Q}}_{n}^{*}, {\bar{g}}_{n}, ψ_{n}^{*}) (O_{i}) - P_{Q_{0}, g_{i}^{n}} D^{*} ({\bar{Q}}_{n}^{*}, {\bar{g}}_{n}, ψ_{n}^{*})} + \frac{1}{n} \sum_{i = 1}^{n} {(P_{Q_{0}, g_{i}^{n}} - P_{Q_{0, g_{i}}}) D^{*} ({\bar{Q}}_{n}^{*}, {\bar{g}}_{n}, ψ_{n}^{*})} + \frac{1}{\sqrt{n}} Z_{W, {\bar{g}}_{n}, n} \\ = \frac{1}{n} \sum_{i = 1}^{n} {D_{Y}^{*} ({\bar{Q}}_{n}^{*}, {\bar{g}}_{n}) (O_{i}) - P_{Q_{0}, g_{i}^{n}} D_{Y}^{*} ({\bar{Q}}_{n}^{*}, {\bar{g}}_{n})} + \frac{1}{n} \sum_{i = 1}^{n} {(P_{Q_{0}, g_{i}^{n}} - P_{Q_{0}, g_{i}}) D^{*} ({\bar{Q}}_{n}^{*}, {\bar{g}}_{n}, ψ_{n}^{*})} + \frac{1}{\sqrt{n}} Z_{W, {\bar{g}}_{n}, n} \\ \equiv \frac{1}{\sqrt{n}} X_{n} ({\bar{Q}}_{n}^{*}) + \frac{1}{\sqrt{n}} Z_{W, n, g^{n}} + \frac{1}{\sqrt{n}} Z_{W, {\bar{g}}_{n}, n} . \end{array}

Here we used at the third equality that $P_{Q_{0}, g_{0, i}^{n}}$ is a conditional expectation, given Wⁿ, so that the empirical process of $D_{W}^{*}$ cancels out in the first term. We defined the process only as a function of ${\bar{Q}}_{n}^{*}$ , not as a function of ḡ_n, because ḡ_n is only a function of Wⁿ. Note, that

\begin{array}{l} D_{Y}^{*} (\bar{Q}, {\bar{g}}_{n}) (O_{i}) - P_{Q_{0}, g_{0, i}^{n}} D_{Y}^{*} (\bar{Q}, {\bar{g}}_{n}) = \frac{2 A_{i} - 1}{{\bar{g}}_{n} (A_{i} ∣ W_{i})} (Y_{i} - \bar{Q} (A_{i}, W_{i})) \\ - {\frac{g_{0, i}^{n} (1 ∣ W^{n})}{{\bar{g}}_{n} (1 ∣ W_{i})} ({\bar{Q}}_{0} - \bar{Q}) (1, W_{i}) - \frac{g_{0, i}^{n} (0 ∣ W^{n})}{{\bar{g}}_{n} (0 ∣ W_{i})} ({\bar{Q}}_{0} - \bar{Q}) (0, W_{i})} \\ \equiv f_{i, n}^{1} (\bar{Q}) (O_{i}) . \end{array}

Note that $f_{i, n}^{1} (\bar{Q})$ is a random function of O_i through Wⁿ, while, given Wⁿ, it is a fixed function of O_i. In the special case that $g_{0, i}^{n} = g_{0, i}$ is constant in i, we have $f_{i, n}^{1} (\bar{Q}) (O_{i}) = D_{Y}^{*} (\bar{Q}, g_{0, i}) (O_{i}) - {{\bar{Q}}_{0} - \bar{Q}} (W_{i})$ . We can represent X_n(Q̄) as $X_{n} (\bar{Q}) = 1 / \sqrt{n} \sum_{i = 1}^{n} f_{i, n}^{1} (\bar{Q}) (O_{i})$ , where $P_{Q_{0}, g_{0, i}^{n}} f_{i, n} (\bar{Q}) = 0$ .

Let’s now determine the form of Z_{W,_n,gⁿ}. We have

\begin{array}{l} 1 / n \sum_{i} P_{Q_{0}, g_{0, i}^{n}} D_{Y}^{*} ({\bar{Q}}_{n}^{*}, {\bar{g}}_{n}) \\ = 1 / n \sum_{i} \frac{g_{0, i} (1 ∣ W^{n})}{{\bar{g}}_{n} (1 ∣ W_{i})} ({\bar{Q}}_{0} - {\bar{Q}}_{n}^{*}) (1, W_{i}) - \frac{g_{0, i}^{n} (0 ∣ W^{n})}{{\bar{g}}_{n} (0 ∣ W_{i})} ({\bar{Q}}_{0} - {\bar{Q}}_{n}^{*}) (0, W_{i}) \\ = 1 / n \sum_{i} (\frac{g_{0, i} (1 ∣ W^{n})}{{\bar{g}}_{n} (1 ∣ W_{i})} - 1) ({\bar{Q}}_{0} - {\bar{Q}}_{n}^{*}) (1, W_{i}) - (\frac{g_{0, i}^{n} (0 ∣ W^{n})}{{\bar{g}}_{n} (0 ∣ W_{i})} - 1) ({\bar{Q}}_{0} - {\bar{Q}}_{n}^{*}) (0, W_{i}) \\ + 1 / n \sum_{i} {\bar{Q}}_{0} (W_{i}) - {\bar{Q}}_{n}^{*} (W_{i}) \\ 1 / n \sum_{i} P_{Q_{0}, g_{0, i}} D_{Y}^{*} ({\bar{Q}}_{n}^{*}, {\bar{g}}_{n}) = \int_{w} ({\bar{Q}}_{0} - {\bar{Q}}_{n}^{*}) Q_{W, 0} (w) \\ 1 / n \sum_{i} P_{Q_{0}, g_{i}^{n}} - P_{Q_{0}, g_{i}} D_{W}^{*} ({\bar{Q}}_{n}^{*}, ψ_{n}^{*}) = 1 / n \sum_{i} {\bar{Q}}_{n}^{*} (W_{i}) - P_{0} {\bar{Q}}_{n}^{*} \end{array}

Thus,

\begin{array}{l} 1 / n \sum_{i} (P_{Q_{0}, g_{0, i}^{n}} - P_{Q_{0}, g_{0, i}}) (D_{Y}^{*} + D_{W}^{*}) ({\bar{Q}}_{n}^{*}, {\bar{g}}_{n}^{*}, ψ_{n}^{*}) \\ = 1 / n \sum_{i} (\frac{g_{0, i} (1 ∣ W^{n})}{{\bar{g}}_{n} (1 ∣ W_{i})} - 1) ({\bar{Q}}_{0} - {\bar{Q}}_{n}^{*}) (1, W_{i}) - (\frac{g_{0, i}^{n} (0 ∣ W^{n})}{{\bar{g}}_{n} (0 ∣ W_{i})} - 1) ({\bar{Q}}_{0} - {\bar{Q}}_{n}^{*}) (0, W_{i}) \\ + 1 / n \sum_{i} {\bar{Q}}_{0} (W_{i}) - {\bar{Q}}_{n}^{*} (W_{i}) + \int_{w} ({\bar{Q}}_{0} - {\bar{Q}}_{n}^{*}) (w) Q_{W, 0} (w) \\ + 1 / n \sum_{i} {\bar{Q}}_{n}^{*} (W_{i}) - P_{0} {\bar{Q}}_{n}^{*} \\ = 1 / n \sum_{i} (\frac{g_{0, i} (1 ∣ W^{n})}{{\bar{g}}_{n} (1 ∣ W_{i})} - 1) ({\bar{Q}}_{0} - {\bar{Q}}_{n}^{*}) (1, W_{i}) - (\frac{g_{0, i}^{n} (0 ∣ W^{n})}{{\bar{g}}_{n} (0 ∣ W_{i})} - 1) ({\bar{Q}}_{0} - {\bar{Q}}_{n}^{*}) (0, W_{i}) \\ + 1 / n \sum_{i} {{\bar{Q}}_{0} (W_{i}) - ψ_{0}} . \end{array}

Thus,

Z_{W, n, g^{n}} = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} {{\bar{Q}}_{0} (W_{i}) - ψ_{0}} + \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} {\frac{g_{0, i}^{n} (1 ∣ W^{n}) - {\bar{g}}_{n} (1 ∣ W_{i})}{{\bar{g}}_{n} (1 ∣ W_{i})} ({\bar{Q}}_{0} - {\bar{Q}}_{n}^{*}) (1, W_{i}) - \frac{g_{0, i}^{n} (0 ∣ W^{n}) - {\bar{g}}_{n} (0 ∣ W_{i})}{{\bar{g}}_{n} (0 ∣ W_{i})} ({\bar{Q}}_{0} - {\bar{Q}}_{n}^{*}) (0, W_{i})} .

In the special case that $g_{0, i}^{n} = g_{0, i}$ and constant in i, we have that

Z_{W, n, g^{n}} = Z_{W, n} \equiv \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} {{\bar{Q}}_{0} (W_{i}) - ψ_{0}} .

In the general case, one can decompose

Z_{W, n, g^{n}} = Z_{1, g^{n}} + Z_{W, n},

where

Z_{1, g^{n}} = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} \frac{g_{0, i}^{n} (1 ∣ W^{n}) - {\bar{g}}_{n} (1 ∣ W_{i})}{{\bar{g}}_{n} (1 ∣ W_{i})} ({\bar{Q}}_{0} - {\bar{Q}}_{n}^{*}) (1, W_{i}) - \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} \frac{g_{0, i}^{n} (0 ∣ W^{n}) - {\bar{g}}_{n} (0 ∣ W_{i})}{{\bar{g}}_{n} (0 ∣ W_{i})} ({\bar{Q}}_{0} - {\bar{Q}}_{n}^{*}) (0, W_{i}) .

Suppose now that $g_{0, i}^{n} = g_{0, i}$ . Then ḡ_n = ḡ₀. Notice that for a function f, we have

\begin{array}{l} \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} E_{0} (\frac{g_{0, i} (1 ∣ W_{i})}{{\bar{g}}_{0} (1 ∣ W_{i})} - 1) f (W_{i}) = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} \int_{w} (\frac{g_{0, i} (1 ∣ w)}{{\bar{g}}_{0} (1 ∣ w)} - 1) f (w) {d Q}_{W, 0} (w) \\ = \frac{1}{\sqrt{n}} \int (\frac{{\bar{g}}_{0}}{{\bar{g}}_{0}} (1 ∣ w) - 1) f (w) {d Q}_{W, 0} (w) \\ = 0. \end{array}

This proves that, Z_1,gⁿ(Q̄), defined as the process above with ${\bar{Q}}_{n}^{*}$ replaced by Q̄, is a standard empirical process $Z_{1, g^{n}} (\bar{Q}) = 1 / \sqrt{n} \sum_{i} f_{i} (\bar{Q}) (W_{i})$ of mean zero and independent random variables

f_{i} (\bar{Q}) (W_{i}) = \frac{g_{0, i} (1 ∣ W_{i}) - {\bar{g}}_{0} (1 ∣ W_{i})}{{\bar{g}}_{0} (1 ∣ W_{i})} ({\bar{Q}}_{0} - \bar{Q}) (1, W_{i}) - \frac{g_{0, i} (0 ∣ W_{i}) - {\bar{g}}_{0} (0 ∣ W_{i})}{{\bar{g}}_{0} (0 ∣ W_{i})} ({\bar{Q}}_{0} - \bar{Q}) (0, W_{i}) .

Such a process can be analyzed with methods we use below, showing that $Z_{1, g^{n}} ({\bar{Q}}_{n}^{*}) = Z_{1, g^{n}} ({\bar{Q}}^{*}) + o_{P} (1)$ , and $Z_{1, g^{n}} ({\bar{Q}}^{*}) = 1 / \sqrt{n} \sum_{i} {I C}_{1, g^{n}, i} (W_{i}) + o_{P} (1)$ , where IC_1,gⁿ,i = fi(Q^*). We conclude that

\sqrt{n} (ψ_{n}^{*} - ψ_{0}) = X_{n} ({\bar{Q}}_{n}^{*}) + Z_{W, n} + Z_{1, g^{n}} + Z_{W, {\bar{g}}_{n}, n},

where our assumptions guarantee that $Z_{W, n} + Z_{1, g^{n}} + Z_{W, {\bar{g}}^{n}, n} = 1 / \sqrt{n} \sum_{i} {I C}_{W, i} (W_{i}) + o_{P} (1)$ . So we showed that $\sqrt{n} (ψ_{n}^{*} - ψ_{0}) = X_{W, n} + X_{n} ({\bar{Q}}_{n}^{*})$ , where $X_{W, n} = 1 / \sqrt{n} \sum_{i} {I C}_{W, i} (W_{i}) + o_{p} (1)$ for some influence curve IC_W_,_i. Thus X_W_,_n is understood and converges to a normal distribution with mean zero and variance $σ_{W}^{2} = {lim}_{n} \frac{1}{n} \sum_{i = 1}^{n} P_{0} {I C}_{W, i}^{2}$ , if the variance of IC_W_,_i is bounded uniformly in i.

Below we establish that, conditional on Wⁿ, $X_{n} ({\bar{Q}}_{n}^{*})$ converges in distribution to a Gaussian random variable. The separate weak convergence of X_W_,_n and $X_{n} ({\bar{Q}}_{n}^{*})$ implies the desired weak convergence of X_W_,_n and $X_{n} ({\bar{Q}}_{n}^{*})$ jointly as follows. For notational convenience, let X_n denote $X_{n} ({\bar{Q}}_{n}^{*})$ and X denotes its limit in distribution. Let W^∞ = (Wⁿ : n = 1, …,). Note that P(X_W_,_n ∈ A, X_n ∈ B) = E_W^∞I(X_W_,_n ∈ A)P (X_n ∈ B | W^∞). Since P (X_n ∈ B | W^∞) converges to P(X ∈ B) for almost every W^∞, we obtain

P (X \in B) E_{W^{\infty}} I (X_{W, n} \in A) \to P (X \in B) P (X_{W} \in A)

plus a term E_W^∞I(X_W_,_n ∈ A)(P (X_n ∈ B) | W^∞) − P (X ∈ B)). The latter term converges to zero by the dominated convergence theorem. The joint convergence implies the weak convergence of the sum $X_{W, n} + X_{n} ({\bar{Q}}_{n}^{*})$ to X_W + X.

So it remains to study $X_{n} ({\bar{Q}}_{n}^{*})$ . By application of a CLT for sums of independent random variables, under the stated conditions, one can show that, conditional on Wⁿ, (X_n(Q̄_j) : j) for fixed Q̄_j ∈ Inline graphic converges to a multivariate normal distribution with covariance matrix defined by (Q̄₁, Q̄₂) → Σ₀(Q̄₁, Q̄₂). Weak convergence of X_n(Q̄) for a fixed Q̄ or finite collection of Q̄’s is not enough for establishing the desired asymptotic linearity. In order to understand terms such as $X_{n} ({\bar{Q}}_{n}^{*}) - X_{n} (\bar{Q})$ (and that our proposed variance estimator is consistent) we need to understand the process (X_n(Q̄) : Q̄ ∈ Inline graphic ) with respect to supremum norm over a set that contains ${\bar{Q}}_{n}^{*}$ with probability tending to 1. Again, we will study this process conditional on (Wⁿ : n ≥ 1).

Let $d_{n}^{2} ({\bar{Q}}_{1}, {\bar{Q}}_{2}) = 1 / n \sum_{j} P_{Q_{0}, g^{n}} {f_{j, n} ({\bar{Q}}_{1}) - f_{j, n} ({\bar{Q}}_{2})}^{2}$ . We note that $X_{n} ({\bar{Q}}_{1}) - X_{n} ({\bar{Q}}_{2}) = X_{n}^{'} ({\bar{Q}}_{1} - {\bar{Q}}_{2})$ for a slightly different process $X_{n}^{'}$ . Thus, $d_{n}^{2} ({\bar{Q}}_{1}, {\bar{Q}}_{2}) = 1 / n \sum_{j} P_{Q_{0}, g^{n}} {f_{j, n}^{'} ({\bar{Q}}_{1} - {\bar{Q}}_{2})}^{2}$ for a specified $f_{j, n}^{'} ({\bar{Q}}_{1} - {\bar{Q}}_{2}) = \sum_{i \in C_{j}} {f_{i, n} ({\bar{Q}}_{1}) - f_{i, n} ({\bar{Q}}_{2})} (O_{i})$ . Note that $d_{n}^{2} ({\bar{Q}}_{1}, {\bar{Q}}_{2})$ is the conditional variance of X_n(Q̄₁) − X_n(Q̄₂), conditional on Wⁿ, or equivalently, it is the conditional variance of $X_{n}^{'} ({\bar{Q}}_{1} - {\bar{Q}}_{2})$ . We will denote this conditional variance also with $σ_{n}^{2} ({\bar{Q}}_{1} - {\bar{Q}}_{2}) = d_{n}^{2} ({\bar{Q}}_{1}, {\bar{Q}}_{2})$ .

Recall that Inline graphic = {f₁ − f₂ : f₁, f₂ ∈ }. Given the entropy condition on , we will prove asymptotic equicontinuity of (X_n(Q̄) : Q̄ ∈ ) with respect to this semi-metric d_n: for each ε > 0 and sequence δ_n → 0,

P (sup_{d_{n} (f, g) \leq δ_{n}} ∣ X_{n} (f) - X_{n} (g) ∣ > ε) \to 0 as n \to \infty .

This is equivalent with establishing the following asymptotic equicontinuity of ( $X_{n}^{'} (f) : f \in F^{d}$ ) w.r.t semi-metric σ_n: for each ε > 0 and sequence δ_n → 0,

P (sup_{σ_{n} (f) \leq δ_{n}} ∣ X_{n}^{'} (f) ∣ > ε) \to 0 as n \to \infty .

If $d_{n} ({\bar{Q}}_{n}^{*}, \bar{Q}) \to 0$ in probability, and ${\bar{Q}}_{n}^{*} - \bar{Q} \in F^{d}$ with probability tending to 1, then this asymptotic equicontinuity proves that $X_{n} ({\bar{Q}}_{n}^{*}) - X_{n} \bar{Q}) = X_{n}^{'} (Q_{n}^{*} - \bar{Q})$ converges to zero in probability, as n → ∞.

To establish the asymptotic equicontinuity result, we use a number of fundamental building blocks. Note that $X_{n}^{'} (f) / σ_{n} (f)$ is a sum of J independent mean zero bounded random variables and the variance of this sum equals 1. Bernstein’s inequality states that $P (∣ \sum_{j} Y_{j} ∣ > x) \leq 2 exp (- \frac{1}{2} \frac{x^{2}}{v + M x / 3})$ , where v ≥ VAR Σ_j Y_j. Thus, by Bernstein’s inequality, conditional on Wⁿ, we have

P (\frac{∣ X_{n}^{'} (f) ∣}{σ_{n} (f)} > x) \leq 2 exp (- \frac{1}{2} \frac{x^{2}}{1 + M x / 3}) \leq K exp (- {C x}^{2}),

for a universal K and C.

As stated in our review section, this implies ${‖ X_{n}^{'} (f) / σ_{n} (f) ‖}_{ψ_{2}} \leq {(1 + K / C)}^{0.5}$ , where for a given convex function ψ with ψ (0) = 0, || X ||_ψ ≡ inf{C > 0 : Eψ(| X | /C) ≤ 1} is the so called Orlics norm, and ψ₂(x) = exp(x²) − 1. Thus ${‖ X_{n}^{'} (f) ‖}_{ψ_{2}} \leq C_{1} σ_{n} (f)$ for f ∈ Inline graphic . This result allows us to apply Theorem 2.2.4 in van der Vaart and Wellner (1996) (this theorem is copied below in the appendix): for each δ > 0 and η > 0, we now have

{‖ sup_{σ_{n} (f) \leq δ} ∣ X_{n}^{'} (f) ∣ ‖}_{ψ_{2}} \leq K {\int_{0}^{η} ψ_{2}^{- 1} (N (ε, σ_{n}, F^{d}) d ε + δ ψ_{2}^{- 1} (N^{2} (η, σ_{n}, F^{d}))},

(B.1)

Convergence to zero with respect to ψ₂-orlics norm implies convergence in expectation to zero and thereby convergence to zero in probability. Let δ_n be a sequence converging to zero, and let η_n also converge to zero but slowly enough so that the term $δ_{n} ψ_{2}^{- 1} (N^{2} (η_{n}, σ_{n}, F^{d}))$ converges to zero as n → ∞. By assumption, $\int_{0}^{δ_{n}} ψ_{2}^{- 1} (N (ε, σ_{n}, F^{d}) d ε$ converges to zero. Thus,

lim_{δ_{n} \to 0} {\int_{0}^{δ_{n}} ψ_{2}^{- 1} (N (ε, σ_{n}, F^{d}) d ε + δ_{n} ψ_{2}^{- 1} (N^{2} (η_{n}, σ_{n}, F^{d}))} = 0.

This proves that

E (sup_{σ_{n} (f) \leq δ_{n}} ∣ X_{n}^{'} (f) ∣) \to 0,

and thereby the asymptotic equicontinuity of $X_{n}^{'}$ .

We now prove the convergence to the limit variance: If $σ_{n} ({\bar{Q}}_{n}^{*} - \bar{Q}) \to 0$ in probability, then

\frac{1}{n} \sum_{j = 1}^{J} {f_{j, n} ({\bar{Q}}_{n}^{*}) (O_{i})}^{2} - \frac{1}{n} \sum_{j = 1}^{J} P_{Q_{0}, g^{n}} {(f_{j, n} (\bar{Q}))}^{2} \to 0 in probability .

We can write this difference as a sum of the following two differences:

\begin{array}{l} \frac{1}{n} \sum_{j = 1}^{J} {f_{j, n} ({\bar{Q}}_{n}^{*}) ({\bar{O}}_{j})}^{2} - \frac{1}{n} \sum_{j = 1}^{J} P_{Q_{0}, g^{n}} f_{j, n} {({\bar{Q}}_{n}^{*})}^{2} \\ \frac{1}{n} \sum_{j = 1}^{J} P_{Q_{0}, g^{n}} f_{j, n} {({\bar{Q}}_{n}^{*})}^{2} - \frac{1}{n} \sum_{j = 1}^{n} P_{Q_{0}, g^{n}} f_{j, n} {(\bar{Q})}^{2} \\ = \frac{1}{n} \sum_{j = 1}^{J} P_{Q_{0}, g^{n}} {f_{j, n} {({\bar{Q}}_{n}^{*})}^{2} - f_{j, n} {(\bar{Q})}^{2}} \\ = \frac{1}{n} \sum_{j = 1}^{J} P_{Q_{0}, g^{n}} {f_{j, n}^{'} ({\bar{Q}}_{n}^{*} - \bar{Q})} {f_{j, n} ({\bar{Q}}_{n}^{*}) + f_{j, n} (\bar{Q})} \\ \leq {(\frac{1}{n} \sum_{j = 1}^{J} P_{Q_{0}, g^{n}} {f_{j, n}^{'} ({\bar{Q}}_{n}^{*} - \bar{Q})}^{2})}^{0.5} {(\frac{1}{n} \sum_{j = 1}^{J} P_{Q_{0}, g^{n}} {f_{j, n} ({\bar{Q}}_{n}^{*}) + f_{j, n} (\bar{Q})}^{2})}^{0.5}, \end{array}

where we used Cauchy-Schwarz inequality at the last inequality. The last term can thus be bounded by ${M d}_{n} ({\bar{Q}}_{n}^{*}, \bar{Q})$ , so that it converges to zero in probability, since $d_{n} ({\bar{Q}}_{n}^{*})$ , Q̄) converges to zero in probability.

We now consider the first term, which can be represented as

\frac{1}{n} \sum_{j = 1}^{J} h_{j, n} ({\bar{Q}}_{n}^{*}),

where

h_{j, n} (\bar{Q}) \equiv f_{j, n}^{2} (\bar{Q}) (O_{i}) - P_{Q_{0}, g^{n}} f_{j, n} {(\bar{Q})}^{2} .

Define the process Y_n(Q̄) = 1/n Σ_j h_j_,_n(Q̄). Note that h_j_,_n(Q̄) has conditional mean zero given Wⁿ. Thus, conditional on Wⁿ, Y_n(Q̄) is a sum of independent mean zero random variables. The process $\sqrt{n} Y_{n} (\bar{Q})$ has exactly same structure as process X_n(Q̄) we analyzed above. Therefore, under our conditions, we have ${sup}_{\bar{Q} \in F} ∣ Y_{n} (\bar{Q}) = O_{P} (1 / \sqrt{n})$ . This implies, in particular, that the first term converges to zero in probability. This proves the convergence to the desired limit.

C Appendix: Proof of Theorem 4

We can decompose D^* (Q̄^*, g₀, ψ₀) orthogonally in a function of W and a function of Y, A, W, which has conditional mean zero, given W, as follows:

D^{*} ({\bar{Q}}^{*}, g_{0}, ψ_{0}) = {\bar{Q}}_{0} (W) - ψ_{0} + H_{g_{0}} (A, W) (Y - {\bar{Q}}^{*} (A, W)) - {{\bar{Q}}_{0} (W) - {\bar{Q}}^{*} (W)} .

Thus, the variance is given by:

P_{0} {D^{*} ({\bar{Q}}^{*}, g_{0}, ψ_{0})}^{2} = E_{0} {{\bar{Q}}_{0} (W) - ψ_{0}}^{2} + E_{0} E_{0} (H_{g_{0}}^{2} (A, W) {(Y - {\bar{Q}}^{*} (A, W))}^{2} ∣ W) - E_{0} {{\bar{Q}}_{0} (W) - {\bar{Q}}^{*} (W)}^{2} .

Note that

\begin{array}{l} E_{0} E_{0} (H_{g_{0}}^{2} (A, W) {(Y - {\bar{Q}}^{*} (A, W))}^{2} ∣ W) = E_{0} E_{0} (\frac{1}{g_{0}^{2} (A ∣ W)} E_{0} ({(Y - {\bar{Q}}^{*} (A, W))}^{2} ∣ A, W) ∣ W) \\ = E_{0} \sum_{a} \frac{1}{g_{0} (a ∣ W)} σ_{0}^{2} ({\bar{Q}}^{*}) (a, W), \end{array}

where $σ_{0}^{2} ({\bar{Q}}^{*}) (a, W) \equiv E_{0} ({(Y - {\bar{Q}}^{*} (A, W))}^{2} ∣ A = a, W)$ . Thus, we have obtained the following expression:

σ_{I}^{2} ({\bar{Q}}^{*}) = E_{0} {{\bar{Q}}_{0} (W) - ψ_{0}}^{2} + E_{0} \sum_{a} \frac{1}{g_{0} (a ∣ W)} σ^{2} ({\bar{Q}}^{*}) (a, W) - E_{0} {{\bar{Q}}_{0} (W) - {\bar{Q}}^{*} (W)}^{2} .

For the paired matching design the asymptotic variance σ² of the TMLE is given by the limit of

E_{0} {{\bar{Q}}_{0} (W) - ψ_{0})}^{2} + E_{0} \frac{1}{n} \sum_{j = 1}^{n / 2} P_{Q_{0}, g^{n}} {\sum_{i \in C_{j} (W^{n})} H_{g_{0}} (A_{i}, W_{i}) (Y_{i} - {\bar{Q}}^{*} (A_{i}, W_{i}))}^{2} - E_{0} \frac{1}{n} \sum_{j = 1}^{n / 2} {\sum_{i \in C_{j} (W^{n})} {{\bar{Q}}_{0} (W_{i}) - {\bar{Q}}^{*} (W_{i})}}^{2}

Each Σ_{i∈C_j(Wⁿ)} is a sum over two terms. We use that (a + b)² = a² + b² + 2ab. The contribution a² + b² from the square terms yields:

\begin{array}{l} E_{0} \frac{1}{n} \sum_{i = 1}^{n} {P_{Q_{0}, g^{n}} {H_{g_{0}} (A_{i}, W_{i}) (Y_{i} - {\bar{Q}}^{*} (A_{i}, W_{i}))}^{2} - {{\bar{Q}}_{0} (W_{i}) - {\bar{Q}}^{*} (W_{i})}^{2}} \\ = E_{0} \sum_{a} \frac{1}{g_{0} (a ∣ W)} σ_{0}^{2} ({\bar{Q}}^{*}) (a, W) - E_{0} {{\bar{Q}}_{0} (W) - {\bar{Q}}^{*} (W)}^{2} . \end{array}

This equals the corresponding expression we have for $σ_{I}^{2} ({\bar{Q}}^{*})$ . The contribution 2ab from the cross-terms yields:

\begin{array}{l} 2 E_{0} \frac{1}{n} \sum_{j = 1}^{n / 2} P_{Q_{0}, g^{n}} H_{g_{0}, 1 j} (Y_{1 j} - {\bar{Q}}^{*} (A_{1 j}, W_{1 j})) H_{g_{0}, 2 j} (Y_{2 j} - {\bar{Q}}^{*} (A_{2 j}, W_{2 j})) \\ - 2 E_{0} \frac{1}{n} \sum_{j = 1}^{n / 2} {{\bar{Q}}_{0} (W_{1 j}) - {\bar{Q}}^{*} (W_{1 j})} {{\bar{Q}}_{0} (W_{2 j}) - {\bar{Q}}^{*} (W_{2 j})} \\ = - 4 E_{0} \frac{1}{n} \sum_{j = 1}^{n / 2} {(Y_{1 j} (1) - {\bar{Q}}^{*} (1, W_{1 j})) (Y_{2 j} (0) - {\bar{Q}}^{*} (0, W_{2 j}))} \\ - 4 E_{0} \frac{1}{n} \sum_{j = 1}^{n / 2} {(Y_{1 j} (0) - {\bar{Q}}^{*} (0, W_{1 j})) (Y_{2 j} (1) - {\bar{Q}}^{*} (1, W_{2 j}))} \\ - 2 E \frac{1}{n} \sum_{j = 1}^{n / 2} {{\bar{Q}}_{0} (W_{1 j}) - {\bar{Q}}^{*} (W_{1 j})} {{\bar{Q}}_{0} (W_{2 j}) - {\bar{Q}}^{*} (W_{2 j})} . \end{array}

To conclude, the asymptotic variance under the paired matching design is given by:

\begin{array}{l} σ^{2} ({\bar{Q}}^{*}) = E_{0} {{\bar{Q}}_{0} (W) - ψ_{0}}^{2} + E_{0} \sum_{a} \frac{1}{g_{0} (a ∣ W)} σ^{2} ({\bar{Q}}^{*}) (a, W) - E_{0} {{\bar{Q}}_{0} (W) - {\bar{Q}}^{*} (W)}^{2} \\ - 4 E_{0} \frac{1}{n} \sum_{j = 1}^{n / 2} {(Y_{1 j} (1) - {\bar{Q}}^{*} (1, W_{1 j})) (Y_{2 j} (0) - {\bar{Q}}^{*} (0, W_{2 j}))} \\ - 4 E_{0} \frac{1}{n} \sum_{j = 1}^{n / 2} {(Y_{1 j} (0) - {\bar{Q}}^{*} (0, W_{1 j})) (Y_{2 j} (1) - {\bar{Q}}^{*} (1, W_{2 j}))} \\ - 2 E \frac{1}{n} \sum_{j = 1}^{n / 2} {{\bar{Q}}_{0} (W_{1 j}) - {\bar{Q}}^{*} (1, W_{1 j})} {{\bar{Q}}_{0} (W_{2 j}) - {\bar{Q}}^{*} (W_{2 j}))} \end{array}

Thus, the difference between the two asymptotic variances is given by:

\begin{array}{l} σ_{I}^{2} - σ^{2} = 4 E_{0} \frac{1}{n} \sum_{j = 1}^{n / 2} {(Y_{1 j} (1) - {\bar{Q}}^{*} (1, W_{1 j})) (Y_{2 j} (0) - {\bar{Q}}^{*} (0, W_{2 j}))} \\ + 4 E_{0} \frac{1}{n} \sum_{j = 1}^{n / 2} {(Y_{1 j} (0) - {\bar{Q}}^{*} (0, W_{1 j})) (Y_{2 j} (1) - {\bar{Q}}^{*} (1, W_{2 j}))} \\ + 2 E \frac{1}{n} \sum_{j = 1}^{n / 2} {{\bar{Q}}_{0} (W_{1 j}) - {\bar{Q}}^{*} (W_{1 j})} {{\bar{Q}}_{0} (W_{2 j}) - {\bar{Q}}^{*} (W_{2 j})} \\ = 2 E_{0} \frac{1}{J} \sum_{j = 1}^{n / 2} {({\bar{Q}}_{0} (1, W_{1 j}) - {\bar{Q}}^{*} (1, W_{1 j})) ({\bar{Q}}_{0} (0, W_{2 j}) - {\bar{Q}}^{*} (0, W_{2 j}))} \\ + 2 E_{0} \frac{1}{J} \sum_{j = 1}^{n / 2} {({\bar{Q}}_{0} (1, W_{1 j}) - {\bar{Q}}^{*} (0, W_{1 j})) ({\bar{Q}}_{0} (1, W_{2 j}) - {\bar{Q}}^{*} (1, W_{2 j}))} \\ + E \frac{1}{J} \sum_{j = 1}^{n / 2} {{\bar{Q}}_{0} (W_{1 j}) - {\bar{Q}}^{*} (W_{1 j})} {{\bar{Q}}_{0} (W_{2 j}) - {\bar{Q}}^{*} (W_{2 j})} \\ = E_{0} \frac{1}{J} \sum_{j = 1}^{n / 2} ({\bar{Q}}_{0} - {\bar{Q}}^{*}) (1, W_{1 j}) ({\bar{Q}}_{0} - {\bar{Q}}^{*}) (0, W_{2 j}) \\ + E_{0} \frac{1}{J} \sum_{j = 1}^{n / 2} ({\bar{Q}}_{0} - {\bar{Q}}^{*}) (0, W_{1 j}) ({\bar{Q}}_{0} - {\bar{Q}}^{*}) (1, W_{2 j}) \\ + E_{0} \frac{1}{J} \sum_{j = 1}^{n / 2} ({\bar{Q}}_{0} - {\bar{Q}}^{*}) (1, W_{1 j}) ({\bar{Q}}_{0} - {\bar{Q}}^{*}) (1, W_{2 j}) \\ + E_{0} \frac{1}{J} \sum_{j = 1}^{n / 2} ({\bar{Q}}_{0} - {\bar{Q}}^{*}) (0, W_{1 j}) ({\bar{Q}}_{0} - {\bar{Q}}^{*}) (0, W_{2 j}) \\ \equiv C, \end{array}

and $σ^{2} = σ_{I}^{2} - C$ .

D Appendix: Proof of Theorem 5

The proof is analogue to the proof of Theorem 1. Therefore, we suffice with proving (7.3). Firstly, we note that

\begin{array}{l} E_{0} D^{n, *} (\bar{Q}, \bar{g}, \bar{Π}, ψ_{0}) (O^{n}) = E_{0} \frac{1}{n} \sum_{i = 1}^{n} {\bar{Q} (W_{i}) - ψ_{0}} + \\ E_{0} \frac{1}{n} \sum_{i = 1}^{n} \frac{g_{i, 0} (1, 1 ∣ W_{i})}{{\bar{g}}_{n} (1, 1 ∣ W_{i})} ({\bar{Q}}_{0} - \bar{Q}) (1, W_{i}) - \frac{g_{i, 0} (0, 1 ∣ W_{i})}{{\bar{g}}_{n} (0, 1 ∣ W_{i})} ({\bar{Q}}_{0} - \bar{Q}) (0, W_{i}) \\ = Ψ (Q) - ψ_{0} + \frac{1}{n} \sum_{i} \int_{w} Q_{W, 0} (w) \frac{g_{i, 0} (1, 1 ∣ w)}{{\bar{g}}_{n} (1, 1 ∣ w)} ({\bar{Q}}_{0} - \bar{Q}) (1, w) \\ - \frac{1}{n} \sum_{i} \int_{w} Q_{W, 0} (w) \frac{g_{i, 0} (0, 1 ∣ w)}{{\bar{g}}_{n} (0, 1 ∣ W_{i})} ({\bar{Q}}_{0} - \bar{Q}) (0, w) \\ = Ψ (Q) - ψ_{0} + E_{0} \frac{{\bar{g}}_{0, n} (1, 1 ∣ W)}{{\bar{g}}_{n} (1, 1 ∣ W)} ({\bar{Q}}_{0} - \bar{Q}) (1, W) - E_{0} \frac{{\bar{g}}_{n, 0} (0, 1 ∣ W)}{{\bar{g}}_{n} (0, 1 ∣ W)} ({\bar{Q}}_{0} - \bar{Q}) (0, W) . \end{array}

Thus, if ḡ_n_,0 = ḡ₀, then this equals Ψ(Q) − ψ₀ + ψ₀ − Ψ(Q) = 0. If Q₀ = Q, then we also obtain 0. This proves (3.3).

E Appendix: Review of relevant empirical process/weak convergence theory

We refer to van der Vaart and Wellner (1996), Section 2.2. on maximal inequalities and covering numbers. For a real valued random variable X and convex function ψ with ψ (0) = 0, the Orlics norm is defined as || X ||_ψ ≡ inf{C > 0 : Eψ (| X | /C) ≤ 1}. Setting ψ(x) = x^p gives the L_p-norms || X ||_p= E(| X |^p)^1/^p, p ≥ 1. Another important choice for empirical processes is ψ_p(x) = exp(x^p) − 1. Sums of independent bounded random variables and Gaussian random variables have bounded ψ₂-norm. There is an important relation between the orlics norm and a bound on the tail probability of the random variable. In particular, we have (page 96 in van der Vaart and Wellner (1996))

P (∣ X ∣ > x) \leq \frac{1}{ψ (x / {‖ X ‖}_{ψ})} .

For ψ_p(x) this leads to tail estimates exp(−Cx^p) for any random variable with a finite ψ_p norm. Conversely, an exponential tail bound of this type shows that || X ||_{ψ_p} is finite: Lemma 2.2.1 states that if P(| X |> x) ≤ K exp(−Cx^p) for every x, for constants K and C, and for p ≥ 1, then its orlics norm satisfies || X ||_{ψ_p} ≤ ((1 + K)/C)^1/^p. So if we have an exponential tail probability for X_n(f), then we can translate this into a bound on the ψ_p-orlics norm.

Given a sequence of random variables X_i, we have (page 96)

{‖ max_{i \leq m} X_{i} ‖}_{ψ} \leq K ψ^{- 1} (m) max_{i} {‖ X_{i} ‖}_{ψ} .

Thus, if we can bound the orlics norm of X_n(f) in terms of a norm on f, then this result allows us to bound the orlics norm of a maximum over m functions. This bound combined with chaining gives the typical entropy type bounds. As we will see one of the main things we will need is a bound on || X_n(f) ||_ψ in terms of d(f, f) for a semi-metric d on Inline graphic .

Bounding orlics norm

Let (T, d) be an arbitrary semi-metric space. The covering number N(ε, d) is the minimal number of balls of radius ε needed to cover T. Call a collection of points ε-separated if the distance between each pair of points is strictly larger than ε. The packing number D(ε, d) is the maximum number of ε-separated points in T. Entropy numbers are the logarithms of the covering or packing number. Since N(ε, d) ≤ D(ε, d) ≤ N(0.5ε, d), bounds in packing number map into a bound in covering number and vice versa.

For our purpose, we will need Theorem 2.2.4 in van der Vaart and Wellner (1996), which is stated here for completeness.

Theorem 6

(Theorem 2.2.4, van der Vaart and Wellner, 96) Let ψ be a convex non-decreasing non zero function with ψ(0) = 0 and lim sup_x_,_y_→∞ ψ(x) ψ(y)/ ψ(cxy) < ∞ for some constant c. Let (X_t : t ∈ T) be a separable stochastic process (that is, sup_d₍_s_,_t_)<_δ | X_s − X_t | remains almost surely the same if the index set T is replaced by a suitable countable subset) with

{‖ X_{s} - X_{t} ‖}_{ψ} \leq C d (s, t) for every s, t,

for some semimetric d on T and a constant C. Then, for any η, δ > 0,

{‖ sup_{d (s, t) \leq δ} ∣ X_{s} - X_{t} ∣ ‖}_{ψ} \leq K {\int_{0}^{η} ψ^{- 1} (D (ε, d)) d ε + δ ψ^{- 1} (D^{2} (η, d))}

for a constant K depending on ψ and C only. In particular, the constant K can be chosen so that

{‖ sup_{s, t} ∣ X_{s} - X_{t} ∣ ‖}_{ψ} \leq K \int_{0}^{diamT} ψ^{- 1} (D (ε, d)) d ε,

where diam(T) is the diameter of T. This result also gives

{‖ sup_{t} ∣ X_{t} ∣ ‖}_{ψ} \leq {‖ X_{t_{0}} ‖}_{ψ} + \int_{0}^{diam (T)} ψ^{- 1} (D (ε, d)) d ε,

The bound shows that the sample paths of X are uniformly continuous in ψ-norm, whenever the covering integral $\int_{0}^{η} ψ^{- 1} (D (ε, d)) d ε$ is finite/exists for some η > 0. In order to have that this integral is bounded for classes T with covering numbers that behave as ε⁻^p, one will need to use an Orlics norm with ψ(x) = x^p, and if one wants the integral to be bounded for any p, then one needs ψ(x) = exp(x^q) − 1 for some q.

If one can prove that || X_n(s) − X_n(t) ||_ψ ≤ Cd(s, t) for a constant C independent of n, and each X_n is a separable stochastic process, then this theorem teaches us that for any sequence δ_n, and η_n > 0, we have that there exists a constant K depending on ψ, C only (not dependent on n!) so that

{‖ sup_{d (s, t) \leq δ_{n}} ∣ X_{n} (s) - X_{n} (t) ∣ ‖}_{ψ} \leq K {\int_{0}^{η} ψ^{- 1} (D (ε, d)) d ε + δ_{n} ψ^{- 1} (D^{2} (η, d))} .

We can now apply this inequality for a sequence δ_n → 0 for n → ∞. Since η can be chosen arbitrary small, it follows that, if $\int_{0}^{η} ψ^{- 1} (D (ε, d)) d ε < \infty$ for some η > 0, then

{‖ sup_{d (s, t) \leq δ_{n}} ∣ X_{n} (s) - X_{n} (t) ∣ ‖}_{ψ} \to 0 as n \to \infty .

So we can state the following useful corollary:

Corollary E.1

Suppose there exists a η > 0 so that $\int_{0}^{η} ψ^{- 1} (D (ε, d)) d ε < \infty$ . In addition, assume

{‖ X_{n} (s) - X_{n} (t) ‖}_{ψ} \leq C d (s, t)

for a constant C independent of n, and each X_n is a separable stochastic process with respect to d. Then for any sequence δ_n → 0, we have

{‖ sup_{d (s, t) \leq δ_{n}} ∣ X_{n} (s) - X_{n} (t) ∣ ‖}_{ψ} \to 0 as n \to \infty .

This corollary provides us with conditions under which X_n is asymptotically uniformly d-equicontinuous in probability. Theorem 1.5.7. in van der Vaart and Wellner (1996) now states that X_n is asymptotically tight in ℓ^∞(T) if X_n(t) is asymptotically tight for every t, (T, d) is totally bounded, and X_n is asymptotically d-equicontinuous in probability. In addition, Theorem 1.5.4 states that if X_n is asymptotically tight and its marginals converge weakly to the marginals X(t₁), …, X(t_k) of a stochastic process X, then there is a version of X with uniformly bounded sample paths and X_n converges weakly to X. Thus, we can state the following result:

Lemma E.1

Let ψ be one of the following functions: ψ(x) = x^p for some p, or ψ(x) = exp(x¹) −1, ψ(x) = exp(x²) − 1. Let d be a semi-metric on T so that (ℓ^∞ (T), d) is totally bounded, and there exists a η > 0 so that $\int_{0}^{η} ψ^{- 1} (D (ε, d)) d ε < \infty$ . In addition, assume

{‖ X_{n} (s) - X_{n} (t) ‖}_{ψ} \leq C d (s, t)

for a constant C independent of n, and each X_n is a separable stochastic process with respect to d. Then for any sequence δ_n → 0, we have for each x > 0

P r (sup_{d (s, t) \leq δ_{n}} ∣ X_{n} (s) - X_{n} (t) ∣ > x) \to 0 a s n \to \infty,

(E.1)

and X_n is asymptotically tight.

If X_n(t₁), …, X_n(t_k) converges weakly to (X(t₁), …, X(t_k), then there exists a version X with uniformly bounded sample paths and X_n ⇒ _d X.

If X is Gaussian process X in ℓ^∞(T), and d(s, t) = ρ_p(s, t) ≡ || (X(f) − X(g)) ||_p, then there exists a version of X which is tight Borel measurable map into ℓ^∞ (T).

Actually (page 41), if X is Gaussian, then X_n converges weakly to X in ℓ^∞ (T) if and only if for some p (and then for all p) (i) the marginals of X_n converge to the corresponding marginals of X, (ii) X_n is asymptotically equicontinuous in probability with respect to

d (s, t) = ρ_{p} (s, t) \equiv {‖ X (s) - X (t) ‖}_{p},

as defined in (E.1), and (iii) T is totally bounded for d = ρ_p.

Contributor Information

Mark J. van der Laan, Email: laan@berkeley.edu, Division of Biostatistics, University of California, Berkeley

Laura B. Balzer, Email: lbbalzer@gmail.com, Division of Biostatistics, University of California, Berkeley

Maya L. Petersen, Email: mayaliv@berkeley.edu, Division of Biostatistics, University of California, Berkeley

References

Andersen J, Faries D, Tamura R. A randomized play-the-winner design for multiarm clinical trials. Communication in Statistical Theory. 1994;23:309–323. [Google Scholar]
Bai Z, Hu F, Rosenberger W. Asymptotic properties of adaptive designs for clinical trials with delayed response. Annals of Statistics. 2002;30(1):122–139. [Google Scholar]
Balzer L, Petersen M, van der Laan M. Technical Report. Vol. 294. Division of Biostatistics, University of California; Berkeley: 2012. Why match in individually and cluster randomized trials. [Google Scholar]
Bickel P, Klaassen C, Ritov Y, Wellner J. Efficient and Adaptive Estimation for Semiparametric Models. Springer-Verlag; 1997. [Google Scholar]
Campbell M, Donner A, Klar N. Developments in cluster randomized trials and statistics in medicine. Statistics in Medicine. 2007;26:2–19. doi: 10.1002/sim.2731. [DOI] [PubMed] [Google Scholar]
Chambaz A, van der Laan M. Technical Report. Vol. 258. Division of Biostatistics, University of California; Berkeley: 2010. Targeting the optimal design in randomized clinical trials with binary outcomes and no covariate. [Google Scholar]
Cheng Y, Shen Y. Bayesian adaptive designs for clinical trials. Biometrika. 2005;92(3):633–646. [Google Scholar]
Donner A, Klar N. Design and Analysis of Cluster Randomization Trials in Health Research. London: Arnold; 2000. [Google Scholar]
Flournoy N, Rosenberger W. Adaptive Designs. Hayward, Institute of Mathematical Statistics; 1995. [Google Scholar]
Freedman L, Gail M, Green S, Corle D. The efficiency of the matched-pairs design of the community intervention trial for smoking cessation (commit) Controlled clinical trials. 1997;18(2):131–139. doi: 10.1016/S0197-2456(96)00115-8. [DOI] [PubMed] [Google Scholar]
Freedman L, Green S, Byar D. Assessing the gain in efficiency due to matching in a community intervention study. Statistics in Medicine. 1990;9:943–952. doi: 10.1002/sim.4780090810. [DOI] [PubMed] [Google Scholar]
Gail M, Byar D, Pechacek T, Corle D. Aspects of statistical design for the community intervention trial for smoking cessation. Controlled clinical trials. 1992;13:6–21. doi: 10.1016/0197-2456(92)90026-v. [DOI] [PubMed] [Google Scholar]
Gill R, van der Laan M, Robins J. In: Lin D, Fleming T, editors. Coarsening at random: characterizations, conjectures and counter-examples; Proceedings of the First Seattle Symposium in Biostatistics; New York. Springer Verlag; 1997. pp. 255–94. [Google Scholar]
Gill R, van der Laan M, Wellner J. Inefficient estimators of the bivariate survival function for three models. Annales de l’Institut Henri Poincaré. 1995;31:545–597. [Google Scholar]
COMMIT Research Group. Summary of design and intervention. Journal of National Cancer Institute. 1991;83(22):1620–1628. doi: 10.1093/jnci/83.22.1620. [DOI] [PubMed] [Google Scholar]
Hayes R, Moulton L. Cluster Randomized Trials. Boca Raton: Chapman & Hall/CRC; 2009. [Google Scholar]
Heitjan D, Rubin D. Ignorability and coarse data. Annals of statistics. 1991 Dec;19(4):2244–2253. [Google Scholar]
Hu F, Rosenberger W. Analysis of time trends in adaptive designs withi appliation to a neurophysiology experiment. Statistics in Medicine. 2000;19:2067–2075. doi: 10.1002/1097-0258(20000815)19:15<2067::aid-sim508>3.0.co;2-l. [DOI] [PubMed] [Google Scholar]
Imai K. Variance identification and efficiency analysis in randomized experiments under the matched-pair design. Statistics in Medicine. 2008;27(24):4857–4873. doi: 10.1002/sim.3337. [DOI] [PubMed] [Google Scholar]
Imai K, King G, Nall C. The essential role of pair matching in cluster randomized experiments with applicationto the mexican universal health insurance evaluation. Statistical Science. 2009;24(1):29–53. [Google Scholar]
Jacobsen M, Keiding N. Coarsening at random in general sample spaces and random censoring in continuous time. Annals of Statistics. 1995;23:774–86. [Google Scholar]
Klar N, Donner A. The merits of matching in community intervention trials: a cautionary tale. Statistics in Medicine. 1997;16(15):1753–1764. doi: 10.1002/(sici)1097-0258(19970815)16:15<1753::aid-sim597>3.0.co;2-e. [DOI] [PubMed] [Google Scholar]
Koethe J, Westfall A, Luhanga D, Clark G, Goldman J, Mulenga P, Cantrell R, Chi B, Zulu I, Saag M, Stringer JS. A cluster randomized trial of routine hiv-1 viral load monitoring in zambia. PlLoS One. 2010;5,5(3):e9680. doi: 10.1371/journal.pone.0009680. [DOI] [PMC free article] [PubMed] [Google Scholar]
Murray D. Design and Analysis of Community Randomized Trials. Oxford: Oxford University Press; 1998. [Google Scholar]
Pearl J. Causal diagrams for empirical research. Biometrika. 1995;82:669–710. [Google Scholar]
Pearl J. Causality: models, reasoning, and inference. 2. New York: Cambridge; 2009. [Google Scholar]
Polley E, van der Laan M. Technical Report. Vol. 266. Division of Biostatistics, University of California; Berkeley: 2010. Super learner in prediction. [Google Scholar]
Raudenbush S, Martinez A, Spybrook J. Developments in cluster randomized trials and statistics in medicine. Educational Evaluation and Policy Analysis. 2007;29:5–29. [Google Scholar]
Rose S, van der Laan M. Why match? investigating matched case-control study designs with causal effect estimation. The International Journal of Biostatistics. 2009 doi: 10.2202/1557-4679.1127. http://www.bepress.com/ijb/vol5/iss1/1/ [DOI] [PMC free article] [PubMed]
Rosenberger W. New directions in adaptive designs. Statistical Science. 1996;11:137–149. [Google Scholar]
Rosenberger W, Flournoy N, Durham S. Asymptotic normality of maximum likelihood estimators from multiparameter response driven designs. Journal of Statistical Planning and Inference. 1997:69–76. [Google Scholar]
Rosenberger W, Grill S. A sequential design for psychophysical experiments: An application to estimating timing of sensory events. Statistics in Medicine. 1997;16:2245–2260. doi: 10.1002/(sici)1097-0258(19971015)16:19<2245::aid-sim656>3.0.co;2-s. [DOI] [PubMed] [Google Scholar]
Rosenberger W, Shiram T. Estimation for an adaptive allocation design. Journal of Statistical Planning and Inference. 1997;59:309–319. [Google Scholar]
Rosenblum M, van der Laan M. Targeted maximum likelihood estimation of the parameter of a marginal structural model. Int J Biostat. 2010;6(2):Article 19. doi: 10.2202/1557-4679.1238. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tamura R, Faries D, Andersen J, Heiligenstein J. A case study of an adaptive clinical trial in the treatment of out-patients with depressive disorder. Journal of the American Statistical Association. 1994;89:768–776. [Google Scholar]
Toftager M, Christiansen L, Kristensen P, Troelsen J. Space for physical activity-a multicomponent intervention study: study design and baseline findings from a cluster randomized controlled trial. BMC Public Health. 2011;10:711–777. doi: 10.1186/1471-2458-11-777. [DOI] [PMC free article] [PubMed] [Google Scholar]
van der Laan M. CWI tract. 114. Amsterdam: Centre of Computer Science and Mathematics; 1996. Efficient and Inefficient Estimation in Semiparametric Models. [Google Scholar]
van der Laan M. Technical Report. Vol. 232. Division of Biostatistics, University of California; Berkeley: 2008. The construction and analysis of adaptive group sequential designs. http://www.bepress.com/ucbbiostat/paper234. [Google Scholar]
van der Laan M, Dudoit S. Technical Report. Vol. 130. Division of Biostatistics, University of California; Berkeley: 2003. Unified cross-validation methodology for selection among estimators and a general cross-validated adaptive epsilon-net estimator: finite sample oracle inequalities and examples. [Google Scholar]
van der Laan M, Dudoit S, van der Vaart A. The cross-validated adaptive epsilon-net estimator. Statistics and Decisions. 2006;24(3):373–395. [Google Scholar]
van der Laan M, Polley E, Hubbard A. Super learner. Statistical Applications in Genetics and Molecular Biology. 2007;6(25):1–21. doi: 10.2202/1544-6115.1309. [DOI] [PubMed] [Google Scholar]
van der Laan M, Robins J. Unified methods for censored longitudinal data and causality. Berlin Heidelberg New York: Springer; 2003. [Google Scholar]
van der Laan M, Rose S. Targeted Learning: Causal Inference for Observational and Experimental Data. New York: Springer; 2012. [Google Scholar]
van der Laan M, Rubin D. Targeted maximum likelihood learning. The International Journal of Biostatistics. 2006;2(1) doi: 10.2202/1557-4679.1211. [DOI] [PMC free article] [PubMed] [Google Scholar]
van der Vaart A. Asymptotic Statistics. Cambridge University Press; 1998. [Google Scholar]
van der Vaart A, Dudoit S, van der Laan M. Oracle inequalities for multi-fold cross-validation. Statistics and Decisions. 2006;24(3):351–371. [Google Scholar]
van der Vaart A, Wellner J. Weak Convergence and Empirical Processes. Springer-Verlag; New York: 1996. [Google Scholar]
Watson L, Small R, Brown S, Dawson W, Lumley J. Mounting a community-randomized trial: sample size, matching, selection, and randomization issues in prism. Journal of Controlled Clinical Trials. 2004;3:235–250. doi: 10.1016/j.cct.2003.12.002. [DOI] [PubMed] [Google Scholar]
Wei L. The generalized polya’s urn design for sequential medical trials. Annals of Statistics. 1979;7:291–296. [Google Scholar]
Wei L, Durham S. The randomize play-the-winner rule in medical trials. Journal of the American Statistical Association. 1978;73:840–843. [Google Scholar]
Wei L, Smythe R, Lin D, Park T. Statistical inference with data-dependent treatment allocation rules. Journal of the American Statistical Association. 1990;85:156–162. [Google Scholar]
Zelen M. Play the winner rule and the controlled clinical trial. Journal of the American Statistical Association. 1969;64:131–146. [Google Scholar]

[R1] Andersen J, Faries D, Tamura R. A randomized play-the-winner design for multiarm clinical trials. Communication in Statistical Theory. 1994;23:309–323. [Google Scholar]

[R2] Bai Z, Hu F, Rosenberger W. Asymptotic properties of adaptive designs for clinical trials with delayed response. Annals of Statistics. 2002;30(1):122–139. [Google Scholar]

[R3] Balzer L, Petersen M, van der Laan M. Technical Report. Vol. 294. Division of Biostatistics, University of California; Berkeley: 2012. Why match in individually and cluster randomized trials. [Google Scholar]

[R4] Bickel P, Klaassen C, Ritov Y, Wellner J. Efficient and Adaptive Estimation for Semiparametric Models. Springer-Verlag; 1997. [Google Scholar]

[R5] Campbell M, Donner A, Klar N. Developments in cluster randomized trials and statistics in medicine. Statistics in Medicine. 2007;26:2–19. doi: 10.1002/sim.2731. [DOI] [PubMed] [Google Scholar]

[R6] Chambaz A, van der Laan M. Technical Report. Vol. 258. Division of Biostatistics, University of California; Berkeley: 2010. Targeting the optimal design in randomized clinical trials with binary outcomes and no covariate. [Google Scholar]

[R7] Cheng Y, Shen Y. Bayesian adaptive designs for clinical trials. Biometrika. 2005;92(3):633–646. [Google Scholar]

[R8] Donner A, Klar N. Design and Analysis of Cluster Randomization Trials in Health Research. London: Arnold; 2000. [Google Scholar]

[R9] Flournoy N, Rosenberger W. Adaptive Designs. Hayward, Institute of Mathematical Statistics; 1995. [Google Scholar]

[R10] Freedman L, Gail M, Green S, Corle D. The efficiency of the matched-pairs design of the community intervention trial for smoking cessation (commit) Controlled clinical trials. 1997;18(2):131–139. doi: 10.1016/S0197-2456(96)00115-8. [DOI] [PubMed] [Google Scholar]

[R11] Freedman L, Green S, Byar D. Assessing the gain in efficiency due to matching in a community intervention study. Statistics in Medicine. 1990;9:943–952. doi: 10.1002/sim.4780090810. [DOI] [PubMed] [Google Scholar]

[R12] Gail M, Byar D, Pechacek T, Corle D. Aspects of statistical design for the community intervention trial for smoking cessation. Controlled clinical trials. 1992;13:6–21. doi: 10.1016/0197-2456(92)90026-v. [DOI] [PubMed] [Google Scholar]

[R13] Gill R, van der Laan M, Robins J. In: Lin D, Fleming T, editors. Coarsening at random: characterizations, conjectures and counter-examples; Proceedings of the First Seattle Symposium in Biostatistics; New York. Springer Verlag; 1997. pp. 255–94. [Google Scholar]

[R14] Gill R, van der Laan M, Wellner J. Inefficient estimators of the bivariate survival function for three models. Annales de l’Institut Henri Poincaré. 1995;31:545–597. [Google Scholar]

[R15] COMMIT Research Group. Summary of design and intervention. Journal of National Cancer Institute. 1991;83(22):1620–1628. doi: 10.1093/jnci/83.22.1620. [DOI] [PubMed] [Google Scholar]

[R16] Hayes R, Moulton L. Cluster Randomized Trials. Boca Raton: Chapman & Hall/CRC; 2009. [Google Scholar]

[R17] Heitjan D, Rubin D. Ignorability and coarse data. Annals of statistics. 1991 Dec;19(4):2244–2253. [Google Scholar]

[R18] Hu F, Rosenberger W. Analysis of time trends in adaptive designs withi appliation to a neurophysiology experiment. Statistics in Medicine. 2000;19:2067–2075. doi: 10.1002/1097-0258(20000815)19:15<2067::aid-sim508>3.0.co;2-l. [DOI] [PubMed] [Google Scholar]

[R19] Imai K. Variance identification and efficiency analysis in randomized experiments under the matched-pair design. Statistics in Medicine. 2008;27(24):4857–4873. doi: 10.1002/sim.3337. [DOI] [PubMed] [Google Scholar]

[R20] Imai K, King G, Nall C. The essential role of pair matching in cluster randomized experiments with applicationto the mexican universal health insurance evaluation. Statistical Science. 2009;24(1):29–53. [Google Scholar]

[R21] Jacobsen M, Keiding N. Coarsening at random in general sample spaces and random censoring in continuous time. Annals of Statistics. 1995;23:774–86. [Google Scholar]

[R22] Klar N, Donner A. The merits of matching in community intervention trials: a cautionary tale. Statistics in Medicine. 1997;16(15):1753–1764. doi: 10.1002/(sici)1097-0258(19970815)16:15<1753::aid-sim597>3.0.co;2-e. [DOI] [PubMed] [Google Scholar]

[R23] Koethe J, Westfall A, Luhanga D, Clark G, Goldman J, Mulenga P, Cantrell R, Chi B, Zulu I, Saag M, Stringer JS. A cluster randomized trial of routine hiv-1 viral load monitoring in zambia. PlLoS One. 2010;5,5(3):e9680. doi: 10.1371/journal.pone.0009680. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] Murray D. Design and Analysis of Community Randomized Trials. Oxford: Oxford University Press; 1998. [Google Scholar]

[R25] Pearl J. Causal diagrams for empirical research. Biometrika. 1995;82:669–710. [Google Scholar]

[R26] Pearl J. Causality: models, reasoning, and inference. 2. New York: Cambridge; 2009. [Google Scholar]

[R27] Polley E, van der Laan M. Technical Report. Vol. 266. Division of Biostatistics, University of California; Berkeley: 2010. Super learner in prediction. [Google Scholar]

[R28] Raudenbush S, Martinez A, Spybrook J. Developments in cluster randomized trials and statistics in medicine. Educational Evaluation and Policy Analysis. 2007;29:5–29. [Google Scholar]

[R29] Rose S, van der Laan M. Why match? investigating matched case-control study designs with causal effect estimation. The International Journal of Biostatistics. 2009 doi: 10.2202/1557-4679.1127. http://www.bepress.com/ijb/vol5/iss1/1/ [DOI] [PMC free article] [PubMed]

[R30] Rosenberger W. New directions in adaptive designs. Statistical Science. 1996;11:137–149. [Google Scholar]

[R31] Rosenberger W, Flournoy N, Durham S. Asymptotic normality of maximum likelihood estimators from multiparameter response driven designs. Journal of Statistical Planning and Inference. 1997:69–76. [Google Scholar]

[R32] Rosenberger W, Grill S. A sequential design for psychophysical experiments: An application to estimating timing of sensory events. Statistics in Medicine. 1997;16:2245–2260. doi: 10.1002/(sici)1097-0258(19971015)16:19<2245::aid-sim656>3.0.co;2-s. [DOI] [PubMed] [Google Scholar]

[R33] Rosenberger W, Shiram T. Estimation for an adaptive allocation design. Journal of Statistical Planning and Inference. 1997;59:309–319. [Google Scholar]

[R34] Rosenblum M, van der Laan M. Targeted maximum likelihood estimation of the parameter of a marginal structural model. Int J Biostat. 2010;6(2):Article 19. doi: 10.2202/1557-4679.1238. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R35] Tamura R, Faries D, Andersen J, Heiligenstein J. A case study of an adaptive clinical trial in the treatment of out-patients with depressive disorder. Journal of the American Statistical Association. 1994;89:768–776. [Google Scholar]

[R36] Toftager M, Christiansen L, Kristensen P, Troelsen J. Space for physical activity-a multicomponent intervention study: study design and baseline findings from a cluster randomized controlled trial. BMC Public Health. 2011;10:711–777. doi: 10.1186/1471-2458-11-777. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R37] van der Laan M. CWI tract. 114. Amsterdam: Centre of Computer Science and Mathematics; 1996. Efficient and Inefficient Estimation in Semiparametric Models. [Google Scholar]

[R38] van der Laan M. Technical Report. Vol. 232. Division of Biostatistics, University of California; Berkeley: 2008. The construction and analysis of adaptive group sequential designs. http://www.bepress.com/ucbbiostat/paper234. [Google Scholar]

[R39] van der Laan M, Dudoit S. Technical Report. Vol. 130. Division of Biostatistics, University of California; Berkeley: 2003. Unified cross-validation methodology for selection among estimators and a general cross-validated adaptive epsilon-net estimator: finite sample oracle inequalities and examples. [Google Scholar]

[R40] van der Laan M, Dudoit S, van der Vaart A. The cross-validated adaptive epsilon-net estimator. Statistics and Decisions. 2006;24(3):373–395. [Google Scholar]

[R41] van der Laan M, Polley E, Hubbard A. Super learner. Statistical Applications in Genetics and Molecular Biology. 2007;6(25):1–21. doi: 10.2202/1544-6115.1309. [DOI] [PubMed] [Google Scholar]

[R42] van der Laan M, Robins J. Unified methods for censored longitudinal data and causality. Berlin Heidelberg New York: Springer; 2003. [Google Scholar]

[R43] van der Laan M, Rose S. Targeted Learning: Causal Inference for Observational and Experimental Data. New York: Springer; 2012. [Google Scholar]

[R44] van der Laan M, Rubin D. Targeted maximum likelihood learning. The International Journal of Biostatistics. 2006;2(1) doi: 10.2202/1557-4679.1211. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R45] van der Vaart A. Asymptotic Statistics. Cambridge University Press; 1998. [Google Scholar]

[R46] van der Vaart A, Dudoit S, van der Laan M. Oracle inequalities for multi-fold cross-validation. Statistics and Decisions. 2006;24(3):351–371. [Google Scholar]

[R47] van der Vaart A, Wellner J. Weak Convergence and Empirical Processes. Springer-Verlag; New York: 1996. [Google Scholar]

[R48] Watson L, Small R, Brown S, Dawson W, Lumley J. Mounting a community-randomized trial: sample size, matching, selection, and randomization issues in prism. Journal of Controlled Clinical Trials. 2004;3:235–250. doi: 10.1016/j.cct.2003.12.002. [DOI] [PubMed] [Google Scholar]

[R49] Wei L. The generalized polya’s urn design for sequential medical trials. Annals of Statistics. 1979;7:291–296. [Google Scholar]

[R50] Wei L, Durham S. The randomize play-the-winner rule in medical trials. Journal of the American Statistical Association. 1978;73:840–843. [Google Scholar]

[R51] Wei L, Smythe R, Lin D, Park T. Statistical inference with data-dependent treatment allocation rules. Journal of the American Statistical Association. 1990;85:156–162. [Google Scholar]

[R52] Zelen M. Play the winner rule and the controlled clinical trial. Journal of the American Statistical Association. 1969;64:131–146. [Google Scholar]

PERMALINK

ADAPTIVE MATCHING IN RANDOMIZED TRIALS AND OBSERVATIONAL STUDIES

Mark J van der Laan

Laura B Balzer

Maya L Petersen

SUMMARY

1 Introduction

1.1 Organization of article

1.2 Novel contributions of this article

2 Definition of Statistical Estimation Problem

2.1 Observed data

2.2 Likelihood and statistical model

2.2.1 Special models of interest for the treatment mechanism

2.3 Target statistical parameter

3 The canonical gradient of the pathwise derivative of the target parameter

Theorem 1

Double robustness of canonical gradient

4 A targeted minimum loss-based estimator (TMLE)

Loss function and initial estimator for Q̄0

Least favorable sub-model through initial estimator

Computing the update of initial estimator

TMLE of target parameter

TMLE solves efficient influence curve equation

4.1 Estimation of ḡ0

4.2 Statistical inference

5 Theorem establishing asymptotic normality

Theorem 2

Uniform bound

Asymptotic linearity of function of ḡn

Asymptotic stability of treatment mechanism as function of covariates

Convergence of variances

Convergence of Q¯n∗ to some limit

Entropy condition

Asymptotic equicontinuity of process

First order linear approximation

Asymptotic normality

6 Randomized trials with adaptive pair matching

6.1 Estimation of the average treatment effect

6.2 Statistical inference

6.3 Robust conservative estimation of the variance

Theorem 3

6.4 Comparison of the “naive” variance estimator with the true variance

6.5 A general conservative estimator of the asymptotic variance of TMLE

6.6 A simulation confirming the variance formula for the unadjusted estimator

Table 1.

6.7 Efficiency gains due to adaptive pair matching

Theorem 4

7 Augmenting the data structure with missingness

Theorem 5

8 Summary

A Appendix: Proof of Theorem 1

B Appendix: Proof of Theorem 2

C Appendix: Proof of Theorem 4

D Appendix: Proof of Theorem 5

E Appendix: Review of relevant empirical process/weak convergence theory

Bounding orlics norm

Theorem 6

Corollary E.1

Lemma E.1

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Loss function and initial estimator for Q̄₀

4.1 Estimation of ḡ₀

Asymptotic linearity of function of ḡ_n

Convergence of ${\bar{Q}}_{n}^{*}$ to some limit