Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2018 Aug 9.
Published in final edited form as: Scand Stat Theory Appl. 2013 May 7;41(1):104–140. doi: 10.1111/sjos.12013

Estimation and testing in targeted goup sequential covariate-adjusted randomized clinical trials

A Chambaz 1,*, M J van der Laan 2
PMCID: PMC6084807  NIHMSID: NIHMS979199  PMID: 30100663

Abstract

This article is devoted to the construction and asymptotic study of adaptive group sequential covariate-adjusted randomized clinical trials analyzed through the prism of the semipara-metric methodology of targeted maximum likelihood estimation (TMLE). We show how to build, as the data accrue group-sequentially, a sampling design which targets a user-supplied optimal design. We also show how to carry out a sound TMLE statistical inference based on such an adaptive sampling scheme (therefore extending some results known in the i.i.d setting only so far), and how group-sequential testing applies on top of it. The procedure is robust (i.e., consistent even if the working model is misspecified). A simulation study confirms the theoretical results, and validates the conjecture that the procedure may also be efficient.

Keywords: Adaptive design, asymptotic normality, canonical distribution, clinical trial, contiguity, group-sequential testing, robustness, targeted maximum likelihood methodology

1. Introduction

This article is devoted to the construction and asymptotic study of adaptive group sequential covariate-adjusted randomized clinical trials (RCTs) analyzed through the prism of the targeted maximum likelihood estimation (TMLE) methodology. The considered observed data structure writes as O = (W, A, Y) where W is a vector of baseline covariates, A is the treatment assignment and Y the primary outcome of interest. We assume that A is binary and that Y is one-dimensional. Typical parameters of scientific interest are Ψ+ = E{E(Y|A = 1, W) – E(Y|A = 0, W)} (additive scale, which we consider hereafter) or Ψ× = log E{E(Y|A = 1, W)} – log E{E(Y|A = 0, W)} (multiplicative scale). Such parameters can be interpreted causally whenever one is willing to assume the existence of a full data structure X = (W, Y(0), Y(1)) containing the vector of baseline covariates and the two counterfactual outcomes under the two possible treatments, and such that Y = Y (A) and A conditionally independent of X given W. If so indeed, Ψ+ = EY (1) – EY (0) and Ψ× = log EY (1) – log EY (0). Let us now explain what we mean by adaptive group sequential covariate-adjusted RCT.

Adaptive group sequential covariate-adjusted RCT.

By adaptive covariate-adjusted design, we mean in the setting of this article a clinical trial design that allows the investigator to dynamically modify its course through data-driven adjustment of the randomization probability based on data accrued so far, without negatively impacting the statistical integrity of the trial. Moreover, the patient’s baseline covariates may be taken into account for the random treatment assignment. This definition is slightly adapted from (Golub,2006). In particular, we assume with a view to the definition of pre-specified sampling plans given in (Emerson, 2006) that, prior to collection of the data, the trial protocol specifies the parameter of scientific interest, the inferential method and confidence level to be used when constructing a confidence interval for the latter parameter.

Furthermore, we assume that the investigator specifies beforehand in the trial protocol a criterion of special interest which yields a notion of optimal randomization scheme that we therefore wish to target. For instance, the criterion could translate the necessity to minimize the number of patients assigned to their corresponding inferior treatment arm, subject to level and power constraints. Or the criterion could translate the necessity that a result be available as quickly as possible, subject to level and power constraints. The sole restriction on the criterion is that it must yield an optimal randomization scheme which can be approximated from the data accrued so far. The two examples above comply with this restriction.

We focus in this article on targeted maximum likelihood estimation, as introduced by van der Laan and Rubin (2006) in the independent identically distributed (i.i.d) setting. The extension to the setting of adaptive RCTs was first considered in (van der Laan, 2008), upon which this article relies. In addition, we choose to consider specifically the second criterion cited above. Consequently, the optimal randomization scheme is the so-called Neyman allocation, which minimizes the asymptotic variance of the TMLE of the parameter of interest. We emphasize that there is nothing special about targeting the Neyman allocation, the whole methodology applying equally well to a large class of optimal randomization schemes derived from a variety of valid criteria.

By adaptive group sequential design, we refer to the possibility to adjust the randomization scheme only by blocks of c patients, where c ≥ 1 is a pre-specified integer (the case that c = 1 corresponds to a fully sequential adaptive design). The expression also refers to the fact that group sequential testing methods can be equally well applied on top of adaptive designs, an extension that we do not consider here. Although all our results (and their proofs) still hold for any c ≥ 1, we consider the case c = 1 in the theoretical part of the article for simplicity’s sake, but the case that c > 1 is considered in the simulation study.

Short bibliography.

The literature on adaptive designs is vast and our review is not comprehensive. Quite misleadingly, the expression “adaptive design” has also been used in the literature for sequential testing and, in general, for designs that allow data-adaptive stopping times for the whole study (or for certain treatment arms) which achieve the wished type I and type II errors requirements when testing a null hypothesis against its alternative.

Of course, data-adaptive randomization schemes have a long history that goes back to the 1930s, and we refer to Section 1.2 in (Hu and Rosenberger, 2006), Section 17.4 in (Jennison and Turnbull, 2000) and (Rosenberger, 1996) for a comprehensive historical perspective.

Many articles are devoted to the study of “response adaptive designs”, an expression implicitly suggesting that those designs only depend on past responses of previous patients and not on the corresponding covariates. We refer to (Hu and Rosenberger, 2006; Chambaz and van der Laan, 2011) for a bibliography on that topic. On the contrary, covariate-adjusted response adaptive (CARA) randomizations tackle the so-called issue of heterogeneity (i.e., the use of covariates in adaptive designs), by dynamically calculating the allocation probabilities on the basis of previous responses and current and past values of certain covariates. In this view, this article studies a new type of CARA procedure. The interest in CARA procedures is more recent, and there is a steadily growing number of articles dedicated to their study, starting with (Rosenberger et al., 2001; Bandyopadhyay and Biswas, 2001), then (Atkinson and Biswas, 2005; Zhang et al., 2007; Zha; Shao et al., 2010) among others. The latter articles are typically concerned with the convergence (almost sure and in law) of the allocation probabilities vector and of the estimator of the parameter in a correctly specified parametric model ((Shao et al., 2010) is devoted to the testing issue).

By contrast, the consistency and asymptotic normality results that we obtain in this article are robust to model misspecification. Thus, they notably contribute significantly to solving the question raised by the Food & Drug Administration (2006):

When is it valid to modify randomization based on results, for example, in a combined phase 2/3 cancer trial?

Finally, this article mainly relies on (Chambaz and van der Laan, 2009, 2011; van der Laan, 2008), the latter technical report paving the way to robust and more efficient estimation based on adaptive RCTs in a variety of other setting (including the case that the outcome Y is a possibly censored time-to-event).

Organization.

We first set the statistical framework in Section 2. The rationale of our adaptive covariate-adjusted designs is presented in Section 3, and we complete its formal definition in Section 4 where we also detail how the TMLE methodology operates. The asymptotic study is carried out in Section 5 (estimation) and in Section 6 (group sequential testing). The simulation study is developed in Section 7 (estimation) and in Section 8 (group sequential testing). The proofs are relegated to the Appendix.

2. Statistical framework

We tackle the asymptotic study of adaptive group sequential designs in the case of RCTs with covariate, binary treatment and one-dimensional primary outcome of interest. Thus, the observed data structure writes as O = (W, A, Y), where WW consists of some baseline covariates, A  A={0,1} denotes the assigned binary treatment, and YW is the primary outcome of interest. For example, Y can indicate whether the treatment has been successful or not (Y={0,1}); or Y can count the number of times an event of interest has occurred under the assigned treatment during a period of follow-up (Y=) or Y can measure a quantity of interest after a given time has elapsed (Y=) Although we will focus on the last case in this article, the methodology applies equally well to each example cited above.

Let us denote by P0 the true distribution of the observed data structure O in the population of interest. We see P0 as a specific element of the non-parametric set M of all possible observed data distributions. Note that, in order to avoid some technicalities, we assume (or rather: impose) that all elements of M are dominated by a common measure. The parameter of scientific interest is the marginal effect of treatment α =1 relative to treatment α = 0 on the additive scale, or risk difference: ψ0=EP0{EP0[Y|A= 1,W] EP0[Y|A= 0, W]}. Of course, other choices such as the log-relative risk (the counterpart of the risk difference on the multiplicative scale) could be considered, and dealt with along the same lines. The risk difference can be interpreted causally for instance in the counterfactual framework.

For all PM, let us introduce the shorthand notation QW(P)(W) = P[W], g(P)(A|W) = P[A|W], QY|A,W(P)(O) = P[Y|A, W]. We use the alternative notation P = PQ,g with Q = Q(P) ≡ (Qw (P), QY|A,W (P)) and g = g(P). Equivalently, PQ,g is the data generating distribution such that Q(PQ,g) = Q and g(PQ,g) = g. In particular, we denote Q0 = Q(P0) = (Qw(P0), QY|A,W(P0)). We also introduce the notation Q={Q(P):PM} for the non-parametric set of all possible values of Q, and G={g(P):PM} for the non-parametric set of all possible values of g. Setting Q¯(P)(A,W)=EP[Y|A,W] and Q¯0=Q¯(P0) (with a slight abuse, we also write sometimes Q¯(Q) instead of Q¯(PQ,g)), we define in greater generality

Ψ(P)=EP{Q¯(P)(1,W)Q¯(P)(0,W)} (1)

over the whole set M, so that ψ0 equivalently writes as ψ0 = Ψ(P0). This notation also emphasizes the fact that Ψ(P) only depends on P through Q¯(P) and QW(P), justifying the alternative notation Ψ(PQ,g) = Ψ(Q). The following proposition summarizes the most fundamental properties enjoyed by Ψ.

Proposition 1 (efficient influence curve). The functional Ψ is pathwise differentiable at every PM relative to the maximal tangent space. The efficient influence curve of Ψ at PQ,gM is characterized by

D(PQ,g)(O)=D1(Q)(W)+D2(PQ,g)(O), (2)

where

D1(Q)(W)=Q¯(1,W)Q¯(0,W)Ψ(Q),andD2(PQ,g)(O)=2A1g(A|W)(YQ¯(A,W)).

The variance VarpD(P)(O) is the lower bound of the asymptotic variance of any regular estimator of Ψ(P) in the i.i.d setting. Furthermore, even if QQ0,

EP0D(PQ,g)(O)=0impliesΨ(Q)=Ψ(Q0) (3)

when g = g(P0).

The implication (3) is the key to the robustness of the targeted maximum likelihood introduced and studied in this article. It is another justification of our interest in the pathwise differentiability of the functional Ψ and its efficient influence curve.

3. Data generating mechanism for adaptive design

The purpose of adaptive group sequential design as we consider it in this article is to adjust the randomization scheme as the data accrue. We first formally describe in Section 3.1 the data generating mechanism through the expression of the likelihood function and also in terms of causal graphs. Then we discuss the general issue of choosing an optimal design to target in Section 3.2, before specifically describing an optimal design of interest in Section 3.3 and showing how to target it in Section 3.4.

3.1. Data generating mechanism and related likelihood

In order to formally describe the data generating mechanism, we need to state a starting assumption: During the course of the clinical trial, it is possible to recruit independently the patients from a stationary population. In the counterfactual framework, this is equivalent to supposing that it is possible to sample as many independent copies of the full data structure as required. Let us denote by Oi = (Wi,Ai,Yi) the ith observed data structure. We also find convenient to introduce On = (O1,…, On) and for every i = 0,…, n, On(i) = (O1,…, Oi) (with convention O(0)= ∅). We denote by O the set where the observed data structure O takes its values.

By adjusting the randomization scheme as the data accrue, we mean that the nth treatment assignment An is drawn from gn(·|Wn), where gn(·|W) is a conditional distribution (or treatment mechanism) given the covariate W which additionally depends on past observations On−1. Since the sequence of treatment mechanisms cannot reasonably grow in complexity as the sample size increases, we will only consider data-adaptive treatment mechanisms such that gn(·|W) depends on On−1 only through a finite-dimensional summary measure Zn = ϕn (On−1), where the measurable function ϕn maps On1 onto d for some fixed d ≥ 0 (d = 0 corresponds to the case that gn(·|W) actually does not adapt). For instance, Zn+1=ϕn+1(On)(n1i=1nYi1{Ai=0},n1i=1nYi1{Ai=1}) characterizes a proper summary measure of the past, which keeps track of the mean outcome in each treatment arm. Another sequence of mappings ϕn will be at the core of the adaptive methodology that we study in depth in this article, see (9).

Formally, the data generating mechanism is specified by the following factorization of the likelihood of On:

i=1n{QW(P0)(Wi)×QY|A,W(P0)(Oi)}×i=1ngi(Ai|Wi),

which suggests the introduction of gn = (g1,…,gn), referred to as the design of the study, and the expression “On is drawn from (Q0, gn)”. Likewise, the likelihood of On under (Q, gn) (where Q=(QW,QY|A,W)Q is a candidate value for Q0) is

i=1n{QW(Wi)×QY|A,W(Oi)}×i=1ngi(Ai|Wi), (4)

where we emphasize that the second factor is known. Thus we will refer with a slight abuse of terminology to i=1nlog QW(Wi)+ log QY|A,W(Oi) as the log-likelihood of On under (Q, gn). Furthermore, given gn, we introduce the notation PQ0,gifEPQ0,gi[f(Oi)|On(i1)] for any possibly vector-valued measurable f defined on O.

Another equivalent characterization of the data generating mechanism involves the causal graph of Fig. 1. It is seen again firstly that Wn is drawn independently from the past On−1; secondly that An is a deterministic function of Wn, the summary measure Zn (which depends on On−1), and a new independent source of randomness (in other words, it is drawn conditionally on (Wn, Zn) and conditionally independently of the past On−1); thirdly that Yn is a deterministic function of (An, Wn) and a new independent source of randomness (in other words, it is drawn conditionally on (An,Wn) and conditionally independently of the past On−1); then that the next summary measure Zn+1 is obtained as a function of On−1 and On = (Wn,An,Yn) (i.e., as a function of On; here, the causal graph grants the access to a new independent source of randomness, but it is useless in our setting), and so on.

Figure 1: A causal graph describing the data generating mechanism for adaptive group sequential design.

Figure 1:

An from nodes Λ1,..., Λr to node ϒ means that there exists a deterministic function f and an independent source of randomness U such that ϒ = f1, …, Λr, U).

Finally, it is interesting in practice to adapt the design group sequentially. This can be simply formalized. For a given pre-specified integer c ≥ 1 (c = 1 corresponds to a fully sequential adaptive design), proceeding c-group sequentially simply amounts to imposing ϕ(r−1)c+1(O(r−1)c) = … = ϕrc (Orc−1) for all r ≥ 1. Then the c treatment assignments A(r−1)c+1,…,Arc in the rth c-group are all drawn from the same conditional distribution g(r−1)c(·|W). Yet, although all our results (and their proofs) still hold for any c ≥ 1, we prefer to consider in the rest of this section and in Sections 4 and 5 the case that c = 1 for simplicity’s sake. On the contrary, the simulation study carried out in Section 7 involves some c > 1.

3.2. On the user-supplied design to target

One of the most important features of the adaptive group sequential design methodology is that it targets a user-supplied specific design of special interest. This specific design is generally an optimal design with respect to a criterion which translates what the investigator cares for the most. Specifically, one could care the most for the well-being of the target population, wishing that a result be available as quickly as possible and aspiring therefore to the highest efficiency (i.e., the ability to reach a conclusion as quickly as possible subject to level and power constraints). Or one could care the most for the well-being of the subjects participating to the clinical trial (therefore trying to minimize the number of patients assigned to their corresponding inferior treatment arms, subject to level and power constraints). Obviously, these are only two important examples from a large class of potentially interesting criteria. The sole purpose of the criterion is to generate a random element in G of the form gn = gZn, where Zn = ϕn (On−1) is a finite-dimensional summary measure of On−1.

We decide to focus in this article on the first example, but it must be clear that the methodology applies to a variety of other criteria (see (van der Laan, 2008) for other examples).

3.3. From the Neyman allocation to the optimal design

So, the objective is now clearly identified: we wish to adapt the design as the data accrue in order to learn from the data then mimic a specific design which guarantees the highest efficiency, our (user-supplied) optimal design.

By Proposition 1, the asymptotic variance of any regular estimator of the risk difference Ψ(Q0) has lower bound VarPQ0,gD(PQ0,g)(O) if the estimator relies on data sampled independently from PQ0,g. Now,

VarPQ0,gD(PQ0,g)(O)=EQ0(Q¯0(1,W)Q¯0(0,W)Ψ(Q0))2+EQ0(σ2(Q0)(1,W)g(1|W)+σ2(Q0)(0,W)g(0|W)),

where σ2(Q0)(A, W) denotes the conditional variance of Y given (A, W) under Q0. We use the notation EQ0 above (for the expectation with respect to the marginal distribution of W under P0) in order to emphasize the fact that the treatment mechanism g only appears in the second term of the right-hand side sum. Furthermore, it holds P0-almost surely that

σ2(Q0)(1,W)g(1|W)+σ2(Q0)(0,W)g(0|W)(σ(Q0)(1,W)+σ(Q0)(0,W))2,

with equality if and only if

g(1|W)=σ(Q0)(1,W)σ(Q0)(1,W)+σ(Q0)(0,W) (5)

P0-almost surely. Therefore, the following lower bound holds for all gG:

VarPQ0,gD(PQ0,g)(O)EQ0(Q¯0(1,W)Q¯0(0,W)Ψ(Q0))2+EQ0(σ(Q0)(1,W)+σ(Q0)(0,W))2,

with equality if and only if gG is characterized by (5). This optimal design is known in the literature as the Neyman allocation (see (Hu and Rosenberger, 2006) page 13). This result notably makes clear that the most efficient treatment mechanism assigns with higher probability a patient with covariate vector W to the treatment arm such that the variance of the outcome Y in this arm is the largest, regardless of the mean of the outcome (i.e., whether the arm is inferior or superior).

Due to logistical reasons, it might be preferable to consider only treatment mechanisms that assign treatment in response to a subvector V of the baseline covariate vector W. In addition, if W is complex, targeting the optimal Neyman allocation might be too ambitious. Therefore, we will consider the important case where V is a discrete covariate with finitely many values in the set V={1,,v}. The covariate V indicates subgroup membership for a collection of ν subgroups of interest. We decide to restrict the search of an optimal design to the set G1G of those treatment mechanisms which only depend on W through V. The same calculations as above yield straightforwardly that, for all gG1,

VarPQ0,gD(PQ0,g)(0)EQ0(Q¯0(1,W)Q¯0(0,W)Ψ(Q0))2+EQ0(σ¯(Q0)(1,V)+σ¯(Q0)(0,V))2,

where σ¯2(Q0)(a,V)=EQ0[σ2(Q0)(a,W)|V] for aA with equality if and only if g coincides with g(Q0), characterized by

g(Q0)(1|V)=σ¯(Q0)(1,V)σ¯(Q0)(1,V)+σ¯(Q0)(0,V) (6)

P0-almost surely. Hereafter, we refer to g(Q0) as the optimal design.

3.4. Targeting the optimal design

Because g(Q0) is characterized as the minimizer over gG1 of the variance under PQ0,g of the efficient influence curve at PQ0,g we propose to construct gn+1G1 as the minimizer over gG1 of an estimator of the latter variance based on past observations On.

We proceed by recursion. We first set g1 = gb, the so-called balanced treatment mechanism such that gb(1|W)=12 for all WW, and assume that On has already been sampled from (Q0, gn) as described in Section 3.1, the sample size being large enough to guarantee i=1n1{Vi=v}>0 for all vV (if n0 is the smallest sample size such that the previous condition is met, then we set g1==gn0=gb).

The issue is now to construct gn+1. Let us assume for the time being that we already know how to construct an estimator Qn of Q0 based on On (hence the estimators Q¯n=Q¯(Qn) of Q¯0 and Ψ(Qn) of Ψ(Q0) = ψ0).1 Then, for all gG1,

Sn(g)=1ni=1n{(D1(Qn)(Wi)2+2D1(Qn)(Wi)D2(PQn,g)(Oi)g(Ai|Vi)gi(Ai|Vi)+D2(PQn,g)(Oi)2g(Ai|Vi)gi(Ai|Vi))(1ni=1nD1(Qn)(Wi)+D2(PQn,g)(Oi)g(Ai|Vi)gi(Ai|Vi))2}=1ni=1n(YiQ¯n(Ai,Wi))2g(Ai|Vi)gi(Ai|Vi)+{1ni=1nD1(Qn)(Wi)2+2D1(Qn)(Wi)D1(Qn)(Wi)D2(PQn,gi)(Oi)(1ni=1nD1(Qn)(Wi)+D2(PQn,gi)(Oi))2}

estimates Var PQ0,gD(O;PQ0,g) (the weighting provides the adequate tilt of the empirical distribution; it is not necessary to weight the terms corresponding to D1 because they do not depend on the treatment mechanism). Now, only the first term in the rightmost expression still depends on g. The same calculations as above straightforwardly yields that Sn (g) is minimized at gn+1G1 characterized by

gn+1(1|v)=sv,n(1)sv,n(1)+sv,n(0), (7)

for all vV, where for each (v,a)V×A,

sv,n2(a)=1ni=1n(YiQ¯n(Ai,Wi))2gi(Ai|Vi)1{(Vi,Ai)=(v,a)}1ni=1n1{Vi=v}.

Yet, instead of considering the above characterization, we find it more convenient to define

gn+1(1|v)=σv,n(1)σv,n(1)+σv,n(0), (8)

for all vV, where for each (v,a)V×A,

σv,n2(a)=1ni=1n(YiQ¯n(Ai,Wi))2gi(Ai|Vi)1{(Vi,Ai)=(v,a)}1ni=1n1{(Vi,Ai)=(v,a)}gi(Ai|Vi).

Note that sv,n2(a) and σv,n2(a) share the same numerator, and that the different denominators converge to the same limit. Substituting σv,n2(a) for sv,n2(a) is convenient because one naturally interprets the former as an estimator of the conditional variance of Y given (A, V) = (a, v) based on On, a fact that we use in Section 4.2. Finally, we emphasize that gn+1=gzn+1 for the summary measure of the past On

Zn+1=ϕn+1(On)((σv,n2(0),σv,n2(1)):vV). (9)

The rigorous definition of the design gn=(g1,,gn) follows by recursion, but it is still subject to knowledge about how to construct an estimator Qn of Q0 based on On. Because this last missing piece of the formal definition of the adaptive group sequential design data generating mechanism is also the core of the TMLE procedure, we address it in Section 4.

4. TMLE in adaptive covariate-adjusted RCTs

We assume hereafter that On has already been sampled from the (Q0,gn)-adaptive sampling scheme. In this section, we construct an estimator Qn (actually denoted by Qn*) of Q0, therefore yielding the characterization of gn+1, and completing the formal definition of the adaptive design gn that we initiated in Section 3. In particular, the next observed data structure On+1 can be drawn from (Q0,gn+1), and it makes sense to undertake the asymptotic study of the properties of the TMLE methodology based on adaptive group sequential sampling.

As in the i.i.d framework, the TMLE procedure maps an initial substitution estimator Ψ(Qn0) of ψ0 into an update ψn*=Ψ(Qn*) by fluctuating the initial estimate Qn0 of Q0. The construction of Qn0 is presented and studied in Section 4.1. In Section 4.2, a straightforward application of the main result of Section 4.1 shows that the adaptive design converges. How to fluctuate Qn0 and stretch it optimally to Qn* is presented and studied in Section 4.3.

4.1. Initial maximum likelihood based substitution estimator Working model.

In order to construct the initial estimate Qn0 of Q0, we consider a working model Qnw. With a slight abuse of notation, the elements of Qnw are denoted by (QW(Pn),QY|A,W(θ)) for some parameter θ ∈ Θ, where QW(Pn) is the empirical marginal distribution of W. Specifically, the working model Qnw is chosen in such a way that

QY|A,W(θ)(O)=12πσV2(A)exp{(Ym(A,W;βV))22σV2(A)}.

This notably implies that for any PθM such that QY|A,W (Pθ) = QY|A,W (θ), the conditional mean Q¯(Pθ)(A,W), which we also denote by Q¯(θ), satisfies Q¯(θ)(A,W)=m(A,W;βV), the right-hand side expression being a linear combination of variables extracted from (A, W) and indexed by the regression vector βV (of dimension b). Defining

θ(v)=(βv,σv2(0),σv2(1))Θvb×+*×+* (10)

for each vV, the complete parameter is given by θ = (θ(1),…,θ(ν)) ∈ Θ, where Θ=v=1νΘv. We impose the following condition on the parameterization:

PARAM. The parameter set Θ is compact. Furthermore, the linear parameterization is identifiable: for all vV, if m(a,w;βv)=m(a,w;βv) for all aA and wW (compatible with v), then necessarily βv=βv.

Characterizing Qn0.

Let us set a reference fixed design grG1. We now characterize Qn0 by letting

Qn0=(QW(Pn),QY|A,W(θn)), (11)

Where

θn=arg maxθΘi=1nlog QY|A,W(θ)(Oi)gr(Ai|Vi)gi(Ai|Vi) (12)

is a weighted maximum likelihood estimator with respect to the working model. Thus, the vth component θn(v) of θn satisfies

θn(v)=arg minθ(v)Θvi=1n(log σv2(Ai)+(Yim(Ai,Wi;βv))2σv2(Ai))gr(Ai|Vi)gi(Ai|Vi)1{Vi=v}

for every vV. Note that this initial estimate Qn0 of Q0 yields the initial maximum likelihood based substitution estimator Ψ(Qn0) of ψ0:

Ψ(Qn0)=1ni=1nQ¯(θn)(1,Wi)Q¯(θn)(0,Wi).

Studying Qn0 through θn.

For simplicity, let us introduce, for all θ ∈ Θ, the additional notations

lθ,0=log QY|A,W(θ),l˙θ,0=θlθ,0,l¨θ,0=2θ2lθ,0.

The first asymptotic property of θn that we derive concerns its consistency (see Theorem 5 in (van der Laan, 2008)).

Proposition 2 (consistency of θn). Assume that:

A1. There exists a unique interior point θ0 ∈ Θ such that θ0 = arg maxθ∈Θ PQ0,grθ,0.

A2. The matrix – PQ0,grl¨θ0,0 is positive definite.

Provided that O is a bounded set, θn consistently estimates θ0.

The proof of Proposition 2 is given in Section A.1.

The limit in probability of θn has a nice interpretation in terms of projection of QY|A,W (P0) onto {QY|A,W (θ): θ ∈ Θ}. Preferring to discuss this issue in terms of data generating distribution rather than conditional distribution, let us set Qθ0=(QW(P0),QY|A,W(θ0)) and assume that PQ0,gr log QY|Q,W(P0) is well defined (this weak assumption concerns Q0, not gr, and holds for instance when |log QY|A,W (P0)| is bounded). Then A1 is equivalent to PQθ0,gr being the unique Kullback-Leibler projection of PQ0,gr onto the set

{PM:θΘ s.t.QY|A,W(P)=QY|A,W(θ), and QW(P)=QW(P0),g(P)=gr}.

In addition to being consistent, θn actually satisfies a central limit theorem if supplementary mild conditions are met. The latter central limit theorem is embedded in a more general result that we state in Section 4.3, see Proposition 5.

Furthermore, maximizing a weighted version of the log-likelihood is a technical twist that makes the theoretical study of the properties of θn easier. Indeed, the unweighted maximum likelihood estimator

tn=arg maxθΘi=1nlog QY|A,W(θ)(Oi)

targets the parameter

Tg¯n(Q0)=arg maxθΘi=1nPQ0,gi log QY|A,W(θ)=arg maxθΘPQ0,g¯nlθ,0,

where g¯n=n1i=1ngi and PQ0,g¯nfEPQ0,g¯n[f(On)|On(n1)] for any measurable f defined on O. Therefore, tn asymptotically targets the limit, if it exists, of Tg¯n(Q0). Assuming that g¯n converges itself to a fixed design gG, then tn asymptotically targets parameter Tg(Q0). The latter parameter is very difficult to interpret and to analyze, as it depends directly and indirectly (through g) on Q0.

4.2. Convergence of the adaptive design

Consider the mapping G from Θ to G1 (respectively equipped with the Euclidean distance and, for instance, the distance d(g,g)=vV|g(1|v)g(1|v)|) such that, for any θ ∈ Θ, for any (a,v)A×V,

G(θ)(a|v)=σv(a)σv(1)+σv(0). (13)

Equation (13) characterizes G which is obviously continuous. Since gn is adapted in such a way that gn=G(θn) (see (8)), Proposition 2 and the continuous mapping theorem (see Theorem 1.3.6 in (van der Vaart and Wellner, 1996)) straightforwardly imply the following result (the convergence in probability yields the convergence in L1 because gn is uniformly bounded).

Proposition 3 (convergence of gn). Under the assumptions of Proposition 2, the adaptive design gn converges in probability and in L1 to the limit design G(θ0).

The convergence of the adaptive design gn is a crucial result. It is noteworthy that the limit design G(θ0) equals the optimal design g(Q0) if the working model is correctly specified (which never happens in real-life applications), but not necessarily otherwise. Furthermore, the relationship gn=G(θn) also entails the possibility to derive the convergence in distribution of n(gnG(θn)) to a centered Gaussian distribution with known variance by application of the delta-method (G is differentiable) from a central limit theorem on θn (see Proposition 5 and Theorem 3.1 in (van der Vaart, 1998)).

4.3. TMLE

Fluctuating Qn0.

The second step of the TMLE procedure stretches the initial estimate Ψ(Qn0) in the direction of the parameter of interest, through a maximum likelihood step over a well-chosen fluctuation of Qn0. The latter fluctuation of Qn0 is just a one-dimensional parametric model {Qn0(ε):εE}Q indexed by the parameter εE,E being a bounded interval which contains a neighborhood of the origin. Specifically, we set for all εE:

Qn0(ε)=(QW(Pn),QY|A,W(θn,ε)), (14)

where for any θ ∈ Θ,

QY|A,W(θ,ε)(O)=12πσV2(A)exp{(YQ¯(θ)(A,W)εH(θ)(A,W))22σV2(A)}, (15)

with

H*(θ)(A,W)=2A1G(θ)(A|V)σV2(A).

In particular, the fluctuation goes through Qn0 at ε = 0 (i.e., Qn0(0)=Qn0). Let Pn0(ε)M be a data generating distribution such that QY|A,W(Pn0(ε))=QY|A,W(θn,ε). The conditional mean Q¯(Pn0(ε)), which we also denote by Q¯(θn,ε), is

Q¯(θn,ε)(A,W)=Q¯(θn)(A,W)+εH*(θn)(A,W).

Furthermore, the score at ε = 0 of Pn0(ε) equals

εlog Pn0(ε)[O]|ε=0=2A1G(θn)(A|V)(YQ¯(θn)(A,W))=D2(PQn0,G(θn))(O),

the second component of the efficient influence curve of Ψ at PQn0,G(θn)=PQn0,gn, see (2) in Proposition 1 (recall that gn=G(θn)).

Characterizing Qn* yields the TMLE.

We characterize the update Qn* of Qn0 in the fluctuation {Qn0(ε):εE} by

Qn*=Qn0(εn),

where

εn=arg maxεEi=1nlog QY|A,W(θn,ε)(Oi)gn(Ai|Vi)gi(Ai|Vi) (16)

is a weighted maximum likelihood estimator with respect to the fluctuation. It is worth noting that εn is known in closed form (we assume, without serious loss of generality, that E is large enough for the maximum to be achieved in its interior). Denoting the vth component θn(v) of θn by (βv,n,σv,n2(0),σv,n2(1)), it holds that

εn=i=1n(YiQ¯(θn)(Ai,Wi))2Ai1gi(Ai|Vi)i=1nσVi,n2(Ai)gn(Ai|Vi)gi(Ai|Vi).

The notation Qn* for this first update of Qn0 is a reference to the fact that the TMLE procedure, which is in greater generality an iterative procedure, converges here in one single step. Indeed, say that one fluctuates Qn* as we fluctuate Qn0 i.e., by introducing

Qn1(ε)=(QW(Pn),QY|A,W(θn,εn,ε))

with QY|A,W (θ,ε’,ε) equal to the right-hand side of (15) where one substitutes Q¯(θ,ε) for Q¯(θ). Say that one then defines the weighted maximum likelihood εn as the right-hand side of (16) where one substitutes QY|A,W (θn, εn, ε) for QY|A,W (θn, ε). Then it follows that εn=0 so that the “updated” Qn*(εn)=Qn*. The updated estimator Qn* of Q0 maps into the TMLE ψn*=Ψ(Qn*) of the risk difference ψ0 = Ψ(Q0):

ψn*=1ni=1nQ¯(θn,εn)(1,Wi)Q¯(θn,εn)(0,Wi). (17)

The asymptotic study of ψn* relies on a central limit theorem for (θn, εn), which we discuss in Section 5.3.

5. Asymptotics

5.1. Studying Qn* through (θn, εn): consistency

We now state and comment on a consistency result for the stacked estimator n, εn) which complements Proposition 2 (see Theorem 8 in (van der Laan, 2008)). For simplicity, let us generalize the notation ℓθ,0 introduced in Section 4.1 by setting, for all (θ,ε)Θ×E,

lθ,ε=log QY|A,W(θ,ε).

Moreover, let us set, for all (θ,ε)Θ×E,

Qθ,ε=(QW(P0),QY|A,W(θ,ε)). (18)

Proposition 4 (consistency of (θnn)). Suppose that A1 and A2 from Proposition 2 hold. In addition, assume that:

A3. There exists a unique interior point ε0E such that ε0=arg maxεEPQ0,G(θ0)lθ0,ε.

  • (i)

    It holds that Ψ(Qθ0,ε0)=Ψ(Q0).

  • (ii)

    Provided that O is a bounded set, (θnn) consistently estimates (θ0,ε0).

The proof of Proposition 4 is given in Section A.1.

We already discussed the interpretation of the limit in probability of θn in terms of Kullback-Leibler projection. Likewise, the limit in probability ε0 of εn enjoys such an interpretation. Let us assume that PQ0,G(θ0) log QY|A,W(P0) is well defined (this weak assumption concerns Q0, not G★(θ0), and holds for instance when |log QY|A,W(P0)| is bounded). Then A3 is equivalent to PQθ0,ε0,G(θ0) being the unique the Kullback-Leibler projection of PQ0,G(θ0) onto the set

{PM:εE s.t.Q(P)=Qθ0,ε and g(P)=G(θ0)}.

Of course, the most striking property that ε0 enjoys is (i): even if Q¯0{Q¯(θ,ε):(θ,ε)Θ×E}, it holds that Ψ(Qθ0,ε0)=Ψ(Q0). This remarkable equality and the convergence of (θnn) to (θ0,ε0) are evidently the keys to the consistency of ψn*=Ψ(Qn*). We investigate in Section 5.3 how the consistency result stated in Proposition 4 translates into the consistency of the TMLE.

5.2. Studying Qn* through (θnn): central limit theorem

We now state and comment on a central limit theorem for the stacked estimator (θn, εn) (see also Theorem 9 in (van der Laan, 2008)).

Let us introduce, for all (θ,ε)Θ×E,

D(θ,)(O,Z)=1gZ(A|V)(l˙θ,0Τ(O)gr(A|V),εlθ,ε(O)G(θ)(A|V)), (19)

and D˜(θ,ε)(O)=gZ(A|V)D(θ,ε)(O,Z).

Proposition 5 (central limit theorem for (θn,εn)). Suppose thatA1, A2andA3from Propositions 2 and 4 hold. In addition, assume thatA4. If a deterministic function F is such that F(O) = 0 PQ0,G(θ0)-almost surely, then F = 0. Then the following asymptotic linear expansion holds:

n((θn,εn)(θ0,ε0))=S011ni=1nD(θ0,ε0)(Oi,Zi)+oP(1), (20)

where

S0=EQ0,G(θ0)(l¨θ0,0(O)gr(A|V)G(θ0)(A|V)       0[[2θεlθ,ε(O)G(θ)(A|V)]|(θ,ε)=(θ0,ε)1G(θ0)(A|V)              2ε2lθ0,ε(O)|ε=ε0) (21)

is an invertible matrix. Furthermore, (20) entails that n((θn,εn)(θ0,ε0)) converges in distribution to the centered Gaussian distribution with covariance matrix S010(S01), where

Σ0=EQ0,G(θ0)(D˜(θ0,ε0)D˜(θ0,ε0)(O)G(θ0)(A|V)2) (22)

is a positive definite symmetric matrix. Moreover, S0 is consistently estimated by

Sn=1ni=1n(l¨θn,0(Oi)gr(Ai|Vi)gzi(Ai|Vi)       0[[2θεlθ,ε(Oi)G(θ)(Ai|Vi)]|(θ,ε)=(θn,εn)1gzi(Ai|Vi)              2ε2lθn,ε(Oi)|ε=εngzi(Ai|Vi))

and ∑0 is consistently estimated by

n=1ni=1nD(θn,εn)D(θn,εn)(Oi,Zi). (23)

The proof of Proposition 5 is given in Section A.2. We investigate in Section 5.3 how the above central limit theorem translates into a central limit theorem for the TMLE.

5.3. Consistency and asymptotic normality of the TMLE

In this section, we finally state and comment on the asymptotic properties of the TMLE ψn*.

TMLE is consistent and asymptotically Gaussian.

In the first place, the TMLE ψn* is robust: it is a consistent estimator even when the working model is misspecified (which is always the case in real-life applications).

Proposition 6 (consistency of ψn*). Suppose that A1, A2, A3 and A4 from Propositions 2, 4 and 5 hold. Then the TMLE ψn* consistently estimates the risk difference ψ0.

If the design of the clinical trial was fixed (and consequently, the n first observations were i.i.d), then the TMLE would be a robust estimator of ψ0: Even if the working model is misspecified, then the TMLE still consistently estimates ψ0 because the treatment mechanism is known (or can be consistently estimated, if one wants to gain in efficiency). Thus, the robustness of the TMLE stated in Proposition 6 is the expected counterpart of the TMLE’s robustness in the latter i.i.d setting: Expected because the TMLE solves a martingale estimating function that is unbiased for ψ0 at misspecified Q and correctly specified gi, i = 1,…,n.

In the second place, the TMLE ψn* is asymptotically linear, and therefore satisfies a central limit theorem. To see this, let us introduce the real-valued function ϕ on Θ×E such that ϕ (θ, ) = Ψ(Qθ,ε) (see (18) for the definition of Qθ,ε). The function ϕ is differentiable on the interior of Θ×E, and we denote ϕθ,ε its gradient at (θ,ε). The latter gradient satisfies

ϕθ,ε=EQθ,ε,G(θ){D(PQθ,ε,G(θ)(O)(θlθ,ε(O),εlθ,ε(O))}. (24)

Note that the right-hand side expression cannot be computed explicitly because the marginal distribution QW(P0) is unknown. By the law of large numbers (independent case), we can build an estimator ϕn of ϕθ0,ε0 as follows. For B a large number (say B = 104), simulate B independent copies Õb of O from the data generating distribution PQn*,G(θn), then compute

ϕn=1Bb=1BD(PQn*,G(θn))(Ob)   (θlθ,εn(Ob)|θ=θn,εlθn,ε(Ob)|ε=εn). (25)

Proposition 7 (central limit theorem for ψn*). Suppose that A1, A2, A3 and A4 from Propositions 2, 4 and 5 hold. Then the following asymptotic linear expansion holds:

n(ψn*ψ0)=1ni=1nIC (Oi,Zi)+oP(1), (26)

where

IC(O,Z)=D1(Qθ0,ε0)(W)+ϕθ0,ε0S01D(θ0,ε0)(O,Z). (27)

Furthermore, (26) entails that n(ψn*ψ0) converges in distribution to the centered Gaussian distribution with a variance consistently estimated by

sn2=1ni=1nD1(Qn*)(Wi)2+2ni=1nD1(Qn*)(Wi)ϕn  Sn1D(θn,εn)(Oi,Zi)+(ϕnSn1)n(ϕnSn1).

Proposition 7 is the backbone of the statistical analysis of adaptive group sequential RCTs as constructed in Section 4. In particular, denoting the (1 − α)-quantile of the standard normal distribution by ξ1−α, the proposition guarantees that the asymptotic level of the confidence interval

[ψn*±snnξ1α/2] (28)

for the risk difference ψ0 is (1 − ε). The proofs of Propositions 6 and 7 are given in Section A.3.

Extensions.

We conjecture that the influence function IC computed at (O, Z), see (27), is equal to

D1(Qθ0,ε0)(W)+D2(PQθ0,ε0,G(θ0))(O)G(θ0)(A|V)gZ(A|V).

This conjecture is backed by the simulations that we carry out and present in Section 7. We will tackle the proof of the conjecture in future work. Let us assume for the moment that the conjecture is true. Then the asymptotic linear expansion (26) now implies that the asymptotic variance of n(ψn*ψ0) can be consistently estimated by

sn2=1ni=1n(D1(Qn*)(Wi)+D2(PQn*,G(θn))  (Oi)  G*(θn)(Ai|Vi)gZi(Ai|Vi))2,

another independent argument also showing that sn2 converges toward

VarQ0,G(θ0)D(PQθ0,ε0,G(θ0))(O)

i.·e.·, the variance under the fixed design PQ0,G(θ0) of the efficient influence curve at PQθ0,ε0,G(θ0).

Furthermore, the most essential characteristic of the joint methodologies of design adaptation and targeted maximum likelihood estimation is certainly the utmost importance of the role played by the likelihood. In this view, the targeted maximized log-likelihood of the data

i=1n{log QW(Pn)(Wi)+ log gi(Ai|Wi)+ log QY|A,W(θn,εn)(Oi)}

provides us with a quantitative measure of the quality of the fit (targeted toward the parameter of interest). It is therefore possible, for example, to use that quantity for the sake of selection of different working models for Q0. As with TMLE for i.i.d data, we can use likelihood based crossvalidation to select among more general initial estimators indexed by fine-tuning parameters. The validity of such TMLEs for group sequential adaptive designs as studied here is outside the scope of this article.

6. Application to group sequential testing

We derive in this section a group sequential testing procedure, that is a testing procedure which repeatedly tries to make a decision at intervals rather than once all data are collected, or than after every new observation is obtained (such a testing procedure would be said fully sequential). We refer to (Jennison and Turnbull, 2000; Proschan et al., 2006) for a general presentation of group sequential testing procedures. The TMLE group sequential testing procedure is formally described in Section 6.1, and some arguments justifying why it should work well (as validated by the simulation study carried out in Section 8) are exposed in Section 6.2.

6.1. Description of the TMLE group sequential testing procedure

The problem at stake is to test the null “Ψ(Q) = ψ0” against “Ψ(Q) > ψ0” with asymptotic type I error α and asymptotic type II error β at some ψ1 > ψ0. We want to proceed group sequentially with K ≥ 2 steps, based on the multi-dimensional t-statistic

(T1*,,TK*)=(Nk(ψNk*ψ0sNk)kK. (29)

Here, N1,…,NK are random sample sizes whose realizations on a specific trajectory depend on how fast the information accrues as the data are collected. In order to quantify this notion of information, we decide to consider the inverse nsn2 of the estimated variance of the TMLE ψn* based on the n first observations On (as a proxy to its true, finite sample, inverse variance). Given a reference maximum committed information Imax and K increasingly order proportions 0 < p1 < … < pK = 1, we set for every kK

Nk= inf {n1:nsn2pkImax}.

The characterization of Imax depends on how we wish to “spend” the type I and type II errors at each step of the group sequential procedure and on how demanding is the power requirement (i.e., how close ψ1 is to ψ0). Say that our spending strategies are summarized by the K-tuples of positive numbers (α1,…,αK) and (β1,…,βK) such that k=1Kαk=α and k=1Kβk=β.

Now, let (Z1,…,ZK) be distributed from the centered Gaussian distribution with covariance matrix C=(pkl/pkl)k,lK We assume that there exists a unique value I > 0, our Imax, such that there exist a rejection boundary (a1,…,aK) and a futility boundary (b1,…,bK) satisfying aK = bK, P(Z1a1) = α1, P(Z1+(ψ1ψ0)p1Ib1)=β1, and for every 1 ≤ k < K,

P(jk,bj<Zj<aj   and   Zk+1ak+1)      =       αk+1,
P(jk,bj<Zj+(ψ1ψ0)pjI<aj   and   Zk+1+(ψ1ψ0)pk+1I    bk+1)      =      βk+1.

Note that the closer ψ1 is to ψ0, the larger is Imax (actually, ψ1ψ1Imax is both upper bounded and bounded away from zero). Heuristically, the closer ψ1 is to ψ0, the more difficult it is to decide between the null and its alternative while preserving the required type II error at ψ1, the more information is needed to proceed.

The targeted maximum likelihood group sequential testing procedure finally goes as follows: starting from k = 1,

  • if Tk*ak then reject the null and stop accruing data,

  • if Tk*bk then fail rejecting the null and stop accruing data,

  • if bk<Tk*<ak then set kk + 1 and repeat.

If (T1*,,TK*) had the same distribution as (Z1,…,ZK), then the latter rule would yield a testing procedure with the required type I error and type II error at the specified alternative parameter.

Clearly, our decision to target the optimal design G★(θ0) which reduces as much as possible (over G1 and for our choice of working model) the asymptotic variance of the TMLE ψn* guarantees, at least informally, that each Nk is stochastically smaller under (Q0, gn)-adaptive sampling than it would have been had another fixed design been used (or targeted). Thus, resorting to the (Q0,gn)-adaptive sampling is likely to result in an earlier conclusion than what would have yielded another fixed (or targeted) sampling scheme.

6.2. Rationale of the TMLE group sequential testing procedure

The next proposition partially justifies the characterization of the TMLE group sequential testing procedure of Section 6.1. First, let us define nk=npk (the smallest integer not smaller than npk) for every kK, then

(T1,,TK)=(nk(ψnk*ψ0snk)kK.

This other multi-dimensional t-statistic is a substitute to (T1*,...,TK*) whose asymptotic study is easier to carry out. In particular, it is possible to derive the limit distribution of (T1,…,TK) under the null and under a sequence of contiguous alternatives. The limit distribution is called the canonical distribution (see Theorems 11 in 12 in (van der Laan, 2008), Theorem 3 in (Chambaz and van der Laan, 2011) and Theorem 2.1 in (Zhu and Hu, 2010), where a similar result is obtained through a different approach based on the study of the limit distribution of a stochastic process defined over (0, 1]).

Proposition 8. Suppose that A1, A2, A3 and A4 from Propositions 2, 4 and 5 hold. Consider fL2(PQ0,G(θ0)), f ≠ 0, with PQ0,G(θ0)f=0 and a fluctuation {Q0(h):hH}Q of Q0 with score f. Set Qf/n=Q0(1/n) for all n ≥n0 (n0 is such that 1/n0). Assume also that A5. The score f is bounded, it is not proportional to D(PQ0,G(θ0)), EPQ0,G(θ0)[f(O)|A,W]=0,PQ0,G(θ0)D(PQ0,G(θ0))f>0, and PQ0,G(θ0)ICf>0.

The sequence (Qf/n)nn0 defines a sequence (ψn)n≥n0 of contiguous parameters (“from direction f”, which only fluctuates the conditional distribution of Y given (A, W)), with ψn=ψ(Qf/n)>ψ(Q0) for n large enough. Introduce μ(f)=(p1,...,pK)PQ0,G(θ0)ICfPQ0,G(θ0)IC2.

  • (i)

    Under (Q0, gn)-adaptive sampling, (T1,…,TK) converges in distribution to the centered Gaussian distribution with covariance matrix C.

  • (ii)

    Under (Qf/n,gn)-adaptive sampling, (T1,…,TK) converges in distribution to the Gaussian distribution with mean μ(f) and covariance matrix C.

The proof of Proposition 8 is given in Section A.4.

The rationale follows. Say that one concludes from Proposition 8 that (i) (T1,,TK) is also approximately distributed as (Z1,…,ZK) under (Q0,gn)-adaptive sampling. Likewise, say that (Nk(ψNk*ψ1)/sNk)kK is also approximately distributed as (Z1,…,ZK) under (Q1,gn)-adaptive sampling, with Ψ(Q1) = ψ1 > ψ0. Say that one is willing to substitute pkImax to Nk/sNk2 for each kK. Then (ii) (T1*,,TK*) is approximately distributed as Imax(ψ1ψ0)(p1,...,pK)+(Z1,...,ZK) under (Q1,gn)-adaptive sampling. It appears that (i) and (ii) suffice to guarantee the wished asymptotic control of type I and type II errors. The rationale is validated by the simulation study undertaken in Section 8.

7. Simulation study of the performances of TMLE in adaptive covariate-adjusted RCTs

In this section, we carry out a simulation study of the performances of TMLE in adaptive group sequential RCTs as exposed in the previous sections. We present the simulation scheme in Section 7.1. The working model upon which the TMLE methodology relies is described in Section 7.2. How the TMLE-based confidence intervals behave is considered in Section 7.3 (empirical coverage) and in Section 7.4 (empirical width). An illustrating example is finally presented in Section 7.5.

7.1. Simulation scheme

We characterize the component Q0 = Q(P0) of the true distribution P0 of the observed data structure O = (W, A, Y) as follows:

  • the baseline covariate W = (U, V) where U is uniformly distributed over the unit interval [0; 1] and the subgroup membership covariate Vv = {1, 2, 3} (hence ν = 3) satisfies
    P0(V=1)=12,P0(V=2)=13  and P0(V=3)=16;
  • the conditional distribution of Y given (A, W) is the Gamma distribution characterized by the conditional mean
    Q¯0(A,W)=2U2  2U+1  +ρ(AV+1A1+V) (30)
    with ρ =1 (we will set ρ to another value in Section 8) and the conditional variance
VarP0(Y|A,W)=(U+A(1+V)+1A1+V)2.

In particular, the risk difference ψ0 = ψ(Q0), our parameter of interest, is known in closed form:

ψ0=91721.264,

and so is the variance

vb(Q0)=VarQ0,gbD(O;PQ0,gb)

of the efficient influence curve under balanced sampling. The numerical value of vb(Q0) is reported in Table 1.

Table 1:

Numerical values of the allocation probabilities and variance of the efficient influence curve under either the balanced gb or the targeted optimal g(Q0) sampling schemes. The ratio of the variances of the efficient influence curve under targeted optimal and balanced sampling schemes satisfies R(Q0) = v*(Q0)/vb(Q0) ≃ 0.762.

Sampling scheme (Q0,g) Allocation probabilities Variance
g(1|v = 1) g(1|v = 1) g(1|v = 1) VarQ0,gD*(O;PQ0,g)
(Q0,gb)-balanced 12 12 12 23.864
(Q0,g (Q0))-optimal 0.707 0.799 0.849 18.181

We target the design which (a) depends on the baseline covariate W = (U, V) only through V (i.e., belongs to G1) and (b) minimizes the variance of the efficient influence curve of the parameter of interest ψ. The latter treatment mechanism g(Q0) and optimal efficient asymptotic variance

v(Q0)=VarQ0,g(Q0)D(O;PQ0,g(Q0))

are also known in closed form, and numerical values are reported in Table 1.

Let n = (100, 250, 500, 750,1000, 2500, 5000) be a sequence of sample sizes. We estimate M = 1000 times the risk difference ψ0 = ψ(Q0) based on Onτm(ni),m = 1,..., M, i = 1,..., 7, under

  • i.i.d (Q0, gb)-balanced sampling,

  • i.i.d (Q0, g(Q0))-optimal sampling,

  • (Q0,gnτ)-adaptive sampling.

Finally, we emphasize that the observed data structure O = (W, A, Y) is not bounded, whereas O is assumed bounded in Propositions 2, 3, 4, 5, 6 and 7.

7.2. Specifying the working model, reference design and a few twists On the working model.

For each vV, let us denote θ(υ)=(βυ,συ2(0),συ2(1))Θυ,Θυ3×+*×+* being compact, where the regression vector βv = (βv,1, βv,2,βv,3) (b = 3 in (10)), then θ = (θ1,θ2,θ3) ∈ Θ = Θ1 × Θ2 × Θ3.

Following the description in Section 4.1, the working model Qnw that the TMLE methodology relies on is characterized by the conditional likelihood of Y given (A, W):

QY|A,W(θ)(O)=12πσV2(A)exp  {(Ym(A,W;βV))22σV2(A)},

with the specific choice of conditional mean Q¯(θ)(A,W) of Y given (A, W):

Q¯(θ)(α,w)=m(a,w;βv)=βv,1+βv,2u+βv,3a

for all aA and w=(u,v)W=×V. As required, condition PARAM is met. Obviously, the working model is heavily misspecified:

  • a Gaussian conditional likelihood is used instead of a Gamma conditional likelihood,

  • the parametric forms of the conditional expectation and variance are wrong too.

On the reference design.

Regarding the choice of a reference fixed design grG1 (see Section 4.1), we select gr = gb (the balanced design). The parameter θ0. only depends on Q0 and the working model, but its estimator θn depends on gr, which may affect negatively its performances. Therefore, we propose to dilute the impact of the choice of gr as an initial reference design as follows.

For a given sample size n, we first compute a first estimate θn1 of θ0 as in (12) but with n/4 (the smallest integer not smaller than n/4) substituted for n in the sum. Then θn is computed as in (12) but this time with G*(θn1)(Ai|Vi) substituted for gr(Ai|Vi).

The proofs can be adapted in order to incorporate this modification of the procedure. We refer the interested reader to Section 8.5 in (van der Laan, 2008).

On additional details.

We decide arbitrarily to update the design each time c = 25 new observations are sampled. In addition, the first update only occurs when there are at least five completed observations in each treatment arm and for all V-strata. Thus, the minimal sample size at first update is 30. It can be shown that, under initialization using the balanced design, the expected sample size at the first sample size at which there are at least 5 observations in each arm equals 75. Finally, as a precautionary measure, we systematically apply a thresholding to the updated treatment mechanism: Using the notation of Section 4, max{δ,min{1δ,gi*(Ai|Vi)}} is substituted for gi*(Ai|Vi) in all computations. We arbitrarily choose δ = 0.01.

7.3. Empirical coverage of the TMLE confidence intervals

We now invoke the central limit theorem stated in Proposition 7 to construct confidence intervals for the risk difference. Let us introduce, for all types of sampling and each sample size ni, the confidence intervals

Ini,m=[ψni*(On7m(ni))±vni(On7m(ni))niξ1α/2],     m=1,,M

where the definition of the variance estimator υn(Onτm(n)) based on the n first observations Onτm(n) depends on the sampling scheme:

  • under i.i.d (Q0, gb)-balanced sampling, υn(Onτm(n)) is the estimator of the asymptotic variance of the TMLE ψ(Qn,iid*):

vn(On7m(n))=1ni=1nD(PQn,iid*gb)(Oim)2, (31)
  • under i.i.d (Q0, g(Q0))-optimal sampling, υn(Onτm(n)) is defined as in (31), replacing gb with g(Q0),

  • under (Q0,gn)-adaptive sampling, υn(Onτm(n))=sn*2(Onτm(n)) , the estimator of the conjectured asymptotic variance of n(ψn*(Onτm(n))ψ0) computed on the n first observations Onτm(n).

We are interested in the empirical coverage (reported in Table 2, top rows)

cni=1Mm=1M1{ψ0Ini,m}

guaranteed for each sampling scheme and every i = 1,..., 7 by {Ini,m:m=1,...,M}. The rescaled empirical coverage proportions Mcni should have a Binomial distribution with parameter (M, 1 – a) with a = α for every i = 1,..., 7. This property can be tested in terms of a standard binomial test, the alternative stating that a > α. This results in a collection of 7 p-values for each sampling scheme, as reported in Table 2 (bottom rows).

Table 2:

Checking the adequateness of the coverage guaranteed by our simulated confidence intervals. We test if the rescaled empirical coverage Binomial random variables Mcni have parameter (M, 1 – α), the alternative stating that they have parameter (M, 1 – a) with a > α. We report the values cni (top row) and corresponding p-values (bottom row, between parentheses) of the Binomial test for each sample size and each sampling scheme.

Sampling scheme Sample size
n1 n2 n3 n4 n5 n6 n7
i.i.d (Q0, gb)-balanced 0.913 (p < 0.001) 0.925 (p < 0.001) 0.939 (0.067) 0.934 (0.015) 0.945 (0.253) 0.940 (0.087) 0.946 (0.300)
i.i.d (Q0, g(Q0))-optimal 0.894 (p < 0.001) 0.941 (0.111) 0.940 (0.087) 0.953 (0.688) 0.954 (0.739) 0.947 (0.351) 0.947 (0.351)
adaptive (Q0,gn7) 0.934 (0.015) 0.939 (0.067) 0.956 (0.827) 0.945 (0.253) 0.943 (0.172) 0.933 (0.011) 0.952 (0.634)

Considering each sampling scheme (i.e., each row of Table 2) separately, we conclude that the (1 – α)-coverage cannot be declared defective under

  • i.i.d (Q0, gb)-balanced sampling for any sample size nin3 = 500,

  • i.i.d (Q0, g(Q0))-optimal sampling for any sample size nin2 = 250,

  • (Q0,gn7)-adaptive sampling for any sample size nin1 = 100,

adjusting for multiple testing in terms of the Benjamini and Yekutieli procedure for controlling the False Discovery Rate at level 5%.

This is a remarkable result that not only validates the theory but also provides us with insight into the finite sample properties of the TMLE procedure based on adaptive sampling. The fact that the TMLE procedure behaves better under adaptive sampling scheme than under balanced i.i.d sampling scheme at sample size n1 = 100 may not be due to mere chance only. Although the TMLE procedure based on an adaptive sampling scheme is initiated under the balanced sampling scheme (so that each stratum consists at the beginning of comparable numbers of patients assigned to each treatment arm, allowing to estimate, at least roughly, the required parameters), it starts deviating from it (as soon as every (A, V)-stratum counts 5 patients) each time 25 new observations are accrued. The poor performance of the TMLE procedure based on optimal i.i.d sampling scheme at sample size n1 is certainly due to the fact that, by starting directly from the optimal sampling scheme (a choice we would not recommend in practice), too few patients from stratum V = 3 are assigned to treatment arm A = 0 among the n1 first subjects. At larger sample sizes, the TMLE procedure performs equally well under adaptive sampling scheme and under both i.i.d schemes in terms of coverage.

7.4. Empirical width of the TMLE confidence intervals

Now that we know that the TMLE-based confidence intervals based on (Q0,gn)-adaptive sampling are valid confidence regions, it is of interest to compare the widths of the latter (Q0,gn)-adaptive sampling confidence intervals with their counterparts obtained under i.i.d (Q0, gb)-balanced or (Q0,g*(Q0))-optimal sampling schemes.

For this purpose, we compare on one hand, for each sample size ni, the empirical distribution of {υn(Onτm(ni)):m=1,...,M} as in (31) (i.e., the empirical distribution of width of the TMLE-based confidence intervals at sample size ni obtained under i.i.d (Q0, gb)-balanced sampling, up to the factor 2ξ1α/2/ni) to the empirical distribution of {sn(Onτm(ni)):m=1,...,M} (i.e., the empirical distribution of the width of the TMLE-based confidence intervals at sample size ni obtained under (Q0,gn) adaptive sampling, up to the factor 2ξ1α/2/ni in terms of the two-sample Kolmogorov-Smirnov test, where the alternative states that the confidence intervals obtained under adaptive sampling are stochastically smaller than their counterparts under i.i.d balanced sampling. This results in 7 p-values, all equal to zero, which we nonetheless report in Table 3 (bottom row). In order to get a sense of how much narrower the confidence intervals obtained under adaptive sampling are, we also compute and report in Table 3 (top row) the ratios of empirical average widths

1MΣm=1Msn(Onτm(ni))1MΣm=1Mvn(Onτm(ni)) (32)

for each sample size ni. Informally, this shows a 12% gain in width.

Table 3:

Comparing the width of our confidence intervals. On one hand, we test for each sample size ni if the TMLE-based confidence intervals obtained under (Q0,gn)-adaptive sampling are narrower stochastically than the TMLE-based confidence intervals obtained under i.i.d (Q0, gb)-balanced sampling in terms of the two-sample Kolmogorov-Smirnov test. On the other hand, we test for each sample size ni if the TMLE-based confidence intervals obtained under (Q0,gn)-adaptive sampling are wider stochastically than the TMLE-based confidence intervals obtained under i.i.d (Q0,g*(Q0))-optimal sampling in terms of the two-sample Kolmogorov-Smirnov test. We report the p-values (bottom rows, between parentheses). In addition, we report for each sample size ni the ratios of average widths as defined in (32).

Comparison Sample size
n1 n2 n3 n4 n5 n6 n7
(Q0,gn7)vs(Q0,gb) 0.856 (0) 0.871 (0) 0.879 (0) 0.880 (0) 0.878 (0) 0.877 (0) 0.876 (0)
(Q0,gn7)vs(Q0,g(Q0)) 0.962 (0.144) 0.977 (0.236) 0.992 (0.100) 0.995 (0.060) 0.997 (0.407) 1.000 (0.236) 1.000 (0.144)

On the other hand, we also compare, for each sample size ni, the empirical distribution of {υn(Onτm(ni)):m=1,...,M} as in (31) but replacing gb by g*(Q0) (i.e., the empirical distribution of width of the TMLE-based confidence intervals at sample size ni obtained under i.i.d (Q0, g*(Q0))-optimal sampling, up to the factor 2ξ1α/2/ni to the empirical distribution of {sn*(Onτm(ni)):m=1,...,M} (i.e., the empirical distribution of the width of the TMLE-based confidence intervals at sample size ni obtained under (Q0,gn)-adaptive sampling, up to the factor 2ξ1α/2/ni in terms of the two-sample Kolmogorov-Smirnov test, where the alternative states that the confidence intervals obtained under adaptive sampling are stochastically larger than their counterparts under i.i.d optimal sampling. This results in 7 p-values that we report in Table 3 (bottom row). In order to get a sense of how similar the confidence intervals obtained under adaptive and i.i.d optimal sampling schemes are, we also compute and report for each sample size ni in Table 3 (top row) the ratios of empirical average widths as in (32) replacing again gb by g*(Q0) in the definition (31) of υn(Onτm(n)). Informally, this shows that the confidence intervals obtained under adaptive sampling are even slightly narrower in average than their counterparts obtained under i.i.d optimal sampling.

7.5. Illustrating example

So far we have been concerned with distributional results, answering to the questions Does the confidence interval [ψn*±sn*ξ1α/2/n] provide the wished coverage? (yes, even for moderate sample size: see Section 7.3), How does its width compare with the width of the confidence interval obtained under either i.i.d sampling scheme? (well: see Section 7.4). In this section, we focus on a particular simulated trajectory (we select arbitrarily the first one, associated to On71) for the sake of illustration.

Some interesting features of the selected simulated trajectory are apparent in Fig. 7.5 and Table 4.

Table 4:

Illustrating the TMLE procedure under (Q0,gn)-adaptive sampling scheme. This table illustrates how the TMLE procedure behaves (on a simulated trajectory) as the sample size increases. We report, at each sample size ni, the updated adapted design gn=G(θn(On71(n))) (columns two to four), current TMLE ψn*(On71(n)) (column five) and 95% confidence interval In,1 as in (28) with sn*(On71(n)) substituted to sn. The true risk difference ψ0 = ≃ 1.264 belongs to all confidence intervals. See also Fig. 7.5.

Sample size Allocation probabilities TMLE Confidence interval
n gn(1|1) gn(1|2) gn(1|3) ψn* [ψn*±sn*ξ10.05/2/n]
n1 0.589 0.764 0.766 1.252 [0.722;1.783]
n2 0.624 0.775 0.707 1.388 [0.974;1.802]
n3 0.679 0.767 0.795 1.361 [1.037;1.685]
n4 0.677 0.757 0.813 1.341 [1.068;1.615]
n5 0.670 0.760 0.806 1.250 [1.012;1.488]
n6 0.677 0.788 0.835 1.288 [1.126;1.451]
n7 0.694 0.793 0.834 1.273 [1.157;1.389]

For instance, we can follow the convergence of the TMLE ψn* toward the true risk difference 𝜓0 in the top plot of Fig. 7.5 and in the fifth column of Table 4. Similarly, the middle plot of Fig. 7.5 and the second to fourth columns of Table 4 illustrate the convergence of gn toward G*0), as stated in Proposition 3. What these plots and columns also teach us is that, in spite of the misspecified working model, the learned design G*0) seems very close to the optimal treatment mechanism g*(Q0) for the chosen simulation scheme and working model used in our simulation study. Moreover, the last column of Table 4 illustrates how the confidence intervals [ψn*±snξ10.05/2/n] shrink around the true risk difference 𝜓0 as the sample size increases.

Yet, the bottom plot of Fig. 7.5 may be the most interesting of the three. It obviously illustrates the convergence of sn2 toward VarQ0,G(θ0)D(PQθ0,ε0,G(θ0))(O) i·e., toward the variance under the fixed design PQ0,G(θ0) of the efficient influence curve at PQθ0,ε0G(θ0). Hence, it also teaches us that the latter limit seems very close to the optimal asymptotic variance v*(Q0) for the chosen simulation scheme and working model used in our simulation study. More importantly, sn2 strikingly converges to v*(Q0) from below. This finite sample characteristic may reflect the fact that the true finite sample variance of n(ψn*ψ0) might be lower than v*(Q0). Studying further this issue in depth is certainly very delicate, and goes beyond the scope of this article.

8. Simulation study of the performances of the TMLE group sequential testing procedure

In this section, we resume the simulation study undertaken in Section 7 in order to investigate the performances of the TMLE group sequential testing procedure presented in Section 6. We describe the simulation scheme in Section 8.1, then evaluate how the testing procedure behaves in terms of empirical type I and type II errors in Section 8.2 and how it behaves in terms of empirical distribution of sample size at decision in Section 8.3.

8.1. The simulation scheme, continued

We wish to test the null “Ψ(P) = ψ0” against its alternative “Ψ(P) > ψ0(ψ0=9172) with prescribed type I error α = 5% and type II error β = 10% at the alternative ψ1 = ψ0 +0.4. Depending on whether we want to investigate the empirical behaviors of the TMLE group sequential testing procedure with respect to (i) type I or (ii) type II errors, we test M = 1000 times the above hypotheses based on 3 × M independent datasets obtained under

  • (i)
    empirical type I error study:
    • i.i.d (Q0, gb)-balanced sampling,
    • i.i.d (Q0, g*(Q0))-optimal sampling,
    • (Q0,gn)-adaptive sampling;
  • (ii)
    empirical type II error study:
    • i.i.d (Q1, gb)-balanced sampling,
    • i.i.d (Q1, g*(Q0))-optimal sampling,
    • (Q1,gn)-adaptive sampling,

where Q1 is defined as Q0 in Section 7.1 with ρ = ψ1/ψ0 (so that Ψ(Q1) = ψ1).

We decide to proceed in K = 4 steps, at proportions (p1,p2,p3,p4) = (0.25, 0.50,0.75,1). We choose the α-and β-spending strategies characterized by the equalities l=1kαl=pk2α and l=1kβl=pk2β,k=1,,K. This set of conditions characterizes the whole group sequential testing procedure, see Table 5 for the resulting numerical values.

Table 5:

Specifics of the TMLE group sequential testing procedure

Ψ0 Ψ1 Imax
1.264 1.664 57.927
Step k 1 2 3 4
Rejection boundary 2.734 2.305 2.005 1.715
α-spending 0.3125% 0.9375% 1.5625% 2.1875%
Step k 1 2 3 4
Futility boundary −0.976 0.132 0.961 1.715
β-spending 0.625% 1.875% 3.125% 4.375%

8.2. Empirical type I and type II errors of the TMLE group sequential testing procedure

We report in Table 6 the empirical type I and type II errors obtained during the course of the simulation study. The values are strikingly close to each other in both cases. Even better, the empirical type I and type II errors obtained under (Q,gn)-adaptive sampling are the lowest in both cases.

Table 6:

Checking the adequateness of the type I and type II error controls on our simulated TMLE group sequential testing procedures. We test if the Binomial random numbers of times the null is falsely rejected have parameter (M, α), the alternative stating that they have parameter (M, a) with a > α. We also test if the Binomial random numbers of times the null is falsely not rejected have parameter (M, 12%), the alternative stating that they have parameter (M, b) with b > 12%. We report the empirical type I and type II errors and corresponding p-values

Sampling scheme Type I error (Q = Q0) Empirical error (p-value) Type II error (Q = Q1) Empirical error (p-value)
i.i.d (Q, gb)-balanced 0.040 (0.919) 0.132 (0.113)
i.i.d (Q, g(Q))-optimal 0.043 (0.827) 0.128 (0.203)
adaptive (Q,gn) 0.040 (0.919) 0.126 (0.261)

Since the number of times the null was falsely rejected for its alternative is Binomial with parameter (M, α), it is possible to test rigorously if a = α (as it should) against a > α. This yields three p-values that we report in Table 6. Because there are less than 50 wrong decisions for all sampling schemes, we naturally get large p-values, confirming (if necessary) that the type I error is under control.

Similarly, the number of times the null was falsely not rejected for its alternative is also Binomial with parameter (M, b). Thus, it is possible to test rigorously if b = β (as it should) against b > β. This yields three small p-values (p < 0.001 for the i.i.d balanced sampling scheme, 0.002 for the i.i.d optimal sampling scheme and 0.003 for the adaptive sampling scheme). This confirms the impression that the number of wrong decisions are all significantly larger than what random deviations from the reference distribution are likely to allow. Put bluntly, the type II error is not under control. However, there is no real need to worry. Indeed, if one rather tests b = 12% against b > 12%, then the three p-values (reported in Table 6) are large.

In summary, for the considered scenario, the TMLE group sequential testing procedure described in Section 6.1 performs at least as well under the (Q,gn)-adaptive sampling scheme as under both i.i.d (Q, gb)-balanced and (Q, g*(Q))-optimal sampling schemes. The type I error control meets the requirements. The type II error control is defective, but only slightly defective.

8.3. Empirical distribution of sample size at decision of the TMLE group sequential testing procedure

Now that we know that the TMLE group sequential testing procedure performs at least as well under adaptive sampling scheme as under both i.i.d sampling schemes, let us compare the performances in terms of sample size at decision.

For this purpose, we simply report the average sample sizes at decision for each sampling scheme when evaluating the type I and type II errors control, see Table 7. The gain in average sample size at decision obtained by resorting to the (Q,gn)-adaptive sampling scheme instead of the i.d.d (Q, gb)-balanced sampling scheme is dramatic: In average, one needs approximately 16% less observations in order to reach a conclusion under the (Q,gn)-adaptive sampling scheme relative to the i.d.d (Q, gb)-balanced sampling scheme. Furthermore, it appears that it is even more efficient to resort to the i.d.d (Q, g*(Q))-optimal sampling scheme: In average, one needs approximately 6% more observations in order to reach a conclusion under the i.i.d (Q,gn)-adaptive sampling scheme relative to the i.d.d (Q, gb)-balanced sampling scheme.

Table 7:

Comparing the average sample sizes at decision for each sampling scheme when evaluating the type I and type II errors control.

Sampling scheme Type I error (Q = Q0) Average sample size Type II error (Q = Q1) Average sample size
i.i.d (Q, gb)-balanced 786.63 895.64
i.i.d (Q, g(Q))-optimal 620.71 713.47
adaptive (Q,gn) 661.14 746.72

In summary, for the considered scenario, the TMLE group sequential testing procedure described in Section 6.1 reaches a decision more quickly under the (Q,gn)-adaptive sampling scheme than under the i.i.d (Q, gb)-balanced sampling scheme, with a gain in average sample size at decision of approximately 16%. Furthermore, the TMLE group sequential testing procedure reaches a decision more slowly under the (Q,gn)-adaptive sampling scheme than under the i.i.d (Q, g*(Q))-optimal sampling scheme, but the loss in average sample size at decision reduces to approximately 6%.

A. Proofs

Let us introduce for convenience the notations θε = (θ, ε), θ0ε=(θ0,ε0) and θnε=(θn,εn). Moreover, let us recall that if μ is a probability measure and f a measurable function, then μf denotes the integral ∫ fdμ.

A.1. Proof of Propositions 2 and 4, and a useful remark on Proposition 3 Proof of Proposition 2.

The cornerstone of the proof is to interpret θn as the solution in θ of the martingale estimating equation i=1nD1(θ)(Oi,Zi)=0, where Zi (defined in (9)) is the finite dimensional summary measure of past observation On (i – 1) such that gi depends on On(i – 1) only through Zi (hence the notation gi=gZi) and D1(θ)(O,Z)=l˙θ,0(O)gr(A|V)/gZ(A|V).

Note first that, for all in,

PQ0,gi*D1(θ0)=EPQ0,gr[l˙θo,0(Oi)|On(i1)]=0

by A1 (changing the order of differentiation and integration is permitted by the dominated convergence theorem, because O is bounded and Θ is compact—see PARAM). Observe then that, by definition of θn, we have:

1ni=1nPQ0,gi*D1(θn)=1ni=1nD1(θn)(Oi,Zi)PQ0,gi*D1(θ0)θsupθΘ1ni=1nD1(θn)(Oi,Zi)PQ0,gi*D1(θ)Mn

Now, since (α) supθ∈ΘD1(θ)∥ < (because O is bounded and Θ is compact) and (b) the standard entropy of F={D1(θ):θΘ} for the supremum norm satisfies H(F,,ε)=O(log 1/ε) (see (van der Vaart, 1998), example 19.7), we can apply (componentwise) the Kolmogorov strong law of large numbers for martingales as in Theorem 8 of (Chambaz and van der Laan, 2009). This yields the almost sure convergence of Mn, hence of n−1 i=1nPQ0,giD1(θn), to 0.

By a Taylor expansion of θPQ0,grl˙θ,0 at θ0 (changing the order of differentiation and integration is permitted here too for the same reasons as above), it holds that, for all in (recall that PQ0,giD1(θ0)=0.),

PQ0,gi*D1(θ)=(PQ0,grl¨θ0,0)(θnθ0)+oP(θnθ0)

From this we deduce that (PQ0,grl¨θ0,0)(θnθ0). converges in probability to 0. Because PQ0,grl¨θ0,0 is positive definite by A2, this implies the result.

Proof of Proposition 4.

The proof of (i) is very simple and typical of robust statistical studies. Indeed,

0=PQ0,G*(θ0)εlθ0,ε|ε=ε0,

the latter expression also writing as

EQ0,G*(θ0){2A1G*(θ0)(A|V)(YQ¯(θ0ε)(A,W))}=EQ0,G*(θ0){(Q¯0(1,W)Q¯(θ0ε)(1,W))(Q¯0(0,W)Q¯(θ0ε)(0,W))}=ψ(Q0)ψ(Qθ0ε),

hence the result.

The proof of (ii) fundamentally relies on the fact that θnε solves the martingale estimating equation i=1nD(θε)(Oi,Zi)=0, where D(θε) is the extension of D1(θ) defined in (19). Here too, PQ0,giD(θ0ε)=0 for all in, and applying the Kolmogorov strong law of large numbers for martingales (see Theorem 8 in (Chambaz and van der Laan, 2009)) yields that n1i=1nPQ0,giD(θnε) converges to zero almost surely. This entails the convergence in probability of θnε to θ0ε by a Taylor expansion of θε(PQ0,grl˙θ,0Τ,PQ0,G(θ)εlθ,ε) at θ0ε. Note that A3 is a clear counterpart to A1 from Proposition 2 but that there is no counterpart to A2 from Proposition 2 in Proposition 4. Indeed, it automatically holds in the framework of the proposition that PQ0,G(θ0)2ε2lθ0,ε|ε=ε0>0, the proof requiring that the latter quantity be different from zero.

A useful remark on Proposition 3.

We deduce in Proposition 3 the convergence in probability and in L1 of the adaptive design gn to the limit design G*(θ0) from the convergence in probability of θn to θ0 as obtained in Proposition 2. It is crucial for us that n1i=1ngi and n1i=1n(gi)1 also converge to G*0) and 1/G*0) (with an obvious notational shortcut) in probability and in L1. Fortunately, this is true because (a) the positive random variables gi are uniformly bounded away from 0 and (b) if a sequence Xn converges in L1 to 0 then n1i=1nXi also converges in L1 to 0 (by convexity of the L1-norm). Let us put this in a lemma:

Lemma 1. For all (a,v)A×V,n1i=1ngi*(a|v) and n1i=1n(gi*(a|v))1 converge in probability and in L1 to G*o)(a|v) and G*o)(a|v)-1.

A.2. Proof of Proposition 5

The proof of Proposition 5 still relies on the facts that (a) θnε solves the martingale estimating equation i=1nD(θε)(Oi,Zi)=0 and (b) PQ0,gi*D(θ0ε)=0 for all in.

Consider the following equality:

LHT1ni=1n[D(θnε)(Oi,Zi)D(θ0ε)(Oi,Zi)]=1ni=1n[D(θnε)(Oi,Zi)PQ0,gi*D(θ0ε)]. (33)

A Taylor expansion of LHT at θ0ε first yields that

LHT=1ni=1nθεD(θε)(Oi,Zi)|θε=θ0ε(θnεθ0ε)+oP(θnεθ0ε)=An(θnεθ0ε)+Bn(θnεθ0ε)+oP(θnεθ0ε)

where An=E[n1i=1nθεD(θε)(Oi,Zi|θε=θ0ε] and Bn=n1i=1nθεD(θε)(Oi,Zi|θε=θ0εAn..

The matrix An satisfies An=n1E[i=1nPQ0,gi*θεD(θε)|θε=θ0ε]. Now, for each in,

PQ0,gi*θεD(θε)|θε=θ0ε=S0 (34)

where the latter matrix, defined in (21) and independent of i, is deterministic (this is due to the weighting). Thus, An = S0. Moreover, An = S0 is invertible because its determinant equals the product of PQ0,G(θ0)2ε2lθ0,ε|ε=ε0<0 and

detEQ0,G*(θ0)[l¨θ0,0(O)gr(A|V)G*(θ0)(A|V)]=detPQ0,grl¨θ0,0,

which is negative by A2. Furthermore, (34) and An = S0 (deterministic) also imply that Bn can be rewritten as

Bn1ni=1n[θεD(θε)(Oi,Zi)|θε=θ0εPQ0,gi*θεD(θε)|θε=θ0ε]

and applying (componentwise) a standard law of large numbers for martingales (see for instance Theorem 2.4.2 in (Sen and Singer, 1993)) yields that Bn converges to 0 almost surely (note that supθεΘ×εθεD(θε)<). We emphasize that this proves the convergence in probability of Sn to S0 as stated in the proposition. Consequently, LHT=S0(θnεθ0ε)+op(θnεθ0ε) and (33) entails

n(θnεθnε)=S01Mn+op(nθnεθ0ε) (35)

with Mn=n1/2i=1n[D(θ0ε)(Oi,Zi)PQ0gi*D(θ0ε)]. It mainly remains to show that Mn satisfies a central limit theorem.

For this purpose, we apply Theorem 10 in (Chambaz and van der Laan, 2009). Define Cn=i=1nPQ0,gi*D(θ0ε)D(θ0ε)Τ. It holds that

1nCn=1ni=1nEQ0,G*(θ0)[D˜(θ0ε)D˜(θ0ε)Τ(O)gi*(A|V)G*(θ0)(A|V)|On(i1)]=(a,v)A×νEQ0,G*(θ0)[D˜(θ0ε)D˜(θ0ε)Τ(O)G*(θ0)(A|V)1{(A,V)=(a,v)}]1ni=1n1gi*(a|v). (36)

Now, by Lemma 1, (36) implies that n1ECn=0(1+o(1)), where Σ0 (defined in (22)) is positive definite when A4 is met. Indeed, assume on the contrarthat there exists a vector u ≠ 0 such that uΤ0u=0. Then necessarily D˜(θ0ε)Τu=0PQ0,G*(θ0)-almost surely, which contradicts A4. Using (36) again we also see that

1n(CnECn)=(a,v)A×VEQ0,G*(θ0)[D˜(θ0ε)D˜(θ0ε)Τ(O)G*(θ0)(A|V)1{(A,V)=(a,v)}]×(1ni=1n1gi*(a|v)E[1n]i=1n1gi*(a|v))

from which we deduce that n−1(CnECn) converges to 0 in probability by Lemma 1. Consequently, Mn converges in distribution to the centered Gaussian law with covariance matrix Σ0. In addition, n1i=1nD(θ0ε)D(θ0ε)Τ(Oi,Zi) converges in probability to Σ0. Since (a) θεD(θε)D(θε)Τ is differentiable at θ0ε, (b) its derivative is bounded, and (c) we already know that θ0εθ0ε=op(1), yet another Taylor expansion allows to derive that Σn (defined in (23)) equals Σ0 (1 + op(1)).

Let us go back to (35), knowing now that Mn satisfies a central limit theorem. We derive from it that nθnεθ0ε=OP(1)+op(nθnεθ0ε) hence θnεθ0ε=OP(1). Thus, (35) does entail (20) since PQ0gi*D(θ0ε)=0 for all in. The stated central limit theorem on n(θnεθ0ε) readily follows.

A.3. Proof of Propositions 6 and 7

Proof of Proposition 6.

The proof of Proposition 6 is twofold and relies on the decomposition

ψn*ψ0=(ψn*ψ(Qn~))+(ψ(Qn~)-ψ0), (37)

where

Qn~=(Qw(P0),QY|A,W(θnε)). (38)

First, a continuity argument and the convergence in probability of the stacked estimator θnε to θ0ε entail the convergence in probability of ψ(Qn~) to ψ(Qθ0ε)=ψ0 (the equality holds by (i) in Proposition 4). The conclusion follows because ψn*-ψ(Qn~) converges almost surely to zero. Indeed,

ψn*ψ(Qn~)=(PW,n-PW,0)Pθnε (39)

where, for all θεΘ×ε,qθε=Q¯(θε)(1,)Q¯(θε)(0,) and, with a slight abuse of notation, PW,n (respectively, PW,0) denotes the empirical (respectively, true) marginal distribution of W. Thus,

|ψn*ψ(Qn~)|supθεΘ×ε|(PW,nPW,0)qθε|, (40)

where (a) supθεΘ×εqθε< and (b) the standard entropy of F={qθε:θεΘ×ε} for the supremum norm satisfies H(F,,ε)=O(log1/ε) (see (van der Vaart, 1998), example 19.7). Therefore, the class F is PW,0-Glivenko-Cantelli and the right-hand side term of (40) converges to 0 PW,0-almost surely by the Glivenko-Cantelli theorem (i.i.d framework; see for instance Theorem 19.4 in (van der Vaart, 1998)).

Proof of Proposition 7.

The proof of (26) relies again on (37). It is easy to derive the asymptotic linear expansion of the first term. Indeed, we can rewrite (39) as

n(ψn*ψ(Qn~))=n(PW,nPW,0)qθ0ε+n(PW,nPW,0)(qθnεqθ0ε)=nPW,n(qθ0εPW,0qθ0ε)+oP(1)=nPW,nD1(Qθ0ε)+op(1), (41)

which holds subject to checking that n(PW,nPW,0)(qθnεqθ0ε)=op(1). Since the class F introduced previously for the proof of Proposition 6 is also PW,0-Donsker, this is a consequence of Lemma 19.24 in (van der Vaart, 1998), provided that Xn=PW,0(qθnεqθ0ε)2 converges in probability to 0. Now, θnε converges to θ0ε in probability, and the function (w,θε)qθε(w) continuously maps W×Θ×ε onto (thus it is uniformly bounded). Consequently, Xn (which is obviously non-negative) is upper bounded, so that it is equivalent to show that Xn converges to 0 in L1. By the Fubini theorem,

EXn=E[(qθnε(w)qθnε(w))2]dPW,0(w).

By the continuous mapping theorem (see Theorem 1.3.6 in (van der Vaart and Wellner, 1996)), (qθnε(w)qθ0ε(w))2 converges to 0 in probability (hence in L1, the variables being bounded) for all wW. Therefore, the integrand of the right-hand side integral in the previous display converges pointwise to 0. Since it is also bounded, we conclude by the dominated convergence theorem that EXn converges to 0. This completes the study of the first term of the decomposition (37).

Regarding the second term of the latter decomposition, we derive its asymptotic linear expansion by the delta-method (see Theorem 3.1 in (van der Vaart, 1998)) from that of n(θnεθ0ε), see (20). Specifically,

n(ψ(Qn~)-ψ0)=n(ϕ(θnε)ϕ(θnε))=ϕθ0ε'ΤS011ni=1nD(θ0ε)(Oi,Zi)+op(1). (42)

Combining (41) and (42) yields the wished asymptotic linear expansion (26) and the closed-form expression of the related influence function IC as given in (27).

A central limit theorem for real-valued martingales (see Theorem 9 in (Chambaz and van der Laan, 2009)) applied to (26) yields the stated convergence and validates the use of sn2 as an estimator of the asymptotic variance. To see this, note that PQ0,gi*IC=0 for all in and define cn=i=1nPQ0gi*IC2. Now we emphasize that, for every in, firstly

PQ0,gi*D1*(Qθ0ε)2=PQ0,G*(θ0)D1*(Qθ0ε)2,

secondly

2PQ0,gi*D1*(Qθ0ε)ϕθ0ε'ΤS01D(θ0ε)=2ϕθ0ε'ΤS01PQ0,gi*D1*(Qθ0ε)D(θ0ε)=2ϕθ0ε'ΤS01PQ0,G*(θ0)[D1*(Qθ0ε)D˜(θ0ε)G*(θ0)]

and thirdly

PQ0,gi*(ϕθ0ε'ΤS01D(θ0ε))2=ϕθ0ε'ΤS01(PQ0,gi*D(θ0ε)D(θ0ε)Τ)(ϕθ0ε'ΤS01)Τ.

By combining the last three equalities, we thus obtain that

1ncn=PQ0,G*(θ0)D1*(Qθ0ε)2+2ϕθ0ε'ΤS01PQ0,G*(θ0)[D1*(Qθ0ε)D˜(θ0ε)G*(θ0)]+ϕθ0ε'ΤS011nCn(ϕθ0ε'ΤS01)Τ (43)

where Cn was introduced in the proof of Proposition 5. Since we already showed in the latter proof that n−1ECn = Σ0(1 + oP(1)), (43) notably yields the convergence of n−1Ecn toward

PQ0,G*(θ0)D1*(Qθ0ε)2+2ϕθ0ε'ΤS01PQ0,G*(θ0)[D1*(Qθ0ε)D˜(θ0ε)G*(θ0)]+ϕθ0ε'ΤS01Σ0(ϕθ0ε'ΤS01)Τ=PQ0,G*(θ0)(D1*(Qθ0ε)+ϕθ0ε'ΤS01D˜(θ0ε)G*(θ0))2s2

which is positive when A4 is met. In addition, (43) also entails that n−1(cn-Ecn) converges to 0 in probability, therefore implying the convergence in distribution of n(ψn*ψ0) to the centered Gaussian distribution with variance s2 as well as the convergence in probability of n1i=1nIC (Oi,Zi)2 to s2. Since (a) θεD1(Qθε) and θεD(θε) are differentiable at θ0ε,(b) both with uniformly bounded derivatives, (c) we already know that θnεθ0ε=oP(1), and (d) ϕθnε and S01 are consistently estimated by ϕn and Sn1, yet another Taylor expansion (and the continuous mapping theorem, see for instance Theorem 1.3.6 in (van der Vaart and Wellner, 1996)) finally yields the stated convergence in probability of sn2 to s2, therefore completing the proof.

A.4. Proof of Proposition 8

The fact that Ψ(Qf/n)>Ψ(Q0) for n large enough is a consequence of the expansion Ψ(Qf/n)= Ψ (Q0)+  PQ0,G(θ0)D(PQ0,G(θ0))f/n+o(1/n), which holds because Ψ is pathwise differentiable at PQ0,g (for any g) relative to the maximal tangent space, with efficient influence curve D*(Q0, g). The rest of Proposition 8 is a corollary of the following lemma.

Lemma 2. Let Λn be the log-likelihood ratio of the (Qf/n,gn) experiment relative to the (Q0,gn) experiment. Under (Q0,gn), the vector (n1(ψn1 Ψ(Q0)),,(nK(ψnK* Ψ(Q0)),Λn) converges in law to the Gaussian distribution with mean (0,,0,12PQ0,G(θ0)f2) and covariance matrix

(PQ0,G*(θ0)IC2×CPQ0,G*(θ0)ICf×(pK)ΤPQ0,G*(θ0)ICf×(p1,...,pK)PQ0,G*(θ0)f2).

In particular, the (Qf/n,gn) and (Q0,gn) experiments are mutually contiguous.

The limiting distribution of (T1,…,TK) under (Q0,gn) is easily derived from Lemma 2. Le Cam’s third lemma (see Example 6.7 in (van der Vaart, 1998)) solves the problem of obtaining the limiting distribution of (T1,…,TK) under (Qf/n,gn) from that under (Q0,gn). Only the asymptotic means are different. We do not repeat the proof here, as it is exactly the same (up to some minor variations in notation) as the proof of Lemma 5 in (Chambaz and van der Laan, 2011).

Proof of Lemma 2.

Since f is bounded and EPQ0,G(θ0)[f(O)|A,W]=0 (here, one could equivalently use the notation EQ0), we can assume without loss of generality that the fluctuation {Q0(h):hH} is characterized by

dQ0(h)dQ0=(1+hf(O))

for all hH=(1/f,1/f). Let us denote by qn and q0 the conditional densities of Y given (A, W) under Qf/n and Q0.

The log-likelihood ratio Λn=i=1nlog qn/q0(Oi), by conditional independence of O1,…, On given ((A1, W1),…, (An,Wn)) and because the marginal distribution of W is the same under Qf/n as under Q0, (see (4) and note that the gi(Ai|Wi) factors cancel out). The fluctuation is chosen in such a way that qn/q0=(1+f/n), hence Λn=n1i=1nf(Oi)12n1i=1nf2(Oi)+n1i=1nf2(Oi)R(f(Oi)/n), where the function R is characterized by log(1+x)=x12x2+x2R(x)(all x>1). In particular, R is increasing and R(x) → R(0) = 0 when x → 0. Since f is bounded, the expansion is valid for n large enough. Moreover, the last term satisfies

|1ni=1nf2(Oi)R(f(Oi)/n)|1ni=1nf2(Oi)×R(f/n),

so that if n1i=1nf2(Oi)=OP(1) then it is oP(1). Furthermore, the Kolmogorov law of large numbers for martingales implies that n1i=1nf2(Oi)=n1i=1nf2PQ0,gif2+oP(1)=PQ0,g¯nf2+oP(1), where g¯n=n1i=1ngi Denote Var(f)(A, W) = EQ0 [f2 (O)|A, W]. We have that

PQ0,g¯n*f2EQ0,g¯n*[f2(On)|On(n1)]=EQ0,g¯n*[Var(f)(A,W)(On)|On(n1)]=EQ0[g¯n*(1|V)Var(f)(1,W)+g¯n*(0|W)Var(f)(1,V)|On(n1)]

for V the relevant subvector of W upon which each gi* depends. By Lemma 1, g¯i* converges in probability to G*(θ0). Consequently, the continuous mapping theorem (see Theorem 1.3.6 in (van der Vaart and Wellner, 1996)) yields that PQ0,g¯n*f2=PQ0,G*(θo)f2+oP(1)=OP(1). In summary, we obtain the following asymptotic linear expansion:

Λn=1ni=1nf(Oi)12PQ0,G*(θ0)f2+op(1). (44)

Introduce now Zi=(1{in1},...,1{inK}) and the bounded (and measurable) function F such that F(Oi,Zi,Zi)=(ZiIC(Oi,Zi),f(Oi)). We show that Mn=n1i=1n[F(Oi,Zi,Zi)PQ0,gi*F]=n1i=1nF(Oi,Zi,Zi)(since PQ0,gi*F=PQo,gi*f=0 for all in) satisfies a central limit theorem. The proof is very similar to the corresponding part of proof of Lemma 5 in (Chambaz and van der Laan, 2011), hence we only give an outline, focusing on the differences. First, it holds for every 1 ≤ k, lK that n1i=1nkPQo,gi*ICf=pkPQ0,G*(θo)ICf+oP(1),n1i=1nknlPQo,gi*IC2=pklPQ0,G*(θo)IC2+oP(1), and n1i=1nPQo,gi*f2=PQ0,G*(θo)f2+oP(1). This entails that the matrix En1i=1nPQo,gi*FF converges to

(PQ0,G*(θ0)IC2×(pkl)k,lKPQ0,G*(θ0)ICf×(p1,...,pK)ΤPQ0,G*(θ0)ICf×(p1,...,pK)PQ0,G*(θ0)f2),

which is positive definite if and only if its determinant is positive. The latter determinant equals a positive constant times [PQ0,G*(θo)f2(PQ0,G*(θo)ICf)2/PQ0,G*(θo)IC2], which is positive too by the Cauchy-Schwarz inequality (f is not proporional to D(PQ0,G(θ0))). Therefore, Theorem 8 in (Chambaz and van der Laan, 2011) applies and nMn converges in law to the centered Gaussian distribution with the covariance matrix given in the above display. The conclusion readily follows.

Figure 2: Illustrating the TMLE procedure under (Q0,gn)-adaptive sampling scheme.

Figure 2:

These three plots illustrate how the TMLE procedure behaves (on a simulated trajectory) as the sample size (on x-axis, logarithmic scale; the vertical grey lines indicate the sample sizes ni, i = 1,..., 7) increases. Top plot: we represent the sequence ψn*(On71(n)) at intermediate sample sizes n. The horizontal grey line indicates the true value of the risk difference 𝜓0. Middle plot: we represent the three sequences gn(1|1)=G(θn(On71(n)))(1|1) (bottom curve), gn(1|2)=G(θn(On71(n)))(1|2) (middle curve) and gn(1|3)=G(θn(On71(n)))(1|3) (top curve) at intermediate sample sizes n. The three horizontal grey lines indicate the optimal allocation probabilities g*(Q0)(1|1) (bottom line), g*(Q0)(1|2) (middle line) and g*(Q0)(1|3) (top line). Bottom plot: we represent the sequence sn2 of estimated asymptotic variance of n(ψn*ψ0) at intermediate sample sizes n. The horizontal grey line indicates the value of the optimal variance v*(Q0). See also Table 4.

Footnotes

1 The reasoning is not circular by virtue of the chronological ordering as it is summarized in Fig. 1 for instance.

References

  1. Atkinson AC and Biswas A. Adaptive biased-coin designs for skewing the allocation proportion in clinical trials with normal responses. Stat. Med, 24(16):2477–2492, 2005. [DOI] [PubMed] [Google Scholar]
  2. Bandyopadhyay U and Biswas A. Adaptive designs for normal responses with prognostic factors. Biometrika, 88(2):409–419, 2001. [Google Scholar]
  3. Chambaz A and van der Laan MJ. Targeting the optimal design in randomized clinical trials with binary outcomes and no covariate. Technical report, Division of Biostatistics, University of California, Berkeley, 2009. [Google Scholar]
  4. Chambaz A and van der Laan MJ. Targeting the optimal design in randomized clinical trials with binary outcomes and no covariate: Theoretical study. Int. J. Biostat, 7(1), 2011. [Google Scholar]
  5. Emerson SS. Issues in the use of adaptive clinical trial designs. Stat. Med, 25:3270–3296, 2006. [DOI] [PubMed] [Google Scholar]
  6. The Food & Drug Administration. Critical path opportunities list Technical report, U.S. Department of Health and Human Services, Food and Drug Administration, 2006. [Google Scholar]
  7. Golub HL. The need for mor efficient trial designs. Stat. Med, 25:3231–3235, 2006. [DOI] [PubMed] [Google Scholar]
  8. Hu F and Rosenberger WF. The theory of response adaptive randomization in clinical trials. New York: Wiley, 2006. [DOI] [PubMed] [Google Scholar]
  9. Jennison C and Turnbull BW. Group Sequential Methods with Applications to Clinical Trials. Chapman & Hall/CRC, Boca Raton, FL, 2000. [Google Scholar]
  10. Proschan MA, Lan GKK, and Wittes JT. Statistical Monitoring of Clinical Trials: A Unified Approach. Statistics for biology and health. Springer, New-York, 2006. [Google Scholar]
  11. Rosenberger WF, Vidyashankar AN, and Agarwal DK. Covariate-adjusted response-adaptive designs for binary response. J. Biopharm. Statist, 11(227–236), 2001. [PubMed] [Google Scholar]
  12. Rosenberger WF. New directions in adaptive designs. Statist. Sci, 11:137–149, 1996. [Google Scholar]
  13. Sen PK and Singer JM. Large sample methods in statistics. Chapman & Hall, New York, 1993. An introduction with applications. [Google Scholar]
  14. Shao J, Yu X, and Bob Zhong B. A theory for testing hypotheses under covariate-adaptive randomization. Biometrika, 2010. [Google Scholar]
  15. van der Laan MJ. The construction and analysis of adaptive group sequential designs. Technical report 232, Division of Biostatistics, University of California, Berkeley, March 2008. [Google Scholar]
  16. van der Laan MJ and Rubin D. Targeted maximum likelihood learning. Int. J. Biostat, 2(1), 2006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. van der Vaart AW. Asymptotic Statistics. Cambridge University Press, 1998. [Google Scholar]
  18. van der Vaart AW and Wellner JA. Weak Convergence and Emprical Processes. Springer-Verlag; New York, 1996. [Google Scholar]
  19. Zhang Li-Xin, Hu Feifang, Cheung Siu Hung, and Chan Wai Sum. Asymptotic properties of covariate-adjusted response-adaptive designs. Ann. Statist, 35(3):1166–1182, 2007. [Google Scholar]
  20. Zhu H and Hu F. Sequential monitoring of response-adaptive randomized clinical trials. Ann. Statist, 38(4):2218–2241, 2010. [Google Scholar]

RESOURCES