Abstract
This article is devoted to the construction and asymptotic study of adaptive group sequential covariate-adjusted randomized clinical trials analyzed through the prism of the semipara-metric methodology of targeted maximum likelihood estimation (TMLE). We show how to build, as the data accrue group-sequentially, a sampling design which targets a user-supplied optimal design. We also show how to carry out a sound TMLE statistical inference based on such an adaptive sampling scheme (therefore extending some results known in the i.i.d setting only so far), and how group-sequential testing applies on top of it. The procedure is robust (i.e., consistent even if the working model is misspecified). A simulation study confirms the theoretical results, and validates the conjecture that the procedure may also be efficient.
Keywords: Adaptive design, asymptotic normality, canonical distribution, clinical trial, contiguity, group-sequential testing, robustness, targeted maximum likelihood methodology
1. Introduction
This article is devoted to the construction and asymptotic study of adaptive group sequential covariate-adjusted randomized clinical trials (RCTs) analyzed through the prism of the targeted maximum likelihood estimation (TMLE) methodology. The considered observed data structure writes as O = (W, A, Y) where W is a vector of baseline covariates, A is the treatment assignment and Y the primary outcome of interest. We assume that A is binary and that Y is one-dimensional. Typical parameters of scientific interest are Ψ+ = E{E(Y|A = 1, W) – E(Y|A = 0, W)} (additive scale, which we consider hereafter) or Ψ× = log E{E(Y|A = 1, W)} – log E{E(Y|A = 0, W)} (multiplicative scale). Such parameters can be interpreted causally whenever one is willing to assume the existence of a full data structure X = (W, Y(0), Y(1)) containing the vector of baseline covariates and the two counterfactual outcomes under the two possible treatments, and such that Y = Y (A) and A conditionally independent of X given W. If so indeed, Ψ+ = EY (1) – EY (0) and Ψ× = log EY (1) – log EY (0). Let us now explain what we mean by adaptive group sequential covariate-adjusted RCT.
Adaptive group sequential covariate-adjusted RCT.
By adaptive covariate-adjusted design, we mean in the setting of this article a clinical trial design that allows the investigator to dynamically modify its course through data-driven adjustment of the randomization probability based on data accrued so far, without negatively impacting the statistical integrity of the trial. Moreover, the patient’s baseline covariates may be taken into account for the random treatment assignment. This definition is slightly adapted from (Golub,2006). In particular, we assume with a view to the definition of pre-specified sampling plans given in (Emerson, 2006) that, prior to collection of the data, the trial protocol specifies the parameter of scientific interest, the inferential method and confidence level to be used when constructing a confidence interval for the latter parameter.
Furthermore, we assume that the investigator specifies beforehand in the trial protocol a criterion of special interest which yields a notion of optimal randomization scheme that we therefore wish to target. For instance, the criterion could translate the necessity to minimize the number of patients assigned to their corresponding inferior treatment arm, subject to level and power constraints. Or the criterion could translate the necessity that a result be available as quickly as possible, subject to level and power constraints. The sole restriction on the criterion is that it must yield an optimal randomization scheme which can be approximated from the data accrued so far. The two examples above comply with this restriction.
We focus in this article on targeted maximum likelihood estimation, as introduced by van der Laan and Rubin (2006) in the independent identically distributed (i.i.d) setting. The extension to the setting of adaptive RCTs was first considered in (van der Laan, 2008), upon which this article relies. In addition, we choose to consider specifically the second criterion cited above. Consequently, the optimal randomization scheme is the so-called Neyman allocation, which minimizes the asymptotic variance of the TMLE of the parameter of interest. We emphasize that there is nothing special about targeting the Neyman allocation, the whole methodology applying equally well to a large class of optimal randomization schemes derived from a variety of valid criteria.
By adaptive group sequential design, we refer to the possibility to adjust the randomization scheme only by blocks of c patients, where c ≥ 1 is a pre-specified integer (the case that c = 1 corresponds to a fully sequential adaptive design). The expression also refers to the fact that group sequential testing methods can be equally well applied on top of adaptive designs, an extension that we do not consider here. Although all our results (and their proofs) still hold for any c ≥ 1, we consider the case c = 1 in the theoretical part of the article for simplicity’s sake, but the case that c > 1 is considered in the simulation study.
Short bibliography.
The literature on adaptive designs is vast and our review is not comprehensive. Quite misleadingly, the expression “adaptive design” has also been used in the literature for sequential testing and, in general, for designs that allow data-adaptive stopping times for the whole study (or for certain treatment arms) which achieve the wished type I and type II errors requirements when testing a null hypothesis against its alternative.
Of course, data-adaptive randomization schemes have a long history that goes back to the 1930s, and we refer to Section 1.2 in (Hu and Rosenberger, 2006), Section 17.4 in (Jennison and Turnbull, 2000) and (Rosenberger, 1996) for a comprehensive historical perspective.
Many articles are devoted to the study of “response adaptive designs”, an expression implicitly suggesting that those designs only depend on past responses of previous patients and not on the corresponding covariates. We refer to (Hu and Rosenberger, 2006; Chambaz and van der Laan, 2011) for a bibliography on that topic. On the contrary, covariate-adjusted response adaptive (CARA) randomizations tackle the so-called issue of heterogeneity (i.e., the use of covariates in adaptive designs), by dynamically calculating the allocation probabilities on the basis of previous responses and current and past values of certain covariates. In this view, this article studies a new type of CARA procedure. The interest in CARA procedures is more recent, and there is a steadily growing number of articles dedicated to their study, starting with (Rosenberger et al., 2001; Bandyopadhyay and Biswas, 2001), then (Atkinson and Biswas, 2005; Zhang et al., 2007; Zha; Shao et al., 2010) among others. The latter articles are typically concerned with the convergence (almost sure and in law) of the allocation probabilities vector and of the estimator of the parameter in a correctly specified parametric model ((Shao et al., 2010) is devoted to the testing issue).
By contrast, the consistency and asymptotic normality results that we obtain in this article are robust to model misspecification. Thus, they notably contribute significantly to solving the question raised by the Food & Drug Administration (2006):
When is it valid to modify randomization based on results, for example, in a combined phase 2/3 cancer trial?
Finally, this article mainly relies on (Chambaz and van der Laan, 2009, 2011; van der Laan, 2008), the latter technical report paving the way to robust and more efficient estimation based on adaptive RCTs in a variety of other setting (including the case that the outcome Y is a possibly censored time-to-event).
Organization.
We first set the statistical framework in Section 2. The rationale of our adaptive covariate-adjusted designs is presented in Section 3, and we complete its formal definition in Section 4 where we also detail how the TMLE methodology operates. The asymptotic study is carried out in Section 5 (estimation) and in Section 6 (group sequential testing). The simulation study is developed in Section 7 (estimation) and in Section 8 (group sequential testing). The proofs are relegated to the Appendix.
2. Statistical framework
We tackle the asymptotic study of adaptive group sequential designs in the case of RCTs with covariate, binary treatment and one-dimensional primary outcome of interest. Thus, the observed data structure writes as O = (W, A, Y), where consists of some baseline covariates, denotes the assigned binary treatment, and is the primary outcome of interest. For example, Y can indicate whether the treatment has been successful or not ; or Y can count the number of times an event of interest has occurred under the assigned treatment during a period of follow-up or Y can measure a quantity of interest after a given time has elapsed Although we will focus on the last case in this article, the methodology applies equally well to each example cited above.
Let us denote by P0 the true distribution of the observed data structure O in the population of interest. We see P0 as a specific element of the non-parametric set of all possible observed data distributions. Note that, in order to avoid some technicalities, we assume (or rather: impose) that all elements of are dominated by a common measure. The parameter of scientific interest is the marginal effect of treatment α =1 relative to treatment α = 0 on the additive scale, or risk difference: . Of course, other choices such as the log-relative risk (the counterpart of the risk difference on the multiplicative scale) could be considered, and dealt with along the same lines. The risk difference can be interpreted causally for instance in the counterfactual framework.
For all , let us introduce the shorthand notation QW(P)(W) = P[W], g(P)(A|W) = P[A|W], QY|A,W(P)(O) = P[Y|A, W]. We use the alternative notation P = PQ,g with Q = Q(P) ≡ (Qw (P), QY|A,W (P)) and g = g(P). Equivalently, PQ,g is the data generating distribution such that Q(PQ,g) = Q and g(PQ,g) = g. In particular, we denote Q0 = Q(P0) = (Qw(P0), QY|A,W(P0)). We also introduce the notation for the non-parametric set of all possible values of Q, and for the non-parametric set of all possible values of g. Setting and (with a slight abuse, we also write sometimes instead of ), we define in greater generality
| (1) |
over the whole set , so that ψ0 equivalently writes as ψ0 = Ψ(P0). This notation also emphasizes the fact that Ψ(P) only depends on P through and QW(P), justifying the alternative notation Ψ(PQ,g) = Ψ(Q). The following proposition summarizes the most fundamental properties enjoyed by Ψ.
Proposition 1 (efficient influence curve). The functional Ψ is pathwise differentiable at every relative to the maximal tangent space. The efficient influence curve of Ψ at is characterized by
| (2) |
where
The variance VarpD★(P)(O) is the lower bound of the asymptotic variance of any regular estimator of Ψ(P) in the i.i.d setting. Furthermore, even if Q ≠ Q0,
| (3) |
when g = g(P0).
The implication (3) is the key to the robustness of the targeted maximum likelihood introduced and studied in this article. It is another justification of our interest in the pathwise differentiability of the functional Ψ and its efficient influence curve.
3. Data generating mechanism for adaptive design
The purpose of adaptive group sequential design as we consider it in this article is to adjust the randomization scheme as the data accrue. We first formally describe in Section 3.1 the data generating mechanism through the expression of the likelihood function and also in terms of causal graphs. Then we discuss the general issue of choosing an optimal design to target in Section 3.2, before specifically describing an optimal design of interest in Section 3.3 and showing how to target it in Section 3.4.
3.1. Data generating mechanism and related likelihood
In order to formally describe the data generating mechanism, we need to state a starting assumption: During the course of the clinical trial, it is possible to recruit independently the patients from a stationary population. In the counterfactual framework, this is equivalent to supposing that it is possible to sample as many independent copies of the full data structure as required. Let us denote by Oi = (Wi,Ai,Yi) the ith observed data structure. We also find convenient to introduce On = (O1,…, On) and for every i = 0,…, n, On(i) = (O1,…, Oi) (with convention O(0)= ∅). We denote by the set where the observed data structure O takes its values.
By adjusting the randomization scheme as the data accrue, we mean that the nth treatment assignment An is drawn from gn(·|Wn), where gn(·|W) is a conditional distribution (or treatment mechanism) given the covariate W which additionally depends on past observations On−1. Since the sequence of treatment mechanisms cannot reasonably grow in complexity as the sample size increases, we will only consider data-adaptive treatment mechanisms such that gn(·|W) depends on On−1 only through a finite-dimensional summary measure Zn = ϕn (On−1), where the measurable function ϕn maps onto for some fixed d ≥ 0 (d = 0 corresponds to the case that gn(·|W) actually does not adapt). For instance, characterizes a proper summary measure of the past, which keeps track of the mean outcome in each treatment arm. Another sequence of mappings ϕn will be at the core of the adaptive methodology that we study in depth in this article, see (9).
Formally, the data generating mechanism is specified by the following factorization of the likelihood of On:
which suggests the introduction of gn = (g1,…,gn), referred to as the design of the study, and the expression “On is drawn from (Q0, gn)”. Likewise, the likelihood of On under (Q, gn) (where is a candidate value for Q0) is
| (4) |
where we emphasize that the second factor is known. Thus we will refer with a slight abuse of terminology to as the log-likelihood of On under (Q, gn). Furthermore, given gn, we introduce the notation for any possibly vector-valued measurable f defined on .
Another equivalent characterization of the data generating mechanism involves the causal graph of Fig. 1. It is seen again firstly that Wn is drawn independently from the past On−1; secondly that An is a deterministic function of Wn, the summary measure Zn (which depends on On−1), and a new independent source of randomness (in other words, it is drawn conditionally on (Wn, Zn) and conditionally independently of the past On−1); thirdly that Yn is a deterministic function of (An, Wn) and a new independent source of randomness (in other words, it is drawn conditionally on (An,Wn) and conditionally independently of the past On−1); then that the next summary measure Zn+1 is obtained as a function of On−1 and On = (Wn,An,Yn) (i.e., as a function of On; here, the causal graph grants the access to a new independent source of randomness, but it is useless in our setting), and so on.
Figure 1: A causal graph describing the data generating mechanism for adaptive group sequential design.

An from nodes Λ1,..., Λr to node ϒ means that there exists a deterministic function f and an independent source of randomness U such that ϒ = f (Λ1, …, Λr, U).
Finally, it is interesting in practice to adapt the design group sequentially. This can be simply formalized. For a given pre-specified integer c ≥ 1 (c = 1 corresponds to a fully sequential adaptive design), proceeding c-group sequentially simply amounts to imposing ϕ(r−1)c+1(O(r−1)c) = … = ϕrc (Orc−1) for all r ≥ 1. Then the c treatment assignments A(r−1)c+1,…,Arc in the rth c-group are all drawn from the same conditional distribution g(r−1)c(·|W). Yet, although all our results (and their proofs) still hold for any c ≥ 1, we prefer to consider in the rest of this section and in Sections 4 and 5 the case that c = 1 for simplicity’s sake. On the contrary, the simulation study carried out in Section 7 involves some c > 1.
3.2. On the user-supplied design to target
One of the most important features of the adaptive group sequential design methodology is that it targets a user-supplied specific design of special interest. This specific design is generally an optimal design with respect to a criterion which translates what the investigator cares for the most. Specifically, one could care the most for the well-being of the target population, wishing that a result be available as quickly as possible and aspiring therefore to the highest efficiency (i.e., the ability to reach a conclusion as quickly as possible subject to level and power constraints). Or one could care the most for the well-being of the subjects participating to the clinical trial (therefore trying to minimize the number of patients assigned to their corresponding inferior treatment arms, subject to level and power constraints). Obviously, these are only two important examples from a large class of potentially interesting criteria. The sole purpose of the criterion is to generate a random element in of the form gn = gZn, where Zn = ϕn (On−1) is a finite-dimensional summary measure of On−1.
We decide to focus in this article on the first example, but it must be clear that the methodology applies to a variety of other criteria (see (van der Laan, 2008) for other examples).
3.3. From the Neyman allocation to the optimal design
So, the objective is now clearly identified: we wish to adapt the design as the data accrue in order to learn from the data then mimic a specific design which guarantees the highest efficiency, our (user-supplied) optimal design.
By Proposition 1, the asymptotic variance of any regular estimator of the risk difference Ψ(Q0) has lower bound if the estimator relies on data sampled independently from . Now,
where σ2(Q0)(A, W) denotes the conditional variance of Y given (A, W) under Q0. We use the notation EQ0 above (for the expectation with respect to the marginal distribution of W under P0) in order to emphasize the fact that the treatment mechanism g only appears in the second term of the right-hand side sum. Furthermore, it holds P0-almost surely that
with equality if and only if
| (5) |
P0-almost surely. Therefore, the following lower bound holds for all :
with equality if and only if is characterized by (5). This optimal design is known in the literature as the Neyman allocation (see (Hu and Rosenberger, 2006) page 13). This result notably makes clear that the most efficient treatment mechanism assigns with higher probability a patient with covariate vector W to the treatment arm such that the variance of the outcome Y in this arm is the largest, regardless of the mean of the outcome (i.e., whether the arm is inferior or superior).
Due to logistical reasons, it might be preferable to consider only treatment mechanisms that assign treatment in response to a subvector V of the baseline covariate vector W. In addition, if W is complex, targeting the optimal Neyman allocation might be too ambitious. Therefore, we will consider the important case where V is a discrete covariate with finitely many values in the set . The covariate V indicates subgroup membership for a collection of ν subgroups of interest. We decide to restrict the search of an optimal design to the set of those treatment mechanisms which only depend on W through V. The same calculations as above yield straightforwardly that, for all ,
where for with equality if and only if g coincides with g★(Q0), characterized by
| (6) |
P0-almost surely. Hereafter, we refer to g★(Q0) as the optimal design.
3.4. Targeting the optimal design
Because g★(Q0) is characterized as the minimizer over of the variance under of the efficient influence curve at we propose to construct as the minimizer over of an estimator of the latter variance based on past observations On.
We proceed by recursion. We first set g1 = gb, the so-called balanced treatment mechanism such that for all , and assume that On has already been sampled from (Q0, gn) as described in Section 3.1, the sample size being large enough to guarantee for all (if n0 is the smallest sample size such that the previous condition is met, then we set ).
The issue is now to construct gn+1. Let us assume for the time being that we already know how to construct an estimator Qn of Q0 based on On (hence the estimators of and Ψ(Qn) of Ψ(Q0) = ψ0).1 Then, for all ,
estimates (the weighting provides the adequate tilt of the empirical distribution; it is not necessary to weight the terms corresponding to because they do not depend on the treatment mechanism). Now, only the first term in the rightmost expression still depends on g. The same calculations as above straightforwardly yields that Sn (g) is minimized at characterized by
| (7) |
for all , where for each ,
Yet, instead of considering the above characterization, we find it more convenient to define
| (8) |
for all , where for each ,
Note that and share the same numerator, and that the different denominators converge to the same limit. Substituting for is convenient because one naturally interprets the former as an estimator of the conditional variance of Y given (A, V) = (a, v) based on On, a fact that we use in Section 4.2. Finally, we emphasize that for the summary measure of the past On
| (9) |
The rigorous definition of the design follows by recursion, but it is still subject to knowledge about how to construct an estimator Qn of Q0 based on On. Because this last missing piece of the formal definition of the adaptive group sequential design data generating mechanism is also the core of the TMLE procedure, we address it in Section 4.
4. TMLE in adaptive covariate-adjusted RCTs
We assume hereafter that On has already been sampled from the -adaptive sampling scheme. In this section, we construct an estimator Qn (actually denoted by ) of Q0, therefore yielding the characterization of , and completing the formal definition of the adaptive design that we initiated in Section 3. In particular, the next observed data structure On+1 can be drawn from , and it makes sense to undertake the asymptotic study of the properties of the TMLE methodology based on adaptive group sequential sampling.
As in the i.i.d framework, the TMLE procedure maps an initial substitution estimator of ψ0 into an update by fluctuating the initial estimate of Q0. The construction of is presented and studied in Section 4.1. In Section 4.2, a straightforward application of the main result of Section 4.1 shows that the adaptive design converges. How to fluctuate and stretch it optimally to is presented and studied in Section 4.3.
4.1. Initial maximum likelihood based substitution estimator Working model.
In order to construct the initial estimate of Q0, we consider a working model . With a slight abuse of notation, the elements of are denoted by (QW(Pn),QY|A,W(θ)) for some parameter θ ∈ Θ, where QW(Pn) is the empirical marginal distribution of W. Specifically, the working model is chosen in such a way that
This notably implies that for any such that QY|A,W (Pθ) = QY|A,W (θ), the conditional mean , which we also denote by , satisfies , the right-hand side expression being a linear combination of variables extracted from (A, W) and indexed by the regression vector βV (of dimension b). Defining
| (10) |
for each , the complete parameter is given by θ = (θ(1)⊤,…,θ(ν)⊤)⊤ ∈ Θ, where . We impose the following condition on the parameterization:
PARAM. The parameter set Θ is compact. Furthermore, the linear parameterization is identifiable: for all , if for all and (compatible with v), then necessarily .
Characterizing .
Let us set a reference fixed design . We now characterize by letting
| (11) |
Where
| (12) |
is a weighted maximum likelihood estimator with respect to the working model. Thus, the vth component θn(v) of θn satisfies
for every . Note that this initial estimate of Q0 yields the initial maximum likelihood based substitution estimator of ψ0:
Studying through θn.
For simplicity, let us introduce, for all θ ∈ Θ, the additional notations
The first asymptotic property of θn that we derive concerns its consistency (see Theorem 5 in (van der Laan, 2008)).
Proposition 2 (consistency of θn). Assume that:
A1. There exists a unique interior point θ0 ∈ Θ such that θ0 = arg maxθ∈Θ PQ0,grℓθ,0.
A2. The matrix – is positive definite.
Provided that is a bounded set, θn consistently estimates θ0.
The proof of Proposition 2 is given in Section A.1.
The limit in probability of θn has a nice interpretation in terms of projection of QY|A,W (P0) onto {QY|A,W (θ): θ ∈ Θ}. Preferring to discuss this issue in terms of data generating distribution rather than conditional distribution, let us set and assume that is well defined (this weak assumption concerns Q0, not gr, and holds for instance when |log QY|A,W (P0)| is bounded). Then A1 is equivalent to being the unique Kullback-Leibler projection of onto the set
In addition to being consistent, θn actually satisfies a central limit theorem if supplementary mild conditions are met. The latter central limit theorem is embedded in a more general result that we state in Section 4.3, see Proposition 5.
Furthermore, maximizing a weighted version of the log-likelihood is a technical twist that makes the theoretical study of the properties of θn easier. Indeed, the unweighted maximum likelihood estimator
targets the parameter
where and for any measurable f defined on . Therefore, tn asymptotically targets the limit, if it exists, of . Assuming that converges itself to a fixed design , then tn asymptotically targets parameter . The latter parameter is very difficult to interpret and to analyze, as it depends directly and indirectly (through g∞) on Q0.
4.2. Convergence of the adaptive design
Consider the mapping G★ from Θ to (respectively equipped with the Euclidean distance and, for instance, the distance such that, for any θ ∈ Θ, for any ,
| (13) |
Equation (13) characterizes G★ which is obviously continuous. Since is adapted in such a way that (see (8)), Proposition 2 and the continuous mapping theorem (see Theorem 1.3.6 in (van der Vaart and Wellner, 1996)) straightforwardly imply the following result (the convergence in probability yields the convergence in L1 because is uniformly bounded).
Proposition 3 (convergence of ). Under the assumptions of Proposition 2, the adaptive design converges in probability and in L1 to the limit design G★(θ0).
The convergence of the adaptive design is a crucial result. It is noteworthy that the limit design G★(θ0) equals the optimal design g★(Q0) if the working model is correctly specified (which never happens in real-life applications), but not necessarily otherwise. Furthermore, the relationship also entails the possibility to derive the convergence in distribution of to a centered Gaussian distribution with known variance by application of the delta-method (G★ is differentiable) from a central limit theorem on θn (see Proposition 5 and Theorem 3.1 in (van der Vaart, 1998)).
4.3. TMLE
Fluctuating .
The second step of the TMLE procedure stretches the initial estimate in the direction of the parameter of interest, through a maximum likelihood step over a well-chosen fluctuation of . The latter fluctuation of is just a one-dimensional parametric model indexed by the parameter being a bounded interval which contains a neighborhood of the origin. Specifically, we set for all :
| (14) |
where for any θ ∈ Θ,
| (15) |
with
In particular, the fluctuation goes through at ε = 0 (i.e., ). Let be a data generating distribution such that . The conditional mean , which we also denote by , is
Furthermore, the score at ε = 0 of equals
the second component of the efficient influence curve of Ψ at , see (2) in Proposition 1 (recall that ).
Characterizing yields the TMLE.
We characterize the update of in the fluctuation by
where
| (16) |
is a weighted maximum likelihood estimator with respect to the fluctuation. It is worth noting that εn is known in closed form (we assume, without serious loss of generality, that is large enough for the maximum to be achieved in its interior). Denoting the vth component θn(v) of θn by , it holds that
The notation for this first update of is a reference to the fact that the TMLE procedure, which is in greater generality an iterative procedure, converges here in one single step. Indeed, say that one fluctuates as we fluctuate i.e., by introducing
with QY|A,W (θ,ε’,ε) equal to the right-hand side of (15) where one substitutes for . Say that one then defines the weighted maximum likelihood as the right-hand side of (16) where one substitutes QY|A,W (θn, εn, ε) for QY|A,W (θn, ε). Then it follows that so that the “updated” . The updated estimator of Q0 maps into the TMLE of the risk difference ψ0 = Ψ(Q0):
| (17) |
The asymptotic study of relies on a central limit theorem for (θn, εn), which we discuss in Section 5.3.
5. Asymptotics
5.1. Studying through (θn, εn): consistency
We now state and comment on a consistency result for the stacked estimator (θn, εn) which complements Proposition 2 (see Theorem 8 in (van der Laan, 2008)). For simplicity, let us generalize the notation ℓθ,0 introduced in Section 4.1 by setting, for all ,
Moreover, let us set, for all ,
| (18) |
Proposition 4 (consistency of (θn,εn)). Suppose that A1 and A2 from Proposition 2 hold. In addition, assume that:
A3. There exists a unique interior point such that .
-
(i)
It holds that .
-
(ii)
Provided that is a bounded set, (θn,εn) consistently estimates (θ0,ε0).
The proof of Proposition 4 is given in Section A.1.
We already discussed the interpretation of the limit in probability of θn in terms of Kullback-Leibler projection. Likewise, the limit in probability ε0 of εn enjoys such an interpretation. Let us assume that log QY|A,W(P0) is well defined (this weak assumption concerns Q0, not G★(θ0), and holds for instance when |log QY|A,W(P0)| is bounded). Then A3 is equivalent to being the unique the Kullback-Leibler projection of onto the set
Of course, the most striking property that ε0 enjoys is (i): even if , it holds that . This remarkable equality and the convergence of (θn,εn) to (θ0,ε0) are evidently the keys to the consistency of . We investigate in Section 5.3 how the consistency result stated in Proposition 4 translates into the consistency of the TMLE.
5.2. Studying through (θn,εn): central limit theorem
We now state and comment on a central limit theorem for the stacked estimator (θn, εn) (see also Theorem 9 in (van der Laan, 2008)).
Let us introduce, for all ,
| (19) |
and .
Proposition 5 (central limit theorem for (θn,εn)). Suppose thatA1, A2andA3from Propositions 2 and 4 hold. In addition, assume thatA4. If a deterministic function F is such that F(O) = 0 -almost surely, then F = 0. Then the following asymptotic linear expansion holds:
| (20) |
where
| (21) |
is an invertible matrix. Furthermore, (20) entails that converges in distribution to the centered Gaussian distribution with covariance matrix , where
| (22) |
is a positive definite symmetric matrix. Moreover, S0 is consistently estimated by
and ∑0 is consistently estimated by
| (23) |
The proof of Proposition 5 is given in Section A.2. We investigate in Section 5.3 how the above central limit theorem translates into a central limit theorem for the TMLE.
5.3. Consistency and asymptotic normality of the TMLE
In this section, we finally state and comment on the asymptotic properties of the TMLE .
TMLE is consistent and asymptotically Gaussian.
In the first place, the TMLE is robust: it is a consistent estimator even when the working model is misspecified (which is always the case in real-life applications).
Proposition 6 (consistency of ). Suppose that A1, A2, A3 and A4 from Propositions 2, 4 and 5 hold. Then the TMLE consistently estimates the risk difference ψ0.
If the design of the clinical trial was fixed (and consequently, the n first observations were i.i.d), then the TMLE would be a robust estimator of ψ0: Even if the working model is misspecified, then the TMLE still consistently estimates ψ0 because the treatment mechanism is known (or can be consistently estimated, if one wants to gain in efficiency). Thus, the robustness of the TMLE stated in Proposition 6 is the expected counterpart of the TMLE’s robustness in the latter i.i.d setting: Expected because the TMLE solves a martingale estimating function that is unbiased for ψ0 at misspecified Q and correctly specified gi, i = 1,…,n.
In the second place, the TMLE is asymptotically linear, and therefore satisfies a central limit theorem. To see this, let us introduce the real-valued function ϕ on such that ϕ (θ, ∈) = Ψ(Qθ,ε) (see (18) for the definition of Qθ,ε). The function ϕ is differentiable on the interior of , and we denote its gradient at (θ,ε). The latter gradient satisfies
| (24) |
Note that the right-hand side expression cannot be computed explicitly because the marginal distribution QW(P0) is unknown. By the law of large numbers (independent case), we can build an estimator of as follows. For B a large number (say B = 104), simulate B independent copies Õb of O from the data generating distribution , then compute
| (25) |
Proposition 7 (central limit theorem for ). Suppose that A1, A2, A3 and A4 from Propositions 2, 4 and 5 hold. Then the following asymptotic linear expansion holds:
| (26) |
where
| (27) |
Furthermore, (26) entails that converges in distribution to the centered Gaussian distribution with a variance consistently estimated by
Proposition 7 is the backbone of the statistical analysis of adaptive group sequential RCTs as constructed in Section 4. In particular, denoting the (1 − α)-quantile of the standard normal distribution by ξ1−α, the proposition guarantees that the asymptotic level of the confidence interval
| (28) |
for the risk difference ψ0 is (1 − ε). The proofs of Propositions 6 and 7 are given in Section A.3.
Extensions.
We conjecture that the influence function IC computed at (O, Z), see (27), is equal to
This conjecture is backed by the simulations that we carry out and present in Section 7. We will tackle the proof of the conjecture in future work. Let us assume for the moment that the conjecture is true. Then the asymptotic linear expansion (26) now implies that the asymptotic variance of can be consistently estimated by
another independent argument also showing that converges toward
i.·e.·, the variance under the fixed design of the efficient influence curve at .
Furthermore, the most essential characteristic of the joint methodologies of design adaptation and targeted maximum likelihood estimation is certainly the utmost importance of the role played by the likelihood. In this view, the targeted maximized log-likelihood of the data
provides us with a quantitative measure of the quality of the fit (targeted toward the parameter of interest). It is therefore possible, for example, to use that quantity for the sake of selection of different working models for Q0. As with TMLE for i.i.d data, we can use likelihood based crossvalidation to select among more general initial estimators indexed by fine-tuning parameters. The validity of such TMLEs for group sequential adaptive designs as studied here is outside the scope of this article.
6. Application to group sequential testing
We derive in this section a group sequential testing procedure, that is a testing procedure which repeatedly tries to make a decision at intervals rather than once all data are collected, or than after every new observation is obtained (such a testing procedure would be said fully sequential). We refer to (Jennison and Turnbull, 2000; Proschan et al., 2006) for a general presentation of group sequential testing procedures. The TMLE group sequential testing procedure is formally described in Section 6.1, and some arguments justifying why it should work well (as validated by the simulation study carried out in Section 8) are exposed in Section 6.2.
6.1. Description of the TMLE group sequential testing procedure
The problem at stake is to test the null “Ψ(Q) = ψ0” against “Ψ(Q) > ψ0” with asymptotic type I error α and asymptotic type II error β at some ψ1 > ψ0. We want to proceed group sequentially with K ≥ 2 steps, based on the multi-dimensional t-statistic
| (29) |
Here, N1,…,NK are random sample sizes whose realizations on a specific trajectory depend on how fast the information accrues as the data are collected. In order to quantify this notion of information, we decide to consider the inverse of the estimated variance of the TMLE based on the n first observations On (as a proxy to its true, finite sample, inverse variance). Given a reference maximum committed information Imax and K increasingly order proportions 0 < p1 < … < pK = 1, we set for every k ≤ K
The characterization of Imax depends on how we wish to “spend” the type I and type II errors at each step of the group sequential procedure and on how demanding is the power requirement (i.e., how close ψ1 is to ψ0). Say that our spending strategies are summarized by the K-tuples of positive numbers (α1,…,αK) and (β1,…,βK) such that and .
Now, let (Z1,…,ZK) be distributed from the centered Gaussian distribution with covariance matrix We assume that there exists a unique value I > 0, our Imax, such that there exist a rejection boundary (a1,…,aK) and a futility boundary (b1,…,bK) satisfying aK = bK, P(Z1 ≥ a1) = α1, , and for every 1 ≤ k < K,
Note that the closer ψ1 is to ψ0, the larger is Imax (actually, is both upper bounded and bounded away from zero). Heuristically, the closer ψ1 is to ψ0, the more difficult it is to decide between the null and its alternative while preserving the required type II error at ψ1, the more information is needed to proceed.
The targeted maximum likelihood group sequential testing procedure finally goes as follows: starting from k = 1,
if then reject the null and stop accruing data,
if then fail rejecting the null and stop accruing data,
if then set k ← k + 1 and repeat.
If () had the same distribution as (Z1,…,ZK), then the latter rule would yield a testing procedure with the required type I error and type II error at the specified alternative parameter.
Clearly, our decision to target the optimal design G★(θ0) which reduces as much as possible (over and for our choice of working model) the asymptotic variance of the TMLE guarantees, at least informally, that each Nk is stochastically smaller under (Q0, )-adaptive sampling than it would have been had another fixed design been used (or targeted). Thus, resorting to the (Q0,)-adaptive sampling is likely to result in an earlier conclusion than what would have yielded another fixed (or targeted) sampling scheme.
6.2. Rationale of the TMLE group sequential testing procedure
The next proposition partially justifies the characterization of the TMLE group sequential testing procedure of Section 6.1. First, let us define (the smallest integer not smaller than npk) for every k ≤ K, then
This other multi-dimensional t-statistic is a substitute to whose asymptotic study is easier to carry out. In particular, it is possible to derive the limit distribution of (T1,…,TK) under the null and under a sequence of contiguous alternatives. The limit distribution is called the canonical distribution (see Theorems 11 in 12 in (van der Laan, 2008), Theorem 3 in (Chambaz and van der Laan, 2011) and Theorem 2.1 in (Zhu and Hu, 2010), where a similar result is obtained through a different approach based on the study of the limit distribution of a stochastic process defined over (0, 1]).
Proposition 8. Suppose that A1, A2, A3 and A4 from Propositions 2, 4 and 5 hold. Consider , f ≠ 0, with and a fluctuation of Q0 with score f. Set for all n ≥n0 (n0 is such that ). Assume also that A5. The score f is bounded, it is not proportional to , , and .
The sequence defines a sequence (ψn)n≥n0 of contiguous parameters (“from direction f”, which only fluctuates the conditional distribution of Y given (A, W)), with for n large enough. Introduce
-
(i)
Under (Q0, gn)-adaptive sampling, (T1,…,TK) converges in distribution to the centered Gaussian distribution with covariance matrix .
-
(ii)
Under -adaptive sampling, (T1,…,TK) converges in distribution to the Gaussian distribution with mean μ(f) and covariance matrix .
The proof of Proposition 8 is given in Section A.4.
The rationale follows. Say that one concludes from Proposition 8 that (i) () is also approximately distributed as (Z1,…,ZK) under -adaptive sampling. Likewise, say that is also approximately distributed as (Z1,…,ZK) under -adaptive sampling, with Ψ(Q1) = ψ1 > ψ0. Say that one is willing to substitute pkImax to for each k ≤ K. Then (ii) () is approximately distributed as under -adaptive sampling. It appears that (i) and (ii) suffice to guarantee the wished asymptotic control of type I and type II errors. The rationale is validated by the simulation study undertaken in Section 8.
7. Simulation study of the performances of TMLE in adaptive covariate-adjusted RCTs
In this section, we carry out a simulation study of the performances of TMLE in adaptive group sequential RCTs as exposed in the previous sections. We present the simulation scheme in Section 7.1. The working model upon which the TMLE methodology relies is described in Section 7.2. How the TMLE-based confidence intervals behave is considered in Section 7.3 (empirical coverage) and in Section 7.4 (empirical width). An illustrating example is finally presented in Section 7.5.
7.1. Simulation scheme
We characterize the component Q0 = Q(P0) of the true distribution P0 of the observed data structure O = (W, A, Y) as follows:
- the baseline covariate W = (U, V) where U is uniformly distributed over the unit interval [0; 1] and the subgroup membership covariate V ∈ v = {1, 2, 3} (hence ν = 3) satisfies
- the conditional distribution of Y given (A, W) is the Gamma distribution characterized by the conditional mean
with ρ =1 (we will set ρ to another value in Section 8) and the conditional variance(30)
In particular, the risk difference ψ0 = ψ(Q0), our parameter of interest, is known in closed form:
and so is the variance
of the efficient influence curve under balanced sampling. The numerical value of vb(Q0) is reported in Table 1.
Table 1:
Numerical values of the allocation probabilities and variance of the efficient influence curve under either the balanced gb or the targeted optimal g★(Q0) sampling schemes. The ratio of the variances of the efficient influence curve under targeted optimal and balanced sampling schemes satisfies R(Q0) = v*(Q0)/vb(Q0) ≃ 0.762.
| Sampling scheme (Q0,g) | Allocation probabilities | Variance | ||
|---|---|---|---|---|
| g(1|v = 1) | ||||
| (Q0,gb)-balanced | 23.864 | |||
| (Q0,g★ (Q0))-optimal | 0.849 | 18.181 | ||
We target the design which (a) depends on the baseline covariate W = (U, V) only through V (i.e., belongs to ) and (b) minimizes the variance of the efficient influence curve of the parameter of interest ψ. The latter treatment mechanism g★(Q0) and optimal efficient asymptotic variance
are also known in closed form, and numerical values are reported in Table 1.
Let n = (100, 250, 500, 750,1000, 2500, 5000) be a sequence of sample sizes. We estimate M = 1000 times the risk difference ψ0 = ψ(Q0) based on ,m = 1,..., M, i = 1,..., 7, under
i.i.d (Q0, gb)-balanced sampling,
i.i.d (Q0, g★(Q0))-optimal sampling,
-adaptive sampling.
Finally, we emphasize that the observed data structure O = (W, A, Y) is not bounded, whereas O is assumed bounded in Propositions 2, 3, 4, 5, 6 and 7.
7.2. Specifying the working model, reference design and a few twists On the working model.
For each v ∈ , let us denote being compact, where the regression vector βv = (βv,1, βv,2,βv,3) (b = 3 in (10)), then θ = (θ1,θ2,θ3) ∈ Θ = Θ1 × Θ2 × Θ3.
Following the description in Section 4.1, the working model that the TMLE methodology relies on is characterized by the conditional likelihood of Y given (A, W):
with the specific choice of conditional mean of Y given (A, W):
for all and . As required, condition PARAM is met. Obviously, the working model is heavily misspecified:
a Gaussian conditional likelihood is used instead of a Gamma conditional likelihood,
the parametric forms of the conditional expectation and variance are wrong too.
On the reference design.
Regarding the choice of a reference fixed design gr ∈ (see Section 4.1), we select gr = gb (the balanced design). The parameter θ0. only depends on Q0 and the working model, but its estimator θn depends on gr, which may affect negatively its performances. Therefore, we propose to dilute the impact of the choice of gr as an initial reference design as follows.
For a given sample size n, we first compute a first estimate of θ0 as in (12) but with (the smallest integer not smaller than n/4) substituted for n in the sum. Then θn is computed as in (12) but this time with substituted for gr(Ai|Vi).
The proofs can be adapted in order to incorporate this modification of the procedure. We refer the interested reader to Section 8.5 in (van der Laan, 2008).
On additional details.
We decide arbitrarily to update the design each time c = 25 new observations are sampled. In addition, the first update only occurs when there are at least five completed observations in each treatment arm and for all V-strata. Thus, the minimal sample size at first update is 30. It can be shown that, under initialization using the balanced design, the expected sample size at the first sample size at which there are at least 5 observations in each arm equals 75. Finally, as a precautionary measure, we systematically apply a thresholding to the updated treatment mechanism: Using the notation of Section 4, is substituted for in all computations. We arbitrarily choose δ = 0.01.
7.3. Empirical coverage of the TMLE confidence intervals
We now invoke the central limit theorem stated in Proposition 7 to construct confidence intervals for the risk difference. Let us introduce, for all types of sampling and each sample size ni, the confidence intervals
where the definition of the variance estimator based on the n first observations depends on the sampling scheme:
under i.i.d (Q0, gb)-balanced sampling, is the estimator of the asymptotic variance of the TMLE :
| (31) |
under i.i.d (Q0, g★(Q0))-optimal sampling, is defined as in (31), replacing gb with g★(Q0),
under -adaptive sampling, , the estimator of the conjectured asymptotic variance of computed on the n first observations .
We are interested in the empirical coverage (reported in Table 2, top rows)
guaranteed for each sampling scheme and every i = 1,..., 7 by . The rescaled empirical coverage proportions should have a Binomial distribution with parameter (M, 1 – a) with a = α for every i = 1,..., 7. This property can be tested in terms of a standard binomial test, the alternative stating that a > α. This results in a collection of 7 p-values for each sampling scheme, as reported in Table 2 (bottom rows).
Table 2:
Checking the adequateness of the coverage guaranteed by our simulated confidence intervals. We test if the rescaled empirical coverage Binomial random variables have parameter (M, 1 – α), the alternative stating that they have parameter (M, 1 – a) with a > α. We report the values (top row) and corresponding p-values (bottom row, between parentheses) of the Binomial test for each sample size and each sampling scheme.
| Sampling scheme | |
|---|---|
| n7 | |
| adaptive | 0.952 (0.634) |
Considering each sampling scheme (i.e., each row of Table 2) separately, we conclude that the (1 – α)-coverage cannot be declared defective under
i.i.d (Q0, gb)-balanced sampling for any sample size ni ≥ n3 = 500,
i.i.d (Q0, g★(Q0))-optimal sampling for any sample size ni ≥ n2 = 250,
-adaptive sampling for any sample size ni ≥ n1 = 100,
adjusting for multiple testing in terms of the Benjamini and Yekutieli procedure for controlling the False Discovery Rate at level 5%.
This is a remarkable result that not only validates the theory but also provides us with insight into the finite sample properties of the TMLE procedure based on adaptive sampling. The fact that the TMLE procedure behaves better under adaptive sampling scheme than under balanced i.i.d sampling scheme at sample size n1 = 100 may not be due to mere chance only. Although the TMLE procedure based on an adaptive sampling scheme is initiated under the balanced sampling scheme (so that each stratum consists at the beginning of comparable numbers of patients assigned to each treatment arm, allowing to estimate, at least roughly, the required parameters), it starts deviating from it (as soon as every (A, V)-stratum counts 5 patients) each time 25 new observations are accrued. The poor performance of the TMLE procedure based on optimal i.i.d sampling scheme at sample size n1 is certainly due to the fact that, by starting directly from the optimal sampling scheme (a choice we would not recommend in practice), too few patients from stratum V = 3 are assigned to treatment arm A = 0 among the n1 first subjects. At larger sample sizes, the TMLE procedure performs equally well under adaptive sampling scheme and under both i.i.d schemes in terms of coverage.
7.4. Empirical width of the TMLE confidence intervals
Now that we know that the TMLE-based confidence intervals based on -adaptive sampling are valid confidence regions, it is of interest to compare the widths of the latter -adaptive sampling confidence intervals with their counterparts obtained under i.i.d (Q0, gb)-balanced or (Q0,g*(Q0))-optimal sampling schemes.
For this purpose, we compare on one hand, for each sample size ni, the empirical distribution of as in (31) (i.e., the empirical distribution of width of the TMLE-based confidence intervals at sample size ni obtained under i.i.d (Q0, gb)-balanced sampling, up to the factor ) to the empirical distribution of (i.e., the empirical distribution of the width of the TMLE-based confidence intervals at sample size ni obtained under adaptive sampling, up to the factor in terms of the two-sample Kolmogorov-Smirnov test, where the alternative states that the confidence intervals obtained under adaptive sampling are stochastically smaller than their counterparts under i.i.d balanced sampling. This results in 7 p-values, all equal to zero, which we nonetheless report in Table 3 (bottom row). In order to get a sense of how much narrower the confidence intervals obtained under adaptive sampling are, we also compute and report in Table 3 (top row) the ratios of empirical average widths
| (32) |
for each sample size ni. Informally, this shows a 12% gain in width.
Table 3:
Comparing the width of our confidence intervals. On one hand, we test for each sample size ni if the TMLE-based confidence intervals obtained under -adaptive sampling are narrower stochastically than the TMLE-based confidence intervals obtained under i.i.d (Q0, gb)-balanced sampling in terms of the two-sample Kolmogorov-Smirnov test. On the other hand, we test for each sample size ni if the TMLE-based confidence intervals obtained under -adaptive sampling are wider stochastically than the TMLE-based confidence intervals obtained under i.i.d (Q0,g*(Q0))-optimal sampling in terms of the two-sample Kolmogorov-Smirnov test. We report the p-values (bottom rows, between parentheses). In addition, we report for each sample size ni the ratios of average widths as defined in (32).
| n7 | |
|---|---|
| 1.000 (0.144) |
On the other hand, we also compare, for each sample size ni, the empirical distribution of as in (31) but replacing gb by g*(Q0) (i.e., the empirical distribution of width of the TMLE-based confidence intervals at sample size ni obtained under i.i.d (Q0, g*(Q0))-optimal sampling, up to the factor to the empirical distribution of (i.e., the empirical distribution of the width of the TMLE-based confidence intervals at sample size ni obtained under -adaptive sampling, up to the factor in terms of the two-sample Kolmogorov-Smirnov test, where the alternative states that the confidence intervals obtained under adaptive sampling are stochastically larger than their counterparts under i.i.d optimal sampling. This results in 7 p-values that we report in Table 3 (bottom row). In order to get a sense of how similar the confidence intervals obtained under adaptive and i.i.d optimal sampling schemes are, we also compute and report for each sample size ni in Table 3 (top row) the ratios of empirical average widths as in (32) replacing again gb by g*(Q0) in the definition (31) of . Informally, this shows that the confidence intervals obtained under adaptive sampling are even slightly narrower in average than their counterparts obtained under i.i.d optimal sampling.
7.5. Illustrating example
So far we have been concerned with distributional results, answering to the questions Does the confidence interval provide the wished coverage? (yes, even for moderate sample size: see Section 7.3), How does its width compare with the width of the confidence interval obtained under either i.i.d sampling scheme? (well: see Section 7.4). In this section, we focus on a particular simulated trajectory (we select arbitrarily the first one, associated to ) for the sake of illustration.
Some interesting features of the selected simulated trajectory are apparent in Fig. 7.5 and Table 4.
Table 4:
Illustrating the TMLE procedure under -adaptive sampling scheme. This table illustrates how the TMLE procedure behaves (on a simulated trajectory) as the sample size increases. We report, at each sample size ni, the updated adapted design (columns two to four), current TMLE (column five) and 95% confidence interval In,1 as in (28) with substituted to sn. The true risk difference ψ0 = ≃ 1.264 belongs to all confidence intervals. See also Fig. 7.5.
| n |
|---|
For instance, we can follow the convergence of the TMLE toward the true risk difference 𝜓0 in the top plot of Fig. 7.5 and in the fifth column of Table 4. Similarly, the middle plot of Fig. 7.5 and the second to fourth columns of Table 4 illustrate the convergence of toward G*(θ0), as stated in Proposition 3. What these plots and columns also teach us is that, in spite of the misspecified working model, the learned design G*(θ0) seems very close to the optimal treatment mechanism g*(Q0) for the chosen simulation scheme and working model used in our simulation study. Moreover, the last column of Table 4 illustrates how the confidence intervals shrink around the true risk difference 𝜓0 as the sample size increases.
Yet, the bottom plot of Fig. 7.5 may be the most interesting of the three. It obviously illustrates the convergence of toward i·e., toward the variance under the fixed design of the efficient influence curve at . Hence, it also teaches us that the latter limit seems very close to the optimal asymptotic variance v*(Q0) for the chosen simulation scheme and working model used in our simulation study. More importantly, strikingly converges to v*(Q0) from below. This finite sample characteristic may reflect the fact that the true finite sample variance of might be lower than v*(Q0). Studying further this issue in depth is certainly very delicate, and goes beyond the scope of this article.
8. Simulation study of the performances of the TMLE group sequential testing procedure
In this section, we resume the simulation study undertaken in Section 7 in order to investigate the performances of the TMLE group sequential testing procedure presented in Section 6. We describe the simulation scheme in Section 8.1, then evaluate how the testing procedure behaves in terms of empirical type I and type II errors in Section 8.2 and how it behaves in terms of empirical distribution of sample size at decision in Section 8.3.
8.1. The simulation scheme, continued
We wish to test the null “Ψ(P) = ψ0” against its alternative “Ψ(P) > ψ0” with prescribed type I error α = 5% and type II error β = 10% at the alternative ψ1 = ψ0 +0.4. Depending on whether we want to investigate the empirical behaviors of the TMLE group sequential testing procedure with respect to (i) type I or (ii) type II errors, we test M = 1000 times the above hypotheses based on 3 × M independent datasets obtained under
-
(i)empirical type I error study:
- i.i.d (Q0, gb)-balanced sampling,
- i.i.d (Q0, g*(Q0))-optimal sampling,
- -adaptive sampling;
-
(ii)empirical type II error study:
- i.i.d (Q1, gb)-balanced sampling,
- i.i.d (Q1, g*(Q0))-optimal sampling,
- -adaptive sampling,
where Q1 is defined as Q0 in Section 7.1 with ρ = ψ1/ψ0 (so that Ψ(Q1) = ψ1).
We decide to proceed in K = 4 steps, at proportions (p1,p2,p3,p4) = (0.25, 0.50,0.75,1). We choose the α-and β-spending strategies characterized by the equalities and . This set of conditions characterizes the whole group sequential testing procedure, see Table 5 for the resulting numerical values.
Table 5:
Specifics of the TMLE group sequential testing procedure
| Ψ0 | Ψ1 | Imax |
| 1.264 | 1.664 | 57.927 |
| Step k | 1 | 2 | 3 | 4 |
| Rejection boundary | 2.734 | 2.305 | 2.005 | 1.715 |
| α-spending | 0.3125% | 0.9375% | 1.5625% | 2.1875% |
| Step k | 1 | 2 | 3 | 4 |
| Futility boundary | −0.976 | 0.132 | 0.961 | 1.715 |
| β-spending | 0.625% | 1.875% | 3.125% | 4.375% |
8.2. Empirical type I and type II errors of the TMLE group sequential testing procedure
We report in Table 6 the empirical type I and type II errors obtained during the course of the simulation study. The values are strikingly close to each other in both cases. Even better, the empirical type I and type II errors obtained under -adaptive sampling are the lowest in both cases.
Table 6:
Checking the adequateness of the type I and type II error controls on our simulated TMLE group sequential testing procedures. We test if the Binomial random numbers of times the null is falsely rejected have parameter (M, α), the alternative stating that they have parameter (M, a) with a > α. We also test if the Binomial random numbers of times the null is falsely not rejected have parameter (M, 12%), the alternative stating that they have parameter (M, b) with b > 12%. We report the empirical type I and type II errors and corresponding p-values
| Sampling scheme | Type I error (Q = Q0) Empirical error (p-value) | Type II error (Q = Q1) Empirical error (p-value) |
|---|---|---|
| i.i.d (Q, gb)-balanced | 0.040 (0.919) | 0.132 (0.113) |
| i.i.d (Q, g★(Q))-optimal | 0.043 (0.827) | 0.128 (0.203) |
| adaptive | 0.040 (0.919) | 0.126 (0.261) |
Since the number of times the null was falsely rejected for its alternative is Binomial with parameter (M, α), it is possible to test rigorously if a = α (as it should) against a > α. This yields three p-values that we report in Table 6. Because there are less than 50 wrong decisions for all sampling schemes, we naturally get large p-values, confirming (if necessary) that the type I error is under control.
Similarly, the number of times the null was falsely not rejected for its alternative is also Binomial with parameter (M, b). Thus, it is possible to test rigorously if b = β (as it should) against b > β. This yields three small p-values (p < 0.001 for the i.i.d balanced sampling scheme, 0.002 for the i.i.d optimal sampling scheme and 0.003 for the adaptive sampling scheme). This confirms the impression that the number of wrong decisions are all significantly larger than what random deviations from the reference distribution are likely to allow. Put bluntly, the type II error is not under control. However, there is no real need to worry. Indeed, if one rather tests b = 12% against b > 12%, then the three p-values (reported in Table 6) are large.
In summary, for the considered scenario, the TMLE group sequential testing procedure described in Section 6.1 performs at least as well under the -adaptive sampling scheme as under both i.i.d (Q, gb)-balanced and (Q, g*(Q))-optimal sampling schemes. The type I error control meets the requirements. The type II error control is defective, but only slightly defective.
8.3. Empirical distribution of sample size at decision of the TMLE group sequential testing procedure
Now that we know that the TMLE group sequential testing procedure performs at least as well under adaptive sampling scheme as under both i.i.d sampling schemes, let us compare the performances in terms of sample size at decision.
For this purpose, we simply report the average sample sizes at decision for each sampling scheme when evaluating the type I and type II errors control, see Table 7. The gain in average sample size at decision obtained by resorting to the -adaptive sampling scheme instead of the i.d.d (Q, gb)-balanced sampling scheme is dramatic: In average, one needs approximately 16% less observations in order to reach a conclusion under the -adaptive sampling scheme relative to the i.d.d (Q, gb)-balanced sampling scheme. Furthermore, it appears that it is even more efficient to resort to the i.d.d (Q, g*(Q))-optimal sampling scheme: In average, one needs approximately 6% more observations in order to reach a conclusion under the i.i.d -adaptive sampling scheme relative to the i.d.d (Q, gb)-balanced sampling scheme.
Table 7:
Comparing the average sample sizes at decision for each sampling scheme when evaluating the type I and type II errors control.
| Sampling scheme | Type I error (Q = Q0) Average sample size | Type II error (Q = Q1) Average sample size |
|---|---|---|
| i.i.d (Q, gb)-balanced | 786.63 | 895.64 |
| i.i.d (Q, g★(Q))-optimal | 620.71 | 713.47 |
| adaptive | 661.14 | 746.72 |
In summary, for the considered scenario, the TMLE group sequential testing procedure described in Section 6.1 reaches a decision more quickly under the -adaptive sampling scheme than under the i.i.d (Q, gb)-balanced sampling scheme, with a gain in average sample size at decision of approximately 16%. Furthermore, the TMLE group sequential testing procedure reaches a decision more slowly under the -adaptive sampling scheme than under the i.i.d (Q, g*(Q))-optimal sampling scheme, but the loss in average sample size at decision reduces to approximately 6%.
A. Proofs
Let us introduce for convenience the notations θε = (θ, ε), and . Moreover, let us recall that if μ is a probability measure and f a measurable function, then μf denotes the integral ∫ fdμ.
A.1. Proof of Propositions 2 and 4, and a useful remark on Proposition 3 Proof of Proposition 2.
The cornerstone of the proof is to interpret θn as the solution in θ of the martingale estimating equation , where Zi (defined in (9)) is the finite dimensional summary measure of past observation On (i – 1) such that depends on On(i – 1) only through Zi (hence the notation ) and .
Note first that, for all i ≤ n,
by A1 (changing the order of differentiation and integration is permitted by the dominated convergence theorem, because is bounded and Θ is compact—see PARAM). Observe then that, by definition of θn, we have:
Now, since (α) supθ∈Θ ∥D1(θ)∥∞ < ∞ (because is bounded and Θ is compact) and (b) the standard entropy of for the supremum norm satisfies (see (van der Vaart, 1998), example 19.7), we can apply (componentwise) the Kolmogorov strong law of large numbers for martingales as in Theorem 8 of (Chambaz and van der Laan, 2009). This yields the almost sure convergence of Mn, hence of n−1 , to 0.
By a Taylor expansion of at θ0 (changing the order of differentiation and integration is permitted here too for the same reasons as above), it holds that, for all i ≤ n (recall that ),
From this we deduce that . converges in probability to 0. Because is positive definite by A2, this implies the result.
Proof of Proposition 4.
The proof of (i) is very simple and typical of robust statistical studies. Indeed,
the latter expression also writing as
hence the result.
The proof of (ii) fundamentally relies on the fact that solves the martingale estimating equation , where D(θε) is the extension of D1(θ) defined in (19). Here too, for all i ≤ n, and applying the Kolmogorov strong law of large numbers for martingales (see Theorem 8 in (Chambaz and van der Laan, 2009)) yields that converges to zero almost surely. This entails the convergence in probability of to by a Taylor expansion of at . Note that A3 is a clear counterpart to A1 from Proposition 2 but that there is no counterpart to A2 from Proposition 2 in Proposition 4. Indeed, it automatically holds in the framework of the proposition that , the proof requiring that the latter quantity be different from zero.
A useful remark on Proposition 3.
We deduce in Proposition 3 the convergence in probability and in L1 of the adaptive design to the limit design G*(θ0) from the convergence in probability of θn to θ0 as obtained in Proposition 2. It is crucial for us that and also converge to G*(θ0) and 1/G*(θ0) (with an obvious notational shortcut) in probability and in L1. Fortunately, this is true because (a) the positive random variables are uniformly bounded away from 0 and (b) if a sequence Xn converges in L1 to 0 then also converges in L1 to 0 (by convexity of the L1-norm). Let us put this in a lemma:
Lemma 1. For all and converge in probability and in L1 to G*(θo)(a|v) and G*(θo)(a|v)-1.
A.2. Proof of Proposition 5
The proof of Proposition 5 still relies on the facts that (a) solves the martingale estimating equation and (b) for all i ≤ n.
Consider the following equality:
| (33) |
A Taylor expansion of LHT at first yields that
where and .
The matrix An satisfies . Now, for each i ≤ n,
| (34) |
where the latter matrix, defined in (21) and independent of i, is deterministic (this is due to the weighting). Thus, An = S0. Moreover, An = S0 is invertible because its determinant equals the product of and
which is negative by A2. Furthermore, (34) and An = S0 (deterministic) also imply that Bn can be rewritten as
and applying (componentwise) a standard law of large numbers for martingales (see for instance Theorem 2.4.2 in (Sen and Singer, 1993)) yields that Bn converges to 0 almost surely (note that ). We emphasize that this proves the convergence in probability of Sn to S0 as stated in the proposition. Consequently, and (33) entails
| (35) |
with . It mainly remains to show that Mn satisfies a central limit theorem.
For this purpose, we apply Theorem 10 in (Chambaz and van der Laan, 2009). Define . It holds that
| (36) |
Now, by Lemma 1, (36) implies that , where Σ0 (defined in (22)) is positive definite when A4 is met. Indeed, assume on the contrarthat there exists a vector u ≠ 0 such that . Then necessarily -almost surely, which contradicts A4. Using (36) again we also see that
from which we deduce that n−1(Cn – ECn) converges to 0 in probability by Lemma 1. Consequently, Mn converges in distribution to the centered Gaussian law with covariance matrix Σ0. In addition, converges in probability to Σ0. Since (a) is differentiable at , (b) its derivative is bounded, and (c) we already know that , yet another Taylor expansion allows to derive that Σn (defined in (23)) equals Σ0 (1 + op(1)).
Let us go back to (35), knowing now that Mn satisfies a central limit theorem. We derive from it that hence . Thus, (35) does entail (20) since for all i ≤ n. The stated central limit theorem on readily follows.
A.3. Proof of Propositions 6 and 7
Proof of Proposition 6.
The proof of Proposition 6 is twofold and relies on the decomposition
| (37) |
where
| (38) |
First, a continuity argument and the convergence in probability of the stacked estimator to entail the convergence in probability of to (the equality holds by (i) in Proposition 4). The conclusion follows because converges almost surely to zero. Indeed,
| (39) |
where, for all and, with a slight abuse of notation, PW,n (respectively, PW,0) denotes the empirical (respectively, true) marginal distribution of W. Thus,
| (40) |
where (a) and (b) the standard entropy of for the supremum norm satisfies (see (van der Vaart, 1998), example 19.7). Therefore, the class is PW,0-Glivenko-Cantelli and the right-hand side term of (40) converges to 0 PW,0-almost surely by the Glivenko-Cantelli theorem (i.i.d framework; see for instance Theorem 19.4 in (van der Vaart, 1998)).
Proof of Proposition 7.
The proof of (26) relies again on (37). It is easy to derive the asymptotic linear expansion of the first term. Indeed, we can rewrite (39) as
| (41) |
which holds subject to checking that . Since the class introduced previously for the proof of Proposition 6 is also PW,0-Donsker, this is a consequence of Lemma 19.24 in (van der Vaart, 1998), provided that converges in probability to 0. Now, converges to in probability, and the function continuously maps onto (thus it is uniformly bounded). Consequently, Xn (which is obviously non-negative) is upper bounded, so that it is equivalent to show that Xn converges to 0 in L1. By the Fubini theorem,
By the continuous mapping theorem (see Theorem 1.3.6 in (van der Vaart and Wellner, 1996)), converges to 0 in probability (hence in L1, the variables being bounded) for all . Therefore, the integrand of the right-hand side integral in the previous display converges pointwise to 0. Since it is also bounded, we conclude by the dominated convergence theorem that EXn converges to 0. This completes the study of the first term of the decomposition (37).
Regarding the second term of the latter decomposition, we derive its asymptotic linear expansion by the delta-method (see Theorem 3.1 in (van der Vaart, 1998)) from that of , see (20). Specifically,
| (42) |
Combining (41) and (42) yields the wished asymptotic linear expansion (26) and the closed-form expression of the related influence function IC as given in (27).
A central limit theorem for real-valued martingales (see Theorem 9 in (Chambaz and van der Laan, 2009)) applied to (26) yields the stated convergence and validates the use of as an estimator of the asymptotic variance. To see this, note that for all i ≤ n and define . Now we emphasize that, for every i ≤ n, firstly
secondly
and thirdly
By combining the last three equalities, we thus obtain that
| (43) |
where Cn was introduced in the proof of Proposition 5. Since we already showed in the latter proof that n−1ECn = Σ0(1 + oP(1)), (43) notably yields the convergence of n−1Ecn toward
which is positive when A4 is met. In addition, (43) also entails that n−1(cn-Ecn) converges to 0 in probability, therefore implying the convergence in distribution of to the centered Gaussian distribution with variance s2 as well as the convergence in probability of to s2. Since (a) and θε ↦ D(θε) are differentiable at ,(b) both with uniformly bounded derivatives, (c) we already know that , and (d) and are consistently estimated by and , yet another Taylor expansion (and the continuous mapping theorem, see for instance Theorem 1.3.6 in (van der Vaart and Wellner, 1996)) finally yields the stated convergence in probability of to s2, therefore completing the proof.
A.4. Proof of Proposition 8
The fact that for n large enough is a consequence of the expansion , which holds because Ψ is pathwise differentiable at PQ0,g (for any g) relative to the maximal tangent space, with efficient influence curve D*(Q0, g). The rest of Proposition 8 is a corollary of the following lemma.
Lemma 2. Let Λn be the log-likelihood ratio of the experiment relative to the experiment. Under , the vector converges in law to the Gaussian distribution with mean and covariance matrix
In particular, the and experiments are mutually contiguous.
The limiting distribution of (T1,…,TK) under is easily derived from Lemma 2. Le Cam’s third lemma (see Example 6.7 in (van der Vaart, 1998)) solves the problem of obtaining the limiting distribution of (T1,…,TK) under from that under . Only the asymptotic means are different. We do not repeat the proof here, as it is exactly the same (up to some minor variations in notation) as the proof of Lemma 5 in (Chambaz and van der Laan, 2011).
Proof of Lemma 2.
Since f is bounded and (here, one could equivalently use the notation EQ0), we can assume without loss of generality that the fluctuation is characterized by
for all . Let us denote by qn and q0 the conditional densities of Y given (A, W) under and Q0.
The log-likelihood ratio , by conditional independence of O1,…, On given ((A1, W1),…, (An,Wn)) and because the marginal distribution of W is the same under as under Q0, (see (4) and note that the factors cancel out). The fluctuation is chosen in such a way that , hence , where the function R is characterized by . In particular, R is increasing and R(x) → R(0) = 0 when x → 0. Since f is bounded, the expansion is valid for n large enough. Moreover, the last term satisfies
so that if then it is oP(1). Furthermore, the Kolmogorov law of large numbers for martingales implies that , where Denote Var(f)(A, W) = EQ0 [f2 (O)|A, W]. We have that
for V the relevant subvector of W upon which each depends. By Lemma 1, converges in probability to G*(θ0). Consequently, the continuous mapping theorem (see Theorem 1.3.6 in (van der Vaart and Wellner, 1996)) yields that . In summary, we obtain the following asymptotic linear expansion:
| (44) |
Introduce now and the bounded (and measurable) function F such that . We show that satisfies a central limit theorem. The proof is very similar to the corresponding part of proof of Lemma 5 in (Chambaz and van der Laan, 2011), hence we only give an outline, focusing on the differences. First, it holds for every 1 ≤ k, l ≤ K that . This entails that the matrix converges to
which is positive definite if and only if its determinant is positive. The latter determinant equals a positive constant times , which is positive too by the Cauchy-Schwarz inequality (f is not proporional to ). Therefore, Theorem 8 in (Chambaz and van der Laan, 2011) applies and converges in law to the centered Gaussian distribution with the covariance matrix given in the above display. The conclusion readily follows.
Figure 2: Illustrating the TMLE procedure under -adaptive sampling scheme.

These three plots illustrate how the TMLE procedure behaves (on a simulated trajectory) as the sample size (on x-axis, logarithmic scale; the vertical grey lines indicate the sample sizes ni, i = 1,..., 7) increases. Top plot: we represent the sequence at intermediate sample sizes n. The horizontal grey line indicates the true value of the risk difference 𝜓0. Middle plot: we represent the three sequences (bottom curve), (middle curve) and (top curve) at intermediate sample sizes n. The three horizontal grey lines indicate the optimal allocation probabilities g*(Q0)(1|1) (bottom line), g*(Q0)(1|2) (middle line) and g*(Q0)(1|3) (top line). Bottom plot: we represent the sequence of estimated asymptotic variance of at intermediate sample sizes n. The horizontal grey line indicates the value of the optimal variance v*(Q0). See also Table 4.
Footnotes
1 The reasoning is not circular by virtue of the chronological ordering as it is summarized in Fig. 1 for instance.
References
- Atkinson AC and Biswas A. Adaptive biased-coin designs for skewing the allocation proportion in clinical trials with normal responses. Stat. Med, 24(16):2477–2492, 2005. [DOI] [PubMed] [Google Scholar]
- Bandyopadhyay U and Biswas A. Adaptive designs for normal responses with prognostic factors. Biometrika, 88(2):409–419, 2001. [Google Scholar]
- Chambaz A and van der Laan MJ. Targeting the optimal design in randomized clinical trials with binary outcomes and no covariate. Technical report, Division of Biostatistics, University of California, Berkeley, 2009. [Google Scholar]
- Chambaz A and van der Laan MJ. Targeting the optimal design in randomized clinical trials with binary outcomes and no covariate: Theoretical study. Int. J. Biostat, 7(1), 2011. [Google Scholar]
- Emerson SS. Issues in the use of adaptive clinical trial designs. Stat. Med, 25:3270–3296, 2006. [DOI] [PubMed] [Google Scholar]
- The Food & Drug Administration. Critical path opportunities list Technical report, U.S. Department of Health and Human Services, Food and Drug Administration, 2006. [Google Scholar]
- Golub HL. The need for mor efficient trial designs. Stat. Med, 25:3231–3235, 2006. [DOI] [PubMed] [Google Scholar]
- Hu F and Rosenberger WF. The theory of response adaptive randomization in clinical trials. New York: Wiley, 2006. [DOI] [PubMed] [Google Scholar]
- Jennison C and Turnbull BW. Group Sequential Methods with Applications to Clinical Trials. Chapman & Hall/CRC, Boca Raton, FL, 2000. [Google Scholar]
- Proschan MA, Lan GKK, and Wittes JT. Statistical Monitoring of Clinical Trials: A Unified Approach. Statistics for biology and health. Springer, New-York, 2006. [Google Scholar]
- Rosenberger WF, Vidyashankar AN, and Agarwal DK. Covariate-adjusted response-adaptive designs for binary response. J. Biopharm. Statist, 11(227–236), 2001. [PubMed] [Google Scholar]
- Rosenberger WF. New directions in adaptive designs. Statist. Sci, 11:137–149, 1996. [Google Scholar]
- Sen PK and Singer JM. Large sample methods in statistics. Chapman & Hall, New York, 1993. An introduction with applications. [Google Scholar]
- Shao J, Yu X, and Bob Zhong B. A theory for testing hypotheses under covariate-adaptive randomization. Biometrika, 2010. [Google Scholar]
- van der Laan MJ. The construction and analysis of adaptive group sequential designs. Technical report 232, Division of Biostatistics, University of California, Berkeley, March 2008. [Google Scholar]
- van der Laan MJ and Rubin D. Targeted maximum likelihood learning. Int. J. Biostat, 2(1), 2006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- van der Vaart AW. Asymptotic Statistics. Cambridge University Press, 1998. [Google Scholar]
- van der Vaart AW and Wellner JA. Weak Convergence and Emprical Processes. Springer-Verlag; New York, 1996. [Google Scholar]
- Zhang Li-Xin, Hu Feifang, Cheung Siu Hung, and Chan Wai Sum. Asymptotic properties of covariate-adjusted response-adaptive designs. Ann. Statist, 35(3):1166–1182, 2007. [Google Scholar]
- Zhu H and Hu F. Sequential monitoring of response-adaptive randomized clinical trials. Ann. Statist, 38(4):2218–2241, 2010. [Google Scholar]
