Estimation and testing in targeted goup sequential covariate-adjusted randomized clinical trials

A Chambaz; M J van der Laan

doi:10.1111/sjos.12013

. Author manuscript; available in PMC: 2018 Aug 9.

Published in final edited form as: Scand Stat Theory Appl. 2013 May 7;41(1):104–140. doi: 10.1111/sjos.12013

Estimation and testing in targeted goup sequential covariate-adjusted randomized clinical trials

A Chambaz ^1,^*, M J van der Laan ²

PMCID: PMC6084807 NIHMSID: NIHMS979199 PMID: 30100663

Abstract

This article is devoted to the construction and asymptotic study of adaptive group sequential covariate-adjusted randomized clinical trials analyzed through the prism of the semipara-metric methodology of targeted maximum likelihood estimation (TMLE). We show how to build, as the data accrue group-sequentially, a sampling design which targets a user-supplied optimal design. We also show how to carry out a sound TMLE statistical inference based on such an adaptive sampling scheme (therefore extending some results known in the i.i.d setting only so far), and how group-sequential testing applies on top of it. The procedure is robust (i.e., consistent even if the working model is misspecified). A simulation study confirms the theoretical results, and validates the conjecture that the procedure may also be efficient.

Keywords: Adaptive design, asymptotic normality, canonical distribution, clinical trial, contiguity, group-sequential testing, robustness, targeted maximum likelihood methodology

1. Introduction

This article is devoted to the construction and asymptotic study of adaptive group sequential covariate-adjusted randomized clinical trials (RCTs) analyzed through the prism of the targeted maximum likelihood estimation (TMLE) methodology. The considered observed data structure writes as O = (W, A, Y) where W is a vector of baseline covariates, A is the treatment assignment and Y the primary outcome of interest. We assume that A is binary and that Y is one-dimensional. Typical parameters of scientific interest are Ψ₊ = E{E(Y|A = 1, W) – E(Y|A = 0, W)} (additive scale, which we consider hereafter) or Ψ_× = log E{E(Y|A = 1, W)} – log E{E(Y|A = 0, W)} (multiplicative scale). Such parameters can be interpreted causally whenever one is willing to assume the existence of a full data structure X = (W, Y(0), Y(1)) containing the vector of baseline covariates and the two counterfactual outcomes under the two possible treatments, and such that Y = Y (A) and A conditionally independent of X given W. If so indeed, Ψ₊ = EY (1) – EY (0) and Ψ_× = log EY (1) – log EY (0). Let us now explain what we mean by adaptive group sequential covariate-adjusted RCT.

Adaptive group sequential covariate-adjusted RCT.

By adaptive covariate-adjusted design, we mean in the setting of this article a clinical trial design that allows the investigator to dynamically modify its course through data-driven adjustment of the randomization probability based on data accrued so far, without negatively impacting the statistical integrity of the trial. Moreover, the patient’s baseline covariates may be taken into account for the random treatment assignment. This definition is slightly adapted from (Golub,2006). In particular, we assume with a view to the definition of pre-specified sampling plans given in (Emerson, 2006) that, prior to collection of the data, the trial protocol specifies the parameter of scientific interest, the inferential method and confidence level to be used when constructing a confidence interval for the latter parameter.

Furthermore, we assume that the investigator specifies beforehand in the trial protocol a criterion of special interest which yields a notion of optimal randomization scheme that we therefore wish to target. For instance, the criterion could translate the necessity to minimize the number of patients assigned to their corresponding inferior treatment arm, subject to level and power constraints. Or the criterion could translate the necessity that a result be available as quickly as possible, subject to level and power constraints. The sole restriction on the criterion is that it must yield an optimal randomization scheme which can be approximated from the data accrued so far. The two examples above comply with this restriction.

We focus in this article on targeted maximum likelihood estimation, as introduced by van der Laan and Rubin (2006) in the independent identically distributed (i.i.d) setting. The extension to the setting of adaptive RCTs was first considered in (van der Laan, 2008), upon which this article relies. In addition, we choose to consider specifically the second criterion cited above. Consequently, the optimal randomization scheme is the so-called Neyman allocation, which minimizes the asymptotic variance of the TMLE of the parameter of interest. We emphasize that there is nothing special about targeting the Neyman allocation, the whole methodology applying equally well to a large class of optimal randomization schemes derived from a variety of valid criteria.

By adaptive group sequential design, we refer to the possibility to adjust the randomization scheme only by blocks of c patients, where c ≥ 1 is a pre-specified integer (the case that c = 1 corresponds to a fully sequential adaptive design). The expression also refers to the fact that group sequential testing methods can be equally well applied on top of adaptive designs, an extension that we do not consider here. Although all our results (and their proofs) still hold for any c ≥ 1, we consider the case c = 1 in the theoretical part of the article for simplicity’s sake, but the case that c > 1 is considered in the simulation study.

Short bibliography.

The literature on adaptive designs is vast and our review is not comprehensive. Quite misleadingly, the expression “adaptive design” has also been used in the literature for sequential testing and, in general, for designs that allow data-adaptive stopping times for the whole study (or for certain treatment arms) which achieve the wished type I and type II errors requirements when testing a null hypothesis against its alternative.

Of course, data-adaptive randomization schemes have a long history that goes back to the 1930s, and we refer to Section 1.2 in (Hu and Rosenberger, 2006), Section 17.4 in (Jennison and Turnbull, 2000) and (Rosenberger, 1996) for a comprehensive historical perspective.

Many articles are devoted to the study of “response adaptive designs”, an expression implicitly suggesting that those designs only depend on past responses of previous patients and not on the corresponding covariates. We refer to (Hu and Rosenberger, 2006; Chambaz and van der Laan, 2011) for a bibliography on that topic. On the contrary, covariate-adjusted response adaptive (CARA) randomizations tackle the so-called issue of heterogeneity (i.e., the use of covariates in adaptive designs), by dynamically calculating the allocation probabilities on the basis of previous responses and current and past values of certain covariates. In this view, this article studies a new type of CARA procedure. The interest in CARA procedures is more recent, and there is a steadily growing number of articles dedicated to their study, starting with (Rosenberger et al., 2001; Bandyopadhyay and Biswas, 2001), then (Atkinson and Biswas, 2005; Zhang et al., 2007; Zha; Shao et al., 2010) among others. The latter articles are typically concerned with the convergence (almost sure and in law) of the allocation probabilities vector and of the estimator of the parameter in a correctly specified parametric model ((Shao et al., 2010) is devoted to the testing issue).

By contrast, the consistency and asymptotic normality results that we obtain in this article are robust to model misspecification. Thus, they notably contribute significantly to solving the question raised by the Food & Drug Administration (2006):

When is it valid to modify randomization based on results, for example, in a combined phase 2/3 cancer trial?

Finally, this article mainly relies on (Chambaz and van der Laan, 2009, 2011; van der Laan, 2008), the latter technical report paving the way to robust and more efficient estimation based on adaptive RCTs in a variety of other setting (including the case that the outcome Y is a possibly censored time-to-event).

Organization.

We first set the statistical framework in Section 2. The rationale of our adaptive covariate-adjusted designs is presented in Section 3, and we complete its formal definition in Section 4 where we also detail how the TMLE methodology operates. The asymptotic study is carried out in Section 5 (estimation) and in Section 6 (group sequential testing). The simulation study is developed in Section 7 (estimation) and in Section 8 (group sequential testing). The proofs are relegated to the Appendix.

2. Statistical framework

We tackle the asymptotic study of adaptive group sequential designs in the case of RCTs with covariate, binary treatment and one-dimensional primary outcome of interest. Thus, the observed data structure writes as O = (W, A, Y), where $W \in W$ consists of some baseline covariates, $A \in A = {0, 1}$ denotes the assigned binary treatment, and $Y \in W$ is the primary outcome of interest. For example, Y can indicate whether the treatment has been successful or not $(Y = {0, 1})$ ; or Y can count the number of times an event of interest has occurred under the assigned treatment during a period of follow-up $(Y = ℕ)$ or Y can measure a quantity of interest after a given time has elapsed $(Y = ℝ)$ Although we will focus on the last case in this article, the methodology applies equally well to each example cited above.

Let us denote by P₀ the true distribution of the observed data structure O in the population of interest. We see P₀ as a specific element of the non-parametric set $M$ of all possible observed data distributions. Note that, in order to avoid some technicalities, we assume (or rather: impose) that all elements of $M$ are dominated by a common measure. The parameter of scientific interest is the marginal effect of treatment α =1 relative to treatment α = 0 on the additive scale, or risk difference: $ψ_{0} = E_{P}_{_{0}} {E_{P}_{_{0}} [Y | A = 1, W] - E_{P}_{_{0}} [Y | A = 0, W]}$ . Of course, other choices such as the log-relative risk (the counterpart of the risk difference on the multiplicative scale) could be considered, and dealt with along the same lines. The risk difference can be interpreted causally for instance in the counterfactual framework.

For all $P \in M$ , let us introduce the shorthand notation Q_W(P)(W) = P[W], g(P)(A|W) = P[A|W], Q_Y|A,W(P)(O) = P[Y|A, W]. We use the alternative notation P = P_Q,g with Q = Q(P) ≡ (Q_w (P), Q_Y|A,W (P)) and g = g(P). Equivalently, P_Q,g is the data generating distribution such that Q(P_Q,g) = Q and g(P_Q,g) = g. In particular, we denote Q₀ = Q(P₀) = (Q_w(P₀), Q_Y|A,W(P₀)). We also introduce the notation $Q = {Q (P) : P \in M}$ for the non-parametric set of all possible values of Q, and $G = {g (P) : P \in M}$ for the non-parametric set of all possible values of g. Setting $\bar{Q} (P) (A, W) = E_{P} [Y | A, W]$ and ${\bar{Q}}_{0} = \bar{Q} (P_{0})$ (with a slight abuse, we also write sometimes $\bar{Q} (Q)$ instead of $\bar{Q} (P_{Q, g})$ ), we define in greater generality

Ψ (P) = E_{P} {\bar{Q} (P) (1, W) - \bar{Q} (P) (0, W)}

(1)

over the whole set $M$ , so that ψ₀ equivalently writes as ψ₀ = Ψ(P₀). This notation also emphasizes the fact that Ψ(P) only depends on P through $\bar{Q} (P)$ and Q_W(P), justifying the alternative notation Ψ(P_Q,g) = Ψ(Q). The following proposition summarizes the most fundamental properties enjoyed by Ψ.

Proposition 1 (efficient influence curve). The functional Ψ is pathwise differentiable at every $P \in M$ relative to the maximal tangent space. The efficient influence curve of Ψ at $P_{Q, g} \in M$ is characterized by

D^{★} (P_{Q, g}) (O) = D_{1}^{★} (Q) (W) + D_{2}^{★} (P_{Q, g}) (O),

(2)

where

\begin{array}{l} D_{1}^{★} (Q) (W) & = & \bar{Q} (1, W) - \bar{Q} (0, W) - Ψ (Q), a n d \\ D_{2}^{★} (P_{Q, g}) (O) & = & \frac{2 A - 1}{g (A | W)} (Y - \bar{Q} (A, W)) . \end{array}

The variance Var_pD^★(P)(O) is the lower bound of the asymptotic variance of any regular estimator of Ψ(P) in the i.i.d setting. Furthermore, even if Q ≠ Q₀,

\begin{matrix} E_{P_{0}} D^{★} (P_{Q, g}) (O) = 0 & i m p l i e s & Ψ (Q) = Ψ (Q_{0}) \end{matrix}

(3)

when g = g(P₀).

The implication (3) is the key to the robustness of the targeted maximum likelihood introduced and studied in this article. It is another justification of our interest in the pathwise differentiability of the functional Ψ and its efficient influence curve.

3. Data generating mechanism for adaptive design

The purpose of adaptive group sequential design as we consider it in this article is to adjust the randomization scheme as the data accrue. We first formally describe in Section 3.1 the data generating mechanism through the expression of the likelihood function and also in terms of causal graphs. Then we discuss the general issue of choosing an optimal design to target in Section 3.2, before specifically describing an optimal design of interest in Section 3.3 and showing how to target it in Section 3.4.

3.1. Data generating mechanism and related likelihood

In order to formally describe the data generating mechanism, we need to state a starting assumption: During the course of the clinical trial, it is possible to recruit independently the patients from a stationary population. In the counterfactual framework, this is equivalent to supposing that it is possible to sample as many independent copies of the full data structure as required. Let us denote by O_i = (W_i,A_i,Y_i) the ith observed data structure. We also find convenient to introduce O_n = (O₁,…, O_n) and for every i = 0,…, n, O_n(i) = (O₁,…, O_i) (with convention O(0)= ∅). We denote by $O$ the set where the observed data structure O takes its values.

By adjusting the randomization scheme as the data accrue, we mean that the nth treatment assignment A_n is drawn from g_n(·|W_n), where g_n(·|W) is a conditional distribution (or treatment mechanism) given the covariate W which additionally depends on past observations O_n−1. Since the sequence of treatment mechanisms cannot reasonably grow in complexity as the sample size increases, we will only consider data-adaptive treatment mechanisms such that g_n(·|W) depends on O_n−1 only through a finite-dimensional summary measure Z_n = ϕ_n (O_n−1), where the measurable function ϕ_n maps $O^{n - 1}$ onto $ℝ^{d}$ for some fixed d ≥ 0 (d = 0 corresponds to the case that g_n(·|W) actually does not adapt). For instance, $Z_{n + 1} = ϕ_{n + 1} (O_{n}) \equiv (n^{- 1} \sum_{i = 1}^{n} Y_{i} 1 {A_{i} = 0}, n^{- 1} \sum_{i = 1}^{n} Y_{i} 1 {A_{i} = 1})$ characterizes a proper summary measure of the past, which keeps track of the mean outcome in each treatment arm. Another sequence of mappings ϕ_n will be at the core of the adaptive methodology that we study in depth in this article, see (9).

Formally, the data generating mechanism is specified by the following factorization of the likelihood of O_n:

\prod_{i = 1}^{n} {Q_{W} (P_{0}) (W_{i}) \times Q_{Y | A, W} (P_{0}) (O_{i})} \times \prod_{i = 1}^{n} g_{i} (A_{i} | W_{i}),

which suggests the introduction of g_n = (g₁,…,g_n), referred to as the design of the study, and the expression “O_n is drawn from (Q₀, g_n)”. Likewise, the likelihood of O_n under (Q, g_n) (where $Q = (Q_{W}, Q_{Y | A, W}) \in Q$ is a candidate value for Q₀) is

\prod_{i = 1}^{n} {Q_{W} (W_{i}) \times Q_{Y | A, W} (O_{i})} \times \prod_{i = 1}^{n} g_{i} (A_{i} | W_{i}),

(4)

where we emphasize that the second factor is known. Thus we will refer with a slight abuse of terminology to $\sum_{i = 1}^{n} log Q_{W} (W_{i}) + log Q_{Y | A, W} (O_{i})$ as the log-likelihood of O_n under (Q, g_n). Furthermore, given g_n, we introduce the notation $P_{Q_{0, g_{i}}} f \equiv E_{P_{Q_{0, g_{i}}}} [f (O_{i}) | O_{n} (i - 1)]$ for any possibly vector-valued measurable f defined on $O$ .

Another equivalent characterization of the data generating mechanism involves the causal graph of Fig. 1. It is seen again firstly that W_n is drawn independently from the past O_n−1; secondly that A_n is a deterministic function of W_n, the summary measure Z_n (which depends on O_n−1), and a new independent source of randomness (in other words, it is drawn conditionally on (W_n, Z_n) and conditionally independently of the past O_n−1); thirdly that Y_n is a deterministic function of (A_n, W_n) and a new independent source of randomness (in other words, it is drawn conditionally on (A_n,W_n) and conditionally independently of the past O_n−1); then that the next summary measure Z_n+1 is obtained as a function of O_n−1 and O_n = (W_n,A_n,Y_n) (i.e., as a function of O_n; here, the causal graph grants the access to a new independent source of randomness, but it is useless in our setting), and so on.

Figure 1: — An from nodes Λ₁,..., Λ_r to node ϒ means that there exists a deterministic function f and an independent source of randomness U such that ϒ = f (Λ₁, …, Λ_r, U).

Finally, it is interesting in practice to adapt the design group sequentially. This can be simply formalized. For a given pre-specified integer c ≥ 1 (c = 1 corresponds to a fully sequential adaptive design), proceeding c-group sequentially simply amounts to imposing ϕ_(r−1)c+1(O_(r−1)c) = … = ϕ_rc (O_rc−1) for all r ≥ 1. Then the c treatment assignments A_(r−1)c+1,…,A_rc in the rth c-group are all drawn from the same conditional distribution g_(r−1)c(·|W). Yet, although all our results (and their proofs) still hold for any c ≥ 1, we prefer to consider in the rest of this section and in Sections 4 and 5 the case that c = 1 for simplicity’s sake. On the contrary, the simulation study carried out in Section 7 involves some c > 1.

3.2. On the user-supplied design to target

One of the most important features of the adaptive group sequential design methodology is that it targets a user-supplied specific design of special interest. This specific design is generally an optimal design with respect to a criterion which translates what the investigator cares for the most. Specifically, one could care the most for the well-being of the target population, wishing that a result be available as quickly as possible and aspiring therefore to the highest efficiency (i.e., the ability to reach a conclusion as quickly as possible subject to level and power constraints). Or one could care the most for the well-being of the subjects participating to the clinical trial (therefore trying to minimize the number of patients assigned to their corresponding inferior treatment arms, subject to level and power constraints). Obviously, these are only two important examples from a large class of potentially interesting criteria. The sole purpose of the criterion is to generate a random element in $G$ of the form g_n = gZ_n, where Z_n = ϕ_n (O_n−1) is a finite-dimensional summary measure of O_n−1.

We decide to focus in this article on the first example, but it must be clear that the methodology applies to a variety of other criteria (see (van der Laan, 2008) for other examples).

3.3. From the Neyman allocation to the optimal design

So, the objective is now clearly identified: we wish to adapt the design as the data accrue in order to learn from the data then mimic a specific design which guarantees the highest efficiency, our (user-supplied) optimal design.

By Proposition 1, the asymptotic variance of any regular estimator of the risk difference Ψ(Q₀) has lower bound ${Var}_{P_{Q_{0}, g}} D^{★} (P_{Q_{0, g}}) (O)$ if the estimator relies on data sampled independently from $P_{Q_{0, g}}$ . Now,

{Var}_{P_{Q_{0}, g}} D^{★} (P_{Q_{0, g}}) (O) = E_{Q_{0}} {({\bar{Q}}_{0} (1, W) - {\bar{Q}}_{0} (0, W) - Ψ (Q_{0}))}^{2} + E_{Q_{0}} (\frac{σ^{2} (Q_{0}) (1, W)}{g (1 | W)} + \frac{σ^{2} (Q_{0}) (0, W)}{g (0 | W)}),

where σ²(Q₀)(A, W) denotes the conditional variance of Y given (A, W) under Q₀. We use the notation E_Q₀ above (for the expectation with respect to the marginal distribution of W under P₀) in order to emphasize the fact that the treatment mechanism g only appears in the second term of the right-hand side sum. Furthermore, it holds P₀-almost surely that

\frac{σ^{2} (Q_{0}) (1, W)}{g (1 | W)} + \frac{σ^{2} (Q_{0}) (0, W)}{g (0 | W)} \geq {(σ (Q_{0}) (1, W) + σ (Q_{0}) (0, W))}^{2},

with equality if and only if

g (1 | W) = \frac{σ (Q_{0}) (1, W)}{σ (Q_{0}) (1, W) + σ (Q_{0}) (0, W)}

(5)

P₀-almost surely. Therefore, the following lower bound holds for all $g \in G$ :

{Var}_{P_{Q_{0, g}}} D^{★} (P_{Q_{0, g}}) (O) \geq E_{Q_{0}} {({\bar{Q}}_{0} (1, W) - {\bar{Q}}_{0} (0, W) Ψ (Q_{0}))}^{2} + E_{Q_{0}} {(σ (Q_{0}) (1, W) + σ (Q_{0}) (0, W))}^{2},

with equality if and only if $g \in G$ is characterized by (5). This optimal design is known in the literature as the Neyman allocation (see (Hu and Rosenberger, 2006) page 13). This result notably makes clear that the most efficient treatment mechanism assigns with higher probability a patient with covariate vector W to the treatment arm such that the variance of the outcome Y in this arm is the largest, regardless of the mean of the outcome (i.e., whether the arm is inferior or superior).

Due to logistical reasons, it might be preferable to consider only treatment mechanisms that assign treatment in response to a subvector V of the baseline covariate vector W. In addition, if W is complex, targeting the optimal Neyman allocation might be too ambitious. Therefore, we will consider the important case where V is a discrete covariate with finitely many values in the set $V = {1, \dots, v}$ . The covariate V indicates subgroup membership for a collection of ν subgroups of interest. We decide to restrict the search of an optimal design to the set $G_{1} \subset G$ of those treatment mechanisms which only depend on W through V. The same calculations as above yield straightforwardly that, for all $g \in G_{1}$ ,

{Var}_{P_{Q_{0, g}}} D^{★} (P_{Q_{0, g}}) (0) \geq E_{Q_{0}} {({\bar{Q}}_{0} (1, W) - {\bar{Q}}_{0} (0, W) - Ψ (Q_{0}))}^{2} + E_{Q_{0}} {(\bar{σ} (Q_{0}) (1, V) + \bar{σ} (Q_{0}) (0, V))}^{2},

where ${\bar{σ}}^{2} (Q_{0}) (a, V) = E_{Q_{0}} [σ^{2} (Q_{0}) (a, W) | V]$ for $a \in A$ with equality if and only if g coincides with g^★(Q₀), characterized by

g^{★} (Q_{0}) (1 | V) = \frac{\bar{σ} (Q_{0}) (1, V)}{\bar{σ} (Q_{0}) (1, V) + \bar{σ} (Q_{0}) (0, V)}

(6)

P₀-almost surely. Hereafter, we refer to g^★(Q₀) as the optimal design.

3.4. Targeting the optimal design

Because g^★(Q₀) is characterized as the minimizer over $g \in G_{1}$ of the variance under $P_{Q_{0,} g}$ of the efficient influence curve at $P_{Q_{0,} g}$ we propose to construct $g_{n + 1} \in G_{1}$ as the minimizer over $g \in G_{1}$ of an estimator of the latter variance based on past observations O_n.

We proceed by recursion. We first set g₁ = g^b, the so-called balanced treatment mechanism such that $g^{b} (1 | W) = \frac{1}{2}$ for all $W \in W$ , and assume that O_n has already been sampled from (Q₀, g_n) as described in Section 3.1, the sample size being large enough to guarantee $\sum_{i = 1}^{n} 1 {V_{i} = v} > 0$ for all $v \in V$ (if n₀ is the smallest sample size such that the previous condition is met, then we set $g_{1} = \dots = g_{n_{0}} = g^{b}$ ).

The issue is now to construct g_n+1. Let us assume for the time being that we already know how to construct an estimator Q_n of Q₀ based on O_n (hence the estimators ${\bar{Q}}_{n} = \bar{Q} (Q_{n})$ of ${\bar{Q}}_{0}$ and Ψ(Q_n) of Ψ(Q₀) = ψ₀).¹ Then, for all $g \in G_{1}$ ,

\begin{array}{l} S_{n} (g) = \frac{1}{n} \sum_{i = 1}^{n} {(D_{1}^{★} (Q_{n}) {(W_{i})}^{2} + 2 D_{1}^{★} (Q_{n}) (W_{i}) D_{2}^{★} (P_{Q_{n}, g}) (O_{i}) \frac{g (A_{i} | V_{i})}{g_{i} (A_{i} | V_{i})} \\ \begin{matrix} \begin{matrix}  \end{matrix} & + D_{2}^{★} (P_{Q_{n}, g}) {(O_{i})}^{2} \frac{g (A_{i} | V_{i})}{g_{i} (A_{i} | V_{i})}) - {(\frac{1}{n} \sum_{i = 1}^{n} D_{1}^{★} (Q_{n}) (W_{i}) + D_{2}^{★} (P_{Q_{n}, g}) (O_{i}) \frac{g (A_{i} | V_{i})}{g_{i} (A_{i} | V_{i})})}^{2}} \end{matrix} \\ = \frac{1}{n} \sum_{i = 1}^{n} \frac{{(Y_{i} - {\bar{Q}}_{n} (A_{i}, W_{i}))}^{2}}{g (A_{i} | V_{i}) g_{i} (A_{i} | V_{i})} \\ \begin{matrix} \begin{matrix} \begin{matrix}  \end{matrix} \end{matrix} & + {\frac{1}{n} \sum_{i = 1}^{n} D_{1}^{★} (Q_{n}) {(W_{i})}^{2} + 2 D_{1}^{★} (Q_{n}) (W_{i}) D_{1}^{★} (Q_{n}) (W_{i}) D_{2}^{★} (P_{Q_{n}, g_{i}}) (O_{i}) \end{matrix} \\ \begin{matrix} \begin{matrix} \begin{matrix} \begin{matrix} \begin{matrix} \begin{matrix} \begin{matrix}  \end{matrix} \end{matrix} \end{matrix} \end{matrix} \end{matrix} \end{matrix} & - {(\frac{1}{n} \sum_{i = 1}^{n} D_{1}^{★} (Q_{n}) (W_{i}) + D_{2}^{★ ★} (P_{Q_{n}, g_{i}}) (O_{i}))}^{2}} \end{matrix} \end{array}

estimates $Var P_{Q_{0,} g} D^{^{★}} (O; P_{Q_{0}, g})$ (the weighting provides the adequate tilt of the empirical distribution; it is not necessary to weight the terms corresponding to $D_{1}^{★}$ because they do not depend on the treatment mechanism). Now, only the first term in the rightmost expression still depends on g. The same calculations as above straightforwardly yields that S_n (g) is minimized at $g_{n + 1}^{★} \in G_{1}$ characterized by

g_{n + 1} (1 | v) = \frac{s_{v, n} (1)}{s_{v, n} (1) + s_{v, n} (0)},

(7)

for all $v \in V$ , where for each $(v, a) \in V \times A$ ,

s_{v, n}^{2} (a) = \frac{\frac{1}{n} \sum_{i = 1}^{n} \frac{{(Y_{i} - {\bar{Q}}_{n} (A_{i}, W_{i}))}^{2}}{g_{i} (A_{i} | V_{i})} 1 {(V_{i}, A_{i}) = (v, a)}}{\frac{1}{n} \sum_{i = 1}^{n} 1 {V_{i} = v}} .

Yet, instead of considering the above characterization, we find it more convenient to define

g_{n + 1}^{★} (1 | v) = \frac{σ_{v, n} (1)}{σ_{v, n} (1) + σ_{v, n} (0)},

(8)

for all $v \in V$ , where for each $(v, a) \in V \times A$ ,

σ_{v, n}^{2} (a) = \frac{\frac{1}{n} \sum_{i = 1}^{n} \frac{{(Y_{i} - {\bar{Q}}_{n} (A_{i}, W_{i}))}^{2}}{g_{i} (A_{i} | V_{i})} 1 {(V_{i}, A_{i}) = (v, a)}}{\frac{1}{n} \sum_{i = 1}^{n} \frac{1 {(V_{i}, A_{i}) = (v, a)}}{g_{i} (A_{i} | V_{i})}} .

Note that $s_{v, n}^{2} (a)$ and $σ_{v, n}^{2} (a)$ share the same numerator, and that the different denominators converge to the same limit. Substituting $σ_{v, n}^{2} (a)$ for $s_{v, n}^{2} (a)$ is convenient because one naturally interprets the former as an estimator of the conditional variance of Y given (A, V) = (a, v) based on O_n, a fact that we use in Section 4.2. Finally, we emphasize that $g_{n + 1}^{★} = g z_{n + 1}$ for the summary measure of the past O_n

Z_{n + 1} = ϕ_{n + 1} (O_{n}) \equiv ((σ_{v, n}^{2} (0), σ_{v, n}^{2} (1)) : v \in V) .

(9)

The rigorous definition of the design $g_{n}^{★} = (g_{1}^{★}, \dots, g_{n}^{★})$ follows by recursion, but it is still subject to knowledge about how to construct an estimator Q_n of Q₀ based on O_n. Because this last missing piece of the formal definition of the adaptive group sequential design data generating mechanism is also the core of the TMLE procedure, we address it in Section 4.

4. TMLE in adaptive covariate-adjusted RCTs

We assume hereafter that O_n has already been sampled from the $(Q_{0}, g_{n}^{★})$ -adaptive sampling scheme. In this section, we construct an estimator Q_n (actually denoted by $Q_{n}^{*}$ ) of Q₀, therefore yielding the characterization of $g_{n + 1}^{★}$ , and completing the formal definition of the adaptive design $g_{n}^{★}$ that we initiated in Section 3. In particular, the next observed data structure O_n+1 can be drawn from $(Q_{0}, g_{n + 1}^{★})$ , and it makes sense to undertake the asymptotic study of the properties of the TMLE methodology based on adaptive group sequential sampling.

As in the i.i.d framework, the TMLE procedure maps an initial substitution estimator $Ψ (Q_{n}^{0})$ of ψ₀ into an update $ψ_{n}^{*} = Ψ (Q_{n}^{*})$ by fluctuating the initial estimate $Q_{n}^{0}$ of Q₀. The construction of $Q_{n}^{0}$ is presented and studied in Section 4.1. In Section 4.2, a straightforward application of the main result of Section 4.1 shows that the adaptive design converges. How to fluctuate $Q_{n}^{0}$ and stretch it optimally to $Q_{n}^{*}$ is presented and studied in Section 4.3.

4.1. Initial maximum likelihood based substitution estimator Working model.

In order to construct the initial estimate $Q_{n}^{0}$ of Q₀, we consider a working model $Q_{n}^{w}$ . With a slight abuse of notation, the elements of $Q_{n}^{w}$ are denoted by (Q_W(P_n),Q_Y|A,W(θ)) for some parameter θ ∈ Θ, where Q_W(P_n) is the empirical marginal distribution of W. Specifically, the working model $Q_{n}^{w}$ is chosen in such a way that

Q_{Y | A, W} (θ) (O) = \frac{1}{\sqrt{2 π σ_{V}^{2} (A)}} exp {- \frac{{(Y - m (A, W; β_{V}))}^{2}}{2 σ_{V}^{2} (A)}} .

This notably implies that for any $P_{θ} \in M$ such that Q_Y|A,W (P_θ) = Q_Y|A,W (θ), the conditional mean $\bar{Q} (P_{θ}) (A, W)$ , which we also denote by $\bar{Q} (θ)$ , satisfies $\bar{Q} (θ) (A, W) = m (A, W; β_{V})$ , the right-hand side expression being a linear combination of variables extracted from (A, W) and indexed by the regression vector β_V (of dimension b). Defining

θ (v) = {(β_{v}, σ_{v}^{2} (0), σ_{v}^{2} (1))}^{⊤} \in Θ_{v} \subset ℝ^{b} \times ℝ_{+}^{*} \times ℝ_{+}^{*}

(10)

for each $v \in V$ , the complete parameter is given by θ = (θ(1)^⊤,…,θ(ν)^⊤)^⊤ ∈ Θ, where $Θ = \prod_{v = 1}^{ν} Θ_{v}$ . We impose the following condition on the parameterization:

PARAM. The parameter set Θ is compact. Furthermore, the linear parameterization is identifiable: for all $v \in V$ , if $m (a, w; β_{v}) = m (a, w; β_{v}^{'})$ for all $a \in A$ and $w \in W$ (compatible with v), then necessarily $β_{v} = β_{v}^{'}$ .

Characterizing $Q_{n}^{0}$ .

Let us set a reference fixed design $g^{r} \in G_{1}$ . We now characterize $Q_{n}^{0}$ by letting

Q_{n}^{0} = (Q_{W} (P_{n}), Q_{Y | A, W} (θ_{n})),

(11)

Where

θ_{n} = \underset{θ \in Θ}{arg max} \sum_{i = 1}^{n} log Q_{Y | A, W} (θ) (O_{i}) \frac{g^{r} (A_{i} | V_{i})}{g_{i}^{★} (A_{i} | V_{i})}

(12)

is a weighted maximum likelihood estimator with respect to the working model. Thus, the vth component θ_n(v) of θ_n satisfies

θ_{n} (v) = \underset{θ (v) \in Θ_{v}}{arg min} \sum_{i = 1}^{n} (log σ_{v}^{2} (A_{i}) + \frac{{(Y_{i} - m (A_{i}, W_{i}; β_{v}))}^{2}}{σ_{v}^{2} (A_{i})}) \frac{g^{r} (A_{i} | V_{i})}{g_{i}^{★} (A_{i} | V_{i})} 1 {V_{i} = v}

for every $v \in V$ . Note that this initial estimate $Q_{n}^{0}$ of Q₀ yields the initial maximum likelihood based substitution estimator $Ψ (Q_{n}^{0})$ of ψ₀:

Ψ (Q_{n}^{0}) = \frac{1}{n} \sum_{i = 1}^{n} \bar{Q} (θ_{n}) (1, W_{i}) - \bar{Q} (θ_{n}) (0, W_{i}) .

Studying $Q_{n}^{0}$ through θ_n.

For simplicity, let us introduce, for all θ ∈ Θ, the additional notations

\begin{matrix} l_{θ, 0} = log Q_{Y | A, W} (θ), & {\dot{l}}_{θ, 0} = \frac{\partial}{\partial θ} l_{θ, 0}, & {\ddot{l}}_{θ, 0} = \frac{\partial^{2}}{\partial θ^{2}} l_{θ, 0} . \end{matrix}

The first asymptotic property of θ_n that we derive concerns its consistency (see Theorem 5 in (van der Laan, 2008)).

Proposition 2 (consistency of θ_n). Assume that:

A1. There exists a unique interior point θ₀ ∈ Θ such that θ₀ = arg max_θ∈Θ P_{Q_0,g^r}ℓ_θ,0.

A2. The matrix – $P_{Q_{0, g^{r}}} {\ddot{l}}_{θ_{0}, 0}$ is positive definite.

Provided that $O$ is a bounded set, θ_n consistently estimates θ₀.

The proof of Proposition 2 is given in Section A.1.

The limit in probability of θ_n has a nice interpretation in terms of projection of Q_Y|A,W (P₀) onto {Q_Y|A,W (θ): θ ∈ Θ}. Preferring to discuss this issue in terms of data generating distribution rather than conditional distribution, let us set $Q_{θ_{0}} = (Q_{W} (P_{0}), Q_{Y | A, W} (θ_{0}))$ and assume that $P_{Q_{0}, g^{r}} log Q_{Y | Q, W} (P_{0})$ is well defined (this weak assumption concerns Q₀, not g^r, and holds for instance when |log Q_Y|A,W (P₀)| is bounded). Then A1 is equivalent to $P_{Q_{θ_{0}}, g^{r}}$ being the unique Kullback-Leibler projection of $P_{Q_{0}, g^{r}}$ onto the set

{P \in M : \exists θ \in Θ s . t . Q_{Y | A, W} (P) = Q_{Y | A, W} (θ), and Q_{W} (P) = Q_{W} (P_{0}), g (P) = g^{r}} .

In addition to being consistent, θ_n actually satisfies a central limit theorem if supplementary mild conditions are met. The latter central limit theorem is embedded in a more general result that we state in Section 4.3, see Proposition 5.

Furthermore, maximizing a weighted version of the log-likelihood is a technical twist that makes the theoretical study of the properties of θ_n easier. Indeed, the unweighted maximum likelihood estimator

t_{n} = \underset{θ \in Θ}{arg max} \sum_{i = 1}^{n} log Q_{Y | A, W} (θ) (O_{i})

targets the parameter

T_{{\bar{g}}_{n}} (Q_{0}) = \underset{θ \in Θ}{arg max} \sum_{i = 1}^{n} P_{Q_{0}, g_{i}^{★}} log Q_{Y | A, W} (θ) = \underset{θ \in Θ}{arg max} P_{Q_{0}, {\bar{g}}_{n}} l_{θ, 0},

where ${\bar{g}}_{n} = n^{- 1} \sum_{i = 1}^{n} g_{i}^{★}$ and $P_{Q_{0}, {\bar{g}}_{n}} f \equiv E_{P_{Q_{0}, {\bar{g}}_{n}}} [f (O_{n}) | O_{n} (n - 1)]$ for any measurable f defined on $O$ . Therefore, t_n asymptotically targets the limit, if it exists, of $T_{{\bar{g}}_{n}} (Q_{0})$ . Assuming that ${\bar{g}}_{n}$ converges itself to a fixed design $g_{\infty} \in G$ , then t_n asymptotically targets parameter $T_{g_{\infty}} (Q_{0})$ . The latter parameter is very difficult to interpret and to analyze, as it depends directly and indirectly (through g_∞) on Q₀.

4.2. Convergence of the adaptive design

Consider the mapping G^★ from Θ to $G_{1}$ (respectively equipped with the Euclidean distance and, for instance, the distance $d (g, g^{'}) = \sum_{v \in V} | g (1 | v) - g^{'} (1 | v) |)$ such that, for any θ ∈ Θ, for any $(a, v) \in A \times V$ ,

G^{★} (θ) (a | v) = \frac{σ_{v} (a)}{σ_{v} (1) + σ_{v} (0)} .

(13)

Equation (13) characterizes G^★ which is obviously continuous. Since $g_{n}^{★}$ is adapted in such a way that $g_{n}^{★} = G^{★} (θ_{n})$ (see (8)), Proposition 2 and the continuous mapping theorem (see Theorem 1.3.6 in (van der Vaart and Wellner, 1996)) straightforwardly imply the following result (the convergence in probability yields the convergence in L¹ because $g_{n}^{★}$ is uniformly bounded).

Proposition 3 (convergence of $g_{n}^{★}$ ). Under the assumptions of Proposition 2, the adaptive design $g_{n}^{★}$ converges in probability and in L¹ to the limit design G^★(θ₀).

The convergence of the adaptive design $g_{n}^{★}$ is a crucial result. It is noteworthy that the limit design G^★(θ₀) equals the optimal design g^★(Q₀) if the working model is correctly specified (which never happens in real-life applications), but not necessarily otherwise. Furthermore, the relationship $g_{n}^{★} = G^{★} (θ_{n})$ also entails the possibility to derive the convergence in distribution of $\sqrt{n} (g_{n}^{★} - G^{★} (θ_{n}))$ to a centered Gaussian distribution with known variance by application of the delta-method (G^★ is differentiable) from a central limit theorem on θ_n (see Proposition 5 and Theorem 3.1 in (van der Vaart, 1998)).

4.3. TMLE

Fluctuating $Q_{n}^{0}$ .

The second step of the TMLE procedure stretches the initial estimate $Ψ (Q_{n}^{0})$ in the direction of the parameter of interest, through a maximum likelihood step over a well-chosen fluctuation of $Q_{n}^{0}$ . The latter fluctuation of $Q_{n}^{0}$ is just a one-dimensional parametric model ${Q_{n}^{0} (ε) : ε \in E} \subset Q$ indexed by the parameter $ε \in E, E \subset ℝ$ being a bounded interval which contains a neighborhood of the origin. Specifically, we set for all $ε \in E$ :

Q_{n}^{0} (ε) = (Q_{W} (P_{n}), Q_{Y | A, W} (θ_{n}, ε)),

(14)

where for any θ ∈ Θ,

Q_{Y | A, W} (θ, ε) (O) = \frac{1}{\sqrt{2 π σ_{V}^{2} (A)}} exp {- \frac{{(Y - \bar{Q} (θ) (A, W) - ε H^{★} (θ) (A, W))}^{2}}{2 σ_{V}^{2} (A)}},

(15)

with

H^{*} (θ) (A, W) = \frac{2 A - 1}{G^{★} (θ) (A | V)} σ_{V}^{2} (A) .

In particular, the fluctuation goes through $Q_{n}^{0}$ at ε = 0 (i.e., $Q_{n}^{0} (0) = Q_{n}^{0}$ ). Let $P_{n}^{0} (ε) \in M$ be a data generating distribution such that $Q_{Y | A, W} (P_{n}^{0} (ε)) = Q_{Y | A, W} (θ_{n}, ε)$ . The conditional mean $\bar{Q} (P_{n}^{0} (ε))$ , which we also denote by $\bar{Q} (θ_{n}, ε)$ , is

\bar{Q} (θ_{n}, ε) (A, W) = \bar{Q} (θ_{n}) (A, W) + ε H^{*} (θ_{n}) (A, W) .

Furthermore, the score at ε = 0 of $P_{n}^{0} (ε)$ equals

{\frac{\partial}{\partial ε} log P_{n}^{0} (ε) [O] |}_{ε = 0} = \frac{2 A - 1}{G^{★} (θ_{n}) (A | V)} (Y - \bar{Q} (θ_{n}) (A, W)) = D_{2}^{★} (P_{Q_{n}^{0}, G^{★} (θ_{n})}) (O),

the second component of the efficient influence curve of Ψ at $P_{Q_{n}^{0}, G^{★} (θ_{n})} = P_{Q_{n}^{0}, g_{n}^{★}}$ , see (2) in Proposition 1 (recall that $g_{n}^{★} = G^{★} (θ_{n})$ ).

Characterizing $Q_{n}^{*}$ yields the TMLE.

We characterize the update $Q_{n}^{*}$ of $Q_{n}^{0}$ in the fluctuation ${Q_{n}^{0} (ε) : ε \in E}$ by

Q_{n}^{*} = Q_{n}^{0} (ε_{n}),

where

ε_{n} = \underset{ε \in E}{arg max} \sum_{i = 1}^{n} log Q_{Y | A, W} (θ_{n}, ε) (O_{i}) \frac{g_{n}^{★} (A_{i} | V_{i})}{g_{i}^{★} (A_{i} | V_{i})}

(16)

is a weighted maximum likelihood estimator with respect to the fluctuation. It is worth noting that ε_n is known in closed form (we assume, without serious loss of generality, that $E$ is large enough for the maximum to be achieved in its interior). Denoting the vth component θ_n(v) of θ_n by $(β_{v, n}, σ_{v, n}^{2} (0), σ_{v, n}^{2} (1))$ , it holds that

ε_{n} = \frac{\sum_{i = 1}^{n} (Y_{i} - \bar{Q} (θ_{n}) (A_{i}, W_{i})) \frac{2 A_{i} - 1}{g_{i}^{★} (A_{i} | V_{i})}}{\sum_{i = 1}^{n} \frac{σ_{V_{i}, n}^{2} (A_{i})}{g_{n}^{★} (A_{i} | V_{i}) g_{i} (A_{i} | V_{i})}} .

The notation $Q_{n}^{*}$ for this first update of $Q_{n}^{0}$ is a reference to the fact that the TMLE procedure, which is in greater generality an iterative procedure, converges here in one single step. Indeed, say that one fluctuates $Q_{n}^{*}$ as we fluctuate $Q_{n}^{0}$ i.e., by introducing

Q_{n}^{1} (ε) = (Q_{W} (P_{n}), Q_{Y | A, W} (θ_{n}, ε_{n}, ε))

with Q_Y|A,W (θ,ε’,ε) equal to the right-hand side of (15) where one substitutes $\bar{Q} (θ, ε^{'})$ for $\bar{Q} (θ)$ . Say that one then defines the weighted maximum likelihood ${ε^{'}}_{n}$ as the right-hand side of (16) where one substitutes Q_Y|A,W (θ_n, ε_n, ε) for Q_Y|A,W (θ_n, ε). Then it follows that ${ε^{'}}_{n} = 0$ so that the “updated” $Q_{n}^{*} ({ε^{'}}_{n}) = Q_{n}^{*}$ . The updated estimator $Q_{n}^{*}$ of Q₀ maps into the TMLE $ψ_{n}^{*} = Ψ (Q_{n}^{*})$ of the risk difference ψ₀ = Ψ(Q₀):

ψ_{n}^{*} = \frac{1}{n} \sum_{i = 1}^{n} \bar{Q} (θ_{n}, ε_{n}) (1, W_{i}) - \bar{Q} (θ_{n}, ε_{n}) (0, W_{i}) .

(17)

The asymptotic study of $ψ_{n}^{*}$ relies on a central limit theorem for (θ_n, ε_n), which we discuss in Section 5.3.

5. Asymptotics

5.1. Studying $Q_{n}^{*}$ through (θ_n, ε_n): consistency

We now state and comment on a consistency result for the stacked estimator (θ_n, ε_n) which complements Proposition 2 (see Theorem 8 in (van der Laan, 2008)). For simplicity, let us generalize the notation ℓ_θ,₀ introduced in Section 4.1 by setting, for all $(θ, ε) \in Θ \times E$ ,

l_{θ, ε} = log Q_{Y | A, W} (θ, ε) .

Moreover, let us set, for all $(θ, ε) \in Θ \times E$ ,

Q_{θ, ε} = (Q_{W} (P_{0}), Q_{Y | A, W} (θ, ε)) .

(18)

Proposition 4 (consistency of (θ_n,ε_n)). Suppose that A1 and A2 from Proposition 2 hold. In addition, assume that:

A3. There exists a unique interior point $ε_{0} \in E$ such that $ε_{0} = arg {max}_{ε \in E} P_{Q_{0}, G^{★} (θ_{0})} l_{θ_{0}, ε}$ .

(i)
It holds that $Ψ (Q_{θ_{0}, ε_{0}}) = Ψ (Q_{0})$ .
(ii)
Provided that $O$ is a bounded set, (θ_n,ε_n) consistently estimates (θ₀,ε₀).

The proof of Proposition 4 is given in Section A.1.

We already discussed the interpretation of the limit in probability of θ_n in terms of Kullback-Leibler projection. Likewise, the limit in probability ε₀ of ε_n enjoys such an interpretation. Let us assume that $P_{Q 0, G^{★} (θ_{0})}$ log Q_Y|A,W(P₀) is well defined (this weak assumption concerns Q₀, not G★(θ₀), and holds for instance when |log Q_Y|A,W(P₀)| is bounded). Then A3 is equivalent to $P_{Q_{θ_{0}, ε_{0}}, G^{★}} (θ_{0})$ being the unique the Kullback-Leibler projection of $P_{Q_{0}, G^{★}} (θ_{0})$ onto the set

{P \in M : \exists ε \in E s . t . Q (P) = Q_{θ_{0, ε}} and g (P) = G^{★} (θ_{0})} .

Of course, the most striking property that ε₀ enjoys is (i): even if $\bar{Q} 0 \notin {\bar{Q} (θ, ε) : (θ, ε) \in Θ \times E}$ , it holds that $Ψ (Q_{θ_{0}, ε_{0}}) = Ψ (Q_{0})$ . This remarkable equality and the convergence of (θ_n,ε_n) to (θ₀,ε₀) are evidently the keys to the consistency of $ψ_{n}^{*} = Ψ (Q_{n}^{*})$ . We investigate in Section 5.3 how the consistency result stated in Proposition 4 translates into the consistency of the TMLE.

5.2. Studying $Q_{n}^{*}$ through (θ_n,ε_n): central limit theorem

We now state and comment on a central limit theorem for the stacked estimator (θ_n, ε_n) (see also Theorem 9 in (van der Laan, 2008)).

Let us introduce, for all $(θ, ε) \in Θ \times E$ ,

D (θ, \in) (O, Z) = \frac{1}{g_{Z} (A | V)} {({\dot{l}}_{θ, 0}^{Τ} (O) g^{r} (A | V), \frac{\partial}{\partial ε} l_{θ, ε} (O) G^{★} (θ) (A | V))}^{⊺},

(19)

and $\tilde{D} (θ, ε) (O) = g_{Z} (A | V) D (θ, ε) (O, Z)$ .

Proposition 5 (central limit theorem for (θ_n,ε_n)). Suppose thatA1, A2andA3from Propositions 2 and 4 hold. In addition, assume thatA4. If a deterministic function F is such that F(O) = 0 $P_{Q_{0}, G^{★} (θ_{0})}$ -almost surely, then F = 0. Then the following asymptotic linear expansion holds:

\sqrt{n} ((θ_{n}, ε_{n}) - (θ_{0}, ε_{0})) = S_{0}^{- 1} \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} D (θ_{0}, ε_{0}) (O_{i}, Z_{i}) + o_{P} (1),

(20)

where

S_{0} = E_{Q_{0}, G ★ (θ_{0})} (\begin{matrix} {\ddot{l}}_{θ_{0}, 0} (O) \frac{g^{r} (A | V)}{G ★ (θ_{0}) (A | V)} & 0 \\ [[\frac{\partial^{2}}{\partial θ \partial ε} l_{θ, ε} (O) G ★ (θ) (A | V)] |_{_{(θ, ε) = (θ_{0}, ε)}}^{⊤} \frac{1}{G ★ (θ_{0}) (A | V)} & \frac{\partial^{2}}{\partial ε^{2}} l_{θ_{0, ε}} (O) |_{ε = ε_{0}} \end{matrix})

(21)

is an invertible matrix. Furthermore, (20) entails that $\sqrt{n} ((θ_{n}, ε_{n}) - (θ_{0}, ε_{0}))$ converges in distribution to the centered Gaussian distribution with covariance matrix $S_{0}^{- 1} \sum_{0} {(S_{0}^{- 1})}^{⊺}$ , where

Σ_{0} = E_{Q_{0}, G^{★} (θ_{0})} (\frac{\tilde{D} (θ_{0}, ε_{0}) \tilde{D} {(θ_{0}, ε_{0})}^{⊺} (O)}{G^{★} (θ_{0}) {(A | V)}^{2}})

(22)

is a positive definite symmetric matrix. Moreover, S₀ is consistently estimated by

S_{n} = \frac{1}{n} \sum_{i = 1}^{n} (\begin{matrix} {\ddot{l}}_{θ_{n}, 0} (O_{i}) \frac{g^{r} (A_{i} | V_{i})}{g z_{i} (A_{i} | V_{i})} & 0 \\ [[\frac{\partial^{2}}{\partial θ \partial ε} l_{θ, ε} (O_{i}) G ★ (θ) (A_{i} | V_{i})] |_{_{(θ, ε) = (θ_{n}, ε_{n})}}^{⊺} \frac{1}{g z_{i} ★ (A_{i} | V_{i})} & \frac{\partial^{2}}{\partial ε^{2}} l_{θ_{n, ε}} (O_{i}) |_{ε = ε_{n}} g z_{i} (A_{i} | V_{i}) \end{matrix})

and ∑₀ is consistently estimated by

\sum_{n} = \frac{1}{n} \sum_{i = 1}^{n} D (θ_{n}, ε_{n}) D {(θ_{n}, ε_{n})}^{⊤} (O_{i}, Z_{i}) .

(23)

The proof of Proposition 5 is given in Section A.2. We investigate in Section 5.3 how the above central limit theorem translates into a central limit theorem for the TMLE.

5.3. Consistency and asymptotic normality of the TMLE

In this section, we finally state and comment on the asymptotic properties of the TMLE $ψ_{n}^{*}$ .

TMLE is consistent and asymptotically Gaussian.

In the first place, the TMLE $ψ_{n}^{*}$ is robust: it is a consistent estimator even when the working model is misspecified (which is always the case in real-life applications).

Proposition 6 (consistency of $ψ_{n}^{*}$ ). Suppose that A1, A2, A3 and A4 from Propositions 2, 4 and 5 hold. Then the TMLE $ψ_{n}^{*}$ consistently estimates the risk difference ψ₀.

If the design of the clinical trial was fixed (and consequently, the n first observations were i.i.d), then the TMLE would be a robust estimator of ψ₀: Even if the working model is misspecified, then the TMLE still consistently estimates ψ₀ because the treatment mechanism is known (or can be consistently estimated, if one wants to gain in efficiency). Thus, the robustness of the TMLE stated in Proposition 6 is the expected counterpart of the TMLE’s robustness in the latter i.i.d setting: Expected because the TMLE solves a martingale estimating function that is unbiased for ψ₀ at misspecified Q and correctly specified g_i, i = 1,…,n.

In the second place, the TMLE $ψ_{n}^{*}$ is asymptotically linear, and therefore satisfies a central limit theorem. To see this, let us introduce the real-valued function ϕ on $Θ \times E$ such that ϕ (θ, ∈) = Ψ(Q_θ,ε) (see (18) for the definition of Q_θ,ε). The function ϕ is differentiable on the interior of $Θ \times E$ , and we denote ${ϕ^{'}}_{θ, ε}$ its gradient at (θ,ε). The latter gradient satisfies

ϕ_{θ, ε}^{'} = E_{Q_{θ, ε}, G^{★} (θ)} {D^{★} (P_{Q_{θ, ε}}, G^{★} (θ) (O) {(\frac{\partial}{\partial θ} l_{θ, ε}^{⊤} (O), \frac{\partial}{\partial ε} l_{θ, ε} (O))}^{⊤}} .

(24)

Note that the right-hand side expression cannot be computed explicitly because the marginal distribution Q_W(P₀) is unknown. By the law of large numbers (independent case), we can build an estimator ${ϕ^{'}}_{n}$ of ${ϕ^{'}}_{θ_{0}, ε_{0}}$ as follows. For B a large number (say B = 10⁴), simulate B independent copies Õ_b of O from the data generating distribution $P_{Q_{n}^{*}, G^{★} (θ_{n})}$ , then compute

ϕ_{n}^{'} = \frac{1}{B} \sum_{b = 1}^{B} D^{★} (P_{Q_{n}^{*}, G^{★} (θ_{n})}) (O_{b}) {(\frac{\partial}{\partial θ} l_{θ, ε_{n}}^{⊤} (O_{b}) |_{θ = θ_{n}}, \frac{\partial}{\partial ε} l_{θ_{n}, ε} (O_{b}) | ε = ε_{n})}^{⊤} .

(25)

Proposition 7 (central limit theorem for $ψ_{n}^{*}$ ). Suppose that A1, A2, A3 and A4 from Propositions 2, 4 and 5 hold. Then the following asymptotic linear expansion holds:

\sqrt{n} (ψ_{n}^{*} - ψ_{0}) = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} IC (O_{i}, Z_{i}) + o P (1),

(26)

where

IC (O, Z) = D_{1}^{★} (Q_{θ_{0}, ε_{0}}) (W) + ϕ_{θ_{0}, ε_{0}}^{' ⊤} S_{0}^{- 1} D (θ_{0}, ε_{0}) (O, Z) .

(27)

Furthermore, (26) entails that $\sqrt{n} (ψ_{n}^{*} - ψ_{0})$ converges in distribution to the centered Gaussian distribution with a variance consistently estimated by

s_{n}^{2} = \frac{1}{n} \sum_{i = 1}^{n} D_{1}^{★} (Q_{n}^{*}) {(W_{i})}^{2} + \frac{2}{n} \sum_{i = 1}^{n} D_{1}^{★} (Q_{n}^{*}) (W_{i}) ϕ_{n}^{' ⊤} S_{n}^{- 1} D (θ_{n}, ε_{n}) (O_{i}, Z_{i}) + (ϕ_{n}^{' ⊤} S_{n}^{- 1}) \sum_{n} {(ϕ_{n}^{' ⊤} S_{n}^{- 1})}^{⊤} .

Proposition 7 is the backbone of the statistical analysis of adaptive group sequential RCTs as constructed in Section 4. In particular, denoting the (1 − α)-quantile of the standard normal distribution by ξ_1−α, the proposition guarantees that the asymptotic level of the confidence interval

[ψ_{n}^{*} \pm \frac{s_{n}}{\sqrt{n}} ξ_{1 - α / 2}]

(28)

for the risk difference ψ₀ is (1 − ε). The proofs of Propositions 6 and 7 are given in Section A.3.

Extensions.

We conjecture that the influence function IC computed at (O, Z), see (27), is equal to

D_{1}^{★} (Q_{θ_{0}, ε_{0}}) (W) + D_{2}^{★} (P_{Q_{θ_{0}, ε_{0}}}_{, G ★ (θ_{0})}) (O) \frac{G ★ (θ_{0}) (A | V)}{g Z (A | V)} .

This conjecture is backed by the simulations that we carry out and present in Section 7. We will tackle the proof of the conjecture in future work. Let us assume for the moment that the conjecture is true. Then the asymptotic linear expansion (26) now implies that the asymptotic variance of $\sqrt{n} (ψ_{n}^{*} - ψ_{0})$ can be consistently estimated by

s_{n}^{★ 2} = \frac{1}{n} \sum_{i = 1}^{n} {(D_{1}^{★} (Q_{n}^{*}) (W_{i}) + D_{2}^{★} (P_{Q_{n}^{*}, G^{★} (θ_{n})}) (O_{i}) \frac{G * (θ_{n}) (A_{i} | V_{i})}{g Z_{i} (A_{i} | V_{i})})}^{2},

another independent argument also showing that $s_{n}^{★ 2}$ converges toward

{Var}_{Q_{0}, G^{★} (θ_{0})} D^{★} (P_{Q_{θ_{0}, ε_{0}}, G^{★} (θ_{0})}) (O)

i.·e.·, the variance under the fixed design $P_{Q_{0}, G^{★} (θ_{0})}$ of the efficient influence curve at $P_{Q_{θ_{0}, ε_{0}}, G^{★} (θ_{0})}$ .

Furthermore, the most essential characteristic of the joint methodologies of design adaptation and targeted maximum likelihood estimation is certainly the utmost importance of the role played by the likelihood. In this view, the targeted maximized log-likelihood of the data

\sum_{i = 1}^{n} {log Q_{W} (P_{n}) (W_{i}) + log g_{i}^{★} (A_{i} | W_{i}) + log Q_{Y | A, W} (θ_{n}, ε_{n}) (O_{i})}

provides us with a quantitative measure of the quality of the fit (targeted toward the parameter of interest). It is therefore possible, for example, to use that quantity for the sake of selection of different working models for Q₀. As with TMLE for i.i.d data, we can use likelihood based crossvalidation to select among more general initial estimators indexed by fine-tuning parameters. The validity of such TMLEs for group sequential adaptive designs as studied here is outside the scope of this article.

6. Application to group sequential testing

We derive in this section a group sequential testing procedure, that is a testing procedure which repeatedly tries to make a decision at intervals rather than once all data are collected, or than after every new observation is obtained (such a testing procedure would be said fully sequential). We refer to (Jennison and Turnbull, 2000; Proschan et al., 2006) for a general presentation of group sequential testing procedures. The TMLE group sequential testing procedure is formally described in Section 6.1, and some arguments justifying why it should work well (as validated by the simulation study carried out in Section 8) are exposed in Section 6.2.

6.1. Description of the TMLE group sequential testing procedure

The problem at stake is to test the null “Ψ(Q) = ψ₀” against “Ψ(Q) > ψ₀” with asymptotic type I error α and asymptotic type II error β at some ψ₁ > ψ₀. We want to proceed group sequentially with K ≥ 2 steps, based on the multi-dimensional t-statistic

(T_{1}^{*}, \dots, T_{K}^{*}) = {(\frac{\sqrt{N_{k}} (ψ_{N_{k}}^{*} - ψ_{0}}{s_{N_{k}}})}_{k \leq K} .

(29)

Here, N₁,…,N_K are random sample sizes whose realizations on a specific trajectory depend on how fast the information accrues as the data are collected. In order to quantify this notion of information, we decide to consider the inverse $\frac{n}{s_{n}^{2}}$ of the estimated variance of the TMLE $ψ_{n}^{*}$ based on the n first observations O_n (as a proxy to its true, finite sample, inverse variance). Given a reference maximum committed information I_max and K increasingly order proportions 0 < p1 < … < p_K = 1, we set for every k ≤ K

N_{k} = inf {n \geq 1 : \frac{n}{s_{n}^{2}} \geq p_{k} I_{max}} .

The characterization of I_max depends on how we wish to “spend” the type I and type II errors at each step of the group sequential procedure and on how demanding is the power requirement (i.e., how close ψ₁ is to ψ₀). Say that our spending strategies are summarized by the K-tuples of positive numbers (α₁,…,α_K) and (β₁,…,β_K) such that $\sum_{k = 1}^{K} α_{k} = α$ and $\sum_{k = 1}^{K} β_{k} = β$ .

Now, let (Z₁,…,Z_K) be distributed from the centered Gaussian distribution with covariance matrix $C = {(\sqrt{p_{k \land l} / p_{k \lor l}})}_{k, l \leq K}$ We assume that there exists a unique value I > 0, our I_max, such that there exist a rejection boundary (a₁,…,a_K) and a futility boundary (b₁,…,b_K) satisfying a_K = b_K, P(Z₁ ≥ a₁) = α₁, $P (Z_{1} + (ψ_{1} - ψ_{0}) \sqrt{p_{1} I} \leq b_{1}) = β_{1}$ , and for every 1 ≤ k < K,

P (\forall j \leq k, b_{j} < Z_{j} < a_{j} and Z_{k + 1} \geq a_{k + 1}) = α_{k + 1},

P (\forall j \leq k, b_{j} < Z_{j} + (ψ_{1} - ψ_{0}) \sqrt{p_{j} I} < a_{j} and Z_{k + 1} + (ψ_{1} - ψ_{0}) \sqrt{p_{k + 1} I} \leq b_{k + 1}) = β_{k + 1} .

Note that the closer ψ₁ is to ψ₀, the larger is I_max (actually, $ψ_{1} \mapsto ψ_{1} \sqrt{I_{max}}$ is both upper bounded and bounded away from zero). Heuristically, the closer ψ₁ is to ψ₀, the more difficult it is to decide between the null and its alternative while preserving the required type II error at ψ₁, the more information is needed to proceed.

The targeted maximum likelihood group sequential testing procedure finally goes as follows: starting from k = 1,

if $T_{k}^{*} \geq a_{k}$ then reject the null and stop accruing data,
if $T_{k}^{*} \leq b_{k}$ then fail rejecting the null and stop accruing data,
if $b_{k} < T_{k}^{*} < a_{k}$ then set k ← k + 1 and repeat.

If ( $T_{1}^{*}, \dots, T_{K}^{*}$ ) had the same distribution as (Z₁,…,Z_K), then the latter rule would yield a testing procedure with the required type I error and type II error at the specified alternative parameter.

Clearly, our decision to target the optimal design G★(θ₀) which reduces as much as possible (over $G_{1}$ and for our choice of working model) the asymptotic variance of the TMLE $ψ_{n}^{*}$ guarantees, at least informally, that each N_k is stochastically smaller under (Q₀, $g_{n}^{★}$ )-adaptive sampling than it would have been had another fixed design been used (or targeted). Thus, resorting to the (Q₀, $g_{n}^{★}$ )-adaptive sampling is likely to result in an earlier conclusion than what would have yielded another fixed (or targeted) sampling scheme.

6.2. Rationale of the TMLE group sequential testing procedure

The next proposition partially justifies the characterization of the TMLE group sequential testing procedure of Section 6.1. First, let us define $n_{k} = ⌈ n p_{k} ⌉$ (the smallest integer not smaller than np_k) for every k ≤ K, then

(T_{1}, \dots, T_{K}) = {(\frac{\sqrt{n_{k}} (ψ_{n k}^{*} - ψ_{0}}{s_{n_{k}}})}_{k \leq K} .

This other multi-dimensional t-statistic is a substitute to $(T_{1}^{*}, ..., T_{K}^{*})$ whose asymptotic study is easier to carry out. In particular, it is possible to derive the limit distribution of (T₁,…,T_K) under the null and under a sequence of contiguous alternatives. The limit distribution is called the canonical distribution (see Theorems 11 in 12 in (van der Laan, 2008), Theorem 3 in (Chambaz and van der Laan, 2011) and Theorem 2.1 in (Zhu and Hu, 2010), where a similar result is obtained through a different approach based on the study of the limit distribution of a stochastic process defined over (0, 1]).

Proposition 8. Suppose that A1, A2, A3 and A4 from Propositions 2, 4 and 5 hold. Consider $f \in L^{2} (P_{Q_{0,} G^{★} (θ_{0})})$ , f ≠ 0, with $P_{Q_{0,} G^{★} (θ_{0})} f = 0$ and a fluctuation ${Q_{0} (h) : h \in H} \subset Q$ of Q₀ with score f. Set $Q_{f / \sqrt{n}} = Q_{0} (1 / \sqrt{n})$ for all n ≥n₀ (n₀ is such that $1 / \sqrt{n_{0}} \in ℋ$ ). Assume also that A5. The score f is bounded, it is not proportional to $D^{★} (P_{Q_{0}, G^{★} (θ_{0})})$ , $E_{P_{Q_{0}, G^{★} (θ_{0})}} [f (O) | A, W] = 0, P_{Q_{0}, G^{★} (θ_{0})} D^{★} (P_{Q_{0}, G^{★} (θ_{0})}) f > 0$ , and $P_{Q_{0}, G^{★} (θ_{0})} IC f > 0$ .

The sequence $(Q_{f / \sqrt{n}}) n \geq n_{0}$ defines a sequence (ψ_n)n≥n₀ of contiguous parameters (“from direction f”, which only fluctuates the conditional distribution of Y given (A, W)), with $ψ_{n} = ψ(Q_{f / \sqrt{n}})>ψ(Q_{0})$ for n large enough. Introduce $μ (f)=(\sqrt{p_{1}}, ..., \sqrt{p_{K}}) \frac{P_{Q_{0}, G^{★} (θ_{0})} IC f}{\sqrt{P_{Q_{0}, G^{★} (θ_{0})} {IC}^{2}}} .$

(i)
Under (Q₀, g_n)-adaptive sampling, (T₁,…,T_K) converges in distribution to the centered Gaussian distribution with covariance matrix $C$ .
(ii)
Under $(Q_{f / \sqrt{n}}, g_{n})$ -adaptive sampling, (T₁,…,T_K) converges in distribution to the Gaussian distribution with mean μ(f) and covariance matrix $C$ .

The proof of Proposition 8 is given in Section A.4.

The rationale follows. Say that one concludes from Proposition 8 that (i) ( $T_{1}^{★}, \dots, T_{K}^{★}$ ) is also approximately distributed as (Z₁,…,Z_K) under $(Q_{0}, g_{n}^{★})$ -adaptive sampling. Likewise, say that $(\sqrt{N_{k}} (ψ_{N_{k}}^{*} - ψ_{1}) / s_{N_{k}}) k \leq K$ is also approximately distributed as (Z₁,…,Z_K) under $(Q_{1}, g_{n}^{★})$ -adaptive sampling, with Ψ(Q₁) = ψ₁ > ψ₀. Say that one is willing to substitute p_kI_max to $N_{k} / s_{N_{k}}^{2}$ for each k ≤ K. Then (ii) ( $T_{1}^{*}, \dots, T_{K}^{*}$ ) is approximately distributed as $\sqrt{I_{max}} (ψ_{1} - ψ_{0}) (\sqrt{p_{1}}, ..., \sqrt{p_{K}}) + (Z_{1}, ..., Z_{K})$ under $(Q_{1}, g_{n}^{★})$ -adaptive sampling. It appears that (i) and (ii) suffice to guarantee the wished asymptotic control of type I and type II errors. The rationale is validated by the simulation study undertaken in Section 8.

7. Simulation study of the performances of TMLE in adaptive covariate-adjusted RCTs

In this section, we carry out a simulation study of the performances of TMLE in adaptive group sequential RCTs as exposed in the previous sections. We present the simulation scheme in Section 7.1. The working model upon which the TMLE methodology relies is described in Section 7.2. How the TMLE-based confidence intervals behave is considered in Section 7.3 (empirical coverage) and in Section 7.4 (empirical width). An illustrating example is finally presented in Section 7.5.

7.1. Simulation scheme

We characterize the component Q₀ = Q(P₀) of the true distribution P₀ of the observed data structure O = (W, A, Y) as follows:

the baseline covariate W = (U, V) where U is uniformly distributed over the unit interval [0; 1] and the subgroup membership covariate V ∈ v = {1, 2, 3} (hence ν = 3) satisfies
$\begin{array}{l} P_{0} (V = 1) = \frac{1}{2}, P_{0} (V = 2) = \frac{1}{3} and P_{0} (V = 3) = \frac{1}{6}; \end{array}$
the conditional distribution of Y given (A, W) is the Gamma distribution characterized by the conditional mean
${\bar{Q}}_{0} (A, W) = 2 U^{2} 2 U + 1 + ρ (A V + \frac{1 - A}{1 + V})$ (30)
with ρ =1 (we will set ρ to another value in Section 8) and the conditional variance

V a r_{P_{0}} (Y | A, W) = {(U + A (1 + V) + \frac{1 - A}{1 + V})}^{2} .

In particular, the risk difference ψ₀ = ψ(Q₀), our parameter of interest, is known in closed form:

ψ_{0} = \frac{91}{72} ≃ 1.264,

and so is the variance

v^{b} (Q_{0}) = {VarQ}_{0, g^{b}} D^{★} (O; P_{Q_{0, g^{b}}})

of the efficient influence curve under balanced sampling. The numerical value of v^b(Q₀) is reported in Table 1.

Table 1:

Numerical values of the allocation probabilities and variance of the efficient influence curve under either the balanced g^b or the targeted optimal g^★(Q₀) sampling schemes. The ratio of the variances of the efficient influence curve under targeted optimal and balanced sampling schemes satisfies R(Q₀) = v*(Q₀)/v^b(Q₀) ≃ 0.762.

Sampling scheme (Q₀,g)	Allocation probabilities			Variance
	g(1\|v = 1)	g(1\|v = 1)	g(1\|v = 1)	${Var}_{Q_{0, g}} D * (O; P_{Q_{0, g}})$
(Q₀,g^b)-balanced	$\frac{1}{2}$	$\frac{1}{2}$	$\frac{1}{2}$	23.864
(Q₀,g^★ (Q₀))-optimal	0.707	0.799	0.849	18.181

Open in a new tab

We target the design which (a) depends on the baseline covariate W = (U, V) only through V (i.e., belongs to $G_{1}$ ) and (b) minimizes the variance of the efficient influence curve of the parameter of interest ψ. The latter treatment mechanism g^★(Q₀) and optimal efficient asymptotic variance

v^{★} (Q_{0}) = {Var}_{Q_{0, g ★} (Q_{0})} D^{★} (O; P_{Q_{0, g ★} (Q_{0})})

are also known in closed form, and numerical values are reported in Table 1.

Let n = (100, 250, 500, 750,1000, 2500, 5000) be a sequence of sample sizes. We estimate M = 1000 times the risk difference ψ₀ = ψ(Q₀) based on $O_{n τ}^{m} (n_{i})$ ,m = 1,..., M, i = 1,..., 7, under

i.i.d (Q₀, g^b)-balanced sampling,
i.i.d (Q₀, g^★(Q₀))-optimal sampling,
$(Q_{0}, g_{n τ}^{★})$ -adaptive sampling.

Finally, we emphasize that the observed data structure O = (W, A, Y) is not bounded, whereas O is assumed bounded in Propositions 2, 3, 4, 5, 6 and 7.

7.2. Specifying the working model, reference design and a few twists On the working model.

For each v ∈ $V$ , let us denote $θ (υ) = (β_{υ}, σ_{υ}^{2} (0), σ_{υ}^{2} (1)) \in Θ_{υ}, Θ_{υ} \subset ℝ^{3} \times ℝ_{+}^{*} \times ℝ_{+}^{*}$ being compact, where the regression vector β_v = (β_v,₁, β_v,₂,β_v,₃) (b = 3 in (10)), then θ = (θ₁,θ₂,θ₃) ∈ Θ = Θ₁ × Θ₂ × Θ₃.

Following the description in Section 4.1, the working model $Q_{n}^{w}$ that the TMLE methodology relies on is characterized by the conditional likelihood of Y given (A, W):

Q_{Y | A, W} (θ) (O) = \frac{1}{\sqrt{2 π σ_{V}^{2} (A)}} exp {- \frac{{(Y - m (A, W; β_{V}))}^{2}}{2 σ_{V}^{2} (A)}},

with the specific choice of conditional mean $\bar{Q} (θ) (A, W)$ of Y given (A, W):

\begin{array}{l} \bar{Q} (θ) (α, w) = m (a, w; β_{v}) = β_{v, 1} + β_{v, 2} u + β_{v, 3} a \end{array}

for all $a \in A$ and $w = (u, v) \in W = ℝ \times V$ . As required, condition PARAM is met. Obviously, the working model is heavily misspecified:

a Gaussian conditional likelihood is used instead of a Gamma conditional likelihood,
the parametric forms of the conditional expectation and variance are wrong too.

On the reference design.

Regarding the choice of a reference fixed design g^r ∈ $G_{1}$ (see Section 4.1), we select g^r = g^b (the balanced design). The parameter θ₀. only depends on Q₀ and the working model, but its estimator θ_n depends on g^r, which may affect negatively its performances. Therefore, we propose to dilute the impact of the choice of g^r as an initial reference design as follows.

For a given sample size n, we first compute a first estimate $θ_{n}^{1}$ of θ₀ as in (12) but with $⌈ n / 4 ⌉$ (the smallest integer not smaller than n/4) substituted for n in the sum. Then θ_n is computed as in (12) but this time with $G * (θ_{n}^{1}) (A_{i} | V_{i})$ substituted for g^r(A_i|V_i).

The proofs can be adapted in order to incorporate this modification of the procedure. We refer the interested reader to Section 8.5 in (van der Laan, 2008).

On additional details.

We decide arbitrarily to update the design each time c = 25 new observations are sampled. In addition, the first update only occurs when there are at least five completed observations in each treatment arm and for all V-strata. Thus, the minimal sample size at first update is 30. It can be shown that, under initialization using the balanced design, the expected sample size at the first sample size at which there are at least 5 observations in each arm equals 75. Finally, as a precautionary measure, we systematically apply a thresholding to the updated treatment mechanism: Using the notation of Section 4, $max {δ, m i n {1 - δ, g_{i}^{*} (A_{i} | V_{i})}}$ is substituted for $g_{i}^{*} (A_{i} | V_{i})$ in all computations. We arbitrarily choose δ = 0.01.

7.3. Empirical coverage of the TMLE confidence intervals

We now invoke the central limit theorem stated in Proposition 7 to construct confidence intervals for the risk difference. Let us introduce, for all types of sampling and each sample size n_i, the confidence intervals

\begin{array}{l} I_{n_{i}, m} = [ψ_{n_{i}}^{*} (O_{n_{7}}^{m} (n_{i})) \pm \sqrt{\frac{v_{n_{i}} (O_{n_{7}}^{m} (n_{i}))}{n_{i}}} ξ_{1 - α / 2}], m = 1, \dots, M \end{array}

where the definition of the variance estimator $υ_{n} (O_{n τ}^{m} (n))$ based on the n first observations $O_{n τ}^{m} (n)$ depends on the sampling scheme:

under i.i.d (Q₀, g^b)-balanced sampling, $υ_{n} (O_{n τ}^{m} (n))$ is the estimator of the asymptotic variance of the TMLE $ψ(Q_{n, i i d}^{*})$ :

v_{n} (O_{n_{7}}^{m} (n)) = \frac{1}{n} \sum_{i = 1}^{n} D^{★} (P_{Q_{n, iid}^{*} g^{b}}) {(O_{i}^{m})}^{2},

(31)

under i.i.d (Q₀, g^★(Q₀))-optimal sampling, $υ_{n} (O_{n τ}^{m} (n))$ is defined as in (31), replacing g^b with g^★(Q₀),
under $(Q_{0}, g_{n}^{★})$ -adaptive sampling, $υ_{n} (O_{n τ}^{m} (n)) = s_{n}^{* 2} (O_{n τ}^{m} (n))$ , the estimator of the conjectured asymptotic variance of $\sqrt{n} (ψ_{n}^{*} (O_{n τ}^{m} (n)) - ψ_{0})$ computed on the n first observations $O_{n τ}^{m} (n)$ .

We are interested in the empirical coverage (reported in Table 2, top rows)

c_{n_{i}} = \frac{1}{M} \sum_{m = 1}^{M} 1 {ψ_{0} \in I_{n_{i}, m}}

guaranteed for each sampling scheme and every i = 1,..., 7 by ${I_{n i, m} : m = 1, ..., M}$ . The rescaled empirical coverage proportions $M_{c}_{_{n_{i}}}$ should have a Binomial distribution with parameter (M, 1 – a) with a = α for every i = 1,..., 7. This property can be tested in terms of a standard binomial test, the alternative stating that a > α. This results in a collection of 7 p-values for each sampling scheme, as reported in Table 2 (bottom rows).

Table 2:

Checking the adequateness of the coverage guaranteed by our simulated confidence intervals. We test if the rescaled empirical coverage Binomial random variables $M_{c}_{_{n_{i}}}$ have parameter (M, 1 – α), the alternative stating that they have parameter (M, 1 – a) with a > α. We report the values $c_{n_{i}}$ (top row) and corresponding p-values (bottom row, between parentheses) of the Binomial test for each sample size and each sampling scheme.

Sampling scheme	Sample size
Sampling scheme	n₁	n₂	n₃	n₄	n₅	n₆	n₇
i.i.d (Q₀, g^b)-balanced	0.913 (p < 0.001)	0.925 (p < 0.001)	0.939 (0.067)	0.934 (0.015)	0.945 (0.253)	0.940 (0.087)	0.946 (0.300)
i.i.d (Q₀, g^★(Q₀))-optimal	0.894 (p < 0.001)	0.941 (0.111)	0.940 (0.087)	0.953 (0.688)	0.954 (0.739)	0.947 (0.351)	0.947 (0.351)
adaptive $(Q_{0}, g_{n_{7}}^{★})$	0.934 (0.015)	0.939 (0.067)	0.956 (0.827)	0.945 (0.253)	0.943 (0.172)	0.933 (0.011)	0.952 (0.634)

Open in a new tab

Considering each sampling scheme (i.e., each row of Table 2) separately, we conclude that the (1 – α)-coverage cannot be declared defective under

i.i.d (Q₀, g^b)-balanced sampling for any sample size n_i ≥ n₃ = 500,
i.i.d (Q₀, g^★(Q₀))-optimal sampling for any sample size n_i ≥ n₂ = 250,
$(Q_{0}, g_{n_{7}}^{★})$ -adaptive sampling for any sample size n_i ≥ n₁ = 100,

adjusting for multiple testing in terms of the Benjamini and Yekutieli procedure for controlling the False Discovery Rate at level 5%.

This is a remarkable result that not only validates the theory but also provides us with insight into the finite sample properties of the TMLE procedure based on adaptive sampling. The fact that the TMLE procedure behaves better under adaptive sampling scheme than under balanced i.i.d sampling scheme at sample size n₁ = 100 may not be due to mere chance only. Although the TMLE procedure based on an adaptive sampling scheme is initiated under the balanced sampling scheme (so that each stratum consists at the beginning of comparable numbers of patients assigned to each treatment arm, allowing to estimate, at least roughly, the required parameters), it starts deviating from it (as soon as every (A, V)-stratum counts 5 patients) each time 25 new observations are accrued. The poor performance of the TMLE procedure based on optimal i.i.d sampling scheme at sample size n₁ is certainly due to the fact that, by starting directly from the optimal sampling scheme (a choice we would not recommend in practice), too few patients from stratum V = 3 are assigned to treatment arm A = 0 among the n1 first subjects. At larger sample sizes, the TMLE procedure performs equally well under adaptive sampling scheme and under both i.i.d schemes in terms of coverage.

7.4. Empirical width of the TMLE confidence intervals

Now that we know that the TMLE-based confidence intervals based on $(Q_{0}, g_{n}^{★})$ -adaptive sampling are valid confidence regions, it is of interest to compare the widths of the latter $(Q_{0}, g_{n}^{★})$ -adaptive sampling confidence intervals with their counterparts obtained under i.i.d (Q₀, g^b)-balanced or (Q₀,g*(Q₀))-optimal sampling schemes.

For this purpose, we compare on one hand, for each sample size n_i, the empirical distribution of ${\sqrt{υ_{n} (O_{n τ}^{m} (n_{i}))} : m = 1, ..., M}$ as in (31) (i.e., the empirical distribution of width of the TMLE-based confidence intervals at sample size n_i obtained under i.i.d (Q₀, g^b)-balanced sampling, up to the factor $2 ξ_{1 - α / 2} / \sqrt{n_{i}}$ ) to the empirical distribution of ${s_{n}^{★} (O_{n τ}^{m} (n_{i})) : m = 1, ..., M}$ (i.e., the empirical distribution of the width of the TMLE-based confidence intervals at sample size n_i obtained under $(Q_{0}, g_{n}^{★})$ adaptive sampling, up to the factor $2 ξ_{1 - α / 2} / \sqrt{n_{i}}$ in terms of the two-sample Kolmogorov-Smirnov test, where the alternative states that the confidence intervals obtained under adaptive sampling are stochastically smaller than their counterparts under i.i.d balanced sampling. This results in 7 p-values, all equal to zero, which we nonetheless report in Table 3 (bottom row). In order to get a sense of how much narrower the confidence intervals obtained under adaptive sampling are, we also compute and report in Table 3 (top row) the ratios of empirical average widths

\frac{\frac{1}{M} Σ_{m = 1}^{M} s_{n}^{★} (O_{n τ}^{m} (n_{i}))}{\frac{1}{M} Σ_{m = 1}^{M} \sqrt{v_{n} (O_{n τ}^{m} (n_{i}))}}

(32)

for each sample size n_i. Informally, this shows a 12% gain in width.

Table 3:

Comparing the width of our confidence intervals. On one hand, we test for each sample size n_i if the TMLE-based confidence intervals obtained under $(Q_{0}, g_{n}^{★})$ -adaptive sampling are narrower stochastically than the TMLE-based confidence intervals obtained under i.i.d (Q₀, g^b)-balanced sampling in terms of the two-sample Kolmogorov-Smirnov test. On the other hand, we test for each sample size n_i if the TMLE-based confidence intervals obtained under $(Q_{0}, g_{n}^{★})$ -adaptive sampling are wider stochastically than the TMLE-based confidence intervals obtained under i.i.d (Q₀,g*(Q₀))-optimal sampling in terms of the two-sample Kolmogorov-Smirnov test. We report the p-values (bottom rows, between parentheses). In addition, we report for each sample size n_i the ratios of average widths as defined in (32).

Comparison	Sample size
	n₁	n₂	n₃	n₄	n₅	n₆	n₇
$(Q_{0}, g_{n_{7}}^{★}) v s (Q_{0}, g^{b})$	0.856 (0)	0.871 (0)	0.879 (0)	0.880 (0)	0.878 (0)	0.877 (0)	0.876 (0)
$(Q_{0}, g_{n_{7}}^{★}) v s (Q_{0}, g^{★} (Q_{0}))$	0.962 (0.144)	0.977 (0.236)	0.992 (0.100)	0.995 (0.060)	0.997 (0.407)	1.000 (0.236)	1.000 (0.144)

Open in a new tab

On the other hand, we also compare, for each sample size n_i, the empirical distribution of ${\sqrt{υ_{n} (O_{n τ}^{m} (n_{i}))} : m = 1, ..., M}$ as in (31) but replacing g^b by g*(Q₀) (i.e., the empirical distribution of width of the TMLE-based confidence intervals at sample size n_i obtained under i.i.d (Q₀, g*(Q₀))-optimal sampling, up to the factor $2 ξ_{1 - α / 2} / \sqrt{n_{i}}$ to the empirical distribution of ${s_{n}^{*} (O_{n τ}^{m} (n_{i})) : m = 1, ..., M}$ (i.e., the empirical distribution of the width of the TMLE-based confidence intervals at sample size n_i obtained under $(Q_{0}, g_{n}^{★})$ -adaptive sampling, up to the factor $2 ξ_{1 - α / 2} / \sqrt{n_{i}}$ in terms of the two-sample Kolmogorov-Smirnov test, where the alternative states that the confidence intervals obtained under adaptive sampling are stochastically larger than their counterparts under i.i.d optimal sampling. This results in 7 p-values that we report in Table 3 (bottom row). In order to get a sense of how similar the confidence intervals obtained under adaptive and i.i.d optimal sampling schemes are, we also compute and report for each sample size n_i in Table 3 (top row) the ratios of empirical average widths as in (32) replacing again g^b by g*(Q₀) in the definition (31) of $υ_{n} (O_{n τ}^{m} (n))$ . Informally, this shows that the confidence intervals obtained under adaptive sampling are even slightly narrower in average than their counterparts obtained under i.i.d optimal sampling.

7.5. Illustrating example

So far we have been concerned with distributional results, answering to the questions Does the confidence interval $[ψ_{n}^{*} \pm s_{n}^{*} ξ_{1 - α / 2} / \sqrt{n}]$ provide the wished coverage? (yes, even for moderate sample size: see Section 7.3), How does its width compare with the width of the confidence interval obtained under either i.i.d sampling scheme? (well: see Section 7.4). In this section, we focus on a particular simulated trajectory (we select arbitrarily the first one, associated to $O_{n_{7}}^{1}$ ) for the sake of illustration.

Some interesting features of the selected simulated trajectory are apparent in Fig. 7.5 and Table 4.

Table 4:

Illustrating the TMLE procedure under $(Q_{0}, g_{n}^{★})$ -adaptive sampling scheme. This table illustrates how the TMLE procedure behaves (on a simulated trajectory) as the sample size increases. We report, at each sample size n_i, the updated adapted design $g_{n}^{★} = G^{★} (θ_{n} (O_{n_{7}}^{1} (n)))$ (columns two to four), current TMLE $ψ_{n}^{*} (O_{n_{7}}^{1} (n))$ (column five) and 95% confidence interval I_n,1 as in (28) with $s_{n}^{*} (O_{n_{7}}^{1} (n))$ substituted to s_n. The true risk difference ψ₀ = ≃ 1.264 belongs to all confidence intervals. See also Fig. 7.5.

Sample size	Allocation probabilities			TMLE	Confidence interval
n	$g_{n}^{★} (1 \| 1)$	$g_{n}^{★} (1 \| 2)$	$g_{n}^{★} (1 \| 3)$	$ψ_{n}^{*}$	$[ψ_{n}^{} \pm s_{n}^{} ξ_{1 - 0.05 / 2} / \sqrt{n}]$
n₁	0.589	0.764	0.766	1.252	[0.722;1.783]
n₂	0.624	0.775	0.707	1.388	[0.974;1.802]
n₃	0.679	0.767	0.795	1.361	[1.037;1.685]
n₄	0.677	0.757	0.813	1.341	[1.068;1.615]
n₅	0.670	0.760	0.806	1.250	[1.012;1.488]
n₆	0.677	0.788	0.835	1.288	[1.126;1.451]
n₇	0.694	0.793	0.834	1.273	[1.157;1.389]

Open in a new tab

For instance, we can follow the convergence of the TMLE $ψ_{n}^{*}$ toward the true risk difference 𝜓₀ in the top plot of Fig. 7.5 and in the fifth column of Table 4. Similarly, the middle plot of Fig. 7.5 and the second to fourth columns of Table 4 illustrate the convergence of $g_{n}^{★}$ toward G^*(θ₀), as stated in Proposition 3. What these plots and columns also teach us is that, in spite of the misspecified working model, the learned design G^*(θ₀) seems very close to the optimal treatment mechanism g^*(Q₀) for the chosen simulation scheme and working model used in our simulation study. Moreover, the last column of Table 4 illustrates how the confidence intervals $[ψ_{n}^{*} \pm s_{n}^{★} ξ_{1 - 0.05 / 2} / \sqrt{n}]$ shrink around the true risk difference 𝜓₀ as the sample size increases.

Yet, the bottom plot of Fig. 7.5 may be the most interesting of the three. It obviously illustrates the convergence of $s_{n}^{★ 2}$ toward ${Var}_{Q_{0}, G^{★} (θ 0)} D^{★} (P_{Q_{θ_{0}, ε_{0}}, G^{★} (θ_{0})}) (O)$ i·e., toward the variance under the fixed design $P_{Q_{0}, G^{★} (θ_{0})}$ of the efficient influence curve at $P_{Q_{θ_{0}, ε_{0}} \cdot G^{★} (θ_{0})}$ . Hence, it also teaches us that the latter limit seems very close to the optimal asymptotic variance v^*(Q₀) for the chosen simulation scheme and working model used in our simulation study. More importantly, $s_{n}^{★ 2}$ strikingly converges to v^*(Q₀) from below. This finite sample characteristic may reflect the fact that the true finite sample variance of $\sqrt{n} (ψ_{n}^{*} - ψ_{0})$ might be lower than v^*(Q₀). Studying further this issue in depth is certainly very delicate, and goes beyond the scope of this article.

8. Simulation study of the performances of the TMLE group sequential testing procedure

In this section, we resume the simulation study undertaken in Section 7 in order to investigate the performances of the TMLE group sequential testing procedure presented in Section 6. We describe the simulation scheme in Section 8.1, then evaluate how the testing procedure behaves in terms of empirical type I and type II errors in Section 8.2 and how it behaves in terms of empirical distribution of sample size at decision in Section 8.3.

8.1. The simulation scheme, continued

We wish to test the null “Ψ(P) = ψ₀” against its alternative “Ψ(P) > ψ₀” $(ψ_{0} = \frac{91}{72})$ with prescribed type I error α = 5% and type II error β = 10% at the alternative ψ₁ = ψ₀ +0.4. Depending on whether we want to investigate the empirical behaviors of the TMLE group sequential testing procedure with respect to (i) type I or (ii) type II errors, we test M = 1000 times the above hypotheses based on 3 × M independent datasets obtained under

(i)
empirical type I error study:
- i.i.d (Q₀, g^b)-balanced sampling,
- i.i.d (Q₀, g*(Q₀))-optimal sampling,
- $(Q_{0}, g_{n}^{★})$ -adaptive sampling;
(ii)
empirical type II error study:
- i.i.d (Q₁, g^b)-balanced sampling,
- i.i.d (Q₁, g^*(Q₀))-optimal sampling,
- $(Q_{1}, g_{n}^{★})$ -adaptive sampling,

where Q₁ is defined as Q₀ in Section 7.1 with ρ = ψ₁/ψ₀ (so that Ψ(Q₁) = ψ₁).

We decide to proceed in K = 4 steps, at proportions (p₁,p₂,p3,p₄) = (0.25, 0.50,0.75,1). We choose the α-and β-spending strategies characterized by the equalities $\sum_{l = 1}^{k} α_{l} = p_{k}^{2} α$ and $\sum_{l = 1}^{k} β_{l} = p_{k}^{2} β, k = 1, \dots, K$ . This set of conditions characterizes the whole group sequential testing procedure, see Table 5 for the resulting numerical values.

Table 5:

Specifics of the TMLE group sequential testing procedure

Ψ₀	Ψ₁	I_max
1.264	1.664	57.927

Step k	1	2	3	4
Rejection boundary	2.734	2.305	2.005	1.715
α-spending	0.3125%	0.9375%	1.5625%	2.1875%

Step k	1	2	3	4
Futility boundary	−0.976	0.132	0.961	1.715
β-spending	0.625%	1.875%	3.125%	4.375%

Open in a new tab

8.2. Empirical type I and type II errors of the TMLE group sequential testing procedure

We report in Table 6 the empirical type I and type II errors obtained during the course of the simulation study. The values are strikingly close to each other in both cases. Even better, the empirical type I and type II errors obtained under $(Q, g_{n}^{★})$ -adaptive sampling are the lowest in both cases.

Table 6:

Checking the adequateness of the type I and type II error controls on our simulated TMLE group sequential testing procedures. We test if the Binomial random numbers of times the null is falsely rejected have parameter (M, α), the alternative stating that they have parameter (M, a) with a > α. We also test if the Binomial random numbers of times the null is falsely not rejected have parameter (M, 12%), the alternative stating that they have parameter (M, b) with b > 12%. We report the empirical type I and type II errors and corresponding p-values

Sampling scheme	Type I error (Q = Q₀) Empirical error (p-value)	Type II error (Q = Q₁) Empirical error (p-value)
i.i.d (Q, g^b)-balanced	0.040 (0.919)	0.132 (0.113)
i.i.d (Q, g^★(Q))-optimal	0.043 (0.827)	0.128 (0.203)
adaptive $(Q, g_{n}^{★})$	0.040 (0.919)	0.126 (0.261)

Open in a new tab

Since the number of times the null was falsely rejected for its alternative is Binomial with parameter (M, α), it is possible to test rigorously if a = α (as it should) against a > α. This yields three p-values that we report in Table 6. Because there are less than 50 wrong decisions for all sampling schemes, we naturally get large p-values, confirming (if necessary) that the type I error is under control.

Similarly, the number of times the null was falsely not rejected for its alternative is also Binomial with parameter (M, b). Thus, it is possible to test rigorously if b = β (as it should) against b > β. This yields three small p-values (p < 0.001 for the i.i.d balanced sampling scheme, 0.002 for the i.i.d optimal sampling scheme and 0.003 for the adaptive sampling scheme). This confirms the impression that the number of wrong decisions are all significantly larger than what random deviations from the reference distribution are likely to allow. Put bluntly, the type II error is not under control. However, there is no real need to worry. Indeed, if one rather tests b = 12% against b > 12%, then the three p-values (reported in Table 6) are large.

In summary, for the considered scenario, the TMLE group sequential testing procedure described in Section 6.1 performs at least as well under the $(Q, g_{n}^{★})$ -adaptive sampling scheme as under both i.i.d (Q, g^b)-balanced and (Q, g*(Q))-optimal sampling schemes. The type I error control meets the requirements. The type II error control is defective, but only slightly defective.

8.3. Empirical distribution of sample size at decision of the TMLE group sequential testing procedure

Now that we know that the TMLE group sequential testing procedure performs at least as well under adaptive sampling scheme as under both i.i.d sampling schemes, let us compare the performances in terms of sample size at decision.

For this purpose, we simply report the average sample sizes at decision for each sampling scheme when evaluating the type I and type II errors control, see Table 7. The gain in average sample size at decision obtained by resorting to the $(Q, g_{n}^{★})$ -adaptive sampling scheme instead of the i.d.d (Q, g^b)-balanced sampling scheme is dramatic: In average, one needs approximately 16% less observations in order to reach a conclusion under the $(Q, g_{n}^{★})$ -adaptive sampling scheme relative to the i.d.d (Q, g^b)-balanced sampling scheme. Furthermore, it appears that it is even more efficient to resort to the i.d.d (Q, g*(Q))-optimal sampling scheme: In average, one needs approximately 6% more observations in order to reach a conclusion under the i.i.d $(Q, g_{n}^{★})$ -adaptive sampling scheme relative to the i.d.d (Q, g^b)-balanced sampling scheme.

Table 7:

Comparing the average sample sizes at decision for each sampling scheme when evaluating the type I and type II errors control.

Sampling scheme	Type I error (Q = Q₀) Average sample size	Type II error (Q = Q1) Average sample size
i.i.d (Q, g^b)-balanced	786.63	895.64
i.i.d (Q, g^★(Q))-optimal	620.71	713.47
adaptive $(Q, g_{n}^{★})$	661.14	746.72

Open in a new tab

In summary, for the considered scenario, the TMLE group sequential testing procedure described in Section 6.1 reaches a decision more quickly under the $(Q, g_{n}^{★})$ -adaptive sampling scheme than under the i.i.d (Q, g^b)-balanced sampling scheme, with a gain in average sample size at decision of approximately 16%. Furthermore, the TMLE group sequential testing procedure reaches a decision more slowly under the $(Q, g_{n}^{★})$ -adaptive sampling scheme than under the i.i.d (Q, g*(Q))-optimal sampling scheme, but the loss in average sample size at decision reduces to approximately 6%.

A. Proofs

Let us introduce for convenience the notations θ_ε = (θ, ε), $θ_{0}^{ε} = (θ_{0}, ε_{0})$ and $θ_{n}^{ε} = (θ_{n}, ε_{n})$ . Moreover, let us recall that if μ is a probability measure and f a measurable function, then μf denotes the integral ∫ fdμ.

A.1. Proof of Propositions 2 and 4, and a useful remark on Proposition 3 Proof of Proposition 2.

The cornerstone of the proof is to interpret θn as the solution in θ of the martingale estimating equation $\sum_{i = 1}^{n} D_{1} (θ) (O_{i}, Z_{i}) = 0$ , where Z_i (defined in (9)) is the finite dimensional summary measure of past observation O_n (i – 1) such that $g_{i}^{★}$ depends on O_n(i – 1) only through Z_i (hence the notation $g_{i}^{★} = g_{Z_{i}}$ ) and $D_{1} (θ) (O, Z) = {\dot{l}}_{θ, 0} (O) g^{r} (A | V) / g_{Z} (A | V)$ .

Note first that, for all i ≤ n,

P_{Q_{0, g_{i}^{*}}} D_{1} (θ_{0}) = E_{P_{Q_{0, g^{r}}}} [{\dot{l}}_{θ_{o}, 0} (O_{i}) | O_{n} (i - 1)] = 0

by A1 (changing the order of differentiation and integration is permitted by the dominated convergence theorem, because $O$ is bounded and Θ is compact—see PARAM). Observe then that, by definition of θ_n, we have:

‖ \frac{1}{n} \sum_{i = 1}^{n} P_{Q_{0}, g_{i}^{*}} D_{1} (θ_{n}) ‖ = ‖ \frac{1}{n} \sum_{i = 1}^{n} D_{1} (θ_{n}) (O_{i}, Z_{i}) - P_{Q_{0}, g_{i}^{*}} D_{1} (θ_{0}) ‖ \leq θ sup_{θ \in Θ} ‖ \frac{1}{n} \sum_{i = 1}^{n} D_{1} (θ_{n}) (O_{i}, Z_{i}) - P_{Q_{0}, g_{i}^{*}} D_{1} (θ) ‖ \equiv M_{n \cdot}

Now, since (α) sup_θ∈Θ ∥D1(θ)∥_∞ < ∞ (because $O$ is bounded and Θ is compact) and (b) the standard entropy of $F = {D_{1} (θ) : θ \in Θ}$ for the supremum norm satisfies $H (F, {‖ \cdot ‖}_{\infty}, ε) = O (log 1 / ε)$ (see (van der Vaart, 1998), example 19.7), we can apply (componentwise) the Kolmogorov strong law of large numbers for martingales as in Theorem 8 of (Chambaz and van der Laan, 2009). This yields the almost sure convergence of M_n, hence of n⁻¹ $\sum_{i = 1}^{n} P_{Q_{0}, g_{i}^{★}} D_{1} (θ_{n})$ , to 0.

By a Taylor expansion of $θ \mapsto P_{Q_{0}, g^{r}} {\dot{l}}_{θ, 0}$ at θ₀ (changing the order of differentiation and integration is permitted here too for the same reasons as above), it holds that, for all i ≤ n (recall that $P_{Q_{0}, g_{i}^{★}} D_{1} (θ_{0}) = 0 .$ ),

P_{Q_{0}, g_{i}^{*}} D_{1} (θ) = (P_{Q_{0}, g^{r}} {\ddot{l}}_{θ_{0}, 0}) (θ_{n} - θ_{0}) + o_{P} (‖ θ_{n} - θ_{0} ‖)

From this we deduce that $(P_{Q_{0}, g^{r}} {\ddot{l}}_{θ_{0}, 0}) (θ_{n} - θ_{0})$ . converges in probability to 0. Because $P_{Q_{0}, g^{r}} {\ddot{l}}_{θ_{0}, 0}$ is positive definite by A2, this implies the result.

Proof of Proposition 4.

The proof of (i) is very simple and typical of robust statistical studies. Indeed,

0 = P_{Q_{0}, G * (θ_{0})} \frac{\partial}{\partial ε} l_{θ_{0}, ε} | ε = ε_{0},

the latter expression also writing as

E_{Q_{0}, G * (θ_{0})} {\frac{2 A - 1}{G * (θ_{0}) (A | V)} (Y - \bar{Q} (θ_{0}^{ε}) (A, W))} = E_{Q_{0}, G * (θ_{0})} {({\bar{Q}}_{0} (1, W) - \bar{Q} (θ_{0}^{ε}) (1, W)) - ({\bar{Q}}_{0} (0, W) - \bar{Q} (θ_{0}^{ε}) (0, W))} = ψ (Q_{0}) - ψ (Q_{θ_{0}^{ε}}),

hence the result.

The proof of (ii) fundamentally relies on the fact that $θ_{n}^{ε}$ solves the martingale estimating equation $\sum_{i = 1}^{n} D (θ^{ε}) (O_{i}, Z_{i}) = 0$ , where D(θ^ε) is the extension of D₁(θ) defined in (19). Here too, $P_{Q_{0}, g_{i}^{★}} D (θ_{0}^{ε}) = 0$ for all i ≤ n, and applying the Kolmogorov strong law of large numbers for martingales (see Theorem 8 in (Chambaz and van der Laan, 2009)) yields that $n^{- 1} \sum_{i = 1}^{n} P_{Q_{0}, g_{i}^{★}} D (θ_{n}^{ε})$ converges to zero almost surely. This entails the convergence in probability of $θ_{n}^{ε}$ to $θ_{0}^{ε}$ by a Taylor expansion of $θ ε \mapsto (P_{Q_{0}, g^{r}} {\dot{l}}_{θ, 0}^{Τ}, P_{Q_{0}, G^{★} (θ)} \frac{\partial}{\partial ε} l_{θ, ε})$ at $θ_{0}^{ε}$ . Note that A3 is a clear counterpart to A1 from Proposition 2 but that there is no counterpart to A2 from Proposition 2 in Proposition 4. Indeed, it automatically holds in the framework of the proposition that $- P_{Q_{0}, G^{★} (θ_{0})} \frac{\partial^{2}}{\partial ε^{2}} l_{θ_{0}, ε} |_{ε = ε_{0}} > 0$ , the proof requiring that the latter quantity be different from zero.

A useful remark on Proposition 3.

We deduce in Proposition 3 the convergence in probability and in L¹ of the adaptive design $g_{n}^{★}$ to the limit design G^*(θ₀) from the convergence in probability of θ_n to θ₀ as obtained in Proposition 2. It is crucial for us that $n^{- 1} \sum_{i = 1}^{n} g_{i}^{★}$ and $n^{- 1} \sum_{i = 1}^{n} {(g_{i}^{★})}^{- 1}$ also converge to G^*(θ₀) and 1/G^*(θ₀) (with an obvious notational shortcut) in probability and in L¹. Fortunately, this is true because (a) the positive random variables $g_{i}^{★}$ are uniformly bounded away from 0 and (b) if a sequence X_n converges in L¹ to 0 then $n^{- 1} \sum_{i = 1}^{n} X_{i}$ also converges in L¹ to 0 (by convexity of the L¹-norm). Let us put this in a lemma:

Lemma 1. For all $(a, v) \in A \times V, n^{- 1} \sum_{i = 1}^{n} g_{i}^{*} (a | v)$ and $n^{- 1} \sum_{i = 1}^{n} {(g_{i}^{*} (a | v))}^{- 1}$ converge in probability and in L¹ to G^*(θo)(a|v) and G^*(θo)(a|v)^-1.

A.2. Proof of Proposition 5

The proof of Proposition 5 still relies on the facts that (a) $θ_{n}^{ε}$ solves the martingale estimating equation $\sum_{i = 1}^{n} D (θ^{ε}) (O_{i,} Z_{i}) = 0$ and (b) $P_{Q_{0, g_{i}^{*}}} D (θ_{0}^{ε}) = 0$ for all i ≤ n.

Consider the following equality:

LHT \equiv \frac{1}{n} \sum_{i = 1}^{n} [D (θ_{n}^{ε}) (O_{i}, Z_{i}) - D (θ_{0}^{ε}) (O_{i}, Z_{i})] = - \frac{1}{n} \sum_{i = 1}^{n} [D (θ_{n}^{ε}) (O_{i}, Z_{i}) - P_{Q_{0}, g_{i}^{*}} D (θ_{0}^{ε})] .

(33)

A Taylor expansion of LHT at $θ_{0}^{ε}$ first yields that

LHT = \frac{1}{n} \sum_{i = 1}^{n} \frac{\partial}{\partial θ^{ε}} D (θ^{ε}) (O_{i}, Z_{i}) |_{θ^{ε} = θ_{0}^{ε}} (θ_{n}^{ε} - θ_{0}^{ε}) + o_{P} (‖ θ_{n}^{ε} - θ_{0}^{ε} ‖) = A_{n} (θ_{n}^{ε} - θ_{0}^{ε}) + B_{n} (θ_{n}^{ε} - θ_{0}^{ε}) + o_{P} (‖ θ_{n}^{ε} - θ_{0}^{ε} ‖)

where $A_{n} = E [n^{- 1} \sum_{i = 1}^{n} \frac{\partial}{\partial θ^{ε}} D (θ^{ε}) (O_{i,} Z_{i} |_{θ^{ε} = θ_{0}^{ε}}]$ and $B_{n} = n^{- 1} \sum_{i = 1}^{n} \frac{\partial}{\partial θ^{ε}} D (θ^{ε}) (O_{i,} Z_{i} |_{θ^{ε} = θ_{0}^{ε}} - A_{n} .$ .

The matrix A_n satisfies $A_{n} = n^{- 1} E [\sum_{i = 1}^{n} P_{Q_{0, g_{i}^{*}}} \frac{\partial}{\partial θ^{ε}} D (θ^{ε}) |_{θ^{ε} = θ_{0}^{ε}}]$ . Now, for each i ≤ n,

P_{Q_{0}, g_{i}^{*}} \frac{\partial}{\partial θ^{ε}} D (θ^{ε}) |_{θ^{ε} = θ_{0}^{ε}} = S_{0}

(34)

where the latter matrix, defined in (21) and independent of i, is deterministic (this is due to the weighting). Thus, A_n = S₀. Moreover, A_n = S₀ is invertible because its determinant equals the product of $P_{Q_{0}, G_{}^{★}} (θ_{0}) \frac{\partial^{2}}{\partial ε^{2}} l_{θ_{0, ε | ε = ε_{0}}} < 0$ and

det E_{Q_{0}, G * (θ_{0})} [{\ddot{l}}_{θ_{0}, 0} (O) \frac{g^{r} (A | V)}{G^{*} (θ_{0}) (A | V)}] = det P_{Q_{0}, g^{r}} {\ddot{l}}_{θ_{0}, 0},

which is negative by A2. Furthermore, (34) and A_n = S₀ (deterministic) also imply that B_n can be rewritten as

B_{n} \frac{1}{n} \sum_{i = 1}^{n} [\frac{\partial}{\partial θ^{ε}} D (θ^{ε}) (O_{i}, Z_{i}) |_{θ^{ε} = θ_{0}^{ε}} - P_{Q_{0}, g_{i}^{*}} \frac{\partial}{\partial θ^{ε}} D (θ^{ε}) |_{θ^{ε} = θ_{0}^{ε}}]

and applying (componentwise) a standard law of large numbers for martingales (see for instance Theorem 2.4.2 in (Sen and Singer, 1993)) yields that B_n converges to 0 almost surely (note that ${sup}_{θ^{ε} \in Θ \times ε} {‖ \frac{\partial}{\partial θ^{ε}} D (θ^{ε}) ‖}_{\infty} < \infty$ ). We emphasize that this proves the convergence in probability of S_n to S₀ as stated in the proposition. Consequently, $LHT= S_{0} (θ_{n}^{ε} - θ_{0}^{ε}) + o p (‖ θ_{n}^{ε} - θ_{0}^{ε} ‖)$ and (33) entails

\sqrt{n} (θ_{n}^{ε} - θ_{n}^{ε}) = - S_{0}^{- 1} M_{n} + o p (\sqrt{n} ‖ θ_{n}^{ε} - θ_{0}^{ε} ‖)

(35)

with $M_{n} = n^{- 1 / 2} \sum_{i = 1}^{n} [D (θ_{0}^{ε}) (O_{i,} Z_{i}) - P_{Q_{{0 g}_{i}^{*}}} D (θ_{0}^{ε})]$ . It mainly remains to show that M_n satisfies a central limit theorem.

For this purpose, we apply Theorem 10 in (Chambaz and van der Laan, 2009). Define $C_{n} = \sum_{i = 1}^{n} P_{Q_{0, g_{i}^{*}}} D (θ_{0}^{ε}) D {(θ_{0}^{ε})}^{Τ}$ . It holds that

\frac{1}{n} C_{n} = \frac{1}{n} \sum_{i = 1}^{n} E_{Q_{0}, G * (θ_{0})} [\frac{\tilde{D} (θ_{0}^{ε}) \tilde{D} {(θ_{0}^{ε})}^{Τ} (O)}{g_{i}^{*} (A | V) G * (θ_{0}) (A | V)} | O_{n} (i - 1)] = \sum_{(a, v) \in A \times ν} E_{Q_{0}, G * (θ_{0})} [\frac{\tilde{D} (θ_{0}^{ε}) \tilde{D} {(θ_{0}^{ε})}^{Τ} (O)}{G * (θ_{0}) (A | V)} 1 {(A, V) = (a, v)}] \frac{1}{n} \sum_{i = 1}^{n} \frac{1}{g_{i}^{*} (a | v)} .

(36)

Now, by Lemma 1, (36) implies that $n^{- 1} E C_{n} = \sum_{0} (1 + o (1))$ , where Σ₀ (defined in (22)) is positive definite when A4 is met. Indeed, assume on the contrarthat there exists a vector u ≠ 0 such that $u^{Τ} \sum_{0} u = 0$ . Then necessarily $\tilde{D} {(θ_{0}^{ε})}^{Τ} u = 0 P_{Q_{0}, G^{*} (θ_{0})}$ -almost surely, which contradicts A4. Using (36) again we also see that

\frac{1}{n} (C_{n} - E C_{n}) = \sum_{(a, v) \in A \times V} E_{Q_{0}, G * (θ_{0})} [\frac{\tilde{D} (θ_{0}^{ε}) \tilde{D} {(θ_{0}^{ε})}^{Τ} (O)}{G * (θ_{0}) (A | V)} 1 {(A, V) = (a, v)}] \times (\frac{1}{n} \sum_{i = 1}^{n} \frac{1}{g_{i}^{*} (a | v)} - E [\frac{1}{n}] \sum_{i = 1}^{n} \frac{1}{g_{i}^{*} (a | v)})

from which we deduce that n⁻¹(C_n – EC_n) converges to 0 in probability by Lemma 1. Consequently, M_n converges in distribution to the centered Gaussian law with covariance matrix Σ₀. In addition, $n^{- 1} \sum_{i = 1}^{n} D (θ_{0}^{ε}) D {(θ_{0}^{ε})}^{Τ} (O_{i,} Z_{i})$ converges in probability to Σ₀. Since (a) $θ^{ε} \mapsto D (θ^{ε}) D {(θ^{ε})}^{Τ}$ is differentiable at $θ_{0}^{ε}$ , (b) its derivative is bounded, and (c) we already know that $‖ θ_{0}^{ε} - θ_{0}^{ε} ‖ = o p (1)$ , yet another Taylor expansion allows to derive that Σ_n (defined in (23)) equals Σ₀ (1 + op(1)).

Let us go back to (35), knowing now that M_n satisfies a central limit theorem. We derive from it that $\sqrt{n} ‖ θ_{n}^{ε} - θ_{0}^{ε} ‖ = O_{P} (1) + o p (\sqrt{n} ‖ θ_{n}^{ε} - θ_{0}^{ε} ‖)$ hence $‖ θ_{n}^{ε} - θ_{0}^{ε} ‖ = O_{P} (1)$ . Thus, (35) does entail (20) since $P_{Q_{{0 g}_{i}^{*}}} D (θ_{0}^{ε}) = 0$ for all i ≤ n. The stated central limit theorem on $\sqrt{n} (θ_{n}^{ε} - θ_{0}^{ε})$ readily follows.

A.3. Proof of Propositions 6 and 7

Proof of Proposition 6.

The proof of Proposition 6 is twofold and relies on the decomposition

ψ_{n}^{*} - ψ_{0} = (ψ_{n}^{*} - ψ(Q_{n}^{~})) + (ψ(Q_{n}^{~})- ψ_{0}),

(37)

where

Q_{n}^{~} = (Q w (P_{0}), Q_{Y | A, W} (θ_{n}^{ε})) .

(38)

First, a continuity argument and the convergence in probability of the stacked estimator $θ_{n}^{ε}$ to $θ_{0}^{ε}$ entail the convergence in probability of $ψ(Q_{n}^{~})$ to $ψ(Q_{θ_{0}^{ε}})= ψ_{0}$ (the equality holds by (i) in Proposition 4). The conclusion follows because $ψ_{n}^{*} -ψ(Q_{n}^{~})$ converges almost surely to zero. Indeed,

ψ_{n}^{*} - ψ(Q_{n}^{~})=(P_{W, n} - P_{W, 0}) P θ_{n}^{ε}

(39)

where, for all $θ^{ε} \in Θ \times ε, q θ^{ε} = \bar{Q} (θ^{ε}) (1, \cdot) - \bar{Q} (θ^{ε}) (0, \cdot)$ and, with a slight abuse of notation, P_W,_n (respectively, P_W,0) denotes the empirical (respectively, true) marginal distribution of W. Thus,

| ψ_{n}^{*} - ψ(Q_{n}^{~}) | \leq sup_{θ^{ε} \in Θ \times ε} | (P_{W, n} - P_{W, 0}) q θ^{ε} |,

(40)

where (a) ${sup}_{θ^{ε} \in Θ \times ε} {‖ q θ^{ε} ‖}_{\infty} < \infty$ and (b) the standard entropy of $F = {q θ^{ε} : θ^{ε} \in Θ \times ε}$ for the supremum norm satisfies $H (F, {‖ \cdot ‖}_{\infty}, ε) = O (log 1 / ε)$ (see (van der Vaart, 1998), example 19.7). Therefore, the class $F$ is P_W,₀-Glivenko-Cantelli and the right-hand side term of (40) converges to 0 P_W,₀-almost surely by the Glivenko-Cantelli theorem (i.i.d framework; see for instance Theorem 19.4 in (van der Vaart, 1998)).

Proof of Proposition 7.

The proof of (26) relies again on (37). It is easy to derive the asymptotic linear expansion of the first term. Indeed, we can rewrite (39) as

\begin{array}{l} \sqrt{n} (ψ_{n}^{*} - ψ(Q_{n}^{~})) & = & \sqrt{n} (P_{W, n} - P_{W, 0}) q θ_{0}^{ε} + \sqrt{n} (P_{W, n} - P_{W, 0}) (q θ_{n}^{ε} - q θ_{0}^{ε}) \\ = & \sqrt{n} P_{W, n} (q θ_{0}^{ε} - P_{W, 0 q θ_{0}^{ε}}) + o P (1) \\ = & \sqrt{n} P_{W, n} D_{1} (Q_{θ_{0}^{ε}}) + o p (1), \end{array}

(41)

which holds subject to checking that $\sqrt{n} (P_{W, n} - P_{W, 0}) (q θ_{n}^{ε} - q θ_{0}^{ε}) = o p (1)$ . Since the class $F$ introduced previously for the proof of Proposition 6 is also P_W,₀-Donsker, this is a consequence of Lemma 19.24 in (van der Vaart, 1998), provided that $X_{n} = P_{W, 0} {(q θ_{n}^{ε} - q θ_{0}^{ε})}^{2}$ converges in probability to 0. Now, $θ_{n}^{ε}$ converges to $θ_{0}^{ε}$ in probability, and the function $(w, θ^{ε}) \mapsto q θ^{ε} (w)$ continuously maps $W \times Θ \times ε$ onto $ℝ$ (thus it is uniformly bounded). Consequently, X_n (which is obviously non-negative) is upper bounded, so that it is equivalent to show that X_n converges to 0 in L¹. By the Fubini theorem,

E X_{n} = \int E [{(q θ_{n}^{ε} (w) - q θ_{n}^{ε} (w))}^{2}] d P_{W, 0} (w) .

By the continuous mapping theorem (see Theorem 1.3.6 in (van der Vaart and Wellner, 1996)), ${(q θ_{n}^{ε} (w) - q θ_{0}^{ε} (w))}^{2}$ converges to 0 in probability (hence in L¹, the variables being bounded) for all $w \in W$ . Therefore, the integrand of the right-hand side integral in the previous display converges pointwise to 0. Since it is also bounded, we conclude by the dominated convergence theorem that EX_n converges to 0. This completes the study of the first term of the decomposition (37).

Regarding the second term of the latter decomposition, we derive its asymptotic linear expansion by the delta-method (see Theorem 3.1 in (van der Vaart, 1998)) from that of $\sqrt{n} (θ_{n}^{ε} - θ_{0}^{ε})$ , see (20). Specifically,

\sqrt{n} (ψ(Q_{n}^{~})- ψ_{0}) = \sqrt{n} (ϕ (θ_{n}^{ε}) - ϕ (θ_{n}^{ε})) = ϕ_{θ_{0}^{ε}}^{' Τ} S_{0}^{- 1} \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} D (θ_{0}^{ε}) (O_{i}, Z_{i}) + o p (1) .

(42)

Combining (41) and (42) yields the wished asymptotic linear expansion (26) and the closed-form expression of the related influence function IC as given in (27).

A central limit theorem for real-valued martingales (see Theorem 9 in (Chambaz and van der Laan, 2009)) applied to (26) yields the stated convergence and validates the use of $s_{n}^{2}$ as an estimator of the asymptotic variance. To see this, note that $P_{Q_{0, g_{i}^{*}}} IC=0$ for all i ≤ n and define $c_{n} = \sum_{i = 1}^{n} P_{Q_{{0 g}_{i}^{*}}} {IC}^{2}$ . Now we emphasize that, for every i ≤ n, firstly

P_{Q_{0}, g_{i}^{*}} D_{1}^{*} {(Q_{θ_{0}^{ε}})}^{2} = P_{Q_{0}, G * (θ_{0})} D_{1}^{*} {(Q_{θ_{0}^{ε}})}^{2},

secondly

\begin{array}{l} 2 P_{Q_{0}, g_{i}^{*}} D_{1}^{*} (Q_{θ_{0}^{ε}}) ϕ_{θ_{0}^{ε}}^{' Τ} S_{0}^{- 1} D (θ_{0}^{ε}) & = & 2 ϕ_{θ_{0}^{ε}}^{' Τ} S_{0}^{- 1} P_{Q_{0}, g_{i}^{*}} D_{1}^{*} (Q_{θ}_{_{0}^{ε}}) D (θ_{0}^{ε}) \\ = & 2 ϕ_{θ_{0}^{ε}}^{' Τ} S_{0}^{- 1} P_{Q_{0}, G * (θ_{0})} [\frac{D_{1}^{*} (Q_{θ_{0}^{ε}}) \tilde{D} (θ_{0}^{ε})}{G * (θ_{0})}] \end{array}

and thirdly

P_{Q_{0}, g_{i}^{*}} {(ϕ_{θ_{0}^{ε}}^{' Τ} S_{0}^{- 1} D (θ_{0}^{ε}))}^{2} = ϕ_{θ_{0}^{ε}}^{' Τ} S_{0}^{- 1} (P_{Q_{0}, g_{i}^{*}} D (θ_{0}^{ε}) D {(θ_{0}^{ε})}^{Τ}) {(ϕ_{θ_{0}^{ε}}^{' Τ} S_{0}^{- 1})}^{Τ} .

By combining the last three equalities, we thus obtain that

\frac{1}{n} c_{n} = P_{Q_{0}, G * (θ_{0})} D_{1}^{*} {(Q_{θ_{0}^{ε}})}^{2} + 2 ϕ_{θ_{0}^{ε}}^{' Τ} S_{0}^{- 1} P_{Q_{0}, G * (θ_{0})} [\frac{D_{1}^{*} (Q_{θ_{0}^{ε}}) \tilde{D} (θ_{0}^{ε})}{G * (θ_{0})}] + ϕ_{θ_{0}^{ε}}^{' Τ} S_{0}^{- 1} \frac{1}{n} C_{n} {(ϕ_{θ_{0}^{ε}}^{' Τ} S_{0}^{- 1})}^{Τ}

(43)

where Cn was introduced in the proof of Proposition 5. Since we already showed in the latter proof that n⁻¹EC_n = Σ₀(1 + o_P(1)), (43) notably yields the convergence of n⁻¹Ec_n toward

P_{Q_{0}, G * (θ_{0})} D_{1}^{*} {(Q_{θ_{0}^{ε}})}^{2} + 2 ϕ_{θ_{0}^{ε}}^{' Τ} S_{0}^{- 1} P_{Q_{0}, G * (θ_{0})} [\frac{D_{1}^{*} (Q_{θ_{0}^{ε}}) \tilde{D} (θ_{0}^{ε})}{G * (θ_{0})}] + ϕ_{θ_{0}^{ε}}^{' Τ} S_{0}^{- 1} Σ_{0} {(ϕ_{θ_{0}^{ε}}^{' Τ} S_{0}^{- 1})}^{Τ} = P_{Q_{0}, G * (θ_{0})} {(D_{1}^{*} (Q_{θ_{0}^{ε}}) + ϕ_{θ_{0}^{ε}}^{' Τ} S_{0}^{- 1} \frac{\tilde{D} (θ_{0}^{ε})}{G * (θ_{0})})}^{2} \equiv s^{2}

which is positive when A4 is met. In addition, (43) also entails that n⁻¹(c_n-Ec_n) converges to 0 in probability, therefore implying the convergence in distribution of $\sqrt{n} (ψ_{n}^{*} - ψ_{0})$ to the centered Gaussian distribution with variance s² as well as the convergence in probability of $n^{- 1} \sum_{i = 1}^{n} IC {(O_{i}, Z_{i})}^{2}$ to s². Since (a) $θ^{ε} \mapsto D_{1}^{★} (Q_{θ^{ε}})$ and θ^ε ↦ D(θ^ε) are differentiable at $θ_{0}^{ε}$ ,(b) both with uniformly bounded derivatives, (c) we already know that $‖ θ_{n}^{ε} - θ_{0}^{ε} ‖ = o P (1)$ , and (d) $ϕ_{θ_{n}^{ε}}^{'}$ and $S_{0}^{- 1}$ are consistently estimated by $ϕ_{n}^{'}$ and $S_{n}^{- 1}$ , yet another Taylor expansion (and the continuous mapping theorem, see for instance Theorem 1.3.6 in (van der Vaart and Wellner, 1996)) finally yields the stated convergence in probability of $s_{n}^{2}$ to s², therefore completing the proof.

A.4. Proof of Proposition 8

The fact that $Ψ (Q_{f / \sqrt{n}}) > Ψ (Q_{0})$ for n large enough is a consequence of the expansion $Ψ (Q_{f / \sqrt{n}}) = Ψ (Q_{0}) + P_{Q_{0}, G^{★} (θ_{0})} D^{★} (P_{Q_{0}, G^{★} (θ_{0})}) f / \sqrt{n} + o (1 / \sqrt{n})$ , which holds because Ψ is pathwise differentiable at P_{Q_0,g} (for any g) relative to the maximal tangent space, with efficient influence curve D*(Q₀, g). The rest of Proposition 8 is a corollary of the following lemma.

Lemma 2. Let Λ_n be the log-likelihood ratio of the $(Q_{f / \sqrt{n}}, g_{n}^{★})$ experiment relative to the $(Q_{0}, g_{n}^{★})$ experiment. Under $(Q_{0}, g_{n}^{★})$ , the vector $(\sqrt{n_{1}} (ψ_{n_{1}}^{★} - Ψ (Q_{0})), \dots, (\sqrt{n_{K}} (ψ_{n_{K}}^{*} - Ψ (Q_{0})), Λ_{n})$ converges in law to the Gaussian distribution with mean $(0, \dots, 0, - \frac{1}{2} P_{Q_{0}, G^{★} (θ_{0})} f^{2})$ and covariance matrix

(\begin{matrix} P_{Q_{0}, G * (θ_{0})} {IC}^{2} \times C & P_{Q_{0}, G * (θ_{0})} I C f \times {(\sqrt{p K})}^{Τ} \\ P_{Q_{0}, G * (θ_{0})} I C f \times (\sqrt{p 1}, ..., \sqrt{p K}) & P_{Q_{0}, G * (θ_{0})} f^{2} \end{matrix}) .

In particular, the $(Q_{f / \sqrt{n}}, g_{n}^{★})$ and $(Q_{0}, g_{n}^{★})$ experiments are mutually contiguous.

The limiting distribution of (T₁,…,T_K) under $(Q_{0}, g_{n}^{★})$ is easily derived from Lemma 2. Le Cam’s third lemma (see Example 6.7 in (van der Vaart, 1998)) solves the problem of obtaining the limiting distribution of (T₁,…,T_K) under $(Q_{f / \sqrt{n}}, g_{n}^{★})$ from that under $(Q_{0}, g_{n}^{★})$ . Only the asymptotic means are different. We do not repeat the proof here, as it is exactly the same (up to some minor variations in notation) as the proof of Lemma 5 in (Chambaz and van der Laan, 2011).

Proof of Lemma 2.

Since f is bounded and $E_{P_{Q_{0}}, G^{★} (θ_{0})} [f (O) | A, W] = 0$ (here, one could equivalently use the notation E_Q₀), we can assume without loss of generality that the fluctuation ${Q_{0} (h) : h \in H}$ is characterized by

\frac{d Q_{0} (h)}{d Q_{0}} = (1 + h f (O))

for all $h \in H = (- 1 / {‖ f ‖}_{\infty}, 1 / {‖ f ‖}_{\infty})$ . Let us denote by q_n and q₀ the conditional densities of Y given (A, W) under $Q_{f / \sqrt{n}}$ and Q₀.

The log-likelihood ratio $Λ_{n} = \sum_{i = 1}^{n} log q_{n} / q_{0} (O_{i})$ , by conditional independence of O₁,…, O_n given ((A₁, W₁),…, (A_n,W_n)) and because the marginal distribution of W is the same under $Q_{f / \sqrt{n}}$ as under Q₀, (see (4) and note that the $g_{i}^{★} (A_{i} | W_{i})$ factors cancel out). The fluctuation is chosen in such a way that $q_{n} / q_{0} = (1 + f / \sqrt{n})$ , hence $Λ_{n} = {\sqrt{n}}^{- 1} \sum_{i = 1}^{n} f (O_{i}) - \frac{1}{2} n^{- 1} \sum_{i = 1}^{n} f^{2} (O_{i}) + n^{- 1} \sum_{i = 1}^{n} f^{2} (O_{i}) R (f (O_{i}) / \sqrt{n})$ , where the function R is characterized by $log (1 + x) = x - \frac{1}{2} x^{2} + x^{2} R (x) (all x > - 1)$ . In particular, R is increasing and R(x) → R(0) = 0 when x → 0. Since f is bounded, the expansion is valid for n large enough. Moreover, the last term satisfies

| \frac{1}{n} \sum_{i = 1}^{n} f^{2} (O_{i}) R (f (O_{i}) / \sqrt{n}) | \leq \frac{1}{n} \sum_{i = 1}^{n} f^{2} (O_{i}) \times R ({‖ f ‖}_{\infty} / \sqrt{n}),

so that if $n^{- 1} \sum_{i = 1}^{n} f^{2} (O_{i}) = O_{P} (1)$ then it is o_P(1). Furthermore, the Kolmogorov law of large numbers for martingales implies that $n^{- 1} \sum_{i = 1}^{n} f^{2} (O_{i}) = n^{- 1} \sum_{i = 1}^{n} f^{2} P_{Q_{0}, g_{i}^{★}} f^{2} + o P (1) = P_{Q_{0}, {\bar{g}}_{n}^{★}} f^{2} + o P (1)$ , where ${\bar{g}}_{n}^{★} = n^{- 1} \sum_{i = 1}^{n} g_{i}^{★}$ Denote Var(f)(A, W) = EQ₀ [f² (O)|A, W]. We have that

\begin{array}{l} \begin{array}{l} P_{Q_{0, {\bar{g}}_{n}^{*}}} f^{2} & \equiv & E_{Q_{0,} {\bar{g}}_{n}^{*}} [f^{2} (O_{n}) | O_{n} (n - 1)] = E_{Q_{0,} {\bar{g}}_{n}^{*}} [V a r (f) (A, W) (O_{n}) | O_{n} (n - 1)] \\ = & E_{Q_{0}} [{\bar{g}}_{n}^{*} (1 | V) V a r (f) (1, W) + {\bar{g}}_{n}^{*} (0 | W) V a r (f) (1, V) | O_{n} (n - 1)] \end{array} \end{array}

for V the relevant subvector of W upon which each $g_{i}^{*}$ depends. By Lemma 1, ${\bar{g}}_{i}^{*}$ converges in probability to G*(θ₀). Consequently, the continuous mapping theorem (see Theorem 1.3.6 in (van der Vaart and Wellner, 1996)) yields that $P_{Q_{0, {\bar{g}}_{n}^{*}}} f^{2} = P_{Q_{0}, G * (θ_{o})} f^{2} + o_{P} (1) = O_{P} (1)$ . In summary, we obtain the following asymptotic linear expansion:

Λ_{n} = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} f (O_{i}) - \frac{1}{2} P_{Q_{0}, G * (θ_{0})} f^{2} + o p (1) .

(44)

Introduce now $Z_{i}^{'} = (1 {i \leq n_{1}}, ..., 1 {i \leq n_{K}})$ and the bounded (and measurable) function F such that $F (O_{i}, Z_{i}, Z_{i}^{'}) = (Z_{i}^{'} IC(O_{i}, Z_{i}), f (O_{i}))$ . We show that $M_{n} = n^{- 1} \sum_{i = 1}^{n} [F (O_{i}, Z_{i}, Z_{i}^{'}) - P_{Q_{0, g_{i}^{*}}} F] = n^{- 1} \sum_{i = 1}^{n} F (O_{i}, Z_{i}, Z_{i}^{'}) (since P_{Q_{0, g_{i}^{*}}} F = P_{Q_{o, g_{i}^{*}}} f = 0 for all i \leq n)$ satisfies a central limit theorem. The proof is very similar to the corresponding part of proof of Lemma 5 in (Chambaz and van der Laan, 2011), hence we only give an outline, focusing on the differences. First, it holds for every 1 ≤ k, l ≤ K that $n^{- 1} \sum_{i = 1}^{n_{k}} P_{Q_{o, g_{i}^{*}}} IC f = p_{k} P_{Q_{0}, G * (θ_{o})} IC f + o_{P} (1), n^{- 1} \sum_{i = 1}^{n_{k} \land n_{l}} P_{Q_{o, g_{i}^{*}}} {IC}^{2} = p_{k \land l} P_{Q_{0}, G * (θ_{o})} {IC}^{2} + o_{P} (1), and n^{- 1} \sum_{i = 1}^{n} P_{Q_{o, g_{i}^{*}}} f^{2} = P_{Q_{0}, G * (θ_{o})} f^{2} + o_{P} (1)$ . This entails that the matrix $E n^{- 1} \sum_{i = 1}^{n} P_{Q_{o, g_{i}^{*}}} F^{⊤} F$ converges to

(\begin{matrix} P_{Q_{0}, G * (θ_{0})} {IC}^{2} \times {(p k \land l)}_{k, l \leq K} & P_{Q_{0}, G * (θ_{0})} I C f \times {(p_{1}, ..., p_{K})}^{Τ} \\ P_{Q_{0}, G * (θ_{0})} IC f \times (p_{1}, ..., p_{K}) & P_{Q_{0}, G * (θ_{0})} f^{2} \end{matrix}),

which is positive definite if and only if its determinant is positive. The latter determinant equals a positive constant times $[P_{Q_{0}, G * (θ_{o})} f^{2} - {(P_{Q_{0}, G * (θ_{o})} IC f)}^{2} / P_{Q_{0}, G * (θ_{o})} I C^{2}]$ , which is positive too by the Cauchy-Schwarz inequality (f is not proporional to $D^{★} (P_{Q_{0}, G^{★} (θ_{0})})$ ). Therefore, Theorem 8 in (Chambaz and van der Laan, 2011) applies and $\sqrt{n} M_{n}$ converges in law to the centered Gaussian distribution with the covariance matrix given in the above display. The conclusion readily follows.

Figure 2: — These three plots illustrate how the TMLE procedure behaves (on a simulated trajectory) as the sample size (on x-axis, logarithmic scale; the vertical grey lines indicate the sample sizes n_i, i = 1,..., 7) increases. Top plot: we represent the sequence $ψ_{n}^{*} (O_{n_{7}}^{1} (n))$ at intermediate sample sizes n. The horizontal grey line indicates the true value of the risk difference 𝜓₀. Middle plot: we represent the three sequences $g_{n}^{★} (1 | 1) = G^{★} (θ_{n} (O_{n_{7}}^{1} (n))) (1 | 1)$ (bottom curve), $g_{n}^{★} (1 | 2) = G^{★} (θ_{n} (O_{n_{7}}^{1} (n))) (1 | 2)$ (middle curve) and $g_{n}^{★} (1 | 3) = G^{★} (θ_{n} (O_{n_{7}}^{1} (n))) (1 | 3)$ (top curve) at intermediate sample sizes n. The three horizontal grey lines indicate the optimal allocation probabilities g*(Q₀)(1|1) (bottom line), g*(Q₀)(1|2) (middle line) and g*(Q₀)(1|3) (top line). Bottom plot: we represent the sequence $s_{n}^{★ 2}$ of estimated asymptotic variance of $\sqrt{n} (ψ_{n}^{*} - ψ_{0})$ at intermediate sample sizes n. The horizontal grey line indicates the value of the optimal variance v*(Q₀). See also Table 4.

Footnotes

¹ The reasoning is not circular by virtue of the chronological ordering as it is summarized in Fig. 1 for instance.

References

Atkinson AC and Biswas A. Adaptive biased-coin designs for skewing the allocation proportion in clinical trials with normal responses. Stat. Med, 24(16):2477–2492, 2005. [DOI] [PubMed] [Google Scholar]
Bandyopadhyay U and Biswas A. Adaptive designs for normal responses with prognostic factors. Biometrika, 88(2):409–419, 2001. [Google Scholar]
Chambaz A and van der Laan MJ. Targeting the optimal design in randomized clinical trials with binary outcomes and no covariate. Technical report, Division of Biostatistics, University of California, Berkeley, 2009. [Google Scholar]
Chambaz A and van der Laan MJ. Targeting the optimal design in randomized clinical trials with binary outcomes and no covariate: Theoretical study. Int. J. Biostat, 7(1), 2011. [Google Scholar]
Emerson SS. Issues in the use of adaptive clinical trial designs. Stat. Med, 25:3270–3296, 2006. [DOI] [PubMed] [Google Scholar]
The Food & Drug Administration. Critical path opportunities list Technical report, U.S. Department of Health and Human Services, Food and Drug Administration, 2006. [Google Scholar]
Golub HL. The need for mor efficient trial designs. Stat. Med, 25:3231–3235, 2006. [DOI] [PubMed] [Google Scholar]
Hu F and Rosenberger WF. The theory of response adaptive randomization in clinical trials. New York: Wiley, 2006. [DOI] [PubMed] [Google Scholar]
Jennison C and Turnbull BW. Group Sequential Methods with Applications to Clinical Trials. Chapman & Hall/CRC, Boca Raton, FL, 2000. [Google Scholar]
Proschan MA, Lan GKK, and Wittes JT. Statistical Monitoring of Clinical Trials: A Unified Approach. Statistics for biology and health. Springer, New-York, 2006. [Google Scholar]
Rosenberger WF, Vidyashankar AN, and Agarwal DK. Covariate-adjusted response-adaptive designs for binary response. J. Biopharm. Statist, 11(227–236), 2001. [PubMed] [Google Scholar]
Rosenberger WF. New directions in adaptive designs. Statist. Sci, 11:137–149, 1996. [Google Scholar]
Sen PK and Singer JM. Large sample methods in statistics. Chapman & Hall, New York, 1993. An introduction with applications. [Google Scholar]
Shao J, Yu X, and Bob Zhong B. A theory for testing hypotheses under covariate-adaptive randomization. Biometrika, 2010. [Google Scholar]
van der Laan MJ. The construction and analysis of adaptive group sequential designs. Technical report 232, Division of Biostatistics, University of California, Berkeley, March 2008. [Google Scholar]
van der Laan MJ and Rubin D. Targeted maximum likelihood learning. Int. J. Biostat, 2(1), 2006. [DOI] [PMC free article] [PubMed] [Google Scholar]
van der Vaart AW. Asymptotic Statistics. Cambridge University Press, 1998. [Google Scholar]
van der Vaart AW and Wellner JA. Weak Convergence and Emprical Processes. Springer-Verlag; New York, 1996. [Google Scholar]
Zhang Li-Xin, Hu Feifang, Cheung Siu Hung, and Chan Wai Sum. Asymptotic properties of covariate-adjusted response-adaptive designs. Ann. Statist, 35(3):1166–1182, 2007. [Google Scholar]
Zhu H and Hu F. Sequential monitoring of response-adaptive randomized clinical trials. Ann. Statist, 38(4):2218–2241, 2010. [Google Scholar]

[R1] Atkinson AC and Biswas A. Adaptive biased-coin designs for skewing the allocation proportion in clinical trials with normal responses. Stat. Med, 24(16):2477–2492, 2005. [DOI] [PubMed] [Google Scholar]

[R2] Bandyopadhyay U and Biswas A. Adaptive designs for normal responses with prognostic factors. Biometrika, 88(2):409–419, 2001. [Google Scholar]

[R3] Chambaz A and van der Laan MJ. Targeting the optimal design in randomized clinical trials with binary outcomes and no covariate. Technical report, Division of Biostatistics, University of California, Berkeley, 2009. [Google Scholar]

[R4] Chambaz A and van der Laan MJ. Targeting the optimal design in randomized clinical trials with binary outcomes and no covariate: Theoretical study. Int. J. Biostat, 7(1), 2011. [Google Scholar]

[R5] Emerson SS. Issues in the use of adaptive clinical trial designs. Stat. Med, 25:3270–3296, 2006. [DOI] [PubMed] [Google Scholar]

[R6] The Food & Drug Administration. Critical path opportunities list Technical report, U.S. Department of Health and Human Services, Food and Drug Administration, 2006. [Google Scholar]

[R7] Golub HL. The need for mor efficient trial designs. Stat. Med, 25:3231–3235, 2006. [DOI] [PubMed] [Google Scholar]

[R8] Hu F and Rosenberger WF. The theory of response adaptive randomization in clinical trials. New York: Wiley, 2006. [DOI] [PubMed] [Google Scholar]

[R9] Jennison C and Turnbull BW. Group Sequential Methods with Applications to Clinical Trials. Chapman & Hall/CRC, Boca Raton, FL, 2000. [Google Scholar]

[R10] Proschan MA, Lan GKK, and Wittes JT. Statistical Monitoring of Clinical Trials: A Unified Approach. Statistics for biology and health. Springer, New-York, 2006. [Google Scholar]

[R11] Rosenberger WF, Vidyashankar AN, and Agarwal DK. Covariate-adjusted response-adaptive designs for binary response. J. Biopharm. Statist, 11(227–236), 2001. [PubMed] [Google Scholar]

[R12] Rosenberger WF. New directions in adaptive designs. Statist. Sci, 11:137–149, 1996. [Google Scholar]

[R13] Sen PK and Singer JM. Large sample methods in statistics. Chapman & Hall, New York, 1993. An introduction with applications. [Google Scholar]

[R14] Shao J, Yu X, and Bob Zhong B. A theory for testing hypotheses under covariate-adaptive randomization. Biometrika, 2010. [Google Scholar]

[R15] van der Laan MJ. The construction and analysis of adaptive group sequential designs. Technical report 232, Division of Biostatistics, University of California, Berkeley, March 2008. [Google Scholar]

[R16] van der Laan MJ and Rubin D. Targeted maximum likelihood learning. Int. J. Biostat, 2(1), 2006. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] van der Vaart AW. Asymptotic Statistics. Cambridge University Press, 1998. [Google Scholar]

[R18] van der Vaart AW and Wellner JA. Weak Convergence and Emprical Processes. Springer-Verlag; New York, 1996. [Google Scholar]

[R19] Zhang Li-Xin, Hu Feifang, Cheung Siu Hung, and Chan Wai Sum. Asymptotic properties of covariate-adjusted response-adaptive designs. Ann. Statist, 35(3):1166–1182, 2007. [Google Scholar]

[R20] Zhu H and Hu F. Sequential monitoring of response-adaptive randomized clinical trials. Ann. Statist, 38(4):2218–2241, 2010. [Google Scholar]

PERMALINK

Estimation and testing in targeted goup sequential covariate-adjusted randomized clinical trials

A Chambaz

M J van der Laan

Abstract

1. Introduction

Adaptive group sequential covariate-adjusted RCT.

Short bibliography.

Organization.

2. Statistical framework

3. Data generating mechanism for adaptive design

3.1. Data generating mechanism and related likelihood

Figure 1: A causal graph describing the data generating mechanism for adaptive group sequential design.

3.2. On the user-supplied design to target

3.3. From the Neyman allocation to the optimal design

3.4. Targeting the optimal design

4. TMLE in adaptive covariate-adjusted RCTs

4.1. Initial maximum likelihood based substitution estimator Working model.

Characterizing Qn0.

Studying Qn0 through θn.

4.2. Convergence of the adaptive design

4.3. TMLE

Fluctuating Qn0.

Characterizing Qn* yields the TMLE.

5. Asymptotics

5.1. Studying Qn* through (θn, εn): consistency

5.2. Studying Qn* through (θn,εn): central limit theorem

5.3. Consistency and asymptotic normality of the TMLE

TMLE is consistent and asymptotically Gaussian.

Extensions.

6. Application to group sequential testing

6.1. Description of the TMLE group sequential testing procedure

6.2. Rationale of the TMLE group sequential testing procedure

7. Simulation study of the performances of TMLE in adaptive covariate-adjusted RCTs

7.1. Simulation scheme

Table 1:

7.2. Specifying the working model, reference design and a few twists On the working model.

On the reference design.

On additional details.

7.3. Empirical coverage of the TMLE confidence intervals

Table 2:

7.4. Empirical width of the TMLE confidence intervals

Table 3:

7.5. Illustrating example

Table 4:

8. Simulation study of the performances of the TMLE group sequential testing procedure

8.1. The simulation scheme, continued

Table 5:

8.2. Empirical type I and type II errors of the TMLE group sequential testing procedure

Table 6:

8.3. Empirical distribution of sample size at decision of the TMLE group sequential testing procedure

Table 7:

A. Proofs

A.1. Proof of Propositions 2 and 4, and a useful remark on Proposition 3 Proof of Proposition 2.

Proof of Proposition 4.

A useful remark on Proposition 3.

A.2. Proof of Proposition 5

A.3. Proof of Propositions 6 and 7

Proof of Proposition 6.

Proof of Proposition 7.

A.4. Proof of Proposition 8

Proof of Lemma 2.

Figure 2: Illustrating the TMLE procedure under (Q0,gn★)-adaptive sampling scheme.

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Characterizing $Q_{n}^{0}$ .

Studying $Q_{n}^{0}$ through θ_n.

Fluctuating $Q_{n}^{0}$ .

Characterizing $Q_{n}^{*}$ yields the TMLE.

5.1. Studying $Q_{n}^{*}$ through (θ_n, ε_n): consistency

5.2. Studying $Q_{n}^{*}$ through (θ_n,ε_n): central limit theorem

Figure 2: Illustrating the TMLE procedure under $(Q_{0}, g_{n}^{★})$ -adaptive sampling scheme.