Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2009 Dec 16.
Published in final edited form as: J Am Stat Assoc. 2001 Dec 1;96(456):1410–1423. doi: 10.1198/016214501753382327

Marginal Mean Models for Dynamic Regimes

S A Murphy, M J van der Laan, J M Robins; CPPRG1
PMCID: PMC2794446  NIHMSID: NIHMS148105  PMID: 20019887

Abstract

A dynamic treatment regime is a list of rules for how the level of treatment will be tailored through time to an individual’s changing severity. In general, individuals who receive the highest level of treatment are the individuals with the greatest severity and need for treatment. Thus there is planned selection of the treatment dose. In addition to the planned selection mandated by the treatment rules, the use of staff judgment results in unplanned selection of the treatment level. Given observational longitudinal data or data in which there is unplanned selection, of the treatment level, the methodology proposed here allows the estimation of a mean response to a dynamic treatment regime under the assumption of sequential randomization.

Keywords: dynamic treatment regimes, nondynamic treatment regimes, causal inference, confounding

1. INTRODUCTION

Dynamic treatment regimes are individually tailored treatments that provide treatment to individuals only when and if they need the treatment and adjust the level of treatment to the individual’s need. In a dynamic regime rules for how the treatment level and type should vary with time are specified prior to the beginning of treatment; these rules are based on time varying measurements of subject-specific need for treatment. Dynamic treatment regimes are potentially attractive to public policy makers because they treat only subjects who show need for treatment, freeing public and private funds for more intensive treatment of the needy.

The goal of this paper is to provide methodology for estimation of the marginal mean response to a dynamic treatment regime when the data is observational. We employ a structural model in that it is a model for the marginal mean of counterfactual or potential responses.

This work is motivated by the Fast Track prevention program. This is an ongoing randomized trial of a complex preventive intervention versus control. The intervention was designed to prevent the emergence of and reduce the level of conduct disorders and drug use in children at risk due to elevated behavior problems (Bierman et al., 1996; CPPRG, 1999a,b; McMahon et al., 1996). Part of the intervention involved implementation of a dynamic treatment regime designed to improve family functioning. At the end of each semester, beginning with the spring semester of first grade, the family counselor filled out a 6 item “home visiting process measure,” describing the quality of parenting and family functioning. Based on the home visiting process measure total score, home visiting assignments for the next semester were made. A simplified version is as follows. A score of 17 or greater corresponded to 4 home visits, a score from 9 to 16 corresponded to 8 home visits and a score of 8 or below corresponded to 16 home visits during the following semester. Thus the rule for assigning dose was

dt(St1)=16I{St18}+8I{9St116}+4I{17St1},t=1,2,3,4

where St is an assessment of severity, the home visiting process measure score, at the end of the tth semester, with low values indicating greater severity. Staff were told that in exceptional cases they might need to deviate from the rule for assigning home visiting level; in practice, staff assignments deviated from the rule for approximately 50% of the intervention children. In Section 6, our goal is to estimate what the marginal effect of the above dynamic treatment regime would have been, had staff followed the rules exactly. We demonstrate that by identifying the intervention group (with home visiting assignments as implemented) as an observational study, one can use the methods developed here to achieve this goal.

Dynamic treatment regimes are routinely assigned to subjects participating in randomized medical trials when there is danger of serious side-effects necessitating cessation of treatment or when the optimal dosage at a given time should depend on the subjects clinical status or laboratory values. For example, in the active treatment arm of the Systolic Hypertension in the Elderly Program randomized trial (Cooperative Research Group, 1988; Borhani et al., 1991), subjects were assigned to a dynamic treatment regime that was intended to minimize the amount of medication required to maintain a subject’s systolic blood pressure (SBP) at or below a pre-determined goal. The treatment schedule was quite complex. Initially, all treated subjects were assigned the same dosage of anti-hypertensive medication; at eight weeks, subjects whose SBP exceeded the goal had their dosage doubled; at sixteen weeks, subjects whose SBP still exceeded the goal had a second anti-hypertensive added. At twenty-four weeks, a subject whose SBP continued to exceed the goal had their dose of the second anti-hypertensive doubled. The treatment schedule specified that subjects with evidence of severe clinical or laboratory toxicity would first have their dose of anti-hypertensive medications reduced; if the toxicity persisted, they would be switched to an alternate medication. As in the Fast Track program, this dynamic treatment is used to tailor the dosage to the subject. Furthermore the use of rules for reducing dosage or switching medications due to side effects is a useful aid in reducing unexplained noncompliance. Indeed subjects whose medication is switched in, accordance with the rules are compliers.

In this paper we consider methodology for estimation of the mean response to both nonrandom and random dynamic treatment regimes when the available data is observational. In a nonrandom dynamic treatment regime, the rules specifying the treatment level, as outputs of present need, are nonrandom. Both the Fast Track program and the Systolic Hypertension in the Elderly Program use nonrandom dynamic treatment regimes. In a random dynamic treatment regime, the time-specific treatment level is drawn from a conditional probability distribution depending only on the present need. In Section 3 we provide quantitative definitions of these two types of dynamic treatment regimes. The modeling of responses to dynamic treatment regimes based on observational data has received very little statistical attention outside of a series of papers by Robins (1986, 1989, 1993, 1997). In these papers Robins considers the use of nested structural models to estimate a variety of conditional treatment effect parameters in the distribution of the potential response. Also Robins (1993) discusses estimating parameters in the response to a nonrandom dynamic treatment regime by censoring the subject at the first time the subject’s treatment differs from the treatment as specified by the dynamic regime.

In Section 2 we review the underlying causal theory based on counterfactual or potential outcomes. Next in section 3 we discuss the assumption of sequential randomization along with the definitions of nonrandom and random dynamic treatment regimes. Section 4 uses the work in the previous sections to precisely specify the estimand in terms of the observational data distribution. We provide an estimating function in Section 5 and then in the last section we return to the Fast Track example.

2. POTENTIAL OUTCOMES

We use counterfactual or potential outcome models to quantify the desired treatment effect and to state assumptions. Neyman (1935) introduced counterfactual outcomes to analyze the causal effect of time-independent treatments in randomized studies. Rubin (1978) explicated Neyman’s ideas and extended Neyman’s work to the analysis of causal effects of time-independent treatments from observational data. Robins (1986, 1987) proposed a formal theory of causal inference that extended both Neyman’s and Rubin’s work to assess the direct and indirect effects of time varying treatments from experimental and observational longitudinal studies. We use these works to first specify our observations in a unified way regardless of the manner in which treatment is selected/assigned.

Suppose that the treatment lasts for K intervals; during this time period intermediate outcomes may be measured and at the end of the K intervals a response is measured. We denote the treatment regime/vector across the K intervals by āK = (a1, a2, …, aK); at is the level of treatment in the tth interval. In general we use a bar over a variable to denote that variable and all past values of the same variable, so āt = (a1, …,at). Let 𝒜K be the collection of all possible treatment vectors. Corresponding to each fixed value of the treatment vector, āK we conceptualize a potential (counterfactual) response denoted by Y(āK). This statement implicitly assumes that there is no need to index Y by others’ treatment or by the mechanism by which the treatment was selected; this is the Stable Unit Treatment Assumption (SUTVA, see Rubin, 1986). That is, first we assume that the treatment is defined so that the treatment selection mechanism does not alter the subject’s potential responses; for example, it does not matter why the subject was treated (e.g., whether the subject is randomly assigned the treatment or whether the subject’s parent solicited the treatment), the entirety of the subject’s intermediate potential outcomes and the potential responses are the same. In the time varying treatment setting this is the consistency assumption of Robins (1997). And second we assume that treatment of a subject does not influence any of the outcomes of any other subject (Cox, 1958); indeed we will make the stronger assumption that the observations on the sample of subjects are independent draws from one distribution. These assumptions allow us to conceptualize a single potential outcome/response corresponding to each possible āK ∈ 𝒜K and thus permit a well-defined, simple notation for an intermediate outcome and final response (see Robins, 1986 and for time-independent treatment, Rubin, 1986; Angrist, Imbens and Rubin, 1996).

SUTVA may well be violated in the Fast Track data. In addition to the home visiting, the Fast Track intervention included “friendship groups” in which small numbers of the intervention children were brought together in order to improve social skills. Thus in close friendships, the effect of home visiting on one child may affect the friend’s response. Alternately whether or not a dominant child is receiving home visits may alter the dynamics of the friendship groups and thus alter other children’s response to treatment. We do not deal with these complications here as an appropriate generalization of the proposed methodology would distract from the main points of this paper.

Capital letters are used to denote random variables. Treatment selection/assignment is allowed to be stochastic and are denoted by the random vector, ĀK. So ĀK takes values in 𝒜K. Denote the time t (t > 0) variables by Lt = (St, Vt) where St is the severity vector and Vt is a vector of auxiliary information. In general both Vt and St may be summaries of information that are available prior to and at time t. (St, Vt) is an intermediate outcome of treatment. Denote the initial, time 0, variables by L0 = (Z, S0, V0) where Z consists of variables defining interesting subpopulations and S0, V0 have the same interpretation as above. Corresponding to each treatment vector, āK, SUTVA implies that the intermediate outcome at time t can be written unambiguously as (St(āt), Vt(āt)) and the potential response as Y(āK).

In both randomized and observational studies, the observed data is the pretreatment information plus the potential outcomes corresponding to the treatment pattern ĀK. Assuming SUTVA we may write the data in a unified way regardless of the manner in which treatment is selected/assigned. A subject’s potential outcomes, that is the “complete” severity and response data is Osr = {Z, S0, S1(a1), …, SK−1 (āK−1), Y(āK); āK ∈ 𝒜K} and the complete auxiliary data is Oaux = {V0, V1(a1), …, VK−1(āK−1); āK ∈ 𝒜K}. In a given experiment or observational study we observe only a subset of the subject’s outcomes plus the treatment pattern, X = {Z, S0, V0, A1, S1(A1), V1(A1), A2, …, SK−1 (ĀK−1), VK−1 (ĀK−1), AK, Y(ĀK)}. Additionally in randomized studies we also have data on the treatment regime to which a study subject was assigned in addition, to the observed data, X. In the following we use either Y or Y(ĀK) to denote the observable response, either St or St(Āt) to denote the observable severity at time t and either Vt or Vt(Āt) to denote the observable auxiliary variables at time t.

3. DYNAMIC REGIMES AND SEQUENTIAL RANDOMIZATION

In a dynamic treatment regime, the rule for treatment assignment in each of the K intervals may be random or nonrandom; in either case the rule depends only on prior severity. If the rules are nonrandom, denote the regime by the K vector, K = (d1, …, dK); at time t, treatment assignment is, At = dt(St−1). (The Fast Track example uses nonrandom rules.) If the rules are random (stochastic), the treatment regime is characterized, by a K vector of stochastic rules or equivalently treatment assignment probabilities, say K = (p1, …,pK); at time t, treatment assignment is drawn from the conditional distribution, pt(·|St−1). We shall denote by PK, the distribution of (Osr, Oaux, ĀK) had the entire study population been subjected to the dynamic regime K. The associated expectation is denoted by EK. Note that a nonrandom dynamic treatment regime, K, is just a special case of a random dynamic treatment regime with pt(a|St−1) degenerate at the point, a = dt(St−1). Henceforth, we do not, except where necessary, distinguish in our notation between random and nonrandom dynamic treatment regimes.

Dynamic treatment regimes should not be confused with nondynamic treatment regimes. In a nondynamic treatment regime, the variation in treatment level across time and subjects is specified prior to treatment onset. In other words, a nondynamic treatment regime is a special case of a dynamic treatment regime in which the treatment assignments do not vary by post-treatment observations (i.e. the rule, dt is not a function of post-treatment information in St−1 and is at most a function of pretreatment variables, L0). Concrete examples of nondynamic treatment regimes are “assign 16 home visits in every semester following kindergarten,” or “assign 16 home visits in the first semester following kindergarten and 8 home visits in each semester thereafter.” Regression analyses which model the mean of Y(āK) as āK varies are regression analyses comparing different nondynamic treatment regimes, each corresponding to a different treatment pattern, āK. For an example of this type of analysis see Robins et al. (1998) and Hernán et al. (1998).

Denote the distribution of (Osr, Oaux, ĀK) in our observational study by Pobs and the expectation with respect to this distribution by Eobs. Note that the marginal distribution of the complete data (Osr, Oaux) is the same for both distributions, PK and Pobs. Only the marginal/conditional distribution of the treatment (ĀK) varies between our observational distribution and the distribution PK. Thus we denote the marginal distribution of (Osr,Oaux) by P without a subscript and expectation with respect to this marginal distribution by E without a subscript.

If the study population follows the dynamic treatment regime, K, then At is conditionally independent of the potential outcomes, (Osr, Oaux), conditional on St−1. However in our observational study, At may not be conditionally independent of the potential outcomes, (Osr, Oaux), conditional on St−1, because there may be unmeasured “confounders” that determine treatment and are associated with the potential outcomes. In general, assumptions, in addition to SUTVA, about this distributional relationship (based on substantive knowledge) must be used to identify causal effects and thus permit causal inference (for discussion, see section 11, of Robins, 1997). In this paper we assume:

Sequential Randomization

For each t = 1, …, K, At is independent of Osr given {L0, A1, L1 A2, …, Lt−1}.

That is, the auxiliary data, the V’s, are sufficiently rich so the observational distribution satisfies the above sequential randomization. This means that within levels of L0, A1, L1, A2, …, Lt−1, we assume that the subjects with different levels of treatment do not vary systematically by the potential severity measures or responses. See Robins (1997) for further discussion of sequential randomization.

We have chosen to assume sequential randomization of the treatment level, ĀK with respect to Osr only and not with respect to both (Osr, Oaux). There are two reasons for this. First there are practical, real-life examples in which sequential randomization of ĀK with respect to Osr may be plausible but sequential randomization of ĀK with respect to Oaux is clearly untrue (see example 1 of Robins, 1987). The second reason is more philosophical. If a time varying correlate of both treatment level selection and response becomes available and the above sequential randomization assumption is more plausible when the V’s include this time varying covariate, then we would like to include this covariate in the V’s without having to make the additional assumption that the treatment levels are sequentially randomized with respect to this covariate as well.

4. THE ESTIMAND

The sequential randomization assumption is used only to identify the parameter of interest (e.g., a treatment regime effect) in terms of the observational distribution. The identification is achieved by the following lemma. Suppose we are given a vector of treatment rules, K. We wish to estimate parameters in PK distribution using data from the observational data, the Pobs distribution. Let πt(·|āt−1,t−1) represent the conditional probability mass function of At given Āt−1 = āt−1, = t−1 = t−1 in the observational distribution. Define

B(osr,a¯K)=Pobs[t=1Kπt(at|a¯t1,L¯t1(a¯t1))>0|Osr=osr]

and

Wp¯j(a¯j,l¯j1)=t=1jpt(at|st1)πt(at|a¯t1,l¯t1)

for any jK.

Lemma 4.1

Assume that the observational distribution, Pobs satisfies the sequential randomization assumption. If

Pp¯K[B(Osr,A¯K)=1]=1 (4.1)

then the distribution of (Y, K−1, ĀK, Z) under PK is absolutely continuous with respect to the distribution of (Y, K−1, ĀK, Z) under Pobs and a version of the Radon-Nikodym derivative is

Eobs[Wp¯K(A¯K,L¯K1)|Y=y,S¯K1=s¯K1,A¯K=a¯K,Z=z]. (4.2)

Proof

Below we repeatedly use the notation, Y == Y(ĀK), Lt == (St, Vt) and Lt == Lt(Āt). Let U be a Borel set. We have that

Eobs[I{(Y,S¯K1,A¯K,Z)U}Eobs[Wp¯K(A¯K,L¯K1)|Y,S¯K1,A¯K,Z]]=Eobs[I{(Y,S¯K1,A¯K,Z)U}Wp¯K(A¯K,L¯K1)]=a¯KEobs[I{(Y,S¯K1,a¯K,Z)U}Wp¯K(a¯K,L¯K1)I{A¯K=a¯K}]=a¯KEobs[I{(Y(a¯K),S¯K1,a¯K,Z)U}Wp¯K1(a¯K,L¯K2)pK(aK|SK1)    Pobs[πK(aK|a¯K1,L¯K1)>0|Osr,L¯K2,A¯K1=a¯K1]I{A¯K1=a¯K1}].

Continuing in this fashion, eliminating the indicator variables concerning Aj and accumulating the conditional expectations concerning the πj(aj|j−1(āj−1),āj−1) we arrive at,

a¯KEobs[I{(Y(a¯K),S¯K1(a¯K1),a¯K,Z)U}B1(Osr,a¯K))t=1Kpt(at|St1(a¯t1))] (4.3)

where B1 is formed from repeated conditional expectations and can be defined iteratively beginning with,

BK(Osr,L¯K2,a¯K)=Pobs[t=1Kπt(at|a¯t1,L¯t1(a¯t1))>0|Osr,L¯K2,A¯K1=a¯K1]

and then for each jK − 1

Bj(Osr,L¯j2,a¯K)=Eobs[Bj+1(Osr,L¯j1,a¯K)|Osr,L¯j2,A¯j1=a¯j1]

and for j = 1 as,

B1(Osr,a¯K)=Eobs[B2(Osr,L0,a¯K)|Osr].

Since the distribution of Osr is the same under both PK and Pobs distributions, (4.3) can also be written with EK in place of Eobs. But then the sum over āK can be taken inside the expectation to yield

Eobs[I{(Y,S¯K1,a¯K,Z)U}Eobs[Wp¯K(A¯K,L¯K1)|Y,S¯K1,A¯K]]=Ep¯K[B1(Osr,A¯K)I{(Y,S¯K1,A¯K,Z)U}].

Since B1 can be expressed as repeated expectations of a probability (bK) which is nessarily bounded above by one, (4.1) implies that B1(Osr,ĀK) = 1 a.e. Thus we have absolute continuity and (4.2) is the Radon-Nikodym derivative of PK with respect to Pobs.

An intuitive interpretation of the lemma’s assumption (4.1) is that, “Any treatment pattern that can result in the implementation of the dynamic treatment regime, k, must also have a positive probability of occuring in the observational study.” This assumption is emininently sensible since if a particular treatment pattern cannot occur in the observational study, then inference involving the response to this treatment pattern requires further knowledge/assumptions.

From a statistical perspective there are two important consequences of the fact that (4.2) is a Radon-Nikodym derivative. Recall that the goal of this paper is estimate, using observational data, the mean response that would have been observed had, contrary to fact, the entire study population followed a particular dynamic regime. Thus the estimand is

Ep¯K[Y|Z]=Eobs[Eobs[Wp¯K(A¯K,L¯K1)|Y,S¯K1,A¯K,Z]Y|Z]    =Eobs[Wp¯K(A¯K,L¯K1)Y|Z]

and this estimand can be expressed in terms of the observational distribution. Or writing the right hand side in terms of conditional densities, EK[Y|Z = z] =

a¯K𝒜l¯K1yyfY(y|a¯K,v¯K1,s¯K1)t=1Kpt(at|st1)t=1K1ft(lt|a¯t,l¯t1)f0(l0|z)dydl¯K1 (4.4)

where fY,fK,…,f0 denote conditional densities for the observational distribution. The conditional density of Y given ĀK,K−1 is fY and the conditional density of Lt given Āt−1,t−1 is ft. Thus the conditional mean is a function of the observational distribution via the conditional densities fY and fK−1, …,f0. This is Robins (1986) G-computation formula.

The second consequence of the fact that (4.2) is a Radon-Nikodym derivative is that given any PK -unbiased estimating equation for parameters in the distribution, PK we can weight this estimating equation by the Radon-Nikodym derivative and thereby produce an unbiased estimating equation distribution for the same parameters but now using data from the observational study, the Pobs distribution. Suppose for the moment that π̄K is known. Let μ(β, Z) be a parameterization of EK[Y|Z] where β is an unknown vector parameter and μ is a known function. Given data from PK, a simple estimating function for β, might be based on μ̇(β, Z)(Y − μ(β, Z)) where μ̇(β, Z) is the derivative of μ with respect to β. Given data from the Pobs distribution, we weight to produce the unbiased estimating function,

n[Wp¯K(A¯K,L¯K)μ˙(β,Z)(Yμ(β,Z))] (4.5)

where for a function g, ℙng(X) is defined as the average of g over the observational sample (ng(X)=1/ni=1ng(Xi)).

Note that the denominator of WK is informally “the probability of having the treatment pattern one did indeed have” in the observational study. Robins (1998) introduced the idea of weighting estimating functions by the inverse of this probability in order to estimate parameters of marginal structural models and direct effect structural models for time varying treatments. Robins refers to such estimators as inverse probability of treatment weighted estimators.

In the special case in which K represents a nonrandom dynamic treatment regime, an alternate strategy is to use the inverse probability of censoring weighted estimators as considered by Robins and Rotnitzky (1992) and Rotnitzky and Robins (1995). Instead of modeling the treatment selection given past information, they model the adherence/non-adherence to the treatment regime given past information. Subjects are censored when their treatment first deviates from the dynamic treatment regime; these subjects receive a weight of zero. Subjects whose treatment patterns match the dynamic treatment regime receive a weight which is the product of the modelled adherence probabilities. Note that in this paper, for each t, we could separate the selection probability, πt(at|āt−1,t−1) into the product of the probability of adherence/non-adherence to the treatment regime given Āt−1 = āt−1, t−1 = t−1 times the probability of At = at given adherence or non-adherence to the treatment regime and Āt−1 = āt−1, t−1 = t−1.

We may view μ(β, Z) as a parameterization of (4.4). If the observational distribution satisfies sequential randomization then μ(β,Z) = EK [Y|Z] and β can be interpreted as a parameter from the PK distribution. If sequential randomization does not hold, then we only have that the estimation procedure will, under general conditions, result in a valid estimator of (4.4). The sequential randomization assumption allows us to intrepret this estimand as the mean response to the regime K.

5. ESTIMATION

Our goal is to conduct inference concerning the mean response to a dynamic treatment regime, possibly conditional on pretreatment variables. Assuming SUTVA, sequential randomization and (4.1), (4.4) can be interpreted as the mean response to a dynamic treatment regime, conditional on Z. We separate this interpretation from the estimation. In this section we consider a parameterization of (4.4) denoted by μ(β, Z). In order to estimate β we no longer need SUTVA, sequential randomization or (4.1); indeed we no longer need to assume the existnec of potential outcomes. However we assume regularity conditions for the asymptotic theory and that

EobsWp¯K(A¯K,L¯K1)=1. (5.1)

This is easily shown to be a consequence of (4.1); however, (5.1) makes no reference to potential outcomes.

Note that μ(β, Z) is a function of the conditional densities fY and fk−1, …, f0. On-the other hand, the likelihood of the observational data can be written as a product of conditional densities:

(fY(y|a¯K,l¯K1)t=1K1ft(lt|a¯t,l¯t1)f0(l0|z))×t=1Kπt(at|a¯t1,l¯t1), (5.2)

where πt(·|āt−1, t−1) represents the conditional probability mass function of At given Āt−1 = āt−1, t−1 = t−1 in the observational distribution. Thus the estimand, (4.4), depends only on the first term in the likelihood. The second term in the likelihood concerns only the treatment selection probabilities (π̄K). An optimal estimating function (one resulting in an efficient estimator of β) will be orthogonal to the score functions for the treatment selection probabilities (Bickel et al., 1993). This means we have the potential to improve estimators based on (4.5) by replacing (4.5) by itself minus its projection on the score functions for the treatment selection probabilities (Robins, 1999). Recall the observed data for one person is denoted by X. Given an estimating function, say U(X,β), the projection of U on the score functions of the treatment selection probabilities is given by t=1KEobs[U(X,β)|L¯t1,A¯t]Eobs[U(X,β)|L¯t1,A¯t1] (Robins, 1999).

Setting U (X, β) equal to WK (ĀK, K)μ̇(β, Z) (Y − μ(β, Z)), we get after algebraic simplification that

U(X,β)t=1KEobs[U(X,β)|L¯t1,A¯t]+Eobs[U(X,β)|L¯t1,A¯t1]=Wp¯K(A¯K,L¯K1)μ˙(β,Z)(Yμ(β,Z))t=1KWp¯t(A¯t,L¯t1)μ˙(β,Z)(gt(A¯t,L¯t1)μ(β,Z)) +t=1Katπt(at|A¯t1,L¯t1)Wp¯t(at,A¯t1,L¯t1)μ˙(β,Z)(gt(at,A¯t1,L¯t1)μ(β,Z)) (5.3)

where Wt (āt, t−1) represents u=1tpu(au|su1)πu(au|a¯u1,l¯u1) and

gK(A¯K,L¯K1)=Eobs(Y|A¯K,L¯K1)   gt(A¯t,L¯t1)=Eobs(at+1pt+1(at+1|St)gt+1(at+1,A¯t,L¯t)|A¯t,L¯t1) (5.4)

for t = 1,…, K − 1. If the parameterization of μ(β, Z) is saturated then the right hand side of (5.3) is the efficient influence function for β. See the appendix for proofs.

Note that whereas the treatment selection probabilities for the PK are known - they are K-in our observational study the treatment selection probabilities, π̄K, are unknown and thus must be estimated from the observational data. Additionally we will need to estimate {gt(āt,t−1), t = 1,…,K}. Given a parametrization of the treatment selection probabilities, say πt(at|āt−1, t−1; αw), t = 1,…,K with parameter, αw, we can use maximum likelihood based on (5.2) to estimate αw. We maximize,

nt=1Klogπt(At|A¯t1,L¯t1;αw) (5.5)

to get α̂w and then by substitution the vector,{π̂t, t = 1,…,K}. Next we parameterize {gt(āt, t−1), t = 1,…,K} with the vector parameter, αg and denote the parameterization by {gt(āt, t−1: αg), t = 1,…,K}. A variety of methods may be used to estimate αg; we use the following estimating equation, i.e., solve,

0=n(YgK(A¯K,L¯K1;αg))αggK(A¯K,L¯K1;αg)+t=1K1n(at+1pt+1(at+1|St)gt+1(at+1,A¯t,L¯t;αg)gt(A¯t,L¯t1;αg))αggt(A¯t,L¯t1;αg) (5.6)

to get α̂g.

To use (5.3) for estimation of β, we substitute estimators of the parameters, αw and αg in the weights and in the gt’s and set,

nWp¯K(A¯K,L¯K1;α^w)μ˙(β,Z)(Yμ(β,Z))   nt=1KWp¯t(A¯t,L¯t1;α¯w)μ˙(β,Z)(gt(A¯t,L¯t1;α^g)μ(β,Z))+nt=1Katπt(at|A¯t1,L¯t1;α^w)Wp¯t(at,A¯t1,L¯t1;α^w)μ˙(β,Z)(gt(at,A¯t1,L¯t1;α^g)μ(β,Z)), (5.7)

to 0 and solve for β.

Under regularity conditions, the following two properties hold.

  1. First assuming the parameterization of μ is correct, and if the parameterization of π̄K OR {gt(āt, t−1), t = 1,…,K} is correct, the resulting estimator of β is consistent and asymptotically normal.

  2. The asymptotic variance of β̂ can be estimated by a sandwich estimator. Define Σ̂ =
    n(Sβ(X;α^w,α^g,β^)+S^β,wI^w1Sw(X;α^w)+S^β,gI^g1Sg(X;α^g))2
    where for vector V, V⊗2 = VVT, Sβ(X; α̂w, α̂g,β) is the estimating function for β (formula (5.7) is ℙnSβ), Sw(X; αw) is the score function for αw (the derivative of (5.5) with respect to αw is ℙnSw(Xw)) , Îw is the observed information matrix for αw (minus the Hessian of (5.5)), Sg(Xg) is the estimating function for αg (formula (5.6) is ℙnSg), Îg is minus the derivative of (5.6) with respect to αg, S^β,w=nαwSβ(X;α^w,α^g,β^) and S^β,g=nαgSβ(X;α^w,α^g,β^). Also define = ℙnμ̇(β,Z)μ̇(β,Z)T. Then a consistent estimator of the asymptotic variance of β̂ is given by −1Σ̂−1.

See the appendix for proofs of these results.

Consistent Parameterization of g1,…, gK

The relationships among the g1,…, gK are highly constrained. Thus it is difficult to formulate parameterizations of g1,…,gK which are consistent with one another. For example, it must be possible for the parameterization of gK and gK−1 to satisfy

gK1(a¯K1,l¯K2)=lK1aKpK(aK|sK1)gK(a¯K,l¯K1)fK1(lK1|a¯K1,l¯K2)dlK1

where pK is specified by the treatment regime and fK−1 is the unspecified conditional density of LK−1 given (ĀK−1, K−2). Or equivalently, given a parametrization of gK; and gK−1 there must be at least some conditional density fK−1 for which the above display is true. See the data example for a consistent parametrization. Also see Robins and Wasserman (1997) and Scharfstein et al. (1999) for similar problems with consistent parametrization.

Double Robustness

Instead of using (5.7) to estimate β, we could use (4.5) with estimated weights. Consistency of the resulting β̂ would require that π̄K is correctly parameterized. Alternately we could base estimation of β on the formula in (4.4). Note that (4.4) can be written in terms of the gt’s as

μ(β,Z)=a1E[g1(a1,L0)p1(a1|S0)|Z]. (5.8)

As noted previously, this is the G-computation formula of Robins (1986). Thus to directly estimate β using (4.4) we may estimate the gt’s in order to arrive at an estimate of g1 and then solve

0=nμ˙(β,Z)(a1g1(a1,L0;α^g)p1(a1|S0)μ(β,Z)). (5.9)

We can only expect the resulting β̂ to be a consistent estimator of β if we have correctly parameterized the gt’s.

The primary advantage of (5.7) over (5.9) and over the use of (4.5) is that the use of (5.7) to estimate μ leads to a consistent estimator if either the gt’s or πt’s are parameterized correctly (see (1) above). This is the “double robustness” property; see Scharfstein et al. (1999) and Robins et al. (2000) for further discussion. Furthermore note that the first term in (5.7) is (4.5) with estimated weights and the t = 1 summand in the last term is (5.9). Thus (5.7) combines both the direct estimation method and the weighted estimation method (4.5) so as to achieve consistency even when only the gt’s or only the πt’s are parameterized correctly.

6. EXAMPLE

The Fast Track trial is a longitudinal, multi-site, multi-cohort trial of a preventive intervention versus a control. For the purposes of this illustration we use a subset of the children for whom data was available at the time of the analysis; this is 579 high risk children of which 202 are in the intervention group. This is not a representative subset; furthermore the variables used in the analysis were selected for illustrative purposes only. A more complete analysis is forthcoming. We consider two endpoints, specifically chosen to illustrate the diversity of results that can follow from the proposed analyses. Both endpoints are from end of grade 3 teacher evaluations. The first endpoint is a teacher rating of school behavior problems. The second endpoint is a teacher rating of behavior change over the course of grade 3. This second rating assesses improvement across the year in social and emotional adjustment.

Beginning in the fall of grade 2, a nonrandom dynamic treatment regime for home visiting was planned as part of the intervention. The home visits were designed to improve parental functioning. The time intervals are fall and spring of grade 2 and 3 (K = 4). The severity for semester t is St, the total score on the home visiting process measure; St is a measure, taking values in the interval [1,24], evaluating the quality of family functioning. High values indicate better functioning. The treatment rule is: if assigned to control, there is no assignment of home visiting by staff; if assigned to treatment the staff should follow the treatment rule,

dt(St1)=16I{St18}+8I{9St116}+4I{17St1},t=1,2,3,4

in assigning home visiting level. Thus, if a child is in the intervention group and the total on the home visiting process measure from the end of grade 1, S0, is less than 8, staff were to assign 16 home visits for the fall of grade 2 and so on. This means that At ∈ {4,8,16}.

Our primary goal is an evaluation of the average effect of the nonrandom dynamic treatment regime on the two endpoints. This regime corresponds to degenerate treatment selection probabilities, i.e.,

pt(a|st1)=1ifa=dt(st1)and0,otherwise

for t = 1, 2, 3, 4.

Staff were told that in exceptional cases they might need to deviate from the rule for recommending home visiting level; in practice, staff recommendations deviated from the rule on 22% of the 202 * 4 occasions. The staff recommended more home visits than the rule on 15% of the occasions and on 7% of the occasions the staff recommended fewer home visits. In all, 104 intervention children (52%) have treatment recommendation patterns that coincide with the rule for all 4 time periods. The staff did not/could not deviate from the rule for the control children as no home visiting assignments were made for the control children. Because the staff deviated from the rule(for the intervention children), we treat the group of intervention children as an observational sample in which the home visiting recommendations may have been influenced by severity (St−1) in ways varying from the rule and indeed, the recommendations may have been influenced by measures other than severity. The Fast Track research team collected a vast amount of information relating to why staff recommended intervention levels. For the purposes of this example, we will use past recommended intervention level, the home visiting process measure and site and cohort indicators to model the distribution of home visiting recommendations (these will be included in the L’s). Also included in the L’s are time-varying predictors of the two endpoints (Recall that these variables are chosen for illustrative purposes; they do not encompass the full set of variables that might be used in a substantive treatment). See Table 1 for a description of L.

Table 1.

Variables used in Example

Variable Collected During

Suspected Abuse1 Kindergarten
Race1 Kindergarten
School Behavior Problems1 End of Grade 1
School Behavior Problems2 End of Grade 3
School Behavior Change2 End of Grade 3
Home Visiting Process Measure3 End of Every Semester
Home Visiting Assignment4 Beginning of Every Semester
Academic Performance1 End of Grades 1 and 2
Social Skills1 Average for Grades 1 and 2
Site (4 Sites)1 Kindergarten
Cohort (2 Cohorts)1 Kindergarten
1

Auxiliary Covariates (V0 or Vt).

2

Response (Y).

3

Severity (St).

4

Treatment Level (At).

We assume that sequential randomization holds; that is we assume that within any cross-stratification of {L0, A1 L1, A2, …, Lt−1} the staff treatment recommendation, At, is not predictive of the array of potential outcomes, Osr. For example, this means that (within a cross-stratification) children recommended 16 home visits do not have a higher chance of high values of {Y(a1, a2, a3, a4), at = 4, 8, 16, t = 1, 2, 3, 4} than children recommended 4 home visits. It also means that (within a cross-stratification) children recommended 16 home visits do not have higher chance of high values of Y(16, 16, 16, 16) – Y(4, 4, 4, 4) than children recommended 4 home visits.

To implement (5.7) we first estimate the weights, W4. Since the staff were not to and did not assign any intervention to control children, the weights for the control children are equal to one. Note that the numerator of W4 is one for an intervention child whose treatment recommendations coincide with the treatment rule and zero otherwise. This means that of the 202 intervention children, 98 (202–104) will have W4 = 0 and thus the first term in (5.7) will be zero for these children. This makes sense because without further (smoothness) assumptions the responses from the 98 children whose treatment recommendations are inconsistent with the regime are not informative of responses to the treatment regime. This issue is discussed in further detail below. To estimate the denominator of the weight, we fit a proportional odds model to the staff home visiting recommendations. For each time t, we fit, via maximum likelihood,

  logit(πt(16|L¯t1,A¯t1;αw))=αw,t01+αw,1ZZ(L¯t1,A¯t1)logit(1πt(4|L¯t1,A¯t1;αw))=αw,t02+αw,1ZZ(L¯t1,A¯t1)

where αw = {αw,t01, αw,t02, αw,1, t = 1, 2, 3, 4} and ZZ(t−1, Āt−1) is a vector summary of past Information. See Table 2 for the estimates of αw. Note that in all four time periods, a low home visiting process measure score is predictive of higher staff home visiting recommendation; also in the last three time periods, a past high staff home visiting recommendation is predictive of a higher staff home visiting recommendations. Additionally there are some minor differences in home visiting recommendations by cohort and site. Since there were only 4 intervals and the sample size was not low we choose to fit separate models for the home visiting recommendations in each interval; alternately we could have fit one model with possibly time varying intercepts and regression coefficients to all 4 intervals simultaneously (see Hernán et al. for an example, 1998a).

Table 2.

Table 2a: Model for Home Visiting Assignment Probabilities

Fall Grade 2

Covariate Estimate Stderr
Intercept 1 −1.45 0.36
Intercept 2 1.40 0.35
Cohort 1 −0.58 0.29
Site 1 −0.58 0.42
Site 2 −0.24 0.43
Site 3 0.42 0.36
HVPM161 2.22 0.75
HVPM41 −2.58 0.31
Table 2b: Model for Home Visiting Assignment Probabilities

Spring Grade 2

Covariate Estimate Stderr
Intercept 1 −2.61 0.47
Intercept 2 1.10 0.40
Cohort 1 −0.51 0.33
Site 1 1.64 0.49
Site 2 0.01 0.56
Site 3 1.76 0.45
HVPM161 3.04 0.96
HVPM41 −2.36 0.38
Past High Home
Visiting Assignment2

2.18

.60
Past Low Home
Visiting Assignment2

−1.55

0.39
Table 2c: Model for Home Visiting Assignment Probabilities

Fall Grade 3

Covariate Estimate Stderr
Intercept 1 −3.19 0.69
Intercept 2 3.00 0.60
Cohort 1 −0.58 0.38
Site 1 −1.26 0.55
Site 2 −0.75 0.62
Site 3 1.23 0.48
HVPM161 0.19 1.08
HVPM41 −3.65 0.68
Past High Home
Visiting Assignment2

2.98

.68
Past Low Home
Visiting Assignment2

−1.31

0.41
Table 2d: Model for Home Visiting Assignment Probabilities

Spring Grade 3

Covariate Estimate Stderr
Intercept 1 −3.95 1.09
Intercept 2 3.35 0.61
Cohort 1 −0.89 0.43
Site 1 −0.82 0.55
Site 2 −0.56 0.63
Site 3 1.18 0.54
HVPM161 4.67 1.61
HVPM41 −2.75 0.47
Past High Home
Visiting Assignment2

5.94

1.61
Past Low Home
Visiting Assignment2

−2.53

0.43
1

HVPM16 is 1 only if the home visiting process measure is ≤ 8; otherwise it is zero. HVPM4 is 1 only if the home visiting process measure is ≥ 17; otherwise it is zero.

1

HVPM16 is 1 only if the home visiting process measure is ≤ 8; otherwise it is zero. HVPM4 is 1 only if the home visiting process measure is ≥ 17; otherwise it is 0.

2

Past High Home Visiting Assignment if 1 if home visiting assignment in fall of 2nd Grade is 16; otherwise it is 0. Past Low Home Visiting Assignment is 1 if home visiting assignment in fall of 2nd grade is 4; otherwise it is 0.

1

HVPM16 is 1 only if the home visiting process measure is ≤ 8; otherwise it is zero. HVPM4 is 1 only if the home visiting process measure is ≥ 17; otherwise it is 0.

2

Past High Home Visiting Assignment if 1 if home visiting assignment in spring of 2nd Grade is 16; otherwise it is 0. Past Low Home Visiting Assignment is 1 if home visiting assignment in spring of 2nd grade is 4; otherwise it is 0.

1

HVPM16 is 1 only if the home visiting process measure is ≤ 8; otherwise it is zero. HVPM4 is 1 only if the home visiting process measure is ≥ 17; otherwise it is 0.

2

Past High Home Visiting Assignment if 1 if home visiting assignment in fall of 3rd Grade is 16; otherwise it is 0. Past Low Home Visiting Assignment is 1 if home visiting assignment in fall of 3rd grade is 4; otherwise it is 0.

To implement (5.7) we must also estimate αg. We model g1 through g4 by linear models:

gt(A¯t,L¯t1;αg)=αg0+αg1XX(L¯t1)

where XX(t−1) is a summary vector of past information and estimate αg = {αg0, αg1} by solving (5.6). Note that we assume that the gt are functionally independent of the past assigned home visiting levels (Āt). It is easy to check that this assumption plus the linearity of gt imply that the parameterizations of g1,…, gK will be consistent with one another. In this example the assumption that the gt are functionally independent of the past assigned home visiting levels is plausible. Preliminary analyses (not shown) found that given past measures of L, both the most recent assigned home visiting levels and the average level of past home visiting were not predictive in the gt’s. This may not be surprising since the composition of children with a high past level of home visiting is strongly skewed toward children with higher severity whereas the composition of the children with a lower past level of home visiting is skewed toward children with low severity. Indeed if home visiting is effective, one could expect that within levels of past L, higher severity children will be similar to lower severity children after both groups receive their respective treatments. The estimates of αg are given in Table 3.

Table 3.

Table 3a: Model for g1 through g4

School Behavior Problems

Covariate Estimate1 Stderr
Intercept 0.30 0.21
Cohort 1 −0.28 0.10
Site 1 0.43 0.14
Site 2 0.48 0.15
Site 3 0.45 0.12
Suspected Abuse 0.20 0.06
School Behavior
Problems-Grade 1

0.26

0.06
Most Recent Academic
Performance Score

0.09

0.03
Most Recent Social
Skills Score

0.16

0.05
Table 3b: Model for g1 through g4

School Behavior Change

Covariate Estimate1 Stderr
Intercept −0.93 0.34
Site 1 −0.62 0.19
Site 2 −0.44 0.18
Site 3 −0.15 0.19
Suspected Abuse −0.06 0.10
School Behavior
Problems-Grade 1

0.18

0.10
Most Recent Academic
Performance Score

0.05

0.03
Most Recent Social
Skills Score

−0.15

0.07
1

All variables are coded so that high values indicate poorer functioning.

1

All variables are coded so that high values indicate poorer functioning.

For each response, we model the mean response to the treatment regime, 4 by

μ(β,Z,treatment)=β0+β1SBP1+β2treatment

where treatment is an indicator with the value one for the intervention children and when grade3 teacher rating of school behavior problems is the response, grade 1 teacher rating of school behavior problems SBP1 is included in the regression. When the teacher rating of behavior change over the course of grade 3 is the response, SBP1 is not included. See Table 4, h = 1, for estimates of β2. As a comparison we include the results of an “intent to treat” analysis (ITT), this is a simple linear regression of the response on treatment (and possibly pretreatment covariates). The ITT analysis does not adjust for the level of staff deviance from the rules of the treatment regime. In contrast the analysis of the dynamic treatment regime yields an estimate of the mean treatment effect in the setting in which staff follow the treatment regime rules in assigning home visiting. In Table 4a, we see that for the response, school behavior problems, both, the ITT treatment effect and the treatment effect for the treatment regime (h = 1) are highly significant. But the estimated treatment effect in the scenario in which the staff follow the rules is nearly twice as large as the ITT treatment effect. Recall that (see Table 2) in addition to the home visiting process measure, past treatment recommendation, site and cohort were predictive of the treatment recommendation probabilities (π̄4) in the observational data. The dynamic regime 4 forces uniformity across past treatment recommendation, site and cohort in making present treatment recommendations. It is unclear whether the increased treatment effect is due to this forced uniformity and/or due to using the home visiting measure only as prescribed by the regime rule. This is discussed further below. In contrast, in Table 4b, we see that while there is a significant ITT treatment effect on behavior change, there is no treatment effect for the treatment regime (h = 1). In this case it appears that staff judgment in deviating from the rules has enhanced the treatment effect. Similarly to before it is unclear whether the decreased treatment effect is due to forced uniformity across site and cohort in assigning treatment and/or due to ignoring past level of assigned treatment in making future recommendations and/or due to using the home visiting measure only as prescribed by the regime rule.

Table 4.

Table 4a: Estimated Treatment Effect on School Behavior Problems1

h Estimated Treatment Effect2 Stderr
ITT3 −0.23 0.07
0 −0.21 0.08
.2 −0.24 0.08
.4 −0.28 0.08
.6 −0.31 0.08
.8 −0.35 0.08
1 −0.40 0.09
Table 4b: Estimated Treatment Effect on School Behavior Change1

h Estimated Treatment Effect2 Stderr
ITT3 −0.33 0.09
0 −0.39 0.10
.2 −0.38 0.10
.4 −0.36 0.10
.6 −0.32 0.10
.8 −0.27 0.11
1 −0.19 0.13
1

The estimated intercepts and the regression coefficients of grade 1 school behavior problems score are omitted.

2

High Values on the school behavior problems response indicate behavior problems at school.

3

Results of a simple linear regression of response on treatment and grade 1 social health profile; all weights are equal to 1.

1

The estimated intercepts are omitted.

2

Low Values on the School Behavior Change indicate improvement across the year in social and emotional adjustment.

3

Results of a simple linear regression of response on treatment; all weights are equal to 1.

Recall that 98 of the 202 intervention children have W4 = 0. These children’s responses enter (5.7) only through the estimated gt’s. In order to better utilize these responses and also obtain a better understanding of the reasons for the difference in treatment effects between the regime and the ITT analyses, we examine a small variety of random dynamic treatment regimes, denoted by 4,h, h ∈ {0, .2, .4,…, 1}. pt,h(a|st−1) will be the probability of recommending home visiting level a at time t when the home visiting measure is equal to st−1. For each t ∈ {1, 2, 3, 4}, set pt,h equal to a mixture of the indicator, I{a = dt(st−1)} and qt(a|st−1);

pt,h(a|st1;γ)=hI{a=dt(st1)}+(1h)qt(a|st1;γ)

where qt is a conditional probability mass function for treatment selection given severity, st−1. We choose qt to approximate the observational distribution of treatment recommendation at time t given only the home visiting process measure. All of these random dynamic regimes do not allow site, cohort and past home visiting recommendations to influence present home visiting recommendation. Note that h = 1 corresponds to originally planned treatment regime. The random dynamic treatment regime, {p1,h, p2,h, p3,h, p4,h} corresponds to drawing a Bernoulli random variable with success probability equal to h at each time t; if the Bernoulli outcome is one then set At equal to dt(St−1), otherwise draw At from the probability mass function, qt(a|St−1;γ).

Of course, the observational distribution of treatment recommendations given only the home visiting process measure is unknown and must be approximated. We estimate this approximation. To construct qt, t = 1, 2, 3, 4, at each time we fit, via maximum likelihood, the following proportional odds model to the data:

logit(qt(16|St1;γ))=γt01+γt1I{d(St1)=16}+γt2I{d(St1)=4}logit(1qt(4|St1;γ))=γt02+γt1I{d(St1)=16}+γt2I{d(St1)=4}

where γ = {γt01, γt02 γt1, γt2, t = 1, 2, 3, 4}. The estimated values of γ are given in Table 5. We use qt(·|St−1; γ̂) to form the pt,h (·|St−1; γ̂) when h < 1.

Table 5.

Table 5a: Model for Numerator of the Weights

Fall Grade 2 Spring Grade 2

Covariate Estimate Stderr* Estimate Stderr1
Intercept 1 −1.69 0.23 −1.55 0.25
Intercept 2 1.06 0.20 1.27 0.24
HVPM162 2.43 1.03 2.67 0.82
HVPM42 −2.61 0.34 −2.69 0.34
Table 5b: Model for Numerator of the Weights

Fall Grade 3 Spring Grade 3

Covariate Estimate Stderr* Estimate Stderr*
Intercept 1 −2.57 0.42 −2.87 0.45
Intercept 2 1.71 0.32 1.28 0.26
HVPM16** 1.95 1.22 3.79 0.93
HVPM4** −3.00 0.36 −2.43 0.34
1

This model is used to approximate the distribution of treatment assignment given the home visiting process measure, thus robust standard errors are provided.

2

HVPM16 is 1 only if the home visiting process measure is ≤ 8; otherwise it is zero. HVPM4 is 1 only if the home visiting process measure is ≥ 17; otherwise it is zero.

*

This model is used to approximate the distribution of treatment assignment given the home visiting process measure, thus robust standard errors are provided.

**

HVPM16 is 1 only if the home visiting process measure is ≤ 8; otherwise it is zero. HVPM4 is 1 only if the home visiting process measure is ≥ 17; otherwise it is zero.

Note that because the proportional odds model, qt is not a saturated model, the models for the qt and πt will not be consistent with one another. However we only view qt as an approximation to the observational distribution of treatment recommendation given severity. Also note that the pt,h (·|St−1; γ̂) are used to form the numerator of the weights and as before the πt(·|t−1, Āt−1; α̂w) are used to form the denominator of the weights. For each value of h and each of the two responses, we use (5.6) to construct estimators of αg, but for h < 1, pt is replaced by pt,h (·|St−1; γ̂). The estimates of γ are omitted. The model for the mean treatment response is as before; Table 4 contains the results. The formula for the estimated standard errors changes since the numerator of the weights is now estimated. See the appendix for the appropriate formula. Consider the results for h = 0 in Table 4. These results are essentially equivalent to the ITT results. Recall that for h = 0, the dynamic treatment regime corresponds to drawing At from the probability mass function, qt (a|St−1; γ). Thus even though there was systematic variation from the treatment assignment rule, systematic variation that can be accounted for by past home visiting recommendation and site and cohort differences, this systematic variation does not appear to alter the mean response. Rather we see that as the random dynamic treatment regime approaches (h approaches 1) the original dynamic treatment regime, the mean effect changes. This indicates that the change in mean effect from the ITT analysis to the original dynamic treatment regime analysis is due to only using the home visiting measure as prescribed by the regime rule.

Clearly a more complete analysis of the Fast Track data is needed to investigate these substantive issues. As stated previously, the two responses were chosen specifically to illustrate the diversity of thought-provoking results that can be obtained by these types of analyses. Certainly one should give careful thought to which responses are of primary interest; variations in the dynamic treatment regime may differentially impact different responses.

7. DISCUSSION

This methodology provides a way to evaluate the effects of dynamic treatment regimes using observational data. It should be extended in a variety of ways. First the interpretation of the results depends heavily on the assumption of sequential randomization. It is difficult to believe that this assumption really holds in the observational setting. Thus an extension of the sensitivity analyses of Robins et al. (1999a) and Scharfstein et al. (1999) to this setting is needed. A second needed extension is to allow for missing severity measures. There are two ways to allow for missing severity measures. First, if one can expect severity measures to be occasionally missing in practice, then it makes sense to extend the dynamic regime rules to include rules for assignment when severity is missing. On the other hand if one wishes to make inference for a rule that does not allow for missing severity measures, yet one’s observational data includes individuals with missing severity measures then the methodology provided here must be adapted. One possibility is by weighting as described in Hernán et al. (1998). A. third extension would be to grouped data. Many intervention/prevention studies take place in a school based setting and thus grouped data dominates. Lastly more systematic work is needed on how to formulate/evaluate dynamic treatment rules, and in particular the evaluation and detection of which posttreatment conditions (such as the development of side effects or the occurrence of family crises) may negate or enhance the effect of treatment.

Acknowledgments

The research of the first author is partially supported by NIDA grant A P50 DA 10075, DMS 9802885, and SBR 9811983. The research of the second author is partially supported by NIAID grant 1RO1AI46182. The research for the third author is partially supported by NIAID grant RO1AI32475.

APPENDIX

Throughout we assume (5.1). Denote the true value of β by β0. In the following it is useful to note two alternate ways to rewrite (5.3). The first emphasizes the correct specification of π̄K:

Sβ(X;αw,αg,β)=Wp¯K(A¯K,L¯K1)μ˙(β,Z)(Yμ(β,Z))t=1Kat(I{At=at}πt(at|A¯t1,L¯t1))Wp¯t(at,A¯t1,L¯t1)μ˙(β,Z)(gt(at,A¯t1,L¯t1)μ(β,Z)). (8.1)

Note that if π̄K is correctly specified then the second term above has mean zero even if the gt’s are incorrectly specified and/or β ≠ β0. The second way to rewrite (5.3) emphasizes the correct estimation of the gt’s:

Sβ(X;αw,αg,β)=Wp¯K(A¯K,L¯K1)μ˙(β,Z)(YgK(A¯K,L¯K1))t=1K1Wp¯t(A¯t,L¯t1)μ˙(β,Z)(gt(A¯t,L¯t1)at+1pt+1(at+1|St)gt+1(at,A¯t,L¯t))+μ˙(β,Z)(a1g1(a1,L0)p1(a1|S0)μ(β,Z)). (8.2)

Note that if the gt’s are correctly specified then the first two terms have expectation zero even if π̄K is incorrectly specified and/or β ≠ β0.

Lemma 8.1

Formula (5.3) is orthogonal to, or equivalently, uncorrelated with the score functions for the treatment assignment probabilities, π̄K, thus (5.3) is WK (ĀK, K)μ̇(β, Z) (Y −μ(β, Z)) minus its projection on the score functions for the treatment assignment probabilities.

Proof

First note that all score functions for the π̄K belong to the linear span of

Bt(a,A¯t,L¯t1)=Ut(A¯t1,L¯t1)(I{At=a}πt(a|A¯t1,L¯t1))

for each possible treatment level, a and t ∈ {1,…, K} and for Ut(Āt−1, t−1) for t ∈ {1,…,K} arbitrary bounded functions. Thus we need only show that (5.3) or equivalently (8.1) is orthogonal to each of these functions. Note that Bj(a,Āj, j−1) is orthogonal to the terms with tj in the sum in (8.1). The covariance between Bj(a, Āj, Lj−1) and (8.1) is thus

  Eobs[Wp¯K(A¯K,L¯K1)μ˙(β,Z)(Yμ(β,Z))Bj(a,A¯j,L¯j1)]Eobs(aj(I{Aj=aj}πj(aj|A¯j1,L¯j1)) (8.3)
Wp¯j(aj,A¯j1,L¯j1)μ˙(β,Z)(gj(aj,A¯j1,L¯j1)μ(β,Z)))Bj(a,A¯j,L¯j1). (8.4)

Since given Āj−1, j−1, Bj (a,Āj, j−1) has mean zero, the second term (8.4) reduces to

EobsajI{Aj=aj}Wp¯j(aj,A¯j1,L¯j1)μ˙(β,Z)(gj(aj,A¯j1,L¯j1)μ(β,Z))Bj(a,A¯j,L¯j1).

But this is the negative of the first term (8.3) (repeatedly take conditional expectations and use (5.1)).

Lemma 8.2

If the parameterization of μ(β, Z) is saturated and g1, …,gK are nonparametric then (5.3) is the efficient influence function for β.

Proof

Because of the factorization, (5.2) of the likelihood and the fact that the functional μ(β, Z) does not depend on the factor t=1Kπt(at|a¯t1,l¯t1), the efficient influence function is the same in the model for which π̄K is completely unknown as in the model where π̄K is a known function up to a parameter, αw (Robins et al., 1994; Robins and Ritov, 1997). However when μ(β, Z) is saturated, the model with π̄K completely unknown is a nonparametric model for the observed data. Bickel et al. (1993) show that there is only one influence function for any nonparametric model, which therefore must necessarily be the efficient influence function. Thus if we find an influence function of the model in which π̄K completely unknown, this will be the efficient influence function for this nonparametric model and also be the efficient influence function for the model in which π̄K is parameterized. Now when π̄K is completely known, U(X,β,K) = WK (ĀK, K)μ̇(β, Z)(Y − μ(β, Z)) is an unbiased estimating function for β regardless of the distribution of Y − μ. It follows that U(X,β) minus its projection on all scores for π̄K (as given in (5.3)) is orthogonal to the nuisance tangent space for the nonparametric models. Furthermore (5.3) is an influence function (the expectation of its derivative with respect to β is −1). Thus (5.3) is the efficient influence function when π̄K is parameterized.

Assume:

  1. There exists a finite vector αw* such that Pobs(Sw(X;αw*))=0 where ℙnSw(Xw) is the derivative with respect to αw of (5.5) and there exists a finite vector, αg* such that Pobs(Sg(X;αg*))=0 where ℙnSg(Xg) is given in (5.6). μ(β0, Z) is equal to (4.4); that is, the parameterization of (4.4) in terms of β is correct. If π̄K is correctly parameterized then put αw*=αw0, the true parameter in the observational distribution. Similarly if {g1,…,gK} are correctly parameterized then αg*=αg0, the true parameter in the observational distribution.

  2. Let Θ be neighborhood of (αw*,αg*,β0). The class of functions
       {αgSβ(X;αw,αg,β),αwSβ(X;αw,αg,β),βSβ(X;αw,αg,β),αwSw(X;αw)αgSg(X;αg),Sw(X;αw)2,Sg(X;αg)2,Sβ(X;γ,αw,αg,β)2for(αw,αg,β)Θ}
    is a Glivenko-Cantelli class. (For vector V, V⊗2 = VVT.)
  3. Assume that Iw=EobsαwSw(X;αw) is invertible at αw=αw*,Ig=EobsαgSg(X;αg) is invertible at αg=αg* and assume that B = Eobsμ̇(β0, Z) μ̇(β0, Z)T is invertible.

Lemma 8.3

Suppose that either π̄K or the gt’s are correctly parameterized. Assuming 1) through 3), there exists a sequence of β̂ solutions to (5.7) for which n(β^β0) converges in distribution to a multivariate normal with mean zero and var-covariance matrix, B−1ΣB−1 where

Σ=Eobs(Sβ(X;αw*,αg*,β0)+Sβ,αwIw1Sw(X;αw*)+Sβ,αgIg1(X;αg*))2

where Sβ,αw=EobsαwSβ(X;αw,αg*,β0)|αw=αw* and Sβ,αg=EobsαgSβ(X;αw*,αg,β0)|αg=αg*. Furthermore −1Σ̂−1 is a consistent estimator of B−1ΣB−1.

Proof

Suppose that αw*=αw0. Then as noted in the discussion concerning (8.1) we have that

EobsSβ(X,αw0,αg*,β0)=EobsWp¯K(A¯K,L¯K1;αw0)μ˙(β0,Z)(Yμ(β0,Z)).

This is zero since μ(β0, Z) is equal to (4.4). Alternately suppose that αg*=αg0. Then as noted in the discussion concerning (8.2) we have that

EobsSβ(X,αw*,αg0,β0)=Eobsμ˙(β0,Z)(a1g1(a1,L0;αg0)p1(a1|S0)μ(β0,Z)).

As before this is zero since μ(β0, Z) is equal to (4.4) (recall that (4.4) can be written as in (5.8)).

Proofs that there exists a consistent sequence of solutions, α^wPαw*ton(Sw(X;αw)=0 and a consistent sequence of solutions, α^gPαg*ton(Sg(X;αg)=0 are very similar and use assumptions 1), 2) and 3) along with lemma 2 of Aitchison and Silvey (1958). Next using existence of consistent α̂w and α̂g and again using Aitchison and Silvey’s lemma 2 along with assumptions 1),2), 3) and 5) one can show there exist a consistence solution, β^Pβ0. These proofs are omitted.

To prove asymptotic normality use Taylor series arguments, along with the Glivenko-Cantelli property in 2) and the invertibility in 3), to deduce that

n(α^wαw*)=Iw1nn(Sw(X;αw*))+oP(1)

and that

n(α^gαg*)=Ig1nn(Sg(X;αg*))+oP(1).

Next a Taylor series yields,

0=nSβ(X,α^w,α^g,β^)  =nSβ(X,αw*,αg*,β0)  +nαwSβ(X;αw,α^g,β^)|αw=αw**(α^wαw*)   whereαw**is a vector of values intermediate betweenα^wandα^w*  +nαgSβ(X;αw,αg,β)|αg=αg**(α^gαg*)   whereαg**is a vector of values intermediate betweenα^gandα^g*  +nβSβ(X;αw*,αg*,β**)(β^β0)   whereβ**is a vector of values intermediate betweenβ^andβ0.

Note from the previous discussion, the first term in the above expansion has mean zero. Next apply the Gliveriko-Cantelli property in 2), along with the invertibility in 3) to yield,

n(β^β0)=B1n(Sβ(X;αw*,αg*,β0)+Sβ,αwIw1Sw(X:αw*)+Sβ,αgIg1Sg(X:αg*))+oP(1).

To show that −1 Σ̂−1 converges in probability to B−1 ΣB−1, we may use the asymptotic normality result along with 2) and 3).

Given a parametrization of K in terms of the parameter, γ, define

Snum(X;γ)=γlogpt(At|St1;γ).

Define

Wp¯t(A¯K,L¯K1;γ,αw)=j=1tpj(Aj|Sj1;γ)πj(Aj|L¯j1,A¯j1;αw)

and

Sg(X;γ,αg)=(YgK(A¯K,L¯K1;αg))αggK(A¯K,L¯K1;αg)+t=1K1n(at+1pt+1(at+1|St;γ)gt+1(at+1,A¯t,L¯t;αg)gt(A¯t,L¯t1;αg))αggt(A¯t,L¯t1;αg).

And define,

   Sβ(X;αw,αg,γ,β)=Wp¯K(A¯K,L¯K1;α^w)μ˙(β,Z)(Yμ(β,Z))     nt=1KWp¯t(A¯t,L¯t1α^w)μ˙(β,Z)(gt(A¯t,L¯t1;α^g)μ(β,Z))+nt=1Katπt(at|A¯t1,L¯t1;α^w)Wp¯t(at,A¯t1,L¯t1;α^w)μ˙(β,Z)(gt(at,A¯t1,L¯t1;α^g)μ(β,Z))

Define

U(X)=Sβ(X;γ*,αw*,αg*,β0*)+Sβ,wIw1Sw(X;αw*)   +Sβ,gIg1(Sg(X;γ*,αg*)+Sg,numInum1Snum(X;γ*))   +Sβ,numInum1Snum(X;γ*),

where Sβ,w=EobsαwSβ(X;γ,αw,αg*,β0*)|αw=αw*,Sβ,g=EobsαgSβ(X;γ,αw*,αg,β0*)|αg=αg*, and Sg,num=EobsγSg(X;γ,αg*)|γ=γ*. Similarly let Σ̂* = ℙnÛ(X)⊗2 where

U^(X)=Sβ(X;γ^,α^w,α^g,β^)+S^β,wI^w1Sw(X;α^w)   +S^β,gI^g1(Sg(X;γ^,α^g)+S^g,numI^num1Snum(X;γ^*))   +S^β,numI^num1Snum(X;γ^*)

and S^β,w=nαwSβ(X;γ^,αw,α^g,β^)|αw=α^w,S^β,g=nαgSβ(X;γ^,α^w,αg,β^)|αg=α^g, and S^g,num=nγSg(X;γ,α^g)|γ=γ^*.

  • (1′)
    There exists a finite vector αw* such that Pobs(Sw(X;αw*))=0 and there exists a finite vector, αg* such that Pobs(Sg(X;αg*))=0. There exists a finite vector γ* such that Pobs(Snum(X;γ*)) = 0 and also a finite vector β0* for which μ(β0*,Z)=
    a¯KAl¯K1yyfY(y|a¯K,v¯K1,s¯K1)t=1Kpt(at|st1;γ*)t=1K1ft(lt|a¯t,l¯t1)f0(l0|z)dydl¯K1.

    Suppose there exists a finite vector, αg* such that Pobs(Sg(X;γ*,αg*))=0. If π̄K is correctly parameterized then put αw*=αw0, the true parameter in the observational distribution. Similarly if {g1,…,gK} are correctly parameterized then αg*=αg0, the true parameter in the observational distribution.

The parameter αw is estimated as before. Calculate γ̂ by solving 0 = ℙnSnum (X;γ). Calculate α̂g by solving 0 = ℙnSg(X;γ̂,αg). Lastly estimate β* by solving 0 = ℙnSβ(X;α̂w,α̂g,γ̂,β) for β̂.

  • (2′)
    Let Θ be neighborhood of (γ*,αw*,αg*,β0*). The class of functions
    {αgSβ(X;γ,αw,αg,β),αwSβ(X;γ,αw,αg,β),βSβ(X;γ,αw,αg,β),γSβ(X;γ,αw,αg,β),αgSg(X;γ,αg),γSg(X;γ,αg),γSnum(X;γ),  αwSw(X;αw),Sw(X;αw)2,Sg(X;γ,αg)2,Sβ(X;γ,αw,αg,β0)2,Snum(X;γ)2for(γ,αw,αg,β)Θ}
    is a Glivenko-Cantelli class. (For vector V, V⊗2 = VVT.)
  • (3′)

    Assume that Iw=EobsαwSw(X;αw) is invertible at αw=αw*,Ig=EobsαgSg(X;αg) is invertible at αg=αg*,Inum=EobsγSnum(X;γ) is invertible at γ = γ* and assume that B=Eobsμ˙(β0*,Z)μ˙(β0*,Z)T is invertible.

Lemma 8.4

Suppose that either π̄K or the gt’s are correctly parameterized. Assuming 1′) through 3′), there exists a sequence of β̂ solutions to (5.7) for which n(β^β0*) converges in distribution to a multivariate normal with mean zero and var-covariance matrix, B−1Σ* B−1 where Σ* = Eobs (U(X)⊗2.

Furthermore −1Σ̂*−1 is a consistent estimator of B−1Σ*B−1.

Proof

The proof is similar to that of the previous lemma and is omitted.

REFERENCES

  • 1.Aitchison J, Silvey SD. Maximum-Likelihood Estimation of Parameters Subject to Restraints. Ann. Math. Statist. 1958;29:813–828. [Google Scholar]
  • 2.Angrist JD, Imbens GW, Rubin DB. Identification of causal effects using instrumental variables. Journal of the American Statistical Association. 1996;91:434–471. [Google Scholar]
  • 3.Bickel P, Klaassen C, Ritov Y, Wellner J. Efficient and adaptive estimation for semiparametric models. Baltimore: Johns Hopkins University Press; 1993. [Google Scholar]
  • 4.Bierman KL, Greenberg MT . The Conduct Problems Prevention Research Group. Social skill training in the FAST Track program. In: Peters R DeV, McMahon RJ., editors. Preventing childhood disorders, substance abuse, and delinquency. Newbury Park, CA: Sage; 1996. pp. 65–89. [Google Scholar]
  • 5.Borhani NO, Applegate WB, Cutler JA, Davis BR, Furberg CD, Lakatos E, Page L, Perry HM, Smith WM, Probstfield JL. Part 1: Rationale and Design. Hypertension. 1991;17 suppl II:II-2–II-15. doi: 10.1161/01.hyp.17.3_suppl.ii2. [DOI] [PubMed] [Google Scholar]
  • 6.Conduct Problems Prevention Research Group. A developmental and clinical model for the prevention of conduct disorders: The FAST Track Program. Development and Psychopathology. 1992;4:509–528. [Google Scholar]
  • 7.Conduct Problems Prevention Research Group. Initial impact of the Fast Track prevention trial for conduct problems: I. The high-risk sample to appear in the Journal of Consulting and Clinical Psychology. 1999a [PMC free article] [PubMed] [Google Scholar]
  • 8.Conduct Problems Prevention Research Group. Initial impact of the Fast Track prevention trial for conduct problems: II. Classroom effects to appear in the Journal of Consulting and Clinical Psychology. 1999b [PMC free article] [PubMed] [Google Scholar]
  • 9.Cooperative Research Group. Rationale and design of a randomized clinical on prevention of stroke in isolated systolic hypertension. Journal of Clinical Epidemiology. 1988;41:1197–1208. doi: 10.1016/0895-4356(88)90024-8. [DOI] [PubMed] [Google Scholar]
  • 10.Cox DR. The Design and Planning of Experiments. London: Chapman and Hall; 1958. [Google Scholar]
  • 11.Hernán MA, Brumback B, Robins JM. Marginal structural models to estimate the causal effect of zidovudine on the survival of HIV-positive men. 1998 doi: 10.1097/00001648-200009000-00012. unpublished manuscript. [DOI] [PubMed] [Google Scholar]
  • 12.Holland P. Statistics and Causal Inference. Journal of the American Statistical Association. 1986;81:945–961. [Google Scholar]
  • 13.Manski CF. Identification Problems in the Social Sciences. Cambridge, MA: Harvard University Press; 1995. [Google Scholar]
  • 14.McMahon RJ, Slough N . Conduct Problems Prevention Research Group. Family-based intervention in the FAST Track Program. In: Peters R DeV, McMahon RJ., editors. Preventing childhood disorders, substance abuse, and delinquency. Newbury Park, CA: Sage; 1996. pp. 65–89. [Google Scholar]
  • 15.Neyman J, Iwaszkriewicz K, Kolodziejczyk S. Statistical Porblems in Agricultural Experimentation (with discussion) Supplement of Journal of the Royal Statistical Society bf. 1935;2:107–180. [Google Scholar]
  • 16.Robins JM. A new approach to causal inference in mortality studies with sustained exposure periods-application to control of the healthy worker survivor effect. Computers and Mathematics with Applications. 1986;14:1393–1512. [Google Scholar]
  • 17.Robins JM. Addendum to ”A new approach to causal inference in mortality studies with sustained exposure periods - Application to control of the healthy worker survivor effect”. Computers and Mathematics with Applications. 1987;14:923–945. [Google Scholar]
  • 18.Robins JM. The Analysis of Randomized and Nonrandomized AIDS Treatment Trials Using a New Approach to Causal Inference in Longitudinal Studies. In: Sechrest L, Freeman H, Mulley A, editors. Health Service Research Methodology: A Focus on AIDS. U.S. Public Health Service: NCHSR; 1989. pp. 113–159. [Google Scholar]
  • 19.Robins JM, Rotnitzky A. Recovery of information and adjustment for dependent censoring using surrogate markers. In: Jewell N, Dietz K, Farewell V, editors. AIDS Epidemiology - Methodological Issues. Boston, MA: Birkhäuser; 1992. pp. 297–331. [Google Scholar]
  • 20.Robins JM. Information recovery and bias adjustment in proportional hazards regression analysis of randomized trials using surrogate markers. Proceedings of the Biopharmaceutical Section, American Statistical Association. 1993:24–33. [Google Scholar]
  • 21.Robins JM, Rotnitzky A. Semiparametric efficiency in multivariate regression models with missing data. Journal of the American Statistical Association. 1995;90:122–129. [Google Scholar]
  • 22.Robins JM. Causal Inference from Complex Longitudinal Data. In: Berkane M, editor. Latent Variable Modeling and Applications to Causality. Lecture Notes in Statistics. 120. New York: Springer-Verlag, Inc; 1997. pp. 69–117. [Google Scholar]
  • 23.Robins JM, Wasserman L. Estimation of Effects of Sequential Treatments by Reparameterizing Directed Acyclic Graphs. In: Geiger D, Shenoy P, editors. Proceedings of the Thirteenth Conference on Uncertainty in Artificial Intelligence; Morgan Kaufmann; San Francisco. 1997. pp. 409–442. [Google Scholar]
  • 24.Robins JM. Marginal structural models; 1997 Proceedings of the American Statistical Association. Section on Bayesian Statistical Science; 1998. pp. 1–10. [Google Scholar]
  • 25.Robins JM, Hernán MA, Brumback B. Marginal structural models and causal inference in epidemiology. 1998 doi: 10.1097/00001648-200009000-00011. unpublished manuscript. [DOI] [PubMed] [Google Scholar]
  • 26.Robins JM, Rotnitzky A, Scharfstein DO. Sensitivity Analysis for Selection Bias and Unmeasured Confounding in Missing Data and Causal Inference Models. In: Halloran E, editor. Statistical Methods in Epidemiology. New York: Springer-Verlag, Inc; 1999a. pp. 1–92. [Google Scholar]
  • 27.Robins JM, Greenland S, Hu FC. Estimation of the Causal Effect of a Time-Varying Exposure on the Marginal Mean of a Repeated Binary Outcome. Journal of the American Statistical Association. 1999b;94:687–700. [Google Scholar]
  • 28.Robins JM. Robust estimation in sequentially ignorable missing data and causal inference models. to appear, Proceedings of the American Statistical Association. 2000 [Google Scholar]
  • 29.Rubin DB. Bayesian Inference for Causal Effects: The Role of Randomization. The Annals of Statistics. 1978;6:34–58. [Google Scholar]
  • 30.Rubin DB. Comment: Which Ifs Have Causal Answers. Journal of the American Statistical Association. 1986;81:961–962. [Google Scholar]
  • 31.Scharfstein DO, Rotnitzky A, Robins JM. Adjusting for Nonignorable Drop-Out Using Semiparametric Nonresponse Models. Journal of the American Statistical Association. 1999;94:1096–1120. [Google Scholar]

RESOURCES