Skip to main content
The International Journal of Biostatistics logoLink to The International Journal of Biostatistics
. 2010 Feb 22;6(2):2. doi: 10.2202/1557-4679.1211

Targeted Maximum Likelihood Based Causal Inference: Part I

Mark J van der Laan 1
PMCID: PMC3126670  PMID: 21969992

Abstract

Given causal graph assumptions, intervention-specific counterfactual distributions of the data can be defined by the so called G-computation formula, which is obtained by carrying out these interventions on the likelihood of the data factorized according to the causal graph. The obtained G-computation formula represents the counterfactual distribution the data would have had if this intervention would have been enforced on the system generating the data. A causal effect of interest can now be defined as some difference between these counterfactual distributions indexed by different interventions. For example, the interventions can represent static treatment regimens or individualized treatment rules that assign treatment in response to time-dependent covariates, and the causal effects could be defined in terms of features of the mean of the treatment-regimen specific counterfactual outcome of interest as a function of the corresponding treatment regimens. Such features could be defined nonparametrically in terms of so called (nonparametric) marginal structural models for static or individualized treatment rules, whose parameters can be thought of as (smooth) summary measures of differences between the treatment regimen specific counterfactual distributions.

In this article, we develop a particular targeted maximum likelihood estimator of causal effects of multiple time point interventions. This involves the use of loss-based super-learning to obtain an initial estimate of the unknown factors of the G-computation formula, and subsequently, applying a target-parameter specific optimal fluctuation function (least favorable parametric submodel) to each estimated factor, estimating the fluctuation parameter(s) with maximum likelihood estimation, and iterating this updating step of the initial factor till convergence. This iterative targeted maximum likelihood updating step makes the resulting estimator of the causal effect double robust in the sense that it is consistent if either the initial estimator is consistent, or the estimator of the optimal fluctuation function is consistent. The optimal fluctuation function is correctly specified if the conditional distributions of the nodes in the causal graph one intervenes upon are correctly specified. The latter conditional distributions often comprise the so called treatment and censoring mechanism. Selection among different targeted maximum likelihood estimators (e.g., indexed by different initial estimators) can be based on loss-based cross-validation such as likelihood based cross-validation or cross-validation based on another appropriate loss function for the distribution of the data. Some specific loss functions are mentioned in this article.

Subsequently, a variety of interesting observations about this targeted maximum likelihood estimation procedure are made. This article provides the basis for the subsequent companion Part II-article in which concrete demonstrations for the implementation of the targeted MLE in complex causal effect estimation problems are provided.

Keywords: causal effect, causal graph, censored data, cross-validation, collaborative double robust, double robust, dynamic treatment regimens, efficient influence curve, estimating function, estimator selection, locally efficient, loss function, marginal structural models for dynamic treatments, maximum likelihood estimation, model selection, pathwise derivative, randomized controlled trials, sieve, super-learning, targeted maximum likelihood estimation

1. Introduction

The data structure on the experimental unit can often be viewed as a time-series in discrete time, possibly on a fine scale. At many time points nothing might be observed and at possibly irregular spaced time-points events occur and are measured, where some of these events occur at the same time. A specified ordering of all measured variables which respects this time-ordering and possibly additional knowledge about the ordering in which variables were realized, implies a graph in the sense that for each observed variable we can identify a set of parent nodes of that observed variable, defined as the set of variables occurring before the observed variable in the ordering. The likelihood of this unit specific data structure can be factorized accordingly in terms of the conditional distribution of a node in the graph, given the parents of that node, across all nodes. This particular factorization of the likelihood puts no restriction on the possible set of data generating distribution, but the ordering affects the so called G-computation formula for counterfactual distributions of the data under certain interventions implied by this ordering. Beyond the factorization of the likelihood in terms of a product of conditional distributions, the G-computation formula involves specifying a set of nodes in the time-series/graph as the variables to intervene upon, and specifying the intervention for these nodes. These interventions could be rules that assign the value for the intervention node (possibly) in response to the observed data on the (observed) parents of the intervention node. The G-computation formula is now defined as the product, across all nodes, excluding the intervention nodes, of the conditional distribution of a node, given the parent nodes with the intervention nodes in the parent set following their assigned values.

If it is known that the conditional distribution of a node only depends on a subset of the parents that were implied by the ordering, then that knowledge should be incorporated by reducing the parent set to its correct set. This kind of knowledge does reduce the size of the model for the data generating distribution (and such assumptions can indeed be tested from the data).

The G-computation formula provides a probability distribution of the intervention specific data structure. Under certain causal graph conditions on a causal graph on an augmented set of nodes which includes unobserved nodes beyond the observed nodes (Pearl (2000)), such as no unblocked back-door path from intervention node to future/downstream nodes, for each intervention node, this G-computation formula equals the counterfactual distribution of the data structure if one would have enforced the specified intervention on the system described by the causal graph.

We remind the reader that a causal graph on a set of nodes states that each node is a deterministic function of its parents. It typically represents a set of so called causal assumptions that cannot be learned from the data. Given a declared causal graph on a set of nodes, one can formally state what assumptions on this causal graph are needed in order to claim that a specified G-computation formula for the observed nodes corresponds with the G-computation formula for the causal graph on the full set of nodes (that includes the unobserved nodes), where the latter G-computation formula is then viewed as the gold-standard representing the causal effect of interest (Pearl (2000)).

Either way, the time-ordering and possible known additional known ordering does provide a statistical graph for the data as explained above, and a corresponding G-computation formula.

In this article we are concerned with (semi-parametric) efficient estimation of the “causal” effects viewed as parameters of this G-computation formula based on observing n independent and identically observations O1, . . . , On of O. Specifically, we are concerned with estimation of parameters of the G-computation formula implied by a particular statistical graph on the observed data structure O, in the semiparametric model that makes no assumptions about each node-specific conditional distribution in the graph, given its parents.

Formally, the density of O is modeled as

p0(O)=jP(N(j)|Pa(N(j))

where N(j) denote the nodes in the graph representing the observed variables, Pa(N(j)) denote the parents of N(j), and we make no assumptions on each conditional distribution of N(j), beyond that N(j) only depends on Pa(N(j)). Note, however, as remarked above, if the parent sets induce more structure than parent sets implied by an ordering of all observed variables, then this statistical graph of p0 might implies a real (i.e., not just nonparametric) semiparametric model on p0, corresponding with a variety of conditional independence assumptions.

Even if the (non-testable) causal assumptions required to interpret the G-computation formula as a counterfactual distribution on a system fail to hold, assuming that the ordering of the likelihood respects the ordering w.r.t. the intervention nodes (i.e., it correctly states what variables are pre or post intervention for each intervention node), the target parameters often still represent effects of interest aiming to get as close to a causal effect as the data allows. In particular, one can simply interpret the G-computation parameters for what they are, namely well defined effects of interventions on the distribution of the data: see van der Laan (2006) for more discussion of the role of causal parameters in variable importance analysis.

It is important to note that the probability density p0 of the observed data structure O, factored by the statistical graph, can be represented as a product of two factors, the first factor Q0 that identifies the G-computation formulas for interventions, and the second factor g0 representing product over the intervention nodes of the conditional distribution of the intervention nodes: p0 = Q0g0. We often refer to the second factor as the censoring and/or treatment mechanism in case the intervention nodes correspond with censoring variables and/or treatment assignments. We will denote the true probability distribution of the data-structure on the experimental unit with P0, and its probability density with p0.

A variety of estimators of causal effects of multiple time-point interventions, including handling censored data (by, enforcing no-censoring as part of the intervention) have been proposed: Inverse Probability of Censoring Weighted (IPCW) estimators, Augmented IPCW-estimators (which are double robust), maximum likelihood based estimators, and targeted Maximum Likelihood Estimators (which are double robust). The IPCW and augmented-IPCW estimators fall in the category of estimating equation methodology (van der Laan and Robins (2003)). The augmented-IPCW estimator is defined as a solution of an estimating equation in the target parameter implied by the so called efficient influence curve. Maximum likelihood based estimators involve estimation of the distribution of the data and subsequent evaluation of the target parameter. Traditional maximum likelihood estimators are not targeted towards the target parameter, and are thereby, in particular, not double robust.

Targeted maximum likelihood estimators (T-MLE) are two stage estimators, the first stage applies regularized maximum likelihood based estimation, where we advocate the use of loss-based super-learning to maximize adaptivity to the true distribution/G-computation formula of data (van der Laan et al. (2007)), and the second stage targets the obtained fit from the first stage towards the target parameter of interest through a targeted maximum likelihood step. This targeted maximum likelihood step removes bias for the target parameter if the censoring/treatment mechanism used in the targeted MLE step is estimated consistently. In this targeted maximum likelihood step the initial (first stage) estimator is treated as an off-set, and it involves the application of a fluctuation function to the offset, where the set of possible fluctuations represents a parametric model consisting of fluctuated versions of the offset. This parametric model is a so called least favorable parametric model in the sense that its maximum likelihood estimator listens as much to the data w.r.t. fitting the target parameter as a semiparametric model efficient estimator. Formally, it is the parametric submodel through the first stage estimator with the worst Cramer-Rao lower bound for estimation of the target parameter (at zero fluctuation), among all parametric submodels. (This worst case Cramer-Rao lower bound as achieved by this least favorable model is actually the semiparametric information bound defined as the variance of the efficient influence curve.) Given this least-favorable submodel, maximum likelihood estimation is used to fit the finite dimensional fluctuation parameter. Due to this parametric targeted maximum likelihood step the targeted maximum likelihood estimator is also double robust: the estimator is consistent if the initial first-stage estimator of the G-computation factor of the likelihood is consistent, or if the conditional distributions of the intervention nodes (i.e., censoring/treatment mechanism) are estimated consistently (as required to identify the fluctuation function used in targeted maximum likelihood step). In addition, under regularity conditions, the targeted MLE is (semiparametric) efficient if the initial estimator is consistent, and consistent and asymptotically linear if either the initial estimator or the treatment/censoring mechanism estimator is consistent.

Even though the augmented IPCW-estimator is also tailored to be double robust and locally efficient, targeted maximum likelihood estimation has the following important advantages relative to estimating equation methods such as the augmented-IPCW estimator: 1) the T-MLE is a substitution estimator and thereby, contrary to the augmented IPCW-estimator, respects global constraints of the model such as that one might be estimating a probability in [0, 1], 2) since, given an initial estimator, the targeted MLE step involves maximizing the likelihood along a smooth parametric submodel, contrary to the augmented IPCW-estimator, it does not suffer from multiple solutions of a (possibly non-smooth in the parameter) estimating equation, 3) contrary to the augmented IPCW-estimator, the T-MLE does not require that the efficient influence curve can be represented as an estimating function in the target parameter, and thereby applies to all path-wise differentiable parameters, 4) it can use the cross-validated log-likelihood (of the targeted maximum likelihood estimator), or any other cross-validated risk of an appropriate loss function for the relevant factor Q0 of the density (i.e., the G-computation formula) of the data, as principle criterion to select among different targeted maximum likelihood estimators indexed by different initial estimators or different choices of fluctuation models. The latter allows fine tuning of the initial estimator of Q0 as well as the fine tuning of the estimation of the unknowns (e.g., censoring/treatment mechanism g0) of the fluctuation function applied in the targeted maximum likelihood step, thereby utilizing the excellent theoretical and practical properties of the loss-function specific cross-validation selector. On the other hand, the augmented-IPCW estimator cannot be evaluated based on a loss function for Q0 alone, but also requires a choice of loss function for g0. The latter point 4) also allows the targeted MLE to be generalized to loss-based estimation of infinite dimensional parameters that can be approximated by pathwise differentiable parameters.

These important theoretical advantages have a substantial practical impact, by allowing one to construct estimators in a wider variety of applications, and with better finite sample and asymptotic mean squared error w.r.t. the target. This inspired us to implement targeted maximum likelihood estimation of causal effects of single time point treatment in a variety of data analyses, allowing for right-censoring of the time-till-event clinical outcome, and missingness of the clinical outcome. Even though we discussed the overall targeted maximum likelihood estimator for causal effect estimation of multiple time point interventions in technical reports (see van der Laan (2008)), in this article we aim to dive deeper into this challenge. In particular, our goal is to present templates that can be implemented with standard statistical software, and aim to understand the choices to be made. In future papers we will be implementing these methods on real and simulated data sets and use this paper as guidance.

The organization of this paper is as follows. Firstly, in Section 2 we start out with presenting the targeted MLE for sequentially randomized controlled trials. A specific targeted loss function is proposed to select among different targeted MLE indexed by different initial estimators, which results in maximally asymptotically efficient targeted MLE’s (Rubin and van der Laan (2008)). Due to the double robustness of the targeted MLE this estimator is guaranteed to estimate the causal effect of interest consistently, so that confidence intervals and type-I error control are asymptotically valid. In addition, the T-MLE utilizes all the data (including time-dependent biomarkers) and thereby has great potential for large efficiency gains and bias reductions in these sequentially randomized controlled trials.

In Section 3 we develop and present a general targeted MLE for any time-series data structure, applicable to sequentially randomized controlled trials with censoring and missingness, as well as longitudinal observational studies. The integration of loss-based (super) learning to build and select among targeted MLE’s is made explicit again, and targeted loss functions are proposed for that purpose. In addition, a variety of interesting observations are made about the targeted MLE, relevant to the practical implementation of this estimator in complex longitudinal observational studies and randomized controlled trials. We end with a discussion in Section 4. Our companion Part II article will present demonstrations of the targeted MLE.

Some overview of relevant literature

The construction of efficient estimators of path-wise differentiable parameters in semi-parametric models requires utilizing the so called efficient influence curve defined as the canonical gradient of the path-wise derivative of the parameter. This is no surprise since a fundamental result of the efficiency theory is that a regular estimator is efficient if and only if it is asymptotically linear with influence curve equal to the efficient influence curve. We refer to Bickel et al. (1997), and Andersen et al. (1993). There are two distinct approaches for construction of efficient (or locally efficient) estimators: the estimating equation approach that uses the efficient influence curve as an estimating equation (e.g., one-step estimators based on the Newton-Raphson algorithm in Bickel et al. (1997)), and the targeted MLE that uses the efficient influence curve to define a targeted fluctuation function of an initial estimator, and maximizes the likelihood in that targeted direction.

The construction of locally efficient estimators in censored data models in which the censoring mechanism satisfies the so called coarsening at random assumption (Heitjan and Rubin (1991), Jacobsen and Keiding (1995), Gill et al. (1997)) has been a particular focus area. This includes also the theory for locally efficient estimation of causal effects, since the causal inference data structure can be viewed as a missing data structure on the intervention-specific counterfactuals, and the sequential randomization assumption (SRA) implies the coarsening at random assumption on the missingness mechanism, while SRA still does not imply any restriction on the data generating distribution. A particular construction of counterfactuals from the observed data structure, so that the observed data structure augmented with the counterfactuals satisfies the consistency (missing data structure) and sequential randomization assumption, is provided in Yu and van der Laan (2002), providing an alternative to the implicit construction presented earlier in Gill and Robins (2001), thereby showing that, without loss of generality, one can view causal inference as a missing data structure estimation problem: the importance of the causal graph is that it makes explicit the definition of the counterfactuals of interest (i.e., full data in the censored data model).

The theory for inverse probability of censoring weighted estimation and the augmented locally efficient IPCW estimator based on estimating functions defined in terms of the orthogonal complement of the nuisance tangent space in CAR-censored data models (including the optimal estimating function implied by efficient influence curve) was originally developed in Robins (1993), Robins and Rotnitzky (1992). Many papers have been building on this framework (see van der Laan and Robins (2003) for a unified treatment of this estimating equation methodology and references). In particular, double robust locally efficient augmented IPCW-estimators have been developed (Robins and Rotnitzky (2001b), Robins and Rotnitzky (2001a), Robins et al. (2000b), Robins (2000a), van der Laan and Robins (2003), Neugebauer and van der Laan (2005), Yu and van der Laan (2003)).

Causal inference for multiple time-point interventions under sequential randomization started out with papers by Robins in the eighties: e.g. Robins (1986), Robins (1989). The popular propensity score methods to assess causal effects of single time point interventions (e.g., Rosenbaum and Rubin (1983), Sekhon (2008), Rubin (2006)) are not double robust (i.e., rely on correct specification of propensity score), have no natural generalization to multiple time-point interventions, and are also inefficient estimators for single time point interventions (Abadie and Imbens (2006)), relative to the locally efficient double robust estimators such as the augmented IPCW estimator, and the targeted MLE.

Structural nested models and marginal structural models for static treatments were proposed by Robins as well: Robins (1997b), Robins (1997a), Robins (2000b). Many application papers on marginal structural models exist, involving the application of estimating equation methodology (IPCW and DR-IPCW): e.g., Hernan et al. (2000), Robins et al. (2000a), Bryan et al. (2003), Yu and van der Laan (2003). In van der Laan et al. (2005) history adjusted marginal structural models were proposed as a natural extension of marginal structural models, and it was shown that the latter also imply an individualized treatment rule of interest (a so called history adjusted statically optimal treatment regimen): see Petersen et al. (2005) for an application to the when to switch question in HIV research.

Murphy et al. (2001) present a nonparametric estimator for a mean under a dynamic treatment in an observational study. Structural nested models for modeling and estimating an optimal dynamic treatment were proposed by Murphy (2003), Robins (2003), Robins (2005a), Robins (2005b). Marginal structural models for a user supplied set of dynamic treatment regimens were developed and proposed in van der Laan (2006), van der Laan and Petersen (2007) and, simultaneously and independently, in Robins et al. (2008). van der Laan and Petersen (2007) also includes a data analysis application of these models to assess the mean outcome under a rule that switches treatment when CD4 count drops below a cut-off, and the optimal cut-off is estimated as well. Another practical illustration in sequentially randomized trials of these marginal structural models for realistic individualized treatment rules is presented in Bembom and van der Laan (2007).

Unified loss-based learning based on cross-validation was developed in- van der Laan and Dudoit (2003), including construction of adaptive minimax estimators for infinite dimensional parameters of the full data distribution in CAR-censored data and causal inference models: see also van der Laan et al. (2006), van der Vaart et al. (2006), van der Laan et al. (2004), Dudoit and van der Laan (2005), Keleş et al. (2002), Sinisi and van der Laan (2004). This research establishes, in particular, finite sample oracle inequalities, which state that the expectation of the loss-function specific dissimilarity between the the cross-validated selected estimator among the library of candidate estimators (trained on training samples) and the truth is smaller or equal than the expectation of the loss-function specific dissimilarity between the best possible selected estimator and the truth plus a term that is bounded by a constant times the logarithm of the number of candidate estimators in the library divided by the sample size. The only assumption this oracle inequality relies upon is that the loss function is uniformly bounded. These oracle results for the cross-validation selector inspired a unified super-learning methodology. This methodology first constructs a set of candidate estimators, proposes a family of weighted combinations of these candidate estimators indexed by a weight vector, and uses cross-validation to determine a weighted combination with optimal cross-validated risk. Under the assumption that the loss function is uniformly bounded, and the number of estimators is polynomial in sample size, the resulting estimator (super learner) is either asymptotically equivalent with the oracle selected estimator among the library of weighted combinations of the estimators, or it achieves the optimal parametric rate of convergence (i.e. one of estimators corresponds with correctly specified parametric model) up till (worst case) log-n-factor. We refer to van der Laan et al. (2007), Polley and van der Laan (2009).

The super-learning methodology applied to a loss function for the G-computation formula factor, Q0, of the observed data distribution, provides substitution estimators of ψ0. However, although these super learners of Q0 are optimal w.r.t. the dissimilarity with Q0 implied by the loss function, the corresponding substitution estimators will be overly biased for a smooth parameter mapping Ψ. This is due to the fact that cross-validation makes optimal choices w.r.t. the (global) loss-function specific dissimilarity, but the variance of Ψ(Qn) is of smaller order than the variance of Qn itself.

van der Laan and Rubin (2006) integrates the loss-based learning of Q0 into the locally efficient estimation of pathwise differentiable parameters, by enforcing the restriction in the loss-based learning that each candidate estimator of Q0 needs to be a targeted maximum likelihood estimator (thereby, in particular, enforcing each candidate estimator of Q0 to solve the efficient influence curve estimating equation). Another way to think about this is that each loss function L(Q) for Q0 has a corresponding targeted loss function L(Q*), Q* representing the targeted maximum likelihood estimator applied to initial Q, and we apply the loss-based learning to the latter targeted version of the loss function L(Q). Rubin and van der Laan (2008) propose the square of efficient influence curve as a valid and sensible loss function L(Q) for selection and estimation of Q0 in models in which g0 can be estimated consistently, such as in randomized controlled trials.

The implications of this targeted loss-based learning are that Q0 is estimated optimally (maximally adaptive to the true Q0) w.r.t. the targeted loss function L(Q*) using the super-learning methodology, and due to the targeted MLE step the resulting substitution estimator of ψ0 is now asymptotically linear as well if the targeted fluctuation function is estimated at a good enough rate (and only requiring adjustment by confounders not yet accounted for by initial estimator: see collaborative targeted MLE): either way, asymptotic bias reduction for the target parameter will occur as long as the censoring/treatment mechanism is estimated consistently. In addition, since the targeted maximum likelihood step involves additional maximum likelihood fitting, generally speaking, no loss in bias will occur, even if the wished fluctuation function is heavily misspecified.

Targeted MLE have been applied in a variety of estimation problems: Bembom et al. (2008), Bembom et al. (2009) (physical activity), Tuglus and van der Laan (2008) (biomarker analysis), Rosenblum et al. (2009) (AIDS), van der Laan (2008) (case control studies), Rose and van der Laan (2008) (case control studies), Rose and van der Laan (2009) (matched case control studies), Moore and van der Laan (2009) (causal effect on time till event, allowing for right-censoring), van der Laan (2008) (adaptive designs, and multiple time point interventions), Moore and van der Laan (2007) (randomized trials with binary outcome). We refer to van der Laan et al. (September, 2009) for collective readings on targeted maximum likelihood estimation.

In van der Laan and Gruber (2009) we use the loss-based cross-validation to not only select among different initial estimators for the targeted maximum likelihood estimators, but it is also used to select the fit of the fluctuation function applied to the initial estimator (and thus the fit of the censoring and treatment mechanism). This results in a so called collaborative double robust targeted maximum likelihood estimator, which utilizes the collaborative double robustness of the efficient influence curve, which is stronger robustness result than the regular double robustness of the efficient influence curve the double robust estimators rely upon. These collaborative double robust estimators select confounders for the censoring and treatment mechanism in response to the outcome and the initial estimator of Q0, thereby allowing for more effective bias reductions by the resulting fluctuation functions (as predicted by theory and observed in practice). Simulations and data analysis illustrating the excellent performance of the collaborative double robust T-MLE are presented in van der Laan and Gruber (2009).

2. The T-MLE in multi-stage sequentially randomized controlled trials

Consider a sequentially randomized trial in which one randomly samples a patient from a population, one collects at baseline covariates L(0), and one randomizes the patient to a first line treatment A(0). Subsequently, one collects an intermediate biomarker L(1), and based on this intermediate clinical response one randomizes the patient to a second line treatment A(1). Finally, one collects the clinical outcome Y of interest at a fixed point in time. This is experiment is carried out for n patients.

We first discuss two of such sequentially randomized cancer trials.

Anderson Cancer Center Prostate Cancer Two Stage Trial: Thall et al. (2000) present an analysis of the first clinical trial in oncology that makes use of sequential randomization. During this trial, prostate cancer patients who were found to be responding poorly to their initially randomly assigned regimen (among four treatments) were re-randomized to the remaining three candidate regimens. The clinical outcomes of interest was response to treatment at a particular point in time or time till death. In contrast to conventional trials based on a single randomization, this design allows the investigator to study adaptive treatment strategies that adjust a patients treatment in response to the observed course of the illness. Such adaptive strategies, also referred to as dynamic or individualized treatment rules, form the basis of common medical practice in cancer chemotherapy, with physicians typically facing the following questions: Which regimen should be used to initially treat a patient? Which regimen should the patient be switched to if the front-line regimen fails? Given an observed intermediate outcome such as a change in tumor size or PSA level, what threshold should be used to decide that the current regimen is failing? In recent years, sequentially randomized trials have been recognized as being uniquely suited to the study of these exciting questions (Thall et al. (2000), Lavori and Dawson (2000), Lavori and Dawson (2004), Murphy (2005)), with researchers in other clinical areas also beginning to implement this design (Rush et al. (2003), Schneider et al. (2006), Swartz et al. (2007)). The original results of Thall et al. (2000) focus on fitting logistic regression models for the different stage-specific factors of the likelihood. We can apply the T-MLE to estimate the mean outcomes under the 12 dynamic treatment rules indexed by first line therapy and the second line switching therapy, and also incorporate the handling of the right-censoring.

E4494 Eastern Oncology Trial: Another example is the cancer trial E4494, a lymphoma study of rituximab therapy that had both induction and maintenance rituximab randomizations, where the second randomization of maintenance versus observation was based on intermediate response to the initial treatment. The clinical outcome of interest was time till death.

Let’s denote the observed data structure on a randomly sampled patient from the target population with O = (L(0), A(0), L(1), A(1), Y = L(2)). For simplicity and the sake of presentation, we will assume that A(0), L(1) and A(1) are binary.

The likelihood can be factorized as

p(O)=j=02P(L(j)|L¯(j1),A¯(j1))j=01P(A(j)|A¯(j1),L¯(j)),

where the first factors will be denoted with QL(j), j = 0, 1, 2, and the latter factors denote the treatment mechanism and are denoted with gA(j), j = 0, 1. We make the convention that for j = 0, Ā(j − 1) and (j − 1) are empty. In a sequentially randomized controlled trial, the treatment assignment mechanisms gA(j), j = 0, 1, are known.

Suppose our parameter of interest is the treatment specific mean EYd for a certain treatment rule d that assigns treatment d0(L(0)) at time 0 and treatment d1((1), A(0)) at time 1. For example, d0(L(0)) = 1 is a static treatment assignment, and d1((1), A(0)) = I(L(1) = 1)1 + I(L(1) = 0)0 assigns treatment 1 if the patients responds well to the first line treatment 1, and treatment 0 if the patients does not respond well to the first line treatment 1. We note that any treatment rule can be viewed as a function of = (L(0), L(1)) only, and therefore we will use the shorter notation d() = (d0(L(0)), d1()) for the two rules at times 0 and 1.

Note that EYd = Ψ(Q) for a well defined mapping Ψ. Specifically, we have Ψ(Q) = EPdY, where the intervened distribution Pd of (L(0), L(1), L(2)) is defined by the G-computation formula:

Pd(L¯)=j=02QL(j),d(L¯(j)),

where, for notational convenience, especially in view of our representation for general data structures, we used the notation QL(j),d((j)) = QL(j)(L(j) | L̄(j − 1), Ā(j − 1) = d((j − 1))).

Instead of computing an analytic mean under Pd, this mean can also be approximated by simulating a large number of observations from this distribution and taking the mean of its last component L(2). Note that Pd corresponds with simulating sequentially from the conditional distributions QL(0),d, QL(1),d, QL(2),d, for L(0), L(1), L(2), respectively. Alternatively, this mean is calculated analytically as follows:

Ψ(Q)=l(0),l(1),yyPd(l(0),l(1),y)=yyl(0),l(1)Pd(l(0),l(1),y)=yyl(0),l(1)QL(0)(l(0))QL(1),d(l(0),l(1))QY,d(l(0),l(1),y).

If QL(0) is replaced by the empirical distribution, then this reduces to

Ψ(Q)=1ni=1nyyl(1)QL(1),d(Li(0),l(1))QY,d(Li(0),l(1),y).

From this analytic expression it also follows that, even if Y is continuous, Ψ(Q) only depends on the conditional distribution of Y through its mean. In this case of a 2-stage sequentially randomized controlled trial, the analytic evaluation of Ψ(Q) seems preferable since it will be very fast to compute.

With this precise definition of the parameter as a mapping from the conditional distributions QL(j), j = 0, 1, 2, to the real line, given an estimator Qn, we obtain a substitution estimator Ψ(Qn) of ψ.

The targeted maximum likelihood estimator involves first defining an initial estimator of Q, and then a subsequent targeted maximum likelihood step according to a fluctuation function applied to this initial estimator, where this step is tailored to remove bias from the initial estimator for the purpose of estimating the parameter of interest ψ. The fluctuation function equals the least favorable parametric model through Q which is defined as the parametric submodel through Q which makes estimation of Ψ(Q) hardest in the sense that the parametric Cramer-Rao Lower bound for the variance of an unbiased estimator is maximal among all parametric submodels. Intuitively, this is the parametric submodel for which the maximum likelihood estimator listens as much to the data w.r.t. fitting the target parameter as an efficient estimator in the large semiparametric model, and thereby can expected to provide important bias reduction. This definition of least favorable model implies that a least favorable parametric model is a model that has a score at zero fluctuation equal to the efficient influence curve/canonical gradient of the pathwise derivative of the target parameter Ψ.

We use the following Theorem that provides the representation of the efficient influence curve which is needed to define the fluctuation function. This Theorem also provides the formula for the efficient influence curve for other parameters and for higher (than two) stage RCT’s.

Theorem 1 The efficient influence curve for ψ = EYd at the true distribution P0 of O can be represented as

D*=Π(DIPCW|TQ),

where

DIPWC(O)=I(A¯=d(L¯))g(d(L¯)|X)Yψ,

TQ is the tangent space of Q in the nonparametric model, and Π denotes the projection operator onto TQ in the Hilbert space L02(P0) of square P0-integrable functions of O, endowed with inner product 〈h1, h2〉 = EP0h1h2(O).

We have that this subspace

TQ=j=02TQL(j)

is the orthogonal sum of the tangent spaces TQL(j) of the QL(j)-factors, which consists of functions of L(j), Pa(L(j)) with conditional mean zero, given the parents Pa(L(j)) of L(j), j = 0, 1, 2. Recall that we also denote L(2) with Y. Let

Dj*=Π(D*|TQL(j)),j=0,1,2.

We have

D0*(O)=E(Yd|L(0))ψ
D1*(O)=I(A(0)=d0(L(0))g(d0(L(0))|X)×{CL(1)(Q0)(1)CL(1)(Q0)(0)}{L(1)E(L(1)|L(0),A(0))}
D2*(O)=I(A¯=d(L¯)g(A¯|X){L(2)E(L(2)|L¯(1),A¯(2))}.

where, for δ ∈ {0, 1},

CL(1)(Q0)(δ)=E(Yd|L(0),A(0),L(1)=δ).

We note that

E(Yd|L(0),A(0)=d0(L(0)),L(1))=E(Y|L¯(1),A¯=d(L¯)).

For general data structure O = L(0), A(0), . . . , L(K), A(K), Y = L(K + 1), and DIPCW = D1(O)/g(Ā | X) for some D1, we have

Π(DIPCW|TQL(j))=1g(A¯(j1)|X)×{E(a¯(j,K)D1(a¯(j,K))|L¯(j),A¯(j1)))E(a¯(j,K)D1(a¯(j,K))|L¯(j1),A¯(j1)))}=1g(A¯(j1)|X)CL(j)(L¯(j1),A¯(j1))(L(j)E(L(j)|L¯(j1),A¯(j1)),

where

CL(j)=E(a¯(j,K)D1(a¯(j,K))|L(j)=1,L¯(j1),A¯(j1)))E(a¯(j,K)D1(a¯(j,K))|L(j)=0,L¯(j1),A¯(j1))).

Here we use the short-hand notation ā(j, K) ≡ (a(j), . . . , a(K)) and D1(ā(j, K)) = D1(Ā(j − 1), ā(j, K), ā(j,K) (K + 1)).

This Theorem allows us now to specify the targeted maximum likelihood estimator.

The targeted maximum likelihood estimator: Consider now an initial estimator QL(j)n of each QL(j), j = 0, 1, 2. We will estimate the first marginal probability distribution QL(0) of L(0) with the empirical distribution of Li(0), i = 1, . . . , n. We can estimate the conditional distributions of the binary L(1) and the conditional mean of Y = L(2) with machine learning algorithms (using logistic link for QL(1), and, if Y is binary, also for QY) such as the super learner represented by a data adaptively (based on cross-validation) determined weighted combination of a user supplied set of candidate machine learning algorithms estimating the particular conditional probability. We will now define fluctuations of this initial estimator QL(1)n = QL(1)n(Pn) and QL(2)n = QL(2)(Pn) which are particular functions of the empirical probability distribution Pn. We will use notation Qn = (QL(1)n, QL(2)n). Firstly, let

LogitQL(1)n(ε)=LogitQL(1)n+εCL(1)(Qn,gn)

be the fluctuation function of QL(1)n with fluctuation parameter ε, where we added the covariate CL(1)(Q, g) defined as

I(A(0)=d0(L(0))gA(0(d0(L(0))|X){CL(1)(Q)(1)CL(1)(Q)(0)},

where CL(1)(Q)(δ) = EQ(Yd | L(0), A(0), L(1) = δ). We refer to these covariate choices as clever covariates, since they represent a covariate choice that identifies a least favorable fluctuation model, thereby providing the wished targeted bias reduction. Similarly, if Y = L(2) is binary, then let

LogitQL(2)n(ε)=LogitQL(2)n+εCL(2)(Qn,gn),

where the added the clever covariate

CL(2)(Q,g)(L¯(1),A¯(1))=I(A¯=d(L¯)g(A¯|X).

If Y is continuous, then we use as fluctuation model the normal densities with mean EQYn(Y | Pa(Y))+εCL(2)(Qn, gn), and constant variance σ2, so that the MLE of ε is the linear least squares estimator, and the score of ε at ε = 0 is CL(2)(Y − EQ(Y | Pa(Y))), as required. We note that the above fluctuation function indeed satisfies that the score of ε at ε = 0 equals the efficient influence curve D*(Qn, gn) as presented in the Theorem above.

One now estimates ε with the MLE.

εn=argmaxεj=12i=1nQL(j)n(ε)(Oi).

One could also obtain a separate MLE of ε for each factor j = 1, 2. This process is now iterated till convergence, which defines the targeted MLE ( Qn*, gn), starting at initial estimator (Qn, gn), which does not involve updating of gn.

We note that the εn for each factor separately can be estimated with standard logistic regression or linear regression software using as off-set the logit of the initial estimator and having a single clever covariate CL(j)(Q, g), j = 1, 2. If Y is also binary, the single/common εn defined above requires applying a single logistic regression applied to repeated measures data set with one line of data for each of two factors, creating a clever covariate column that alternates the clever covariates CL(1) and CL(2), and using the corresponding off-sets. So in both cases (separate or common ε), the update step can be carried out with a simple univariate logistic regression maximum likelihood estimator. Computing a common ε in the case that we use linear regression for Y and logistic regression for L(1) requires some programming.

We note that the clever covariate changes at each update step since the estimator of Q is updated at each step and the clever covariate is defined by the current Q-fit. Let QL(j)n*, j = 1, 2, and Qn* denote the final update (at convergence of the MLE of ε to zero) of QL(j)n, j = 1, 2, and Qn, respectively. The T-MLE of ψ is now given by Ψ(Qn*).

A one-step T-MLE: Interestingly, if we use a separate εL(j) for j = 1, 2, first carry out the tmle update for QL(2)n, and use this updated QL(2)n* in the targeted MLE update for QL(1)n, then we obtain a targeted MLE-algorithm that converges in two simple steps, representing a single step update of Qn. Below, we will generalize this one-step targeted MLE algorithm for updating an initial Qn for general longitudinal data structures.

Statistical inference for T-MLE: Let D*(Q, g) be the efficient influence curve at pQ,g = Q*g, as defined in the above Theorem. Under regularity conditions, the T-MLE is consistent and asymptotically linear with influence curve D*(Q*, g0), where Q* denotes the limit of Qn*, and g0 is the true treatment mechanism. As a consequence, for construction of confidence intervals and testing one can use as working model ψn*N(ψ0,0), where ∑0 = ED*(Q, g0)2 is the variance of the efficient influence curve at (Q*, g0). Here ∑0 can be estimated with the empirical covariance matrix of D*(Qn*,g0) (Oi), i = 1, . . . , n.

Targeted Loss-based selection among T-MLE’s indexed by different initial estimators: Sequentially randomized trials allow us to select a targeted loss function for selection among different targeted maximum likelihood estimators indexed by different initial estimators. For the sake of illustration, we assume ψ0 is one-dimensional. Suppose a collection of initial estimators is available for Q0. Let Qkn*=Qk*(Pn) be the corresponding targeted maximum likelihood estimators, k = 1, . . . , K. One of these initial estimators might correspond with a super learner based on the log-likelihood loss function. We can select among these targeted maximum likelihood estimators based on cross-validated risk of the loss function

L(Q)D*(Q,g0)2,

which is indeed a valid loss function since it satisfies Q0 = arg minQ E0L(Q)(O) among all Q with Ψ(Q) = ψ0. The latter loss function is now a loss function for the whole Q and is very targeted towards ψ0 since it corresponds exactly with the asymptotic variance of the targeted MLE. Thus, we would select k with the cross-validation selector:

kn=k^(Pn)=argminkEBnPn,Bn1D*(Qk*(Pn,Bn0),g0)2,

where Bn ∈ {0, 1}n denotes a random vector of binaries indicating a split in training sample {i : Bn(i) = 0} and validation sample {i : Bn(i) = 1}, Pn,Bn0, Pn,Bn1, are the corresponding empirical distributions of the training and validation sample. Here we used the notation Pf ≡ ∫ f(o)dP(o). The selected targeted maximum likelihood estimator is then Qn*Qkn*(Pn), and ψ0 is now estimated with the substitution estimator Ψ(Qn*).

Assuming a uniformly bounded loss function (i.e., a uniform bound on the efficient influence curve), due to oracle results of the cross-validation selector, the resulting targeted maximum likelihood estimator Ψ(Qn*) will be at least as efficient as any of the candidate targeted maximum likelihood estimators Ψ(Qkn*), k = 1, . . . , K.

Construction of Targeted Initial estimators: Above we showed that the projection of the efficient influence curve on the tangent space of the conditional distribution of L(1) can be written as CL(1)(L(1) − QL(1)), and for Y = L(2), as CL(2)(L(2) − QL(2)), where we use short-hand notation. For the purpose of constructing an initial estimator of QL(1), we can use loss-based learning based on the weighted squared-error loss function L1(QL(1))=CL(1)2(L(1)QL(1))2, and, similarly, for the purpose of constructing of an initial estimator of QL(2), we can use loss-based learning based on the weighted squared-error loss function L2(QL(2))=CL(2)2(YQL(2))2. These are targeted loss functions since they correspond with the components of the variance of the efficient influence curve. Since the clever covariate CL(2) only depends on g0, the required weights CL(2)2 for loss-based learning of QL(2) are completely known. Therefore, we first apply the loss-based learning of the true QL(2)0. Let, QL(2)n be the resulting estimator. Now, we plug such an estimator into the weight-function CL(1), and we use the resulting weights CL(1)2 to apply loss-based learning of QL(1). In this way, using this backwards sequential loss-based learning, we can generate initial candidate estimators of QL(1), QL(2) that are themselves already targeted by being based on these weighted squared-error loss functions (e.g. using different regression algorithms but using the weight option). We can now select among the targeted MLE indexed by these different targeted initial estimators, by using cross-validation with the above mentioned loss function L(Q) = D*(Q, g0)2.

As might already be apparent, and certainly becomes apparent in the next section, this powerful approach combining loss-based learning with targeted MLE for the analysis of the simple two-stage sequentially randomized controlled trial generalizes to all sequentially randomized controlled trials for any target parameter, any number of stages, and higher dimensional intermediate time-dependent covariates.

We remark that the above targeted maximum likelihood estimator can also be applied to the data structure L(0), A(0), L(1), Δ, L(2) = ΔY, where A(0) is a treatment assigned at baseline (e.g, RCT), L(1) represents the data collected between baseline and the time point at which the outcome Y is measured, and Δ is a missing indicator for Y. One simply applies the above data structure with A(2) = Δ. Off course, if L(1) is not binary, then the above estimator needs to be generalized as carried out in the next section, and, the missingness mechanism might need to be estimated from the data.

3. Targeted MLE of parameters of the G-computation formula

We will now present the general approach to obtain a targeted maximum likelihood estimator, including the selection among different targeted maximum likelihood estimators indexed by different initial estimators. The choice of loss function we will use for the latter will depend on if one is willing to assume that the treatment/censoring mechanism is correctly estimated (or known, as in a S-RCT), or that one wishes to rely on double robustness, and we will provide appropriate loss functions for both purposes. This will generalize the above targeted maximum likelihood estimator for a two-stage sequentially randomized controlled trial to arbitrary sequentially randomized controlled trials, including S-RCT’s that are subject to right-censoring or missingness and for which one is willing to assume that censoring/missingness is well understood. In addition, it will present the double robust T-MLE for observational studies.

Organization: Firstly, we will present the likelihood using binary coding of the data structure O. Second, we will present a representation of the efficient influence curve based on this binary factorization of the likelihood. Third, we present the fluctuation/least favorable model of the initial estimate and the corresponding targeted maximum likelihood estimator. Fourth, we present a closed form one-step version of this targeted maximum likelihood estimator that applies if one is willing to fit a separate fluctuation parameter for each factor of the G-computation formula factor of the likelihood. Fifth, we present a targeted loss function that can be used to select among different targeted maximum likelihood estimators indexed by different initial estimators. We also present a particular type of targeted maximum likelihood estimator that uses a degenerate initial estimator for the intermediate factors of the G-computation formula, so that the targeted MLE algorithm only requires updating the final outcome conditional distribution. Finally, we make some observations regarding the pursuit of targeted dimension reductions simplifying the G-computation formula, which can form an important ingredient for generating different candidate targeted MLE’s, and control complexity.

3.1. A factorization of likelihood of data in terms of binary variables

Suppose the data structure O = (L(0), A(0), . . . , L(K), A(K), L(K + 1)) for one unit involves collection of treatment and censoring actions coded with A(t) at times t = 0, . . . , K, and time-dependent covariate and outcome data at times t = 0, . . . , K + 1. We note that L(t) can become degenerate after censoring and or after a terminal event like death, so that this data structure O also allows for longitudinal data structures that are truncated by the minimum of right-censoring and death. By choosing a fine enough discretization in time this data structure also approximates treatment and censoring processes A(t) that evolve in continuous time.

For the sake of presentation, we will assume that A(t) and L(t) are discrete valued for all t so that the likelihood of O can be expressed in terms of probabilities, thereby avoiding technical difficulties regarding choice of dominating measure, without affecting the realm of practical applications.

The time ordering implies a graph with observed nodes L(t), t = 0, . . . , K+ 1, and A(t), t = 0, . . . , K, and a corresponding factorization of the observed data likelihood of O, given by

p0=t=0K+1QL(t)t=0KgA(t),

where QL(t) and gA(t) denote the conditional probability distributions of L(t), given parents Pa(L(t)), and A(t), given parents Pa(A(t)), respectively. The parent sets could be known to be subsets of the parent set implied by the time ordering of data structure, as discussed in introduction.

This factorized likelihood can be subjected to static and dynamic interventions on the A() process mapping the probability distribution of O into probability distributions of Od corresponding with a static or dynamic intervention d, often referred to as the G-computation formula. These interventions could involve all A-nodes as well as a subset of these nodes. The corresponding probability distributions of Od are obtained by removing the gA(t)’s corresponding with the A(t) nodes on which an intervention is carried out under rule d, and substituting for A(t) in the conditioning events (i.e., parents) of the QL(l)-factors with l > t the corresponding intervened values.

In many applications A(t) = (A1(t), A2(t)) involves two types of actions A1(t) and A2(t), both relevant for defining the parameter of interest of the probability distribution of O. For example, A1(t) might be the treatment assigned at time t, A2(t) might be an indicator of being right-censored at time t, and the scientific parameter of interest, Ψ(P0), might be defined as a parameter of the distribution of O under the intervention on A defined by no-censoring at any time point, and a certain treatment intervention. In many cases, one defines the scientific parameter of interest in terms of changes of the latter distribution under different treatment regimens, and always no censoring: for example, marginal structural models for static or realistic dynamic treatment regimens provide such parameters, as we demonstrate in Section 5.

We will consider the case that for each node, the model for the conditional distributions of a nodes, given the parents, is nonparametric. Let Ψ be the parameter mapping so that ψ0 = Ψ(P0) denotes the parameter of interest.

Without loss of generality, we assume that, for each t ∈ {1 . . . , K + 1}, L(t) can be coded in terms of n(t) binary variables {L(t, j) : j = 1, . . . , nt}, so that QL(t) can be further factorized as

QL(t)=j=1n(t)QL(t,j),

where we define QL(t,j) as the conditional distribution of L(t, j), given its parents Pa(L(t, j)) defined as the parents of L(t) augmented with the first j−1 variables L(t, 1), . . . , L(t, j − 1), l = 1, . . . , j − 1. Note that this factorizations depends on a user-supplied ordering of the binary variables. For example, this particular coding and ordering might be implied by what is considered natural. The choice of coding and ordering does not affect the theoretical properties of the resulting targeted MLE, but it does imply the binary predictors QL(t,j) one will need to estimate from the data.

This now provides the following likelihood factorization for the probability distribution of O:

p0=QL(0)t=1K+1j=1n(t)QL(t,j)t=0KgA(t). (2)

where QL(0) denotes the marginal distribution of the baseline covariates L(0).

3.2. General representation of efficient influence curve of target parameter

We will now work out a general representation of the efficient influence curve we can apply to implement the targeted maximum likelihood estimator for general longitudinal data structures. These results provide us with a template for implementing the targeted maximum likelihood estimator for nonparametric models and essentially any type of longitudinal data structure that includes time dependent treatments and censoring actions that are realized in response to previously collected data.

Recall that in our model for P0, for each node in the statistical graph, the conditional distribution is unspecified. Let Ψ be the parameter mapping so that ψ0 = Ψ(P0) denotes the parameter of interest. If Ψ(P0) = ΨF (Q0) is only a parameter of the Q0, then we can present the efficient influence curve of Ψ as the projection of any influence curve (i.e., gradient of path-wise derivative) in the model in which g is known onto the tangent space of Q (van der Laan and Robins (2003)):

D*=Π(D|TQ)foracertaingradientD.

Such an estimating function D is often called an IPCW-estimating function (van der Laan and Robins (2003)). We will now be concerned with finding a representation of this efficient influence curve in terms of an orthogonal sum of scores of certain fluctuations Q(ε) of Q at ε = 0, thereby implying a corresponding implementation of the targeted MLE.

The factorization (1) of the distribution P0 implies an orthogonal decomposition of the tangent space at P0 in our model, where this tangent space is a subspace of the Hilbert space L02(P0) endowed with inner product 〈h1, h2〉 = EP0h1(O)h2(O). This orthogonal decomposition of the tangent space T(P0)L02(P0 is given by

T(P0)=TL(0)+t=1K+1j=1n(t)TL(t,j)+TCAR,

where TL(0) is the tangent space of QL(0) consisting of the functions of L(0) with mean zero, TL(t,j) is the tangent space of the conditional probability distribution QL(t,j),

TL(t,j)={V(L(t,j)|Pa(L(t,j)))EQL(t,j)V:V}={{V(1|Pa(L(t,j)))V(0|Pa(L(t,j))}(L(t,j)QL(t,j)(1)):V},

and TCAR is the tangent space of g. TCAR can also be orthogonally decomposed as t=0KTA(t) with TA(t) the tangent space of gA(t). Here we used the notation EQL(t,j)V = E(V | Pa(L(t, j))) for the conditional expectation w.r.t. QL(t,j). If the parent sets are all implied by a specified ordering of all measured variables, then the model for P0 is actually the nonparametric model so that the tangent space is saturated: T(P0)=L02(P0).

In the case that Ψ(P0) is a parameter of both Q0 and g0, the efficient influence curve D* will also have components in TCAR. An example of such a target parameter is E(Y (1) − Y (0) | A = 1), the effect among the treated, based on observed data (W, A, Y) and the causal graph implied by time ordering W, A, Y. In that case the targeted MLE will also need to fluctuate an initial fit of g0 with a fluctuation having a score that coincides with the efficient influence curve. For that purpose, let’s also code A(t) in terms of binary variables. Let A(t) be coded in terms of binary variables {A(t, j) : j = 1, . . . , m(t)}, and consider the factorization

gA(t)=j=1m(t)gA(t,j),

where an ordering needs to be specified so that the parents of A(t, j) are given by the parents of A(t) augmented with A(t, 1), . . . , A(t, j − 1).

The corresponding orthogonal decomposition of the tangent space of g is given by

TCAR=t=0Kj=1m(t)TA(t,j)

where

TA(t,j)={V(A(t,j)|Pa(A(t,j)))E(V|Pa(A(t,j))):V}={{V¯(Pa(A(t,j))(A(t,j)gA(t,j)(1|Pa(A(t,j))):V},

where we used the notation (Pa(A(t, j)) = V (1 | Pa(A(t, j))) − V (0 | Pa(A(t, j)), which thereby can represent any function of Pa(A(t, j)).

This factorization p(O) = ∏tj QL(t,j)tj gA(t,j) yields the orthogonal decomposition of the tangent space T(P0) given by

T(P0)=TL(0)+t=1K+1j=1n(t)TL(t,j)+t=0Kj=1m(t)TA(t,j).

We can now state the corresponding Theorem for both a representation of a given efficient influence curve D* as well as a projection of a function D, (e.g.) representing an inefficient influence curve for a parameter Ψ(P) = ΨF (Q) in a model with g known, onto the tangent space TQ of Q.

Theorem 2 Consider the Hilbert space L02(P0) and the factorization (1) of P0. A function DL02(P0) which is also an element of the tangent space T(P0) can be represented as

D=DL(0)+t=1K+1j=1n(t)DL(t,j)+t=0Kj=1m(t)DA(t,j),

where

DL(0)=E(D|L(0))EDDL(t,j)=Π(D|TL(t,j))=CL(t,j){L(t,j)QL(t,j)(1)},

where

CL(t,j)=E(D|L(t,j)=1,Pa(L(t,j)))E(D|L(t,j)=0,Pa(L(t,j))),

for t = 1, . . . , K + 1, and, for each t, j = 1, . . . , n(t). In addition,

DA(t,j)=Π(D|TA(t,j))=CA(t,j){A(t,j)gA(t,j)(1)},

where

CA(t,j)=E(D|A(t,j)=1,Pa(A(t,j)))E(D|A(t,j)=0,Pa(A(t,j))).

In particular, the projection of D onto the tangent space TQ of Q can be represented as

Π(D|TQ)=D0+t=1K+1j=1n(t)DL(t,j).

If we represent D as D(O) = D1(A, L(A))/g(Ā | X) for some D1, X = (L(a) : a), La(t) = Lā(t−1)(t), assume that Pa(L(t, j)) = (A(−1), PaĀ(t−1)(L(t, j)) includes Ā(t − 1), then the above representation of Π(D | TQ) applies with

CL(t,j)=1g(A¯(t1)|Pa(L(t)))×{CL(t,j)(Q)(1)CL(t,j)(Q)(0)}

where, for δ ∈ {0, 1},

CL(t,j)(Q)(δ)=EQ(a¯(t,K)D1|La¯(t1)(t,j)=δ,Paa¯(t1)(L(t,j)))|a¯(t1)=A¯(t1).

Above, we used short-hand notation forā(t,K) D1(A(−1), a(,K), LĀ(t−1) ā(t,,K)), and ā(t,K) = (a(t), . . . , a(K)). Here g(ā(t−1) | Pa(L(t))) denotes the conditional probability of Ā(t−1) = ā(t−1), given Paā(t−1)(L(t)), and it also equals the conditional probability of Ā(t − 1) = ā(t − 1), given La(t, j), Paa(L(t, j)).

Proof of Theorem. We only need to prove the latter representation of CL(t,j).

We have D(A, L(A)) = D1(A, L(A))//g(Ā | X), and we consider the case that Pa(L(t, j)) includes Ā(t − 1). For the sake of this proof we exclude the treatment nodes Ā(t − 1) from Pa(L(t, j)). Setting Ā(t − 1) = ā(t − 1), gives us the following conditional expectation to consider

E(D1(A¯)/g(A¯|X)|L(t,j),Pa(L(t,j)),A¯(t1)=a¯(t1)).

We first condition on X and Ā(t−1). This corresponds with taking an expectation w.r.t. Πs=tKg(A(s)|Pa(A(s))). This gives us

E(a¯(t,K)D1(a¯)/g(a¯(t1)|X)|La¯(t,j),Paa¯(L(t,j)),A¯(t1)=a¯(t1)).

This conditional expectation for each ā(t, K)-specific term is a sum over La compatible with La(t, j), Paa(L(t, j)). Specifically,

LaD1(a¯)g(a¯(t1)|X)P(La|La(t,j),Paa(L(t,j)),A¯(t1)=a¯(t1))=LaD1(a¯)g(a¯(t1)|X)P(La,La(t,j),Paa(L(t,j)),A¯(t1)=a¯(t1))P(La(t,j),Paa(L(t,j)),A¯(t1)=a¯(t1))={La:La(t,j),Paa(L(t,j))}D1(a¯)P(La)P(La(t,j),Paa(L(t,j)),A¯(t1)=a¯(t1))=1g(a¯(t1)|La(t,j),Paa(L(t,j)){La:La(t,j),Paa(L(t,j))}D1(a¯)P(La)P(La(t,j),Paa(L(t,j)))=1g(a¯(t1)|La(t,j),Paa(L(t,j))EQ(D1(a¯)|La(t,j),Paa(L(t,j)).

We will now prove that, by conditional independence assumptions of the statistical graph, g(ā(t − 1) | La(t, j), Paa(L(t, j))) = g(ā(t − 1) | Paa(L(t))). To see this we first note that g(ā(t − 1) | La(t, j), Paa(L(t, j))) equals

L¯a(t1)g(a¯(t1)|L¯a(t1))P(L¯a(t1)|La(t,j),Paa(L(t,j))).

Since Paa(L(t, j)) are the parents of La(t, j), we have P(a(t − 1) | La(t, j), Paa(L(t, j))) = P(a(t − 1) | Paa(L(t, j))). Thus, this proves

g(a¯(t1)|La(t,j),Paa(L(t,j)))=g(a¯(t1)|Paa(L(t,j))).

More general, recall La(t, j), Paa(L(t, j)) = La(t, 1), . . . , La(t, j), Paa(L(t)), and note that Paa(L(t)) is included in a(t−1) (recall, that we excluded Ā(t−1) from Pa(L(t, j)) in this proof). We have La(t, 1), . . . , La(t, j) is independent of a(t − 1), given Paa(L(t)). So we obtain

P(L¯a(t1)|La(t,j),Paa(L(t,j)))=P(L¯a(t1)|La(t,1),,La(t,j),Paa(L(t)))=P(L¯a(t1)|Paa(L(t))).

This shows

g(a¯(t1)|La(t,j),Paa(L(t,j)))=g(a¯(t1)|Paa(L(t))).

To conclude, we have shown

E(a¯(t,K)D1(a¯)g(a¯(t1)|X|La¯(t,j)=1,Paa¯(L(t,j)),A¯(t1)=a¯(t1))=1g(a¯(t1)|Paa(L(t)))E(a¯(t,K)D1(a¯)|La¯(t1)(t,j)=1,Paa¯(t1)(L(t,j))),

which completes the proof.

Remark about double robustness of efficient influence curve for general statistical graph: The efficient influence curve D* at P depends on the Q-factor as well as a g representing conditional distributions of A(t) nodes, possibly conditioning on subsets of the actual parents of A(t). It is immediate that P0D*(Q0, g) = 0 at possibly miss-specified g. To understand the possible additional robustness P0D*(Q, g0) for Q with Ψ(Q) = Ψ(Q0) and correctly specified g0, and thereby the so called double robustness of the efficient influence curve (van der Laan and Robins (2003)), we make the following observation. By our latter representation in the above theorem, we have

D*(Q,g)=t1/g(A¯(t1)|(Paa(L(t)):a)){EQ(a¯(t,K)D1(a¯(t1),a¯(t,K))|La(t)=L(t),Paa¯(t1)(L(t))=Pa(L(t)))EQ(a¯(t,K)D1(a¯(t1),a¯(t,K))|Paa¯(t1)(L(t)))}|a¯(t1)=A¯(t1),

where we also have that g(A( −1) | (Paa(L(t)) : a)) = g(A(t − 1) | − (La(t), Paa(L(t)) : a)), as we showed in the proof above. If we now take the conditional mean, given (La(t), Paa(L(t)) : a), within the ∑t-summation, then this corresponds with integration over g0(Ā(t − 1) | (Paa(L(t)) : a)). Thus at a correctly specified g0, we obtain that P0D*(Q, g0) equals

EQ0ta¯{EQ(D1(a¯)|La(t),Paa(L(t)))EQ(D1(a¯)|Paa(L(t)))},

thereby giving us an expression that does only depend on the Q0-factor of the distribution of the data (thus nothing to do anymore with the conditional treatment probabilities). Some additional structure is now needed on the statistical graph to have that the latter equals zero at miss-specified Q. In particular, if Pa(L(t)) = Ā(t − 1), (t − 1) represents the history according to the time-ordered sequence representing the longitudinal data structure O, it follows, through cancelation of terms, that the latter equals EQ0ā D1(ā), thereby giving the wished result (corresponding with the double robustness results in van der Laan and Robins (2003) for nonparametric full data models).

Using normal error regression to model and fluctuate conditional final outcome distribution. Consider the case that Ψ(Q0) only depends on the conditional distribution of a final outcome Y = L(K+1), given its parents Pa(Y) through its conditional mean, and that the projection of the efficient influence curve (or any other gradient in model with g known) onto the tangent space of this conditional distribution QY can be written as CY (Y − EQ(Y | Pa(Y)) for some function CY of its parents Pa(Y). Then it follows that there is no need to factorize the conditional distribution of Y in binary conditional distributions, but one could model the conditional distribution of Y with a normal error mean regression, and fluctuate the mean by adding the clever-covariate extension εCY. This was explicitly illustrated in Section 2 for the targeted MLE of EYd.

3.3. The targeted MLE based on the binary representation of density

In this subsection we will define the targeted MLE based on the representation (1) of the density of O in terms of the binary predictors QL(t,j), and, for the sake of presentation, we assume that our target parameter is only a parameter of Q0. Consider an initial estimator QL(t,j)n of each QL(t,j), t = 1, . . . , K +1, j = 1, . . . , n(t). We will estimate the first marginal probability distribution QL(0) of L(0) with the empirical distribution of Li(0), i = 1, . . . , n. Let Qn denote the combined set of QL(t,j)n across all nodes L(t, j).

The conditional distributions of L(t, j) are binary distributions which we can estimate with machine learning algorithms (using logistic link) such as the super learner represented by a data adaptively (based on cross-validation) determined weighted combination of a user supplied library of machine learning algorithms estimating the particular conditional probability. These estimates could be obtained separately for each t, j or smoothing across time points t and or j could be employed if appropriate, by applying such machine learning algorithms to an appropriately constructed repeated measures data set. In particular, candidate estimators could be based on (guessed) subsets of Pa(L(t, j)).

In addition, let gn be an estimator of g0. We will now define the following fluctuations of the initial estimator QL(t,j)n = QL(t,j)(Pn) of QL(t,j):

LogitQL(t,j)n(ε)=LogitQL(t,j)n+εCL(t,j)(Qn,gn),

where we added the clever covariate CL(t,j)n obtained by substitution of the initial estimator Qn and gn of the true Q0 and g0.

One can now estimate ε with the MLE.

εn=argmaxεt=1K+1j=1n(t)i=1nQL(t,j)n(ε)(Oi).

One could also obtain a separate MLE of ε for each factor indexed by (t, j):

εL(t,j)n=argmaxεi=1nQL(t,j)n(ε).

One can now set Qn1=Qn(εn) to update Qn. This updating process Qnm=Qnm1(εnm), m = 1, . . . , is now iterated till convergence, which defines the targeted MLE starting Qn* at initial estimator (Qn, gn).

We note that the εL(t,j)n for each factor separately can be estimated with standard logistic regression software using as off-set the logit of the initial estimator and having a single clever covariate CL(t,j)(Qn, gn). The single εn (uniform across t, j) defined above requires applying a single univariate logistic regression applied to repeated measures data set with one line of data for each factor indexed by (t, j), creating a clever covariate column that stacks (CL(t,j) : t, j) for each unit, and using the corresponding off-set covariate logitQL(t,j)n. So in both cases, the update step can be carried out with a simple univariate logistic regression maximum likelihood estimator using the off-set command (applied to a possibly repeated measures data set).

We note that the clever covariate changes at each update step since the estimator of Q is updated at each step and the clever covariate is defined by the current Q-fit in the iterative algorithm. Let QL(t,j)n* and Qn* denote the final update (at convergence of the MLE of ε to zero) of QL(t,j)n, and Qn. The targeted MLE of ψ0 is now given by Ψ(Qn*).

3.4. A targeted MLE based on the binary predictor representation of density that converges in one step

In this section we will define a fast targeted MLE based on the representation (1) of the density of O in terms of the binary predictors QL(t,j).

Consider an initial estimator QL(t,j)n of each QL(t,j), t = 1, . . . , K + 1, j = 1, . . . , n(t). We will estimate the first marginal probability distribution QL(0) of L(0) with the empirical distribution of Li(0), i = 1, . . . , n. Let Qn denote the combined set of QL(t,j)n across t, j.

In addition, let gn be an estimator of g0. As above, we define the following fluctuations of the initial estimator QL(t,j)n of QL(t,j):

LogitQL(t,j)n(ε)=LogitQL(t,j)n+εCL(t,j)(Qn,gn),

where we added the clever covariate CL(t,j)n obtained by substitution of the initial estimator Qn and gn of the true Q0 and g0.

Monotone dependence on Q-property of the clever covariates: Consider the clever covariate representations of CL(t,j) presented in the above Theorem 2 for QL(t,j) for the case that D = D1/g with D1 not indexed by Q, g. Then the conditional expectations in the definition of the clever covariate CL(t,j) only depends on Q through {QL(s,l) : s > t, l} ∪ {QL(t,l) : l > j}.

Let’s enumerate all terms QL(t,j) for t ≥ 1 by moving row-wise: thus Q1 = Q11, Q2 = Q12, . . ., Qn(1) = Q1n(1), Qn(1)+1 = Q21, and so on till QN = QK+1,n(K+1), where N=t=1K+1n(t). Here we used temporarily the notation Q12 = QL(1,2) and so on. Recall that QL(0), the marginal distribution of L(0), does not need to be fluctuated, and is thus not considered here: we will always estimate QL(0) with the empirical distribution, so that no fluctuation is needed. Under this ordering, the k-th clever covariate Ck only depends on Q through Qk+1, . . . , QN, k = 1, . . . , N. In particular, CN does not depend on Q at all, while CN−1 depends on QN only, CN−1 depends on QN−1, QN, and so on. We refer to this property of the clever covariates as the monotone dependence (on Q) property, which will have an immediate implication for the corresponding iterative T-MLE algorithm.

We denote this monotonicity property with Ck(Q) = Ck(Qk+1, . . . , QN), where we suppress the dependence on g since in the targeted MLE algorithm presented below g will not be updated.

We obtain a separate MLE of ε for each factor, but we start with last factor first, and use the update of last factor in clever covariate of N − 1-th factor, carry out update of N − 1-th factor, use the update of N − 1-th factor in clever covariate of N − 2-th factor, and so on till we update the first factor based on first clever covariate including all previously obtained updates. One could now start over, since Qn got updated during this particular round of updating steps, and apply the same round of updating steps to the update of Qn, and iterate this till convergence. The below Theorem states that this is not necessary, since the algorithm has converged after one round.

We state here the one step convergence of this targeted MLE algorithm.

Theorem 3 Consider the targeted MLE algorithm above applied to an initial estimator Qn, gn, using a separate εL(t,j)n for each factor QL(t,j)n, t ≥ 1, carrying out the updating steps one at the time, starting with final factor in likelihood, and going backwards till first term always incorporating the latest updates on Qn, and Q0n is the empirical distribution of Li(0), i = 1, . . . , n. We can refer to one round of updating starting at final factor and ending at first factor as one step. This process can be iterated thereby defining an iterative algorithm.

Suppose that for each t ≥ 1, j the clever covariate in this algorithm, CL(t,j)(Q), only depends on Q through Qsl = QL(s,l) for s > t and for s = t, l > j. In that case, the above iterative targeted MLE algorithm converges in one step/round, and thus in exactly N=t=1K+1n(t) updating steps.

We recall from the previous Theorem, if D(O) = D1(O)/g(Ā | X), and the probability distribution of O is factored in binary predictors as in (1), then D* = Π(D | TQ) = D0 + ∑tj Dtj, where Dtj = CL(t,j)(L(t, j) − QL(t,j)(1)), and

CL(t,j)=1g(A¯(t1)|Pa(L(t)))×{CL(t,j)(Q)(1)CL(t,j)(Q)(0)}

where, for δ ∈ {0, 1},

CL(t,j)(Q)(δ)=EQ(a¯(t,K)D1|La¯(t1)(t,j)=δ,Paa¯(t1)(L(t,j)))|a¯(t1)=A¯(t1).

This monotonicity property of the clever covariate holds if D1 does not depend on Q itself. More generally, it holds if

D(Q)=D1+C1(Q)g,C1(Q)isonlyfunctionofOthroughL(0),A¯(K),

(so that C1(Q) will cancel out in the representation of CL(t,j)) and D1 does not depend on Q (it can depend on g).

This Theorem allows us to define closed form targeted MLE algorithms for a large class of parameters in our semiparametric model defined by no constraints on any of the conditional node specific distribution, given their specified parent nodes. To utilize this closed form one-step targeted MLE, one is forced to carry out a separate update step for each factor (only once), but one can still use smoothing across many factors for the initial estimator.

3.5. Targeted loss-based selection among targeted MLE

The basic idea is as follows. All our candidate estimators of Q0 are targeted maximum likelihood estimators, indexed by different initial estimators of Q0, and using same gn of g0. Due to fact that these targeted MLE’s solve the efficient influence curve equation, it follows that the bias for ψ0 involves a product of Qn*Q0 and gng0: see asymptotic linearity Theorems in van der Laan and Robins (2003) and van der Laan and Gruber (2009). The goal is clearly to estimate Q0 as accurately as possible, which will maximize efficiency and minimize bias for ψ0. Therefore, we want to use cross-validation to select among different targeted maximum likelihood estimators, using a loss function whose risk is minimized at Q0. However, there are many choices for the loss-based dissimilarity, E0L(Q) − L(Q0), between a candidate Q and Q0 possible, and one will be more targeted towards ψ0 than another. For example, we can use the log-likelihood loss function, a penalized log-likelihood loss function presented in (van der Laan and Gruber (2009)), and other loss functions inspired by the efficient influence curve of ψ0, as presented here (see also van der Laan and Gruber (2009)).

Here we present two loss functions for Q0 that are identified by the efficient influence curve of Ψ. Firstly, if g0 is known, then we can use

L1(Q)={D*(Q,g0)}2.

If D*(Q, g0) = D(Q, g0) Ψ(Q), then it follows that indeed Q0 = arg minQ E0L1(Q), since the variance under P0 of D*(Q, g0, ψ0) is minimized at Q = Q0 (van der Laan and Robins (2003)). For more general efficient influence curves, the latter property has to be explicitly verified: at minimal, if D*(Q, g) = D(Q, g, Ψ(Q)), then one can replace Ψ(Q) by a consistent estimator of ψ0, and use the loss function D2(Q, g0, ψn). By the argument above, the loss function is still valid if one is willing to assume a consistent and good estimator of g0 (an estimator that will converge faster to true g0 than the estimators of Q0 will converge to Q0).

To explain the rationale of this loss function, first consider the case that g0 is known. If g0 is known, a targeted MLE for which Qn* converges to Q with Ψ(Q) = ψ0 is asymptotically linear with influence curve D*(Q, g0) (van der Laan and Rubin (2006)) and it is well known that the variance of D*(Q, g0) for a Q with Ψ(Q) = ψ0 is minimal at Q = Q0, which then corresponds with the semiparametric information bound. Thus, E0L1(Q) equals the asymptotic variance of the influence curve of the targeted MLE. Under the assumption that L1(Q) is uniformly bounded in all candidate Q’s, we can apply the Theorems on the cross-validation selector (e.g. van der Laan and Dudoit (2003)), which proves that either the cross-validation selector is asymptotically equivalent with the oracle selector, or it achieves the parametric rate of convergence. As a consequence, loss-function based cross-validation based on this loss function will, for large enough sample size, select the targeted maximum likelihood estimator with the smallest asymptotic variance of its resulting substitution estimator of ψ0 (excluding the case that there are ties). If g0 is unknown, but estimated at a fast rate relative to the rate at which one estimates Q0, then the above argument for the cross-validation selector still applies in first order: see van der Laan and Dudoit (2003) for oracle results for the cross-validation selector based on loss functions with nuisance parameters. If g0 is estimated, and QQ0, then the influence curve of the targeted MLE involves another contribution, reducing the variance relative to the variance of the influence curve for g0 known. In this case, L1(Q) is not exactly the asymptotic variance of the targeted MLE, but it is still minimized at the optimal Q0, and it represents a large component of the true asymptotic variance of the targeted MLE.

Consider now the case that we are not willing to assume that estimation of g0 is easy relative to estimation of Q0. In that case, the above loss function is not appropriate. Recall the representation of the efficient influence curve D*=D0+t,jDL(t,j)2 with DL(t,j) = CL(t,j)(L(t, j) − QL(t,j)(1)). We make the following observation (using short-hand notation):

VAR(D*(Q0,g0))=EDL(0)2+t,jE0DL(t,j)2=EDL(0)2+t,jE0CL(t,j)2(L(t,j)QL(t,j)(1))2,

This suggests to use as loss function for QL(t,j), t ≥ 1, the weighted squared-error loss function:

L2(Q)=tjCL(t,j)2(L(t,j)QL(t,j)(1))2,

which is indexed by a weight function CL(t,j)2. One would need to obtain an initial estimator of these weights which depend on both Q0 and g0. However, note that even if we estimate these weights inconsistently, then this loss function L2(Q) remains a valid loss function for Q0, thereby preserving the double robustness of the resulting targeted maximum likelihood estimator of Q0.

In van der Laan and Gruber (2009) other loss functions implied by the efficient influence curve are proposed, including the variance of efficient influence curve at a collaborative estimator of g0.

3.6. The targeted-MLE at a degenerate initial estimator for intermediate time-dependent covarariate factors

Consider the likelihood factorization as used to define the G-computation formula, and assume that the IPCW estimating function is of the form stated in the above Theorem 3. If one of the node-specific conditional distribution is estimated with a degenerate conditional distribution, given the data generated by previous node-specific conditional distributions, then Theorem 3 implies that the projection of a function of O on the tangent space generated by that factor equals zero.

For example, suppose the likelihood is factored according to the ordering L(0), A(0), L(1), A(1), Y. The projection of a function D(O) onto the tangent space of QL(1) is zero at a degenerate QL(1), even if the true conditional distribution of L(1) is not degenerate.

This insight suggests a simple-to-compute version of targeted MLE. Suppose we obtain an initial estimator Qn0 that is degenerate for all factors except the last one, and we use the empirical distribution for the marginal distribution of the baseline covariates. In that case, only the last factor, say QY =L(K+1), needs to be updated in the targeted MLE algorithm. As a consequence, the targeted MLE requires only one update, and thus converges in one single updating step.

Thus, in this case one estimates most of the system with a deterministic system, and only the last factor is estimated with a non-degenerate conditional probability distribution that is updated with a clever covariate depending on the treatment mechanism. Due to the double robustness of the targeted MLE, the resulting targeted MLE will be consistent and asymptotically linear if the treatment mechanism is correctly specified, and will still gain efficiency if the degenerate distributions are doing a reasonable job: since the degenerate distribution will be misspecified it is not reasonable anymore to rely on correct specification of the initial estimator Qn0 of Q0 for consistency. Note also that this simplified targeted ML estimator still allows us to apply the collaborative targeted MLE approach for selection among different treatment mechanism estimators based on the log-likelihood of the targeted estimator Qn1 indexed by the treatment mechanism estimator: see van der Laan and Gruber (2009).

We can view this particular simple targeted maximum likelihood estimator as one particular candidate among a set of candidate targeted maximum likelihood estimators, and use loss-based cross-validation to select among these candidate targeted maximum likelihood estimators. One would use one of our proposed (efficient-influence curve based) loss functions, such as the weighted squared error loss function, since the log-likelihood loss function will become undefined as a degenerate distribution.

3.7. Dimension reduction for time-dependent covariates

One could use a loss function on the Q-factor of the binary coded complete data structure, and use loss-function based cross-validation to select among different fits, thereby implicitly carrying out a dimension reduction. For example, by not including a node in the graph in parent sets of other nodes it will be equivalent to removing the node from the data structure, and such moves can be scored based on the loss function. In this manner one might build an initial estimator Qn whose G-computation formula for parameter of interest is only affected by conditional distribution of subset of all nodes, thereby also simplifying the targeted MLE update.

Here we wish to investigate alternative targeted dimension reductions that would, in particular, reduce the computational complexity of the targeted maximum likelihood estimator which is driven by the number of binary variables coding the data structure. This reduced complexity/dimension can also imply that the loss function for the Q0 of the reduced data structure implies a more targeted dissimilarity for the purpose of fitting Ψ(Q0).

If a multivariate L(t) is reduced to a one dimensional time-dependent covariate, then the targeted maximum likelihood estimator is simpler, but if this reduction means that A(t) now depends on measured variables beyond the one dimensional reduced time-dependent covariate, then this reduction will also have caused bias. In addition, much information might have been lost, thereby causing variance. So care is needed.

Let’s revisit the two-stage sequentially randomized controlled trial with data structure O = L(0), A(0), L(1), A(1), Y = L(2), but let’s now consider the more general case that L(1) can be a multivariate vector with discrete and/or continuous components. Suppose that we wish to estimate EY (1, 1). Inspection of the efficient influence curve of EY (1, 1) shows that it only depends on the conditional distribution of Y through its mean E(Y | A(0) = 1, A(1) = 1, (1)). This suggest that LQ(1) = E(Y | A(0), A(1) = 1, (1)) denotes a targeted dimension reduction: below we provide a general approach which implies this precise dimension reduction. In addition, let Lg(1) be defined as the propensity score P(A(1) = 1 | L(0), A(0), L(1)). A targeted dimension reduction of the observed O is now given by (L(0), A(0), LQ(1), Lg(1), A(1), Y). We can fit both LQ(1) and Lg(1) from the data using super-learning, thereby obtaining an estimated dimension reduction Or. A targeted MLE for this (estimated) reduced data structure now involves fitting QLQ (1), QLg (1), and QY, where only the conditional mean of Y is needed. However, by definition of LQ(1), the conditional mean of QY at A(1) = 1 equals LQ(1), suggesting that we can exclude Lg(1) from the parent set of Y without meaningful loss of information. Then, the conditional distribution QLg (1) does not affect the G-computation formula of the distribution of Y (1, 1) or, more general, the joint distribution of Y (1, 1) and L(0). As a consequence, in this case we do not even need to fit QLg (1).

To summarize, in this manner we have succeeded in dramatically reducing the complexity of a targeted MLE by replacing the fitting of a conditional distribution of a multivariate random variable L(1) into fitting of a univariate conditional distribution of LQ(1).

Let’s now generalize this type of targeted dimension reduction procedure. Consider a general longitudinal data structure L(0), A(0), . . . , L(K), A(K), Y = L(K + 1), and let’s consider the case that A(j) is binary, j = 0, . . . , K. The dimension reduction can be guided by the actual form of the efficient influence curve for the target parameter. To demonstrate this, we first note that the efficient influence curve can often be represented as DIPCW(g0,ψ0)(L(0),A,Y)j=0KE(DIPCW|A(j),Pa(A(j)))E(DIPCW|Pa(A(j))) for some IPCW-estimating function (see van der Laan and Robins (2003)). The latter differences of two conditional expectations can also be written as C(j)(A(j) − P(A(j) = 1 | Pa(A(j)), where

C(j)=E(DIPCW|A(j)=1,Pa(A(j)))E(DIPCW|A(j)=0,Pa(A(j))).

For example, if ψ0 = EY (1), then DIPCW(O) = {I(Ā = 1)/g(Ā | X)}Y − ψ0. As we did before, we can factorize this difference of conditional expectations in terms of a factor only depending on Q0 and a factor only depending on g0. We can define LQ(j) as the Q0-factor only, thereby preserving double robustness of the resulting targeted MLE. In addition, we define

Lg(j)=P(A(j)=1|Pa(A(j)).

If the target parameter is EY (a) for a static regimen a, it follows that the efficient influence curve depends on O through the reduction

Or=(L(0),A(0),LQ(1),Lg(1),A(1),,LQ(K),Lg(K),A(K),Y.

If the target parameter is EY (d) for a dynamic treatment rule d, then, one also needs to include the time-dependent covariate the rule d uses to assign treatments. To summarize, inspection of the efficient influence curve of the target parameter defines a reduction Or in terms of two time-dependent covariate processes, one representing the treatment asssignment probabilities as functions of the past, and one based on the Q0-functions making up the efficient influence curve. These time-dependent covariates depend on unknown Q0 and g0. We will estimate these time-dependent covariates, by estimating the treatment mechanism, and the required LQ(j). We can now apply the targeted MLE to this reduced data structure.

As in our previous example, suppose that for each of the conditional distributions of Y and LQ(j), j = 1, . . . , K, we do not include any of the Lg(j) in the parent sets. We suggest that this comes at little cost in efficiency. Under this condition, the conditional distributions QLg(j) do not affect the G-computation formula of the distribution of Y (d) or, more general, the joint distribution of Y (d) and L(0). As a consequence, in that case we do not even need to fit QLg(j), j = 1, . . . , K. To summarize, in this manner we can dramatically reduce the complexity of a targeted MLE by replacing the fitting of a conditional distribution of a multivariate random variable L(j) into only fitting the univariate conditional distributions of LQ(j) and possibly the conditional distribution of another time-dependent covariate that is used to define the target parameter (e.g., treatment rule based on time dependent biomarker). Note that we will still fit the treatment mechanisms of A(j) conditional on its parents (under Or) including Lg(j), and thus just fit P(A(j) = 1 | Pa(A(j))) with Lg(j) itself.

This dimension reduction still allows for the construction of a collaborative estimator gn of g0, given an estimator Qn of Q0r, representing the conditional distributions of LQ(j), Lg(j) and Y . This just requires applying the C-T-MLE algorithm as presented in van der Laan and Gruber (2009) to the log-likelihood for Q0r, thereby scoring a fit of g0 with the log-likelihood (or other loss function) of the targeted MLE of Q0r corresponding with the fluctuation function implied by the candidate g0-fit.

By using as loss function the variance of the influence curve of the targeted MLE we can still select among different targeted maximum likelihood estimators indexed by different dimension reductions of the type presented above, assuming that each of them puts the maximal effort in obtaining an unbiased estimator.

4. Discussion

Targeted maximum likelihood estimation, combined with loss based super learning fully utilizing the power of cross-validation for bounded loss functions, provides an exceptional powerful framework for assessing causal effects, with distinct advantages relative to other proposed methodology. The current paper lays the ground work for the implementation of targeted maximum likelihood estimators that also incorporate time-dependent covariates and outcome processes. Time dependent covariates and outcome processes allow reasonably accurate imputations of the clinical outcome based on recent history, thereby allowing for significant potential gains in both efficiency and bias (see formula of effcicient influence curve/clever covariates). Even though almost all current clinical trials collect time dependent data, these important sources of information have been ignored in the assessment of the causal effect of a drug or treatment strategy. Clinical trials provide just one important application of the targeted maximum likelihood estimator. Other important applications are the assessment of causal effects of treatment rules in sequentially randomized controlled trials, and observational studies.

References

  1. Abadie A, Imbens GW. Large sample properties of matching estimators for average treatment effects. Econometrica. 2006;74:235–67. doi: 10.1111/j.1468-0262.2006.00655.x. [DOI] [Google Scholar]
  2. Andersen PK, Borgan O, Gill RD, Keiding N. Statistical Models Based on Counting Processes. Springer-Verlag; New York: 1993. [Google Scholar]
  3. Bembom O, Petersen ML, Rhee S-Y, Fessel WJ, Sinisi SE, Shafer RW, van der Laan MJ. Biomarker discovery using targeted maximum likelihood estimation: Application to the treatment of antiretroviral resistant hiv infection. Statistics in Medicine. page http://www3.interscience.wiley.com/journal/121422393/abstract, 2008. [DOI] [PMC free article] [PubMed]
  4. Bembom O, van der Laan MJ, Haight T, Tager IB. Lifetime and current leisure time physical activity and all-cause mortality in an elderly cohort. Epidemiology. 2009 doi: 10.1097/EDE.0b013e31819e3f28. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Bembom Oliver, van der Laan Mark. Statistical methods for analyzing sequentially randomized trials, commentary on jnci article adaptive therapy for androgen independent prostate cancer: A randomized selection trial including four regimens, by peter f. thall, c. logothetis, c. pagliaro, s. wen, m.a. brown, d. williams, r. millikan 2007. Journal of the National Cancer Institute. 2007;99(21):1577–1582. doi: 10.1093/jnci/djm185. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Bickel PJ, Klaassen CAJ, Ritov Y, Wellner J. Efficient and Adaptive Estimation for Semiparametric Models. Springer-Verlag; 1997. [Google Scholar]
  7. Bryan J, Yu Z, van der Laan MJ. Analysis of longitudinal marginal structural models. Biostatistics. 2003;5(3):361–380. doi: 10.1093/biostatistics/kxg041. [DOI] [PubMed] [Google Scholar]
  8. Dudoit S, van der Laan MJ. Asymptotics of cross-validated risk estimation in estimator selection and performance assessment. Statistical Methodology. 2005;2(2):131–154. doi: 10.1016/j.stamet.2005.02.003. [DOI] [Google Scholar]
  9. Gill R, Robins JM. Causal inference in complex longitudinal studies: continuous case. Ann Stat. 2001;29(6) [Google Scholar]
  10. Gill RD, van der Laan MJ, Robins JM. Coarsening at random: characterizations, conjectures and counter-examples. In: Lin DY, Fleming TR, editors. Proceedings of the First Seattle Symposium in Biostatistics. New York: Springer Verlag; 1997. pp. 255–94. [Google Scholar]
  11. Heitjan DF, Rubin DB. Ignorability and coarse data. Annals of statistics. 1991 Dec;19(4):2244–2253. doi: 10.1214/aos/1176348396. [DOI] [Google Scholar]
  12. Hernan MA, Brumback B, Robins JM. Marginal structural models to estimate the causal effect of zidovudine on the survival of HIV-positive men. Epidemiology. 2000;11(5):561–570. doi: 10.1097/00001648-200009000-00012. [DOI] [PubMed] [Google Scholar]
  13. Jacobsen M, Keiding N. Coarsening at random in general sample spaces and random censoring in continuous time. Annals of Statistics. 1995;23:774–86. doi: 10.1214/aos/1176324622. [DOI] [Google Scholar]
  14. Keleş S, van der Laan M, Dudoit S. Asymptotically optimal model selection method for regression on censored outcomes. Technical Report, Division of Biostatistics, UC Berkeley. 2002 [Google Scholar]
  15. Lavori P, Dawson R. A design for testing clinical strategies: biased adaptive within-subject randomization. Journal of the Royal Statistical Society, Series A. 2000;163:2938. [Google Scholar]
  16. Lavori P, Dawson R. Dynamic treatment regimes: practical design considerations. Clinical trials. 2004;1:920. doi: 10.1191/1740774S04cn002oa. [DOI] [PubMed] [Google Scholar]
  17. Moore KL, van der Laan MJ. Covariate adjustment in randomized trials with binary outcomes. Technical report 215, Division of Biostatistics, University of California, Berkeley, April 2007. [Google Scholar]
  18. Moore KL, van der Laan MJ. Application of time-to-event methods in the assessment of safety in clinical trials. In: Peace Karl E., editor. Design, Summarization, Analysis & Interpretation of Clinical Trials with Time-to-Event Endpoints. Chapman and Hall; 2009. [Google Scholar]
  19. Murphy S. An experimental design for the development of adaptive treatment strategies. Statistics in Medicine. 2005;24:1455–1481. doi: 10.1002/sim.2022. [DOI] [PubMed] [Google Scholar]
  20. Murphy SA. Optimal dynamic treatment regimes. Journal of the Royal Statistical Society: Series B. 2003;65(2) doi: 10.1093/jrsssc/qlad037. ? [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Murphy SA, van der Laan MJ, Robins JM. Marginal mean models for dynamic treatment regimens. Journal of the American Statistical Association. 2001;96:1410–1424. doi: 10.1198/016214501753382327. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Neugebauer R, van der Laan MJ. Why prefer double robust estimates. Journal of Statistical Planning and Inference. 2005;129(1–2):405–426. doi: 10.1016/j.jspi.2004.06.060. [DOI] [Google Scholar]
  23. Pearl J. Causality: Models, Reasoning, and Inference. Cambridge University Press; Cambridge: 2000. [Google Scholar]
  24. Petersen Maya L, Deeks Steven G, Martin Jeffrey N, van der Laan Mark J. History-adjusted marginal structural models: Time-varying effect modification and dynamic treatment regimens. Technical report 199, Division of Biostatistics, University of California, Berkeley, December 2005.
  25. Polley EC, van der Laan MJ. Predicting optimal treatment assignment based on prognostic factors in cancer patients. In: Peace Karl E., editor. Design, Summarization, Analysis & Interpretation of Clinical Trials with Time-to-Event Endpoints. Chapman and Hall; 2009. [Google Scholar]
  26. Robins J, Orallana L, Rotnitzky A. Estimaton and extrapolation of optimal treatment and testing strategies. Statistics in Medicine. 2008;27(23):4678–4721. doi: 10.1002/sim.3301. [DOI] [PubMed] [Google Scholar]
  27. Robins JM, Rotnitzky A. Comment on the Bickel and Kwon article, “Inference for semiparametric models: Some questions and an answer”. Statistica Sinica. 2001a;11(4):920–936. [Google Scholar]
  28. Robins JM, Hernan MA, Brumback B. Marginal structural models and causal inference in epidemiology. Epidemiology. 2000a;11(5):550–560. doi: 10.1097/00001648-200009000-00011. [DOI] [PubMed] [Google Scholar]
  29. Robins JM, Rotnitzky A, van der Laan MJ. Comment on “On Profile Likelihood” by S.A. Murphy and A.W. van der Vaart. Journal of the American Statistical Association – Theory and Methods. 2000b;450:431–435. doi: 10.2307/2669381. [DOI] [Google Scholar]
  30. Robins JM. Robust estimation in sequentially ignorable missing data and causal inference models. Proceedings of the American Statistical Association. 2000a.
  31. Robins JM. Discussion of “Optimal dynamic treatment regimes” by Susan A. Murphy. Journal of the Royal Statistical Society: Series B. 2003;65(2):355–366. [Google Scholar]
  32. Robins JM. Optimal structural nested models for optimal sequential decisions. Heagerty PJ, Lin DY, editors. Proceedings of the 2nd Seattle symposium in biostatistics. 2005a;179:189–326. [Google Scholar]
  33. Robins JM. Optimal structural nested models for optimal sequential decisions. Technical report, Department of Biostatistics, Havard University, 2005b.
  34. Robins JM. A new approach to causal inference in mortality studies with sustained exposure periods - application to control of the healthy worker survivor effect. Mathematical Modelling. 1986;7:1393–1512. doi: 10.1016/0270-0255(86)90088-6. [DOI] [Google Scholar]
  35. Robins JM. The analysis of randomized and non-randomized aids treatment trials using a new approach in causal inference in longitudinal studies. In: Sechrest L, Freeman H, Mulley A, editors. Health Service Methodology: A Focus on AIDS. U.S. Public Health Service, National Center for Health SErvices Research; Washington D.C.: 1989. pp. 113–159. [Google Scholar]
  36. Robins JM. Proceeding of the Biopharmaceutical section. American Statistical Association; 1993. Information recovery and bias adjustment in proportional hazards regression analysis of randomized trials using surrogate markers; pp. 24–33. [Google Scholar]
  37. Robins JM. Causal inference from complex longitudinal data. In: Berkane Editor M., editor. Latent Variable Modeling and Applications to Causality. Springer, Verlag; New York: 1997a. pp. 69–117. [Google Scholar]
  38. Robins JM. Structural nested failure time models. In: Armitage P, Colton T, Andersen PK, Keiding N, editors. The Encyclopedia of Biostatistics. John Wiley and Sons; Chichester, UK: 1997b. [Google Scholar]
  39. Robins JM. Statistical models in epidemiology, the environment, and clinical trials (Minneapolis, MN, 1997) Springer; New York: 2000b. Marginal structural models versus structural nested models as tools for causal inference; pp. 95–133. [Google Scholar]
  40. Robins JM, Rotnitzky A. Comment on Inference for semiparametric models: some questions and an answer, by Bickel, P.J. and Kwon, J. Statistica Sinica. 2001b;11:920–935. [Google Scholar]
  41. Robins JM, Rotnitzky A. Recovery of information and adjustment for dependent censoring using surrogate markers. AIDS Epidemiology. 1992. Methodological issues Bikhäuser.
  42. Rose S, van der Laan MJ. Simple optimal weighting of cases and controls in case-control studies. The International Journal of Biostatistics. page http://www.bepress.com/ijb/vol4/iss1/19/., 2008. [DOI] [PMC free article] [PubMed]
  43. Rose S, van der Laan MJ. Why match? investigating matched case-control study designs with causal effect estimation. The International Journal of Biostatistics. page http://www.bepress.com/ijb/vol5/iss1/1/., 2009. [DOI] [PMC free article] [PubMed]
  44. Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika. 1983;70:41–55. doi: 10.1093/biomet/70.1.41. [DOI] [Google Scholar]
  45. Rosenblum M, Deeks SG, van der Laan MJ, Bangsberg DR. The risk of virologic failure decreases with duration of hiv suppression, at greater than 50% adherence to antiretroviral therapy. PLoS ONE. 2009;4(9):e7196. doi: 10.1371/journal.pone.0007196. [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Rubin DB. Matched Sampling for Causal Effects. Cambridge University Press; Cambridge, MA: 2006. [Google Scholar]
  47. Rubin DB, van der Laan MJ. Empirical efficiency maximization: Improved locally efficient covariate adjustment in randomized experiments and survival analysis. The International Journal of Biostatistics. 2008;4(1) doi: 10.2202/1557-4679.1084. Article 5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Rush AJ, Trivedi M, Fava M. Depression. IV. American Journal of Psychiatry. 2003;160(2):237. doi: 10.1176/appi.ajp.160.2.237. [DOI] [PubMed] [Google Scholar]
  49. Schneider L, Tariot PN, Dagerman K, Davis S, Hsiao J, Ismail S, Lebowitz B, Lyketsos C, Ryan J, Stroup T, Sultzer D, Weintraub D, Lieberman J. Effectiveness of atypical antipsychotic drugs in patients with alzheimers disease. New England Journal of Medicine. 2006;355(15):1525–1538. doi: 10.1056/NEJMoa061240. [DOI] [PubMed] [Google Scholar]
  50. Sekhon JS. Multivariate and propensity score matching software with automated balance optimization: The matching package for R. Journal of Statistical Sotware, Forthcoming. 2008.
  51. Sinisi S, van der Laan MJ. The deletion/substitution/addition algorithm in loss function based estimation: Applications in genomics. Journal of Statistical Methods in Molecular Biology. 2004;3(1) doi: 10.2202/1544-6115.1069. [DOI] [PubMed] [Google Scholar]
  52. Swartz M, Perkins D, Stroup T, Davis S, Capuano G, Rosenheck R, Reimherr F, McGee M, Keefe R, McEvoy J, Hsiao J, Lieberman J. Effects of antipsychotic medications on psychosocial functioning in patients with chronic shizophrenia: Findings from the nimh catie study. American Journal of Psychiatry. 2007;164:428–436. doi: 10.1176/appi.ajp.164.3.428. [DOI] [PubMed] [Google Scholar]
  53. Thall P, Millikan R, Sung H-G. Evaluating multiple treatment courses in clinical trials. Statistics in Medicine. 2000;19:1011–1028. doi: 10.1002/(SICI)1097-0258(20000430)19:8<1011::AID-SIM414>3.0.CO;2-M. [DOI] [PubMed] [Google Scholar]
  54. Tuglus C, van der Laan MJ. Targeted methods for biomarker discovery, the search for a standard. UC Berkeley Working Paper Series. page http://www.bepress.com/ucbbiostat/paper233/., 2008.
  55. van der Laan MJ. Causal effect models for intention to treat and realistic individualized treatment rules. 2006. Technical report 203, Division of Biostatistics, University of California, Berkeley. [DOI] [PMC free article] [PubMed]
  56. van der Laan MJ. Estimation based on case-control designs with known prevalance probability. The International Journal of Biostatistics. 2008. page http://www.bepress.com/ijb/vol4/iss1/17/ [DOI] [PubMed]
  57. van der Laan MJ, Dudoit S. Unified cross-validation methodology for selection among estimators and a general cross-validated adaptive epsilon-net estimator: Finite sample oracle inequalities and examples. Technical report, Division of Biostatistics, University of California, Berkeley, November 2003. [Google Scholar]
  58. van der Laan MJ, Gruber S. Collaborative double robust penalized targeted maximum likelihood estimation. The International Journal of Biostatistics. 2009. [DOI] [PMC free article] [PubMed]
  59. van der Laan MJ, Petersen ML. Causal effect models for realistic individualized treatment and intention to treat rules. International Journal of Biostatistics. 2007;3(1) doi: 10.2202/1557-4679.1022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  60. van der Laan MJ, Robins JM. Unified methods for censored longitudinal data and causality. Springer; New York: 2003. [Google Scholar]
  61. van der Laan MJ, Rubin D. Targeted maximum likelihood learning. The International Journal of Biostatistics. 2006;2(1) doi: 10.2202/1557-4679.1043. [DOI] [PMC free article] [PubMed] [Google Scholar]
  62. van der Laan MJ, Dudoit S, Keles S. Asymptotic optimality of likelihood-based cross-validation. Statistical Applications in Genetics and Molecular Biology. 2004;3 doi: 10.2202/1544-6115.1036. [DOI] [PubMed] [Google Scholar]
  63. van der Laan MJ, Petersen ML, Joffe MM. History-adjusted marginal structural models and statically-optimal dynamic treatment regimens. The International Journal of Biostatistics. 2005;1(1):10–20. doi: 10.2202/1557-4679.1003. [DOI] [Google Scholar]
  64. van der Laan MJ, Dudoit S, van der Vaart AW. The cross-validated adaptive epsilon-net estimator. Statistics and Decisions. 2006;24(3):373–395. doi: 10.1524/stnd.2006.24.3.373. [DOI] [Google Scholar]
  65. van der Laan MJ, Polley E, Hubbard A. Super learner. Statistical Applications in Genetics and Molecular Biology. 2007;6(25) doi: 10.2202/1544-6115.1309. ISSN 1. [DOI] [PubMed] [Google Scholar]
  66. van der Laan MJ, Rose S, Gruber S. Readings on targeted maximum likelihood estimation. Sep, 2009. Technical report, working paper series http://www.bepress.com/ucbbiostat/paper254/,
  67. van der Vaart AW, Dudoit S, van der Laan MJ. Oracle inequalities for multi-fold cross-validation. Statistics and Decisions. 2006;24(3):351–371. doi: 10.1524/stnd.2006.24.3.351. [DOI] [Google Scholar]
  68. Yu Z, van der Laan MJ. Construction of counterfactuals and the G-computation formula. 2002. Technical report, Division of Biostatistics, University of California, Berkeley.
  69. Yu Z, van der Laan MJ. Double robust estimation in longitudinal marginal structural models. 2003. Technical report, Division of Biostatistics, University of California, Berkeley.

Articles from The International Journal of Biostatistics are provided here courtesy of Berkeley Electronic Press

RESOURCES