Targeted Maximum Likelihood Based Causal Inference: Part I

Mark J van der Laan

doi:10.2202/1557-4679.1211

. 2010 Feb 22;6(2):2. doi: 10.2202/1557-4679.1211

Targeted Maximum Likelihood Based Causal Inference: Part I

Mark J van der Laan ¹

PMCID: PMC3126670 PMID: 21969992

Abstract

Given causal graph assumptions, intervention-specific counterfactual distributions of the data can be defined by the so called G-computation formula, which is obtained by carrying out these interventions on the likelihood of the data factorized according to the causal graph. The obtained G-computation formula represents the counterfactual distribution the data would have had if this intervention would have been enforced on the system generating the data. A causal effect of interest can now be defined as some difference between these counterfactual distributions indexed by different interventions. For example, the interventions can represent static treatment regimens or individualized treatment rules that assign treatment in response to time-dependent covariates, and the causal effects could be defined in terms of features of the mean of the treatment-regimen specific counterfactual outcome of interest as a function of the corresponding treatment regimens. Such features could be defined nonparametrically in terms of so called (nonparametric) marginal structural models for static or individualized treatment rules, whose parameters can be thought of as (smooth) summary measures of differences between the treatment regimen specific counterfactual distributions.

In this article, we develop a particular targeted maximum likelihood estimator of causal effects of multiple time point interventions. This involves the use of loss-based super-learning to obtain an initial estimate of the unknown factors of the G-computation formula, and subsequently, applying a target-parameter specific optimal fluctuation function (least favorable parametric submodel) to each estimated factor, estimating the fluctuation parameter(s) with maximum likelihood estimation, and iterating this updating step of the initial factor till convergence. This iterative targeted maximum likelihood updating step makes the resulting estimator of the causal effect double robust in the sense that it is consistent if either the initial estimator is consistent, or the estimator of the optimal fluctuation function is consistent. The optimal fluctuation function is correctly specified if the conditional distributions of the nodes in the causal graph one intervenes upon are correctly specified. The latter conditional distributions often comprise the so called treatment and censoring mechanism. Selection among different targeted maximum likelihood estimators (e.g., indexed by different initial estimators) can be based on loss-based cross-validation such as likelihood based cross-validation or cross-validation based on another appropriate loss function for the distribution of the data. Some specific loss functions are mentioned in this article.

Subsequently, a variety of interesting observations about this targeted maximum likelihood estimation procedure are made. This article provides the basis for the subsequent companion Part II-article in which concrete demonstrations for the implementation of the targeted MLE in complex causal effect estimation problems are provided.

Keywords: causal effect, causal graph, censored data, cross-validation, collaborative double robust, double robust, dynamic treatment regimens, efficient influence curve, estimating function, estimator selection, locally efficient, loss function, marginal structural models for dynamic treatments, maximum likelihood estimation, model selection, pathwise derivative, randomized controlled trials, sieve, super-learning, targeted maximum likelihood estimation

1. Introduction

The data structure on the experimental unit can often be viewed as a time-series in discrete time, possibly on a fine scale. At many time points nothing might be observed and at possibly irregular spaced time-points events occur and are measured, where some of these events occur at the same time. A specified ordering of all measured variables which respects this time-ordering and possibly additional knowledge about the ordering in which variables were realized, implies a graph in the sense that for each observed variable we can identify a set of parent nodes of that observed variable, defined as the set of variables occurring before the observed variable in the ordering. The likelihood of this unit specific data structure can be factorized accordingly in terms of the conditional distribution of a node in the graph, given the parents of that node, across all nodes. This particular factorization of the likelihood puts no restriction on the possible set of data generating distribution, but the ordering affects the so called G-computation formula for counterfactual distributions of the data under certain interventions implied by this ordering. Beyond the factorization of the likelihood in terms of a product of conditional distributions, the G-computation formula involves specifying a set of nodes in the time-series/graph as the variables to intervene upon, and specifying the intervention for these nodes. These interventions could be rules that assign the value for the intervention node (possibly) in response to the observed data on the (observed) parents of the intervention node. The G-computation formula is now defined as the product, across all nodes, excluding the intervention nodes, of the conditional distribution of a node, given the parent nodes with the intervention nodes in the parent set following their assigned values.

If it is known that the conditional distribution of a node only depends on a subset of the parents that were implied by the ordering, then that knowledge should be incorporated by reducing the parent set to its correct set. This kind of knowledge does reduce the size of the model for the data generating distribution (and such assumptions can indeed be tested from the data).

The G-computation formula provides a probability distribution of the intervention specific data structure. Under certain causal graph conditions on a causal graph on an augmented set of nodes which includes unobserved nodes beyond the observed nodes (Pearl (2000)), such as no unblocked back-door path from intervention node to future/downstream nodes, for each intervention node, this G-computation formula equals the counterfactual distribution of the data structure if one would have enforced the specified intervention on the system described by the causal graph.

We remind the reader that a causal graph on a set of nodes states that each node is a deterministic function of its parents. It typically represents a set of so called causal assumptions that cannot be learned from the data. Given a declared causal graph on a set of nodes, one can formally state what assumptions on this causal graph are needed in order to claim that a specified G-computation formula for the observed nodes corresponds with the G-computation formula for the causal graph on the full set of nodes (that includes the unobserved nodes), where the latter G-computation formula is then viewed as the gold-standard representing the causal effect of interest (Pearl (2000)).

Either way, the time-ordering and possible known additional known ordering does provide a statistical graph for the data as explained above, and a corresponding G-computation formula.

In this article we are concerned with (semi-parametric) efficient estimation of the “causal” effects viewed as parameters of this G-computation formula based on observing n independent and identically observations O₁, . . . , O_n of O. Specifically, we are concerned with estimation of parameters of the G-computation formula implied by a particular statistical graph on the observed data structure O, in the semiparametric model that makes no assumptions about each node-specific conditional distribution in the graph, given its parents.

Formally, the density of O is modeled as

p_{0} (O) = \prod_{j} P (N (j) | P a (N (j))

where N(j) denote the nodes in the graph representing the observed variables, Pa(N(j)) denote the parents of N(j), and we make no assumptions on each conditional distribution of N(j), beyond that N(j) only depends on Pa(N(j)). Note, however, as remarked above, if the parent sets induce more structure than parent sets implied by an ordering of all observed variables, then this statistical graph of p₀ might implies a real (i.e., not just nonparametric) semiparametric model on p₀, corresponding with a variety of conditional independence assumptions.

Even if the (non-testable) causal assumptions required to interpret the G-computation formula as a counterfactual distribution on a system fail to hold, assuming that the ordering of the likelihood respects the ordering w.r.t. the intervention nodes (i.e., it correctly states what variables are pre or post intervention for each intervention node), the target parameters often still represent effects of interest aiming to get as close to a causal effect as the data allows. In particular, one can simply interpret the G-computation parameters for what they are, namely well defined effects of interventions on the distribution of the data: see van der Laan (2006) for more discussion of the role of causal parameters in variable importance analysis.

It is important to note that the probability density p₀ of the observed data structure O, factored by the statistical graph, can be represented as a product of two factors, the first factor Q₀ that identifies the G-computation formulas for interventions, and the second factor g₀ representing product over the intervention nodes of the conditional distribution of the intervention nodes: p₀ = Q₀g₀. We often refer to the second factor as the censoring and/or treatment mechanism in case the intervention nodes correspond with censoring variables and/or treatment assignments. We will denote the true probability distribution of the data-structure on the experimental unit with P₀, and its probability density with p₀.

A variety of estimators of causal effects of multiple time-point interventions, including handling censored data (by, enforcing no-censoring as part of the intervention) have been proposed: Inverse Probability of Censoring Weighted (IPCW) estimators, Augmented IPCW-estimators (which are double robust), maximum likelihood based estimators, and targeted Maximum Likelihood Estimators (which are double robust). The IPCW and augmented-IPCW estimators fall in the category of estimating equation methodology (van der Laan and Robins (2003)). The augmented-IPCW estimator is defined as a solution of an estimating equation in the target parameter implied by the so called efficient influence curve. Maximum likelihood based estimators involve estimation of the distribution of the data and subsequent evaluation of the target parameter. Traditional maximum likelihood estimators are not targeted towards the target parameter, and are thereby, in particular, not double robust.

Targeted maximum likelihood estimators (T-MLE) are two stage estimators, the first stage applies regularized maximum likelihood based estimation, where we advocate the use of loss-based super-learning to maximize adaptivity to the true distribution/G-computation formula of data (van der Laan et al. (2007)), and the second stage targets the obtained fit from the first stage towards the target parameter of interest through a targeted maximum likelihood step. This targeted maximum likelihood step removes bias for the target parameter if the censoring/treatment mechanism used in the targeted MLE step is estimated consistently. In this targeted maximum likelihood step the initial (first stage) estimator is treated as an off-set, and it involves the application of a fluctuation function to the offset, where the set of possible fluctuations represents a parametric model consisting of fluctuated versions of the offset. This parametric model is a so called least favorable parametric model in the sense that its maximum likelihood estimator listens as much to the data w.r.t. fitting the target parameter as a semiparametric model efficient estimator. Formally, it is the parametric submodel through the first stage estimator with the worst Cramer-Rao lower bound for estimation of the target parameter (at zero fluctuation), among all parametric submodels. (This worst case Cramer-Rao lower bound as achieved by this least favorable model is actually the semiparametric information bound defined as the variance of the efficient influence curve.) Given this least-favorable submodel, maximum likelihood estimation is used to fit the finite dimensional fluctuation parameter. Due to this parametric targeted maximum likelihood step the targeted maximum likelihood estimator is also double robust: the estimator is consistent if the initial first-stage estimator of the G-computation factor of the likelihood is consistent, or if the conditional distributions of the intervention nodes (i.e., censoring/treatment mechanism) are estimated consistently (as required to identify the fluctuation function used in targeted maximum likelihood step). In addition, under regularity conditions, the targeted MLE is (semiparametric) efficient if the initial estimator is consistent, and consistent and asymptotically linear if either the initial estimator or the treatment/censoring mechanism estimator is consistent.

Even though the augmented IPCW-estimator is also tailored to be double robust and locally efficient, targeted maximum likelihood estimation has the following important advantages relative to estimating equation methods such as the augmented-IPCW estimator: 1) the T-MLE is a substitution estimator and thereby, contrary to the augmented IPCW-estimator, respects global constraints of the model such as that one might be estimating a probability in [0, 1], 2) since, given an initial estimator, the targeted MLE step involves maximizing the likelihood along a smooth parametric submodel, contrary to the augmented IPCW-estimator, it does not suffer from multiple solutions of a (possibly non-smooth in the parameter) estimating equation, 3) contrary to the augmented IPCW-estimator, the T-MLE does not require that the efficient influence curve can be represented as an estimating function in the target parameter, and thereby applies to all path-wise differentiable parameters, 4) it can use the cross-validated log-likelihood (of the targeted maximum likelihood estimator), or any other cross-validated risk of an appropriate loss function for the relevant factor Q₀ of the density (i.e., the G-computation formula) of the data, as principle criterion to select among different targeted maximum likelihood estimators indexed by different initial estimators or different choices of fluctuation models. The latter allows fine tuning of the initial estimator of Q₀ as well as the fine tuning of the estimation of the unknowns (e.g., censoring/treatment mechanism g₀) of the fluctuation function applied in the targeted maximum likelihood step, thereby utilizing the excellent theoretical and practical properties of the loss-function specific cross-validation selector. On the other hand, the augmented-IPCW estimator cannot be evaluated based on a loss function for Q₀ alone, but also requires a choice of loss function for g₀. The latter point 4) also allows the targeted MLE to be generalized to loss-based estimation of infinite dimensional parameters that can be approximated by pathwise differentiable parameters.

These important theoretical advantages have a substantial practical impact, by allowing one to construct estimators in a wider variety of applications, and with better finite sample and asymptotic mean squared error w.r.t. the target. This inspired us to implement targeted maximum likelihood estimation of causal effects of single time point treatment in a variety of data analyses, allowing for right-censoring of the time-till-event clinical outcome, and missingness of the clinical outcome. Even though we discussed the overall targeted maximum likelihood estimator for causal effect estimation of multiple time point interventions in technical reports (see van der Laan (2008)), in this article we aim to dive deeper into this challenge. In particular, our goal is to present templates that can be implemented with standard statistical software, and aim to understand the choices to be made. In future papers we will be implementing these methods on real and simulated data sets and use this paper as guidance.

The organization of this paper is as follows. Firstly, in Section 2 we start out with presenting the targeted MLE for sequentially randomized controlled trials. A specific targeted loss function is proposed to select among different targeted MLE indexed by different initial estimators, which results in maximally asymptotically efficient targeted MLE’s (Rubin and van der Laan (2008)). Due to the double robustness of the targeted MLE this estimator is guaranteed to estimate the causal effect of interest consistently, so that confidence intervals and type-I error control are asymptotically valid. In addition, the T-MLE utilizes all the data (including time-dependent biomarkers) and thereby has great potential for large efficiency gains and bias reductions in these sequentially randomized controlled trials.

In Section 3 we develop and present a general targeted MLE for any time-series data structure, applicable to sequentially randomized controlled trials with censoring and missingness, as well as longitudinal observational studies. The integration of loss-based (super) learning to build and select among targeted MLE’s is made explicit again, and targeted loss functions are proposed for that purpose. In addition, a variety of interesting observations are made about the targeted MLE, relevant to the practical implementation of this estimator in complex longitudinal observational studies and randomized controlled trials. We end with a discussion in Section 4. Our companion Part II article will present demonstrations of the targeted MLE.

Some overview of relevant literature

The construction of efficient estimators of path-wise differentiable parameters in semi-parametric models requires utilizing the so called efficient influence curve defined as the canonical gradient of the path-wise derivative of the parameter. This is no surprise since a fundamental result of the efficiency theory is that a regular estimator is efficient if and only if it is asymptotically linear with influence curve equal to the efficient influence curve. We refer to Bickel et al. (1997), and Andersen et al. (1993). There are two distinct approaches for construction of efficient (or locally efficient) estimators: the estimating equation approach that uses the efficient influence curve as an estimating equation (e.g., one-step estimators based on the Newton-Raphson algorithm in Bickel et al. (1997)), and the targeted MLE that uses the efficient influence curve to define a targeted fluctuation function of an initial estimator, and maximizes the likelihood in that targeted direction.

The construction of locally efficient estimators in censored data models in which the censoring mechanism satisfies the so called coarsening at random assumption (Heitjan and Rubin (1991), Jacobsen and Keiding (1995), Gill et al. (1997)) has been a particular focus area. This includes also the theory for locally efficient estimation of causal effects, since the causal inference data structure can be viewed as a missing data structure on the intervention-specific counterfactuals, and the sequential randomization assumption (SRA) implies the coarsening at random assumption on the missingness mechanism, while SRA still does not imply any restriction on the data generating distribution. A particular construction of counterfactuals from the observed data structure, so that the observed data structure augmented with the counterfactuals satisfies the consistency (missing data structure) and sequential randomization assumption, is provided in Yu and van der Laan (2002), providing an alternative to the implicit construction presented earlier in Gill and Robins (2001), thereby showing that, without loss of generality, one can view causal inference as a missing data structure estimation problem: the importance of the causal graph is that it makes explicit the definition of the counterfactuals of interest (i.e., full data in the censored data model).

The theory for inverse probability of censoring weighted estimation and the augmented locally efficient IPCW estimator based on estimating functions defined in terms of the orthogonal complement of the nuisance tangent space in CAR-censored data models (including the optimal estimating function implied by efficient influence curve) was originally developed in Robins (1993), Robins and Rotnitzky (1992). Many papers have been building on this framework (see van der Laan and Robins (2003) for a unified treatment of this estimating equation methodology and references). In particular, double robust locally efficient augmented IPCW-estimators have been developed (Robins and Rotnitzky (2001b), Robins and Rotnitzky (2001a), Robins et al. (2000b), Robins (2000a), van der Laan and Robins (2003), Neugebauer and van der Laan (2005), Yu and van der Laan (2003)).

Causal inference for multiple time-point interventions under sequential randomization started out with papers by Robins in the eighties: e.g. Robins (1986), Robins (1989). The popular propensity score methods to assess causal effects of single time point interventions (e.g., Rosenbaum and Rubin (1983), Sekhon (2008), Rubin (2006)) are not double robust (i.e., rely on correct specification of propensity score), have no natural generalization to multiple time-point interventions, and are also inefficient estimators for single time point interventions (Abadie and Imbens (2006)), relative to the locally efficient double robust estimators such as the augmented IPCW estimator, and the targeted MLE.

Structural nested models and marginal structural models for static treatments were proposed by Robins as well: Robins (1997b), Robins (1997a), Robins (2000b). Many application papers on marginal structural models exist, involving the application of estimating equation methodology (IPCW and DR-IPCW): e.g., Hernan et al. (2000), Robins et al. (2000a), Bryan et al. (2003), Yu and van der Laan (2003). In van der Laan et al. (2005) history adjusted marginal structural models were proposed as a natural extension of marginal structural models, and it was shown that the latter also imply an individualized treatment rule of interest (a so called history adjusted statically optimal treatment regimen): see Petersen et al. (2005) for an application to the when to switch question in HIV research.

Murphy et al. (2001) present a nonparametric estimator for a mean under a dynamic treatment in an observational study. Structural nested models for modeling and estimating an optimal dynamic treatment were proposed by Murphy (2003), Robins (2003), Robins (2005a), Robins (2005b). Marginal structural models for a user supplied set of dynamic treatment regimens were developed and proposed in van der Laan (2006), van der Laan and Petersen (2007) and, simultaneously and independently, in Robins et al. (2008). van der Laan and Petersen (2007) also includes a data analysis application of these models to assess the mean outcome under a rule that switches treatment when CD4 count drops below a cut-off, and the optimal cut-off is estimated as well. Another practical illustration in sequentially randomized trials of these marginal structural models for realistic individualized treatment rules is presented in Bembom and van der Laan (2007).

Unified loss-based learning based on cross-validation was developed in- van der Laan and Dudoit (2003), including construction of adaptive minimax estimators for infinite dimensional parameters of the full data distribution in CAR-censored data and causal inference models: see also van der Laan et al. (2006), van der Vaart et al. (2006), van der Laan et al. (2004), Dudoit and van der Laan (2005), Keleş et al. (2002), Sinisi and van der Laan (2004). This research establishes, in particular, finite sample oracle inequalities, which state that the expectation of the loss-function specific dissimilarity between the the cross-validated selected estimator among the library of candidate estimators (trained on training samples) and the truth is smaller or equal than the expectation of the loss-function specific dissimilarity between the best possible selected estimator and the truth plus a term that is bounded by a constant times the logarithm of the number of candidate estimators in the library divided by the sample size. The only assumption this oracle inequality relies upon is that the loss function is uniformly bounded. These oracle results for the cross-validation selector inspired a unified super-learning methodology. This methodology first constructs a set of candidate estimators, proposes a family of weighted combinations of these candidate estimators indexed by a weight vector, and uses cross-validation to determine a weighted combination with optimal cross-validated risk. Under the assumption that the loss function is uniformly bounded, and the number of estimators is polynomial in sample size, the resulting estimator (super learner) is either asymptotically equivalent with the oracle selected estimator among the library of weighted combinations of the estimators, or it achieves the optimal parametric rate of convergence (i.e. one of estimators corresponds with correctly specified parametric model) up till (worst case) log-n-factor. We refer to van der Laan et al. (2007), Polley and van der Laan (2009).

The super-learning methodology applied to a loss function for the G-computation formula factor, Q₀, of the observed data distribution, provides substitution estimators of ψ₀. However, although these super learners of Q₀ are optimal w.r.t. the dissimilarity with Q₀ implied by the loss function, the corresponding substitution estimators will be overly biased for a smooth parameter mapping Ψ. This is due to the fact that cross-validation makes optimal choices w.r.t. the (global) loss-function specific dissimilarity, but the variance of Ψ(Q_n) is of smaller order than the variance of Q_n itself.

van der Laan and Rubin (2006) integrates the loss-based learning of Q₀ into the locally efficient estimation of pathwise differentiable parameters, by enforcing the restriction in the loss-based learning that each candidate estimator of Q₀ needs to be a targeted maximum likelihood estimator (thereby, in particular, enforcing each candidate estimator of Q₀ to solve the efficient influence curve estimating equation). Another way to think about this is that each loss function L(Q) for Q₀ has a corresponding targeted loss function L(Q*), Q* representing the targeted maximum likelihood estimator applied to initial Q, and we apply the loss-based learning to the latter targeted version of the loss function L(Q). Rubin and van der Laan (2008) propose the square of efficient influence curve as a valid and sensible loss function L(Q) for selection and estimation of Q₀ in models in which g₀ can be estimated consistently, such as in randomized controlled trials.

The implications of this targeted loss-based learning are that Q₀ is estimated optimally (maximally adaptive to the true Q₀) w.r.t. the targeted loss function L(Q*) using the super-learning methodology, and due to the targeted MLE step the resulting substitution estimator of ψ₀ is now asymptotically linear as well if the targeted fluctuation function is estimated at a good enough rate (and only requiring adjustment by confounders not yet accounted for by initial estimator: see collaborative targeted MLE): either way, asymptotic bias reduction for the target parameter will occur as long as the censoring/treatment mechanism is estimated consistently. In addition, since the targeted maximum likelihood step involves additional maximum likelihood fitting, generally speaking, no loss in bias will occur, even if the wished fluctuation function is heavily misspecified.

Targeted MLE have been applied in a variety of estimation problems: Bembom et al. (2008), Bembom et al. (2009) (physical activity), Tuglus and van der Laan (2008) (biomarker analysis), Rosenblum et al. (2009) (AIDS), van der Laan (2008) (case control studies), Rose and van der Laan (2008) (case control studies), Rose and van der Laan (2009) (matched case control studies), Moore and van der Laan (2009) (causal effect on time till event, allowing for right-censoring), van der Laan (2008) (adaptive designs, and multiple time point interventions), Moore and van der Laan (2007) (randomized trials with binary outcome). We refer to van der Laan et al. (September, 2009) for collective readings on targeted maximum likelihood estimation.

In van der Laan and Gruber (2009) we use the loss-based cross-validation to not only select among different initial estimators for the targeted maximum likelihood estimators, but it is also used to select the fit of the fluctuation function applied to the initial estimator (and thus the fit of the censoring and treatment mechanism). This results in a so called collaborative double robust targeted maximum likelihood estimator, which utilizes the collaborative double robustness of the efficient influence curve, which is stronger robustness result than the regular double robustness of the efficient influence curve the double robust estimators rely upon. These collaborative double robust estimators select confounders for the censoring and treatment mechanism in response to the outcome and the initial estimator of Q₀, thereby allowing for more effective bias reductions by the resulting fluctuation functions (as predicted by theory and observed in practice). Simulations and data analysis illustrating the excellent performance of the collaborative double robust T-MLE are presented in van der Laan and Gruber (2009).

2. The T-MLE in multi-stage sequentially randomized controlled trials

Consider a sequentially randomized trial in which one randomly samples a patient from a population, one collects at baseline covariates L(0), and one randomizes the patient to a first line treatment A(0). Subsequently, one collects an intermediate biomarker L(1), and based on this intermediate clinical response one randomizes the patient to a second line treatment A(1). Finally, one collects the clinical outcome Y of interest at a fixed point in time. This is experiment is carried out for n patients.

We first discuss two of such sequentially randomized cancer trials.

Anderson Cancer Center Prostate Cancer Two Stage Trial: Thall et al. (2000) present an analysis of the first clinical trial in oncology that makes use of sequential randomization. During this trial, prostate cancer patients who were found to be responding poorly to their initially randomly assigned regimen (among four treatments) were re-randomized to the remaining three candidate regimens. The clinical outcomes of interest was response to treatment at a particular point in time or time till death. In contrast to conventional trials based on a single randomization, this design allows the investigator to study adaptive treatment strategies that adjust a patients treatment in response to the observed course of the illness. Such adaptive strategies, also referred to as dynamic or individualized treatment rules, form the basis of common medical practice in cancer chemotherapy, with physicians typically facing the following questions: Which regimen should be used to initially treat a patient? Which regimen should the patient be switched to if the front-line regimen fails? Given an observed intermediate outcome such as a change in tumor size or PSA level, what threshold should be used to decide that the current regimen is failing? In recent years, sequentially randomized trials have been recognized as being uniquely suited to the study of these exciting questions (Thall et al. (2000), Lavori and Dawson (2000), Lavori and Dawson (2004), Murphy (2005)), with researchers in other clinical areas also beginning to implement this design (Rush et al. (2003), Schneider et al. (2006), Swartz et al. (2007)). The original results of Thall et al. (2000) focus on fitting logistic regression models for the different stage-specific factors of the likelihood. We can apply the T-MLE to estimate the mean outcomes under the 12 dynamic treatment rules indexed by first line therapy and the second line switching therapy, and also incorporate the handling of the right-censoring.

E4494 Eastern Oncology Trial: Another example is the cancer trial E4494, a lymphoma study of rituximab therapy that had both induction and maintenance rituximab randomizations, where the second randomization of maintenance versus observation was based on intermediate response to the initial treatment. The clinical outcome of interest was time till death.

Let’s denote the observed data structure on a randomly sampled patient from the target population with O = (L(0), A(0), L(1), A(1), Y = L(2)). For simplicity and the sake of presentation, we will assume that A(0), L(1) and A(1) are binary.

The likelihood can be factorized as

p (O) = \prod_{j = 0}^{2} P (L (j) | \bar{L} (j - 1), \bar{A} (j - 1)) \prod_{j = 0}^{1} P (A (j) | \bar{A} (j - 1), \bar{L} (j)),

where the first factors will be denoted with Q_L(j), j = 0, 1, 2, and the latter factors denote the treatment mechanism and are denoted with g_A(j), j = 0, 1. We make the convention that for j = 0, Ā(j − 1) and L̄(j − 1) are empty. In a sequentially randomized controlled trial, the treatment assignment mechanisms g_A(j), j = 0, 1, are known.

Suppose our parameter of interest is the treatment specific mean EY_d for a certain treatment rule d that assigns treatment d₀(L(0)) at time 0 and treatment d₁(L̄(1), A(0)) at time 1. For example, d₀(L(0)) = 1 is a static treatment assignment, and d₁(L̄(1), A(0)) = I(L(1) = 1)1 + I(L(1) = 0)0 assigns treatment 1 if the patients responds well to the first line treatment 1, and treatment 0 if the patients does not respond well to the first line treatment 1. We note that any treatment rule can be viewed as a function of L̄ = (L(0), L(1)) only, and therefore we will use the shorter notation d(L̄) = (d₀(L(0)), d₁(L̄)) for the two rules at times 0 and 1.

Note that EY_d = Ψ(Q) for a well defined mapping Ψ. Specifically, we have Ψ(Q) = E_{P_d}Y, where the intervened distribution P_d of (L(0), L(1), L(2)) is defined by the G-computation formula:

P_{d} (\bar{L}) = \prod_{j = 0}^{2} Q_{L (j), d} (\bar{L} (j)),

where, for notational convenience, especially in view of our representation for general data structures, we used the notation Q_L(j),d(L̄(j)) = Q_L(j)(L(j) | L̄(j − 1), Ā(j − 1) = d(L̄(j − 1))).

Instead of computing an analytic mean under P_d, this mean can also be approximated by simulating a large number of observations from this distribution and taking the mean of its last component L(2). Note that P_d corresponds with simulating sequentially from the conditional distributions Q_L(0),d, Q_L(1),d, Q_L(2),d, for L(0), L(1), L(2), respectively. Alternatively, this mean is calculated analytically as follows:

\begin{array}{l} Ψ (Q) & = & \sum_{l (0), l (1), y} y P_{d} (l (0), l (1), y) \\ = & \sum_{y} y \sum_{l (0), l (1)} P_{d} (l (0), l (1), y) \\ = & \sum_{y} y \sum_{l (0), l (1)} Q_{L (0)} (l (0)) Q_{L (1), d} (l (0), l (1)) Q_{Y, d} (l (0), l (1), y) . \end{array}

If Q_L(0) is replaced by the empirical distribution, then this reduces to

Ψ (Q) = \frac{1}{n} \sum_{i = 1}^{n} \sum_{y} y \sum_{l (1)} Q_{L (1), d} (L_{i} (0), l (1)) Q_{Y, d} (L_{i} (0), l (1), y) .

From this analytic expression it also follows that, even if Y is continuous, Ψ(Q) only depends on the conditional distribution of Y through its mean. In this case of a 2-stage sequentially randomized controlled trial, the analytic evaluation of Ψ(Q) seems preferable since it will be very fast to compute.

With this precise definition of the parameter as a mapping from the conditional distributions Q_L(j), j = 0, 1, 2, to the real line, given an estimator Q_n, we obtain a substitution estimator Ψ(Q_n) of ψ.

The targeted maximum likelihood estimator involves first defining an initial estimator of Q, and then a subsequent targeted maximum likelihood step according to a fluctuation function applied to this initial estimator, where this step is tailored to remove bias from the initial estimator for the purpose of estimating the parameter of interest ψ. The fluctuation function equals the least favorable parametric model through Q which is defined as the parametric submodel through Q which makes estimation of Ψ(Q) hardest in the sense that the parametric Cramer-Rao Lower bound for the variance of an unbiased estimator is maximal among all parametric submodels. Intuitively, this is the parametric submodel for which the maximum likelihood estimator listens as much to the data w.r.t. fitting the target parameter as an efficient estimator in the large semiparametric model, and thereby can expected to provide important bias reduction. This definition of least favorable model implies that a least favorable parametric model is a model that has a score at zero fluctuation equal to the efficient influence curve/canonical gradient of the pathwise derivative of the target parameter Ψ.

We use the following Theorem that provides the representation of the efficient influence curve which is needed to define the fluctuation function. This Theorem also provides the formula for the efficient influence curve for other parameters and for higher (than two) stage RCT’s.

Theorem 1 The efficient influence curve for ψ = EY_d at the true distribution P₀ of O can be represented as

D^{*} = Π (D_{I P C W} | T_{Q}),

where

D_{I P W C} (O) = \frac{I (\bar{A} = d (\bar{L}))}{g (d (\bar{L}) | X)} Y - ψ,

T_Q is the tangent space of Q in the nonparametric model, and Π denotes the projection operator onto T_Q in the Hilbert space $L_{0}^{2} (P_{0})$ of square P₀-integrable functions of O, endowed with inner product 〈h₁, h₂〉 = E_P₀h₁h₂(O).

We have that this subspace

T_{Q} = \sum_{j = 0}^{2} T_{Q_{L (j)}}

is the orthogonal sum of the tangent spaces T_{Q_L(j)} of the Q_L(j)-factors, which consists of functions of L(j), Pa(L(j)) with conditional mean zero, given the parents Pa(L(j)) of L(j), j = 0, 1, 2. Recall that we also denote L(2) with Y. Let

D_{j}^{*} = Π (D^{*} | T_{Q_{L (j)}}), j = 0, 1, 2.

We have

D_{0}^{*} (O) = E (Y_{d} | L (0)) - ψ

\begin{array}{l} D_{1}^{*} (O) & = & \frac{I (A (0) = d_{0} (L (0))}{g (d_{0} (L (0)) | X)} \times \\ {C_{L (1)} (Q_{0}) (1) - C_{L (1)} (Q_{0}) (0)} {L (1) - E (L (1) | L (0), A (0))} \end{array}

D_{2}^{*} (O) = \frac{I (\bar{A} = d (\bar{L})}{g (\bar{A} | X)} {L (2) - E (L (2) | \bar{L} (1), \bar{A} (2))} .

where, for δ ∈ {0, 1},

C_{L (1)} (Q_{0}) (δ) = E (Y_{d} | L (0), A (0), L (1) = δ) .

We note that

E (Y_{d} | L (0), A (0) = d_{0} (L (0)), L (1)) = E (Y | \bar{L} (1), \bar{A} = d (\bar{L})) .

For general data structure O = L(0), A(0), . . . , L(K), A(K), Y = L(K + 1), and D_IPCW = D₁(O)/g(Ā | X) for some D₁, we have

\begin{array}{l} Π (D_{I P C W} | T_{Q_{L (j)}}) = \frac{1}{g (\bar{A} (j - 1) | X)} \\ \times {E (\sum_{\bar{a} (j, K)} D_{1} (\bar{a} (j, K)) | \bar{L} (j), \bar{A} (j - 1))) \\ - E (\sum_{\bar{a} (j, K)} D_{1} (\bar{a} (j, K)) | \bar{L} (j - 1), \bar{A} (j - 1)))} \\ = \frac{1}{g (\bar{A} (j - 1) | X)} C_{L (j)} (\bar{L} (j - 1), \bar{A} (j - 1)) (L (j) - E (L (j) | \bar{L} (j - 1), \bar{A} (j - 1)), \end{array}

where

\begin{array}{l} C_{L (j)} & = & E (\sum_{\bar{a} (j, K)} D_{1} (\bar{a} (j, K)) | L (j) = 1, \bar{L} (j - 1), \bar{A} (j - 1))) \\ - E (\sum_{\bar{a} (j, K)} D_{1} (\bar{a} (j, K)) | L (j) = 0, \bar{L} (j - 1), \bar{A} (j - 1))) . \end{array}

Here we use the short-hand notation ā(j, K) ≡ (a(j), . . . , a(K)) and D₁(ā(j, K)) = D₁(Ā(j − 1), ā(j, K), L̄_ā₍_j,K₎ (K + 1)).

This Theorem allows us now to specify the targeted maximum likelihood estimator.

The targeted maximum likelihood estimator: Consider now an initial estimator Q_L(j)n of each Q_L(j), j = 0, 1, 2. We will estimate the first marginal probability distribution Q_L(0) of L(0) with the empirical distribution of L_i(0), i = 1, . . . , n. We can estimate the conditional distributions of the binary L(1) and the conditional mean of Y = L(2) with machine learning algorithms (using logistic link for Q_L(1), and, if Y is binary, also for Q_Y) such as the super learner represented by a data adaptively (based on cross-validation) determined weighted combination of a user supplied set of candidate machine learning algorithms estimating the particular conditional probability. We will now define fluctuations of this initial estimator Q_L(1)_n = Q_L(1)_n(P_n) and Q_L(2)_n = Q_L(2)(P_n) which are particular functions of the empirical probability distribution P_n. We will use notation Q_n = (Q_L(1)_n, Q_L(2)_n). Firstly, let

Logit Q_{L (1) n} (ε) = Logit Q_{L (1) n} + ε C_{L (1)} (Q_{n}, g_{n})

be the fluctuation function of Q_L(1)_n with fluctuation parameter ε, where we added the covariate C_L(1)(Q, g) defined as

\frac{I (A (0) = d_{0} (L (0))}{g_{A (0} (d_{0} (L (0)) | X)} {C_{L (1)} (Q) (1) - C_{L (1)} (Q) (0)},

where C_L(1)(Q)(δ) = E_Q(Y_d | L(0), A(0), L(1) = δ). We refer to these covariate choices as clever covariates, since they represent a covariate choice that identifies a least favorable fluctuation model, thereby providing the wished targeted bias reduction. Similarly, if Y = L(2) is binary, then let

Logit Q_{L (2) n} (ε) = Logit Q_{L (2) n} + ε C_{L (2)} (Q_{n}, g_{n}),

where the added the clever covariate

C_{L (2)} (Q, g) (\bar{L} (1), \bar{A} (1)) = \frac{I (\bar{A} = d (\bar{L})}{g (\bar{A} | X)} .

If Y is continuous, then we use as fluctuation model the normal densities with mean E_{Q_{Y_n}}(Y | Pa(Y))+εC_L(2)(Q_n, g_n), and constant variance σ², so that the MLE of ε is the linear least squares estimator, and the score of ε at ε = 0 is C_L(2)(Y − E_Q(Y | Pa(Y))), as required. We note that the above fluctuation function indeed satisfies that the score of ε at ε = 0 equals the efficient influence curve D*(Q_n, g_n) as presented in the Theorem above.

One now estimates ε with the MLE.

ε_{n} = arg max_{ε} \prod_{j = 1}^{2} \prod_{i = 1}^{n} Q_{L (j) n} (ε) (O_{i}) .

One could also obtain a separate MLE of ε for each factor j = 1, 2. This process is now iterated till convergence, which defines the targeted MLE ( $Q_{n}^{*}$ , g_n), starting at initial estimator (Q_n, g_n), which does not involve updating of g_n.

We note that the ε_n for each factor separately can be estimated with standard logistic regression or linear regression software using as off-set the logit of the initial estimator and having a single clever covariate C_L(j)(Q, g), j = 1, 2. If Y is also binary, the single/common ε_n defined above requires applying a single logistic regression applied to repeated measures data set with one line of data for each of two factors, creating a clever covariate column that alternates the clever covariates C_L(1) and C_L(2), and using the corresponding off-sets. So in both cases (separate or common ε), the update step can be carried out with a simple univariate logistic regression maximum likelihood estimator. Computing a common ε in the case that we use linear regression for Y and logistic regression for L(1) requires some programming.

We note that the clever covariate changes at each update step since the estimator of Q is updated at each step and the clever covariate is defined by the current Q-fit. Let $Q_{L (j) n}^{*}$ , j = 1, 2, and $Q_{n}^{*}$ denote the final update (at convergence of the MLE of ε to zero) of Q_L(_j₎_n, j = 1, 2, and Q_n, respectively. The T-MLE of ψ is now given by $Ψ (Q_{n}^{*})$ .

A one-step T-MLE: Interestingly, if we use a separate ε_L(j) for j = 1, 2, first carry out the tmle update for Q_L(2)_n, and use this updated $Q_{L (2) n}^{*}$ in the targeted MLE update for Q_L(1)_n, then we obtain a targeted MLE-algorithm that converges in two simple steps, representing a single step update of Q_n. Below, we will generalize this one-step targeted MLE algorithm for updating an initial Q_n for general longitudinal data structures.

Statistical inference for T-MLE: Let D*(Q, g) be the efficient influence curve at p_Q,g = Q*g, as defined in the above Theorem. Under regularity conditions, the T-MLE is consistent and asymptotically linear with influence curve D*(Q*, g₀), where Q* denotes the limit of $Q_{n}^{*}$ , and g₀ is the true treatment mechanism. As a consequence, for construction of confidence intervals and testing one can use as working model $ψ_{n}^{*} \sim N (ψ_{0}, \sum_{0})$ , where ∑₀ = ED*(Q, g₀)² is the variance of the efficient influence curve at (Q*, g₀). Here ∑₀ can be estimated with the empirical covariance matrix of $D^{*} (Q_{n}^{*}, g_{0})$ (O_i), i = 1, . . . , n.

Targeted Loss-based selection among T-MLE’s indexed by different initial estimators: Sequentially randomized trials allow us to select a targeted loss function for selection among different targeted maximum likelihood estimators indexed by different initial estimators. For the sake of illustration, we assume ψ₀ is one-dimensional. Suppose a collection of initial estimators is available for Q₀. Let $Q_{k n}^{*} = Q_{k}^{*} (P_{n})$ be the corresponding targeted maximum likelihood estimators, k = 1, . . . , K. One of these initial estimators might correspond with a super learner based on the log-likelihood loss function. We can select among these targeted maximum likelihood estimators based on cross-validated risk of the loss function

L (Q) \equiv D^{*} {(Q, g_{0})}^{2},

which is indeed a valid loss function since it satisfies Q₀ = arg min_Q E₀L(Q)(O) among all Q with Ψ(Q) = ψ₀. The latter loss function is now a loss function for the whole Q and is very targeted towards ψ₀ since it corresponds exactly with the asymptotic variance of the targeted MLE. Thus, we would select k with the cross-validation selector:

k_{n} = \hat{k} (P_{n}) = arg min_{k} E_{B_{n}} P_{n, B_{n}}^{1} D^{*} {(Q_{k}^{*} (P_{n, B_{n}}^{0}), g_{0})}^{2},

where B_n ∈ {0, 1}ⁿ denotes a random vector of binaries indicating a split in training sample {i : B_n(i) = 0} and validation sample {i : B_n(i) = 1}, $P_{n, B_{n}}^{0}$ , $P_{n, B_{n}}^{1}$ , are the corresponding empirical distributions of the training and validation sample. Here we used the notation Pf ≡ ∫ f(o)dP(o). The selected targeted maximum likelihood estimator is then $Q_{n}^{*} \equiv Q_{k_{n}}^{*} (P_{n})$ , and ψ₀ is now estimated with the substitution estimator $Ψ (Q_{n}^{*})$ .

Assuming a uniformly bounded loss function (i.e., a uniform bound on the efficient influence curve), due to oracle results of the cross-validation selector, the resulting targeted maximum likelihood estimator $Ψ (Q_{n}^{*})$ will be at least as efficient as any of the candidate targeted maximum likelihood estimators $Ψ (Q_{k n}^{*})$ , k = 1, . . . , K.

Construction of Targeted Initial estimators: Above we showed that the projection of the efficient influence curve on the tangent space of the conditional distribution of L(1) can be written as C_L(1)(L(1) − Q_L(1)), and for Y = L(2), as C_L(2)(L(2) − Q_L(2)), where we use short-hand notation. For the purpose of constructing an initial estimator of Q_L(1), we can use loss-based learning based on the weighted squared-error loss function $L_{1} (Q_{L (1)}) = C_{L (1)}^{2} {(L (1) - Q_{L (1)})}^{2}$ , and, similarly, for the purpose of constructing of an initial estimator of Q_L(2), we can use loss-based learning based on the weighted squared-error loss function $L_{2} (Q_{L (2)}) = C_{L (2)}^{2} {(Y - Q_{L (2)})}^{2}$ . These are targeted loss functions since they correspond with the components of the variance of the efficient influence curve. Since the clever covariate C_L(2) only depends on g₀, the required weights $C_{L (2)}^{2}$ for loss-based learning of Q_L(2) are completely known. Therefore, we first apply the loss-based learning of the true Q_L(2)0. Let, Q_L(2)_n be the resulting estimator. Now, we plug such an estimator into the weight-function C_L(1), and we use the resulting weights $C_{L (1)}^{2}$ to apply loss-based learning of Q_L(1). In this way, using this backwards sequential loss-based learning, we can generate initial candidate estimators of Q_L(1), Q_L(2) that are themselves already targeted by being based on these weighted squared-error loss functions (e.g. using different regression algorithms but using the weight option). We can now select among the targeted MLE indexed by these different targeted initial estimators, by using cross-validation with the above mentioned loss function L(Q) = D*(Q, g₀)².

As might already be apparent, and certainly becomes apparent in the next section, this powerful approach combining loss-based learning with targeted MLE for the analysis of the simple two-stage sequentially randomized controlled trial generalizes to all sequentially randomized controlled trials for any target parameter, any number of stages, and higher dimensional intermediate time-dependent covariates.

We remark that the above targeted maximum likelihood estimator can also be applied to the data structure L(0), A(0), L(1), Δ, L(2) = ΔY, where A(0) is a treatment assigned at baseline (e.g, RCT), L(1) represents the data collected between baseline and the time point at which the outcome Y is measured, and Δ is a missing indicator for Y. One simply applies the above data structure with A(2) = Δ. Off course, if L(1) is not binary, then the above estimator needs to be generalized as carried out in the next section, and, the missingness mechanism might need to be estimated from the data.

3. Targeted MLE of parameters of the G-computation formula

We will now present the general approach to obtain a targeted maximum likelihood estimator, including the selection among different targeted maximum likelihood estimators indexed by different initial estimators. The choice of loss function we will use for the latter will depend on if one is willing to assume that the treatment/censoring mechanism is correctly estimated (or known, as in a S-RCT), or that one wishes to rely on double robustness, and we will provide appropriate loss functions for both purposes. This will generalize the above targeted maximum likelihood estimator for a two-stage sequentially randomized controlled trial to arbitrary sequentially randomized controlled trials, including S-RCT’s that are subject to right-censoring or missingness and for which one is willing to assume that censoring/missingness is well understood. In addition, it will present the double robust T-MLE for observational studies.

Organization: Firstly, we will present the likelihood using binary coding of the data structure O. Second, we will present a representation of the efficient influence curve based on this binary factorization of the likelihood. Third, we present the fluctuation/least favorable model of the initial estimate and the corresponding targeted maximum likelihood estimator. Fourth, we present a closed form one-step version of this targeted maximum likelihood estimator that applies if one is willing to fit a separate fluctuation parameter for each factor of the G-computation formula factor of the likelihood. Fifth, we present a targeted loss function that can be used to select among different targeted maximum likelihood estimators indexed by different initial estimators. We also present a particular type of targeted maximum likelihood estimator that uses a degenerate initial estimator for the intermediate factors of the G-computation formula, so that the targeted MLE algorithm only requires updating the final outcome conditional distribution. Finally, we make some observations regarding the pursuit of targeted dimension reductions simplifying the G-computation formula, which can form an important ingredient for generating different candidate targeted MLE’s, and control complexity.

3.1. A factorization of likelihood of data in terms of binary variables

Suppose the data structure O = (L(0), A(0), . . . , L(K), A(K), L(K + 1)) for one unit involves collection of treatment and censoring actions coded with A(t) at times t = 0, . . . , K, and time-dependent covariate and outcome data at times t = 0, . . . , K + 1. We note that L(t) can become degenerate after censoring and or after a terminal event like death, so that this data structure O also allows for longitudinal data structures that are truncated by the minimum of right-censoring and death. By choosing a fine enough discretization in time this data structure also approximates treatment and censoring processes A(t) that evolve in continuous time.

For the sake of presentation, we will assume that A(t) and L(t) are discrete valued for all t so that the likelihood of O can be expressed in terms of probabilities, thereby avoiding technical difficulties regarding choice of dominating measure, without affecting the realm of practical applications.

The time ordering implies a graph with observed nodes L(t), t = 0, . . . , K+ 1, and A(t), t = 0, . . . , K, and a corresponding factorization of the observed data likelihood of O, given by

p_{0} = \prod_{t = 0}^{K + 1} Q_{L (t)} \prod_{t = 0}^{K} g_{A (t)},

where Q_L(t) and g_A(t) denote the conditional probability distributions of L(t), given parents Pa(L(t)), and A(t), given parents Pa(A(t)), respectively. The parent sets could be known to be subsets of the parent set implied by the time ordering of data structure, as discussed in introduction.

This factorized likelihood can be subjected to static and dynamic interventions on the A() process mapping the probability distribution of O into probability distributions of O_d corresponding with a static or dynamic intervention d, often referred to as the G-computation formula. These interventions could involve all A-nodes as well as a subset of these nodes. The corresponding probability distributions of O_d are obtained by removing the g_A(t)’s corresponding with the A(t) nodes on which an intervention is carried out under rule d, and substituting for A(t) in the conditioning events (i.e., parents) of the Q_L(l)-factors with l > t the corresponding intervened values.

In many applications A(t) = (A₁(t), A₂(t)) involves two types of actions A₁(t) and A₂(t), both relevant for defining the parameter of interest of the probability distribution of O. For example, A₁(t) might be the treatment assigned at time t, A₂(t) might be an indicator of being right-censored at time t, and the scientific parameter of interest, Ψ(P₀), might be defined as a parameter of the distribution of O under the intervention on A defined by no-censoring at any time point, and a certain treatment intervention. In many cases, one defines the scientific parameter of interest in terms of changes of the latter distribution under different treatment regimens, and always no censoring: for example, marginal structural models for static or realistic dynamic treatment regimens provide such parameters, as we demonstrate in Section 5.

We will consider the case that for each node, the model for the conditional distributions of a nodes, given the parents, is nonparametric. Let Ψ be the parameter mapping so that ψ₀ = Ψ(P₀) denotes the parameter of interest.

Without loss of generality, we assume that, for each t ∈ {1 . . . , K + 1}, L(t) can be coded in terms of n(t) binary variables {L(t, j) : j = 1, . . . , n_t}, so that Q_L(t) can be further factorized as

Q_{L (t)} = \prod_{j = 1}^{n (t)} Q_{L (t, j)},

where we define Q_L(t,j) as the conditional distribution of L(t, j), given its parents Pa(L(t, j)) defined as the parents of L(t) augmented with the first j−1 variables L(t, 1), . . . , L(t, j − 1), l = 1, . . . , j − 1. Note that this factorizations depends on a user-supplied ordering of the binary variables. For example, this particular coding and ordering might be implied by what is considered natural. The choice of coding and ordering does not affect the theoretical properties of the resulting targeted MLE, but it does imply the binary predictors Q_L(t,j) one will need to estimate from the data.

This now provides the following likelihood factorization for the probability distribution of O:

p_{0} = Q_{L (0)} \prod_{t = 1}^{K + 1} \prod_{j = 1}^{n (t)} Q_{L (t, j)} \prod_{t = 0}^{K} g_{A (t)} .

(2)

where Q_L(0) denotes the marginal distribution of the baseline covariates L(0).

3.2. General representation of efficient influence curve of target parameter

We will now work out a general representation of the efficient influence curve we can apply to implement the targeted maximum likelihood estimator for general longitudinal data structures. These results provide us with a template for implementing the targeted maximum likelihood estimator for nonparametric models and essentially any type of longitudinal data structure that includes time dependent treatments and censoring actions that are realized in response to previously collected data.

Recall that in our model for P₀, for each node in the statistical graph, the conditional distribution is unspecified. Let Ψ be the parameter mapping so that ψ₀ = Ψ(P₀) denotes the parameter of interest. If Ψ(P₀) = Ψ^F (Q₀) is only a parameter of the Q₀, then we can present the efficient influence curve of Ψ as the projection of any influence curve (i.e., gradient of path-wise derivative) in the model in which g is known onto the tangent space of Q (van der Laan and Robins (2003)):

D^{*} = Π (D | T_{Q}) for a certain gradient D .

Such an estimating function D is often called an IPCW-estimating function (van der Laan and Robins (2003)). We will now be concerned with finding a representation of this efficient influence curve in terms of an orthogonal sum of scores of certain fluctuations Q(ε) of Q at ε = 0, thereby implying a corresponding implementation of the targeted MLE.

The factorization (1) of the distribution P₀ implies an orthogonal decomposition of the tangent space at P₀ in our model, where this tangent space is a subspace of the Hilbert space $L_{0}^{2} (P_{0})$ endowed with inner product 〈h₁, h₂〉 = E_P₀h₁(O)h₂(O). This orthogonal decomposition of the tangent space $T (P_{0}) \subset L_{0}^{2} (P_{0}$ is given by

T (P_{0}) = T_{L (0)} + \sum_{t = 1}^{K + 1} \sum_{j = 1}^{n (t)} T_{L (t, j)} + T_{C A R},

where T_L₍₀₎ is the tangent space of Q_L(0) consisting of the functions of L(0) with mean zero, T_L(t,j) is the tangent space of the conditional probability distribution Q_L(t,j),

\begin{array}{l} T_{L (t, j)} = {V (L (t, j) | P a (L (t, j))) - E_{Q_{L (t, j)}} V : V} \\ = {{V (1 | P a (L (t, j))) - V (0 | P a (L (t, j))} (L (t, j) - Q_{L (t, j)} (1)) : V}, \end{array}

and T_CAR is the tangent space of g. T_CAR can also be orthogonally decomposed as $\sum_{t = 0}^{K} T_{A (t)}$ with T_A(t) the tangent space of g_A(t). Here we used the notation E_{Q_L(t,j)}V = E(V | Pa(L(t, j))) for the conditional expectation w.r.t. Q_L(t,j). If the parent sets are all implied by a specified ordering of all measured variables, then the model for P₀ is actually the nonparametric model so that the tangent space is saturated: $T (P_{0}) = L_{0}^{2} (P_{0})$ .

In the case that Ψ(P₀) is a parameter of both Q₀ and g₀, the efficient influence curve D* will also have components in T_CAR. An example of such a target parameter is E(Y (1) − Y (0) | A = 1), the effect among the treated, based on observed data (W, A, Y) and the causal graph implied by time ordering W, A, Y. In that case the targeted MLE will also need to fluctuate an initial fit of g₀ with a fluctuation having a score that coincides with the efficient influence curve. For that purpose, let’s also code A(t) in terms of binary variables. Let A(t) be coded in terms of binary variables {A(t, j) : j = 1, . . . , m(t)}, and consider the factorization

g_{A (t)} = \prod_{j = 1}^{m (t)} g_{A (t, j)},

where an ordering needs to be specified so that the parents of A(t, j) are given by the parents of A(t) augmented with A(t, 1), . . . , A(t, j − 1).

The corresponding orthogonal decomposition of the tangent space of g is given by

T_{C A R} = \sum_{t = 0}^{K} \sum_{j = 1}^{m (t)} T_{A (t, j)}

where

\begin{array}{l} T_{A (t, j)} = {V (A (t, j) | P a (A (t, j))) - E (V | P a (A (t, j))) : V} \\ = {{\bar{V} (P a (A (t, j)) (A (t, j) - g_{A (t, j)} (1 | P a (A (t, j))) : V}, \end{array}

where we used the notation V̄ (Pa(A(t, j)) = V (1 | Pa(A(t, j))) − V (0 | Pa(A(t, j)), which thereby can represent any function of Pa(A(t, j)).

This factorization p(O) = ∏_t∏_j Q_L(t,j) ∏_t∏_j g_A(t,j) yields the orthogonal decomposition of the tangent space T(P₀) given by

T (P_{0}) = T_{L (0)} + \sum_{t = 1}^{K + 1} \sum_{j = 1}^{n (t)} T_{L (t, j)} + \sum_{t = 0}^{K} \sum_{j = 1}^{m (t)} T_{A (t, j)} .

We can now state the corresponding Theorem for both a representation of a given efficient influence curve D* as well as a projection of a function D, (e.g.) representing an inefficient influence curve for a parameter Ψ(P) = Ψ^F (Q) in a model with g known, onto the tangent space T_Q of Q.

Theorem 2 Consider the Hilbert space $L_{0}^{2} (P_{0})$ and the factorization (1) of P₀. A function $D \in L_{0}^{2} (P_{0})$ which is also an element of the tangent space T(P₀) can be represented as

D = D_{L (0)} + \sum_{t = 1}^{K + 1} \sum_{j = 1}^{n (t)} D_{L (t, j)} + \sum_{t = 0}^{K} \sum_{j = 1}^{m (t)} D_{A (t, j)},

where

\begin{array}{l} D_{L (0)} & = & E (D | L (0)) - E D \\ D_{L (t, j)} & = & Π (D | T_{L (t, j)}) \\ = & C_{L (t, j)} {L (t, j) - Q_{L (t, j)} (1)}, \end{array}

where

C_{L (t, j)} = E (D | L (t, j) = 1, P a (L (t, j))) - E (D | L (t, j) = 0, P a (L (t, j))),

for t = 1, . . . , K + 1, and, for each t, j = 1, . . . , n(t). In addition,

\begin{matrix} D_{A (t, j)} & = & Π (D | T_{A (t, j)}) \\ = & C_{A (t, j)} {A (t, j) - g_{A (t, j)} (1)}, \end{matrix}

where

C_{A (t, j)} = E (D | A (t, j) = 1, P a (A (t, j))) - E (D | A (t, j) = 0, P a (A (t, j))) .

In particular, the projection of D onto the tangent space T_Q of Q can be represented as

Π (D | T_{Q}) = D_{0} + \sum_{t = 1}^{K + 1} \sum_{j = 1}^{n (t)} D_{L (t, j)} .

If we represent D as D(O) = D₁(A, L(A))/g(Ā | X) for some D₁, X = (L(a) : a), L_a(t) = L_ā(t−1)(t), assume that Pa(L(t, j)) = (A(t̄−1), Pa_Ā(t−1)(L(t, j)) includes Ā(t − 1), then the above representation of Π(D | T_Q) applies with

C_{L (t, j)} = \frac{1}{g (\bar{A} (t - 1) | P a (L (t)))} \times {C_{L (t, j)} (Q) (1) - C_{L (t, j)} (Q) (0)}

where, for δ ∈ {0, 1},

C_{L (t, j)} (Q) (δ) = E_{Q} (\sum_{\bar{a} (t, K)} D_{1} | L_{\bar{a} (t - 1)} (t, j) = δ, P a_{\bar{a} (t - 1)} (L (t, j))) |_{\bar{a} (t - 1) = \bar{A} (t - 1)} .

Above, we used short-hand notation for ∑_ā(t,K) D₁(A(t̄−1), a(t̄,K), L_{Ā(t−1) ā(t,,K)}), and ā(t,K) = (a(t), . . . , a(K)). Here g(ā(t−1) | Pa(L(t))) denotes the conditional probability of Ā(t−1) = ā(t−1), given Pa_ā(t−1)(L(t)), and it also equals the conditional probability of Ā(t − 1) = ā(t − 1), given L_a(t, j), Pa_a(L(t, j)).

Proof of Theorem. We only need to prove the latter representation of C_L(t,j).

We have D(A, L(A)) = D₁(A, L(A))//g(Ā | X), and we consider the case that Pa(L(t, j)) includes Ā(t − 1). For the sake of this proof we exclude the treatment nodes Ā(t − 1) from Pa(L(t, j)). Setting Ā(t − 1) = ā(t − 1), gives us the following conditional expectation to consider

E (D_{1} (\bar{A}) / g (\bar{A} | X) | L (t, j), P a (L (t, j)), \bar{A} (t - 1) = \bar{a} (t - 1)) .

We first condition on X and Ā(t−1). This corresponds with taking an expectation w.r.t. $Π_{s = t}^{K} g (A (s) | P a (A (s)))$ . This gives us

E (\sum_{\bar{a} (t, K)} D_{1} (\bar{a}) / g (\bar{a} (t - 1) | X) | L_{\bar{a}} (t, j), P a_{\bar{a}} (L (t, j)), \bar{A} (t - 1) = \bar{a} (t - 1)) .

This conditional expectation for each ā(t, K)-specific term is a sum over L_a compatible with L_a(t, j), Pa_a(L(t, j)). Specifically,

\begin{array}{l} \sum_{L_{a}} \frac{D_{1} (\bar{a})}{g (\bar{a} (t - 1) | X)} P (L_{a} | L_{a} (t, j), P a_{a} (L (t, j)), \bar{A} (t - 1) = \bar{a} (t - 1)) \\ = \sum_{L_{a}} \frac{D_{1} (\bar{a})}{g (\bar{a} (t - 1) | X)} \frac{P (L_{a}, L_{a} (t, j), P a_{a} (L (t, j)), \bar{A} (t - 1) = \bar{a} (t - 1))}{P (L_{a} (t, j), P a_{a} (L (t, j)), \bar{A} (t - 1) = \bar{a} (t - 1))} \\ = \sum_{{L_{a} : L_{a} (t, j), P a_{a} (L (t, j))}} D_{1} (\bar{a}) \frac{P (L_{a})}{P (L_{a} (t, j), P a_{a} (L (t, j)), \bar{A} (t - 1) = \bar{a} (t - 1))} \\ = \frac{1}{g (\bar{a} (t - 1) | L_{a} (t, j), P a_{a} (L (t, j))} \sum_{{L_{a} : L_{a} (t, j), P a_{a} (L (t, j))}} D_{1} (\bar{a}) \frac{P (L_{a})}{P (L_{a} (t, j), P a_{a} (L (t, j)))} \\ = \frac{1}{g (\bar{a} (t - 1) | L_{a} (t, j), P a_{a} (L (t, j))} E_{Q} (D_{1} (\bar{a}) | L_{a} (t, j), P a_{a} (L (t, j)) . \end{array}

We will now prove that, by conditional independence assumptions of the statistical graph, g(ā(t − 1) | L_a(t, j), Pa_a(L(t, j))) = g(ā(t − 1) | Pa_a(L(t))). To see this we first note that g(ā(t − 1) | L_a(t, j), Pa_a(L(t, j))) equals

\sum_{{\bar{L}}_{a} (t - 1)} g (\bar{a} (t - 1) | {\bar{L}}_{a} (t - 1)) P ({\bar{L}}_{a} (t - 1) | L_{a} (t, j), P a_{a} (L (t, j))) .

Since Pa_a(L(t, j)) are the parents of L_a(t, j), we have P(L̄_a(t − 1) | L_a(t, j), Pa_a(L(t, j))) = P(L̄_a(t − 1) | Pa_a(L(t, j))). Thus, this proves

g (\bar{a} (t - 1) | L_{a} (t, j), P a_{a} (L (t, j))) = g (\bar{a} (t - 1) | P a_{a} (L (t, j))) .

More general, recall L_a(t, j), Pa_a(L(t, j)) = L_a(t, 1), . . . , L_a(t, j), Pa_a(L(t)), and note that Pa_a(L(t)) is included in L̄_a(t−1) (recall, that we excluded Ā(t−1) from Pa(L(t, j)) in this proof). We have L_a(t, 1), . . . , L_a(t, j) is independent of L̄_a(t − 1), given Pa_a(L(t)). So we obtain

\begin{array}{l} P ({\bar{L}}_{a} (t - 1) | L_{a} (t, j), P a_{a} (L (t, j))) \\ = P ({\bar{L}}_{a} (t - 1) | L_{a} (t, 1), \dots, L_{a} (t, j), P a_{a} (L (t))) \\ = P ({\bar{L}}_{a} (t - 1) | P a_{a} (L (t))) . \end{array}

This shows

g (\bar{a} (t - 1) | L_{a} (t, j), P a_{a} (L (t, j))) = g (\bar{a} (t - 1) | P a_{a} (L (t))) .

To conclude, we have shown

\begin{array}{l} E (\frac{\sum_{\bar{a} (t, K)} D_{1} (\bar{a})}{g (\bar{a} (t - 1) | X} | L_{\bar{a}} (t, j) = 1, P a_{\bar{a}} (L (t, j)), \bar{A} (t - 1) = \bar{a} (t - 1)) \\ = \frac{1}{g (\bar{a} (t - 1) | P a_{a} (L (t)))} E (\sum_{\bar{a} (t, K)} D_{1} (\bar{a}) | L_{\bar{a} (t - 1)} (t, j) = 1, P a_{\bar{a} (t - 1)} (L (t, j))), \end{array}

which completes the proof.

Remark about double robustness of efficient influence curve for general statistical graph: The efficient influence curve D* at P depends on the Q-factor as well as a g representing conditional distributions of A(t) nodes, possibly conditioning on subsets of the actual parents of A(t). It is immediate that P₀D*(Q₀, g) = 0 at possibly miss-specified g. To understand the possible additional robustness P₀D*(Q, g₀) for Q with Ψ(Q) = Ψ(Q₀) and correctly specified g₀, and thereby the so called double robustness of the efficient influence curve (van der Laan and Robins (2003)), we make the following observation. By our latter representation in the above theorem, we have

\begin{array}{l} D^{*} (Q, g) = \sum_{t} 1 / g (\bar{A} (t - 1) | (P a_{a} (L (t)) : a)) \\ {E_{Q} (\sum_{\bar{a} (t, K)} D_{1} (\bar{a} (t - 1), \bar{a} (t, K)) | L_{a} (t) = L (t), P a_{\bar{a} (t - 1)} (L (t)) = P a (L (t))) \\ - E_{Q} (\sum_{\bar{a} (t, K)} D_{1} (\bar{a} (t - 1), \bar{a} (t, K)) | P a_{\bar{a} (t - 1)} (L (t)))} |_{\bar{a} (t - 1) = \bar{A} (t - 1)}, \end{array}

where we also have that g(A(t̄ −1) | (Pa_a(L(t)) : a)) = g(A(t − 1) | − (L_a(t), Pa_a(L(t)) : a)), as we showed in the proof above. If we now take the conditional mean, given (L_a(t), Pa_a(L(t)) : a), within the ∑_t-summation, then this corresponds with integration over g₀(Ā(t − 1) | (Pa_a(L(t)) : a)). Thus at a correctly specified g₀, we obtain that P₀D*(Q, g₀) equals

E_{Q_{0}} \sum_{t} \sum_{\bar{a}} {E_{Q} (D_{1} (\bar{a}) | L_{a} (t), P a_{a} (L (t))) - E_{Q} (D_{1} (\bar{a}) | P a_{a} (L (t)))},

thereby giving us an expression that does only depend on the Q₀-factor of the distribution of the data (thus nothing to do anymore with the conditional treatment probabilities). Some additional structure is now needed on the statistical graph to have that the latter equals zero at miss-specified Q. In particular, if Pa(L(t)) = Ā(t − 1), L̄(t − 1) represents the history according to the time-ordered sequence representing the longitudinal data structure O, it follows, through cancelation of terms, that the latter equals E_Q₀∑_ā D₁(ā), thereby giving the wished result (corresponding with the double robustness results in van der Laan and Robins (2003) for nonparametric full data models).

Using normal error regression to model and fluctuate conditional final outcome distribution. Consider the case that Ψ(Q₀) only depends on the conditional distribution of a final outcome Y = L(K+1), given its parents Pa(Y) through its conditional mean, and that the projection of the efficient influence curve (or any other gradient in model with g known) onto the tangent space of this conditional distribution Q_Y can be written as C_Y (Y − E_Q(Y | Pa(Y)) for some function C_Y of its parents Pa(Y). Then it follows that there is no need to factorize the conditional distribution of Y in binary conditional distributions, but one could model the conditional distribution of Y with a normal error mean regression, and fluctuate the mean by adding the clever-covariate extension εC_Y. This was explicitly illustrated in Section 2 for the targeted MLE of EY_d.

3.3. The targeted MLE based on the binary representation of density

In this subsection we will define the targeted MLE based on the representation (1) of the density of O in terms of the binary predictors Q_L(t,j), and, for the sake of presentation, we assume that our target parameter is only a parameter of Q₀. Consider an initial estimator Q_L(t,j)_n of each Q_L(t,j), t = 1, . . . , K +1, j = 1, . . . , n(t). We will estimate the first marginal probability distribution Q_L(0) of L(0) with the empirical distribution of L_i(0), i = 1, . . . , n. Let Q_n denote the combined set of Q_L(t,j)_n across all nodes L(t, j).

The conditional distributions of L(t, j) are binary distributions which we can estimate with machine learning algorithms (using logistic link) such as the super learner represented by a data adaptively (based on cross-validation) determined weighted combination of a user supplied library of machine learning algorithms estimating the particular conditional probability. These estimates could be obtained separately for each t, j or smoothing across time points t and or j could be employed if appropriate, by applying such machine learning algorithms to an appropriately constructed repeated measures data set. In particular, candidate estimators could be based on (guessed) subsets of Pa(L(t, j)).

In addition, let g_n be an estimator of g₀. We will now define the following fluctuations of the initial estimator Q_L(t,j)_n = Q_L(t,j)(P_n) of Q_L(t,j):

Logit Q_{L (t, j) n} (ε) = Logit Q_{L (t, j) n} + ε C_{L (t, j)} (Q_{n}, g_{n}),

where we added the clever covariate C_L(t,j)_n obtained by substitution of the initial estimator Q_n and g_n of the true Q₀ and g₀.

One can now estimate ε with the MLE.

ε_{n} = arg max_{ε} \prod_{t = 1}^{K + 1} \prod_{j = 1}^{n (t)} \prod_{i = 1}^{n} Q_{L (t, j) n} (ε) (O_{i}) .

One could also obtain a separate MLE of ε for each factor indexed by (t, j):

ε_{L (t, j) n} = arg max_{ε} \prod_{i = 1}^{n} Q_{L (t, j) n} (ε) .

One can now set $Q_{n}^{1} = Q_{n} (ε_{n})$ to update Q_n. This updating process $Q_{n}^{m} = Q_{n}^{m - 1} (ε_{n}^{m})$ , m = 1, . . . , is now iterated till convergence, which defines the targeted MLE starting $Q_{n}^{*}$ at initial estimator (Q_n, g_n).

We note that the ε_L(t,j)_n for each factor separately can be estimated with standard logistic regression software using as off-set the logit of the initial estimator and having a single clever covariate C_L(t,j)(Q_n, g_n). The single ε_n (uniform across t, j) defined above requires applying a single univariate logistic regression applied to repeated measures data set with one line of data for each factor indexed by (t, j), creating a clever covariate column that stacks (C_L₍_t,j) : t, j) for each unit, and using the corresponding off-set covariate logitQ_L(t,j)_n. So in both cases, the update step can be carried out with a simple univariate logistic regression maximum likelihood estimator using the off-set command (applied to a possibly repeated measures data set).

We note that the clever covariate changes at each update step since the estimator of Q is updated at each step and the clever covariate is defined by the current Q-fit in the iterative algorithm. Let $Q_{L (t, j) n}^{*}$ and $Q_{n}^{*}$ denote the final update (at convergence of the MLE of ε to zero) of Q_L(t,j)_n, and Q_n. The targeted MLE of ψ₀ is now given by $Ψ (Q_{n}^{*})$ .

3.4. A targeted MLE based on the binary predictor representation of density that converges in one step

In this section we will define a fast targeted MLE based on the representation (1) of the density of O in terms of the binary predictors Q_L(t,j).

Consider an initial estimator Q_L(t,j)_n of each Q_L(t,j), t = 1, . . . , K + 1, j = 1, . . . , n(t). We will estimate the first marginal probability distribution Q_L(0) of L(0) with the empirical distribution of L_i(0), i = 1, . . . , n. Let Q_n denote the combined set of Q_L(t,j)_n across t, j.

In addition, let g_n be an estimator of g₀. As above, we define the following fluctuations of the initial estimator Q_L(t,j)_n of Q_L(t,j):

Logit Q_{L (t, j) n} (ε) = Logit Q_{L (t, j) n} + ε C_{L (t, j)} (Q_{n}, g_{n}),

where we added the clever covariate C_L(t,j)_n obtained by substitution of the initial estimator Q_n and g_n of the true Q₀ and g₀.

Monotone dependence on Q-property of the clever covariates: Consider the clever covariate representations of C_L_(t,j) presented in the above Theorem 2 for Q_L(t,j) for the case that D = D₁/g with D₁ not indexed by Q, g. Then the conditional expectations in the definition of the clever covariate C_L(t,j) only depends on Q through {Q_L(s,l) : s > t, l} ∪ {Q_L(t,l) : l > j}.

Let’s enumerate all terms Q_L(t,j) for t ≥ 1 by moving row-wise: thus Q₁ = Q₁₁, Q₂ = Q₁₂, . . ., Q_n₍₁₎ = Q₁_n₍₁₎, Q_n(1)+1 = Q₂₁, and so on till Q_N = Q_K+1,n(K+1), where $N = \sum_{t = 1}^{K + 1} n (t)$ . Here we used temporarily the notation Q₁₂ = Q_L(1,2) and so on. Recall that Q_L(0), the marginal distribution of L(0), does not need to be fluctuated, and is thus not considered here: we will always estimate Q_L(0) with the empirical distribution, so that no fluctuation is needed. Under this ordering, the k-th clever covariate C_k only depends on Q through Q_k+1, . . . , Q_N, k = 1, . . . , N. In particular, C_N does not depend on Q at all, while C_N−1 depends on Q_N only, C_N−1 depends on Q_N−1, Q_N, and so on. We refer to this property of the clever covariates as the monotone dependence (on Q) property, which will have an immediate implication for the corresponding iterative T-MLE algorithm.

We denote this monotonicity property with C_k(Q) = C_k(Q_k+1, . . . , Q_N), where we suppress the dependence on g since in the targeted MLE algorithm presented below g will not be updated.

We obtain a separate MLE of ε for each factor, but we start with last factor first, and use the update of last factor in clever covariate of N − 1-th factor, carry out update of N − 1-th factor, use the update of N − 1-th factor in clever covariate of N − 2-th factor, and so on till we update the first factor based on first clever covariate including all previously obtained updates. One could now start over, since Q_n got updated during this particular round of updating steps, and apply the same round of updating steps to the update of Q_n, and iterate this till convergence. The below Theorem states that this is not necessary, since the algorithm has converged after one round.

We state here the one step convergence of this targeted MLE algorithm.

Theorem 3 Consider the targeted MLE algorithm above applied to an initial estimator Q_n, g_n, using a separate ε_L(t,j)n for each factor Q_L(t,j)n, t ≥ 1, carrying out the updating steps one at the time, starting with final factor in likelihood, and going backwards till first term always incorporating the latest updates on Q_n, and Q_0n is the empirical distribution of L_i(0), i = 1, . . . , n. We can refer to one round of updating starting at final factor and ending at first factor as one step. This process can be iterated thereby defining an iterative algorithm.

Suppose that for each t ≥ 1, j the clever covariate in this algorithm, C_L(t,j)(Q), only depends on Q through Q_sl = Q_L(s,l) for s > t and for s = t, l > j. In that case, the above iterative targeted MLE algorithm converges in one step/round, and thus in exactly $N = \sum_{t = 1}^{K + 1} n (t)$ updating steps.

We recall from the previous Theorem, if D(O) = D₁(O)/g(Ā | X), and the probability distribution of O is factored in binary predictors as in (1), then D* = Π(D | T_Q) = D₀ + ∑_tj D_tj, where D_tj = C_L(t,j)(L(t, j) − Q_L(t,j)(1)), and

C_{L (t, j)} = \frac{1}{g (\bar{A} (t - 1) | P a (L (t)))} \times {C_{L (t, j)} (Q) (1) - C_{L (t, j)} (Q) (0)}

where, for δ ∈ {0, 1},

C_{L (t, j)} (Q) (δ) = E_{Q} (\sum_{\bar{a} (t, K)} D_{1} | L_{\bar{a} (t - 1)} (t, j) = δ, P a_{\bar{a} (t - 1)} (L (t, j))) |_{\bar{a} (t - 1) = \bar{A} (t - 1)} .

This monotonicity property of the clever covariate holds if D₁ does not depend on Q itself. More generally, it holds if

D (Q) = \frac{D_{1} + C_{1} (Q)}{g}, C_{1} (Q) is only function of O through L (0), \bar{A} (K),

(so that C₁(Q) will cancel out in the representation of C_L(t,j)) and D₁ does not depend on Q (it can depend on g).

This Theorem allows us to define closed form targeted MLE algorithms for a large class of parameters in our semiparametric model defined by no constraints on any of the conditional node specific distribution, given their specified parent nodes. To utilize this closed form one-step targeted MLE, one is forced to carry out a separate update step for each factor (only once), but one can still use smoothing across many factors for the initial estimator.

3.5. Targeted loss-based selection among targeted MLE

The basic idea is as follows. All our candidate estimators of Q₀ are targeted maximum likelihood estimators, indexed by different initial estimators of Q₀, and using same g_n of g₀. Due to fact that these targeted MLE’s solve the efficient influence curve equation, it follows that the bias for ψ₀ involves a product of $Q_{n}^{*} - Q_{0}$ and g_n − g₀: see asymptotic linearity Theorems in van der Laan and Robins (2003) and van der Laan and Gruber (2009). The goal is clearly to estimate Q₀ as accurately as possible, which will maximize efficiency and minimize bias for ψ₀. Therefore, we want to use cross-validation to select among different targeted maximum likelihood estimators, using a loss function whose risk is minimized at Q₀. However, there are many choices for the loss-based dissimilarity, E₀L(Q) − L(Q₀), between a candidate Q and Q₀ possible, and one will be more targeted towards ψ₀ than another. For example, we can use the log-likelihood loss function, a penalized log-likelihood loss function presented in (van der Laan and Gruber (2009)), and other loss functions inspired by the efficient influence curve of ψ₀, as presented here (see also van der Laan and Gruber (2009)).

Here we present two loss functions for Q₀ that are identified by the efficient influence curve of Ψ. Firstly, if g₀ is known, then we can use

L_{1} (Q) = {D^{*} (Q, g_{0})}^{2} .

If D*(Q, g₀) = D(Q, g₀) − Ψ(Q), then it follows that indeed Q₀ = arg min_Q E₀L₁(Q), since the variance under P₀ of D*(Q, g₀, ψ₀) is minimized at Q = Q₀ (van der Laan and Robins (2003)). For more general efficient influence curves, the latter property has to be explicitly verified: at minimal, if D*(Q, g) = D(Q, g, Ψ(Q)), then one can replace Ψ(Q) by a consistent estimator of ψ₀, and use the loss function D²(Q, g₀, ψ_n). By the argument above, the loss function is still valid if one is willing to assume a consistent and good estimator of g₀ (an estimator that will converge faster to true g₀ than the estimators of Q₀ will converge to Q₀).

To explain the rationale of this loss function, first consider the case that g₀ is known. If g₀ is known, a targeted MLE for which $Q_{n}^{*}$ converges to Q with Ψ(Q) = ψ₀ is asymptotically linear with influence curve D*(Q, g₀) (van der Laan and Rubin (2006)) and it is well known that the variance of D*(Q, g₀) for a Q with Ψ(Q) = ψ₀ is minimal at Q = Q₀, which then corresponds with the semiparametric information bound. Thus, E₀L₁(Q) equals the asymptotic variance of the influence curve of the targeted MLE. Under the assumption that L₁(Q) is uniformly bounded in all candidate Q’s, we can apply the Theorems on the cross-validation selector (e.g. van der Laan and Dudoit (2003)), which proves that either the cross-validation selector is asymptotically equivalent with the oracle selector, or it achieves the parametric rate of convergence. As a consequence, loss-function based cross-validation based on this loss function will, for large enough sample size, select the targeted maximum likelihood estimator with the smallest asymptotic variance of its resulting substitution estimator of ψ₀ (excluding the case that there are ties). If g₀ is unknown, but estimated at a fast rate relative to the rate at which one estimates Q₀, then the above argument for the cross-validation selector still applies in first order: see van der Laan and Dudoit (2003) for oracle results for the cross-validation selector based on loss functions with nuisance parameters. If g₀ is estimated, and Q ≠ Q₀, then the influence curve of the targeted MLE involves another contribution, reducing the variance relative to the variance of the influence curve for g₀ known. In this case, L₁(Q) is not exactly the asymptotic variance of the targeted MLE, but it is still minimized at the optimal Q₀, and it represents a large component of the true asymptotic variance of the targeted MLE.

Consider now the case that we are not willing to assume that estimation of g₀ is easy relative to estimation of Q₀. In that case, the above loss function is not appropriate. Recall the representation of the efficient influence curve $D^{*} = D_{0} + \sum_{t, j} D_{L (t, j)}^{2}$ with D_L(t,j) = C_L(t,j)(L(t, j) − Q_L(t,j)(1)). We make the following observation (using short-hand notation):

\begin{array}{l} VAR(D^{*} (Q_{0}, g_{0})) & = & E D_{L (0)}^{2} + \sum_{t, j} E_{0} D_{L (t, j)}^{2} \\ = & E D_{L (0)}^{2} + \sum_{t, j} E_{0} C_{L (t, j)}^{2} {(L (t, j) - Q_{L (t, j)} (1))}^{2}, \end{array}

This suggests to use as loss function for Q_L(t,j), t ≥ 1, the weighted squared-error loss function:

L_{2} (Q) = \sum_{t j} C_{L (t, j)}^{2} {(L (t, j) - Q_{L (t, j)} (1))}^{2},

which is indexed by a weight function $C_{L (t, j)}^{2}$ . One would need to obtain an initial estimator of these weights which depend on both Q₀ and g₀. However, note that even if we estimate these weights inconsistently, then this loss function L₂(Q) remains a valid loss function for Q₀, thereby preserving the double robustness of the resulting targeted maximum likelihood estimator of Q₀.

In van der Laan and Gruber (2009) other loss functions implied by the efficient influence curve are proposed, including the variance of efficient influence curve at a collaborative estimator of g₀.

3.6. The targeted-MLE at a degenerate initial estimator for intermediate time-dependent covarariate factors

Consider the likelihood factorization as used to define the G-computation formula, and assume that the IPCW estimating function is of the form stated in the above Theorem 3. If one of the node-specific conditional distribution is estimated with a degenerate conditional distribution, given the data generated by previous node-specific conditional distributions, then Theorem 3 implies that the projection of a function of O on the tangent space generated by that factor equals zero.

For example, suppose the likelihood is factored according to the ordering L(0), A(0), L(1), A(1), Y. The projection of a function D(O) onto the tangent space of Q_L(1) is zero at a degenerate Q_L(1), even if the true conditional distribution of L(1) is not degenerate.

This insight suggests a simple-to-compute version of targeted MLE. Suppose we obtain an initial estimator $Q_{n}^{0}$ that is degenerate for all factors except the last one, and we use the empirical distribution for the marginal distribution of the baseline covariates. In that case, only the last factor, say Q_{Y =L(K+1)}, needs to be updated in the targeted MLE algorithm. As a consequence, the targeted MLE requires only one update, and thus converges in one single updating step.

Thus, in this case one estimates most of the system with a deterministic system, and only the last factor is estimated with a non-degenerate conditional probability distribution that is updated with a clever covariate depending on the treatment mechanism. Due to the double robustness of the targeted MLE, the resulting targeted MLE will be consistent and asymptotically linear if the treatment mechanism is correctly specified, and will still gain efficiency if the degenerate distributions are doing a reasonable job: since the degenerate distribution will be misspecified it is not reasonable anymore to rely on correct specification of the initial estimator $Q_{n}^{0}$ of Q₀ for consistency. Note also that this simplified targeted ML estimator still allows us to apply the collaborative targeted MLE approach for selection among different treatment mechanism estimators based on the log-likelihood of the targeted estimator $Q_{n}^{1}$ indexed by the treatment mechanism estimator: see van der Laan and Gruber (2009).

We can view this particular simple targeted maximum likelihood estimator as one particular candidate among a set of candidate targeted maximum likelihood estimators, and use loss-based cross-validation to select among these candidate targeted maximum likelihood estimators. One would use one of our proposed (efficient-influence curve based) loss functions, such as the weighted squared error loss function, since the log-likelihood loss function will become undefined as a degenerate distribution.

3.7. Dimension reduction for time-dependent covariates

One could use a loss function on the Q-factor of the binary coded complete data structure, and use loss-function based cross-validation to select among different fits, thereby implicitly carrying out a dimension reduction. For example, by not including a node in the graph in parent sets of other nodes it will be equivalent to removing the node from the data structure, and such moves can be scored based on the loss function. In this manner one might build an initial estimator Q_n whose G-computation formula for parameter of interest is only affected by conditional distribution of subset of all nodes, thereby also simplifying the targeted MLE update.

Here we wish to investigate alternative targeted dimension reductions that would, in particular, reduce the computational complexity of the targeted maximum likelihood estimator which is driven by the number of binary variables coding the data structure. This reduced complexity/dimension can also imply that the loss function for the Q₀ of the reduced data structure implies a more targeted dissimilarity for the purpose of fitting Ψ(Q₀).

If a multivariate L(t) is reduced to a one dimensional time-dependent covariate, then the targeted maximum likelihood estimator is simpler, but if this reduction means that A(t) now depends on measured variables beyond the one dimensional reduced time-dependent covariate, then this reduction will also have caused bias. In addition, much information might have been lost, thereby causing variance. So care is needed.

Let’s revisit the two-stage sequentially randomized controlled trial with data structure O = L(0), A(0), L(1), A(1), Y = L(2), but let’s now consider the more general case that L(1) can be a multivariate vector with discrete and/or continuous components. Suppose that we wish to estimate EY (1, 1). Inspection of the efficient influence curve of EY (1, 1) shows that it only depends on the conditional distribution of Y through its mean E(Y | A(0) = 1, A(1) = 1, L̄(1)). This suggest that L_Q(1) = E(Y | A(0), A(1) = 1, L̄(1)) denotes a targeted dimension reduction: below we provide a general approach which implies this precise dimension reduction. In addition, let L_g(1) be defined as the propensity score P(A(1) = 1 | L(0), A(0), L(1)). A targeted dimension reduction of the observed O is now given by (L(0), A(0), L_Q(1), L_g(1), A(1), Y). We can fit both L_Q(1) and L_g(1) from the data using super-learning, thereby obtaining an estimated dimension reduction O_r. A targeted MLE for this (estimated) reduced data structure now involves fitting Q_{L_Q (1)}, Q_{L_g (1)}, and Q_Y, where only the conditional mean of Y is needed. However, by definition of L_Q(1), the conditional mean of Q_Y at A(1) = 1 equals L_Q(1), suggesting that we can exclude L_g(1) from the parent set of Y without meaningful loss of information. Then, the conditional distribution Q_{L_g (1)} does not affect the G-computation formula of the distribution of Y (1, 1) or, more general, the joint distribution of Y (1, 1) and L(0). As a consequence, in this case we do not even need to fit Q_{L_g (1)}.

To summarize, in this manner we have succeeded in dramatically reducing the complexity of a targeted MLE by replacing the fitting of a conditional distribution of a multivariate random variable L(1) into fitting of a univariate conditional distribution of L_Q(1).

Let’s now generalize this type of targeted dimension reduction procedure. Consider a general longitudinal data structure L(0), A(0), . . . , L(K), A(K), Y = L(K + 1), and let’s consider the case that A(j) is binary, j = 0, . . . , K. The dimension reduction can be guided by the actual form of the efficient influence curve for the target parameter. To demonstrate this, we first note that the efficient influence curve can often be represented as $D_{I P C W} (g_{0}, ψ_{0}) (L (0), A, Y) - \sum_{j = 0}^{K} E (D_{I P C W} | A (j), P a (A (j))) - E (D_{I P C W} | P a (A (j)))$ for some IPCW-estimating function (see van der Laan and Robins (2003)). The latter differences of two conditional expectations can also be written as C(j)(A(j) − P(A(j) = 1 | Pa(A(j)), where

C (j) = E (D_{I P C W} | A (j) = 1, P a (A (j))) - E (D_{I P C W} | A (j) = 0, P a (A (j))) .

For example, if ψ₀ = EY (1), then D_IPCW(O) = {I(Ā = 1)/g(Ā | X)}Y − ψ₀. As we did before, we can factorize this difference of conditional expectations in terms of a factor only depending on Q₀ and a factor only depending on g₀. We can define L_Q(j) as the Q₀-factor only, thereby preserving double robustness of the resulting targeted MLE. In addition, we define

L_{g} (j) = P (A (j) = 1 | P a (A (j)) .

If the target parameter is EY (a) for a static regimen a, it follows that the efficient influence curve depends on O through the reduction

O^{r} = (L (0), A (0), L_{Q} (1), L_{g} (1), A (1), \dots, L_{Q} (K), L_{g} (K), A (K), Y .

If the target parameter is EY (d) for a dynamic treatment rule d, then, one also needs to include the time-dependent covariate the rule d uses to assign treatments. To summarize, inspection of the efficient influence curve of the target parameter defines a reduction O^r in terms of two time-dependent covariate processes, one representing the treatment asssignment probabilities as functions of the past, and one based on the Q₀-functions making up the efficient influence curve. These time-dependent covariates depend on unknown Q₀ and g₀. We will estimate these time-dependent covariates, by estimating the treatment mechanism, and the required L_Q(j). We can now apply the targeted MLE to this reduced data structure.

As in our previous example, suppose that for each of the conditional distributions of Y and L_Q(j), j = 1, . . . , K, we do not include any of the L_g(j) in the parent sets. We suggest that this comes at little cost in efficiency. Under this condition, the conditional distributions Q_{L_g(j)} do not affect the G-computation formula of the distribution of Y (d) or, more general, the joint distribution of Y (d) and L(0). As a consequence, in that case we do not even need to fit Q_{L_g(j)}, j = 1, . . . , K. To summarize, in this manner we can dramatically reduce the complexity of a targeted MLE by replacing the fitting of a conditional distribution of a multivariate random variable L(j) into only fitting the univariate conditional distributions of L_Q(j) and possibly the conditional distribution of another time-dependent covariate that is used to define the target parameter (e.g., treatment rule based on time dependent biomarker). Note that we will still fit the treatment mechanisms of A(j) conditional on its parents (under O^r) including L_g(j), and thus just fit P(A(j) = 1 | Pa(A(j))) with L_g(j) itself.

This dimension reduction still allows for the construction of a collaborative estimator g_n of g₀, given an estimator Q_n of $Q_{0}^{r}$ , representing the conditional distributions of L_Q(j), L_g(j) and Y . This just requires applying the C-T-MLE algorithm as presented in van der Laan and Gruber (2009) to the log-likelihood for $Q_{0}^{r}$ , thereby scoring a fit of g₀ with the log-likelihood (or other loss function) of the targeted MLE of $Q_{0}^{r}$ corresponding with the fluctuation function implied by the candidate g₀-fit.

By using as loss function the variance of the influence curve of the targeted MLE we can still select among different targeted maximum likelihood estimators indexed by different dimension reductions of the type presented above, assuming that each of them puts the maximal effort in obtaining an unbiased estimator.

4. Discussion

Targeted maximum likelihood estimation, combined with loss based super learning fully utilizing the power of cross-validation for bounded loss functions, provides an exceptional powerful framework for assessing causal effects, with distinct advantages relative to other proposed methodology. The current paper lays the ground work for the implementation of targeted maximum likelihood estimators that also incorporate time-dependent covariates and outcome processes. Time dependent covariates and outcome processes allow reasonably accurate imputations of the clinical outcome based on recent history, thereby allowing for significant potential gains in both efficiency and bias (see formula of effcicient influence curve/clever covariates). Even though almost all current clinical trials collect time dependent data, these important sources of information have been ignored in the assessment of the causal effect of a drug or treatment strategy. Clinical trials provide just one important application of the targeted maximum likelihood estimator. Other important applications are the assessment of causal effects of treatment rules in sequentially randomized controlled trials, and observational studies.

References

Abadie A, Imbens GW. Large sample properties of matching estimators for average treatment effects. Econometrica. 2006;74:235–67. doi: 10.1111/j.1468-0262.2006.00655.x. [DOI] [Google Scholar]
Andersen PK, Borgan O, Gill RD, Keiding N. Statistical Models Based on Counting Processes. Springer-Verlag; New York: 1993. [Google Scholar]
Bembom O, Petersen ML, Rhee S-Y, Fessel WJ, Sinisi SE, Shafer RW, van der Laan MJ. Biomarker discovery using targeted maximum likelihood estimation: Application to the treatment of antiretroviral resistant hiv infection. Statistics in Medicine. page http://www3.interscience.wiley.com/journal/121422393/abstract, 2008. [DOI] [PMC free article] [PubMed]
Bembom O, van der Laan MJ, Haight T, Tager IB. Lifetime and current leisure time physical activity and all-cause mortality in an elderly cohort. Epidemiology. 2009 doi: 10.1097/EDE.0b013e31819e3f28. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bembom Oliver, van der Laan Mark. Statistical methods for analyzing sequentially randomized trials, commentary on jnci article adaptive therapy for androgen independent prostate cancer: A randomized selection trial including four regimens, by peter f. thall, c. logothetis, c. pagliaro, s. wen, m.a. brown, d. williams, r. millikan 2007. Journal of the National Cancer Institute. 2007;99(21):1577–1582. doi: 10.1093/jnci/djm185. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bickel PJ, Klaassen CAJ, Ritov Y, Wellner J. Efficient and Adaptive Estimation for Semiparametric Models. Springer-Verlag; 1997. [Google Scholar]
Bryan J, Yu Z, van der Laan MJ. Analysis of longitudinal marginal structural models. Biostatistics. 2003;5(3):361–380. doi: 10.1093/biostatistics/kxg041. [DOI] [PubMed] [Google Scholar]
Dudoit S, van der Laan MJ. Asymptotics of cross-validated risk estimation in estimator selection and performance assessment. Statistical Methodology. 2005;2(2):131–154. doi: 10.1016/j.stamet.2005.02.003. [DOI] [Google Scholar]
Gill R, Robins JM. Causal inference in complex longitudinal studies: continuous case. Ann Stat. 2001;29(6) [Google Scholar]
Gill RD, van der Laan MJ, Robins JM. Coarsening at random: characterizations, conjectures and counter-examples. In: Lin DY, Fleming TR, editors. Proceedings of the First Seattle Symposium in Biostatistics. New York: Springer Verlag; 1997. pp. 255–94. [Google Scholar]
Heitjan DF, Rubin DB. Ignorability and coarse data. Annals of statistics. 1991 Dec;19(4):2244–2253. doi: 10.1214/aos/1176348396. [DOI] [Google Scholar]
Hernan MA, Brumback B, Robins JM. Marginal structural models to estimate the causal effect of zidovudine on the survival of HIV-positive men. Epidemiology. 2000;11(5):561–570. doi: 10.1097/00001648-200009000-00012. [DOI] [PubMed] [Google Scholar]
Jacobsen M, Keiding N. Coarsening at random in general sample spaces and random censoring in continuous time. Annals of Statistics. 1995;23:774–86. doi: 10.1214/aos/1176324622. [DOI] [Google Scholar]
Keleş S, van der Laan M, Dudoit S. Asymptotically optimal model selection method for regression on censored outcomes. Technical Report, Division of Biostatistics, UC Berkeley. 2002 [Google Scholar]
Lavori P, Dawson R. A design for testing clinical strategies: biased adaptive within-subject randomization. Journal of the Royal Statistical Society, Series A. 2000;163:2938. [Google Scholar]
Lavori P, Dawson R. Dynamic treatment regimes: practical design considerations. Clinical trials. 2004;1:920. doi: 10.1191/1740774S04cn002oa. [DOI] [PubMed] [Google Scholar]
Moore KL, van der Laan MJ. Covariate adjustment in randomized trials with binary outcomes. Technical report 215, Division of Biostatistics, University of California, Berkeley, April 2007. [Google Scholar]
Moore KL, van der Laan MJ. Application of time-to-event methods in the assessment of safety in clinical trials. In: Peace Karl E., editor. Design, Summarization, Analysis & Interpretation of Clinical Trials with Time-to-Event Endpoints. Chapman and Hall; 2009. [Google Scholar]
Murphy S. An experimental design for the development of adaptive treatment strategies. Statistics in Medicine. 2005;24:1455–1481. doi: 10.1002/sim.2022. [DOI] [PubMed] [Google Scholar]
Murphy SA. Optimal dynamic treatment regimes. Journal of the Royal Statistical Society: Series B. 2003;65(2) doi: 10.1093/jrsssc/qlad037. ? [DOI] [PMC free article] [PubMed] [Google Scholar]
Murphy SA, van der Laan MJ, Robins JM. Marginal mean models for dynamic treatment regimens. Journal of the American Statistical Association. 2001;96:1410–1424. doi: 10.1198/016214501753382327. [DOI] [PMC free article] [PubMed] [Google Scholar]
Neugebauer R, van der Laan MJ. Why prefer double robust estimates. Journal of Statistical Planning and Inference. 2005;129(1–2):405–426. doi: 10.1016/j.jspi.2004.06.060. [DOI] [Google Scholar]
Pearl J. Causality: Models, Reasoning, and Inference. Cambridge University Press; Cambridge: 2000. [Google Scholar]
Petersen Maya L, Deeks Steven G, Martin Jeffrey N, van der Laan Mark J. History-adjusted marginal structural models: Time-varying effect modification and dynamic treatment regimens. Technical report 199, Division of Biostatistics, University of California, Berkeley, December 2005.
Polley EC, van der Laan MJ. Predicting optimal treatment assignment based on prognostic factors in cancer patients. In: Peace Karl E., editor. Design, Summarization, Analysis & Interpretation of Clinical Trials with Time-to-Event Endpoints. Chapman and Hall; 2009. [Google Scholar]
Robins J, Orallana L, Rotnitzky A. Estimaton and extrapolation of optimal treatment and testing strategies. Statistics in Medicine. 2008;27(23):4678–4721. doi: 10.1002/sim.3301. [DOI] [PubMed] [Google Scholar]
Robins JM, Rotnitzky A. Comment on the Bickel and Kwon article, “Inference for semiparametric models: Some questions and an answer”. Statistica Sinica. 2001a;11(4):920–936. [Google Scholar]
Robins JM, Hernan MA, Brumback B. Marginal structural models and causal inference in epidemiology. Epidemiology. 2000a;11(5):550–560. doi: 10.1097/00001648-200009000-00011. [DOI] [PubMed] [Google Scholar]
Robins JM, Rotnitzky A, van der Laan MJ. Comment on “On Profile Likelihood” by S.A. Murphy and A.W. van der Vaart. Journal of the American Statistical Association – Theory and Methods. 2000b;450:431–435. doi: 10.2307/2669381. [DOI] [Google Scholar]
Robins JM. Robust estimation in sequentially ignorable missing data and causal inference models. Proceedings of the American Statistical Association. 2000a.
Robins JM. Discussion of “Optimal dynamic treatment regimes” by Susan A. Murphy. Journal of the Royal Statistical Society: Series B. 2003;65(2):355–366. [Google Scholar]
Robins JM. Optimal structural nested models for optimal sequential decisions. Heagerty PJ, Lin DY, editors. Proceedings of the 2nd Seattle symposium in biostatistics. 2005a;179:189–326. [Google Scholar]
Robins JM. Optimal structural nested models for optimal sequential decisions. Technical report, Department of Biostatistics, Havard University, 2005b.
Robins JM. A new approach to causal inference in mortality studies with sustained exposure periods - application to control of the healthy worker survivor effect. Mathematical Modelling. 1986;7:1393–1512. doi: 10.1016/0270-0255(86)90088-6. [DOI] [Google Scholar]
Robins JM. The analysis of randomized and non-randomized aids treatment trials using a new approach in causal inference in longitudinal studies. In: Sechrest L, Freeman H, Mulley A, editors. Health Service Methodology: A Focus on AIDS. U.S. Public Health Service, National Center for Health SErvices Research; Washington D.C.: 1989. pp. 113–159. [Google Scholar]
Robins JM. Proceeding of the Biopharmaceutical section. American Statistical Association; 1993. Information recovery and bias adjustment in proportional hazards regression analysis of randomized trials using surrogate markers; pp. 24–33. [Google Scholar]
Robins JM. Causal inference from complex longitudinal data. In: Berkane Editor M., editor. Latent Variable Modeling and Applications to Causality. Springer, Verlag; New York: 1997a. pp. 69–117. [Google Scholar]
Robins JM. Structural nested failure time models. In: Armitage P, Colton T, Andersen PK, Keiding N, editors. The Encyclopedia of Biostatistics. John Wiley and Sons; Chichester, UK: 1997b. [Google Scholar]
Robins JM. Statistical models in epidemiology, the environment, and clinical trials (Minneapolis, MN, 1997) Springer; New York: 2000b. Marginal structural models versus structural nested models as tools for causal inference; pp. 95–133. [Google Scholar]
Robins JM, Rotnitzky A. Comment on Inference for semiparametric models: some questions and an answer, by Bickel, P.J. and Kwon, J. Statistica Sinica. 2001b;11:920–935. [Google Scholar]
Robins JM, Rotnitzky A. Recovery of information and adjustment for dependent censoring using surrogate markers. AIDS Epidemiology. 1992. Methodological issues Bikhäuser.
Rose S, van der Laan MJ. Simple optimal weighting of cases and controls in case-control studies. The International Journal of Biostatistics. page http://www.bepress.com/ijb/vol4/iss1/19/., 2008. [DOI] [PMC free article] [PubMed]
Rose S, van der Laan MJ. Why match? investigating matched case-control study designs with causal effect estimation. The International Journal of Biostatistics. page http://www.bepress.com/ijb/vol5/iss1/1/., 2009. [DOI] [PMC free article] [PubMed]
Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika. 1983;70:41–55. doi: 10.1093/biomet/70.1.41. [DOI] [Google Scholar]
Rosenblum M, Deeks SG, van der Laan MJ, Bangsberg DR. The risk of virologic failure decreases with duration of hiv suppression, at greater than 50% adherence to antiretroviral therapy. PLoS ONE. 2009;4(9):e7196. doi: 10.1371/journal.pone.0007196. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rubin DB. Matched Sampling for Causal Effects. Cambridge University Press; Cambridge, MA: 2006. [Google Scholar]
Rubin DB, van der Laan MJ. Empirical efficiency maximization: Improved locally efficient covariate adjustment in randomized experiments and survival analysis. The International Journal of Biostatistics. 2008;4(1) doi: 10.2202/1557-4679.1084. Article 5. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rush AJ, Trivedi M, Fava M. Depression. IV. American Journal of Psychiatry. 2003;160(2):237. doi: 10.1176/appi.ajp.160.2.237. [DOI] [PubMed] [Google Scholar]
Schneider L, Tariot PN, Dagerman K, Davis S, Hsiao J, Ismail S, Lebowitz B, Lyketsos C, Ryan J, Stroup T, Sultzer D, Weintraub D, Lieberman J. Effectiveness of atypical antipsychotic drugs in patients with alzheimers disease. New England Journal of Medicine. 2006;355(15):1525–1538. doi: 10.1056/NEJMoa061240. [DOI] [PubMed] [Google Scholar]
Sekhon JS. Multivariate and propensity score matching software with automated balance optimization: The matching package for R. Journal of Statistical Sotware, Forthcoming. 2008.
Sinisi S, van der Laan MJ. The deletion/substitution/addition algorithm in loss function based estimation: Applications in genomics. Journal of Statistical Methods in Molecular Biology. 2004;3(1) doi: 10.2202/1544-6115.1069. [DOI] [PubMed] [Google Scholar]
Swartz M, Perkins D, Stroup T, Davis S, Capuano G, Rosenheck R, Reimherr F, McGee M, Keefe R, McEvoy J, Hsiao J, Lieberman J. Effects of antipsychotic medications on psychosocial functioning in patients with chronic shizophrenia: Findings from the nimh catie study. American Journal of Psychiatry. 2007;164:428–436. doi: 10.1176/appi.ajp.164.3.428. [DOI] [PubMed] [Google Scholar]
Thall P, Millikan R, Sung H-G. Evaluating multiple treatment courses in clinical trials. Statistics in Medicine. 2000;19:1011–1028. doi: 10.1002/(SICI)1097-0258(20000430)19:8<1011::AID-SIM414>3.0.CO;2-M. [DOI] [PubMed] [Google Scholar]
Tuglus C, van der Laan MJ. Targeted methods for biomarker discovery, the search for a standard. UC Berkeley Working Paper Series. page http://www.bepress.com/ucbbiostat/paper233/., 2008.
van der Laan MJ. Causal effect models for intention to treat and realistic individualized treatment rules. 2006. Technical report 203, Division of Biostatistics, University of California, Berkeley. [DOI] [PMC free article] [PubMed]
van der Laan MJ. Estimation based on case-control designs with known prevalance probability. The International Journal of Biostatistics. 2008. page http://www.bepress.com/ijb/vol4/iss1/17/ [DOI] [PubMed]
van der Laan MJ, Dudoit S. Unified cross-validation methodology for selection among estimators and a general cross-validated adaptive epsilon-net estimator: Finite sample oracle inequalities and examples. Technical report, Division of Biostatistics, University of California, Berkeley, November 2003. [Google Scholar]
van der Laan MJ, Gruber S. Collaborative double robust penalized targeted maximum likelihood estimation. The International Journal of Biostatistics. 2009. [DOI] [PMC free article] [PubMed]
van der Laan MJ, Petersen ML. Causal effect models for realistic individualized treatment and intention to treat rules. International Journal of Biostatistics. 2007;3(1) doi: 10.2202/1557-4679.1022. [DOI] [PMC free article] [PubMed] [Google Scholar]
van der Laan MJ, Robins JM. Unified methods for censored longitudinal data and causality. Springer; New York: 2003. [Google Scholar]
van der Laan MJ, Rubin D. Targeted maximum likelihood learning. The International Journal of Biostatistics. 2006;2(1) doi: 10.2202/1557-4679.1043. [DOI] [PMC free article] [PubMed] [Google Scholar]
van der Laan MJ, Dudoit S, Keles S. Asymptotic optimality of likelihood-based cross-validation. Statistical Applications in Genetics and Molecular Biology. 2004;3 doi: 10.2202/1544-6115.1036. [DOI] [PubMed] [Google Scholar]
van der Laan MJ, Petersen ML, Joffe MM. History-adjusted marginal structural models and statically-optimal dynamic treatment regimens. The International Journal of Biostatistics. 2005;1(1):10–20. doi: 10.2202/1557-4679.1003. [DOI] [Google Scholar]
van der Laan MJ, Dudoit S, van der Vaart AW. The cross-validated adaptive epsilon-net estimator. Statistics and Decisions. 2006;24(3):373–395. doi: 10.1524/stnd.2006.24.3.373. [DOI] [Google Scholar]
van der Laan MJ, Polley E, Hubbard A. Super learner. Statistical Applications in Genetics and Molecular Biology. 2007;6(25) doi: 10.2202/1544-6115.1309. ISSN 1. [DOI] [PubMed] [Google Scholar]
van der Laan MJ, Rose S, Gruber S. Readings on targeted maximum likelihood estimation. Sep, 2009. Technical report, working paper series http://www.bepress.com/ucbbiostat/paper254/,
van der Vaart AW, Dudoit S, van der Laan MJ. Oracle inequalities for multi-fold cross-validation. Statistics and Decisions. 2006;24(3):351–371. doi: 10.1524/stnd.2006.24.3.351. [DOI] [Google Scholar]
Yu Z, van der Laan MJ. Construction of counterfactuals and the G-computation formula. 2002. Technical report, Division of Biostatistics, University of California, Berkeley.
Yu Z, van der Laan MJ. Double robust estimation in longitudinal marginal structural models. 2003. Technical report, Division of Biostatistics, University of California, Berkeley.

[b1-ijb1211] Abadie A, Imbens GW. Large sample properties of matching estimators for average treatment effects. Econometrica. 2006;74:235–67. doi: 10.1111/j.1468-0262.2006.00655.x. [DOI] [Google Scholar]

[b2-ijb1211] Andersen PK, Borgan O, Gill RD, Keiding N. Statistical Models Based on Counting Processes. Springer-Verlag; New York: 1993. [Google Scholar]

[b3-ijb1211] Bembom O, Petersen ML, Rhee S-Y, Fessel WJ, Sinisi SE, Shafer RW, van der Laan MJ. Biomarker discovery using targeted maximum likelihood estimation: Application to the treatment of antiretroviral resistant hiv infection. Statistics in Medicine. page http://www3.interscience.wiley.com/journal/121422393/abstract, 2008. [DOI] [PMC free article] [PubMed]

[b4-ijb1211] Bembom O, van der Laan MJ, Haight T, Tager IB. Lifetime and current leisure time physical activity and all-cause mortality in an elderly cohort. Epidemiology. 2009 doi: 10.1097/EDE.0b013e31819e3f28. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b5-ijb1211] Bembom Oliver, van der Laan Mark. Statistical methods for analyzing sequentially randomized trials, commentary on jnci article adaptive therapy for androgen independent prostate cancer: A randomized selection trial including four regimens, by peter f. thall, c. logothetis, c. pagliaro, s. wen, m.a. brown, d. williams, r. millikan 2007. Journal of the National Cancer Institute. 2007;99(21):1577–1582. doi: 10.1093/jnci/djm185. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b6-ijb1211] Bickel PJ, Klaassen CAJ, Ritov Y, Wellner J. Efficient and Adaptive Estimation for Semiparametric Models. Springer-Verlag; 1997. [Google Scholar]

[b7-ijb1211] Bryan J, Yu Z, van der Laan MJ. Analysis of longitudinal marginal structural models. Biostatistics. 2003;5(3):361–380. doi: 10.1093/biostatistics/kxg041. [DOI] [PubMed] [Google Scholar]

[b8-ijb1211] Dudoit S, van der Laan MJ. Asymptotics of cross-validated risk estimation in estimator selection and performance assessment. Statistical Methodology. 2005;2(2):131–154. doi: 10.1016/j.stamet.2005.02.003. [DOI] [Google Scholar]

[b9-ijb1211] Gill R, Robins JM. Causal inference in complex longitudinal studies: continuous case. Ann Stat. 2001;29(6) [Google Scholar]

[b10-ijb1211] Gill RD, van der Laan MJ, Robins JM. Coarsening at random: characterizations, conjectures and counter-examples. In: Lin DY, Fleming TR, editors. Proceedings of the First Seattle Symposium in Biostatistics. New York: Springer Verlag; 1997. pp. 255–94. [Google Scholar]

[b11-ijb1211] Heitjan DF, Rubin DB. Ignorability and coarse data. Annals of statistics. 1991 Dec;19(4):2244–2253. doi: 10.1214/aos/1176348396. [DOI] [Google Scholar]

[b12-ijb1211] Hernan MA, Brumback B, Robins JM. Marginal structural models to estimate the causal effect of zidovudine on the survival of HIV-positive men. Epidemiology. 2000;11(5):561–570. doi: 10.1097/00001648-200009000-00012. [DOI] [PubMed] [Google Scholar]

[b13-ijb1211] Jacobsen M, Keiding N. Coarsening at random in general sample spaces and random censoring in continuous time. Annals of Statistics. 1995;23:774–86. doi: 10.1214/aos/1176324622. [DOI] [Google Scholar]

[b14-ijb1211] Keleş S, van der Laan M, Dudoit S. Asymptotically optimal model selection method for regression on censored outcomes. Technical Report, Division of Biostatistics, UC Berkeley. 2002 [Google Scholar]

[b15-ijb1211] Lavori P, Dawson R. A design for testing clinical strategies: biased adaptive within-subject randomization. Journal of the Royal Statistical Society, Series A. 2000;163:2938. [Google Scholar]

[b16-ijb1211] Lavori P, Dawson R. Dynamic treatment regimes: practical design considerations. Clinical trials. 2004;1:920. doi: 10.1191/1740774S04cn002oa. [DOI] [PubMed] [Google Scholar]

[b17-ijb1211] Moore KL, van der Laan MJ. Covariate adjustment in randomized trials with binary outcomes. Technical report 215, Division of Biostatistics, University of California, Berkeley, April 2007. [Google Scholar]

[b18-ijb1211] Moore KL, van der Laan MJ. Application of time-to-event methods in the assessment of safety in clinical trials. In: Peace Karl E., editor. Design, Summarization, Analysis & Interpretation of Clinical Trials with Time-to-Event Endpoints. Chapman and Hall; 2009. [Google Scholar]

[b19-ijb1211] Murphy S. An experimental design for the development of adaptive treatment strategies. Statistics in Medicine. 2005;24:1455–1481. doi: 10.1002/sim.2022. [DOI] [PubMed] [Google Scholar]

[b20-ijb1211] Murphy SA. Optimal dynamic treatment regimes. Journal of the Royal Statistical Society: Series B. 2003;65(2) doi: 10.1093/jrsssc/qlad037. ? [DOI] [PMC free article] [PubMed] [Google Scholar]

[b21-ijb1211] Murphy SA, van der Laan MJ, Robins JM. Marginal mean models for dynamic treatment regimens. Journal of the American Statistical Association. 2001;96:1410–1424. doi: 10.1198/016214501753382327. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b22-ijb1211] Neugebauer R, van der Laan MJ. Why prefer double robust estimates. Journal of Statistical Planning and Inference. 2005;129(1–2):405–426. doi: 10.1016/j.jspi.2004.06.060. [DOI] [Google Scholar]

[b23-ijb1211] Pearl J. Causality: Models, Reasoning, and Inference. Cambridge University Press; Cambridge: 2000. [Google Scholar]

[b24-ijb1211] Petersen Maya L, Deeks Steven G, Martin Jeffrey N, van der Laan Mark J. History-adjusted marginal structural models: Time-varying effect modification and dynamic treatment regimens. Technical report 199, Division of Biostatistics, University of California, Berkeley, December 2005.

[b25-ijb1211] Polley EC, van der Laan MJ. Predicting optimal treatment assignment based on prognostic factors in cancer patients. In: Peace Karl E., editor. Design, Summarization, Analysis & Interpretation of Clinical Trials with Time-to-Event Endpoints. Chapman and Hall; 2009. [Google Scholar]

[b26-ijb1211] Robins J, Orallana L, Rotnitzky A. Estimaton and extrapolation of optimal treatment and testing strategies. Statistics in Medicine. 2008;27(23):4678–4721. doi: 10.1002/sim.3301. [DOI] [PubMed] [Google Scholar]

[b27-ijb1211] Robins JM, Rotnitzky A. Comment on the Bickel and Kwon article, “Inference for semiparametric models: Some questions and an answer”. Statistica Sinica. 2001a;11(4):920–936. [Google Scholar]

[b28-ijb1211] Robins JM, Hernan MA, Brumback B. Marginal structural models and causal inference in epidemiology. Epidemiology. 2000a;11(5):550–560. doi: 10.1097/00001648-200009000-00011. [DOI] [PubMed] [Google Scholar]

[b29-ijb1211] Robins JM, Rotnitzky A, van der Laan MJ. Comment on “On Profile Likelihood” by S.A. Murphy and A.W. van der Vaart. Journal of the American Statistical Association – Theory and Methods. 2000b;450:431–435. doi: 10.2307/2669381. [DOI] [Google Scholar]

[b30-ijb1211] Robins JM. Robust estimation in sequentially ignorable missing data and causal inference models. Proceedings of the American Statistical Association. 2000a.

[b31-ijb1211] Robins JM. Discussion of “Optimal dynamic treatment regimes” by Susan A. Murphy. Journal of the Royal Statistical Society: Series B. 2003;65(2):355–366. [Google Scholar]

[b32-ijb1211] Robins JM. Optimal structural nested models for optimal sequential decisions. Heagerty PJ, Lin DY, editors. Proceedings of the 2nd Seattle symposium in biostatistics. 2005a;179:189–326. [Google Scholar]

[b33-ijb1211] Robins JM. Optimal structural nested models for optimal sequential decisions. Technical report, Department of Biostatistics, Havard University, 2005b.

[b34-ijb1211] Robins JM. A new approach to causal inference in mortality studies with sustained exposure periods - application to control of the healthy worker survivor effect. Mathematical Modelling. 1986;7:1393–1512. doi: 10.1016/0270-0255(86)90088-6. [DOI] [Google Scholar]

[b35-ijb1211] Robins JM. The analysis of randomized and non-randomized aids treatment trials using a new approach in causal inference in longitudinal studies. In: Sechrest L, Freeman H, Mulley A, editors. Health Service Methodology: A Focus on AIDS. U.S. Public Health Service, National Center for Health SErvices Research; Washington D.C.: 1989. pp. 113–159. [Google Scholar]

[b36-ijb1211] Robins JM. Proceeding of the Biopharmaceutical section. American Statistical Association; 1993. Information recovery and bias adjustment in proportional hazards regression analysis of randomized trials using surrogate markers; pp. 24–33. [Google Scholar]

[b37-ijb1211] Robins JM. Causal inference from complex longitudinal data. In: Berkane Editor M., editor. Latent Variable Modeling and Applications to Causality. Springer, Verlag; New York: 1997a. pp. 69–117. [Google Scholar]

[b38-ijb1211] Robins JM. Structural nested failure time models. In: Armitage P, Colton T, Andersen PK, Keiding N, editors. The Encyclopedia of Biostatistics. John Wiley and Sons; Chichester, UK: 1997b. [Google Scholar]

[b39-ijb1211] Robins JM. Statistical models in epidemiology, the environment, and clinical trials (Minneapolis, MN, 1997) Springer; New York: 2000b. Marginal structural models versus structural nested models as tools for causal inference; pp. 95–133. [Google Scholar]

[b40-ijb1211] Robins JM, Rotnitzky A. Comment on Inference for semiparametric models: some questions and an answer, by Bickel, P.J. and Kwon, J. Statistica Sinica. 2001b;11:920–935. [Google Scholar]

[b41-ijb1211] Robins JM, Rotnitzky A. Recovery of information and adjustment for dependent censoring using surrogate markers. AIDS Epidemiology. 1992. Methodological issues Bikhäuser.

[b42-ijb1211] Rose S, van der Laan MJ. Simple optimal weighting of cases and controls in case-control studies. The International Journal of Biostatistics. page http://www.bepress.com/ijb/vol4/iss1/19/., 2008. [DOI] [PMC free article] [PubMed]

[b43-ijb1211] Rose S, van der Laan MJ. Why match? investigating matched case-control study designs with causal effect estimation. The International Journal of Biostatistics. page http://www.bepress.com/ijb/vol5/iss1/1/., 2009. [DOI] [PMC free article] [PubMed]

[b44-ijb1211] Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika. 1983;70:41–55. doi: 10.1093/biomet/70.1.41. [DOI] [Google Scholar]

[b45-ijb1211] Rosenblum M, Deeks SG, van der Laan MJ, Bangsberg DR. The risk of virologic failure decreases with duration of hiv suppression, at greater than 50% adherence to antiretroviral therapy. PLoS ONE. 2009;4(9):e7196. doi: 10.1371/journal.pone.0007196. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b46-ijb1211] Rubin DB. Matched Sampling for Causal Effects. Cambridge University Press; Cambridge, MA: 2006. [Google Scholar]

[b47-ijb1211] Rubin DB, van der Laan MJ. Empirical efficiency maximization: Improved locally efficient covariate adjustment in randomized experiments and survival analysis. The International Journal of Biostatistics. 2008;4(1) doi: 10.2202/1557-4679.1084. Article 5. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b48-ijb1211] Rush AJ, Trivedi M, Fava M. Depression. IV. American Journal of Psychiatry. 2003;160(2):237. doi: 10.1176/appi.ajp.160.2.237. [DOI] [PubMed] [Google Scholar]

[b49-ijb1211] Schneider L, Tariot PN, Dagerman K, Davis S, Hsiao J, Ismail S, Lebowitz B, Lyketsos C, Ryan J, Stroup T, Sultzer D, Weintraub D, Lieberman J. Effectiveness of atypical antipsychotic drugs in patients with alzheimers disease. New England Journal of Medicine. 2006;355(15):1525–1538. doi: 10.1056/NEJMoa061240. [DOI] [PubMed] [Google Scholar]

[b50-ijb1211] Sekhon JS. Multivariate and propensity score matching software with automated balance optimization: The matching package for R. Journal of Statistical Sotware, Forthcoming. 2008.

[b51-ijb1211] Sinisi S, van der Laan MJ. The deletion/substitution/addition algorithm in loss function based estimation: Applications in genomics. Journal of Statistical Methods in Molecular Biology. 2004;3(1) doi: 10.2202/1544-6115.1069. [DOI] [PubMed] [Google Scholar]

[b52-ijb1211] Swartz M, Perkins D, Stroup T, Davis S, Capuano G, Rosenheck R, Reimherr F, McGee M, Keefe R, McEvoy J, Hsiao J, Lieberman J. Effects of antipsychotic medications on psychosocial functioning in patients with chronic shizophrenia: Findings from the nimh catie study. American Journal of Psychiatry. 2007;164:428–436. doi: 10.1176/appi.ajp.164.3.428. [DOI] [PubMed] [Google Scholar]

[b53-ijb1211] Thall P, Millikan R, Sung H-G. Evaluating multiple treatment courses in clinical trials. Statistics in Medicine. 2000;19:1011–1028. doi: 10.1002/(SICI)1097-0258(20000430)19:8<1011::AID-SIM414>3.0.CO;2-M. [DOI] [PubMed] [Google Scholar]

[b54-ijb1211] Tuglus C, van der Laan MJ. Targeted methods for biomarker discovery, the search for a standard. UC Berkeley Working Paper Series. page http://www.bepress.com/ucbbiostat/paper233/., 2008.

[b55-ijb1211] van der Laan MJ. Causal effect models for intention to treat and realistic individualized treatment rules. 2006. Technical report 203, Division of Biostatistics, University of California, Berkeley. [DOI] [PMC free article] [PubMed]

[b56-ijb1211] van der Laan MJ. Estimation based on case-control designs with known prevalance probability. The International Journal of Biostatistics. 2008. page http://www.bepress.com/ijb/vol4/iss1/17/ [DOI] [PubMed]

[b57-ijb1211] van der Laan MJ, Dudoit S. Unified cross-validation methodology for selection among estimators and a general cross-validated adaptive epsilon-net estimator: Finite sample oracle inequalities and examples. Technical report, Division of Biostatistics, University of California, Berkeley, November 2003. [Google Scholar]

[b58-ijb1211] van der Laan MJ, Gruber S. Collaborative double robust penalized targeted maximum likelihood estimation. The International Journal of Biostatistics. 2009. [DOI] [PMC free article] [PubMed]

[b59-ijb1211] van der Laan MJ, Petersen ML. Causal effect models for realistic individualized treatment and intention to treat rules. International Journal of Biostatistics. 2007;3(1) doi: 10.2202/1557-4679.1022. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b60-ijb1211] van der Laan MJ, Robins JM. Unified methods for censored longitudinal data and causality. Springer; New York: 2003. [Google Scholar]

[b61-ijb1211] van der Laan MJ, Rubin D. Targeted maximum likelihood learning. The International Journal of Biostatistics. 2006;2(1) doi: 10.2202/1557-4679.1043. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b62-ijb1211] van der Laan MJ, Dudoit S, Keles S. Asymptotic optimality of likelihood-based cross-validation. Statistical Applications in Genetics and Molecular Biology. 2004;3 doi: 10.2202/1544-6115.1036. [DOI] [PubMed] [Google Scholar]

[b63-ijb1211] van der Laan MJ, Petersen ML, Joffe MM. History-adjusted marginal structural models and statically-optimal dynamic treatment regimens. The International Journal of Biostatistics. 2005;1(1):10–20. doi: 10.2202/1557-4679.1003. [DOI] [Google Scholar]

[b64-ijb1211] van der Laan MJ, Dudoit S, van der Vaart AW. The cross-validated adaptive epsilon-net estimator. Statistics and Decisions. 2006;24(3):373–395. doi: 10.1524/stnd.2006.24.3.373. [DOI] [Google Scholar]

[b65-ijb1211] van der Laan MJ, Polley E, Hubbard A. Super learner. Statistical Applications in Genetics and Molecular Biology. 2007;6(25) doi: 10.2202/1544-6115.1309. ISSN 1. [DOI] [PubMed] [Google Scholar]

[b66-ijb1211] van der Laan MJ, Rose S, Gruber S. Readings on targeted maximum likelihood estimation. Sep, 2009. Technical report, working paper series http://www.bepress.com/ucbbiostat/paper254/,

[b67-ijb1211] van der Vaart AW, Dudoit S, van der Laan MJ. Oracle inequalities for multi-fold cross-validation. Statistics and Decisions. 2006;24(3):351–371. doi: 10.1524/stnd.2006.24.3.351. [DOI] [Google Scholar]

[b68-ijb1211] Yu Z, van der Laan MJ. Construction of counterfactuals and the G-computation formula. 2002. Technical report, Division of Biostatistics, University of California, Berkeley.

[b69-ijb1211] Yu Z, van der Laan MJ. Double robust estimation in longitudinal marginal structural models. 2003. Technical report, Division of Biostatistics, University of California, Berkeley.

PERMALINK

Targeted Maximum Likelihood Based Causal Inference: Part I

Mark J van der Laan

Abstract

1. Introduction

Some overview of relevant literature

2. The T-MLE in multi-stage sequentially randomized controlled trials

3. Targeted MLE of parameters of the G-computation formula

3.1. A factorization of likelihood of data in terms of binary variables

3.2. General representation of efficient influence curve of target parameter

3.3. The targeted MLE based on the binary representation of density

3.4. A targeted MLE based on the binary predictor representation of density that converges in one step

3.5. Targeted loss-based selection among targeted MLE

3.6. The targeted-MLE at a degenerate initial estimator for intermediate time-dependent covarariate factors

3.7. Dimension reduction for time-dependent covariates

4. Discussion

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Targeted Maximum Likelihood Based Causal Inference: Part I

Mark J van der Laan

Abstract

1. Introduction

Some overview of relevant literature

2. The T-MLE in multi-stage sequentially randomized controlled trials

3. Targeted MLE of parameters of the G-computation formula

3.1. A factorization of likelihood of data in terms of binary variables

3.2. General representation of efficient influence curve of target parameter

3.3. The targeted MLE based on the binary representation of density

3.4. A targeted MLE based on the binary predictor representation of density that converges in one step

3.5. Targeted loss-based selection among targeted MLE

3.6. The targeted-MLE at a degenerate initial estimator for intermediate time-dependent covarariate factors

3.7. Dimension reduction for time-dependent covariates

4. Discussion

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases