Skip to main content
The International Journal of Biostatistics logoLink to The International Journal of Biostatistics
. 2010 May 17;6(1):17. doi: 10.2202/1557-4679.1181

Collaborative Double Robust Targeted Maximum Likelihood Estimation*

Mark J van der Laan *, Susan Gruber
PMCID: PMC2898626  PMID: 20628637

Abstract

Collaborative double robust targeted maximum likelihood estimators represent a fundamental further advance over standard targeted maximum likelihood estimators of a pathwise differentiable parameter of a data generating distribution in a semiparametric model, introduced in van der Laan, Rubin (2006). The targeted maximum likelihood approach involves fluctuating an initial estimate of a relevant factor (Q) of the density of the observed data, in order to make a bias/variance tradeoff targeted towards the parameter of interest. The fluctuation involves estimation of a nuisance parameter portion of the likelihood, g. TMLE has been shown to be consistent and asymptotically normally distributed (CAN) under regularity conditions, when either one of these two factors of the likelihood of the data is correctly specified, and it is semiparametric efficient if both are correctly specified.

In this article we provide a template for applying collaborative targeted maximum likelihood estimation (C-TMLE) to the estimation of pathwise differentiable parameters in semi-parametric models. The procedure creates a sequence of candidate targeted maximum likelihood estimators based on an initial estimate for Q coupled with a succession of increasingly non-parametric estimates for g. In a departure from current state of the art nuisance parameter estimation, C-TMLE estimates of g are constructed based on a loss function for the targeted maximum likelihood estimator of the relevant factor Q that uses the nuisance parameter to carry out the fluctuation, instead of a loss function for the nuisance parameter itself. Likelihood-based cross-validation is used to select the best estimator among all candidate TMLE estimators of Q0 in this sequence. A penalized-likelihood loss function for Q is suggested when the parameter of interest is borderline-identifiable.

We present theoretical results for “collaborative double robustness,” demonstrating that the collaborative targeted maximum likelihood estimator is CAN even when Q and g are both mis-specified, providing that g solves a specified score equation implied by the difference between the Q and the true Q0. This marks an improvement over the current definition of double robustness in the estimating equation literature.

We also establish an asymptotic linearity theorem for the C-DR-TMLE of the target parameter, showing that the C-DR-TMLE is more adaptive to the truth, and, as a consequence, can even be super efficient if the first stage density estimator does an excellent job itself with respect to the target parameter.

This research provides a template for targeted efficient and robust loss-based learning of a particular target feature of the probability distribution of the data within large (infinite dimensional) semi-parametric models, while still providing statistical inference in terms of confidence intervals and p-values. This research also breaks with a taboo (e.g., in the propensity score literature in the field of causal inference) on using the relevant part of likelihood to fine-tune the fitting of the nuisance parameter/censoring mechanism/treatment mechanism.

Keywords: asymptotic linearity, coarsening at random, causal effect, censored data, crossvalidation, collaborative double robust, double robust, efficient influence curve, estimating function, estimator selection, influence curve, G-computation, locally efficient, loss-function, marginal structural model, maximum likelihood estimation, model selection, pathwise derivative, semiparametric model, sieve, super efficiency, super-learning, targeted maximum likelihood estimation, targeted nuisance parameter estimator selection, variable importance

1. Introduction

Researchers acknowledge that questions about our infinite-dimensional, semi-parametric world are not well-addressed by parametric models. More sophisticated tools are needed to wrest meaning from data. We can and should develop and utilize methods specifically designed to estimate a relatively small-dimensional precisely specified parameter within such a semiparametric model that is identifiable from the data. The ideal method would be entirely a priori specified, have desirable statistical properties, avoid reliance on ad hoc or arbitrary specifications, and be computationally feasible.

Suppose one observes a sample of independent and identically distributed observations from a particular data generating distribution P0 in a semiparametric model, and that one is concerned with estimation of a particular pathwise differentiable parameter of the data generating distribution. A parameter should be viewed as a mapping from the semiparametric model to the parameter space (e.g., real line). A parameter mapping is pathwise differentiable at P0 if it is differentiable along all smooth parametric sub-models through P0, and its derivative is uniformly bounded as a linear mapping on the Hilbert space of all scores of these parametric submodels. Intuitively, a pathwise differentiable parameter is a parameter which has a finite generalized Cramer-Rao information lower bound, so that in principle, under enough regularity conditions, it is possible to construct an estimator which behaves like a sample mean of i.i.d. random variables. Due to the curse of dimensionality implied by the infinite dimension of semi-parametric models, standard (nonparametric) maximum likelihood estimation is often ill defined or breaks down due to overfitting, while, on the other hand, regularized sieve-based maximum likelihood estimation results in overly biased plug-in estimators of the target parameter of interest.

The latter is due to the fact that such likelihood based estimators seek and achieve a bias-variance trade-off that is optimal for the density of the distribution of the data itself. Since the variance of an optimally smoothed density estimator is typically much larger than the variance of a smooth (pathwise-differentiable) parameter of the density estimator, the substitution estimators are often too biased relative to their variance. That is, substitution estimators based on density estimators involving optimal (e.g., likelihood-based) bias-variance trade-off (for the whole density) are not targeted towards the parameter of interest.

Motivated by this problem with the bias-variance trade-off of maximum likelihood estimation in semiparametric models, while still wanting to preserve the log-likelihood as the principle criterion in estimation, in van der Laan and Rubin (2006) we introduced and developed a targeted maximum likelihood estimator of the parameter of interest.

The targeted maximum likelihood estimator of the distribution of the data is obtained by fluctuating an initial estimator of the relevant part of the data generating distribution with a parametric fluctuation model whose score at the initial estimator (i.e., at zero fluctuation) equals or includes the efficient influence curve of the parameter of interest, and estimating the fluctuation parameter (i.e., amount of fluctuation) with standard parametric maximum likelihood, treating the initial estimator as offset. Iteration of this targeted maximum likelihood modification step results in a so called k-th step targeted maximum likelihood estimator, and its limit in k solves the actual efficient influence curve equation defined by setting the empirical mean of the efficient influence curve equal to zero. The latter estimator we called the targeted maximum likelihood estimator, which also results in a corresponding plug-in targeted maximum likelihood estimator of the parameter of interest by applying the parameter mapping to the targeted maximum likelihood estimator.

This targeted maximum likelihood step using the fluctuation model removes bias of the initial estimator with respect to (w.r.t.) the target parameter, while increasing the variance of the estimator till the level of the semi-parametric information bound, thereby resulting in a consistent, asymptotically linear, and semi-parametric (locally) efficient estimator.

Although in a variety of applications the fluctuation model is known, e.g., randomized controlled trials with known treatment assignment and missingness mechanism, the fluctuation model typically depends on an unknown nuisance parameter, which then needs to be estimated as well. In censored data models satisfying the so called coarsening at random (CAR) assumption this nuisance parameter typically represents the censoring mechanism, and the density of the data factors in the relevant part of the density and the censoring mechanism density (e.g., Heitjan and Rubin (1991), Jacobsen and Keiding (1995), Gill et al. (1997)).

In this case, the bias reduction obtained at the targeted maximum likelihood step depends on how and how well we estimate the nuisance parameter. Specifically, the targeted maximum likelihood estimator is a so called double robust locally efficient estimator in censored data models (including causal inference models with the full data representing a collection of treatment regimen specific counterfactuals) in which the censoring mechanism satisfies the coarsening at random assumption. This means that, under regularity conditions, it is consistent and asymptotically linear if either the initial estimator is consistent or the nuisance parameter is consistent, and it is efficient in the semiparametric model if the initial estimator is consistent. Another approach for double robust locally efficient estimation is the estimating equation methodology (see, van der Laan and Robins (2003), and review below).

An outstanding open problem that obstructs the robust practical application of double robust estimators (in particular, in nonparametric censored data or causal inference models) is the selection of a sensible model or estimator of the nuisance parameter: this is particularly true when the efficient influence curve estimating equation involves inverse probability of censoring or treatment weighting, due to the enormous sensitivity of the estimator of the parameter of interest to the estimator of the nuisance parameter. A relevant recent discussion of these issues is found in Kang and Schafer (2007a), Ridgeway and McCaffrey (2007), Robins et al. (2007), Tan (2007), Tsiatis and Davidian (2007), Kang and Schafer (2007b).

Given an initial estimator, we are concerned with constructing an estimator of the nuisance parameter, that results in a better bias-variance trade-off (i.e. better MSE) for the resulting targeted maximum likelihood estimator of the target parameter than current practice. In this article we introduce a new strategy for nuisance parameter estimator selection for targeted maximum likelihood estimators that addresses this challenge by using the log-likelihood of the targeted maximum likelihood estimator (of the relevant density) indexed by the nuisance parameter estimator as the principal selection criterion. The nuisance parameter estimators needed for the targeting step are selected based on the relevant log-likelihood loss function of the resulting targeted maximum likelihood estimator, not on a loss function for the nuisance parameter itself. This approach takes into account the established fit of the initial estimator, and that the resulting estimator of the target parameter is indeed based on the relevant part of the likelihood.

Recognizing that the selected estimator of the nuisance parameter is very much a function of the goodness of fit of the initial estimator led to the development of a new theory of collaborative double robust estimation. The asymptotic linearity theory presented below involves characterizing a true minimal nuisance parameter indexed by the initial estimator limit, g0(Q), that results in an efficient influence curve that is unbiased for the target parameter. This defines the collaborative double robustness of the efficient influence curve. Given a nested sequence of increasingly non-parametric estimators, gδ, there is a gδmin corresponding to g0(Q) which makes the efficient influence curve unbiased for the target parameter. In addition, all estimates of g in the sequence that are more nonparametric than the estimator indexed by δmin, i.e. δ > δmin, also make the efficient influence curve unbiased for the target parameter. These results allow us to establish asymptotic linearity of the collaborative double robust targeted maximum likelihood estimator, under appropriate regularity conditions.

The theory is fascinating, and results in potentially super efficient estimators, the main intuition, in the context of CAR-censored data models, is that the covariates that enter the treatment mechanism and censoring mechanism estimator (i.e., nuisance parameter estimator) used to define the fluctuation model in the targeted maximum likelihood step should explain the difference between the initial estimator, Qn, and the true relevant density, Q0.

Such collaborative double robust estimators involve a variety of choices, including the choice of initial estimator and the choice of collaborative nuisance parameter estimator, but all solve the efficient influence curve equation and all rely on the collaborative nuisance parameter estimator being correctly specified so that the wished unbiasedness of the efficient influence curve is achieved in the limit. We propose using cross-validation w.r.t. a targeted loss function to select among these different collaborative targeted maximum likelihood estimators of the relevant density. In addition, we suggest the square of their influence curve or square of the efficient influence curve as a particularly suitable loss function, corresponding with selection of the estimator with minimal asymptotic variance.

An overview of relevant literature

The construction of efficient estimators of pathwise differentiable parameters in semi-parametric models requires utilizing the so called efficient influence curve, defined as the canonical gradient of the pathwise derivative of the parameter. A fundamental result of the efficiency theory is that a regular estimator is efficient if and only if it is asymptotically linear with influence curve equal to the efficient influence curve. We refer to Bickel et al. (1997), and Andersen et al. (1993). There are two distinct approaches for construction of efficient (or locally efficient) estimators: the estimating equation approach that uses the efficient influence curve as an estimating equation (e.g., one-step estimators based on the Newton-Raphson algorithm in Bickel et al. (1997)), and the targeted MLE that uses the efficient influence curve to define a targeted fluctuation function, and maximizes the likelihood in that targeted direction.

The construction of locally efficient estimators in censored data models in which the censoring mechanism satisfies the so called coarsening at random assumption (Heitjan and Rubin (1991), Jacobsen and Keiding (1995), Gill et al. (1997)) has been a particular focus area. This also includes the theory for locally efficient estimation of causal effects under the sequential randomization assumption (SRA), since the causal inference data structure can be viewed as a missing data structure on the intervention-specific counterfactuals, and SRA implies the coarsening at random assumption on the missingness mechanism, while not implying any restriction on the data generating distribution.

Gill and Robins (2001) present an implicit construction of counterfactuals as a mapping from the observed data distribution, such that the observed data structure augmented with the counterfactuals satisfies the consistency assumption and the SRA. Yu and van der Laan (2002) provide a particular explicit construction of counterfactuals from the observed data structure in terms of quantile-quantile functions, satisfying the consistency assumption and SRA. These results show that, without loss of generality, one can view causal inference as a missing data structure estimation problem. Causal graphs make explicit the real assumptions needed to claim that these counterfactuals are actually the counterfactuals of interest.

Inverse probability of censoring weighted (IPCW) estimators were originally developed to correct for confounding-induced bias in causal effect estimation. Theory for IPCW estimation and augmented locally efficient IPCW-estimator based on estimating functions defined in terms of the orthogonal complement of the nuisance tangent space in CAR-censored data models (including the optimal estimating function implied by efficient influence curve) was originally developed in Robins (1993), Robins and Rotnitzky (1992). Many papers build on this framework (see van der Laan and Robins (2003) for a unified treatment of this estimating equation methodology, and references). In particular, double robust locally efficient augmented IPCW-estimators have been developed (Robins and Rotnitzky (2001b), Robins and Rotnitzky (2001a), Robins et al. (2000b), Robins (2000a), van der Laan and Robins (2003), Neugebauer and van der Laan (2005), Yu and van der Laan (2003)).

Causal inference for multiple time-point interventions under sequential randomization was first addressed by Robins in the eighties: e.g. Robins (1986), Robins (1989).

The popular propensity score methods to assess causal effects of single time point interventions (e.g., Rosenbaum and Rubin (1983), Sekhon (2008), Rubin (2006)) have no natural generalization to multiple time-point interventions and may be inefficient (and less robust) estimators for single time point interventions, relative to the locally efficient double robust estimators such as the augmented IPCW and the targeted MLE. One crucial ingredient of these proposed methods is propensity score estimation in the absence of any knowledge of the outcomes.

Structural nested models and marginal structural models for single and multiple time point static treatment regimens were proposed by Robins as well: Robins (1997b), Robins (1997a), Robins (2000b). Many application papers on marginal structural models exist, involving the application of estimating equation methodology (IPCW and DR-IPCW): e.g., Hernan et al. (2000), Robins et al. (2000a), Bryan et al. (2003), Yu and van der Laan (2003). In van der Laan et al. (2005) history adjusted marginal structural models were proposed as a natural extension of marginal structural models, and it was shown that the latter also imply an individualized treatment rule of interest (a so called history adjusted statically optimal treatment regimen): see Petersen et al. (2005) for an application to the “when to switch” question in HIV research.

Murphy et al. (2001) present a nonparametric estimator for a mean under a dynamic treatment in an observational study. Structural nested models for modeling and estimating an optimal dynamic treatment were proposed by Murphy (2003), Robins (2003), Robins (2005a), Robins (2005b). Marginal structural models for user supplied set of dynamic treatment regimens were developed and proposed in van der Laan (2006), van der Laan and Petersen (2007) and, simultaneously and independently, in a technical report authored by Rotnizky and co-workers (2006), and Robins et al. (2008). van der Laan and Petersen (2007) also includes a data analysis application of these models to assess the mean outcome under a rule that switches treatment when CD4-count drops below a cut-off, and the optimal cut-off is estimated as well. Another practical illustration in sequentially randomized trials of these marginal structural models for realistic individualized treatment rules is presented in Bembom and van der Laan (2007).

Unified loss based learning based on cross-validation was developed in- van der Laan and Dudoit (2003), including construction of adaptive minimax estimators for infinite dimensional parameters of the full data distribution in CAR-censored data and causal inference models: see also van der Laan et al. (2006), van der Vaart et al. (2006), van der Laan et al. (2004), Dudoit and van der Laan (2005), Keleş et al. (2002), Sinisi and van der Laan (2004).

The oracle results for the cross-validation selector inspired a unified super learning methodology mapping a library of candidate estimators into a weighted combination with optimal cross-validated risk, thereby resulting in an estimator which either achieves the best possible parametric model rate of convergence up till a log-n-factor, or it is asymptotically equivalent with the oracle selected estimator that selects the best set of weights for the given data set. These results rely on the assumption that the loss function is uniformly bounded and that the number of candidates in the library is polynomial in sample size (van der Laan et al. (2007), Polley and van der Laan (2009)).

The super learning methodology applied to a loss function for the G-computation formula factor, Q0, in causal inference, or the full-data distribution factor, Q0, of the observed data distribution in CAR-censored data models, provides substitution estimators of the target parameter ψ0. However, although these super learners of Q0 are optimal w.r.t. the dissimilarity with Q0 implied by the loss function, the corresponding substitution estimators will be overly biased for a smooth parameter mapping Ψ. This is due to the fact that cross-validation makes optimal choices w.r.t. the (global) loss-function specific dissimilarity, but the variance of Ψ() is of smaller order than the variance of itself.

van der Laan and Rubin (2006) integrates the loss-based learning of Q0 into the locally efficient estimation of pathwise differentiable parameters, by enforcing the restriction in the loss-based learning that each candidate estimator of Q0 needs to be a targeted maximum likelihood estimator (thereby, in particular, enforcing each candidate estimator of Q0 to solve the efficient influence curve estimating equation). Another way to think about this is that each loss function L(Q) for Q0 has a corresponding targeted loss function L(Q*), with Q* the targeted MLE algorithm applied to initial Q, and we apply the loss-based learning to the latter targeted version of the loss function L(Q). Rubin and van der Laan (2008) propose the square of efficient influence curve as a valid and sensible loss function L(Q) for selection and estimation of Q0 in models in which g0 can be estimated consistently, such as in randomized controlled trials.

The implications of this targeted loss based learning are that Q0 is estimated optimally (maximally adaptive to the true Q0) w.r.t. the targeted loss function L(Q*) using the super learning methodology, and due to the targeted MLE step the resulting substitution estimator of ψ0 is now asymptotically linear as well if the targeted fluctuation function is estimated at a good enough rate (and only requiring adjustment by confounders not yet accounted for by initial estimator: see collaborative targeted MLE): either way, bias reduction will occur as long as the censoring/treatment mechanism is estimated consistently. Targeted MLE have been applied in a variety of estimation problems: Bembom et al. (2008), Bembom et al. (2009) (physical activity), Tuglus and van der Laan (2008) (biomarker analysis), Rosenblum et al. (2009) (AIDS), van der Laan (2008a) (case control studies), Rose and van der Laan (2008) (case control studies), Rose and van der Laan (2009) (matched case control studies), Moore and van der Laan (2009) (causal effect on time till event, allowing for right-censoring), van der Laan (2008b) (adaptive designs, and multiple time point interventions), Moore and van der Laan (2007) (randomized trials with binary outcome). We refer to van der Laan et al. (September, 2009) for collective readings on targeted maximum likelihood estimation.

1.1. Advantages of TMLE relative to augmented IPCW estimating function methodology

Even though the augmented IPCW-estimator is also double robust, targeted maximum likelihood estimation has the following important advantages relative to estimating equation methods such as the augmented-IPCW estimator: 1) the TMLE is a substitution estimator and thereby respects global constraints of model such as that one might be estimating a probability in [0, 1] or a (monotone) survival function at a finite set of points, 2) since, given an initial estimator, the targeted MLE step involves maximizing the likelihood along a smooth parametric targeted fluctuation model, it does not suffer from multiple solutions of a (possibly non-smooth in the parameter) estimating equation, 3) the TMLE does not require that the efficient influence curve can be represented as an estimating function in the target parameter, and thereby applies to all pathwise differentiable parameters 4) it can use the cross-validated log-likelihood (of the targeted maximum likelihood estimator), or any other cross-validated risk of an appropriate loss function for the relevant factor Q0 of the density of the data, as principle criterion to select among different targeted maximum likelihood estimators indexed by different initial estimators or targeted maximum likelihood steps.

The latter allows fine tuning of initial estimator of Q0 as well as the fine tuning of the estimation of the unknowns (e.g., censoring/treatment mechanism g0) of the fluctuation function applied in the targeted maximum likelihood step, thereby utilizing the excellent theoretical and practical properties of the loss-function specific cross-validation selector. In particular, this property results in a collaborative double robust, and possibly super efficient, TMLE, as introduced and studied in this article, thereby adding theoretical and practical properties that go beyond the double robustness and efficiency. In contrast, the augmented-IPCW estimator cannot be evaluated based on a loss function for Q0 alone: the augmented-IPCW estimator is not a substitution estimator Ψ(Qn*) for some Qn* of Q0, as is the TMLE. Instead the augmented-IPCW estimator ψn is a certain function of an initial Qn and gn, where the performance of gn is scored based on the orthogonal loglikelihood of g0, for which a good fit can result in bad fit of ψ0. In trying to address these shortcomings of the augmented IPCW-estimators we converged to the targeted MLE and, subsequent refinement, the collaborative targeted MLE.

1.2. Organization of article

In Section 2 we present a description of the two stage collaborative targeted maximum likelihood methodology, the first stage representing the initial estimator, and the second stage representing the construction of a sequence of targeted maximum likelihood estimators indexed by increasingly nonparametric nuisance parameter estimators, and log-likelihood based cross-validation to select among the TMLEs and thereby select the nuisance parameter estimator. The second stage of the C-DR-TMLE can be viewed as a mapping from an initial estimator of the relevant density into a particular estimator of the nuisance parameter needed in the fluctuation function, and corresponding targeted maximum likelihood estimator using this nuisance parameter estimator in the targeted maximum likelihood step. We also provide the rational for the consistency of this C-DR-TML estimator under the collaborative double robustness assumption, relying on the earlier established oracle property of the log-likelihood-based cross-validation selector, which itself relies on the assumption that the log-likelihood loss function is uniformly bounded.

In Section 3 we define and study collaborative double robustness of the efficient influence curve. In particular, we define true nuisance parameters depending on a choice of relevant density (i.e., limit of initial estimator), which make the efficient influence curve an unbiased function for the target parameter. A collaborative targeted maximum likelihood estimator solves the efficient influence curve equation and relies on the nuisance parameter estimator to consistently estimate this true initial estimator-specific nuisance parameter or more nonparametric nuisance parameter. We also discuss alternative collaborative nuisance parameter estimators that can be used in the targeted MLE or in estimating equation methodology.

In Section 4 we prove an asymptotic linearity theorem for such collaborative double robust estimators, such as the collaborative double robust targeted maximum likelihood estimator, and discuss the conditions and implications of this theorem. In particular, this theorem provides us with influence curve based confidence intervals and tests of null hypotheses. A study of the influence curve teaches us that the C-DR-TMLE can be super efficient.

In Section 5 we consider targeted loss functions that can be used to select among different C-DR-TMLEs indexed by different initial estimators and choices of nuisance parameter estimator. These targeted loss functions can also be used to build the candidate nuisance parameter estimators within a C-DR-TMLE estimator, and thereby to construct the sequence of corresponding candidate targeted maximum likelihood estimators in the collaborative targeted maximum likelihood algorithm. Even though we enforce the use of a log-likelihood-based cross-validation selector to select among these candidate targeted maximum likelihood estimators in the C-DR-TMLE algorithm, we propose a penalized log-likelihood loss function that is more targeted towards the target parameter in the case the target parameter is borderline identifiable. This penalty is particularly important to robustify the estimation procedure in situations in which the variance of the efficient influence curve easily blows up to infinity for certain realization of the nuisance parameter estimator (e.g., close to zero inverse weights).

In section 6 we consider estimation of a causal effect in a marginal structural model, and define the collaborative double robust targeted penalized maximum likelihood estimator of the unknown parameters of the marginal structural model. In the companion paper we present a simulation study and data analysis for the C-DR-TMLE of the causal effect EY (1) − Y (0) of a binary treatment A, adjusting for baseline confounders W, based on observing n i.i.d. copies of a time-ordered data structure (W, A, Y = Y (A)).

We end this article with a discussion in Section 7, which provides a global overview.

1.3. An example to keep in mind

Although the methodology is completely general, throughout the paper we ground the discussion by referring to the following example, estimation of the additive causal effect of a binary treatment on an outcome. This example is rich enough to illustrate the ideas and methods, and has been used intensively in the causal inference literature. In this subsection we provide the notation, and objects required to define the C-DR-TMLE.

Let O = (W, A, Y = Y (A)) ∼ P0 be an observed missing data structure on full data structure X = (W, Y (0), Y (1)) with missingness binary variable A ∈ {0, 1}. For concreteness, we consider the case that Y is binary. Suppose the model for P0 is nonparametric, that the missingness mechanism g0(1 | X) = P0(A = 1 | X) = P0(A = 1 | W) satisfies the coarsening at random assumption, and that our target parameter is the causal additive risk

Ψ(P0)=ΨF(Q0)=E0Y(1)Y(0)                        =E0{E0(Y|A=1,W)E0(Y|A=0,W)},

where Q0 = (Q01, Q02) denotes the marginal distribution of W and conditional distribution of Y, given A, W, respectively. For notational convenience, we will suppress the F from “Full Data Parameter” in ΨF. We note that dP0(O) = Q0(O)g0(A | X) = Q01(W)Q02(Y | A, W)g0(A | X).

The efficient influence curve of Ψ at dP0 = Q0g0 is given by

D*(Q0,g0)=hg0(A,W)(YQ0(A,W))+Q0(1,W)Q0(0,W)Ψ(Q0),

where Q0(A, W) = EQ0 (Y | A, W), hg0(A, W) = A/g0(1 | W) – (1 – A)/g0(0 | X). We note that hg0 also plays the role of the clever covariate in the targeted maximum likelihood fluctuation of the conditional distribution of Y, given A, W: log Q(ε)/(1 – Q(ε)) = log Q/(1 – Q) + εhg0.

We also note that an alternative representation of the efficient influence curve is given by the augmented IPCW-representation:

D*(Q0,g0)=(Ag0(1|X)1Ag0(0|X))YΨ(Q0)                       (Ag0(1|X)1)Q0(1,W)+(1Ag0(0|X)1)Q0(0,W)                   =DIPCW(g0,ψ0)DCAR(Q0,g0),

where DIPCW(g0, ψ0) = (A/g0(1) – (1 – A)/g0(0))Y – Ψ(Q0) is the IPCW-estimating function, and DCAR(Q0, g0) is its projection onto TCAR defined as the sub-Hilbert space of L02(P0) consisting of all functions of (A, W) with conditional mean zero, given W. Here L02(P0) is the Hilbert space of functions of O endowed with inner product 〈h1, h2P0 = EP0h1(O)h2(O).

2. Collaborative double robust targeted maximum likelihood estimators

We will describe the proposed collaborative double robust targeted maximum likelihood estimators in the context of censored data models, but the generalization to general semi-parametric models is immediate. We first review targeted maximum likelihood estimation and loss-based cross-validation in order to provide a foundation for the explanation of C-DR-TMLE.

2.1. Targeted MLE in CAR-censored data model

Let O = Φ(C, X) be a censored data structure on a full data random variable X, where C denotes the censoring variable. We assume coarsening at random so that the observed data structure OP0 has a probability distribution whose density w.r.t an appropriate dominating measure factors as dP0(O) = Q0(O)g0(O | X), where Q0 is the part of the distribution of X that is identifiable, and g0 denotes the conditional probability distribution of O, given X, which we often refer to as the censoring mechanism. By CAR, we have g0(O | X) = h(O) for some measurable function h. If C is observed itself, then g0 denotes the conditional distribution of C, given X.

A semiparametric model M for the probability distribution P0 of the observed data structure O is implied by a model Q for the full-data distribution factor Q0, and a model G for the censoring mechanism g0. The conditional distribution of O, given X, is identified by the conditional distribution of C, given X. For notational convenience, we will denote both with g0. Let O1, . . . , On be n independent and identically distributed (i.i.d.) observations of the experimental unit O with probability distribution P0M. Let Pn be the empirical probability distribution of O1, . . . , On which puts mass 1/n on each of the n observations.

Let Ψ : M → IRd be a d-dimensional parameter that is path-wise differentiable at each PM (w.r.t. a class of finite dimensional paths through P) with canonical gradient D*(P): i.e., for a rich class of parametric submodels {P(δ) : δ} ⊂ M through P at δ = 0 with score SL02(P), L02(P) being the Hilbert space of mean zero functions of O endowed with inner product 〈h1, h2P = Eh1h2(O) (i.e., the covariance operator), we have

ddδΨ(P(δ))|δ=0=EPD*(P)S.

Because D*(P) is an element of the Hilbert space in L02(P) generated by all scores S of these parametric submodels (the so called tangent space), it is the canonical gradient D*(P), also called the efficient influence curve at P. Any D(P) such that EPD*(P)S = EPD(P)S for all scores S in the tangent space is called a gradient of the path-wise derivative. Thus the canonical gradient is the unique gradient that is an element of the tangent space. For the sake of illustration, it is assumed that Ψ(PQ,g) = ΨF (Q) for some ΨF : i..e, the parameter of interest is a parameter of the full data distribution of X. The efficient influence curve D*(P) at P with dP = Qg will also be denoted with D*(Q, g).

The Targeted Maximum Likelihood estimator indexed by initial (Q, g): Given any PM with dP = Qg, let {P(ε) : ε} ⊂ M be a submodel with finite dimensional parameter ε, dominated by P, through P at ε = 0, and whose scores at ε = 0 span a finite dimensional space within L02(P) that includes the (components of the) efficient influence curve D*(P) = D*(Q, g). Because our parameter of interest is a parameter of Q0 and the factorization dP0 = Q0g0, it follows that such a fluctuation model can be chosen to only fluctuate Q with a submodel Qg(ε) ⊂ Q, where this fluctuation model will be indexed by g. Let dP(ε) = Qg(ε)g be such a fluctuation model with fluctuation parameter ε. In van der Laan and Rubin (2006) we also consider fluctuation models that vary both Q and g.

At a given (Q, g), one can now define a k-th step targeted maximum likelihood version Qgk(Pn) of Q0 as follows. Let L(Q) = − log Q be the log-likelihood loss. Firstly, let Qg1(Pn)=Qg(εn1), where

εn1=argminεPnL(Qg(ε)).

Here we use the notation Pf = ∫ f(o)dP(o). In general, Qgnk=Qgk(Pn)=Qgk1(Pn)(εnk), where

εnk=argminεPnL(Qgk1(Pn)(ε)),k=1,.

One iterates this updating till εnk equals zero within a user supplied precision. The final update is refered to as the (iterative) targeted maximum likelihood estimator Qgn*=Qg*(Pn), indexed by the initial starting point (Q, g).

The Targeted Maximum Likelihood estimator indexed by initial estimator and estimator of nuisance parameter: The above procedure, applied to an initial estimator Qn0, and an estimator gn of g0, defines the k-th step targeted maximum likelihood estimator and its limit in k, Qn*, as introduced and analyzed in van der Laan and Rubin (2006). By definition, the targeted maximum likelihood estimator ( Qn*, gn) solves the efficient influence curve equation:

0=PnD*(Qn*,gn).

Remark: Cross-validated initial estimator in the targeted MLE. If the initial estimator is an over-fit, then the bias reduction of the targeted MLE algorithm is not as effective. To protect against such cases one can use a cross-validated initial estimator. Specifically, let Bn ∈ {0, 1}n be a random variable that splits the sample in a training sample {i : Bn(i) = 0} and validation sample {i : Bn(i) = 1}, and, let Pn,Bn0, Pn,Bn1, denote the empirical distribution of the training and validation sample, respectively. The above targeted MLE iterative algorithm is now given by: Qgnk=Qgk(Pn)=Qgk1(Pn)(εnk), where

εnk=argminεEBnPn,Bn1L(Qgk1(PnBn0)(ε)),k=1,.

2.2. Loss-based cross-validation to select among (collaborative) targeted maximum likelihood estimators

Consider a loss function L*(Q) for Q0 that satisfies

Q0=argminQP0L*(Q).

Or, more precisely, we only require that Ψ (arg minQ P0L*(Q)) = Ψ(Q0). An example of such a loss function is the the log-likelihood L*(Q)(O) = L(Q) = − log Q(O). Each loss function has a corresponding dissimilarity d(Q, Q0) = P0{L*(Q) – L*(Q0)}.

Given different targeted maximum likelihood estimators, PnQ^k*(Pn), of Q0, for example, indexed by different initial estimators, we can use a preferred loss-function based cross-validation to select among them. Specifically, let Bn ∈ {0, 1}n be a random variable that splits the sample in a training sample {i : Bn(i) = 0} and validation sample {i : Bn(i) = 1}, and, let Pn,Bn0, Pn,Bn1, denote the empirical distribution of the training and validation sample, respectively. The loss-function based cross-validation selector is now defined by

k^(Pn)=argminkEBnPn,Bn1L*(Q^k*(Pn,Bn0)).

The resulting targeted maximum likelihood estimator is then given by

Q^n*=Q^k^(Pn)*(Pn).

Cross-validation selector: Consider a preferred loss function that satisfies

supQVARP0{L*(Q)L*(Q0)}P0{L*(Q)L*(Q0)}M2, (1)

and that is uniformly bounded

supO,Q|L*(Q)L*(Q0)|(O)<M1<,

where the supremum is over the support of P0, and over all possible candidate estimators of Q0 that will ever be considered. The first property (1) applies to the log-likelihood loss function and any weighted squared residual loss function, among others. The property (1) is essentially equivalent with the assumption that the loss-function based dissimilarity d(Q, Q0) = P0L*(Q) – L*(Q0) is quadratic in a distance between Q and Q0. The property (1) has been proven for log-likelihood loss functions and weighted L2-loss functions, and is in essence equivalent with stating that the loss function implies a quadratic dissimilarity d(Q, Q0) (see van der Laan and Dudoit (2003)). If this property does not hold for the loss function, the rates 1/n for second order terms in the below stated oracle inequality reduce to the rate 1/n.

For such loss functions, the cross-validation selector satisfies the following (so called) oracle inequality: for any δ > 0,

EBn{P0L(Q^k^(Pn,Bn0)L(Q0)}(1+2δ)EBnminkP0{L(Q^k(Pn,Bn0))L(Q0)}                                                 +2C(M1,M2,δ)1+logK(n)np,

where the constant C(M1, M2, δ) = 2(1 + δ)2(M1/3 + M2/δ) (see page 25 of van der Laan and Dudoit (2003)). This result proves (see van der Laan and Dudoit (2003) for the precise statement of these implications) that, if the number of candidates K(n) is polynomial in sample size, then the cross-validation selector is either asymptotically equivalent with the oracle selector (based on sample of size of training samples, as defined on right-hand side of above inequality), or it achieves the parametric rate log n/n for convergence w.r.t. d(Q, Q0) ≡ P0{L(Q) – L(Q0)}. So in most realistic scenarios, in which none of the candidate estimators achieve the rate of convergence one would have with an a priori correctly specified parametric model, the cross-validated selected estimator selector performs asymptotically exactly as well (up till constant!) as the oracle selected estimator. These oracle results are generalized for estimated loss functions Ln*(Q) that approximate a fixed loss function L*(Q). If arg minQ P0Ln*(Q)Q0, then the oracle inequality also presents second order terms due to the estimation of the loss function.

This preferred loss function based cross-validation can now be used to select among different candidate targeted maximum likelihood estimators indexed by different initial estimators, and possibly different censoring mechanism estimators. Specifically, we will use a preferred targeted loss function to select among different collaborative targeted maximum likelihood estimators, which are just special targeted maximum likelihood estimators in the sense that gn is estimated in collaboration with the initial Qn.

For a given loss function L(Q), and an estimator (Pn), we will refer to PnL((Pn)) as the entropy of the fit (Pn). Similarly, for a loss function L1(g) of g0, and an estimator ĝ(Pn), we will refer to PnL1(ĝ(Pn)) as the entropy of ĝ(Pn). Both the preferred loss function for Q0, as well as this loss function L1 for g0 represent important choices. For example, one likes to select the loss function L1 so that the dissimilarity P0{L1(g) – L1(g0)} measures strongly how well g approaches the optimal fluctuation function implied by g0. In other words, we need to keep in mind how g is used, namely that it is used to fit the wished fluctuation function implied by g0. For example, if the clever covariate defining the fluctuation function is given by AE(A/σ2(A, W) | W)/E(12(A, W) | W), as in the semiparametric regression model E(Y | A, W) – E(Y | A = 0, W) = βA, one might want to define as loss function L(θ1(g), θ2(g)) = w1(A/σ2(A, W) – θ1(W))2 + w2(1/σ2(A, W) – θ2(W))2, for weight-functions w1, w2 (functions of W), and θ1, θ2 representing the numerator and denominator of the conditional expectations in the clever covariate. Similarly, the preferred loss function for Q0 can be tuned to represent a dissimilarity d(Q, Q0) that measures strongly how well Ψ(Q) approximates Ψ(Q0). We discuss such choices in more detail in a later section.

2.3. Building a collaborative estimator of censoring mechanism/nuisance parameter

A C-TMLE estimator is constructed by building a family of candidate estimators, then choosing the best among them, using cross-validation to drive the choice to Q0. However, we also rely upon a loss function when building each candidate nuisance parameter (e.g. censoring mechanism) estimator, and it is not necessary that these two loss functions be the same. In fact, as part of building a collaborative nuisance parameter estimator in the collaborative T-MLE procedure, we couple an increase in the log-likelihood entropy of the targeted maximum likelihood estimator with an increase in the g0-loss function specific entropy of the corresponding nuisance parameter estimator. In this manner, we arrange that, for increasing sample size, the cross-validation selector will be driven towards the selection of targeted maximum likelihood estimator with an initial estimator closer to Q0 and simultaneously a more and more nonparametric estimator of g0 (thereby achieving the full wished bias reduction in the limit).

That is, given a collection of candidate estimators of g0, ordered by empirical fit w.r.t. a loss function for g0 such as the log-likelihood, we will build a sequence of targeted maximum likelihood estimators of Q0 ordered by log-likelihood entropy and indexed by increasingly nonparametric estimators of g0, where the extend of being nonparamatric is measured by the L1-entropy. Subsequently, we use the cross-validated log-likelihood for Q0 to choose among these candidate targeted maximum likelihood estimators.

There are many possible approaches that construct such an ordered sequence of targeted maximum likelihood estimators in which a next element in the sequence has both a higher entropy for the Q0-loss as well as a higher g0-loss entropy for its corresponding censoring estimator. Of course, the strict ordering is not what drives the properties of the resulting estimator, but the sequence should represent an approximately monotone function in the log-likelihood entropy of Q0 and L1-entropy of g0.

This procedure represents one particular approach for constructing a targeted maximum likelihood estimator that uses a collaboratively estimated nuisance parameter. We refer to any algorithm that maps into a targeted maximum likelihood estimator that uses a collaborative nuisance parameter estimator (relative to the Q-estimator), as a collaborative targeted maximum likelihood estimator.

2.4. A template for collaborative targeted MLEs

We present the following template providing a class of collaborative targeted maximum likelihood estimators.

Initial estimator of Q0: Build an estimator Qn of Q0, such as a super learner based on the log-likelihood loss function L(Q), or any other loss function.

Preferred loss function for Q0: Let L*(Q) be a (targeted) loss function for Q0. We note that the loss function can also be data dependent, and, in particular, the choice of loss function can depend on an initial estimator Qn of Q0, and corresponding collaborative estimator gn (see DR-IPCW loss functions in van der Laan and Dudoit (2003), and our section on targeted loss functions).

Loss function for g0: Let L1(g) be a loss function for g0.

Candidate estimators of censoring mechanism/nuisance parameter: For each δ in an index set, let g be a candidate estimator of g0. Let d(δ) = PnL1(g) denote the entropy of g, thereby measuring how data adaptive g is, and for a maximal value d(δ) or for d(δ) approximating a maximum value we have that g is actually a consistent estimator of g0.

Select ordered sequence of entropies for censoring mechanism (nuisance parameter) estimator

Select a sequence d0 > d1 > . . . > dK.

Select initial targeted maximum likelihood estimator: We start out with a gn0 with entropy larger than d0 and a corresponding targeted maximum likelihood estimator Qn*0=Qngn0* applied to initial estimator Qn. We refer to the pair ( gn0, Qn*0) as the initial targeted maximum likelihood estimator in the sequence of targeted maximum likelihood estimators that will be constructed below.

Construct next targeted maximum likelihood estimator in sequence

We are given an current initial estimator Qnk, a current targeted maximum likelihood estimator ( gnk, Qnk*) in our sequence of targeted maximum likelihood estimators, with Qnk* being the targeted maximum likelihood estimator applied to current initial estimator Qnk and nuisance parameter estimator gnk. The current nuisance parameter estimator gnk has entropy larger than dk. We are also given k, and thereby two corresponding entropy values dk > dk+1. (we note that the initial estimator does not get updated at each step k, but it corresponds with one of the elements in current sequence of targeted maximum likelihood estimators)

Consider an algorithm that searches among a specified set of candidate estimators g with {δ : dk > d(δ) > dk+1)} with the goal of minimizing the preferred loss L*-fit of the targeted maximum likelihood estimator, applied to initial Qnk:

δPnL*(Qnδk*). (2)

Recall that Qnδk* denotes the targeted maximum likelihood estimator that uses the optimal fluctuation model identified by censoring mechanism gnδ applied to initial estimator Qnk. Let gn be the selected estimator. If either the fit is improved relative to current T-MLE Qnk*,

PnL*(Qnδnk*)<PnL*(Qnk*),

or the above holds for the log-likelihood loss function L(Q) = − log Q on which the targeted maximum likelihood algorithm operates, then we accept δn, and thereby the next targeted maximum likelihood estimator, gnk+1=gnδn, Qnk+1*=Qnδnk*, in the sequence we are constructing. The algorithm now delivered its next k + 1-th targeted maximum likelihood estimator. We set k = k + 1, keep the initial estimator Qnk unchanged, and the current targeted maximum likelihood estimator ( gnk, Qnk*) is now updated.

If this monotonicity condition fails to hold for both the log-likelihood fit as well as the preferred loss function fit, then we reject this δn, and update the initial estimator Qnk by setting it equal to the current targeted maximum likelihood estimator Qnk*. We now, rerun the above procedure with initial Qnk=Qnk*, and same dk > dk+1. This time the resulting δn will always be accepted since the log-likelihood fit of a targeted maximum likelihood estimator (a maximum likelihood fluctuation of an initial estimator) is larger than the log-likelihood of initial estimator. So the algorithm now delivers the next k + 1-th targeted maximum likelihood estimator gnk+1=gnδn, Qnk+1*=Qnδnk* in its sequence. We set k = k + 1, the initial estimator is still set at Qnk, and the current targeted maximum likelihood estimator ( gnk, Qn*k) is now updated (the last one in sequence so far).

k-th step collaborative targeted maximum likelihood estimator:

The above algorithm maps a running current initial estimator, a current targeted MLE ( gnk, Qn*k) (the lastly constructed in current sequence), into a new targeted MLE ( gnk+1, Qn*k+1), and possible updated current initial estimator. We start this algorithm with k = 0, and iterate it. This now defines the k-th step collaborative targeted maximum likelihood estimator ( gnk, Qn*k), k = 0, 1, 2, . . . , K.

We are guaranteed that the fit of Qn*k is either increasing w.r.t. the preferred loss function (most likely, since that is the loss we minimize at each step), or it is increasing w.r.t the log-likelihood loss used to define the targeted maximum likelihood step, relative to previous targeted maximum likelihood estimator Qn*k1. In addition, the corresponding gnk has a L1-fit that is larger than the L1-fit of gnk1. At every step in which the initial estimator is updated, we also know that the log-likelihood fit is increasing.

Cross-validation to select number of iterations k in k-th step C-TMLE:

Given this sequence of k-th step collaborative targeted maximum likelihood estimators Pn(Qnk*=)Q^k*(Pn), using estimator gnk, it remains to select k, k = 0, 1, . . . , K.

We select k based on the cross-validated log-likelihood:

kn=argmaxkEBnPn,Bn1L(Q^k*(Pn,Bn0)),

where the random vector Bn ∈ {0, 1}n denotes a cross-validation scheme such as V-fold cross-validation, and Pn,Bn0, Pn,Bn1 are the empirical probability distributions of the training sample {i : Bn(i) = 0} and validation sample {i : Bn(i) = 1}, respectively, as identified by the split vector Bn.

This finalizes the mapping from the initial estimator Qn, and the data, into a collaborative estimator of the censoring mechanism, gn=gnkn. We refer to Qn*=Qnkn*, paired with collaborative estimator gn, as the collaborative targeted maximum likelihood estimator of Q0.

The Collaborative (Double Robust) Targeted Maximum Likelihood Estimator: The corresponding targeted maximum likelihood estimator of ψ0 = ΨF (Q0) is given by the substitution estimator

Ψ(Qn*)=Ψ(Qnkn*)=Ψ(Q^kn*(Pn)).

We refer to this estimator as the collaborative (double robust) targeted maximum likelihood estimator (C-DR-TMLE or C-TMLE) of ψ0, and we recall that it is paired with a collaborative estimator gn.

C-TMLE solves an efficient influence curve equation: Since the C-TMLE is a targeted maximum likelihood estimator Qnkn*, applying the fluctuation function with censoring mechanism estimator gn=gnkn to the estimator Qnkn, it solves the efficient influence curve equation:

0=PnD*(Qn*,gn).

This is a fundamental property of the collaborative targeted MLEs driving the targeted bias reduction w.r.t. the target parameter of interest,ψ0.

Selection among candidate C-TMLEs: The collaborative targeted maximum likelihood estimator depends on a choice of initial estimator Qn0, and choices that concern the second stage. As a consequence, one might have a set of collaborative targeted maximum likelihood estimators ( Qnj*, gnj) indexed by such choices, j = 1, . . . , J. We can now select among these estimators Qnj* based on loss-based cross-validation using the preferred loss function L* for Q0.

Selection based on empirical efficiency maximization: Since, under regularity conditions of our asymptotic linearity theorem, each j-specific C-TMLE is asymptotically linear with influence curve ICj( Qj*, g0j) (equal to D*( Qj*, g0j) plus a contribution from gnj), we can select j as the minimizer of a (cross-validated) estimate of the variance of ICj( Qj*, g0j), or, if ψ0 has dimension larger than 1, then we can minimize an estimate of the variance of a function of ψ0. One could here ignore the contribution from gnj and thus use the cross-validated or empirical variance of the efficient influence curve at the collaborative targeted maximum likelihood estimator:

jn=argminjEBnPn,Bn1{D*(Q^j*(Pn,Bn0),g^j(Pn,Bn0))}2.

Generalization. The above C-TMLE can also be called the collaborative minimum loss estimator. The loss function L(Q) needs to satisfy that the derivative of εL(Qg(ε)) at ε = 0 for a suitably constructed path {Qg(ε) : ε} equals the efficient influence curve D*(Q, g), where the efficient influence curve at P only depends on Q(P) and g(P), while the target parameter Ψ(P) = ΨF (Q(P)) depends on P only through Q(P). No further structure is needed for the above template (such as dP = Q*g, or CAR-censored data structure).

2.5. The rationale of the consistency of the collaborative-TMLE

The C-TMLE procedure starts with an initial estimator Qn of Q0. Suppose that the sequence constructed in the C-TMLE template consists of a finite number K of targeted maximum likelihood estimators Qnk*. By construction, the last targeted maximum likelihood estimator in this sequence uses a censoring mechanism estimator that is nonparametric (maximal g0-entropy): i.e, the nuisance parameter estimator gnK as selected by the K-th step C-TMLE converges to the true g0. We also know that gnk is increasingly nonparametric in k, k = 1, . . . , K.

For simplicity, we also assume that the k-th targeted maximum likelihood estimator in the sequence is obtained by applying the targeted maximum likelihood algorithm to the previous targeted maximum likelihood estimator in sequence. This is not necessary, since we can apply the argument to the subsequence for which that is true (the elements in the sequence at which the targeted maximum likelihood update is actually carried out), but it simplifies the presentation.

Consider the limits Qkgk* of the targeted maximum likelihood estimators Qn*k in our sequence, where gk is the limit of gnk, k = 1, . . . , K, and thus gK = g0. We also know that PnlogQnkgnk* is increasing in k, by the fact that each element in the sequence is a targeted maximum likelihood estimator applied to previous element in sequence (as initial estimator in the T-MLE algorithm). Therefore, P0logQkgk* is non-decreasing in k. As discussed in introduction, if the log Q is uniformly bounded in all its candidates Q, then the cross-validation selector of k is asymptotically equivalent with the oracle selector k˜n=argmaxP0logQkgnk*. For n large enough, this oracle selector behaves as k˜=argmaxkP0logQkgk*, where this maximum might be non-unique. One maximum is obtained at k = K, giving P0logQKg0* and, we know that Ψ(QKg0*)=ψ0. So if = K, then the c-tmle will be consistent for ψ0. Suppose that is actually smaller than K. Then we have, suppressing the g’s in the notation,

P0logQk˜*=P0logQk˜+1*==P0logQK*.

We know that Q*k+1 is a T-MLE with Q*k as initial. So the above equalities are only possible if Q*k+1 = Q*k for k = , . . . , K – 1. Thus QK*=Qk˜*. Since QK* is a targeted MLE at nuisance parameter g0, it follows that εP0logQK,g0*(ε) is maximized at ε = 0: compare with PnlogQnK*(ε) is maximized at ε = 0 by definition of the T-MLE algorithm. Since we just showed that QK*=Qk˜*, it also follows now

εP0logQk˜,g0*(ε)

is maximized at ε = 0. In particular, this means that the derivative at ε = 0 equals zero, giving us:

0=P0D*(Qk˜*,g0).

However, the efficient influence curve typically satisfies that P0D*(Q, g0) = 0 implies Ψ(Q) = ψ0, which then implies Ψ(Qk˜*)=ψ0. Thus Ψ(Qnk˜*) is consistent, and thereby Ψ(Qnkn*) is consistent.

Figure 1 illustrates the collaborative nature of the construction of a sequence of increasingly data-adaptive nuisance parameter estimators, {gn1,,gnK}, and its relation to the performance of the initial estimator. We generated 5000 observations of O = (W, A, Y) from data generating distribution dP0 = Q0g0 defined as:

logit(g0(A|W))=.15W1+.1W2+W3W4          Q0(A,W)=A+3W16W2+4W35W5+3W4

where W1 through W5 are independent random variables ∼ N(0, 1), Y = Q0(A, W) + ε, εN(0, 1), and g0 is the conditional density of A given confounding variables W = {W1, W2, W3, W4, W5}. We applied the C-TMLE to estimate the effect of binary treatment A on outcome Y, adjusting for W, defined as ψ0 = EW (E(Y | A = 1, W) – E(Y | A = 0, W)).

Figure 1:

Figure 1:

Construction of a sequence of nuisance parameter estimators based on a poor initial fit of the density (top) and a good initial fit for the density (bottom). Kernel estimates of true densities Q0 and g0 are shown in gray.

A kernel density estimator was applied to Y and to the predicted values of two initial estimators of Q0 = E0(Y | A, W), which we denote with Q^n,poor0, and Q^n,good0, respectively. These estimators were obtained with the D/S/A algorithm (Sinisi and van der Laan, 2004), a data-adaptive machine learning approach to model selection that was set to search over all second degree polynomials of size six. The kernel density estimates are displayed in plots on the left hand side of the figure.

In addition, we plotted the kernel density estimates of the predicted values of each set of the collaboratively-constructed candidate ĝ estimators, and we can compare them with the density of the true predictions g0(1 | W) = P0(A = 1|W). These are plotted on the right hand side of the figure, overlaid with the density estimator applied to the true values g0(1 | W). When the initial fit of Q0 is poor, the nuisance parameter estimator gnk converges quickly to g0 in k, and the selected candidate estimator closely approximates g0. Plots in the bottom half of the figure shows the behavior of the C-TMLE procedure when Qn0 is a good estimate of Q0. When the initial fit of Q0 is good, the nuisance parameter estimator grows slowly towards g0, and a candidate estimator that estimates a true treatment mechanism that adjusts for fewer covariates than the true treatment mechanism g0 that was used to generate the data.

2.6. Revisiting the additive causal effect example

Recall that the targeted maximum likelihood estimator applied to an estimator, Qn, of P(Y = 1 | A, W) is obtained by running a univariate logistic regression of Y , with offset the initial estimator, on an estimate of the univariate clever covariate hg0(A, W) = A/g0(1 | W) – (1 – A)/g0(0 | W), implied by the treatment mechanism estimator, using an estimator gn of treatment mechanism P(A = 1 | W).

The collaborative targeted maximum likelihood estimation procedure starts with computing Qn0, an initial estimator of P(Y = 1 | A, W) using super learning, and then collaboratively generating a sequence of targeted maximum likelihood estimators. These use increasingly nonparametric estimators of g0, applied to subsequent targeted maximum likelihood updates of the initial estimator (as needed to guarantee the monotonicity in fit). In this way the sequence of constructed targeted maximum likelihood estimators has increasing log-likelihood fit. The selection of the sequence of increasingly nonparametric treatment mechanism estimators was based on maximizing the fit of the corresponding targeted maximum likelihood estimators of P(Y = 1 | A, W), as outlined in our template, thus very much driven by the outcome data. Likelihood based cross-validation selects the wished targeted maximum likelihood estimator, with its paired treatment mechanism estimator, from this sequence. It is assumed that the resulting selection of the estimator of g0 is nonparametric enough so that the collaborative double robustness of the efficient influence curve as presented in next section is utilized, and, thereby, that our asymptotic linearity theorem in later section can indeed be applied.

A collaborative targeted maximum likelihood estimator constructed in this manner has made every effort to make the estimator of the additive causal effect as unbiased as possible. If we now construct a set of such collaborative targeted maximum likelihood estimators, possibly indexed by different initial estimators, and different ways of constructing the sequence of targeted maximum likelihood estimators, we can then select among these estimators the estimator with minimal estimated variance (based on the influence curve). To obtain an honest estimate of the variance of the resulting estimator, just as one obtained honest cross-validated risk of an estimator that internally uses cross-validation, one uses the honest cross-validated variance of the influence curve of the complete estimator, including cross-validating this final selection step that involves minimizing the variance.

3. Collaborative double robustness of estimating functions in CAR censored data models

In this section we establish a new kind of collaborative robustness of the class of estimating functions in CAR-censored data models, where, as in van der Laan and Robins (2003), the class of estimating functions is implied by the orthogonal complement of the nuisance tangent space of the target parameter Ψ : M → IRd. This orthogonal complement of the nuisance tangent space equals the space spanned by the gradients of the pathwise derivative of Ψ, and thus includes the canonical gradient/efficient influence curve. The collaborative robustness result teaches us that the censoring mechanism required to obtain an unbiased estimating function at a mis-specified Q for the parameter of interest need not always condition on the whole full data structure. In fact, it teaches us that the better Q approximates Q0 the less of an adjustment by full data random variables is necessary for the censoring mechanism to still obtain an unbiased estimating function for the parameter of interest. The precise collaborative property of (Q, g0(Q)) such that P0D(ψ0, g0(Q), Q) = 0 will be explicitly specified, where D represents the estimating function, such as the one implied by the canonical gradient.

3.1. The formal collaborative robustness result

The new form of double robustness we wish to establish is understood as follows. Consider an estimating function D(Ψ(Q), G, Q) for the parameter of interest ψ0 that is indexed by nuisance parameters (G0, Q0), and which is already known to satisfy the classical double robustness property: for any G under which ψ0 is identifiable from PQ0,G, we have E0D(ψ0, G, Q) = 0 if either Q = Q0 or G = G0 (van der Laan and Robins (2003)). Given a Q, we are interested in the question under what conditional distribution G0δ of censoring variable C, given a reduction X(δ) of X, will we still have P0D(ψ0, G0δ, Q) = 0 and thereby that D is an unbiased estimating function for ψ at this mis-specified Q.

Firstly, we note that P0D(ψ0, G, Q) = P0{D(ψ0, G, Q) – D(ψ0, G, Q0)} + P0D(ψ0, G, Q0), and the latter term is zero under any G that allows identifiability of ψ0. Thus, it remains to determine for what G0δ we will have P0{D(ψ0, G0δ, Q) – D(ψ0, G0δ, Q0)} = 0. This choice of G0δ (e.g., it includes G0 itself) is not unique but will be dependent on a difference Q – Q0 in the sense that X(δ) has to be rich enough so that it contains a difference Q–Q0.

By the general representation theorem for estimating functions that are orthogonal to the nuisance tangent space of the target parameter (Theorem 1.6, van der Laan and Robins (2003)), one can typically represent an estimating function D(ψ0, G, Q) as an Inverse Probability of Censoring Weighted Estimating function DIPCW(G, ψ0) plus a function DCAR(Q, G) in the tangent space TCAR(G) of the censoring mechanism at G. The function DCAR(Q, G) is defined as the projection of – DIPCW(G, ψ0) on the tangent space TCAR(G) = {h(O) : EG(h(O) | X) = 0} of the censoring mechanism when only assuming coarsening at random, where this projection is carried out in the Hilbert space of all functions of O with mean zero and finite variance endowed with inner product the covariance operator 〈f1, f2〉 = EQ,gf1(O)f2(O). In other instances, the DIPCW might depend on Q0 through another parameter beyond ψ0, in which case it will need to be assumed that this parameter is correctly specified.

This teaches us that P0{D(ψ0, G, Q) – D(ψ0, G, Q0)} = P0{DCAR(Q, G) – DCAR(Q0, G)}, since the IPCW-difference equals zero. This representation theorem also teaches us that for all Q we have that DCAR(Q, G) has conditional mean zero under G, given X. In addition, this same theorem also shows that QDCAR(Q, G) is linear in Q. Therefore, it remains to show that P0DCAR(QQ0, G) = 0. Now, inspection of the proof that the conditional mean of DCAR(Q′, G) under G equals zero for a Q′ involves typically conditioning on a rich enough reduction of X so that a particular function indexed by Q′ is fixed under the conditioning. Thus, the censoring mechanism only needs to condition on a particular function of Q – Q0.

This is best illustrated with a concrete censored data structure. For example, consider the right censored data structures O = (C, (C)), where X(t) is a time dependent process, X = (X(t) : t) represents the full data structure, and (t) = {X(s) : st} represents the sample path up till time t. For this censored data structure, one can represent the projection of DIPCW onto TCAR as DCAR(Q, G) = ∫ HQ,G(u, (u–))dMG(u), where

HQ,G(u,X¯(u))=EQ,G(DIPCW,G|C=u,X¯(u))E(DIPCW,G|Cu,X¯(u))            dMG(u)=I(C=u)I(Cu)dΛC|X(u|X),

and ΛC|X is the cumulative hazard of C, given X. For details, we refer to chapter 3 in van der Laan and Robins (2003). Here dMG(u) is a Martingale satisfying E(dMG(u) | (u), Cu) = 0. Due to the linearity of the conditional expectation operator, we have DCAR(QQ0, G) = ∫ HQQ0,G(u, (u))dMG(u). By conditioning on HQ–Q0,G(u, (u)) within the integral, and using E(dMG(u) | (u), Cu) = 0, it follows that DCAR(Q – Q0, G) also has mean zero under a censoring mechanism s.t. λC(u | X) only depends on (u) (it only depends on (u) by CAR) through HQQ0,G(u, (u)). One can factorize HQQ0,G = H1(G)H2(Q – Q0), so that adjustment in λC(u | X) by the time-dependent covariate H2(Q – Q0) (u, (u)) suffices. Alternatively, it also suffices if the censoring mechanism uses a self-iterated adjustment by HQQ0,G as described later in this section. If Q approximates Q0, this function HQQ0,G(u, (u)) will be shrunk to zero, so that less conditioning becomes necessary.

The following much simpler (but in essence making the same point) example helps to further illustrate the general collaborative double robustness property of the efficient influence curve. Suppose the observed censored data structure is O = (W, Δ, ΔY) and X = (W, Y) is the full data random variable, where Δ is the censoring variable. Suppose one wishes to estimate ψ0 = E0Y. The efficient influence curve is given by

D(ψ0,Π0,Q0)=DIPCW(ψ0,Π0)+DCAR(Q0,Π0),

where

DIPCW(ψ0,Π0)=YΔΠ0(W)ψ0  DCAR(Q0,Π0)=E(Y|Δ=1,W)(ΔΠ0(W)1),

0(W) = P0(Δ = 1 | W) and Q0(W) = E0(Y | W, Δ = 1). Consider a Q. We are interested in the question under what conditional distribution ∏0δ of Δ, given a reduction W(δ) of W, will we still have P0D(ψ0, ∏0δ, Q) = 0 and thereby that D is an unbiased estimating function for ψ at this mis-specified Q. Firstly, we note that P0D(ψ0, ∏, Q) = P0{D(ψ0, ∏, Q) – D(ψ0, ∏, Q0)} + P0D(ψ0, ∏, Q0), and the latter term is zero under any ∏ for which P0(∏(W) > 0) = 1. Thus, it remains to determine for what ∏0δ P0{D(ψ0, ∏0δ, Q) – D(ψ0, ∏0δ, Q0)} = 0.

This teaches us that P0{D(ψ0, ∏, Q) – D(ψ0, ∏, Q0)} = P0{DCAR(Q, ∏) – DCAR(Q0, ∏)}, since the IPCW-difference equals zero:

P0{D(ψ0,,Q)D(ψ0,,Q0)}=(QQ0)(W)0(W)(Δ0(W)).

Note that we used here that QDCAR(Q, ∏) is linear in Q. Therefore, it remains to show that P0DCAR(Q – Q0, ∏) = 0. This can be represented as H(Q – Q0, ∏0)(W)(Δ – ∏0(W)) as above, with H(Q – Q0, ∏0)(W) = (Q – Q0)(W)/0(W).

The proof that the conditional mean of DCAR(QQ0, ∏) under ∏ equals zero involves conditioning on a rich enough reduction of W so that QQ0 is captured by the conditioning: if (QQ0)(W) only depends on W through W(δ), then

E(QQ0)(W)0(W(δ))(Δ0(W(δ)))=0.

In particular, we have that the conditional mean of DCAR(QQ0, ∏0), given (QQ0)(W), equals zero if ∏0(W) = P(Δ = 1 | QQ0(W)). This shows that if, for example, (QQ0)(W) only depends on one component W1, then P0D(ψ0, ∏0, Q) = 0 for ∏0(W1) = P0(Δ = 1 | W1), and, more general, for ∏0(W′) with W1W′ . That is, the better job Q does in approximating Q0 the less inverse probability of missingness weighting is required to still obtain an unbiased estimating function for ψ0.

Summary: Consider the efficient influence curve D(Q, G). Suppose we already know that for Q with Ψ(Q) = ψ0 P0D(Q0, G) = 0 for all G. Given a Q with Ψ(Q) = ψ0, characterize the set of G0(Q)s for which P0D(Q0, G0(Q)) – D(Q, G0(Q)) = 0. For such Q and corresponding G0(Q)’s we have P0D(Q, G0(Q)) = 0. Given the representation theorem for estimating functions derived from the orthogonal complement of the nuisance tangent space, it appears that we need to determine the conditional distributions G0δ of C, given a reduction X(δ) of X, for which E0DCAR(QQ0, G0δ(Q)) = 0. Thus we need to determine the conditional distributions G0(Q) of C that solves the score equation E0DCAR(QQ0, G0) = 0 of score DCAR(QQ0, G0). In particular, if G0(Q) is a MLE of a finite dimensional parameter (e.g., same dimension as ψ0), whose score spans DCAR(QQ0, G0), then E0DCAR(QQ0, G0) = 0. More generally, if G0 is a limit of an efficient (e.g. NPMLE) estimator in a model for G0 that has a tangent space at G0 that contains DCAR(QQ0, G0), then this G0 also satisfies E0DCAR(QQ0, G0) = 0. In addition, a self-iterated iterative MLE, starting with arbitrary offset G, for a parameter with score DCAR(QQ0, G), at G, can be employed as well, as presented below, resulting in an updated G0(Q) of G so that E0DCAR(QQ0, G0(Q)) = 0.

We will now present the general result which can be applied to any CAR-censored data model as defined and studied in van der Laan and Robins (2003).

Theorem 1 (Collaborative Double Robustness of Efficient Influence Curve/Estimating Functions)

CAR-censored data model: Let O = Φ(C, X) ∼ P0 be a censored data structure with full data random variable X ∼ PX0, and censoring variable C with conditional probability distribution G0 of C, given X. Assume G0 satisfies the coarsening at random assumption. Let g0(C | X) = dG0(C | X) a probability density of G0 w.r.t. an appropriate dominating measure that satisfies coarsening at random itself. Let M denote the observed data model for P0. Due to CAR, we have w.r.t. an appropriate dominating measure dP0(O) = Q0(O)g0(O | X), where g0(O | X) is only a function of O (by CAR), and Q0 denotes the identifiable part of the full data distribution PX0 (Gill et al. (1997)). (Here we abused notation to indicate that the conditional density of O, given X, is a deterministic function of the conditional density of C, given X, and, in fact, represents the identifiable part of the censoring mechanism G0.) Let Q and G be models for Q0 and G0 which imply a model M= {dP = Qg : Q ∈ Q, G ∈ G} for P0.

Parameter of interest: Let Ψ: M IRd be pathwise differentiable parameter of interest and it is assumed that Ψ(P0) = ΨF (Q0) is only a function of Q0. Let D*(Q, G) be the efficient influence curve/canonical gradient of Ψ at dP = Qg.

We make the following assumptions:

Augmented “PCW”-representation of efficient influence curve:

(PCW stands for Probability of Censoring Weighted) For each QQ, GG,

D*(G,Q)=Dh(G,Q)(G,Q)=Dh(G,Q),PCW(G,Γ(Q))+Dh(G,Q),CAR(G,Q),

for mappings (G, Q) → h(G, Q), (h, G, Q) → Dh,PCW(G, Γ(Q)), (h, G, Q′) → Dh,CAR(G, Q′(Q, G)), both defined on H × G × Q, a parameter mapping Γ on Q, and (G, Q) → Q′(G, Q).

(We refer to Theorem 1.3 in van der Laan and Robins (2003) for such a general representation of the efficient influence curve and, more generally, the orthogonal complement of the nuisance tangent space, where the CAR-components are elements of the tangent space TCAR of G consisting of all functions of O with conditional mean zero, given X, under G. Under that representation, we have that E0Dh,PCW(G0, γ0) = 0 and Dh,CAR(G0, Q′) has conditional mean zero, given X, for all Q′.)

Linearity of CAR-component: Q′ → Dh,CAR(G, Q′) is linear on a set Q¯ containing {Q′(G, Q) : G, Q} in the sense that for all h ∈ H, and all Q1, Q2Q¯

Dh,CAR(G,Q1)Dh,CAR(G,Q2)=Dh,CAR(G,Q1Q2).

Robustness for mis-specified censoring mechanism: For all Q0Q and G ∈ G(Q0) G, where (e.g.,) G(Q0) is defined as all censoring mechanisms G for which ψ0 can be identified from dP = dQ0g, we have

E0Dh(G,Q0)=0forallhH.

Robustness of CAR-component: For a reduction X(δ) of X (i.e., X(δ) = f(X, δ) for some function f), let G0δ be the conditional distribution of C, given X(δ).

Let Qδ¯ be a set within Q¯ for which for each Q¯Qδ¯

E0Dh,CAR(G0δ,Q¯)=0.

(Typically, one can select Qδ¯ as all functions in Q¯ that are only functions of X through X(δ).)

Let Γ(Q) = Γ(Q0) (typically implying Ψ(Q) = ψ0), G0δ G(Q0), and assume QQ0Qδ¯, where Q′ = Q′(G0δ, Q) and Q′0 = Q′(G0δ, Q0). Then

E0D*(G0δ,Q)=0.

We also have for all G ∈ G(Q0)

E0D*(G,Q0)=0.

Proof. Suppose Γ(Q) = Γ(Q0) and QQ0Qδ¯. Let G0*=G0δ be the conditional distribution of C, given X(δ), and assume it is an element of G(Q0).

By the “Augmented ‘PCW’-representation of efficient influence curve” assumption, we have

E0D*(G0*,Q)=E0Dh(G0*,Q)

for some hH. Thus,

E0D*(G0*,Q)=E0Dh(G0*,Q)=E0{Dh(G0*,Q)Dh(G0*,Q0)}+E0Dh(G0*,Q0).

By the assumption that G0*G(Q0), it follows that the last term E0Dh(G0*,Q0)=0.

By the “PCW-representation” assumption we have

E0{Dh(G0*,Q)Dh(G0*,Q0)}=E0{Dh,PCW(G0*,Γ(Q))Dh,PCW(G0*,Γ(Q0))}+E0{Dh,CAR(G0*,Q(Q,G0*))Dh,CAR(G0*,Q(Q0,G0*))}.

By the assumption that Γ(Q) = Γ(Q0), the first term equals zero. By the “linearity of CAR-component”-assumption we have that the last term equals:

E0{Dh,CAR(G0*,Q)Dh,CAR(G0*,Q0)}=E0Dh,CAR(G0*,QQ0),

where Q=Q(G0*,Q) and Q0=Q(G0*,Q0).

We assumed that QQ0Qδ¯. Thus, by the “Robustness of CAR-component”-assumption we have that

E0Dh,CAR(G0*,QQ0)=0.

This proves E0D*(G0*,Q)=0. □

3.2. Examples illustrating the collaborative double robustness in censored data models

For the sake of illustration, we will now explicitly establish the collaborative double robustness of the efficient influence curve estimating function in two additional examples. These results are also corollaries of the above general Theorem 1.

3.2.1. Example I: Marginal additive causal effect in nonparametric model

We have the following double robustness result for our additive causal effect example.

Theorem 2 Let dP0 = Q0dG0 be the distribution of O = (W, A, Y) and let the model for P0 be nonparametric.

Let Ψ(Q0) = EQ01{EQ02(Y | A = 1, W) – EQ02(Y | A = 0, W)} be the parameter on this model, where it is assumed that it is identifiable from P0. Here Q01 denotes marginal distribution of W and Q02 the conditional distribution of Y, given A, W. The efficient influence curve of Ψ at P = (Q, G) is given by

D*(Q,G)(O)=h(G)(A,W)(YQ2(A,W))+Q2(1,W)Q2(0,W)Ψ(Q),

where Q2(A, W) = EQ(Y | A, W) denotes the conditional mean of Y, given A, W, under Q = (Q1, Q2).

Assume

(Q02Q2)(A,W)=EQ0(YQ2(A,W)|A,W)=f0(A,W(Q))

is only a function of A, W(Q) for a W(Q) = Φ(Q2, W) for some mapping Φ: i.e., W(Q) denotes a reduction or subset of the full vector random variable W indexed by Q.

Let dG0(Q) be the conditional distribution of A, given W(Q). If Ψ(Q) = Ψ(Q0), then

EP0D*(Q,G0(Q))=0.

Or, equivalently, if we represent D*(Q, G) as D*(Ψ(Q), Q, G), then

EP0D*(ψ0,Q,G0(Q))=0.

We also have: If Pr(PG(A = 0 | W) * PG(A = 1 | W) > 0) = 1, then

EP0D*(Q0,G)=0,

or equivalently,

EP0D*(ψ0,Q0,G)=0.

Proof. The last statement is easy and well known (e.g., van der Laan and Robins (2003)). The first statement needs to be proved, or can be derived as a corollary of Theorem 1. Note, if Ψ(Q) = ψ0, then

E0D*(Q,G0(Q))=E0h(G0)(A,W(Q))(YQ(A,W))+Q(1,W)Q(0,W)ψ0.

If E0(YQ(A, W) | A, W) = f0(A, W(Q)) is only a function of A, W(Q), then it follows by first taking the conditional mean, given A, W, and then taking the mean of A, given W(Q),

E0D*(Q,G0(Q))=E0h(G0)(A,W(Q))f0(A,W(Q))+Q(1,W)Q(0,W)ψ0=E0f0(1,W(Q))f0(0,W(Q))+Q(1,W)Q(0,W)ψ0.

Now, note that f0(A, W(Q)) = Q0(A, W) – Q(A, W), which proves that the latter quantity equals zero.

The implication of this result is that, given an estimate Q of Q0, we only need to estimate G0(Q), conditioning on W(Q), or any conditional distribution that conditions on more than W(Q). Thus, if Q already succeeds in explaining most of the true regression E0(Y | A, W), then only little inverse weighting with G0(Q) = P(A = · | W(Q)) remains to be done. That is, the amount and manner of inverse weighting required to obtain a consistent estimator of the causal effect ψ0 can be adapted to the approximation error of Q relative to the true regression.

3.2.2. Example II: Semiparametric regression

Let O = (W, A, Y) ∼ P0. Assume the model E0(Y | A, W) – E0(Y | A = 0, W) = 0V for some VW. If the variance of Y, given A, W, only depends on W, then the efficient score of β0 at P0 can be represented as

D*(0,θ0,β0)(O)=(A0(W))(YAβ0Vθ0(W)),

where ∏0(W) = E0(A | W), and θ0(W) = E0(Y | A = 0, W). For the sake of illustration we will use this simpler representation, but the same double robustness applies to the general efficient influence curve representation as (e.g.) presented in van der Laan and Robins (2003).

Theorem 3 Suppose E0(Y – Aβ0V – θ(W) | A, W) = f0(W(θ)) for some function f0 of W(θ) where W(θ) = Φ(W, θ) is function of W and θ. Note that this states that θ0(W) – θ(W) = f0(W(θ)) is only a function of a reduction W(θ) of W. Let0(θ)(W) = E0(A | W(θ)). Then

E0D*(0(θ),θ,β0)=0

We also have

E0D*(,θ0,β0)=0

Proof. Only the first robustness result needs to be proved. First take the conditional mean, given A, W, which results in the term E0(A–∏0(θ)(W(θ))) f0(W(θ)). Subsequently, we take the conditional mean, given W(θ), which proves it equals zero. □

3.3. Construction of collaborative double robust estimators

By using a collaborative estimator gn(Q) of a g0(Q) in the set of conditional distributions that conditions on the required function of QQ0 (and g0 itself), one can construct collaborative double robust estimators. For example, one could use the targeted maximum likelihood estimator applied to initial estimator Qn and using the resulting collaborative estimator gn(Qn). One can also use estimating equation methodology, solving for ψn in 0 = PnD*(ψ, Qn, gn(Qn)). The formal asymptotic linearity (and thereby asymptotic normality) of such estimators is studied in the next section. Our proposed collaborative targeted maximum likelihood procedure is one particular collaborative double robust targeted maximum likelihood estimator, which also involves updating the initial estimator Qn beyond the construction of an appropriate gn(Qn). However, we could also simply have taken gn(Qn) from our proposed collaborative targeted maximum likelihood procedure, and still use the targeted maximum likelihood estimator with initial estimator Qn. In addition, we could also have used our proposed collaborative estimator gn(Qn) to solve an estimating equation 0 = PnD*(Qn, gn(Qn), ψ) = 0 in ψ.

Other methods for construction of collaborative estimators gn(Qn) are of interest as well. For example, one could consider a collection of one-dimensional fluctuations of Qn and use maximum likelihood to test these fluctuations. In this manner one can select a dimension reduction involving the X-components that still significantly increase the log-likelihood (or other loss function) beyond the initial fit Qn. One could then fit gn by running a machine learning algorithm that only conditions on the selected components. This procedure only uses the initial estimator to obtain a dimension reduction, but from then on it uses an external procedure based on the loss function for g0.

Given an initial estimator Qn, another idea of interest for construction of a collaborative estimator gn(Qn) is the following. One first constructs a sequence of increasingly nonparametric estimators ĝj of g0, j = 1, . . . , J. These estimators could already be based on a dimension reduction based on offset by initial estimator. Given an initial estimator , we select the following estimator of g0:

jn=argminjEBnPn,Bn1D*(Q^(Pn,Bn0),g^j(Pn))2,

where Bn denotes a random variable in {0, 1}n defining a random split in training sample {i : Bn(i) = 0} and validation sample {i : Bn(i) = 1}, and Pn,Bn0, Pn,Bn1 denote the empirical probability distributions of the training and validation sample, respectively. Thus, one selects the estimator that minimizes the Euclidean norm of the cross-validated mean of efficient influence curve at the estimator . If j is too small, then P0D*(, gj) will be non zero, so that jn will always select a large enough j for n tending to infinity. If, on the other hand, j is large enough so that P0D*(, gj) = 0, then the expectation of {PnD*(, gj)}2 will be equal to its variance which will be increasing in j, so that smaller j’s, but larger than the critical one, will be preferred. One now defines as collaborative estimator the estimator gn(Qn) ≡ ĝjn(Pn) indexed by this choice jn.

3.4. Estimating the sufficient minimal adjustment covariate from the data

Let H(g0, QQ0) be the component that needs to be adjusted for in g0 = g0(Q). One could estimate this component from the data using appropriate methodology. If, given an arbitrary initial fit g, one would add H(g, QQ0) as main term in a fluctuation model of g, and the fluctuation function is chosen so that the score of the coefficient of this main term at zero equals DCAR(QQ0, g0), then the MLE-update of g will solve the wished score equation P0DCAR(QQ0, g0) = 0. We can refer to H(g0, QQ0) as the minimal adjustment covariate, needed to obtain the wished collaborative robustness.

The sufficient covariate H(g, QQ0), that is needed to update g, depends on g itself as well, so that, even given an estimate of QQ0, the above maximum likelihood update of an initial g does not work. There are two approaches that can be used to deal with this self-dependence of the minimal adjustment covariate for g.

Firstly, one can extract the few components of only QQ0, and enforce nonparametric adjustment by these covariates in a fit of g0(Q) of the censoring mechanism. In this manner, the resulting censoring mechanism estimator will estimate a true conditional distribution that conditions on covariates that imply the value of H(g0(Q), QQ0). Secondly, one can also only adjust for H(g0, QQ0) as a main term, given an estimate g0, and iterate this updating process of g till convergence. In the latter case, as mentioned above, it is assumed that the score of the fluctuation of g implied by this main term extension H(g, QQ0) at zero equals DCAR(g, QQ0). Let’s illustrate these two approaches with an example.

For example, in the additive causal effect example with O = (W, A, Y), ψ0 = EY (1)–Y (0) and Y continuous, we have H(g0,QQ0)=1g0(1|W)E(YQ|A=1,W)+1g0(0|W)E(YQ|A=0,W), and the score of ε, at ε = 0, of logistic regression model gε(1 | W) = 1/(1 + exp(–C0(W) – εH(g, QQ0)), using an offset C0(W), is given by DCAR(g, QQ0) = H(g, QQ0)(Ag(1 | W)).

One can estimate E(YQ|A = 1, W) and E(YQ|A = 0, W) with a machine learning algorithm, treating an initial estimate Q as offset (possibly cross-validated to make the offset independent of Yi). If Y is binary, we would use Q as offset and one could estimate these QQ0-components by running a logistic regression with Q as offset. Given this estimate of the two QQ0-components that span H(g0, QQ0), one could now force nonparametric adjustment by these two estimated covariates in the estimate of g0.

Alternatively, given an initial estimator gn0, one could obtain an estimate Hn0 by plugging in an estimate of Q0Q, and this initial gn0. One could now force in this Hn0 as a main term in gn0, resulting in an updated gn1. This process is iterated till convergence. In the limit we have that gn solves 0=PnH(gn,QQ0^)(Agn(1|W)). Here gn0 would already be a collaborative estimator of g0(Q), such as the one proposed in our collaborative targeted maximum likelihood estimator, so that the collaborative double robustness is preserved (we do not want to only rely on correct specification of QQ0^, and thereby of this sufficient minimal covariate)

We do not advise starting the above iterative algorithm at a purposely misspecified estimator gn0. Instead we want to apply the above iterative algorithm at a collaborative estimator gn0, such as the one presented in our template of the C-TMLE in Section 2. For example, after having run the C-TMLE in Section 2, we would carry out a subsequent update of the resulting collaborative estimator gn by applying the above iterative updating algorithm, starting at gn0=gn, and using an estimator of Q0Q. If we would only include this estimate of the sufficient covariate H(g0, QQ0) in gn, then the consistency of the estimator ψn fully relies on correct estimation of QQ0, and thereby on correct estimation of Q0, and therefore would not utilize the collaborative double robustness of the efficient influence curve. Instead of carrying out a subsequent update of a collaborative estimator gn using the iterative algorithm, we could incorporate an estimate Hn (or its (QQ0)-components) in our proposed template for the collaborative targeted MLE by forcing it in our candidate censoring mechanism estimators.

In the above additive causal risk example, we estimate H(g, QQ0) by plugging in an initial estimate g0 and QQ0, and the iterative adjustment succeeds in its goal as long as the estimate of QQ0 is correct, even if (the initial) g is misspecified. One could also use a representation such as H(g0, QQ0) = Eg0,Q0(DIPCW(g0, Q) | A = 1, W) – E0(DIPCW(g0, Q) | A = 0, W), and estimate the two regressions by regressing an IPCW-function indexed by g0, Q on A, W, and evaluate it at A = 1 and A = 0, respectively. Here DIPCW = (YQ){A/g0(A | W) – (1 – A)/g0(A | W)}. One could now apply the above iterative updating algorithm to this (non-substitution based) manner of estimating H(g0, QQ0). As shown in (e.g.) van der Laan and Robins (2003) for monotone censored data structures and causal inference data structures, involving censoring and treatment actions over time, DCAR(QQ0, g0) does allow such a representation Σj Hj(g0, QQ0)(A(j)–g0j(1 | F(j)), where F(j) represents the history before censoring or treatment A(j), and Hj(g0, QQ0) = E0(DIPCW | A(j) = 1, F(j)) – E0(DIPCW | A(j) = 0, F(j)) for some IPCW-function. The disadvantage of this approach is that it relies on g0 representing a true conditional distribution, while in the iterative substitution based approach the main term adjustment at a possibly misspecified g still yields the wished collaborative double robustness.

4. Asymptotic linearity of collaborative double robust TMLE

The collaborative targeted maximum likelihood estimator Qn* equals a kn-th step collaborative targeted maximum likelihood estimator, and thereby equals a targeted maximum likelihood estimator with a starting estimator Qnk (e.g., the kn – 1-th collaborative targeted maximum likelihood estimator), and the censoring mechanism estimator gn = gn as selected in the kn-step, given the collection of candidate estimators g indexed by δ ranging over an index set.

Thus, just like the targeted maximum likelihood estimator, the collaborative targeted maximum likelihood estimator ψn=Ψ(Qn*) of ψ0 solves the efficient influence curve estimating equation

0=PnD*(Qn*,gn,ψn).

For simplicity, we will make the assumption that the efficient influence curve at a PQ,g can be represented as an estimating function in ψ: i.e., the efficient influence curve at P can be represented as D*(Q(P), g(P), ψ(Q(P))) for some mapping (Q, g, ψ) → D*(Q, g, ψ). However, the theorem in this section can be generalized to any efficient influence curve D*(Q, g) at a data generating distribution PQ,g.

It is a reasonable assumption that Qn* converges to some element Q* in the model for Q0, where Q* is not necessarily equal to the true Q0. In addition, let’s assume that, for each δ, the δ-specific censoring mechanism estimator g converges to some g0δ. For example, if δ indicates an adjustment set, then it might be assumed that g converges to the true conditional distribution, given this δ-specific adjustment set.

For a given Q, we define δ(Q) as the index δ with entropy d(δ) minimal and so that

P0D*(Q,g0δ(Q),ψ0)=0.

In other words, given the family of adjustments indexed by δ, δ (Q) represents the minimal adjustment necessary in the censoring mechanism to obtain the collaborative double robustness/unbiased estimating function for ψ0. It is then a natural assumption that

P0D*(Q,g0δ,ψ0)=0foreachδwithd(δ)d(δ(Q)).

In other words, if one uses a more nonparametric estimator of the censoring mechanism than needed (i.e.., than δ(Q)), then one certainly obtains the wished unbiasedness.

We will assume that, as n converges to infinity, then the selected censoring mechanism estimator gn = gn converges to a fixed g0δ0 representing the limit of a g0 , not necessarily equal to the conditional distribution, given the full X. For notational convenience, we will also denote this limit with g0.

It is assumed that d(δ0) ≥ d(Q*) so that

0=P0D*(Q*,g0,ψ0),

which will be the fundamental assumption for asymptotic normality of the CTMLE. In other words, it is assumed that our collaborative C-TMLE procedure selects a nonparametric enough estimator gn for the censoring mechanism (in collaboration with Qn*) so that the required unbiasedness of the efficient influence curve estimating function is achieved.

To derive the influence curve of Ψ(Qn*), the asymptotic linearity theorem below assumes also that the limit of the selected censoring mechanism estimator satisfies

P0D*(Qn*,g0,ψ0)=oP(1/n). (3)

As a consequence of this assumption (3), the influence curve does not involve a contribution requiring the analysis of a function of Qn*. This important simplification of the influence curve allows straightforward calculation of standard errors for the C-TMLE. The assumption (3) requires the limit g0 to be nonparametric enough w.r.t. the actual estimator Qn* so that enough orthogonality is achieved to make the contribution P0D*(Qn*,g0,ψ0) second order.

Why assumption (3) holds for C-TMLE: We now explain why this assumption is reasonable for the C-TMLE.

Define g0(Qn*) as g0δn* with δn*=min{δ:d(δ)d(δ0),P0D*(Qn*,g0δ,ψ0)=0}. In other words, g0(Qn*) corresponds with the limit of the least nonparametric estimator (among all estimators more nonparametric than the one identified by δ0) that still yields the wished unbiasedness of the estimating function at Qn*, and it as close as possible to g0 = g0δ0.

We note that

P0D*(Qn*,g0,ψ0)D*(Q*,g0,ψ0)=P0D*(Qn*,g0(Qn*),ψ0)D*(Q*,g0(Qn*),ψ0)+Rn,

where Rn is a second order term (like Rn1 below) involving the difference Qn*Q* and g0(Qn*)g0. By definition of g0(Qn*) and the fact that Qn* converges to Q*, it is reasonable to assume g0(Qn*)g0 as n → ∞. So Rn is a second order term, so that it is reasonable to assume Rn=oP(1/n).

By definition of g0(Qn*), we do not only have

P0D*(Qn*,g0(Qn*),ψ0)=0,

but also that g0(Qn*) is equally or more nonparametric than g0(Q*) so that

P0D*(Q*,g0(Qn*),ψ0)=0.

This implies now that indeed

P0D*(Qn*,g0,ψ0)=oP(1/n).

Finally, we note that the next theorem can be applied to any collaborative double robust estimator, as discussed in previous section, not only the collaborative double robust targeted maximum likelihood estimator.

Theorem 4 Let (Q, g, ψ) → D*(Q, g, ψ) be a well defined function that maps any possible (Q, g, Ψ(Q)) into a function of O. Let O1, . . . , OnP0 be i.i.d, and let Pn be the empirical probability distribution. Let Q → Ψ(Q) be a d-dimensional parameter, where ψ0 = Ψ(Q0) is the parameter value of interest. In the following template for proving asymptotic linearity of Ψ(Qn*) as an estimator of Ψ(Q0), Qn* represents the collaborative targeted maximum likelihood estimator, but it can be any estimator.

Let Q* denote the limit of Qn*. Let gn be an estimator and g0 denote its limit.

Assume

Efficient Influence Curve Estimating Equation: 0=PnD*(Qn*,gn,ψn), where ψn=Ψ(Qn*).

Censoring Mechanism Estimator is Nonparametric Enough:

P0D*(Q*,g0,ψ0)=0.
P0D*(Qn*,g0,ψ0)=oP(1/n).

(Above we show why the latter is indeed a second order term for the C-TMLE.)

Consistency:

P0(D*(Qn*,gn,ψn)D*(Q*,g0,ψ0))20inprobability,

as n → ∞. And the same is assumed if one or two of the triplets (Qn*,gn,ψn) is replaced by its limit (Q*, g0, ψ0).

Identifiability/Invertibility: c0 = –d/dψ0P0D*(Q*, g0, ψ0) exists and is invertible.

Donsker Class: {D*(Q, g, Ψ(Q)) : Q, g} is P0-Donsker, where (Q, g) vary over sets that contain ( Qn*, gn), (Q*, gn), ( Qn*, g) with probability tending to 1.

Contribution due to Censoring Mechanism Estimation: Define the mapping g → Φ(g) ≡ P0D*(Q*, g, ψ0). Assume Φ(gn)Φ(g0)=(PnP0)ICg0+oP(1/n) for some mean zero function ICg0L02(P0).

Second order terms: Define second order term

Rn1=P0{D*(Qn*,gn,ψn)D*(Q*,gn,ψn)}P0{D*(Qn*,g0,ψ0)D*(Q*,g0,ψ0)},

and assume Rn1=oP(1/n). Note Rn1 is a second order term involving difference between Qn*Q and gng0.

Define second order term

Rn2=P0{D*(Q*,gn,ψn)D*(Q*,g0,ψn)}P0{D*(Q*,gn,ψ0)D*(Q*,g0,ψ0)},

and assume Rn2=oP(1/n). Note Rn2 is a second order term involving differences gn – g0 and ψn – ψ0.

Then, ψn is asymptotically linear estimator of ψ0 at P0 with influence curve

IC(P0)=c01{D*(Q*,g0,ψ0)+ICg0}.

That is,

ψnψ0=(PnP0)IC(P0)+oP(1/n).

In particular, n(ψnψ0) converges in distribution to a multivariate normal distribution with mean zero and covariance matrix Σ0 = E0IC(P0)IC(P0).

Proof: The principal equations are 0=PnD*(Qn*,gn,ψn) and P0D*(Q*, g0, ψ0) = 0. So, we have

P0D*(Q*,g0,ψn)D*(Q*,g0,ψ0)={PnD*(Qn*,gn,ψn)P0D*(Q*,g0,ψn)}.

Let c0=ddψ0P0D*(Q*,g0,ψ0). Then,

c0(ψnψ0)+o(|ψnψ0|)=(PnP0)D*(Q*,g0,ψn)+Pn{D*(Qn*,gn,ψn)D*(Q*,gn,ψn)}+Pn{D*(Q*,gn,ψn)D*(Q*,g0,ψn)}.

We denote the three terms on the right with I,II and III, and deal with them separately below.

I: By the Donsker condition, and consistency condition, we have

(PnP0){D*(Q*,g0,ψn)D*(Q*,g0,ψ0)}=oP(1/n).

Thus, we obtain (PnP0)D*(Q*,g0,ψ0)+oP(1/n) as first term approximation. We refer to van der Vaart and Wellner (1996) for this empirical process theorem.

II: We have

Pn{D*(Qn*,gn,ψn)D*(Q*,gn,ψn)}=PnP0){D*(Qn*,gn,ψn)D*(Q*,gn,ψn)}+P0{D*(Qn*,gn,ψn)D*(Q*,gn,ψn)}.

The first term is op(1/n) by our Donsker class condition, and consistency condition at Qn*,gn,ψn. We also have

P0{D*(Qn*,gn,ψn)D*(Q*,gn,ψn)}=P0{D*(Qn*,gn,ψ0)D*(Q*,g0,ψ0)+Rn1,

where

Rn1=P0{D*(Qn*,gn,ψn)D*(Q*,gn,ψn)D*(Qn*,g0,ψ0)D*(Q*,g0,ψ0)}=oP(1/n),

by assumption.

Rn1 is a second order term involving Qn*Q* and (gn, ψn) – (g0, ψ0). It remains to consider the term P0{D*(Qn*,g0,ψ0)D*(Q*,g0,ψ0)}, which is oP(1/n) by “Censoring Mechanism is Nonparametric Enough”-assumption.

III: We have

Pn{D*(Q*,gn,ψn)D*(Q*,g0,ψn)=(PnP0){D*(Q*,gn,ψn)D*(Q*,g0,ψn)}+P0{D*(Q*,gn,ψn)D*(Q*,g0,ψn)}.

The first term is oP(1/n) by Donsker class condition, and consistency condition at Qn*, gn, ψn. We also have

P0{D*(Q*,gn,ψn)D*(Q*,g0,ψn)}=P0{D*(Q*,gn,ψ0)D*(Q*,g0,ψ0)}+Rn2,

where

Rn2=P0{D*(Q*,gn,ψn)D*(Q*,g0,ψn)D*(Q*,gn,ψ0)D*(Q*,g0,ψ0)}=oP(1/n),

by assumption. Thus the third term equals P0D*(Q*, gn, ψ0)–D*(Q*, g0, ψ0), which, by definition, equals Φ(gn)–Φ(g0). We assumed that Φ(gn)Φ(g0)=(PnP0)ICg0+oP(1/n). Thus, the third term equals (PnP0)ICg0+oP(1/n).

We can thus conclude that

ψnψ0=(PnP0)c01{D*(Q*,g0,ψ0)+ICg0}+oP(|ψnψ0|)+oP(1/n).

This implies |ψnψ0|=OP(1/n), and thereby the stated asymptotic linearity. □

4.1. Statistical Inference

If Q* = Q0, then ICg0 = 0, so that the influence curve reduces to the efficient influence curve D*(Q0, g0, ψ0) at a possibly weakly adjusted g0. If gn converges to the fully adjusted conditional distribution, given X, then we know that ICg0 equals minus the projection of D*(Q*, g0, ψ0) onto the tangent space of the model used by gn (van der Laan and Robins (2003), Section 2.3.7). We suggest that, even if g0 is not the fully adjusted censoring mechanism, we will typically still have that D*(Q*, g0, ψ0) is a conservative influence curve. In other words, if Qn starts approximating the true Q0, then the ICg0 contribution gets smaller and smaller, while if Qn stays away from Q0, then gn starts approximating the fully adjusted g0, in which case, inference based on D* is conservative. This might explain why we see good coverage in our simulations based on “influence curve” D*(Qn*,gn,ψn*). If gn corresponds with a parametric MLE estimator (for a data adaptively selected parametric model), then we propose to use the parametric delta-method to compute the analytic formula for the influence curve ICg0 in order to obtain an accurate influence curve.

One can estimate the covariance matrix Σ = E0ICIC of the influence curve with the empirical covariance matrix n=1/ni=1nIC^(Oi)IC^(Oi), and statistical inference can be based on the corresponding mean zero multivariate normal distribution, as usual.

4.2. Selection among difference collaborative targeted maximum likelihood estimators

Suppose that we have a set of candidate collaborative targeted maximum likelihood estimators (Q^k*(Pn),g^k(Pn)), k = 1, . . . , K. Suppose that each of these estimators satisfy the conditions of the theorem. For example, these might be collaborative targeted maximum likelihood estimators as defined in our template, using different initial estimators indexed by k, but the same collaborative estimator for the censoring mechanism as a function of the data and the initial estimator (thus still resulting in different realizations if the initial estimators are different). Then Ψ(Q^k*(Pn)) is asymptotically linear with influence curve D*(Qk*,g0k,ψ0), k = 1, . . . , K. We can now select among these candidate C-DR-TMLEs by maximizing the estimated efficiency, as in Rubin and van der Laan (2008).

Specifically, let Ψ be a one-dimensional parameter. We now select the k that minimizes the cross-validated variance of the influence curve:

kn=argminkEBnPn,Bn1D*2(Q^k*(Pn,Bn0),g^k(Pn,Bn),ψn).

Thus, we would use the estimator ψn=Ψ(Q^kn*(Pn)). If Ψ is multidimensional, then one needs to agree on a real valued criterion applied to the covariance matrix of the influence curve, such as the sum of the variances along the diagonal, and minimize over k the criterion of the cross-validated covariance matrix of the k-specific influence curve.

4.3. Irregular C-TMLE and super efficiency

If gn converges to the fully adjusted g0(· | X) (fully adjusting for X, under CAR) and Qn* converges to Q0, then it follows that ψn is asymptotically linear with influence curve equal to the efficient influence curve D*(Q0, g0, ψ0). So in that case, ψn is an asymptotically efficient estimator and thereby also a regular estimator.

Due to the particular way gn is constructed in response to Qn, it is easily argued that the collaborative targeted MLE can be an irregular estimator and can be super efficient by achieving an asymptotic variance that is smaller than the variance of the efficient influence curve. In particular, our previous arguments showed that if the initial estimator is a maximum likelihood estimator according to a correctly specified parametric model, then gn will avoid nonparametric fits, thereby staying away from estimating the fully adjusted g0 that would result in an efficient estimator in first order. In this case, by the above theorem, the influence curve of ψn will be equal to D*(Q0, g0, ψ0), using a non-fully adjusted g0, so that the variance of the influence curve will be smaller than the variance of the efficient influence curve that involves a fully adjusted g0.

The super efficiency may have very attractive features in practice. For example, there might be a covariate that is very predictive of censoring/treatment, but have no relation to the outcome. The C-TMLE will now decide to not adjust for this covariate at all in the selected censoring mechanism, and as a consequence, it might achieve the efficiency bound for the data structure excluding this covariate, but still assuming CAR, so that the C-TMLE will have smaller asymptotic variance than the efficiency bound. The resulting super efficient estimator not only shows improved precision, but also yields more reliable confidence intervals, by avoiding heavily non-robust (and harmful) operations. In most practical scenarios, such a covariate will still have a weak link with the outcome. In this case, for very large sample sizes, the C-TMLE will adjust for this covariate and thereby only be asymptotically efficient, but it will still behave as a super efficient estimator for practical sample sizes, by not adjusting for this covariate. That is, it invests in effective bias reduction focussing on covariates that are still predictive of the outcome, taking into account the already included initial estimator. This behavior is completely compatible with an estimator that aims to minimize mean squared error of the estimator of the target parameter, and certainly avoids steps that both increase bias as well as variance.

Finally, we remark that in simulations in which Qn converges fast to the true Q0, gn seems to have a temptation to converge to a random choice g0 that is beyond the required minimal censoring mechanism with probability 1. That is, likelihood based cross-validation might over-select the adjustment in the censoring mechanism relative to the minimal adjustment, and the amount of over-selection remains random (but small) for large sample sizes (this is a known property of cross-validation). This naturally results in an irregularity of the estimator. Simulations have not shown practical problems for statistical inference, but this remains an area of study.

5. Targeted loss functions implied by efficient influence curve

The template of the collaborative targeted maximum likelihood estimators is based on 1) a log-likelihood loss function (i.e., same loss function that is maximized at targeted maximum likelihood step) to select among candidate targeted maximum likelihood estimators indexed by increasingly nonparametric estimators of censoring mechanism, and 2) a preferred loss function to compare targeted maximum likelihood estimators using different censoring mechanism estimators, in order to build these candidate censoring mechanism estimators. One can also use a preferred loss function to select among different candidate collaborative targeted maximum likelihood estimators (e.g., indexed by different initial estimators).

In this section we propose targeted loss functions implied by the efficient influence curve of Ψ. Firstly, the log-likelihood can be replaced by a penalized log-likelihood that is sensitive to sparse data bias w.r.t. target parameter, as defined in next subsection This penalized log-likelihood can also play the role of the preferred loss function. In the second subsection we propose as preferred loss function the cross-validated variance of the efficient influence curve, relying on an overall collaborative estimator of censoring mechanism w.r.t. an initial estimator, or a candidate specific collaborative estimator. In the last subsection, we utilize the mean of the efficient influence curve as a criterion to generate a targeted loss function for Q that incorporates a sequence of increasingly nonparametric estimators of the censoring mechanism.

5.1. The MSE-penalized cross-validated log-likelihood

In the C-DR-TMLE we applied loglikelihood based cross-validation to select among different targeted maximum likelihood estimators, indexed by different censoring mechanism estimators. We propose here a penalized log-likelihood criterion that results in robust estimators in the context of sparse data w.r.t. the parameter of interest.

Consider candidate (e.g., collaborative) targeted maximum likelihood estimators PnQ^δ*(Pn) of the true Q0M, targeting a parameter ψ0 = Ψ(Q0), indexed by δ. Our proposed criterion for selecting δ is

δn=argmaxδEBnPn,Bn1logQ^δ*(Pn,Bn0)MSE(Pn)(δ),

where the first term is the cross-validated log-likelihood for the candidate estimator Q^δ*(Pn), and MSE(Pn)(δ) is an estimator of the mean squared error (variance plus bias-squared) of the substitution estimator Ψ^(Q^δ*(Pn)) as an estimator of its δ-specific limit (thus ignoring asymptotic bias). The MSE(Pn)(δ) is possibly appropriately scaled relative to the log-likelihood term. The sole motivation for the proposed additional penalty term is to make the criterion more targeted towards ψ0, while still preserving the log-likelihood as the dominant term in regular situations: i..e, asymptotically, the penalty is negligible (in regular situations the MSE behaves as 1/n).

5.1.1. Variance of targeted maximum likelihood estimator relative to its δ-limit

If the target parameter cannot be reasonably identified from the data the log-likelihood of the targeted maximum likelihood estimator is not sensitive enough to such a singularity: in fact, on many occasions this just means that the targeted maximum likelihood algorithm will be ineffective (i.e., the maximum likelihood fluctuations get too noisy) so that in essence the log-likelihood of the initial estimator drives the selection.

Therefore it is crucial that the log-likelihood terms are penalized by a term that blows up (in the negative direction) for δ-values for which the variance (or bias, addressed in next subsection) of the targeted maximum likelihood estimator Ψ(Q^δ*(Pn)) relative to its limit ψ0(δ)=Ψ(Q^δ*(P0)) gets large. Since we can derive the influence curve of the targeted maximum likelihood estimator Ψ(Q^δ*(Pn)) as an estimator of ψ0(δ), this variance can be estimated with the variance of this influence curve at this targeted maximum likelihood estimator Q^δ*(Pn). As follows from the study of TMLE in van der Laan and Rubin (2006) one can often use as approximate influence curve the efficient influence curve D*( Qδ*, gδ) at the limit of the targeted maximum likelihood estimator ( Q^δ*, ĝδ), which simplifies the penalty while it remains equally effective.

We first define the cross-validated covariance matrix

Σ(Pn)(δ)n=1nEBnPn,Bn1{D*(P^δ*(Pn,Bn0))D*(P^δ*(Pn,Bn0))}.

For example, if the target parameter is 1-dimensional (i.e., d = 1), then we have

σ2(Pn)(δ)n=1nEBnPn,Bn1{D*(P^δ*(Pn,Bn0))}2.

For example, one can define the variance term of the MSE as

σ2(Pn)(δ)=aΣ(Pn)(δ)a,

for a user supplied vector a, so that σ2(Pn)/n represents the variance estimate of the estimator of aψ0(δ).

Our proposal presented in the MSE-subsection below will actually have the form

σ2(Pn)(δ)=j=1dajΣ(Pn)(δ)aj, (4)

where aj are the row vectors of the square root of a user supplied matrix such as the inverse of the the correlation matrix of Σ(Pn)(δ).

5.1.2. Bias of targeted maximum likelihood estimator relative to its δ-limit

We might also wish to estimate the bias of the targeted maximum likelihood estimator Ψ(Q^δ*(Pn)) relative to its limit ψ0(δ) (even though in most applications the variance appears to drive the MSE term). For example, this could be done with the bootstrap:

EPn{Ψ(Q^δ*(Pn#))Ψ(Q^δ*(Pn))},

where Pn# represents the empirical distribution of a bootstrap sample O1#,,On# from the empirical distribution Pn. However, this would be much too computer intensive in many applications in which the targeted maximum likelihood estimator involves data adaptive model or algorithm selection. By noting that a bootstrap sample corresponds on average with 2/3 of the n observations, the following analogue bias estimate can be viewed as an approximation of this bootstrap bias that only requires 3 times applying the targeted maximum likelihood estimator to a sample of size n * 2/3:

B(Pn)(δ)=EBn3{Ψ(Q^δ*(Pn,Bn0))Ψ(Q^δ*(Pn)},

where Bn3 denotes the 3-fold cross-validation scheme.

If d = 1, then we will add to the variance term in the previous section the squared bias B(Pn)2 to create a MSE-term. If d > 1, then in our proposal below we will construct an appropriate function of B(Pn) representing the analogue of the variance term (4):

b(Pn)2(δ)j(ajB(Pn)(δ))2.

Additional rationale behind bias term: To provide further understanding of this kind of bias estimate B(Pn), we note the following. Let Ψ̂(Pn) be an estimator of its target Ψ̂(P0), where it plays the role of the δ-specific targeted maximum likelihood estimator Ψ(Q^δ*(Pn)). The fundamental assumption allowing statistical inference for Ψ̂(P0) is the assumption of asymptotic linearity:

Ψ^(Pn)Ψ^(P0)=(PnP0)D(P0)+R(Pn), (5)

where D(P0) is the influence curve of the estimator, and R(Pn) is the remainder. The asymptotic linearity assumption now assumes that R(Pn)=oP(1/n).

The representation (5) of the mapping Pn → Ψ̂(Pn) implies for any cross-validation scheme Bn

B(Pn)=EBnΨ^(PnBn0)Ψ^(Pn)=EBn{Ψ^(PnBn0)Ψ^(P0)}{Ψ^(Pn)Ψ^(P0)}=EBn{(PnBn0P0)D(P0)+R(Pn,Bn0)}{(PnP0)D(P0)+R(Pn)}=EBnR(Pn,Bn0)R(Pn),

where we use that EBnPn,Bn0D(P0)=PnD(P0). Thus, our proposed bias estimate B(Pn) equals, for any cross-validation scheme, an average difference of the remainder applied to a subsample of size n(1–p) and the full sample of size n. Therefore, one can conclude that this term will be very sensitive to a large remainder (e.g., second order terms) in the asymptotic linearity expansion (5).

5.1.3. MSE of targeted maximum likelihood estimator relative to its δ-limit

If d = 1, then we define the MSE term as

MSE(Pn)(δ)=σ2(Pn)(δ)n+B(Pn)2.

If d > 1, then we assume that we are provided with a user-specified d × d symmetric positive definite matrix ρ, so that the square root of this matrix ρ1/2 exists. Our MSE term will represent the expectation of the Euclidean norm of ρ1/2(Ψ̂ – ψ), or equivalently, the expectation of (Ψ̂ – ψ)ρ(Ψ̂ – ψ). One concrete proposal is to set ρ1/2 equal to the square root of the inverse of an estimate of the correlation matrix of the asymptotic covariance matrix of n(Ψ^ψ), so that the linearly transformed vector has uncorrelated components.

Let aj be the j-th row of the matrix ρ1/2, j = 1, . . . , d. The wished MSE term is now the sum of the MSEs of the linear combination ajΨ^. Therefore, the MSE term is represented as

MSE(Pn)(δ)=1njajΣ(Pn)(δ)aj+n{ajB(Pn)(δ)}.

This is equivalent to defining a variance term

σ2(Pn)(δ)n=1njajΣ(Pn)(δ)aj,

a bias term

b(Pn)(δ)=j{ajB(Pn)(δ)},

and defining

MSE(Pn)(δ)=σ2(Pn)(δ)n+{b(Pn)(δ)}2.

5.2. Targeted loss functions relying on a collaborative estimate of censoring mechanism

Let D*(Q0, g0, ψ0) be the efficient influence curve at dP0 = Q0g0 for the parameter Ψ : M → IR. Consider a set of estimators k(Pn) that are all more nonparametric than an initial estimator 0(Pn). Let ĝ0 be a collaborative estimator, relative to the initial estimator 0, so that P0D*(Q0*, g0, ψ0) = 0. Since, k is more nonparametric than 0, it is reasonable to assume that ĝ0 is also a collaborative estimator for k, so that P0D*(Qg0k*,g0,ψ0)=0, where now Qg0k* denotes the limit of the targeted maximum likelihood estimator of k using ĝ0.

We can use as criterion for selection among k, the cross-validated estimated variance of D*(Qg0k*,g0,Ψ(Qg0k*)), where Qg0k* denotes the limit of the targeted maximum likelihood estimator of k using ĝ0. Of course, this is also a selection among the collaborative targeted maximum likelihood estimators Q^g^0k*.

By our asymptotic linearity theorem, this selection among k corresponds with minimizing the variance of the influence curve of the (collaborative) targeted maximum likelihood estimators Ψ(Q^g^0k*) based on initial estimator k using the collaborative estimator ĝ0. The crucial assumption is that ĝ0 is indeed a collaborative estimator estimating a true g0 that involves enough adjustment, so that P0D*(Qg0k*,g0,ψ0)=0. However, this assumption is needed for construction of estimators of ψ0, and is already relying on weaker assumption than double robustness.

This criterion can now be used as the preferred loss function in our template for the collaborative targeted maximum likelihood estimator to build the candidate censoring mechanism estimators. It will require obtaining a single collaborative estimator ĝ0 after having obtained the initial estimator: for example, one could carry out a dimension reduction based on improved fits relative to initial estimator, and estimate it with a super learner only adjusting for the selected variables.

We already discussed that the same criterion can be used to select among any set of candidate collaborative targeted maximum likelihood estimators ( Q^g^kk*, ĝk), relying on collaborative estimators ĝk, so that P0D*(k*, gk, ψ0) = 0:

kn=argminkEBnPn,Bn1D*2(Q^k*(Pn,Bn0),g^k(Pn,Bn0),Ψ(Q^k*(Pn)).

However, the above only relies on a single collaborative estimator ĝ0, and is therefore particularly suitable for building the second stage within the collaborative targeted MLE template.

5.3. Targeted loss functions incorporating a sequence of increasingly nonparametric estimators of censoring mechanism

Consider an initial estimator . We wish to develop a criterion to select among candidate estimators of Q0 that are more nonparametric than , such as estimators using as offset. A special application of the loss function presented in this subsection is that is empty. Consider a sequence of increasingly nonparametric estimators ĝj, j = 1, . . . , J, of g0, which could be based on : For example, they might be based on a dimension reduction which has as goal to estimate Q0. Consider the following hypothetical criterion for a candidate Q:

Qj=1Jw(j)P0D*(Q,g^j(P0))2,

for a list of weights/scalars (w(j) : j = 1, . . . , J), where ‖ · ‖ denotes a possibly weighted Euclidean norm applied to the vector of the form P0D*(Q, g).

Firstly, if Q = Q0, then the criterion is minimized. In addition, let j0(Q) be such that P0D*(Q, ĝj(P0)) = 0 for all jj0. Since this mean zero property holds if ĝj solves a score implied by QQ0, it follows that the closer Q is to Q0 the more j-specific terms within this sum are close to zero. Finally, by the fact that D* is a gradient of the path-wise derivative of the target parameter Ψ, for jj0, we have that P0D*(Q, ĝj(P0)) ≈ Ψ(Q) – ψ0 (either exact, or in first order), which shows that this criterion also targets ψ0 (see van der Laan and Robins (2003), Section 1.4).

The cross-validated analogue of this criterion is given by

Dn(Q^k)EBnj=1Jw(j)Pn,Bn1D*(Q^k(Pn,Bn0),g^j(Pn))2.

If ψ0 is one dimensional, a sensible choice of weights are given by the inverse of the variance of the empirical means, so that it downweights the noisy j-specific signals:

w(j)1=Pn,Bn1{D*(Q^k(Pn,Bn0),g^j(Pn))}2,

and makes the criterion Dn() unit free. Analogues of such weighting can be obtained for multi-dimensional ψ0 as well, and are recommended.

One can add this criterion to a valid loss function such as the log-likelihood criterion L(k), giving a more targeted loss function

L*(Q^k)L(Q^k)+Dn(Q^k).

The additional term Dn preserves the validity of the loss function L (i.e., its minimum still identifies Q0), while it makes the selection targeted towards ψ0. One can add both Dn as well as the above presented MSE-penalty, where the latter is asymptotically negligible but important in sparse data (w.r.t. ψ0) situations.

6. Example: Targeted maximum likelihood estimation of the marginal structural model

Suppose we observe O = (W, A, Y = Y (A)), where W are baseline covariates, A is a discrete treatment, and Y is a subsequently measured outcome. It is assumed that A is realized in response to the realization of W, and Y is realized in response to both W and A. The full data structure on the experimental unit is X = (W, (Y (a) : a)), so that A represents the missingness variable for the missing data structure O on X. We assume the randomization assumption: g0(a | X) = P0(A = a | X) = P0(A = a | W).

Consider a marginal structural model for the full data distribution

E0(Y(a)|V)=m(a,V|β0)

that models the causal effect of a treatment intervention A = a on the outcome Y. For example, one might assume a simple linear model m(a, V | β0) = β0(a, V, aV).

Since it is often unreasonable to assume such a parametric form, but such parametric forms can still provide very meaningful projections of the true causal curve, we consider the nonparametric extensions of the parameter β0:

Ψh(P0)=argminβEP0ah(a,V)(Q0(a,W)m(a,V|β))2,

where Q0(a, w) = E0(Y | A = a, W). We have that Ψh(P0) represents a projection of E0(Y (a) | V) onto the working model m(| β0). Specifically,

Ψh(P0)=argminβEP0ah(a,V)(E(Y(a)|V)m(a,V|β))2.

In particular, if E0(Y (a) | V) = m(a, V | β0), then for each h we have Ψh(P0) = β0. Without the randomization assumption and consistency assumption, we can interpret Ψh(P0) as the same projection, but E(Y (a) | V) = E(E(Y | A = a, W) | V) now represents a (non-causal) dose response curve of the effect of A on Y that controls for the measured confounders W.

We note that this nonparametric extension only depends on P0 through the conditional mean of Y, given A, W, and the marginal distribution of W. For simplicity, we will also use the notation Ψh(Q0), where Q0 now denotes both the marginal distribution of W and the conditional distribution of Y , given A, W.

The efficient estimating function for this nonparametric extension Ψh of β0 is given by:

Dh(P0)(O)=h1(A,V)g0(A|W)(YQ0(A,W))+ah1(a,V)(Q0(a,W)m(a,V|Ψh(P0)),

where h1(a,V)=h(a,V)ddψm(a,V|ψ). We will assume that h1(A, V) = d/dψ0m(A, V | ψ0)h(A, V) is chosen so that h1 does not depend on ψ0, which is easily arranged for the case that m is linear in ψ and that m is logistic linear. For example, if we use the linear form (1, a, V, aV), then, if m is linear, then we can choose h1 = d/dψ0m(A, V | ψ0)g(A | V) = ((1, A, V, AV)g(A | V), and, if m is logistic linear, then we can choose h1 = d/dψ0m(A, V | ψ0)/(m(1 – m)(A, V | ψ0))g(A | V), which equals (1, A, V, AV)g(A | V), again.

Let Dh*(P0)=c01Dh(P0) be the corresponding efficient influence curve obtained by standardizing the efficient estimating function by the negative of the inverse of the derivative matrix c0 = d/dψ0E0Dh(P0) (noting that Dh(P0) can indeed be viewed as a function in ψ0).

If Y is continuous and we use a normal error regression model as a working model, then a targeted maximum likelihood estimator of ψh0 can be obtained by adding to an initial estimator Q0(A, W) of E0(Y | A, W) the d-dimensional ε-extension εCh(g)(A, W), where

Ch(g)(A,W)=h1(A,V)g(A|W),

for some fit g of g0, and fitting ε with maximum likelihood (i.e., least squares estimation) using Q0 as offset. The resulting update Q1(A, W) is now a first step targeted maximum likelihood estimator. It is also the actual targeted maximum likelihood estimator, since iteration is not resulting in further updates (the clever covariate does not change at a targeted MLE update). One estimates the distribution of W with the empirical distribution. The estimate Q1 and the empirical distribution of W now yields a substitution estimate of the target parameter ψh0. If Y is binary, the same εCh(g)(A, W) is added on the logit scale, and ε is fitted with maximum likelihood estimation. Again, the targeted maximum likelihood estimator converges in one step.

6.1. Penalized log-likelihood for candidate treatment mechanism fits

Let (Pn) be an initial regression estimator of Q0 = E0(Y | A, W). For a given Pnĝ(Pn), let Q^g^*(Pn) be the targeted maximum likelihood estimator corresponding with the covariate Ch(ĝ(Pn)). Let Bn be a cross-validation scheme, and let Pn,Bn1 and Pn,Bn0 be the empirical distributions of the validation and training sample, respectively, as identified by Bn ∈ {0, 1}n. Let

Σ^CV(Pn)(g^)=EBnPn,Bn1Dh*(Q^g^*(Pn,Bn0),g^(Pn,Bn0))2

be the cross-validated estimate of the covariance matrix of the efficient influence curve at the estimator and ĝ. We also consider the empirical estimate of this covariance matrix

Σ^(Pn)(g^)=PnDh*(Q^g^*(Pn),g^(Pn))2.

Let

B^(Pn)(g^)=EBnΨ^g^(Pn,Bn0)Ψ^g^(Pn)

be the bias estimator for the targeted maximum likelihood estimator Ψ^g^(Pn)=Ψh(Q^g^*(Pn)) obtained by plugging in Q^g^*(Pn) in the parameter mapping Ψ. Here we can use three-fold cross-validation as choice for Bn.

We will penalize the log-likelihood with the an estimate of the following average mean squared error

1ni=1nE(m(ai,vi|Ψ^g^(Pn)))m(ai,vi|Ψ^g^(P0)))2.

This mean squared error can be decomposed as 1/ni=1nVar(m(ai,vi|Ψ^g^(Pn))) and 1/ni=1nBias2(m(ai,vi|Ψ^g^(Pn))). The variance terms of this mean squared error can be estimated by

σi2(g^)nz(ai,vi)Σ^(Pn)(g^)z(ai,vi)n,

where

z(ai,vi)=ddβm(ai,vi|β)|β=Ψ^g^(Pn).

We keep open the option that one uses either the cross-validated covariance matrix Σ̂CV (Pn) or the empirical covariance matrix Σ̂(Pn).

The bias terms of this mean squared error can be estimated as

Bi(g^)EBnm(ai,vi|Ψ^g^(Pn,Bn0))m(ai,vi|Ψ^g^(Pn)).

If m is linear in β, then this reduces to

Bi(g^)=m(ai,vi|B(Pn)).

Thus, we obtain the following mean squared error estimate for the targeted maximum likelihood estimator Ψ̂ĝ(Pn) for a given g-estimator:

MSE^(Pn)(g^)1ni=1n{σi2(g^)n+Bi(g^)2}.

We suggest that the penalized log-likelihood could also only be penalized by the empirical variance component of the MSE. Therefore, we also define

σ2(Pn)(g^)1ni=1nσi2(g^)n.

Consider now the following two penalized log-likelihood criterions for ĝ, given the initial estimator 0:

L(g^|Q^0)=1ni=1n(YiQ^g^*(Pn)(Wi,Ai))2+MSE^(Pn)(g^),

or

L(g^|Q^0)=1ni=1n(YiQ^g^*(Pn)(Wi,Ai)))2+σ2(Pn)(g^).

For Y binary, the RSS is replaced by the log-likelihood of Y , given A, W.

We will use the penalized log-likelihood as loss function to build the candidate treatment mechanism estimators, according to the template for the collaborative targeted MLE in Section 2.

6.2. Algorithm for estimating the treatment mechanism based on penalized log-likelihood

Given any candidate adjustment set W*W, let an estimator ĝ(Pn)(W*) of g0(A | W*) be specified. This allows us to define a criterion in candidate adjustment sets W*, given the current estimator :

L(W*|Q^)L(g^(Pn)(W*)|Q^).

Thus, one can evaluate/score any given adjustment set W* with L(W* | ).

Given , one can now use this empirical criterion in adjustment sets to construct an estimator of g0() with a greedy type algorithm maximizing over a set of candidate adjustment sets. One starts with the empty adjustment set and selects the best addition move among a set of candidate addition moves based on the criterion. One iterates this process until there does not exist an addition move that improves the criterion. More aggressive greedy algorithms can be implemented as well, as with any machine learning algorithm that is based on iterative local maximization of an empirical criterion. One could apply this algorithm to candidate adjustments sets that have a certain size or entropy for the corresponding ĝ(Pn)(W*).

Alternatively, one creates a sequence of nested (increasing in size) adjustment sets Wj*, j = 1, . . . , J, for each Wj* one obtains a particular estimator ĝj(Pn) of g0(A|Wj*) (e.g., using super learning), and maximizes the penalized log-likelihood criterion over all these J adjustment sets.

In our algorithm in the next subsection defining the sequence of C-TMLEs we apply this greedy algorithm to candidate estimators that are more nonparametric than the selected estimator of g0 in the previous step.

6.3. Iteration to obtain sequence of targeted maximum likelihood estimators indexed by increasingly non-parametric estimators of treatment mechanism

Given an initial estimator of E(Y | A, W) and a corresponding estimator ĝ() defined above, sometimes denoted with ĝ, we define a resulting targeted maximum likelihood estimator

Q^g*(Pn)=Q^(Pn)+εnh(g^(Q^)(Pn)),

where εn is the least squares estimator of the regression coefficient ε treating Q̂(Pn) as offset and h(ĝ()(Pn)) as covariate. We can define this as a first step targeted maximum likelihood estimator based on an initial (Pn), and corresponding censoring mechanism estimator ĝ()). Let’s denote this operation as:

Q^1(Pn)=Q^(Pn)+εh1h(g^(Q^)(Pn)).

This process can now be iterated by replacing (Pn) by this update 1(Pn):

Q^2(Pn)=Q^1(Pn)+εh2h(g^(Q^1)(Pn)),

where we require that the next censoring mechanism estimator ĝ(1)(Pn) is obtained with the same algorithm as presented in above subsection, but now maximizing over candidate estimators that are more nonparametric than the previously obtained ĝ()(Pn).

In general, we define the k-th step of this targeted maximum likelihood estimator as

Q^k(Pn)=Q^k1(Pn)+εhkh(g^(Q^k1)(Pn)),

where ĝ(k−1)(Pn) involves maximizing over more nonparametric candidate estimators than ĝ(k–2)(Pn).

This algorithm results in a sequence of k-th step collaborative targeted maximum likelihood estimators Ψ(k(Pn)) of ψ0, and corresponding increasingly nonparametric censoring mechanism estimators ĝk(Pn) (i.e., ĝ(k–1)(Pn) in above notation), k = 1, . . . , K.

We could also have defined candidate targeted maximum likelihood estimators using a forward selection algorithm each time finding the best next term to add in the treatment mechanism, so that k denotes the number of terms included in ĝk. In that case, k corresponds with the size of the model for ĝk, and the targeted MLE step would be carried out when needed in order to guarantee an increase in either the penalized or non-penalized log-likelihood fit of k, as described in our general template for collaborative targeted maximum likelihood estimation in Section 2.

6.4. Collaborative TMLEs

If the initial estimator is indexed by a choice δ1 and the choice of algorithm ĝ() is indexed by a δ2, then, for each δ1, δ2, this results in candidate k-th step collaborative targeted maximum likelihood estimators PnQ^δ1,δ2k(Pn), corresponding treatment mechanism estimators Png^δ2k(Pn), and corresponding PnΨ(Q^δ1,δ2k(Pn)) targeted maximum likelihood estimators of ψ0, indexed by k.

For each δ1, δ2, in order to select among these candidate targeted maximum likelihood estimators indexed by k, we use the cross-validated penalized log-likelihood defined as:

L(k,δ1,δ2)=EBnPn,Bn1(YQ^δ1,δ2k(Pn,Bn0)(W,A))2+MSE^CV(Pn)(Q^δ1,δ2,kg^δ2k).

This results now in candidate (δ1, δ2)-specific collaborative targeted maximum likelihood estimators Q^δ1,δ2*, with corresponding initial estimator δ1 and collaborative treatment mechanism estimator ĝδ1,δ2 (note the choice of k is now a function of δ1, δ2 so that also the collaborative estimator ĝ is indexed by these choices).

6.5. Selection among candidate collaborative targeted maximum likelihood estimators

We could select δ1, δ2 by minimizing the same cross-validated penalized log-likelihood, e.g., by simply simultaneously minimizing the above criterion over the triplets (k, δ1, δ2). Alternatively, we could employ empirical efficiency maximization for all these candidate collaborative targeted maximum likelihood estimators that are assumed to be asymptotically linear with influence curve the efficient influence curve plus a contribution from the collaborative estimator of g0, as stated in our asymptotic linearity theorem. Thus, by ignoring this latter contribution to the influence curve, we could also select δ1, δ2 as the minimizer of the sum of the variances of the components of the efficient influence curve of ψ0: (with δ = (δ1, δ2))

δn=argminδj=1dPnDj*(Q^δ*,g^δ)2.

Other criteria based on the vector-efficient influence curve could be considered as well.

6.6. Statistical inference based on CLT

The resulting collaborative targeted maximum likelihood estimator Qn = *(Pn) and corresponding gn = ĝ(Pn) solve the efficient influence curve equation 0 = PnD*(Ψ(Qn), gn, Qn), so that ψn = Ψ(Qn) can be analyzed with our asymptotics theorem, and inference can be based on the influence curve. So we could estimate the covariance matrix as

Σn=EBnPn,Bn1D*(Q^*(Pn,Bn0),g^*(Pn,Bn0))2,

where one should include the gn-component to obtain more accurate inference. Statistical inference would be based on the normal working model ψnN(ψ0, Σn) to construct confidence intervals, confidence bands, p-values, and possible multiple testing adjusted p-values.

7. Discussion

For most data sets little to no knowledge is available about the data generating distribution. Consequently, the true model is a large infinite dimensional semi-parametric model. In such models many data adaptive approaches can be considered for fitting the true distribution of the data, based on different approximation function spaces, different searching strategies for maximizing an empirical criterion (such as the empirical log-likelihood) over these spaces, and different methods for selecting the fine tuning parameters indexing the function spaces and search strategies. Depending on the true data generating distribution, these algorithms will have very different levels of performance in approximating the true data generating distribution. As a consequence, cross-validation based super learning should be employed to find the best weighted combination among a large user supplied set of candidate estimators of the true data generating distribution. The user has an option to select the wished loss function. The oracle property of the cross-validation selector (van der Vaart et al. (2006), van der Laan et al. (2006)) teaches us that the super learner will asymptotically perform exactly as well, w.r.t. the loss-based (e.g., Kullback-Leibler) dissimilarity measure, as the best weighted combination of the candidate algorithms optimized for each data set. A crucial assumption is that the loss function used in this super learning methodology is uniformly bounded in all the candidate estimators. This could be viewed as a semi-parametric model assumption, and it is essential for robustness of the resulting estimator against outliers. It can be arranged by bounding the candidate estimators so that the loss function remains bounded.

The super learner fit represents a best fit for the purpose of estimation of the whole distribution of the data w.r.t. the loss-function specific dissimilarity, so that the bias-variance trade-off is not targeted (enough) w.r.t. the parameter of interest. Overall, the resulting estimate it overly biased for the smooth target parameter.

Therefore, our methodology involves a second targeted modification of the first stage super learner fit that aims to reduce the bias w.r.t the target parameter, while simultaneously increasing the likelihood fit. A single fluctuation function that would yield asymptotic optimal bias reduction as defined by the efficient influence curve of the target parameter is determined. This fluctuation function needs to have a score-vector at zero fluctuation whose linear span includes the efficient influence curve of the target parameter. This fluctuation function depends on an unknown nuisance parameter of the data generating distribution, such as a censoring mechanism.

In one particular embodiment of our C-TMLE, we define an iterative sequence of subsequent fluctuations, starting with the initial super learner fit, where the subsequent fluctuation functions are estimated with increasingly nonparametric estimates of the nuisance parameter, including a final fully non-parametrically estimated fluctuation function. In addition, by construction, we make sure that for each fluctuation function the nuisance parameter estimator that results in maximal increase in likelihood (or preferred loss function) fit is selected, among the candidate nuisance parameter estimators that are more nonparametric than the one selected at previous fluctuation function. In this way, we arrange that most of the targeted bias reduction occurs in the first few fluctuations. The actual number of times we carry out the subsequent update is selected with likelihood based cross-validation.

Essentially, we try to move towards the asymptotically optimal bias reduction (identified by the true nuisance parameter/censoring mechanism used in the data generating distribution) along a sequence of targeted bias reduction steps, but we stop moving towards this asymptotically optimal bias reduction when it results in a loss of likelihood fit as measured by the cross-validated log-likelihood. In general, our template for C-TMLE allow a fine sequence of nested targeted bias reduction steps (i.e., a fine sequence of candidate second stage estimators indexed by increasingly nonparametric nuisance parameter estimators) whose fits contain this set of candidate-fits as a subsequence, thereby potentially providing an additional improvement in practical performance of the resulting C-TMLE.

Theoretical results teach us that this push towards the asymptotically optimal bias reduction takes into account how well the initial estimator approximates the true distribution, by giving preference to targeted bias reduction steps that improve the log-likelihood fit using the initial estimator as off-set. As a consequence, the C-TMLE is able to avoid selecting irrelevant or harmful (w.r.t. relevant factor of density) fits of the nuisance parameter, even though such fits might improve the overall fit of the nuisance parameter. That is, the fit of the nuisance parameter is targeted towards our primary goal, the parameter of interest.

Specifically, we establish a general asymptotic linearity theorem for collaborative targeted maximum likelihood estimators, which provides us with the influence curve of our estimator, and thereby statistical inference. Fortunately, the influence curve happens to be equal to the efficient influence curve at the collaborative nuisance parameter estimator (its limit) plus a contribution of the nuisance parameter estimator as an estimator of its limit, providing immediate variance estimates. Gains in efficiency, resulting in possible super efficiency, are obtained by estimating the nuisance parameter collaboratively, in relation to the initial estimator. An inspection of the efficient influence curve allows us to precisely define the sufficient and minimal term H(g0, QQ0) one needs to adjust for in g0 to achieve the wished full bias reduction. We propose to estimate this term and force adjustment by this term in the candidate censoring mechanism estimators within the C-TMLE procedure, without relying on its correct specification, thereby preserving and only enhancing the double robustness of the C-TMLE.

The targeted maximum likelihood estimator relies itself on maximizing an empirical mean of a loss function over a fluctuation function, and the derivative of this empirical criterion at zero fluctuation needs to be the empirical mean of the efficient influence curve at the estimator to be fluctuated. We refer to this loss as the log-likelihood loss, but it needs to be understood that this choice can already be targeted (e.g,. van der Laan (2008b)) in the sense that it is typically implied by the efficient influence curve of the target parameter (e.g, one would not use a likelihood based on factors that are not needed to evaluate the target parameter). In addition, we propose to replace this log-likelihood by a penalized log-likelihood, where the penalty is scaled appropriately, has negligible contribution for nicely identifiable target parameters, but blows up for fits that result in extremely variable or biased estimators of the parameter of interest. Even though the penalty’s effect on the Kullback-Leibler dissimilarity is asymptotically negligible for identifiable parameters, for parameters that are borderline identifiable, this penalty can yield dramatic additional finite sample improvements to the C-TMLE. In essence, it builds in a sensible robustness of the resulting C-TMLE as an estimator of the target parameter.

Given the initial estimator, the candidate censoring mechanism estimators, and the corresponding sequence of targeted MLE indexed by these increasingly nonparametric estimators of the censoring mechanism, the log-likelihood or penalized log-likelihood is used for cross-validation selection of the depth of the bias reduction (i.e, for selection among these candidate targeted MLEs) in the C-TMLE algorithm. However, a preferred loss function can be used to build the initial estimator, to build the candidate censoring mechanism estimators, and to select among different collaborative C-TMLEs. In particular, one could use the penalized log-likelihood as preferred loss function.

We propose as a possibly more targeted selection criterion the cross-validated variance of the square of the efficient influence curve (e.g., if target parameter is one-dimensional), where a collaborative estimator is used for the nuisance parameter (censoring mechanism) in the efficient influence curve: i.e. we propose as loss function for a candidate Q the square of the efficient influence curve at its targeted version, using a collaborative estimator of g0. Indeed, E0D*2(Qg0*,g0,Ψ(Qg0*)=E0D*2(Qg0*,g0,ψ0) is a variance of a collaborative double robust estimator of ψ0, indexed by initial estimator Q, using known g0, and is thereby a valid and sensible loss function. In order to also take into account that g0 is estimated, one could also minimize the variance of the actual influence curve of the collaborative double robust targeted MLE using Q as initial and using gn as collaborative estimator. This would imply the same square of influence curve, but now the influence curve equals D* plus an contribution from estimation of g0.

Finally, we are also able to select a targeted loss function for g0 in the CTMLE template, by making its loss-based dissimilarity a measure of goodness of fit of the g0-specific fluctuation function in which the estimator of g0 is used.

The collaborative double robust maximum likelihood estimator utilizes 1) loss-based super learning to obtain the best initial estimator of Q0 (grounded by theory for cross-validation selector), 2) loss-based super learning to generate best estimators of candidate censoring mechanisms adjusting for increasingly large adjustment sets (grounded by theory for cross-validation selector), 3) specification of loss functions that target the needed Q0, g0 for identification of the efficient influence curve, 4) targeted maximum likelihood estimation along a fluctuation function using such a censoring mechanism estimator to remove bias for target parameter (grounded by theory for targeted maximum likelihood estimation), 4) Q0-(penalized)-likelihood based cross-validation to select among such candidate censoring mechanism estimators, and thereby obtain the collaborative estimator of the censoring mechanism for the fluctuation function that removes the bias, while avoids unnecessary loss in variance (grounded by oracle results for cross-validation, collaborative double robustness, and our asymptotic linearity theorem), 5) efficient influence curve based dimension reductions (the minimal sufficient covariate) to be included in the censoring mechanism estimators that allow for effective bias reduction in the T-MLE, if correctly specified, (and will not harm the collaborative double robustness if not) (grounded by theory on efficient influence curve representation and our collaborative double robustness theorem), 6) efficient influence curve based loss function for Q0 to build more targeted candidate censoring mechanism estimators 7) efficient influence curve based loss functions for Q0 to make the initial estimator more targeted, to make the selection among candidate C-TMLE more targeted, resulting in smaller asymptotic variance of the resulting C-TML estimator of the target parameter (grounded by empirical efficiency maximization theory, and our asymptotic linearity theorem for C-TMLE).

To summarize, this article provides a template for loss-based adaptive (super) efficient estimators in semiparametric models that are targeted towards a particular target feature of the distribution of the data, and for which statistical inference is available.

Footnotes

*

We like to thank the referees for their helpful comments. This research was supported by NIH grant R01 A1074345-03.

References

  1. Andersen PK, Borgan O, Gill RD, Keiding N. Statistical Models Based on Counting Processes. Springer-Verlag; New York: 1993. [Google Scholar]
  2. Bembom O, Petersen ML, Rhee S-Y, Fessel WJ, Sinisi SE, Shafer RW, van der Laan MJ.Biomarker discovery using targeted maximum likelihood estimation: Application to the treatment of antiretroviral resistant hiv infection Statistics in Medicine, page http://www3.interscience.wiley.com/journal/121422393/abstract, 2008 [DOI] [PMC free article] [PubMed]
  3. Bembom O, van der Laan MJ, Haight T, Tager IB. Lifetime and current leisure time physical activity and all-cause mortality in an elderly cohort. Epidemiology. 2009 doi: 10.1097/EDE.0b013e31819e3f28. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Bembom Oliver, van der Laan Mark. Statistical methods for analyzing sequentially randomized trials, commentary on JNCI article Adaptive therapy for androgen independent prostate cancer: A randomized selection trial including four regimens, by Peter F. Thall, C. Logothetis, C. Pagliaro, S. Wen, M.A. Brown, D. Williams, R. Millikan. Journal of the National Cancer Institute. 2007;2007;99(21):1577–1582. doi: 10.1093/jnci/djm185. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Bickel PJ, Klaassen CAJ, Ritov Y, Wellner J. Efficient and Adaptive Estimation for Semiparametric Models. Springer-Verlag; 1997. [Google Scholar]
  6. Bryan J, Yu Z, van der Laan MJ. Analysis of longitudinal marginal structural models. Biostatistics. 2003;5(3):361–380. doi: 10.1093/biostatistics/kxg041. [DOI] [PubMed] [Google Scholar]
  7. Dudoit S, van der Laan MJ. Asymptotics of cross-validated risk estimation in estimator selection and performance assessment. Statistical Methodology. 2005;2(2):131–154. doi: 10.1016/j.stamet.2005.02.003. [DOI] [Google Scholar]
  8. Gill R, Robins JM. Causal inference in complex longitudinal studies: continuous case. Ann Stat. 2001;29(6) [Google Scholar]
  9. Gill RD, van der Laan MJ, Robins JM. Coarsening at random: characterizations, conjectures and counter-examples. In: Lin DY, Fleming TR, editors. Proceedings of the First Seattle Symposium in Biostatistics. New York: Springer Verlag; 1997. pp. 255–94. [Google Scholar]
  10. Heitjan DF, Rubin DB. Ignorability and coarse data. Annals of statistics. 1991 Dec;19(4):2244–2253. doi: 10.1214/aos/1176348396. [DOI] [Google Scholar]
  11. Hernan MA, Brumback B, Robins JM. Marginal structural models to estimate the causal effect of zidovudine on the survival of HIV-positive men. Epidemiology. 2000;11(5):561–570. doi: 10.1097/00001648-200009000-00012. [DOI] [PubMed] [Google Scholar]
  12. Jacobsen M, Keiding N. Coarsening at random in general sample spaces and random censoring in continuous time. Annals of Statistics. 1995;23:774–86. doi: 10.1214/aos/1176324622. [DOI] [Google Scholar]
  13. Kang J, Schafer J. Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data (with discussion) Statistical Science. 2007a;22:523–39. doi: 10.1214/07-STS227. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Kang J, Schafer J. Rejoinder: Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statistical Science. 2007b;22:574–80. doi: 10.1214/07-STS227REJ. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Keleş S, van der Laan M, Dudoit S. Asymptotically optimal model selection method for regression on censored outcomes. Technical Report, Division of Biostatistics, UC Berkeley. 2002 [Google Scholar]
  16. Moore KL, van der Laan MJ.Covariate adjustment in randomized trials with binary outcomes Technical report 215, Division of Biostatistics, University of California; Berkeley: April2007 [Google Scholar]
  17. Moore KL, van der Laan MJ. Application of time-to-event methods in the assessment of safety in clinical trials. In: Peace Karl E., editor. Design, Summarization, Analysis & Interpretation of Clinical Trials with Time-to-Event Endpoints. Chapman and Hall; 2009. [Google Scholar]
  18. Murphy SA. Optimal dynamic treatment regimes. Journal of the Royal Statistical Society: Series B. 2003;65(2) doi: 10.1093/jrsssc/qlad037. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Murphy SA, van der Laan MJ, Robins JM. Marginal mean models for dynamic treatment regimens. Journal of the American Statistical Association. 2001;96:1410–1424. doi: 10.1198/016214501753382327. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Neugebauer R, van der Laan MJ. Why prefer double robust estimates. Journal of Statistical Planning and Inference. 2005;129(1–2):405–426. doi: 10.1016/j.jspi.2004.06.060. [DOI] [Google Scholar]
  21. Petersen Maya L, Deeks Steven G, Martin Jeffrey N, van der Laan Mark J.History-adjusted marginal structural models: Time-varying effect modification and dynamic treatment regimens Technical report 199, Division of Biostatistics, University of California; Berkeley: December2005 [Google Scholar]
  22. Polley EC, van der Laan MJ. Predicting optimal treatment assignment based on prognostic factors in cancer patients. In: Peace Karl E., editor. Design, Summarization, Analysis & Interpretation of Clinical Trials with Time-to-Event Endpoints. Chapman and Hall; 2009. [Google Scholar]
  23. Ridgeway G, McCaffrey D. Comment: Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data (with discussion) Statistical Science. 2007;22:540–43. doi: 10.1214/07-STS227C. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Robins J, Orallana L, Rotnitzky A. Estimaton and extrapolation of optimal treatment and testing strategies. Statistics in Medicine. 2008;27(23):4678–4721. doi: 10.1002/sim.3301. [DOI] [PubMed] [Google Scholar]
  25. Robins JM, Rotnitzky A. Comment on the Bickel and Kwon article, “Inference for semiparametric models: Some questions and an answer”. Statistica Sinica. 2001a;11(4):920–936. [Google Scholar]
  26. Robins JM, Hernan MA, Brumback B. Marginal structural models and causal inference in epidemiology. Epidemiology. 2000a;11(5):550–560. doi: 10.1097/00001648-200009000-00011. [DOI] [PubMed] [Google Scholar]
  27. Robins JM, Rotnitzky A, van der Laan MJ. Comment on “On Profile Likelihood” by S.A. Murphy and A.W. van der Vaart. Journal of the American Statistical Association – Theory and Methods. 2000b;450:431–435. doi: 10.2307/2669381. [DOI] [Google Scholar]
  28. Robins JM, Sued M, Lei-Gomez Q, Rotnitzky A. Comment: Performance of double-robust estimators when “inverse probability” weights are highly variable. Statistical Science. 2007;22:544–559. doi: 10.1214/07-STS227D. [DOI] [Google Scholar]
  29. Robins JM. Robust estimation in sequentially ignorable missing data and causal inference models. Proceedings of the American Statistical Association; 2000a. [Google Scholar]
  30. Robins JM. Discussion of “optimal dynamic treatment regimes” by susan a. murphy. Journal of the Royal Statistical Society: Series B. 2003;65(2):355–366. [Google Scholar]
  31. Robins JM. Optimal structural nested models for optimal sequential decisions. Heagerty PJ, Lin DY, editors. Proceedings of the 2nd Seattle symposium in biostatistics. 2005a;179:189–326. [Google Scholar]
  32. Robins JM.Optimal structural nested models for optimal sequential decisions Technical report, Department of Biostatistics, Havard University; 2005b [Google Scholar]
  33. Robins JM. A new approach to causal inference in mortality studies with sustained exposure periods - application to control of the healthy worker survivor effect. Mathematical Modelling. 1986;7:1393–1512. doi: 10.1016/0270-0255(86)90088-6. [DOI] [Google Scholar]
  34. Robins JM. The analysis of randomized and non-randomized aids treatment trials using a new approach in causal inference in longitudinal studies. In: Sechrest L, Freeman H, Mulley A, editors. Health Service Methodology: A Focus on AIDS. U.S. Public Health Service, National Center for Health SErvices Research; Washington D.C.: 1989. pp. 113–159. [Google Scholar]
  35. Robins JM. Proceeding of the Biopharmaceutical section. American Statistical Association; 1993. Information recovery and bias adjustment in proportional hazards regression analysis of randomized trials using surrogate markers; pp. 24–33. [Google Scholar]
  36. Robins JM. Causal inference from complex longitudinal data. In: Berkane Editor M., editor. Latent Variable Modeling and Applications to Causality. Springer Verlag; New York: 1997a. pp. 69–117. [Google Scholar]
  37. Robins JM. Structural nested failure time models. In: Armitage P, Colton T, Andersen PK, Keiding N, editors. The Encyclopedia of Biostatistics. John Wiley and Sons; Chichester, UK: 1997b. [Google Scholar]
  38. Robins JM. Statistical models in epidemiology, the environment, and clinical trials (Minneapolis, MN, 1997) Springer; New York: 2000b. Marginal structural models versus structural nested models as tools for causal inference; pp. 95–133. [Google Scholar]
  39. Robins JM, Rotnitzky A. Comment on Inference for semiparametric models: some questions and an answer, by Bickel, P.J. and Kwon, J. Statistica Sinica. 2001b;11:920–935. [Google Scholar]
  40. Robins JM, Rotnitzky A.Recovery of information and adjustment for dependent censoring using surrogate markers AIDS Epidemiology, Methodological issues. Bikhäuser1992
  41. Rose S, van der Laan MJ.Simple optimal weighting of cases and controls in case-control studies The International Journal of Biostatistics, page http://www.bepress.com/ijb/vol4/iss1/19/, 2008 10.2202/1557-4679.1115 [DOI] [PMC free article] [PubMed]
  42. Rose S, van der Laan MJ.Why match? investigating matched case-control study designs with causal effect estimation The International Journal of Biostatistics, page http://www.bepress.com/ijb/vol5/iss1/1/, 2009 10.2202/1557-4679.1127 [DOI] [PMC free article] [PubMed]
  43. Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika. 1983;70:41–55. doi: 10.1093/biomet/70.1.41. [DOI] [Google Scholar]
  44. Rosenblum M, Deeks SG, van der Laan MJ, Bangsberg DR.The risk of virologic failure decreases with duration of hiv suppression, at greater than 50% adherence to antiretroviral therapy PLoS ONE 49e7196doi:10.1371/journal.pone.0007196, 2009 10.1371/journal.pone.0007196 [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Rubin DB. Matched Sampling for Causal Effects. Cambridge University Press; Cambridge, MA: 2006. [Google Scholar]
  46. Rubin DB, van der Laan MJ.Empirical efficiency maximization: Improved locally efficient covariate adjustment in randomized experiments and survival analysis The International Journal of Biostatistics 41, Article 5, 2008 10.2202/1557-4679.1084 [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Sekhon JS. Multivariate and propensity score matching software with automated balance optimization: The matching package for R. Journal of Statistical Sotware, Forthcoming. 2008 [Google Scholar]
  48. Sinisi S, van der Laan MJ. The deletion/substitution/addition algorithm in loss function based estimation: Applications in genomics. Journal of Statistical Methods in Molecular Biology. 2004;3(1) doi: 10.2202/1544-6115.1069. [DOI] [PubMed] [Google Scholar]
  49. Tan Z. Comment: Understanding or, ps and dr. Statistical Science. 2007;22:560–568. doi: 10.1214/07-STS227A. [DOI] [Google Scholar]
  50. Tsiatis A, Davidian M. Comment: Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data (with discussion) Statistical Science. 2007;22:569–73. doi: 10.1214/07-STS227B. [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. Tuglus C, van der Laan MJ.Targeted methods for biomarker discovery, the search for a standard UC Berkeley Working Paper Series, page http://www.bepress.com/ucbbiostat/paper233/, 2008
  52. van der Laan MJ.Causal effect models for intention to treat and realistic individualized treatment rules Technical report 203, Division of Biostatistics, University of California; Berkeley: 2006 [Google Scholar]
  53. van der Laan MJ.Estimation based on case-control designs with known prevalance probability The International Journal of Biostatistics, page http://www.bepress.com/ijb/vol4/iss1/17/, 2008a 10.2202/1557-4679.1114 [DOI] [PubMed]
  54. van der Laan MJ.The construction and analysis of adaptive group sequential designs Technical report 232, Division of Biostatistics, University of California; Berkeley: March2008b [Google Scholar]
  55. van der Laan MJ, Dudoit S.Unified cross-validation methodology for selection among estimators and a general cross-validated adaptive epsilon-net estimator: Finite sample oracle inequalities and examples Technical report, Division of Biostatistics, University of California; Berkeley: November2003 [Google Scholar]
  56. van der Laan MJ, Petersen ML. Causal effect models for realistic individualized treatment and intention to treat rules. International Journal of Biostatistics. 2007;3(1) doi: 10.2202/1557-4679.1022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  57. van der Laan MJ, Robins JM. Unified methods for censored longitudinal data and causality. Springer; New York: 2003. [Google Scholar]
  58. van der Laan MJ, Rubin D. Targeted maximum likelihood learning. The International Journal of Biostatistics. 2006;2(1) doi: 10.2202/1557-4679.1043. [DOI] [PMC free article] [PubMed] [Google Scholar]
  59. van der Laan MJ, Dudoit S, Keles S. Asymptotic optimality of likelihood-based cross-validation. Statistical Applications in Genetics and Molecular Biology. 2004;3 doi: 10.2202/1544-6115.1036. [DOI] [PubMed] [Google Scholar]
  60. van der Laan MJ, Petersen ML, Joffe MM. History-adjusted marginal structural models and statically-optimal dynamic treatment regimens. The International Journal of Biostatistics. 2005;1(1):10–20. doi: 10.2202/1557-4679.1003. [DOI] [Google Scholar]
  61. van der Laan MJ, Dudoit S, van der Vaart AW. The cross-validated adaptive epsilon-net estimator. Statistics and Decisions. 2006;24(3):373–395. doi: 10.1524/stnd.2006.24.3.373. [DOI] [Google Scholar]
  62. van der Laan MJ, Polley E, Hubbard A.Super learner Statistical Applications in Genetics and Molecular Biology 6252007. ISSN 1. 10.2202/1544-6115.1309 [DOI] [PubMed] [Google Scholar]
  63. van der Laan MJ, Rose S, Gruber S.Readings on targeted maximum likelihood estimation Technical report, working paper series http://www.bepress.com/ucbbiostat/paper254/, September2009
  64. van der Vaart A, Wellner J. Weak Convergence and Empirical Processes. Springer-Verlag; New York: 1996. [Google Scholar]
  65. van der Vaart AW, Dudoit S, van der Laan MJ. Oracle inequalities for multi-fold cross-validation. Statistics and Decisions. 2006;24(3):351–371. doi: 10.1524/stnd.2006.24.3.351. [DOI] [Google Scholar]
  66. Yu Z, van der Laan MJ.Construction of counterfactuals and the G-computation formula Technical report, Division of Biostatistics, University of California; Berkeley: 2002 [Google Scholar]
  67. Yu Z, van der Laan MJ.Double robust estimation in longitudinal marginal structural models Technical report, Division of Biostatistics, University of California; Berkeley: 2003 [Google Scholar]

Articles from The International Journal of Biostatistics are provided here courtesy of Berkeley Electronic Press

RESOURCES