Evaluating the impact of treating the optimal subgroup

Alexander R Luedtke; Mark J van der Laan

doi:10.1177/0962280217708664

. Author manuscript; available in PMC: 2020 May 22.

Published in final edited form as: Stat Methods Med Res. 2017 May 8;26(4):1630–1640. doi: 10.1177/0962280217708664

Evaluating the impact of treating the optimal subgroup

Alexander R Luedtke ¹, Mark J van der Laan ²

PMCID: PMC7244101 NIHMSID: NIHMS1584869 PMID: 28482779

Abstract

Suppose we have a binary treatment used to influence an outcome. Given data from an observational or controlled study, we wish to determine whether or not there exists some subset of observed covariates in which the treatment is more effective than the standard practice of no treatment. Furthermore, we wish to quantify the improvement in population mean outcome that will be seen if this subgroup receives treatment and the rest of the population remains untreated. We show that this problem is surprisingly challenging given how often it is an (at least implicit) study objective. Blindly applying standard techniques fails to yield any apparent asymptotic results, while using existing techniques to confront the non-regularity does not necessarily help at distributions where there is no treatment effect. Here, we describe an approach to estimate the impact of treating the subgroup which benefits from treatment that is valid in a nonparametric model and is able to deal with the case where there is no treatment effect. The approach is a slight modification of an approach that recently appeared in the individualized medicine literature.

Keywords: Individualized treatment, non-regular inference, stabilized one-step estimator, subgroup analyses

1. Introduction

Traditionally, statisticians have evaluated the efficacy of a new treatment using an average treatment effect which compares the population mean outcomes when everyone versus no one is treated. While analyses of marginal effect often successfully identify whether or not introducing a treatment into the population is beneficial, these analyses underestimate the overall benefit of introducing treatment into the population when treatment is on average harmful in some strata of covariates. A treatment need not have adverse physiological side effects for the treatment effect to be negative: it will be negative if the administration of an inferior treatment under study precludes the administration of a superior treatment. To avoid this problem, researchers often perform subgroup analyses to see if the treatment effect varies between different strata of covariates.^1,2 Many investigators consider subgroups defined by a single covariate at a time,^3,4 though there is a growing trend toward defining these subgroups using multiple baseline covariates.^5,6

Subgroup analyses have led to much disagreement between clinicians and statisticians. As has been highlighted elsewhere,² A R Feinstein eloquently described this controversy as follows:

The essence of tragedy has been described as the destructive collision of two sets of protagonists, both of whom are correct. The statisticians are right in denouncing subgroups that are formed post hoc from exercises in pure data dredging. The clinicians are also right, however, in insisting that a subgroup is respectable and worthwhile when established a priori from pathophysiological principles.⁷

While learning about subgroup specific effects is clearly important when they exist, the concerns of statisticians are understandable. When the analysis is not prespecified, statistical significance procedures tend not to be reliable.^2,3,8 To exemplify the concerns on this issue, P Sleight shows that the strong marginal effect of aspirin for preventing myocardial infarction changes to a negative effect in two subgroups when these subgroups are defined by astrological sign.⁹ While it is clearly unlikely in practice that astrological sign yields heterogeneous subgroups, one would hope that a statistical procedure would be sufficiently robust to inform the user that astrological sign is not in fact associated with efficacy. Some have argued that sample splitting methods would help robustify a procedure to “data dredging” by either humans or overfitting by an algorithm.^10–12

In a related literature, optimal individualized treatment strategy has been developed to formalize the process of allowing treatment decisions to depend on baseline covariates in a rigorous manner.¹³ An individualized treatment strategy is a treatment strategy that makes a treatment decision based on a patient’s covariates. Often the objective for such treatments is to optimize the population mean outcome under the given treatment strategy.^14,15 For binary treatment decisions in a single time point setting, an optimal individualized treatment strategy is any individualized treatment strategy which treats all individuals for which the average treatment effect is positive in their strata of covariates and does not treat anyone for whom the average treatment is negative in their strata of covariates. Estimating the population mean outcome under the optimal individualized treatment rule has been shown to be non-regular when the optimal treatment strategy is not unique.^16,17 This non-regularity causes standard semiparametric estimation approaches to fail. Despite the complexity of this estimation problem, Chakraborty et al.¹⁸ show that one can develop a slower than root-n rate confidence interval for the mean outcome under the estimated optimal individualized treatment rule using a bootstrap procedure, and Luedtke and van der Laan¹⁷ show how to obtain a root-n rate confidence interval for the actual optimal individualized treatment strategy. Often one can use the same confidence interval for these two estimation problems because one can estimate the optimal treatment strategy consistently (in terms of the strategy’s mean outcome) at a faster than root-n rate.¹⁷

As the reader may have noticed, two literatures are analogous – if one defines the optimal subgroup as the subgroup of covariate strata in which the treatment effect is positive, then the optimal subgroup is (up to covariate strata for which there is no treatment effect) equal to the group of individuals for which an optimal treatment rule suggests treatment. However, the subgroup literature has not confronted the problem of developing high powered inference when an arbitrary algorithm is used to develop the subgroup – while the sample splitting procedure described in Malani et al.¹² is valid for a single sample split provided the optimal subgroup is not empty, subsequently averaging across sample splits will not yield valid inference (see the discussion of the use of cross-validation for individualized treatment rules in van der Laan and Luedtke¹⁹). Thus there is a significant loss of statistical power in such a procedure.

In this work, we aim to satisfy the desires of both statisticians and clinicians – we seek a statistically valid subgroup analysis procedure which allows the incorporation of both the subject matter knowledge of physicians and the agnostic flexibility of modern statistical learning techniques. Our subgroup analysis procedure will return an estimate of the population level effect of treating everyone in a stratum of covariates with positive treatment effect versus treating no one. This succinctly characterizes the effect of optimally introducing a given treatment into a population. To estimate this quantity, we modify an estimator from the individualized treatment literature which overcomes a statistical challenge that typically arises when trying to estimate quantities involving individualized treatment rules.¹⁷ We will show that an additional statistical challenge arises when trying to use a variant of this estimator in the subgroup setting. We will then show how to overcome this challenge.

2. Statistical formulation

Suppose we observe baseline covariates W, an indicator of binary treatment A, and an outcome Y occuring after treatment and covariates. Let P₀ be some distribution for O ≡ (W, A, Y) in a nonparametric statistical model $M$ that at most places restrictions on the probability of treatment given covariates. We observe n independent individuals O₁, …, O_n drawn from P₀. Define $b_{0} (W) = E_{P_{0}} [Y | A = 1, W] - E_{P_{0}} [Y | A = 0, W]$ . Under causal assumptions not elaborated here, b₀(W) can be identified with the additive effect of treatment on outcome if everyone versus no one in a strata of covariates W receives treatment.²⁰ We use sg to denote any (measurable) subset of the support of W. Define

Ψ_{sg} (P_{0}) \equiv \int_{sg} b_{0} (w) d P_{0} (w)

Under causal assumptions, Ψ_sg(P₀) is identified with the difference (i)–(ii) between (i) the average outcome if the only individuals receiving treatment in the population are those whose covariates fall in sg and (ii) the average outcome if no one in the population receives treatment. Drawing parallels to optimal individualized treatment strategies, Ψ_sg(P₀) is maximized at sg if and only if sg includes precisely those individuals with covariate w such that b(w) > 0 and does not include those with b(w) < 0.^14,15 The maximizer is non-unique at so-called “exceptional laws”, i.e. distributions for which b(W) = 0 with positive P₀ probability.¹⁵ Define Ψ(P₀) ≡ max_sg Ψ_sg(P₀).Throughout we define b_P, Ψ_sg(P), and Ψ(P) atbitrary $P \in M$ analogously to b₀, Ψ_sg(P₀) and Ψ(P₀).

3. Breakdown of “standard” estimators

We now describe the way in which the standard semiparametric estimation roadmap suggests we estimate Ψ(P₀) function known as the efficient influence function (EIF) often plays a key role in this estimation procedure. While we avoid a formal presentation of the derivation of EIFs here, the key result about EIFs is that they typically yield the expansion

n^{1 / 2} [Ψ ({\hat{P}}_{n}) - Ψ (P_{0})] = - n^{1 / 2} \int {IF}_{{\hat{P}}_{n}} (o) d P_{0} (o) + n^{1 / 2} Rem ({\hat{P}}_{n}, P_{0})

(1)

where ${\hat{P}}_{n}$ is an estimate of P₀, ${IF}_{{\hat{P}}_{n}}$ is the EIF of Ψ at ${\hat{P}}_{n}$ , and $Rem ({\hat{P}}_{n}, P_{0})$ is a remainder that plausibly converges to zero faster than n^−1/2. We will present an explicit expression for ${IF}_{{\hat{P}}_{n}}$ at the end of this section, but for now we state that the corresponding remainder term is given by

Rem ({\hat{P}}_{n}, P_{0}) = \sum_{\tilde{a} = 0}^{1} (2 \tilde{a} - 1) \int I (w \in {sg}_{n}) [1 - \frac{P_{0} (\tilde{a} | w)}{{\hat{P}}_{n} (\tilde{a} | w)}] (E_{{\hat{P}}_{n}} [Y | \tilde{a}, w] - E_{P_{0}} [Y | \tilde{a}, w]) d P_{0} (o) + Ψ_{{sg}_{n}} (P_{0}) - Ψ_{{sg}_{0}} (P_{0})

where sg_n is the optimal subgroup under ${\hat{P}}_{n}$ . The first term on the right is a double robust term²¹ that shrinks to zero faster than n^−1/2 if the outcome regression and treatment mechanism are estimated well. The second term requires that the optimal subgroup can be estimated well and is plausible if the stratum specific treatment effect function does not concentrate too much mass near zero (mass at zero is not problematic since any subgroup decision for these strata is optimal). See Theorem 8 of Luedtke and van der Laan¹⁷ for precise conditions under which this term is small. In principle, sg_n need not be an optimal subgroup under ${\hat{P}}_{n}$ , i.e. one can replace $Ψ ({\hat{P}}_{n})$ on the left-hand side of (1) with $Ψ_{{sg}_{n}} ({\hat{P}}_{n})$ without changing the expansion. We ignore such considerations here for brevity, though the discussion in a closely related problem is given in van der Laan and Luedtke.¹⁹

A one-step estimator of the form $ψ_{n} \equiv Ψ ({\hat{P}}_{n}) + \frac{1}{n} \sum_{i = 1}^{n} {IF}_{{\hat{P}}_{n}} (O_{i})$ aims to correct the bias on the right-hand side of by adding an estimate of that expectation, yielding

\sqrt{n} [ψ_{n} - Ψ (P_{0})] \approx \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} ({IF}_{{\hat{P}}_{n}} (O_{i}) - \int {IF}_{{\hat{P}}_{n}} (o) d P_{0} (o))

(2)

where the above approximation is valid provided $n^{1 / 2} Rem ({\hat{P}}_{n}, P_{0})$ converges to zero in probability. For a general parameter Ψ, targeted minimum loss-based estimators (TMLEs) can be seen to follow the above prescribed formula, with the estimate ${\hat{P}}_{n}$ carefully chosen so that the empirical mean of ${IF}_{{\hat{P}}_{n}}$ is zero, and thus the final estimator is the plug-in estimator $ψ_{n} \equiv Ψ ({\hat{P}}_{n})$ .²² A detailed exposition of efficiency theory is given in Bickel et al.²³

We cannot apply the central limit theorem to the right-hand side of equation (2) without further conditions because the right-hand side is a root-n empirical mean over functions which depend on the data. We now give sufficient conditions. Suppose that ${IF}_{{\hat{P}}_{n}}$ has a limit IF_∞ in the sense that

{({IF}_{{\hat{P}}_{n}} - {IF}_{\infty})}^{2} has P_{0} expectation converging to zero

(Lim)

Typically ${IF}_{\infty} = {IF}_{P_{0}}$ . If ${\hat{P}}_{n}$ is not allowed to heavily overfit the data, the instances of ${IF}_{{\hat{P}}_{n}}$ on the right-hand side of equation (2) can be replaced with IF_∞. The conditions on ${\hat{P}}_{n}$ which prevent overfitting are given by the empirical process conditions presented in Part 2 of van der Vaart and Wellner.²⁴ If

{Var}_{P_{0}} [{IF}_{\infty} (O)] > 0

(V+)

then the central limit theorem can be used to see that n^1/2[ψ_n – Ψ(P₀)] converges to a normal distribution with mean zero and variance ${Var}_{P_{0}} [{IF}_{\infty} (O)]$ . Under these conditions, Ψ(P₀) falls in $ψ_{n} \pm 1.96 \frac{σ_{n}}{\sqrt{n}}$ with probability approaching 0.95, where $σ_{n}^{2}$ is the empirical variance of ${IF}_{{\hat{P}}_{n}}$ applied to the data. If ${Var}_{P_{0}} [{IF}_{\infty} (O)]$ is zero, then n^1/2[ψ_n−Ψ(P₀)] converges to zero in probability, but there is no guarantee that $ψ_{n} \pm 1.96 \frac{σ_{n}}{\sqrt{n}}$ contains Ψ(P₀) with probability approaching 0.95: both $\sqrt{n} [ψ_{n} - Ψ (P_{0})]$ and σ_n are converging to zero, but the coverage depends on the relative rate of convergence of the two quantities.

We now argue that it is unlikely that both (Lim) and (V+) hold. When ${\hat{P}}_{n}$ is non-exceptional

{IF}_{{\hat{P}}_{n}} (o) = I [b_{{\hat{P}}_{n}} (w) > 0] [\frac{2 a - 1}{{\hat{P}}_{n} (a | w)} (Y - E_{{\hat{P}}_{n}} [Y | A = a, W = w]) + b_{{\hat{P}}_{n}} (w)] - Ψ ({\hat{P}}_{n})

If ${\hat{P}}_{n}$ is exceptional, the above definition of ${IF}_{{\hat{P}}_{n}}$ can still be used and the same central limit theorem result about equation (2) holds under (Lim) and (V+), though in truth Ψ is not smooth enough at ${\hat{P}}_{n}$ for an efficient influence function to be well defined.^16,17 In light of the above expression, the validity of (Lim) will typically require $I [b_{{\hat{P}}_{n}} (W) > 0]$ to have a mean-square limit. Suppose the data are drawn from an exceptional law where the treatment effect is zero on some set S₀. In that case, we do not expect $I [b_{{\hat{P}}_{n}} (W) > 0]$ to converge to anything on S₀ since likely $b_{{\hat{P}}_{n}} (w)$ does not converge to 0 strictly from above or below at any given w for which b₀(w) = 0.

Now consider the case where treatment is always harmful, i.e. b₀ (W) < 0 with probability 1. In this case b₀(W) ≥ 0 with probability 0, and so we would expect the indicator that $b_{{\hat{P}}_{n}} (w)$ is positive converges to zero if ${\hat{P}}_{n}$ is a good estimate of P₀. But in this case, the subgroup that should be treated is empty so that if the limit IF_∞ exists, then it is zero almost surely and (V+) does not hold.

Finally, consider the intermediate case where there is no additive treatment effect within any strata of covariates, i.e. b₀(w) = 0 for all w. If on any positive probability set convergence occurs from both above and below, then we do not expect (Lim) to hold. If $b_{{\hat{P}}_{n}} (w)$ converges to zero from below for all w, then we expect (Lim) to hold with IF_∞ equal to the constant function zero and (V+) not to hold. If, for each w, $b_{{\hat{P}}_{n}} (w)$ converges to zero from either strictly above or strictly below and the set of covariates for which the convergence occurs from above happens with positive probability, then we expect (Lim) and (V+) to hold.

4. Avoiding the need for (Lim) and (V+)

In this section, we present an estimator which overcomes both (Lim) and (V+). We first present an estimator which does not require (Lim), and then argue that a simple extension of this estimator also does not require (V+).

This estimation strategy is similar to the one-step estimator presented in the previous section, but designed to estimate the parameter in an online fashion which eliminates the need for convergence. The online one-step estimator was originally presented in van der Laan and Lendle,²⁵ and was refined in Luedtke and van der Laan¹⁷ to deal with cases where the convergence of the sort required by (Lim) fails to hold. The method will be presented in full generality in a forthcoming paper.

Let ${\hat{P}}_{n}^{i}$ represent an estimate of P₀ based on observations O₁, …, O_i. The stabilized online one-step estimator for Ψ(P₀) is given by

ψ_{n}^{st} \equiv \frac{{\bar{σ}}_{n}}{n - l_{n}} \sum_{i = l_{n}}^{n - 1} \frac{Ψ ({\hat{P}}_{n}^{i}) + {IF}_{{\hat{p}}_{n}^{i}} (O_{i + 1})}{{\hat{σ}}_{i}}

(3)

where ℓ_n is some user-defined quantity that may or may not grow to infinity but must satisfy $n - l_{n} \to \infty, {\hat{σ}}_{i}^{2}$ represents an estimate of the variance of ${IF}_{{\hat{P}}_{n}^{l}}^{d_{i}} (O)$ based on observations O₁, …, O_i and ${\bar{σ}}_{n}$ is the harmonic mean $1 / [\frac{1}{n - l_{n}} \sum_{i = l_{n}}^{n - 1} {\hat{σ}}_{i}^{- 1}]$ of ${\hat{σ}}_{l_{n}}, \dots, {\hat{σ}}_{n - 1}$ . We now wish to apply the martingale central limit theorem²⁶ to understand the behavior of $\sqrt{n - l_{n}} {\bar{σ}}_{n}^{- 1} [ψ_{n}^{st} - Ψ (P_{0})]$ . As each term in the sum defining ψ_n has variance converging to 1 due to the stabilization by ${\hat{σ}}_{i}$ , the validity of our central limit theorem argument does not rely on an analogue of (Lim). It does, however, rely on an analogue of (V+). The primary condition we would use to establish the validity of the central limit theorem argument is that the set of ${\hat{σ}}_{i}^{2}$ is consistent for ${Var}_{P_{0}} [{IF}_{{\hat{P}}_{n}} (O_{i + 1})]$ as i gets large and that there exists some δ > 0 such that

{Var}_{P_{0}} [{IF}_{{\hat{p}}_{n}^{i}} (O_{i + 1})] > δ^{2} > 0 for all i with probability approaching 1

(V+′)

The former condition holds under a Glivenko–Cantelli condition which is discussed in Theorem 7 of Luedtke and van der Laan.¹⁷ Under these conditions, Section 7 of Luedtke and van der Laan¹⁷ (especially Lemma 6) shows that

\frac{\sqrt{n - l_{n}} [ψ_{n}^{st} - Ψ (P_{0})]}{{\bar{σ}}_{n}} \approx \frac{1}{\sqrt{n - l_{n}}} \sum_{i = l_{n}}^{n - 1} \frac{{IF}_{{\hat{P}}_{n}^{i}} (O_{i + 1}) - \int {IF}_{{\hat{P}}_{n}^{i}} (o) d P_{0} (o)}{{\hat{σ}}_{i}}

provided the same conditions needed for equation (2) hold. The above approximation is accurate up to a term that goes to zero in probability. The martingale central limit theorem can now be applied to establish the validity of the 95% confidence interval ${CI}^{st} \equiv [ψ_{n}^{st} \pm 1.96 \frac{{\bar{σ}}_{n}}{\sqrt{n - l_{n}}}]$ . We refer the reader to Theorem 2 in Luedtke and van der Laan¹⁷ or a sense of the formal conditions needed to prove this result. If the treatment effect is negative for all strata of covariates, then (V+′) will not hold if ${\hat{P}}_{n}^{i}$ is a reasonable estimate of P₀ in the sense that the estimated optimal subgroup converges to the empty set. Similarly, we have no guarantee that (V+′) will hold if b₀(W) is zero almost surely.

Suppose that we are not willing to assume that (V+′) holds. A natural approach is to redefine the inverse weights to equal ${\hat{σ}}_{i} (δ) \equiv {\hat{σ}}_{i} \lor δ$ for some fixed δ > 0. We then define ${\bar{σ}}_{n} (δ)$ to be equal to the harmonic mean of ${\hat{σ}}_{l_{n}} (δ), \dots, {\hat{σ}}_{n - 1} (δ)$ and $ψ_{n}^{st} (δ)$ as in equation (3) but with ${\hat{σ}}_{i}$ and ${\bar{σ}}_{n}$ replaced by ${\hat{σ}}_{i} (δ)$ and ${\bar{σ}}_{n} (δ)$ . Clearly ${\bar{σ}}_{n} (δ) \geq {\bar{σ}}_{n}$ , and thus the confidence interval ${CI}^{st} (δ) \equiv [ψ_{n}^{st} (δ) \pm 1.96 \frac{{\bar{σ}}_{n} (δ)}{\sqrt{n - l_{n}}}]$ is wider than that CI^st, though it may have a different midpoint. We conjecture that this confidence interval is conservative when the truncation scheme is active so that ${\bar{σ}}_{n} (δ) > {\bar{σ}}_{n}$ , though proving this result has proven challenging.

To give the reader a sense of why we have hope that this conjecture will hold, we show in Appendix 1 that adding normal noise (with random variance depending on the sample) to ψ_n(δ) can yield a valid 95% confidence interval for Ψ(P₀). It then seems reasonable that removing the noise can only improve coverage. Readers whose primary concern is the theoretical soundness of an inferential procedure can apply the estimator in Appendix 1 and rest assured that their confidence interval will be valid provided the optimal subgroup is estimated sufficiently well and a double robust term is small. Nonetheless, the type I error gains by not using this noised estimator, which we will see provided the conjecture holds, would seem to imply that our original unnoised confidence interval performs better than the noised interval.

5. Simulation study

5.1. Methods

We now present a simulation study conducted in R.²⁷ Our simulation uses a four-dimensional covariate W drawn from a mean zero normal distribution with identity covariance matrix. Treatment A is drawn according to a Bernoulli random with probability of success 1/2, independent of baseline covariates. The outcome Y is Bernoulli, and the outcome regressions considered in our primary analysis are displayed in Table 1.

Table 1.

Data generating distributions for simulation.

Simulation	logit E[Y\|A, W]	Ψ (P₀)	P₀(b₀(W) = 0)	P₀(b₀(W) < 0)
N1	W₁ + W₂	0	l	0
N2	−0.2A[(W₁ – z_0.8)⁺]²	0	0.80	0.20
N3	−0.25A	0	0	l
A1	0.8A	0.19	0	0
A2	$A W_{1}^{+} - A W_{2}^{+}$	0.06	0.25	0.38
A3	AW₁	0.09	0	0.50

Open in a new tab

Note: Decimals rounded to the nearest hundredth. For N2, z_0.8 ≈ 0.84 is the 80th percentile of a standard normal distribution. We use x⁺ to denote the positive part of a real number x.

We compare two estimators of Ψ(P₀). The first is the stabilized one-step estimator. We truncate the inverse weights at 0.1, 0.001, and 10⁻²⁰. The results were essentially identical for truncations of 10⁻³ and 10⁻²⁰, and thus we only display the results for the truncation of 10⁻³. We ℓ_n = n/10, and to speed up the computation time, we estimated the subgroup and outcome regression using observations O₁, …, O_k(i)n/10 for all i ≥ ℓ_n, where k(i) is the largest integer such that k(i)n/10 < i (see Section 6.1 of Luedtke and van der Laan¹⁷ for more details). The second estimator is a 10-fold cross-validated TMLE (CV-TMLE). This estimator is analogous the CV-TMLE for the mean outcome under an optimal treatment rule as presented in van der Laan and Luedtke,¹⁹ but is modified to account for the fact that Ψ(P₀) is equal to this quantity minus the mean outcome when no one in the population is treated. We truncate the variance estimates for this estimator at the same values as considered for the stabilized one-step estimator. Following the theoretical results in van der Laan and Luedtke,²⁸ we can formally show that this estimator is asymptotically valid when cross-validated analogues of (Lim) and (V+) hold.

We estimate the blip function using the super-learner methodology as described in Luedtke and van der Laan.²⁹ Super-learner is an ensemble algorithm with an oracle guarantee ensuring that the resulting blip function estimate will perform at least as well as the best candidate in the library up to a small error term. We use a squared error loss to estimate the blip function, and use as candidate algorithms SL.gam, SL.glm, SL.glm.interaction, SL.mean, and SL.rpart in the R package SuperLearner.³⁰ The outcome regression E[Y|A, W], is estimated using this same super-learner library but with the log-likelihood loss to respect the bounds on the outcome. The probability of treatment given covariates was treated as known and the known value was used by all of the estimators.

We also compare our estimators to the oracle estimator which a priori knows the optimal subgroup. In particular, we use a CV-TMLE for $Ψ_{{sg}_{0}} (P_{0})$ in which we treat sg₀ as known. The estimation problem is regular in this case so that we expect the corresponding confidence intervals to have proper coverage at exceptional laws. We truncate the variance estimate at the same values as for the other methods.

5.2. Results

Figure 1 displays the coverage of the confidence interval lower bounds of the various estimation strategies. All methods appear to achieve proper 97.5% lower bound coverage, with the stabilized one-step and the non-oracle CV-TMLE estimators generally being conservative. This conservative behavior is to be expected given that both of these sample splitting procedures need to estimate the optimal subgroup, and thus in any finite sample are expected to be negatively biased due to the resulting suboptimal subgroup used from the estimate. The oracle CV-TMLE attains the nominal coverage rate for alternative distributions, and for non-alternatives is also conservative.

We now verify the tightness of the lower bounds for the alternative distributions A1, A2, and A3. Figure 2 shows the power for the test H₀: Ψ(P₀) = 0 againts H₁ : Ψ(P₀) > 0, where the test was conducted using the duality between hypothesis tests and confidence intervals. We see that the stabilized one-step is slightly less powerful than the non-oracle CV-TMLE. This is likely due to the online nature of the stabilized one-step estimator relative to the CV-TMLE. Nonetheless, the power loss is not large and the fact that we have actual theoretical results for this estimator even at exceptional laws should make up for this slight loss of power.

Figure 3 displays the two-sided coverage of the 95% confidence intervals. One could argue that upper bound coverage is not interesting for the null distributions, given that any failure of the upper bound of the confidence interval to cover Ψ(P₀) = 0 requires this upper bound to be negative. Hence, we can always obtain proper upper bound coverage at null distributions by ensuring that the upper bound of our confidence interval respects the parameter space of Ψ, i.e. is non-negative. Nonetheless, the coverage of the uncorrected two-sided confidence intervals (upper bound may be negative) is useful for detecting a lack of asymptotic normality of the estimator sequence. While the stabilized one-step has two-sided coverage above 0.95 for all distributions at all sample sizes of at least 500, the coverage for the unadjusted non-oracle CV-TMLE confidence interval falls at or below 0.90 for N1 and N2 at all sample sizes. This is in line with our lack of asymptotic results for the non-oracle CV-TMLE at exceptional laws.

We now consider the two-sided coverage for the alternative distributions A1, A2, and A3. The stabilized one-step confidence intervals have coverage that improves with sample size, though the improvement appears slow. The coverage is near nominal at a sample size of 4000 for all three simulations. In light of Figure 1, essentially all of the coverage deficiency is a result of a failure of the upper bound. This makes sense given that our estimator relies on a second-order term measuring a linear combination of the difference in impact of treating the estimated subgroups (estimated on increasing chunks of data) versus treating the optimal subgroup. While this term often reasonably shrinks to zero faster than n^−1/2, in finite samples this term can hurt the upper bound coverage. The non-oracle CV-TMLE confidence intervals, on the other hand, attain near nominal coverage at large sample sizes. This is to be expected at the non-exceptional laws A1 and A3 given that we can prove asymptotic normality in this case. We do not have an asymptotic result supporting the method’s proper coverage for exceptional law A2, though this is an interesting area for future work.

6. Discussion

We have studied the statistical challenges associated with estimating the additive effect of treating the subgroup of individuals with positive stratum-specific additive treatment effect. We showed that these challenges are similar to those arising when estimating the mean outcome under an optimal individualized treatment strategy. Indeed, the individuals treated by the strategy which maximizes the population mean outcome are the same individuals who belong to the optimal subgroup. An additional challenge arises when one wishes to consider the relative measure giving the additive effect of treating only those individuals in the optimal subgroup versus treating no one in the population. In this case, the parameter of interest is estimable at a faster than root-n rate for some data generating distributions. Procedures which yield root-n rate confidence intervals tend to fail in this setting due to the need to estimate both the (in truth empty) optimal subgroup and the variance of the estimate of the impact of treating this subgroup: generally the subgroup estimate will converge to the empty set and the variance estimate will converge to zero, but there is no guarantee that the relative rate of convergence of the two will yield valid inference.

Despite this added inferential challenge, we argue that obtaining a confidence interval for the impact of treating the optimal subgroup requires only minor modification to the confidence interval for the mean outcome when only the optimal subgroup is treated. In particular, we propose truncating the estimated variance in the martingale sum used in Luedtke and van der Laan¹⁷ at some constant δ > 0. If the truncation is not active, which will typically be true for alternative distributions under which there exists a subgroup for which the treatment effect is positive and is arguably true for many null distributions as well, then, under standard regularity conditions, we obtain root-n rate inference with coverage approaching 0.95. If the data are generated according to an alternative distribution for which there is a non-null (positive or negative) treatment effect within all strata of covariates, then our estimator is asymptotically efficient and, provided the truncation is not active, our confidence interval is asymptotically equivalent to a standard Wald-type confidence interval (see Corollary 3 in Luedtke and van der Laan¹⁷). We expect our confidence interval to be conservative when the truncation is active, though we leave this as a conjecture. We have instead shown that adding noise to our estimator yields a confidence interval with proper 95% coverage, though we suggest using the unnoised estimator in practice.

One could imagine several alternative solutions to the described inferential challenge. One such solution is to ensure that the variance of our estimator minus the truth, scaled by root-n, is positive as the sample size grows. This can be accomplished by changing the definition of the optimal subgroup to ensure that this subgroup is not too small, e.g. it contains at least 10% of the population. One can show via a change of variables that estimating the mean outcome under such a constrained subgroup is equivalent to estimating the mean outcome under an optimal rule which can treat at most 90% of the population, see Luedtke and van der Laan.³¹ Estimating this alternative constrained parameter is still difficult when the optimal subgroup is non-unique, though there is little risk of degenerating the first-order behavior in this case. To construct confidence intervals despite the non-uniqueness of the optimal subgroup, one can combine the results in Luedtke and van der Laan³¹ with the stabilized one-step estimator presented in Luedtke and van der Laan.¹⁷

A cross-validated TMLE, closely related to that presented in van der Laan and Luedtke,²⁸ outperformed the method proposed in this paper in many simulation settings. Nonetheless, we do not have any asymptotic results about the CV-TMLE at exceptional laws, in contrast to the estimator presented in this paper for which we do have such results. This estimator’s lack of asymptotic normality at such laws was evident in our simulation. We view a careful study of this estimator’s behavior at exceptional laws to be an important area for future research. In a forthcoming work, we will present a stabilized TMLE that has the same desirable asymptotic properties of the stabilized one-step estimator but, like the CV-TMLE, is a substitution estimator (thereby forcing the estimate to respect the parameter space).

One could imagine considering other parameters relating to the optimal subgroup that we have presented in this paper. For example, investigators may be interested in estimating the impact of treating everyone in the optimal subgroup on some secondary outcome $\tilde{Y}$ . Each such parameter yields a new estimation problem and, in our experience, many of these problems still face at least one of the two primary challenges that we faced in this paper. In particular, these problems are often non-regular when the optimal subgroup is non-unique, and may have degenerated the first-order behavior when the optimal subgroup is empty.

Acknowledgement

The authors thank Tyler VanderWeele for the valuable discussions.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported by the NIH grant R01 AI074345-06. Alex Luedtke was supported by a Berkeley Fellowship.

Appendix 1

Let {Z_i : i = 1, …, ∞} be a sequence of i.i.d. normal random variable independent of all other sources of randomness under consideration. Let

{\tilde{ψ}}_{n}^{st} (δ) = \frac{{\bar{σ}}_{n}}{n - l_{n}} \sum_{i = l_{n}}^{n - 1} \frac{Ψ ({\hat{P}}_{n}^{i}) + {IF}_{{\hat{P}}_{n}^{i}} (O_{i + 1}) + \sqrt{{(δ^{2} - {\hat{σ}}_{i}^{2})}^{+}} Z_{i}}{{\hat{σ}}_{i} (δ)}

where x⁺ is the positive part of a real number x. Observe that each term in the sum on the right-hand side above has variance of approximately 1 (approximately because ${\hat{σ}}_{i}$ is only an estimate of ${Var}_{P_{0}} [{IF}_{{\hat{P}}_{n}^{i}} (O)]$ ). It will then follow that, under regularity conditions, we can apply the martingale central limit theorem appearing in Brown²⁶ to show that $[{\tilde{ψ}}_{n}^{st} (δ) \pm 1.96 \frac{{\bar{σ}}_{n} (δ)}{\sqrt{n - l_{n}}}]$ has coverage approaching 95% for Ψ(P₀). These regularity conditions are the same as those used to establish the validity of CI^st, except that they do not require (V+′) to hold. Note the similarity to CI^st(δ), though the above confidence interval is centered about ${\tilde{ψ}}_{n} (δ)$ rather than ψ_n(δ).

We now relate the noised ${\tilde{ψ}}_{n}^{st}$ to the unnoised $ψ_{n}^{st}$ . Conditional on the data, ${\tilde{ψ}}_{n}^{st} (δ)$ is equal in distribution to $ψ_{n}^{st} (δ) + \frac{{\tilde{σ}}_{n} (δ)}{\sqrt{n - l_{n}}} Z_{1}$ , where ${\tilde{σ}}_{n}^{2} (δ) \equiv \frac{1}{n - l_{n}} \sum_{i = l_{n}}^{n - 1} \frac{{(δ^{2} - σ_{i}^{2})}^{+}}{{\hat{σ}}_{i}^{2} \lor δ^{2}}$ . That is, ${\tilde{ψ}}_{n}^{st} (δ)$ is equal to ψ_n(δ) plus normal noise, where the variance of the normal noise depends on the data. If the truncation is not active, then the variance of this noise is zero. Otherwise, the variance is positive, but the sign of the noise is independent of the data. Thus, it seems reasonable to expect that the unnoised $ψ_{n}^{st}$ provides a better estimate of Ψ(P₀) than ${\tilde{ψ}}_{n}^{st}$ . It is for this reason that we expect the unnoised confidence interval CI^st(δ) have a coverage of at least 0.95 in large samples.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

1.Assmann SF, Pocock SJ, Enos LE, et al. Subgroup analysis and other (mis) uses of baseline data in clinical trials. Lancet 2000; 355: 1064–1069. [DOI] [PubMed] [Google Scholar]
2.Rothwell PM. Subgroup analysis in randomised controlled trials: importance, indications, and interpretation. Lancet 2005; 365: 176–186. [DOI] [PubMed] [Google Scholar]
3.Yusuf S, Wittes J, Probstfield J, et al. Analysis and interpretation of treatment effects in subgroups of patients in randomized clinical trials. JAMA 1991; 266: 93–98. [PubMed] [Google Scholar]
4.VanderWeele TJ. On the distinction between interaction and effect modification. Epidemiology 2009; 20: 863–871. [DOI] [PubMed] [Google Scholar]
5.Kent DM and Hayward RA. Limitations of applying summary results of clinical trials to individual patients: the need for risk stratification. JAMA 2007; 298: 1209–1212. [DOI] [PubMed] [Google Scholar]
6.Abadie A, Chingos MM and West MR. Endogenous stratification in randomized experiments. National Bureau of Economic Research; 2013. NBER Working Paper No. 19742. [Google Scholar]
7.Feinstein AR. The problem of cogent subgroups: a clinicostatistical tragedy. J Clin Epidemiol 1998; 51: 297–299. [DOI] [PubMed] [Google Scholar]
8.Lagakos SW. The challenge of subgroup analyses-reporting without distorting. N Engl J Med 2006; 354: 1667. [DOI] [PubMed] [Google Scholar]
9.Sleight P Debate: subgroup analyses in clinical trials: fun to look at-but don’t believe them. Curr Control Trials Cardiovasc Med 2000; 1: 25–27. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Lipkovich I, Dmitrienko A, Denne J, et al. Subgroup identification based on differential effect search: a recursive partitioning method for establishing response to treatment in patient subpopulations. Stat Med 2011; 30: 2601–2621. [DOI] [PubMed] [Google Scholar]
11.Dmitrienko A, Muysers C, Fritsch A, et al. General guidance on exploratory and confirmatory subgroup analysis in late-stage clinical trials. J Biopharm Stat 2015; 26: 71–98. [DOI] [PubMed] [Google Scholar]
12.Malani A, Bembom O and der Laan MJ. Accounting for differences among patients in the FDA approval process. U Chicago Law & Econ Olin Work Paper, 2009; (488).
13.Chakraborty B and Moodie EE. Statistical methods for dynamic treatment regimes. New York, NY: Springer, 2013. [Google Scholar]
14.Murphy SA. Optimal dynamic treatment regimes. J R Stat Soc Ser B 2003; 65: 331–336. [Google Scholar]
15.Robins JM. Optimal structural nested models for optimal sequential decisions. In: Proceedings of the Second Seattle symposium on biostatistics, Seattle, WA, 20–21 November 2000, 2004, pp. 189–326. [Google Scholar]
16.Robins JM and Rotnitzky A. Discussion of “Dynamic treatment regimes: Technical challenges and applications”. Electron J Stat 2014; 8: 1273–1289. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Luedtke AR and van der Laan MJ. Statistical inference for the mean outcome under a possibly non-unique optimal treatment strategy. Ann Stat 2016; 44: 713–742. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Chakraborty B, Laber EB and Zhao YQ. Inference about the expected performance of a data-driven dynamic treatment regime. Clin Trial 2014; 11: 408–417. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.van der Laan MJ and Luedtke AR. Targeted learning of the mean outcome under an optimal dynamic treatment rule. J Causal Infer 2014; 3: 61–95. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Robins JM. A new approach to causal inference in mortality studies with a sustained exposure periodapplication to control of the healthy worker survivor effect. Math Model 1986; 7: 1393–1512. [Google Scholar]
21.van der Laan MJ and Robins JM. Unified methods for censored longitudinal data and causality. New York, NY: Springer, 2003. [Google Scholar]
22.van der Laan MJ and Rose S. Targeted learning: causal inference for observational and experimental data. New York, NY: Springer, 2011. [Google Scholar]
23.Bickel PJ, Klaassen CAJ, Ritov Y, et al. Efficient and adaptive estimation for semiparametric models. Baltimore, MD: Johns Hopkins University Press, 1993. [Google Scholar]
24.van der Vaart AW and Wellner JA. Weak convergence and empirical processes. New York, NY: Springer, 1996. [Google Scholar]
25.van der Laan MJ and Lendle SD. Online targeted learning. Berkeley: Division of Biostatistics, University of California, 2014, p. 330, http://www.bepress.com/ucbbiostat/ [Google Scholar]
26.Brown BM. Martingale central limit theorems. Ann Math Stat 1971; 42: 59–66. [Google Scholar]
27.R Core Team. R: a language and environment for statistical computing. Vienna, Austria, 2014, http://www.r-project.org/ [Google Scholar]
28.van der Laan MJ and Luedtke AR. Targeted learning of an optimal dynamic treatment, and statistical inference for its mean outcome. Berkeley: Division of Biostatistics, University of California, 2014, p. 329, http://www.bepress.com/ucbbiostat/ [Google Scholar]
29.Luedtke AR and van der Laan MJ. Super-learning of an optimal dynamic treatment rule. Berkeley: Division of Biostatistics, University of California, under review at JCI; 2014, p. 326, http://www.bepress.com/ucbbiostat/. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Polley E and van der Laan MJ. SuperLearner: super learner prediction, 2013, http://cran.r-project.org/package=SuperLearner
31.Luedtke AR and van der Laan MJ. Optimal Individualized Treatments in Resource-Limited Settings. The International Journal of Biostatistics 2016; 12: 283–303. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R1] 1.Assmann SF, Pocock SJ, Enos LE, et al. Subgroup analysis and other (mis) uses of baseline data in clinical trials. Lancet 2000; 355: 1064–1069. [DOI] [PubMed] [Google Scholar]

[R2] 2.Rothwell PM. Subgroup analysis in randomised controlled trials: importance, indications, and interpretation. Lancet 2005; 365: 176–186. [DOI] [PubMed] [Google Scholar]

[R3] 3.Yusuf S, Wittes J, Probstfield J, et al. Analysis and interpretation of treatment effects in subgroups of patients in randomized clinical trials. JAMA 1991; 266: 93–98. [PubMed] [Google Scholar]

[R4] 4.VanderWeele TJ. On the distinction between interaction and effect modification. Epidemiology 2009; 20: 863–871. [DOI] [PubMed] [Google Scholar]

[R5] 5.Kent DM and Hayward RA. Limitations of applying summary results of clinical trials to individual patients: the need for risk stratification. JAMA 2007; 298: 1209–1212. [DOI] [PubMed] [Google Scholar]

[R6] 6.Abadie A, Chingos MM and West MR. Endogenous stratification in randomized experiments. National Bureau of Economic Research; 2013. NBER Working Paper No. 19742. [Google Scholar]

[R7] 7.Feinstein AR. The problem of cogent subgroups: a clinicostatistical tragedy. J Clin Epidemiol 1998; 51: 297–299. [DOI] [PubMed] [Google Scholar]

[R8] 8.Lagakos SW. The challenge of subgroup analyses-reporting without distorting. N Engl J Med 2006; 354: 1667. [DOI] [PubMed] [Google Scholar]

[R9] 9.Sleight P Debate: subgroup analyses in clinical trials: fun to look at-but don’t believe them. Curr Control Trials Cardiovasc Med 2000; 1: 25–27. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] 10.Lipkovich I, Dmitrienko A, Denne J, et al. Subgroup identification based on differential effect search: a recursive partitioning method for establishing response to treatment in patient subpopulations. Stat Med 2011; 30: 2601–2621. [DOI] [PubMed] [Google Scholar]

[R11] 11.Dmitrienko A, Muysers C, Fritsch A, et al. General guidance on exploratory and confirmatory subgroup analysis in late-stage clinical trials. J Biopharm Stat 2015; 26: 71–98. [DOI] [PubMed] [Google Scholar]

[R12] 12.Malani A, Bembom O and der Laan MJ. Accounting for differences among patients in the FDA approval process. U Chicago Law & Econ Olin Work Paper, 2009; (488).

[R13] 13.Chakraborty B and Moodie EE. Statistical methods for dynamic treatment regimes. New York, NY: Springer, 2013. [Google Scholar]

[R14] 14.Murphy SA. Optimal dynamic treatment regimes. J R Stat Soc Ser B 2003; 65: 331–336. [Google Scholar]

[R15] 15.Robins JM. Optimal structural nested models for optimal sequential decisions. In: Proceedings of the Second Seattle symposium on biostatistics, Seattle, WA, 20–21 November 2000, 2004, pp. 189–326. [Google Scholar]

[R16] 16.Robins JM and Rotnitzky A. Discussion of “Dynamic treatment regimes: Technical challenges and applications”. Electron J Stat 2014; 8: 1273–1289. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] 17.Luedtke AR and van der Laan MJ. Statistical inference for the mean outcome under a possibly non-unique optimal treatment strategy. Ann Stat 2016; 44: 713–742. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] 18.Chakraborty B, Laber EB and Zhao YQ. Inference about the expected performance of a data-driven dynamic treatment regime. Clin Trial 2014; 11: 408–417. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] 19.van der Laan MJ and Luedtke AR. Targeted learning of the mean outcome under an optimal dynamic treatment rule. J Causal Infer 2014; 3: 61–95. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] 20.Robins JM. A new approach to causal inference in mortality studies with a sustained exposure periodapplication to control of the healthy worker survivor effect. Math Model 1986; 7: 1393–1512. [Google Scholar]

[R21] 21.van der Laan MJ and Robins JM. Unified methods for censored longitudinal data and causality. New York, NY: Springer, 2003. [Google Scholar]

[R22] 22.van der Laan MJ and Rose S. Targeted learning: causal inference for observational and experimental data. New York, NY: Springer, 2011. [Google Scholar]

[R23] 23.Bickel PJ, Klaassen CAJ, Ritov Y, et al. Efficient and adaptive estimation for semiparametric models. Baltimore, MD: Johns Hopkins University Press, 1993. [Google Scholar]

[R24] 24.van der Vaart AW and Wellner JA. Weak convergence and empirical processes. New York, NY: Springer, 1996. [Google Scholar]

[R25] 25.van der Laan MJ and Lendle SD. Online targeted learning. Berkeley: Division of Biostatistics, University of California, 2014, p. 330, http://www.bepress.com/ucbbiostat/ [Google Scholar]

[R26] 26.Brown BM. Martingale central limit theorems. Ann Math Stat 1971; 42: 59–66. [Google Scholar]

[R27] 27.R Core Team. R: a language and environment for statistical computing. Vienna, Austria, 2014, http://www.r-project.org/ [Google Scholar]

[R28] 28.van der Laan MJ and Luedtke AR. Targeted learning of an optimal dynamic treatment, and statistical inference for its mean outcome. Berkeley: Division of Biostatistics, University of California, 2014, p. 329, http://www.bepress.com/ucbbiostat/ [Google Scholar]

[R29] 29.Luedtke AR and van der Laan MJ. Super-learning of an optimal dynamic treatment rule. Berkeley: Division of Biostatistics, University of California, under review at JCI; 2014, p. 326, http://www.bepress.com/ucbbiostat/. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] 30.Polley E and van der Laan MJ. SuperLearner: super learner prediction, 2013, http://cran.r-project.org/package=SuperLearner

[R31] 31.Luedtke AR and van der Laan MJ. Optimal Individualized Treatments in Resource-Limited Settings. The International Journal of Biostatistics 2016; 12: 283–303. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Evaluating the impact of treating the optimal subgroup

Alexander R Luedtke

Mark J van der Laan

Abstract

1. Introduction

2. Statistical formulation

3. Breakdown of “standard” estimators

4. Avoiding the need for (Lim) and (V+)