Abstract
A personalized treatment strategy formalizes evidence-based treatment selection by mapping patient information to a recommended treatment. Personalized treatment strategies can produce better patient outcomes while reducing cost and treatment burden. Thus, among clinical and intervention scientists, there is a growing interest in conducting randomized clinical trials when one of the primary aims is estimation of a personalized treatment strategy. However, at present, there are no appropriate sample size formulae to assist in the design of such a trial. Furthermore, because the sampling distribution of the estimated outcome under an estimated optimal treatment strategy can be highly sensitive to small perturbations in the underlying generative model, sample size calculations based on standard (uncorrected) asymptotic approximations or computer simulations may not be reliable. We offer a simple and robust method for powering a single stage, two-armed randomized clinical trial when the primary aim is estimating the optimal single stage personalized treatment strategy. The proposed method is based on inverting a plugin projection confidence interval and is thereby regular and robust to small perturbations of the underlying generative model. The proposed method requires elicitation of two clinically-meaningful parameters from clinical scientists and uses data from a small pilot study to estimate nuisance parameters which are not easily elicited. The method performs well in simulated experiments and is illustrated using data from a pilot study of time to conception and fertility awareness.
Keywords: Personalized medicine, sample size calculation, treatment regimes, nonregular asymptotics, projection confidence region
1. Introduction
A personalized treatment strategy formalizes clinical decision making as a function that maps individual patient information to a recommended treatment. By selecting the right treatment for the right patients, personalized treatment strategies can lead to better patient outcomes while reducing overmedication, cost, and patient burden [1]. There is a growing interest in conducting randomized clinical trials when one of the primary aims is the estimation of an optimal personalized treatment strategy [2]. However, at present, there are no appropriate sample size formulae to inform the design of such a trial as existing methods treat estimation of a personalized treatment strategy as a secondary aim [3, 4, 5]. Furthermore, the expected outcome of the optimal personalized treatment strategy is nonregular (in the sense described in [6]); consequently, sample size formulae cannot be based on conventional asymptotic approximations of the sampling distribution of the estimated expected value of the optimal personalized treatment strategy [7, 8]. We propose a method for sizing a trial when the primary aim is the estimation of an optimal personalized treatment strategy that is based on inverting a plugin projection confidence interval [9, 10]. Because this interval is regular and robust to small perturbations of the underlying generative model so too is our sample size procedure. The method relies on auxiliary data in the form of a small pilot study or observational data and two simple clinician-elicited quantities. Simulations suggest that a pilot study with as few as 20 subjects may be sufficient to obtain reliable sample size estimators.
There is a large literature on estimating optimal treatment regimes using randomized or observational studies (see [11] and references therein). We consider a single treatment decision wherein baseline patient information is used to inform treatment choice; baseline information can include measures of disease stage severity, medical history, genetic and genomic information, as well as demographics, patient preferences and cost. The single decision problem is a special case of the more general problem of estimating a sequence of treatment rules, one per stage of clinical information, that map accrued patient information to a recommended treatment. Many existing estimation algorithms include single- and multi-stage variants. Regression-based methods, e.g., Q-learning [12, 13, 14, 15, 16], are a common and convenient tool for estimating interpretable treatment strategies from observational or randomized study data. Furthermore, extensions of these methods exist to accommodate a number of practical issues including: missing data [17], censored data [18], multiple outcomes [19, 20], binary outcomes [11], and high-dimensional data [21, 22]. Thus, we use Q-learning as the basis for constructing our sample size procedure. However, the proposed methodology could be extended to deal with other estimation procedures including direct value maximization [23, 24, 25, 26, 27], and A-learning [28, 29, 10, 16].
This work is partially motivated by our involvement with a clinical trial of fertility awareness and time to pregnancy (NICHD, K23 HD0147901, PI Joseph B. Stanford) and the planning of a follow-up trial. The original trial enrolled n = 123 subjects and demonstrated a promising but statistically insignificant benefit of a personalized treatment strategy. Furthermore, existing theory suggests that treatment response varies across subjects according to number of past successful pregnancies, length of sexual relationship, and other factors (see Section 5); thus, there is also scientific justification for a personalized treatment strategy in this context. Clinical investigators are therefore interested in running a follow-up trial with more subjects to estimate a high-quality personalized treatment strategy. The proposed methodology offers a principled method for sizing such a follow-up trial.
2. Q-learning
Q-learning is a simple regression-based method for estimating an optimal personalized treatment strategy. This is the assumed estimation method in the sample size calculations given in Section 3. However, we do not assume that the posited regression model used in the Q-learning algorithm is correctly specified. Let capital letters like X and Y denote random variables and lowercase letters like x and y denote instances of these random variables. To estimate a personalized treatment strategy we will collect n independent identically distributed triples , one for each subject, where: X ∈ ℝp denotes baseline, i.e., pre-treatment, subject information; A ∈ {−1, 1} denotes the treatment received; and Y ∈ ℝ denotes the observed outcome for which higher values correspond to better clinical outcomes. We assume the simplest setting in which A is independent of X and assigned randomly with P (A = 1) = P (A = −1) = 1/2. The more general setting, in which the distribution of A depends on X, can be handled at the expense of are more complex expression for the marginal mean outcome under a pre-specified policy, e.g., see [6, 24]; the extension of asymptotic arguments to this setting are straightforward (see Remark 2).
A personalized treatment strategy is a map π : ℝp → {−1, 1} from the space of patient characteristics into the space of available treatments. Under π, a patient with X = x is recommended treatment π(x). To define the optimal treatment strategy we use the framework of potential outcomes [30, 31]. Let Y *(a) denote the potential outcome under treatment a ∈ {−1, 1}. For any treatment strategy, π, define Y*(π) =Σa∈{−1,1} Y*(a)1π(X)=a to be the potential outcome under π. Define V (π) ≜ 𝔼Y*(π), so that V (π) denotes the expected outcome if all subjects in the population are assigned treatment under π. An optimal treatment strategy, say πopt, satisfies V (πopt) ≥ V (π) for all π. To connect the potential outcomes with the observed data we assume: (i) Y = Y*(A) (consistency); (ii) there exists c > 0 so that c < P (A = 1|X) < 1 − c with probability one (positivity); and (iii) {Y*(−1), Y*(1)} ⫫ A |X (no unmeasured confounders). Note that (ii) and (iii) are automatically satisfied by the randomized design considered here. However, if the pilot data (described below) are observational, then the validity of these assumptions must be considered carefully. Define Q(x, a) ≜ 𝔼 {Y*(a)|X = x, A = a}, then, under the foregoing assumptions, it can be shown that Q(x, a) = 𝔼(Y |X = x, A = a) and V (π) = 𝔼Q {X, π(X)}. Thus, it can be seen that πopt(x) = arg maxa∈{−1,1} Q(x, a).
In practice, the Q-function, Q(x, a), is not known. In Q-learning, an estimator of πopt is formed by estimating Q(x, a) from the observed data and then using a plug-in estimator. We will assume a linear working model of the form , where θ = (α⊤, β⊤)⊤ and x1, x2 are known features derived from x. Note that x1 and x2 may contain polynomial terms or other nonlinear basis expansions. Let ℙn denote the empirical measure, i.e., given i.i.d. random variables Z1, …, Zn and function f defined on the support of Z, . Define least squares estimator θ̂n ≜ arg minθ ℙn {Y − Q(X, A; θ)}2. Subsequently, , is the estimated optimal treatment strategy.
Define θ* ≜ arg minθ 𝔼 {Y − Q(X, A; θ)}2 and subsequently . Thus, θ* is the population analog of θ̂n. However, π* need not equal πopt if the working model for the Q-function is not correctly specified (see, for example, [14, 16, 32]). In the context of a sample size calculation 𝔼Y (π*) is more relevant than 𝔼Y (πopt) because the (possibly misspecified) postulated model for Q-function will be used in the analysis of the collected data. Our sample size procedure is robust in that provides sample size estimates that are correctly powered when this postulated model is incorrect.
2.1. Estimation of V (π*) and nonregularity
For any γ ∈ ℝdim(x2) let , and write V (γ) to denote V (πγ). Applying a change of measure it follows that V (γ) = 2𝔼Y 1AX⊤γ≥0 [26, 24, 25]. Thus, V (β*) is a nonsmooth (nondifferentiable) functional of the underlying generative distribution [33, 34]. Consequently, there exists neither a regular nor an asymptotically unbiased estimator of V (β*) [35, 8]. To see how this impacts sample size calculation, define
| (1) |
then V̂𝓓(n)(β*) is unbiased for V (β*) and V̂𝓓(n)(β̂n) is an augmented inverse probability weighted estimator of V (β*) [6]. Note that the first term in (1), 2ℙnY1Y X⊤γ>0, is a consistent estimator of V (γ) and is nonparametric in the sense that it does not rely on a model for the Q-function. The second term in (1), has mean zero and thus does not affect the consistency of V̂𝓓(n)(γ), instead this term is included to reduce variability of the estimator. See [24] for a detailed derivation of this estimator in a more general setting. A standard approach to sample size calculation would depend on the limiting distribution of . However, as stated above, because V (β*) is nonsmooth, 𝔾n is nonregular and its sampling distribution is difficult to approximate in finite samples (see, for example, [36] and references therein). In fact, V̂𝓓(n)(β̂n) need not even be consistent for V (β*) (see [33]). Thus, sample size calculations should not be based directly on asymptotic approximations to the sampling distribution of 𝔾n.
3. Projection sample size estimator
Let V0 denote an elicited expected outcome under standard care, and let δ > 0 be an elicited value so that δV0 denotes a clinically meaningful improvement in the expected outcome. For simplicity we treat V0 as fixed, however, this is not essential, an alternative would be to replace V0 with the mean outcome under a fixed treatment regime π(x) ≡ a for all x. Suppose that we observe data from a preliminary study; this preliminary study could be a standalone pilot study, an internal pilot, or an observational study. For simplicity, we assume that 𝓓p is collected in a randomized pilot study of exactly the same form as the follow-up study we are trying to size, but with far fewer patients (see Remark 3 below). Our goal is to determine a sample size n* = n*(𝓓p) so that:
-
(P1)
The power to reject the hypothesis V (β*) = V0 in favor of V (β*) > V0 is at least 1 − ρ when V (β*) ≥ (1 + δ)V0;
-
(P2)
P{|V̂𝓓(n*)(β̂n*) − V (β*)| ≤ ε}≥ 1 − ρ;
where 𝓓(n*) is a follow-up study of size n*, and ρ, ε > 0 are user chosen values. Conceptually, it is possible to have different values of ρ in (P1) and (P2), say ρ1 and ρ2; however, as we will show, because our method relies on inverting a confidence interval for V (β*), it would be necessary to choose the level of this confidence interval to be {1 − min(ρ1, ρ2)} × 100 to ensure that (P1) and (P2) hold simultaneously so there is no loss in generality by setting ρ = ρ1 = ρ2. The condition (P1) ensures that, with high probability, the follow-up study will produce significant evidence that the optimal personalized treatment strategy leads to better patient outcomes on average than standard care when indeed the optimal personalized treatment strategy yields clinically significantly better outcomes. Thus, (P1) regards the viability of a data-driven personalized treatment strategy. (P2) ensures that the estimated optimal treatment strategy using data from the follow-up study will have an empirical performance near that of the optimal strategy. Thus, (P2) regards the quality of the estimated optimal treatment strategy. An alternative to (P2) is to try to control P(|V (β̂n) − V (β*)| > ε), however, there is no guarantee that a confidence interval for V (β*) actually contains V (β̂n). Because V (β̂n) is a data-dependent parameter, constructing a confidence interval for this quantity is not well-studied (see [33, 34]).
Our sample size procedure works by constructing an estimator of the diameter of a confidence region for V (β*) as a function of the sample size and then solving for a sample size sufficiently large to ensure (P1) and (P2) hold with high probability. Constructing a sample size based on confidence region diameter is sometimes categorized as precision-based or accuracy-based sample size calculation [37, 38]. Suppose that based on n observations we can construct a (1 − ρ) × 100% confidence interval for V (β*), say ζn,1−ρ. Let d(n) denote the diameter of ζn,1−ρ. If d(n*) ≤ δV0, then V (β*) and V0 cannot simultaneously belong to ζn*,1−ρ provided V (β*) ≥ (1 + δ)V0. Because P {V (β*) ∈ ζn*,1− ρ} ≥ 1 − ρ, we can distinguish V (β*) from V0 with probability 1 − ρ when V (β*) ≥ (1 + δ)V0 by rejecting the hypothesis V (β) = V0 when V0 falls below the lower endpoint of ζn*,1−ρ. Thus, (P1) is satisfied. Furthermore, if d(n*) ≤ ε and V̂𝓓(n*)(β̂n*) ∈ ζn*,1−ρ, then P{|V̂𝓓(n*)(β̂n*) − V (β*)| ≤ ε} ≥ 1 − ρ, and (P2) is satisfied. Figure (1) shows a schematic illustrating the foregoing relationships between a confidence interval for V (β*) and the criteria (P1) and (P2). To estimate n* we first construct a valid (asymptotic) confidence interval for V (β*) and an estimator d̂𝓓p (n) of the diameter function d(n). Our estimator of n* is given by n̂* ≜ inf{n : d̂𝓓p(n) ≤ min(δV0, ε)}.
Figure 1.
Schematic showing how a confidence interval for V (β*) can be used to ensure (P1) and (P2) hold. If the diameter of the confidence interval is less than (1 + δ)V0, then the interval cannot simultaneously contain V0 and V (β*). Furthermore, if the diameter of the confidence interval is less than ε and V̂𝓓(n*)(β̂n*) belongs to the interval, then when the interval covers V (β*) it follows that |V̂𝓓(n*)(β̂n*) − V (β*)| ≤ ε.
To construct a confidence interval for V (β*) we treat β* as a nuisance parameter and construct a projection interval [9, 10]. For any fixed γ ∈ ℝdim(x2) define
and subsequently define the plug-in estimator
For any fixed γ ∈ ℝdim(x2), under standard regularity conditions, it follows from the central limit theorem that converges to a normal random variable with mean zero and variance σ2(γ). Therefore, an asymptotic (1 − μ) × 100% confidence interval for V (γ) based on pilot data 𝓓p is
| (2) |
where z1−μ/2 is the (1 − μ/2) × 100 percentile of a standard normal random variable. Let , define Ω = (𝔼BB⊤)−1𝔼BB⊤ {Y − Q(X, A)}2 (𝔼BB⊤)−1 to be the asymptotic covariance of , and define Σ to be the submatrix of Ω corresponding to the asymptotic covariance of . Let Σ̂𝓓p denote the plug-in estimator of Σ based on the pilot data 𝓓p. Define , where is the (1 − ξ) × 100 percentile of a χ2-distribution with dim(x2) degrees of freedom. Thus, 𝕋(𝓓p, np) is an asymptotic (1 − ξ) × 100% confidence interval for β*. An asymptotic (1 − ξ − μ) × 100% projection confidence interval for V (β*) is therefore given by ⋃γ∈ 𝕋(𝓓p,np) 𝕀γ (𝓓p, np). This interval has diameter d(np) = supγ∈ 𝕋(𝓓p,np) 𝕀γ (𝓓p, np) − infγ∈𝕋(𝓓p,np) 𝕀γ (𝓓p, np). For an arbitrary n ≥ np an estimator of d(n) is given by d̂𝓓p (n) ≜ supγ ∈ 𝕋(𝓓p,n) 𝕀γ (𝓓p, n) − infγ∈𝕋(𝓓p,n) 𝕀γ (𝓓p, n), and subsequently n̂* = inf{n : d̂𝓓p(n) ≤ min(δV0, ε)}.
Remark 1
Our sample size procedure is based on choosing a sufficient number of samples to ensure that the width of a confidence interval is sufficiently small. It may be helpful to note that the proposed procedure is analogous to the following familiar example. Suppose we desire a confidence interval for the mean of a random variable Z of width no more than ω. Let Z̄n denote the sample mean of random sample Z1,…, Zn. The natural (1 − ρ) × 100% confidence interval is which has diameter . Using pilot data 𝓓p we construct estimator σ̂𝓓p of σ and subsequently the estimator of d(n); solving for n̂* = inf{n : d̂𝓓p (n) ≤ ω} gives the familiar expression .
Remark 2
If the preliminary data are observational then the proposed sample size estimator can be applied with slight modification. To distinguish between the generative model for the pilot data and the data to be collected in the follow-up trial being sized, write . We assume that for any y, x, a P (Yp ≤ y|Xp = x, Ap = a) = P (Y ≤ y|X = x, A = a), and P (Xp ≤ x) = P (X ≤ x). Furthermore, we require that the pilot study satisfy the consistency, positivity, and no unmeasured confounders assumptions stated in Section 2. Under these assumptions, the only difference between the population in the pilot study and the population in the randomized trial is the treatment assignment mechanism; note, weaker assumptions are possible. Let p(a|x) denote the treatment assignment mechanism P (Ap = a|Xp = x). Then,
and Ω = {𝔼p(Ap|Xp)BpBp⊤}−1 𝔼p(Ap|Xp)BpBp⊤{Yp − Q(Xp, Ap)}2 {𝔼p(Ap|Xp)BpBp⊤}−1, and (α*⊤, β*⊤)⊤= {𝔼p(Ap|Xp)BpBp⊤}−1 𝔼p(Ap|Xp)BpYp. Thus, with knowledge or an estimator of p(a|x) it is possible to construct plugin estimators of the quantities needed in the sample size formula with observational data.
Remark 3
The proposed sample size calculation does not require that the posited Q-function is correctly specified. However, if one is willing to assume that the Q-function is correctly specified then: (i) V (β*) is still non-regular; and (ii) the proposed sample size procedure can be applied with a slight modification. Assume that so that the posited linear model is correctly specified. We assume X1 has been centered so that 𝔼X1 = 0. Then, and . Define then is asymptotically normal with mean zero and variance . Define and
Letting 𝕋(𝓓p, np) be defined as above, then ⋃γ∈𝕋(𝓓p, np) 𝕀̃γ (𝓓p, np) is a valid (1 − ξ − μ) × 100% valid (asymptotic) confidence interval for V (β*). This interval can be inverted to construct a sample size. A potential drawback of this approach is that it depends heavily on the correctness of the postulated analysis model.
4. Simulation experiments
In this section we examine the finite sample performance of the proposed sample size procedure using simulated experiments. We use data from the following generative model:
where Pβ* denotes the projection matrix onto the linear span of β*. Thus, under this class of generative models ν × 100% of the subjects will have the same expected outcome regardless of which treatment they receive, i.e., the treatment effect for these patients is zero. We use a working model of the form Q(x, a, θ) = x⊤α + ax⊤β so that the working model is only correctly specified when ν = 0. In our experiments we fix p = 5 and α* ≡ 1. We include intercept terms in our working models so the dimension of θ is 2p + 2 = 12. Furthermore, for any fixed value of ν ∈ [0, 1] we choose β* ≡ c where c is chosen to be the smallest positive constant so that V (β*) equals a pre-specified value. In our experiments we consider ν = 0, 0.05, 0.10, 0.25, 0.50, 0.75 and V (β*) = 2, 2.25. We fix δ = ε = V0 = 1. We set ρ = 0.80 and, following recommended best practice in [9], obtain an 80% confidence interval for V (β*) by setting ξ = 0.01 thereby constructing a 99% confidence interval for β, and setting μ = 0.19 thereby constructing an 81% confidence interval for V (γ) for each fixed γ in the confidence set for β*. For illustration, we also implement the sample-size formula which is a plugin estimator derived from the asymptotic normality of discussed in Section 3. However, as discussed previously, normal approximations are not valid in this context without strong (unverifiable) assumptions (e.g., [35, 8]) and even when these conditions hold the small sample performance of these approximations may be poor [36, 7, 39, 33, 34]. A confidence interval using the normal approximation with data 𝓓 is .
We repeat the following procedure 1000 times. First, we generate pilot data 𝓓p of size np = 20. Second, we use the sample size procedure described in the preceding section to obtain n̂*(𝓓p); because np is small, least squares estimation of all 2p + 2 parameters in our working model is unstable, hence in these experiments we use ridge-regression to estimate θ = (α⊤, β⊤)⊤ with the amount of regularization chosen by BIC. The confidence intervals 𝕋(𝓓p, np) are adjusted accordingly (details given in the appendix). Finally, we generate a follow-up study, say 𝓓* = 𝓓 {n̂*(𝓓p)} and subsequently the projection interval ⋃γ∈𝕋{𝓓*, n̂*(𝓓p)} 𝕀γ {𝓓*, n̂*(𝓓p)}for V (β*). To approximate the union over all γ ∈ 𝕋{𝓓*, n̂*(𝓓p)} we randomly sample 100,000 points from 𝕋{𝓓*, n̂*(𝓓p)} using rejection sampling with a proposal distribution. Table 1 shows the estimated coverage, mean, and standard deviation of the proposed sample size procedure, as well as that of the normal-approximation-based estimator. The estimated coverage of V (β*) is not significantly different at the 0.01 level from the target 0.80 in all but one case for the proposed estimator. Thus, empirically objective (P2) was attained. The resultant interval covered V0 significantly less than the 1 − 0.80 = 0.20 target and thus the power to detect a difference between the adaptive strategy and standard care was significantly higher than the 0.80 target. Thus, empirically objective (P1) was attained. As expected, generally, both the mean and standard deviation of n̂*(𝓓p) increase as the proportion of individuals with zero treatment effect (ν) increases. Not surprisingly, the normal based confidence interval generally predicted smaller sample sizes than the proposed projection based procedure. However, the normal approximation sample size covered V (β*) significantly less than the 0.80 target in 7 of the 12 generative models; furthermore, the normal based interval also included V0 more frequently indicating less power.
Table 1.
Estimated coverage, mean, and standard deviation for the proposed sample size procedure based on 1000 Monte Carlo replications. Target coverage of V (β*) is 0.80; bolded numbers indicate observed coverage is not significantly different from the target coverage at the 0.05 level.
| Projection sample size | Normal sample size | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| V* | V0 | δ | ν | Cov V (β*) | Cov V0 | 𝔼n̂*(𝓓p) | SD (n̂*(𝓓p)) | Cov V (β*) | Cov V0 | 𝔼n̂*(𝓓p) | SD (n̂*(𝓓p)) |
| 2 | 1 | 1 | 0 | 0.77 | 0.08 | 156 | 151 | 0.77 | 0.14 | 111 | 62 |
| 2 | 1 | 1 | 0.05 | 0.77 | 0.06 | 144 | 132 | 0.77 | 0.13 | 107 | 62 |
| 2 | 1 | 1 | 0.1 | 0.80 | 0.07 | 144 | 127 | 0.76 | 0.14 | 100 | 60 |
| 2 | 1 | 1 | 0.25 | 0.80 | 0.08 | 149 | 123 | 0.76 | 0.16 | 98 | 65 |
| 2 | 1 | 1 | 0.5 | 0.79 | 0.09 | 191 | 204 | 0.72 | 0.19 | 85 | 65 |
| 2 | 1 | 1 | 0.75 | 0.79 | 0.09 | 191 | 204 | 0.70 | 0.23 | 101 | 104 |
| 2.25 | 1 | 1 | 0 | 0.74 | 0.02 | 145 | 138 | 0.78 | 0.073 | 120 | 68 |
| 2.25 | 1 | 1 | 0.05 | 0.77 | 0.02 | 147 | 143 | 0.78 | 0.07 | 112 | 65 |
| 2.25 | 1 | 1 | 0.1 | 0.77 | 0.03 | 142 | 118 | 0.77 | 0.067 | 114 | 71 |
| 2.25 | 1 | 1 | 0.25 | 0.78 | 0.04 | 166 | 157 | 0.75 | 0.09 | 112 | 75 |
| 2.25 | 1 | 1 | 0.5 | 0.79 | 0.03 | 228 | 216 | 0.74 | 0.11 | 105 | 84 |
| 2.25 | 1 | 1 | 0.75 | 0.77 | 0.06 | 347 | 390 | 0.69 | 0.17 | 129 | 135 |
4.1. Midstream sample size adjustment
A common strategy to increase precision when sample size estimates are based on estimated nuisance parameters is to use an interim sample size adjustment based on updated estimates of the nuisance parameters [40]. We illustrate a simple interim update for our sample size procedure. First, we generate pilot data 𝓓p of size np. Second, we use our sample size procedure to estimate n̂*(𝓓p). Third, we generate data 𝓓1 in a follow-up study with subjects. Fourth, treating 𝓓1 as a new pilot study, we collect data, say 𝓓2, on an additional subjects. Let 𝓓* ≜ 𝓓1 ∪ 𝓓2. Finally, we construct a confidence interval for V (β*) based on the subjects in 𝓓*. Because our approach uses a projection interval, uncertainty about parameter values translates into larger diameters and hence larger estimated sample sizes. Consequently, the midstream update typically results in smaller sample sizes.
Table 2 shows the estimated coverage, mean, and standard deviation of the proposed sample size procedure based with a midstream update; to make the results comparable the table is constructed from the same 1000 Monte Carlo pilot data sets of size np = 20 used to construct Table 1. The mean and standard deviation of the total samples size n̂*(𝓓p, 𝓓1) are both generally smaller when using the midstream update yet the coverages are comparable. This suggests that midstream update may be a viable strategy for reducing the number of subjects required to estimate an optimal personalized treatment strategy.
Table 2.
Estimated coverage, mean, and standard deviation for the proposed sample size procedure with a midstream adjustment based on 1000 Monte Carlo replications. Target coverage of V (β*) is 0.80; bolded numbers indicate observed coverage is not significantly different from the target coverage at the 0.05 level.
| V* | V0 | δ | ν | Coverage V (β*) | Coverage V0 |
|
𝔼n̂*(𝓓p, 𝓓1) | SD (n̂*(𝓓p, 𝓓1)) | |
|---|---|---|---|---|---|---|---|---|---|
| 2 | 1 | 1 | 0 | 0.80 | 0.06 | 76 | 109 | 72 | |
| 2 | 1 | 1 | 0.05 | 0.76 | 0.06 | 72 | 110 | 78 | |
| 2 | 1 | 1 | 0.1 | 0.79 | 0.07 | 72 | 112 | 92 | |
| 2 | 1 | 1 | 0.25 | 0.80 | 0.06 | 75 | 115 | 83 | |
| 2 | 1 | 1 | 0.5 | 0.79 | 0.05 | 91 | 155 | 116 | |
| 2 | 1 | 1 | 0.75 | 0.76 | 0.08 | 91 | 235 | 201 | |
| 2.25 | 1 | 1 | 0 | 0.77 | 0.02 | 73 | 108 | 101 | |
| 2.25 | 1 | 1 | 0.05 | 0.74 | 0.03 | 74 | 109 | 79 | |
| 2.25 | 1 | 1 | 0.1 | 0.77 | 0.02 | 71 | 108 | 95 | |
| 2.25 | 1 | 1 | 0.25 | 0.78 | 0.02 | 83 | 131 | 136 | |
| 2.25 | 1 | 1 | 0.5 | 0.78 | 0.02 | 114 | 169 | 138 | |
| 2.25 | 1 | 1 | 0.75 | 0.73 | 0.03 | 174 | 246 | 197 |
5. Time to pregnancy in normal fertility
A motivation for this work is our involvement with a randomized controlled trial of the effect of fertility awareness on time to pregnancy in normal fertility introduced in Section 1. We use data from this trial as pilot data to size a follow-up study. In this trial subjects were randomized to receive either standard care or follow the Creighton Model Fertility Care™(CrM) System. The CrM involves patient education and daily patient self-monitoring to predict periods of fertility. Thus, there is an interest in identifying subjects for whom the burden of CrM is expected to be worthwhile. A personalized treatment strategy in this setting is map from baseline patient characteristics to a recommendation of CrM or standard care.
For simplicity, in our analysis we consider patients with complete information (n = 109 of an enrolled 123 subjects). Eligibility criteria required participants to be of proven fertility (full term pregnancy in past 8 years), and planning to conceive in the near future. For additional study details, see [41]. The outcome of interest is the minimum of eight cycles and the number of cycles before conception occurred. To be consistent with our development of finding strategies which maximize an expected outcome, we use eight minus the outcome of interest as our response. The two possible treatments are standard care (coded −1) and CrM (coded 1). Treatment was randomly assigned using a block-randomized design with blocking by age (blocks 18–29 and 30–35); thus, the no unmeasured confounders assumption (iii) of Section 2 is automatically satisfied. Variables thought to be predictive of benefit from CrM include number of previous successful pregnancies (Prg), subject age (years, Age), length of sexual relationship (years, SexLen), and education level (indicator of post-secondary, Edu). Let X denote these predictors, A the binary treatment, and Y the outcome. We use a linear Q-function of the form Q(x, a) = x⊤α + ax⊤β. Residual diagnostics for this model suggested that SexLen and Age should be transformed to the log-scale and that Prg should transformed to 1/Prg. Standard regression diagnostics are provided in the appendix. Table 3 provides the estimated coefficients for the Q-function. There is a significant interaction between CrM and the reciprocal of Prg suggesting that the benefit of CrM is greatest for subjects with a single prior successful pregnancy. The estimated optimal decision rule is π̂n(x) = sign(x⊤β̂n). This rule has an estimated value of 5.82 (which corresponds to an average of 2.18 months to conception); in contrast the estimated value of assigning standard care to all patients is 5.52 (2.48 months to conception) whereas the estimated value of assigning CrM to all patients is 4.92 (3.08 months to conception). Thus, while there is some indication that a personalized treatment strategy might be effective, the evidence is not significant. For example, an 80% confidence interval for the value of the optimal decision rule, V (β*), is [5.50, 6.13], which has a diameter of 0.63. It is of interest to know how this diameter would change in a larger, follow-up study. Figure 2 shows n̂* as a function of the target diameter for an 80% interval. To see how this figure might be used, suppose that, based on the pilot study, one postulated that V (β*) = 5.90, and V0 = 5.52, the mean outcome treating all subjects. Then δ = V (β*)/V0 − 1 = 0.0688 so that a diameter of δV0 = 0.38 is required for 80% power to detect superiority of V (β*) to V0. According to Figure 2, the estimated required size for a follow-up study is n̂* = 280. If instead it was postulated that V (β*) = 6.12, the required diameter is 0.5 and the required size for a follow-up study is n̂* = 165. In addition to informing the size of a follow-up study, the slope of the line in Figure 2 can be viewed as a rough estimate of the incremental gain in the precision of V̂𝓓(n)(β̂n) per subject.
Table 3.
Estimated regression coefficients for linear Q-function. The outcome is coded so that higher values correspond to shorter time to conception. The is a significant interaction between CrM and 1/Prg suggesting that the benefit of CrM is greatest for patients with only one successful prior pregnancy (which was the minimum required for study eligibility). Indeed, all subjects assigned CrM by the estimated optimal treatment strategy had one successful prior pregnancy.
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| Intercept | 22.10 | 6.11 | 3.61 | 0.00 |
| CrM | −1.80 | 6.11 | −0.294 | 0.770 |
| log Age | −5.72 | 1.99 | −2.88 | 0.005 |
| 1/Prg | −0.586 | 0.642 | −0.906 | 0.367 |
| Edu | 0.316 | 0.194 | 1.63 | 0.106 |
| log SexLen | 0.844 | 0.550 | 1.535 | 0.128 |
| CrM:log Age | −0.0840 | 1.99 | −0.042 | 0.966 |
| CrM:1/Prg | 1.47 | 0.647 | 2.26 | 0.025 |
| CrM:Edu | −0.0920 | 0.194 | −0.474 | 0.636 |
| CrM:log SexLen | 0.699 | 0.550 | 1.27 | 0.206 |
Figure 2.
Estimated sample size as a function of 80% confidence interval diameter. Reducing the interval width by a factor of two would require nearly a four-fold increase in sample size over the initial pilot study.
6. Discussion
We proposed a procedure for estimating the required sample size for a two-arm randomized clinical trial when the primary aim is the estimation of a personalized treatment strategy. The proposed method is based on inverting a projection confidence interval for the expected outcome under the optimal treatment strategy. Empirically the proposed method appeared to make efficient use of limited pilot data and enjoyed increased precision when assisted by a midstream interim update. Our procedure relied on standard assumptions for causal inference for treatment strategies, i.e., consistency, positivity, and no unmeasured confounders. The question of what happens when there are unmeasured confounders in the pilot is an interesting question but one that is difficult to answer in generality. The reason is that even in simpler settings, e.g., estimating an optimal regime from observational data, the impact of unmeasured confounding depends on the nature and severity of the confounding. Thus, characterizing the impact of unmeasured confounding on the sample size calculation would require first cataloging the effects of unmeasured confounding on estimation of the nuisance parameters across the many possible ways in which confounding can occur. It would be interesting future work to investigate the effects of confounding on the sample size procedure.
We focused on Q-learning as our method of estimation, however, there is growing interest in direct value maximization methods [42, 23, 24, 25, 26, 27, 43]. Distribution theory for parameters estimated via direct value maximization methods is lacking at present; when this theory is developed the proposed sizing procedure should be directly applicable. Furthermore, direct value maximization procedures often rely on a lower dimensional parameter than Q-learning, thus it is possible that the proposed sample size procedure will be less conservative when used with such methods.
Another important extension of this work is to sequences of treatment rules, also known as dynamic treatment regimes [28, 10]. Sample size procedures for sequential multiple assignment randomized trials [4] with the primary aim of constructing an optimal sequence of treatment rules are of increasing interest. The extension to multiple time points is complicated by the fact that in this setting the parameters indexing the Q-functions are nonregular [7, 32]. We are currently pursuing this extension.
7. Appendix
7.1. Wald type confidence region for ridge regression
For fixed nonnegative ridge parameter λ, let . Then, . Recall θ = (a⊤, β⊤)⊤, let U denote the constant matrix so that Uθ = β, and define , so that . It can be seen that converges in distribution to a normal random variable with mean zero and variance-covariance matrix
which has plug-in estimator
A Wald type confidence region for β* based on is given by
where the u− denotes the generalized inverse of u.
7.2. Regression diagnostics for Q-function in fertility study
Figure (3) displays standard regression diagnostic plots using data from the fertility study.
Figure 3.
Regression diagnostics for fitted Q-function used in fertility study. The discreteness of the response manifests itself as parallel lines in the plot residuals vs. fitted values. Otherwise, there do not appear to be major departures from standard linear modeling assumptions.
References
- 1.Coalition PM. The case for personalized medicine. Technical Report, Coalition for personalized medicine. 2011 [Google Scholar]
- 2.Hamburg MA, Collins FS. The path to personalized medicine. New England Journal of Medicine. 2010;363(4):301–304. doi: 10.1056/NEJMp1006304. [DOI] [PubMed] [Google Scholar]
- 3.Oetting A, Levy J, Weiss R, Murphy S. Statistical methodology for a smart design in the development of adaptive treatment strategies. In: Shrout P, Keys K, Ornstein K, editors. Causality and Psychopathology. Oxford University Press; New York: 2010. pp. 179–205. [Google Scholar]
- 4.Murphy SA. An experimental design for the development of adaptive treatment strategies. Statistics in medicine. 2005;24(10):1455–1481. doi: 10.1002/sim.2022. [DOI] [PubMed] [Google Scholar]
- 5.Li Z, Murphy SA. Sample size formulae for two-stage randomized trials with survival outcomes. Biometrika. 2011;98(3):503–518. doi: 10.1093/biomet/asr019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Tsiatis A. Semiparametric theory and missing data. Springer Verlag; 2006. [Google Scholar]
- 7.Laber EB, Lizotte DJ, Qian M, Pelham WE, Murphy SA, et al. Dynamic treatment regimes: Technical challenges and applications. Electronic Journal of Statistics. 2014;8:1225–1272. doi: 10.1214/14-ejs920. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Hirano K, Porter JR. Impossibility results for nondifferentiable functionals. Econometrica. 2012;80(4):1769–1790. [Google Scholar]
- 9.Berger RL, Boos DD. P values maximized over a confidence set for the nuisance parameter. Journal of the American Statistical Association. 1994;89(427):1012–1016. [Google Scholar]
- 10.Robins JM. Proceedings of the Second Seattle Symposium in Biostatistics. Springer; 2004. Optimal structural nested models for optimal sequential decisions; pp. 189–326. [Google Scholar]
- 11.Chakraborty B, Moodie E. Statistical Methods for Dynamic Treatment Regimes. Springer; 2013. [Google Scholar]
- 12.Watkins C, Dayan P. Q-learning. Machine learning. 1992;8(3):279–292. [Google Scholar]
- 13.Murphy SA. A generalization error for q-learning. Journal of Machine Learning Research. 2005 Jul;6:1073–1097. [PMC free article] [PubMed] [Google Scholar]
- 14.Qian M, Murphy SA. Performance guarantees for individualized treatment rules. Annals of Statistics. 2011;39(2):1180. doi: 10.1214/10-AOS864. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Nahum-Shani I, Qian M, Almirall D, Pelham WE, Gnagy B, Fabiano GA, Waxmonsky JG, Yu J, Murphy SA. Q-learning: A data analysis method for constructing adaptive interventions. Psychological methods. 2012;17(4):478. doi: 10.1037/a0029373. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Schulte P, Tsiatis A, Laber E, Davidian M. q- and a-learning methods for estimating optimal dynamic treatment regimes. Statistical Science. 2014;29(4):640–661. doi: 10.1214/13-STS450. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Shortreed SM, Laber E, Scott Stroup T, Pineau J, Murphy SA. A multiple imputation strategy for sequential multiple assignment randomized trials. Statistics in Medicine. 2014;33(24):4202–4214. doi: 10.1002/sim.6223. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Goldberg Y, Kosorok MR. Q-learning with censored data. Annals of Statistics. 2012;40(1):529. doi: 10.1214/12-AOS968. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Laber EB, Lizotte DJ, Ferguson B. Set-valued dynamic treatment regimes for competing outcomes. Biometrics. 2014;70(1):53–61. doi: 10.1111/biom.12132. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Linn K, Laber E, Stefanski L. Estimation of dynamic treatment regimes for complex outcomes: Balancing benefits and risks. In: Kosorok M, Moodie E, editors. Adaptive Treatment Strategies in Practice: Planning Trials and Analyzing Data for Personalized Medicine. CRC Press; New York: 2015. [Google Scholar]
- 21.Lu W, Zhang HH, Zeng D. Variable selection for optimal treatment decision. Statistical methods in medical research. 2011 doi: 10.1177/0962280211428383. 0962280211428 383. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Geng Y, Zhang HH, Lu W. On optimal treatment regimes selection for mean survival time. Statistics in Medicine. 2014 doi: 10.1002/sim.6397. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Orellana L, Rotnitzky A, Robins J. Dynamic regime marginal structural mean models for estimation of optimal dynamic treatment regimes, part i:Main content. Int Jrn of Biostatistics. 2010;6(2) [PubMed] [Google Scholar]
- 24.Zhang B, Tsiatis AA, Laber EB, Davidian M. A robust method for estimating optimal treatment regimes. Biometrics. 2012;68(4):1010–1018. doi: 10.1111/j.1541-0420.2012.01763.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Zhang B, Tsiatis AA, Davidian M, Zhang M, Laber E. Estimating optimal treatment regimes from a classification perspective. Stat. 2012;1(1):103–114. doi: 10.1002/sta.411. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Zhao Y, Zeng D, Rush AJ, Kosorok MR. Estimating individualized treatment rules using outcome weighted learning. Journal of the American Statistical Association. 2012;107(499):1106–1118. doi: 10.1080/01621459.2012.695674. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Zhao Y, Zeng D, Laber E, Kosorok MR. New statistical learning methods for estimating optimal dynamic treatment regimes. Journal of the American Statistical Association. 2015;110(510):583–598. doi: 10.1080/01621459.2014.937488. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Murphy SA. Optimal dynamic treatment regimes (with discussion) Journal of the Royal Statistical Society: Series B. 2003;58:331–366. [Google Scholar]
- 29.Blatt D, Murphy SA, Zhu J. Technical Report 04-63. The Methodology Center, Penn. State University; 2004. A-learning for approximate planning. [Google Scholar]
- 30.Rubin D. Bayesian inference for causal effects: The role of randomization. The Annals of Statistics. 1978:34–58. [Google Scholar]
- 31.Splawa-Neyman J, Dabrowska D, Speed T, et al. On the application of probability theory to agricultural experiments. essay on principles section 9. Statistical Science. 1990;5(4):465–472. [Google Scholar]
- 32.Laber E, Linn L, Stefanski L. Interactive model-building for q-learning. Biometrika. 2014;101(4):831–847. doi: 10.1093/biomet/asu043. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Laber EB, Murphy SA. Adaptive confidence intervals for the test error in classification. Journal of the American Statistical Association. 2011;106(495):904–913. doi: 10.1198/jasa.2010.tm10053. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Chakraborty B, Laber E, Zhao Y. Inference about the expected performance of a data-driven treatment regime. Clinical Trials. 2014;0:1–10. doi: 10.1177/1740774514537727. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.van der Vaart A. On differentiable functionals. Annals of Statistics. 1991;19(1):178–204. [Google Scholar]
- 36.Leeb H, Pötscher BM. The finite-sample distribution of post-model-selection estimators and uniform versus nonuniform approximations. Econometric Theory. 2003;19(1):100–142. [Google Scholar]
- 37.Kelley K, Maxwell SE, Rausch JR. Obtaining power or obtaining precision delineating methods of sample-size planning. Evaluation & the Health Professions. 2003;26(3):258–287. doi: 10.1177/0163278703255242. [DOI] [PubMed] [Google Scholar]
- 38.Maxwell SE, Kelley K, Rausch JR. Sample size planning for statistical power and accuracy in parameter estimation. Annu Rev Psychol. 2008;59:537–563. doi: 10.1146/annurev.psych.59.103006.093735. [DOI] [PubMed] [Google Scholar]
- 39.Chakraborty B, Murphy S, Strecher V. Inference for non-regular parameters in optimal dynamic treatment regimes. Statistical Methods in Medical Research. 2010;19(3):317–343. doi: 10.1177/0962280209105013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Wittes J, Brittain E. The role of internal pilot studies in increasing the efficiency of clinical trials. Statistics in Medicine. 1990;9(1–2):65–72. doi: 10.1002/sim.4780090113. [DOI] [PubMed] [Google Scholar]
- 41.Stanford J, Smith KMWV. Impact of instruction in the creighton model fertilitycare system on time to pregnancy in couples of proven fecundity: Results of a randomised trial. Paediatric and Perinatal Epidemiology. 2014;29(5) doi: 10.1111/ppe.12141. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Robins J, Orellana L, Rotnitzky A. Estimation and extrapolation of optimal treatment and testing strategies. Statistics in Medicine. 2008:4678–4721. doi: 10.1002/sim.3301. [DOI] [PubMed] [Google Scholar]
- 43.Zhao Y, Zeng D, Laber EB, Song R, Ming Y, Kosorok MR. Doubly robust learning for estimating individualized treatment with censored data. Biometrika. 2014;102(1):1–16. doi: 10.1093/biomet/asu050. [DOI] [PMC free article] [PubMed] [Google Scholar]



