Abstract
Randomized controlled trials (RCTs) emphasize the average or overall effect of a treatment (ATE) on the primary endpoint. Even though the ATE provides the best summary of treatment efficacy, it is of critical importance to know whether the treatment is similarly efficacious in important, predefined subgroups. This is why the RCTs, in addition to the ATE, also present the results of subgroup analysis for preestablished subgroups. Typically, these are marginal subgroup analysis in the sense that treatment effects are estimated in mutually exclusive subgroups defined by only one baseline characteristic at a time (e.g., men versus women, young versus old). Forest plot is a popular graphical approach for displaying the results of subgroup analysis. These plots were originally used in meta-analysis for displaying the treatment effects from independent studies. Treatment effect estimates of different marginal subgroups are, however, not independent. Correlation between the subgrouping variables should be addressed for proper interpretation of forest plots, especially in large effectiveness trials where one of the goals is to address concerns about the generalizability of findings to various populations. Failure to account for the correlation between the subgrouping variables can result in misleading (confounded) interpretations of subgroup effects. Here we present an approach called standardization, a commonly used technique in epidemiology, that allows for valid comparison of subgroup effects depicted in a forest plot. We present simulations results and a subgroup analysis from parallel-group, placebo-controlled randomized trials of antibiotics for acute otitis media.
Keywords: Confounding, Consistency of treatment effect, Forest plot, Heterogeneity of treatment effects, Interaction, Marginal structural model
1. INTRODUCTION
Randomized controlled trials (RCTs) emphasize the overall or average effect of a treatment (ATE) on the primary endpoint. Even though the ATE provides the best summary of treatment efficacy, it is of critical importance to know whether the treatment is similarly efficacious in important, predefined subgroups. For example, if a treatment is efficacious on average, we may want to know whether it has a similar degree of efficacy among men and women, or young and old. The Food and Drug Administration (FDA) Guidance for Industry (FDA, 1998) states:
Frequently, large trials have relatively broad entry criteria and the study populations may be diverse with regard to important covariates such as concomitant or prior therapy, disease stage, age, gender or race. Analysis of the results of such trials for consistency across key patient subsets addresses concerns about generalizability of findings to various populations in a manner that may not be possible with smaller trials or trials with more narrow entry criteria.
The FDA guidance does not define what is meant by consistency of findings in key patient subsets. We provide a working definition of consistency of treatment effect as the absence of any statistically significant interaction between the treatment and a subgrouping variable using an appropriate model for interaction (at a significance level α = .05/K, where K is the number of preestablished subgrouping variables considered in the analysis). In order to examine consistency of treatment effects across key patient subgroups, the RCTs, in addition to the ATE, also present the results of subgroup analysis for pre-established, baseline subgrouping factors. Typically, these are marginal subgroup analysis in the sense that treatment effects are estimated in subgroups defined by only one baseline characteristic at a time (e.g., men versus women, young versus old). This should be contrasted with a joint subgroup analysis where treatment effects would be estimated in mutually exclusive subgroups defined by multiple baseline characteristics (e.g., young–men, young–women, old–men, old–women). Forest plot is a popular graphical approach for displaying the results of marginal subgroup analysis in clinical trials. These plots were originally used in meta-analysis for displaying the treatment effects from independent studies. Treatment effect estimates of different subgroups from marginal analysis are, however, not independent. Correlation between the subgrouping variables should be addressed for proper interpretation of forest plots. This is especially important in large effectiveness trials where one of the goals of subgroup analyses is to address concerns about the generalizability of findings to various populations. Failure to account for the correlation between the subgrouping variables can result in misleading interpretations of subgroup-specific treatment effects, and consequently treatment may be targeted to the wrong subgroups. For example, suppose that gender and age are correlated such that men are likely to be younger than women in the trial. Suppose the truth is that the effectiveness of the treatment declines with age independently of gender. A naive comparison of treatment effect in men and women would reveal that the treatment is only effective in men, which could result in wrongfully withholding the treatment from women. We should compare the effects of the treatment in men and in women in such a manner that the distribution of age is the same in men and women. This naturally leads to a consideration of the technique of standardization routinely employed by epidemiologists to control for confounding (Sato and Matsuyama, 2003). Epidemiologists calculate, for example, quantities such as standardized mortality/morbidity ratio by stratifying on a set of confounding factors. This technique is also known as inverse probability weighting (Robins et al., 2000). While the concept of standardization (or inverse probability weighting) is well known in epidemiology, our proposal for using it in the context of subgroup analysis in randomized controlled trials is novel.
2. MARGINAL SUBGROUP EFFECTS
Consider a randomized experiment where a binary treatment T is randomly assigned to subjects. Let Y be the outcome and let X = {X1, …, XK} be a vector of categorical baseline covariates. Our interest lies in understanding how the effect of T on Y varies according to baseline characteristics. We can define the treatment effect within any given stratum x = {x1, …, xK} as follows:
| (1) |
where g(·) is a function denoting the scale in which the treatment effect is quantified. Commonly used treatment effect scales are identity (g(x) = x), log (g(x) = log(x)), and logit (g(x) = log(x) − log(1 − x)). The identity scale is used when the outcome is continuous. For binary outcomes, all three scales may be used with the identity scale denoting risk difference, log scale denoting the risk ratio, and the logit scale denoting the odds ratio. The log scale is also used when the outcome is the number of occurrences (counts) or the time duration for an event to occur.
Suppose, for example, that we have two binary baseline covariates: sex (men/women) and age (young/old) defined with men (X1 = 1), women (X1 = 2), young (X2 = 1), and old (X2 = 2). There will be four treatment effects, possibly different, defined in the four strata: young men, young women, old men, and old women. The number of strata can become large when there are several baseline covariates. Treatment effect estimates in such high-dimensional problems will tend to be highly variable due to small sample size. Therefore, it is more common to focus on treatment effects in low-dimensional subgroups defined by fewer baseline covariates, the simplest of which is subgroups defined by a single baseline characteristic, for example, men (X1 = 1) and women (X1 = 2), or young (X2 = 1) and old (X2 = 2). We call these “marginal” subgroup-specific treatment effects, which are defined for the subgroups defined by the kth baseline variable:
| (2) |
Using the fact that
and that T is independent of X, we obtain:
| (3) |
Therefore,
| (4) |
Thus, for linear models we can obtain the marginal effects for the k-th subgrouping variable Xk, equation (2), by marginalizing the fully stratified treatment effect θ(x) over the conditional distribution of the other baseline covariates, f(X−k | Xk = xk):
| (5) |
Equation (4) shows that when g(x) = x in equation (2), the treatment effect is collapsible in the sense that the marginal effects can be obtained by integrating the fully stratified effects. Such collapsibility is not generally present for other, nonlinear link functions:
| (6) |
Therefore, when collapsibility is not present, we cannot exactly represent the marginal subgroup effects by integrating the fully stratified subgroup effects over appropriate conditional distributions. Noncollapsibility notwithstanding, such a representation is reasonably accurate even for subgroups of moderately large size.
The theoretical development presented is applicable to both continuous and categorical covariates X. For categorical variables, all the integrals in the preceding equations reduce to weighted summations across strata given by {X−k, Xk = xk}, where the weights are the conditional probabilities P(X−k = x−k | Xk = xk).
3. STANDARDIZED MARGINAL INTERACTION MODELS
The marginal effect in equation (2) can be estimated nonparametrically by replacing the expectations with sample means of response Y in the appropriate subgroups. For binary (or categorical) subgrouping variables, the estimand in equation (2) can, equivalently, be estimated from the following saturated regression model:
| (7) |
We call this the marginal interaction model (MIM), since it only pertains to a single covariate Xk. When Xk is binary, this model is equivalent to a stratified subgroup analysis for Xk = 0 and Xk = 1, since it is a saturated model in T and Xk. The treatment effects in the subgroups Xk = 0 and Xk = 1 denoted by βt,k and βt,k + γk, respectively, are the same as η(Xk = 0) and η(Xk = 1), respectively, from equation (2). An advantage of the marginal interaction model over marginal subgroup analysis is that we can directly assess the consistency of subgroup treatment effects, that is, we can test the presence of an interaction between T and Xk, since it is equivalent to testing whether the treatment effects differ between Xk = 0 and Xk = 1. We denote the subgroup-specific treatment effects estimated using this model as naive subgroup-specific treatment effects.
When Xk is not independent of X−k, the inferences from the marginal model, equation (7), regarding consistency of subgroup-specific treatment effects can be misleading. Suppose, for example, that X1 and X2 are positively correlated in the sample, but that only X1 is a true predictor of treatment response. The marginal subgroup analysis based on equation (7) could lead us to conclude that both X1 and X2 are predictors of treatment response. When X1 and X2 are negatively correlated and are both treatment effect modifiers, we could conclude from the marginal subgroup analysis that neither modifies the effect of the treatment. In other words, the interaction between treatment and X2 is confounded by X1. Conversely, the interaction between treatment and X1 can also be confounded by X2. Groenwold (2009) discusses an example of this phenomenon. We need to account for the correlation between baseline subgrouping variables X in order to draw proper inferences.
This leads to the consideration of the following marginal treatment effect that is not affected by correlation between Xk and X−k:
| (8) |
where f(X−k) is the joint distribution of K − 1 baseline covariates excluding Xk. Using the fact that f(A) = f(A | B)f(B)/f(B | A), we can rewrite the preceding integral as:
| (9) |
Now, a comparison of equation (5) with equation (9) suggests that we can recover the unconfounded effects from the naive effects by using the weights, . This is the basis for our standardization approach to removing the confounding in the naive marginal treatment effects. When X−k is independent of Xk we have η†(xk) = η(xk), and the weights equal 1. The obvious and natural choice for standardizing distribution, f(X−k), is the actual sample distribution of baseline covariates in the trial.
We illustrate standardization with a simple example. Consider a 2 × 2 table of men/women versus young/old; see Table 1. Men are six times as likely to be young than women; hence, a naive comparison of treatment effect between men and women is likely to be misleading if age is a predictor of treatment response. The probability of being young among men and women is and , respectively, whereas it is in the overall sample.
Table 1.
A 2 × 2 table before standardization
| Young | Old | ||
|---|---|---|---|
| Men | 400 | 200 | 600 |
| Women | 100 | 800 | 900 |
| 500 | 1000 |
We can standardize the age distributions of men and women to match that of the overall sample, by weighting them differently. We assign a weight of to each young man, which is equal to , and a weight of 2 to each old man, which is equal to . Similarly, we also assign a weight of 3 to each young woman and a weight of to each old woman. After applying the standardization weights, we get the results shown in Table 2.
Table 2.
The 2 × 2 table after standardization
| Young | Old | ||
|---|---|---|---|
| Men | 200 | 400 | 600 |
| Women | 300 | 600 | 900 |
| 500 | 1000 |
In the standardized Table 2, the age distributions in men and women are identical; the gender distributions among young and old are identical, as well.
Standardization is readily extended to multiple binary variables. Suppose that we have K binary subgrouping variables: X = {X1, X2, …, XK}. To standardize X1 such that the joint distribution of {X2, …, XK} is the same in X1 = 0 and X1 = 1, we apply the following weights:
| (10) |
where xk ∈ {0, 1}, ∀k = 1, 2, …, K. When K is even moderately large, the denominator of equation (10) can become very small, and consequently, these nonparametric weights can become extremely large. The resulting standardized subgroup-specific treatment effects would become unstable with a large variance. Therefore, it is preferable to estimate the weights using parametric models. Let the observed data be {X1i, X2i, …, XKi, Ti, Yi}, i = 1, …, n, where T is the binary treatment indicator and Y is the outcome. To obtain the weights for standardizing the subgroups X1 = 0 and X1 = 1, we fit the logistic regression model:
| (11) |
This is about the simplest model for calculating the weights. It is also possible to fit more complex models, such as, for example, adding pairwise or higher order interactions in equation (11). It is generally not advisable to fit more than pairwise interactions due to potential instability in weights. Weights for the individuals are calculated by plugging in the model predicted probabilities from equation (11) into the denominator of equation (10). Standardization can also be readily applied to categorical factors with more than two levels. Multinomial models would be used instead of binomial models for calculating standardization weights.
When we incorporate weights, w1i, for individuals in equation (7), we obtain the standardized marginal interaction model (SMIM):
| (12) |
where the standardization weights, wk, are given by:
| (13) |
In this model, the treatment effects in the standardized subgroups Xk = 0 and Xk = 1 are denoted by and , respectively. The variances for the parameter estimates can be computed using Huber–White type estimation. A Wald test of interaction can be performed by testing the significance of the interaction coefficient estimate γ̂1† with the Huber–White standard error. It should be noted that when Xk is independent of X−k, the weights wk,i equal 1, and the SMIM model reduces to the naive MIM model, equation (7).
4. SIMULATION EXPERIMENTS
The goal of the simulation experiments was twofold: (1) to evaluate the bias, standard error, and mean-squared error of the two types of estimation procedures for the subgroup-specific treatment effects: naive or usual subgroup-specific treatment effect estimated using the model, equation (3) and the standardized subgroup-specific treatment effects estimated using equations (2) and (4); and (2) to compare the type I error of naive and standardization methods for hypothesis testing when there is no interaction between the treatment and the subgrouping variable. The outcome Y was assumed to be normally distributed with mean given by the following model:
| (14) |
and with σ2 = 1.
Under this data generating model, we can obtain the true stratified (joint) and marginal subgroup-specific treatment effects, respectively, as follows:
| (15) |
and
| (16) |
The naive marginal subgroup-specific treatment effects for each scenario can also be calculated:
| (17) |
We considered three binary subgrouping variables, X1, X2, and X3, which are correlated. The correlation matrix is: .
The marginal probabilities were: P(X1 = 1) = 0.4; P(X2 = 1) = 0.5; P(X3 = 1) = 0.6. Let us denote the parameter vector as: θ = (β0, βt, β1, β2, β3, γ1, γ2, γ3). In all the simulations, we consider θ = (−1.0, 0.7, −0.5, −0.5, −0.5, γ1, γ2, γ3), where we only varied one or more of the interaction parameters, γ1, γ2, γ3 between different scenarios. We consider four different scenarios: (1) The treatment effect is homogeneous, (2) only X1 is an effect modifier, (3) X2 and X3 are positive effect modifiers, and (4) X2 interaction is positive and X3 interaction is negative. The parameter vectors for these 4 scenarios are:
-
Scenario 1
θ = (−1.0, 0.7, −0.5, −0.5, −0.5, 0.0, 0.0, 0.0).
-
Scenario 2
θ = (−1.0, 0.7, −0.5, −0.5, −0.5, 0.5, 0.0, 0.0).
-
Scenario 3
θ = (−1.0, 0.7, −0.5, −0.5, −0.5, 0.0, 0.5, 0.5).
-
Scenario 4
θ = (−1.0, 0.7, −0.5, −0.5, −0.5, 0.0, 0.5, −0.5).
We considered two different sample sizes, N = 200 and N = 2500, representing small efficacy trials and large effectiveness trials, respectively. We conducted 100,000 simulations for N = 200 and 10,000 simulations for N = 2, 500. Table 3 presents the simulation results for small sample size and Table 4 presents the results for large sample size.
Table 3.
Results for small sample size: N = 200
| Subgroup-specific treatment effect (standard error) (RMSE)d |
Type-I error of interaction testc |
||||||||
|---|---|---|---|---|---|---|---|---|---|
| Scenario | θ̂X1=0 | θ̂X1=1 | θ̂X2=0 | θ̂X2=1 | θ̂X3=0 | θ̂X3=1 | γ1 = 0 | γ2 = 0 | γ3 = 0 |
| 1. γ1 = 0, γ2 = 0, γ3 = 0 | |||||||||
| Naivea | 0.70 | 0.70 | 0.70 | 0.70 | 0.70 | 0.70 | 0.049 | 0.051 | 0.051 |
| (0.20) | (0.24) | (0.21) | (0.21) | (0.23) | (0.19) | ||||
| (0.20) | (0.24) | (0.21) | (0.21) | (0.23) | (0.19) | ||||
| Standardizedb | 0.70 | 0.70 | 0.70 | 0.70 | 0.70 | 0.70 | 0.050 | 0.048 | 0.050 |
| (0.23) | (0.30) | (0.25) | (0.28) | (0.33) | (0.22) | ||||
| (0.23) | (0.30) | (0.25) | (0.28) | (0.33) | (0.22) | ||||
| 2. γ1 = 0.5, γ2 = 0, γ3 = 0 | |||||||||
| Naive | 0.70 | 1.20 | 1.00 | 0.80 | 1.02 | 0.82 | — | 0.102 | 0.100 |
| (0.20) | (0.24) | (0.21) | (0.21) | (0.23) | (0.19) | ||||
| (0.20) | (0.24) | (0.23) | (0.23) | (0.26) | (0.20) | ||||
| Standardized | 0.70 | 1.20 | 0.90 | 0.90 | 0.90 | 0.90 | — | 0.061 | 0.064 |
| (0.22) | (0.30) | (0.25) | (0.28) | (0.32) | (0.22) | ||||
| (0.22) | (0.30) | (0.25) | (0.28) | (0.32) | (0.22) | ||||
| 3. γ1 = 0.0, γ2 = 0.5, γ3 = 0.5 | |||||||||
| Naive | 1.41 | 1.01 | 0.88 | 1.62 | 0.80 | 1.55 | 0.260 | — | — |
| (0.19) | (0.23) | (0.21) | (0.21) | (0.23) | (0.19) | ||||
| (0.25) | (0.33) | (0.24) | (0.24) | (0.28) | (0.21) | ||||
| Standardized | 1.25 | 1.25 | 1.00 | 1.50 | 0.94 | 1.45 | 0.052 | — | — |
| (0.21) | (0.29) | (0.25) | (0.28) | (0.32) | (0.22) | ||||
| (0.21) | (0.29) | (0.25) | (0.28) | (0.32) | (0.22) | ||||
| 4. γ1 = 0.0, γ2 = 0.5, γ3 = −0.5 | |||||||||
| Naive | 0.65 | 0.65 | 0.52 | 0.78 | 0.80 | 0.55 | 0.049 | — | — |
| (0.20) | (0.25) | (0.22) | (0.21) | (0.23) | (0.19) | ||||
| (0.20) | (0.25) | (0.24) | (0.24) | (0.28) | (0.22) | ||||
| Standardized | 0.65 | 0.65 | 0.40 | 0.90 | 0.95 | 0.45 | 0.065 | — | — |
| (0.23) | (0.30) | (0.25) | (0.29) | (0.32) | (0.22) | ||||
| (0.23) | (0.30) | (0.25) | (0.29) | (0.32) | (0.22) | ||||
Naive approach using equation (7).
Standardization approach using equation (12).
Rejecting the hypothesis of no interaction at α = 0.05.
Root mean-squared error .
Table 4.
Results for large sample size: N = 2500
| Subgroup-specific treatment effect (standard error) (RMSE)d |
Type-I error of interaction testc |
||||||||
|---|---|---|---|---|---|---|---|---|---|
| Scenario | θ̂X1=0 | θ̂X1=1 | θ̂X2=0 | θ̂X2=1 | θ̂X3=0 | θ̂X3=1 | γ1 = 0 | γ2 = 0 | γ3 = 0 |
| 1. γ1 = 0, γ2 = 0, γ3 = 0 | |||||||||
| Naivea | 0.70 | 0.70 | 0.70 | 0.70 | 0.70 | 0.70 | 0.049 | 0.051 | 0.045 |
| (0.056) | (0.068) | (0.060) | (0.059) | (0.066) | (0.054) | ||||
| (0.06) | (0.07) | (0.06) | (0.06) | (0.07) | (0.05) | ||||
| Standardizedb | 0.70 | 0.70 | 0.70 | 0.70 | 0.70 | 0.70 | 0.050 | 0.051 | 0.048 |
| (0.063) | (0.084) | (0.069) | (0.076) | (0.088) | (0.061) | ||||
| (0.06) | (0.08) | (0.07) | (0.08) | (0.09) | (0.06) | ||||
| 2. γ1 = 0.5, γ2 = 0, γ3 = 0 | |||||||||
| Naive | 0.70 | 1.20 | 1.00 | 0.80 | 1.02 | 0.82 | — | 0.664 | 0.657 |
| (0.055) | (0.068) | (0.058) | (0.059) | (0.065) | (0.054) | ||||
| (0.06) | (0.07) | (0.11) | (0.11) | (0.14) | (0.10) | ||||
| Standardized | 0.70 | 1.20 | 0.90 | 0.90 | 0.90 | 0.90 | — | 0.051 | 0.048 |
| (0.062) | (0.083) | (0.068) | (0.076) | (0.087) | (0.062) | ||||
| (0.06) | (0.08) | (0.07) | (0.08) | (0.09) | (0.06) | ||||
| 3. γ1 = 0.0, γ2 = 0.5, γ3 = 0.5 | |||||||||
| Naive | 1.41 | 1.01 | 0.88 | 1.62 | 0.80 | 1.55 | 0.997 | — | — |
| (0.053) | (0.065) | (0.058) | (0.058) | (0.065) | (0.053) | ||||
| (0.17) | (0.25) | (0.14) | (0.14) | (0.17) | (0.12) | ||||
| Standardized | 1.25 | 1.25 | 1.00 | 1.50 | 0.95 | 1.45 | 0.042 | — | — |
| (0.058) | (0.077) | (0.068) | (0.075) | (0.087) | (0.061) | ||||
| (0.06) | (0.08) | (0.07) | (0.08) | (0.09) | (0.06) | ||||
| 4. γ1 = 0.0, γ2 = 0.5, γ3 = −0.5 | |||||||||
| Naive | 0.65 | 0.65 | 0.52 | 0.78 | 0.80 | 0.55 | 0.049 | — | — |
| (0.055) | (0.070) | (0.061) | (0.060) | (0.066) | (0.053) | ||||
| (0.06) | (0.07) | (0.14) | (0.14) | (0.17) | (0.11) | ||||
| Standardized | 0.65 | 0.65 | 0.40 | 0.90 | 0.95 | 0.45 | 0.0532 | — | — |
| (0.062) | (0.084) | (0.069) | (0.078) | (0.088) | (0.061) | ||||
| (0.06) | (0.08) | (0.07) | (0.08) | (0.09) | (0.06) | ||||
Naive approach using equation (7).
Standardization approach using equation (12).
Rejecting the hypothesis of no interaction at α = .05.
Root mean-squared error .
Scenario 1
In this scenario, the treatment effect was constant across all subgroups. Both the naive and the standardized approaches yield unbiased estimates, which is 0.7 (from equations (16) and (17)), for small (Table 3) and large (Table 4) sample sizes. Standardization, while eliminating the bias, increases the variance of the estimate. Thus, the root mean squared error (RMSE), defined as , is larger for standardization. The rejection rates for the naive and standardization are consistent with a two-sided nominal level of 0.05.
Scenario 2
In this scenario, the treatment effect varied only across subgroups defined by X1. The marginal subgroup-specific treatment effects calculated using equation (16) for X1 = 0, X1 = 1, X2 = 0, X2 = 1, X3 = 0, X3 = 1 are, respectively, 0.7, 1.2, 0.9, 0.9, 0.9, 0.9. The naive estimates of the subgroup-specific treatment effects are unbiased only for X1 = 0 and X1 = 1 when X1 is the only effect modifier. The subgroup-specific treatment effect estimates for the noneffect modifier, X2 and X3, are biased, with an upward bias for X2 = 0 and X3 = 0, and a downward bias for X2 = 1 and X3 = 1. This results in a negatively biased estimate of the null interaction between T and X2 and X3. As expected, the RMSE is larger for standardization compared to naive approach in small samples, but it is smaller in large samples.
Also, the type I error probabilities of the naive approach for testing the null interaction between T and X2 and X3 are greater than the nominal rate of .05. The type I errors are 0.102 and 0.100 for the small sample size (Table 3) and 0.664 and 0.657 for large sample size (Table 4). Standardization has a type I error slightly larger than the nominal rate for small sample size.
Scenario 3
Here the treatment effect varied (in the same direction) across subgroups defined by X2 and X3. The marginal subgroup-specific treatment effects calculated using equation (16) for X1 = 0, X1 = 1, X2 = 0, X2 = 1, X3 = 0, and X3 = 1 are, respectively, 1.25, 1.25, 1.00, 1.50, 0.95, and 1.45. The naive approach has a substantial bias for all subgroups. It has an upward bias for X1 = 0, X2 = 1, and X3 = 1, and a downward bias for X1 = 1, X2 = 0, and X3 = 0. Therefore, the magnitude of all three interaction effects is overestimated. The RMSE of the naive approach is much larger than that of the standardization approach for large samples, but is smaller for small samples. The type I error probabilities of the naive approach for testing the null interaction between T and X1 were much larger than the nominal rate in both small sample (0.26) and large sample (0.997) simulations.
Scenario 4
As in Scenario 3, the treatment effect varied across subgroups defined by X2 and X3, with the only difference being that the treatment effects varied in opposite directions. The marginal subgroup-specific treatment effects calculated using equation (16) for X1 = 0, X1 = 1, X2 = 0, X2 = 1, X3 = 0, and X3 = 1 are, respectively, 0.65, 0.65, 0.40, 0.90, 0.95, and 0.45. Interestingly, the naive approach is unbiased for X1 subgroups, where it was biased in Scenario 3. This is due to the fact that the interactions between T and X2 and T and X3 have same magnitude and opposite signs, and X2 and X3 are negatively correlated. This has a counteracting effect. Also, with the naive approach there is an upward bias for X2 = 0, X3 = 1 and a downward bias for X2 = 1, X3 = 0. Therefore, the magnitude of the interaction effects T * X2 and T * X3 are underestimated. Once again, as expected, the RMSE is smaller for standardization compared to naive approach in large samples, but not always in small samples. The type I error probabilities for testing the null interaction between T and X1 was consistent with the nominal rate for both approaches in both small (Table 3) and large (Table 4) sample simulations.
5. AN APPLICATION: ANTIBIOTICS FOR ACUTE OTITIS MEDIA IN CHILDREN
We demonstrate how standardization works in a typical study using the individual patient data (IPD) meta-analysis of six randomized controlled trials for evaluating the efficacy of antibiotics for acute otitis media (AOM) in children (Rovers, 2006). Table 5 depicts the joint distribution of the two binary subgrouping variables examined in the RCTs, age <2/age ≥2 and unilateral AOM/bilateral AOM. The table shows that age at least 2 years and bilaterality are negatively correlated; that is, children at least 2 years old are less likely to be bilateral than children less than 2 years old (ρ̂ = −0.29, p-value < .0001).
Table 5.
A 2 × 2 table of subgrouping variables in the meta-analysis of antiobiotics for acute otitis media
| Unilateral AOM | Bilateral AOM | ||
|---|---|---|---|
| Age < 2 yrs | 261 | 273 | 534 |
| Age ≥ 2 yrs | 611 | 183 | 794 |
| 872 | 456 |
Note. Correlation between row and column variables ρ = −0.29 (p-value <.0001).
Naive and standardized marginal effects of antibiotics are shown in Table 6 for the 4 subgroups: age <2 years, age ≥2 years, unilateral AOM, and bilateral AOM. There is significant treatment-by-laterality interaction. This interaction effect is exaggerated in the naive approach (p = 0.0065), but standardization provides a more conservative inference (p = 0.014). The treatment effect appears to be consistent between children less than 2 years old and children at least 2 years old. The nominal p-values for treatment-by-age interaction are 0.61 and 0.78, respectively, based on the naive and standardization approaches. As per our working definition of consistency of treatment effect, the treatment effect is not consistent across all subgroups since the treatment efficacy is significantly different by laterality.
Table 6.
Results of acute otitis media (AOM) example. Interaction effects of antibiotics by age and by laterality
| Naive approach | Standardization | |||||
|---|---|---|---|---|---|---|
| Scenario | OR (95% CI) | γ̂ | Interaction p-value |
OR (95% CI) | γ̂ | Interaction p-value |
| Age | ||||||
| <2 years | 0.56 (0.39, 0.81) | 0.66 (0.45, 0.98) | ||||
| ≥2 years | 0.65 (0.45, 0.93) | 0.14 | 0.61 | 0.61 (0.42, 0.90) | −0.08 | 0.78 |
| Laterality of AOM | ||||||
| Unilateral | 0.81 (0.58, 1.12) | 0.83 (0.60, 1.16) | ||||
| Bilateral | 0.39 (0.26, 0.59) | −0.73 | 0.0065 | 0.41 (0.26, 0.64) | −0.71 | 0.014 |
From the simulation studies, we learnt that when only one subgrouping variable (e.g., laterality) interacts with treatment and the other (e.g., age) does not, the estimated laterality-specific treatment effects should be similar between the naive approach and the standardization approach, but the estimated age-specific treatment effects can be overestimated if the correlation between age and laterality is positive, whereas it can be underestimated with a negative correlation. Indeed, we observe this phenomenon in this case example. That is, the estimated treatment effects and their 95% confidence intervals expressed by odds ratio are similar for the two methods in the unilateral subgroup, 0.81 (0.58, 1.12) with the naive approach and 0.83 (0.60, 1.16) with the standardization approach, and also in the bilateral subgroup, 0.39 (0.26, 0.59) with the naive approach and 0.41 (0.26, 0.64) with the standardization approach. However, the treatment effects are underestimated in the less than 2 years old and the at least 2 years old subgroups based on the naive approach (odds ratios closer to 1 with the naive method than with standardization). The estimated odds ratio and their 95% confidence intervals in children less than 2 years old are 0.56 (0.39, 0.81) with the naive approach and 0.66 (0.45, 0.98) with the standardization approach, and are 0.65 (0.45, 0.93) with the naive approach and 0.61 (0.42, 0.90) with the standardization approach.
6. DISCUSSION
Treatment effects in subgroups defined by baseline covariates and the interaction effects (i.e., difference in treatment effect between levels of a baseline covariate) can be confounded even in randomized controlled trial due to the correlation between covariates. Randomization guarantees that the distribution of covariates will be the same in the two arms of a reasonably large trial, but it has no impact on the correlation between the baseline covariates in the sample. It is important to account for this correlation when interpreting the differences in treatment effect across subgroups. We have shown that standardization provides a proper (unconfounded) comparison of treatment effects among strata of a baseline subgrouping variable, when the distribution of other important baseline variables is different in those strata. We have demonstrated using simulations of linear regression models that standardization can remove the bias in the estimation of subgroup-specific treatment effects, as well as in the estimation of the interaction effect between subgrouping variable and treatment. However, standardization increases the variance of the subgroup-specific treatment effects. Thus, the bias–variance trade-off between naive approach and standardization approach might depend on the specific problem setting. In general, standardization should be preferred. However, in smaller efficacy trials, the choice of the method would depend upon the relative importance placed on bias elimination versus variance reduction.
Standardization provides the correct type I error under the null hypothesis of no interaction, whereas the naive analysis can inflate the type I error. These inferential aspects were also reinforced by a real example involving the RCTs evaluating th efficacy of antibiotics for acute otitis media in children (section 5). In this example, the two binary subgrouping variables, age less than 2 years (yes/no) and unilateral otitis media (yes/no) were negatively correlated (ρ̂ = −0.29). Although standardization did not alter final conclusions regarding subgroup differences, the example did illustrate quantitative impact of standardization.
Calculation of standardization weights is straightforward. For a binary subgrouping variable, logistic regression can be used to calculate the weights; for a categorical variable with more than two levels, multinomial regression can be used. Standardization can also be applied to continuous baseline covariates such as age or serum albumin levels. We did not consider continuous covariates because that would not fit in the category of subgroup analysis. Standardization is applicable to all types of outcome regression models: linear, binary, count, time-to-event, or other types. Standardization has a negligible impact when the subgrouping variables are weakly correlated (e.g., correlation coefficient of magnitude less than 0.2), or when the magnitude of interaction is very small.
Which reference distribution of covariates should be used to standardize? The distribution of baseline subgrouping variables in the entire trial sample is a natural and obvious choice. However, it is also possible to use a different reference distribution for standardization, for example, the covariate distribution in a particular target population.
vanderWeele and Knol (2011) discussed when confounding is an issue in subgroup analyses of randomized controlled trials. They argued that control for confounding is not needed if the goal is to target subpopulations for intervention; however, if the goal is to intervene on factors defined by subgroups to increase treatment effect then control for confounding must be done. We believe that this distinction is not necessary and that control for confounding is useful for both targeting and intervening. Consider the example of correlation between sex and age in a randomized trial where men are on average younger than women. Suppose the truth is that the efficacy of the treatment wanes with age. We might incorrectly conclude that the treatment is less efficacious in women, and possibly withhold treatment for them—even the younger ones! Therefore, control for confounding is potentially useful regardless of the goal. In conclusion, we recommend that RCTs report the standardized subgroup-specific treatment effects and the corresponding forest plots when the established clinically important baseline covariates are strongly correlated.
Acknowledgments
FUNDING
The analyses upon which this publication is based were performed under contract HHSF2232010000072C, entitled “Partnership in Applied Comparative Effectiveness Science,” sponsored by the Food and Drug Administration, Department of Health and Human Services. This research was also supported by the RSR grant 08-48 awarded by the Center for Drug Evaluation and Research, US Food and Drug Administration. Ravi Varadhan is a Brookdale Leadership in Aging Fellow at the Johns Hopkins University.
Footnotes
Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) contained in the publications on our platform. However, Taylor & Francis, our agents, and our licensors make no representations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of the Content. Any opinions and views expressed in this publication are the opinions and views of the authors, and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon and should be independently verified with primary sources of information. Taylor and Francis shall not be liable for any losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoever or howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use of the Content.
This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http://www.tandfonline.com/page/terms-and-conditions
REFERENCES
- Food and Drug Administration. Providing clinical evidence of effectiveness of human drug and biological products. Washington, DC: US Department of Health and Human Services; 1998. Guidance for Industry. [Google Scholar]
- Groenwold RHH, Donders ART, van der Heijden GJMG, Hoes AW, Rovers MM. Confounding of subgroup analyses in randomized data. Archives of Internal Medicine. 2009;169:1532–1533. doi: 10.1001/archinternmed.2009.250. [DOI] [PubMed] [Google Scholar]
- Robins JM, Hernan MA, Brumback B. Marginal structural models and causal inference in epidemiology. Epidemiology. 2000;11:550–560. doi: 10.1097/00001648-200009000-00011. [DOI] [PubMed] [Google Scholar]
- Rovers MM, Glasziou P, Appelman CL, Burke P, McCormick DP, Damoiseaux RA, et al. Antibiotics for acute otitis media: A meta-analysis with individual patient data. Lancet. 2006;368:1429–1435. doi: 10.1016/S0140-6736(06)69606-2. [DOI] [PubMed] [Google Scholar]
- Sato T, Matsuyama Y. Marginal structural models as a tool for standardization. Epidemiology. 2003;14:680–686. doi: 10.1097/01.EDE.0000081989.82616.7d. [DOI] [PubMed] [Google Scholar]
- vanderWeele TJ, Knol MJ. Interpretation of subgroup analyses in randomized trials: Heterogeneity versus secondary interventions. Annals of Internal Medicine. 2011;169:1532–1533. doi: 10.7326/0003-4819-154-10-201105170-00008. [DOI] [PMC free article] [PubMed] [Google Scholar]
