Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Mar 18.
Published in final edited form as: Stat Biosci. 2019 Oct 23;12(3):399–415. doi: 10.1007/s12561-019-09259-x

A simulation study of statistical approaches to data analysis in the stepped wedge design

Yuqi Ren 1, James P Hughes 2, Patrick J Heagerty 3
PMCID: PMC7971509  NIHMSID: NIHMS1679390  PMID: 33747241

Abstract

This paper studies model-based and design-based approaches for the analysis of data arising from a stepped wedge randomized design. Specifically, for different scenarios we compare robustness, efficiency, Type I error rate under the null hypothesis, and power under the alternative hypothesis for the leading analytical options including generalized estimating equations (GEE) and linear mixed model (LMM) based approaches. We find that GEE models with exchangeable correlation structures are more efficient than GEE models with independent correlation structures under all scenarios considered. The model-based GEE Type I error rate can be inflated when applied with a small number of clusters, but this problem can be solved using a design-based approach. As expected, correct model specification is more important for LMM (compared to GEE) since the model is assumed correct when standard errors are calculated. However, in contrast to the model-based results, the design-based Type I error rates for LMM models under scenarios with a random treatment effect show type I error inflation even though the fitted models perfectly match the corresponding data generating scenarios. Therefore, greater robustness can be realized by combining GEE and permutation testing strategies.

Keywords: Stepped wedge design, GEE, LMM, Permutation test, Simulation

1. Introduction

A stepped wedge cluster randomized trial design is a type of one-way crossover design in which each cluster starts under a reference or control condition and then crosses over to a treatment condition at a randomly determined time point (Hussey and Hughes, 2007). Eventually, at the last time point, all clusters receive treatment during the final study time period. The unique control to treatment crossover patterns are referred to as “sequences” (e.g. in figure 1, the stepped wedge design has 4 sequences). In contrast, in a parallel cluster randomized design, half of the clusters are (usually) randomly assigned to the intervention and half to the control at the beginning of the trial with no planned crossover. A stepped wedge design is also different from a cluster randomized crossover design in which each cluster is randomly assigned to cross over from control to treatment or treatment to control (possibly more than once). In both crossover and stepped wedge trials, a washout time period may be included between intervention and control periods in order to make sure one condition does not affect the other or to allow individuals enrolled under one condition to complete their intervention before their cluster changes conditions. Figure 1 illustrates the settings for a traditional crossover design, a parallel design and a stepped wedge design (Hussey and Hughes, 2007).

Figure 1.

Figure 1

Settings for traditional crossover and stepped wedge designs. “X” represents a treatment; “O” represents control

Stepped wedge cluster randomized trials have become increasingly popular in recent years for a number of reasons (Mdege et al., 2011). For example, in the field of HIV prevention and treatment, as governments and public health agencies have begun to focus on effective implementation of proven interventions, stepped wedge designed studies are often used during program roll-out to assess real-world effectiveness. In Killiam et al. (2010), for instance, a stepped wedge design was used to evaluate whether integrating antiretroviral therapy (ART) into antenatal care clinics increased the proportion of HIV-infected pregnant women initiating ART during pregnancy, compared to the standard approach of referral for ART.

Another reason to consider a stepped wedge design is that it may be logistically or financially impossible to provide the intervention to all participants at once due to resource limitations or geographical constraints (Brown and Lilford, 2006; Mdege et al., 2011; Ji et al., 2017). In this case, the stepped wedge design is feasible because only a small fraction of the clusters are required to initiate the intervention at each time point. Also, the stepped wedge design is useful when it is not ethical or practical to withhold or withdraw treatment, but logistical constraints prevent immediate provision of the intervention, since all participants are able to receive the intervention eventually (Rhoda et al., 2011). For example, in the field of sexually transmitted infections (STI) prevention, Golden et al. (2015) used a stepped wedge design to assess the impact of an intervention to reduce STI burden as the program was implemented across Washington state. Finally, the longitudinal nature of the stepped wedge design allows one to study changes in the effectiveness of the intervention over time by modeling the effects of time (Woertman et al., 2013, Hughes et al., 2015).

Despite the increased adoption of stepped wedge designs there are a number of important analytical issues that need additional careful study in order to provide practical recommendations. Key issues include the impact of a small number of clusters, and robustness to model assumptions such as additional sources of variation due to heterogeneity in time effects or in treatment effects. We are interested in the performance of marginal and random effect models for evaluating the treatment effect in the stepped wedge design from both a model-based (inference based on distributional assumptions) and design-based (inference based on reference to the permutation distribution implied by the study design) perspective (see section 2.3 for further discussion). Ji et al. (2017) found that model-based inference on the treatment effect in stepped wedge designs using linear mixed models (LMM) is sensitive to model mis-specification, such as failing to account for cluster-by-time interactions in the data. Therefore, there is a real practical risk that simple model-based inference may provide inaccurate standard errors and invalid Type I error rates. Ji et al. (2017) also considered permutation tests and found that the permutation test provided tight control of Type I error rates under the scenarios they investigated. Thompson et al. (2018) compared cluster-level parametric and non-parametric within-period estimates of treatment effects to the standard mixed-effect model-based inference. Ultimately the parametric within-period model was not recommended due to its below nominal coverage levels under some scenarios. The non-parametric within-period estimator was less efficient than the mixed effect model approach when period effects were common to all clusters or the number of clusters varied. Furthermore, the estimate of the treatment effect from cluster-level methods was consistently larger than that from the mixed-effect model. Therefore, important gaps exist in terms of what conditions are required for the validity and efficiency of common alternative analysis methods.

In the current study, we conduct both model-based (asymptotic) and design-based analyses at the individual-level using linear mixed models (LMM) and generalized estimating equations (GEE) approaches under both null and alternative conditions for a variety of data generating scenarios. We include scenarios with random treatment effects where the impact of the treatment depends on the specific cluster to which it is applied, a situation that was not investigated by Ji et al. (2007). We also consider a varying number of clusters. We specifically seek to characterize treatment effect estimation bias, standard error accuracy, and Type I error rates under null conditions, and power under alternative conditions for the different analysis strategies. In Section 2, we describe the simulation scenarios and models we use for our study. In Section 3, we present results comparing efficiency, robustness and power among the analysis methods. In section 4, we summarize our findings and discuss future steps.

2. Methods

2.1. Data Generating Model

We generated normally distributed data with an identity link corresponding to a balanced, complete, cross-sectional stepped wedged design with 5 time points (T = 5) and either twenty (I = 20) or forty clusters (I = 40). The design structure is shown in the third panel of Figure 1. One hundred observations (n = 100) were generated for each cluster at each time point for a total sample size of N = 10,000 (I=20) or 20,000 (I=40). We are envisioning common public health studies or clinical delivery investigations within health care systems where a moderate number of clusters (ie. villages, hospitals) are available but relatively large population sizes are under study. Let Yijt be the response for individual j in cluster i at time t (i in 1…20/40; j in 1…100; t in 1…5). We generate data from the model

Yijt=μ+ai+βt+Xit(θ+ci)+eijt [1]

where μ is the overall mean, ai is a random effect for cluster i where ai ~ N(0, τ2), βt is the (categorical) fixed effect of time point t, Xit is the treatment indicator (0 = control; 1 = treatment) for cluster i at time t, θ is the fixed or average treatment effect, ci is a random cluster-specific treatment effect where ci ~ N(0, ν2), and eijt is a random error where eijt ~ N(0, σ2). We assume that Corr(ai, ci) = ρ (possibly 0) and eijt is independent of ai and ci.

2.2. Simulation scenarios

Table 1 shows nine data-generating scenarios used for our simulation studies. We investigated scenarios with different number of clusters (20 vs. 40) under the null condition to understand the effect of number of clusters on Type I error rate (Leyrat et al., 2018). All scenarios contain a fixed treatment effect (θ = 0 under null condition and θ = various values under alternative conditions; the value of θ chosen varied by scenario to achieve power between 10-90%), a time effect (β1 = 0, β2 = 0.2, β3 = 0.3, β4 = 0.4, β5 = 0.5 for t = 1, …, 5 under all conditions) and a random cluster effect (τ2 = 4). The error variance (σ2) is equal to 1 in all simulations. For these variance components the intraclass correlation coefficient (ICC), defined as τ2σ2+τ2, is equal to 0.8. A random treatment effect (ν2 = 4) is also included in some scenarios. When a random treatment effect is included, we allow it to be uncorrelated (corr = 0) or correlated (corr = 0.3) with the random cluster effect.

Table 1.

Scenarios for simulation. All scenarios represent variations of the mixed effects model shown in equation (1) and include a fixed time and treatment effect. Checks indicate key characteristics of each model

Data Generating Scenarios Hypothesis No. Clusters Random cluster effect (ai) Random treatment effect (ci) Correlation between cluster and treatment effect
S1 Null 20 NA
S2 Null 20 0
S3 Null 20 0.3
S4 Null 40 NA
S5 Null 40 0
S6 Null 40 0.3
S7 Alt 40 NA
S8 Alt 40 0
S9 Alt 40 0.3

Note: Null: null condition (θ = 0)

Alt: alternative condition (θ = 0.08, 0.8 or 1)

NA = not applicable

We generate 500 realizations under each scenario allowing type I error rate estimates to be accurate to within ±0.02 due to Monte Carlo variation.

2.3. Approaches to analysis

We fit each simulated dataset in R using two GEE models (R package gee) and four LMM models (R package lme4) using both standard model-based inference and design-based inference. All models were fit to individual level data. All inferences using GEE are based on robust (sandwich) variances (Diggle et al., 2002). We estimate bias, variance, and Type I error rate under the null hypothesis, and power under alternative hypotheses. Power was only investigated in the 40 cluster cases to focus on scenarios where the type I error rates were (generally) close to nominal levels.

2.3.1. GEE approaches

We investigate models with independent (G1) and exchangeable (G2) working correlation structures. The exchangeable working correlation structure, in which the correlation between observations within a cluster is assumed constant (Leyrat, et al., 2017), is often chosen in the analysis of stepped wedge trials since it captures a common source of correlation. However, it is known that GEE asymptotically will be robust to mis-specification of the working variance structure because GEE uses robust (sandwich) variance estimates that are widely valid provided there are a large number of clusters (Diggle et al., 2002).

We conduct both model-based and design-based tests using GEE. For the model-based tests we compare the robust z-score (estimate divided by its robust standard error) of the intervention effect from the GEE analysis to the standard normal distribution. GEE tends to inflate type I error rates when the number of clusters is small (Sharples and Breslow, 1992), so we expect that performance may be better for 40 clusters compared to 20 clusters.

For the design-based analyses we permute the stepped wedge sequences among clusters and investigate the use of both the estimated intervention effect and the robust z-score as a test statistic. We reject the null when the test statistic from the observed dataset is smaller than the 2.5’th percentile or larger than the 97.5’th percentile of the permutation distribution.

2.3.2. LMM approaches

Table 2 shows four LMM models fit to simulation scenarios S1 to S9. Tests are again conducted using both model-based and design-based tests. We expect that LMM will be less robust to mis-specification of the variance structure than GEE since standard errors are computed under the assumed covariance for the outcomes. If the random effect model structure is mis-specified, then the model-based variance in LMM will be invalid and an inflated Type I error rate may result.

Table 2.

Characteristics of linear mixed models (LMM) fit to scenarios S1-S9, checks indicate key characteristics of each LMM model

LMM Models Random cluster effect Random treatment effect Cluster effect correlated with treatment effect
L1
L2
L3
L4

For the design-based tests, similar to the approach outlined for GEE, we investigate both unstandardized and standardized intervention effect estimates as a test statistic and reject the null when the observed test statistic is smaller than the 2.5’th percentile or larger than the 97.5’th percentile of the permutation distribution.

3. Results

3.1. GEE Asymptotic Inference

In the scenarios that we studied, the GEE estimators with different correlation structures (G1 and G2) both give unbiased estimates of the treatment effect (Table 3). GEE with an exchangeable correlation structure (G2) leads to a smaller empirical variance compared to G1, indicating higher efficiency due to the fact that the exchangeable structure corresponds more closely to the true correlation structure and therefore provides an optimal weighted estimator. The efficiency advantage remained true even in scenarios with a random treatment effect (e.g. S2, S3, S5, S6), which does not correspond to a simple exchangeable correlation structure. For 20 clusters, the type I error rate is inflated to approximately 0.10. As we change the number of clusters from 20 to 40, the estimated sandwich variance more closely approximates the true sampling variance, so the Type I error rate approaches 0.05. The presence of correlation between the random cluster and random treatment effects (e.g. S2 vs S3 or S5 vs S6) does not meaningfully affect the results.

Table 3.

GEE results based on 500 simulations showing model-based results for θ, the treatment effect. Results are presented as: Mean; Sd across simulations (standard error of the parameter value); Pr(reject H0)

Fitting models

Scenario No. clusters C R Corr GEE independent (G1) GEE exchangeable (G2)
S1 Null 20 NA 0.00; 1.05; 0.12 −0.01; 0.47; 0.09
S2 Null 20 0 0.00; 1.05; 0.12 −0.01; 0.47; 0.09
S3 Null 20 0.3 0.001; 1.14; 0.12 −0.01; 0.45; 0.09
S4 Null 40 NA 0.002; 0.57; 0.07 0.000; 0.02; 0.05
S5 Null 40 0 0.04; 0.68; 0.07 −0.003; 0.32; 0.06
S6 Null 40 0.3 0.04; 0.77; 0.07 −0.003; 0.30; 0.06
S7 (0.08) Alt 40 NA 0.08; 0.57; 0.08 0.08; 0.02; 0.89
S8 (1) Alt 40 0 1.04; 0.68; 0.33 1.00; 0.32; 0.87
S9 (1) Alt 40 0.3 1.04; 0.77; 0.31 1.00; 0.30; 0.90

Notes: Null: null condition (Ho: θ = 0)

Alt: alternative condition (Ha: θ = 0.08 or 1)

R: random treatment effect (ν2 = 4)

C: random cluster effect (τ2 = 4)

Corr = NA: the correlation between random cluster and treatment effects is not applicable

Corr = 0.3: the correlation between random cluster and treatment effects is 0.3

Under alternative conditions (S7-S9), we choose an effect size of 0.08 for S7 and an effect size of 1.00 for S8 & S9 to investigate power. The GEE model with an exchangeable correlation structure (G2) is much more efficient than the GEE model with independent correlation structure (G1) due to the large ICC (see section 2.2) in these data.

3.2. LMM Asymptotic Inference

Table 4 shows results from fitting LMM models L1 – L4 to the nine scenarios. All models give unbiased treatment effect estimates under all scenarios. Also, the Type I error rates for models with a random treatment effect are all close to the nominal threshold 0.05. Not surprisingly, the type I error rate is significantly inflated for the model that assumes independent data (L1) under all scenarios. For analysis models without a random treatment effect (L1, L2), the Type I error rates are also far above the nominal level under simulation scenarios that include a random treatment effect (e.g. S2, S3, S5, S6). Interestingly, whether the random cluster effect and random treatment effect are modeled as correlated or not does not seem to have a significant effect on the statistical Type I error rates (compare L3 vs L4). In addition, the cross-simulation variance of the intervention effect estimate is not noticeably different between models L2 – L4, suggesting that it is preferable to over-fit than to under-fit a model. Based on the results in tables 3 and 4, our simulations validate theoretical predictions that correct model specification is more important for LMM compared to GEE.

Table 4.

LMM results based on 500 simulations showing model-based results for θ, the treatment effect. Results are presented as: Mean; Sd across simulations (standard error of the parameter value); Pr(reject Ho).

Fitting models

Scenario C R Corr Indep. (L1) No random treatment (L2) Random treatment and cluster correlated (L3) Random treatment and cluster not correlated (L4)
S1 Null 20 NA −0.01; 0.84; 0.86 0.001; 0.04; 0.04 0.001; 0.04; 0.04 0.001; 0.04; 0.03
S2 Null 20 0 0.00; 1.05; 0.88 −0.01; 0.47; 0.85 −0.006; 0.46; 0.06 −0.006; 0.46; 0.06
S3 Null 20 0.3 0.001; 1.15; 0.89 −0.01; 0.45; 0.85 −0.006; 0.44; 0.06 −0.006; 0.44; 0.06
S4 Null 40 NA 0.002; 0.57; 0.89 0.000; 0.02; 0.04 0.000; 0.02; 0.04 0.000; 0.02; 0.04
S5 Null 40 0 0.04; 0.68; 0.88 −0.003; 0.32; 0.84 −0.005; 0.31; 0.05 −0.005; 0.31; 0.05
S6 Null 40 0.3 0.04; 0.77; 0.90 −0.003; 0.30; 0.83 −0.004; 0.30; 0.05 −0.004; 0.30; 0.05
S7 (0.08) Alt 40 NA 0.08; 0.57; 0.88 0.08; 0.02; 0.88 0.08; 0.02; 0.88 0.08; 0.02; 0.88
S8 (0.8) Alt 40 0 0.84; 0.68; 0.94 0.80; 0.32; 0.99 0.80; 0.31; 0.71 0.80; 0.31; 0.71
S9 (0.8) Alt 40 0.3 0.84; 0.77; 0.93 0.80; 0.30; 0.99 0.80; 0.30; 0.75 0.80; 0.30; 0.75

Notes: Null: null condition (Ho: θ = 0)

Alt: alternative condition (Ha: θ = 0.08 or 1)

20 vs. 40: 20 clusters vs. 40 clusters

R: random treatment effect (ν2 = 4)

C: random cluster effect (τ2 = 4)

Corr = NA: the correlation between random cluster and treatment effects is not applicable

Corr = 0.3: the correlation between random cluster and treatment effects is 0.3

Under alternative conditions, we choose an effect size of 0.08 for S7 and effect size of 0.80 for S8 & S9 to evaluate power. The power for L3 and L4 are similar; therefore, there is not much difference in their efficiency.

3.3. GEE Permutation Test

In addition to the GEE model-based analysis shown in Table 3, we also conducted GEE design-based analyses. We provide results based on both the permutation distribution of the estimated treatment effect parameter (Table 5) as well as the permutation distribution of the robust z-statistic (Table 6).

Table 5.

GEE results based on 500 datasets (1000 permutations / dataset), showing design-based results where the test statistic is the estimated intervention effect, θ. Results are presented as: Pr(reject Ho)

Fitting models

Scenario No. clusters C R Corr GEE independent (G1) GEE exchangeable (G2)
S1 Null 20 NA 0.07 0.04
S2 Null 20 0 0.09 0.18
S3 Null 20 0.3 0.07 0.18
S4 Null 40 NA 0.04 0.07
S5 Null 40 0 0.06 0.17
S6 Null 40 0.3 0.06 0.16
S7 (0.21) Alt 40 NA 0.07 0.73
S8 (0.8) Alt 40 0 0.26 0.84
S9 (0.8) Alt 40 0.3 0.26 0.84

Notes: Null: null condition (Ho: θ = 0)

Alt: alternative condition (Ha: θ = 0.21 or 0.8)

R: random treatment effect (ν2 = 4)

C: random cluster effect (τ2 = 4)

Corr = NA: the correlation between random cluster and treatment effects is not applicable

Corr = 0.3: the correlation between random cluster and treatment effects is 0.

Table 6.

GEE results based on 500 datasets (1000 permutations / dataset), showing design-based results where the test statistic is the robust z-score for the intervention effect. Results are presented as: Pr(reject Ho)

Fitting model

Scenario No. clusters C R Corr GEE independent (G1) GEE exchangeable (G2)
S1 Null 20 NA 0.05 0.05
S2 Null 20 0 0.07 0.08
S3 Null 20 0.3 0.06 0.08
S4 Null 40 NA 0.05 0.07
S5 Null 40 0 0.06 0.05
S6 Null 40 0.3 0.06 0.06
S7 (0.21) Alt 40 NA 0.07 0.71
S8 (0.8) Alt 40 0 0.21 0.69
S9 (0.8) Alt 40 0.3 0.21 0.69

Notes: Null: null condition (Ho: θ = 0)

Alt: alternative condition (Ha: θ = 0.21 or 0.8)

R: random treatment effect (ν2 = 4)

C: random cluster effect (τ2 = 4)

Corr = NA: the correlation between random cluster and treatment effects is not applicable

Corr = 0.3: the correlation between random cluster and treatment effects is 0.3

Table 5 illustrates several interesting findings. In scenarios with no random treatment effect (S1, S4) both G1 and G2 maintain the nominal type I error rate and do not show the type I error inflation with smaller numbers of clusters that was observed in the model-based analysis. Table 5 shows some evidence of a small type I error inflation for G1 under scenarios that include random treatment effects (S2, S3, S5 and S6). However, the type I error rate for the GEE model with exchangeable correlation structure (G2) is significantly inflated under scenarios with a random treatment effect.

Interestingly, when the permutation test is based on the robust z-statistic (Table 6), the type I error rate inflation largely disappears. The Type I error rates for G1 are all close to 0.05. Model G2 now shows only slight type I error inflation under scenarios that include a random treatment effect.

Under the alternative condition, we generated data with θ = 0.21 for S7 and θ = 0.80 for S8 & S9. Similar to the asymptotic results (section 3.1), the model with exchangeable correlation structure (G2) has more power than the model with independent correlation structure (G1), although in table 5 this is partly due to the inflated type I error rate previously noted for G2.

3.4. LMM Permutation Test

We also conducted design-based tests using the permutation distribution of the LMM-based estimated treatment effects and z-statistics (Tables 7 and 8, respectively). We note that the treatment effect permutation distributions (but not the z-statistic permutation distributions) for models L1 and L2 are identical to those for models G1 and G2, respectively, and so the results in these simulations are similar.

Table 7.

LMM results based on 500 datasets (1000 permutations / dataset), showing design-based results where the test statistic is the estimated intervention effect, θ. Results are presented as: Pr(reject Ho)

Fitting model

Scenario No. clusters C R Corr Indep. (L1) No random treatment (L2) Random treatment and cluster correlated (L3) Random treatment and cluster are NOT correlated (L4)
S1 Null 20 NA 0.07 0.04 0.04 0.04
S2 Null 20 0 0.08 0.18 0.20 0.20
S3 Null 20 0.3 0.07 0.18 0.19 0.19
S4 Null 40 NA 0.04 0.07 0.08 0.09
S5 Null 40 0 0.06 0.16 0.19 0.19
S6 Null 40 0.3 0.06 0.17 0.18 0.18
S7 (0.21) Alt 40 NA 0.07 0.73 0.73 0.72
S8 (0.7) Alt 40 0 0.21 0.78 0.82 0.82
S9 (0.7) Alt 40 0.3 0.19 0.78 0.81 0.81

Notes: Null: null condition (Ho: θ = 0)

Alt: alternative condition (Ha: θ = 0.21 or 0.7)

R: random treatment effect (ν2 = 4)

C: random cluster effect (τ2 = 4)

Corr = NA: the correlation between random cluster and treatment effects is not applicable

Corr = 0.3: the correlation between random cluster and treatment effects is 0.3

Table 8.

LMM results based on 500 datasets (1000 permutations / dataset), showing design-based results where the test statistic is the estimated intervention effect, θ. Results are presented as: Pr(reject Ho)

Fitting models

Scenario No. clusters C R Corr Indep. (L1) LMM without random treatment effect (L2) LMM random treatment and cluster effects correlated (L3) LMM random treatment and cluster effects NOT correlated (L4)
S1 Null 20 NA 0.05 0.04 0.04 0.04
S2 Null 20 0 0.07 0.18 0.10 0.10
S3 Null 20 0.3 0.08 0.18 0.10 0.10
S4 Null 40 NA 0.05 0.08 0.07 0.07
S5 Null 40 0 0.06 0.16 0.07 0.07
S6 Null 40 0.3 0.07 0.16 0.07 0.07
S7 (0.21) Alt 40 NA 0.08 0.73 0.73 0.73
S8 (0.7) Alt 40 0 0.21 0.77 0.66 0.66
S9 (0.7) Alt 40 0.3 0.19 0.78 0.66 0.66

Notes: Null: null condition (Ho: θ = 0)

Alt: alternative condition (Ha: θ = 0.21 or 0.7)

R: random treatment effect (ν2 = 4)

C: random cluster effect (τ2 = 4)

Corr = NA: the correlation between random cluster and treatment effects is not applicable

Corr = 0.3: the correlation between random cluster and treatment effects is 0.3

Table 7 shows permutation results using the treatment effect coefficients. Interestingly, the model that assumes independence (L1) shows little to no type 1 error inflation while all the non-independence models (L2 – L4) show significant type I error inflation under scenarios with random treatment effects. For this reason, power comparisons are difficult, except under scenario S7 (no random treatment effect) where we find that models that include a correlation structure similar to the data-generating mechanism have higher power than the independence model.

In table 8, we see that design-based tests using z-statistics give similar results to the design-based tests based on coefficients (table 7) for L1 – L2 over all scenarios. In contrast, the type I error inflation noted for models L3 and L4 under scenarios with a random treatment effect (S2, S3, S5, S6) in table 7 is reduced, but not eliminated, using a design-based test using z-statistics (table 8). In addition, unlike table 7, there is some suggestion in table 8 that increasing the number of clusters from 20 to 40 reduces the type I error inflation for L2 – L4 under scenarios with a random treatment effect. It is notable that, in contrast to the model-based LMM results, the type I error rates for design-based tests using LMM models L3 and L4 give inflated type I error rates under scenarios with a random treatment effect (S2, S3, S5, S6) even though the models perfectly match the corresponding scenarios.

Since L1 – L4 all have nominal type I error rate only under scenarios without a random treatment effect (S7), we only compare power under scenario S7. The power for L2, L3 and L4 are similar under S7 while the power for L1 is much lower.

4. Discussion

We conducted both model-based and design-based analysis of data from stepped wedge study designs to compare the robustness, efficiency, and Type I error rate under null conditions and power under alternative condition among GEE and LMM models for each of 9 data-generating scenarios. In general, for model-based analyses correct model specification is more important to LMM compared to GEE, and over-specification of LMM models performs better than under-fitting. Specifically, the model-based results show that LMM models with random cluster and treatment effect produce similar levels of bias, efficiency and Type I error rate as the correctly specified LMM model, even if there is no random treatment effect in the correctly specified model. In contrast, if a random treatment effect truly exists, the model-based results for a LMM without a random treatment effect show an inflated Type I error rate.

In model-based analyses, the number of clusters has a greater effect on type I error rates in GEE than LMM. As we increase the number of clusters from 20 to 40, the model-based GEE simulations provide Type I error rates closer to the nominal level. In contrast, the Type I error rates for model-based LMM simulations are close to nominal levels even for 20 clusters when the analysis model matches the data-generating scenario. Westgate et al (2013), Leyrat et al (2017) and Li and Redden (2015) have investigated the effect of various corrections to GEE when the number of clusters is small in the context of parallel design cluster randomized trials. The application of these methods to stepped wedge trials has not been investigated although Taljaard et al (2016) noted some of the risks associated with too few clusters in stepped wedge trials. Additional research on finite sample size corrections, the effect of number of clusters and the effect of (possibly variable) cluster size in the context of stepped wedge trials (with both linear and nonlinear links) is needed.

Using permutation tests, the primary quantities of interest are the Type I error rate under null conditions and the power under alternative conditions. Permutation tests do not naturally provide estimates of the treatment effect or confidence intervals (although Hughes et al (2019) describe a design-based procedure for stepped wedge models that gives estimates, confidence intervals and valid tests). Using a permutation procedure, GEE models show similar Type I error rates under all the scenarios investigated when the permutation test is based on robust z-scores. In addition, type I error rates are not sensitive to the number of clusters when the permutation distribution is used for testing. GEE models with an exchangeable working correlation structure show greater power than models with an independence working correlation structure. However, in scenarios with a random treatment effect, permutation tests from a GEE model with exchangeable correlation matrix show significantly inflated type I error rates when based on the estimated treatment effect coefficient but only minor inflation when based on robust z-statistics.

Design-based tests using LMM models produced some surprising findings. When a random treatment effect is included in the data generation, the design-based tests using LMM models show inflated type I error rates even if the underlying model is correctly specified. This is likely due to the fact that the inclusion of a random treatment effect leads to a different covariance matrix for each sequence (Hughes et al, 2019) and thereby violates the assumption of exchangeability under the null hypothesis, which is required for permutation tests (Good, 2005). The magnitude of the effect of this violation on the type I error rate will depend on the relative magnitude of the variance components.

In this study we used large random effect variances relative to the error variance (e.g. ICC equal to 0.8). This approach has allowed us to identify scenarios that lead to type I error inflation with a relatively limited number of simulations. However, this also suggests that in applying our results to applications with smaller (relative) random effect variances, researchers should be most concerned about the scenarios where we find moderate to large type I error inflation. In addition, all the simulations presented here used a linear link and normal errors and are all based on individual level analyses of the data. Further research is needed using models with non-linear links such as binary data or that use cluster-level methods, as noted by Thompson et.al (2018).

We have shown areas of strength and weaknesses of model-based and design-based analyses of stepped wedge designs. We believe these results will help guide practitioners in choosing approaches to the analysis of data from stepped-wedge designs.

Acknowledgments

Funding

This research was supported by the National Institute of Allergy and Infectious Diseases grant AI29168 and PCORI contract ME-1507-31750.

Appendix – R, Stata, and SAS code

Here we present basic R, Stata and SAS code for fitting common models for stepped wedge designs with cross-sectional data collection at each time point. See Hussey and Hughes (2007), Hooper et al (2016) and Hughes et al (2015).

I. Linear mixed models

1) Random cluster effect: Yijt = μ + ai + βt + Xitθ + eijt

2) Random cluster and cluster*time effect: Yijt = μ + ai + βt + bit + Xitθ + eijt

3) Random cluster, cluster*time and treatment effect (corr(ai, ci) = 0): Yijt = μ + ai + βt + bit + Xit(θ + ci) + eijt

4) Random cluster, cluster*time and treatment effect (corr(ai, ci) = ρ): Yijt = μ + ai + βt + bit + Xit(θ + ci) + eijt

where

ai ~ N(0, τ2)

bit ~ N(0, γ2)

ci ~ N(0, ν2)

eijt ~ N(0, σ2)

R:

library(“lme4”)

fcluster = factor(cluster)

ftime = factor(time)

ftreat = factor(treat)

clustime = interaction(fcluster,ftime)

# Model 1

lmer(y ~ ftime + ftreat + (1 | fcluster))

# Model 2

lmer(y ~ ftime + ftreat + (1 | fcluster) + (1 | clustime))

# Model 3

lmer(y ~ ftime + ftreat + (1 | fcluster) + (0 + ftreat | fcluster) + (1 | clustime))

# Model 4

lmer(y ~ ftime + ftreat + (ftreat | fcluster) + (1 | clustime))

Stata:

egen clustime = group(cluster time)

* Model 1

mixed y i.time i.treat || cluster:, var reml

* Model 2

mixed y i.time i.treat || cluster: || clustime:, var reml

* Model 3

mixed y i.time i.treat || cluster: treat, cov(ind) || clustime:, var reml

* Model 4

mixed y i.time i.treat || cluster: treat, cov(uns) || clustime:, var reml

SAS:

** Assume throughout that treat is coded as 0 = No, 1 = Yes

** Model 1

proc mixed data=mydata;

class time cluster ;

model y = treat time / solution;

random cluster;

run;

** Model 2

proc mixed data=mydata;

class time cluster ;

model y = treat time / solution;

random cluster cluster*time;

run;

** Model 3

proc mixed data=mydata;

class time cluster ;

model y = treat time / solution;

random intercept treat / sub=cluster type=vc ;

random cluster*time ;

run;

** Model 4

proc mixed data=mydata;

class time cluster ;

model y = treat time / solution;

random intercept treat / sub=cluster type=un ;

random cluster*time;

run;

II. Generalized estimating equation models

5) Independent working correlation

6) Exchangeable working correlation

R:

library(“gee”)

fcluster = factor(cluster)

ftime = factor(time)

ftreat = factor(treat)

# Data are assumed to be sorted so that all observations on a

# cluster are contiguous rows

# Model 5

gee(y ~ ftreat + ftime, id=fcluster, corstr=”independence”)

# Model 6

gee(y ~ ftreat + ftime, id=fcluster, corstr=”exchangeable”)

Stata:

xtset cluster

* Model 5

xtgee y i.time i.treat, corr(ind) vce(robust)

* Model 6

xtgee y i.time i.treat, corr(exc) vce(robust)

SAS:

** Assume throughout that treat is coded as 0 = No, 1 = Yes

** Model 5

proc genmod data=mydata;

class time cluster ;

model y = treat time ;

repeated subject = cluster / type = ind ;

run;

** Model 6

proc genmod data=mydata;

class time cluster ;

model y = treat time ;

repeated subject = cluster / type = exch ;

run ;

Contributor Information

Yuqi Ren, University of Washington, 4550 11th Ave NE Apt W209, Seattle, WA 98105, USA.

James P. Hughes, University of Washington, H655F, Health Sciences Building, University of Washington, 1705 NE Pacific St, Seattle, WA 98195, USA

Patrick J. Heagerty, University of Washington, Box 357232, 1959 NE Pacific Street, Seattle, WA 98195, USA

References

  • 1.Brown CA and Lilford RJ (2006). The stepped wedge trial design: A systematic review. BMC Med. Res. Methodol 6 54. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Diggle P, Heagerty P, Liang K, Zeger S (2002). Analysis of longitudinal data. 2nd ed., Oxford University Press. [Google Scholar]
  • 3.Golden MR, Kerani RP, Stenger M, Hughes JP, Aubin M, Malinski C, Holmes KK (2015). Uptake and Population-Level Impact of Expedited Partner Therapy (EPT) on Chlamydia trachomatis and Neisseria gonorrhoeae: The Washington State Community-level Randomized Trial of EPT. PLoS Medicine 12(1): e1001777. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Good P (2005). Permutation, Parametric and Bootstrap Tests of Hypotheses, 3rd ed. Springer. [Google Scholar]
  • 5.Hooper R, Teerenstra S, de Hoop E, Eldridge S (2016). Sample size calculation for stepped wedge and other longitudinal cluster randomized trials. Statistics in Medicine 35: 4718–4728. [DOI] [PubMed] [Google Scholar]
  • 6.Hussey MA and Hughes JP (2007). Design and analysis of stepped wedge cluster randomized trials. Contemp. Clin. Trials 28: 182–191. [DOI] [PubMed] [Google Scholar]
  • 7.Hughes JP, Granston TS, Heagerty PJ. Current issues in the design and analysis of stepped wedge trials. Contemp. Clin. Trials 45: 55–60. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Hughes JP, Heagerty PJ, Xia F, Ren Y (2019). Robust inference in the stepped wedge design. Biometrics, in press. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Ji X, Fink G, Robyn PJ, Small DS (2017). Randomization inference for stepped-wedge cluster-randomized trials: An application to community-based health insurance. Ann. Appl. Stat 11: 1–20. [Google Scholar]
  • 10.Killiam WP, Tambatamba BC, Chintu N, Rouose D, Stringer E, Bweupe M, Yu Y, Stringer JSA (2010). Antiretroviral therapy in antenatal care to increase treatment initiation in HIV-infected pregnant women: a stepped-wedge evaluation. AIDS 24: 85–91. [DOI] [PubMed] [Google Scholar]
  • 11.Leyrat C, Morgan EK, Leurent B, Kahan CB (2018). Cluster randomized trials with a small number of clusters: which analyses should be used? International Journal of Epidemiology 47: 321–331. [DOI] [PubMed] [Google Scholar]
  • 12.Li P and Redden DT (2015). Comparing denominator degrees of freedom approximations for the generalized linear mixed model in analyzing binary outcome in small sample cluster-randomized trials. BMC Medical Research Methodology 15:38. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Mdege ND, Man MS, Taylor CA and Torgerson DJ (2011). Systematic review of stepped wedge cluster randomized trials shows that design is particularly used to evaluate interventions during routine implementation. J. Clin. Epidemiol 64: 936–948. [DOI] [PubMed] [Google Scholar]
  • 14.Rhoda DA, Murray DM, Andridge RR, Pennell ML and Hade EM (2011). Studies with staggered starts: Multiple baseline designs and group-randomized trials. Am. J. Publ. Health 101: 2164–2169. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Sharples K, Breslow N (1992). Regression analysis of correlated binary data: some small sample results for the estimating equation approach. J Stat Comput Simul. 42:1–20. [Google Scholar]
  • 16.Taljaard M, Teerenstra S, Ivers NM, Fergusson DA (2016). Substantial risks associated with few clusters in cluster randomized and stepped wedge designs. Clinical Trials 13:459–463. [DOI] [PubMed] [Google Scholar]
  • 17.Thompson JA, Davey C, Fielding K, Hargreaves JR, Hayes RJ (2018). Robust analysis of stepped wedge trials using cluster-level summaries within periods. Statistics in Medicine 37:2487–2500. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Westgate PM (2013). On small-sample inference in group randomized trials with binary outcomes and cluster-level covariates. Biometrical J. 5:789–806. [DOI] [PubMed] [Google Scholar]
  • 19.Woertman W, de Hoop E, Moerbeek M, Zuidema SU, Gerritsen DL and Teerenstra S (2013). Stepped wedge designs could reduce the required sample size in cluster randomized trials. J. Clin. Epidemiol 66 752–758. [DOI] [PubMed] [Google Scholar]

RESOURCES