Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 Oct 3.
Published in final edited form as: J Clin Epidemiol. 2023 Mar 15;157:134–145. doi: 10.1016/j.jclinepi.2023.03.010

A scoping review described diversity in methods of randomization and reporting of baseline balance in stepped-wedge cluster randomized trials

Pascale Nevins a, Kendra Davis-Plourde b,c, Jules Antoine Pereira Macedo d, Yongdong Ouyang a,e, Mary Ryan b, Guangyu Tong b,f, Xueqi Wang b,g, Can Meng c, Luis Ortiz-Reyes a, Fan Li b,f, Agnès Caille d,h, Monica Taljaard a,e,*
PMCID: PMC10546924  NIHMSID: NIHMS1932124  PMID: 36931478

Abstract

Objectives:

In stepped-wedge cluster randomized trials (SW-CRTs), clusters are randomized not to treatment and control arms but to sequences dictating the times of crossing from control to intervention conditions. Randomization is an essential feature of this design but application of standard methods to promote and report on balance at baseline is not straightforward. We aimed to describe current methods of randomization and reporting of balance at baseline in SW-CRTs.

Study Design and Setting:

We used electronic searches to identify primary reports of SW-CRTs published between 2016 and 2022.

Results:

Across 160 identified trials, the median number of clusters randomized was 11 (Q1-Q3: 8–18). Sixty-three (39%) used restricted randomization–most often stratification based on a single cluster-level covariate; 12 (19%) of these adjusted for the covariate(s) in the primary analysis. Overall, 50 (31%) and 134 (84%) reported on balance at baseline on cluster- and individual-level characteristics, respectively. Balance on individual-level characteristics was most often reported by condition in cross-sectional designs and by sequence in cohort designs. Authors reported baseline imbalances in 72 (45%) trials.

Conclusion:

SW-CRTs often randomize a small number of clusters using unrestricted allocation. Investigators need guidance on appropriate methods of randomization and assessment and reporting of balance at baseline.

Keywords: Baseline balance, Covariate adjustment, Stratification, Reporting guidelines, Random allocation, Covariate constrained randomization

1. Introduction

Cluster randomized trials (CRTs), in which intact groups such as medical practices, hospitals, or communities—rather than individuals—are randomized, are commonly used when interventions need to be delivered at the level of the cluster or when there is a substantial risk of contamination within clusters [1]. The most common type of CRT is the parallel arm design, in which clusters are randomly allocated to either intervention or control arms. The stepped-wedge cluster randomized trial (SW-CRT) is a relatively new and increasingly popular design [2], where clusters are instead randomized to a particular “sequence,” which determines the time at which they will transition to the intervention condition [3]. Thus, all clusters typically begin in the control condition but, at regular intervals, one or more clusters cross over to the intervention condition until all clusters receive the intervention (see Fig. 1, modified from Hemming et al. [4]). Observations may be taken on the same participants in regular periods (a “cohort” design) or different participants (a “cross-sectional” design). In closed cohort designs, all participants are identified at baseline and followed until the end of the trial; in open cohort designs, some participants may leave, and new participants are allowed to join the cohort during the trial [5].

Fig. 1.

Fig. 1.

Diagram of a stepped-wedge cluster randomized trial with 8 clusters, 4 sequences and 5 periods.

Previous reviews have revealed a variety of justifications for the choice of the SW-CRT design in practice, including the belief that each cluster “serves as its own control” thus, reducing the risk of confounding due to between-cluster differences [6], and its potential greater statistical efficiency over the parallel arm CRT design [7]. Not surprisingly, SW-CRTs often include a small number of clusters [6,8]. Problems associated with small numbers of clusters in CRTs have been previously highlighted, and include an increased risk of chance imbalances, insufficient power to detect intervention effects, and potentially inflated type I error rates [9,10]. In parallel arm CRTs, restricted randomization methods (such as stratification, matching, covariate-constrained randomization, or minimization) are recommended to promote balance at baseline [11], but their application to SW-CRTs is less straightforward—especially when there are more than two sequences, when there is only one cluster per sequence, or when the number of clusters is not a multiple of the number of sequences. Indeed, how to define “balance” in SW-CRTs needs clarification. Additional considerations arise with unequal allocation because each sequence does not make an equal contribution to study power [12], and thus, the method of allocation may be determined not only by the need to minimize chance imbalances but also by the desire to maximize statistical efficiency. Furthermore, to obtain correct inferences, characteristics constrained in the randomization should be adjusted for in the statistical analysis [1315]; however, statistical analysis of SW-CRTs is complicated by the need to account for time as a confounder as well as the complex intracluster correlation structure [16]. Adjustment for cluster-level covariates may be challenging when available degrees of freedom are limited and computational challenges may arise.

The transparent reporting of the distributions of cluster and participant characteristics between trial arms is essential and recommended by the Consolidated Standards of Reporting Trials (CONSORT) extension for CRTs [17]. The presence of imbalances at baseline can indicate problems in the randomization process and signify potential identification and recruitment biases [18]. However, reporting of such information is not straightforward in SW-CRTs. First, the definition of “baseline” can be different in cross-sectional and cohort designs [19]. In a cohort design, cluster- and individual-level characteristics may be measured before randomization or before any clusters or participants are exposed to the intervention; in a cross-sectional design, individual-level characteristics may be measured in every period as new participants enter the study (before exposure to the intervention). Second, summaries of baseline characteristics may be presented by condition, by sequence/cluster, or by a combination of these (see Fig. 2). Summaries may also be presented by period or by sequence and period. Third, summarizing information meaningfully for cluster-level characteristics can be challenging, especially when the number of clusters is small.

Fig. 2.

Fig. 2.

Methods of presenting balance at baseline.

There is a lack of guidance on how to implement existing randomization methods in SW-CRTs with different types of designs and how to meaningfully present information about covariate balance at baseline. We reviewed SW-CRTs published over the past 7 years with the main objectives to (1) describe current methods used to promote baseline covariate balance in the allocation and [2] describe reporting practices for assessing covariate balance at baseline in SW-CRTs. Our ultimate aim is to inform the development of more explicit guidance to improve the design, analysis, and reporting of SW-CRTs.

2. Methods

2.1. Eligibility criteria

This study was conducted according to a prespecified protocol [20] and guided by methodological principles for scoping reviews in the JBI manual [21]. The results are reported in accordance with the preferred reporting items for systematic review and meta-analyses extension for scoping reviews (PRISMA-ScR) checklist (Appendix A). We included primary reports of completed SW-CRTs in humans, published in English between 1st January, 2016, and 4 March, 2022 (the search date). The date range was chosen to yield at least 160 trials, which are adequate to limit the margin of error around an estimated proportion to, at worst, ±0.08. Eligible trials had a minimum of two sequences and three periods, evaluated a single intervention, and randomized a minimum of five independent clusters (as in our previous review) [22]. We excluded protocols, pilot or feasibility studies, nonprimary reports (e.g., secondary analyses), trials not considered health research (e.g., medical education trials), and nonrandomized designs (although trials including a small number of nonrandomized clusters in the analysis were eligible). Eligibility criteria had been pilot-tested and refined in our previous review [22].

2.2. Search strategy

We used three sources to identify eligible trials: (1) all trials included in a recently published review of implementation challenges in SW-CRTs by Caille et al. [22] covering January 2019 to September 2020; (2) an updated PubMed search using the same terms as Caille et al., covering September 2020 to 4 March 2022 (the search date); and (3) a search of a previously established database of 4336 primary reports of pragmatic trials to April 2019 [23]. We searched the database of pragmatic trials instead of implementing a new search in PubMed as the database eliminated the time-intensive screening for primary trial reports, and the search terms used to establish this database [24] overlapped with those used by Caille et al. [22].

2.3. Screening

The trials sourced from the review by Caille et al. required no additional screening, as similar inclusion and exclusion criteria were used. The updated search was implemented in PubMed, and records were uploaded to Covidence [25]. Two reviewers (PN and MT) independently screened all titles and abstracts and reached agreement on potential eligibility through discussion. The full texts of trials passing the title/abstract screening were then screened independently by two reviewers (PN and YO). If agreements could not be reached, decisions were made through discussion and consultation with MT. Finally, the previously established pragmatic trials database was searched by two reviewers (PN and YO) to identify eligible SW-CRTs.

2.4. Data elements

An extraction form (see Appendix B) was developed to standardize the capture of data elements of interest. To characterize the sample of trials included in our review, we extracted the region of recruitment and type of cluster. We extracted the type of SW-CRT design (cross-sectional, closed cohort, or open cohort), the number of clusters randomized, number of sequences, and sample size in the primary analysis of the primary outcome. To identify a primary outcome for extraction, we chose the primary outcome defined by the trial authors; if more than one primary outcome was defined or if the authors did not clearly identify a primary outcome, we selected the outcome driving the sample size calculation, or if no sample size calculation was presented, we selected the first outcome listed in the section describing the outcomes of interest or the outcome presented more prominently. If the primary analysis was not clearly identified, reviewers were instructed to choose the analysis corresponding to the main result reported in the abstract or otherwise the first analysis presented for the primary outcome. As a posthoc addition, we also extracted whether equal numbers of clusters were allocated to each sequence.

For objective 1, we extracted whether any restricted randomization was used and if so, what method was used; the type and number of variables used in restricting the randomization, distinguishing between inherently cluster-level (e.g., cluster size or location) and inherently individual-level characteristics (e.g., patient age or sex); and whether any rationale was provided for restricting randomization by the chosen variable(s). We also extracted whether characteristics used in restricting randomization were adjusted in the primary analysis for the primary outcome.

For objective 2, we extracted how baseline characteristics (either cluster- or individual-level) were presented. A visual guide was created for classifying different methods of presenting balance at baseline (see Fig. 2), with categories: by condition, by sequence/cluster, by period, by condition-sequence, by condition-period, and by sequence-period. We also extracted whether balance on baseline values of the primary outcome was presented; whether significance testing of baseline balance was performed (and if so, how); and whether any imbalances at baseline were noted (regardless of significance testing).

2.5. Data extraction

Basic trial characteristics were extracted independently by either one or two trained reviewers (PN and/or LOR). The full extraction form was pilot tested on eight trials. Eleven statisticians (PN, KDP, JPM, YO, CM, MR, GT, XW, AC, FL, and MT) completed the pilot test as part of calibration and training and participated in consensus discussions to review discrepancies and refine the form. The form was finalized, and trials were then randomly allocated to rotating pairs of reviewers in batches of four trials per week. Extractions were conducted independently, and discrepancies resolved through discussion within each pair. FL, AC, and/or MT were consulted in cases where consensus between reviewers could not be reached. All data were captured in Airtable [26].

2.6. Analysis

Descriptive statistics were used to describe categorical variables as counts and frequencies. Continuous variables are presented as range, mean and standard deviation, and/or median and interquartile range. Reporting of balance at baseline was stratified by the type of trial design. Use of restricted randomization methods was stratified by total number of clusters (above vs. below the median) and by number of clusters per sequence (one vs. more than one per sequence). Presence of baseline imbalances was stratified by the total number of clusters (above vs. below the median) and use of restricted randomization methods.

3. Results

3.1. Screening and inclusion

A flow diagram depicting the screening of trials is presented in Figure 3. Fifty-five trials were obtained from the review by Caille et al. [22] The updated PubMed search yielded 561 records, of which 117 passed the title/abstract screening. After reviewing the full texts, 65 trials met inclusion criteria and were included in our review. A search of the database of pragmatic trials identified 83 potentially eligible primary trial reports, of which 46 met the inclusion criteria. During extraction, a further 6 trials were found not to meet all inclusion criteria and were excluded from the analysis, resulting in a final sample size of 160 SW-CRTs.

Fig. 3.

Fig. 3.

Flow diagram for identification of 160 stepped-wedge cluster randomized trials.

3.2. Descriptive characteristics

Descriptive characteristics of the 160 SW-CRT publications are presented in Table 1. Most trials were conducted in Europe (51, 31.9%) or North America (40, 25.0%) and hospitals or hospital wards (61, 38.1%) or primary care clinics (42, 26.3%) were the most common clusters. Most trials were cross-sectional (122, 76.3%); 23 (14.4%) were closed cohorts, and 15 (9.4%) were open cohorts. The median number of clusters randomized was 11 (Q1-Q3: 8–18) and median number of sequences was 5 (Q1-Q3: 4–7). The majority (112, 70.0%) allocated an equal number of clusters per sequence and among these, 58 (51.8%) had only one cluster per sequence. The median sample size used in the analysis was 2724 (Q1-Q3: 643–14734).

Table 1.

Characteristics of included stepped-wedge cluster randomized trials (SW-CRTs)(N = 160)

Characteristic Frequency (%)

Publication Year
 2016 12 (7.5)
 2017 9 (5.6)
 2018 23 (14.4)
 2019 30 (18.8)
 2020 34 (21.3)
 2021 40 (25.0)
 2022 12 (7.5)
Country or region of study recruitmenta
 North America 40 (25.0)
 South or Central America 5 (3.1)
 Europe 51 (31.9)
 Asia 15 (9.4)
 Australia or New Zealand 23 (14.4)
 Middle East 3 (1.9)
 Africa 26 (16.3)
Type of cluster randomized
 Hospitals (38) or hospital wards (23) 61 (38.1)
 Primary care practices or clinics 42 (26.3)
 Nursing homes 5 (3.1)
 Communities or geographical areas 20 (12.5)
 Schools or classrooms 3 (1.9)
 Other (e.g., specialty clinics, groups of the above, families, etc.) 29 (18.1)
Type of design
 Cross-sectional 122 (76.3)
 Open cohort 15 (9.4)
 Closed cohort 23 (14.4)
Number of clusters randomized
 Median (Q1, Q3) 11 (8, 18)
 Min, Max 5, 291
 Not reported 1
Number of sequences
 Median (Q1, Q3) 5 (4, 7)
 Min, Max 2, 81
 Not reported 2
Allocation ratio
 Equal number of clusters per sequence 112 (70.0)
 Unequal number of clusters per sequence 44 (27.5)
 Unclear 4 (2.5)
Number of clusters per sequence (N = 112 with equal allocation of clusters per sequence)
 1 58 (51.8)
 2 20 (17.9)
 3 12 (10.7)
 4 or more 22 (19.6)
Sample sizeb
 Median (Q1, Q3) 2724 (643, 14,733.5)
 Min, Max 44, 4,801,573
 Not reported 5
a

Multiple selections possible.

b

Defined as number of participants or participant-visits in a cross-sectional design, number of participants in an open or closed cohort design, or the off-set or person-time in a design with a rate or time-to-event outcome.

3.3. Randomization methods

Details about randomization methods are provided in Table 2. Ninety-six (60%) used simple randomization, 63 (39.4%) used restricted randomization, while one provided no details of the randomization. Among those using restricted randomization, methods were stratification (37, 58.7%), matching (10, 15.9%), covariate-constrained randomization (4, 6.3%), minimization (2, 3.2%), or another method, mixture or unclear method (10, 15.9%). The majority of trials (43, 68.3%) restricted randomization on a single, inherently cluster-level variable, while 16 (25.4%) used more than one cluster-level variable. Typical cluster-level characteristics balanced during allocation were cluster size and location. Almost no trials balanced on inherently individual-level characteristics. Of the 63 trials using restricted randomization, a rationale was provided for the use of the chosen variables in 27 (42.9%), and only 12 (19.0%) adjusted for some or all variable(s) used in the randomization as covariates; 10 (15.9%) adjusted for identical variables as used in the randomization.

Table 2.

Randomization methods used in the included trials (N = 160)

Characteristic Frequency (%)

Allocation strategy
 Unrestricted randomization 96 (60.0)
 Some restricted randomization method 63 (39.4)
  Stratification 37 (58.7)
  Matching 10 (15.9)
  Minimization 2 (3.2)
  Covariate-constrained randomization 4 (6.3)
  Other, mixture, or precise details not stated 10 (15.9)
 No details of randomization 1 (0.6)
Number of inherently cluster-level characteristics used in randomization (N = 63 with restricted randomization)
 0 2 (3.2)
 1 43 (68.3)
 >1 16 (25.4)
 Unclear 2 (3.2)
 Min, Max 0, 7
Type of cluster-level characteristics balanced during allocationa (N = 59 with at least one cluster-level characteristic)
 Cluster size 31 (49.2)
 Cluster location 25 (39.7)
 Other (e.g., practice type) 22 (34.9)
Number of inherently individual-level characteristics used in randomization (N = 63)
 0 59 (93.7)
 1 2 (3.2)
 Unclear 2 (3.2)
Justification provided for restricting randomization by chosen variable(s) (N = 63)
 Yes, for at least one 27 (42.9)
 No 36 (57.1)
Covariates used in restricted randomization adjusted in the primary analysis? (N = 63 with restricted randomization)
 Yes, some or all 12 (19.0)
 All identical in scale and level 10 (15.9)
 Change in at least one in terms of scale and/or level 1 (1.6)
 Unclear if same scale and/or level 1 (1.6)
 No 49 (77.8)
 Unclear 2 (3.2)
a

Not mutually exclusive categories, multiple selections possible.

When considering trials above the median number of clusters, 40 (51.9%) used restricted randomization compared to 22 (26.8%) among those at or below the median. Among trials with equal allocation of clusters to sequences, the prevalence of restricted randomization was 10 (17.2%) in trials with one cluster per sequence vs. 29 (53.7%) in trials with more than one cluster per sequence.

3.4. Balance at baseline

Full details of how balance at baseline was reported, overall and by SW-CRT design, are presented in Table 3. A diverse pattern of presenting balance at baseline was observed across all trials. Almost all trials (142, 88.8%) presented descriptive results to assess balance at baseline on cluster- and/or individual-level characteristics, but only 42 (26.3%) presented information on both. The majority of trials (134, 83.8%) presented balance on individual-level characteristics, most often by condition in cross-sectional designs and by sequence in closed cohort designs. Fifty (31.2%) presented balance on cluster-level characteristics, most often by sequence for both cross-sectional and cohort designs. Forty-six (28.8%) reported balance on baseline values of the primary outcome. Across all designs, significance testing of baseline balance was performed in 59 (36.9%) trials, the majority of which used a simple test, not adjusted for clustering (33, 55.9%). Baseline imbalances were identified or reported, i.e., via a significant P-value or a statement by the authors, in 72 (45.0%) trials.

Table 3.

Reporting of balance at baseline, overall, and bv tvoe of stepped-wedge design (Frequency, %)

Characteristic Overall
Cross-sectional
Open cohort
Closed cohor
(N = 160) (N = 122) (N = 15) (N = 23)

Balance at baseline reported?
 Yes 142 (88.8) 113 (92.6) 12 (80.0) 17 (73.9)
 At cluster-level only 8 (5.0) 7 (5.7) 1 (6.7) 0
 At individual-level only 92 (57.5) 75 (61.5) 7 (46.7) 10 (43.5)
 Both cluster- and individual-level 42 (26.3) 31 (25.4) 4 (26.7) 7 (30.4)
 No 18 (11.2) 9 (7.4) 3 (20.0) 6 (26.1)
Balance on individual-level characteristicsa
 Not reported 26 (16.2) 16 (13.1) 4 (26.7) 6 (26.1)
 Reported 134 (83.8) 106 (86.9) 11 (73.3) 17 (73.9)
  By condition 105 (78.3) 97 (91.5) 4 (36.4) 4 (23.5)
  By sequence or cluster 27 (20.1) 9 (8.5) 4 (36.4) 14 (82.4)
  By period 2 (1.5) 2 (1.9) 0 0
  By condition and sequence/cluster 4 (2.9) 4 (3.8) 0 0
  By condition and period 3 (2.2) 1 (0.9) 2 (18.2) 0
  By sequence/cluster and period 1 (0.7) 0 0 1 (5.9)
  Other or unclear 8 (6.0) 4 (3.8) 4 (36.4) 0
Balance on cluster-level characteristicsa
 Not reported 110 (68.8) 84 (68.9) 10 (66.7) 16 (69.6)
 Reported 50 (31.2) 38 (31.1) 5 (33.3) 7 (30.4)
  By condition 18 (36.0) 16 (42.1) 1 (20.0) 1 (1.4)
  By sequence or cluster 30 (60.0) 20 (52.6) 3 (60.0) 7 (100.0)
  By period 0 0 0 0
  By condition and sequence/cluster 2 (4.0) 2 (5.3) 0 0
  By condition and period 2 (4.0) 1 (2.6) 0 0
  By sequence/cluster and period 1 (2.0) 1 (2.6) 0 0
  Other or unclear 3 (6.0) 2 (5.3) 1 (20.0) 0
Balance on baseline values of primary outcome?
 No or not applicableb 114 (71.2) 93 (76.2) 11 (73.3) 10 (43.5)
 Yes 46 (28.8) 29 (23.8) 4 (26.7) 13 (56.5)
Significance testing of baseline balance?
 Yes 59 (36.9) 48 (39.3) 3 (20.0) 8 (34.8)
  Simple test 33 (55.9) 25 (52.1) 2 (66.7) 6 (75.0)
  Other 7 (11.9) 7 (14.6) 0 0
  Not specified 19 (32.2) 16 (33.3) 1 (33.3) 2 (25.0)
 No 101 (63.1) 74 (60.7) 12 (80.0) 15 (65.2)
Baseline imbalances reported?
 Yes 72 (45.0) 57 (46.7) 5 (33.3) 10 (43.5)
 No 88 (55.0) 65 (53.3) 10 (66.7) 13 (56.5)
a

Multiple selections possible.

b

Not applicable applies to “terminal” outcomes such as death or remission that cannot be measured repeatedly.

Among trials with more than the median number of clusters, 30 (38.9%) reported imbalances, while among those at or below the median, 41 (50.0%) reported imbalances at baseline. Among trials using restricted randomization, 29 (46.0%) reported imbalances, while among those with unrestricted or unclear randomization, 43 (44.3%) reported imbalances.

4. Discussion

4.1. Summary of main findings

In this scoping review of 160 published SW-CRTs, we found that many trials randomized a limited number of clusters (a median of only 11); nevertheless, the majority did not use restricted randomization to promote balance at baseline. When restricted randomization was performed, it was typically stratification using a single cluster-level variable. However, the covariates used in the allocation procedure were commonly not adjusted for in the primary analysis. Balance at baseline was assessed in most studies but reported in a variety of ways; most commonly based on individual-level characteristics and presented by sequence for cohort designs and by condition for cross-sectional designs. Information to assess balance on both cluster- and individual-level characteristics was seldom reported. Baseline imbalances were noted in nearly half of the trials. Significance testing of baseline imbalances was performed in over one-third of trials.

4.2. Strengths and limitations

Our study has several strengths. To our knowledge, this is the largest review of SW-CRTs to date. Our data extraction form was well-tested, and agreement between two expert statisticians was used to extract information from each trial. Regardless, we note some limitations. Our search is comprehensive but may not be complete; we used trials identified in a previously published review with similar eligibility criteria and updated that search; however, to efficiently extend the date range of the review to an earlier period, we identified SW-CRTs through a previously published database of pragmatic trials. We tested the sensitivity of the search used to create that database and found a sensitivity of 54 (98%) against the 55 published SW-CRTs in the Caille et al. [22] review which was considered adequate. Finally, the number of trials with open- and closed-cohort designs was relatively small, precluding robust comparisons between the different trial designs.

4.3. Comparison to previous studies

A review of 298 (mostly individually) randomized trials published in high-impact journals in 2009 and 2014 by Ciolino et al. found that 81% used covariates in the allocation [27]. When considering CRTs in particular, a previous review of 300 (mostly parallel arm) CRTs published between 2000 and 2008 across a broad range of journals by Wright et al. found that 58% used some form of restricted randomization [28]. Given the relative absence of published methodology addressing randomization methods in SW-CRTs, our finding that only 39% of SW-CRTs used restricted randomization is not surprising and is similar to that in an earlier review of 62 SW-CRTs published before 2014, in which 41% used restricted randomization [8]. It is notable that the median number of clusters randomized in the sample of 300 (mostly parallel arm) CRTs was 21 [28], which is substantially higher than in our review (11). The small number of clusters may be one explanation for the lower prevalence of restricted randomization in our review; another may be the relatively high prevalence of SW-CRTs with only one cluster per sequence: investigators may not be aware that methods such as covariate-constrained allocation can be used to promote balance across conditions even with a small number of clusters and as few as one cluster per sequence. A small number of clusters may also be a potential barrier to adjustment for cluster-level covariates in SW-CRTs, although the percentage of SW-CRTs adjusting for the covariates used in randomization (19%) is comparable to that in Wright et al. [28], where 17% of (mostly parallel arm) CRTs adjusted for covariates used in the restricted randomization in the analysis.

Among the 300 (most parallel arm) CRTs reviewed by Wright et al., 80% reported on balance at baseline and when reported, balance was primarily assessed on individual-level covariates. Our results demonstrate a similar pattern, with 89% of SW-CRTs reporting on balance at baseline, mostly on individual-level covariates. However, due to multiple ways in which balance at baseline can be defined and assessed in SW-CRTs, we found diverse methods of reporting on balance at baseline. The most common approach tended to mimic how baseline balance is reported in parallel arm CRTs (i.e., by condition), although this approach may not be meaningful for cluster-level covariates or for cohort designs. Exactly what is meant by “balance” in SW-CRT may require further elucidation.

4.4. Implications

Our review has highlighted key gaps and shortcomings in the practice of design and analysis of SW-CRTs. Failure to use restricted randomization methods may, in the first place, reflect a misunderstanding on the part of trialists that clusters serve as their own controls and that imbalances are therefore a lesser concern. However, the standard analytical approach for SW-CRTs includes both within-cluster and between-cluster comparisons and randomization serves an important role in balancing trends across clusters randomized to different sequences [29]. Failure to use restricted randomization may also be due to the relative absence of clear guidance specific to the SW-CRT. Practical guidance for choosing between matching, stratification, and simple randomization has been provided [11,30], but this is not specific to SW-CRTs. Although stratification or matching has been recommended for SW-CRTs [5], implementing these techniques is not always straightforward, for example, when the total number of clusters is small or there is only one cluster per sequence or when the number of clusters varies across strata and is not a multiple of the number of sequences. Two trials in our review reported the use of minimization. This might be considered unusual as it requires sequential randomization of clusters: in one trial, it might be inferred that randomization occurred at several time points whereas in the other, the Minim randomization software was cited [31], but precise details of the implementation were not provided. The technique of covariate constrained randomization [32] is attractive for SW-CRTs as all clusters are usually allocated at the same time, and because it allows multiple variables to be balanced in the allocation, even when there is only one cluster per sequence. Moulton discussed adaption of this method to an SW-CRT case study [33] and a SAS macro to implement this technique is available [34], although consensus on the best method to implement covariate constrained randomization for SW-CRTs is lacking [3537]. Due to the multitude of ways in which baseline balance can be assessed in SW-CRTs, trialists should reflect on what they wish to achieve when carrying out restricted randomization, and apply a method which will promote balance in the direction(s) of greatest importance. Guidance on incorporating variables used in the randomization into the analysis, especially when the number of clusters is small, may also be useful.

Imbalances in baseline characteristics have implications for the internal and external validity of the study and should be transparently reported. Clear guidance on optimal presentation of baseline balance in SW-CRTs to permit readers to assess internal and external validity is required. At a minimum, cross-sectional designs should present a comparison of cluster-level characteristics by sequence and individual-level characteristics by condition; open- and closed-cohort studies should present comparisons by sequence for both cluster- and individual-level characteristics. Whether additional comparisons are informative might vary from trial to trial: for example, presenting information by period may facilitate assessment of changes in the types of participants identified over time, especially in trials with a long duration. Clearly presenting baseline trends in the primary outcome as tables or figures could allow assessment of secular trends.

5. Conclusions

In practice, randomization in SW-CRTs often appears suboptimal. Baseline covariate imbalances in SW-CRTs reduce face validity and credibility of the trial results, reduce power and precision, and complicate interpretation. When trials do not include an adequate number of clusters, even restricted randomization may not successfully balance all pertinent characteristics Although there is no consensus on the minimum required number of clusters in an SW-CRT, trialists should be encouraged to randomize a sufficiently large number of clusters (preferably ten or more) [9] and adopt restricted randomization methods, as simple randomization is inadequate to achieve balance in most SW-CRTs.

Supplementary Material

Supplement A
Supplement B

What is new?

Key findings

  • Many stepped-wedge cluster randomized trials (SW-CRTs) involve a small number of clusters but restricted randomization techniques, which can help promote balance in the allocation, are often not implemented; when restricted randomization is used, variables restricted in the randomization are often not adjusted for in the analysis.

  • Diverse methods for assessing and reporting balance at baseline in SW-CRTs are being used; baseline imbalances are reported in nearly half of trials.

What this adds to what was known

  • The importance of randomizing an adequate number of clusters and using techniques to promote balanced allocation in SW-CRTs may not be fully appreciated by investigators.

What is the implication and what should change now

  • To support generating high-quality evidence from SW-CRTs, trialists should be discouraged from conducting trials with small numbers of clusters.

  • Investigators need more explicit guidance on randomization techniques to promote balance in SW-CRTs and on assessing and reporting balance at baseline in both cross-sectional and cohort SW-CRTs and for both cluster- and individual-level characteristics.

Acknowledgments

We are grateful to summer student Eric Tran at the Ottawa Hospital Research Institute for his contributions to database management.

Funding:

This work was supported by the Canadian Institutes of Health Research through the Project Grant competition (competitive, peer-reviewed), award number PJT-153045; and by the National Institute of Aging (NIA) of the National Institutes of Health under Award Number U54AG063546, which funds NIA Imbedded Pragmatic Alzheimer’s Disease and AD-Related Dementias Clinical Trials Collaboratory (NIA IMPACT Collaboratory). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. The funder played no role in the study. KDP was funded by the Yale Clinical and Translational Science Award (UL1 TR001863).

Footnotes

Competing interests: The authors declare that they have no conflicts of interest related to this article.

Supplementary data

Supplementary data to this article can be found online at https://doi.org/10.1016/j.jclinepi.2023.03.010.

References

  • [1].Donner A, Klar N. Design and analysis of cluster randomization trials in health research. London: Arnold Publishers Limited; 2000. [Google Scholar]
  • [2].Hemming K, Haines TP, Chilton PJ, Girling AJ, Lilford RJ. The stepped wedge cluster randomised trial: rationale, design, analysis, and reporting. BMJ 2015;350:h391. [DOI] [PubMed] [Google Scholar]
  • [3].Hussey MA, Hughes JP. Design and analysis of stepped wedge cluster randomized trials. Contemp Clin Trials 2007;28:182e91. [DOI] [PubMed] [Google Scholar]
  • [4].Hemming K, Taljaard M, Grimshaw J. Introducing the new CONSORT extension for stepped-wedge cluster randomised trials. Trials 2019;20(1):68. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [5].Copas AJ, Lewis JJ, Thompson JA, Davey C, Baio G, Hargreaves JR. Designing a stepped wedge trial: three main designs, carry-over effects and randomisation approaches. Trials 2015;16:352. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [6].Beard E, Lewis JJ, Copas A, Davey C, Osrin D, Baio G, et al. Stepped wedge randomised controlled trials: systematic review of studies published between 2010 and 2014. Trials 2015;16:353. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [7].Taljaard M, Hemming K, Shah L, Giraudeau B, Grimshaw JM, Weijer C. Inadequacy of ethical conduct and reporting of stepped wedge cluster randomized trials: results from a systematic review. Clin Trials 2017;14:333–41. [DOI] [PubMed] [Google Scholar]
  • [8].Martin J, Taljaard M, Girling A, Hemming K. Systematic review finds major deficiencies in sample size methodology and reporting for stepped-wedge cluster randomised trials. BMJ Open 2016;6(2):e010166. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [9].Taljaard M, Teerenstra S, Ivers NM, Fergusson DA. Substantial risks associated with few clusters in cluster randomized and stepped wedge designs. Clin Trials 2016;13:459–63. [DOI] [PubMed] [Google Scholar]
  • [10].Leyrat C, Morgan KE, Leurent B, Kahan BC. Cluster randomized trials with a small number of clusters: which analyses should be used? Int J Epidemiol 2018;47:321–31. [DOI] [PubMed] [Google Scholar]
  • [11].Ivers NM, Halperin IJ, Barnsley J, Grimshaw JM, Shah BR, Tu K, et al. Allocation techniques for balance at baseline in cluster randomized trials: a methodological review. Trials 2012;13:120. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [12].Kasza J, Forbes AB. Information content of cluster-period cells in stepped wedge trials. Biometrics 2019;75:144–52. [DOI] [PubMed] [Google Scholar]
  • [13].Li F, Lokhnygina Y, Murray DM, Heagerty PJ, DeLong ER. An evaluation of constrained randomization for the design and analysis of group-randomized trials. Stat Med 2016;35:1565–79. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [14].Li F, Turner EL, Heagerty PJ, Murray DM, Vollmer WM, DeLong ER. An evaluation of constrained randomization for the design and analysis of group-randomized trials with binary outcomes. Stat Med 2017;36:3791–806. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [15].Zhou Y, Turner EL, Simmons RA, Li F. Constrained randomization and statistical inference for multi-arm parallel cluster randomized controlled trials. Stat Med 2022;41:1862–83. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [16].Li F, Hughes JP, Hemming K, Taljaard M, Melnick ER, Heagerty PJ. Mixed-effects models for the design and analysis of stepped wedge cluster randomized trials: an overview. Stat Methods Med Res 2021;30(2):612–39. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [17].Campbell MK, Piaggio G, Elbourne DR, Altman DG, CONSORT Group. Consort 2010 statement: extension to cluster randomised trials. BMJ 2012;345:e5661. [DOI] [PubMed] [Google Scholar]
  • [18].Eldridge S, Kerry S, Torgerson DJ. Bias in identifying and recruiting participants in cluster randomised trials: what can be done? BMJ 2009;339:b4006. [DOI] [PubMed] [Google Scholar]
  • [19].Hemming K, Taljaard M, McKenzie JE, Hooper R, Copas A, Thompson JA, et al. Reporting of stepped wedge cluster randomised trials: extension of the CONSORT 2010 statement with explanation and elaboration. BMJ 2018;363:k1614. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [20].Nevins P, Davis-Plourde K, Ouyang Y, Ryan M, Tong G, Wang X, et al. Handling of covariates in stepped-wedge cluster randomized trials: Protocol for a methodological review. uO Research 2022. Available at: https://ruor.uottawa.ca/handle/10393/43901. Accessed April 4, 2023. [Google Scholar]
  • [21].Peters MDJ, Godfrey C, McInerney P, Munn Z, Tricco AC, Khalil H. Chapter 11: Scoping Reviews (2020 version). In: Aromataris E, Munn Z, editors. JBI Manual for Evidence Synthesis. JBI; 2020. Available at: https://synthesismanual.jbi.global. [Google Scholar]
  • [22].Caille A, Taljaard M, Le Vilain-Abraham F, Le Moigne A, Copas AJ, Tubach F, et al. Recruitment and implementation challenges were common in stepped-wedge cluster randomized trials: results from a methodological review. J Clin Epidemiol 2022;148:93–103. [DOI] [PubMed] [Google Scholar]
  • [23].Nicholls SG, Carroll K, Hey SP, Zwarenstein M, Zhang JZ, Nix HP, et al. A review of pragmatic trials found a high degree of diversity in design and scope, deficiencies in reporting and trial registry data, and poor indexing. J Clin Epidemiol 2021;137:45–57. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [24].Taljaard M, McDonald S, Nicholls SG, Carroll K, Hey SP, Grimshaw JM, et al. A search filter to identify pragmatic trials in MEDLINE was highly specific but lacked sensitivity. J Clin Epidemiol 2020;124:75–84. [DOI] [PubMed] [Google Scholar]
  • [25].Covidence systematic review software, veritas health innovation, melbourne, Australia. Available at: www.covidence.org. Accessed April 4, 2023.
  • [26].Airtable, Formagrid, Inc: CA, USA. Available at: https://support.airtable.com/v1/en. Accessed April 4, 2023. [Google Scholar]
  • [27].Ciolino JD, Palac HL, Yang A, Vaca M, Belli HM. Ideal vs. real: a systematic review on handling covariates in randomized controlled trials. BMC Med Res Methodol 2019;19:136. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [28].Wright N, Ivers N, Eldridge S, Taljaard M, Bremner S. A review of the use of covariates in cluster randomized trials uncovers marked discrepancies between guidance and practice. J Clin Epidemiol 2015;68:603–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [29].Hargreaves JR, Prost A, Fielding KL, Copas AJ. How important is randomisation in a stepped wedge trial? Trials 2015;16:359. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [30].Chondros P, Ukoumunne OC, Gunn JM, Carlin JB. When should matching be used in the design of cluster randomized trials? Stat Med 2021;40:5765–78. [DOI] [PubMed] [Google Scholar]
  • [31].Evans S, Royston P, Day S. Minim: allocation by minimisation in clinical trials. Available at: https://www-users.york.ac.uk/~mb55/guide/minim.htm. Accessed April 4, 2023.
  • [32].Moulton LH. Covariate-based constrained randomization of group-randomized trials. Clin Trials 2004;1:297–305. [DOI] [PubMed] [Google Scholar]
  • [33].Moulton LH, Golub JE, Durovni B, Cavalcante SC, Pacheco AG, Saraceni V, et al. Statistical design of THRio: a phased implementation clinic-randomized study of a tuberculosis preventive therapy intervention. Clin Trials 2007;4:190–9. [DOI] [PubMed] [Google Scholar]
  • [34].Green E ccrsw-sas: A SAS macro for constrained covariate randomization of cluster-randomized and unstratified stepped-wedge designs. 2017. Available at: https://github.com/ejgreene. Accessed April 4, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [35].Chaussee EL, Dickinson LM, Fairclough DL. Evaluation of a covariate-constrained randomization procedure in stepped wedge cluster randomized trials. Contemp Clin Trials 2021;105:106409. [DOI] [PubMed] [Google Scholar]
  • [36].Lew RA, Miller CJ, Kim B, Wu H, Stolzmann K, Bauer MS. A method to reduce imbalance for site-level randomized stepped wedge implementation trial designs. Implement Sci 2019;14:46. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [37].Kristunas CA. Improving the feasibility of stepped-wedge cluster randomised trials : a mixed methods enquiry [doctoral dissertation]. United Kingdom: University of Leicester; 2021:238. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement A
Supplement B

RESOURCES