Abstract
A practical limitation of cluster randomized controlled trials (cRCTs) is that the number of available clusters may be small, resulting in an increased risk of baseline imbalance under simple randomization. Constrained randomization overcomes this issue by restricting the allocation to a subset of randomization schemes where sufficient overall covariate balance across comparison arms is achieved. However, for multi-arm cRCTs, several design and analysis issues pertaining to constrained randomization have not been fully investigated. Motivated by an ongoing multi-arm cRCT, we elaborate the method of constrained randomization and provide a comprehensive evaluation of the statistical properties of model-based and randomization-based tests under both simple and constrained randomization designs in multi-arm cRCTs, with varying combinations of design and analysis-based covariate adjustment strategies. In particular, as randomization-based tests have not been extensively studied in multi-arm cRCTs, we additionally develop most-powerful randomization tests under the linear mixed model framework for our comparisons. Our results indicate that under constrained randomization, both model-based and randomization-based analyses could gain power while preserving nominal type I error rate, given proper analysis-based adjustment for the baseline covariates. Randomization-based analyses, however, are more robust against violations of distributional assumptions. The choice of balance metrics and candidate set sizes and their implications on the testing of the pairwise and global hypotheses are also discussed. Finally, we caution against the design and analysis of multi-arm cRCTs with an extremely small number of clusters, due to insufficient degrees of freedom and the tendency to obtain an overly restricted randomization space.
Keywords: Cluster randomized trials, covariate adjustment, linear mixed models, multi-arm trial, most-powerful randomization test, restricted randomization
1 |. INTRODUCTION
Cluster randomized controlled trials (cRCTs) are designed to evaluate the effect of an intervention that is delivered to clusters of individuals.1,2,3,4 Examples of such clusters include schools, communities, factories, hospitals, and medical practices. This design is often used in practice when the intervention by its nature needs to be applied to an entire group of individuals or when treatment contamination might arise from the interaction between individuals in the same cluster. In addition, cRCTs can be used to capture the population-level direct and indirect effects of an intervention (for example, an intervention designed to reduce infectious disease transmission).5 There are various types of cRCT designs, including parallel-arm and stepped wedge designs.6 In this article, we focus on multi-arm parallel cRCTs, motivated by the design of TESTsmART, an ongoing trial evaluating interventions to improve antimalarial stewardship in the retail sector.7 In a multi-arm cRCT, each distinct arm can be defined by different combinations of interventions, or different “doses” of a single type of intervention,8 or it could be of interest to simultaneously examine the effects of multiple types of interventions.
A frequent practical limitation of cRCTs is the difficulty in recruiting a large number of randomization units (i.e., clusters) as compared to individually randomized trials.9 With only a limited number of often heterogeneous clusters, simple randomization may fail to adequately balance important baseline prognostic covariates across arms.10 The lack of balance with respect to baseline covariates can lead to decreased statistical power and precision and may threaten the internal validity of the trial.11,6 In multi-arm cRCTs, the risk of chance imbalance in baseline covariates increases as the number of arms increases and the number of randomization units per arm decreases. Therefore, statistical methods controlling for baseline prognostic characteristics are particularly important for the appropriate design and analysis of such trials.
In parallel cRCTs, it has been well-known that design-based adjustment for baseline balance enhances comparability between clusters in different arms.9 To minimize the risk of chance imbalance in small cRCTs with two arms, a variety of restricted randomization methods, such as stratification and matching, have been proposed. However, these methods have inherent limitations. When there are multiple important baseline covariates available at the design phase, stratification may create too many sparse strata with incomplete fillings (overstratification) and the optimal number of strata is context dependent,12 while matching introduces additional complexity to the estimation of the intracluster correlation coefficient (ICC), which hampers the reporting of ICC and planning of future trials, and limits the choice of statistical inference methods.13 Covariate-based constrained randomization14 has been developed as a promising design strategy that overcomes potential limitations of stratification and pair-matching.14 In brief, constrained randomization involves (i) specifying important baseline cluster-level covariates to be balanced; (ii) enumerating all randomization schemes or simulating a large number of possible randomization schemes (duplicates should be removed if the schemes are randomly simulated); (iii) retaining a constrained randomization space with a subset of schemes where sufficient balance across baseline covariates is achieved according to some pre-specified balance metric; (iv) randomly selecting a scheme from the constrained space for implementation.15,16 Despite this generic framework, implementation of constrained randomization strategies can differ with respect to the choice of balance metric as well as the size of the constrained space. Li et al.17,18 examined the impact of such design parameters under constrained randomization in two-arm parallel cRCTs. While the choice between the l1 and l2 balance metrics does not lead to substantial differences for statistical inference, they found that a smaller randomization space (such as 10% of the simple randomization space with the smallest balance score) could improve the power of the model-based and randomization-based tests. Ciolino et al.19 and Watson et al.8 extended the balance metrics to multi-arm cRCTs. However, the consideration of alternative balance metrics as well as the effect of constrained subspace sizes on the validity of analytical methods were not fully articulated for multi-arm cRCTs in these prior studies (see Web Table 1 for comparison).
Parallel to design-based adjustment, analysis-based adjustment for prognostic covariates often leads to improved precision of the estimated treatment effect; thus increasing the statistical power of the trial.20 Both the choice of covariates to be adjusted for and the choice of a proper primary analytical method are of crucial importance. Li et al.17,18 recommended that covariates used in design-based adjustment should also be accounted for in the subsequent analysis of a two-arm cRCT. The same recommendation has been enforced by Watson et al.8 in three-arm cRCTs for model-based tests to maintain a valid type I error rate as well as adequate power. While mixed-model regression has been routinely used to test for an intervention effect in cRCTs,21 the randomization test provides a flexible alternative that may be particularly attractive under the constrained randomization design. This is because the randomization test leads to exact inference with the nominal type I error rate and dispenses with any asymptotic approximation as in model-based inference.22 This unique feature helps alleviate concerns on potential small-sample biases from model-based inference. While model-based inference accompanied by small-sample degrees-of-freedom corrections have been studied in multi-arm cRCT designs by Watson et al.,8 the randomization test has not been extended to allow for valid inference under constrained randomization with multi-arm cRCTs. Furthermore, the relative performance of the model-based test and randomization test remains unexplored, and recommendations are needed to guide practice (see Web Table 1 for comparison).
To address these knowledge gaps, we evaluate several statistical issues concerning the use of constrained randomization and the downstream statistical inference in the context of multi-arm cRCTs. Specifically, our contributions are three-fold: (i) we provide alternative balance metrics that could be used for constrained randomization with multi-arm cRCTs; (ii) we propose new randomization test statistics for efficient randomization-based inference with global and pairwise hypotheses of interest to multi-arm cRCTs; (iii) we clarify the relative performance between the model-based test and the randomization test under constrained randomization in multi-arm cRCTs, and detail key analytical considerations for each test (such as whether to adjust for a cluster-level aggregate or the corresponding individual-level covariate). The remainder of the paper is organized into five sections. In Section 2, we use an ongoing study to motivate and illustrate the constrained randomization design. In Section 3, we describe the details of statistical approaches for testing hypotheses in multi-arm cRCTs. Our simulation study and results are presented in Sections 4 and 5. Section 6 gives concluding remarks.
2 |. CONSTRAINED RANDOMIZATION IN MULTI-ARM CLUSTER RANDOMIZED CONTROLLED TRIALS
2.1 |. Motivating example: the TESTsmART study
The Malaria Diagnostic Testing and Conditional Subsidies to Target Artemisinin-Based Combination Therapies in the Retail Sector (TESTsmART) study is an ongoing cRCT to evaluate strategies to increase appropriate treatment of malaria cases in the retail sector.7 In response to the over-consumption of artemisinin-based combination therapies (ACTs) in malaria-endemic countries, the study aims to target subsidized ACTs to those who receive a confirmatory diagnosis in private sector retail outlets. The study was originally planned as two four-arm cRCTs among registered retail outlets (clusters) conducted separately in two distinct study sites: western Kenya and Lagos, Nigeria. For the purpose of illustration, we focus on three of the four arms and the Nigerian study site in the current article.
In this motivating example, forty-eight randomly selected retail outlets in Lagos will be allocated evenly across three intervention arms as described in Woolsey et al.7 Given the highly heterogeneous nature of the retail outlets, it is important to ensure balance on several baseline cluster-level factors of interest at the design stage. Three cluster-level variables can be obtained from the pre-randomization survey conducted by the study team, including daily patient volume of an outlet (continuous), which of the two geographical regions (A & B) within the city of Lagos the outlet is located in (binary), and whether the outlet has malaria rapid diagnostic tests (mRDT) in stock prior to the intervention (binary). The actual randomization in the study considered only the second factor, but we will consider all three variables in the decision making process in this motivating example for demonstration purposes.
2.2 |. Implementation of constrained randomization
Covariate data for each cluster available before randomization can be used to improve overall balance through constrained randomization. This is achieved by restricting to treatment allocations that satisfy certain pre-specified criteria on overall balance. One allocation scheme will then be selected randomly from this constrained subspace. Whereas this general idea carries over from two-arm cRCTs to multi-arm cRCTs, the choice of balance metrics requires additional considerations because an overall balance is now defined based on all treatment arms. With two-arms, Raab and Butcher11 introduced the l2 balance metric, and Li et al.17,18 developed the corresponding l1 balance metric. Little difference was found between the l1 and l2 balance metric in the study by Li et al.17,18, hence both may be used interchangeably. With more than two arms, Ciolino et al.19 proposed a class of p-value based balance metrics, and concluded that Kruskal–Wallis test p-value with a threshold (p > 0.3) leads to acceptable balance. Watson et al.8 extended the l2 balance metric in Raab and Butcher11 for multi-arm cRCTs as the sum of the cluster-level standardized mean differences across all arms, and followed Li et al.18 to choose the lowest 10% of the randomization space (in terms of the balance score) for constrained randomization. In the current study, we wish to control the maximum degree of the between-arm imbalance and therefore propose an alternative extension of the l2 metric of Raab and Butcher.11 Specifically, our maximum pairwise l2 metric is given by
| (1) |
where ωl ≥ 0 is the weight for the lth variable considered for balance and l ∈ {1, …, L}, , denote the average of the lth cluster-level covariate from the ith and i′th arm, i ≠ i′ and i, i′ ∈ {1, …, c} with c denoting the number of treatment arms. Both continuous and categorical variables are easily accommodated by the l2 balance score (1). For categorical variables, a set of dummy variables can be used in the balance metric. Note that cluster-level data available for constrained randomization may also be aggregated from individual-level data. In practice, however, it is not always possible to obtain individual-level data at the design phase and cluster-level summaries from historical individual-level data may be used instead.15 The weights ωl represent the relative importance of each covariate to the balance score B(l2). One can choose to up-weight or down-weight certain covariates to reflect different priorities when prior knowledge of the strength of their associations with the outcome is available. Otherwise, a simple choice of the weights would be the inverse of the variance of the lth covariate (i.e., ωl = 1/var(xl)) and this would be equivalent to standardizing the covariates such that their variances are unity. Finally, we notice that the l2 metric developed in Watson et al.8 defined balance in terms of the total imbalance of each arm with respect to the population mean, whereas our l2 balance metric achieves a similar purpose by bounding the maximum imbalance between any of all possible two-arm comparisons. In particular, both definitions are also connected with the balance metrics developed in the causal inference literature with multiple treatments. For example, the metric of Watson et al.8 resembles the so-called population standardized difference,23 whereas our metric resembles the so-called pairwise standardized differences.24 These two types of balance metrics often carry similar performance in empirical studies.24
The l2 balance metric (1) can be further extended to account for the covariances between the cluster-level factors. To this end, we additionally consider extending the the Mahalanobis distance metric discussed in Morgan and Rubin25 to multi-arm cRCTs. Specifically, we define the maximum pairwise Mahalanobis distance metric as
| (2) |
where is the L × 1 vector of the means of the covariates in the ith arm, and S is the L × L estimated sample covariance matrix of X, the matrix of the covariates to balance on. It can be seen that B(l2) is a special case of B(M) when the off-diagonal covariance components of S are replace by zeros.
Once a balance metric is specified, one can either enumerate all possible randomization schemes or randomly simulate a large number of randomization schemes within the simple randomization space (duplicates are removed if the schemes are randomly simulated) and calculate the corresponding balance score for each scheme. The next important aspect of constrained randomization is the cutoff value, by which we create the constrained randomization subspace. Let q ∈ (0, 1] denote the cutoff value and FB denote the empirical cumulative distribution function of the balance scores calculated using any pre-specified balance metric. The cutoff value could be defined as the percentile such that the constrained space contains schemes with balance scores no larger than , where is the inverse empirical distribution function of the balance scores. The intuition is that the cutoff value approximately measures the proportion of schemes achieving sufficient balance on the covariates of interest. When q = 1, there is no constraint and simple randomization is implemented. When 0 < q < 1, a subset of schemes with sufficient balance will be created and constrained randomization is implemented by selecting an allocation within the subset of schemes.
2.3 |. Illustration based on the TESTsmART study
With a mix of binary and continuous variables available before randomization in the TESTsmART study, we consider the l2 balance metric, B(l2), and the Mahalanobis distance metric, B(M), to assess covariate balance and obtain the distributions of the metrics in Web Figure 1. We chose q = 0.1 as the cutoff value for the constrained space. Table 1 presents the comparison of the cluster-level variable means from three schemes selected by constrained randomization (CR) and simple randomization (SR). Unlike the realized schemes from CR, the scheme from SR has resulted in a highly imbalanced proportion of outlets from Region B across arms. At the same time, larger mean differences across arms with respect to the continuous group-level variables are also seen from the realized scheme under SR. Although this is only a single realization, it demonstrates the potential for observing covariate imbalance under SR, as well as advantages of CR in protecting from obtaining a scheme with large imbalance.
TABLE 1.
Average value of each covariate by treatment arm from three randomization schemes independently selected from (1) simple randomization (SR); (2) constrained randomization (CR) with B(l2) and candidate set size q = 0.1; (3) CR with B(M) and candidate set size q = 0.1.
| Randomization design | Cluster-level covariates | Arm 1 (N1 = 16) | Arm 2 (N2 = 16) | Arm 3 (N3 = 16) |
|---|---|---|---|---|
| SR | # in Region B | 6 (37.5%) | 2 (12.5%) | 8 (50%) |
| Patient volume | 17.75 | 19.31 | 26.12 | |
| # have mRDT in stock | 1 (6.2%) | 3 (18.8%) | 0 (0.0%) | |
| CR with B(l2) metric | # in Region B | 6 (37.5%) | 5 (31.2%) | 5 (31.2%) |
| Patient volume | 19.81 | 22.75 | 20.62 | |
| # have mRDT in stock | 1 (6.2%) | 2 (12.5%) | 1 (6.2%) | |
| CR with B(M) metric | # in Region B | 6 (37.5%) | 4 (25.0%) | 6 (37.5%) |
| Patient volume | 22.56 | 21.50 | 19.12 | |
| # have mRDT in stock | 1 (6.2%) | 2 (12.5%) | 1 (6.2%) |
Abbreviation: SR, simple randomization; CR, constrained randomization; mRDT, malaria rapid diagnostic test.
3 |. STATISTICAL INFERENCE UNDER CONSTRAINED RANDOMIZATION IN MULTI-ARM CLUSTER RANDOMIZED CONTROLLED TRIALS
While we illustrated the application of constrained randomization to the TESTsmART study, appropriate statistical inference strategies under constrained randomization in multi-arm cRCTs need to to be further explored and could eventually inform the analysis of the TESTsmART study. Watson et al.8 have studied the performance of the mixed-model based test in their simulations, and we will review this approach in Section 3.1. In addition, because there is little prior discussion on how to carry out randomization-based inference in multi-arm cRCTs, and given randomization-based inference could be naturally coupled with constrained randomization,17,18 we develop most-powerful randomization tests in Section 3.2, extending the approach of Braun and Feng26 to multiple arms and additional null hypotheses. Our evaluation assumes a cross-sectional design with only a single post-treatment outcome observation for each individual in each cluster, which resembles the TESTsmART study and represents the scenario where constrained randomization provides the maximum benefit for covariate balance.8 We do not evaluate repeated cross-sectional or cohort designs, but note that the horizontal before-after comparisons in these alternative designs already offer some protection against between-cluster imbalance.
To proceed, we assume a single continuous outcome Yjk for each individual k (k = 1, …, mj) nested within each cluster j (j = 1, …, G), where with gi denoting the number of clusters in treatment arm i (i = 1, …, c). Let Tij denote the dichotomous treatment indicator for the treatment i (i = 1, …, c − 1) where Tij = 1 if cluster j is assigned to arm i and −1 otherwise. We deviate from the standard notation of Tij ∈ {0, 1} because re-parameterization, Tij ∈ {−1, 1}, is useful for elaborating the details of randomization-based inference in Section 3.2. Write xjk as the p-dimensional vector of cluster-level or individual-level covariates. Of note, this vector may contain p1 cluster-level and p2 individual-level covariates. Further, if this vector only includes cluster-level covariates, we could simply replace xjk by xj and the rest follows.
3.1 |. Model-based inference
Linear mixed model (LMM) regression is routinely used in the analyses of cRCTs. In this approach, estimates of both treatment and covariate effects at either the cluster or individual level, variance components, and the induced ICC can be obtained simultaneously. For multi-arm cRCTs with continuous outcomes, LMM with a single post-treatment outcome adjusting for baseline covariates can be expressed by
| (3) |
where λ is the overall intercept parameter, β is the parameter vector for covariates, δ = (δ1, …, δc−1)′ are the parameters associated with the effect of each treatment relative to the reference level. The within-cluster correlation is accounted for by a cluster-specific normally-distributed random effect , and the individual-level error variance is ; independence is often assumed between γj and ϵjk. The choice of reference treatment arm depends on the context, and a natural reference arm in many multi-arm cRCTs is the control or usual care arm. Conditional on estimates of random effect variance-covariance parameters, adjusted treatment effects are estimated based on maximum likelihood or restricted maximum likelihood.27 Using these estimated adjusted treatment effects, a model-based hypothesis test of the treatment effects often uses the Wald statistic. In multi-arm cRCTs where not only the effect of each single treatment but also the joint comparison among all treatment effects may be of interest, the Wald test can be flexible enough to accommodate both purposes with appropriate considerations on the degrees of freedom.
Specifically, the global null hypothesis of no treatment effect takes the form of , and proceeds with the Wald statistic, , where is the variance-covariance matrix of the treatment effect estimators. This statistic follows an asymptotic χ2-distribution with degrees of freedom c − 1 under the global null. On the other hand, an alternative common practice in multi-arm trials is to investigate whether there is effect of each active intervention relative to the reference treatment in a pairwise fashion, and consider multiplicity adjustment to control for the family-wise error rate (FWER). In this case, each separate null for pairwise comparison i is given by , for i = 1, …, c − 1, and can be tested with a Wald z-statistic, , where . However, because cRCTs often only include a limited number of clusters (e.g. fewer than 20 clusters per arm), the actual sampling distribution of the test statistic may deviate from the Chi-squared or normal distribution. This renders the asymptotic approximation inaccurate and results in an inflated type I error rate.28
As a potential remedy to the inflated type I error rate, we consider the F-test as the alternative for the χ2-test when studying the global null, and t-test as an alternative for the z-test when studying the pairwise null. In addition, several small-sample corrections have been proposed to determine the appropriate denominator degrees of freedom (DoF) of the F-test.27,29,30,31 In this article, we consider the “between-within” correction, which sets the number of denominator DoF to the number of DoF at the cluster level.27,31 It is critical to determine whether the fixed effects in model (3) are at the cluster or individual level. For an individual-level fixed effect that changes within any cluster, within-cluster degrees of freedom should be assigned; otherwise, as for tests of intervention effects, the between-cluster degrees of freedom should be assigned. That is, when a joint test for the global null is performed, the F-statistic should be referenced to a (central) F-distribution with degrees of freedom equal to (c − 1, G − c − # of cluster-level covariates). Similarly, for the separate null of each pairwise comparison (δi = 0), we use the t-test with the between-within degrees of freedom = G − c − # of cluster-level covariates.8 In two-arm cRCTs conducted under constrained randomization, Li et al.18 demonstrated that the t-test with the between-within degrees of freedom preserves the type I error rate, and similar findings (albeit in the absence of multiplicity adjustment) were observed in Watson et al.8 for three-arm cRCTs conducted under constrained randomization.
3.2 |. Randomization-based inference
In two-arm parallel cRCTs, Li et al.17,18 have demonstrated that the randomization test provides a flexible framework for inference under constrained randomization. First introduced by Fisher,32 the randomization test can often be more robust than the model-based methods in the analyses of cRCTs, because it is exact and does not rely on asymptotic approximation of the reference distribution. In the literature of cRCTs, the terminologies “randomization test” and “permutation test” have been used interchangeably. However, they represent two distinct probability models such that subjects are randomly assigned to treatment in randomization tests while subjects are randomly sampled in permutation tests.33 To acknowledge the fact that the test of interest in this article re-randomizes the treatment assignment, we henceforth refer to such tests as randomization tests. To carry out a randomization test in cRCTs, one first defines a relevant test statistic. Then the outcome data are first analyzed based on the actual observed allocation of clusters to obtain an observed test statistic, which will be referenced against the randomization distribution calculated from shuffling or permuting the treatment labels according to the randomization space.34,35 The two-sided p-value is defined as the proportion of the test statistics obtained by permutation that are at least as extreme (in absolute values) as the observed one. While any sensible statistic may be considered for valid randomization-based inference, the power of the randomization test can depend on the choice of test statistics. Under two-arm cRCTs with simple randomization, Braun and Feng26 developed a uniformly most-powerful randomization (UMPR) test statistic in the sense of Lehmann and Stein36 with a continuous outcome. The UMPR test statistic is derived from the joint marginal likelihood induced from a LMM and is invariant to the magnitude of effect size under the alternative. Their UMPR test can be favorable for three reasons: (i) it can achieve the highest statistical power when the LMM is correctly specified; (ii) it still maintains the nominal type I error rate when the LMM is incorrectly specified; and (iii) it is computationally efficient as a score-type test in which only one LMM is fit to estimate nuisance parameters; this is in sharp contrast to a computationally intensive Wald-type randomization test where the same LMM needs to be fit repeatedly under permutation. Despite these advantages, randomization tests for cRCTs have so far only been considered when there are two arms, and we provide an extension to multi-arm cRCTs below.
We extend randomization-based inference to multi-arm cRCTs under the specific case of normal alternatives where the random effect , and the outcome . Again, we parameterize the treatment indicators in model (3) such that Tij = 1 if cluster j is assigned to arm i and −1 otherwise. Under this model, we can write the joint marginal likelihood of the observed data as the product of the conditional distribution of Yjk integrated over the normally-distributed random effect, namely , where the jth cluster’s contribution to the marginal likelihood is
| (4) |
In the above expression, f(Yjk|γj) is the density of , and f(γj) is . The most-powerful randomization test statistic is therefore the joint likelihood under a specified alternative.
For testing the separate null for each pairwise comparison, , we show in Web Appendix C that depends on a simple kernel26
| (5) |
regardless of the alternative . In (5), , is the observed treatment indicator for arm i′ ≠ i, and Tij ∈ {1, −1} is the treatment indicator dictating the randomization distribution of Si. Because the form of the kernel is independent of Δi, the test statistic Si is uniformly most-powerful for testing . Evidently, Si is a weighted sum of the cluster total errors, and depends on the nuisance parameters, , , λ, β and δi′ (i′ ≠ i). These nuisance model parameters are estimated by fitting the LMM (3) on the observed data by setting δi = 0 (i.e., removing the treatment indicator Tij from LMM (3)), and fixed across treatment permutations. To implement the pairwise test in multi-arm cRCTs, there is one important caveat for obtaining the randomization distribution of Si. That is, the permutation of Tij should be a conditional permutation in which only the clusters in the arms evaluated in (i.e., the reference arm and the arm receiving treatment i) are permuted while the clusters receiving other treatments are fixed. Operationally, we could first subset the (simple or constrained) randomization space such that for all i′ ≠ i. Then we obtain the randomization distribution of Si only within this randomization subspace, where Tij is allowed to vary conditional on fixed treatment indicators for all other arms. This idea of conditional permutation was also explained in Wang et al.37 for multi-arm individual randomized trials, and applies to multi-arm cRCTs. Finally, because the randomization test requires a minimum of 20 allocation schemes to provide a 0.05 level test, an overly constrained randomization space may lead to too few allocation schemes once we condition on and fail to support a 0.05 level test for . Randomization-based confidence intervals for the treatment effect can be obtained based on the duality between interval estimation and hypothesis testing. For example, one can perform a grid search for H0: δ1 = Δ1,0, …, δc−1 = Δc−1,0 with Δ1,0, …, Δc−1,0 by carrying out a series of randomization tests. The confidence interval for δi is formed by the set of values for Δi,0 not rejected by the randomization test. In addition, inverting randomization tests to obtain confidence intervals does not necessarily require a grid search. Faster algorithms to obtain randomization-based confidence intervals have been proposed for randomized controlled trials in general and studied for cRCTs.38,39,40
To test the global null , the likelihood-based statistic can still be used. However, this statistic does not simplify to a kernel as in (6) and will not lead to a UMPR because the statistic depends on the alternative δ = Δ. Therefore, we instead develop a locally most-powerful randomization (LMPR) test based on the efficient score statistic,41 which is analytically tractable under the LMM marginal likelihood. Recall the marginal likelihood (4) is
where is the mj × (C − 1) matrix of the treatment for cluster j, Zj is the mj × (1 + p) design matrix including the column vector of intercepts and the p-dimensional covariates Xj, δ and η are the corresponding parameter vectors, is the mj × mj identity matrix, and is the mj × mj matrix of ones. The full score function is given by
with the expected information matrix given by
Since we need to estimate the nuisance parameter η, we summarize S using the efficient score statistic and define the locally most-powerful randomization (LMPR) test statistic as
| (6) |
where each term is evaluated under the global null. Specifically, we show in Web Appendix C that the efficient score is
where is the analytical inverse of the compound symmetric matrix.42 Furthermore, the upper left component of the information matrix is derived as
with the diagonal components qii = 1, and off-diagonal components , and πi is the allocation proportion to treatment arm i. The lower right component of the information matrix is derived as , and the off-diagonal component is given by
The detailed derivation are provided in Web Appendix C. Similar to the UMPP for testing the pairwise null, the nuisance parameters in the LMPP is estimated by fitting the LMM (3) once to the observed data (setting δ1 = … = δc−1 = 0) and fixed across treatment permutations. Finally, unlike the UMPR test statistic Si, the randomization distribution of Q is dictated by the joint distribution of (T1j, …, Tc−1,j)′, and therefore the randomization distribution of Q can be calculated according to the randomization space for the entire treatment vector. To facilitate the operation of the above randomization tests, we provide detailed execution steps in Algorithm 1.
4 |. METHODS FOR THE SIMULATION STUDIES
We conduct a simulation study to assess the impact of the choice of the candidate set sizes and the balance metrics for constrained randomization, the choice of adjusted versus unadjusted analysis, as well as the use of model-based versus randomization-based inference in a multi-arm parallel cRCT setting. Wherever applicable, we followed and extended the simulation design described in Li et al.17
Overall, we designed and reported our simulation study according to the “ADEMP” structure of key steps and decisions in simulation studies described by Morris et al.43 We described the Aim in the previous paragraph, Data generation process in Section 4.1, Estimand (we interpret this in our case as the target hypothesis because the goal here is for testing instead of estimation) in Section 4.2, Methods in Section 4.3, and chose Type I error rate and power as the main Performance measures reported in Section 5. We conducted a series of simulations based on a parallel multi-arm cRCT with cross-sectional design with a single post-treatment continuous outcome Yjk for each individual k (k = 1, …, m) nested within each cluster j (j = 1, …, c × g) where c is the number of treatment arms and g is the number of clusters nested within each treatment arm. We therefore considered balanced designs with equal number of clusters (g) in each arm and equal number of individuals (m) in each cluster. Our choice of the number of clusters (g) and individuals (m) was motivated by the TESTsmART trial. We varied g using values of 3, 5, and 10 to evaluate the performance of the randomization and analysis methods with different (but small) numbers of clusters. We considered a small number of clusters because a recent systematic review9 suggested that the median number of clusters in published cRCTs is only 21 (IQR: 12 – 52). Each cluster was assumed to contain m = 150 individuals, resembling the TESTsmART trial. This number was not varied since it is well-known that the effective sample size of a cRCT is largely driven by the number of clusters rather than the cluster size.2 We assumed a three-arm design (c = 3), with one arm receiving ‘standard of care’ and serving as the control arm. Findings from three-arm simulation studies can be informative and easily applied to other multi-arm parallel cRCT settings with more arms, thus a three-arm design was selected for the simulation studies due to computational efficiency. To ensure stable estimates for the type I error rate and power, we ran 10, 000 Monte Carlo iterations for each combination of the parameters. That is, the acceptable bounds for 5% Type I error rate are (4.57%, 5.43%).

4.1 |. Data generation process (DGP)
Let Yjk be the outcome for each subject k (k = 1, …, m), nested within each cluster j (j = 1, …, 3g). We generated the outcome data with two cluster-level binary covariates and two individual-level continuous covariates from the following linear mixed model:
| (7) |
In this model, zjk is the 2 × 1 vector of individual-level continuous covariates; xj is the 2 × 1 vector of cluster-level binary covariates. Each of the two individual-level covariates was independently generated from . The cluster-specific means μj were randomly generated from a uniform distribution with support (−2, 2). Each of the two cluster-level covariates was independently simulated from a Bernoulli distribution with probability 0.3 (a modest probability of being either 1 or 0). The strength of the association of each covariate and the outcome was fixed at a value of 1, so that βz = βx = 12×1. For a three-arm design, we need two dichotomous treatment indicators T1j and T2j, which take the values of −1 and 1, to contrast the two treatment conditions to the control condition. The error term ϵjk was independently generated from , where and the cluster-specific random effect γj was generated from , where μγ = 1, and ρ is the intraclass correlation coefficient (ICC). Three ICC values were considered for each level of g: 0.01, 0.05, and 0.10. The intervention effects 2δ1 and 2δ2 were fixed at zero for studying type I error and were specified such that the standardized effect size is approximately 0.5 or 0.75 for studying power.
To further inform the comparison, we considered 2 alternative data generation processes. First, we reduced the effect sizes to 75%, 50%, and 25% of the original magnitude for the case of g = 10 to avoid the ceiling effect on power. Second, we generated non-normal data to evaluate the robustness of the analytical methods under violations of distributional assumptions. Following Small et al.44, we specified the random cluster effect γj and error term ϵjk to follow the standard Cauchy distribution, which has a heavier tail than the normal distribution. Note that test statistics (5) and (6) are no longer UMPR and LMPR test statistics under non-normal data generation process as they are derived from full likelihood. However, it would still be of interest to clarify the relative performances of statistics (5) and (6), compared to model-based inference.
4.2 |. Null hypotheses and multiplicity adjustment
As noted in Section 3, different types of hypotheses could be of interest in multi-arm parallel cRCTs.45 We specified a set of three hypotheses (see Table 2) based on model (7) where there is one control condition and two treatment conditions. We adopted a hierarchical approach and specified a global hypothesis comparing all three arms at once as the first step. Then we compared the two arms receiving active treatments with the control arm separately. To adjust for multiple testing, we considered two scenarios and two adjustment strategies accordingly. In the first scenario, we were interested in each of the treatment comparisons individually and the two active treatments were evaluated distinctly. In this case, adjustment of family-wise error rate (FWER) is not needed46,47 and we therefore fixed the significance level at 5% for each pairwise hypothesis. In the second scenario, we considered a setting whereby the two active treatment arms consisted of the same treatment given at different doses and if either one of the treatment doses showed a statistically significant effect relative to ‘standard of care’, we would conclude that there was evidence of treatment effect. It would be recommended to control for FWER when interventions are related and findings are summarized into one single conclusion.46,47,48 Therefore, we performed a conservative Bonferroni adjustment49 for the two pairwise hypotheses of the treatment effects (i.e., alpha level = 2.5%). Given that the adjustment of FWER is context dependent and there is no consensus in current literature, the purpose of performing such adjustment in our simulation studies is to evaluate whether the analytical methods carry the appropriate error rate under multiplicity corrections. Providing theoretical guidance on when to perform multiplicity adjustment is beyond the scope of this article.
TABLE 2.
Null and alternative hypotheses tested in the simulation studies. For the pairwise hypothesis, we consider two cases: one where no multiplicity corrections are performed and the other where multiplicity is controlled. Abbreviation: FWER, family-wise error rate.
| Comparison | Alpha level | Alpha level controlling for FWER | ||
|---|---|---|---|---|
| Global hypothesis | δ1 = δ2 = 0 | δ1 ≠ 0 or δ2 ≠ 0 | 5% | Not Applicable |
| Pairwise hypothesis | δ1 = 0 | δ1 ≠ 0 | 5% | 2.5% |
| Pairwise hypothesis | δ2 = 0 | δ2 ≠ 0 | 5% | 2.5% |
4.3 |. Design-based versus analysis-based adjustment
Design-based adjustment was implemented through constrained randomization in contrast to simple randomization. For simple randomization, the final allocation scheme was selected from the full randomization space, regardless of the covariate balance. For constrained randomization, the final allocation scheme was selected from the randomization subspace where sufficient balance on the covariates of interest among all treatment arms was achieved with respect to the balance metrics described in Section 2.2. To compare the choice of covariates in the design stage, we considered different combinations of the covariates to balance in the design phase via constrained randomization: (i) only the two cluster-level binary covariates (xj) were adjusted for in the design; (ii) only the two individual-level covariates aggregated at the cluster level were adjusted for in the design; (iii) all four covariates were adjusted for in the design. Precisely, the use of individual-level covariates aggregated at cluster level in the design phase implies that the individuals in each cluster should be already recruited at the time of randomization, which may not always be the case in cRCTs. However, we consider this scenario to mimic the practice when there are high-quality estimates on the cluster-level means of the individual-level covariates from historical data available prior to randomization. To implement constrained randomization, we first enumerated all 1680 possible random assignments when g = 3, and, for g = 5 or 10, we randomly sampled 20,000 assignments and removed duplicates as an approximation of the complete randomization space since full enumeration can quickly become computationally intractable with larger numbers of clusters. Then we calculated the balance scores for all the randomization schemes we enumerated (g = 3) or randomly sampled (g = 5 or 10). The constrained space was defined as the subspace in which the corresponding balance scores of the randomization schemes were lower than a cutoff value. Since the absolute magnitude of the balance scores has no intuitive interpretation, we chose the cutoff such that the candidate set size varies from exactly 100 to 10% and 50% of the full randomization space.
In the analysis phase, we compared the model-based and the randomization-based tests in Section 3 for both simple and constrained randomization. For each test, we compared three types of covariate adjustment strategies (analysis-based adjustment): (i) no adjustment; (ii) adjustment of covariates according to the covariates used in the design stage; and (iii) fully adjusted with all four covariates. Whenever applicable, we also compared the adjustment for the actual individual-level covariates (zjk) versus for their cluster-level aggregates ; this latter strategy perfectly conforms to the design, where the cluster-level aggregates of individual-level covariates are used under constrained randomization. Under simple randomization, we examined all possible covariate adjustment strategies considered for constrained randomization to compare the net benefit of design-based adjustment in the presence of analysis-based adjustment. To further elucidate our simulation design, Table 3 provides a summary of design-based and analysis-based adjustment we considered (with acronyms defined and explained). In Table 3, the comparison of each column within a specific row reveals the benefit of analysis-based adjustment, whereas the comparison of each row within a specific column reveals the benefit of design-based adjustment. All analyses were conducted in R 3.6.050 with randomization programmed with the package randomizr51 and linear mixed analysis performed with the package nlme52.
TABLE 3.
Combinations of design-based and analysis-based adjustment strategies (indicated by ‘✔’) evaluated in the simulation studies. Unadj: unadjusted; Adj-C: analysis adjusted for cluster-level covariates/aggregates; Adj-I, analysis adjusted for individual-level covariates; Fully Adj-C, analysis adjusted for all covariates with cluster-level aggregates; Fully Adj-I, analysis adjusted for all covariates with individual-level covariates (when available).
| Analysis-based adjustment | |||||||
|---|---|---|---|---|---|---|---|
| Unadj | Adj-C | Adj-C | Adj-I | Fully Adj-C | Fully Adj-I | ||
| Design-based adjustment | ∅ | x j | z jk | {xj, zjk} | |||
| Simple randomization | ∅ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ |
| Constrained randomization | x j | ✔ | ✔ | ✔ | ✔ | ||
| Constrained randomization | ✔ | ✔ | ✔ | ✔ | ✔ | ||
| Constrained randomization | ✔ | ✔ | ✔ | ||||
5 |. RESULTS FROM THE SIMULATION STUDIES
In Table 4 and Figure 1, we summarized the Monte Carlo type I error rates for the global hypothesis comparing all three arms under simple randomization (SR) and constrained randomization (CR); in Table 5 and Figure 2, we summarized the corresponding results for power. The results for the pairwise hypotheses ( and ) are presented in Web Appendix D. To simplify the presentation, we held the ICC fixed at 0.05 throughout and compared results for g = 3, 5, and 10. Results with other ICC values are qualitatively similar and presented in Web Appendix E. The tables and figures focus on the comparisons with a slightly different emphasis. In the tables, we focused on comparing the model-based and randomization-based tests under SR and CR (using B(l2)) with the candidate set size ranging from 50% to 10% of the full randomization space as well as with the candidate set size is exactly 100. That is, we study the consequence of constrained randomization with an increasing level of constraint. All CR scenarios considered all four covariates in the balance metric; the adjusted tests therefore controlled for all four covariates accordingly, with either the actual individual-level covariates or their cluster-level aggregates. In the figures, we fixed the candidate set size at 10% and compared the tests under SR and CR (using both B(l2) and B(M)), across different design-based and analysis-based adjustment strategies. For design-based adjustment using CR, we controlled for three combinations of the covariates, as in Table 3. For analysis-based adjustment, we chose to adjust for no covariates, the covariates used in the design phase, and all covariates, again as outlined in Table 3. Additional results and discussions on multiplicity adjustment were presented in Web Appendix F, results under reduced effect sizes were presented in Web Appendix G, and results under non-normal DGP were presented in Web Appendix H.
TABLE 4.
Type I error rates for the global hypothesis under simple randomization (SR) versus constrained randomization (CR) with candidate set sizes = 50%, 10%, and 100 of the randomization space. All covariates were used in constrained randomization and the adjusted tests; constrained randomization was implemented using the l2 metric; ICC = 0.05. The nominal type I error rate is 0.05, and the acceptable range for nominal type I error rate with 10, 000 replicates is (0.0457, 0.0543).
| χ2-test | F-test | Randomization test | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| # of clusters per arm | Analysis-based adjustment | SR | CR (50%) | CR (10%) | CR (100) | SR | CR (50%) | CR (10%) | CR (100) | SR | CR (50%) | CR (10%) | CR (100) |
| g = 10 | Unadj | 0.070 | 0.009 | 0.000 | 0.000 | 0.053 | 0.005 | 0.000 | 0.000 | 0.049 | 0.050 | 0.051 | 0.042 |
| Adj-C | 0.068 | 0.066 | 0.072 | 0.067 | 0.050 | 0.047 | 0.049 | 0.048 | 0.047 | 0.048 | 0.050 | 0.038 | |
| Adj-I | 0.066 | 0.068 | 0.069 | 0.065 | 0.050 | 0.049 | 0.050 | 0.048 | 0.049 | 0.050 | 0.050 | 0.039 | |
| g = 5 | Unadj | 0.091 | 0.022 | 0.003 | 0.000 | 0.052 | 0.008 | 0.000 | 0.000 | 0.051 | 0.052 | 0.053 | 0.040 |
| Adj-C | 0.102 | 0.096 | 0.104 | 0.103 | 0.043 | 0.044 | 0.044 | 0.047 | 0.048 | 0.049 | 0.048 | 0.042 | |
| Adj-I | 0.090 | 0.096 | 0.093 | 0.095 | 0.049 | 0.050 | 0.049 | 0.052 | 0.047 | 0.048 | 0.049 | 0.042 | |
| g = 3 | Unadj | 0.132 | 0.050 | 0.012 | 0.006 | 0.054 | 0.014 | 0.003 | 0.001 | 0.050 | 0.049 | 0.045 | 0.040 |
| Adj-C | 0.164 | 0.174 | 0.164 | 0.167 | 0.001 | 0.000 | 0.001 | 0.001 | 0.048 | 0.049 | 0.046 | 0.043 | |
| Adj-I | 0.159 | 0.165 | 0.157 | 0.158 | 0.047 | 0.049 | 0.046 | 0.049 | 0.052 | 0.048 | 0.050 | 0.040 | |
FIGURE 1.

Type I error rates for the global hypothesis under simple randomization (SR) versus constrained randomization (CR) with 2 balance metrics B(l2) and B(M). CR implemented using covariates indicated on the horizontal axis; candidate set size = 10% under CR; ICC = 0.05; alpha level = 5%; R-test: randomization test; CL-level: cluster-level covariates, xj; Ind-level: individual-level covariates, zjk; Unadj: unadjusted test; Adj-C: test adjusted for the covariates on the horizontal axis (with individual-level covariates aggregated at the cluster level); Adj-I: test adjusted for the covariates on the horizontal axis (with actual individual-level covariates); Fully Adj-I: test adjusted for all four covariates (with actual individual-level covariates).
TABLE 5.
Power for the global hypothesis under simple randomization (SR) versus constrained randomization (CR) with candidate set sizes = 50%, 10%, and 100 of the randomization space. All covariates were used in constrained randomization and the adjusted tests; constrained randomization was implemented using the l2 metric; ICC = 0.05; alpha level = 5%; power values corresponding to non-nominal type I errors are shaded out.
| χ2-test | F-test | Randomization test | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| # of clusters per arm | Analysis-based adjustment |
SR | CR (50%) | CR (10%) | CR (100) | SR | CR (50%) | CR (10%) | CR (100) | SR | CR (50%) | CR (10%) | CR (100) |
| g = 10 | Unadj | 0.525 | 0.489 | 0.433 | 0.385 | 0.464 | 0.425 | 0.355 | 0.293 | 0.469 | 0.642 | 0.826 | 0.960 |
| Adj-C | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | |
| Adj-I | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | |
| g = 5 | Unadj | 0.322 | 0.240 | 0.159 | 0.076 | 0.224 | 0.152 | 0.084 | 0.028 | 0.221 | 0.314 | 0.429 | 0.551 |
| Adj-C | 0.982 | 0.994 | 0.997 | 0.998 | 0.951 | 0.975 | 0.989 | 0.993 | 0.850 | 0.923 | 0.970 | 0.956 | |
| Adj-I | 0.996 | 0.998 | 0.999 | 0.999 | 0.989 | 0.995 | 0.996 | 0.997 | 0.958 | 0.984 | 0.992 | 0.967 | |
| g = 3 | Unadj | 0.266 | 0.186 | 0.102 | 0.091 | 0.128 | 0.074 | 0.030 | 0.027 | 0.121 | 0.149 | 0.177 | 0.147 |
| Adj-C | 0.761 | 0.815 | 0.872 | 0.882 | 0.180 | 0.226 | 0.282 | 0.294 | 0.267 | 0.273 | 0.339 | 0.284 | |
| Adj-I | 0.908 | 0.933 | 0.947 | 0.950 | 0.653 | 0.703 | 0.742 | 0.739 | 0.538 | 0.560 | 0.566 | 0.453 | |
FIGURE 2.

Power for the global hypothesis under simple randomization (SR) versus constrained randomization (CR) with 2 balance metrics B(l2) and B(M). CR implemented using covariates indicated on the horizontal axis; candidate set size = 10% under CR; ICC = 0.05; alpha level = 5%; R-test: randomization test; CL-level: cluster-level covariates, xj; Ind-level: individual-level covariates, zjk; Unadj: unadjusted test; Adj-C: test adjusted for the covariates on the horizontal axis (with individual-level covariates aggregated at the cluster level); Adj-I: test adjusted for the covariates on the horizontal axis (with actual individual-level covariates); Fully Adj-I: test adjusted for all four covariates (with actual individual-level covariates).
5.1 |. Type I error rate
With regard to type I error rate of the model-based tests, three distinct patterns were observed. First, for the global hypothesis, the model-based tests are conservative under CR if no analysis-based adjustment was performed in accordance with the design-based adjustment. The type I error rates of the unadjusted model-based tests decrease as the candidate set size decreases (Table 4). In addition, the conservative performance was observed regardless of the balance metrics being used and the combinations of covariates being constrained on (Figure 1). Second, appropriate analysis-based adjustment under CR brings the type I error rates of the test with appropriate degrees of freedom (i.e., F-test for the global hypothesis) back to the nominal level. The performance of the adjusted F-test is similar to that under simple randomization when g = 10 or 5 (Table 4). Similar patterns were observed for the pairwise hypotheses (Web Appendix D). Third, the χ2-test for the global hypothesis is anti-conservative with type I error rates higher than 0.0543 (the acceptable upper bound accounting for Monte Carlo error) with g = 10 and are inflated to over 15% when g = 3 (Table 4). For this reason, the z-test, which is the counterpart of the χ2-test for the pairwise hypotheses ( and ), was not considered further in our simulations. On the other hand, the F-test and t-test provide adequate small-sample correction and maintain the correct size of type I error rates after analysis-based adjustment except for g = 3, as shown in Table 4 and Web Table 2, as well as Figure 1 and Web Figures 2–3. This is because the small-sample correction is achieved by modifying the (denominator) degrees of freedom and each cluster-level covariate counts towards those degrees of freedom. As a result, the F-test and t-test adjusted for all covariates aggregated at the cluster level fail to achieve the correct size of type I error rate due to the lack of (denominator) degrees of freedom when the number of clusters is very small (e.g., g = 3).
Different patterns were observed for the randomization test. For most scenarios, the proposed randomization test maintains the nominal type I error rate for the global hypothesis under both SR and CR, regardless of analysis-based adjustments. Throughout this article, we refer to analysis-based adjustment as the explicit adjustment of covariates in LMM (3) for either model-based or randomization-based inference. Note that the covariates used in the design phase are already implicitly adjusted for in the randomization tests through the constrained randomization space used to create the empirical distribution of the test statistics. The unadjusted randomization test has similar performance to that of the adjusted versions, indicating the validity of a randomization test under CR does not rely on analysis-based adjustment (Table 4 and Figure 1). Similar results were observed for the pairwise hypotheses (Web Appendix D). This is in sharp contrast to the model-based tests, which heavily depend on adequate analysis-based adjustment for validity. However, two other important factors have an impact on the performance of the randomization test, both of which are related to the number of randomization schemes used to construct the exact randomization distribution of the test statistics. First, the size of the constrained subspace should not be extremely small. When the candidate set size is only 100, the type I error rates for testing the global hypothesis become lower than 0.05. The pairwise hypothesis was observed to be even more sensitive to the candidate set size. For example, when g = 5, the type I error rates for the pairwise hypothesis start to drop substantially below the nominal level even when the candidate set is 10% of the full randomization space (Web Table 2). This is because the randomization distribution for the pairwise hypothesis depends on the clusters in the control arm and only one of the active treatment arms, resulting in a limited number of possible schemes, if the constrained randomization space is already small. Second, the number of clusters per arm is a key determinant of the size of the full randomization space and the constrained randomization space, and therefore can compromise the rejection rate of the randomization tests with a very small number of clusters. An extreme example arises from the pairwise hypothesis when g = 3. The full randomization space for a test comparing two of the three arms is . With such a limited randomization space, the rejection of any null hypothesis at the 5% significance level is not feasible, let alone further reduction of the number of acceptable randomization schemes by constrained randomization.
The combinations of the analysis-based and design-based adjustment strategies listed in Table 3 were compared with respect to type I error rate. When constrained randomization is used in the design phase, adjustment of the corresponding covariates in the LMM is required for the validity of model-based tests, but not for randomization-based tests. Further adjustment of the covariates beyond those used in CR does not affect type I error rate. Adjustment of covariates using individual level data (i.e., Adj-I & Fully Adj-I) or cluster level aggregates (i.e., Adj-C & Fully Adj-C) has little impact on type I error rate, except for g = 3 where there is an insufficient number of clusters to support the between-within denominator degrees of freedom in the model-based tests. For this reason, the fully adjusted tests using cluster level aggregates were excluded from the figures. Last but not least, the two balance metrics has little impact on both the model-based and randomization-based tests under constrained randomization (see the CR panels of Figure 1 and Web Figures 2–3).
5.2 |. Power
In this section, we describe the performance of the methods under comparison in terms of power, with a particular interest in methods that maintain nominal Type I error rates. With regard to power for the global hypothesis, the performance of model-based and randomization-based tests differ in many of the scenarios considered. First, power for the unadjusted model-based tests (i.e., χ2-test and F-test) decreases as the candidate set size decreases under CR (see Table 5), indicating that the design-based adjustment alone does not achieve optimal power for cRCTs if results are analyzed using the model-based tests. Moreover, these tests have already been shown to be overly conservative in Section 5.1. Second, after appropriate analysis-based adjustment of the covariates in accordance with the design-based adjustment, the adjusted model-based tests can achieve higher power than the adjusted randomization-based tests, when the number of clusters per arm is fairly limited (say g = 3 or 5). With more clusters per arm (g = 10), the adjusted model-based tests and the randomization-based test demonstrate similar level of power, confirming the properties of the UMPR and LMPR derived in Section 3.2. Moreover, given the power for the global hypothesis under g = 10 reaches the maximum, we present power under g = 10 with reduced effect sizes in Web Appendix G to further support this argument. Relatedly, adjustment with prognostic covariates in the model-based tests can lead to considerably increased power even under SR (see Table 5 and the left panel of Figure 2). In this case, constrained randomization can mildly improve power over simple randomization for the adjusted model-based tests. The results for the pairwise hypotheses ( and ) are similar (See Web Appendix D).
In contrast to the model-based tests, increased power is observed for randomization tests under CR versus SR regardless of analysis-based adjustment, as long as there is a sufficient number of randomization schemes to ensure a valid randomization test. For example, great improvement in power is seen in the unadjusted randomization test, for which the power under CR increases as the candidate set size decreases in most scenarios, and can be almost three times higher than that under SR in the case where g = 10 (see top right of Table 5). This implies that the design-based adjustment alone could achieve adequate power with randomization-based inference. However, this power gain for the randomization test from the design-based adjustment does not reach the level attained by the analysis-based adjustment. Despite the improvement in power under CR, our results demonstrate the importance of the validity of the randomization test, which highly depends on whether an adequate number of randomization schemes are available. This is jointly determined by the number of clusters and the size of the constrained randomization space. In particular, power for the randomization tests does not always increase monotonically as the candidate set size decreases, especially when g ≤ 10. This is because overly constraining can reduce the number of acceptable randomization schemes to construct the exact distribution of the randomization test statistic. Moreover, when the number of clusters is small, for example g = 3, the power for randomization tests becomes unacceptably low under either SR and CR, as there are insufficient clusters to enumerate a randomization space that is large enough to make valid inference. Importantly, the pairwise hypothesis is particularly sensitive to the number of clusters and the size of the constrained randomization space, compared with the global hypothesis (see Web Appendix D).
Besides the differences in model-based and randomization-based methods, similar patterns for the design and analysis-based adjustment strategies in Table 3 can be found and are applied to all tests. In general, constrained randomization improves power, given proper analysis-based adjustment of the covariates used in constrained randomization. In addition, when CR was performed, further adjustment of the covariates associated with the outcome in the analysis phase beyond those used in constrained randomization (i.e., Fully Adj-I vs Adj-I, Fully Adj-C vs Adj-C) could gain additional power. Noticeably, the adjusted tests are more powerful than their unadjusted version for both SR and CR and power improves with an increasing number of covariates that are known to be predictive of the outcome being adjusted. Moreover, among the analysis-based adjustment strategies, the adjustment of individual-level continuous covariates using their aggregated version (i.e., Adj-C and Fully Adj-C) is less powerful than keeping them at the individual level (i.e., Adj-I and Fully Adj-I). For this reason, the fully adjusted tests using cluster level aggregates were excluded from the figures to simplify presentation. Finally, there is little impact of the balance metrics on power as shown in Figure 2 and Web Figures 4–5.
5.3 |. Results under different ICCs and multiplicity adjustment
Results under an ICC of 0.10 and 0.01 were presented in Web Appendix E. We focused on the comparison of the performance of the design-based and analysis-based adjustment strategies within each level of ICC. Results under multiplicity adjustment (alpha = 2.5%) for the tests of the two pairwise hypotheses ( and ) were summarized in Web Appendix F. Results in these settings were similar to those in Sections 5.1 and 5.2. That is, when constrained randomization is performed, model-based tests depend on the adjustment of the covariates used in constrained randomization in the models to maintain nominal type I error rate, while the randomization tests do not require this adjustment. Constrained randomization, as compared to simple randomization, improves power for properly adjusted model based tests and for randomization tests regardless of analysis-based adjustment. Further adjustment of prognostic covariates in the analysis phase improves power for both model-based and randomization tests. Analysis-based adjustment with the individual level data whenever available is preferred over using the cluster level aggregates, even though the latter perfectly matches the design-based adjustment through constrained randomization. The two balance metrics have little impact on the results.
5.4 |. Results under alternative DGP
Results under non-normal DGP were presented in Web Appendix H. In the case where either the random cluster effect γj or error term ϵjk follows the standard Cauchy distribution instead of the normal distribution, the randomization test has superior performance over the model-based test. In particular, when the LMM assumption fails to hold, the model-based test fails to maintain the nominal type I error rate, regardless of adjustment strategies under CR, and yields consistently smaller power than the randomization test. These results hold for both the UMPR and LMPR, depending on the hypothesis of interest, demonstrating that the randomization test is in general more robust against violations of the LMM assumptions, compared to the model-based tests.
6 |. DISCUSSION
Constrained randomization is a useful tool in cRCTs to balance multiple baseline covariates; thus protecting the internal validity of a trial. Previous simulation studies have investigated the considerations on constrained randomization and subsequent statistical analysis in two-arm parallel cRCT settings.17,18 While extensions of constrained randomization to three-arm parallel cRCTs were recently pursued,19,8 investigations on the choice of design parameters are currently limited, and the statistical tests previously considered were restricted to linear mixed models.8 Motivated by a recent multi-arm cRCT,7 we investigated the application of constrained randomization and provided a comprehensive evaluation of its statistical properties. For the design of multi-arm cRCTs, we proposed two alternative balance metrics (the maximum pairwise l2 metric and the maximum Mahalanobis distance metric) for the implementation of constrained randomization, and provided a detailed discussion of the impact of these alternative balance metrics as well as different sizes of the constrained space. For statistical inference under multi-arm cRCTs, we extended the theory of optimal randomization test by developing new randomization test statistics and articulating the different randomization procedures and spaces required for testing the global and pairwise hypotheses. A comparison between the model-based and randomization-based tests was carried out to elucidate the performance of each statistical analysis approach under constrained randomization to generate practical recommendations.
Our simulation study demonstrated that when the baseline covariates are balanced through design-based adjustment via constrained randomization, both the model-based and randomization-based analyses could potentially gain power and maintain the nominal type I error rate. However, this desirable property can depend on proper adjustment in the analyses for the baseline covariates being constrained upon, especially for the model-based tests. For the model-based analyses, constrained randomization without corresponding analysis-based adjustment leads to conservative inference with no improvement in power. With appropriate adjustment of the covariates used in constrained randomization, model-based analyses were able to achieve greater power compared to simple randomization, while maintaining nominal type I error rate. This finding is consistent with that in previous cRCTs simulations17,18,8 as well as that in individually randomized trials.53,12,54 Among the model-based analyses, the Wald F-test (for global hypothesis) and t-test (for pairwise hypothesis) are preferred over the χ2-test and z-test for their ability to carry the correct test size, especially in multi-arm cRCTs with a limited number of clusters per arm. However, the model-based test fails to maintain nominal type I error rate when normality assumption does not hold. For randomization-based analysis (randomization test), we provide compelling evidence to show that substantial power could be obtained and nominal type I error rate can be well maintained under constrained randomization. Notably, the power of the randomization test given sufficient randomization space is similar to that of the model-based test coupled with the correct modeling assumptions. In contrast to the model-based test, the randomization-based test does not depend on correct distributional assumptions and analysis-based adjustment in the LMM to maintain valid inference in terms of type I error rate. Although analysis-based adjustment is not necessary for the randomization test under constrained randomization, the statistical power could be further improved if the covariates being constrained on are adjusted in the analyses. Moreover, when covariates are adjusted for in the analysis phase, the improvement in power as a result of constrained randomization becomes relatively modest. In addition, further improvement of power could be achieved through the analysis-based adjustment of additional covariates that are not used in the design but predict the outcome. Finally, for both model-based and randomization-based analyses, we have advocated for analysis-based adjustment of individual-level covariates instead of their cluster-level aggregates, whenever applicable, to achieve higher power. This is particularly appealing in small cRCTs, because the adjustment of cluster-level surrogates further consumes the already highly limited cluster-level degrees of freedom, which may lead to non-nominal error rate and reduced power. However, individual-level adjustment requires the exclusion or imputation of missing covariate data, while cluster-level adjustment may be less prone to this issue.55 In general, analysis-based adjustment of strong prognostic covariates is recommended, even after design-based adjustment via constrained randomization. However, we also recognize that it may be challenging to pre-specify the “correct” set of prognostic covariates to include in the analysis. In individually randomized trials analyzed by analysis of covariance models, previous work has shown that adjusting for baseline covariates generally does not lead to asymptotic efficiency loss for estimating the average treatment effect.56,57,58 In two-arm cRCTs analyzed by linear mixed models, however, the efficiency implications of covariate adjustment may depend on the correct specification of the covariance structure and are generally more complicated.59 Additional research and guidance are also needed on optimal ways to select baseline covariates for adjustment in multi-arm cRCTs.
Compared to two-arm cRCTs, multi-arm cRCTs may embody a more diverse combination of hypotheses for treatment effects, including but not limited to the global hypothesis comparing all treatments and the pairwise hypothesis evaluating the effects of each treatment arm compared to the usual care. For model-based analyses, these hypotheses can be tested with the readily available output of the linear mixed model fit with mainstream software. In contrast, the randomization test can proceed differently according to the type of hypothesis being tested. Specifically, the test for the global hypothesis should be referenced against the randomization space where all treatment arms are permuted, whereas the test for the pairwise hypothesis should be carried out against the subspace where only the treatments under comparison are permuted (i.e., a conditional permutation). Therefore, as demonstrated by our simulation results, the randomization test for the pairwise hypothesis is more sensitive to small cRCTs and a tight constrained randomization space, because of the limited allocation schemes after conditioning on the observed treatment assignments for clusters receiving other treatments. On the other hand, with a sufficient number of clusters such as g = 10 and as long as the constrained randomization space is not too tight, the randomization test we developed carries similar power to the model-based test. This highlights one caveat for using randomization test in small multi-arm RCTs with CR: even though they have better control of type I error rates, they may fail to provide a powerful test due to insufficient number of permutations supported by the constrained randomization design.
These findings led us to believe that appropriate statistical inference can be challenging in (multi-arm) cRCTs with a very small number of clusters per arm, even after constrained randomization that balances the baseline covariates. Such trials are not uncommon, as 8 studies with at most 3 clusters per condition were identified in the systematic review by Varnell et al.60 and 8 studies with at most 5 clusters per condition were identified in a recent review of cRCTs with binary outcomes by Turner et al.61 We caution against the design and analysis of such trials for the following reasons. First, randomization-based inference may not be feasible under the 0.05 significance level or be sufficiently powerful due to an insufficient number of possible allocation schemes to construct the randomization distribution. An example would be the case where g = 3, the type I error rate and power for the pairwise comparison is exactly zero, because the rejection of any null hypothesis at the 5% significance level based on p-value < 0.05 is infeasible with a full randomization space of size . Although the adjusted F-test and t-test provided acceptable power and carried the desired type I error rate when g = 3, this evidence may not be generalizable to other small cRCTs, as the data generation process in our simulation studies fully satisfy the distributional assumptions of linear mixed models. Importantly, small sample corrections for the degrees of freedom can be used to ensure reasonable type I error rate and the randomization test can be considered as a flexible randomization-based alternative. However, theoretical guidance on the choices of small sample correction methods varies with multiple factors and the current methods may not always achieve the appropriate test size,8 while the randomization test may be invalid when the number of clusters is extremely small. Furthermore, a small number of clusters limits the ability to adjust for cluster-level covariates in the analyses. The between-within denominator degree of freedom of the F-test and the degree of freedom of the t-test is calculated as the difference between the total number of clusters and the total number of cluster-level parameters, including the treatment arms and cluster-level covariates. In such situations, any model-based test adjusting for an excessive number of covariates is expected to be invalid due to insufficient degrees of freedom. With insufficient analysis-based adjustment, the model-based test becomes overly conservative. Because of these reasons, even though constrained randomization could achieve balance in very small cRCTs, we caution against the use of multi-arm cRCTs with a very limited number of clusters.
The choice of the two balance metrics considered in our study show little difference in terms of type I error and power. This similarity is anticipated since the two balance metrics under comparison only differ in whether the correlation among the baseline covariates is incorporated in the balance score. Compared to the choice of a balance metric, the size of the constrained randomization subspace has a more crucial impact on the analysis. Typically, power gain is achieved with a smaller constrained subspace, but this does not suggest a monotone inverse relationship. In fact, we observed in our study that overly constrained randomization can decrease the power of the tests. In addition, this impact manifests more in the pairwise hypothesis compared with the global hypothesis. In our simulations with g = 5 and g = 10, a constrained randomization space q = 0.1 works fairly well in terms of type I error and power for the global hypothesis. For the pairwise hypothesis, however, a larger q = 0.5 is necessary when g = 5 to provide sufficient allocation schemes, especially when it comes to the application of the randomization test. For conducting multi-arm cRCTs, it would be useful to develop an algorithm to check the validity of the constrained randomization and to avoid an overly constrained randomization space, perhaps along the lines of Moulton14.
There are a few possible limitation of our study. First, our simulations are based on a balanced design. We assumed the same number of clusters per arm and the same number of individuals per cluster. Design balance at the cluster level is common with cRCTs and is essential to ensure the validity of the randomization test.26,62 In addition, the mixed-model methods are generally robust to unbalanced design at the cluster level.1 On the other hand, variable cluster size can be easily incorporated into our randomization tests without affecting its validity, whereas the appropriate choice of model-based tests and small-sample corrections may depend on the degree of cluster size variability in multi-arm cRCTs, which is an open question for further research. Second, we did not incorporate heterogeneous ICC structures across participating clusters. In some cases, the intervention effect may not be constant and will likely lead to higher ICC values for groups receiving intervention, giving rise to a random intervention model.1,63 Third, we only considered the F– and t–tests with cluster-level degrees of freedom as the small sample correction method in our simulation studies. Other correction methods have been thoroughly studied in Watson et al.8 As a complement, we demonstrated that the randomization test can also be a flexible alternative for controlling type 1 error rate because it does not dependent on further analysis-based adjustment nor any degree of freedom corrections. Finally, we limited the simulation studies to continuous outcomes. However, the relative performance of different design and analytical methods, the covariate adjustment strategies, and other practical suggestions should be largely generalizable to multi-arm cRCTs with binary outcomes. More research is needed to evaluate constrained randomization and the subsequent analytical issues in multi-arm cRCTs with more complex ICC structures and different types of outcomes, in order to better guide statistical practice where balance is actively sought for in the design phase.
Supplementary Material
ACKNOWLEDGEMENTS
The research presented in the manuscript was partially funded by the National Institute of Allergy and Infectious Diseases of the National Institutes of Health (NIH) under Award Number R01 AI141444 (PI: Dr. Wendy Prudhomme O’Meara). YZ and RAS were supported in part by the National Center For Advancing Translational Sciences of the NIH under Award Number UL1 TR002553. FL was supported in part by the National Center For Advancing Translational Sciences of the NIH under Award Number UL1 TR001863. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH. We thank Dr. Wendy Prudhomme O’Meara and Theodoor Visser for sharing the data set from the TESTsmART trial. We thank Xueqi Wang and John A. Gallis for helpful discussions and computational assistance. The authors also thank the Editor, Associate Editor and two anonymous reviewers for constructive comments and suggestions, which have greatly improved the paper.
Funding Information
National Institute of Allergy and Infectious Diseases, Grant/Award Number: R01 AI141444; National Center for Advancing Translational Sciences, Grant/Award Numbers: UL1 TR002553 & UL1 TR001863
Footnotes
SUPPORTING INFORMATION
Additional supporting information may be found online in the Supporting Information section at the end of this article.
DATA AVAILABILITY STATEMENT
The data that motivate and inform the simulation studies in this article are available upon request from the corresponding author. The data are not publicly available due to privacy or ethical restrictions.
References
- 1.Murray DM. Design and analysis of group-randomized trials. New York, NY: Oxford University Press. 1998. [Google Scholar]
- 2.Donner A, Klar N. Design and analysis of cluster randomization trials in health research. London: Arnold. 2000. [Google Scholar]
- 3.Eldridge S, Kerry S. A practical guide to cluster randomised trials in health services research. Chichester, West Sussex: John Wiley & Sons. 2012. [Google Scholar]
- 4.Campbell MJ, Walters SJ. How to design, analyse and report cluster randomised trials in medicine and health related research. Chichester, West Sussex: John Wiley & Sons. 2014. [Google Scholar]
- 5.Hayes RJ, Moulton LH. Cluster randomised trials. CRC press. 2017. [Google Scholar]
- 6.Turner EL, Li F, Gallis JA, Prague M, Murray DM. Review of recent methodological developments in group-randomized trials: part 1—design. American Journal of Public Health 2017; 107(6): 907–915. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Woolsey AM, Simmons RA, Woldeghebriel M, et al. Incentivizing appropriate malaria case management in the private sector: a study protocol for two linked cluster randomized controlled trials to evaluate provider-and client-focused interventions in western Kenya and Lagos, Nigeria. Implementation Science 2021; 16: 14. doi: 10.1186/s13012-020-01077-w [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Watson SI, Girling A, Hemming K. Design and analysis of three-arm parallel cluster randomized trials with small numbers of clusters. Statistics in Medicine 2020. doi: 10.1002/sim.8828 [DOI] [PubMed] [Google Scholar]
- 9.Ivers NM, Halperin IJ, Barnsley J, et al. Allocation techniques for balance at baseline in cluster randomized trials: a methodological review. Trials 2012; 13(1): 1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Taljaard M, Teerenstra S, Ivers NM, Fergusson DA. Substantial risks associated with few clusters in cluster randomized and stepped wedge designs. Clinical Trials 2016; 13(4): 459–463. [DOI] [PubMed] [Google Scholar]
- 11.Raab GM, Butcher I. Balance in cluster randomized trials. Statistics in Medicine 2001; 20(3): 351–365. [DOI] [PubMed] [Google Scholar]
- 12.Kernan WN, Viscoli CM, Makuch RW, Brass LM, Horwitz RI. Stratified randomization for clinical trials. Journal of clinical epidemiology 1999; 52(1): 19–26. [DOI] [PubMed] [Google Scholar]
- 13.Donner A, Klar N. Pitfalls of and controversies in cluster randomization trials. American Journal of Public Health 2004; 94(3): 416–422. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Moulton LH. Covariate-based constrained randomization of group-randomized trials. Clinical Trials 2004; 1(3): 297–305. [DOI] [PubMed] [Google Scholar]
- 15.Gallis JA, Li F, Yu H, Turner EL. cvcrand and cptest: Commands for efficient design and analysis of cluster randomized trials using constrained randomization and permutation tests. The Stata Journal 2018; 18(2): 357–378. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Yu H, Li F, Gallis JA, Turner EL. cvcrand: A Package for Covariate-constrained Randomization and the Clustered Permutation Test for Cluster Randomized Trials.. R Journal 2019; 9(2). [Google Scholar]
- 17.Li F, Lokhnygina Y, Murray DM, Heagerty PJ, DeLong ER. An evaluation of constrained randomization for the design and analysis of group-randomized trials. Statistics in Medicine 2016; 35(10): 1565–1579. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Li F, Turner EL, Heagerty PJ, Murray DM, Vollmer WM, DeLong ER. An evaluation of constrained randomization for the design and analysis of group-randomized trials with binary outcomes. Statistics in Medicine 2017; 36(24): 3791–3806. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Ciolino JD, Diebold A, Jensen JK, Rouleau GW, Koloms KK, Tandon D. Choosing an imbalance metric for covariate-constrained randomization in multiple-arm cluster-randomized trials. Trials 2019; 20(1): 293. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Pocock SJ, Assmann SE, Enos LE, Kasten LE. Subgroup analysis, covariate adjustment and baseline comparisons in clinical trial reporting: current practiceand problems. Statistics in medicine 2002; 21(19): 2917–2930. [DOI] [PubMed] [Google Scholar]
- 21.Turner EL, Prague M, Gallis JA, Li F, Murray DM. Review of recent methodological developments in group-randomized trials: part 2—analysis. American Journal of Public Health 2017; 107(7): 1078–1086. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Murray DM, Hannan PJ, Pals SP, McCowen RG, Baker WL, Blitstein JL. A comparison of permutation and mixed-model regression methods for the analysis of simulated data in the context of a group-randomized trial. Statistics in medicine 2006; 25(3): 375–388. [DOI] [PubMed] [Google Scholar]
- 23.McCaffrey DF, Griffin BA, Almirall D, Slaughter ME, Ramchand R, Burgette LF. A tutorial on propensity score estimation for multiple treatments using generalized boosted models. Statistics in Medicine 2013; 32(19): 3388–3414. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Li F, Li F. Propensity score weighting for causal inference with multiple treatments. Annals of Applied Statistics 2019; 13(4): 2389–2415. [Google Scholar]
- 25.Morgan KL, Rubin DB. Rerandomization to improve covariate balance in experiments. The Annals of Statistics 2012; 40(2): 1263–1282. [Google Scholar]
- 26.Braun TM, Feng Z. Optimal permutation tests for the analysis of group randomized trials. Journal of the American Statistical Association 2001; 96(456): 1424–1432. [Google Scholar]
- 27.Pinheiro JC, Bates DM. Mixed-effects models in S and S-PLUS. New York: Springer. 2000. [Google Scholar]
- 28.Kahan BC, Forbes G, Ali Y, et al. Increased risk of type I errors in cluster randomised trials with small or medium numbers of clusters: a review, reanalysis, and simulation study. Trials 2016; 17(1): 438. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Kenward MG, Roger JH. Small sample inference for fixed effects from restricted maximum likelihood. Biometrics 1997: 983–997. [PubMed] [Google Scholar]
- 30.Satterthwaite FE. An approximate distribution of estimates of variance components. Biometrics bulletin 1946; 2(6): 110–114. [PubMed] [Google Scholar]
- 31.Murray DM, Blitstein JL, Hannan PJ, Baker WL, Lytle LA. Sizing a trial to alter the trajectory of health behaviours: methods, parameter estimates, and their application. Statistics in Medicine 2007; 26(11): 2297–2316. [DOI] [PubMed] [Google Scholar]
- 32.Fisher RA. The Design of Experiments. Edinburgh, Oliver and Boyd. 1935. [Google Scholar]
- 33.Ernst MD. Permutation methods: a basis for exact inference. Statistical Science 2004: 676–685. [Google Scholar]
- 34.Good PI. Permutation tests: a practical guide to resampling methods for testing hypotheses. New York: Springer. 2000. [Google Scholar]
- 35.Edgington ES. Randomization tests. New York: M. Dekker. 1995. [Google Scholar]
- 36.Lehmann EL, Stein C, et al. On the theory of some non-parametric hypotheses. The Annals of Mathematical Statistics 1949; 20(1): 28–45. [Google Scholar]
- 37.Wang Y, Rosenberger WF, Uschner D. Randomization tests for multiarmed randomized clinical trials. Statistics in Medicine 2020; 39(4): 494–509. [DOI] [PubMed] [Google Scholar]
- 38.Garthwaite PH. Confidence intervals from randomization tests. Biometrics 1996: 1387–1393. [Google Scholar]
- 39.Garthwaite PH, Jones M. A stochastic approximation method and its application to confidence intervals. Journal of Computational and Graphical Statistics 2009; 18(1): 184–200. [Google Scholar]
- 40.Rabideau DJ, Wang R. Randomization-based confidence intervals for cluster randomized trials. Biostatistics 2021; 22(4): 913–927. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Cox DR, Hinkley DV. Theoretical statistics. CRC Press. 1974. [Google Scholar]
- 42.Li F, Turner EL, Preisser JS. Sample size determination for GEE analyses of stepped wedge cluster randomized trials. Biometrics 2018; 74(4): 1450–1458. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Morris TP, White IR, Crowther MJ. Using simulation studies to evaluate statistical methods. Statistics in medicine 2019; 38(11): 2074–2102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Small DS, Ten Have TR, Rosenbaum PR. Randomization inference in a group–randomized trial of treatments for depression: covariate adjustment, noncompliance, and quantile effects. Journal of the American Statistical Association 2008; 103(481): 271–279. [Google Scholar]
- 45.Juszczak E, Altman DG, Hopewell S, Schulz K. Reporting of multi-arm parallel-group randomized trials: extension of the CONSORT 2010 statement. JAMA 2019; 321(16): 1610–1620. [DOI] [PubMed] [Google Scholar]
- 46.Li G, Taljaard M, Heuvel V. dER, et al. An introduction to multiplicity issues in clinical trials: the what, why, when and how. International Journal of Epidemiology 2017; 46(2): 746–755. [DOI] [PubMed] [Google Scholar]
- 47.Parker RA, Weir CJ. Non-adjustment for multiple testing in multi-arm trials of distinct treatments: Rationale and justification. Clinical Trials 2020; 17(5): 562–566. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Khan MS, Khan MS, Ansari ZN, et al. Prevalence of Multiplicity and Appropriate Adjustments Among Cardiovascular Randomized Clinical Trials Published in Major Medical Journals. JAMA network open 2020; 3(4): e203082–e203082. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Aickin M, Gensler H. Adjusting for multiple testing when reporting research results: the Bonferroni vs Holm methods.. American journal of public health 1996; 86(5): 726–728. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing; Vienna, Austria: 2019. [Google Scholar]
- 51.Coppock A, Cooper J, Fultz N. randomizr: Easy-to-Use Tools for Common Forms of Random Assignment and Sampling. R package version 0.20.0 2019.
- 52.Pinheiro J, Bates D, DebRoy S, Sarkar D, R Core Team. nlme: Linear and Nonlinear Mixed Effects Models. R package version 3.1–151 2020.
- 53.Lewis JA. Statistical principles for clinical trials (ICH E9): an introductory note on an international guideline. Statistics in medicine 1999; 18(15): 1903–1942. [DOI] [PubMed] [Google Scholar]
- 54.Kahan BC, Morris TP. Reporting and analysis of trials using stratified randomisation in leading medical journals: review and reanalysis. BMJ 2012; 345. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Hooper R, Forbes A, Hemming K, Takeda A, Beresford L. Analysis of cluster randomised trials with an assessment of outcome at baseline. bmj 2018; 360. [DOI] [PubMed] [Google Scholar]
- 56.Zeng S, Li F, Wang R, Li F. Propensity score weighting for covariate adjustment in randomized clinical trials. Statistics in medicine 2021; 40(4): 842–858. [DOI] [PubMed] [Google Scholar]
- 57.Yang L, Tsiatis AA. Efficiency study of estimators for a treatment effect in a pretest–posttest trial. The American Statistician 2001; 55(4): 314–321. [Google Scholar]
- 58.Wang B, Ogburn EL, Rosenblum M. Analysis of covariance in randomized trials: More precision and valid confidence intervals, without model assumptions. Biometrics 2019; 75(4): 1391–1400. [DOI] [PubMed] [Google Scholar]
- 59.Wang B, Harhay MO, Small DS, Morris TP, Li F. On the robustness and precision of mixed-model analysis of covariance in cluster-randomized trials. arXiv preprint arXiv:2112.00832 2021.
- 60.Varnell SP, Murray DM, Janega JB, Blitstein JL. Design and analysis of group-randomized trials: a review of recent practices. American journal of public health 2004; 94(3): 393–399. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Turner EL, Platt AC, Gallis JA, et al. Completeness of reporting and risks of overstating impact in cluster randomised trials: a systematic review. The Lancet Global Health 2021; 9(8): e1163–e1168. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Gail MH, Mark SD, Carroll RJ, Green SB, Pee D. On design considerations and randomization-based inference for community intervention trials. Statistics in medicine 1996; 15(11): 1069–1092. [DOI] [PubMed] [Google Scholar]
- 63.Hemming K, Taljaard M, Forbes A. Modeling clustering and treatment effect heterogeneity in parallel and stepped-wedge cluster randomized trials. Statistics in Medicine 2018; 37(6): 883–898. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The data that motivate and inform the simulation studies in this article are available upon request from the corresponding author. The data are not publicly available due to privacy or ethical restrictions.
