Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2022 Apr 1.
Published in final edited form as: J Pain Symptom Manage. 2020 Nov 25;61(4):858–863. doi: 10.1016/j.jpainsymman.2020.11.019

Two Questions about the Design of Cluster Randomized Trials: A Tutorial

Gregory P Samsa 1,2, Joseph G Winger 3, Christopher E Cox 4, Maren K Olsen 1,5
PMCID: PMC8009809  NIHMSID: NIHMS1649520  PMID: 33246075

Abstract

This is a short tutorial on two key questions that pertain to cluster randomized trials (CRTs): (1) Should I perform a CRT? and (2) If so, how do I derive the sample size? In summary, a CRT is the best option when you “must” (e.g., the intervention can only be administered to a group) or you “should” (e.g., because of issues such as feasibility and contamination). CRTs are less statistically efficient and usually more logistically complex than individually randomized trials, and so reviewing the rationale for their use is critical. The most straightforward approach to the sample size calculation is to first perform the calculation as if the design were randomized at the level of the patient, and then to inflate this sample size by multiplying by the “design effect”, which quantifies the degree to which responses within a cluster are similar to one another. Although trials with large numbers of small clusters are more statistically efficient than those with a few large clusters, trials with large clusters can be more feasible. Also, if results are to be compared across individual sites, then sufficient sample size will be required to attain adequate precision within each site. Sample size calculations should include sensitivity analyses, as inputs from the literature can lack precision. Collaborating with a statistician is essential. To illustrate these points, we describe an ongoing CRT testing a mobile-based app to systematically engage families of intensive care unit patients and help intensive care unit clinicians deliver needs-targeted palliative care.

Keywords: Statistics, study design, cluster randomized trials, sample size calculation

Introduction

This paper is a short tutorial on two key questions that pertain to cluster randomized trials (CRTs): (1) Should I perform a CRT? and (2) If so, how do I derive the sample size? Our goal is to summarize existing knowledge with an emphasis on palliative care. This tutorial is not intended to replace collaboration with a statistician – which, ideally, should occur as early in the design process as possible – but rather as preparation for such a collaboration. This tutorial also does not comprehensively describe cluster trial designs or analytic considerations, as a recent two-part series by Turner et al offers a detailed review of both topics.1,2 To illustrate our points, we describe an ongoing CRT testing a mobile-based app to engage families of intensive care unit (ICU) patients and help ICU clinicians deliver needs-targeted palliative care.

What is a cluster randomized trial?

In a CRT, randomization takes place at the level of the group rather than the individual.3 The following are example CRTs from the palliative care literature:

  • In a study of patients with advanced cancer, Bernacki et al compared usual care with a multi-component structured communication intervention. Strata corresponded to disease centers (e.g., breast, gastrointestinal), and clusters were operationally defined as an organizational unit of clinicians within a disease center, typically with one nurse practitioner and two to three physicians. Within strata, the clusters were randomized to intervention or usual care.4

  • In a study of patients with advanced dementia, Agar et al randomized nursing homes to either usual care or facilitated family case conferencing.5

  • Ringdal et al stratified community health care districts according to pairs of rural/urban status and the number of inhabitants above 60 years of age, and within pairs the districts were randomized to usual care or a family-centric palliative care intervention.6

In each of these examples, the randomization was performed not at the level of the individual patient but, instead, among groups of patients (e.g., clinics). Randomization at the level of the group is the defining characteristic of a CRT.3

CRTs can be subdivided into parallel group designs without a baseline period, parallel group designs with a baseline period, and stepped wedge designs.1 For a parallel group design without a baseline period, clusters are randomized into the intervention or usual care. (More generally, clusters are randomized into study groups. For simplicity, we will assume that there are two groups being studied: intervention and usual care.) For a parallel group design with a baseline period, each of the clusters are observed under usual care during a baseline period, and then are randomly assigned into intervention or usual care subsequently. For a stepped wedge design, each of the clusters are observed under usual care during a baseline period, and then transitioned into the intervention at a randomly selected time. In a generally accessible introduction to stepped wedge designs, Hemming discusses the pros and cons of these designs.7

Usually (but not always), in a CRT the inference is performed at the level of the individual patient. For example, a pain coping intervention implemented at the level of the clinic is intended to reduce pain and improve quality of life among individual patients. We do not discuss the analysis of CRTs here, but suffice it to say that what is required is a statistical approach which uses the patient as the unit of analysis, while appropriately accounting for clustering. Readers interested in specific analytic considerations should see Turner et al.2 Additionally, an excellent introduction for statisticians to the statistical issues pertaining to CRTs is a paper by Ellenberg.8

When should I perform a cluster randomized trial?

Pladvall et al summarize possible reasons for performing a CRT.9 The following are examples specific to the palliative care setting:

  • A palliative care intervention is implemented at a higher organizational level than the individual (e.g., addition of a specialist into a practice), and thus it is natural to use this same organizational level as the basis for defining the clusters.

  • It may not be appropriate or possible to randomize individuals. For example, if a family conferencing intervention is introduced under patient-level randomization it might be considered unethical to deny usual care patients within that practice access to the intervention. Alternatively, patients might be unwilling to consent to being randomized to usual care.

  • In a CRT what is being studied is the “standard of care” at all the participating practices.

  • Logistical considerations might suggest that locating and randomizing individuals is prohibitively costly.

  • Randomizing at the level of the individual induces a risk of significant contamination. For example, providers who see both intervention and usual care patients might transfer some elements of the intervention to all their patients. Patients talk among themselves, as do providers. Contamination can be a point of emphasis among grant reviewers, so a CRT might realistically be the path of least resistance. From a purely statistical perspective, contamination will tend to bias efficacy estimates toward the null,10 and this reduces statistical power in comparison with a study design without contamination.

In summary, a CRT is the best option when you “must” (e.g., the intervention can only be administered to a group) or you “should” (e.g., most typically, because of issues such as feasibility and contamination). CRTs are less statistically efficient and, usually, more logistically complex than individually randomized trials, and so reviewing the rationale for their use is critical.

Methods

How do I derive the sample size?

We focus on the most straightforward possible cluster design – one in which clusters are randomized to either intervention or control (usual care) – with a single level of clustering. For this type of CRT, the most straightforward approach to the sample size calculation is to first perform the calculation as if the design were randomized at the level of the patient, and then to inflate this sample size by multiplying by the “design effect”, the formula for which is 1 + [(Nc-1)ICC]. Nc is the number of participants per cluster and ICC is the intra-class correlation coefficient, which quantifies the degree to which responses within a cluster are similar to one another. (Technically, the ICC equals the Pearson correlation between pairs of observations within a cluster.) The formula assumes an identical number of patients per cluster, but a straightforward extension can be used if this condition does not hold.

To illustrate the approach, suppose that the outcome variable is continuously scaled, and the investigators are powering the trial to be able to identify a moderate effect size of 0.50 with 80% power. (In this context, the effect size is the difference between the two group means, expressed in standard deviation units. For example, an effect size of 0.50 corresponds to a mean of 50 in one group and a mean of 45 in the other, with a within-group standard deviation of 10.) With simple randomization, approximately 63 participants with complete data on the outcome variable are required per group for a total of 126 patients before considering attrition. With an expected attrition rate slightly more than 20%, the target sample size would be 168. In practice, this number might be padded somewhat to have sufficient power to analyze secondary outcomes or to accommodate a higher attrition rate.

Now suppose that the ICC=0.05 and there are 21 patients per cluster (i.e., 21 patients per cluster x 8 clusters = 168 total patients). The design effect becomes 1 + 20*0.05 = 2, and thus the sample size would be doubled from 168 to 336.

From the perspective of statistical efficiency, for a fixed sample size it is better to have many small clusters than a few large ones. For example, with 340 patients and an ICC=0.05 the design effect for 34 clusters of 10 patients each is 1 + 9*0.05 = 1.45, whereas the design effect for 10 clusters with 34 patients each is 1 + 33*0.05 = 2.65.

This calculation is particularly sensitive to the value of the ICC: for example, if the ICC is 0.01 the design effects become 1 + 9*0.01 = 1.09 and 1 + 33*0.01 = 1.33, respectively, whereas if the ICC is 0.10 they become 1 + 9*0.10 = 1.90 and 1 + 33*0.10 = 4.30, respectively. Very large values of the ICC can call the CRT design into question.

In practice, the statistical benefits of large numbers of small clusters must be traded off against other considerations. Trials with small numbers of sites tend to be simpler to manage than trials with large numbers of sites, especially when this latter design has so few patients per site that it is difficult to maintain the fidelity of the intervention. (Readers interested in administrative considerations should refer to Eldridge & Kerry’s practical guide book to CRTs.11) Yet another consideration is analytical: if results are to be compared across individual sites then sufficient sample size will be required to attain adequate precision within each site (i.e., in addition to adequate precision overall). A critical question to ask during the design stage is whether the logic of the study favors a few large clusters or many small ones.

What value do I use for the intra-class correlation coefficient?

The sample size calculation requires an estimate for the ICC. There is a modestly-sized literature describing the distribution of ICCs across studies.1214 Although most of the studies in this literature took place outside palliative care, it nevertheless is a good place to start. One conclusion from this literature is that ICCs for process variables tend to be larger than ICCs for outcome variables, such as pain intensity, quality of life, and others which are relevant to palliative care. This should be intuitively reasonable: process variables might be more directly related to the nature of the intervention, whereas outcome variables will reflect the heterogeneity of patient experience associated with serious illness. ICCs for smaller groups (e.g., families) tend to be larger than ICCs for larger groups (e.g., clinics or communities).

Greatly complicating sample size calculations is the fact that ICCs are usually found within the literature as point estimates rather than confidence intervals or plausible ranges, and moreover that these confidence intervals tend to be quite wide.

At the risk of a one-size-fits-all recommendation, as a practical matter for CRTs in palliative care that use outcome rather than process measures we recommend the following. Based on considerations such as logistics and budget, determine the maximum number of clusters that the trial can realistically support, and also the average number of patients that can be recruited per cluster. Determine the sample size under individual randomization – this typically requires specifying an intervention effect size which is small, moderate, or somewhere in between. Plug ICCs of 0.01, 0.05, and 0.10 into the formula for the design effect, and determine the final sample size. Base your best guess on the literature: for example, it is often reasonable to make that best guess 0.05. When reporting trial results, consider including ICC values. This will help other investigators with their sample size calculations.

Results

Illustrative example: ICUconnect

For a CRT, sample size calculations take place within a broader context of study design. Of course, the first design decision is whether a CRT is indicated. If so, additional design decisions include how to address contamination, how to address clustering in the statistical analysis, how to select the number of clusters and the number of units, how to perform randomization, and how to address underperforming clusters. Turner et al covers these decisions in greater detail.1,2 Here, as a case study, we discuss various design decisions pertaining to the ICUconnect study (ClinicalTrials.govIdentifier: NCT03506438), with an emphasis on the decision to perform a CRT and choice of the sample size. The discussion of additional design decisions, sometimes in abbreviated form, is intended to provide the reader with an introduction to the level of nuance which often accompanies the design of CRTs.

Currently, the quality of ICU-based palliative care is highly variable. For example, family members report poor quality communication and decision making, and clinicians struggle to identify needs and connect with families in a shiftwork environment. Barriers to improving the quality of palliative care for this high-risk population include difficulty identifying high-risk patients and families with unmet palliative care needs, coordinating care by ICU teams, and engaging diverse family decision makers as partners in the care process. To address these barriers, Dr. Cox and colleagues developed a mobile-based app (entitled ICUconnect) which systematically engages families and helps ICU clinicians deliver needs-targeted palliative care.

Choice of a cluster randomized trial, treatment of contamination

The impact of ICUconnect (relative to usual care) upon unmet palliative care needs is being evaluated via a CRT, performed at five ICUs within a single medical center. The unit of randomization is the ICU attending physician. Multiple patients are nested within each physician. The primary rationale for randomizing physicians (e.g., rather than patients, and thus producing a CRT) was the potential for contamination. Multiple contemporaneous participants could be cared for by a single ICU physician during any given week, and insights gained from treating intervention patients could potentially be applied to control patients as well. One of the implications of this decision is the need for the two study groups to be as balanced as possible on physician characteristics (discussed subsequently).

Considerations in the number of clusters and the number of units

For ICUconnect, the study team had multiple constraints limiting the possible number of physicians and families per physician. The ICU setting is a challenging environment to conduct a CRT due to the workflow of the attending physicians and the quickly changing care needs of the patients. Across the five ICUs, there are around 60–70 ICU attending physicians. Many of these physicians work in sporadic shifts of one to two weeks in length, totaling 8 to 12 weeks of on-service time per year. The study team originally considered enrolling and randomizing about half (n=30) of the physicians, with a fixed number of families for each physician. However, it was quickly realized that with physicians’ sporadic shift schedule, it was going to be difficult to enroll a large enough number of patients per physician. Additionally, ICUconnect has several sub-aims examining racial differences between the intervention and control groups. The study, therefore, plans to enroll an equal number of African American and White patients overall and per physician. For these reasons, it was more feasible to enroll more physicians with fewer families per physician.

Impact of non-performing clusters

Having more physicians and fewer patients per physician also allowed us to address the impact of non-performing clusters. Within an ICU, we had a “margin for error” in that no single physician was crucial to the study. Thus, a physician who reconsidered their participation (or had no patients) could easily be replaced. Having entire ICUs drop out would have been significantly more problematic – this was addressed by the study team being proactive and transparent with ICU directors about expectations, commitment, and integration of the app with standard ICU care.

Randomization

In ICUconnect, there were five different physician-level variables (each with two levels) that were of interest to be balanced across the two groups. Given the large number of variables compared to the projected number of physicians, stratified randomization was not an appropriate option (i.e., 32 strata for 40 physicians). We, therefore, implemented a method of minimization to equally allocate the physicians to the two treatment groups.15,16 As discussed in a review by Scott et al, this type of dynamic allocation examines variable levels and allocation of prior enrollees to dynamically balance the treatment groups with respective levels.17

Sample size

A first step in determining the sample size was to state the primary hypothesis, which in turn depends on the choice of primary outcome variable. Here, the primary outcome variable is the unmet palliative care needs identified by a primary family member of the ICU patient. This primary outcome is assessed via the Needs at the End-of-Life Screening Tool (NEST)18 survey (treated as continuously scaled) at the time of enrollment, 3 days later, 7 days later, and at 3 months post enrollment. The NEST scale has 13-items, with each item response from 0 to 10 (total scores range from 0 to 130). A change of at least 5 – 6 points on this scale has been considered as a minimum clinically relevant difference.

The primary hypothesis is that, as compared to usual care, the ICUconnect intervention will improve unmet needs from enrollment to 3 days post enrollment.

In general, statistical power and sample size considerations proceed in one of two ways. After specifying some standard parameters (e.g., a type-I error rate of 0.05), there are three key variables: sample size, effect size, and power. Specifying any two of these allows the third to be derived. In practice, power is often specified to be 80%, and so we either ask: (1) What is the sample size needed to detect a pre-specified clinically important difference (i.e., effect size) at a certain power?; or, (2) Given a fixed sample size and power, what is the minimum detectable difference? For ICUconnect we took the latter approach, primarily because the logistical constraints effectively fixed the sample size. More precisely, because with a CRT there are two different sample sizes to take into consideration (i.e., number of clusters and participants within each cluster), we set these to plausible values, set the power to 80%, and asked whether the effect size that could be identified was of the same order of magnitude as the minimum clinically difference for the NEST. The values we selected were 40 ICU physicians (i.e., 20 per group) and 4 patients per ICU physician, this latter number was based on an initial feasibility study.

In ICUconnect, the NEST score is a longitudinal outcome, so our analytic model will include multiple random effects to account for the correlation between repeated time points and the correlation within physicians. Rather than accounting for multiple levels of correlation, the longitudinal outcome was simplified as a change score (i.e., improvement from baseline to 3 days) for the sample size calculations. That is, the calculations were based on the mean difference score using tests for two means in a cluster randomized design. As inputs, we then only need to provide a measure of variability of the change in NEST score in addition to the sample size, type I and type II errors. Based on preliminary studies, the standard deviation of the change in NEST score was estimated as 12 points. For all calculations, the type-I error is 5% and power is 80%.

As previously discussed, a unique component to sample size estimation in CRTs is incorporating the ICC. For ICUconnect, we used a range of reasonable ICCs – 0.01 to 0.1. Table 1 shows the range of detectable differences for this range of ICCs. With a sample size of 160 families (4 per physician), we will be able to detect differences of improvement of 5.4 to 6.1 points between the intervention and control groups. Because the number of families per physician is relatively small, the ICC values do not dramatically impact the NEST score detectable differences.

Table 1.

Intra-class correlation coefficients and detectable mean differences

Number of ICU physicians Number of families (number per physician) ICC Mean NEST score difference
40 (20 in each treatment arm) 160 (4 per physician) 0.01 5.4
0.05 5.7
0.10 6.1

ICC = intra-class correlation coefficient. NEST = Needs at the End-of-Life Screening Tool.

The conclusion of our sample size calculations was that with 40 physicians and 4 patients per physician, we had adequate power to detect intervention effects which were near the minimum clinically important difference of the NEST, and that the results are robust to the specification of the ICC. Applying simplifying assumptions which tended to be conservative (e.g., using less sophisticated statistical methods than will be applied to the actual data analysis) lend additional confidence to this conclusion.

Summary

In a CRT randomization is performed at the level of the group rather than the individual. CRTs are often used when an intervention can only be administered to a group, or they are suggested by issues of feasibility or contamination. CRTs are less statistically efficient and (usually) more logistically complex than individually randomized trials. Sample size calculations are based on the formula for the design effect, and illustrate the trade-off between the statistically more efficient choice of many small clusters and the (usually) logistically preferable choice of few large clusters. Collaborating with a statistician is essential.

The Palliative Care Research Cooperative Group is an excellent resource for facilitating multi-site studies with large numbers of sites (https://palliativecareresearch.org/). Some additional references include the modification of the CONSORT statement as applied to CRTs,19 Garrison et al which discusses CRTs from the perspective of quality improvement research,20 a commentary on pitfalls and controversies in CRTs,21 and the NIH Collaboratory website (https://rethinkingclinicaltrials.org/chapters/design/experimental-designs-randomization-schemes-top/cluster-randomized-trials/) for up-to-date references and tool kits.

Key Message:

This article is a tutorial on cluster randomized trials focused on two key decision points: (1) when to conduct a cluster trial and (2) how to calculate the needed sample size. Collaborating with a statistician early and often will help investigators avoid common pitfalls.

Disclosures and Acknowledgements

The authors have no conflicts of interest to disclose. Dr. Maren K. Olsen and Dr. Christopher E. Cox were supported by NIH funding (U54 MD 12530-01). Dr. Maren K. Olsen was supported by the Durham Center of Innovation to Accelerate Discovery and Practice Transformation (ADAPT), (CIN 13-410) at the Durham VA Health Care System. Dr. Joseph G. Winger was supported by a Kornfeld Scholars Program Award from the National Palliative Care Research Center.

This project was supported by the Palliative Care Research Cooperative Group funded by the National Institute of Nursing Research U2CNR014637.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

  • 1.Turner EL, Li F, Gallis JA, Prague M, Murray DM Review of recent methodological developments in group-randomized trials: part 1—design. Am J Public Health 2017;107:907–915. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Turner EL, Prague M, Gallis JA, Li F, Murray DM Review of recent methodological developments in group-randomized trials: part 2—analysis. Am J Public Health 2017;107:1078–1086. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Murray DM Design and analysis of group-randomized trials. New York: Oxford University Press, 1998. [Google Scholar]
  • 4.Bernacki R, Hutchings M, Vick J, et al. Development of the serious illness care program: a randomised controlled trial of a palliative care communication intervention. BMJ open 2015;5:e009032. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Agar M, Beattie E, Luckett T, et al. Pragmatic cluster randomised controlled trial of facilitated family case conferencing compared with usual care for improving end of life care and outcomes in nursing home residents with advanced dementia and their families: the IDEAL study protocol. BMC Palliat Care 2015;14:1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Ringdal GI, Jordhøy MS, Kaasa S Family satisfaction with end-of-life care for cancer patients in a cluster randomized trial. J Pain Symptom Manage 2002;24:53–63. [DOI] [PubMed] [Google Scholar]
  • 7.Hemming K, Haines TP, Chilton PJ, Girling AJ, Lilford RJ The stepped wedge cluster randomised trial: rationale, design, analysis, and reporting. BMJ 2015;350:h391. [DOI] [PubMed] [Google Scholar]
  • 8.Eldridge SM, Ukoumunne OC, Carlin JB The intra-cluster correlation coefficient in cluster randomized trials: a review of definitions. Int Stat Rev 2009;77:378–394. [Google Scholar]
  • 9.Pladevall M, Simpkins J, Donner A, Nerenz DR Designing multi--center cluster randomized trials: an introductory toolkit. 2014; https://rethinkingclinicaltrials.org/chapters/design/experimental-designs-randomization-schemes-top/cluster-randomized-trials/. Accessed July 15, 2020.
  • 10.Hudgens MG, Halloran ME Toward causal inference with interference. J Am Stat Assoc 2008;103:832–842. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Eldridge S, Kerry S A practical guide to cluster randomised trials in health services research. United Kingdom: John Wiley & Sons, Ltd., 2012. [Google Scholar]
  • 12.Kul S, Vanhaecht K, Panella M Intraclass correlation coefficients for cluster randomized trials in care pathways and usual care: hospital treatment for heart failure. BMC Health Serv Res 2014;14:84. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Singh J, Liddy C, Hogg W, Taljaard M Intracluster correlation coefficients for sample size calculations related to cardiovascular disease prevention and management in primary care practices. BMC Res Notes 2015;8:89. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Thompson DM, Fernald DH, Mold JW Intraclass correlation coefficients typical of cluster-randomized studies: estimates from the Robert Wood Johnson Prescription for Health projects. Ann Fam Med 2012;10:235–240. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Taves DR Minimization: a new method of assigning patients to treatment and control groups. Clin Pharmacol Ther 1974;15:443–453. [DOI] [PubMed] [Google Scholar]
  • 16.Pocock SJ, Simon R Sequential treatment assignment with balancing for prognostic factors in the controlled clinical trial. Biometrics 1975:103–115. [PubMed] [Google Scholar]
  • 17.Scott NW, McPherson GC, Ramsay CR, Campbell MK The method of minimization for allocation to clinical trials: a review. Control Clin Trials 2002;23:662–674. [DOI] [PubMed] [Google Scholar]
  • 18.Emanuel LL, Alpert HR, Emanuel EE Concise screening questions for clinical assessments of terminal care: the needs near the end-of-life care screening tool. J Palliat Med 2001;4:465–474. [DOI] [PubMed] [Google Scholar]
  • 19.Campbell MK, Piaggio G, Elbourne DR, Altman DG Consort 2010 statement: extension to cluster randomised trials. BMJ 2012;345:e5661. [DOI] [PubMed] [Google Scholar]
  • 20.Garrison MM, Mangione-Smith R Cluster randomized trials for health care quality improvement research. Acad Pediatr 2013;13:S31–S37. [DOI] [PubMed] [Google Scholar]
  • 21.Donner A, Klar N Pitfalls of and controversies in cluster randomization trials. Am J Public Health 2004;94:416–422. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES