. 2008 Aug;98(8):1418–1424. doi: 10.2105/AJPH.2007.127027

Individually Randomized Group Treatment Trials: A Critical Appraisal of Frequently Used Design and Analytic Approaches

Sherri L Pals 1, David M Murray 1, Catherine M Alfano 1, William R Shadish 1, Peter J Hannan 1, William L Baker 1
PMCID: PMC2446464  PMID: 18556603


Objectives. We reviewed published individually randomized group treatment (IRGT) trials to assess researchers’ awareness of within-group correlation and determine whether appropriate design and analytic methods were used to test for treatment effectiveness.

Methods. We assessed sample size and analytic methods in IRGT trials published in 6 public health and behavioral health journals between 2002 and 2006.

Results. Our review included 34 articles; in 32 (94.1%) of these articles, inappropriate analytic methods were used. In only 1 article did the researchers claim that expected intraclass correlations (ICCs) were taken into account in sample size estimation; in most articles, sample size was not mentioned or ICCs were ignored in the reported calculations.

Conclusions. Trials in which individuals are randomly assigned to study conditions and treatments administered in groups may induce within-group correlation, violating the assumption of independence underlying commonly used statistical methods. Methods that take expected ICCs into account should be used in reexamining past studies and planning future studies to ensure that interventions are not judged effective solely on the basis of statistical artifacts. We strongly encourage investigators to report ICCs from IRGT trials and describe study characteristics clearly to aid these efforts.

Randomized trials evaluating the effectiveness of interventions delivered to groups of participants, rather than individuals, are common in public health. These interventions may be preferred because they are often less expensive than an intervention delivered to individuals and because the group environment may enhance the effectiveness of the intervention. Trials designed to evaluate group interventions may assign intact groups (e.g., schools or worksites) to study conditions; these studies are examples of group-randomized trials (GRTs; also called cluster-randomized trials).

Alternatively, such trials may assign individuals to study conditions, with interventions then delivered to groups; we propose labeling these trials individually randomized group treatment (IRGT) trials. In the past 30 years, a great deal of attention has been paid to the unique design and analytic methods needed for GRTs,14 but comparatively little attention has been devoted to IRGT trials.

As do GRTs, IRGT trials involve the problem of potential correlation among observations within treatment conditions. In GRTs, the correlation is present at the beginning of the study because intact groups are randomly assigned to study conditions. Observations within these groups are correlated because people select themselves into groups, share a history, and interact with each other. In IRGT trials, correlation may develop over time as group members share the treatment environment and interact with each other.

Regardless of how it develops, any correlation within groups violates one of the major assumptions of statistical methods used in the analysis of randomized clinical trials and often, erroneously, GRTs and IRGT trials. An assumption of these methods is that observations are independent within conditions; violations of this assumption can inflate type I error rates.57 Correlation within groups gives rise to an additional between-group component of variance, and standard analytic methods developed for randomized clinical trials ignore this extra variation, underestimating the error term.


The defining characteristic of an IRGT trial is randomization of individuals followed by treatment in groups in at least 1 study condition; however, study designs vary considerably. As one example, consider a weight loss trial in which participants are randomized to an intervention or usual care. Those in the intervention are asked to come to group treatment sessions at 1 of several time slots available, whereas the usual care condition involves no group meetings. Delivery of the intervention in small groups transforms what would otherwise be a standard randomized clinical trial into an IRGT trial, with the potential for correlation to develop in the intervention condition.

As a second example, consider a smoking cessation trial in which 3 group interventions are compared: one with 2 group sessions, one with 4 group sessions, and one with 10 group sessions. Here correlation would be expected to develop over time in all 3 conditions, probably in proportion to the number of group sessions. Correlation may also be increased if, after being randomly assigned to a study condition, participants select which group session they attend on the basis of factors such as timing and geographic location. Furthermore, if the group leader or therapist varies across groups in a way that cannot be controlled statistically (e.g., a different leader for each group), this may be an additional source of group variation.

Correlation among observations within groups is indexed by the intraclass correlation coefficient (ICC), often interpreted as the proportion of variance in an outcome that is attributable to groups. In a GRT, the variance of the difference between 2 means in the presence of a positive ICC will be inflated by a factor of

graphic file with name M1.gif (1)

(where m is the average number of members per group and ρ is the ICC) relative to a trial with independent observations. This quantity, termed the variance inflation factor2 or design effect,8 increases as ρ or m increases. In the context of GRTs, it is well established that ignoring this additional variation can lead to inflated type I error rates.1,2,9,10

In an IRGT trial, the form of the variance inflation factor may be more complicated. In an IRGT trial with 1 group-based treatment and 1 individually administered treatment, the variance inflation factor at a postintervention follow-up would be 1 + (m − 1)ρ in the group-based treatment condition and 1 in the individually administered condition (given the expectation of no ICC in that condition). In an IRGT trial with 2 group-based study conditions, the variance inflation factor would be

graphic file with name M2.gif (2)

in the first condition and

graphic file with name M3.gif (3)

in the second condition, allowing for different ICCs and a different number of members per group.

These variance inflation factor formulas are an important part of sample size estimation in IRGT trials, given that sample sizes must be increased to account for extra variation. Table 1 presents variance formulas for a GRT and 2 different IRGT trials with pretest–posttest designs. Each formula includes σ̂2g, the additional variance component that results from treatment of groups of participants, along with the usual σ̂2m that results from variation between individuals. As shown by Murray,3 these formulas can be used in combination with the detectable difference formula, also presented in Table 1, to determine the smallest difference between conditions that would be considered significant given the α and β levels.


Formulas for the Variance of the Difference Between 2 Study Conditions: Pretest–Posttest GRT and IRGT Trial Study Designs

Repeated Measures Analysis of Variance Analysis of Covariance With Baseline as Covariate
Study design
    GRT: 2 conditions Inline graphic Inline graphic
    IRGT trial: 2 group treatment conditions Inline graphic Inline graphic
    IRGT trial: 1 group treatment condition and 1 individual treatment condition Inline graphic Inline graphic
Detectable differencea (all study designs) Inline graphic

Note. GRT = group-randomized trial; IRGT = individually randomized group treatment; σ̂2m = member component of variance; r^yy(m) = over-time correlation at the member level; m = average number of members per group (nested within study condition); σ̂2m = group component of variance; N1 = number of participants in individual treatment condition; N2 = number of participants in group treatment condition; Δ̂ = detectable difference (between study conditions); σ̂2Δ = variance of the difference between study conditions.

aSmallest difference between study conditions that would be considered significant.

Sample size estimation for an IRGT trial, as in a GRT, involves anticipating the number of members per treatment group and the ICC or group component of variance expected in 1 or more study conditions. In the GRT literature, articles reporting ICCs for a variety of variables are relatively common.1115 In GRTs, ICCs have been shown to vary according to group type, group size, study design characteristics, analytic approach, and type of primary outcome variable used. ICCs for physiological variables are generally smaller than those for behavioral variables, and in turn, ICCs for behavioral variables are generally smaller than those for knowledge, attitudinal, or belief variables.16 This sort of variation, along with variation due to length and duration of group treatment, is likely to exist among ICCs in IRGT trials as well.

Unfortunately, very few published estimates of ICCs for IRGT trials exist in the public health literature. Creamer et al.17 reported ICCs from a nonrandomized longitudinal cohort study of a 12-week posttraumatic stress disorder intervention targeting veterans; groups ranged in size from 6 to 8 participants. In that study, ICCs for a variety of psychosocial measures ranged from 0.04 to 0.13. Herzog et al.18 reported ICCs for a group-based smoking cessation intervention in which participants were assigned to conditions in groups according to the time of their entrance into the study. Three group-based treatments were compared; ICCs were 0.44 for intervention attendance and 0.32 for smoking. Neither of these studies can be labeled IRGT trials because individuals were not randomly assigned to conditions; however, the ICCs observed in the studies may be useful for planning given that so few estimates are available.

Roberts and Roberts19 applied several different analytic models to data derived from an IRGT trial in which patients with schizophrenia were randomly assigned to receive 16 sessions of cognitive–behavioral group therapy or individual treatment as usual. They reported ICCs ranging from approximately 0.20 to approximately 0.46 for scales measuring schizophrenia symptoms.

There is better awareness of the implications of intraclass correlation in IRGT trials conducted outside the area of public health. This is particularly true for social psychology studies20,21 and group psychotherapy trials, in which individuals are often randomly assigned to conditions but treatments are administered in groups.22,23 As early as 1978, investigators conducting psychotherapy studies were warned of the consequences of treating nested within-therapist observations as independent,24 and such cautions have appeared regularly since then.5,6,23 In public health, epidemiology, and biostatistics, authors have more recently begun to discuss such issues.

Donner and Klar25 noted that within-group correlation may arise in studies involving individual randomization and group treatment and indicated that the methods they outlined for cluster-randomized trials could also apply to such studies. Hoover26 discussed the implications of group treatment in clinical trials and showed that when there is a positive ICC, the denominator of the standard 2-sample t test is too small. In addition, he found that the denominator of the t test needs to be inflated to take into account between-group heterogeneity and that degrees of freedom should be based on the number of groups rather than the number of group members.26 Hoover also suggested that mixed models or generalized estimating equations (GEE)27 could be used to incorporate covariates.

Roberts and Roberts19 presented analytic options for studies comparing 2 group treatments and studies comparing 1 group treatment and a treatment administered individually (or a no-treatment control). For the latter, they proposed several analytic methods, including mixed-model approaches, and evaluated them in a small simulation study varying the ICC with 8 groups per condition and 6 members per group in the group treatment condition and 48 participants in the control condition. A Satterthwaite t test maintained a type I error rate closer to the nominal rate than any of the other candidate models in these simulations, and this test is easy to perform in many statistical packages. Roberts and Roberts19 also discussed the impact on power of differing group sizes and provided recommendations for optimal allocation of participants to study conditions. Lee and Thompson addressed correlation in individually randomized trials and proposed several Bayesian random-effects models.28

Given the comparatively limited attention paid to IRGT trial design and analytic issues, it may be useful to refer to publications on GRTs for recommendations on design and analysis. A design approach that has been strongly discouraged in GRTs is the 1 group (e.g., a single community) per condition design.10 This design has been criticized in the context of GRTs because when there is only 1 group per condition, between-group variation cannot be separated from between-condition variation, and a valid analysis is not possible. This criticism is equally applicable to IRGT trials.

An analytic strategy that has been used in GRTs but discredited is treating the group as a fixed effect rather than a random effect, as is done in mixed models. Zucker7 demonstrated that this strategy can yield a type I error rate inflated beyond what would be obtained if the group were ignored altogether, and Martindale24 cautioned against its use in psychotherapy studies. A recent review showed that this approach is rarely used in GRTs,29 and it should be avoided in analyses of IRGT trials as well.

Another discredited analytic approach that continues to be seen in GRTs is that of using GEE in a GRT with fewer than 20 groups per condition without a small sample correction.4 A GEE approach is attractive in the case of nonnormally distributed data with correlated observations because it is asymptotically robust to misspecifications of the form of the covariance matrix. However, studies have shown that if there are fewer than 20 groups per condition, standard errors can be biased and type I errors inflated.9,30 In GRTs there are often fewer than 20 groups per condition, and this may be true of many IRGT trials as well.

Because discussions of the implications of within-group correlation in IRGT trials are rare in public health and health behavior journals, we suspected that investigators publishing articles that report the findings of IRGT trials would be largely unaware of the design and analytic issues they present. We reviewed published IRGT trials in an effort to assess investigators’ awareness of the need to use design and analytic methods that account for within-group correlation.


We manually searched all issues of the following journals for the years 2002 through 2006: the American Journal of Public Health, Preventive Medicine, Health Psychology, Obesity Research, Addictive Behaviors, and AIDS and Behavior. We selected these journals because of their broad coverage of public health research or because they were known to publish articles reporting results of IRGT trials. Articles published in these journals were included in our review if they reported results from an intervention trial in which individuals were randomly assigned to conditions but treatment was delivered in groups. Articles were excluded if there was no clear statement that participants were randomly assigned to conditions or if an evaluation of the intervention was not included.

Each article was reviewed by the first 3 authors (S.L.P., D.M.M., C.M.A.). Study characteristics such as number of study conditions and numbers of groups and group members per condition were recorded. Each reviewer also determined whether the article mentioned sample size calculations and whether these calculations took intraclass correlation into account properly, the types of analytic methods reported in the article, and whether the reported analytic methods took intraclass correlation into account properly. Any disagreements between reviewers were resolved through discussion. S. L. P. was a coauthor on an article that was eligible for the review; that article was reviewed by D.M.M. and C.M.A without input from the study coauthor.


Table 2 presents the characteristics of articles included in the review. We identified 34 eligible articles3164 in the 6 journals, with the largest number (12) published in 2006. Most of the articles reported the results of trials with 2 study conditions, and most had 1 or 2 group treatment conditions. About 70% of the articles reported a baseline sample size of fewer than 200 participants. Only 6 articles reported number of members per group (between 6 and 12 for all 6), and only 3 reported number of groups per condition.


Characteristics of 34 Studies in a Review of 6 Public Health Journals: 2002–2006

Articles, No. (%)
    American Journal of Public Health 4 (11.8)
    Preventive Medicine 6 (17.6)
    Health Psychology 8 (23.5)
    Obesity 7 (20.6)
    Addictive Behaviors 7 (20.6)
    AIDS and Behavior 2 (5.9)
Year of publication
    2002 5 (14.7)
    2003 6 (17.6)
    2004 6 (17.6)
    2005 5 (14.7)
    2006 12 (35.3)
No. of study conditionsa
    2 23 (67.6)
    3 8 (23.5)
    4 3 (8.8)
No. of group treatment conditionsb
    1 11 (32.3)
    2 17 (50.0)
    3 4 (11.8)
    4 2 (5.9)
Baseline sample size
    < 100 15 (44.1)
    100–199 9 (26.5)
    200–299 4 (11.8)
    ≥ 300 6 (17.6)
Target population
    Adults or adolescents with mental health issues 3 (8.8)
    Overweight or obese children 2 (5.9)
    Overweight or obese adults 9 (26.5)
    Adults with cardiovascular risk factors other than weight 3 (8.8)
    Cancer patients 2 (5.9)
    College or university students 2 (5.9)
    HIV-positive adults 3 (8.8)
    Smokers or substance abusers 7 (20.6)
    Other 3 (8.8)
Primary outcome variablesc
    Weight, BMI, body fat percentage, or dietary variables 13 (38.2)
    Physical activity/physical fitness variables 5 (14.7)
    Smoking or substance use variables 7 (20.6)
    Mental health variables 6 (17.6)
    Sexual behavior variables 6 (17.6)
    Treatment retention 2 (5.9)
    Medication adherence 2 (5.9)
    Other 7 (20.6)

Note. BMI = body mass index.

aStudy condition refers to the treatment to which a participant is randomly assigned (e.g., intervention or control).

aGroup treatment condition refers to the number of study conditions that involve treatments administered to participants in groups rather than to individuals.

cNumber of articles sums to more than 34 because several articles reported more than one primary outcome variable.

A variety of target populations were represented among the studies; the largest percentage of studies targeted overweight or obese adults or children, and the next largest percentage targeted smokers or substance abusers. Most articles reported trials with multiple primary outcome variables. A majority (n = 20) reported inclusion of variables related to diet, body composition, or physical activity and fitness. In 13 studies, smoking, substance use, or mental health variables served as primary outcome variables.

Table 3 presents the results of the review of sample size calculations and analytic methods. Most of the articles included no mention of sample size calculations. Of the 9 articles that mentioned sample size, 6 reported calculations performed at the individual level, 1 reported that power calculations were performed but provided no detail, and 1 reported that the sample size was inflated to account for a positive ICC but provided no supporting data, such as the ICC or variance inflation factor. One additional article reported that power calculations were not performed because the trial was funded as a replication of an earlier study.


Sample Size Calculations and Analytic Methods Used in 34 Studies Reviewed in 6 Public Health Journals: 2002–2006

Articles, No. (%)
Sample size calculations
    Sample size calculations reported at individual level 6 (17.6)
    Power calculations reported but no details provided 1 (2.9)
    No mention of sample size calculations 25 (73.5)
    Sample size reported to account for ICC but no details provided 1 (2.9)
    Other 1 (2.9)
Significant results reported
    Yes 27 (79.4)
    No 7 (20.6)
Analytic approachesa
    Analysis at individual level, ignoring group entirely 32 (94.1)
    Mixed-model approach with baseline as covariate 2 (5.9)
    Structural equation modeling 1 (2.9)
Appropriateness of analytic methods
    All analytic methods appropriate 1 (2.9)
    No analytic methods appropriate 32 (94.1)
    Not enough information provided 1 (2.9)

Note. ICC = intraclass correlation coefficient.

aNumber of articles sums to more than 34 because one article reported both a mixed-model approach and a structural equation modeling approach.

Almost all of the articles reviewed (n = 32; 94.1%) reported analyses at the individual level, ignoring the group entirely. Two articles did report analyses in which the ICC and associated degrees of freedom were taken into account properly; in both articles, a mixed-model approach was used with the baseline value of the primary outcome variable as a covariate. One of these articles also reported the results of a structural equation modeling approach testing for intervention effects with bootstrap standard errors. Bootstrap standard errors are correct in the context of within-group correlation only when bootstrapping is done at the group level65; in the article reporting use of the bootstrap method, it was not clear at what level bootstrapping was performed. Only 1 article reported results based entirely on appropriate analytic methods.


The results of our review of published IRGT trials demonstrate that these studies are relatively common in public health and health behavior journals. Few reported analyses that appropriately took into account the intraclass correlation that can develop in group interventions or the resulting limitations on the degrees of freedom available for tests of the intervention effect. This could mean that the researchers in many of these studies overestimated the significance of their findings and that some of the findings may have even prompted inappropriate dissemination of interventions found to be effective only as a result of statistical artifacts. Absent estimates of ICCs and clear reporting of numbers of groups and group members in these studies, it is impossible to know whether their results are misleading; however, the potential is certainly there and should be taken seriously.

The majority of the articles included in our review also did not report sample size calculations or reported calculations ignoring ICCs. It is possible that the authors performed sample size calculations that took ICCs into account but did not report them. However, because 15 articles reported sample sizes below 100 and all involved at least 2 study arms, it is unlikely that power calculations taking the IRGT trial design into account were performed, particularly when so few authors acknowledged the IRGT trial design in the article.

Similar to GRTs, IRGT trials carry the potential for intraclass correlation and limited degrees of freedom; sample size calculations that ignore these penalties can lead to studies that lack sufficient power for a proper analysis. Investigators planning IRGT trials should find ICC estimates relevant to the planned study (i.e., from a study as similar as possible to the planned study) and conduct a sensitivity analysis to examine the impact of ICCs of varying magnitudes. Unfortunately, there are currently very few published estimates of ICCs from IRGT trials, making this task quite difficult. These estimates are sorely needed, and we strongly encourage investigators with IRGT trial data to publish their ICCs so that others can use them in planning new studies.

There were limitations to our review, particularly with respect to the number and types of journals included. Nevertheless, at least in these 6 journals, very few of the articles even mentioned the potential for intraclass correlation in IRGT trials, and only 1 reported entirely appropriate analytic methods.

Another limitation was that we were unable to provide much information on number of members per group, which is essential for computing the variance inflation factor and determining the extent of variance underestimation in an analysis ignoring ICCs. Most of the articles failed to report important characteristics such as numbers of groups and group members included in each condition and number and duration of sessions in the group treatment. The Consolidated Standards of Reporting Trials (CONSORT) statement,66 which provides guidelines for reporting the results of clinical trials, has been extended for cluster-randomized trials67; however, additional guidelines may be required for IRGT trials to ensure the reporting of characteristics that are specific to these studies and essential in evaluating their design and analytic methods.

Our review suggests that many investigators conducting IRGT trials are using statistical methods that incorrectly assume independence of observations within study conditions. In light of our findings and those of others who have discussed such design and analytic issues, it will be important to increase awareness of these issues among investigators planning future IRGT trials. It will be important to reexamine earlier studies as well; the most urgent question is whether the results of previous studies indicating that group interventions were effective were misleading. It may be argued that the variance inflation factor for most of these studies is too small to change the study conclusions, given the small groups often found in IRGT trials. However, to make that determination, one would need to know whether the combination of the ICC and number of members per group is large enough to yield substantial variance inflation.

Even ICCs and group sizes that seem modest can result in a level of variance inflation that could change the outcome of a study. For example, in an IRGT trial with 2 group-based treatment conditions and 10 people per group, an ICC of 0.10 yields a variance inflation factor of 1 + [(10 − 1) × 0.10], or 1.9. An analysis that ignored within-group correlation would underestimate the variance by almost half. With so few estimates of ICCs available from IRGT trials, there is insufficient evidence to support any claim of a negligible ICC, so the possibility of a spurious result must be taken seriously.

To address the possibility of misleading study results in psychotherapy trials, Baldwin et al.22 examined 33 previously published psychotherapy studies under different assumptions about ICCs and found that the results of between 6 and 19 of the studies were no longer significant after a post hoc correction for ICCs and the appropriate degrees of freedom. It would be enlightening to conduct a similar assessment of studies conducted in the area of public health to determine whether significant differences remain significant after post hoc application of valid methods.

Investigators planning future trials must be aware that the methods commonly used in analyses of randomized clinical trials that assume independence of observations are not appropriate in IRGT trials. Current books on GRTs3,25 are a good source of reviews of analytic issues as well as recommendations and formulas for sample size calculations.

Grant reviewers and journal reviewers and editors must also be aware of the necessity of taking ICCs into account in the design and analysis of IRGT trials, and they must be critical of grants and manuscripts that ignore this point. In addition, they should be critical of articles that do not describe analytic models, sample size calculations, or intervention characteristics sufficiently to allow a determination of whether the design and analysis were appropriate. Group interventions may be appropriate for a variety of reasons, but they should be considered effective only if they have been evaluated using analytic methods that take study design into account properly.


We acknowledge the helpful suggestions of 3 anonymous reviewers.

Human Participant Protection …No protocol approval was needed for this study.

Peer Reviewed

Note. The findings and conclusions in this article are those of the authors and do not necessarily represent the views of the Centers for Disease Control and Prevention.

Contributors…S. L. Pals originated the study, reviewed all of the articles included in the study, and wrote the article. D. M. Murray originated the study, reviewed all of the articles included in the study, and provided substantial comments on the article. C. M. Alfano reviewed all of the articles included in the study and provided substantial comments on the article. W. R. Shadish originated the study and provided substantial comments on the article. P. J. Hannan provided substantial comments on the article. W. L. Baker wrote simulation code for an earlier version of the article and reviewed the article.


