Editor's Note: The online version of this article contains an explanation of how group learning affects sample size calculations and statistical power.
Members of our medical education scholarship group, including a biostatistician, were discussing an analysis when an important question came up: “Does learning on a team depend on who is on the team?” The clinician-educators replied in unison, “Of course it does.” Our statistician replied, “Well, then we have a problem.”
Medical education research emphasizes the reliability and validity of a measurement tool,1,2 and a robust approach to the analysis of data is essential. Residents and medical students learn in groups, and gains in knowledge and skills often are assessed at multiple points over time. While methods are available to address correlated or clustered data (groups of learners or multiple measurements; box 1), these methods are not emphasized or widely used in graduate medical education research. Are these methods necessary? Does it really matter? In this perspective, we aim to help medical education researchers identify clusters, understand the importance of accounting for correlated data, and empower researchers to discuss the effect of group learning and repeated measurements with statisticians.
box 1 Glossary
Independent Data: Individual data points are not related to each other. Most statistical tests, such as t tests and regression, require data to be independent for valid results.
Nonindependent Data: Individual data points are related to each other in some way. Examples include repeated measures from a single individual, and measures from residents working together on a learning team.
Repeated Measures: Measuring the same information from a single individual multiple times. A knowledge test given to the same learner at multiple time points is an example of repeated measure.
Cluster: A set of data that is not independent. A cluster may be several learners within a learning group, or may be several measurements within a single person.
Mixed Effects Model: A type of regression model that allows the statistician to specify different clusters within a data set, such as repeated measures for a single resident and groups of residents learning on teams.
Description of the Problem
Two common study design challenges in medical education that must be acknowledged and addressed methodologically in the evaluation of educational interventions are (1) multiple assessments, and (2) learning in groups or teams.
Multiple (ie, repeated) assessments of learners (box 1): Residents and medical students are often assessed more than once during educational studies. Often, researchers will assess knowledge at the beginning of an intervention or program, and again at the end. Other times, learners are measured frequently, such as daily class attendance, learning assessment at each admission, or surveys of attitudes over time. Due to differences in individual test-taking abilities and underlying individual rating tendencies, some students will be high performers/raters and others will be low performers/raters. Study design and analyses must account for this within-person correlation.
Learning in groups or teams: Learners often work in small groups, and residents work in teams on clinical services and in other clinical settings. Yet, evaluation of educational attainment is usually conducted at the level of the individual learner. When comparing educational attainment of individuals who are members of groups, researchers must account for group learning. Inevitably, there will be some groups where the learning is better, and other groups where the learning may be negatively affected. This is attributable to the composition of the groups and interactions among the members. In addition, educational resources may differ among groups. For example, ward team education corresponds with the random and nonrandom (ie, service-related) diagnoses of patients admitted during the rotation. To ignore the factors associated with group-based learning (ie, clusters of learners) disregards the nature of education, and it also violates the methodological assumptions of independent subjects inherent to most evaluation techniques (box 1).
Importantly, statistical experts often have not trained in and are not acquainted with the medical education system. Statisticians may not realize that residents work in ward teams and that the teams change regularly. Therefore, medical education researchers bear the responsibility of discussing the intricacies of the study design, the study context, and the structure of the data with the statistician. Ideally, this conversation should occur during the study planning phase, as these issues affect the power of the study and consequently the number of learners who must be enrolled (provided as online supplemental material).3
These issues can arise in multiple types of education studies, from pre-post knowledge assessments to randomized control trials. Medical education researchers need to identify these scenarios and partner with their statistics experts to modify study design and analyses appropriately. Failure to account for the study design in the analysis can lead to incorrect conclusions about the effect of the educational process.
Educational Scenarios
For illustrative purposes, we present 3 scenarios in which analytic challenges exist (box 2). The data in these scenarios are for illustrative purposes only and do not reflect real study data. Our outcome is a knowledge measure. In these hypothetical scenarios, our knowledge measure has broad evidence for validity, including reliability. Values of the measure range from 0 to 100, where 0 indicates no knowledge and 100 indicates full knowledge.
box 2 Scenarios and Analysis Approach

Scenario 1—Repeated Measures
For a 6-week course on neurophysiology, 100 residents are randomly assigned to either in-classroom large group learning or online individual learning. Residents undergo both a precourse knowledge assessment and a postcourse knowledge assessment.
Methodological Solutions for Repeated Measures
The simplest form of repeated measures in education is the pre-post learning assessment to examine change in knowledge, as in scenario 1. In this scenario, a 2-sample t test of postknowledge is appropriate, as the 2 groups should have similar precourse knowledge if randomization worked.4 However, if the researchers in this scenario wanted to measure learning with weekly assessments of knowledge, the analysis must go beyond the t test. Instead, researchers must account for the fact that some residents will be particularly adept at neurophysiology and perform well on all weekly assessments. Other residents will be low performers across all assessments. One way to address this methodologically is to treat each resident as a cluster (the weekly assessment data are correlated within each individual resident). Application of the techniques for clustered data/groups discussed below is appropriate. If no adjustment is made for the clustering of observations (ie, multiple assessments from each individual), the researchers may make incorrect conclusions regarding the effect of the educational intervention.
Scenario 2—Group Learning
Twenty residents are randomly assigned to 1 of 4 inpatient cardiology ward teams made up of 5 residents per team. In the control group, 2 teams have patient care–based education, which is learning from patients admitted to the service during the month. In the intervention group, 2 teams have a series of lectures in addition to the patient care–based education. Residents in both groups undergo a knowledge assessment at the end of the month.
Methodological Solutions to Account for Group Learning
In scenario 2, researchers must account for the study design of 2 ward teams in each arm of the study. The methods used to account for the clusters (ward teams) will change the width of the confidence interval.5
In the figure, we present results of analyses for scenario 2. In panel A, researchers directly compare the learning between the intervention and control groups using a t test (without accounting for the group learning in ward teams). The researchers report a statistically significant 3.3-point improvement in learning in the intervention group versus the control group (P = .02). However, this analysis is incorrect, as it failed to account for the learning that occurred within the teams. When examining the same data, as illustrated in panel B, the knowledge in team B is higher than the knowledge on all of the other teams. This difference may be due to the composition of team B, or because of differences in educational exposure related to the care of different patients during the rotation. Regardless, when these same data are appropriately analyzed, accounting for potential nonindependence (ie, analyzed using methods to adjust for data clustered in ward teams), researchers conclude that the difference between the groups is 3.3 points (same as panel A when they did not account for ward team learning). However, using the correct analytic technique, the confidence interval is wider, and the researchers conclude there is no statistically significant difference in learning between the intervention and control groups (P = .27).
FIGURE.
Accounting for Clusters in the Analysis of Simulated Data
All 3 panels contain the same data. In panel A, the data are ungrouped and dashed lines represent the mean in the intervention and control groups. In panel B, the data are grouped in teams. The dashed lines remain unchanged. Solid lines represent the team mean. In panel C, each individual subject is delineated within the team. The dashed and solid lines remain unchanged. Asterisks represent the individual's mean.
Scenario 3—Repeated Measures and Group Learning
Twenty residents are randomly assigned to teams that are identical to those in scenario 2, except that researchers test knowledge on a weekly basis.
Methodological Solutions for Repeated Measures of Learners in Groups
In this scenario, medical educators want to assess repeated measures among learners who are clustered into groups. Residents are on ward teams, and the composition of the ward team will likely affect their learning. In addition, some residents will be high or low performers. The researcher must account for correlation both within residents (individual-level cluster) and among residents in teams (group-level cluster). One approach is to use a mixed effects model (box 1).6 In panel C of the figure, researchers examine the same data as in panels A and B, but each individual subject is delineated within each team. Some subjects are generally high performers (eg, resident 10) and others are low performers (eg, resident 20). Using a linear mixed model to account for repeated observations within individuals and across the teams, researchers estimate that the difference related to the intervention is 3.0 points, and conclude that there were no significant differences between the intervention and control groups (P = .34).
Additional Considerations: Measuring Team Effects, Randomization, and Dichotomous Outcomes
Measuring Team Effects
While we have highlighted some common issues to consider in the evaluation of medical education initiatives, we have not covered the methodological issues and options that one needs to consider in these methods. If a researcher wants to estimate how much learning is related to the teams themselves, then hierarchical regression techniques are necessary.
Randomization
Ideally, intervention trials are randomized. Effective randomization can ensure equal distribution of learners in intervention and control groups (ie, equal numbers of high performers and low performers). However, randomization does not mitigate the need to use appropriate methodological techniques to handle clustered data. For example, in scenario 2, the residents were randomly assigned to resident teams; however, some teams were higher performing than others. Furthermore, randomization can be challenging in medical education scenarios. Often, the number of participants is too small for effective randomization. Learner schedules (eg, resident team assignments) are often set prior to the decision to perform a test of change, prohibiting randomization.
Dichotomous Outcomes
All of our scenarios use a continuous knowledge outcome. When studying a dichotomous outcome, different models must be specified (eg, logistic model). For example, “Did a resident attend a teaching conference?” This scenario has possible outcomes of “yes” or “no.” Given the complexity of these considerations and calculations, statisticians are essential members of the medical education research team.
Summary
Medical education interventions often have methodological challenges due to repeated measures and learning that often occur in groups. These challenges can be addressed appropriately by applying the relevant statistical techniques. Without the proper evaluation methods, an educator might conclude that a new curriculum or intervention is effective when it is not, or vice versa. Early consultation with a statistician is ideal, as many of these issues affect the sample size needed to detect meaningful differences and should be incorporated into the statistical analysis plan. This primer, along with examples, may be a useful place to start the discussion with the statistician.
Supplementary Material
Footnotes
Katherine A. Auger, MD, MSc, is Assistant Professor, Division of Hospital Medicine, Department of Pediatrics, Cincinnati Children's Hospital Medical Center; Karen E. Jerardi, MD, MEd, is Assistant Professor, Division of Hospital Medicine, Department of Pediatrics, Cincinnati Children's Hospital Medical Center; Jeffrey M. Simmons, MD, MSc, is Associate Professor, Division of Hospital Medicine, Department of Pediatrics, Cincinnati Children's Hospital Medical Center; Matthew M. Davis, MD, MAPP, is Professor of Pediatrics, Professor of Internal Medicine, and Professor of Public Policy, Gerald R. Ford School of Public Policy, and Co-Director, Robert Wood Johnson Clinical Scholars Program, University of Michigan; Jennifer O'Toole, MD, MEd, is Associate Professor, Division of Hospital Medicine, Department of Pediatrics, Cincinnati Children's Hospital Medical Center; and Heidi J. Sucharew, PhD, is Assistant Professor, Division of Biostatistics and Epidemiology, Department of Pediatrics, Cincinnati Children's Hospital Medical Center.
References
- 1.Artino AR, Jr, Durning SJ, Creel AH. AM last page: reliability and validity in educational measurement. Acad Med. 2010;85(9):1545. doi: 10.1097/ACM.0b013e3181edface. [DOI] [PubMed] [Google Scholar]
- 2.Sullivan GM. A primer on the validity of assessment instruments. J Grad Med Educ. 2011;3(2):119–120. doi: 10.4300/JGME-D-11-00075.1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Bland M. An Introduction to Medical Statistics. 3rd ed. Oxford, UK: Oxford University Press; 2000. [Google Scholar]
- 4.Norman GR, Streiner DL. Norman GR, Streiner DL. Biostatistics: The Bare Essentials. 3rd ed. Shelton, CT: People's Medical Publishing House; 2008. Two repeated observations: the paired t-test and alternatives; pp. 101–106. In. [Google Scholar]
- 5.Merlo J, Chaix B, Yang M, Lynch J, Råstam L. A brief conceptual tutorial of multilevel analysis in social epidemiology: linking the statistical concept of clustering to the idea of contextual phenomenon. J Epidemiol Community Health. 2005;59(6):443–449. doi: 10.1136/jech.2004.023473. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Diggle P, Heagerty P, Liang KY, Zeger S. Analysis of Longitudinal Data. 2nd ed. Oxford, UK: Oxford University Press; 2002. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.

