Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
letter
. 2025 Feb 18;122(8):e2425536122. doi: 10.1073/pnas.2425536122

Measuring population heterogeneity requires heterogeneous populations

Antonia Krefeld-Schwalb a,1,2, Xuwen Hua b, Eric J Johnson b,1
PMCID: PMC11874009  PMID: 39964726

Any judgment about population heterogeneity depends on the definition of the sampling frame (1). In a recent paper, Holzmeister et al. (2) (HJBK hereafter) compare different sources of heterogeneity to population heterogeneity. They find that population heterogeneity is much smaller compared to design and analytic heterogeneity as a source of variation in effect sizes. This is important because, if true, it presents an optimistic picture for the generalization of results from one sample to another. However, this claim is puzzling given calls to increase attention to heterogeneity in social science (3). A closer examination of their data and related work (4) (KSJ hereafter) suggests a modification to their conclusion.

Table 1 lists the studies included in HJBK suggesting a predominance of similar, somewhat narrow sample frames. University students are the most common respondents. More generally, McShane et al. (5) argue that using large replication studies as in HJBK might provide limited evidence for the investigation of heterogeneity.

Table 1.

Description of samples analyzed for population heterogeneity by HJBK

Study Number of effects examined Total number of samples analyzed Sample description Median heterogeneity factor across effects (H)
1-RRR1 1 31 Undergraduate students across 10 countries. 1.000
1-RRR2 1 22 Undergraduate students across 8 countries. 1.000
2 1 21 Undergraduate students across 12 countries. 1.014
3 5 16 Undergraduate students across 5 countries. 1.000
4 3 11 Undergraduate students across the U.S. and Canada. 1.000
5 2 23 Undergraduate students across 10 countries. 1.195
6 2 22 University students across 12 countries. 1.103
7 1 23 Undergraduate students across 13 countries. 1.101
8 1 19 University students across 11 countries. 1.000
9 1 17 University students across 8 countries. 1.000
10 10 21 Undergraduate students (95.2% of the samples) across the U.S. and Canada and participants from one online panel (U.S. MTurk). 1.000
11 16 36 University students (~75%) and participants from university and other online panels (~25%) across 12 countries. 1.130
12 25 125 University students (at least 63.2%) and participants recruited through Universities and other online panel (36.8%) across 36 countries 1.166
13 1 7 University students in the U.S. 1.000
Summary 70 394 University students (at least 85.8%), participants from university and other online panels (14.2%) across 41 countries. 1.081

Note: The table presents studies in the same order as listed in the HJBK supplementary information table S1. Sample descriptions were based on published manuscripts, supplementary materials, and study OSF repositories.

Panel A of Fig. 1 shows the estimates of population, design, and analysis heterogeneity in HJBK. HJBK found a median H of 1.08, far below “the 2.0 threshold indicative of large heterogeneity,” and only 4 of the 70 studies exceeded that threshold.

Fig. 1.

Fig. 1.

Comparing heterogeneity estimates between HJBK and KSJ. (A) The figure shows boxplot distribution of heterogeneity factor (H) in the analyses performed by HJBK, separated by the source of heterogeneity. The vertical dashed reference lines indicate benchmark levels of small, medium, and large heterogeneity based on I2 values of 25% (H = 1.15), 50% (H = 1.41), and 75% (H = 2), respectively. (B) The figure shows 11 estimates of heterogeneity factor across three studies and five paradigms using KSJ’s data, separated by study and the source of heterogeneity. Consistent with HJBK, H statistics were calculated using the metafor package (v-4.6.0, ref. 6, 2010) in R. Study 1 participants (n = 6,438) were recruited from 10 different online platforms and a student population. Both Studies 2 (n = 968) and 3 (n = 1,196) participants were recruited from MTurk (nStudy 2 = 467; nStudy 3 = 598) and Prolific. Study 3 systematically varied the times the surveys were launched. Six sampling periods were defined (at which data were collected from both panels). On both Sunday and Wednesday, the survey sampling was initiated at 1:00 am PST (4:00 am EST), 8:00 am PST (12:00 pm EST), and 5:00 pm PST (8:00 pm EST).

KSJ employed purposive variation of the sampling frame, using 11 different panels—including frequently used panels like MTurk, Prolific, and a student sample. They kept design and analysis variance minimal. Panel B reports the population heterogeneity in KSJ. We also included heterogeneity across time of day and weekday as a benchmark, varied in Study 3 in KSJ. Some researchers have hypothesized that this changes effect sizes (7).

KSJ’s estimates of population heterogeneity are markedly larger than those provided by HJBK and are about as large as HJBK’s estimates for analytic and design heterogeneity. The median H across the nine estimates of population heterogeneity was 3.19, ranging from 1.72 to 9.58. All but one exceed the large benchmark threshold. Cochran’s Q-test rejects the null of homogeneity at the 5% level for all nine estimates. The effects of day of the week and time of day are much more minor and not significant.

Based on HJBK, one might be tempted to conclude that population heterogeneity is relatively unimportant and research results easily generalize. However, HJBK’s characterization of population heterogeneity is limited by the sample frames analyzed. This limits the variation of variables that affect the outcome, or moderators. The variation of these variables is larger across different designs and analyses. KSJ’s modest use of purposive sampling (limited to anglophone participants in three highly developed countries) uncovers a greater challenge to generalizability across populations.

The generalization of social science effects, even in modestly diverse samples like in KSJ is an important challenge to the field’s relevance. The problem is probably larger than given by our analysis since most studies are dominated by Western, Educated, Industrialized, Rich, and Democratic samples (8). More research is needed including more studies (9) with more diverse populations. We worry that not acknowledging population heterogeneity will diminish the impact of social science research.

Acknowledgments

Author contributions

A.K.-S. and E.J.J. designed research; A.K.-S. and E.J.J. performed research; X.H. analyzed data; and A.K.-S. and E.J.J. wrote the paper.

Competing interests

The authors declare no competing interest.

References

  • 1.Jessen R. J., Statistical Survey Techniques (Wiley, 1978). [Google Scholar]
  • 2.Holzmeister F., et al. , Heterogeneity in effect size estimates. Proc. Natl. Acad. Sci. U.S.A. 121, e2403490121 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Kenny D. A., Judd C. M., The unappreciated heterogeneity of effect sizes: Implications for power, precision, planning of research, and replication. Psychol. Methods 24, 578–589 (2019). [DOI] [PubMed] [Google Scholar]
  • 4.Krefeld-Schwalb A., Sugerman E., Johnson E. J., Exposing omitted moderators: Explaining why effect sizes differ in the social sciences. Proc. Natl. Acad. Sci. U.S.A. 121, e2306281121 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.McShane B. B., Tackett J. L., Böckenholt U., Gelman A., Large-scale replication projects in contemporary psychological research. Am. Stat. 73, 99–105 (2019). [Google Scholar]
  • 6.Viechtbauer W., Conducting meta-analyses in R with the metafor package. J. Stat. Softw. 36, 1–48 (2010). [Google Scholar]
  • 7.Arechar A. A., Kraft-Todd G., Rand D., Turking overtime: How participant characteristics and behavior vary over time and day on Amazon Mechanical Turk. J. Econ. Sci. Assoc. 3, 1–11 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Henrich J., Heine S. J., Norenzayan A., The weirdest people in the world? Behav. Brain Sci. 33, 61–83 (2010). [DOI] [PubMed] [Google Scholar]
  • 9.Higgins J. P. T., Thompson S. G., Quantifying heterogeneity in a meta-analysis. Stat. Med. 21, 1539–1558 (2002). [DOI] [PubMed] [Google Scholar]

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES