Measuring population heterogeneity requires heterogeneous populations

Antonia Krefeld-Schwalb; Xuwen Hua; Eric J Johnson

doi:10.1073/pnas.2425536122

letter

. 2025 Feb 18;122(8):e2425536122. doi: 10.1073/pnas.2425536122

Measuring population heterogeneity requires heterogeneous populations

Antonia Krefeld-Schwalb ^a,^1,², Xuwen Hua ^b, Eric J Johnson ^b,¹

PMCID: PMC11874009 PMID: 39964726

Any judgment about population heterogeneity depends on the definition of the sampling frame (1). In a recent paper, Holzmeister et al. (2) (HJBK hereafter) compare different sources of heterogeneity to population heterogeneity. They find that population heterogeneity is much smaller compared to design and analytic heterogeneity as a source of variation in effect sizes. This is important because, if true, it presents an optimistic picture for the generalization of results from one sample to another. However, this claim is puzzling given calls to increase attention to heterogeneity in social science (3). A closer examination of their data and related work (4) (KSJ hereafter) suggests a modification to their conclusion.

Table 1 lists the studies included in HJBK suggesting a predominance of similar, somewhat narrow sample frames. University students are the most common respondents. More generally, McShane et al. (5) argue that using large replication studies as in HJBK might provide limited evidence for the investigation of heterogeneity.

Table 1.

Description of samples analyzed for population heterogeneity by HJBK

Study	Number of effects examined	Total number of samples analyzed	Sample description	Median heterogeneity factor across effects (H)
1-RRR1	1	31	Undergraduate students across 10 countries.	1.000
1-RRR2	1	22	Undergraduate students across 8 countries.	1.000
2	1	21	Undergraduate students across 12 countries.	1.014
3	5	16	Undergraduate students across 5 countries.	1.000
4	3	11	Undergraduate students across the U.S. and Canada.	1.000
5	2	23	Undergraduate students across 10 countries.	1.195
6	2	22	University students across 12 countries.	1.103
7	1	23	Undergraduate students across 13 countries.	1.101
8	1	19	University students across 11 countries.	1.000
9	1	17	University students across 8 countries.	1.000
10	10	21	Undergraduate students (95.2% of the samples) across the U.S. and Canada and participants from one online panel (U.S. MTurk).	1.000
11	16	36	University students (~75%) and participants from university and other online panels (~25%) across 12 countries.	1.130
12	25	125	University students (at least 63.2%) and participants recruited through Universities and other online panel (36.8%) across 36 countries	1.166
13	1	7	University students in the U.S.	1.000
Summary	70	394	University students (at least 85.8%), participants from university and other online panels (14.2%) across 41 countries.	1.081

Open in a new tab

Note: The table presents studies in the same order as listed in the HJBK supplementary information table S1. Sample descriptions were based on published manuscripts, supplementary materials, and study OSF repositories.

Panel A of Fig. 1 shows the estimates of population, design, and analysis heterogeneity in HJBK. HJBK found a median H of 1.08, far below “the 2.0 threshold indicative of large heterogeneity,” and only 4 of the 70 studies exceeded that threshold.

KSJ employed purposive variation of the sampling frame, using 11 different panels—including frequently used panels like MTurk, Prolific, and a student sample. They kept design and analysis variance minimal. Panel B reports the population heterogeneity in KSJ. We also included heterogeneity across time of day and weekday as a benchmark, varied in Study 3 in KSJ. Some researchers have hypothesized that this changes effect sizes (7).

KSJ’s estimates of population heterogeneity are markedly larger than those provided by HJBK and are about as large as HJBK’s estimates for analytic and design heterogeneity. The median H across the nine estimates of population heterogeneity was 3.19, ranging from 1.72 to 9.58. All but one exceed the large benchmark threshold. Cochran’s Q-test rejects the null of homogeneity at the 5% level for all nine estimates. The effects of day of the week and time of day are much more minor and not significant.

Based on HJBK, one might be tempted to conclude that population heterogeneity is relatively unimportant and research results easily generalize. However, HJBK’s characterization of population heterogeneity is limited by the sample frames analyzed. This limits the variation of variables that affect the outcome, or moderators. The variation of these variables is larger across different designs and analyses. KSJ’s modest use of purposive sampling (limited to anglophone participants in three highly developed countries) uncovers a greater challenge to generalizability across populations.

The generalization of social science effects, even in modestly diverse samples like in KSJ is an important challenge to the field’s relevance. The problem is probably larger than given by our analysis since most studies are dominated by Western, Educated, Industrialized, Rich, and Democratic samples (8). More research is needed including more studies (9) with more diverse populations. We worry that not acknowledging population heterogeneity will diminish the impact of social science research.

Acknowledgments

Author contributions

A.K.-S. and E.J.J. designed research; A.K.-S. and E.J.J. performed research; X.H. analyzed data; and A.K.-S. and E.J.J. wrote the paper.

Competing interests

The authors declare no competing interest.

References

1.Jessen R. J., Statistical Survey Techniques (Wiley, 1978). [Google Scholar]
2.Holzmeister F., et al. , Heterogeneity in effect size estimates. Proc. Natl. Acad. Sci. U.S.A. 121, e2403490121 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Kenny D. A., Judd C. M., The unappreciated heterogeneity of effect sizes: Implications for power, precision, planning of research, and replication. Psychol. Methods 24, 578–589 (2019). [DOI] [PubMed] [Google Scholar]
4.Krefeld-Schwalb A., Sugerman E., Johnson E. J., Exposing omitted moderators: Explaining why effect sizes differ in the social sciences. Proc. Natl. Acad. Sci. U.S.A. 121, e2306281121 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
5.McShane B. B., Tackett J. L., Böckenholt U., Gelman A., Large-scale replication projects in contemporary psychological research. Am. Stat. 73, 99–105 (2019). [Google Scholar]
6.Viechtbauer W., Conducting meta-analyses in R with the metafor package. J. Stat. Softw. 36, 1–48 (2010). [Google Scholar]
7.Arechar A. A., Kraft-Todd G., Rand D., Turking overtime: How participant characteristics and behavior vary over time and day on Amazon Mechanical Turk. J. Econ. Sci. Assoc. 3, 1–11 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Henrich J., Heine S. J., Norenzayan A., The weirdest people in the world? Behav. Brain Sci. 33, 61–83 (2010). [DOI] [PubMed] [Google Scholar]
9.Higgins J. P. T., Thompson S. G., Quantifying heterogeneity in a meta-analysis. Stat. Med. 21, 1539–1558 (2002). [DOI] [PubMed] [Google Scholar]

[r1] 1.Jessen R. J., Statistical Survey Techniques (Wiley, 1978). [Google Scholar]

[r2] 2.Holzmeister F., et al. , Heterogeneity in effect size estimates. Proc. Natl. Acad. Sci. U.S.A. 121, e2403490121 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r3] 3.Kenny D. A., Judd C. M., The unappreciated heterogeneity of effect sizes: Implications for power, precision, planning of research, and replication. Psychol. Methods 24, 578–589 (2019). [DOI] [PubMed] [Google Scholar]

[r4] 4.Krefeld-Schwalb A., Sugerman E., Johnson E. J., Exposing omitted moderators: Explaining why effect sizes differ in the social sciences. Proc. Natl. Acad. Sci. U.S.A. 121, e2306281121 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r5] 5.McShane B. B., Tackett J. L., Böckenholt U., Gelman A., Large-scale replication projects in contemporary psychological research. Am. Stat. 73, 99–105 (2019). [Google Scholar]

[r6] 6.Viechtbauer W., Conducting meta-analyses in R with the metafor package. J. Stat. Softw. 36, 1–48 (2010). [Google Scholar]

[r7] 7.Arechar A. A., Kraft-Todd G., Rand D., Turking overtime: How participant characteristics and behavior vary over time and day on Amazon Mechanical Turk. J. Econ. Sci. Assoc. 3, 1–11 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r8] 8.Henrich J., Heine S. J., Norenzayan A., The weirdest people in the world? Behav. Brain Sci. 33, 61–83 (2010). [DOI] [PubMed] [Google Scholar]

[r9] 9.Higgins J. P. T., Thompson S. G., Quantifying heterogeneity in a meta-analysis. Stat. Med. 21, 1539–1558 (2002). [DOI] [PubMed] [Google Scholar]

PERMALINK

Measuring population heterogeneity requires heterogeneous populations

Antonia Krefeld-Schwalb

Xuwen Hua

Eric J Johnson

Table 1.

Fig. 1.

Acknowledgments

Author contributions

Competing interests

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Measuring population heterogeneity requires heterogeneous populations

Antonia Krefeld-Schwalb

Xuwen Hua

Eric J Johnson

Table 1.

Fig. 1.

Acknowledgments

Author contributions

Competing interests

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases