Skip to main content
American Journal of Public Health logoLink to American Journal of Public Health
editorial
. 2023 Jul;113(7):721–723. doi: 10.2105/AJPH.2023.307317

Probability Samples Provide a Means of Benchmarking and Adjusting for Data Collected From Nonprobability Samples

Michael R Elliott 1,
PMCID: PMC10262241  PMID: 37285564

I want to thank Keith et al. (p. 768) for an important exploration of the much-neglected problem of obtaining accurate estimates of the COVID-19 pandemic spread in real time. The authors compared a standard probability sample with a convenience sample (each with about a 1% finite population fraction) and administrative records to estimate seroprevalence of COVID-19 in Jefferson County, Kentucky. Jefferson County is essentially Louisville and the immediate suburbs, with a population of approximately 800 000.1 They found little difference between the probability sample and the convenience sample with respect to either the distribution of covariates or the prevalence estimates after raking both samples to known sex, race, and geographic region, but substantial differences between the prevalence estimates from the sampling methods and the administrative record estimates.

The results provide important findings that in some ways match prior expectations and in other ways defy them. First, the sample-based estimates of prevalence are higher than prevalences obtained from the administrative records, and the authors assume that sample-based estimates are more accurate. The authors do not state exactly why they make this assumption, but it is presumably because administrative records require a reported positive test filed with Louisville Metro Public Health and Wellness and thus exclude nearly all cases that were asymptomatic, as well as cases that did not result in a visit to a physician or testing site. Assuming that the sample estimates are the “gold standard,” this leads to an underreporting by two factors or more, and perhaps even higher among minorities. This is not a novel finding,2,3 but it confirms some previous literature in the area and provides a rough estimate of the magnitude by which administratively reported cases (which constitute the vast bulk of prevalence data in both the United States and around the world) should be multiplied to obtain an estimate of the true number of cases, at least during the 2020–2021 period.

The second finding—that the probability sample and the convenience sample prevalence estimates correspond—contradicts a nontrivial body of literature that suggests that even low-response-rate probability samples can yield more accurate results than convenience samples. Kennedy et al.4 found major quality concerns using on-line panel convenience samples with respect to the measurement of US political attitudes and recreational interests, especially for Black and Hispanic samples, although there was substantial variation in quality among vendors. In a major study that compared both random digit-dial (RDD) and address-based sampling-web (ABS-web) probability samples with six nonprobability samples with respect to a variety of demographic, health, economic, and transportation measures, MacInnis et al.5 found that the probability samples performed substantially better when benchmarked against high-quality census data obtained by the Current Population Survey, the Current Expenditure Survey, or National Center for Health Statistics surveys such as the National Health Interview Survey or the National Health and Nutrition Examination Survey. Furthermore, Groves and Peytcheva6 found that nonresponse rates were only weakly linked to nonresponse bias in a large meta-analysis. Further work by Tourangeau7 and by Brick and Tourangeau8 explained this finding by arguing that most nonresponse is due to missing completely at random9 factors (i.e., factors completely independent of any data being collected from participants) such as happenstance of contact time or study-level design features unrelated to sampled member characteristics, or to participant-level characteristics unrelated to survey variables. Thus, it is somewhat surprising that even a low-response-rate survey did not differ to some degree from a volunteer sample. Although the authors cite 30% as a “safe” cutoff for response rates,10 MacInnis and colleagues’ RDD and ABS-web probability samples had response rates of 15% and 2%, respectively, yet still dominated their convenience sample competitors with respect to bias. It may be that, with their response rates of 2% to 5% (depending on region of the county), Keith et al. have finally descended into volunteer territory, especially given that the data collection required the respondent to make an in-person visit to a separate clinic.

As a survey statistician, I became enormously frustrated that nearly a century of learning how to obtain accurate prevalence estimates from a population appeared to be all but forgotten in this public health crisis. Systems such as the Behavioral Risk Factor Surveillance System that had been put into place long ago to provide flexible, real-time data collection on “emerging public health problems”11 were not up to the task, given the speed of infectious disease spread. As noted by Keith et al., there have been a few attempts to use traditional methods. For example, in a study somewhat similar to that of the authors, Menachemi et al.12 conducted a study in Indiana with 68 statewide testing facilities, obtaining a considerably higher 24% response rate. Their resulting estimate of population prevalence early in the pandemic (2.8% at the end of April 2020) was far higher than the rate obtained from the number of confirmed cases at the Indiana State Department of Health (0.3% of the population); despite the higher response rate, racial minorities were severely undersampled (8% non-White vs 23% in the population, and 2% Hispanic vs 8% in the population; estimates were postratified to age, gender, and race distributions). Although there are no “gold standards” to assess prevalence measures, the resulting derived case-fatality rate was 0.58% at the time, consistent with the 0.66% Chinese fatality rate estimated at that time after careful adjustment for censoring and ascertainment bias.13 On the other hand, major nonprobability samples such as Delphi-Facebook14 were shown to perform poorly when estimating COVID-19 vaccine uptake compared with the probability sample obtained from Ipsos online KnowledgePanel,15 even though the latter had only a 10.5% response rate.16 Although hardly a complete literature review of a still evolving “autopsy” of the failure of the public health and medical community to grapple with this aspect of the COVID-19 pandemic response, it does suggest that a more nimble and survey science–informed response may have been helpful. Survey researchers could have perhaps been more creative in suggesting alternatives to standard methods (e.g., use of at-door drop boxes rather than requiring travel to remote sites), although undoubtedly many such suggestions would have foundered on blanket data collection shutdowns of survey-related research.

In sum, although Keith et al. provide an example in which probability sampling and convenience sampling gave similar results, I believe a broader overview still suggests the need for probability samples to provide a means of benchmarking and adjusting for data collected from nonprobability samples.17,18 For an excellent example of this approach applied to prevalence estimates that leverage the previously mentioned Indiana study in combination with an Indianapolis-only probability sample, state-level Delphi-Facebook reports of symptoms, and administrative COVID-19 death data, see Dempsey.19 Dempsey develops a procedure that combines administrative case-count data, data from nonprobability samples, and data from random samples over time to estimate selection propensities based on key covariate information. These selection propensities are then combined with epidemiological forecast models to construct a doubly robust estimator that accounts for both measurement-error and selection bias to estimate population seropositivity.

CONFLICTS OF INTEREST

The author has no conflicts of interest to disclose.

See also Keith et al., p. 768.

REFERENCES

  • 1.US Census Bureau. 2023. https://www.census.gov/quickfacts/jeffersoncountykentucky
  • 2.Albani V, Loria J, Massad E, Zubelli J. COVID-19 underreporting and its impact on vaccination strategies. BMC Infect Dis. 2021;21(1):1111. doi: 10.1186/s12879-021-06780-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Schwab N, Nienhold R, Henkel M, et al. COVID-19 autopsies reveal underreporting of SARS-CoV-2 infection and scarcity of co-infections. Front Med (Lausanne). 2022;9:868954. doi: 10.3389/fmed.2022.868954. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Kennedy C, Mercer A, Keeter S, et al. Evaluating Online Nonprobability Surveys. Washington, DC: Pew Research Center; 2016. p. 61. [Google Scholar]
  • 5.MacInnis B, Krosnick JA, Ho AS, Cho MJ. The accuracy of measurements with probability and nonprobability survey samples: replication and extension. Public Opin Q. 2018;82(4):707–744. doi: 10.1093/poq/nfy038. [DOI] [Google Scholar]
  • 6.Groves RM, Peytcheva E. The impact of nonresponse rates on nonresponse bias: a meta-analysis. Public Opin Q. 2008;72(2):167–189. doi: 10.1093/poq/nfn011. [DOI] [Google Scholar]
  • 7.Tourangeau R. Presidential address: paradoxes of nonresponse. Public Opin Q. 2017;81(3):803–814. doi: 10.1093/poq/nfx031. [DOI] [Google Scholar]
  • 8.Brick J, Tourangeau R. Responsive survey designs for reducing nonresponse bias. J Off Stat. 2017;33(3):735–752. doi: 10.1515/jos-2017-0034. [DOI] [Google Scholar]
  • 9.Little RJA, Rubin DB. Statistical Analysis With Missing Data. 3rd ed. Hoboken, NJ: John Wiley & Sons; 2020. [Google Scholar]
  • 10.Hedlin D. Is there a “safe area” where the nonresponse rate has only a modest effect on bias despite non‐ignorable nonresponse? Int Stat Rev. 2020;88(3):642–657. doi: 10.1111/insr.12359. [DOI] [Google Scholar]
  • 11.Remington PL, Smith MY, Williamson DF, et al. Design, characteristics, and usefulness of state-based behavioral risk factor surveillance: 1981–87. Public Health Rep. 1988;103(4):366–375. [PMC free article] [PubMed] [Google Scholar]
  • 12.Menachemi N, Yiannoutsos CT, Dixon BE, et al. Population point prevalence of SARS-CoV-2 infection based on a statewide random sample—Indiana, April 25–29, 2020. MMWR Morb Mortal Wkly Rep. 2020;69(29):960–964. doi: 10.15585/mmwr.mm6929e1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Verity R, Okell LC, Dorigatti I, et al. Estimates of the severity of coronavirus disease 2019: a model-based analysis. Lancet Infect Dis. 2020;20(6):669–677. doi: 10.1016/S1473-3099(20)30243-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Salomon JA, Reinhart A, Bilinski A, et al. The US COVID-19 Trends and Impact Survey: continuous real-time measurement of COVID-19 symptoms, risks, protective behaviors, testing, and vaccination. Proc Natl Acad Sci U S A. 2021;118(51):e2111454118. doi: 10.1073/pnas.2111454118. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Bradley VC, Kuriwaki S, Isakov M, et al. Unrepresentative big surveys significantly overestimated US vaccine uptake. Nature. 2021;600(7890):695–700. doi: 10.1038/s41586-021-04198-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Consumer and Community Research Section of the Federal Reserve Board’s Division of Consumer and Community Affairs. https://www.federalreserve.gov/publications/2021-economic-well-being-of-us-households-in-2020-acknowledgments.htm
  • 17.Elliott MR, Valliant R. Inference for nonprobability samples. Stat Sci. 2017;32(2):249–264. doi: 10.1214/16-STS598. [DOI] [Google Scholar]
  • 18.Wu C. Statistical inference with non-probability survey samples. Surv Methodol. 2022;48(2):283–311. [Google Scholar]
  • 19.Dempsey W. Addressing selection bias and measurement error in COVID-19 case count data using auxiliary information. Ann Appl Stat. doi: 10.1214/23-aoas1744. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from American Journal of Public Health are provided here courtesy of American Public Health Association

RESOURCES