Skip to main content
American Journal of Public Health logoLink to American Journal of Public Health
editorial
. 2017 Aug;107(8):1214–1215. doi: 10.2105/AJPH.2017.303895

Oversampling in Health Surveys: Why, When, and How?

Roger Vaughan 1,
PMCID: PMC5508166  PMID: 28657770

Professional survey and polling firms often “oversample”1 certain groups to better estimate attributes of that group and then use sampling weights in analyses to avoid unintended biases associated with oversampling.

WHY OVERSAMPLE?

How is this fact relevant to learning about the LGBT (lesbian, gay, bisexual, or transgender) population? Say that we wanted to do a survey of adults in America. That is our population of interest. Further say that we wanted to know the rates of hypertension among those who identify as “straight” and those who identify as “LGBT.”

We do not have the time or money to assess all straight and LGBT people in the population, so we take a sample from the population (just as we would in any poll); for purposes of illustration and example, say we had the time and money to collect information about hypertension for 100 people. But if you simply took a random sample of 100 people, you might expect something like 96 people in that sample to identify as straight and about four to identify as LGBT.2 If you were trying to describe the health characteristics of straight people, you would probably be fairly confident of your estimate of hypertension rates based on 96 people. You would probably feel much less comfortable characterizing the hypertension rates of LGBT people on the basis of answers from only four people.

So what to do? You could decide that that you will not report information about LGBT individuals because only four people identified as such, or you could decide that obtaining information about LGBT individuals is important and sample from the population in a different way to ensure that you surveyed more people identifying as LGBT. This intentional sampling process, designed to incorporate more (typically low-prevalence) members of a certain community into your sample, is called oversampling.

HOW TO OVERSAMPLE?

To learn more about this (relatively) small group of the population, one would intentionally include more of its members in the sample. Say that it is known from other surveillance data that there is a higher prevalence of LGBT individuals in certain cities, zip code areas, or metropolitan statistical areas, so we might decide to oversample in those areas first until we selected, for example, 17 people who identified as LGBT. We would then choose the remaining 83 people randomly from the population (assuming that population proportions would result in about 80 people who say that they are straight and about three who say that they are LGBT2) to keep our sample size at 100. We are now much more confident about characterizing the hypertension rates of LGBT individuals on the basis of our sample of 20 people as opposed to four.

What we would not do is say that the prevalence of LGBT individuals in the population is 20% (20/100), because we purposefully sampled 20 such individuals to better describe their hypertension rates. When doing prevalence analyses, we would statistically “downweigh” those 20 observations to equal four, so the prevalence would not change (i.e., the true prevalence would still be four per 100, or 4%). But now we have used oversampling to learn something about a perhaps hard-to-reach or low-prevalence group.

Table 1 illustrates this process numerically; the first data row provides the estimated population prevalence for the two groups, and the second row shows the percentage of each group in our sample after oversampling (note that the “amount” of oversampling would be determined by the research team). The “weights” are calculated by taking the ratio of the population prevalence to the sample percentage, and one can see that when those weights are “applied” to the data, the rates return to the correct population proportions. Clearly, this example is simplified; the process of oversampling and calculation and application of weights is complex and a discipline unto itself, but the principle is the same.

TABLE 1—

Hypothetical Population and Sampling Percentages, and Creation and Application of Weights

Variable Straight LGBT
Population, % 96 4
Sample, % 80 20
Weight 1.2 (96/80) 0.2 (4/20)
Weight × sample n 96 (1.2 × 80) 4 (0.2 × 20)

Note. LGBT = lesbian, gay, bisexual, or transgender.

WHEN TO OVERSAMPLE?

There are readily available sampling and statistical tools that can help one learn more about lower-prevalence populations without inducing bias in calculating prevalence rates. Therefore, the decision of whether to oversample in an LGBT health survey depends on the answer to a simple question: “Is learning about the health of LGBT individuals important or not?”

REFERENCES


Articles from American Journal of Public Health are provided here courtesy of American Public Health Association

RESOURCES