Skip to main content
Oxford University Press - PMC COVID-19 Collection logoLink to Oxford University Press - PMC COVID-19 Collection
. 2020 Jul 29;17(4):6–7. doi: 10.1111/1740-9713.01413

The Spectre of Berkson's Paradox: Collider Bias in Covid-19 Research

Annie Herbert 1, Gareth Griffith 2, Gibran Hemani 3, Luisa Zuccolo 4
PMCID: PMC10016947  PMID: 37250182

Abstract

When non-random sampling collides with our understanding of Covid-19 risk, we must be careful not to draw incorrect conclusions about cause and effect. By Annie Herbert, Gareth Griffith, Gibran Hemani and Luisa Zuccolo


graphic file with name sign_17_4_6_gra-1.jpg

graphic file with name sign_17_4_6_gra-2.jpg

graphic file with name sign_17_4_6_gra-3.jpg

graphic file with name sign_17_4_6_gra-4.jpg

There have been many surprising things written and said about the coronavirus pandemic, but perhaps none more so than the claim that smoking might protect against Covid-19 infection (bit.ly/2YLudbR).

The claim originated early in the pandemic and was greeted with disbelief. But as New Scientist explained in a 19 May article (bit.ly/3fMDZQB), “data emerging from the countries first hit by coronavirus gave doctors pause: the proportion of smokers among those being hospitalised for covid-19 was lower than in the general population. In China, for example, about 8 per cent of people in hospital with covid-19 were smokers, while 26 per cent of the general population smoke. The equivalent figures for Italy are 8 and 19 per cent respectively.”

Observational evidence such as this can and must inform the prioritisation of avenues for research. However, while attempting to understand such observations, researchers must be aware of non-causal as well as causal explanations for observed associations. Specifically, a non-causal but credible explanation for the association between smoking and Covid-19 is that it is due to a form of bias known as “Berkson's paradox”, after the US statistician Joseph Berkson (1899–1982). For reasons we explain, it is now often called “collider bias”.

Understanding collider bias

To explain collider bias, consider the following example. In middle age, obesity generally increases with age. However, if we evaluate the cross-sectional relationship between obesity and age among middle-aged individuals tested for Covid-19, we might tempt you to believe that there is a negative correlation: that is, you have a slimmer waistline to look forward to as you get older.

But you would be wrong to conclude this. It is an inferential error arising from the fact that being older or being more obese both influence your likelihood of being tested for Covid-19, because testing in the UK has largely been confined to those presenting at hospital with severe symptoms of the disease. This can be seen in UK Biobank participants (aged 40–69) in Figure 1. In the entire subsample of those tested for Covid-19, those who are more obese will appear to have a lower age, and vice versa, inducing a negative correlation.

FIGURE 1.

FIGURE 1

The cross-sectional relationship between z-scores of age and obesity score in the general population (red, r = 0.02, N = 17,613) and in the subsample of individuals tested for Covid-19 (blue, r = –0.12, N = 5,871). Data from the UK Biobank.

In general, if two factors influence being selected into a sample, we say that they “collide on selection” (see Figure 2a). Hence the name “collider bias”. The impact of this could range

FIGURE 2.

FIGURE 2

Arrows represent causal effects, and dotted lines represent induced relationships between risk factor (yellow) and outcome (blue) when conditioning on sample selection (red). (a) The association in Figure 1 can be induced if both age and obesity influence the probability of being tested. (b) Sample selection in hospitals is influenced by different factors than those in (a), which means that associations may not be consistent between different study designs.

Like other threats to causal inference, once you become aware of collider bias, you see it lurking everywhere from introducing an association when the two factors were in fact independent, to reducing, exaggerating or even reversing an existing association in ways which are hard to anticipate.

In a recent paper,1 we summarise how collider bias might play a role in Covid-19 studies, and identify hundreds of demographic, genetic and health-related factors all influencing the probability of a person being selected for testing for Covid-19 among UK Biobank participants. But simply having the symptoms of Covid-19 infection also increases the chances of being selected for testing. So all these health-related factors and infection will collide on selection, increasing the risk of associations being distorted by Berkson's paradox.

Like other threats to causal inference (e.g. a third factor explaining a link between the exposure and outcome), once you become aware of the collider bias mechanism, you see it lurking everywhere. This includes studies of hospitalised patients, for example. While hospitalised patients are an obvious source of Covid-19 data, they are not a random sample of the general population, but instead represent individuals who are effectively selected from that population on the grounds of being older, frail with health issues, smokers, and – of course – those suffering from Covid-19.

To see how this could lead to a misleading link between smoking and protection against Covid-19, imagine – for simplicity – that patients are admitted to hospital for just two reasons: smoking-related illness or Covid-19 (Figure 2b). Covid-19 tests on these hospitalised individuals are likely to show lower infection rates among smokers than among non-smokers, because the former are also hospitalised for smoking-related illness, not necessarily Covid-19. This could explain the reports from several studies claiming that smoking might protect against Covid-19 infection.

Protecting against collider bias

To avoid drawing incorrect conclusions from studies that might be affected by collider bias, there are a few things to watch out for.

When reading about the selection or sampling of participants in a study of Covid-19 risk factors, we should look out for the terms “Berkson's”, “collider”, “selection” and “random”. Ideally, investigators would randomly sample from the target population, as this aims to ensure no variables can influence selection and introduce collider bias. Failing this, if the study attempts to recruit individuals who are representative of the target population, or at least uses a sample matching recruitment design, we should not need to worry too much about collider bias. However, selection processes are often not clear from the outset of a study, so even when sampling is representative, it is good practice for studies to report sensitivity analyses exploring selection bias (not only in specific studies, but also in subsequent systematic reviews). This is particularly important where causal inference is the focus of the work.2 Sensitivity analyses may include probability weighting of the samples, or an evaluation of the extent to which collider bias could explain a study's findings.1 But if a scientific paper simply states “collider bias” as a potential limitation, or does not mention it at all, be very cautious.

There is clearly a place for responsive, time-sensitive analyses of convenience samples – like those of people tested for, or hospitalised with, Covid-19. However, we fail to do justice to those data collection efforts if we do not investigate biases that could lead to faulty conclusions.

Contributor Information

Annie Herbert, Annie Herbert is a Senior research associate at the MRC Integrative Epidemiology Unit, Bristol Medical School, University of Bristol.

Gareth Griffith, Gareth Griffith is an ESRC postdoctoral fellow at the MRC Integrative Epidemiology Unit, Bristol Medical School, University of Bristol.

Gibran Hemani, Gibran Hemani is a Senior research fellow at the MRC Integrative Epidemiology Unit, Bristol Medical School, University of Bristol.

Luisa Zuccolo, Luisa Zuccolo is a Senior lecturer in epidemiology at the MRC Integrative Epidemiology Unit, Bristol Medical School, University of Bristol.

References

  • 1. Griffith, G., Morris, T. M., Tudball, M.. et al. (2020) Collider bias undermines our understanding of COVID-19 disease risk and severity. Epidemiology. doi: 10.1101/2020.05.04.20090506. [DOI] [PMC free article] [PubMed]
  • 2. Greenland, S. (2003) Quantifying biases in causal models: Classical confounding vs collider-stratification bias. Epidemiology, 14, 300–306. [PubMed] [Google Scholar]

Articles from Significance (Oxford, England) are provided here courtesy of Oxford University Press

RESOURCES