For the purpose of establishing cause and effect, randomized controlled trials (RCTs) are the gold standard. Compared with observational studies, the major advantages of RCTs—assuming the trial is adequately powered and group allocation truly random—include the elimination of the impact of confounding and reverse causation, and blinding to the exposure status which minimizes investigator/participant biases (Figure 1, left panel). A less often acknowledged ‘strength’ is declaration of the hypotheses being tested and definition of primary exposure and outcome measures prior to analysis: integral parts of trial registration.
Figure 1.
Comparison of randomized controlled trial (gold standard) with conventional observational study (cohort, case-control, cross-sectional). Research standards are higher for trials
The fact that a priori hypotheses and pre-defined measures are not required in observational studies (Figure 1, right panel) provides authors with opportunities to ‘recalibrate’ their research questions and exposure definitions during and after the analysis and in response to feedback from reviewers. Such possibilities, combined with the fact that observational studies typically capture an array of exposure measures, undoubtedly promote scientific exploration and hypothesis generation. However, for hypothesis testing it also confers less protection against post hoc or ad hoc decision-making.1
What is the problem?
Publication bias is a form of post hoc decision-making, and availability of alternative exposure definitions opens the door for post hoc choices. Publication bias is the notion that findings that are statistically significant at conventional levels and support the hypothesis under investigation are most likely to be published.2,3 One example is the classic analysis of almost 500 projects approved by the Oxford Research Ethics Committee in the 1980s. It showed that those projects that revealed statistically significant results were 2.3 times (95% CI: 1.3–4.3) more likely to be published than those that found no association. They were also more likely to be published in journals with a high impact factor.2
It is post hoc decisions by researchers not to submit papers showing no statistically significant associations and, when they do, by editors not to publish such papers, that lie behind this. It is easy to understand why if we imagine four hazard ratios for a given putative risk factor: 1.45 (95% confidence interval 1.1–1.9) in Study A, 1.19 (0.7–1.9) in Study B, 1.88 (0.9–3.9) in Study C and 0.90 (0.5–1.6) in Study D. Assume the studies are methodologically equally strong—results from Study A are certainly more tempting for a researcher or indeed a handling editor to put forward than those from Studies B, C and D. In Study A the conclusion is unambiguous, whereas in studies B, C and D it is moot whether or not there is evidence of an association between the risk factor and the outcome in question, in part because it is generally harder to demonstrate a null than a positive association.
As trials require registration, the denominator of both published and unpublished results can be tracked. This provides an opportunity to illustrate publication bias. A striking recent example involves antidepressant medication. In 2008, Turner and colleagues analysed 74 RCTs approved by the US Food and Drug Administration (FDA) and found that a total of 37 of the 38 RCTs regarded by this agency as having positive results were published. In contrast, 36 RCTs viewed by the FDA as having negative or questionable results, with three exceptions, were either unpublished or published in a way that spuriously conveyed a positive outcome (Figure 2, Panel A).4 A 2010 review of four meta-analyses of efficacy of antidepressants examined in RCTs submitted to the FDA reached the same conclusion.5
Figure 2.
Illustrating publication bias in antidepressant efficacy trials (A) and cohort studies on job strain and coronary heart disease (B). RCTs with positive findings were more likely published than those with negative findings (questionable studies not shown). Published cohort studies reported, on average, greater excess risk than unpublished studies
Without study registration, a corresponding test is not possible in relation to observational research. However, Egger’s group has developed an indirect test of publication bias, building on the assumption that such bias is likely if larger effects are reported in smaller studies.6 This is because the smaller the study, the larger the effects needed to achieve statistical significance. Egger’s test is routinely used in meta-analyses of observational studies, but it identifies only some cases of publication bias. Furthermore, it does not allow estimation of the unbiased effect. Thus, this test is not intended to be a complete solution for the problems of publication bias in observational studies.
We recently established the IPD-Work consortium, pooling individual-level data from 200 000 participants in 13 studies with data on job strain and coronary heart disease (CHD). Because some of the studies included in the consortium had not yet published their main findings, it was possible to evaluate the effect of publication bias.7 Previous meta-analyses, which captured published papers from the 1980s to 2011, found an approximately 40% higher risk of CHD among individuals exposed to job strain—with Egger’s test revealing no evidence of publication bias.8,9 Nevertheless, direct comparisons of published and unpublished data in the studies in the IPD-Work consortium revealed such a bias. In the three studies that had been published, the hazard ratio for CHD in persons reporting job strain relative to those who did not was 1.43 (95% confidence interval: 1.2–1.8) (Figure 2, Part B),7 almost identical to findings from previous meta-analyses of published data.8,9 However, based on the 10 studies that had not published, the hazard ratio for CHD in those reporting job strain was more than halved (1.16; 95% confidence interval: 1.0–1.3). The hazard ratio of 1.23 (95% confidence interval: 1.1–1.4) when these studies were combined suggests that the risk of CHD among individuals with job strain is appreciably smaller than previously thought.7
The availability of alternative ways of categorizing exposures can also encourage multiple testing and in this way further contribute to selective reporting and publication bias. Fransson and colleagues, for example, found 11 different sets of questions that had been utilised to measure high job demands and low job control (i.e. the components of job strain).10 Landsbergis identified multiple alternative ways of defining job strain even when using identical item content: the quotient, the quadrant term, the quadrant term using national means, and linear term formulations.11 Having a range of alternative measures for a given exposure is not unique to psychosocial exposures in observational studies. For many behavioural, social and psychological exposures, including diet, physical activity, socioeconomic and occupational circumstances, social isolation and personality disposition, multiple exposure measures are available.12–14
The lack of standard measures complicates the interpretation of results. A null finding is not necessarily seen to add to scientific knowledge as it may be interpreted as a ‘false negative’ due to the use of non-optimal measures, or non-optimal categorization of measures. A positive finding, on the other hand, may also be considered not to offer definite proof of an association if similar caveats apply. In the context of psychosocial stress, we believe that the job strain hypothesis, which has been a source of controversy for several decades,15 could have been confirmed or refuted much sooner had an agreement on standard definitions been reached at an earlier stage. This is also likely to be the case for several other hypotheses based on multiple operationalizations of putative risk factors.
Is there a solution?
Decisions regarding exposure definitions and intention to submit for publication that occur after, rather than before, knowledge of the results tend to lead to inflated effect estimates in hypothesis testing. In his insightful paper ‘Why most published research findings are false’,16 Ioannidis suggested a number of remedies. The first is better powered studies.
Larger sample size
The work of Peto and colleagues,17 and more recently Danesh,18 both based on pooling of individual-level data from multiple cohort studies, illustrates the benefits of such an approach in associations between established physiological risk factors and vascular disease. IPD-Work is a similar attempt to pool raw data in psychosocial epidemiology.19 A major advantage is that with large data sets it is possible to show (and publish) absence of associations convincingly. For example, data on body mass index appears not to improve CHD prediction when information is also available for systolic blood pressure, history of diabetes, and lipids;20 fasting glucose is only modestly associated with risk of vascular disease in persons without diabetes;18 and job strain is not, as had sometimes been believed, linked to cancer risk.21 Refuting a hypothesis enables progress in understanding the role of risk factors for a given outcome.
In genetics, accelerated scientific progress resulting from the use of better-powered studies is increasingly clear. Before the era of genetic mega-studies, it was not uncommon for each publication of a genetic variant-disease link to be followed rapidly by publications that failed to replicate the original association. This sorry state of affairs was largely resolved by the pooling of data from genetic studies, an advance stipulated by the funders of these studies. Indeed, it could be said that the use of collaborative mega-studies in the field of genetics represents one of the most successful recent strategic advances in biomedical research.
Pre-registration
Adopting from RCTs the principle of pre-registration of exposure definitions when testing hypotheses is likely to be useful. Some observational studies, including the CALIBER project,22 have, exceptionally, taken this approach by registering protocols on www.clinicaltrials.gov. In IPD-Work, we elected to publish a description of our exposure definitions before any linkage with outcome data.10 This two-stage data extraction strategy is meant to ensure that associations with outcomes do not affect the way exposures are operationalized.
Distinction between hypothesis testing vs generation
There are two types of study in observational epidemiology; hypothesis testing (the majority) and hypothesis generating (including ‘development of new hypotheses’ and ‘hypothesis-free testing’—the minority). A clearer separation between these two types of study might be useful. If the emphasis is on testing a hypothesis, then pre-defined exposure and outcome measures as well as efforts to publish both positive and null findings are critically important. This is not the case in studies in which the aim is to generate hypotheses. Such studies may explore a wide range of exposures to identify a novel risk factor for the disease under investigation. Although the rules of pre-defined exposures and avoiding post hoc decisions do not apply here, taking account of the effects of multiple testing, that is, that every 20th association is statistically significant by chance, is crucial. In this context, corrections for multiple testing ought to apply, a procedure successfully followed in high-quality genetic research.
What could be done?
It is not our intention in this commentary to blame other researchers or editors for making post hoc decisions nor indeed to exonerate ourselves. Instead, we hope to encourage further discussion of these perennial difficulties. In particular, we want to raise the question of whether procedures similar to those used in RCTs and population genetics should be applied more systematically by researchers, reviewers and editors in the field of observational epidemiology. If this is seen important, then changes in journal and funding policies are needed.
Funding
M.K. is supported by the Medical Research Council (K013351), the Academy of Finland, the US National Institutes of Health (R01HL036310, R01AG034454) and a professorial fellowship from the Economic and Social Research Council. G.B.D. is a member of the University of Edinburgh Centre for Cognitive Ageing and Cognitive Epidemiology, part of the cross council Lifelong Health and Wellbeing Initiative (G0700704/ 84698) and a Wellcome Trust Fellow during the preparation of this manuscript. A.S-M. is supported by a ‘European Young Investigator Award' from the European Science Foundation and the National Institute on Aging, NIH (R01AG013196, R01AG034454).
Conflict of interest: None declared.
References
- 1.Kerr NL. HARKing: hypothesizing after the results are known. Pers Soc Psychol Rev. 1998;2:196–217. doi: 10.1207/s15327957pspr0203_4. [DOI] [PubMed] [Google Scholar]
- 2.Easterbrook PJ, Berlin JA, Gopalan R, Matthews DR. Publication bias in clinical research. Lancet. 1991;337:867–72. doi: 10.1016/0140-6736(91)90201-y. [DOI] [PubMed] [Google Scholar]
- 3.Scargle JD. Publication bias: the “file-drawer problem” in scientific inference. Journal of Scientific Exploration. 2000;14:94–106. [Google Scholar]
- 4.Turner EH, Matthews AM, Linardatos E, Tell RA, Rosenthal R. Selective publication of antidepressant trials and its influence on apparent efficacy. N Engl J Med. 2008;358:252–60. doi: 10.1056/NEJMsa065779. [DOI] [PubMed] [Google Scholar]
- 5.Pigott HE, Leventhal AM, Alter GS, Boren JJ. Efficacy and effectiveness of antidepressants: current status of research. Psychother Psychosom. 2010;79:267–79. doi: 10.1159/000318293. [DOI] [PubMed] [Google Scholar]
- 6.Egger M, Davey Smith G, Schneider M, Minder C. Bias in meta-analysis detected by a simple, graphical test. BMJ (Clinical research Ed.) 1997;315:629–34. doi: 10.1136/bmj.315.7109.629. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Kivimaki M, Nyberg ST, Batty GD, et al. Job strain as a risk factor for coronary heart disease: a collaborative meta-analysis of individual participant data. Lancet. 2012;380:1491–97. doi: 10.1016/S0140-6736(12)60994-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Kivimaki M, Virtanen M, Elovainio M, Kouvonen A, Vaananen A, Vahtera J. Work stress in the etiology of coronary heart disease – a meta-analysis. Scand J Work Environ Health. 2006;32:431–42. doi: 10.5271/sjweh.1049. [DOI] [PubMed] [Google Scholar]
- 9.Steptoe A, Kivimaki M. Stress and cardiovascular disease. Nat Rev Cardiol. 2012;9:360–70. doi: 10.1038/nrcardio.2012.45. [DOI] [PubMed] [Google Scholar]
- 10.Fransson EI, Nyberg ST, Heikkila K, et al. Comparison of alternative versions of the job demand-control scales in 17 European cohort studies: the IPD-Work consortium. BMC Public Health. 2012;12:62. doi: 10.1186/1471-2458-12-62. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Landsbergis PA, Schnall PL, Warren K, Pickering TG, Schwartz JE. Association between ambulatory blood pressure and alternative formulations of job strain. Scand J Work Environ Health. 1994;20:349–63. doi: 10.5271/sjweh.1386. [DOI] [PubMed] [Google Scholar]
- 12.Bauman AE, Reis RS, Sallis JF, Wells JC, Loos RJ, Martin BW. Correlates of physical activity: why are some people physically active and others not? Lancet. 2012;380:258–71. doi: 10.1016/S0140-6736(12)60735-1. [DOI] [PubMed] [Google Scholar]
- 13.Burrows T, Golley RK, Khambalia A, et al. The quality of dietary intake methodology and reporting in child and adolescent obesity intervention trials: a systematic review. Obes Rev. 2012;13:1125–38. doi: 10.1111/j.1467-789X.2012.01022.x. [DOI] [PubMed] [Google Scholar]
- 14.Holt-Lunstad J, Smith TB, Layton JB. Social relationships and mortality risk: a meta-analytic review. PLoS Med. 2010;7:e1000316. doi: 10.1371/journal.pmed.1000316. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Karasek R, Baker D, Marxer F, Ahlbom A, Theorell T. Job decision latitude, job demands, and cardiovascular disease: a prospective study of Swedish men. Am J Public Health. 1981;71:694–705. doi: 10.2105/ajph.71.7.694. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Ioannidis JP. Why most published research findings are false. PLoS Med. 2005;2:e124. doi: 10.1371/journal.pmed.0020124. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Lewington S, Clarke R, Qizilbash N, Peto R, Collins R. Age-specific relevance of usual blood pressure to vascular mortality: a meta-analysis of individual data for one million adults in 61 prospective studies. Lancet. 2002;360:1903–13. doi: 10.1016/s0140-6736(02)11911-8. [DOI] [PubMed] [Google Scholar]
- 18.Sarwar N, Gao P, Seshasai SR, et al. Diabetes mellitus, fasting blood glucose concentration, and risk of vascular disease: a collaborative meta-analysis of 102 prospective studies. Lancet. 2010;375:2215–22. doi: 10.1016/S0140-6736(10)60484-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Kivimaki M, Kawachi I. Need for more individual-level meta-analyses in social epidemiology: example of job strain and coronary heart disease. Am J Epidemiol. 2013;177:1–2. doi: 10.1093/aje/kws407. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Wormser D, Kaptoge S, Di Angelantonio E, et al. Separate and combined associations of body-mass index and abdominal adiposity with cardiovascular disease: collaborative analysis of 58 prospective studies. Lancet. 2011;377:1085–95. doi: 10.1016/S0140-6736(11)60105-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Heikkilä K, Nyberg ST, Batty GD, et al. Work stress and cancer risk: A meta-analysis of 5700 incident cancer events in 116 000 European men and women. BMJ. 2013;346:f165. doi: 10.1136/bmj.f165. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Denaxas SC, George J, Herrett E, et al. Data Resource Profile: Cardiovascular disease research using linked bespoke studies and electronic health records (CALIBER) Int J Epidemiol. 2012;41:1625–38. doi: 10.1093/ije/dys188. [DOI] [PMC free article] [PubMed] [Google Scholar]


