The remarkable growth in the popularity of mendelian randomization (MR) studies—analyses using genetic polymorphisms as instrumental variables (IVs)—reflects both the importance of the causal inference problems addressed by MR and the potential power of the approach.1 Although IV analysis is a well-established econometric tool, applications in health research using genetic information as IVs present novel challenges and opportunities. Papers such as that of Au Yeung et al., providing evidence on the characteristics of a proposed genetic IV for a specific phenotype-outcome association,2 are important because they help establish a foundation of evidence relevant for many future MR studies. In this commentary, we discuss the value of integrating evidence from external sources, such as the Au Yeung et al. results in MR studies, with specific focus on using separate-sample IV (SSIV, sometimes called split-sample IV) in data from case-control studies.
Evaluating the validity of proposed genetic IVs entails first demonstrating that the genotype predicts the phenotype. A more serious challenge in evaluating an IV is to provide evidence for the assumption that the genotype has no causal links with the outcome of interest except via the phenotype of interest.3,4 Although this cannot be conclusively proven, statistical, theoretical or biological evidence may bolster this assumption if such evidence suggests direct pathways are unlikely. External evidence such as that provided by Au Yeung et al. can be used to quantify the genotype-phenotype association; to support the assumption of no other causal pathways between the genotype and the outcome; and, via SSIV, to derive IV effect estimates.
IV analyses typically use two types of information: information on the association between the genotype and the phenotype, and information on the association between the genotype and the outcome. These two types of association can sometimes be evaluated using separate data sources. The SSIV method, described by Angrist et al. in the context of the two-stage least square (2SLS) IV estimator,5 is valid assuming that the associations observed in the data set used to estimate the genotype-phenotype association would also hold in the data set used to estimate the genotype-outcome association and vice versa.
The first stage of 2SLS IV estimator involves regressing the phenotype of interest (X) on the genetic IVs; for example, using three independent polymorphisms as IVs (Z1-Z3):
(1) |
The predicted value of X is used as the independent variable in a second stage regression with the outcome (Y) as the dependent variable:
(2) |
When external information about the association between proposed genetic instruments and the phenotype is available, the predicted value of X used in Equation 2 can be estimated based on this information. If the polymorphisms are thought to have independent, additive effects, the predicted value of X is simply the sum of the number of minor alleles times the estimated effect (as reported in the external study) of each minor allele on the phenotype. The coefficient for this weighted polygenic score predicting the outcome corresponds directly with the IV effect estimate. This approach has been implemented using beta weights from genome-wide association studies for body mass index,6 and, given the huge investments in estimating genotype-phenotype associations, it could be widely applicable.
SSIV can be used if the phenotype of interest was either not measured or was measured with substantial error in the data set in which the outcome was assessed. SSIV also has the advantage of remediating weak-instruments bias, so there is less concern about using multiple polymorphisms as IVs, even if many of these polymorphisms are only weakly associated with the phenotype. To see this, note that if a set of genetic polymorphisms has no causal effect on a phenotype, a regression model using these polymorphisms to predict the phenotype will nonetheless typically be weakly predictive within any finite sample. The prediction will improve as more non-causal polymorphisms are used in the regression model, just as adding any random number as a predictor in a regression model always improves prediction R2 within the sample. The predicted phenotype value from such a model will be correlated with unmeasured causes of the phenotype within that sample, including those confounding factors that influence both the phenotype and the outcome. When this predicted phenotype value is used in the second stage of the 2SLS, it will also predict the outcome, due to its correlation with the confounders. As a result, the standard 2SLS IV estimate is biased towards the confounded conventional effect estimate when there are multiple IVs that have very small or no effects on the phenotype. This phenomenon is eliminated when the two stages are estimated in separate samples, because the association is not structural: the linear combination of the genotypes that by chance best predicts the phenotype in one sample is unrelated to the confounders of the phenotype-outcome association in the second sample.
In health research, data are often drawn from case-control studies rather than representative samples and SSIV may be especially valuable in this context. We compare below some possible approaches to deriving IV estimates in case-control data when (i) the outcome is rare in the population and (ii) the phenotype of interest is continuous.
If case-control status defines the outcome in an MR analysis, standard 2SLS estimation as described above will generally not consistently estimate the phenotype-outcome causal effect. A notable exception occurs under the null hypothesis of no causal phenotype-outcome effect, in which case 2SLS rejects the null at the nominal type 1 error rate. An alternative IV analysis potentially applicable when the binary outcome is rare in the target population is to replace the second stage with a logistic regression.7 If data are drawn from a random sample, the resulting estimator is consistent for the multiplicative causal association between the phenotype and the outcome, provided: (i) the causal effect is constant across levels of unmeasured confounders, i.e. there is no effect modification by unmeasured confounders, and (ii) the residuals for the first stage regression are homoscedastic4,8 Palmer et al. compare several IV estimators appropriate for the situation of a random sample with a continuous phenotype and binary outcome, noting the target parameter (marginal or conditional causal odds ratio or risk ratio) and the assumptions under which each estimator is consistent for the target parameter.9
Under case-control sampling, however, this approach is not necessarily correct; although if the first stage regression is performed in controls only (and therefore gives results similar to the first stage regression in the population), the approach is approximately correct.10 This strategy has the disadvantage of losing information because cases are not used in the first stage estimates. Some information may be recovered in the first stage by pooling cases and controls and including case-control status as a covariate in the linear regression model,11 but the approach is sensitive to the rare disease assumption.12 To implement this approach, the fitted values fed into the second stage must be computed setting the case-control variable to its reference value (i.e. as if each observation were a control) for all subjects in the sample. In the event of heterogeneity in the IV-phenotype association by case-control disease status, the approach can be generalized by including an interaction between the IV and disease status in the first stage regression. As before, the fitted values fed into the second stage should be computed for all subjects upon setting their disease status to the reference level.
Alternatively, if the outcome is not rare in the population, nested case-control data can be analysed by inverse-probability weighting both stages of 2SLS by the probability of selection into the sample.10 Selection probabilities are typically known by design or can be estimated in nested case-control studies. For example, if all cases in a large cohort are selected into the nested case-control study, along with 10% of non-cases in the cohort, the 2SLS could be implemented by assigning each case a weight of 1 and each non-case a weight of 10. However, this approach can be inefficient compared with the conditional approach described above, although some efficiency may be recovered by augmenting inverse probability estimating equations.13
SSIV provides a third alternative for implementing MR in case-control data sets. The first stage estimates of the association between the genotype and phenotype are obtained from other data sources and used to construct the expected value of the phenotype for every observation in the case-control data set. A logistic regression estimated with the expected phenotype value predicting the outcome then estimates the causal odds ratio (approximating the causal risk ratio) assuming the outcome is rare and the genotype-phenotype association in the external source matches the association in the target population from which the case-control data were drawn. Additional IV techniques for binary outcomes and case-control studies, including structural mean models and generalized method of moments, can be found elsewhere.9,10
Simple modifications of the approaches above are applicable in secondary analyses of case-control data, when the outcome of interest is not the outcome used to define ‘case’ status. The two-stage approach applying logistic regression in the second stage may be used by including an indicator for case-status in both analysis stages. However, as before, predicted values of the phenotype estimated from the first stage regression must again be evaluated with case-control status set to its reference value. This two-stage approach recovers the phenotype-outcome multiplicative association provided (i) neither case-control status (i.e. cases as used to select the data) nor unmeasured confounders interact with the phenotype on a multiplicative scale, and (ii) case-control status does not interact with the IV in the first stage regression (on a linear scale). The SSIV, using external information to estimate the first stage regression, is also appropriate when the outcome of interest is not the condition originally used to define cases, but again the second stage must be estimated including an indicator for case-status.
In all two-stage analyses, the same covariates must be included in both stages. This presents a difficulty in SSIV models because the same covariates may not be available in both data sets. In studies that aim to provide evidence on the validity of a proposed genetic IV, it is therefore valuable to present evidence on the genotype-phenotype association with several alternative covariate sets. Finally, as noted above, many approaches rely on ‘no-interaction’ assumptions, so evidence on the plausibility of such assumptions, i.e. whether the genotype-phenotype associations are consistent across a range of characteristics, will be valuable.
Randomized controlled trials will remain the sine qua non of causal inference in health studies, but MR and related approaches offer valuable techniques for assessing causal hypotheses. Establishing a base of evidence regarding proposed genetic IVs, as in the Au Yeung et al. article, is a critical step forward to facilitate evaluation of IV validity and IV effect estimation in other studies.
Funding
The authors gratefully acknowledge financial support from the National Institute on Mental Health (MH092707-01) and the National Institute of Environmental Health Sciences (1R21ES019712-01) at the National Institutes of Health.
Conflict of interest: None declared.
References
- 1.Davey Smith G, Ebrahim S. ‘Mendelian randomization’: can genetic epidemiology contribute to understanding environmental determinants of disease? Int J Epidemiol. 2003;32:1–22. doi: 10.1093/ije/dyg070. [DOI] [PubMed] [Google Scholar]
- 2.Au Yeung S, Jiang C, Cheng K, et al. Is aldehyde dehydrogenase 2 a credible genetic instrument for alcohol use in Mendelian randomization analysis in Southern Chinese men? Int J Epidemiol. 2013;42:318–28. doi: 10.1093/ije/dys221. [DOI] [PubMed] [Google Scholar]
- 3.Glymour MM, Tchetgen Tchetgen E, Robins JM. Credible Mendelian Randomization studies: approaches for evaluating the instrumental variable assumptions. Am J Epidemiol. 2012;175:332–39. doi: 10.1093/aje/kwr323. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Didelez V, Meng S, Sheehan NA. Assumptions of IV methods for observational Epidemiology. Stat Sci. 2010;25:22–40. [Google Scholar]
- 5.Angrist JD, Krueger AB. Split Sample Instrumental Variables. National Bureau of Economic Research Technical Paper. University of Chicago, January 1994, p. 150. [Google Scholar]
- 6.Jokela M, Elovainio M, Keltikangas-Järvinen L, et al. Body mass index and depressive symptoms: Instrumental-variables regression with genetic risk score. Genes Brain Behav. 2012;11:942–48. doi: 10.1111/j.1601-183X.2012.00846.x. [DOI] [PubMed] [Google Scholar]
- 7.Palmer T, Thompson JR, Tobin MD, Sheehan NA, Burton PR. Adjusting for bias and unmeasured confounding in Mendelian randomization studies with binary responses. Int J Epidemiol. 2008;37:1161–68. doi: 10.1093/ije/dyn080. [DOI] [PubMed] [Google Scholar]
- 8.Mullahy J. Instrumental-variable estimation of count data models: Applications to models of cigarette smoking behavior. Rev Econ Stat. 1997;79:586–93. [Google Scholar]
- 9.Palmer TM, Sterne JAC, Harbord RM, et al. Instrumental variable estimation of causal risk ratios and causal odds ratios in mendelian randomization analyses. Am J Epidemiol. 2011;173:1392–403. doi: 10.1093/aje/kwr026. [DOI] [PubMed] [Google Scholar]
- 10.Bowden J, Vansteelandt S. Mendelian randomization analysis of case-control data using structural mean models. Stat Med. 2011;30:678–94. doi: 10.1002/sim.4138. [DOI] [PubMed] [Google Scholar]
- 11.Lin D, Zeng D. Proper analysis of secondary phenotype data in case-control association studies. Genet Epidemiol. 2009;33:256–65. doi: 10.1002/gepi.20377. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Tchetgen Tchetgen E. A General Regression Framework for a Secondary Outcome in Case-control Studies. Harvard University Biostatistics Working Paper Series, working paper 155. 2013. Available from: http://biostats.bepress.com/harvardbiostat/paper155 (15 February 2013, date last accessed) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Robins J, Rotnitzky A, Zhao L. Estimation of regression coefficients when some regressors are not always observed. J Am Stat Assoc. 1994;89:846–66. [Google Scholar]