Abstract
In this article, we assess the impact of case-control sampling on mendelian randomization analyses with a dichotomous disease outcome and a continuous exposure. The 2-stage instrumental variables (2SIV) method uses the prediction of the exposure given genotypes in the logistic regression for the outcome and provides a valid test and an approximation of the causal effect. Under case-control sampling, however, the first stage of the 2SIV procedure becomes a secondary trait association, which requires proper adjustment for the biased sampling. Through theoretical development and simulations, we compare the naïve estimator, the inverse probability weighted estimator, and the maximum likelihood estimator for the first-stage association and, more importantly, the resulting 2SIV estimates of the causal effect. We also include in our comparison the causal odds ratio estimate derived from structural mean models by double-logistic regression. Our results suggest that the naïve estimator is substantially biased under the alternative, yet it remains unbiased under the null hypothesis of no causal effect; the maximum likelihood estimator yields smaller variance and mean squared error than other estimators; and the structural mean models estimator delivers the smallest bias, though generally incurring a larger variance and sometimes having issues in algorithm stability and convergence.
Keywords: biased sampling, causal inference, instrumental variable, secondary trait association, 2-stage instrumental variables method
In observational studies, causal effect of an exposure on a disease outcome is often masked by unmeasured confounding. mendelian randomization studies, using inherited genetic variants as instrumental variables to facilitate causal inference (1, 2), are increasingly popular. The key concept is that genetic predisposition is split randomly at meiosis and therefore largely independent of nongenetic confounding variables. If the genetic variant is associated, preferably strongly, with the exposure and there is no alternative pathway from the genetic variant to the disease outcome, then the test of causality from the exposure to the disease can be inferred from the test of association between the genetic variant and the disease outcome (1, 3). For continuous outcomes and under additional model assumptions, the 2-stage least-squares method yields a consistent and asymptotically normal estimator of the causal effect (4, 5).
Compared with the classic instrumental variable regression in econometrics, mendelian randomization analyses in genetic epidemiology have the unique feature that disease outcomes are predominantly case-control status. Recently, substantial works have been developed to estimate the causal odds ratio for dichotomized outcomes, ranging from the 2-stage instrumental variables (2SIVs) approximation estimator that replaces the second-stage least-squares regression in 2-stage least squares by logistic regression (6, 7), to the related control function estimator (6), to the consistent estimator using double-logistic structural mean models (SMMs) under the potential outcomes framework (8, 9). Among these methods, the 2SIV estimator is often the de facto inference method, because it is straightforward to implement and approximates the causal effect that is defined in structural equation models (6, 7). The double-logistic estimator, though consistent in large samples, can be computationally difficult and sometimes suffers from poor or lack of identification in finite samples (10). The other drawback of the double-logistic SMMs is that the postulated association model may be uncongenial with the causal model, especially when there are adjusting covariates (9). Alternatively for SMMs, a selection-bias function can be parameterized to avoid the uncongenial issue (11), though computationally demanding in the presence of covariates. Refer, for example, to a review by Clarke and Windmeijer (12).
In epidemiologic studies, case-control sampling has been widely used for its cost effectiveness and statistical efficiency. A notable example of mendelian randomization using the case-control sampling design is the study of causal relationship between sex hormone–binding globulin (SHBG) and type 2 diabetes (13). Women with newly diagnosed type 2 diabetes and their 1:1 matched controls were sampled from the Women's Health Study for measuring genetic polymorphisms in the sex hormone-binding globulin gene, SHBG, and the plasma SHBG level. These analyses (13) used the naïve 2SIV method and ignored the case-control sampling scheme. Unlike the classic case-control analyses in which case-control sampling can be conveniently ignored while still preserving statistical efficiency (14), the validity of such naïve analyses in case-control mendelian randomization analyses needs to be carefully evaluated.
The issue of case-control sampling to the 2SIV procedure arises in the first stage, where the genetic risk score for the exposure is computed from a linear model. Because the genotypes and the plasma SHBG level were ascertained by case-control sampling and because the plasma SHBG level is strongly associated with the diabetes status, the first stage of the 2SIV procedure is therefore effectively a secondary-trait association, which has been recently investigated outside of the realm of mendelian randomization (15–18). Except in a few tangential situations, a naïve secondary-trait association without adjustment for case-control sampling can be severely biased, even under the null hypothesis of no secondary trait association. We note that, once the case-control sampling was accounted for in the first stage of 2SIV, the second stage presents no additional problem because the case-control sampling can be ignored for estimating slope parameters in a logistic regression (14).
Recently, there have been some discussions on this topic. For double-logistic SMMs, case-control sampling can be accounted for by the generalized method of moments (GMM) without much difficulty (19). For 2SIV estimators, it was noted that the naïve 2SIV method may not work under case-control sampling, and adjustment for case-control sampling bias can be performed by the inverse probability weighted (IPW) estimator, as suggested previously (20, 21).
However, the IPW estimator is poor in statistical efficiency in estimating secondary-trait association, because it does not leverage the information relating the disease outcome, the secondary trait, and the genotype. Use of the maximum likelihood (ML) estimator in the first-stage secondary-trait association, such as those developed previously (16, 17), would improve the efficiency of the resulting 2SIV estimator, although sensitive to model misspecification. For mendelian randomization studies, the efficiency-robustness consideration is quite different from that in secondary-trait association in a genome-wide association study. First, genetic instrumental variables are often chosen from known susceptibility loci that were identified and replicated a priori for the intermediate phenotype (13, 22). The false-positive genotype-exposure association is less a concern. Second, the instrumental variable regression requires some stringent assumptions to start with. For example, there is no direct effect from the gene to the outcome, nor is there interaction between gene and biomarker on the outcome. There are extensive discussions on limitations of these assumptions (7, 9, 23) that are not the intended topic of this work. If investigators made these assumptions and conducted mendelian randomization analyses, model misspecification may be less a concern to the ML estimator of secondary-trait association. In this case, the consideration may tip over to the potential efficiency gain, because mendelian randomization analyses are often plagued by weak genetic instruments and lack of power.
In this article, we give a rigorous treatment on the impact of case-control sampling on the testing and estimation of causal effect of a continuous exposure, typically a biomarker, on a rare disease outcome. This scenario is common in mendelian randomization studies (13, 22, 24), encompassing a majority of cardiovascular diseases and cancers. We first define the causal effect in structural equation models. We show that the naïve 2SIV estimator provides a valid test for the causal effect under random sampling and, surprisingly, under case-control sampling. We then develop asymptotic distributions for various 2SIV estimators in a unified fashion for the first-stage naïve, IPW, and ML estimators. For completeness, we also include in the comparison the double-logistic SMM because of its easy implementation by the GMM function in R (R Foundation for Statistical Computing, Vienna, Austria). Such numerical comparison between 2SIV and SMM is rarely seen in the literature. The assessment of bias and efficiency for various estimators under case-control sampling is conducted by extensive simulations. We close by a discussion on the merits and limitations of the aforementioned estimators.
METHODS
Consider a mendelian randomization study of the potential causal effect of a continuous intermediate phenotype X on a dichotomous disease outcome Y. Denote by G a set of genotypes used as instrumental variables for inferring the causality. Denote by U the collection of confounding variables that mask the causal effect of X on Y. Suppose U is indeed measured; then, under a random sampling scheme, a cohort of N subjects with independent and identically distributed random variables (Yi, Xi, Gi, Ui) is generated from the following models sequentially:
| (1) |
| (2) |
where is identically distributed random error that is independent of Gi and Ui. By construction, both Ui and have mean 0. Models 1 and 2 implicitly assumed that Y ⊥ G|(X, U). The models also imply that there is no interaction between G and U in generating X; there is no interaction between X and U in generating Y; and the error is homoscedastic. Often in genetic studies, there are adjusting covariates in models 1 and 2 conditional upon which the validity of instrumental variables holds. The results presented in this paper can be extended to more realistic settings with covariates. For ease of exposition, we omit adjusting covariates hereafter.
The causal effect of X on Y is defined as the log odds ratio θ1 in model 2, conditional on all confounding variables U. The challenge of directly estimating θ1 is that U is not measured. Further complicating the estimation of the causal effect, a case-control sampling mechanism takes place so that only a subset of subjects have (G, X) measured. Let Ri denote the indicator of whether a subject is included in the case-control subset. The observed data, therefore, contain (Yi, Ri, XiRi, GiRi), i = 1, … N. We assume that Pr(Ri = 1|Yi = k, Xi, Gi) = Pr(Ri = 1|Yi = k) = πi, so that G and X are missing at random in the sense described elsewhere (25). Following the Bernoulli sampling (26), the participants are first inspected for Yi sequentially, and the ith subject is selected for observation (Gi, Xi) with prespecified positive probabilities πi. Let n denote the number of subjects who are in the case-control samples.
Instrumental variables and the 2-stage estimator
If the sampling scheme is simply random and the outcome Y is continuous, the difficulty of estimating causal effect in the presence of unmeasured confounding can be overcome by exploiting instrumental variables. Because G ⊥ U, α1 ≠ 0, and Y ⊥ G|(X, U), then G is qualified as an instrumental variable. The classic 2-stage least-squares estimator proceeds by first regressing X on G and then regressing Y on the predicted value from the first stage (4, 5). When there is only 1 instrumental variable, the 2-stage least-squares estimate of θ1 is merely the ratio of the instrumental variable effect on Y and the instrumental variable effect on X.
For a dichotomous Y, the 2SIV estimator can be similarly obtained by fitting the 2-stage models sequentially,
| (3) |
| (4) |
The 2SIV estimator based on models 3 and 4 converges to the solution of the expected estimation equation derived from the model
which is clearly inconsistent for θ1. Nonetheless, we show in the following corollary that stronger statements can be put forth concerning the properties of .
Corollary 1
Suppose that models 1 and 2 are data-generating and that the two working models, 3 and 4, were fitted. If G is qualified as an instrumental variable then if and only if θ1 = 0 and has the same sign as θ1.
The proof can be found in the Appendix. This result suggests that, although the naïve 2SIV estimator for a causal odds ratio is generally not consistent, the corresponding testing procedure is valid and consistent for testing whether the causal effect θ1 = 0. This result holds regardless of whether the disease is rare or not and thus establishes the general utility of 2SIV in mendelian randomization analyses.
Mendelian randomization under case-control sampling
The issue of case-control sampling to the 2SIV estimation lies in the first stage, which is essentially a secondary-trait association previously studied (15–17). Let α = (α0, α1) represent the parameters in model 3. If the intermediate phenotype X and the case-control status Y are correlated, the distribution of X in the ascertained case-control sample is likely to be different from that in the general population because of oversampling of cases. So the naïve estimation of α ignoring the sampling scheme may be biased. In what follows, we first establish the asymptotic distribution of the 2SIV estimator based on an approximately consistent estimator of in a unified fashion for rare disease outcomes; second, we discuss the properties of in the naïve 2SIV estimator and the IPW- or ML-based 2SIV estimators that account for selection bias; then, we include in our discussion the double logistic SMM estimators.
Suppose Y is indeed rare, and so the logistic regression in model 4 can be approximated by log-linear regression. For the naïve, IPW, or ML approach, the 2SIV method proceeds by first computing an estimated value and then plugging it into the log-linear model
| (5) |
in the case-control sample. Denote as the 2SIV estimator .
Define the log-linear regression model of Y and the instrumental variables G as
| (6) |
It follows that γ1 = θ1α1 and , because
where . This result does not require to be normally distributed as long as the error is homoscedastic (7). In matrix notation, denote , γ = (γ0, γ1), and ; then . The slope component of is approximately consistent for θ1, and the joint distribution is asymptotically normal with the limiting distribution expressed in the following corollary.
Corollary 2
The asymptotic expansion of is
where A2 is the information matrix for in model 6.
The proof can be found in the Appendix. Corollary 2 provides a unified way to compute the variance of the based on the joint asymptotic distribution of , whether is derived from the naïve, IPW, or ML estimator. Note that is a member of so-called minimal distance estimators that has been studied in the econometrics literature on limited dependent variable models (27). In the Appendix, we provide the derivation for the joint limiting distribution of .
Naïve estimator
The naïve estimator for α is obtained by solving
where G = (1, G) is the design matrix of model 3, and Ri is the indicator of observing G. The naïve estimator is generally biased for estimating α (15–17). However, one interesting observation from our simulation study is that, when there is no causal effect (θ1 = 0), the naïve estimator of α1 is actually consistent. This is surprising in that, even if θ1 = 0, Y and X can still be correlated because of unmeasured confounding U. Observe that
| (7) |
where pk is the probability of Yi = k, and qk is the sampling probability in the stratum Yi = k, for k = 0, 1. Because under the null hypothesis θ1 = 0, Y ⊥ G|U, and because as an instrumental variable G ⊥ U, then Y ⊥ G and by the Bayes rule U ⊥ G|Y. Moreover, under the causal null hypothesis, and Y are independent. Therefore model 7 becomes
This suggests that the solution of the naïve estimating function is , where . The naïve estimator for α1 is consistent under the null hypothesis for the causal effect, even if the case-control sampling is ignored. Note that this result is derived under several assumptions in the data-generating model 1, that there is no interaction between G and U, and that is random homoscedastic error. These assumptions are not needed for the test of G-Y association acting as a valid test of causal effect, as originally proposed by Katan (1) and Didelez and Sheehan (3). Because the G-Y association is assessed in logistic regression, such a test is not affected by case-control sampling either.
IPW estimator
The simple IPW estimator solves
| (8) |
In the context of secondary trait association, it was noted that the IPW estimator is robust against model misspecification (15, 17). For example, even if the fitted model is wrong, the IPW estimator converges to the solution of, , which is a well-defined population quantity, as long as the sampling probability π is correctly specified. Compared with the ML estimator we describe below, the consistency of IPW does not depend on correctly specifying the distribution of X or the primary trait regression model. However, it is known to be inefficient for secondary trait association (15, 17, 18), because it does not exploit the information in X and G that is related to primary outcome Y. When there is a parametric model regressing Y on X and G, the ML estimator for secondary trait association is more efficient. If the regression of Y on X and G is left nonparametric, ML has no efficiency gain over IPW (17). Furthermore, when there are additional data in the cohort beyond case-control status, which is often true, Tapsoba et al. (17) discussed ways to improve the efficiency of the IPW estimator, by using for example the estimated weight and adding an augmented term to estimating equation 8.
ML estimator
The likelihood approach to secondary trait association is efficient (15, 16), provided the model specification is correct. The likelihood for regression parameters in the 2-phase sampling plan, given the observed data (Y, RX, RG), is proportional to
| (9) |
where α indexes the secondary-trait model 3, η = (η0, η1, η2) indexes the nuisance primary trait working model
| (10) |
and f describes the distribution of G. For example, refer to the likelihood development for 2-phase sampling reported by Lawless et al. (26). The sampling process does not depend on the regression parameter and therefore is omitted in the likelihood. Working model 10 contains the association of G due to unmeasured confounding U, which can be expressed as a function of X, G, and based on the data-generating model 1. As emphasized elsewhere (17), the consistency of the ML estimator depends on the correct specification of both Prα(X|G) and Prη(Y|X, G). In the context of mendelian randomization analyses, because of the assumptions and data-generating models one has to make a priori to infer causality, model misspecification may be less an issue to the ML approach. For example, omitting an interaction term between G and X in the primary case-control regression model when there is indeed an interaction can be devastating to the error control of secondary-trait association (17). If model 2 is the true data-generating model, we can express U as a function of . Under the rare disease assumption and by integrating out , model 10 is the correct model and the interaction term is not needed. On the other hand, the distribution Prα(X|G) needs to be integrated in the likelihood. For X being a gaussian distribution, the integration can be well approximated by the Gauss-Hermite quadrature. For nongaussian distributions, we found in simulations that the quadrature integration method is robust and yields little bias.
Double-logistic SMM estimator
Aside from structural equation models, the causal effect can be alternatively defined by the potential outcomes approach (28). Let Y(x) denote the potential outcome of Y when X is experimentally altered to an arbitrary value x within the set of all attainable values. Two assumptions are commonly made: the “consistency assumption” that Y(x) = Y with probability 1 when X = x, and the “stable unit treatment value assumption,” known as SUTVA, that potential outcomes of any subject are not related to other subjects' potential outcomes. The causal odds ratio can be defined (8, 11) as follows:
where Y(0) is the potential outcome for a subject when his or her X value is controlled at 0. Vansteelandt and Goetghebeur (8) proposed a double-logistic structural mean model, essentially the generalized method of moments, for estimating the causal odds ratio in randomized trials. For estimation, a logistic model for the observed data similar to model 10 is needed in addition to the moment conditions generated by properties of instrumental variables. For mendelian randomization studies with case-control sampling, the same estimation principle applies, and little change is needed in estimation (19). In particular, when the instrumental variable is dichotomous and there are no covariates, as in our simulation model later, no congeniality problem arises. Define a set of estimating functions as follows:
The first 3 estimating equations are derived from the observational model, and the last is based on the property that G shall be orthogonal to the odds of Y(0), with proper adjustment for case-control sampling, including 1/πi and log(q1/q0) for differential sampling probabilities in cases and controls. The estimation can be immediately accomplished by the R function GMM. The optimization method chosen in the GMM function is “BFGS,” a gradient-based searching algorithm with the maximum iteration number 3,000. In simulations, we found that though the GMM function is easy to use, the estimation sometimes does not converge and can be unstable, for example, producing a huge variance estimate.
NUMERICAL STUDY
We conducted simulation studies to compare the finite-sample behavior of the naïve, IPW, ML, and GMM estimators. A random cohort with sample size of either 5,000 or 20,000 was first generated with independent and identically distributed data (G, U, X, Y) following the data-generating models: G ∼ Bernoulli(0.4), X = −ln(1.5)G + U, and ℰ(Y) = expit(−θ0 +θ1X + U). We varied the distribution of U to be either N(0, 1) or a standardized χ2(5) in order to assess the impact of X deviating from a normal distribution. The causal effect θ1 is chosen to be 0, ln 1.5 or ln 2. A case-control sample that contained all cases and the same number of controls was then selected from the cohort. The value of θ0 was picked such that the proportion of cases in the cohort was approximately 4.5%, so approximately 225 (or 900) cases and 225 (or 900) controls were sampled from the cohort of size 5,000 (or 20,000). Specifically, θ0 was chosen as (−3.5, −3.8, −4.1) and (−3.6, −4, −4.4) for normal and χ2 confounding, respectively. Under each parameter setting, 1,000 independent data sets were simulated. Operating characteristics for were assessed by using all data sets; those for were based on the data sets when the estimation algorithm converged and the variance estimate appears regular (i.e., <50). The proportion left out in the summary was computed as the convergence rate.
Our simulation settings are similar to those reported by Bowden and Vansteelandt (19): We omitted the random error term in the data-generating model 1 when simulating data sets. The purpose is that, through reparameterization, the causal effect θ1 in the structural equation models is also the causal odds ratio in SMM, setting the stage for a side-by-side comparison of 2SIV and double-logistic SMM estimators. Observe that U = X + ln(1.5)G, odds(Y = 1) = exp{−θ0 +(θ1 + 1)X + ln(1.5)G}. By SMM notation, odds(Y(0) = 1) =exp{−θ0 + X + ln(1.5)G} = −θ0 + U, and hence G ⊥ Y(0). Therefore by construction, θ1 is also the causal effect under SMM.
Tables 1 and 2 show the performance of various estimators when X is indeed a gaussian distribution. From Table 1, it seems that, under the null hypothesis of no causal effect, values from all 3 methods (naïve, IPW, and ML) are virtually unbiased, the means of the estimated variance are close to the sample variances, and the coverage probability of 95% confidence interval is correct around 0.95. The result is that the naïve estimator is unbiased under the null of no causal effect as predicted by theoretical development but actually counterintuitive, because under the null G and X in this example both are correlated with Y when unmeasured confounding U is not conditioned upon. As the effect size increases from 0 to ln(1.5) or to ln(2), the naïve becomes biased, and the coverage probabilities of 95% confidence intervals are smaller than 95%. Both IPW and ML remain unbiased estimators under the null, or under the alternative, the bias of ML is generally smaller than that of IPW. Furthermore, the ML estimator is more efficient than the IPW estimator, with a nearly 50% reduction of variance when N = 5,000.
Table 1.
Performance of the Naïve, IPW, ML, and GMM Estimators of α1 in 1,000 Simulated Data Setsa
|
N = 5,000/n ≈ 450 |
N = 20,000/n ≈ 1,800 |
|||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Bias | 95% CP | MSE | Bias | 95% CP | MSE | |||||
| θ1 = 0 | ||||||||||
| Naïve | −0.003 | 0.011 | 0.011 | 0.949 | 0.011 | 0 | 0.003 | 0.003 | 0.946 | 0.003 |
| IPW | −0.001 | 0.016 | 0.017 | 0.957 | 0.016 | −0.002 | 0.004 | 0.004 | 0.947 | 0.004 |
| ML | −0.003 | 0.009 | 0.009 | 0.953 | 0.009 | 0 | 0.002 | 0.002 | 0.947 | 0.002 |
| θ1 = ln(1.5) | ||||||||||
| Naïve | −0.038 | 0.013 | 0.012 | 0.924 | 0.014 | −0.029 | 0.003 | 0.003 | 0.907 | 0.004 |
| IPW | −0.002 | 0.016 | 0.016 | 0.943 | 0.016 | −0.002 | 0.004 | 0.004 | 0.938 | 0.004 |
| ML | −0.005 | 0.009 | 0.008 | 0.940 | 0.009 | 0 | 0.002 | 0.002 | 0.954 | 0.002 |
| θ1 = ln(2) | ||||||||||
| Naïve | −0.063 | 0.013 | 0.013 | 0.911 | 0.017 | −0.056 | 0.003 | 0.003 | 0.834 | 0.006 |
| IPW | 0 | 0.016 | 0.016 | 0.949 | 0.016 | −0.001 | 0.004 | 0.004 | 0.949 | 0.004 |
| ML | −0.003 | 0.008 | 0.008 | 0.953 | 0.008 | 0 | 0.002 | 0.002 | 0.952 | 0.002 |
Abbreviations: CP, coverage probability; GMM, generalized method of moments; IPW, inverse probability weighted; ML, maximum likelihood; MSE, mean squared error.
a Data-generating models are X = −log(1.5)G + U, Pr(Y = 1) = expit(−θ0 + θ1X + U), where U ∼ X(0, 1).
Table 2.
Performance of the Naïve, IPW, ML, and GMM Estimators of θ1 in 1,000 Simulated Data Setsa
| Convergence Rate | Bias | 95% CP | Type 1 Error/Power | MSE | |||
|---|---|---|---|---|---|---|---|
| N = 5,000/n ≈ 450 | |||||||
| θ1 = 0 | |||||||
| Naïve | 0.998 | −0.061 | 0.297 | 0.388 | 0.964 | 0.036 | 0.301 |
| IPW | 0.998 | −0.020 | 0.341 | 0.533 | 0.985 | 0.015 | 0.342 |
| ML | 1 | −0.012 | 0.279 | 0.323 | 0.977 | 0.023 | 0.279 |
| GMM | 0.937 | −0.049 | 0.313 | 0.459 | 0.982 | 0.018 | 0.315 |
| θ1 = ln(1.5) | |||||||
| Naïve | 1 | −0.106 | 0.215 | 0.266 | 0.978 | 0.178 | 0.226 |
| IPW | 0.995 | −0.003 | 0.322 | 0.483 | 0.982 | 0.065 | 0.322 |
| ML | 1 | −0.021 | 0.276 | 0.304 | 0.978 | 0.091 | 0.276 |
| GMM | 0.923 | −0.039 | 0.285 | 0.745 | 0.991 | 0.087 | 0.286 |
| θ1 = ln(2) | |||||||
| Naïve | 1 | −0.204 | 0.162 | 0.188 | 0.976 | 0.356 | 0.203 |
| IPW | 1 | −0.008 | 0.434 | 0.665 | 0.957 | 0.133 | 0.434 |
| ML | 1 | −0.070 | 0.278 | 0.305 | 0.960 | 0.214 | 0.283 |
| GMM | 0.908 | −0.022 | 0.320 | 1.273 | 0.981 | 0.193 | 0.321 |
| N = 20,000/n ≈ 1,800 | |||||||
| θ1 = 0 | |||||||
| Naïve | 1 | −0.016 | 0.063 | 0.063 | 0.958 | 0.042 | 0.063 |
| IPW | 1 | −0.003 | 0.063 | 0.063 | 0.962 | 0.038 | 0.063 |
| ML | 1 | −0.003 | 0.061 | 0.061 | 0.952 | 0.048 | 0.061 |
| GMM | 0.999 | −0.014 | 0.077 | 0.081 | 0.969 | 0.031 | 0.077 |
| θ1 = ln(1.5) | |||||||
| Naïve | 1 | −0.105 | 0.048 | 0.046 | 0.956 | 0.349 | 0.059 |
| IPW | 1 | −0.068 | 0.065 | 0.065 | 0.947 | 0.279 | 0.069 |
| ML | 1 | −0.070 | 0.062 | 0.060 | 0.939 | 0.286 | 0.066 |
| GMM | 0.995 | −0.037 | 0.088 | 0.092 | 0.966 | 0.258 | 0.090 |
| θ1 = ln(2) | |||||||
| Naïve | 1 | −0.205 | 0.038 | 0.036 | 0.839 | 0.736 | 0.080 |
| IPW | 1 | −0.115 | 0.075 | 0.071 | 0.904 | 0.626 | 0.088 |
| ML | 1 | −0.122 | 0.065 | 0.061 | 0.900 | 0.649 | 0.080 |
| GMM | 0.995 | −0.017 | 0.103 | 0.143 | 0.972 | 0.606 | 0.103 |
Abbreviations: CP, coverage probability; GMM, generalized method of moments; IPW, inverse probability weighted; ML, maximum likelihood; MSE, mean squared error.
a Data-generating models are X = −log(1.5)G + U, Pr(Y = 1) = expit(−θ0 + θ1X + U), where U ∼ X(0, 1).
Table 2 shows the performance of 4 estimators of the causal effect θ1, including naïve, IPW, ML, and GMM. When θ1 = 0, all 4 estimators are unbiased when N = 20,000, yet the naïve estimator and the GMM estimator seem to have small-sample bias when N = 5,000. The variance of the ML estimator is the smallest among the 4 estimators, as is the mean squared error for the ML estimator. Under the alternative, the naïve estimator shows some bias in a much bigger magnitude than the IPW and ML estimators. The ML estimator generally has lower variance than IPW, but the improvement is diminished when the sample size gets to 20,000. This is because the variance of estimated α1 is much reduced in magnitude when N = 20,000 (Table 1), so small that a half reduction in the scale of the third decimal place does not meaningfully influence the variance of the estimated θ1.
The GMM estimator appears unbiased under all settings, but its variance is much greater than the IPW and ML estimators, resulting in a bigger mean squared error. The consistency of GMM hinges on the moment condition that exploits the orthogonality of instrumental variables and unmeasured confounding. Its estimation is more complex, involving parameters in the working model for observed outcome in addition to the causal effect θ1, thereby incurring a bigger variance. It is also notable that the convergence of GMM and the stability of variance estimation can be problematic when the sample size is small and the causal effect increases. As much as 10% of the simulated data sets could yield GMM estimators that did not converge. In a comparison of all 4 methods, the ML estimator has consistently the smallest mean squared error among all scenarios.
In Tables 3 and 4, we assess the performance of various estimators for α1 and θ1, respectively, when X is from a standardized χ2(5) distribution having mean 0 and variance 1. The nonnormality of X will affect the performance of the ML estimator because the gaussian quadrature integration used in ML estimation is not ideal for nongaussian distributions. Surprisingly in Table 3, the ML estimator for α1 did not show much bias even though the numerical integration technique is not optimal. The variance of the ML estimator, however, is no longer superior to the IPW estimator. As a result, the ML estimator for θ1 does not outperform the IPW estimator as shown in Table 4. The GMM estimator remains unbiased in all settings in Table 4, but its variance is substantially bigger than those of the ML and the IPW estimators. Under the smaller sample size (N = 5,000), the estimated variance for GMM can be quite unstable, with as much as 15% of the simulations that did not converge.
Table 3.
Performance of the Naïve, IPW, and ML Estimators of α1 in 1,000 Simulated Data Setsa
|
N = 5,000/n ≈ 450 |
N = 20,000/n ≈ 1,800 |
|||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Bias | 95% CP | MSE | Bias | 95% CP | MSE | |||||
| θ1 = 0 | ||||||||||
| Naïve | 0.007 | 0.020 | 0.019 | 0.943 | 0.020 | 0.003 | 0.005 | 0.005 | 0.958 | 0.005 |
| IPW | 0.009 | 0.014 | 0.015 | 0.955 | 0.014 | −0.002 | 0.004 | 0.004 | 0.953 | 0.004 |
| ML | 0.007 | 0.017 | 0.016 | 0.947 | 0.017 | 0.002 | 0.004 | 0.004 | 0.951 | 0.004 |
| θ1 = ln(1.5) | ||||||||||
| Naïve | −0.033 | 0.022 | 0.022 | 0.943 | 0.023 | −0.021 | 0.006 | 0.005 | 0.932 | 0.006 |
| IPW | −0.006 | 0.014 | 0.013 | 0.949 | 0.014 | 0.002 | 0.003 | 0.003 | 0.950 | 0.003 |
| ML | −0.004 | 0.016 | 0.016 | 0.943 | 0.016 | 0.006 | 0.004 | 0.004 | 0.952 | 0.004 |
| θ1 = ln(2) | ||||||||||
| Naïve | −0.044 | 0.023 | 0.024 | 0.944 | 0.025 | −0.039 | 0.006 | 0.006 | 0.935 | 0.007 |
| IPW | 0.003 | 0.012 | 0.013 | 0.961 | 0.012 | 0.001 | 0.004 | 0.003 | 0.947 | 0.004 |
| ML | 0.007 | 0.016 | 0.016 | 0.957 | 0.016 | 0.009 | 0.004 | 0.004 | 0.949 | 0.004 |
Abbreviations: CP, coverage probability; IPW, inverse probability weighted; ML, maximum likelihood; MSE, mean squared error.
a Data-generating models are X = −ln(1.5)G + U, Pr(Y = 1) = expit(−θ0 + θ1X + U), where U ∼ standardized χ2(5).
Table 4.
Performance of the Naïve, IPW, ML, and GMM Estimators of θ1 in 1,000 Simulated Data Setsa
| Convergence Rate |
Bias | 95% CP | Type 1 Error/Power |
MSE | |||
|---|---|---|---|---|---|---|---|
| N = 5,000/n ≈ 450 | |||||||
| θ1 = 0 | |||||||
| Naïve | 0.986 | −0.067 | 0.389 | 0.753 | 0.956 | 0.044 | 0.394 |
| IPW | 0.996 | −0.011 | 0.333 | 0.513 | 0.982 | 0.018 | 0.333 |
| ML | 0.995 | −0.042 | 0.381 | 0.564 | 0.976 | 0.024 | 0.382 |
| GMM | 0.867 | −0.020 | 0.412 | 0.713 | 0.977 | 0.023 | 0.412 |
| θ1 = ln(1.5) | |||||||
| Naïve | 0.995 | −0.179 | 0.254 | 0.552 | 0.985 | 0.177 | 0.286 |
| IPW | 0.999 | −0.076 | 0.292 | 0.387 | 0.972 | 0.068 | 0.297 |
| ML | 0.996 | −0.108 | 0.269 | 0.429 | 0.990 | 0.118 | 0.281 |
| GMM | 0.845 | −0.072 | 0.370 | 0.605 | 0.980 | 0.137 | 0.375 |
| θ1 = ln(2) | |||||||
| Naïve | 0.997 | −0.338 | 0.221 | 0.468 | 0.992 | 0.310 | 0.335 |
| IPW | 1 | −0.185 | 0.331 | 0.484 | 0.941 | 0.120 | 0.365 |
| ML | 1 | −0.236 | 0.312 | 0.482 | 0.972 | 0.212 | 0.367 |
| GMM | 0.845 | −0.126 | 0.490 | 1.781 | 0.966 | 0.188 | 0.506 |
| N = 20,000/n ≈ 1,800 | |||||||
| θ1 = 0 | |||||||
| Naïve | 1 | −0.032 | 0.060 | 0.070 | 0.973 | 0.027 | 0.061 |
| IPW | 1 | −0.016 | 0.055 | 0.062 | 0.971 | 0.029 | 0.055 |
| ML | 1 | −0.020 | 0.057 | 0.065 | 0.972 | 0.028 | 0.058 |
| GMM | 0.983 | −0.042 | 0.095 | 0.124 | 0.970 | 0.030 | 0.097 |
| θ1 = ln(1.5) | |||||||
| Naïve | 1 | −0.143 | 0.047 | 0.049 | 0.970 | 0.321 | 0.068 |
| IPW | 1 | −0.107 | 0.061 | 0.062 | 0.930 | 0.226 | 0.072 |
| ML | 1 | −0.113 | 0.059 | 0.060 | 0.960 | 0.263 | 0.072 |
| GMM | 0.978 | −0.029 | 0.111 | 0.130 | 0.960 | 0.268 | 0.112 |
| θ1 = ln(2) | |||||||
| Naïve | 1 | −0.306 | 0.036 | 0.038 | 0.670 | 0.594 | 0.130 |
| IPW | 1 | −0.244 | 0.065 | 0.064 | 0.794 | 0.465 | 0.125 |
| ML | 1 | −0.246 | 0.056 | 0.057 | 0.829 | 0.533 | 0.116 |
| GMM | 0.988 | −0.054 | 0.139 | 0.273 | 0.944 | 0.465 | 0.142 |
Abbreviations: CP, coverage probability; GMM, generalized method of moments; IPW, inverse probability weighted; ML, maximum likelihood; MSE, mean square error.
a Data-generating models are X = −In(1.5)G + U, Pr(Y = 1) = expit(−θ0 + θ1X + U), where U ∼ standardized χ2(5).
DISCUSSION
Several new observations arise in this paper. First, the commonly used naïve 2SIV estimator, ignoring biased sampling such as reported previously (13), yields a biased estimate of the causal effect as one would predict. Yet surprisingly, the test for whether there is a causal effect remains valid. In other words, biased sampling does not affect the validity of the 2SIV testing, nor does it affect the validity of testing the G-Y association as a test of causal effect. This result is interesting as mendelian randomization is plagued by concerns of violating assumptions (23). Focusing on testing causal effect, rather than estimation, may alleviate these concerns (23). The other observation is that, for unbiased estimation, the case-control sampling can be accounted for by using IPW or ML in the first stage of 2SIV, the latter of which is more efficient in a small-to-moderate sample size. Finally, the weighted double-logistic SMM estimator generally yields a smaller bias but a larger variance, therefore resulting in a larger mean squared error than does the 2SIV estimator.
Beyond IPW, there are a number of other methods that account for sampling bias in secondary association analyses. We show the ML method as a mere example of improving the efficiency of 2SIV through improving its first stage regression. When there are additional adjusting covariates in models 1 and 2, ML can be much harder to develop, because one has to model the distribution of such covariates. The profile likelihood, similar to that reported by Lin and Zeng (16), can be developed to treat the covariate distribution nonparametrically. Furthermore, the distributional assumption of secondary trait X can be further relaxed by using recently proposed estimation equation approaches (18, 29).
We have considered a simple unmatched case-control sampling scheme nested within a cohort, where both G and X are missing at random. For matched case-control studies, the methods being discussed can be used by adding matching variables into regression models. The IPW method and the weighted double-logistic SMM method require sampling probabilities to be known, which is feasible for nested case-control samples in a cohort. On the other hand, the ML method can be formulated for general case-control studies without knowing sampling fractions (16). Sometimes X is observed for everyone in the cohort, whereas G is almost always ascertained by case-control sampling. In such a setting, the ML estimator of α1 is much more advantageous relative to the IPW estimator, because it will leverage the complete information for X. The estimation is quite simple because numerical integration is no longer needed. Refer to the report by Tapsoba et al. (17) for more discussion.
Alternative instrumental variable estimators to the 2SIV procedure include the control functions approach, which adds the first stage residual as an additional regressor in model 5. In doing so, the bias of the instrumental variance estimator can be attenuated (6). The idea of using ML in the first stage of 2SIV to improve the efficiency of causal effect estimates is also applicable to the control functions approach. In future work, we plan to explore the asymptotic distribution of the resulting causal estimates using the control functions approach, as well as its comparison with other approaches.
ACKNOWLEDGMENTS
Author affiliations: Public Health Science Division and Vaccine and Infectious Diseases Division, Fred Hutchinson Cancer Research Center (James Y. Dai, Xinyi Cindy Zhang); and Department of Biostatistics, University of Washington, Seattle, Washington (James Y. Dai).
This work was supported by National Institutes of Health grants P01 CA53996 and R01 HL114901.
The authors thank Jean de Dieu Tapsoba for discussion and help in simulation codes.
Conflict of interest: none declared.
APPENDIX
Proof of Corollary 1
Suppose θ1 = 0, then from model 2 we have Y ⊥ G|U. Because G ⊥ U, this leads to Y ⊥ G. Therefore, fitting model 4 leads to . Conversely, if θ1 ≠ 0, and because α1 ≠ 0, we have . Without loss of generality, suppose G = 0 is the baseline genotype group, and suppose θ1α1 > 0, so for every U and . Because G ⊥ U and is independent error, Pr(Y|G = g) > Pr(Y|G = 0), so . A similar argument holds for the setting where θ1α1 < 0. Therefore, θ1 = 0 if and only if θ* = 0; θ1 and have the same sign.
Derivation for the joint limiting distribution of
The estimators for both can be represented by asymptotically linear estimators (30, 31). For many approaches to estimating α, including IPW and ML we detail below, can be written as , where B1 is often referred to as the influence function of α, ℰ(B1i) = 0, and . To estimate γ in model 6, can be represented in terms of influence function B2 without much difficulty. Then, by the central limit theorem, Sluskey's theorem, and the Cramer-Wold device, the joint limiting distribution of and is
In what follows, we discuss in detail each of the 3 estimators of α we consider, how to compute the covariate matrix, and the implication of to estimating θ1.
Proof of Corollary 2
To obtain , one first computes the predicted value in model 3 and then solves the estimating equation in model 4. Because γ = Aθ, where A is a matrix composed of elements in α, the 2-stage least-squares estimating function can be equivalently presented as , where S2i is the estimating function for model 6. Taylor series expansion with respect to both α and θ yields
On the other hand, the estimating function for also equals , because is consistent for γ. It follows that
REFERENCES
- 1.Katan M. Apoupoprotein E isoforms, serum cholesterol, and cancer. Lancet. 1986;327(8479):507–508. doi: 10.1016/s0140-6736(86)92972-7. [DOI] [PubMed] [Google Scholar]
- 2.Smith DG, Ebrahim S. ‘Mendelian randomization’: Can genetic epidemiology contribute to understanding environmental determinants of disease? Int J Epidemiol. 2003;32(1):1–22. doi: 10.1093/ije/dyg070. [DOI] [PubMed] [Google Scholar]
- 3.Didelez V, Sheehan N. Mendelian randomization as an instrumental variable approach to causal inference. Stat Methods Med Res. 2007;16(4):309–330. doi: 10.1177/0962280206077743. [DOI] [PubMed] [Google Scholar]
- 4.White H. Instrumental variables regression with independent observations. Econometrica. 1982;50(2):483–500. [Google Scholar]
- 5.Davidson R, MacKinnon J. Estimation and Inference in Econometrics. New York, NY: Oxford University Press; 1993. [Google Scholar]
- 6.Palmer TM, Thompson JR, Tobin JR, et al. Adjusting for bias and unmeasured confounding in Mendelian randomization studies with binary responses. Int J Epidemiol. 2008;37(5):1161–1168. doi: 10.1093/ije/dyn080. [DOI] [PubMed] [Google Scholar]
- 7.Didelez V, Meng S, Sheehan NA. Assumptions of IV methods for observational epidemiology. Stat Sci. 2010;25(1):22–40. [Google Scholar]
- 8.Vansteelandt S, Goetghebeur E. Causal inference with generalized structural mean models. J R Stat Soc Series B Stat Methodol. 2003;65(4):817–835. [Google Scholar]
- 9.Vansteelandt S, Bowden J, Babanezhad M, et al. On instrumental variables estimation of causal odds ratios. Stat Sci. 2011;26(3):403–422. [Google Scholar]
- 10.Burgess S, Granell R, Palmer TM, et al. Lack of identification in semiparametric instrumental variable models with binary outcomes. Am J Epidemiol. 2014;180(1):111–119. doi: 10.1093/aje/kwu107. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Robins J, Rotnitzky A. Estimation of treatment effects in randomised trials with non-compliance and a dichotomous outcome using structural mean models. Biometrika. 2004;91(4):763–783. [Google Scholar]
- 12.Clarke PS, Windmeijer F. Instrumental variable estimators for binary outcomes. J Am Stat Assoc. 2012;107:1638–1652. [Google Scholar]
- 13.Ding EL, Song Y, Manson JE, et al. Sex hormone-binding globulin and risk of type 2 diabetes in women and men. N Engl J Med. 2009;361(12):1152–1163. doi: 10.1056/NEJMoa0804381. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Prentice RL, Pyke R. Logistic disease incidence models and case-control studies. Biometrika. 1979;66(3):403–411. [Google Scholar]
- 15.Jiang Y, Scott AJ, Wild CJ. Secondary analysis of case-control data. Stat Med. 2006;25(8):1323–1339. doi: 10.1002/sim.2283. [DOI] [PubMed] [Google Scholar]
- 16.Lin DY, Zeng D. Proper analysis of secondary phenotype data in case-control association studies. Genet Epidemiol. 2009;33(3):256–265. doi: 10.1002/gepi.20377. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Tapsoba JD, Kooperberg C, Reiner A, et al. Robust estimation for secondary trait association in case-control genetic studies. Am J Epidemiol. 2014;179(10):1264–1272. doi: 10.1093/aje/kwu039. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Tchetgen Tchetgen EJ. A general regression framework for a secondary outcome in case-control studies. Biostatistics. 2014;15(1):117–128. doi: 10.1093/biostatistics/kxt041. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Bowden J, Vansteelandt S. Mendelian randomization analysis of case-control data using structural mean models. Stat Med. 2011;30(6):678–694. doi: 10.1002/sim.4138. [DOI] [PubMed] [Google Scholar]
- 20.Tchetgen Tchetgen EJ, Walter S, Glymour MM. Commentary: Building an evidence base for Mendelian randomization studies: assessing the validity and strength of proposed genetic instrumental variables. Int J Epidemiol. 2013;42(1):328–331. doi: 10.1093/ije/dyt023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Uddin MJ, Groenwold RH, de Boer A, et al. Performance of instrumental variable methods in cohort and nested case-control studies: a simulation study. Pharmacoepidemiol Drug Saf. 2014;23(2):165–177. doi: 10.1002/pds.3555. [DOI] [PubMed] [Google Scholar]
- 22.Voight BF, Peloso GM, Orho-Melander M, et al. Plasma HDL cholesterol and risk of myocardial infarction: a Mendelian randomisation study. Lancet. 2012;380(9841):572–580. doi: 10.1016/S0140-6736(12)60312-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.VanderWeele TJ, Tchetgen Tchetgen EJ, Cornelis M, et al. Methodological challenges in Mendelian randomization. Epidemiology. 2014;25(3):427–435. doi: 10.1097/EDE.0000000000000081. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Cohen JC, Boerwinkle E, Mosley TH, Jr, et al. Sequence variations in PCSK9, low LDL, and protection against coronary heart disease. N Engl J Med. 2006;354(12):1264–1272. doi: 10.1056/NEJMoa054013. [DOI] [PubMed] [Google Scholar]
- 25.Rubin DB. Inference and missing data. Biometrika. 1976;63(3):581–592. [Google Scholar]
- 26.Lawless JF, Kalbfleisch JD, Wild CJ. Semiparametric methods for response-selective and missing data problems in regression. J R Stat Soc Series B Stat Methodol. 1999;61(2):413–438. [Google Scholar]
- 27.Newey WK. Semiparametric estimation of limited dependent variable models with endogenous explanatory variables. Ann INSEE. 1985;59/60:219–237. [Google Scholar]
- 28.Rubin DB. Estimating causal effects of treatments in randomized and nonrandomized studies. J Educ Psychol. 1974;66(5):688–701. [Google Scholar]
- 29.Wei J, Carroll RJ, Müller UU, et al. Robust estimation for homoscedastic regression in the secondary analysis of case-control data. J R Stat Soc Series B Stat Methodol. 2013;75(1):186–206. doi: 10.1111/j.1467-9868.2012.01052.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Newey WK, Powell JL. Efficient estimation of linear and type I censored regression models under conditional quantile restrictions. Econometric Theory. 1990;6(3):295–317. [Google Scholar]
- 31.Robins JM, Rotnitzky A, Zhao LP. Estimation of regression coefficients when some regressors are not always observed. J Am Stat Assoc. 1994;89(427):846–866. [Google Scholar]
