Abstract
For analysis of case-control genetic association studies, it has recently been shown that gene-environment independence in the population can be leveraged to increase efficiency for estimating gene-environment interaction effects in comparison with the standard prospective analysis. However, for the special case in which data on the binary phenotype and genetic and environmental risk factors can be summarized in a 2 × 2 × 2 table, the authors show here that there is no efficiency gain for estimating interaction effects, nor is there an efficiency gain for estimating the genetic and environmental main effects. This contrasts with the well-known result assuming that rare phenotype prevalence and gene-environment independence in the control population for the same data can lead to efficiency gain. This discrepancy is counterintuitive, since the 2 likelihoods are also approximately equal when the phenotype is rare. An explanation for the paradox based on a theoretical analysis is provided. Implications of these results for data analyses are also examined, and practical guidance on analyzing such case-control studies is offered.
Keywords: case-control studies, epidemiologic methods, logistic models, odds ratio, prospective sampling, rare diseases, retrospective sampling
To estimate odds ratio parameters for genetic main effects and gene-environment (G-E) interactions using data from case-control genetic association studies, the traditional prospective likelihood method (1–6) has been the standard approach to analysis. Recently, it has been found that the G-E independence assumption can be leveraged to increase the efficiency of estimation (7–10). These methods were essentially based on maximizing the retrospective likelihood where the G-E independence was appropriately reflected, while the joint distribution of G and E was nonparametric in the standard approach. Some of these methods (7, 8) were based on log-linear models with the G-E independence assumption imposed on the control population, and some (9, 10) were based on logistic regression models with the G-E independence assumption imposed on the general study population. When an additional assumption that the binary phenotype is rare is imposed, the model of Chatterjee and Carroll (9) can be approximated by the model of Umbach and Weinberg (8). In the latter case, the need to estimate the intercept parameter in the logistic regression model is eliminated. Without the rare phenotype assumption and when both the gene and the environment have effects on the phenotype, Chatterjee and Carroll (9) showed an interesting result: The intercept parameter in the logistic model became identifiable under the G-E independence assumption in the general population. This contrasts with the widely known result that the baseline risk for the phenotype is unidentifiable from case-control data using prospective logistic regression analysis.
It turns out that the need to estimate the additional identifiable parameter compromises the efficiency gain from incorporating the G-E independence, and we illustrate this point here. We show that the magnitude of the efficiency gain under the G-E independence assumption depends on the structure of the data on genetic and environmental risk factors. In the simplest case, where data on phenotype and genetic and environmental risk factors can be summarized in a 2 × 2 × 2 table, we show that the retrospective method of Chatterjee and Carroll (9) under the assumption of G-E independence in the general population and the standard prospective analysis yield identical estimates. This result seemingly contradicts that obtained by Umbach and Weinberg (8), where meaningful efficiency gain is achieved under the assumption of G-E independence in the control population in a 2 × 2 × 2 table. We offer an explanation that is related to the identifiability of the intercept parameter in Chatterjee and Carroll's (9) approach: Estimating the intercept parameter used up the additional information for estimating odds ratio association parameters coded in the G-E independence assumption in the population. For the approach of Umbach and Weinberg (8), we show that any efficiency gain is due to the fact that the marginal G-E distribution under a case-control design encoded information on the odds ratio parameters after the approximation based on the rare phenotype assumption. We also study the implications of the results for applications in practice.
MATERIALS AND METHODS
Let D denote the binary phenotype, G denote the genetic factor, and E denote the environmental factor (or the other genetic factor in studies of gene-gene interaction). Let S = 1 indicate that a subject in the general study population is selected into the case-control sample. In the case-control design, the sampling probability depends only on the disease status, that is, P(S = 1|D, G, E) = P(S = 1|D). Assume that G and E contribute to the risk of phenotype D following a logistic regression model
| (1) |
and that in the study population, the gene G and the environmental factor E are independently distributed, that is,
| (2) |
The contribution to the retrospective likelihood by a single subject, p(G, E|D), can be expressed as
![]() |
where η{D; (G, E)} = exp(β1DG + β2DE + β12DGE) and m(G, E) = β0 + β1G + β2E + β12GE. In the case-control sample, the proportions of cases or controls P(D|S = 1) = nD/n are fixed by the sampling design, where n0 and n1 are the respective numbers of cases and controls, and n = n0 + n1. Let β0* = β0 + log{P(S = 1|D = 1)/P(S = 1|D = 0)} and define p*(D) = exp(Dβ0*)/{1 + exp(β0*)}, p*(G) = P(G|E = 0, D = 0), and p*(E) = P(E|G = 0, D = 0). By combining P(G, E|D, S = 1) with P(D|S = 1), the data (D, G, E) for the case-control sample are distributed as follows:
![]() |
(3) |
where
Chen (11) studied the problem of parameter identifiability and estimation for general biased sampling designs that included the case-control design as a special case. Let θ be parameters in p(D, G, E|S = 1) other than p*(D) and p*(E). Lemma 1 below follows from the results presented by Chen (11).
Lemma 1. For the case-control design, if both p*(D) and p*(E) are unconstrained and variation-independent of each other and of θ, the profile likelihood for parameters in p(G, E|D) other than p*(E) from the conditional likelihood based on p(D, G|E, S = 1) is the same as the profile likelihood for θ from the retrospective likelihood based on p(G, E|D).
In a typical case-control design, P(S = 1|D = 1)/P(S = 1|D = 0) is unknown. This means that p*(D) is unconstrained and is variation-independent of θ = (β1, β2, β12, p*(G)) and p*(E). Under equations 1 and 2 and unconstrained p(E), we can conclude from lemma 1 that the θ estimator from maximizing the joint conditional likelihood ∏i = 1nP(Di,Gi|Ei,Si = 1) is efficient if the θ estimator from maximizing the retrospective likelihood ∏i = 1nP(Gi,Ei|Di) is efficient. The traditional method of analyzing case-control data maximizes the prospective likelihood ∏i = 1nP(Di|Gi,Ei,Si = 1). The joint conditional likelihood equals the prospective likelihood multiplied by ∏i = 1nP(Gi|Ei,Si = 1). Lemma 1 implies that any additional information for estimating (β1, β2, β12) resulting from the incorporation of the population-level G-E independence in the retrospective likelihood is contained in ∏i = 1nP(Gi|Ei,Si = 1). Below, we analyze this term to characterize the relation between the magnitude of efficiency gain and the structure of the data on (G, E). In the simplest case of a 2 × 2 × 2 table, which is frequently used to summarize case-control data, we show that ∏i = 1nP(Gi|Ei,Si = 1) contains no information for estimating (β1, β2, β12). Thus, the only difference between the traditional prospective analysis and the method of Chatterjee and Carroll (9) in this case is that β0 is identifiable and can be estimated from the observed data when conditions on the identifiability of β0 hold.
RESULTS
G-E independence in the general population
Theorem 1. Assume the case-control sampling probabilities are unknown and p(E) is unconstrained. When the case-control data on (D, G, E) can be summarized in a 2 × 2 × 2 table, the maximum likelihood estimators for (β1, β2, β12) based on p(G, E|D) without or with the G-E independence assumption, that is, those obtained from maximizing ∏i = 1nP(Di|Gi,Ei,Si = 1) and ∏i = 1nP(Di,Gi|Ei,Si = 1), are identical.
Proof of theorem 1 is given in the Appendix. The result in theorem 1 can be understood as follows. Note that p(G|E, S = 1) involves the parameters β0 and p(G), which do not appear in p(D|G, E, S = 1). For a 2 × 2 × 2 table, the likelihood ∏i = 1nP(Gi|Ei,Si = 1) allows for the estimation of 2 independent parameters, p(G = 1|E = 1, S = 1) and p(G = 1|E = 0, S = 1), to which parameters β0 and p(G = 1) can be mapped. This equivalent parameterization means that there is no information left for estimating (β1, β2, β12) in equation A4 (Appendix) after β0 and p(G = 1) are estimated. By analogy, for slightly richer data summarized in a 2 × 3 × 2 table, if 1 parameter is used to model the G-E association in the general population (12) and no additional constraint is imposed on p(G), we would expect no efficiency gain even if the model for p(G|E) is not saturated.
Theorem 1 has direct implications for detecting interaction effects between a single nucleotide polymorphism (SNP) and binary environmental exposures. A SNP with alleles B and b has 3 possible genotypes, BB, Bb, and bb. Let G be the numerical coding for the genotype. The classical models for the SNP effect include the dominant model (G = 1 for BB and Bb and G = 0 for bb), the recessive model (G = 1 for BB and G = 0 for Bb and bb), and the additive model (G = 0 for bb, G = 1 for Bb, and G = 2 for BB). With the dominant or recessive model for the main effect of G and a dichotomous environmental factor E, theorem 1 implies that incorporating the G-E independence in the general population cannot improve statistical power for detecting G-E interaction effects. For the additive case on a logarithmic scale, some efficiency gain in estimating the odds ratio parameters can be achieved, but it has very limited magnitude.
It is important to note that when the phenotype prevalence rate, p(D = 1), is known, theorem 1 does not apply. That is, the efficiency of the (β1, β2, β12) estimator from maximizing the retrospective likelihood may be higher than that from maximizing the prospective likelihood, even in the case of a 2 × 2 × 2 table. On the other hand, it is recognized that knowing the disease prevalence rate does not help in estimating (β1, β2, β12) if p(G, E) is unconstrained, because the additional constraints imposed by fixed p(D = 1) happen to be equivalent to the constraints imposed by fixing the numbers of cases and controls for the estimation of odds ratio parameters (13). The key reason that theorem 1 does not hold is that conditions for applying lemma 1 are no longer satisfied. When the phenotype prevalence is known, β0* is fully determined by β0:
![]() |
Thus, p*(D) is no longer variation-independent of θ under the G-E independence. These results suggest that if we would like to exploit the G-E independence for efficiency gain in estimating the odds ratio parameter, one strategy would be to incorporate knowledge of the disease prevalence rate.
Chatterjee and Carroll (9) noticed that it is challenging to estimate the intercept parameter β0 based on their profile likelihood method, despite the fact that β0 is theoretically identifiable when β12 ≠ 0 or β1 ≠ 0 and β2 ≠ 0. (See a related problem discussed in the article by Neuhaus et al. (14).) The difficulty in estimating β0 with data summarized in a 2 × 2 × 2 table may be explained as follows. Given (β1, β2, β12), the part of the likelihood useful for estimating β0 is
It can be seen that limβ0→ + ∞g(β0) = exp( − β12) and limβ0→−∞g(β0) = 1. Thus, g(β0) is a bounded function. Figure 1 shows 2 typical shapes of the function with a positive or negative interaction effect and 2 typical shapes of the function in the absence of an interaction effect.
Figure 1.
Plots of function g(β0) = (1 + eβ0 + β1)(1 + eβ0 + β2)/(1 + eβ0)(1 + eβ0 + β1 + β2 + β12) with different parameter values. Upper left panel: g(β0) with main effect (β1, β2) = (1, 1) and positive interaction (β12 = 0.5); upper right panel: g(β0) with main effect (β1, β2) = (−1, −1) and negative interaction (β12 = −0.5); lower left panel: g(β0) with main effect (β1, β2) = (1, 1) and no interaction (β12 = 0); lower right panel: g(β0) with main effect (β1, β2) = (1, −1) and no interaction (β12 = 0).
Since the second part of the right-hand side of equation A9 (i.e., n11n00/n10n01) can be any number due to the random variation, the solution to equation 14 may not exist for a data set with a finite sample size. However, this difficulty would be less likely to arise with extremely large sample sizes.
Rare phenotype assumption and G-E independence in the control population
Assuming a log-linear model for analyzing case-control data (D, G, E), Umbach and Weinberg (8) showed that efficiency for estimating the odds ratio parameters when G-E independence is imposed on the control population is higher than that in the traditional prospective analysis, even for data summarized by a 2 × 2 × 2 table. However, it is well known that a log-linear model approximates a logistic regression model adopted by Chatterjee and Carroll (9) when the phenotype is rare, that is, p(D = 1) ≈ 0, and G-E independence holds in the general population. We explain these seemingly contradictive results as follows. Note that the assumptions in the article by Umbach and Weinberg (8) are equivalent to equation 1 and p(G, E|D = 0) = p(G|D = 0) p(E|D = 0). By the odds ratio formulation of Chen (11), the joint distribution of (D, G, E) under the case-control design can be rewritten as
![]() |
(4) |
In this case, parameters in p(G, E|D) other than p(E|D = 0) are (β1, β2, β12, p(G|D = 0)), which do not include β0. If the distribution p(E|D = 0) is not constrained, lemma 1 can be applied to obtain the efficient parameter estimator for (β1, β2, β12, p(G|D = 0)) by profiling p(E|D = 0) out from the unconstrained joint likelihood based on p(D, G, E|S = 1). The profile likelihood is equivalent to the conditional likelihood
![]() |
(5) |
The likelihood equation 5 is approximately the same as equation A2 under the rare phenotype assumption, that is, exp{m(G, E)} ≈ 0. Note, however, that the property of likelihood equation 5 is very different from that of likelihood equation A2. First, theorem 1 is no longer applicable to equation 5 and the estimator of (β1, β2, β12) based on likelihood equation 5 can be more efficient than that based on the traditional prospective likelihood. Although equation 5 can be factored into equation A3 and
![]() |
(6) |
there is only 1 parameter to be estimated from equation 6, aside from p*(D), and p*(D) needs to be estimated jointly with equation A3. Efficiency gain in estimating β0* leads to more efficient estimation of (β1, β2, β12) than with the traditional prospective likelihood estimator. Second, when p(D = 1) is known, p*(D) is determined by β0, which is not contained in p(G, E|D) under the G-E independence in the control population. This means that lemma 1 is still applicable to likelihood equation 4 because the variation-independence condition in lemma 1 is satisfied. As a result, knowing p(D = 1) does not help increase the efficiency for estimating (β1, β2, β12, p(G|D = 0)). This is very different from the case where G-E independence is assumed for the general population. The false contradiction observed earlier is based on the wrong intuition that the maximum likelihood estimators should have approximately equal efficiency if the 2 likelihoods are approximately equal.
SIMULATION STUDIES
We performed 2 simulation studies to assess the magnitude of efficiency gain under the G-E independence assumption in the general population, and we performed 1 simulation study to assess bias when this G-E independence assumption is violated. In the first study, the traditional prospective logistic regression (PLR) analysis based on p(D|G, E, S = 1) and the prospective analysis based on p(D, G|E, S = 1) in the article by Chatterjee and Carroll (9) (retrospective maximum likelihood (RML)) are compared. Models 1 and 2 are used to simulate data. Because G and E are in symmetric positions, we assume in the first simulation study that G takes the values 0 and 1 only and E either has 3 different categories or is a continuous variable. Since the relative efficiency is of primary interest and the intercept parameter is difficult to estimate, we choose a very large sample size in the first simulation study. The sample size is 50,000 with 25,000 cases and 25,000 controls. In this simulation, P(G = 1) = 0.1, and E is uniformly distributed on {0, 1, 2} or on [0, 2]. The log odds ratio parameters (β1, β2, β12) = (1, 1, 0.5) and the intercept β0 = −3. The relatively large G-E effects were primarily chosen to make the computational method more stable. Results based on 1,000 replicates are shown in Table 1.
Table 1.
Simulation Results for Odds Ratio Estimation Under Gene-Environment (G-E) Independence in the General Population With an Unknown Case Prevalence Ratea
| Effect | Truth | Traditional Prospective Logistic Regression Approach |
Retrospective Maximum Likelihood Approach |
||||
| Bias (Estimated Truth) | SD of the Parameter Estimate | Average of the SE Estimatesb | Bias (Estimated Truth) | SD of the Parameter Estimate | Average of the SE Estimatesb | ||
| G Has 2 Categories and E Has 3 Categories | |||||||
| G | 1.0 | 0.003 | 0.0511 | 0.0525 | 0.003 | 0.0510 | 0.0525 |
| E | 1.0 | 0.001 | 0.0141 | 0.0139 | 0.001 | 0.0141 | 0.0139 |
| G × E | 0.5 | 0.000 | 0.0404 | 0.0434 | 0.004 | 0.0404 | 0.0433 |
| Intercept | −3.0 | −0.003 | 0.1032 | 0.1013 | |||
| G Has 2 Categories and E Is Continuous | |||||||
| G | 1.0 | 0.004 | 0.0636 | 0.0618 | 0.004 | 0.0635 | 0.0617 |
| E | 1.0 | 0.000 | 0.0186 | 0.0184 | 0.000 | 0.0186 | 0.0184 |
| G × E | 0.5 | −0.001 | 0.0578 | 0.0558 | −0.001 | 0.0575 | 0.0556 |
| Intercept | −3.0 | −0.002 | 0.1344 | 0.1324 | |||
Abbreviations: SD, standard deviation; SE, standard error.
The simulation results are based on 1,000 replicates in a sample of 50,000 with 25,000 cases and 25,000 controls.
Average of the SE estimates of the parameter estimator.
As Table 1 shows, when the environmental factor takes 3 values, the efficiency of the RML estimator is slightly improved over the PLR estimator. When the environmental factor is a continuous variable, the efficiency improvement is also very limited in the simulation. The efficiency gain in our simulation appears to be smaller than that in the article by Chatterjee and Carroll (9) when the phenotype prevalence rate is unknown in the general population. To find out the reason, we tested our computer program using the simulation setting for sparse disease in the article by Chatterjee and Carroll (9). Using a sample size of 10,000 with 5,000 cases and 5,000 controls and 1,000 repetitions, we obtained the average estimates for (β1, β2, β12) as (0.253, 0.100, 0.304) from the likelihood approach based on p(D, G|E, S = 1), where the true values are (0.26, 0.1, 0.3). The empirical standard error estimates are (0.100, 0.011, 0.041). Notice that our sample size is 10 times that used by Chatterjee and Carroll (9). After adjusting for the sample size difference, we obtain √10(0.100, 0.011, 0.041) = (0.32, 0.035, 0.130), which are close to the (0.32, 0.037, 0.128) obtained by Chatterjee and Carroll (9). Indeed, the efficiency gain for the interaction parameter in this case is much larger than that in our simulation setting. This may suggest that the efficiency gain depends on the underlying covariate distribution.
In the second simulation study, we compared the traditional prospective logistic analysis with the approach proposed by Chatterjee and Carroll (9) assuming a known correct or incorrect phenotype prevalence rate, and as in the article by Umbach and Weinberg (8) assuming G-E independence in the control population. We generated independent binary G and E with p(G = 1) = 0.1 and p(E = 1) = 0.5. The log odds ratio association parameters in model 1 were more realistically set to (0.2, 0.1, 0.5), and we varied the intercept parameter in the logistic regression to obtain phenotype prevalence rates of 25%, 10%, and 0.1%, respectively, corresponding to β0 = −1.2, −2.3, −7.0. Five estimators of the relative risk parameters were computed: the traditional PLR estimator, Chatterjee and Carroll's approach assuming that the true phenotype prevalence rate (pRML) is known, Chatterjee and Carroll's approach assuming an incorrect phenotype prevalence rate (twice the true rate (UpRML) and half of the true rate (LpRML)), and Umbach and Weinberg's approach, which effectively assumes an incorrect phenotype prevalence rate at 0. The simulation results, shown in Table 2, are based on 1,000 repetitions of a sample size of 5,000 with 2,500 cases and 2,500 controls.
Table 2.
Simulation Results for Odds Ratio Estimation Under Gene-Environment (G-E) Independence in the General Population With a Known Correct or Incorrect Case Prevalence Ratea
| Method |
G(0.2) |
E(0.1) |
G × E(0.5) |
||||||
| Bias (Estimated Truth) | SD of the Parameter Estimate | Average of the SE Estimatesb | Bias (Estimated Truth) | SD of the Parameter Estimate | Average of the SE Estimatesb | Bias (Estimated Truth) | SD of the Parameter Estimate | Average of the SE Estimatesb | |
| Phenotype Prevalence Rate = 25% | |||||||||
| PLRc | 0.002 | 0.130 | 0.133 | 0.000 | 0.060 | 0.060 | 0.000 | 0.184 | 0.185 |
| UpRMLd | −0.001 | 0.130 | 0.133 | 0.000 | 0.060 | 0.060 | 0.002 | 0.184 | 0.185 |
| pRMLe | −0.001 | 0.123 | 0.125 | 0.000 | 0.059 | 0.059 | 0.002 | 0.161 | 0.164 |
| LpRMLf | 0.022 | 0.118 | 0.120 | 0.004 | 0.059 | 0.059 | −0.045 | 0.142 | 0.142 |
| UWg | 0.047 | 0.114 | 0.115 | 0.009 | 0.059 | 0.059 | −0.098 | 0.122 | 0.122 |
| Phenotype Prevalence Rate = 10% | |||||||||
| PLR | −0.001 | 0.133 | 0.132 | −0.003 | 0.061 | 0.060 | 0.010 | 0.183 | 0.181 |
| UpRML | −0.002 | 0.124 | 0.121 | −0.003 | 0.060 | 0.059 | 0.008 | 0.155 | 0.152 |
| pRML | −0.002 | 0.120 | 0.117 | −0.003 | 0.060 | 0.059 | 0.008 | 0.139 | 0.136 |
| LpRML | 0.009 | 0.118 | 0.115 | −0.001 | 0.059 | 0.059 | −0.024 | 0.130 | 0.127 |
| UW | 0.019 | 0.117 | 0.114 | 0.002 | 0.059 | 0.059 | −0.046 | 0.122 | 0.120 |
| Phenotype Prevalence Rate = 0.1% | |||||||||
| PLR | −0.003 | 0.130 | 0.131 | 0.001 | 0.060 | 0.060 | 0.006 | 0.176 | 0.179 |
| UpRML | −0.003 | 0.111 | 0.113 | 0.001 | 0.058 | 0.059 | 0.002 | 0.114 | 0.119 |
| pRML | −0.003 | 0.111 | 0.113 | 0.001 | 0.058 | 0.059 | 0.002 | 0.114 | 0.118 |
| LpRML | −0.003 | 0.111 | 0.113 | 0.001 | 0.058 | 0.059 | 0.002 | 0.114 | 0.118 |
| UW | −0.003 | 0.111 | 0.113 | 0.001 | 0.058 | 0.059 | 0.001 | 0.114 | 0.118 |
Abbreviations: PLR, prospective logistic regression; RML, retrospective maximum likelihood; SD, standard deviation; SE, standard error; UW, Umbach and Weinberg.
The simulation results are based on 1,000 replicates in a sample of 5,000 with 2,500 cases and 2,500 controls.
Average of the SE estimates of the parameter estimator.
Traditional PLR approach.
RML approach assuming a known wrong disease prevalence rate as twice that of the truth.
RML approach assuming a known true disease prevalence rate.
RML approach assuming a known wrong disease prevalence rate as half that of the truth.
Umbach and Weinberg's (8) approach, corresponding to LpRML with a known wrong disease prevalence rate of 0.
The simulation results show that pRML had a negligible bias and a big reduction in variance compared with PLR. Both UpRML and LpRML had smaller bias than Umbach and Weinberg's estimator. However, the simulation also shows that the variance of the estimator decreases as the assumed incorrect phenotypic prevalence rate decreases. In particular, the UpRML estimator has a larger variance than did the pRML estimator, which may suggest that an upward bias in the assumed incorrect phenotypic prevalence rate can result in loss of efficiency. Umbach and Weinberg's estimator has the smallest variance among all of the estimators, but it can have relatively large bias when the phenotypic prevalence rate is large. For a very small prevalence rate, all of the estimators with a known prevalence rate appear to have the same performance. This suggests that the additional efficiency gain in Umbach and Weinberg's approach compared with Chatterjee and Carroll's approach without knowledge of the prevalence rate is due to the known prevalence rate of approximately 0 in the former. Overall, the result suggests that when we do not have exact information on the prevalence rate, it is still useful to use an inaccurate prevalence rate for estimation and inference.
In the third simulation study, correlated binary G and E are simulated with P(G = 1,E = 1) = pq, P(G = 1, E = 0) = p(1 − q), P(G = 0, E = 1) = (1 − p)(1 − q), and p(G = 0, E = 0) = (1 − p)q, where p = 0.1 and q = 0.55. The association between G and E has an odds ratio of approximately 1.49. All other model parameters are the same as in the second simulation setting. In the analysis, G and E are modeled as independent, and the same methods of analysis as in the second simulation are used. Simulation results are shown in Table 3. These results show that, with moderate deviation from the G-E independence assumption, large bias occurs with all of the estimators except for the PLR. The results suggest that approaches exploiting the G-E independence assumption for efficiency gain are all very sensitive in terms of bias to the deviation from the independence assumption. It is very important to check the validity of the assumption in applying these approaches.
Table 3.
Simulation Results for Odds Ratio Estimation with Incorrectly Assumed Gene-Environment (G-E) Independence in the General Populationa
| Method |
G(0.2) |
E(0.1) |
G × E(0.5) |
||||||
| Bias (Estimated Truth) | SD of the Parameter Estimate | Average of the SE Estimatesb | Bias (Estimated Truth) | SD of the Parameter Estimate | Average of the SE Estimatesb | Bias (Estimated Truth) | SD of the Parameter Estimate | Average of the SE Estimatesb | |
| Phenotype Prevalence Rate = 25%, OR(G, E) = 1.49 | |||||||||
| PLRc | −0.004 | 0.139 | 0.138 | −0.003 | 0.060 | 0.060 | 0.002 | 0.186 | 0.187 |
| UpRMLd | −0.184 | 0.138 | 0.138 | −0.038 | 0.060 | 0.060 | 0.340 | 0.185 | 0.186 |
| pRMLe | −0.184 | 0.128 | 0.128 | −0.034 | 0.059 | 0.060 | 0.340 | 0.165 | 0.165 |
| LpRMLf | −0.165 | 0.123 | 0.122 | −0.034 | 0.059 | 0.059 | 0.321 | 0.144 | 0.144 |
| UWg | −0.101 | 0.120 | 0.119 | −0.023 | 0.059 | 0.060 | 0.224 | 0.122 | 0.121 |
| Phenotype Prevalence Rate = 10%, OR(G, E) = 1.49 | |||||||||
| PLR | 0.003 | 0.140 | 0.138 | −0.005 | 0.060 | 0.061 | −0.004 | 0.178 | 0.182 |
| UpRML | −0.212 | 0.122 | 0.124 | −0.047 | 0.058 | 0.060 | 0.413 | 0.147 | 0.153 |
| pRML | −0.212 | 0.118 | 0.120 | −0.047 | 0.058 | 0.059 | 0.413 | 0.131 | 0.136 |
| LpRML | −0.189 | 0.117 | 0.118 | −0.042 | 0.058 | 0.059 | 0.377 | 0.123 | 0.127 |
| UW | −0.162 | 0.116 | 0.118 | −0.036 | 0.058 | 0.059 | 0.331 | 0.115 | 0.119 |
| Phenotype Prevalence Rate = 0.1%, OR(G, E) = 1.49 | |||||||||
| PLR | −0.005 | 0.134 | 0.137 | 0.000 | 0.062 | 0.061 | 0.000 | 0.179 | 0.178 |
| UpRML | −0.206 | 0.112 | 0.116 | −0.040 | 0.060 | 0.059 | 0.399 | 0.113 | 0.118 |
| pRML | −0.206 | 0.112 | 0.116 | −0.040 | 0.060 | 0.059 | 0.399 | 0.113 | 0.118 |
| LpRML | −0.206 | 0.112 | 0.116 | −0.040 | 0.060 | 0.059 | 0.399 | 0.113 | 0.118 |
| UW | −0.206 | 0.112 | 0.116 | −0.040 | 0.060 | 0.059 | 0.398 | 0.113 | 0.117 |
Abbreviations: OR, odds ratio; PLR, prospective logistic regression; RML, retrospective maximum likelihood; SD, standard deviation; SE, standard error; UW, Umbach and Weinberg.
The simulation results are based on 1,000 replicates in a sample of 5,000 with 2,500 cases and 2,500 controls.
Average of the SE estimates of the parameter estimator.
Traditional PLR approach.
RML approach assuming a known wrong disease prevalence rate as twice that of the truth.
RML approach assuming a known true disease prevalence rate.
RML approach assuming a known wrong disease prevalence rate as half that of the truth.
Umbach and Weinberg's (8) approach, corresponding to LpRML with a known wrong disease prevalence rate of 0.
DISCUSSION
Here we showed that estimation of the identifiable intercept parameter in the logistic regression model under certain constraints on the covariate distribution could compromise efficiency for estimating odds ratio parameters. Thus, the information coded in the G-E independence assumption alone for assessing joint effects of G-E variables can become limited with a binary genetic variable, particularly when the environmental variable takes only a small set of unique values. The amount of information increases as the data structure on (G, E) becomes more complex. In this sense, haplotype-based association analyses may be preferable to single-SNP-based analyses. In the same sense, one would probably need to be cautious in deciding whether to dichotomize a continuous environmental exposure. Of course, any analysis would have to be guided by scientific inquiry, rather than statistical efficiency. In addition, implementing the approach of Chatterjee and Carroll (9) without assuming any knowledge of the phenotypic prevalence rate is inherently computationally challenging.
Additional knowledge on the prevalence rate can be leveraged to increase the information coded in G-E independence and largely eradicates the difficulty in parameter estimation. The approach of Umbach and Weinberg (8) does not require us to know the exact phenotypic prevalence rate. However, Chatterjee and Carroll (9) noticed the subtlety that not only should the overall phenotype prevalence be small but it should be small in every population subgroup defined by covariates. When the phenotypic prevalence rate is not very low, the approach of Umbach and Weinberg (8) can have substantial bias. The approach of Chatterjee and Carroll (9) assuming a known phenotypic prevalence rate can have a substantial efficiency gain over the traditional prospective analysis. When the phenotypic prevalence rate is very low and is known, the methods of Umbach and Weinberg (8) and Chatterjee and Carroll (9) have comparable performance.
When the phenotypic prevalence rate is not known, attempting to estimate the intercept from the case-control data under the G-E independence assumption may not be a good choice. We see from these simulation results that, even if we do not know the exact phenotypic prevalence rate, assuming an incorrect phenotypic prevalence rate can still lead to a substantial reduction in variance with some bias in the parameter estimator. The exact cutoff in the bias and variance tradeoff requires further research. Finally, caution should be exercised in using the approaches for efficiency gain because of their sensitivity to deviation from the independence assumption.
Acknowledgments
Author affiliations: Division of Epidemiology and Biostatistics, School of Public Health, University of Illinois at Chicago, Chicago, Illinois (Hua Yun Chen); and Department of Biostatistics and Epidemiology, School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania (Jinbo Chen).
This work was partially supported by the Division of Mathematical Science of the National Science Foundation (grant DMS 1007726 to H. Y. C.) and the National Institutes of Health (grant R01 ES016626 to J. C.).
Conflict of interest: none declared.
Glossary
Abbreviations
- G-E
gene-environment
- PLR
prospective logistic regression
- RML
retrospective maximum likelihood
- SNP
single nucleotide polymorphism
APPENDIX
Proof of Theorem 1
Lemma 1 implies that the maximum likelihood estimator of θ can be obtained by maximizing the conditional likelihood
![]() |
(A1) |
which can be simplified as
![]() |
(A2) |
This likelihood can be factored into 2 parts. The first part is based on P(D|G, E, S = 1), which is the likelihood employed by the traditional prospective analysis and equals
![]() |
(A3) |
The second part is based on P(G|E, S = 1) and equals
![]() |
(A4) |
We note that parameters β0 and p(G) are only contained in the second part of equation A4, so that the maximum likelihood estimator for β0 and p(G) can be obtained by maximizing equation A4. Let AGE = ∑Dη(D;(G,E)}p*(D){1 + em(G,E)} − 1, pG = p(G), and nkj be the observed number of subjects with G = k and E = j, k = 0, 1 and j = 0, 1, respectively. When data on (D, G, E) can be summarized in a 2 × 2 × 2 table, likelihood equation A4 can be rewritten as
![]() |
(A5) |
The possible maximum value that equation A5 can achieve is
| (A6) |
This requires that at least 1 solution exist for (p0, p1) and β0 for the following 2 equations:
| (A7) |
| (A8) |
where the second equation can be rewritten as
![]() |
(A9) |
Since (n11n00)/(n01n10) converges to
asymptotically, there exists a β0 satisfying equation A9 for any given (β0*, β1, β2, β12) close to the truth values. This further implies that (p0, p1) values satisfying equation A7 exist. It follows that likelihood equation A5 asymptotically achieves the maximum of equation A6, which is independent of (β1, β2, β12). Whenever solutions to the equations A7 and A8 exist, the profile likelihood for (β1, β2, β12) based on equation A2 is the same as that based on equation A3. This completes the proof.
References
- 1.Anderson JA. Separate sample logistic discrimination. Biometrika. 1972;59(1):19–35. [Google Scholar]
- 2.Prentice RL, Pyke R. Logistic disease incidence models and case-control studies. Biometrika. 1979;66(3):403–411. [Google Scholar]
- 3.Weinberg CR, Wacholder S. Prospective analysis of case-control data under general multiplicative-intercept risk models. Biometrika. 1993;80(2):461–465. [Google Scholar]
- 4.Rabinowitz D. A note on efficient estimation from case-control data. Biometrika. 1997;84(2):486–488. [Google Scholar]
- 5.Scott AJ, Wild CJ. Fitting regression models to case-control data by maximum likelihood. Biometrika. 1997;84(1):57–71. [Google Scholar]
- 6.Chen HY. A semiparametric odds ratio model for measuring association. Biometrics. 2007;63(2):413–421. doi: 10.1111/j.1541-0420.2006.00701.x. [DOI] [PubMed] [Google Scholar]
- 7.Piegorsch WW, Weinberg CR, Taylor JA. Non-hierarchical logistic models and case-only designs for assessing susceptibility in population-based case-control studies. Stat Med. 1994;13(2):153–162. doi: 10.1002/sim.4780130206. [DOI] [PubMed] [Google Scholar]
- 8.Umbach DM, Weinberg CR. Designing and analysing case-control studies to exploit independence of genotype and exposure. Stat Med. 1997;16(15):1731–1743. doi: 10.1002/(sici)1097-0258(19970815)16:15<1731::aid-sim595>3.0.co;2-s. [DOI] [PubMed] [Google Scholar]
- 9.Chatterjee N, Carroll RJ. Semiparametric maximum likelihood estimation exploiting G- E independence in case-control studies. Biometrika. 2005;92(2):399–418. [Google Scholar]
- 10.Lin DY, Zeng D. Likelihood-based inference on haplotype effects in genetic association studies. J Am Stat Assoc. 2006;101(473):89–104. [Google Scholar]
- 11.Chen HY. A unified framework for studying parameter identifiability and estimation in biased sampling designs. Biometrika. 2011;98(1):163–175. [Google Scholar]
- 12.Chen YH, Chatterjee N, Carroll RJ. Retrospective analysis of haplotype-based case control studies under a flexible model for gene environment association. Biostatistics. 2008;9(1):81–99. doi: 10.1093/biostatistics/kxm011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Roeder K, Carroll RJ, Lindsay BG. A semiparametric mixture approach to case-control studies with errors in covariables. J Am Stat Assoc. 1996;91(434):722–732. [Google Scholar]
- 14.Neuhaus J, Scott AJ, Wild CJ. The analysis of retrospective family studies. Biometrika. 2002;89(1):23–37. [Google Scholar]













