Abstract
It is important to study the interaction between two risk factors in molecular epidemiology studies. To improve the power for the detection of interaction, some statistical testing procedures have been proposed in the literature by incorporating certain assumptions on the underlying joint distribution of the two risk factors. For example, the well known case-only test used in genetic epidemiology studies is derived under the assumption of independency between the two considered risk factors. However, those testing procedures could have detrimental effects on both false positive and false negative rates when the assumptions are not met. We propose to use a parametric copula function to model the joint distribution while leaving the marginal distributions for the two risk factors totally unspecified. A unified approach is proposed to estimate/test the interaction effect. This approach is very flexible and can be applied to study the interaction between two risk factors that are continuous or discrete. A simulation study demonstrates that the proposed approach is generally more powerful than the traditional robust test derived under the standard logistic regression without specifying the relationship between the two risk factors. The performance of the proposed approach is comparable with the case-only test when the two risk factors are indeed independent in the control population. Unlike the case-only test, the proposed test can still maintain the correct type I error rate when the independence assumption is not valid. The application of the proposed procedure is demonstrated through two cancer epidemiology studies.
Keywords: Case-only design, gene-environment interaction, gene-gene interaction, pseudo likelihood
1. Introduction
In epidemiology studies, it is usually of interest to evaluate whether there is any interaction between two risk factors for the disease of interest. For instance, in genetic epidemiology studies, it is important to study gene-gene and gene-environment interactions in order to better understand the etiology underlying the disease development. The case-control design is widely used in epidemiology studies. Under such a retrospective design, information on risk factors and other covariates is collected at fixed numbers of cases and controls. It is well known that the prospective likelihood based on a logistic regression model can be used to estimate the log odds ratio parameters due to the equivalence of the prospective and retrospective maximum likelihood estimates (Cornfield 1956; Prentice and Pyke 1979). For the same reason, the prospective likelihood model can also be used to study the interaction by evaluating the coefficient of the product term of two risk factors. The standard logistic regression model is optimal for the assessment of the interaction when the joint distribution of two risk factors is fully nonparametric. However, this standard logistic regression method ignores the relationship between the risk factors and thus could loss power for detecting interaction if such information is available. Piegorsch et al. (1994) proposed a case-only method for detecting interaction, which is valid when the disease is rare enough and the two risk factors are independent in the general population, but this method does not allow for the adjustment of additional covariates. Umbach and Weinberg (1997) extended the case-only method to account for categorical covariates. Chatterjee and Carroll (2005) developed a general semiparametric method to detect gene-environment or gene-gene interaction by incorporating the independence assumption for the the risk factors. However, it have been shown that these methods derived under the independence assumption can lead to serious inflation in type I error or loss of power if that assumption is not met in the application, therefore, they should be used with great caution when the independence assumption is in doubt. To relax the independence assumption, Mukherjee and Chatterjee (2008) proposed an empirical Bayes-type shrinkage estimator by combining the estimate from the standard logistic regression model with the one derived under the independence assumption.
In this paper, we will develop a novel approach for detecting interaction by modeling the joint distribution of two risk factors in the control population (or equivalently, the general population if the disease is rare) through a copula model (Nelsen 1999) while leaving the marginal distributions of the risk factors totally unspecified. Some authors modeled the relationship between the risk factors in the general population (Chatterjee and Carroll, 2005; Lin and Zeng, 2009), but the parameters could be nearly unidentifiable as mentioned in their papers, while our model avoids the identifiability problem. The proposed model is very general and covers the scenario when the two risk factors are independent in the control population. The theory underlying our approach is due to Sklar (1959), which states that there exists a unique copula function characterizing the joint distribution of any two continuous random variables. For discrete-continuous and discrete-discrete scenarios, we will assume that the discrete risk factors are ordinal and can be derived through (unobservable) continuous random variables. Therefore, our approach can be used to study the interaction of two risk factors that are either continuous or discrete.
The rest of this paper is organized as follows. In Section 2, the statistical models (the logistic regression model and the copula model) are described for three scenarios (continuous-continuous, discrete-continuous, and discrete-discrete). In Section 3, a two-stage approach is developed for estimating unknown parameters under each of the three scenarios, where the first stage is to estimate the marginal distributions of the risk factors and the second stage is to maximize the pseudo likelihood function with the marginal distributions being fixed as the ones estimated in the first stage. In Section 4, some large sample properties of the pseudo maximum likelihood estimator are established and a bootstrap based procedure is proposed for estimating the variance-covariance matrix of the pseudo maximum likelihood estimators, which can be used to construct a confidence interval and a Wald test statistic for the interaction. In Sections 5 and 6, the proposed approach is illustrated via a simulation study and two real studies. Some final conclusions and remarks are given in Section 7 and the proofs are relegated to the appendices.
2. Model assumption
We consider two risk factors for the disease of interest. The risk factors could be either continuous or discrete. Let X and Y denote the risk factors in the continuous-continuous scenario, X∗ and Y in the discrete-continuous scenario, and X∗ and Y ∗ in the discrete-discrete scenario. In the following, we first focus on the continuous-continuous scenario, and then generalize the arguments to the other two scenarios.
A logistic regression model relating disease status D and the two risk factors X and Y is
| (2.1) |
where α∗ is the intercept, β and γ are the main effects, and ξ is the interaction effect. Let f0(x, y) denote the joint density function of (X, Y ) in control population, then the joint density function of (X, Y ) in case population can be written in an exponential tilting model (Qin and Zhang 1997):
| (2.2) |
where α = α∗ + log{Pr(D = 0)/Pr(D = 1)}. Notice that f1(x, y) is a density function satisfying , we have so that
| (2.3) |
Based on (2.3), the distribution of (X, Y ) in cases can be treated as a re-weighted version of their distribution in controls, with the weight given by exp(α + βx + γy + ξxy).
In the literature, one may assume that X and Y are independent, or equivalently X and Y are independent in the control population for rare disease. Then, under the null hypothesis of no interaction effect, X is also independent of Y in the case population. Therefore, to test interaction effect, we can apply the Pearson chi-square test to the case data. This is the so-called case-only method.
If the independence is in doubt, then the application of copula model is a natural choice to model the dependency. In the following, we will specify the joint density function f0(·,·) through a copula function.
In the continuous-continuous scenario, we assume that X and Y are two continuous random variables with the marginal cumulative distribution functions in the control population being FX(x) and FY (y), respectively. By Sklar’s theorem (Sklar 1959), we can assume that the joint cumulative distribution function of (X, Y ) is
| (2.4) |
where C(u, v; θ) is a copula function known up to a parameter vector θ of finite dimension. The joint density function of (X, Y ) in the control population is
| (2.5) |
where c(x, y; θ) = ∂2C(x, y; θ)/∂x∂y, and fX(x) = ∂FX(x)/∂x and ∂FY (y)/∂y are respectively the marginal density functions of X and Y in the control population.
In the discrete-continuous scenario, we can adopt the approach of de Leon and Wu (2011) by assuming that there is a continuous random variable underlying the discrete risk factor. In detail, we assume that the discrete risk factor X∗ is an ordinal random variable taking finite number of values, say, 1, ⋯ ,K, and Y is a continuous random variable with marginal cumulative distribution function FY (y); furthermore, we assume that there exists an underlying continuous random variable X with cumulative distribution function FX(x) and −∞ = c0 < c1 < ⋯ < cK < cK+1 = ∞ such that
| (2.6) |
Let the joint cumulative distribution function (X, Y ) be C(FX(x), FY (y); θ), then the joint density function of (X∗, Y ) in the control population is
| (2.7) |
for k = 0, 1, … , K and −∞ < y < ∞, where C2(u, v; θ) = ∂C(u, v; θ)/∂v and fY (y) = ∂FY (y)/∂y. The joint density function of (X∗, Y ) in the case population is
| (2.8) |
In the discrete-discrete scenario, in additional to the above assumption on X∗, we further assume that Y ∗ is an ordinal random variable taking finite number of values, say, 1, ⋯ ,L, and there exists an underlying continuous random variable Y with cumulative distribution function FY (y) and −∞ = d0 < d1 < ⋯ < dL < dL+1 = ∞ such that
| (2.9) |
Let the joint cumulative distribution function (X, Y ) be C(FX(x), FY (y); θ), then the joint density function of (X∗, Y∗) in the control population is
| (2.10) |
if (x∗, y∗) = (ck, dl) for k = 0, 1, … cK and l = 0, 1, … cL. The joint density function of (X∗, Y ∗) in the case population is
| (2.11) |
Remark 1.
In the discrete-continuous scenario, the joint density functions and depend on the threshold values {c1, … , cK} only through {FX(c1), … , FX(cK)} which can be estimated in the next section. Therefore, we do not need to estimate these threshold values. The same is true for the discrete-discrete scenario, because the joint density functions and depend on {c1, … , cK; d1,.… c dL} only through the estimable probabilities {FX(c1), … , FX(cK); FY (d1), … , FY (dK)}.
3. Parameter estimation
The estimation of regression coefficients β, γ, and ξ is complicated by the presence of high-dimensional nuisance parameters FX and FY. We adopt a pseudo-likelihood based approach, with FX and FY being estimated in the first stage. The pseudo likelihood method has been well developed for some widely used models, for instance, the parametric model used for a pseudo-likelihood estimation procedure (Gong and Sammaniego 1981), the copula model for multivariate data under a cross-sectional design (Genest et al. 1995), and the bivariate survival model (Shih and Louis 1995).
The detailed parameter estimation procedures for the continuous-continuous scenario, the discrete-continuous scenario, and the discrete-discrete scenario are presented in the following three subsections, respectively.
3.1. Continuous-continuous scenario
Let the observed risk factors for cases and controls be respectively {(x1i, y1i), i = 1, ⋯ , n1}, and {(x0i, y0i), i = 1, ⋯ , n0}, and the pooled data be {(xi, yi), i = 1, ⋯ , n(= n1 + n0)}. The log likelihood function is
| (3.1) |
It is difficult to maximize the above log likelihood function with respect to all the unknown parameters. Instead, we adopt a two-stage algorithm. The first stage is to estimate FX and FY without any constraint on the joint distribution of X and Y. Let the resulting estimators of FX and FY be and , respectively. The second stage is to maximize the log pseudo likelihood function with respect to (β, γ, ξ, θ), let the resulting pseudo-MLE be .
We consider two types of estimates for FX and FY. One method is to estimate FX and FY with the (nonparametric) empirical distribution functions for control samples, which have some nice large sample properties but do not use the information observed in the case data. The other method is a semiparametric one, which utilizes both case data and control data and is intuitively more efficient. The idea is to estimate the marginal cumulative distribution functions based on empirical likelihood estimate at each observation. The detailed algorithm for the semiparametric method is described as follows.
-
Estimate the regression parameters (α, β, γ, ξ) without any constraint on the joint distribution of X and Y. That is, we maximize the following log profile likelihood function
Let the resulting estimator be .
Obtain an empirical maximum likelihood estimator of pi = Pr(xi, yi|control): .
Estimate FX and FY with the empirical distribution functions and , respectively.
Qin and Zhang (1997) showed that is consistent for F0(x, y). Therefore, and are consistent for FX(·) and FY (·), respectively, and the resulting log pseudo likelihood function can be used for estimating/detecting the interaction effect.
The MLE and can be easily obtained by some existing algorithms, such as the Newton-Raphson algorithm and nonlinear optimization algorithms.
3.2. Discrete-continuous scenario
Let the observed risk factors for cases and controls be and , respectively, and the pooled data be . The log likelihood function is
| (3.2) |
where f0(·,·) is defined in (2.7). The log likelihood function (3.2) depends on FX only through K variates FX(c1), … c FX(cK), but it also depends on the high-dimensional parameter FY, making it difficult to be maximized directly. One may adopt a two-stage approach, that is, in the first stage one estimates FY with defined in the last subsection, and in the second stage one maximizes with respect to (β, γ, ξ, θ, FX). However, our preliminary numerical study shows that this two-stage approach is numerically not stable due to the estimation of FX (at c1, … c cK) in the second stage. Instead, in the second stage we fix FX at the estimator (evaluated at c1, … , cK) obtained in the first stage as in the continuous-continuous scenario. That is, in the second stage we maximize the log pseudo likelihood function with respect to (β, γ, ξ, θ), and let the resulting pseudo-MLE be . Here, the estimate is given by , and the later is obtained as in the continuous-continuous scenario by treating FX∗ as a continuous distribution function, which depends on k instead of the threshold value ck.
3.3. Discrete-discrete scenario
Let the observed risk factors for cases and controls be respectively and , and the pooled data be . The log likelihood function is
| (3.3) |
where f0(·,·) is defined in (2.10). The log likelihood function (3.3) depends on FX only through K variates FX(c1), … , FX(cK), and FY only through L variates FY (d1), … c FY (dL). Therefore, intuitively one can maximize (3.3) with respect to all unknown parameters directly. Again, due to the fact that this maximization can be difficult in practice, we adopt a two-stage approach. That is, in the first stage we obtain the estimators (at c1, … c cK) and (at d1, … c dL) of FX and FY, respectively, based on prospective likelihood function as in the continuous-continuous scenario, and in the second stage we maximize the log pseudo likelihood function with respect to (β, γ, ξ, θ). Let the resulting pseudo-MLE be .
4. Large sample properties and testing procedure
In this section, we focus on the continuous-continuous scenario since the other two scenarios can be addressed similarly.
In Supplementary Material S1, under some regularity conditions on F0X, F0Y, C(x, y; θ), we show i) there exists a local maximizer of the pseudo likelihood function that is consistent for the true value of (β, γ, ξ, θ) and ii) this pseudo-MLE is asymptotically normally distributed with expectation 0 and a variance-covariance matrix given in the appendix. By virtue of the asymptotic normality, we can construct a confidence interval of ξ and a Wald statistic for the significance test of H0 : ξ = 0, provided a variance-covariance matrix of the pseudo-MLE can be obtained. The (1 − α) × 100% confidence limits of ξ is and the Wald test statistic takes the form , where z1−α/2 is the upper α-quantile of the standard normal distribution and is an estimated standard error of .
From Supplementary Material S1 we can see that the estimation of the variance-covariance matrix of the pseudo-MLE is quite complicated, and the matrix can be even more complicated if FX and FY are estimated by the algorithm in Subsection 3.1. Here, we consider a computationally simpler bootstrap strategy instead. We consider two versions of bootstrap methods. One is the nonparametric bootstrap, which separately resamples case data and control data with replacement. The other is the semiparametric bootstrap, which is based on the empirical distributions for the case group and the control group estimated from the retrospective likelihood (Qin and Zhang 1997). In the semiparametric bootstrap method, each subject in the pooled sample can be sampled for both cases and controls. Therefore, this method fully uses all samples, and it is more suitable than the nonparametric bootstrap method when n0 or n1 is relatively small. The detailed algorithm of the semiparametric bootstrap method is as follows.
From (xi, yi), i = 1, … , n, randomly generate n0 risk factors , i = 1, … , n0, for controls, with weight for (xi, yi), and n1 risk factors , for cases, with weight for (xi, yi). Here is the maximum likelihood estimator of (α, β, γ, ξ) without constraint, as defined in Subsection 3.1.
Obtain the pseudo maximum likelihood estimator using resampled data .
Repeat (2.1) and (2.2) for B times to obtain some copies of , say .
Estimate the variance-covariance matrix of the pseudo maximum likelihood estimator using the sample variance-covariance matrix of .
Because is asymptotically normal, a moderate number of resamplings can generate a good approximation of the variance of , as will be shown in the subsequent simulation study with B = 200.
5. A simulation study
In the simulation study, we considered the following Gaussian copula model (Li 2000):
| (5.1) |
where Φ−1(x) is the inverse function of the standard normal distribution function and Φθ(x, y) is the joint distribution function of the bivariate normal distribution with means 0, variances 1, and correlation coefficient θ. We have
| (5.2) |
where ϕ(x) is the standard normal density function and ϕθ(x,y) = ∂2Φθ(x, y)/∂x∂y. The first equation in (5.2) immediately follows from the definition of the joint density function and the derivation of the second equation is given in Supplementary Material S2.
The risk factors (X, Y ) in the control population was generated from the bivariate normal distribution with joint density function ϕθ(x, y) so that the copula function for (X, Y ) was C(x, y; θ). For the discrete-continuous and discrete-discrete scenarios, the probability function of X∗ was Pr(X∗ = k) = 1/4, k = 0, 1, … c 3, and for the discrete-discrete scenario, the probability function of Y ∗ is the same as that of X∗.
We considered three values for the correlation parameter θ, namely 0, 0.2, and −0.2. The main effects were fixed at β = γ = 0.5. The interaction effect ξ was either 0 (for all the three scenarios) or 0.25 (for the continuous-continuous scenario) or 0.5 (for the discrete-continuous and discrete-discrete scenarios). For each combination of θ and ξ, we generated 106 risk factors {(Xi, Yi), i = 1, … c 106} for controls from the joint density function ϕθ(x, y). For the discrete-continuous and discrete-discrete scenarios, we generated copies of X∗ according to the formula (2.6); for the discrete-discrete scenario, we further generated copies of according to the formula (2.9). We used a biased sampling technique (Cochran 1977; Nair and Wang 1989) to generate the case data. For instance, in the discrete-continuous scenario, we generated risk factors for cases from the following distribution:
| (5.3) |
From the above generated 106 observations, we randomly sampled 200 observations for controls and randomly sampled 200 observations for cases with their weights given by (5.3). The simulation results were based on 2000 generated datasets. Then we applied the proposed approach, the standard logistic regression method, and the case-only method (Pearson chi-square test for testing independence using the case data) to each simulated dataset. In the proposed approach, we estimated the standard errors of pseudo-MLEs based on 200 semiparametric bootstrap samples.
For the proposed approach, the average value of the pseudo-MLE of minus the true ξ (Bias), the empirical standard error (SE) of , the mean estimated standard error (SEE) of , and the coverage probability (CP) of the 95% confidence interval of ξ are reported in Tables 1, 2, and 3 for the continuous-continuous scenario, the discrete-continuous scenario, and the discrete-discrete scenario, respectively. The type I error rates and powers at 0.05 level for the proposed approach (Wald1), the standard logistic regression method (Wald2), and the case-only method (Pearson) are also reported in Tables 1–3.
Table 1.
Interaction estimate/test results in continuous-continuous scenario
| ξ | θ | Proposed |
Logistic Wald2 | Case-only Pearson | ||||
|---|---|---|---|---|---|---|---|---|
| Bias | SE | SEE | CP | Wald1 | ||||
|
| ||||||||
| 0 | 0 | 0.004 | 0.104 | 0.102 | 0.938 | 0.064 | 0.050 | 0.057 |
| 0 | 0.2 | 0.032 | 0.105 | 0.106 | 0.943 | 0.056 | 0.048 | 0.826 |
| 0 | −0.2 | −0.018 | 0.103 | 0.103 | 0.947 | 0.061 | 0.054 | 0.816 |
| 0.25 | 0 | 0.027 | 0.109 | 0.108 | 0.958 | 0.754 | 0.578 | 0.945 |
| 0.25 | 0.2 | 0.070 | 0.111 | 0.116 | 0.892 | 0.835 | 0.556 | 1.000 |
| 0.25 | −0.2 | −0.006 | 0.107 | 0.105 | 0.937 | 0.636 | 0.571 | 0.092 |
ξ, the interaction effect; θ, the correlation coefficient of marginal distributions; Bias, mean estimated ξ minus the ξ; SE, standard error of estimated ξ; SEE, mean estimated standard error of estimated ξ; CP, coverage probability of confidence interval; Wald1 and Wald2, Wald test; Pearson, Pearson chi-square test.
Table 2.
Interaction estimate/test results in discrete-continuous scenario
| ξ | θ | Proposed |
Logistic Wald2 | Case-only Pearson | ||||
|---|---|---|---|---|---|---|---|---|
| Bias | SE | SEE | CP | Wald1 | ||||
|
| ||||||||
| 0 | 0 | 0.005 | 0.096 | 0.096 | 0.949 | 0.051 | 0.054 | 0.048 |
| 0 | 0.2 | 0.010 | 0.100 | 0.100 | 0.948 | 0.055 | 0.049 | 0.645 |
| 0 | −0.2 | −0.006 | 0.097 | 0.097 | 0.962 | 0.052 | 0.052 | 0.698 |
| 0.5 | 0 | 0.049 | 0.164 | 0.156 | 0.937 | 0.936 | 0.802 | 0.997 |
| 0.5 | 0.2 | 0.063 | 0.162 | 0.164 | 0.938 | 0.925 | 0.722 | 1.000 |
| 0.5 | −0.2 | 0.009 | 0.146 | 0.140 | 0.942 | 0.961 | 0.883 | 0.951 |
ξ, the interaction effect; θ, the correlation coefficient of marginal distribution; Bias, mean estimated ξ minus the ξ; SE, standard error of estimated ξ; SEE, mean estimated standard error of estimated ξ; CP, coverage probability of confidence interval; Wald1 and Wald2, Wald test; Pearson, Pearson chi-square test.
Table 3.
Interaction estimate/test results in discrete-discrete scenario
| ξ | θ | Proposed |
Logistic Wald2 | Case-only Pearson | ||||
|---|---|---|---|---|---|---|---|---|
| Bias | SE | SEE | CP | Wald1 | ||||
|
| ||||||||
| 0 | 0 | −0.002 | 0.093 | 0.092 | 0.942 | 0.050 | 0.048 | 0.046 |
| 0 | 0.2 | −0.005 | 0.093 | 0.094 | 0.965 | 0.048 | 0.056 | 0.479 |
| 0 | −0.2 | −0.005 | 0.089 | 0.091 | 0.953 | 0.05 | 0.050 | 0.550 |
| 0.5 | 0 | −0.053 | 0.254 | 0.241 | 0.902 | 0.469 | 0.437 | 0.543 |
| 0.5 | 0.2 | −0.045 | 0.276 | 0.256 | 0.884 | 0.441 | 0.414 | 0.995 |
| 0.5 | −0.2 | −0.038 | 0.217 | 0.223 | 0.921 | 0.549 | 0.511 | 0.920 |
ξ, the interaction effect; θ, the correlation coefficient of marginal distributions; Bias, mean estimated ξ minus the ξ; SE, standard error of estimated ξ; SEE, mean estimated standard error of estimated ξ; CP, coverage probability of confidence interval; Wald1 and Wald2, Wald test; Pearson, Pearson chi-square test.
In most situations, the biases are small, the estimated standard errors are close to the empirical standard errors, and the coverage probabilities are close to the nominal level 95%. Only in a few situations, absolute biases are greater than 0.05. Both the proposed approach and the logistic regression method have well controlled type I error rate (ξ = 0). The type I error rates of the case-only method are also under control when the risk factors X and Y are indeed independent (θ = 0), but the type I error rates are dramatically inflated when X and Y are correlated (θ ≠ 0). Under the alternative hypothesis (ξ ≠ 0), the proposed test is uniformly more powerful than the logistic regression method, with relative power gains ranging from 11% to 50% in the continuous-continuous scenario, from 8% to 28% in the discrete-continuous scenario, and from 6.5% to 7.4% in the discrete-discrete scenario. When the independence assumption is met, the power of the proposed approach is slightly lower than the case-only method which is the most powerful test under the independence assumption. However, when the independence assumption is not met, the case-only method could lose power substantially. For instance, in the continuous-continuous scenario with ξ = 0.25 and θ = −0.2, the power for the case-only method is 0.092, which is dramatically lower than 0.636 for the proposed approach.
We also considered different marginal distributions for the two risk factors, one is normal and the other is uniform. The results are similar to Tables 1–3, and the results are not presented here.
Finally, we conducted sensitivity analysis by misspecifying the copula function in the continuous-continuous scenario. We considered three true copula functions, namely, the Clayton, Frank, and t (with 10 degrees of freedom) copula functions. The parameter characterizing the copula functions were chosen such that the correlation coefficients were around 0.24, and the marginal distributions of the risk factors were again the standard normal. The other settings were the same as those for Table 1. The corresponding results are presented in Table 4. When the true copula function is Clayton or Frank, the proposed approach produces minor bias in estimates and well controlled type I error rates and coverage probabilities. When the true copula function is t, the proposed approach can produce relatively larger biases, inflated type I error rates, and poorer coverage probabilities.
Table 4.
Interaction estimate/test results for non-Gaussian copula functions
| Copula | θ | ξ | Proposed |
Logistic Wald2 | Case-only Pearson | ||||
|---|---|---|---|---|---|---|---|---|---|
| Bias | SE | SEE | CP | Wald1 | |||||
|
| |||||||||
| Clayton | 0.23 | 0 | −0.011 | 0.103 | 0.104 | 0.947 | 0.053 | 0.052 | 0.073 |
| 0.23 | 0.25 | 0.002 | 0.111 | 0.106 | 0.933 | 0.670 | 0.583 | 0.976 | |
| Frank | 0.24 | 0 | −0.001 | 0.104 | 0.102 | 0.931 | 0.069 | 0.055 | 0.088 |
| 0.24 | 0.25 | 0.012 | 0.103 | 0.106 | 0.950 | 0.711 | 0.566 | 0.982 | |
| t 10 | 0.24 | 0 | 0.084 | 0.115 | 0.113 | 0.889 | 0.112 | 0.048 | 0.344 |
| 0.24 | 0.25 | 0.239 | 0.134 | 0.115 | 0.533 | 0.982 | 0.628 | 1.000 | |
Copula, true copula function; Proposed, the proposed approach with copula function specified to be Gaussian; ξ, the interaction effect; θ, the correlation coefficient of marginal distributions; Bias, mean estimated ξ minus the true of ξ; SE, standard error of estimated ξ; SEE, mean estimated standard error of estimated ξ; CP, coverage probability of confidence interval; Wald1 and Wald2, Wald test; Pearson, Pearson chi-square test.
6. Real data applications
6.1. Prostate cancer example
The first dataset was from a nested case-control study (Ahn et al. 2008) within the Prostate, Lung, Colorectal, and Ovarian Cancer Screening Trial (PLCO), where cases and controls were frequency matched by age at cohort entry, time since initial screening, and calendar year of cohort entry. In this study, the effect of some risk factors on the prostate cancer were examined. We considered two continuous risk factors, one was the vitamin D level [25(OH)D concentrations] and the other was body mass index (BMI). By removing individuals with extreme 25(OH)D concentrations, 749 case patients and 781 control subjects were kept. The vitamin D measure determined by 25(OH)D concentrations (nmol/L) strongly depended on the season when the blood was drawn, so we removed this seasonal variation pattern using the locally weighted scatterplot smoothing (Cleveland et al. 1991). Let B and V denote the normalized values of BMI and the adjusted 25(OH)D concentrations with seasonal pattern being removed, respectively. We modeled the relationship between the disease status D and (B, V ) by the following logistic regression model:
To test the interaction effect ξ = 0, we applied the standard logistic regression method, the case-only method, and the proposed copula-model based approach to the data. In the proposed approach, we assumed that the copula characterizing the joint distribution was the Gaussian copula (5.1) with correlation coefficient parameter θ, and the standard errors of the pseudo-MLEs were obtained with 1000 semiparametric bootstrap samples. The pseudo-MLE of θ was −0.207 (p-value = 1.8 × 10−10), showing a very significant negative correlation between BMI and vitamin D level. The resulting estimates of ξ from the proposed approach and the standard logistic regression method were 5.8 × 10−3 (p-value = 0.916) and 4.2×10−2 (p-value = 0.443), respectively, both of which indicated the absence of the interaction. On the other hand, the case-only method gave a p-value of 4.2 × 10−8, indicating a very statistically significant but most likely false positive finding of the interaction as the independence assumption was clearly violated. When the age at cohort entry, the time since initial screening, and the calendar year of cohort entry were further adjusted for in the standard logistic regression, we obtained very similar result, with the interaction effect being 4.4 × 10−2 (p-value = 0.422).
6.2. Lung cancer example
In recent genome-wide association studies (GWAS), a few chromosome regions (e.g., chromosomes 15q25, 5p15, and 6p21) have been identified to be associated with lung cancer (Hung et al. 2008; Amos et al. 2008; Thorgeirsson et al. 2008; McKay et al. 2008; Wang et al. 2008; Rafnar et al. 2009; and Landi et al. 2009). Among these chromosome regions, the chromosome 15q25 region was shown to be associated with both lung cancer and smoking behavior. Therefore, it would be of great interest to test whether there is any interaction between the genetic variants in the 15q25 region and smoking on the risk of lung cancer. We used the data from Environment and Genetics in Lung Cancer Etiology Study (EAGLE; Landi et al. 2009), and focused on the genotypes on 39 relatively common single-nucleotide polymorphisms (SNPs) within the 15q25 region and smoking intensity, measured by the average number of packs of cigarette per day (CPD). The numbers of individuals were 460 for CPD < 0.5, 965 for 0.5 ≤ CPD < 1, 1393 for 1 ≤ CPD < 2, and 256 for CPD ≥ 2, respectively. We evaluated the interaction between CPD and each of the 39 SNPs.
We modeled the relationship between the lung cancer status D and any of the genetic variants (coded by G, the number of minor alleles) and CPD measure C by the following logistic regression model:
In the proposed approach, we assumed a Gaussian copula (5.1) for the joint distribution of CPD and the continuous variable underlying the SNP genotype.
For our analysis, we only considered those subjects with a smoking history, and focused on the SNP rs12913946 which had the most significant interaction effect with CPD from the standard logistic regression analysis (p-value=0.042) and the case only method (p-value = 0.011). This kept 1738 lung cancer cases and 1336 controls. We applied the proposed approach to study this interaction with standard errors of pseudo-MLEs being obtained with 1000 semiparametric bootstrap samples. The pseudo-MLE of the interaction effect ξ was 0.238 (standard error = 0.088) with the two-sided p-value being 6.5×10−3, which was smaller than the one obtained by the standard logistic regression. More detailed results are summarized in Table 5. Clearly, further investigations are needed to validate this interaction.
Table 5.
Analysis results for SNP rs12913946 in lung cancer data set
| Parameter | Estimate | SE | Z-value | P-value |
|---|---|---|---|---|
|
| ||||
| θ | −0.036 | 0.024 | −1.503 | 0.137 |
| β | −0.239 | 0.112 | −2.143 | 0.033 |
| γ | 1.183 | 0.094 | 12.557 | 3.65×10−36 |
| ξ | 0.238 | 0.088 | 2.722 | 6.5 × 10−3 |
7. Discussion
Majority of common diseases result from complex interplay of genetic and environmental risk factors. It is important to study gene-gene and gene-environment interactions in order to better understand the mechanism underlying the disease development. We develop a copula-model based semiparametric test for the interaction detection. Our proposed approach strikes a good balance between robustness and efficiency by modeling the correlation between the two risk factors while keeping their marginal distributions fully unspecified.
Under a case-control design, the logistic regression model is the most widely used model for relating multiple risk factors with disease status. The interaction test based on this standard logistic regression does not use the relationship information between two risk factors. If some auxiliary information on risk factors is available, it is possible to improve the power of the interaction test by taking this information into account. If the two risk factors under study are correlated, then the case-only method and its extensions would not be valid because the independence assumption does not hold. In such situation, statistical methods that allow dependence between two risk factors are desired. By using a copula function to characterize the joint distribution of the risk factors, our proposed approach relaxes the independence assumption, and is more powerful than the standard logistic regression method, if the assumed copula model is appropriate for the joint distribution of the two considered risk factors. Our simulation results also show that the copula approach still provides valid results even if the underlying copula model is misspecified mildly. As a precaution, in practical applications one has to make sure that the copula model is not terribly misspecified. It is a classic problem in statistical inference to balance the efficiency and robustness.
Although we only consider two risk factors in the current manuscript, the proposed approach may be extended to more than two risk factors, since any copula function can be used to relate arbitrarily finite number of risk factors. Furthermore, the proposed approach can be extended to adjust for covariates. For example, when a covariate Z is involved in the case-control study, we can model the joint density function of (X, Y, Z) in the control population as follows:
where FX(x|z; ηX) and FY (y|z; ηY ) are the conditional cumulation distribution functions of X and Y respectively in the control population (given Z = z), fX(x|z; ηX) and fY (y|z; ηX) are the corresponding density functions, and fZ(z) is the marginal distribution function of Z in the control population. The estimation procedure for all unknown parameters and the corresponding inference procedure can be similarly derived, but slightly more complicated.
We have found that the efficiency gain of the proposed approach in the continuous-continuous scenario is much higher than in the discrete-discrete scenario. This phenomenon is related to the degree of freedom of the tests. In fact, for a full nonparametric bivariate distribution, the nonparametric MLE has rs − 1 parameters if the first and second components have r and s possible values, respectively. On the other hand, if one imposes the constraint that the two components are independent, then the nonparametric MLE has (r −1)+(s−1) parameters. The difference of parameter number is Δ = rs−1−(r−1)−(s−1) = (r − 1)(s − 1). If r = 2, s = 2 (corresponding to the discrete-discrete scenario), then Δ = 1. On the other hand, if r = s = 100 (corresponding to the continuous-continuous scenario with sample size 100), then Δ = (100 − 1)2 = 9801, and the number of unknown parameters in the independent assumption is reduced dramatically. For the copula model, the number of reduced parameters is Δ minus the dimension of θ (the parameter vector in the copula function). As a result, in the continuous-continuous scenario, the restricted models (with either independent assumption on the two components or the copula model assumption) have a greater efficiency gain in comparison with the discrete-discrete scenario.
Supplementary Material
Acknowledgment
We are grateful to two referees, the associate editor, and the joint editor for their insightful comments. This research was supported by the State Key Development Program for Basic Research of China (Grant No. 2012CB316500) (HZ) and the Intramural Research Program of the Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health (HZ, ML, NC, KY).
Footnotes
Supplementary Material
Supplementary material is provided for the consistency and the asymptotic normality of the pseudo-MLE and a proof of the second equation in (5.2).
References
- Ahn J, Peters U, Albanes D, Purdue MP, Abnet CC, Chatterjee N, Horst RL, Hollis BW, Huang WY, Shikany JM, Hayes RB and Prostate, Lung, Colorectal, and Ovarian Cancer Screening Trial Project Team (2008). Serum vitamin D concentration and prostate cancer risk: a nested case-control study. J. Nat. Cancer Inst. 100, 796–804. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Amos CI, Wu X, Broderick P, Gorlov IP, Gu J, Eisen T, Dong Q, Zhang Q, Gu X, and Vijayakrishnan J (2008). Genome-wide association scan of tag SNPs identifies a susceptibility locus for lung cancer at 15q25.1. Nat. Genet. 40, 616–622. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cochran WG (1977). Sampling Techniques, 3rd edition. John Wiley & Sons, New York. [Google Scholar]
- Cornfield J (1956). A statistical problem arising from retrospective studies. Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, Vol 4 (Edited by Neyman J), 135–148, University of California Press, Berkeley, CA. [Google Scholar]
- Chatterjee N and Carroll RJ (2005). Semiparametric maximum likelihood estimation exploiting gene-environment independence in case-control studies. Biometrika 92, 399–418. [Google Scholar]
- Cleveland WS, Grosse E, and Shyu WM (1991). Local regression models. Chapter 8 of Statistical Models in S (Edited by Chambers JM and Hastie TJ), 309–376, Chapman & Hall, London. [Google Scholar]
- de Leon AR and Wu B (2011). Copula-based regression models for a bivariate mixed discrete and continuous outcome. Statist. Med. 30, 175–185. [DOI] [PubMed] [Google Scholar]
- Genest C, Ghoudi K, and Rivest LP (1995). A semiparametrie estimation procedure of dependence parameters in multivariate families of distributions. Biometrika 82, 543–552. [Google Scholar]
- Gong G and Samaniego FJ (1981). Pseudo maximum likelihood estimation: theory and applications. Ann. Statist. 9, 861–869. [Google Scholar]
- Hung RJ, McKay JD, Gaborieau V, Boffetta P, Hashibe M, Zaridze D, Mukeria A, Szeszenia-Dabrowska N, Lissowska J, and Rudnai P (2008). A susceptibility locus for lung cancer maps to nicotinic acetylcholine receptor subunit genes on 15q25. Nature 452, 633–637. [DOI] [PubMed] [Google Scholar]
- Landi MT, Chatterjee N, Yu K, Goldin LR, Goldstein AM, Rotunno M, Mirabello L, Jacobs K, Wheeler W, Yeager M, Bergen AW, Li Q, Consonni D, Pesatori AC, Wacholder S, Thun M, Diver R, Oken M, Virtamo J, Albanes D, Wang Z, Burdette L, Doheny KF, Pugh EW, Laurie C, Brennan P, Hung R, Gaborieau V, McKay JD, Lathrop M, McLaughlin J, Wang Y, Tsao MS, Spitz MR, Krokan H, Vatten L, Skorpen F, Arnesen E, Benhamou S, Bouchard C, Metsapalu A, Vooder T, Nelis M, Valk K, Field JK, Chen C, Goodman G, Sulem P, Thorleifsson G, Rafnar T, Eisen T, Sauter W, Rosenberger A, Bickeboller H, Risch A, Chang-Claude J, Wichmann HE, Stefansson K, Houlston R, Amos CI, Fraumeni JF, Savage SA, Bertazzi PA, Tucker MA, Chanock S, and Caporaso NE (2009). A genome-wide association study of lung cancer identifies a region of chromosome 5p15 associated with risk for adenocarcinoma. Am. J. Hum. Genet. 85, 679–691. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li DX (2000). On default correlation: a copula function approach. JFI 9, 43–54. [Google Scholar]
- Lin D and Zeng D (2009). Proper analysis of secondary phenotype data in case-control association studies. Genet. Epi. 33, 256–265. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McKay JD, Hung RJ, Gaborieau V, Boffetta P, Chabrier A, Byrnes G, Zaridze D, Mukeria A, Szeszenia-Dabrowska N, and Lissowska J (2008). Lung cancer susceptibility locus at 5p15.33. Nat. Genet. 40, 1404–1406. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mukherjee B and Chatterjee N (2008). Exploiting gene-environment independence for analysis of case-control studies: an empirical Bayes-type shrinkage estimator to trade-off between bias and efficiency. Biometrics 64, 685–694. [DOI] [PubMed] [Google Scholar]
- Nair VN and Wang PCC (1989). Maximum likelihood estimation under a successive sampling discovery model. Technometrics 31, 423–436. [Google Scholar]
- Nelsen RB (1999). An introduction to copulas. Springer, New York. [Google Scholar]
- Owen AB (2001). Empirical likelihood. Chapman & Hall/CRC, New York. [Google Scholar]
- Piegorsch WW, Weinberg CR, and Taylor JA (1994). Non-hierarchical logistic models and case-only designs for assessing susceptibility in population based case-control studies. Statist. Med. 13, 153–162. [DOI] [PubMed] [Google Scholar]
- Prentice RL and Pyke R (1979). Logistic disease incidence models and case-control studies. Biometrika 66, 403–411. [Google Scholar]
- Qin J and Zhang B (1997). A goodness-of-fit test for logistic regression models based on case-control data, Biometrika 84, 609–618. [Google Scholar]
- Rafnar T, Sulem P, Stacey SN, Geller F, Gudmundsson J, Sigurdsson A, Jakobsdottir M, Helgadottir H, Thorlacius S, and Aben KK (2009). Sequence variants at the TERTCLPTM1L locus associate with many cancer types. Nat. Genet. 41, 221–227. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sklar A (1959). Fonctions de répartition à n dimensions et leurs marges. Statistical Institute of the University of Paris 8, 229–231. [Google Scholar]
- Shih JH, and Louis TA (1995). Association parameter in copula models for bivariate survival data. Biometrics 51, 1384–1399. [PubMed] [Google Scholar]
- Thorgeirsson TE, Geller F, Sulem P, Rafnar T, Wiste A, Magnusson KP, Manolescu A, Thorleifsson G, Stefansson H, Ingason A (2008). A variant associated with nicotine dependence, lung cancer and peripheral arterial disease. Nature 452, 638–642. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Umbach DM and Weinberg CM (1997). Designing and analyzing case-control studies to exploit independence of genotype and exposure. Statist. Med. 16, 1731–1743. [DOI] [PubMed] [Google Scholar]
- Wang Y, Broderick P, Webb E, Wu X, Vijayakrishnan J, Matakidou A, Qureshi M, Dong Q, Gu X, and Chen WV (2008). Common 5p15.33 and 6p21.33 variants influence lung cancer risk. Nat. Genet. 40, 1407–1409. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
