Abstract
Background and Aims:
There is considerable interest in epidemiology to estimate an additive interaction effect between two risk factors in case-control studies. An additive interaction is defined as the differential reduction in absolute risk associated with one factor between different levels of the other factor. A stratified two-phase case-control design is commonly used in epidemiology to reduce the cost of assembling covariates. It is crucial to obtain valid estimates of the model parameters by accounting for the underlying stratification scheme to obtain accurate and precise estimates of additive interaction effects. The aim of this paper is to examine the properties of different methods for estimating model parameters and additive interaction effects under a stratified two-phase case-control design.
Methods:
Using simulations, we investigate the properties of three existing methods, namely stratum-specific offset, inverse-probability weighting, and multiple imputation for estimating model parameters and additive interaction effects. We also illustrate these properties using data from two published epidemiology studies.
Results:
Simulation studies show that the multiple imputation method performs well when both the true and analysis models are additive (i.e., does not include multiplicative interaction terms) but does not provide a discernible advantage over the offset method when the analysis model is non-additive (i.e., includes multiplicative interaction terms). The offset method exhibits the best overall properties when the analysis model contains multiplicative interaction effects.
Conclusion:
When estimating additive interaction between risk factors in stratified two-phase case-control studies, we recommend estimating model parameters using multiple imputation when the analysis model is additive, and recommend the offset method when the analysis model is non-additive.
Keywords: Additive interaction, Inverse-probability weighting, Multiple imputation, Offset, Stratified two-phase case-control design
1. Introduction
Stratified two-phase case-control design is a popular sampling strategy to measure new risk factors in a cost-efficient manner in epidemiological studies [1–3]. A parent cohort is first established and certain risk factors of interest are ascertained from all subjects. The parent cohort may be assembled in a prospective or retrospective manner. Next, affected cases and unaffected controls are sub-sampled from within strata defined by one or more risk factors measured in the parent cohort, and new risk factors are measured only for individuals in the sub-sample [4, 5]. This paper investigates the properties of three methods for estimating an interaction between two risk factors on the additive scale of the outcome under a stratified two-phase case-control design.
Epidemiology studies use statistical models to investigate interactions between two (or more) risk factors - for example, interaction between genetic and environmental factors [6], genetic and demographic factors [7] or multiple genetic [8] or non-genetic [9] factors. There is considerable enthusiasm in studying interactions because it is anticipated that interactions can shed light on biological mechanisms underlying disease etiology and provide insights into the benefits of changes or behavioral modifications to one risk factor in subgroups of individuals defined by another risk factor [10]. In statistical models, interaction between two risk factors is defined as departure from additivity of the effects of the risk factors i.e., the effect of one risk factor on the outcome varies across the levels of another risk factor [11]. An interaction is represented by a product term of the risk factors under a canonical link function and is commonly referred to as multiplicative interaction [12]. For binary outcomes, the canonical link function is the logistic link. A multiplicative interaction between two risk factors is the ratio of two odds ratios, where the numerator is the odds ratio measuring the association between two risk factors in affected individuals (equivalently, individuals having value 1 for the outcome) and the denominator a similar odds ratio in unaffected individuals (equivalently, individuals having value 0 for the outcome) [13].
Epidemiology studies increasingly examine additive interaction between risk factors, which is defined as differential reduction in absolute risk associated with changing the level of one risk factor across the levels of another risk factor [14–18]. Additive interactions are anticipated to be more useful than multiplicative interactions since they can be directly interpreted in terms of risk reduction [12] and conceptual models for biologic interactions between risk factors translate to interactions on an additive scale [8, 19].
There is a large and growing body of literature on statistical methods for estimating the additive effects of risk factors and their multiplicative interaction effects in logistic regression models under stratified two-phase designs with retrospective sampling of cases and controls in the first phase. These methods include maximum likelihood estimation with an offset and the weighted likelihood approach [3, 20–27]. To our knowledge, limited attention has been given to statistical evaluations of the properties of these methods for estimating additive interaction effects between risk factors under stratified two-phase sampling. This gap exists despite growing interest in the evaluation of additive interactions in epidemiology studies and despite the increasing availability of large epidemiology cohorts that can be used for investigating novel additive interactions between risk factors in a cost-efficient manner.
Multiple imputation approach has been proposed to estimate the parameters of a logistic regression model under two-phase designs [28–30]. It is a popular approach for obtaining complete data sets in the presence of missing variable in medical studies and is easy to implement [30]. Under this approach, individuals who are not sampled into phase two of the study are treated as having missing value for the new variable measured in the second phase. The missing values are replaced with plausible values of the new variable. The resulting data set is then treated as a complete data set and a standard logistic regression model is applied to estimate the model parameters from the complete data set. Multiple imputation methods applied to stratified two-phase case-control studies have shown that this approach provides parameter estimates with less bias and better precision than methods based on conditional maximum likelihood estimation [30]. These investigations have focused on estimating the parameters of an additive logistic regression model (i.e., a model without multiplicative interaction terms) and have not examined the properties of estimating additive interactions between risk factors.
In this paper, we examine the properties of three methods, namely, offset, weighted likelihood, and multiple imputation, for estimating additive interaction effects between risk factors in a stratified two-phase design. We provide an empirical illustration of these methods using data from two published epidemiology studies and use simulations to examine the properties of the methods.
2. Motivating Examples
Our paper is motivated by the following epidemiology studies.
The Study of Nevi in Children (SONIC) is a population-based prospective study in a US cohort of children to examine risk factors associated with nevus phenotypes [31]. Several putative risk factors including demographic factors and sun exposure were measured on the initial parent cohort of N = 443 subjects during 2004. A main outcome of interest is high-risk nevus phenotype (also referred to as mole-prone phenotype), a binary outcome defined as nevus counts exceeding a certain value. Following prior work in SONIC that defined high-risk phenotype based on quintiles of nevus counts observed in the study subjects [32], in this paper we consider high-risk nevus phenotype as our binary outcome. The parent cohort has since been expanded with prospective ascertainment of more study subjects from different age groups. This prospective cohort is now amenable to future studies of novel risk factors in a cost-efficient manner via a two-phase stratified sampling approach.
The Epidemiology of Endometrial Cancer Consortium (E2C2) is a large population- based case-control study investigating risk factors associated with the etiology of endometrial cancer [7]. Data on age, body mass index and genotype at marker rs727479, a single nucleotide polymorphism in the CYP19A1 gene, are available for the parent cohort of 4261 endometrial cancer cases and 7099 unaffected controls from tables published by Setiawan et al. [7]. This cohort is also amenable to future cost-efficient studies of novel risk factors via a two-phase stratified sampling approach.
We will use data from SONIC and E2C2 to illustrate statistical methods for estimating additive interactions under a two-phase stratified case-control design. We will estimate additive interaction between sun sensitivity index (a demographic factor) and sun exposure in relation to high risk nevus phenotype in SONIC, and additive interaction between body mass index and rs727479 in relation to the risk of endometrial cancer in E2C2. The availability of complete data from the parent cohorts in these studies allows us to illustrate the properties of analysis methods for estimating additive interactions by presenting an analysis of the full cohorts along with a side-by-side illustration of analysis based on stratified two-phase sampling applied to these studies.
3. Notations
We use the following notations throughout this paper:
Let (D, G, E, X) be a vector of observations, where D is a binary outcome denoting a case (i.e., affected status, D = 1) or a control (i.e., unaffected status, D = 0), G and E are risk factors of interest, and X is a vector of other covariates including confounders. In this paper we take G to be a continuous or categorical risk factor and E to be a categorical risk factor. When discussing the methodological concepts, we focus on E having two categories (E = 0 or 1). Note that G and E can be any two risk factors of interest and need not be restricted to genetic and environmental factors.
- Denote p = pr(D = 1|G, E, X) as disease risk given G, E, and X. We assume that the association between D and (G, E, X) is described by the logistic regression model [33]:
where β0 is the intercept, β1 and β2 are referred to as the additive effects of G and E, respectively, β3 is their non-additive effect, and β4 is a vector of additive effects for X.(1) Denote as the vector of model parameters.
Consider a parent data set of N independent individuals. We assume that D and G are measured on these N subjects. In practice, some components of X may also be available for these N subjects. However, in this paper we assume only D and G are available for the entire parent data set. We further assume that there are N1 individuals with D = 1 (cases) and N0 = N − N1 individuals with D = 0 (controls) in the parent data set.
Divide the parent data set into H strata using G. Let Nh denote the sample size of stratum h = 1,···, H. Clearly,
Subsample nh individuals from stratum h. Let denote the size of the subsampled data set. Measure E and X on these n individuals.
Denote as the observations corresponding to person in stratum h (= 1,, H). Here Shi is a sampling indicator with Shi = 1 if the person is in the subsample and, hence, has a measured Ehi and Xhi; otherwise, Shi = 0 and Ehi and Xhi are unmeasured.
We use the hat notation to indicate estimated values, e.g. is an estimate of β.
4. Interaction
Denoting f (.) as an arbitrary function (or scale) of disease risk, we define the effect of E when X and G take values X = x and G = g as the contrast:
The statistical interaction between G and E, evaluated at X = x, is a contrast defined as the difference between the effects of E under two values of G, namely G = g1 and G = g2. In notations, the interaction effect is a contrast [34], denoted A(x, g1, g2), given by:
| (2) |
4.1. Multiplicative Interaction
A multiplicative interaction between G and E is obtained by defining f(.) on the logistic scale as It then follows from equations (1) and (2)that the multiplicative interaction effect is the non-additive effect in equation (1).
4.2. Additive Interaction
An additive interaction between G and E is obtained on the risk scale by setting and is given by
| (3) |
The components are disease risks, which can be estimated from equation (1). We can estimate the additive interaction by plugging in the estimated parameters of equation (1) to calculate disease risk, and derive its variance using the delta method [33].
The components of the right-hand side of equation (3) can be obtained directly from the estimated model parameters in prospective samples such as the SONIC study. However, it is not straightforward to estimate disease risks in this manner from retrospective samples such as the E2C2 study since the intercept β0 cannot be interpreted in terms of baseline risk in the population. Therefore, it is common to report the relative excess risk due to interaction (RERI) as a measure of additive interaction in analysis of retrospective samples [18, 19], which is defined as the ratio For rare diseases, RERI can be approximated as which does not depend upon the baseline parameter β0 and the effect β4 of X.
5. Stratified design
Denote Nh1 and Nh0 as the known number of cases and controls, respectively, in stratum Subsample nh1 cases and nh0 controls from stratum h using sampling probabilities πh1 and πh0, respectively, where [2]. Without loss of generality, we assume that we select all the cases and an equal number of controls from each stratum. Therefore, Note that when there are more cases than controls in a given stratum, all cases and controls are selected. The total number of individuals subsampled from stratum h is The total numbers of cases and controls in the stratified case-control subset are respectively. Additional risk factors such as E are measured in these n = n1 + n0 individuals.
Our goal is to estimate the additive interaction between G and E given by equation (3) and its standard error using data from the stratified subsample. In the following section we summarize three existing methods for obtaining consistent estimates of β under a stratified sampling design, and use these estimates to calculate the additive interaction and its standard error.
6. Estimation Methods
6.1. Stratum-specific offset
The disease risk of person i selected into stratum h is conditioned on the sampling indicator Shi = 1. Hence, the disease risk for person i is given by
It follows from Bayes theorem that
| (4) |
where is the linear predictor from a logistic model based on a prospective sample.
Therefore, the log-likelihood function of the stratified case-control subsample is
| (5) |
where the risks on the right hand side are given by (4). By including log(πh1/πh0) as an offset for stratum h, we can obtain as consistent estimates of β by maximizing (5). Since the offset simplifies to The variance of is provided by [35] and [36] and is summarized in the Appendix.
6.2. Inverse-probability weighting
For the stratified case-control sample, denote
| (6) |
as the logistic score function corresponding to person with disease status d (= 0, 1) in stratum We can obtain the inverse-probability weighted (IPW) estimate as a solution to
| (7) |
It can be shown that the IPW estimate of β is consistent [37]. Its variance can be estimated as given by Breslow and Chatterjee [3]
| (8) |
where the notation denotes aaT.
6.3. Multiple Imputation (MI)
Since E and X are measured only in the stratified case-control subsample of n individuals, their values in the N − n unselected subjects can be deemed missing. The probability of a subject being selected into the subsample depends only on D and G. Therefore, the missingness mechanism for E and X can be classified as missing at random [38]. Thus, one can use multiple imputation (MI) to fill in the missing values in the N −n unselected subjects and use the full cohort of N individuals to estimate the model parameters β via the standard logistic regression, instead of restricting our analysis to the case-control subsample of n individuals as done under the offset and IPW methods.
Several methods for MI are currently available in the literature [39–41]. In this paper we use the chained equation method [42], a flexible approach that assumes the existence of the conditional distribution of the missing variable being imputed given all other variables including the outcome in an imputation model [43]. Briefly, the chained equation method starts with some initial values for missing data and imputes missing variables one at a time in a cyclical fashion by sampling from their conditional distributions given all other variables in the imputation model. This procedure is iterated until convergence [42]. In this method, the conditional distribution of categorical variables being imputed given all other variables in the imputation model is assumed to be binomial or multinomial; the conditional distribution of continuous variables being imputed given all other variables in the imputation model is assumed to be normal. One can choose any suitable imputation method to impute individual missing variables. An important requirement for the imputation model is that it has to be at least as saturated (in terms of the number of parameters) as the analysis model to be fitted to the imputed data [41, 44].
The imputed values are used to estimate the model parameters and the additive interaction of interest. This entire iterative procedure is carried out m times (typically m = 5 to 10) to obtain multiple imputed data sets and the corresponding additive interaction estimates. The final estimate of additive interaction is then obtained as the empirical average of these estimates. Its variance is estimated by Rubin’s rule, where W is the within-imputation variance and B is the between-imputation variance. Readers are referred to [38] for details. In all our illustrative examples and simulations we used m = 10 imputed data sets.
6.4. Software for Multiple Imputation (MI)
The chained equation method is implemented in the mice (multiple-imputation by chained equation) function in the R programming language and in the FCS (fully conditional specification) statement in the PROC MI procedure in the SAS software.
7. Illustrative Examples
In this section we provide empirical illustration of estimating additive interactions and their standard errors via the above parameter estimation methods using data from two published cancer epidemiology studies.
7.1. Study of Nevi in Children
Nevi are among the important known risk factors for melanoma and are anticipated to be associated with magnitude of sun sensitivity and sun exposure even in children. The study of nevi in children (SONIC) is the first known population-based prospective study of nevi in a US cohort of children to evaluate risk factors for nevi [31]. The parent data set includes N = 443 prospectively ascertained children with the following measurements: the total number of nevi on the back (obtained using imaging techniques), sex and race/ethnicity (obtained from school records), sun sensitivity index (SSI, a continuous score obtained by SONIC investigators by combining numerical values assigned to skin color, hair color and Fitzpatrick tendency to burn, which were recorded by a school nurse), and sun exposure (obtained from a questionnaire).
As described in Section 2, the binary outcome is high-risk nevus phenotype, defined as presence or absence of nevus counts in the upper quintile of the observed counts [32]. A total of 91 individuals had high-risk nevus phenotype (cases). The covariate vector X consists of two binary variables: sex (male or female) and race/ethnicity (non-Hispanic White or other). The continuous risk factor G is sun sensitivity index (SSI), which takes values between 0 (high sun sensitivity) and 3 (low sun sensitivity). The risk factor E is sun exposure, an ordinal variable denoting low (≤ 2 hours time per day, on average, spent outdoor during summer), medium (3 to 4 hours spent outdoors) and high (> 4 hours spent outdoors) exposure. Table 1 summarizes the characteristics of all 443 individuals in the parent SONIC cohort.
Table 1:
Participants characteristics of SONIC study
| Participants characteristics (n=443) | |
|---|---|
| Sex (n (%) male) | 193 (43.6) |
| Race (n (%) white) | 203 (45.8) |
| Sun exposure (n (%)) | |
| Low | 114 (25.7) |
| Medium | 129 (29.1) |
| High | 77 (17.4) |
| SSI (mean (sd)) | 1.66 (0.78) |
We stratified the parent data set on SSI using steps of 0.2 units. When a stratum had no case, we combined it with an adjacent stratum. We obtained H = 14 strata as shown in Table A1 in the Appendix. We obtained a stratified case-control subsample by sampling all the cases and, where feasible, an equal number of controls from each stratum, yielding n1 = 91 cases and n0 = 87 controls (n = 178). In our set up, D and SSI are measured on all 443 individuals in the parent data set, and E and X are measured only in the subsample of 178 children.
We fitted a logistic regression model for the binary outcome with respect to SSI, sun exposure, race/ethnicity and sex. We included two indicator variables Ehigh and Emed for sun exposure, representing high and medium exposures, respectively (taking low exposure as the reference category). Prior works have shown a quadratic effect of SSI and no interaction between SSI and sun exposure in relation to nevus counts [32]. Building on this work, we fitted the following logistic regression model to the stratified case-control subsample:
| (9) |
Table 2 shows the results. Columns 3, 4 and 5 show the parameter estimates based on MI, offset and IPW methods using the 178 stratified case-control subsamples. As a benchmark for comparison, Column 2 shows the analysis of the full parent data set of all 443 individuals. As another benchmark, Column 6 shows the results of a naive logistic regression approach that analyzes the second phase data using standard logistic regression, ignoring the stratified sampling scheme. The offset, IPW and naive methods give considerably inflated estimates of the (linear and quadratic) effects of the stratification variable SSI. Overall, the MI method gives estimates that are closer to those based on the full cohort.
Table 2:
Estimated regression coefficients and standard errors (in parenthesis) from SONIC study data
| Full cohort | MI | Offset | IPW | Naive | |
|---|---|---|---|---|---|
| Intercept | −3.95 (0.73) | −3.41 (0.78) | −5.16 (0.70) | −5.21 (0.83) | −3.51 (0.83) |
| Sun exposure (med vs low) | 0.55 (0.34) | 0.50 (0.53) | 0.50 (0.49) | 0.31 (0.52) | 0.41 (0.46) |
| Sun exposure (high vs low) | 0.34 (0.37) | 0.35 (0.67) | 0.27 (0.50) | 0.18 (0.57) | 0.20 (0.50) |
| SSI* | 1.32 (0.72) | 1.44 (0.74) | 5.60 (0.65) | 5.33 ( 0.96) | 3.25 (0.91) |
| SSI2 | −0.79 (0.31) | −0.83 (0.31) | −2.62 (0.29) | −2.48 (0.41) | −1.33 (0.39) |
| Race (white vs non-white) | 1.70 (0.56) | 1.21 (0.68) | 1.52 (0.66) | 1.66 (0.67) | 1.43 (0.68) |
| Sex (male vs female) | 1.25 (0.30) | 1.12 (0.60) | 1.21 (0.38) | 1.07 (0.40) | 1.24 (0.38) |
Values in the brackets are standard errors.
Full cohort: analysis on the full cohort.
MI: multiple imputation on the stratified case-control sample.
Offset: Offset method on the stratified case-control sample.
IPW: Inverse-probability weighting method on the stratified case-control sample.
Naive: naive logistic regression on the stratified case-control sample.
stratifying variable.
Nevi are not rare events. Hence, RERI cannot be approximated using the rare disease assumption as shown in Section 3.2. Further, SONIC is a prospective study. Therefore, we estimated additive interaction using equation (3) instead of estimating RERI. We used the parameter estimates obtained from the above logistic regression model and estimated the additive interaction between SSI and sun exposure among non-Hispanic White male children with SSI = 1 (=g1) and 2.5 (=g2) and medium (E = 1) and low (E = 0) sun exposure levels. The results, shown in Table 3, illustrate the effect of the different parameter estimation methods on estimating the additive interaction effect. The MI method has risk differences and additive interaction estimates close to those obtained under the full cohort. Additive interaction estimates under the naive method were considerably smaller than that of the full cohort. The offset method had risk differences and additive interaction estimates closer to the full cohort than the MI method. However, the standard error of the additive interaction between SSI and sun exposure was slightly larger under the offset method than the MI method. These results suggest that the offset and MI methods have similar performances in terms of estimating additive interaction, although the MI method had better performance for estimating the individual model parameters.
Table 3:
Estimated additive interactions between SSI and sun exposure their standard errors in SONIC
| Medium versus Low Sun Exposure | |||||
| Full Cohort | MI | Offset | IPW | Naive | |
| Risk difference in SSI=1 | 0.137 | 0.106 | 0.107 | 0.071 | 0.069 |
| Risk difference in SSI=2.5 | 0.043 | 0.03 | 0.005 | 0.003 | 0.087 |
| Additive interaction (SSI 1 vs. 2.5) | 0.093 | 0.076 | 0.102 | 0.068 | −0.018 |
| Standard error of additive interaction | 0.062 | 0.09 | 0.1 | 0.12 | 0.026 |
| High versus Low Sun Exposure | |||||
| Risk difference in SSI=1 | 0.27 | 0.213 | 0.191 | 0.136 | 0.123 |
| Risk difference in SSI=2.5 | 0.11 | 0.082 | 0.014 | 0.008 | 0.185 |
| Additive interaction (SSI 1 vs. 2.5) | 0.16 | 0.13 | 0.177 | 0.128 | −0.062 |
| Standard error of additive interaction | 0.091 | 0.15 | 0.15 | 0.21 | 0.09 |
Full cohort: analysis on the full cohort.
MI: multiple imputation on the stratified case-control sample.
Offset: Offset method on the stratified case-control sample.
IPW: Inverse-probability weighting method on the stratified case-control sample.
Naive: naive logistic regression on the stratified case-control sample.
Under the MI approach, the risk differences between low and medium sun exposure for non-Hispanic White children were 10.6% and 3.0%, respectively, at SSI = 1 and 2.5. This suggests that greater risk reduction can be obtained by reducing sun exposure from medium to low levels for children with high (SSI = 1) than low (SSI = 2.5) sun sensitivity. The estimated additive interaction was 7.6% (standard error = 9.0%), suggesting that non-Hispanic White children with high sun sensitivity can gain 7.6% more risk reduction than those with low sun sensitivity by reducing their sun exposure from medium to low level. We obtained similar results for additive interactions obtained by comparing high and low sun exposure groups.
7.2. Endometrial Cancer Study
The Epidemiology of Endometrial Cancer Consortium [45] is a population-based retrospective case-control study that examines risk factors associated with the etiology of endometrial cancer. Setiawan and colleagues [7] reported the following data from 4261 endometrial cancer cases and 7099 unaffected controls ascertained in a retrospective manner: age (binary: < and ≥ 55 years), body mass index (BMI; 3 categories: < 25 kg/m2 or normal, 25 to 30 kg/m2 or overweight and ≥30 kg/m2 or obese) and genotypes at marker rs727479, a single nucleotide polymorphism (SNP), in the gene CYP19A1 (3 genotype categories: AA, AC and CC). This is our parent data set of N = 11360 (4261+7099) individuals. This data set is given in Table 4 of the paper by Setiawan et al., [7] and is, hence, not tabulated again in this paper. We stratified the parent data on BMI, which has H=3 strata, and drew case-control subsamples from the three strata by sampling all the cases and, where feasible, an equal number of controls. This gave us 4261 cases and 4111 controls in the stratified case-control subsample. For illustrating the properties of additive interaction estimated via the offset, IPW and MI methods under a stratified two-phase design, we assumed that BMI and age were measured on all subjects in the parent data set, and the genotypes at marker rs727479 were measured only in the subsample.
Table 4:
Estimated regression coefficients and standard errors (in parenthesis) the endometrial cancer study data. Additive interactions and their standard errors and RERIs and their standard erorrs are shown in the last four rows of each table.
| (a) Age ≥ 55 | |||||
|---|---|---|---|---|---|
| Full cohort | MI | Offset | IPW | Naive | |
| Intercept | −0.830 (0.100) | −0.829 (0.105) | −0.813 (0.116) | −0.813 (0.118) | 0.080 (0.121) |
| SNPAC | −0.096 (0.114) | −0.094 (0.121) | −0.111 (0.136) | −0.111 (0.133) | −0.111 (0.136) |
| SNPAA | −0.049 (0.113) | −0.054 (0.119) | −0.072 (0.136) | −0.072 (0.132) | −0.072 (0.136) |
| BMI*overweight | 0.022 (0.169) | 0.063 (0.184) | 0.037 (0.191) | 0.037 (0.193) | −0.318 (0.194) |
| BMI*obese | 0.492 (0.184) | 0.490 (0.187) | 0.475 (0.195) | 0.475 (0.192) | −0.418 (0.196) |
| SNPAC* BMIoverweight | 0.341 (0.188) | 0.300 (0.203) | 0.329 (0.215) | 0.329 (0.215) | 0.329 (0.215) |
| SNPAA* BMIoverweight | 0.396 (0.187) | 0.346 (0.207) | 0.375 (0.214) | 0.375 (0.214) | 0.375 (0.214) |
| SNPAC* BMIobese | 0.498 (0.204) | 0.496 (0.208) | 0.512 (0.218) | 0.512 (0.214) | 0.512 (0.218) |
| SNPAA* BMIobese | 0.619 (0.203) | 0.624 (0.207) | 0.642 (0.217) | 0.642 (0.212) | 0.642 (0.217) |
| Additive interaction | 0.152 | 0.153 | 0.157 | 0.157 | 0.160 |
| SE of additive interaction | 0.048 | 0.048 | 0.050 | 0.049 | 0.053 |
| RERI | 1.305 | 1.307 | 1.305 | 1.305 | 0.525 |
| SE of RERI | 0.327 | 0.327 | 0.322 | 0.324 | 0.108 |
| (b) Age <55 | |||||
|---|---|---|---|---|---|
| Full cohort | MI | Offset | IPW | Naive | |
| Intercept | −0.899 (0.175) | −0.928 (0.183) | −0.926 (0.192) | −0.926 (0.194) | −0.214 (0.198) |
| SNPAC | 0.221 (0.192) | 0.270 (0.210) | 0.267 (0.219) | 0.267 (0.214) | 0.267 (0.219) |
| SNPAA | 0.194 (0.191) | 0.214 (0.197) | 0.211 (0.217) | 0.211 (0.212) | 0.211 (0.217) |
| BMI* overweight | 0.850 (0.282) | 0.874 (0.305) | 0.822 (0.302) | 0.822 (0.304) | 0.348 (0.305) |
| BMI* obese | 0.899 (0.305) | 0.928 (0.310) | 0.926 (0.318) | 0.926 (0.316) | 0.214 (0.319) |
| SNPAC* BMIoverweight | −0.532 (0.315) | −0.580 (0.349) | −0.526 (0.341) | −0.526 (0.340) | −0.526 (0.341) |
| SNPAA* BMIoverweight | −0.326 (0.309) | −0.335 (0.331) | −0.274 (0.335) | −0.274 (0.334) | −0.274 (0.335) |
| SNPAC* BMIobese | −0.014 (0.341) | −0.063 (0.351) | −0.060 (0.357) | −0.060 (0.353) | −0.060 (0.357) |
| SNPAA* BMIobese | 0.024 (0.336) | 0.005 (0.34) | 0.008 (0.352) | 0.008 (0.348) | 0.008 (0.352) |
| Additive interaction | 0.013 | 0.010 | 0.010 | 0.010 | 0.002 |
| SE of additive interaction | 0.080 | 0.080 | 0.082 | 0.081 | 0.087 |
| RERI | 0.384 | 0.381 | 0.383 | 0.383 | 0.068 |
| SE of RERI | 0.727 | 0.750 | 0.752 | 0.752 | 0.419 |
Full cohort: analysis on full cohort.
MI: analysis on stratified case-control sample using multiple-imputation.
Offset: analysis on stratified case-control sample using offset method.
IPW: analysis on stratified case-control sample using inverse-probability weighting method.
Naive: naive analysis on stratified case-control sample ignoring the design.
BMIoverweight: 25 kg/m2 ≤ BMI <30 kg/m2; BMIobese: ≥ 30 kg/m2.
The additive interactions are the differences of the risk difference of obese vs. normal BMI between genotype AA and CC.
stratifying variable.
Setiawan and colleagues [7] fitted separate models to the < 55 and ≥ 55 age groups and included an interaction between BMI and the SNP in each model. Following their work, we fitted the following logistic regression model, separately to the two age groups:
| (10) |
where p denotes the risk of endometrial cancer given BMI and genotypes of the SNP rs727479, SNPg is an indicator variable taking value 1 if the genotype is g (= AA or AC) and taking value 0 otherwise. Similarly, BMIb is an indicator variable taking value 1 if BMI is b (= overweight or obese) and taking value 0 otherwise. Thus, the CC genotype category and the normal BMI category are reference groups. Since the imputation model needs to be as saturated (in terms of the number of parameters) as the analysis model (10) [41, 44], we allowed interaction between genotype and BMI in the multiple imputation model by performing separate MI within each BMI level.
Tables 4a and 4b show the parameter estimates under the various estimation methods for age ≥ 55 and < 55 groups, respectively. The results based on the full parent data set and a naive analysis of the subsample are also shown as benchmarks for comparison. As before, the naive approach performs poorly in estimating the intercept and the effects of BMI, the stratification factor. The MI, offset and IPW methods give similar parameter estimates that are also close to estimates obtained under the full cohort approach. These three methods have similar performances possibly because of the large sample size.
We calculated additive interaction between BMI (obese versus normal groups) and the SNP rs727479 (AA versus CC genotypes) in age ≥ 55 and < 55 groups separately. These estimates and their standard errors, obtained under the various methods, are shown at the bottom of Tables 4a and 4b. For age < 55 group, all three methods - offset, IPW and MI - had similar additive interaction estimates (estimated additive interaction = 0.010 units) that were also close to that obtained under the full cohort method (0.013 units). The additive interaction estimated under the naive approach (0.002 units) was considerably smaller than that under the full cohort approach. For the age ≥ 55 group, the estimated additive interaction of the MI method (0.153 units) was similar to that of the full cohort method (0.152 units). The offset and IPW methods had similar additive interaction estimates (0.157 units) that were only slightly larger than that under the MI method. The naive approach had the largest additive interaction estimate of all the methods considered (0.160 units). The standard errors of additive interaction estimates under the offset, IPW and MI methods were close to that under the full cohort analysis for both age < 55 and ≥55 groups.
It should be noted that the parent cohort of this study was ascertained in a retrospective manner. Therefore, the estimation of additive interaction only serves as an illustration of implementing equation (3) and should not be interpreted as an estimation of the true additive interaction in the population. Since the model intercept cannot be interpreted in terms of population-level risk of endometrial cancer and the endometrial cancer is rare in the population, we also estimated RERI as a measure of additive interaction for this study. For the age < 55 group, the offset and IPW methods had similar RERI estimates (0.383 units) that were also closer to the full cohort approach (0.384 units) and were slightly larger than RERI estimated under the MI approach (0.381 units). The naive approach had considerably smaller RERI (0.068 units). For the age ≥ 55 group, the estimated RERI of the offset and IPW methods (1.305 units) was closer to that of the full cohort method (1.305 units) than that under the MI method (1.307 units). The naive approach had the smallest additive interaction (0.525 units) of all the methods considered. The standard error of RERI under the MI method was closer to that under the full cohort analysis for both age < 55 and ≥ 55 groups.
8. Numerical studies
Using the SONIC data set as a prototype, we conducted simulation studies to evaluate the bias, standard errors and bias-variance trade-off of additive interaction effects estimated under a stratified two-phase design where the parent cohort has binary outcomes and is ascertained in a prospective manner.
We examined the properties where the parameters of the stratified design are estimated via the offset, IPW and MI methods. Since the naive approach had the worst performance in our empirical illustrations, we excluded this approach from further consideration in our simulation studies. In the remainder of this Section, we use the term “additive model” to refer to a logistic regression model containing the additive effects of two risk factors of interest, but not their multiplicative interaction. We use the term “non-additive model” to refer to a logistic regression model containing the additive effects of the risk factors of interest as well as their multiplicative interaction.
In our simulations, the parent data set consisted of N individuals. We considered three variables, denoted G, E and X, and simulated a binary outcome for each individual in the parent data set from the following logistic regression model: where Throughout, we considered E to be a binary risk factor of interest with prevalence P (E = 1) = 0.40 and X to be a binary covariate of interest with prevalence P (X = 1) = 0.50. We conducted simulations by first treating G as a continuous risk factor, uniformly distributed random variable in the range [0, 3]. We generated G, E and X for each individual from a three-dimensional zero-mean multivariate normal distribution with unit variance and pair-wise correlation of 0.50. We derived G from the scaled cumulative distribution of the first component of the multivariate normal distribution, and obtained E and X by dichotomizing the second and third components, respectively, at appropriate cutpoints corresponding to the desired prevalence. We also conducted additional simulations by treating G as a categorical variable with two categories and prevalence P (G = 1) = 0.50. We simulated a binary G by dichotomizing the first component of the multivariate normal distribution at the median.
We set N = 500, 2000, 3500 and 5000, β0 = −2.75 to obtain an expected marginal disease probability of 0.20 (as in the SONIC study illustrated above), to denote small and large effects of E and to denote the absence and presence of a multiplicative interaction between G and E. When G was continuous, we set its true effect as β1 = 0.20. For binary G, we set β1 = 0.50. In total, this gave 32 parametric combinations for our simulations. The effect of X, δ, was set to one in all scenarios.
We generated 500 data sets under each parametric configuration and analyzed each data set as follows: (i) full cohort design consisting of all N individuals with their observed values of G, E and X; and (2) a stratified two-phase design using G to define the strata and by selecting all the cases and an equal number of controls from each stratum. In the stratified design, we assumed that E and X were measured only on individuals ascertained into the subsample. Throughout, we assumed that G was available for all N individuals. When G was a continuous variable, we defined strata by steps of 0.2 units of G. When G was a categorical variables, the categories were the strata. Under each design, we analyzed the data set by fitting additive and non-additive models and estimated additive interaction between G and E using equation (3) by setting x = 1, g1 = 2 and g2 = 3 for continuous G and g1 = 0 and g2 = 1 for categorical G. Under the stratified design, we estimated model parameters via the offset, IPW and MI methods.
When using MI to estimate parameters from non-additive model, we imputed E and X in each stratum separately to ensure that the imputation model was compatible with the analysis model. For continuous G, the value of G varied within a given stratum. Hence, we included it in the stratum-specific imputation model. However, for categorical G, we did not include G in the stratum-specific imputation model since it was constant within each stratum.
We examined the performance of the different methods for estimating the model parameters and additive interaction effects using various summary measures. Our primary measures were: (i) 95% coverage, defined as the proportion of simulated data sets, out of 500, where the true model parameter or the true additive interaction effect was included within the 95% confidence interval; and (ii) mean squared error (MSE), defined as the squared difference between the true and estimated values, averaged over 500 data sets under a given parametric configuration. Figures corresponding to these primary measures are included in the main body of this paper.
We also examined the following secondary summary measures: (i) bias, defined as the difference between the estimated and true values; (ii) empirical standard error (SEE), defined as the standard deviation of the estimates from the 500 data sets; and (iii) percent bias in standard error, defined as the difference between the estimated standard error and SEE, divided by SEE. Figures corresponding to these secondary measures are included as Supplementary Material.
8.1. Results
In this section we summarize the simulation results on the estimation of additive interaction. The results on the estimation of other model parameters can be found in the Supplementary Material.
8.1.1. True model does not have a multiplicative interaction
Figure 1 shows the results of simulations where both the true and analysis models were additive i.e., neither model had a G × E product term for multiplicative interaction. Thus, the analysis model is the same as the true model generating the data. In general, the offset, IPW, and MI methods all had coverage probability close to that of the full cohort design for estimating additive interaction. All the methods had small MSE (< 0.001). As expected, the stratified two-phase designs had higher MSE than the full cohort approach and the MSEs of all the methods decreased towards zero as the sample size increased. When the sample size was small, the offset method had slightly smaller MSE than the IPW and MI methods under most settings; however, when G was continuous and E had a large effect, the MI method had slightly smaller MSE than the offset and IPW methods. Thus, the offset and MI methods had overall good performances for estimating additive interaction compared to the IPW method.
Figure 1:
Simulation results showing the performance of the Offset, inverse-probability weighting (IPW) and multiple imputation (MI) methods on additive interaction estimation when both the true and analysis models were additive (i.e., neither contained G × E term). Row 1: G was a binary variable and E had large effect; Row 2: G was a binary variable and E had small effect; Row 3: G was a continuous variable and E had large effect; Row 4: G was a continuous variable and E had small effect. The two columns summarize the 95% empirical coverage (left column) and mean squared error (right column).
Figure 2 shows the results of simulations where true model was additive, but a multiplicative interaction term was included in the analysis model (i.e., over-fitting). The offset method had coverage close to that of the full cohort. The MI method did not have good coverage when E had a large effect under continuous G. In general, the offset, IPW, and MI methods had similar MSEs. However, for continuous G with E having a large effect, the MI and offset methods had smaller MSEs than the IPW method when the sample size was small.
Figure 2:
Simulation results showing the performance of the Offset, inverse-probability weighting (IPW) and multiple imputation (MI) methods on additive interaction estimation when the true model was additive but the analysis model had G × E term. Row 1: G was a binary variable and E had large effect; Row 2: G was a binary variable and E had small effect; Row 3: G was a continuous variable and E had large effect; Row 4: G was a continuous variable and E had small effect. The two columns summarize the 95% empirical coverage (left column) and mean squared error (right column).
8.1.2. True model includes a multiplicative interaction
Figure 3 shows the results of simulations where the true model included a multiplicative interaction term but the analysis model was additive (i.e., under fitting). All the methods, including the full cohort approach, had poor coverage probabilities. This aligns with the large bias noted for all the methods (see Supplementary Material). As before, all the methods had MSEs of small magnitude. The MI method had smaller MSE than the offset and IPW methods, except when G was binary and E had a small effect.
Figure 3:
Simulation results showing the performance of the Offset, inverse-probability weighting (IPW) and multiple imputation (MI) methods on additive interaction estimation when the true model had G × E term but the analysis model was additive. Row 1: G was a binary variable and E had large effect; Row 2: G was a binary variable and E had small effect; Row 3: G was a continuous variable and E had large effect; Row 4: G was a continuous variable and E had small effect. The two columns summarize the 95% empirical coverage (left column) and mean squared error (right column).
Figure 4 shows the results of simulations where both the true and analysis models included a multiplicative interaction term. The offset method had coverage closer to that of the full cohort analysis. The IPW method performed similar to the offset method, while the MI method had poor coverage. Once again, the magnitude of MSE was small for all the methods. The offset and MI methods had smaller MSE than the IPW method. While the MSEs of the offset, IPW and MI methods were larger than that of the full cohort analysis for small sample size, the MSEs of all the methods decreased towards 0 as the sample size increased.
Figure 4:
Simulation results showing the performance of the Offset, inverse-probability weighting (IPW) and multiple imputation (MI) methods on additive interaction estimation when both the true model and analysis model had G × E term. Row 1: G was a binary variable and E had large effect; Row 2: G was a binary variable and E had small effect; Row 3: G was a continuous variable and E had large effect; Row 4: G was a continuous variable and E had small effect. The two columns summarize the 95% empirical coverage (left column) and mean squared error (right column).
The bias in the estimated additive interaction effect and the SEE decreased with increasing sample size for all methods (Figure S1–S4 in Supplementary Material). There was substantial bias in all methods when the analysis model under-fitted the data, i.e. when the true model included a multiplicative interaction term but the analysis model was additive (Figure S3). When the analysis model contained a multiplicative interaction term, the MI method had larger bias than the other methods (Figure S2 and S4), reflecting the challenges in imputation when the underlying model is not additive.
8.1.3. Summary of Results
Overall, our simulations suggest that the offset method performs well. The MI method performs well when both the true and analysis models are additive. The MI method does not provide a discernible advantage over the offset method when the analysis model includes multiplicative inter- action terms. Thus, the MI method may not be the best choice for estimating additive interaction effect when the analysis model includes multiplicative interaction terms; the offset method has the best bias-variance trade-off (as measured by MSE) for estimating additive interaction in this setting.
9. Discussion
In this paper we have investigated the properties of three different methods for estimating additive interactions and their standard errors in stratified two-phase case-controls studies. Our simulations have focused on stratified two-phase sampling where the parent cohort is ascertained in a prospective manner. The key to the estimation of additive interactions is to consistently estimate the intercept and the effect of the stratification variable. The offset and IPW methods are among the popular approaches for estimating model parameters under a stratified case-control design [46]. Multiple imputation tackles the estimation problem from the perspective of missing data [47].
Multiple imputation has several advantages. It is easy to use and is implemented in most standard statistical packages. The standard error estimates under multiple imputation can be easily obtained using Rubin’s rule [38]. Thus, no additional steps are needed to account for the extra information as well as variability carried in the estimated offsets or weights under a stratified two-phase case-control design. Another potential advantage of MI is that, when there are missing data in other covariates, it is straightforward to impute these variables as part of the multiple imputation strategy, thereby improving the efficiency of the estimates relative to analyses based on removing individuals with missing values in these covariates. In contrast, the offset and IPW methods would either need an extra step to impute the missing covariates or use only individuals without any missing values in the analyses, resulting in efficiency loss. Lastly, when some covariates are available for the parent data set, MI can naturally incorporate the additional information in the estimation process, resulting in potentially more efficient estimates than those from the offset and IPW methods.
Multiple imputation also has its limitations. A critical requirement for the imputation model is that it has to be compatible with the analysis model [41, 44]. This requirement becomes difficult when the analysis model contains nonlinear terms of the covariates such as multiplicative interaction or polynomials. A systematic treatment of this topic is beyond the scope of this paper. In our simulation studies and the endometrial cancer data analysis, we accommodated G × E multiplicative interaction in the analysis model by performing MI separately for each stratum of G, thereby allowing the relationship between E and D to vary across G. Our simulation results suggest that this strategy is not completely satisfactory, as evidenced by the bias in the estimated additive interaction and suboptimal coverage of the MI method when the analysis model contained multiplicative interaction effect. Based on the findings form the simulation studies and considering the strengths and limitations of multiple imputation method, we recommend using MI to estimate additive interactions under stratified case-control samples when there is no multiplicative interaction term in the analysis model. In practice, we suggest a two-step process. First, using a naïve logistic regression to determine whether a multiplicative interaction is needed. When there is no evidence of multiplicative interaction, estimate the additive interaction using MI method; otherwise use the offset method.
The stratified two-phase case-control design considered in this paper can lend itself to any large epidemiological cohort study. One example is the Atherosclerosis Risk in Communities (ARIC) study [48], where a cohort of 15,792 subjects were sampled from four U.S. communities and were followed for 10 years for the development of various heart diseases. Numerous variables were collected and DNA specimen is available for all subjects to obtain genetic information upon request. Since the specimen is in limited quantity, it would not be feasible to request DNA specimen for the entire cohort to conduct a gene-environment interaction study. Case-control or other similar designs are ideal choices in this situation. A stratified case-cohort study has been conducted on ARIC data to identify risk factors of incident coronary heart disease [49]. It would be interesting to study additive gene-environment interaction effect on risks of various heart diseases using ARIC data with a two-phase stratified case-control design. The estimation methods investigated in this paper can be applied to such studies.
The stratification used in the design ensures balanced numbers of cases and controls in each stratum. As a result, it avoids potential efficiency loss due to highly unbalanced distribution of the stratification variable between sampled cases and controls. This design is similar to the exposure enriched sampling designs [50, 51] in that phase two sampling is stratified by one of the risk factors of interest (e.g. exposure factor) with stratum-specific sampling probabilities. However, they differ in that the design in this paper aims to oversample cases and achieve balance between case and control in each stratum whereas the latter design aims to oversample the rare exposure strata. It would be interesting to study how an outcome and exposure enriched sampling design would affect the efficiency for estimating additive interactions.
Some existing large epidemiology studies have ascertained risk factors using a population-based stratified two-phase case-control sampling approach. An example is the U.S Kidney Cancer Study, a population-based stratified case-control study of the association between hypertension and renal cell carcinoma in black and white men and women [52]. The design considered in this paper is different from the population-based stratified case-control (PBSCC) design [53]. Among many differences between them, two have important implications on additive interaction estimation. First, the marginal disease probability that is needed to estimate the additive interaction can be easily estimated in the two-phase design when a prospectively sampled parent cohort is available in phase one. In PBSCC, the marginal disease probability is more difficult to estimate since the controls are selected directly from the population, whose size is typically unknown and therefore needs to be estimated under possibly complex sampling scheme. The potential sample frame deficiency can also lead to biased estimate of the population size. Second, stratified simple random sampling is used to select controls from the parent cohort in the two-phase design, whereas clustered sampling could be used in PBSCC to choose controls from the population, which complicates the variance estimation of the additive interaction due to the intra-cluster correlation among subjects. The methods studied in this paper can be potentially extended to the PBSCC design provided that the stratum-specific marginal disease probability can be reliably estimated, and the intra-cluster correlation is properly accounted for in the variance estimation.
Several extensions of our work are possible. In this paper, we focused on estimating additive interaction effects and their standard errors. We have not examined hypothesis tests, power and type I errors. These properties will be examined in future works and compared with other existing inferential procedures for additive interactions. There is a strong line of research on efficient estimation of gene-environment interaction in case-control studies exploiting the assumption of gene-environment independence [54–60]. These studies have focused on population-based case- control design and/or assumed rare disease. It would be interesting to extend these methods to the estimation of additive interaction under a stratified two-phase case-control design for non-rare diseases. Another extension is to consider time to event outcome. Under survival analysis framework, the analogous design is the nested case-control design [61] where all individuals with events are sampled and, at each event time, a person from the risk set is randomly sampled and maybe matched to the corresponding event by genetic or environmental factors. There are two potential modeling strategies for nested case-control design - proportional hazards model and additive hazards model [62]. In either of these models, one needs to estimate the main effect of the matching variables and the baseline hazards function to obtain an estimate of the additive interaction. We will investigate these methods in future work.
The R code implementing the estimation methods investigated in this paper is available from the authors upon request.
Supplementary Material
Acknowledgment
We thank Dr. Allan Halpern for providing us access to data from the study of nevi in children. This work was supported through grants P30 CA008748, R01 CA137420 and R01 CA197402 from the National Cancer Institute, USA, and grant UL1RR024996 from the Clinical and Translational Science Center at Weill Cornell Medicine, new York, USA. The content is solely the responsibility of the authors and does not represent the official views of the National Institutes of Health.
The majority of the work in this paper was done when Ai Ni was at the Department of Epidemiology and Biostatistics of Memorial Sloan Kettering Cancer Center, New York, USA.
Appendix
Derivation of variance estimator in the offset method
In essence, including estimated offsets in the likelihood function (4) when estimating regression coefficients is equivalent to jointly estimating the distribution parameters of some functions of the offsets and the regression coefficients. Define ηh1 as the true probability of a case being in stratum h and define ηh0 similarly for controls. Thus ηhd (d = 0, 1) is the parameter vector of a multinomial distribution. The offsets can be written as Then the log-likelihood function (5) can be re-parameterized for parameters η and β and written as Let be the true values and maximum likelihood estimates of respectively. By Taylor expansion of the first partial derivative of the log-likelihood at the true parameters we get
Using the fact that we have that
| (11) |
By central limit theorem, converges to a normal distribution with mean zero and covariance matrix Let with d = 0 for controls and d = 1 for cases. Then, converges to a normal distribution with mean zero and covariance matrix (qij) where qij = (1 − ) if i = j and qij = − if i ≠ j. Plug in the estimates of these covariance matrices to (11) and with some algebra we arrive at
where Z is the design matrix, V is an n × n diagonal matrix with element with being the estimated disease probability for subject i in stratum h,
where with eh being an n×1 vector with ones in locations corresponding to the hth stratum and zeros elsewhere, and
Table A1:
Numbers of cases and controls in the fourteen SSI strata in the SONIC study
| Strata | N case | N control | Total |
|---|---|---|---|
| 1, SSI ∈ [0; 0:2) | 4 | 22 | 26 |
| 2, SSI ∈ [0:2; 0:4) | 6 | 14 | 20 |
| 3, SSI ∈ [0:4; 0:6) | 12 | 17 | 29 |
| 4, SSI ∈ [0:6; 0:8) | 14 | 10 | 24 |
| 5, SSI ∈ [0:8; 1:0) | 3 | 4 | 7 |
| 6, SSI ∈ [1:0; 1:2) | 5 | 15 | 20 |
| 7, SSI ∈ [1:2; 1:4) | 7 | 24 | 31 |
| 8, SSI ∈ [1:4; 1:6) | 14 | 54 | 68 |
| 9, SSI ∈ [1:6; 1:8) | 17 | 59 | 76 |
| 10, SSI ∈ [1:8; 2:0) | 5 | 20 | 25 |
| 11, SSI ∈ [2:0; 2:2) | 1 | 8 | 9 |
| 12, SSI ∈ [2:2; 2:4) | 1 | 30 | 31 |
| 13, SSI ∈ [2:4; 2:6) | 1 | 24 | 25 |
| 14, SSI ∈ [2:6; 3:0] | 1 | 51 | 52 |
| Total | 91 | 352 | 443 |
Footnotes
Statement of Ethics
The two epidemiology studies used in this paper were approved by the corresponding institutional review boards and informed consent was obtained from all participants.
Conflicts of Interest
The authors declare no conflicts of interest.
References
- [1].Neyman J, “Contribution to the theory of sampling human populations,” Journal of the American Statistical Association, vol. 33, no. 201, pp. 101–116, 1938. [Google Scholar]
- [2].Scott AJ and Wild CJ, “Fitting regression models to case-control data by maximum likelihood,” Biometrika, vol. 84, no. 1, pp. 57–71, 1997. [Google Scholar]
- [3].Breslow NE and Chatterjee N, “Design and analysis of two-phase studies with binary outcome applied to wilms tumour prognosis,” Journal of the Royal Statistical Society: Series C (Applied Statistics), vol. 48, no. 4, pp. 457–468, 1999. [Google Scholar]
- [4].White JE, “A two stage design for the study of the relationship between a rare exposure and a rare disease,” American Journal of Epidemiology, vol. 115, no. 1, pp. 119–128, 1982. [DOI] [PubMed] [Google Scholar]
- [5].Legg JC and Fuller WA, “Two-phase sampling,” in Handbook of statistics, vol. 29, pp. 55–70, Elsevier, 2009. [Google Scholar]
- [6].Moslehi R, Chatterjee N, Church TR, Chen J, Yeager M, Weissfeld J, Hein DW, and Hayes RB, “Cigarette smoking, n-acetyltransferase genes and the risk of advanced colorectal adenoma,” Pharmacogenomics, vol. 7, pp. 819–829, 2006. [DOI] [PubMed] [Google Scholar]
- [7].Setiawan VW, Doherty JA, Shu X.-o., Akbari MR, Chen C, De Vivo I, DeMichele A, Garcia-Closas M, Goodman MT, Haiman CA, et al. , “Two estrogen-related variants in cyp19a1 and endometrial cancer risk: a pooled analysis in the epidemiology of endometrial cancer consortium,” Cancer Epidemiology and Prevention Biomarkers, vol. 18, no. 1, pp. 242–247, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [8].Cordell HJ, “Detecting gene-gene interactions that underlie human diseases,” Nature Reviews Genetics, vol. 10, pp. 392–404, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [9].Izumi S, Sakata R, Yamada M, and Cologne J, “Interaction between a single exposure and age in cohort-based hazard rate models impacted the statistical distribution of age at onset,” Journal of clinical epidemiology, vol. 71, pp. 43–50, 2016. [DOI] [PubMed] [Google Scholar]
- [10].Thomas D, “Methods for investigating gene-environment interactions in candidate pathway and genome-wide association studies,” Annual review of public health, vol. 31, pp. 21–36, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [11].Elston RC, “On additivity in the analysis of variance,” Biometrics, vol. 17, no. 2, pp. 209–219, 1961. [Google Scholar]
- [12].Rothman KJ, Greenland S, and Lash TL, Modern Epidemiology. Lippincott, Williams and Wilkins: Philadelphia, PA, 2008. [Google Scholar]
- [13].Begg CB and Zhang Z, “Statistical analysis of molecular epidemiology studies employing case-series.,” Cancer Epidemiology and Prevention Biomarkers, vol. 3, no. 2, pp. 173–175, 1994. [PubMed] [Google Scholar]
- [14].Kalilani L and Atashili J, “Measuring additive interaction using odds ratios,” Epidemiologic Perspectives & Innovations, vol. 3, no. 1, p. 5, 2006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [15].Karlson EW, Chang S-C, Cui J, Chibnik LB, Fraser PA, De Vivo I, and Costenbader KH, “Gene–environment interaction between hla-drb1 shared epitope and heavy cigarette smoking in predicting incident rheumatoid arthritis,” Annals of the rheumatic diseases, vol. 69, no. 01, pp. 54–60, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [16].Kiyohara C and Washio M, “The role of gene–environment interaction in the etiology of sle,” in Epidemiological Studies of Specified Rare and Intractable Disease, pp. 147–162, Springer, 2019. [Google Scholar]
- [17].Meidtner K, Podmore C, Kröger J, Van Der Schouw YT, Bendinelli B, Agnoli C, Arriola L, Barricarte A, Boeing H, Cross AJ, et al. , “Interaction of dietary and genetic factors influencing body iron status and risk of type 2 diabetes within the epic-interact study,” Diabetes care, vol. 41, no. 2, pp. 277–285, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [18].Zou GY, “On the estimation of additive interaction by use of the four-by-two table and beyond,” American Journal of Epidemiology, vol. 168, pp. 212–224, 2008. [DOI] [PubMed] [Google Scholar]
- [19].Han SS, Rosenberg PS, Garcia-Closas M, Figueroa JD, Silverman D, Chanock SJ, Rothman N, and Chatterjee N, “Likelihood ratio test for detecting gene (g)-environment (e) interactions under an additive risk model exploiting g-e independence for case-control data,” American journal of epidemiology, vol. 176, no. 11, pp. 1060–1067, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [20].Flanders WD and Greenland S, “Analytic methods for two-stage case-control studies and other stratified designs,” Statistics in Medicine, vol. 10, no. 5, pp. 739–747, 1991. [DOI] [PubMed] [Google Scholar]
- [21].Korn E and Graubard B, “Analysis of large health surveys: accounting for the sampling design,” Journal of the Royal Statistical Society. Series A. Statistics in society, vol. 158, no. 2, pp. 263–295, 1995. [Google Scholar]
- [22].Breslow NE and Holubkov R, “Maximum likelihood estimation of logistic regression parameters under two-phase, outcome-dependent sampling,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 59, no. 2, pp. 447–461, 1997. [Google Scholar]
- [23].Chatterjee N, Chen Y-H, and Breslow NE, “A pseudoscore estimator for regression problems with two-phase sampling,” Journal of the American Statistical Association, vol. 98, no. 461, pp. 158–168, 2003. [Google Scholar]
- [24].Lee A, Scott A, and Wild C, “Efficient estimation in multi-phase case-control studies,” Biometrika, vol. 97, no. 2, pp. 361–374, 2010. [Google Scholar]
- [25].Tao R, Zeng D, and Lin D-Y, “Efficient semiparametric inference under two-phase sampling, with applications to genetic association studies,” Journal of the American Statistical Association, vol. 112, no. 520, pp. 1468–1476, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [26].Amorim G, Scott AJ, and Wild CJ, “Multi-phase sampling,” Handbook of Statistical Methods for Case-Control Studies, 2018. [Google Scholar]
- [27].Espin-Garcia O, Craiu RV, and Bull SB, “Two-phase designs for joint quantitative- trait-dependent and genotype-dependent sampling in post-gwas regional sequencing,” Genetic epidemiology, vol. 42, no. 1, pp. 104–116, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [28].Shen C, “Application of multiple imputation to data from two-phase sampling: Estimation of the incidence rate of cognitive impairment,” Journal of Data Science, vol. 5, no. 4, pp. 503–518, 2007. [Google Scholar]
- [29].Marti H and Chavance M, “Multiple imputation analysis of case–cohort studies,” Statistics in medicine, vol. 30, no. 13, pp. 1595–1607, 2011. [DOI] [PubMed] [Google Scholar]
- [30].Enders D, Kollhorst B, Engel S, Linder R, and Pigeot I, “Comparison of multiple imputation and two-phase logistic regression to analyse two-phase case–control studies with rich phase 1: a simulation study,” Journal of Statistical Computation and Simulation, vol. 88, no. 11, pp. 2201–2214, 2018. [Google Scholar]
- [31].Oliveria SA, Satagopan JM, Geller AC, Dusza SW, Weinstock MA, Berwick M, Bishop M, Heneghan MK, and Halpern AC, “Study of nevi in children (SONIC): baseline findings and predictors of nevus count,” American Journal of Epidemiology, vol. 169, no. 1, pp. 41–53, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [32].Xu H, Marchetti MA, Dusza SW, Chung E, Fonseca M, Scope A, Geller AC, Bishop M, Marghoob AA, and Halpern AC, “Factors in early adolescence associated with a mole-prone phenotype in late adolescence,” JAMA dermatology, vol. 153, no. 10, pp. 990–998, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [33].Agresti A, Categorical Data Analysis, vol. 482 John Wiley & Sons, 2003. [Google Scholar]
- [34].Wang X, Elston RC, and Zhu X, “The meaning of interaction,” Human Heredity, vol. 70, pp. 269–277, 2010. PMCID: . [DOI] [PMC free article] [PubMed] [Google Scholar]
- [35].Breslow N and Cain K, “Logistic regression for two-stage case-control data,” Biometrika, vol. 75, no. 1, pp. 11–20, 1988. [Google Scholar]
- [36].Breslow N, Zhao L, Fears TR, and Brown CC, “Logistic regression for stratified case- control studies,” Biometrics, vol. 44, no. 3, pp. 891–899, 1988. [PubMed] [Google Scholar]
- [37].Hsieh DA, Manski CF, and McFadden D, “Estimation of response probabilities from augmented retrospective observations,” Journal of the American Statistical Association, vol. 80, no. 391, pp. 651–662, 1985. [Google Scholar]
- [38].Rubin DB, Multiple Imputation for Nonresponse in Surveys, vol. 81 John Wiley & Sons, 1987. [Google Scholar]
- [39].Schenker N and Taylor JM, “Partially parametric techniques for multiple imputation,” Computational Statistics & Data Analysis, vol. 22, no. 4, pp. 425–446, 1996. [Google Scholar]
- [40].Lavori PW, Dawson R, and Shera D, “A multiple imputation strategy for clinical trials with truncation of patient data,” Statistics in Medicine, vol. 14, no. 17, pp. 1913–1925, 1995. [DOI] [PubMed] [Google Scholar]
- [41].Schafer JL, Analysis of Incomplete Multivariate Data. CRC press, 1997. [Google Scholar]
- [42].Buuren S and Groothuis-Oudshoorn K, “mice: Multivariate imputation by chained equations in r,” Journal of Statistical Software, vol. 45, no. 3, pp. 1–67, 2011. [Google Scholar]
- [43].Moons KG, Donders RA, Stijnen T, and Harrell FE, “Using the outcome for imputation of missing predictor values was preferred,” Journal of Clinical Epidemiology, vol. 59, no. 10, pp. 1092–1101, 2006. [DOI] [PubMed] [Google Scholar]
- [44].Rubin DB, “Multiple imputation after 18+ years,” Journal of the American statistical Association, vol. 91, no. 434, pp. 473–489, 1996. [Google Scholar]
- [45].Olson SH, Chen C, De Vivo I, Doherty JA, Hartmuller V, Horn-Ross PL, Lacey JV, Lynch SM, Sansbury L, Setiawan VW, et al. , “Maximizing resources to study an uncommon cancer: E2C2 - Epidemiology of Endometrial Cancer Consortium,” Cancer Causes & Control, vol. 20, no. 4, pp. 491–496, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [46].Whittemore AS, “Multistage sampling designs and estimating equations,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 59, no. 3, pp. 589–602, 1997. [Google Scholar]
- [47].Little RJ and Rubin DB, Statistical analysis with missing data, vol. 333 John Wiley & Sons, 2014. [Google Scholar]
- [48].Investigators A, “The Atherosclerosis Risk in Communit (ARIC) stuidy: Design and objectives,” American journal of epidemiology, vol. 129, no. 4, pp. 687–702, 1989. [PubMed] [Google Scholar]
- [49].Ballantyne CM, Hoogeveen RC, Bang H, Coresh J, Folsom AR, Heiss G, and Sharrett AR, “Lipoprotein-associated phospholipase A2, high-sensitivity C-reactive protein, and risk for incident coronary heart disease in middle-aged men and women in the Atherosclerosis Risk in Communities (ARIC) study,” Circulation, vol. 109, no. 7, pp. 837–842, 2004. [DOI] [PubMed] [Google Scholar]
- [50].Stenzel SL, Ahn J, Boonstra PS, Gruber SB, and Mukherjee B, “The impact of exposure-biased sampling designs on detection of gene–environment interactions in case–control studies with potential exposure misclassification,” European journal of epidemiology, vol. 30, no. 5, pp. 413–423, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [51].Sun Z, Mukherjee B, Estes JP, Vokonas PS, and Park SK, “Exposure enriched outcome dependent designs for longitudinal studies of gene–environment interaction,” Statistics in medicine, vol. 36, no. 18, pp. 2947–2960, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [52].Colt JS, Schwartz K, Graubard BI, Davis F, Ruterbusch J, DiGaetano R, Purdue M, Rothman N, Wacholder S, and Chow W-H, “Hypertension and risk of renal cell carcinoma among white and black americans,” Epidemiology (Cambridge, Mass.), vol. 22, no. 6, p. 797, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [53].Hancock DB and Scott WK, “Population-based case-control association studies,” Current protocols in human genetics, vol. 74, no. 1, pp. 1–17, 2012. [DOI] [PubMed] [Google Scholar]
- [54].Piegorsch WW, Weinberg CR, and Taylor JA, “Non-hierarchical logistic models and case-only designs for assessing susceptibility in population-based case-control studies,” Statistics in Medicine, vol. 13, pp. 153–162, 1994. [DOI] [PubMed] [Google Scholar]
- [55].Umbach DM and Weinberg CR, “Designing and analysing case-control studies to exploit independence of genotype and exposure,” Statistics in medicine, vol. 16, no. 15, pp. 1731–1743, 1997. [DOI] [PubMed] [Google Scholar]
- [56].Chatterjee N and Carroll RJ, “Semiparametric maximum likelihood estimation exploiting gene-environment independence in case-control studies,” Biometrika, vol. 92, no. 2, pp. 399–418, 2005. [Google Scholar]
- [57].Mukherjee B and Chatterjee N, “Exploiting gene-environment independence for analysis of case–control studies: an empirical bayes-type shrinkage estimator to trade-off between bias and efficiency,” Biometrics, vol. 64, no. 3, pp. 685–694, 2008. [DOI] [PubMed] [Google Scholar]
- [58].Han SS, Rosenberg PS, Garcia-Closas M, Figueroa JD, Silverman D, Chanock SJ, Rothman N, and Chatterjee N, “Likelihood ratio test for detecting gene (g)-environment (e) interactions under an additive risk model exploiting g-e independence for case-control data,” American journal of epidemiology, vol. 176, no. 11, pp. 1060–1067, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [59].Chen J, Kang G, VanderWeele T, Zhang C, and Mukherjee B, “Efficient designs of gene–environment interaction studies: implications of hardy–weinberg equilibrium and gene– environment independence,” Statistics in medicine, vol. 31, no. 22, pp. 2516–2530, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [60].Liu G, Mukherjee B, Lee S, Lee AW, Wu AH, Bandera EV, Jensen A, Rossing MA, Moysich KB, Chang-Claude J, et al. , “Robust tests for additive gene-environment interaction in case-control studies using gene-environment independence,” American journal of epidemiology, vol. 187, no. 2, pp. 366–377, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [61].Thomas D, “Addendum to: methods of cohort analysis: appraisal by application to asbestos mining,” Journal of the Royal Statistical Society. Series A (General), pp. 469–491, 1977. [Google Scholar]
- [62].Lin D and Ying Z, “Semiparametric analysis of the additive risk model,” Biometrika, vol. 81, no. 1, pp. 61–71, 1994. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.




