Summary
Case-control studies are prone to low power for testing gene-environment interactions (GXE) given the need for a sufficient number of individuals on each strata of disease, gene, and environment. We propose a new study design to increase power by strategically pooling biospecimens. Pooling biospecimens allows us to increase the number of subjects significantly, thereby providing substantial increase in power. We focus on a special, though realistic case, where disease and environmental statuses are binary, and gene status is ordinal with each individual having 0, 1 or 2 minor alleles. Through pooling, we obtain an allele frequency for each level of disease and environmental status. Using the allele frequencies, we develop new methodology for estimating and testing GXE that is comparable to the situation when we have complete data on gene status for each individual. We also explore the measurement process and its effect on the GXE estimator. Using an illustration, we show the effectiveness of pooling with an epidemiologic study which tests an interaction for fiber and PON1 on anovulation. Through simulation, we show that taking 12 pooled measurements from 1000 individuals achieves more power than individually genotyping 500 individuals. Our findings suggest that strategic pooling should be considered when an investigator designs a pilot study to test for a GXE.
Keywords: Allele frequency measurements, case-control study, gene-environment interaction, pooling, power
1. Introduction
In studying complex diseases such as infertility, cancer, heart disease, and obesity, gene-environment interactions are critical hypotheses [1-3]. One important implication of unmasking gene-environment interactions is to identify highly susceptible populations such that modifiable exposures that cause disease can be prevented. For example, the Clean Air Act mandates that the U.S. Environmental Protection Agency aims to set standards to protect children, the elderly, those with chronic disease, and genetically susceptible individuals [3]. However, genetically susceptible individuals cannot be identified unless there exist a better understanding of how genes interact with individuals' environments.
The case-control study design is traditionally used to estimate gene-environment interactions because it is convenient both in terms of cost and study duration, particularly in studying rare diseases [4]. However, the traditional case-control interaction estimator is known to require a large sample size to obtain reasonable statistical power [5, 6]. This is due to the large number of individuals necessary to obtain sufficient information on each level of disease, environment, and genotype. Unfortunately genotyping a large number of individuals can be expensive and time consuming. While in recent years the cost of measuring a candidate gene for individuals has become less of a burden, the power to detect an interaction will always benefit when the cost associated with measurement decreases and the recruitment costs increase. Moreover, in certain situations the investigator might not have access to enough biospecimens from each individual to measure the genotype with current technology. Hence, cost and biospecimen availability often leave many interesting gene-environment interaction hypotheses unexplored.
As a cost and biospecimen saving alternative, multiple authors have proposed pooling specimens to increase statistical power [7-10]. The first to attempt to pool DNA in a case-control study was researched by [11], and since then pooling has become popular in genome-wide studies for estimating allele frequencies within a pool [10, 12, 13].
If an investigator had access to a large number of stored biospecimen samples then it is easy to understand that pooling methods would be very useful in preliminary assessments of gene-environment interaction hypotheses. Additionally, with new genotype technology constantly emerging, and investigator might choose to pool biospecimen using an expensive and accurate method rather than the less expensive and inaccurate alternative.
In this paper we explore a novel pooling strategy, where we pool within disease and environmental status to increase statistical power to test a gene-environment interaction hypothesis. First we propose a strategic pooling method that provides a gene-environment interaction estimator. Next we discuss the pooling laboratory process and provide an example of how pooled measurements are performed. Then we discuss the effect of the pooling process on the gene-environment interaction estimator and its efficiency. Subsequently, we explore the consequences of pooling in the BioCycle study of the gene human serum paraoxonase polymorphism (PON1), and its interaction with dietary fiber consumption on the effect of anovulation. Then we perform a simulation study to evaluate the power and coverage probability of the pooled gene-environment interaction estimator under different measurement errors and model assumptions. Lastly, we provide discussions and a small simulation study for estimating the gene-environment interaction adjusted for categorical covariates.
2. Pooled Gene-Environment Interaction Estimator
While these methods can be extended, for illustration, we focus on a special, though realistic case-control study, where the environmental variable of interest is binary (e.g. smoking status) and the data for genotype is ordinal. Let D represent disease status, where D = 1 represents individuals with disease presence (case), and D = 0 represents without (control). Let E represent environmental exposure status, where E = 1 represents with an exposure, and E = 0 represents without. We assume that we are exploring a biallelic single nucleotide polymorphism (SNP) at a given SNP locus. Therefore the SNP genotype consists of two alleles, one from each chromosome at a known location, where one allele is of interest. Therefore, for each individual, G = 0, 1, or 2 represents none, one, or two copies of the allele of interest respectively. Let the number of individuals with disease and environmental status D and E be rDE, and the number of individuals with disease, environmental, and genotype status D, E and G be nDEG.
Defining the genotype status as ordinal implies that we are assuming the multiplicative or additive model as the underlying genetic model as opposed to a dominant or recessive model. When the genetic model is assumed to be a dominate or recessive model, little methodological development is necessary, as will be discussed later. Moreover, we define the genotype status of an individual using a biallelic SNP. Therefore, we assume that the genotype comprises a pair of alleles at a single known locus, and for each individual there are two possible copies of the SNP location, one per chromosome. We assume that the pair of alleles are independent, and the probability of receiving a SNP copy from the father is identical to the probability of receiving it from the mother.
Assuming that the disease and environmental statuses are known before genotyping, which is typically the situation in a case control study, we propose pooling the individuals' biospecimens within all four levels of D and E and then performing R replicate measurements of the frequency within each of the four levels. Therefore we obtain R replicate measurements at each level of D and E (4R overall), rather than obtaining rDE individual genotypes. The pooled measurement will yield an estimate of the minor allele frequency of interest which we denote by fDE. By pooling, we no longer observe nDEG, rather we observe an estimate of fDE, and we know rDE (the pool size). Using estimates of the pooled information, fDE and rDE, we are able to estimate the gene-environment interaction as if we had observed nDEG individual genotypes.
Let the allele status (A) be binary where A = 1 represents the allele of interest, and A = 0 otherwise. Assume that the disease is related to the allele and environmental statuses by
| (1) |
Therefore βAxE is the allele-environment interaction parameter. While it is immediately clear that the maximum likelihood estimator for βAxE has a closed form expression which only depends on the values of fDE, it is not immediately clear how an allele-environment interaction relates back to a gene-environment interaction. For example an allele-environment interaction might be interpreted as having a specified allele and environmental status which increases the risk for a disease; however since an individual has a pair of alleles, one on each chromosome, the allele-environment interaction does not directly relate back to the individual. We need to formulate how we can relate an allele-environment interaction to a gene-environment interaction.
Define Ai (i = 1, 2) as the ith allele for an individual where Ai = 1 if the ith allele is the allele of interest, and Ai = 0 otherwise. Under the assumed additive model, for each individual G is the sum of two alleles G = A1 + A2. If A1 and A2 are independent within disease and environmental status, and the probability of having the disease with A1 is equivalent to having it with A2, then we can use the set based logistic model of [7], where each individual is a set of two alleles. The assumption of independence between A1 and A2, and the assumption that they are equally probable are common assumptions in genetic epidemiology. The assumptions imply that the parents' genotypes are independent, and the child is equally likely to inherit the gene from the mother as they are from the father. Using the set based logistic model (derived in section A of the supplementary material), we relate disease to genotype and environmental status by
| (2) |
The parameters α̃ and β̃E are defined in section A of the supplementary material. The gene-environment interaction parameter of interest, is equivalent to the allele-environment interaction (βGxE = βAxE). Therefore, estimators of βAxE using allele information will be equivalent to estimators of βGxE obtained using full information on individual genotypes. Hence, inference of the allele-environment interaction (βAxE) is exactly the same as the inference of the gene-environment interaction (βGxE).
It is important to note that using (1) to estimate the gene-environment interaction is not equivalent to using the set based model as proposed by [7]. In their paper they developed a method which assumes that the pooled exposure measurement is the average of the individuals' exposure. This method can be adapted to handle allele frequencies (fDE), which are half of the average genotyped status (G) within a pool. However, if one adapted this method, the estimator for the gene-environment interaction would not be in closed form as it is in (1), and more pools would be required for estimation. If the dominate or recessive genotype models were assumed, the genotype status (G), would be binary and thus the interaction estimator in (2) would be in closed form and only depend on estimates of Pr(G = 1|D,E). If one were to assume Hardy-Weinberg equilibrium so that Pr(A = 1|D,E) = fDE, then for the dominate model, and for the recessive model. Misspecification of the genotype model will bias our pooled gene-environment interaction estimator in the same way that misspecification would bias the traditional gene-environment interaction estimator.
Through the pooling process described in the next section, we have data on the alleles and therefore we can use (1), and obtain the closed form maximum likelihood estimate of βAxE,
| (3) |
where , f̃DE is the allele frequency in a sample of individuals with disease status D and environmental status E. However, due to measurement error inherent in the pooling process, we do not observe f̃DE, rather we observe f̂DE(k) which is an estimate of the allele frequency in the pool with disease status D and environmental status E measured on replicate k, where k = 1, 2, 3, …, R, and R is the number of replicate measurements for each pool. We construct an estimate of the gene-environment interaction by averaging replicate allele frequencies on the logit scale. The intuition for taking the average over replication is based on our understanding that a single estimate of each allele frequency within strata of disease and environment can be used for asymptotically unbiased estimation for (3); however, replication is necessary for estimation of the variance of the estimator. Therefore, we define an estimate of γ̃DE to be . Using γ̂DE we define the estimator by,
| (4) |
Further explanation of the estimation process for the allele frequencies of the four pools (one pool on each level of disease and environmental status) is provided in the following sections.
3. The Pooling Process
A number of methods have been proposed for measuring SNP allele frequencies of pools [12-18]. There are three main stages in obtaining a pooled measurement of the gene frequency. The first stage is called pool formation. In this stage, equal quantities of individuals' DNA must be combined to form the pool. The second stage is called PCR (polymerase chain reaction). In this stage, a targeted sequence is amplified. Lastly, there is the estimation stage. In this stage, using a methodology of choice, the minor allele frequency is estimated [17].
There are a number of measurement procedures available for estimating the frequency of a SNP [19-23]. For illustration, we describe the method used in [12]. It is straightforward to extend our estimation procedure to other methods, including those in [10, 18]. In the method described by [12], the allele frequency is determined by comparing the number of cycles of PCR amplifications (each PCR amplification cycle doubles the frequency) required before a reaction crosses a predetermined frequency threshold. For two alleles, two PCR amplifications are required, and the number of PCR amplification cycles are compared (Figure 1 illustrates this process). If one allele has a one cycle delay to the other, then the minor allele ratio is 1:2. If one allele has a two cycle delay to the other, then the minor allele ratio is 1:4. If we define as the cycle delay for the pool on the level DE, then in general the allele ratio is . Therefore, the minor allele frequency in the population of individuals with disease status D and environmental status E can be estimated as
Figure 1.

Illustration of obtaining minor allele frequencies using PCR.
| (5) |
where the number of allele, and hence , cycles can be fractional.
4. The Pooled Gene-Environment Interaction Estimator with Measurement Error
After forming the four pools on each level of D and E, we obtain an estimate of the cycle difference, , for each replicate measurement. Note that we are defining a replicate as repeating the measurement processes, including forming the pools with equal quantities of specimen, PCR amplification and cycle delay measurement. Let be the estimated cycle difference for the kth replicate of the pool with disease status D, and environmental status E. Define,
| (6) |
where εDE(k) is the measurement error for the kth replicate. We assume for k = 1, 2, 3, …, R. Using the pooled measurements, , we can estimate the interaction,
| (7) |
and its variance by,
| (8) |
where is the sample variance of the pooled measurements , and . The derivation of this asymptotically unbiased estimator of the gene-environment interaction and the estimator of its variance are given in section B and C of the supplementary material. The expression for the variance depends on the frequency of each of the pools and the variance due to pooling. The variance estimator is shown to be positively biased in section C of the supplementary material.
5. Example
The BioCycle study is a prospective cohort study of 259 regularly menstruating women ages 18 to 44, who were followed for up to two menstrual cycles. Details of the study are described elsewhere [24]. Within the BioCycle study, Gaskins et al. found that a higher consumption of dietary fiber was significantly associated with lower concentrations of reproductive hormones and an increased risk of incident anovulation [25]. Also, the human serum paraoxonase (PON1) has been identified as a prime example of gene/polymorphism that plays an important role in ‘environmental susceptibility’ to multifactorial diseases and disorders [26]. Therefore a natural hypothesis might be to investigate whether there is an interaction effect of fiber and PON1 on anovulation. Table 2 shows the data where all individuals in the study are genotyped, and dichotomized fiber intake as above and below the mean (there are two missing values).
Table 2. The BioCycle Data.
| D = 0 (No anovulatory cycles) | |||||
| E = 0 (A fiber intake below the mean) |
E = 1 (A fiber intake above the mean) |
||||
| r00= 173 | r01= 50 | ||||
| G = 0 | G = 1 | G = 2 | G = 0 | G = 1 | G = 2 |
| n000= 66 | n001= 81 | n002= 26 | n010= 23 | n011= 22 | n012= 5 |
| D = 1 (At least one anovulatory cycle) | |||||
| E = 0 (A fiber intake below the mean) |
E = 1 (A fiber intake above the mean) |
||||
| r10= 21 | r11= 13 | ||||
| G = 0 | G = 1 | G = 2 | G = 0 | G = 1 | G = 2 |
| n100= 7 | n101= 10 | n102= 4 | n110= 6 | n111= 7 | n112= 0 |
Using the data in Table 2, the gene-environment interaction estimate is -0.4459 (S.E. = 0.6082), with a p-value of 0.4634. Thus, the gene-environment interaction estimate is not significant. However, this study has a very low power to detect an interaction if one were to exist. For illustration, we took a random sample of 1000 individuals with replacement from the BioCycle data. While the investigators were not able to observe 1000 individuals due to cost constraints, we took a random sample with replacement to show that the power of the interaction estimator could be improved if fewer costs were allocated for genotyping, and more funding were spent on collecting a large sample size. Table 3 shows an example of one random sample of 1000 individuals with replacement from the BioCycle data.
Table 3. A Random Sample with Replacement of the BioCycle Data.
| D = 0 (No anovulatory cycles) | |||||
| E = 0 (A fiber intake below the mean) |
E = 1 (A fiber intake above the mean) |
||||
| r00= 675 | r01= 184 | ||||
| G = 0 | G = 0 | G = 0 | G = 0 | G = 0 | G = 0 |
| n000= 270 | n001= 312 | n002= 93 | n010= 85 | n011= 80 | n012= 19 |
| D = 1 (At least one anovulatory cycle) | |||||
| E = 0 (A fiber intake below the mean) |
E = 1 (A fiber intake above the mean) |
||||
| r10= 81 | r11= 60 | ||||
| G = 0 | G = 1 | G = 2 | G = 0 | G = 1 | G = 2 |
| n100= 21 | n101= 41 | n102= 19 | n110= 27 | n111= 33 | n112= 0 |
The data in Table 3 yield a gene-environment interaction estimate of -0.7236 (S.E. = 0.296), with a significant p-value of 0.014. This data is an illustration, and due to the combined cost of genotyping and collecting the larger sample, the investigator might not be able to collect a sample with 1000 individuals with individual measurements of genotype (G). On the other hand, the investigator might have enough funding to sample 1000 individuals if the cost due to genotyping was reduced significantly to 12 measurements, 3 repeated measurements of the allele frequency on each level of D and E, rather than 1000 measurements of G.
Using the random sample given in Table 3, we observe that f̃11 = 0.0275, f̃10= 0.4877, f̃01 = 0.3207, f̃00 = 0.3689. Using (5) we can solve for each level of D and E. Recall that , where εDE(k) is the measurement error for the kth replicate, k = 1, 2, 3. For illustration, we set which is consistent with the measurement error seen in literature [17]. By simulating a value of εDE(k)∼iid∼N(0,.05) for each level of D, E, and k, we obtain values presented in Table 4.
Table 4. Simulated Values of for the Random Sample with Replacement.
| Pool | ||||
|---|---|---|---|---|
| Replicate | D = 1, E = 1 | D = 1, E = 0 | D = 0, E = 1 | D = 0, E = 0 |
| 1 | 1.2105 | 0.0869 | 0.9776 | 0.8056 |
| 2 | 1.7081 | 0.4539 | 0.9411 | 1.0492 |
| 3 | 1.1178 | -0.0636 | 1.0192 | 0.5954 |
Using the data in Table 4, we can solve (7) and (8) to estimate the gene-environment interaction and its standard error. We find that the gene-environment interaction estimate is -0.7097 (S.E. 0.343), with a significant p-value of 0.0387.
This illustration shows that in a large sample from a population similar to the one the generated the data in the BioCycle Study, pooling can be implemented to increase the power of an estimator without increasing the cost of the study. We emphasize that this approach is only used to illustrate the performance of our design, and cannot be used to make statistical inference about the interaction of PONS and fiber on ovulation. However, our illustration shows that if an investigator is interested in a particular gene-environment interaction, they could consider increasing their funding for collecting individuals by decreasing their cost due to genotyping. If one adapts a pooling strategy such that the cost due to genotyping is reduced, often the investigator is capable of allocating more funding to collect a larger sample size, and thus can substantially increase the power to test the gene-environment interaction hypothesis.
6. Simulations
We performed simulations to compare the statistical properties of the pooled gene-environment interaction estimator to the estimator we would obtain if we had data on individual genotypes. In the BioCycle anovulation example explored in the previous section, the allele frequency of the sample is 0.37, and the proportion of individuals with the environmental exposure is 0.25. Using these parameters, we simulate the alleles with Pr(A = 1) = 0.37, and then randomly pair the alleles for each genotype observation. The number of minor alleles (A=1) for each observation is defined as G (therefore G is 0, 1, or 2). Subsequently, we generate the environmental exposure with Pr(E = 1) = 0.25. Lastly, we set the following logistic regression equation to generate D with a dependence on E and G,
where α = -2.26, βE = 1.06, and βG = 0.18. These beta coefficients are obtained by fitting the above model to the BioCycle anovulation example. To explore the properties of our estimate of the gene-environment interaction, we vary βGxE between -1 and 1.
Once the cohort (disease, gene, and environmental status) are generated, we took a case-control sample (500 cases and 500 controls). Next, we pooled our sample on disease and environmental status to find the minor allele frequency (f̃DE) for each of the four pools (four levels of disease and environmental status). Once we had the simulated frequencies for our case-control sample, using (5) we found the value of for each level of D and E. Lastly we add measurement error using , where εDE(k) is the measurement error of the kth replicate , for k = 1, 2, 3, …, R). In our simulation we defined R to be 3 and for all four levels of disease and environmental status. This variance is consistent with the measurement error seen in literature [17].
Under this simulation scenario, Table 5 shows that the gene-environment interaction estimator is unbiased for the true gene-environment interaction. The variance estimators are very close to the empirical variances of the gene-environment interaction estimators in the Monte Carlo simulation. However there is a trend for the variance estimators to be slightly positively biased. This slight positive bias is in accordance with our theoretical results, and yields coverage probabilities which are slightly smaller than 0.95.
Table 5. The Mean of the Sample Variances.
| βGxE | Mean β̂GxE (95% CI) | Empirical Variance of β̂GxE | Mean (95% CI) | Coverage Probability (level = .05) (95% CI) |
|---|---|---|---|---|
| -.5 | -0.5017 (-0.594, -0.4940) |
0.0764 | 0.0759 (0.0764, 0.0755) |
0.9416 (0.9351, 0.9481) |
| -.3 | -0.2996 (-0.3072, -0.2920) |
0.0757 | 0.0757 (0.0752, 0.0761) |
0.9432 (0.9367, 0.9496) |
| 0 | 0.0023 (-0.0052, 0.0098) |
0.0732 | 0.0756 (0.0751, 0.0760) |
0.9478 (0.9413, 0.9540) |
| .3 | 0.2926 (0.2851, 0.3002) |
0.0744 | 0.0768 (0.0764, 0.0773) |
0.9508 (0.9443, 0.9568) |
| .5 | 0.4915 (0.4838, 0.4992) |
0.0773 | 0.0785 (0.0781, 0.0790) |
0.9470 (0.9405, 0.9532) |
From left to right the columns are: The true simulated gene-environment interaction. The mean of the Monte Carlo simulated gene-environment interaction estimators. The empirical variance of the gene-environment interaction estimators in the Monte Carlo simulation. The mean of the Monte Carlo simulated variance estimators. The empirical coverage probability at level .05. The number of Monte Carlo replicates is 5000.
Once we have estimated the gene-environment interaction and the variance of the estimator, we were able to compute the power function using a two-sided Wald test (i.e. a test based on standard normal assumption for under the null hypothesis). We have shown that the pooled estimator of the gene-environment interaction is unbiased. Therefore, compared with the traditional estimator, which is also unbiased, the power of the pooled estimator will depend on the distribution of the estimated variance of the estimator, . As the number of replicates measurements, R, increases, the extra term in the variance estimator decreases towards zero. Therefore, for a large number of replicates, the power of the pooled estimator is nearly identical to the power of the traditional estimator. However, the purpose of pooling is to allow the number of measurements, R, to be as small as possible such that we achieve reasonable power for an alpha level test. At minimum, we require two replicate measurements to estimate the measurement error, . However, if the number of replicate measurements does not allow accurate estimation of the measurement error, then we risk underestimating the variance of the estimator, and thus inflating our type one error. We explore the relationship between the power of the estimator and the number of replicate measurements in Figure 2.
Figure 2.
Power functions of five different scenarios. All solid lines are for pooled estimators with 1000 individuals.
The solid line with the least power uses 2 repeated measurements of pools on each level of disease and environments. The solid line above that uses 3 repeated measurements, and the solid line with the most power uses 1000 repeated measurements. The dotted line with the least power observes 500 individual measurements, and the dotted line with the most power uses 1000 individual measurements
To compare our pooling method to traditional methods, Figure 2 shows the power functions of five different scenarios. The first scenario is the same scenario described for Table 5. In this scenario the investigator uses our pooling strategy to estimate the gene-environment interaction with a case-control sample of 1000 individuals. The investigator has 3 replicate measurements on each level of disease and environment, yielding 12 total measurements, and we assume that the measurement error variance is 0.05 for all levels of D and E. The second scenario decreases the number of replicate measurements to 2, yielding 6 total measurements. To show that as the number of replicate measurements increases, the power approaches the power of the traditional estimator, the third scenario increases the number of replicate measurements to 100, yielding 400 total measurements. The fourth and fifth scenarios use the traditional estimator of the gene-environment interaction using a case-control sample of 1000 and 500 individuals, respectively. These scenarios require that the investigator genotypes 1000 or 500 individuals, respectively.
In Figure 2 we see that the gene-environment interaction estimators using our pooling strategy, involving 3 replicate measurements (12 total measurements), provides a test and achieves slightly more power than the test with the traditional gene-environment interaction estimator, involving 500 measurements. For higher number of replicates the power of the pooled test becomes nearly identical to the power of the traditional test. If the number of replicates is too small the type one error is generally inflated.
Figure 2 shows a dramatic decrease in the number of measurements required (12 pooled measurements compared to 500 individual measurements) to obtain approximately the same amount of power. However, as evident from Figure 2, care needs to be taken to accurately estimate the measurement error variance with replicates to obtain an alpha level test. Equation 5 suggests that measurement error near extreme true frequencies has less of an impact than measurement error near a true allele frequency of 0.5. Changing the population allele frequency, Pr(A = 1), will affect the allele frequencies within strata of disease and environmental statuses which are measured in pools. Therefore, we explored different population allele frequencies, with different magnitudes of measurement error to find the appropriate number of replicates to obtain an alpha level test. Table 6 shows the type I error under the same simulation scenario described for Table 5 except with different population allele frequencies, measurement error variances, and number of replicate measurements. Extreme frequencies are less impacted by measurement error, and thus require less replicate measurements to obtain a test at alpha level. When the measurement error variance is 0.05 [17] more than 3 replicates are required to obtain a test at alpha level.
Table 6. Type 1 error.
|
|
R | Pr (A = 1) | |||||
|---|---|---|---|---|---|---|---|
|
| |||||||
| 0.1 | 0.3 | 0.5 | 0.7 | 0.9 | |||
| 0.05 | 2 | 0.0492* | 0.0668* | 0.0686* | 0.0642* | 0.0556* | |
| 3 | 0.0482* | 0.0510* | 0.0550* | 0.0518* | 0.0498* | ||
| 5 | 0.0550* | 0.0552* | 0.0520* | 0.0464* | 0.0478* | ||
| 10 | 0.0508* | 0.0460* | 0.0494* | 0.0500* | 0.0480* | ||
|
| |||||||
| 0.1 | 2 | 0.0576* | 0.0816* | 0.0834* | 0.0738* | 0.0596* | |
| 3 | 0.0476* | 0.0572* | 0.0664* | 0.0648* | 0.0506* | ||
| 5 | 0.0468* | 0.0498* | 0.0510* | 0.0558* | 0.0460* | ||
| 10 | 0.0502* | 0.0446* | 0.0496* | 0.0578* | 0.0476* | ||
|
| |||||||
| 0.5 | 2 | 0.0956* | 0.1080* | 0.1068* | 0.1082* | 0.0826* | |
| 3 | 0.0586* | 0.0782* | 0.0824* | 0.0748* | 0.0648* | ||
| 5 | 0.0506* | 0.0528* | 0.0650* | 0.0570* | 0.0504* | ||
| 10 | 0.0444* | 0.0492* | 0.0516* | 0.0442* | 0.0480* | ||
0.05 is less than the upper 95% confidence interval with 5000 Monte Carlo replicates
7. Incorporating Covariates
Until now we have only discussed a simple gene-environment interaction estimator which is not adjusted for covariates. However, incorporating a reasonable number of categorical covariates is straightforward. In this situation we suggest incorporating categorical covariates by stratifying the pools by disease, environment, and covariate statuses. Once pools are formed for each stratum, frequencies can be measured in replicate and gene-environment interactions can be estimated using the methods we have already discussed. Assuming that the gene-environment interactions are the same within each stratum, one can estimate the gene-environment interaction of interest using the weighted average,
| (7) |
and its variance by,
| (8) |
where, β̂GxE(c) is the estimated gene-environment interacted within covariate strata, c. Also, we define the weights to be the inverse variance, .
To explore the power of our pooled estimator adjusted for covariates, we considered a simple simulation scenario where we adjusted for a covariate with three levels, 0, 1, or 2, with the following probabilities, Pr(C=0) = 0.25 and Pr(C=1) = 0.5. We used the following logistic regression equation to generate E, with dependence on the covariate C,
where γ0 = 0.2. We varied γc to explore the effect of the relationship between E and C on our pooled estimator. We used the following logistic regression equation to generate D with a dependence on E, G, and C,
where we set α = -2.26, βE = 1.06, βG = 0.18, βC = 5, and varied βGxE between -1 and 1. The number of replicate measurements, R, was set to 3, and the measurement error variance, , was 0.05. For our pooled estimates we took a case-control sample of 1000 individuals to be measured in 12 pools (4 levels of D and E and 3 levels of C), and we compared the power to that of an individually measured case-control sample of size 500. Figure 3 shows that the test based on the adjusted pooled estimator, which requires only 36 measurements (12 pools each measured with 3 replicates), maintains the size at a nominal level while having more power than the test based on the traditional estimator with 500 individual genotype measurements.
Figure 3.
Power functions of four different scenarios. All solid lines are for pooled estimators, with 1000 individuals and dotted lines are for the traditional estimators with 500 individual measurements. The solid and dotted lines with the most power uses _c D_1, and the solid and dotted lines with the least power uses _c D 1.
8. Discussion
In this paper we have proposed an effective method for obtaining a gene-environment interaction estimator using pooled biospecimens. Pooling can greatly reduce the number of measurements, which in turn greatly reduces the cost due to genotyping. Pooling can also save valuable biospecimens because a pooled measurement requires less biospecimens from each individual. These savings provide a feasible method for investigating previously unexplored gene-environment interaction hypotheses with sufficient statistical power.
We have shown that when disease and environmental statuses are binary, the pooled frequency at each level of disease and environmental status is a sufficient statistic for the gene-environment interaction under some assumptions. Through simulation, we have demonstrated that with 3 replicate measurements on each level of disease and environment, and with a reasonable measurement error of 0.05, our pooled gene-environment interaction estimator and the estimate of the variance of the estimator are nearly unbiased. The simulation results indicate that a test based on 1000 individuals that uses our pooling strategy (12 genotyping measurements) has approximately the same power of a test with 500 individuals and genotypes of all 500 individuals.
While we have focused on the case where disease and environmental statuses are binary, these methods can be expanded to environmental statuses which are categorical. However, the requirement that the pools are formed within disease and environmental status suggest that for a large number of environmental categories a large number of pools will be necessary. In this situation, pooling might not dramatically decrease the number of assay measurements, and thus might not be recommended. Our pooling methods require that the disease and environment statuses, and possible confounders, must be known before the pooling can be performed. This is a slightly stronger assumption than those in case control studies. The requirement that the environmental status be known before pooling implies that there is no straightforward extension of our method to estimation of gene-gene interaction, were both genes are estimated using pooling. Further work is necessary to explore cheap, powerful estimation methodology for gene-gene interactions. The set based logistic model assumes that A1 and A2 (i.e. the mother and father) are independent within disease and environmental status, and the probability of receiving the allele from the mother is the same as that from the father. These assumptions may not be realistic for certain genes, and methods need to be developed to deal with departures from these assumption. Lastly, while these methods can adjust for categorical covariates, as the number of covariates increases the number of pools will also increase. Because the purpose of this method is to dramatically decrease the number of measurements required this method might not be appropriate when an investigator would like to adjust for a large number of covariates. Adjusting for a large number of covariates might be done using the pooled method described by [7].
Although our method has limitations in its assumptions, it may be very useful in a pilot study. For example, if an investigator had access to a large number of stored biospecimen samples then our pooling method would be very useful in preliminary assessments of gene-environment interaction hypotheses. Additionally, our method might be very useful in high-dimensional SNP studies, as a direct extension to [10], where they related SNP allele frequencies based on disease statuses, and we related SNP allele frequencies based on disease and environmental statuses.
In conclusion, we proposed a powerful method to estimate a gene-environment interaction by pooling biospecimens. We have derived a pooled gene-environment interaction estimator, and the estimated variance of that estimator. Our findings show that strategic pooling should be considered when an investigator designs a pilot study to test for a gene-environment interaction.
Supplementary Material
Table 1. The Setup for the Original Non Pooled Data.
| D = 0 | D = 1 | ||||||||||
| E = 0 | E = 1 | E = 0 | E = 1 | ||||||||
| r00 | r01 | r10 | r11 | ||||||||
| G = 0 | G = 1 | G = 2 | G = 0 | G = 1 | G = 2 | G = 0 | G = 1 | G = 2 | G = 0 | G = 1 | G = 2 |
| n000 | n001 | n002 | n010 | n011 | n012 | n100 | n101 | n102 | n110 | n111 | n112 |
Acknowledgments
grant information: This research was supported partially by the Intramural Research Program of the Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health and by the Long-Range Research Initiative of the American Chemistry Council.
References
- 1.Amato R, Pinelli M, D'Andrea D, Miele G, Nicodemi M, Raiconi G, Cocozza S. A novel approach to simulate gene-environment interactions in complex diseases. BMC Bioinformatics. 2010;11:8. doi: 10.1186/1471-2105-11-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Marchand LL, Wilkens LR. Design considerations for genomic association studies: importance of gene-environment interactions. Cancer Epidemiology Biomarkers and Prevention. 2008;17:263–267. doi: 10.1158/1055-9965.EPI-07-0402. [DOI] [PubMed] [Google Scholar]
- 3.Thomas D. Methods for investigating gene-environment interactions in candidate pathway and genome-wide association studies. Annual Review of Public Health. 2010;31:21–36. doi: 10.1146/annurev.publhealth.012809.103619. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Prentice RL, Pyke R. Logistic disease incidence models and case-control studies. Biometrika. 1979;66:403–411. doi: 10.1093/biomet/66.3.403. [DOI] [Google Scholar]
- 5.Foppa I, Spiegelman D. Power and sample size calculations for case-control studies of gene-environment interactions with a polytomous exposure variable. American Journal of Epidemiology. 1997;146:596–604. doi: 10.1093/oxfordjournals.aje.a009320. [DOI] [PubMed] [Google Scholar]
- 6.Hwang SJ, Beaty TH, Liang KY, Coresh J, Khoury MJ. Minimum sample size estimation to detect gene-environment interaction in case-control designs. American Journal of Epidemiology. 1994;140(11):1029–1037. doi: 10.1093/oxfordjournals.aje.a117193. [DOI] [PubMed] [Google Scholar]
- 7.Weinberg CR, Umbach DM. Using pooled exposure assessment to improve efficiency in case-control studies. Biometrics. 1999;55:718–726. doi: 10.1111/j.0006-341x.1999.00718.x. [DOI] [PubMed] [Google Scholar]
- 8.Faraggi D, Reiser B, Schisterman EF. ROC curve analysis for biomarkers based on pooled assessments. Statistics in Medicine. 2003;22:2515–2527. doi: 10.1002/sim.1418. [DOI] [PubMed] [Google Scholar]
- 9.Schisterman EF, Vexler A, Mumford SL, Perkins NJ. Hybrid pooled-unpooled design for cost-efficient measurement of biomarkers. Statistics in Medicine. 2010;29:597–613. doi: 10.1002/sim.382. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Prentice RL, Qi L. Aspects of the design and analysis of high-dimensional SNP studies for disease risk estimation. Biostatistics. 2006;7:339–354. doi: 10.1093/biostatistics/kxj020. DOI: 0.1093/biostatistics/kxj020. [DOI] [PubMed] [Google Scholar]
- 11.Arnheim N, Strange C, Erlich H. Use of pooled DNA samples to detect linkage disequilibrium of polymorphic restriction fragments and human disease: studies of the HLA class II loci. Proceedings of the National Academy of Sciences of the United States of America. 1985;82:6970–6974. doi: 10.1073/pnas.82.20.6970. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Germer S, Holland MJ, Higuchi R. High-throughput SNP allele-frequency determination in pooled DNA samples by kinetic PCR. Genome Research. 2000;10:258–266. doi: 10.1101/gr.10.2.258. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Werner M, Synch M, Herbon N, Illig T, König IR, Wjst M. Large-scale determination of SNP allele frequencies in DNA pools using MALDI-TOF mass spectrometry. Human Mutation. 2002;20:57–64. doi: 10.1002/humu.10094. [DOI] [PubMed] [Google Scholar]
- 14.Pacek P, Sajantila A, Syvanen AC. Determination of allele frequencies at loci with length polymorphism by quantitative analysis of DNA amplified from pooled samples. PCR Methods Application. 1993;2:313–317. doi: 10.1101/gr.2.4.313. [DOI] [PubMed] [Google Scholar]
- 15.Sham P, Bader JS, Craig I, O'Donovan M, Owen M. DNA Pooling: a tool for large-scale association studies. Nature Review Genetics. 2002;3:862–871. doi: 10.1038/nrg930. [DOI] [PubMed] [Google Scholar]
- 16.Shaw SH, Carrasquillo MM, Kashuk C, Puffenberger EG, Chakravarti A. Allele frequency distributions in pooled DNA samples: applications to mapping complex disease genes. Genome Research. 1998;8:111–123. doi: 10.1101/gr.8.2.111. [DOI] [PubMed] [Google Scholar]
- 17.Barratt BJ, Payne F, Rance HE, Nutland S, Todd JA, Clayton DG. Identification of the sources of error in allele frequency estimations from pooled DNA indicates an optimal experimental design. Annals of Human Genetics. 2002;66:393–405. doi: 10.1046/j.1469-1809.2002.00125.x. [DOI] [PubMed] [Google Scholar]
- 18.Hoogendoorn B, Norton N, Kirov G, Williams N, Hamshere ML, Spurlock G, Austin J, Stephens MK, Buckland PR, Owen MJ, O'Donovan MC. Cheap, accurate and rapid allele frequency estimation of single nucleotide polymorphisms by primer extension and DHPLC in DNA pools. Human Genetics. 2000;107:488–493. doi: 10.1007/s004390000397. [DOI] [PubMed] [Google Scholar]
- 19.Breen G, Harold D, Ralston S, Shaw D, St Clair D. Determining SNP allele frequencies in DNA pools. Biotechniques. 2000;28:464–466. doi: 10.2144/00283st03. [DOI] [PubMed] [Google Scholar]
- 20.Giordano M, Mellai M, Hoogendoorn B, Momigliano-Richiardi P. Determination of SNP allele frequencies in pooled DNAs by primer extension genotyping and denaturing high-performance liquid chromatography. Journal of Biochemical and Biophysical Methods. 2001;47:101–110. doi: 10.1016/S0165-022X(00)00156-1. [DOI] [PubMed] [Google Scholar]
- 21.Kosaki K, Yoshihashi H, Ohashi Y, Kosaki R, Suzuki T, Matsuo N. Fluorescence-based DHPLC for allelic quantification by single-nucleotide primer extension. Journal of Biochemical and Biophysical Methods. 2001;47:111–119. doi: 10.1016/S0165-022X(00)00157-3. [DOI] [PubMed] [Google Scholar]
- 22.Ross P, Hall L, Haff LA. Quantitative approach to single-nucleotide polymorphism analysis using MALDI-TOF mass spectrometry. Biotechniques. 2000;29:620–629. doi: 10.2144/00293rr05. [DOI] [PubMed] [Google Scholar]
- 23.Sasaki T, Tahira T, Suzuki A, Higasa K, Kukita Y, Baba S, Hayashi K. Precise estimation of allele frequencies of single-nucleotide polymorphisms by a quantitative SSCP analysis of pooled DNA. American Journal of Human Genetics. 2001;68:214–218. doi: 10.1086/316928. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Wactawski-Wende J, Schisterman EF, Hovey KM, Howards PP, Browne RW, Hediger M, Liu A, Trevisan M BioCycle Study Group. BioCycle study: design of the longitudinal study of the oxidative stress and hormone variation during the menstrual cycle. Paediatric and Perinatal Epidemiology. 2009;23:171–184. doi: 10.1111/j.1365-3016.2008.00985.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Gaskins AJ, Mumford SL, Zhang C, Wactawski-Wende J, Hovey KM, Whitcomb BW, Howards PP, Perkins NJ, Yeung E, Schisterman EF BioCycle Study Group. Effect of daily fiber intake on reproductive function: the BioCycle Study. American Journal of Clinical Nutrition. 2009;90:1061–1069. doi: 10.3945/ajcn.2009.27990. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Costa LG, Richter RJ, Li WF, Cole T, Guizzetti M, Furlong CE. Paraoxonase (PON 1) as a biomarker of susceptibility for organophosphate toxicity. Biomarkers. 2003;8:1–12. doi: 10.1080/1354750021014831. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.


