Abstract
Genetic interactions have been recognized as a potentially important contributor to the heritability of complex diseases. Nevertheless, due to small effect sizes and stringent multiple-testing correction, identifying genetic interactions in complex diseases is particularly challenging. To address the above challenges, many genomic research initiatives collaborate to form large-scale consortia and develop open access to enable sharing of genome-wide association study (GWAS) data. Despite the perceived benefits of data sharing from large consortia, a number of practical issues have arisen, such as privacy concerns on individual genomic information and heterogeneous data sources from distributed GWAS databases. In the context of large consortia, we demonstrate that the heterogeneously appearing marginal effects over distributed GWAS databases can offer new insights into genetic interactions for which conventional methods have had limited success. In this paper, we develop a novel two-stage testing procedure, named phylogenY-based Effect-size Tests for Interactions using first 2 moments (YETI2), to detect genetic interactions through both pooled marginal effects, in terms of averaging site-specific marginal effects, and heterogeneity in marginal effects across sites, using a meta-analytic framework. YETI2 can not only be applied to large consortia without shared personal information, but also can be used to leverage underlying heterogeneity in marginal effects to prioritize potential genetic interactions. We investigate the performance of YETI2 through simulation studies, and apply YETI2 to a bladder cancer data from dbGaP.
Keywords: Genetic interaction, Heterogeneity in marginal effects, Meta-analysis, Privacy-preserving algorithm, Two-stage testing procedure
Introduction
Genetic interactions (or gene-gene interactions) have been recognized as a potentially important contributor to the missing heritability of complex diseases (Moore & Williams, 2009; Manolio et al., 2009; Eichler et al., 2010). The identification of gene-gene interactions is more challenging than the analyses of main genetic effects alone (Moore et al., 2010), particularly due to computational issues. Additionally, it requires larger sample sizes than those needed to detect main genetic effects alone (Thomas, 2010). A naive strategy to search for gene-gene interactions is to test all possible pairwise combinations of single nucleotide polymorphisms (SNPs) across the genome using standard regression-based models; however, such a strategy faces the computational challenge due to the need of an exhaustive search. To tackle this problem, several two-stage testing methods (Marchini et al., 2005; Evans et al., 2006; Kooperberg & LeBlanc, 2008; Murcray et al., 2009; Lewinger et al., 2013) have been proposed, in which candidate SNPs that meet some pre-specified significance threshold in the first stage are subsequently followed up for testing of pairwise interactions in the second stage. Intuitively, this kind of approach yields higher statistical power than that of an exhaustive search because the multiple-testing burden has been substantially alleviated. On the other hand, the chance of losing genuine interactions remains high when the marginal effects of the causal SNPs are very subtle, or even absent (Ritchie et al., 2001; Culverhouse et al., 2002). This may lead to limited power in detection of gene-gene interactions.
To date, many individual genome-wide association study (GWAS) research groups have participated in large-scale sharing of data through collaborative research networks by forming a number of high-profile consortia, for example, the Genetic Investigation of Anthropometric Traits (GIANT) (Willer et al., 2009) and the Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) (Psaty et al., 2009). Such data sharing can not only achieve larger sample sizes, but also reduce bias caused by non-random selection (Wolfson et al., 2010). Despite the perceived benefits of data sharing from large consortia, a number of practical issues have arisen, including privacy concerns on individual genomic information and heterogeneous data sources from distributed GWAS databases. A number of methods have been developed to address the challenge of protecting patient-level information in multiple GWAS/genome databases. These include logistic regression-based models (Wu et al., 2012; Jiang et al., 2013; Wang et al., 2013; Ji et al., 2014; Wu et al., 2015), splitting regression analyses (Wolfson et al., 2010), cryptographic techniques for participant-level genomic data (Kamm et al., 2013), and genetic association meta-analyses (Xie et al., 2014), among others.
In the context of large consortia, we demonstrate that heterogeneously appearing main effects over distributed GWAS databases can offer new insights into genetic interactions for which conventional methods have had limited success. The term “heterogeneity” has a variety of contextual meanings. For example genetic heterogeneity, or more specifically either locus or allelic heterogeneity, refers to different patterns of genetic variant associations within different subject groups (Thornton-Wells et al., 2004; Urbanowicz et al., 2013). In the present study we use the term “heterogeneity” to refer to inconsistent or non-replicated main effects across sites (Greene et al., 2009a). While these inconsistencies may include examples of genetic heterogeneity, they may also reflect sampling differences or other forms of chance variation. Here, we propose that heterogeneity between sites may be suggestive of possible gene-gene interactions because some unmeasured differential exposures of interacting SNPs could induce changes in estimated site-specific marginal effects. Such variability of site-specific marginal effects across sites is referred to as heterogeneity in marginal effects. This heterogeneity is induced by potential interactions that affect the outcome, and therefore we can leverage heterogeneity in marginal effects to prioritize potential genetic interactions.
It has been suggested that failure to replicate a genetic association across sites may provide important clues on genetic architecture (Greene et al., 2009a). In this paper, we intend to fill this methodological gap by proposing a rigorous framework for identifying genetic interactions through leveraging between-site heterogeneity in marginal genetic effects. Specifically, we propose a two-stage testing procedure, named phylogenY-based Effect-size Tests for Interactions using first 2 moments (YETI2). It relies on both pooled marginal effects, in terms of averaging site-specific marginal effects, and heterogeneity in marginal effects across sites, using a meta-analytic framework. This framework is attractive because of the following three aspects. Firstly, comparing to the traditional meta-analysis method, the proposed framework can not only estimate the overall effect sizes based on summary data, but also leverage the impact of heterogeneity in marginal effects across sites to prioritize potential genetic interactions. Secondly, it is an efficient algorithm to reduce iterative procedures because the conventional distributed analysis methods often lead to heavy computational burden with large-scale GWAS databases (Wu et al., 2012). Lastly, as with other meta-analyses, it can be easily applied to distributed GWAS databases without sharing individual genomic information.
The contribution of this paper is twofold. Firstly, we propose a novel two-stage testing procedure, YETI2, for distributed GWAS databases to overcome data sharing barriers. YETI2 can also alleviate limitations of the method proposed by Kooperberg & LeBlanc (2008), particularly for the situation with small or absent marginal effects. Secondly, we show that the overall type I error rate can be controlled by YETI2, and establish the theoretical filtering property for YETI2, that is, the test statistic used in the first stage is asymptotically independent of that in the second stage. Such a property can reduce the search space by filtering out irrelevant SNPs or markers, and thus can substantially improve statistical power over the naive strategy of testing all possible pairwise combinations of SNPs.
The rest of this paper is organized as follows. In Section 2, we describe and formalize the YETI2 procedure. In Section 3, we show that YETI2 possesses the independent filtering property (Dai et al., 2012) and provide a formal justification for this property. To evaluate the performance of the proposed method and the impact of different search strategies, we conduct simulation studies under three different scenarios and summarize the simulation results in Section 4. We apply the proposed method to a bladder cancer data from the Database of Genotype and Phenotype (dbGaP) and interpret the results in Section 5. Finally, we discuss the strengths and limitations of the proposed method in Section 6.
Methods
Model framework
Consider an association study for gene-gene (G×G) interactions involving K sites from a large consortium. Denote by the disease status (e.g., binary outcome: diseased and non-diseased) of the ith subject in the kth site, k = 1,…, K, and by () some genetic coding (e.g., representing dominant, additive, or recessive models) at L loci, which are genotyped at all K sites, for the ith subject. Let denote a p–dimensional vector of covariates on the ith subject in the kth site, for i = 1,…, n(k) and . Here may include non-genetic variables, such as age, smoking status, and ancestry variables. We suppose that among L loci, the lth and jth SNPs are possibly interacting together in modifying the risk of disease, i.e.,
| (1) |
Under a dominant model coding, for example, is the prevalence of disease (in logit scale) among noncarriers at SNPs l and j in the k-th site, quantifies the effect of SNP l among noncarriers at SNP j, quantifies the effect of SNP j among noncarriers at SNP l, is an interaction effect between SNPs l and j, and is a p-dimensional vector of unknown site-specific regression parameters. Under a recessive or an additive model coding, the parameters have similar interpretations.
Standard procedure
A naive procedure for detection of G×G interactions using all K sites is to test for interactions of all possible pairs of SNPs. Specifically, for the SNP pair (l, j), denote as the overall interaction effect in the meta-analysis setting (DerSimonian & Laird, 1986). One can test for
| (2) |
for all pairs of SNPs (l, j) with l, j ∈ {1, 2,…, L} and l ≠ j. Such a test can be conducted using a Wald test with and its variance being , where under a fixed-effect meta-analysis model, and under a random-effects meta-analysis model with being an estimate of between-site heterogeneity (DerSimonian & Laird, 1986). Such between-site heterogeneity is referred to as variation in outcomes across sites. To control the overall type I error at level α, the significance level based on a Wald test can be set at using the Bonferroni correction for multiple tests. However, there are two major limitations of this procedure. Firstly, it has a huge computational burden with a large number of possible pairs of SNPs to be tested. Secondly, it has a high multiple-testing burden. Depending on the assumptions about main effects, marginal effects or heterogeneity in marginal effects across sites, the multiple-testing burden may be reduced. In this paper, we offer a prior selection procedure to alleviate such limitations.
Previously proposed methods
(i). KL-meta:
Liu et al. (2016) have previously proposed the KL-meta procedure, which extends the method of Kooperberg & LeBlanc (2008) to the context of large consortia by using a meta-analytic framework. Briefly, the KL-meta is a two-stage testing procedure. In the first stage, one can test for the pooled marginal effect, in terms of averaging site-specific marginal effects, for each SNP. In the second stage, one can test for pairwise interactions among those SNPs with significant pooled marginal effects. For detailed model formulations, we refer to Liu et al. (2016).
(ii). phylogenY-based Effect-size Tests for Interactions (YETI):
The rationale of YETI is that heterogeneity in marginal effects at an SNP across studies may be caused by interactions with other genetic loci that exhibit varying exposures among studies. YETI procedure can be conducted by the following two steps: first, rather than using pooled marginal effects, one can screen SNPs with significant heterogeneity among effect size estimates, and then one can test for interactions between those SNPs that passed the first step and all other SNPs, adjusting for testing multiple hypotheses with a Bonferroni correction (Liu et al., 2016).
New proposed method: phylogenY-based Effect-size Tests for Interactions using the first 2 moments (YETI2)
We note that the difference between KL-meta and YETI is both by different criteria of selecting SNPs in the first step, and also by a different search strategy in the second step. In fact, KL-meta and YETI are found to be complementary to each other; thus both methods have their own advantages and limitations. This motivates us to propose a new two-stage approach by combining the advantages from both methods in this paper. In particular, in the first stage we propose to screen the candidate SNPs by both pooled marginal effects and heterogeneity in marginal effects across sites, and we then test interactions for those significant SNPs with all other SNPs. As we show later, such a procedure can avoid the non-monotonic power function of interaction effects. The intuition is that a substantial heterogeneity in marginal effects at a SNP across sites may suggest interactions between this SNP and other SNPs that explain the differences among populations. We term this procedure phylogenY-based Effect-size Tests for Interactions using first 2 moments, or YETI2. The detailed procedure of YETI2 is described below, and the corresponding pseudo code is provided in Algorithm 1. The R code for YETI2 is available at https://www.penncil.org/software
-
Step 1(a): Subgroup analysis
For the kth site, the following logistic regression model can be fitted at the lth SNP
where is the marginal effect of SNP l in the kth site, for l = 1,…, L and k = 1,…, K.(3) -
Step 1(b): Screening SNPs by testing both pooled marginal effects and heterogeneity in marginal effects across sites
For the lth SNP, we can use either one of the following models to combine summary data from distributed sites:- The ANOVA-type K-df test: the null hypothesis of no effect, that is, at significance level α1 with 0 < α1 < 1, can be tested by the Wald test statistic which has an asymptotic chi-squared distribution with K degrees of freedom. For the ANOVA-type K-df test, similar to the fixed-effects one-way variance of analysis (ANOVA) test, the heterogeneity among marginal effect sizes can be investigated by testing the equality of marginal effects across sites.
- The Mean-Variance component test: the null hypothesis specifies both and , i.e., at significance level α1 with 0 < α1 < 1, for K ≥ 3. It can be tested by the likelihood ratio test, where the distribution of the test statistic can be approximated by a mixture distribution of under (Self & Liang, 1987). For the Mean-Variance component test, similar to a random-effects ANOVA, where a similar decomposition of the heterogeneity in effect sizes holds, we proposed to test for both mean and variation are zero.
-
Step 2: Test for interactions
Assuming that L** SNPs passed Step 1(b). We then test for interactions between SNPs that passed Step 1(b) and any other SNPs with Bonferroni correction for L** × (L − 1) tests. Particularly, for a SNP that passed Step 1(b), the interaction effect between this SNP and any distinct SNPs in the kth site is estimated by equation (1), and a test of pooled interaction effect, , is conducted based on , with a significance threshold of α/{L** × (L − 1)}, where α is the overall type I error rate.
Here we highlight the differences between main effects and marginal effects for above-mentioned YETI2 procedure, as shown in separate equations (1) and (3). Specifically, the main effects, which are determined by coefficients and , in fact are conditional effects, because the effect of one causal SNP, , is adjusted by the effect of another causal SNP, . On the other hand, the marginal effects are determined by coefficients, , and such low-order terms would depend upon whether the high-order term (i.e., the interaction effect, ) is presented in equation (1).

YETI2 properties
One important feature of the two-stage testing procedure is to exploit outcome-dependent information twice from the same dataset; in particular, we set two separate values, α1 and α, as significance thresholds for the first (or filtering) and second (or testing) stages of YETI2. Typically, we have to employ a multiple-testing correction for both stages simultaneously to control the overall type I error rate. However, due to the independent filtering property of YETI2, we only need a multiple-testing correction for the second (or testing) stage. Such a property for detecting interactions has recently been discussed by Dai et al. (2012) in a simpler setting with one study site. In the following subsections, we prove the independent filtering property for YETI2 and show that YETI2 procedure can properly control the overall type I error rate. Moreover, we show that YETI2 should gain more power, compared to the standard one-step Bonferroni correction for testing all possible pairs of SNPs.
Independent filtering by pooled marginal effects and heterogeneity in marginal effects
In this subsection, we provide a formal justification of the independent filtering property through using logistic regression models under the meta-analysis setting. Specifically, we show the asymptotic independence between the pairs of estimators (,) as described in equations (1) and (3) for l, j = 1,…, L, l ≠ j, and k = 1,…, K. The two test statistics at the filtering and testing stages are asymptotically independent, and this result is formally presented in the following proposition.
Proposition 1 Consider two nested logistic regression models using the ANOVA-type K-df test, and for SNP l
Let be a Wald test statistic for testing , where . Under the null hypothesis , the asymptotic distribution of is distribution with K degrees of freedom. On the other hand, let be a Wald test statistic for testing , where . Then, is asymptotically independent of .
The proof of Proposition 1 is given in Section A of the supplementary materials. The derivation of the result on asymptotic independence can be also applied to the Mean-Variance component test through using the likelihood ratio test for the filtering stage instead of the Wald test. We provide a sketch of the proof in Section A of the supplementary materials. In the next theorem, we show that the asymptotic independence between the pairs of estimators and , with the possible weak dependence among other SNPs, is sufficient to control the overall type I error rate.
Theorem 1 Define the rejection region of the filtering stage as , where is quantile function of the distribution. Let L denote the total number of SNPs, and L** denote the number of SNPs passing the filtering stage. The rejection region of the testing stage is defined as , where Φ−1 (·) is the quantile function of the standard normal distribution. Under some regularity conditions, as |l − j| → ∞. For any nonempty set and under the null hypothesis of , we have
as n(k) → ∞, L → ∞.
A sketch of the proof is given in Section B of the supplementary material.
Power consideration and tuning parameter α1 selection
YETI2 can be shown to potentially gain statistical power from a less stringent significance rule in the filtering stage, compared to the standard one-stage Bonferroni correction for testing all possible pairs of SNPs. Intuitively, such gain in power is affected by the tuning parameter α1. As suggested in Dai et al. (2012), a small α1 would allow fewer SNPs to pass the filtering stage, leading to a sizable decrease in power. On the other hand, as α1 increases, the SNPs would have an increased chance of passing the filtering stage, but the penalty of multiple-testing correction for those SNPs that passed the filtering stage would also increase. The optimal selection of α1 depends on the relative power of the filtering and testing stages, the number of SNPs, and the extent of interactions due to population structures or other factors (Lewinger et al., 2013). Here, using the law of total probability, we obtained a closed-form expression that characterizes how the tuning parameter α1 impacts the power. An approximated power of identifying any interactions based on YETI2 is given by
The details of power derivation are provided in Section C of the supplementary materials.
Results
Simulation Settings
To evaluate the empirical performance of the proposed method YETI2, compared with YETI and KL-meta, we devised simulation studies under three different scenarios. Our basic strategy is to simulate genotypes given site-specific structures, and for simplicity, we simulate the disease status under a given genetic dominant model and prevalence to obtain our data. The details of the simulation settings are described below.
We mimic population structures in real GWAS datasets by generating K = 2 sites with equal numbers of cases and controls, involving 5, 000 cases and 5, 000 controls. We consider both allele frequencies of 0.1 and 0.4 for two separate populations and varying interaction effects. Since we simulated different allele frequencies and disease prevalence in the two populations, the simulated data could potentially contain population stratifications. Due to difference in allele frequency in the two populations, heterogeneity in marginal effects is induced across populations. On the other hand, when allele frequencies are more homogeneous across populations, no heterogeneity in marginal effects across populations is to be expected. However, one concern about SNPs with large variation in allele frequency could be a susceptibility to population stratification, especially since subjects are sampled from admixed populations or from poorly matched cases and controls (Pritchard & Rosenberg, 1999; Pritchard et al., 2000; Novembre et al., 2008; Kang et al., 2010; Price et al., 2010). Although the first stage of YETI2 could presumably be affected to such confounding due to population stratification, it remains robust to between-study population stratification at the second stage. This is because that, in order to narrow down the search space for possible interactions, YETI2 precisely utilizes the between-study population structure to better guide the searching space of genetic interactions. In the data analysis section, we showed its robustness to between-study population stratification by extending the regression model at the second stage. For the sake of simplicity, we do not include any covariates in equation (1) for simulation models. In the real data analysis, we include the principal components into the second stage regression to account for the potential population stratification. We set the total number of SNPs, L, at 1000. We consider the following three simulation schemes for the causal SNPs. In the first scheme, we assume two causal SNPs with main effects of zero, that is, we set , with varying from 0 to 1 for k = 1, 2. In the second scheme, the main effects are set to be −0.25, and the corresponding marginal effects are approximately zero at for k = 1, 2, respectively. In the third scheme, we simulate two negatively correlated causal SNPs with Pearson correlation coefficient of −0.2. For each of these three simulation schemes, the type I error rate and power are estimated based on 1000 replications with overall significance level α = 0.05. Moreover, we specify α1 = 0.01 as the pre-specified tuning parameter in the first stage, and we provide the rationale for the choice of α1 in Section D of the supplementary materials as well. For all settings, β0, which corresponds to the intercept in equation (1), is set to be −2. We note that the disease prevalence, which is simply the probability of disease given a particular combination of genotypes and the other factors, is not the same for these two populations.
In addition to the aforementioned KL-meta and YETI2, we also explore the impact of alternative search strategies for finding G×G interactions at the second stage under the frameworks of KL-meta and YETI2. These include (i) KL-meta-Y procedure: differs from KL-meta in the second stage in that we instead test for all possible interactions between those SNPs that passed stage one and any other SNPs at the second stage; (ii) YETI2-K procedure: differs from YETI2 in the second stage in that we instead test for all possible interactions among those SNPs that passed stage one.
Simulation results
Table 1 shows the empirical type I error rates for KL-meta, KL-meta-Y, YETI2, YETI2-K, and YETI procedures at the overall type I error rate of α = 0.05, with a tuning parameter α1 = 0.01 for L = 1000. As we can see, these procedures provide nearly accurate control of the overall type I error rate for all three schemes considered.
Table 1:
Empirical type I error rates (%) of the KL-meta, KL-meta-Y, YETI, YETI2-K and YETI2 procedures with the nominal significance level 5% based on 1000 replications.
| Scheme | KL-meta framework | YETI framework | |||
|---|---|---|---|---|---|
| KL-meta | KL-meta-Y | YETI | YETI2-K | YETI2 | |
| 1 | 4.8 | 5.3 | 5.2 | 5.5 | 5.2 |
| 2 | 5.2 | 5.6 | 5.2 | 5.6 | 5.1 |
| 3 | 5.4 | 5.5 | 6.2 | 5.0 | 5.7 |
Figure 1(a) displays the power functions of these five procedures for detecting G×G interactions in the first simulation scheme, with increasing interaction effect β3 and the corresponding marginal effects (i.e., η1 and η2) for two causal SNPs. Not surprisingly, KL-meta always yields the highest power among all procedures. This is because, with increasing non-zero interaction effect β3, the two causal SNPs have increasingly larger marginal effects for both SNPs that are more likely to pass the first stage. Interestingly, YETI2-K works almost equally well as KL-meta. We note that both KL-meta and YETI2-K use the same search strategy of pairwise for finding G×G interactions at the second stage, where the term pairwise is referred to testing all possible interactions among those SNPs that passed the first stage. On the other hand, YETI2, although not the best, remains a close second. Moreover, we observe that the power performance of KL-meta-Y is very similar to that of YETI2 since both procedures use the same search strategy of one-versus-all for interactions. The term one-versus-all refers to testing all possible interactions between each SNP that passed the first stage in a pairwise fashion against all other SNPs at the second stage. These results in the first scheme demonstrate that the procedure based on a pairwise search strategy is likely more powerful than that of a one-versus-all search strategy due to the reduction of multiple-testing corrections. In this scheme, we also find that YETI does not gain as much statistical power as the other procedures. As mentioned above, the marginal effects for both causal SNPs are dependent on the magnitude of interaction effects, even though the main effects are set to zero in this scheme. The range of marginal effects in this scheme is between 0 and 0.52, which corresponds to odds ratio ranging from 1 to 1.68. For example, as an interaction effect is fixed at 0.3, the marginal effects for both causal SNPs are respectively 0.12 (odds ratio = 1.13) and 0.12 (odds ratio = 1.13), in which KL-meta and YETI2-K have relatively large power to detect the interaction effect, compared to the other three methods. As we consider large interaction effects, all procedures could increase power to detect interactions due to the increasing of marginal effects.
Figure 1:
Power curves for identifying an interaction (a) left panel: when , for various interaction effects and corresponding empirically marginal effects between two uncorrelated causal SNPs; (b) middle panel: when , for various interaction effects and corresponding empirically marginal effects between two uncorrelated causal SNPs; (c) right panel: when for k = 1, 2, for various interaction effects and corresponding empirically marginal effects between two negatively correlated causal SNPs. The tuning parameter α1 = 0.01 and the number of total SNPs L = 1000.
Figure 1(b) shows how the power functions based on these five procedures change in the second scheme, in which the marginal effects for both causal SNPs at both sites are nearly zero as the corresponding interaction effect is 0.5 or 0.6, i.e., at or 0.6, for k = 1,2. We also empirically investigate the magnitude of marginal effects for different interaction effects under this scenario, in which the marginal effects range from −0.35 to 0.28 (i.e., the corresponding odds ratio is between 0.70 and 1.32). As a result, both KL-meta and KL-meta-Y lead to non-monotonic power curves. Specifically, when the interaction effect β3 deviates from 0.5, there may exist negative or positive marginal effects, and as this deviation increases, there is an increase in power for both KL-meta and KL-meta-Y. Furthermore, KL-meta-Y is noticeably powered relative to KL-meta. This is likely due to the fact that the search strategy of one-versus-all gains some extra power for this scheme in which the heterogeneity in marginal effects across sites exists. On the other hand, the remaining three procedures based on heterogeneity in marginal effects always have monotonic power curves. Among them, YETI2 and YETI perform quite well compared to the other ones; YETI is more powerful than YETI2, particularly for the moderate interaction effect. This is because YETI2, which is based on the ANOVA-type K-df test, has a penalty on the power due to an extra degree of freedom for modeling interactions.
Power curves of these five procedures under the third scheme, in which two causal SNPs are negatively correlated, are shown in Figure 1(c). In this scheme, the power loss of KL-meta is substantial, and closely followed by YETI2-K. The reason is that, due to the fact that both of two causal SNPs are negatively correlated, one of two causal SNPs may have a subtle or no marginal effect, leading to failure of KL-meta passing the first stage. We note that producing such SNPs with weak or no marginal effects can be explained by that one of two site-specific marginal effects for that SNP is in the opposite direction and then its pooled marginal effect could be diminished or even disappeared. This implies that KL-meta and YETI2-K have limited power for detecting G × G interactions of two causal SNPs in the second stage. In contrast, the other three procedures, including YETI2, KL-meta-Y, and YETI, are more powerful in this scheme. Furthermore, YETI2 is generally the best performing of the three. This is because it can not only leverage heterogeneity in marginal effects across sites but also capture the signal contained in pooled marginal effects, with the increasing interaction effects.
In summary, simulation results show that the search strategies have impact on power. When a scheme in which the marginal effects at both causal SNPs are relatively large is considered, the procedure based on a pairwise strategy clearly has better performance than the alternative search strategy. In contrast, as the effects of heterogeneity in marginal effects is relative larger than the pooled marginal effects, the search strategy of one-versus-all is an optimal procedure. Moreover, both KL-meta and KL-meta-Y can lead to non-monotonic power curves, while YETI, YETI2-K and YETI2 always have monotonic power functions due to the fact that they are not only based on screening of pooled marginal effects.
Analysis of bladder cancer data
Bladder cancer is the fourth most common cancer among males, but it is less common among females in the United States (American Cancer Society, 2017). In 2017, an estimated 79,030 new cases in the United States were diagnosed with bladder cancer, and 16,870 people will die of this disease (American Cancer Society, 2017). In the past decade, there have been extensive efforts to identify genetic susceptibility loci for bladder cancer through GWAS studies (Kiemeney et al., 2008; Wu et al., 2009; Kiemeney et al., 2010; Rafnar et al., 2009; Rothman et al., 2010). However, our understanding of the molecular mechanism for bladder cancer remains limited. In this section, we evaluate the performances of YETI2, YETI, and KL-meta, on a bladder cancer data set which is available from the database of Genotypes and Phenotypes (dbGaP) (Tryka et al., 2013) under accession phs000346.v1.p1 (Rothman et al., 2010). We considered a collaborative consortium consisting of the Spain cohort (genotyped on a 1M chip) and the United States/Finland cohort (genotyped on a 610K chip) from this data set. The detailed genotyping information is provided in Section E of the supplementary materials.
Prior to the interaction analyses, we applied a number of quality control (QC) filters through PLINK (Purcell et al., 2007). Specifically, for each cohort, we removed subjects with more than 5% missing genotypes; we considered only the SNPs genotyped in both cohorts; and we removed SNPs with a minor allele frequency less than 1% and those with more than 5% missing genotypes. After this QC, we had a total 588,682 SNPs from 6,978 subjects with 3,624 cases and 3,354 controls, where the Spain cohort included 1,145 cases and 1,081 controls, and the United States/Finland cohort included 2,479 cases and 2,273 controls. Additionally, in order to reduce the influence of strong linkage disequilibrium (LD) on the assessment of interaction effects, we used the PLINK software to prune the SNPs using a pairwise r2 greater than 0.2 within sliding windows of 50 SNPs with a step size of 5, which reduced the number of SNPs analyzed to 95,094. In addition to the above quality control procedures, we conducted the principal component analysis (PCA) to account for the potential between-study population stratification by using the R package SNPRelate (Zheng, 2012); particularly, the effect of population stratification was adjusted at the second stage by fitting the first few eigenvectors (or PCs) from the PCA of the SNP genotypes.
In this analysis of the bladder cancer data, we adopted the dominant model. We applied the YETI2, YETI, and KL-meta procedures to the data set and set 0.002 as the value of tuning parameter α1 in the first stage and 0.05 as a nominal significance level in the second stage. Figure 2 displays the total number of SNPs that passed the first stage based on a fixed-effects model and the number of overlapping SNPs; of 95,094 SNPs, 1,033 SNPs were identified by KL-meta, 486 SNPs were captured by YETI, and 1,067 SNPs were detected by YETI2. We found that, sine KL-meta and YETI pick up orthogonal signals, these procedures do not have much overlap; only 10 common SNPs were identified by both KL-meta and YETI, and they were detected by YETI2 as well. On the other hand, YETI2 had substantial overlap with KL-meta and YETI. There were 444 SNPs identified by KL-meta only. These SNPs failed to be captured by YETI2 because of potential false-positive findings or the impact of one extra degree of freedom. More interestingly, 223 SNPs were identified by YETI2 only. This is because the signal of pooled marginal effects alone or heterogeneity in marginal effects alone were not strong enough to be captured by neither KL-meta nor YETI.
Figure 2:
The Venn diagram summarizes the total number of SNPs passing the filtering stage based on KL-meta, YETI, and YETI2. The tuning parameter is set as α1 = 0.002. The sizes of regions are not proportional to the number of each sub-region.
Figure 3 presents Bland-Altman plots of variability against average values of two site-specific marginal effects, for the Spanish and for the Finnish/USA populations. In the left panel of Figure 3, KL-meta identified large pooled marginal effects between the Spanish and Finnish/USA populations, while YETI picked up those SNPs with large variability between the Spanish and Finnish/USA populations, as shown in the middle of Figure 3. The strength of YETI2 is through combining variability and pooled marginal effects, so that YETI2 can identify other possible SNPs that are not captured by either KL-meta or YETI.
Figure 3:
The Bland-Altman plots show the distribution of site-specific marginal effects, and , for those SNPs passed the filtering stage based on KL-meta (left panel), YETI (middle panel), and YETI2 (right panel).
After filtering out the majority of irrelevant SNPs, we tested for interactions in the second stage. As a result, we identified three interactions based on the proposed methods, and the results are summarized in Table 2. The number of cases and controls in each genotype group for each pair of interacting SNPs is provided in Tables S1 and S2 of the supplementary materials. The resulting logistic regression coefficients for interaction analysis is provided in Table S3 of the supplementary materials. Essentially, KL-meta and YETI picked up orthogonal signals, however, they both identified one interaction between SNP rs1402938 (2kb 3′ of the ANAPC1 gene) and SNP rs6502446 (in the TEKT3 gene), with p-value=2.86×10−11. This interaction yielded a considerable signal in both pooled marginal effect and heterogeneity in marginal effects, and was confirmed by YETI2 as well. In addition, YETI and YETI2 identified two more interactions, including one interaction between SNP rs1402938 (2kb 3′ of the ANAPC1 gene) and SNP rs2141788 (54kb 5′ of the TMEM45B gene), with p-value = 7.41 × 10−13, and another interaction between SNP rs1402938 (2kb 3′ of the ANAPC1 gene) and SNP rs10941856 (in the RP11–774D14.1 gene), with p-value= 1.08 × 10−31. These two SNPs rs2141788 and rs10941856 did not pass the filtering stage in all three procedures. These two SNPs have close to zero marginal effects which led to the failure in passing the filtering stage so that KL-meta was not able to identify the two interactions. On the other hand, even though these two SNPs did not pass the first stage, YETI2 and YETI could still pick up the two SNPs for interactions. This is because both allow only one of two SNPs in a pair to pass the first stage.
Table 2:
Interaction analysis result based on KL-meta, YETI, and YETI2 procedures using the Bladder cancer dataset from the dbGaP.
| Procedures | Gene1 | Gene2 | SNP1 | SNP2 | P-value |
|---|---|---|---|---|---|
| KL-meta | 2kb 3′ of ANAPC1 | TEKT3 | (Chr.2) rs1402938 |
(Chr.17) rs6502446 |
2.86 × 10−11 |
| YETI | 2kb 3′ of ANAPC1 | TEKT3 | (Chr.2) rs1402938 |
(Chr.17) rs6502446 |
2.86 × 10−11 |
| 2kb 3′ of ANAPC1 | 54kb 5′ of TMEM45B | (Chr.2) rs1402938 |
(Chr.11) rs2141788 |
7.41 × 10−13 | |
| 2kb 3′ of ANAPC1 | RP11-774D14.1 | (Chr.2) rs1402938 |
(Chr.5) rs10941856 |
1.08 × 10−31 | |
| YETI2 | 2kb 3′ of ANAPC1 | TEKT3 | (Chr.2) rs1402938 |
(Chr.17) rs6502446 |
2.86 × 10−11 |
| 2kb 3′ of ANAPC1 | 54kb 5′ of TMEM45B | (Chr.2) rs1402938 |
(Chr.11) rs2141788 |
7.41 × 10−13 | |
| 2kb 3′ of ANAPC1 | RP11-774D14.1 | (Chr.2) rs1402938 |
(Chr.5) rs10941856 |
1.08 × 10−31 |
We next sought to evaluate the biological plausibility of these associations, so we evaluated both gene expression and network-based resources. We queried HaploReg v4.1 (Ward & Kellis, 2011) for evidence of enhancer activity or transcriptional factor blinding in cell lines at SNP rs1402938 (2kb 3′ of the ANAPC1 gene) and at SNPs in high LD with this. There was no SNPs reported in HaploReg with r2 > 0.8 to this SNP (in Europeans), and this SNP was thereby reported to affect the binding motif of NANOG gene, a transcription factor, that might be connected with some aspects of bladder cancer (Gawlik-Rzemieniewska et al., 2016; Mohd et al., 2017). Furthermore, we queried GTEx (GTEx Consortium, 2015), a database of gene expression in normal human tissues, and found that ANAPC1 was expressed relatively ubiquitously, including in the normal human bladder (see Figure S1 in the supplementary materials). ANAPC1 was found to be highly expressed in EBV-transformed lymphocytes which is of interest based on (Abe et al., 2008). Regarding the identified SNP rs2141788, it was 54kb 5′ of the TMEM45B gene, and there were various SNPs in high LD with this SNP (HaploReg v4.1), disrupting a variety of regulatory motifs. We note that, among the motifs disrupted by rs2141788 itself, there was again that for NANOG discussed above. In contrast with ANAPC1’s ubiquitous expression, TMEM45B was expressed in a small number of tissues, one of which was the bladder (see Figure S2 in the supplementary materials). To evaluate potential biological relationships between these, we queried the GIANT resource (Greene et al., 2015). GIANT provides tissue-specific networks of shared pathway relationships, built from a Bayesian integration of available genome-wide data, for 144 human tissues and cell lineages. We queried the network for TEKT3, ANAPC1, and TMEM45B to compare the level of network support for the YETI-identified interactions with the level of support for the interactions discovered by all three methods. We were unable to include RP11–774D14.1 because GIANT does not include lncRNAs. We found a similar level of network support for both interactions, as neither are predicted to have direct edges but all are connected through intermediary nodes, particularly C12orf44 and RAB3D (see Figure S3 in the supplementary materials), which are the two most connected genes to the three query genes. Examining the sources of evidence behind each edge suggested that co-expression was the primary driver of the relationships observed in this module. Taken together, these resources reveal that genes participating in this statistical interaction influencing susceptibility to bladder cancer are expressed in the bladder and linked through short paths in the bladder network. We note that the identified SNPs could be false positives; however, they are not imputed genotypes as described in the Section E of the supplemental materials. In addition to genotyping quality control, we conducted a number of quality control checks to filter out SNPs and individuals for the further analysis. The impacts of the population structure are unknown, but we attempted to reduce the impacts of population stratification by including principal components as covariates in the second stage regression model. We believe that these findings still need to be validated by more investigations.
Discussion
We have proposed YETI2 to identify G×G interactions in distributed GWAS databases, by embracing heterogeneity in marginal effects across sites. YETI2 is based on the first two moments test using both pooled marginal effects and heterogeneity in marginal effects at the first stage and testing for interactions at the follow-up stage. Most importantly, the key rationale behind YETI2 is that such two-stage testing method can search for SNPs sequentially by first identifying low order marginal effects, thereby reducing the search space before finding higher-order interactions. In this work, we used both simulated and real data from dbGaP to evaluate the performance of YETI2, compared to KL-meta and YETI. Essentially, KL-meta is based on pooled marginal effects while YETI is based on heterogeneity in marginal effects alone; these two pieces of information are approximately orthogonal to each other. By borrowing their strengths, YETI2 is devised as a compromise between KL-meta and YETI. As our simulation results indicated, for at least modest-sized marginal effects, KL-meta performed better than YETI and YETI2; YETI2, although it may not be the best, generally has robust power. It is worthy noting that when the main effects of two causal SNPs are in the opposite direction of a nonzero interaction effect, the estimated marginal effects will be cancelled out and will not pass the filtering stage, leading to the non-monotonic power curve for KL-meta. In contrast, the power function of YETI2 is always monotonic due to the fact that YETI2 screens the SNPs by not only the pooled marginal effects but also by heterogeneity in marginal effects across sites. Additionally, simulation results have shown the impact of search strategies on power performances in different schemes. When substantial heterogeneity exists, the search strategy of one-versus-all can help improve the power of identifying interactions. In contrast, the pairwise search strategy could be optimal for the situation with strong pooled marginal effects.
There are four limitations of YETI2. (1) At the first stage, based on the ANOVA-type K-df test, YETI2 is a two degrees of freedom test rather than one degree of freedom. YETI2 could have less power if the signal/information is purely contained in pooled marginal effects, which means that when the interacting SNPs have large marginal effects, then it will sacrifice some power due to an extra degree of freedom. In such a case, the extra degree of freedom is not necessary and makes it harder for the interacting SNPs to pass to the testing stage. In contrast, when the information contains in both pooled marginal effects and heterogeneity in marginal effects, YETI2 is more powerful than others. (2) Another limitation of YETI2 is that, since YETI2 is devised to allow at least one of two interacting SNPs to have significant marginal effects, pure interactions in the absence of main effects, or with small main effects could be missed (Urbanowicz et al., 2012). (3) The framework of YETI2 focuses on identifying G×G interactions to SNPs with evidence of heterogeneity in marginal effects across sites. However, we note that heterogeneity in marginal effects could be the result of phenomena other than G×G interactions, including differences in LD patterns and non-genetic confounders (McCarthy et al., 2008). YETI2 could be a susceptibility to such potential causes of heterogeneity in marginal effects. Like heterogeneity due to differential LD, it is expected that YETI2 could lose a sizable or substantial power under certain situations either for differential LD between two causal SNPs, or for LD between the tag SNP and the causal SNP. For more discussion, please refer to the simulation results in Liu et al. (2016). Another source of heterogeneity may arise due to non-genetic factors, e.g., environmental factors. If known, these environmental factors can be adjusted through a meta-regression in the first stage of YETI2. If they are unknown, while the first stage could be vulnerable to differential environmental exposures among sites, the second stage of YETI2 protects against this by testing for the proposed interactions within each site. In this regard, the putative interacting SNPs that pass the first stage of YETI2 and fail to replicate within sties could be explained by the presence of differential environmental exposures or other factors. Although YETI2 is targeted for G×G interactions, it is also applicable to studying G×E interactions. In future work, a framework similar to YETI2 could be adapted to specifically detect such G×E interactions. (4) Another limitation of YETI2 is that it requires each study to conduct the interaction analyses for all L** × (L − 1) pairs of SNPs and report the estimated interaction effects in the second step. In practice, it could be challenging, especially in large-scale research consortia. Alternatively, study sites could choose to share their individual level data so that some central analysis group could carry out this analysis.
Furthermore, future work should address the three potential issues that have not been addressed in this study: covariates adjustment, the genetic model encoding of the data, and quantitive traits. Regarding covariate adjustments, each site can make appropriate covariate adjustments for its own effect size estimates, so that the local population substructure and site-specific covariates can be accounted for. For example, the SNPs of interest may interact with environmental or other risk factors whose prevalences vary across sites, leading to variation in effect sizes. Such data adjustments are not considered in this work, and hence further investigation is needed. Regarding the issue of genetic coding models, we only considered the identification of genetic interactions under a dominant model in this work. Commonly, the genotype is encoded according to the assumed disease model. Using different coding models may affect the ability to find the underlying patterns of the main and interaction effects (Schaid et al., 2005). The robustness of genetic coding models with respect to our method should be investigated in future work. Another interesting yet open question is how to formulate various procedures of combining mean signals and heterogeneity in marginal effects across sites as an optimization problem with the goal of maximizing statistical power in identifying G×G interactions. Lastly, the current method is applicable to studying binary phenotypes only. With the availability and utility of data from biobanks (e.g., UK Biobank), genetic variation to common disorders will focus particularly on quantitative traits, such as blood pressure for cardiovascular disease (Warren et al., 2017), mood for major depressive disorder (Boomsma et al., 2008), and body mass index for obesity (Akiyama et al., 2017). Within the proposed framework, the proposed procedure for binary outcome could be extended to other types of outcomes, such as categorical and time-to-event outcomes, by replacing the regression in stages 1 and 2 by appropriate regression models.
In conclusion, there are numerous methods to detect interactions in genetic data, for example, feature selection methods (Greene et al., 2008), multifactor-dimensionality reduction for feature construction (Ritchie et al., 2001; Gui et al., 2013), or other machine learning methods (Greene et al., 2009b; Urbanowicz et al., 2013), among others. Capturing and replicating statistical interactions is the first step to identifying the biological mechanism underlying the interactions. We believe YETI2 will be useful to genetic researchers and will help to uncover susceptibility variants that may otherwise be missed by the conventional methods.
Supplementary Material
Acknowledgments
This work was supported by NIH grants R01 AI130460 (for Y.C.), R01 LM012607 (for J.M. and Y.C.), R01 AI116794 (for J.M. and Y.C.), R01 LM0010098 (for J.M.), P30 ES013508 (for J.M.), P30 CA016672 (for P.S.) and NSF grant IIS-1718798 (for K.C.). The authors thank the referees, the associate editor and the editor for their constructive comments that substantially improved the presentation of this work.
Footnotes
Data availability statement
The data that support the findings of this study (Genome-wide association study for bladder cancer risk; [data set] Rothman N, Garcia-Closas M, Chatterjee N, Malats N, Wu X, Figueroa JD and others 2010) are available through dbGap (phs000346.v1.p1). Genetic and expression data are under study number phs000346.v1.p1 (https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000346.v1.p1).
References
- Abe Takashige, Shinohara Nobuo, Tada Mitsuhiro, Harabayashi Toru, Sazawa Ataru, Maruyama Satoru, Moriuchi Tetsuya, Takada Kenzo, & Nonomura Katsuya. 2008. Infiltration of Epstein-Barr virus-harboring lymphocytes occurs in a large subset of bladder cancers. International journal of urology, 15(5), 429–434. [DOI] [PubMed] [Google Scholar]
- Akiyama Masato, Okada Yukinori, Kanai Masahiro, Takahashi Atsushi, Momozawa Yukihide, Ikeda Masashi, Iwata Nakao, Ikegawa Shiro, Hirata Makoto, Matsuda Koichi, et al. 2017. Genome-wide association study identifies 112 new loci for body mass index in the Japanese population. Nature genetics, 49(10), 1458. [DOI] [PubMed] [Google Scholar]
- American Cancer Society. 2017. URL:http://www.cancer.org/cancer/bladdercancer/detailedguide/bladder-cancer-key-statistics Accessed: February 1st, 2017.
- Boomsma Dorret I, Willemsen Gonneke, Sullivan Patrick F, Heutink Peter, Meijer Piet, Sondervan David, Kluft Cornelis, Smit Guus, Nolen Willem A, Zitman Frans G, et al. 2008. Genome-wide association of major depression: description of samples for the GAIN Major Depressive Disorder Study: NTR and NESDA biobank projects. European Journal of Human Genetics, 16(3), 335. [DOI] [PubMed] [Google Scholar]
- Culverhouse Robert, Suarez Brian K, Lin Jennifer, & Reich Theodore. 2002. A perspective on epistasis: limits of models displaying no main effect. The American Journal of Human Genetics, 70(2), 461–471. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dai James Y, Kooperberg Charles, Leblanc Michael, & Prentice Ross L. 2012. Two-stage testing procedures with independent filtering for genome-wide gene-environment interaction. Biometrika, 99(4), 929–944. [DOI] [PMC free article] [PubMed] [Google Scholar]
- DerSimonian Rebecca, & Laird Nan. 1986. Meta-analysis in clinical trials. Controlled clinical trials, 7(3), 177–188. [DOI] [PubMed] [Google Scholar]
- Eichler Evan E, Flint Jonathan, Gibson Greg, Kong Augustine, Leal Suzanne M, Moore Jason H, & Nadeau Joseph H. 2010. Missing heritability and strategies for finding the underlying causes of complex disease. Nature Reviews Genetics, 11(6), 446–450. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Evans David M, Marchini Jonathan, Morris Andrew P, & Cardon Lon R. 2006. Two-stage two-locus models in genome-wide association. PLoS Genetics, 2(9), e157. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gawlik-Rzemieniewska Natalia, Galilejczyk Anna, Krawczyk Michał, & Bednarek Ilona. 2016. Silencing expression of the NANOG gene and changes in migration and metastasis of urinary bladder cancer cells. Archives of medical science: AMS, 12(4), 889. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Greene Casey S, White Bill C, & Moore Jason H. 2008. Ant colony optimization for genome-wide genetic analysis. Lecture Notes in Computer Science, 5217, 37–47. [Google Scholar]
- Greene Casey S, Penrod Nadia M, Williams Scott M, & Moore Jason H. 2009a. Failure to replicate a genetic association may provide important clues about genetic architecture. PloS one, 4(6), e5639. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Greene Casey S, Penrod Nadia M, Kiralis Jeff, & Moore Jason H. 2009b. Spatially Uniform ReliefF (SURF) for computationally-efficient filtering of gene-gene interactions. BioData Mining, 2(1), 5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Greene Casey S., Krishnan Arjun, Wong Aaron K., Ricciotti Emanuela, Zelaya Rene A., Himmelstein Daniel S., Zhang Ran, Hartmann Boris M., Zaslavsky Elena, Sealfon Stuart C., Chasman Daniel I., FitzGerald Garret A., Dolinski Kara, Grosser Tilo, & Troyanskaya Olga G. 2015. Understanding multicellular function and disease with human tissue-specific networks. Nature Genetics, 47(6), 569–576. [DOI] [PMC free article] [PubMed] [Google Scholar]
- GTEx Consortium. 2015. The Genotype-Tissue Expression (GTEx) pilot analysis: Multitissue gene regulation in humans. Science, 348(6235), 648–660. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gui Jiang, Moore Jason H, Williams Scott M, Andrews Peter, Hillege Hans L, van der Harst Pim, Navis Gerjan, Van Gilst Wiek H, Asselbergs Folkert W, & Gilbert-Diamond Diane. 2013. A simple and computationally efficient approach to multifactor dimensionality reduction analysis of gene-gene interactions for quantitative traits. PLoS One, 8(6), e66545. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ji Zhanglong, Jiang Xiaoqian, Wang Shuang, Xiong Li, & Ohno-Machado Lucila. 2014. Differentially private distributed logistic regression using private and public data. BMC medical genomics, 7(Suppl 1), S14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jiang Wenchao, Li Pinghao, Wang Shuang, Wu Yuan, Xue Meng, Ohno-Machado Lucila, & Jiang Xiaoqian. 2013. WebGLORE: a web service for Grid LOgistic REgression. Bioinformatics, 29(24), 3238–3240. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kamm Liina, Bogdanov Dan, Laur Sven, & Vilo Jaak. 2013. A new way to protect privacy in large-scale genome-wide association studies. Bioinformatics, 29(7), 886–893. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kang Hyun Min, Sul Jae Hoon, Service Susan K, Zaitlen Noah A, Kong Sit-yee, Freimer Nelson B, Sabatti Chiara, Eskin Eleazar, et al. 2010. Variance component model to account for sample structure in genome-wide association studies. Nature genetics, 42(4), 348. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kiemeney Lambertus A, Thorlacius Steinunn, Sulem Patrick, Geller Frank, Aben Katja KH, Stacey Simon N, Gudmundsson Julius, Jakobsdottir Margret, Bergthorsson on T, Sigurdsson Asgeir, et al. 2008. Sequence variant on 8q24 confers susceptibility to urinary bladder cancer. Nature genetics, 40(11), 1307–1312. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kiemeney Lambertus A, Sulem Patrick, Besenbacher Soren, Vermeulen Sita H, Sigurdsson Asgeir, Thorleifsson Gudmar, Gudbjartsson Daniel F, Stacey Simon N, Gudmundsson Julius Zanon Carlo, et al. 2010. A sequence variant at 4p16. 3 confers susceptibility to urinary bladder cancer. Nature genetics, 42(5), 415–419. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kooperberg Charles, & LeBlanc Michael. 2008. Increasing the power of identifying gene× gene interactions in genome-wide association studies. Genetic epidemiology, 32(3), 255–263. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lewinger Juan Pablo, Morrison John L, Thomas Duncan C, Murcray Cassandra E, Conti David V, Li Dalin, & Gauderman W James. 2013. Efficient Two-Step Testing of Gene-Gene Interactions in Genome-Wide Association Studies. Genetic epidemiology, 37(5), 440–451. [DOI] [PubMed] [Google Scholar]
- Liu Yulun, Chen Yong, & Scheet Paul. 2016. A meta-analytic framework for detection of genetic interactions. Genetic Epidemiology, 40(7), 534–543. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Manolio Teri A, Collins Francis S, Cox Nancy J, Goldstein David B, Hindorff Lucia A, Hunter David J, McCarthy Mark I, Ramos Erin M, Cardon Lon R, Chakravarti Aravinda, et al. 2009. Finding the missing heritability of complex diseases. Nature, 461(7265), 747. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Marchini Jonathan, Donnelly Peter, & Cardon Lon R. 2005. Genome-wide strategies for detecting multiple loci that influence complex diseases. Nature genetics, 37(4), 413–417. [DOI] [PubMed] [Google Scholar]
- McCarthy Mark I, Abecasis Gonçalo R, Cardon Lon R, Goldstein David B, Little Julian, Ioannidis John PA, & Hirschhorn Joel N. 2008. Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nature reviews genetics, 9(5), 356. [DOI] [PubMed] [Google Scholar]
- Mohd Khairul, Huzlinda Hussin, Abhimanyu VEERAKUMARASIVAM, Chan Soon, CHOY Maizaton, Atmadini ABDULLAH, Fauzah Abdul, & Ghani. 2017. Immunohistochemical expression of NANOG in urothelial carcinoma of the bladder. The Malaysian journal of pathology, 39(3), 227–234. [PubMed] [Google Scholar]
- Moore Jason H, & Williams Scott M. 2009. Epistasis and its implications for personal genetics. The American Journal of Human Genetics, 85(3), 309–320. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Moore Jason H, Asselbergs Folkert W, & Williams Scott M. 2010. Bioinformatics challenges for genomewide association studies. Bioinformatics, 26(4), 445–455. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Murcray Cassandra E, Lewinger Juan Pablo, & Gauderman W James. 2009. Gene-environment interaction in genome-wide association studies. American journal of epidemiology, 169(2), 219–226. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Novembre John, Johnson Toby, Bryc Katarzyna, Kutalik Zoltán, Boyko Adam R, Auton Adam, Indap Amit, King Karen S, Bergmann Sven, Nelson Matthew R, et al. 2008. Genes mirror geography within Europe. Nature, 456(7218), 98. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Price Alkes L, Zaitlen Noah A, Reich David, & Patterson Nick. 2010. New approaches to population stratification in genome-wide association studies. Nature Reviews Genetics, 11(7), 459. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pritchard Jonathan K, & Rosenberg Noah A. 1999. Use of unlinked genetic markers to detect population stratification in association studies. The American Journal of Human Genetics, 65(1), 220–228. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pritchard Jonathan K, Stephens Matthew, Rosenberg Noah A, & Donnelly Peter. 2000. Association mapping in structured populations. The American Journal of Human Genetics, 67(1), 170–181. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Psaty Bruce M, O’Donnell Christopher J, Gudnason Vilmundur, Lunetta Kathryn L, Folsom Aaron R, Rotter Jerome I, Uitterlinden André G, Harris Tamara B, Witteman Jacqueline CM, Boerwinkle Eric, et al. 2009. Cohorts for heart and aging research in genomic epidemiology (CHARGE) consortium design of prospective meta-analyses of genome-wide association studies from 5 cohorts. Circulation: Cardiovascular Genetics, 2(1), 73–80. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Purcell Shaun, Neale Benjamin, Todd-Brown Kathe, Thomas Lori, Ferreira Manuel AR, Bender David, Maller Julian, Sklar Pamela, De Bakker Paul IW, Daly Mark J, et al. 2007. PLINK: a tool set for whole-genome association and population-based linkage analyses. The American Journal of Human Genetics, 81(3), 559–575. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rafnar Thorunn, Sulem Patrick, Stacey Simon N, Geller Frank, Gudmundsson Julius, Sigurdsson Asgeir, Jakobsdottir Margret, Helgadottir Hafdis, Thorlacius Steinunn, Aben Katja KH, et al. 2009. Sequence variants at the TERT-CLPTM1L locus associate with many cancer types. Nature genetics, 41(2), 221–227. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ritchie Marylyn D, Hahn Lance W, Roodi Nady, Bailey L Renee, Dupont William D, Parl Fritz F, & Moore Jason H. 2001. Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. The American Journal of Human Genetics, 69(1), 138–147. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rothman Nathaniel, Garcia-Closas Montserrat, Chatterjee Nilanjan, Malats Nuria, Wu Xifeng, Figueroa Jonine D, Real Francisco X, Van Den Berg David, Matullo Giuseppe, Baris Dalsu, et al. 2010. A multi-stage genome-wide association study of bladder cancer identifies multiple susceptibility loci. Nature genetics, 42(11), 978–984. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schaid Daniel J, McDonnell Shannon K, Hebbring Scott J, Cunningham Julie M, & Thibodeau Stephen N. 2005. Nonparametric tests of association of multiple genes with human disease. The American Journal of Human Genetics, 76(5), 780–793. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Self Steven G, & Liang Kung-Yee. 1987. Asymptotic properties of maximum likelihood estimators and likelihood ratio tests under nonstandard conditions. Journal of the American Statistical Association, 82(398), 605–610. [Google Scholar]
- Thomas Duncan. 2010. Gene–environment-wide association studies: emerging approaches. Nature Reviews Genetics, 11(4), 259–272. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Thornton-Wells Tricia A, Moore Jason H, & Haines Jonathan L. 2004. Genetics, statistics and human disease: analytical retooling for complexity. TRENDS in Genetics, 20(12), 640–647. [DOI] [PubMed] [Google Scholar]
- Tryka Kimberly A, Hao Luning, Sturcke Anne, Jin Yumi, Wang Zhen Y, Ziyabari Lora, Lee Moira, Popova Natalia, Sharopova Nataliya, Kimura Masato, et al. 2013. NCBI’s Database of Genotypes and Phenotypes: dbGaP. Nucleic acids research, 42(D1), D975–D979. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Urbanowicz Ryan J, Kiralis Jeff, Sinnott-Armstrong Nicholas A, Heberling Tamra, Fisher Jonathan M, & Moore Jason H. 2012. GAMETES: a fast, direct algorithm for generating pure, strict, epistatic models with random architectures. BioData mining, 5(1), 16. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Urbanowicz Ryan John, Andrew Angeline S, Karagas Margaret Rita, & Moore Jason H. 2013. Role of genetic heterogeneity and epistasis in bladder cancer susceptibility and outcome: a learning classifier system approach. Journal of the American Medical Informatics Association, 20(4), 603–612. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang Shuang, Jiang Xiaoqian, Wu Yuan, Cui Lijuan, Cheng Samuel, & Ohno-Machado Lucila. 2013. EXpectation Propagation LOgistic REgRession (EXPLORER): distributed privacy-preserving online model learning. Journal of biomedical informatics, 46(3), 480–496. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ward Lucas D, & Kellis Manolis. 2011. HaploReg: a resource for exploring chromatin states, conservation, and regulatory motif alterations within sets of genetically linked variants. Nucleic acids research, 40(D1), D930–D934. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Warren Helen R, Evangelou Evangelos, Cabrera Claudia P, Gao He, Ren Meixia, Mifsud Borbala, Ntalla Ioanna, Surendran Praveen, Liu Chunyu, Cook James P, et al. 2017. Genome-wide association analysis identifies novel blood pressure loci and offers biological insights into cardiovascular risk. Nature genetics, 49(3), 403. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Willer CJ, Speliotes EK, Loos RJ, Li S, Lindgren CM, Heid IM, Berndt SI, Elliott AL, Jackson AU, Lamina C, et al. 2009. Genetic Investigation of ANthropometric Traits Consortium Six new loci associated with body mass index highlight a neuronal influence on body weight regulation. Nat. Genet, 41, 25–34. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wolfson Michael, Wallace Susan E, Masca Nicholas, Rowe Geoff, Sheehan Nuala A, Ferretti Vincent, LaFlamme Philippe, Tobin Martin D, Macleod John, Little Julian, et al. 2010. DataSHIELD: resolving a conflict in contemporary bioscience-performing a pooled analysis of individual-level data without sharing the data. International journal of epidemiology, 39(5), 1372–1382. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wu Xifeng, Ye Yuanqing, Kiemeney Lambertus A, Sulem Patrick, Rafnar Thorunn, Matullo Giuseppe, Seminara Daniela, Yoshida Teruhiko, Saeki Norihisa, Andrew Angeline S, et al. 2009. Genetic variation in the prostate stem cell antigen gene PSCA confers susceptibility to urinary bladder cancer. Nature genetics, 41(9), 991–995. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wu Yuan, Jiang Xiaoqian, Kim Jihoon, & Ohno-Machado Lucila. 2012. Grid Binary LOgistic RE-gression (GLORE): building shared models without sharing data. Journal of the American Medical Informatics Association, 19(5), 758–764. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wu Yuan, Jiang Xiaoqian, Wang Shuang, Jiang Wenchao, Li Pinghao, & Ohno-Machado Lucila. 2015. Grid multi-category response logistic models. BMC medical informatics and decision making, 15(1), 1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xie Wei, Kantarcioglu Murat, Bush William S, Crawford Dana Denny Joshua C, Heatherly Raymond, & Malin Bradley A. 2014. SecureMA: protecting participant privacy in genetic association meta-analysis. Bioinformatics, 30(23), 3334–3341. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zheng X 2012. SNPRelate: parrallel computing toolset for genome-wide association studies. R package version, 95. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.



