Abstract
An important challenge in the analysis of single nucleotide polymorphism (SNP) data is the identification of SNPs that interact in a nonlinear fashion in their association with disease. Such epistatic interactions among genetic variants at multiple loci likely underlie the inheritance of common diseases. We have developed a novel method called the Bayesian combinatorial method (BCM) for detecting combination of genetic variants that are predictive of disease. When compared with the multifactor dimensionality reduction (MDR), a widely used combinatorial method, BCM has significantly greater power to detect interactions and is computationally more efficient.
Introduction
Common diseases, like hypertension and Alzheimer’s disease, are believed to be caused by genetic variants at multiple loci, with each locus conferring modest risk of developing disease. In addition, interactions among genetic variants at multiple loci and between genetic variants and environmental factors likely play an important role in such diseases. This is in contrast to Mendelian diseases, like cystic fibrosis and thalassemia, that are caused by variation at a single genetic locus. The commonest type of genetic variation is the single nucleotide polymorphism (SNP) that results when a single nucleotide is replaced by another in the genome sequence. The development of high-throughput genotyping technologies that simultaneously assay many thousands of SNPs have led to a flurry of studies with the aim of discovering SNPs that either singly or in combination are associated with disease.
An important challenge in the analysis of SNP data is the identification of epistatic loci that interact in a nonlinear fashion in their association with disease. Biologically, epistasis refers to gene-gene interaction when the action of one gene is modified by one or several other genes. Statistically, epistasis refers to interaction between genetic variants at multiple loci in which the net effect on disease from the combination of genotypes at the different loci is not accurately predicted by a simple linear combination of the individual genotype effects. The detection of statistical epistasis has the potential to identify interacting genetic loci that likely underlie the inheritance of common diseases.
In this paper, we develop and evaluate a new Bayesian method to detect interactions. We compare its power of detecting interactions among SNPs and its computational efficiency to that of a widely used combinatorial method called the multifactor dimensionality reduction.
Genetic interaction models
Ritchie et al. [1] have developed genetic models of epistatic interactions that are valuable for generating synthetic SNP data for testing algorithms designed to detect epistasis. These models define non-linear interactions between two genetic loci in terms of penetrance, heritability and minor allele frequency. Penetrance is the probability that an individual will develop a disease given that the individual has a genotype or a combination of genotypes that allows the disease to be exhibited. For example, in the presence of a completely penetrant genetic variant (allele) at a locus, the disease is almost always expressed and the penetrance is 1.0, while in the presence of an incompletely penetrant allele the penetrance is less than 1.0. Heritability is defined as the proportion of disease variation that is attributable to genetic factors. High heritability values as opposed to lower values indicate that the disease has a larger genetic component. The minor allele frequency is the frequency of the less frequent allele at the locus in a given population.
An illustrative example of a simple epistatic genetic model is shown in Figure 1, which is based on the 2-locus interaction model M170 described in [2]. For a SNP, typically two alleles are present in the population: the major or the wild-type allele (say A) and the minor or the mutant allele (say a). Thus, the genotype at a SNP locus can have one of three states: AA, Aa or aa. In Figure 1, the two illustrative SNPs have the alleles A and a and B and b respectively. The population frequencies of the genotypes are given in parenthesis: for genotypes AA, aa, BB and bb the population frequency is 0.25, while for the genotypes Aa and Bb the population frequency is 0.50. The individual cells contain the penetrance values for the corresponding combination of genotypes. This model represents a probabilistic version of the exclusive OR function. The inherited risk of disease, in this example, is dependent on the particular combination of genotypes and hence epistasis is present. Those who possess Aa at the first SNP or Bb at the second SNP, but not both, have a higher risk of disease. The penetrance for each SNP is 0.05 and is computed by summing the products of the genotype frequencies and the penetrance values along the rows and columns. This implies that there is no increase in disease risk when each SNP is examined individually. It is only when the two SNPs are examined in combination that the higher risk genotypes are revealed to be Aa or Bb, but not both. In this model, both alleles have a frequency of 0.5 and the heritability is calculated to be 0.053 using the equations given in [3]. This is an extreme model where the SNPs exhibit no univariate association with disease.
Figure 1.
A 2-locus epistatic model based on the M170 genetic model from [2]. Genotype frequencies are in parentheses and the penetrance values are in the cells.
Methods to detect interactions
Both parametric and non-parametric statistical methods have been described in the literature for modeling epistasis. Common parametric models, such as linear and logistic regression, have limited ability to deal with interactions involving many loci since the number of interaction terms grows exponentially with the inclusion of each additional genetic locus in the model. Hence, more recently, several non-parametric methods based on machine-learning and data mining techniques have been developed to detect interactions. Such methods include combinatorial methods, set association analysis, genetic programming, neural networks and random forests and have been summarized in a recent review [4].
Combinatorial methods search over all possible combinations of loci to find combinations that are predictive of the disease. A widely used combinatorial method is the multifactor dimensionality reduction (MDR) that has been successfully applied in identifying epistatic interactions in diseases such as sporadic breast cancer, essential hypertension, and type II diabetes. MDR was designed to detect associations between multiple SNPs and disease by examining higher order interactions among SNPs in a case-control setting.
Bayesian combinatorial method (BCM)
We have developed a novel combinatorial method called the Bayesian combinatorial method (BCM) designed to detect epistatic interactions. We next describe the combinatorial model, the Bayesian score and the algorithm in detail.
Combinatorial model
We define a combinatorial model M as a set of probability distributions over the target variable Z given X where X is a set of m SNP variables X1, X2, X3…, Xm. For example, a 2-SNP model associated with disease is given by the set of probability distributions:
| (1) |
where, Z is the disease variable and X1 and X2 are two SNP variables. Figure 2 gives an example of a table of counts for a 2-SNP model derived from a dataset of 200 samples. From the column of counts for a genotype combination the parameter estimates of a binomial (multinomial, if Z has more than two states) distribution can be derived. For example, the counts (14, 4) for the genotype combination X1 = AA and X2 = BB can be used to derive the maximum likelihood parameter estimate of 14/18 = 0.78. For a pair of SNPs, there are nine possible genotypic combinations and each genotypic combination is associated with a probability distribution over the disease variable in the model. In contrast to the typically used maximum likelihood method for parameter estimation as given in the above example, the Bayesian paradigm computes the posterior estimate for a parameter by applying Bayes theorem that combines the prior estimate for the parameter with the likelihood obtained from the data. In addition, the Bayesian paradigm allows the computation of the posterior probability of a model M represented by a table of counts from the prior probability of M and the data represented by the counts. This posterior probability of M represents the Bayesian score that assesses the goodness of the model M.
Figure 2.
An example of a table of counts for a 2-SNP combinatorial model for a dataset of 200 samples.
Bayesian score
We now derive the Bayesian score for scoring a model like the one shown in Figure 2. Given a dataset D that contains values for SNPs and the disease/healthy state for N cases and a combinatorial model M the Bayesian score is proportional to the posterior probability P(M | D) that is given by Bayes theorem as follows:
| (2) |
where P(M) is the prior probability of model M, P(D | M) is the marginal likelihood and P(D) is the probability of the data. Intuitively, the Bayesian score is a measure that combines the prior probability of the model (which encompasses prior knowledge about the model) with the marginal likelihood (which is a measure of how well the model fits the data). Only the numerator on the right hand side of Equation 2 has to be evaluated for the Bayesian score since P(D) is a constant for all models. Thus, the Bayesian score is given by:
| (3) |
The marginal likelihood in Equation 3 is given by the following equation:
| (4) |
where, θM are the parameters of the probability distributions of the disease variable Z given the SNP variables X, namely P(Z | X). Equation 4 has a closed-form solution under the following assumptions: (1) the values of the disease variable are generated according to i.i.d. sampling from P(Z | X), which is modeled as a multinomial distribution, (2) prior belief about the distribution P(Z | X = xi) is independent of prior belief about the distribution P(Z | X = xj) for all values xi and xj of X, such that i ≠ j, and (3) for all values xj of X, prior belief about the distribution P(Z | X = xj) is modeled using a Dirichlet distribution with hyperparameters αi and αij. The closed-form solution to the marginal likelihood is given by the following expression [5]:
| (5) |
where, I is the number of genotype combinations (e.g., nine for the model in Figure 2), J is the number of disease states ((e.g., two for the model in Table 2), Λ(•) is the gamma function, ni is the number of cases in the dataset where the SNPs X have the genotypes given by xi, nij is the number of cases that belong to the disease state j where the SNPs X have the genotypes given by xi, αij are the hyperparameters in a Dirichlet distribution which define the prior probability over the θM parameters, and αi = ∑j αij . The hyperparameters can be viewed as prior counts of cases from a previous hypothetical study (which is parallel to the current one for which D has been collected) that belong to the disease state j where the SNPs X have the genotypes given by xi. When aij and nij are positive integers, the gamma function can be expressed as a factorial and Equation 5 can be written as:
| (6) |
Table 2.
Mean powers of BCM and. MDR for 2-SNP models on sample size of 200. Each entry gives the Z statistic and the corresponding two-tailed p-value obtained from applying the Wilcoxon test on results from datasets generated from 35 genetic models. Negative Z values indicate better performance by BCM, and p values less than 0.05 are in bold.
| # SNPs | BCM | MDR | Z value (p value) |
|---|---|---|---|
| 20 | 66.63 | 64.86 | −1.930 (0.054) |
| 40 | 61.46 | 59.20 | −2.082 (0.037) |
| 80 | 56.91 | 54.34 | −3.182 (0.001) |
| 160 | 53.31 | 50.03 | −3.543 (<0.001) |
| 320 | 49.91 | 47.14 | −3.712 (<0.001) |
In our experiments, we model the prior probability P(M) in Equation 3 as being non-informative, i.e., a priori we consider all models to be equally plausible. Thus, P(M) is a constant for all models M. We also set all the hyperparameters α ij to 1, which implies that we assume all possible distributions of Z given X = xi to be equally likely. The model space consists of all n-SNP models where n = 1, 2, ..., L, where L is a user-specified limit, and we assume that all models are equally probable a priori.
Bayesian combinatorial method
BCM, like MDR, evaluates all possible n-SNP combinations up to a user-specified limit on n. MDR evaluates a n-SNP combination by its ability to classify and accurately predict the disease variable by multifold cross-validation, while BCM evaluates a n-SNP combination with the Bayesian score derived in the previous section.
Given a dataset with m SNPs and N cases, the BCM’s algorithm is summarized as follows:
Select n SNPs from the set of m SNPs.
Construct a table of counts for all combinations of values of these n SNPs like the one shown in Figure 2.
Compute the Bayesian score for this table as given by Equation 6.
Repeat steps 1–3 for all possible combinations of n SNPs.
Select the model with the highest Bayesian score as the best n-SNP model.
Repeat steps 1–6 for n = 1, 2, ..., L, where L is a user-specified limit.
Experimental methods
We evaluated BCM and MDR on two sets of synthetic datasets as well as on a Crohn’s disease dataset. One set of synthetic datasets examined the effect of sample size and a second set of synthetic datasets examined the effect of dimensionality (i.e., the number of SNPs).
For generating synthetic data, we used 35 2-SNP epistatic models with different penetrance functions that are described in Velez et al. [6] and have been used for evaluating MDR. The models have minor allele frequency of 0.2 and seven heritabilities ranging from 0.01 to 0.40 (0.01, 0.025, 0.05, 0.10, 0.20, 0.30, and 0.40) that might be expected for a common disease in which not all susceptibility factors are accounted for. To study the effect of sample size, from a given model, 100 datasets were generated for each of four sample sizes (200, 400, 800 and 1600) where each dataset contains equal number of disease and healthy samples. For a generated pair of epistatic SNP values, a set of 18 SNPs that were assigned random values was appended to simulate SNPs that are non-informative with respect to the disease status. To study the effect of dimensionality of non-informative SNPs, from a given model, 100 datasets were generated for 20, 40, 80, 160 and 320 SNPs respectively, where each dataset had a sample size of 200 with equal number of cases and controls and all SNPs except for the first two were assigned random values irrespective of the disease status.
The Crohn’s disease dataset consists of 103 SNPs spanning a 616-kb region of the human chromosome 5 that is known to contain a group of genes that has been linked to Crohn’s disease. The dataset contains 144 individuals with Crohn’s disease and 243 healthy individuals and was analyzed by Rioux et al. for individual SNPs associated with disease [7]. For analysis with BCM and MDR, we treated missing values for a SNP as a distinct state.
We evaluated the performance of BCM and MDR on power (accuracy) and computational speed. For a set of 100 datasets generated from a model, power is a number from 0 to 100 and refers to the number of datasets on which an algorithm correctly selected the model containing both the interacting SNPs.
We implemented BCM in Java and used MDR v 1.25 obtained from www.epistasis.org. All experiments were run on a PC running Windows XP with a 2.8 GHz processor and 3 GB of RAM.
Results
The results of the Wilcoxon two-sample paired signed rank test comparing the power of BCM and MDR on different sample sizes for 2-SNP, 3-SNP and 4-SNP models are given in Table 1a. The statistic in each cell is obtained by applying the Wilcoxon test on the powers obtained from datasets simulated from 35 models. For example, the powers obtained by BCM and MDR for 2-SNP models at sample size 200 are given in Table 1b. The results in Table 1a show that BCM achieves significantly greater power that MDR at the 0.05 level at all experimental sample sizes except one. There is also a trend for BCM to achieve greater power gains at lower sample sizes.
Table 1.
a. Results of the Wilcoxon two-sample paired signed rank test comparing the power (accuracy) of BCM with that of MDR. Each entry gives the Z statistic and the corresponding two-tailed p-value obtained from applying the Wilcoxon test on results from datasets generated from 35 genetic models. Negative Z values indicate better performance by BCM, and p values less than 0.05 are in bold. b. Power of MDR and BCM for 2-SNP models on sample size of 200 for all 35 genetic models.
| Sample size | 2-SNP models | 3-SNP models | 4-SNP models |
|---|---|---|---|
| 200 | −4.020 <0.001 |
−4.334 <0.001 |
−4.706 <0.001 |
| 400 | −2.483 0.013 |
−3.922 <0.001 |
−4.109 <0.001 |
| 800 | −2.646 0.008 |
−3.297 0.001 |
−3.726 <0.001 |
| 1600 | −1.483 0.138 |
−2.366 0.018 |
−2.666 0.008 |
| a | |||
| BCM | MDR | BCM | MDR | BCM | MDR | BCM | MDR | BCM | MDR |
|---|---|---|---|---|---|---|---|---|---|
| 100 | 100 | 100 | 99 | 83 | 69 | 100 | 100 | 58 | 38 |
| 100 | 100 | 100 | 100 | 40 | 33 | 100 | 100 | 16 | 14 |
| 100 | 100 | 100 | 100 | 32 | 27 | 100 | 100 | 5 | 3 |
| 100 | 100 | 71 | 60 | 67 | 40 | 100 | 100 | 4 | 3 |
| 100 | 100 | 87 | 78 | 65 | 44 | 7 | 3 | 5 | 3 |
| 100 | 100 | 79 | 74 | 46 | 38 | 14 | 12 | 7 | 3 |
| 100 | 99 | 94 | 88 | 100 | 100 | 51 | 40 | 3 | 3 |
| b | |||||||||
The results of the Wilcoxon rank test comparing the power of BCM and MDR on datasets with increasing dimensionality for 2-SNP models at sample size of 200 are given in Table 2. The results show that BCM achieves significantly greater power than MDR as the dimensionality increases. The results of the running times of BCM and MDR are given in Table 3. BCM, on average, runs 50–140 times faster than MDR.
Table 3.
Mean running times in seconds of BCM and MDR. Each entry gives the mean running time obtained by averaging the running times over datasets generated from 35 genetic models. All the datasets had 20 SNPs.
| Sample size | 2-SNP models | 3-SNP models | 4-SNP models | |||
|---|---|---|---|---|---|---|
| BCM | MDR | BCM | MDR | BCM | MDR | |
| 200 | 0.85 | 119.81 | 4.50 | 249.42 | 24.44 | 779.91 |
| 400 | 1.45 | 146.64 | 8.86 | 353.57 | 43.28 | 1306.35 |
| 800 | 2.65 | 207.98 | 16.94 | 544.29 | 79.44 | 2262.09 |
| 1600 | 4.96 | 241.74 | 32.56 | 949.55 | 150.5 | 4349.42 |
The original analysis by Rioux et al. [7] of the Crohn’s dataset identified 11 SNPs with alleles that were associated with risk of Crohn’s disease. Of these 11 SNPs, nine were present in the dataset we analyzed. Both BCM and MDR identified the same eight of the nine significant SNPs in the top 20 single-SNP models. Among the 20 top scoring 2-SNP models, BCM and MDR agreed on 7 of them. Since, no biologically validated epistatic interactions are known for this dataset, the detected 2-SNP interactions need further validation.
Discussion
Identifying interactions among multiple genetic variants and environmental factors is an important challenge in elucidating the etiology of common diseases. We have developed a new combinatorial method called BCM for identifying genetic interactions that has significantly greater power and is substantially faster than the widely used MDR. Improved performance by BCM is due to its ability to use the entire dataset for computing the model score which is simpler and faster, though, with increasing sample size this advantage of BCM is mitigated to some extent. Also, the BCM score presents a coherent way to combine knowledge with data which has the potential to enhance the analysis of high dimensional datasets like those collected in genome-wide studies. Biological knowledge or results from analyses of earlier studies can be encoded as a prior distribution over the models that can then be used in Equation 3. Use of informative priors is becoming common in the analysis of microarray expression studies, and a similar strategy can be employed for genomic data.
There are several limitations to our method. Though BCM is more effective than MDR it still remains a combinatorial method and hence is unlikely to scale to dataset with very large number of SNPs without additional search heuristics. The evaluation of BCM presented in this paper is mainly on synthetic data and further evaluation on real data with known interactions is needed.
In future work, we plan to study the performance of BCM in additional genetic models, evaluate the effect of different values of the hyperparameters αij , develop methods to encode knowledge as model priors, and evaluate it on real data.
References
- 1.Ritchie MD, Motsinger AA, Bush WS, Coffey CS, Moore JH. Genetic programming neural networks: A powerful bioinformatics tool for human genetics. Applied Soft Computing. 2007;7(1):471–479. doi: 10.1016/j.asoc.2006.01.013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Li W, Reich J. A complete enumeration and classification of two-locus disease models. Human Heredity. 2000;50(6):334–49. doi: 10.1159/000022939. [DOI] [PubMed] [Google Scholar]
- 3.Culverhouse R, Suarez BK, Lin J, Reich T. A perspective on epistasis: Limits of models displaying no main effect. The American Journal of Human Genetics. 2002;70(2):461–71. doi: 10.1086/338759. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Heidema AG, Boer JM, Nagelkerke N, Mariman EC, van der AD, Feskens EJ. The challenge for genetic epidemiologists: How to analyze large numbers of SNPs in relation to complex diseases. BMC Genetics. 2006;7:23. doi: 10.1186/1471-2156-7-23. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Cooper GF, Herskovits E. A bayesian method for the induction of probabilistic networks from data. Machine Learning. 1992;9(4):309–347. [Google Scholar]
- 6.Velez DR, White BC, Motsinger AA, Bush WS, Ritchie MD, Williams SM, et al. A balanced accuracy function for epistasis modeling in imbalanced datasets using multifactor dimensionality reduction. Genetic Epidemiology. 2007;31(4):306–15. doi: 10.1002/gepi.20211. [DOI] [PubMed] [Google Scholar]
- 7.Rioux JD, Daly MJ, Silverberg MS, Lindblad K, Steinhart H, Cohen Z, et al. Genetic variation in the 5q31 cytokine gene cluster confers susceptibility to Crohn disease. Nature Genetics. 2001;29(2):223–8. doi: 10.1038/ng1001-223. [DOI] [PubMed] [Google Scholar]


