Skip to main content
Statistical Applications in Genetics and Molecular Biology logoLink to Statistical Applications in Genetics and Molecular Biology
. 2011 Aug 22;10(1):38. doi: 10.2202/1544-6115.1719

Entropy Based Genetic Association Tests and Gene-Gene Interaction Tests

Mariza de Andrade 1, Xin Wang 2
PMCID: PMC3176139  PMID: 23089811

Abstract

In the past few years, several entropy-based tests have been proposed for testing either single SNP association or gene-gene interaction. These tests are mainly based on Shannon entropy and have higher statistical power when compared to standard χ2 tests. In this paper, we extend some of these tests using a more generalized entropy definition, Rényi entropy, where Shannon entropy is a special case of order 1. The order λ (>0) of Rényi entropy weights the events (genotype/haplotype) according to their probabilities (frequencies). Higher λ places more emphasis on higher probability events while smaller λ (close to 0) tends to assign weights more equally. Thus, by properly choosing the λ, one can potentially increase the power of the tests or the p-value level of significance. We conducted simulation as well as real data analyses to assess the impact of the order λ and the performance of these generalized tests. The results showed that for dominant model the order 2 test was more powerful and for multiplicative model the order 1 or 2 had similar power. The analyses indicate that the choice of λ depends on the underlying genetic model and Shannon entropy is not necessarily the most powerful entropy measure for constructing genetic association or interaction tests.

Keywords: Rényi entropy, genetic association, gene-gene interaction

1. Introduction

The strategy of using a single locus to test for association with a particular phenotype has not been as successful as one would expect [Manolio et al. (2009)]. This may be due to different reasons such as the predominance of common variants in the genome-wide platforms, and the synergy between environment and genetic risk factors as well as between different genetic risk factors [Kraft et al. (2007), Thomas (2010)]. However, complex human genetic diseases are typically caused not only by marginal effects of genes or gene-environment interactions, but also by the interactions of multiple genes [Cordell (2009)]. Recently, gene-gene interaction, or epistasis, has been a hot topic in molecular and quantitative genetics.

If the effect at one genetic locus is altered or masked by the effects at another locus, single-locus tests or marginal tests may not be able to detect the association. By allowing for epistatic interactions among potential disease loci, we may succeed in identifying genetic variants that might otherwise remain undetected.

Several statistical techniques have been applied or developed in detecting statistical epistasis or gene-gene interaction [Cordell (2009)]. Among those techniques the most applied ones are logistic regression models and χ2-tests of independence due to easy access in well known statistical packages. However, little attention has been given to the entropy methods. Entropy methods are best known for their application in information theory with the seminal work byKullback and Leibler [Kullback and Leibler (1951)]. Shannon’s entropy is one of the most well known entropy measures, and it is the one that has been applied in single locus and gene-gene interaction analyses [Zhao et al. (2005), Dong et al. (2008), Kang et al. (2008)]. However, Shannon’s entropy is a particular case of a more generalized type of entropy, the Rényi entropy [Rényi (1960)].

The goal of this study is to extend the application of Shannon entropy to Rényi entropy in a single locus association as well as gene-gene interaction. Since entropy measures are nonlinear transformations of the variable distribution, an entropy measure of allele frequencies can amplify the allele difference between groups of interest (e.g. case/control). Furthermore, the extension to Rényi entropy introduces more flexibility in such transformations.

Thus, in this paper, we have proposed several Rényi entropy based tests and compared the performance of the novel tests to some traditional methods. We first introduced a single locus association test under two-group design, then a one-group gene-gene interaction test under linkage equilibrium (LE) assumption. The power of these tests was compared through simulations. We demonstrated that by properly choosing the Rényi entropy order λ, we could increase the power of the association test. We also discussed possible ways to construct a two-group interaction test and how to check whether an interaction effect is due to disease or not under case-control design. All the methods introduced in this paper were applied in analyzing real venous thromboembolism (VTE) case-control data.

Throughout this paper, we use the terminology “statistical gene-gene interaction test” or “statistical epistasis test” for interaction test. The word “entropy” refers to Rényi entropy unless otherwise specified. The simulation data sets were generated by GWAsimulator [Li et al. (2007)] and the analyses were performed by using R functions written by the authors.

2. Methods

2.1. Rényi Entropy

In information theory, entropy is a measure of the uncertainty associated with a random variable. One of the most common entropies is the Shannon entropy introduced by Shannon (1948), which is a special case of a more generalized type of entropy introduced by Rényi (1960).

The so called Rényi entropy is a family of functionals for quantifying the diversity, uncertainty, or randomness of a system. The Rényi entropy of order λ, λ ≥ 0, is defined as

Hλ(X)=11λlog(i=1npiλ),

where X is a discrete random variable with n values of positive probabilities and i=1npi=1. Rényi entropy with higher values of λ is more dependent on higher probability events, while lower values of λ weight all possible events more equally.

Some Rényi entropy measures have quite natural interpretations, such as H0(·) is defined as the logarithm of the number of values that have non-zero probabilities; H2(·) is often called collision entropy and is the negative logarithm of the likelihood of two independent random variables with the same probability distribution to have the same value; and H(·) is called min-entropy and is a function of the highest probability only.

The most well known Rényi entropy is the one with λ = 1. By applying L’Hopital’s rule, one can show that the formula of Rényi entropy reduces to the form of Shannon entropy:

H1(X)=i=1npilogpi.

2.2. Association Tests

In this section, we derive a model free association test based on Rényi entropy. Under a two-group design, such as a case-control design, one may test whether a set of SNPs or a single SNP is associated with the disease of interest by comparing the entropy of the first group (case group) to the second group (control group).

Let us assume a locus with k genotypes G1, G2, . . ., Gk. For the disease population, let PD=[p1D,,pkD] be the distribution of the genotypes, where piD is the probability of a case having genotype Gi at a locus of interest. Similarly let us denote the genotype distribution of normal population by PN=[p1N,,pkN], where piN is the probability of a control having genotype Gi at a locus of interest. Under null hypothesis of no association, PD and PN are identical.

For a given observed case-control data, let D and N be the estimated distribution of genotypes of cases and controls, respectively. The Rényi entropy of order λ is calculated as

Hλ(P^D)=11λlog(i=1k(p^iD)λ). (1)

Similarly, one calculates Hλ (N). The difference between the two entropy statistics

SλA=Hλ(P^D)Hλ(P^N) (2)

is then considered the association test statistic with superscript A standing for ”association”.

In the appendix, we show that the entropy statistic (2) follows asymptotically a normal distribution. Therefore, a test of difference between the two groups can be constructed, where a significant difference indicates a possible association between the SNP and the disease.

For multiple loci, it is worth noting that this test may include or exclude the effect of interaction depending on the way PD and PN are estimated. To allow for interaction, the genotype distributions should be jointly estimated. To test for marginal effects only, one could estimate D and N as the product of the marginal probability estimates.

When λ = 1, the Rényi entropy reduces to the Shannon entropy, i.e.,

limλ1Hλ(P^D)=i=1k(p^iD)log(p^iD). (3)

Thus the statistic of the association test S1A is a summation of terms of the form pi log(pi) − qi log(qi), where i is the index over all genotypes with p and q representing the corresponding distributions of the case group and the control group. By studying each component of the statistics, one can tell which genotype has the most impact on the statistics, and consequently, it can help us choose the appropriate λ for the association test. For the purpose of achieving more power to observe a difference between genotype frequency in cases and controls, the choice of λ depends on where the main difference lies, whether on the higher or the lower genotype frequencies. One should favor a larger λ value for the former and a smaller λ value for the latter.

The R codes of the entropy test are available upon request. We tested the computing time of the association test using a PC processor Intel(R) Core(TM) 2 Duo CPU P7750 @2.26GHz. The test data set contains 1000 cases and 1000 controls. It took about 0.4 sec to get the association test results of one SNP with 20 different λ values. Since most of the computational time is contributed to the calculation of frequency of genotypes, we recommend to apply the association test using multiple values of lambda simultaneously.

2.3. Interaction Test

2.3.1. One-group analysis (case-only or control-only)

In this section we describe the Rényi entropy based interaction test of two loci, L1 and L2, in detail. A generalization of the test for three or more loci is straight forward. Assume the two loci are in linkage equilibrium with the first locus having two alleles, A and a, and the second locus two alleles, B and b. Let p0L1,p1L1,p2L1 denote the probabilities of three genotypes aa, aA and AA, respectively, at the first locus. Similarly, at the second locus, let p0L2,p1L2,p2L2 denote the corresponding probabilities of three genotypes bb, bB and BB. Then the joint probability of the nine genotype combinations is represented by pij, i, j = 0, 1, 2, with i and j being the index of the genotypes at the first and second locus, respectively.

Define qij=piL1pjL2 as the product of the two marginal probabilities. Under the null hypothesis of no interaction effect, the two loci are independent and the entropy calculated based on the true joint probability pij and on the induced qij should be identical.

We first estimate the joint and the marginal probabilities as the observed frequencies ij, p^iL1 and p^jL2. The induced joint probability is then the product of the observed marginal frequencies q^ij=p^iL1p^jL2. The entropy (1) can be estimated using either the observed frequencies = ij or the induced frequencies = ij. The proposed interaction (epistasis) test statistic, denoted as SλE, is calculated as the entropy difference between the two entropy estimates,

SλE=Hλ(Q^)Hλ(P^)=Hλ(P^1)=Hλ(P^2)Hλ(P^), (4)

where P^1=p^iL1 and P^2=p^jL2 are the observed marginal genotype distributions of the first and second locus, respectively. A statistically significant difference means an interaction between the two loci.

For case-only study, in the case where λ = 1 and n is the case-only sample size, the statistic 2nS1E is the interaction test statistic proposed by Kang et al. (2008). The statistics is asymptotically distributed as a χ2 with 4 degrees of freedom. For a more general λ, the asymptotic distribution of (4) under null is unknown. Thus simulation methods such as Monte Carlo simulations are needed to determine its distribution and p-values. For a given pair of SNPs, permute the genotypes of one SNP among subjects to break the possible joint structure of the pair. Follow with a calculation of the test statistic SλE using the permuted data. Generate N permutation samples and for each permuted sample calculate the test to estimate the null distribution of (4).

We tested the computing time of the interaction test using a PC processor Intel(R) Core(TM) 2 Duo CPU P7750 @2.26GHz. The data set contains 1000 samples. The calculation of interaction tests is based on 1000 shuffles and it took about 10.5 sec to get the one-group test results of one pair of SNPs. Note that most of the computational time is attributed to the calculation of frequency of genotypes, thus it makes almost no difference if we test using only one value of lambda or multiple values of λ. The R code is available upon request.

2.3.2. Two-group analysis

A question one may ask is, for a given significant p-value of an interaction test, how does one know if the interaction is truly due to either the disease or to some unknown cause. Under a case-control design, we can apply the one group interaction test to both case and control groups separately. If the interaction effect is not due to the disease, we would expect the case group and control group to behave similarly. The question then becomes how to compare the test results between the two groups.

First, we compared the test statistics of two groups using the ratio of test statistics. Let SλE (Case) and SλE (Ctrl) be the corresponding test statistics of the two groups, then SλE (Case)/ SλE (Ctrl) should be close to 1 under null. If the ratio of the statistics is significantly different from 1, the case group and control group are not equivalent in terms of interaction, therefore, the difference may be associated with the disease. The null distribution of the ratio statistics can be estimated using the already generated permutation samples of each one-group analysis, thus, the computing time is just the summation of the computing time of two one-group analysis. Our simulation results (data not shown) showed that the power to detect the true difference is weak, especially when the marginal effect is strong. Larger sample size is needed to get reliable test results, however, the exact sample size is not easily determined. It depends on the disease model, the strength of the interaction and the marginal effects.

In the case where a significant p-value for the case group and an insignificant p-value for the control group are observed, a further permutation test can be performed to investigate whether these two groups are truly different in terms of p-value significance. Notice that here we compare the p-values of interaction tests, thus only interaction effect difference is studied. The comparison can be done using a 2-step procedure. In the first step the case-control indicator is shuffled to create new case and control groups and to recalculate the interaction test for these two groups. The second step is to compare the group p-value difference (defined as the p-value of the control group minus the p-value of the case group) of the observed data to the group p-value difference of the shuffled sample. Repeat these two steps n times (n to be determined by the investigators) to obtain the proportion of the shuffled sample group p-value difference exceeding the observed sample group p-value difference, which is the empirical p-value of the permutation test. A significant result of the permutation test indicates the case group has more significant interaction than the control group, which means the interaction is associated with the disease. Since this procedure requires a lot of permutation, this method is computational intensive and may be only feasible to apply to a small set of genes or SNPs. Using a PC processor Intel(R) Core(TM) 2 Duo CPU P7750 @2.26GHz, it takes n × 21 sec to compare the p-values of one pair of SNPs for a date set of 1000 cases and 1000 controls, where n is the number of permutations.

3. Simulation

In this study we performed Monte Carlo simulations to investigate the performance of the entropy-based tests for several λ values. We also compared our results with two other methods, χ2-test for contingency tables and likelihood ratio (LR) test for logistic regression. Data were simulated using GWA simulator Version 2.0 [Li et al. (2007)].

3.1. Simulation 1: Comparison between association tests

We studied the performance of the entropy-based association tests with parameter λ = 0.9,1,2 and compared it to the logistic regression method. LR tests were used to test for the significance of the allele effect in the regression model.

Data were simulated using logistic models [Li et al. (2007)]. Four different marginal effect models were considered: weak dominant, strong dominant, weak multiplicative and strong multiplicative. The dominant marginal effect (threshold marginal) is not affected by the number of copies of risk allele as long as at least one copy is present. The multiplicative marginal assumes the relative risk (compared to the risk with zero copy of risk allele) increases multiplicatively as the number of copies of the risk allele increases. Given a disease locus, let R1 be the relative risk with 1 copy of risk allele and R2 be the relative risk with two copies, then dominant marginal satisfies R1 = R2 and multiplicative marginal satisfies R2=R12.

Let gi = 0,1,2 be the number of copies of the the risk allele at SNP i, and define f (gi) = Pr(affected|gi) as the penetrance for genotype gi. Then, the disease models can be described by the following formula:

logit[f(gi)]=β0+β1Igi=1+β2Igi=2, (5)

where βj is the marginal effect coefficient of the disease locus with j copies of risk allele. These parameters are calculated approximately as the natural log of the corresponding relative risk.

For our simulation we fixed the risk allele frequency as 0.15. Relative risk R1 was chosen to be 1.25 as weak and 1.5 as strong. For each disease model, 1000 data sets were simulated. The coefficients in (5) for each model are shown in the following Table 1:

Table 1:

Parameters for the four disease models: Weak Dominant (WD), Strong Dominant (SD), Weak Multiplicative (WM), Strong Multiplicative (SM)

Model ß0 ß1 ß2
1: WD −1.844 0.265 0.265
2: SD −1.968 0.484 0.484
3: WM −1.864 0.265 0.542
4: SM −1.990 0.483 1.017

Each simulated data set was analyzed by Rényi entropy association tests with λ values of 0.9, 1, and 2. A logistic regression model assuming additive genetic effect was fitted to each data set. The likelihood ratio test was used to test the association of a single SNP to the disease. For each test, power was calculated as the percentage of having p value less than 0.05 over the 1,000 simulations. The power of the four tests under four disease models was summarized in Figure 1. We also evaluated the false positive (type I error) rates for the LR and entropy-based tests under different sample sizes. We observed that for a sample size less than or equal to 300, the LR test had the lowest type I error rate (below 0.05) followed by the entropy test with λ =2. For a sample size greater than 300, the entropy test with λ = 2.0 had the lowest type I error (Figure 2). Both LR test and entropy test with λ = 2 had good control of type I error with small sample size. All the tests had type I error close to the target 0.05 as sample size increased.

Figure 1:

Figure 1:

Power and sample size for the likelihood ratio (LR) test and entropy-based test for λ values of 0.9, 1.0, and 2.0 under different marginal effects.

Figure 2:

Figure 2:

Type I error rate of likelihood ratio (LR) test and entropy-based test with λ values of 0.9, 1.0 and 2.0

As shown in Figure 1, for dominant marginal effect models, the entropy-based test with λ = 2 was the most powerful among the four tests, and the power of entropy-based test with λ values of 0.9 and 1 were similar to the power of the LR test. For multiplicative marginal effect models, all four methods look similar. We applied two-tail matched pair t-test (matched by sample size) to compare the curves of each method. There was significant (p < 0.05) difference between entropy tests with different λ. Entropy tests with λ = 2 were significantly different from the LR test for all the models; entropy tests with λ = 1 and 0.9 were significantly different from the LR test for the dominant models.

Figure 3 depicts the power of entropy-based test with different λ values for sample sizes 100, 300 and 500 under strong dominant and strong multiplicative marginal effects. The power of the test with λ between 1 and 2 was about the same for multiplicative model, while for dominant model, the power of λ = 1 test had much lower power compared to the λ = 2 test. The curves under multiplicative marginal effect were relatively flat compared to those under dominant marginal effect. The figure illustrates that the multiplicative model was not very sensitive to the choice of λ, while properly chosen λ greatly improved the power to detect the dominant marginal effect.

Figure 3:

Figure 3:

Power of the entropy-based test with sample size 100, 300 and 500 for different λ values under strong dominant and multiplicative marginal effects

3.2. Simulation 2: Comparison between Interaction Tests

We studied the performance of the one-group and two group entropy-based interaction tests with parameter λ = 0.9,1,2 and compared the one-group test to the standard χ2-test of association between genotypes at the two loci for a case-only analysis. We considered two disease loci models with three types of marginal effects (no marginal, dominant marginal and multiplicative marginal) and two types of interaction effects (threshold interaction and multiplicative interaction).

The dominant interaction, also called threshold interaction, assumes the same interaction effect for all genotypes with at least one copy of risk allele at both disease loci. The multiplicative interaction increases multiplicatively as the number of copies of disease allele increases. Let rij be the relative risk with i copies of risk allele at disease locus 1 and j copies at disease locus 2 (compared to the case with i copies at locus 1 and j copies at locus 2 but without interaction effect). For threshold interaction, r11 = r12 = r21 = r22 holds. For multiplicative interaction, r12=r21=r112 and r22=r114.

Let gi =0,1,2 be the number of copies of the risk allele at a locus/gene/SNP i, (i = 1,2), and f (g1,g2) = Pr(affected|g1,g2) the penetrance for the genotypes (g1,g2). The disease models can then be described by the following formula:

logit[f(g1,g2)]=β0+β11Ig1=1+β12Ig1=2+β21Ig2=1+β22Ig2=2+γ11Ig1=1,g2=1+γ12Ig1=1,g2=2+γ21Ig1=2,g2=1+γ22Ig1=2,g2=2, (6)

where the βij is the marginal effect coefficient of disease locus i with j copies of risk allele, ∀ i,j, and the γij the interaction effect coefficient with i copies of risk allele at disease locus 1 and j copies of risk allele at disease locus 2. These parameters are calculated approximately as the natural log of the corresponding relative risk.

We simulated 1,000 data sets from each of the disease models specified above. We set the risk allele frequencies of 0.15 for locus 1 and 0.075 for locus 2 respectively, with a relative risk of r1 = 4. The coefficients in (6) are specified in Table 2.

Table 2:

Coefficients of the six models (-/-): first letter labels the marginal effect and second letter labels the interaction effect. “N” for null, “D” for dominant, “M” for multiplicative.

Model ß0 ß11 ß12 ß21 ß22 γ11 γ12 γ21 γ22
1: N/D −1.816 0 0 0 0 1.386 1.386 1.386 1.386
2: D/D −2.074 0.484 0.484 0.490 0.490 1.386 1.386 1.386 1.386
3: M/D −2.096 0.483 1.017 0.490 1.037 1.386 1.386 1.386 1.386
4: N/M −1.775 0 0 0 0 0.693 1.386 1.386 2.773
5: D/M −2.023 0.484 0.484 0.490 0.490 0.693 1.386 1.386 2.773
6: M/M −2.045 0.483 1.017 0.490 1.037 0.693 1.386 1.386 2.773

Under one group design (for instance, case-only), we evaluated the performance of the entropy-based test of three λ values, 0.9, 1.0, and 2.0 with the χ2 test by comparing their powers. The simulation results (refer to Figure 4) showed that for all six models, the entropy tests SλE with λ ≥ 1 had higher power than the χ2 test. On the other hand, the tests with λ < 1 had lower power. As shown in the figure, λ = 2 has the highest power especially for the models with multiplicative interaction. We applied two-tail matched pair t-test (matched by sample size) to compare the curves of each methods. There was significant (p < 0.05) difference between entropy tests with different λ. Entropy tests with λ = 2 and 0.9 were significantly different from the χ2-tests for all the models; entropy tests with λ = 1 were significantly different from the χ2-test for the models without marginal effect and for the model with dominant (threshold) marginal and interaction effects. By properly choosing parameter λ, one can potentially increase the power.

Figure 4:

Figure 4:

Six models with different marginal and interaction effects under case-only design. Power of χ2-test and entropy-based tests with λ = 0.9, 1 and 2 are plotted against the sample size

Under two group design, we have also evaluated the performance of the ratio statistics to detect interaction for the case-control design. Our simulation showed that λ = 1 usually had better performance. We compared the test with the LR test under certain scenarios. The power of the ratio test was low, specially when the marginal effect was strong (data not shown). As stated in section 2.3.2, this test is only recommended for large sample size.

4. Real Data Application

4.1. Venous Thromboembolism (VTE) Data Set Description

The data set consists of 1270 VTE subjects and 1302 unrelated controls collected to participate in a candidate-gene study. 12,296 SNPs located in 764 genes were genotyped. More details about this study can be found in Heit et al. (2011). Because there is a genomic region on chromosome 1q24.2 that contains a cluster of 5 genes highly associated with VTE, we decided to investigate this region for potential association and SNP-SNP interactions using our proposed entropy-based tests. A total of 102 SNPs were analyzed.

4.2. Association

We applied entropy-based test with λ = 0.9, 1.0 and 2.0 to test the association of each single SNP. The LR test was performed using a logistic regression model with additive genetic effect. Twenty-one SNPs were identified as significant (p < 0.05) by each of the three entropy-based tests. The p-values of those 21 SNPs are listed in Table 3.

Table 3:

The p-values (from likelihood ratio test and entropy-based test with λ = 0.9,1 and 2) of the most significant 21 SNPs sorted by entropy-based association test with λ = 1.

SNP Gene LR λ = 0.9 λ = 1 λ = 2
rs2420371 F5 2.22E-16 4.22E-15 1.33E-15 6.66E-16
rs16861990 NME7 2.11E-13 3.03E-12 1.14E-12 2.65E-13
rs1208134 SLC19A2 4.81E-13 9.34E-12 3.10E-12 3.43E-13
rs2038024 SLC19A2 1.09E-10 2.62E-09 1.27E-09 1.97E-10
rs3766031 ATP1B1 2.55E-05 1.73E-05 1.59E-05 5.55E-05
rs6656822 SLC19A2 2.35E-05 0.000259 0.000234 7.47E-05
rs4524 F5 0.001123 0.000403 0.000386 0.000558
rs10158595 F5 0.001018 0.000403 0.000392 0.000911
rs9332627 F5 0.001181 0.000415 0.000399 0.000604
rs2239851 F5 0.001189 0.000419 0.000403 0.000592
rs4525 F5 0.001262 0.000426 0.00041 0.000614
rs970741 F5 0.002286 0.001043 0.000997 0.001292
rs723751 SLC19A2/F5 0.00203 0.003416 0.003036 0.001767
rs6030 F5 0.000572 0.004033 0.003842 0.002098
rs3820059 C1orf114 6.92E-05 0.004635 0.004988 0.011502
rs2176473 NME7 0.000528 0.006444 0.007138 0.021481
rs4656687 F5 0.001071 0.007846 0.007588 0.004945
rs1040503 ATP1B1 0.001453 0.011241 0.012167 0.029942
rs10800456 F5 0.004685 0.01346 0.012734 0.007133
rs3766077 NME7 0.026623 0.034665 0.035445 0.045988
rs16828170 NME7 0.070794 0.03606 0.035897 0.035187

As previously stated, the Rényi entropy reduces to the Shannon entropy when λ = 1, therefore the entropy-based association test statistics is a summation of terms of the form pi log(pi) − qi log(qi), where i is the index over all genotypes of the SNPs in the test and p and q refer to the two different distributions of case group and control group, respectively. Each component of the statistics follows a normal distribution and the standard deviation can then be estimated by delta method (see Appendix). Thus, to illustrate the effect of Rényi Entropy parameter λ, we decomposed the test statistics at λ = 1. Three typical SNPs’ analysis results were displayed. One was very significantly associated with VTE (rs2038024) in all three tests, and the other two demonstrated a moderate significant association with VTE (rs9332627 and rs2176473). SNP names, genotypes, genotype frequencies within case group, genotype frequencies within control group, the component statistics, and the standard deviation estimate of each component and p-value are listed in Table 4. The analysis showed that the most significant component of SNP rs2038024 was genotype 0, followed by genotype 1 and genotype 2. It is worth noting that genotype 0 had the highest frequency, followed by genotype 1, with genotype 2 having the lowest frequency. For this SNP, the main difference between case and control groups came from the high frequency genotypes. To emphasize the difference on high frequency genotypes, one may increase the λ value. As shown in Figure 5 top panel, the p-value declined as λ increased. For SNP rs2176473, the genotype with lower frequency was more significant. In this case, the p-value decreased as the λ value moved toward 0 (Figure 5, bottom panel). For SNP rs9332627, there was no monotonicity between the genotype frequency and the significance of the components, and the minimum p-value was not achieved at the limits, but rather around λ = 1.2.

Table 4:

Decomposition of the Shannon entropy statistics of three SNPs

Case freq Ctrl freq Stat comp SD p-value
rs2038024
0 0.6144 0.7281 0.0683 0.0112 9.36E-10
1 0.3368 0.2488 0.0204 0.0041 7.85E-07
2 0.0489 0.0230 0.0607 0.0171 3.78E-04

rs9332627
0 0.5923 0.5438 −0.0211 0.0085 0.0130
1 0.3644 0.3840 0.0003 0.0003 0.3158
2 0.0434 0.0722 −0.0537 0.0170 0.0016

rs2176473
0 0.3462 0.3940 0.0003 0.0001 0.0516
1 0.4740 0.4731 −0.0002 0.0050 0.9653
2 0.1798 0.1329 0.0403 0.0123 0.0010

Figure 5:

Figure 5:

The changing pattern of entropy-based association test p-values of SNPs rs2038024, rs9332627 and rs2176473

To investigate whether two or more SNPs are jointly associated with a phenotype, one can apply the entropy test to the joint frequency of the SNPs of interest. We checked the pairwise joint association for the VTE data set. As one would expect, pairs with one or both SNPs of strong marginal effect were strongly associated with the disease. Figure 6, upper panel, depicts the histogram of p-values (entropy-based association test with λ = 2) for all possible pairs. Thirty percent (1571 out of 5151) pairs had p-values less than 0.05. We also investigated the joint effect of SNPs without strong marginal effect. SNPs other than those 21 (identified by the previous association test based on the frequency of a single SNP) were considered SNPs with moderate or no marginal effect. The histogram in the lower panel of Figure 6 is based on the pairs with both SNPs moderate or of no marginal effect. Six percent (204 out of 3240) pairs had p-values less than 0.05. The shape of the histogram is a little skewed to the left. The joint association test seems to have lower power when the marginal effect is weak or absent.

Figure 6:

Figure 6:

Distribution of p-values of entropy-based association tests for SNP pairs. Upper panel: SNP pairs of strong, moderate or no marginal effect. Lower panel: SNP pairs of moderate or no marginal effect.

4.3. Interaction

Entropy-based interaction tests were applied to the VTE data set. We first applied the tests to case group and control group separately. For an interaction associated with the disease, one would expect the test result of the case group to be significant while that of the control group insignificant. To check if the case group and the control group differ in terms of interaction effect, we applied further permutation to compare p-values between case and control groups. The case-control indicator was shuffled to create new case and control groups and tests were performed using the shuffled data. The details of the two-step procedure is described in the last paragraph of subsection 2.3.2.

Due to heavy computational burden, we only considered λ = 1 as an example, and we set up the threshold of significance and insignificance as 0.05 and 0.2, respectively, and only considered the SNP pairs with case group p-value less than 0.05 and control group p-value greater than 0.2. There were 182 pairs of SNPs that met the criteria. We applied the p-values comparison procedure to those 182 SNP pairs and calculated the p-values of the comparison (for a given SNP pair, it tests if the interaction effect of case group is more significant than the control group interaction effect). Among those p-values of comparison, 82 were less than 0.05. Accordingly, those 82 pairs, with a case group p-value smaller than the control group p-value, may have interaction associated with the disease and deserve further analysis. These pairs are listed in Table 5.

Table 5:

Significance of control case p-value difference.

SNP1 SNP2 P SNP1 SNP2 P
rs16828170 rs12120904 0.000 rs1200082 rs2420371 0.016
rs16828170 rs9332618 0.000 rs10800456 rs6678795 0.017
rs1208134 rs6427202 0.000 rs1208134 rs4525 0.017
rs3766031 rs6703463 0.000 rs9332618 rs6663533 0.017*
rs16861990 rs6427202 0.000 rs16861990 rs9332627 0.017
rs3766031 rs2285211 0.000 rs6027 rs12755775 0.017*
rs16861990 rs4656687 0.000 rs1894691 rs1894702 0.018*
rs3766031 rs16828170 0.000 rs3766031 rs3766117 0.019
rs16861990 rs6678795 0.000 rs3766031 rs7545236 0.019
rs2213865 rs2420371 0.000 rs1208134 rs4524 0.020
rs16861990 rs6030 0.000 rs16828170 rs12120605 0.021
rs1892094 rs2420371 0.000 rs17516734 rs9332684 0.022*
rs12728466 rs2420371 0.000 rs9783117 rs2420371 0.022
rs3766031 rs1208370 0.000 rs16861990 rs4525 0.023
rs1208134 rs9332627 0.001 rs10158595 rs6678795 0.024
rs1208134 rs10800456 0.001 rs10158595 rs2239854 0.024
rs16861990 rs10800456 0.001 rs17518769 rs6027 0.027*
rs3753292 rs2420371 0.001 rs3766031 rs1894701 0.028
rs3766031 rs10753786 0.001 rs6018 rs9332684 0.028*
rs1200160 rs2420371 0.001 rs9783117 rs6022 0.030
rs1208134 rs970741 0.002 rs16862153 rs2420371 0.031
rs1208134 rs4656687 0.003 rs4524 rs2239854 0.032
rs723751 rs2420371 0.003 rs1894691 rs2239854 0.034*
rs2420371 rs6035 0.003 rs4525 rs2239854 0.034
rs16861990 rs12120605 0.003 rs9783117 rs1894701 0.034*
rs3766031 rs9287095 0.004 rs9783117 rs9332653 0.035*
rs1208134 rs2239851 0.005 rs9332624 rs6663533 0.035*
rs16861990 rs970741 0.005 rs3766031 rs12758208 0.036
rs1208134 rs6030 0.006 rs4524 rs3766119 0.039
rs1208134 rs6678795 0.006 rs2239851 rs3766119 0.039
rs12753710 rs2038024 0.006 rs1200138 rs3766077 0.040
rs1200131 rs1208134 0.006 rs2420371 rs9332628 0.040
rs6027 rs9332684 0.008* rs6663533 rs12755775 0.040*
rs1208134 rs2213865 0.008 rs1200131 rs12758208 0.041*
rs2420371 rs12755775 0.008 rs1320964 rs2420371 0.042
rs3766031 rs6022 0.012 rs4525 rs3766119 0.045
rs1894691 rs3766119 0.012* rs2239851 rs2239854 0.045
rs12120904 rs6663533 0.013* rs9332653 rs6663533 0.046*
rs16861990 rs2239851 0.013 rs12074013 rs3766117 0.046*
rs1200131 rs16861990 0.013 rs9783117 rs7545236 0.047*
rs16861990 rs4524 0.015 rs9783117 rs3766117 0.049*
*:

Both SNPs have no significant main effect

5. Discussion

5.1. Choice of Rényi Parameter λ

We observed in our study that higher power can be achieved by properly choosing the entropy parameter λ. The optimal λ should be the one that amplifies the true difference between two populations, thus the choice of λ depends upon the true population allele frequencies and the source of difference. Although such information is usually not available, the family of entropy-based tests allow us to test the association and/or interaction with a different emphasis.

Since most of the computational time is devoted to estimate the allele frequencies of the permuted samples, once the frequencies become available, we recommend performing a series of Rényi entropy tests with multiple λs. A p-value vs. λ plot or a summary of multiple tests is usually recommended.

If one wants to make an interpretation based on tests of a fixed λ without prior knowledge of the optimal λ, we would suggest using 1 ≤ λ ≤ 3. According to our experience, the power of the entropy test is usually higher with λ in that range. Also, the interaction test λ < 1 may sometimes be misleading due to poor estimation of the distribution of the test statistics. A large number of permutation is usually required to achieve reliable p-value of the test.

5.2. Deviation from Uniform

In probability theory and information theory, the Rényi divergence measures the difference between two probability distributions. For probability distribution P and Q of discrete random variables, the Rényi divergence of order λ of the two distributions is defined as

Dλ(PQ)=1λ1logipiλqiλ1.

The Rényi entropy and Rényi divergence are related by Hλ (P) = Hλ (U) − Dλ (PU), where U represents the finite discrete uniform distribution which takes equal probability at any possible value. The uniform distribution is special as it is the one of maximum entropy, thus most unpredictable. We can rewrite the association test statistic as:

SλA=Hλ(P^D)Hλ(P^N)=Dλ(P^NU)Dλ(P^DU).

The statistic can then be interpreted as the difference between the two distributions’ deviation from uniform. The interaction test can be represented as

SλE=Hλ(Q^)Hλ(P^)=Dλ(P^1U)+Dλ(P^2U)Dλ(P^U×U).

The U × U is uniform distribution over the nine genotypic combinations of the two loci. The interaction test statistic compares the deviations from uniform of the sum of marginal distributions and that of the joint distribution.

It is easy to show that D1(PQ) = H1(Q) − H1(P). However, this equation does not hold for general λ. If we replace the reference distribution U in the test statistics by some other distribution V, the equivalence between the entropy difference and the divergence difference does not hold except for λ = 1.0. Thus the extension from Shannon’s case where λ = 1.0 to Rényi’s case with general λ allows us to introduce other more reasonable reference distributions under various genetics settings. For example, V can be the population allele frequencies or the theoretical allele frequencies under certain model. The tests based on Rényi divergence with different reference distributions require further study and can be an interesting future research direction.

6. Appendix

Assume the loci of interest have k genotypes G1,G2, . . .,Gk, let p = [p1, p2, . . ., pk], and i=1kpi=1 be the distribution of those genotypes in population. Let X = [X1,X2, . . .,Xk] be the number of observations of each genotype in the sample and n the sample size, then X has a multinomial distribution Mn(n, p). Note that for sufficiently large n, the multinomial distribution is approximately a multinormal distribution with mean E(Xi) = npi and variance-covariance matrix given by Var(Xi) = npi(1 − pi) and Cov(Xi,Xj) = −npipj (ij).

For = X/n = [X1,X2, . . .,Xk]/n = [1, 2, . . ., k] we define E() = [p1, p2, . . ., pk] = p and the variance-covariance matrix of as

Var(P^)=ΣP=1n(p1(1p1)p1p2p1pkp2p1p2(1p2)p2pkpkp1pkp2pk(1pk))

First consider the Shannon’s entropy, H1(P^)=i=1kp^ilogp^i, and define the function

h(P^)=[h(p^1),h(p^2),,h(p^k)]=[p^1logp^1,p^2logp^2,,p^klogp^k].

The variance of h() can be approximated by the delta method as

V(n,p)[h(p)]TPh(p)=diag(1+logp1,,1+logpk)Pdiag(1+logp1,,1+logpk).

Thus Var(H1()) is the sum over all V(n, p)’s elements.

For the general Rényi’s entropy, Hλ(P^)=11λlog(i=1kp^iλ), define the function h(P^)=[h(p^1),h(p^2),h(p^k)]=[p^1λ,p^2λ,p^kλ], have Z=i=1kp^iλ and the function g(Z)=11λlogZ. After applying the delta-method multiple times, we obtain

Var(Hλ(P^))=Z(1λ)2z2,

where z=i=1kpiλ and ΣZ is the sum over all elements of the following matrix V(n, p), given by

V(n,p)=diag[λp1λ1,λpkλ1]Pdiag[λp1λ1,λpkλ1].

Let n1 and n2 be the sample size of case group and the sample size of control group, respectively. Under the null hypothesis of no association, the genotypic distributions of the disease population and the normal population are identical. Let the overall genotype population distribution be p = [p1, p2, . . ., pk], and X1i be the number of cases with genotype Gi, then X1 = [X11,X12, . . .,X1k] follows a multinomial distribution Mn(n1, p). Similarly, let X2 = [X21,X22, . . .,X2k] be the distribution of controls over all genotypes, and X2 follows a multinomial distribution Mn(n2, p). When the case group and the control group are independent samples, the variance of the test statistics is simply the sum of the variance of the two entropies given by

Var(SλA)=Var(Hλ(P^D))+Var(Hλ(P^N)).

If there is no additional information about the distribution of genotypes in the overall population, p is usually estimated by (X1+X2)/(n1+n2).

Contributor Information

Mariza de Andrade, Mayo Clinic.

Xin Wang, Mayo Clinic.

References

  1. Cordell HJ. Detecting gene-gene interactions that underlie human disease. Nature Reviews Genetics. 2009;10:392–404. doi: 10.1038/nrg2579. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Dong CZ, Chu X, Wang Y, Wang Yi, Jin L, Shi TL, Huang W, Li YX. Exploration of gene-gene interaction effects using entropy-based methods. European Journal of Human Genetics. 2008;16:229–235. doi: 10.1038/sj.ejhg.5201921. [DOI] [PubMed] [Google Scholar]
  3. Heit JA, Cunningham JM, Petterson TM, Armasu SM, Rider DN, de Andrade M. Genetic variation within the anticoagulant, procoagulant, fibrinolytic and innate immunity pathways as risk factors for venous thromboembolism. Journal of Thrombosis and Haemostasis. 2011;9, 6:1133–1142. doi: 10.1111/j.1538-7836.2011.04272.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Kang GL, Yue WH, Zhang JF, Cui YH, Zuo YJ, Zhang D. An entropy-based approach for testing genetic epistasis underlying complex diseases. Journal of Theoretical Biology. 2008;250:362–374. doi: 10.1016/j.jtbi.2007.10.001. [DOI] [PubMed] [Google Scholar]
  5. Kraft P, Yen YC, Stram DO, Morrison J, Gauderman WJ. Exploiting gene-enviroment interaction to detect genetic associations. Human Heredity. 2007;63:111–119. doi: 10.1159/000099183. [DOI] [PubMed] [Google Scholar]
  6. Kullback S, Leibler RA. On information and sufficiency. Annals of Mathematical Statistics. 1951;22, 1:79–86. doi: 10.1214/aoms/1177729694. [DOI] [Google Scholar]
  7. Li C, Li MY. GWAsimulator: A rapid whole-genome simulation program. Bioinformatics. 2007;24, 1:140–142. doi: 10.1093/bioinformatics/btm549. [DOI] [PubMed] [Google Scholar]
  8. Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, McCarthy MI, Ramos EM, Cardon LR, Chakravarti A, Cho JH, Feinberg AP, Guttmacher AE, Kong A, Kruglyak L, Mardis E, Rotimi CN, Slatkin M, Valle D, Whittemore AS, Boehnke M, Clark A, Eichler EE, Gibson G, Haines JL, Mackay TFC, McCarroll SA, Visscher PM. Finding the missing heritability of complex diseases. Nature. 2009;461:747–753. doi: 10.1038/nature08494. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Rényi A. On measures of information and entropy. Proceedings of the 4th Berkeley Symposium on Mathematics, Statistics and Probability; 1960. pp. 547–561. [Google Scholar]
  10. Shannon CE. A mathematical theroy of communication. Bell System Technical Journal. 1948;27:379–423. [Google Scholar]
  11. Thomas DC. Gene–environment-wide association studies: emerging approaches. Nature Reviews Genetics. 2010;11:259–272. doi: 10.1038/nrg2764. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Zhao JY, Boewinkle E, Xiong MM. An entropy-based statistic for genomewide association studies. American Journal of Human Genetics. 2005;77:27–40. doi: 10.1086/431243. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Statistical Applications in Genetics and Molecular Biology are provided here courtesy of Berkeley Electronic Press

RESOURCES