Detecting Genetic Interactions for Quantitative Traits Using m-Spacing Entropy Measure

Jaeyong Yee; Min-Seok Kwon; Seohoon Jin; Taesung Park; Mira Park

doi:10.1155/2015/523641

. 2015 Aug 3;2015:523641. doi: 10.1155/2015/523641

Detecting Genetic Interactions for Quantitative Traits Using m-Spacing Entropy Measure

Jaeyong Yee ¹, Min-Seok Kwon ², Seohoon Jin ³, Taesung Park ⁴, Mira Park ^5,^*

PMCID: PMC4538333 PMID: 26339620

Abstract

A number of statistical methods for detecting gene-gene interactions have been developed in genetic association studies with binary traits. However, many phenotype measures are intrinsically quantitative and categorizing continuous traits may not always be straightforward and meaningful. Association of gene-gene interactions with an observed distribution of such phenotypes needs to be investigated directly without categorization. Information gain based on entropy measure has previously been successful in identifying genetic associations with binary traits. We extend the usefulness of this information gain by proposing a nonparametric evaluation method of conditional entropy of a quantitative phenotype associated with a given genotype. Hence, the information gain can be obtained for any phenotype distribution. Because any functional form, such as Gaussian, is not assumed for the entire distribution of a trait or a given genotype, this method is expected to be robust enough to be applied to any phenotypic association data. Here, we show its use to successfully identify the main effect, as well as the genetic interactions, associated with a quantitative trait.

1. Introduction

Recent advances in high-throughput genotyping techniques have produced massive volumes of genetic data. Although it is common to analyze single SNP effects extensively, such approaches cannot adequately explain the intricate genetic contributions to complex diseases such as hypertension, diabetes, and certain psychiatric disorders. Consequently there are still large amounts of genetic components that remain unexplained. Gene-gene interaction analysis may be one method to adequately address this missing heritability problem [1].

For case-control studies, which formulate the measures for a binary trait, a number of statistical methods for detecting gene-gene interactions have been proposed. One of the most popular methods is multifactor dimensionality reduction (MDR) [2] that converts a high-dimensional contingency table to a one-dimensional model without raising the issue of sparse cells. Several variants of MDR have been recently developed [3–8], while another approach was developed [9–11] from information theory [12, 13]. More recently, an entropy-based approach which utilizes the relative gain of information, as well as its standardized measure, has also been proposed [14].

However, for quantitative traits such as the blood pressure, body mass index, and patient survival times, relatively few attempts have been made to analyze the genetic interactions. Because many phenotype measures are intrinsically quantitative, and categorizing a continuous trait may not always be straightforward and meaningful, association of gene-gene interactions with an observed distribution of such phenotypes needs to be investigated directly without categorization. To that end, introducing a new statistic is one way to tackle the problem [15]. Extending the MDR algorithm to continuous traits, as in the ways of the generalized MDR (GMDR) and the model-based MDR (MB-MDR), has been proposed [3, 6]. More recently a quantitative MDR (QMDR) was proposed to replace the balanced accuracy metric with a t-test statistic [16]. However, these MDR-based approaches may oversimplify the original data to some degree, through classification of phenotypes. An entropy-based approach may well be an alternative model. Entropy is commonly used in information theory to measure the uncertainty of random variables [12, 13], and information gain or mutual information has been shown useful to represent association strengths [17–19]. Although the usefulness of such information theoretical methods is well known, the statistical methods based on this approach for analyzing gene-gene interactions of the quantitative traits are rarely found, with the exception of one specific case [20]. However, the application may also be limited by assuming a normal distribution.

Here, we extend the usefulness of the information concept to quantitative traits by considering nonparametric estimates based on sample-spacing or m-spacing [22–25] for the conditional entropy of a quantitative phenotype, based on a given genotype. The challenge, therefore, is to couple a nonparametric entropy estimator to correct and stable information gains. We thus developed the useful information gain standardized (IGS) approach and applied it to datasets composed of several genotypes and the quantitative trait. This approach could be considered an extension of previous work on categorical traits [14] to the quantitative phenotypes. The proposed method, however, does not attempt in any way to classify quantitative phenotypes like other methods, such as variants of MDR but instead handles them directly, providing an intrinsic advantage of removing the chance of misclassification. While previous entropy-based methods of analyzing quantitative traits assumed the shape of its distribution to be normal [20], our method does not need to specify the distribution to estimate the association. Any regular or irregular distribution would not cause any difficulties. Although this is also an advantage of GMDR or QMDR, we propose a method that takes the advantageous characteristics from both of those methods. We also performed extensive simulation studies to compare the powers of the proposed method to QMDR and GMDR, demonstrating its advantage in detection power.

In the following sections, after a brief review of nonparametric entropy estimation, we describe a new method for modeling genetic interactions. A nonparametric entropy estimator is shown to successfully couple with genetic datasets through our modifying work in the Materials and Methods. Application of this information gain standardized (IGS) approach is evaluated for both simulation and real datasets in the Results and Discussions.

2. Materials and Methods

2.1. Estimation of the Entropy for a Continuous Variable

If X is a random vector with probability density function, f(x), its differential entropy is defined by

\begin{matrix} H (f) = - \int f (x) \ln (f (x)) d x . \end{matrix}

(1)

A well-known approach for estimating a solution to this equation is to use plug-in estimates. In this approach, f(x) is first estimated using a standard density estimation method such as a histogram or kernel density estimator, and the entropy is then computed. Integral, resubstitution, splitting data, and cross-validation estimates are among the usual plug-in estimates [22]. Another approach is based on sample-spacing. Let {X _k} be a set of independent and identically distributed real valued random variables, with corresponding order statistics of {X _n,k}. Here, n represents the total number of measured samples. For the arbitrary integers i and m satisfying the condition of 1 ≤ i < i + m ≤ n, a spacing of order m or m-spacing is defined as X _n,i+m − X _n,i. A density estimate, based on sample-spacing, m, is then constructed as

\begin{matrix} f_{n} (x) = \frac{m}{n} \frac{1}{X_{n, i m} - X_{n, (i - 1) m}}, \end{matrix}

(2)

where x ∈ [X _{n,(i − 1)m}, X _n,im [14]. This density estimate is consistent if, as n → ∞, m → ∞ and m/n → 0 [22]. Several variations of an entropy estimator with minor differences have been proposed, all based on the above density estimates [23, 24]. Among them, the following were reported to approximate with lowered variance [25]:

\begin{matrix} H_{m, n} = \frac{1}{n - m} \sum_{k = 1}^{n - m} \ln (\frac{n}{m} (X_{n, k + m} - X_{n, k})) . \end{matrix}

(3)

Asymptotic bias of this estimator can be corrected by adding additional terms, including the digamma function [22, 28]:

\begin{matrix} H_{m, n} = \frac{1}{n - m} \sum_{k = 1}^{n - m} \ln (\frac{n}{m} (X_{n, k + m} - X_{n, k})) - \frac{Γ^{'} (m)}{Γ (m)} + \ln m . \end{matrix}

(4)

As m increases, the correctional terms become negligible and the two estimators coincide. Our evaluation of the entropy of a phenotype, H(P), of a quantitative trait is based on this estimator.

2.2. Modification of the m-Spacing Based Entropy Estimator

The estimator in (4) has both n and m as parameters. In genetic association studies, the number of samples, n, of several hundreds is common. However, when the conditional entropy is estimated, there may be a minor allele that could have a much smaller number of samples corresponding to that allele. Moreover, the choice of the sample-spacing, m, should affect the resulting estimation of an entropy value. Therefore, it is required to have an entropy estimation scheme independent of the number of samples, without the need of choosing a particular value of the sample-spacing. To illustrate such a requirement, an ensemble of 3,000 sets of the random deviation from N(0, 1²) was generated for each data point in Figure 1, where the mean and standard deviation of the estimates are plotted for each ensemble. On the left panel of Figure 1, m is fixed to 10 and 20 while n is varied. The analytic formula of the entropy for a normal distribution can be obtained as follows [20], where e is Euler's number:

\begin{matrix} H = \ln (σ \sqrt{2 π e}) . \end{matrix}

(5)

The calculated value of (5) is pointed on the vertical axis with a horizontal arrow with the corresponding σ above it. The obvious n-dependence of the estimator can be seen in this plot, where the estimation approaches the analytic value, as n increases with $\sqrt{n}$ -consistency, as expected [24]. In Figure 1(b), n is fixed to 400, while m is varied. In this plot, the estimated entropy again changes in value throughout the possible range of m. It is shown that the estimated value is always smaller than the analytically calculated value. Therefore, assigning a particular value to m such as $\sqrt{n}$ , the typical choice [25], would not be appropriate in this sampling range. Because of these n- and m-dependences, the estimator in (4) may need to be modified. Therefore, we modify the entropy estimator in (4) as follows:

\begin{matrix} H_{〈m〉, n} = \frac{1}{n - 1} \sum_{m = 1}^{n - 1} (\frac{1}{n - m} \sum_{k = 1}^{n - m} \ln (\frac{n}{m} (X_{n, k + m} - X_{n, k})) \\ - \frac{Γ^{' (m)}}{Γ (m)} + \ln m) . \end{matrix}

(6)

In this modification, an entropy estimator is averaged over the possible m values for each n, which is denoted by 〈m〉. This estimator is used to plot the entropy versus number of samples in Figure 2. Over a wide range of n, this entropy estimator yields very stable values, in contrast to Figure 1(a). An increase in the extremely small n range should be within the tolerable error in an application of genome-wide association, as the contribution to the conditional entropy by such a minor allele would be suppressed by the weighting factor of the marginal probability that should be proportional to the number of corresponding samples. Analytically obtained entropy values for N(0, σ ²), with three different σ's, are marked on the vertical axis on the right-hand side. Regardless of the value of σ, the differences between the analytically obtained value and the values given by the estimator stay essentially the same. Considering that the association study measures the difference between the entropy and the corresponding conditional entropy, the stability should be a more critical issue than the absolute value of the estimates. Therefore compensation of this Δ would not be necessary as long as it is stable. Furthermore, the underestimation of the entropy shown in the plot should have little effect on the association strength. Hence, an entropy estimator has been set up that should satisfy the practical n-independence without the need to find a proper sample-spacing.

The n-dependence (a) and m-dependence (b) of the entropy estimator H _m,n. An ensemble of 3,000 sets of random sampling from N(0, 1²) was constructed and used for each point in the plot. The sample-spacing, m, was fixed while varying the number of samples, n, (a) to evaluate the n-dependence of the entropy estimator. In (b), n was fixed and m was varied to show the m-dependence. Analytically obtained true values are represented by the arrowed horizontal lines.

The n-independence and constant offset from the true value of the estimates averaged over all possible m values for each n. Each symbol represents a result of samplings from N(0, σ ²). While varying n, the number of samples, the estimated entropy values were averaged over all the possible m, sample-spacing values. 〈m〉 denotes this averaging, which should not depend on weighting due to the virtually same standard deviations shown in Figure 1(b). Over a wide range of n, the estimated entropy stays effectively the same, showing n-independence in the range of practical number of sampling. Moreover, the almost flat line connecting each symbol shifts up or down following exactly the change of the true value indicated by the horizontal arrows. The rise in the extremely small n range should be within the tolerable error of any specific application, because the contribution to conditional entropy by such a case would be suppressed by weighting, based on the marginal probability that should be proportional to n.

2.3. Evaluation of a Conditional Entropy

Now let G be a categorical variable assigned to each sample measurement X _k. G may be a genotype given by a measured SNP or a combination of SNPs, while X _k represents the measured value of a phenotype. For detecting the main effect of a single SNP, G consists of three categories of G = 0, G = 1, and G = 2. For detecting the interaction between SNP_i and SNP_j, G consists of 9 categories, such that G = 0 = (SNP_i = 0, SNP_j = 0), G = 1 = (SNP_i = 0, SNP_j = 1), G = 2 = (SNP_i = 0, SNP_j = 2), G = 3 = (SNP_i = 1, SNP_j = 0),…, and G = 8 = (SNP_i = 2, SNP_j = 2). Detection of the higher order interaction can be performed in the same way with expansion of the categories of G. Then an estimator for each specific component of the conditional entropy, H(P∣G = g), can be constructed using the genotype-selected subset measurements {X _{n_g,k}}, along with an individual sample-spacing of m _g. Extending (6), while applying the above argument, should now readily produce the estimators for the entropy of a phenotype and the conditional entropy. Here d denotes the order of a gene-gene interaction:

\begin{matrix} H (P) = \frac{1}{n - 1} \sum_{m = 1}^{n - 1} (\frac{1}{n - m} \sum_{k = 1}^{n - m} \ln (\frac{n}{m} (X_{n, k + m} - X_{n, k})) \\ - \frac{Γ^{' (m)}}{Γ (m)} + \ln m), \\ H (P ∣ G) \\ = \sum_{g = 0}^{3^{d} - 1} \frac{n_{g}}{n} (\frac{1}{n_{g} - 1} \sum_{m_{g} = 1}^{n_{g} - 1} (\frac{1}{n_{g} - m_{g}} \\ \cdot \sum_{k = 1}^{n_{g} - m_{g}} \ln (\frac{n_{g}}{m_{g}} (X_{n_{g}, k + m_{g}} - X_{n_{g}, k})) \\ - \frac{Γ^{'} (m_{g})}{Γ (m_{g})} + \ln m_{g})) \\ = \sum_{g = 0}^{3^{d} - 1} \frac{n_{g}}{n} H (P ∣ G = g) . \end{matrix}

(7)

2.4. Standardized Measure of an Association Strength

Since the differential entropy values are scale-dependent, when the above estimators are calculated with {X _i} and {cX _i} (where c is a constant scale factor), the difference would be ln⁡c:

\begin{matrix} H_{\{c X_{i}\}} = \ln c + H_{\{X_{i}\}} . \end{matrix}

(8)

For example, if the phenotype is height it may be measured in meters or centimeters. In this case, the scale factor is 100. Nevertheless, the association strength should also be the same. Also note that a negative value is perfectly legitimate for a differential entropy. Information gain, IG, as in the form defined with discrete entropies [14], should satisfy scale independence, while correctly representing an association strength without being affected by negative values. Therefore, it should retain its usefulness as a measure of an association strength:

\begin{matrix} IG = H (P) - H (P ∣ G) . \end{matrix}

(9)

IG would be readily estimated with the above estimator (7). IG standardized (IGS) is set up with the means and standard deviations of IGs obtained from repeated shuffling of the phenotypes while all genotypes remained fixed [14]. Let IG_i ⁽¹⁾ denote the maximum IG of the ith permuted dataset. Then, the mean and standard deviation of IG₁ ⁽¹⁾, IG₂ ⁽¹⁾,…, IG_n ⁽¹⁾ can be computed as follows:

\begin{matrix} {\bar{IG}}_{p} = \frac{\sum_{i = 1}^{n} {IG}_{i}^{(1)}}{n}, S_{p} = \sqrt{\frac{\sum_{i = 1}^{n} {({\bar{IG}}_{i}^{(1)} - {\bar{IG}}_{p})}^{2}}{n - 1}}, \end{matrix}

(10)

where n is the number of permuted datasets. Now IGS is defined as follows:

\begin{matrix} IGS = \frac{IG - {\bar{IG}}_{p}}{S_{p}} . \end{matrix}

(11)

3. Results and Discussions

3.1. Demonstration of the m-Spacing Method

To show the plausibility of the proposed m-spacing method, a representative result is shown in Figure 3, using a dataset whose quantitative trait was generated from a normal distribution with a single causal SNP pair simulated, as described in the next section. The sample size of the dataset was 400, with 20 SNPs. In panel (a), the association strengths, obtained by m-spacing and GMDR, are plotted as horizontal and vertical coordinates, respectively. Filled triangles represent the main effects, while open circles are for the 2nd order interactions. Both methods identify the same single SNP pair having a prominent interaction plotted in the upper right corner. One of the SNPs was found to produce the main effect, in contrast to others. Again, the result is agreed by both methods. P values obtained by permutation are given in the boxes for those selected points. Association strengths of the 3rd order interactions are plotted with a plus sign. Because no 3rd order interaction is simulated into the dataset, the combinations of SNPs made by adding a single SNP to the causal pair are expected to have high association values. Those points are clustered near the identified causal pair in the upper right corner. In panel (b) of Figure 3, the same comparison was made using the result from m-spacing and QMDR. Both comparisons show consistent results between the proposed m-spacing method and GMDR or QMDR. Note that IGS instead of IG was used. The distribution of the IG values from a dataset would shift to a higher direction, with increased order of interactions. Thus, the more conditions applied, the less entropy may be left to find. In other words, as the order of interaction increases, the conditional entropy H(P∣G) tends to decrease, while H(P) remains the same. Therefore IGS is vital if one needs to compare the association strengths between genotypes from different orders of interactions. Figure 3 shows that the simulated causal pair has the largest IGS value among all points, from different orders of interactions.

Comparison of the QMDR, GMDR, and m-spacing methods. Association strengths obtained by GMDR versus m-spacing (a) and by QMDR versus m-spacing (b) are compared for a simulated dataset. All three methods were used to evaluate the main effect as well as 2nd and 3rd order interactions. The dataset was designed to have one 2nd order interaction causal pair.

3.2. Generation of the Simulation Data

To examine the performance of the m-spacing method, an extensive set of simulation data was necessarily generated. First, three types of quantitative trait distributions were considered. Two of them were normal and gamma distributions, and another one was a mixture of those two types. With single causal pair designed, 70 different penetrance models, based on [21], were incorporated. For the case of a normal distribution, a phenotype value, y, associated with two interacting SNPs was selected from a normal distribution, as defined by the penetrance values tabled for possible combinations of genotypes associated as follows:

\begin{matrix} y ∣ ({SNP}_{1} = i, {SNP}_{2} = j) ~ N (f_{i j}, σ^{2}) . \end{matrix}

(12)

Here f _ij represents the penetrance values tabled for every model simulated and can be found in [21]. It is tabulated for each possible pair of genotypes, (i, j). In 70 different penetrance models, 14 different combinations of two different minor allele frequencies (MAFs) and seven different heritability values were considered. Specifically, we considered the cases when the MAFs were 0.2 and 0.4 and when the heritability was 0.01, 0.025, 0.05, 0.1, 0.2, 0.3, and 0.4. Three different values (0.8, 1.0, and 1.2) of the variance, σ, were used independently for the high- and low-risk groups, resulting in 9 combinations. The grouping constraint for the generated event was set such that the averaged y of the high-risk group should be larger than or equal to the overall average. The averaged y of the low-risk group should be less than the overall average. In Figure 4, 9 possible distributions of a generated phenotype are shown. In this example, the sample size is 400. The high- and low-risk groups have the same number of samples and both have a variance of 1.0. For gamma distributions, phenotype values follow the rule below:

\begin{matrix} y ∣ ({SNP}_{1} = i, {SNP}_{2} = j) ~ Γ (k, θ) . \end{matrix}

(13)

The shape and scale parameters, k and θ, were determined by f _ij and σ, using the relationship f _ij = kθ and σ = kθ ². Penetrance models were classified by 7 heritability values: 0.01, 0.02, 0.05, 0.1, 0.2, 0.3, and 0.4, resulting in 10 models for each heritability. The generated data files had a sample size of 400, with 20 SNPs. In all, 3 × 70 × 9 = 1,890 different conditions were set up, with 100 simulated data files generated for each condition.

Demonstration of the simulation scheme. Phenotype distributions were plotted to associate with the genotypes by two interacting SNPs, as denoted in the parentheses on top of each plot. SNPs may take values of 0, 1, and 2 or AA, Aa, and aa. For this particular dataset, the MAF was set to 0.200. On the bottom of each plot, the penetrance value for this particular model is given, which is taken from [21]. Inside each plot, the number of samples generated to satisfy the simulation constraint is given. The vertical dotted lines are for the mean values of the high- and low-risk groups. By constraint, the line on the left is for the low-risk group.

3.3. Comparison of the Detection Probability and Type I Error

The “hit ratio,” or detection power, of the IGS was evaluated and compared. Simulated data files described in the previous subsection were used. All of them had a single causal pair to identify. In addition to our proposed m-spacing method, QMDR and GMDR were used to compare the results. Figure 5 shows the comparison. Panels (a), (b), and (c) are for the quantitative trait of normal, gamma, and mixed distributions, respectively. Seventy penetrance models were grouped into 7 cases of heritability on the horizontal axis, while all 9 combinations of the variances in high- and low-risk distributions were merged into each heritability case. With a normal distribution, as shown in Figure 5(a), the m-spacing's performance was in between those of QMDR and GMDR for higher values of penetrance. However, in the range of penetrance less than 0.2, the m-spacing performs best. Note that the QMDR shows higher detection probability than the GMDR throughout the range. In the case of a gamma distribution, as shown in Figure 5(b), the QMDR's performance drops rapidly, as the heritability decreases when the hit ratios of m-spacing, as well as the GMDR, stay better than that of QMDR and are comparable to each other. Note the switch of the GMDR and QMDR's performance ranks with the change of the phenotype distribution. What QMDR does is essentially the dichotomization of the observed values of the quantitative phenotypes. Therefore, it should do better with well-defined symmetric distributions, such as a normal distribution, than with an asymmetric one (e.g., gamma distribution). The proposed m-spacing method is expected to be effective regardless of the shape of the phenotype distribution, because it makes no assumptions regarding the distribution and is therefore nonparametric, as demonstrated in Figures 5(a) and 5(b). This nonparameterization is again confirmed in Figure 5(c), showing that m-spacing outperforms the QMDR and the GMDR, throughout the whole range of heritability, in the case of the mixed form of phenotype distribution. Among the three methods examined, m-spacing was the most robust, performing consistently within the range of conditions for the simulation.

Comparison of the hit ratios or the detection probabilities among the proposed m-spacing method, QMDR, and GMDR. Genomic datasets were generated based on 70 different penetrance functions [21], which were, in turn, classified into 7 distinct values of heritability. For each model, the phenotype values are simulated with normal (a), gamma (b), and mixed (c) distributions. High- and low-risk groups in a quantitative trait overlapped with 9 different combinations of the standard deviations. Considering all of the above, 100 data files were generated for each case, adding up to 9,000 simulated files being examined for each point in the plot.

To estimate the type I error rate, the null datasets were generated under the same scheme as used for the detection power analysis except that there was no causal pair intended. Now there are 20 SNPs that none of the pairs are expected to have an association. Permutation P values for a particular pair were obtained by permuting each dataset 1000 times. We took the significance level α as 0.05 to get the ratio of the permutation P values smaller than or equal to α. We report this ratio as the type I error rate in Table 1, whose accuracy to one decimal place when expressed in percent was ensured by the number of the permutation. Table 1 presents the type I error rate for each combination of three trait distributions, two MAFs, and seven heritability values, along with the overall estimates. Throughout these conditions, the type I error rates are gathered tightly around 5% with maximum and minimum of 5.4% and 4.3%, respectively. Moreover there exists no sign of the dependence on the trait shape, heritability, and MAF. Therefore our proposed method preserved the type I error rates on these conditions.

Table 1.

Type I error estimation with the significance level α of 0.05.

Type I error rate (%)		Normal	Gamma	Mixed
MAF	0.2	5.0	5.0	5.1
MAF	0.4	5.1	5.0	5.1

Heritability	0.01	5.3	5.0	4.8
	0.02	4.9	5.4	5.2
	0.05	5.3	4.3	5.3
	0.1	5.0	5.3	5.1
	0.2	5.0	5.3	5.1
	0.3	4.8	4.9	4.8
	0.4	5.1	4.7	5.3

Overall		5.0	5.0	5.1

Open in a new tab

3.4. Application to Real Data

A full-scale real dataset from the Korean Association Resource (KARE) project [20] was analyzed to investigate the effectiveness of the m-spacing method. Among the available phenotypes, “height” was chosen with a sample size of 8,842 from the population-based cohort. The total number of SNPs was 327,872, spanning over 22 chromosomes. The “height” phenotype showed to be close to a normal distribution such that the m-spacing method may not take advantage of the shape of the phenotype distribution, as discussed in the previous subsection. Table 2 lists the SNPs, selected by the m-spacing method (IGS), that had the strongest main effects. Out of 10 selected SNPs, rs2079795 and rs6440003 coincide with two previous reports [26, 27], although two more matched SNPs, rs11989122 and rs1344672, could be found as results of our analysis using the same tool as in [26], but using the newly imputed dataset. P values were estimated by permutation of the phenotype values to make null distributions. Permutations were iterated 100,000 and 10,000 times for the main effect and the interaction, respectively. A clear distinction between rs11989122 and the other selected SNPs can be seen in the IGS values. In Table 3, the 2nd order gene-gene interaction result is given. The top selected pair (rs6499786, rs1788421) was found to have the strongest association with “height,” but the distinction was not so obvious, compared to the case of the main effect.

Table 2.

Application of the m-spacing method to a full set of KARE samples with the phenotype “height;” main effect.

Main effect
rs ID	Chromosome	IGS	P value	Previous report
rs11989122	8	11.3892	1 × 10⁻⁵	^*5.89 × 10⁻⁶
rs7316119	12	8.7531	1 × 10⁻⁵	—
rs936634	18	8.6125	2 × 10⁻⁵	—
rs7632381	3	7.8235	1 × 10⁻⁵	—
rs2079795	17	7.6542	1 × 10⁻⁵	2.92 × 10⁻⁶ Ref. [26]
rs1344672	3	7.6177	1 × 10⁻⁵	^*5.21 × 10⁻⁷
rs2523865	6	7.6044	4 × 10⁻⁵	—
rs3790199	20	7.5362	2 × 10⁻⁵	—
rs6440003	3	7.5231	1 × 10⁻⁵	3.87 × 10⁻⁷ Ref. [27]
rs17628655	19	7.5117	6 × 10⁻⁵	—

Open in a new tab

^*Identified using the same method as [26] but with imputed data, which is the same one we analyzed.

Table 3.

Application of the m-spacing method to a full set of KARE samples with the phenotype “height;” 2nd order interaction.

2nd order interaction
rs ID	Chromosome	rs ID	Chromosome	IGS	P value
rs6499786	16	rs1788421	21	4.6197	1 × 10⁻⁴
rs2529232	7	rs1788421	21	4.3869	1 × 10⁻⁴
rs2241704	19	rs1788421	21	4.3855	1 × 10⁻⁴

Open in a new tab

4. Conclusion

In this paper, we present a modified m-spacing method for genome-wide association studies with a quantitative trait. The robustness of this method makes it useful for a wide range of sample sizes, while the original m-spacing method yields a reliable result only for datasets with a large sample size. Extensive simulation was performed to produce the datasets with different shapes of phenotype distributions, while varying the penetrance functions and adjusting the heritability as well. Causal pair detection probability was unaffected the most by the compared methods, based on the distribution shape and heritability, while GMDR and QMDR showed more dependency. The proposed m-spacing method is proven to outperform the others regardless of the shape of the trait distribution and also the range of lower heritability. In the higher heritability region, the performance of the proposed method is comparable to that of GMDR or QMDR, whichever shows better performance in that region. This would lead to versatile applicability of our nonparametric method for quantitative traits, with various characteristics. We applied this method to successfully identify the main effect and gene-gene interactions for the phenotype “height” with the full set of KARE samples. Although several of them overlapped with a previous report, new interactions were also found. Because “height” is presumed to be a trait with a normal distribution having a higher heritability, our method may be said to have performed successfully with no advantage over other methods. More extensive study is needed for quantitative traits, having various characteristics, to further demonstrate the expected robustness of our modified m-spacing method.

Acknowledgment

This research was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science, and Technology (NRF-2013R1A1A2062848, NRF-2012R1A3A2026438).

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

References

1.Zuk O., Hechter E., Sunyaev S. R., Lander E. S. The mystery of missing heritability: genetic interactions create phantom heritability. Proceedings of the National Academy of Sciences of the United States of America. 2012;109(4):1193–1198. doi: 10.1073/pnas.1119675109. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Ritchie M. D., Hahn L. W., Roodi N., et al. Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. American Journal of Human Genetics. 2001;69(1):138–147. doi: 10.1086/321276. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Lou X.-Y., Chen G.-B., Yan L., et al. A generalized combinatorial approach for detecting gene-by-gene and gene-by-environment interactions with application to nicotine dependence. American Journal of Human Genetics. 2007;80(6):1125–1137. doi: 10.1086/518312. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Chung Y., Lee S. Y., Elston R. C., Park T. Odds ratio based multifactor-dimensionality reduction method for detecting gene-gene interactions. Bioinformatics. 2007;23(1):71–76. doi: 10.1093/bioinformatics/btl557. [DOI] [PubMed] [Google Scholar]
5.Yeoun Lee S., Chung Y., Elston R. C., Kim Y., Park T. Log-linear model-based multifactor dimensionality reduction method to detect gene-gene interactions. Bioinformatics. 2007;23(19):2589–2595. doi: 10.1093/bioinformatics/btm396. [DOI] [PubMed] [Google Scholar]
6.Calle M. L., Urrea V., Vellalta G., Malats N., Steen K. V. Improving strategies for detecting genetic patterns of disease susceptibility in association studies. Statistics in Medicine. 2008;27(30):6532–6546. doi: 10.1002/sim.3431. [DOI] [PubMed] [Google Scholar]
7.Bush W. S., Edwards T. L., Dudek S. M., McKinney B. A., Ritchie M. D. Alternative contingency table measures improve the power and detection of multifactor dimensionality reduction. BMC Bioinformatics. 2008;9, article 238 doi: 10.1186/1471-2105-9-238. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Kim K., Kwon M.-S., Oh S., Park T. Identification of multiple gene-gene interactions for ordinal phenotypes. BMC Medical Genomics. 2013;6, supplement 2, article S9 doi: 10.1186/1755-8794-6-S2-S9. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Kang G., Yue W., Zhang J., Cui Y., Zuo Y., Zhang D. An entropy-based approach for testing genetic epistasis underlying complex diseases. Journal of Theoretical Biology. 2008;250(2):362–374. doi: 10.1016/j.jtbi.2007.10.001. [DOI] [PubMed] [Google Scholar]
10.Dong C., Chu X., Wang Y., et al. Exploration of gene-gene interaction effects using entropy-based methods. European Journal of Human Genetics. 2008;16(2):229–235. doi: 10.1038/sj.ejhg.5201921. [DOI] [PubMed] [Google Scholar]
11.Wu C., Li S., Cui Y. Genetic association studies: an information content perspective. Current Genomics. 2012;13(7):566–573. doi: 10.2174/138920212803251382. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Shannon C. E. A mathematical theory of communication. The Bell System Technical Journal. 1948;27:379–423. doi: 10.1002/j.1538-7305.1948.tb01338.x. [DOI] [Google Scholar]
13.Gray R. M. Entropy and Information Theory. 2nd. New York, NY, USA: Springer; 2011. [DOI] [Google Scholar]
14.Yee J., Kwon M.-S., Park T., Park M. A modified entropy-based approach for identifying gene-gene interactions in case-control study. PLoS ONE. 2013;8(7) doi: 10.1371/journal.pone.0069321.e69321 [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Li M., Ye C., Fu W., Elston R. C., Lu Q. Detecting genetic interactions for quantitative traits with U-statistics. Genetic Epidemiology. 2011;35(6):457–468. doi: 10.1002/gepi.20594. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Gui J., Moore J. H., Williams S. M., et al. A simple and computationally efficient approach to multifactor dimensionality reduction analysis of gene-gene interactions for quantitative traits. PLoS ONE. 2013;8(6) doi: 10.1371/journal.pone.0066545.e66545 [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Paninski L. Estimation of entropy and mutual information. Neural Computation. 2003;15(6):1191–1253. doi: 10.1162/089976603321780272. [DOI] [Google Scholar]
18.Schraudolph N. N. Grandient-based manipulation of non-parametric entropy estimates. IEEE Transactions on Neural Networks. 2004;14(2):1–10. doi: 10.1109/TNN.2004.828766. [DOI] [PubMed] [Google Scholar]
19.Torkkola K. Feature extraction by non-parametric mutual information maximization. Journal of Machine Learning Research. 2003;3(7-8):1415–1438. doi: 10.1162/153244303322753742. [DOI] [Google Scholar]
20.Chanda P., Sucheston L., Liu S., Zhang A., Ramanathan M. Information-theoretic gene-gene and gene-environment interaction analysis of quantitative traits. BMC Genomics. 2009;10, article 1471:509–530. doi: 10.1186/1471-2164-10-509. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Velez D. R., White B. C., Motsinger A. A., et al. A balanced accuracy function for epistasis modeling in imbalanced datasets using multifactor dimensionality reduction. Genetic Epidemiology. 2007;31(4):306–315. doi: 10.1002/gepi.20211. [DOI] [PubMed] [Google Scholar]
22.Beirlant J., Dudewicz E. J., Gyorfi L., van der Meulen E. C. Nonparametric entropy estimation: an overview. International Journal of Mathematical and Statistical Sciences. 1997;6(1):17–39. [Google Scholar]
23.Tsybakov A. B., van der Meulen E. C. Root-n consistent estimators of entropy for densities with unbound support. Scandinavian Journal of Statistics. 1992;23:75–83. [Google Scholar]
24.El Haje Hussein F., Golubev Y. On entropy estimation by m-spacing method. Jounal of Mathematical Sciences. 2009;163:290–309. doi: 10.1007/s10958-009-9674-x. [DOI] [Google Scholar]
25.Learned-Miller E. G., Fisher J. W., III ICA using spacings estimates of entropy. Journal of Machine Learning Research. 2004;4:1271–1295. [Google Scholar]
26.Cho Y. S., Go M. J., Kim Y. J., et al. A large-scale genome-wide association study of Asian populations uncovers genetic factors influencing eight quantitative traits. Nature Genetics. 2009;41(5):527–534. doi: 10.1038/ng.357. [DOI] [PubMed] [Google Scholar]
27.Weedon M. N., Lango H., Lindgren C. M., et al. Genome-wide association analysis identifies 20 loci that influence adult height. Nature Genetics. 2008;40(5):575–583. doi: 10.1038/ng.121. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Singh H., Misra N., Hnizdo V., Fedorowicz A., Demchuk E. Nearest neighbor estimates of entropy. American Journal of Mathematical and Management Sciences. 2003;23(3-4):301–321. doi: 10.1080/01966324.2003.10737616. [DOI] [Google Scholar]

[B1] 1.Zuk O., Hechter E., Sunyaev S. R., Lander E. S. The mystery of missing heritability: genetic interactions create phantom heritability. Proceedings of the National Academy of Sciences of the United States of America. 2012;109(4):1193–1198. doi: 10.1073/pnas.1119675109. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B2] 2.Ritchie M. D., Hahn L. W., Roodi N., et al. Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. American Journal of Human Genetics. 2001;69(1):138–147. doi: 10.1086/321276. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B3] 3.Lou X.-Y., Chen G.-B., Yan L., et al. A generalized combinatorial approach for detecting gene-by-gene and gene-by-environment interactions with application to nicotine dependence. American Journal of Human Genetics. 2007;80(6):1125–1137. doi: 10.1086/518312. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B4] 4.Chung Y., Lee S. Y., Elston R. C., Park T. Odds ratio based multifactor-dimensionality reduction method for detecting gene-gene interactions. Bioinformatics. 2007;23(1):71–76. doi: 10.1093/bioinformatics/btl557. [DOI] [PubMed] [Google Scholar]

[B5] 5.Yeoun Lee S., Chung Y., Elston R. C., Kim Y., Park T. Log-linear model-based multifactor dimensionality reduction method to detect gene-gene interactions. Bioinformatics. 2007;23(19):2589–2595. doi: 10.1093/bioinformatics/btm396. [DOI] [PubMed] [Google Scholar]

[B6] 6.Calle M. L., Urrea V., Vellalta G., Malats N., Steen K. V. Improving strategies for detecting genetic patterns of disease susceptibility in association studies. Statistics in Medicine. 2008;27(30):6532–6546. doi: 10.1002/sim.3431. [DOI] [PubMed] [Google Scholar]

[B7] 7.Bush W. S., Edwards T. L., Dudek S. M., McKinney B. A., Ritchie M. D. Alternative contingency table measures improve the power and detection of multifactor dimensionality reduction. BMC Bioinformatics. 2008;9, article 238 doi: 10.1186/1471-2105-9-238. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B8] 8.Kim K., Kwon M.-S., Oh S., Park T. Identification of multiple gene-gene interactions for ordinal phenotypes. BMC Medical Genomics. 2013;6, supplement 2, article S9 doi: 10.1186/1755-8794-6-S2-S9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B9] 9.Kang G., Yue W., Zhang J., Cui Y., Zuo Y., Zhang D. An entropy-based approach for testing genetic epistasis underlying complex diseases. Journal of Theoretical Biology. 2008;250(2):362–374. doi: 10.1016/j.jtbi.2007.10.001. [DOI] [PubMed] [Google Scholar]

[B10] 10.Dong C., Chu X., Wang Y., et al. Exploration of gene-gene interaction effects using entropy-based methods. European Journal of Human Genetics. 2008;16(2):229–235. doi: 10.1038/sj.ejhg.5201921. [DOI] [PubMed] [Google Scholar]

[B11] 11.Wu C., Li S., Cui Y. Genetic association studies: an information content perspective. Current Genomics. 2012;13(7):566–573. doi: 10.2174/138920212803251382. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B12] 12.Shannon C. E. A mathematical theory of communication. The Bell System Technical Journal. 1948;27:379–423. doi: 10.1002/j.1538-7305.1948.tb01338.x. [DOI] [Google Scholar]

[B13] 13.Gray R. M. Entropy and Information Theory. 2nd. New York, NY, USA: Springer; 2011. [DOI] [Google Scholar]

[B14] 14.Yee J., Kwon M.-S., Park T., Park M. A modified entropy-based approach for identifying gene-gene interactions in case-control study. PLoS ONE. 2013;8(7) doi: 10.1371/journal.pone.0069321.e69321 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B15] 15.Li M., Ye C., Fu W., Elston R. C., Lu Q. Detecting genetic interactions for quantitative traits with U-statistics. Genetic Epidemiology. 2011;35(6):457–468. doi: 10.1002/gepi.20594. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B16] 16.Gui J., Moore J. H., Williams S. M., et al. A simple and computationally efficient approach to multifactor dimensionality reduction analysis of gene-gene interactions for quantitative traits. PLoS ONE. 2013;8(6) doi: 10.1371/journal.pone.0066545.e66545 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B17] 17.Paninski L. Estimation of entropy and mutual information. Neural Computation. 2003;15(6):1191–1253. doi: 10.1162/089976603321780272. [DOI] [Google Scholar]

[B18] 18.Schraudolph N. N. Grandient-based manipulation of non-parametric entropy estimates. IEEE Transactions on Neural Networks. 2004;14(2):1–10. doi: 10.1109/TNN.2004.828766. [DOI] [PubMed] [Google Scholar]

[B19] 19.Torkkola K. Feature extraction by non-parametric mutual information maximization. Journal of Machine Learning Research. 2003;3(7-8):1415–1438. doi: 10.1162/153244303322753742. [DOI] [Google Scholar]

[B20] 20.Chanda P., Sucheston L., Liu S., Zhang A., Ramanathan M. Information-theoretic gene-gene and gene-environment interaction analysis of quantitative traits. BMC Genomics. 2009;10, article 1471:509–530. doi: 10.1186/1471-2164-10-509. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B26] 21.Velez D. R., White B. C., Motsinger A. A., et al. A balanced accuracy function for epistasis modeling in imbalanced datasets using multifactor dimensionality reduction. Genetic Epidemiology. 2007;31(4):306–315. doi: 10.1002/gepi.20211. [DOI] [PubMed] [Google Scholar]

[B21] 22.Beirlant J., Dudewicz E. J., Gyorfi L., van der Meulen E. C. Nonparametric entropy estimation: an overview. International Journal of Mathematical and Statistical Sciences. 1997;6(1):17–39. [Google Scholar]

[B22] 23.Tsybakov A. B., van der Meulen E. C. Root-n consistent estimators of entropy for densities with unbound support. Scandinavian Journal of Statistics. 1992;23:75–83. [Google Scholar]

[B23] 24.El Haje Hussein F., Golubev Y. On entropy estimation by m-spacing method. Jounal of Mathematical Sciences. 2009;163:290–309. doi: 10.1007/s10958-009-9674-x. [DOI] [Google Scholar]

[B24] 25.Learned-Miller E. G., Fisher J. W., III ICA using spacings estimates of entropy. Journal of Machine Learning Research. 2004;4:1271–1295. [Google Scholar]

[B27] 26.Cho Y. S., Go M. J., Kim Y. J., et al. A large-scale genome-wide association study of Asian populations uncovers genetic factors influencing eight quantitative traits. Nature Genetics. 2009;41(5):527–534. doi: 10.1038/ng.357. [DOI] [PubMed] [Google Scholar]

[B28] 27.Weedon M. N., Lango H., Lindgren C. M., et al. Genome-wide association analysis identifies 20 loci that influence adult height. Nature Genetics. 2008;40(5):575–583. doi: 10.1038/ng.121. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B25] 28.Singh H., Misra N., Hnizdo V., Fedorowicz A., Demchuk E. Nearest neighbor estimates of entropy. American Journal of Mathematical and Management Sciences. 2003;23(3-4):301–321. doi: 10.1080/01966324.2003.10737616. [DOI] [Google Scholar]

PERMALINK

Detecting Genetic Interactions for Quantitative Traits Using m-Spacing Entropy Measure

Jaeyong Yee

Min-Seok Kwon

Seohoon Jin

Taesung Park

Mira Park

Abstract

1. Introduction

2. Materials and Methods

2.1. Estimation of the Entropy for a Continuous Variable