Skip to main content
Statistical Applications in Genetics and Molecular Biology logoLink to Statistical Applications in Genetics and Molecular Biology
. 2011 Feb 23;10(1):12. doi: 10.2202/1544-6115.1569

Information Metrics in Genetic Epidemiology*

David L Tritchler *, Lara Sucheston , Pritam Chanda , Murali Ramanathan **
PMCID: PMC3058413  PMID: 21381437

Abstract

Information-theoretic metrics have been proposed for studying gene-gene and gene-environment interactions in genetic epidemiology. Although these metrics have proven very promising, they are typically interpreted in the context of communications and information transmission, diminishing their tangibility for epidemiologists and statisticians.

In this paper, we clarify the interpretation of information-theoretic metrics. In particular, we develop the methods so that their relation to the global properties of probability models is made clear and contrast them with log-linear models for multinomial data. Hopefully, a better understanding of their properties and probabilistic implications will promote their acceptance and correct usage in genetic epidemiology. Our novel development also suggests new approaches to model search and computation.

Keywords: gene-environment interaction, gene-gene interactions, K-way Interaction Index, information theory

1. Introduction

Information-theoretic metrics have been proposed for studying gene-gene and gene-environment interactions in genetic epidemiology (Sucheston et al., 2010); Chanda et al., 2007, 2008; Bush et al., 2008; Moore et al., 2006; Dong et al., 2008; Kang et al., 2008). The metrics considered in this paper been have shown to perform well in comparison to other methods in simulation experiments and case studies (Sucheston et al., 2010; Chanda et al., 2007, 2008). Although these metrics have proven very promising, they are typically interpreted in the context of communications and information transmission, diminishing their tangibility for epidemiologists and statisticians. Information concepts are typically defined as sums of entropy terms; such characterizations may be heuristically meaningful, but do not reveal their relevance to data and to global properties of probability models.

Cordell (2009) points out that the relationship of these approaches to more standard statistical methods is unclear. Clayton (2009) questions the usefulness of these methods compared to the logistic model for testing interaction and alternative methods which are based on conditional entropy. In this paper we clarify the interpretation of information-theoretic metrics, which have proven useful in the study of gene-gene and gene-environment interaction in genetic epidemiology. In particular, we develop the methods so that their relation to the global properties of probability models is made clear, and contrast them with log-linear models for multinomial data for further insight. We conclude that information-theoretic measures of statistical interaction have strong connections to more familiar approaches. However, they also differ in ways that may offer advantages and that encourage their further study. Hopefully, a better understanding of their properties and probabilistic implications will promote their acceptance as a promising approach for interaction detection in genetic epidemiology.

Section 2 of the paper, Background and Preliminaries, reviews fundamental concepts in information theory. The basic unit of information, the entropy of a distribution, is introduced. Kullback-Leibler divergence, Information concepts related to statistical independence, and key combinatorial tools are presented. Section 3 introduces and interprets information metrics specifically developed for detecting association in statistical genetic studies. Phenotype Associated Information (PAI) measures the association of a phenotype with a set of predictors, e.g. single nucleotide polymorphisms (SNPs). We describe it in statistical terms and give its asymptotic distribution and corresponding permutation test. Section 4 introduces the information metric of primary interest for the epidemiologic study of interaction, the K-way Interaction Index (KWII), and describes its properties in depth. Examples comparing KWII modeling with log-linear models for multinomial data give insight into the interpretation of KWII. The last of these examples demonstrates that KWII is a more parsimonious description of interaction. The relationship between KWII and PAI is made explicit, and a decomposition of entropy in terms of KWII analogous to components of variance is given. The section closes with a discussion of some alternative entropy-based methods. In section 5 we address model search algorithms, briefly describe previous work on the AMBIENCE algorithm (Chanda et al., 2008) for mining interactions using information metrics, and introduce a new algorithm. A discussion section concludes the paper.

2. Background and Preliminaries

2.1. Entropy

Entropies (Cover and Thomas, 2006) are the basic building blocks for the information metrics to be discussed. The Shannon entropy of a probability distribution with density f is defined to be H(f) = −E[log f (x)] and is described as representing “uncertainty” in x or “information gained” by observing a random variable x from the distribution f (x). Entropy is similar to the concept of variance. For a random variable X with density f we will often substitute the notation H(X) for H(f ). More generally, for a true distribution f, the entropy of a different density g is −Ef [log g(x)] = H(g) where the expectation is over the true distribution.

For a discrete distribution the information gained by observing an outcome x is defined to be –log p(x). Then the information expected from an experiment is:

H(X)=xp(x)logp(x)=E[logp(x)]

The capitalization of the argument of H(X) emphasizes that H depends on the distribution of the random variable X rather than an observed value, since it is an expectation. We adopt the convention that H(φ) = 0.

For a discrete Bernoulli p(x), the case p = 1 entails no uncertainty (and zero variance), so entropy is zero. In contrast, when outcomes are equiprobable, i.e., p = 0.5, the entropy is maximal because the uncertainty regarding the outcome is maximal. Note that H = 0 for both p = 0 and p = 1, so H is not a one-to-one transformation and it is a condensation of the detail in the probabilities.

2.2. Conditional entropy

Conditional entropy (Cover and Thomas, 2006) is defined in the natural way as:

HX(Y)=ExEy|x[logf(y|x)]

where Ey|x [−log f (y|x)], the expectation over the conditional distribution, is entropy calculated within the slice of data with a specific x value. Note that

HX(Y)=E[log(f(x,y)f(x))]=H(X,Y)H(X) (1)

This can be thought of as the entropy of y remaining after removing the entropy due to x, i.e., it is residual entropy. The joint entropy is the conditional entropy plus the entropy of the conditioning variable. Conditional entropy is distinguished from entropy of a conditional distribution H(Y|x) = Ey|x [−log f (y |x)], which is a function of x.

2.3. Kullback-Leibler divergence

Kullback-Leibler divergence (Anderson, 2008) is defined to be

I(f;q)=Ef[log(f(x)q(x))]

where f and q are probability density functions and the expectation is with respect to the true density f. We will use f and q to represent densities throughout this paper. Note that I(f ;q) = H(q) − H(f) so it can be viewed as comparing the entropies (with respect to f ) of the two density functions. Important properties of I(f ;q) include positive definiteness; i.e., I(f ;q) ≥ 0 and I(f ;q) = 0 if and only if f = q. Thus the unique density that minimizes the divergence is the true density function.

In our applications of K-L divergence q(x) will be a restriction of f (x) representing a statistical hypothesis. The K-L divergence therefore is the expected log-likelihood ratio.

2.4. Mutual information and statistical independence

The mutual information Inf (XY) (Cover and Thomas, 2006) describing the information in one random vector x about another random vector y is:

Inf(XY)=I(f(x,y);f(x)f(y))

This provides a measure of association or correlation between X and Y. In work with an information theory perspective, this correlation measure is denoted as “interaction”, a potential source of confusion for statistical applications; we will restrict the application of the term interaction to relations like nonadditive joint effects. If X and Y are independent, f (x, y) = f (x) f (y), so Inf (XY) = 0. Conversely, by the positive definiteness of divergence Inf (XY) = 0 only when X and Y are independent. Note that:

Inf(XY)=H(Y)+H(X)H(X,Y)=H(Y)HX(Y) (2)

by (1). Thus, if the uncertainty in Y remaining after accounting for X is large (small) as measured by HX (Y) then the information X possesses about Y is small (large). Since Inf (XY) ≥ 0, H(Y) ≥ HX (Y) and controlling for X can never increase entropy. The role of X here is analogous to that of a blocking variable in ANOVA or a covariate in regression analysis.

Conditional mutual information (Cover and Thomas, 2006) is defined in the natural way:

Inf(XY|Z)=I(f(x,y|z);f(x|z)f(y|z))=E[log(f(x,y,z)f(z)f(x,z)f(y,z))]

The expectation is over all the variables, so that Inf (XY |Z ) is not a function of Z.

Total correlation information (Chanda et al., 2007) is defined as

TCI(X1,X2,,XK)=i=1KH(Xi)H(X1,X2,XK) (3)

For example,

TCI(X1,X2,X3)=H(X1)+H(X2)+H(X3)H(X1,X2,X3)

This is easy to relate to divergence. Direct substitutions of the definitions of the entropies gives:

TCI(X1,X2,,XK)=I(f(X1,X2,,XK);i=1Kf(Xi))=E[logf(X1,X2,,XK)i=1Kf(Xi)]

For example,

TCI(X,Y,Z)=I(fx,y,z;fxfyfz)

We see that TCI is the information against complete independence.

2.5. The Möbius transform

The Möbius transform is a key combinatorial tool in the development to follow. We will use it in section 4.1 to relate our interaction metrics to a factorization of the joint density. It will also be central to the algorithm given in section 5.2.

Let g and G be functions defined on the set of all subsets of a finite set Ω. Then the two statements i) g(x) = Σzx G(z) and ii) G(z) = Σwz(−1)|zw|g(w) are equivalent, where |·| is the cardinality of the argument. Thus a function defined by either expression can be inverted via the other equation of the pair. There is a Fast Möbius Transform (FMT, Thoma (1991)) for fast computation of the functions for all subsets of Ω in O(N2N) time, where N = |Ω|:

To compute the Möbius transform using the FMT let initially

g0(S)=G(S)

for all subsets S. Then iterate for j ranging over the variables in C as follows:

gj(S)={gj1(S)ifjSgj1(S{j})+gj1(S)ifjS

The inverse transform is computed by initializing

G0(S)=g(S)

for all subsets S. Then iterate for j ranging over the variables in C for as follows:

Gj(S)={Gj1(S)ifjSGj1(S{j})+Gj1(S)ifjS

3. A measure of association: Phenotype Associated Information (PAI)

Since epidemiologic studies usually distinguish a response variable P (phenotype in our case) from the vector of predictors x, a concept PAI(X, P), the Phenotype Associated Information has been introduced (Chanda et al., 2007):

PAI(X,P)=TCI(P,X1,X2,,XK)TCI(X1,X2,,XK)

This can be seen to be I(fx,P; fx fP) = Inf (XP). That is, PAI measures the association of P with the multivariate vector x. Note that PAI(X, P) = PAI(P, X) since Inf (XP) = Inf (PX). Using (1), PAI can be characterized as:

PAI(X,P)=H(X)+H(P)H(X,P)=H(P)HX(P) (4)

the reduction in the entropy of P due to X, i.e., the difference between the entropy of P and the residual entropy.

PAI(X, P) = 0 if and only if x and P are independent, so we can use properties of probabilistic independence to explicate the meaning of a null PAI. For example, by the Reduction Theorem (Whittaker, 1990), PAI(X, P) = 0 ⇔ PXPX1, PX1, . . . PXK, so each xi alone contributes no information about P, i.e., no marginal effect. By the Block Independence Theorem (Whittaker, 1990), when the joint density of P and X is positive PX ⇔ both PXA |XB and PXB |XA are true, where A, B is any partition of the variables in x. Thus the unique information regarding P possessed by any subset of the x’s is zero. For example, PAI(X, P) = 0 means that PXi |XXi, i.e., P is independent of each Xi given the remaining predictors. This is analogous to every regression coefficient being zero in a multiple prediction model regressing P on the combined set of predictors.

3.1. Statistical distribution of PAI

For multinomial data the sample estimate of PAI is obtained by substituting maximum likelihood estimates of the cell probabilities into the definition. The implication of this for statistical inference about PAI is given by Proposition 7.4.2 of Whittaker (1990):

Proposition: let (x) be the unrestricted maximum likelihood estimate of the probability of cell x, i.e. the cell count n(x) divided by the number of observations N. A model M is defined to be a set of restrictions on the probabilities p. Let M(x) be the maximum likelihood estimate under M. The deviance of M is

dev(M)=2NI(p^,p^M)=2xn(x)logn(x)Np^M(x),

which is asymptotically chi-squared with degrees of freedom equal to the number of parameters set to zero.

A model PAI(x, P) = 0 implies M = x P where x and P are unrestricted maximum estimates in the marginal tables of x and P. Since I(fxP; fx fP) = Inf (XP)=PAI(X, P), the empirical value of 2N PAI(X, P) has an asymptotic chi-squared distribution. The degrees of freedom is (rP − 1)(ΠrXi − 1), where rA is the number of values the random variable A can take. The chi-squared distribution is only an approximation to the exact sampling distribution and is strictly valid only in large samples. Since the cell frequencies decrease as interaction order increases, larger sample sizes are required for testing interactions.

Since PAI(X, P)= Inf (XP), when the sample size is insufficient to permit the reliable use of asymptotic methods a permutation method can be used. Randomly permuting the phenotype assignments will produce observations from the null distribution with X and P independent, yielding an empirical null distribution of PAI(X, P).

4. A information metric for interaction: K-Way Interaction Index (KWII)

Sucheston et al. (2010) and Chanda et al. (2007, 2008) have explored in detail the KWII, which is interpreted as an interaction of K variables. The KWII provides a consistent definition of statistical interaction which is of the same form for any choice of joint distribution of the variables.

The KWII of a set of variables X is defined as

KWII(X)=ZX(1)|XZ|H(Z)

where H(φ) = 0.

It is clear from the definition that KWII is symmetric in its variables. The KWII of a single variable is negative entropy. For two variables it is mutual information (i.e. a measure of association/correlation). For three or more variables we will see that it captures the notion of statistical interaction. Note that KWII(X) is an expectation over the set of random variables X, not a function of an observed value x. A characterization of KWII is given by the following recursion:

Proposition: Define KWIIY (X) = −ΣZX(−1)|XZ|HY (Z). Then

KWII(X,Y)=KWIIY(X)KWII(X), (5)

where Y is a single variable, i.e. |Y| = 1, and X is a set of random variables not including Y.

Proof:

KWII(X,Y)=ZXY(1)|XY||Z|H(Z),bydefinition.=ZX(1)|X|+|Y||Z|H(Z)ZX(1)|X|+1|ZY|H(Z,Y),decomposingthesummationintosetsexcludingYandsetsincludingY.=ZX(1)|X||Z|H(Z)ZX(1)|X|+1|ZY|H(Z,Y),since|Y|=1.=KWII(X)ZX(1)|X|+1|ZY|(HY(Z)+H(Y)),by(1)andthedefinitionofKWII.=KWII(X)ZX(1)|X||Z|HY(Z)H(Y)ZX(1)|X||Z|,since|Y|=1.=KWII(X)+KWIIY(X)H(Y)ZX(1)|X||Z|

Note that

ZX(1)|X||Z|=i=0|X|ZX|Z|=i1|Z|(1)|X||Z|=i=0|X|(|X|i)1|Z|(1)|X||Z|.

Thus the summation in the last term above is equal to (1 − 1)|X| = 0 since it is of the form of the binomial theorem, i=0n(ni)aibni=(a+b)n.

Note that (5) holds no matter which variable we condition on, i.e., KWII(X) = KWIIXi(X \ Xi) − KWII(X \ Xi) for any i, where A \ B is the set difference, i.e. A with the elements shared with B removed.

The following lemma gives a method for computing KWIIY (X).

Lemma: Let KWII(X|Y = y) be the value of KWII(X) calculated for the subset of observations taking value y for the random variable Y. Then KWIIY (X) = EyKWII(X|Y = y)

Proof:

KWIIY(X)=ZX(1)|XZ|HY(Z)=ZX(1)|XZ|EyEx|y[logf(Z|y)]=Ey[ZX(1)|XZ|Ex|y[logf(z|y)]]=EyKWII(X|Y=y).

Thus if KWII(X) is computed within each subset of observations with a fixed y value, then KWIIY (X) is obtained as the expectation of those KWII over the values of y.

Example 3.1. Consider the 2 x 2 tables 224123347 and 224213437.

We consider the two tables to be conditional on a third binary variable. For example, the row and column variables Gr and Gc might represent binary indicator variables corresponding to the dominance coding of the genotypes at two different loci and the tables might be stratified according to a binary phenotype P. For the left table (corresponding to P = 0) H(Gr,Gc)=27log(27)+27log(27)+27log(27)+17log(17)=1.35178; ; H(Gr)=47log(47)+37log(37)=0.682908; H(Gc)=37log(37)+47log(47)=0.682908; H(φ) = 0. KWII(Gr, Gc|P = 0) for that table is −H(Gr, Gc) + H(Gr) + H(Gc) = −0.0140322. For the right table (P = 1) H(Gr,Gc)=27log(27)+27log(27)+17log(17)+27log(27); H(Gr)=47log(47)+37log(37); H(Gc)=47log(47)+37log(37); H(φ) = 0. Since these entropies are the same as those of the left table KWII(Gr, Gc|P = 1) = KWII(Gr, Gc|P = 0) = −0.0140322.

The pooled table (representing the marginal distribution) is

4483367714

In the margin H(Gr,Gc)=414log(414)+414log(414)+314log(314)+314log(314)=1.37606; H(Gr)=814log(814)+614log(614)=0.116121; H(Gc)=714log(714)+714log(714)=0.693147; H(φ) = 0. Thus KWII(Gr, Gc) for the marginal table is −H(Gr, Gc) + H(Gr) + H(Gc) = 0.799029. By the Lemma KWIIP(Gr,Gc)=12KWII(Gr,Gc|P=0)+12KWII(Gr,Gc|P=1)=0.0140322. By the Proposition KWII(Gr, Gc, P) = KWIIP(Gr, Gc) −KWII(Gr, Gc) = −0.0140322 − 0.799029 6 ≠ 0, indicating KWII interaction.

In contrast, odds ratio is 2 for the left table and 12 for the right table. Interaction on the odds ratio scale is indicated by the differing odds ratios. This example highlights the fact that KWII interaction is not a difference in the information across strata, but the change in information as a consequence of stratification.

4.1. Global properties of KWII

In general, factorizations of the density (equivalently, additive decompositions of the log density) determine the dependence relationships among the variables. A simple example is the independence of x and y, where the joint density factors into separate functions of x and y:

f(xy)=f(x)f(y)

Another example is no interaction of the x, y association with z:

f(xyz)=f(xy|z)f(z)=f(xy)f(z)

so the x, y association is constant over levels of z. Here the absence of the 3-way interaction is reflected by the fact that no factor is a function of more than two variables.

4.1.1. KWII and factorization of joint density: example

A dependence model is of the form

f(x)=AgA(xA) (6)

where A ranges over some subsets of indices, and g is a general function (i.e., not necessarily a density function). We assume that each g cannot be further factored. Such a factorization completely determines the dependencies among the variables.

We will show that a particular factorization of the density function leads to the KWII. Considering the case of three variables x; y; z demonstrates the general idea. Assuming strictly positive densities can write the identity:

f(x,y,z)=f(x)f(y)f(z)f(x,y)f(x)f(y)f(x,z)f(x)f(z)f(y,z)f(y)f(z)f(x,y,z)f(x)f(y)f(z)f(x,y)f(x,z)f(y,z)=gx(x)gy(y)gz(z)gx,y(x,y)gx,z(x,z)gy,z(y,z)gx,y,z(x,y,z) (7)

Then:

logf(x,y,z)=log(gx(x))+log(gy(y))+log(gz(z))+log(gx,y(x,y))+log(gx,z(x,z))+log(gy,z(y,z))+log(gx,y,z(x,y,z)) (8)

From (7), (8), and the definition of entropy and KWII we see that:

E[logf(x,y,z)]=KWII(x)+KWII(y)+KWII(z)+KWII(x,y)+KWII(x,z)+KWII(y,z)+KWII(x,y,z). (9)

For example, referring to (7):

E[loggx,y,z(x,y,z)]=E[log(f(x,y,z)f(x)f(y)f(z)f(x,y)f(x,z)f(y,z))=H(x,y,z)+H(x,y)+H(x,z)+H(y,z)H(x)H(y)H(z)=KWII(x,y,z)

In this way, we can pair each factor in (7) with a corresponding KWII. KWII(xA) corresponds to the expectation of log gA(xA), which will be zero when gA(xA) ≡ 1, i.e., there is no factor involving all the indices in A and thus no interaction of that order. Otherwise, all variables in A must be considered simultaneously to describe the probability distribution and no simplification is possible. KWII(XA) = 0 when the average value of gA(XA) is 1, and measures interaction using the expected log likelihood.

4.1.2. KWII and factorization of joint density: general result

The derivation of the general result corresponding to (9) relies on the Möbius transform defined in section 2.5. For positive densities define

Γ(z)=wz(1)|zw|logf(w) (10)

so that:

E[Γ(z)]=wz(1)|zw|E[logf(w)]=wz(1)|zw|H(w)=KWII(z)

We note that (10) is a Möbius transform, thus the inverse transform gives log f (x) = Σzx Γ(z) so the desired factorization of the joint density is f (x) = Πzx eΓ(z). Finally,

H(X)=E[logf(x)]=zxE[Γ(z)]=zxKWII(z). (11)

The above equation can be viewed as substituting the factorization of eH(X) for the factorization of the density function.

4.2. Dependency structure

The sum of all non-singleton KWII accounts for the total correlation information since by (3) and (11) TCI(X) = ΣZX,|Z|≥2KWII(Z) and the KWII of a singleton is the negative entropy. Sets of null KWII reflect global dependency statements. For example, note that if gx,y,z(x, y, z) ≡ 1 and gx,z(x, z) ≡ 1 in (7), then the density has a factorization h(x, y)h(y, z) so that X and Z are conditionally independent given Y. This implies KWII(X, Y, Z) = KWII(X, Y) = 0. Likewise, if Y and Z are conditionally independent then KWII(X, Y, Z) = KWII(Y, Z) = 0. When the three variables are independent KWII(X, Y, Z) = KWII(X, Y) = KWII(Y, Z) = KWII(X, Z) = 0.

4.3. Relation to log-linear statistical models

A decomposition of the form (6) but different from (7) is the log-linear model for contingency tables, e.g.,

logf(x)=u1(x1)+u2(x2)+u3(x3)+u1,2(x1,x2)+

The interpretation of the log-linear model relies on the implied factorization of the density in the same way as (7). To gain insight into KWII we will compare it with log-linear modeling in three examples.

Example 4.1: Two binary variables

The joint and marginal probabilities are presented in the following table.

p11p10p1·p01p00p0·p·1p·0

The log-linear model factors the density such that for (x1, x2) in {0,1}2:

logf(x1,x2)=μ0+μ1x1+μ2x2+μ12x1x2 (12)

where μ0 = log p00, μ1=logp10p00, μ2=logp01p00 and μ12=logp11p00p01p10. μ12 is the log cross-product ratio, where the cross-product ratio of a 2x2 table is defined as cpr=p11p00p10p01. The parameter μ12 is a measure of the association x1 and x2, which is KWII interaction for the simple case of the 2x2 table.

With information metrics, to get a factorization like (7) we write:

logf(x1,x2)=logf(x1)+logf(x2)+logf(x1,x2)f(x1)f(x2)=logp1·x1p0·1x1+logp·1x2p·01x2+logf(x1,x2)f(x1)f(x2)=x1logp1·+(1x1)logp0·+x2logp·1+(1x2)logp·0+logf(x1,x2)f(x1)f(x2),

so that

E[logf(x1,x2)]=H(x1)H(x2)+Inf(x1x2)

KWII interaction (i.e. association/correlation in this case) is measured by Inf (x1x2), which equals 0 iff x1 and x2 are independent. An alternate description of Inf (x1x2) = 0 which better generalizes to interaction among more than two variables is H(X1) = HX2(X1) by (2).

It is informative to inspect the last term above, Inf(x1x2)=KWII(x1,x2)=E[logf(x1,x2)f(x1)f(x2)] in detail:

E[logf(x1,x2)f(x1)f(x2)]=p11logp11p1·p·1+p10logp10p1·p·0+p00logp00p0·p·0+p01logp01p0·p·1

Each term corresponds to a cell of the 2 x 2 table going clockwise from the upper left, and is a measure of independence in that cell weighted by probability of that cell. The sum is thus a weighted average of independence metrics. This is reminiscent of the chi-squared test for independence in a 2 x 2 table, which combines within-cell measures of deviation from independence.

To compare the log-linear interaction parameter μ12 and KWII(x1, x2) let us assume that the log-linear model holds so that (12) implies that KWII(x1, x2) is the expectation of

logf(x1,x2)f(x1)f(x2)=logf(x1,x2)logf(x1)logf(x2)=μ0+μ1x1+μ1x2+μ12x1x2x1logp1·(1x1)logp0·x2logp·1(1x2)logp·0=μ0logp0·logp·0+x1(μ1logp1·+logp0.)+x2(μ2logp·1+logp·0)+x1x2μ12=x1x2μ12+logp00p0·p·0+x1logp10/p00p1·/p0·+x2logp01/p00p·1/p·0

We see that KWII includes the log-linear parameter μ12=logp11p00p01p10 as a component, plus three additional components which seem to also reflect the notion of interaction. For example, nonzero p00p0·p·0 is a deviation from the statistical independence of x1 and x2, while p10/p00p1·/p0· contrasts the odds of x1 when x2 = 0 to the odds of x1 ignoring x2 to measure the heterogeneity of the x1 effect with respect to x2. A similar interpretation holds for p01/p00p·1/p·0.

When row and column sums are fixed, μ12 is unaffected by marginal probabilities and is known to capture all of the conditional information about the x1, x2 association, ie. it is the sufficient statistic with respect to the conditional distribution. The KWII is affected by the marginal probabilities and is not conditional on the margins. This is further discussed in section 4.7.

Example 4.2: Three binary variables

The log-linear model factors the density such that:

logf(x1,x2,x3)=μ0+μ1x1+μ2x2+μ3x3+μ12x1x2+μ13x1x3+μ23x2x3+μ123x1x2x3

The 2-way -terms are calculated from 2 x 2 marginal tables as in the previous example. The 3-way -term is log cpr(x1,x2|x3=1)cpr(x1,x2|x3=0), where cpr is the cross product ratio of the corresponding marginal 2x2 table. It measures the effect of x3 on the x1, x2 association.

With information metrics we write the expected log density as:

E[log(f(x1)f(x2)f(x3)f(x1,x2)f(x1)f(x2)f(x1,x3)f(x1)f(x3)f(x2,x3)f(x2)f(x3)f(x1,x2,x3)f(x1)f(x2)f(x3)f(x1,x2)f(x1,x3)f(x2,x3))]

The 3-way interaction derives from the last factor:

KWII(x1,x2,x3)=E[logf(x1,x2,x3)f(x1)f(x2)f(x3)f(x1,x2)f(x1,x3)f(x2,x3)]=E[log(f(x1,x2,x3)f(x3)f(x1,x3)f(x2,x3))(f(x1)f(x2)f(x1,x2))]=Inf(x1x2|x3)Inf(x1x2)

Similarly to the log-linear model the 3-way term represents the effect of effect of x3 on the x1, x2 association.

Example 4.3: Two trinomial genotype variables

Here x1 and x2 take values in {0, 1, 2} representing genotype. In this case the log-linear model uses four parameters to model the x1, x2 interaction, μ12(1, 1), μ12(1, 2), μ12(2, 1), μ12(2, 2). In contrast, only the single metric KWII(x1, x2) is required. This reduction of degrees of freedom is different from that resulting from recoding genotypes to model restricted genetic models such as dominant or recessive, as it makes no genetic assumptions and relies only on the distribution of genotypes. In general, for a kth-order interaction of trinomial genotype variables the log-linear model is expressed by 2k parameters, as opposed to a single KWII. So the information-based model is more parsimonious, especially for high order interactions. There is a loss of detail, but since the number of parameters poses a huge challenge when screening genome-wide data for interactions the principled reduction of parameters inherent in the information approach is advantageous. For example, if we are analyzing haplotypes with r variants, a kth order interaction is expressed by (r − 1)k interaction parameters using the log-linear parameterization, but only a single KWII.

4.4. Relation of KWII and PAI

By (4) and (11),

PAI(X,P)=H(X)+H(P)H(X,P)=ZXKWII(Z)+H(P)+ZXPKWII(Z)=ZXKWII(Z,P)+H(P) (13)

PAI is an imperfect but useful tool for screening potential interactions. In the AMBIENCE algorithm to be described later an interaction term KWII(Z) is considered whenever PAI(Z) is large. However a KWII can be either positive or negative, so non-zero terms in (13) could conceivably cancel each other to make PAI small when there are interactions present. However Han (1980) shows that KWII is nonnegative when the distribution is sufficiently close to complete independence so cancelation may be considered unlikely, supporting AMBIENCE’s use of PAI as a screening tool for interactions.

There is a converse to (13). Since by (13) PAI(X, P) −H(P)=ΣZX KWII(Z, P), Möbius inversion gives

KWII(X,P)=ZX(1)|XZ|(PAI(Z,P)H(P))=ZX(1)|XZ|PAI(Z,P)

since ΣZX (−1)|XZ|H(P) = H(PZX (−1)|XZ|(−1)|XZ| = 0.

4.5. Similarity to components of variance

We can write a decomposition of the entropy of the phenotype P. By (4) and (13)

H(P)=φzxKWII(Z,P)+HX(P)

Since the last term above is a residual, the expression is of a form similar to components of variance, with the distinction that these components can be positive or negative.

4.6. Statistical distribution of KWII

As was the case for PAI, for multinomial data the sample estimate of KWII is obtained by substituting maximum likelihood estimates of the cell probabilities into the definition. Han (1980) shows that if a condition called semi-independence holds then 2N KWII(X) converges in distribution asymptotically to a chi-squared statistic with degrees of freedom Π(rXi − 1), where rA is the number of values the random variable A can take. Further, if XY, 2N KWII(X) and 2N KWII(Y) asymptotically approach independent chi-squared distributions.

Han’s results are based on a Taylor series expansion around the distribution for which the variables are completely independent, and so are accurate for cases close to that condition. Since we wish to employ our methods when strong correlations exist the practical usefulness of the asymptotic results needs to be established. Also, the asymptotic relation of semi-independence conditions to the factorization of the density needs to be investigated. Further study is required to justify the use of these results in applications.

4.7. Related information approaches

Clayton (2009) gives a stimulating critique of KWII, which he terms “synergy”. He advocates an alternative information metric for interaction based on the uncertainty remaining if all lower order relationships were known. Good(1963) derived the associated metric to be the maximum entropy over all possible distributions constrained by fixing the lower order margins, and showed that this corresponds to interaction in a logistic regression model. Our discussion of example 4.1 supports Clayton’s observation that KWII in influenced by the marginal probabilities. In the simple two variable case the argument for the specification of fixed margins in the analysis of 2 x 2 tables is not conclusive. In that case the odds ratio with fixed margins (a logistic model parameter) and unconditional inference (e.g. via a chi-squared test) are both acceptable. This suggests the comparison of KWII and Good’s maximum entropy may be analogous to the contrast between conditional and unconditional approaches to statistical inference for contingency tables. Agresti (2002) discusses the relative merits of those two approaches. To summarize, there is information about association in the table margins (Agresti, 2002; Zhu and Reid, 1994), but it is a small amount. Future work will consider this issue for genetic applications and general dimensionality. Conditional inference methods for multinomial data tend to be computationally complex, and this may be an important consideration.

Clayton also shows the value of KWII is not invariant under case-control sampling. The seriousness of this concern will be addressed in future work. The important question is whether inferences are rendered invalid. KWII is not really an interpretable parameter in the way that odds ratio contrasts are; it is used screening rather than for interpreting the strength of effects. So this criticism, while valid for the magnitude of the metric, may not discourage the use of KWII for case-control studies. This is an important question which needs further study.

Reconstructability Analysis (RA) (Zwick, 1994; Shervais et al., 2010) is an information-based approach related to maximum entropy. This approach reduces the dimension from the saturated model by projecting the joint distribution onto a set of margins. The margins specify the reduced model. The marginal distributions are then composed to form a more parsimonious estimate of the joint distribution whose fit to the data will be tested using Good’s maximum entropy method to evaluate the model. Thus the comments of the previous paragraph also apply to RA.

Maximum entropy and RA are closely related to log-linear and logistic modeling. Importantly, they require iterative computational methods, unlike KWII. Thus KWII may be the only feasible choice for high-dimensional model search.

5. Model search

The non-iterative computation required by the information metrics of this paper makes them candidates for the large-scale screening of genetic interactions, when many other methods (e.g. logistic regression, reconstructability analysis) are infeasible. However, large problems are still daunting. Using the definition to compute KWII for all possible interactions (i.e. all possible combinations of variables) is of complexity O(3N), where N is the number of variables. Thus efficient algorithms and heuristics must be used. The AMBIENCE algorithm is a heuristic search strategy for interaction detection.

5.1. The AMBIENCE algorithm

The AMBIENCE algorithm (Chanda et al., 2008) screens to find sets of variables with significant PAI of order ≤ k, where the significance is determined by the chi-squared test or the permutation method given in section 3.1. It does this greedily, building candidate ith-order PAI by adding a new variable to a significant (i − 1)th-order PAI. Finally for each PAI selected the KWII with the same arguments is tested. Note that a significant PAI could be due to any of the KWII given by (13), not just the highest order one. AMBIENCE accounts for this by its greedy choice of PAI, which nests the interaction orders. A ith-order PAI is not tested unless it contains an (i − 1)th-order PAI, which contains a significant (i − 2)th-order PAI, and so on. For example, this means that a two-way interaction between SNPs is not tested unless there is significant marginal association between at least one of the SNPs and the disease phenotype. Testing the associated nested KWII can be viewed as a heuristic approximation to testing all the KWII implied by (13) for a significant PAI.

5.2. The Möby Quick algorithm

Thoma (1991) gave an efficient algorithm, the Fast Möbius Transform (FMT), for computing KWII(X) for all subsets X. As presented, the FMT calculates all KWII up to and including order N in O(N2N) time, only a factor of N greater than the Nth order interaction alone. As it stands, the FMT is not of use to us since it is infeasible to consider very high order interactions orders and we limit the order to k << N. Happily, the basic FMT is easily modified to efficiently calculate KWII only up to order k. The modified algorithm, of complexity O(kNk+1), is given in the Appendix.

We will usually wish to restrict the interactions so that every one includes the same fixed size c set of variables, For example, in our application every interaction of interest involves the phenotype P, so c = 1. We may also wish to fix other variables, e.g., gender, so that every interaction involves both phenotype and gender so c = 2. The new algorithm enables this to be done, and the complexity is reduced to O(kcNk + k(Nc)k+1). Although the complexity is polynomial in N of order k + 1, N can be huge. It will be necessary to reduce N, possibly by selecting those variables found to play a role in significant PAI.

In this paper we propose an algorithm, Möby Quick, which is complementary to AMBIENCE, based on the modification of the FMT given in the Appendix. After significant PAI are identified by AMBIENCE it is usually the case that a small subset Θ of variables are seen to be participating in the interactions. Rather than computing KWII individually as in AMBIENCE, Möby Quick computes all possible interactions up to order k involving variables in Θ in O(k |Θ|k+1) time. The algorithm casts a wider net than AMBIENCE, and if |Θ| is small it is computationally feasible.

6. Discussion

In this paper we presented a likelihood factorization which identifies quantities which capture the notion of statistical interaction. The information theoretic interaction metrics arise from the expectation of the log-likelihood and sample estimates of all the metrics we have considered arise from maximum likelihood estimates of the entropies. An advantage of the information approach is that the meaning of the metrics and their connection to the statistical distribution of the data is clear, as opposed to many machine learning algorithms for interaction detection.

For categorical variables with more than two levels the expected log-likelihood reduces the number of parameters compared to log-linear models and logistic regression. This gives a much more parsimonious description of the interactions among the variables. Some detail is lost, but parsimony may be a more important consideration when searching for interactions in high dimensional data. In addition, the computations required are non-iterative, which greatly extends their feasibility compared with methods requiring numerical optimization such as logistic regression and reconstructability analysis.

The calculus of the information metrics facilitates a variety of re-expressions of the metrics, which suggest different approaches to model search and computation. The potential algorithms suggested in this paper will be explored in future work.

The goal of this paper was to connect information-theoretic metrics to probability models so that they can be better understood in the context of data analysis. Along the way we also uncovered some differences. For genetic markers with many levels there are many fewer parameters produced in the information approach. This may be critical in the case of haplotype analysis, where for haplotypes with r variants, a kth order interaction is expressed by (r − 1)k interaction parameters using the log-linear parameterization, but only a single KWII. The way the metrics are defined also makes transformations among them easy using powerful algorithmic tools.

We have also highlighted areas in need of further development. The distribution of KWII needs further study. Clayton (2009) has shown that the influence of the sampling plan and the influence of marginal probabilities will be key question in future comparisons of interaction detection methods. Also, we will explore improved modeling strategies using these metrics. In most of the algorithms quantities are not refined based on previous modeling steps. That is, a KWII at the end of AMBIENCE is the same number as at the beginning. The AMBIENCE algorithm just tells you that you should look at it. Our future work will explore approaches where previous decisions do simplify the computation of KWII, essentially using fewer parameters.

APPENDIX: Modified FMT

The following algorithm calculates KWII for all variable subsets of orders = k, each including all of the variables in a fixed set C. Let Ω be all size = k subsets of the complete set of N variables. For a fixed set C of the variables let Ω represent all the variable sets in Ω which contain the variables in C. Let initially

f0(S)=H(S)

for all S ∈ Ω. Iterate for j ranging over the variables in C for S ∈ Ω as follows:

fj(S)={fj1(S)ifjSfj1(S{j})+fj1(S)ifjS (14)

Then iterate 14 for j ranging over the variables in for S ∈ Ω. Finally, after step N when all variables have been iterated over, set KWII(S) = fN(S), S ∈ Ω.

Proof: If k = N and C = φ, the algorithm is the original FMT. If k < N and C = φ, a iteration 14 for a set S depends only on subsets of S and thus is unaffected. If C ≠ φ, the initial iteration over C is the same as the basic FMT. For the second stage iterating over Ω, a step of (14) for a set S containing C depends only on lower-order subsets containing C, which have been computed.

Footnotes

*

Thanks to Jerry Lawless for helpful discussions and key insights. This work was partly supported by NSF, NIH, NSERC and MITACS. Support from the National Multiple Sclerosis Society (RG3743 and a Pediatric MS Center of Excellence Center Grant) and the Department of Defense Multiple Sclerosis Program (MS090122) for the Ramanathan laboratory is gratefully acknowledged. Our work has benefited greatly from the informative input of reviewers.

References

  1. Agresti A. Categorical Data Analysis. Wiley; New York: 2002. [Google Scholar]
  2. Anderson D. Model Based Inference in the Life Sciences. Springer; New York: 2008. [Google Scholar]
  3. Bush W, Edwards T, Dudek S, McKinney B, Ritchie M. Alternative contingency measures improve the power and detection of multifactor dimensionality reduction. BMC Bioinformatics. 2008;9:238–255. doi: 10.1186/1471-2105-9-238. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Chanda P, Zhang A, Brazeau D, Sucheston L, Freudenheim JL, Ambrosone C, Ramanathan M. Information-theoretic metrics for visualizing gene-environment interactions. Am J Hum Genet. 2007;81:939–963. doi: 10.1086/521878. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Chanda P, Sucheston L, Zhang A, Brazeau D, Freudenheim JL, Ambrosone C, Ramanathan M. AMBIENCE: A novel approach and efficient algorithm for identifying informative genetic and environmental associations with complex phenotypes. Genetics. 2008;180:1191–1210. doi: 10.1534/genetics.108.088542. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Clayton D. Prediction and interaction in complex disease genetics: Experience in type 1 diabetes. PLos Genetics. 2009;5:e1000540. doi: 10.1371/journal.pgen.1000540. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Cordell H. Detecting gene-gene interactions that underlie human diseases. Nature Reviews Genetics. 2009;10:392–404. doi: 10.1038/nrg2579. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Cover T, Thomas J. Elements of Information Theory. Wiley; New York: 2006. [Google Scholar]
  9. Dong C, Chu X, Wang Y, Jin L, Shi T, Huang W, Li Y. Exploration of gene-gene interaction effects using entropy-based methods. Eur J Hum Genet. 2008;16:229–235. doi: 10.1038/sj.ejhg.5201921. [DOI] [PubMed] [Google Scholar]
  10. Good I. Maximum entropy for hypothesis formulation, especially for multidimensional contingency tables. The Annals of Mathematical Statistics. 1963;34:911–934. doi: 10.1214/aoms/1177704014. [DOI] [Google Scholar]
  11. Han TS. Multiple mutual information and multiple interactions in frequency data. Information and Control. 1980;46:26–45. doi: 10.1016/S0019-9958(80)90478-7. [DOI] [Google Scholar]
  12. Moore JH, Gilbert JC, Tsai CT, Chiang FT, Holden T, Barney N, White BC. A flexible computational framework for detecting, characterizing, and interpreting statistical patterns of epistasis in genetic studies of human disease susceptibility. J Theor Biol. 2006;241:252–261. doi: 10.1016/j.jtbi.2005.11.036. [DOI] [PubMed] [Google Scholar]
  13. Kang G, et al. An entropy-based approach for testing genetic epistasis underlying complex diseases. Journal of Theoretical Biology. 2008;250:362–374. doi: 10.1016/j.jtbi.2007.10.001. [DOI] [PubMed] [Google Scholar]
  14. Shervais S, Kramer P, Westaway S, Cox N, Zwick M. Reconstructability analysis as a tool for identifying gene-gene interactions in studies of human diseases. Statistical Applications in Genetics and Molecular Biology. 2010;9(1) doi: 10.2202/1544-6115.1516. Article 18. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Sucheston L, Chanda P, Zhang A, Tritchler D, Ramanathan M. Comparison of information-theoretic to statistical methods for gene-gene interactions in the presence of genetic heterogeneity. BMC Genomics. 2010;11:487. doi: 10.1186/1471-2164-11-487. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Thoma HM. Belief function computations In Conditional Logic in Expert Systems. In: Googman IR, Gupta MMK, Nguyen HT, Rogers GS, translators. Amsterdam: North-Holland; 1991. pp. 269–308. Edds. [Google Scholar]
  17. Whittaker J. Graphical Models in Applied Multivariate Analysis. Wiley; New York: 1990. [Google Scholar]
  18. Zhu Y, Reid N. Information, ancillarity, and sufficiency in the presence of nuisance parameters. Canadian Journal of Statistics. 1994;22:111–123. doi: 10.2307/3315827. [DOI] [Google Scholar]
  19. Zwick M. An overview of reconstructability analysis. Kybernetes. 2004;33:877–905. doi: 10.1108/03684920410533958. [DOI] [Google Scholar]

Articles from Statistical Applications in Genetics and Molecular Biology are provided here courtesy of Berkeley Electronic Press

RESOURCES