Skip to main content
Entropy logoLink to Entropy
. 2020 Aug 31;22(9):974. doi: 10.3390/e22090974

Analysis of Information-Based Nonparametric Variable Selection Criteria

Małgorzata Łazęcka 1,2, Jan Mielniczuk 1,2,*
PMCID: PMC7597280  PMID: 33286743

Abstract

We consider a nonparametric Generative Tree Model and discuss a problem of selecting active predictors for the response in such scenario. We investigated two popular information-based selection criteria: Conditional Infomax Feature Extraction (CIFE) and Joint Mutual information (JMI), which are both derived as approximations of Conditional Mutual Information (CMI) criterion. We show that both criteria CIFE and JMI may exhibit different behavior from CMI, resulting in different orders in which predictors are chosen in variable selection process. Explicit formulae for CMI and its two approximations in the generative tree model are obtained. As a byproduct, we establish expressions for an entropy of a multivariate gaussian mixture and its mutual information with mixing distribution.

Keywords: conditional mutual information, CMI, information measures, nonparametric variable selection criteria, gaussian mixture, conditional infomax feature extraction, CIFE, joint mutual information criterion, JMI, generative tree model, Markov blanket

1. Introduction

In the paper, we consider theoretical properties of Conditional Mutual Information (CMI) and its approximations in a certain dependence model called Generative Tree Model (GTM). CMI and its modifications are used in many problems of machine learning including feature selection, variable importance ranking, causal discovery, and structure learning of dependence networks (see, e.g., Reference [1,2]). They are the cornerstone of nonparametric methods to solve such problems meaning that no parametric assumptions on dependence structure are imposed. However, formal properties of these criteria remain largely unknown. This is mainly due to two problems: firstly, theoretical values of CMI and related quantities are hard to calculate explicitly, especially when the conditioning set has a large dimension. Moreover, there are only a few established facts about behavior of their sample counterparts. Such a situation, however, has important consequences. In particular, a relevant question whether certain information based criteria, such as Conditional Infomax Feature Extraction (CIFE) and Joint Mutual Information (JMI), obtained as approximations of CMI, e.g., by truncation of its Möbius expansion are approximations in analytic sense (i.e., whether the difference of both quantities is negligible) remains unanswered. In the paper, we try to fill this gap. The considered GTM is a model for which marginal distributions of predictors are mixtures of gaussians. Exact values of CMI, as well as of those of CIFE and JMI, are calculated for this model, which makes studying their behavior when parameters of the model and number of predictors change feasible. In particular, it is shown that CIFE and JMI exhibit different behavior than CMI and also they may significantly differ between themselves. In particular, we show, that depending on the value of model parameters, each of considered criteria JMI and CIFE can incorporate inactive variables before active ones into a set of chosen predictors. This, of course, does not mean that important performance criteria, such as False Detection Rate (FDR), cannot be controlled for CIFE and JMI but should serve as a cautionary note that their similarity to CMI, despite their derivation, is not necessarily ensured. As a byproduct, we establish expressions for an entropy of a multivariate gaussian mixture and its mutual information with mixing distribution, which are of independent interest.

We stress that our approach is intrinsically nonparametric and focuses on using nonparametric measures of conditional dependence for feature selection. By studying their theoretical behavior for this task we also learn an average behavior of their empirical counterparts for large sample sizes.

Generative Tree Model appears, e.g., in Reference [3], a non-parametric tree structured model is also considered, e.g., in Reference [4,5]. Together with autoregressive model, it is one of the two most common types of generative models. Besides its easily explainable dependence structure, distributions of predictors in the considered model are mixed gaussians, and this facilitates calculation of explicit form of information-based selection criteria.

The paper is structured as follows. Section 2 contains information-theoretic preliminaries, some necessary facts on information based feature-selection and derivation of CIFE and JMI criteria as approximations of CMI. Section 3 contains derivation of entropy and mutual information for gaussian mixtures. In Section 4, behavior of CMI, CIFE, and JMI is studied in GTM. Section 5 concludes.

2. Preliminaries

We denote by p(x), xRd a probability density function corresponding to continuous variable X on Rd. Joint density of X and variable Y will be denoted by p(x,y). In the following, Y will denote discrete random response to be predicted using multivariate vector X.

Below, we discuss some information-theoretic preliminaries, which leads, at the end of Section 2.1, to Möbius decomposition of mutual information. This is used in Section 2.2 to construct CIFE approximation of CMI. In addition, properties of Mutual Information discussed in Section 2.1 are used in Section 2.2 to justify JMI criterion.

2.1. Information-Theoretic Measures of Dependence

The (differential) entropy for continuous random variable X is defined as

H(X)=Rdp(x)logp(x)dx (1)

and quantifies the uncertainty of observing random values of X. Note that the definition above is valid regardless the dimensionality d of the range of X. For discrete X, we replace the integral in (1) by the sum and density p(x) by probability mass function. In the following, we will frequently consider subvectors of X=(X1,,Xp), which is a vector of all potential predictors of discrete response Y. The conditional entropy of X given discrete Y is written as

H(X|Y)=yYp(y)H(X|Y=y). (2)

When Z is continuous, the conditional entropy H(X|Z) is defined as EZH(X|Z=z), i.e.,

H(X|Z)=p(z)p(x,z)p(z)logp(x,z)p(z)dxdz=p(x,z)logp(x,z)p(z)dxdz, (3)

where p(x,z) and p(z) denote joint density of (X,Z) and density of Z, respectively. The mutual information (MI) between X and Y is

I(X,Y)=H(X)H(X|Y)=H(X)H(Y|X). (4)

This can be interpreted as the amount of uncertainty in X (Y) which is removed when Y (respectively, X) is known, which is consistent with the intuitive meaning of mutual information as the amount of information that one variable provides about another. It determines how similar the joint distribution is to the product of marginal distributions when Kullback-Leibler divergence is used as similarity measure (cf. Reference [6], Equation (8.49)). Thus, I(X,Y) may be viewed as nonparametric measure of dependence. Note that, as I(X,Y) is symmetric, it only shows the strength of dependence but not its direction. In contrast to correlation coefficient MI is able to discover non-linear relationships as it equals zero if and only if X and Y are independent. It is easily seen that I(X,Y)=H(X)+H(Y)H(X,Y). A natural extension of MI is conditional mutual information (CMI) defined as

I(X,Y|Z)=H(X|Z)H(X|Y,Z)=p(z)p(x,y|z)logp(x,y|z)p(x|z)p(y|z)dxdydz, (5)

which measures the conditional dependence between X and Y given Z. When Z is a discrete random variable, the first integral is replaced by a sum. Note that the conditional mutual information is mutual information of X and Y given Z=z averaged over values z of Z, and it equals zero if and only if X and Y are conditionally independent given Z. Important property of MI is a chain rule which connects I((X1,X2),Y) with I(X1,Y):

I((X1,X2),Y)=I(X1,Y)+I(X2,Y|X1). (6)

For more properties of the basic measures described above, we refer to Reference [6,7]. We define now interaction information II ([8]), which is a useful tool for decomposing mutual information between multivariate random variable XS and Y (see Formula (13) below). The 3-way interaction information is defined as

II(X1,X2,Y)=I((X1,X2),Y)I(X1,Y)I(X2,Y). (7)

This is frequently interpreted as the part of I((X1,X2),Y), which remains after subtraction of individual informations between Y and X1 and Y and X2. The definition indicates in particular that II(X1,X2,Y) is symmetric. Note that it follows from (6) that

II(X1,X2,Y)=I(X1,Y|X2)I(X1,Y)=I(X2,Y|X1)I(X2,Y), (8)

which is consistent with the intuitive meaning of existence of interaction as a situation in which the effect of one variable on the class variable Y depends on the value of another variable. By expanding all mutual informations on RHS of (7), we obtain

II(X1,X2,Y)=H(X1)H(X2)H(Y)+H(X1,Y)+H(X2,Y)+H(X1,X2)H(X1,X2,Y). (9)

The 3-way II can be extended to the general case of p variables. The p-way interaction information [9,10] is

II(X1,,Xp)=T{1,,p}(1)p|T|H(XT). (10)

For p=2, (10) reduces to mutual information, whereas, for p=3, it reduces to (9).

We consider two useful properties of introduced measures. We first start with 3-way information interaction, and we note that it inherits chain-rule property from MI, namely

II(X1,(X2,X3),Y)=II(X1,X3,Y)+II(X1,X2,Y|X3), (11)

where I(X1,X2,Y|X3) is defined analogously to (7) by replacing mutual informations on RHS by conditional mutual informations given X3. This is easily proved by writing, in the view of (6):

II(X1,(X2,X3),Y)=I(X1,(X2,X3)|Y)I(X1,(X2,X3))=I(X1,X3|Y)+I(X1,X2|Y,X3)I(X1,X3)+I(X1,X2|X3) (12)

and using (8) in the above equalities. Namely, joining the first and the third expression together (and the second and the fourth, as well), we obtain that RHS equals II(X1,X3,Y)+II(X1,X2,Y|X3).

We also state Möbius representation of mutual information which plays an important role in the following development. For S{1,2,,p}, let XS be a random vector coordinates of which have indices in S. Möbius representation [10,11,12] states that I(XS,Y) can be recovered from interaction informations

I(XS,Y)=k=1|S|{t1,,tk}SII(Xt1,,Xtk,Y), (13)

where |S| denotes number of elements of set S.

2.2. Information-Based Feature Selection

We consider discrete class variable Y and p features X1,,Xp. We do not impose any assumptions on dependence between Y and X1,,Xp, i.e., we view its distributional structure in a nonparametric way. Let XS denote a subset of features, indexed by set S{1,,p}. As I(XS,Y) does not decrease when S is replaced by its superset SS, the problem of finding arg maxSI(XS,Y) has a trivial solution full={1,2,,p}. Thus, one usually tries to optimize mutual information between XS and Y under some constraints on the size |S| of S. The most intuitive approach is an analogue of k-best subset selection in regression which tries to identify a feature subset of a fixed size 1kp that maximizes the joint mutual information with a class variable Y. However, this is infeasible for large k because the search space grows exponentially with the number of features. As a result, various greedy algorithms have been developed including forward selection, backward elimination and genetic algorithms. They are based on observation that

arg maxjSc[I(XS{j},Y)I(XS,Y)]=arg maxjScI(Xj,Y|XS), (14)

where Sc={1,,p}\S is a complement of S. The equality in (14) follows from (6). In each step, the most promising candidate is added. In the case of ties in (14), the variable satisfying it with the smallest index is chosen.

2.3. Approximations of CMI: CIFE and JMI Criteria

Observe that it follows from (13)

I(XS{j},Y)I(XS,Y)=I(Xj,Y|XS)=k=0|S|{t1,,tk}SII(Xt1,,Xtk,Xj,Y). (15)

Direct application of the above formula to find the maximizer in (14) is infeasible as estimation of a specific information interaction of order k requires O(Ck) observations. The above formula allows us, however, to obtain various natural approximations of CMI. The first order approximation does not take interactions between features into account and that is why the second order approximation obtained by taking first two terms in (15) is usually considered. The corresponding score for candidate feature Xj is

CIFE(Xj,Y|XS)=I(Xj,Y)+iSII(Xi,Xj,Y)=I(Xj,Y)+iSI(Xi,Xj|Y)I(Xi,Xj). (16)

The acronym CIFE stand for Conditional Infomax Feature Extraction, and the measure has been introduced in Reference [13]. Observe that if interactions of order 3 and higher between predictors are 0, i.e., II(Xt1,,Xtk,Xj,Y)=0 for k2 and then CIFE coincides with CMI. In Reference [2], it is shown that CMI also coincides with CIFE if certain dependence assumptions on vector (X,Y) are satisfied. In view of the discussion above, CIFE can be viewed as a natural approximation to CMI.

Observe that, in (16), we take into account not only relevance of the candidate feature, but also the possible interactions between the already selected features and the candidate feature. The empirical evaluation indicates that (16) is among the most successful MI-based methods; see Reference [2] for an extensive comparison of several MI-based feature selection approaches. We mention in this context, Reference [14], in which stopping rules for CIFE-based methods are considered.

Some additional assumptions lead to other score functions. We show now reasoning leading to Joint Mutual Information Criterion JMI (cf. Reference [12], on which the derivation below is based). Namely, if we define S={j1,,j|S|}, we have for iS

I(Xj,XS)=I(Xj,Xi)+I(Xj,XS\{i}|Xi).

Summing these equalities over all iS and dividing by |S|, we obtain

I(Xj,XS)=1|S|iSI(Xj,Xi)+1|S|iSI(Xj,XS\{i}|Xi)

and analogously

I(Xj,XS|Y)=1|S|iSI(Xj,Xi|Y)+1|S|iSI(Xj,XS\{i}|Xi,Y).

Subtracting the two last equations and using (8), we obtain

I(Xj,Y|XS)=I(Xj,Y)+1|S|iSII(Xj,Xi,Y)+1|S|iSII(Xj,XS\{i},Y|Xi). (17)

Moreover, it follows from (8) that when Xj is independent of XS\{i} given Xi and these quantities are independent given Xi and Y the last sum is 0 and we obtain equality

JMI(Xj,Y|XS)=I(Xj,Y)+1|S|iSII(Xj,Xi,Y)=I(Xj,Y)+1|S|iSI(Xj,Xi|Y)I(Xj,Xi). (18)

This is Joint Mutual Information Criterion (JMI) introduced in Reference [15]. Note that (18) together with (8) imply another useful representation

JMI(Xj,Y|XS)=I(Xj,Y)+1|S|iSI(Xj,Y|Xi)I(Xj,Y)=1|S|iSI(Xj,Y|Xi). (19)

JMI can be viewed as an approximation of CMI when independence assumptions on which the above derivation was based are satisfied only approximately. Observe that JMI(Xj,Y|XS) differs from CIFE(Xj,Y|XS) in that the influence of the sum of interaction informations II(Xj,Xi,Y) is down weighted by factor |S|1 instead of 1. This is sometimes interpreted as coping with ‘redundancy over-scaled’ problem (cf. Reference [2]). When the terms I(Xj,Xi|Y) are omitted from the sum above then minimal redundancy maximal relevance (mRMR) criterion is obtained [16]. We note that approximations of CMI, such as CIFE or JMI, can be used in place of CMI in (14). As the derivation in both cases is quite intuitive, it is natural to ask how the approximations compare when used for selection. This is the primary aim of the present paper. Theoretical behavior of such methods will be investigated in the following sections. Note that we do not consider empirical counterparts of the above selection rules and investigate how they would behave provided their values have been known exactly.

3. Auxiliary Results: Information Measures for Gaussian Mixtures

In the following section, we will prove some results on information-theoretic properties of gaussian mixtures which are necessary to analyze the behavior of CMI, CIFE, and JMI in Generative Tree Model defined below.

In the next section, we will consider a gaussian Generative Tree Model, in which the main components have marginal distributions being mixtures of normal distributions. Namely, if Y has Bernoulli distribution YBern1/2 (i.e., it admits values 0 and 1 with probability 1/2) and distribution of X is defined as NμY,Σ, then X is a mixture of two normal distributions: N0,Σ and Nμ,Σ with equal weights. Thus, in this section, we state auxiliary results on entropy of such random variable and its mutual information with its mixing distribution. The result for entropy of multivariate gaussian mixture, to the best of our knowledge, is new; for univariate case, it was derived in Reference [17]. Bounds and approximations of the entropy of a gaussian mixture are used, e.g., in signal processing; see, e.g., Reference [18,19]. Consider d-dimensional gaussian mixture X defined as

X12N0,Id+12Nμ,Id, (20)

where ‘∼’ signifies ‘distributed as’.

Theorem 1.

Differential entropy of X in (20) equals

H(X)=h(μ)+d12log(2πe),

where h(a) is the differential entropy of one-dimensional gaussian mixture 21{N(0,1)+N(0,a)} for a>0.

h(a)=R122πex22+e(xa)22log122πex22+e(xa)22dx. (21)

Proof. 

In order to avoid burdensome notation, we prove the theorem for d=2 only. By the definition of differential entropy, we have

H(X)=R212f0(x1,x2)+fμ(x1,x2)log12(f0(x1,x2)+fμ(x1,x2))dx1dx2,

where X is defined in (20) for d=2, and fμ denotes the density of normal distribution with a mean μ and a covariance matrix I2.

We calculate the integral above changing the variables according to the following rotation

y1y2=μ1μμ2μμ2μμ1μx1x2.

Transformed densities f0 and fμ are equal

f0(y1,y2)=12πexpy12+y222

and

fμ(y1,y2)=12πexp(y1μ)2+y222.

Applying above transformation, we can decompose H(X) into sum of two integrals as follows:

H(X)=R122πe12y12+e12(y1μ)2log122πe12y12+e12(y1μ)2dy1+R12πe12y22log12πe12y22dy2=h(μ)+12log(2πe),

where in the last equality the value H(Z)=log(2πe)/2 for N(0,1) variable Z is used. This ends the proof. □

The result above is now generalized to the case of arbitrary covariance matrix Σ. The general case will follow from Theorem 1 and the scaling property of differential entropy under linear transformations.

Theorem 2.

Differential entropy of

X12N0,Σ+12Nμ,Σ

equals

H(X)=hΣ1/2μ+d12log(2πe)+12logdetΣ.

Proof. 

We apply Theorem 1 to multivariate random variable Y=Σ12X. We obtain

H(Y)=hΣ1/2μ+d12log(2πe).

Using the scaling property of differential entropy [6], we have

H(X)=H(Y)+12log(detΣ),

which completes the proof. □

Similarly, we obtain the formula for mutual information of gaussian mixture and its mixing distribution. We use shorthand X|Y=y to denote random variable defined as having distribution coinciding with conditional distribution P(X|Y=y).

Theorem 3.

Mutual information of X and Y where YBern1/2 and X|Y=yNyμ,Σ equals

I(X,Y)=hΣ1/2μ12log(2πe). (22)

Proof. 

We will use here the fact that the entropy of multidimensional normal distribution ZNμZ,Σ equals (cf. Reference [6], Theorem 8.4.1)

H(Z)=d2log(2πe)+12log(detΣ).

Therefore, we have

I(X,Y)=H(X)H(X|Y)=hΣ1/2μ12log(2πe), (23)

as

H(X|Y)=12H(X|Y=0)+12H(X|Y=1), (24)

where H(X|Y=i) stands for the entropy of X on the stratum Y=i. We notice that H(X|Y=i)=H(Z), as the distribution of X on stratum Y=i is normal with covariance matrix Σ, and its entropy does not depend on the mean. □

We note that, in Reference [17], entropy of one-dimensional Gaussian mixture 21(N(a,1)+N(a,1)) is calculated as he(a), where he(a) is given in an integral form. As the entropy is invariant with respect to translation, function h(a) defined above equals he(a/2). The behavior of h and its two first derivatives is shown in Figure 1. It indicates that the function h is strictly increasing, and this fact is also stated in Reference [17] without proof. This is proved formally below. Strict monotonicity of h plays a crucial role in determining the order in which variables are included in a set of active variables. Note that h(0)=log(2πe)/2, which is the entropy of the standard normal N(0,1) variable. Values of h need to be calculated numerically.

Figure 1.

Figure 1

Behavior of function h and its two first derivatives. Horizontal lines in the left chart correspond to bounds of h and equal 12log(2πe) and 12log(2πe)+log(2), respectively.

Lemma 1.

Differential entropy h(a) of gaussian mixture defined in Theorem 1 is strictly increasing function of a.

Proof. 

It is easy to see that h is differentiable and for calculation of its derivative, integration in (21) and taking derivatives can be interchanged. We show that derivative of h is positive. We have by standard manipulations, using the fact that xexp(x2/2) is an odd function for the second equality below, that

h(a)=122πR(xa)e(xa)22log122πex22+e(xa)22+(xa)e(xa)22dx=122πR(xa)e(xa)22log122πex22+e(xa)22dx=122πRxex22log122πex22+e(x+a)22dx=122π0xex22log122πex22+e(x+a)22dx122π0xex22log122πex22+e(x+a)22dx=122π0xex22log122πex22+e(xa)22log122πex22+e(x+a)22dx.

We have used change of variables for the third and the fifth equality above. It follows from the last expression that h(a)>0 as (xa)2<(x+a)2 for x>0 and a>0, and, therefore, h is increasing. □

Remark 1.

Note that Theorems 2 and 3 in conjunction with Lemma 1 show that entropy of mixture of two gaussians with the same covariance matrix and its mutual information with mixing distribution is strictly increasing function of the norm Σ1μ. In particular, for Σ=I, entropy increases as the distance between centers of two gaussians increases. In addition, it follows from (22) and I(X,Y)0 that h(s)log(2πe)/2 for any sR.

Remark 2.

We call a random variable XRd a generalized mixture when there exist diffeomorphisms fi:RR such that (f1(X1),fp(Xd))21(N(0,Id)+N(μ,Id)). Then, it follows from Theorem 2 that, analogously to Reference [20], that total correlation of X (cf. Reference [21]) defined as T(X)=i=1dH(Xi)H(X) equals for generalized mixture X

TC(X)=i=1dh(|μi|)h(||μ||)+(1d)log(2πe)/2,

where μ=(μ1,,μd)T.

4. Main Results: Behavior of Information-Based Criteria in Generative Tree Model

In the following, we define a special gaussian Generative Tree Model and investigate how greedy procedure based on (14), as well as its analogues when CMI is replaced by JMI and CIFE, behaves in this model. Theorem 22 proved in the previous section will yield explicit formulae for CMIs in this model, whereas strict monotonicity of function h(·) proved in Lemma 1 will be essential to compare values of I(Xj,Y|XS) for different candidates Xj.

4.1. Generative Tree Model

We will consider the Generative Tree Model with tree structure illustrated in the Figure 2. Data Generating Process described by this model yields the distribution of the random vector (Y,X1,,Xk+1,X1(1)) such that:

YBern1/2,Xi|YNγi1Y,1andi{1,2,,k+1},|X1NX1,1, (25)

where 0<γ1 is the parameter. Thus, first the value Y=0,1 is generated with both values 0 and 1 having the same probability 1/2; then, X1,Xk+1 are generated as normal variables with the variance 1 and the mean equal to Y. Finally, once the value of X1 is obtained, X1(1) is generated from normal distribution with the variance 1 and the mean equal to X1. Thus, in the sense specified above, X1,Xk+1 are the children of Y and X1(1) is the child of X1. Parameter γ controls how difficult the problem of feature selection is. Namely, the smaller the parameter γ is, the less information Xi holds about Y for i{1,2,,k+1}. We will refer to the model defined above as Mk,γ. We denote by, abusing slightly the notation, p(y,xi),p(x1,x1(1)) bivariate densities and by p(y),p(xi),p(x1(1)) marginal densities. With this notation, the joint density p(y,x1,,xk+1,x1(1)) equals

p(y)i=1k+1p(y,xi)p(y)p(x1,x1(1))p(x1)=p(x1,x1(1))p(x1)p(x1(1))i=1k+1p(y,xi)p(y)p(xi)i=1k+1p(xi)p(y)p(x1(1)),

which can be more succinctly written as

(i,j)Ep(zi,zj)p(zi)p(zj)iVp(zi),

after renaming the variables to zi,i=1,k+3 and E and V standing for edges and vertices in the graph shown in Figure 2 (cf. formula (4.1) in Reference [4]).

Figure 2.

Figure 2

Generative Tree Model under consideration.

The above model generalizes the model discussed in Reference [3], but some branches which are irrelevant in our considerations are omitted. The values of conditional mutual information I(Xk+1,Y|XS) in the model, where S={1,2,,k} for different γ as a function of k are shown in the Figure 3. We prove in the following that I(Xk+1,Y|XS)>0; thus, Xk+1 carries non-null predictive information about Y even when variables X1,,Xk are already chosen as predictors. We note that I(X1(1),Y|XS)=0 for every γ(0,1] and XS containing X1. Thus, {X1,,Xk+1} is the Markov Blanket (cf., e.g., Reference [22]) of Y among predictors {X1,,Xk+1,X1(1)} and {X1,,Xk+1} is sufficient for Y (cf. Reference [23]). A more general model may be considered which incorporates children of every vertex X1,,Xk+1, and several levels of progeny. Here, we show how one variable X1(1) which does not belong to Markov Blanket of Y is treated differently by the considered selection rules.

Figure 3.

Figure 3

Behavior of conditional mutual information I(Xk+1,Y|X1,X2,,Xk) as a function of k for different γ values.

Intuitively, for 0<γ<1 and l<nXl carry more information about Y than Xn and, moreover, X1(1) is redundant once X1 has been chosen. Thus, predictors should be chosen in order X1,X2,Xk+1. For γ=1, the order of selection of Xi is also X1,,Xk+1 in concordance with our convention of breaking ties, but X1(1) should not be chosen. We show in the following that CMI chooses variables in this order; however, the order with respect to its approximations, CIFE, and JMI may be different. We also note that alternative way of representing predictors is

Xi=γi1Y+εi,X1(1)=X1+εk+2, (26)

for i=1,,k+1, where ε1,,εk+2 are i.i.d. N(0,1). Thus, in particular

akY=i=1k+1Xii=1k+1εi,

with ak=(1γk+1)/(1γ). Moreover, it is seen that EXi=γi1EY=γi1/2.

It is shown in Reference [2] that maximization of I(Xj,Y|XS) is equivalent to maximization of CIFE(Xj,Y|XS) provided that selected features in XS are independent and class-conditionally independent given unselected features Xj. It is easily seen that these properties do not hold in the considered GTM for S={1,,l} and j=l+1 for lk. It can also be seen by a direct calculation that CMI differs from CIFE in GTM. Take S={1,2} and Xj=X1(1). Then, note that the difference between these quantities equals

I(Xj,Y|XS)I(Xj,Y)iSII(Xi,Xj,Y) (27)

Moreover, using conditional independence, we have

II(X1,X1(1),Y)=I(X1(1),Y|X1)I(X1(1),Y)=I(X1(1),Y)

and

II(X2,X1(1),Y)=I(X1(1),X2|Y)I(X1(1),X2)=I(X1(1),X2);

thus, plugging the above equalities into (27) and using I(X1(1),Y|X1,X2)=0, we obtain that expression there equals I(X1(1),X2), which is strictly positive in the considered GTM.

Similar considerations concerning conditions stated above (18) show that maximization of JMI is not equivalent to maximization of CMI in GTM. Namely, if S={1,2} and j{3,,k+1}, then it is easily seen that I(Xj,XS\{i}|Xi)>0 and I(Xj,XS\{i}|Xi,Y)=0 for i=1,2; thus, the last term in (17) is negative.

In order to support this numerically for a specific case, consider γ=2/3. In the first column of the Table 1a, MI values I(Xi,Y),i=1,,4 are shown for this value of γ. They were calculated in Reference [3] using simulations, while here they are based on (23) and numerical evaluation of hΣ1/2μ. Additionally, in Table 1, CMI values from subsequent steps and JMI and CIFE values in such a model are shown. As a foretaste of the analysis which follows, note that, in view of panel (b) of the table, JMI chooses erroneously X1(1) in the third step instead of X3 in contrast to CIFE (cf. part (c) of the table) which chooses X1,X2,X3 in the right order. Note also that, in this case, is the second largest mutual informations with Y; thus, when the filter based solely on this information is considered, then X1(1) is chosen at the second step (after X1).

Table 1.

The criteria (Conditional Mutual Information (CMI), Joint Mutual Information (JMI), Conditional Infomax Feature Extraction (CIFE)) values for k=2 and γ=2/3. A value of the chosen variable in each step and for each criterion is in bold.

(a) XS1={X1}, XS2={X1,X2}, XS3={X1,X2,X3}
I(·,Y) I(·,Y|XS1) I(·,Y|XS2) I(·,Y|XS3)
X1 0.1114
X2 0.0527 0.0422
X3 0.0241 0.0192 0.0176
X1(1) 0.0589 0.0000 0.0000 0.0000
(b) XS1={X1}, XS2={X1,X2}, XS3={X1,X2,X1(1)}
JMI(·) JMI(·|XS1) JMI(·|XS2) JMI(·|XS3)
X1 0.1114
X2 0.0527 0.0422
X3 0.0241 0.0192 0.0205 0.0208
X1(1) 0.0589 0.0000 0.0266
(c) XS1={X1}, XS2={X1,X2}, XS3={X1,X2,X3}
CIFE(·) CIFE(·|XS1) CIFE(·|XS2) CIFE(·|XS3)
X1 0.1114
X2 0.0527 0.0422
X3 0.0241 0.0192 0.0169
X1(1) 0.0589 0.0000 0.0057 −0.0083

We note that analysis of behavior of CMI and its approximations including CIFE and JMI has been given in Reference [24], Section 6, for a simple model containing 4 predictors. We analyze here the behavior of these measures of conditional dependence for the general model Mk,γ, which involves arbitrary number of predictors having varying dependence with Y.

4.2. Behavior of CMI

First of all, we show that the criterion based on conditional mutual information CMI without any modifications chooses correct variables in the right order. It has been previously noticed that I(X1(1),Y|XS)=0 for S={1,,k}. Now, we show that I(Xk+1,Y|XS)>0 for every k. Namely, applying Theorem 3 and the chain rule for mutual information

I(XS{k+1},Y)=I(XS,Y)+I(Xk+1,Y|XS),

we obtain

I(Xk+1,Y|XS)=hi=0kγ2ihi=0k1γ2i>0, (28)

where the inequality follows as h is an strictly increasing function. Thus, we proved that I(X1(1),Y|XS)=0<I(Xk+1,Y|XS) for S={1,,k} for every k. Whence we have for S={1,,l} and l<k that

arg maxZScI(Z,Y|XS)=Xl+1,

thus CMI chooses predictors in a correct order. Figure 3 shows behavior of g(k,γ)=I(Xk+1,Y|X1,,Xk) as the function of k for various γ. Note that it follows from Figure 3 that g(·,γ) is decreasing. This means that the additional information on Y obtained when Xk+1 is incorporated gets smaller with k. Now, we study the order in which predictors are chosen with respect to JMI and CIFE.

4.3. Behavior of JMI

The main objective of this section is to examine performance of JMI criterion in the Generative Tree Model for different values of parameter γ. We will show that:

  • For γ=1 active predictors X1,,Xk+1MB(Y) are chosen in the right order and X1(1) is not chosen before them;

  • For 0<γ<1, variable X1(1)MB(Y) is chosen at a certain step before all X1,,Xk+1 are chosen, and we evaluate a moment when this situation occurs.

Consider the model above and assume that the set of indices of currently chosen variables equals S={1,2,,k}. For i{1,2,,k} we apply chain rule (6) and Theorem 3 with the following covariance matrices and mean vectors for I((Xi,Z),Y) (cf. (26)):

Σ=1001,μ=γi1γkandΣ=1002,μ=γi11, (29)

respectively, for Z=Xk+1 and Z=X1(1). Then, we have

I(Xk+1,Y|Xi)=hγ2k+γ2(i1)hγi1, (30)
I(X1(1),Y|Xi)=hγ2(i1)+12hγi1fori1, (31)
I(X1(1),Y|X1)=0. (32)

The last equation follows from the fact that X1(1) and Y are conditionally independent given X1.

From the definition of JMI(X,Y|XS), abbreviated from now on to JMI(X|XS) to simplify notation, we obtain

kJMI(Xk+1|XS)=i=1khγ2k+γ2(i1)hγi1, (33)
kJMI(X1(1)|XS)=0if k=1i=2khγ2(i1)+12hγi1if k>1. (34)

We observe that the variables X1,X2, are chosen in order according to JMI, as for S={1,,l} and l<m<n, we have JMI(Xm)>JMI(Xn). For γ=1, the right-hand sides of the last two expressions equal kh2h1 and (k1)h3/2h1, respectively. Thus, for γ=1, we have JMI(Xk+1|XS)>JMI(X1(1)|XS), which means that variables are chosen in the order X1,,Xk+1 and X1(1) is not chosen before them when JMI criterion is used. Although, for γ=1, JMI criterion does not select this redundant feature, we note that, for k, S={1,,k}, and γ=1

JMI(X1(1)|XS)h32h1>0,

which differs from I(X1(1),Y|XS)=0 for all k1. We note also that, in this case, JMI(Xk+1|XS) does not depend on k in contrast to I(Xk+1,Y|XS).

Now, we will consider the case 0<γ<1. We want to show that, for sufficiently large k and S={1,,k}, JMI criterion chooses X1(1) since

JMI(Xk+1|XS)<JMI(X1(1)|XS).

The last inequality is equivalent to

i=2khγ2(i1)+12hγ2k+γ2(i1)>h(1+γ2k)h1. (35)

The right-hand side tends to 0 when k. For the left-hand side, note that, for k>logγ22, we have γ2k<1/2, and all summands of the sum above are positive, as h is an increasing function. Thus, bounding the sum by its first term, we have

i=2khγ2(i1)+12hγ2k+γ2(i1)>h(γ2+1/2)h(γ2+1/2)=0.

The minimal k for which the JMI criterion incorrectly chooses X1(1), i.e., the first k for which (35) holds, is shown in Figure 4. The values of JMI criterion for variables Xk+1 and X1(1) is shown in Figure 5. Figure 4 indicates that X1(1) is chosen early; for γ0.8, it happens in the third step at the latest.

Figure 4.

Figure 4

Minimal k for which JMI(Xk+1|XS)<JMI(X1(1)|XS), 0<γ<1.

Figure 5.

Figure 5

The behavior of JMI in the generative tree model: JMI(Xk+1|XS) and JMI(X1(1)|XS).

4.4. Behavior of CIFE and Its Comparison with JMI

The aim of this section is to show that, although both JMI and CIFE criteria are developed as approximations to conditional mutual information, their behavior in the tree generative model differs. We will show that:

  • For γ=1, CIFE incorrectly chooses X1(1) at some point;

  • For 0<γ<1, CIFE selects variables X1,,Xk+1 in the right order.

Thus, CIFE behaves very differently from JMI in Generative Tree Model.

Analogously to formulae for JMI, we have the following formulae for CIFE (S={1,,k}):

CIFE(Xk+1|XS)=(1k)hγk12log(2πe)+i=1khγ2k+γ2(i1)hγi1,CIFE(X1(1)|XS)=0if k=1(1k)h(1)12log(2πe)+i=2khγ2(i1)+12hγi1if k>1.

For γ=1, we have

CIFE(Xk+1|XS)=(1k)h112log(2πe)+i=1kh2h1,=h112log(2πe)k2h(1)h(2)12log(2πe)CIFE(X1(1)|XS)=(1k)2h(1)12log(2πe)h32.

Note that both expressions above are linear functions with respect to k. Comparison of their slopes, in view of h32<h2 as h is an increasing function, yields that, for sufficiently large k, we obtain CIFE(Xk+1|XS)<CIFE(X1(1)|XS). The behavior of CIFE for 0<γ<1 in case of Xk+1 and X1(1) is shown in Figure 6 and the difference between CIFE(Xk+1|XS) and CIFE(X1(1)|XS) in Figure 7. The values below 0 in the last plot occur for γ=1; only, thus, for 0<γ<1, we have CIFE(Xk+1|XS)>CIFE(X1(1)|XS) for any k.

Figure 6.

Figure 6

The behavior of CIFE in the generative tree model: CIFE(Xk+1|XS) and CIFE(X1(1)|XS).

Figure 7.

Figure 7

Difference between values of JMI for Xk+1 and X1(1) (left panel) and analogous difference for CIFE (right panel). Values below 0 mean that the variable X1(1) is chosen.

Furthermore, as 2h(1)12log(2πe)h320.0642>0, we have, for γ=1,

CIFE(X1(1)|XS)ask,

and as 2h(1)h(2)12log(2πe)0.0215>0, we have

CIFE(Xk+1|XS)ask.

In order to understand the consequences of this property, let us momentarily assume that one introduces an intuitive stopping rule which says that candidate Xj0 such that j0=arg maxjScCIFE(Xj,Y|XS) is appended only when CIFE(Xj0,Y|XS)>0. Then, Positive Selection Rate (PSR) of such selection procedure may become arbitrarily small in model Mk,γ for fixed γ and sufficiently large k. PSR is defined as |t^t|/|t|, where t={1,,k+1} is a set of indices of Markov Blanket of Y and t^ is a set of indices of the chosen variables.

5. Conclusions

We have considered Mk,γ, a special case of Generative Tree Model and investigated behavior of CMI and related criteria JMI and CIFE in this model. We have shown that, despite the fact that both of these criteria are derived as approximations of CMI under certain dependence conditions, their behavior may greatly differ from that of CMI in the sense that they may switch the order of variable importance and treat inactive variables as more relevant than active ones. In particular, this occurs for JMI when γ<1 and CIFE for γ=1. We have also shown a drawback of CIFE procedure which consists in disregarding significant part of active variables so that PSR may become arbitrarily small in model Mk,γ for large k. As a byproduct, we obtain formulae for the entropy of multivariate gaussian mixture and its mutual information with mixing variable. We have also shown that the entropy of the gaussian mixture is a strictly increasing function of the euclidean distance between two centers of its components. Note that, in this paper, we investigated behavior of theoretical CMI and its approximations in GTM; for their empirical versions, we may expect exacerbation of effects described here.

Acknowledgments

Comments of two referees which helped to improve presentation of the original version of the manuscript are gratefully acknowledged.

Author Contributions

Conceptualization, M.Ł.; Formal analysis, J.M. and M.Ł.; Methodology, J.M. and M.Ł.; Supervision, J.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflict of interest.

References

  • 1.Guyon I., Elyseeff A. Feature Extraction, Foundations and Applications. Volume 207. Springer; Berlin/Heidelberger, Germany: 2006. An introduction to feature selection; pp. 1–25. [Google Scholar]
  • 2.Brown G., Pocock A., Zhao M., Luján M. Conditional likelihood maximisation: A unifying framework for information theoretic feature selection. J. Mach. Learn. Res. 2012;13:27–66. [Google Scholar]
  • 3.Gao S., Ver Steeg G., Galstyan A. Advances in Neural Information Processing Systems. MIT Press; Cambridge, MA, USA: 2016. Variational Information Maximization for Feature Selection; pp. 487–495. [Google Scholar]
  • 4.Lafferty J., Liu H., Wasserman L. parse nonparametric graphical models. Stat. Sci. 2012;27:519–537. doi: 10.1214/12-STS391. [DOI] [Google Scholar]
  • 5.Liu H., Xu M., Gu H., Gupta A., Lafferty J., Wasserman L. Forest density estimation. J. Mach. Learn. Res. 2011;12:907–951. [Google Scholar]
  • 6.Cover T.M., Thomas J.A. Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing) Wiley-VCH; Hoboken, NJ, USA: 2006. [Google Scholar]
  • 7.Yeung R.W. A First Course in Information Theory. Kluwer; South Holland, The Netherlands: 2002. [Google Scholar]
  • 8.McGill W.J. Multivariate information transmission. Psychometrika. 1954;19:97–116. doi: 10.1007/BF02289159. [DOI] [Google Scholar]
  • 9.Ting H.K. On the Amount of Information. Theory Probab. Appl. 1960;7:439–447. doi: 10.1137/1107041. [DOI] [Google Scholar]
  • 10.Han T.S. Multiple mutual informations and multiple interactions in frequency data. Inform. Control. 1980;46:26–45. doi: 10.1016/S0019-9958(80)90478-7. [DOI] [Google Scholar]
  • 11.Meyer P., Schretter C., Bontempi G. Information-theoretic feature selection in microarray data using variable complementarity. IEEE J. Sel. Top. Signal Process. 2008;2:261–274. doi: 10.1109/JSTSP.2008.923858. [DOI] [Google Scholar]
  • 12.Vergara J.R., Estévez P.A. A review of feature selection methods based on mutual information. Neural. Comput. Appl. 2014;24:175–186. doi: 10.1007/s00521-013-1368-0. [DOI] [Google Scholar]
  • 13.Lin D., Tang X. European Conference on Computer Vision 2006 May 7. Springer; Berlin/Heidelberg, Germany: 2006. Conditional infomax learning: An integrated framework for feature extraction and fusion; pp. 68–82. [Google Scholar]
  • 14.Mielniczuk J., Teisseyre P. Stopping rules for information-based feature selection. Neurocomputing. 2019;358:255–274. doi: 10.1016/j.neucom.2019.05.048. [DOI] [Google Scholar]
  • 15.Yang H.H., Moody J. Data visualization and feature selection: New algorithms for nongaussian data. Adv. Neural. Inf. Process Syst. 1999;12:687–693. [Google Scholar]
  • 16.Peng H., Long F., Ding C. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 2005;27:1226–1238. doi: 10.1109/TPAMI.2005.159. [DOI] [PubMed] [Google Scholar]
  • 17.Michalowicz J., Nichols J.M., Bucholtz F. Calculation of differential entropy for a mixed gaussian distribution. Entropy. 2008;10:200–206. doi: 10.3390/entropy-e10030200. [DOI] [Google Scholar]
  • 18.Moshkar K., Khandani A. Arbitrarily tight bound on differential entropy of gaussian mixtures. IEEE Trans. Inf. Theory. 2016;62:3340–3354. doi: 10.1109/TIT.2016.2553147. [DOI] [Google Scholar]
  • 19.Huber M., Bailey T., Durrant-Whyte H., Hanebeck U. On entropy approximation for gaussian mixture random vectors; Proceedings of the 2008 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems; Seoul, Korea. 20–22 August 2008; Piscataway, NJ, USA: IEEE; 2008. pp. 181–189. [Google Scholar]
  • 20.Singh S., Póczos B. Nonparanormal information estimation. arXiv. 20171702.07803 [Google Scholar]
  • 21.Watanabe S. Iformation theoretical analysis of multivariate correlation. IBM J. Res. Dev. 1960;45:211–232. [Google Scholar]
  • 22.Pena J.M., Nilsson R., Bjoerkegren J., Tegner J. Towards scalable and data efficient learning of Markov boundaries. Int. J. Approx. Reason. 2007;45:211–232. doi: 10.1016/j.ijar.2006.06.008. [DOI] [Google Scholar]
  • 23.Achille A., Soatto S. Emergence of invariance and disentanglements in deep representations. J. Mach. Learn. Res. 2018;19:1948–1980. [Google Scholar]
  • 24.Macedo F., Oliveira M., Pacecho A., Valadas R. Theoretical foundations of forward feature selection based on mutual information. Neurocomputing. 2019;325:67–89. doi: 10.1016/j.neucom.2018.09.077. [DOI] [Google Scholar]

Articles from Entropy are provided here courtesy of Multidisciplinary Digital Publishing Institute (MDPI)

RESOURCES