Learned Practical Guidelines for Evaluating Conditional Entropy and Mutual Information in Discovering Major Factors of Response-vs.-Covariate Dynamics

Ting-Li Chen; Hsieh Fushing; Elizabeth P Chou

doi:10.3390/e24101382

. 2022 Sep 28;24(10):1382. doi: 10.3390/e24101382

Learned Practical Guidelines for Evaluating Conditional Entropy and Mutual Information in Discovering Major Factors of Response-vs.-Covariate Dynamics

Ting-Li Chen ¹, Hsieh Fushing ², Elizabeth P Chou ^3,^*

Editor: Philip Broadbridge

PMCID: PMC9601428 PMID: 37420402

Abstract

We reformulate and reframe a series of increasingly complex parametric statistical topics into a framework of response-vs.-covariate (Re-Co) dynamics that is described without any explicit functional structures. Then we resolve these topics’ data analysis tasks by discovering major factors underlying such Re-Co dynamics by only making use of data’s categorical nature. The major factor selection protocol at the heart of Categorical Exploratory Data Analysis (CEDA) paradigm is illustrated and carried out by employing Shannon’s conditional entropy (CE) and mutual information ( $I [R e; C o]$ ) as the two key Information Theoretical measurements. Through the process of evaluating these two entropy-based measurements and resolving statistical tasks, we acquire several computational guidelines for carrying out the major factor selection protocol in a do-and-learn fashion. Specifically, practical guidelines are established for evaluating CE and $I [R e; C o]$ in accordance with the criterion called [C1:confirmable]. Following the [C1:confirmable] criterion, we make no attempts on acquiring consistent estimations of these theoretical information measurements. All evaluations are carried out on a contingency table platform, upon which the practical guidelines also provide ways of lessening the effects of the curse of dimensionality. We explicitly carry out six examples of Re-Co dynamics, within each of which, several widely extended scenarios are also explored and discussed.

Keywords: Categorical Exploratory Data Analysis, curse of dimensionality, Hierarchical clustering, interacting effects, K-means, LASSO

1. Introduction

The majority of scientific fields, such as biology [1], neuroscience [2], medicine, sociology and psychology [3] and many others [4], involve dynamics of complex systems [5,6]. Scientists and experts in such fields typically can only imagine or even briefly outline various potential response-vs.-covariate (Re-Co) relationships in an attempt to characterize the dynamics of their complex systems of interest [7]. Given that no explicit functional form of such Re-Co relationships is available, such scientists still go ahead and collect structured data sets by investing great efforts in choosing which features for the role of response variable, and which features for the role of covariate variables. Such choices of features are indeed critical for the sciences because their successes rely entirely on whether such structured data sets can embrace the essence of the targeted Re-Co dynamics or not.

After scientists achieve their scientific quests by generating structured data sets upon the complex systems of interest, it becomes not only very natural, but also very important to ask the following specific question: When such structured data sets are in the data analysts’ hands, what is the most essential common goal of data analysis? This goal is certainly not aimed at an explicit system of equations, nor at a complete set of functional descriptions of the targeted Re-Co dynamic. Instead, this goal can and shall be oriented to decode the scientists’ authentic knowledge and intelligence about the complex systems of interest, and one step further to go beyond the current state of understanding.

In sharp contrast, nearly all statistical model-based data analyses on any structured data sets pertaining to wide-range of Re-Co dynamics always assume an explicit functional structure linking the response variables to covariate variables, including hypothesis testing [8], analysis of variance (ANOVA) and the many variants of regression analysis [9,10], including generalized linear models and log-linear models [11,12]. By framing rather complex Re-Co dynamics with rather simplistic explicit functional structures, statistical model-based data analysis surely will run the dangers of hijacking data’s authentic information content. With such dangers in mind, it is natural to ask the reverse question: What if we can reformulate all fundamental statistical tasks to fit under a framework of response-vs.-covariate (Re-Co) dynamics without explicit functional forms and extract data’s authentic information content of data sets?

As the theme of this paper, we demonstrate a positive answer to the above fundamental question. The chief merits of such demonstrations are that we not only can do nearly all data analysis without statistical modeling, but more importantly we can reveal data’s authentic information content to foster true understanding about the complex systems of interest. Our computational developments are illustrated through a series of 6 well-known statistical topic issues with increasing complexity. All successfully revealed information content is visible and interpretable.

The positive answer resides in the paradigm called Categorical Exploratory Data Analysis (CEDA) with its heart anchored at a major factor selection protocol, which has been under developing in a series of published works [13,14,15,16] and a recently completed work [17]. For demonstrating the positive answer, this paper establishes practical guidelines for evaluating Theoretical Information Measurements, in particular Shannon’s conditional entropy (CE) and mutual information between the response variables and covariate variables, denoted as $I [R e; C o]$ [18], which are the basis of CEDA and major factor selection protocol.

Along the process of establishing such computational guidelines, we characterize four theme-components in CEDA and the major factor selection protocol:

TC-1.
Our practical guidelines are established here for evaluating CE and $I [R e; C o]$ without requiring consistent estimations of their theoretical population-version of measurements.
TC-2.
All entropy-related evaluations are carried out on a contingency table platform, so learned practical guidelines also provide ways of relieving from the effects of the curse of dimensionality and ascertaining for [C1:confirmable] criterion, which is a kind of relative-reliability.
TC-3.
CEDA is free of man-made assumption and structures, so consequently its inferences are carried out with natural reliability.
TC-4.
CEDA only employs data’s categorical nature, so the confirmed collection of major factors indeed reveals data’s authentic information content disregarding data types.

The theme-component [TC-1] allows us to avoid many technical and difficult issues encountered in estimating the theoretical information measurement [19,20]. [TC-1] and [TC-2] together make CEDA’s major factor selection protocol very distinct to model-based feature selection based on mutual information evaluations [21,22,23,24], while [TC-3] makes CEDA’s inferences realistic, and [TC-4] makes CEDA to provide authentic information content with very wide applicability.

For specifically illustrating these four theme-components, we consider a structured data set consisting of data points that are measured and collected in a $L + K$ D vector format with respect to $L + K$ features. The first L components are the designated response (Re) features’ measurements or categories, denoted as $Y = {(Y_{1}, \dots, Y_{L})}^{'}$ , and the rest of K components are K covariate (Co) features’ measurements or categories, denoted as ${V_{1}, \dots, V_{K}}$ . It is essential to note that some or even all covariate features could be categorical. Thus, data analysts’ task is prescribed as precisely extracting the authentic associative relations between $Y$ and ${V_{1}, \dots, V_{K}}$ based on a structured data set.

By extracting authentic associations between response and covariate features, various Theoretical Information Measurements are employed under the structured data setting in [13,14,15,16,17]. In particular, Re-Co directional associations developed in CEDA and its major factor selection protocol rely on evaluations of Shannon conditional entropy (CE) and mutual information ( $I [R e; C o]$ ) that are all carried out upon the contingency table platform. This platform is indeed very flexible and adaptable to the number of features on row- and column-axes as well as the total size of data points. Such a key characteristic makes CEDA very versatile in applicability. We explain in more detail as follows.

On the response side, a collection of categories of response features (pertaining to $Y$ ) is determined with respect to their categorical nature and sample size. Likewise, on the covariate side, a collection of categories for each 1D covariate feature (pertaining to $V_{k}$ for $k = 1, \dots K$ ) is chosen accordingly. It is noted that a continuous feature is categorized with respect to its histogram [25]. If $L > 1$ , then the entire collection of response categories will consist of all non-empty cells or hypercubes of LD contingency tables. However, when L is large, the total number of LD hypercubes could be too large for a finite data set in the sense that many hypercubes are occupied by very few data points. This is known as the effect of the curse of dimensionality. To avoid such an effect, clustering algorithms, such as Hierarchical clustering or K-means algorithms, can also be performed for fusing the L response features (upon their original continuous measurement scales or their contingency tables when involving categorical ones) into one single categorical response variable. The number of categories can be pre-determined for K-means algorithm or determined by cutting a Hierarchical clustering tree in a fashion such as there is only one tree branch per category. The essential idea behind such feature-fusing operations is to retain the structural dependency among these L response features, while at the same time reducing the detrimental effect of the curse of dimensionality.

In contrast, singleton and joint (or interacting) effects of all possible subsets of ${V_{1}, \dots, V_{K}}$ are theoretically potential on the covariate side. However, it is practically known that any high order interacting effects needed to be considered are to a great extent determined by the sample size. That is, a covariate-vs.-response contingency table platform can vary greatly in dimensions: large or small. When viewing a contingency table as a high-dimensional histogram, which is a naive form of density estimation, the curse of dimensionality, or so-called finite sample phenomenon, is supposed to affect our conditional entropy evaluations whenever this table’s dimension is large relative to data’s sample size. We use the notation $C [A - v s . - Y]$ (rows-vs.-columns) for a contingency table of a covariate variable subset $A \subseteq {V_{1}, \dots, V_{K}}$ and response variable $Y$ . As a convention, the categories of $Y$ are arranged along its column-axis, while the categories of A are arranged along the row-axis. This row-axis would expand with respect to memberships of A.

In CEDA, the associative patterns between any $A \subseteq {V_{1}, \dots, V_{K}}$ and $Y$ would be discovered and evaluated ucing the contingency table $C [A - v s . - Y]$ . It is necessary to reiterate that $C [A - v s . - Y]$ can be viewed as a “joint histogram” or “density estimation” of all features contained in A and $Y$ . From this perspective, when the dimension of $C [A - v s . - Y]$ increasingly expands as A including more variables, it is expected that its dimensionality would affect the comparability and reliability of conditional entropy evaluations. Consequently, for comparability purposes, this criterion [C1:confirmable] in CEDA arises. This criterion is based on a so-called data mimicking operation developed in [14], as will be described in the following paragraphs.

Let $\tilde{A}$ denote one mimicry of A in the ideal sense of having the same deterministic and stochastic structures. In other words, $\tilde{A}$ is generated to have the same empirical categorical distribution of A, see [14] for construction details. More practically speaking, if the empirical categorical distribution of A is represented by a contingency table, then, given the observed vector of row-sums, $\tilde{A}$ would be another contingency table that has the same lattice dimension and all its row-vectors are generated from Multinomial distribution with parameters specified by the corresponding row-sum and the corresponding vector of observed proportions in A’s contingency table. It is noted that $\tilde{A}$ is constructed independent of $Y$ , that is, $\tilde{A}$ is stochastically independent of $Y$ [14].

Denote the mutual information of $Y$ of A be $I [Y; A]$ based on $C [A - v s . - Y]$ , and likewise $I [Y; \tilde{A}]$ based on $C [\tilde{A} - v s . - Y]$ . The [C1:confirmable] used in CEDA is referred to as the degree of certainty that $I [Y; A]$ is far beyond the upper limit of confidence region based on the empirical distribution of $I [Y; \tilde{A}]$ . This [C1:confirmable] criterion is indeed in accordance with CEDA’s theme components: [TC-2] and [TC-3], regarding the merits of a contingency table platform in dealing with the curse of dimensionality and facilitating reliability. It is critical to note that we are not estimating the theoretical mutual information of $Y$ and A here, and we just want to computationally make sure that $I [Y; A]$ is significantly above zero with great reliability under the reality of having only a finite amount of data points at hand.

Henceforth, it is a critical fact in all applications of CEDA: a covariate feature set is confirmed as having effects on $Y$ only when the [C1: confirmable] criterion of $I [Y; A]$ is established. This concept makes possible for [TC-1] by doing without the nonparametric estimations of Shannon entropy for a continuous distribution function as well as the mutual information for two sets of continuous variables, which have been the long standing problems in physics and neural computing (see theoretical details in [19] and computational protocols based on biGamma function in [20]).

Here, we do not take the view of contingency table as a setup of Grenander’s Method of Sieves (MoS) [26] in this paper. Though MoS can be a choice for practical reasons and computing issues involving many dimensional features or variables, we do not concern primarily on estimating the population-versions of CEs and $I [R e; C o]$ per se, nor the induced sieves biases. Rather, the dimensions of contingency tables are made adaptable to the necessity of accommodating multiple covariate feature-members in A. Within such cases, the collection of categories of A might be built based on hierarchical or K-means clustering algorithms. From this perspective, computations for theoretical conditional entropy and mutual information between multiple dimensional covariate and possibly multi-dimensional Y are neither realistically nor practically possible, due to the limited size of the available data sets. Since this kind of sieves is data dependent, the computations for sieve biases can be much more complicate than that covered in [19].

In this paper, we illustrate and carry out CEDA coupled with its major factor selection protocol through a series of 6 classic statistical topic examples, within each of which various scenarios are also considered. By building contingency tables across various dimensions with respect to different sample sizes, we attempt to reveal the robustness of CEDA resolutions to statistical topic issues. On one hand, we learn practical guidelines of evaluating conditional (Shannon) entropy and mutual information along this illustrative process. On the other hand, we demonstrate that very distinct CEDA resolutions to these classic statistical topic issues can be achieved by coherently extracting data’s authentic information content, which is the intrinsic goal of any proper data analysis. That being said, if modeling is indeed a necessary step within a scientific quest, then data’s authentic information content surely will better serve its purpose by relying on confirmed structures to begin with a new kind of data-driven modeling.

At the end of this section, we briefly project the applicability of our CDA approach for data analysis related to complex systems. One critical application is in a case-control study. Since such studies likely involve multiple features of any data types as often conducted in medical, pharmaceutical, and epidemiological research. Another critical application of CEDA is to serve as an alternative approach to all kinds of regression analysis techniques based on linear, logistic, log-linear, or generalized linear regression models. Such modeling-based analyses are often required and conducted in biological, social, and economic sciences, among many other scientific fields. Furthermore, in our ongoing research, we look into the issue of how well CEDA would deal with causality issues. Addiotonally, with such a wide spectrum of applicability, we project that CEDA will become an essential topic of data analysis education in the fields of statistics, physics, and beyond in the foreseeable future.

2. Estimations of Mutual Information between One Categorical and One Quantitative Variables

In this section, we demonstrate how to resolve classic statistical tasks by discovering major factors based on entropy evaluations. First, we frame each classic statistical task into precisely stated Re-Co dynamics. Secondly, we compute and discover major factors underlying this Re-Co dynamics. Inferences are then performed under [C1:confirmable] criterion across a spectrum of contingency tables with varying designed dimensions. Thirdly, we look beyond the setting of the discussed examples to much wider related statistical topics.

Throughout this paper, all $95 %$ confidence ranges (CR) are calculated as the region between $2.5 %$ percentile on the lower tail and $97.5 %$ percentile on the upper tail of any simulated distribution. This CR reflecting both tail behaviors is considered informative. Since even when the upper tail is the only quantity of interest as being the case in this paper, the classic one-sided $97.5 %$ confidence interval becomes visible.

2.1. [Example-1]: From 1D Two-Sample Problem to One-Way and Two-Way ANOVA

Consider a data set consisting of quantitative observations ${Y_{l j} | l = 1, 2; j = 1, \dots, N_{i}}$ of 1D response feature Y derived from two populations labeled by $l = 1, 2$ , respectively. Let $Y_{l j}$ be distributed according to $F_{l} (.)$ . Testing the distributional equality hypothesis $H_{0} : F_{1} (y) = F_{2} (y), \forall y \in R^{1}$ is the most fundamental topic in statistics. Under this setting, the only covariate $V_{1}$ is the categorical population-ID taking values in ${1, 2}$ . The testing hypothesis problem and its subsequent ones can be turned into an equivalent problem: Is $V_{1}$ a major factor underlying the Re-Co dynamics of Y? If $V_{1}$ is not a major factor, then $H_{0}$ is accepted. If $H_{0}$ is indeed rejected by confirming $V_{1}$ being a major factor, then we would further want to discover where they are different.

For the illustrative simplicity, let $Y_{1 j} \sim N (0, 1)$ and $Y_{1 j} \sim N (1, 1)$ with $j = 1, \dots, N / 2$ , that is, $N_{1} = N_{2}$ . From a theoretical information measurement perspective, the theoretical value of entropy of Y is calculated being equal to $H [Y] = 1.5321$ , and its conditional entropy

H [Y | V_{1}] = (H [Y | V_{1} = 0] + H [Y | V_{1} = 1]) / 2 = (1.4189 \times 2) / 2 = 1.4189,

so the mutual information shared by Y and $V_{1}$ is denoted and calculated as $I [Y; V_{1}] = H [Y] - H [Y | V_{1}] = 0.1132$ . By $V_{1}$ being a major factor of Y, we mean that the $V_{1}$ is not replaceable by other covariate variables that is stochastically independent of Y, such as fair-coin-tossing random variable $ε$ . That is, we theoretically establish this fact by knowing $0 = I [Y; ε] < < I [Y; V_{1}]$ .

In the real world, the two population-specific distributions $F_{1} (.)$ and $F_{2} (.)$ are often unknown. To accommodate this realistic setting, we build a histogram, say $\hat{F} (.)$ , based on pooled observed dataset ${Y_{i j} | i = 1, 2; j = 1, \dots, N_{i}}$ . With a chosen version of $\hat{F} (.)$ with $K^{'}$ bins, we can build a $2 \times K^{'}$ contingency table, denoted by $C [V_{1} - v s . - Y]$ . Its two rows correspond to two population-IDs and all $K^{'}$ bins with column-sums $n_{k}, k = 1, \dots K^{'}$ being arranged along the column-axis. That is, $C [V_{1} - v s . - Y]$ keeps the records of popultion-IDs for all members within each bin of $\hat{F} (.)$ , and enable us to estimate the mutual information:

I [Y; V_{1}] = H [Y] - H [Y | V_{1}] = H [V_{1}] - H [V_{1} | Y] .

All estimates of $I [Y; V_{1}]$ would be compared with estimates of $I [Y; ε]$ from $2 \times K$ contingency tables generated as follows: its kth column with $k = 1, \dots, K^{'}$ simulated from a binomial random variable $B N (n_{k}, P_{0})$ with $P_{0} = {(N_{1} / N, N_{2} / N)}^{'}$ . This comparison of $I [Y; V_{1}]$ with $I [Y; ε]$ is a way of testing whether a major factor candidate satisfies the criterion [C1: confirmable] in [15]. Precisely this testing is performed by comparing the observed estimate of $I [Y; V_{1}]$ with respect to the simulated distribution of $I [Y; ε]$ .

To make our focal issue concrete and meaningful, we undertake the following simulation study, in which the reliability issue of $H [Y | V_{1}]$ estimation is addressed, and at the same time [C1: confirmable] is tested. Recall that $Y_{1 j} \sim N (0, 1)$ and $Y_{1 j} \sim N (1, 1)$ with $j = 1, \dots, N / 2$ . We consider two cases of $N = 2000$ and $N =$ 20,000. For practical considerations with respect to the infinity range of Normality, we choose $K^{'} = K + 2$ bins for building a histogram via a $1 + K + 1$ fashion. The observed $90 %$ quantile range $[F_{N}^{- 1} (0.05), F_{N}^{- 1} (0.95)]$ K is divided into K equal size of bins, while the first bin is $(- \infty, F_{N}^{- 1} (0.05)]$ and last bin is $[F_{N}^{- 1} (0.95), \infty)$ . We use 5 choices of $K \in {10, 20, 30, 100, 1000}$ . For each K value, the estimated Shannon entropy $H^{(K)} [Y]$ and conditional entropies $H^{(K)} [Y | V_{1}]$ . Also, a $95 %$ confidence range (CR) of $I [Y; ε]$ is also simulated and reported based on an ensemble of $I^{(K)} [Y; ε] = H^{(K)} [Y] - H^{(K)} [Y | ε]$ , where $ε$ is Bernoulli (fair-coin tossing) random variable.

As reported in the table Table 1, it is evident that the mutual information $I^{(K)} [Y; V_{1}] = H^{(K)} [Y] - H^{(K)} [Y | V_{1}]$ is very close to the theoretical values as if they are nearly scale-free when $K = 10, 20, 30$ with $N = 2000$ and $K = 10, 20, 30, 100$ with $N =$ 20,000. The rule of thump in this 1D setting seems to be: the mutual information estimations are rather robust when the averaged cell count is over 30. When the average cell count is around 10, we begin to see the effects of finite sample phenomenon. Nonetheless, we still have estimates of $I^{(K)} [Y; V_{1}]$ being far above the upper limits of $95 %$ confidence range of $I [Y; ε]$ when $K = 100$ with $N = 2000$ and even $K = 1000$ with $N =$ 20,000. This simulation indeed points to an observation that the conclusion based on $I^{(K)} [Y; V_{1}]$ tends to rather reliable in view of [C1: confirmable] criterion.

Table 1.

Point estimations of mutual information $I [Y; V_{1}]$ with $0.1132$ as its theoretical value: $I [Y; V_{1}] = H [Y] - H [Y | V_{1}] = 1.5321 - 1.4189$ , and null $95 %$ confidence range (CR) of $I^{(K)} [Y; ε]$ with $ε$ being the Binomial random variable under the null hypothesis.

N	Bin Size	$H [Y]$	$H [Y \| V_{1}]$	$I [Y; V_{1}]$	$95 %$ CR of $I [Y; ε]$
2000	1 + 10 + 1	2.3993	2.2824	0.1168	[0.00254, 0.00298]
	1 + 20 + 1	3.0149	2.8951	0.1199	[0.00489, 0.00551]
	1 + 30 + 1	3.3782	3.2571	0.1211	[0.00757, 0.00836]
	1 + 100 + 1	4.4424	4.3043	0.1382	[0.02548, 0.02704]
	1 + 1000 + 1	6.2609	5.9149	0.3461	[0.26435, 0.26768]
20,000	1 + 10 + 1	2.4135	2.3011	0.1124	[0.00025, 0.00030]
	1 + 20 + 1	3.0350	2.9215	0.1135	[0.00050, 0.00057]
	1 + 30 + 1	3.3995	3.2856	0.1139	[0.00074, 0.00082]
	1 + 100 + 1	4.4807	4.3649	0.1157	[0.00243, 0.00258]
	1 + 1000 + 1	6.5310	6.3933	0.1377	[0.02591, 0.02637]

n	Bin Size	$H [Y]$	$H [Y \| X]$	$I [Y; X]$	$95 %$ CR of $I [Y; ε]$
2000	12	2.3962	2.3866	0.0096	[0.00248, 0.00299]
	22	2.9722	2.9530	0.0192	[0.00487, 0.00544]
	32	3.3354	3.3123	0.0232	[0.00731, 0.00799]
	102	4.5430	4.4995	0.0434	[0.02576, 0.02711]
	1002	6.7989	6.4311	0.3678	[0.33761, 0.34149]
20,000	12	2.4208	2.4148	0.0060	[0.00024, 0.00029]
	22	2.9916	2.9816	0.0100	[0.00049, 0.00056]
	32	3.3500	3.3377	0.0123	[0.00074, 0.00081]
	102	4.5076	4.4899	0.0177	[0.00244, 0.00258]
	1002	6.8662	6.8236	0.0425	[0.02570, 0.02608]

n	Bin Size	$H [Y]$	$H [Y \| X]$	$I [Y; X]$	$95 %$ CR of $I [Y; ε]$
2000	12	2.4411	2.4310	0.0101	[0.00260, 0.00303]
	22	3.0166	3.0028	0.0138	[0.00476, 0.00537]
	32	3.3706	3.3482	0.0224	[0.00732, 0.00812]
	102	4.5297	4.4771	0.0526	[0.02563, 0.02712]
	1002	6.8065	6.4558	0.3507	[0.33899, 0.34254]
20,000	12	2.4642	2.4620	0.0023	[0.00025, 0.00030]
	22	3.0425	3.0337	0.0088	[0.00047, 0.00053]
	32	3.4064	3.3958	0.0106	[0.00075, 0.00083]
	102	4.5307	4.5067	0.0241	[0.00246, 0.00258]
	1002	6.8551	6.7988	0.0563	[0.02582, 0.02632]

n	Bin Size	$H [Y]$	$H [Y \| X]$	$I [Y; X]$	$95 %$ CR of $I [Y; ε]$
1000	12	2.4599	2.4556	0.0043	[0.00249, 0.00299]
	22	3.0612	3.0518	0.0094	[0.00477, 0.00536]
	32	3.4115	3.3911	0.0204	[0.00753, 0.00838]
	102	4.5065	4.4508	0.0557	[0.02565, 0.02717]
	1002	6.8162	6.4627	0.3535	[0.33696, 0.34110]
10,000	12	2.4756	2.4728	0.0029	[0.00026, 0.00032]
	22	3.0772	3.0736	0.0036	[0.00049, 0.00056]
	32	3.4456	3.4377	0.0079	[0.00073, 0.00081]
	102	4.5590	4.5347	0.0243	[0.00244, 0.00257]
	1002	6.8328	6.7697	0.0631	[0.02556, 0.02607]

Data	Bin Size	$H [Y]$	$H [Y \| X]$	$I [Y; X]$	$95 %$ CR of $I [Y; ε]$
1st mixture	12	2.4246	2.4233	0.0012	[0.00258, 0.00309]
	22	2.9958	2.9910	0.0048	[0.00506, 0.00575]
	32	3.3805	3.3725	0.0080	[0.00786, 0.00855]
	102	4.5481	4.5214	0.0267	[0.02622, 0.02747]
	1002	6.7953	6.4700	0.3252	[0.33811, 0.34153]
2nd mixture	12	2.4434	2.4375	0.0059	[0.00226, 0.00272]
	22	2.9943	2.9795	0.0147	[0.00529, 0.00602]
	32	3.3678	3.3518	0.0159	[0.00745, 0.00817]
	102	4.5485	4.5143	0.0342	[0.02542, 0.02690]
	1002	6.7975	6.4573	0.3403	[0.33702, 0.34059]

Bin Size	Bin Size	$H [Y]$	$H [Y \| X]$	$I [Y; X]$	$95 %$ CR of $I [Y; ε]$
Y = 12	X = 12	2.4135	2.3435	0.0700	[0.0637, 0.0669]
	X = 22	2.4135	2.2861	0.1273	[0.1231, 0.1274]
	X = 32	2.4135	2.2194	0.1940	[0.1863, 0.1916]
	X = 102	2.4135	1.7971	0.6164	[0.5714, 0.5787]
Y = 22	X = 12	3.0168	2.9014	0.1154	[0.1249, 0.1294]
	X = 22	3.0168	2.7650	0.2517	[0.2393, 0.2450]
	X = 32	3.0168	2.6319	0.3848	[0.3613, 0.3681]
	X = 102	3.0168	2.0360	0.9808	[0.9365, 0.9439]
Y = 32	X = 12	3.3910	3.1952	0.1958	[0.1899, 0.1951]
	X = 22	3.3910	3.0196	0.3714	[0.3587, 0.3656]
	X = 32	3.3910	2.8494	0.5416	[0.5143, 0.5209]
	X = 102	3.3910	2.1175	1.2736	[1.2040, 1.2106]
Y = 102	X = 12	4.5236	3.9131	0.6105	[0.5657, 0.5728]
	X = 22	4.5236	3.5339	0.9897	[0.9516, 0.9585]
	X = 32	4.5236	3.2717	1.2519	[1.2193, 1.2261]
	X = 102	4.5236	2.2962	2.2274	[2.1571, 2.1643]
Y = 12	X = 12	2.3392	2.3332	0.0059	[0.0060, 0.0063]
	X = 22	2.3392	2.3275	0.0116	[0.0115, 0.0119]
	X = 32	2.3392	2.3216	0.0175	[0.0172, 0.0177]
	X = 102	2.3392	2.2799	0.0592	[0.0578, 0.0588]
Y = 22	X = 12	2.9424	2.9311	0.0113	[0.0116, 0.0120]
	X = 22	2.9424	2.9215	0.0210	[0.0223, 0.0228]
	X = 32	2.9424	2.9109	0.0316	[0.0335, 0.0342]
	X = 102	2.9424	2.8334	0.1090	[0.1122, 0.1135]
Y = 32	X = 12	3.3311	3.3155	0.0157	[0.0174, 0.0179]
	X = 22	3.3311	3.2978	0.0333	[0.0334, 0.0341]
	X = 32	3.3311	3.2843	0.0468	[0.0496, 0.0505]
	X = 102	3.3311	3.1634	0.1677	[0.1675, 0.1690]
Y = 102	X = 12	4.5504	4.4933	0.0571	[0.0582, 0.0592]
	X = 22	4.5504	4.4401	0.1103	[0.1116, 0.1128]
	X = 32	4.5504	4.3836	0.1668	[0.1684, 0.1698]
	X = 102	4.5504	3.9991	0.5513	[0.5475, 0.5497]

Bin Size	Bin Size	$H [Y]$	$H [Y \| X]$	$I [Y; X]$	95% CR of $I [Y; Z]$
Y = 12	X = 12	2.3317	2.1839	0.1478	[0.0058, 0.0062]
	X = 22	2.3317	2.1758	0.1559	[0.0114, 0.0119]
	X = 32	2.3317	2.1709	0.1609	[0.0175, 0.0180]
	X = 102	2.3317	2.1270	0.2048	[0.0578, 0.0588]
Y = 22	X = 12	2.9543	2.7995	0.1548	[0.0116, 0.0120]
	X = 22	2.9543	2.7852	0.1692	[0.0224, 0.0230]
	X = 32	2.9543	2.7750	0.1793	[0.0336, 0.0344]
	X = 102	2.9543	2.7018	0.2525	[0.1125, 0.1139]
Y = 32	X = 12	3.3654	3.2043	0.1611	[0.0172, 0.0178]
	X = 22	3.3654	3.1864	0.1790	[0.0332, 0.0339]
	X = 32	3.3654	3.1708	0.1945	[0.0492, 0.0501]
	X = 102	3.3654	3.0555	0.3099	[0.1672, 0.1688]
Y = 102	X = 12	4.5415	4.3416	0.1999	[0.0583, 0.0590]
	X = 22	4.5415	4.2849	0.2565	[0.1117, 0.1131]
	X = 32	4.5415	4.2344	0.3070	[0.1654, 0.1668]
	X = 102	4.5415	3.8806	0.6609	[0.5488, 0.5513]

Bin Size	Bin Size	$H [Y]$	$H [Y \| X]$	$I [Y; X]$	$95 %$ CR of $I [Y; ε]$
Y = 12	X = 12	2.4840	1.7450	0.7391	[0.00600, 0.00632]
	X = 22	2.4840	1.7326	0.7514	[0.01137, 0.01183]
	X = 32	2.4840	1.7237	0.7603	[0.01690, 0.01742]
	X = 102	2.4840	1.6977	0.7863	[0.05704, 0.05804]
Y = 22	X = 12	3.0881	2.3131	0.7749	[0.01151, 0.01189]
	X = 22	3.0881	2.2975	0.7906	[0.02205, 0.02264]
	X = 32	3.0881	2.2853	0.8028	[0.03305, 0.03387]
	X = 102	3.0881	2.2369	0.8512	[0.11254, 0.11387]
Y = 32	X = 12	3.4499	2.6679	0.7819	[0.01689, 0.01743]
	X = 22	3.4499	2.6466	0.8033	[0.03272, 0.03343]
	X = 32	3.4499	2.6335	0.8163	[0.04928, 0.05009]
	X = 102	3.4499	2.5559	0.8940	[0.17143, 0.17303]
Y = 102	X = 12	4.6133	3.7972	0.8161	[0.05663, 0.05757]
	X = 22	4.6133	3.7550	0.8583	[0.11237, 0.11375]
	X = 32	4.6133	3.7194	0.8939	[0.17072, 0.17235]
	X = 102	4.6133	3.5164	1.0969	[0.56831, 0.57063]

Bin Size	Bin Size	$H [Y]$	$H [Y \| X]$	$I [Y; X]$	$95 %$ CR of $I [Y; ε]$
Y = 12	X = 12	2.4807	2.1916	0.2890	[0.0061, 0.0064]
	X = 22	2.4807	2.1822	0.2984	[0.0115, 0.0120]
	X = 32	2.4807	2.1757	0.3050	[0.0171, 0.0176]
	X = 102	2.4807	2.1310	0.3497	[0.0567, 0.0577]
Y = 22	X = 12	3.0692	2.7651	0.3042	[0.0114, 0.0118]
	X = 22	3.0692	2.7517	0.3175	[0.0223, 0.0229]
	X = 32	3.0692	2.7426	0.3266	[0.0333, 0.0341]
	X = 102	3.0692	2.6671	0.4022	[0.1133, 0.1147]
Y = 32	X = 12	3.4398	3.1293	0.3105	[0.0170, 0.0175]
	X = 22	3.4398	3.1094	0.3303	[0.0331, 0.0338]
	X = 32	3.4398	3.0980	0.3417	[0.0493, 0.0502]
	X = 102	3.4398	2.9917	0.4481	[0.1717, 0.1735]
Y = 102	X = 12	4.5698	4.2185	0.3513	[0.0577, 0.0587]
	X = 22	4.5698	4.1752	0.3946	[0.1118, 0.1130]
	X = 32	4.5698	4.1233	0.4466	[0.1679, 0.1694]
	X = 102	4.5698	3.7851	0.7848	[0.5577, 0.5602]

1-Feature	CE	SCE-Drop	2-Feature	CE	SCE-Drop	3-Feature	CE	SCE-Drop	4-Feature	CE	SCE-Drop
X1	2.2315	0.2322	X1_X2	2.1671	0.0644	X1_X2_X3	0.8362	0.8431	X1_X2_X3_X4	0.1762	0.6599
X2	2.4579	0.0057	X1_X3	2.1647	0.0667	X1_X2_X4	1.4451	0.7219
X3	2.4575	0.0062	X1_X4	2.1685	0.0630	X1_X3_X4	1.4531	0.7115
X4	2.4557	0.0079	X2_X3	1.6793	0.7781	X2_X3_X4	1.2263	0.4530
			X2_X4	2.3780	0.0777
			X3_X4	2.3831	0.0726

1-Feature	CE	CE-Drop	2-Feature	CE	SCE-Drop	3-Feature	CE	SCE-Drop	4-Feature	CE	SCE-Drop
X1	2.2299	0.2295	X1_X2	2.1636	0.0662	X1_X2_X3	1.4444	0.7191	X1_X2_X3_X4	0.1945	1.0367
X2	2.4539	0.0055	X1_X3	2.1671	0.0627	X1_X2_X4	1.4576	0.7059
X3	2.4550	0.0044	X1_X4	2.1645	0.0653	X1_X3_X4	1.4473	0.7171
X4	2.4529	0.0065	X2_X3	2,3800	0.0739	X2_X3_X4	1.2313	1.1455
			X2_X4	2.3800	0.0728
			X3_X4	2.3768	0.0760

1-Feature	CE	CE-Drop	2-Feature	CE	SCE-Drop	3-Feature	CE	SCE-Drop	4-Feature	CE	SCE-Drop
X1	2.1873	0.2572	X1_X2	1.5863	0.6010	X1_X2_X3	0.3657	1.2022	X1_X2_X3_X4	0.0207	0.2947
X2	2.3945	0.0500	X1_X3	1.5679	0.6193	X1_X2_X4	0.3155	1.2601
X3	2.3789	0.0655	X1_X4	1.5757	0.6116	X1_X3_X4	0.3258	1.2421
X4	2.3819	0.0625	X2_X3	1.6502	0.7286	X2_X3_X4	0.3553	1.2718
			X2_X4	1.6272	0.7547
			X3_X4	1.6387	0.7402

Y’s Size	$X_{234}$ ’s Size	$H [Y \| X_{234}]$	Mean of $H [Y \| X_{234}^{ε}]$	$95 %$ CR of $H [Y \| X_{234}^{ε}]$
12	12	2.345	2.394	[2.393, 2.396]
	36	2.039	2.195	[2.192, 2.198]
	72	1.783	1.981	[1.978, 1.984]
	144	1.409	1.652	[1.648, 1.655]
32	12	3.141	3.192	[3.190, 3.194]
	36	2.651	2.790	[2.787, 2.794]
	72	2.180	2.385	[2.382, 2.388]
	144	1.720	1.888	[1.885, 1.892]

1-Feature	CE	2-Feature	CE	3-Feature	CE	4-Feature	CE	5-Feature	CE	6-Feature	CE
X6	2.3351	X4_X6	2.1321	X1_X2_X3	0.7543	X1_X2_X3_X7	0.5602	X1_X2_X3_X7_X8	0.1020	X1_X2_X3_X7_X8_X9	0.0065
X3	2.7295	X1_X6	2.2439	X4_X5_X6	1.0746	X1_X2_X3_X6	0.6201	X1_X2_X3_X6_X9	0.1723	X1_X2_X3_X6_X7_X8	0.0132
X1	2.7308	X1_X2	2.3184	X1_X2_X6	2.0049	X4_X5_X6_X8	0.8789	X1_X7_X8_X9_X10	0.2255	X1_X4_X5_X7_X8_X9	0.0150
X2	2.7310	X6_X7	2.3309	X1_X4_X6	2.0239	X1_X4_X5_X6	0.8965	X4_X5_X6_X8_X9	0.2355	X1_X2_X3_X5_X6_X8	0.0202
X9	2.9879	X3_X7	2.7010	X4_X6_X7	2.0771	X2_X3_X5_X7	1.4054	X1_X4_X5_X6_X8	0.2681	X2_X3_X6_X7_X8_X9	0.0211
X8	2.9880	X3_X4	2.7012	X3_X6_X9	2.1765	X4_X6_X7_X9	1.4468	X5_X6_X7_X8_X9	0.2719	X4_X5_X6_X7_X8_X9	0.0240
X7	2.9882	X7_X8	2.9516	X1_X2_X7	2.2328	X6_X7_X8_X9	1.4605	X2_X5_X6_X8_X9	0.3022	X1_X4_X5_X6_X7_X8	0.0280
X4	2.9882	X5_X7	2.9520	X6_X7_X8	2.2572	X1_X6_X8_X9	1.4752	X1_X4_X6_X7_X8	0.3035	X1_X2_X5_X6_X8_X9	0.0280
X5	2.9883	X4_X5	2.9522	X1_X7_X9	2.5849	X1_X7_X8_X9	1.5458	X1_X2_X4_X5_X6	0.3236	X1_X2_X4_X5_X6_X8	0.0329
X10	2.9883	X4_X7	2.9523	X7_X8_X9	2.8139	X7_X8_X9_X10	1.6278	X1_X2_X5_X6_X9	0.3427	X1_X2_X3_X4_X5_X6	0.0584

	Estimate	Std. Error	t Value	$\Pr (> \| t \|)$
(intercept)	−0.776	0.013	−59.57	0.000
X1	0.334	0.004	819.68	0.000
X2	0.334	0.004	820.21	0.000
X3	0.334	0.004	820.05	0.000
X4	−0.232	0.004	−568.08	0.000
X5	−0.231	0.004	−566.27	0.000
X6	0.528	0.008	624.12	0.000
X7	−0.0002	0.001	−0.94	0.3462
X8	0.0001	0.001	0.56	0.5735
X9	−0.0002	0.001	−1.40	0.1622
X10	−0.0002	0.001	−1.13	0.2567

Experiments	1-Feature	CE	2-Feature	CE
L0.2	$X_{123}$	1.9317	$X_{123}$ _ $X_{456}$	1.8206
	$X_{456}$	2.4734	$X_{123}$ _ $X_{789}$	1.9195
	$X_{789}$	2.9450	$X_{456}$ _ $X_{789}$	2.4555

PERMALINK

Learned Practical Guidelines for Evaluating Conditional Entropy and Mutual Information in Discovering Major Factors of Response-vs.-Covariate Dynamics

Ting-Li Chen

Hsieh Fushing

Elizabeth P Chou

Roles

Abstract

1. Introduction

2. Estimations of Mutual Information between One Categorical and One Quantitative Variables

2.1. [Example-1]: From 1D Two-Sample Problem to One-Way and Two-Way ANOVA

Table 1.

2.2. [Example-2]: From Dealing to Lessening the Effects of Curse of Dimensionality

Figure 1.

Figure 2.

Table 2.

Table 3.

Table 4.

Figure 3.

Table 5.

2.3. [Example-3]: From Linear to Highly Nonlinear Associations

Table 6.

Table 7.

Figure 4.

Table 8.

Table 9.

3. Examples with Complex Re-Co Dynamics

3.1. [Example-4]: From Complex Interaction to Further Beyond

Table 10.

3.2. [Example-5]: From High-Order Interaction to Complexity

Table 11.

Table 12.

Table 13.

4. Examples with Complex Re-Co Dynamics with Dependent Covariate Features

4.1. [Example-6]: From Dependency Induced Complications to Reality

Table 14.

Figure 5.

Table 15.

4.2. Escaping from the Curse of Dimensionality

Table 16.

5. Conclusions

Author Contributions

Data Availability Statement

Conflicts of Interest

Funding Statement

Footnotes

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases