Skip to main content
Entropy logoLink to Entropy
. 2022 Jan 17;24(1):135. doi: 10.3390/e24010135

An Information Theoretic Interpretation to Deep Neural Networks

Xiangxiang Xu 1, Shao-Lun Huang 1,*, Lizhong Zheng 2, Gregory W Wornell 2
Editor: Raúl Alcaraz
PMCID: PMC8774347  PMID: 35052161

Abstract

With the unprecedented performance achieved by deep learning, it is commonly believed that deep neural networks (DNNs) attempt to extract informative features for learning tasks. To formalize this intuition, we apply the local information geometric analysis and establish an information-theoretic framework for feature selection, which demonstrates the information-theoretic optimality of DNN features. Moreover, we conduct a quantitative analysis to characterize the impact of network structure on the feature extraction process of DNNs. Our investigation naturally leads to a performance metric for evaluating the effectiveness of extracted features, called the H-score, which illustrates the connection between the practical training process of DNNs and the information-theoretic framework. Finally, we validate our theoretical results by experimental designs on synthesized data and the ImageNet dataset.

Keywords: deep neural network, information theory, local information geometry, feature extraction

1. Introduction

Due to the striking performance of deep learning in various application fields, deep neural networks (DNNs) have gained great attention in modern computer science. While it is a common understanding that the features extracted from the hidden layers of DNN are “informative” for learning tasks, the mathematical meaning of informative features in DNN is generally not clear. From the practical perspective, DNN models have obtained unprecedented performance in varying tasks, such as image recognition [1], language processing [2,3], and games [4,5]. However, the understanding of the feature extraction behind these models is relatively lacking, which poses challenges for their application in security-sensitive tasks, such as the autonomous vehicle.

To address this problem, there have been numerous research efforts, including both experimental and theoretical studies [6]. The experimental studies usually focus on some empirical properties of the feature extracted by DNNs, by visualizing the feature [7] or testing its performance on specific training settings [8] or learning tasks [9]. Though such empirical methods have provided some intuitive interpretations, the performance can highly depend on the data and network architecture used. For example, while the feature visualization works well on convolutional neural networks, its application to other networks is typically less effective [10].

In contrast, theoretical studies focus on the analytical properties of the extracted feature or the learning process in DNNs. Due to the complicated structure of DNNs, existing studies were often restricted to the networks of specific structures, e.g., network with infinite width [11] or two-layer network [12,13], to characterize the theoretical behaviors. However, the interpretation of the optimal feature remains unclear, which limits their further applications. To obtain better interpretability, tools and measures from information theory [14] have recently been applied to connect DNNs with general information processing problems [15]. For instance, the information bottleneck [16,17] employs the mutual information as the metric to quantify the informativeness of features in DNN, and other information metrics, such as the Kullback–Leibler (KL) divergence [18] and Weissenstein distance [19], are also used in different problems. However, there is still a disconnection between these information metrics and the performance objectives of the inference tasks that DNNs want to solve [20]. Therefore, it is, in general, difficult to match the DNN learning with the optimization of a particular information metric.

This paper aims to provide an information-theoretic interpretation to the feature extraction process in DNNs, to bridge the gap between the practical deep learning implementations and information-theoretic characterizations. To this end, we first propose an information-theoretic feature selection framework, which establishes an information metric to measure the performance of each given feature in inference tasks. In addition, we demonstrate that the optimal features extracted by DNNs coincide with the solutions of the information-theoretic feature selection problem, which share the same performance metric. Therefore, our results give an explicit interpretation of the learning goal of the back-propagation (BackProp) and stochastic gradient descent (SGD) operations in deep learning [21], which also lead to a performance metric for evaluating the effectiveness of the extracted features. Finally, we validate our theoretic characterizations using numerical experiments on both synthesized data and the ImageNet [22] dataset for image classification.

2. Preliminaries and Methods

2.1. Methodological Background

The main method used in our development is local information geometry [23,24], which characterizes the local geometric properties of the probability distribution space. The local information geometric method is closely related to the conventional Hirschfeld–Gebelein–Rényi (HGR) maximal correlation [25,26,27] problem, which has attracted increasing interest in the information theory community [28,29,30,31,32,33], and has also been applied in data analysis [34] and privacy studies [35].

Specifically, we use the local information geometric method to construct and investigate an information-theoretic feature selection problem in Section 3.1, which leads to an information metric of features and also demonstrates an SVD (singular value decomposition) structure of the feature selection process. Following the same analysis framework, we characterize the optimal feature extracted by DNNs in Section 3.2, and demonstrate that the same SVD structure is shared by DNNs. Based on the established connection, we then propose an effectiveness measure for DNNs, with details presented in Section 3.3.

2.2. Notations

Throughout this paper, we use X, X, PX, and x to represent a discrete random variable, the range, the probability distribution, and the value of X. In addition, for any function s(X)Rk of X, we use μs to denote the mean of s(X), and “˜” to denote the centered variable with mean subtracted, e.g., s˜(X)s(X)μs. Moreover, we use · and ·F to denote the 2-norm and the Frobenius norm, respectively. All logarithms in our analyses are base e, i.e., natural.

2.3. Local Information Geometry

The following concepts from local information geometry would be useful in our development.

Definition 1

(ϵ-Neighborhood). Let PX denote the space of distributions on some finite alphabet X, and let relint(PX) denote the subset of strictly positive distributions. For a given ϵ>0, the ϵ-neighborhood of a distribution PXrelint(PX) is defined by the χ2-divergence as

NϵX(PX)PPX:xXP(x)PX(x)2PX(x)ϵ2.

Definition 2

(ϵ-Dependence). The random variables X,Y are called ϵ-dependent if PXYNϵX×Y(PXPY).

Definition 3

(ϵ-Attribute). A random variable U is called an ϵ-attribute of X if PX|U(·|u)NϵX(PX), for all uU.

We will focus on the small ϵ regime, which we refer to as the local analysis regime. In addition, for any PPX, we define the information vectorϕ and feature functionL(x) corresponding to P, with respect to a reference distribution PXrelint(PX), as

ϕ(x)P(x)PX(x)PX(x),L(x)ϕ(x)PX(x). (1)

This gives a three way correspondence PϕL for all distributions in NϵX(PX), which will be useful in our derivations.

2.4. Modal Decomposition

Given a pair of discrete random variables X,Y with the joint distribution PXY(x,y), the |Y|×|X| matrix B˜ is defined as

B˜(y,x)PXY(x,y)PX(x)PY(y)PX(x)PY(y), (2)

where B˜(y,x) is the (y,x)th entry of B˜. The matrix B˜ is referred to as the canonical dependence matrix (CDM) [24]. The SVD of B˜ is referred to as the modal decomposition [24] of the joint distribution PXY, which has the following property [18].

Lemma 1.

The SVD of B˜ can be written as B˜=i=1KσiψiYψiXT, where Kmin{|X|,|Y|}, and σi denotes the ith singular value with the ordering 1σ1σK=0, and ψiY and ψiX are the corresponding left and right singular vectors with ψKX(x)=PX(x) and ψKY(y)=PY(y).

This SVD decomposes the feature spaces of X,Y into maximally correlated features. To see that, consider the generalized canonical correlation analysis (CCA) problem:

maxEfi(X)=Egi(Y)=0Efi(X)fj(X)=Egi(Y)gj(Y)=δiji=1kEfi(X)gi(Y), (3)

where δij denotes the Kronecker delta function. It can be shown that for any 1kK1, the optimal features are fi(x)=ψiX(x)/PX(x), and gi(y)=ψiY(y)/PY(y), for i=0,,K1, where ψiX(x) and ψiY(y) are the xth and yth entries of ψiX and ψiY, respectively [18]. The special case k=1 corresponds to the HGR maximal correlation [25,26,27], and the optimal features can be computed from the ACE (Alternating Conditional Expectation) algorithm [36].

2.5. Deep Neural Networks

The architecture of deep neural networks (under log-loss) can be depicted as Figure 1, where X is the input data, e.g., images, audios, or natural languages. Moreover, Y is the objective to predict, which can represent a discrete label in classification tasks, or represent target natural languages in machine translations [37]. Specifically, for given data X, the network produces a (trainable) feature mapping to generate k-dimensional feature s(x)=(s1,,sk)T. In practice, the feature mapping block (depicted as the gray block in Figure 1) is typically composed of hundreds and thousands of functional components (e.g., residual block [1]) with different types of layers, and may contain recurrent structure, e.g., LSTM (Long Short-Term Memory) [38]. In general, the internal structure of the feature mapping can have various different types of designs, depending on the learning tasks.

Figure 1.

Figure 1

A deep neural network that uses data X to predict Y. All hidden layers together map the input data X to k-dimensional feature s(x)=(s1,,sk)T. Then, the probabilistic prediction P˜Y|X of Y is computed from s(x),v(y), and b(y), where v and bias b are the weights and bias in the last layer.

After obtaining the feature s(X), the Y is then predicted by the probability distribution P˜Y|X(s,v,b) of the form

P˜Y|X(s,v,b)(y|x)evT(y)s(x)+b(y)yYevT(y)s(x)+b(y), (4)

which is obtained by applying the softmax function [39] on vT(y)s(x)+b(y), where v(·) and b(·) are the weights and biases in the last layer, respectively (this is equivalent to the common practice that denotes weight and biases by the matrix [v(1),,v(|Y|)]T and the vector [b(1),,b(|Y|)]T, respectively. However, as we will show later, expressing weights v and biases b as mappings of y can better illustrate their roles in feature selection). We will use P˜Y|X to refer to P˜Y|X(s,v,b) when there is no ambiguity.

Then, for a given training set of labeled samples (xi,yi), for i=1,,N, all the parameters in the network, including v, b, as well as those in the feature mapping block, are chosen to maximize the log-likelihood function (or, equivalently, minimize the log-loss)

1Ni=1NlogP˜Y|X(yi|xi). (5)

The procedure of choosing such parameters is called the training of network, which can be performed by stochastic gradient descent (SGD) or its variants [21]. With a trained network, the label y^ for a new data sample x can be predicted by the maximum a posteriori (MAP) estimation, i.e., y^ = arg maxyYP˜Y|X(y|x). Specifically, when we make predictions for samples in a test dataset, the proportion of samples with correct prediction (i.e., y^=y) over all samples is called the test accuracy.

3. Results

3.1. Information-Theoretic Feature Selection

Suppose that, given random variables X,Y with joint distribution PXY, we want to infer about an attribute V of Y from observed i.i.d. samples x1,,xn of X. When the statistical model PX|V is known, the optimal decision rule is the log-likelihood ratio test, where the log-likelihood function can be viewed as the optimal feature for inference. However, in many practical situations [18], it is hard to identify the model of the targeted attribute, and it is necessary to select low-dimensional informative features of X for inference tasks before knowing the model. An information-theoretic formulation of such feature selection problem is the universal feature selection problem [24], which we formalize as follows.

To begin, for an attribute V, we refer to CY=V,{PV(v),vV},{ϕvY|V,vV}, as the configuration of V, where ϕvY|VPY|V(·|v) is the information vector specifying the corresponding conditional distribution PY|V(·|v). The configuration of V models the statistical correlation between V and Y. In the sequel, we focus on the local analysis regime, for which we assume that all the attributes V of our interests to detect are ϵ-attributes of Y. As a result, the corresponding configuration satisfies ϕvY|Vϵ, for all vV. We refer to such configurations as ϵ-configurations. The configuration of V is unknown in advance but assumed to be generated from a rotational invariant ensemble (RIE).

Definition 4

(RIE). Two configurations CY and C˜Y defined as

CYV,{PV(v),vV},{ϕvY|V,vV},
C˜YV,{PV(v),vV},{ϕ˜ϕvY|V,vV}

are called rotationally equivalent, if there exists a unitary matrix Q such that ϕ˜ϕvY|V=QϕvY|V, for all vV. Moreover, a probability measure defined on a set of configurations is called an RIE, if all rotationally equivalent configurations have the same measure.

The RIE can be interpreted as assigning a uniform measure to the attributes with the same level of distinguishability. To infer about the attribute V, we construct a k-dimensional feature vector hk=(h1,,hk), for some 1kK1, of the form

hi=1nl=1nfi(xl),i=1,,k, (6)

for some choices of feature functions fi. Our goal is to determine the fi such that the optimal decision rule based on hk achieves the smallest possible error probability, where the performance is averaged over the possible CY generated from an RIE. In turn, we denote ξiXfi as the corresponding information vector, and define the matrix ΞX[ξ1XξkX].

Theorem 1

(Universal Feature Selection). For v,vV, let Ehk(v,v) be the error exponent associated with the pairwise error probability distinguishing v and v based on hk, then the expected error exponent over a given RIE defined on the set of ϵ-configurations is given by

EEhk(v,v)=C02·B˜ΞXΞXTΞX12F2+o(ϵ2), (7)

where C014|Y|·EϕvY|VϕvY|V2 is independent of the choices of fi’s, and the expectations E· are taken over this RIE.

Proof. 

See Appendix A. □

As a result of (7), designing the ξiX as the singular vectors ψiX of B˜, for i=1,,k, optimizes (7) for all RIEs, pairs of (v,v), and ϵ-configurations. Thus, the feature functions corresponding to ψiX are universally optimal for inferring the unknown attribute V. Moreover, (7) naturally leads to an information metric B˜ΞXΞXTΞX12F2 for any feature ΞX of X, measured by projecting the normalized ΞX through a linear projection B˜. This information metric quantifies how informative a feature of X is when solving inference problems with respect to Y and is optimized when designing features by singular vectors of B˜. Thus, we can interpret the universal feature selection as solving the most informative features for data inferences via the SVD of B˜, which also coincides with the maximally correlated features in (3). Later, we will show that the feature selection in DNNs shares the same information metric as universal feature selection in the local analysis regime.

3.2. Feature Extraction in Deep Neural Networks

3.2.1. Network with Ideal Expressive Power

For convenience of analysis, we first consider the ideal case where the neural network can express any feature mapping s(·) as desired. While this assumption can be rather strong, the existence of such ideal networks is guaranteed by the universal approximation theorem [40]. In addition, one goal of practical network designs is to approximate the ideal networks and obtain sufficient expressive power. For such networks, we will show that when X,Y are ϵ-dependent, the extracted feature s(x) and weights v(y) coincide with the solutions of the universal feature selection.

To begin, we use PXY to denote the joint empirical distribution of the labeled samples (xi,yi),i=1,,N, and PX,PY to denote the corresponding marginal distributions. Then, the objective function of (5) is the empirical average of the log-likelihood function

1Ni=1NlogP˜Y|X(yi|xi)=EPXYlogP˜Y|X(Y|X).

Therefore, maximizing this empirical average is equivalent as minimizing the KL divergence:

(s*,v*,b*)=arg min(s,v,b)D(PXYPXP˜Y|X(s,v,b)). (8)

This can be interpreted as finding the best fitting to empirical joint distribution PXY by distributions of the form PXP˜Y|X(s,v,b). In our development, it is more convenient to denote the bias by d(y)=b(y)logPY(y), for yY. Then, the following lemma illustrates the explicit constraint on the problem (8) in the local analysis regime.

Lemma 2.

If X,Y are ϵ-dependent, then the optimal v,d for (8) satisfy

|v˜T(y)s(x)+d˜(y)|=O(ϵ),forallxX,yY. (9)
Proof. 

See Appendix B. □

In turn, we take (9) as the constraint for solving the problem (8) in the local analysis regime. Moreover, we define the information vectors for zero-mean vectors s˜, v˜ as ξX(x)=PX(x)s˜(x), ξY(y)=PY(y)v˜(y), and define matrices

ΞYξY(1)ξY(|Y|)T,ΞXξX(1)ξX(|X|)T.
Lemma 3.

The KL divergence (8) in the local analysis regime (9) can be expressed as

D(PXYPXP˜Y|X(s,v,b))=12B˜ΞYΞXTF2+12η(v,b)(s)+o(ϵ2), (10)

where η(v,b)(s)EPY(μsTv˜(Y)+d˜(Y))2.

Proof. 

See Appendix C. □

Lemma 3 reveals key insights for feature selection in neural networks. To see this, we consider the following two learning problems: learning the optimal weight v for given s and learning the optimal feature s for given v.

For the case that s is fixed, we can optimize (10) with ΞX fixed and obtain the following optimal weights:

Theorem 2.

For fixed ΞX and μs, the optimal ΞY* to minimize (10) is given by

ΞY*=B˜ΞXΞXTΞX1, (11)

and the optimal weights v˜* and bias d˜* are

v˜*(y)=EPX|YΛs˜(X)1s˜(X)|Y=y,d˜*(y)=μsTv˜(Y). (12)

where Λs˜(X) denotes the covariance matrix of s˜(X).

Proof. 

See Appendix D. □

Specifically, when s(x)=x, Theorem 2 gives the optimal weights for softmax regression. Note that Equation (11) can be viewed as a projection of the input feature s˜(x), to a feature v(y) computable from the value of y, which is the most correlated feature to s˜(x). The solution is given by the operation that left multiplies B˜ matrix, which we refer to as forward feature projection.

Remark 1.

While we assume the continuous input s(x) is a function of a discrete variable X, we only need the labeled samples between s and Y to compute the weights and bias from the conditional expectation (12), and the correlation between X and s is irrelevant. Thus, our analysis for weights and bias can be applied to continuous input networks by just ignoring X and taking s as the real input to network.

We then consider the “backward feature projection” problem, which attempts to find informative feature s*(X) to minimize the loss (10) with given weights and bias. In particular, we can show that the solution of this backward feature projection is precisely symmetric to the forward one.

Theorem 3.

For fixed ΞY and d˜, the optimal ΞX* to minimize (10) is given by

ΞX*=B˜TΞY((ΞY)TΞY)1, (13)

and the optimal feature function s*, which are decomposed to s˜* and μs*, is given by

s˜*(x)=EPY|XΛv˜(Y)1v˜(Y)|X=x,μs*=Λv˜(Y)1EPYv˜(Y)d˜(Y), (14)

where Λv˜(Y) denotes the covariance matrix of v˜(Y).

Proof. 

See Appendix D. □

Finally, when both s and (v,b) (and hence ΞX,ΞY,d) can be designed, the optimal (ΞY,ΞX) corresponds to the low rank factorization of B˜, and the solutions coincide with the universal feature selection.

Theorem 4.

The optimal solutions for weights and bias to minimize (10) are given by d˜(y)=μsTv˜(y), and (ΞY,ΞX)* chosen as the largest k left and right singular vectors of B˜.

Proof. 

See Appendix E. □

Therefore, we conclude that the learning of neural networks, when both s and (v,b) are designable, is to extract the most correlated aspects of the input data X and the label Y that are informative features for data inferences from universal feature selection.

In the practical learning process of DNN, the BackProp updates the weights of the softmax layer and those on the previous layer(s) in an iterative manner. As we have illustrated in Lemma 3, such iterative updates will converge to the same solution as the alternating between the forward feature projection (11) and the backward feature projection (13), which is indeed the power method to solve the SVD for B˜ [41], also known as the Alternating Conditional Expectation (ACE) algorithm [36].

Remark 2.

From Theorem 4, for a neural network with sufficient expressive power, the trained feature depends only on the distribution of input data rather than the training process. It is worth mentioning that this result does not contradict the practice that trained weights in hidden layers can be different during each training run. In fact, due to the over-parameterized nature of practical network designs, there exist multiple choices of weights in hidden layers to express the same optimal feature s(x).

3.2.2. Network with Restricted Expressive Power

The analysis of the previous section has considered neural networks with ideal expressive power, where the feature s(X) can be selected as any desired function. In general, however, the form of feature functions that can be generalized is often limited by the network structure. In the following, we consider networks with restricted expressive power to characterize the impacts of network structure on the extracted feature.

For illustration, we consider the neural network with a hidden layer of k nodes, and a zero-mean continuous input t=[t1tm]TRm to this hidden layer, where t is assumed to be a function t(x) of some discrete variable X. Our goal is to analyze the weights and bias in this layer with labeled samples (t(xi),yi). Assume the activation function of the hidden layer is a generally smooth function σ(·), then the output sz(X) of the z-th hidden node is

sz(x)=σwT(z)t(x)+c(z),forz=1,,k,xX, (15)

where w(z)Rm and c(z)R are the weights and bias from input layer to hidden layer as shown in Figure 2. We denote s=[s1sk]T as the input vector to the output classification layer.

Figure 2.

Figure 2

A multi-layer neural network, where the expressive power of the feature mapping s(·) is restricted by the hidden representation t. All hidden layers previous to t are fixed, represented by the “pre-processing” module.

To interpret the feature selection in hidden layers, we fix (v(y),b(y)) at the output layer and consider the problem of designing (w(z),c(z)) to minimize the loss function (8) at the output layer. Ideally, we should have picked w(z) and c(z) to generate s(x) to match s*(x) from (14), which minimizes the loss. However, here we have the constraint that s(x) must take the form of (15) and, intuitively, the network should select w(z),c(z) so that s(x) is close to s*(x). Our goal is to quantify the notion of such closeness.

To develop insights on feature selection in hidden layers, we again focus on the local analysis regime, where the weights and bias are assumed to satisfy the local constraint

v˜T(y)s(x)+d˜(y)=O(ϵ),wT(z)t˜(x)=O(ϵ),x,y,z. (16)

Then, since t is zero-mean, we can express (15) as

sz(x)=σwT(z)t(x)+c(z)=wT(z)t˜(x)·σc(z)+σc(z)+o(ϵ), (17)

Moreover, we define a matrix B˜1 with the (z,x)th entry B˜1(z,x)=PX(x)σ(c(z))s˜z*(x), which can be interpreted as a generalized CDM for the hidden layer. Furthermore, we denote ξ1X(x)=PX(x)t˜(x) as the information vector of t˜(x) with the matrix Ξ1X defined as Ξ1Xξ1X(1)ξ1X(|X|)T, and we also define

Ww(1)w(k)T, (18)
Jdiag{σ(c(1)),σ(c(2)),,σ(c(k))}. (19)

The following theorem characterizes the loss (8).

Theorem 5.

Given the weights and bias (v,b) at the output layer, and for any input feature s, we denote L(s) as the loss (8) evaluated with respect to (v,b) and s. Then, with the constraints (16)

L(s)L(s*)=12ΘB˜1ΘWΞ1XTF2+12κ(v,b)(s,s*)+o(ϵ2), (20)

where Θ(ΞYTΞY)1/2J, and the term κ(v,b)(s,s*)=(μsμs*)TΛv˜(Y)(μsμs*).

Proof. 

See Appendix F. □

Equation (20) quantifies the closeness between s and s* in terms of the loss (8). Then, our goal is to minimize (20), which can be separated to two optimization problems:

W*=arg minWΘB˜1ΘWΞ1XTF2, (21)
μs*=arg minμsκ(v,b)(s,s*). (22)

Note that the optimization problem (21) is similar to the one that appeared in Lemma 3, and the optimal solution is given by W*=B˜1Ξ1XΞ1XTΞ1X1. Therefore, solving the optimal weights in the hidden layer can be interpreted as projecting s˜*(x) to the subspace of feature functions spanned by t(x) to find the closest expressible function. In addition, the problem (22) is to choose μs (and hence the bias c(z)) to minimize the quadratic term similar to η(v,b)(s) in (10). Similar to the analyses of parameters in the last layer, we can obtain analytical solutions for hidden layer parameters, e.g., μs* and w*, with detailed discussions provided in Appendix G.

Overall, we observe the correspondence between (11), (14), and (21), (22), and interpret both operations as feature projections. Our argument can be generalized to any intermediate layer in a multi-layer network, with all the previous layers viewed as the fixed pre-processing that specifies t(x), and all the layers after determining s*. Then, the iterative procedure in back-propagation can be viewed as alternating projection finding the fixed-point solution over the entire network. This final fixed-point solution, even under the local assumption, might not be the SVD solution as in Theorem 4. This is because the limited expressive power of the network often makes it impossible to generate the desired feature function. In such cases, the concept of feature projection can be used to quantify this gap, and thus to measure the quality of the selected features.

3.3. Scoring Neural Networks

Given a learning problem, it is useful to tell whether or not some extracted features are informative [42]. Our previous development naturally gives rise to a performance metric.

Definition 5.

Given a feature s(x)Rk and weight v(y)Rk with the corresponding information matrices ΞX and ΞY, the H-score H(s,v) is defined as

H(s,v)12B˜F212B˜ΞYΞXTF2=EPXYs˜T(X)v˜(Y)12trΛs˜(X)Λv˜(Y). (23)

In addition, for given s(x), we define the single-sided H-score H(s) as

H(s)maxvH(s,v) (24)
=12B˜F212B˜B˜ΞXΞXTΞX1ΞXTF2 (25)
=12B˜ΞXΞXTΞX12F2=12EPYEPX|YΛs˜(X)1/2s˜(X)|Y2. (26)

H-score can be used to measure the quality of features generated at any intermediate layer of the network. It is related to (20) when choosing the optimal bias and Θ as the identity matrix. This can be understood as taking the output of this layer s(x) and directly feeding it to a softmax output layer with v(y) used as the weights, and H(s,v) measures the resulting performance. Note that v(y) here can be an arbitrary function of Y, not necessarily the weights on the next layer computed by the network. When the optimal v*(y) as defined in (12) is used, the resulting performance becomes the one-sided H-score H(s), which measures the quality of s(x). In addition, by comparing (26) with (7), the performance measure H(s) also coincides with the information metric (7), up to a scale factor.

Specifically, for a given dataset and a feature extractor that generate s(·), the H-score H(s) can be efficiently computed from the second equation of (26). In addition, when we use H-score to compare the performance of different feature extractors (models), the model complexity has to be taken into account to reduce overfitting. To this end, we adopt Akaike information criterion (AIC) and define AIC-corrected H-score

HAIC(s)H(s)npns (27)

for comparing different models, where np and ns represent the number of parameters in the model and the training sample size, respectively.

In current practice, the cross-entropy EPXYlogP˜Y|X(v,b) is often used as the performance metric. One can, in principle, also use log-loss to measure the effectiveness of the selected feature at the output of an intermediate layer [42]. However, one problem of this metric is that, for a given problem, it is not clear what value of log-loss one should expect, as the log-loss is generally unbounded. In contrast, the H-score can be directly computed from the data samples and has a clear upper bound. Indeed, it follows from Lemma 1 that, for k-dimensional feature s and weights v, we have the sequence of inequalities

H(s,v)H(s)12i=1kσi2k2, (28)

where σi indicates the ith singular value of B˜.

In particular, the first “≤” follows from the definition (24), and the gap between H(s,v) and H(v) measures the optimality of the weights v; the second “≤” follows from the first equality of (26), and the gap between two sides characterizes the difference between the chosen feature and the optimal solution, which is a useful measure of how restrictive (lack of expressive power) the network structure is; the last “≤” follows from the fact that σi1 (cf. Lemma 1), which measures the dependency between data variable and label for the given dataset. In Section 3.4.3, we validate this metric on real data.

3.4. Experiments

This section presents experiments for validating our theoretical characterizations, with corresponding code available at https://github.com/XiangxiangXu/dnn (accessed on 7 December 2021). Specifically, all DNN models used in Section 3.4.3 are available at https://keras.io/applications/ (accessed on 7 December 2021).

3.4.1. Experimental Validation of Theorem 4

We first validate Theorem 4, the optimal feature extracted by network with ideal expressive power. Here, we consider the discrete data with alphabet sizes, |X|=8 and |Y|=6, and construct the network as shown in Figure 3. Specifically, the network input is the one-hot encoding of X, i.e., [1X(1),,1X(|X|)]T, where 1X(x) takes one if and only if X=x, and takes zero otherwise. Then, the feature s(X) is generated by a linear layer, with sigmoid function used as the activation function. For ease of comparison and presentation, we set feature dimension to k=1, since otherwise the optimal feature (cf. Theorem 4) lies in a subspace and is non-unique. It can be verified that this network has ideal expressive power, i.e., with proper weights in the first layer, s(X) can express any desired function up to scaling and shifting.

Figure 3.

Figure 3

A simple neural network with ideal expressive power, which can generate any k=1 dimensional feature s of X by tuning the weights in the first layer.

To compare the result trained by the neural network and that in Theorem 4, we first randomly generate a distribution PXY, and then draw independently n= 100,000 pairs of (X,Y) samples. We then train the network using batch gradient descent, where we have applied Nesterov momentum [43] with the momentum hyperparameter being 0.9. In addition, we set the learning rate to 4 with a decay factor of 0.01 and clip gradients with norm exceeding 0.5. After training, the learned values of s(x),v(y) and b(y) are shown in Figure 4 and compared with theoretical results. From the figure, we can observe that the training results match our theoretical analyses.

Figure 4.

Figure 4

The trained feature s, weights v, and bias b of the network in Figure 3, which are compared with the corresponding theoretical results to show their coincidences.

3.4.2. Experimental Validation of Theorem 5

In addition, we validate Theorem 5 by the neural network depicted in Figure 5, with the same settings of X,Y. Specifically, the number of neurons in hidden layers are set to m=4 and k=3, where t(X) is randomly generated from X, and we have chosen sigmoid function as the activation function σ(·) to generate s(x). We then fix the weights and bias at the output layer and train the weights w(1),w(2), w(3) and bias c in the hidden layer to optimize the log-loss. Specifically, we use the batch gradient descent with the Nesterov momentum hyperparameter being 0.9. In addition, we set the learning rate to 4 with a decay factor of 106 and clip gradients with norm exceeding 0.1. After training, Figure 6 shows the matching between the learned results and the corresponding theoretical values.

Figure 5.

Figure 5

The designed network for validating the impact of network structure on feature extraction, with m=4 and k=3 neurons in two hidden layers. Our goal is to compare the learned weights w(1),w(2), w(3) and bias c in the hidden layer with our theoretic characterizations in Section 3.2.2.

Figure 6.

Figure 6

The trained weights w and bias c of the network in Figure 5, which are compared with the corresponding theoretical results to show their coincidences.

3.4.3. Experimental Validation of H-Score

To validate H-score as a performance measure for extracted features, we compare the H-score and classification accuracy of DNNs on image classification tasks. Specifically, we use the ImageNet Large Scale Visual Recognition Challenge 2012 (ILSVRC2012) [22] dataset as the dataset and extract features using several deep neural networks with representative architectures designs [44,45,46,47,48,49]. After training the feature extractors on the ILSVRC2012 training set, we then compute the H-score of the feature in the last hidden layer, as well as the classification accuracies on ILSVRC2012 validation set (here, we use ILSVRC2012 validation set for testing, as the labels in ILSVRC2012 testing set have not been publicly released). The results are summarized in Table 1, where HAIC(s) is the AIC-corrected H-score as defined in (27), with np being the number of model parameters, and ns= 1,300,000 corresponding to the number of training samples in ImageNet. The AIC-corrected H-score is consistent with the classification accuracy, which validates the effectiveness of H-score as a measurement of neural networks.

Table 1.

Classification accuracy and H-score for different DNN models on ImageNet dataset, where “Paras” indicates the number of parameters (in millions) in the model and HAIC represents the AIC-corrected H-score.

DNN Model Paras [×106] H(s) HAIC(s) Accuracy [%]
VGG16 [44] 138.4 148.3 41.9 64.2
VGG19 [44] 143.7 152.7 42.2 64.7
MobileNet [45] 4.3 45.9 42.6 68.4
DenseNet121 [46] 8.1 59.5 53.3 71.4
DenseNet169 [46] 14.3 81.2 70.2 73.6
DenseNet201 [46] 20.2 89.1 73.5 74.4
Xception [47] 22.9 179.8 162.2 77.5
InceptionV3 [48] 23.9 181.2 162.9 76.3
InceptionResNetV2 [49] 55.9 241.1 198.1 79.1

4. Discussion

Our characterization gives an information-theoretic interpretation of the feature extraction process in DNNs, which also provides a practical performance measure for scoring neural networks. Different from empirical studies focusing on specific datasets [7], our development is based on the probability distribution space, which is more general and can also provide theoretic insights. Moreover, the information-theoretic framework allows us to obtain direct operational meaning and better interpretations for the solutions, compared with optimization-based theoretical characterizations, e.g., [11,13].

As a first step in establishing a rigorous framework for DNN analysis, the present work can be extended in both theoretical and practical aspects. From the theoretical perspective, one extension is to investigate the analytical properties for general DNNs, using the theoretic insights obtained from local analysis regime. For example, it was shown in [50] that the symmetry between feature and weights in DNNs established in the local analysis regime (cf. Section 3.2.1) also holds for general probability distributions. Another extension is to apply the framework to investigate the optimal feature for structured data or network, e.g., data with sparsity structure [51].

From the practical perspective, in addition to the demonstrated example of evaluating existing DNN models (cf. Section 3.4.3), the H-score can also be used as an objective function in designing learning algorithms. In particular, such usages have been illustrated in multi-modal learning [52] and transfer learning [53] tasks.

5. Conclusions

In this paper, we apply the local information geometric analysis and provide an information-theoretic interpretation to the feature extraction scheme in DNNs. We first establish an information metric for features in inference tasks by formalizing the information-theoretic feature selection problem. In addition, we demonstrate that the features extracted by DNNs coincide with the information-theoretically optimal feature, with the same metric measuring the performance of features, called H-score. Furthermore, we discuss the usage of the H-score for measuring the effectiveness of DNNs. Our framework demonstrates a connection between the practical deep learning implementations and information-theoretic characterizations, which can provide theoretical insights for DNN analysis and learning algorithm designs.

Appendix A. Proof of Theorem 1

We commence with the characterization of the error exponent.

Lemma A1.

Given a reference distribution PXrelint(PX), a constant ϵ>0 and integers n and k, let x1,,xn denote i.i.d. samples from one of P1 or P2, where P1,P2NϵX(PX). To decide whether P1 or P2 is the generating distribution, a sequence of k-dimensional statistics hk=(h1,,hk) is constructed as

hi=1nl=1nfi(xl),i=1,,k, (A1)

where (f1(X),,fk(X)) are zero mean, unit-variance, and uncorrelated with respect to PX, i.e.,

EPXfi(X)=0,i{1,,k} (A2)
EPXfi(X)fj(X)=δij,i,j{1,,k}. (A3)

Then, the error probability of the decision based on hk decays exponentially in n as n, with (Chernoff) exponent

limnlogpenEhk=i=1kEhi, (A4)

where

Ehi=18ϕ1ϕ2,ξi2+o(ϵ2), (A5)

and ϕ1P1,ϕ2P2,ξifi(X),i{1,,k} are the corresponding information vectors.

Proof of Lemma A1.

Since the rule is to decide based on comparing the projection

i=1khiEP1fi(X)EP2fi(X)

to a threshold, via Cramér’s theorem [54], the error exponent under Pj(j=1,2) is

Ej(λ)=minPS(λ)D(PPj), (A6)

where

S(λ)PPX:EPfk(X)=λEP1fk(X)+(1λ)EP2fk(X). (A7)

Now, since (A2) holds, we obtain

EPjfi(X)=xXPj(x)fi(x)=xXPX(x)fi(x)+xX(Pj(x)PX(x))fi(x)=EPXfi(X)+xXPX(x)ϕj(x)·ξi(x)PX(x)=xXϕj(x)ξi(x)=ϕj,ξi,j=1,2andi=1,,k, (A8)

which we express compactly as

EPjfk(X)=ϕj,ξk,j=1,2

with ξk(ξ1,,ξk).

Hence, the constraint (A7) is expressed in information vectors as

ϕ,ξi=λϕ1+(1λ)ϕ2,ξi,i=1,,k,

i.e.,

ϕ,ξk=λϕ1+(1λ)ϕ2,ξk. (A9)

In turn, the optimal P in (A6), which we denoted by P*, lies in the exponential family through Pj with natural statistic fk(x), i.e., the k-dimensional family whose members are of the form

logP˜θk(x)=i=1kθifi(x)+logPj(x)αθk,

for which the associated information vector is

ϕ˜θk(x)=i=1kθiξi(x)+ϕj(x)α(θk)PX(x)+o(ϵ), (A10)

where we have used the fact that

logQX(x)=logPX(x)+logQX(x)PX(x)=logPX(x)+log1+1PX(x)ϕ(x)=logPX(x)+1PX(x)ϕ(x)+o(ϵ)

for all QXNϵX(PX) with the information vector ϕQX. As a result,

ϕ˜θk,ξi=θi+ϕj,ξi+o(ϵ),

where we have used (A3). Hence, via (A9), we obtain that the intersection with the linear family (A7) is at P*=Pθk* with

θi*=λϕ1+(1λ)ϕ2ϕj,ξi+o(ϵ)

and thus

Ej(λ)=D(P*Pj)
=12ϕ˜θkϕj2+o(ϵ2) (A11)
=12i=1kθi*ξi2+12αθk*2+o(ϵ2) (A12)
=12i=1k(θi*)2+12αθk*2+o(ϵ2) (A13)
=12i=1kλϕ1+(1λ)ϕ2ϕj,ξi2+o(ϵ2), (A14)

where to obtain (A11) we have exploited the local approximation of KL divergence [18], to obtain (A12) we have exploited (A10), to obtain (A13) we have again exploited (A3), and to obtain (A14) we have used that

αθk*=o(ϵ2)

since θk*=O(ϵ) and

α(0)=0,andα(0)=EPjfk(X)=ϕj,ξk=O(ϵ).

Finally, E1(λ)=E2(λ) when λ=1/2, so the overall error probability has exponent (A5). □

Then, the following lemma demonstrates a property of information vectors in a Markov chain.

Lemma A2.

Given the Markov relation XYV and any vV, let ϕvX|V and ϕvY|V denote the associated information vectors for PX|V(·|v) and PY|V(·|v), then we have

ϕvX|V=B˜TϕvY|V. (A15)

Proof of Lemma A2.

From the Markov relation we have

PX(x)=yYPX|Y(x|y)PY(y)

and

PX|V(x|v)=yYPX|Y,V(x|y,v)PY|V(y|v)=yYPX|Y(x|y)PY|V(y|v).

As a result,

PX|V(x|v)PX(x)=yYPX|Y(x|y)[PY|V(y|v)PY(y)],

from which we obtain the corresponding information vector

ϕvX|V(x)=1PX(x)yYPX|Y(x|y)PY(y)ϕvY|V(y)=yYB˜(y,x)+PX(x)PY(y)ϕvY|V(y)=yYB˜(y,x)ϕvY|V(y), (A16)

where the last equality follows from the fact that

yYPY(y)ϕvY|V(y)=yY[PY|V(y|v)PY(y)]=0.

Finally, rewrite (A16) in the matrix form and we obtain (A15). □

In addition, the following lemma is useful for dealing with the expectation over an RIE.

Lemma A3.

Let z be a spherically symmetric random vector of dimension M, i.e., for any orthogonal Q we have z=dQz. If A is a fixed matrix of compatible dimensions, then

EzTA2=1MEz2AF2. (A17)

Proof of Lemma A3.

By definition we have Λz=QΛzQT for any orthogonal Q; hence, Λz is diagonal. Suppose Λz=λI, then from

trΛz=Ez2=λM

we obtain

λ=1MtrΛz.

As a result, we have

EzTA2=trATΛzA=λtrATA=1MEz2AF2.

Proceeding to our proof of Theorem 1, by definition of feature functions, we have EPXfi(X)=0,i=1,,k. Suppose f is the vector representation of fk and denote by f˜Λf1/2f the normalized f, with Λf1/2 denoting any square root matrix of Λf. Then, the corresponding statistics f˜k=(f˜1,,f˜k) satisfy the constraints (A2) and (A3). In addition, we construct the statistic h˜k=(h˜1,,h˜k) as [cf. (A1)]

h˜i=1nl=1nf˜i(xl),i=1,,k. (A18)

Then, from Lemma A1, the error exponent of distinguishing v and v based on h˜k is

Eh˜k(v,v)=18i=1kϕvX|VϕvX|VTξ˜iX2+o(ϵ2)=18ϕvX|VϕvX|VTΞ˜X2+o(ϵ2),

where ϕvX|V denotes the associated information vector for PX|V(·|v), ξ˜iX denotes the information vectors of f˜i, and Ξ˜X[ξ˜1X,,ξ˜kX]. Since the optimal decision rule is linear, the error exponent is invariant with linear transformations of statistics, i.e.,

Ehk(v,v)=Eh˜k(v,v)=18ϕvX|VϕvX|VTΞ˜X2+o(ϵ2)=18ϕvY|VϕvY|VTB˜Ξ˜X2+o(ϵ2), (A19)

where the last equality follows from Lemma A2.

As a result, taking the expectation of (A19) over a given RIE yields

EEhk(v,v)=18EϕvY|VϕvY|VTB˜Ξ˜X2+o(ϵ2)=EϕvY|VϕvY|V28|Y|B˜Ξ˜XF2+o(ϵ2),

where we have exploited Lemma A3. Finally, the error exponent (7) can be obtained via noting from the definition of f˜k that

Ξ˜X=ΞXΞXTΞX12.

Appendix B. Proof of Lemma 2

We first prove two useful lemmas.

Lemma A4.

For distributions Prelint(PX), Q,RPX, and sufficiently small ϵ, if D(PQ)ϵ2 and D(PR)ϵ2, then there exists a constant C>0 independent of ϵ, such that D(QR)Cϵ2.

Proof of Lemma A4.

Denote by ·1 the 1-distance between distributions, i.e., PQ1xX|P(x)Q(x)|, then from Pinsker’s inequality [14], we have

PQ12D(PQ)<2ϵ, (A20)
PR12D(PR)<2ϵ, (A21)

which implies

QR1PQ1+PR122ϵ. (A22)

In addition, with pminminxXP(x), for all xX we have

R(x)>P(x)|P(x)R(x)| (A23)
>minxXP(x)2ϵ (A24)
=pmin2ϵ, (A25)

where to obtain (A24) we have used (A21). Note that since Prelint(PX) we have pmin>0, and thus R(x)>pmin/2 for sufficiently small ϵ. As a result,

D(QR)xX(Q(x)R(x))2R(x) (A26)
2pminxX[Q(x)R(x)]2 (A27)
2QR12pmin (A28)
16pminϵ2, (A29)

where to obtain (A26) we have used the fact that KL divergence is upper bounded by corresponding χ2-divergence [55], and to obtain (A29) we have used (A22). □

Lemma A5.

For all (x,y)X×Y, we have

D(PXPYPXP˜Y|X(s,v,b))PX(x)logPY(y)eτ(x,y)+(1PY(y))ePY(y)1PY(y)τ(x,y)

where P˜Y|X(s,v,b) is as defined in (4), and where we have defined τ(x,y)v˜T(y)s(x)+d˜(y).

Proof of Lemma A5.

First, we can rewrite the conditional distribution P˜Y|X(s,v,b)(y|x) as

P˜Y|X(s,v,b)(y|x)=evT(y)s(x)+b(y)yYevT(y)s(x)+b(y)=PY(y)evT(y)s(x)+d(y)yYPY(y)evT(y)s(x)+d(y)=PY(y)ev˜T(y)s(x)+d˜(y)yYPY(y)ev˜T(y)s(x)+d˜(y)=PY(y)eτ(x,y)yYPY(y)eτ(x,y). (A30)

Then, the KL divergence D(PXPYPXP˜Y|X(s,v,b)) can be expressed as

D(PXPYPXP˜Y|X(s,v,b))=(x,y)X×YPX(x)PY(y)logyYPY(y)eτ(x,y)eτ(x,y)=xXPX(x)logyYPY(y)eτ(x,y)EPXPYτ(X,Y)=xXPX(x)logyYPY(y)eτ(x,y), (A31)

where to obtain the last equality we have used the fact EPXPYτ(X,Y)=0. As a result, we have

D(PXPYPXP˜Y|X(s,v,b))PX(x)logyYPY(y)eτ(x,y) (A32)
PX(x)logPY(y)eτ(x,y)+(1PY(y))ePY(y)1PY(y)τ(x,y), (A33)

where the last inequality follows from Jensen’s inequality:

yYPY(y)eτ(x,y)=PY(y)eτ(x,y)+(1PY(y))yyPY(y)1PY(y)eτ(x,y)PY(y)eτ(x,y)+(1PY(y))exp11PY(y)yyPY(y)τ(x,y)=PY(y)eτ(x,y)+(1PY(y))ePY(y)1PY(y)τ(x,y).

Proceeding to our proof of Lemma 2, first note that when v=d=0, we have P˜Y|X(s,v,b)=PY. As a result, the optimal v,d for (8) satisfy

D(PXYPXP˜Y|X(s,v,b))D(PXYPXPY)(x,y)X×YPX,Y(x,y)PX(x)PY(y)2PX(x)PY(y)ϵ2, (A34)

where to obtain the second inequality we have again exploited χ2-divergence as an upper bound of KL divergence [55], and to obtain the last inequality we have used the definition of ϵ-dependency.

As PXYrelint(PX×Y), from Lemma A4, there exist C>0 and ϵ1>0 such that D(PXPYPXP˜Y|X(s,v,b))<Cϵ2 for all ϵ<ϵ1. Furthermore, from Lemma A5, for all (x,y)X×Y and ϵ(0,ϵ1), we have

Cϵ2PX(x)logPY(y)eτ(x,y)+(1PY(y))ePY(y)1PY(y)τ(x,y). (A35)

Note that the right-hand side of (A35) satisfies

logPY(y)eτ(x,y)+(1PY(y))ePY(y)1PY(y)τ(x,y)=PY(y)2(1PY(y))τ2(x,y)+o(τ2(x,y)).

Therefore, there exists δ>0 independent of ϵ1, such that for all |τ(x,y)|δ, we have

logPY(y)eτ(x,y)+(1PY(y))ePY(y)1PY(y)τ(x,y)>PY(y)2τ2(x,y). (A36)

In addition, if |τ(x,y)|>δ, we have

logPY(y)eτ(x,y)+(1PY(y))ePY(y)1PY(y)τ(x,y)minlogPY(y)eδ+(1PY(y))ePY(y)1PY(y)δ,logPY(y)eδ+(1PY(y))ePY(y)1PY(y)δPY(y)2δ2,

where to obtain the second inequality we have exploited the monotonicity of function tPY(y)et+(1PY(y))ePY(y)1PY(y)t, and to obtain the third inequality we have exploited (A36).

As a result, we have

logPY(y)eτ(x,y)+(1PY(y))ePY(y)1PY(y)τ(x,y)>PY(y)2·min{δ2,τ2(x,y)}. (A37)

Hence, (A35) becomes

Cϵ2PX(x)PY(y)2·min{δ2,τ2(x,y)}, (A38)

from which we can obtain τ(x,y)=O(ϵ). To see this, let

ϵ2δ2C·min(x,y)X×YPX(x)PY(y),ϵ0min{ϵ1,ϵ2}.

Then, for all ϵ<ϵ0, we have

Cϵ2<PX(x)PY(y)2·δ2,

and (A38) implies |τ(x,y)|<Cϵ with C=2CPX(x)PY(y).

Appendix C. Proof of Lemma 3

Proof. 

From Lemma 2, there exists C>0 such that for all (x,y)X×Y, we have

|v˜T(y)s(x)+d˜(y)|<Cϵ, (A39)

which implies

|μsTv˜(y)+d˜(y)|<Cϵ, (A40)
|v˜T(y)s˜(x)|<2Cϵ, (A41)

with C=max{C,1}.

From (A30), we can assume EPYv(Y)=EPYd(Y)=0 without loss of generality. Then, (4) can be rewritten as

P˜Y|X(s,v,b)(y|x)=PY(y)ev˜T(y)s(x)+d˜(y)yYPY(y)ev˜T(y)s(x)+d˜(y), (A42)

and the numerator can be written as

PY(y)ev˜T(y)s(x)+d˜(y)=PY(y)1+v˜T(y)s(x)+d˜(y)+o(ϵ)=PY(y)1+v˜T(y)s(x)+d˜(y)+o(ϵ),

where we have used (A39). Similarly, from

yYPY(y)ev˜T(y)s(x)+d˜(y)=yYPY(y)1+v˜T(y)s(x)+d˜(y)+o(ϵ)=1+EPYv˜T(Y)s(x)+EPYd˜(y)+o(ϵ)=1+o(ϵ)

we obtain

1yYPY(y)ev˜T(y)s(x)+d˜(y)=11+o(ϵ)=1+o(ϵ).

As a result, (A42) can be written as

P˜Y|X(s,v,b)(y|x)=PY(y)1+v˜T(y)s(x)+d˜(y)+o(ϵ)[1+o(ϵ)]=PY(y)1+v˜T(y)s(x)+d˜(y)+o(ϵ), (A43)

which implies PXP˜Y|X(v,b)NCϵX×Y(PXPY) for sufficiently small ϵ. In addition, the local assumption of distributions implies that PXYNϵX×Y(PXPY)NCϵX×Y(PXPY). Again, from the local approximation of KL divergence [18]

D(P1P2)=12ϕ1ϕ22+oϵ2, (A44)

we have

D(PY,XPXP˜Y|X(s,v,b))=12xX,yYPY,X(y,x)P˜Y|X(s,v,b)(y|x)PX(x)2PY(y)PX(x)+o(ϵ2)=12xX,yYPY,X(y,x)PY(y)PX(x)PY(y)PX(x)PY(y)PX(x)v˜T(y)s(x)+d˜(y)+o(ϵ)2+o(ϵ2)=12xX,yYB˜(y,x)PY(y)PX(x)v˜T(y)s˜(x)PY(y)PX(x)d˜(y)+μsTv˜(y)PY(y)PX(x)o(ϵ)2+o(ϵ2)=(*)12xX,yYB˜(y,x)PY(y)PX(x)v˜T(y)s˜(x)2+12xX,yYPY(y)PX(x)d˜(y)+μsTv˜(y)2+o(ϵ2)=12xX,yYB˜(y,x)ξY(y)TξX(x)2+12EPY(d˜(y)+μsTv˜(y))2+o(ϵ2)=12B˜ΞYΞXTF2+12η(v,b)(s)+o(ϵ2),

where to obtain (*), we have used (A40) and (A41) together with the fact |B˜(y,x)|<ϵ, and that

xX,yYB˜(y,x)PY(y)PX(x)d˜(y)+μsTv˜(y)=0,xX,yYPY(y)PX(x)v˜T(y)s˜(x)d˜(y)+μsTv˜(y)=0,

since Ed˜(Y)=0,Es˜(X)=Ev˜(Y)=0. □

Appendix D. Proofs of Theorems 2 and 3

Theorems 2 and 3 can be proved based on Lemma 3.

Proofs of Theorems 2 and 3.

Note that the value of d(·) only affects the second term of the KL divergence; hence, we can always choose d(·) such that d˜(y)+μsTv˜(y)=0. Then, the (ΞY,ΞX) pair should be chosen as

(ΞY,ΞX)*=arg min(ΞY,ΞX)B˜ΞYΞXTF2. (A45)

Set the derivative (we use the denominator-layout notation of matrix calculus where the scalar-by-matrix derivative will have the same dimension as the matrix)

ΞYB˜ΞYΞXTF2=2(ΞYΞXTΞXB˜ΞX) (A46)

to zero, and the optimal ΞY for fixed ΞX is (here, we assume the matrix ΞXTΞX=Λs˜(X) is invertible; for the case where ΞXTΞX is singular, we can obtain a similar result with ordinary matrix inverse replaced by the Moore–Penrose inverse)

ΞY*=B˜ΞX(ΞXTΞX)1. (A47)

As 1TPYB˜=0, we have 1TPYΞY*=0, which demonstrates that ΞY* is a valid matrix for a zero-mean feature vector.

To express ΞY* of (A47) in the form of s and v, we can make use of the correspondence between feature and information vectors. We can show that, for a zero-mean feature function f(X) with corresponding information vector ϕ, we have the correspondence EPX|Yf(X)|YB˜ϕ. To see this, note that the y-th element of information vector B˜ϕ is given by

xXB˜(y,x)ϕ(x)=xXPXY(x,y)PX(x)PY(y)PX(x)PY(y)f(x)PX(x)=1PY(y)xXPXY(x,y)f(x)=1PY(y)EPX|Yf(X)|Y=y.

Using similar methods, we can verify that Λs˜(X)=ΞXTΞX. As a result, (A47) is equivalent to

v˜*(y)=EPX|YΛs˜(X)1s˜(X)|Y=y. (A48)

By a symmetry argument, we can also obtain the first two equations of Theorem 3. To obtain the third equations of these two theorems, we need to minimize η(v,b)(s)=EPY(μsTv˜(Y)+d˜(Y))2. For given v˜ and μs, the optimal d˜ is

d˜*(y)=μsTv˜(Y), (A49)

and the corresponding η(v,b)(s)=0.

In addition, for given d˜ and v˜, we have

η(v,b)(s)=EPY(μsTv˜(Y)+d˜(Y))2=μsTΛv˜(Y)μs+2μsTEPYv˜(Y)d˜(Y)+var(d˜(Y)). (A50)

Set μsη(v,b)(s)=0 and we obtain

μs*=Λv˜(Y)1EPYv˜(Y)d˜(Y). (A51)

Appendix E. Proof of Theorem 4

Proof. 

From Lemma 3, choosing the optimal (ΞY,ΞX) is equivalent to solving the matrix factorization problem of B˜. Since both ΞY and ΞX have rank no greater than k, from the Eckart–Young–Mirsky theorem [56], the optimal choice of ΞYΞXT should be the truncated singular value decomposition of B˜ with top k singular values. As a result, (ΞY,ΞX)* are the left and right singular vectors of B˜ corresponding to the largest k singular values.

The optimality of bias d˜(y)=μsTv˜(y) has already been shown in Appendix D. □

Appendix F. Proof of Theorem 5

The following lemma is useful to prove Theorem 5.

Lemma A6

(Pythagorean theorem). Let ΞX* be the optimal matrix for given ΞY as defined in (13). Then,

B˜ΞYΞXTF2B˜ΞYΞX*TF2=ΞYΞX*TΞYΞXTF2. (A52)

Proof of Lemma A6.

Denote by U,V the Frobenius inner product of matrices U and V, i.e., U,Vtr(UTV), and we have

B˜ΞYΞX*T,ΞYΞXT=trB˜ΞXΞYTtrΞX*ΞYTΞYΞXT=trB˜ΞXΞYTtrB˜TΞYΞXT=0.

As a result, we obtain

B˜ΞYΞXTF2=B˜ΞYΞX*T+ΞYΞX*TΞYΞXTF2=B˜ΞYΞX*TF+ΞYΞX*TΞYΞXTF2+2B˜ΞYΞX*T,ΞY(ΞX*TΞXT)=B˜ΞYΞX*TF+ΞYΞX*TΞYΞXTF2,

which finishes the proof. □

Proceeding to our proof of Theorem 5, from Lemma A6 we have

L(s)L(s*)=12B˜ΞYΞXTF2B˜ΞYΞX*TF2+12η(v,b)(s)η(v,b)(s*)+o(ϵ2)=12ΞYΞX*TΞYΞXTF2+12κ(v,b)(s,s*)+o(ϵ2),

where κ(v,b)(s,s*)η(v,b)(s)η(v,b)(s*). We then optimize ΞYΞX*TΞYΞXTF2 and κ(v,b)(s,s*) separately.

For the first term, we need to express ΞX in terms of W and Ξ1X. From (17), we obtain

Esz(X)=σc(z)+o(ϵ), (A53)
s˜z(x)=wT(z)t˜(x)·σc(z)+o(ϵ), (A54)

which can be expressed in information vectors as

ΞX=Ξ1XWTJ+o(ϵ). (A55)

From Theorem 3, we have

ΞX*=B˜TΞYΞYTΞY1. (A56)

As a result, we have

ΞYΞX*TΞYΞXTF2=ΞYTΞY1/2(ΞX*TΞXT)F2=ΞYTΞY1/2·ΞX*TJWΞ1XTo(ϵ)F2=ΞYTΞY1/2·ΞX*TJWΞ1XTF2+o(ϵ2)=ΞYTΞY1/2J·J1ΞX*TWΞ1XTF2+o(ϵ2)=ΘB˜1ΘWΞ1XTF2+o(ϵ2), (A57)

where the third equality follows from the fact that [cf. (A41)] s˜(x)=O(ϵ) and v˜(y)=O(1), and the last equality follows from the definitions B˜1J1ΞX*T and Θ(ΞYTΞY)1/2J.

For the second term, from (A50) and (A51), we have

κ(v,b)(s,s*)=[(μsμs*)+μs*]TΛv˜(Y)(μsμs*)+μs*μs*TΛv˜(Y)μs*+2(μsμs*)TEPYv˜(Y)d˜(Y)=(μsμs*)TΛv˜(Y)μsμs*+2(μsμs*)TΛv˜(Y)μs*+EPYv˜(Y)d˜(Y)=(μsμs*)TΛv˜(Y)μsμs*. (A58)

Combining (A57) and (A58) finishes the proof.

Appendix G. Analyses of Hidden Layer Parameters

First, from (A53), the bias c(z) of hidden layer is (when μt0, the formula should be modified as c(z)=σ1(μs*(z))μtTw+o(ϵ).)

c(z)=σ1(μs*(z))+o(ϵ).

To obtain μs*, let us define σmininfxσ(x),σmaxsupxσ(x). Then, the optimal μs is the solution of

minimizeμs(μsμs*)TΛv˜(Y)μsμs*subjecttoσminμsσmax. (A59)

If μs* satisfies the constraint of (A59), then it is the optimal solution. Otherwise, some elements of μs* will become either σmin or σmax, known as the saturation phenomenon [21].

To obtain W*, let

B˜1ΘB˜1=ΞYTΞY1/2ΞYTB˜,WΘW=ΞYTΞY1/2JW.

Then, the optimal W is given by

W*=arg minWB˜1WΞ1XTF2=B˜1Ξ1X(Ξ1XTΞ1X)1. (A60)

Hence, W* is given by

W*=Θ1W*=Θ1B˜1Ξ1X(Ξ1XTΞ1X)1=B˜1Ξ1X(Ξ1XTΞ1X)1=J1·[ΞY(ΞYTΞY)1]TB˜Ξ1X(Ξ1XTΞ1X)1,

where the term B˜Ξ1X(Ξ1XTΞ1X)1 corresponds to a feature projection of t˜(X):

B˜Ξ1XΞ1XTΞ1X1EPX|YΛt˜(X)1t˜(X)|Y. (A61)

As a consequence, this multi-layer neural network conducts a generalized feature projection between features extracted from different layers. Note that the projected feature EPt˜|YΛt˜1t˜|Y depends only on the distribution Pt˜|Y and does not depend on the distribution PX|Y. Therefore, the above computations can be accomplished without knowing the hidden random variable X and can be applied to general cases.

Author Contributions

X.X., S.-L.H., L.Z. and G.W.W. contributed to the conceptualization, methodology, and writing of this paper. All authors have read and agreed to the published version of the manuscript.

Funding

The work of S.-L. Huang was supported in part by the National Natural Science Foundation of China under Grant 61807021 and the Shenzhen Science and Technology Program under Grant KQTD20170810150821146. The work of L. Zheng was supported in part by the National Science Foundation (NSF) under Award CNS-2002908 and the Office of Naval Research (ONR) under Grant N00014-19-1-2621.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Footnotes

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.He K., Zhang X., Ren S., Sun J. Deep residual learning for image recognition; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Las Vegas, NV, USA. 26 June–1 July 2016; pp. 770–778. [Google Scholar]
  • 2.Devlin J., Chang M.W., Lee K., Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding; Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies; Minneapolis, MN, USA. 3–5 June 2019; Minneapolis, MN, USA: Association for Computational Linguistics; 2019. pp. 4171–4186. Volume 1 (Long and Short Papers) [Google Scholar]
  • 3.Brown T., Mann B., Ryder N., Subbiah M., Kaplan J.D., Dhariwal P., Neelakantan A., Shyam P., Sastry G., Askell A., et al. Language Models are Few-Shot Learners. In: Larochelle H., Ranzato M., Hadsell R., Balcan M.F., Lin H., editors. Advances in Neural Information Processing Systems. Volume 33. Curran Associates, Inc.; Red Hook, NY, USA: 2020. pp. 1877–1901. [Google Scholar]
  • 4.Silver D., Huang A., Maddison C.J., Guez A., Sifre L., Van Den Driessche G., Schrittwieser J., Antonoglou I., Panneershelvam V., Lanctot M., et al. Mastering the game of Go with deep neural networks and tree search. Nature. 2016;529:484–489. doi: 10.1038/nature16961. [DOI] [PubMed] [Google Scholar]
  • 5.Arulkumaran K., Cully A., Togelius J. Alphastar: An evolutionary computation perspective; Proceedings of the Genetic and Evolutionary Computation Conference Companion; Prague, Czech Republic. 13–17 July 2019; pp. 314–315. [Google Scholar]
  • 6.MacKay D.J.C. Information Theory, Inference, and Learning Algorithms. Cambridge University Press; Cambridge, UK: 2003. [Google Scholar]
  • 7.Zintgraf L.M., Cohen T.S., Adel T., Welling M. Visualizing Deep Neural Network Decisions: Prediction Difference Analysis; Proceedings of the 5th International Conference on Learning Representations, ICLR 2017; Toulon, France. 24–26 April 2017. [Google Scholar]
  • 8.Papyan V., Han X., Donoho D.L. Prevalence of neural collapse during the terminal phase of deep learning training. Proc. Natl. Acad. Sci. USA. 2020;117:24652–24663. doi: 10.1073/pnas.2015509117. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Adebayo J., Gilmer J., Muelly M., Goodfellow I., Hardt M., Kim B. Sanity Checks for Saliency Maps. In: Bengio S., Wallach H., Larochelle H., Grauman K., Cesa-Bianchi N., Garnett R., editors. Advances in Neural Information Processing Systems. Volume 31 Curran Associates, Inc.; Red Hook, NY, USA: 2018. [Google Scholar]
  • 10.Guidotti R., Monreale A., Ruggieri S., Turini F., Giannotti F., Pedreschi D. A survey of methods for explaining black box models. ACM Comput. Surv. (CSUR) 2018;51:1–42. doi: 10.1145/3236009. [DOI] [Google Scholar]
  • 11.Jacot A., Gabriel F., Hongler C. Neural Tangent Kernel: Convergence and Generalization in Neural Networks. In: Bengio S., Wallach H., Larochelle H., Grauman K., Cesa-Bianchi N., Garnett R., editors. Advances in Neural Information Processing Systems. Volume 31 Curran Associates, Inc.; Red Hook, NY, USA: 2018. [Google Scholar]
  • 12.Mei S., Montanari A., Nguyen P.M. A mean field view of the landscape of two-layer neural networks. Proc. Natl. Acad. Sci. USA. 2018;115:E7665–E7671. doi: 10.1073/pnas.1806579115. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Arora S., Du S., Hu W., Li Z., Wang R. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks; Proceedings of the International Conference on Machine Learning, PMLR; Long Beach, CA, USA. 9–15 June 2019; pp. 322–332. [Google Scholar]
  • 14.Cover T.M., Thomas J.A. Elements of Information Theory. John Wiley & Sons; Hoboken, NJ, USA: 2012. [Google Scholar]
  • 15.Huang S.L., Xu X., Zheng L., Wornell G.W. An information theoretic interpretation to deep neural networks; Proceedings of the 2019 IEEE International Symposium on Information Theory (ISIT); Paris, France. 7–12 July 2019; pp. 1984–1988. [Google Scholar]
  • 16.Tishby N., Zaslavsky N. Deep learning and the information bottleneck principle; Proceedings of the Information Theory Workshop (ITW); Jerusalem, Israel. 26 April–1 May 2015; pp. 1–5. [Google Scholar]
  • 17.Goldfeld Z., Polyanskiy Y. The information bottleneck problem and its applications in machine learning. IEEE J. Sel. Areas Inf. Theory. 2020;1:19–38. doi: 10.1109/JSAIT.2020.2991561. [DOI] [Google Scholar]
  • 18.Huang S.L., Makur A., Zheng L., Wornell G.W. An information-theoretic approach to universal feature selection in high-dimensional inference; Proceedings of the 2017 IEEE International Symposium on Information Theory (ISIT); Aachen, Germany. 25–30 June 2017; pp. 1336–1340. [Google Scholar]
  • 19.Arjovsky M., Chintala S., Bottou L. Wasserstein generative adversarial networks; Proceedings of the International Conference on Machine Learning; PMLR, Sydney, Australia. 6–11 August 2017; pp. 214–223. [Google Scholar]
  • 20.Saxe A.M., Bansal Y., Dapello J., Advani M., Kolchinsky A., Tracey B.D., Cox D.D. On the information bottleneck theory of deep learning. J. Stat. Mech. Theory Exp. 2019;2019:124020. doi: 10.1088/1742-5468/ab3985. [DOI] [Google Scholar]
  • 21.Goodfellow I., Bengio J., Courville A. Deep Learning. MIT Press; Cambridge, MA, USA: 2017. [Google Scholar]
  • 22.Olga R., Jia D., Hao S., Jonathan K., Sanjeev S., Sean M., Zhiheng H., Andrej K., Aditya K., Michael B., et al. ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vis. 2015;115:211–252. doi: 10.1007/s11263-015-0816-y. [DOI] [Google Scholar]
  • 23.Huang S.L., Zheng L. Linear information coupling problems; Proceedings of the 2012 IEEE International Symposium on Information Theory Proceedings; Cambridge, MA, USA. 1–6 July 2012; pp. 1029–1033. [Google Scholar]
  • 24.Huang S.L., Makur A., Wornell G.W., Zheng L. On universal features for high-dimensional learning and inference. arXiv. 20191911.09105 [Google Scholar]
  • 25.Hirschfeld H.O. A connection between correlation and contingency. Proc. Camb. Phil. Soc. 1935;31:520–524. doi: 10.1017/S0305004100013517. [DOI] [Google Scholar]
  • 26.Gebelein H. Das statistische problem der Korrelation als variations-und Eigenwertproblem und sein Zusammenhang mit der Ausgleichungsrechnung. Z. Angew. Math. Mech. 1941;21:364–379. doi: 10.1002/zamm.19410210604. [DOI] [Google Scholar]
  • 27.Rényi A. On Measures of Dependence. Acta Math. Acad. Sci. Hung. 1959;10:441–451. doi: 10.1007/BF02024507. [DOI] [Google Scholar]
  • 28.du Pin Calmon F., Makhdoumi A., Médard M., Varia M., Christiansen M., Duffy K.R. Principal inertia components and applications. IEEE Trans. Inf. Theory. 2017;63:5011–5038. doi: 10.1109/TIT.2017.2700857. [DOI] [Google Scholar]
  • 29.Hsu H., Asoodeh S., Salamatian S., Calmon F.P. Generalizing bottleneck problems; Proceedings of the 2018 IEEE International Symposium on Information Theory (ISIT); Vail, CO, USA. 17–22 June 2018; pp. 531–535. [Google Scholar]
  • 30.Hsu H., Salamatian S., Calmon F.P. Correspondence analysis using neural networks; Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, PMLR; Okinawa, Japan,. 16–18 April 2019; pp. 2671–2680. [Google Scholar]
  • 31.Anantharam V., Gohari A., Kamath S., Nair C. On hypercontractivity and a data processing inequality; Proceedings of the 2014 IEEE International Symposium on Information Theory; Honolulu, HI, USA,. 29 June–4 July 2014; pp. 3022–3026. [Google Scholar]
  • 32.Raginsky M. Strong data processing inequalities and Φ-Sobolev inequalities for discrete channels. IEEE Trans. Inf. Theory. 2016;62:3355–3389. doi: 10.1109/TIT.2016.2549542. [DOI] [Google Scholar]
  • 33.Polyanskiy Y., Wu Y. Convexity and Concentration. Springer; Berlin/Heidelberg, Germany: 2017. Strong data-processing inequalities for channels and Bayesian networks; pp. 211–249. [Google Scholar]
  • 34.Greenacre M.J. Theory and Applications Of Correspondence Analysis. Academic Press; London, UK: 1984. [Google Scholar]
  • 35.Wang H., Vo L., Calmon F.P., Médard M., Duffy K.R., Varia M. Privacy with estimation guarantees. IEEE Trans. Inf. Theory. 2019;65:8025–8042. doi: 10.1109/TIT.2019.2934414. [DOI] [Google Scholar]
  • 36.Breiman L., Friedman J.H. Estimating Optimal Transformations for Multiple Regression and Correlation. J. Am. Stat. Assoc. 1985;80:614–619. [Google Scholar]
  • 37.Sutskever I., Vinyals O., Le Q.V. Sequence to sequence learning with neural networks; Proceedings of the Advances in Neural Information Processing Systems; Montreal, QC, Canada. 8–13 December 2014; pp. 3104–3112. [Google Scholar]
  • 38.Hochreiter S., Schmidhuber J. Long short-term memory. Neural Comput. 1997;9:1735–1780. doi: 10.1162/neco.1997.9.8.1735. [DOI] [PubMed] [Google Scholar]
  • 39.Hastie T., Tibshirani R., Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer; New York, NY, USA: 2009. Neural Networks; pp. 389–416. [DOI] [Google Scholar]
  • 40.Cybenko G. Approximation by superpositions of a sigmoidal function. Math. Control. Signals Syst. 1989;2:303–314. doi: 10.1007/BF02551274. [DOI] [Google Scholar]
  • 41.Stoer J., Bulirsch R. Introduction to Numerical Analysis. Volume 12 Springer Science & Business Media; Berlin/Heidelberg, Germany: 2013. [Google Scholar]
  • 42.Alain G., Bengio Y. Understanding intermediate layers using linear classifier probes; Proceedings of the 5th International Conference on Learning Representations, ICLR 2017; Toulon, France. 24–26 April 2017. [Google Scholar]
  • 43.Sutskever I., Martens J., Dahl G., Hinton G. On the importance of initialization and momentum in deep learning; Proceedings of the International Conference on Machine Learning, PMLR; Atlanta, GA, USA. 17–19 June 2013; pp. 1139–1147. [Google Scholar]
  • 44.Simonyan K., Zisserman A. Very Deep Convolutional Networks for Large-Scale Image Recognition. In: Bengio Y., LeCun Y., editors. Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015; San Diego, CA, USA. 7–9 May 2015; Conference Track Proceedings. [Google Scholar]
  • 45.Howard A.G., Zhu M., Chen B., Kalenichenko D., Wang W., Weyand T., Andreetto M., Adam H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv. 20171704.04861 [Google Scholar]
  • 46.Huang G., Liu Z., Weinberger K.Q., van der Maaten L. Densely connected convolutional networks; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Honolulu, HI, USA. 21–29 July 2017; p. 3. [Google Scholar]
  • 47.Chollet F. Xception: Deep learning with depthwise separable convolutions; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Honolulu, HI, USA. 21–29 July 2017; pp. 1251–1258. [Google Scholar]
  • 48.Szegedy C., Vanhoucke V., Ioffe S., Shlens J., Wojna Z. Rethinking the inception architecture for computer vision; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Las Vegas, NV, USA,. 26 June–1 July 2016; pp. 2818–2826. [Google Scholar]
  • 49.Szegedy C., Ioffe S., Vanhoucke V., Alemi A.A. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning; Proceedings of the AAAI; San Francisco, CA, USA. 4–9 February 2017; pp. 4278–4284. [Google Scholar]
  • 50.Xu X., Huang S.L., Zheng L., Zhang L. The geometric structure of generalized softmax learning; Proceedings of the 2018 IEEE Information Theory Workshop (ITW); Guangzhou, China. 25–29 November 2018; pp. 1–5. [Google Scholar]
  • 51.Wen W., Wu C., Wang Y., Chen Y., Li H. Learning structured sparsity in deep neural networks. Adv. Neural Inf. Process. Syst. 2016;29:2074–2082. [Google Scholar]
  • 52.Wang L., Wu J., Huang S.L., Zheng L., Xu X., Zhang L., Huang J. An efficient approach to informative feature extraction from multimodal data; Proceedings of the AAAI Conference on Artificial Intelligence; Honolulu, HI, USA. 27 January–1 February 2019; pp. 5281–5288. [Google Scholar]
  • 53.Lee J., Sattigeri P., Wornell G. Learning new tricks from old dogs: Multi-source transfer learning from pre-trained networks. Adv. Neural Inf. Process. Syst. 2019;32:4370–4380. [Google Scholar]
  • 54.Dembo A., Zeitouni O. Large Deviations Techniques and Applications. Springer; Berlin/Heidelberg, Germany: 2010. p. 38. Corrected Reprint of the Second (1998) Edition. Stochastic Modelling and Applied Probability. [Google Scholar]
  • 55.Sason I., Verdú S. f-divergence Inequalities. IEEE Trans. Inf. Theory. 2016;62:5973–6006. doi: 10.1109/TIT.2016.2603151. [DOI] [Google Scholar]
  • 56.Eckart C., Young G. The approximation of one matrix by another of lower rank. Psychometrika. 1936;1:211–218. doi: 10.1007/BF02288367. [DOI] [Google Scholar]

Articles from Entropy are provided here courtesy of Multidisciplinary Digital Publishing Institute (MDPI)

RESOURCES