Skip to main content
Entropy logoLink to Entropy
. 2020 Sep 30;22(10):1107. doi: 10.3390/e22101107

Complexity as Causal Information Integration

Carlotta Langer 1,*, Nihat Ay 1,2,3
PMCID: PMC7597220  PMID: 33286876

Abstract

Complexity measures in the context of the Integrated Information Theory of consciousness try to quantify the strength of the causal connections between different neurons. This is done by minimizing the KL-divergence between a full system and one without causal cross-connections. Various measures have been proposed and compared in this setting. We will discuss a class of information geometric measures that aim at assessing the intrinsic causal cross-influences in a system. One promising candidate of these measures, denoted by ΦCIS, is based on conditional independence statements and does satisfy all of the properties that have been postulated as desirable. Unfortunately it does not have a graphical representation, which makes it less intuitive and difficult to analyze. We propose an alternative approach using a latent variable, which models a common exterior influence. This leads to a measure ΦCII, Causal Information Integration, that satisfies all of the required conditions. Our measure can be calculated using an iterative information geometric algorithm, the em-algorithm. Therefore we are able to compare its behavior to existing integrated information measures.

Keywords: complexity, integrated information, causality, conditional independence, em-algorithm

1. Introduction

The theory of Integrated Information aims at quantifying the amount and quality of consciousness of a neural network. It was originally proposed by Tononi and went through various phases of evolution, starting with one of the first papers "Consciousness and Complexity" [1] in 1999 to "Consciousness as Integrated Information—a Provisional Manifesto" [2] in 2008 and Integrated Information Theory (IIT) 3.0 [3] in 2014 to ongoing research. Although important parts of the methodology of this theory changed or got extended the two key concepts determining consciousness that virtually stayed fixed are “Information” and “Integration”. Information refers to the number of different states a system can be in and Integration describes the amount to which the information is integrated among different parts of it. Tononi summarizes this idea in Reference [2] with the following sentence:

In short, integrated information captures the information generated by causal interactions in the whole, over and above the information generated by the parts.

Therefore Integrated Information can be seen as a measure of the systems complexity. In this context it belongs to the class of theories that define complexity as to what extent the whole is more than the sum of its parts.

There are various ways to define a split system and the difference between them. Therefore, there exist different branches of complexity measures in the context of Integrated Information. The most recent theory, IIT 3.0 [3], goes far beyond the original measures and includes a different level of definitions corresponding to the quality of the measured consciousness, including the maximally irreducible conceptual structure (MICS) and the integrated conceptual information. In order to focus on the information geometric aspects of IIT, we follow the strategy of Oizumi et al. [4] and Amari et al. [5], restricting attention to measuring the integrated information in discrete n-dimensional stationary Markov processes from an information geometric point of view.

In detail we will measure the distance between the full and the split system using the KL-divergence as proposed in Reference [6], published in Reference [7]. This framework was further discussed in Reference [8]. Oizumi et al. [4] and Amari et al. [5] summarize these ideas and add a Markov condition and an upper bound to clarify what a complexity measure should satisfy. The Markov condition intends to model the removal of certain cross-time connections, which we call causal cross-connections. These connections are the ones that integrate information among the different nodes across different points in time. The upper bound was originally proposed in Reference [9] and is given by the mutual information, which aims at quantifying the total information flow from one timestep to the next. These conditions are defined as necessary and do not specify a measure uniquely. We will discuss the conditions in the next section.

Additionally Oizumi et al. [4] and Amari et al. [5] introduce one measure that satisfies all of these requirements. This measure is described by conditional independence statements and will be denoted here by ΦCIS. We will introduce ΦCIS along with two other existing measures, namely Stochastic Interaction ΦSI [7] and Geometric Integrated Information ΦG [10]. The measure ΦSI is not bounded from above by the mutual information and ΦG does not satisfy the postulated Markov condition.

Although ΦCIS fits perfectly in the proposed framework, this measure does not correspond to a graphical representation and it is therefore difficult to analyze the causal nature of the measured information flow. We focus on the notion of causality defined by Pearl in Reference [11], in which the correspondence between conditional independence statements and graphs, for instance DAGs or more generally chain graphs, is a key concept. Moreover, we demonstrate that it is not possible to express the conditional independence statements corresponding to ΦCIS using a chain graph even after adding latent variables. Following the reasoning of Pearls causality theory, however, this would be a desirable property.

The main purpose of this paper is to propose a more intuitive approach that ensures the consistency between graphical representation and conditional independence statements. This is achieved by using a latent variable that models a common exterior influence. Doing so leads to a new measure, which we call Causal Information Integration ΦCII. This measure is specifically created to only measure the intrinsic causal cross-influences in a setting with an unknown exterior influence and it satisfies all the required conditions postulated by Oizumi et al. To assume the existence of an unknown exterior influence is not unreasonable, in fact one point of criticism concerning ΦSI is that this measure does not account for exterior influences and therefore measures them erroneously as internal, see Section 6.9 in Reference [10]. In a setting with known external influences, these can be integrated in the model as visible variables. This leads to a model discussed in Section 2.1.1 that we call ΦT, which is an upper bound for ΦCII.

We discuss the relationships between the introduced measures in Section 2.1.2 and present a way of calculating ΦCII by using an iterative information geometric algorithm, the em-algorithm described in Section 2.1.3. This algorithm is guaranteed to converge to a minimum, but this might be a local minimum. Therefore we have to run the algorithm multiple times to find a global minimum. Utilizing this algorithm we are able to compare the behavior of ΦCII to existing integrated information measures.

Integrated Information Measures

Measures corresponding to Integrated Information investigate the information flow in a system from a time t to t+1. This flow is represented by the connections from the nodes Xi in t to the nodes Yi in t+1,i{1,,n} as displayed in Figure 1.

Figure 1.

Figure 1

The fully connected system for n=2 and n=3.

The systems are modeled as discrete, stationary, n-dimensional Markov processes (Zt)tN

X=(X1,,Xn)=(X1,t,,Xn,t),Y=(Y1,,Yn)=(X1,t+1,,Xn,t+1),Z=(X,Y)

on a finite set Z, which is the Cartesian product of the sample spaces of Xii{1n}, denoted by Xi

Z=X×Y=Xi=1nXi×Xi=1nYi.

It is possible to apply the following methods to non-stationary distributions, but this assumption in addition to the process being Markovian allows us to restrict the discussion to one time step.

Let MP(Z) be set of distributions that belong to these Markov processes.

Denote the complement of Xi in X by XI\{i}=(X1,,Xi1,Xi+1,,Xn) with I={1,,n}. Corresponding to this notation xI\{i}XI\{i} describes the elementary events of XI\{i}. We will use the analogue notation in the case of Y and we will write zZ instead of (x,y)X×Y. The set of probability distributions on Z will be denoted by P(Z). Throughout this article we will restrict attention to strictly positive distributions.

The core idea of measuring Integrated Information is to determine how much the initial system differs from one in which no information integration takes place. The former will be called a “full” system, because we allow all possible connections between the nodes, and the latter will be called a “split” system. Graphical representations of the full systems for n=2,3 and their connections are depicted in Figure 1. In this article we are using graphs that describe the conditional independence structure of the corresponding sets of distributions. An introduction to those is given in Appendix A.

Graphs are not only a tool to conveniently represent conditional independence statements, but the connection between conditional independence and graphs is a core concept of Pearls causality theory. The interplay between graphs and conditional independence statements provides a consistent foundation of causality. In Reference [11] Section 1.3 Pearl emphasizes the importance of a graphical representation with the following statement:

It seems that if conditional independence judgments are by-products of stored causal relationships, then tapping and representing those relationships directly would be a more natural and more reliable way of expressing what we know or believe about the world. This is indeed the philosophy behind causal Bayesian networks.

Therefore, measures of the strength of causal cross-connections should be based on split models, that have a graphical representation.

Following the concept introduced in References [6,7], the difference between the measures corresponding to the full and split systems will be calculated by using the KL-divergence.

Definition 1 (Complexity).

Let M be a set of probability distributions on Z corresponding to a split system. Then we minimize the KL-divergence between M and the distribution of the fully connected system P˜ to calculate the complexity

ΦM=infQMDZ(P˜Q)=zZP˜(z)logP˜(z)Q(z).

Minimizing the KL-divergence with respect to the second argument is called m-projection or rI-projection. Hence we will call P with

P=arg infQMDZ(P˜Q)

the projection of P˜ to M.

The question remains how to define the split model M. We want to measure the information that gets integrated between different nodes in different points in time. In Figure 1 these are the dashed connections, also called cross-influences in Reference [4]. We will refer to the dashed connections as causal cross-connections.

In order to ensure that these connections are removed in the split system, the authors of Reference [4] and Reference [5] argue that Yj should be independent of Xi given XI\{i}, ij, leading to the following property.

Property 1.

A valid split system should satisfy the Markov condition

Q(Xi,YjXI\{i})=Q(XiXI\{i})Q(YjXI\{i}),ij, (1)

with QP(Z). This can also be written in the following form

YjXi|XI\{i}. (2)

Now we take a closer look at the remaining connections. The dotted lines connect nodes belonging to the same point in time. These connections between the Yis might result from common internal influences, meaning a correlation between the Xis passed on to the next point in time via the dashed or solid connections. Additionally Amari points out in Section 6.9 in Reference [10] that there might exist a common exterior influence on the Yis. Although the measured integrated information should be internal and independent of external influences, the system itself is in general not completely independent of its environment.

Since we want to measure the amount of integrated information between t and t+1, the distribution in t, and therefore the connection between the Xis, should stay unchanged in the split system. The dotted connections between the Yis play an important role in Property 2. For this property, we will consider the split system in which the solid and dashed connections are removed.

The solid arrows represent the influence of a node in t on itself in t+1 and removing these arrows, in addition to the causal cross-connections, leads to a system with completely disconnected points in time as shown on the right in Figure 2. The distributions corresponding to this split system are

MI={QP(Z)|Q(z)=Q(x)Q(y),z=(x,y)Z}

and the measure ΦI is given by the mutual information I(X;Y), which is defined in the following way

ΦI=I(X;Y)=zZP(x,y)logP(x,y)P(x)P(y).

Since there is no information flow between the time steps Oizumi et al. argue in Reference [4] that an integrated information measure should be bounded from above by the mutual information.

Figure 2.

Figure 2

Interior and exterior influences on Y in the full and the split system corresponding to ΦI.

Property 2.

The mutual information should be an upper bound for an Integrated Information measure

ΦM=infQMDZ(P˜Q)I(X;Y).

Oizumi et al. [4,9] and Amari et al. [5] state that this property is natural, because an Integrated Information measure should be bounded by the total amount of information flow between the different points in time. The postulation of this property led to a discussion in Reference [12]. The point of disagreement concerns the edge between the Yis. On the one hand this connection takes into account that the Yis might have a common exterior influence that affects all the Yis, as pointed out by Amari in Reference [10]. This is symbolized by the additional node W in Figure 2 and this should not contribute to the value of Integrated Information between the different points in time.

On the other hand, we know that if the Xis are correlated, then the correlation is passed to the Yis via the solid and dashed arrows. The edges created by calculating the marginal distribution on Y also contain these correlations. The question now is, how much of these correlations integrate information in the system and should therefore be measured. Kanwal et al. discuss this problem in Reference [12]. They distinguish between intrinsic and extrinsic influences that cause the connections between the Yis in the way displayed in Figure 2. By calculating the split system for ΦI the edge between the Yis might compensate for the solid arrows and common exterior influences, but also for the dashed, causal cross-connections, as shown in Figure 2 on the right. Kanwal et al. analyze an example of a full system without a common exterior influence with the result that there are cases in which a measure that only removes the causal cross-connections has a larger value than ΦI. This is only possible if the undirected edge between the Yis compensates a part of the causal cross-connections. Hence ΦI does not measure all the intrinsic causal cross-influences. Therefore Kanwal et al. question the use of the mutual information as an upper bound.

Then again, we would like to contribute a different perspective. Admitting to Property 2 does not necessarily mean that the connections between the Yis are fixed. It may merely mean that MI is a subset of the set of split distributions. We will see that the measures ΦCIS and ΦCII do satisfy Property 2 in this way. Although the argument that ΦI measures all the intrinsic influences is no longer valid, satisfying Property 2 is still desirable in general. Consider an initial system with the distribution P˜(z)=P˜(x)P˜(y),zZ. This system has a common exterior influence on the Yis and no connection between the different points in time. Since there is no information flow between the points in time, a measure for Integrated Information ΦM should be zero for all distributions of this form. This is the case exactly when MIM, hence when ΦI is an upper bound for ΦM. In order to emphasize this point we propose a modified version of Property 2.

Property 3.

The set MI should be a subset of the split model M corresponding to the Integrated Information measure ΦM. Then the inequality

ΦM=infQMDZ(P˜Q)I(X;Y)

holds.

Note that the new formulation is stronger, hence Property 2 is a consequence of Property 3. Every measure discussed here that satisfies Property 2 also fulfills Property 3. Therefore we will keep referring to Property 2 in the following sections.

Figure 3 displays an overview over the different measures and whether they satisfy Properties 1 and 2.

Figure 3.

Figure 3

The different measures and their properties in the case of n=2.

The first complexity measure that we are discussing does not fulfill Property 2. It is called Stochastic Interaction and was introduced by Ay in Reference [6] in 2001, later published in Reference [7]. Barrett and Seth discuss it in Reference [13] in the context of Integrated Information. In Reference [5] the corresponding model is called “fully split model”.

The core idea is to allow only the connections among the random variables in t and additionally the connections between Xi and Yi, meaning the same random variable in different points in time. The last ones correspond to the solid arrows in Figure 1. A graphical representation for n=2 can be found in the first column of Figure 3.

Definition 2 (Stochastic Interaction).

The set of distributions belonging to the split model in the sense of Stochastic Interaction can be defined as

MSI=QP(Z)Q(YX)=i=1nQ(YiXi)

and the complexity measure can be calculated as follows

ΦSI=infQMSIDZ(P˜Q)=i=1nH(YiXi)H(YX),

as shown in Reference [7]. In the definition above, H denotes the conditional entropy

H(YiXi)=xiXiyiYiP˜(xi,yi)logP˜(yi|x).

This does not satisfy Property 2 and therefore the corresponding graph is displayed only in the first column of Figure 3. Amari points out in Reference [10] that this measure is not applicable in the case of an exterior influences on the Yis. Such an influence can cause the Yis to be correlated even in the case of independent Xis and no causal cross-connections.

Consider a setting without exterior influences, then ΦSI quantifies the strength of the causal cross-connections alone and is therefore a reasonable choice for an Integrated Information measure. Accounting for an exterior influence that does not exist leads to a split system, which compensates a part of the removal of the causal cross-connections so that the resulting measure does not quantify all of the interior causal cross-influences.

To force the model to satisfy Property 2, one can add the interaction between Yi and Yj, which results in the measure Geometric Integrated Information [10].

Definition 3 (Geometric Integrated Information).

The graphical model corresponding to the graph in the second row and first column of Figure 3 is the set

MG=PP(Z)|f1,,fn+2R+Zs.t.P(z)=fn+1(x)fn+2(y)i=1nfi(xi,yi)

and the measure is defined as

ΦG=infQMGDZ(PQ).

MG is called the diagonally split model in Reference [5]. This is not causally split in the sense that the corresponding distributions in general do not satisfy Property 1. It can be seen by analyzing the conditional independence structure of the graph as described in Appendix A. By introducing the edges between the Yis as fixed, ΦG might force these connections to be stronger than they originally are. A result of this might be that an effect of the causal cross-connections gets atoned for by the new edge. We discussed this above in the context of Property 2.

This measure has no closed form solution, but we are able to calculate the corresponding split system with the help of the iterative scaling algorithm, (see, for example, Section 5.1 in Reference [14]).

The first measure that satifies both properties is called “Integrated Information” [4], its model is referred to by “Causally split model” in Reference [5] and it is derived from the first property. Since we are able to define it using conditional independence statements, we will denote it by ΦCIS. It requires Yi to be independent of XI\{i} given Xi.

Definition 4 (Integrated Information).

The set of distributions, that belongs to the split system corresponding to integrated information, is defined as

MCIS=QP(Z)Q(YiX)=Q(YiXi),foralli{1,,n} (3)

and this leads to the measure

ΦCIS=infQMCISDZ(PQ).

We write the requirements to the distributions in (3) as conditional independent statements

YiXI\{i}Xi.

A detailed analysis of probabilistic independence statements can be found in Reference [15]. Unfortunately, these conditional independence statements can not be encoded in terms of a chain graph in general. The definition of this measure arises naturally from Property 1 by applying the relation (1)

Q(Xi,YjXI\{i})=Q(XiXI\{i})Q(YjXI\{i}),ij

to all pairs i,j{1,,n}. This leads to

Q(Yj|X)=Q(Yj|Xj), (4)

as shown in Appendix B.

Note that this implies that every model satisfying Property 1 is a submodel of MCIS. In order to show that ΦCIS satisfies Property 1, we are going to rewrite the condition in Property 1 as

Q(Yj|X)=Q(Yj|XI\{i}).

The definition of MCIS allows us to write

Q(Yj|X)=Q(Yj|Xj)=Q(Yj|XI\{i}),

for QMCIS. Therefore ΦCIS satisfies Property 1 and since MI meets the conditional independence statements of Property 1 the relation MIMCIS holds and ΦCIS fulfills Property 2.

In Reference [4] Oizumi et al. derive an analytical solution for Gaussian variables, but there does not exist a closed form solution for discrete variables in general. Therefore they use Newton’s method in the case of discrete variables.

Due to the lack of a graphical representation, it is difficult to interpret the causal nature of the elements of MCIS. In Example 1 we will see a type of model that is part of MCIS, but which has a graphical representation. This model does not lie in the set of Markovian processes discussed in this article MP(Z). Hence this implies that not all the split distributions in MCIS arise from removing connections from a full distribution, as depicted in Figure 1.

2. Causal Information Integration

Inspired by the discussion about extrinsic and intrinsic influences in the context of Property 2, we now utilize the notion of a common exterior influence to define the measure ΦCII, which we call Causal Information Integration. This measure should be used in case of an unknown exterior influence.

2.1. Definition

Explicitly including a common exterior influence allows us to avoid the problems of a fixed edge between the Yis discussed earlier. This leads to the graphs in Figure 4.

Figure 4.

Figure 4

Split systems with exterior influences for n=2 and n=3.

The factorization of the distributions belonging to these graphical models is the following one

P(z,w)=P(x)i=1nP(yi|xi,w)P(w).

By marginalizing over the elements of W we get a distribution on Z defining our new model.

Definition 5 (Causal Information Integration).

The set of distributions belonging to the marginalized model for |Wm|=m is

MCIIm=PP(Z)|QP(Z×Wm):P(z)=j=1mQ(x)Q(wj)i=1nQ(yi|xi,wj).

We will define the split model for Causal Integrated Information as the closure (denoted by a bar) of the union of MCIIs

MCII=mNMCIIm¯. (5)

This leads to the measure

ΦCII=infQMCIIDZ(PQ).

Since the split system MCII was defined by utilizing graphs, we are able to use the graphical representation to get a more precise notion of the cases in which ΦCII(P˜)=0 holds. In those cases the initial distribution can be completely explained as a limit of marginalized distributions without causal cross-influences and with exterior influences.

Proposition 1.

The measure ΦCII(P˜) is 0 if and only if there exists a sequence of distributions QmP(Z) with the following properties.

  • 1. 

    P˜=limmQm.

  • 2. 
    For every mN there exists a distribution Q^mP(Z×Wm) that has Z marginals equal to Qm
    Qm(z)=Q^m(z),zZ.
    Additionally Q^m factors according to the graph corresponding to the split system
    Q^m(z,w)=Q^(x)mi=1nQ^m(yi|xi,w)Q^m(w),(z,w)Z×Wm.

In order to show that ΦCII satisfies the conditional independence statements in Property 1, we will calculate the conditional distributions P(yi|xi) and P(yi|x) of

P(z)=wP(x)j=1nP(yj|xj,w)P(w).

This results in

P(yi|xi)=yI\{i}xI\{i}wP(x)i=jnP(yj|xj,w)P(w)P(xi)=xI\{i}wP(x)P(yi|xi,w)P(w)P(xi)=wP(yi|xi,w)P(w)P(yi|x)=yI\{i}wP(x)i=jnP(yj|xj,w)P(w)P(x)=wP(yi|xi,w)P(w)

for all zZ. Hence P(yi|xi)=P(yi|x), for every PMCIIm,mN. Since every element in P^MCII is a limit point of distributions that satisfy the conditional independence statements, P^ also fulfills those. A proof can be found in Reference [16] Proposition 3.12. Therefore ΦCII satisfies Property 1 and the set of all such distributions is a subset of MCIS

MCIIMCIS.

We are able to represent the marginalized model by using the methods from Reference [17]. Up to this point we have been using chain graphs. These are graphs consisting of directed and undirected edges such that there are no semi-directed cycles as described in Appendix A. In order to be able to gain a graph that represents the conditional independence structure of the marginalized model, we need the concept of chain mixed graphs (CMGs). In addition to the directed and undirected edges belonging to chain graphs, chain mixed graphs also have arcs ↔. Two nodes connected by an arc are called spouses. The connection between spouses appears when we marginalize over a common influence, hence spouses do not have a directed information flow from one node to the other but are affected by the same mechanisms. The Algorithm A3 from Reference [17] allows us to transform a chain graph with latent variables into a chain mixed graph that represents the conditional independence structures of the marginalized chain graph. Using this on the graphs in Figure 4 leads to the CMGs in Figure 5. Unfortunately, there exists no new factorization corresponding to the CMGs known to the authors.

Figure 5.

Figure 5

Marginalized Model for n=2 and n=4.

In order to prove that ΦCII satisfies Property 2, we will show that MI is a subset of MCII. At first we will consider the following subset of MCII

MCIm=PP(Z)|QP(Z×Wm):P(z)=j=1mQ(x)Q(wj)i=1nQ(yi|wj)MCI=mNMCIm¯,

where we remove the connections between the different stages, as shown in Figure 6.

Figure 6.

Figure 6

Submodels of the split models with exterior influences for n=2 and n=3.

Now X and Y are independent of each other

Q(z)=Q(x)·Q(y)

with

Q(y)=wQ(w)i=1nQ(yi|w)

for QMCIm and since independence structures of discrete distributions are preserved in the limit we have MCIMI. In order to gain equality it remains to show that Q(Y) can approximate every distribution on Y if the state space of W is sufficiently large. These distributions are mixtures of discrete product distributions, where

i=1nQ(yi|w)

are the mixture components and Q(w) are the mixture weights. Hence we are able to use the following result.

Theorem 1

(Theorem 1.3.1 from Reference [18]). Let q be a prime power. The smallest m for which any probability distribution on {1,,q} can be approximated arbitrarily well as mixture of m product distributions is qn1.

Universal approximation results like the theorem above may suggest that the models MCII and MCIS are equal. However we will present numerically calculated examples of elements belonging to MCIS, but not to MCII, even with an extremely large state space. We will discuss this matter further in Section 2.1.2.

In conclusion, ΦCII satisfies Property 1 and 2.

Note that using ΦCII in cases without an exterior influence might not capture all the internal cross-influences, since the additional latent variable can compensate some of the difference between the initial distribution and the split model. This can only be avoided when the exterior influence is known and can therefore be included in the model. We will discuss that case in the next section.

2.1.1. Ground Truth

The concept of an exterior influence suggests that there exists a ground truth in a larger model in which W is a visible variable. This is shown in Figure 7 on the right.

Figure 7.

Figure 7

The graphs corresponding to E and Ef (right).

Assuming that we know the distribution of the whole model, we are able to apply the concepts discussed above to define an Integrated Information measure ΦT on the larger space. This allows us to really only remove the causal cross-connections as shown in Figure 7 on the left. Thus we can interpret ΦT as the ultimate measure of Integrated Information, if the ground truth is available. Note that using the measure ΦSI in the setting with no external influences is a special case of ΦT.

The set of distributions belonging to the larger, fully connected model will be called Ef and the set corresponding to the graph on the left of Figure 7 depicts the split system which will be denoted by E. Since W is now known, we are able to fix the state space W to its actual size m.

E=PP(Z×Wm)P(z,w)=P(x)i=1nP(yi|xi,w)P(w),(z,w)Z×Wm,|W|=mEf=PP(Z×Wm)P(z,w)=P(x)i=1nP(yi|x,w)P(w),(z,w)Z×Wm,|W|=m.

Note that E is the set of all the distributions that result in an element of MCII after marginalization over Wm

MCIIm=PP(Z)|QEm:P(z)=j=1mQ(x)Q(wj)i=1nQ(yi|xi,wj).

Calculating the KL-divergence between PEf and E results in the new measure.

Proposition 2.

Let PEf. Minimizing the KL-divergence between P and E leads to

ΦT=infQEDZ×Wm(PQ)=z,wP(z,w)logiP(yi|x,w)iP(yi|xi,w)=iI(Yi;XI\{i}|Xi,W).

In the definition above I(Yi;XI\{i}|Xi,W) is the conditional mutual information defined by

I(Yi;XI\{i}|Xi,W)=yi,x,wP(yi,x,w)logP(yi,xI\{i}|xi,w)P(yi|xi,w)P(xI\{i}|xi,w).

It characterizes the reduction of uncertainty in Yi due to XI\{i} when W and Xi are given. Therefore this measure decomposes to a sum in which each addend characterizes the information flow towards one Yi. Writing this as conditional independence statements, ΦT is 0 if and only if

YiXI\{i}|{Xi,W}.

Ignoring W would lead exactly to the conditional independence statements in Equation (3). For a more detailed description of the conditional mutual information and its properties, see Reference [19].

Furthermore, ΦT=0 if and only if the initial distribution P factors according to the graph that belongs to E. This follows from Proposition 2 and the fact that the KL-divergence is 0 if and only if both distributions are equal. Hence this measure truly removes the causal cross-connections.

Additionally, by using that WX, we are able to split up the conditional mutual information into a part corresponding to the conditional independence statements of Property 1 and another conditional mutual information.

I(Yi;XI\{i}|Xi,W)=yi,x,wP(w)logP(yi,xI\{i}|xi)P(yi|xi)P(xI\{i}|xi)·P(yi,xi)P(x)P(yi,x,w)P(xi,w)P(yi,x)P(xi)P(yi,xi,w)P(x,w)=I(Yi;XI\{i}|Xi)+yi,x,wP(w)logP(yi,xi)P(x)P(yi,x,w)P(xi,w)P(yi,x)P(xi)P(yi,xi,w)P(x,w)=I(Yi;XI\{i}|Xi)+yi,x,wP(w)logP(w,xI\{i}|yi,xi)P(w|yi,xi)P(xI\{i}|yi,xi)=I(Yi;XI\{i}|Xi)+I(W;XI\{i}|Yi,Xi).

Since the conditional mutual information is non-negative, ΦT is 0 if and only if the conditional independence statements of Equation (3) hold and additionally the reduction of uncertainty in W due to XI\{i} given Yi,Xi is 0.

In general, we do not know what the ground truth of our system is and therefore we have to assume that W is a hidden variable. This leads us back to ΦCII. Minimizing over all possible W might compensate a part of the causal information flow. One example, in which accounting for an exterior influence that does not exist leads to a value smaller than the true integrated information, was discussed earlier in the context of Property 2. There we refer to an example in Reference [12] where ΦSI exceeds ΦI in a setting without an exterior influence. Similarly, ΦCII is smaller or equal to the true value ΦT.

Proposition 3.

The new measure ΦT is an upper bound for ΦCII

ΦCIIΦT.

Hence by assuming that there exists a common exterior influence, we are able to show that ΦCII is bounded from above by the true value, that measures all the intrinsic cross-influences. We are able to observe this behavior in Section 2.2.2.

2.1.2. Relationships between the Different Measures

Now we are going to analyze the relationship between the different measures ΦSI,ΦG,ΦCIS and ΦCII. We will start with ΦG and ΦCII. Previously we already showed that ΦCII satisfies Property 1 and since ΦG does not satisfy Property 1, we have

MGMCII.

To evaluate the other inclusion, we will consider the more refined parametrizations of elements PMCIIm and QMG as defined A1. These are

P(z)=P(x)f2(x1,y1)g2(x2,y2)wP(w)f1(w,y1)f3(x1,y1,w)g1(w,y2)g3(x2,y2,w)=P(x)f2(x1,y1)g2(x2,y2)ϕ(x1,x2,y1,y2)Q(z)=hn+1(x)hn+2(y)i=1nhi(yi,xi),

where f1,f2,f3,g1,g2,g3,h1,h2,h3,h4 are non-negative functions such that P,QP(Z) and

ϕ(x1,x2,y1,y2)=wP(w)f1(w,y1)f3(x1,y1,w)g1(w,y2)g3(x2,y2,w).

Since ϕ depends on more than Y1 and Y2, P(z) does not factorize according to MG in general. Hence MCIIMG holds.

Furthermore, looking at the parametrizations allows us to identify a subset of distributions that lies in the intersection of MG and MCII. Allowing P to only have pairwise interactions would lead to

P(z)=P(x)f˜2(x1,y1)g˜2(x2,y2)wP(w)f˜1(w,y1)g˜1(w,y2)=P(x)f˜2(x1,y1)g˜2(x2,y2)ϕ˜(y1,y2),

with the non-negative functions f˜1,f˜2,g˜1,g˜2 such that PP(Z) and

ϕ˜(y1,y2)=wP(w)f˜1(w,y1)g˜1(w,y2).

This P is an element of MGMCII.

In the next part we will discuss the relationship between MCII and MCIS. The elements in MCII satisfy the conditional independence statements of Property 1, therefore

MCIIMCIS.

Previously we have seen that making the state space of W large enough can approximate a distribution between the Yis, see Theorem 1. This gives the impression that MCII and MCIS coincide. However, based on numerically calculated examples, we have the following conjecture.

Conjecture 1.

It is not possible to approximate every distribution QMCIS with arbitrary accuracy by an element of PMCII. Therefore, we have that

MCIIMCIS.

The following example strongly suggests this conjecture to be true.

Example 1.

Consider the set of distributions that factor according to the graph in Figure 8

NCIS={PP(Z)|P(z)=P(x1)P(x2)P(y1|x1,y2)P(y2)}.

This model satisfies the conditional independence statements of Property 1 and is therefore a subset of the model MCIS. In this case X1 and X2 are independent of each other, hence from a causal perspective the influence of Y2 on Y1 should be purely external. Therefore we try to model this with a subset of MCII

NCII=mNNCIIm¯,NCIIm=PP(Z)|QP(Z×Wm):P(z)=Q(x1)Q(x2)j=1mQ(y1|x1,wj)Q(y2|wj)Q(wj) (6)

and this corresponds to Figure 9.

Figure 8.

Figure 8

Graph of the model NCIS.

Figure 9.

Figure 9

Graph of the model NCII.

Using the em-algorithm described in Section 2.1.3 we took 500 random elements of NCIS and calculated the closest element of NCII by using the minimum KL-divergence of 50 different random input distributions in each run. The results are displayed in Table 1.

Table 1.

The results of the em-algorithm between NCIS and NCII.

|W| Minimum Maximum Arithmetic Mean
2 0.011969035529826939 0.5028091152589176 0.15263592877594967
3 0.021348311360946 0.5499395859771526 0.1538653506807848
4 0.014762084688030863 0.3984635189946462 0.15139198568055212
8 0.017334311629729246 0.4383731978333986 0.15481967618112732
16 0.024306996171092318 0.4238222051787452 0.1490336847067273
300 0.016524177216064712 0.47733473380366764 0.15493896625208842

This is an example of an element lying in MCIS, which cannot be approximated by an element in MCII.

Now we are going to look at this example from the causal perspective. Proposition 1 states that ΦCII(P˜) is 0 if and only if P˜ is the limit of a sequence of distributions in MCII corresponding to distributions on the extended space that factor according to the split model. Hence a distribution resulting in ΦCII>0 cannot be explained by a split model with an exterior influence. Taking into account that MCIS does not correspond to a graph, we do not have a similar result describing the distributions for which ΦCIS=0. Nonetheless, by looking at the graphical model NCIS, we are able to discuss the causal structure of a submodel of MCIS, a class of distributions for which ΦCIS=0 holds.

If we trust the results in Table 1, this would imply that the influence from Y2 to Y1 is not purely external, but that there suddenly develops an internal influence in timestep t+1 that did not exist in timestep t. Therefore the distributions in NCIS do not belong to the stationary Markovian processes MP(Z), depicted in Figure 1, in general. For these Markovian processes the connections between the Yis arise from correlated Xis or external influences, as pointed out by Amari in Section 6.9 [10]. So from a causal perspective NCIS does not fit into our framework. Hence the initial distribution P˜, which corresponds to a full model, will in general not be an element of NCIS. However, the projection of P˜ to MCIS might lie in NCIS as illustrated in Figure 10.

Figure 10.

Figure 10

Sketch of the relationships among MP(Z),MCIS and NCIS.

When this is the case, then P˜ is closer to an element with a causal structure that does not fit into the discussed setting, than to a split model in which only the causal cross-connections are removed. Hence a part of the internal cross-connections is being compensated by this type of model and therefore this does not measure all the intrinsic integrated information.

Further examples, which hint towards MCIIMCIS, can be found in Section 2.2.2.

Adding the hidden variable W seems not to be sufficient to approximate elements of MCIS. Now the question naturally arises whether there are other exterior influences that need to be included in order to be able to approximate MCIS. We will explore this thought by starting with the graph corresponding to the split model MSI, depicted in Figure 11 on the left. In the next step we add hidden vertices and edges to the graph in a way such that the whole graph is still a chain graph. An example for a valid hidden structure is given in Figure 11 in the middle. Since we are going to marginalize over the hidden structure, it is only important how the visible nodes are connected via the hidden nodes. In the case of the example in Figure 11 we have a directed path from X1 to X2 going through the hidden nodes. Therefore we are able to reduce the structure to a gray box shown on the right in Figure 11.

Figure 11.

Figure 11

Example of an exterior influence on the initial graph.

Then we use the Algorithm A3 mentioned earlier, which converts a chain graph with hidden variables to a chain mixed graph reflecting the conditional independence structure of the marginalized model. This leads to a directed edge from X1 to X2 by marginalizing over the nodes in the hidden structures. Seeing that this directed edge already existed, the resulting model now is a subset of MSI and therefore does not approximate MCIS.

Following this procedure we are able to show that adding further hidden nodes and subgraphs of hidden nodes does not lead to a chain mixed graph belonging to a model that satisfies the conditional independence statements of Property 1 and strictly contains MCII.

Theorem 2.

It is not possible to create a chain mixed graph corresponding to a model M, such that its distributions satisfy Property 1 and MCIIM, by introducing a more complicated hidden structure to the graph of MSI.

In conclusion, assuming that Conjecture 1 holds, we have the following relations among the different presented models.

MIMGMIMCIIMCISMSIMCIIMCIS

A sketch of the inclusion properties among the models is displayed in Figure 12.

Figure 12.

Figure 12

Sketch of the relationship between the manifolds corresponding to the different measures.

Every set that lies inside MCIS satisfies Property 1 and every set that completely contains MI fulfills Property 2.

2.1.3. em-Algorithm

The calculation of the measure ΦCIIm with

ΦCIIm=infQMCIImDZ(P˜Q)

can be done by the em-algorithm, a well known information geometric algorithm. It was proposed by Csiszár and Tusnády in 1984 in Reference [20] and its usage in the context of neural networks with hidden variables was described for example by Amari et al. in Reference [21]. The expectation-maximization EM-algorithm [22] used in statistics is equivalent to the em-algorithm in many cases, including this one, as we will see below. A detailed discussion of the relationship of these algorithms can be found in Reference [23].

In order to calculate the distance between the distribution P˜ and the set MCIIm on Z we will make use of the extended space of distributions on Z×Wm, P(Z×Wm). Let MW|Z be the set of all distributions on Z×Wm that have Z-marginals equal to the distribution of the whole system P˜

MW|Z=PP(Z×Wm)P(z)=P˜(z),zZ=PP(Z×Wm)P(z,w)=P˜(z)P(w|z),(z,w)Z×Wm.

This is an m-flat submanifold since it is linear w.r.t P(w|z). Therefore there exists a unique e-projection to MW|Z.

The second set that we are going to use is the set Em of distributions that factor according to the split model including the common exterior influence. We have seen this set before in Section 2.1.1.

Em=PP(Z×Wm)P(z,w)=P(x)i=1nP(yi|xi,w)P(w),(z,w)Z×Wm. (7)

This set is in general not e-flat, but we will show that there is a unique m-projection to it. We are able to use these sets instead of P˜ and MCIIm because of the following result.

Theorem 3

(Theorem 7 from Reference [21]). The minimum divergence between MW|Z and Em is equal to the minimum divergence between P˜ and MCIIm in the visible manifold

infPMW|Z,QEmDZ×Wm(PQ)=infQ˜MCIImDZ(P˜Q˜).
Proof of Theorem 3.

Let P,QP(Z×Wm), using the chain-rule for KL-divergence leads to

DZ×Wm(PQ)=DZ(PQ)+DW|Z(PQ),

with

DW|Z(PQ)=(z,w)Z×WmP(z,w)logP(w|z)Q(w|z).

This results in

infPMW|Z,QEmDZ×Wm(PQ)=infPMW|Z,QEmDZ(PQ)+DW|Z(PQ)=infPMW|Z,QEmDZ(P˜Q)+DW|Z(PQ)=infQMCIImDZ(P˜Q).

 □

The em-algorithm is an iterative algorithm that first performs an e-projection to MW|Z and then an m-projection to Em repeatedly. Let Q0Em be an arbitrary starting point and define P1 as the e-projection of Q0 to MW|Z

P1=arg infPMW|ZDZ×Wm(PQ0).

Now we define Q1 as the m-projection of P1 to Em

Q1=arg infQEmDZ×Wm(P1Q).

Repeating this leads to

Pi+1=arg infPMW|ZDZ×Wm(PQi),Qi+1=arg infQEmDZ×Wm(Pi+1Q).

The correspondence between these projections in the extended space P(Z×Wm) and one m-projection in P(Z) is illustrated in Figure 13.

Figure 13.

Figure 13

Sketch of the em-Algorithm.

The algorithm iterates between the extended spaces MW|Z and Em on the left of Figure 13. Using Theorem 2.1.3 we gain that this minimization is equivalent to the minimization between P˜ and MCIIm. The convergence of this algorithm is given by the following result.

Proposition 4

(Theorem 8 from Reference [21]). The monotonic relations

DZ×Wm(PiQi)DZ×Wm(Pi+1Qi)DZ×Wm(Pi+1Qi+1)

hold, where equality holds only for the fixed points (P^,Q^)MW|Z×Em of the projections

P^=arg infPMW|ZDZ×Wm(PQ^)Q^=arg infQEmDZ×Wm(P^Q).
Proof of Proposition 4.

This is immediate, because of the definitions of the e- and m-projections. □

Hence this algorithm is guaranteed to converge towards a minimum, but this minimum might be local. We will see examples of that in Section 2.2.2.

In order to use this algorithm to calculate ΦCII we first need to determine how to perform an e- and m-projection in this case. The e-projection from QEm to MW|Z is given by

P(z,w)=P˜(z)Q(w|z),

for all (z,w)Z×Wm. This is the projection because of the following equality

DZ×Wm(PQ)=(z,w)Z×WmP(z,w)logP(z,w)Q(z,w)=zZP˜(z)logP˜(z)Q(z)+(z,w)Z×WmP(z,w)logP(w|z)Q(w|z).

The first addend is a constant for a fixed distribution P˜ and the second addend is equal to 0 if and only if P(w|z)=Q(w|z). Note that this means that the conditional expectation of W remains fixed during the e-projection. This is an important point, because this guarantees the equivalence to the EM algorithm and therefore the convergence towards the MLE. For a proof and examples see Theorem 8.1 in Reference [10] and Section 6 in Reference [23].

After discussing the e-projection, we now consider the m-projection.

Proposition 5.

The m-projection from PMW|Z is given by

Q(z,w)=P(x)i=1nP(yi|xi,w)P(w)

for all (z,w)Z×Wm.

The last remaining decision to be made before calculating ΦCII is the choice of the initial distribution. Since it depends on the initial distribution whether the algorithm converges towards a local or global minimum, it is important to take the minimal outcome of multiple runs. One class of starting points that immediately lead to an equilibrium, which is in general not minimal, are the ones in which Z and W are independent P0(z,w)=P0(z)P0(w). It is easy to check that the algorithm converges here to the fixed point P^

P^(z,w)=P˜(x)1|Wm|inP˜(yi|xi)P^(z)=P˜(x)inP˜(yi|xi).

Note that this is the result of the m-projection of P˜ to MSI, the manifold belonging to ΦSI.

2.2. Comparison

In order to compare the different measures, we need a setting in which we generate the probability distributions of full systems. We chose to use weighted Ising models as described in the next section.

2.2.1. Ising Model

The distributions used to compare the different measures in the next chapter are generated by weighted Ising models, also known as binary auto-logistic models as described in Reference [24] Example 3.2.3. Let us consider n binary variables X=(X1,,Xn), X={1,1}n. The matrix VRn×n contains the weights vij of the connection from Xi to Yj as displayed in Figure 14. Note that this figure is not a graphical model corresponding to the stationary distribution, but merely displays the connections of the conditional distribution of Yi=yi given X=x with the respective weights

P(yj|x)=11+e2βi=1nvijxiyj. (8)

The inverse temperature β>0 regulates the coupling strength between the nodes. For β close to zero the different nodes are almost independent and as β grows the connections become stronger.

Figure 14.

Figure 14

The weights corresponding to the connections for n=2.

We are calculating the stationary distribution P^ by starting with a random initial distribution P0 and then multiplying by (8) in the following way

Pt+1(x)=xXPt(x)·j=1nP(yi|x),

this leads to

P^=limtPt.

There always exists a unique stationary distribution, see for instance Reference [24], Theorem 5.1.2.

2.2.2. Results

In this section we are going to compare the different measures experimentally. Note that we do not have an exterior influence in these examples, so that ΦT=ΦSI holds.

To distinguish between the Causal Information Integration ΦCII calculated with different sized state spaces of W, we will denote

ΦCIIm=infQMCIImDZ(P˜Q).

We start with the smallest example possible, with n=2, and the weight matrix

V=0.00841810.24015450.392701610.37198751

shown in Figure 15. In this example every measure is bounded by ΦI and the measures ΦI,ΦG and ΦSI display a limit behavior different from ΦCIS and the ΦCII. The state spaces of W have the size 2, 3, 4, 36 and 92 and the respective measures are displayed in shades of blue that get darker as the state space gets larger. In every case the em-algorithm has been initiated 100 times with a random input distribution in order to find a global minimum. Minimizing over the outcome of 100 different runs turns out to be sufficient, at least empirically, to reveal the behavior of the global minima. On the right side of this figure, we are able to see the difference between ΦCIS and ΦCII. Considering the precision of the algorithms we assume that a difference smaller than 5e-07 is approx. zero. We can see that in a region from β=15 to β=25 the measures differ even in the case of 92 hidden states. So this small case already hints towards MCIIMCIS.

Figure 15.

Figure 15

Ising model with 2 nodes and the differences between ΦCIS and ΦCII.

Increasing n from 2 to 3 makes the difference even more visible, as we can see in Figure 16 produced with the weight matrix

V=0.434783880.474482180.368083130.521174670.006725780.73877370.561147950.969412430.76408711.
Figure 16.

Figure 16

Ising model with 3 nodes.

Here we are able to observe a difference in the behavior of ΦG compared to the other measures, since we see that ΦI,ΦSI, ΦCII and ΦG are still increasing around β1.1, while ΦG starts to decrease.

Now, we are going to focus on an example with 5 nodes. Since it is very time consuming to calculate ΦCIS for more than 3 nodes, we are going to restrict attention to ΦI, ΦG, ΦSI and ΦCII. The weight matrix

V=0.356158390.097759030.897438010.006042470.038977720.22600560.477697170.43022560.186927070.251407410.860811590.183481320.715287540.081006020.643641760.139672340.032330110.810576540.333275580.574473220.189202640.990547160.320883580.691003970.69206604

produces the Figure 17. This example shows that ΦSI is not bounded by ΦI and therefore does not satisfy Property 2. Since the focus in this examples lies on the relationship between ΦSI and ΦI, the em-algorithm was run with ten different input distributions for each step.

Figure 17.

Figure 17

Ising model with 5 nodes.

Using this example, we are going to take a closer look at the local minima the em-algorithm converges to. Considering only ΦCII and varying the size of the state space leads to the upper part in Figure 18. This figure displays ten different runs of the em-algorithm with each size of state space in different shades of the respective color, namely blue for ΦCII2, violet for ΦCII4, red for ΦCII8 and orange for ΦCII16. Note that we display the outcomes of every run in this case and not only the minimal one, since we are interested in the local minima. We are able to observe how increasing the state space leads to a smaller value of ΦCII. Additionally, the differences between the minimal values corresponding to each state space grow smaller and converge as the state spaces increase.

Figure 18.

Figure 18

The effect of a different sized state space.

The bottom half of Figure 18 highlights an observation that we made. Each of the four illustrations is a copy of the one above, where the difference between the minima are shaded in the respective color. By increasing the size of the state space the difference in value between the various local minima decreases visibly. We think this is consistent with the general observation made in the context of high dimensional optimization, for example, Reference [25] in which the authors conjecture that the probability of finding a high valued local minimum decreases when the network size grows.

Letting the algorithm run only once with |W|=2 on the same data leads to a curve on the left in Figure 19.

Figure 19.

Figure 19

Curve of one run of the em-algorithm for each β coloured according to the distribution of W.

The sets E defined in (7) and MCII (5) do not change for different values of β and therefore we have a fixed set of local minima for a fixed state space of W. What does change with different β is which of the local minima are global minima. The vertical dotted lines represent the steps Pβt to Pβt+1 in which the KL-divergence between the projection to MCII is greater than 0.2

DZ(Pβt,Pβt+1,)>0.2,

meaning that inside the different sections of the curve, the projections to MCII are close. As β increases, a different region of local minima becomes global. A sketch of this is shown in Figure 20.

Figure 20.

Figure 20

Sketch of different local Minima.

The curve is colored according to the distribution of W as shown on the right side of Figure 19. We see that a different distribution on W results in a different minimum, except for the region between 7.5 and 8. The colors light blue and yellow refer to distributions on W that are different, but symmetric in the following way. Consider two different distributions Q,Q^ on Z×W such that

Q(z,w1)=Q^(z,w2)andQ(z,w2)=Q^(z,w1)

for all zZ. Then the corresponding marginalized distributions in MCII2 are equal

wQ(z,w)=wQ^(z,w1).

This symmetry is the reason for the different colors in the region between 7.5 and 8.

Using this geometric algorithm we therefore gain a notion of the local minima on E.

3. Discussion

This article discusses a selection of existing complexity measures in the context of Integrated Information Theory that follow the framework introduced in Reference [7], namely ΦSI,ΦG and ΦCIS. The main contribution is the proposal of a new measure, Causal Information Integration ΦCII.

In Reference [4] and Reference [5] the authors postulate a Markov condition, ensuring the removal of the causal cross-connections, and an upper bound, given by the mutual information ΦI, for valid Integrated Information measures. Although ΦSI is not bounded by ΦI, as we see in Figure 17, it does measure the intrinsic causal cross-connections in a setting in which there exists no common exterior influences. Therefore the authors of Reference [12] criticize this bound. Since wrongly assuming the existence of a common exterior influence might lead to a value that does not measure all the intrinsic causal influences, the question which measure to use strongly depends on how much we know about the system and its environment. We argue that using ΦI as an upper bound in the cases in which we have an unknown common exterior influence is reasonable. The measure ΦG attempts to extend ΦSI to a setting with exterior influences, but it does not satisfy the Markov condition postulated in Reference [4].

One measure that fulfills all the requirements of this framework is ΦCIS, but it has no graphical representation. Hence the causal nature of the measured information flow is difficult to analyze. We present in Example 1 a submodel of MCIS that has a causal structure, which does not lie inside the set of Markovian processes MP(Z), that we discuss in this article. Therefore by projecting to MCIS we might project to a distribution that still holds some of the integrated information of the original system, although it does not have any causal cross-connections. Additionally we demonstrate that MCIS does not correspond to a graphical representation, even after adding any number of latent variables to the model of MSI. This is conflicting with the strong connection between conditional independence statements and graphs in Pearls causality theory. For discrete variables ΦCIS does not have a closed form solution and has to be calculated numerically.

We propose a new measure ΦCII that also satisfies all the conditions and has additionally a graphical and intuitive interpretation. Numerically calculated examples indicate that ΦCIIΦCIS. The definition of ΦCII explicitly includes an interior influence as a latent variable and therefore aims at only measuring intrinsic causal influences. This measure should be used in the setting in which there exists an unknown common exterior influence. By assuming the existence of a ground truth, we are able to prove that our new measure is bounded from above by the ultimate value of Integrated Information ΦT of this system. Although ΦCII also has no analytical solution, we are able to use the information geometric em-algorithm to calculate it. The em-algorithm is guaranteed to converge towards a minimum, but this might be local. Even after letting our smallest example, depicted in Figure 15, run with 100 random input distributions, we still get local minima. On the other hand, in our experience the em-algorithm seems to be more reliable, and for larger networks faster, than the numerical methods we used to calculate ΦCIS. Additionally, by letting the algorithm run multiple times we are able to gain a notion on how the local minima in E are related to each other as demonstrated in Figure 19.

4. Materials and Methods

The distributions used in the Section 2.2.2 were generated by a python program and the measures ΦI,ΦCII,ΦSI ans ΦG are implemented in C++. The python package scipy.mimimize has been used to calculate ΦCIS. The code is available at Reference [26].

Appendix A. Graphical Models

Graphical models are a useful tool to visualize conditional independence structures. In this method a graph is used to describe the set of distributions that factor according to it. In our case, we are considering chain graphs.These are graphs, with vertex set V and edge set EV×V, consisting of directed and undirected edges such that we are able to partition the vertex set into subsets V=V1Vm, called chain components, with the properties that all edges between different subsets are directed, all edges between vertices of the same chain component are undirected and that there are no directed cycles between chain components. For a vertex set τ, we will denote by pa(τ) the set of parents of element in τ, which are vertices α with a directed arrow from α to an element of τ. Vertices connected by an undirected edge are called neighbours. A more detailed description can be found in Reference [16].

Definition A1.

Let T be the set of chain components. A distribution factorizes with respect to a chain graph G if the distribution can be written as follows

P(z)=τTP(xτ|xpa(τ)),

where the structure of P(xτ|xpa(τ)) can be described in more detail. Let A(τ),τT be the set of all subsets of τpa(τ), that are complete in a graph τ, which is an undirected graph with the vertex set τpa(τ) and the edges are the ones between elements in τpa(τ) that exist in G and additionally the ones between elements in pa(τ). An undirected graph is complete if every pair of distinct vertices is connected by an edge. Then there are non-negative functions ϕa such that

P(xτ|xpa(τ))=aA(τ)ϕa(x).

If τ is a singleton then τ is already complete. There are different kinds of independence statements a chain graph can encode, but we only need the global chain graph markov property. In order to define this property we need the concepts ancestral set and moral graph.

The boundary bd(A) of a set AV is the set of vertices in V\A that are parents or neighbours to vertices in A. If bd(α)A for all αA we call A an ancestral set. For any AV there exists a smallest ancestral set containing A, because the intersection of ancestral sets is again an ancestral set. This smallest ancestral set of A is denoted by An(A).

Let G be a chain graph. The moral graph of G is an undirected graph denoted by Gm that consists of the same vertex set as G and in which two vertices α,β are connected if and only if either they were already connected by an edge in G or if there are vertices γ,δ belonging to the same chain component such that αγ and βδ.

Definition A2.

(Global Chain Graph Markov Property). Let P be a distribution on Z and G a chain graph. P satisfies the global chain Markov property, with respect to G, if for any triple (ZA,ZB,ZS) of disjoint subsets of Z such that ZS separates ZA from ZB in (GAn(ZAZBZS))m, the moral graph of the smallest ancestral set containing ZAZBZS,

ZAZBZS

holds.

Since we are only considering positive discrete distributions, we have the following result.

Lemma A1.

The global chain Markov property and the factorization property are equivalent for positive discrete distributions.

Proof of Lemma A1.

Theorem 4.1 from Reference [27] combined with the Hammersley–Clifford theorem, for example, Theorem 2.9 in Reference [28], proves this statement. □

In order to understand the conditional independence structure of a chain graph after marginalization, we need the following alogrithm from Reference [17]. This algorithm converts a chain graph with latent variables into a chain mixed graph with the conditional independence structure of the marginalized chain graph. A chain mixed graph has in addition to directed and undirected edges also bidirected edges, called arcs. The condition that there are no semi-directed cycles also applies to chain mixed graphs.

Definition A3.

Let M be the set of vertices over which we want to marginalize. The following algorithm produces a chain mixed graph (CMG) with the conditional independence structure of the marginalized chain graph.

  • 1. 

    Generate an ij edge as in Table A1, steps 8 and 9, between i and j on a collider trislide with an endpoint j and an endpoint in M if the edge of the same type does not already exist.

  • 2. 

    Generate an appropriate edge as in Table A1, steps 1 to 7, between the endpoints of every tripath with inner node in M if the edge of the same type does not already exist. Apply this step until no other edge can be generated.

  • 3. 

    Remove all nodes in M.

Table A1.

Types of edge induced by tripaths with inner node m ∈ M and trislides with endpoint m ∈ M.

1 i ← m ← j generates i ← j
2 i ← m – j generates i ← j
3 i ↔ m —j generates i ↔ j
4 i ← m → j generates i ↔ j
5 i ← m ↔ j generates i ↔ j
6 i – m ← j generates i ← j
7 i – m – j generates i–j
8 m → i – ⋯ – j generates i ← j
9 m i j generates i ↔ j

Conditional independence in CMGs is defined using the concept of c-separation, see for example Reference [17] in Section 4. For this definition we need the concepts of a walk and of a collider section. A walk is a list of vertices α0,,αk,kN, such there is an edge or arrow from αi to αi+1,i{0,,k1}. A set of vertices connected by undirected edges is called a section. If there exists a walk including a section such that an arrow points at the first and last vertices of the section

then this is called a collider section.

Definition A4 (c-separation).

Let A,B and C be disjoint sets of vertices of a graph. A walk π is called a c-connecting walk given C, if every collider section of π has a node in C and all non-collider sections are disjoint. The nodes A and B are called c-separated given C if there are no c-connecting walks between them given C and we write AcB|C.

Appendix B. Proofs

Proof of the Relationship (4).

For n=2 this is immediate. Let now n3 and i,j,k{1,,n},ijki. Applying (1) two times leads to

Q(yj,x)=Q(yj,xI\{i})Q(x)Q(xI\{i})Q(yj,x)=Q(yj,xI\{k})Q(x)Q(xI\{k})Q(yj,xI\{i})Q(xI\{k})=Q(yj,xI\{k})Q(xI\{i})

for all (x,yj)X×Yj. Marginalizing over the elements of Xk yields

Q(yj,xI\{i,k})Q(xI\{k})=Q(yj,xI\{k})Q(xI\{i,k})Q(yj|xI\{i,k})=Q(yj|xI\{k}).

Using inductively the remaining relations results in (4). □

Proof of Proposition 1.

If ΦCII(P˜)=0 holds, then

infQMCIIDZ(P˜Q)=0.

Since MCII is compact the infimum is an element of MCII, so there exists QMCII such that DZ(PQ)=0. Therefore PMCII and the existence of a sequence Qm follows from the definition of MCII.

Assume that there exists a sequence Qm that satisfies 1. and 2. Then every element QmMCIIm per definition and the limit

P˜mNMCIIm¯=MCII.

Hence

ΦCII(P˜)=infQMCIIDZ(P˜Q)=DZ(P˜,P˜)=0.

 □

Proof of Proposition 2.

Let PEf and QE, then the KL-divergence between the two elements is

DZ×Wm(PQ)=z,wP(z,w)logP(x)iP(yi|x,w)P(w)Q(x)iQ(yi|xi,w)Q(w)=xP(x)logP(x)Q(x)+z,wP(z,w)logiP(yi|x,w)iQ(yi|xi,w)+wP(w)logP(w)Q(w)xP(x)logP(x)P(x)+z,wP(z,w)logiP(yi|x,w)iP(yi|xi,w)+wP(w)logP(w)P(w)=z,wP(z,w)logiP(yi|x,w)iP(yi|xi,w).

The inequality holds, because in the first and third addend, we are able to apply that the cross entropy is greater or equal to the entropy and in the second addend we use the log-sum inequality in the following way

z,wP(z,w)logiP(yi|x,w)iQ(yi|xi,w)z,wP(z,w)logiP(yi|x,w)iP(yi|xi,w)=x,wP(x)P(w)yiP(yi,|x,w)logiP(yi|xi,w)iQ(yi|xi,w)x,wP(x)P(w)yiP(yi,|x,w)logyiP(yi|xi,w)yiQ(yi|xi,w)=0.

Therefore the new integrated information measure results in

infQEDZ×Wm(PQ)=z,wP(z,w)logiP(yi|x,w)iP(yi|xi,w).

This can be rewritten to

z,wP(z,w)logiP(yi|x,w)iP(yi|xi,w)=z,wP(z,w)logiP(yi,x,w)P(xi,w)iP(yi,xi,w)P(x,w)=z,wP(z,w)logiP(yi,xI\{i}|xi,w)P(xi,w)iP(yi|xi,w)P(x,w)=z,wP(z,w)logiP(yi,xI\{i}|xi,w)iP(yi|xi,w)P(xI\{i}|xi,w)=iI(Yi;XI\{i}|Xi,W).

 □

Proof of Proposition 3.

By using the log-sum inequality we get

ΦCIIm=infQMCIImzP(z)logwP(x)iP(yi|x,w)P(w)wQ(x)iQ(yi|xi,w)Q(w)infQMCIImwzP(z,w)logP(x)iP(yi|x,w)P(w)Q(x)iQ(yi|xi,w)Q(w)=infQEDZ×Wm(PQ).

The fact that every element of QE corresponds via marginalization to an element in MCIIm and every element in MCIIm has at least one corresponding element in QE, leads to the equality in the last row. Since taking the infimum over a larger space can only decrease the value further, the relation

ΦCIIΦT

holds. □

Proof of Proposition 5.

DZ×Wm(PQ)=(z,w)Z×WmP(z,w)logP(z,w)Q(x)i=1nQ(yi|xi,w)Q(w)=(z,w)Z×WmP(z,w)logP(z,w)+(z,w)Z×WmP(z,w)log1Q(x)+(z,w)Z×Wmi=1nP(z,w)log1Q(yi|xi,w)+(z,w)Z×WmP(z,w)log1Q(w)

The first addend is a constant for P and the others are cross-entropies which are greater or equal to entropy

DZ×Wm(PQ)(z,w)Z×WmP(z,w)logP(z,w)+(z,w)Z×WmP(z,w)log1P(x)+(z,w)Z×Wmi=1nP(z,w)log1P(yi|xi,w)+(z,w)Z×WmP(z,w)log1P(w)=(z,w)Z×WmP(z,w)logP(z,w)P(x)i=1nP(yi|xi,w)P(w).

Therefore this projection is unique. □

Proof of Theorem 2.

We need a way to understand the connections in a graph after marginalization. In Reference [17] Sadeghi presents an algorithm that converts a chain graph to a chain mixed graph that represents the markov properties of the original graph after marginalizing, see Definition A3.

Although the actual set of distributions after marginalizing might be more complicated, it is a subset of the distributions factorizing according to the new graph, if the new graph is still a chain graph. This is due to the equivalence of the global chain Markov property and the factorization property in Lemma A1.

At first we will consider the case of two nodes per time step, n=2. We will take a close look at the possible ways a hidden structure could be connected to the left graph in Figure A1. At first we will look at the possible connections between two nodes, depicted on the right in Figure A1. The boxes stand for any kind of subgraph of hidden nodes such that the whole graph is still a chain graph and the two headed dotted arrows stand for a line, or an arrow in any direction. Consider two nodes A and B, then the connections including a box between the nodes can take one of the five following forms

  1. they form an undirected path between A and B,

  2. they can form a directed path from A to B,

  3. they can form a directed path form B to A,

  4. there exists a collider,

  5. A and B have a common exterior influence.

A collider is a node or a set of nodes connected by undirected edges that have an arrow pointing at the set at both ends

.

Figure A1.

Figure A1

Starting graph and possible two way interactions.

We will start with the gridded hidden structure connected to X1 and X2. Since there already is an undirected edge between the Xis an undirected path would make no difference in the marginalized model. The cases (2) and (3) would form a directed cycle which violates the requirements of a chain mixed graph. A collider would also make no difference, since it disappears in the marginalized model. A common exterior influence leads to

P(w^)P(x|w^)P(y1|x1)P(y2|x2)=P(x,w^)P(y1|x1)P(y2|x2)w^P(x,w^)P(y1|x1)P(y2|x2)=P(x)P(y1|x1)P(y2|x2).

Now let us discuss these possibilities in the case of a gray hidden structure between Xi and Yj, i,j{1,2},ij. An undirected edge or a directed edge (3) would create a directed cycle. A directed path (2) from Xi to Yj would lead to a chain graph in which Xi and Yj are not conditionally independent given Xj. If there exists a collider (4) in the hidden structure, then nothing else in the graph depends on this part of the structure and it reduces to a factor one when we marginalize over the hidden variables. Therefore the path between Xi and Yj gets interrupted leaving a potential external influence or effect. Those do not have an additional impact on the marginalized model. A common exterior influence (5) leads to a chain mixed graph which does not satisfy the necessary conditional independence structure, because using the Algorithm A3 leads to an arc between Xi and Yj, hence they are c-connected in the sense of Definition A4.

The next possibility is a dotted hidden structure between Xi and Yi,i{1,2}. An undirected path (1) and a directed path (3) would lead to a directed cycle. A directed path (2) would add no new structure to the model since there already is a directed edge between Xi and Yi. A collider (4) does not have an effect on the marginalized model. Adding a common exterior influence W1 on X1,Y1 results in a new model which is not symmetric in i{1,2} and does not include MI, therefore it does not fully contain MCII. By adding additional common exterior W2 influences on X2,Y2 or Y1,Y2, in order to include MI in the new model, violates the conditional independence statements since nodes in W1 and W2 are connected in the moralized graph.

The last hidden structure between two nodes is the striped one between the Yis. An undirected path (1) or any directed path (2), (3) lead to a graph that does not satisfy the conditional independence statements. A collider (4) has no impact on the model and a common exterior influence leads to the definition of Causal Information Integration.

Connecting Y1,Y2 and Xi,i{1,2} leads either to a violation of the conditional independence statements or contains a collider in which case the marginalized model reduces to one of the cases above.

All the possible ways a hidden structure could be connected to three nodes X1,X2,Y1 by directed edges are shown in Figure A2. Replacing any of these edges by an undirected edge would either make no difference or lead to a model that does not satisfy the conditional independence statements. In this case the black boxes represent sections. More complicated hidden structures reduce to this case, since these structures either contain a collider and correspond to one of the cases above or contain longer directed paths in the direction of the edges connecting the structure to the visible nodes, which does not change the marginalized model.

Figure A2.

Figure A2

The eight possible hidden structures between three nodes.

The models in (c), (d), (e), (f) and (g) contain either a collider and reduce therefore to one of the cases discussed above or induce a directed cycle. We see that (a) and (h) display structures that do not satisfy the conditional independence statements. The hidden structure in (b) has no impact on the model.

A hidden structure connected to all four nodes contains one of the structures above and therefore does not induce a new valid model.

Let us now consider a model with n>2. Any hidden structure on this model either connects only up to four nodes and reduces therefore to one of the cases above, contains one of the connections discussed in Figure A2 or only connects nodes among one point in time. The only structures possible to add would be a common exterior influence on the Xis, a common exterior influence on the Yis or a collider section on any nodes. All these structures do not change the marginalized model. Therefore it is not possible to create a chain graph with hidden nodes in order to get a model strictly larger than MCII. □

Author Contributions

Conceptualization, N.A. and C.L.; methodology, N.A. and C.L.; software, C.L.; investigation, C.L.; writing, C.L.; supervision, N.A.; project administration, N.A.; funding acquisition, N.A. All authors have read and agreed to the published version of the manuscript.

Funding

The authors acknowledge funding by Deutsche Forschungsgemeinschaft Priority Programme “The Active Self” (SPP 2134).

Conflicts of Interest

The authors declare no conflict of interest.

References

  • 1.Tononi G., Edelman G.M. Consciousness and Complexity. Science. 1999;282:1846–1851. doi: 10.1126/science.282.5395.1846. [DOI] [PubMed] [Google Scholar]
  • 2.Tononi G. Consciousness as Integrated Information: A Provisional Manifesto. Biol. Bull. 2008;215:216–242. doi: 10.2307/25470707. [DOI] [PubMed] [Google Scholar]
  • 3.Oizumi M., Albantakis L., Tononi G. From the Phenomenology to the Mechanisms of Consciousness: Integrated Information Theory 3.0. PLoS Comput. Biol. 2014;10:1–25. doi: 10.1371/journal.pcbi.1003588. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Oizumi M., Tsuchiya N., Amari S. Unified framework for information integration based on information geometry. Proc. Natl. Acad. Sci. USA. 2016;113:14817–14822. doi: 10.1073/pnas.1603583113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Amari S., Tsuchiya N., Oizumi M. Geometry of Information Integration. In: Ay N., Gibilisco P., Matúš F., editors. Information Geometry and Its Applications. Springer International Publishing; Cham, Switzerland: 2018. pp. 3–17. [Google Scholar]
  • 6.Ay N. Information Geometry on Complexity and Stochastic Interaction. [(accessed on 28 September 2020)];MPI MIS PREPRINT 95. 2001 Available online: https://www.mis.mpg.de/preprints/2001/preprint2001_95.pdf.
  • 7.Ay N. Information Geometry on Complexity and Stochastic Interaction. Entropy. 2015;17:2432–2458. doi: 10.3390/e17042432. [DOI] [Google Scholar]
  • 8.Ay N., Olbrich E., Bertschinger N.A. Geometric Approach to Complexity. Chaos. 2011;21 doi: 10.1063/1.3638446. [DOI] [PubMed] [Google Scholar]
  • 9.Oizumi M., Amari S., Yanagawa T., Fujii N., Tsuchiya N. Measuring Integrated Information from the Decoding Perspective. PLoS Comput. Biol. 2016;12 doi: 10.1371/journal.pcbi.1004654. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Amari S. Information Geometry and Its Applications. Springer Japan; Tokyo, Japan: 2016. [Google Scholar]
  • 11.Pearl J. Causality. Cambridge University Press; Cambridge, UK: 2009. [Google Scholar]
  • 12.Kanwal M.S., Grochow J.A., Ay N. Comparing Information-Theoretic Measures of Complexity in Boltzmann Machines. Entropy. 2017;19:310. doi: 10.3390/e19070310. [DOI] [Google Scholar]
  • 13.Barrett A.B., Seth A.K. Practical Measures of Integrated Information for Time- Series Data. PLoS Comput. Biol. 2011;7 doi: 10.1371/journal.pcbi.1001052. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Csiszár I., Shields P. Information Theory and Statistics: A Tutorial. Now Publishers Inc.; Delft, The Netherlands: 2004. Foundations and Trends in Communications and Information Theory; pp. 417–528. [Google Scholar]
  • 15.Studený M. Probabilistic Conditional Independence Structures. Springer; London, UK: 2005. [Google Scholar]
  • 16.Lauritzen S.L. Graphical Models. Clarendon Press; Oxford, UK: 1996. [Google Scholar]
  • 17.Sadeghi K. Marginalization and conditioning for LWF chain graphs. Ann. Stat. 2016;44:1792–1816. doi: 10.1214/16-AOS1451. [DOI] [Google Scholar]
  • 18.Montúfar G. Ph.D. Thesis. Universität Leipzig; Leipzig, Germany: 2012. On the expressive power of discrete mixture models, restricted Boltzmann machines, and deep belief networks—A unified mathematical treatment. [Google Scholar]
  • 19.Cover T.M., Thomas J.A. Elements of Information Theory. John Wiley & Sons; Hoboken, NJ, USA: 2006. [Google Scholar]
  • 20.Csiszár I., Tusnády G. Information geometry and alternating minimization procedures. Stat. Decis. 1984;Supplemental Issue Number 1:205–237. [Google Scholar]
  • 21.Amari S., Kurata K., Nagaoka H. Information geometry of Boltzmann machines. IEEE Trans. Neural Netw. 1992;3:260–271. doi: 10.1109/72.125867. [DOI] [PubMed] [Google Scholar]
  • 22.Dempster A.P., Laird N.M., Rubin D.B. Maximum Likelihood from Incomplete Data via the EM Algorithm. J. R. Stat. Soc. 1977;39:2–38. [Google Scholar]
  • 23.Amari S. Information Geometry of the EM and em Algorithms for Neural Networks. Neural Netw. 1995;9:1379–1408. doi: 10.1016/0893-6080(95)00003-8. [DOI] [Google Scholar]
  • 24.Winkler G. Image Analysis, Random Fields and Markov Chain Monte Carlo Methods. Springer; Berlin/Heidelberg, Germany: 2003. [Google Scholar]
  • 25.Choromanska A., Henaff M., Mathieu M., Arous G.B., LeCun Y. The Loss Surfaces of Multilayer Networks. PMLR. 2015;38:192–204. [Google Scholar]
  • 26.Langer C. Integrated-Information-Measures GitHub Repository. [(accessed on 18 August 2020)]; Available online: https://github.com/CarlottaLanger/Integrated-Information-Measures.
  • 27.Frydenberg M. The Chain Graph Markov Property. Scand. J. Stat. 1990;17:333–353. [Google Scholar]
  • 28.Ay N., Jost J., Lê H.V., Schwachhöfer L. Information Geometry. Springer International Publishing; Cham, Switzerland: 2017. [Google Scholar]

Articles from Entropy are provided here courtesy of Multidisciplinary Digital Publishing Institute (MDPI)

RESOURCES