Skip to main content
Entropy logoLink to Entropy
. 2022 Aug 19;24(8):1156. doi: 10.3390/e24081156

Entropy-Based Discovery of Summary Causal Graphs in Time Series

Charles K Assaad 1,2,*,, Emilie Devijver 2,, Eric Gaussier 2,
Editors: Philip Broadbridge, Fan Yang, Wenkai Hu
PMCID: PMC9407574  PMID: 36010820

Abstract

This study addresses the problem of learning a summary causal graph on time series with potentially different sampling rates. To do so, we first propose a new causal temporal mutual information measure for time series. We then show how this measure relates to an entropy reduction principle that can be seen as a special case of the probability raising principle. We finally combine these two ingredients in PC-like and FCI-like algorithms to construct the summary causal graph. There algorithm are evaluated on several datasets, which shows both their efficacy and efficiency.

Keywords: causal discovery, time series, summary causal graph, mutual information

1. Introduction

Time series arise as soon as observations, from sensors or experiments, for example, are collected over time. They are present in various forms in many different domains, such as healthcare (through, e.g., monitoring systems), Industry 4.0 (through, e.g., predictive maintenance and industrial monitoring systems), surveillance systems (from images, acoustic signals, seismic waves, etc.), or energy management (through, e.g., energy consumption data), to name but a few. We are interested in this study in analyzing quantitative discrete-time series to detect the causal relations that exist between them under the assumption of consistency throughout time [1] which states that causal relationships remain constant in direction throughout time. Under this assumption, one can replace the infinite full-time causal graph (Figure 1a) by a window causal graph (Figure 1b) defined as follows:

Figure 1.

Figure 1

Example of a full-time causal graph (a), a window causal graph (b), and a summary causal graph (c).

Definition 1

(Window causal graph [1]). Let X be a multivariate discrete-time series and G=(V,E) the associated window causal graph for a window of size τ. The set of vertices in that graph consists of the set of components X1,,Xd at each time t,,t+τ. The edges E of the graph are defined as follows: variables Xtip and Xtq are connected by a lag-specific directed link XtipXtq in G pointing forward in time if and only if Xp causes Xq at time t with a time lag of 0iτ for pq and with a time lag of 0<iτ for p=q.

The window causal graph only covers a fixed number of time instants, which are sufficient to understand the dynamics of the system given the largest time gap between causes and effects. This said, it is difficult for an expert to provide or validate a window causal graph because it is difficult to determine which exact time instant is the cause of another. In contrast, there exists another type of causal graph where a node corresponds to a time series, as illustrated in Figure 1c, which usually can be analyzed or validated by an expert easily [2,3,4]. Such graphs are referred to as summary causal graphs [1] and are defined as follows:

Definition 2

(Summary causal graph [1]). Let X be a multivariate discrete-time series and G=(V,E) the associated summary causal graph. The set of vertices in that graph consists of the set of time series X1,,Xd. The edges E of the graph are defined as follows: variables Xp and Xq are connected if and only if there exist some time t and some time lag i such that Xtip causes Xtq at time t with a time lag of 0i for pq and with a time lag of 0<i for p=q.

Note that a summary causal graph can be deduced from a window causal graph, but the reverse is not true. We focus in this study on the summary causal graph, as it provides a simple and efficient view on the causal relations that exist between time series. In particular, we are interested in inferring a summary causal graph without passing by a window causal graph to avoid unnecessary computations.

An important aspect of real-world time series is that different time series, as they measure different elements, usually have different sampling rates. Despite this, the algorithms that have been developed so far to discover causal structures [5,6,7,8] rely on the idealized assumptions that all time series have the same sampling rates with identical timestamps (assuming identical timestamps in the case of identical sampling rates seems reasonable as one can shift time series so that they coincide in time).

We introduce in this paper a constraint-based strategy to infer a summary causal graph from discrete-time series with continuous values with equal or different sampling rates under the two classical assumptions of causal discovery (causal Markov condition, faithfulness), in addition to acyclicity in summary causal graphs. In summary, our contribution is four-fold:

  • First of all, we propose a new causal temporal mutual information measure defined on a window-based representation of time series;

  • We then show how this measure relates to an entropy reduction principle, which can be seen as a special case of the probability raising principle;

  • We also show how this measure can be used for time series with different sampling rates;

  • We finally combine these three ingredients in PC-like and FCI-like algorithms [9] to construct the summary causal graph from time series with equal or different sampling rates.

The remainder of the paper is organized as follows: Section 2 describes the related work. Section 3 introduces the (conditional) mutual information measures we propose for time series and the entropy reduction principle that our method is based on. Section 4 presents two causal discovery algorithms we developed on top of these measures. The causal discovery algorithms we propose are illustrated and evaluated on datasets, including time series with equal and different sampling rates and a real dataset in Section 5. Finally, Section 6 concludes the paper.

2. Related Work

Granger Causality is one of the oldest methods proposed to detect causal relations between time series. However, in its standard form [5], it is known to handle a restricted version of causality that focuses on linear relations and causal priorities, as it assumes that the past of a cause is necessary and sufficient for optimally forecasting its effect. This approach has nevertheless been improved since then in MVGC [10] and has recently been extended to handle nonlinearities through an attention mechanism within convolutional networks [8]. This last extension is referred to as TCDF. Score-based approaches [11] search over the space of possible graphs, trying to maximize a score that reflects how well the graph fits the data. Recently, a new score-based method called Dynotears [12] was presented to infer a window causal graph from time series. In a different line, approaches based on the noise assume that the causal system can be defined by a set of equations that explains each variable by its direct causes and an additional noise. Causal relations are in this case discovered using footprints produced by the causal asymmetry in the data. For time series, the most popular nonlinear algorithm in this family is TiMINo [6], which discovers a causal relationship by looking at independence between the noise and the potential causes. The most popular approaches for inferring causal graphs are certainly constraint-based approaches, based on the PC and FCI algorithms [9], which turn knowledge about (conditional) independencies into causal knowledge assuming the causal Markov condition and faithfulness. Several algorithms, adapted from non-temporal causal graph discovery algorithms, have been proposed in this family for time series, namely oCSE by [13], which is limited to one time lag between causal relations, PCMCI [7], and tsFCI [14]. Our work is thus more closely related to that of [7,14], which use constraint-based strategies to infer a window causal graph. However, we focus here on the summary causal graph, and we introduce a new causal temporal mutual information measure and entropy reduction principle on which we ground our causal discovery algorithms. Constraint-based methods have been also used jointly with other approaches (e.g., with a score-based method [15] and with a noise-based method [16]), but we consider such hybrid methods as beyond the scope of this paper.

At the core of constraint-based approaches lie (in)dependence measures, to detect relevant dependencies, which are based here on an information theoretic approach. Since their introduction [17], information theoretic measures have become very popular due to their non-parametric nature, their robustness against strictly monotonic transformations, which makes them capable of handling nonlinear distortions in the system, and their good behavior in previous studies on causal discovery [18]. However, their application to temporal data raises several problems related to the fact that time series may have different sampling rates, be shifted in time, and have strong internal dependencies. Many studies have attempted to re-formalize mutual information for time series. Reference [19] considered each value of each time series as different random variables and proceeded by whitening the data, such that time-dependent data will be transformed into independent residuals through a parametric model. However, whitening the data can have severe consequences on causal relations. Reference [20] proposed a reformulation of mutual information, called the transfer entropy, which represents the information flow from one state to another and, thus, is asymmetric. Later, Ref. [13] generalized transfer entropy to handle conditioning. However, their approach, called causation entropy, can only handle causal relations with lags equal to 1. Closely related, the directed information [21,22] allows the computation of the mutual information between two instantaneous time series conditioned on the past of a third time series with a lag of 1. One of our goals in this paper is to extend these approaches to handle any lag equal to or greater than 0. Another related approach is the time-delayed mutual information proposed in [23], which aims at addressing the problem of non-uniform sampling rates. The computation of the time-delayed mutual information relates single points from a single time series (shifted in time), but does not consider potentially complex relations between time stamps in different time series, as we do through the use of window-based representations and compatible time lags. The measure we propose is more suited to discovering summary causal graphs as it can consider potentially complex relations between timestamps in different time series through the use of window-based representations and compatible time lags and is more general as it can consider different sampling rates.

3. Information Measures for Causal Discovery in Time Series

We present in this section a new mutual information measure that operates on a window-based representation of time series to assess whether time series are (conditionally) dependent or not. We then show how this measure is related to an entropy reduction principle that is a special case of the probability raising principle [24].

We first assume that all time series are aligned in time, with the same sampling rate, prior to showing how our development can be applied to time series with different sampling rates. Without loss of generality, time instants are assumed to be integers. Lastly, as performed in previous studies [20], we assume that all time series are first-order Markov self-causal (any time instant is caused by its previous instant within the same time series).

3.1. Causal Temporal Mutual Information

Let us consider d univariate time series X1,,Xd and their observations (Xtp)1tNp,1pd. Throughout this section, we will make use of the following Example 1, illustrated in Figure 1, to discuss the notions we introduce.

Example 1.

Let us consider the following two time series defined by, for all t,

Xt1=Xt11+ξt1,Xt2=Xt12+Xt21+Xt11+ξt2,

with(ξt1,ξt2)N(0,1).

One can see in Example 1 that, in order to capture the dependencies between the two time series, one needs to take into account a lag between them, as the true causal relations are not instantaneous. Several studies have recognized the importance of taking into account lags to measure (conditional) dependencies between time series; for example, in [7], a pointwise mutual information between time series with lags was used to assess whether they are dependent or not.

In addition to lags, Example 1 also reveals that a window-based representation may be necessary to fully capture the dependencies between the two time series. Indeed, as Xt12 and Xt2 are the effects of the same cause (Xt21), it may be convenient to consider them together when assessing whether the time series are dependent or not. For example, defining (overlapping) windows of size two for X2 and one for X1 with a lag of 1 from X1 to X2, as in Figure 2, allows one to fully represent the causal dependencies between the two time series.

Figure 2.

Figure 2

Why do we need windows and lags? An illustration with two time series where X1 causes X2 in two steps (circles correspond to observed points and rectangles to windows). The arrows in black are discussed in the text.

Definition 3.

Letγmaxdenote the maximum lag between two time seriesXpandXq, and let the maximum window sizeλmax=γmax+1. The window-based representation, of size0<λpqλmax<Np, of the time seriesXpwith respect toXq, which will be denotedX(p;λpq), simply amounts to considering(Npλpq+1)windows:(Xtp,,Xt+λpq1p),1tNpλpq+1. The window-based representation, of size0<λqpλmax<Nq, of the time seriesXqwith respect toXpis defined in the same way. A temporal lagγpqZ compatible with λpq and λqp relates windows in X(p;λpq) and X(q;λqp) with starting time points separated by γpq. We denote by C(p,q) the set of window sizes and compatible temporal lags.

Based on the above elements, we define the causal temporal mutual information between two time series Xp and Xq as the maximum of the standard mutual information over all possible compatible temporal lags and windows C(p,q), conditioned by the past of the two time series. Indeed, as we are interested in obtaining a summary causal graph, we do not have to consider all the potential dependencies between two time series (which would be necessary for inferring a window causal graph). Using the maximum over all possible associations is a way to summarize all temporal dependencies, which ensures that one does not miss a dependency between the two time series. Furthermore, conditioning on the past allows one to eliminate spurious dependencies in the form of auto-correlation, as in transfer entropy [20]. We follow this idea here and, as in transfer entropy, consider windows of size 1 and a temporal lag of 1 for conditioning on the past, which is in line with the first-order Markov self-causal assumption mentioned above.

Definition 4.

Consider two time series Xp and Xq. We define the causal temporal mutual information between Xp and Xq as:

CTMI(Xp;Xq)=max(λpq,λqp,γpq)C(p,q)I(Xt(p;λpq);Xt+γpq(q;λqp)|Xt1(p;1),Xt+γpq1(q;1))=ΔI(Xt(p;λ¯pq);Xt+γ¯pq(q;λ¯qp)|Xt1(p;1);Xt+γ¯pq1(q;1)), (1)

where I represents the mutual information. In case the maximum can be obtained with different values in C(p,q), we first set γ¯pq to its largest possible value. We then set λ¯pq to its smallest possible value and, finally, λ¯qp to its smallest possible value. γ¯pq, λ¯pq, and λ¯qp, respectively, correspond to the optimal lag and optimal windows.

In the context we have retained, in which dependencies are constant over time, CTMI satisfies the standard properties of mutual information, namely it is nonnegative, symmetric, and equals 0 iff time series are independent. Thus, two time series Xp and Xq such that CTMI(Xp;Xq)>0 are dependent. Setting γ¯pq to its largest possible value allows one to get rid of instants that are not crucial in determining the mutual information between two time series. The choice for the window sizes, even though arbitrary on the choice of treating one window size before the other, is based on the same ground, as the mutual information defined above cannot decrease when one increases the size of the windows. Indeed:

I(Xt(p;λpq);Xt+γpq(q;λqp)Xt1(p;1),Xt+γpq1(q;1))=I((Xt(p;λpq1),Xt+λpq1(p;1));Xt+γpq(q;λqp)Xt1(p;1),Xt+γpq1(q;1))=I(Xt(p;λpq1);Xt+γpq(q;λqp)Xt1(p;1),Xt+γpq1(q;1))+I(Xt+λpq1(p;1);Xt+γpq(q;λqp)Xt1(p;1),Xt+γpq1(q;1),Xt(p;λpq1))I(Xt(p;λpq1);Xt+γpq(q;λqp)Xt1(p;1),Xt+γpq1(q;1)). (2)

The last inequality is due to the fact that mutual information is positive.

Example 2.

Consider the structure described in Example 1, and assume that λmax=3. First, we have, for the standard mutual information,

I(Xt(1;1);Xt(2;1)Xt1(1;1),Xt1(2;1))=0.

We also have that any γ12<0 has zero mutual information because conditioning on the past of Xt(1;1);Xt(2;1) (namely Xt1(1;1);Xt1(2;1)) is closing all paths from Xti(2;1) to Xt(1;1) for all i>0. For γ12>0, starting by γ12=1,

I(Xt(1;1);Xt+1(2;1)Xt1(1;1),Xt(2;1))>0.

This is similar for γ12>2. Now, any increase of λ21 alone or of λ12 and λ21 will generate an increase in the mutual information as long as the difference between the last time point of the window of X1 (the cause) and the last time point of the window of X2 is less than or equal to γ12 as

I(Xt(1;λ121);Xt+γ12(2;λ121γ12)Xt1(1;1),Xt+γ121(2;1))=I(Xt(1;λ12);Xt+γ12(2;λ121γ12)Xt1(1;1),Xt+γ121(2;1))

because

I(Xt+λ121(1;1);Xt+γ12(2;λ121γ12)Xt1(1;1),Xt+γ121(2;1),Xt(1;λ121))=0,

where γ121 (1 is the minimal lag that generates a correlation that cannot be removed by conditioning on the past of X1 and X2). For λmax=3, the optimal window size λ¯12 is equal to 2 as X1 has no other cause than itself; λ¯21 is equal to 2 as X1 causes only (except itself) Xt+12 and Xt2. Furthermore, γ¯12=1 and

CTMI(X1;X2)=I((Xt1,Xt+11);(Xt+12,Xt+22)Xt11,Xt2)=I(Xt1;(Xt+12,Xt+22)Xt11,Xt2)+I(Xt+11;(Xt+12,Xt+22)Xt11,Xt2,Xt1)=I(Xt1;Xt+12Xt11,Xt2)+I(Xt1;Xt+22Xt11,Xt2,Xt+12)+I(Xt+11;Xt+12Xt11,Xt2,Xt1)+I(Xt+11;Xt+22Xt11,Xt2,Xt1,Xt+12)=2I(Xt1;Xt+12Xt11,Xt2)+I(Xt1;Xt+22Xt11,Xt2,Xt+12)=3log(3)/4.

3.2. Entropy Reduction Principle

Interestingly, CTMI can be related to a version of the probability raising principle [24], which states that a cause, here a time series, raises the probability of any of its effects, here another time series, even when the past of the two time series is taken into account, meaning that the relation between the two time series is not negligible compared to the internal dependencies of the time series. In this context, the following definition generalizes to window-based representations of time series the standard definition of prima facie causes for discrete variables.

Definition 5

(Prima facie cause for window-based time series).Let Xp and Xq be two time series with window sizes λpq and λqp, and let Pt,t=(Xt1(p;1),Xt1(q;1)) represent the past of Xp and Xq for any two instants (t,t). We say that Xp is a prima facie cause of Xq with delay γpq>0 iff there exist Borel sets Bp, Bq, and BP such that one has: P(Xt+γpq(q;λqp)Bq|Xt(p;λpq)Bp,Pt,t+γpqBP)>P(Xt+γpq(q;λqp)Bq|Pt,t+γpqBP).

We now introduce a slightly different principle based on the causal temporal mutual information, which we refer to as the entropy reduction principle.

Definition 6

(Entropic prima facie cause).Using the same notations as in Definition 5, we say that Xp is an entropic prima facie cause of Xq with delay γpq>0 iff I(Xt(p;λpq);Xt+γpq(q;λqp)|Pt,t+γpq)>0.

Note that considering that the above mutual information is positive is equivalent to considering that the entropy of Xq when conditioned on the past reduces when one further conditions on Xp. One has the following relation between the entropy reduction and the probability raising principles.

Property 1.

With the same notations, if Xp is an entropic prima facie cause of Xq with delay γpq>0, then Xp is a prima facie cause of Xq with delay γpq>0. Furthermore, if CTMI(Xp;Xq)>0 with γ¯pq>0, then Xp is an entropic prima facie cause of Xq with delay γ¯pq.

Proof. 

Let us assume that Xp is not a prima facie cause of Xq for the delay γpq. Then, for all Borel sets Bp, Bq, and BP, one has P(Xt+γpq(q;λqp)Bq|Xt(p;λpq)Bp,Pt,t+γpqBP)P(Xt+γpq(q;λqp)Bq|Pt,t+γpqBP). This translates, in terms of density functions denoted f, as:

(xtp,xt+γpqq,pt,t+γpq),f(xt+γpqq|xtp,pt,t+γpq)f(xt+γpqq|pt,t+γpq),

which implies that H(Xt+γpq(q;λqp)Bq|Xt(p;λpq)Bp,Pt,t+γpqBP) is greater than H(Xt+γpq(q;λqp)Bq|Pt,t+γpqBP), so that Xp is not an entropic prima facie cause of Xp with delay γpq. By contraposition, we conclude the proof of the first statement. The second statement directly derives from the definition of CTMI.  □

It is also interesting to note that, given two time series Xp and Xq such that Xp causes Xq with γpq=0, CTMI does not necessarily increase symmetrically with respect to the increase of λpq and λqp. For an illustration on a simple model, see Figure 3. In Figure 3a, the mutual information conditioned on the past between Xp and Xq with γpq=0 is positive since Xp causes Xq instantaneously. In Figure 3b the same mutual information does not increase when increasing only the window size of the cause. However, the mutual information increases when increasing only the window size of the effect, as in Figure 3c, or when increasing simultaneously the window sizes of the effect and the cause, as in Figure 3d.

Figure 3.

Figure 3

Illustration of the asymmetric increase of CTMI with the increase of the window sizes. The mutual information conditioned on the past in (a) is I(Xtp,XtqXt1p,Xt1q)>0. It increases when increasing only the window size of the effect, as in (c), i.e, I(Xtp,Xt(q;2)Xt1p,Xt1q)>I(Xtp,XtqXt1p,Xt1q), or when increasing simultaneously the window sizes of the effect and the cause, as in (d), i.e, I(Xt(p;2),Xt(q;2)Xt1p,Xt1q)>I(Xtp,Xt(q;2)Xt1p,Xt1q). However, it does not increase when increasing only the window size of the cause, as in (b), i.e, I(Xt(p;2),XtqXt1p,Xt1q)=I(Xtp,XtqXt1p,Xt1q). Dashed lines are for correlations that are not causations, and bold arrows correspond to causal relations between the window representations of time series.

3.3. Conditional Causal Temporal Mutual Information

We now extend the causal temporal mutual information by conditioning on a set of variables. In a causal discovery setting, conditioning is used to assess whether two dependent time series can be made independent by conditioning on connected time series, i.e., time series that are dependent with at least one of the two times series under consideration. Figure 4 illustrates the case where the dependence between X1 and X2 is due to spurious correlations originating from common causes. Conditioning on these common causes should lead to the conditional independence of the two time series. Of course, the conditional variables cannot succeed simultaneously in time the two time series under consideration. This leads us to the following definition of the conditional causal temporal mutual information.

Figure 4.

Figure 4

Example of conditional independence between dependent time series. In (a) the conditioning set contains one time series X3 in addition to the past of X1 and X2. In (b) the conditioning set contains one time series X3 and X4 in addition to the past of X1 and X2. Dashed lines are for correlations that are not causations, and bold arrows correspond to conditioning variables.

Definition 7.

The conditional causal temporal mutual information between two time series Xp and Xq such that γ¯pq0, conditioned on a set XR={Xr1,,XrK}, is given by:

CTMI(Xp;XqXR)=I(Xt(p;λ¯pq);Xt+γ¯pq(q;λ¯qp)|(XtΓ¯k(rk;λ¯k))1kK,Xt1(p;1),Xt+γ¯pq1(q;1)). (3)

In case the minimum can be obtained with different values, we first set Γ¯k to its largest possible value. We then set λ¯k to its smallest possible value. (Γ¯1,,Γ¯K) and (λ¯1,,λ¯K) correspond to the optimal conditional lags and window sizes, which minimize, for Γ1,,ΓKγ¯pq:

IXt(p;λ¯pq);Xt+γ¯pq(q;λ¯qp)|(XtΓk(rk;λk))1kK,Xt1(p;1),Xt+γ¯pq1(q;1).

By considering the minimum over compatible lags and window sizes, one guarantees that, if there exist conditioning variables that make the two time series independent, they will be found. Note that the case in which γ¯qp<0 corresponds to CTMI(Xp;XqXR), where γ¯pq>0.

Figure 4 illustrates the above on two different examples. On the left, Xt11 is correlated to Xt2 as Xt23 is a common cause with a lag of 1 for X1 and a lag of 2 for X2. Conditioning on Xt23 removes the dependency between X1 and X2. Note that all time series have here a window of size 1. On the right, X3 and X4 are common causes of X1 and X2: X3 causes X1 and X2 with a temporal lag of 1, which renders X1 and X2 correlated at the same time point, while X4 causes X1 and X2 with a temporal lag of 1 and 2, respectively, which renders X1 and X2 correlated at lagged time points. The overall correlation between X1 and X2 is captured by considering a window size of 2 in X2. All other time series have a window size of 1. By conditioning on both X3 and X4, Xp and Xq become independent, assuming we also condition on the past of X1 and X2 to remove the autocorrelation.

3.4. Estimation and Testing

In practice, the success of the CTMI approach (and in fact, any entropy-based approaches) depends crucially on the reliable estimation of the relevant entropies in question from data. This leads to two practical challenges. The first one is based on the fact that entropies must be estimated from finite-time series data. The second is that to detect independence, we need a statistical test to check if the temporal causation entropy is equal to zero.

Here, we rely on the k-nearest neighbor method [25,26] for the estimation of CTMI. The distance between two windows considered here is the supremum distance, i.e., the maximum of the absolute difference between any two values in the two windows.

d((Xt(p;λ¯pq),Xt+γ¯pq(q;λ¯pq))i,(Xt(p;λ¯pq),Xt+γ¯pq(q;λ¯pq))j)=max0<λp,0<λq(|(Xt(p;λ¯pq))i+(Xt(p;λ¯pq))j+|,|(Xt(q;λ¯qp))i+(Xt(q;λ¯qp))j+|).

In the case of the causal temporal mutual information, we denote by ϵik/2 the distance from

(Xt(p;λ¯pq),Xt+γ¯pq(q;λ¯pq),Xt1p,Xt+γ¯pq1q)

to its k-th neighbor, ni1,3, ni2,3, and ni3 being the numbers of points with a distance strictly smaller than ϵik/2 in the subspace

(Xt(p;λ¯pq),Xt1p,Xt+γ¯pq1q),(Xt+γ¯pq(q;λ¯pq),Xt1p,Xt+γ¯pq1q),and(Xt1p,Xt+γ¯pq1q)

respectively, and nγpq,γqp the number of observations. The estimate of the causal temporal mutual information is then given by:

CTMI^(Xp;Xq)=ψ(k)+1nγpq,γqpi=1nγpq,γqpψ(ni3)ψ(ni1,3)ψ(ni2,3)

where ψ denotes the digamma function.

Similarly, for the estimation of the conditional causal temporal mutual information, we denote by ϵik/2 the distance from

(Xt(p;λ¯pq),Xt+γ¯pq(q;λ¯pq),Xt1p,Xt+γ¯pq1q,(XtΓ¯k(rk;λ¯k))1kK)

to its k-th neighbor, ni1,3, ni2,3, and ni3 being the numbers of points with a distance strictly smaller than ϵik/2 in the subspace

(Xt(p;λ¯pq),Xt1p,Xt+γ¯pq1q,(XtΓ¯k(rk;λ¯k))1kK),(Xt+γ¯pq(q;λ¯pq),Xt1p,Xt+γ¯pq1q,(XtΓ¯k(rk;λ¯k))1kK),and(Xt1p,Xt+γ¯pq1q,(XtΓ¯k(rk;λ¯k))1kK)

respectively, and nγrp,γrq the number of observations. The estimate of the conditional causal temporal mutual information is then given by:

CTMI^(Xp;XqXR)=ψ(k)+1nγrp,γrqi=1nγrp,γrqψ(ni3)ψ(ni1,3)ψ(ni2,3)

where ψ denotes the digamma function.

To detect independencies through CTMI, we rely on the following permutation test:

Definition 8

(Permutation test of CTMI). Given Xp, Xq, and XR, the p-value associated with the permutation test of CTMI is given by:

p=1Bb=1B1CTMI^((Xp)b;XqXR)CTMI^(Xp;XqXR), (4)

where (Xp)b is a permuted version of Xp and follows the local permutation scheme presented in [27].

The advantage of the scheme presented in [27] is that it preserves marginals by drawing as much as possible without replacement and it performs local permutation, which ensures that by permuting Xp, the dependence between Xp and Xr is not destroyed.

Note that Definition 8 is applicable to the causal temporal mutual information (when R is empty) and to the conditional causal temporal mutual information.

3.5. Extension to Time Series with Different Sampling Rates

The above development readily applies to time series with different sampling rates, as one can define window-based representations of the two time series, as well as a sequence of joint observations.

Indeed, as one can note, Definition 3 does not rely on the fact that the time series have the same sampling rates. Figure 5 displays two time series Xp and Xq with different sampling rates, where, while λpq=2 and λqp=3, the time spanned by each window is the same. The joint sequence of observations, relating pairs of windows from Xp and Xq in the form S={(X1p(p;λpq),X1q(q;λqp)),,(Xnp(p;λpq),Xnq(q;λqp))}, should, however, be such that, for all indices i of the sequence, one has: s(Xiq(q;λqp))=s(Xip(p;λpq))+γpq, where s(X) represents the starting time of the window X, and γpq is constant over time. This is not the case for the first example, but is true for the second one, which is a relevant sequence of observations.

Figure 5.

Figure 5

Illustration for constructing sequences of windows for two time series with different sampling rates. In (a) the construction represents incompatible lags and in (b) the construction represents compatible lags.

If the two time series are sufficiently long, there always exists a correct sequence of joint observations. Indeed, if the window sizes λpq and λqp are known, let γpq=s(X1(q;λqp))s(X1(p;λpq)). Furthermore, let Np and Nq denote the number of observations per time unit (the time unit corresponds to the largest (integer) time interval according to the sampling rates of the different time series. For example, if a time series has a sampling rate of 10 per second and another a sampling rate of 3 per 10 min, then the time unit is equal to 10 min). Then, λpq, λqp, and γpq are compatible through the set of joint observations S with s(Xip(p;λpq))=s(X1(p;λpq))+(ip1)LCM(Np,Nq) and s(Xiq(q;λqp))=s(X1(q;λqp))+(iq1)LCM(Np,Nq), such that LCM is the lowest common multiple.

Note that this methodology for handling different sampling rates is not unique to our proposal and can be used as soon as lags and windows are defined.

4. Causal Discovery Based on Causal Temporal Mutual Information

We present in this section two new methods for causal discovery in time series based on the causal temporal mutual information introduced above to construct the skeleton of the causal graph. The first method assumes causal sufficiency (there exists no hidden common cause of two observable time series), while the second method is an extension of the first for the case where causal sufficiency is not satisfied. In both methods, the skeleton is oriented on the basis of the entropy reduction principle in addition to classical constraint-based orientation rules (the rules used in the PC algorithm or the rules used in the FCI algorithm). Our methods assume both the causal Markov condition and faithfulness of the data distribution, which are classical assumptions for causal discovery within constraint-based methods in addition to acyclicity in the summary causal graph.

4.1. Without Hidden Common Causes

We present here our main method, which assumes causal sufficiency. We start by describing how we infer the skeleton of the summary graph, then we show how we orient edges using classical PC-rules in addition to rules based on the entropy reduction principle.

4.1.1. Skeleton Construction

We follow the same steps as the ones of the PC algorithm [9], which assumes that all variables are observed. It aims at building causal graphs by orienting a skeleton obtained, from a complete graph, by removing edges connecting independent variables. The summary causal graphs considered are directed acyclic graphs (DAGs) in which self-loops are allowed to represent temporal dependencies within a time series.

Starting with a complete graph that relates all time series, the first step consists of computing CTMI for all pairs of time series and removing edges if the two time series are considered independent. Once this is done, one checks, for the remaining edges, whether the two time series are conditionally independent (the edge is removed) or not (the edge is kept). Starting from a single time series connected to Xp or Xq, the set of conditioning time series is gradually increased until either the edge between Xp and Xq is removed or all time series connected to Xp and Xq have been considered. We will denote by Sepset(p,q) the separation set of Xp and Xq, which corresponds to the smallest set of time series connected to Xp and Xq such that Xp and Xq are conditionally independent given this set. Note that we make use of the same strategy as the one used in PC-stable [28], which consists of sorting time series according to their CTMI scores and, when an independence is detected, removing all other occurrences of the time series. This leads to an order-independent procedure. The following theorem states that the skeleton obtained by the above procedure is the true one.

Theorem 1.

Let G=(V,E) be a summary causal graph, and assume that we are given perfect conditional independence information about all pairs of variables (Xp,Xq) in V given subsets SV{Xp,Xq}. Then, the skeleton previously constructed is the skeleton of G.

Proof. 

Let us consider two time series Xp and Xq. If they are independent given XR, then CTMI(Xp;Xq|XR)=0, as otherwise, the conditional mutual information between Xp and Xq would be non-null and the two time series would not be conditionally independent as we are given perfect information. By the causal Markov and faithfulness conditions, there is no edge in this case between Xp and Xq in the corresponding skeleton, as in the true one. Conversely, if CTMI(Xp;Xq|XR)=0 for any XR, then the two time series cannot be dependent conditioned on XR. Indeed, if this were the case, as we are given perfect conditional information, there would exist a lag γ and two window sizes λpq and λpq such that I(Xt(p;λpq);Xt+γ(q;λqp)|XR)>0 with 0<λpq,λqpγ. In this case, the two windows of size λmax centered on time point t in both Xp and Xq contain the windows of sizes λpq and λpq separated by a lag γ in Xp and Xq as λmax=2γmax+1. Thus, CTMI(Xp;Xq|XR) would be positive as this quantity cannot be less than I(Xt(p;λpq);Xt+γ(q;λqp)|XR), which leads to a contradiction. Finally, as we tested all necessary conditioning sets in the construction of the skeleton, we have the guarantee of removing all unnecessary edges.  □

4.1.2. Orientation

Once the skeleton has been constructed, one tries to orient as many edges as possible using the standard PC-rules [9], which yields a completed partially directed acyclic graph (CPDAG).

PC-rule 0

(Origin of causality). In an unshielded triple XpXrXq, if Xrsepset(p,q), then Xr is an unshielded collider: XpXrXq.

PC-rule 1

(Propagation of causality). In an unshielded triple XpXrXq, if Xrsepset(p,q), then orient the unshielded triple as XpXrXq.

PC-rule 2.

If there exist a direct path from Xp to Xq and an edge between Xp and Xq, then orient XpXq.

PC-rule 3.

If there exists an unshielded triple XpXrXq and an unshielded triple XpXsXq, then orient XsXr.

As we are using here the standard PC-rules, we have the following theorem, the proof of which directly derives from the results on PC [9].

Theorem 2

(Theorem 5.1 of [9]). Let the distribution of V be faithful to a DAG G=(V,E), and assume that we are given perfect conditional independence information about all pairs of variables (Xp,Xq) in V given subsets XRV{Xp,Xq}. Then, the skeleton constructed previously followed by PC-rules 0, 1, 2, and 3 represents the CPDAG of G.

In addition to the PC orientation rules, we introduce two new rules, which are based on the notion of possible spurious correlations and the mutual information we have introduced. The notion of possible spurious correlations captures the fact that two variables may be correlated through relations that do not only correspond to direct causal relations between them. It is formalized as follows:

Definition 9

(Possible spurious correlations). We say that two nodes XpXq have possible spurious correlations if there exists a path between them that neither contains the edge XpXq nor any collider.

Interestingly, when two connected variables do not have possible spurious correlations, one can conclude about their orientation using CTMI.

Property 2.

Let us assume that we are given perfect conditional independence information about all pairs of variables (Xp,Xq) in V given subsets SV\{Xp,Xq}. Then, every non-oriented edge in the CPDAG obtained by the above procedure corresponds to a prima facie cause and by, causal sufficiency, to a true causal relation between the related time series. Furthermore, the orientation of an unoriented edge between two nodes Xp and Xq that do not have possible spurious correlations is given by the “direction” of the optimal lag in CTMI(Xp,Xq), assuming that the maximal window size is larger than the longest lag γmax between causes and effects.

Proof. 

The first part of Property 2 directly derives from Property 1. As we assume that we are given perfect conditional information, the skeleton is the true one from Theorem 1. Thus, if two variables do not have possible spurious correlations, the only correlations observed between them correspond to a causal relation. We now need to prove that the optimal lag can be used to orient edges between any pair of variables Xp and Xq.

Without loss of generality, let us assume that Xp causes Xtq, for any time t, via the K time instants {tγ,tγ1,,tγK1} with 0<γK1<<γ1<γ. First, let us consider a window of size 1 in Xq and a window of arbitrary size λ in Xp with a lag γpq set to γ0. As γ0, there is no cause of Xtq in the window Xt+γ(p;λ). Furthermore, the only observed correlations between Xp and Xq correspond to causal relations. We thus have:

I(Xt(q;1);Xt+γ(p;λ)|Xt1(q;1),Xt+γ1(p;1))=0,

as Xtq and all variables in Xt+γ(p;λ) are independent of each other. One the contrary, for the same window size in Xp and a lag γpq set to γ with γ>0, one has:

I(Xt(q;1);Xtγ(p;γ)|Xt1(q;1),Xtγ1(p;1))I(Xt(q;1);Xtγ(p;1)|Xt1(q;1),Xtγ1(p;1))>0.

The first inequality derives from Inequality 2. The second inequality is due to the fact that Xtγp is a true cause of Xtq and the fact that we are given perfect information. Thus, when considering a window of size 1 for Xq, the optimal lag given by CTMI will necessarily go from Xp to Xq, which corresponds to the correct orientation.

We now consider the case where we have a window of arbitrary size λ in Xq. Let us further consider a window of arbitrary size λ in Xp with a lag γpq set to γ0. If λ<γ+γK1, there is no causal relations between the elements in Xt(q;λ) and the elements in Xt+γ(p;λ) and the mutual information between these two windows is 0. Otherwise, one can decompose this mutual information as:

I(Xt(q;λ);Xt+γ(p;λ)|Xt1(q;1),Xt+γ1(p;1))=I(Xt(q;γ+γK1);Xt+γ(p;λ)|Xt1(q;1),Xt+γ1(p;1))+I(Xt+γ+γK1(q;λγγK1);Xt+γ(p;λ)|Xt+γ+γK11(q;1),Xt+γ1(p;1)),

as the conditioning on Xt(q;γ+γK1) and Xt1(q;1) amounts to conditioning on the instant Xt+γ+γK11(q;1) due to the first-order Markov self-causal assumption.

As there are no causal relations between the elements in Xt(q;γ+γK1) and the elements in Xt+γ(p;λ), the first term on the right-hand side is 0. Using a similar decomposition in order to exclude elements at the end of Xt+γ(p;λ), which do not cause any element in Xt(q;λ), one obtains:

I(Xt(q;λ);Xt+γ(p;λ)|Xt1(q;1),Xt+γ1(p;1))=I(Xt+γ+γK1(q;λγγK1);Xt+γ(p;min(λ,λγK1γ))|Xt+γ+γK11(q;1),Xt+γ1(p;1)).

Let us now consider the window in Xp of size λ with a lag γpq set to γK1. Using the same reasoning as before, one obtains:

I(Xt(q;λ);XtγK1(p;λ)|Xt1(q;1),XtγK11(p;1))=I(Xt(q;γ+γK1);XtγK1(p;λ)|Xt1(q;1),XtγK11(p;1))+I(Xt+γ+γK1(q;λγγK1);XtγK1(p;λ)|Xt+γ+γK11(q;1),XtγK11(p;1)), (5)

with:

I(Xt+γ+γK1(q;λγγK1);XtγK1(p;λ)|Xt+γ+γK11(q;1),XtγK11(p;1))I(Xt+γ+γK1(q;λγγK1);Xt+γ(p;min(λ,λγK1γ))|Xt+γ+γK11(q;1),Xt+γ1(p;1)),

as the window XtγK1(p;λ) contains the window Xt+γ(p;min(λ,λγK1γ)). In addition, the first term on the right-hand side of Equation (5) is strictly positive as all the elements in Xt(q;γ+γK1) have causal relations in XtγK1(p;λ). Thus, the mutual information obtained with the negative lag γK1 is better than the one obtained with any positive lag:

I(Xt(q;λ);XtγK1(p;λ)|Xt1(q;1),XtγK11(p;1))>I(Xt(q;λ);Xt+γ(p;λ)|Xt1(q;1),Xt+γ1(p;1));

meaning that the optimal lag given by CTMI will necessarily go from Xp to Xq, which corresponds to the correct orientation.  □

The following orientation rule is a direct application of the above property.

ER-rule 0

(Entropy reduction—γ). In a pair XpXq, such Xp and Xq do not have a possible spurious correlation, if γ¯pq>0, then orient the edge as: XpXq.

Furthermore, we make use of the following rule to orient additional edges when the optimal lag γ¯pq is null based on the fact that CTMI increases asymmetrically with respect to the increase of λpq and λqp (Figure 3). This rule infers the direction of the cause by checking the difference in the window sizes as the window size of the cause cannot be greater than the window size of the effect.

ER-rule 1

(Entropy reduction—λ). In a pair XpXq, such Xp and Xq do not have a possible spurious correlation, if γ¯pq=0 and λ¯pq<λ¯qp, then orient the edge as: XpXq.

In practice, we also apply the ER-rule 0 before PC-rules, because, experimentally, we found that the ER-rule 0 is more reliable than the PC-rule 0 in detecting lagged unshielded colliders, especially in the case of a low sample size.

We call our method PCTMI; the pseudo-code is available in Algorithm 1. Adj(Xq,G) represents all adjacent nodes to Xq in the graph G, and sepset(p,q) is the separation set of Xp and Xq. The output of the algorithm is a CPDAG version of the summary graph such that all lagged relations are oriented, but instantaneous relations are partially oriented.

Algorithm 1PCTMI.
X a d-dimensional time series of length T, γmaxN the maximum number of lags, α a significance threshold
Form a complete undirected graph G=(V,E) with d nodes
n = 0
while there exists XqV such that card(Adj(Xq,G))n+1 do
D=list()
for XqV s.t. card(Adj(Xq,G))n+1 do
   for XpAdj(Xq,G) do
    for all subsets XRAdj(Xq,G)\{Xp} such that card(XR)=n and (Γrp0 or Γrq0) for all rR do
    yq,p,R=CTMI(Xp;XqXR)
    append (D,{Xq,Xp,XR,yq,p,R}))
 Sort D by increasing order of y
while D is not empty do
   {Xq,Xp,XR,y}=pop(D)
   if XpAdj(Xq,G) and XRAdj(Xq,G) then
    Compute z the p-value of CTMI(Xp;XqXR) given by Equation (4)
    if test z>α then
    Remove edge XpXq from G
    Sepset(p,q)=Sepset(q,p)=XR
 n = n + 1
for each triple in G, do apply PC-rule 0
while no more edges can be oriented do
for each triple in G, do apply PC-rules 1, 2, and 3
for each connected pair in G do apply ER-rules 0 and 1
Return G

4.2. Extension to Hidden Common Causes

When unobserved variables are causing a variable of interest, the PC algorithm is no longer appropriate and one needs to resort to the FCI algorithm introduced in [9], which infers a partial ancestral graph (PAG). We extend here the version of this algorithm presented in [29] without the selection bias.

We recall the notations needed for FCI: a double arrow indicates the presence of hidden common causes, while the symbol ∘ represents an undetermined mark, and the meta-symbol asterisk * is used as a wildcard, which can denote a tail, an arrow, or a circle. Furthermore, we make use of the notion Possible-Dsep, which is defined as follows:

Definition 10

(Possible-Dsep [9]). Xr is in the Possible-Dsep set of Xp and Xq if and only if Xr is different from Xp and Xq and there is an undirected path U between Xp and Xr such that, for every subpath <Xw,Xs,Xv> of U, either Xs is a collider on the subpath or Xw and Xv are adjacent in the PAG.

Lastly, we make use of the following types of paths. A discriminating path between Xp and Xq is a path that includes at least three edges such that each non-endpoint vertex Xr on the path is adjacent to Xq, and Xp is not adjacent to Xq, and every vertex between Xp and Xr is a collider on the path and is a parent of Xq. An uncovered path is a path where every consecutive triple on the path is unshielded. A potentially directed path is a path where the edge between two consecutive vertices is not directed toward the first or against the second.

From the skeleton obtained in Section 4.1.1, unshielded colliders are detected using the FCI-rule 0.

FCI-rule 0

(Origin of causality). For each unshielded triple XpXrXq, if XrSepset(p,q), then orient the unshielded triple as a collider: XpXrXq.

From this, Possible-Dsep sets can be constructed. As elements of Possible-Dsep sets in a PAG play a role similar to the ones of parents in a DAG, additional edges are removed by conditioning on the elements of the Possible-Dsep sets, using the same strategy as the one given in Section 4.1.1. All edges are then unoriented, and the FCI-rule 0 is again applied as some of the edges of the unshielded colliders originally detected may have been removed by the previous step. Then, as in FCI, FCI-rules 1, 2, 3, 4, 8, 9, and 10 are applied.

FCI-rule 1.

In an unshielded triple XpXrXq, if XrSepset(p,q), then orient the unshielded triple as XpXrXq.

FCI-rule 2.

If there exists a triple XpXrXq or a triple XpXrXq with XpXq, then orient the pair as XpXq.

FCI-rule 3.

If there exists an unshielded triple XpXrXq and an unshielded triple XpXsXq and XsXr, then orient the pair as XsXr.

FCI-rule 4.

If there exists a discriminating path between Xp and Xq for Xr and XrXq, then orient XrXq as XrXq; otherwise, orient the triple as XsXrXq.

FCI-rule 8.

If XpXrXq or XpXrXq and XpXq, then orient XpXq.

FCI-rule 9.

If XpXq and U is an uncovered potentially directed path from Xp to Xq such that Xq and Xr are not adjacent, then orient the pair as XpXq.

FCI-rule 10.

Suppose XpXq, XrXqXs, U1 is an uncovered potentially directed path from Xp to Xr, and U2 is an uncovered potentially directed path from Xp to Xs. Let μ be the vertex adjacent to Xp on U1 (μ could be Xr) and ω be the vertex adjacent to Xp on U2 (ω could be Xs). If μ and ω are distinct and are not adjacent, then orient XpXq as XpXq.

Note that we have not included FCI-rules 5, 6, and 7 from [29], as these rules deal with selection bias, a phenomenon that is not present in the datasets we consider. Including these rules in our framework is nevertheless straightforward.

Finally, as in PCTMI, we orient additional edges using an adapted version of the ER-rules.

LER-rule 0

(Latent entropy reduction—γ). In a pair XpXq, such Xp and Xq do not have a possible spurious correlation, if γ¯pq>0, then orient the edge as: XpXq.

LER-rule 1

(Latent entropy reduction—λ). In a pair XpXq, such Xp and Xq do not have a possible spurious correlation, if γ¯pq=0 and λ¯pq<λ¯qp, then orient the edge as: XpXq.

The overall process, referred to as FCITMI, is described in Algorithm 2.

Algorithm 2FCITMI.
Require:X a d-dimensional time series of length T, γmaxN the maximum number of lags, α a significance threshold
Form a complete undirected graph G=(V,E) with d nodes
n = 0
while there exists XqV such that card(Adj(Xq,G))n+1 do
D=list()
for XqV s.t. card(Adj(Xq,G))n+1 do
   for XpAdj(Xq,G) do
    for all subsets XRAdj(Xq,G)\{Xp} such that card(XR)=n and (γrp0 or γrq0) for all rR do
    yq,p,R=CTMI(Xp;XqXR)
    append (D,{Xq,Xp,XR,yq,p,R}))
 Sort D by increasing order of y
while D is not empty do
   {Xq,Xp,XR,y}=pop(D)
   if XpAdj(Xq,G) and XRAdj(Xq,G) then
    Compute z the p-value of CTMI(Xp;XqXR) given by Equation (4)
    if test z>α then
    Remove edge XpXq from G
    Sepset(p,q)=Sepset(q,p)=XR
 n = n + 1
for each triple in G, do apply FCI-rule 0
using Possible-Dsep sets, remove edges using CTMI
Reorient all edges as in G
for each triple in G, do apply FCI-rule 0
while edges can be oriented do
for each triple in G, apply FCI-rules 1, 2, 3, 4, 8, 9, and 10
for each connected pair in G, do apply ER-rules 0 and 1.
Return G

5. Experiments

In this section, we evaluate our method experimentally on several datasets. We propose first an extensive analysis on simulated data with equal and different sampling rates, generated from basic causal structures; then, we performed an analysis on real-world datasets. To assess the quality of causal inference, we used the F1-score regarding directed edges in the graph without considering self-loops.

In the following, we first describe the different settings of the methods we compare with and the datasets, and then, we describe the results and provide a complexity analysis.

5.1. Methods and Their Use

We compared our method PCTMI, available at https://github.com/ckassaad/PCTMI (accessed on 29 June 2022),with several methods. From the Granger family, we ran MVGC through the implementation available at http://www.sussex.ac.uk/sackler/mvgc/ (accessed on 29 June 2022) and TCDF through the implementation available at https://github.com/M-Nauta/TCDF (accessed on 29 June 2022). For TCDF, some hyperparameters have to be defined: we use da kernel of size 4, a dilation coefficient 4, one hidden layer, a learning rate of 0.01, and 5000 epochs. From the score-based family, we retained Dynotears, which is available at https://github.com/quantumblacklabs/causalnex (accessed on 29 June 2022), the hyperparameters of which were set to their recommended values (the regularization constants λW=λA=0.05). Among the noise-based approaches, we ran TiMINo, which is available at http://web.math.ku.dk/~peters/code.html (accessed on 29 June 2022). TiMINo uses a test based on the cross-correlation.From the constraint-based family, we ran PCMCI using the mutual information to measure the dependence, both provided in the implementation available at https://github.com/jakobrunge/tigramite (accessed on 29 June 2022).

We compared FCITMI with tsFCI, provided at https://sites.google.com/site/dorisentner/publications/tsfci (accessed on 29 June 2022), where independence or conditional independence is tested, respectively, with tests of zero correlation or zero partial correlation.

For all methods using mutual information, we used the k-nearest neighbor estimator [26], for which we fixed the number of nearest neighbor to k=10. Since the output of those measures are necessarily positive given a finite sample size and a finite numerical precision, we used a significance permutation test introduced in [27]. For all methods, we used γmax=5, and when performing a statistical test, we used a significance level of 0.05.

5.2. Datasets

To illustrate the behavior of the causal inference algorithms, we relied on both artificial and real-world datasets.

5.2.1. Simulated data

In the case of causal sufficiency, we simulated time series with equal and different sampling rates of size 1000 generated from three different causal structures: fork, v structure, and diamond represented in Table 1. In the case of non-causal sufficiency, we considered the 7ts2h structure represented in Table 2.

Table 1.

Structures of simulated data without hidden common causes. AB means that A causes B.

V-Structure Fork Diamond
graphic file with name entropy-24-01156-i001.jpg graphic file with name entropy-24-01156-i002.jpg graphic file with name entropy-24-01156-i003.jpg
Table 2.

Structures of simulated data with hidden common causes. AB means that A causes B and AB represents the existence of a hidden common cause between A and B.

7ts2h
graphic file with name entropy-24-01156-i004.jpg

For each structure, we generated 10 datasets using the following generating process: for all q, X0q=0, and for all t>0,

Xtq=at1qqXt1q+(p,γ)XtγpPar(Xtq)atγpqf(Xtγp)+0.1ξtq,

where γ0, atjq are random coefficients chosen uniformly in U([1;0.1][0.1;1]) for all 1jd, ξtqN(0,15) and f is a nonlinear function chosen at random uniformly among the absolute value, tanh, sine, and cosine.

5.2.2. Real Data

First, we considered the realistic functional magnetic resonance imaging (fMRI) benchmark, which contains blood-oxygen-level dependent (BOLD) datasets [30] for 28 different underlying brain networks. It measures the neural activity of different regions of interest in the brain based on the change of blood flow. There are 50 regions in total, each with its own associated time series. Since not all existing methods can handle 50 time series, datasets with more than 10 time series were excluded. In total, we were left with 26 datasets containing between 5 and 10 brain regions.

Second, we considered time series collected from an IT monitoring system with a one-minute sampling rate provided by EasyVista (https://www.easyvista.com/fr/produits/servicenav, accessed on 29 June 2022). In total, we considered 24 datasets with 1000 timestamps. Each dataset contains three time series: the metric extraction (M.E), which represents the activity of the extraction of the metrics from the messages; the group history insertion (G.H.I), which represents the activity of the insertion of the historical status in the database; and the collector monitoring information (C.M.I), which represents the activity of the updates in a given database. Lags between time series are unknown, as well as the existence of self-causes. According to the domain experts, all these datasets follow a fork structure such that metric extraction is a common cause of the other two time series.

5.3. Numerical Results

5.3.1. Simulated Data

We provide in Table 3 the performance of all methods on simulated data with an equal sampling rate. PCTMI has better results than other methods for the v structure and the fork structure For the diamond structure, PCTMI and PCMCI have the same performance; however, PCTMI has a lower standard deviation. In general, we can say, that constraint-based algorithms perform best.

Table 3.

Results for simulated datasets with an equal sampling rate. We report the mean and the standard deviation of the F1-score. The best results are in bold.

PCTMI PCMCI TiMINo VarLiNGAM Dynotears TCDF MVGC
V structure 0.78±0.18 0.67±0.37 0.65±0.37 0.0±0.0 0.07±0.20 0.13±0.26 0.37±0.26
Fork 0.83±0.31 0.78±0.17 0.52±0.44 0.0±0.0 0.07±0.20 0.26±0.32 0.44±0.38
Diamond 0.82±0.11 0.82±0.16 0.60±0.25 0.03±0.09 0.23±0.24 0.16±0.19 0.68±0.26

We also assess the behavior of PCTMI when the time series have different sampling rates in Table 4. We present here results only for PCTMI, because other methods are not applicable to time series with different sampling rates. As one can see, the performance obtained here is close to the ones obtained with equal sampling rates in the case of the v structure, but significantly lower in the case if the fork structure and the diamond. The degradation of the results is not really surprising as one has less data to rely on in a different sampling rate scenario. By comparing the results of PCTMI on time series with different sampling rates with the results of other methods obtained when time series have an equal sampling rate, we can see that PCTMI still performs better than most methods.

Table 4.

Results for simulated datasets with different sampling rates. We report the mean and the standard deviation of the F1-score.

PCTMI
V structure 0.80±0.31
Fork 0.56±0.30
Diamond 0.66±0.24

In the case of causal non-sufficiency, our proposed algorithm FCITMI outperforms tsFCI, as shown in Table 5.

Table 5.

Results for simulated datasets with an equal sampling rate and with hidden common causes. We report the mean and the standard deviation of the F1-score.

FCITMI tsFCI
7ts2h 0.44±0.11 0.37±0.09

5.3.2. Real Data

We provide in Table 6 the results on the fMRI benchmark and the IT benchmark for all the methods. In the case of fMRI, VarLiNGAM performs best, followed by Dynotears and, then, by PCTMI and TiMINo. The success of linear methods (VarLiNGAM and Dynotears) on this benchmark might suggest that causal relations in fMRI are linear, especially as this result was not replicated on other datasets considered in this paper. In the case of the IT datasets, TiMINo performs best, followed by PCTMI; however, while investigating the results, we noticed that for 6/10 datasets, TiMINo returns a fully connected summary graph with arcs oriented from both sides. In both benchmarks, PCTMI clearly outperforms other constraint-based methods.

Table 6.

Results for real datasets. We report the mean and the standard deviation of the F1-score. The best results are in bold.

PCTMI PCMCI TiMINo VarLiNGAM Dynotears TCDF MVGC
fMRI 0.32±0.17 0.22±0.18 0.32±0.11 0.49±0.28 0.34±0.13 0.07±0.13 0.24±0.18
IT 0.40±0.32 0.25±0.31 0.62±0.14 0.36±0.19 0.0±0.0 0.0±0.0 0.38±0.17

In the case of the IT datasets, we also provide in Table 7, for each dataset, the inferred summary causal graphs by methods that have an F1-score >0 in Table 6. Here, we can see that in 5 out of 10 datasets, PCTMI was able to infer that M.E is a cause of G.H.I. On the other hand, PCTMI was able to infer that M.E is a cause of C.M.I. in only one dataset. Interestingly, in 9 out 10 datasets, PCTMI did not infer any false positive. As one can expect from Table 6, PCMCI suffers on all datasets; however, one can notice that as PCTMI, PCMCI has a tendency to yield sparse graphs. TiMINo, which has the best F1-score in Table 6, seems to have a very low precision. In 7 out 10 datasets, TiMINo returned a complete bidirected graph. In the other three datasets, TiMINo detects all true causal relations, in addition to one false positive. VarLiNGAM has an even lower precision compared to TiMINo and a lower recall. Finally, MVGC infers most of the time a complete bidirected graph.

Table 7.

Summary causal graphs inferred by different methods using 10 different monitoring IT datasets. AB means that A causes B and AB means that A causes B and B causes A.

PCTMI PCMCI TiMINo VarLiNGAM MVGC
Dataset 1 graphic file with name entropy-24-01156-i005.jpg graphic file with name entropy-24-01156-i006.jpg graphic file with name entropy-24-01156-i007.jpg graphic file with name entropy-24-01156-i008.jpg graphic file with name entropy-24-01156-i009.jpg
Dataset 2 graphic file with name entropy-24-01156-i010.jpg graphic file with name entropy-24-01156-i011.jpg graphic file with name entropy-24-01156-i012.jpg graphic file with name entropy-24-01156-i013.jpg graphic file with name entropy-24-01156-i014.jpg
Dataset 3 graphic file with name entropy-24-01156-i015.jpg graphic file with name entropy-24-01156-i016.jpg graphic file with name entropy-24-01156-i017.jpg graphic file with name entropy-24-01156-i018.jpg graphic file with name entropy-24-01156-i019.jpg
Dataset 4 graphic file with name entropy-24-01156-i020.jpg graphic file with name entropy-24-01156-i021.jpg graphic file with name entropy-24-01156-i022.jpg graphic file with name entropy-24-01156-i023.jpg graphic file with name entropy-24-01156-i024.jpg
Dataset 5 graphic file with name entropy-24-01156-i025.jpg graphic file with name entropy-24-01156-i026.jpg graphic file with name entropy-24-01156-i027.jpg graphic file with name entropy-24-01156-i028.jpg graphic file with name entropy-24-01156-i029.jpg
Dataset 6 graphic file with name entropy-24-01156-i030.jpg graphic file with name entropy-24-01156-i031.jpg graphic file with name entropy-24-01156-i032.jpg graphic file with name entropy-24-01156-i033.jpg graphic file with name entropy-24-01156-i034.jpg
Dataset 7 graphic file with name entropy-24-01156-i035.jpg graphic file with name entropy-24-01156-i036.jpg graphic file with name entropy-24-01156-i037.jpg graphic file with name entropy-24-01156-i038.jpg graphic file with name entropy-24-01156-i039.jpg
Dataset 8 graphic file with name entropy-24-01156-i040.jpg graphic file with name entropy-24-01156-i041.jpg graphic file with name entropy-24-01156-i042.jpg graphic file with name entropy-24-01156-i043.jpg graphic file with name entropy-24-01156-i044.jpg
Dataset 9 graphic file with name entropy-24-01156-i045.jpg graphic file with name entropy-24-01156-i046.jpg graphic file with name entropy-24-01156-i047.jpg graphic file with name entropy-24-01156-i048.jpg graphic file with name entropy-24-01156-i049.jpg
Dataset 10 graphic file with name entropy-24-01156-i050.jpg graphic file with name entropy-24-01156-i051.jpg graphic file with name entropy-24-01156-i052.jpg graphic file with name entropy-24-01156-i053.jpg graphic file with name entropy-24-01156-i054.jpg

5.3.3. Complexity Analysis

Here, we provide a complexity analysis on different constraint-based algorithms including our proposed method.

In the worst case, the complexity of PC in a window causal graph is bounded by (dγmax)2(dγmax1)k1(k1)!, where k represents the maximal degree of any vertex and γmax is the maximum number of lags. Each operation consists of conducting a significance test on a conditional independence measure. Algorithms adapted to time series, as PCMCI [7], rely on time information to reduce the number of tests. PCTMI reduces the number of tests even more since it infers a summary graph. Indeed, PCTMI’s complexity in the worst case is bounded by d2(d1)k1(k1)!.

Figure 6 provides an empirical illustration of the difference in the complexity of the two approaches on the three structures (v structure, fork, diamond), sorted according to their number of nodes, their maximal out-degree, and their maximal in-degree. The time is given in seconds. As one can note, PCTMI is always faster than PCMCI, the difference being more important when the structure to be inferred is complex.

Figure 6.

Figure 6

Time computation for PCTMI and PCMCI.

5.3.4. Hyperparameters’ Analysis

Here, we provide a hyperparameter analysis on CTMI with respect to the maximum lag γmax and the k-nearest neighbor k. Using the same 10 datasets that are compatible with the fork structure in Table 1 and generated as described in Section 5.2.1, we performed two experimentations.

First, we computed the average and the standard deviation of CTMI(X2;X3) and CTMI(X2;X3X1) over the 10 datasets while varying γmax between 3 and 10. The results given in Figure 7a show that CTMI is robust with respect to γmax in the case of dependency and in the case of conditional independency: the mean of CTMI(X2;X3) remains the same for γmax=3,4,5 and slightly decreases for γmax=10, and the standard deviation remains the same for γmax=3,4 and slightly increases for γmax=5,10. On the other hand, the mean and the standard deviation of CTMI(X2;X3X1) remain the same for all γmax.

Figure 7.

Figure 7

CTMI with respect to the maximum lag γmax in (a) and the k-nearest neighbor k in (b) for dependent time series (X2 and X3 in the fork structure in Table 1) and conditionally independent time series (X2 and X3 conditioned on X1 in the fork structure in Table 1).

Second, we again computed the average and the standard deviation of CTMI(X2;X3) and CTMI(X2;X3X1) over the 10 datasets, but now, while varying k between 5 and 100. The results given in Figure 7b show that CTMI is robust with respect to k in the case of conditional independency, but suffers in the case of dependency when k increases: the mean of CTMI(X2;X3) slightly decreases between k=5 and k=10, but drastically decreases for k=50.

6. Discussion and Conclusions

We addressed in this paper the problem of learning a summary causal graph on time series with equal or different sampling rates. To do so, we first proposed a new temporal mutual information measure defined on a window-based representation of time series. We then showed how this measure relates to an entropy reduction principle, which can be seen as a special case of the probability raising principle. We finally combined these two ingredients in PC-like and FCI-like algorithms to construct the summary causal graph. Our method proved to work well with small time complexity in comparison with similar approaches that use mutual information. The main limitations of our methods are that they cannot orient causal relations that are not oriented by the classical PC-rules or FCI-rules and have a possible spurious correlation, and they assume acyclicity in summary causal graphs.

Acknowledgments

We thank Ali Aït-Bachir and Christophe de Bignicourt from EasyVista for providing us with the IT monitoring data along with their underlying causal graph.

Author Contributions

Conceptualization, C.K.A., E.D. and E.G.; methodology, C.K.A., E.D., and E.G.; software, C.K.A., E.D. and E.G.; validation, C.K.A., E.D. and E.G.; formal analysis, C.K.A., E.D. and E.G.; investigation, C.K.A., E.D. and E.G.; resources, C.K.A., E.D. and E.G.; data curation, C.K.A., E.D. and E.G.; writing—original draft preparation, C.K.A., E.D. and E.G.; writing—review and editing, C.K.A., E.D. and E.G.; visualization, C.K.A., E.D. and E.G.; supervision, C.K.A., E.D. and E.G.; project administration, C.K.A., E.D. and E.G.; funding acquisition, C.K.A., E.D. and E.G. All authors have read and agreed to the published version of the manuscript.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All data presented in this study are publicly available: Simulated data are available at https://dataverse.harvard.edu/dataverse/basic_causal_structures_additive_noise (accessed on 29 June 2022); The original fMRI data are available at https://www.fmrib.ox.ac.uk/datasets/netsim/index.html (accessed on 29 June 2022), and a preprocessed version is available at https://github.com/M-Nauta/TCDF/tree/master/data/fMRI (accessed on 29 June 2022); IT monitoring data are available at https://easyvista2015-my.sharepoint.com/:f:/g/personal/aait-bachir_easyvista_com/ElLiNpfCkO1JgglQcrBPP9IBxBXzaINrM5f0ILz6wbgoEQ?e=OBTsUY (accessed on 29 June 2022).

Conflicts of Interest

The authors declare no conflict of interest.

Funding Statement

This research was partly funded by MIAI@Grenoble Alpes Grant Number ANR-19-P3IA-0003.

Footnotes

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Assaad C.K., Devijver E., Gaussier E. Survey and Evaluation of Causal Discovery Methods for Time Series. J. Artif. Intell. Res. 2022;73:767–819. doi: 10.1613/jair.1.13428. [DOI] [Google Scholar]
  • 2.Wang P., Xu J., Ma M., Lin W., Pan D., Wang Y., Chen P. CloudRanger: Root Cause Identification for Cloud Native Systems; Proceedings of the 2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid ’18); Washington, DC, USA. 1–4 May 2018; pp. 492–502. [DOI] [Google Scholar]
  • 3.Wang H., Wu Z., Jiang H., Huang Y., Wang J., Kopru S., Xie T. Groot: An Event-Graph-Based Approach for Root Cause Analysis in Industrial Settings; Proceedings of the 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE ’21); Melbourne, Australia. 15–19 November 2021; pp. 419–429. [DOI] [Google Scholar]
  • 4.Zhang Y., Guan Z., Qian H., Xu L., Liu H., Wen Q., Sun L., Jiang J., Fan L., Ke M. CloudRCA: A Root Cause Analysis Framework for Cloud Computing Platforms; Proceedings of the 30th ACM International Conference on Information and Knowledge Management, Association for Computing Machinery; Online. 1–5 November 2021; pp. 4373–4382. [Google Scholar]
  • 5.Granger C. Investigating Causal Relations by Econometric Models and Cross-Spectral Methods. Econometrica. 1969;37:424–438. doi: 10.2307/1912791. [DOI] [Google Scholar]
  • 6.Peters J., Janzing D., Schölkopf B. Causal Inference on Time Series using Restricted Structural Equation Models; Proceedings of the Advances in Neural Information Processing Systems 26; Lake Tahoe, NV, USA. 5–10 December 2013; pp. 154–162. [Google Scholar]
  • 7.Runge J., Nowack P., Kretschmer M., Flaxman S., Sejdinovic D. Detecting and quantifying causal associations in large nonlinear time series datasets. Sci. Adv. 2019;5:eaau4996. doi: 10.1126/sciadv.aau4996. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Nauta M., Bucur D., Seifert C. Causal Discovery with Attention-Based Convolutional Neural Networks. Mach. Learn. Knowl. Extr. 2019;1:312–340. doi: 10.3390/make1010019. [DOI] [Google Scholar]
  • 9.Spirtes P., Glymour C., Scheines R. Causation, Prediction, and Search. 2nd ed. MIT Press; Cambridge, MA, USA: 2000. [Google Scholar]
  • 10.Granger C.W.J. Time Series Analysis, Cointegration, and Applications. Am. Econ. Rev. 2004;94:421–425. doi: 10.1257/0002828041464669. [DOI] [Google Scholar]
  • 11.Chickering D.M. Learning Equivalence Classes of Bayesian-Network Structures. J. Mach. Learn. Res. 2002;2:445–498. doi: 10.1162/153244302760200696. [DOI] [Google Scholar]
  • 12.Pamfil R., Sriwattanaworachai N., Desai S., Pilgerstorfer P., Georgatzis K., Beaumont P., Aragam B. DYNOTEARS: Structure Learning from Time-Series Data; Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics; Online. 28 August 2020; pp. 1595–1605. [Google Scholar]
  • 13.Sun J., Taylor D., Bollt E.M. Causal Network Inference by Optimal Causation Entropy. SIAM J. Appl. Dyn. Syst. 2015;14:73–106. doi: 10.1137/140956166. [DOI] [Google Scholar]
  • 14.Entner D., Hoyer P. On Causal Discovery from Time Series Data using FCI; Proceedings of the 5th European Workshop on Probabilistic Graphical Models, PGM 2010; Helsinki, Filnland. 13–15 September 2010. [Google Scholar]
  • 15.Malinsky D., Spirtes P. Causal Structure Learning from Multivariate Time Series in Settings with Unmeasured Confounding; Proceedings of the 2018 ACM SIGKDD Workshop on Causal Disocvery; London, UK. 20 August 2018; pp. 23–47. [Google Scholar]
  • 16.Assaad C.K., Devijver E., Gaussier E., Ait-Bachir A. Machine Learning and Knowledge Discovery in Databases. Research Track, Proceedings of the Machine Learning and Knowledge Discovery in Databases, Bilbao, Spain, 13–17 September 2021. Springer International Publishing; Cham, Switzerland: 2021. A Mixed Noise and Constraint-Based Approach to Causal Inference in Time Series; pp. 453–468. [Google Scholar]
  • 17.Shannon C.E. A Mathematical Theory of Communication. Bell Syst. Tech. J. 1948;27:379–423. doi: 10.1002/j.1538-7305.1948.tb01338.x. [DOI] [Google Scholar]
  • 18.Affeldt S., Isambert H. Robust Reconstruction of Causal Graphical Models Based on Conditional 2-point and 3-point Information; Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence (UAI’15); Amsterdam, The Netherlands. 12–16 July 2015; Amsterdam, The Netherlands: AUAI Press; 2015. [Google Scholar]
  • 19.Galka A., Ozaki T., Bayard J.B., Yamashita O. Whitening as a Tool for Estimating Mutual Information in Spatiotemporal Data Sets. J. Stat. Phys. 2006;124:1275–1315. doi: 10.1007/s10955-006-9131-x. [DOI] [Google Scholar]
  • 20.Schreiber T. Measuring Information Transfer. Phys. Rev. Lett. 2000;85:461. doi: 10.1103/PhysRevLett.85.461. [DOI] [PubMed] [Google Scholar]
  • 21.Marko H. The Bidirectional Communication Theory—A Generalization of Information Theory. IEEE Trans. Commun. 1973;21:1345–1351. doi: 10.1109/TCOM.1973.1091610. [DOI] [Google Scholar]
  • 22.Massey J.L. Causality, feedback and directed information; Proceedings of the International Symposium on Information Theory and Its Applications; Waikiki, HI, USA. 27–30 November 1990. [Google Scholar]
  • 23.Albers D.J., Hripcsak G. Estimation of time-delayed mutual information and bias for irregularly and sparsely sampled time-series. Chaos Solitons Fractals. 2012;45:853–860. doi: 10.1016/j.chaos.2012.03.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Suppes P. A Probabilistic Theory of Causality. North-Holland Pub. Co.; Amsterdam, The Netherlands: 1970. [Google Scholar]
  • 25.Kraskov A., Stögbauer H., Grassberger P. Estimating mutual information. Pt 2Phys. Rev. E Stat. Nonlinear, Soft Matter Phys. 2004;69:066138. doi: 10.1103/PhysRevE.69.066138. [DOI] [PubMed] [Google Scholar]
  • 26.Frenzel S., Pompe B. Partial Mutual Information for Coupling Analysis of Multivariate Time Series. Phys. Rev. Lett. 2007;99:204101. doi: 10.1103/PhysRevLett.99.204101. [DOI] [PubMed] [Google Scholar]
  • 27.Runge J. Conditional independence testing based on a nearest-neighbor estimator of conditional mutual information; Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics; Playa Blanca, Spain. 9 April 2018; pp. 938–947. [Google Scholar]
  • 28.Colombo D., Maathuis M.H. Order-Independent Constraint-Based Causal Structure Learning. J. Mach. Learn. Res. 2014;15:3921–3962. [Google Scholar]
  • 29.Zhang J. On the completeness of orientation rules for causal discovery in the presence of latent confounders and selection bias. Artif. Intell. 2008;172:1873–1896. doi: 10.1016/j.artint.2008.08.001. [DOI] [Google Scholar]
  • 30.Smith S.M., Miller K.L., Khorshidi G.S., Webster M.A., Beckmann C.F., Nichols T.E., Ramsey J., Woolrich M.W. Network modelling methods for FMRI. NeuroImage. 2011;54:875–891. doi: 10.1016/j.neuroimage.2010.08.063. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

All data presented in this study are publicly available: Simulated data are available at https://dataverse.harvard.edu/dataverse/basic_causal_structures_additive_noise (accessed on 29 June 2022); The original fMRI data are available at https://www.fmrib.ox.ac.uk/datasets/netsim/index.html (accessed on 29 June 2022), and a preprocessed version is available at https://github.com/M-Nauta/TCDF/tree/master/data/fMRI (accessed on 29 June 2022); IT monitoring data are available at https://easyvista2015-my.sharepoint.com/:f:/g/personal/aait-bachir_easyvista_com/ElLiNpfCkO1JgglQcrBPP9IBxBXzaINrM5f0ILz6wbgoEQ?e=OBTsUY (accessed on 29 June 2022).


Articles from Entropy are provided here courtesy of Multidisciplinary Digital Publishing Institute (MDPI)

RESOURCES