Abstract
Discovering causal structure of a dynamical system from observed time series is a traditional and important problem. In many practical applications, observed data are obtained by applying subsampling or temporally aggregation to the original causal processes, making it difficult to discover the underlying causal relations. Subsampling refers to the procedure that for every k consecutive observations, one is kept, the rest being skipped, and recently some advances have been made in causal discovery from such data. With temporal aggregation, the local averages or sums of k consecutive, non-overlapping observations in the causal process are computed as new observations, and causal discovery from such data is even harder. In this paper, we investigate how to recover causal relations at the original causal frequency from temporally aggregated data when k is known. Assuming the time series at the causal frequency follows a vector autoregressive (VAR) model, we show that the causal structure at the causal frequency is identifiable from aggregated time series if the noise terms are independent and non-Gaussian and some other technical conditions hold. We then present an estimation method based on non-Gaussian state-space modeling and evaluate its performance on both synthetic and real data.
1 INTRODUCTION
Causal modeling (Spirtes et al., 2001; Pearl, 2000) of time series data has been widely applied in many fields such as econometrics (Ghysels et al., 2016), neuroscience (Zhou et al., 2014), and climate science (Van Nes et al., 2015). Classical causal discovery approaches, e.g., Granger causality test (Granger, 1969), usually assume that the data measurement frequency matches the true causal frequency of the underlying physical process. However, since the true causal frequency is usually unknown, the time series data are often measured at the frequency lower than the causal frequency. For example, some econometric indicators such as GDP and non-farm payroll are usually recorded at quarterly and monthly scales. Causal interactions between the processes, however, may take place at the weekly or fortnightly scales (Ghysels et al., 2016). In neuroscience, imaging technologies have relatively low temporal resolutions, while many high frequency neuronal interactions are important for understanding neuronal dynamics (Zhou et al., 2014). In these situations, the available observations have a lower resolution than the underlying causal process.
There are two typical schemes to generate low-resolution or low-frequency data from high-frequency ones (Silvestrini & Veredas, 2008; Marcellino, 1999). One is by subsampling: for every k consecutive observations, one is kept, the rest being skipped. The other is temporally aggregation, i.e., taking the local averages or sums of k consecutive, non-overlapping observations from the underlying causal process as new observations. For instance, the time series of interest, money supply, and temperature are usually obtained by subsampling; in contrast, the U.S. nominal GDP was obtained by aggregation – it refers to a total number of dollars spent over a time period.
Numerous contributions have been made on analyzing the effects of the above two schemes to generate low-resolution data on the properties of the time series such as estimated causal relations and exogeneity (Tiao, 1972; Weiss, 1984; Granger, 1987; Marcellino, 1999; Rajaguru & Abeysinghe, 2008). These studies found that temporal aggregation can lead to errors in the estimated causal relations if not properly addressed. For example, Breitung & Swanson (2002) examined the impact of temporal aggregation on Granger causality in vector autoregressive (VAR) models and found that the results of Granger causal analysis heavily depend on temporal aggregation.
Recovering the high frequency causal relations from temporally aggregated data is a very hard problem due to information loss in the aggregation process. A classical way to discover high frequency causal relations from temporally aggregated data is to first disaggregate the low frequency time series to high frequency ones and then apply standard causal discovery methods on the disaggregated data. Temporal disaggregation of low resolution time series has been extensively studied in the econometric and statistical literature (Boot et al., 1967; Stram & Wei, 1986; Harvey & Chung, 2000; Moauro & Savio, 2005; Proietti, 2006), which is clearly an even harder problem than discovering causal relations.
Recently, a set of methods have been proposed to estimate the causal structure at the causal frequency from subsampled data without resorting to disaggregation techniques (Hyttinen et al., 2016; Gong et al., 2015; Plis et al., 2015a; Danks & Plis, 2013). Plis et al. (2015a, b) first inferred the causal structure from the subsampled data, and then searched for the causal structure at the causal frequency from the causal structure inferred in the first step. Based on this framework, Hyttinen et al. (2016) proposed a much faster inference method using a general purpose Boolean constraint solver. Gong et al. (2015) proposed a model-based approach and examined the identifiability of the underlying vector autoregressive model (VAR) at the causal frequency from subsampled time series. They showed that the causal transition matrix is identifiable if the noise process is non-Gaussian. This work was recently extended to mixed frequency data by structural VAR modeling (Tank et al., 2017). However, how to estimate causal relations from aggregated data still remains unknown.
Compared to subsampling, temporal aggregation is perhaps more widely used to produce low-resolution time series, especially in economics and finance. However, the effect of temporal aggregation is more complex, and accordingly it is technically more difficult to recover the underlying causal relations from such data. Specifically, because the noise terms are generated by a larger number of independent components and thus the mixing matrix contains a more complicated structure, the estimation is both statistically and computationally harder.
The objective of this paper is to seek a possible solution to this problem, by studying the theoretical identifiability of the underlying causal relations and developing a practical causal discovery algorithm. Following (Gong et al., 2015), we assume that the high-frequency data follow a VAR model, the error terms are non-Gaussian, and there are no confounders (Geiger et al., 2015). We show that the original causal relation can be estimated from the aggregated data with known k, under a set of technical conditions.
Moreover, we propose an estimation method based on non-Gaussian state-space modeling of the aggregated data. Since the exact inference in the non-Gaussian state-space model is intractable, we estimate the model parameters using the particle stochastic approximation EM (PSAEM) algorithm (Lindsten, 2013; Svensson et al., 2014), which combines the efficient conditional particle filter with ancestor sampling (CPF-AS) (Lindsten et al., 2014) with the stochastic approximation EM (SAEM) algorithm (Delyon et al., 1999). Interestingly, in the extreme case where the aggregation factor k becomes larger and larger, we show that the observed time series will become independent and identically distributed (i.i.d.), and we study to what extent the underlying time-delayed causal relations can be recovered from the instantaneous dependence in the observed data.
2 EFFECT OF TEMPORAL AGGREGATION
In the linear case, Granger causal analysis Granger (1969) can be done by fitting the following first-order VAR model (Sims, 1980):
(1) |
where xt = (xt,1, xt,2, …, xt,n)𝖳 is the vector of the observed data, et = (et,1, …, et,n)𝖳 is the temporally and contemporaneously independent noise process, and A is the causal transition matrix containing temporal causal relations.
2.1 WITH A FINITE k
Gong et al. (2015) studied causal discovery from subsampled data. With subsampling, the observations follow
(2) |
which turns out to be a VAR model with temporally independent and contemporaneously dependent noise process. They demonstrated that it is possible to identify the high-resolution causal relation A from the low-resolution observations if the noise terms are non-Gaussian.
In this paper, we are concerned with the temporally aggregated data x̃1:T ≜ (x̃1, x̃2, …, x̃T), which are obtained by taking the average (or sum) of every non-overlapping k points, i.e., , where
Taking the average of the above equation over i = 1, 2, …, k, we have
(3) |
which is a vector autoregressive-moving-average (VARMA) model with one autoregressive term and two moving-average terms:
(4) |
where , and . Here, I represents the n × n identity matrix, and 0 represents the n × n zero matrix. We call (A, e, k) the representation of the k-th order aggregated time series x̃. Clearly A cannot be recovered by simply fitting a VAR model on x̃t, as done by Granger causal analysis. Even if we using VARMA modeling, we are only guaranteed to identify Ak instead of the original A. In Section 3, we will show under what conditions can we identify the causal relation A at the causal frequency from the aggregated time series x̃1:T.
2.2 WHEN k → ∞
Interestingly, causal discovery from aggregated data with a large aggregation factor k seems to have a wide range of applications. For instance, in the stock market, the causal influences between stocks take place very quickly (as indicated by the efficient market hypothesis), but we usually work with low-frequency data such as daily returns. The daily return is the sum of high-frequency returns within the same day. Discovering the causal interactions between stocks from their daily returns then become a problem of causal discovery from aggregated data with a large k.
When the aggregation factor k is very large, e⃗t becomes a mixture of numerous independent components. Fortunately, we can use a simple model to approximate the generating process of the aggregated data. From (1), we have
that is,
Denote by ēt the error term above, i.e., . Note that ēt has contemporaneous independent components. Since
we have
(5) |
as k → ∞. This is a linear instantaneous causal model for the components of x̃t because the components of the total error term, ēt, are still contemporaneously independent. When the error terms are non-Gaussian, it has the same form as the Linear, Non-Gaussian Model (LiNG) (Lacerda et al., 2008); when the causal relations are further assumed to be acyclic, it follows the form of the Linear, Non-Gaussian Acyclic Model (LiNGAM) (Shimizu et al., 2006). The difference is that in LiNG or LiNGAM, the self-loop influences, Aii, are assumed to be zero. We will also investigate the identifiably of A in this case in Section 3.
3 IDENTIFIABILITY OF CAUSAL RELATIONS IN A
We investigate the identifiability of the high-resolution causal transition matrix A from the aggregated time time series x̃1:T. In other words, suppose x̃ also admits another representation (A′, e′, k), we aim to see whether it is always the case that A = A′ as the sample size T → ∞. If the noise terms follow the Gaussian distribution, A is usually not identifiable (Palm & Nijman, 1984). Recently, it has been shown that A is identifiable from subsampled time series if the noise terms are non-Gaussian (Gong et al., 2015). However, this does not give rise to the identifiability of A from aggregated time series – the latter is much more difficult to see as the aggregated model described in (3) has a more complicated structure. Here, we show that, in the exact model (3), A is identifiable from the aggregated data under appropriate conditions; furthermore, as k → ∞, the approximate model (5) holds, and A is partially identifiable from the aggregated data, but the identification procedure is computationally much more efficient.
First, we will show that Ak can be identified by fitting the VARMA model (4). We make the following assumption.
-
A1
At least one of the τ-step (τ ≥ 2) delayed cross covariance matrices of x̃t, , is invertible.
Since εt is both temporally and contemporaneously independent, εt and εt−1 are independent of x̃t−τ, which implies that
Multiplying both sides of (4) from the right side by and taking the expectation, we have
(6) |
Under the assumption A1, we can first see that Ak is identifiable:
(7) |
3.1 IDENTIFIABILITY WITH FINITE k
Substituting the above equation into (3), one can then find e⃗t, which is defined to be linear mixtures of (2k − 1) noise terms, i.e., etk, etk−1, …, and e(t−2)k+2. In the following, we will concentrate on the identifiablity of A from e⃗.
Let
(8) |
The error terms in (3) correspond to the following mixing procedure of random vectors:
(9) |
Here, together with the time index t represent etk−l. The components of e⃗ are independent, and for each i, , l = 0, …, 2k − 2, have the same distribution pei. Under the condition that pei is non-Gaussian for each i, H can be estimated up to the permutation and scaling indeterminacies (including the sign indeterminacy) of the columns, as given in the following proposition.
Proposition 1
Suppose that all pei are non-Gaussian. Given k and x̃1:T generated according to (3), H can be determined up to permutation and scaling of columns.
For the proof of Proposition 1, please refer to (Gong et al., 2015).
We make the following assumptions on the underlying dynamic process (1) and the distributions pei, and then we have the identifiability result for the causal transition matrix A.
-
A2
The system is stable, in that all eigenvalues of A have modulus smaller than one.
-
A3
The distributions pei are different for different i after re-scaling by any non-zero scale factor, their characteristic functions are all analytic (or they are all non-vanishing), and none of them has an exponent factor with a polynomial of degree at least 2.
The following identifiability result on A states that in various situations, A for the original high frequency data is fully identifiable.
Theorem 1
Suppose all of eit are non-Gaussian, and that the data x̃t are generated by (3) and that it also admits another kth order aggregation representation (A′, e′, k). Let assumptions A1 and A2 hold. When the number of observed data points T → ∞, the following statements are true.
A′ can be represented as A′ − I = (A − I)D, where D is a diagonal matrix with 1 or −1 on its diagonal. If we constrain all the self influences, represented by diagonal entries of A and A′, to be no greater than 1, then A′ = A.
If each pei is asymmetric, we have A′ = A.
A complete proof of Theorem 1 can be found in Section 6.
3.2 IDENTIFIABILITY AS k → ∞
We have shown that A is identifiable from aggregated data (3) when k is finite. However, when k becomes larger, estimating A will encounter more difficulty because more independent components in (9) are involved. When k = ∞, it is not necessary for Proposition 1, as well as Theorem 1, to hold, because e⃗ in (9) is the mixture of an infinite number of independent components.
Interestingly, as k → ∞, x̃t follows an instantaneous causal model in the i.i.d. case, as shown in (5). We will then answer the following two questions. In this case, can we still estimate A from aggregated data? If we can, is there an efficient procedure to do so?
Equation (5) implies (I − A)x̃t = ēt. That is, applying the linear transformation (I − A) on x̃t produces independent components, as components of ēt. This can be achieved by the independent component analysis (ICA) procedure (Hyvärinen et al., 2001), and (I − A) can be estimated up to row scaling and permutation indeterminacies. We then have the following observations.
First, the diagonal entries of A, Aii, which represent the self influences or “self-loops" of the time-delayed causal relations, cannot be determined (Lacerda et al., 2008). (Here we have assumed Aii ≠ 1.) This is because the scale of each row of (I − A) is unknown, and so is (1 − Aii).
Let DA be the diagonal matrix with A11, A22, …, Ann on its diagonal. Equation (5) is equivalent to
(10) |
Secondly, suppose there is no feedback loop between the processes after removing the self-loops, meaning that (A − DA) can be permuted to a strictly lower-triangular matrix by equal row and column permutations. According to the LiNGAM model, which assumes these is no self-loop, (I − DA)−1(A − DA) in (10) can be uniquely estimated (Shimizu et al., 2006). In other words, if one applies LiNGAM analysis on x̃t, the estimated causal coefficients from th ith variable to the jth variable is actually (1 − Ajj)−1Aji. From this we can see whether Aji is zero or not; furthermore, if the self-loops Ajj are given by prior knowledge, then A is fully identifiable.
Thirdly, suppose there exist feedback loops between the processes after removing the self-loops. In this case, (A − DA) cannot be permuted to a strictly lower-triangular matrix by equal row and column permutations. The identifiability of A in (5) has been studied by Lacerda et al. (2008): suppose the feedback loops are disjoint, although in theory there are multiple solutions to ANoSelfLoop, the most stable solution (the product of the coefficients in ANoSelfLoop along each loop is minimized) is unique.
4 ESTIMATING THE CAUSAL RELATIONS FROM AGGREGATED DATA
In this section, we present the algorithm to estimate A from aggregated data with finite k. Clearly, the larger k, the more difficult it is to estimate A from aggregated data. Therefore, when k is relatively large (say, larger than 6), we advocate the methods given in Section 3.2 to partially estimate A.
Since the identifiability of A from aggregated data relies on the non-Gaussianity of the error terms, we use Gaussian mixtures to represent their distributions. It is natural to do parameter estimation with the Expectation-Maximization algorithm, which, unfortunately, involves a large number of Gaussian components. To avoid this issue, we propose to use the Stochastic Approximation EM (SAEM) algorithm, as a variant of EM, and further resort to conditional particle filtering with ancestor sampling (CPF-AS) to achieve computational efficiency.
4.1 STATE-SPACE MODELING
We can consider (3) as a special state-space model:
(11) |
where
, and the noise terms etk, etk−1, …, e(t−1)k+1 share the same distribution for the same channel and are mutually independent. Since the non-Gaussianity is essential to the identifiability of A, we use a Gaussian mixture model to represent each channel of noise term e, i.e., , where wi,c ≥ 0, and , for i = 1, …, n. Correspondingly, each channel of ẽ is also represented by a Gaussian mixture model.
We aim to estimate the parameters A and the noise terms (if necessary) in the above state-space model. We introduce the additional latent variable z̃t = (z̃t,1, …, z̃t,nk)𝖳, in which z̃t,j ∈ {1, …, m}, to model the distribution of noise terms ẽt by Gaussian mixture models. The joint distribution of the state-space model (11) over both observed and unobserved variables is given by
(12) |
The distributions in (12) are specified as follows:
(13a) |
(13b) |
(13c) |
Since there are no additional additive noise terms in the model, we fix Λ to a small value in our estimation algorithm for regularization. μ̃t is the conditional mean of ẽt, i.e., μ̃t = [μ̃1, z̃t, 1, …, μ̃nk, z̃t,nk, 01×n(k−1)]𝖳. Σ̃t is a diagonal matrix containing the conditional variance parameters of ẽt, i.e., . According to the structure of ẽ, the parameters π̃j, z̃t, j, μ̃j, z̃t,j, and σ̃j, z̃t,j are controlled by the parameters of e, i.e., π̃i+nl,c = πi,c, μ̃i+nl,c = μi,c, and σ̃i+nl,c = σi,c, for i = 1, …, n, l = 0, …, k − 1, and c = 1, …, m.
4.2 STOCHASTIC APPROXIMATION EM
The expectation-maximization (EM) algorithm is usually adopted to find the maximum likelihood estimation of the parameters in a probabilistic model with unobserved variables. We can estimate the parameters θ = (A, wi,c, μi,c, σi,c) in (12) using the EM algorithm that iteratively maximizes the lower bound of the marginal log-likelihood pθ(x̃1:T) = log Σz̃1:T ∫ pθ(x̃1:T, ẽ1:T, z̃1:T)dẽ1:T. In the E-step, at the k-th iteration, given the parameters θk−1 estimated from the (k − 1)-th iteration, the EM algorithm firstly computes the posterior distribution pθk−1 (z̃1:T, ẽ1:T|x̃1:T) and then computes the lower bound 𝒬(θ, θk−1) = Σz̃1:T ∫ pθk−1 (z̃1:T, ẽ1:T|x̃1:T) log pθ(x̃1:T, ẽ1:T, z̃1:T)dẽ1:T. In the M-step, the parameters are updated as θk = arg maxθ 𝒬(θ, θk−1).
However, we note that the number of Gaussian mixtures in the posterior distribution grows exponentially with the dimension of the time series, n, the number of aggregation factor, k, and the duration of time series T. Therefore, computing the exact posterior pθk−1 (z̃1:T, ẽ1:T|x̃1:T) and 𝒬(θ, θk−1) is intractable in this situation. A possible solution is to adopt the monte carlo EM (MCEM) algorithm (Wei & Tanner, 1990), which approximately calculates 𝒬(θ, θk−1) using samples drawn from the posterior distribution pθk−1 (z̃1:T, ẽ1:T|x̃1:T). However, MCEM makes inefficient use of generated samples, as it discards samples generated in the previous EM iterations. Therefore, a large number of sample points are required for each iteration, which is computationally expensive when the sampling method is complex.
To reduce the number of simulated sample points, we propose to use the stochastic approximation EM (SAEM) algorithm (Delyon et al., 1999), which only requires of a single realization of the unobserved variables at each iteration. At the k-th iteration, the E-step and M-step are replaced by the following:
- E-step: Generate a single sample point z̃1:T[k] from the posterior pθk−1 (z̃1:T|x̃1:T), and compute
(14) M-step: Update parameters by θk = arg maxθ 𝒬̂k(θ).
In (14), is a sequence of decreasing step sizes satisfying and . Here we use Rao-Blackwellization (Svensson et al., 2014) to avoid sampling ẽ1:T because it is analytically integrable. It has been shown in (Delyon et al., 1999) that the resulting sequence {θk}k≥1 will converge to a stationary point of pθ(x̃1:T) under weak assumptions.
4.3 CONDITIONAL PARTICLE FILTER WITH ANCESTOR SAMPLING
In our model, sampling from the posterior pθk−1 (z̃1:T|x̃1:T) is usually performed using forward filter/backward simulator particle smoother, which typically requires a large number of particles to generate a smooth backward trajectory z̃1:T [k]. To reduce the number of required particles, we use the Markovian version of SAEM (Kuhn & Lavielle, 2004), which samples from a Markov kernel ℳθt−1, leaving the posterior distribution invariant. Specifically, let z̃1:T[k − 1] be the previous draw from the Markov kernel, the current state is sampled by z̃1:T[k] ~ ℳθk−1 (·|z̃1:T [k − 1]). Following (Lindsten, 2013; Svensson et al., 2014), we construct the Markov kernel using Rao-Blackwellized conditional particle filter with ancestor sampling (RB-CPF-AS) (Lindsten et al., 2014), which was originally proposed for Gibbs sampling.
The machinery inside RB-CPF-AS resembles a standard particle filter, with two main differences: one particle trajectory is deterministically set to a reference trajectory , and the ancestors of the reference trajectory are randomly chosen and stored during the algorithm execution. Algorithm 1 gives a brief description of the RB-CPF-AS algorithm. Let be the approximation of pθ(z̃1:t−1|x̃1:t−1), RB-CPF-AS propagates this sample to time t by introducing the auxiliary variables , referred to as ancestor indices. To generate for the first N − 1 particle trajectories, we first sample the ancestor index according to , and then sample according to . The first N − 1 trajectories are then augmented as . The N-th particle is set to the reference particle, , and the ancestor index is sampled according to
(15) |
where and
(16) |
Conditioned on , we can calculate using Kalman filter and Rauch-Tung-Striebel (RTS) smoother. The filtering, prediction, and smoothing PDFs are
(17a) |
(17b) |
(17c) |
respectively. Let and . In (16),
(18a) |
(18b) |
where
(19a) |
(19b) |
(19c) |
(19d) |
(19e) |
(19f) |
(19g) |
With ΩT = 0 and can be computed recursively for using (19a)–(19g). Once all the ancestors have been sampled, we can calculate the new particle weights as follows
(20) |
After all the particle trajectories have been generated, we obtain z̃1:T[k] by sampling from these trajectories according to the weights at time T.
4.4 PARAMETER UPDATE
At the k-th M step, given the sample z̃1:T[k] drawn by RB-CPF-AS, we can obtain pθk−1 (ẽt|z̃1:T[k], x̃1:T) = 𝒩(ẽt|μ̂s,t, Σ̂s,t) using the RTS smoother. Then we have
(21) |
where z̃t = z̃t[k], , ỹt = x̃t − Akx̃t−1, and q(ẽt) = pθk−1 (ẽt|z̃1:T, x̃1:T).
It can be seen that we only need sufficient statistics ∫ ẽtq(ẽt)dẽt and to maximize (21). Denoting a sufficient statistics at the k-th iteration as Sk, we use 𝕊k = (1 − γk)𝕊k + γkSk to maximize 𝒬̂k(θ). To maximize 𝒬̂k(θ) with respect to A, we compute the gradient of A in terms involving Ak and H and apply a conjugate gradient descent method as done in Gong et al. (2015).
Algorithm 1.
Input: , θ = θk−1 | |
Output: z̃1:T [k] ~ ℳθk−1 (·|z̃1:T [k − 1]) | |
Compute according to (19a)–(19g) | |
Draw with for i = 1, …, N | |
Compute , and for i = 1, …, N | |
for t=2 to T do | |
Draw with for i = 1, …, N | |
// Resampling and ancestor sampling | |
Draw with for i = 1, …, N | |
Compute according to (18a) and (18b) | |
Draw according to (15) for i = 1, …, N | |
// Particle propagation | |
Set for i = 1, …, N | |
Set , | |
| |
// Weighting | |
Compute , and | |
Compute weights according to (20) | |
Draw J with and set . |
5 EXPERIMENTS
In this section, we conduct empirical studies of the two estimation methods presented in Section 3.2 and Section 4 on both synthetic and real data to show their effectiveness.
5.1 SIMULATED DATA
We conduct a series of simulations to investigate the effectiveness of the proposed estimation methods. Following (Gong et al., 2015), we generated the data at the casual frequency using the VAR model (1) with randomly generated matrix A and independent Gaussian mixture noises et. The elements in A were drawn from a uniform distribution 𝒰(−0.5, 0.5). The Gaussian mixture model contains two components for each channel. The parameters were w1,1 = 0.2, w1,2 = 0.8, w2,1 = 0.3, w2,2 = 0.7, μi,1 = 0, μi,2 = 0, , and . Low-resolution observations were obtained by aggregating the high-resolution data using aggregation factor k. Similarly, we also generated data with Gaussian noise (by setting ) for comparison of different methods. We tested data with dimension n = 2, aggregation factor k = 2 and 3, and sample size T = 150 and 300, respectively. For comparison, we replaced the Gaussian mixture models in our method with Gaussian noise models, leading to a method based on Gaussian noises. We denote the method proposed in Section 4 that performs causal discovery from temporally aggregated data as CDTAfinite and the corresponding Gaussian counterpart as CDTAGauss. We also compare with the NG-EM method (Gong et al., 2015) on the aggregated data with non-Gaussian noises. The experiments are repeated for 10 replications.
Table 1 shows the mean squared error (MSE) of the estimated causal transition matrix A. It can be seen that as the sample size T increases, both the proposed CDTAfinite and the baseline method CDTAGauss obtain smaller estimation errors. On the non-Gaussian data, CDTAGauss produces much higher errors than CDTAfinite. On the Gaussian data, neither CDTAfinite nor CDTAGauss can obtain accurate estimations. This is because the estimation algorithms can converge to many possible solutions that have the same marginal likelihood, if the data noises are Gaussian or the estimation algorithms assume a Gaussian noise model. The results are consistent with the theoretical results that the causal relations might not be uniquely determined using Gaussian noise models. It can also be seen that the NG-EM method fails on the aggregated data, because NG-EM is proposed for subsampled rather than aggregated data.
Table 1.
Data | non-Gaussian noise | Gaussian noise | ||||||
---|---|---|---|---|---|---|---|---|
k=2 | k=3 | k=2 | k=3 | |||||
Methods | T=150 | T=300 | T=150 | T=300 | T=150 | T=300 | T=150 | T=300 |
CDTAfinite | 2.10e-4 | 1.19e-4 | 8.17e-4 | 7.36e-4 | 1.42e-2 | 3.67e-3 | 7.63e-3 | 9.69e-3 |
CDTAGauss | 1.28e-2 | 4.49e-3 | 1.20e-2 | 7.22e-3 | 1.13e-2 | 3.08e-3 | 6.26e-2 | 9.07e-3 |
NG-EM | 8.75e-2 | 8.51e-2 | 5.27e-1 | 1.88e-1 | - | - | - | - |
Further, we examined the performance of the method described in Section 3.2, denoted as CDTAinfty, with finite k values. To achieve so, we generated aggregated data with , aggregation factor k = 2, 3, 4, 10, and the same Gaussian mixture noise parameters described above. The true ANoSelfLoop in this case can be calculated as . Using the linear instantaneous non-Gaussian model, we can obtain the estimations of ANoSelfLoop on the aggregated data. The results for k = 2, 3, 4, 10 are given as follows:
(22) |
It seems that when k ≥ 4, the linear instantaneous non-Gaussian causal model (10), which assumes that there is no self-loop, can estimate the corresponding ANoSelfLoop accurately and very efficiently, at the cost of losing the self-loops in the original process. However, the self-loops can be estimated with CDTAfinite when k is reasonably large. As a cautionary notice, researchers should carefully interpret the estimated parameters produced by linear instantaneous causal models, which assume there is no self-loop; as a consequence, the linear instantaneous non-Gaussian causal model produces (1 − Ajj)−1Aji, whose magnitude can be very different from the true causal parameter Aji.
5.2 REAL DATA
We conducted experiments on the Temperature Ozone data (Mooij et al., 2016) and the macroeconomic data used in (Moneta, 2008). These two time series are collected by averaging the records during specified time intervals. For example, the Temperature Ozone data contain daily mean values of ozone and temperature of year 2009 in Chaumont and Switzerland. The macroeconomic data contain quarterly US macro variables for the period 1947:2 to 1994:1.
Temperature Ozone
The Temperature Ozone data is the 50th causal-effect pair from the website https://webdav.tuebingen.mpg.de/cause-effect/. The data have records of ozone density X and daily mean temperature Y. The ground truth is causal relation is Y → X. We first applied CDTAinfty on the data, resulting in . From this result, we can find that instantaneous effects exist in both directions. This could possibly caused by aggregation with a small k as the estimated ANoSelfLoop is likely to be inaccurate. We then estimated the causal matrix by CDTAfinite. The estimated transition matrix A for k = 1, 2, 3 is , respectively. It seems that the estimated matrices sensibly captured the self-influences and cross-influences between the ozone and temperature processes.
Macroeconomic Data
The data are quarterly U.S. observational on real aggregated macroeconomic variables. Here we consider the causal relations between two variables, including real balances X and price inflation Y. X denotes the logarithm of per captita M2 minus the logarithm of the implicit price deflator. Y is the log of the implicit price deflator at the time t minus log of the implicit price deflator at the time t − 1. Again, we first applied the CDTAinfty to find the rough estimation of causal relations excluding self-loops. The estimated ANoSelfLoop is . This indicates that no influence from effect to cause can be estimated from the instantaneous dependencies, which is consistent with the ground truth. We also employed CDTAfinite to obtain the estimation of a complete causal transition matrix A. The estimated transition matrix A for k = 1, 2, 3, 4 is , respectively. We can see that A gives weaker responses from effect Y to cause X as k increases. If we consider k = 4 as the aggregation factor, then we can calculate from the estimated A that , which is close to the results estimated by CDTAinfty.
6 CONCLUSION
In this paper, we have investigate the problem of discovering high frequency causal relations from temporally aggregated time series. When the aggregation factor is finite, we proved that the causal relations are fully identifiable if the underlying causal relations are linear and the noise process is non-Gaussian. We also show that the causal matrix that removes self-loops is identifiable from instantaneous dependencies, when the aggregation factor goes to infinity. Based on these results, we propose an algorithm to recover the complete causal matrix when the aggregation factor is relatively small an an very efficient algorithm to partially recover the matrix when the aggregation factor is relatively large. Future work will focus on automatically estimating the aggregation factor k from data.
Acknowledgments
The authors would like to thank Dr. Tongliang Liu for helpful discussions. DT and MG would like to acknowledge the support from DP-140102164, FT-130101457, and LP-150100671. CG and KZ would like to acknowledge the support from NIH-1R01EB022858-01 FAIN-R01EB022858, NIH-1R01LM012087, and NIH-5U54HG008540-02 FAINU54HG008540.
APPENDIX
PROOF OF THEOREM 1
Proof
Here we consider the limit when T → 1. According to the identifiability results of Ak (7), we have
(23) |
We then consider the remaining error term, e⃗t. The corresponding random vector, e⃗ follows both the representation (9) and
(24) |
(25) |
with , l = 0, …, 2k − 2, having the same distribution .
According to Proposition 1, each column of H′ is a scaled version of a column of H. Denote by Hln+i, l = 0, …, 2k − 2; i = 1, …, n, the (ln + i)th column of H, and similarly for . According to the Uniqueness Theorem in Eriksson & Koivunen (2004), we know that under condition A2, for each i, there exists one and only one j such that the distribution of , l = 0, …, 2k − 2 (which have the same distribution), is the same as the distribution of , l = 0, …, 2k − 2, up to changes of location and scale. As a consequence, the columns correspond to {Hln+i | l = 0, …, 2k − 2} up to the permutation anding arbitrariness.
According to the structure of H, ∀ m ≤ k − 1, Hkn+i = Hm+i + Hm+k+i, and similarly we have ∀ m ≤ k − 1, . Hence, Hkn+i is proportional to , i.e., . Assume that , A′k−1 are non-diagonal matrices, we have . Since Hi and must be columns of I, as implied by the structure of H and H′, we can see that λ0i = 1 and that i = j. Consequently, λki must be 1 or −1. Let B = I + A + ⋯ + Ak−1 and B′ = I + A′ + …* + A′k−1, we thus have B = B′D, where D is a diagonal matrix with 1 or −1 as its diagonal entries. Moreover, because AB − B = Ak − I and A′B′ − B′ = A′k − I, and Ak = A′k, we have
(26) |
If both A′ and A have diagonal entries which are smaller than 1, D must be the identity matrix, i.e., A′ = A. Therefore statement (i) is true.
If each pei is asymmetric, ei and −ei have different distributions. Consequently, the representation (24) does not hold any more if one changes the signs of a subset of, but not all, non-zero elements of . This implies that for non-zero Hln+i, λli, including λ0i, have the same sign, and they are therefore 1 since λ0i = 1. λki = 1 leads to D = I and thus gives A′ = A. That is, (ii) is true.
References
- Boot JCG, Feibes W, Lisman J, Hubertus C. Further methods of derivation of quarterly figures from annual data. Applied Statistics. 1967:65–75. [Google Scholar]
- Breitung J, Swanson NR. Temporal aggregation and spurious instantaneous causality in multiple time series models. Journal of Time Series Analysis. 2002;23:651–665. [Google Scholar]
- Danks D, Plis S. Learning causal structure from undersampled time series; JMLR: Workshop and Conference Proceedings; 2013. [PMC free article] [PubMed] [Google Scholar]
- Delyon B, Lavielle M, Moulines E. Convergence of a stochastic approximation version of the em algorithm. Annals of statistics. 1999:94–128. [Google Scholar]
- Eriksson J, Koivunen V. Identifiability, separability, and uniqueness of linear ICA models. IEEE Signal Processing Letters. 2004;11(7):601–604. [Google Scholar]
- Geiger P, Zhang K, Gong M, Schölkopf B, Janzing D. 32nd International Conference on Machine Learning. Microtome Publishing; 2015. Causal inference by identification of vector autoregressive processes with hidden components; pp. 1917–1925. [Google Scholar]
- Ghysels E, Hill JB, Motegi K. Testing for granger causality with mixed frequency data. Journal of Econometrics. 2016;192(1):207–230. [Google Scholar]
- Gong M, Zhang K, Schölkopf B, Tao D, Geiger P. Discovering temporal causal relations from subsampled data. ICML. 2015:1898–1906. [Google Scholar]
- Granger Clive WJ. Investigating causal relations by econometric models and cross-spectral methods. Econometrica: Journal of the Econometric Society. 1969:424–438. [Google Scholar]
- Granger Clive WJ. Implications of aggregation with common factors. Econometric Theory. 1987;3(02):208–222. [Google Scholar]
- Harvey AC, Chung CH. Estimating the underlying change in unemployment in the uk. Journal of the Royal Statistics Society, Series A. 2000;163:303–309. [Google Scholar]
- Hyttinen A, Plis S, Järvisalo M, Eberhardt F, Danks D. Causal discovery from subsampled time series data by constraint optimization; International Conference on Probabilistic Graphical Models; 2016. pp. 216–227. [PMC free article] [PubMed] [Google Scholar]
- Hyvärinen A, Karhunen J, Oja E. Independent Component Analysis. John Wiley & Sons, Inc; 2001. [Google Scholar]
- Kuhn E, Lavielle M. Coupling a stochastic approximation version of em with an mcmc procedure. ESAIM: Probability and Statistics. 2004;8:115–131. [Google Scholar]
- Lacerda G, Spirtes P, Ramsey J, Hoyer PO. Proceedings of the 24th Conference on Uncertainty in Artificial Intelligence (UAI2008) Helsinki, Finland: 2008. Discovering cyclic causal models by independent components analysis. [Google Scholar]
- Lindsten F. Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE; 2013. An efficient stochastic approximation em algorithm using conditional particle filters; pp. 6274–6278. [Google Scholar]
- Lindsten F, Jordan MI, Schön TB. Particle gibbs with ancestor sampling. Journal of Machine Learning Research. 2014;15(1):2145–2184. [Google Scholar]
- Marcellino M. Some consequences of temporal aggregation in empirical analysis. Journal of Business and Economic Statistics. 1999;17:129–136. [Google Scholar]
- Moauro F, Savio G. Temporal disaggregation using multivariate structural time series models. Journal of Econometrics. 2005;8:210–234. [Google Scholar]
- Moneta A. Graphical causal models and vars: an empirical assessment of the real business cycles hypothesis. Empirical Economics. 2008;35(2):275–300. [Google Scholar]
- Mooij JM, Peters J, Janzing D, Zscheischler J, Schölkopf B. Distinguishing cause from effect using observational data: methods and benchmarks. Journal of Machine Learning Research. 2016;17(32):1–102. [Google Scholar]
- Palm FC, Nijman TE. Missing observations in the dynamic regression model. Econometrica. 1984;52:1415–1435. [Google Scholar]
- Pearl J. Causality: Models, Reasoning, and Inference. Cambridge University Press; Cambridge: 2000. [Google Scholar]
- Plis S, Danks D, Freeman C, Calhoun V. Rate-agnostic (causal) structure learning. Advances in neural information processing systems. 2015a:3303–3311. [PMC free article] [PubMed] [Google Scholar]
- Plis S, Danks D, Yang J. Proceedings of the 31st Conference on Uncertainty in Artificial Intelligence. Vol. 31. NIH Public Access; 2015b. Mesochronal structure learning. [PMC free article] [PubMed] [Google Scholar]
- Proietti T. Temporal disaggregation by state space methods: Dynamic regression methods revisited. The Econometrics Journal. 2006;9:357–372. [Google Scholar]
- Rajaguru G, Abeysinghe T. Temporal aggregation, cointegration and causality inference. Economics Letters. 2008;101:223–226. [Google Scholar]
- Shimizu S, Hoyer PO, Hyvärinen A, Kerminen AJ. A linear non-Gaussian acyclic model for causal discovery. Journal of Machine Learning Research. 2006;7:2003–2030. [Google Scholar]
- Silvestrini A, Veredas D. Temporal aggregation of univariate and multivariate time series models: A survey. Journal of Economic Surveys. 2008;22:458–497. [Google Scholar]
- Sims CA. Macroeconomics and reality. Econometrica. 1980;48:1–48. [Google Scholar]
- Spirtes P, Glymour C, Scheines R. Causation, Prediction, and Search. 2. MIT Press; Cambridge, MA: 2001. [Google Scholar]
- Stram Daniel O, Wei William WS. A methodological note on the disaggregation of time series totals. Journal of Time Series Analysis. 1986;7(4):293–302. [Google Scholar]
- Svensson A, Schön TB, Lindsten F. Decision and Control (CDC), 2014 IEEE 53rd Annual Conference on. IEEE; 2014. Identification of jump markov linear models using particle filters; pp. 6504–6509. [Google Scholar]
- Tank A, Fox EB, Shojaie A. Identifiability and estimation of structural vector autoregressive models for subsampled and mixed frequency time series. arXiv preprint arXiv: 1704.02519. 2017 doi: 10.1093/biomet/asz007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tiao George C. Asymptotic behaviour of temporal aggregates of time series. Biometrika. 1972:525–531. [Google Scholar]
- Van Nes EH, Scheffer M, Brovkin V, Lenton TM, Ye H, Deyle E, Sugihara G. Causal feedbacks in climate change. Nature Climate Change. 2015;5(5):445–448. [Google Scholar]
- Wei GCG, Tanner MA. A monte carlo implementation of the em algorithm and the poor man’s data augmentation algorithms. Journal of the American statistical Association. 1990;85(411):699–704. [Google Scholar]
- Weiss A. Systematic sampling and temporal aggregation in time series models. Journal of Econometrics. 1984;26:271–281. [Google Scholar]
- Zhou D, Zhang Y, Xiao Y, Cai D. Analysis of sampling artifacts on the granger causality analysis for topology extraction of neuronal dynamics. Frontiers in computational neuroscience. 2014;8 doi: 10.3389/fncom.2014.00075. [DOI] [PMC free article] [PubMed] [Google Scholar]