Rate-Agnostic (Causal) Structure Learning

Sergey Plis; David Danks; Cynthia Freeman; Vince Calhoun

. Author manuscript; available in PMC: 2016 May 11.

Published in final edited form as: Adv Neural Inf Process Syst. 2015 Dec;28:3303–3311.

Rate-Agnostic (Causal) Structure Learning

Sergey Plis ¹, David Danks ², Cynthia Freeman ³, Vince Calhoun ⁴

PMCID: PMC4863709 NIHMSID: NIHMS759948 PMID: 27182188

Abstract

Causal structure learning from time series data is a major scientific challenge. Extant algorithms assume that measurements occur sufficiently quickly; more precisely, they assume approximately equal system and measurement timescales. In many domains, however, measurements occur at a significantly slower rate than the underlying system changes, but the size of the timescale mismatch is often unknown. This paper develops three causal structure learning algorithms, each of which discovers all dynamic causal graphs that explain the observed measurement data, perhaps given undersampling. That is, these algorithms all learn causal structure in a “rate-agnostic” manner: they do not assume any particular relation between the measurement and system timescales. We apply these algorithms to data from simulations to gain insight into the challenge of undersampling.

1 Introduction

Dynamic causal systems are a major focus of scientific investigation in diverse domains, including neuroscience, economics, meteorology, and education. One significant limitation in all of these sciences is the difficulty of measuring the relevant variables at an appropriate timescale for the particular scientific domain. This challenge is particularly salient in neuroimaging: standard fMRI experiments sample the brain’s bloodflow approximately every one or two seconds, though the underlying neural activity (i.e., the major driver of bloodflow) occurs much more rapidly. Moreover, the precise timescale of the underlying causal system is unknown; it is almost certainly faster than the fMRI measurements, but it is unknown how much faster.

In this paper, we aim to learn the causal structure of a system that evolves at timescale τ_S, given measurements at timescale τ_M. We focus on the case in which τ_S is faster than τ_M to an unknown degree. We assume that the underlying causal structure can be modeled as a directed graphical model G without simultaneous influence. There has been substantial work on modeling the statistics of time series, but relatively less on learning causal structure, and almost all of that assumes that the measurement and causal timescales match [1–5]. The problem of causal learning from “under-sampled” time series data was explicitly addressed by [6, 7 ], but they assumed that the degree of undersampling—i.e., the ratio of τ_S to τ_M —was both known and small. In contrast, we focus on the significantly harder challenge of causal learning when that ratio is unknown.

We provide a formal specification of the problem and representational framework in Section 2. We then present three different Rate-Agnostic Structure Learning (RASL) algorithms in Section 3. We finish in Section 4 by exploring their performance on synthetic data.

2 Representation and Formalism

A dynamic causal graphical model consists of a graph G over random variables V at the current time t, as well as nodes for V at all previous (relative) timesteps that contain a direct cause of a variable at the current timestep.¹ The Markov order of the system is the largest k such that $V_{i}^{t - k} \to V_{j}^{t}$ , where superscripts denote timestep. We assume throughout that the “true” underlying causal system is Markov order 1, and that all causally relevant variables are measured.² Finally, we assume that there are no isochronal causal edges $V_{i}^{t} \to V_{j}^{t}$ ; causal influences inevitably take time to propagate, and so any apparent isochronal edge will disappear when measured sufficiently finely. Since we do not assume that the causal timescale __S is known, this is a relatively innocuous assumption.

G is thus over 2V nodes, where the only edges are $V_{i}^{t - 1} \to V_{j}^{t}$ , where possibly i = j. There is additionally a conditional probability distribution or density, P(V^t|V^t⁻¹), which we assume to be time-independent. We do not, however, assume stationarity of P(V^t). Finally, we assume appropriate versions of the Markov (“Variable V is independent of non-descendants given parents”) and Faithfulness/Stability (“The only independencies are those implied by Markov”) assumptions, such that the graph and probability distribution/density mutually constrain each other.

Let {t⁰, t¹, …, t^k, …} be the measurement timesteps. We undersample at rate u when we measure only timesteps {t⁰, t^u, …, t^uk, …}; the causal timescale is thus “undersampled at rate 1.” We denote the causal graph resulting from undersampling at rate u by G^u. To obtain G^u, we “unroll” G¹ by introducing nodes for V^t⁻² that bear the same graphical and parametric relations to V^t⁻¹ as those variables bear to V^t, and iterate until we have included V^t⁻^u. We then marginalize out all variables except those in V^t and V^t⁻^u.

Marginalization yields an Acyclic Directed Mixed Graph (ADMG) G^u containing both directed and bidirected edges [8]. $V_{i}^{t - u} \to V_{j}^{t}$ in G^u iff there is a directed path from $V_{i}^{t - u}$ to $V_{j}^{t}$ in the unrolled graph. Define a trek to be a pair of directed paths (π₁, π₂) such that both have the same start variable. $V_{i}^{t} \leftrightarrow V_{j}^{t}$ in G^u iff there is a trek between $V_{i}^{t}$ and $V_{j}^{t}$ with length(π₁) = length(π₂) = k < u. Clearly, if a bidirected edge occurs in G^m, then it occurs in G^u for all u ≥ m.

Unrolling-and-marginalizing can be computationally complex due to duplication of nodes, and so we instead use compressed graphs that encode temporal relations in edges. For an arbitrary dynamic causal graph H, ℋ is its compressed graph representation: (i) ℋ is over non-time-indexed nodes for V; (ii)V_i → V_j in ℋ iff $V_{i}^{t - 1} \to V_{j}^{t}$ in H; and (iii) V_i ↔ V_j in ℋ iff $V_{i}^{t} \leftrightarrow V_{j}^{t}$ in H. Compressed graphs can be cyclic (V_i ⇄ V_j for $V_{i}^{t - 1} \to V_{j}^{t}$ and $V_{j}^{t - 1} \to V_{i}^{t}$ ), including self-cycles. There is clearly a 1-1 mapping between dynamic ADMGs and compressed graphs.

Computationally, the effects of undersampling at rate u can be computed in a compressed graph simply by finding directed paths of length u in 𝒢¹. More precisely, $V_{i}^{t - u} \to V_{j}^{t}$ in 𝒢^u iff there is a directed path of length u in 𝒢¹. Similarly, $V_{i}^{t} \leftrightarrow V_{j}^{t}$ in 𝒢^u iff there is a trek with length(π₁) = length(π₂) = k < u in 𝒢¹. We thus use compressed graphs going forward.

3 Algorithms

The core question of this paper is: given ℋ = 𝒢^u for unknown u, what can be inferred about 𝒢¹? Let ⟦ℋ⟧ = {𝒢¹ : ∃u 𝒢^u = ℋ} be the equivalence class of 𝒢¹ that could, for some undersample rate, yield ℋ. We are thus trying to learn ⟦ℋ⟧ from ℋ. An obvious brute-force algorithm is: for each possible 𝒢¹, compute the corresponding graphs for all u, and then output all 𝒢^u = ℋ. Equally obviously, this algorithm will be computationally intractable for any reasonable n, as there are 2^n² possible 𝒢¹ and u can (in theory) be arbitrarily large. Instead, we pursue three different constructive strategies that more efficiently “build” the members of ⟦ℋ⟧ (Sections 3.2, 3.3, and 3.4). Because these algorithms make no assumptions about u, we refer to them each as RASL—Rate Agnostic Structure Learner—and use subscripts to distinguish between different types. First, though, we provide some key theoretical results about forward inference that will be used by all three algorithms.

3.1 Nonparametric Forward Inference

For given 𝒢¹ and u, there is an efficient algorithm [9] for calculating 𝒢^u, but it is only useful in learning if we have stopping rules that constrain which 𝒢¹ and u should ever be considered. These rules will depend on how 𝒢¹ changes as u → ∞. A key notion is a strongly connected component (SCC) in 𝒢¹: a maximal set of variables S ⊆ V such that, for every X, Y ∈ S (possibly X = Y), there is a directed path from X to Y. Non-singleton SCCs are clearly cyclic and can provably be decomposed into a set of (possibly overlapping) simple loops (i.e., those in which no node is repeated): σ₁, …, σ_s [10]. Let ℒ_S be the set of those simple loop lengths.

One stopping rule must specify, for given 𝒢¹, which u to consider. For a single SCC, the greatest common divisor of simple loop lengths (where gcd(ℒ_S) = 1 for singleton S) is key: gcd(ℒ_S) = 1 iff ∃f s.t. ∀u > f[𝒢^u = 𝒢^f]; that is, gcd() determines whether an SCC “converges” to a fixed-point graph as u → ∞. We can constrain u if there is such a fixed-point graph, and Theorem 3.1 generalizes [9, Theorem 5] to provide an upper bound on (interesting) u. (All proofs found in supplement.)

Theorem 3.1

If gcd(ℒ_S) = 1, then stabilization occurs at f ≤ n_F + γ+ d + 1.

where n_F is the Frobenius number,³ d is the graph diameter, and γ is the transit number (see supplement). This is a theoretically useful bound, but is not practically helpful since neither γ nor n_F have a known analytic expression. Moreover, gcd(ℒ_S) = 1 is a weak restriction, but a restriction nonetheless. We instead use a functional stopping rule for u (Theorem 3.2) that holds for all 𝒢:

Theorem 3.2

If 𝒢^u = 𝒢^v for u > v, then ∀w > u∃k_w < u[𝒢^w = 𝒢^{k^w}].

That is, as u increases, if we find a graph that we previously encountered, then there cannot be any new graphs as u → ∞. For a given 𝒢¹, we can thus determine all possible corresponding undersampled graphs by computing 𝒢², 𝒢³, … until we encounter a previously-observed graph. This stopping rule enables us to (correctly) constrain the u that are considered for each 𝒢¹.

We also require a stopping rule for 𝒢¹, as we cannot evaluate all 2^n² possible graphs for any reasonable n. The key theoretical result is:

Theorem 3.3

If 𝒢¹ ⊆ 𝒥¹, then ∀u[𝒢^u ⊆ 𝒥^u].

Let $G_{E}^{1}$ be the graph resulting from adding the edges in E to 𝒢¹. Since this is simply another graph, it can be undersampled at rate u; denote the result ${(G_{E}^{1})}^{u}$ . Since $G_{E}^{1}$ can always serve as 𝒥¹ in Theorem 3.3, we immediately have the following two corollaries:

Corollary 3.4

If 𝒢^u ⊈ ℋ, then $\forall E [{(G_{E}^{1})}^{u} ⊈ H]$ Corollary 3.5

If ∀u[𝒢^u ⊈ ℋ], then ∀E, $u [{(G_{E}^{1})}^{u} ⊈ H]$

We thus have a stopping rule for some candidate 𝒢¹: if 𝒢^u is not an edge-subset of ℋ for all u, then do not consider any edge-superset of 𝒢¹. This stopping rule fits very cleanly with “constructive” algorithms that iteratively add edge(s) to candidate 𝒢¹. We now develop three such algorithms.

3.2 A recursive edgewise inverse algorithm

The two stopping rules naturally suggest a recursive structure learning algorithm with ℋ as input and ⟦ℋ⟧ as output. Start with an empty graph. For each edge e (of n² possible edges), construct 𝒢¹ containing only e. If 𝒢^u ⊈ ℋ for all u, then reject; else if 𝒢^u = ℋ for some u,⁴ then add 𝒢¹ to ⟦ℋ⟧; else, recurse into non-conflicting graphs in order. Effectively, this is a depth first search (DFS) algorithm on the solution tree; denote it as RASL_re for “recursive edgewise.” Figure 1a provides pseudo-code, and Figure 1b shows how one DFS path in the search tree unfolds. We can prove:

RASL_re algorithm 1a specification, and 1b search tree example

Theorem 3.6

The RASL_re algorithm is correct and complete.

One significant drawback of RASL_re is that the same graph can be constructed in many different ways, corresponding to different orders of edge addition; the search tree is actually a search lattice. The algorithm is thus unnecessarily inefficient, even when we use dynamic programming via memoization of input graphs.

3.3 An iterative edgecentric inverse algorithm

To minimize multiple constructions of the same graph, we can use RASL_ie (“iterative edgewise”) which generates, at stage i, all not-yet-eliminated 𝒢¹ with exactly i edges. More precisely, at stage 0, RASL_ie starts with the empty graph; if ℋ is also empty, then it adds the empty graph to ⟦ℋ⟧. Otherwise, it moves to stage 1. In general, at stage i + 1, RASL_ie (a) considers each graph 𝒢¹ resulting from a single edge addition to an acceptable graph at stage i; (b) rejects 𝒢¹ if it conflicts (for all u) with ℋ; (c) otherwise keeps 𝒢¹ as acceptable at i + 1; and (d) if ∃u[𝒢^u = ℋ], then adds 𝒢¹ to ⟦ℋ⟧. RASL_ie continues until there are no more edges to add (or it reaches stage n₂ + 1). Figure 2 provides the main loop (Figure 2a) and core function of RASL_ie (Figure 2c), as well as an example of the number of graphs potentially considered at each stage (Figure 2b). RASL_ie provides significant speed-up and memory gains over RASL_re (see Figure 3).

RASL_ie algorithm (a) main loop; (b) example of graphs considered; and (c) core function.

We optimize RASL_ie by tracking the single edges that could possibly still be added; for example, if a single-edge graph is rejected in stage 1, then do not consider adding that edge at other stages. Additional conflicts can be derived analytically, further reducing the graphs to consider. In general, absence of an edge in ℋ implies, for the corresponding (unknown) u, absence of length u paths in all 𝒢¹ ∈ ⟦ℋ⟧. Since we do not know u, we cannot directly apply this constraint. However, lemmas 3.7 and 3.8 provide useful, special case constraints for u > 1 (implied by a single bidirected edge).

Lemma 3.7

If u > 1, then ∀V ↛ W ∈ ℋ, 𝒢¹ cannot contain any of the following paths: $1. \overset{↷}{V} \to W; 2. \overset{↷}{V} \to X \to W; 3. V \to \overset{↷}{X} \to W; 4. V \to X \to \overset{↷}{W}; 5. V \to \overset{↷}{W}$ .

Lemma 3.8

If u > 1, then ∀V↮ W ∈ ℋ ∄ T[V ← T → W] ∈ 𝒢¹

3.4 An iterative loopcentric inverse algorithm

RASL_ie yields results in reasonable time for ℋ with up to 8 nodes, though it is computationally demanding. We can gain further computational advantages if we assume that ℋ is an SCC. This assumption is relatively innocuous, as it requires only that our time series be generated by a system with (appropriate) feedback loops. As noted earlier, any SCC is composed of a set of simple loops, and so we modify RASL_ie to iteratively add loops instead of edges; call the resulting algorithm RASL_il for “iterative loopwise.” More precisely, RASL_il uses the same algorithm as in Figure 2, but successively attempts to add non-conflicting simple loops, rather than non-conflicting edges. RASL_il also incorporates the additional constraints due to lemmas 3.7 and 3.8.

RASL_il is surprisingly much faster than RASL_ie even though, for n nodes, there are $\sum_{i = 0}^{n} (\begin{matrix} n \\ i \end{matrix}) (i - 1)!$ simple loops (compared to n² edges). The key is that introducing a single simple loop induces multiple constraints simultaneously, and so conflicting graphs are discovered at a much earlier stage. As a result, RASL_il checks many fewer graphs in practice. For example, consider the 𝒢¹ in Figure 1, with corresponding ℋ for u = 3. RASL_re constructs (not counting pruned single edges) 28,661 graphs; RASL_ie constructs only 249 graphs; and RASL_il considers only 47. For u = 2, these numbers are 413, 44, and 7 respectively. Unsurprisingly, these differences in numbers of examined graphs translate directly into wall clock time differences (Figure 3).

4 Results

All three RASL algorithms take a measurement timescale graph ℋ as input. They are therefore compatible with any structure learning algorithm that outputs a measurement timescale graph, whether Structural Vector Autoregression (SVAR) [11], direct Dynamic Bayes Net search [12], or modifications of standard causal structure learning algorithms such as PC [1, 13] and GES [14]. The problem of learning a measurement timescale graph is a very hard one, but is also not our primary focus here. Instead, we focus on the performance of the novel RASL algorithms.

First, we abstract away from learning measurement timescale structure and assume that the correct ℋ is provided as input. For these simulated graphs, we focus on SCCs, which are the most scientifically interesting cases. For simplicity (and because within-SCC structure can be learned in parallel for a complex ℋ [9]), we employ single-SCC graphs. To generate random SCCs, we (i) build a single simple loop over n nodes, and (ii) uniformly sample from the other n(n − 1) possible edges until we reach the specified density (i.e., proportion of the n² total possible edges). We employ density in order to measure graph complexity in an (approximately) n-independent way.

We can improve the runtime speed of RASL_re using memoization, though it is then memory-constrained for n ≥ 6. Figure 3 provides the wall-clock running times for all three RASL algorithms applied to 100 random 5-node graphs at each of three densities. This graph substantiates our earlier claims that RASL_il is faster than RASL_ie, which is faster than RASL_re. In fact, each is at least an order of magnitude faster than the previous one.

RASL_re would take over a year on the most difficult problems, so we focus exclusively on RASL_il. Unsurprisingly, run-time complexity of all RASL algorithms depends on the density of ℋ. For each of three density values (20%, 25%, 30%), we generated 100 random 6-node SCCs, which were then undersampled at rates 2, 3, and 4 before being provided as input to RASL_il. Figure 4 summarizes wall clock computation time as a function of ℋ’s density, with different plots based on density of 𝒢¹ and undersampling rate. We also show three examples of ℋ with a range of computation runtime. Unsurprisingly, the most difficult ℋ is quite dense; ℋ with densities below 50% typically require less than one minute.

4.1 Equivalence classes

We first use RASL_il to determine ⟦ℋ⟧ size and composition for varying ℋ; that is, we explore the degree of underdetermination produced by undersampling. The worst-case underdetermination occurs if ℋ is a super-clique with every possible edge: ∀ X, Y[X → Y & X ↔ Y]. Any SCC with gcd(ℒ_S) = 1 becomes a super-clique as u → ∞ [9], so ⟦ℋ⟧ contains all such graphs for super-clique ℋ. We thus note when ℋ is a super-clique, rather than computing the size of ⟦ℋ⟧.

Figures 5 and 6 plot equivalence class size as a function of both 𝒢¹ density and the true undersampling rate. For each n and density, we (i) generated 100 random 𝒢¹; (ii) undersampled each at indicated u; (iii) passed 𝒢^u = ℋ to RASL_il; and (iv) computed the size of ⟦ℋ⟧. Interestingly, ⟦ℋ⟧ is typically quite small, sometimes even a singleton. For example, 5-node graphs at u = 2 typically have singleton ⟦ℋ⟧ up to 40% 𝒢¹ density. Even 10-node graphs often have a singleton ⟦ℋ⟧ (though with relatively sparse 𝒢¹). Increased undersampling and density both clearly worsen underdetermination, but often not intractably so, particularly since even nonsingleton ⟦ℋ⟧ can be valuable if they permit post hoc inspection or analysis.

Size of equivalence classes for 100 random SCCs at each density and u ∈ {2, 3, 4}.

Size of equivalence classes for larger graphs n ∈ 7, 8, 10 for u ∈ 2, 3

To focus on the impact of undersampling, we generated 100 random 5-node SCCs with 25% density, each of which was undersampled for u ∈ {2, …, 11}. Figure 7 plots the size of ⟦ℋ⟧ as a function of u for these graphs. For u ≤ 4, singleton ⟦ℋ⟧ still dominate. Interestingly, even u = 11 still yields some non-superclique ℋ.

Effect of the undersampling rate on equivalence class size.

Finally, 𝒢¹ ∈ ⟦ℋ⟧ iff ∃ u[𝒢^u = ℋ], but the appropriate u need not be the same for all members of ⟦ℋ⟧. Figure 8 plots the percentages of u-values appropriate for each 𝒢¹ ∈ ⟦ℋ⟧, for the ℋ from Figure 5. If actually u_true = 2, then almost all 𝒢¹ ∈ ⟦ℋ⟧ are because of 𝒢²; there are rarely 𝒢¹ ∈ ⟦ℋ⟧ due to u > 2. If actually u_true > 2, though, then many 𝒢¹ ∈ ⟦ℋ⟧ are due to 𝒢^u where u ≠ u_true. As density and u_true increase, there is increased underdetermination in both 𝒢¹ and u.

Distribution of u for 𝒢^u = ℋ for 𝒢¹ ∈ ⟦ℋ⟧ for 5- and 6-node graphs

4.2 Synthetic data

In practice, we typically must learn ℋ structure from finite sample data. As noted earlier, there are many algorithms for learning ℋ, as it is a measurement timescale structure (though small modifications are required to learn bidirected edges). In pilot testing, we found that structural vector autoregressive (SVAR) model [11] optimization provided the most accurate and stable solutions for ℋ for our simulation regime. We thus employ the SVAR procedure here, though we note that other measurement timescale learning algorithms might work better in different domains.

To test the two-step procedure—SVAR learning passed to RASL_il—we generated 20 random 6-node SCCs for each density in {25%, 30%, 35%}. For each random graph, we generated a random transition matrix A by sampling weights for the non-zero elements of the adjacency matrix, and controlling system stability (by keeping the maximal eigenvalue at or below 1). We then generated time series data using a vector auto-regressive (VAR) model [11] with A and random noise (σ = 1). To simulate undersampling, datapoints were removed to yield u = 2. SVAR optimization on the resulting time series yielded a candidate ℋ that was passed to RASL_il to obtain ⟦ℋ⟧.

The space of possible ℋ is a factor of $(\begin{matrix} n \\ 2 \end{matrix})$ larger than the space of possible 𝒢¹, and so SVAR optimization can return an ℋ such that ⟦ℋ⟧ = ∅. If RASL_il returns ∅, then we rerun it on all ℋ* that result from a single edge addition or deletion on ℋ. If RASL_il returns ∅ for all of those graphs, then we consider the ℋ* that result from two changes to ℋ, then three changes. This search through the 3-step Hamming neighborhood of ℋ essentially always finds an ℋ* with ⟦ℋ*⟧ ≠ ∅.

Figure 9 shows the results of the two-step process, where algorithm output is evaluated by two error-types: omission error: the number of omitted edges normalized to the total number of edges in the ground truth; comission error: number of edges not present in the ground truth normalized to the total possible edges minus the number of those present in the ground truth. We also plot the estimation errors of SVAR (on the undersampled data) to capture the dependence of RASL_il estimation errors on estimation errors for ℋ. Interestingly, RASL_il does not significantly increase the error rates over those produced by the SVAR estimation. In fact, we find the contrary (similarly to [6]): the requirement to use an ℋ that could be generated by some undersampled 𝒢¹ functions as a regularization constraint that corrects for some SVAR estimation errors.

The estimation and search errors on synthetic data: 6-node graphs, u = 2, 20 per density.

5 Conclusion

Time series data are widespread in many scientific domains, but if the measurement and system timescales differ, then we can make significant causal inference errors [9, 15]. Despite this potential for numerous errors, there have been only limited attempts to address this problem [6, 7], and even those methods required strong assumptions about the undersample rate.

We here provided the first causal inference algorithms that can reliably learn causal structure from time series data when the system and measurement timescales diverge to an unknown degree. The RASL algorithms are complex, but not restricted to toy problems. We also showed that underdetermination of 𝒢¹ is sometimes minimal, given the right methods. ⟦ℋ⟧ was often small; substantial system timescale causal structure could be learned from undersampled measurement timescale data. Significant open problems remain, such as more efficient methods when ℋ has ⟦ℋ⟧ = ∅. This paper has, however, expanded our causal inference “toolbox” to include cases of unknown undersampling.

Acknowledgments

This work was supported by awards NIH R01EB005846 (SP); NSF IIS-1318759 (SP); NSF IIS-1318815 (DD); & NIH U54HG008540 (DD) (from the National Human Genome Research Institute through funds provided by the trans-NIH Big Data to Knowledge (BD2K) initiative).

Footnotes

We use difference equations in our analyses. The results and algorithms will be applicable to systems of differential equations to the extent that they can be approximated by a system of difference equations.

More precisely, we assume a dynamic variant of the Causal Sufficiency assumption, though it is more complicated than just “no unmeasured common causes.”

For set B of positive integers with gcd(B) = 1, n_F is the max integer with $n_{F} \neq \sum_{i = 1}^{b} α_{i} B_{i}$ for α_i ≥ 0.

⁴

This check requires at most min(e_u, e_ℋ) + 1 (fast) operations, where e_u, e_ℋ are the number of edges in 𝒢^u, ℋ, respectively. This equality check occurs relatively rarely, since 𝒢^u and ℋ must be non-conflicting.

The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Contributor Information

Sergey Plis, Email: s.m.plis@gmail.com, The Mind Research Network, Albuquerque, NM.

David Danks, Email: ddanks@cmu.edu, Carnegie-Mellon University, Pittsburgh, PA.

Cynthia Freeman, Email: cynthiaw2004@gmail.com, The Mind Research Network, CS Dept., University of New Mexico, Albuquerque, NM.

Vince Calhoun, Email: vcalhoun@mrn.org, The Mind Research Network, ECE Dept., University of New Mexico, Albuquerque, NM.

References

1.Moneta A, Chlaß N, Entner D, Hoyer P. Causal search in structural vector autoregressive models. Journal of Machine Learning Research: Workshop and Conference Proceedings, Causality in Time Series (Proc. NIPS2009 Mini-Symposium on Causality in Time Series); 2011. pp. 95–114. [Google Scholar]
2.Granger CWJ. Investigating causal relations by econometric models and cross-spectral methods. Econometrica: Journal of the Econometric Society, pages. 1969:424–438. [Google Scholar]
3.Thiesson B, Chickering D, Heckerman D, Meek C. Arma time-series modeling with graphical models. Proceedings of the Twentieth Conference Annual Conference on Uncertainty in Artificial Intelligence (UAI-04); Arlington, Virginia. AUAI Press; 2004. pp. 552–560. [Google Scholar]
4.Voortman Mark, Dash Denver, Druzdzel Marek. Learning why things change: The difference-based causality learner. Proceedings of the Twenty-Sixth Annual Conference on Uncertainty in Artificial Intelligence (UAI); Corvallis, Oregon. AUAI Press; 2010. pp. 641–650. [Google Scholar]
5.Friedman Nir, Murphy Kevin, Russell Stuart. Learning the structure of dynamic probabilistic networks. 15th Annual Conference on Uncertainty in Artificial Intelligence; San Francisco. Morgan Kaufmann; 1999. pp. 139–147. [Google Scholar]
6.Plis Sergey, Danks David, Yang Jianyu. Mesochronal structure learning. Proceedings of the Thirty-First Conference Annual Conference on Uncertainty in Artificial Intelligence (UAI-15); Corvallis, Oregon. AUAI Press; 2015. [PMC free article] [PubMed] [Google Scholar]
7.Gong Mingming, Zhang Kun, Schoelkopf Bernhard, Tao Dacheng, Geiger Philipp. Discovering temporal causal relations from subsampled data. Proc. ICML; 2015. pp. 1898–1906. [Google Scholar]
8.Richardson T, Spirtes P. Ancestral graph markov models. The Annals of Statistics. 2002;30(4):962–1030. [Google Scholar]
9.Danks David, Plis Sergey. Learning causal structure from undersampled time series. JMLR: Workshop and Conference Proceedings; 2013. pp. 1–10. [Google Scholar]
10.Johnson Donald B. Finding all the elementary circuits of a directed graph. SIAM Journal on Computing. 1975;4(1):77–84. [Google Scholar]
11.Lütkepohl Helmut. New introduction to multiple time series analysis. Springer Science & Business Media; 2007. [Google Scholar]
12.Murphy K. PhD thesis. UC Berkeley; 2002. Dynamic Bayesian Networks: Representation, Inference and Learning. [Google Scholar]
13.Glymour Clark, Spirtes Peter, Scheines Richard. Erkenntnis Orientated: A Centennial Volume for Rudolf Carnap and Hans Reichenbach. Springer; 1991. Causal inference; pp. 151–189. [Google Scholar]
14.Chickering David Maxwell. Optimal structure identification with greedy search. The Journal of Machine Learning Research. 2003;3:507–554. [Google Scholar]
15.Seth Anil K, Chorley Paul, Barnett Lionel C. Granger causality analysis of fmri bold signals is invariant to hemodynamic convolution but not downsampling. Neuroimage. 2013;65:540–555. doi: 10.1016/j.neuroimage.2012.09.049. [DOI] [PubMed] [Google Scholar]

[R1] 1.Moneta A, Chlaß N, Entner D, Hoyer P. Causal search in structural vector autoregressive models. Journal of Machine Learning Research: Workshop and Conference Proceedings, Causality in Time Series (Proc. NIPS2009 Mini-Symposium on Causality in Time Series); 2011. pp. 95–114. [Google Scholar]

[R2] 2.Granger CWJ. Investigating causal relations by econometric models and cross-spectral methods. Econometrica: Journal of the Econometric Society, pages. 1969:424–438. [Google Scholar]

[R3] 3.Thiesson B, Chickering D, Heckerman D, Meek C. Arma time-series modeling with graphical models. Proceedings of the Twentieth Conference Annual Conference on Uncertainty in Artificial Intelligence (UAI-04); Arlington, Virginia. AUAI Press; 2004. pp. 552–560. [Google Scholar]

[R4] 4.Voortman Mark, Dash Denver, Druzdzel Marek. Learning why things change: The difference-based causality learner. Proceedings of the Twenty-Sixth Annual Conference on Uncertainty in Artificial Intelligence (UAI); Corvallis, Oregon. AUAI Press; 2010. pp. 641–650. [Google Scholar]

[R5] 5.Friedman Nir, Murphy Kevin, Russell Stuart. Learning the structure of dynamic probabilistic networks. 15th Annual Conference on Uncertainty in Artificial Intelligence; San Francisco. Morgan Kaufmann; 1999. pp. 139–147. [Google Scholar]

[R6] 6.Plis Sergey, Danks David, Yang Jianyu. Mesochronal structure learning. Proceedings of the Thirty-First Conference Annual Conference on Uncertainty in Artificial Intelligence (UAI-15); Corvallis, Oregon. AUAI Press; 2015. [PMC free article] [PubMed] [Google Scholar]

[R7] 7.Gong Mingming, Zhang Kun, Schoelkopf Bernhard, Tao Dacheng, Geiger Philipp. Discovering temporal causal relations from subsampled data. Proc. ICML; 2015. pp. 1898–1906. [Google Scholar]

[R8] 8.Richardson T, Spirtes P. Ancestral graph markov models. The Annals of Statistics. 2002;30(4):962–1030. [Google Scholar]

[R9] 9.Danks David, Plis Sergey. Learning causal structure from undersampled time series. JMLR: Workshop and Conference Proceedings; 2013. pp. 1–10. [Google Scholar]

[R10] 10.Johnson Donald B. Finding all the elementary circuits of a directed graph. SIAM Journal on Computing. 1975;4(1):77–84. [Google Scholar]

[R11] 11.Lütkepohl Helmut. New introduction to multiple time series analysis. Springer Science & Business Media; 2007. [Google Scholar]

[R12] 12.Murphy K. PhD thesis. UC Berkeley; 2002. Dynamic Bayesian Networks: Representation, Inference and Learning. [Google Scholar]

[R13] 13.Glymour Clark, Spirtes Peter, Scheines Richard. Erkenntnis Orientated: A Centennial Volume for Rudolf Carnap and Hans Reichenbach. Springer; 1991. Causal inference; pp. 151–189. [Google Scholar]

[R14] 14.Chickering David Maxwell. Optimal structure identification with greedy search. The Journal of Machine Learning Research. 2003;3:507–554. [Google Scholar]

[R15] 15.Seth Anil K, Chorley Paul, Barnett Lionel C. Granger causality analysis of fmri bold signals is invariant to hemodynamic convolution but not downsampling. Neuroimage. 2013;65:540–555. doi: 10.1016/j.neuroimage.2012.09.049. [DOI] [PubMed] [Google Scholar]

PERMALINK

Rate-Agnostic (Causal) Structure Learning

Sergey Plis

David Danks

Cynthia Freeman

Vince Calhoun

Abstract

1 Introduction

2 Representation and Formalism

3 Algorithms

3.1 Nonparametric Forward Inference

Theorem 3.1

Theorem 3.2

Theorem 3.3

Corollary 3.4

3.2 A recursive edgewise inverse algorithm

Figure 1.

Theorem 3.6

3.3 An iterative edgecentric inverse algorithm

Figure 2.

Figure 3.

Lemma 3.7

Lemma 3.8

3.4 An iterative loopcentric inverse algorithm

4 Results

Figure 4.

4.1 Equivalence classes

Figure 5.

Figure 6.

Figure 7.

Figure 8.

4.2 Synthetic data

Figure 9.

5 Conclusion

Acknowledgments

Footnotes

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases