Behind Distribution Shift: Mining Driving Forces of Changes and Causal Arrows

Biwei Huang; Kun Zhang; Jiji Zhang; Ruben Sanchez-Romero; Clark Glymour; Bernhard Schölkopf

doi:10.1109/ICDM.2017.114

. Author manuscript; available in PMC: 2019 May 6.

Published in final edited form as: Proc IEEE Int Conf Data Min. 2017 Dec 18;2017:913–918. doi: 10.1109/ICDM.2017.114

Behind Distribution Shift: Mining Driving Forces of Changes and Causal Arrows

Biwei Huang ^†,^⋆, Kun Zhang ^†, Jiji Zhang ^‡, Ruben Sanchez-Romero ^†, Clark Glymour ^†, Bernhard Schölkopf ^⋆

PMCID: PMC6502242 NIHMSID: NIHMS1001265 PMID: 31068766

Abstract

We address two important issues in causal discovery from nonstationary or heterogeneous data, where parameters associated with a causal structure may change over time or across data sets. First, we investigate how to efficiently estimate the “driving force” of the nonstationarity of a causal mechanism. That is, given a causal mechanism that varies over time or across data sets and whose qualitative structure is known, we aim to extract from data a low-dimensional and interpretable representation of the main components of the changes. For this purpose we develop a novel kernel embedding of nonstationary conditional distributions that does not rely on sliding windows. Second, the embedding also leads to a measure of dependence between the changes of causal modules that can be used to determine the directions of many causal arrows. We demonstrate the power of our methods with experiments on both synthetic and real data.

I. INTRODUCTION

A fundamental problem in science and engineering is to discover and make use of causal relations among variables of interest. A standard way to find causal relations resorts to interventions or randomized experiments, which, however, are usually difficult or even impossible to conduct. Consequently, how to infer causal relations from observational data or combinations of observational and experimental data, known as causal discovery [9], [6], has drawn much attention in several disciplines in the past three decades.

Most causal discovery methods assume that there is a fixed causal model underlying the observed data and aim to estimate it from the data. In this setting, constraint-based causal discovery methods [9], [6] make use of conditional independence relationships among variables to infer the equivalence class of the underlying causal structure. With the rapid accumulation of huge volumes of data of various types, collected data often exhibit distribution shift, which can occur across data sets or over time. From a causal standpoint, the shift in the joint distribution of the data may result from changes in just a few local causal mechanisms or modules because of varied background variables or experimental conditions, while a large portion of the data-generating process remains the same.

As illustrated in [11], applying causal discovery methods which assume a fixed causal model on data with distribution shift may lead to extra causal edges and, accordingly, it is desirable to develop causal analysis methods specifically for such data. A procedure was proposed in [11] which is able to asymptotically correctly recover the skeleton of the causal structure over observed variables and locate changing causal modules. In this paper we build on that work and aim to address two further problems, after the skeleton of the causal structure is learned.

How to efficiently estimate the nonstationary “driving force” of a causal mechanism that changes over time or across data sets? An interpretable representation of the main components of the nonstationarity will greatly enhance understanding of the data generating process.
How to make use of distribution shifts to determine causal directions in a system with an arbitrary number of variables? Such a method will supplement the classic Meek orientation rules [5] to derive more causal information from nonstationary/heterogeneous data.

Both problems are essential components of our causal analysis framework for nonstationary/heterogeneous data. Regarding problem 1, traditionally, one may use Bayesian change point detection to detect change points of observed time series [1], or one may use sliding window-based methods. However, Bayesian change point detection can only be applied to detect changes in marginal distributions, whereas causal mechanisms are represented by conditional distributions. Moreover, neither of them is appropriate when the causal mechanisms change continuously over time. [4] proposed a method that is able to learn how the causal model changes over time automatically. However, it requires the assumption of linearity, and it fails to handle cases where the nonstationarity results from changes of influences from the noise. Problem 2, as a sub-problem of causal discovery, exploits a generalized notion of the invariance property [10] or the exogeneity property [2], [13] of causal systems: if there is no nonstationary confounder for V_i and V_j, then the causal mechanisms, represented by the conditional distributions P(V_i |PAⁱ) and P(V_j |PA^j), change independently over time or across data sets.

The paper is organized as follows. After reviewing the procedure of causal skeleton discovery in the case of distribution shift in Section II, we present our solution to problem 1 in Section III, in which we assume that the direct causes of the considered variable are known. Then we address problem 2 in Section IV. We discuss problem 1 before problem 2, since our method for dealing with problem 2 takes advantage of the technical results derived for solving problem 1. Section V gives the experimental results on both synthetic and real data.

II. CAUSAL SKELETON DISCOVERY FROM NONSTATIONARY/HETEROGENEOUS DATA

Suppose that the underlying causal structure over variables $V = {V_{i}}_{i = 1}^{m}$ is represented by a DAG G. For each V_i, let PAⁱ denote the set of parents of V_i in G. Suppose that at each time point or in each domain, the joint probability distribution of V factorizes according to $G : P (V) = \prod_{i = 1}^{m} P (V_{i} | {PA}^{i}) .$ We call each P(V_i |PAⁱ) a causal module. Their changes may be due to changes of causal strengths, influences from the noise, etc. We assume that those quantities that change over time or cross domains can be written as functions of a time or domain index, and denote by C such an index. If the changes in some modules are related, one can treat the situation as if there exists some unobserved quantity that influences the changes of those modules simultaneously. We call such quantities nonstationary confounders.

We assume that for each V_i the local causal process for V_i can be represented by the following structural equation model (SEM):

V_{i} = f_{i} ({PA}^{i}, g^{i} (C), θ_{i} (C), ϵ_{i}),

(1)

where $g^{i} (C) \subseteq {g_{l} (C)}_{l = 1}^{L}$ denotes the set of nonstationary confounders that influence V_i (it is an empty set if there is no confounder behind V_i and any other variable), θ_i(C) denotes the effective parameters in the model that are also assumed to be functions of C, and ϵ_i is a disturbance term that is independent of C and has a non-zero variance (i.e., the model is not deterministic). The noise terms ϵ_i are assumed to be independent and identically distributed.

The procedure for causal discovery from nonstationary data proposed in [11] is briefly described in Algorithm 1. Step 3 aims to discover the skeleton of the causal structure over V, i.e., an undirected graph representing which variables are adjacent in the underlying causal structure. Step 2 is used to identify nonstationary causal modules. The (asymptotic) correctness of the procedure is justified by the following Theorem proved in [11].

Theorem 1. Given the above assumptions, for every V_i, V_j ∈ V, V_i and V_j are not adjacent in G if and only if they are independent conditional on some subset of {V_k | k ≠ i, k ≠ j} ∪ {C}.

III. NONSTATIONARY DRIVING FORCE ESTIMATION

In this section, we focus on the discovery of how causal module P(V_i | PAⁱ) changes, i.e., where the changes occur, how fast it changes, and how to visualize the changes. We assume that we already know the causal structure and know which causal modules are nonstationary (see Algorithm 1).

Algorithm 1.

Detection of Changing Modules and Recovery of Causal Skeleton

1) Build a complete undirected graph U_C on the variable set V ∪ {C}.

2) (Detection of changing modules) For each i, test for the marginal and conditional independence between V_i and C. If they are independent given a subset of {V_k | k ≠ i}, remove the edge between V_i and C in U_C.

3) (Recovery of causal skeleton) For every i ≠ j, test for the marginal and conditional independence between V_i and V_j. If they are independent given a subset of {V_k | k ≠ i, k ≠ j} ∪ {C}, remove the edge between V_i and V_j in U_C.

Open in a new tab

In the parametric case, if we know which parameters of the causal model PAⁱ → V_i are changing, e.g., the mean of a root cause, the coefficients in a linear SEM, then we can estimate such parameters for different values of C and see how they change. However, such knowledge is usually not available, and for the sake of flexibility it is better to model the causal processes nonparametrically. Therefore, it is desirable to develop a general nonparametric procedure for capturing the nonstationarity of changing causal modules.

We aim to find a low-dimensional mapping of P(V_i | PAⁱ) which captures its nonstationarity in a nonparametric way:

λ_{i} (C) = h_{i} (P (V_{i} | {PA}^{i}, C)) .

(2)

Note that changes in P(V_i | PAⁱ) are irrelevant to changes in P(PAⁱ), and accordingly, they are not necessarily the same as changes in the joint distribution P(V_i, PAⁱ). If V_i is a root cause, then PAⁱ is an empty set, and P(V_i | PAⁱ) reduces to the marginal distribution P(V_i).

We call λ_i(C) the nonstationary driving force of P(V_i | PAⁱ, C). If P(V_i | PAⁱ, C) does not change along with C, then λ_i(C) remains constant. Otherwise, λ_i(C) is intended to capture the variability of P(V_i | PAⁱ, C) across different values of C.

Now there are two problems to solve. One is given only observed data, how to represent the conditional distributions conveniently. The other is what method to use to enable λ_i(C) to capture the variability in the conditional distribution along with C. We tackle the above two problems by using kernels [7] and accordingly propose a method called Nonstationary Driving Force Estimation (NoDFEs) of causal modules.

A. Kernel Embedding of Constructed Joint Distributions

Notation:

Throughout the paper, we use following notation. Let X be a random variable on domain $X$ , and $(H, k)$ be a Reproducing Kernel Hilbert Space (RKHS) with a measurable kernel on $X$ . Let $ϕ (x) \in H$ represent the feature map for each $x \in X$ , with $ϕ : X \to H$ . We assume integrability: E_X[k(X,X)]≤∞. Similar notations are for variables Y and C. The cross-covariance operator $C_{Y X} : H \to G$ is defined as $C_{Y X} : = E_{Y X} [ϕ (X) \otimes ψ (Y)]$ , where $G$ is the RKHS associated with Y.

Intuitively, to represent the kernel embedding of nonstationary causal modules, we need to consider P(V_i|PAⁱ) for each C separately. If C is a domain index, for each value of C we have a dataset of (V_i, PAⁱ). If C is a time index, one may use a sliding window to use the data of (V_i, PAⁱ) in the window of length L centered at C = c. However, in some cases it might be hard to find an appropriate window length L, especially when the causal module changes fast. In the following, we propose a way to estimate the kernel embedding of nonstationary causal modules on the whole dataset, avoiding window segmentation. For the sake of conciseness, below we use Y and X to denote V_i and PAⁱ, respectively.

Instead of working with P(Y |X,C = c_n) (n = 1, · · · ,N) directly, we “virtually” construct a particular distribution $\tilde{P} (\underline{Y}, X | C = c_{n})$ as follows:¹

\tilde{P} (\underline{Y}, X | C = c_{n}) = P (Y | X, C = c_{n}) P (X) .

The constructed distribution $\tilde{P} (\underline{Y}, X | C = c_{n})$ captures changes in P(Y | X,C = c_n) across different c_n.

Proposition 1 shows that the kernel embedding of the distribution $\tilde{P} (\underline{Y}, X | C = c_{n})$ can be estimated on the whole dataset, without window segmentation.

Proposition 1. Let X represent the direct causes of Y, and suppose that they have N observations. The kernel embedding of distribution $\tilde{P} (\underline{Y}, X | C = c_{n})$ can be represented as

{\hat{\tilde{μ}}}_{\underline{Y}, {X |}_{C = c_{n}}} = \frac{1}{n} Φ_{y} {(K_{x} ⊙ K_{c} + λ I)}^{- 1} d i a g (k_{c, c_{n}}) K_{x} Φ_{x}^{T},

where $Φ_{y} ≔ [ϕ (y_{1}), \dots, ϕ (y_{N})]$ , $Φ_{x} ≔ [ϕ (x_{1}), \dots, ϕ (x_{N})]$ , $k_{c, c_{n}} : = {[k (c_{1}, c_{n}), \dots, k (c_{N}, c_{n})]}^{T}$ , and ⊙ represents pointwise product.

B. Nonstationary Driving Force Estimation As an Eigenvalue Decomposition Problem

Next, we use the estimated kernel embedding of distributions, ${\hat{\tilde{μ}}}_{\underline{Y}, {X |}_{C = c_{n}}} (n = 1, \dots, N)$ , as the input, and aim to find $\hat{λ} (C)$ as a low-dimensional representation of ${\tilde{μ}}_{\underline{Y}, {X |}_{C = c_{n}}}$ , to capture its variability across different values of C. This can be readily achieved by exploiting kernel principle component analysis (KPCA) techniques [8], which computes principal components in kernel spaces of the input.

To perform KPCA, we need to know the N × N Gram matrix of ${\hat{\tilde{μ}}}_{\underline{Y}, {X |}_{C = c}}$ first. If we use a linear kernel, the (c, c′)th entry of the Gram matrix $M_{\underline{Y} X}^{l}$ is the inner product between ${\hat{\tilde{μ}}}_{\underline{Y}, {X |}_{C = c}}$ and ${\hat{\tilde{μ}}}_{\underline{Y}, {X |}_{C = c^{'}}} :$

M_{\underline{Y} X}^{l} (c, c^{'}) ≜ tr ({\hat{\tilde{μ}}}_{\underline{Y}, {X |}_{C = c}}^{T} {\hat{\tilde{μ}}}_{\underline{Y}, {X |}_{C = c^{'}}}) = \frac{1}{n^{2}} k_{c, c}^{T} [K_{x}^{3} ⊙ ({(K_{x} ⊙ K_{c} + λ I)}^{- 1} K_{y} {(K_{x} ⊙ K_{c} + λ I)}^{- 1})] k_{c, c^{'}},

which is the (c, c′)th entry of the matrix

M_{\underline{Y} X}^{l} = \frac{1}{n^{2}} K_{c} [K_{x}^{3} ⊙ ({(K_{x} ⊙ K_{c} + λ I)}^{- 1} K_{y} {(K_{x} ⊙ K_{c} + λ I)}^{- 1})] K_{c} .

(3)

If we use a Gaussian kernel with kernel width σ₂, the Gram matrix is given by

M_{\underline{Y} X}^{g} (c, c^{'}) = \exp (- \frac{{‖ {\tilde{μ}}_{\underline{Y}, {X |}_{C = c}} - {\tilde{μ}}_{\underline{Y}, {X |}_{C = c^{'}}} ‖}_{F}^{2}}{2 σ_{2}^{2}}) = \exp (- \frac{M_{\underline{Y} X}^{l} (c, c) + M_{\underline{Y} X}^{l} (c^{'}, c^{'}) - 2 M_{\underline{Y} X}^{l} (c^{'}, c)}{2 σ_{2}^{2}}),

(4)

where || · ||F denotes the Frobenius norm.

Finally, ${\hat{λ}}_{i} (C)$ can be found by performing eigenvalue decomposition on the above Gram matrix, $M_{\underline{Y} X}^{l}$ or $M_{\underline{Y} X}^{g}$ ; for details please see [8]. In practice, one may take the first few eigenvectors which capture most of the variance.

We can see that with our methods, we do not need to explicitly learn the high-dimensional kernel embedding ${\tilde{μ}}_{\underline{Y}, {X |}_{C = c}}$ for each c. With the kernel trick, the final Gram matrix can be represented by N × N kernel matrices directly. Then the nonstationary driving force ${\hat{λ}}_{i} (C)$ can be estimated by performing eigenvalue decomposition on the Gram matrix.

Algorithm 2 summarizes the proposed NoDFEs method. There are several hyperparameters to set. The hyperparameters associated with K_x, K_c, and the regularization parameter λ in equation (3) are learned through a Gaussian process regression framework: the hyperparameters are learned by maximizing the marginal likelihood. For the hyperparameters associated with K_y and the kernel with σ₂ in equation (4), we set them with empirical values. See [12] for details.

Change in marginal distributions.

As a special case, when we are concerned with how the marginal distribution of Y changes with C, i.e., when X = ∅, we have $μ_{Y | C = c_{n}} = C_{Y C} C_{C C}^{- 1} ϕ (c_{n})$ . By constraining X in ${\tilde{μ}}_{Y, X | C = c_{n}}$ to a fixed value, its empirical estimate is

{\hat{μ}}_{Y | C = c_{n}} = Φ_{y} {(K_{c} + λ I)}^{- 1} k_{c, c_{n}} .

Then the Gram matrix with a linear kernel is

M_{Y}^{l} = K_{c} {(K_{c} + λ I)}^{- 1} K_{y} {(K_{c} + λ I)}^{- 1} K_{c} .

Algorithm 2.

NoDFEs of Causal Modules P(Y |X)

1) Input: N observstions of X and Y.

2) Calculate Gram matrix

M_{\underline{Y} X}

(see Eq. 3 for linear kernels and Eq. 4 for Gaussian kernels).

3) Find

{\hat{λ}}_{i} (C)

by directly feeding Gram matrix

M_{\underline{Y} X}

to KPCA. That is, perform eigenvalue decomposition on M to find the nonlinear principal components

{\hat{λ}}_{i} (C)

[8].

4) Output: the estimation of nonstationary driving force

{\hat{λ}}_{i} (C)

Open in a new tab

IV. CAUSAL DIRECTION ESTIMATION BY DEPENDENCE MINIMIZATION

In this section, we propose a nonparametric method to determine causal directions, by making use of the independence property between causal modules. Suppose that X → Y; if only one of the distributions P(X) and P(Y|X) changes, the independent change property also holds because a constant is independent from any variable. Therefore, below we do not separately study the case where only one of the two considered variables is adjacent to C, but consider it as a special case.

We also note that to accelerate the process of causal direction determination, one may first apply Meek’s orientation rules [5] to derive the equivalence class and then use the procedure proposed below to further find some of orientations that are not given in the equivalence class.

A. Two-Variable Case

For simplicity, let us start with the two-variable case: suppose that X and Y are adjacent and at least one of them is adjacent to C, and there are no confounders behind them. We aim to identify the causal direction between them, which, without loss of generality, we assume to be X → Y. The guiding idea is that distribution shift may carry information that confirms “independence” of causal modules, which, in the simple case we are considering, is the “independence” between P(X) and P(Y|X). If P(X) and P(Y |X) are “independent” but P(Y) and P(X|Y ) are not, then the causal direction is inferred to be from X to Y.

The dependence between P(X) and P(Y|X) can be estimated by extending the Hilbert-Schmidt Independence Criterion (HSIC) [3].

a). HSIC:

Given a set of observations {(u₁, v₁), (u₂, v₂), …, (u_N, v_N)} for variables U and V, respectively, HSIC provides a statistic for testing their statistical independence as well as a measure of dependence. Let M_U and M_V be the Gram matrices for U and V calculated on the sample, respectively. An estimator of HSIC is given by [3]

{HSIC}_{U V} = \frac{1}{{(N - 1)}^{2}} tr (M_{U} H M_{V} H),

(5)

where H is used to center the features, with entries $H_{i j} ≔ δ_{i j} - N^{- 1}$ .

We will use a normalized version of the estimated HSIC, which is invariant to the scale in M_U and M_V :

{HSIC}_{U V}^{N} = \frac{{HSIC}_{U V}}{\frac{1}{N - 1} tr (M_{U} H) \cdot \frac{1}{N - 1} tr (M_{V} H)} = \frac{tr (M_{U} H M_{V} H)}{tr (M_{U} H) tr (M_{V} H)} .

(6)

b). Dependence between Nonstationary Modules and Causal Direction Estimation:

In our case, we aim to check whether P(Y|X,C) and P(X|C) change independently when C changes. We work with the estimate of their embeddings. Then we can think of ${({\hat{μ}}_{X |_{C = c}}, {\hat{\tilde{μ}}}_{\underline{Y}, X |_{C = c}})}_{c = c_{1}}^{c_{N}}$ as the observed data pairs and measure their dependence from the data pairs.

This can be done by applying (the normalized version of) the estimate of HSIC given in equation (6) to the above data pairs. The expression then involves M_X, the Gram matrix of ${\hat{μ}}_{X | C}$ at C = c₁, c₂, …, c_N, and $M_{\underline{Y} X}$ , the Gram matrix of ${\hat{\tilde{μ}}}_{\underline{Y} X | C}$ at C = c₁, c₂, …, c_N. In particular, the dependence between P(Y|X,C) and P(X|C) on the given data can be estimated by

{\hat{Δ}}_{X \to Y} = \frac{tr (M_{X} H M_{\underline{Y} X} H)}{tr (M_{X} H) tr (M_{\underline{Y} X} H)} .

(7)

Similarly, for the hypothetic direction Y → X the dependence between P(X|Y,C) and P(Y|C) on the data is estimated by

{\hat{Δ}}_{Y \to X} = \frac{tr (M_{Y} H M_{\underline{X} Y} H)}{tr (M_{Y} H) tr (M_{\underline{X} Y} H)} .

(8)

We have the following rule to infer the causal direction between X and Y.

Causal Direction Inference Rule:

Suppose that X and Y are two random variables with N observations. We assume that X and Y are adjacent with at least one of them adjacent to C. We further assume that there are no confounders behind them. If ${\hat{Δ}}_{X \to Y} < {\hat{Δ}}_{Y \to X}$ , which are given by equations (7) and (8), respectively, then X is the cause of Y. Otherwise we conclude that Y is a cause of X.

B. With More Than Two Variables

Our rule to determine the causal direction in the two-variable case can be extended to a heuristic method for inferring causal directions in the multi-variable case. Suppose that we have m observed random variables ${V_{i}}_{i = 1}^{m}$ , and the causal skeleton U_G of the m random variables is recovered by Algorithm 1. Let V_S be the subset of ${V_{i}}_{i = 1}^{m}$ such that V_i ∈ V_S iff V_i’s causal module is nonstationary or there is a V_j adjacent to V_i whose causal module is nonstationary. Assume that there are no nonstationary confounders behind V_S. We propose to use the following heuristic to estimate the causal directions between variables in V_S (Algorithm 3).

Algorithm 3.

Causal Direction Determination

1) Input: observations of

{V_{i}}_{1}^{m}

, subset V_S, causal skeleton U_G.

2) Let R = V_S.

3) For each variable V_i in R, let Adⁱ be the set of variables adjacent to V_i in U_G. Estimate the dependence between P(Adⁱ) and P(V_i|Adⁱ) using equation (7), and denote the estimation as

\hat{Δ} (i)

. Find the variable V_l in R with the minimum

\hat{Δ}

4) Orient all edges incident to V_l in U into V_l. (In other words, make V_l a leaf.)

5) Remove V_l from R.

6) Repeat steps 3, 4, and 5 until only one variable is left in R.

7) Output: Graph U_G (with edges between variables in V_S oriented).

Open in a new tab

For variables outside V_S (i.e., variables whose modules are stationary and which are adjacent only to variables with stationary modules), the causal direction between them cannot be determined by Algorithm 3. In such a case, one may further infer some causal directions by making use of Meek orientation rules [5].

V. EXPERIMENTAL RESULTS

A. Simulations

We generated synthetic data according to the SEMs specified in Fig. 1. We considered different sources of nonstationarity: (1) nonstationarity due to the change of causal coefficients; (2) nonstationary due to the change of influences from the noise. More specifically, the modules for V₂, V₃ and V₅ are nonstationary in the sense of (1), and V₂, V₆, and V₇ are nonstationary in the sense of (2). The nonstationarity is governed by functions a_i(t) (i = 2, 3, 5, 6, 7). We considered both smooth changes and sudden changes of a_i. Smooth change: We generated a_i by sampling from a Gaussian process (GP) prior with a squared exponential kernel. Sudden change: We generated the sudden change of a_i with block signal. In both cases, a_is are all sampled independently to ensure the assumption that causal modules change independently (that is, there is no nonstationary confounding). The functions ${f_{i}}_{i = 2}^{8}$ are randomly chosen from linear functions, sinusoid functions, and polynomial functions. The noise terms E_i (i = 1, · · ·, 8) are randomly chosen from Gaussian distributions and uniform distributions. We also considered different sample sizes (N = 600,1200). There are hence four settings in total: (1) N = 600, smooth change; (2) N = 600, sudden change; (3) N = 1200, smooth change; (4) N = 1200, sudden change. For each setting, we ran 50 trials.

We first learned causal skeletons by the procedure in Algorithm 1, with PC search [9] and kernel-based conditional independence (KCI) test [12]. We included time information T as C in the causal system to capture nonstationarity, and thus we can recover the causal skeleton and detect changing causal modules. Next, we inferred causal directions by making use of the independence between causal modules, with the procedure proposed in Algorithm 3. We compared it with the method proposed in [11], which uses a window-based method to infer causal directions. For those pairs of adjacent variables that do not have nonstationary causal modules, we try to infer causal directions, if possible, by Meek orientation rules [5]. Then, based on the recovered causal graph, we extracted the nonstationary driving force of changing causal modules by the procedure NoDFEs in Algorithm 2. We used Gaussian kernels both in kernel embedding of constructed joint distributions and kernel PCA. We compared our approach with linear time-dependent functional causal model [4], which puts a GP prior on time-varying coefficients and uses the mean of the posterior to represent the nonstationary driving force. In addition, we compared our methods with Bayesian change point detection [1] ², which is widely used in nonstationary data to detect change points; we did Bayesian change point detection on V₂, V₃, V₅, V₆, and V₇, whose causal modules change over time.

We counted a causal connection between two variables as genuine if it exists in more than 85% trials. Algorithm 1 identifies the causal skeleton and nonstatioary causal modules correctly, in all four setting. Table I shows the accuracy of inferred causal directions in different settings, compared with the window-based method proposed in [11]. Our method obviously and significantly outperforms the window-based method, especially in cases of smooth changes. To our knowledge, there are no other comparable methods that can be used to infer causal directions in the nonstationary case.

TABLE I:

Accuracy of inferred causal directions

	Our method	Window-based
N=600, smooth	85.4%	56.4%
N=600, sudden	85.0%	70.2%
N=1200, smooth	87.9%	59.6%
N=1200, sudden	88.5%	74.2%

Open in a new tab

Figure 2 visualize the estimated nonstationary driving force of changing causal modules for smooth changes, when N = 600. Left panel: blue lines are estimated nonstationary driving force by NoDFEs. Red lines are ground truth. Vertical black dashed lines indicate detected change points by Bayesian change point detection. Middle Panel: the largest ten eigenvalues of Gram matrix M^g. Right Panel: blue lines are recovered nonstationary components by linear time-dependent GP. Red lines are ground truth. The scales of recovery have been adapted. We only showed the first principal component in the left panel, since the first eigenvector captures most of the variance (middle panel). We found that NoDFEs gives the best recovery in all cases. Bayesian change point detection fails to handle the case of smooth changes, while it works for sudden changes. The linear time-dependent GP does not work well when the influences from the noise change (2&5 → 6, 3 → 7).

Fig. 2: — Visualization of estimated nonstationary driving forces of changing causal modules. See main text for details.

B. Real-World Datasets

US Stock Market We applied our methods to daily returns of stocks from New York Stock Exchange, downloaded from Yahoo Finance, which contains 80 major stocks from 07/05/2006 to 12/16/2009. They are grouped into 10 sectors, energy, public utilities, capital goods, health care, consumer service, finance, transportation, consumer nondurable goods, basic industry, and technology.

Figure 3 shows the causal connections between stock returns, each color representing one sector. We found that intrasector connections are denser than inter-sector connections. The stocks in energy, finance, public utilities, and basic industries are more likely to be causes of stocks in other sectors; among those four sectors, stocks in energy and finance cause stocks in utilities and basic industries. 37 out of 80 causal modules are nonstationary; most of them are in finance (7 out of 9) and consumer service (5 out of 7).

Fig. 3: — Recovered causal graph from 80 NYSE stocks. Each color of nodes represents one sector.

Figure 4 visualizes the estimated nonstationary driving forces of stocks USB, JCP, GE, PBR, SAN, and CHK, recovered by NoDFEs. We found that among these six stocks, USB, JCP, GE, and PBR have change points around 07/16/2007 (T₁) and 05/05/2008 (T₂), while SAN and CHK only have changes points around 05/05/2008 (T₂),. Most stocks which have change points only at T₂ have more direct causes. These findings match with the critical time points of financial crisis around the year of 2008.

Fig. 4: — The estimated nonstationary driving force of six stock returns from 07/05/2006 to 12/16/2009. The change points match with the critical time of financial crisis.

VI. CONCLUSION

In this paper we proposed nonparametric methods for estimating the underlying driving force of the change in the local causal mechanisms and for determining causal direction by leveraging distribution shift. The discovered causal direction helps construct correct causal models and, moreover, the estimated nonstationary driving force of the changes in the causal mechanisms facilitates understanding why and how the generating process changes and gives suggestions about what variables to further incorporate into the system to make it causally sufficient. We note that causal modeling and distribution shift are heavily coupled and that distribution shift actually contains useful information for causal direction determination. A line of our future research is to exploit this connection to improve online prediction in nonstationary environments.

ACKNOWLEDGEMENTS

This project was supported by the National Institutes of Health (NIH) under Award Numbers NIH-1R01EB022858-01 FAIN-R01EB022858, NIH-1R01LM012087, and NIH-5U54HG008540-02 FAIN-U54HG008540.

Footnotes

Here we use $\underline{Y}$ instead of Y to emphasize that in this constructed distribution Y and X are not symmetric, which will be used in Section IV.

We used the implemented matlab code from http://hips.seas.harvard.edu/content/bayesian-online-changepoint-detection.

REFERENCES

[1].Adams RP and Mackay DJC Bayesian online change point detection, 2007. Technical report, University of Cambridge, Cambridge, UK: Preprint at http://arxiv.org/abs/0710.3742v1. [Google Scholar]
[2].Engle RF, Hendry DF, and Richard JF Exogeneity. Econometrica, 51:277–304, 1983. [Google Scholar]
[3].Gretton A, Fukumizu K, Teo CH, Song L, Schölkopf B, and Smola AJ A kernel statistical test of independence In NIPS 20, pages 585–592, Cambridge, MA, 2008. MIT Press. [Google Scholar]
[4].Huang B, Zhang K, and Schölkopf B Identification of time-dependent causal model: A gaussian process treatment. In the 24th International Joint Conference on Artificial Intelligence, Machine Learning Track, pages 3561–3568, Buenos, Argentina, 2015. [Google Scholar]
[5].Meek C Strong completeness and faithfulness in bayesian networks. In Proceedings of the Eleventh Conference Annual Conference on Uncertainty in Artificial Intelligence (UAI-95), pages 411–419, 1995. [Google Scholar]
[6].Pearl J Causality: Models, Reasoning, and Inference. Cambridge University Press, Cambridge, 2000. [Google Scholar]
[7].Schölkopf B and Smola A Learning with kernels. MIT Press, Cambridge, MA, 2002. [Google Scholar]
[8].Schölkopf B, Smola A, and Muller K Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10:1299–1319, 1998. [Google Scholar]
[9].Spirtes P, Glymour C, and Scheines R Causation, Prediction, and Search. MIT Press, Cambridge, MA, 2nd edition, 2001. [Google Scholar]
[10].Woodward J Making things happen: A theory of causal explanation. Oxford University Press, New York, 2003. [Google Scholar]
[11].Zhang K, Huang B, Zhang J, Glymour C, and Schölkopf B Causal discovery from nonstationary/heterogeneous data: Skeleton estimation and orientation determination. In IJCAI, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
[12].Zhang K, Peters J, Janzing D, and Schölkopf B Kernel-based conditional independence test and application in causal discovery. In Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence (UAI 2011), Barcelona, Spain, 2011. [Google Scholar]
[13].Zhang K, Zhang J, and Schölkopf B Distinguishing cause from effect based on exogeneity. In Proc. 15th Conference on Theoretical Aspects of Rationality and Knowledge (TARK 2015), 2015. [Google Scholar]

[R1] [1].Adams RP and Mackay DJC Bayesian online change point detection, 2007. Technical report, University of Cambridge, Cambridge, UK: Preprint at http://arxiv.org/abs/0710.3742v1. [Google Scholar]

[R2] [2].Engle RF, Hendry DF, and Richard JF Exogeneity. Econometrica, 51:277–304, 1983. [Google Scholar]

[R3] [3].Gretton A, Fukumizu K, Teo CH, Song L, Schölkopf B, and Smola AJ A kernel statistical test of independence In NIPS 20, pages 585–592, Cambridge, MA, 2008. MIT Press. [Google Scholar]

[R4] [4].Huang B, Zhang K, and Schölkopf B Identification of time-dependent causal model: A gaussian process treatment. In the 24th International Joint Conference on Artificial Intelligence, Machine Learning Track, pages 3561–3568, Buenos, Argentina, 2015. [Google Scholar]

[R5] [5].Meek C Strong completeness and faithfulness in bayesian networks. In Proceedings of the Eleventh Conference Annual Conference on Uncertainty in Artificial Intelligence (UAI-95), pages 411–419, 1995. [Google Scholar]

[R6] [6].Pearl J Causality: Models, Reasoning, and Inference. Cambridge University Press, Cambridge, 2000. [Google Scholar]

[R7] [7].Schölkopf B and Smola A Learning with kernels. MIT Press, Cambridge, MA, 2002. [Google Scholar]

[R8] [8].Schölkopf B, Smola A, and Muller K Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10:1299–1319, 1998. [Google Scholar]

[R9] [9].Spirtes P, Glymour C, and Scheines R Causation, Prediction, and Search. MIT Press, Cambridge, MA, 2nd edition, 2001. [Google Scholar]

[R10] [10].Woodward J Making things happen: A theory of causal explanation. Oxford University Press, New York, 2003. [Google Scholar]

[R11] [11].Zhang K, Huang B, Zhang J, Glymour C, and Schölkopf B Causal discovery from nonstationary/heterogeneous data: Skeleton estimation and orientation determination. In IJCAI, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] [12].Zhang K, Peters J, Janzing D, and Schölkopf B Kernel-based conditional independence test and application in causal discovery. In Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence (UAI 2011), Barcelona, Spain, 2011. [Google Scholar]

[R13] [13].Zhang K, Zhang J, and Schölkopf B Distinguishing cause from effect based on exogeneity. In Proc. 15th Conference on Theoretical Aspects of Rationality and Knowledge (TARK 2015), 2015. [Google Scholar]

PERMALINK

Behind Distribution Shift: Mining Driving Forces of Changes and Causal Arrows

Biwei Huang

Kun Zhang

Jiji Zhang

Ruben Sanchez-Romero

Clark Glymour

Bernhard Schölkopf

Abstract

I. INTRODUCTION

II. CAUSAL SKELETON DISCOVERY FROM NONSTATIONARY/HETEROGENEOUS DATA

III. NONSTATIONARY DRIVING FORCE ESTIMATION

Algorithm 1.

A. Kernel Embedding of Constructed Joint Distributions

Notation:

B. Nonstationary Driving Force Estimation As an Eigenvalue Decomposition Problem

Change in marginal distributions.

Algorithm 2.

IV. CAUSAL DIRECTION ESTIMATION BY DEPENDENCE MINIMIZATION

A. Two-Variable Case

a). HSIC:

b). Dependence between Nonstationary Modules and Causal Direction Estimation:

Causal Direction Inference Rule:

B. With More Than Two Variables

Algorithm 3.

V. EXPERIMENTAL RESULTS

A. Simulations

Fig. 1:

TABLE I:

Fig. 2:

B. Real-World Datasets

Fig. 3:

Fig. 4:

VI. CONCLUSION

ACKNOWLEDGEMENTS

Footnotes

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases