The Convex Mixture Distribution: Granger Causality for Categorical Time Series

Alex Tank; Xiudi Li; Emily B Fox; Ali Shojaie

doi:10.1137/20m133097x

. Author manuscript; available in PMC: 2023 Oct 19.

Published in final edited form as: SIAM J Math Data Sci. 2021;3(1):83–112. doi: 10.1137/20m133097x

The Convex Mixture Distribution: Granger Causality for Categorical Time Series^{^*}

Alex Tank ^‡,^†, Xiudi Li ^§,^†, Emily B Fox ^¶, Ali Shojaie ^∥

PMCID: PMC10586348 NIHMSID: NIHMS1885629 PMID: 37859797

Abstract

We present a framework for learning Granger causality networks for multivariate categorical time series based on the mixture transition distribution (MTD) model. Traditionally, MTD is plagued by a nonconvex objective, non-identifiability, and presence of local optima. To circumvent these problems, we recast inference in the MTD as a convex problem. The new formulation facilitates the application of MTD to high-dimensional multivariate time series. As a baseline, we also formulate a multi-output logistic autoregressive model (mLTD), which while a straightforward extension of autoregressive Bernoulli generalized linear models, has not been previously applied to the analysis of multivariate categorial time series. We establish identifiability conditions of the MTD model and compare them to those for mLTD. We further devise novel and efficient optimization algorithms for MTD based on our proposed convex formulation, and compare the MTD and mLTD in both simulated and real data experiments. Finally, we establish consistency of the convex MTD in high dimensions. Our approach simultaneously provides a comparison of methods for network inference in categorical time series and opens the door to modern, regularized inference with the MTD model.

Keywords: time series, Granger causality, categorical data, structured sparsity, convex, 68Q25, 68R10, 68U05

1. Introduction.

Granger causality [17] is a popular framework for assessing the relationships between time series, and has been widely applied in econometrics, neuroscience, and genomics, amongst other fields. Given two time series $x$ and $y$ , the idea is to use the temporal structure of the data to assess whether the past values of one, say $x$ , are predictive of future values of the other, $y$ , beyond what the past of $y$ can predict alone; if so, $x$ is said to Granger cause $y$ . Recently, the focus has shifted to inferring Granger causality networks from multivariate time series data, with the goal of uncovering a sparse set of Granger causal relationships amongst the individual univariate time series. Building on the typical autoregressive framework for assessing Granger causality, the majority of approaches for inferring Granger causal networks have focused on real-valued Gaussian time series using the vector autoregressive model (VAR) with sparsity inducing penalties [19, 42]. More recently, this approach has been extended to non-Gaussian data such as multivariate point processes using sparse Hawkes processes [48], count data using autoregressive Poisson generalized linear models [18], or even time series with heavy tails using VAR models with elliptical errors [36]. In contrast, inferring networks for multivariate categorical time series under this paradigm has received less attention.

Multivariate categorical time series arise naturally in many domains. For example, we might have health states from various indicators for a patient over time, voting records for a set of politicians, action labels for players on a team, social behaviors for kids in a school, or musical notes in an orchestrated piece. There are also many datasets that can be viewed as binary multivariate time series based on the presence or absence of an action for some set of entities. Furthermore, in some applications, collections of continuous-valued time series are each quantized into a small set of discrete values, like the weather data from multiple stations [12], wind data [39], stock returns [32], or sales volume for a collection of products [10]. Our work develops both interpretable and computationally efficient methodology for Granger causality network estimation in such cases using sparse penalties [19, 42]. Existing approaches to modeling categorical series both do not scale to higher dimensional series and also lack Granger causal interpretability, hampering their ability to estimate large Granger causality networks. We first discuss the specific drawbacks of existing approaches and then introduce the contributions of our proposed framework.

The mixture transition distribution (MTD) model [4, 39], originally proposed for parsimonious modeling of higher order Markov chains, can provide an approach to modeling multivariate categorical time series [10, 32, 49]. The MTD model reduces each categorical interaction to a standard single dimensional Markov transition probability table. While alluring due to its elegant construction and intuitive interpretation, widespread use of the MTD model has been limited by a non-convex objective with many local optima, a large number of parameter constraints, and unknown identifiability conditions [32, 49, 3]. For these reasons, the few applications of the MTD model to multivariate time series have looked at a maximum of three or four time series. To bypass the limitations of MTD, autoregressive generalized linear models have been advocated for categorical time series. In particular, autoregressive generalized linear binomial models are often used for the special case of two categories per series [18, 2]. While their multinomial-output extension to a larger number of states per series has not been widely adopted, they have been applied to the univariate time series case [24].

We refer to the autoregressive multinomial GLM as the mixture logistic transition distribution (mLTD). The mLTD model uses a logistic function to bypass parameter constraints, results in a convex objective, and has well-known identifiability conditions. However, these advantages of mLTD come at the cost of reduced interpretability, mainly because the transition distribution in mLTD depends nonlinearly on the model parameters. Recently, a constrained autoregressive probit model that improves interpretability has been proposed [32]. However, the probit model is both non-convex and inference is computationally intensive, limiting applications to higher dimensional series. As such, one is still torn between a computational and interpretability tradeoff. Methods for learning Granger causality networks among general time series based on transfer entropy or directed information have been proposed. In particular, the empirical estimator [37] and the context tree weighting estimator [23] for directed information are specifically applicable to categorical time series. However, consistency guarantees of these estimators are derived under the pairwise (group-wise) Markov assumption, and implementing these algorithms can be computationally intensive.

We address these issues by going back to the interpretable MTD framework and showing how one can improve its computational drawbacks. In particular, we recast inference in the MTD model as a convex problem through a novel re-parameterization. We further develop a regularized estimation framework for identifying Granger causality for multivariate categorical time series. We also establish, for the first time, conditions for identifiability in the MTD model and compare the identifiability conditions for MTD and mLTD models. We find that while the identifiability conditions for the MTD model are given by a non-convex set, we may easily enforce the constraints using our convex re-parameterization by augmenting the likelihood with appropriate convex penalties. We then develop efficient projected gradient and Frank-Wolfe algorithms for optimizing the penalized convex MTD objective. Our projected gradient algorithm depends on a Dykstra splitting method for projection onto the constraint sets of the MTD model. This computational approach for MTD enables this model to be applied to large, modern datasets for the first time. Importantly, the computational insights we provide carry over to the suite of other applications of MTD models, such as higher-order Markov chains, beyond the multivariate categorical time series which are the focus herein.

As a comparison benchmark we also develop a penalized mLTD model for Granger causality in multivariate Markov chains. While straightforward, the application of the penalized mLTD framework to multivariate categorical time series with more than two categories is new. We compare MTD and mLTD methods under multiple simulation conditions and use the MTD method to uncover Granger causality structure in both music [27] and iEEG brain recording [9] data sets. Finally, we also establish, for the first time, consistency of the convex MTD in high dimensions, which facilitates future theoretical developments in this area.

2. Categorical Time Series and Granger Causality.

2.1. Granger Causality.

Let $x_{t} = (x_{1 t}, \dots, x_{d t}) \in 𝒳$ denote a $d$ -dimensional categorical random variable indexed by time where $𝒳 = (𝒳_{1} \times 𝒳_{2} \times \dots \times 𝒳_{d})$ , with $𝒳_{i}$ denoting the set of possible values of $x_{i t}$ . Let $m_{i} = |𝒳_{i}|$ be the cardinality of set $𝒳_{i}$ , i.e., the number of categories that series $i$ may take. A length $T$ multivariate categorical time series is the sequence $X = \{x_{1}, \dots, x_{t}, \dots, x_{T}\}$ .

An order $k$ multivariate Markov chain models the transition probability between the categories at lagged times $t - 1, \dots, t - k$ and those at time $t$ using a transition probability distributions:

p (x_{t} | x_{t - 1}, \dots) = p (x_{t} | x_{t - 1}, \dots, x_{t - k}) .

(2.1)

Due to the complexity of fully parameterizing this transition distribution, it is common to simplify the model and assume that the categories at time $t$ are conditionally independent of one another given the past realizations:

p (x_{t} | x_{t - 1}, \dots, x_{t - k}) = \prod_{i = 1}^{d} p (x_{i t} | x_{t - 1}, \dots, x_{t - k}) .

(2.2)

For simplicity, we assume $k = 1$ , but stress that all models and results equally apply to higher orders of $k$ . By the decomposition assumption (2.2), the problem of estimation and inference can be divided into independent subproblems over each series $i$ . Using this decomposition, we define Granger non-causality for two categorical time series $x_{i}$ and $x_{j}$ as follows.

Definition 2.1.

Time series $x_{j}$ is not Granger causal for time series $x_{i}$ iff $\forall t$ ,

p (x_{i t} | x_{1 (t - 1)}, \dots, x_{j (t - 1)}, \dots, x_{d (t - 1)}) = p (x_{i t} | x_{1 (t - 1)}, \dots, x_{(j - 1) (t - 1)}, x_{(j + 1) (t - 1)}, \dots, x_{d (t - 1)}) .

Definition 2.1 states that $x_{j t}$ is not Granger causal for time series $x_{i t}$ if the probability that $x_{i t}$ is in any state at time $t$ is conditionally independent of the value of $x_{j (t - 1)}$ at time $t - 1$ given the values of all other series $x_{k (t - 1)},$ $k \neq i, j$ , at time $t - 1$ . Definition 2.1 is natural since it implies that if $x_{j t}$ does not Granger cause $x_{i t}$ , then knowing $x_{j (t - 1)}$ does not help predicting the future state of series $i,$ $x_{i t}$ . For real-valued data, classical definitions of Granger non-causality generally state that the conditional mean, in homoskedastic models, or conditional variance, in heteroskedastic models, of $x_{i t}$ do not depend on the past values $x_{j (t - 1)}$ . Thus, Definition 2.1 is a generalization of the classical case to multivariate categorical data, where notions like conditional mean and variance are less applicable. The same definition has been considered before, for example, in [14].

2.2. Tensor Representation for Categorical Time Series.

Each individual conditional distribution in Equation (2.2) can be represented as a conditional probability tensor ${\tilde{P}}^{i}$ with $d + 1$ modes of dimension $m_{i} \times m_{1} \times \dots \times m_{d}$ . Each entry of the tensor is given by

{\tilde{P}}_{x_{i t}, x_{1 (t - 1)}, \dots, x_{d (t - 1)}}^{i} = p (x_{i t} | x_{1 (t - 1)}, \dots, x_{d (t - 1)}) .

(2.3)

Definition 2.1 may be stated equivalently using the language of tensors: $x_{j}$ does not Granger cause $x_{i}$ if all subtensors along the mode associated with $x_{j}$ are equal. Specifically,

{\tilde{P}}_{1 : m_{i}, 1 : m_{1}, \dots, x_{j (t - 1)} = 1, \dots, 1 : m_{d}}^{i} = \dots = {\tilde{P}}_{1 : m_{i}, 1 : m_{1}, \dots, x_{j (t - 1)} = m_{j}, \dots, 1 : m_{d}}^{i} .

(2.4)

This subtensor view of Granger non-causality in categorical time series is displayed graphically in Figure 1.

The tensor interpretation suggests a naive penalized likelihood method for Granger non-causality selection in categorical time series: perform penalized maximum likelihood estimation of the conditional probability tensor with a penalty that enforces equality among the subtensors of each mode. While we have explored the above approach in low dimensions, e.g. for $d \leq 5$ , memory, and in turn, computational requirements for storing the complete probability tensor become infeasible for even moderate dimensions since ${\tilde{P}}^{i}$ has $m_{i} \times m_{1} \times \dots \times m_{d}$ entries. Other authors have modeled the conditional probability distribution of Markov chains using a Bayesian nonparametric higher order singular value decomposition [41] that adaptively shrinks the number of parameters needed to represent the high-dimensional tensor. We take an alternative approach and, instead, in Sections 2.3 and 2.4, present tensor parameterizations where the number of parameters needed to represent the full conditional probability tensor grows linearly with $d$ . We establish Granger non-causality conditions and associated penalized likelihood methods for estimation under these parameterizations in Sections 3 and 4, respectively.

In specifying our models, and throughout the remainder of the paper, we focus on a single conditional of $x_{i t}$ given $x_{t - 1}$ in Equation (2.2). For notational simplicity, we drop the $i$ index.

2.3. The MTD Model.

The MTD model as in [39] provides an elegant and intuitive parameterization of a high-order Markov chain. Here, we extend this model to the case of multiple time series, and model the multivariate Markov transition as a convex combination of pairwise transition probabilities. The MTD model is given by:

p (x_{i t} | x_{1 (t - 1)}, \dots, x_{d (t - 1)}) = γ_{0} p_{0} (x_{i t}) + \sum_{j = 1}^{d} γ_{j} p_{j} (x_{i t} | x_{j (t - 1)}),

(2.5)

where $p_{0}$ is a probability vector, $p_{j} (. | \cdot)$ is a pairwise transition probability table between $x_{j (t - 1)}$ and $x_{i t}$ and $γ = (γ_{0}, γ_{1}, \dots, γ_{d})$ is a $d + 1$ dimensional probability distribution such that $1^{T} γ = 1$ with $γ_{j} \geq 0,$ $j = 0, \dots, d$ . We let the matrix $P^{j} \in R^{m_{i} \times m_{j}}$ denote the pairwise transition probability $P_{x_{i t}, x_{j (t - 1)}}^{j} = p_{j} (x_{i t} | x_{j (t - 1)})$ . Thus, $1^{T} P^{j} = 1^{T},$ $P_{l k}^{j} \geq 0, l = 1, \dots, m_{i},$ $k = 1, \dots, m_{j}$ . We also let $p^{0} \in R^{m_{i}}$ denote the intercept, where $p_{x_{i t}}^{0} = p_{0} (x_{i t})$ . While past formulations of the MTD model neglect the $p_{0}$ intercept term, we show below that the intercept is crucial for model identifiability and, consequently, Granger causality inference. Finally, we note that the MTD model may be extended by adding interaction terms for pairwise effects [4], such as $p_{j k} (x_{i t} | x_{j (t - 1)}, x_{k (t - 1)})$ , though we focus our presentation on the simple case above.

2.4. The mLTD Model.

The multinomial logistic transition distribution model (mLTD) is given by:

p (x_{i t} | x_{1 (t - 1)}, \dots, x_{d (t - 1)}) = \frac{\exp (z_{x_{i t}}^{0} + \sum_{j = 1}^{d} Z_{x_{i t}, x_{j (t - 1)}}^{j})}{\sum_{x^{'} \in 𝒳_{i}} \exp (z_{x^{'}}^{0} + \sum_{j = 1}^{d} Z_{x^{'}, x_{j} (t - 1)}^{j})},

(2.6)

where $Z^{j} \in R^{m_{i} \times m_{j}}$ and $z^{0} \in R^{m_{i}}$ . The probit model in [32] is not a natural fit for inferring Granger causality networks both due to the non-convexity of the probit model and the non-convex constraints imposed on the $Z^{j}$ matrices. Note that, like the MTD model, the mLTD model naturally allows adding interaction terms, though we focus again on the simple case above.

2.5. Comparing MTD and mLTD Models.

Both MTD and mLTD models represent the full conditional probability tensor using individual matrices for each $x_{j}$ series, $P^{j}$ for MTD and $Z^{j}$ for mLTD. However, how these matrices are composed and restrictions on their domains differ substantially between the two models. The MTD model is a convex combination of pairwise probability tables, whereas mLTD is a nonlinear function of the unrestricted $Z^{j} s$ . MTD may thus be thought of as a linear tensor factorization method for conditional probability tensors, where the tensor is created by summing probability table slices along each dimension. This interpretation of MTD is displayed graphically in Figure 2.

Figure 2. — Schematic of the MTD factorization of the conditional probability tensor $p (x_{1 t} | x_{1 (t - 1)}, x_{2 (t - 1)})$ for $d = 2$ *time series and* $m = 3$ *categories.*

3. Convexity, Identifiability and Granger Causality.

In this section, we first introduce a novel re-parameterization of the MTD model that renders the log likelihood of the MTD model convex. The convex formulation alone opens up an array of possibilities for the MTD framework beyond our multivariate categorical time series focus, eliminating the primary barrier to adoption of this model, i.e., non-convexity and associated computationally demanding inference procedures. The proposed change-of-variables also allows us to derive both novel identifiability conditions for the MTD model and Granger causality restrictions that hold for both MTD and mLTD models. The non-identifiability of the MTD model was first pointed out by [28], but no explicit conditions or general framework for identifiability were given. We show that while the identifiability conditions for MTD are non-convex, they may be enforced implicitly by adding an appropriate convex penalty to the convex log-likelihood objective. The proofs of all results are given in the Supplementary Material.

3.1. Convex MTD.

Maximum likelihood estimator for the MTD model under the $(γ, P)$ parameterization is defined by the non-convex optimization problem:

\underset{P, γ}{minimize} - \sum_{t = 1}^{T} \log (γ_{0} p_{x_{i t}}^{0} + \sum_{j = 1}^{d} γ_{j} P_{x_{i t}, x_{j (t - 1)}}^{j}) subject to 1^{T} P^{j} = 1^{T}, P^{j} \geq 0, \forall j 1^{T} γ = 1, γ \geq 0.

(3.1)

The log-likelihood surface is highly non-convex, following from the multiplication of $γ_{j}$ and $P^{j}$ in the log term. It also contains many local optima due to the general non-identifiability. Indeed, the set of equivalent models forms a non-convex region in the $(γ, P)$ parameterization (i.e., the convex combination of equivalent models is not necessarily another equivalent model). This limitation may lead to many non-convex shaped ridges and sets of equal probability.

Fortunately, the optimization problem in (3.1) may be recast as a convex program using the re-parameterization $Z^{j} = γ_{j} P^{j}$ and $z^{0} = γ_{0} p^{0}$ . Using this reparameterization, we can rewrite the factorization of the conditional probability tensor for MTD in Equation (2.5) as

p (x_{i t} | x_{1 (t - 1)}, \dots, x_{d (t - 1)}) = z_{x_{i t}}^{0} + \sum_{j = 1}^{d} Z_{x_{i t} x_{j (t - 1)}}^{j} .

(3.2)

The full optimization problem for maximum log-likelihood including constraints then becomes:

\underset{Z, γ}{minimize} - \sum_{t = 1}^{T} \log (z_{x_{i t}}^{0} + \sum_{j = 1}^{d} Z_{x_{i t} x_{j (t - 1)}}^{j}) subject to 1^{T} Z^{j} = γ_{j} 1^{T}, Z^{j} \geq 0, \forall j 1^{T} γ = 1, γ \geq 0.

(3.3)

Problem (3.3) is convex since the objective function is a linear function composed with a log function and only involves linear equality and inequality constraints [6].

The $Z^{j}$ reparameterization in Equation (3.2) also provides clear intuition for why the MTD model may not be identifiable. Since the probability function is a linear sum of $Z^{j}$ s, one may move probability mass around, taking mass from some $Z^{j}$ and moving to some $Z^{k},$ $k \neq j$ or $z^{0}$ , while keeping the conditional probability tensor constant. These sets of equivalent MTD parameterizations have the following appealing property:

Proposition 3.1.

The set of MTD parameters, $Z$ , that yield the same factorized conditional distribution $p (x_{i t} | x_{(t - 1)})$ forms a convex set.

Taken together, the convex reparameterization and Proposition 3.1 imply that the convex function given in Equation (3.3) has no local optima, and that the globally optimal solution to Problem (3.3) is given by a convex set of equivalent MTD models.

3.2. Identifiability.

3.2.1. Identifiability for the MTD Model.

The re-parameterization of the MTD model in terms of $Z^{j}$ s, instead of $γ_{j}$ and $P^{j}$ , combined with the introduction of an intercept term, allows us to explicitly characterize identifiability conditions for the MTD model.

Theorem 3.2.

Every MTD distribution has a unique parameterization where the minimal element in each row of $P^{j}$ (and thus $Z^{j}$ ) is zero for all $j$ .

The intuition for this result is simple: any excess probability mass on a row of each $Z^{j}$ may be pushed onto the same row of the intercept term $z^{0}$ without changing the full conditional probability. This operation may be done until the smallest element in each row is zero, but no more without violating the positivity constraints of the pairwise transitions. The identifiability condition in Theorem 3.2 also offers an interpretation of the parameters in the MTD model. Specifically, the element $Z_{m n}^{j}$ denotes the additive increase in probability that $x_{i t}$ is in state $m$ given that $x_{j (t - 1)}$ is in state $n$ . Furthermore, the $γ_{j}$ parameters now represent the total amount of probability mass in the full conditional distribution explained by categorical variable $x_{j}$ , providing an interpretable notion of dependence in categorical time series. The mLTD model, however, does not readily suggest a simple and interpretable notion of dependence from the $Z^{j}$ matrix due to the non-linearity of the link function. The identifiability conditions are displayed pictorially in Figure 3.

Figure 3. — *Schematic displaying the identifiability conditions for the MTD model (top) and the mLTD model (bottom) for an example with* $d = 3$ *and* $m_{1} = m_{2} = m_{3} = 3$ . *Identifiability for MTD requires a zero entry in each row of* $Z^{j}$ , *while for mLTD the first column and last row must all be zero. In MTD the columns of each* $Z^{j}$ *must also sum to the same value, and must sum to one across all* $Z^{j}$ .

Unfortunately, the necessary constraint set for identifiability specified in Theorem 3.2 is a non-convex set since the locations of the zero elements in each row of $Z^{j}$ are unknown. Naively, one could search over all possible locations for the zero element in each row of each $Z^{j}$ ; however, this quickly becomes infeasible as both $m$ and $d$ grow. Instead, we add a penalty term $Ω (Z)$ , or prior, that biases the solution towards the uniqueness constraints. This regularization also aids convergence of optimization since the maximum likelihood solution without identifiability constraints is not unique. Letting

L_{MTD} (Z) = - \sum_{t = 1}^{T} \log (z_{x_{i t}}^{0} + \sum_{j = 1}^{p} Z_{x_{i t} x_{j (t - 1)}}^{j}),

(3.4)

the regularized estimation problem is given by

\underset{Z, γ}{m i n i m i z e} L_{M T D} (Z) + λ Ω (Z) subject to 1^{T} Z^{j} = γ_{j} 1^{T}, Z^{j} \geq 0, \forall j, 1^{T} γ = 1, γ \geq 0 .

(3.5)

Theorem 3.3.

For any $λ > 0$ and $Ω (Z)$ that does not depend on $z^{0}$ and is increasing with respect to the absolute value of entries in $Z^{j}$ , the solution to the problem in Equation (3.5) is contained in the set of identifiable MTD models described in Theorem 3.2.

Intuitively, by penalizing the entries of the $Z^{j}$ matrices, but not the intercept term, solutions will be biased to having the intercept contain the excess probability mass, rather than the $Z^{j}$ matrices. Thus, even with a very small penalty, we constrain the solution space to the set of identifiable models. Theorem 3.3 characterizes an entire class of regularizers that enforce the identifiability constraints for MTD. As we explain in Section 4.1, a simple choice for $Ω (Z)$ is a regularizer that also selects for Granger causality.

3.2.2. Identifiability for the mLTD Model.

The non-identifiability of multinomial logistic models is also well-known, as is the non-identifiability of generalized linear models with categorical covariates. Combining the standard identifiability restrictions for both settings gives the following result.

Proposition 3.4.

([1]) Every mLTD has a unique parameterization such that first column and last row of $Z^{j}$ are zero for all $j$ and the last element of $z^{0}$ is zero.

These conditions are displayed pictorially in Figure 3. Under the identifiability constraints for both MTD and mLTD models, at least one element in each row must be zero. For MTD this zero may be in any column, while for mLTD the zero may, without loss of generality, be placed in the first column of each row. For mLTD, the last row of $Z^{j}$ must also be zero due to the logistic output (one category serves as the ‘baseline’); in MTD, instead, each column of $P^{j}$ must sum to one.

3.3. Granger Causality in MTD and mLTD.

Under the $Z^{j}$ parameterization for MTD and mLTD specification of Equation (2.6), we have the following simple result for Granger non-causality conditions:

Proposition 3.5.

In both the MTD model of Equation (3.2) and the mLTD model of Equation (2.6), time series $x_{j}$ is Granger non-causal for time series $x_{i}$ if and only if the columns of $Z^{j}$ are all equal. Furthermore, all equivalent MTD model parameterizations give the same Granger causality conclusions.

Intuitively, if all columns of $Z^{j}$ are equal, the transition distribution for $x_{i t}$ does not depend on $x_{j (t - 1)}$ . This result for mLTD and MTD models is analogous to the general Granger non-causality result for the slices of the conditional probability tensor being constant along the $x_{j (t - 1)}$ mode being equal. Based on Proposition 3.5, we might select for Granger non-causality by penalizing the columns of $Z^{j}$ to be the same. While this approach is potentially interesting, a more direct, stable method takes into account the conditions required for identifiability of the $Z^{j}$ under both models.

Under the identifiability constraints for both MTD and mLTD given in Theorems 3.2 and Proposition 3.4, respectively, $x_{j}$ is Granger non-causal for $x_{i}$ if and only if $Z^{j} = 0$ (a special case of all columns being equal). For both MTD and mLTD models this fact follows from each row having at least one zero element; for all the columns to be equal, as stated in Proposition 3.5, all elements in each row must also be equal to zero. Taken together, if we enforce the identifiability constraints, we may uniquely select for Granger non-causality by encouraging some $Z^{j}$ to be zero.

4. Granger Causality Selection.

We now turn to procedures for inferring Granger non-causality statements from observed multivariate categorical time series. In Section 3, we derived that if $Z^{j} = 0$ , then $x_{j}$ is Granger non-causal for $x_{i}$ in both MTD and mLTD models. To perform model selection, we take a penalized likelihood approach and present a set of penalty terms that encourage $Z^{j} = 0$ , while maintaining convexity of the overall objective. The final parameter estimates automatically satisfy the identifiability constraints for MTD. We also develop an analogous penalized criterion for selecting Granger causality in the mLTD model.

4.1. Model Selection in MTD.

We now explore penalties that encourage the $Z^{j}$ matrices to be zero. Under the $(P^{j}; γ_{j})$ parameterization, this is equivalent to encouraging the $γ_{j}$ to be zero. We first introduce an $L_{0}$ penalized problem in terms of the original $γ_{j}$ parameterization, and then show how convex relaxations of the $L_{0}$ norm on $γ_{j}$ lead to natural convex penalties on $Z^{j}$ . Ideally, we would solve the penalized $L_{0}$ problem:

\underset{Z, γ}{m i n i m i z e} L_{M T D} (Z) + λ {∥γ_{1 : d}∥}_{0} subject to 1^{T} Z^{j} = γ_{j} 1^{T}, Z^{j} \geq 0 \forall j, 1^{T} γ = 1, γ \geq 0,

(4.1)

where $λ \geq 0$ is a regularization parameter and ${∥γ_{1 : d}∥}_{0}$ is the $L_{0}$ norm over the weights; the intercept weight $γ_{0}$ is not regularized. The $L_{0}$ penalty simply counts the number of non-zero $γ_{j}$ , which is equivalent to the number of non-zero $Z^{j}$ . This results in a non-convex objective. Instead, we develop alternative convex penalties suited to model selection in MTD. Importantly, we require that any such penalty, $Ω (Z)$ , fall in the intersection of two penalty classes: 1) $Ω (Z)$ must be a convex relaxation to the $L_{0}$ norm in Problem (4.1) to promote sparse solutions and 2) $Ω (Z)$ must satisfy the conditions of Theorem 3.3 to ensure the final parameter estimates satisfy the MTD identifiability constraints. We propose and compare two penalties that satisfy these criteria.

Our first proposal is the standard $L_{1}$ relaxation, as in lasso regression, which simply sums the absolute values of $γ_{j}$ . This penalty encourages soft-thresholding, where some estimated $γ_{j}$ are set exactly to zero while others are shrunk relative to the estimates from the unpenalized objective. Note that due to the non-negativity constraint, the $L_{1}$ norm on $γ_{1 : d}$ is simply given by $\sum_{j = 1}^{d} γ_{j}$ . If $γ_{0}$ were included in the $L_{0}$ regularization, the $L_{1}$ relaxation would fail due to the $γ$ simplex constraints $1^{T} γ = 1,$ $γ \geq 0$ so the $L_{1}$ norm would always be equal to one over the feasible set [35]. Our addition of an unpenalized intercept to the MTD model allows us to sidestep this issue and leverage the sparsity promoting properties of the $L_{1}$ penalty for model selection in MTD. The $L_{1}$ regularized MTD problem is thus given by

\underset{Z, γ}{m i n i m i z e} L_{M T D} (Z) + λ \sum_{j = 1}^{d} γ_{j} subject to 1^{T} Z^{j} = γ_{j} 1^{T}, Z^{j} \geq 0, \forall j, 1^{T} γ = 1, γ \geq 0 .

(4.2)

Equation (4.2) may be rewritten solely in terms of the $Z^{j}$ s by noting that $γ_{j} = \frac{1}{m_{j}} 1^{T} Z^{j} 1$ . Defining ${\tilde{z}}^{T} = (v e c {(Z^{1})}^{T}, \dots, v e c {(Z^{d})}^{T})$ , and assuming, for simplicity of presentation, $|𝒳_{i}| = m \forall i$ , we can rewrite the MTD constraints as

(I_{d} \otimes A) \tilde{z} = 0, 1^{T} \tilde{z} = m, \tilde{z} \geq 0,

where

A = (\begin{matrix} 1_{m}^{T} & - 1_{m}^{T} & 0 & 0 & \dots \\ 0 & 1_{m}^{T} & - 1_{m}^{T} & 0 & \dots \\ \dots & \dots & ⋱ & ⋮ & ⋮ \\ 0 & 0 & \dots & 1_{m}^{T} & - 1_{m}^{T} \end{matrix}),

(4.3)

and $I_{d}$ is a $d$ -dimensional identity matrix. This gives the final penalized optimization problem only in terms of $Z^{j}$ as

\underset{Z}{m i n i m i z e} L_{M T D} (Z) + λ \sum_{i = 1}^{d} \frac{1}{m} 1^{T} Z^{j} 1 subject to (I_{d} \otimes A) \tilde{z} = 0, 1^{T} \tilde{z} = m, \tilde{z} \geq 0 .

(4.4)

Writing the $L_{1}$ penalized problem in this form shows that the $L_{1}$ penalty increases with the absolute value of the entries in $Z^{j}$ and does not penalize the intercept; it thus satisfies the conditions of Theorem 3.3. As a result, the solution to the problem given in Equation (4.4) automatically satisfies the MTD identifiability constraints. Furthermore, the solution will lead to Granger causality estimates since many of the $Z^{j} s$ will be zero due to the $L_{1}$ penalty.

Another natural convex relaxation of the objective in Equation (4.1) is given by a group lasso penalty on each $Z^{j}$ [47]. The relaxation is derived by writing the $L_{0}$ norm as a rank constraint in terms of $Z^{j}$ , which is then relaxed to a group lasso. Specifically, assume all time series have the same number of categories, i.e., $m_{j} = m \forall j$ . Due to the equality and non-negativity constraints,

{∥γ_{1 : d}∥}_{0} = {∥(1^{T} v e c (Z^{1}), \dots, 1^{T} v e c (Z^{d}))∥}_{0} = r a n k (Q^{T} Q) = rank (Q),

where

Q = (\begin{matrix} v e c (Z^{1}) & 0 & \dots & 0 \\ 0 & v e c (Z^{2}) & \dots & 0 \\ 0 & \dots & ⋱ & ⋮ \\ 0 & \dots & \dots & v e c (Z^{d}) \end{matrix}) .

Thus, we can use the nuclear norm on $Q$ as a convex relaxation of ${∥γ_{1 : d}∥}_{0}$ . Furthermore, the nuclear norm of $Q$ is given by the sum of Frobenius norms of $Z^{j}$ . More specifically, denoting by $∥ \cdot ∥_{*}$ the nuclear norm and by $∥ \cdot ∥_{F}$ the Froebenius norm,

∥ Q ∥_{*} = \sum_{j = 1}^{d} {∥Z^{j}∥}_{F} = \sum_{j = 1}^{d} \sqrt{tr ({(Z^{j})}^{T} (Z^{j}))} .

This group lasso penalty gives the final problem

\underset{Z}{m i n i m i z e} L_{M T D} (Z) + λ \sum_{j = 1}^{d} {∥Z^{j}∥}_{F} subject to (I_{d} \otimes A) \tilde{z} = 0, 1^{T} \tilde{z} = m, \tilde{z} \geq 0 .

(4.5)

Here, we penalize $Z^{j}$ directly, rather than indirectly via $γ_{j}$ . The group lasso penalty drives all elements of $Z^{j}$ to zero together, such that the optimal solution sets some $Z^{j}$ to be all zero. This effect naturally coincides with our conditions of Granger non-causality that all elements of $Z^{j} = 0$ . The group lasso penalty also satisfies the conditions of Theorem 3.3 since the $L_{2}$ norm is increasing with respect to each element in $Z^{j}$ and the intercept is not penalized. Thus, solutions to Problem (4.5) automatically enforce the MTD identifiability constraints.

The group lasso penalty tends to favor larger groups [20]. When the time series have different number of categories, the sizes of the coefficient matrices $Z^{j}$ s are different. In this case, one can use penalties that scale with the group size, for example, $λ \sum_{j = 1}^{d} \sqrt{m_{j}} {∥ Z^{j} ∥}_{F}$ . For simplicity, we focus on the case where all time series have the same number of categories hereafter, and omit the dependence of the penalty on group sizes.

4.2. Model Selection in mLTD.

To select for Granger causality in the mLTD model, we add a group lasso penalty to each of the $Z^{j}$ matrices, similar to Equation (4.5), leading to the following optimization problem:

\underset{Z}{minimize} \sum_{t = 1}^{T} z_{x_{i t}}^{0} + \sum_{j = 1}^{d} Z_{x_{i t} x_{j (t - 1)}}^{j} + \log (\sum_{x^{'} \in 𝒳_{i}} \exp (z_{x^{'}}^{0} + \sum_{j = 1}^{d} Z_{x^{'} x_{j (t - 1)}}^{j})) + λ \sum_{j = 1}^{d} {‖ Z^{j} ‖}_{F} subject to Z_{1 : m_{i}, 1}^{j} = 0, Z_{m_{i}, 1 : m_{j}}^{j} = 0 \forall j .

(4.6)

For two categories, $m_{i} = 2 \forall i$ , this problem reduces to sparse logistic regression for binary time series, which was recently studied theoretically [18]. As in the MTD case, the group lasso penalty shrinks some $Z^{j}$ entirely to zero.

5. Optimization.

Here we present fast proximal algorithms for fitting both penalized MTD and mLTD models. The convex formulation invites new optimization routines for fitting MTD models since many options exist for solving problems with convex objectives with linear equality and inequality constraints. In the Supplementary Material, we present alternative MTD solvers based on Frank-Wolfe [22] and Majorization-Minimization (MM) [21], and discuss their trade-offs. Both Frank-Wolfe and MM algorithms for MTD take elegant and simple forms. Furthermore, the MM algorithm for the non-penalized convex Problem 3.2 is equivalent to an EM algorithm for the MTD model in the original non-convex parameterization of Problem 3.1. As a byproduct, this equivalence shows that the EM algorithm under the non-convex parameterization converges to a global optima. Here we focus on proximal algorithms since the MM algorithm for MTD is applicable only to the non-penalized MTD objective and Frank-Wolfe converges slowly relative to proximal gradient for the dimensions we consider; see the Supplementary Material for more details.

For the mLTD model, we perform gradient steps with respect to the mLTD likelihood followed by a proximal step with respect to the group lasso penalty. This leads to a gradient step of the smooth likelihood followed by separate soft group thresholding [33] on each $Z^{j}$ .

For the MTD model, our proximal algorithm reduces to a projected gradient algorithm [33]. Projected gradient algorithms take steps along the gradient of the objective, and then project the result onto the feasible region defined by the constraints. Compared to other MTD optimization methods, our projected gradient algorithm under the $Z^{j}$ parameterization is guaranteed to reach the global optima of the MTD log-likelihood. The gradient of the regularized MTD model with respect to entries in $Z^{j}$ over the feasible set is given by

\frac{d L}{d Z_{x^{'} x^{″}}^{j}} = \sum_{t = 1}^{T} 1_{{x_{i t} = x^{'}, x_{j (t - 1)} = x^{″}}} \frac{1}{Z_{x_{i t}}^{0} + \sum_{j = 1}^{d} Z_{x_{i t} x_{j (t - 1)}}^{j}} + λ \frac{d Ω}{d Z_{x^{'} x^{″}}^{j}} .

(5.1)

For the $L_{1}$ norm, $Ω (Z)$ is not differentiable when an element in any $Z^{j}$ is zero. For the $L_{2}$ group norm, $Ω (Z)$ is not differentiable when every element in at least one $Z^{j}$ is zero. However, the MTD constraints enforce that $Z^{j} \geq 0$ . Since the point of non-differentiability for the $L_{2}$ norm in our case occurs when elements are identically zero, we modify the constraints so that $Z^{j} \geq ϵ$ for some small $ϵ$ when using the group penalty. This allows us to ignore non-differentiability issues, and instead take gradient steps directly along the penalized MTD objective.

Following the notation from the end of Section 4.1, let the set $C = \{\tilde{z} | \tilde{z} \geq ϵ, (I_{d} \otimes A) \tilde{z} = 0, 1^{T} \tilde{z} = m\}$ denote the modified MTD constraints with respect to the $Z^{j}$ parameterization. We perform projected gradient descent by taking a step along the regularized MTD gradient of Equation (5.1) and then project the result onto $C$ . Specifically, the algorithm iterates the following recursion starting at iteration $k = 0$ :

{\tilde{z}}^{k + 1} = 𝒫_{C} ({\tilde{z}}^{k} - δ_{k} \frac{d L}{d \tilde{z}}),

(5.2)

where $δ_{k}$ is a step size chosen by line search [33]. For ease of presentation, here we have written the projected gradient steps in terms of the vectorized variables $\tilde{z}$ , rather than the $Z^{j}$ matrices. The $𝒫_{C} (x)$ operation is the projection of a vector $x$ onto the modified MTD constraint set $C$ :

\underset{z}{m i n i m i z e} ∥ z - x ∥_{2}^{2} subject to z \geq ϵ, (I_{d} \otimes A) z = 0, 1^{T} z = m,

with $ϵ = 0$ for the $L_{1}$ penalty and $ϵ > 0$ but small for the group lasso penalty. While this is a standard quadratic program for which we may use the dual method [16] as, e.g. implemented in the R quadratic programming package quadprog [43], we have found that standard solvers may scale poorly as the number of time series $d$ become large. To mitigate this inefficiency, here we develop a fast projection algorithm based on Dykstra’s splitting algorithm [7] that harnesses the particular structure of the constraint set for much faster projection, as described in Section 5.1.

5.1. Dykstra’s Splitting Algorithm for Projection onto the MTD Constraints.

The set $C$ may be written as the intersection of two simpler sets: $C = S \cap B$ , where $S$ is the simplex constraint set of the first column of each $Z^{j}$ matrix and the non-negativity constraint for all entries of $Z^{j}$ . Specifically,

S = {{Z^{j} \in ℝ^{m \times m}}_{j = 0}^{d} | \sum_{j = 0}^{d} \sum_{i = 1}^{m} Z_{i 1}^{j} = 1, Z^{j} \geq 0 \forall j} .

(5.3)

On the other hand, $B = \cup_{j = 1}^{d} B_{j}$ , where $B_{j}$ is the constraint set that all columns in $Z^{j}$ sum to the same value:

B_{j} = \{Z^{j} \in R^{m \times m} | A v e c (Z^{j}) = 0\},

(5.4)

where the matrix $A$ is given in Equation (4.3). Dykstra’s algorithm alternates between projecting onto the simplex constraints $S$ and the equal column sums $B$ by iterating the following steps. Let $w^{0} = x, u^{0} = v^{0} = 0$ . Denote by $𝒫_{S}$ the projection onto the set $S$ and by $𝒫_{B}$ the projection onto the set $B$ . Dykstra’s algorithm amounts to the following iterations starting with $l = 0$ :

y^{l} = 𝒫_{S} (w^{l} + u^{l})

u^{l + 1} = w^{l} + u^{l} - y^{l}

w^{l} = 𝒫_{B} (y^{l} + v^{l})

v^{l + 1} = y^{l} + v^{l} - w^{l} .

The $𝒫_{S}$ projection may be split into a simplex projection for the constraint $\sum_{j = 0}^{d} \sum_{i = 1}^{m} Z_{i 1}^{j} = 1, Z_{i 1}^{j} \geq 0 \forall i, j$ and a non-negativity constraint $Z_{n i}^{j} \geq 0 \forall i, j$ and $n > 1$ . We perform the simplex projection in $(d m) l o g (d m)$ time using the algorithm of [13]; the non-negativity projection is simply thresholding elements at zero and is performed in linear time. The $𝒫_{B}$ linear projection is performed separately for each $Z^{j}$ :

𝒫_{B_{j}} (x) = (I - (A {(A A^{T})}^{- 1} A^{T})) x,

(5.5)

where $(I - (A {(A A^{T})}^{- 1} A^{T}))$ may be precomputed so the per-iteration complexity for the full $B$ projection is $d m^{4}$ since $A$ is a $(m - 1) \times m^{2}$ matrix. Importantly, this projection scheme harnesses the structure of the constraint set by splitting the projections into components that admit fast and simple low-dimensional projections. The full projection algorithm is given in Algorithm 5.2.

We compare projection times of the Dykstra algorithm to the active set method of [16] implemented in the R package quadprog [43]. The Dykstra projection for the MTD constraints was implemented in C++. Elements of $Z^{j}$ were drawn independently from a normal distribution with standard deviation .7 and then projected onto $C$ . Average run times across 10 random realizations for $d \in (10, 20, 30, 40, 50, 60, 70)$ series and $m = 5$ categories are displayed in Figure 4. The Dykstra algorithm was run until iterates changed by less than 10⁻¹⁰. For each run, the elementwise maximum difference between the Dykstra projection and the quadprog projection was always on the scale of 10⁻¹⁰. Across this range of $d$ , the quadprog runtime appears to scale quadratically in $d$ , with a total run time on the scale of seconds for $d \geq 20$ . The Dykstra projection method, however, appears to scale near linearly in this range with run times on the order of milliseconds. We also performed experiments with differing standard deviations for the independent draws of $Z^{j}$ and observed very similar results.

Figure 4. — (left) A runtime comparison of the quadprog projection method and the Dykstra projection method on a range of time series dimensions. (right) A zoom in of only the compute time of the Dykstra method.

Algorithm 5.1.

Projected gradient algorithm for MTD using Dykstra projections.

Initialize

Z^{(0)} \forall j

k = 0

while

Z^{(k)}

not converged do

compute

\nabla L (Z^{(k)})

via Equation 5.1

determine

γ^{k}

by line search [33]

Z^{(k + 1)} = Dykstra M T D (Z^{(k)} + γ^{k} \nabla L (Z^{(k)}))

end while

return

Z^{(k)}

Open in a new tab

5.2. Comparing Model Selection and Optimization in MTD and mLTD.

Approaches to model selection in MTD and mLTD models are conceptually similar; both add regularizing penalties to enforce elements in $Z^{j}$ to zero. However, these two approaches differ in practice. We explore the differences in selecting for Granger causality between these two approaches via extensive simulations in Section 7.

Both MTD and mLTD models take gradient steps followed by a proximal operation. In the mLTD model, this proximal operation is given by soft thresholding on the elements of $Z^{j}$ . In the MTD optimization the proximal operation reduces to a projection onto the MTD constraint set. Importantly, due to the restricted domain of the MTD parameter set, the normally non-smooth penalty terms become smooth over the constraint set and we thus include them in the gradient step. In mLTD, the soft threshold proximal operation is performed in linear time while in MTD the projection is performed by iteratively using the Dykstra algorithm, where each step of the Dykstra algorithm is performed in log-linear time.

Algorithm 5.2.

DykstraMTD: Dykstra algorithm for projection onto the MTD constraints.

z = {({(z^{0})}^{T}, vec {(Z^{1})}^{T}, \dots, vec {(Z^{d})}^{T})}^{T}

Let

S

be the ordered indices of

z

whose elements belong in the first column of some

Z^{j}, j > 0

or in

z^{0}

Let (

j

) refer to ordered indices of

z

whose elements belong to

Z^{j} \forall j

w_{0} = z

u_{0} = v_{0} = 0

l = 0

while

w^{l}

not converged do

y_{S}^{l} = Simplex Projection (w_{S}^{l} + p_{S}^{l})

via [13]

y_{\ S}^{l} = Positive Threshold (w_{\ S}^{l} + u_{\ S}^{l})

u^{l + 1} = w_{l} + u_{l} - y_{l}

w_{(0)}^{k} = y_{(0)}^{l} + v_{(0)}^{l}

for

j = l : d

w_{(j)}^{l} = P_{B_{j}} (y_{(j)}^{l} + v_{(j)}^{l})

via Equation 5.5

end for

v^{(l + 1)} = y^{l} + q^{l} - w^{l}

l = l + l

end while

return

w^{l}

Open in a new tab

6. Estimation Consistency of MTD Model.

In this section, we establish an upper bound for estimation error of MTD parameters under the group lasso penalty. Analogous results can be obtained for the standard lasso penalty.

We first note that the MTD likelihood is of the same form as a multinomial GLM model with identity link, i.e., with probability modeled as linear combination of covariates. However, the dependence in the time series and the identity link create additional technicalities in the proof, and we will use newly developed concentration and entropy results in the dependent sample setting to overcome these difficulties.

We begin by stating the assumptions. Recall that $X = \{x_{1}, \dots, x_{t}, \dots, x_{T}\}$ is a Markov chain with state space $𝒳$ . The transition kernel is given by (2.2) and (2.5). As in [34], we say that $X$ is $φ$ -irreducible, if there exists a non-zero $σ$ -finite measure $φ$ on $𝒳$ such that for all $A \subset 𝒳$ with $φ (A) > 0$ and for all $x \in 𝒳$ , there exists a positive integer $n$ such that $P^{n} (x; A) > 0$ . Here, $P^{n} (x; \cdot)$ is the distribution of $x_{n}$ given $x_{0} = x$ . Our first assumption concerns the nature of the data generating model and is rather mild.

Assumption 1.

$X$ is aperiodic and $φ$ -irreducible, and has a unique stationary distribution $π$ .

For the ease of presentation, we will write the MTD likelihood as a multinomial model with identity link. Let $I {\cdot}$ be the indicator function. Define $W_{t 0} = {(W_{t 0}^{1}, \dots, W_{t 0}^{m})}^{⊤} \in R^{m}$ where $W_{t 0}^{l} = I \{x_{i t} = l\}$ , and hence $W_{t 0}$ indicates the state of time series $i$ at time $t$ . We define $W_{t j} = {({(W_{t j}^{1})}^{⊤}, \dots, {(W_{t j}^{m})}^{⊤})}^{⊤} \in R^{m^{2}}$ , for each $j \in {1, \dots, d}$ , where $W_{t j}^{l} = {(W_{t j}^{l 1}, \dots, W_{t j}^{l m})}^{⊤}$ and $W_{t j}^{l k} = I \{x_{i t} = l, x_{j (t - 1)} = k\}$ . Hence, $W_{t j}$ indicates both the state of time series $i$ at time $t$ and the state of time series $j$ at time $t - 1$ . Define a new covariate vector $W \in R^{m + d m^{2}}$ as $W_{t} = {(W_{t 0}^{⊤}, W_{t 1}^{⊤}, \dots, W_{t d}^{⊤})}^{⊤}$ . We note that each component of $W$ can take values only in {0, 1}, and denote the possible values of $W$ as $𝒲$ . The MTD model can then be written as

p (x_{i t} | x_{t - 1}) = W_{t}^{⊤} β^{0},

(6.1)

where $β^{0} \in R^{m + d m^{2}}$ is the coefficient of interest defined in terms of $Z$ ’s. Specifically, for a general set of MTD parameters, we let $β_{0} = Z^{0},$ $β_{j} = v e c (Z^{j})$ for $j \in {1, \dots, d}$ and define $β = {(β_{0}^{⊤}, β_{1}^{⊤}, \dots, β_{d}^{⊤})}^{⊤}$ . In other words, the first $m$ components correspond to the intercept and each subsequent consecutive $m^{2}$ components correspond to a transition matrix. The superscript 0 denotes the true parameter value.

Denote the group lasso penalty by $Ω (β) = \sum_{j = 1}^{d} {∥β_{j}∥}_{2} = \sum_{j = 1}^{d} {∥Z^{j}∥}_{F}$ , where the intercept is left unpenalized. The MTD optimization problem can be written as

{minimize}_{β} {- \frac{1}{T} \sum_{t = 1}^{T} \log (W_{t}^{⊤} β) + λ Ω (β)},

(6.2)

subject to (I_{d} \otimes A) β_{1 : d} = 0, m 1^{⊤} β_{0} + \sum_{j = 1}^{d} 1^{⊤} β_{j} = m, β \geq 0.

(6.3)

Let $R_{n}$ and $R$ be the empirical and conditional expected negative log-likelihood risks, respectively

R_{n} (β) = - \frac{1}{T} \sum_{t = 1}^{T} \log (W_{t}^{⊤} β); R (β) = - \frac{1}{T} \sum_{t = 1}^{T} E [\log (W_{t}^{⊤} β) | 𝒜_{t - 1}],

(6.4)

where $𝒜_{t}$ is the $σ$ -algebra generated by $x_{1}, \dots, x_{t}$ . Furthermore, let $S$ denote the active set of $β^{0}$ , i.e., $S = \{j : j > 0, β_{j}^{0} \neq 0\}$ and $S^{c}$ denote its complement in ${1, \dots, d}$ . We define $Ω^{+} (β) = \sum_{j \in S} {∥β_{j}∥}_{1}$ and $Ω^{-} (β) = \sum_{j \in S^{c}} {∥β_{j}∥}_{1}$ . With this formulation, we are now ready to state the next assumptions.

Assumption 2.

For all $W \in 𝒲$ such that $W^{T} β^{0} \neq 0, W^{T} β^{0} \geq c (T, d)$ for some function $c$ that only depends on $T$ and $d$ . Moreover, we assume that

\frac{| S |}{c^{2} (T, d)} \sqrt{\frac{l o g (d) {l o g}^{3} (T)}{T}} = o (1) .

(6.5)

Assumption 3.

Define a semi-norm $τ (\cdot)$ as $τ (β) = \sqrt{β^{⊤} E_{π} [W_{t} W_{t}^{⊤}] β}$ . For a stretching factor $L \geq 1$ , define

Γ_{Ω} (L, S, τ) = {(\underset{β}{m i n} \{τ (β) : β \in ℬ, {∥β_{0}∥}_{1} + Ω^{+} (β) = 1, Ω^{-} (β) \leq L\})}^{- 1},

(6.6)

ϕ^{2} (L, S, τ) = Γ_{Ω}^{- 2} (L, S, τ) | S |,

(6.7)

where $ℬ$ is the set of all $β$ that can be written as a scaled difference between two vectors that satisfy the MTD model constraints and identifiability constraints. We assume that for some $L \geq 1, ϕ^{2} (L, S, τ) \geq c_{1}$ for some constant $c_{1}$ .

Assumption 2 states that the transition probabilities are either 0 or bounded away from 0 by some quantity that only depends on the sample size and dimension. We further assume that this quantity is larger than the estimation error, which we will derive later. It ensures that when the parameter estimates are close to the true value, the likelihoods are also close. This in general may not be the case, as log(⋅) is unbounded when its argument approaches 0 and is not Lipschitz-continuous. Assumption 3 is a compatibility condition, often used in establishing estimation consistency of lasso-type estimators [8]. It is slightly weaker than the restricted eigenvalue condition which is also commonly used. Intuitively, this assumption requires that inactive groups are not too correlated with the active ones. The requirement that $β \in ℬ$ constrains the inherent co-linearity among the covariates.

Due to the Markovian structure, the design ${(W_{t})}_{t = 1}^{T}$ has to be treated as random, yet the compatibility constant is defined using population quantities. Hence, we need to show that the sample version of compatibility constant converges to its population counterpart defined in Assumption 3. To this end we use concentration results for Markov chains developed in [34] based on spectral methods.

A key quantity for the concentration results is the pseudo spectral gap of the chain [34]. We re-state the relevant definitions here for completeness. Let $L^{2} (π)$ be the Hilbert space of complex valued measurable functions on $𝒳$ that are square integrable with respect to $π$ . We equip $L^{2} (π)$ with the inner product $⟨ f, g ⟩_{π} = \int f g^{*} d π$ . Define a linear operator $P$ on $L^{2} (π)$ as $(P f) (x) = E_{P (x, \cdot)} (f)$ , which is induced from the transition kernel $P$ . The spectrum of a chain is defined as

S_{2} = {λ \in C : (λ I - P)^{- 1} does not exist as a bounded linear operator on L^{2} (π)} .

(6.8)

If $P$ is a self-adjoint operator, the spectral gap is defined as

γ = \{\begin{array}{l} 1 - s u p \{λ : λ \in S_{2}, λ \neq 1\} & if eigenvalue 1 has multiplicity 1, \\ 0 & otherwise. \end{array}

(6.9)

The self-adjointness of $P$ corresponds to the reversibility of the Markov chain with transition kernel $P$ . In general, the chain specified by the MTD model may not be reversible. In this case, define the time reversal of $P$ as the transition kernel

P^{*} (x, y) = \frac{P (y, x)}{π (x)} π (y) .

(6.10)

Then, the induced linear operator $P^{*}$ is the adjoint of $P$ on $L^{2} (π)$ . Note that when the chain is indeed reversible, we have $P^{*} = P$ . Finally, the pseudo spectral gap of $P$ is defined as

γ_{p s} = \underset{k \geq 1}{m a x} \{γ ({(P^{*})}^{k} P^{k}) / k\},

(6.11)

where $γ ({(P^{*})}^{k} P^{k})$ denotes the spectral gap of the self-adjoint operator ${(P^{*})}^{k} P^{k}$ . See Section 3.1 in [34] for additional discussion on the pseudo spectral gap. We make the following assumption on the pseudo spectral gap:

Assumption 4.

The pseudo spectral gap $γ_{p s}$ satisfies $| S | \sqrt{l o g (d) / T γ_{p s}} = o (1)$ .

This assumption requires that as $d$ grows, the pseudo spectral gap of the chain does not approach 0 too fast. For a uniformly ergodic chain, the pseudo spectral gap is closely related to its mixing time, and this assumption requires that the mixing time does not grow too large. If $γ_{p s}$ is lower bounded by some constant, Assumption 4 reduces to an assumption on the dimension and sparsity relative to the sample size. Methods have been proposed to estimate the pseudo spectral gap [44], which can be used to assess the validity of this assumption empirically.

We are now ready to state our main theorem on the estimation error of the MTD model.

Theorem 6.1.

(Estimation error) Let $0 < δ < 1$ . Suppose that there exists $M_{m a x} \geq 0 a n d λ_{ϵ}$ such that for all $0 \leq M \leq M_{m a x}$

\underset{β : {∥β_{0} - β_{0}^{0}∥}_{1} + Ω (β - β^{0}) \leq M}{s u p} |(R_{n} (β) - R (β)) - (R_{n} (β^{0}) - R (β^{0}))| \leq λ_{ϵ} M,

(6.12)

and

\frac{32 λ_{ϵ} (1 + δ)^{2} | S |}{δ^{2} ϕ^{2} (1 / (1 - δ), S, τ)} \leq M_{m a x} .

(6.13)

Take $λ \geq 8 λ_{ϵ} / δ$ . Then, under Assumption 1 and Assumption 4, for sufficiently large $T$ , we have that

{∥{\hat{β}}_{0} - β_{0}^{0}∥}_{1} + Ω (\hat{β} - β^{0}) \leq \frac{4 λ (1 + δ)^{2} | S |}{δ ϕ^{2} (\frac{1}{1 - δ}, S, τ)} .

(6.14)

Furthermore, under Assumption 3, the RHS is upper bounded by $C (δ) λ | S |$ where $C (δ)$ is a constant depending on $δ$ .

This theorem states that the estimation error defined in terms of $Ω (\cdot)$ is closely related to $λ_{ϵ}$ . The next lemma quantifies the magnitude of $λ_{ϵ}$ .

Lemma 6.2.

Under Assumption 2 and Assumption 3, we can take $λ_{ϵ}$ and $M_{m a x}$ to satisfy (6.12) and (6.13), and

λ_{ϵ} = O_{p} (\frac{1}{c (T, d)} \sqrt{\frac{l o g (d) {l o g}^{3} (T)}{T}}), M_{m a x} = O (c (T, d)) .

(6.15)

Combining Theorem 6.1 and Lemma 6.2, we have the following corollary.

Corollary 6.3. Under Assumptions 1–4, we have that

{∥{\hat{β}}_{0} - β_{0}^{0}∥}_{1} + Ω (\hat{β} - β^{0}) = O_{p} (\frac{| S |}{c (T, d)} \sqrt{\frac{l o g (d) {l o g}^{3} (T)}{T}}) .

(6.16)

If the minimal nonzero transition probability is large enough so that $1 / c (T, d) = O (1)$ , we get a convergence rate of $O_{p} (| S | \sqrt{\frac{l o g (d) {l o g}^{3} (T)}{T}})$ . Compared with the classical results on the estimation error of lasso (see, for example, [5]), we have an extra $l o g (T)$ term. This is due to a concentration result in the dependent data setting [40]. Investigating whether this log factor can be removed would be an interesting question for future research.

Based on the estimation error bound, one can consider a thresholded version of the MTD estimator to achieve variable selection consistency. The thresholding step helps eliminating false positives, without the stringent irrepresentable condition, which is required for variable selection consistency of the lasso [30]. Specifically, we can use a threshold of $c_{t} \sqrt{\frac{l o g (d) l o g 3 (T)}{T}}$ for some appropriately chosen $c_{t}$ . If we additionally assume that the minimal signal strength is of order larger than the estimation error bound, we can achieve variable selection consistency asymptotically.

7. Experiments.

We study the performance of our approaches to Granger causality detection in categorical time series. First, we compare penalized mLTD and MTD methods across multiple simulated data scenarios in Section 7.1. In Section 7.2 we apply our penalized MTD method to detect Granger causal connectivity between musical elements in a music dataset of Bach chorales and in Section 7.3 between iEEG sensors during seizures in an awake canine.

7.1. Simulated Data.

We perform a set of simulation experiments to compare the MTD and mLTD model selection methods. Specifically, we compare the MTD group lasso, $L_{1}$ -MTD, and mLTD group lasso methods on simulated categorical time series generated from a sparse MTD model, a sparse mLTD model and a sparse latent vector autoregressive model (VAR) with quantized outputs. In the sparse VAR setting, we also compare the three proposed methods to a penalized VAR fit using the ordinal categorical variables. For all experiments, we consider time series of length $T \in (200, 400, 800, 1600)$ , $d \in (15, 25)$ , and number of categories $m \in (2,3, 4,5, 6)$ . We first explain the details of each simulation condition and then discuss the results.

Sparse MTD.

For the MTD model, we randomly generate parameters by $γ_{i j} \sim \frac{z_{i j} ϕ_{i j}}{\sum_{l = 1}^{d} z_{i l} ϕ_{i l}}$ where $ϕ_{i} \sim D i r i c h l e t (α)$ and $z_{i j} \sim B i n o m i a l (δ)$ . We let $δ = . 15,$ $α = 5$ . Columns of $P^{i j}$ are generated according to $P_{: l}^{i j} \sim D i r i c h l e t (γ)$ with $γ = . 7$ . (Note that here we have added a superscript $i$ to $P$ to specifically indicate the $j$ to $i$ interaction, whereas previously we dropped the $i$ index for notational simplicity by assuming we were just looking at the series $i$ term.) To ensure that the columns are not close to identical in $P^{i j}$ (which would imply Granger non-causality), $P^{i j}$ is sampled until the average total variation norm between the columns is greater than some tolerance $ρ$ . This ensures that non-causality occurs only when $P^{i j}$ are zero, and not due to equal columns in the simulation. For our simulations, we set $ρ = . 3$ . A lower value of $ρ$ makes it more difficult to learn the Granger causality graph since some true interactions might be extremely weak.

Sparse mLTD.

For the mLTD model, the nonzero $Z^{i j}$ parameters are generated by $Z_{l k}^{i j} \sim z_{i j} N (0, σ_{Z}^{2})$ where $z_{i j} \sim B i n o m i a l (δ)$ with $δ = . 15$ .

Sparse Latent VAR.

To examine data generated from neither of the models considered, we simulate data from a continuous time series $y_{t} \in R^{d}$ according to a sparse VAR(1):

y_{t} = A y_{t - 1} + ϵ_{t},

(7.1)

where $ϵ_{t} \sim N (0, σ^{2} I_{d})$ . The sparse matrix A is generated by first sampling entries $B_{i j} \sim N (0, σ_{A}^{2})$ and then setting $A_{i j} = B_{i j} z_{i j}$ , where $z_{i j} \sim B i n o m i a l (δ)$ with $δ = . 15$ . We then quantize each dimension, $y_{i t}$ , into $m$ categories to create a categorical time series $x_{i t}$ . For example, when $m = 3,$ $x_{i t} = 1$ if $y_{i t}$ is in the (0,.33) quantile of $\{y_{i 1}, \dots y_{i T}\}$ , and so forth.

Results.

For all methods — MTD $L_{1}$ , MTD group lasso, and mLTD group lasso — we compute the true positive rate and false positive rate over a grid of $λ$ values, and trace out the ROC curve. We then compute the area under the ROC curve. The results are displayed as boxplots across 20 simulation runs in Figures 5, 6, and 7 for the categorical time series generated by MTD, mLTD, and latent VAR, respectively. We note that the mLTD group lasso model performs best when the data are generated from a mLTD, and likewise the MTD $L_{1}$ and MTD group lasso perform better when the data are generated from a MTD. As pointed out in [20], when the groups are homogeneous in the sense that most coefficients in the active group are nonzero, group lasso tends to perform well. This is the case in the MTD model as the coefficients in nonzero $P^{i j}$ are generated from a Dirichlet distribution. However, this principle is less applicable when the data are generated from a mLTD model as we have model misspecification. MTD with either group lasso or lasso penalty tries to find the best MTD approximation to the true data generating mechanism. Interestingly, for data generated from mLTD, we see improved performance as a function of the number of categories $m$ for all $T$ and $d$ settings, while for MTD performance starts high, dips and goes back up with increasing $m$ . This is probably due to the simulation conditions, as in both MTD and mLTD models Granger causality can be quantified as the difference between the columns of $Z^{i j}$ . When there are more categories, there is higher probability under our simulation conditions that there will be some columns with large deviation from other columns in $Z^{i j}$ . This leads to improved Granger causality detection when it exists. Furthermore, we notice that in general the performances of all three methods are better when the data are generated from a mLTD model compared to a MTD model. This is again related to the simulation conditions. In the MTD model, the columns of $Z^{i j}$ are generated from a Dirichlet distribution with values constrained between 0 and 1, and the differences among columns are in general smaller than those in the mLTD model where the coefficients are generated using a normal distribution. Thus the connections among time series in the sense of Granger causality are weaker in the MTD model than the mLTD model. The difference in the signal strengths is illustrated in Figure SM3 in the Supplementary Material.

Figure 5. — AUC for data generated by a sparse MTD process. Boxplots over 20 simulation runs.

Figure 6. — AUC for data generated by a sparse mLTD process. Boxplots over 20 simulation runs.

Figure 7. — AUC for data generated by a sparse latent VAR process. Boxplots over 20 simulation runs.

In the latent VAR simulation, the MTD $L_{1}$ and the mLTD methods have comparable performance, and both outperform the MTD group lasso approach. However, under model misspecfication, the relative performance of these methods might depend on how well they approximate the true data generating mechanism. There is also evidence of worsened performance for all three methods as the quantization of the latent VAR processes becomes finer, and the number of categories increases. This might be due to the increased extent of model misspecification. We additionally compare the proposed methods to a sparse VAR fit, where we use the ordinal categorical variables directly. We observe that when the number of categories is small, our proposed methods perform similarly to the sparse VAR approach, as not much information is lost by ignoring the order. However, as the number of categories increases, sparse VAR approach performs better by taking the order into account.

As expected, across all simulation conditions and estimation methods increasing the sample size $T$ leads to improved performance while increasing the dimension $d$ worsens performance.

We additionally present the median ROC curves in the Supplementary Material, along with points on the ROC curves chosen by cross-validation and BIC. In general, our numerical experiments indicate that the values of the tuning parameter selected by cross-validation tend to over-select edges, which has been observed in previous studies [29]. This highlights the importance of the thresholding step to reduce false positives. In contrast, BIC tends to give a large tuning parameter and results in an overly sparse graph when the sample size is small compared to the dimension; however, its performance improves considerably with large sample sizes.

7.2. Music Data Analysis.

We analyze Granger causality connections in the ‘Bach Choral Harmony’ data set available at the UCI machine learning repository [27] (https://archive.ics.uci.edu/ml/datasets/Bach+Chorales). This data set, which has been used previously [38, 15], consists of 60 chorales for a total of 5665 time steps. At each time step, 15 unique discrete events are recorded. There are 12 harmony notes, {C, C#, D, F#, D#, E, F, G, G#, A, A#, B}, that take values either ‘on’ (played) or ‘off’ (not played), i.e., $x_{j t} \in {0,1}$ for $j \in {1, \dots, 12}$ . There is a ‘meter’ category taking values in ${1, \dots, 5}$ , where lower numbers indicate less accented events and higher numbers higher accented events. There is also the ‘pitch class of the base note’, taking 12 different values and a ‘chord’ category. We group all chords that occur less than 200 times into one group, giving a total of 12 chord categories.

We apply the sparse MTD model for Granger causality selection. As the sample size is relatively small compared to the number of time series and number of categories per series, we choose the tuning parameter $λ$ by five-fold cross validation over a grid of $λ$ values. However, since cross-validation tends to over-select Granger causality relationships, we threshold the $γ$ weights at .01. The estimated resulting Granger causality graph is plotted in Figure 8. To aide in the presentation of our structural analysis below, we bold all edges with $γ$ weight magnitudes greater than .06.

Figure 8. — The Granger causality graph for the ‘Bach Choral Harmony’ data set using the penalized MTD method. The harmony notes are displayed around the edge in a circle corresponding to the circle of fifths. Orange links display directed interactions between the harmony notes while green links display interactions to and from the ‘bass’, ‘chord’, and ‘meter’ variables.

The harmony notes in the graph are displayed in a circle corresponding to the circle of fifths; the circle of fifths is a sequence of pitches where the next pitch in the circle is found seven semitones higher or lower, and it is a common way of displaying and understanding relationships between pitches in western classical music. Plotting the graph in this way shows substantially higher connections with respect to sequences on this circle. For example, moving both clockwise and counter-clockwise around the circle of fifths we see strong connections between adjacent pitches, and in some cases strong connections between pitches that are two hops away on the circle of fifths. Strong connections to pitches far away on the circle of fifths are much rarer. Together, the results suggest that in these chorales there is strong dependence in time between pitches moving in both the clockwise and counter-clockwise direction on the circle of fifths.

We also note that the ‘chord’ category has very strong outgoing connections implying it has strong Granger causality relationship with all harmony pitches. This result is intuitive, as it implies that there is strong dependence between what chord is played at time step $t$ and what harmony notes are played at time step $t + 1$ . The ‘bass’ pitch is also influenced by ‘chord’ and tends to both influence and be influenced by most harmony pitches. Finally, we note that the ‘meter’ category has much fewer and weaker incoming and outgoing connections, capturing the intuitive notion that the level of accentuation of certain notes does not really relate to what notes are played.

As mentioned in Section 3.2.1, the MTD model is much more appropriate than the mLTD model for this type of exploratory Granger causality analysis: The $γ$ weights intuitively describe the amount of probability mass that is accounted for in the conditional probability table, giving an intuitive notion of dependence between categorical variables. In the mLTD model, in contrast, there is not as an intuitive interpretation of ‘link strength’ between two categorical variables due to the non-linearity of the softmax function. For this reason, it is not clear how to define the strength of interaction and dependence given a set of estimated $Z^{i j}$ parameters. We still attempted to draw such a comparison. We chose to use the normalized $L_{2}$ norm of each $Z^{i j}$ matrix, $\frac{∥Z^{i j}∥}{\sqrt{m_{i}} \sqrt{m_{j}}}$ , as a measure of connection strength in the mLTD model. However, this metric does not have a direct interpretation with respect to the conditional probability tensor. Due to these interpretational difficulties, we present the results of the mLTD Bach analysis in the Supplementary Material. We note here that the final graph shows some of the structure of the MTD analysis: strong connections between chord and the harmony notes and some strong connections between notes on the circle of fifths. However, in general, the resulting graph is much less sparse and interpretable than the MTD graph.

7.3. Functional Connectivity in Canine iEEG.

We analyze functional connectivity among intracranial electroencephalogram (iEEG) sensors during seizures in an awake canine [11]. The data was collected from a single canine undergoing seizures and is available at ieeg.org [9]. Recent time series segmentation of iEEG data around seizure events has shown that different discrete dynamic states are active before, during, and after a seizure onset [45, 11]. We analyze Granger causal connectivity between the iEEG recording channels at the level of these discrete dynamic states, providing a Granger causal analysis at a more abstract level. Specifically, we segment the continuous measurements into nominal categorical states using a Markov switching autoregressive model. This analysis illustrates which channel’s dynamic states are predictive of another channel’s states.

Each of 18 iEEG recordings from a single dog contains $d = 16$ channels and $T = 20000$ time points corresponding to a two minute window around a seizure event. The time series for each channel was segmented into a categorical time series with $m = 5$ states using a Markov switching autoregressive model of multiple time series [46, 45]. See the Supplementary Material for details on the segmentation model and procedure.

We separately apply our sparse MTD model to the resulting iEEG multivariate categorical time series from 18 different seizure events. For each seizure, the hyperparameter $λ$ was varied over 800 values sampled uniformly between 0.01 and 100000. As the sample size is large, the final model was selected by the Bayesian information criterion (BIC). The resulting estimated graphs for six representative seizure events are shown in Figure 9. For aided interpretability, only edges that contribute more than 1% of the total conditional probability tensor are displayed. In Figure 10 we display two graphs that summarize Granger causality across all 18 seizures. In the first, we compute the average edge weight across all seizures and threshold the final graph at 0.5%. In the second, for each edge we display the number of times that edge is included across all seizures.

Figure 9. — Granger causality graphs estimated from a sparse MTD model across six different seizure events for canine iEEG data.

Figure 10. — (left) Graph weighted by the average across 18 siezures and (right) graph weighted and colored by the number of edge inclusions across 18 seizures.

The graphs in Figures 9 and 10 indicate persistent shared structure across seizures. The four nodes in the same row represent a strip of four electrodes that were placed along the anterior-posterior direction. There were two parallel strips of four electrodes on each hemisphere. Most connections appear horizontally across the sensor locations, corresponding to anterior-posterior connections among regions within the same strip, which should be close both spatially and functionally. The few vertical connections are between adjacent rows, which represent connections between strips next to each other. Some groups of edges like 1 → 9, 14 → 13, 13 → 14, 3 → 7, 7 → 3, 4 → 8, 8 → 4, 12 → 16, 16 → 12 and others, appear in at least fifteen of the seizure graphs, showing the persistence in some Granger causal connectivity across different seizure events. Future work aims to assess the clinical significance of these findings. But, at a high level, we have identified that AR states, that capture the frequency content in individual channel signals, are correlated across time in a structured and sparse manner during seizure events.

8. Discussion.

We have proposed a novel convex framework for the MTD model, as well as two penalized estimation strategies that simultaneously promote sparsity in Granger causality estimation and constrain the solution to an identifiable space. We have also introduced the mLTD model as a baseline for multivariate categorical time series that although straightforward, has not been explored in the literature. Novel identifiability conditions for the MTD have been derived and compared to those for the mLTD model. Finally, we have developed both projected gradient and Frank-Wolfe algorithms for the MTD model that harnesses the new convex formulation. For the projected gradient optimization, we also developed a Dykstra projection method to quickly project onto the MTD constraint set, allowing the MTD model to scale to much higher dimensions. Our experiments demonstrate the utility of both the MTD and mLTD model for inferring Granger causality networks from categorical time series, even under model misspecification.

There are a number of potential directions for future work. Consistency of high dimensional autoregressive GLMs with univariate natural parameters for each series has been recently established [18]. With less stringent parametric assumptions, the MTD model offers a more flexible framework than autoregressive GLMs. To handle this additional flexibility, we need additional assumptions on the Gram matrix and the spectral properties of the process when deriving an upper bound for the estimation error. We also have an extra $l o g (T)$ factor in the upper bound compared to the results for lasso-type estimators in the independent data setting. This log factor also appears in [18]. Whether it can be removed or not would be an interesting question for future research. Further theoretical comparison between mLTD and MTD is also important. For example, to what extent may a mLTD distribution be represented by an MTD one, and vice versa; or, to what extent are both models consistent for Granger causality estimation under model misspecification? Our simulation results suggest that both methods perform well under model misspecification but more general theoretical results are certainly needed. Our sparse MTD framework also presents a simple approach to sparsity estimation under simplex constraints. As mentioned in Section 4.1, typically $L_{1}$ penalties are avoided under simplex constraints since the sum is constrained to equal one. Many authors have proposed a variety of non-convex sparsity regularizers that demand more involved optimization routines [35]. Inspired by our work with MTD, a simple solution is to leave some of the important coefficients known to be in the model unpenalized. For instance, treasury bonds in a sparse portfolio optimization [26] or large background clusters in sparse clustering and density estimation [25, 35].

It would also be interesting to explore other regularized MTD objectives, such as the nuclear norm on $Z^{j}$ when the number of categories per time series is large. This penalty would both select for sparse dependencies, while simultaneously sharing information about transitions within each $Z^{j}$ . While we have considered sparsity in $γ$ , in other applications including categorical time series with large state-spaces, such as language modeling, the entries within each $Z^{j}$ might be sparse. Comparing the projected gradient and Frank-Wolfe algorithms in these sparse, large state-space settings would be interesting. Another possible extension includes the hierarchical group lasso over lags for higher order Markov chains [31] to automatically obtain the order of the Markov chain. Overall, the methods presented herein open many new opportunities for analyzing multivariate categorical time series both in practice and theoretically.

Supplementary Material

SupplementaryMaterial

NIHMS1885629-supplement-SupplementaryMaterial.pdf^{(9.9MB, pdf)}

Funding:

This work was partially funded by ONR grant N00014-18-1-2862 and grants from the National Science Foundation (CAREER IIS-1350133, DMS-1161565, DMS-1561814) and the National Institutes of Health (R01GM114029)

Footnotes

Submitted to the editors April 9th, 2020.

REFERENCES

[1].Agresti A and Kateri M, Categorical Data Analysis, Springer Berlin Heidelberg, Berlin, Heidelberg, 2011, pp. 206–208, 10.1007/978-3-642-04898-2, [DOI] [Google Scholar]
[2].Bahadori MT, Liu Y, and Xing EP, Fast structure learning in generalized stochastic processes with latent factors, in Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ‘13, New York, NY, USA, 2013, ACM, pp. 284–292, 10.1145/2487575.2487578, http://doi.acm.org/10.1145/2487575.2487578. [DOI] [Google Scholar]
[3].Berchtold A, Estimation in the mixture transition distribution model, Journal of Time Series Analysis, 22 (2001), pp. 379–397, 10.1111/1467-9892.00231, https://onlinelibrary.wiley.com/doi/abs/10.1111/1467-9892.00231, https://arxiv.org/abs/https://onlinelibrary.wiley.com/doi/pdf/10.1111/1467-9892.00231. [DOI] [Google Scholar]
[4].Berchtold A and Raftery A, The mixture transition distribution model for high-order markov chains and non-gaussian time series, Statist. Sci, 17 (2002), pp. 328–356, 10.1214/ss/1042727943, 10.1214/ss/1042727943. [DOI] [Google Scholar]
[5].Bickel PJ, Ritov Y, Tsybakov AB, et al. , Simultaneous analysis of lasso and dantzig selector, The Annals of Statistics, 37 (2009), pp. 1705–1732. [Google Scholar]
[6].Boyd S and Vandenberghe L, Convex optimization, Cambridge university press, 2004. [Google Scholar]
[7].Boyle JP and Dykstra RL, A method for finding projections onto the intersection of convex sets in hilbert spaces, in Advances in Order Restricted Statistical Inference, Dykstra R, Robertson T, and Wright FT, eds., New York, NY, 1986, Springer New York, pp. 28–47. [Google Scholar]
[8].Bühlmann P and Van De Geer S, Statistics for high-dimensional data: methods, theory and applications, Springer Science & Business Media, 2011. [Google Scholar]
[9].M. C and U. of Pennsylvania, Ieeg.org.
[10].Ching W, Fung ES, and Ng MK, A multivariate markov chain model for categorical data sequences and its applications in demand predictions, IMA Journal of Management Mathematics, 13 (2002), pp. 187–199, 10.1093/imaman/13.3.187. [DOI] [Google Scholar]
[11].Davis KA, Ung H, Wulsin D, Wagenaar J, Fox E, Patterson N, Vite C, Worrell G, and Litt B, Mining continuous intracranial eeg in focal canine epilepsy: Relating interictal bursts to seizure onsets, Epilepsia, 57 (2016), pp. 89–98, 10.1111/epi.13249, https://onlinelibrary.wiley.com/doi/abs/10.1111/epi.13249, https://arxiv.org/abs/https://onlinelibrary.wiley.com/doi/pdf/10.1111/epi.13249. [DOI] [PMC free article] [PubMed] [Google Scholar]
[12].Doshi-Velez F, Wingate D, Tenenbaum J, and Roy N, Infinite dynamic bayesian networks, in Proceedings of the 28th International Conference on International Conference on Machine Learning, ICML’11, USA, 2011, Omnipress, pp. 913–920, http://dl.acm.org/citation.cfm?id=3104482.3104597. [Google Scholar]
[13].Duchi J, Shalev-Shwartz S, Singer Y, and Chandra T, Efficient projections onto the l1-ball for learning in high dimensions, in Proceedings of the 25th International Conference on Machine Learning, ICML ‘08, New York, NY, USA, 2008, ACM, pp. 272–279, 10.1145/1390156.1390191, http://doi.acm.org/10.1145/1390156.1390191. [DOI] [Google Scholar]
[14].Eichler M, Graphical modelling of multivariate time series, Probability Theory and Related Fields, 153 (2012), pp. 233–268. [Google Scholar]
[15].Esposito R and Radicioni DP, Carpediem: Optimizing the viterbi algorithm and applications to supervised sequential learning, Journal of Machine Learning Research, 10 (2009), pp. 1851–1880. [Google Scholar]
[16].Goldfarb D and Idnani A, Dual and primal-dual methods for solving strictly convex quadratic programs, in Numerical Analysis, Hennart JP, ed., Berlin, Heidelberg, 1982, Springer Berlin Heidelberg, pp. 226–239. [Google Scholar]
[17].Granger C, Investigating causal relations by econometric, Rational Expectations and Econometric Practice, 1 (1981), p. 371. [Google Scholar]
[18].Hall EC, Raskutti G, and Willett R, Inference of high-dimensional autoregressive generalized linear models, arXiv preprint arXiv:1605.02693, (2016). [Google Scholar]
[19].Han F, Lu H, and Liu H, A direct estimation of high dimensional stationary vector autoregressions, The Journal of Machine Learning Research, 16 (2015), pp. 3115–3150. [Google Scholar]
[20].Huang J, Zhang T, et al. , The benefit of group sparsity, The Annals of Statistics, 38 (2010), pp. 1978–2004. [Google Scholar]
[21].Hunter DR and Lange K, A tutorial on mm algorithms, The American Statistician, 58 (2004), pp. 30–37, 10.1198/0003130042836, 10.1198/0003130042836, 10.1198/0003130042836. [DOI] [Google Scholar]
[22].Jaggi M, Revisiting frank-wolfe: Projection-free sparse convex optimization, in ICML (1), 2013, pp. 427–435. [Google Scholar]
[23].Jiao J, Permuter HH, Zhao L, Kim Y-H, and Weissman T, Universal estimation of directed information, IEEE Transactions on Information Theory, 59 (2013), pp. 6220–6242. [Google Scholar]
[24].Kedem B and Fokianos K, Regression Models for Time Series Analysis,, vol. 488, John Wiley & Sons, 2005, ch. Regression models for categorical time series. [Google Scholar]
[25].Kyrillidis A, Becker S, Cevher V, and Koch C, Sparse projections onto the simplex, in International Conference on Machine Learning, 2013, pp. 235–243. [Google Scholar]
[26].Kyrillidis A, Becker S, Cevher V, and Koch C, Sparse simplex projections for portfolio optimization, in 2013 IEEE Global Conference on Signal and Information Processing, Dec 2013, pp. 1141–1141, 10.1109/GlobalSIP.2013.6737104. [DOI] [Google Scholar]
[27].Lichman M et al. , Uci machine learning repository, 2013.
[28].Lèbre S and Bourguignon P-Y, An em algorithm for estimation in the mixture transition distribution model, Journal of Statistical Computation and Simulation, 78 (2008), pp. 713–729, 10.1080/00949650701266666, 10.1080/00949650701266666, 10.1080/00949650701266666. [DOI] [Google Scholar]
[29].Meinshausen N, Bühlmann P, et al. , High-dimensional graphs and variable selection with the lasso, The annals of statistics, 34 (2006), pp. 1436–1462. [Google Scholar]
[30].Meinshausen N, Yu B, et al. , Lasso-type recovery of sparse representations for high-dimensional data, The annals of statistics, 37 (2009), pp. 246–270. [Google Scholar]
[31].Nicholson WB, Bien J, and Matteson DS, Hierarchical vector autoregression, arXiv preprint arXiv:1412.5250, (2014). [Google Scholar]
[32].Nicolau J, A new model for multivariate markov chains, Scandinavian Journal of Statistics, 41 (2014), pp. 1124–1135. [Google Scholar]
[33].Parikh N, Boyd S, et al. , Proximal algorithms, foundations and trends r in optimization, 1 (2014).
[34].Paulin D et al. , Concentration inequalities for markov chains by marton couplings and spectral methods, Electronic Journal of Probability, 20 (2015). [Google Scholar]
[35].Pilanci M, Ghaoui LE, and Chandrasekaran V, Recovery of sparse probability measures via convex programming, in Advances in Neural Information Processing Systems 25, Pereira F, Burges CJC, Bottou L, and Weinberger KQ, eds., Curran Associates, Inc., 2012, pp. 2420–2428, http://papers.nips.cc/paper/4504-recovery-of-sparse-probability-measures-via-convex-programming.pdf. [Google Scholar]
[36].Qiu H, Xu S, Han F, Liu H, and Caffo B, Robust estimation of transition matrices in high dimensional heavy-tailed vector autoregressive processes, in Proceedings of the... International Conference on Machine Learning. International Conference on Machine Learning, vol. 37, NIH Public Access, 2015, p. 1843. [PMC free article] [PubMed] [Google Scholar]
[37].Quinn CJ, Kiyavash N, and Coleman TP, Directed information graphs, IEEE Transactions on information theory, 61 (2015), pp. 6887–6909. [Google Scholar]
[38].Radicioni DP and Esposito R, BREVE: An HMPerceptron-Based Chord Recognition System, Springer Berlin Heidelberg, Berlin, Heidelberg, 2010, pp. 143–164, 10.1007/978-3-642-11674-2, 10.1007/978-3-642-11674-2. [DOI] [Google Scholar]
[39].Raftery AE, A model for high-order markov chains, Journal of the Royal Statistical Society: Series B (Methodological), 47 (1985), pp. 528–539, 10.1111/j.2517-6161.1985.tb01383.x, https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/j.2517-6161.1985.tb01383.x, https://arxiv.org/abs/https://rss.onlinelibrary.wiley.com/doi/pdf/10.1111/j.2517-6161.1985.tb01383.x. [DOI] [Google Scholar]
[40].Rakhlin A, Sridharan K, and Tewari A, Sequential complexities and uniform martingale laws of large numbers, Probability Theory and Related Fields, 161 (2015), pp. 111–153. [Google Scholar]
[41].Sarkar A and Dunson DB, Bayesian nonparametric modeling of higher order markov chains, Journal of the American Statistical Association, 111 (2016), pp. 1791–1803, 10.1080/01621459.2015.1115763, 10.1080/01621459.2015.1115763, 10.1080/01621459.2015.1115763. [DOI] [Google Scholar]
[42].Shojaie A and Michailidis G, Discovering graphical Granger causality using the truncating lasso penalty, Bioinformatics, 26 (2010), pp. i517–i523, 10.1093/bioinformatics/7btq37, , https://arxiv.org/abs/http://oup.prod.sis.lan/bioinformatics/article-pdf/26/18/i517/536841/btq377.pdf. [DOI] [PMC free article] [PubMed] [Google Scholar]
[43].Turlach B and Weingessel A, quadprog r package. available online, 2013.
[44].Wolfer G and Kontorovich A, Estimating the mixing time of ergodic markov chains, arXiv preprint arXiv:1902.01224, (2019). [Google Scholar]
[45].Wulsin D, Fox E, and Litt B, Parsing epileptic events using a markov switching process model for correlated time series, in International Conference on Machine Learning, 2013, pp. 356–364. [Google Scholar]
[46].Wulsin DF, Fox EB, and Litt B, Modeling the complex dynamics and changing correlations of epileptic events, Artificial Intelligence, 216 (2014), pp. 55–75, 10.1016/j.artint.2014.05.006, http://www.sciencedirect.com/science/article/pii/S0004370214000599. [DOI] [PMC free article] [PubMed] [Google Scholar]
[47].Yuan M and Lin Y, Model selection and estimation in regression with grouped variables, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68 (2006), pp. 49–67. [Google Scholar]
[48].Zhou K, Zha H, and Song L, Learning social infectivity in sparse low-rank networks using multidimensional hawkes processes, in Artificial Intelligence and Statistics, 2013, pp. 641–649. [Google Scholar]
[49].Zhu D and Ching W, A new estimation method for multivariate markov chain model with application in demand predictions, in 2010 Third International Conference on Business Intelligence and Financial Engineering, Aug 2010, pp. 126–130, 10.1109/BIFE.2010.39. [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials