Modeling the Complex Dynamics and Changing Correlations of Epileptic Events

Drausin F Wulsin; Emily B Fox; Brian Litt

doi:10.1016/j.artint.2014.05.006

. Author manuscript; available in PMC: 2015 Nov 1.

Published in final edited form as: Artif Intell. 2014 Nov 1;216:55–75. doi: 10.1016/j.artint.2014.05.006

Modeling the Complex Dynamics and Changing Correlations of Epileptic Events

Drausin F Wulsin ^a, Emily B Fox ^c, Brian Litt ^a,^b

PMCID: PMC4180222 NIHMSID: NIHMS616449 PMID: 25284825

Abstract

Patients with epilepsy can manifest short, sub-clinical epileptic “bursts” in addition to full-blown clinical seizures. We believe the relationship between these two classes of events—something not previously studied quantitatively—could yield important insights into the nature and intrinsic dynamics of seizures. A goal of our work is to parse these complex epileptic events into distinct dynamic regimes. A challenge posed by the intracranial EEG (iEEG) data we study is the fact that the number and placement of electrodes can vary between patients. We develop a Bayesian nonparametric Markov switching process that allows for (i) shared dynamic regimes between a variable number of channels, (ii) asynchronous regime-switching, and (iii) an unknown dictionary of dynamic regimes. We encode a sparse and changing set of dependencies between the channels using a Markov-switching Gaussian graphical model for the innovations process driving the channel dynamics and demonstrate the importance of this model in parsing and out-of-sample predictions of iEEG data. We show that our model produces intuitive state assignments that can help automate clinical analysis of seizures and enable the comparison of sub-clinical bursts and full clinical seizures.

Keywords: Bayesian nonparametric, EEG, factorial hidden Markov model, graphical model, time series

1. Introduction

Despite over three decades of research, we still have very little idea of what defines a seizure. This ignorance stems both from the complexity of epilepsy as a disease and a paucity of quantitative tools that are flexible enough to describe epileptic events but restrictive enough to distill intelligible information from them. Much of the recent machine learning work in electroencephalogram (EEG) analysis has focused on seizure prediction, [cf., ¹^,²], an important area of study but one that generally has not focused on parsing the EEG directly, as a human EEG reader would. Such parsings are central for diagnosis and relating various types of abnormal activity. Recent evidence shows that the range of epileptic events extends beyond clinical seizures to include shorter, sub-clinical “bursts” lasting fewer than 10 seconds [3]. What is the relationship between these shorter bursts and the longer seizures? In this work, we demonstrate that machine learning techniques can have substantial impact in this domain by unpacking how seizures begin, progress, and end.

In particular, we build a Bayesian nonparametric time series model to analyze intracranial EEG (iEEG) data. We take a modeling approach similar to a physician’s in analyzing EEG events: look directly at the evolution of the raw EEG voltage traces. EEG signals exhibit nonstationary behavior during a variety of neurological events, and time-varying autoregressive (AR) processes have been proposed to model single channel data [4]. Here we aim to parse the recordings into interpretable regions of activity and thus propose to use autoregressive hidden Markov models (AR-HMMs) to define locally stationary processes. In the presence of multiple channels of simultaneous recordings, as is almost always the case in EEG, we wish to share AR states between the channels while allowing for asynchronous switches. The recent beta process (BP) AR-HMM of Fox et al. [5] provides a flexible model of such dynamics: a shared library of infinitely many possible AR states is defined and each time series uses a finite subset of the states. The process encourages sharing of AR states, while allowing for time-series-specific variability.

Conditioned on the selected AR dynamics, the BP-AR-HMM assumes independence between time series. In the case of iEEG, this assumption is almost assuredly false. Figure 1 shows an example of a 4x8 intracranial electrode grid and the residual EEG traces of 16 channels after subtracting the predicted value in each channel using a conventional BP-AR-HMM. While the error term in some channels remains low throughout the recording, other channels—especially those spatially adjacent in the electrode grid—have very correlated error traces. We propose to capture correlations between channels by modeling a multivariate innovations process that drives independently evolving channel dynamics. We demonstrate the importance of accounting for this error trace in predicting heldout seizure recordings, making this a crucial modeling step before undertaking large-scale EEG analysis.

An iEEG grid electrode and (**bottom left**) corresponding graphical model. (**middle**) Residual EEG values *after* subtracting predictions from a BP-AR-HMM assuming independent channels. All EEG scale bars indicate 1 mV vertically and 1 second horizontally.

To aid in scaling to large electrode grids, we exploit a sparse dependency structure for the multivariate innovations process. In particular, we assume a graph with known vertex structure that encodes conditional independencies in the multivariate innovations process. The graph structure is based on the spatial adjacencies of the iEEG channels, with a few exceptions to make the graphical model fully decomposable. Figure 1 (bottom left) shows an example of such a graphical model over the channels. Although the relative position of channels in the electrode grid is clear, determining the precise 3D location of each channel is extremely difficult. Furthermore, unlike in scalp EEG or magentoencephalogram (MEG), which have generally consistent channel positions from patient to patient, iEEG channels vary in number and position for each patient. These issues impede the use of alternative spatial and multivariate time series modeling techniques.

It is well-known that the correlations between EEG channels usually vary during the beginning, middle, and end of a seizure [6, 7]. Prado et al. [8] employ a mixture-of-expert vector autoregressive (VAR) model to describe the different dynamics present in seven channels of scalp EEG. We take a similar approach by allowing for a Markov evolution to an underlying innovations covariance state.

An alternative modeling approach is to treat the channel recordings as a single multivariate time series, perhaps using a switching VAR process as in Prado et al. [8]. However, such an approach (i) assumes synchronous switches in dynamics between channels, (ii) scales poorly with the number of channels, and (iii) requires an identical number of channels between patients to share dynamics between event recordings.

Other work has explored nonparametric modeling of multiple time series. The infinite factorial HMM of Van Gael et al. [9] considers an infinite collection of chains each with a binary state space. The infinite hierarchical HMM [10] also involves infinitely many chains with finite state spaces, but with constrained transitions between the chains in a top down fashion. The infinite DBN of Doshi-Velez et al. [11] considers more general connection structures and arbitrary state spaces. Alternatively, the graph-coupled HMM of Dong et al. [12] allows graph-structured dependencies in the underlying states of some N Markov chains. Here, we consider a finite set of chains with infinite state spaces that evolve independently. The factorial structure combines the chain-specific AR dynamic states and the graph-structured innovations to generate the multivariate observations with sparse dependencies.

Expanding upon previous work [13], we show that our model for correlated time series has better out-of-sample predictions of iEEG data than standard AR- and BP-AR-HMMs and demonstrate the utility of our model in comparing short, sub-clinical epileptic bursts with longer, clinical seizures. Our inferred parsings of iEEG data concur with key features hand-annotated by clinicians but provide additional insight beyond what can be extracted from a visual read of the data. The importance of our methodology is multifold: (i) the output is interpretable to a practitioner and (ii) the parsings can be used to relate seizure types both within and between patients even with different electrode setups. Enabling such broad-scale automatic analysis, and identifying dynamics unique to sub-clinical seizures, can lead to new insights in epilepsy treatments.

Although we are motivated by the study of seizures from iEEG data, our work is much more broadly applicable in time series analysis. For example, perhaps one has a collection of stocks and wants to model shared dynamics between them while capturing changing correlations. The BP-AR-HMM was applied to the analysis of a collection of motion capture data assuming independence between individuals; our modeling extension could account for coordinated motion with a sparse dependency structure between individuals. Regardless, we find the impact in the neuroscience domain to be quite significant.

2. A Structured Bayesian Nonparametric Factorial AR-HMM

2.1. Dynamic Model

Consider an event with N univariate time series of length T. This event could be a seizure, where each time series is one of the iEEG voltage-recording channels. For clarity of exposition, we refer to the individual univariate time series as channels and the resulting N-dimensional multivariate time series (stacking up the channel series) as the event. We denote the scalar value for each channel i at each (discrete) time point t as $y_{t}^{(i)}$ and model it using an r-order AR-HMM [5]. That is, each channel is modeled via Markov switches between a set of AR dynamics. Denoting the latent state at time t by $z_{t}^{(i)}$ , we have:

\begin{matrix} z_{t}^{(i)} \sim π_{z_{t - 1}^{(i)}}^{(i)}, \\ y_{t}^{(i)} = \sum_{j = 1}^{r} a_{z_{t}^{(i)}},_{j} y_{t - j}^{(i)} + ε_{t}^{(i)} = a_{z_{t}^{(i)}}^{T} {\tilde{y}}_{t}^{(i)} + ε_{t}^{(i)} . \end{matrix}

(1)

Here, a_k = (a_k,1,…,a_k,r)^T are the AR parameters for state k and π_k is the transition distribution from state k to any other state. We also introduce the notation ${\tilde{y}}_{t}^{(i)}$ as the vector of r previous observations ${(y_{t - 1}^{(i)}, \dots, y_{t - r}^{(i)})}^{T}$ .

In contrast to a vector AR (VAR) HMM specification of the event, our modeling of channel dynamics separately as in Eq. (1) allows for (i) asynchronous switches and (ii) sharing of dynamic parameters between recordings with a potentially different number of channels. However, a key aspect of our data is the fact that the channels are correlated. Likewise, these correlations change as the patient progresses through various seizure event states (e.g., “resting”, “onset”, “offset”, …). That is, the channels may display one innovation covariance before a seizure (e.g., relatively independent and low-magnitude) but quite a different covariance during a seizure (e.g., correlated, higher magnitude). To capture this, we jointly model the innovations $ε_{t} = {(ε_{t}^{(1)}, \dots, ε_{t}^{(N)})}^{T}$ driving the AR-HMMs of Eq. (1) as

\begin{matrix} Z_{t} \sim φ_{Z_{t - 1}}, \\ ε_{t} \sim N (0, Δ_{Z_{t}}), \end{matrix}

(2)

where Z_t denotes a Markov-evolving event state distinct from the individual channel states ${z_{t}^{(i)}}$ , φ_l the transition distributions, and Δ_k the event-state-specific channel covariance. That is each Δ_l describes a different set of channel relationships.

For compactness, we sometimes alternately write

y_{t} = A_{z_{t}} {\tilde{Y}}_{t} + ε_{t} (Z_{t}),

(3)

where y_t is the concatenation of N channel observations at time t and z_t is the vector of concatenated channel states. The overall dynamic model is represented graphically in Figure 2.

Graphical model of our factorial AR-HMM. The N channel states $z_{t}^{(i)}$ evolve according to independent Markov processes (transition distributions omitted for simplicity) and index the AR dynamic parameters a_k used in generating observation $y_{t}^{(i)}$ . The Markov-evolving event state *Z_t* indexes the graph-structured covariance Δ_l of the correlated AR innovations resulting in multivariate observations $y_{t} = {[y_{t}^{(1)}, \dots, y_{t}^{(N)}]}^{T}$ sharing the same conditional independencies.

Scaling to large electrode grids

To scale our model to a large number of channels, we consider a Gaussian graphical model (GGM) for ε_t capturing a sparse dependency structure amongst the channels. Let G = (V, E) be an undirected graph with V the set of channel nodes i and E the set of edges with (i,j) ∈ E if i and j are connected by an edge in the graph. Then, ${[Δ_{l}^{- 1}]}_{i j} = 0$ for all (i, j) ∉ E, implying $ɛ_{t}^{(i)}$ is conditionally independent of $ɛ_{t}^{(j)}$ given $ɛ_{t}^{(k)}$ for all channels k ≠ i,j. In our dynamic model of Eq. (1), statements of conditional independence of ɛ_t translate directly to statements of the observations y_t.

In our application, we choose G based on the spatial adjacencies of channels in the electrode grid, as depicted in Figure 1 (bottom left). In addition to encoding the spatial proximities of iEEG electrodes, the graphical model also yields a sparse precision matrix $Δ_{l}^{- 1}$ , allowing for more efficient scaling to the large number of channels commonly present in iEEG. These computational efficiencies are made clear in Section 3.

Interpretation as a sparse factorial HMM

Recall that our formulation involves N + 1 independently evolving Markov chains: N chains for the channel states $z_{t}^{(i)}$ plus one for the event state sequence Z_t. As indicated by the observation model of Eq. (3), the N + 1 Markov chains jointly generate our observation vector y_t leading to an interpretation of our formulation as a factorial HMM [14]. However, here we have a sparse dependency structure in how the Markov chains influence a given observation y_t, as induced by the conditional independencies in ε_t encoded in the graph G. That is, $y_{t}^{(i)}$ only depends on Z_t the set of $z_{t}^{(i)}$ for which j is a neighbor of i in G.

2.2. Prior Specification

Emission parameters

As in the AR-HMM, we place a multivariate normal prior on the AR coefficients,

a_{k} \sim N (m_{0}, \sum_{0}),

(4)

with mean m₀ and covariance Σ₀. Throughout this work, we let m₀ = 0.

For the channel covariances Δ_l with sparse precisions $Δ_{l}^{- 1}$ determined by the graph G, we specify a hyper-inverse Wishart (HIW) prior,

Δ_{l} \sim {HIW}_{G} (b_{0}, D_{0}),

(5)

where b₀ denotes the degrees of freedom and D₀ the scale. The HIW prior [15] enforces hyper-Markov conditions specified by G.

Feature constrained channel transition distributions

A natural question is how many AR states are the channels switching between? Likewise, which are shared between the channels and which are unique? We expect to see similar dynamics present in the channels (sharing of AR processes), but also some differences. For example, maybe only some of the channels ever get excited into a certain state. To capture this structure, we take a Bayesian nonparametric approach building on the beta process (BP) AR-HMM of Fox et al. [16]. Through the beta process prior [17], the BP-AR-HMM defines a shared library of infinitely many AR coefficients {a_k}, but encourages each channel to only use a sparse subset of them.

The BP-AR-HMM specifically defines a featural model. Let f⁽ⁱ⁾ be a binary feature vector associated with channel i with $f_{k}^{(i)} = 1$ indicating that channel i uses the dynamic described by a_k. Formally, the feature assignments $f_{k}^{(i)}$ and their corresponding parameters a_k are generated by a beta process random measure and the conjugate Bernoulli process (BeP),

\begin{matrix} B \sim & BP (1, B_{0}), \\ X^{(i)} \sim & BeP (B), \end{matrix}

(6)

with base measure B₀ over the parameter space Θ = ℝ^r for our r-order autoregressive parameters a_k. As specified in Eq. (4), we take the normalized measure B₀/B₀(Θ) to be N(m₀, Σ₀). The discrete measures B and X⁽ⁱ⁾ can be represented as

B = \sum_{k = 1}^{\infty} ω_{k} δ_{a_{k}}, X^{(i)} = \sum_{k = 1}^{\infty} f_{k}^{(i)} δ_{a_{k}},

(7)

with $f_{k}^{(i)} \sim Ber (ω_{k})$ .

The resulting feature vectors f⁽ⁱ⁾ constrain the set of available states $z_{t}^{(i)}$ can take by constraining each transition distributions, $π_{j}^{(i)}$ , to be 0 when $f_{k}^{(i)} = 0$ . Specifically, the BP-AR-HMM defines $π_{j}^{(i)}$ by introducing a set of gamma random variables, $η_{jk}^{(i)}$ , and setting

η_{jk}^{(i)} \sim Gamma (γ_{c} + κ_{c} δ (j, κ))

(8)

π_{j}^{(i)} = \frac{η_{j}^{(i)} \circ f^{(i)}}{\sum_{k | f_{k}^{(i)} = 1} η_{jk}^{(i)}} .

(9)

The positive elements of $π_{j}^{(i)}$ can also be thought of as a sample from a finite Dirichlet distribution with only K⁽ⁱ⁾ dimensions, where $K^{(i)} = \sum_{k} f_{k}^{(i)}$ represents the number of states channel i uses. For convenience, we sometimes denote the set of transition variables ${η_{jk}^{(i)}}_{j}$ as η⁽ⁱ⁾. As in the sticky HDP-HMM of Fox et al. [18], the parameter κ_c encourages self-transitions (i.e., state j at time t – 1 to state j at time t).

Unconstrained event transition distributions

We again take a Bayesian non-parametric approach to define the event state HMM, building on the sticky HDP-HMM [18]. In particular, the transition distributions φ_l are hierarchically defined as

\begin{matrix} β \sim & stick (α), \\ φ_{l} \sim & DP (α_{e} β + κ_{e} e_{l}), \end{matrix}

(10)

where stick(α) refers to a stick-breaking measure, also known as GEM(α), with β generated by

\begin{matrix} β_{k}^{'} \sim & Beta (1, α), & k = 1, 2, \dots, \\ β_{k} = & β_{k}^{'} \prod_{ℓ = 1}^{k - 1} (1 - β_{ℓ}^{'}), & k = 1, 2, \dots, \\ = & β_{k}^{'} (1 - \sum_{ℓ = 1}^{k - 1} β_{ℓ}) & k = 1, 2, \dots . \end{matrix}

(11)

Again, the sticky parameter κ_e promotes self-transitions, reducing state redundancy.

We term this model the sparse factorial BP-AR-HMM. Although the graph G can be arbitrarily structured, because of our motivating seizure modeling application with a focus on a spatial-based graph structure, we often describe the sparse factorial BP-AR-HMM as capturing spatial correlations. We depict this model in the directed acyclic graphs shown in Figure 3. Note that while we formally consider a model of only a single event for notational simplicity, our formulation scales straightforwardly to multiple independent events. In this case, everything except the library of AR states {a_k} becomes event-specific. If all events share the same channel setup, we can assume the channel covariances {Δ_l} are shared as well.

Referencing the channel state and event state sequences of Figure 2, here we depict the graphical model associated with our Bayesian nonparametric prior specification of Section 2.2. (**top**) The channel i feature indicators f⁽ⁱ⁾ are samples from a Bernoulli process with weights {*ω_k*} and constrain the channel transition distributions π⁽ⁱ⁾Channel states $z_{t}^{(i)}$ evolve independently for each channel according to these feature-constrained transition distributions π⁽ⁱ⁾. (**bottom**) The event state *Z_t* evolves independently of each channel i’s state $z_{t}^{(i)}$ according to transition distributions {*φ_l*}, which are coupled by global transition distribution β.

3. Posterior Computations

Although the components of our model related to the individual channel dynamics are similar to those in the BP-AR-HMM, our posterior computations are significantly different due to the coupling of the Markov chains via the correlated innovations ɛ_t. In the BP-AR-HMM, conditioned on the feature assignments, each time series is independent. Here, however, we are faced with a factorial HMM structure and the associated challenges. Yet the underlying graph structure of the channel dependencies mitigates the scale of these challenges.

Algorithm 1 Sparse factorial BP-AR-HMM master MCMC sampler

1: for each MCMC iteration do

2: get a random permutation h of the channel indices,

3: for each channel i ∈ h do

4: sample feature indicators f⁽ⁱ⁾ as in Eq. (12)

5: sample state sequence

z_{1 : T}^{(i)}

as in Eq. (13)

6: sample state transition parameters η⁽ⁱ⁾ as in Eq. (14)

7: end for

8: sample event states sequence Z_1:T

9: sample event state transition parameters φ as in Eq. (17)

10: sample channel AR parameters {a_k} as in Eq. (19)

11: sample channel {Δ_l} as in Eq. (18)

12: (sample hyperparameters γ_c, κ_c, α_e, κ_e, γ_e, and α_c = B₀(Θ))

13: end for

Open in a new tab

Conditioned on channel sequences ${z_{1 : T}^{(i')}}$ , we can marginalize $z_{1 : T}^{(i)}$ ; because of the graph structure, we need only condition on a sparse set of other channels i′ (i.e., neighbors of channel i in the graph). This step is important for efficiently sampling the feature assignments f⁽ⁱ⁾.

At a high level, each MCMC iteration proceeds through sampling channel states, events states, dynamic model parameters, and hyperparameters. Algorithm 1 summarizes these steps, which we briefly describe below and more fully in Appendices B-D.

Individual channel variables

We harness the fact that we can compute the marginal likelihood of y_1:T given f⁽ⁱ⁾ and the neighborhood set of other channels $z_{1 : T}^{(i')}$ in order to block sample ${f^{(i)}, z_{1 : T}^{(i)}}$ . That is, we first sample f⁽ⁱ⁾ marginalizing $z_{1 : T}^{(i)}$ and then sample $z_{1 : T}^{(i)}$ given the sampled f⁽ⁱ⁾. Sampling the active features f⁽ⁱ⁾ for channel i follows as in Fox et al. [5], using the Indian buffet process (IBP) [19] predictive representation associated with the beta process, but using a likelihood term that conditions on neighboring channel state sequences $z_{1 : T}^{(i')}$ and observations $y_{1 : T}^{(i')}$ . We additionally condition on the event state sequence Z_1:T to define the sequence of distributions on the innovations. Generically, this yields

\begin{matrix} p (f_{k}^{(i)} | y_{1 : T}^{(i)}, y_{1 : T}^{(i')}, z_{1 : T}^{(i')}, Z_{1 : T}, F^{- ik}, η^{(i)}, {a_{k}}, {Δ_{l}}) \propto \\ p (f_{k}^{(i)} | F^{- ik}) p (y_{1 : T}^{(i)} | y_{1 : T}^{(i')}, z_{1 : T}^{(i')}, Z_{1 : T}, F^{- ik}, f_{k}^{(i)}, η^{(i)}, {a_{k}}, {Δ_{l}}) . \end{matrix}

(12)

Here, F^−ik denotes the set of feature assignments not including $f_{k}^{(i)}$ . The first term is given by the IBP prior and the second term is the marginal conditional likelihood (marginalizing $z_{1 : T}^{(i)}$ ). Based on the derived marginal conditional likelihood, feature sampling follows similarly to that of Fox et al. [5].

Conditioned on f⁽ⁱ⁾, we block sample the state sequence $z_{1 : T}^{(i)}$ using a backward filtering forward sampling algorithm (see Appendix A)based on a decomposition of the full conditional as

\begin{matrix} p (z_{1 : T}^{(i)} | y_{1 : T}^{(i)}, y_{1 : T}^{(i')}, z_{1 : T}^{(i')}, f^{(i)}, η^{(i)}, {a_{k}}, {Δ_{l}}) = \\ p (z_{1}^{(i)} | y_{1}^{(i)}, y_{1}^{(i')}, z_{1}^{(i')}, f^{(i)}, η^{(i)}, {a_{k}}, {Δ_{l}}) . \\ \prod_{t = 2}^{T} p (z_{t}^{(i)} | y_{t : T}^{(i)}, y_{t : T}^{(i')}, z_{t - 1}^{(i)}, z_{t : T}^{(i')}, f^{(i)}, η^{(i)} . {a_{k}}, {Δ_{l}}) . \end{matrix}

(13)

For sampling the transition parameters η⁽ⁱ⁾, we follow Hughes et al. [20, Supplement] and sample from the full conditional

p (η_{jk}^{(i)} | z_{1 : T}^{(i)}, f_{k}^{(i)}) \propto \frac{{(η_{jk}^{(i)})}^{n_{jk}^{(i)} + γ_{c} + κ_{c} δ (j, k) - 1} e^{η_{jk}^{(i)}}}{\sum_{k' | f_{k}^{(i)} = 1} η_{jk'}^{(i)}},

(14)

where $n_{jk}^{(i)}$ denotes the number of times channel i transitions from state j to state k. We sample $η_{j}^{(i)} = C_{j}^{(i)} {\bar{η}}_{j}^{(i)}$ from its posterior via two auxiliary variables,

\begin{matrix} {\bar{η}}_{j}^{(i)} \sim Dir (γ_{c} + e_{j} κ_{c} + n_{j}^{(i)}) \\ C_{j}^{(i)} \sim Gamma (K γ_{c} + κ_{c}, 1), \end{matrix}

(15)

where $n_{j}^{(i)}$ gives the transition counts from state j in channel i.

Event variables {φ_l,Δ_l,Z_1:T}

Conditioned on the channel state sequences z_1:T and AR coefficients {a_k}, we can compute an innovations sequence as ε_t = y_t − A_{z_t}Ỹ_t, where we recall the definition of A_k and Ỹ_t from Eq. (3). These innovations are the observations of the sticky HDP-HMM of Eq. (2). For simplicity and to allow block-sampling of z_1:_T, we consider a weak limit approximation of the sticky HDP-HMM as in [18]. The top-level Dirichlet process is approximated by an L-dimensional Dirichlet distribution [21], inducing a finite Dirichlet for φ_l:

\begin{matrix} β \sim Dir (γ_{e} / L, \dots, γ_{e} / L), \\ φ_{l} \sim Dir (α_{e} β + κ_{e} e_{l}) . \end{matrix}

(16)

Here, L provides an upper bound on the number of states in the HDP-HMM. The weak limit approximation still encourages using a subset of these L states.

Based on the weak limit approximation, we first sample the parent transition distribution β as in [22, 18], followed by sampling each φ_l from its Dirichlet posterior,

p (φ_{l} | Z_{1 : T}, β) \propto Dir (α_{e} β + e_{l} κ_{e} + n_{l}),

(17)

where n_l is a vector of transition counts of Z_1:T from state l to the L different states.

Using standard conjugacy results, based on “observations” ε_t = y_t − A_{z_t}Ỹ_t for t such that Z_t = l, the full conditional for Δ_l is given by

p (Δ_{l} | y_{1 : T}, z_{1 : T}, Z_{1 : T}, {a_{k}}) \propto {HIW}_{G} (b_{l}, D_{l}),

(18)

Where

\begin{matrix} b_{l} = b_{0} + | {t | Z_{t} = l, t = 1, \dots, T} |, \\ D_{l} = D_{0} + \sum_{t | Z_{t} = l} ε_{t} ε_{t}^{T} . \end{matrix}

Details on how to efficiently sample from a HIW distribution are provided in [23].

Conditioned on the truncated HDP-HMM event transition distributions {φ_l} and emission parameters {Δ_l}, we use a standard backward filtering forward sampling scheme to block sample Z_1:T.

AR coefficients, {a_k}

Each observation y_t is generated based on a matrix of AR parameters $A_{z_{t}} = [a_{z_{t}^{(1)}} | \dots | a_{z_{t}^{(N)}}]$ . Thus, sampling a_k involves conditioning on {a_k′}_k′≠k and disentangling the contribution of a_k on each y_t. As derived in Appendix E, the full conditional for a_k is a multivariate normal

p (a_{k} | y_{1 : T,} z_{1 : T}, Z_{1 : T}, {a_{k^{'}}} k^{'} \neq k, {Δ_{l}}) \propto N (μ_{k}, \sum_{k}),

(19)

where

\begin{matrix} \sum_{k}^{- 1} = & \sum_{0}^{- 1} + \sum_{t = 1}^{T} {\bar{Y}}_{t}^{(k^{+})} Δ_{Z_{t}}^{- 1 (k^{+}, k^{+})} {({\bar{Y}}_{t}^{(k^{+})})}^{T}, \\ \sum_{k}^{- 1} μ_{k} = & \sum_{t = 1}^{T} {\bar{Y}}_{t}^{(k^{+})} (Δ_{Z_{t}}^{- 1 (k^{+}, k^{+})} y_{t}^{(k^{+})} + Δ_{Z_{t}}^{- 1 (k^{+}, k^{-})} ε_{t}^{(k^{-})}) . \end{matrix}

The vectors k⁺ and k⁻ denote the indices of channels assigned and not assigned to state k at time t, respectively. We use these to index into the rows and columns of the vectors ε_t, y_t, and matrix Δ_{Z_t}. Each column of matrix ${\bar{Y}}_{t}^{(k^{+})}$ is the previous r observations for one of the channels assigned to state k at time t.

Hyperparameters

See Appendix D for the prior and full conditionals of the hyperparameters γ_c, κ_c, α_e, κ_e, γ_e, and α_c = B₀(Θ).

4. Experiments

4.1. Simulation Experiments

To initially explore some characteristics of our sparse factorial BP-AR-HMM, we examined a small simulated dataset of six time series in a 2×3 spatial arrangement, with vertices connecting all adjacent nodes (i.e., two cliques of 4 nodes each). We generated an event of length 2000 time points as follows. We defined five first-order AR channel states linearly spaced between −0.9 and 0.9 and three event states with covariances shown in the bottom left of Figure 4. Channel and event state transition distributions were set to 0.99 and 0.9, respectively, for a self-transition and uniform between the other states. Channel feature indicators $f_{k}^{(i)}$ were simulated from an IBP with α_c = 10 (no channel had indicators exceeding the five specified states). The sampled f⁽ⁱ⁾ were then used to modify the channel state transition distributions by setting to 0 transitions to states with $f_{k}^{(i)} = 0$ and then renormalizing. Using these feature-constrained transition distributions, we simulated sequences $z_{1 : T}^{(i)}$ for each channel i = 1, … , 6 and for T = 2000. The event sequence Z_1:T was likewise simulated. Based on these sampled state sequences, and using the defined state-specific AR coefficients and channel covariances, we generated observations y_1:T as in Eq. (3).

(**top left**) The six simulated channel time series overlaid on the five true channel states, denoted by different colors; the three true event states are shown in grayscale in the bar below. (**top right**) The true and estimated channel (color) and event (grayscale) states shown below for comparison after 6000 MCMC iterations. The true (**bottom left**) and estimated (**bottom right**) event state innovation covariances.

We ran our MCMC sampler for 6000 iterations, discarding the first 1000 as burn-in and thinning the chain by 10. Figure 4 shows the generated data and its true states along with the inferred states and learned channel covariances for a representative posterior sample. The event state matching is almost perfect, and the channel state matching is quite good, though we see that the sampler added an additional (yellow) state in the middle of the first time series when it should have assigned that section to the cyan state. The scale and structure of the estimated event state covariances match the true covariances quite well. Furthermore, Table 1 shows how the posterior estimates of the channel state AR coefficients also center well around the true values.

Table 1.

The true and estimated values for the channel state coefficients in the simulated dataset. We include the posterior mean and 95% credible interval.

channel state	true a_k	post. a_k mean	post. a_k 95% interval
1	-0.900	-0.906	[-0.917, -0.896]
2	-0.450	-0.456	[-0.474, -0.436]
3	0	-0.009	[-0.038, 0.020]
4	0.450	0.445	[0.425, 0.466]
5	0.900	0.902	[0.890, 0.913]

Open in a new tab

4.2. Parsing a Seizure

We tested the sparse factorial BP-AR-HMM on two similar seizures (events) from a patient of the Children’s Hospital of Pennsylvania. These seizures were chosen because qualitatively they displayed a variety of dynamics throughout the beginning, middle, and end of the seizure and thus are ideal for exploring the extent to which our sparse factorial BP-AR-HMM can parse a set of rich neurophysiologic signals. We used the 90 seconds of data after the clinically-determined starts of each seizure from 16 channels, whose spatial layout in the electrode grid is shown in Figure 5 along with the graph encoding our conditional independence assumptions. The data were low-pass filtered and downsampled from 200 to 50 Hz, preserving the clinically important signals but reducing the computational burden of posterior inference. The data was also scaled to have 99% of values within [-10, 10] for numerical reasons. We examined a 5th-order sparse factorial BP-AR-HMM and ran 10 MCMC chains for 6000 iterations, discarding 1000 samples as burn-in and using 10-sample thinning.

The graph used for a 16 channel iEEG electrode and corresponding traces over 25 seconds of a seizure onset with colors indicating the inferred channel states. The event states are shown below along with the associated innovation covariances. Vertical dashed lines indicate the EEG transition times marked independently by an epileptologist. Vertical and horizontal scale bars denote 1 mV and 1 second, respectively.

The sparse factorial BP-AR-HMM inferred state sequences for the sample corresponding to a minimum expected Hamming distance criterion ([18]) are shown in Figure 5. The results were analyzed by a board-certified epileptologist who agreed with the model’s judgement in identifying the subtle changes from the background dynamic (cyan) initially present in all channels. The model’s grouping of spatially-proximate channels into similar state transition patterns (e.g., channels 03, 07, 11, 15) was clinically intuitive and consistent with his own reading of the raw EEG. Using only the raw EEG, and prior to disclosing our results, he independently identified roughly six points in the duration of the seizure where the dynamics fundamentally change. The three main event state transitions shown in Figure 5 occurred almost exactly at the same time as three of his own marked transitions. The fourth coincides with a major shift in the channel dynamics with most channels transitioning to the green dynamic. The other two transitions he marked that are not displayed occurred after this onset period. From this analysis, we see that our event states provide an important global summary of the dynamics of the seizure that augments the information conveyed from the channel state sequences.

Clinical relevance

While interpreting these state sequences and covariances from the model, it is important to keep in mind that they are ultimately estimates of a system whose parsing even highly-trained physicians disagree upon. Nevertheless, we believe that the event states directly describe the activity of particular clinical interest.

In modeling the correlations between channels, the event states give insight into how different physiologic areas of the brain interact over the course of a seizure. In the clinical workup for resective brain surgery, these event states could help define and specifically quantify the full range of ways in which neurophysiologic regions initiate seizures and how others are recruited over the numerous seizures of a patient. In addition, given fixed model parameters, our model can fit the channel and event state sequences of an hour’s worth of 64-channel EEG data in a matter of minutes on a single 8-core machine, possibly facilitating epileptologist EEG annotation of long-term monitoring records.

The ultimate clinical aim of this work, however, involves understanding the relationship between epileptic bursts and seizures. Because the event state aspect of our model involves a Markov assumption, the intrinsic length of the event has little bearing on the states assigned to particular time points. Thus, these event states allow us to straightforwardly compare the neurophysiologic relationship dynamics in short bursts (often less than two seconds long) to those in much longer seizures (on the order of two minutes long), as explored in Section 4.4. Prior to this analysis, we first examine the importance of our various model components by comparing to baseline alternatives.

4.3. Model Comparison

The advantages of a spatial model

We explored the extent to which the spatial information and sparse dependencies encoded in the HIW prior improves our predictions of heldout data relative to a number of baseline models. To assess the impact of the sparse dependencies induced by the Gaussian graphical model for ε_t, we compare to a full-covariance model with an IW prior on Δ_l (dense factorial). For assessing the importance of spatial correlations, we additionally compare to two alternatives where channels evolve independently: the BP-AR-HMM of Fox et al. [5] and an AR-HMM without the feature-based modeling provided by the beta process [24]. Both of these models use inverse gamma (IG) priors on the individual channel innovation variances. We learned a set of AR coefficients {a_k} and event covariances {Δ_l} on one seizure and then computed the heldout log-likelihood on a separate seizure, constraining it use the learned model parameters from the training seizure.

For the training seizure, MCMC samples were collected over 5000 samples across 10 chains, each with a 1000-sample burn in and 10-sample thinning. To compute the predictive log-likelihood of the heldout seizure, we analytically marginalized the heldout event state sequence Z_1:T but perform a Monte Carlo integration over the feature vectors f⁽ⁱ⁾ and channel states z_1:T using our MCMC sampler. For each original MCMC sample generated from the training seizure, a secondary chain is run fixing {a_k} and {Δ_l} and sampling $z_{t}^{(i)}$ , Z_t, f⁽ⁱ⁾, η⁽ⁱ⁾, and {φ_l} for the heldout seizure. We approximate p(y_1:T | {φ}, {a_k}, {Δ_l}) by averaging the secondary chain’s closed-form p(y_1:T | z_1:T, {φ_l}, {a_k}, {Δ_l}), described in Appendix B.

Figure 6 (left) shows how conditioning on the innovations of neighboring channels in the sparse factorial model improves the prediction of an individual channel, as seen by its reduced innovation trace relative to the original BP-AR-HMM. The quantitative benefits of accounting for these correlations are seen in our predictions of heldout events, as depicted in Figure 6 (right), which compares the heldout log-likelihoods for the original and the factorial models listed above. As expected, the factorial models have significantly larger predictive power than the original models. Though hard to see due to the large factorial/original difference, the BP-based model also improves on the standard non-feature-based AR-HMM. Performance of the sparse factorial model is also at least as competitive as a full-covariance model (dense factorial). We would expect to see even larger gains for electrode grids with more channels due to the parsimonious representation presented by the graphical model. Regardless, these results demonstrate that the assumptions of sparsity in the channel dependencies do not adversely affect our performance.

(**left**) An example 16-channel clip of iEEG with the middle section of one channel zoomed in and innovations from the original BP-AR-HMM and our sparse factorial BP-AR-HMM shown below. (**right**) Boxplots of the heldout event log-likelihoods from the two original and factorial models with mean and median posterior likelihood given in green and red lines. Boxes denote the middle 50% prediction interval.

The advantages of sparse factorial dependencies

In addition to providing a parsimonious modeling tool, the sparse dependencies among channels induced by the HIW prior allow our computations to scale linearly to the large number of channels present in iEEG. We compared a dense factorial BP-AR-HMM (entailing a fully-connected spatial graph) and a sparse factorial BP-AR-HMM on five datasets of 8, 16, 32, 64, and 96 channels (from three 32-channel electrodes) from the same seizure used previously. We ran the two models on each of the five datasets for at least 1000 MCMC iterations, using a profiler to tabulate the time spent in each step of the MCMC iteration.

Figure 7 shows the average time required to calculate the channel likelihoods at each time point under each AR channel state. This computation is used both for calculating the marginal likelihood (averaging over all the state sequences $z_{1 : T}^{(i)}$ ) required in active feature sampling as well as in sampling the state sequences $z_{1 : T}^{(i)}$ . In our sparse factorial model, each channel has a constant set of M dependencies, assuming M neighboring channels. As such, the channel likelihood computation at each time point has an O(M) complexity, implying an O(M N) complexity for calculating the likelihoods of all N channels at each time point. In contrast, the likelihood computation at each time point under the full covariance model had complexity O(N), implying O(N²) for calculating all the channel likelihoods. For M ≪ N, as is typically the case, our sparse dependency model is significantly more computationally efficient.

The average time per MCMC iteration required to calculate all of the channel likelihoods under each AR model at each time point.

Anecdotally, we also found that the IW prior experiments—especially those with larger number of channels—tended to occasionally have numerical underflow problems associated with the inverse term $Δ_{Z_{t}}^{- 1 (i^{'}, i^{'})}$ in the conditional channel likelihood calculation. This underflow in the IW prior model calculations is not surprising since the matrices inverted are of dimension N – 1 (for N channels), whereas in the HIW prior, the sparse spatial dependencies of the electrode grids make these matrices no larger than eight-by-eight.

4.4. Comparing Epileptic Events of Different Scales

We applied our sparse factorial BP-AR-HMM to six channels of iEEG over 15 events from a human patient with hippocampal depth electrodes. These events comprise 14 short sub-clinical epileptic bursts of roughly five to eight seconds and a final, 2-3 minute clinical seizure. Our hypothesis was that the sub-clinical bursts display initiation dynamics similar to those of a full, clinical seizure and thus contain information about the seizure-generation process.

The events were automatically extracted from the patient’s continuous iEEG record by taking sections of iEEG whose median line-length feature [25] crossed a preset threshold, also including 10 seconds before and after each event. The iEEG was preprocessed in the same way as in the previous section. The six channels studied came from a depth electrode implanted in the left temporal lobe of the patient’s brain. We ran our MCMC sampler jointly on the 15 events. In particular, the AR channel state and event state parameters, {a_k} and {Δ_l}, were shared between the 15 events such that the parsings of each recording jointly informed the posteriors of these shared parameters. The hyperparameter settings, number of MCMC iterations, chains, and thinning was as in the experiment of Section 4.2.

Figure 8 compares two of the 14 sub-clinical bursts and the onset of the single seizure. We have aligned the three events relative to the beginnings of the red event state common to all three, which we treat as the start of the epileptic activity. The individual channel states of the four middle channels are also all green throughout most of the red event state. It is interesting to note that at this time the fifth channel’s activity in all three events is much lower than those of the three channels above it, yet it is still assigned to the green state and continues in that state along with the other three channels as the event state switches from the red to the lime green state in all three events. While clinical opinions can vary widely in EEG reading, a physician would most likely not consider this segment of the fifth channel similar to the other three, as our model consistently does. But on a relative voltage axis, the segments actually look quite similar. In a sense, the fifth channel has the same dynamics as the other three but just with smaller magnitude. This kind of relationship is difficult for the human EEG readers to identify and shows how models such as ours are capable of providing new perspectives not readily apparent to a human reader. Additionally, we note the similarities in event state transitions.

6 iEEG traces from two sub-clinical bursts and onset of the single seizure with colors indicating inferred channel and event states. The dashed lines indicates the start of the red state in the three events. Vertical and horizontal scale bars denote 1mv and 1 second, respectively.

The similarities mentioned above, among others, suggest some relationship between these two different classes of epileptic events. However, all bursts make a notable departure from the seizure: a large one-second depolarization in the middle three channels, highlighted at the end by the magenta event state and followed shortly thereafter by the end of the event. Neither the states assigned by our model nor the iEEG itself indicates that dynamic present in the clinical seizure. This difference leads us to posit that perhaps these sub-clinical bursts are a kind of false-start seizure, with similar onset patterns but a disrupting discharge that prevents the event from escalating to a full-blown seizure.

5. Conclusion

In this work, we develop a sparse factorial BP-AR-HMM model that allows for shared dynamic regimes between a variable number of time series, asynchronous regime-switching, and an unknown dictionary of dynamic regimes. Key to our model is capturing the between-series correlations and their evolution with a Markov-switching multivariate innovations process. For scalability, we assume a sparse dependency structure between the time series using a Gaussian graphical model.

This model is inspired by challenges in modeling high-dimensional intracranial EEG time series of seizures, which contain a large variety of small- and large-scale dynamics that are of interest to clinicians. We demonstrate the value of this unsupervised model by showing its ability to parse seizures in a clinically intuitive manner and to produce state of the art out-of-sample predictions on the iEEG data. Finally, we show how our model allows for flexible, large-scale analysis of different classes of epileptic events, opening new, valuable clinical research directions previously too challenging to pursue. Such analyses have direct relevance to clinical decision-making for patients with epilepsy.

Algorithm 2 HMM forward-filtering algorithm for calculating p(y_1:T)

1: let π₀ and π = (π₁ | … | π_K)^T be the initial and transition distributions

2: let

u_{t} \in ℝ_{+}^{K}

be the likelihoods p(y_t | z_t = k) for each k

3: let

ξ_{t} \in ℝ_{+}^{K}

be the forward messages p(y_1:t−1, z_t−1 = k) at time t for each k

4: normalize each u_t to sum to 1, preventing underflow during computations

5: for t = 1, … ,T do

6: store the marginal log probability over u_t in υ_t : υ_t ← log (1^Tu_t)

7: normalize u_t : ũ_t ← u_t/ exp(υ_t)

8: end for

9: calculate and normalize initial forward messages

10: ξ ← ũ₁ ∘ π₀

11: w₁ ← 1^Tξ₁, ξ̃₁ ← ξ₁/w₁

12: υ₁ ← υ₁ + log(w₁)

13: propagate messages forward through each time point

14: for t = 2, … ,T do

15: transmit messages forward : ξ_t ← ũ_t ∘ (π^T ξ̃_t−1)

16: normalize : w_t ← l^Tξ_t, ξ̃_t ← ξ_t/w_t

17: υ_t ← υ_t + log(w_t)

18: end for

19: calculate final marginal log likelihood :

\log p (y_{1 : T}) = \sum_{t = 1}^{T} υ_{t}

Open in a new tab

Acknowledgments

This work is supported in part by AFOSR Grant FA9550-12-1-0453 and DARPA Grant FA9550-12-1-0406 negotiated by AFOSR, ONR Grant N00014-10-1-0746, NINDS RO1-NS041811, RO1-NS48598, and U24-NS063930-03, and the Mirowski Discovery Fund for Epilepsy Research.

Appendix A. HMM sum-product algorithm

Consider a hidden Markov model of a sequence y_1:T with corresponding discrete states z_1:T, each of which takes one of K values. The joint probability of y_1:t and z_t = ℓ is

p (y_{1 : t}, z_{t} = ℓ) = p (y_{t} | z_{t} = ℓ) \sum_{k = 1}^{K} p (y_{1 : t - 1}, z_{t - 1} = k) p (z_{t} = ℓ | z_{t - 1} = k),

(A.1)

sometimes called a forward message, which depends on a recursive call for y_1:t−1 and z_t−1 = k with

p (y_{1}, z_{1} = k) = p (y_{1} | z_{1} = k) p (z_{1} = k) .

(A.2)

Calculating the marginal likelihood p(y_1:T) simply involves one last marginalization over z_T,

p (y_{1 : T}) = \sum_{k = 1}^{K} p (y_{1 : T}, z_{T} = k) .

(A.3)

Algorithm 3 HMM backward-filtering forward-sampling algorithm for block-sampling z_1:T

1: let π₀ and π = (π₁|…|π_K)^T be the initial and transition distributions

2: let

u_{t} \in ℝ_{+}^{K}

be the likelihoods p(y_t | z_t = k) for each k

3: let

ζ_{t} \in ℝ_{+}^{K}

be the backward messages p(y_t_+1:T | z_t = k) at t for each k

4: normalize each u_t to sum to 1, preventing underflow during computations

5: for t = 1,…,T do

6: ũ_t ← u_t/ (1^Tu_t)

7: end for

8: calculate backward messages over all time points

9: ζ̃_T = 1

10: for t = T − 1,…,1 do

11: transmit messages backward : τ_t₊₁ ← ũ_t₊₁ ∘ ζ̃_t₊₁, ζ_t ← πτ_t₊₁

12: normalize : ζ̃_t ← ζ_t/ (1^Tζ_t)

13: end for

14: τ₁ ← ũ₁ ∘ ζ̃₁

15: sample first time point

16: q₁ ← π₀ ∘ τ₁, q̃₁ ← q₁/(1^Tq)

17: z₁ ~ q̃₁

18: sample other time points

19: for t = 2,…,T do

20: q_t ← π_{z_t₋₁} ∘ τ_t, q̃_t ← q_t/ (1^Tq_t)

21: z_t ~ q̃_t

22: end for

Open in a new tab

Algorithm 2 provides a numerically stable recipe for calculating this marginal likelihood.

We can sample the states z_1:_T from their joint distribution, also known as block sampling, via a similar recursive formulation. The conditional likelihood of the last T – t samples given the state at t is

\begin{matrix} p (y_{t + 1 : T} | z_{t} = ℓ) = \\ \sum_{k = 1}^{K} p (y_{t + 1} | z_{t + 1} = k) p (y_{t + 2 : T} | z_{t + 1} = k) p (z_{t + 1} = k | z_{t} = ℓ), \end{matrix}

(A.4)

which depends recursively on the backward messages p(y_t_+2:_T | z_t₊₁ = k for each k ∈ {1,…,K}. To sample z_1:_T at once we use the joint posterior distribution of the entire state sequence z_1:_T, which factors into

\begin{matrix} p (z_{1 : T} | y_{1 : T}) = \\ p (z_{T} | z_{T - 1}, y_{1 : T}) p (z_{T - 1} | z_{T - 2}, y_{1 : T}) \dots p (z_{2} | z_{1}, y_{1 : T}) p (z_{1} | y_{1 : T}) \end{matrix}

(A.5)

If we first sample z₁, we can condition on it to then sample z₂ and continue in this fashion until we finish with z_T. The posterior for z_t is the product of the backward message, the likelihood of y_t given z_t, and the probability of z_t given z_t₋₁,

p (z_{t} | y_{1 : T}) \propto p (y_{t + 1 : T} | z_{t}) p (y_{t} | z_{t}) p (z_{t} | z_{t - 1}),

(A.6)

where p(z_t | z_t₋₁) ≜ p(z₁) for t = 1. A numerical stable recipe for this backward-filtering forward-sampling is given in Algorithm 3.

Appendix B. Individual channel variables posterior

Sampling the variables associated with the individual channel i involves first sampling active features f⁽ⁱ⁾ (while marginalizing $z_{1 : T}^{(i)}$ ), then conditioning on these feature assignments f⁽ⁱ⁾ to block sample the state sequence $z_{1 : T}^{(i)}$ , and finally sampling the transition distribution π⁽ⁱ⁾ given the feature indicators f⁽ⁱ⁾ and state sequence $z_{1 : T}^{(i)}$ . Explicit algorithms for this sampling are given in Wulsin [26, Section 4.2.1].

Channel marginal likelihood

Let i′ ⊆ {1,…, N} index the neighboring channels in the graph upon which channel i is conditioned. The conditional likelihood of observation $y_{t}^{(i)}$ under AR model k given the neighboring observations $y_{t}^{(i^{'})}$ at time t is

p (y_{t}^{(i)} | {\tilde{y}}_{t}^{(i)}, y_{t}^{(i^{'})}, z_{t}^{(i)} = k, z_{t}^{(i^{'})}, Z_{t}, {a_{k}}, {Δ_{ι}}) \propto N ({\tilde{μ}}_{t}, {\tilde{σ}}_{t}^{2})

(B.1)

for

\begin{matrix} {\tilde{μ}}_{t} = a_{k}^{T} {\tilde{y}}_{t}^{(i)} + Δ_{Z_{t}}^{(i, i^{'})} Δ_{Z_{t}}^{- 1 (i^{'}, i^{'})} (y_{t}^{(i^{'})} - A_{z^{(i^{'})}} {\tilde{Y}}_{t}^{(i^{'})}), \\ {\tilde{σ}}_{t}^{2} = Δ_{Z_{t}}^{(i, i)} - Δ_{Z_{t}}^{(i, i^{'})} Δ_{Z_{t}}^{- 1 (i^{'}, i^{'}),} Δ_{Z_{t}}^{(i^{'}, i)}, \end{matrix}

(B.2)

which follows from the conditional distribution of the multivariate normal [27, pg. 579]. Using the forward-filtering scheme (see Algorithm 2) to marginalize over the exponentially many state sequences $z_{1 : T}^{(i)}$ , we can calculate the channel marginal likelihood,

p (y_{1 : T}^{(i)} | y_{1 : T}^{(i^{'})}, z_{1 : T}^{(i^{'})}, Z_{1 : T}, f^{(i)}, η^{(i)}, {a_{k}}, {Δ_{l}}),

(B.3)

of channel i’s observations over all t = 1,…, T given the observations $y_{1 : T}^{(i^{'})}$ and the assigned states $z_{1 : T}^{(i^{'})}$ of neighboring channels i′ and given the event state sequence Z_1:T. As previously discussed, taking the non-zero elements of the infinite-dimensional transition distributions π⁽ⁱ⁾, derived from f⁽ⁱ⁾ and η⁽ⁱ⁾ as in Eq. (9), yields a set of K⁽ⁱ⁾-dimensional active feature transition distributions π̃ = {π̃_j}, reducing this marginalization to a series of matrix-vector products.

Sampling active features, f⁽ⁱ⁾

We briefly describe the active feature sampling scheme given in detail by Fox et al. [5]. Recall that for our HIW-spatial BP-AR-HMM, we need to condition on neighboring channel state sequences $z_{1 : T}^{(i^{'})}$ and event state sequences Z_1:_T. Sampling the feature indicators f⁽ⁱ⁾ for channel i via the Indian buffet process (IBP) involves considering those features shared by other channels and those unique to channel i. Let $K_{+} = \sum_{k = 1}^{K} 1 (f_{k}^{(1)} \lor \dots \lor f_{k}^{(N)})$ denote the total number of active features used by at least one of the channels. We consider the set of shared features across channels not including those specific to channel i as S⁽⁻ⁱ⁾ ⊆ {1,…, K₊} and the set of unique features for channel i as U⁽ⁱ⁾ ⊆ {1,…,K₊}/S⁽⁻ⁱ⁾.

Shared features

The posterior for each shared feature k ∈ S⁽⁻ⁱ⁾ for channel i is given by

\begin{matrix} p (f_{k}^{(i)} | y_{1 : T}^{(i)}, y_{1 : T}^{(i^{'})}, z_{1 : T}^{(i^{'})}, Z_{1 : T}, F^{- i k}, f_{k^{'} \neq k}^{(i)}, η^{(i)}, {a_{k}}, {Δ_{l}}) \propto \\ p (f_{k}^{(i)} | F^{- i k}) p (y_{1 : T}^{(i)} | y_{1 : T}^{(i')}, z_{1 : T}^{(i^{'})}, Z_{1 : T}, f_{k^{'} \neq k}^{(i)}, f_{k}^{(i)}, η^{(i)}, {a_{k}}, {Δ_{l}}), \end{matrix}

(B.4)

where the marginal likelihood of $y_{1 : T}^{(i)}$ term (see Eq. B.3) follows from the sum-product algorithm. Recalling the form of the IBP posterior predictive distribution, we have $p (f_{k}^{(i)} = 1 | F^{- i k}) = m_{k}^{(- i)} / N$ , where $m_{k}^{(- i)}$ denotes the number of other channels that use feature k. We use this posterior to formulate a Metropolis-Hastings proposal that flips the current indicator value $f_{k}^{(i)}$ to its complement ${\bar{f}}_{k}^{(i)}$ with probability, $ρ ({\bar{f}}_{k}^{(i)} | f_{k}^{(i)})$ ,

f_{k}^{(i)} = {\begin{matrix} {\bar{f}}_{k}^{(i)}, w . p . ρ ({\bar{f}}_{k}^{(i)} | f_{k}^{(i)}, \\ f_{k}^{(i)}, otherwise, \end{matrix}

(B.5)

where

ρ ({\bar{f}}_{k}^{(i)} | f_{k}^{(i)}) = \min (\frac{p ({\bar{f}}_{k}^{(i)} | y_{1 : T}^{(i)}, y_{1 : T}^{(i^{'})}, z_{1 : T}^{(i^{'})}, Z_{1 : T}, F^{- ik}, f_{k^{'} \neq k}^{(i)}, η^{(i)}, {a_{k}}, {Δ_{l}},)}{p (f_{k}^{(i)} | y_{1 : T}^{(i)}, y_{1 : T}^{(i^{'})}, z_{1 : T}^{(i^{'})}, Z_{1 : T}, F^{- ik}, f_{k^{'} \neq k}^{(i)}, η^{(i)}, {a_{k}}, {Δ_{l}},)}, 1) .

(B.6)

Unique features

We either propose a new feature or remove a unique feature for channel i using a birth and death reversible jump MCMC sampler [28, 29, 30] (see Fox et al. [16] for details relevant to the BP-AR-HMM). We denote the number of unique features for channel i as n⁽ⁱ⁾ = |U⁽ⁱ⁾|. We define the vector of shared feature indicators as $f_{-}^{(i)} = f_{k^{'} | k^{'} \in S^{(- i)}}^{(i)}$ and that for unique feature indicators as $f_{+}^{(i)} = f_{k^{'} | k^{'} \in U^{(i)}}^{(i)}$ , which together $(f_{-}^{(i)}, f_{+}^{(i)})$ define the full feature indicator vector f⁽ⁱ⁾ for channel i. Similarly, $a_{+}^{(i)}$ and $η_{+}^{(i)}$ describe the model dynamics and transition parameters associated with these unique features. We propose a new unique feature vector f′₊ and correspoinding model dynamics a′₊ transition parameters η′₊ (sampled from their priors in the case of feature birth) with a proposal distribution of

\begin{matrix} q (f_{+}^{'}, a_{+}^{'} η_{+}^{'} | f_{+}^{(i)}, {a_{k}}_{+}^{(i)}, η_{+}^{(i)}) = \\ q (f_{+}^{'} | f_{+}^{(i)}) q ({a^{'}}_{+} | f_{+}^{'}, f_{+}^{(i)}, {a_{k}}_{+}^{(i)}) q ({η^{'}}_{+} | f_{+}^{'}, f_{+}^{(i)}, η_{+}^{(i)}) . \end{matrix}

(B.7)

A new unique feature is proposed with probability 0.5 and each existing unique feature is removed with probability 0.5/n⁽ⁱ⁾. This proposal is accepted with probability

\begin{matrix} ρ (f_{+}^{'}, a_{+}^{'} η_{+}^{'} | f_{+}^{(i)}, {a_{k}}_{+}^{(i)}, η_{+}^{(i)}) = \\ min (\frac{p (y_{1 : T}^{(i)} | y_{1 : T}^{(i^{'})}, z_{1 : T}^{(i^{'})}, [f_{-}^{(i)} f_{+}^{'}], η^{(i)}, {η^{'}}_{+}, {a_{k}}, {Δ_{l}})}{p (y_{1 : T}^{(i)} | y_{1 : T}^{(i^{'})}, z_{1 : T}^{(i^{'})}, [f_{-}^{(i)} f_{+}^{(i)}], η^{(i)}, {a_{k}}, {Δ_{l}})} . \\ \frac{Poisson (n_{i}^{'} | α_{c} / N) q (f_{+}^{(i)} | f_{+}^{'})}{Poisson (n_{i} | α_{c} / N) q (f_{+}^{'} | f_{+}^{(i)})}, 1) . \end{matrix}

(B.8)

Channel state sequence, $z_{1 : T}^{(i)}$

We block sample the state sequence for all the time points of channel i, given that channel’s feature-constrained transition distributions π⁽ⁱ⁾, the state parameters {a_k}, the observations $y_{1 : T}^{(i)}$ and the neighboring observations $y_{1 : T}^{(i^{'})}$ and current states $z_{1 : T}^{(i^{'})}$ . The joint probability of the state sequence $z_{1 : T}^{(i)}$ is given by

\begin{matrix} p (z_{1 : T}^{(i)} | y_{1 : T}^{(i)}, y_{1 : T}^{(i^{'})}, z_{1 : T}^{(i^{'})}, f^{(i)}, η^{(i)}, {a_{k}}, {Δ_{l}}) = \\ p (z_{1}^{(i)} | y_{1}^{(i)}, y_{1}^{(i^{'})}, z_{1}^{(i^{'})}, f^{(i)}, η^{(i)}, {a_{k}}, {Δ_{l}}) . \\ \prod_{t = 2}^{T} p (z_{t}^{(i)} | y_{t : T}^{(i)}, y_{t : T}^{(i^{'})}, z_{t - 1}^{(i)}, z_{t : T}^{(i^{'})}, f^{(i)}, η^{(i)}, {a_{k}}, {Δ_{l}}) . \end{matrix}

(B.9)

Again following the backward filtering forward sampling scheme (Algorithm 3), at each time point t we sample state $z_{t}^{(i)}$ conditioned on $z_{t - 1}^{(i)}$ by marginalizing $z_{t + 1 : T}^{(i)}$ . The conditional probability of $z_{t}^{(i)}$ is given by

\begin{array}{r} p (z_{t}^{(i)} | y_{t : T}^{(i)}, y_{t : T}^{(i^{'})}, z_{t - 1}^{(i)}, Z_{1 : T}, z_{t : T}^{(i^{'})}, f^{(i)}, η^{(i)}, {a_{k}}, {Δ_{l}}) \propto \\ Multi ({\tilde{π}}_{z_{t - 1}^{(i)}}^{(i)} \circ u_{t}^{(i)} \circ ψ_{t}), \end{array}

(B.10)

where ${\tilde{π}}_{z_{t - 1}^{(i)}}^{(i)}$ is the transition distribution given the assigned state at t – 1, $u_{t}^{(i)} \in ℝ^{K^{(i)}}$ is the vector of likelihoods under each possible state at time t (as in Eq. (B.1)), and $ψ_{t} \in ℝ^{K^{(i)}}$ is the vector of backwards messages (see Eq. (A.4)) from time point t + 1 to t.

Channel transition parameters, $η^{(i)}$

Following [20, Supplement], the posterior for the transition variable $η_{j k}^{(i)}$ is given by

p (η_{j k}^{(i)} | z_{1 : T}^{(i)}, f_{k}^{(i)}) \propto \frac{{(η_{j k}^{(i)})}^{n_{j k}^{(i)} + γ_{c} + κ_{c} δ (j, k) - 1} e^{η_{j k}^{(i)}}}{\sum_{k^{'} | f_{k}^{(i)} = 1} η_{j k^{'}}^{(i)}},

(B.11)

where $η_{jk}^{(i)}$ denotes the number of times channel i transitions from state j to state k. We can sample from this posterior via two auxiliary variables,

\begin{matrix} {\bar{η}}_{j}^{(i)} \sim Dir (γ_{c} + κ_{c} e_{j} + n_{j}^{(i)}) \\ C_{j}^{(i)} \sim Gamma (K γ_{c} + κ_{c}, 1) \\ η_{j}^{(i)} = C_{j}^{(i)} {\bar{η}}_{j}^{(i)} . \end{matrix}

(B.12)

Appendix C. Event state variables posterior

Since we model the event state process with a (truncated approximation to the) HDP-HMM, inference is more straightforward than with the channel states. We block sample the event state sequence Z_1:T and then sample the event state transition distributions φ.

Event marginal likelihood

Let z_t denote the vector of N channel states at time t. Since the space of z_t is exponentially large, we cannot integrate it out to compute the marginal conditional likelihood of the data given the event state sequence Z_1:T (and model parameters). Instead, we consider the conditional likelihood of an observation at time t given channel states z_t and event state Z_t = l,

p (y_{t} | {\tilde{Y}}_{t}, z_{t}, Z_{t} = l, {a_{k}}, Δ_{l}) \propto N (A_{z_{t}} {\tilde{Y}}_{t}, Δ_{l}) .

(C.1)

Recalling Eq. (3), we see that this conditional likelihood of y_t is equivalent to a zero-mean multivariate normal model on the channel innovations ε_t,

p (ε_{t} | Z_{t} = l, Δ_{l}) \propto N (0, Δ_{l}) .

As with the channel marginal likelihoods, we use the forward-filtering algorithm (see Algorithm 2) to marginalize over the possible event state sequences Z_1:T, yielding a likelihood conditional on the channel states z_t and autoregressive parameters {a_k}, in addition to the event transition distribution φ and event state covariances {Δ_l},

p (y_{1 : T} | z_{1 : T}, φ, {a_{k}}, {Δ_{l}}) .

(C.2)

Event state sequence, Z_1:T

The mechanics of sampling the event state sequence Z_1:T directly parallel those of sampling the individual channel state sequences $z_{1 : T}^{(i)}$ . The joint probability of the event state sequence is given by

\begin{matrix} p (Z_{1 : T} | y_{1 : T}, z_{1 : T}, φ, {a_{k}}, {Δ_{l}}) \propto \\ p (Z_{1} | y_{1}, z_{1}, φ, {a_{k}}, {Δ_{l}}) \prod_{t = 2}^{T} p (Z_{t} | y_{t : T}, z_{t : T}, Z_{t - 1}, φ, {a_{k}}, {Δ_{l}}) . \end{matrix}

(C.3)

We again use the backward filtering forward sampling scheme of the sumproduct algorithm to block sample each event state whose conditional probability distribution over the L states is given by

p (Z_{t} | {\tilde{Y}}_{t}, z_{t}, φ, {a_{k}}, {Δ_{l}}) \propto Multi (φ_{Z_{t - 1}} \circ v_{t} \circ ψ_{t}),

(C.4)

where φ_{Z_t−1} the transition distribution given the assigned state at t − 1, v_t ∈ ℝ^L is the vector of likelihoods under each of the L possible states at time t (as in Eq. (C.1)), and ψ_t ∈ ℝ^L is again the vector of backwards messages from time point t + 1 to t. ∘ denotes element-wise product.

Event transition parameters, φ

The Dirichlet posterior for the event state l’s transition distribution φ_l simply involves the transition counts n_l from event state l to all L states,

p (φ_{l} | Z_{1 : T}, β) \propto Dir (α_{e} β + e_{l} κ_{e} + n_{l}),

(C.5)

for global weights β, concentration parameter α_e, and self-transition parameter κ_e.

Global transition parameters, β

The Dirichlet posterior of the global transition distribution β involves the auxiliary variables (m̅_.1 , … , m̅_.L),

p (β | Z_{1 : T}) \propto Dir (γ_{e} / L + \bar{m} ._{1}, \dots, γ_{e} / L + \bar{m} ._{L}),

(C.6)

where these auxiliary variables are defined as

\begin{matrix} {\bar{m}}_{l l^{'}} = & {\begin{matrix} m_{l l^{'}}, & l \neq l^{'} \\ m_{ll} - w_{l}, & l = l^{'} \end{matrix} \\ m_{l l^{'}} = & \sum_{r = 1}^{n_{l l^{'}}} θ_{r} \\ θ_{r} \sim & Ber (\frac{α_{e} β_{l} + κ_{e} δ (l, l^{'})}{α_{e} β_{l} + δ (l, l^{'}) + r}), \\ w_{l} \sim & Bin (m_{l l^{'}}, \frac{ρ_{e}}{ρ_{e} + β_{l} (1 - ρ_{e})}) \end{matrix}

(C.7)

and m_.l′ = Σ_l m_ll′ See Fox [31, Appendix A] for full derivations.

Appendix D. Hyperparameters posterior

Below we give brief descriptions for the MCMC sampling of the hyperparameters in our model. Full derivations are given in Fox [31, Section 5.2.3, Appendix C].

Channel dynamics model hyperparameters, γ_c, κ_c

We use Metropolis-Hastings steps to propose a new value $γ_{c}^{'}$ from gamma distributions with fixed variance $σ_{γ c}^{2}$ and accept with probability $\min (r (γ_{c}^{'} | γ_{c}), 1)$ ,

\begin{matrix} r (γ_{c}^{'} | γ_{c}) & = \frac{p ({π^{(i)}} | γ_{c}^{'}, κ, {f^{(i)}}) p (γ_{c}^{'} | γ_{c}^{2} / σ_{γ c}^{2}, γ_{c} / σ_{γ c}^{2}) p (γ_{c} | γ_{c}^{'}, σ_{γ c}^{2})}{p ({π^{(i)}} | γ_{c}, κ, {f^{(i)}}) q (γ_{c} | γ_{c}^{2} / σ_{γ c}^{2}, γ_{c} / σ_{γ c}^{2}) q (γ_{c}^{'} | γ_{c}, σ_{γ c}^{2})} \\ = \frac{p ({π^{(i)}} | γ_{c}^{'}, κ, {f^{(i)}})}{p ({π^{(i)}} | γ_{c}, κ, {f^{(i)}})} \frac{Γ (ν) γ_{c}^{ν^{'} - ν - a}}{Γ (ν^{'}) γ_{c}^{ν - ν^{'} - a}} exp (- b (γ_{c}^{'} - γ_{c}) σ_{γ c}^{2 (ν - ν^{'})}), \end{matrix}

(D.1)

where $ν = γ_{c}^{2} / σ_{γ c}^{2}$ , $v' = γ_{c}^{' 2} / σ_{γ c}^{2}$ , and we have a Gamma(a, b) prior on γ_c. The likelihood term $p ({π^{(i)}} | γ_{c}^{'}, κ, {f^{(i)}})$ follows from the Dirichlet distribution and is given by

\begin{matrix} p ({π^{(i)}} | γ_{c}^{'}, κ, {f^{(i)}}) = \\ \prod_{i = 1}^{N} \prod_{k = i}^{K^{(i)}} (\frac{Γ (γ_{c}^{'} K^{(i)} + κ_{c})}{(\prod_{k^{'}}^{K^{(i)} - 1} Γ (γ_{c}^{'})) Γ (γ_{c}^{'} + κ_{c})} \prod_{j = 1}^{K^{(i)}} {({\tilde{π}}_{k j}^{(i)})}^{γ_{c}^{'} + κ_{c} δ (k, j) - 1}), \end{matrix}

(D.2)

where that for p({π⁽ⁱ⁾} | γ_c, κ, {f⁽ⁱ⁾ }) is similar. Recall that the transition parameters {π⁽ⁱ⁾} are independent over i, and thus their likelihoods multiply. The proposal and acceptance ratio for κ_c is similar.

Channel active features model hyperparameter α_c

We place a Gamma(a_{α_c}, b_{α_c}) prior on α_c, which implies a gamma posterior of the form

p (α_{c} | {f^{(i)}}) \propto Gamma (a_{α_{c}} + K_{+}, b_{α_{c}} + \sum_{i = 1}^{N} (1 / i)),

(D.3)

where $K_{+} = \sum_{k = 1}^{K} 1 (f_{k}^{(1)} \lor \dots \lor f_{k}^{(N)})$ denotes the number of channel states activated in at least one of the channels.

Event dynamics model hyperparameters, γ_e, α_e, κ_e, ρ_e

Instead of sampling α_e and κ_e independently, we an additional parameter ρ_e = κ_e/(α_e + κ_e) and sample (α_e + κ_e) and ρ_e, which is simpler than sampling α_e and κ_e independently.

(α_e + κ_e) With a Gamma(a_{α_e}₊_{κ_e}, b_{α_e}_{+κ_e}) prior on (α_e + κ_e), we use auxiliary variables ${r_{l}}_{l = 1}^{L}$ and ${s_{l}}_{l = 1}^{L}$ to define the posterior,

p (α_{e} + κ_{e} | Z_{1 : T}) \propto Gamma (a_{α_{e} + κ_{e}} + \bar{m} . . - \sum_{l = 1}^{L} s_{l}, b_{α_{e} + κ_{e}} - \sum_{l = 1}^{L} log r_{l}),

(D.4)

where $\bar{m} .. = \sum_{l, l^{'} = 1}^{L} {\bar{m}}_{l l^{'}}$ is the sum over auxiliary variables m̄ll′ defined in Eq. (C.7), and the auxiliary variables r_l and s_l are sampled as

\begin{matrix} r_{l} \sim Beta (α + κ + 1, n_{l} .), \\ s_{l} \sim Ber (n_{l} . / (n_{l} . + α + κ)) . \end{matrix}

ρ_e With a Beta(c_{ρ_e}, d_{ρ_e}) prior on ρ_e, we use auxiliary variables ${w_{l} .}_{l = 1}^{L}$ to define the posterior,

p (ρ_{e} | \bar{m}, β) \propto Beta (c_{ρ_{e}} + \sum_{l = 1}^{L} w_{l} ., d_{ρ_{e}} + \bar{m} .. - \sum_{l = 1}^{L} w_{l} .) .

(D.5)

For w_ls ~ Ber(ρ) over s = 1, … , m_ll, the posterior of the auxiliary variable w_l. is

p (w_{l} . | {\bar{m}}_{ll}, β_{l}) \propto Bin ({\bar{m}}_{ll}, ρ_{e} + β_{l} (1 - ρ_{e}))

(D.6)

γ_e With a Gamma(a_{γ_e}, b_{γ_e}) prior on γ_e, we use auxiliary variables υ and q to define the posterior,

p (γ_{e} | m) \propto Gamma (a_{γ_{e}} - q + \sum_{l = 1}^{L} 1 (\bar{m} ._{l} > 0), b_{γ_{e}} - log υ) .

(D.7)

The auxiliary variables are sampled via

\begin{matrix} υ \sim Beta (γ_{e} + 1, \bar{m} ..), \\ q \sim Ber (\bar{m} .. / (γ_{e} + \bar{m} ..)) . \end{matrix}

Appendix E. Autoregressive state coefficients posterior

Recall that our prior on the autoregressive coefficients a_k is a multivariate normal with zero mean and covariance Σ₀,

\begin{matrix} p (a_{k} | \sum_{0}) \propto N (0, \sum_{0}) \\ log p (a_{k} | \sum_{0}) \propto - \frac{1}{2} a_{k}^{T} \sum_{0}^{- 1} a_{k} . \end{matrix}

(E.1)

The conditional event likelihood given the channel states z_1:T and the event states Z_1:T is

\begin{matrix} p (y_{1 : T} | z_{1 : T}, Z_{1 : T}, {a_{k}}, {Δ_{l}}) \propto \prod_{t = 1}^{T} N (y_{t}; A_{z_{t}} \tilde{Y}, Δ_{Z_{t}}) \\ log p (y_{1 : T} | z_{1 : T}, Z_{1 : T}, {a_{k}}, {Δ_{l}}) \propto - \frac{1}{2} \sum_{t = 1}^{T} {(y_{t} - A_{z_{t}} {\tilde{Y}}_{t})}^{T} Δ_{Z_{t}}^{- 1} (y_{t} - A_{z_{t}} {\tilde{Y}}_{t}) . \end{matrix}

(E.2)

The product of these prior and likelihood terms is the joint distribution over a_k and y_1:T,

\begin{matrix} log p (a_{k}, y_{1 : T} | z_{1 : T}, Z_{1 : T}, {a_{k^{'}}}_{k^{'} \neq k,} {Δ_{l}}) \propto \\ - \frac{1}{2} a_{k}^{T} \sum_{0}^{- 1} a_{k} - \frac{1}{2} \sum_{t = 1}^{T} {(y_{t} - A_{z_{t}} {\tilde{Y}}_{t})}^{T} Δ_{Z_{t}}^{- 1} (y_{t} - A_{z_{t}} {\tilde{Y}}_{t}) . \end{matrix}

(E.3)

We take a brief tangent to prove a useful identity,

Lemma Appendix E.1

Let the column vector x ∈ ℝ^m and the symmetric matrix A∈S^m×m be defined as

x = [\begin{matrix} y \\ z \end{matrix}] and A = [\begin{matrix} B & C \\ C^{T} & D \end{matrix}],

where y ∈ ℝ^p, z ∈ ℝ^q, B ∈ S^p×p, D ∈ S^q×q, C ∈ ℝ^p×q, and m = p + q. Then

x^{T} A x = y^{T} B y + z^{T} D z + 2 y^{T} C z .

(E.4)

Proof

\begin{matrix} x^{T} A x = [\begin{matrix} y^{T} & z^{T} \end{matrix}] [\begin{matrix} B & C \\ C^{T} & D \end{matrix}] [\begin{matrix} y \\ z \end{matrix}] \\ = [\begin{matrix} y^{T} & z^{T} \end{matrix}] [\begin{matrix} B y + & C z \\ C^{T} y + & D z \end{matrix}] \\ = y^{T} B y + y^{T} C z + z^{T} C^{T} y + z^{T} D z \\ = y^{T} B y + z^{T} D z + 2 y^{T} C z \end{matrix}

Note that this identity also holds for any permutation p applied to the rows of x and the rows and columns of A. We now can manipulate the likelihood term of Eq. (E.3) into a form that separates a_k from a_k′≠k. Suppose that k⁺ denotes the indices of the N channels where $z_{t}^{(i)} = k$ and k⁻ = {1,…, N}/k⁺ denotes those for whom $z_{t}^{(i)} \neq k$ . Furthermore, we use the superscript indexing on these two sets of indices to select the corresponding portions of the y_t vector and the A_{z_t}, Ỹ_T, and $Δ_{Z_{t}}^{- 1}$ matrices. We start by decomposing the likelihood term into three parts,

\begin{matrix} {(y_{t} - A_{z_{t}} {\tilde{Y}}_{t})}^{T} Δ_{Z_{t}}^{- 1} (y_{t} - A_{z_{t}} {\tilde{Y}}_{t}) \propto \\ {(y_{t}^{(k^{+})} - A_{z_{t}}^{(k^{+}, k^{+})} {\tilde{Y}}_{t}^{(k^{+}, k^{+})})}^{T} Δ_{Z_{t}}^{- 1 (k^{+}, k^{+})} (y_{t}^{(k^{+})} - A_{z_{t}}^{(k^{+}, k^{+})} {\tilde{Y}}_{t}^{(k^{+}, k^{+})}) + \\ 2 {(y_{t}^{(k^{+})} - A_{z_{t}}^{(k^{+}, k^{+})} {\tilde{Y}}_{t}^{(k^{+}, k^{+})})}^{T} Δ_{Z_{t}}^{- 1 (k^{+}, k^{-})} (y_{t}^{(k^{-})} - A_{z_{t}}^{(k^{-}, k^{-})} {\tilde{Y}}_{t}^{(k^{-}, k^{-})}) + \\ {(y_{t}^{(k^{-})} - A_{z_{t}}^{(k^{-}, k^{-})} {\tilde{Y}}_{t}^{(k^{-}, k^{-})})}^{T} Δ_{Z_{t}}^{- 1 (k^{-}, k^{-})} (y_{t}^{(k^{-})} - A_{z_{t}}^{(k^{-}, k^{-})} {\tilde{Y}}_{t}^{(k^{-}, k^{-})}), \end{matrix}

(E.5)

which we then insert into our previous expression for the joint distribution of a_k and y_1:T (Eq. (E.3)),

\begin{matrix} \log p (a_{k}, y_{1 : T} | z_{1 : T}, Z_{1 : T}, {a_{k^{'}}}_{k^{'} \neq k}, {Δ_{l}}) \propto - \frac{1}{2} a_{k}^{T} \sum_{0}^{- 1} a_{k} - \\ \frac{1}{2} \sum_{t = 1}^{T} {{(y_{t}^{(k^{+})} - A_{z_{t}}^{(k^{+}, k^{+})} {\tilde{Y}}_{t}^{(k^{+}, k^{+})})}^{T} Δ_{Z_{t}}^{- 1 (k^{+}, k^{+})} (y_{t}^{(k^{+})} - A_{z_{t}}^{(k^{+}, k^{+})} {\tilde{Y}}_{t}^{(k^{+}, k^{+})}) + \\ 2 {(y_{t}^{(k^{+})} - A_{z_{t}}^{(k^{+}, k^{+})} {\tilde{Y}}_{t}^{(k^{+}, k^{+})})}^{T} Δ_{Z_{t}}^{- 1 (k^{+}, k^{-})} (y_{t}^{(k^{-})} - A_{z_{t}}^{(k^{-}, k^{-})} {\tilde{Y}}_{t}^{(k^{-}, k^{-})}) + \\ {(y_{t}^{(k^{-})} - A_{z_{t}}^{(k^{-}, k^{-})} {\tilde{Y}}_{t}^{(k^{-}, k^{-})})}^{T} Δ_{Z_{t}}^{- 1 (k^{-}, k^{-})} (y_{t}^{(k^{-})} - A_{z_{t}}^{(k^{-}, k^{-})} {\tilde{Y}}_{t}^{(k^{-}, k^{-})})} . \end{matrix}

(E.6)

Conditioning on y_1:T allows us to absorb the third term of the sum into the proportionality, and after replacing $A_{z_{t}}^{(k^{+}, k^{+})} {\tilde{Y}}^{(k^{+}, k^{+})}$ with a more explicit expression, we have

\begin{matrix} \log p (a_{k} | y_{1 : T}, z_{1 : T}, Z_{1 : T}, {a_{k^{'}}}_{k^{'} \neq k}, {Δ_{l}}) \propto - \frac{1}{2} a_{k}^{T} \sum_{0}^{- 1} a_{k} - \\ \frac{1}{2} \sum_{t = 1}^{T} {{(y_{t}^{(k^{+})} - {[{\tilde{y}}_{t}^{(k_{1}^{+})} | \dots | {\tilde{y}}_{t}^{(k_{| k^{+} |}^{+})}]}^{T} a_{k})}^{T} (Δ_{Z_{t}}^{- 1 (k^{+}, k^{+})}) . \\ (y_{t}^{(k^{+})} - {[{\tilde{y}}_{t}^{(k_{1}^{+})} | \dots | {\tilde{y}}_{t}^{(k_{| k^{+} |}^{+})}]}^{T} a_{k}) + \\ 2 {(y_{t}^{(k^{+})} - {[{\tilde{y}}_{t}^{(k_{1}^{+})} | \dots | {\tilde{y}}_{t}^{(k_{| k^{+} |}^{+})}]}^{T} a_{k})}^{T} (Δ_{Z_{t}}^{- 1 (k^{+}, k^{-})}) . \\ (y_{t}^{(k^{-})} - A_{z_{t}}^{k^{-}, k^{-})} {\tilde{Y}}_{t}^{(k^{-}, k^{-}})}, \end{matrix}

(E.7)

which we can further expand to yield

\begin{matrix} \log p (a_{k} | y_{1 : T}, z_{1 : T}, Z_{1 : T}, {a_{k^{'}}}_{k^{'} \neq k}, {Δ_{l}}) \propto - \frac{1}{2} a_{k}^{T} \sum_{0}^{- 1} a_{k} - \\ \frac{1}{2} \sum_{t = 1}^{T} {{(y_{t}^{(k^{+})})}^{T} (Δ_{Z_{t}}^{- 1 (k^{+}, k^{+})}) (y_{t}^{(k^{+})}) + \\ (a_{k}^{T} [{\tilde{y}}_{t}^{(k_{1}^{+})} | \dots | {\tilde{y}}_{t}^{(k_{| k^{+} |}^{+})}]) (Δ_{Z_{t}}^{- 1 (k^{+}, k^{+})}) ({[{\tilde{y}}_{t}^{(k_{1}^{+})} | \dots | {\tilde{y}}_{t}^{(k_{| k^{+} |}^{+})}]}^{T} a_{k}) - \\ 2 {(y_{t}^{(k^{+})})}^{T} (Δ_{Z_{t}}^{- 1 (k^{+}, k^{+})}) ({[{\tilde{y}}_{t}^{(k_{1}^{+})} | \dots | {\tilde{y}}_{t}^{(k_{| k + |}^{+})}]}^{T} a_{k})} - \\ \sum_{t = 1}^{T} {y_{t}^{T (k^{+})} Δ_{Z_{t}}^{- 1 (k^{+}, k^{-})} (y_{t}^{(k^{-})} - A_{z_{t}}^{(k^{-}, k^{-})} {\tilde{Y}}_{t}^{(k^{-}, k^{-})}) - \\ (a_{k}^{T} [{\tilde{y}}_{t}^{(k_{1}^{+})} | \dots | {\tilde{y}}_{t}^{(k_{| k + |}^{+})}]) (Δ_{Z_{t}}^{- 1 (k^{+}, k^{-})}) (y_{t}^{(k^{-})} - A_{z_{t}}^{(k^{-}, k^{-})} {\tilde{Y}}_{t}^{(k^{-}, k^{-})})} . \end{matrix}

(E.8)

Absorbing more terms unrelated to a_k into the proportionality, we have

\begin{matrix} \log p (a_{k} | y_{1 : T}, z_{1 : T}, Z_{1 : T}, {a_{k^{'}}}_{k^{'} \neq k}, {Δ_{l}}) \propto - \frac{1}{2} a_{k}^{T} \sum_{0}^{- 1} a_{k} - \\ \frac{1}{2} \sum_{t = 1}^{T} {(a_{k}^{T} [{\tilde{y}}_{t}^{(k_{1}^{+})} | \dots | {\tilde{y}}_{t}^{(k_{| k + |}^{+})}]) (Δ_{Z_{t}}^{- 1 (k^{+}, k^{+})}) ({[{\tilde{y}}_{t}^{(k_{1}^{+})} | \dots | {\tilde{y}}_{t}^{(k_{| k + |}^{+})}]}^{T} a_{k}) - \\ 2 {(y_{t}^{(k^{+})})}^{T} (Δ_{Z_{t}}^{- 1 (k^{+}, k^{+})}) ({[{\tilde{y}}_{t}^{(k_{1}^{+})} | \dots | {\tilde{y}}_{t}^{(k_{| k + |}^{+})}]}^{T} a_{k})} - \\ \sum_{t = 1}^{T} {- (a_{k}^{T} [{\tilde{y}}_{t}^{(k_{1}^{+})} | \dots | {\tilde{y}}_{t}^{(k_{| k + |}^{+})}]) (Δ_{Z_{t}}^{- 1 (k^{+}, k^{-})}) (y_{t}^{(k^{-})} - A_{z_{t}}^{(k^{-}, k^{-})} {\tilde{Y}}_{t}^{(k^{-}, k^{-})})}, \end{matrix}

(E.9)

which after some rearranging gives

\begin{matrix} \log p (a_{k} | y_{1 : T}, z_{1 : T}, Z_{1 : T}, {a_{k^{'}}}_{k^{'} \neq k}, {Δ_{l}}) \propto \\ - \frac{1}{2} a_{k}^{T} {\sum_{0}^{- 1} + \sum_{t = 1}^{T} [{\tilde{y}}_{t}^{(k_{1}^{+})} | \dots | {\tilde{y}}_{t}^{(k_{| k + |}^{+})}] (Δ_{Z_{t}}^{- 1 (k^{+}, k^{+})}) {[{\tilde{y}}_{t}^{(k_{1}^{+})} | \dots | {\tilde{y}}_{t}^{(k_{| k + |}^{+})}]}^{T}} a_{k} + \\ a_{k}^{T} \sum_{t = 1}^{T} {[{\tilde{y}}_{t}^{(k_{1}^{+})} | \dots | {\tilde{y}}_{t}^{(k_{| k + |}^{+})}] (Δ_{Z_{t}}^{- 1 (k^{+}, k^{+})}) (y_{t}^{(k^{+})}) + \\ [{\tilde{y}}_{t}^{(k_{1}^{+})} | \dots | {\tilde{y}}_{t}^{(k_{| k + |}^{+})}] (Δ_{Z_{t}}^{- 1 (k^{+}, k^{-})}) (y_{t}^{(k^{-})} - A_{z_{t}}^{(k^{-}, k^{-})} {\tilde{Y}}_{t}^{(k^{-}, k^{-})})} . \end{matrix}

(E.10)

Before completing the square, we will find it useful to introduce a bit more notation to simplify the expression,

{\bar{Y}}_{t}^{(k^{+})} = [{\tilde{y}}_{t}^{(k_{1}^{+})} | \dots | {\tilde{y}}_{t}^{(k_{| k^{+} |}^{+})}], ε_{t}^{(k^{-})} = y_{t}^{(k^{-})} - A_{z_{t}}^{(k^{-}, k^{-})} {\tilde{Y}}_{t}^{(k^{-}, k^{-})},

yielding

\begin{matrix} \log p (a_{k} | y_{1 : T,} z_{1 : T}, Z_{1 : T}, {a_{k'}}_{k' \neq k}, {Δ_{l}}) \propto \\ - \frac{1}{2} a_{k}^{T} {\sum_{0}^{- 1} + \sum_{t = 1}^{T} {\bar{Y}}_{t}^{(k^{+})} Δ_{Z_{t}}^{- 1 (k^{+}, k^{+})} {\bar{Y}}_{t}^{T (k^{+})}} a_{k} + \\ a_{k}^{T} {\sum_{t = 1}^{T} {\bar{Y}}_{t}^{(k^{+})} (Δ_{Z_{t}}^{- 1 (k^{+}, k^{+})} y_{t}^{(k^{+})} + Δ_{Z_{t}}^{- 1 (k^{+}, k^{-})} ε_{t}^{(k^{-})})} . \end{matrix}

(E.11)

We desire an expression in the $- \frac{1}{2} {(a_{k} - μ_{k})}^{T} \sum_{k}^{- 1} (a_{k} - μ_{k})$ for unknown μ_k and $\sum_{k}^{- 1}$ so that it conforms to the multivariate normal density with mean μ_k and precision $\sum_{k}^{- 1}$ . We already have our $\sum_{k}^{- 1}$ value from the quadratic term above,

\sum_{k}^{- 1} = \sum_{0}^{- 1} + \sum_{t = 1}^{T} {\bar{Y}}_{t}^{(k^{+})} Δ_{Z_{t}}^{- 1 (k^{+}, k^{+})} {\bar{Y}}_{t}^{T (k^{+})},

(E.12)

which allows us to solve the cross-term for μ_k,

\begin{matrix} - \frac{1}{2} (- 2 μ_{k}^{T} \sum_{k}^{- 1} a_{k}) = a_{k}^{T} (\sum_{t = 1}^{T} {\bar{Y}}_{t}^{(k^{+})} (Δ_{Z_{t}}^{- 1 (k^{+}, k^{+})} y_{t}^{(k^{+})} + Δ_{Z_{t}}^{- 1 (k^{+}, k^{-})} ε_{t}^{(k^{-})} +)), \\ \sum_{k}^{- 1} μ_{k} = \sum_{t = 1}^{T} {\bar{Y}}_{t}^{(k^{+})} (Δ_{Z_{t}}^{- 1 (k^{+}, k^{+})} y_{t}^{(k^{+})} + Δ_{Z_{t}}^{- 1 (k^{+}, k^{-})} ε_{t}^{(k^{-})} +) . \end{matrix}

(E.13)

We can pull the final required $- 2 μ_{k}^{T} \sum_{k}^{- 1} μ_{k}$ term from the proportionality and complete the square. Thus, we have the form of the posterior for a_k,

\begin{matrix} p (a_{k} | y_{1 : T,} z_{1 : T}, Z_{1 : T}, {a_{k'}}_{k' \neq k}, {Δ_{t}}) & \propto \exp (- \frac{1}{2} {(a_{k} - μ_{k})}^{T} \sum_{k}^{- 1} (a_{k} - μ_{k})) \\ \propto N (μ_{k}, \sum_{k}), \end{matrix}

(E.14)

where

\begin{matrix} \sum_{k}^{- 1} = \sum_{0}^{- 1} + \sum_{t = 1}^{T} {\bar{Y}}_{t}^{(k^{+})} Δ_{Z_{t}}^{- 1 (k^{+}, k^{+})} {\bar{Y}}_{t}^{T (k^{+})} \\ \sum_{k}^{- 1} μ_{k} = \sum_{t = 1}^{T} {\bar{Y}}_{t}^{(k^{+})} (Δ_{Z_{t}}^{- 1 (k^{+}, k^{+})} y_{t}^{(k^{+})} + Δ_{Z_{t}}^{- 1 (k^{+}, k^{-})} ε_{t}^{(k^{-})}) . \end{matrix}

(E.15)

Appendix F. Experiment Parameters

Below, we give explicit values for the various priors and parameters used in our experiments.

Table F.2.

Parameters used in sparse factorial BP-AR-HMM simulation experiment

parameter	description	value
N	number of time series per event	6
r	AR model order	1
m ₀	a_k N prior mean	0
Σ0	a_k Nprior covariance	0.1·I_1×1
L	truncated number of event states	20
b ₀	Δ_l HIW prior degrees of freedom	N + 3
D ₀	Δ_l HIW prior scale	0.05(b₀ – N – 1)(I_N×_N + 1)
(a_{α_e+κ_e},b_{α_e+κ_e})	α_e + κ_e gamma prior	(1,1)
(a_{γ_e},b_{γ_e})	γ_e gamma prior	(1,1)
(c_{ρ_e},d_{ρ_e})	ρ_e beta prior	(1,1)
(a_{γ_c},b_{γ_c})	γ_c gamma prior	(1,1)
(a_{κ_c},b_{κ_c})	κ_c gamma prior	(1000,1)
$σ_{γ_{c}}^{2}$	γ_c Metropolis-Hastings proposal variance	1
$σ_{κ_{c}}^{2}$	κ_c Metropolis-Hastings proposal variance	100
(a_{α_c},b_{α_c})	α_c gamma prior	(1,1)

Open in a new tab

Table F.3.

Parameters used in epileptic seizures and bursts experiments. When applicable, the same parameters were used for the standard BP-AR-HMM as in the correlated BP-AR-HMMs. The analysis of two two seizures involved 16 iEEG channels, and the analysis of the 15 bursts and single seizure involved 6 iEEG channels.

parameter	description	value
N	number of time series per event	16 and 6
r	AR model order	5
m₀	a_k N prior mean	0
Σ₀	a_k N prior covariance	$Cov ({y_{t}^{(i)}}_{\forall t, i})$
L	truncated number of event states	30
b₀	Δ_l (H)IW prior degrees of freedom	N + 3
D₀	Δ_l (H)IW prior scale	(b₀ − N− 1) · Cov({y_t₊₁ − y_t}∀_t)
(a_{α_e+κ_e},b_{α_e+κ_e})	α_e + κ_e gamma prior	(1,1)
(a_{γ_e},b_{γ_e})	γ_e gamma prior	(1,1)
(c_{ρ_e},d_{ρ_e})	ρ_e beta prior	(1,1)
(a_{γ_c},b_{γ_c})	γ_c gamma prior	(1,1)
(a_{κ_c},b_{κ_c})	κ_c gamma prior	(1000,1)
$σ_{γ_{c}}^{2}$	γ_c Metropolis-Hastings proposal variance	1
$σ_{κ_{c}}^{2}$	Metropolis-Hastings proposal variance	100
(a_{α_c},b_{α_c})	gamma prior	(1,1)

Open in a new tab

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

1.D’Alessandro M, Vachtsevanos G, Esteller R, Echauz J, Cranstoun S, Worrell G, Parish L, Litt B. A multi-feature and multi-channel univariate selection process for seizure prediction. Clinical Neurophysiology. 2005;116:506–516. doi: 10.1016/j.clinph.2004.11.014. [DOI] [PubMed] [Google Scholar]
2.Mirowski P, Madhavan D, Lecun Y, Kuzniecky R. Classification of patterns of EEG synchronization for seizure prediction. Clinical Neurophysiology. 2009;120:1927–1940. doi: 10.1016/j.clinph.2009.09.002. [DOI] [PubMed] [Google Scholar]
3.Litt B, Esteller R, Echauz J, D’Alessandro M, Shor R, Henry T, Pennell P, Epstein C, Bakay R, Dichter M, Vachtsevanos G. Epileptic seizures may begin hours in advance of clinical onset: a report of five patients. Neuron. 2001;30:51–64. doi: 10.1016/s0896-6273(01)00262-8. [DOI] [PubMed] [Google Scholar]
4.Krystal AD, Prado R, West M. New methods of time series analysis of non-stationary EEG data: eigenstructure decompositions of time varying autoregressions. Clinical Neurophysiology. 1999;110:2197–2206. doi: 10.1016/s1388-2457(99)00165-0. [DOI] [PubMed] [Google Scholar]
5.Fox EB, Sudderth EB, Jordan MI, Willsky AS. Sharing features among dynamical systems with beta processes. Advances in Neural Information Processing Systems. 2009 [Google Scholar]
6.Schindler K, Leung H, Elger CE, Lehnertz K. Assessing seizure dynamics by analysing the correlation structure of multichannel intracranial EEG. Brain. 2007;130:65–77. doi: 10.1093/brain/awl304. [DOI] [PubMed] [Google Scholar]
7.Schiff SJ, Sauer T, Kumar R, Weinstein SL. Neuronal spatiotemporal pattern discrimination: the dynamical evolution of seizures. NeuroImage. 2005;28:1043–1055. doi: 10.1016/j.neuroimage.2005.06.059. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Prado R, Molina F, Huerta G. Multivariate time series modeling and classification via hierarchical VAR mixtures. Computational Statistics & Data Analysis. 2006;51:1445–1462. [Google Scholar]
9.Van Gael J, Saatci Y, Teh YW, Ghahramani Z. Beam sampling for the infinite hidden Markov model. Proceedings of the 25th International Conference on Machine Learning; 2008. [Google Scholar]
10.Heller K, Teh YW, Gorur D. The Infinite Hierarchical Hidden Markov Model. Proceedings of the 12th International Conference on Artificial Intelligence and Statistics; 2009. [Google Scholar]
11.Doshi-Velez F, Wingate D, Tenenbaum J, Roy N. Infinite dynamic bayesian networks. Proceedings of the 28th International Conference on Machine Learning; 2011. [Google Scholar]
12.Dong W, Pentland A, Heller KA. Graph-Coupled HMMs for Modeling the Spread of Infection. Proceedings of the 28th Conference Conference on Uncertainty in Artificial Intelligence; 2012. [Google Scholar]
13.Wulsin DF, Fox EB, Litt B. Parsing Epileptic Events Using a Markov Switching Process for Correlated Time Series. Proceedings of the 30th International Conference on Machine Learning; 2013. [Google Scholar]
14.Ghahramani Z, Jordan MI. Factorial hidden Markov models. Machine Learning. 1997;29:245–273. [Google Scholar]
15.Dawid AP, Lauritzen SL. Hyper Markov laws in the statistical analysis of decomposable graphical models. The Annals of Statistics. 1993;21:1272–1317. [Google Scholar]
16.Fox EB, Sudderth EB, Jordan MI, Willsky AS. Joint Modeling of Multiple Related Time Series via the Beta Process. Arxiv preprint arXiv:1111.4226v1. 2011 [Google Scholar]
17.Thibaux R, Jordan MI. Hierarchical beta processes and the Indian buffet process. Proceedings of the 10th Conference on Artificial Intelligence and Statistics; 2007. [Google Scholar]
18.Fox EB, Sudderth EB, Jordan MI, Willsky AS. A sticky HDP-HMM with application to speaker diarization. The Annals of Applied Statistics. 2011;5:1020–1056. [Google Scholar]
19.Griffiths TL, Ghahramani Z. Infinite latent feature models and the Indian buffet process. Gatsby Computational Neuroscience Unit. 2005 Technical Report #2005-001. [Google Scholar]
20.Hughes MC, Fox EB, Sudderth EB. Effective split-merge Monte Carlo methods for nonparametric models of sequential data. Advances in Neural Information Processing Systems. 2012 [Google Scholar]
21.Ishwaran H, Zarepour M. Exact and approximate sum representations for the Dirichlet process. Canadian Journal of Statistics. 2002;30:269–283. [Google Scholar]
22.Teh YW, Jordan MI, Beal MJ, Blei DM. Hierarchical Dirichlet processes. Journal of the American Statistical Association. 2006;101:1566–1581. [Google Scholar]
23.Carvalho CM, Massam H, West M. Simulation of hyper-inverse Wishart distributions in graphical models. Biometrika. 2007;94:647–659. [Google Scholar]
24.Fox EB, Sudderth EB, Jordan MI, Willsky AS. Bayesian nonparametric inference of switching dynamic linear models. IEEE Transactions on Signal Processing. 2011;59:1569–1585. [Google Scholar]
25.Esteller R, Echauz J, Tcheng T, Litt B, Pless B. Line length: an efficient feature for seizure onset detection. Proceedings of the 23rd Engineering in Medicine and Biology Society Conference; 2001. [Google Scholar]
26.Wulsin DF. PhD thesis. University of Pennsylvania; 2013. Bayesian Nonparametric Modeling of Epileptic Events. [Google Scholar]
27.Gelman A, Carlin JB, Stern HS, Rubin DS. Bayesian Data Analysis. second ed. Chapman & Hall/CRC; Boca Raton, Florida: 2004. [Google Scholar]
28.Green PJ. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika. 1995;82:711–732. [Google Scholar]
29.Richardson S, Green PJ. On Bayesian analysis of mixtures with an unknown number of components. Journal of the Royal Statistical Society Series B. 1997;59:731–792. [Google Scholar]
30.Brooks SP, Gelman A, Jones GL, Meng X, editors. Handbook of Markov Chain Monte Carlo. Chapman & Hall/CRC; Boca Raton, Florida: 2011. [Google Scholar]
31.Fox EB. PhD thesis. Massachusetts Institute of Technology; 2009. Bayesian Nonparametric Learning of Complex Dynamical Phenomena. [Google Scholar]

[R1] 1.D’Alessandro M, Vachtsevanos G, Esteller R, Echauz J, Cranstoun S, Worrell G, Parish L, Litt B. A multi-feature and multi-channel univariate selection process for seizure prediction. Clinical Neurophysiology. 2005;116:506–516. doi: 10.1016/j.clinph.2004.11.014. [DOI] [PubMed] [Google Scholar]

[R2] 2.Mirowski P, Madhavan D, Lecun Y, Kuzniecky R. Classification of patterns of EEG synchronization for seizure prediction. Clinical Neurophysiology. 2009;120:1927–1940. doi: 10.1016/j.clinph.2009.09.002. [DOI] [PubMed] [Google Scholar]

[R3] 3.Litt B, Esteller R, Echauz J, D’Alessandro M, Shor R, Henry T, Pennell P, Epstein C, Bakay R, Dichter M, Vachtsevanos G. Epileptic seizures may begin hours in advance of clinical onset: a report of five patients. Neuron. 2001;30:51–64. doi: 10.1016/s0896-6273(01)00262-8. [DOI] [PubMed] [Google Scholar]

[R4] 4.Krystal AD, Prado R, West M. New methods of time series analysis of non-stationary EEG data: eigenstructure decompositions of time varying autoregressions. Clinical Neurophysiology. 1999;110:2197–2206. doi: 10.1016/s1388-2457(99)00165-0. [DOI] [PubMed] [Google Scholar]

[R5] 5.Fox EB, Sudderth EB, Jordan MI, Willsky AS. Sharing features among dynamical systems with beta processes. Advances in Neural Information Processing Systems. 2009 [Google Scholar]

[R6] 6.Schindler K, Leung H, Elger CE, Lehnertz K. Assessing seizure dynamics by analysing the correlation structure of multichannel intracranial EEG. Brain. 2007;130:65–77. doi: 10.1093/brain/awl304. [DOI] [PubMed] [Google Scholar]

[R7] 7.Schiff SJ, Sauer T, Kumar R, Weinstein SL. Neuronal spatiotemporal pattern discrimination: the dynamical evolution of seizures. NeuroImage. 2005;28:1043–1055. doi: 10.1016/j.neuroimage.2005.06.059. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] 8.Prado R, Molina F, Huerta G. Multivariate time series modeling and classification via hierarchical VAR mixtures. Computational Statistics & Data Analysis. 2006;51:1445–1462. [Google Scholar]

[R9] 9.Van Gael J, Saatci Y, Teh YW, Ghahramani Z. Beam sampling for the infinite hidden Markov model. Proceedings of the 25th International Conference on Machine Learning; 2008. [Google Scholar]

[R10] 10.Heller K, Teh YW, Gorur D. The Infinite Hierarchical Hidden Markov Model. Proceedings of the 12th International Conference on Artificial Intelligence and Statistics; 2009. [Google Scholar]

[R11] 11.Doshi-Velez F, Wingate D, Tenenbaum J, Roy N. Infinite dynamic bayesian networks. Proceedings of the 28th International Conference on Machine Learning; 2011. [Google Scholar]

[R12] 12.Dong W, Pentland A, Heller KA. Graph-Coupled HMMs for Modeling the Spread of Infection. Proceedings of the 28th Conference Conference on Uncertainty in Artificial Intelligence; 2012. [Google Scholar]

[R13] 13.Wulsin DF, Fox EB, Litt B. Parsing Epileptic Events Using a Markov Switching Process for Correlated Time Series. Proceedings of the 30th International Conference on Machine Learning; 2013. [Google Scholar]

[R14] 14.Ghahramani Z, Jordan MI. Factorial hidden Markov models. Machine Learning. 1997;29:245–273. [Google Scholar]

[R15] 15.Dawid AP, Lauritzen SL. Hyper Markov laws in the statistical analysis of decomposable graphical models. The Annals of Statistics. 1993;21:1272–1317. [Google Scholar]

[R16] 16.Fox EB, Sudderth EB, Jordan MI, Willsky AS. Joint Modeling of Multiple Related Time Series via the Beta Process. Arxiv preprint arXiv:1111.4226v1. 2011 [Google Scholar]

[R17] 17.Thibaux R, Jordan MI. Hierarchical beta processes and the Indian buffet process. Proceedings of the 10th Conference on Artificial Intelligence and Statistics; 2007. [Google Scholar]

[R18] 18.Fox EB, Sudderth EB, Jordan MI, Willsky AS. A sticky HDP-HMM with application to speaker diarization. The Annals of Applied Statistics. 2011;5:1020–1056. [Google Scholar]

[R19] 19.Griffiths TL, Ghahramani Z. Infinite latent feature models and the Indian buffet process. Gatsby Computational Neuroscience Unit. 2005 Technical Report #2005-001. [Google Scholar]

[R20] 20.Hughes MC, Fox EB, Sudderth EB. Effective split-merge Monte Carlo methods for nonparametric models of sequential data. Advances in Neural Information Processing Systems. 2012 [Google Scholar]

[R21] 21.Ishwaran H, Zarepour M. Exact and approximate sum representations for the Dirichlet process. Canadian Journal of Statistics. 2002;30:269–283. [Google Scholar]

[R22] 22.Teh YW, Jordan MI, Beal MJ, Blei DM. Hierarchical Dirichlet processes. Journal of the American Statistical Association. 2006;101:1566–1581. [Google Scholar]

[R23] 23.Carvalho CM, Massam H, West M. Simulation of hyper-inverse Wishart distributions in graphical models. Biometrika. 2007;94:647–659. [Google Scholar]

[R24] 24.Fox EB, Sudderth EB, Jordan MI, Willsky AS. Bayesian nonparametric inference of switching dynamic linear models. IEEE Transactions on Signal Processing. 2011;59:1569–1585. [Google Scholar]

[R25] 25.Esteller R, Echauz J, Tcheng T, Litt B, Pless B. Line length: an efficient feature for seizure onset detection. Proceedings of the 23rd Engineering in Medicine and Biology Society Conference; 2001. [Google Scholar]

[R26] 26.Wulsin DF. PhD thesis. University of Pennsylvania; 2013. Bayesian Nonparametric Modeling of Epileptic Events. [Google Scholar]

[R27] 27.Gelman A, Carlin JB, Stern HS, Rubin DS. Bayesian Data Analysis. second ed. Chapman & Hall/CRC; Boca Raton, Florida: 2004. [Google Scholar]

[R28] 28.Green PJ. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika. 1995;82:711–732. [Google Scholar]

[R29] 29.Richardson S, Green PJ. On Bayesian analysis of mixtures with an unknown number of components. Journal of the Royal Statistical Society Series B. 1997;59:731–792. [Google Scholar]

[R30] 30.Brooks SP, Gelman A, Jones GL, Meng X, editors. Handbook of Markov Chain Monte Carlo. Chapman & Hall/CRC; Boca Raton, Florida: 2011. [Google Scholar]

[R31] 31.Fox EB. PhD thesis. Massachusetts Institute of Technology; 2009. Bayesian Nonparametric Learning of Complex Dynamical Phenomena. [Google Scholar]

PERMALINK

Modeling the Complex Dynamics and Changing Correlations of Epileptic Events

Drausin F Wulsin

Emily B Fox

Brian Litt

Abstract

1. Introduction

Figure 1.

2. A Structured Bayesian Nonparametric Factorial AR-HMM

2.1. Dynamic Model

Figure 2.

Scaling to large electrode grids

Interpretation as a sparse factorial HMM

2.2. Prior Specification

Emission parameters

Feature constrained channel transition distributions

Unconstrained event transition distributions

Figure 3.

3. Posterior Computations

Individual channel variables

Event variables {φl,Δl,Z1:T}

AR coefficients, {ak}

Hyperparameters

4. Experiments

4.1. Simulation Experiments

Figure 4.

Table 1.

4.2. Parsing a Seizure

Figure 5.

Clinical relevance

4.3. Model Comparison

The advantages of a spatial model

Figure 6.

The advantages of sparse factorial dependencies

Figure 7.

4.4. Comparing Epileptic Events of Different Scales

Figure 8.

5. Conclusion

Acknowledgments

Appendix A. HMM sum-product algorithm

Appendix B. Individual channel variables posterior

Channel marginal likelihood

Sampling active features, f(i)

Shared features

Unique features

Channel state sequence, z1:T(i)

Channel transition parameters, η(i)

Appendix C. Event state variables posterior

Event marginal likelihood

Event state sequence, Z1:T

Event transition parameters, φ

Global transition parameters, β

Appendix D. Hyperparameters posterior

Channel dynamics model hyperparameters, γc, κc

Channel active features model hyperparameter αc

Event dynamics model hyperparameters, γe, αe, κe, ρe

Appendix E. Autoregressive state coefficients posterior

Lemma Appendix E.1

Proof

Appendix F. Experiment Parameters

Table F.2.

Table F.3.

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Event variables {φ_l,Δ_l,Z_1:T}

AR coefficients, {a_k}

Sampling active features, f⁽ⁱ⁾

Channel state sequence, $z_{1 : T}^{(i)}$

Channel transition parameters, $η^{(i)}$

Event state sequence, Z_1:T

Channel dynamics model hyperparameters, γ_c, κ_c

Channel active features model hyperparameter α_c

Event dynamics model hyperparameters, γ_e, α_e, κ_e, ρ_e