The Discriminative Kalman Filter for Bayesian Filtering with Nonlinear and Non-Gaussian Observation Models

Michael C Burkhart; David M Brandman; Brian Franco; Leigh R Hochberg; Matthew T Harrison

doi:10.1162/neco_a_01275

. Author manuscript; available in PMC: 2021 Jul 6.

Published in final edited form as: Neural Comput. 2020 Mar 18;32(5):969–1017. doi: 10.1162/neco_a_01275

The Discriminative Kalman Filter for Bayesian Filtering with Nonlinear and Non-Gaussian Observation Models

Michael C Burkhart ¹, David M Brandman ^2,³, Brian Franco ^4,^†, Leigh R Hochberg ^4,^5,^6,⁷, Matthew T Harrison ¹

PMCID: PMC8259355 NIHMSID: NIHMS1714316 PMID: 32187000

Abstract

The Kalman filter provides a simple and efficient algorithm to compute the posterior distribution for state-space models where both the latent state and measurement models are linear and Gaussian. Extensions to the Kalman filter, including the extended and unscented Kalman filters, incorporate linearizations for models where the observation model p(observation|state) is nonlinear. We argue that in many cases a model for p(state|observation) proves both easier to learn and more accurate for latent state estimation.

Approximating p(state|observation) as Gaussian leads to a new filtering algorithm, the discriminative Kalman filter (DKF), that can perform well even when p(observation|state) is highly nonlinear and/or non-Gaussian. The approximation, motivated by the Bernstein–von Mises Theorem, improves as the dimensionality of the observations increases. The DKF has computational complexity similar to the Kalman filter, allowing it in some cases to perform much faster than particle filters with similar precision, while better accounting for nonlinear and non-Gaussian observation models than Kalman-based extensions.

When the observation model must be learned from training data prior to filtering, off-the-shelf nonlinear and/or nonparametric regression techniques can provide a Gaussian model for p(observation|state) that cleanly integrates with the DKF. As part of the BrainGate2 clinical trial, we successfully implemented Gaussian process regression with the DKF framework in a brain computer interface to provide real-time closed-loop cursor control to a person with a complete spinal cord injury. In this paper, we explore the theory underlying the DKF, exhibit some illustrative examples, and outline potential extensions.

Keywords: Bayesian filtering, discriminative learning, dynamic state-space models, neural decoding, the Kalman filter

1. Introduction

Consider a state space model for Z_1:T := Z₁,…,Z_T (latent states) and X_1:T := X₁,…,X_T (observations) represented as a Bayesian network:

\begin{array}{l} Z_{1} & \to & \dots & \to & Z_{t - 1} & \to & Z_{t} & \to & \dots & \to & Z_{T} \\ ↓ & ↓ & ↓ & ↓ \\ X_{1} & X_{t - 1} & X_{t} & X_{T} \end{array}

(1)

The conditional density of Z_t given X_1:t can be expressed recursively using the Chapman–Kolmogorov equation and Bayes’ rule (see Chen, 2003, for further details):

p (z_{t} | x_{1 : t - 1}) = \int p (z_{t} | z_{t - 1}) p (z_{t - 1} | x_{1 : t - 1}) d z_{t - 1},

(2a)

p (z_{t} | x_{1 : t}) = \frac{p (x_{t} | z_{t}) p (z_{t} | x_{1 : t - 1})}{\int p (x_{t} | z_{t}) p (z_{t} | x_{1 : t - 1}) d z_{t}} = \frac{p (x_{t} | z_{t}) p (z_{t} | x_{1 : t - 1})}{p (x_{t} | x_{1 : t - 1})},

(2b)

where p(z₀|x_1:0) = p(z₀) and the conditional densities p(z_t|z_t−1) and p(x_t|z_t) are either specified a priori or learned from training data prior to filtering. Computing or approximating Equation 2 is often called Bayesian filtering. Bayesian filtering arises in a large number of applications, including global positioning systems, target tracking, aircraft and spacecraft guidance, weather forecasting, computer vision, digital communications, and brain computer interfaces (Chen, 2003; Hall, 1966; Battin and Levine, 1970; Grewal and Andrews, 2010; Buehner et al., 2017; Brown and Hwang, 2012; Schmidt et al., 1970; Brandman et al., 2017).

Exact solutions to Equation 2 are only available in special cases, such as the Kalman filter (Kalman, 1960; Kalman and Bucy, 1961). The Kalman filter models the conditional densities p(z_t|z_t−1) and p(x_t|z_t) as linear and Gaussian so that the posterior distribution p(z_t|x_1:t) is also Gaussian and quickly computable. Beneš (1981) and Daum (1984, 1986) broadened the class of models for which the integrals in Equation 2 are analytically tractable but many model specifications still fall outside this class. When the latent state space is finite, the integrals in Equation 2 become sums that can be calculated exactly using a grid-based filter (Elliott, 1994; Arulampalam et al., 2002). For more general models, there are many techniques for approximate Bayesian filtering; see Chen (2003) for a review.

In some applications, parts of the underlying model are first learned from supervised training data consisting of (Z_t,X_t) pairs and then the learned model is used for filtering on new (X_t) data. For instance, (Z_t,X_t) pairs might be used to learn p(x_t|z_t) with nonparametric conditional density estimation and then the learned p(x_t|z_t), say $\hat{p} (x_{t} | z_{t})$ , is substituted into whatever algorithm is used to approximate Bayes’ rule in Equation 2b. This motivates the search for combinations of approximation algorithms and learning methods that work well together. It also opens the door to novel approximation algorithms that would not traditionally be considered for a known model but become practical when the model can be learned. For instance, from (Z_t,X_t) pairs we can choose to learn p(x_t|z_t) or p(z_t|x_t) and incorporate either into the approximation algorithm, whereas traditional approximation algorithms assume that only p(x_t|z_t) is available.

In this paper, we explore the idea of using a novel approximation algorithm that pairs well with learning and demonstrate its use in an intracortical brain computer interface (iBCI) for a human volunteer with tetraplegia as part of the ongoing BrainGate2 clinical trial. Our approach focuses on the approximation of Bayes’ rule in Equation 2b, making use of the fact that p(x_t|z_t) can be replaced with p(z_t|x_t)/p(z_t) throughout. (The p(x_t) term cancels.) This strategy combines well with various Gaussian assumptions that are often employed in approximate Bayesian filtering, resulting in what we call the discriminative Kalman filter (DKF). The DKF retains much of the computational simplicity of the classical Kalman filter, but allows for arbitrary observation models. Some of our clinical research using the DKF has already been published (Brandman et al., 2018b,a), and theoretical aspects of the DKF are further explored in the first author’s dissertation (Burkhart, 2019).

2. The discriminative Kalman filter

In Section 2.1, we derive the DKF approximation for a class of models that generalizes the Kalman filter by allowing for arbitrary observation models. We discuss approximation accuracy in Section 2.2 and introduce a modified algorithm that can be more robust to model misspecification in Section 2.3. In Section 2.4, we compare the DKF formalism to a variety of existing approaches that generalize the Kalman filter and, in Section 2.5, we discuss using the DKF approximation in models with nonlinear and/or non-Gaussian state dynamics.

We now introduce some notation and conventions. We let the latent states Z_t take values in $ℝ^{d \times 1}$ and the observations X_t take values in an abstract space $X$ . In all of our examples, $X \subseteq ℝ^{n \times 1}$ , but this is not necessary. We use η_d(z;µ,Σ) to denote the d-dimensional multivariate Gaussian distribution with mean vector $μ \in ℝ^{d \times 1}$ and covariance matrix $Σ \in S_{d}$ evaluated at $z \in ℝ^{d \times 1}$ , where $S_{d}$ denotes the set of d×d positive definite (symmetric) matrices. We let A⊤ refer to the transpose of a matrix A and use $E$ and $V$ for expected value and variance/covariance, respectively.

2.1. Filter derivation

For the basic derivation, we assume that the latent states form a stationary, mean zero, Gaussian, vector autoregressive model of order 1. Namely, for $A \in ℝ^{d \times d}$ and S, $Γ \in S_{d}$ ,

p (z_{0}) = η_{d} (z_{0}; 0, S),

(3a)

p (z_{t} | z_{t - 1}) = η_{d} (z_{t}; A z_{t - 1}, Γ),

(3b)

for t = 1, 2,…, where $S = A S A^{⊤} + Γ$ so that the process is stationary. Note that Equation 3 matches the latent state model for the stationary Kalman filter. (The assumption of zero mean is easily generalized, but it is usually more convenient to center the Z_t process by subtracting the common mean.)

The observation model p(x_t|z_t) is assumed to not vary with t, so that the joint (Z_t,X_t) process is stationary but otherwise arbitrary. The observation model can be non-Gaussian, multimodal, discrete, etc. For instance, in neural decoding for BCI applications, the observations are often vectors of counts of neural spiking events (binned action potentials), which might be restricted to small integers or even be binary-valued.

The DKF is based on a Gaussian approximation for p(z_t|x_t), namely,

p (z_{t} | x_{t}) \approx η_{d} (z_{t}; f (x_{t}), Q (x_{t})),

(4)

where $f : X \to ℝ^{d}$ and $Q : X \to S_{d}$ . Note that Equation 4 is not an approximation of the observation model, but rather of the conditional density of the latent state given the observation at a single time step. In Section 2.4, we compare this to other approaches that use Gaussian approximations for Bayesian filtering. When the dimensionality of the observation space $(X)$ is large relative to the dimensionality of the state space $(ℝ^{d})$ , the Bernstein–von Mises Theorem states that there exists f and Q such that this approximation will be accurate, requiring only mild regularity conditions on the observation model p(x_t|z_t); see Section 2.2 in van der Vaart (1998). Furthermore, we can take f and Q to be the conditional mean and covariance of Z_t given X_t, namely,

f (x) = E (Z_{t} | X_{t} = x), Q (x) = V (Z_{t} | X_{t} = x),

(5)

which is the approach taken in this paper, although other choices are certainly possible, such as $f (x_{t}) = \arg \max_{z_{t}} p (z_{t} | x_{t})$ or $f (x_{t}) = \arg \max_{z_{t}} p (x_{t} | z_{t})$ , the latter of which is most commonly used in statements of the Bernstein–von Mises Theorem.

To make use of Equation 4 for approximating Equation 2, we first rewrite Equation 2b in terms of p(z_t|x_t) as

p (z_{t} | x_{1 : t}) = \frac{p (x_{t})}{p (x_{t} | x_{1 : t - 1})} \frac{p (z_{t} | x_{t})}{p (z_{t})} p (z_{t} | x_{1 : t - 1}), = \frac{p (x_{t})}{p (x_{t} | x_{1 : t - 1})} \frac{p (z_{t} | x_{t})}{p (z_{t})} \int p (z_{t} | z_{t - 1}) p (z_{t - 1} | x_{1 : t - 1}) d z_{t - 1},

(6)

where the second line follows from the the Chapman–Kolmogorov equation (Equation 2a). We then substitute the latent state model (Equation 3) and the DKF approximation (Equation 4) into Equation 6. We absorb terms not depending on z_t into a normalizing constant κ to obtain

p (z_{t} | x_{1 : t}) \approx κ (x_{1 : t}) \frac{η_{d} (z_{t}; f (x_{t}), Q (x_{t}))}{η_{d} (z_{t}; 0, S)} \int η_{d} (z_{t}; A z_{t - 1}, Γ) p (z_{t - 1} | x_{1 : t - 1}) d z_{t - 1} .

(7)

If p(z_t−1|x_1:t−1) is approximately Gaussian, which it is for the base case of t = 1 from Equation 3a (defining p(z₀|x_1:0) = p(z₀)), then all of the terms on the right side of Equation 7 are approximately Gaussian. If these approximations are exact and the analytic expression for covariance is valid (specifically if Σ_t in Equation 9 is positive definite), we find that the right side of Equation 7 is again Gaussian, giving a Gaussian approximation for p(z_t|x_1:t). We rely on the fact that dividing two Gaussian pdf’s yields an exponentiated quadratic form that will itself be Gaussian if the associated covariance matrix is positive definite (and that the product of two Gaussian pdf’s is Gaussian, without any additional assumptions). See the proof of Lemma 1 in Appendix 7 for a full derivation and further details.

Let

p (z_{t} | x_{1 : t}) \approx η_{d} (z_{t}; μ_{t} (x_{1 : t}), Σ_{t} (x_{1 : t}))

(8)

be the Gaussian approximation of p(z_t|x_1:t) obtained from successively applying the approximation in Equation 7. Defining µ₀ = 0 and Σ₀ = S, we can sequentially compute $μ_{t} = μ_{t} (x_{1 : t}) \in ℝ^{d \times 1}$ and $Σ_{t} = Σ_{t} (x_{1 : t}) \in S_{d}$ via

ν_{t} = A μ_{t - 1}, M_{t} = A Σ_{t - 1} A^{⊤} + Γ, Σ_{t} = {(M_{t}^{- 1} + Q {(x_{t})}^{- 1} - S^{- 1})}^{- 1}, μ_{t} = Σ_{t} (M_{t}^{- 1} ν_{t} + Q {(x_{t})}^{- 1} f (x_{t})) .

(9)

The first two steps incorporate the exact state dynamics in Equation 3b and the final two steps incorporate the observation information using the DKF approximation in Equation 4. The function Q needs to be defined so that Σ_t exists and is a proper covariance matrix. A sufficient condition that is easy to enforce in practice is $Q {(\cdot)}^{- 1} - S^{- 1} \in S_{d}$ ; see Appendix 6.3.

Equation 9 encapsulates the DKF. For pseudocode, see Algorithm 1 Once f(x_t) and Q(x_t) have been evaluated, there is no remaining dependence on n and a single iteration of the algorithm takes O(d³) operations, which is at least as fast as the Kalman filter (when d < n). The power of the DKF, along with potential computational difficulties, comes from evaluating f and Q. If f is linear and Q is constant, then the DKF and the Kalman filter are equivalent (cf. Section 4.1). More general f and Q allow the filter to depend nonlinearly on the observations, improving performance in many cases. If f and Q can be quickly evaluated and the dimension d of Z_t is not too large, then the DKF is fast enough for use in real-time applications, such as the BCI decoding example below.

2.2. Approximation accuracy

Let the observation space be $X = B^{n}$ for some set B. As n grows, the Bernstein–von Mises (BvM) Theorem guarantees under mild assumptions that the conditional distribution of Z_t|X_t is asymptotically normal in total variation distance and concentrates at Z_t (van der Vaart, 1998). This asymptotic normality result provides the main rationale for our key approximation expressed in Equation 4. The BvM Theorem is usually stated in the context of Bayesian estimation. To apply it in our context, we equate Z_t with the parameter and X_t with the data, so that p(z_t|x_t) becomes the posterior distribution of the parameter at a fixed time t. We then let the dimension n of x_t grow, meaning that we are observing growing amounts of data at a fixed time t associated with the parameter Z_t. Very loosely speaking, the BvM Theorem tends to be applicable in situations where X_t uniquely determines Z_t in the limit as n → ∞, but does not uniquely determine Z_t for any finite n.

Algorithm 1: the DKF

\begin{array}{l} Data: observations x_{1}, x_{2}, \dots; matrices A \in ℝ^{d \times d} and S, Γ \in S_{d} such that \\ z_{0}, z_{1}, \dots are drawn from stationary process satisfying Equation 3; \\ functions f : X \to ℝ^{d} and Q : X \to S_{d} such that \\ p (z_{t} | x_{t}) \approx η_{d} (z_{t}; f (x_{t}), Q (x_{t})) and Q {(\cdot)}^{- 1} - S^{- 1} \in S_{d}, either derived \\ analytically or approximated from data \\ Result: μ_{t} = μ_{t} (x_{1 : t}) \in ℝ^{d \times 1} and Σ_{t} = Σ_{t} (x_{1 : t}) \in S_{d} to approximate the \\ posterior distribution as p (z_{t} | x_{1 : t}) \approx η_{d} (z_{t}; μ_{t}, Σ_{t}) for t = 1, 2, \dots \\ Initialize μ_{0} = 0 and Σ_{0} = S; \\ for t \geq 1 do \\ | \begin{array}{l} set ν_{t} = A μ_{t - 1} and M_{t} = A Σ_{t - 1} A^{⊤} + Σ; \\ set Σ_{t} = {(M_{t}^{- 1} + Q {(x_{t})}^{- 1} - S^{- 1})}^{- 1} and μ_{t} = Σ_{t} (M_{t}^{- 1} ν_{t} + Q {(x_{t})}^{- 1} f (x_{t})); \end{array} \\ end \end{array}

One concern is that Equation 6 will amplify approximation errors. Along these lines, we prove the following result that holds whenever the BvM Theorem is applicable for Equation 4:

Theorem 1. Under mild assumptions, the total variation distance between our approximation η_d(z_t;µ_t(x_1:t), Σ_t(x_1:t)) and the exact filtering distribution p(z_t|x_1:t) converges in probability to zero for each t as n → ∞.

This result is stated formally and proven in Appendix 7. We interpret the theorem to mean that under most conditions, as the dimensionality of the observations increases, the approximation error of the DKF tends to zero.

The proof is elementary, but involves several subtleties that arise because of the p(z_t) term in the denominator of Equation 6 corresponding to η_d(z_t;0,S). This term can amplify approximation errors in the tails of p(z_t|x_t), which are not uniformly controlled by the asymptotic normality results in the BvM Theorem. To remedy this, our proof also uses the concentration results in the BvM Theorem to control pathological behaviors in the tails. As an intermediate step, we prove that the theorem above still holds when the p(z_t) term is omitted from the denominator of Equation 6 (see Remark 3 in Appendix 7).

2.3. Robust DKF

Omitting the p(z_t) from the denominator of Equation 6 is also helpful for making the DKF robust to violations of the modeling assumptions and to errors introduced when f and Q are learned from training data. Repeating the original derivation, but without η_d(z_t;0,S) in the denominator, gives the following filtering algorithm that we call the robust DKF. One can think of the robust DKF as a special case of the standard DKF where all eigenvalues of S⁻¹ are so small that the effect of subtracting S⁻¹ is negligible. This has the effect of placing an improper prior on Z₀. Defining µ₁(x₁) = f(x₁) and Σ₁(x₁) = Q(x₁), we sequentially compute µ_t and Σ_t for t ≥ 2 via

ν_{t} = A μ_{t - 1}, M_{t} = A Σ_{t - 1} A^{⊤} + Γ, Σ_{t} = {(M_{t}^{- 1} + Q {(x_{t})}^{- 1})}^{- 1}, μ_{t} = Σ_{t} (M_{t}^{- 1} ν_{t} + Q {(x_{t})}^{- 1} f (x_{t})) .

(10)

(Note that we initialize at t = 1 and not t = 0 in the robust DKF.) Justification for the robust DKF comes from Remark 3 in Appendix 7 showing that the robust DKF accurately approximates the true p(z_t|x_1:t) in total variation distance for each t as n increases. We sometimes find that the robust DKF outperforms the DKF on real-data examples, but not on simulated examples that closely match the DKF assumptions. For pseudocode, see Algorithm 2.

2.4. Other Gaussian approximations

The DKF enforces a Gaussian form for the filtering distribution p(z_t|x_1:t), which is a common strategy for approximate Bayesian filtering, owing to the analytic and representational tractability of Gaussians. In this section, we describe several other methods that use Gaussian approximations, focusing on the case of linear, Gaussian state dynamics. For this type of state dynamics the transition from time t−1 to time t is usually separated into two distinct steps when using Gaussian approximations. Beginning with the first step uses the exact state dynamics (Equation 3b) to create a Gaussian approximation for p(z_t|x_1:t−1), namely,

p (z_{t - 1} | x_{1 : t - 1}) \approx η_{d} (z_{t - 1}; μ_{t - 1}, Σ_{t - 1})

Algorithm 2: the robust DKF

\begin{array}{l} Data: observations x_{1}, x_{2}, \dots; matrices A \in ℝ^{d \times d} and S, Γ \in S_{d} such that \\ z_{0}, z_{1}, \dots are drawn from stationary process satisfying Equation 3; \\ functions f : X \to ℝ^{d} and Q : X \to S_{d} such that \\ p (z_{t} | x_{t}) \approx η_{d} (z_{t}; f (x_{t}), Q (x_{t})) \\ Result: μ_{t} = μ_{t} (x_{1 : t}) \in ℝ^{d \times 1} and Σ_{t} = Σ_{t} (x_{1 : t}) \in S_{d} to approximate the \\ posterior distribution as p (z_{t} | x_{1 : t}) \approx η_{d} (z_{t}; μ_{t}, Σ_{t}) for t = 1, 2, \dots \\ Initialize μ_{1} (x_{1}) = f (x_{1}) and Σ_{1} (x_{1}) = Q (x_{1}); \\ for t \geq 2 do \\ | \begin{array}{l} set ν_{t} = A μ_{t - 1} and M_{t} = A Σ_{t - 1} A^{⊤} + Γ; \\ set Σ_{t} = {(M_{t}^{- 1} + Q {(x_{t})}^{- 1})}^{- 1} and μ_{t} = Σ_{t} (M_{t}^{- 1} ν_{t} + Q {(x_{t})}^{- 1} f (x_{t})); \end{array} \\ end \end{array}

p (z_{t} | x_{1 : t - 1}) \approx η_{d} (z_{t}; ν_{t}, M_{t}),

(11)

where ν_t = Aµ_t−1 and $M_{t} = A Σ_{t - 1} A^{⊤} + Γ$ , as in Equations 9 and 10. Most Gaussian methods would proceed similarly for the first step under these state dynamics. Differences between methods appear for nonlinear or non-Gaussian state dynamics; see Section 2.5.

The second step attempts to incorporate the observation information x_t via Bayes rule:

p (z_{t} | x_{1 : t}) = \frac{p (x_{t} | z_{t}) p (z_{t} | x_{1 : t - 1})}{\int p (x_{t} | z_{t}) p (z_{t} | x_{1 : t - 1}) d z_{t}} .

Beginning with the Gaussian approximation from step 1 (Equation 11) and enforcing the final approximation

p (z_{t} | x_{1 : t}) \approx η_{d} (z_{t}; μ_{t}, Σ_{t}),

the problem reduces to finding µ_t and Σ_t so that

η_{d} (z_{t}; μ_{t}, Σ_{t}) \approx \frac{p (x_{t} | z_{t}) η_{d} (z_{t}; ν_{t}, M_{t})}{\int p (x_{t} | z_{t}) η_{d} (z_{t}; ν_{t}, M_{t}) d z_{t}} = q_{t} (z_{t}),

(12)

where q_t is defined by this equation.

There are many strategies in the literature for choosing µ_t and Σ_t in Equation 12. The terminology is not standardized, but we will attempt to describe some prominent classes of strategies.

2.4.1. Gaussian assumed density filter

The Gaussian assumed density filter (G-ADF) usually refers to choosing µ_t and Σ_t to be the mean vector and covariance matrix of the density q_t in Equation 12 (Kushner, 1967; Ito, 2000; Ito and Xiong, 2000; Minka, 2001a). Moment matching, in this case, minimizes the relative entropy D(q_t||η_d(·;µ_t, Σ_t)). The G-ADF directly seeks a Gaussian approximation to the full posterior p(z_t|x_1:t), whereas the DKF derives a Gaussian approximation to the full posterior from a Gaussian approximation of p(z_t|x_t). While the G-ADF approach tends to prove quite accurate, it is only practical if the mean and covariance of q_t are available. In particular, we must be able to efficiently compute or easily approximate the integrals

a = \int p (x_{t} | z_{t}) η_{d} (z_{t}; ν_{t}, M_{t}) d z_{t}, b = \int z_{t} p (x_{t} | z_{t}) η_{d} (z_{t}; ν_{t}, M_{t}) d z_{t}, c = \int z_{t} z_{t}^{⊤} p (x_{t} | z_{t}) η_{d} (z_{t}; ν_{t}, M_{t}) d z_{t},

(13)

to obtain µ_t = b/a and $Σ_{t} = c / a - μ_{t} μ_{t}^{⊤}$ . There also exist extensions of the G-ADF. For instance, expectation propagation uses iterative refinement of estimates to improve upon assumed density filtering (Minka, 2001b,a). It may be possible to similarly improve the DKF, but iterating over the history of observations is typically not practical in an online setting and we do not explore that approach here.

In cases where the DKF is derived from a known model, as opposed to being learned from training data, computing f(x_t) and Q(x_t) requires the computation of very similar integrals to those needed for the G-ADF, the difference being that v_t and M_t are replaced by 0 and S, respectively, throughout Equation 13 (and then f(x_t) = b/a and $Q (x_{t}) = c / a - f (x_{t}) f {(x_{t})}^{⊤}$ ). For this reason, in models where the G-ADF can be easily used, there would seem to be no reason to use the DKF. The main difference is that the DKF can be easily learned from training data, whereas the G-ADF cannot, since the latter is based on the conditional mean and variance of Z_t|X_t derived under a different marginal distribution for Z_t at each time step, namely, η_d(z_t;ν_t,M_t). The example in Section 4.2 below illustrates a model where both the DKF and G-ADF can be analytically computed; there is little difference in performance. The example in Section 4.3 illustrates a somewhat contrived model where the DKF can be easily computed, but it seems the G-ADF cannot.

2.4.2. Laplace approximation

The Laplace approximation uses a Taylor approximation at the maximum to coerce the numerator in Equation 12 into a Gaussian form as a function of z_t (Butler, 2007; Koyama et al., 2010; Quang et al., 2015). Defining

g_{t} (z_{t}) = \log (p (x_{t} | z_{t}) η_{d} (z_{t}; ν_{t}, M_{t})) and z_{t}^{*} = \arg \max_{z_{t}} g_{t} (z_{t}),

a second order Taylor approximation of g_t at $z_{t}^{*}$ is

g_{t} (z_{t}) \approx g_{t} (z_{t}^{*}) + {\dot{g}}_{t} (z_{t}^{*}) (z_{t} - z_{t}^{*}) + {(z_{t} - z_{t}^{*})}^{⊤} {\ddot{g}}_{t} (z_{t}^{*}) (z_{t} - z_{t}^{*}) / 2,

where ${\dot{g}}_{t} (z)$ and ${\ddot{g}}_{t} (z)$ denote, respectively, the d×1 gradient vector and the d×d Hessian matrix of g_t evaluated at z. The second term vanishes since ${\dot{g}}_{t}$ is zero at the maximum, giving

q_{t} (z_{t}) \propto \exp (g_{t} (z_{t})) \approx \exp (g_{t} (z_{t}^{*}) + {(z_{t} - z_{t}^{*})}^{⊤} {\ddot{g}}_{t} (z_{t}^{*}) (z_{t} - z_{t}^{*}) / 2)) \propto η_{d} (z_{t}; z_{t}^{*}, - {\ddot{g}}_{t} {(z_{t}^{*})}^{- 1}) .

This motivates the choice of $μ_{t} = z_{t}^{*}$ and $Σ_{t} = - {\ddot{g}}_{t} (z_{t}^{*})$ . Similar to the DKF, the Laplace approximation can be justified in the limit of increasing observation dimensionality using the BvM Theorem. If $z_{t}^{*}$ or the derivatives of g_t are not available in closed form, then the Laplace approximation can be slow, owing to the need to solve an optimization problem at each time step. Laplace approximations are also criticized for being too local, in that the local curvature in the density at $z_{t}^{*}$ dictates the variance chosen for a global approximation to the density.

2.4.3. Linearization methods

Several methods, often called linearization methods, can be motivated by attempting to approximate the numerator of Equation 12 as jointly Gaussian in (z_t,x_t), namely,

p (x_{t} | z_{t}) η_{d} (z_{t}; ν_{t}, M_{t}) \approx η_{d + n} ((\begin{matrix} z_{t} \\ x_{t} \end{matrix}) | (\begin{matrix} ν_{t} \\ h_{t} \end{matrix}), (\begin{matrix} M_{t} & C_{t} \\ C_{t}^{⊤} & N_{t} \end{matrix})),

(14)

where the history of observations x_1:t is allowed to influence the choice of $h_{t} \in ℝ^{n \times 1}$ , $N_{t} \in S_{n}$ , and $C_{t} \in ℝ^{d \times n}$ . Using this approximation allows Equation 12 to be exactly integrated to obtain

μ_{t} = ν_{t} + C_{t} N_{t}^{- 1} (x_{t} - h_{t}) and Σ_{t} = M_{t} - C_{t} N_{t}^{- 1} C_{t}^{⊤} .

(15)

Methods differ in how they choose h_t, N_t, and C_t.

Using η_d(z_t;ν_t,M_t) as the marginal density for Z_t, Equation 14 can be rewritten as

p (x_{t} | z_{t}) \approx η_{n} (x_{t}; b_{t} + H_{t} z_{t}, Λ_{t}) .

(16)

The implicit linearization in Equation 14 is now explicit: $E (X_{t} | Z_{t} = z_{t})$ is approximated as the linear function b_t + H_tz_t. The relationship between the different parameters in Equations 14 and 16 is $b_{t} = h_{t} - C_{t}^{⊤} M_{t}^{- 1} v_{t}$ , $H_{t} = C_{t}^{⊤} M_{t}^{- 1}$ , and $Λ_{t} = N_{t} - C_{t}^{⊤} M^{- 1} C_{t}$ . Upon re-parameterization, Equation 15 can be used for filtering with

Σ_{t} = {(M_{t}^{- 1} + H_{t}^{⊤} Λ_{t}^{- 1} H_{t})}^{- 1},

μ_{t} = Σ_{t} (M_{t}^{- 1} ν_{t} + H_{t}^{⊤} Λ_{t}^{- 1} (x_{t} - b_{t})),

which has a similar appearance to the corresponding DKF updates in Equation 9.

Equation 16 underlies several Gaussian approximations to Bayes’ rule, including the approximations used in the extended Kalman filter (EKF), the unscented Kalman filter (UKF: Julier and Uhlmann, 1997; Wan and van der Merwe, 2000; van der Merwe, 2004), and the statistically linearized filter (SLF: Gelb, 1974; Särkkä, 2013). The EKF, for instance, begins with the functions

h (z) = E (X_{t} | Z_{t} = z) and Λ (z) = V (X_{t} | Z_{t} = z),

which are assumed known, and takes $H_{t} = \dot{h} (ν_{t})$ , $b_{t} = h (ν_{t}) - H_{t} ν_{t}$ , and Λ_t = Λ(ν_t), where $\dot{h} (z)$ is the n×d matrix of partial derivatives of h evaluated at z. These choices of b_t and H_t correspond to a first-order Taylor approximation of h at the point ν_t. Like the Laplace approximation, the EKF is often criticized for being too local, because the gradient of h at a single point drives the approximation.

The unscented Kalman filter (UKF) employs the eponymous transform to propagate weighted, deterministically-chosen points through a nonlinear transformation and recover estimates for h_t, N_t, and C_t from Equation 15. The estimates for all three parameters prove exact for linear transformations of Gaussians but inexact for general higher order polynomials (Särkkä, 2013), so we consider this a linearization method. Variations on this approach, collectively called sigma-point filters (van der Merwe, 2004), include the central difference Kalman filter (CDKF: Ito and Xiong, 2000; Nørgaard et al., 2000), the Gauss–Hermite Kalman Filter, the Quadrature Kalman filter (Ito, 2000; Ito and Xiong, 2000) and the Cubature Kalman filter (Arasaratnam et al., 2007; Arasaratnam and Haykin, 2009).

The SLF is a related, but more global approximation for the same observation model. It selects b_t and H_t to minimize the difference between the true observation model $X_{t} = h (Z_{t}) + ϵ_{t}$ and the linear approximation $X_{t} \approx a_{t} + B_{t} Z_{t} + ϵ_{t}$ where Z_t is chosen from the current, approximate, predicted distribution. For instance, a_t and B_t can be chosen to minimize

\int {‖ h (z_{t}) - (a_{t} + B_{t} z_{t}) ‖}^{2} η_{d} (z_{t}; ν_{t}, M_{t}) d z_{t},

where || · || is the usual Euclidean norm in $ℝ^{n}$ . Defining ${\bar{h}}_{t} = \int h (z_{t}) η_{d} (z_{t}; ν_{t}, M_{t}) d z_{t}$ and ${\bar{H}}_{t} = \int (h (z_{t}) - {\bar{h}}_{t}) {(z_{t} - ν_{t})}^{⊤} η_{d} (z_{t}; ν_{t}, M_{t}) d z_{t}$ , the solution is $B_{t} = {\bar{H}}_{t} M_{t}^{- 1}$ , and $a_{t} = {\bar{h}}_{t} - B_{t} ν_{t}$ , again with Λ_t = Λ. Like the EKF, this version of the SLF is best suited for additive, Gaussian noise models, but it further requires that the integrals defining ${\bar{h}}_{t}$ and ${\bar{H}}_{t}$ can be efficiently computed or easily approximated.

The UKF, the SLF, and many related techniques improve upon some of the deficiencies of the EKF. Nevertheless, these methods tend to perform poorly when the conditional distribution of X_t given Z_t cannot be well-approximated as Gaussian. The examples in Sections 4.2 and 4.3 below illustrate models where linearization proves completely ineffectual, as $h (z) = E (X_{t} | Z_{t} = z) = 0$ for all z in these examples, even though the G-ADF and the DKF work well.

2.5. Nonlinear state dynamics

As described in the previous section, filtering can be conceptually separated into two steps. The first step uses the state dynamics to transition from Z_t−1|X_1:t−1 to Z_t|X_1:t−1 via Equation 2a and the second step uses Bayes’ rule to update Z_t|X_1:t−1 into Z_t|X_1:t via Equation 2b. In this paper, difficulties with the first step are removed by assuming linear, Gaussian, state dynamics (Equation 3). There are, however, a variety of approximation methods for more complicated state dynamics, including methods that approximate p(z_t|x_1:t−1) as a Gaussian. Any such Gaussian method could be easily combined with the DKF approximation, which relates to Bayes’ rule in the second step. In particular, given the approximation

p (z_{t} | x_{1 : t - 1}) \approx η_{d} (z_{t}; ν_{t}, M_{t}),

we simply use these values of ν_t and M_t in the DKF algorithm (Equation 9) or the robust DKF algorithm (Equation 10), instead of computing them in the first two lines of these algorithms. In this paper, we do not explore in depth this generalization to nonlinear state dynamics, although we do provide a proof of concept example in Section 4.4 below.

There is a vast literature on more general approximation algorithms for Bayesian filtering (Särkkä, 2013; Chen, 2003). Monte Carlo integration (Metropolis and Ulam, 1949) can almost always be used. Such approaches are called sequential Monte Carlo or particle filtering and include sequential importance sampling and sequential importance resampling (Handschin and Mayne, 1969; Handschin, 1970; Gordon et al., 1993; Kitagawa, 1996; del Moral, 1996; Doucet et al., 2000; Cappé et al., 2005, 2007). These methods apply to all classes of models but tend to be the most expensive to compute online and suffer from the curse of dimensionality (Daum and Huang, 2003). Alternate sampling strategies (see, e.g., Chen, 2003; Liu, 2008) can be used to improve filter performance, including: acceptance-rejection sampling (Handschin and Mayne, 1969), stratified sampling (Douc and Cappé, 2005), hybrid MC (Choo and Fleet, 2001), and quasi-MC (Gerber and Chopin, 2015). There are also ensemble versions of the Kalman filter that are used to propagate the covariance matrix in high dimensions including the ensemble Kalman filter (enKF: Evensen, 1994) and the ensemble transform Kalman filter (ETKF: Bishop et al., 2001; Majumdar et al., 2002), along with versions that produce local, parallelizable approximations for covariance (Ott et al., 2004; Hunt et al., 2007).

It may be possible to usefully combine the DKF approximation with some of these more advanced filtering techniques. The key approximation in the DKF is

p (x_{t} | z_{t}) = p (x_{t}) \frac{p (z_{t} | x_{t})}{p (z_{t})} \approx κ (x_{t}) \frac{η_{d} (z_{t}; f (x_{t}), Q (x_{t}))}{η_{d} (z_{t}; 0, S)} .

(17)

This approximation could, in principle, be substituted for the likelihood p(x_t|z_t) in any filtering algorithm, including particle filters, which incorporate the likelihood into the particle weights. The normalizing term κ(x_t) from Equation 17 will generally cancel, since the final posterior distribution p(z_t|x_1:t) is invariant to terms depending only on x_1:t. The advantage of Equation 17 is that f(·), Q(·), and S, might be easier to learn from data than the full conditional density p(x_t|z_t). For complex state dynamics, it is worth noting that the denominator η_d(z_t;0,S) will no longer precisely correspond to p(z_t) but will also be an approximation. If the Gaussian approximations for p(z_t|x_t) and p(z_t) are learned separately, some care may need to be taken to ensure the resulting approximation to p(x_t|z_t) remains a good one. One strategy might be to learn a Gaussian-shaped approximation to the density ratio p(z_t|x_t)/p(z_t), as a function of z_t (Sugiyama et al., 2012). Another strategy might be to use the robust DKF approximation as in Section 2.3, which simply drops the denominator in Equation 17. In future work, we plan to explore these and other approaches that might allow a DKF-style approximation to be incorporated into more general filtering models.

3. Learning the DKF

The parameters in the DKF are A, Γ, f(·), and Q(·). (S is specified from A and Γ using the stationarity assumption.) In many problems, some or all of these parameters might be unknown or not easily computable. In this section we discuss some strategies for learning or approximating the parameters in the situation where fully supervised training data is available, meaning that we have a sequence of (Z_t,X_t) pairs assumed to be sampled from the underlying Bayesian network in Equation 1 and denoted $(z_{1}^{'}, x_{1}^{'}), \dots, (z_{m}^{'}, x_{m}^{'})$ . This training data might be real data, or it might be simulated from a known generative model for which the parameters, particularly f and Q, are not easily computable.

We use $\hat{A}$ , $\hat{Γ}$ , $\hat{f}$ , and $\hat{Q}$ to denote the respective learned parameters. We only consider the situation where the parameters are learned from training data and then fixed for subsequent filtering on a different sequence of observations. In particular, for filtering we simply replace each parameter with its corresponding estimate in the DKF algorithm in Equation 9. We do not consider a more fully Bayesian approach where parameter uncertainty is propagated through the filtering equations.

A and Γ are the parameters of a well-specified statistical model given by Equations 3a–3b. In the learning experiments below we learn them from $(z_{t - 1}^{'}, z_{t}^{'})$ pairs using only Equation 3b, which reduces to multiple linear regression and is a common approach when learning the parameters of a Kalman filter from fully observed training data (see, for example, Wu et al., 2002).

The parameters f and Q are more unusual, since they are not uniquely defined by the model, but are introduced via a Gaussian approximation in Equation 4. One possibility, and the one we focus on here, is to define f and Q via Equation 5 and then learn them directly from training data as

\hat{f} (x) \approx f (x) = E (Z_{t} | X_{t} = x) and \hat{Q} (x) \approx Q (x) = V (Z_{t} | X_{t} = x) .

(18)

Using Equation 18, we learn f and Q from $(z_{t}^{'}, x_{t}^{'})$ pairs ignoring the overall temporal structure of the data, which reduces to a standard nonlinear regression problem with heteroscedastic variance. The conditional mean f can be learned using any number of off-the-shelf regression tools and then Q can be learned from the residuals, ideally using a held-out portion of the training data. We think that the ability to easily incorporate off-the-shelf discriminative learning tools into a closed-form filtering equation is one of the most exciting and useful aspects of this approach.

In the experiments below, we compare three standard nonlinear regression methods for learning f: Nadaraya-Watson (NW) kernel regression, neural network (NN) regression, and Gaussian process (GP) regression. Details are in the Appendix. While we have found that these methods work well with the DKF framework, one could readily use any arbitrary regression model.

For learning Q, we first define R_t = Z_t − f(X_t) and ${\hat{R}}_{t} = Z_{t} - \hat{f} (X_{t})$ , so that

Q (x) = V (Z_{t} | X_{t} = x) = E (R_{t} R_{t}^{⊤} | X_{t} = x) \approx E ({\hat{R}}_{t} {\hat{R}}_{t}^{⊤} | X_{t} = x) .

(19)

The final expression in Equation 19 is a conditional expectation and can in principle be learned with regression on $({\hat{R}}_{t} {\hat{R}}_{t}^{⊤}, X_{t})$ pairs. Learning Q in this way using off-the-shelf regression tools is more challenging because of the additional requirement that Q(x) be a valid covariance matrix. Since ${\hat{R}}_{t} {\hat{R}}_{t}^{⊤}$ is positive semidefinite, any regression estimator that is a weighted average of the training data with only nonnegative weights will also be positive semidefinite and, in most cases, positive definite. NW kernel regression constitutes one such method and we use it for learning Q in all of our examples below. Given a subset of the training set ${(z_{i}^{″}, x_{i}^{″})}_{i = 1}^{k}$ , distinct from the subset used to learn the function f, we define the residuals ${\hat{r}}_{i} = z_{i}^{″} - \hat{f} (x_{i}^{″})$ , and then learn Q using NW kernel regression via

\hat{Q} (x) = \frac{\sum_{i = 1}^{k} {\hat{r}}_{i} {\hat{r}}_{i}^{⊤} κ (x, x_{i}^{″})}{\sum_{i = 1}^{k} κ (x, x_{i}^{″})},

(20)

for a kernel $κ : X \times X \to [0, \infty)$ . Complete details are in the Appendix.

4. Examples

In this section, we compare filter performance on both artificial models and on real neural data. Corresponding MATLAB code (and Python code for the LSTM comparison) is freely available online at https://github.com/burkh4rt/Discriminative-Kalman-Filter under the GNU General Public License v3.0 to encourage code use and adaptation. For timing comparisons, the code was run on a Mid-2018 MacBook Pro laptop with a 2.6 GHz Intel Core i7 processor using MATLAB v. 2019a and Python v. 3.6.8.

4.1. Kalman observation model

The stationary Kalman filter observation model is

p (x_{t} | z_{t}) = η_{n} (x_{t}; b + H z_{t}, Λ)

for observations in $X = ℝ^{n \times 1}$ and for fixed $b \in ℝ^{n \times 1}$ , $H \in ℝ^{n \times d}$ , and $Λ \in S_{n}$ . Defining f and Q via Equation 5 gives

Q (x) \equiv Q = {(S^{- 1} + H^{⊤} Λ^{- 1} H)}^{- 1} and f (x) = Q H^{⊤} Λ^{- 1} (x - b) .

It is straightforward to verify that the DKF in Equation 9 is exactly the well-known Kalman filter recursion. Hence, the DKF computes the exact posterior p(z_t|x_1:t) in this special case.

4.2. Kalman observation mixtures

This example and the next are designed to illustrate how the Gaussian approximation underlying the DKF is more similar in spirit to the G-ADF than to linearization approximations such as the Kalman filter, the EKF, and the UKF (Section 2.4). In particular, the specific observation model used in the simulation below is engineered so that the state Z_t and the observation X_t are uncorrelated (but not independent). Linearization methods are useless in this case, whereas the DKF is able to take advantage of the higher-order dependence, much like the G-ADF.

The observation model is a probabilistic mixture of Kalman observation models (Section 4.1), namely,

p (x_{t} | z_{t}) = \sum_{ℓ = 1}^{L} π_{ℓ} η_{n} (x_{t}; b_{ℓ} + H_{ℓ} z_{t}, Λ_{ℓ})

for a probability vector π = π_1:L, where each $b_{ℓ} \in ℝ^{n \times 1}$ , $H_{ℓ} \in ℝ^{n \times d}$ , and $Λ_{ℓ} \in S_{n}$ . At each time step, one of L possible Kalman observation models is randomly and independently selected according to π and then used to generate the observation. This model can be viewed as a special case of a switching state space model with independent switching (see Shumway and Stoffer, 1991; Ghahramani and Hinton, 2000). The integrals in Equation 13 can be efficiently computed for any choice of ν_t and M_t, including ν_t = 0 and M_t = S, so the G-ADF and the DKF can be computed exactly for this model (see Appendix 6.1 for details), although the DKF is much faster for large n, because it allows for more pre-computation. Figure 1 illustrates that the DKF is comparable to the G-ADF in terms of root mean squared error (RMSE) for a particular instance of this model, and it also shows that the computational savings of the DKF over a particle filter with similar accuracy can be dramatic, especially as n gets large.

Figure 1: — This figure shows filtering performance on an instance of the model in Section 4.2 for various approximation algorithms as the observation dimension n increases. The hidden state dimension is d = 10, and the state model parameters are S = I_d, A = 0.95I_d − 0.05, and $Γ = S - A S A^{⊤}$ . The number of categories is L = 2, the category probabilities are π = (0.5, 0.5), and the Kalman parameters are $b_{1} = b_{2} = \bar{b} = 0$ , Λ₁ = I_n, Λ₂ = 5I_n, and H₂ = −H₁, so that $\bar{H} = 0$ ; see Equation 21. The entries of H₁ were generated as independent N(0,d⁻¹) using the Matlab 9.6 code rng(42,’twister’); H = randn(1000,10)/sqrt(10);. The data was generated for an observation dimension of 1000 and the plot shows filter performance using only the first n dimensions of X_t for selected n between 1 and 1000. Filter performance was measured using root mean squared error (RMSE, left panel) and computation time (s, right panel) on a single test sequence of length T = 10⁴. Because X_t and Z_t are uncorrelated, linearization methods (e.g., KF, EKF, and UKF) ignore X_t and always predict $Z_{t} \approx E (Z_{t}) = 0$ giving an RMSE of approximately 1 (black line) in this case. The accuracy of particle filtering increases with the number of particles at the expense of increased computation, and we show performance for different numbers of particles: 10¹, 10², 10³, 10⁴, 10⁵ (blue lines, ordered as expected). We also show RMSE for the optimal prediction using only X_t (as opposed to the entire history X_1:t), namely, $Z_{t} \approx E (Z_{t} | X_{t}) = f (X_{t})$ (dotted red line). (This serves to demonstrate the performance gain that *filtering* provides.) Finally, we caution that the model parameters have much more influence on the relative performance of the different Gaussian approximation methods when n is small than when n is large. The parameters in this model were chosen so that the DKF also performs well for small n, even though we only have guarantees about its performance in the large n setting.

Define $\bar{b} = \sum_{ℓ} π_{ℓ} b_{ℓ}$ and $\bar{H} = \sum_{ℓ} π_{ℓ} H_{ℓ}$ so that

E (X_{t} | Z_{t}) = \bar{b} + \bar{H} Z_{t} .

(21)

An interesting special case of this model is when $\bar{H} = 0$ , so that the mean of X_t given Z_t does not depend on Z_t, and, consequently, X_t and Z_t are uncorrelated. Information about the states is only found in higher-order moments of the observations. Algorithms that are designed around $E (X_{t} | Z_{t})$ , such as the Kalman filter, EKF, and UKF, are not useful when $\bar{H} = 0$ , illustrating the important difference between a Gaussian approximation for the observation model and the DKF approximation in Equation 4. The simulation in Figure 1 used $\bar{H} = 0$ , and the ineffectiveness of linearization techniques is easily seen.

4.3. Independent Bernoulli mixtures

Here we describe a model where observations take values in {0, 1}ⁿ to further emphasize that our Gaussian approximation is in the state space, not in the observation space. Like the example in Section 4.2, this example is also engineered so that the states and observations are uncorrelated, rendering linearization-based methods ineffective (Section 2.4). Finally, the specific parameters of this example are chosen to have the peculiar property that the DKF is efficiently computable, whereas the G-ADF is not (insofar as we can tell).

The observation model is a probabilistic mixture of conditionally independent Bernoulli random variables, namely,

p (x_{t} | z_{t}) = \sum_{ℓ = 1}^{L} π_{ℓ} \prod_{i = 1}^{n} g_{ℓ i} {(z_{t})}^{x_{t i}} {(1 - g_{ℓ i} (z_{t}))}^{1 - x_{t i}},

for a probability vector π = π_1:L. For each $ℓ = 1, \dots, L$ and i = 1,…,n, the functions $g_{ℓ i} : ℝ^{d \times 1} \to (0, 1)$ are defined by

g_{ℓ i} (z_{t}) = α_{ℓ i} 𝟙 {z_{t d_{i}} < γ_{i}} + β_{ℓ i} 𝟙 {z_{t d_{i}} \geq γ_{i}},

where each $γ_{i} \in ℝ$ , $α_{ℓ i}$ , $β_{ℓ i} \in (0, 1)$ , $d_{i} \in {1, \dots, d}$ and where z_tk indicates the kth coordinate of z_t. The ith coordinate of X_t depends on Z_t only through the d_ith coordinate of Z_t, and the probability distribution of X_ti is different depending on whether $Z_{t d_{i}} < γ_{i}$ or not. Each of the L components of the mixture changes the probability distribution of X_ti, via $α_{ℓ i}$ and $β_{ℓ i}$ , but it does not change the corresponding coordinate d_i or the change point γ_i.

For the state dynamics, we use S = I_d, which makes it possible to compute f(z_t) and Q(z_t) exactly; see Appendix 6.2. In general, however, the integrals in Equation 13 are not easily evaluated, so the G-ADF is not a practical approximation technique in this example. Figure 2 suggests that the DKF approximation performs well for a particular instance of this model, in the sense that the DKF’s RMSE is near or better than that of a particle filter with a large number of particles. The figure also shows that the computational savings over a particle filter with similar accuracy can be dramatic, especially as n gets large.

Figure 2: — This figure shows filtering performance on an instance of the model in Section 4.3 for various approximation algorithms as the observation dimension n increases. The state model (Z_t) and the figure conventions (and cautions) are the same as those described in the Figure 1 caption. (Using this many particles with higher n was too time consuming.) The number of categories is L = 2, the category probabilities are π = (0.5, 0.5), and for each i, α_1i = β_2i = 0.01 and α_2i = β_1i = 0.99, so that each ${\bar{g}}_{i} \equiv 0.5$ ; see Equation 22. The d_1:n were chosen as independent uniform{1,*…,d*}, and the γ_1:n were chosen as independent N(0, 1).

Define ${\bar{g}}_{i} = \sum_{ℓ} π_{ℓ} g_{ℓ i}$ , so that

E (X_{t i} | Z_{t}) = ℙ (X_{t i} = 1 | Z_{t}) = {\bar{g}}_{i} (Z_{t}) .

(22)

An interesting special case of this model is when ${\bar{g}}_{i}$ is constant for each i, so that the mean of X_t given Z_t does not depend on Z_t, and, consequently, X_t and Z_t are uncorrelated. As in the previous section, linearization approximations like the Kalman filter, EKF, and UKF are not useful when ${\bar{g}}_{i}$ is constant. Furthermore, when ${\bar{g}}_{i}$ is constant, then X_ti and Z_t are independent, i.e., individual coordinates of the observations carry no information about the states. Only the vector of observations X_t can be used for meaningful predictions of Z_t. The simulation in Figure 2 used ${\bar{g}}_{i} \equiv 0.5$ for all i, so that each coordinate of the observations is independent of the state.

4.4. Kalman observation mixtures with nonlinear state dynamics

This example illustrates how the DKF approximation can be combined with other filtering approximations for use with nonlinear state dynamics; see Section 2.5. We include it here as a proof of concept and leave for future work a more thorough exploration of when the DKF approximation is useful for filtering with nonlinear state dynamics. We use the same mixture of Kalman observation models from Section 4.2 but we modify the state dynamics in Equation 3 as follows. Define the 2×2 rotation matrix $R (θ) = (\begin{matrix} \sin θ & \cos θ \\ - \cos θ & \sin θ \end{matrix})$ and for even d define the d×d rotation matrix R_d(θ) to be the block-diagonal matrix with R(θ) repeated along the diagonal, namely,

R_{d} (θ) = (\begin{array}{c} R (θ) & 0 & \dots & 0 \\ 0 & R (θ) & \dots & 0 \\ ⋮ & ⋮ & ⋱ & ⋮ \\ 0 & 0 & \dots & R (θ) \end{array}) .

Define the function $a : ℝ^{d \times 1} \to ℝ^{d \times 1}$ via a(z) = AR_d(|z|)z, where | · | denotes the Euclidean norm. The new state dynamics are

p (z_{0}) = η_{d} (z_{0}; 0, S),

p (z_{t} | z_{t - 1}) = η_{d} (z_{t}; a (z_{t - 1}), Γ),

for t = 1, 2,…, where $A = A S A^{⊤} + Γ$ . These are the same dynamics as before except that the conditional mean of Z_t given Z_t−1 has changed from the linear function AZ_t−1 to the nonlinear function a(Z_t−1). In particular, before being multiplied by A, the state vector is rotated by an amount that depends on its length. This type of nonlinearity was chosen because when S = I_d (as in our examples), then Z_t remains marginally Gaussian, which is an important part of the DKF approximation.

We use an unscented Kalman filter (UKF) approximation for the state dynamics; i.e., we replaced ν_t and M_t in Equations 9 and 10 with the mean and covariance obtained from performing the unscented transform (Julier and Uhlmann, 1997). We used Matlab’s

unscentedKalmanFilter

implementation with

alpha=1,beta=kappa=0

. The UKF approximations of ν_t and M_t can also be substituted directly into the G-ADF used in Section 4.2.

Figure 3 shows filtering performance for a specific instance of this model and illustrates that, at least in this case, a DKF approximation for nonlinear, non-Gaussian observation models can be usefully combined with other approximations for nonlinear state dynamics, and that there is little loss of performance compared to the G-ADF.

4.5. Unknown observation model: Macaque reaching-task data

This example illustrates Bayesian filtering in a case where the observation model is unknown and must be learned from data. Flint et al. (2012) implanted a rhesus monkey with a 96-channel microelectrode array (Blackrock Microsystems LLC) over the arm area of its primary motor cortex (M1). The monkey was trained to move a manipulandum to acquire illuminated targets for a juice reward. While performing this task, the monkey’s neural spikes were recorded with a 128-channel acquisition system (Cerebus, Blackrock Microsystems LLC). The signal was sampled at 30 kHz, high-pass filtered at 300 Hz, and then thresholded and manually sorted into spikes offline. Walker and Kording (2013) continue to make this data publicly available as part of the Database for Reaching Experiments and Models (DREAM). We used data from Flint et al. (2012) and aggregated spike counts over 100ms bins. The first n = 10 principal component analysis (PCA) components of neural data became the observed variable X_t, and we used the d = 2 dimensional (horizontal and vertical) cursor velocity (lagged 50ms after the end of the spike count bin) as the latent variable Z_t.

Tables 1 and 2 compare filtering performance using various learning algorithms and filtering methods. For learning the function $f : ℝ^{10} \to ℝ^{2}$ for the DKF, we experimented with Nadaraya-Watson (NW) kernel regression, neural network (NN) regression, and Gaussian process (GP) regression. In each case we learned the function $Q : ℝ^{10} \to S_{2}$ using NW kernel regression from the approximate residuals as in Equation 20. For the Kalman filter, parameters are learned in the usual manner via multivariate (linear) regression. For the EKF and UKF (see Section 2.4) we learned the conditional mean $h : ℝ^{2} \to ℝ^{10}$ defined by

h (z) = E (X_{t} | Z_{t} = z)

(23)

via neural network regression, and took the conditional covariance to be constant, namely, $Λ (z) = V (X_{t} | Z_{t} = z) \equiv Λ \in S_{10}$ , which we learned from the approximate residuals. Finally, we also experimented with a Long Short Term Memory (LSTM) recurrent neural network for predicting Z_t given X_1:t. In all cases, we used 5000 training points and a different 1000 testing points. More details about all of these methods are in Appendices 6.4–6.7.

Table 1:

This figure compares the normalized RMSE (nRMSE) for various filtering methods on the Flint dataset from Section 4.5. The nRMSE is computed by dividing the RMSE by the root mean square of the observation vector, so that predicting identically zero would yield a nRMSE of 1. The top row shows the nRMSE of the Kalman filter. Each remaining row shows the percentage change in nRMSE relative to the Kalman filter, with methods ordered from best (top) to worst (bottom) average performance. Columns 1–6 refer to completely separate trials using new training and testing data. The final column gives average performance across the six trials.

	Trial 1	Trial 2	Trial 3	Trial 4	Trial 5	Trial 6	Avg.
Kalman	0.765	0.942	0.788	0.793	0.780	0.765	0.805

DKF-NW	−21%	−18%	−17%	−23%	−20%	−23%	−20%
DKF-GP	−21%	−19%	−15%	−20%	−18%	−20%	−19%
DKF-NN	−19%	−15%	−13%	−13%	−13%	−17%	−15%

LSTM	−15%	−19%	−16%	−13%	−16%	−11%	−15%

EKF	2%	24%	12%	18%	12%	3%	12%
UKF	2%	31%	18%	18%	15%	6%	15%

Open in a new tab

Table 2:

This figure compares the mean absolute angular error (radians) for various filtering methods on the Flint dataset from Section 4.5. Because cursor speed is often adjustable in BCIs (Willett et al., 2019), this may provide a more informative measure of performance. See the caption for Table 1 for more details about the table arrangement. Note that 45° = π/4 ≈ 0.79 radians, so all of these methods have fairly substantial angular error over 100 ms prediction intervals. Chance performance would be π/2 ≈ 1.57 radians.

	Trial 1	Trial 2	Trial 3	Trial 4	Trial 5	Trial 6	Avg
Kalman	0.889	0.955	1.025	0.933	0.964	0.926	0.949

DKF-NW	−15%	−1%	−20%	−17%	−25%	−28%	−18%
DKF-GP	−11%	7%	−22%	−16%	−24%	−25%	−15%
DKF-NN	−7%	−2%	−17%	−16%	−21%	−23%	−14%

LSTM	−2%	−2%	−12%	−6%	−10%	−8%	−7%

UKF	0%	3%	−3%	−3%	−8%	−6%	−3%
EKF	4%	3%	−2%	−4%	−8%	−7%	−2%

Open in a new tab

The DKF using NW kernel regression was the best method among the ones that we tried, and all versions of the DKF were near the top in performance. Under the Mean Absolute Angular Error (MAAE) metric (Simeral et al., 2011), each version of the DKF outperformed prediction using the corresponding $\hat{f}$ , illustrating the benefit of filtering to combine information from both past and present observations. The EKF and UKF performed poorly. We do not know the degree to which poor performance is a result of errors introduced by the EKF and UKF approximations or a result of errors introduced from learning the function h in Equation 23. All versions of the DKF outperformed the LSTM that we used. The LSTM and its variants require manually selecting a neural network architecture and several tuning parameters. This is often done by experts through trial and error. While we suspect that there is some combination of architecture and tuning parameters that would allow the LSTM to meet or exceed the DKF performance, automating this process of searching through network architecture remains an area of active research requiring extensive computational resources (Zoph and Le, 2017; Real et al., 2017).

4.6. Closed-loop decoding in a person with paralysis

Neural decoding for closed-loop brain-computer interfaces (BCIs) provided the motivating application for the development of the DKF. BCIs use neural measurements from the brain to enable voluntary control of external devices (Wolpaw et al., 2002; Hochberg and Donoghue, 2006; Brandman et al., 2017). Intracortical BCI systems (iBCIs) have been shown to provide users with paralysis the ability to control computer cursors (Pandarinath et al., 2015; Jarosiewicz et al., 2015; Nuyujukian et al., 2018), robotic arms (Hochberg et al., 2012; Collinger et al., 2013), and functional electrical stimulation systems (Bouton et al., 2016; Ajiboye et al., 2017) with the real-time decoded neural activity generated during attempted movement. State-of-the-art decoding approaches have been based on the Kalman filter (Pandarinath et al., 2017; Jarosiewicz et al., 2015; Gilja et al., 2015), with observed neural features and latent motor intention used to move external devices. To construct a supervised training set, motor intentions are inferred as vectors from the instantaneous cursor position to the target position Z_t (Brandman et al., 2018b).

The DKF is a natural choice for closed-loop neural decoding using iBCIs for a few reasons. First, evidence suggests that neurons have very complex behavior. Neurons in the motor cortex have been shown to encode direction of movement (Georgopoulos et al., 1988), velocity (Schwartz, 1994), acceleration (Paninski et al., 2004), muscle activation (Lemon, 2008; Pohlmeyer et al., 2007), proprioception (Bensmaia and Miller, 2014), visual information related to the task (Rao and Donoghue, 2014) and preparatory activity (Churchland et al., 2012). Hence, iBCI-related recordings are highly complex and non-linear (Vargas-Irwin et al., 2015). Moving away from the linear constraints of the Kalman filter could potentially capture more of the inherent complexity of the signals, resulting in higher end-effector control for the user.

Second, evidence suggests that the quality of control directly relates to the rate at which the decoding systems perform real-time decoding. Modern iBCI sytems update velocity estimates on the order of 20ms (Jarosiewicz et al., 2015) or even 1ms (Pandarinath et al., 2015). Thus, any potential filtering technique must be computationally feasible to implement for real-time use.

Third, over the past decades, new technologies have allowed neuroscientists to simultaneously record from increasingly large numbers of neurons. In fact, the number of observed brain signals has been growing exponentially (Stevenson and Kording, 2011). By contrast, the dimensionality of the underlying device being controlled remains small, generally not exceeding ten dimensions (Wodlinger et al., 2015; Vargas-Irwin et al., 2010).

We previously reported how three people with spinal cord injuries used the DKF with GP regression to rapidly gain closed-loop neural control (Brandman et al., 2018b,a). Here, as an additional proof of concept, we present data from a person with amyotrophic lateral sclerosis (participant T9) using the DKF. In these research sessions, the observations constitute neural data collected from an electrode array surgically implanted in the participant’s brain and the hidden states represent the intended cursor velocity. The DKF prediction of intended cursor velocity is used at each time step to move the cursor. For learning the DKF parameters, training data is collected during an initial calibration phase in which the participant is instructed to attempt to move the cursor to various target locations, and the intended velocity at each time step is assumed to be pointing from the current cursor position to the instructed target. GP regression was used to learn f, and, for computational efficiency, Q was assumed to be constant and set as the covariance of the residuals. The participant’s performance using an out-of-the-box DKF was comparable to state-of-the-art decoders based on modifications of the Kalman filter designed specifically for the BrainGate2 clinical trials.

4.6.1. Participant

The participant in this study was T9, a 52 year-old right-handed male with paralysis from late stage amyotrophic lateral sclerosis (ALSFRS-R score = 7; see Cedarbaum et al. (1999) for a detailed explanation of this metric). T9 underwent surgical placement of two 96-channel intracortical silicon microelectrode arrays (Maynard et al., 1997) (1.5-mm electrode length, Blackrock Microsystems, Salt Lake City, UT) in the motor cortex as previously described (Kim et al., 2008; Simeral et al., 2011). Data was used from trial (post-implant) days 292 and 293.

4.6.2. Signal acquisition

Raw neural signals for each channel (electrode) were sampled at 30kHz using the Neuro-Port System (Blackrock Microsystems, Salt Lake City, UT). Further signal processing and neural decoding were performed using the xPC target real-time operating system (Mathworks, Natick, MA). Raw signals were downsampled to 15kHz for decoding and de-noised by subtracting an instantaneous common average reference (Gilja et al., 2015; Jarosiewicz et al., 2015) using 40 of the 96 channels on each array with the lowest root-mean-square value (selected based on their baseline activity during a one minute reference block run at the start of each session). The de-noised signal was band-pass filtered between 250 Hz and 5000 Hz using an 8th order non-causal Butterworth filter (Masse et al., 2015). Spike events were triggered by crossing a threshold set at 3.5x the root mean square amplitude of each channel, as determined by data from the reference block. The neural feature used was the the total power in the band-pass filtered signal (Jarosiewicz et al., 2015; Brandman et al., 2018b). Neural features were binned in 20ms non-overlapping increments for decoding. We used the top 40 features ranked by signal-to-noise-ratio (Malik et al., 2015).

4.6.3. Decoder calibration

Decoder calibration was performed using the standard Radial-8 task (Simeral et al., 2011; Gilja et al., 2015) using custom built software running Matlab. An LCD monitor was placed 55–60 cm at a comfortable angle and orientation to T9. Targets (size = 2.4 cm, visual angle = 2.5°) were presented sequentially in a pseudo-random order, alternating between one of eight radially distributed targets and a center target (radial target distance from center = 12.1 cm, visual angle = 12.6°). Successful target acquisition required the user to place the cursor (size = 1.5cm, visual angle = 1.6°) within the target’s diameter for 300ms, before a pre-determined timeout of 15 seconds. Target timeouts resulted in the cursor moving directly to the intended target, with immediate presentation of the next target.

Calibration began with two minute of open-loop presentation of a cursor; that is, the cursor moved automatically to pseudorandomly presented targets in a straight path. During this time, T9 was instructed to “imagine” or “attempt” to move the computer cursor as if he had control of it. After two minutes, initial hyperparameters for the GP were learned. Next, T9 acquired targets for three minutes with 80% of the component of the decoded vector perpendicular to the vector between the cursor and the target (Jarosiewicz et al., 2013; Velliste et al., 2008), in order to assist with target acquisition. GP hyperparameters were then recomputed with all of the available data. The Radial-8 task was repeated two more times with the attenuated components at 50% and 20%, for a total of 11 minutes of calibration data collected. We collected a total of 3000 data points randomly subsampled from the 11 minutes of collected data, using all 192 neural features (96 features per array, two arrays).

4.6.4. Performance measurement

We quantified the performance of the DKF decoder with the mFitts1 task (Gilja et al., 2015; Simeral et al., 2011). Under the Fitts model (Fitts, 1954), movement time (MT) varies linearly with the index of difficulty (ID) as

MT = a \cdot ID + b

(24)

where the parameters a and b depend on the input device. Parameters are estimated using linear regression on observed (ID, MT) pairs for each input method. These estimates are then used to evaluate filter performance.

A single target was presented on the screen in a pseudorandom location, with one of three pseudorandomly fixed diameters (size = 1.6cm, 3.5cm, and 5.6cm, visual angles 1.7°, 3.7°, and 5.8°). Targets were acquired by having the cursor contact the target for 500ms milliseconds, within a timeout of 10 seconds. For the mFitts1 task, the Index of Difficulty for each trial was calculated as follows:

ID = \log_{2} [\frac{D}{W} + 1]

where D is the distance from the cursor’s start position to the goal, and W is the sum of the target’s diameter and cursor’s radius. Hence, $\frac{D}{W}$ reflects a measure of difficulty for acquiring targets.

4.6.5. Results

T9 acquired 98% of targets presented over two research sessions (N = 299) with the mFitts1 task. The Fitts regression parameters were comparable to the previously described performance by different participants (T6 and T7) using the ReFIT decoder (Gilja et al., 2015) (Fig. 4.6.5, slope = 1.08 ± 0.06,p < 1.2 × 10⁻³⁰, intercept = 1.6 ± 1.3,p < 2.2 × 10⁻⁴¹).

5. Discussion

The DKF is a novel filtering method that should prove a helpful addition to the filtering toolbox. It provides a fast, analytic approximation for models with linear, Gaussian dynamics, but nonlinear, non-Gaussian observations. The approximations underlying the DKF tend to improve as the dimensionality of the observation space increases relative to the dimensionality of the state space. For known models, the DKF is quite similar in nature to the G-ADF; however, when models must be learned from training data as is the case for many practical applications, the G-ADF entails integrals which require approximation and does not provide a closed-form update. In comparison to Laplace or saddle-point approximations, the DKF provides a more global approximation to the true filtering distribution. As we demonstrate in our examples, there are many families of state space models that render the EKF and UKF ineffective but for which the DKF performs well.

In applications where the model must be learned from supervised training data prior to filtering, off-the-shelf nonlinear and/or nonparametric regression tools can be used to learn the conditional mean and variance for the DKF directly, avoiding the more complicated task of learning the complete observation model p(x_t|z_t). Using the DKF in this way appears to be novel within the large literature on state space models. Most approaches either learn a fully generative model and invert it for filtering (this includes the of use discriminative methods for training filters derived from generative models (Abbeel et al., 2005; Hess and Fern, 2009)) or learn a fully discriminative model that directly predicts states from the sequence of observations. The DKF allows a generative model for the state dynamics to be combined in principled way with a discriminative model for predicting the states from the observations at individual time steps. We think that the ability to easily incorporate off-the-shelf discriminative learning tools into a closed-form filtering equation is one of the most exciting and useful aspects of this methodology.

Many promising opportunities exist to apply and extend the DKF. For example, using a Gaussian approximation for p(z_t|x_t) can permit a more principled approach to mitigating nonstationarities that occur in the measurement model. In neural decoding, a large change in the behavior of a single neuron that occurs between model training and filter use can result in significant performance degradation for the decoder. In the DKF framework with a GP regression model for p(z_t|x_t), one can select a kernel function that ignores large differences along any single dimension. Clinical results demonstrate that this modification allows the filter to be more robust to erratic firing patterns in an arbitrary single neuron. See Brandman et al. (2018a) for further details. It seems that this approach could be readily applied more generally to increase filter resilience to nonstationarities.

While the DKF assumes an approximately Gaussian posterior, for general filtering models there may also be ways to incorporate the underlying Gaussian approximation for p(z_t|x_t) to improve performance. Methods that preserve the full form of the filtering distribution, such as particle filters, could be combined with alternatively-specified measurement models, as in Equation 17, to create general purpose filters that are both more convenient to learn from data and use in filtering applications. The DKF marks a first step in this direction.

Supplementary Material

code to reproduce simulations discussed in manuscript

NIHMS1714316-supplement-code_to_reproduce_simulations_discussed_in_manuscript.zip^{(20.7MB, zip)}

LaTeX source

NIHMS1714316-supplement-LaTeX_source.tex^{(144.4KB, tex)}

Figure 4: — On the left, we plot movement time vs. index of difficulty for T9 during the Radial-8 task. On the right, we compare Fitts metrics for the DKF to those for Kalman ReFit. In particular, the slope and intercept from the line of best fit on the left correspond to the yellow bars for slope and intercept on the right. Error bars correspond to a 95% confidence interval for each estimated parameter. Following the discussion in Section 4.6.4, lower values for the slope parameter (a in Equation 24) correspond to less of an increase in movement time for more difficult targets. Estimates for the intercept parameter correspond to b in Equation 24.

Acknowledgments

The authors would like to thank participant T9 and T9’s family, the anonymous reviewers and E. Crites for their thoughtful feedback on the manuscript, B. Travers and D. Rosler for administrative support, and C. Grant for clinical assistance. This work was supported by the National Institutes of Health: National Institute on Deafness and Other Communication Disorders - NIDCD (R01DC009899), Rehabilitation Research and Development Service, Department of Veterans Affairs (B6453R and N9228C); National Science Foundation (DMS1309004), National Institute of Health (IDeA P20GM103645, R01MH102840); Massachusetts General Hospital (MGH) - Deane Institute for Integrated Research on Atrial Fibrillation and Stroke; Joseph Martin Prize for Basic Research; the Executive Committee on Research (ECOR) of Massachusetts General Hospital; Canadian Institute of Health Research (336092); Killam Trust Award Foundation; and the Brown Institute of Brain Science. The content of this paper is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health, the Department of Veterans Affairs, or the United States Government.

Appendix

Section 6 covers technical details and section 7 includes a proof of the main theorem.

6. Technical details

This section provides the derivations used in Sections 4.2 and 4.3, along with some information on numerical stability and details for the discriminative learning methods employed in Section 4.5.

6.1. Kalman observation mixtures

For the model in Section 4.2 we provide analytic expressions for the integrals in Equation 13, which are needed for the G-ADF and the DKF (using ν_t = 0 and M_t = S for the DKF). Define

U_{t ℓ} = {(M_{t}^{- 1} + H_{ℓ}^{⊤} Λ_{ℓ}^{- 1} H_{ℓ})}^{- 1},

y_{t ℓ} = U_{t ℓ} (M_{t}^{- 1} ν_{t} + H_{ℓ}^{⊤} Λ_{ℓ}^{- 1} (x_{t} - b_{ℓ})),

κ_{t ℓ} = η_{d} (ν_{t}; 0, M_{t}) η_{n} (x_{t}; b_{ℓ}, Λ_{ℓ}) / η_{d} (y_{t ℓ}; 0, U_{t ℓ}) .

Then

a = \int p (x_{t} | z_{t}) η_{d} (z_{t}; ν_{t}, M_{t}) = \sum_{ℓ = 1}^{L} π_{ℓ} \int η_{n} (x_{t}; b_{ℓ} + H_{ℓ} z_{t}, Λ_{ℓ}) η_{d} (z_{t}; η_{t}, M_{t}) d z_{t} = \sum_{ℓ = 1}^{L} π_{ℓ} η_{n} (x_{t}; b_{ℓ} + H_{ℓ} ν_{t}, Λ_{ℓ} + H_{ℓ} M_{t} H_{ℓ}^{⊤}),

b = \int z_{t} p (x_{t} | z_{t}) η_{d} (z_{t}, ν_{t}, M_{t}) d z_{t} = \sum_{ℓ = 1}^{L} π_{ℓ} \int z_{t} η_{n} (x_{t}; b_{ℓ} + H_{ℓ} z_{t}, Λ_{ℓ}) η_{d} (z_{t}; ν_{t}, M_{t}) d z_{t} = \sum_{ℓ = 1}^{L} π_{ℓ} κ_{t ℓ} \int z_{t} η_{d} (z_{t}; y_{t ℓ}, U_{t ℓ}) d z_{t} = \sum_{ℓ = 1}^{L} π_{ℓ} κ_{t ℓ} y_{t ℓ},

c = \int z_{t} z_{t}^{⊤} p (x_{t} | z_{t}) η_{d} (z_{t}; ν_{t}, M_{t}) d z_{t} = \sum_{ℓ = 1}^{L} π_{ℓ} κ_{t ℓ} \int z_{t} z_{t}^{⊤} η_{d} (z_{t}; y_{t ℓ}, U_{t ℓ}) d z_{t} = \sum_{ℓ = 1}^{L} π_{ℓ} κ_{t ℓ} (U_{t ℓ} + y_{t ℓ} y_{t ℓ}^{⊤}) .

6.2. Independent Bernoulli mixtures

For the model in Section 4.3 we provide analytic expressions for the integrals in Equation 13 for the special case of ν_t = 0 and M_t = S = I_d, which are needed for the DKF. For each k = 1,…,d, define N_k = {i : d_i = k}, Γ_k = {γ_i : i ∈ N_k}, n_k = |Γ_k|, and let $γ_{k, 1} < \dots < γ_{k, n_{k}}$ denote the sorted (distinct) values in Γ_k, using γ_k,0 = −∞ and $γ_{k, n_{k} + 1} = + \infty$ . Using η(u) = η₁(u;0, 1) to denote the standard normal pdf and $ϕ (v) = \int_{- \infty}^{v} η (u)$ to denote the corresponding distribution function, define

Φ_{k j} = \int_{γ_{k, j - 1}}^{γ_{k, j}} η (u) d u = ϕ (γ_{k, j}) - ϕ (γ_{k, j - 1}),

Φ_{k j}^{'} = \int_{γ_{k, j - 1}}^{γ_{k, j}} u η (u) d u - Φ_{k j} = η (γ_{k, j - 1}) - η (γ_{k, j}) - Φ_{k j},

Φ_{k j}^{″} = \int_{γ_{k, j - 1}}^{γ_{k, j}} u^{2} η (u) d u - Φ_{k j} - 2 Φ_{k j}^{'} = γ_{k, j - 1} η (γ_{k, j - 1}) - γ_{k, j} η (γ_{k, j}) - 2 Φ_{k j}^{'},

ρ_{ℓ i j} = α_{ℓ i} 𝟙 \{γ_{k, j} \leq γ_{i}\} + β_{ℓ i} 𝟙 \{γ_{i} < γ_{k, j}\}, (i \in N_{k}),

for k = 1,…,d and j = 1,…,n_k + 1 and $ℓ = 1, \dots, L$ .

Let $x_{t N_{k}} = (x_{t i} : i \in N_{k})$ and define

D_{ℓ k j} (x_{t N_{k}}) = \prod_{i \in N_{k}} ρ_{ℓ i j}^{x_{t i}} {(1 - ρ_{ℓ i j})}^{1 - x_{t i}},

p_{ℓ} (x_{t N_{k}} | z_{t k}) = \prod_{i \in N_{k}} (α_{ℓ i}^{x_{t i}} {(1 - α_{ℓ i})}^{1 - x_{t i}} 1 {z_{t k} < γ_{i}} + β_{ℓ i}^{x_{t i}} {(1 - β_{ℓ i})}^{1 - x_{t i}} 1 {z_{t k} \geq γ_{i}}) = \sum_{j = 1}^{n_{k} + 1} 1 {γ_{k, j - 1} \leq z_{t k} < γ_{k, j}} D_{ℓ k j} (x_{t N_{k}})

so that $p (x_{t} | z_{t}) = \sum_{ℓ = 1}^{L} π_{ℓ} \prod_{k = 1}^{d} p_{ℓ} (x N_{k} | z_{t k})$ and (with S = I_d)

p (x_{t} | z_{t}) η_{d} (z_{t}; 0. S) = p (x_{t} | z_{t}) \prod_{k = 1}^{d} η (z_{t k}) = \sum_{ℓ = 1}^{L} π_{ℓ} \prod_{k = 1}^{d} p_{ℓ} (x_{t N_{k}} | z_{t k}) η (z_{t k}) .

Hence, using $δ_{k r} = 1 {k = r}$ .

a = \int p (x_{t} | z_{t}) η_{d} (z_{t}; 0, S) d z_{t} = \sum_{ℓ = 1}^{L} π_{ℓ} \prod_{k = 1}^{d} \int p_{ℓ} (x_{t N_{k}} | z_{t k}) η (z_{t k}) d z_{t k} = \sum_{ℓ = 1}^{L} π_{ℓ} \prod_{k = 1}^{d} \sum_{j = 1}^{n_{k} + 1} D_{ℓ k j} (x_{t N_{k}}) \int_{γ_{k, j - 1}}^{γ_{k, j}} η (z_{t k}) d z_{t k} = \sum_{ℓ = 1}^{L} π_{ℓ} \prod_{k = 1}^{d} \sum_{j = 1}^{n_{k} + 1} D_{ℓ k j} (x_{t N_{k}}) Φ_{k j},

b_{r} = \int z_{t r} p (x_{t} | z_{t}) η_{d} (z_{t}; 0, S) d z_{t} = \sum_{ℓ = 1}^{L} π_{ℓ} \prod_{k = 1}^{d} \sum_{j = 1}^{n_{k} + 1} D_{ℓ k j} (x_{t N_{k}}) (Φ_{k j} + Φ_{k j}^{'} δ_{k r}),

c_{r s} = \int z_{t r} z_{t s} p (x_{t} | z_{t}) η_{d} (z_{t}; 0, S) d z_{t} = \sum_{ℓ = 1}^{L} π_{ℓ} \prod_{k = 1}^{d} \sum_{j = 1}^{n_{k} + 1} D_{ℓ k j} (x_{t N_{k}}) (Φ_{k j} + Φ_{k j}^{'} δ_{k r} + Φ_{k j}^{'} δ_{k s} + Φ_{k j}^{″} δ_{k r} δ_{k s}),

where in Equation 13 the vector b = (b_r : r = 1,…,d) and the matrix c = (c_rs : r,s = 1,…,d). We have f(x) = b/a and $Q (x) = c / a - f (x) f {(x)}^{⊤}$ .

6.3. Measures to prevent numerical instabilities

The covariance matrix Σ_t must be positive definite for the DKF algorithm to make sense. As n gets large, using $Q (x_{t}) = V (Z_{t} | X_{t} = x_{t})$ , the probability that Σ_t is positive definite goes to 1; see Lemma 1 below. However, when n is small or when Q is learned, Σ_t will often not be positive definite. An easy remedy is to force Q⁻¹(x) − S⁻¹ to be positive semidefinite for every x by shrinking the (generalized) eigenvalues of Q(x) for any x where this constraint is not satisfied. In particular, beginning with a target Q = Q(x) for a given fixed x, consider the generalized eigenvalue decomposition QV = SV D, where $V \in ℝ^{d \times d}$ is invertible and $D \in ℝ^{d \times d}$ is diagonal. (This decomposition can be computed in Matlab using [V,D]=eig(Q,S).) Let D ∧ 1 denote the element-wise minimum of D and 1, and define Q′ = SV (D ∧ 1)V⁻¹. By redefining Q(x) as Q′, we will ensure that Q⁻¹(x) − S⁻¹ is positive semidefinite, as required. Moreover, Q′ will be the same as the original Q if this condition was already satisfied by the original Q, showing that this modification to the DKF algorithm does not affect our asymptotic analysis. We used this modification for all of the experiments with the DKF. The robust DKF does not require this modification. Here is a proof of the claims about this method: Q⁻¹ − S⁻¹ is positive semidefinite if and only if S − Q is positive semidefinite if and only if S^−1/2(S − Q)S^−1/2 is positive semidefinite. We have S^−1/2(S − Q)S^−1/2 = S−1/2(S − SV DV −1)S−1/2 = I − S1/2V D(S1/2V )⁻¹ = (S1/2V )(I − D)(S1/2V )⁻¹, which is positive semidefinite if and only if all entries of D (which is diagonal) are ≤ 1. Replacing D with D ∧ 1 exactly enforces this constraint.

For our DKF experiments with nonlinear state dynamics using an extended Kalman filter (EKF) approximation (not described here), we found that the DKF-EKF became unstable for small n, because the EKF approximation to the nonlinearity was quite poor. To remedy this, we modified the DKF algorithm to prevent µ_t from diverging too far from ν_t and f(x_t) (the posterior means of Z_t given X_1:t−1 and given X_t, respectively). In particular, we forced |µ_t|² ≤ |ν_t|² + |f(x_t)|² (by scaling µ_t whenever its norm exceeded our bound). For larger n, once the DKF approximation becomes more accurate, this constraint was always satisfied in our experiments without intervention, but for smaller n, enforcing it was important for preventing numerical instabilities. The robust DKF did not require this modification. Although not used in this paper, we report this modification in case others find it useful in their application.

6.4. Nadaraya-Watson kernel regression

We can learn $f : ℝ^{n} \to ℝ^{d}$ with a variety of regression methods. The well-known Nadaraya-Watson kernel regression estimator (Nadaraya, 1964; Watson, 1964) is

\hat{f} (x) = \frac{\sum_{i = 1}^{m} z_{i}^{'} κ_{X} (x, x_{i}^{'})}{\sum_{i = 1}^{m} κ_{X} (x, x_{i}^{'})}

where the κ_X(x,x′) is a nonnegative kernel and m is the size of the training set. Bandwidth can be chosen using rule-of-thumb or with leave-one-out cross validation, the latter scaling as $O (m^{2})$ . Evaluation of $\hat{f}$ scales like $O (m)$ . In the examples we use a Gaussian kernel with a bandwidth chosen by minimizing leave-one-out mean squared error (MSE) on the training set.

6.5. Neural network regression

We can also learn f as a neural network (NN). NN’s are attractive for online filtering, because evaluation of $\hat{f}$ scales $O (1)$ with the size of the training set. With mean squared error (MSE) as an objective function, we optimize parameters over the training set. Typically, optimization continues until performance stops improving on a validation subset (to prevent overfitting), but instead we use Bayesian regularization to ensure network generalizability (MacKay, 1992; Foresee and Hagan, 1997). Training costs depend on the training algorithm chosen. Traditional optimizers include: stochastic gradient descent, scaling with $O (m)$ ; scaled conjugate gradient, with $O (m^{2})$ ; Levenberg–Marquardt, with $O (m^{3})$ (Castillo et al., 2010), where m is the size of the training set. More recently, Hessian-free approaches have been developed to train NN’s on larger data sets (Schmidhuber, 2015). Training costs also grow with d, depending on choice of architecture.

We implemented all feedforward neural networks with Matlab’s Neural Network Toolbox R2019a. Our implementation consisted of a single hidden layer of tansig neurons trained via Levenberg-Marquardt optimization (Levenberg, 1944; Marquardt, 1963; Hagan and Menhaj, 1994) with Bayesian regularization.

6.6. Gaussian process regression

Gaussian process (GP) regression is another popular method for nonlinear regression (Rasmussen and Williams, 2006). The idea is to put a prior distribution on the function f and approximate f with its posterior mean given training data. We will first briefly describe the case d = 1. We form an m×n-dimensional matrix X′ by concatenating the 1×n-dimensional vectors $X_{i}^{'}$ and a m×d-dimensional matrix Z′ by concatening the vectors $Z_{i}^{'}$ . We assume that $p (z_{i}^{'} | x_{i}^{'}, f) = η (z_{i}^{'}; f (x_{i}^{'}), σ^{2})$ , where f is sampled from a mean-zero GP with covariance kernel K(·, ·). Under this model,

\hat{f} (x) = E (f (x) | Z^{'}, X^{'}) = K (x, X^{'}) {(K (X^{'}, X^{'}) + σ^{2} I_{m})}^{- 1} Z^{'},

where K(x,X′) denotes the 1×m vector with ith entry $K (x, X_{i}^{'})$ , where K(X′,X′) denotes the m×m matrix with ijth entry $K (X_{i}^{'}, X_{j}^{'})$ , where Z′ is a column vector, and where I_m is the m×m identity matrix. The noise variance σ² and any parameters controlling the kernel shape are hyperparameters. For our examples, we used the radial basis function kernel with two parameters: length scale and maximum covariance. These hyperparameters were selected via maximum likelihood. For d > 1, we repeated this process for each dimension to separately learn the coordinates of f. Training costs for a single dimension scale as $O (m^{3})$ . Sparse approximations to GP’s can reduce training requirements to $O (m \cdot N_{S}^{2})$ where N_S is the size of the sparse GP (Quiñonero Candela and Rasmussen, 2005). Evaluation of $\hat{f}$ scales $O (m)$ for each dimension, or $O (N_{S})$ for sparse approximations.

All GP training was performed using the publicly available GPML package (Rasmussen and Nickisch, 2010).

6.7. Comparison with a long short term memory (LSTM) neural network

An LSTM is a stateful recurrent neural network designed to overcome error backflow problems (Hochreiter and Schmidhuber, 1997). Such recurrent neural networks have previously been shown to outperform state-of-the-art Kalman-based filters on this primate neural decoding task and so provide a good point of comparison (Sussillo et al., 2012, 2016; Pandarinath et al., 2018; Hosman et al., 2019). While there are many variants on the LSTM architecture, none seem to universally improve on the basic design (Jozefowicz et al., 2015; Greff et al., 2016). LSTM optimization uses many of the same methods that work for feedforward NN’s (Schmidhuber, 2015). Training and evaluation requirements are similar.

All LSTM trials were conducted with TensorFlow r1.4.0 in a Python 3.6.8 environment. The LSTM cell used in these trials was built from scratch in TensorFlow following Gers et al. (2000). Dropout was used to prevent overfitting (Srivastava et al., 2014), but it was only applied to feedforward connections, not recurrent connections (Pham et al., 2014; Zaremba et al., 2014). The recurrent states and outputs at each intermediate time step were batch-normalized to accommodate internal covariate shift (Ioffe and Szegedy, 2015). Model parameters were initialized via a Xavier-type method (Glorot and Bengio, 2010) designed to stabilize variance from layer to layer. Optimization was then performed with Adadelta (Zeiler, 2012), an algorithm designed to improve upon Adagrad (Duchi et al., 2011) with the explicit goals of decreasing sensitivity to hyperparameters and permitting the learning rate to sometimes increase.

7. Mathematical results

Our main technical result is Theorem 2 below. After stating the theorem we translate it into the setting of the paper. Probability density functions (pdfs) are with respect to Lebesgue measure over $ℝ^{d}$ . || · ||₁ and || · ||_∞ denote the L₁ and L_∞ norms, respectively, $\overset{w}{\to}$ denotes weak convergence of probability measures (equivalent, for instance, to convergence of the expected values of bounded continuous functions), and δ_c denotes the unit point mass at $c \in ℝ^{d}$ . Define the Markov transition density $τ (y, z) = η_{d} (z; A y, Γ)$ , and let $τ h$ denote the function

(τ h) (z) = \int τ (y, z) h (y) d y

for an arbitrary, integrable h. Define p(z) = η_d(z;0,S), where S satisfies $S = A S A^{⊤} + Γ$ .

Theorem 2. Fix pdfs s_n and u_n (n ≥ 1) so that the pdfs

p_{n} = \frac{u_{n} τ s_{n} / p}{{‖ u_{n} τ s_{n} / p ‖}_{1}}

(25)

are well-defined for each n. Suppose that for some $b \in ℝ^{d}$ and some probability measure P over $ℝ^{d}$

A1. $s_{n} \overset{w}{\to} P$ as n → ∞;

A2. there exists a sequence of Gaussian pdfs $(s_{n}^{'})$ such that ${‖ s_{n} - s_{n}^{'} ‖}_{1} \to 0$ as n → ∞;

A3. $u_{n} \overset{w}{\to} δ_{b}$ as n → ∞;

A4. there exists a sequence of Gaussian pdfs $(u_{n}^{'})$ such that ${‖ u_{n} - u_{n}^{'} ‖}_{1} \to 0$ as n → ∞;

A5. $p_{n} \overset{w}{\to} δ_{b}$ as n → ∞;

Then

C1. $s_{n}^{'} \overset{w}{\to} P$ as n → ∞;

C2. $u_{n}^{'} \overset{w}{\to} δ_{b}$ as n → ∞;

C3. the pdf

p_{n}^{'} = \frac{u_{n}^{'} τ s_{n}^{'} / p}{{‖ u_{n}^{'} τ s_{n}^{'} / p ‖}_{1}}

is well defined and Gaussian for n sufficiently large;

C4. $p_{n}^{'} \overset{w}{\to} δ_{b}$ as n → ∞;

C5. ${‖ p_{n} - p_{n}^{'} ‖}_{1} \to 0$ as n → ∞;

Remark 1. The L₁ distance between pdfs is equivalent to the total variation distance between the respective probability measures.

Remark 2. We are not content to show the existence of a sequence of Gaussian pdfs $(p_{n}^{'})$ that satisfy C4–C5. Rather, we are trying to show that the specific $p_{n}^{'}$ defined in C3 satisfies C4–C5 regardless of the choice of $u_{n}^{'}$ and $s_{n}^{'}$ .

Remark 3. An inspection of the proof shows that the pdf

r_{n}^{'} = p_{n}^{'} p / {‖ p_{n}^{'} p ‖}_{1} = u_{n}^{'} τ s_{n}^{'} / {‖ u_{n}^{'} τ s_{n}^{'} ‖}_{1}

is well-defined and Gaussian with $r_{n}^{'} \overset{w}{\to} δ_{b}$ and

{‖ p_{n} - r_{n}^{'} ‖}_{1} \leq A_{n} + B_{n} + C_{n}

where the terms A_n, B_n, C_n are those defined in Equation 26, each of which tend to zero in the limit. Thus ${‖ p_{n} - r_{n}^{'} ‖}_{1} \to 0$ . These $r_{n}^{'}$ are precisely the estimates formed using the robust DKF.

Remark 4. Suppose the pdfs s_n, $s_{n}^{'}$ , u_n, $u_{n}^{'} (n \geq 1)$ , the constant b, and the probability measure P are themselves random, defined on a common probability space, so that p_n is well-defined with probability one, and suppose that the limits in A1–A5 hold in probability. Then the probability that $p_{n}^{'}$ is a well-defined, Gaussian pdf converges to one, and the limits in C1–C5 hold in probability.

For the setting of the paper, first fix t ≥ 1 and note that p is the common pdf of each Z_t and $τ$ is the common conditional pdf of Z_t given Z_t−1. The limit of interest is for increasing dimension (n) of a single observation. To formalize this, we let each X_t be infinite dimensional and consider observing only the first n dimensions, denoted $X_{t}^{1 : n} \in ℝ^{n}$ . Similarly, $X_{1 : t}^{1 : n} = (X_{1}^{1 : n}, \dots, X_{t}^{1 : n})$ . We will abuse notation and use $ℙ (Z_{t} = \cdot | W)$ to denote the conditional pdf of Z_t given another random variable W. These conditional pdfs (formally defined via disintegrations) exist under very mild regularity assumptions (Chang and Pollard, 1997). Note that we are in the setting of Remark 4, where the randomness comes from X_1:t,Z_1:t. With this in mind, define

u_{n} (\cdot) = u_{n} (\cdot; X_{t}^{1 : n}) = ℙ (Z_{t} = \cdot | X_{t}^{1 : n})

u_{n}^{'} (\cdot) = u_{n}^{'} (\cdot; X_{t}^{1 : n}) = η_{d} (\cdot; f_{n} (X_{t}^{1 : n}), Q_{n} (X_{t}^{1 : n}))

s_{n} (\cdot) = s_{n} (\cdot; X_{1 : t - 1}^{1 : n}) = ℙ (Z_{t - 1} = \cdot | X_{1 : t - 1}^{1 : n}) (t > 1)

s_{n}^{'} (\cdot) = s_{n}^{'} (\cdot; X_{1 : t - 1}^{1 : n}) = η_{d} (\cdot; μ_{t - 1, n} (X_{1 : t - 1}^{1 : n}), \sum_{t - 1, n} (X_{1 : t - 1}^{1 : n})) (t > 1)

p_{n} (\cdot) = p_{n} (\cdot; X_{1 : t}^{1 : n}) = ℙ (Z_{t} = \cdot | X_{1 : t}^{1 : n})

p_{n}^{'} (\cdot) = p_{n}^{'} (\cdot; X_{1 : t}^{1 : n}) = η_{d} (\cdot; μ_{t, n} (X_{1 : t}^{1 : n}), \sum_{t, n} (X_{1 : t}^{1 : n}))

b = Z_{t}

P (\cdot) = P (\cdot; Z_{t - 1}) = δ_{Z_{t - 1}} (t > 1),

and define $s_{n} \equiv s_{n}^{'} \equiv P \equiv p$ when t = 0. The pdf $u_{n}^{'}$ is our Gaussian approximation of the conditional pdf of Z_t for a given $X_{t}^{1 : n}$ . We have added the subscript n to f and Q from the main text to emphasize the dependence on the dimensionality of the observations. The pdfs $s_{n}^{'}$ and $p_{n}^{'}$ are our Gaussian approximations of Z_t−1 and Z_t given $X_{1 : t - 1}^{1 : n}$ and $X_{1 : t}^{1 : n}$ , respectively. Again, we added the subscript n to µ_t and Σ_t from the text. Note that Equation 25 above is simply a condensed version of Equation 6 in the main text, and, for the same reason, the $p_{n}^{'}$ defined in C3 is the same $p_{n}^{'}$ defined above.

The Bernstein–von Mises (BvM) Theorem gives conditions for the existence of functions f_n and Q_n so that A3–A4 hold in probability. We refer the reader to van der Vaart (1998) for details. Very loosely speaking, the BvM Theorem requires Z_t to be completely determined in the limit of increasing amounts of data, but not completely determined after observing only a finite amount of data. The simplest case is when $X_{t}^{1 : n}$ are conditionally iid given Z_t and distinct values of Z_t give rise to distinct conditional distributions for $X_{t}^{1 : n}$ , but the result holds in much more general settings. A separate application of the BvM Theorem gives A5 (in probability). In applying the BvM Theorem to obtain A5, we also obtain the existence of a sequence of (random) Gaussian pdfs $(p_{n}^{″})$ such that ${‖ p_{n} - p_{n}^{″} ‖}_{1} \to 0$ (in probability), but we do not make use of this result, and, as explained in Remark 2, we care about the specific sequence $(p_{n}^{'})$ defined in C3.

As long as the BvM Theorem is applicable, the only remaining thing to show is A1–A2 (in probability). For the case t = 1, we have $s_{n} \equiv s_{n}^{'} \equiv P \equiv p$ , so A1–A2 are trivially true and the theorem holds. For any case t > 1, we note that s_n and $s_{n}^{'}$ are simply p_n and $p_{n}^{'}$ , respectively, for the case t − 1. So the conclusions C4–C5 in the case t − 1 become the assumptions A1–A2 for the subsequent case t. The theorem then holds for all t ≥ 1 by induction. The key conclusion is C5, which says that our Gaussian filter approximation $p_{n}^{'}$ will be close in total variation distance (see Remark 1) to the true Bayesian filter distribution p_n with high probability when n is large.

Proof. C1 follows immediately from A1 and A2. C2 follows immediately from A3 and A4. C3 and C4 are proved in Lemma 1 below. To show C5 we first bound

{‖ p_{n} - p_{n}^{'} ‖}_{1} \leq \underset{A_{n}}{\underset{︸}{{‖ p_{n} - \frac{p_{n} p}{p (b)} ‖}_{1}}} + \underset{B_{n}}{\underset{︸}{{‖ \frac{p_{n} p}{p (b)} - \frac{p_{n} p}{{‖ p_{n} p ‖}_{1}} ‖}_{1}}} + \underset{C_{n}}{\underset{︸}{{‖ \frac{p_{n} p}{{‖ p_{n} p ‖}_{1}} - \frac{p_{n}^{'} p}{{‖ p_{n}^{'} p ‖}_{1}} ‖}_{1}}} + \underset{B_{n}^{'}}{\underset{︸}{{‖ \frac{p_{n}^{'} p}{{‖ p_{n}^{'} p ‖}_{1}} - \frac{p_{n}^{'} p}{p (b)} ‖}_{1}}} + \underset{A_{n}^{'}}{\underset{︸}{{‖ \frac{p_{n}^{'} p}{p (b)} - p_{n}^{'} ‖}_{1}}} .

(26)

Since $p_{n} \overset{w}{\to} δ_{b}$ and p(z) is bounded and continuous,

A_{n} = \int p_{n} | 1 - \frac{p}{p (b)} | = E_{Z_{n} ~ p_{n}} | 1 - \frac{p (Z_{n})}{p (b)} | \to | 1 - \frac{p (b)}{p (b)} | = 0

and

B_{n} = \int \frac{p_{n} p}{{‖ p_{n} p ‖}_{1}} | \frac{{‖ p_{n} p ‖}_{1}}{p (b)} - 1 | = | \frac{{‖ p_{n} p ‖}_{1}}{p (b)} - 1 | = | \frac{E_{Z_{n} ~ p_{n}} | p (Z_{n}) |}{p (b)} - 1 | \to | \frac{p (b)}{p (b)} - 1 | = 0.

Similarly, since $p_{n}^{'} \overset{w}{\to} δ_{b}$ ,

A_{n}^{'} = \int p_{n}^{'} | 1 - \frac{p}{p (b)} | = E_{Z_{n} ~ p_{n}^{'}} | 1 - \frac{p (Z_{n})}{p (b)} | \to | 1 - \frac{p (b)}{p (b)} | = 0

and

B_{n}^{'} = \int \frac{p_{n}^{'} p}{{‖ p_{n}^{'} p ‖}_{1}} | \frac{{‖ p_{n}^{'} p ‖}_{1}}{p (b)} - 1 | = | \frac{{‖ p_{n}^{'} p ‖}_{1}}{p (b)} - 1 | = | \frac{E_{Z_{n} ~ p_{n}^{'}} | p (Z_{n}) |}{p (b)} - 1 | \to | \frac{p (b)}{p (b)} - 1 | = 0.

All that remains is to show that C_n → 0.

We first observe that

\frac{p_{n} p}{{‖ p_{n} p ‖}_{1}} = \frac{u_{n} τ s_{n}}{{‖ u_{n} τ s_{n} ‖}_{1}} and \frac{p_{n}^{'} p}{{‖ p_{n}^{'} p ‖}_{1}} = \frac{u_{n}^{'} τ s_{n}^{'}}{{‖ u_{n}^{'} τ s_{n}^{'} ‖}_{1}} .

Define

α = E_{(Y, Z) ~ P \times δ_{b}} η_{d} (Z; A Y, Γ) = E_{Y ~ P} η_{d} (b; A Y, Γ) \in (0, \infty) .

Since $s_{n} \overset{w}{\to} P$ , $u_{n} \overset{w}{\to} δ_{b}$ , and $(z, y) \mapsto τ (y, z) = η_{d} (z; A y, Γ)$ is bounded and continuous, we have

{‖ u_{n} τ s_{n} ‖}_{1} = \iint η_{d} (z; A y, Γ) s_{n} (y) u_{n} (z) d y d z = E_{(Y_{n}, Z_{n}) ~ s_{n} \times u_{n}} η_{d} (Z_{n}; A Y_{n}, Γ) \to α .

Similarly since, $s_{n}^{'} \overset{w}{\to} P$ and $u_{n}^{'} \overset{w}{\to} δ_{b}$ ,

{‖ u_{n}^{'} τ s_{n}^{'} ‖}_{1} = \iint η_{d} (z; A y, Γ) s_{n}^{'} (y) u_{n}^{'} (z) d y d z = E_{(Y_{n}, Z_{n}) ~ s_{n}^{'} \times u_{n}^{'}} η_{d} (Z_{n}; A Y_{n}, Γ) \to α .

Defining β = η_d(0;0, Γ) ∈ (0, ∞), gives

{‖ τ h ‖}_{\infty} \leq \sup_{z} | (τ h) (z) | \leq \sup_{z, y} η_{d} (z; A y, Γ) \int | h (t) | d t \leq η_{d} (0; 0, Γ) {‖ h ‖}_{1} = β {‖ h ‖}_{1}

for any integrable h. With these facts in mind we obtain

C_{n} = {| \frac{u_{n} τ s_{n}}{{‖ u_{n} τ s_{n} ‖}_{1}} - \frac{u_{n}^{'} τ s_{n}^{'}}{{‖ u_{n}^{'} τ s_{n}^{'} ‖}_{1}} |}_{1} \leq {| \frac{u_{n} τ s_{n}}{{‖ u_{n} τ s_{n} ‖}_{1}} - \frac{u_{n}^{'} τ s_{n}}{{‖ u_{n} τ s_{n} ‖}_{1}} |}_{1} + {| \frac{u_{n}^{'} τ s_{n}}{{‖ u_{n} τ s_{n} ‖}_{1}} - \frac{u_{n}^{'} τ s_{n}^{'}}{{‖ u_{n}^{'} τ s_{n}^{'} ‖}_{1}} |}_{1} \leq \frac{{‖ τ s_{n} ‖}_{\infty}}{{‖ u_{n} τ s_{n} ‖}_{1}} {‖ u_{n} - u_{n}^{'} ‖}_{1} + {‖ \frac{τ s_{n}}{{‖ u_{n} τ s_{n} ‖}_{1}} - \frac{τ s_{n}^{'}}{{‖ u_{n}^{'} τ s_{n}^{'} ‖}_{1}} ‖}_{\infty} {‖ u_{n}^{'} ‖}_{1} \leq \frac{β}{{‖ u_{n} τ s_{n} ‖}_{1}} {‖ u_{n} - u_{n}^{'} ‖}_{1} + {‖ \frac{τ s_{n}}{{‖ u_{n} τ s_{n} ‖}_{1}} - \frac{τ s_{n}^{'}}{{‖ u_{n}^{'} τ s_{n}^{'} ‖}_{1}} ‖}_{\infty} + {‖ \frac{τ s_{n}}{{‖ u_{n}^{'} τ s_{n}^{'} ‖}_{1}} - \frac{τ s_{n}^{'}}{{‖ u_{n}^{'} τ s_{n}^{'} ‖}_{1}} ‖}_{\infty} \leq \frac{β}{{‖ u_{n} τ s_{n} ‖}_{1}} {‖ u_{n} - u_{n}^{'} ‖}_{1} + \frac{{‖ τ s_{n} ‖}_{\infty}}{{‖ u_{n} τ s_{n} ‖}_{1}} {| 1 - \frac{{‖ u_{n} τ s_{n} ‖}_{1}}{{‖ u_{n}^{'} τ s_{n}^{'} ‖}_{1}} |}_{\infty} + \frac{{‖ τ s_{n} - τ s_{n}^{'} ‖}_{\infty}}{{‖ u_{n}^{'} τ s_{n}^{'} ‖}_{1}} \leq \underset{\to β / α}{\underset{︸}{\frac{β}{{‖ u_{n} τ s_{n} ‖}_{1}}}} \underset{\to 0}{\underset{︸}{{‖ u_{n} - u_{n}^{'} ‖}_{1}}} + \underset{\to β / α}{\underset{︸}{\frac{β}{{‖ u_{n} τ s_{n} ‖}_{1}}}} \underset{\to | 1 - α / α | = 0}{\underset{︸}{| 1 - \frac{{‖ u_{n} τ s_{n} ‖}_{1}}{{‖ u_{n}^{'} τ s_{n}^{'} ‖}_{1}} |}} + \underset{\to β / α}{\underset{︸}{\frac{β}{{‖ u_{n}^{'} τ s_{n}^{'} ‖}_{1}}}} \underset{\to 0}{\underset{︸}{{‖ s_{n} - s_{n}^{'} ‖}_{1}}}

Since α > 0, we see that C_n → 0 and the proof of the theorem is complete.

Remark 4 follows from standard arguments by making use of the equivalence between convergence in probability and the existence of a strongly convergent subsequence within each subsequence. The theorem can be applied to each strongly convergent subsequence.

Lemma 1 (DKF equation). If $s_{n}^{'} (z) = η_{d} (z; a_{n}, V_{n})$ and $u_{n}^{'} (z) = η_{d} (z; b_{n}, U_{n})$ , then defining

p_{n}^{'} = \frac{u_{n}^{'} τ s_{n}^{'} / p}{{‖ u_{n}^{'} τ s_{n}^{'} / p ‖}_{1}},

gives

p_{n}^{'} (z) = η_{d} (z; c_{n}, T_{n}),

where $G_{n} = A V_{n} A^{⊤} + Γ$ , $T_{n} = {(U_{n}^{- 1} + G_{n}^{- 1} - S^{- 1})}^{- 1}$ , and $c_{n} = T_{n} (U_{n}^{- 1} b_{n} + G_{n}^{- 1} A a_{n})$ , as long as T_n is well-defined and positive definite. Furthermore, if $s_{n}^{'} \overset{w}{\to} P$ , then $u_{n}^{'} \overset{w}{\to} δ_{b}$ then $p_{n}^{'}$ is eventually well-defined and $p_{n}^{'} \overset{w}{\to} δ_{b}$ .

Proof. See above for the definition of $τ$ , p, A, Γ, S. Assuming $u_{n}^{'} τ s_{n}^{'} / p$ is integrable, we have

p_{n}^{'} (z) \propto \frac{η_{d} (z; b_{n}, U_{n})}{η_{d} (z; 0, S)} \int η_{d} (z; A y, Γ) η_{d} (y; a_{n}, V_{n}) d y .

Since

\int η_{d} (z; A y, Γ) η_{d} (y; a_{n}, V_{n}) d y = η_{d} (z; A a_{n}, A V_{n} A^{⊤} + Γ) = η_{d} (z; A a_{n}, G_{n})

and

\frac{η_{d} (z; b_{n}, U_{n})}{η_{d} (z; 0, S)} \propto \frac{\exp (- \frac{1}{2} {(z - b_{n})}^{⊤} U_{n}^{- 1} (z - b_{n}))}{\exp (- \frac{1}{2} z^{⊤} S^{- 1} z)} \propto \exp (- \frac{1}{2} (z^{⊤} (U_{n}^{- 1} - S^{- 1}) z - 2 z^{⊤} U_{n}^{- 1} b_{n})) \propto \exp (- \frac{1}{2} {(z - b_{n}^{'})}^{⊤} {(U_{n}^{'})}^{- 1} (z - b_{n}^{'})) \propto η_{d} (z, b_{n}^{'}, U_{n}^{'})

for $U_{n}^{'} = (U_{n}^{- 1} - S^{- 1})$ and $b_{n}^{'} = U_{n}^{'} U_{n}^{- 1} b_{n}$ , we have

p_{n}^{'} (z) \propto η_{d} (z; b_{n}^{'}, U_{n}^{'}) η_{d} (z; A a_{n}, G_{n}) \propto η_{d} (z; T_{n} ({(U_{n}^{'})}^{- 1} b_{n}^{'} + G_{n}^{- 1} A a_{n}), T_{n}) = η_{d} (z; c_{n}, T_{n}) .

As the normal density integrates to 1, the proportionality constant drops out.

Now, suppose additionally that $s_{n}^{'} \overset{w}{\to} P$ and $u_{n}^{'} \overset{w}{\to} δ_{b}$ . Consider the characteristic functions

ϕ_{s_{n}^{'}} (t) = E_{X ~ s_{n}^{'}} [e^{i t X}] = e^{i t^{⊤} a_{n} - \frac{1}{2} t^{⊤} U_{n} t}

for these random variables. Lévy’s continuity theorem (Thm. 2.13 in van der Vaart, 1998) implies that $ϕ_{s_{n}^{'}} (t) \to ϕ_{P} (t)$ and $ϕ_{u_{n}^{'}} (t) \to ϕ_{δ_{b}} (t)$ for all $t \in ℝ^{d}$ where

ϕ_{P} (t) = e^{i t^{⊤} a - \frac{1}{2} t^{⊤} V t} and ϕ_{δ_{b}} (t) = e^{i t^{⊤} b}

denote the characteristic functions for P and δ_b, respectively. Here, a and V are the mean vector and covariance matrix, respectively, of the distribution P, which must itself be Gaussian, although possibly degenerate. It follows that

(i t^{⊤} a_{n} - \frac{1}{2} t^{⊤} V_{n} t) \to (i t^{⊤} a - \frac{1}{2} t^{⊤} V t)

and, as $ϕ_{s_{n}^{'}} (- t) \to ϕ_{P} (- t)$ ,

(- i t^{⊤} a_{n} - \frac{1}{2} t^{⊤} V_{n} t) \to (- i t^{⊤} a - \frac{1}{2} t^{⊤} V t)

so $t^{⊤} a_{n} \to t^{⊤} a$ and $t^{⊤} V_{n} t \to t^{⊤} V t$ for all $t \in ℝ^{d}$ . Choosing t to be coordinate vectors, we see that this implies a_n → a and V_n → V coordinate-wise. An analogous argument allows us to conclude that b_n → b and U_n → 0_d×d. Thus, $G_{n} \to G = A V A^{⊤} + Γ$ , which is invertible, since Γ is positive definite, and so $G_{n}^{- 1} \to G^{- 1}$ .

The Woodbury matrix identity gives

T_{n} = {(U_{n}^{- 1} + G_{n}^{- 1} - S^{- 1})}^{- 1} = U_{n} - U_{n} {({(G_{n}^{- 1} - S^{- 1})}^{- 1} + U_{n})}^{- 1} U_{n} .

(27)

Since U_n → 0_d×d and ${({(G_{n}^{- 1} - S^{- 1})}^{- 1} + U_{n})}^{- 1} \to G^{- 1} - S^{- 1}$ , we see that T_n → 0_d×d.

To show T_n is eventually well-defined and strictly positive definite, it suffices to show the same for

T_{n}^{- 1} = U_{n}^{- 1} + D_{n}

where we set $D_{n} = G_{n}^{- 1} - S^{- 1}$ . For a symmetric matrix $M \in ℝ^{d \times d}$ , let λ₁(M) ≥ ··· ≥ λ_d(M) denote its ordered eigenvalues. As a Corollary to Hoffman and Wielandt’s result (see Cor. 6.3.8 in Horn and Johnson, 2013), it follows that

\max_{j} | λ_{j} (T_{n}^{- 1}) - λ_{j} (U_{n}^{- 1}) | \leq {‖ D_{n} ‖}_{2}

where || · ||₂ denotes the Frobenius norm. Since ||D_n||₂ → ||G⁻¹ − S⁻¹||₂ < ∞, the difference between the jth ordered eigenvalues for $T_{n}^{- 1}$ and $U_{n}^{- 1}$ is upper bounded independently of n for 1 ≤ j ≤ d. Since U_n is positive definite and since U_n → 0_d×d, it follows that $λ_{j} (U_{n}^{- 1}) \geq λ_{d} (U_{n}^{- 1}) = 1 / λ_{1} (U_{n}) \to \infty$ . Hence, all eigenvalues of $T_{n}^{- 1}$ must eventually become positive, so that $T_{n}^{- 1}$ becomes positive definite, hence also T_n. For the means, we have

c_{n} = T_{n} U_{n}^{- 1} b_{n} + T_{n} G_{n}^{- 1} A a_{n} .

Because $T_{n} \to 0_{d \times d}$ , $G_{n}^{- 1} \to G^{- 1}$ , and a_n → a, we have $T_{n} G_{n}^{- 1} A a_{n} \to \vec{0}$ . Using Equation 27 for T_n gives

T_{n} U_{n}^{- 1} b_{n} = b_{n} - U_{n} {({(G_{n}^{- 1} - S^{- 1})}^{- 1} + U_{n})}^{- 1} b_{n},

where the eventual boundedness of ${{(G_{n}^{- 1} - S^{- 1})}^{- 1} + U_{n})}^{- 1}$ implies

U_{n} {({(G_{n}^{- 1} - S^{- 1})}^{- 1} + U_{n})}^{- 1} b_{n} \to \vec{0} .

As b_n → b, we conclude c_n → b. Hence, $p_{n}^{'} \overset{w}{\to} δ_{b} . □$

Footnotes

with $h_{t} = b_{t} + H_{t} ν_{t}$ , $C_{t} = M_{t} H_{t}^{⊤}$ , and $N_{t} = Λ_{t} + H_{t} M_{t} H_{t}^{⊤}$

References

Abbeel P, Coates A, Montemerlo M, Ng AY, and Thrun S (2005). Discriminative training of Kalman filters. In Robotics: Sci. and Syst. (RSS)
Ajiboye AB, Willett FR, Young DR, Memberg WD, Murphy BA, Miller JP, Walter BL, Sweet JA, Hoyen HA, Keith MW, Peckham PH, Simeral JD, Donoghue JP, Hochberg LR, and Kirsch RF (2017). Restoration of reaching and grasping movements through brain-controlled muscle stimulation in a person with tetraplegia: a proof-of-concept demonstration. The Lancet, 389:1821–1830. [DOI] [PMC free article] [PubMed] [Google Scholar]
Arasaratnam I and Haykin S (2009). Cubature Kalman filters. IEEE Trans. Autom. Control, 54(6):1254–1269. [Google Scholar]
Arasaratnam I, Haykin S, and Elliott RJ (2007). Discrete-time nonlinear filtering algorithms using Gauss–Hermite quadrature. Proc. IEEE, 95(5):953–977. [Google Scholar]
Arulampalam MS, Maskell S, Gordon N, and Clapp T (2002). A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking. IEEE Trans. Signal Process, 50(2):174–188. [Google Scholar]
Battin RH and Levine GM (1970). Application of Kalman filtering techniques to the Apollo program. In Leondes CT, editor, Theory and Applications of Kalman Filtering, chapter 14 NATO, Advisory Group for Aerospace Research and Development. [Google Scholar]
Beneš VE (1981). Exact finite-dimensional filters for certain diffusions with nonlinear drift. Stochastics, 5(1–2):65–92. [Google Scholar]
Bensmaia SJ and Miller LE (2014). Restoring sensorimotor function through intracortical interfaces: progress and looming challenges. Nat. Rev. Neurosci, 15(5):313–325. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bishop CH, Etherton BJ, and Majumdar SJ (2001). Adaptive sampling with the ensemble transform Kalman filter. part I: Theoretical aspects. Mon. Weather Rev, 129(3):420–436. [Google Scholar]
Bouton CE, Shaikhouni A, Annetta NV, Bockbrader MA, Friedenberg DA, Nielson DM, Sharma G, Sederberg PB, Glenn BC, Mysiw WJ, Morgan AG, Deogaonkar M, and Rezai AR (2016). Restoring cortical control of functional movement in a human with quadriplegia. Nature, 533:247–250. [DOI] [PubMed] [Google Scholar]
Brandman DM, Burkhart MC, Kelemen J, Franco B, Harrison MT, and Hochberg LR (2018a). Robust closed-loop control of a cursor in a person with tetraplegia using gaussian process regression. Neural Comput, 30(11):2986–3008. [DOI] [PMC free article] [PubMed] [Google Scholar]
Brandman DM, Cash SS, and Hochberg LR (2017). Review: Human intracortical recording and neural decoding for brain-computer interfaces. IEEE Trans. Neural Syst. Rehabil. Eng, PP(99). [DOI] [PMC free article] [PubMed] [Google Scholar]
Brandman DM, Hosman T, Saab J, Burkhart MC, Shanahan BE, Ciancibello JG, Sarma AA, Milstein DJ, Vargas-Irwin CE, Franco B, Kelemen J, Blabe C, Murphy BA, Young DR, Willett FR, Pandarinath C, Stavisky SD, Kirsch RF, Walter BL, Bolu Ajiboye A, Cash SS, Eskandar EE, Miller JP, Sweet JA, Shenoy KV, Henderson JM, Jarosiewicz B, Harrison MT, Simeral JD, and Hochberg LR (2018b). Rapid calibration of an intracortical brain–computer interface for people with tetraplegia. J. Neural Eng, 15(2). [DOI] [PMC free article] [PubMed] [Google Scholar]
Brown RG and Hwang PYC (2012). Introduction to Random Signals and Applied Kalman Filtering John Wiley & Sons, Inc., fourth edition. [Google Scholar]
Buehner M, McTaggart-Cowan R, and Heilliette S (2017). An ensemble Kalman filter for numerical weather prediction based on variational data assimilation: Varenkf. Mon. Weather Rev, 145(2):617–635. [Google Scholar]
Burkhart MC (2019). A Discriminative Approach to Bayesian Filtering with Applications to Human Neural Decoding PhD thesis, Brown University, Providence, Rhode Island, U.S.A. [Google Scholar]
Butler RW (2007). Saddlepoint approximations with applications, volume 22 of Cambridge Series in Statistical and Probabilistic Mathematics Cambridge University Press. [Google Scholar]
Cappé O, Godsill SJ, and Moulines E (2007). An overview of existing methods and recent advances in sequential Monte Carlo. Proc. IEEE, 95(5):899–924. [Google Scholar]
Cappé O, Moulines E, and Ryden T (2005). Inference in Hidden Markov Models Springer-Verlag. [Google Scholar]
Castillo E, Guijarro-Berdiñas B, Fontenla-Romero O, and Alonso-Betanzos A (2010). A very fast learning method for neural networks based on sensitivity analysis. J. Mach. Learn. Res, 7:1159–1182. [Google Scholar]
Cedarbaum JM, Stambler N, Malta E, Fuller C, Hilt D, Thurmond B, and Nakanishi A (1999). The ALSFRS-R: a revised ALS functional rating scale that incorporates assessments of respiratory function. J. Neurol. Sci, 169(1):13–21. [DOI] [PubMed] [Google Scholar]
Chang JT and Pollard D (1997). Conditioning as disintegration. Stat. Neerl, 51(3):287–317. [Google Scholar]
Chen Z (2003). Bayesian filtering: From Kalman filters to particle filters, and beyond. Statistics, 182(1):1–69. [Google Scholar]
Choo K and Fleet DJ (2001). People tracking using hybrid Monte Carlo filtering. In Proc. Int. Conf. Comput. Vis, volume 2, pages 321–328. [Google Scholar]
Churchland MM, Cunningham JP, Kaufman MT, Foster JD, Nuyujukian P, Ryu SI, and Shenoy KV (2012). Neural population dynamics during reaching. Nature, 487(7405):1–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
Collinger JL, Wodlinger B, Downey JE, Wang W, Tyler-Kabara EC, Weber DJ, McMorland AJC, Velliste M, Boninger ML, and Schwartz AB (2013). High-performance neuroprosthetic control by an individual with tetraplegia. Lancet, 381(9866):557–564. [DOI] [PMC free article] [PubMed] [Google Scholar]
Daum FE (1984). Exact finite dimensional nonlinear filters for continuous time processes with discrete time measurements. In IEEE Conf. Decis. Control, pages 16–22.
Daum FE (1986). Exact finite-dimensional nonlinear filters. IEEE Trans. Autom. Control, 31(7):616–622. [Google Scholar]
Daum FE and Huang J (2003). Curse of dimensionality and particle filters. In 2003 IEEE Aerosp. Conf. Proc, volume 4. [Google Scholar]
del Moral P (1996). Nonlinear filtering using random particles. Theory Probab. Appl, 40(4):690–701. [Google Scholar]
Douc R and Cappé O (2005). Comparison of resampling schemes for particle filtering. In Proc. Int. Symp. Image and Signal Process. Anal, pages 64–69.
Doucet A, Godsill S, and Andrieu C (2000). On sequential Monte Carlo sampling methods for Bayesian filtering. Stat. Comput, 10(3):197–208. [Google Scholar]
Duchi J, Hazan E, and Singer Y (2011). Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res, 12:2121–2159. [Google Scholar]
Elliott R (1994). Exact adaptive filters for Markov chains observed in Gaussian noise. Automatica, 30(9):1399–1408. [Google Scholar]
Evensen G (1994). Sequential data assimilation with a nonlinear quasi-geostrophic model using Monte Carlo methods to forecast error statistics. J. Geophys. Res: Oceans, 99:10143–10162. [Google Scholar]
Fitts PM (1954). The information capacity of the human motor system in controlling the amplitude of movement. J. Exp. Pyschol, 47(6):381–391. [PubMed] [Google Scholar]
Flint RD, Lindberg EW, Jordan LR, Miller LE, and Slutzky MW (2012). Accurate decoding of reaching movements from field potentials in the absence of spikes. J. Neural Eng, 9(4). [DOI] [PMC free article] [PubMed] [Google Scholar]
Foresee FD and Hagan MT (1997). Gauss-Newton approximation to Bayesian learning. In Int. Conf. Neural Netw, volume 3, pages 1930–1935. [Google Scholar]
Gelb A (1974). Applied Optimal Estimation The MIT Press. [Google Scholar]
Georgopoulos AP, Kettner RE, and Schwartz AB (1988). Primate motor cortex and free arm movements to visual targets in three-dimensional space. II. coding of the direction of movement by a neuronal population. J. Neurosci, 8(8):2928–2937. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gerber M and Chopin N (2015). Sequential quasi Monte Carlo. J. Roy. Stat. Soc. Ser. B (Stat. Methodol.), 77(3):509–579. [Google Scholar]
Gers FA, Schmidhuber J, and Cummins F (2000). Learning to forget: Continual prediction with LSTM. Neural Comput, 12(10):2451–2471. [DOI] [PubMed] [Google Scholar]
Ghahramani Z and Hinton GE (2000). Variational learning for switching state-space models. Neural Comput, 12(4):831–864. [DOI] [PubMed] [Google Scholar]
Gilja V, Pandarinath C, Blabe CH, Nuyujukian P, Simeral JD, Sarma AA, Sorice BL, Perge JA, Jarosiewicz B, Hochberg LR, Shenoy KV, and Henderson JM (2015). Clinical translation of a high-performance neural prosthesis. Nat. Med, 21(10):1142–1145. [DOI] [PMC free article] [PubMed] [Google Scholar]
Glorot X and Bengio Y (2010). Understanding the difficulty of training deep feedforward neural networks. In Int. Conf. Artif. Intell. Stats, volume 9, pages 249–256. [Google Scholar]
Gordon NJ, Salmond DJ, and Smith AFM (1993). Novel approach to nonlinear/non-Gaussian Bayesian state estimation. IEE Proc. F - Radar and Signal Process, 140(2):107–113. [Google Scholar]
Greff K, Srivastava RK, Koutník J, Steunebrink BR, and Schmidhuber J (2016). LSTM: A search space odyssey. IEEE Trans. Neural Netw. Learn. Syst, PP(99):1–11. [DOI] [PubMed] [Google Scholar]
Grewal MS and Andrews AP (2010). Applications of Kalman filtering in aerospace 1960 to the present. IEEE Control Syst. Mag, 30(3):69–78. [Google Scholar]
Hagan MT and Menhaj MB (1994). Training feedforward networks with the Marquardt algorithm. IEEE Trans. Neural Netw, 5(6):989–993. [DOI] [PubMed] [Google Scholar]
Hall EC (1966). Case History of the Apollo Guidance Computer MIT. [Google Scholar]
Handschin J (1970). Monte Carlo techniques for prediction and filtering of non-linear stochastic processes. Automatica, 6(4):555–563. [Google Scholar]
Handschin JE and Mayne DQ (1969). Monte Carlo techniques to estimate the conditional expectation in multi-stage non-linear filtering. Int. J. Control, 9(5):547–559. [Google Scholar]
Hess R and Fern A (2009). Discriminatively trained particle filters for complex multi-object tracking. In Comput. Vis. Pattern Recognit, pages 240–247.
Hochberg LR, Bacher D, Jarosiewicz B, Masse NY, Simeral JD, Vogel J, Haddadin S, Liu J, Cash SS, van der Smagt P, and Donoghue JP (2012). Reach and grasp by people with tetraplegia using a neurally controlled robotic arm. Nature, 485(7398):372–375. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hochberg LR and Donoghue JP (2006). Sensors for brain-computer interfaces. IEEE Eng. Med. Biol. Mag, pages 32–38. [DOI] [PubMed]
Hochreiter S and Schmidhuber J (1997). Long short-term memory. Neural Comput, 9(8):1735–1780. [DOI] [PubMed] [Google Scholar]
Horn RA and Johnson CR (2013). Matrix Analysis Cambridge University Press, second edition. [Google Scholar]
Hosman T, Vilela M, Milstein D, Kelemen JN, Brandman DM, Hochberg LR, and Simeral JD (2019). BCI decoder performance comparison of an LSTM recurrent neural network and a Kalman filter in retrospective simulation. In Int. IEEE EMBS Conf. Neural Eng. [Google Scholar]
Hunt BR, Kostelich EJ, and Szunyogh I (2007). Efficient data assimilation for spatiotemporal chaos: A local ensemble transform Kalman filter. Physica D: Nonlinear Phenom, 230(1):112–126. [Google Scholar]
Ioffe S and Szegedy C (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Bach F and Blei D, editors, Int. Conf. Mach. Learn, volume 37, pages 448–456. PMLR. [Google Scholar]
Ito K (2000). Gaussian filter for nonlinear filtering problems. In IEEE Conf. Decis. Control, volume 2. [Google Scholar]
Ito K and Xiong K (2000). Gaussian filters for nonlinear filtering problems. IEEE Trans. Autom. Control, pages 910–927.
Jarosiewicz B, Masse NY, Bacher D, Cash SS, Eskandar E, Friehs G, Donoghue JP, and Hochberg LR (2013). Advantages of closed-loop calibration in intracortical brain-computer interfaces for people with tetraplegia. J. Neural Eng, 10(4). [DOI] [PMC free article] [PubMed] [Google Scholar]
Jarosiewicz B, Sarma AA, Bacher D, Masse NY, Simeral JD, Sorice B, Oakley EM, Blabe C, Pandarinath Cand Gilja V, Cash SS, Eskandar EN, Friehs GM, Henderson JM, Shenoy KV, Donoghue JP, and Hochberg LR (2015). Virtual typing by people with tetraplegia using a self-calibrating intracortical brain-computer interface. Sci. Transl. Med, 7(313):1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
Jozefowicz R, Zaremba W, and Sutskever I (2015). An empirical exploration of recurrent network architectures. In Bach F and Blei D, editors, Int. Conf. Mach. Learn, volume 37, pages 2342–2350. PMLR. [Google Scholar]
Julier SJ and Uhlmann JK (1997). New extension of the Kalman filter to nonlinear systems. Proc. SPIE, 3068:182–193. [Google Scholar]
Kalman RE (1960). A new approach to linear filtering and prediction problems. J. Basic Eng, 82(1):35–45. [Google Scholar]
Kalman RE and Bucy RS (1961). New results in linear filtering and prediction theory. J. Basic Eng, 83(1):95–108. [Google Scholar]
Kim S-P, Simeral JD, Hochberg LR, Donoghue JP, and Black MJ (2008). Neural control of computer cursor velocity by decoding motor cortical spiking activity in humans with tetraplegia. J. Neural Eng, 5(4). [DOI] [PMC free article] [PubMed] [Google Scholar]
Kitagawa G (1996). Monte Carlo filter and smoother for non-Gaussian nonlinear state space models. J. Comput. Graph. Stat, 5(1). [Google Scholar]
Koyama S, Pérez-Bolde LC, Shalizi CR, and Kass RE (2010). Approximate methods for state-space models. J. Am. Stat. Assoc, 105(489):170–180. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kushner H (1967). Approximations to optimal nonlinear filters. IEEE Trans. Autom. Control, 12(5):546–556. [Google Scholar]
Lemon RN (2008). Descending pathways in motor control. Annu. Rev. Neurosci, 31:195–218. [DOI] [PubMed] [Google Scholar]
Levenberg K (1944). A method for the solution of certain non-linear problems in least squares. Quart. Appl. Math, 2:164–168. [Google Scholar]
Liu JS (2008). Monte Carlo Strategies in Scientific Computing Springer. [Google Scholar]
MacKay DJC (1992). Bayesian interpolation. Neural Comput, 4(3):415–447. [Google Scholar]
Majumdar SJ, Bishop CH, Etherton BJ, and Toth Z (2002). Adaptive sampling with the ensemble transform Kalman filter. part II: Field program implementation. Mon. Weather Rev, 130(5):1356–1369. [Google Scholar]
Malik WQ, Hochberg LR, Donoghue JP, Hochberg LR, Donoghue JP, and Brown EN (2015). Modulation depth estimation and variable selection in state-space models for neural interfaces. IEEE Trans. Biomed. Eng, 62(2):570–581. [DOI] [PMC free article] [PubMed] [Google Scholar]
Marquardt DW (1963). An algorithm for least-squares estimation of nonlinear parameters. J. Soc. Indust. Appl. Math, 11:431–441. [Google Scholar]
Masse NY, Jarosiewicz B, Simeral JD, Bacher D, Stavisky SD, Cash SS, Oakley EM, Berhanu E, Eskandar E, Friehs G, Hochberg LR, and Donoghue JP (2015). Non-causal spike filtering improves decoding of movement intention for intracortical bcis. J. Neurosci. Methods, 244:94–103. [DOI] [PMC free article] [PubMed] [Google Scholar]
Maynard EM, Nordhausen CT, and Normann RA (1997). The utah intracortical electrode array: a recording structure for potential brain-computer interfaces. Electroencephalogr. Clin. Neurophysiol, 102(3):228–239. [DOI] [PubMed] [Google Scholar]
Metropolis N and Ulam S (1949). The Monte Carlo method. J. Am. Stat. Assoc, 44(247):335–341. [DOI] [PubMed] [Google Scholar]
Minka TP (2001a). Expectation propagation for approximate Bayesian inference. Uncertain. Artif. Intell
Minka TP (2001b). A family of algorithms for approximate Bayesian inference PhD thesis, MIT, Cambridge, Massachusetts, U.S.A. [Google Scholar]
Nadaraya EA (1964). On a regression estimate. Teor. Verojatnost. i Primenen, 9:157–159. [Google Scholar]
Nørgaard M, Poulsen NK, and Ravn O (2000). New developments in state estimation for nonlinear systems. Automatica, 36(11):1627–1638. [Google Scholar]
Nuyujukian P, Albites Sanabria J, Saab J, Pandarinath C, Jarosiewicz B, Blabe CH, Franco B, Mernoff ST, Eskandar EN, Simeral JD, Hochberg LR, Shenoy KV, and Henderson JM (2018). Cortical control of a tablet computer by people with paralysis. PLOS ONE, 13(11). [DOI] [PMC free article] [PubMed] [Google Scholar]
Ott E, Hunt BR, Szunyogh I, Zimin AV, Kostelich EJ, Corazza M, Kalnay E, Patil DJ, and Yorke JA (2004). A local ensemble Kalman filter for atmospheric data assimilation. Tellus A, 56(5):415–428. [Google Scholar]
Pandarinath C, Gilja V, Blabe CH, Nuyujukian P, Sarma AA, Sorice BL, Eskandar EN, Hochberg LR, Henderson JM, and Shenoy KV (2015). Neural population dynamics in human motor cortex during movements in people with ALS. eLife, 4. [DOI] [PMC free article] [PubMed] [Google Scholar]
Pandarinath C, Nuyujukian P, Blabe CH, Sorice BL, Saab J, Willett F, Hochberg LR, Shenoy KV, and Henderson JM (2017). High performance communication by people with paralysis using an intracortical brain-computer interface. eLife, pages 1–27. [DOI] [PMC free article] [PubMed]
Pandarinath C, O’Shea DJ, Collins J, Jozefowicz R, Stavisky SD, Kao JC, Trautmann EM, Kaufman MT, Ryu SI, Hochberg LR, Henderson JM, Shenoy KV, Abbott LF, and Sussillo D (2018). Inferring single-trial neural population dynamics using sequential auto-encoders. Nat. Methods, 15(10):805–815. [DOI] [PMC free article] [PubMed] [Google Scholar]
Paninski L, Fellows MR, Hatsopoulos NG, and Donoghue JP (2004). Spatiotemporal tuning of motor cortical neurons for hand position and velocity spatiotemporal tuning of motor cortical neurons for hand position and velocity. J. Clin. Neurophysiol, 91:515–532. [DOI] [PubMed] [Google Scholar]
Pham V, Bluche T, Kermorvant C, and Louradour J (2014). Dropout improves recurrent neural networks for handwriting recognition. In Int. Conf. Front. Handwriting Recognit, pages 285–290.
Pohlmeyer E, Solla S, Perreault EJ, and Miller LE (2007). Prediction of upper limb muscle activity from motor cortical discharge during reaching. J. Neural Eng, 4:369–379. [DOI] [PMC free article] [PubMed] [Google Scholar]
Quang PB, Musso C, and Le Gland F (2015). The Kalman Laplace filter: A new deterministic algorithm for nonlinear Bayesian filtering. In Intern. Conf. Inf. Fusion, pages 1566–1573.
Quiñonero Candela J and Rasmussen CE (2005). A unifying view of sparse approximate Gaussian process regression. J. Mach. Learn. Res, 6:1939–1959. [Google Scholar]
Rao NG and Donoghue JP (2014). Cue to action processing in motor cortex populations. J. Neurophysiol, 111(2):441–453. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rasmussen CE and Nickisch H (2010). Gaussian processes for machine learning (GPML) toolbox. J. Mach. Learn. Res, 11:3011–3015. [Google Scholar]
Rasmussen CE and Williams CKI (2006). Gaussian processes for machine learning MIT Press, Cambridge, MA. [Google Scholar]
Real E, Moore S, Selle A, Saxena S, Suematsu YL, Le Q, and Kurakin A (2017). Large-scale evolution of image classifiers. Int. Conf. Mach. Learn
Särkkä S (2013). Bayesian Filtering and Smoothing Cambridge University Press. [Google Scholar]
Schmidhuber J (2015). Deep learning in neural networks: An overview. Neural Netw, 61:85–117. [DOI] [PubMed] [Google Scholar]
Schmidt SF, Weinberg JD, and Lukesh JS (1970). Application of Kalman filtering to the C-5 guidance and control system. In Leondes CT, editor, Theory and Applications of Kalman Filtering, chapter 13 NATO, Advisory Group for Aerospace Research and Development. [Google Scholar]
Schwartz AB (1994). Direct cortical representation of drawing. Science, 265(5171):540–542. [DOI] [PubMed] [Google Scholar]
Shumway RH and Stoffer DS (1991). Dynamic linear models with switching. J. Am. Stat. Assoc, 86(415):763–769. [DOI] [PubMed] [Google Scholar]
Simeral JD, Kim S-P, Black MJ, Donoghue JP, and Hochberg LR (2011). Neural control of cursor trajectory and click by a human with tetraplegia 1000 days after implant of an intracortical microelectrode array. J. Neural Eng, 8(2). [DOI] [PMC free article] [PubMed] [Google Scholar]
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, and Salakhutdinov R (2014). Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res, 15:1929–1958. [Google Scholar]
Stevenson IH and Kording KP (2011). How advances in neural recording affect data analysis. Nat. Neurosci, 14(2):139–142. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sugiyama M, Suzuki T, and Kanamori T (2012). Density Ratio Estimation in Machine Learning Cambridge University Press. [Google Scholar]
Sussillo D, Nuyujukian P, Fan JM, Kao JC, Stavisky SD, Ryu S, and Shenoy K (2012). A recurrent neural network for closed-loop intracortical brain–machine interface decoders. J. Neural Eng, 9(2). [DOI] [PMC free article] [PubMed] [Google Scholar]
Sussillo D, Stavisky SD, Kao JC, Ryu SI, and Shenoy KV (2016). Making brain–machine interfaces robust to future neural variability. Nat. Commun, 7. [DOI] [PMC free article] [PubMed] [Google Scholar]
van der Merwe R (2004). Sigma-Point Kalman Filters for Probabilistic Inference in Dynamic State-Space Models PhD thesis, Oregon Health & Science University, Portland, Oregon, U.S.A. [Google Scholar]
van der Vaart AW (1998). Asymptotic statistics Cambridge University Press, Cambridge. [Google Scholar]
Vargas-Irwin CE, Brandman DM, Zimmermann JB, Donoghue JP, and Black MJ (2015). Spike train SIMilarity space (SSIMS): a framework for single neuron and ensemble data analysis. Neural Comput, 27(1):1–31. [DOI] [PMC free article] [PubMed] [Google Scholar]
Vargas-Irwin CE, Shakhnarovich G, Yadollahpour P, Mislow JMK, Black MJ, and Donoghue JP (2010). Decoding complete reach and grasp actions from local primary motor cortex populations. J. Neurosci, 30(29):9659–9669. [DOI] [PMC free article] [PubMed] [Google Scholar]
Velliste M, Perel S, Spalding MC, Whitford AS, and Schwartz AB (2008). Cortical control of a prosthetic arm for self-feeding. Nature, 453(7198):1098–101. [DOI] [PubMed] [Google Scholar]
Walker B and Kording K (2013). The database for reaching experiments and models. PLOS ONE, 8(11). [DOI] [PMC free article] [PubMed] [Google Scholar]
Wan EA and van der Merwe R (2000). The unscented Kalman filter for nonlinear estimation. In Adaptive Syst. for Signal Process., Commun., and Control Symp, pages 153–158.
Watson GS (1964). Smooth regression analysis. Sankhyā Ser. A, 26:359–372. [Google Scholar]
Willett FR, Young DR, Murphy BA, Memberg WD, Blabe CH, Pandarinath C, Stavisky SD, Rezaii P, Saab J, Walter BL, Sweet JA, Miller JP, Henderson JM, Shenoy KV, Simeral JD, Jarosiewicz B, Hochberg LR, Kirsch RF, and Bolu Ajiboye A (2019). Principled bci decoder design and parameter selection using a feedback control model. Sci. Rep, 9(8881). [DOI] [PMC free article] [PubMed] [Google Scholar]
Wodlinger B, Downey JE, Tyler-Kabara EC, Schwartz AB, Boninger ML, and Collinger JL (2015). Ten-dimensional anthropomorphic arm control in a human brain machine interface: difficulties, solutions, and limitations. J. Neural Eng, 12(1). [DOI] [PubMed] [Google Scholar]
Wolpaw JR, Birbaumer N, McFarland DJ, Pfurtscheller G, and Vaughan TM (2002). Brain-computer interfaces for communication and control. Clin. Neurophysiol, 113(6):767–91. [DOI] [PubMed] [Google Scholar]
Wu W, Black MJ, Gao Y, Bienenstock E, Serruya M, and Donoghue JP (2002). Inferring hand motion from multi-cell recordings in motor cortex using a Kalman filter. In SAB’02-Workshop on Motor Control in Humans and Robots: On the Interplay of Real Brains and Artificial Devices, pages 66–73.
Zaremba W, Sutskever I, and Vinyals O (2014). Recurrent neural network regularization. ArXiv e-prints
Zeiler MD (2012). Adadelta: An adaptive learning rate method. ArXiv e-prints
Zoph B and Le QV (2017). Neural architecture search with reinforcement learning. Int. Conf. Learn. Represent

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

code to reproduce simulations discussed in manuscript

NIHMS1714316-supplement-code_to_reproduce_simulations_discussed_in_manuscript.zip^{(20.7MB, zip)}

LaTeX source

NIHMS1714316-supplement-LaTeX_source.tex^{(144.4KB, tex)}

[R1] Abbeel P, Coates A, Montemerlo M, Ng AY, and Thrun S (2005). Discriminative training of Kalman filters. In Robotics: Sci. and Syst. (RSS)

[R2] Ajiboye AB, Willett FR, Young DR, Memberg WD, Murphy BA, Miller JP, Walter BL, Sweet JA, Hoyen HA, Keith MW, Peckham PH, Simeral JD, Donoghue JP, Hochberg LR, and Kirsch RF (2017). Restoration of reaching and grasping movements through brain-controlled muscle stimulation in a person with tetraplegia: a proof-of-concept demonstration. The Lancet, 389:1821–1830. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] Arasaratnam I and Haykin S (2009). Cubature Kalman filters. IEEE Trans. Autom. Control, 54(6):1254–1269. [Google Scholar]

[R4] Arasaratnam I, Haykin S, and Elliott RJ (2007). Discrete-time nonlinear filtering algorithms using Gauss–Hermite quadrature. Proc. IEEE, 95(5):953–977. [Google Scholar]

[R5] Arulampalam MS, Maskell S, Gordon N, and Clapp T (2002). A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking. IEEE Trans. Signal Process, 50(2):174–188. [Google Scholar]

[R6] Battin RH and Levine GM (1970). Application of Kalman filtering techniques to the Apollo program. In Leondes CT, editor, Theory and Applications of Kalman Filtering, chapter 14 NATO, Advisory Group for Aerospace Research and Development. [Google Scholar]

[R7] Beneš VE (1981). Exact finite-dimensional filters for certain diffusions with nonlinear drift. Stochastics, 5(1–2):65–92. [Google Scholar]

[R8] Bensmaia SJ and Miller LE (2014). Restoring sensorimotor function through intracortical interfaces: progress and looming challenges. Nat. Rev. Neurosci, 15(5):313–325. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] Bishop CH, Etherton BJ, and Majumdar SJ (2001). Adaptive sampling with the ensemble transform Kalman filter. part I: Theoretical aspects. Mon. Weather Rev, 129(3):420–436. [Google Scholar]

[R10] Bouton CE, Shaikhouni A, Annetta NV, Bockbrader MA, Friedenberg DA, Nielson DM, Sharma G, Sederberg PB, Glenn BC, Mysiw WJ, Morgan AG, Deogaonkar M, and Rezai AR (2016). Restoring cortical control of functional movement in a human with quadriplegia. Nature, 533:247–250. [DOI] [PubMed] [Google Scholar]

[R11] Brandman DM, Burkhart MC, Kelemen J, Franco B, Harrison MT, and Hochberg LR (2018a). Robust closed-loop control of a cursor in a person with tetraplegia using gaussian process regression. Neural Comput, 30(11):2986–3008. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] Brandman DM, Cash SS, and Hochberg LR (2017). Review: Human intracortical recording and neural decoding for brain-computer interfaces. IEEE Trans. Neural Syst. Rehabil. Eng, PP(99). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Brandman DM, Hosman T, Saab J, Burkhart MC, Shanahan BE, Ciancibello JG, Sarma AA, Milstein DJ, Vargas-Irwin CE, Franco B, Kelemen J, Blabe C, Murphy BA, Young DR, Willett FR, Pandarinath C, Stavisky SD, Kirsch RF, Walter BL, Bolu Ajiboye A, Cash SS, Eskandar EE, Miller JP, Sweet JA, Shenoy KV, Henderson JM, Jarosiewicz B, Harrison MT, Simeral JD, and Hochberg LR (2018b). Rapid calibration of an intracortical brain–computer interface for people with tetraplegia. J. Neural Eng, 15(2). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] Brown RG and Hwang PYC (2012). Introduction to Random Signals and Applied Kalman Filtering John Wiley & Sons, Inc., fourth edition. [Google Scholar]

[R15] Buehner M, McTaggart-Cowan R, and Heilliette S (2017). An ensemble Kalman filter for numerical weather prediction based on variational data assimilation: Varenkf. Mon. Weather Rev, 145(2):617–635. [Google Scholar]

[R16] Burkhart MC (2019). A Discriminative Approach to Bayesian Filtering with Applications to Human Neural Decoding PhD thesis, Brown University, Providence, Rhode Island, U.S.A. [Google Scholar]

[R17] Butler RW (2007). Saddlepoint approximations with applications, volume 22 of Cambridge Series in Statistical and Probabilistic Mathematics Cambridge University Press. [Google Scholar]

[R18] Cappé O, Godsill SJ, and Moulines E (2007). An overview of existing methods and recent advances in sequential Monte Carlo. Proc. IEEE, 95(5):899–924. [Google Scholar]

[R19] Cappé O, Moulines E, and Ryden T (2005). Inference in Hidden Markov Models Springer-Verlag. [Google Scholar]

[R20] Castillo E, Guijarro-Berdiñas B, Fontenla-Romero O, and Alonso-Betanzos A (2010). A very fast learning method for neural networks based on sensitivity analysis. J. Mach. Learn. Res, 7:1159–1182. [Google Scholar]

[R21] Cedarbaum JM, Stambler N, Malta E, Fuller C, Hilt D, Thurmond B, and Nakanishi A (1999). The ALSFRS-R: a revised ALS functional rating scale that incorporates assessments of respiratory function. J. Neurol. Sci, 169(1):13–21. [DOI] [PubMed] [Google Scholar]

[R22] Chang JT and Pollard D (1997). Conditioning as disintegration. Stat. Neerl, 51(3):287–317. [Google Scholar]

[R23] Chen Z (2003). Bayesian filtering: From Kalman filters to particle filters, and beyond. Statistics, 182(1):1–69. [Google Scholar]

[R24] Choo K and Fleet DJ (2001). People tracking using hybrid Monte Carlo filtering. In Proc. Int. Conf. Comput. Vis, volume 2, pages 321–328. [Google Scholar]

[R25] Churchland MM, Cunningham JP, Kaufman MT, Foster JD, Nuyujukian P, Ryu SI, and Shenoy KV (2012). Neural population dynamics during reaching. Nature, 487(7405):1–20. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] Collinger JL, Wodlinger B, Downey JE, Wang W, Tyler-Kabara EC, Weber DJ, McMorland AJC, Velliste M, Boninger ML, and Schwartz AB (2013). High-performance neuroprosthetic control by an individual with tetraplegia. Lancet, 381(9866):557–564. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] Daum FE (1984). Exact finite dimensional nonlinear filters for continuous time processes with discrete time measurements. In IEEE Conf. Decis. Control, pages 16–22.

[R28] Daum FE (1986). Exact finite-dimensional nonlinear filters. IEEE Trans. Autom. Control, 31(7):616–622. [Google Scholar]

[R29] Daum FE and Huang J (2003). Curse of dimensionality and particle filters. In 2003 IEEE Aerosp. Conf. Proc, volume 4. [Google Scholar]

[R30] del Moral P (1996). Nonlinear filtering using random particles. Theory Probab. Appl, 40(4):690–701. [Google Scholar]

[R31] Douc R and Cappé O (2005). Comparison of resampling schemes for particle filtering. In Proc. Int. Symp. Image and Signal Process. Anal, pages 64–69.

[R32] Doucet A, Godsill S, and Andrieu C (2000). On sequential Monte Carlo sampling methods for Bayesian filtering. Stat. Comput, 10(3):197–208. [Google Scholar]

[R33] Duchi J, Hazan E, and Singer Y (2011). Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res, 12:2121–2159. [Google Scholar]

[R34] Elliott R (1994). Exact adaptive filters for Markov chains observed in Gaussian noise. Automatica, 30(9):1399–1408. [Google Scholar]

[R35] Evensen G (1994). Sequential data assimilation with a nonlinear quasi-geostrophic model using Monte Carlo methods to forecast error statistics. J. Geophys. Res: Oceans, 99:10143–10162. [Google Scholar]

[R36] Fitts PM (1954). The information capacity of the human motor system in controlling the amplitude of movement. J. Exp. Pyschol, 47(6):381–391. [PubMed] [Google Scholar]

[R37] Flint RD, Lindberg EW, Jordan LR, Miller LE, and Slutzky MW (2012). Accurate decoding of reaching movements from field potentials in the absence of spikes. J. Neural Eng, 9(4). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R38] Foresee FD and Hagan MT (1997). Gauss-Newton approximation to Bayesian learning. In Int. Conf. Neural Netw, volume 3, pages 1930–1935. [Google Scholar]

[R39] Gelb A (1974). Applied Optimal Estimation The MIT Press. [Google Scholar]

[R40] Georgopoulos AP, Kettner RE, and Schwartz AB (1988). Primate motor cortex and free arm movements to visual targets in three-dimensional space. II. coding of the direction of movement by a neuronal population. J. Neurosci, 8(8):2928–2937. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R41] Gerber M and Chopin N (2015). Sequential quasi Monte Carlo. J. Roy. Stat. Soc. Ser. B (Stat. Methodol.), 77(3):509–579. [Google Scholar]

[R42] Gers FA, Schmidhuber J, and Cummins F (2000). Learning to forget: Continual prediction with LSTM. Neural Comput, 12(10):2451–2471. [DOI] [PubMed] [Google Scholar]

[R43] Ghahramani Z and Hinton GE (2000). Variational learning for switching state-space models. Neural Comput, 12(4):831–864. [DOI] [PubMed] [Google Scholar]

[R44] Gilja V, Pandarinath C, Blabe CH, Nuyujukian P, Simeral JD, Sarma AA, Sorice BL, Perge JA, Jarosiewicz B, Hochberg LR, Shenoy KV, and Henderson JM (2015). Clinical translation of a high-performance neural prosthesis. Nat. Med, 21(10):1142–1145. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R45] Glorot X and Bengio Y (2010). Understanding the difficulty of training deep feedforward neural networks. In Int. Conf. Artif. Intell. Stats, volume 9, pages 249–256. [Google Scholar]

[R46] Gordon NJ, Salmond DJ, and Smith AFM (1993). Novel approach to nonlinear/non-Gaussian Bayesian state estimation. IEE Proc. F - Radar and Signal Process, 140(2):107–113. [Google Scholar]

[R47] Greff K, Srivastava RK, Koutník J, Steunebrink BR, and Schmidhuber J (2016). LSTM: A search space odyssey. IEEE Trans. Neural Netw. Learn. Syst, PP(99):1–11. [DOI] [PubMed] [Google Scholar]

[R48] Grewal MS and Andrews AP (2010). Applications of Kalman filtering in aerospace 1960 to the present. IEEE Control Syst. Mag, 30(3):69–78. [Google Scholar]

[R49] Hagan MT and Menhaj MB (1994). Training feedforward networks with the Marquardt algorithm. IEEE Trans. Neural Netw, 5(6):989–993. [DOI] [PubMed] [Google Scholar]

[R50] Hall EC (1966). Case History of the Apollo Guidance Computer MIT. [Google Scholar]

[R51] Handschin J (1970). Monte Carlo techniques for prediction and filtering of non-linear stochastic processes. Automatica, 6(4):555–563. [Google Scholar]

[R52] Handschin JE and Mayne DQ (1969). Monte Carlo techniques to estimate the conditional expectation in multi-stage non-linear filtering. Int. J. Control, 9(5):547–559. [Google Scholar]

[R53] Hess R and Fern A (2009). Discriminatively trained particle filters for complex multi-object tracking. In Comput. Vis. Pattern Recognit, pages 240–247.

[R54] Hochberg LR, Bacher D, Jarosiewicz B, Masse NY, Simeral JD, Vogel J, Haddadin S, Liu J, Cash SS, van der Smagt P, and Donoghue JP (2012). Reach and grasp by people with tetraplegia using a neurally controlled robotic arm. Nature, 485(7398):372–375. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R55] Hochberg LR and Donoghue JP (2006). Sensors for brain-computer interfaces. IEEE Eng. Med. Biol. Mag, pages 32–38. [DOI] [PubMed]

[R56] Hochreiter S and Schmidhuber J (1997). Long short-term memory. Neural Comput, 9(8):1735–1780. [DOI] [PubMed] [Google Scholar]

[R57] Horn RA and Johnson CR (2013). Matrix Analysis Cambridge University Press, second edition. [Google Scholar]

[R58] Hosman T, Vilela M, Milstein D, Kelemen JN, Brandman DM, Hochberg LR, and Simeral JD (2019). BCI decoder performance comparison of an LSTM recurrent neural network and a Kalman filter in retrospective simulation. In Int. IEEE EMBS Conf. Neural Eng. [Google Scholar]

[R59] Hunt BR, Kostelich EJ, and Szunyogh I (2007). Efficient data assimilation for spatiotemporal chaos: A local ensemble transform Kalman filter. Physica D: Nonlinear Phenom, 230(1):112–126. [Google Scholar]

[R60] Ioffe S and Szegedy C (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Bach F and Blei D, editors, Int. Conf. Mach. Learn, volume 37, pages 448–456. PMLR. [Google Scholar]

[R61] Ito K (2000). Gaussian filter for nonlinear filtering problems. In IEEE Conf. Decis. Control, volume 2. [Google Scholar]

[R62] Ito K and Xiong K (2000). Gaussian filters for nonlinear filtering problems. IEEE Trans. Autom. Control, pages 910–927.

[R63] Jarosiewicz B, Masse NY, Bacher D, Cash SS, Eskandar E, Friehs G, Donoghue JP, and Hochberg LR (2013). Advantages of closed-loop calibration in intracortical brain-computer interfaces for people with tetraplegia. J. Neural Eng, 10(4). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R64] Jarosiewicz B, Sarma AA, Bacher D, Masse NY, Simeral JD, Sorice B, Oakley EM, Blabe C, Pandarinath Cand Gilja V, Cash SS, Eskandar EN, Friehs GM, Henderson JM, Shenoy KV, Donoghue JP, and Hochberg LR (2015). Virtual typing by people with tetraplegia using a self-calibrating intracortical brain-computer interface. Sci. Transl. Med, 7(313):1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R65] Jozefowicz R, Zaremba W, and Sutskever I (2015). An empirical exploration of recurrent network architectures. In Bach F and Blei D, editors, Int. Conf. Mach. Learn, volume 37, pages 2342–2350. PMLR. [Google Scholar]

[R66] Julier SJ and Uhlmann JK (1997). New extension of the Kalman filter to nonlinear systems. Proc. SPIE, 3068:182–193. [Google Scholar]

[R67] Kalman RE (1960). A new approach to linear filtering and prediction problems. J. Basic Eng, 82(1):35–45. [Google Scholar]

[R68] Kalman RE and Bucy RS (1961). New results in linear filtering and prediction theory. J. Basic Eng, 83(1):95–108. [Google Scholar]

[R69] Kim S-P, Simeral JD, Hochberg LR, Donoghue JP, and Black MJ (2008). Neural control of computer cursor velocity by decoding motor cortical spiking activity in humans with tetraplegia. J. Neural Eng, 5(4). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R70] Kitagawa G (1996). Monte Carlo filter and smoother for non-Gaussian nonlinear state space models. J. Comput. Graph. Stat, 5(1). [Google Scholar]

[R71] Koyama S, Pérez-Bolde LC, Shalizi CR, and Kass RE (2010). Approximate methods for state-space models. J. Am. Stat. Assoc, 105(489):170–180. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R72] Kushner H (1967). Approximations to optimal nonlinear filters. IEEE Trans. Autom. Control, 12(5):546–556. [Google Scholar]

[R73] Lemon RN (2008). Descending pathways in motor control. Annu. Rev. Neurosci, 31:195–218. [DOI] [PubMed] [Google Scholar]

[R74] Levenberg K (1944). A method for the solution of certain non-linear problems in least squares. Quart. Appl. Math, 2:164–168. [Google Scholar]

[R75] Liu JS (2008). Monte Carlo Strategies in Scientific Computing Springer. [Google Scholar]

[R76] MacKay DJC (1992). Bayesian interpolation. Neural Comput, 4(3):415–447. [Google Scholar]

[R77] Majumdar SJ, Bishop CH, Etherton BJ, and Toth Z (2002). Adaptive sampling with the ensemble transform Kalman filter. part II: Field program implementation. Mon. Weather Rev, 130(5):1356–1369. [Google Scholar]

[R78] Malik WQ, Hochberg LR, Donoghue JP, Hochberg LR, Donoghue JP, and Brown EN (2015). Modulation depth estimation and variable selection in state-space models for neural interfaces. IEEE Trans. Biomed. Eng, 62(2):570–581. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R79] Marquardt DW (1963). An algorithm for least-squares estimation of nonlinear parameters. J. Soc. Indust. Appl. Math, 11:431–441. [Google Scholar]

[R80] Masse NY, Jarosiewicz B, Simeral JD, Bacher D, Stavisky SD, Cash SS, Oakley EM, Berhanu E, Eskandar E, Friehs G, Hochberg LR, and Donoghue JP (2015). Non-causal spike filtering improves decoding of movement intention for intracortical bcis. J. Neurosci. Methods, 244:94–103. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R81] Maynard EM, Nordhausen CT, and Normann RA (1997). The utah intracortical electrode array: a recording structure for potential brain-computer interfaces. Electroencephalogr. Clin. Neurophysiol, 102(3):228–239. [DOI] [PubMed] [Google Scholar]

[R82] Metropolis N and Ulam S (1949). The Monte Carlo method. J. Am. Stat. Assoc, 44(247):335–341. [DOI] [PubMed] [Google Scholar]

[R83] Minka TP (2001a). Expectation propagation for approximate Bayesian inference. Uncertain. Artif. Intell

[R84] Minka TP (2001b). A family of algorithms for approximate Bayesian inference PhD thesis, MIT, Cambridge, Massachusetts, U.S.A. [Google Scholar]

[R85] Nadaraya EA (1964). On a regression estimate. Teor. Verojatnost. i Primenen, 9:157–159. [Google Scholar]

[R86] Nørgaard M, Poulsen NK, and Ravn O (2000). New developments in state estimation for nonlinear systems. Automatica, 36(11):1627–1638. [Google Scholar]

[R87] Nuyujukian P, Albites Sanabria J, Saab J, Pandarinath C, Jarosiewicz B, Blabe CH, Franco B, Mernoff ST, Eskandar EN, Simeral JD, Hochberg LR, Shenoy KV, and Henderson JM (2018). Cortical control of a tablet computer by people with paralysis. PLOS ONE, 13(11). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R88] Ott E, Hunt BR, Szunyogh I, Zimin AV, Kostelich EJ, Corazza M, Kalnay E, Patil DJ, and Yorke JA (2004). A local ensemble Kalman filter for atmospheric data assimilation. Tellus A, 56(5):415–428. [Google Scholar]

[R89] Pandarinath C, Gilja V, Blabe CH, Nuyujukian P, Sarma AA, Sorice BL, Eskandar EN, Hochberg LR, Henderson JM, and Shenoy KV (2015). Neural population dynamics in human motor cortex during movements in people with ALS. eLife, 4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R90] Pandarinath C, Nuyujukian P, Blabe CH, Sorice BL, Saab J, Willett F, Hochberg LR, Shenoy KV, and Henderson JM (2017). High performance communication by people with paralysis using an intracortical brain-computer interface. eLife, pages 1–27. [DOI] [PMC free article] [PubMed]

[R91] Pandarinath C, O’Shea DJ, Collins J, Jozefowicz R, Stavisky SD, Kao JC, Trautmann EM, Kaufman MT, Ryu SI, Hochberg LR, Henderson JM, Shenoy KV, Abbott LF, and Sussillo D (2018). Inferring single-trial neural population dynamics using sequential auto-encoders. Nat. Methods, 15(10):805–815. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R92] Paninski L, Fellows MR, Hatsopoulos NG, and Donoghue JP (2004). Spatiotemporal tuning of motor cortical neurons for hand position and velocity spatiotemporal tuning of motor cortical neurons for hand position and velocity. J. Clin. Neurophysiol, 91:515–532. [DOI] [PubMed] [Google Scholar]

[R93] Pham V, Bluche T, Kermorvant C, and Louradour J (2014). Dropout improves recurrent neural networks for handwriting recognition. In Int. Conf. Front. Handwriting Recognit, pages 285–290.

[R94] Pohlmeyer E, Solla S, Perreault EJ, and Miller LE (2007). Prediction of upper limb muscle activity from motor cortical discharge during reaching. J. Neural Eng, 4:369–379. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R95] Quang PB, Musso C, and Le Gland F (2015). The Kalman Laplace filter: A new deterministic algorithm for nonlinear Bayesian filtering. In Intern. Conf. Inf. Fusion, pages 1566–1573.

[R96] Quiñonero Candela J and Rasmussen CE (2005). A unifying view of sparse approximate Gaussian process regression. J. Mach. Learn. Res, 6:1939–1959. [Google Scholar]

[R97] Rao NG and Donoghue JP (2014). Cue to action processing in motor cortex populations. J. Neurophysiol, 111(2):441–453. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R98] Rasmussen CE and Nickisch H (2010). Gaussian processes for machine learning (GPML) toolbox. J. Mach. Learn. Res, 11:3011–3015. [Google Scholar]

[R99] Rasmussen CE and Williams CKI (2006). Gaussian processes for machine learning MIT Press, Cambridge, MA. [Google Scholar]

[R100] Real E, Moore S, Selle A, Saxena S, Suematsu YL, Le Q, and Kurakin A (2017). Large-scale evolution of image classifiers. Int. Conf. Mach. Learn

[R101] Särkkä S (2013). Bayesian Filtering and Smoothing Cambridge University Press. [Google Scholar]

[R102] Schmidhuber J (2015). Deep learning in neural networks: An overview. Neural Netw, 61:85–117. [DOI] [PubMed] [Google Scholar]

[R103] Schmidt SF, Weinberg JD, and Lukesh JS (1970). Application of Kalman filtering to the C-5 guidance and control system. In Leondes CT, editor, Theory and Applications of Kalman Filtering, chapter 13 NATO, Advisory Group for Aerospace Research and Development. [Google Scholar]

[R104] Schwartz AB (1994). Direct cortical representation of drawing. Science, 265(5171):540–542. [DOI] [PubMed] [Google Scholar]

[R105] Shumway RH and Stoffer DS (1991). Dynamic linear models with switching. J. Am. Stat. Assoc, 86(415):763–769. [DOI] [PubMed] [Google Scholar]

[R106] Simeral JD, Kim S-P, Black MJ, Donoghue JP, and Hochberg LR (2011). Neural control of cursor trajectory and click by a human with tetraplegia 1000 days after implant of an intracortical microelectrode array. J. Neural Eng, 8(2). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R107] Srivastava N, Hinton G, Krizhevsky A, Sutskever I, and Salakhutdinov R (2014). Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res, 15:1929–1958. [Google Scholar]

[R108] Stevenson IH and Kording KP (2011). How advances in neural recording affect data analysis. Nat. Neurosci, 14(2):139–142. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R109] Sugiyama M, Suzuki T, and Kanamori T (2012). Density Ratio Estimation in Machine Learning Cambridge University Press. [Google Scholar]

[R110] Sussillo D, Nuyujukian P, Fan JM, Kao JC, Stavisky SD, Ryu S, and Shenoy K (2012). A recurrent neural network for closed-loop intracortical brain–machine interface decoders. J. Neural Eng, 9(2). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R111] Sussillo D, Stavisky SD, Kao JC, Ryu SI, and Shenoy KV (2016). Making brain–machine interfaces robust to future neural variability. Nat. Commun, 7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R112] van der Merwe R (2004). Sigma-Point Kalman Filters for Probabilistic Inference in Dynamic State-Space Models PhD thesis, Oregon Health & Science University, Portland, Oregon, U.S.A. [Google Scholar]

[R113] van der Vaart AW (1998). Asymptotic statistics Cambridge University Press, Cambridge. [Google Scholar]

[R114] Vargas-Irwin CE, Brandman DM, Zimmermann JB, Donoghue JP, and Black MJ (2015). Spike train SIMilarity space (SSIMS): a framework for single neuron and ensemble data analysis. Neural Comput, 27(1):1–31. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R115] Vargas-Irwin CE, Shakhnarovich G, Yadollahpour P, Mislow JMK, Black MJ, and Donoghue JP (2010). Decoding complete reach and grasp actions from local primary motor cortex populations. J. Neurosci, 30(29):9659–9669. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R116] Velliste M, Perel S, Spalding MC, Whitford AS, and Schwartz AB (2008). Cortical control of a prosthetic arm for self-feeding. Nature, 453(7198):1098–101. [DOI] [PubMed] [Google Scholar]

[R117] Walker B and Kording K (2013). The database for reaching experiments and models. PLOS ONE, 8(11). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R118] Wan EA and van der Merwe R (2000). The unscented Kalman filter for nonlinear estimation. In Adaptive Syst. for Signal Process., Commun., and Control Symp, pages 153–158.

[R119] Watson GS (1964). Smooth regression analysis. Sankhyā Ser. A, 26:359–372. [Google Scholar]

[R120] Willett FR, Young DR, Murphy BA, Memberg WD, Blabe CH, Pandarinath C, Stavisky SD, Rezaii P, Saab J, Walter BL, Sweet JA, Miller JP, Henderson JM, Shenoy KV, Simeral JD, Jarosiewicz B, Hochberg LR, Kirsch RF, and Bolu Ajiboye A (2019). Principled bci decoder design and parameter selection using a feedback control model. Sci. Rep, 9(8881). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R121] Wodlinger B, Downey JE, Tyler-Kabara EC, Schwartz AB, Boninger ML, and Collinger JL (2015). Ten-dimensional anthropomorphic arm control in a human brain machine interface: difficulties, solutions, and limitations. J. Neural Eng, 12(1). [DOI] [PubMed] [Google Scholar]

[R122] Wolpaw JR, Birbaumer N, McFarland DJ, Pfurtscheller G, and Vaughan TM (2002). Brain-computer interfaces for communication and control. Clin. Neurophysiol, 113(6):767–91. [DOI] [PubMed] [Google Scholar]

[R123] Wu W, Black MJ, Gao Y, Bienenstock E, Serruya M, and Donoghue JP (2002). Inferring hand motion from multi-cell recordings in motor cortex using a Kalman filter. In SAB’02-Workshop on Motor Control in Humans and Robots: On the Interplay of Real Brains and Artificial Devices, pages 66–73.

[R124] Zaremba W, Sutskever I, and Vinyals O (2014). Recurrent neural network regularization. ArXiv e-prints

[R125] Zeiler MD (2012). Adadelta: An adaptive learning rate method. ArXiv e-prints

[R126] Zoph B and Le QV (2017). Neural architecture search with reinforcement learning. Int. Conf. Learn. Represent

PERMALINK

The Discriminative Kalman Filter for Bayesian Filtering with Nonlinear and Non-Gaussian Observation Models

Michael C Burkhart

David M Brandman

Brian Franco

Leigh R Hochberg

Matthew T Harrison

Abstract

1. Introduction

2. The discriminative Kalman filter

2.1. Filter derivation

2.2. Approximation accuracy

2.3. Robust DKF

2.4. Other Gaussian approximations

2.4.1. Gaussian assumed density filter

2.4.2. Laplace approximation

2.4.3. Linearization methods

2.5. Nonlinear state dynamics

3. Learning the DKF

4. Examples

4.1. Kalman observation model

4.2. Kalman observation mixtures

Figure 1: Kalman observation mixtures.

4.3. Independent Bernoulli mixtures

Figure 2: Independent Bernoulli mixtures.

4.4. Kalman observation mixtures with nonlinear state dynamics

Figure 3: Nonlinear state dynamics.

4.5. Unknown observation model: Macaque reaching-task data

Table 1:

Table 2:

4.6. Closed-loop decoding in a person with paralysis

4.6.1. Participant

4.6.2. Signal acquisition

4.6.3. Decoder calibration

4.6.4. Performance measurement

4.6.5. Results

5. Discussion

Supplementary Material

Figure 4:

Acknowledgments

Appendix

6. Technical details

6.1. Kalman observation mixtures

6.2. Independent Bernoulli mixtures

6.3. Measures to prevent numerical instabilities

6.4. Nadaraya-Watson kernel regression

6.5. Neural network regression

6.6. Gaussian process regression

6.7. Comparison with a long short term memory (LSTM) neural network

7. Mathematical results

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases