Abstract
For dynamical systems that can be modelled as asymptotically stable linear systems forced by Gaussian noise, this paper develops methods to infer (estimate) their dominant modes from observations in real time. The modes can be real or complex. For a real mode (monotone decay), the goal is to infer its damping rate and mode shape. For a complex mode (oscillatory decay), the goal is to infer its frequency, damping rate and (complex) mode shape. Their amplitudes and correlations are encoded in a mode covariance matrix that is also to be inferred. The work is motivated and illustrated by the problem of detection of oscillations in power flow in AC electrical networks. Suggestions of some other applications are given.
Keywords: inference, linear stochastic process, mode, Gaussian process, Kalman filter, AC power networks
1. Introduction
In January 2015, National Grid asked if I could improve their methods for detection of oscillations in power flow, to estimate frequency, damping constant, mode shape and amplitude. Fig. 1 from [1] shows an example where such a mode of oscillation became clear. This type of oscillation is called ‘inter-area’; for a review of oscillations in electrical power flow, see [2]. National Grid is interested in detecting such modes in nascent form, so that they can design and install suitable controllers to limit them.
As my brother David, author of [3], was expert in data analysis, I asked him what he would recommend. He responded ‘Use a Gaussian process’. It looked a good idea and this paper is the result.
I specialized to linear stochastic processes because they are physically well motivated and the inference (also known as estimation) of their state can be carried out in streaming mode with constant amount of computation per observation. It is a well-developed class, e.g. [4], with major results for inference dating back to the 1960s.
Inference methods for linear stochastic processes are generally used to infer the state of a system with known parameters from observations (the parameters consist of the system matrix, the noise covariance matrix, the observation function and the observational noise covariance matrix). They can also be used to infer the observable parameters of the system if the others are known or a strong prior probability distribution for them is assumed. This generally requires knowing a significant fraction of the parameters to a significant accuracy in advance. Furthermore, for a large system the remaining parameter space can be high dimensional, making the inference imprecise. If the goal is to infer the possible modes of the system, the transformation from the system matrix to modes (computation of eigenvalues and eigenvectors) can be highly sensitive to the system matrix, adding yet more uncertainty, so it would be better to devise a method to infer the modes directly. Another consideration is that in many circumstances one would want a method that can run in real time, as a monitor and eventually for use in control.
The point of this paper is to present a method to infer the ‘dominant’ modes, those that have significant amplitude, for a linear stochastic process of many degrees of freedom, with significantly lower dimension of parameter-fit than for the whole system, and to do this in streaming mode. I consider both complex modes (those with oscillatory decay) and real modes (those with monotone decay). What I will propose here has precedents, yet I hope it will be valuable, particularly for AC power flow.
In addition to detecting oscillations in power flow in electricity networks, I envisage the method to be useful in various other contexts, for example detecting soft modes in civil engineering structures, inferring the internal structure of the sun from observation of acoustic waves at the surface (helioseismology), understanding gene expression, and studying business cycles.
The paper starts by presenting my approach to inferring dominant modes. Then it specializes to a method to infer them in streaming mode. An extension is given to filtered Gaussian noise forcing. Next, a formulation of AC power flow network dynamics is proposed, to set it up for potential treatment by the method. Implementation and testing the method on an AC power system, whether real or simulated, is deferred to future work, though some simple data analysis tests are reported here. Finally, a discussion section compares the method with other approaches and proposes some other applications. For the uninitiated, a series of appendices gives a pedagogical introduction to Gaussian processes (GPs), linear stochastic processes and Bayesian inference for them, including Kalman filtering, plus some useful formulae and covariance functions for the AC power model.
2. Fitting dominant modes
Suppose an autonomous differentiable dynamical system is subject to random forcing near an asymptotically stable equilibrium m. Linearizing about the equilibrium produces a system of the form
| 2.1 |
with A an asymptotically stable matrix. Suppose ξ is a (multi-dimensional) Gaussian white noise process with zero mean and auto-correlation 〈ξ(t)ξ(s)T〉 = Cξδ(t − s). Modifications of the noise process will be discussed in §4.
Suppose observations are taken at an increasing sequence of times ti (not necessarily equally spaced), in the form
| 2.2 |
with Zi being observation matrices (not necessarily all the same) and ζi independent zero-mean Gaussians with covariance , representing measurement error.
From the observations y and knowing m, A, Cξ, Zi, , a standard approach is to infer a probability distribution for x(t); the method reduces to linear algebra (appendix A.4). The next level of inference is if m, A, Cξ are not known but one has a prior probability on their joint distribution, to infer them too from the observations (one could also infer the if they are not known). This is a nonlinear problem in Bayesian inference but can be tackled by Monte Carlo methods or by gradient methods for the likelihood. After that, one could determine the eigenvalues and eigenvectors of the resulting matrix A, thereby inferring the modes of the system, and one could compute their amplitudes and covariance under the noise process with the inferred Cξ.
Instead of the above, I propose to fit the dominant modes of A and their covariance, directly from the observations, without any prior on A or Cξ. The dominant modes are those that best explain the observations, in the sense of Bayesian model comparison (appendix A.5).
The idea is that the system matrix A can always be put into a block-diagonal form D, i.e.
for some invertible matrix B (that I take real), with the diagonal blocks of D taking simple forms, e.g. a single (negative) real number −λ or a 2 × 2 block of the form
The former case represents a real mode, the latter a complex (or oscillatory) mode. More complicated blocks may be required in the case of multiple eigenvalues. They may also be advisable for groups of nearby eigenvalues, but for present purposes those refinements are ignored. The columns of B are mode shape vectors. Columns for complex modes must be taken in pairs that I call the real and imaginary parts.
Thus, one can think of the state x as being an observation
| 2.3 |
on a process for the amplitudes u of modes:
| 2.4 |
with D block diagonal, and η Gaussian white noise with covariance matrix Cη = B−1Cξ B−T. The real observations yi become observations on the mode process:
| 2.5 |
The beauty of this view is that one can then forget about A and x and consider equations (2.4), (2.5) as a self-contained inference problem for u, m, D, Cη, B, given the set of Zi, ti, yi, (the measurement noise covariance could also be considered unknown). Furthermore, there is no need to keep D, B and m of the same dimension as A. If M is the dimension of the space spanned by the rows of all the matrices Ki then one can try to fit equations (2.4), (2.5) to the data for much smaller sizes of D than A, say dimension d × d, with Nr and Nc real and complex modes (d = Nr + 2Nc), and then fit an equilibrium vector m of dimension M and mode shape matrix B of dimension M × d. The unfitted modes will just contribute to the inferred forcing and measurement noises.
Bayesian model comparison (appendix A.5) allows to compare the evidence for models with different numbers Nr and Nc of real and complex modes. The fit with the highest Bayes factor gives the dominant modes.
Finally, the covariance matrix S = 〈uuT〉 for the resulting mode amplitudes is given by (appendix A.3)
| 2.6 |
which can be evaluated explicitly since D is block diagonal with small blocks. For example, the term corresponding to two real modes m, n is . Formulae for the other cases can be derived using results in appendix A.7.
There are some redundancies in this specification, which will give rise to problems in maximizing likelihood so need removing. Firstly, the order in which the modes are labelled is irrelevant. One could eliminate this freedom by choosing to list first all the real modes and then all the complex modes and labelling them in order of size of λ and α, respectively. Secondly, each mode shape vector can be scaled by an arbitrary non-zero scalar (real for a real mode, complex for a complex mode), subject to scaling Cη by the inverse square root. Note that since I am using a purely real representation, when I say multiplication of a complex mode shape vector by a complex scalar x + iy, I mean to take the linear combinations xbr − ybi, xbi + ybr of the columns br, bi of the mode shape vector. One could eliminate this freedom by selecting a ‘large’ component in for each mode n and setting for a real mode, [+1, 0] for a complex mode. But as one explores parameter space, one may need to change these choices, so a continuous choice would be preferable.
Also, one needs to enforce Cη to be positive semi-definite (PSD). One way to achieve this is to write Cη = eR for R symmetric. There are efficient algorithms for exponentiating matrices. Another is to write Cη = LLT with L lower triangular (in some chosen order on modes), but the diagonal elements of L should be chosen non-negative to remove another redundancy of sign. Such a Cholesky decomposition is a common step for efficient matrix computations so could come for free.
As mentioned above, it might be that a complex mode is close to transition to a pair of real modes, or vice versa. To allow parameter search in a uniform way near such a transition, it would be better to generalize complex modes to also allow pairs of real modes, as in [5], but I leave incorporating that refinement to the future. Similarly, one could allow the formation of non-trivial Jordan blocks and the associated transitions.
3. Inferring dominant modes in real time
To infer dominant modes in real time, I take the view expressed in §2, and apply a Kalman filter (appendix A.6). Thus consider a mode process u:
| 3.1 |
with D block diagonal, and observations
| 3.2 |
Suppose Zi and are known (this would be from calibration and testing of the instruments). It is desired to infer D, B, m and Cη. Given all, the Kalman filter enables to infer u(t) in real-time, both its mean and covariance. Then one can calculate a discounted evidence rate for the parameters D, B, m, Cη, as explained in appendix A.6. One can choose how to seek to maximize this but for definiteness I will describe the Newton method.
Choose numbers NR, NC of real and complex modes to fit, respectively. Choose a discount rate λ for the evidence. Make an initial guess at the mode time-constants: λn for each real mode, αn, ωn for each complex mode, and at the mode shape matrix B (one column for each real mode, two columns for each complex mode), and the mode forcing covariance matrix Cη. Make an initial guess at the mean vector m. All this forms an initial guess for the parameter vector μ = (D, B, m, Cη). Choose initial u0|0 and P0|0.
When each new observation yi arrives, note the time ti it was taken, the observation matrix Zi and the measurement noise covariance , and set
| 3.3 |
| 3.4 |
| 3.5 |
| 3.6 |
| 3.7 |
| 3.8 |
| 3.9 |
| 3.10 |
| 3.11 |
| 3.12 |
| 3.13 |
| 3.14 |
| 3.15 |
| 3.16 |
| 3.17 |
| 3.18 |
| 3.19 |
The above notation for second derivatives in equations (3.16)–(3.19) is condensed, but the first occurrence of ′ in each term should be understood as ∂/∂μj and the second as ∂/∂μk. Also, the subscripts i on all the terms in have been suppressed to save space.
Note that the integral for Gi (in equation (3.5)) is easy to work out because D is block diagonal. Similarly, the update (in equation (3.6)) for Pi|i−1 is easy. These can be derived from appendix A.7.
The method can also handle the case where the are not known but are taken from some prior probability distribution. One just adds them to the list of parameters to infer.
If one wants to allow the number of modes to vary then one needs to do Bayesian model comparison, by running several different models alongside each other and computing their Bayes’ factors (appendix A.5).
4. Filtered noise models
The assumption of forcing by Gaussian white noise might not be realistic in many cases. Perhaps one has to leave the Gaussian world. For example, one can generalize to the world of Student t-processes [6] or to elliptic stable processes [7], both of which continue to be specified by a mean function and a covariance kernel. Marginals remain in the same class. Conditioning on N variables increases by N the number of degrees of freedom in a multivariate t-distribution. Conditioning stable distributions is not so easy, e.g. [8], but see also [9]. The result of forcing a linear system by a stable process belongs to the same class (but this fails for t-processes). I am not aware of an analogue of the Kalman filter to speed up the inference for stable processes in streaming mode, but I leave that for future investigation (see [10]).
On the other hand, there is a class of generalizations of Gaussian white noise that can be incorporated easily in my framework. They are the filtered Gaussian noises, defined as the solution of
| 4.1 |
for some asymptotically stable matrix J and w a (multidimensional) Gaussian white noise with covariance 〈w(t)wT(s)〉 = Cwδ(t − s). They fit right in the framework by simply considering the joint process
| 4.2 |
and
| 4.3 |
which is just a special skew-product form of the general case of a linear system forced by Gaussian white noise w. Some component of white noise can also be added to the equation if desired.
Then inference of the dominant modes would also involve inference of the modes of the noise filter J, in particular its eigenvalues (its eigenvectors are not observable from the measurements). A curious feature is that I do not see a rational way to assign the eigenvalues between A and J.
5. AC electricity networks
I turn now to the motivating application.
The dynamics of an AC (alternating current) electricity network can be modelled approximately by a connected graph with a node for each rotating machine (synchronous generator or motor) [11] (this leaves open the question of how to model DC/AC convertors, such as at wind farms, solar photovoltaic farms and DC interconnector terminals). Let N be the number of nodes. As described in [12] (other useful references are [13,14]), one can model an AC network at various levels of complexity. If one ignores aspects like the dynamics of the voltages1, 3-phase imbalances, reactive power control and harmonics, the state can be specified by a phase ϕl and frequency2 at each location l, and dynamics for the vector f of frequencies and phases ϕ are given by balancing power:
| 5.1 |
where Il is an inertia, Γl a damping constant, Vl is the amplitude of the voltage at l, Bll′ is a symmetric matrix of ideal admittances of the line between l and l′, Gll′ is a symmetric PSD matrix of conductances of the line between l and l′ (which produces transmission losses) including self-conductances, and p is a vector of power imbalances (generation minus consumption), which is to be regarded as an external stochastic process (e.g. people switching loads on and off, wind farms producing varying power). For the moment, think of p as fixed. For an example of more detailed modelling, see [16].
Note that it is common in the electrical engineering literature (e.g. (1) of [17] or (17) of [18]) to partially linearize equation (5.1) about a reference frequency f0 (usually 100π or 120π s−1) by writing ωl = fl − f0, δl = ϕl − f0 t, and replacing by with Ml = Il f0 (which is often called an inertia again) and by Dωl with D = 2Γl f0. I shall completely linearize later in this section, but for the present retain the fully nonlinear form (equation (5.1)) for discussion of its global phase symmetry and its equilibria.
The system has the special feature of global phase-rotation invariance: if one adds the same constant to all the phases then the dynamics produce the same trajectory but with the constant added. One can quotient by this symmetry group, which we denote by S.3 For example, choose a root node o and a spanning tree in the graph, orient its edges e away from o (other choices are alright but this is to make a definite choice), and let Δe = ϕl′ − ϕl for each edge e = ll′ in the spanning tree; there are N − 1 of these, and we denote the vector of phase differences by Δ. Then the phase difference between any two nodes can be expressed as a signed sum of the Δe, and the equations can be replaced by .
The quotient system has a manifold of equilibria in the space of all power imbalance vectors p, frequency vectors f and phase difference vectors Δ. For an equilibrium (mod S), each node has the same frequency and the phase differences are constant. The manifold of equilibria is a graph of power imbalance vector over the space of common frequency (which I take positive) and phase differences :
| 5.2 |
The manifold of equilibria is folded, however, so for given power imbalance vector p there may be 0, 2 or up to 2N−1 equilibria, of which only some sheets (or parts of sheets) are stable. For equilibria with all phase differences in a suitable subinterval of (−π/2, π/2), stability can be established by the energy method used in [15], modified to include the conductance matrix G and ignore the voltage dynamics. It should be noted, however, that inclusion of governors or power system stabilizers in the model can destabilize the equilibrium and produce oscillations [12], presumably by a Hopf bifurcation. The method of the present paper is not well adapted to detecting autonomous oscillations as opposed to damped ones forced by noise.
Suppose the system is near a stable equilibrium for some p. As p moves in time, the response roughly follows it on the manifold of equilibria, but deviations from equilibrium are in general excited and these would relax back to equilibrium if p were to stop moving. For small movements of p about a mean imbalance vector P with corresponding stable equilibrium (F, Δ), it is appropriate to linearize the system. A reference for small-signal stability in power systems is [19]. Write δfl, δΔe, δpl for the deviations of fl, Δe and pl from the equilibrium. Write
| 5.3 |
and
| 5.4 |
Then
| 5.5 |
Write this as
| 5.6 |
with
| 5.7 |
The power imbalances δp are an input to equation (5.6). They fluctuate in time because of variations in generation (in particular, wind and solar) and variations in consumption. I choose to model the dynamics of the power imbalances by
| 5.8 |
for some matrix J (with −J asymptotically stable) and (multidimensional) Gaussian white noise σξ with covariance matrix K = σσT (later, J, P, T and K may vary slowly in time). This is a somewhat crude representation, but captures the idea that p has random increments and reversion to a mean. There is evidence that load distribution is close to Gaussian, e.g. fig. 14 of [20], which is consistent with this model, though those data say nothing about the temporal correlations. It is common to neglect temporal correlations of the power imbalance, e.g. [21], but there are automated and human responses to power imbalance which have a filtering effect. One might argue that National Grid’s balancing actions are based more on the deviations of the average frequency and phase differences from nominal than the power imbalances, but on the manifold of equilibria these are equivalent.
The resulting system (equations (5.6) and (5.8)) for (x, δp) is of the form (2.1). It has a skew-product structure that we could exploit, though that does not play a role for application of my method, so that discussion is deferred to appendix A.8.
So now we can fit observations of (f, Δ) at as many locations as available (say, k) and as a function of time t to a mode processes (2.4) and (2.5) with mean m of the form for some and , where 1 is the vector of length k with all components 1. I make the obvious step of shrinking the spanning tree to one for just the observed nodes. The observations can be deduced from phasor measurement units (PMUs), which measure (among many things) the (voltage) phase relative to a notional 50 Hz reference and the instantaneous frequency at their location.
Let k be the number of PMUs. For NR real modes and NC complex modes and
observation components (fl for each PMU l and Δe for the voltage phase difference along each edge e in the spanning tree of the PMUs), the parameter space consists of NR decay rates λn for the real modes, NC frequencies ωm and decay rates αm for the complex modes, NR vectors Bin of length M for the real mode shapes normalized to have one component +1, NC pairs of vectors Bim of length M for the complex modes normalized to have one component (+1, 0), d(d + 1)/2 coefficients of the mode correlation matrix Cη (symmetric), where
one mean frequency F and k − 1 mean phase differences along the edges of the spanning tree. This makes a total dimension
of parameter space. This is slightly less than the dimension stated in appendix A.4, because for the AC electricity system it is automatic that the time-mean frequencies at all PMUs are the same. If one desires to fit many modes, this dimension could be quite large, but it is still much smaller than the dimension of the parameter space for the whole system.
As an example, if there are k = 10 PMUs and one wishes to fit two real modes and one complex mode then d = 4 and the parameter space has dimension 53. One might say one is not interested in real modes but some of them are probably the biggest ones and to detect a complex mode accurately one needs to fit the biggest behaviour too.
There is the question of how many modes to allow, both real and complex. This can be decided by the Bayesian comparison method already mentioned (appendix A.5).
One could expect the most important mode behaviour to be an Ornstein–Uhlenbeck (OU) process (see appendix A.2) for fo, assuming o to be a central node for the network. Indeed, using GPML, I found that a 2 h trace of frequency at 1 s intervals, figure 2, which was publicly available from National Grid [22], fit reasonably well to an OU process with a decay time of about 30 min and amplitude 0.045 Hz. The time constant of 30 mins is so long compared to the period (about 2 s) or decay time (about 20 s) of typical inter-area oscillations that it is hardly relevant, and one could just say that on a timescale of up to a minute the basic behaviour of fo is a Wiener process (continuous-time limit of a random walk) rather than OU. The inferred decay time is a significant fraction of the duration of the time series, so might not be determined very accurately.
Figure 2.

A frequency trace over 2 h from National Grid [22].
On shorter timescales, however, the data look differentiable (figure 3). This is my principal reason for rejecting the hypothesis (e.g. [21]) that power imbalance is a white Gaussian noise, because that would make frequency a nowhere differentiable function of time. Instead, I propose that power imbalance is a first-order filtered white Gaussian noise. Analysis of the power spectrum of fluctuations in the frequency support this proposal. Figure 4 shows a loglog plot of the power spectrum of the data of figure 2 multiplied by a Hann window function (sin2(πt/T), where T = 7200 s is the duration of the series) to prevent the jump between the values at the two ends provoking high-frequency components. The main part of figure 4 has a slope near −2, consistent with frequency being an OU process. But for frequency larger than 0.04 Hz (period 25 s) the slope steepens, plausibly to −4, until the fact that the data were provided at only 1 s intervals causes an inevitable flattening off of the power spectrum at the Nyquist frequency of 0.5 Hz. National Grid have the data at 1/50 s intervals, but that is confidential so I cannot use it here. Otherwise, we could see if the slope −4 extends to higher frequency. Reference [23] shows a power spectrum in which a slope of roughly −4 goes from 0.9 Hz to 3 Hz, but it is for power flow on some line rather than frequency at a node.
Figure 3.

The first 3 min 20 s of the frequency trace.
Figure 4.

Loglog plot of the power spectrum of the data of figure 2 using a Hann window.
A simple model for the data is a first-order filtered OU process (FOU). To justify this, imagine the system is aggregated to a single generator. Then we have two equations of the form
| 5.9 |
It follows from the second equation that δp is OU with covariance function k(τ) = (σ2/2J)e−J|τ|. Then applying equation (A 21) we see that δf is a GP with covariance function
| 5.10 |
where h is the impulse response for the first equation, viz. h(s) = (1/M)e−Γs, with Γ = γ/M. Computation of the integral (for the generic case Γ ≠ J) yields
| 5.11 |
A sample from the FOU process is shown in figure 5. Note that the same covariance function arises for the overdamped linear Langevin process, with −Γ and −J being the two real eigenvalues.
Figure 5.

A sample from the filtered OU process for Γ = 1/e, J = e2.
Fitting an FOU to the 2 h of data with GPML yields maximum-likelihood estimates for the time constants 1/Γ and 1/J around 11.1 min and 1.87 s, though one can not say from the data analysis which is which. On the basis of estimates of UK system parameters, Andrey Gorbunov (2021, private communication) suggests the most likely match is 1/Γ ≈ 1.87 s, and hence 1/J ≈ 11.1 min, making additional evidence that the power imbalance noise is not white on the timescale of interest. It is again awkward that the data are not available at more frequent intervals than 1 s, as the determined time constant 1.87 s is close to this limit. A more thorough treatment would evaluate the posterior uncertainty in the parameter fits and attempt to resolve the discrepancies between the previously estimated OU time constant of 30 min and the current one of 11.1 min, and between the eyeball estimate f = 0.04 Hz from figure 4 of where the slope changes, giving a time constant of (2πf)−1 around 4 s, and the current one of 1.87 s.
It would also be good to apply the approach of §3 to the data to see if two real modes are justified. In particular, one could compare to see whether the model (5.9) is justified, which corresponds to a special subcase of two real-mode fitting in which the mode-noise forcing is perfectly correlated and the mode observation matrix is precisely determined to make only indirect influence of the noise on f.
Over long timescales, deviations from Gaussianity have been established [24]. Nevertheless, I believe this does not invalidate Gaussian modelling for short times.
To take this project further, one should next tackle simultaneous readings from two PMUs. This would need the Kalman filter and its parameter-fitting coding up to allow for a number of real and complex modes. It would be best to test it first on simulated data from a power system model with, for example, two generators, one load and a noise process for the power imbalance. Then it could be tested on real data: the phase difference between the two PMUs and their two frequencies.
6. Discussion
I have presented a method to detect oscillations in systems with many components. It also detects real modes. It is promising because it can integrate data from many locations simultaneously to enhance the sensitivity of detection of modes of oscillation, and it can run in real-time with constant computation time per observation.
Some references on the problem of calculating modes and mode shapes from phasor measurement units PMU in an AC electrical network are [25,26]. The authors of [27] consider the problem to have been solved. They cite [28–31]. I am not so convinced, because these papers depend to some extent on external estimates of system parameters. It may of course be good to use all available knowledge, but the idea I present here is that one could determine modes of oscillation without determining any system parameters in advance. I think it would be good to try the method of this paper on that problem, particularly the streaming version. It could also help in determining the time constant for inertia in an AC power system, of crucial importance for its operation.
Detection of modes of oscillation is important in many other contexts. One example is to detect soft (i.e. lightly damped) modes for civil engineering structures such as buildings and bridges, e.g. [32] and ch. 13 of [33]. Another is the identification of modes of oscillation in the sun (helioseismology), which enables to deduce its temperature and rotation profiles4 [34]. A third is the analysis of gene expression data, e.g. [35]. A fourth is the analysis of business cycles, e.g. ch. 4 of [36], which have been seen for a long time but are still not understood. Use of the Kalman filter to evaluate the likelihood function for parameters is a standard part of training in econometrics, e.g. [37], but I had not seen it used to analyse business cycles until the recent paper [38], which remarkably uses a detection of dominant modes approach, as here.
Detection of oscillations is a very old subject, so I next give a brief review of traditional methods.
A standard approach to detecting oscillations is to identify peaks in the Fourier spectrum [33] or variants [39]. For example, the response x of the second-order system
| 6.1 |
to noise η with power spectrum P has power spectrum
| 6.2 |
as a function of frequency Ω. So if the noise is white (P is constant), then the inverse quality factor is precisely the fullwidth at half maximum for the power spectrum of the velocity (its maximum is at , known as the resonant frequency), and the damping ratio ζ = (1/2)Q−1 is the halfwidth at half maximum. For P slowly varying on the scale of , the results remain good approximations. This was given a sound grounding in Bayesian analysis (see [40] for a survey and [41] for a pedagogical presentation). It was employed by an MSc student Tajhame Francis on National Grid data, with a view to extending to multidimensional time series. However, it still suffers from issues like dealing with trends, choosing windowing functions, missing data, failure to cater for slowly shifting phase, and poor theoretical justification for taking more than the largest peak if one wants to infer more than one mode of oscillation. Nonetheless, after the initial submission of my paper, [23] was brought to my attention, in which Bayesian spectral analysis (also known as operational modal analysis) is performed with promising results.
Wavelet transforms are popular for resolving signals in both time and frequency (up to the limits of the uncertainty principle), but I am not aware whether they can give an estimate of damping rate.
Another approach is to study the effect of excitation by an impulse (the Prony method and variants like MUSIC and ESPRIT, e.g. [42,43]), but many real-world systems may not be subjectable to impulses. For a review of these and some other methods (e.g. Hilbert transform), see [44].5
To detect periodic components in a signal, my brother David [45] proposed the family of stationary GPs with covariance function of the form
| 6.3 |
for which samples are exactly periodic with period 2π/ω. A slight modification was used in [46] to remove the effect of its non-zero mean, namely
| 6.4 |
where I0 is a modified Bessel function of the first kind. As λ → ∞, it takes the limiting form
| 6.5 |
called the Cos kernel, which has the property that it forces anti-periodicity with anti-period π/ω: f(t + π/ω) = −f(t). Although these have found valuable uses, and can be made less rigid by multiplication by a decaying kernel such as exp (−α|t|) (which with the Cos kernel produces OUosc of [35] or the ‘exponentially decaying cosine’ of [47]), it seems to me highly preferable to start from the point of view of a linear system forced by noise, which furthermore allows for efficient treatment in streaming mode.
The approaches that are closest in spirit to this paper are ‘reduced order’ methods such as ‘subspace identification’ e.g. [48], ‘dominant mode analysis’ e.g. [49], and ‘dynamic mode decomposition’ e.g. [50]. The idea of subspace identification is, given observations of some input and output functions in time, to infer a state-space model for the system. The state variables do not have to correspond to any physical variables. My abstract modes are examples. The Kalman filter is used to do the inference. Thus, perhaps my method should be seen as a variant of subspace identification. A difference is that [48] talks a lot about projections, and even the use of the word ‘subspace’ suggests that they are looking for a subspace of some larger space. I do not have any such larger space and I have no projections. It is possible, however, that if one gets to the bottom of the comparison, the similarities outweigh the differences.
The idea of dominant mode analysis is to subject the system to an excitatory signal, e.g. an impulse or a random binary sequence, and fit a low-dimensional linear model to the sequence of observations as functions of the forcing sequence, allowing for a measurement noise. The differences are that I obtain eigenvectors and use unknown natural forcing.
The idea of dynamic mode decomposition is to infer a linear recurrence relation for a time-series of observations and then to find its eigenvalues and eigenvectors (as time-sequences). There is an implicit assumption of white noise. This is reviewed in [50] along with extensions and relations to some other methods, notably the eigensystem-realization algorithm and linear inverse modelling. It was used by [31] in the power-system context. My eigenvectors are in a space of simultaneous quantities rather than time-delayed, and I do not require equal time intervals between observations.
Next, I discuss deficiencies of my method. One defect is that the forcing might not be Gaussian. For example, even a compound Poisson process with independent Gaussian amplitude is not Gaussian. Also, a consequence of the Gaussian assumption is that the covariance of the response is time-symmetric, as shown in equation (A 23), whereas this might not be true for real systems. As already mentioned, evidence for Gaussian distribution of electrical load is given in fig. 14 of [20], but this reference does not report on time-correlation. Load variations are likely to be the sum of many small independent factors, however, which would make them Gaussian if they have finite variance, by the central limit theorem, but not particularly white. Wind power is far from Gaussian and has long-time dependency: there is considerable research on the statistics of wind power, e.g. [51–53]. Some directions to allow fat-tailed distributions were discussed in §4.
Another defect of the approach of the present paper is that it does not allow for nonlinearity. Nevertheless, for small fluctuations around an equilibrium, linearizing is a good approach. It will fail to give a good approximation, however, if the eigenvalues of any mode approach or cross the imaginary axis. A big question with power-flow oscillations, gene expression and business cycles is whether there is a limit cycle of some underlying deterministic dynamics, or just lightly damped oscillations around an equilibrium forced by noise. Figure 1 suggests that there was a Hopf bifurcation, but the general interpretation of such events in the power system community is that the oscillations are transient, triggered by a switching event, e.g. [1]. For gene expression this dichotomy has been addressed by [54]. For business cycles, most economists decided long ago that they are just a near-unit-root process (meaning lightly damped oscillations forced by shocks) [36], though Grandmont proposed deterministic models with a variety of forms of dynamics [55]. [56] fit a vector autoregressive (VAR) model, but with perhaps more free parameters than justified by the data. Our approach would restrict to a small number of modes, as has been done in the recent paper [38].
Figure 1.
Voltage angle at seven locations in England relative to CE2 as a function of time. Reproduced with permission from [1]. Angle differences drive power flow, so oscillations in angle differences indicate oscillations in power flow.
A catch with the discounted evidence approach I suggested to allow fitting slowly varying parameters is that the parameters might sometimes vary faster than the chosen memory time-constant. Indeed, this would be a problem for the case for figure 1. It might be better to make a probabilistic model of parameter variation that allows jumps. Large mismatch between prediction and observation in the Kalman filter could be used as an anomaly detector to decide when to insert such jumps.
An interesting issue is that if the noise is considered to be the result of filtering white noise then our method also finds the modes of the filter. Without further information about the structure of the system or direct observations of the forcing process, I see no way of distinguishing between modes of the filter and modes of the system from observations of just the system. An example of this was given in §5.
Lastly, I have not yet tested the method. There will doubtless be issues that arise once one starts to implement it. A likely one is that the choice of normalization condition on the mode vectors and the ordering of the modes by time-constants are both discontinuous and could lead to awkwardness in the fitting; it would be better to find some continuous ways of removing these redundancies. Another is that the computations may turn out to take more than one-fiftieth of a second and thus not be implementable in real time; then alternating between prediction and correction in the Kalman filter on a slower timescale might be used. A third is that the discounting time for the evidence might turn out to require careful tuning; or as mentioned above, one might do better to switch to a probabilistic model of parameter variation that allows jumps. A fourth is that a particle swarm with many different parameter values and different choices of numbers of modes all running in parallel might be the best way to manage parameter variation; then one will require to implement a good Bayes’ factor comparator on top of all the Kalman filters, to decide which particles to mutate and which ones to terminate.
Nevertheless, I think the savings in dimension and the ability to run in streaming mode make my method promising.
Supplementary Material
Acknowledgements
I am grateful to Ben Marshall of National Grid for proposing the problem of detecting inter-area oscillations in January 2015, and to him and his colleague Phillip Ashton for helpful discussions on the topic and pointers to the literature; to MSc student Tajhame Francis for initial investigations by spectral analysis; to my brother David for telling me to ‘Use a Gaussian process’; to PhD student Marcos Tello Fraile and postdoc Lisa Flatley for trying to follow my suggestions; to Hannes Nickisch and Colm Connaughton for helping me implement my resulting solutions in GPML; to Carl Rasmussen and Hannes Nickisch for having created GPML; to Zoubin Ghahramani for answering some questions about GPs; to Igor Mezic and Yoshihiko Susuki for discussions on modelling AC networks; to undergraduate summer project student John Prater for coding up the 2 × 2 underdamped linear Langevin covariance for GPML; and to Chris Williams, Darren Wilkinson, Keith Worden and especially Janusz Bialek and Andrey Gorbunov, and the reviewers for many useful comments and questions.
Appendix A
A.1. Gaussian processes
A GP on a set T is a probability distribution for functions such that for all n ≥ 1 the marginal density P for the vector of values f1, …fn = F(t1), …F(tn) at any finite sequence t1, …tn ∈ T is Gaussian. Examples for the set T are representing time, or the set V of vertices in a graph representing spatial locations in a network, or for time and vertices, or where I is a set of labels representing components of a vector of values at each vertex and time.
A basic theorem (e.g. [57]) for a GP is that there is a ‘mean’ function and a positive-definite ‘covariance’ function such that
| A 1 |
where m is the vector with components mi = M(ti) and c is the matrix with components cij = C(ti, tj). A function C of two variables is said to be positive-definite if for all n ≥ 1, t1, …tn ∈ T and not all zero then vTcv > 0.
It is convenient to extend the concept of GP to degenerate cases by allowing C to be PSD (vT cv ≥ 0). In this case, c may fail to be invertible but the above formula for the density P can be understood as the product of a delta-function on the null space of c and a Gaussian of complementary dimension on the range of c, centred at m. More formally, its characteristic function for .
Given a GP and observations of a realization of it at a subset T′ ⊂ T, possibly with an assumed Gaussian distribution for measurement error (essential if the covariance is not positive-definite), then conditioning on the observations produces a posterior probability distribution for the realization, which is again a GP. It has mean function
| A 2 |
where Y is the column vector of observations for ti ∈ T′, V is the covariance matrix of the measurement error vector (assumed zero-mean Gaussian), C(t, T′) denotes the row vector of C(t, ti), C(T′, T′) the matrix of C(ti, tj), and M(T′) the column vector of M(ti). It has covariance function
| A 3 |
where C(T′, t) denotes the column vector of C(ti, t).
Given a family of GPs, labelled by one or more parameters, a prior probability distribution on the parameter space, and observations of a realization, then Bayesian inference gives a posterior probability distribution over the joint space of parameters and realizations. In particular, its marginal on the parameter space gives a posterior probability over the parameter space. In general, this can not be computed explicitly, but search algorithms can find the parameter values maximizing the posterior likelihood. In this way, one can infer the parameters. Computational methods can also give an idea of the posterior uncertainty in the parameters.
There are many introductions to GPs, e.g. [45,46,57–59], and software packages to implement them and infer from them, e.g. GPML and GPy.
Much GP modelling, however, seems to me to be ad hoc. A family of covariance functions is chosen, for example to reflect assumed smoothness class or periodicity, the mean function is often set to zero, and a best fit to the data is obtained. Instead, it is better to use known or assumed structure of the system under study to choose a sensible class of models. This strategy is recognized under the names ‘hybrid modelling’ or ‘latent force models’, e.g. [60], or data-based mechanistic modelling [49].
For time-dependent systems, in many contexts, a natural class of models is an asymptotically stable continuous-time linear system forced by Gaussian noise. Furthermore, it is often natural to assume the linear system to be autonomous (some say ‘time-invariant’) and the noise to be stationary, at least on short time-scales. The idea that filtering white noise produces interesting processes is old, e.g. [61–63].6
The noise is not necessarily white. I make the assumption that it is the result of forcing some other autonomous asymptotically stable linear system with white Gaussian noise. The end-result of the assumption on the noise is a skew-product asymptotically stable linear system (consisting of the real system and the noise filter) forced by Gaussian white noise.
Another name for linear stochastic systems is continuous-time VAR processes (see append. B.2.1 of [57]). Classic books on the discrete-time version of such models, including inference for them, are [64,65]. In the latter, they are called dynamic linear models. Linear stochastic process models have been used for inference in many contexts, e.g. [35,66,67].
A.2. Simple examples of linear stochastic system
The simplest example of asymptotically stable linear system forced by Gaussian noise is the OU process:
| A 4 |
with , μ > 0, σ > 0 and ξ unit Gaussian white noise (which can be considered as a degenerate GP on with mean M(t) = 0 and covariance C(t, t′) = δ(t − t′)). Then Duhamel’s formula yields
| A 5 |
which shows that x is a GP with mean zero and covariance
| A 6 |
A sample is shown in figure 6. With probability one, samples are continuous but nowhere differentiable [68].
Figure 6.

A sample from the OU process with μ = 1, .
Next consider the linear Langevin process:
| A 7 |
with m, β, k, σ > 0 (cf. [69]). It follows that x is a GP on with mean zero and covariance
| A 8 |
where τ = t − t′, α = β/2m, . This formula is most appropriate for the underdamped case β2/4 < mk. In the overdamped case β2/4 > mk, it is more usefully written as
| A 9 |
where and . In the critically damped case β2/4 = mk, then
| A 10 |
Figure 7 shows a sample for an underdamped case, and a sample for an overdamped case appeared in figure 5. With probability one, solutions of the linear Langevin equation are differentiable but nowhere twice differentiable [68].
Figure 7.

A sample for the underdamped linear Langevin process with σ2 = 2βk, α = 1/e, ω = e (e being the base of natural logarithms).
The linear Langevin equation can be written as a system of two first-order stochastic differential equations. This can be generalized to the two-dimensional system
| A 11 |
where , A a 2 × 2 matrix with , , and with η being two-dimensional Gaussian white noise with (PSD) covariance matrix K (i.e. 〈ηi(s)ηj(t)〉 = Kijδ(t − s)). Then x is a GP on , the first factor indicating the component of x (for which I use subscript notation). It has zero mean. Its covariance function, which I write as a matrix function on is
| A 12 |
where
| A 13 |
Taking the first (or second) component of the general two-dimensional system produces a family of covariance functions that we have proposed for purposes such as deciding if a system is under- or over-damped [5].
A.3. General linear stochastic system
In this section, I review the calculation of the mean and covariance functions for an asymptotically stable continuous-time forced linear system of arbitrary dimension, cf. [57,61,69]. Initially, I allow the system to be non-autonomous and do not restrict the forcing to be Gaussian. Thus consider
| A 14 |
with x, . The asymptotic stability assumption implies that the response x to forcing η can be written as
| A 15 |
with H the impulse response (matrix-valued Green function), i.e. the matrix solution of
| A 16 |
for t > t′ with H(t′ + , t′) = I. Note that for any t < t′ < t″,
| A 17 |
If η is a GP on with mean function Mη and covariance function Cη (so Cη(s, t) = 〈η(s)ηT(t)〉), then x is a GP on the same set, with mean function
| A 18 |
and covariance function
| A 19 |
If the system is autonomous then H(s, s′) is a matrix-function
of just one variable σ = s − s′. If the forcing is stationary then Mη is constant and Cη(s, t) is a matrix-function k(τ) of τ = t − s and k(−τ) = k(τ)T. So assuming both and changing variables to σ and τ′ = t′ − s′,
| A 20 |
and
| A 21 |
Now specialize further to white forcing of zero-mean, i.e. k(τ) = Kδ(τ) for some PSD symmetric matrix K. A common way to write this is η(t) = Bξ(t) for ξ a vector of independent unit Gaussian white noises and a matrix B; then K = BBT. But B can be replaced by BO for any orthogonal matrix O without changing the probability distribution for η, so this description contains useless redundancy and it is better to specify the noise η by just its covariance matrix K.
In this case, x has zero mean and
| A 22 |
For τ < 0, Cx(τ) = Cx(−τ)T. Using h(τ + σ) = h(σ)h(τ) for σ, τ > 0 (a special case of equation (A 17)), this boils down to
| A 23 |
where the symmetric matrix
| A 24 |
giving the covariance of the response of an asymptotically stable autonomous linear system to Gaussian white noise in terms of the impulse response function (cf. (4.5.71a&b) of [61]).
Note that S satisfies a Sylvester equation (actually the special case known as the Lyapunov matrix equation) (cf. (4.5.64) of [61], but with the opposite sign-convention for A):
| A 25 |
The theory of Sylvester equations (e.g. [70]) shows that this equation has a unique solution for S, because A has been assumed to have all its spectrum in the open left half plane, so there are no pairs (λi, λj) of eigenvalues for A and AT that sum to zero. An interesting approach using equation (A 25) to infer A and K from S in the AC electricity context is presented in [21], where the model is called a vector OU process.
A.4. Bayesian inference for linear stochastic systems
I review the standard Bayesian approach to inference of a linear stochastic dynamical system from observations. The use of the Kalman filter to speed up the computations and to update the inference in real time will be treated in appendix A.6. There are alternative approaches, such as subspace identification methods, e.g. [48], but my goal is not to go into detail, rather to present enough to contrast the approach that I propose in §2.
The system is modelled by
| A 26 |
with 〈ξ(s)ξT(t)〉 = Kδ(t − s). The matrices A and K are considered unknown, though one may have strong prior probability distributions for them. This needs augmenting by a model for the observations, e.g. vectors
| A 27 |
for some known times ti, known observation matrices Zi, and unknown measurement errors ζi that I assume to be independent zero-mean Gaussian vectors with known covariance matrices Hi. The idea is that the matrices Zi specify which components (or combinations of components) of x are measured.
Then the parameters of the model are the matrix elements of A and K. If x has dimension L, the parameters form a continuous space of dimension L2 + L(L + 1)/2 = L(3L + 1)/2 (though this may be reduced significantly if the system has known structure). Denote the parameters compactly by a vector μ.
Previous knowledge about the system is encoded into a prior probability density P−(μ) for the parameters. Given the observations Y = (yi), a posterior probability density P+ is computed for the parameters by Bayes’ rule:
| A 28 |
where Pm is the probability density for the observations given the parameters, specified by the model, and is a normalization factor.
If enough observations have been taken (depending on how tight the prior P− was), then P+ will be tightly peaked around some value of the parameter vector. The maximum posterior likelihood value is the that maximizes P+(μ|Y) (assuming it is unique). Although the functional form for P+ is not in general computable, numerical algorithms like conjugate gradient ascent can search for , based on evaluation of P+ (or in practice log P+) and its gradient with respect to μ at a suitable sequence of points. They can also compute a quadratic approximation to P+ around to give an idea of the posterior uncertainty in the inference. Another nice method, called Bayesian optimization, is to fit a GP to the posterior likelihood and optimally choose a sequence of evaluation points to reduce the uncertainty of its maximum [71].
It might be that one is not interested in all the parameters. For example, one might want A but not K (and if one were to treat the measurement noise covariances Hi as unknown they would also appear in the parameter list but probably not be of primary interest). In this case, one could marginalize the posterior probability distribution over the uninteresting parameters, for example compute . In general, this is not possible analytically, but it can be done approximately by various numerical methods. One of the nicest is Bayesian quadrature [72], which fits a GP to the integrand and hence obtains a (one-dimensional) Gaussian distribution for the integral.
Once a best fit to A has been obtained, one could compute the modes of A (frequency, damping, shape) by diagonalization of A. If a fit to K has also been obtained, one can calculate the covariance of the response from equations (A 23), (A 24), and hence the covariance of the modes using the diagonalization.
The main point of this paper, however, is that the above approach to inferring modes is overkill. If one is interested only in the dominant modes one can infer them without inferring A and K. I say the dominant modes are the subset of modes which best explain the observations, in the sense of Bayesian model comparison (highest Bayes’ factor, described in appendix A.5). One can always fit the data better by adding more modes but at the expense of making the model bigger and potentially reducing its explanatory power. To infer d modes (counting complex ones twice) from M observation components requires a parameter space of dimension only (d + 1)(M + d/2). This is likely to be much less than the dimension L(3L + 1)/2 of the space of A and K above, because both d and M are smaller than L.
The second main point of the paper is that the inference of dominant modes can be run in streaming mode. The description given so far involves collecting all the data and then maximizing the posterior likelihood. This is called batch mode of inference. For inference of a process running in real-time, it is better to update the maximum posterior likelihood estimates after each new observation arrives. This is called streaming mode of inference. It can be done efficiently, with each new observation requiring the same time to process regardless of how many previous observations have been made, whereas for a general GP the computation time to infer from n observations scales like n3 and the time to take into account one new observation scales like n2 [57].
A.5. Bayesian model comparison
The number of modes to attempt to fit can be decided by Bayesian model comparison [3]. This is an extension of maximum posterior likelihood search to a setting with two or more models Mj, which each have their own continuous parameter spaces . For each model Mj, one can compute the posterior probability density P+(μ|Y, Mj) for . By various methods, e.g. [73] or Bayesian quadrature [72], one can also compute the normalization constant Z(Y|Mj), called Bayes’ factor for the model. Then given prior probabilities P−(Mj) for the models (which can be taken the same if one is agnostic about which model is best), one applies Bayes’ rule again to obtain posterior probabilities
| A 29 |
where Z(Y) is a normalization factor again, depending on Y and the chosen set of models, but is not required for what follows. This formula can be used to decide which model is the best explanation of the observations and to keep track of near-competitors. For each model Mj, the method of appendix A.4. determines best-fit parameters . In our case, the different models correspond to the numbers NR and NC of real and complex modes to fit, respectively. The idea is that even though a better fit is achievable with more modes, the required increased dimension of parameter space might not justify it (Occam’s razor).
A.6. Inference from streaming data: Kalman filtering
In many circumstances, it would be preferable to run the inference of modes in real time rather than batch, and efficiently. There are papers on real-time inference with GPs, e.g. [58,66,67,74]. The most important insight is that for linear stochastic processes one can use the Kalman filter, a method that uses the Markovian structure to reduce the computational burden of inference to a constant amount per observation. It goes back to the 1960s and is in many textbooks, though rarely in continuous time. A nice text on recursive estimation more generally is [75].
Here, I review the use of the Kalman filter to infer the state and parameters of a continuous-time linear system forced by white noise from real-time observations. Denote the state of the system at time by and suppose it evolves according to
| A 30 |
with Gaussian white noise of covariance matrix Cη. Suppose observations are taken at an increasing sequence of times ti. In contrast to claims in some of the literature, these do not need to be equally spaced and one can observe different components of x at different times. Furthermore, one can also allow A and Cη to be time-dependent, though for simplicity of exposition I will not do that here.
So let the observations be
| A 31 |
where , xi = x(ti), , Zi are matrices specifying which combinations of components of xi are observed, and ζi is a zero-mean Gaussian measurement noise with covariance which I suppose independent for different i.
Then for a sequence of vectors xi at the times ti, use the notation xi|i−1 = 〈xi|yi−1, …y1〉 and xi|i = 〈xi|yi, …y1〉. Similarly, define yi|i−1 = 〈yi|yi−1, …y1〉. Let
| A 32 |
and similarly Pi|i = 〈(xi − xi|i)(xi − xi|i)T〉. Write τi = ti − ti−1. As a consequence of the Duhamel formula
| A 33 |
one obtains
| A 34 |
and
| A 35 |
with
| A 36 |
Also, averaging equation (A 31),
| A 37 |
Let
| A 38 |
and
| A 39 |
Then
| A 40 |
Finally, by conditioning on yi, one obtains
| A 41 |
and
| A 42 |
where the ‘Kalman gain matrix’
| A 43 |
The standard use of these equations is to provide an estimate xi|i of the state xi and its uncertainty (from Pi|i). But they can also be used to provide the likelihood for the parameters of the model, given the observations, and this is my primary goal.
To see how to do this, the likelihoods for the observations given the parameter values μ satisfy
| A 44 |
Now yi|(yi−1, …y1, μ) is Gaussian with mean yi|i−1 and covariance Fi. So the evidence Li(μ) for the parameter value μ, defined to be the log-likelihood of the observations as a function of the parameters, updates by
| A 45 |
where
| A 46 |
Recall that di is the dimension of the observation vector yi at time ti. This provides the total evidence for the given parameters μ, starting from the initial time. Despite the fact that for a general GP it takes time O(N3) to compute the likelihood from N observations, the class of linear stochastic processes with the above algorithm takes equal time per observation, allowing the computation to be done in real-time.
One can similarly work out how to update the derivative of Li with respect to the parameters; use
| A 47 |
where prime denotes derivative with respect to any parameter, and propagate the derivatives through the Kalman filter equations. Thus one can make gradient steps to improve the estimate of the maximum-likelihood parameters.
Note that as usual in numerical optimization using gradients, it is best to use an adapted inner product μ·ν = μTGν with G a positive-definite symmetric matrix to compute the gradient for the vector L′ of derivatives, by . The inner product should be chosen to reflect the different typical sizes of different components of μ. Ideally, it would be close to the negative of the second derivative of the evidence at the maximum. One has also to decide how far to step along the gradient. There are various choices which go under the name of conjugate gradient methods and also modify the inner product as one goes along (this includes the Davidon-Fletcher-Powell and Broyden-Fletcher-Goldfarb-Shanno methods).
To adapt to the case where the parameters may in reality be time-varying, one could make a GP model for the parameters as functions of time, for example an OU process or slowly varying plus jumps, and infer them. An alternative that I propose as probably more practical is to maximize an exponentially weighted sum of the gains in evidence, allowing one to forget past evidence because it is likely to become irrelevant. Choose a rate constant λ for forgetting past evidence. The evidence gained at time ti relative to ti−1 is of equation (A 46). An appropriate notion of the weighted sum of gains, that I call discounted evidence rate, is
| A 48 |
and it updates by
| A 49 |
The use of first-order filters such as equation (A 49) to take an exponentially weighted time average, allowing one to forget the distant past, is old. In the Kalman filter context, I have found it used by [76], for example, under the name ‘adaptive filtering’. In particular, the idea to first-order filter the evidence gains occurs there, as an anomaly detector. Similarly, it comes in [77]. The approach is analogous to ‘fading memory’ filters, e.g. [78], but which are for state rather than parameters. Again, derivative information can be updated and gradient steps made to track maximum-likelihood parameters.
In principle, one can begin by specifying a prior probability density on the parameter space but its effect on the discounted evidence rate will go to zero exponentially in the time since the start, so it may not be very important.
There are issues with doing this optimization in real time. To compute at a new parameter point requires in principle to recompute all the previous at the new point. In practice, it is probably enough to compute the new , j ≥ i, at the new point and let the earlier ones get forgotten by the discounting. This would slow down each step of the optimization, however, to the discounting rate. Alternatively, one could run many filters in parallel for different parameter values, and use Bayes’ rule to combine them. A practical way to deal with pruning and merging the set of such filters is given in [79].
It may be better to also compute the second derivative matrix with respect to μ. This can be done by differentiating again. It gives the ideal choice of parameter step Δμ when near the maximum, namely the Newton step . It is not necessary to update frequently, an approximation suffices. Thus one could update (or at least ) as each observation comes in, and less frequently.
Note that with the Newton method, it is not in fact necessary to compute itself. Nevertheless, the ingredients (vi, Fi) that go into computation of are required for computation of , so one might as well track too.
There may be better optimization methods for this problem. For example, as already discussed, one could compute at a set of parameter values simultaneously and replace parameter values with low by ones chosen to be likely to have high from time to time. Indeed, there are Bayesian optimization procedures that fit a GP to the evidence function and automatically choose the next evaluation point to minimize the variance of the maximum of the posterior GP. Gradient information can be incorporated too.
An alternative to linear stochastic processes and Kalman filtering might be to use a more general stationary GP with equal observation time intervals and fast methods for Cholesky decomposition of the resulting block Toeplitz covariance matrices, e.g. [80–83].
A.7. Terms involving eDt
For block-diagonal D with blocks of the forms −λm and , then eDt is block-diagonal with blocks and .
So for any matrix C, is given by blocks of the following forms:
- (i) two real modes: 1 × 1 blocks
- (ii) one real mode m and one complex mode n: 1 × 2 blocks
- (iii) one complex mode n and one real mode m: 2 × 1 blocks
- (iv) two complex modes: 2 × 2 blocks
In particular Gi of equation (3.5) has blocks of the corresponding types (dropping the suffix i and the superscript η from Cη):
where
and
where , and cc, cs, sc, ss are defined by
with a = αn + αn′, σ = ωn + ωn′, δ = ωn − ωn′.
The covariance matrix S of equation (2.6) is the case τ = +∞ of G.
A.8. Skew-product structure of the power system model
The power system model (5.6) and (5.8) has a skew-product structure, namely does not depend on x (also the x-dynamics has structure in that it is only the frequency deviations δf that see δp directly). In reality, perhaps does depend a little on x, e.g. National Grid balancing operations and frequency-sensitive generators and loads, but I continue with this model. One way to exploit the skew-product structure is to derive the covariance function for δp using equation (A 23) and then insert this into the formula (A 21) for the covariance function of x, but it leads to an integration whose treatment is not simple. Alternatively, one can apply equation (A 23) to the joint systems (5.6) and (5.8), exploit the skew-product form of the impulse response, and take the xx-block of the covariance function. I chose the latter approach, subject to the simplifying but generic assumption of simple eigenvalues for the full system.
The impulse response of equation (5.8) can be written in matrix exponential notation as δp(t) = e−Jt. Similarly, the impulse response of equation (5.6) can be written as x(t) = eAt. To compute the response of x to an impulse on , it is convenient to assume that A and −J have no eigenvalues in common, as is generically the case. Then there exists a unique solution E to another Sylvester equation
| A 50 |
and defining y = x + Ep shows that . So the response of y to an impulse on is eAtE. It follows that the response of x = y − Ep to an impulse on is
| A 51 |
Note that using equation (A 50), the time-derivative of hxp at t = 0 is just C. Thus the impulse response of the full system has the block form
| A 52 |
Then the stationary covariance matrix S (A 24) of the joint process has block form
| A 53 |
It follows from equation (A 23) that (for τ > 0)
| A 54 |
Thus the covariance of x = (δf, δΔ) is a linear combination of functions from the impulse response of x to and of p to .7
Footnotes
This is relatively easy to incorporate, e.g. [15], but a full treatment would require including voltage control, power system stabilizers and excitor control.
As ϕl is in radians it might be better to denote fl by ωl, but I am already using ω for mode frequencies.
In reality, the system operator is required to keep the phases within some interval (of about 100 cycles) around that for a reference rotor at the nominal frequency, so they exert changes to p to achieve this, thereby breaking the phase rotation invariance, but I ignore that.
This is achieved with great precision already, but it might be that my approach would have advantages in some circumstances.
Yet another method to determine the eigenvalues of an asymptotically stable system from the response to an impulse is to Laplace transform the response numerically and then fit a Padé approximation and read off its poles.
Also, I took a computer music course in Princeton in 1978/1979 in which we made artificial human song by filtering white noise.
In the case of common eigenvalues λ to −J and A there would in general also be terms of the form P(τ)eλτ with P a polynomial of degree higher than those which might already result from multiplicity in −J or A.
Data accessibility
The frequency trace used in §5 was obtained from a publicly available site www.nationalgrid.com/Enhanced-Frequency-Response.aspx. That link appears to be no longer active so I have created a Dryad entry containing the excel file at doi:10.5061/dryad.z34tmpgb4 [22].
Competing interests
The author is an associate editor of RSOS but has not been involved in the assessment of the submission.
Funding
The beginning of the work was supported by National Grid under Network Innovation Allowance award NIA_NGET0161. The later parts were supported by the Alan Turing Institute under award no. TU/B/000101.
References
- 1.Turunen J, Renner H, Hung WW, Carter AM, Ashton PM, Haarla LC. 2015. Simulated and measured inter-area mode shapes and frequencies in the electrical power system of Great Britain. In IET Int. Conf. on Resilience of Transmission and Distribution Networks (RTDN2015), pp. 136–141. Stevenage, UK: IET.
- 2.CIGRE Task Force 38.01.07 on Power System Oscillations, Paserba J. (convenor), Analysis and control of power system oscillations. CIGRE Technical Brochure no. 111, December 1996.
- 3.MacKay DJC. 2003. Information theory, Inference, and Learning algorithms. Cambridge, UK: Cambridge University Press. [Google Scholar]
- 4.Lindquist A, Picci G. 2015. Linear stochastic systems. New York, NY: Springer. [Google Scholar]
- 5.MacKay RS, Phillips NE. 2018. A natural 4-parameter family of covariance functions for stationary Gaussian processes. See http://arxiv.org/abs/1810.07738.
- 6.Shah A, Wilson AG, Ghahramani Z. 2014. Student-t processes as alternatives to Gaussian processes. Proc. AISTATS Reykjavik. J. Machine Learning Res: Workshop and Conf. Proc. 33.
- 7.Fang KT, Kotz S, Ng KW. 1989. Symmetric multivariate and related distributions. London, UK: Chapman & Hall. [Google Scholar]
- 8.Tsionas MG. 2016. Bayesian analysis of multivariate stable distributions using one-dimensional projections. J. Multivar. Anal. 143, 185-193. ( 10.1016/j.jmva.2015.09.005) [DOI] [Google Scholar]
- 9.Nguyen TT. 1995. Conditional distributions and characterizations of multivariate stable distribution. J. Multivar. Anal. 53, 181-193. ( 10.1006/jmva.1995.1031) [DOI] [Google Scholar]
- 10.Roth M, Ardeshiri T, Özkan E, Gustafsson F. 2017. Robust Bayesian filtering and smoothing using Student’s t distribution. See http://arxiv.org/abs/1703.02428.
- 11.Machowski J, Bialek JW, Bumby JR. 2008. Power system dynamics, 2nd edn. New York, NY: Wiley. [Google Scholar]
- 12.Roger G. 2000. Power system oscillations. Amsterdam, The Netherlands: Kluwer. [Google Scholar]
- 13.Andersson G. 2012. Power system analysis. ETH lecture notes. Zurich, Switzerland: ETH. [Google Scholar]
- 14.Grainger JJ, Stevenson WD. 1994. Power system analysis. Singapore: McGraw-Hill. [Google Scholar]
- 15.Trip S, Bürger M, De Persis C. 2015. An internal model approach to (optimal) frequency regulation in power grids with time-varying voltages. See http://arxiv.org/abs/1403.7019v3.
- 16.Janssens N, Kamagate A. 2000. Interarea oscillations in power systems. IFAC Power Plants and Power System Control, pp. 217–226. Brussels, Belgium: Elsevier.
- 17.Susuki Y, Mezic I, Hikihara T. 2012. Coherent swing instability of interconnected power grids and a mechanism of cascading failure. In Control and optimisation methods for electric smart grids (eds A Chakraborty, MD Ilic), pp. 185–202. New York, NY: Springer.
- 18.Susuki Y, Mezic I. 2012. Nonlinear Koopman modes and a precursor to power system swing instabilities. IEEE Trans. Power. Syst. 27, 1182-1191. ( 10.1109/TPWRS.2012.2183625) [DOI] [Google Scholar]
- 19.Gibbard MJ, Pourbeik P, Vowles DJ. 2015. Small-signal stability, control and dynamics performance of power systems. Adelaide, Australia: University Adelaide Press. [Google Scholar]
- 20.Turunen J, Thambirajah J, Larsson M, Pal BC, Thornhill NF, Haarla LC, Hung WW, Carter AM, Rauhala T. 2011. Comparison of three electromechanical oscillation damping estimation methods. IEEE Trans. Power. Syst. 26, 2398-2407. ( 10.1109/TPWRS.2011.2155684) [DOI] [Google Scholar]
- 21.Wang X, Bialek JW, Turitsyn K. 2018. PMU-based estimation of dynamic state jacobian matrix and dynamic system state matrix in ambient conditions. IEEE Trans. Power Syst. 33, 681-690. ( 10.1109/TPWRS.2017.2712762) [DOI] [Google Scholar]
- 22.National Grid frequency trace. See 10.5061/dryad.z34tmpgb4. [DOI]
- 23.Seppänen J, Au S-K, Turunen J, Haarla L. 2017. Bayesian approach in the modal analysis of electromechanical oscillations. IEEE Trans. Power Syst. 32, 316-325. ( 10.1109/TPWRS.2016.2561020) [DOI] [Google Scholar]
- 24.Schäfer B, Beck C, Aihara K, Witthart D, Timme D. 2018. Non-Gaussian power grid frequency fluctuations characterised by Lévy-stable laws and superstatistics. Nat. Energy 3, 119-126. ( 10.1038/s41560-017-0058-z) [DOI] [Google Scholar]
- 25.Trudnowski DJ. 2008. Estimating electromechanical mode shape from synchrophasor measurements. IEEE Trans. Power Syst. 23, 1188-1195. ( 10.1109/TPWRS.2008.922226) [DOI] [Google Scholar]
- 26.Han S, Rong N, Sun T, Zhang J. 2013. An approach for estimating mode shape for participation of inter-area oscillation mode. In IEEE Int. Symp. on Circuits & Systems (ISCAS). pp. 2968–2971. Beijing, China: IEEE. [Google Scholar]
- 27.Gorbunov A, Dymarsky A, Bialek J. to appear Estimation of parameters of a dynamic generator model from modal PMU measurements. IEEE Trans. Power Syst. 35, 53-62. ( 10.1109/TPWRS.2019.2925127) [DOI] [Google Scholar]
- 28.Pierre JW, Trudnowski DJ, Donnelly MK. 1997. Initial results in electromechanical mode identification from ambient data. IEEE Trans. Power Syst. 12, 1245-1251. ( 10.1109/59.630467) [DOI] [Google Scholar]
- 29.Zhou N, Trudnowski DJ, Pierre JW. 2008. Mittelstadt, electromechanical mode online estimation using regularised robust RLS methods. IEEE Trans. Power Syst. 23, 1670-1680. ( 10.1109/TPWRS.2008.2002173) [DOI] [Google Scholar]
- 30.Messina A, Vittal V. 2007. Extraction of dynamic patterns from wide-area measurements using empirical orthogonal functions. IEEE Trans. Power Syst. 22, 682-692. ( 10.1109/TPWRS.2007.895157) [DOI] [Google Scholar]
- 31.Barocio E, Pal BC, Thornhill NF, Messina AR. 2015. A dynamic mode decomposition framework for global power system oscillation analysis. IEEE Trans. Power Syst. 30, 2902-2912. ( 10.1109/TPWRS.2014.2368078) [DOI] [Google Scholar]
- 32.Au S-K. 2017. Operational modal analysis. New York, NY: Springer. [Google Scholar]
- 33.He J, Fu Z-F. 2001. Modal analysis. Oxford, UK: Butterworth-Heinemann. [Google Scholar]
- 34.Kosovichev AG. 2009. Solar oscillations. In Stellar Pulsation. AIP Conf Proc, vol. 1170, pp. 547–559. New York, USA.
- 35.Phillips NE, Manning C, Papalopulu N, Rattray M. 2016. Identifying stochastic oscillations in single-cell live imaging time series using Gaussian processes. See http://arxiv.org/abs/1608.06476v2. [DOI] [PMC free article] [PubMed]
- 36.Romer D. 2001. Advanced Macroeconomics. New York, NY: McGraw-Hill. [Google Scholar]
- 37.Hamilton JD. 1994. Time series analysis. Princeton, NJ: Princeton University Press. [Google Scholar]
- 38.Hasenzagl T, Pellegrino F, Reichlin L, Ricco G. 2019. A model of the Fed’s view on inflation, working paper, 7 Nov, Centre for Economic Policy Research, London, UK.
- 39.Brincker R, Zhang L, Andersen P. 2000. Output-only modal analysis by frequency domain decomposition, Proc. ISMA25, 2, 7.
- 40.Gregory PC. 2001. A Bayesian revolution in spectral analysis. In Bayesian inference and maximum entropy methods in science and engineering (ed. A Mohammad-Djafari), Am Inst Phys. Conf. Proc., vol. 568. pp. 557–568. Melville, NY: American Institute of Physics. [Google Scholar]
- 41.Bretthorst GL. 1988. Bayesian spectrum analysis and parameter estimation. Lect Note Stats, vol. 48. New York, NY: Springer. [Google Scholar]
- 42.Papy JM, de Latauwer L, van Huffel S. 2005. Exponential data fitting using multilinear algebra: the single-channel and multi-channel case. Numer. Lin. Alg. Appl. 12, 809-826. ( 10.1002/nla.453) [DOI] [Google Scholar]
- 43.Potts D, Tasche M. 2013. Parameter estimation for nonincreasing exponential sums by Prony-like methods. Lin. Alg. & Appl. 439, 1024-1039. ( 10.1016/j.laa.2012.10.036) [DOI] [Google Scholar]
- 44.Zielinski TP, Duda K. 2011. Frequency and damping estimation methods - an overview. Metrol. Meas. Syst. 18, 505-528. ( 10.2478/v10178-011-0051-y) [DOI] [Google Scholar]
- 45.MacKay DJC. 1998. Introduction to Gaussian Processes. NATO ASI series F Comp. Syst. Sci.168, 133–166. [Google Scholar]
- 46.Lloyd JR, Duvenaud D, Grosse R, Tenenbaum JB, Ghahramani Z. 2014. Automatic construction and natural-language description of nonparametric regression models. See http://arxiv.org/abs/1402.4304.
- 47.Abrahamsen P. 1997. A review of Gaussian random fields and correlation functions, Technical report 917, Norwegian Computing Center, Oslo, 1997.
- 48.van Overschee P, de Moor B. 1996. Subspace identification. Amsterdam, The Netherlands: Kluwer. [Google Scholar]
- 49.Young PC. 1999. Data-based mechanistic modelling, generalised sensitivity and dominant mode analysis. Comp. Phys. Commun. 117, 113-29. ( 10.1016/S0010-4655(98)00168-4) [DOI] [Google Scholar]
- 50.Tu JH, Rowley CW, Luchtenburg DM, Brunton SL, Kutz JN. 2014. On dynamic mode decomposition: theory and applications. J. Comput. Dyn. 1, 391-421. ( 10.3934/jcd.2014.1.391) [DOI] [Google Scholar]
- 51.D’Amico G, Petroni F, Prattico F. 2015. Wind speed prediction for wind farm applications by extreme value theory and copulas. J. Wind Eng. Ind. Aerodyn. 145, 229-236. ( 10.1016/j.jweia.2015.06.018) [DOI] [Google Scholar]
- 52.Troffaes MCM, Williams E, Dent CJ. Data analysis and robust modelling of the impact of renewable generation on long term security of supply and demand. In 2015 IEEE Power and Energy Society General Meeting, pp. 1–5. London, Ontario: IEEE. [Google Scholar]
- 53.Wadman WS, Bloemhof G, Crommelin D, Frank J. 2012. Probabilistic power flow simulation allowing temporary current overloading. Proc. PMAPS 2012, 494-499. [Google Scholar]
- 54.Durrande N, Hensman J, Rattray M, Lawrence ND. 2016. Detecting periodicities with Gaussian processes. See http://arxiv.org/abs/1303.7090v2.
- 55.Grandmont J-M. 1985. On endogeneous competitive business cycles. Econometrica 53, 995-1045. ( 10.2307/1911010) [DOI] [Google Scholar]
- 56.Sims CA. 1980. Macroeconomics and reality. Econometrica 48, 1-48. ( 10.2307/1912017) [DOI] [Google Scholar]
- 57.Rasmussen CE, Williams CKI. 2006. Gaussian processes for machine learning. Cambridge, MA: MIT Press. [Google Scholar]
- 58.Reece S, Roberts S. An introduction to Gaussian processes for the Kalman filter expert. In Information Fusion 2010 (IEEE). Edinburgh, UK. ( 10.1109/ICIF.2010.5711863) [DOI]
- 59.Roberts S, Osborne M, Ebden M, Reece S, Gibson N, Aigrain S. 2013. Gaussian processes for timeseries modelling. Phil. Trans. R. Soc. A 371, 20110550. ( 10.1098/rsta.2011.0550) [DOI] [PubMed] [Google Scholar]
- 60.Alvarez M, Luengo D, Lawrence ND. 2009. Latent force models. In Proc 12th AISTATS, Clearwater Beach, FL, USA. [DOI] [PubMed]
- 61.Gardiner CW. 2009. Stochastic methods, 4th edn. New York, NY: Springer. [Google Scholar]
- 62.Boyle P, Frean M. 2005. Dependent Gaussian Processes. In Advances in Neural Information Processing Systems. Montreal, Canada.
- 63.Tobar F, Bui TD, Turner RE. 2015. Learning stationary time series using Gaussian processes with nonparametric kernels. In Advances in Neural Information Processing Systems, vol. 28. Cambridge, MA: MIT Press.
- 64.Caines PE. 1988. Linear stochastic systems. New York, NY: Wiley. [Google Scholar]
- 65.West M, Harrison PJ. 1989. Bayesian forecasting and dynamic models. Berlin, Germany: Springer. [Google Scholar]
- 66.Hartikainen J, Särkkä S. 2010. Kalman filtering and smoothing solutions to temporal Gaussian process regression models. In Proc. IEEE Internat Workshop on Machine Learning for Signal Processing. New York, NY: IEEE. ( 10.1109/MLSP16741.2010) [DOI] [Google Scholar]
- 67.Reece S, Ghosh S, Rogers A, Jennings N, Roberts S. 2014. Efficient state-space inference of periodic latent force models. See http://arxiv.org/abs/1310.6319v2.
- 68.Adler RJ. 1981. The geometry of random fields. New York, NY: Wiley. [Google Scholar]
- 69.Papoulis A. 1991. Probability random variables and stochastic processes. New York, NY: McGraw-Hill. [Google Scholar]
- 70.Bhatia R, Rosenthal P. 1997. How and why to solve the operator equation AX − XB = Y. Bull. Lond. Math. Soc. 29, 1-21. ( 10.1112/S0024609396001828) [DOI] [Google Scholar]
- 71.Mockus J. 1989. Bayesian approach to global optimization. Amsterdam, The Netherlands: Kluwer. [Google Scholar]
- 72.O’Hagan A. 1991. Bayes-Hermite quadrature. J. Stat. Plan Inference 29, 245-260. ( 10.1016/0378-3758(91)90002-V) [DOI] [Google Scholar]
- 73.Alston C, Kuhnert P, Low Choy S, McVinish R, Mengersen K. 2005. Bayesian model comparison: review and discussion. In Internat Stat Institute 55th session. Sydney, Australia.
- 74.Bui TD, Nguyen CV, Turner RE. 2017. Streaming sparse Gaussian process approximations. See http://arxiv.org/abs/1705.07131v2.
- 75.Young PC. 2011. Recursive estimation and time-series analysis, 2nd edn. New York, NY: Springer. [Google Scholar]
- 76.Knorn F, Leith DJ. 2008. Adaptive Kalman filtering for anomaly detection in software applications. Proc. Info Comm workshop on automated network management. New York, NY: IEEE. ( 10.1109/INFOCOM.2008.4544581) [DOI] [Google Scholar]
- 77.Söderström T, Stoica P. 1989. System identification. Denver, CO: Prentice Hall. [Google Scholar]
- 78.Simon D. 2006. Optimal state estimation. New York, NY: Wiley. [Google Scholar]
- 79.Blom HAP, Bar-Shalom Y. 1988. The interacting multiple model algorithm for systems with Markovian switching coefficients. IEEE Trans. Autom. Control 33, 780-783. ( 10.1109/9.1299) [DOI] [Google Scholar]
- 80.Ammar GS, Gragg WB. 1987. The generalised Schur algorithm for the superfast solution of Toeplitz systems. In Rational approximation and its applications in mathematics and physics (eds J Gilewicz, M Pindor, W Siemaszko), Lect. Notes Math., vol. 1237, pp. 315–330. New York, NY: Springer.
- 81.Gallivan K, Thirumalai S, van Dooren P. On solving block Toeplitz systems using a block Shur algorithm. Parallel Processing 1994 (IEEE Conf.) vol. 3, pp. 274–281. New York, NY: IEEE. [Google Scholar]
- 82.Park H, Elden L. 2000. Schur-type methods for solving least squares problems with Toeplitz structure. SIAM J. Sci. Comput. 22, 406-430. ( 10.1137/S1064827598347423) [DOI] [Google Scholar]
- 83.Stewart M. 1997. Cholesky factorization of semidefinite Toeplitz matrices. Lin. Alg. Appl. 254, 497-525. ( 10.1016/S0024-3795(96)00517-4) [DOI] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Citations
- National Grid frequency trace. See 10.5061/dryad.z34tmpgb4. [DOI]
Supplementary Materials
Data Availability Statement
The frequency trace used in §5 was obtained from a publicly available site www.nationalgrid.com/Enhanced-Frequency-Response.aspx. That link appears to be no longer active so I have created a Dryad entry containing the excel file at doi:10.5061/dryad.z34tmpgb4 [22].

