Abstract
We develop a statistical framework for simulating natural hazard events that combines extreme value theory and geostatistics. Robust generalized additive model forms represent generalized Pareto marginal distribution parameters while a Student’s t-process captures spatial dependence and gives a continuous-space framework for natural hazard event simulations. Efficiency of the simulation method allows many years of data (typically over 10 000) to be obtained at relatively little computational cost. This makes the model viable for forming the hazard module of a catastrophe model. We illustrate the framework by simulating maximum wind gusts for European windstorms, which are found to have realistic marginal and spatial properties, and validate well against wind gust measurements.
Keywords: spatial statistics, statistics of extremes, Student’s -process, wind gust data, generalized Pareto distribution, quantile regression
1. Introduction
Natural hazard events can have devastating and widespread effects on society. In 2014, natural hazards are estimated to have caused USD 106 and 29 billion of economic and insured loss, respectively. Over the past few decades, European windstorms, for example, have been the second biggest cause of insured loss from natural hazards globally, after US hurricanes. Windstorms Christian, Xavier, Dirk and Tini, which struck over the 2013/2014 winter, caused insured losses totalling USD 3.3 billion. To ensure their solvency, insurance companies must have accurate understanding of their potential losses.
Events that cause significant loss are rare, resulting in a lack of vital data and other relevant knowledge. One method for overcoming data scarcity is to simulate events. This can help build a probabilistic view of loss, in addition to providing information on types and strengths of defences required to offer sufficient protection.
One strategy for producing probabilistic estimates of losses from natural hazard events is known as catastrophe modelling (see Grossi & Kunreuther [1] for an overview). Such models are usually formed by linking hazard, vulnerability, damage and loss modules. When combined these characterize the extent of the hazard event, the property susceptible to damage, the damage caused to the property, given the hazard, and the subsequent loss. Often catastrophe models are used to estimate loss distributions, of which extreme quantiles are typically important. (For example, the Solvency II directive1 is based on the 99.5% quantile of a company’s annual loss distribution.) Hazard modules can be used to produce arbitrarily many synthetic events from which losses can be calculated, to improve precision of extreme quantile estimates. Various contrasting approaches to the hazard module exist: translations, distortions or parsimonious parametrizations of historical events, which can fail to capture the full population variation in events; physically based simulation models similar to climate models, which can under-represent processes, such as wind gust speeds [2,3], and, in turn, underestimate losses; or multivariate statistical models that incorporate extreme value statistics, which have extended from bivariate dependencies [4,5] to max-stable processes [6].
The area affected by a natural hazard event can vary considerably in size: heatwaves might affect entire continents, whereas flooding events may only span a few metres. To capture events entirely, simulation domains may be very different in size. Local variation of natural hazards can also vary considerably: relative variations in temperature over a small domain are typically much less than those of rainfall. European windstorms, however, can affect large areas and have high local variability; adequate simulations of these must represent a large domain at high resolution. A robust simulation model for natural hazard events must allow various different combinations of domain size, resolution and variability amounts on different spatial scales.
To meet these simulation criteria, we propose a framework that couples extreme value and geostatistical methods. Works by Casson & Coles [7], Cooley et al. [8] and Sang & Gelfand [9] consider a similar coupling, although here we focus on a geostatistical model for residuals—an approach sometimes referred to as anamorphosis. Our approach allows excesses corresponding to exceedances of a high threshold to be simulated, and provides an alternative benchmark for catastrophe models for the following reasons: geostatistical models can give highly efficient simulations of high-resolution random fields, compared with fully multivariate models; various forms for dependence over space and time exist for geostatistical models, which have well represented various types of environmental phenomena (see Diggle & Ribeiro [10]); and statistically sound marginal estimates for extremes are used. Furthermore, we can quickly implement the method on a personal computer, thus requiring less computational time and resource than other types of hazard module, such as those built similarly to climate models.
The following section describes a spatial, extreme value framework for simulating natural hazard events. Section 3 presents a model for simulating extreme windstorm events for a large part of Europe together with some simulations from the model. Section 4 provides a summary of the framework presented and of its performance for simulating windstorm events.
2. Method
This section gives details of the framework proposed to give realistic simulations of natural hazard events. Throughout let Y (s,t) represent values of a natural hazard process at location s∈S and time t=1,…,T. The simulation model generates excesses above a high threshold, which are assumed to follow a generalized Pareto distribution (GPD). The threshold can either be chosen or estimated, but must be sufficiently high that the GPD assumption is valid. For loss estimation, it is useful if the threshold is below any value of the natural hazard process above which damage can occur; for example, around 25 m s−1 for wind gusts. The simulation model must also represent spatio-temporal dependence realistically and include an estimate of the threshold exceedance rate to allow simulations to represent specific time periods.
(a). Marginal threshold excess model
We assume that excesses of some spatially varying threshold u(s), for location s, follow the GPD, so that
| 2.1 |
where ψu(s)>0 and ξ(s) are scale and shape parameters, respectively, for {y : 1+ξ(s)y/ψu(s)>0}; the scale parameter may vary with u(s). We consider generalized additive model (GAM, [11,12]) forms for GPD parameters, summarized as
| 2.2 |
where lon(s) and lat(s) represent longitude and latitude, and zp(s), p=1,…,q, represent covariates, at location s. The log link in equation (2.2) ensures ψu(s)>0, for all s∈S; a transformation for ξ(s) may not be necessary. Parametric or non-parametric forms can be considered for f0( ), fp( ) and fpp′( , ). We focus on GAM forms and propose specific types that are suitable for simulating many different hazard types. These are found to be more flexible than parametric forms, such as those described in Coles [13], ch. 6, and simpler to fit than non-parametric forms, such as Casson & Coles [7] and Cooley et al. [8]. We represent f0( , ) as a thin plate regression spline [14], as longitude and latitude have the same units; for this reason, we distinguish longitude and latitude from other covariates. For other covariates, where interactions may occur and scales and units may differ, we propose additive and tensor-product forms [15].
The GPD models with parameters with spline forms can be fitted in various ways. Chavez-Demoulin & Davison [12] propose an approach based on maximum likelihood that combines Newton–Raphson and generalized ridge regression steps. This can be used to simultaneously estimate both regression and smoothing parameters. Their orthogonal parametrization guarantees convergence. For interpretability, we use spline forms for and ξ(s), which are not orthogonal. We estimate smoothing parameters using generalized (approximate) cross-validation [16], where at each iteration we find GPD parameters by numerically maximizing the penalized likelihood [17,18].
(b). Threshold estimation
Here we consider models for the spatially varying threshold, u(s). A constant threshold may not suit large domains, especially if marginal distributions exhibit large variation. We use quantile regression to estimate u(s), which is assumed to have GAM form. Put briefly, u(s) is chosen to minimize the so-called tilted loss function, subject to some additional roughness penalty terms, which depend on the spline choices (see Koenker [19], ch. 7 for complete details). This approach to modelling and estimating u(s), coupled with a GPD model for excesses of u(s), was proposed by Northrop & Jonathan [20]. We extend the additive spatial GAM forms for longitude and latitude used in Northrop & Jonathan [20] to thin plate regression splines forms. Link functions may also be considered to relate u(s) to GAM terms.
For simulations to represent given time periods, the rate of upcrossing of the threshold must be taken into account, which we denote ζ(s)=Pr(Y (s,t)>u(s)). If the quantile used in quantile regression is fixed at ζ, then ζ(s)=ζ for all s∈S, t=1,…,t. Alternatively, the threshold can be fixed and logistic regression used to estimate ζ(s), using a similar additive form to equation (2.2) (see Wood [21] for inference details in this case).
(c). Residual dependence model
By considering marginal and dependence models separately, the model can be related to copula-based approaches. Disadvantages to this separation, identified in Mikosch [22], are potential bias in stochastic dependence estimates, a lack of supporting statistical theory and potentially poor representation of multivariate extremes or, by virtue of being static, temporal dependence. Here the advantages, given in §1, of fast, high-resolution simulations, robust dependence forms and extreme-value margins are seen to outweigh these disadvantages. To efficiently achieve high-resolution simulations, we restrict attention to residual models based on spatially continuous stochastic processes, which are widely used in geostatistics.
Gaussian processes are commonly used in geostatistics and may be used for anamorphosis. They are robust and can give efficient simulations on high-resolution grids using circulant embedding methods [23], or for irregularly spread locations. The Gaussian process model imposes asymptotic independence between different locations s and s′ at time t: if y+ is the upper endpoint of the distribution of Y (s,t) and , then χ=0 [24]; when χ>0 asymptotic dependence occurs. This assumption may be inappropriate for some natural hazards (see Coles et al. [25] for environmental examples). The Student’s t-process can be seen as a generalization. It imposes asymptotic dependence when its degrees of freedom, ν, are finite, and asymptotic independence when infinite, as the latter can be parametrized as a Gaussian process. We treat ν as an unknown parameter.
We consider a Student’s t-process tail model: a Student’s t and continuous space extension of the multivariate Gaussian tail model of Bortot et al. [26]. The marginal model of §2a can be used to give uniformly distributed residuals on [1−ζ(s),1],
| 2.3 |
which may be transformed to the final residuals , where is the inverse Student’s t-distribution function with ν degrees of freedom. Residuals are modelled with a tail Student’s t-process, which is written as
| 2.4 |
where c((s,t),) is a spatio-temporal correlation function and ν is the degrees of freedom of the process. A suitable form for the correlation function will depend on the application, although inclusion of a nugget term is likely to be necessary to allow local measurement error. A form for c((s,t),) suitable for extreme windstorm events is presented in §3. Parameters of the tail Student’s t-process can be estimated by maximizing the censored log-likelihood of the corresponding finite-dimensional multivariate tail Student’s t-distribution.
It is natural to consider max-stable processes (R. L. Smith 1990, unpublished manuscript; [27]) as alternative residual models. These fit readily into the proposed framework by converting the uniform residuals of equation (2.3) to have unit Frechét distribution, i.e. . Options for max-stable models include Schlather, extremal t and Brown–Resnick processes [27–29]. We focus on anamorphosis-based models for their interpretability, relative to widely used geostatistical methods and simulation efficiency. Related findings are presented in §3b(iii), d(iii).
(d). Event simulation
Values of a natural hazard process are simulated for a fixed number of locations , which may be regularly or irregularly spread. Let be the time domain of the event. For example, if daily values are simulated for a 3-day event centred at time t*, . The following algorithm details simulation of a single event, and assumes that the threshold, marginal excess and residual dependence models have been fitted.
(i) Simulate e(s,t) from tail-tProcessν(0,c((s,t),)) for locations , times .
(ii) Obtain uniform residuals e*(s,t)=Gν(e(s,t)).
- (iii) Set
where ζ(s)=Pr(Y (s,t)>u(s)); recall §2b.
Repeat steps (i)–(iii) to give the required number of simulated events. Extensions for the case where simulated values of non-exceedances of the threshold u(s) are also required are proposed in §4.
3. Simulation of extreme windstorm events
This section illustrates the framework by simulating windstorm events for a large part of Europe.
(a). Data
The windstorm data are maximum daily surface wind gust speeds (in metres per second), measured by anemometer and extracted from the National Climatic Data Center global summary of the day database.2 These are supplemented with MIDAS land surface station observations [30], to improve coverage for Norway and the UK, and ECA&D data [31], to improve coverage for Spain. The period 1 January 1994 to 31 December 2014 is studied. Only winter (defined as December, January, February) storms are modelled. Locations of data stations used are shown on figure 1 together with a plot representing their elevations.
Figure 1.
(a) Station locations overlaid on a map of elevation. Seven sites used for initial data analysis (figure 2) are identified as A–G. Nine sites used for model checking (figure 6) are identified as A, C, E–K. (b) Histogram of station elevations.
Figure 2 shows plots of wind gust speeds for seven stations. These are chosen as they are the nearest two to London and Paris, and the nearest to Madrid, the Ruhr and Milan (which are the five most populated metropolitan areas within the domain studied, in descending order3). Whether the gusts speeds are asymptotically dependent or independent is assessed. These probabilities are estimated empirically and uncertainty quantified by bootstrap re-sampling the data. Two stations are chosen near to London and Paris to help assess asymptotic dependence for these proximate and highly populated station pairs (see also §3b(iii)). Between the London and Paris stations, dependence tends to be higher across the range of gust speeds, compared with between stations from different areas. However, the rarity of concomitant exceedances for the highest gust speeds, such as above 25 m s−1, makes it inconclusive from figure 2 whether the gust speeds are asymptotically dependent or independent. Further analysis is given in §3b(iii) by considering extremal index estimates.
Figure 2.
Plots of wind gust speeds exceeding 15 m s−1 for stations A–G. Diagonal plots show histograms of wind gust speeds. Lower off-diagonal plots show daily wind gust speeds at two stations. Upper off-diagonal plots show empirical estimates of Pr{Y (s,t)>y | Y (s′,t)>y} for station pairs; see §2c.
(b). Model specification and estimates
This section gives details of the model specifications used to estimate and then simulate extreme wind gust speeds.
(i). Threshold estimation
We use a spatially varying threshold. A constant threshold did not suit the large domain, primarily because it was not possible to find a single threshold for which the GPD assumption was valid for all stations.
The high proportion of missing data, which is typical of European wind gust measurements, presents a significant obstacle to estimating the threshold. Estimating the threshold at a pre-specified quantile using only available measurements leads to bias. This is because of a tendency towards the start of the study period for stations to only record high wind gust measurements; ignoring this, and assuming that measurements are missing at random, leads to severe overestimation of high quantiles. We overcome this with an infilling procedure based on a regression model: its response is Box–Cox transformed wind gust measurements using λ=0.2 [32]; its covariates are taken from temporally complete ERA-Interim reanalysis data [33], such as gust and wind speeds and mean sea-level pressure. A multivariate Gaussian model is then fitted to the model’s residuals. The infilled data are obtained by simulating residuals for the missing data, conditional on residuals derived from available data. Residuals and predictions from the ERA-Interim regression model are combined, and the infilled gust speeds obtained by inverting the Box–Cox transformation. These infilled values are only used to estimate the threshold, and not the excess model.
As described in §2b, and proposed by Northrop & Jonathan [20], the threshold is estimated by non-parametric quantile regression. The 98th percentile is estimated for both practical and theoretical reasons. Klawa & Ulbrich [34] found this to perform well for estimating financial loss from European windstorms over Germany, while it is sufficiently high that tests show the GPD assumption for threshold excesses to be valid, as supported by figures 6 and 7. We find estimation of both the threshold and excesses models unreliable at higher thresholds, due to too few data, and that GPD estimates incur systematic spatial bias for thresholds below the 95th percentile. Empirical estimates of the 98th percentile for each station show variation with elevation and mean winter wind speed.4 Consequently, the quantile regression model for u(s) takes the form
| 3.1 |
where f0( , ) is a thin plate regression spline, elev(s) and mws(s) are elevation and mean winter wind speed at location s, respectively, and f12( , ) is formed by a tensor-product of P-splines (which negates the need for fp( ) terms).
The estimate of the 98th percentile, shown in figure 3, ranges from approximately 20 m s−1, which includes densely populated cities, such as Paris and London, to over 35 m s−1 for the Alps and Pyrenees (although there are very few stations at such altitudes). Figure 3 shows that elevation influences the threshold estimate most; mean winter wind speed has a small, but much lesser, effect. The thin plate spline captures that windstorms tend to track north of the UK and Norway. Relative to the rest of the region studied, wind gust data for the Balkans are scarce; their increased estimate is accompanied by large uncertainty.
Figure 3.
Estimate of 98th percentile of the distribution of wind gust speeds (m s−1) from non-parametric quantile regression (a) and contributions to estimate of thin-plate regression spline (b, f0(lon(s),lat(s))) and tensor product spline of elevation and mean winter wind speed (c, f12(elev(s),mws(s)) are shown.
(ii). Excess model
We then model excesses of the estimated 98th percentile as GPD. The following GAM forms are used:
and
where f1( ) and f2( ) are P-splines. Estimates of and ξ(s) are shown in figure 4, with contributions from each of the GAM components shown in figure 5. Elevation appears to have the greatest effect. In general, increased elevation coincides with higher values of and lower values of ξ(s), although for this relationship is not entirely monotonic increasing. Increased mean winter wind speed also coincides with higher values, but to a much lesser extent than elevation. The thin plate spline contributions show that decreases smoothly from the northwest to the southeast of the region studied; a similar—but lesser—effect can be seen for ξ(s), although its increased values around the Balkan states are most apparent. Uncertainty in the ξ(s) estimate around the Balkan states is, however, again large, due to few data.
Figure 4.
GPD model parameter estimates. (a) GPD scale parameter estimate (log scale, ). (b) GPD shape parameter.
Figure 5.
(a) Additive contributions of f0(lon(s),lat(s)) (i), f1(elev(s)) (ii) and f2(mws(s)) (iii) to (log) GPD scale parameter estimate. (b) Contributions of f0(lon(s),lat(s)) (i) and f1(elev(s)) (ii) to GPD shape parameter estimate.
Quantile plots for assessing the GPD fits are shown in figure 6. These are achieved by omitting the data for the nine validation sites (A, C, E–K) to give estimates of a reduced model, which is used to predict sites’ GPD parameters. Data for these sites are included when estimating the final model. For the nine sites considered, which coincide with the nine most populated metropolitan areas, and are highlighted in figure 1, adequate agreement between model-based and empirical estimates of the distribution of excesses of u(s) can be seen. Slight signs of higher model-based estimates than observed gust speeds are apparent. We suspect that careful quality control on measurements could offer slight improvement, as, due to a scarcity of data, we have taken a conservative approach to quality control. We have also only considered fairly simple covariate choices. Other similarly simple covariates, such as mean winter temperature, were also considered. Of those considered, we have chosen the best fitting model. However, allowing less simple covariates may improve the model’s fit, in particular covariates derived from reanalysis data.
Figure 6.
GPD quantile-quantile plots for nine stations identified as A, C, E–K in figure 1.
We also collectively assess the fits for the stations using the Kolmogorov–Smirnov (K–S) test. Ties in the data make the K–S test in its standard form unreliable. Instead, we use a Monte Carlo technique to obtain p-values, which involves simulating sets of excesses from estimated GPDs, rounding the simulated excesses, and then computing the K–S test statistic for each set of simulated data. The p-values, shown for each station in figure 7, result from comparing the K–S test statistic for the gust speed data with the test statistics from the simulated data. For 40 of 789 stations (≃5.1%), the p-value is below 5%, which indicates adequate fit of the GPD across all stations. Importantly, there are no signs of systematic spatial deviations, such as large parts of the study region where the GPD fit is inadequate.
Figure 7.
(a) Stations with Kolmogorov–Smirnov test p-value greater than 5% (black) and less than 5% (red). (b) Pairwise, censored extremal index estimates for nine stations of figure 6 based on wind gust speed data (black) and simulations from Gaussian process model (grey). Estimates (black circle, grey circle) are shown alongside 95% uncertainty estimates based on profile likelihood (black line) and variability from multiple simulations (grey line).
(iii). Residual model
We build a residual model based on an anisotropic correlation function. This is formed by considering a transformed space with coordinates , where
and s=(s1,s2)T. This corresponds to scaling by 1/ϕ1 and 1/ϕ2 in longitude and latitude directions, and a counter-clockwise rotation through angle θ. We allow anisotropy as windstorms follow tracks, which typically causes dependence between gust speeds to be greatest along the track. Angle θ allows the prevailing direction of the tracks to deviate from perfectly following the longitude or latitude axes. The correlation function includes a nugget, 1−τ, to allow measurement error, and is written
| 3.2 |
where δ( ) is the Kronecker delta function and for locations and in transformed space. We choose the Whittle form for ρ( ), so that
| 3.3 |
for κ>0 and Kκ the modified Bessel function of the second kind. The Whittle form is chosen for compatibility with the storm model of Cox & Isham [36], which follows in §3c(i); otherwise Matérn or powered exponential forms might be considered. We are required to estimate τ, ϕ1, ϕ2, θ and κ.
Figure 7 shows pairwise estimates of the extremal index for the nine validation sites of figure 6 based on the censored method of Schlather & Tawn [37], with 95% uncertainty bounds estimated by profiling the likelihood. The maximum-likelihood estimates generally equal 2 (and uncertainty bounds all include 2), which indicates asymptotic independence. This is confirmed when we fit the tail Student’s t-process, under the assumption of no temporal dependence, and find that its likelihood is maximized for large ν (i.e. ν>10 000); at this value the difference between Student’s t- and Gaussian processes is negligible. Consequently, we focus on Gaussian processes. The estimated correlation structure of the Gaussian process is represented in figure 8 by a semi-variogram, which compares empirical with model-based estimates of the dependence structure against distance defined on transformed space, ; the structure itself is also shown with distance defined on the original scale. Figure 8 shows a decay in dependence. The practical range, where correlation falls below 0.05, is approximately 13°. The mean estimates approximately show that, once rotated counter-clockwise by 7 radians, dependence extends 2.8 times further along the longitude than the latitude axis. Figure 7 shows extremal index estimates for the nine validation sites based on simulations from the Gaussian process model. These mimic the availability of the wind gust data so that missing values occur simultaneously. Once uncertainty is taken into account, we see that simulations from the model are consistent with the extremal index estimates from the wind gust data, which supports the adequacy of the estimated dependence structure.
Figure 8.
(a) Semi-variogram against distance, defined in degrees for transformed coordinates, with Whittle model superimposed. (b) Estimated anisotropic correlation function, i.e. ρ( ) from equation (3.3).
(c). Simulated windstorm events
(i). Simulation specification
We base windstorm simulations on a spatio-temporal extension of the Whittle correlation function, developed as a storm model in Cox & Isham [36] and later presented for Gaussian processes in Gneiting et al. [38] and Schlather [39]. Each event, j=1,…,J, is assumed to move at a random bivariate normally distributed velocity Vj∼BVN(μj,M/2) and has corresponding correlation function
| 3.4 |
for s≠s′, t≠t′, where cj((s,t),(s′,t′))=1 otherwise, dt=t−t′, and . The large number of windstorms, J, results in too many μjs to reliably obtain maximum-likelihood estimates of M and the μjs alongside τ, ϕ1, ϕ2, θ and κ. A simpler model with μj=μ for all j fails to capture the variation that is empirically evident. Instead, the μjs and M are empirically estimated from storm tracks extracted from the ERA-Interim reanalysis [33] using the tracking algorithm of Hodges [40,41]. Comparison against the windstorm track observations of the IBTrACS database ([42]; not shown) shows that, although the intensities of windstorms may tend to be underestimated, velocities are represented sufficiently well for the purposes of this simulation.
(ii). Results on simulated events
Events are simulated for locations sk on a 300×300 grid , where nS=90 000. We simulate events equivalent to 10 000 December–February winters and analyse 3-day events because a 72-h period is typically used in the insurance industry to define individual events. Only excesses corresponding to exceedances of u(s) are simulated. Under this specification, each simulated winter takes approximately 20 s, but simulations are entirely paralellizable. Figure 9 shows two randomly chosen 3-day event simulations. For both events, exceedances of the estimated 98th percentile can be seen on each of the three simulated days. Very few damage-causing gust speeds exceeding 25 m s−1 occur. Therefore, neither event is likely to be classed as extreme in terms of the financial loss that it might have caused.
Figure 9.
Plots of two randomly chosen 3-day windstorm event simulations (a,b) and extreme events, based on largest SSI (c) and exceedance area (d) over 3 days.
To highlight potentially catastrophic events, two measures are used to quantify whether a windstorm event is extreme [2]. Each event, j=1,…,J, is defined for locations s∈S and a time domain , where tj gives the peak time of event j. Plots of the events with the highest values of each of these measures are shown in figure 9.
Klawa & Ulbrich [34] propose a storm severity index (SSI) given by
which is designed to quantify an event’s kinetic energy, which relates closely to financial loss, where SSSI⊂S is a region for which the SSI will be calculated. Compared with S, SSSI excludes the Balkan states, where shape parameter estimates are relatively imprecise; corresponding simulated gusts would otherwise dominate SSI values. Windstorms with large financial losses seldom occur in the excluded regions. Figure 9 shows the 3-day event with highest SSI. We see that many threshold exceedances occur on day 3 in various countries, in particular, France, Switzerland, Germany, Italy, Belgium and Austria, and also, to a lesser extent, the Netherlands, UK, Denmark and Slovenia. The renowned windstorms Martin, Lothar and Kyrill affected a similar set of countries.
We define the exceedance area of a storm as
This measure captures an event’s spatial extent over 3 days; that is, the proportion of Europe affected by high wind gust speeds, relative to the local 98th percentile. Figure 9 also shows the 3-day event with the largest exceedance area. The event with largest exceedance area is very similar to that with largest SSI, as it has greatest effect on day 3 and affects a similar set of countries.
(d). Alternative residual models
One of the benefits of the separation of the simulation model into marginal and residual models is the ability to combine alternative models for each.
(i). The effect of measurement error
The simulations in figure 9 could prove awkward for loss estimation if the signal-to-noise ratio (the ratio of τ to 1−τ in equation (3.2)) is seen to be small, i.e. if the gust speeds have relatively large measurement error. Figure 10 shows single-day windstorm events simulated with different signal-to-noise ratios, based on the GPD estimates described in §3b(ii). Part of the motivation for this analysis is that increasing the signal-to-noise ratio might lead to simulated events representative of unobserved actual windstorm events, as opposed to representing measurements. The effect of increasing the signal-to-noise ratio is clear, with the spatial structure of events becoming smoother, eventually forming larger, more distinct patches affecting fewer parts of Europe.
Figure 10.
(a–f) Simulated wind gusts for single days based on differing measurement error amounts, i.e. 1−τ in equation (3.2). Simulations use the same seed to aid comparison between different signal-to-noise ratios.
(ii). A reanalysis-based residual model
As a further sensitivity analysis, we swap the measurement-based residual model for one based on ERA-Interim reanalysis wind gust output, which are exempt from measurement error. We estimate the ERA-Interim 3-day dependence structure using an empirical probability transformation to convert margins to Gaussian, interpolate the resulting data onto the simulation grid S and then calculate the empirical correlation matrix. We then use this to simulate Gaussian residuals, which are then converted to the original wind gust scale using the GPD estimates of §3b(ii). Figure 11 shows simulated events chosen according to the same criteria as figure 9. Increased smoothness of the events is clear to see. We also note that the events with largest SSI and exceedance area coincide. As the ERA-Interim data are aggregated onto a 0.75° grid, their residual structure will inevitably be smoother than in reality, which will lead to larger-than-realistic loss estimates; residuals based on the measurements will suffer the opposite. These alternative residual models could therefore be used to place upper and lower bounds on loss estimates. Based on the preceding two sensitivity analyses for the residual model, we should hope to reduce the uncertainty in loss estimates with higher-resolution wind gust output instead of ERA-Interim and/or more precise measurements.
Figure 11.
Events simulated using ERA-Interim residual model. Two randomly chosen 3-day windstorm event simulations (rows 1 and 2) and an extreme event (row 3) based on largest SSI or exceedance area (both happen to coincide) over 3 days.
(iii). A max-stable process residual model
To fit a max-stable process residual model, we obtain unit Frechét residuals as described in §2c, i.e. with e*(s,t) defined in equation (2.3). We fit the max-stable models by maximizing pairwise censored likelihoods [43–46] and hence present the two-dimensional case. For two locations and , defined on the transformed space of §3b(iii), we assume that
where V ( , ) is the exponent measure function. As asymptotic independence is indicated by figure 7, we consider the extension of the Schlather model [27] proposed in Davison & Gholamrezaee [6], where
with c( ) as in equation (3.2), where . We take α(d)={1−|d|/(2r)}+, which gives asymptotic independence when |d|>2r and find that . The resulting estimate of c( ) has a relatively large nugget estimate (τ≃0.63); its ρ( ) component is shown on the original coordinate scale in figure 12, which shows a very short range of dependence. Extremal index estimates for the max-stable model with distance defined on the transformed space, which are shown in figure 12, highlight a lack of agreement between the max-stable model and the wind gust data. Simulations from the model (not shown), which take on average 90 times longer than those for Gaussian process model, have qualitatively similar characteristics to the τ=1 case presented in figure 10. We have, however, only considered a fairly simple form for α( ). More flexible forms are presented in Davison & Gholamrezaee [6] and Huser & Davison [44] use a further extension when modelling extreme hourly rainfall.
Figure 12.
(a) Plot of correlation function estimate for asymptotically independent parametrization of Schlather model. (b) Extremal index estimates from model (red) against empirical estimates (black).
4. Summary
We have presented a statistical framework for simulating natural hazard events that combines extreme value theory, to accurately capture event magnitude, and geostatistics, to robustly incorporate spatial dependence. The framework can be used to quickly give high-resolution simulations that can be formally validated. The framework has been used to produce realistic simulations of European windstorm events. By virtue of its generality, speed and that it can be implemented using a standard computer, the framework could be readily used as a hazard module in a catastrophe model for other types of hazard, or as a tool for assessing other modules. Various aspects of the proposed framework that could be changed or extended to bring potential improvements are discussed in the remainder of this section.
The Student’s t-process residual model includes the Gaussian process as a special case. It can be fitted quickly using maximum likelihood, based on finite-dimensional counterparts of the processes, and subsequently gives fast simulations. The finite-dimensional representation—as opposed to continuous-space—could better suit certain types of hazard phenomena. One example, as in studied Keef et al. [47], is river flows, where complex dependencies between flow gauges may benefit from a finite specification. Its finite-dimensional representation loses certain computational benefits, such as those gained by circulant embedding, but remains relatively efficient. Student’s t- and Gaussian processes lack max-stability. Although in §3d(iii) we have looked into using max-stable processes to simulate extreme windstorm events, we note that they may better represent other types of natural hazard, or other temporal resolutions. For example, max-stable processes have been used in Huser & Davison [44] to model hourly rainfall, and for modelling annual maximum snow depths [48] and temperatures [6]. Estimates of the extremal index, as in figure 7, may help reveal which, if any, residual model is most suitable. If the model is used as a hazard module in a catastrophe model, so that loss estimates are produced, it is important that, in addition to formal statistical validation of the residual model, validation against an appropriate loss function for the application is also performed. The anamorphosis and max-stable models can also be extended to capture dependencies between multiple processes; although a trade-off between resolution and increased dimensionality may be needed to maintain computational feasibility.
We have used thin plate spline forms to capture spatial smoothness in marginal distribution parameters, which appear to capture the expected parameter variation reliably, upon appropriate choice of basis dimension and cross-validation of smoothing parameters. We see spline forms as a compromise between parametric forms, which can be inflexible, and stochastic forms, such as Gaussian processes, for which inference might require computer-intensive Markov chain Monte Carlo (MCMC) techniques. We have seen little to justify this extra computational demand and also think it could hinder use of the model by end-users. The GAM forms can be extended to incorporate temporal variability, such as a model for the entire year by modelling each season separately or by allowing temporal variation in parameter estimates. For extreme gust speeds it is not valid to assume a constant distribution across the year, which is likely to be true for most hazard phenomena. Unless a fairly homogeneous period can be identified when events tend to occur, such as heatwaves in summer, a temporal model should be used. Similarly, temporal variability may benefit corresponding residual models.
For some hazard types, such as modelling flooding due to extreme rainfall, simulated values of non-exceedances of the threshold u(s) may be required. To model marginal distributions in this case, a GPD tail model can be coupled with a model for non-exceedances, which can be empirical [47], non-parametric [49] or parametric [50]. To give spatial simulations, empirical estimates of spatial dependence can be used together with a suitable interpolation scheme. A geostatistical model can be used; although a different spatial dependence structure between exceedances and non-exceedances of u(s) should be considered for sufficient flexibility. Alternatively, flooding can be studied by modelling extreme river flows [47], for which the proposed framework is better suited.
We have allowed for measurement error by including a nugget term in correlation functions. This is computationally convenient, but may lack interpretability when compared with the model Y (s,t)∼N(W(s,t),ω2), where W(s,t) is an actual—but unobservable—value of the hazard process at location s and time t, as it may be possible to specify ω2 based on knowledge of the measurement process. The W(s,t)s can then be modelled as the Y (s,t)s have been previously. As it is not possible to integrate out the W(s,t)s analytically within the present framework, numerical procedures, such as MCMC, will be required for inference. Integration would also need to carry over the threshold excess and residual models, negating much of the simplicity offered by GAM forms; this might make stochastic processes preferable for capturing spatial variation in parameters. This model can also be formulated as a hierarchical model and, if informative prior knowledge is available, implemented from a Bayesian perspective.
One aspect of the model where scope exists to add generality is the correlation function, which we have assumed is the same across the study region. This assumption can be relaxed; for example, by allowing parameters of the correlation function to vary in space. We considered replacing ϕ1 with ϕ1(s) and using simple forms such as . Although such model extensions could not be reliably estimated here, they may benefit other applications. Correlation can also be defined on an alternative (potentially higher-dimensional) space, such as climate space used in Cooley et al. [8]. When modelling gust speeds over larger domains, it should be noted that storms have highs and lows, which suggests an oscillating correlation function might be more suitable, as illustrated in Lindgren et al. [51]. Windstorms are also known to behave differently over land from sea; therefore partitioning distance into over-land and over-sea components might better represent dependencies. For hazard types for which events tend to affect smaller areas, the proposed model may be built based on a smaller region, which might improve the validity of using spatially constant correlation functions.
Supplementary Material
Acknowledgements
We thank two reviewers and an editorial board member for their useful comments, which have led to many significant improvements to this paper.
Footnotes
EU directive 2009/138/EC, available from http://ec.europa.eu/finance/insurance/solvency/solvency2/index_en.htm.
Source https://en.wikipedia.org/wiki/List_of_metropolitan_areas_in_Europe (accessed 8 April 2016).
Mean wind speed data obtained from https://crudata.uea.ac.uk/cru/data/hrg/ (see New et al.[35]).
Data accessibility
The NCDC GSOD European wind gust measurements, which are provided as the electronic supplementary material, cannot be redistributed for commercial purposes.
Authors' contributions
B.D.Y. developed and implemented the statistical methodology. D.B.S. provided critical advice on the methodology, weather-related issues and results.
Competing interests
We have no competing interests.
Funding
This work has been kindly funded by the Willis Research Network.
References
- 1.Grossi P, Kunreuther H. 2006. Catastrophe modeling: a new approach to managing risk. Huebner International series on risk, insurance and economic security Berlin, Germany: Springer Science+Business Media. [Google Scholar]
- 2.Roberts JF, Champion AJ, Dawkins LC, Hodges KI, Shaffrey LC, Stephenson DB, Stringer MA, Thornton HE, Youngman BD. 2014. The XWS open access catalogue of extreme European windstorms from 1979 to 2012. Nat. Hazards Earth Syst. Sci. 14, 2487–2501. (doi:10.5194/nhess-14-2487-2014) [Google Scholar]
- 3.Haas R, Pinto JG, Born K. 2014. Can dynamically downscaled windstorm footprints be improved by observations through a probabilistic approach? J. Geophys. Res. Atmos. 119, 713–725. (doi:10.1002/2013JD020882) [Google Scholar]
- 4.Brodin E, Rootzén H. 2009. Univariate and bivariate GPD methods for predicting extreme wind storm losses. Insur. Math. Econ. 44, 345–356. (doi:10.1016/j.insmatheco.2008.11.002) [Google Scholar]
- 5.Bonazzi A, Cusack S, Mitas C, Jewson S. 2012. The spatial structure of European wind storms as characterized by bivariate extreme-value copulas. Nat. Hazards Earth Syst. Sci. 12, 1769–1782. (doi:10.5194/nhess-12-1769-2012) [Google Scholar]
- 6.Davison AC, Gholamrezaee MM. 2012. Geostatistics of extremes. Proc. R. Soc. A 468, 581–608. (doi:10.1098/rspa.2011.0412) [Google Scholar]
- 7.Casson E, Coles S. 1999. Spatial regression models for extremes. Extremes 1, 449–468. (doi:10.1023/A:1009931222386) [Google Scholar]
- 8.Cooley D, Nychka D, Naveau P. 2007. Bayesian spatial modeling of extreme precipitation return levels. J. Am. Stat. Assoc. 102, 824–840. (doi:10.1198/016214506000000780) [Google Scholar]
- 9.Sang H, Gelfand A. 2009. Hierarchical modeling for extreme values observed over space and time. Environ. Ecol. Stat. 16, 407–426. (doi:10.1007/s10651-007-0078-0) [Google Scholar]
- 10.Diggle P, Ribeiro P. 2007. Model-based geostatistics. Springer series in statistics Berlin, Germany: Springer. [Google Scholar]
- 11.Hastie TJ, Tibshirani RJ. 1990. Generalized additive models, vol. 43 Boca Raton, FL: CRC Press. [Google Scholar]
- 12.Chavez-Demoulin V, Davison AC. 2005. Generalized additive modelling of sample extremes. J. R. Stat. Soc. C 54, 207–222. (doi:10.1111/j.1467-9876.2005.00479.x) [Google Scholar]
- 13.Coles S. 2001. An introduction to statistical modeling of extreme values. Springer series in statistics London, UK: Springer. [Google Scholar]
- 14.Wood SN. 2003. Thin-plate regression splines. J. R. Stat. Soc. B 65 95–114. (doi:10.1111/1467-9868.00374) [Google Scholar]
- 15.De Boor C. 1978. A practical guide to splines. Applied mathematical sciences, number v. 27 Berlin, Germany: Springer. [Google Scholar]
- 16.Wahba G. 1990. Spline models for observational data. CBMS-NSF regional conference series in applied mathematics Philadelphia, PA: SIAM. [Google Scholar]
- 17.Green PJ. 1987. Penalized likelihood for general semi-parametric regression models. Int. Stat. Rev. 55, 245–259. (doi:10.2307/1403404) [Google Scholar]
- 18.Pauli F, Coles S. 2001. Penalized likelihood inference in extreme value analyses. J. Appl. Stat. 28, 547–560. (doi:10.1080/02664760120047889) [Google Scholar]
- 19.Koenker R. 2005. Quantile regression. Econometric Society monographs Cambridge, UK: Cambridge University Press. [Google Scholar]
- 20.Northrop PJ, Jonathan P. 2011. Threshold modelling of spatially dependent non-stationary extremes with application to hurricane-induced wave heights. Environmetrics 22, 799–809. (doi:10.1002/env.1106) [Google Scholar]
- 21.Wood SN. 2006. Generalized additive models: an introduction with R. London, UK: Chapman & Hall. [Google Scholar]
- 22.Mikosch T. 2006. Copulas: tales and facts. Extremes 9, 3–20. (doi:10.1007/s10687-006-0015-x) [Google Scholar]
- 23.Wood AT, Chan G. 1994. Simulation of stationary Gaussian processes in [0,1]d. J. Comput. Graph. Stat. 3, 409–432. (doi:10.2307/1390903) [Google Scholar]
- 24.Sibuya M. 1959. Bivariate extreme statistics, I. Ann. Inst. Stat. Math. 11, 195–210. (doi:10.1007/BF01682329) [Google Scholar]
- 25.Coles S, Heffernan J, Tawn J. 1999. Dependence measures for extreme value analyses. Extremes 2, 339–365. (doi:10.1023/A:1009963131610) [Google Scholar]
- 26.Bortot P, Coles S, Tawn J. 2000. The multivariate Gaussian tail model: an application to oceanographic data. J. R. Stat. Soc. C 49, 31–049. (doi:10.1111/1467-9876.00177) [Google Scholar]
- 27.Schlather M. 2002. Models for stationary max-stable random fields. Extremes 5, 33–44. (doi:10.1023/A:1020977924878) [Google Scholar]
- 28.Opitz T. 2013. Extremal processes: elliptical domain of attraction and a spectral representation. J. Multivariate Anal. 122, 409–413. (doi:10.1016/j.jmva.2013.08.008) [Google Scholar]
- 29.Davis R, Resnick S. 1984. Tail estimates motivated by extreme value theory. Ann. Stat. 12, 1467–1487. (doi:10.1214/aos/1176346804) [Google Scholar]
- 30.Met Office. 2012. Met Office integrated data archive system (MIDAS) land and marine surface stations data (1853–current). NCAS British Atmospheric Data Centre, 4 February 2015. See http://catalogue.ceda.ac.uk/uuid/220a65615218d5c9cc9e4785a3234bd0.
- 31.Klein Tank A. et al. 2002. Daily dataset of 20th-century surface air temperature and precipitation series for the European climate assessment. Int. J. Climatol. 22, 1441–1453. (doi:10.1002/joc.773) [Google Scholar]
- 32.Box GEP, Cox DR. 1964. An analysis of transformations. J. R. Stat. Soc. B 26, 211–252. [Google Scholar]
- 33.Dee DP. et al. 2011. The ERA-Interim reanalysis: configuration and performance of the data assimilation system. Q. J. R. Meteorol. Soc. 137, 553–597. (doi:10.1002/qj.828) [Google Scholar]
- 34.Klawa M, Ulbrich U. 2003. A model for the estimation of storm losses and the identification of severe winter storms in Germany. Nat. Hazards Earth Syst. Sci. 3, 725–732. (doi:10.5194/nhess-3-725-2003) [Google Scholar]
- 35.New M, Lister D, Hulme M, Makin I. 2002. A high-resolution data set of surface climate over global land areas. Clim. Res. 21, 1–25. (doi:10.3354/cr021001) [Google Scholar]
- 36.Cox DR, Isham V. 1988. A simple spatial-temporal model of rainfall. Proc. R. Soc. Lond. A 415, 317–328. (doi:10.1098/rspa.1988.0016) [Google Scholar]
- 37.Schlather M, Tawn JA. 2003. A dependence measure for multivariate and spatial extreme values: properties and inference. Biometrika 90, 139–156. (doi:10.1093/biomet/90.1.139) [Google Scholar]
- 38.Gneiting T, Genton M, Guttorp P. 2006. Geostatistical space-time models, stationarity, separability and full symmetry. In Statistical methods for spatio-temporal systems, pp. 151–175. Boca Raton, FL: CRC Press.
- 39.Schlather M. 2010. Some covariance models based on normal scale mixtures. Bernoulli 16, 780–797. (doi:10.3150/09-BEJ226) [Google Scholar]
- 40.Hodges KI. 1995. Feature tracking on the unit-sphere. Mon. Weather Rev. 123, 3458–3465. (doi:10.1175/1520-0493(1995)123<3458:FTOTUS>2.0.CO;2) [Google Scholar]
- 41.Hodges KI. 1999. Adaptive constraints for feature tracking. Mon. Weather Rev. 127, 1362–1373. (doi:10.1175/1520-0493(1999)127<1362:ACFFT>2.0.CO;2) [Google Scholar]
- 42.Knapp KR, Kruk MC, Levinson DH, Diamond HJ, Neumann CJ. 2010. The international best track archive for climate stewardship (IBTrACS) unifying tropical cyclone data. Bull. Am. Meteorol. Soc. 91, 363–376. (doi:10.1175/2009BAMS2755.1) [Google Scholar]
- 43.Ledford AW, Tawn JA. 1996. Statistics for near independence in multivariate extreme values. Biometrika 83, 169–187. (doi:10.1093/biomet/83.1.169) [Google Scholar]
- 44.Huser R, Davison AC. 2014. Space-time modelling of extreme events. J. R. Stat. Soc. B 76, 439–461. (doi:10.1111/rssb.12035) [Google Scholar]
- 45.Lindsay B. 1988. Composite likelihood methods. In Statistical inference from stochastic processes (Ithaca, NY, 1987). Contemp. Math., pp. 221–239. Providence, RI: Amer. Math. Soc.
- 46.Padoan SA, Ribatet M, Sisson SA. 2010. Likelihood-based inference for max-stable processes. J. Am. Stat. Assoc. 105, 263–277. (doi:10.1198/jasa.2009.tm08577) [Google Scholar]
- 47.Keef C, Tawn J, Svensson C. 2009. Spatial risk assessment for extreme river flows. J. R. Stat. Soc. C 58, 601–618. (doi:10.1111/j.1467-9876.2009.00672.x) [Google Scholar]
- 48.Blanchet J, Davison AC. 2011. Spatial modeling of extreme snow depth. Ann. Appl. Stat. 5, 1699–1725. (doi:10.1214/11-AOAS464) [Google Scholar]
- 49.Carreau J, Naveau P, Sauquet E. 2009. A statistical rainfall-runoff mixture model with heavy-tailed components. Water Resour. Res. 45, W10437 (doi:10.1029/2009WR007880) [Google Scholar]
- 50.Frigessi A, Haug O, Rue H. 2002. A dynamic mixture model for unsupervised tail estimation without threshold selection. Extremes 5, 219–235. (doi:10.1023/A:1024072610684) [Google Scholar]
- 51.Lindgren F, Rue H, Lindström J. 2011. An explicit link between Gaussian fields and Gaussian Markov random fields: the stochastic partial differential equation approach. J. R. Stat. Soc. B 73, 423–498. (doi:10.1111/j.1467-9868.2011.00777.x) [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The NCDC GSOD European wind gust measurements, which are provided as the electronic supplementary material, cannot be redistributed for commercial purposes.












