Skip to main content
NIST Author Manuscripts logoLink to NIST Author Manuscripts
. Author manuscript; available in PMC: 2021 Feb 26.
Published in final edited form as: Eng Struct. 2017;3:10.1061/AJRUA6.0000933. doi: 10.1061/AJRUA6.0000933

Estimating peaks of stationary random processes: a peaks-over-threshold approach

Dat Duthinh 1, Adam L Pintar 2, Emil Simiu 3
PMCID: PMC7909582  NIHMSID: NIHMS1658792  PMID: 33642655

Abstract

Estimating properties of the distribution of the peak of a stationary process from a single finite realization is a problem that arises in a variety of science and engineering applications. Further, it is often the case that the realization is of length T while the distribution of the peak is sought for a different length of time, T1 > T. Current methods for estimating peaks of time series have drawbacks that motivated the development of a new procedure, based on the peaks-over-threshold method, an advantage of which is that it often results in an increased size of the relevant extreme value data set compared with epochal procedures. For further comparison, the translation approach depends upon the estimate of the marginal distribution of a non-Gaussian time series, which is typically difficult to perform reliably. The epochal procedure for estimating peaks combined with Best Linear Unbiased Estimates (BLUE) of the Gumbel parameters was found to depend, in some cases very significantly, upon the number of partitions being used.

The proposed procedure is based on a Poisson process model for the thresholded pressure coefficient y, with threshold u. The estimated peak depends upon the choice of the threshold. However, unlike for the choice of the number n of partitions for the epochal procedure, a criterion is available that allows the analyst to make an optimal choice (according to a chosen metric) of the threshold value.

Two versions of the proposed new procedure have been developed. One version, denoted by FpotMax, includes estimation of a tail length parameter with a similar interpretation to the generalized extreme value distribution tail length parameter. The second version, denoted by GpotMax, assumes that the tail length parameter vanishes, resulting in a tail of the Gumbel distribution type.

Keywords: autocorrelation, bootstrap, decluster, Gumbel distribution, independent peaks, Monte-Carlo simulation, peaks-over-threshold, Poisson process, stationary time series, wind pressure

INTRODUCTION

Estimating properties of the distribution of the peak of a stationary process from a single finite realization is a problem that arises in a variety of science and engineering applications. Further, it is often the case that the realization is of length T while the distribution of the peak is sought for a different length of time, T1 > T. An example of interest in structural engineering, which is the focus of this paper, is the time history of pressure coefficients measured on a model in an aerodynamics laboratory.

For the particular case in which the marginal distribution of the process is Gaussian, closed-form expressions for the mean and standard deviation of the distribution of the peak are available (Rice, 1954, Davenport, 1964). If the distribution is not Gaussian, a nonlinear mapping procedure, sometimes referred to in the literature as “translation,” has been developed by Sadek and Simiu (2002), by which those statistics can be obtained. The translation method depends heavily on the engineer’s ability to choose an appropriate marginal probability distribution. In practice, because of the difficulty of this task, the performance of the translation method can be unsatisfactory.

Eaton and Mayne (1975) proposed a simple procedure in which the time history of the pressure coefficients of length T is divided into n equal segments. A sample is created consisting of the peaks of each of those segments, and a Gumbel cumulative distribution function (CDF) is fitted to that sample. To obtain the largest peak for a time history of length T1 = rT/n (rn), that CDF is raised to the r-th power, resulting in a second Gumbel distribution describing the peak of a series of length T1. This procedure is commonly implemented using the BLUE (Best Linear Unbiased Estimate, Lieblein, 1974) for the parameters of the Gumbel distribution. This method is henceforth referred to as the epochal method. One drawback is that the result depends significantly on n (see Section before Conclusions) with no criterion for an optimal choice of n being available in the literature.

The purposes of this paper are (1) to describe a novel procedure for estimating the distribution of the peak based on peaks-over-threshold models for extreme values; and (2) to assess its performance through comparison with results obtained from observed data. A software implementation that leverages the R [R Core Team (2015)] environment for statistical computing and graphics is available at https://github.com/usnistgov/potMax. Instructions for installation and use may be found there as well.

PEAK ESTIMATION BY A PEAKS-OVER-THRESHOLD (POT) MODEL

Peaks-over-Threshold (POT) models are applied to observations, y, that exceed a specified threshold, u, of the time series being considered. We choose a POT model as opposed to the epochal method using the Gumbel or generalized extreme value distribution for two reasons. First, POT models generally allow one to use more observations from the original time series to infer the values of the model parameters, potentially leading to less uncertainty.

Second, and perhaps more importantly, the threshold selection approach of Pintar et al. (2015) may be applied. Recall that the approach of Eaton and Mayne (1975) depends upon the chosen number of segments, n. While the results of the POT approach depend upon the choice of threshold, a procedure for objectively selecting an appropriate threshold using the data themselves eliminates the dependence of the final result on an arbitrary choice. The threshold selection procedure is summarized below and described fully in Pintar et al. (2015).

The complete procedure is described and illustrated in what follows with reference to the data set jp1 in the UWO-NIST Aerodynamic Database for Rigid Buildings (NIST 2004, Ho et al. 2005, http://fris2.nist.gov/winddata/), Building 7, in open terrain. The building was modeled at a scale of 1:100, tested for wind directions ranging from 0° to 90° and 270° to 360° every 5°, and data were collected for 100 s at 500 Hz. Due to building symmetry, for terrain with the same exposure in all directions, only half of all possible wind directions need to be investigated. In Figure 1, 0° and 90° are in the +y and the +x directions, respectively. Normalized pressure coefficients (Fig. 2) from tap 1715, in the middle of the roof edge were investigated for wind direction 270°. The procedure is described by the following steps:

Fig. 1.

Fig. 1.

Building 7, data set jp1 (NIST–UWO database). Dimensions in feet (1 ft = 0.3048 m); roof slope 1:12.

Fig. 2.

Fig. 2.

Pressure coefficients from tap 1715 for wind in –x direction

1. Reverse the signs of the time series, if necessary.

The procedure is developed for positive peaks. The peaks of interest in Fig. 2 being negative, the signs of this time series were reversed. If analysts are interested in both positive and negative peaks, they should apply the method twice, first with the original signs, and second with reversed signs.

2. Choose a model.

The peaks-over-threshold models leveraged in this work are two-dimensional Poisson processes defined by the intensity function λ in Equation 1 or 2. The appropriateness of such models for crossings of a high threshold was first discussed by Pickands (1971).

λ(t,y)=1σ[1+k(yμ)σ]+11/k (1)
λ(t,y)=1σexp{(yμ)σ} (2)

General Poisson processes are described in many texts, for example Chapter 4 of Resnick (1992). Briefly, for the two- dimensional case, if D is some two-dimensional region, the number of observations found in D follows a Poisson distribution with mean equal to the volume trapped by the intensity function over D. Further, if D1 and D2 are disjoint regions, the number of observations falling in each is independent. Note that the left side of Equations 1 and 2 are functions of t (time) and y (e.g., normalized pressure coefficient). However, the right-hand sides of Equations 1 and 2 are only functions of y. The implication is that only stationary time series are considered. The argument, t, on the left side of Equations (1) and (2) is kept to explicitly convey that the Poisson process has necessarily two dimensions, one of which being time (the subject of interest is, after all, time histories of pressure coefficients). The + subscript in Equation 1 means that negative values inside the square brackets are raised to zero. The parameters μ and σ are the location and scale parameters of the distribution, respectively. Equation 2 is the limit of Equation 1 as the tail length parameter k approaches zero, just as the Gumbel distribution is the generalized extreme value distribution with the tail length parameter set to zero. For this reason, the Poisson process defined by the intensity function in Equation 2 is henceforth referred to as the Gumbel model, and the designation of the complete approach is GpotMax; the Poisson process with λ(t, y) defined in Equation 1 is designated by Full model or FpotMax.

3. Decluster.

Figure 3a depicts the same raw time series as Figure 2, and two thresholded variants with the threshold u = 1.8 (Fig. 3b) and u = 2.0 (Fig. 3c). Note in Figure 3 that successive peaks can be separated by time intervals smaller than the time between an up-crossing of the mean and the subsequent down-crossing of the mean. Such successive peaks are typically strongly correlated, as shown by Figure 4, where it is seen that observations separated by more than 40 increments of time (in this case 0.08 s) remain highly positively correlated (this type of correlation is referred to as autocorrelation). Poisson processes are not appropriate for highly autocorrelated data without further processing because of the independence assumption that underlies them.

Figure 3.

Figure 3.

(a) Raw time series, (b) observations above 1.8, and (c) observations above 2.0.

Figure 4.

Figure 4.

Estimated autocorrelation function for the time series in Figure 3a.

Clusters are data blocks within time intervals defined by an up-crossing of the mean and the subsequent down-crossing of the mean. Declustering is an operation that is effective in removing the high autocorrelation from the data. It proceeds by discarding, in each cluster, all but the cluster maximum. Figure 5 displays the counterparts of Figure 3 after declustering, and Figure 6 depicts the estimated autocorrelation function of the series in Figure 5a. Figure 6 shows that declustering is indeed very effective. After removing the autocorrelation in the series, or declustering, the use of Poisson processes as models for crossings of a high threshold is justified. They are used for such purposes in many papers, e.g., Smith (1989), Smith (2004), Coles (2004), and Pintar et al. (2015).

Figure 5.

Figure 5.

(a) Declustered time series, (b) resulting observations (b) above 1.8, and (c) above 2.0.

Figure 6.

Figure 6.

Estimated autocorrelation function for the time series in Figure 5a.

4. Select optimal threshold.

Historically, a hurdle to the use of the POT models has been the appropriate choice of a threshold. Since the threshold dictates the data that are included in (or omitted from) the sample used to fit the model, its impact on the results can be large. The model becomes more appropriate as the threshold increases, but the threshold cannot be too high because too few observations will remain for fitting the model, since observations are taken over a finite period of time. Any approach to choosing a threshold must balance these competing aspects. A common and easy to implement -- though not necessarily optimal -- approach is to pick a high quantile of the series, e.g., 95 % (see Mannshardt-Shamseldin et al. 2010, p. 489). This work takes the approach of Pintar et al. (2015). An optimal threshold based on the fit of the model to the data, as judged by the W-statistics defined in Equation (1.30) of Smith (2004) is used. (The W-statistic is unitless and defines a transformation of the data such that, if the Poisson process model were perfectly correct, the transformed data would follow exactly an exponential distribution with mean one.) Figure 7 shows a plot of the ordered W-statistics versus quantiles of the standard exponential distribution using the optimal threshold for the series in Fig. 5a. If the data fit perfectly to the model, the points would fall exactly on the diagonal line. The threshold is chosen by creating such a plot for a sequence of potential thresholds and selecting the one that minimizes the maximum absolute vertical distance to the diagonal line. The maximum vertical distance to the diagonal line is nonzero, and possibly quite large, even for a threshold that allows only 2 observations (or 3 for Equation (2)) because of random noise, which generally increases as the number of observations decreases. However, in practice, limits on the minimum and maximum number of observations to be used in estimation should be set. As a minimum, 10 and 15 have been found to work reasonably well for Equations (2) and (1), respectively, providing 5 observations per parameter. The maximum should be chosen sufficiently high so that it does not correspond to the number of data above the selected threshold, and yet still be feasible from a computational perspective. This approach to selecting a threshold is comparable to the one used by Pickands (1994).

Figure 7.

Figure 7.

Plot of the W-statistics versus their corresponding standard exponential quantiles for the declustered series depicted in Figure 5a using the optimal threshold. Best fit of data using GpotMax (threshold = 2.10, 45 data points).

5. Estimate model parameters.

The model parameters, η = (μ, σ, k) for the intensity function in Equation 1, and η = (μ, σ) for the intensity function in Equation 2 are estimated by maximum likelihood (Casella and Berger, 2002, Section 7.2.2); the Hessian matrix of the log-likelihood at the maximum is also calculated in this step because it is used in the bootstrap algorithm for quantifying uncertainty, as discussed below.

6. Empirically build the distribution of the peak by Monte-Carlo simulation.

A series of desired length is generated from the fitted model, and the peak of the generated series is recorded. This is repeated nmc times. The recorded peaks form an empirical approximation to the distribution of the peak. A histogram of the simulated peaks over 100 s with nmc = 1000 for the example data set is shown in Figure 8, in which the mean value is marked by the triangle.

Figure 8.

Figure 8.

Histogram of the estimated distribution of the peak value starting with the time series depicted in Figure 5a. The triangle shows the mean of the distribution.

7. Quantify uncertainty.

Recall that our goal is to estimate the distribution of the peak of the time series under study. Thus, the uncertainty in the estimate of the entire distribution of the peak is being quantified, not just, for example, the uncertainty in the mean of that distribution. To accomplish this, a second layer of Monte Carlo sampling is included. The input to step 6 was the maximum likelihood estimate of η, η^. However, because only a finite sample is available, these estimates are uncertain. That uncertainty may be described using the multivariate Gaussian distribution and the Hessian matrix calculated in step 5. More specifically, we may sample values of η that are also consistent with observed time series and repeat step 6 for those new parameter values, repeating nboot times. The mean of the multivariate Gaussian distribution is equal to the estimated parameters, η^, and the covariance matrix is equal to the negative inverse Hessian matrix of the log- likelihood evaluated at its maximum. The result of step 7 is nboot empirical approximations to the distribution of the peak. Figure 9 shows nboot = 50 replicates of the distribution of the peak for the example data set. Only 50 replicates are shown in Figure 9 for illustrational clarity. Typically, more replicates, say 1000, would be desired. The bar shown in Figure 9 depicts an 80% confidence interval for the mean, which is calculated from 1000 replicates. This technique is an approximation to a bootstrap algorithm [Efron and Tibshirani (1993)]. It is an approximation because the bootstrap distribution of the estimated parameter values is not constructed by resampling the data, but assumed to be a multivariate Gaussian distribution.

Figure 9.

Figure 9.

Bootstrap replicates of the distribution of the peak starting with the series shown in Figure 5a. The short horizontal line shows an 80 % confidence interval (CI) for the mean of the distribution of the peak value.

DISCUSSION OF RESULTS

The dashed line in Figure 10 shows the (global) peak estimated by GpotMax applied to the entire time series of duration 100 s. This estimate is close to the observed peak of the time series, shown by the solid line. On the same plot, the squares show the results of six analyses performed on six partitions of the same time series, each of length 100s/6. The GpotMax estimates closely track the observed peaks, the circles, for each of the partitions. For each partition, GpotMax may also be used to calculate the mean of the distribution of the peak for a duration of 100 s, shown by the triangles in Figure 10. The six individual partitions can yield estimates that differ by as much as approximately 25 % from the estimate based on the entire 100-s time series. However, the average of these six estimates, shown by the dashed line, is reasonably close to the global estimate and the observed 100-s peak.

Figure 10.

Figure 10.

Comparison of estimates based on 6 equal data blocks and on global analysis, using GpotMax

The “full” version of the algorithm based on Equation 1, referred to as FpotMax, does not assume the tail parameter to be zero. The W-plot for the threshold that provides the best match between the data and the model for the example series is shown in Figure 11 and corresponds to a threshold of 2.16 and 37 data points.

Figure 11.

Figure 11.

Best fit to the data for the example series, tap 1715, using FpotMax (threshold =2.16, 37 data points)

Observed peaks and estimates obtained by GpotMax and FpotMax for the example series are compared in Table 1. The respective estimates are close, as should be expected from the good fit of the two models to the data (Figs. 7 and 11). Comparisons for an additional four taps (see Appendix) are also shown in Table 1 (positive peaks are of interest for tap 2011).

Table 1.

Observed and Estimated Peaks Obtained by GpotMax and FpotMax.

GpotMax FpotMax
Tap Wind Dir. Obs. Mean Stand. Error 10% Quant. 90% Quant. Mean of 6 Mean Stand. Error 10% Quant. 90% Quant. Mean of 6
1715 270° 3.17 3.25 0.17 3.03 3.46 3.34 3.17 0.33 2.77 3.58 3.22
1712 270° 3.79 3.80 0.21 3.54 4.06 3.98 3.70 0.21 3.46 3.98 3.69
2011+ 270° 2.48 2.54 0.16 2.36 2.76 2.68 2.49 0.14 2.34 2.68 2.55
708 3.24 3.41 0.24 3.10 3.71 2.63 3.35 0.34 2.97 3.81 4.47
813 315° 7.45 7.22 0.43 6.70 7.79 7.54 7.42 1.18 6.34 9.03 8.11

Figure 12 shows data pertaining to all five time-series analyzed as part of this paper. The filled squares show the observed peaks of the 100-s (global) time series. For each tap, the symbols and bars at the left (right) of the filled squares pertain to estimated peaks by GpotMax (FpotMax). The other symbols show the estimated mean of the distribution of the peak from the complete 100-s time history and the mean of the estimates of the 100-s extrapolations obtained from the six partitions of length 100s/6. The vertical bars depict 80 % confidence intervals based on the complete 100-s series.

Figure 12.

Figure 12.

Observed peaks, and estimates by GpotMax and FpotMax of (i) mean peaks and (ii) the 80 % confidence intervals.

Table 1 and Figure 12 show that for all five taps the results obtained by using GpotMax and FpotMax are similar to each other and to the observed peaks. In four of the five cases, the estimates based on FpotMax are slightly lower than those based on GpotMax. The choice of GpotMax versus FpotMax should be based on the fit of the model to the data, assessed by plots such as Figures 7 and 11. Applying the principle of parsimony, GpotMax should be preferred to FpotMax unless the fit of the FpotMax model to the data is appreciably better than Gpotmax. This would be expected, for example, if there were one or two very large peaks relative to other threshold crossings.

ILLUSTRATION OF THE DEPENDENCE OF THE EPOCHAL APPROACH UPON THE NUMBER n OF PARTITIONS

Figure 13 is a plot of the peaks estimated using the epochal method for two probabilities of non-exceedance p = 0.78 and p = 0.5704. The latter corresponds to the mean of the Gumbel distribution, and the former is commonly used by wind tunnel operators (Peng et al. 2013) and is close to the number 0.80 specified in the ISO 4354 (2009). For 10 ≤ n ≤ 24, say, the estimated peaks for tap 708, wind direction 360°, vary between 3.72 and 4.20 for p = 0.78, and between 3.48 and 3.82 for p = 0.57. For comparison, the single GpotMax and FpotMax estimates are 3.41 and 3.35, respectively, and the observed peak is 3.24.

Figure 13.

Figure 13.

Peaks estimated by the epochal method and BLUEs for the Gumbel parameters

CONCLUSIONS

Current methods for estimating peaks of pressure time series have drawbacks that motivated the development of the new procedure, an advantage of which is that it often results in an increased size of the relevant extreme value data set compared with epochal procedures. The translation approach depends upon the estimate of the marginal distribution of a non-Gaussian time series, which is typically difficult to perform reliably. The epochal procedure for estimating peaks combined with BLUEs of the Gumbel parameters was found to depend, in some cases very significantly, upon the number of partitions being used.

The proposed procedure is based on a Poisson process model for y, that exceed a specified threshold, u, of the time series being considered. The estimate depends upon the choice of the threshold. However, unlike for the choice of the number n of partitions for the epochal procedure, a criterion is available that allows the analyst to make an optimal choice (according to the specified metric) of the threshold value.

Two versions of the proposed new procedure have been developed. One version, denoted by FpotMax, includes estimation of a tail length parameter with a similar interpretation to the generalized extreme value distribution tail length parameter. The second version, denoted by GpotMax assumes that the tail length parameter vanishes, resulting in a tail of the Gumbel distribution type. By the principle of parsimony, GpotMax should be preferred unless FpotMax provides a considerable improvement in the fit of the model to the data as judged for instance by a plot like Figure 11.

APPENDIX:

Tap locations, wind directions, and pressure time histories for taps 1712, 2011, 708, and 813.

graphic file with name nihms-1658792-f0014.jpg

graphic file with name nihms-1658792-f0015.jpg

REFERENCES

  1. Casella G, & Berger RL (2002). “Statistical inference” (Vol. 2). Pacific Grove, CA: Duxbury. [Google Scholar]
  2. Coles S (2004) “The use and Misuse of Extreme Value Models in Practice,” in Extreme Values in Finance, Telecommunications, and the Environment, Finkenstädt B and Rootzén H, editors, chapter 2, 79–100, Chapman & Hall/CRC [Google Scholar]
  3. Davenport AG (1964) “Note on the Distribution of the Largest Value of a Random Function with Application to Gust Loading,” Proc. Institution of Civil Engineers, 28, 187–196 [Google Scholar]
  4. Eaton KJ and Mayne JR (1975). “The measurement of wind pressures on two-story houses at Aylesbury,” Journal of Industrial Aerodynamics, 1(1), 67–109. [Google Scholar]
  5. Efron B, & Tibshirani RJ (1994). “An introduction to the bootstrap.” CRC press. [Google Scholar]
  6. Ho TCE, Surry D, Morrish D, and Kopp GA (2005) “The UWO Contribution to the NIST Aerodynamic Database for Wind Loads on Low Buildings Part 1: Archiving Format and Basic Aerodynamic Data” J. Wind Eng. Ind. Aerodyn, 93 (1),1–30 [Google Scholar]
  7. INTERNATIONAL STANDARD, ISO 4354 (2009-June-01), 2nd edition, “Wind actions on structures,” Annex D (informative) “Aerodynamic pressure and force coefficients,” Geneva, Switzerland, p. 22 [Google Scholar]
  8. Lieblein J (1974) “Efficient Methods of Extreme Value Methodology,” NBSIR74–602, National Bureau of Standards, Washington, DC [Google Scholar]
  9. Mannshardt-Shamseldin EC, Smith RL, Sain SR, Mearns LO, and Cooley D (2010) “Downscaling Extremes: A Comparison of Extreme Value Distributions in Point-source and Gridded Precipitation Data,” The Annals of Applied Statistics, 4, 484–502 [Google Scholar]
  10. NIST (2004) “Extreme Winds and Wind Effects on Structures” (Aerodynamic Database for Rigid Buildings) www.itl.nist.gov/div898/winds/homepage.htm, National Institute of Standards and Technology, Gaithersburg, MD, created 03/05/2004, last updated 05/22/2015, last accessed 09/04/2015 [Google Scholar]
  11. Peng X, Yang L, Gurley K, Prevatt D and Gavanski E (2013) “Prediction of peak wind loads on low-rise building,” 12th Americas Conf. on Wind Engg., 12ACWE, Seattle, WA. June [Google Scholar]
  12. Pickands J III (1971) “The Two-dimensional Poisson Process and Extremal Processes,” Journal of Applied Probability, 8, 745–756 [Google Scholar]
  13. Pickands J III (1994) “Bayes quantile estimation and threshold selection for the Generalized Pareto family,” Proc. Conf. on Extreme Value Theory and Applications, Gaithersburg, MD, 1993, Kluwer Academic Pub., Boston [Google Scholar]
  14. Pintar AL, Simiu E, Lombardo FT, and Levitan M (2015) “Maps of Non-hurricane Non-tornadic Winds Speeds With Specified Mean Recurrence Intervals for the Contiguous United States Using a Two-dimensional Poisson Process Extreme Value Model and Local Regression,” NIST Special Publication 500–301, URL http://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.500-301.pdf [Google Scholar]
  15. R Core Team (2015). R: A language and environemtn for statistical computing. R Foundation for Statistical Computing, Vienna, Austria, URL https://www/R-project.org/ [Google Scholar]
  16. Resnick SI (1992). “Adventures in stochastic processes.” Birkhäuser. [Google Scholar]
  17. Rice SO (1954) “Mathematical analysis of random noise” Select papers on noise and stochastic processes, Wax N, ed., Dover, New York. [Google Scholar]
  18. Sadek F and Simiu E (2002) “Peak Non-Gaussian Wind Effects for Database-Assisted Low-Rise Building Design,” ASCE J. Eng. Mech 128 (5), 530–539, May [Google Scholar]
  19. Simiu E and Scanlan RH (1996) “Wind effects on structures,” Wiley, New York [Google Scholar]
  20. Smith RL (1989) “Extreme Value Analysis of Environmental Time Series: an Application to Trend Detection in Ground-level Ozone,” Statistical Science, 4, 367–393 [Google Scholar]
  21. Smith RL (2004) “Statistics of Extremes, with Applications in Environment, Insurance, and Finance,” in Extreme Values in Finance, Telecommunications, and the Environment, Finkenstädt B and Rootzén H, editors, chapter 1, 1–78, Chapman & Hall/CRC [Google Scholar]

RESOURCES