Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2022 Oct 15;12:17327. doi: 10.1038/s41598-022-22243-8

A mixed distribution to fix the threshold for Peak-Over-Threshold wave height estimation

Antonio M Durán-Rosal 1,, Mariano Carbonero 1, Pedro Antonio Gutiérrez 2, César Hervás-Martínez 2
PMCID: PMC9569348  PMID: 36243880

Abstract

Modelling extreme values distributions, such as wave height time series where the higher waves are much less frequent than the lower ones, has been tackled from the point of view of the Peak-Over-Threshold (POT) methodologies, where modelling is based on those values higher than a threshold. This threshold is usually predefined by the user, while the rest of values are ignored. In this paper, we propose a new method to estimate the distribution of the complete time series, including both extreme and regular values. This methodology assumes that extreme values time series can be modelled by a normal distribution in a combination of a uniform one. The resulting theoretical distribution is then used to fix the threshold for the POT methodology. The methodology is tested in nine real-world time series collected in the Gulf of Alaska, Puerto Rico and Gibraltar (Spain), which are provided by the National Data Buoy Center (USA) and Puertos del Estado (Spain). By using the Kolmogorov-Smirnov statistical test, the results confirm that the time series can be modelled with this type of mixed distribution. Based on this, the return values and the confidence intervals for wave height in different periods of time are also calculated.

Subject terms: Physical oceanography, Ocean sciences

Introduction

Marine forecasting has become a essential task to ensure the safety of navigation, fishery and engineering construction, among others1. Concretely, wave height prediction is key to the design of coastal and off-shore structures2. In this sense, the incorporation of wave models into numerical weather prediction models can improve atmospheric forecasts3. The development of offshore installations for oil and gas extraction and for renewable energy exploitation requires knowledge of the wave fields and any potential changes in them. One of the main problems is that the knowledge of the maximum peak-to-trough wave height is not usually available although largest waves have the greatest impact on ships and offshore structures4.

The importance of time series data mining has been increasing exponentially in the last decade5,6. They are present in different fields of application, e.g. climate7, oceanography8, biology9 and much more. In addition, they are used for different research objectives, such as classification10, tipping point detection11, forecasting12, etc.

Basically, a time series can be defined as temporal data collected in different periods of time. In this sense, the observation of a random variable in regular periods of time can lead to the introduction of noise. That is, if the period between two consecutive observations is much lower than the real cadence of the phenomenon under investigation, a high number of observed values will be very close to the average value of the characteristic studied.

In the context of oceanography and specifically, in the determination of extreme wave height values, if we consider a buoy collecting the wave height value every four hours, then a high proportion of values close to the average wave height will be recorded. This results in the fact that extreme wave heights, which are probably the most interesting ones, will be outnumbered by a set of very similar values without special interest. These non-informative observations have a distorting effect on the measures that could be taken to analyse the variable, because they do not significantly change the mean value but reduce the deviation, increasing the sample size.

Consequently, wave height extreme values will change from being more or less infrequent to atypical or outliers, with the drawbacks that this means for its analysis and prediction. The presence of these extreme values produces a denaturalization of the standard wave height probability distribution. For this reason, it is necessary to define thresholds of wave height from which the extreme wave distributions are considered, where large time series are needed, given that the number of these events every year is very low and depends on the oceanic position of the buoy.

Statistical methods to determine extreme wave heights using the Peaks-Over-Threshold approach (POT) have been significantly improved for several years. Mathiesen et al.13 use the POT method along with a Weibull distribution estimated by a maximum likelihood procedure. This is applied to the prediction of individual wave heights associated with high return periods, considering that 100 years or more is enough for the extensive use of ocean’s resources. In 2001, Coles14 introduced the GPD-Poisson by fitting a Generalized Pareto Distribution (GPD), which was also used later on15.

In 2011, Mazas and Hamm16 proposed the determination of extreme wave heights using a POT approach, where a double threshold (u1,u2) is presented. A low value u1 is set to select both weak and strong storms. Then, a second higher threshold (u2) has to be determined to decide which storms have a statistically extreme behaviour. Tree probability distributions of extreme values are used to determine u2: GPD-Poisson, Weibull and Gamma distributions. To select the best-fitting distribution, two objective criteria based on likelihood (Bayesian Information Criterion17, BIC, and Akaike Information Criterion18, AIC) are used.

More recently, Petrov et al.19 presented a maximum entropy (MaxEnt) method for the prediction of extreme significant wave heights, comparing it with the state of the art methodologies of the Extreme Value Theory (EVT): the GPD and the Generalized Extreme Value distribution (GEV). According to the definition of the MaxEnt principle, the distribution that provides the highest entropy is selected to give more information among all other possible distributions that satisfy the proposed constraints.

As can be seen, all methods are based on selecting a threshold and modelling the distribution of the wave heights over this threshold. Thus, the main problem is how to select this threshold in order to avoid information loss. For that, it could be interesting to model the complete time series with both regular and extreme values and to use this theoretical distribution to fix the threshold for the POT approach. In this paper, we propose a new methodology to determine the distribution of the extreme wave heights considering that the normally distributed extreme wave heights are added as to regular values from a uniform distribution. The reason for choosing a uniform distribution is that, outside a range around the mean, all observations of wave height should be assumed to be part of the problem and never noise. This makes us discard the normal distribution as a contamination distribution. After that, using the estimated theoretical mixed distribution, we set the threshold for the POT methodologies. In this way, we fit several distributions of the values over this threshold and select the best-fitting distribution according to the BIC and AIC criteria.

The novel contributions of this work to applied energy issues are:

  • In atmospheric time series, such as wave height20, wind power21 or fog formation in airports22,23, there are many values close to the average. This makes that extreme values of time series, which are the most interesting ones, are hidden by uninteresting values. For this reason, these values have a distorting effect on extreme values. In this paper, we show that regular values do not significantly change the mean value of the time series, but they reduce the deviation by increasing the sample size.

  • We propose a new methodology which, up to the author knowledge, has not been applied before to wave height time series. This methodology is able to determine the distribution of the complete time series, taking into account that wave height time series distribution is a mixture of a normal distribution of extreme values and noise from a uniform distribution.

  • For adjusting the four parameters needed to define the mixed distribution, we used the method of moments24, given that our methodology uses the raw time series.

  • When the mixed distribution is estimated, this methodology is used to determine the threshold needed for POT approaches. We assume that using the extreme values situated over a percentile of the theoretical mixed distribution is more reliable than using a predefined value adjusted by a trial and error process. In this way, our methodology is applied to obtain return values for 1, 2, 5, 10, 20, 50 and 100 years for nine real-world wave height time series, using three different percentiles from the mixed distribution.

The rest of paper is organised as follows: section “Methodology” presents the details of the proposed method. Section “Dataset and experimental design” describes the data considered and the characteristics of the experiments, while section “Results and discussion” includes the results and the associated discussion. Finally, section “Conclusion” concludes the paper.

Methodology

This sections introduces the Extreme Value Theory and presents the proposed methodology of this work.

Extreme value theory

Extreme Value Theory (EVT) is associated to the maximum sample Mn=max(X1,,Xn), where (X1,,Xn) is a set of independent random variables with common distribution function F. In this case, the distribution of the maximum observation is given by Pr(Mn<x)=Fn(x). The hypothesis of independence when the X variables represent the wave height over a determined threshold is quite acceptable, because, for oceanographic data, it is common to adopt a POT scheme which selects extreme wave height events that are approximately independent25. Also, in26, authors affirm that “The maximum wave heights in successive sea states can be considered independent, in the sense that the maximum height is dependent only on the sea state parameters and not in the maximum height in adjacent sea states”. This Mn variable is described with one of the three following distributions: Gumbel, Frechet, and Weibull.

One methodology in EVT is to consider wave height time series with the annual maximum approach (AM)27, where X represents the wave height collected on regular periods of time of one year, and Mn is formed by the maximum values of each year. The statistical behaviour of AM can be described by the distribution of the maximum wave height in terms of Generalized Extreme Value (GEV) distribution:

G(x)=exp-1+ξx-μσ1ξ,ξ0,exp-exp-x-μσ,ξ=0, 1

where:

0<x<1+ξx-μσ, 2

where -<μ<, σ>0 and -<ξ<. As can be seen, the model has three parameters: location (μ), scale (σ), and shape (ξ).

The estimation of the return values, corresponding to the return period (Tp), are obtained by inverting Eq. (1):

zp=μ-σξ1--log(1-p)-ξ,ξ0,μ-σlog-log(1-p),ξ=0, 3

where G(zp)=1-p. Then, zp will be exceeded once per 1/p years, which corresponds to Tp.

The alternative method in the EVT context is the Peak-Over-Threshold (POT), where all values over a threshold predefined by the user are selected to be statistically described instead of only the maximum values28,29. POT method has become a standard approach for these predictions13,29,30. Furthermore, several improvements over the basic approach have been proposed by various authors since then19,31,32.

The POT method is based on the fact that if the AM approach uses a GEV distribution (Eq. 1), the peaks over a high threshold should result in the related approximated distribution: the Generalized Pareto Distribution (GPD). The GPD fitted to the tail of the distribution gives the conditional non-exceedance probability P(Xx|X>u), where u is the threshold level. The conditional distribution function can be calculated as:

P(Xx|X>u)=1-1+ξx-uσ1ξ,ξ0,1-exp-x-uσ,ξ=0. 4

There is consistency between the GEV and GPD models, meaning that the parameters can be related to ξ=ϵ and σ=σ+ξ(u-μ). The parameters σ and ξ are the scale and shape parameters, respectively. When ξ0, the distribution is referred to as long tailed. When ξ<0, the distribution is referred to as short tailed. The methods used to estimate the parameters of the GPD and the selection of the threshold will be now discussed.

The use of the GPD for modelling the tail of the distribution is also justified by asymptotic arguments in14. In this paper, author confirms that it is usually more convenient to interpret extreme value models in terms of return levels, rather than individual parameters. In order to obtain these return levels, the exceedance rates of thresholds have to be determined as P(X>u). In this way, using Eq. (4) (P(X>x|X>u)=P(X>x)/P(X>u)) and considering that zN is exceeded on average every N observations, we have:

P(X>u)1+ξzN-uσ-1ξ=1N. 5

Then, the N-year return level zN is obtained as:

zN=u+σξ(NP(X>u))ξ-1. 6

There are many techniques proposed for the estimation of the parameters of GEV and GPD. In19, authors applied the maximum likelihood methodology (ML) described in14. However, the use of this methodology for two parameter distributions (i.e. Weibull or Gamma) has a very important drawback: these distributions are very sensitive to the distance between the high threshold (u2) and the first peak16. For this reason, ML could be used with two-parameter distribution when u2 reaches a peak. As this peak is excluded, the first value of the exceedance is as far from u2 as possible. A solution would be to use the three-parameter Weibull and Gamma distributions. However, ML estimation of such distributions is very difficult, and the algorithms usually fit two-parameter distributions inside a discrete range of location parameters33.

Proposed methodology

As stated before, in this paper, we present a new methodology to model this kind of time series considering not only extreme values but also the rest of observations. In this way, instead of selecting the maximum values per a period (usually a year) or defining thresholds in the distribution of these extreme wave heights, which has an appreciable subjective component, we model the distribution of all wave heights, considering that it is a mixture formed by a normal distribution and a uniform distribution. The motivation is that the uniform distribution is associated to regular wave height values which contaminate the normal distribution of extreme values. This theoretical mixed distribution is used then to fix the threshold for the estimation of the POT distributions. Thus, the determination of the threshold will be done in a much more objective and probabilistic way.

Let us consider as a sequence of independent random variables, (X1,,Xn) of wave height data. These data follow an unknown continuous distribution. We assume that this distribution is a mixture of two independent distributions: Y1N(μ,σ), and Y2U(μ-δ,μ+δ), where N(μ,σ) is a Gaussian distribution, U(μ-δ,μ+δ) is a uniform distribution, μ>0 is the common mean of both distributions, σ is the standard deviation of Y1, and δ is the radius of Y2, being μ-δ>0. Then f(x)=γf1(x)+(1-γ)f2(x), being γ the probability that an observation comes from the normal distribution, and f(x), f1(x) and f2(x) are the probability density functions (pdf) of X, Y1 and Y2, respectively.

For the estimation of the values of the four above-mentioned parameters (μ,σ,δ,γ), the standard statistical theory considers the least squares methods, the method of moments and the maximum likelihood (ML) method. In this context, Mathiesen et al.13 found that the least squares methods are sensitive to outliers, although Goda34 recommended this method with modified plotting position formulae.

Clauset et al.35 show that methods based on least-squares fitting for the estimation of probability-distribution parameters can have many problems, and, usually, the results are biased. These authors propose the method of ML for fitting parametrized models such as power-law distributions to observed data, given that ML provably gives accurate parameter estimates in the limit of large sample size36. The ML method is commonly used in multiple applications, e.g. in metocean applications25, due to its asymptotic properties of being unbiased and efficient. In this regard, White et al.37 conclude that ML estimation outperforms the other fitting methods, as it always yields the lowest variance and bias of the estimator. This is not unexpected, as the ML estimator is asymptotically efficient37,38. Also, in Clauset et al.35, it is shown, among other properties, that under mild regularity conditions, the ML estimation α^ converges almost surely to the true α, when considering estimating the scaling parameter (α) of a power law in the case of continuous data. It is asymptotically Gaussian, with variance (α-1)2/n. However, the ML estimators do not achieve these asymptotic properties until they are applied to large sample sizes. Hosking and Wallis39 showed that the ML estimators are non-optimal for sample sizes up to 500, with higher bias and variance than other estimators, such as moments and probability weighted-moments estimators.

Deluca and Corral40 also presented the estimation of a single parameter α associated with a truncated continuous power-law distribution. In order to find the ML estimator of the exponent, they proceed by directly maximizing the log-likelihood l(α). The reason is practical since their procedure is part of a more general method, valid for arbitrary distributions f(x), for which the derivative of l(α) can be challenging to evaluate. They claim that one needs to be cautious when the value of α is very close to one in the maximization algorithm and replace l(α) by its limit at α=1.

Furthermore, the use of ML estimation for two-parameter distributions such as Weibull and Gamma distributions has the drawback16 previously discussed. Besides, the ML estimation is known to provide poor results when the maximum is at the limit of the interval of validity of one of the parameters. On the other hand, the estimation of the GPD parameters is subject of ongoing research. A quantitative comparison of recent methods for estimating the parameters was presented by Kang and Song41. In our case, having to estimate four parameters, we have decided to use the method of moments, for its analytical simplicity. It is always an estimation method associated with sample and population moments. Besides, adequate estimations are obtained in multi-parametric estimation and with limited samples, as shown in this work.

Considering ϕ as the pdf of a standard normal distribution N(0, 1), the pdf of Y1 is defined as:

f1(x)=1σϕ(zx),zx=x-μσ,xR. 7

The pdf of Y2 is:

f2(x)=12δ,x(μ-δ,μ+δ). 8

Consequently, the pdf of X is:

f(x)=γf1(x)+(1-γ)f2(x),xR. 9

To infer the values of the four parameters of the wave height time series (μ, σ, δ, γ), we define, for any symmetric random variable with respect to the mean μ with pdf g and finite moments, a set of functions in the form:

Uk(x)=|t-μ|x|t-μ|kg(t)dt,x0,k=1,2,3,, 10

or because of its symmetry:

Uk(x)=2x+μ(t-μ)kg(t)dt,k=1,2,3,. 11

These functions are well defined for the same moments of the variable x, because:

Uk(x)<-|t-μ|kg(t)dt<,k=1,2,3,. 12

Particularly, for the normal and uniform distributions, all the moments are finite, and the same happens for all the Uk(x) functions. This function measures, for each pair of values x and k, the bilateral tail from the value x of the moment with respect to the mean of order k of the variable. It is, therefore, a generalization of the concept of probability tail, which is obtained for k=0.

Now, if we denote the corresponding moments for the distributions Y1 and Y2 by Uk,1(x) and Uk,2(x), it is verified that:

Uk(x)=γUk,1(x)+(1-γ)Uk,2(x). 13

Then, to calculate the function Uk(x), we just need to calculate the functions Uk,1(x) and Uk,2(x).

Calculation Uk for the uniform distribution (Uk,2)

From the definition of f2(x) and Uk(x), if μ>δ:

Uk,2(x)=2μ+xμ+δ(t-μ)k12δdt=(t-μ)k+1(k+1)δμ+xμ+δ=δk+1-xk+1(k+1)δ, 14

then,

Uk,2(x)=δk+1-xk+1(k+1)δ0xδ,0x>δ. 15

Calculation Uk for the normal distribution (Uk,1)

From the definition of the f1(x) and Uk(x), we have:

Uk,1(x)=2σμ+x(t-μ)kϕt-μσdt. 16

Let the variable u be in the form u=t-μσ, then:

Uk,1(x)=2xσ(uσ)kϕ(u)du=σkΥkxσ, 17

where Υk=2x(u)kϕ(u)du. Υk(z) is the Uk function calculated for a N(0, 1) distribution, which will be then updated with values of k=1,2,3.

Proposition I

The following equations are verified:

Υ1(x)=2xuϕ(u)du=2ϕ(x), 18
Υ2(x)=2xu2ϕ(u)du=2(1-Φ(x)+xϕ(x)), 19
Υ3(x)=2xu3ϕ(u)du=2(2+x2)ϕ(x), 20

where Φ is the cumulative distribution function (CDF) of the N(0, 1) distribution. The demonstration is included below.

The three equations can be obtained using integration by parts, but it is easier to derive the functions Υk(x) to check the result. For the definition of the functions, for each value of k, we have:

Υk(x)=Υk(x)x=-2xkϕ(x). 21

Taking into account that ϕ(x)x=-xϕ(x), and Φ(x)x=ϕ(x):

2ϕ(x)x=-2xϕ(x)=Υ1(x), 22
(2(1-Φ(x)+xϕ(x)))x=2(-ϕ(x)+ϕ(x)-x2ϕ(x))==-2x2ϕ(x)=Υ2(x), 23
(2(2+x2)ϕ(x))(x)=2(2xϕ(x)-(2+x2)xϕ(x))==-2x3ϕ(x)=Υ3(x). 24

Therefore, the left and right sides of the previous equations differ in, at most, a constant. To verify that they are the same, we check the value x=0:

Υ1(0)=20uϕ(u)du=2π, 25
Υ2(0)=20u2ϕ(u)du=1, 26
Υ3(0)=20u3ϕ(u)du=22π, 27

which match with the right sides of Eqs. (18)–(20):

Υ1(0)=2ϕ(0)=2π, 28
Υ2(0)=2(1-Φ(0))=1, 29
Υ3(0)=2(2)ϕ(0)=22π. 30

Substituting these results in Eq. (17) we have:

U1,1=σΥ1xσ=2σϕxσ, 31
U2,1=σ2Υ2xσ=2σ21-Φxσ+xσϕxσ, 32
U3,1=σ3Υ3xσ=2σ32+xσ2ϕxσ. 33

These functions will be the base to estimate the parameters of the distribution of variable X, except in the case of μ, as we will comment later. The estimates will be made with the corresponding Uk sample estimates, defined in the following Section.

Sample estimates of Uk

For each value of k and x0, the sample estimator of Uk obtained by the method of moments is:

uk(x)=1n|xi-μ|x|xi-μ|k, 34

which has the properties described in the following propositions.

Proposition II

The estimator uk(x) is an unbiased estimator of Uk(x). For the demonstration, we first rewrite uk in the form:

uk(x)=1ni=1n|xi-μ|kI{|xi-μ|x}, 35

where I is the indicator function. Considering the previous expression, we check the condition of an unbiased estimator:

E(uk(x))=1ni=1nE(|xi-μ|kI{|t-μ|x})==E(|t-μ|kI{|t-μ|x}==|t-μ|x|t-μ|kg(t)dt=Uk(x). 36
Proposition III

The estimator uk(x) is a consistent estimator of Uk(x). Considering again Eq. (35) for the variance of uk(x) we have:

V(uk(x))==1n2i=1nV(|xi-μ|kI{|t-μ|x})=1nV(|t-μ|kI{|t-μ|x})==1nE(|t-μ|2kI{|t-μ|x})-E2(|t-μ|kI{|t-μ|x})==1n(U2k(x)-Uk2(x))n0, 37

taking into account that I2{.}=I{.}.

Parameter estimation of the mixed distribution of X

The estimates are based on the uk(0) values, for k=1,2,3, which estimate the corresponding population parameters.

Estimation of μ Given that the mean value of both distributions (uniform and normal) is the same, this value is not affected by the mixture. Therefore, the natural estimator is

μ^=x¯=1ni=1nxi. 38

Estimation of σ, δ, and γ parameters

Applying the method of moments, we have the following three-equation system:

Uk(0)=uk(0),k=1,2,3. 39

The reason for choosing the origin is that it has the maximum amount of information about the uk(x) functions defined in Eq. (34). If a nonzero x value is chosen, the estimate will discard all observations in the interval (μ-x,μ+x). Substituting Eqs. (15), (31), (32) and (33) in Eq. (13), the resulting equation system is:

γU1,1(0)+(1-γ)U1,2(0)=γσ2π+(1-γ)δ2=u1(0), 40
γU2,1(0)+(1-γ)U2,2(0)=γσ2+(1-γ)δ23=u2(0), 41
γU3,1(0)+(1-γ)U3,2(0)=γσ322π+(1-γ)δ34=u3(0), 42

where the solution must satisfy: σ^,δ^>0 and γ[0,1].

Adjustment to the mixed distribution

To contrast if the obtained estimators are valid, we could see if the set of observations {x1,,xn} fit the pdf of the final distribution:

f^(x)=γ^f1^(x)+(1-γ^)f2^(x),xR, 43

where:

f1^(x)=1σ^ϕx-μ^σ^,xR, 44

and:

f2^(x)=12δ^,x(μ^-δ^,μ^+δ^). 45

For this purpose, a test that can be used is the Kolmogorov-Smirnov test. The one-sample Kolmogorov-Smirnov test42 is commonly used to examine whether samples come from a specific distribution function by comparing the observed cumulative distribution function with an assumed theoretical distribution. The Kolmogorov-Smirnov statistic Z is computed from the largest difference (in absolute value) between the observed and theoretical cumulative distribution. In this way, Z is the greatest vertical distance between empirical distribution function S(x) and the specified hypothesized distribution function F(x), which can be calculated as:

Z=maxx|F(x)-S(x)|, 46

where the null hypothesis is H0:F(x)=F(x) for all -<x<, and the alternative hypothesis is H1:F(x)F(x) for at least one value of x, F(x) being the true distribution. If Z exceeds the 1-α quantile value (Q(1-α)), then we reject H0 at the level of significance of α. When the number of observations n is large, the Q(1-α) value can be approximated as43:

Q(1-α)=-0.5log(α2)n. 47

Using the theoretical mixed distribution to fix the threshold of the POT approaches

In this paper, when the mixed distribution is estimated, we use it to set the threshold for estimating the POT distributions. We assume that using the points which are situated over a percentile of the theoretical mixed distribution is more reliable than using a threshold value predefined by a trial and error procedures. Identifying extreme values when studying a phenomenon is supported by the determination of a limit value or a probability threshold. Since the consideration of extreme is determined by an unusual deviation from the central values of the distribution of the phenomenon under investigation, we understand that the probabilistic approach is preferred. In our work, we consider the 95%, 97.5% and 99% percentiles as possible thresholds.

In this way, a new sample of independent random variables is defined by Z=(z1,z2,,zM), where Z=X>u, u being the threshold and M being the number of exceedances. In this work, three distributions are fitted for the threshold exceedance distribution:

  • The first one is the GPD44, whose cumulative function is defined in Eq. (4).

  • The second distribution is the Gamma distribution, with the following cumulative function:
    F(z;ξ,σ)=γ(ξ,zσ)Γ(ξ), 48
    where γ is the lower incomplete gamma function, and Γ is the Gamma function.
  • Finally, the Weibull distribution is also considered:
    F(z;ξ,σ)=1-exp-zσξ. 49

These three distributions are adjusted to the exceedances using the Maximum Likelihood Estimator (MLE)13. After that, we select the best fit based on two objective criteria: BIC17 and AIC18. On the one hand, BIC minimizes the bias between the fitted model and the unknown true model:

BIC=-2lnL+kplnM, 50

where L is the likelihood of the fit, M is the sample size (in our case, the number of exceedances) and kp the number of parameters of the distribution. On the other hand, AIC gives the model providing the best compromise between bias and variance:

AIC=-2lnL+2kp. 51

Both criteria need to be minimized.

When the best-fitted distribution is obtained, the return period T (HsT) is calculated, and then the confidence intervals are computed. As can be seen in the experimental section, the GPD is the best distribution for all cases. The quantile for the GPD is:

HsT=μ+σξ1-(λT)-ξ, 52

where λ is the number of exceedances per year.

Finally, confidence intervals are also computed. For that, many authors use the classical asymptotic method14. However, Mathiesen et al. advocate the use of Monte-Carlo (MC) simulation techniques. Also, Mackay and Johanning26 proposed a storm-based MC method for calculating return periods of individual wave and crest heights. In the MC method, a random realisation of the maximum wave height in each sea state is simulated from the metocean parameter time series, and the GPD is fitted to storm peak wave heights exceeding some threshold. Mackay and Johanning26 showed that using n=1000 is sufficient to obtain a stable estimation, although in our case, we have considered n=100000 following the work of16. In16, as in our work, authors used the MC simulation method, and, after 100000 iterations, the 90% confidence interval is obtained using the percentiles [HsT,5%;HsT,95%] of the 100000 HsT values obtained with the procedure.

Dataset and experimental design

Dataset

As stated before, the objective of this work is to model wave height time series where extreme values are present. For this reason, we evaluate the performance of the proposed methodology in several real-world wave height time series from different locations:

  • Gulf of Alaska: two wave height time series collected from the National Data Buoy Center of the USA45 in the Gulf of Alaska have been used. The buoys have the registration numbers 46001 and 46075. For the two buoys, one value every six hours is considered. The buoy 46001 is an offshore buoy placed in the coordinates 56.23N 147.95W, and data from 1st January 2008 to 31st December 2013 is considered, with a total of 8767 observations. On the other hand, 46075 is an offshore buoy whose coordinates are 53.98N 160.82W and data from 1st January 2011 to 31st December 2015 are collected in this buoy (7303 observations).

  • Puerto Rico: a total of six offshore buoys from Puerto Rico have been selected in our experiments to evaluate the proposed methodology. These buoys also belong to the NDBC of the USA, with registration ids 41043, 41044, 41046, 41047, 41048 and 41049. One value every six hours is considered, and data from 1st January 2011 to 31st December 2015 are used (7303 observations for each one). The geographical coordinates for each buoy are 21.13N 64.86W, 21.58N 58.63W, 23.83N 68.42W, 27.52N 71.53W, 31.86N 69.59W, and 27.54N 62.95W, respectively.

  • Spain: this dataset comes from the SIMAR-44 hindcast database provided by Puertos del Estado (Spain). The point is placed in the Strait of Gibraltar, whose coordinates are 36N 6W. One value every three hours is considered in this dataset from 1st January 1959 to 31 December 2000, forming a set of 122278 observations. Note that, it is the largest time series in our experiments. Given that the time series includes 42 years, we can estimate long return periods of wave height.

The summary of the information for each time series can be seen in Table 1 which includes the type of buoy, the location, the geographical coordinates, the number of observations, the mean values of the time series (Hs), and the maximum values of each one. The map location can be observed in Fig. 1, while the representation of the time series are shown in Fig. 2.

Table 1.

Characteristics of the time series recorded for every buoy.

Id Type Location Coordinates # Observations Average Hs (m) Max Hs (m)
46001 Offshore Alaska 56.23N 147.95W 8767 2.65 10.17
46075 Offshore Alaska 53.98N 160.82W 7303 2.72 13.39
41043 Offshore Puerto Rico 21.13N 64.86W 7303 1.76 6.12
41044 Offshore Puerto Rico 21.58N 58.63W 7303 1.84 8.98
41046 Offshore Puerto Rico 23.83N 68.42W 7303 1.71 7.85
41047 Offshore Puerto Rico 27.52N 71.53W 7303 1.63 8.51
41048 Offshore Puerto Rico 31.86N 69.59W 7303 1.85 12.07
41049 Offshore Puerto Rico 27.54N 62.95W 7303 1.78 10.96
SIMAR-44 Coastal Spain 36.00N 6.00W 122278 1.09 8.60

Figure 1.

Figure 1

Locations of the different buoys considered for the experimentation.

Figure 2.

Figure 2

Graphical representation of the time series recorded for every buoy.

Experimental design

The experimental design for the time series under study is presented in this subsection. We divide the experiments in three stages:

  • Firstly, a Kolmogorov-Smirnov test is applied to determine whether the wave height distributions follow a normal distribution. That is, their distributions fit a simple Gaussian. The reason behind applying this test is that, if the wave height distributions follow a normal distribution, using the proposed methodology will not make sense. If this is not the case, we will proceed with the following points.

  • Secondly, the methodology is tested on the raw time series presented in the previous subsection. The algorithm estimates the parameters of the mixed distribution (μ,σ,δ,γ) for each wave height time series, and then, the Kolmogorov-Smirnov test is applied to check if the estimated distribution corresponds to the empirical distribution of the data. It is important to mention that the Kolmogorov-Smirnov test is applied considering n=50, which is an acceptable value for the Eq. (47), that is, we calculate the CDF of the estimated theoretical function and the empirical one in 50 intervals. Graphically, in this paper, we show the comparison between the theoretical distribution (estimated) and the empirical one (Fig. 3).

  • Finally, as we stated in previous sections, we use the theoretical mixed distribution to establish the threshold. In this sense, we delete the values below the threshold, and we fit the GPD, Gamma and Weibull distributions with the remaining values (those which are higher than the threshold). Based on two objective criteria, BIC and AIC, we select the best-fitted distribution and, finally, the return values of this distribution for the following return periods in years T=(1,2,5,10,20,50,100) are calculated.

Figure 3.

Figure 3

Estimated theoretical distribution versus empirical distribution in all wave height time series considered.

Results and discussion

As mentioned above, the first phase of the experimentation is to check that the distributions of the wave height time series do not follow a normal distribution. The Kolmogorov-Smirnov test obtains Z values between 0.6 and 0.8, while the critical values are around 0.016. Moreover, the p-value is 0 in all cases and, therefore, lower than any α value. Thus, for all time series, the null hypothesis is rejected, and it can be stated that the wave height time series distribution does not fit a simple Gaussian. We, therefore, proceed to part two of the experimentation.

For the mixed distribution proposed in this paper, the estimates and the Kolmogorov-Smirnov test results are shown in Table 2. As can be seen, the estimation of the μ parameter is the same than the mean value of the time series (see Table 1), because we have used the sample mean as estimator (see section “Proposed methodology”). σ estimation seems to be very high with respect to the mean. It makes sense given that the estimation is made with approximately 7000 points, the variance needing to be high. δ has values in the interval (0.74,1.80) because there is wave height data that, although not very small, contaminates the normal distribution (in intervals of three months, the parameter value is lower). γ, which is the probability that an observation comes from the normal distribution, is very low. Again, this makes sense because of the high amount of data which are not extreme values and represent regular waves (uniform distribution). The Kolmogorov-Smirnov test does not reject the null hypothesis for all cases, Z<Q(1-α), confirming that the estimated parameters of the mixed distribution correspond to the empirical values. For this reason, we can accept the theory proposed in this paper as a good method to estimate the theoretical distribution in wave height time series. Note that the Z values are lower in those time series whose mean value is higher, so the wave height time series collected from buoys 46001 and 46075 are better adjusted with this distribution, while the Spanish time series results in a worse fit. The results of the Kolmogorov-Smirnov test can be complementary analysed with the representation of the empirical and theoretical distribution, as can be observed in Fig. 3. The graphs show how the estimated theoretical distributions are adapted to the empirical distributions in each database.

Table 2.

Parameter estimation and Kolmogorov-Smirnov test results.

Id μ^ σ^ δ^ γ^ Z Q(1-α)
46001 2.652597 2.082763 1.708683 0.296738 0.081194 0.192065
46075 2.724890 2.522156 1.799095 0.189406 0.080575 0.192065
41043 1.762838 0.956801 0.743943 0.224906 0.086916 0.192065
41044 1.836434 1.449356 0.795858 0.077810 0.107365 0.192065
41046 1.705895 1.236797 0.793447 0.170138 0.099714 0.192065
41047 1.633332 1.853012 0.893645 0.113544 0.110250 0.192065
41048 1.849044 2.435167 1.158171 0.109262 0.119285 0.192065
41049 1.777286 2.023050 0.998251 0.091232 0.132657 0.192065
SIMAR-44 1.093372 1.580551 0.748225 0.125561 0.142356 0.192065

For the third experiment, Table 3 shows the values of the BIC and AIC criteria when the GPD, Gamma and Weibull distribution are fitted using the values over the threshold determined by the percentiles 95%, 97.5% and 99% of the theoretical mixed distribution. The number of POTs (M) and the number of peaks per year (λ) are also included. As can be seen, the higher the percentile, the lesser number of peaks per year, because the number of POTs will be much lower. The results confirm that the best fitted distribution for all databases and for all percentiles is the GPD.

Table 3.

BIC and AIC criterion for the estimated distributions of the POT method.

Id Percentile 95% Percentile 97.5% Percentile 99%
M λ BIC AIC M λ BIC AIC M λ BIC AIC
46001 GPD 806 134.33 662.72 653.33 379 63.17 786.42 774.61 154 25.67 313.09 303.98
Gamma 2193.33 2183.95 1002.74 994.87 389.46 383.38
Weibull 2497.93 2488.54 1146.18 1138.30 441.27 435.20
46075 GPD 786 157.20 1894.56 1880.56 337 67.40 818.69 807.23 126 25.20 290.39 281.88
Gamma 2381.82 2372.49 1025.61 1017.97 389.93 384.26
Weibull 2719.86 2710.53 1188.91 1181.27 458.29 452.62
41043 GPD 784 156.80 302.62 288.63 298 59.60 79.40 68.20 94 18.80 49.64 41.98
Gamma 820.51 811.18 375.38 367.92 158.16 153.06
Weibull 1307.63 1298.30 574.64 567.17 207.79 202.69
41044 GPD 758 151.60 346.78 332.89 694 138.80 320.77 307.14 110 22.00 50.51 42.41
Gamma 1018.04 1008.78 947.71 938.63 249.05 243.65
Weibull 1638.67 1629.41 1521.01 1511.93 328.35 322.95
41046 GPD 669 167.25 606.02 592.50 280 70.00 238.24 227.33 92 23.00 62.41 54.84
Gamma 1040.63 1031.62 449.81 442.54 173.41 168.36
Weibull 1399.50 1390.49 628.37 621.10 235.26 230.21
41047 GPD 629 157.25 1064.67 1051.34 316 79.00 580.17 568.91 97 24.25 185.58 177.85
Gamma 1503.31 1494.42 775.16 767.65 253.51 248.36
Weibull 1749.18 1740.29 910.31 902.80 295.82 290.67
41048 GPD 806 161.20 1776.19 1762.11 412 82.40 971.75 959.69 120 24.00 301.70 293.34
Gamma 2320.91 2311.53 1231.09 1223.04 368.35 362.77
Weibull 2626.43 2617.05 1392.24 1384.20 421.23 415.66
41049 GPD 811 162.20 1227.71 1213.61 558 111.60 895.71 882.74 112 22.40 226.25 218.09
Gamma 1870.43 1861.03 1337.14 1328.49 324.23 318.80
Weibull 2277.40 2268.00 1624.41 1615.76 378.43 372.99
SIMAR-44 GPD 13847 329.69 16998.99 16976.38 5768 137.33 8345.27 8325.29 1908 45.43 2867.27 2850.61
Gamma 28375.75 28360.68 12646.92 12633.60 4089.08 4077.97
Weibull 33396.55 33381.48 14701.35 14688.03 4842.63 4831.52

There exist a perfect correlation between the values of BIC and AIC for the three percentiles (0.977, 0.998 and 1.000, respectively), for the three distributions and the nine time series. In Table 3, it can be seen that the number of annual peaks is more reasonable when considering the 97.5% and 99% percentiles. This is because the lower the threshold, the more the number of waves from the uniform distribution, i.e. non-extreme waves, are contaminating the distribution of extreme waves, the more the number of less relevant peaks. For instance, in buoy 46001, the BIC value for the GPD is 786.42 and 313.09 respectively, a 21.57% and 19.61% lower than the value for the Gamma distribution, and a 31.39% and 29.05% lower than the value for the Weibull distribution. These results differ from those obtained by16 for the SIMAR-44 time series, where GPD gives poor results with respect to these criteria when compared to Gamma; but it is important to mention that we use a 3-parameter GPD instead of a 2-parameter one.

Finally, the return values and the confidence intervals for each dataset considering the different thresholds are summarized in Table 4. We have considered return periods of T{1,2,5,10,20,50,100} years. If we compare the obtained return values and the confidence intervals with respect to the ones obtained by Mazas and Hamm16, for SIMAR-44 time series, we can see that the results are not the same due to the differences in the thresholds, and because they consider 44 years instead of 42, as the first and the last year are used although they are not complete. We agree with the authors in that work in the sense that choosing the right threshold is not always a straightforward issue. For example, if we consider the percentile 97.5% of the theoretical mixed distribution, the return values and the confidence intervals are quite similar to the ones obtained by Mazas (with the slight differences commented above). With respect to the values obtained for the rest of the buoys, up to our knowledge, there are not other reference values. These estimations are approximate, given the reduced length of the time series (six years for buoy 46001 and five for the other buoys). If we compare them with the extreme values that appear in Table 1, we can see that, for the buoys 46075, 41043, 41046, the confidence intervals for the 95% percentile tend to contain these values more frequently, for the buoys 41047, 41048, 41049 and SIMAR-44, the confidence intervals are more adjusted, and, for the buoys 46001 and 41044, there are no confidence intervals that contain them.

Table 4.

Return values and confidence intervals for the GPD distribution considering T=(1,2,5,10,20,50,100) and the percentiles 95%, 97.5%, and 99%.

Id T Percentile 95% Percentile 97.5% Percentile 99%
HsT Confidence Interval HsT Confidence Interval HsT Confidence Interval
46001 100 23.50 18.25–32.75 20.65 15.17–32.21 28.71 18.17–62.30
50 21.46 17.00–29.06 18.95 14.46–28.29 25.09 16.99–50.77
20 18.97 15.47–24.49 16.87 13.31–23.42 21.01 15.12–37.18
10 17.22 14.34–21.59 15.40 12.56–20.75 18.38 13.84–29.61
5 15.60 13.18–19.17 14.03 11.70–18.04 16.09 12.60–24.10
2 13.61 11.89–16.11 12.35 10.66–15.08 13.51 11.28–18.14
1 12.22 10.88–14.15 11.18 9.93–13.15 11.84 10.32–14.79
46075 100 16.24 12.99–21.77 16.69 12.48–25.15 12.59 9.79–21.29
50 15.39 12.49–19.95 15.78 12.11–22.95 12.22 9.67–19.40
20 14.28 11.85–18.21 14.59 11.59–20.30 11.70 9.48–17.23
10 13.44 11.34–16.68 13.70 11.15–18.38 11.27 9.40–15.90
5 12.60 10.78–15.27 12.81 10.69–16.51 10.82 9.19–14.24
2 11.49 10.06–13.58 11.64 10.00–14.26 10.18 8.94–12.58
1 10.64 9.49–12.23 10.77 9.50–12.82 10.18 8.94–12.58
41043 100 6.47 5.38–8.34 4.68 4.04–5.93 4.58 3.99–6.48
50 6.20 5.26–7.81 4.61 4.02–5.72 4.54 3.97–6.23
20 5.85 5.02–7.10 4.50 3.97–5.46 4.48 3.96–5.90
10 5.57 4.84–6.63 4.41 3.94–5.23 4.42 3.94–5.59
5 5.29 4.66–6.21 4.30 3.89–5.03 4.35 3.93–5.33
2 4.93 4.43–5.62 4.15 3.81–4.69 4.23 3.88–4.96
1 4.64 4.24–5.21 4.02 3.73–4.46 4.13 3.84–4.66
41044 100 5.10 4.42–6.19 5.06 4.40–6.15 4.03 3.78–4.65
50 4.99 4.36–5.96 4.95 4.32–5.94 4.02 3.78–4.61
20 4.83 4.28–5.66 4.80 4.26–5.67 4.01 3.78–4.54
10 4.70 4.21–5.45 4.68 4.18–5.43 4.00 3.78–4.49
5 4.56 4.12–5.19 4.55 4.11–5.21 3.98 3.77–4.42
2 4.36 4.00–4.87 4.35 3.99–4.89 3.95 3.76–4.30
1 4.20 3.89–4.62 4.19 3.88–4.62 3.91 3.75–4.21
41046 100 7.53 6.01–10.21 6.50 5.13–9.55 4.87 4.26–6.83
50 7.20 5.86–9.49 6.29 5.07–8.94 4.83 4.25–6.60
20 6.75 5.62–8.63 6.00 4.96–8.11 4.77 4.24–6.22
10 6.41 5.43–7.99 5.77 4.83–7.49 4.72 4.22–5.98
5 6.06 5.21–7.35 5.53 4.72–6.96 4.65 4.20–5.70
2 5.60 4.92–6.59 5.19 4.55–6.27 4.55 4.17–5.31
1 5.24 4.68–6.04 4.92 4.41–5.76 4.45 4.14–5.05
41047 100 7.83 6.25–10.50 10.37 7.58–16.37 9.35 6.55–19.55
50 7.57 6.14–9.99 9.85 7.41–15.03 8.98 6.45–17.26
20 7.19 5.95–9.22 9.15 7.06–13.36 8.47 6.34–14.93
10 6.89 5.78–8.63 8.61 6.82–11.91 8.06 6.21–13.08
5 6.58 5.58–8.03 8.06 6.54–10.81 7.63 6.09–11.60
2 6.13 5.32–7.33 7.33 6.15–9.27 7.04 5.85–9.66
1 5.78 5.09–6.76 6.77 5.81–8.32 6.57 5.65 -8.38
41048 100 10.09 8.17–13.15 12.93 9.78–19.31 16.06 10.43–34.91
50 9.73 8.01–12.51 12.28 9.42–17.50 14.98 10.20–30.03
20 9.22 7.72–11.53 11.41 8.99–15.61 13.59 9.74–24.07
10 8.81 7.47–10.83 10.75 8.69–14.30 12.56 9.37–20.67
5 8.39 7.22–10.09 10.07 8.30–12.97 11.54 8.89–17.51
2 7.79 6.81–9.15 9.15 7.73–11.32 10.24 8.35–14.23
1 7.31 6.48–8.41 8.44 7.34–10.10 9.28 7.85 -11.99
41049 100 6.69 5.64–8.32 7.14 5.92–9.35 7.63 5.98–12.71
50 6.53 5.57–8.02 6.96 5.84–8.89 7.48 5.93–11.75
20 6.30 5.45–7.59 6.70 5.69–8.31 7.25 5.88–10.73
10 6.11 5.35–7.27 6.48 5.57–7.88 7.05 5.83–9.94
5 5.91 5.22–6.87 6.25 5.46–7.49 6.82 5.75–9.14
2 5.61 5.02–6.41 5.91 5.24–6.88 6.48 5.61–8.19
1 5.36 4.85–6.03 5.63 5.06–6.44 6.18 5.49–7.43
SIMAR-44 100 4.49 4.31–4.70 6.84 6.37–7.41 10.68 9.39–12.36
50 4.43 4.25–4.63 6.64 6.20–7.16 10.03 8.96–11.51
20 4.34 4.18–4.52 6.35 5.97–6.79 9.19 8.31–10.32
10 4.26 4.11–4.42 6.12 5.78–6.51 8.56 7.84–9.50
5 4.17 4.03–4.32 5.87 5.57–6.20 7.94 7.36–8.69
2 4.02 3.90–4.16 5.51 5.26–5.78 7.14 6.70–7.69
1 3.90 3.79–4.02 5.22 5.01–5.44 6.54 6.20–6.94

Conclusions

This paper proposes a novel methodology for wave height time series modelling based on the assumption that, given a time series where the high waves are less common than lower ones, its distribution can be modelled as a mixture of a normal distribution with a uniform distribution. The methodology is based on the method of moments, and we use it to establish the threshold for the distribution estimation of the values over a peak methodology (POT). The automatic determination of this threshold is an important task, given that the alternative is to use a trial and error method which, as several authors agree, can be problematic and quite subjective. The whole approach is tested on nine real-world time series collected from the Gulf of Alaska (46001 and 46075), from Puerto Rico (41043, 41044, 41046, 41047, 41048 and 41049), and from Spain (SIMAR-44). For SIMAR-44, we compare our return periods with those obtained by Mazas and Hamm. The return periods obtained for the rest buoys can be considered as an initial approximation given the reduced length of the time series.

The experimentation is divided into three stages: the first verifies that the time series do not follow a normal distribution and that it, therefore, makes sense to apply the proposed methodology. The second one analysed the estimation of the distribution in the nine time series, showing that the estimated theoretical distribution fits the empirical one. These results are corroborated by a Kolmogorov-Smirnov test where Z<Q(1-α) in all databases. For the third experiment, we use the percentiles 95%, 97.5% and 99% of the estimated theoretical distribution as possible thresholds for the POT distribution estimation. Results show that the best-fitted distribution for the POT is the Generalized Pareto Distribution in all cases, showing their return periods and confidence intervals.

A future line of work could approach the segmentation of the time series based on the percentiles of the obtained distribution and perform a posterior prediction of the segments obtained. We also plan to extend this work using time series from different fields and more advanced methods for forecasting, such as artificial neural networks. One line of work already underway is eliminating uniform noise, after which the extraction of extreme values can be carried out on a normal distribution. Although the probability distributions of extreme values are independent from the starting distribution, we believe that knowledge about them would allow a better approximation.

Acknowledgements

This work was supported in part by the “Agencia Española de Investigación” under Grant PID2020-115454GB-C22, AEI/10.13039/501100011033, and in part by the “Consejería de Transformación Económica, Industria, Conocimiento y Universidades (Junta de Andalucía) y Programa Operativo FEDER 2014-2020” under Grant PY20 00074. We would like to thank Puertos del Estado (Spain) for providing the dataset from the SIMAR-44 hindcast database.

Author contributions

A.M.D.R. and P.A.G. processed the experimental data; M.C., P.A. and C.H.M. were involved in planning and supervised the work, A.M.D.R. performed the analysis, wrote the manuscript and designed the figures. All authors reviewed the manuscript.

Data availibility

The datasets generated and/or analysed during the current study and the code generated in the experimental design are available at https://github.com/amduran/mixed_distributions.git, with the exception of SIMAR-44 which is available on request from Puertos del Estado.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Peng S, et al. Improving the real-time marine forecasting of the northern south china sea by assimilation of glider-observed t/s profiles. Sci. Rep. 2019;9:1–9. doi: 10.1038/s41598-019-54241-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Soares CG, Scotto M. Modelling uncertainty in long-term predictions of significant wave height. Ocean Eng. 2001;28:329–342. doi: 10.1016/S0029-8018(00)00011-1. [DOI] [Google Scholar]
  • 3.Saetra, Ø. & Bidlot, J.-R. Assessment of the ECMWF Ensemble Prediction Sytem for Waves and Marine Winds (European Centre for Medium-Range Weather Forecasts, 2002).
  • 4.Feng X, Tsimplis M, Yelland M, Quartly G. Changes in significant and maximum wave heights in the norwegian sea. Global Planet. Change. 2014;113:68–76. doi: 10.1016/j.gloplacha.2013.12.010. [DOI] [Google Scholar]
  • 5.Esling P, Agon C. Time-series data mining. ACM Comput. Surv. (CSUR) 2012;45:12. doi: 10.1145/2379776.2379788. [DOI] [Google Scholar]
  • 6.Fontes CH, Budman H. A hybrid clustering approach for multivariate time series-a case study applied to failure analysis in a gas turbine. ISA Trans. 2017;2017:5. doi: 10.1016/j.isatra.2017.09.004. [DOI] [PubMed] [Google Scholar]
  • 7.Pérez-Ortiz M, et al. On the use of evolutionary time series analysis for segmenting paleoclimate data. Neurocomputing. 2017;2017:5. [Google Scholar]
  • 8.Kim J-S, Seo K-W, Chen J, Wilson C. Uncertainty in grace/grace-follow on global ocean mass change estimates due to mis-modeled glacial isostatic adjustment and geocenter motion. Sci. Rep. 2022;12:1–7. doi: 10.1038/s41598-022-10628-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Omranian N, Mueller-Roeber B, Nikoloski Z. Segmentation of biological multivariate time-series data. Sci. Rep. 2015;5:1–6. doi: 10.1038/srep08937. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Bagnall A, Lines J, Hills J, Bostrom A. Time-series classification with COTE: The collective of transformation-based ensembles. IEEE Trans. Knowl. Data Eng. 2015;27:2522–2535. doi: 10.1109/TKDE.2015.2416723. [DOI] [Google Scholar]
  • 11.Nikolaou A, et al. Detection of early warning signals in paleoclimate data using a genetic time series segmentation algorithm. Clim. Dyn. 2015;44:1919–1933. doi: 10.1007/s00382-014-2405-0. [DOI] [Google Scholar]
  • 12.Zhao Y, et al. A novel bidirectional mechanism based on time series model for wind power forecasting. Appl. Energy. 2016;177:793–803. doi: 10.1016/j.apenergy.2016.03.096. [DOI] [Google Scholar]
  • 13.Mathiesen M, et al. Recommended practice for extreme wave analysis. J. Hydraul. Res. 1994;32:803–814. doi: 10.1080/00221689409498691. [DOI] [Google Scholar]
  • 14.Coles, S., Bawa, J., Trenner, L. & Dorazio, P. An Introduction to Statistical Modeling of Extreme Values, vol. 208 (Springer, 2001).
  • 15.Méndez FJ, Menéndez M, Luceño A, Losada IJ. Estimation of the long-term variability of extreme significant wave height using a time-dependent peak over threshold (pot) model. J. Geophys. Res.: Oceans. 2006;111:5. [Google Scholar]
  • 16.Mazas F, Hamm L. A multi-distribution approach to pot methods for determining extreme wave heights. Coast. Eng. 2011;58:385–394. doi: 10.1016/j.coastaleng.2010.12.003. [DOI] [Google Scholar]
  • 17.Schwarz G, et al. Estimating the dimension of a model. Ann. Stat. 1978;6:461–464. doi: 10.1214/aos/1176344136. [DOI] [Google Scholar]
  • 18.Akaike, H. Information theory and an extension of the maximum likelihood principle. In Selected Papers of Hirotugu Akaike 199–213 (Springer, 1998).
  • 19.Petrov V, Soares CG, Gotovac H. Prediction of extreme significant wave heights using maximum entropy. Coast. Eng. 2013;74:1–10. doi: 10.1016/j.coastaleng.2012.11.009. [DOI] [Google Scholar]
  • 20.Durán-Rosal A, Fernández J, Gutiérrez P, Hervás-Martínez C. Detection and prediction of segments containing extreme significant wave heights. Ocean Eng. 2017;142:268–279. doi: 10.1016/j.oceaneng.2017.07.009. [DOI] [Google Scholar]
  • 21.Dorado-Moreno M, et al. Robust estimation of wind power ramp events with reservoir computing. Renew. Energy. 2017;111:428–437. doi: 10.1016/j.renene.2017.04.016. [DOI] [Google Scholar]
  • 22.Guijo-Rubio D, et al. Prediction of low-visibility events due to fog using ordinal classification. Atmos. Res. 2018;214:64–73. doi: 10.1016/j.atmosres.2018.07.017. [DOI] [Google Scholar]
  • 23.Durán-Rosal A, et al. Efficient fog prediction with multi-objective evolutionary neural networks. Appl. Soft Comput. 2018;70:347–358. doi: 10.1016/j.asoc.2018.05.035. [DOI] [Google Scholar]
  • 24.Bowman K, Shenton L. Estimation: Method of moments. Encycl. Stat. Sci. 2004;3:5. [Google Scholar]
  • 25.Jonathan P, Ewans K. Statistical modelling of extreme ocean environments for marine design: A review. Ocean Eng. 2013;62:91–109. doi: 10.1016/j.oceaneng.2013.01.004. [DOI] [Google Scholar]
  • 26.Mackay E, Johanning L. Long-term distributions of individual wave and crest heights. Ocean Eng. 2018;165:164–183. doi: 10.1016/j.oceaneng.2018.07.047. [DOI] [Google Scholar]
  • 27.DeLeo F, Besio G, Briganti R, Vanem E. Non-stationary extreme value analysis of sea states based on linear trends analysis of annual maxima series of significant wave height and peak period in the mediterranean sea. Coast. Eng. 2021;167:103896. doi: 10.1016/j.coastaleng.2021.103896. [DOI] [Google Scholar]
  • 28.Davison AC, Smith RL. Models for exceedances over high thresholds. J. R. Stat. Soc. Ser. B (Methodol.) 1990;1990:393–442. [Google Scholar]
  • 29.Ferreira J, Soares CG. An application of the peaks over threshold method to predict extremes of significant wave height. J. Offshore Mech. Arct. Eng. 1998;120:165–176. doi: 10.1115/1.2829537. [DOI] [Google Scholar]
  • 30.Caires S, Sterl A. 100-year return value estimates for ocean wind speed and significant wave height from the era-40 data. J. Clim. 2005;18:1032–1048. doi: 10.1175/JCLI-3312.1. [DOI] [Google Scholar]
  • 31.Stefanakos CN, Athanassoulis GA. Extreme value predictions based on nonstationary time series of wave data. Environmetrics. 2006;17:25–46. doi: 10.1002/env.742. [DOI] [Google Scholar]
  • 32.Jonathan P, Randell D, Wadsworth J, Tawn J. Uncertainties in return values from extreme value analysis of peaks over threshold using the generalised pareto distribution. Ocean Eng. 2021;220:107725. doi: 10.1016/j.oceaneng.2020.107725. [DOI] [Google Scholar]
  • 33.Panchang VG, Gupta RC. On the determination of three-parameter weibull mle’s. Commun. Stat.-Simul. Comput. 1989;18:1037–1057. doi: 10.1080/03610918908812805. [DOI] [Google Scholar]
  • 34.Goda, Y. Random Seas and Design of Maritime Structures, vol. 33 (World Scientific Publishing Company, 2010).
  • 35.Clauset A, Shalizi CR, Newman ME. Power-law distributions in empirical data. SIAM Rev. 2009;51:661–703. doi: 10.1137/070710111. [DOI] [Google Scholar]
  • 36.Wasserman, L. All of Statistics: A Concise Course in Statistical Inference, vol. 26 (Springer, 2004).
  • 37.White EP, Enquist BJ, Green JL. On estimating the exponent of power-law frequency distributions. Ecology. 2008;89:905–912. doi: 10.1890/07-1288.1. [DOI] [PubMed] [Google Scholar]
  • 38.Bauke H. Parameter estimation for power-law distributions by maximum likelihood methods. Eur. Phys. J. B. 2007;58:167–173. doi: 10.1140/epjb/e2007-00219-y. [DOI] [Google Scholar]
  • 39.Hosking JR, Wallis JR. Parameter and quantile estimation for the generalized pareto distribution. Technometrics. 1987;29:339–349. doi: 10.1080/00401706.1987.10488243. [DOI] [Google Scholar]
  • 40.Deluca A, Corral Á. Fitting and goodness-of-fit test of non-truncated and truncated power-law distributions. Acta Geophys. 2013;61:1351–1394. doi: 10.2478/s11600-013-0154-9. [DOI] [Google Scholar]
  • 41.Kang S, Song J. Parameter and quantile estimation for the generalized pareto distribution in peaks over threshold framework. J. Korean Stat. Soc. 2017;46:487–501. doi: 10.1016/j.jkss.2017.02.003. [DOI] [Google Scholar]
  • 42.Chakravarty, I. M., Roy, J. & Laha, R. G. Handbook of Methods of Applied Statistics (McGraw-Hill, 1967).
  • 43.Pearson, E. S. & Hartley, H. O. Biometrika Tables for Statisticians (Cambridge University Press, 1966).
  • 44.Pickands J. Statistical inference using extreme order statistics. Ann. Stat. 1975;1975:119–131. [Google Scholar]
  • 45.National buoy data center. http://www.ndbc.noaa.gov/. (National Oceanic and Atmospheric Administration of the USA (NOAA), 2021).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The datasets generated and/or analysed during the current study and the code generated in the experimental design are available at https://github.com/amduran/mixed_distributions.git, with the exception of SIMAR-44 which is available on request from Puertos del Estado.


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES