Skip to main content
Entropy logoLink to Entropy
. 2021 Jan 4;23(1):70. doi: 10.3390/e23010070

A Method for Confidence Intervals of High Quantiles

Mei Ling Huang 1,*, Xiang Raney-Yan 2
PMCID: PMC7823321  PMID: 33406678

Abstract

The high quantile estimation of heavy tailed distributions has many important applications. There are theoretical difficulties in studying heavy tailed distributions since they often have infinite moments. There are also bias issues with the existing methods of confidence intervals (CIs) of high quantiles. This paper proposes a new estimator for high quantiles based on the geometric mean. The new estimator has good asymptotic properties as well as it provides a computational algorithm for estimating confidence intervals of high quantiles. The new estimator avoids difficulties, improves efficiency and reduces bias. Comparisons of efficiencies and biases of the new estimator relative to existing estimators are studied. The theoretical are confirmed through Monte Carlo simulations. Finally, the applications on two real-world examples are provided.

Keywords: efficiency, extreme value distributions, generalized Pareto distribution, Hill estimator, mean square errors, order statistics, tail index, Weissman estimator

1. Introduction

Extreme value analysis (EVA) was first introduced by Leonard Tippett (Fisher and Tippett, 1928 [1]). Tippett was working on how to make cotton thread stronger, he realized that the strength of the weakest threads were the only factor that matters when it comes to deciding the strength of the cotton thread. Nowadays, extreme value analysis is widely used in almost all fields, from engineering, social science, economics, traffic predictions to insurance and so on. People are interested in extreme events in these fields such as, the shortest life span of a new engine, the maximum appreciation of the stock market, the longest driving time on a highway at rush hour, or the biggest medical claim to an insurance company. The distributions of these extreme events are usually unknown. In general, EVA involves the extrapolation of an unknown distribution and its high quantiles. Estimating high quantile based on observation is very important in EVA, since it gives the corresponding value x for a very small exceeding possibility p.

There are certain risks, ones that are not decided by us or can barely be predicted until right before they are about to happen. This can include things such as an earthquake, terrorist attacks, a virus breakout, and so forth. For these events, we will need risk management which is in place to minimize, monitor, and control the impact of unfortunate events, or to maximize the realization of opportunities. Estimating the confidence interval of high quantiles plays an important role in risk management. Since a high quantile is located at the tail area, it heavily depends on the behaviour of the tail distribution, or from the statistical point of view, it depends on the k largest order statistics. This leads to the challenges of the instability in the choice of k, and the bias issues. There are many research on the mathematical models and theoretical studies in the literature for estimating confidence intervals of high quantiles, we review them in Section 2.

This paper proposes a new method to estimate high quantile of a heavy-tailed distribution. The new method has interesting improvements compared with other existing methods. This paper makes three main contributions to methodology.

(1) This paper proposes a new estimation method based on a geometric mean with good asymptotic properties. It is consistent and stable relative to the existing methods. The paper provides a computational algorithm which overcomes the mathematical difficulties and bias problems of the estimation of confidence intervals of high quantiles of a heavy tailed distribution.

(2) The Monte Carlo simulation studies on three heavy tailed distribution models: Fréchet (0.25), GPD (0.5) and GPD(2) (GPD: generalized Pareto distribution). The simulation results confirm that the proposed method is more efficient relative to the existing quantile estimators.

(3) This paper uses the proposed estimation method to predict extreme values in the flu in Canada, and gamma ray from solar flare examples. It is interesting to see that these data sets fit the GPD model very well. We apply the proposed method to estimate the confidence intervals of high quantiles. The numerical results show that the proposed method gives more efficient results compared with other existing methods.

In this paper, we review several existing high quantile estimators with their behavior in Section 2. We propose a new estimator for the confidence interval of high quantiles based on the geometric mean and explore its asymptotic properties in Section 3. To compare the new estimator with the existing estimators, Section 4 presents Monte Carlo simulation results and the improvement of the proposed quantile estimator relative to existing methods. In Section 5 we apply the proposed new method to construct confidence intervals of high quantiles on flu in Canada and gamma ray examples. Finally, conclusions and discussions are given in Section 6.

2. Existing Estimator for High Quantiles

Heavy-tailed distributions (de Haan and Ferreira, 2006 [2]) is important to extreme value events.

Definition 1.

A random variable X is said to have a heavy tail distribution if its distribution function F(x) satisfies

1F(x)=L(x)x1/γ,x(,),asx,γ>0,

whereL(t)is a slowly varying functionwithlimtL(tx)L(t)=1, for all x>0. γ is the tail index.

Notice that we can have L(x)=(ln(x))b, b (de Hann and Ferreira, 2006, p. 362 [2]). Since L(t) behaves approximately as a constant c, for simplicity, we assume that a heavy tailed distribution satisfies

1F(x,γ)cx1/γ,x(,),asx,c>0,γ>0. (1)

Since the heavy tailed distributions decay slower than the exponential distributions and have longer tails. A tail function is defined as

Definition 2.

A tail function U(t) of any distribution function F(x) is defined as

U(t)=(11F),wheredenotestheinversefunction.

For the heavy tailed distribution in (1), we can rewrite the tail function as

U(t)=1ct1/γ=cγtγ=Ctγ,ast,wherecγ=(L(t))γ,letC=cγ. (2)

Definition 3.

The quantile function Q(1p,γ) of a heavy tailed distribution F(x,γ) in (1) for a given probability 1p is defined by

x1p=Q(1p,γ)=infx:F(x,γ)1p,x(,),0<p<1

where Q(1p,γ) is the generalized inverse function of F, we call Q(1p,γ) the (1p)th quantile function of F(x).

Value at Risk (VaR) is widely used in risk management. When p is very small, x1p becomes a high quantile as the pth value at risk, we define

VaRp,γ=x1p=Q(1p,γ),0<p<1,pisverysmall. (3)

Also we can use the tail function in (2) to write VaRp,γ as

VaRp,γ=U1p,γ=Ctγ,t=1p,p=pn0,npn0,asn.

The heavy-tailed models have a compulsory infinite right endpoint. In the case of negative observations in the model, the sample size should be exclusively the number of positive observations, n+, although a deterministic shift in the data is preferred by some authors, to work only with positive values. In this paper, we use the real line (0,).

To estimate VaRp,γ, let X1:nX2:n ... Xn:n be the order statistics from a random sample X1,X2,...Xn. We review the four high quantile estimation methods in the literature.

2.1. Quantile Function-Tail Index Method

For estimating high quantiles, we use the ln function, and estimate the tail index first

lnQγ^(p)=lnVaRp,γ^=lnx1p,γ^,0<p<1,pisverysmall. (4)

To estimate high quantile function, we estimate the tail index first (Dekkers and de Haan 1989 [3]). Hill (1975) [4] estimator is a well known consistent estimator for tail index γ.

Definition 4.

Consider the order statistics Xnk:n, and k as an intermediate sequence of integers, Hill estimator is defined as

γ^H=H(k)=1ki=1kUi,Ui=ilnXni+1:nXni:n,1ik. (5)

wherek=kn, k[1,n), k=o(n)asn.

The Hill estimator γ^H=H(k) in (5) used largest k order statistics of a random sample. Substitute γ^H=H(k) defined in (5) into (4), then we obtain ln (1p)th high quantile as

lnqH,p(k)=lnQH(p)(k)=lnx1p,H(k),0<p<1,pisverysmall. (6)

This estimator depends on k, small values of k provide high volatility whereas large values of k induce considerable bias. Hence, semi-parametric extensions may be considered for increasing the degree of freedom in the trade-off between variance and bias. Note that the tail index γ is a parameter of a given distribution, and a quantile of a distribution is a function of γ.

2.2. Weissman Method

Weissman (1978) [5] proposed the following semiparametric estimator of a high quantile

Qγ^(p)(k)=VaRp,γ^=x1p,γ^=Xnk:nknpγ^,0<p<1,1kn1,and
lnQγ^(p)(k)=lnXnk:n+γ^lnknp.

We substitute γ^H=H(k) in (5) into the function above, then we have,

lnQ^H(p)(k)=lnXnk:n+H(k)lnknp,1kn1. (7)

Without any prior indication on k, the Weissman estimator shows a large volatility as it depends on the fraction sample k. Although the minimization of the bias and MSE can be considered as a criterion to select k, it is impractical as they are unknown. Other methods for the selection of sample fraction k can be found in Beirlant et al. (1996) [6]; Dreea and Kaufmann (1998) [7]; Guillou and Hall (2001) [8]; Gomes and Oliveira (2001) [9].

The optimal k value through the tail index Hill estimator H, k0, is given by formula (15) in Section 2.4 Optimal k Values.

2.3. Reduced-Bias Method

Hall and Welsh (1985) [10] proposed a second-order expansion on the tail function U in (2)

U(t)=Ctγ1+A(t)ρ+o(tρ),A(t)=γβtρ,ast, (8)

with C, γ>0, ρ<0, and β 0. Where β is the scale second-order parameter and ρ is the shape second-order parameter.

To further reduce the bias of quantile estimators which requires us to observe the behavior of the estimation of the second-order parameters β and ρ. Second-order reduced-bias was discussed by Peng (1998) [11], Beirlant, Dierckx, Goegebeur and Mattys (1999) [12], Freueverger and Hall (1999) [13], Gomes, Martins and Neves (2000) [14], Caeiro and Gomes (2002) [15], Gomes, Figueiredo and Mendonea (2004) [16], among others. Comes and Pestana (2007) [17] considered the estimators ρτ^(k),βρ^^(k) for the second-order parameters (ρ,β).

Careiro et al. (2005, p. 122) [18] advises the the use of turning parameter τ in the estimation of ρ. It provides higher stability as functions of k, the number of the top order statistics used, for a wide range of large k value, by means of any stability criterion.

Definition 5.

Caeiro et al. (2005) [18] defined the bias-corrected Hill estimator

H¯(k)H¯β^,ρ^(k)=H(k)1β^1ρ^nkρ^, (9)

whereH(k)is defined in (5). For a tuning real parameter τ,

ρτ^(k)ρn^(τ)(k)=min0,3Tn(τ)(k)1Tn(τ)(k)3 (10)
Tn(τ)(k)=Mn(1)(k)τMn(2)(k)2τ2Mn(2)(k)2τ2Mn(3)(k)6τ3ifτ0;lnMn(1)(k)12lnMn(2)(k)212lnMn(2)(k)213lnMn(3)(k)6ifτ=0,
Mn(j)(k)=1ki=1k(logXni+1:nlogXnk:n)j,j=1,2,3.
βρ^^(k)=knρ^dρ^(k)D0(k)Dρ^(k)dρ^(k)Dρ^(k)D2ρ^(k), (11)
whereforanyθ0,dθ(k)=1ki=1kikθandDθ(k)=1ki=1kikθUi,

withUias defined in (5) that1ik, and (10) achieves consistency ifkA(n/k)asnandρ^ρ=oρ(1/lnn).

The corresponding ln-quantile estimator with the tail index estimator H¯ in (9) is

lnQ^H¯(p)(k)=lnXnk+1:n+H¯(k)lnknp,1kn1. (12)

A similar estimator to the estimator in (12) is considered in Lekina et al. (2014) [19] and Lekina (2010) [20].

Gomes and Pestana (2007) [17] considered the ln-Var estimator

lnQ¯γ^(p)(k)=lnXnk+1:n+γ^lnknp+Cpk;β^,ρ^,Cp(k;β^,ρ^)=β^nkρ^knpρ^1ρ^. (13)

Substitute the estimator H¯ in (9) into (13), we have another estimator for high quantile as

lnQ¯H¯(p)(k)=lnXnk+1:n+H¯(k)lnknp+Cpk;β^,ρ^. (14)

2.4. Optimal k Values

As discussed previously, we have problem that the estimation varies as the k varies, and it become very unreliable when k is large. Gomes and Pestana (2007) [17] suggested to use the numerically estimated optimal k values.

The optimal k for the tail index estimator through Hill estimator H(k) in (5) is k0,

k0=(1ρ)nρβ2ρ212ρ. (15)

The optimal k for the semiparametric quantile estimator lnQH(k) in (7), is k0QH,

k0QH=argminkln2knp1k+β2n/k2ρ(1ρ)2. (16)

The optimal k for the second-order reduced-bias quantile estimator lnQH¯(p)(k) in (12) and lnQ¯H¯(p)(k) in (14) should be larger than k0, is k01,

k01=1.96(1ρ)nρβ212ρ. (17)

By using these optimal k values, all the quantile estimators provide better results. However, with an unknown distribution, and estimated second-order parameters, these numerically estimated k values are not always accurate. Since all the quantile estimators are so sensitive to the k value, in this paper, we propose a new quantile estimator which does not depends on k.

3. New Estimator for High Quantile

3.1. New Estimator

Our goal is to improve the quantile estimators in Section 2. There are bias issues and difficult in determining k with the existing estimating methods. In order to overcome these problems, Huang (2011) [21] proposed a new quantile estimator which is the geometric mean of the reduced-bias quantile estimator in (14).

Definition 6.

Q^New,γ^(p)=k=1n1Xnk+1:nknpγ^1n1,0<p<1,1kn1. (18)

where Xnk:n is the (k+1)th top order statistic, γ^ is any consistent estimator for γ, and Q stands for quantile function.

Based on (16), (20) can be written as

lnQ^New,γ^(p)=1n1k=1n1lnXnk+1:n+γ^lnknp+αCp(k;β^,ρ^), (19)

where 0<p<1 and α is a constant that α.

αCp(k;β^,ρ^) is the adjustment term, where Cp(k;β^,ρ^) is defined in (13) that reduces bias using the second-order parameters. α is a key value depends only on n to furthermore reduce the bias by observing the behavior of the second-order parameters. We will discuss the choice of α in Section 4.

Section 3, Section 4 and Section 5 will show that the new estimator lnQ^New,γ^(p) has good properties, and

1. The new quantile estimator lnQ^New,γ^(p) has the least bias, the smallest MSE and the highest efficiency.

2. The new quantile estimator lnQ^New,γ^(p) is consistent and does not depend on k as the existing quantile estimators does.

3. The confidence interval based on the new quantile estimator lnQ^New,γ^(p) is the most efficient compared to the existing methods, where it not only has the shortest length of the interval, but also has the highest probability coverage of the true value in most cases.

3.2. Asymptotic Properties of the New Estimator lnQ^New,H¯(p)

Using the Hall-Welsh class of model in (8), we derive that the new estimator lnQ^New,γ^(p) in (19) has the asymptotic properties under following conditions, when γ^=H¯ in (5).

Condition 1 (C1).

For intermediate k,k=kn.k[1,n),k=o(n), as n.

Condition 2 (C2).

ln(npn)=ok,limnkAnk=λ, where A is in (8).

Theorem 1.

Under (C1) and (C2), if we use γ^=H¯ in (5), then lnQ^New,H¯(p) has a asymptotic normal distribution

lnQ^New,H¯(p)lnVaRpndNormal0,γ2(n1)2k=1n1[lnknp]2k+wn1n1i<jlninplnjnpi·j. (20)

The asymptotic mean, variance and efficiency of lnQ^New,H¯(p)(k) in (19) relative to lnQ¯H¯(p)(k) in (14) are given by

ElnQ^New,H¯(p)nlnVaRp;
VarlnQ^New,H¯(p)nγ2(n1)2k=1n1[lnknp]2k+wn1n1i<jlninplnjnpi·j; (21)
EFFlnQ^New,H¯(p)lnQ^H¯(p)(k)n(n1)2lnknp2kk=1n1[lnknp]2k+wn1n1i<jlninplnjnpi·j>1,fork=1,...,n1, (22)

where w is the weight, w=maxijρij+,0w1,ρij+=ρij,0ρij+1;ρij is correlation coefficient of lnQ¯H¯(p)(i) and lnQ¯H¯(p)(j)

ρij=CovlnQ¯H¯(p)(i),lnQ¯H¯(p)(j)VarlnQ¯H¯(p)(i)VarlnQ¯H¯(p)(j),ij,i,j=1,...,n1,1ρij1,and
EFFlnQ^New,H¯(p)lnQ^H¯(p)(k)=VarlnQ¯H¯(p)(k)VarlnQ^New,H¯(p),whereVarlnQ¯H¯(p)(k)nγ2lnknp2k.

See Appendix A for the proof of Theorem 1.

3.3. The C.I. for The New Estimator ln Q^New,H¯(p)

Theorem 2.

Under conditions (C1) and (C2), a (1α)100% confidence interval for lnVaRp by using lnQ¯New,H¯(p) in (19) is given by

LCLlnQ^New,H¯(p)(k),UCLlnQ^New,H¯(p)(k)=lnQ^New,H¯(p)UCLH¯(k)b3,lnQ^New,H¯(p)+UCLH¯(k)b3 (23)

where z1α/2 is the (1α/2)th quantile of standard normal distribution, and

b3=z1α/2n1k=1n1[lnknp]2k+wn1n1ijlninplnjnpi·j,UCLH¯(k)=H¯1z1α/2k.

See Appendix A for the proof of Theorem 2.

Remark 1.

Note that in the CI in (23), the main term lnQ^new,H¯(p) does not depend on k, only the error terms UCLH¯(k)b3 depends on k.

Remark 2.

In Section 4 Simulations and Section 5 Applications, we use the maximum weight w=1 in Formula (23), thus, we use maximum CI length for new proposed estimator lnQ^New,H¯(p) comparing with existing methods. Even with maximum CI length. Section 4 and Section 5 show that the new estimator obtained confidence interval in (23) is still shorter than existing estimators obtained confidence intervals for most of k values.

4. Simulations

4.1. Computer Simulations of Quantile Estimators

To verify that the new estimator lnQ^new,H¯(p) has good properties, we use simulations and compare the new estimator to the existing estimators using the following statistics

  1. The expected value E[·].

  2. The root of mean squared errors RMSE[·].

  3. The relative efficiencies REFF[·]

REFFQ˜HorH¯=MSElnqH(p)(k0)MSElnQ˜HorH¯(p)(k0)forQ˜=QorQ¯,p=1/(2n),k0isdefinedin(15). (24)

In this Section, we choose models of Fréchet (0.25), GPD (0.5), GPD (2) to compare with the simulation results of Gomes and Pestana (2007) [17]. We use four quantile estimators in Table 1 to run simulations. When ρ1, estimators β^ and ρ^ in H¯ use the tuning parameter τ=0, otherwise, use τ=1

Table 1.

The four ln-quantile estimators we use in simulations.

Quantile Estimators Defined in Tail Index Estimator
lnQγ^=H=lnqH (6) H in (5)
lnQH (7) H in (5)
lnQ˜H¯ lnQH¯whenρ1in(12)lnQ¯H¯whenρ=1in(14) H¯ in (9)
lnQ^new,H¯ (19) H¯ in (9)
  • (1)
    The Fréchet distribution (Fréchet, 1927) [22] has the c.d.f.
    F(x)=expx1γ,x>0,γ>0. (25)
    An estimator of the pth ln-high quantile function is
    lnQγ^(p)=lnx1p,γ^=γ^lnln11p,0<p<1,pisverysmall.
  • (2)
    The generalized Pareto distribution (GPD) (de Zea Bermudeza and Kotz, 2010) [23] has the c.d.f.
    F(x;γ)=1(1+γx)1γ,x0,γ0, (26)
    for γ>0, an estimator of the pth ln-high quantile function is
    lnQγ^(p)=lnVaRγ^=lnx1p,γ^=lnpγ^1γ^,0<p<1,pisverysmall.

4.2. The Choice of α

As mentioned in Section 3, α is a key value to reduce the bias of the lnQ^New,;^(p) defined in (19). We developed an algorithm to estimate α based on the results of m simulation runs:

Step 1: For a fixed sample size n, the αi(n) in ith iteration, i=1,...,m, m=500, is the true solution of equation

lnQ^New,H¯i(p)=1n1k=1n1lnXi,nk+1:n+H¯ilnknp+αCp(k;β^,ρ^)=lnVaRp,i=1...,m,

then α(n)=1mi=1mαi(n). Note that α(n) depends on n. lnVaRp is the true lnVaR value.

Step 2: Obtain estimator α^(n) based on the linear regression (LR) models where α is related to n. We collect data set (αj,nj), j=1,...,l, with the sample size l.

α^(n)=ρ^(n),ρ(n)<1,for GPD(0.5);1.74880.0002n+2.9693X1+2.6604X2,ρ(n)1,for Fréchet (0.25),X1=β^(n);X2=ρ^(n);for GPD(2),X1=ρ^(n);X2=β^(n). (27)

Note that the estimate α^(n) in (27) depends the parameters of the models and LR relationship with sample size n.

Remark 3.

If we assume αj in (αj,nj),j=1,...,l, is normally distributed, based on (Bickel and Doksum, 2015, pp. 286–388) [24], then α^(n) is a maximum likelihood estimator (MLE) and has an asymptotic normal distribution. Since the estimator α^(n) only depands to n not related to the order statistics, it will not affect the asymptotic proprties of the proposed estimator lnQ^New,γ^(p) in (19).

4.3. Simulation of Fre´chet (0.25). GPD (0.5) and GPD (2)

Table 2, Table 3 and Table 4 list the results of simulations under the Fréchet (0.25), GPD (0.5) and GPD (2), where N=500 iterations for sample size n=500,1000,2000,5000 and p=1/(2n). With α^(n) in (27), we compare mean values, mean squared errors (MSE) and REFF of the four lnVaR estimators in Table 1, at optimal level k=k0 based on (15) Note that the new estimator lnQ^New,γ^(p) has the highest REFF values among the four estimators which are in bold in all three models. The simulation MSE of lnQ^New,γ^(p) is defined as

MSElnQ^New,γ^(p)=1Ni=1NlnQ^New,γ^,i(p)lnVaRp2,

where lnQ^New,γ^,i(p) is the lnQ^New,γ^(p) in the ith iteration, i=1,...,N. So do for other ln-quatile estimators.

Table 2.

Fréchet (0.25),N=500, β=0.5,ρ=1. Mean, MSE, REFF of the lnVaR Estimators. The highest REFF values are in bold.

n 500 1000 2000 5000
lnVaRp, p=1/(2n) 1.7268 1.9002 2.0735 2.3026
k0 126 200 318 585
α^(n)=LR 1.1218 1.1400 0.4991 −0.0357
lnqH Mean (MSE) 1.8526 (0.0429) 2.0038 (0.0300) 2.1657 (0.0228) 2.3755 (0.0147)
REFF 1 1 1 1
lnQH Mean (MSE) 1.7906 (0.0219) 1.9540 (0.0154) 2.1239 (0.0115) 2.3431 (0.0074)
REFF 1.4004 1.3933 1.4104 1.4125
lnQ¯H¯ Mean (MSE) 1.7092 (0.0206) 1.8849 (0.0141) 2.0764 (0.0111) 2.3073 (0.0072)
REFF 1.4419 1.4576 1.4347 1.4257
lnQ^new,H¯ Mean (MSE) 1.7185 (0.0095) 1.8791 (0.0065) 2.0716 (0.0051) 2.2798 (0.0044)
REFF 2.1252 2.1399 2.1139 1.8231

Table 3.

GPD (0.5), N=500, β=1,ρ=0.5. Mean, MSE, REFF of the lnVaR estimators. The highest REFF values are in bold.

n 500 1000 2000 5000
lnVaRp, p=1/(2n) 4.1149 4.4710 4.8242 5.2883
k0 34 48 68 107
α^(n)=ρ^ −0.7512 −0.7482 −0.7427 −0.7244
lnqH Mean (MSE) 4.7019 (0.6349) 4.9773 (0.4863) 5.3065 (0.4554) 5.7209 (0.3427)
REFF 1 1 1 1
lnQH Mean (MSE) 4.2913 (0.2172) 4.6258 (0.1628) 4.9904 (0.1491) 5.4485 (0.1074)
REFF 1.7159 1.7282 1.7478 1.7865
lnQH¯ Mean (MSE) 4.1140 (0.1654) 4.4801 (0.1267) 4.8656 (0.1166) 5.3434 (0.0825)
REFF 1.9663 1.9591 1.9763 2.0379
lnQ^new,H¯ Mean (MSE) 3.9076 (0.0779) 4.4239 (0.0233) 4.8674 (0.0241) 5.4359 (0.0382)
REFF 2.8657 4.5666 4.3428 2.9954

Table 4.

GPD(2), N=500, β=1,ρ=2. Mean, MSE, REFF of the lnVaR estimators. The highest REFF values are in bold.

n 500 1000 2000 5000
lnVaRp, p=1/(2n) 13.1224 14.5087 15.8949 17.7275
k0 170 269 515 1071
α^(n)=LR −2.7893 −2.8417 −2.8687 −3.1684
lnqH Mean (MSE) 13.6276 (1.3232) 14.9415 (0.9745) 16.2733 (0.6548) 18.0099 (0.3833)
REFF 1 1 1 1
lnQH Mean (MSE) 13.4502 (0.9965) 14.8004 (0.7283) 16.1618 (0.4779) 17.9283 (0.2804)
REFF 1.1523 1.1567 1.2412 1.1693
lnQH¯ Mean (MSE) 13.1933 (0.8926) 14.5960 (0.6477) 15.9719 (0.4491) 17.7717 (0.2751)
REFF 1.2175 1.2267 1.2075 1.1804
lnQ^new,H¯ Mean (MSE) 13.0009 (0.6007) 14.4907 (0.3680) 15.8926 (0.3070) 17.6429 (0.2127)
REFF 1.4841 1.6274 1.4606 1.3426

Figure 1, Figure 2 and Figure 3 are based on Table 2, Table 3 and Table 4 results, Figure 1 is for Fréchet (0.25), we use N=500 iterations, sample size n=1000, γ=0.25, ρ=1, β=0.5, p=1/2n. The new estimator lnQ^New,γ^(p) has the best performance with the least bias and RMSE. It does not change as k varies. Figure 2 and Figure 3 are for GPD(0.5) and GPD(2), N=500 iterations, sample size n=1000, γ=0.5 and 2, ρ=γ, β=1, p=1/2n. We note that the new estimator lnQ^new,H¯ is the best estimator as well, with the least bias, consistency as k varies, and the smallest RMSE. Note that lnQ^New,γ^(p) values are very close to the true lnVaRp values.

Figure 1.

Figure 1

Underlying Fréchet (0.25), ρ=1, β=0.5. N=500, n=1000. (a) The means of ln-quantile estimators with the true lnVaR0.00051.9 (lnQ^new,H¯0.00051.88). (b) The RMSE of Ln-quantile estimation, p=0.0005, α^=1.14.

Figure 2.

Figure 2

Underlying GPD (0.5), ρ=0.5, β=1, N=500, n=1000. (a) The means of ln-quantile estimators with the true lnVaR0.00054.47 (lnQ^new,H¯0.00054.42). (b) The RMSE of Ln-quantile estimation, p=0.0005, α^=0.7482.

Figure 3.

Figure 3

Underlying GPD (2), ρ=2, β=1, N=500, n=1000. (a) The means of ln-quantile estimators with the true lnVaR0.000514.51 (lnQ^new,H¯0.000514.49). (b) The RMSE of ln-quantile estimators, p=0.0005, α^=2.8417.

4.4. Simulations of Confidence Intervals

By Gomes and Pestana (2007) [17], the 95% confidence interval of the true tail index using H is

(LCLH(k),UCLH(k))=H(k)1+β(n/k)ρ1ρ+1.96k,H(k)1+β(n/k)ρ1ρ1.96k (28)

and the 95% confidence interval of the true tail index using H¯ is

(LCLH¯(k),UCLH¯(k))=H¯(k)1+1.96k,H¯(k)11.96k (29)

Next, we compute the confidence intervals for the true ln-quantile by using the quantile estimators. We only use three out of four quantile estimators in Table 1, except lnqH which has the worst result. Therefore, we compare CIs only using lnQH, lnQ˜H¯ and lnQ^new,H¯ in (30), (31) and (23). Thus

  • (1)
    The 95% confidence interval for the true lnVaRp using lnQH is
    LCLlnQH(k)=minlnQH(k)LCLH(k)lnknpb2,lnQH(k)UCLH(k)lnknpb2;UCLlnQH(k)=maxlnQH(k)+LCLH(k)lnknpb1,lnQH(k)+UCLH(k)lnknpb1. (30)
    where LCLH(k), UCLH(k) is given in (28), and
    b1=1.96kβ(n/k)ρ1ρ,b2=1.96k+β(n/k)ρ1ρ.
  • (2)
    The 95% confidence interval for the true lnVaRp using lnQ˜H¯ is
    LCLlnQ˜H¯(k)=lnQ˜H¯UCLH¯(k)lnknp1.96k,UCLlnQ˜H¯(k)=lnQ˜H¯+UCLH¯(k)lnknp1.96k; (31)
    where LCLH¯(k), UCLH¯(k) is given in (29), and
  • (3)

    The 95% confidence interval for the true lnVaRp using lnQ^new,H¯ is given in (24).

To compare new proposed CI in (23) to CIs in (30) and (31), we use evaluate the length and probability coverage of the CIs.

The length of CI is given as

lengthofCI=UCLquantileestimatorLCLquantileestimator.

and the efficiency of the length of 95% CI is given as

EFFlength=C.I.lengthoflnQHatk0QHC.I.lengthoflnQ¯H¯orlnQ^new,H¯atk01. (32)

Also, the confidence interval is more efficient when it has a higher coverage of the true value under the simulations, where the probability coverage of 95% CI is defined as

P.C.=numberof95% CIscontainsthetruevaluenumberof95% CIssimulatedintotal*100%.

and the efficiency of the probability coverage of 95% CI is given as

EFFP.C.=|P.C.lnQH95%|P.C.lnQ¯H¯orP.C.lnQ^new,H¯95%. (33)

when EFFP.C. is bigger means it is more efficient.

Figure 4, Figure 5 and Figure 6 show the 95% confidence interval of the three ln-quantile estimators under Fréchet (0.25), GPD (0.5 and 2) with p=0.0005. We compare the size of each confidence interval at their optimal k level, and the probability coverage of each confidence interval at their optimal k level. Recall, the optimal k level for lnQH is at k0QH based in (16), the optimal k level for lnQ˜H¯ and lnQ^new,H¯ is at k01 based in (15).

Figure 4.

Figure 4

Fre´chet (0.25) model, 95% confidence interval of quantile estimators, N=500, n=1000, p=0.0005, β=0.5, ρ=1, α(1000)=LR=1.14, k0QH=165, k01=395. Note that lnQ^new,H¯ (purple) has shortest CI with length 0.2668. (The solid circles “•” in the plot are the values of the quantile estimators at their optimal k level).

Figure 5.

Figure 5

The GPD (0.5) model, 95% confidence interval of quantile estimators, N=500, n=1000, p=0.0005, β=1,ρ=0.5, α^(1000)=ρ^=0.7482, k0QH=28, k01=93. Note that lnQ^new,H¯ (purple) has shortest CI with length 0.7094. (The solid circles “•” in the plot are the values of the quantile estimators at their optimal k level).

Figure 6.

Figure 6

The GPD (2) model, 95% confidence interval of quantile estimators N=500,n=1000, p=0.0005, β=1,ρ=2, α^(1000)=LR=2.8417, k^0=75, k^0QH=80, k^01=70. Note that lnQ^new,H¯ (purple) has shortest CI with length 2.2511. (The solid circles “•” in the plot are the values of the quantile estimators at their optimal k level).

Table 5 compare the efficiencies of 95% CI of the three quantile estimators under Fre´chet (0.25), GPD (0.5 and 2). The efficiency of 95% CI can be compared by the length of CI and the probability coverage of CI, denoted by EFFlength and EFFP.C..

Table 5.

N=500,n=1000, efficiencies of 95% CI for lnVaR0.01.

CI of at Optimal k Length EFFlength ProbabilityCoverage EFFP.C.
lnQH k0QH= 165 0.5142 1 94.2% 1
Fréchet (0.25) lnQ¯H¯ k01= 395 0.3564 1.4517 96.7% 0.4706
lnQ^new,H¯ k01= 395 0.2668 1.9275 99.6% 0.1739
lnQH k0QH= 28 2.4922 1 47.4% 1
GPD(0.5) lnQH¯ k01= 93 1.5204 1.6392 79.0% 2.9750
lnQ^new,H¯ k01= 93 0.7094 3.5130 99.6% 10.3478
lnQH k0QH= 270 3.4410 1 79.7% 1
GPD(2) lnQH¯ k01= 511 2.7291 1.2609 83.2% 1.2966
lnQ^new,H¯ k01= 511 2.2511 1.5286 99.6% 3.3261

In this section, we compared the new quantile estimator lnQ^new,H¯(p) in (19) with the existing methods. lnQ^new,H¯(p) has the least bias, the smallest RMSE, and not depends on k too much. It also has the smallest length and the highest probability coverage in 95% confidence interval in most cases. The simulation results verify that lnQ^new,H¯ is the best quantile estimator among all three methods. Next section, we apply the new estimator lnQ^new,H¯(p) to real world examples.

5. Applications

We will study two real-world examples in this Section. We are interested in the population that is above the threshold for each example. The goal is to estimate the (1p)th high quantiles of the example, where 0<p<1 is a very small. We use the four quantile estimators in table 1 lnqH, lnQH, lnQH¯ and lnQ^new,γ^ in (21), and compare their performances.

  1. Procedure:

    • Step 1:

      Choose and collect data of examples of real life extreme events.

    • Step 2:

      Run Goodness-of-Fit tests to check if data is heavy distributed.

    • Step 3:

      Estimate the high quantiles and construct the confidence intervals by using the new method and the existing methods.

    • Step 4:

      Estimate the high quantiles and construct the confidence intervals by using the new method and the existing methods.

  2. Estimators

    1. Two tail index estimators H(k) in (5) and H¯(k) in (9).

    2. Four quantile estimators (6), (7), (12) and (19) are in Table 1.

    3. We use α^(n) in (27) for the new estimator lnQ^new,H¯(p) in (19) for the GPD model.

Remark 4.

In applications, the GPD is used as a tail approximation to the population distribution from which a sample of excesses xμ above some suitably high threshold μ are observed. The GPD is parameterized by location, scale and shape parameters μ,λ>0 and γ, and can equivalently be specified in terms of threshold excesses xμ or, as here, exceedances x>μ, as three parameters (γ,μ,λ) GPD in (34) (de Zea Bermudeza and Kotz, 2010) [23],

Hγ(x)=11+γxμλ1γ,0<μ<x<(0(γ))1,λ>0, (34)

Traditionally, the threshold was chosen before fitting, giving the so-called fixed threshold approach (Pickands, 1975 [25], Balkema and de Haan, 1974 [26]). It is common for practitioners to assume a constant quantile level, determined by some assessment of fit across all or a subset of the datasets (Scarrott and McDonald, 2012, p.36 [27]). In our application, the threshold is pre-determined by physical considerations, that is, number of type A flu viruses detected weekly in Canada above the average in flu season, and the counts of gamma ray released from significant solar flares (M and X rated) during the Sun’s active years. Although it is possible to make some arbitrary definition of the choice of the threshold, it is preferable not to become involved with such delicate question. The application of the proposed method is presented in both examples for illustrative purpose.

5.1. Flu in Canada Example

According to the WHO (World Health Organization, 2020 [28]), seasonal influenza is a common infection of the airways and lungs that can spread easily among humans. There are 37 million people in Canada, and flu season usually runs from November to April. Most people recover from the flu in about a week. However, influenza may be associated with serious complications such as pneumonia, especially in infants, the elderly and those with underlying medical conditions like diabetes, anemia, cancer, and immune suppression. On average, the flu and its complications send about 12,200 Canadians to the hospital every year, and around 3500 Canadians die. There are 3 types of flu viruses, A, B and C. Type A flu virus is the most harmful, and it is constantly changing and is generally responsible for the large flu epidemics. The 1918 Spanish Flu, 1957 Asian Flu, 1968 Hong Kong Flu, 2009 Swine flu, and the most recent 2014 H5N1 Bird Flu are all type A flu. In this paper, we study type A viruses in Canada.

We collected the number of the type A flu viruses detected weekly in Canada, from 1 January 1997 to 31 December 2019, resulting in a sample size of n*= 994 weeks. According to the WHO, the average number of type A flu viruses detested per week in the flu season, November to April, is 953, for the past 10 years. We set 953 viruses/week as the threshold, which reduced our sample size to n= 111 weeks. Full data-set is available at http://apps.who.int/influenza/gisrs_laboratory/flunet/en.

Figure 7a shows a Flu chart in n*= 994 weeks of type A flu viruses detected in Canada, and n= 111 weeks remaining after the threshold, of average 953 flu viruses. For each flu incubation period, a flu virus can last from one up to few weeks, that is why some arches are narrow and some arches are more bell shaped in this figure. The top three weeks are circled in the plot. Figure 7b shows a histogram of n*= 994 weeks data. We are interested in the 99% quantile, x0.99, such that 99% chance that the viruses detected in a given week would be less than this value, or equivalently, with a 1% possibility, the number of flu viruses detested in a given week would be in excess of this value. This information is useful for monitoring and studying the virus, also is helpful for medical organizations that deal with disease control and prevention, pharmaceutical availability, and hospital resource readiness, especially during a serious flu outbreak. x^0.99 is approximately located in the plot. In this paper, we propose a new estimate high quantiles method, and compare it with existing methods.

Figure 7.

Figure 7

Flu original data from 1 January 1997 to December 31 2019, n*= 994 weeks, (a) Flu chart of type A flu viruses detected in Canada, and n= 111 weeks remaining after the threshold, of average 953 flu viruses. (b) Histogram of the number of type A flu viruses detected in Canada.

Our interest is to find the 5%VaR and 1% VaR of the number of type A flu viruses detested in a week, and their 95% confidence intervals.

5.1.1. Goodness-of-Fit Test

Through data transformation Yi=Xiμλ, i=1,...,n, n=111. Take μ=953 as the threshold, the maximum likelihood estimators (MLE) are λ^MLE=1275.97287 and γ^MLE=0.01345. Figure 8a is the log-log plot of GPD curve with the horizontal axis ln(x) against the vertical axis ln(P{x<X}). Visually the transformed data fit the one parameter GPD in (26) the bestusing γ^MLE (red curve). Figure 8b shows the GPD density curve (red curve) fits the histogram very well.

Figure 8.

Figure 8

After threshold 953 flu viruses, Flu transformation data, n=111, (a) Log-log plot of flu in Canada example. (b) Estimate GPD curve and the 99% high quantile and histogram of the distribution of type A flu viruses detested weekly.

Beside visual view of Figure 8, we also carry on the three goodness-of-fit tests: the Kolmogorov-Smirnov (K-S) test (Kolmogorov, 1933 [29]), Anderson-Darling (A-D) test, and Cramér von Mises (C-v-M) test (Anderson-Darling, 1952 [30]). All three tests are based on the maximum vertical distance between the empirical distribution function and the observations, and the parent distribution function is the GPD.

The Hypothesis for all three tests is

H0:F(x)=F*(x),forallvaluesofxH1:F(x)F*(x),foratleastonevalueofx

F(x) is the true but unknown distribution of the sample. F*(x) is the theoretical distribution, in our project, the parent distribution, GPD. Sn(x) is the empirical distribution and step function of the sample. It is defined as

Sn(x)=1ni=1nI(,x](Xi),whereIA=1,ifxA;0ifxA.

where <x<, 0Sn(x)1.

The test statistics under H0 of KS test is

T=supxF*(x)Sn(x). (35)

Based on Table 6 goodness of fit tests’ results, we set the GPD model for the flu in Canada data. We define the absolute errors (AE) in (34) and integrated errors (IE) in (35) as

IE=1Xn:nXnr+1:nXnr+1:nXn:nSn(x)F*(x)2dx1/2. (36)
Table 6.

The goodness-of-fit tests under the GPD model for the flu in Canada data.

Goodness-of-Fit Tests
K-S Test A-D Test C-v-M Test
TestStatistics p-Value TestStatistics p-Value TestStatistics p-Value
γ^MLE 0.0628 0.6406 0.4475 0.8007 0.0621 0.8006

For both AE and IE, we use 3 different r values by letting r=n10th, r=n2th, and r=nth top statistics. Table 7 lists the AE and IE errors which are very small.

Table 7.

AE and IE under the GPD model for the flu in Canada data by using γ^MLE.

Absolute Errors (AE) Integrated Errors (IE)
rthHighestAmountofTypeAViruses rthHighestAmountofTypeAViruses
r=12 r=56 r=111 r=12 r=56 r=111
γ^MLE 0.0450 0.0450 0.0628 0.0085 0.0071 0.0074

Next, we estimate the high quantiles and their confidence interval for this example.

5.1.2. Compare Four Estimation Methods

We use the four estimators in Table 1: lnqH, lnQH, lnQH¯, and the new estimator lnQ^new,H¯.

We use ρ^τ(k) in (10), and β^ρ^0(k) in (11). To decide if the tuning parameter τ=0 or 1, consider {ρ^τ(k)}kk, for kk=(n0.995,n0.999), and compute their median xτ, then

τ=argminkkk(ρ^τ(k)xτ)2.

With n=111, we get kk=(108,110) and xτ=109, then kk(ρ^0(k)xτ)236116<kk(ρ^1(k)xτ)237033, conclude that τ=0, thus we have ρ^0(k1)=0.7101 and β^ρ^0(k1)=1.026571, where k1 is the optimal k value. Figure 9 shows the results.

Figure 9.

Figure 9

For flu in the Canadian data, n=111, (a) Estimates of the second-order parameter ρ^ and ρ^τ(k), τ=0; (b) Estimates β^ and β^ρ^0(k). (c) Tail index estimators, H, H¯; (d) ln-quantile estimators, p=0.01. The solid circles “•” in the plot are the values of the quantile estimators at their optimal k level.

Figure 9a shows estimates of the second-order parameters ρ through ρ^ and ρ^τ(k), τ=0; Figure 9b shows Estimates β^ and β^ρ^0(k). Figure 9c shows the two estimated tail index, H, H¯, H=0.4379 at its optimal level using k^0=21 based on (15) and H¯=0.3736 at its optimal level using k^01=42 based on (17). Figure 9d shows four quantile estimators of flu in Canada example, with p=0.01. The full circles “•” in the plot are the values of the quantile estimators at their optimal k level. We note that lnQ^new,H¯(p) has a constant value, which does not depend on k.

Figure 10 compares the confidence intervals of three quantile estimators in (7), (12) and (19). This figure shows that the new quantile estimator lnQ^new,H¯(p) has the smallest confidence interval with length 0.7966, where we use α^=ρ^=0.7101. (The solid circles “•” in the plot are the values of the quantile estimators at their optimal k level).

Figure 10.

Figure 10

95% confidence interval of three ln-quantile estimators after the threshold 953 for the flu in Canada example. n=111, p=0.01. Note that lnQ^new,H¯ (purple) has shortest CI with length 0.7966. (The solid circles “•” in the plot are the values of the quantile estimators at their optimal k level).

In Table 8, we compare the four ln-quantile estimators and their mean, median, VaR0.05 and VaR0.01. Table 9 compares the size of confidence intervals at lnVaR0.01 and VaR0.01 of the three quantile estimators.

Table 8.

Estimated VaR0.05 and VaR0.01 for the flu in Canada data. (Unit: Type A flu viruses).

Estimation α^ γ^ Mean Median  VaR0.05  VaR0.01
lnQH N/A H= 0.4370 3219.29 2257.03 4519.70 8159.10
lnQH¯ N/A H¯= 0.3736 2989.93 2130.78 3736.80 6031.79
lnQ^new,H¯ ρ^=0.7101 H¯= 0.3736 2989.93 1690.07 2924.80 5499.85
Table 9.

The 95% confidence interval for lnVaR0.01 and VaR0.01.

EstimationMethod k LCL lnVaR0.01
(VaR0.01)
UCL Length EFF
lnQH k^0=21 0.6920 1.7312 2.1452 1.4531 1
(QH) (3502.14) (8159.10) (11854.31) (8352.17) (1)
lnQH¯ k^01=42 0.7929 1.3814 1.9698 1.1770 1.2346
(QH¯) (3772.58) (6031.78) (10101.19) (6328.21) (1.3197)
lnQnew,H¯ k^01=42 0.8724 1.2707 1.6690 0.7966 1.8242
(Qnew,H¯) (4006.07) (5499.85) (7724.49) (3718.42) (2.2462)

In Table 9, we compared QH, QH¯ and Qnew,H¯, the Qnew,H¯ has the shortest confidence interval with the highest efficiency of 2.2462.

5.1.3. Summary

Based on Figure 10 and Table 9, we conclude that the new estimator lnQ^new,H¯ in (19) is the best estimator for Flu in Canada example. We can predict that at VaR0.01, we expect 5500 type A flu viruses during a flu outbreak after threshold 953/week. This is shown in Figure 8b.

5.2. Gamma Ray of Solar Flare Example

Gamma ray has the most penetrating power among all the radiations. The burst of gamma rays are thought to be, due to the collapse of stars called hypernovas, the most powerful events so far discovered in the cosmos. The measurement of gamma rays are in counts, and it is the number of atoms in a given quantity of radioactive material that are detected by an instrument to have decayed. We have collected gamma ray data from solar flares, from November 2008 to September 2020, from NASA (National Aeronautics and Space Administration, 2020 [31]). Full data-set is available at http://hesperia.gsfc.nasa.gov/fermi/gbm/qlook/fermi_gbm_flare_list.txt.

The solar flare travels hundreds of miles per second, and can reach the Earth within hours. It can disrupt communication navigational equipment, damage satellites, and even cause blackouts by damaging power plants. In 1989, a strong solar storm knocked out the power grid in Québec, Canada, causing 6 million people to lose power for more than 9 hours, and it cost millions of dollars to repair. It can bring additional radiation around the north and south poles, a risk that forces airlines to reroute flights. The Fermi Gamma-ray Space Telescope was launched in late 2008 to explore high-energy phenomena in the Universe. It is worth noting that more than one trigger may have occurred during the flare, the one nearest the peak of the flare is listed, resulting in a sample size of 5128. Solar flares are classified as A, B, C, M or X according to the peak flux (in watts per square meter, W/m2) of 1 to 8 angstrom (The angstrom is a unit of length equal to 1/10,000,000,000 (one ten-billionth) of a meter.) X-rays near the Earth, as measured on the GOES spacecraft. Gamma ray activity is correlated with the X ray activity, as shown in Figure 11 (NOAA, 2020 [32]. When the amount of gamma ray released is over 5 million counts, it usually corresponds to an X rated flare or significant M rated flares.

Figure 11.

Figure 11

Two weeks plot of gamma ray & X ray from July 2 to 16, 2012.

Figure 12a shows a Gamma ray chart of n*= 5128 flares, and n= 104 flares remaining after the threshold of 86 million counts. The most powerful gamma ray was released in March 7, 2012 with nearly 1.5 billion counts, the sun was brightened by 1000 times, and became the brightest object in the gamma ray sky. The top three events are circled in the chart. Figure 12b shows a histogram of n*= 5128 flares. We are interested in the 99% quantile, x0.99, such that 99% gamma ray released from solar flares are under this value, or equivalently, with a 1% possibility, the amount of gamma ray a solar flare releases would be in excess of this value. During the spring and fall, the satellites that are used to detect solar flares experience eclipses, in which the Earth or the Moon blocks between the satellites and the Sun for a short period every day. Eclipse season lasts for about 45 to 60 days and ranges from minutes to just over an hour. The quantile estimation would provide useful predictions for these times. x^0.99 is approximately located in the plot since we do not know this value yet.

Figure 12.

Figure 12

Gamma ray original data from November 2008 to April 2017, n*=5182, (a) Gamma ray released V.S solar flare occurred. After the threshold of 86 million counts, n=104 flares remaining. (b) Histogram of gamma ray released from solar flares.

We chose the threshold as the mean of the data from the peak period. The solar cycle is every 11.6 years, and the sun’s activity peaked from 2011 to 2014. In Figure 12a we can see that the top 3 flares, in fact, almost 90% of the top 100 flares, are from the 2011 to 2014 time period. Taking the average of all the X rated and significant M rated flares from this peak period, we obtained a mean of 86 million counts, resulting in a remaining sample size of n=104.

For the Gamma ray of solar flare example, our goal is to find out the high quantiles, specifically, the 5% VaR and 1% VaR of the amount of gamma ray a solar flare would release, and their 95% confidence intervals.

5.2.1. Gooness-of-Fit Tests

Similar as Flu in Canada Example, we set μ=86million, and obtain λ^MLE=171.0708592, and γ^MLE=0.2580384847. Figure 13a is a log-log plot of gamma ray data under GPD model, with the horizontal axis ln(x) against the vertical axis ln(P{x<X}). Figure 13b shows the histogram fits the GPD model.

Figure 13.

Figure 13

After threshold 86 millions count, transformation data, n=104, (a) Log-log plot of gamma ray from solar flare example. (b) The Estimate GPD and the 99% high quantile of the distribution of gamma ray released by solar flare.

Next, we will perform three goodness-of-fit tests: Kolmogorov-Smirnov test, the Anderson-Darling test and the Cramér-von-Mises test. The results listed in Table 10, the data fits the GPD with γ^MLE the best, nearly 59%.

Table 10.

Compare the goodness-of-fit tests under the GPD model for the gamma ray data.

Goodness-of-Fit Tests
K-S Test A-D Test C-v-M Test
TestStatistics p-Value TestStatistics p-Value TestStatistics p-Value
γ^MLE 0.0697 0.5750 0.7276 0.5362 0.0991 0.5893

In Table 11, all the errors are less than 0.07 for AE, and less than 0.01 for IE.

Table 11.

AE and IE under the GPD model for the gamma ray data using γ^MLE.

Absolute Errors (AE) Integrated Errors (IE)
rth Highest Gamma Ray Released rth Highest Gamma Ray Released
r=11 r=53 r=104 r=11 r=53 r=104
γ^MLE 0.0359 0.0697 0.0697 0.0062 0.0092 0.0089

Next, we can compare the four high quantile estimators and their confidence intervals of this example.

5.2.2. Compare Four Estimation Methods

Similar as Example 1, we use the four quantile estimators in Table 1: lnqH, lnQH, lnQH¯, and the lnQ^new,H¯.

We use ρ^τ(k) and β^ρ^0(k), and τ=0, thus we have ρ^0(k1)=0.7269 and β^ρ^0(k1)=1.0257, where k1 is the optimal k value for the second-order parameters. The results are in Figure 14.

Figure 14.

Figure 14

For gamma ray of solar flare example, n=104, (a) Estimates of the second-order parameters ρ^ and ρ^τ(k), τ=0, (b) Estimates β^ and β^ρ^0(k). (c) Tail index estimators, H, H¯. (d) ln-quantile estimators, p=0.01. The solid circles “•” in the plot are the values of the quantile estimators at their optimal k level.

Figure 14a shows the estimates of the second-order parameters ρ^ and ρ^τ(k), τ=0. Figure 14b shows β^ and β^ρ^0(k). Figure 14c shows the two different tail index estimators, H, H¯. We have H=0.5324 at its optimal level with k^0=21, H¯=0.6517 at its optimal level with k^01=41. Figure 14d shows all four quantile estimators of gamma ray example, with p=0.01. We note that lnQ^new,H¯ has a constant value which does not depend on k.

Figure 15 compares the confidence intervals of our ln-quantile estimators in (7), (12) and (19). This figure shows that the new quantile estimator lnQ^new,,H¯ has the smallest confidence interval with length 1.4451, where we use α^=ρ^=0.7269. The solid circles “•” in the plot are the values of the quantile estimators at their optimal k level.

Figure 15.

Figure 15

95% confidence interval of three ln-quantile estimators after threshold of 86 million counts for the gamma ray example. n=104, p=0.01. Note that lnQ^new,H¯ (purple) has shortest CI with length 1.4451. (The solid circles “•” in the plot are the values of the quantile estimators at their optimal k level).

In Table 12, we compare all four quantile estimators under VaR0.05 and VaR0.01. Table 13 compares the size of confidence intervals of lnVaR0.01 and VaR0.01 by three quantile estimators.

Table 12.

Estimated VaR0.05 and VaR0.01 in the gamma ray example. (Unit: million counts).

EstimationMethod α^ γ^ Mean Median VaR0.05 VaR0.01
lnQH N/A H=0.5324 451.82 315.27 867.12 1926.04
lnQH¯ N/A H¯=0.6517 577.22 232.28 742.01 1958.67
lnQ^new,H¯ ρ^=0.7269 H¯=0.6517 577.22 189.35 441.60 1102.57
Table 13.

The 95% confidence interval of lnVaR0.01t. and VaR0.01.

EstimationMethod k LCL lnVaR0.01
(VaR0.01)
UCL Length EFF
lnQH k^0=21 1.0807 2.3755 2.8864 1.8057 1
(QH) (590.12) (1926.04) (3153.18) (2563.06) (1)
lnQH¯ k^01=41 1.3367 2.3930 3.4494 2.1128 0.8547
(QH¯) (737.14) (1958.67) (5471.71) (4734.56) (0.5414)
lnQnew,H¯ k^01=41 1.0595 1.7821 2.5047 1.4451 1.2495
(Qnew,H¯) (579.55) (1102.57) (2179.85) (1600.30) (1.6016)

Table 13 shows that the new estimator has the shortest confidence interval, compared to lnQH, and lnQH¯, with the highest efficiency of 1.6016.

5.2.3. Summary

Based on Figure 15 and Table 13, we conclude that the new estimator lnQ^new,H¯ in (19) is the best estimator for Gamma Ray example. We predict that VaR0.01 is a gamma ray release of 1102.57 million counts, this is most likely an X rated solar flare. This is shown in Figure 13b.

6. Conclusions

Based on the studies in this paper, we conclude that:

1. High quantile and its CI estimation provides important information for risk management and for extreme event predictions.

2. Based on the theoretical and simulation results, the proposed new method for estimating confidence interval of high quantiles has advantages properties comparing with other existing methods. The estimation is consistent and stable with less error. The proposed method provides a useful computational algorithm to the readers.

3. The confidence interval of high quantile obtained by the new proposed method also has the highest efficiency compared to the existing methods, in terms of having the smallest size of confidence interval, and the highest probability coverage of the true quantile values in most cases.

4. Based on the analysis of the two real-world examples, flu in Canada and gamma ray from the solar flare, we can see that the new proposed method can be applied to many more fields, including other extreme events such as insurance claims, natural disasters, stock market predictions and pandemic disease monitoring.

Acknowledgments

We are grateful for the comments of the reviewers and editor. They have helped us to improve the paper. We deeply appreciate the Brock Library Open Access Publishing Fund support.

Appendix A. Proofs of Theorems 1 and 2

Lemma 1.

The sum of CovlnQ¯H¯(p)(i),lnQ¯H¯(p)(j), ij,i,j=1,...,n1, satisfy the inequality

n1n1ijCovlnQ¯H¯(p)(i),lnQ¯H¯(p)(j)wn1n1ijVarlnQ¯H¯(p)(i)VarlnQ¯H¯(p)(j), (A1)

where w is a weight

w=maxijρij+,0w1,whereρij+=ρij,0ρij+1;

ρij is correlation coefficient of lnQ¯H¯(p)(i) and lnQ¯H¯(p)(j)

ρij=CovlnQ¯H¯(p)(i),lnQ¯H¯(p)(j)VarlnQ¯H¯(p)(i)VarlnQ¯H¯(p)(j),ij,i,j=1,...,n1,1ρij1.

Proof of Lemma 1.

For each pair (i,j), ij,i,j=1,...,n1

CovlnQ¯H¯(p)(i),lnQ¯H¯(p)(j)=ρijVarlnQ¯H¯(p)(i)VarlnQ¯H¯(p)(j)

then we have a boundary if we only add the positive ρij+ terms in the following summation

n1n1ijCovlnQ¯H¯(p)(i),lnQ¯H¯(p)(j)=n1n1ijρijVarlnQ¯H¯(p)(i)VarlnQ¯H¯(p)(j)n1n1ijρij+VarlnQ¯H¯(p)(i)VarlnQ¯H¯(p)(j)wn1n1ijVarlnQ¯H¯(p)(i)VarlnQ¯H¯(p)(j).

Lemma 2.

Under conditions (C1) and (C2), for lnQ¯H¯(p)(k) in (14) by use Theorem 5.1, formula (5.2) in Gomes and Pestana (2007, p.285 [17]), as n,

klnknplnQ¯H¯(p)(k)lnVaRpndNormal0,γ2,

then the asymptotic expected value and variance are

ElnQ¯H¯(p)(k)lnVaRp0,asn;
VarlnQ¯H¯(p)(k)γ2lnknp2k,asn. (A2)

Proof of Theorem 1.

Under conditions (C1) and (C2), in the Hall-Welsh class of models in (6), where H¯ is in (8) with conditions

kH¯β,ρ(k)γndNormal(0,γ),

and

H¯β^,ρ^(k)nd=γ+γkVk+opAnk,

where Vk is an asymptotic standard normal random variable.

By Schwartz inequality and Lemma 1 formula (A1), sinece α is a contant in (19), based on asympototic properties of Cp(k;β^,ρ^) in (13) (Gomes and Pestana, 2007, p.286 [17]), we have that

VarlnQ^New,H¯(p)=Var1n1k=1n1lnQ¯H¯(p)(k)=1(n1)2k=1n1VarlnQ¯H¯(p)(k)+n1n1ijCovlnQ¯H¯(p)(i),lnQ¯H¯(p)(j)=1(n1)2k=1n1VarlnQ¯H¯(p)(k)+n1n1ijρijVarlnQ¯H¯(p)(i)VarlnQ¯H¯(p)(j)1(n1)2k=1n1VarlnQ¯H¯(p)(k)+wijn1n1VarlnQ¯H¯(p)(i)VarlnQ¯H¯(p)(j).

Therefore when n is large enough, use Lemma 2, formula (A2),

VarlnQ¯H¯(p)(k)γ2lnknp2k,asn;

and

ElnQ^New,H¯(p)lnVaRp0,asn.

we can have the following approximate relation

VarlnQ^New,H¯(p)1(n1)2k=1n1γ2lnknp2k+wijn1n1γ2lninp2ilnjnp2j=γ2(n1)2k=1n1[lnknp]2k+wijn1n1lninplnjnpi·j,n.

this proved (21). Therefore, we have the asymptotic normal distribution in (20)

lnQ^New,H¯(p)lnVaRpndNormal0,γ2(n1)2k=1n1[lnknp]2k+wi<jn1n1lninplnjnpi·j. (A3)

Furthermore, we use (21) and (A2), we obtain (22) as

EFFlnQ^New,H¯(p)lnQ^H¯(p)(k)=VarlnQ¯H¯(p)(k)VarlnQ^New,H¯(p)(n1)2lnknp2kk=1n1[lnknp]2k+wn1n1lninplnjnpij,n.

Proof of Theorem 2.

Under conditions (C1) and (C2), Use VarlnQ^New,H¯(p) in Theorem 1, Formula (22), and zα/2=z1α/2, with

b3=z1α/2n1k=1n1[lnknp]2k+wijn1n1lninplnjnpi·j,UCLH¯(k)=H¯1z1α/2k.

we have

Z=n1γk=1n1[lnknp]2k+wn1n1ijlninplnjnpijlnQ^New,H¯(p)lnVaRpndNormal(0,1),

therefore approximately

Pz1α/2n1γk=1n1[lnknp]2k+wn1n1ijlninplnjnpijlnQ^New,H¯(p)lnVaRpz1α/2

=1α, then

Pγb3lnQ^New,H¯(p)lnVaRpγb3=1α
PlnQ^New,H¯(p)γb3lnVaRplnQ^New,H¯(p)+γb3=1α,

and using γ^=H¯(k)UCLH¯(k), to guarantee a coverage probability of at least (1α)100%, we have

PlnQ^New,H¯(p)UCLH¯(k)b3lnVaRplnQ^New,H¯(p)+UCLH¯(k)b31α.

Author Contributions

The authors M.L.H. and X.R.-Y. carried this work and drafted the manuscript together. All authors have read and agreed to the published version of the manuscript.

Funding

This research is supported by the Nature Sciences and Engineering Research Council of Canada (NSERC) grant: MLH DDG-2019-04206.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. The datasets can be found here: http://hesperia.gsfc.nasa.gov/fermi/gbm/qlook/fermi_gbm_flare_list.txt [31] and https://www.who.int/influenza/gisrs_laboratory/flunet/en [28].

Conflicts of Interest

The authors declare no conflict of interest.

Footnotes

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Fisher R.A., Tippett L.H.C. Limiting forms of the frequency distribution of the largest or smallest member of a sample. Procs. Camb. Philos. Soc. 1928;24:180–190. doi: 10.1017/S0305004100015681. [DOI] [Google Scholar]
  • 2.de Haan L.D., Ferreira A. Extreme Value Theory. Springer; New York, NY, USA: 2006. [Google Scholar]
  • 3.Dekkers A.L.M., de Haan L. On the estimation of the extreme value index and large quantile estimation. Ann. Stat. 1989;17:1795–1832. doi: 10.1214/aos/1176347396. [DOI] [Google Scholar]
  • 4.Hill B.M. A simple general approach to inference about the tail of a distribution. Ann. Statist. 1975;3:1163–1174. doi: 10.1214/aos/1176343247. [DOI] [Google Scholar]
  • 5.Weissman I. Estimation of parameters and large quantiles based on the k largest observations. J. Am. Stat. Assoc. 1978;73:812–815. [Google Scholar]
  • 6.Beirlant J., Vynckier P., Teugels J.L. Tail index estimation, Pareto quantile plots and regression diagnostics. J. Am. Statist. Assoc. 1996;91:1659–1667. [Google Scholar]
  • 7.Dreea H., Kaufmann E. Selecting the optimal sample fraction in univaraiate extreme values estimation. Stoch. Proc. Appl. 1998;75:149–172. [Google Scholar]
  • 8.Guillou A., Hall P.G. A diagnostic for selecting the threshold in extreme value analysis. J. R. Stat. Soc. B. 2001;63:293–305. doi: 10.1111/1467-9868.00286. [DOI] [Google Scholar]
  • 9.Gomes M.I., Oliveira O. The bootstrap methodology in statistics of extremes: Choice odf the optimal sample fraction. Extremes. 2001;4:331–358. doi: 10.1023/A:1016592028871. [DOI] [Google Scholar]
  • 10.Hall P., Welsh A.H. Adaptive estimates of parameters of regular variation. Ann. Stat. 1985;13:331–341. doi: 10.1214/aos/1176346596. [DOI] [Google Scholar]
  • 11.Peng L. Asymptotic unbiased estimator for extreme-value index. Stat. Probab. Lett. 1998;38:107–115. doi: 10.1016/S0167-7152(97)00160-0. [DOI] [Google Scholar]
  • 12.Beirlant J., Dierckx G., Goegebeur Y., Matthys G. Tail index estimation and an exponential regression model. Extremes. 1999;2:177–200. doi: 10.1023/A:1009975020370. [DOI] [Google Scholar]
  • 13.Feuerverger A., Hall P.G. Estimating a tail exponent by modelling departure from a Pareto. Ann. Stat. 1999;27:760–781. [Google Scholar]
  • 14.Gomes M.I., Martins M.J., Neves M. Alternatives to a semi-parametric estimator of parametric of rare events-the jackknife methodology. Extremes. 2000;3:207–229. doi: 10.1023/A:1011470010228. [DOI] [Google Scholar]
  • 15.Caeiro F., Gomes M.I. A class of asymptotically unbiased semi-parametric estimations of the tail index. Test. 2002;11:345–364. doi: 10.1007/BF02595711. [DOI] [Google Scholar]
  • 16.Gomes M.I., Figueiredo F., Mendonca S. Asymptotically best linear unbiased tail estimators under second order regular variation. J. Stat. Plan. Inference. 2004;134:409–433. doi: 10.1016/j.jspi.2004.04.013. [DOI] [Google Scholar]
  • 17.Gomes M.I., Pestana D. A study reduced-bias extreme quantile (VaR) estimator. J. Am. Assoc. 2007;102:280–292. doi: 10.1198/016214506000000799. [DOI] [Google Scholar]
  • 18.Caeiro F., Gomes M.I., Pestana D. Direct reduction of bias of the classical Hill estimator. Rev. Stat. 2005;3:113–136. [Google Scholar]
  • 19.Lekina A., Chebana F., Ouarda T.B.M.J. Weighted estimate of extreme quantile: An application to the estimation of high flood return periods. Stoch. Environ. Res. Risk Assess. 2014;28:147–165. doi: 10.1007/s00477-013-0705-2. [DOI] [Google Scholar]
  • 20.Lekina A. Ph.D. Thesis. Université de Grenoble; Saint-Martin-d’Hères, France: 2010. Estimation Non-Paramétrique des Quantiles Extrêmes Conditionnels. [Google Scholar]
  • 21.Huang M.L. A New High Quantile Estimator for Heavy Tailed Distributions. Department of Mathematics, Brock University; St. Catharines, ON, Canada: 2011. (Working Paper) [Google Scholar]
  • 22.Fréchet M. Sur la loi de probabilite de l’écart maximum. Ann. Soc. Pol. Math. 1927;6:93–116. [Google Scholar]
  • 23.de Zea Bermudeza P., Kotz S. Parameter estimation of the generalized Pareto distribution. J. Stat. Inference. 2010;140:1374–1388. doi: 10.1016/j.jspi.2008.11.020. [DOI] [Google Scholar]
  • 24.Bickel P.J., Docksum K.A. Mathematical Statistics, Basic Ideas and Selected Topic. 2nd ed. Volume 1 CRC Press, Taylor & Frances Group; Boca Raton, FL, USA: 2015. [Google Scholar]
  • 25.Pickands J. Statistical inference using extreme order statistics. Ann. Stat. 1975;3:119–131. [Google Scholar]
  • 26.Balkema A.A., de Haan L. Residual life time at great age. Ann. Prob. 1974;2:792–804. doi: 10.1214/aop/1176996548. [DOI] [Google Scholar]
  • 27.Scarrott G., McDonald A. A Review of extreme value threshold estimation and uncertainty quantification. Revstat. 2012;10:33–60. [Google Scholar]
  • 28.World Health Organization (WHO) [(accessed on 31 December 2020)];2020 Available online: https://www.who.int/influenza/gisrs_laboratory/flunet/en.
  • 29.Kolmogorov A.N. Sulla determinazione empirica di una legge di distribuzione. G. Dell. Istituto Ital. Degli Attuari. 1933;4:83–91. [Google Scholar]
  • 30.Anderson T.W., Darling D.A. Asymptotic theory of certain “goodness of fit” criteria based on stochastic processes. Ann. Math. Stat. 1952;23:193–212. doi: 10.1214/aoms/1177729437. [DOI] [Google Scholar]
  • 31.National Aeronautics and Space Administration (NASA) Gamma Ray. [(accessed on 31 December 2020)];2020 Available online: http://hesperia.gsfc.nasa.gov/fermi/gbm/qlook/fermi_gbm_flare_list.txt.
  • 32.National Weather Service (NOAA) Space Weather Prediction Center. [(accessed on 31 December 2020)];2020 Available online: https://satdat.ngdc.noaa.gov/sem/goes/data/plots.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Publicly available datasets were analyzed in this study. The datasets can be found here: http://hesperia.gsfc.nasa.gov/fermi/gbm/qlook/fermi_gbm_flare_list.txt [31] and https://www.who.int/influenza/gisrs_laboratory/flunet/en [28].


Articles from Entropy are provided here courtesy of Multidisciplinary Digital Publishing Institute (MDPI)

RESOURCES