Skip to main content
Entropy logoLink to Entropy
. 2020 Mar 6;22(3):304. doi: 10.3390/e22030304

Robust Model Selection Criteria Based on Pseudodistances

Aida Toma 1,2,*, Alex Karagrigoriou 3, Paschalini Trentou 3
PMCID: PMC7516763  PMID: 33286078

Abstract

In this paper, we introduce a new class of robust model selection criteria. These criteria are defined by estimators of the expected overall discrepancy using pseudodistances and the minimum pseudodistance principle. Theoretical properties of these criteria are proved, namely asymptotic unbiasedness, robustness, consistency, as well as the limit laws. The case of the linear regression models is studied and a specific pseudodistance based criterion is proposed. Monte Carlo simulations and applications for real data are presented in order to exemplify the performance of the new methodology. These examples show that the new selection criterion for regression models is a good competitor of some well known criteria and may have superior performance, especially in the case of small and contaminated samples.

Keywords: model selection, minimum pseudodistance estimation, Robustness

1. Introduction

Model selection is fundamental to the practical applications of statistics and there is a substantial literature on this issue. Classical model selection criteria include, among others, the Cp-criterion, the Akaike Information Criterion (AIC), based on the Kullback-Leibler divergence, and the Bayesian Information Criterion (BIC) as well as a General Information Criterion (GIC) which corresponds to a general class of criteria which also estimates the Kullback-Leibler divergence. These criteria have been proposed respectively in [1,2,3,4], and represent powerful tools for choosing the best model among different candidate models that can be used to fit a given data set. On the other hand, many classical procedures for model selection are extremely sensitive to outliers and to other departures from the distributional assumptions of the model. Robust versions of classical model selection criteria, which are not strongly affected by outliers, have been proposed for example in [5,6,7]. Some recent proposals for robust model selection are criteria based on divergences and minimum divergence estimators. We recall here, the Divergence Information Criteria (DIC) based on the density power divergences introduced in [8], the Modified Divergence Information Criteria (MDIC) introduced in [9] and the criteria based on minimum dual divergence estimators introduced in [10].

The interest on statistical methods based on divergence measures has grown significantly in recent years. For a wide variety of models, statistical methods based on divergences have high model efficiency and are also robust, representing attractive alternatives to the classical methods. We refer to the monographs [11,12] for an excellent presentation of such methods, for their importance and applications. The pseudodistances that we use in the present paper were originally introduced in [13], where they are called “type-0” divergences, and corresponding minimum divergence estimators have been studied. They are also presented and extensively studied in [14] where they are called γ-divergences, as well as in [15] in the context of decomposable pseudodistances. Like divergences, the pseudodistances are not mathematical metrics in the strict sense of the term. They satisfy two properties, namely the nonnegativity and the fact that the pseudodistance between two probability measures equals to zero if and only if the two measures are equal. The divergences are moreover characterized by the information processing property, that is, the complete invariance with respect to statistically sufficient transformations of the observation space. In general, a pseudodistance may not satisfy this property. We have adopted the term pseudodistance for this reason, but in literature we can also encounter the other terms mentioned above.

The pseudodistances that we consider in this paper have also been used to define robustness and efficiency measures, as well as the corresponding optimal robust M-estimators following the Hampel’s infinitesimal approach in [16]. The minimum pseudodistance estimators for general parametric models have been studied in [15] and consist of minimizing an empirical version of a pseudodistance between the assumed theoretical model and the true model underlying the data. These estimators have the advantage of not requiring any prior smoothing and conciliate robustness with high efficiency, providing a high degree of stability under model misspecification, often with a minimal loss in model efficiency. Such estimators are also defined and studied in the case of the multivariate normal model, as well as for linear regression models in [17,18], where applications for portfolio optimization models are also presented.

In the present paper we propose new criteria for model selection, based on pseudodistances and on minimum pseudodistance estimators. These new criteria have robustness properties, are asymptotically unbiased, consistent and compare well with some other known model selection criteria, even for small samples.

The paper is organized as follows—Section 2 is devoted to minimum pseudodistance estimators and to their asymptotic properties, which will be needed in the next sections. Section 3 presents new estimators of the expected overall discrepancy using pseudodistances, together with corresponding theoretical properties including robustness, consistency and limit laws. The new asymptotically unbiased model selection criteria are presented in Section 3.3, where the case of the univariate normal model and the case of linear regression models are investigated. Applications based on Monte Carlo simulations and on real data, illustrating the performance of the new methodology in the case of linear regression models, are included in Section 4.

2. Minimum Pseudodistance Estimators

The construction of new model selection criteria is based on using the following family of pseudodistances (see [15]). For two probability measures P and Q admitting densities p and q respectively with respect to the Lebesgue measure, the family of pseudodistances of order γ>0 is defined by

Rγ(P,Q)=1γ+1lnpγdP+1γ(γ+1)lnqγdQ1γlnpγdQ (1)

and satisfies the limit relation

limγ0Rγ(P,Q)=R0(P,Q), (2)

where R0(P,Q):=lnqpdQ is the modified Kullback-Leibler divergence.

Let (Pθ) be a parametric model indexed by θΘ, where Θ is a d-dimensional parameter space, and pθ be the corresponding densities with respect to the Lebesgue measure λ. Let X1,,Xn be a random sample on Pθ0, θ0Θ. For γ>0 fixed, a minimum pseudodistance estimator of the unknown parameter θ0 from the law Pθ0 is defined by replacing the measure Pθ0 in the pseudodistance Rγ(Pθ,Pθ0) by the empirical measure Pn pertaining to the sample, and then minimizing this empirical quantity with respect to θ on the parameter space. Since the middle term in Rγ(Pθ,Pθ0) does not depend on θ, these estimators are defined by

θ^n=argminθΘ1γ+1lnpθγ+1dλ1γln1ni=1npθγ(Xi), (3)

or equivalently as

θ^n=argmaxθΘ{Cγ(θ)1·1ni=1npθγ(Xi)}, (4)

where Cγ(θ)=(pθγ+1dλ)γ/(γ+1). Denoting h(x,θ):=Cγ(θ)1·pθγ(x), these estimators can be written as

θ^n=argmaxθΘ1ni=1nh(Xi,θ). (5)

The optimum given above need not be uniquely defined.

On the other hand,

argmaxθΘh(x,θ)dPθ0(x)=θ0 (6)

and here θ0 is the unique optimizer, since Rγ(Pθ,Pθ0)=0 implies θ=θ0.

Define

Rγ(θ0):=maxθΘh(x,θ)dPθ0(x)=h(x,θ0)dPθ0(x).

An estimator of Rγ(θ0) is defined by

R^γ(θ0):=maxθΘh(x,θ)dPn(x)=maxθΘ1ni=1nh(Xi,θ)=1ni=1nh(Xi,θ^n). (7)

The following regularity conditions of the model will be assumed throughout the rest of the paper.

(C1) The density pθ(x) has continuous partial derivatives with respect to θ up to the third order (for all x λ-a.e.).

(C2) There exists a neighborhood Nθ0 of θ0 such that the first-, the second- and the third- order partial derivatives with respect to θ of h(x,θ) are dominated on Nθ0 by some Pθ0-integrable functions.

(C3) The integrals [2θ2h(x,θ)]θ=θ0dPθ0(x) and [θh(x,θ)]θ=θ0[θh(x,θ)]θ=θ0tdPθ0(x) exist.

Theorem 1.

Assume that conditions (C1), (C2) and (C3) are fulfilled. Then

  • (a) 

    Let B:=θΘ;θθ0n1/3. Then, as n, with probability one, the function θ1ni=1nh(Xi,θ) attains a local maximal value at some point θ^n in the interior of B, which implies that the estimator θ^n is n1/3-consistent.

  • (b) 
    nθ^nθ0 converges in distribution to a centered multivariate normal random variable with covariance matrix
    V=S1MS1, (8)
    where S:=[2θ2h(x,θ)]θ=θ0dPθ0(x) and M:=[θh(x,θ)]θ=θ0[θh(x,θ)]θ=θ0tdPθ0(x).
  • (c) 

    nR^γ(θ0)Rγ(θ0) converges in distribution to a centered normal variable with variance σ2(θ0)=h(x,θ0)2dPθ0(x)h(x,θ0)dPθ0(x)2.

We refer to [15] for details regarding these estimators and for the proofs of the above asymptotic properties.

3. Model Selection Criteria Based on Pseudodistances

Model selection is a method for selecting the best model among candidate models that can be used to fit a given data set. A model selection criterion can be considered as an approximately unbiased estimator of the expected overall discrepancy, a nonnegative quantity which measures the distance between the true unknown model and a fitted approximating model. If the value of the criterion is small, then the approximated candidate model can be chosen. In the following, by applying the same methodology used for AIC, we construct new criteria for model selection using pseudodistances (1) and minimum pseudodistance estimators.

Let X1,,Xn be a random sample from the distribution associated with the true model Q with density q and let pθ be the density of a candidate model Pθ from a parametric family (Pθ), where θΘRd.

3.1. The Expected Overall Discrepancy

For γ>0 fixed, we consider the quantity

Wθ=1γ+1lnpθγ+1dλ1γlnpθγqdλ, (9)

which is the same as the pseudodistance Rγ(Pθ,Q) without the middle term that remains constant irrespectively of the model (Pθ) used.

The target theoretical quantity that will be approximated by an asymptotically unbiased estimator is given by

E[Wθ^n]=E[Wθ|θ=θ^n], (10)

where θ^n is a minimum pseudodistance estimator defined as in (3). The same pseudodistance is used for both Wθ and θ^n. The quantity (10) can be seen as an average distance between Q and (Pθ) up to a constant and is called the expected overall discrepancy between Q and (Pθ).

The next Lemma gives the gradient vector and the Hessian matrix of Wθ and is useful for the evaluation of E[Wθ^n] through Taylor expansion.

Throughout this paper, for a scalar function φθ(·), the quantity θφθ(·) denotes the d-dimensional gradient vector of φθ(·) with respect to the vector θ and 2θ2φθ(·) denotes the corresponding d×d Hessian matrix. We also use the notations φ˙θ and φ¨θ for the first and the second order derivatives of φθ with respect to θ.

We assume the following conditions allowing derivation under the integral sign:

(C4) There exists a neighborhood Nθ of θ such that

suptNθtptγ+1dλ<,suptNθt[ptγp˙t]dλ<.

(C5) There exists a neighborhood Nθ of θ such that

suptNθtptγqdλ<,suptNθt[ptγ1p˙t]qdλ<.

Lemma 1.

Under (C4) and (C5), the gradient vector and the Hessian matrix of Wθ are

θWθ=pθγp˙θdλpθγ+1dλpθγ1p˙θqdλpθγqdλ (11)
2θ2Wθ=[γpθγ1p˙θp˙θtdλ+pθγp¨θdλ]pθγ+1dλ(γ+1)pθγp˙θdλ(pθγp˙θdλ)t(pθγ+1dλ)2[(γ1)pθγ2p˙θp˙θtqdλ+pθγ1p¨θqdλ]pθγqdλγpθγ1p˙θqdλ(pθγ1p˙θqdλ)t(pθγqdλ)2.

When the true model Q belongs to the parametric model (Pθ), hence Q=Pθ0 and q=pθ0, the gradient vector and the Hessian matrix of Wθ simplify to

θWθθ=θ0=0 (12)
2θ2Wθθ=θ0=Mγ,(θ0) (13)

where

Mγ(θ0):=(pθ0γ1p˙θ0p˙θ0tdλ)(pθ0γ+1dλ)(pθ0γp˙θ0dλ)(pθ0γp˙θ0dλ)t(pθ0γ+1dλ)2. (14)

In the following Propositions we suppose that the true model Q belongs to the parametric model (Pθ), hence Q=Pθ0, q=pθ0 and θ0 is the value of the parameter corresponding to the true model Q=Pθ0. We also say that θ0 is the true value of the parameter (All the proof of the propositions can be seen in the Appendix A).

Proposition 1.

When the true model Q belongs to the parametric model (Pθ), assuming that (C4) and (C5) are fulfilled for q=pθ0 and θ=θ0, the expected overall discrepancy is given by

E[Wθ^n]=Wθ0+12E[(θ^nθ0)tMγ(θ0)(θ^nθ0)]+E[Rn], (15)

where Rn=o(θ^nθ02), Mγ(θ0) is given by (14).

3.2. Estimation of the Expected Overall Discrepancy

In this section, we introduce an estimator of the expected overall discrepancy, under the hypothesis that the true model Q belongs to the parametric model (Pθ). Hence, Q=Pθ0 and the unknown parameter θ0 will be estimated by a minimum pseudodistance estimator θ^n.

For a given θΘ, a natural estimator of Wθ is defined by

Qθ:=1γ+1lnpθγ+1dλ1γln1ni=1npθγ(Xi). (16)

Lemma 2.

Assuming (C4), the gradient vector and the Hessian matrix of Qθ are given by

θQθ=pθγp˙θdλpθγ+1dλi=1npθγ1(Xi)p˙θ(Xi)i=1npθγ(Xi)2θ2Qθ=[γpθγ1p˙θp˙θtdλ+pθγp¨θdλ]pθγ+1dλ(γ+1)pθγp˙θdλ(pθγp˙θdλ)t(pθγ+1dλ)2[(γ1)i=1npθγ2(Xi)p˙θ(Xi)p˙θ(Xi)t+i=1npθγ1(Xi)p¨θ(Xi)]i=1npθγ(Xi)(i=1npθγ(Xi))2+γ(i=1npθγ1(Xi)p˙θ(Xi))(i=1npθγ1(Xi)p˙θ(Xi))t(i=1npθγ(Xi))2.

Proposition 2.

When the true model Q belongs to the parametric model (Pθ), by imposing the conditions (C1)-(C5), it holds

E[Qθ0]=E[Qθ^n]+12E[(θ0θ^n)tMγ(θ0)(θ0θ^n)]+E[Rn], (17)

where Rn=o(θ^nθ02).

The following result allows to define an asymptotically unbiased estimator of the expected overall discrepancy.

Proposition 3.

When the true model Q belongs to the parametric model (Pθ), under (C1)-(C5), it holds

E[Wθ^n]=E[Qθ^n]+E[(θ0θ^n)tMγ(θ0)(θ0θ^n)]++12γn1pθ02γ+1dλpθ0γ+1dλ2+ERn+1γERn, (18)

where Rn=o(θ^nθ02) and Rn=o(1ni=1npθ0γ(Xi)pθ0γ+1dλ2).

3.2.1. Limit Properties of the Estimator Qθ^n

Under the hypothesis that the true model Q belongs to the family of models (Pθ), hence Q=Pθ0, we prove the consistency and the asymptotic normality for the estimator Qθ^n.

Note that

Qθ^n=1γ+1lnpθ^nγ+1dλ1γln1ni=1npθ^nγ(Xi) (19)
=ln1ni=1npθ^n(Xi)(pθ^nγ+1dλ)γγ+11γ=ln[R^γ(θ0)]1γ, (20)

where pθ^nγ+1dλ=pθγ+1dλθ=θ^n and R^γ(θ0) is given by (7).

First we prove that R^γ(θ0) is a consistent estimator of Rγ(θ0). Indeed, using Theorem 1 and the fact that θh(x,θ0)dPθ0(x)=0, a Taylor expansion of 1ni=1nh(Xi,θ) in θ^n around θ0 gives

R^γ(θ0)=1ni=1nh(Xi,θ0)+oP(n1/2). (21)

Using the weak law of large numbers,

1ni=1nh(Xi,θ0)=Rγ(θ0)+oP(1). (22)

Combining (21) and (22), we obtain that R^γ(θ0) converges to Rγ(θ0) in probability.

Then, using the continuous mapping theorem, since g(t)=lnt1γ is a continuous function, we get

Qθ^n=ln[R^γ(θ0)]1γln[Rγ(θ0)]1γ=Wθ0

in probability.

On the other hand, using the asymptotic normality of the estimator R^γ(θ0) (according to Theorem 1 (c)) together with the univariate delta method, we obtain the asymptotic normality of Qθ^n. The Proposition below summarizes the above asymptotic results.

Proposition 4.

Under (C1)-(C3), when Q=Pθ0, it holds

(a) Qθ^n converges to Wθ0 in probability.

(b) n(Qθ^nWθ0) converges in distribution to a centered univariate normal random variable with variance σ2(θ0)γ2Rγ(θ0)2, σ2(θ0) being defined in Theorem 1.

3.2.2. Robustness Properties of the Estimator Qθ^n

The influence function is a useful tool for describing robustness of an estimator. Recall that, a map T defined on a set of probability measures and parameter space valued is a statistical functional corresponding to an estimator θ^n of the parameter θ, whenever θ^n=T(Pn), where Pn is the empirical measure associated to the sample. The influence function of T at Pθ is defined by

IF(x;T,Pθ):=T(P˜εx)εε=0,

where P˜εx:=(1ε)Pθ+εδx, ε>0, δx being the Dirac measure putting all mass at x. The gross error sensitivity of the estimator is defined by

γ*(T,Pθ)=supxIF(x;T,Pθ).

Whenever the influence function is bounded with respect to x, the corresponding estimator is called B-robust (see [19]).

In what follows, for a given γ>0, we derive the influence function of the estimator Qθ^n. The statistical functional associated with this estimator, which we denote by U, is defined by

U(P):=1γ+1lnpT(P)γ+1dλ1γlnpT(P)γdP,

where T is the statistical functional corresponding to the used minimum pseudodistance estimator estimator θ^n, namely

T(P):=argsupθCγ(θ)1pθγdP

where Cγ(θ)=(pθγ+1dλ)γ/(γ+1).

Due to the Fisher consistency of the functional T, according to (6), we have T(Pθ0)=θ0 which implies that U(Pθ0)=Wθ0.

Proposition 5.

When Q=Pθ0, the influence function of Qθ^n is given by

IF(x;U,Pθ0)=1γ1pθ0γ(x)pθ0γ+1dλ. (23)

Note that the influence function of the estimator Qθ^n does not depend on the estimator θ^n, but depends on the used pseudodistance. Usually, pθ0γ(x) is bounded with respect to x and therefore Qθ^n is a robust estimator with respect to Wθ0.

For comparison at the level of the influence function, we consider the AIC criterion which is defined by

AIC=2ln(L(θ^n))+2d,

where L(θ^n) is the maximum value of the likelihood function for the model, θ^n the maximum likelihood estimator and d the dimension of the parameter. The statistical functional corresponding to the statistic 2ln(L(θ^n)) is

V(P)=2lnpT(P)dP

where T here is the statistical functional corresponding to the maximum likelihood estimator. The influence function of the functional V is given by

IF(x;V,Pθ0)=2lnpθ0dPθ0lnpθ0(x). (24)

This influence function is not bounded with respect to x, therefore the statistic 2ln(L(θ^n)) is not robust.

For example, in the case of the univariate normal model, for a positive γ, the influence function (23) writes as

IF(x;U,Pθ0)=1γ1γ+1·expγ2xmσ2 (25)

while the influence function (24) writes as

IF(x;V,Pθ0)=xmσ22m2σ21 (26)

(here θ0=(m,σ)). For all the pseudodistances, the influence function (25) is bounded with respect to x, therefore the selection criteria based on the statistic Qθ^n will be robust. On the other hand, the influence function (26) is not bounded with respect to x, showing the non robustness of AIC in this case. Moreover, the gross error sensitivities corresponding to these influence functions are γ*(U,Pθ0)=1γ and γ*(V,Pθ0)=. These results show that, in the case of the normal model, when γ increases the gross error sensitivity decreases. Therefore, larger values of γ are associated with more robust procedures. For the particular case m=0 and σ=1, the influence functions (25) and (26) are represented in Figure 1.

Figure 1.

Figure 1

Influence functions in the case of the normal model.

3.3. Model Selection Criteria Using Pseudodistances

3.3.1. The Case of Univariate Normal Family

The criteria that we propose in this section correspond to the case where the candidate model is a univariate normal model from the family of normal models (Pθ) indexed by θ=(μ,σ). We also suppose that the true model Q belongs to (Pθ).

In the case of the univariate normal model, Mγ(θ0) defined in (14) expresses as

Mγ(θ0)=(γ+1)2(2γ+1)3/2A(γ)V1, (27)

where V is the asymptotic covariance matrix given by (8) and the matrix A(γ) is given by

A(γ)=1003γ2+4γ+22(2γ+1).

For small positive values of γ, the matrix A(γ) can be approximated by the identity matrix I.

According to Theorem 1, n(θ^nθ0) is asymptotically multivariate normal and then the statistic n(θ0θ^n)tV1(θ0θ^n) has approximately a χd2 distribution. For large n, it holds

E[(θ0θ^n)tMγ(θ0)(θ0θ^n)](γ+1)2(2γ+1)3/2·dn. (28)

Also, for the normal model, it holds

pθ02γ+1dλpθ0γ+1dλ2=γ+12γ+1. (29)

Therefore, (18) becomes

E[Wθ^n]E[Qθ^n]+(γ+1)2(2γ+1)3/2·dn+12γn1γ+12γ+1+ERn+1γERn. (30)

Using the central limit theorem and asymptotic properties of θ^n given in Theorem 1, the following hold

n·o(θ^nθ02)=oP(1), (31)
n·o(1ni=1npθ0γ(Xi)pθ0γ+1dλ2)=oP(1). (32)

Using (30), (31) and (32) we obtain:

Proposition 6.

For the univariate normal family, an asymptotically unbiased estimator of the expected overall discrepancy is given by

Qθ^n+(γ+1)2(2γ+1)3/2·dn+12γn1γ+12γ+1, (33)

where θ^n is a minimum pseudodistance estimator given by (3).

Under the hypothesis that (Pθ) is the univariate normal model, as we supposed in this subsection, the function h writes as

h(x,θ)=(γ+1)γ/(γ+1)·(σ2π)γ/(γ+1)·expγ2xmσ2 (34)

and it can be easily checked that all the conditions (C1)–(C5) are fulfilled. Therefore we can use all results presented in the preceding subsections, such that Proposition 6 is fully justified.

Moreover, the selection criteria based on (33) are consistent on the basis of Proposition 4. It should also be noted that the bias correction term in (33) decreases slowly as the parameter γ increases staying always very close to zero (102). As expected, the larger the sample size the smaller the bias correction. As we saw in Section 3.2.2, since the gross error sensitivity of Qθ^n is γ*(U,Pθ0)=1γ, larger values of γ are associated with more robust procedures. On the other hand, the approximation of A(γ) with the identity matrix is realized for values of γ close to zero. Thus, positive values of γ smaller than 0.5 for example could represent choices satisfying the robustness requirement and the approximation of A(γ) through the identity matrix, approximation which is necessary to construct the criterion in this case.

3.3.2. The Case of Linear Regression Models

In the following, we adapt the pseudodistance based model selection criterion in the case of linear regression models. Consider the linear regression model

Y=α+βtX+e (35)

where eN(0,σ) and e is independent of X. Suppose we have a sample given by the i.i.d. random vectors Zi=(Xi,Yi), i=1,,n, such that Yi=α+βtXi+ei.

We consider the joint distribution of the entire data and write a pseudodistance between the theoretical model and the true model corresponding to the data. Let Pθ, θ:=(α,β,σ), be the probability measure associated to the theoretical model given by the random vector Z=(X,Y) and Q the probability measure associated to the true model corresponding to the data. Denote by pθ, respectively by q the corresponding densities. For γ>0, the pseudodistance between Pθ and Q is defined by

Rγ(Pθ,Q):=1γ+1lnpθγ(x,y)dPθ(x,y)+1γ(γ+1)lnqγ(x,y)dQ(x,y)1γlnpθγ(x,y)dQ(x,y). (36)

Similar to [18], since the middle term above does not depend on Pθ, a minimum pseudodistance estimator of the parameter θ0=(α0,β0,σ0) is defined by

θ^n=(α^n,β^n,σ^n)=argminα,β,σ1γ+1lnpθγ(x,y)dPθ(x,y)1γlnpθγ(x,y)dPn(x,y), (37)

where Pn is the empirical measure associated with the sample. This estimator can be written as

θ^n=(α^n,β^n,σ^n)=argminα,β,σ1γ+1lnϕσγ+1(e)de1γln1ni=1nϕσγ(YiαβtXi), (38)

where ϕσ is the density of the random variable eN(0,σ). Then, the estimator Qθ^n can be written as

Qθ^n=minα,β,σ1γ+1ln1(σ2π)γγ+11γln1ni=1n1(σ2π)γ·expγ2σ2(YiαβtXi)2. (39)

In order to construct an asymptotic unbiased estimator of the expected overall discrepancy in the case of the linear regression models, we evaluated the second and the third terms from (18).

For values of γ close to 0 (γ smaller than 0.3), we found the following approximation of the matrix Mγ(θ0)

Mγ(θ0)(γ+1)2(2γ+1)3/2V1I003γ2+4γ+22γ+1, (40)

where V is the asymptotic covariance matrix of θ^n and I is the identity matrix. We refer to [15] for the asymptotic properties of the minimum pseudodistance estimators in the case of linear regression models. Since n(θn^θ0) is asymptotically multivariate normal distributed, using the χ2 distribution, we obtain the approximation

E[(θn^θ0)tMγ(θ0)(θn^θ0)]1n·(γ+1)2(2γ+1)3/2(d1)+3γ2+4γ+22(γ+1)(2γ+1). (41)

Also, the third term in (18) is given by

12γn1γ+12γ+1d. (42)

Then, according to Proposition 3, an asymptotically unbiased estimator of the expected overall discrepancy is given by

Qθ^n+1n·(γ+1)2(2γ+1)3/2(d1)+3γ2+4γ+22(γ+1)(2γ+1)+12γn1γ+12γ+1d, (43)

where Qθ^n is given by (39). Note that, using the asymptotic properties of θ^n and the central limit theorem, the last two terms in (18) of Proposition 3 are oP(1n).

When we compare different linear regression models, as in Section 4 below, we can ignore the terms depending only on n and γ in (43). Therefore, we can use as model selection criterion the simplified expression

Qθ^n+(γ+1)2(2γ+1)3·dn12γnγ+12γ+1d, (44)

which we call Pseudodistance based Information Criterion (PIC).

4. Applications

4.1. Simulation Study

In order to illustrate the performance of the PIC criterion (44) in the case of linear regression models, we performed a simulation study using for comparison the model selection criteria AIC, BIC and MDIC. These criteria are defined respectively by

AIC=nlogσ^p2+2p+2
BIC=nlogσp^2+p+2logn,

where n the sample size, p the number of covariates of the model and σ^p2 the classical unbiased estimator of the variance of the model,

MDIC=nMQθ^+(2π)α/2(1+α)2+p/2p

with α=0.25 and

MQθ^=(1+α1)1nn=1nfθ^α(Xi),

where θ^ is a consistent estimate of the vector of unknown parameters involved in the model with p covariates and fθ^ is the associated probability density function. Note that MDIC is based on the well known BHHJ family of divergence measures indexed by a parameter α>0 and on the minimum divergence estimating method for robust parameter estimation (see [20]). The value of α=0.25 was found in [9] to be an ideal one for a great variety of settings. The above three criteria have been chosen to be used in this comparative study with PIC not only due to their popularity, but also due to their special characteristics. Indeed, AIC is the classical representative of asymptotically efficient criteria, BIC is known to be consistent, while MDIC is associated with robust estimations (see e.g., [20,21,22,23]).

Let X1,X2,X3,X4 be four variables following respectively the normal distributions N(0,3), N(1,3), N(2,3) and N(3,3). We consider the model

Y=a0+a1X1+a2X2+ε

with a0=a1=a2=1 and εN(0,1). This is the uncontaminated model. In order to evaluate the robustness of the new PIC criterion, we also consider the contaminated model

Y=d1(a0+a1X1+a2X2+ε)+d2(a0+a1X1+a2X2+ε*)

where ε*N(5,1) and d1,d2[0,1] such that d1+d2=1. Note that for d1=1 and d2=0 the uncontaminated model is obtained.

The simulated data corresponding to the contaminated model are

Yi=d1(1+X1,i+X2,i+εi)+d2(1+X1,i+X2,i+εi*),

for i=1,,n, where X1,i,X2,i,εi,εi* are values of the variables X1,X2,ε,ε* independently generated from the normal distributions N(1,3), N(2,3), N(0,1), N(5,1) correspondingly.

With a set of four possible regressors there are 241=15 possible model specifications that include at least one regressor. These 15 possible models constitute the set of candidate models in our study. More precisely, this set contains the full model (X1,X2,X3,X4) given by

Y=b0+b1X1+b2X2+b3X3+b4X4+ε

as well as all 14 possible subsets of the full model consisting of one (Xj1), two (Xj1,Xj2) and three (Xj1,Xj2,Xj3) of the four regressors X1,X2,X3 and X4, with j1j2j3, ji{1,2,3,4} and i=1,2,3.

In our simulation study, for several values of the parameter γ associated with the pseudodistance, we compared the new criterion PIC with the other model selection criteria. Different levels of contamination and different sample sizes have been considered. In the examples presented in this work, d1{0.8,0.9,0.95,1} and n{20,50,100}. Additional examples for n=30,75,200,500 have been analyzed (results not shown) with similar findings (see below). For each setting, fifty experiments were performed in order to select the best model among the available candidate models. In the framework of each of the fifty experiments, on the basis of the simulated observations, the value of each of the above model selection criteria was calculated for each of the 15 possible models. Then, for each criterion, the 15 candidate models were ranked from 1st to 15th according to the value of the criterion. The model chosen by a given criterion is the one for which the value of the criterion is the lowest among all the 15 candidate models.

Table 1, Table 2, Table 3, Table 4, Table 5, Table 6, Table 7, Table 8, Table 9, Table 10, Table 11 and Table 12 present the proportions of models selected by the considered criteria. Among the 15 candidate models only 4 were chosen at least once. These four models are the same in all instances and appear in the 2nd column of all tables.

Table 1.

Proportions of selected models by the considered criteria (n = 20, d1=0.8).

Criteria Variables γ=0.01 γ=0.05 γ=0.1 γ=0.15 γ=0.2 γ=0.25 γ=0.3
PIC X1,X2 90 84 88 84 92 90 86
X1,X2,X3 }(10) }(16) }(12) }(16) }(8) }(10) }(14)
X1,X2,X4
X1,X2,X3,X4
AIC X1,X2 62 56 52 56 66 56 60
X1,X2,X3 }(38) }(44) }(48) }(44) }(34) }(44) }(40)
X1,X2,X4
X1,X2,X3,X4
BIC X1,X2 74 76 60 74 72 68 70
X1,X2,X3 }(26) }(24) }(40) }(26) }(28) }(32) }(30)
X1,X2,X4
X1,X2,X3,X4
MDIC X1,X2 86 86 64 78 84 80 74
X1,X2,X3 }(14) }(14) }(36) }(22) }(16) }(20) }(26)
X1,X2,X4
X1,X2,X3,X4

Table 2.

Proportions of selected models by the considered criteria (n = 20, d1=0.9).

Criteria Variables γ=0.01 γ=0.05 γ=0.1 γ=0.15 γ=0.2 γ=0.25 γ=0.3
PIC X1,X2 80 84 90 82 82 80 80
X1,X2,X3 }(20) }(16) }(10) }(18) }(18) }(20) }(20)
X1,X2,X4
X1,X2,X3,X4
AIC X1,X2 60 52 56 62 64 54 52
X1,X2,X3 }(40) }(48) }(44) }(38) }(36) }(46) }(48)
X1,X2,X4
X1,X2,X3,X4
BIC X1,X2 76 70 78 72 84 76 76
X1,X2,X3 }(24) }(30) }(22) }(28) }(16) }(24) }(24)
X1,X2,X4
X1,X2,X3,X4
MDIC X1,X2 86 76 88 74 92 78 86
X1,X2,X3 }(14) }(24) }(22) }(26) }(8) }(22) }(14)
X1,X2,X4
X1,X2,X3,X4

Table 3.

Proportions of selected models by the considered criteria (n = 20, d1=0.95).

Criteria Variables γ=0.01 γ=0.05 γ=0.1 γ=0.15 γ=0.2 γ=0.25 γ=0.3
PIC X1,X2 82 88 80 94 82 88 86
X1,X2,X3 }(8) }(12) }(20) }(6) }(18) }(12) }(14)
X1,X2,X4
X1,X2,X3,X4
AIC X1,X2 78 50 66 70 66 64 66
X1,X2,X3 }(22) }(50) }(34) }(30) }(34) }(36) }(34)
X1,X2,X4
X1,X2,X3,X4
BIC X1,X2 84 64 74 84 84 76 82
X1,X2,X3 }(16) }(36) }(26) }(16) }(16) }(24) }(18)
X1,X2,X4
X1,X2,X3,X4
MDIC X1,X2 90 74 82 88 88 80 88
X1,X2,X3 }(10) }(26) }(18) }(12) }(12) }(20) }(12)
X1,X2,X4
X1,X2,X3,X4

Table 4.

Proportions of selected models by the considered criteria (n = 20, d1=1).

Criteria Variables γ=0.01 γ=0.05 γ=0.1 γ=0.15 γ=0.2 γ=0.25 γ=0.3
PIC X1,X2 86 86 86 86 88 82 92
X1,X2,X3 }(14) }(14) }(14) }(14) }(12) }(18) }(8)
X1,X2,X4
X1,X2,X3,X4
AIC X1,X2 64 74 62 58 64 62 70
X1,X2,X3 }(36) }(26) }(38) }(42) }(36) }(38) }(30)
X1,X2,X4
X1,X2,X3,X4
BIC X1,X2 78 90 78 80 82 80 74
X1,X2,X3 }(22) }(10) }(22) }(20) }(18) }(20) }(26)
X1,X2,X4
X1,X2,X3,X4
MDIC X1,X2 84 92 88 88 88 88 80
X1,X2,X3 }(16) }(8) }(12) }(12) }(12) }(12) }(20)
X1,X2,X4
X1,X2,X3,X4

Table 5.

Proportions of selected models by the considered criteria (n = 50, d1=0.8).

Criteria Variables γ=0.01 γ=0.05 γ=0.1 γ=0.15 γ=0.2 γ=0.25 γ=0.3
PIC X1,X2 86 96 94 90 88 86 90
X1,X2,X3 }(14) }(4) }(6) }(10) }(12) }(14) }(10)
X1,X2,X4
X1,X2,X3,X4
AIC X1,X2 74 64 82 62 64 78 72
X1,X2,X3 }(26) }(36) }(18) }(38) }(36) }(22) }(28)
X1,X2,X4
X1,X2,X3,X4
BIC X1,X2 94 86 96 86 90 88 90
X1,X2,X3 }(6) }(14) }(4) }(14) }(10) }(12) }(10)
X1,X2,X4
X1,X2,X3,X4
MDIC X1,X2 94 82 98 82 86 88 90
X1,X2,X3 }(6) }(18) }(2) }(18) }(14) }(12) }(10)
X1,X2,X4
X1,X2,X3,X4

Table 6.

Proportions of selected models by the considered criteria (n = 50, d1=0.9).

Criteria Variables γ=0.01 γ=0.05 γ=0.1 γ=0.15 γ=0.2 γ=0.25 γ=0.3
PIC X1,X2 92 88 92 90 82 94 86
X1,X2,X3 }(8) }(12) }(8) }(10) }(18) }(6) }(14)
X1,X2,X4
X1,X2,X3,X4
AIC X1,X2 70 64 62 64 66 74 72
X1,X2,X3 }(30) }(36) }(38) }(36) }(34) }(26) }(28)
X1,X2,X4
X1,X2,X3,X4
BIC X1,X2 92 88 82 92 88 88 86
X1,X2,X3 }(8) }(12) }(18) }(8) }(12) }(12) }(14)
X1,X2,X4
X1,X2,X3,X4
MDIC X1,X2 92 86 76 88 84 88 86
X1,X2,X3 }(8) }(14) }(24) }(12) }(16) }(12) }(14)
X1,X2,X4
X1,X2,X3,X4

Table 7.

Proportions of selected models by the considered criteria (n = 50, d1=0.95).

Criteria Variables γ=0.01 γ=0.05 γ=0.1 γ=0.15 γ=0.2 γ=0.25 γ=0.3
PIC X1,X2 94 92 92 88 84 90 88
X1,X2,X3 }(6) }(8) }(8) }(12) }(16) }(10) }(12)
X1,X2,X4
X1,X2,X3,X4
AIC X1,X2 70 62 66 68 70 72 58
X1,X2,X3 }(30) }(38) }(34) }(32) }(30) }(28) }(42)
X1,X2,X4
X1,X2,X3,X4
BIC X1,X2 96 82 92 86 92 92 86
X1,X2,X3 }(4) }(18) }(8) }(14) }(8) }(8) }(14)
X1,X2,X4
X1,X2,X3,X4
MDIC X1,X2 90 78 88 86 86 90 82
X1,X2,X3 }(10) }(22) }(12) }(14) }(14) }(10) }(18)
X1,X2,X4
X1,X2,X3,X4

Table 8.

Proportions of selected models by the considered criteria (n = 50, d1=1).

Criteria Variables γ=0.01 γ=0.05 γ=0.1 γ=0.15 γ=0.2 γ=0.25 γ=0.3
PIC X1,X2 94 90 80 84 90 94 88
X1,X2,X3 }(6) }(10) }(20) }(16) }(10) }(6) }(12)
X1,X2,X4
X1,X2,X3,X4
AIC X1,X2 64 68 62 68 66 64 62
X1,X2,X3 }(34) }(32) }(38) }(32) }(34) }(36) }(38)
X1,X2,X4
X1,X2,X3,X4
BIC X1,X2 86 86 86 90 86 94 82
X1,X2,X3 }(14) }(14) }(14) }(10) }(14) }(6) }(18)
X1,X2,X4
X1,X2,X3,X4
MDIC X1,X2 84 84 82 88 84 90 82
X1,X2,X3 }(16) }(16) }(18) }(12) }(16) }(10) }(18)
X1,X2,X4
X1,X2,X3,X4

Table 9.

Proportions of selected models by the considered criteria (n = 100, d1=0.8).

Criteria Variables γ=0.01 γ=0.05 γ=0.1 γ=0.15 γ=0.2 γ=0.25 γ=0.3
PIC X1,X2 94 94 94 92 88 88 94
X1,X2,X3 }(6) }(6) }(6) }(8) }(12) }(12) }(6)
X1,X2,X4
X1,X2,X3,X4
AIC X1,X2 70 82 78 70 68 68 72
X1,X2,X3 }(30) }(18) }(22) }(30) }(32) }(32) }(28)
X1,X2,X4
X1,X2,X3,X4
BIC X1,X2 90 96 98 90 96 94 88
X1,X2,X3 }(10) }(4) }(2) }(10) }(4) }(6) }(12)
X1,X2,X4
X1,X2,X3,X4
MDIC X1,X2 86 96 92 86 92 90 88
X1,X2,X3 }(14) }(4) }(8) }(14) }(8) }(10) }(12)
X1,X2,X4
X1,X2,X3,X4

Table 10.

Proportions of selected models by the considered criteria (n = 100, d1=0.9).

Criteria Variables γ=0.01 γ=0.05 γ=0.1 γ=0.15 γ=0.2 γ=0.25 γ=0.3
PIC X1,X2 88 92 96 88 88 88 86
X1,X2,X3 }(12) }(8) }(4) }(12) }(12) }(12) }(14)
X1,X2,X4
X1,X2,X3,X4
AIC X1,X2 68 72 78 66 70 78 60
X1,X2,X3 }(32) }(28) }(22) }(34) }(30) }(22) }(40)
X1,X2,X4
X1,X2,X3,X4
BIC X1,X2 98 98 96 88 92 94 92
X1,X2,X3 }(2) }(2) }(4) }(12) }(8) }(6) }(8)
X1,X2,X4
X1,X2,X3,X4
MDIC X1,X2 90 90 96 84 82 90 82
X1,X2,X3 }(10) }(10) }(4) }(16) }(18) }(10) }(18)
X1,X2,X4
X1,X2,X3,X4

Table 11.

Proportions of selected models by the considered criteria (n = 100, d1=0.95).

Criteria Variables γ=0.01 γ=0.05 γ=0.1 γ=0.15 γ=0.2 γ=0.25 γ=0.3
PIC X1,X2 90 88 92 90 98 96 92
X1,X2,X3 }(10) }(12) }(8) }(10) }(2) }(4) }(8)
X1,X2,X4
X1,X2,X3,X4
AIC X1,X2 70 78 78 66 82 68 68
X1,X2,X3 }(30) }(22) }(22) }(34) }(18) }(32) }(32)
X1,X2,X4
X1,X2,X3,X4
BIC X1,X2 96 92 92 94 96 94 88
X1,X2,X3 }(4) }(8) }(8) }(6) }(4) }(6) }(12)
X1,X2,X4
X1,X2,X3,X4
MDIC X1,X2 90 88 82 90 94 84 88
X1,X2,X3 }(10) }(12) }(18) }(10) }(6) }(16) }(12)
X1,X2,X4
X1,X2,X3,X4

Table 12.

Proportions of the selected models by the considered criteria (n = 100, d1=1).

Criteria Variables γ=0.01 γ=0.05 γ=0.1 γ=0.15 γ=0.2 γ=0.25 γ=0.3
PIC X1,X2 94 96 92 92 96 90 94
X1,X2,X3 }(6) }(4) }(8) }(8) }(4) }(10) }(6)
X1,X2,X4
X1,X2,X3,X4
AIC X1,X2 78 74 72 74 70 62 74
X1,X2,X3 }(22) }(26) }(28) }(26) }(30) }(38) }(26)
X1,X2,X4
X1,X2,X3,X4
BIC X1,X2 96 100 92 96 94 90 100
X1,X2,X3 }(4) }(8) }(4) }(6) }(10)
X1,X2,X4
X1,X2,X3,X4
MDIC X1,X2 94 92 86 90 86 80 94
X1,X2,X3 }(6) }(8) }(14) }(10) }(14) }(20) }(6)
X1,X2,X4
X1,X2,X3,X4

For small sample sizes (n=20, n=30) the criteria PIC and MDIC yield the best results. When the level of contamination is 10% or 20%, the PIC criterion yields very good results and beats the other competitors almost all the time. When the level of contamination is small, for example 5% or when there is no contamination, the two criteria are comparable, in the sense that in many cases the proportions of selected models by the two criteria are very close, so that sometimes PIC wins and sometimes MDIC wins. Table 1, Table 2, Table 3 and Table 4 present these results for n=20, but similar results are obtained for n=30, too.

For medium sample sizes (n=50, n=75), the criteria PIC and BIC yield the best results. The results for n=50 are given in Table 5, Table 6, Table 7 and Table 8. Note that the PIC criterion yields the best results for 0% and 10% contamination. For the other levels of contamination, there are values of γ for which PIC is the best among all the considered criteria. On the other hand, in most cases when BIC wins, the proportions of selections of the true model by BIC and PIC are close.

When the sample size is large (n=100, n=200, n=500), BIC generally yields better results than PIC which stays relatively close behind, but sometimes BIC and PIC have the same performance. Table 9, Table 10, Table 11 and Table 12 present the results obtained for n=100.

Thus, the new PIC criterion works very well for small to medium sample sizes and for levels of contamination up to 20%, but falls behind BIC for large sample sizes. Note that for contaminated data, PIC with γ=0.15 prevails in most of the considered cases. On the other hand, for uncontaminated data, it is PIC with γ=0.2 that prevails in all the considered instances. It is also worth mentioning that PIC with γ=0.3 appears to behave very satisfactorily in most cases irrespectively of the proportion of contamination (0%20%) and the sample size. Observe also that in all cases, AIC has the highest overestimation rate which is somehow expected (see [24]).

Although the consistency is the main focus of the applications presented in this work, one should point out that if prediction is part of the objective of a regression analysis, then model selection carried out using criteria such as the ones used in this work, have desirable properties. In fact, the case of finite-dimensional normal regression models has been shown to be associated with satisfactory prediction errors for criteria such as AIC and BIC (see [25]). Furthermore, it should be pointed out that in many instances PIC has a behavior quite similar to the above criteria by choosing the same models. Also, according to the presented simulation results, the proportion of choosing the true model by PIC is always better than the proportion of choosing the true model by AIC (even in the case of non contaminated data) and sometimes it is better than the proportion of choosing the true model by BIC. These results imply a satisfactory prediction ability for the proposed PIC criterion.

In conclusion, the new PIC criterion is a good competitor of the well known model selection criteria AIC, BIC and MDIC and may have superior performance especially in the case of small and contaminated samples.

4.2. Real Data Example

In order to illustrate the proposed method, we used the Hald cement data (see [26]) which represent a popular example for multiple linear regression. This example concern the heat evolved in calories per gram of cement Y as a function of the amount of each of four ingredient in the mix: tricalcium aluminate (X1), tricalcium silicate (X2), tetracalcium alumino-ferrite (X3) and dicalcium silicate (X4). The data are presented in Table 13.

Table 13.

Hald cement data.

X1 X2 X3 X4 Y
7 26 6 60 78.5
1 29 15 52 74.3
11 56 8 20 104.3
11 31 8 47 87.6
7 52 6 33 95.9
11 55 9 22 109.2
3 71 17 6 102.7
1 31 22 44 72.5
2 54 18 22 93.1
21 47 4 26 115.9
1 40 23 34 83.8
11 66 9 12 113.3
10 68 8 12 109.4

Since 4 variables are available, there are 15 possible candidate models (involving at least one regressor) for this data set. Note that the 4 single-variable models should be excluded from the analysis, because cement involves a mixture of at least two components that react chemically (see [27], p. 102). The model selection criteria that have been used are PIC for several values of γ, AIC, BIC and MDIC with α=0.25. Table 14 shows the model selected by each of the considered criteria.

Table 14.

Selected models by model selection criteria.

Criteria Variables
PIC, γ=0.05 X1,X2,X4
PIC, γ=0.15 X1,X2,X4
PIC, γ=0.2 X1,X2,X3
PIC, γ=0.25 X1,X2,X4
PIC, γ=0.3 X1,X2,X4
AIC X1,X2,X4
BIC X1,X2
MDIC X1,X2,X3

Observe that, in this example, PIC behaves similarly to AIC and MDIC having a slight tendency of overestimation. Note that for this specific dataset the collinearity is quite strong with X1 and X3 as well as X2 and X4 being seriously correlated. It should be pointed out that the model (X1,X2,X4) is chosen not only by AIC and PIC, but also by Cp Mallows’ criterion ([1]) with (X1,X2,X3) coming very close second. Note further that (X1,X2,X4) has also been chosen by cross validation ([28], p. 33) and PRESS ([26], p. 325). Finally, it is worth noticing that these two models share the highest adjusted R2 values which are almost identical (0.976 for (X1,X2,X4) and 0.974 for (X1,X2,X3)) making the distinction between them extremely hard. Thus, in this example, the new PIC criterion gives results as good as other recognized classical model selection criteria.

5. Conclusions

In this work, by applying the same methodology as for AIC to a family of pseudodistances, we constructed new model selection criteria using minimum pseudodistance estimators. We proved theoretical properties of these criteria including asymptotic unbiasedness, robustness, consistency, as well as the limit laws. The case of the linear regression models was studied in detail and specific selection criteria based on pseudodistance are proposed.

For linear regression models, a comparative study based on Monte Carlo simulations illustrate the performance of the new methodology. Thus, for small sample sizes, the criteria PIC and MDIC yield the best results and in many cases PIC wins, for example when the level of contamination is 10% or 20%. For medium sample sizes, the criteria PIC and BIC yield the best results. When the sample size is large, BIC generally yields better results than PIC which stays relatively close behind, but sometimes BIC and PIC have the same performance.

Based on the results of the simulation study and on the real data example, we conclude that the new PIC criterion is a good competitor of the well known criteria AIC, BIC and MDIC with an overall performance which is very satisfactory for all possible settings according to the sample size and contamination rate. Also PIC may have superior performance, especially in the case of small and contaminated samples.

An important issue that needs further investigation is the choice of the appropriate value for the parameter γ associated to the procedure. The findings of the presented simulation study show that, for contaminated data, the value γ=0.15 leads to very good results, irrespectively of the sample size. Also, γ=0.3 produces overall very satisfactory results, irrespectively of the sample size and the contamination rate. We hope to explore further and provide a clear solution to this problem, in a future work. We also intend to extend this methodology to other type of models including nonlinear or time series models.

Acknowledgments

The work of the first author was partially supported by a grant of the Romanian National Authority for Scientific Research, CNCS-UEFISCDI, project number PN-II-RU-TE-2012-3-0007. The work of the third author was completed as part of the activities of the Laboratory of Statistics and Data Analysis of the University of the Aegean.

Abbreviations

The following abbreviations are used in this manuscript:

AIC Akaike Information Criterion
BIC Bayesian Information Criterion
GIC General Information Criterion
DIC Divergence Information Criterion
MDIC Modified Divergence Information Criterion
PIC Pseudodistance based Information Criterion
BHHJ family of measures Basu, Harris, Hjort and Jones family of measures

Appendix A

Proof of Proposition 1.

Using a Taylor expansion of Wθ around the true parameter θ0 and taking θ=θ^n, on the basis of (12) and (13) we obtain

Wθ^n=Wθ0+12(θ^nθ0)tMγ(θ0)(θ^nθ0)+o(θ^nθ02). (A1)

Then (15) holds. □

Proof of Proposition 2.

Using a Taylor expansion of Qθ around to θ^n and taking θ=θ0, we obtain

Qθ0=Qθ^n+θQθθ=θ^nt(θ0θ^n)+12(θ0θ^n)t2θ2Qθθ=θ^n(θ0θ^n)+o(θ^nθ02). (A2)

Note that θQθθ=θ^n=0 by the very definition of θ^n.

By applying the weak law of large numbers and the continuous mapping theorem, we get

2θ2Qθθ=θ02θ2Wθθ=θ0P0 (A3)

and using (13)

2θ2Qθθ=θ0Mγ(θ0)P0. (A4)

Then, using the consistency of θ^n and (A4), we obtain

2θ2Qθθ=θ^n=Mγ(θ0)+oP(1). (A5)

Consequently,

Qθ0=Qθ^n+12(θ0θ^n)tMγ(θ0)(θ0θ^n)+o(θ^nθ02) (A6)

and we deduce (17). □

Proof of Proposition 3.

Using Proposition 1 and Proposition 2, we obtain

E[Wθ^n]=E[Qθ^n]+E[(θ0θ^n)tMγ(θ0)(θ0θ^n)]E[Qθ0]+Wθ0+E[Rn] (A7)

where Rn=o(θ^nθ02).

In order to evaluate Wθ0E[Qθ0], note that

Qθ0Wθ0=1γln1ni=1npθ0γ(Xi)lnpθ0γ+1dλ. (A8)

A Taylor expansion of the function lnx around to pθ0γ+1dλ yields

ln1ni=1npθ0γ(Xi)=lnpθ0γ+1dλ+1pθ0γ+1dλ1ni=1npθ0γ(Xi)pθ0γ+1dλ12·1(pθ0γ+1dλ)21ni=1npθ0γ(Xi)pθ0γ+1dλ2++o(1ni=1npθ0γ(Xi)pθ0γ+1dλ2). (A9)

Then

E[Qθ0Wθ0]=1γEln1ni=1npθ0γ(Xi)lnpθ0γ+1dλ=1γ1pθ0γ+1dλE1ni=1npθ0γ(Xi)pθ0γ+1dλ12·1(pθ0γ+1dλ)2E1ni=1npθ0γ(Xi)pθ0γ+1dλ2+E[Rn]

where Rn=o(1ni=1npθ0γ(Xi)pθ0γ+1dλ2).

On the other hand,

E1ni=1npθ0γ(Xi)pθ0γ+1dλ2=Var1ni=1npθ0γ(Xi)=1nVarpθ0γ(X)=1nE[pθ02γ(X)]E[pθ0γ(X)]2=pθ02γ+1dλ(pθ0γ+1dλ)2n. (A10)

Consequently,

E[Qθ0]Wθ0=12γn1pθ02γ+1dλpθ0γ+1dλ21γERn. (A11)

Using (A7) and (A11), we obtain (18). □

Proof of Proposition 5.

For the contaminated model P˜εx=(1ε)Pθ0+εδx, it holds

U(P˜εx)=1γ+1lnpT(P˜εx)γ+1dλ1γlnpT(P˜εx)γdP˜εx. (A12)

Derivation with respect to ε yields

ε[U(P˜εx)]ε=0=1pθ0γ+1dλ·pθ0γp˙θ0dλ·IF(x;T,Pθ0)1γ·1pθ0γ+1dλpθ0γ+1dλ+γ·pθ0γp˙θ0dλ·IF(x;T,Pθ0)+pθ0γ(x)=1γ·1pθ0γ(x)pθ0γ+1dλ.

Thus we obtain

IF(x;U,Pθ0)=1γ1pθ0γ(x)pθ0γ+1dλ. (A13)

Author Contributions

A.T. conceived the methodology, obtained the theoretical results. A.T., A.K. and P.T. conceived the application part. A.K. and P.T. implemented the method in R and obtained the numerical results. All authors wrote the paper. All authors have read and approved the final manuscript.

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflict of interst.

References

  • 1.Mallows C.L. Some comments on Cp. Technometrics. 1973;15:661–675. [Google Scholar]
  • 2.Akaike H. Proceedings of the Second International Symposium on Information Theory Petrov. Springer; Berlin/Heidelberger, Germany: 1973. Information theory and an extension of the maximum likelihood principle; pp. 267–281. [Google Scholar]
  • 3.Schwarz G. Estimating the dimension of a model. Ann. Stat. 1978;6:461–464. doi: 10.1214/aos/1176344136. [DOI] [Google Scholar]
  • 4.Konishi S., Kitagawa G. Generalised information criteria in model selection. Biometrika. 1996;83:875–890. doi: 10.1093/biomet/83.4.875. [DOI] [Google Scholar]
  • 5.Ronchetti E. Robust model selection in regression. Statist. Probab. Lett. 1985;3:21–23. doi: 10.1016/0167-7152(85)90006-9. [DOI] [Google Scholar]
  • 6.Ronchetti E., Staudte R.G. A robust version of Mallows’ Cp. J. Am. Stat. Assoc. 1994;89:550–559. [Google Scholar]
  • 7.Agostinelli C. Robust model selection in regression via weighted likelihood estimating equations. Stat. Probab. Lett. 2002;76:1930–1934. doi: 10.1016/j.spl.2006.04.048. [DOI] [Google Scholar]
  • 8.Mattheou K., Lee S., Karagrigoriou A. A model selection criterion based on the BHHJ measure of divergence. J. Stat. Plann. Inf. 2009;139:228–235. doi: 10.1016/j.jspi.2008.04.022. [DOI] [Google Scholar]
  • 9.Mantalos P., Mattheou K., Karagrigoriou A. An improved divergence information criterion for the determination of the order of an AR process. Commun. Stat.-Simul. Comput. 2010;39:865–879. doi: 10.1080/03610911003650391. [DOI] [Google Scholar]
  • 10.Toma A. Model selection criteria using divergences. Entropy. 2014;16:2686–2698. doi: 10.3390/e16052686. [DOI] [Google Scholar]
  • 11.Pardo L. Statistical Inference Based on Divergence Measures. Chapmann & Hall; London, UK: 2006. [Google Scholar]
  • 12.Basu A., Shioya H., Park C. Statistical Inference: The Minimum Distance Approach. Chapmann & Hall; London, UK: 2011. [Google Scholar]
  • 13.Jones M.C., Hjort N.L., Harris I.R., Basu A. A comparison of related density-based minimum divergence estimators. Biometrika. 2001;88:865–873. doi: 10.1093/biomet/88.3.865. [DOI] [Google Scholar]
  • 14.Fujisawa H., Eguchi S. Robust parameter estimation with a small bias against heavy contamination. J. Multivar. Anal. 2008;99:2053–2081. doi: 10.1016/j.jmva.2008.02.004. [DOI] [Google Scholar]
  • 15.Broniatowski M., Toma A., Vajda I. Decomposable pseudodistances and applications in statistical estimation. J. Stat. Plan. Infer. 2012;142:2574–2585. doi: 10.1016/j.jspi.2012.03.019. [DOI] [Google Scholar]
  • 16.Toma A., Leoni-Aubin S. Optimal robust M-estimators using Renyi pseudodistances. J. Multivar. Anal. 2013;115:359–373. doi: 10.1016/j.jmva.2012.10.003. [DOI] [Google Scholar]
  • 17.Toma A., Leoni-Aubin S. Robust portfolio optimization using pseudodistances. PLoS ONE. 2015;10:e0140546. doi: 10.1371/journal.pone.0140546. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Toma A., Fulga C. Robust estimation for the single index model using pseudodistances. Entropy. 2018;20:374. doi: 10.3390/e20050374. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Hampel F.R., Ronchetti E., Rousseeuw P.J., Stahel W. Robust Statistics: The Approach Based on Influence Functions. Wiley Blackwell; Hoboken, NJ, USA: 1986. [Google Scholar]
  • 20.Basu A., Harris I.R., Hjort N.L., Jones M.C. Robust and efficient estimation by minimising a density power divergence. Biometrika. 1998;85:549–559. doi: 10.1093/biomet/85.3.549. [DOI] [Google Scholar]
  • 21.Karagrigoriou A. Asymptotic efficiecy of the order selection of a nongaussian AR process. Stat. Sin. 1997;7:407–423. [Google Scholar]
  • 22.Vonta F., Karagrigoriou A. Generalized measures of divergence in survival analysis and reliability. J. Appl. Prob. 2010;47:216–234. doi: 10.1239/jap/1269610827. [DOI] [Google Scholar]
  • 23.Karagrigoriou A., Mattheou K., Vonta F. On asymptotic properties of AIC variants with applications. Open J. Stat. 2011;1:105–109. doi: 10.4236/ojs.2011.12012. [DOI] [Google Scholar]
  • 24.Shibata R. Selection of the order of an autoregressive model by Akaike’s information criterion. Biometrika. 1976;63:117–126. doi: 10.1093/biomet/63.1.117. [DOI] [Google Scholar]
  • 25.Speed T.P., Yu B. Model selection and prediction: Normal regression. Ann. Inst. Stat. Math. 1993;45:35–54. doi: 10.1007/BF00773667. [DOI] [Google Scholar]
  • 26.Draper N.R., Smith H. Applied Regression Analysis. 2nd ed. Wiley Blackwell; Hoboken, NJ, USA: 1981. [Google Scholar]
  • 27.Burnham K.P., Anderson D.R. Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach. Springer; Berlin/Heidelberger, Germany: 2002. [Google Scholar]
  • 28.Hjorth J.S.U. Computer Intensive Statistical Methods: Validation, Model Selection and Bootstrap. Chapman and Hall; London, UK: 1994. [Google Scholar]

Articles from Entropy are provided here courtesy of Multidisciplinary Digital Publishing Institute (MDPI)

RESOURCES