Skip to main content
Entropy logoLink to Entropy
. 2020 Feb 27;22(3):270. doi: 10.3390/e22030270

Model Selection in a Composite Likelihood Framework Based on Density Power Divergence

Elena Castilla 1,*, Nirian Martín 2, Leandro Pardo 1, Konstantinos Zografos 3
PMCID: PMC7516723  PMID: 33286044

Abstract

This paper presents a model selection criterion in a composite likelihood framework based on density power divergence measures and in the composite minimum density power divergence estimators, which depends on an tuning parameter α. After introducing such a criterion, some asymptotic properties are established. We present a simulation study and two numerical examples in order to point out the robustness properties of the introduced model selection criterion.

Keywords: composite likelihood, composite minimum density power divergence estimators, model selection

1. Introduction

Composite likelihood inference is an important approach to deal with those real situations of large data sets or very complex models, in which classical likelihood methods are computationally difficult, or even, not possible to manage. Composite likelihood methods have been successfully used in many applications concerning, for example, genetics ([1]), generalized linear mixed models ([2]), spatial statistics ([3,4,5]), frailty models ([6]), multivariate survival analysis ([7,8]), etc.

Let us introduce the problem, adopting here the notation by [9]. Let {f(·;θ),θΘRp,p1} be a parametric identifiable family of distributions for an observation y=(y1,,ym)T, a realization of a random m-vector Y. In this setting, the composite likelihood function based on K different marginal or conditional distributions has the form

CL(θ,y)=k=1KfAk(yj,jAk;θ)wk

and the corresponding composite log-density

logCL(θ,y)=k=1KwkAk(θ,y), (1)

with Ak(θ,y)=logfAk(yj,jAk;θ), where {Ak}k=1K is a family of sets of indices associated either with marginal or conditional distributions involving some yj, j{1,,m} and wk, k=1,,K are non-negative and known weights. If the weights are all equal, then they can be ignored. In this case, all the statistical procedures give equivalent results. The composite maximum likelihood estimator (CMLE), θ^c, is obtained by maximizing, in respect to θΘ, the expression (1).

The CMLE is consistent and asymptotically normal and, based on it, we can establish hypothesis testing procedures in a similar way to the classical likelihood ratio test, Wald test or Rao’s score test. A development of the asymptotic theory of the CMLE including its application to obtain the composite ratio statistics, Wald-type tests and Rao score tests in the context of composite likelihood can be seen in [10]. However, in [11,12,13] is shown that the CMLE and the derived testing procedures present an important lack of robustness. In this sense, [11,12,13] derived some new distance-based estimators and tests with good robustness behaviour without an important loss of efficiency. In this paper, we are going to consider the composite minimum density power divergence estimator (CMDPDE), introduced in [12], in order to present a model selection criterion in a composite likelihood framework.

Model selection criteria, for summarizing data evidence in favor of a model, is a very well studied subject in statistical literature, overall in the context of full likelihood. The construction of such criteria requires a measure of similarity between two models, which are typically described in terms of their distributions. This can be achieved if an unbiased estimator of the expected overall discrepancy is found, which measures the statistical distance between the true, but unknown model, and the entertained model. Therefore, the model with the smallest value of the criterion is the most preferable model. The use of divergence measures, in particular Kullback–Leibler divergence ([14]), to measure this discrepancy, is the main idea of some of the most known criteria: Akaike Information Criterion (AIC, [15,16]), the criterion proposed by Takeuchi (TIC, [17]) and other modifications of AIC [18]. DIC criterion, based on the density power divergence (DPD), was presented in [19] and, recently, [20] presented a local BHHJ power divergence information criterion following [21]. In the context of the composite likelihood there are some criteria based on Kullback–Leibler divergence, see for instance [22,23,24] and references therein. To the best of our knowledge only Kullback–Leibler divergence was used to develop model selection criteria in a composite likelihood framework. To fill this gap, our interest is now focused on DPD.

In this paper, we present a new information criterion for model selection in the framework of composite likelihood based on DPD measure. This divergence measure, introduced and studied in the case of complete likelihood by [25], has been considered previously in [12,13] in the context of composite likelihood. In these papers, a new estimator, the CMDPDE, was introduced and its robustness in relation to the CMLE as well as the robustness of some families of test statistics were studied, but the problem of model selection was not considered. This problem is considered in this paper. The criterion introduced in this paper will be called composite likelihood DIC criterion (CLDIC). The motivation of considering a criterion based on DPD instead of Kullback–Leibler divergence is due to the robustness of the procedures based on DPD in statistical inference, not only in the context of full likelihood [25,26], but also in the context of composite likelihood [12,13]. In Section 2, the CMDPDE is presented and some properties of this estimator are discussed. The new model selection criterion, CLDIC, based on CMDPDE is introduced in Section 3 and some of its asymptotic properties are studied. A simulation study is carried out in Section 4 and some numerical examples are presented in Section 5. Finally, some concluding remarks are presented in Section 6.

2. Composite Minimum Density Power Divergence Estimator

Given two probability density functions g and f, associated with two m-dimensional random variables respectively, the DPD ([25]) measures a statistical distance between g and f by

dα(g,f)=Rmf(y)1+α1+1αf(y)αg(y)+1αg(y)1+αdy, (2)

for α>0, while for α=0 it is defined by

d0(g,f)=limα0+dα(g,f)=dKL(g,f),

where dKL(g,f) is the Kullback–Leibler divergence (see, for example, [26]). For α=1, the expression (2) leads to the L2 distance L2(g,f)=Rmf(y)g(y)2dy. It is also interesting to note that (2) is a special case of the so-called Bregman divergence

RmT(g(y))T(f(y)){g(y)f(y}T(f(y))dy. (3)

If we consider T(l)=1αl1+α in (3), we get dα(g,f). The parameter α controls the trade-off between robustness and asymptotic efficiency of the parameter estimates which are the minimizers of this family of divergences. For more details about this family of divergence measures we refer to [27].

Let now Y1,,Yn be independent and identically distributed replications of Y which are characterized by the true but unknown distribution g. Taking into account that the true model g is unknown, suppose that Ξ={f(·;θ),θΘRp,p1} is a parametric identifiable family of candidate distributions to describe the observations y1,,yn. Then, the DPD between the true model g and the composite likelihood function, CL(θ,·), associated to the parametric model f(·;θ) is defined as

dα(g·,CL(θ,·))=RmCL(θ,y)1+α1+1αCL(θ,y)αg(y)+1αg(y)1+αdy, (4)

for α>0, while for α=0 we have dKL(g·,CL(θ,·)), which is defined by

dKL(g·,CL(θ,·))=Rmg(y)logg(y)CL(θ,y)dy. (5)

In Section 3, we are going to introduce and study the CLDIC criterion based on (4).

Let

Mkk1,, (6)

be a family of candidate models to govern the observations Y1,,Yn. We shall assume that the true model is included in Mkk1,,. For a specific k=1,,, the parametric model Mk is described by the composite likelihood function

CL(θ,·),θΘkRk.

In this setting, it is quite clear that the most suitable candidate model to describe the observations is the model that minimizes the DPD in (4). However, the unknown parameter θ is included in it, so it is not possible to use directly this measure for the choice of the most suitable model. A way to overcome this problem is to plug-in, in (4), the unknown parameter θ by an estimator which is desirable to obey some nice properties, like consistency and asymptotic normality. Based on this point, the CMDPDE, introduced in [12], can be used. This estimator is described in the sequel for the sake of completeness.

If we denote the kernel of (4) as

Wαθ=RmCL(θ,y)1+αdy1+1αRmCL(θ,y)αg(y)dy, (7)

we can write

dα(g·,CL(θ,·))=Wαθ+1αRmg(y)1+αdy

and the term 1αRmg(y)1+αdy does not depend on θ and could be ignored in (9). A natural estimator of Wαθ, given in (7), can be obtained by observing that the last integral in (7), can be expressed in the form RmCL(θ,y)αdG(y), for G the distribution function corresponding to g. Hence, if the empirical distribution function of Y1,,Yn will be exploited, this last integral is approximated by 1ni=1nCL(θ,Yi)α, i.e.,

Wn,αθ=RmCL(θ,y)α+1dy1+1α1ni=1nCL(θ,Yi)α. (8)

Definition 1.

The CMDPDE of θ, θ^cα, is defined, for α>0, by

θ^cα=argminθΘWn,αθ. (9)

We shall denote the score of the composite likelihood by

u(θ,y)=logCL(θ,y)θ. (10)

Let θ0 be the true value of the parameter θ. In [12], it was shown that the asymptotic distribution of θ^cα is given by

n(θ^cαθ0)LnN0p,Hα(θ0)1Jα(θ0)Hα(θ0)1,

being

Hα(θ)=RmCL(θ,y)α+1u(θ,y)u(θ,y)Tdy (11)

and

Jα(θ)=RmCL(θ,y)2α+1u(θ,y)u(θ,y)TdyRmCL(θ,y)α+1u(θ,y)dyRmu(θ,y)TCL(θ,y)1+αdy. (12)

Remark 1.

For α=0 we get the CMLE of θ

θ^c=argminθΘ1ni=1nlogCL(θ,yi). (13)

At the same time it is well-known that

n(θ^cθ)LnN0p,G(θ)1,

where G(θ) denotes the Godambe information matrix defined by G(θ)=H(θ)J(θ)1H(θ), with H(θ) being the sensitivity or Hessian matrix and J(θ) being the variability matrix, defined, respectively, by

H(θ)=Eθθu(θ,y)T,J(θ)=Eθu(θ,y)u(θ,y)T.

3. A New Model Selection Criterion

In order to describe the CLDIC criterion we consider the model Mk given in (6). Following standard methodology (cf. [28], pp. 240), the most suitable candidate model to describe the data Y1,,Yn is the model that minimizes the expected estimated DPD

EY1,,Yndα(g·,CL(θ^cα,·)), (14)

subject to the assumption that the unknown model g is belonging to Ξ, i.e., the true model is included in Mss1,, and taking into account that θ^cα, defined in (9), is a consistent and asymptotic normally distributed estimator of θ. However, this expected value is still depending on the unknown parameter θ. So, as a criterion, it should be used an asymptotically unbiased estimator of (14), for gΞ.

The most appropriate model to select is the model which minimizes the expected value

EY1,,YnWαθ^cα.

This expected value is still depending on the unknown parameter θ. So, an asymptotically unbiased estimator of the above expected value could be the basis of a selection criterion, for gΞ. In order to proceed with the derivation of such an asymptotically unbiased estimator of EY1,,YnWαθ^cα. The empirical version of Wαθ, in (7), is Wn,α(θ), given in (8), and plays a central role in the development of the model selection criterion on the basis of the next theorem which expresses the expected value EY1,,YnWαθ^cα by means of the respective expected value of Wn,α(θ^cα), in an asymptotically equivalent way.

Theorem 1.

If the true distribution g belongs to the parametric family Ξ and θ0 denotes the true value of the parameter θ, then we have

EY1,,YnWα(θ^cα)=EY1,,YnWn,α(θ^α)+α+1ntraceJαθ0Hαθ01+op(1)

with Hαθ and Jαθ given in (11) and (12), respectively.

Based on the above theorem, the proof of which is presented in a full detail in the Appendix A, an asymptotic unbiased estimator of EY1,.,YnWα(θ^cα) is given by

Wn,α(θ^cα)+α+1ntraceJα(θ^cα)Hα(θ^cα)1.

This ascertainment is the basis and a strong motivation for the next definition which introduces the model selection criterion.

Definition 2.

Let Mkk1,, be candidate models for the observations Y1,,Yn. The selected model M verifies

M=mink{1,,,}CLDICαMk,

where

CLDICαMk=Wn,α(θ^cα)+α+1ntraceJα(θ^cα)Hα(θ^cα)1,

Wn,α(θ) was given in (8) and Jαθ and Hαθ were defined in (11) and (12), respectively.

The next remark summarizes the model selection criterion in the case α=0 and it therefore extends, in a sense, the pioneer and classic AIC.

Remark 2.

For α=0 we have,

dKL(g(·),CL(θ,·))=W0(θ)+Rng(y)logg(y)dy

with W0(θ)=RnlogCL(θ,y)g(y)dy. Therefore, the most appropriate model which should be selected, is the model which minimizes the expected value

EY1,,YnW0(θ^c), (15)

where θ^c is the CMLE of θ0 defined in (9).

The expected value (15) is still depending on the unknown parameter θ. A natural estimator of W0(θ^c) can be obtained by replacing the distribution function G, of g, by the empirical distribution function based on Y1,,Yn,

Wn,0(θ)=1ni=1nlogCL(θ,yi).

Based on it, we select the model M that verifies

M=mink{1,,}CLDIC0Mk,

with

CLDIC0Mk=Hn,0(θ^c)+1ntraceJ(θ^c)H(θ^c)1,

where J(θ^c) and H(θ^c) are defined in Remark 1. In a manner, quite similar to that of the previous theorem, it can be established that CLDIC0(Mk) is an asymptotic unbiased estimator of EY1,,YnW0(θ^c).

This would be the model selection criterion in a composite likelihood framework based on Kullback–Leibler divergence. We can observe that this criterion coincides with the criterion given in [22] as a generalization of the classical criterion of Akaike, which will be referred from now as Composite Akaike Information Criterion (CAIC).

4. Numerical Simulations

4.1. Scenario 1: Two-Component Mixed Model

We are starting with a simulation example, which is motivated and follows ideas from the paper [29] and the Example 4.1 in [20] which will compare the behaviour of the proposed criteria with the CAIC criterion, for α=0 (see Remark 2).

Consider the random vector Y=(Y1,Y2,Y3,Y4)T from an unknown density g and let now Y1,,Yn be independent and identically distributed replications of Y which are described by the true but unknown distribution g. Taking into account that the true model g is unknown, suppose that {f(·;θ),θΘRp,p1} is a parametric identifiable family of candidate distributions to describe the observations y1,,yn. Let also CL(θ,y) denotes the composite likelihood function associated to the parametric model f(·;θ).

We consider the problem of choosing (on the basis of n independent and identically distributed replications y1,,yn of Y=(Y1,Y2,Y3,Y4)T) between a 4-variate normal distribution, NμN,Σ, with μN=(μ1N,μ2N,μ3N,μ4N)T and

Σ=1ρ2ρ2ρρ12ρ2ρ2ρ2ρ1ρ2ρ2ρρ1,

and a 4-variate t-distribution with ν degrees of freedom, tνμtν,Σ, with different location parameters μtν=(μ1tν,μ2tν,μ3tν,μ4tν)T and same variance-covariance matrix Σ, and density,

Cm|Σ|1/21+1ν(yμtν)T(Σ)1(yμtν)(ν+m)/2,

with Σ=ν2νΣ, Cm=(πν)m/2Γ[(ν+m)/2]Γ(ν/2) and m=4.

Consider the composite likelihood function,

CLN(ρ,y)=fA1N(y;ρ)fA2N(y;ρ),

with fA1N(y;ρ)=f12N(y1,y2;μ1N,μ2N;ρ) and fA2N(y;ρ)=f34N(y3,y4;μ3N,μ4N;ρ), where f12N and f34N are the densities of the marginals of Y, i.e., bivariate normal distributions with mean vectors (μ1N,μ2N)T and (μ3N,μ4N)T, respectively, and common variance-covariance matrix

Σ0=1ρρ1.

In a similar manner consider the composite likelihood

CLtν(ρ,y)=fA1tν(y;ρ)fA2tν(y;ρ),

with fA1tν(y;ρ)=f12tν(y1,y2;μ1tν,μ2tν;ρ) and fA2tν(y;ρ)=f34tν(y3,y4;μ3tν,μ4tν;ρ), where f12tν and f34tν are the densities of the marginals of Y, i.e., bivariate t-distributions with mean vectors (μ1tν,μ2tν)T and (μ3tν,μ4tν)T, respectively, and common variance-covariance matrix

Σ0=1ρρ1.

Under this formulation, the simulation study follows in the next two scenarios.

4.1.1. Scenario 1a

Following Example 4.1 in [20], the steps of the simulation study are the following:

  • Generate 1000 samples of size n=5,7,10,20,40,50,70,100 from a two component mixture of two 4-variate distributions, namely, a 4-variate normal and a 4-variate t-distribution,
    hω(y)=ωNμN,Σ+(1ω)tνμtν,Σ,0ω1,
    with μN=(0,0,0.5,0) and μtν=(3.2,1.5,0.5,2), for ω=0,0.25,0.45,0.5,0.55,0.75,1, ν=5,10,30 degrees of freedom and with specific values of ρ=0.15,0.10,0.10. As pointed out in [29], taking into account that Σ should be semi-positive definite, the following condition is imposed: 15ρ13.
  • Estimate the common parameter ρ, separately in each model, by using the CMDPDE estimator for different values of the tuning parameter α=0,0.3. The composite density which corresponds to the mixture hω(y) is defined by
    CL(ρ,y)=ωCLN(ρ,y)+(1ω)CLtν(ρ,y),0ω1,
    and it is used to obtain the CMDPDE estimator, ρ^, of ρ.
  • Define the mixture composite likelihood function
    CL(ρ^,y)=ωCLN(ρ^,y)+(1ω)CLtν(ρ^,y),0ω1.
  • Calculate CLDICαMk, the value of the model selection criterion considered in this paper, for the two candidate models, with
    CLDICαMk=Wn,αρ^+α+1ntraceJαρ^Hαρ^1.

    An explanation of how to obtain this value for the both candidate models is given in Appendix B.

  • Compute the times that the 4-variate normal model was selected.

Results are summarized in Table 1. Extreme values of ω=0,1 represent the times that the 4-variate normal model was selected under the 4-variate t-distribution and 4-variate normal distribution, respectively. This means that, for ω=1, the perfect discrimination will be achieved when 1000 of the 1000 simulated samples are correctly assigned, while for ω=0, the more near to 0, the better discrimination of the criterion. ω=0.5 means that each sample was generated both from the normal and t-distribution in the same proportion.

Table 1.

Main results, Scenario 1a.

α=0 (CAIC) α=0.3
ω 0 0.25 0.45 0.5 0.55 0.75 1 0 0.25 0.45 0.5 0.55 0.75 1
ν=5,ρ=0.15
n = 5 0 1 269 499 713 996 1000 0 0 273 498 712 1000 1000
7 0 1 246 504 758 998 1000 0 1 220 511 738 999 1000
10 0 0 202 482 775 1000 1000 0 0 185 467 771 1000 1000
20 0 0 114 486 871 1000 1000 0 0 112 473 866 1000 1000
40 0 0 41 459 947 1000 1000 0 0 54 496 954 1000 1000
50 0 0 21 475 964 1000 1000 0 0 41 556 986 1000 1000
70 0 0 9 461 985 1000 1000 0 0 48 656 995 1000 1000
100 0 0 5 472 992 1000 1000 0 0 142 885 1000 1000 1000
ν=10,ρ=0.15
5 0 3 222 445 688 996 1000 0 3 218 433 688 997 1000
7 0 1 191 439 720 1000 1000 0 0 179 431 690 999 1000
10 0 0 163 432 747 1000 1000 0 0 152 402 725 1000 1000
20 0 0 59 399 819 1000 1000 0 0 49 361 773 1000 1000
40 0 0 19 336 912 1000 1000 0 0 12 326 899 1000 1000
50 0 0 6 362 936 1000 1000 0 0 10 334 925 1000 1000
70 0 0 1 292 960 999 1000 0 0 2 356 973 1000 1000
100 0 0 0 301 983 1000 1000 0 0 1 531 992 1000 1000
ν=30,ρ=0.15
5 0 4 237 423 677 997 1000 0 2 235 413 656 996 1000
7 0 0 155 394 689 1000 1000 0 0 141 379 677 999 1000
10 0 0 144 413 719 1000 1000 0 0 134 393 701 1000 1000
20 0 0 57 351 801 1000 1000 0 0 40 311 764 1000 1000
40 0 0 11 296 904 1000 1000 0 0 8 263 882 1000 1000
50 0 0 6 271 918 1000 1000 0 0 3 253 903 1000 1000
70 0 0 1 225 942 1000 1000 0 0 0 229 941 1000 1000
100 0 0 0 208 978 1000 1000 0 0 0 303 989 1000 1000
ν=10,ρ=0.10
5 0 4 242 464 680 996 1000 0 3 238 459 682 999 1000
7 0 0 187 461 733 997 1000 0 0 199 457 731 998 1000
10 0 0 162 445 738 1000 1000 0 0 165 407 713 1000 1000
20 0 0 62 378 807 1000 1000 0 0 59 354 789 1000 1000
40 0 0 19 357 902 999 1000 0 0 14 333 895 1000 1000
50 0 0 6 325 932 1000 1000 0 0 8 325 931 1000 1000
70 0 0 2 305 954 1000 1000 0 0 6 367 967 1000 1000
100 0 0 0 307 979 1000 1000 0 0 2 507 993 1000 1000
ν=10,ρ=0.10
5 0 11 268 459 669 991 1000 1 11 268 478 680 993 1000
7 0 1 211 456 720 999 1000 0 3 207 464 716 998 1000
10 0 0 168 423 704 1000 1000 0 0 162 403 702 1000 1000
20 0 0 86 360 789 1000 999 0 0 89 357 786 1000 1000
40 0 0 35 367 893 1000 1000 0 0 38 398 896 1000 1000
50 0 0 19 331 886 1000 1000 0 0 19 360 913 1000 1000
70 0 0 11 311 933 1000 1000 0 0 16 379 963 1000 1000
100 0 0 2 276 969 1000 1000 0 0 7 490 985 1000 1000

4.1.2. Scenario 1b

Same Scenario is evaluated under the more-closed means μN=(0,1.5,0.5,0.75) and μtν=(0,1.5,0.5,2) for moderate-large sample sizes and α{0,0.2,0.4}. Here ν=5 and ρ=0.15. Results are shown in Table 2. In this case, the models under consideration are more similar, so it would be understandable that the CLDIC criterion did not discriminate in such as good way.

Table 2.

Main results, Scenario 1b.

α=0 (CAIC) α=0.2 α=0.4
0 0.25 0.75 1 0 0.25 0.75 1 0 0.25 0.75 1
n = 40 0 0 39 731 0 0 537 961 0 0 580 949
50 0 0 24 732 0 0 859 990 0 0 944 994
60 0 0 14 772 0 0 999 1000 0 1 999 1000
70 0 0 9 734 0 0 999 1000 0 27 999 1000
80 0 0 5 770 0 1 1000 1000 0 326 1000 1000
90 0 0 4 782 0 23 1000 1000 2 794 1000 1000
100 0 0 4 802 0 173 1000 1000 26 978 1000 1000

4.2. Scenario 2: Three-Component Mixed Model

Now, we consider a mixed model composed on two 4-variate normal distributions and a 4-variate t-distribution with ν=10 degrees of freedom. The three distributions have common variance-covariance matrix, as in the previous scenario, with unknown ρ=0.15 and different but known means μ1N=(0,0,0.5,0), μ2N=(0,1.5,0.5,0) and μt=(0,1.5,0.5,2). The model is defined by

ωN(μ1N,Σ)+λN(μ2N,Σ)+(1ωλ)tν=10(μt,Σ),0ω,λ,ω+λ1,

with Σ being again a common variance-covariance matrix with unknown parameter ρ of the form

Σ=1ρ2ρ2ρρ12ρ2ρ2ρ2ρ1ρ2ρ2ρρ1.

Following the same steps that in the first scenario, we generate 1000 samples of the three-component mixture for different sample sizes n=5,7,10,20,40,50,70,100 and different values of ω and λ. Then, we consider the problem of choosing among one of the two 4-variate normal distributions and the 4-variate t-distribution through the CLDIC criterion, for different values of the tuning parameter α=0,0.3,0.5,0.7. See Table 3 for results. Here, the normal models are denoted by N1 and N2, respectively, while the 4-variate t-distribution is denoted by MT. The first three cases evaluate the selected model under these multivariate distributions. In the last two scenarios, a mixed model is considered as the true distribution.

Table 3.

Main results, Scenario 2.

α=0 (CAIC) α=0.3 α=0.5 α=0.7
Model N1 N2 MT N1 N2 MT N1 N2 MT N1 N2 MT
True model: N(μ1N,Σ)
n = 5 957 24 19 950 16 34 939 23 38 936 28 36
7 970 19 11 966 13 24 961 13 26 950 22 28
10 993 3 4 986 4 10 979 6 15 971 6 23
20 1000 0 0 1000 0 0 998 0 2 997 0 3
40 1000 0 0 1000 0 0 1000 0 0 1000 0 0
50 1000 0 0 1000 0 0 1000 0 0 1000 0 0
70 1000 0 0 1000 0 0 1000 0 0 1000 0 0
100 1000 0 0 1000 0 0 1000 0 0 999 0 0
True model: N(μ2N,Σ)
5 29 638 333 34 610 356 38 639 323 50 646 304
7 15 622 363 13 589 398 17 599 384 28 627 345
10 6 610 384 5 540 455 5 540 455 11 586 403
20 1 612 387 1 518 481 1 472 527 1 527 472
40 0 566 434 0 650 350 0 590 410 0 614 386
50 0 561 439 0 804 196 0 797 203 0 835 165
70 0 584 416 0 987 13 0 994 6 0 998 2
100 0 520 480 0 1000 0 0 1000 0 0 1000 0
True model: tν=10(μt,Σ)
5 2 15 983 1 6 993 1 8 991 3 15 982
7 0 3 997 0 1 999 2 2 996 0 4 996
10 0 1 999 0 2 998 0 2 998 0 3 997
20 0 0 1000 0 0 1000 0 0 1000 0 0 1000
40 0 0 1000 0 0 1000 0 0 1000 0 0 1000
50 0 0 1000 0 0 1000 0 0 1000 0 0 1000
70 0 0 1000 0 0 1000 0 0 1000 0 0 1000
100 0 0 1000 0 0 1000 0 4 996 0 296 704
True model: 0.7N(μ2N,Σ)+0.3tν=10(μt,Σ)
5 6 384 610 6 375 619 4 401 595 11 452 537
7 1 331 668 1 294 705 1 317 682 1 373 626
10 1 261 738 1 218 781 1 253 746 1 306 693
20 0 109 891 0 101 899 0 107 893 0 141 859
40 0 26 974 0 126 874 0 122 878 0 166 834
50 0 13 987 0 311 689 0 345 655 0 445 555
70 0 6 994 0 948 52 0 982 18 0 994 6
100 0 2 998 0 1000 0 0 1000 0 0 999 1
True model: 13N(μ1N,Σ)+13N(μ2N,Σ)+13tν=10(μt,Σ)
5 127 377 496 121 363 516 107 392 501 107 424 469
7 87 357 556 70 339 591 66 356 578 63 396 541
10 69 326 605 61 314 625 56 330 614 45 381 574
20 37 259 704 25 298 677 17 337 646 15 349 636
40 7 145 848 9 452 539 4 508 488 1 469 530
50 2 122 876 5 744 251 3 814 183 3 853 144
70 0 99 901 4 996 0 4 996 0 4 996 0
100 0 36 964 355 645 0 645 355 0 856 144 0

Here the model candidates are expressed as N1, N2, MT to denote N(μ1N,Σ), N(μ2N,Σ) and t10(μt,Σ), respectively.

4.3. Discussion of Results

In Scenario 1a, two well-differentiated multivariate models are considered. In this case CLDIC criterion works in a very efficient way, with an almost-perfect discrimination for extreme values of ω. The good behaviour is also observed for not so extreme values of ω, such as ω=0.55 or 0.45. We can not observe a significant difference in the choice of α.

In Scenario 1b we consider closer models, which affect the discrimination power of the CLDIC. However, in this case, we do observe great differences when considering different α. While the discrimination power of CLDIC for α=0 (CAIC) and ω=1 is around 75%, for α=0.2 or α=0.4 the behaviour is excellent. This happens also for large but not extreme values of ω, such as ω=0.75. However, a medium value of α turns into a worse discrimination for low values of ω.

Scenario 2 deals with three different models, two multivariate normal and one multivariate t (N1, N2 and MT, respectively). The second normal distribution is closer to MT in terms of means. While CLDIC criterion discriminate well between N1 and N2 and between N1 and MT, it has difficulties in distinguishing N2 an MT distributions, overall for small samples sizes and α=0.

It seems, therefore, that when we have well-discriminated models, CLDIC criterion works very well, independently of the sample size and the tuning parameter α considered. Dealing with closer models leads, as expected, to worst results, overall for α=0 (CAIC).

Note that the behaviour of Wald-type and Rao tests based on CMDPDEs was studied in [12,13] through extensive simulation studies.

5. Numerical Examples

5.1. Choice of the Tuning Parameter

In the previous sections, we have seen that CLDIC criterion works generally very well, independently of α, but that some values present a better behaviour, overall when distinguishing similar models. In these situations, it appears that values close to 0.2 or 0.3 work well, while CAIC criterion presents a worse behaviour. A data-driven approach for the choice of the tuning parameter which would be helpful in practice. The approach of [30] was adapted In [13], for the choice of the optimum α in CMDPDEs. This approach consisted on minimizing the estimated mean squared error by means of a pilot estimator, θP. This approximation is given by

MSE^α=(θ^cαθP)T(θ^cαθP)+1nTraceHα1(θ^cα)Jα(θ^cα)Hα1(θ^cα), (16)

where Hα(θ) and Jα(θ) are given in (11) and (12). The optimum α will be the one that minimizes expression (16). The choice of the pilot estimator is probably one of the major drawbacks of this approach, as it may lead to a choice of α too close to that used for the pilot estimator. A pilot estimator with α0.4, was proposed in [13] after some simulations, in concordance with [30], where the initial choice of a pilot is suggested to be a robust one in order to obtain the best results in terms of robustness.

5.2. Iris Data

The Iris data (Fisher, [31]) includes 3 categories of 50 sample values each, where each category refers to a type of iris plant: setosa, versicolor and virginica. Each plant is categorized in its class and described by other 4 variables: (1) sepal length, (2) sepal width, (3) petal length and (4) petal width. This is one of the most known data sets for discriminant analysis. [32] proposed the use of a Gaussian finite mixture for modeling Iris data, in which each known class is modeled by a single Gaussian term with the same variance-covariance matrix. The resulting model is as follows

f(x)=13N(μ1,Σ)+13N(μ2,Σ)+13N(μ3,Σ), (17)

with

μ1=(μ11,μ12,μ13,μ14)T,μ2=(μ21,μ22,μ23,μ24)T,μ3=(μ31,μ32,μ33,μ34)T

and

Σ=σ12σ12σ13σ14σ21σ22σ23σ24σ31σ32σ32σ34σ41σ42σ43σ42.

Exact values can be obtained by MclustDA() function of mclust package in R Software ([32]).

We propose a composite likelihood approach to modeling (17) where we suppose independence between the two first and two last variables. This is

fCL(y)=13CLN1+13CLN2+13CLN3, (18)

with

CLNi=fAi1N(ρ12,y)fAi2N(ρ34,y),

where fAi1N(ρ12,y)=fAi1N(ρ12,μi1,μi2,ΣA1,y) and fAi2N(ρ34,y)=fAi2N(ρ34,μi3,μi4,ΣA2,y), i=1,2,3 are bivariate normals with variance-covariance matrices

ΣA1=σ12ρ12σ1σ2ρ12σ1σ2σ22,ΣA2=σ32ρ34σ3σ4ρ34σ3σ4σ42.

We are going to evaluate the behavior of the CLDIC criterion proposed in previous sections. After estimating parameters ρ12 and ρ34 in (18), we consider 10 different subsets of the IRIS data:

  • SE subset: 50 first observations, corresponding to Setosa plants (n=50).

  • VE subset: 50 second observations, corresponding to Versicolor plants (n=50).

  • VI subset: 50 last observations, corresponding to Virginica plants (n=50).

  • SE(VE) subset: SE subset with 2 first observations of VE subset (n=52).

    Equivalently: SE(VI), VE(SE), VE(VI), VI(SE) and VI(VE).

  • VI(SE+VE) subset: VI subset with 2 first observations of SE and VE subsets (n=54).

In Table 4, chosen models for each one of the subsets are obtained by the proposed CLDIC criterion. When a “pure” subset is considered, all the tuning parameters lead to optimal decisions, but when a “contaminated” subset is under consideration, only α=0.2,0.3 have an optimal response in all the cases.

Table 4.

Selected model in each of the subsets. Iris data.

α SE VE VI SE(VE) SE(VI) VE(SE) VE(VI) VI(SE) VI(VE) VI(SE+VE)
0 (CAIC) CN1 CN2 CN3 CN1 CN1 CN1 CN2 CN1 CN3 CN3
0.2 CN1 CN2 CN3 CN1 CN1 CN2 CN2 CN3 CN3 CN3
0.3 CN1 CN2 CN3 CN1 CN1 CN2 CN2 CN3 CN3 CN3
0.4 CN1 CN2 CN3 CN1 CN1 CN2 CN2 CN1 CN3 CN3
0.5 CN1 CN2 CN3 CN1 CN1 CN2 CN2 CN1 CN3 CN3
0.8 CN1 CN2 CN3 CN1 CN1 CN2 CN2 CN1 CN3 CN3
0.22 CN1 CN2 CN3 CN1 CN1 CN2 CN2 CN3 CN3 CN3

We now apply the ad hoc approach presented in Section 5.1 for selecting the tuning parameter α in a composite likelihood framework. Applying this procedure to our data set though a grid search of length 100 and by means of a pilot estimator with α=0.4 leads to the optimal tuning parameter α=0.22, what is in concordance with the obtained results (see Table 5). We can see that the use of other pilot estimators would not affect very much to the final decission.

Table 5.

Selected α for different pilot estimators, ad-hoc tuning parameter selection procedure. Iris and Wine data.

αpilot 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Iris αopt 0.31 0.17 0.20 0.21 0.22 0.23 0.24 0.24 0.25 0.25 0.25
Wine αopt 0.45 0.46 0.47 0.49 0.51 0.53 0.55 0.56 0.56 0.56 0.57

5.3. Wine Data

We now work with Wine data ([33]), which contain a chemical analysis of 178 Italian wines from three different cultivars (Barolo, Grignolino, Barbera) yielded 13 measurements. In order to illustrate our criterion, we will work with only first four explanatory variables: Alcohol, Malic, Ash and Alkalinity. As in the previous section, we adjust a Gaussian mixture model with weights, in this case: 59/178, 72/178 and 47/178 corresponding to Barolo, Grignolino and Barbera classes, respectively. We now consider these 10 different subsets of the Wine data:

  • BO subset: 20 first observations of Barolo wines (n=20).

  • GR subset: 20 first observations of Grignolino wines (n=20).

  • BA subset: 20 first observations of Barbera wines (n=20).

  • BO(GR) subset: BO subset with 5 first observations of GR subset (n=25).

    Equivalently: BO(BA), GR(BO), GR(BA), BA(BO) and BA(GR).

  • BA(BO+GR) subset: BA subset with 3 first observations of BO and GR subsets (n=26).

We can observe how, for medium values of α, the discrimination is perfect (see Table 6). Applying ad-hoc tuning parameter choice procedure we obtain αopt0.51, with a perfect discrimination again (Table 5).

Table 6.

Selected model in each of the subsets. Wine data.

α BO GR BA BO(GR) BO(BA) GR(BO) GR(BA) BA(BO) BA(GR) BA(BO+GR)
0 (CAIC) CN1 CN2 CN3 CN1 CN1 CN2 CN2 CN3 CN3 CN2
0.2 CN1 CN2 CN3 CN1 CN1 CN2 CN2 CN3 CN3 CN3
0.3 CN1 CN2 CN3 CN1 CN1 CN2 CN2 CN3 CN3 CN3
0.4 CN1 CN2 CN3 CN1 CN1 CN2 CN2 CN3 CN3 CN3
0.5 CN1 CN2 CN3 CN1 CN1 CN2 CN2 CN3 CN3 CN3
0.8 CN1 CN2 CN3 CN1 CN1 CN2 CN2 CN2 CN2 CN3
0.51 CN1 CN2 CN3 CN1 CN1 CN2 CN2 CN3 CN3 CN3

6. Conclusions and Future Research

In this paper, we have addressed the problem of model selection in the framework of composite likelihood methodology, on the basis of the DPD as a measure of the closeness of the composite density and the true model that drives the data. In this context, an information criterion is introduced and studied which is defined by means of composite minimum distance type estimators of the unknown parameters, well-known for having nice robustness properties. Thanks to a simulation study, we have shown that the proposed here model selection criterion works well in practice and mainly that the use of CMDPDE makes the criterion more robust than the criteria based on the classic CMLE and the Kullback–Leibler divergence, given in [22]. The analysis of two real data examples of the literature illustrate on how the model selection criterion, presented here, can be applied in practical cases. This paper is a part of a series of papers by the authors where composite likelihood ideas and methods are harmonically weaved with divergence theoretic methods in order to develop statistical inference (estimation and testing of hypotheses) and model selection criteria, as well. We envision future work in some directions. The development of change point methodology on the basis of composite density with CMDPDE and divergence measures would be maybe an appealing problem for a future research on the topic. However, all the information theoretic methods developed on the basis of the composite likelihood depend on the choice of the family of sets {Ak}k=1K, appeared in Formula (1). A question is raised at this point: how the information theoretic procedures developed on the basis of the composite likelihood are affected by this family of sets? It is an appealing problem which deserves also investigation in a future work.

Acknowledgments

The authors would like to thank the Editor and Reviewers for taking their precious time to make several valuable comments on the manuscript.

Abbreviations

The following abbreviations are used in this manuscript:

MLE Maximum likelihood estimator
CMLE Composite maximum likelihood estimator
CLDIC Composite likelihood DIC
DPD Density power divergence
MDPDE Minimum density power divergence estimator
CMDPDE Composite minimum density power divergence estimator
AIC Akaike Information Criterion
CAIC Composite Akaike Information Criterion
TIC Takeuchi Information Criterion

Appendix A. Proof of Theorem 1

Proof. 

A Taylor expansion of Wαθ around the true parameter θ0 and evaluated in θ=θ^cα, gives

Wαθ^cα=Wαθ0+Wαθθθ=θ0θ^cαθ0+12θ^cαθ0T2WαθθθTθ=θ0θ^cαθ0+oθ^cαθ02.

Now,

Wαθθ=Rm1+αCL(θ,y)αCL(θ,y)θdy1+1ααRmCL(θ,y)α1CL(θ,y)θg(y)dy=1+αRmCL(θ,y)α+1uθ,ydy1+αRmCL(θ,y)αuθ,yg(y)dy.

It is clear that if the true distribution g belongs to the parameter family f(.;θ),θΘ and θ0 denotes the true value of the parameter θ, we get

Wαθθθ=θ0=0.

Now we are going to get

2WαθθθT=1+αRm1+αCL(θ,y)α+1uθ,yuθ,yTdyRmCL(θ,y)α+12logCL(θ,y)θθTdyαRmCL(θ,y)αuθ,yuθ,yTg(y)dy+RmCL(θ,y)α2logCL(θ,y)θθTg(y)dy.

If the true distribution g belongs to the parameter family fθ(·;θ), θΘ and θ0 denotes the true value of the parameter θ, verifies,

2WαθθθTθ=θ0=1+αRmCL(θ0,y)α+1uθ0,yuθ0,yTdy=1+αHαθ0.

Therefore,

nWαθ^cα=nWαθ0+1+α2nθ^cαθ0THαθ0nθ^cαθ0+noθ^cαθ02.

But

nθ^cαθ0LnN0,Hα(θ0)1Jα(θ0)Hα(θ0)1,

and noθ^cαθ02=o(Op(1))=op(1).

The asymptotic distribution of the quadratic form nθ^cαθ0THαθ0nθ^cαθ0, verifies

nθ^cαθ0THαθ0nθ^cαθ0Lnr=1kλrZr2

being λr,r=1,,k, the eigenvalues of the matrix

Hαθ0Hα(θ0)1Jα(θ0)Hα(θ0)1=Jα(θ0)Hα(θ0)1

and Zr are independent normal random variable of mean zero and variance 1. Therefore,

EY1,,Ynnθ^caθ0THαθ0nθ^caθ0=r=1kλr+op(1)=traceJα(θ0)Hα(θ0)1+op(1)

and

EY1,,YnnWα(θ^cα)=nWαθ0+1+α2traceJα(θ0)Hα(θ0)1+op(1).

Now a Taylor expansion of Wn,αθ, around θ^cα and evaluated at θ=θ0 gives

Wn,α(θ0)=Wn,α(θ^cα)+Hn,αθθθ=θ^cαθ0θ^cα+12θ0θ^cαT2Wn,αθθθTθ=θ^cαθ0θ^cα+oθ0θ^cα2.

But

Wn,αθθ=α+1RmCL(θ,y)α+1uθ,ydyα+11nk=1nCL(θ,yk)αuθ,yk

therefore

Wn,αθθθ=θ^caPn0.

On the other hand

2Wn,αθθθT=1+αRm1+αCL(θ,y)α+1uθ,yTuθ,ydy+RmCL(θ,y)α+1uθ,yθTdy1ni=1nαCL(θ,yi)αuθ,yiTuθ,yi1ni=1nCL(θ,yi)αuθ,yiθT.

But

1ni=1nCL(θ,yi)αuθ,yiTuθ,yiPnRmCL(θ,y)α+1uθ,yTuθ,ydy

and

1ni=1nCL(θ,yi)αuθ,yiθTPnRmCL(θ,y)α+1uθ,yθTdy.

Therefore

2Hn,α(θ)θθTθ=θ^cαPn1+αHαθ0.

We can now write

nWn,αθ0=nWn,α(θ^cα)+1+α2nθ0θ^cαTHαθ0nθ0θ^cα+op(1).

It is clear that

EY1,,Ynnθ0θ^cαTHαθ0nθ0θ^cα=r=1kλr+op(1)=traceJα(θ0)Hα(θ0)1+op(1).

Then

EY1,,YnnWn,α(θ0)=EY1,,YnnWn,α(θ^cα)+1+α2traceJα(θ0)Hα(θ0)1+op(1)

and, on the other hand, it is clear that

EY1,,YnWn,α(θ0)=Wα(θ0).

Therefore,

EY1,,YnnWα(θ^cα)=nWαθ0+1+α2traceJα(θ0)Hα(θ0)1+op(1)=EY1,,YnnWn,αθ0+1+α2traceJα(θ0)Hα(θ0)1+op(1)=EY1,,YnnWn,α(θ^cα)+1+α2traceJα(θ0)Hα(θ0)1+1+α2traceJα(θ0)Hα(θ0)1+op(1)=EY1,,YnnWn,α(θ^cα)+1+αtraceJα(θ0)Hα(θ0)1+op(1).

Hence nWn,α(θ^cα)+1+αtraceJα(θ0)Hα(θ0)1 is an asymptotic unbiased estimator of

EY1,,YnnWα(θ^cα).

 □

Appendix B. Computation of the CLDIC in Section 4.1

We have to compute

CLDICMk=Wn,αρ^+α+1nJαρ^Hαρ^,

where

Wn,αρ^=R4CL(ρ^,y)α+1dy(1α1)1ni=1nCL(ρ^,yi)α, (A1)
Jα(ρ^)=R4CL(ρ^,y)2α+1u(ρ^,y)2dyR4CL(ρ^,y)α+1u(ρ^,y)dy2, (A2)
Hα(ρ^)=R4CL(ρ^,y)α+1u(ρ^,y)2dy, (A3)

for our candidate models, namely, composite normal and composite 4-variate t-distribution. As commented in Section 4.1, we consider a composite likelihood function based on the product of two bivariate distributions with common variance-covariance matrix. It is therefore, necessary in this example, to obtain values (A1), (A2) and (A3) for both composite normal and composite t-distributions. However, as stated in [10], while the sensitivity and variability matrices can be sometimes be evaluated explicitly, it is more usual to use empirical estimates. Following this comment, in the current example, we compute Equations (A1), (A2) and (A3) empirically through the sample data using

W^n,αρ^=i=1nCL(ρ^,yi)α+1(1α1)1ni=1nCL(ρ^,yi)α,J^α(ρ^)=i=1nCL(ρ^,yi)2α+1u(ρ^,yi)2i=1nCL(ρ^,yi)α+1u(ρ^,yi)2H^α(ρ^)=i=1nCL(ρ^,yi)α+1u(ρ^,yi)2.

Now, we obtain the score of the composite likelihood u(ρ^,yi) explicitly for both cases. By equation (A.5) in [12],

uN(ρ^,yi)=ρ^1ρ^22+1ρ^(t1it2i+t3it4i)11ρ^2t1i22ρ^t1it2i+t2i211ρ^2t3i22ρ^t3it4i+t4i2,

with tji=yjiμj, j=1,,4. On the other hand, we want to compute utν(ρ^,yi).

utν(ρ^,yi)=CLtν(ρ^,yi)ρ^=logCLtν(ρ^,yi)ρ^=1CLtν(ρ^,yi)CLtν(ρ^,yi)ρ^=1f12tν(yi;ρ^)f34tν(yi;ρ^)ρ^f12tν(yi;ρ^)f34tν(yi;ρ^)=1f12tν(yi;ρ^)f34tν(yi;ρ^)ρ^f12tν(yi;ρ^)f34tν(yi;ρ^)+f12tν(yi;ρ^)ρ^f34tν(yi;ρ^)=1f12tν(yi;ρ^)ρ^f12tν(yi;ρ^)+1f34tν(yi;ρ^)ρ^f34tν(yi;ρ^).

Now, it can be shown that

f12tν(yi;ρ^)ρ^=f12tν(yi;ρ^)ν(ν2)ρ^3t1it2iνρ^2+(t1i2+t2i21)ν+t2i2+t1i2+2ρ^t1it2iν2t1it2i(1ρ^2)(ν2)ρ^2+2t1it2iρ^νt1i2t2i2+2

and

f34tν(yi;ρ^)ρ^=f34tν(yi;ρ^)ν(ν2)ρ^3t3it4iνρ^2+(t3i2+t4i21)ν+t4i2+t3i2+2ρ^t1it4iν2t3it4i(1ρ^2)(ν2)ρ^2+2t3it4iρ^νt3i2t4i2+2.

Author Contributions

Conceptualization, E.C., N.M., L.P. and K.Z.; Methodology, E.C., N.M., L.P. and K.Z.; Software, E.C., N.M., L.P. and K.Z.; Validation, E.C., N.M., L.P. and K.Z.; Formal Analysis, E.C., N.M., L.P. and K.Z.; Investigation, E.C., N.M., L.P. and K.Z.; Resources, E.C., N.M., L.P. and K.Z.; Data Curation, E.C., N.M., L.P. and K.Z.; Writing—Original Draft Preparation, E.C., N.M., L.P. and K.Z.; Writing—Review & Editing, E.C., N.M., L.P. and K.Z.; Visualization, E.C., N.M., L.P. and K.Z.; Supervision, E.C., N.M., L.P. and K.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research is partially supported by Grant PGC2018-095194-B-I00 and Grant FPU16/03104 from Ministerio de Ciencia, Innovación y Universidades (Spain). E. Castilla, N. Martín and L. Pardo are members of the Instituto de Matemática Interdisciplinar, Complutense University of Madrid.

Conflicts of Interest

The authors declare no conflict of interest.

References

  • 1.Fearnhead P., Donnelly P. Approximate likelihood methods for estimating local recombination rates. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 2002;64:657–680. doi: 10.1111/1467-9868.00355. [DOI] [Google Scholar]
  • 2.Renard D., Molenberghs G., Geys H. A pairwise likelihood approach to estimation in multilevel probit models. J. Comput. Stat. Data Anal. 2004;44:649–667. doi: 10.1016/S0167-9473(02)00263-3. [DOI] [Google Scholar]
  • 3.Hjort N.L., Omre H. Topics in spatial statistics. Scand. J. Stat. 1994;21:289–357. [Google Scholar]
  • 4.Heagerty P.J., Lele S.R. A composite likelihood approach to binary spatial data. J. Am. Stat. Assoc. 1998;93:1099–1111. doi: 10.1080/01621459.1998.10473771. [DOI] [Google Scholar]
  • 5.Varin C., Host G., Skare O. Pairwise likelihood inference in spatial generalized linear mixed models. Comput. Stat. Data Anal. 2005;49:1173–1191. doi: 10.1016/j.csda.2004.07.021. [DOI] [Google Scholar]
  • 6.Henderson R., Shimakura S. A serially correlated gamma frailty model for longitudinal count data. Biometrika. 2003;90:355–366. doi: 10.1093/biomet/90.2.355. [DOI] [Google Scholar]
  • 7.Parner E.T. A composite likelihood approach to multivariate survival data. Scand. J. Stat. 2001;28:295–302. doi: 10.1111/1467-9469.00238. [DOI] [Google Scholar]
  • 8.Li Y., Lin X. Semiparametric Normal Transformation Models for Spatially Correlated Survival Data. J. Am. Stat. Assoc. 2006;101:593–603. doi: 10.1198/016214505000001186. [DOI] [Google Scholar]
  • 9.Joe H., Reid N., Somg P.X., Firth D., Varin C. Composite Likelihood Methods. Report on the Workshop on Composite Likelihood. [(accessed on 23 July 2019)];2012 Available online: http://www.birs.ca/events/2012/5-day-workshops/12w5046.
  • 10.Varin C., Reid N., Firth D. An overview of composite likelihood methods. Statist. Sin. 2011;21:5–42. [Google Scholar]
  • 11.Martín N., Pardo L., Zografos K. On divergence tests for composite hypotheses under composite likelihood. Stat. Pap. 2019;60:1883–1919. doi: 10.1007/s00362-017-0900-1. [DOI] [Google Scholar]
  • 12.Castilla E., Martin N., Pardo L., Zografos K. Composite Likelihood Methods Based on Minimum Density Power Divergence Estimator. Entropy. 2018;20:18. doi: 10.3390/e20010018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Castilla E., Martin N., Pardo L., Zografos K. Composite likelihood methods: Rao-type tests based on composite minimum density power divergence estimator. Stat. Pap. 2019 doi: 10.1007/s00362-019-01122-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Kullback S. Information Theory and Statistics. Wiley; New York, NY, USA: 1959. [Google Scholar]
  • 15.Akaike H. Information theory and an extension of the maximum likelihood principle. In: Petrov B.N., Csaki F., editors. 2nd International Symposium on Information Theory. Akademiai Kiado; Budapest, Hungary: 1973. pp. 267–281. [Google Scholar]
  • 16.Akaike H. A new look at the statistical model identification. IEEE Trans. Autom. Control. 1974;19:716–723. doi: 10.1109/TAC.1974.1100705. [DOI] [Google Scholar]
  • 17.Takeuchi K. Distribution of information statistics and criteria for adequacy of models. Math. Sci. 1976;153:12–18. (In Japanese) [Google Scholar]
  • 18.Murari A., Peluso E., Cianfrani F., Gaudio P., Lungaroni M. On the Use of Entropy to Improve Model Selection Criteria. Entropy. 2019;21:394. doi: 10.3390/e21040394. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Mattheou K., Lee S., Karagrigoriou A. A model selection criterion based on the BHHJ measure of divergence. J. Stat. Plan. Inference. 2009;139:228–235. doi: 10.1016/j.jspi.2008.04.022. [DOI] [Google Scholar]
  • 20.Avlogiaris G., Micheas A., Zografos K. A criterion for local model selection. Shankhya. 2019;81:406–444. doi: 10.1007/s13171-018-0126-x. [DOI] [Google Scholar]
  • 21.Avlogiaris G., Micheas A., Zografos K. On local divergences between two probability measures. Metrika. 2016;79:303–333. doi: 10.1007/s00184-015-0556-6. [DOI] [Google Scholar]
  • 22.Varin C., Vidoni P. A note on composite likelihood inference and model selection. Biometrika. 2005;92:519–528. doi: 10.1093/biomet/92.3.519. [DOI] [Google Scholar]
  • 23.Gao X., Song P.X.K. Composite likelihood Bayesian information criteria for model selection in high-dimensional data. J. Am. Stat. Assoc. 2010;105:1531–1540. doi: 10.1198/jasa.2010.tm09414. [DOI] [Google Scholar]
  • 24.Ng C.T., Joe H. Model comparison with composite likelihood information criteria. Bernoulli. 2014;20:1738–1764. doi: 10.3150/13-BEJ539. [DOI] [Google Scholar]
  • 25.Basu A., Harris I.R., Hjort N.L., Jones M.C. Robust and efficient estimation by minimizing a density power divergence. Biometrika. 1998;85:549–559. doi: 10.1093/biomet/85.3.549. [DOI] [Google Scholar]
  • 26.Pardo L. Statistical Inference Based on Divergence Measures. Chapman & Hall CRC Press; Boca Raton, FL, USA: 2006. [Google Scholar]
  • 27.Basu A., Shioya H., Park C. Statistical Inference. The Minimum Distance Approach. Chapman & Hall/CRC; Boca Raton, FL, USA: 2011. [Google Scholar]
  • 28.Burham K.P., Anderson D.R. Model Selection and Multinomial Inference: A Practical Information-Theoretic Approach. Springer; New York, NY, USA: 2002. [Google Scholar]
  • 29.Xu X., Reid N. On the robustness of maximum composite estimate. J. Stat. Plan. Inference. 2011;141:3047–3054. doi: 10.1016/j.jspi.2011.03.026. [DOI] [Google Scholar]
  • 30.Warwick J., Jones M.C. Choosing a robustness tuning parameter. J. Stat. Comput. Simul. 2005;75:581–588. doi: 10.1080/00949650412331299120. [DOI] [Google Scholar]
  • 31.Fisher R.A. The use of multiple measurements in taxonomic problems. Ann. Eugenics. 1936;7:179–188. doi: 10.1111/j.1469-1809.1936.tb02137.x. [DOI] [Google Scholar]
  • 32.Fraley A., Raftery E., Murphy T.B., Scrucca L. MCLUST Version 4 for R: Normal Mixture Modeling for Model-based Clustering, Classification, and Density Estimation. Department of Statistics, University of Washington; Seattle, WA, USA: 2012. Technical Report 597. [Google Scholar]
  • 33.Forina M., Lanteri S., Armanino C., Leardi R. PARVUS: An Extendable Package of Programs for Data Exploration, Classification, and Correlation. Institute of Pharmaceutical and Food Analysis Technologies; Genoa, Italy: 1998. [Google Scholar]

Articles from Entropy are provided here courtesy of Multidisciplinary Digital Publishing Institute (MDPI)

RESOURCES