Skip to main content
Springer logoLink to Springer
. 2023 Dec 19;33(2):589–608. doi: 10.1007/s11749-023-00912-8

A generalized Hosmer–Lemeshow goodness-of-fit test for a family of generalized linear models

Nikola Surjanovic 1,, Richard A Lockhart 2, Thomas M Loughin 2
PMCID: PMC11164741  PMID: 38868722

Abstract

Generalized linear models (GLMs) are very widely used, but formal goodness-of-fit (GOF) tests for the overall fit of the model seem to be in wide use only for certain classes of GLMs. We develop and apply a new goodness-of-fit test, similar to the well-known and commonly used Hosmer–Lemeshow (HL) test, that can be used with a wide variety of GLMs. The test statistic is a variant of the HL statistic, but we rigorously derive an asymptotically correct sampling distribution using methods of Stute and Zhu (Scand J Stat 29(3):535–545, 2002) and demonstrate its consistency. We compare the performance of our new test with other GOF tests for GLMs, including a naive direct application of the HL test to the Poisson problem. Our test provides competitive or comparable power in various simulation settings and we identify a situation where a naive version of the test fails to hold its size. Our generalized HL test is straightforward to implement and interpret and an R package is publicly available.

Supplementary Information

The online version contains supplementary material available at 10.1007/s11749-023-00912-8.

Keywords: Empirical regression process, Exponential dispersion family, Generalized linear model, Goodness-of-fit test, Hosmer–Lemeshow test

Introduction

Generalized linear models (GLMs), including among others the linear, logistic, and Poisson regression models, have been used in a vast number of application domains and are extremely popular in medical and biological applications.

Naturally, it is desirable to have a model that fits the observed data well. Hosmer and Lemeshow (1980) constructed a GOF test for the logistic regression model, the HL test, which applies a Pearson test statistic to differences of observed and expected event counts from data grouped based on the ordered fitted values from the model. As a result, their test is very easy to interpret and is extremely popular, particularly in medical applications. Due to the simplicity of the test, it is tempting to naively apply it to other GLMs with minimal modification, as has occasionally been suggested in the literature (Bilder and Loughin 2014; Agresti 1996, p. 90). However, the HL test has not been rigorously justified outside of the binomial setting, and hence, its validity is unknown. Indeed, there are indications that its limiting distribution is incorrect in the non-binomial setting, as simulation results in Sect. 5 suggest.

A considerable number of GOF tests available for GLMs with a non-binomial or non-normal response (or other regression models beyond GLMs) involve kernel density estimation or some other form of smoothing—for example, the test of Cheng and Wu (1994). However, González-Manteiga and Crujeiras (2013) mention that the selection of the smoothing parameter used in some tests is “a broadly studied problem in regression estimation but with serious gaps for testing problems”. Also, use of continuous covariates in a GLM model can render certain basic tests, such as the Pearson chi-squared test, invalid (Pulkstenis and Robinson 2002).

In this paper, we work towards two goals. First, we explore some of the properties of the “naive” application of the HL test to other GLM distributions. Second, we derive a more appropriate modification to the HL test statistic and determine its correct limiting null sampling distribution. The modification is based on an application of theory developed by Stute and Zhu (2002). We show that the test statistic has an asymptotic chi-squared distribution for many GLMs in the exponential dispersion family, adding to the appealing simplicity of the test. We investigate both the new and the naive tests’ small-sample performances in a series of carefully designed computational experiments.

Section 2 gives an overview of previously developed GOF tests. Section 3 has a more detailed description of certain GOF tests, including the naive HL test and our new test. We present our main theorems concerning the asymptotic distribution of our proposed test under the null hypothesis as well as consistency results. The design for our simulation study comparing these tests is laid out in Sect. 4, and the results are provided in Sect. 5. We find that our test provides competitive or comparable power to other available tests in various simulation settings, is computationally efficient, and avoids the use of kernel-based estimators. Finally, we discuss the results and potential future work in Sect. 6. Proofs of several results are in the supplementary material and an R package implementing the generalized Hosmer–Lemeshow test is available at https://github.com/nikola-sur/goodGLM.

Background and notation

We let Y be a response variable that is associated with a covariate vector, X, where XRd. We write (Xi,Yi), i=1,,n, to denote a random sample where each (Xi,Yi) has the same distribution as (XY), and provides observed data (xi,yi). Limits are taken with the sample size n tending to infinity.

The HL GOF test assesses departures between observed and expected event counts from data grouped based on the fitted values from the GLM. For instance, in the binary case for logistic regression, we assume

E(YX=x)=π(βx)=exp(βx)1+exp(βx),

for some βRd. The likelihood function is then given by

L(β)=i=1nπ(βxi)yi(1-π(βxi))1-yi,

from which a maximum likelihood estimate (MLE), βn, of β can be obtained. Computing the HL test statistic starts with partitioning the data into G groups. This is often done in a way so that the groups are of approximately equal size and fitted values within each group are similar. The partition fixes interval endpoints, kg for 1gG, and -=k0<k1<<kG-1<kG=. The kg are often set to be equal to the logit of equally spaced quantiles of the fitted values, πi^=π(βnxi). We define Ii(g)=1(kg-1<βnxikg), Og=i=1nyiIi(g), Eg=i=1nπi^Ii(g), ng=i=1nIi(g), and π¯g=Eg/ng for 1gG, where 1(A) is the indicator function on a set A. Here, ng represents the number of observations in the gth group, and π¯g represents the average of the fitted values within the gth group. The HL test statistic is

C^G=g=1G(Og-Eg)2ngπ¯g(1-π¯g). 1

The theory behind this test, based on work by Moore and Spruill (1975), suggests that the asymptotic null distribution of the test statistic follows the distribution of a weighted sum of chi-squared random variables. Hosmer and Lemeshow (1980) approximated this distribution with a single chi-squared distribution, where the G-2 degrees of freedom were determined partly by simulation.

To generalize the HL test to other GLMs, we follow Stute and Zhu (2002). We assume that E(Y2)< under the null hypothesis, given by (4) below. As a consequence, E(Y)<, and we may define m(x)=E(YX=x), and σ2(x)=Var(YX=x). We also assume that σ2(x)>0 for all x in the support of X. The GLM that we fit and test assumes that the conditional density of Y given X=x is an exponential family member with inverse link function, m. Specifically,

fYX(yx,β0)=exp{yθ-b(θ)}ν(dy), 2

where the vector β0, θ, and x are related by E(YX=x)=b(θ)=m(β0x), and ν is a suitable dominating measure. The parameter θ must belong to Θ={θ:exp(yθ)ν(dy)<} and we must have m(β0x)b(Θ) for all x in the support of X. In such a model, the variance is a function of the mean and we may write σ2(x)=v(m(β0x)), for a smooth function v determined by the function b. In some cases it is of interest to add an (unknown) dispersion parameter, ϕ, to the model and assume further that there is a scalar ϕ0 such that the density of Y given X=x has the form

fYX(yx,β0,ϕ0)=expyθ-b(θ)ϕ0-c(y,ϕ0)ν(dy), 3

where ν is a (possibly different) dominating measure, and β0, θ, and x are related as before. With fYX as in (3), we can write σ2(x)=ϕ0·v(m(β0x)). The parameter space for (θ,ϕ) is ΘΛ for a suitable Λ(0,).

Let B0={β:P(m(βX)b(Θ))=1}. We test the null hypothesis

H0:βB0s.t.m(X)=m(βX)a.s., andP(X{xs.t.YX=xfYX=x})=1, 4

for a pre-specified inverse link function m(·). Our alternative hypothesis is

H1:βB0,we havePm(X)m(βX)>0orPX{xs.t.YX=xfYX=x}>0. 5

That is, we simultaneously test for a misspecification of the link function and of the conditional response distribution. We denote the maximum likelihood estimate of β by βn and let ϕn be some consistent estimate of ϕ; if there is no dispersion parameter, we take ϕn=ϕ0=1.

For many popular link functions m-1, it is automatic that m(βX)b(Θ) almost surely for all βRd and any X distribution. For other link functions and some distributions for the predictor X this may not hold for all β; this motivates our definition of B0. Similar comments apply to models with an unknown dispersion parameter as in (3); see the supplementary material.

Related work

A grouped GOF test for logistic regression models similar to the HL test was introduced by Tsiatis (1980), which uses a data-independent partitioning scheme (i.e., not random). Canary et al. (2016) constructed a generalized Tsiatis test statistic for binary regression models with a non-canonical link, and then used a data-dependent partitioning scheme. We note that the theoretical validity of such a data-dependent partitioning method for the generalized Tsiatis test should be verified, similar to what was done by Halteman (1980) for logistic regression. Our work focuses on proving the validity of a large class of data-dependent partitioning schemes, also allowing for extensions to a much wider class of GLMs, as Canary (2013) briefly suggested. Our test also extends the Xw2 statistic with all weights equal to one, from Hosmer and Hjort (2002), to a broader class of problems. There are some other versions of the HL test for specific models other than logistic regression, including binomial regression models with a log link (Blizzard and Hosmer 2006; Quinn et al. 2015) and other non-canonical links (Canary et al. 2016). Other models include the multinomial regression model (Fagerland et al. 2008) and the proportional odds and other ordinal logistic regression models (Fagerland and Hosmer 2013, 2016).

GOF tests not based on the HL test have been constructed that can be used with broader classes of GLM models (Su and Wei 1991; Stute and Zhu 2002; Cheng and Wu 1994; Lin et al. 2002; Liu et al. 2004; Rodríguez-Campos et al. 1998; Xiang and Wahba 1995). A review of GOF tests for regression models is given by González-Manteiga and Crujeiras (2013). While many of these tests appear to have merit, they do not seem to have been widely adopted in practice. For instance, the p-values accompanying some tests need to be obtained through simulation and calculations may be time consuming in the presence of several explanatory variables (Christensen and Lin 2015).

Methods and test statistics

Tests for GOF require statistics measuring departures from the null hypothesis. We use the residual process defined in Stute and Zhu (2002) for uR by:

Rn1(u)=1ni=1n1(βnXiu)[Yi-m(βnXi)]. 6

For the special case of logistic regression, the HL test can be rewritten as a quadratic form, in terms of Rn1(u). Define the (length G) vector

Sn1=(Rn1(k1)-Rn1(k0),,Rn1(kG)-Rn1(kG-1)). 7

Then, the HL test statistic can be rewritten as C^G=Sn1D-1Sn1, where

D=diagngπ¯g(1-π¯g)n,1gG. 8

Provided that G>d—a requirement not always cited in references to the HL test—and using Theorem 5.1 in Moore and Spruill (1975), Hosmer and Lemeshow (1980) show that their test statistic is asymptotically distributed under the null hypothesis as a weighted sum of chi-squared random variables, with

C^GdχG-d2+j=1dλjχ1j2,

where each χ1j2 is a chi-squared random variable with 1 degree of freedom, and each λj is an eigenvalue of a particular matrix that depends on β0 and the distribution of X. Then, through simulations, they conclude that the term j=1dλdχ1j2 can be approximated in various settings by a χd-22 distribution, leading to the recommended G-2 degrees of freedom. In other words, C^G.χG-22. However, in certain settings with a finite sample size this does not serve as a good approximation, as we discuss in Sect. 5.

Naive generalization of the Hosmer–Lemeshow test

The HL test statistic depends on the binomial assumption only through D in (8), with the gth diagonal element representing an estimate of the variance of the counts in the gth group, divided by n. To extend this test to other GLMs, it is tempting to define a “naive” HL test statistic

C^G=Sn1(D)-1Sn1, 9

where

D=diag1ni=1nVar^(YX=xi)1(kg-1<βnxikg),

for 1gG, similar to the estimates of the variances of group counts given in the original HL test through D. For example, for Poisson regression models,

D=diag1ni=1nm(βnxi)1(kg-1<βnxikg),

since the conditional variance of the response is equal to the conditional mean. This idea is very briefly suggested in Agresti (1996) on p. 90, and in Bilder and Loughin (2014), but has not been developed further or assessed in the literature. The limiting distribution of this test statistic has not been determined, although one might naively assume that it retains the same χG-22 limit as the original HL test. Implementing this test for Poisson regression models, our simulation results suggest that as the number of estimated parameters in the model increases, the mean and variance of the C^G test statistic tend to decrease for a fixed sample size. Thus, it is apparent that the naive limiting distribution is not correct. Further properties of this test are discussed in Sect. 5.

The generalized HL test statistic

The HL test uses a limiting distribution that is only partially supported mathematically and does not seem to be appropriate for finite samples for GLMs outside of logistic regression. Rather than try to fix the flaws in the naive HL test, we propose a test statistic whose limiting law is demonstrated by appropriate techniques to be chi-squared with a determined number of degrees of freedom (less than or equal to G). We allow cell boundaries kn,g that may depend on the data but must be distinct and properly ordered for each n. We assume each kn,g converges in probability to some kg, that these limits are all distinct and that they satisfy P(β0X=kg)=0 for all g. In our simulation study, described in Sect. 4, we use random interval endpoints so that i=1nσ^2(xi)Ii(g) is approximately equal across groups. Implementation details are in the supplementary material.

We follow Stute and Zhu (2002) to directly develop our generalized HL test and rigorously determine its correct limiting distribution. We first define

Gngi=1(kn,g-1<βnxikn,g),V1/2=diagv(m(βxi))1/2|β=β0,W1/2=diagm(βxi)v(m(βxi))1/2|β=β0,

for i=1,,n, and g=1,,G. Note that in the above definition of W1/2, we write m(u) to denote the derivative of m(u) with respect to u. Also, let X be the n×d matrix whose ith row is given by xi, for i=1,,n. Define Wn1/2 and Vn1/2 the same as W1/2 and V1/2, respectively, but evaluated at βn instead of β0.

Let In be the n×n identity matrix, and denote the generalized hat matrix by Hn=Wn1/2X(XWnX)-1XWn1/2. Define

Σn=1nGnVn-Vn1/2Wn1/2X(XWnX)-1XWn1/2Vn1/2Gn=1nGnVn1/2In-Wn1/2X(XWnX)-1XWn1/2Vn1/2Gn=1nGnVn1/2(In-Hn)Vn1/2Gn. 10

Denote the Moore–Penrose pseudoinverse of a matrix A by A+. Our “generalized HL” (GHL) test statistic is then given as

XGHL2=Sn1Σn+Sn1/ϕn. 11

We note that without the Hn term in Σn, the GHL test statistic reduces to the naive GHL statistic. Under some conditions described next, we have under the null hypothesis that

Sn1Σn+Sn1/ϕndχr2, 12

with r=rank(Σ), where Σ is specified below in (13).

In this work, we focus on the GHL test with G=10 groups. However, the impact of the choice of the number of groups is a well-known limitation of group-based test statistics. In the long history of the HL test, most sources use G=10, and in our limited investigation we found no evidence that this is a bad choice. As informal guidance, we remark that Bilder and Loughin (2014) suggest trying a few different values of G to make sure that a single result is not overly influenced by unfortunate grouping.

GHL test statistic limiting distribution

We first state conditions used in our main theorem on the limiting distribution of XGHL2 under the null hypothesis (4). The theorem describes sufficient conditions for the convergence in distribution given in (12). Conditions (A), (C), and the first inequality in (B) below come from Stute and Zhu (2002). We assume the conditional density of Y given X is given by (2) or (3), with respect to some dominating measure, say ν. For models without a dispersion parameter, the score function from observation i is Ui(β0), given by Ui(β0)=Xim(β0Xi)Yi-m(β0Xi/vm(β0X)}, and the Fisher information matrix is I1(β0)=EXX{m(β0X)}2/vm(β0X).

Condition (A)

  1. I1(β0) exists and is positive definite.

  2. Let (Xi,Yi,β0)=[I1(β0)]-1Ui(β0). Under the null hypothesis, we have
    n1/2{βn-β0}=n-1/2i=1n(Xi,Yi,β0)+oP(1).

Condition (B): The function m is twice continuously differentiable and the function v is continuously differentiable. For some δ>0 we have

1)EsupmaxjXjm(βX):β-β0δ<.2)Ev2(m(β0X))<,3)EXXsupwA2(βX):β-β0δ<,4)EXXm(β0X)2<.5)EXX2sup{wB2(βX)):β-β0δ}<,and6)EXXsup|wC(βX)|:β-β0δ<,

where

wA(u)=m(u)v(m(u)),wB(u)=m(u),andwC(u)=ddu{m(u)}2v(m(u)).

Condition (C): Define H~(u,β)=Eσ2(X)1(βXu). Then, H~ is uniformly continuous in u at β0 (and ϕ0). This condition requires that β0X have a continuous distribution; in particular, β00.

Condition (D): Under the null hypothesis, with Σ as defined below,

  1. ΣnpΣ, and

  2. rank(Σn)prank(Σ).

We specify Σ when fYX is as in (2) or (3). Define the column vector

q(x,β)=m(βx)β=(q1(x,β),,qd(x,β))=m(βx)x.

Define the vector-valued function Q=(Q1,,Qd) by

Qi(u)Qi(u,β0)=E(qi(X,β0)1(β0Xu)),1id.

The matrix Σ is defined by

Σ=Σ(1)-Σ(2), 13

where, for 1g,gG,

Σgg(1)=Ev(m(β0X))1(kg-1<βnXkg),g=g0,ggΣgg(2)=(Q(kg)-Q(kg-1))[I1(β0)]-1(Q(kg)-Q(kg-1)).

In the supplementary material, we prove our main theorem.

Theorem 1

Suppose that E(Y2)<, conditions (A), (B), and (C) hold, and that ϕnpϕ0. Assume the cell boundaries kn,g satisfy kn,gpkg for g=0,,G and that the kg are distinct. Then, under the null hypothesis given by (4), with fYX as in (2) or (3), we have

Sn1dMVNG(0,ϕ0Σ)S1,

where Sn1 is as defined in (7), and Σ is given by (13).

If there exists any sequence of matrices Σn that satisfies condition (D), then, putting r=rank(Σ),

Sn1Σn+Sn1/ϕndχr2.

If conditions (A), (B), and (C) hold, then the particular matrix Σn in (10) converges almost surely to Σ, so (D1) holds under the null hypothesis. Finally, if conditions (i) and (ii) of Sect. 3.4 are satisfied, then conditions (B) and (C) hold.

GLMs for which the GHL test is valid

Formally, conditions (A), (B), (C), and (D) should be verified before using the generalized HL test statistic, XGHL2. From Theorem 1, the conditions below are sufficient for the validity of conditions (B) and (C) provided that fYX is of the form presented in (2) or (3):

  • (i)

    One of the distribution/link function combinations from Table 1 is used.

  • (ii)

    The joint probability distribution of the explanatory variables, X, has compact support. For all β in an open neighbourhood N of β0 the variable βX has a bounded continuous Lebesgue density, the support, supp(βX), of the linear predictor is an interval, and P(m(βX)Θ)=1.

The supplementary material outlines how to verify conditions (A), (B), (C), and (D), weaken the compactness assumption in (ii), and extend our test to other GLMs.

Table 1.

Several possible distribution and link function combinations. Distributions are parameterized so that μ or π represents the mean of the distribution. *Gamma distribution with variance μ2/k. **Inverse Gaussian distribution with variance μ3/λ. ***Negative binomial distribution with variance μ+μ2/k. For the negative binomial distribution, k is assumed to be known

Distribution Example possible links
Normal(μ,σ2) identity
Bernoulli(π) logit, probit, cauchit, cloglog
Poisson(λ) log, square root
Gamma(μ,k)* log
IG(μ,λ)** log
NB(μ,k)*** log

Consistency of the GHL test

We discuss power in terms of consistency; we outline one possible set of conditions on the alternative distribution, the specific model, and the choice of cell boundaries that will ensure that our test is consistent. Precise versions of conditions (K1-5) that follow are in the supplementary material.

Our conclusions are affected by the presence or absence of an intercept term. Here, we present results for models that do not have an intercept. For such models, we assume that the rows Xi of the design matrix are i.i.d. and have a Lebesgue density. Models with intercepts are discussed in the supplementary material. We let η(x)=βx denote the linear predictor of our GLM.

Behaviour of coefficient estimator under the alternative

In the supplementary material, we give conditions (K1), used to check conditions in White (1982), which guarantee that the estimate βn has a limit under the alternative being considered; we denote this limit by β. The most important additional components are a restriction to a compact parameter set not containing β=0, say B, uniformity over B in some of our conditions, and the assumption that X has a density. From White (1982), conditions (K1) are enough to ensure the existence of a (possibly not unique) maximizer of the GLM likelihood.

Also in the supplementary material, we give conditions (K2) on the joint density of X and Y which come, essentially, from White (1982). The two conditions (K1) and (K2) now imply Assumptions A1, A2, and A3 of White (1982). In turn, these imply almost sure convergence of βn to a unique βB. We write η(X) for the linear predictor evaluated at β. That is, η(X)=βX.

Behaviour of interval endpoints under the alternative

We consider both fixed and random interval endpoints. Our consistency results need assumptions about the probability that η(X) belongs to each limiting interval; these probabilities depend on β and the assumptions will be false for β=0. For some choices of intervals, there can be intervals that have no observations almost surely. We need to assume that this does not happen for the predictor η. In the supplementary material, we present condition (K3), which strengthens our main condition (C); this condition requires β0.

There may be βB such that supp(η) is bounded; in that case some methods of choosing boundaries (like fixed boundaries) will not satisfy (K3). However, the method used in our simulations chooses cell boundaries using the estimate βn so as to make all the cells have approximately the same sum of variances of the responses. For this method, (K3) holds.

Behaviour of covariance estimate under the alternative

We consider our estimate Σn of Σ. Let Σn(β) and Σ(β) be the matrices in (10) and (13) evaluated at a general β; as before, Σn=Σn(βn). Condition (B) imposes moment conditions in a neighbourhood implying our estimate Σn is consistent for Σ(β0). Assumption (K4) extends these conditions to every βB so that they apply to the unknown value β. Under conditions (K1–4), our arguments show that, with probability equal to 1, Σn(β)Σ(β) uniformly in βB; moreover, Σ is a continuous function of β so Σn converges to Σ(β).

In the supplementary material, we show that under reasonable conditions the matrix Σ(β) has rank G or G-1 and that when n is large Σn(β) has the same rank with high probability. We delineate the special cases where rank G-1 arises. Here we give one common special case of our results; a more general version is in the supplementary material.

Theorem 2

Fix βB. Assume conditions (K1), (K3), and (K4) and that we are not fitting an intercept. Then, rank(Σ(β))=G unless there is a constant c such that

v(m(u))=m(u)cu 14

for all usupp(η(X)). If (14) holds for all usupp(η(X)), then rank(Σ(β))=G-1.

We turn to the rank of Σn(β). Unless identity (14) holds on supp(η(X)), the rank of Σ(β) is G under our conditions. Since Σn converges to Σ(β) and β0 is in B, we find that P(rank(Σn(β))=G) converges to 1 in this case.

Our consistency theorem

Consistency requires that we have modelled the mean incorrectly in a fairly strong sense. For 1gG, define

δg(β)=E1(kg-1<η(X)kg)Y-m(η(X)).

Also define δ¯(β)=1Gg=1Gδg(β).

Condition (K5): The model fitted does not have an intercept and either:

  1. For every c the set {usupp(βX)} where identity (14) does not hold has positive Lebesgue measure. For all βB, gδg(β)2>0; or

  2. Identity (14) holds for all u, and gδg(β)-δ¯(β)2>0 for all βB.

Theorem 3

Under conditions (K1), (K2), (K3), (K4), and (K5),

XGHL2p.

We note that in the above theorem, the test based on XGHL2 is consistent against the alternative in question. That is, for any α(0,1), under H1 we have P(XGHL2>χr,1-α2)1 as n, where r=rank(Σ).

Simulation study design

We perform a simulation study to assess the performance of our proposed test. The purpose of this paper is to extend the HL test to a wide array of new distribution models. We therefore focus on applications to data generated from Poisson distributions, since Poisson appears to be a common non-binomial GLM, but we include other response distributions such as the gamma and inverse Gaussian distributions. While this paper does not study the performance of the GHL test in binomial models, we do compare the performance of the HL test and the binomial version of the GHL test in Surjanovic and Loughin (2021). That work arose because we discovered that certain plausible data structures cause the regular HL test to fail, whereas the binomial version of the GHL test appropriately resolves the issue. We carefully document that problem and discuss it in Surjanovic and Loughin (2021) as a warning to practitioners who routinely apply HL to binomial models.

The binomial version of the GHL test is also studied in Hosmer and Hjort (2002), where it is equivalent to their Xw2 test when all weights are set to 1. Full details of simulation settings are in the following sections and the supplementary material. We first give an overview and describe features common to all experiments.

Unless otherwise specified, the mean is taken to be m(βTx) where m-1 is the log link, and the null hypothesis is that the Poisson distribution with this structure is correct. We compare rejection rates under different null and alternative hypothesis settings for four different GOF tests: the naive generalized HL test, our new generalized HL test (GHL), the Stute-Zhu (SZ) test, and the Su-Wei (SW) test. For HL and GHL, we use G=10 unless specified otherwise. The naive generalized HL test is included to demonstrate that a test without proper theoretical justification may fail. We believe that the SZ test has an appealing construction, but is perhaps not as well known as the SW test, which can be found in other simulation studies. These two tests are included because they do not rely heavily on kernel-based density estimation and are relatively straightforward to implement.

For x,vRd let 1(xv) be 1 if and only if xjvj for all j=1,2,,d. Define

Rn~(v)=1ni=1n1(Xiv)[Yi-m(βnXi)].

Using our notation, the SW test statistic is defined as

XSW2=supvRd|Rn~(v)|. 15

The SZ test statistic has a more complicated form as a Cramér–von Mises statistic applied to a specially transformed version of the Rn1 process. We also slightly modify the SZ test statistic to detect overdispersion in the Poisson case. These test statistics, including the modification to SZ, are described in greater detail in the supplementary material.

We use sample sizes of 100 and 500 throughout this simulation study, representing moderate and large sample sizes in many studies in medical and other disciplines where GLMs are used. Unless otherwise stated, only the results for the sample size of 100 are reported for the null and power simulation settings; important differences between the two settings are summarized in Sect. 5. For each simulation setting, we produce 2500 realizations. On each realization, we record a binary value for each test indicating whether the test rejects the null hypothesis for that data set. The proportion of the 2500 realizations for which a test rejects the null hypothesis estimates the test’s true probability of type I error or power in that setting. In the null simulations, approximate 95% confidence intervals for the binomial probability of rejection H0 can be obtained from the observed rejection rates by adding and subtracting 0.009 (1.96×0.05·0.95/25000.009). Conservative 95% confidence intervals for power can be obtained from the observed rejection rates by adding and subtracting 0.02, accounting for the widest interval, when a proportion is equal to 0.5 (1.96×0.5·0.5/25000.02). The simulations are performed using R.

Null distribution

Under the null hypothesis described by (4), where fYX is a Poisson distribution, we consider six settings with a log link and three with a square root link, varying the distribution of explanatory variables and the true parameter values in each case. In the first three null settings with a log link, a model with a single covariate, X, and an intercept term is used. These settings serve to examine the effect of small and large fitted values on the null distribution of the test statistics. The distribution of X and values of β0 and β1 are chosen so that the fitted values take on a wide range of values (approximately 0.1 to 100) in the first setting, are moderate in size (approximately 1 to 10) for the second setting, and are very small (approximately 0.1 to 1) for the third setting. Specifics on these and other settings are given in the supplementary material.

For settings 4, 5, and 6, coefficients are chosen so that the fitted values are moderate in size (rarely less than 1, with an average of approximately 4 or 5), so that other sources of potential problems for the GOF tests can be explored. The fourth setting examines a model including two continuous covariates and one dichotomous covariate (the “Normal-Bernoulli” model), and is similar to the one used in Hosmer and Hjort (2002). The fifth and sixth simulation settings examine the effects of correlated and right-skewed covariates, respectively. It is well known that in the presence of multicollinearity the variance of regression parameter estimates can become inflated. The correlated covariates setting is included to assess the impact of multicollinearity on the GOF tests, since the estimated covariance matrix of the regression coefficients is used in the calculation of the GHL test statistic. The right-skewed covariate in setting 6 is included in order to assess its potential impact on the SZ test, since this test makes use of kernel-based density estimation as a part of the calculation of the test statistic, albeit in a one-dimensional case.

We next create three settings labelled 1b, 2b, and 3b, that are the same as settings 1, 2, and 3, respectively, except the true and fitted models use a square root link, rather than a log link. We also use simulations to verify that the proposed GHL test maintains its size in models where there are unknown dispersion parameters that must be estimated. We consider several settings with gamma, inverse Gaussian, and negative binomial responses with a single covariate, log link, and moderate-sized means. We fix the true dispersion parameter to be ϕ0=0.1, so that the variance of the response distribution given its mean, μ, is ϕ0μ2, ϕ0μ3, and μ+ϕ0μ2, for the gamma, inverse Gaussian, and negative binomial responses, respectively. The dispersion parameter for the gamma and inverse Gaussian models is estimated using a weighted average of the squared residuals, which is the default in summary.glm(). The negative binomial distribution with an unknown dispersion parameter is a popular alternative to Poisson regression that does not fall within the exponential dispersion family framework presented. In this case, the parameter that controls the variance is estimated using maximum likelihood, which is the default estimation procedure in MASS::glm.nb() in R.

Finally, we consider simulation settings that examine the performance of the GHL and naive generalized HL tests when only discrete covariates are present. We repeat setting 2 above, except that X is sampled uniformly from 30 or 50 possible values on a grid. These numbers of points are chosen so that the data can be split into at least G=10 groups. All model coefficients and further details of the simulation study, such as the implementation of the GHL test for the negative binomial model, are given in the supplementary material.

Power

To examine the power of the GOF tests, we consider four types of deviations from the null model: a missing quadratic term, overdispersion, a missing interaction term, and an incorrectly specified link function. These settings are similar to those used in Hosmer and Hjort (2002) and are realistic model misspecifications. In the first three settings, the severity of deviation between the true model and the fitted model is controlled by regression parameters to represent four levels ranging from “small” to “large” deviations from the assumed additive linear model. In the incorrect link setting, the true model uses a square root link, but the fitted model assumes a log link. In all four settings, we use a Poisson GLM and choose appropriate regression coefficients so that the fitted values are moderate in size, rarely less than 1 and often smaller than 10, to ensure that a large rejection rate is not simply due to small fitted values in the Pearson-like test statistics. All four power simulation settings are described in detail in the supplementary material.

Performance with larger models

We additionally assess the performance of each of the tests with larger models. The theoretical results presented in this work assume a constant dimension, d, and consider limiting distributions as the sample size tends to infinity. Here, we again consider fixed d, but as a factor worth studying on its own.

Realizations of Y are again drawn from a Poisson distribution with a log link and d=2,10,20,30,40,50 parameters. We use sample sizes n=100 and 500. To keep the distribution of the fitted values approximately constant as d is varied, we set β0=1.67, and β1==βd-1=0.0717/(d-1). This gives a distribution of fitted values mostly within the interval [1, 10], ensuring that expected counts within each group used in the calculation of the Pearson statistic are sufficiently large. The SW test is omitted due to computational challenges that arise with this test with large models and because we observe that a large proportion of the data needs to be omitted when d is large and n=100 for the test statistic to be computed (see the supplementary material).

Simulation results

Null distribution

From the null simulation results in Table 2, we see that the estimated type I error rate for the GHL test does not significantly differ from the nominal level, since all values are in the interval (0.041, 0.059). However, the naive generalized HL test falls out of this interval in three settings, whereas the SW and SZ tests fall out of this interval in two and four settings, respectively. Interestingly, even with d=4 in setting 4, we begin to see a decreased type I error rate for the naive generalized HL test, a phenomenon discussed in more detail later in this section. However, with a sample size of 500, the naive generalized HL test and the SZ test generally have better empirical rejection rates, whereas the SW test has similar poor performance even with a larger sample size. Table 3 summarizes type I error rates for the GHL test in the presence of a dispersion parameter. A larger sample size is sometimes needed to ensure that the finite sampling distribution of the test statistic is well approximated by its limiting chi-squared distribution. In our simulation settings, the estimated type I error rate is closer to the nominal rate for the larger sample size of 500. Type I error rates for settings with discrete covariates are in Table 4. The naive generalized HL test and the GHL test hold their size for the considered sub-settings.

Table 2.

Estimated type I error rates (null setting simulation results)

Statistic/setting 1 1b 2 2b 3 3b
C^G 0.058 0.061 0.042 0.051 0.051 0.047
XGHL2 0.053 0.056 0.049 0.052 0.052 0.050
XSW2 0.043 0.062 0.041 0.052 0.042 0.048
XSZ2 0.032 0.045 0.039 0.039 0.044 0.041
Statistic/setting 4 5 6
C^G 0.037 0.055 0.059
XGHL2 0.048 0.056 0.054
XSW2 0.051 0.048 0.053
XSZ2 0.045 0.049

Numbers in italics represent cases where the estimated rejection rate is significantly different from 0.05 (i.e., outside of the interval (0.041, 0.059))

Table 3.

Estimated type I error rates for the GHL test in the presence of a dispersion parameter for samples of size n=100 and n=500. For the negative binomial response with n=100, approximately 3% of simulation draws were discarded due to GLM convergence warnings

Distribution n=100 n=500
Gamma 0.041 0.042
Inverse Gaussian 0.064 0.055
Negative binomial 0.042 0.053

Table 4.

Estimated type I error rates for the naive generalized HL and GHL tests when all covariates are discrete. Sample sizes n=100 and n=500 are considered, with the discrete covariate being a random sample of size n from a uniformly spaced sequence of length npoints on [-3,3]

Statistic n=100 n=500
C^G (npoints=30) 0.051 0.051
C^G (npoints=50) 0.053 0.042
XGHL2 (npoints=30) 0.052 0.052
XGHL2 (npoints=50) 0.050 0.044

Power

The power simulation results displayed in Fig. 1 show our new test does have power to detect each of the violated model assumptions we tested. However, for those model flaws that are detectable by the SZ and SW tests, these two tests generally have better power than the tests based on grouped residuals. In the simulation settings we explored, our test had power to detect overdispersion while the SW and SZ tests had little or no power, although these two competitors were not necessarily designed to detect overdispersion.

Fig. 1.

Fig. 1

Power simulation results for the first three settings. Solid red lines are 95% Wilson CIs for the mean rejection rate (colour figure online)

Larger models

The null distribution of the naive generalized HL test statistic, C^G, is not well approximated by the usual χG-22 distribution in this setting with a finite sample size. The impact of the number of parameters on the estimated mean and level of the naive generalized HL test statistic can be seen in Table 5, where the G-2 degrees of freedom approximation for this test deteriorates as the model size grows, relative to the sample size. This adverse effect is less pronounced with a larger sample size. The estimated type I error rates steadily decrease for the naive generalized HL test from about 0.050 to 0.001 as d grows from 2 to 50 with a sample size of 100, and down to 0.030 with a sample size of 500. Our proposed test does not seem to be affected by the number of parameters present in the model. Similar results were obtained for both tests with G=50, used to ensure that G>d, as is required by the traditional HL test on which the naive GHL test is based.

Table 5.

Estimated means (top) and levels (bottom) of the naive HL and generalized HL statistics for Poisson regression models with n{100,500} and G=10

Statistic n d=2 d=10 d=20 d=30 d=40 d=50
C^G 100 8.05 7.24 6.48 5.61 4.76 3.85
XGHL2 100 8.99 8.89 9.03 9.09 9.20 9.13
C^G 500 8.02 7.86 7.81 7.57 7.34 7.23
XGHL2 500 8.98 8.95 9.09 9.09 9.02 9.02
Statistic n d=2 d=10 d=20 d=30 d=40 d=50
C^G 100 0.050 0.030 0.012 0.005 0.002 0.001
XGHL2 100 0.044 0.049 0.048 0.053 0.054 0.056
C^G 500 0.049 0.047 0.045 0.036 0.030 0.030
XGHL2 500 0.050 0.043 0.056 0.054 0.051 0.054

For the naive HL and generalized HL test statistics, the means should be approximately G-2=8 and G-1=9, respectively. In all cases, the type I error rate should be α=0.05

Discussion

The simulation results of Sect. 5 show that the GHL test provides competitive or comparable power in various simulation settings. Our test is also computationally efficient, straightforward to implement, and works in a variety of scenarios. There is no need for a choice of a kernel bandwidth (although the choice of number of groups, G, can play a role in determining the outcome of the test), and the output can be interpreted in a meaningful way by assessing differences between observed and expected counts in each of the groups. The naive generalization of the HL test does not work well under certain settings, but use of the GHL test resolves these issues.

Supplementary Information

Below is the link to the electronic supplementary material.

Acknowledgements

We acknowledge the support of the Natural Sciences and Engineering Research Council of Canada (NSERC). N.S. acknowledges the support of a Vanier Canada Graduate Scholarship.

Declarations

Conflict of interest

The authors declare no conflict of interest.

Footnotes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  1. Agresti A. An introduction to categorical data analysis. New York: Wiley; 1996. [Google Scholar]
  2. Bilder CR, Loughin TM. Analysis of categorical data with R. Boston: Chapman and Hall/CRC; 2014. [Google Scholar]
  3. Blizzard L, Hosmer DW. Parameter estimation and goodness-of-fit in log binomial regression. Biom J. 2006;48(1):5–22. doi: 10.1002/bimj.200410165. [DOI] [PubMed] [Google Scholar]
  4. Canary JD (2013) Grouped goodness-of-fit tests for binary regression models. PhD thesis, University of Tasmania
  5. Canary JD, Blizzard L, Barry RP, Hosmer DW, Quinn SJ. Summary goodness-of-fit statistics for binary generalized linear models with noncanonical link functions. Biom J. 2016;58(3):674–690. doi: 10.1002/bimj.201400079. [DOI] [PubMed] [Google Scholar]
  6. Cheng KF, Wu JW. Testing goodness of fit for a parametric family of link functions. J Am Stat Assoc. 1994;89(426):657–664. doi: 10.1080/01621459.1994.10476790. [DOI] [Google Scholar]
  7. Christensen R, Lin Y. Lack-of-fit tests based on partial sums of residuals. Commun Stat Theory Methods. 2015;44(13):2862–2880. doi: 10.1080/03610926.2013.844256. [DOI] [Google Scholar]
  8. Fagerland MW, Hosmer DW. A goodness-of-fit test for the proportional odds regression model. Stat Med. 2013;32(13):2235–2249. doi: 10.1002/sim.5645. [DOI] [PubMed] [Google Scholar]
  9. Fagerland MW, Hosmer DW. Tests for goodness of fit in ordinal logistic regression models. J Stat Comput Simul. 2016;86(17):3398–3418. doi: 10.1080/00949655.2016.1156682. [DOI] [Google Scholar]
  10. Fagerland MW, Hosmer DW, Bofin AM. Multinomial goodness-of-fit tests for logistic regression models. Stat Med. 2008;27(21):4238–4253. doi: 10.1002/sim.3202. [DOI] [PubMed] [Google Scholar]
  11. González-Manteiga W, Crujeiras RM. An updated review of goodness-of-fit tests for regression models. TEST. 2013;22(3):361–411. doi: 10.1007/s11749-013-0327-5. [DOI] [Google Scholar]
  12. Halteman WA (1980) A goodness of fit test for binary logistic regression. Unpublished doctoral dissertation, Department of Biostatistics, University of Washington, Seattle, WA
  13. Hosmer DW, Hjort NL. Goodness-of-fit processes for logistic regression: simulation results. Stat Med. 2002;21(18):2723–2738. doi: 10.1002/sim.1200. [DOI] [PubMed] [Google Scholar]
  14. Hosmer DW, Lemeshow S. Goodness of fit tests for the multiple logistic regression model. Commun Stat Theory Methods. 1980;9(10):1043–1069. doi: 10.1080/03610928008827941. [DOI] [Google Scholar]
  15. Lin DY, Wei LJ, Ying Z. Model-checking techniques based on cumulative residuals. Biometrics. 2002;58(1):1–12. doi: 10.1111/j.0006-341X.2002.00001.x. [DOI] [PubMed] [Google Scholar]
  16. Liu A, Meiring W, Wang Y. Testing generalized linear models using smoothing spline methods. Stat Sin. 2004;15:235–256. [Google Scholar]
  17. Moore DS, Spruill MC. Unified large-sample theory of general chi-squared statistics for tests of fit. Ann Stat. 1975;3:599–616. doi: 10.1214/aos/1176343125. [DOI] [Google Scholar]
  18. Pulkstenis E, Robinson TJ. Two goodness-of-fit tests for logistic regression models with continuous covariates. Stat Med. 2002;21(1):79–93. doi: 10.1002/sim.943. [DOI] [PubMed] [Google Scholar]
  19. Quinn SJ, Hosmer DW, Blizzard CL. Goodness-of-fit statistics for log-link regression models. J Stat Comput Simul. 2015;85(12):2533–2545. doi: 10.1080/00949655.2014.940953. [DOI] [Google Scholar]
  20. Rodríguez-Campos MC, González-Manteiga W, Cao R. Testing the hypothesis of a generalized linear regression model using nonparametric regression estimation. J Stat Plan Inference. 1998;67(1):99–122. doi: 10.1016/S0378-3758(97)00098-0. [DOI] [Google Scholar]
  21. Stute W, Zhu L-X. Model checks for generalized linear models. Scand J Stat. 2002;29(3):535–545. doi: 10.1111/1467-9469.00304. [DOI] [Google Scholar]
  22. Su JQ, Wei LJ. A lack-of-fit test for the mean function in a generalized linear model. J Am Stat Assoc. 1991;86(414):420–426. doi: 10.1080/01621459.1991.10475059. [DOI] [Google Scholar]
  23. Surjanovic N, Loughin TM (2021) Improving the Hosmer–Lemeshow goodness-of-fit test in large models with replicated trials. arXiv preprint arXiv:2102.12698 [DOI] [PMC free article] [PubMed]
  24. Tsiatis AA. A note on a goodness-of-fit test for the logistic regression model. Biometrika. 1980;67(1):250–251. doi: 10.1093/biomet/67.1.250. [DOI] [Google Scholar]
  25. White H. Maximum likelihood estimation of misspecified models. Econometrica. 1982;50(1):1–25. doi: 10.2307/1912526. [DOI] [Google Scholar]
  26. Xiang D, Wahba G (1995) Testing the generalized linear model null hypothesis versus ‘smooth’ alternatives. Technical Report 953, Department of Statistics, University of Wisconsin

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials


Articles from Test (Madrid, Spain) are provided here courtesy of Springer

RESOURCES