Quantifying the Impact of Unobserved Heterogeneity on Inference from the Logistic Model

Salma Ayis

doi:10.1080/03610920802491782

. 2009 Jun 11;38(13):2164–2177. doi: 10.1080/03610920802491782

Quantifying the Impact of Unobserved Heterogeneity on Inference from the Logistic Model

Salma Ayis ^1,^✉

PMCID: PMC4453966 PMID: 26085712

Abstract

While consequences of unobserved heterogeneity such as biased estimates of binary response regression models are generally known; quantifying these and awareness of situations with more serious impact on inference is however, remarkably lacking. This study examines the effect of unobserved heterogeneity on estimates of the standard logistic model. An estimate of bias was derived for the maximum likelihood estimator $\hat{β}$ , and simulated data was used to investigate a range of situations that influence size of bias due to unobserved heterogeneity. It was found that the position of the probabilities, along the logistic curve, and the variance of the unobserved heterogeneity, were important determinants of size of bias.

Keywords: Biased estimate, Logistic model, Unobserved heterogeneity

1. Introduction

Theoretical models, such as health modes, generally conceptualize outcomes as a result of interaction among a complex set of components including biological, genetic, behavioral, and socio-economics (Mosley and Chen, 2003; World Health Organization, 2001). In practical situations of data analysis, however, it is not possible to account for all variables that result in an outcome by including these as explanatory variables in a statistical model. Even in the richest model specification, several factors would be unobserved, immeasurable or unknown, and some of these would be of high importance to the resulting outcome (Lee and Lee, 2003; Zohoori and Savitz, 1997). Nonetheless, it is not uncommon in many specialized journals to find some conclusions that were reached on the basis of inference from observed variables, assuming unobserved heterogeneity is of little relevance. Economists were puzzled for nearly two decades by the spurious positive association between drinking alcohol (medically known as drug with depressant properties and is unlikely to positively affect productivity) and high wage, where drinkers were persistently found to earn more than alcohol abstainers, the association was reversed by the introduction of individual specific fixed effect, which rid results of bias due to time invariant unobserved heterogeneity (Peters, 2004).

Unobserved factors whether environmental or personal may have a large effect on outcomes of relevance (for example, health, economics, social). Ignoring such effects may lead to the identification of incorrect risk factors, in addition the magnitude of the association of these may be so seriously biased and as a result conclusions may be misleading.

The objectives of this study were to quantify bias of the maximum likelihood (ML) estimate, of a standard logistic model due to un-modeled unobserved heterogeneity, and to highlight situations where the impact of such bias was more serious. Using Taylor series theory, an approximate estimate of bias was derived then simulation was used to investigate situations that were thought to affect the size of bias. The importance of higher-order terms of the derived approximation was also examined under various situations, including different variance of unobserved heterogeneity and differing positions for the probabilities of outcome within the logistic curve. The estimation described was confined to the case of a single binary explanatory variable x, a binary response y, and unobserved heterogeneity that was linked to each individual.

It was found that in most situations the first-order approximation defined as δ₁, provides an adequate approximation of bias due to unobserved heterogeneity. At special situations with large variance and large difference between the two probabilities, however, the first-order approximation becomes inadequate, and does worse as the difference between the two probabilities increases.

2. Methods

2.1. Assumptions and Models

We consider a hypothetical simple example, i.e., lung cancer, as an outcome, and smoking as the only explanatory variable. It was also assumed that there are other factors that were likely to cause lung cancer but these were either unobserved, difficult to measure, or were totally unknown. These may include some genetic factors, personal differences in diet, childhood exposures, lifestyle, or other individual specific factors. In a statistical model, the relationship may be expressed as:

y = f (x, ε),

2.0

where y was the outcome, x was the observed binary explanatory variable, and ϵ was the unobserved variable or variables. The conditional expectation of y given x and ϵ, may be written as:

E (y | x, ε) = h (β_{0} + β_{1} x + ε) = h (η) .

2.1

Under the linear model assumptions, the estimates of β₁ will be unbiased whether ϵ, was considered by the model or not. At other situations, where nonlinear models such as the logistic regression model (Agresti, 2002) were used, the effect of the omission of ϵ on estimates was, however, different. The situation with a missing binary variable was described before (Gail et al., 1984). Here, we shall consider the situation, where the omitted variable was continuous which is quite common in many research areas, such as health. The probability of positive outcome (response) in our example with one explanatory variable x, may be written as:

p_{1} (x) = p (Y = 1 | X = x)

2.2

and the logit as:

logit [p_{1} (x)] = log (\frac{p_{1} (x)}{1 - p_{1} (x)}) = x^{'} β .

2.3

The maximum likelihood (ML) estimate equation for such a model may be written as:

\hat{β} = log n_{1} {\hat{p}}_{1} - log n_{1} (1 - {\hat{p}}_{1}) - log n_{0} {\hat{p}}_{0} + log n_{0} (1 - {\hat{p}}_{0}),

2.4

where n₁ and n₀ were the numbers of subjects exposed (X = 1) and unexposed (X = 0), respectively, and the conditional probabilities, ${\hat{p}}_{1} = p (Y = 1 | X = 1)$ and ${\hat{p}}_{0} = p (Y = 1 | X = 0)$ , are the probabilities of positive response, among the exposed and unexposed subjects, respectively.

2.2. Estimation of Bias, Using Taylor Series Expansion: The Uncorrelated Case

We consider the situation where sampled observations were identically independently distributed (IID), expected value of y, E(Y) exists, X was a binary variable (0/1), and ϵ was a random variable that was unobserved but has an influence on the response y. The unobserved term was assumed to follow a normal distribution, with mean zero and a variance $σ_{ε}^{2}$ . The maximum likelihood estimator, $\hat{β}$ for the logistic model, with one explanatory variable x, was first obtained ignoring unobserved heterogeneity; the effect of unobserved heterogeneity, ϵ on the response was then brought in, and the expected value of the (ML) estimator was evaluated again. Taylor series expansion was used and the estimates were compared (ignoring unobserved heterogeneity and including the influence of the term on the response y). The derivation showed that the (ML) estimator is biased due to un-modeled unobserved heterogeneity and the bias may be approximated by a first order term δ₁, of Taylor series as follows:

δ_{1} = {\frac{g_{1} (p_{1}) - p_{1}}{p_{1} (1 - p_{1})}} - {\frac{g_{1} (p_{0}) - p_{0}}{p_{0} (1 - p_{0})}} .

2.5

The bias may also be approximated by the first- and second-order terms, δ₁ + δ₂, and/or by first-, second-, and third-order terms, δ₁ + δ₂ + δ₃, where δ_i for i = 1, 2, and 3 together with details of the terms g₁(p₁) and g₁(p₀) were fully described in Appendix A, was the derivation steps.

2.3. Simulation

Simulation was used to examine situations that were thought to affect the severity of bias. Data was simulated from a logistic model with a response y, an explanatory variable x, and an extra term representing unobserved heterogeneity. A Fortran program was written to generate data from a logit function of the form:

log {\frac{p_{j}}{1 - p_{j}}} = β_{0} + {x^{'}}_{j} β_{1} + ε_{j},

2.6

where x_j is a vector of a binary variable for subjects j = 1, 2,…, n, p_j = p(y_j = 1 | x_j) is the conditional probability, β₀ and β₁ are the logistic regression parameters, and ϵ_j an extra term representing unobserved heterogeneity, assumed to be normally distributed random variable with zero mean and a variance $σ_{ε}^{2}$ . Several parameter values were considered, so that they cover a range of probabilities, including situations where one of the probabilities was 0.5, the two probabilities were lying on one side of 0.5, and where they were lying further apart on the two sides of 0.5. We also considered a range of values for the variance of the unobserved heterogeneity term. The standard logistic model was then fitted to predict the response y, using x as the only explanatory variable, in order to detect the effect of ignoring unobserved heterogeneity on the estimated parameter, $\hat{β}$ . Estimates were calculated from 100 simulations, each based on a sample of 1,000 of identically independently distributed (IID) individuals.

3. Results

3.1. Theoretical Bias: The Effect of Position of the Two Probabilities

We examined the size of bias with focus on the importance of first-, second-, and third-order terms in the overall bias approximation. Table 1 illustrates the contribution of each term under varying situations. Different positions of the two probabilities were examined, while the variance of unobserved heterogeneity was kept fixed at 1.0. The positions considered include: (i) the two probabilities lie on one side of 0.5 of the logistic curve; (ii) one of the probabilities is 0.5; and (iii) the two probabilities lie on the opposite side of 0.5.

Table 1.

Theoretical contribution of first-, second-, and third-order terms to the approximation of bias due to unobserved heterogeneity, with variance $σ_{ε}^{2}$ = 1.0


p₀, p₁	δ₁	δ₂	δ₃

0.73,0.82	−0.08	0.02	0.0
0.88,0.99	−0.28	0.13	−0.03
0.73,0.88	−0.15	0.04	−0.01
0.50, 0.38	−0.08	0.0	0.0
0.50, 0.73	−0.18	0.0	0.0
0.50, 0.82	−0.26	0.02	−0.01
0.50, 0.92	−0.39	0.09	−0.02
0.38, 0.62	−0.20	0.0	0.0
0.27, 0.73	−0.36	0.01	0.0
0.12, 0.88	−0.67	0.10	−0.03

Open in a new tab

For all three situations, the first term δ₁ was dominant, the second-order term is of less importance, and the third-order term has a very small contribution that may be safely ignored at almost all situations. There are, however, special situations where the second-order term becomes more important, for example, as one probability approaches 0.9.

Table 2.

A comparison between theoretical bias and numerical (simulation) bias, for the (MLH) estimator $\hat{β}$ , of the logistic model, for various probabilities, and for a range of variance (0.01–1.0) for the unobserved heterogeneity term ϵ


Variance	Theoretical bias				Simulation bias
$σ_{ε}^{2}$	δ₁	δ₂	δ₃	Total	Bias $\hat{β}$ – β	95% C.I	Rej%	(p₀,p₁)

1.0	−0.12	0.0	0.0	−0.12	−0.12	−0.15,-0.09	13	0.62, 0.73
0.75	−0.07	0.01	0.0	−0.06	−0.07	−0.10,-0.04	12
0.5	−0.04	0.0	0.0	−0.05	−0.05	−0.06,0.0	7
0.1	−0.01	0.0	0.0	−0.01	−0.01	−0.04, 0.02	6
0.01	0.0	0.0	0.0	0.0	0.01	−0.03,0.03	3
1.0	−0.39	0.11	−0.03	−0.31	−0.33	−0.36, -0.30	36	0.62, 0.95
0.75	−0.29	0.06	−0.01	−0.24	−0.23	−0.27, -0.19	17
0.5	−0.19	0.02	0.0	−0.17	−0.15	−0.19,-0.11	15
0.10	−0.03	0.0	0.0	−0.03	−0.04	−0.09,0.01	6
0.01	0.0	0.0	0.0	0.0	0.05	0.0, 0.10	4
1.0	−0.09	0	0	−0.10	−0.08	−0.10,-0.06	12	0.5, 0.62
0.75	−0.07	0	0	−0.07	−0.06	−0.08, -0.04	8
0.5	−0.05	0	0	−0.05	−0.05	−0.07, -0.03	5
0.1	−0.01	0	0	−0.01	0	−0.03,0.03	4
0.01	0.0	0	0	0.0	0.01	−0.02, 0.02	2
1.0	−0.34	0.05	−0.01	−0.30	−0.29	−0.32, -0.26	49	0.5, 0.88
0.75	−0.27	0.03	−0.01	−0.24	−0.23	−0.27, -0.21	27
0.5	−0.18	0.01	0	−0.17	−0.16	−0.19,-0.13	14
0.1	−0.04	0	0	−0.04	−0.03	−0.06,0.0	4
0.01	0	0	0	0.0	−0.01	−0.02, 0.02	3
1.0	−0.48	0.11	−0.03	−0.40	−0.39	−0.43, -0.35	49	0.5, 0.95
0.75	−0.36	0.04	−0.01	−0.32	−0.31	−0.35, -0.27	32
0.5	−0.25	0.04	0	−0.21	−0.20	−0.24,-0.16	21
0.1	−0.05	0	0	−0.04	−0.03	−0.08, 0.02	7
0.01	0.0	0	0	0	0.06	0.01, 0.11	5
1.0	−0.20	0	0	−0.20	−0.17	−0.19,-0.15	20	0.38, 0.62
0.75	−0.15	0	0	−0.15	−0.14	−0.17,-0.11	18
0.5	−0.10	0	0	−0.10	−0.10	−0.13,-0.07	13
0.10	−0.02	0	0	−0.02	−0.03	−0.06, 0.0	10
0.01	0	0	0	0	.03	−0.01, 0.04	3

Open in a new tab

Note: 95% C.I : upper and lower 95% confidence intervals for the simulated bias.

Rej%: the number of times, in percentage that the true parameter β, lied outside the 95% confidence intervals of the simulated estimator $\hat{β}$ .

3.2. Simulation Results: Theoretical and Numerical Bias

Table 2, reports the average estimates of 100 simulations, each based on a sample of size 1,000. The table examined various positions for the two probabilities, and a range of variance for unobserved heterogeneity (1.0−0.01) including the special case where unobserved heterogeneity, was almost constant ( $σ_{ε}^{2}$ = 0.01). Three situations regarding the position of the two probabilities were covered as described earlier. A comparison was then drawn between the theoretical bias, and the numerical bias, which was calculated from the simulation. The 95% confidence intervals for the simulated bias were reported and the number of times the confidence intervals of the estimate $\hat{β}$ , fail to cover the true parameter β was reported as percentage rejected (Rej%). The main findings may summarized as:

The bias was well approximated by the first-order term at most of the situations considered.
The bias becomes more serious as the difference between the two probabilities gets larger, especially as one probability gets closer to 0.9 (the situation with one probability approaching 0.1 is identical) or where the two probabilities lie further apart at the two sides of the logistic curve.
The bias was more serious for relatively large variance of unobserved heterogeneity, 1.0 and 0.75, from the range of values considered.
Contributions from the second-order term to the bias approximation become of some importance at special situation where the difference between the two probabilities was large, and the variance of unobserved heterogeneity was relatively large.
For small variance of unobserved heterogeneity, the bias was small and the percentage of rejections was modest.
For the special case of unobserved heterogeneity with variance = 0.01, the bias either disappeared or became negligible, and the percentage of estimates outside the coverage property of $\hat{β}$ was particularly small, less than 5% in most cases.

4. Discussion

Much attention in the 1980's, and after, was given to the asymptotic bias of the maximum likelihood (ML) estimators of binary response regression models, that are widely used to describe associations between binary outcome and explanatory variables in trials and surveys. In general, ML estimators may not hold in small and finite samples, as shown by Anderson and Richardson (1979) where a simulation was used to investigate bias of the logistic model estimates, the study found that bias can be substantial if the sample size is small, a formulae for correction was developed. Another similar study (Griffiths et al., 1987) examined the bias and other sample properties such as mean square error based on three alternative covariance matrix estimators for the Probit model, also reached the same conclusion with regard to the bias. A simpler formula using Taylor series expansion for correction of bias in logistic ML estimate was also developed by Copas (1988). For exponential family such as the logistic model, the bias was of order 0(1/n) suggesting that for large samples it was negligible relative to the standard errors of the estimates, the bias was treated by Jeffery's priors' as reported in McCullagh and Nelder (1989) and Firth (1992). A set of GLIM macros was developed to reduce bias (Firth, 1993; Steyerberg and Eijkemans, 2004), but the reduction achieved was reported to be small.

Another issue of importance to the (ML) estimation was the deviation of data from the assumed identical independent distribution, that was addressed in survey methodology and procedures for correction of estimates were developed (Skinner and Smith, 1989).

Of no less importance was the problem of unobserved heterogeneity, although awareness of the problem has recently increased (Aprahamian et al., 2007; Arana and Leon, 2006; Cramer, 2007), but still many influential articles continue to report important findings ignoring the possibility of any impact of unobserved heterogeneity on these findings. In practice, in almost any biological investigation, there are factors (exogenous or endogenous, independent of a biological process, or part of it, time varying or time invariant, personal or contextual) that would be unobserved. Using nonlinear models under such situations lead to biased estimates of population parameters.

Here we present the case with unobserved heterogeneity that was linked to individuals (for example, taste, charisma, emotions), another scenario is where unobserved heterogeneity was correlated, that may occur with repeated measurements within individual, or due to clusters such as household (family related gene, for example), more details on the correlated case were reported in an earlier study (Ayis, 1995) where it was shown that the leading term of bias approximation for the correlated case was the same as that for the non correlated. An extra term, due to replications, however, becomes important if the number of clusters was small, and where the two probabilities lie further apart within the logistic curve.

While there are situations where estimates from the logistic model may be fairly robust to unobserved heterogeneity, there are others where the problem deserves more attention. For situations with outcomes such as fertility or incidence of disease, where all of the probabilities were on the same side of 0.5, the potential for bias was there, but perhaps not as bad as where extreme probabilities occur in both tails, for example (p₀ ≤ 0.2, p₁ ≥ 0.8). The work by Copas (1988) is also relevant to the latter situation of extreme probabilities, although the assumption was that extreme values occur due to mis-recording, that is where the values of the response “y” was being transposed in error between 0 and 1, rather than due to the nature of the association between the response and the explanatory variable we present. Mont Carlo simulation was used to examine the sensitivity of different binary response models to such extreme values of probabilities, a model was proposed to allow for robust estimation where a small number of outcomes was being mis-recorded, and techniques for diagnostics where developed. For situations like the one we present in this study, extreme values will be more common, but detection of such values may help in assessing the quality of the estimates and possibly in whether to use alternative methods at situations were the bias is serious.

The misspecification bias of the logistic model, due to unobserved heterogeneity described in this study, is similar to the misspecification bias due to incorrectly assuming the error term was logistically distributed when it was not, as described in Horowitz (1993), where the effect on estimates was measured using simulation. The bias was found to be small as long as the assumed error distribution has the same qualitative shape as the true distribution (unimodal for the logistic case considered) and more serious when the true distribution of the error term was bimodal or heteroskedastic. These findings suggest the need to explore the effect of other forms of distribution on the bias derived in this study. Similar findings were shown by Arana and Leon (2005) where a Monte-Carlo simulation was used to test the performance of a Bayesian mixture normal distribution (semi parametric model), with other parametric models (including the logit) and nonparametric models, using alternative assumptions for the error distribution and using different sample sizes, when data exhibit unobserved heterogeneity. The mean square error (MSE) in all models and for all sample sizes was found to be considerably large reflecting the difficulty in modeling this type of data. The Bayesian specification, however, performed better than the competing models for small as well as for large samples, and substantially reduces the bias and the MSE, the improvements in bias was larger for small samples.

Misspecification bias due to unobserved heterogeneity can seriously affect estimates from the logistic model as well as other binary regression models. Using panel data and suitable random effect models that allow for individual fixed effect adjustment is a solution, but obtaining such data may not always be easy. Attempting other semiparametric models such as Bayesian normal mixture model seems to be an appealing approach, especially as suitable software are rapidly developing. Further work is needed to examine the impact on estimates in real situations where there are several explanatory variables often correlated. Methods of detection may also be developed to assess the need for alternative more flexible models, at situations where unobserved heterogeneity is likely to have more serious impact on the estimates due to data structure, before drawing inference that might be misleading.

Acknowledgments

I am grateful to professor D. Holt for the initiatives, advice, and suggestions he gave to the theoretical and simulation investigations. Much thanks to Dr. Marie South and Dr. Peter Egger for considerable help with the FORTRAN programs used.

Appendix A

Consider a MLH estimator, $\hat{β}$ for a logistic model with binary response y and a single binary explanatory variable x,

\hat{β} = log n_{1} {\hat{p}}_{1} - log n_{1} (1 - {\hat{p}}_{1}) - log n_{0} {\hat{p}}_{0} + log n_{0} (1 - {\hat{p}}_{0}) .

A.1

We may rewrite the first term of the right-hand side (R.H.S) of Eq. (A.1) as follows:

log n_{1} {\hat{p}}_{1} = log n_{1} p_{1} {1 + \frac{{\hat{p}}_{1} - p_{1}}{p_{1}}} .

A.2

Using Taylor series expansion, we rewrite Eq. (A.2) as follows:

graphic file with name lsta38-2164-m3.jpg

The symbols, I, II, and III were introduced to make referral to the original, more complex terms simpler, each term will be manipulated by algebraic procedures separately, (H.O.T. stands for higher-order terms). We first consider term I; we rewrite it in a form that involves the response y;

{\frac{{\hat{p}}_{1} - p_{1}}{p_{1}}} = \frac{1}{n_{1}} Σ {\frac{y_{1 j} - p_{1}}{p_{1}}} = \frac{1}{n_{1} p_{1}} Σ y_{1 j} - 1.

A.4

To appreciate the effect of unobserved heterogeneity defined by term ϵ, we consider the conditional expectation of the response y_1j given ϵ, we evaluate the expectation of each of the three terms, I, II, and III, and then collecting terms with common powers of σ_ϵ. Some preliminary results are needed for the expectation and these will be described in this section. We first consider the expectation of the response y, given unobserved heterogeneity ϵ,

E (y_{1 j}) = E_{1} {E_{2} (y_{1 j} | ε)},

A.5

where E₁, is the expectation over ϵ, and E₂ is the expectation given ϵ

E_{2} (y_{1 j} | ε) = {\frac{p_{1}}{1 - p_{1}} e^{ε}} / {1 + \frac{p_{1}}{1 - p_{1}} e^{ε}} .

A.6

We introduce a function f(ϵ), such that

f (ε) = {\frac{p_{1}}{1 - p_{1}} e^{ε}} / {1 + \frac{p_{1}}{1 - p_{1}} e^{ε}} .

A.7

Taylor, series theorem, was then used to expand the function about zero, the expansion may be written as:

f (ε) = f (0) + f^{(1)} (0) ε + f^{(2)} (0) \frac{ε^{2}}{2!} + f^{(3)} (0) \frac{ε^{3}}{3!} + \dots + f^{(n - 1)} (0) \frac{ε^{n - 1}}{(n - 1)!} + H . O . T,

A.8

where f⁽¹⁾,f⁽²⁾,…,f⁽ⁿ⁻¹ are the first, second, …,(n −1)th, derivatives of the function, and the derivatives for f(ϵ), are as follows;

\begin{matrix} f^{(1)} (ε) = {\frac{p_{1}}{1 - p_{1}} e^{ε}} / {1 + \frac{p_{1}}{1 - p_{1}} e^{ε}}^{2} \\ f^{(2)} (ε) = \frac{{1 + \frac{p_{1}}{1 - p_{1}} e^{ε}}^{2} {\frac{p_{1}}{1 - p_{1}} e^{ε}} - 2 {\frac{p_{1}}{1 - p_{1}} e^{ε}}^{2} {1 + \frac{p_{1}}{1 - p_{1}} e^{ε}}}{{1 + \frac{p_{1}}{1 - p_{1}} e^{ε}}} \end{matrix}

A.9

f^{(2)} (ε) = \frac{{1 + \frac{p_{1}}{1 - p_{1}} e^{ε}}^{2} {\frac{p_{1}}{1 - p_{1}} e^{ε}} - 2 {\frac{p_{1}}{1 - p_{1}} e^{ε}}^{2} {1 + \frac{p_{1}}{1 - p_{1}} e^{ε}}}{{1 + \frac{p_{1}}{1 - p_{1}} e^{ε}}^{4}}

A.10

f^{(3)} (ε) = \frac{{\frac{p_{1}}{1 - p_{1}} e^{ε}} [1 - 4 {\frac{p_{1}}{1 - p_{1}} e^{ε}} + {\frac{p_{1}}{1 - p_{1}} e^{ε}}^{2}]}{{1 + \frac{p_{1}}{1 - p_{1}} e^{ε}}^{4}} .

A.11

To simplify notation, define $q = {\frac{p_{1}}{1 - p_{1}} e^{ε}}$ ; accordingly, f⁽³⁾(ϵ) may be written as:

f^{(3)} (ε) = \frac{q [1 - 4 q + q^{2}]}{{(1 + q)}^{4}},

A.12

and, the fourth, fifth, and sixth's derivatives as:

\begin{matrix} f^{(4)} (ε) = \frac{q {{(1 + q)}^{4} (1 - 8 q + 3 q^{2}) - 4 q (1 - 4 q + q^{2}) {(1 + q)}^{3}}}{{(1 + q)}^{8}} \\ = \frac{q (1 - 11 q + 11 q^{2} - q^{3})}{{(1 + q)}^{5}} \end{matrix}

A.13

\begin{matrix} \begin{matrix} f^{(5)} (ε) = \frac{q {{(1 + q)}^{5} (1 - 22 q + 33 q^{2} - 4 q^{3}) - 5 q (1 - 11 q + 11 q^{2} - q^{3}) {(1 + q)}^{4}}}{{(1 + q)}^{10}} \\ = \frac{q {(1 - 26 q + 66 q^{2} - 26 q^{3} + q^{4})}}{{(1 + q)}^{6}} \end{matrix} \end{matrix}

A.14

\begin{matrix} f^{(6)} (ε) = \frac{\begin{array}{l} q {{(1 + q)}^{6} (1 - 52 q + 198 q^{2} - 104 q^{3} + 5 q^{4}) \\ - 6 q (1 - 26 q + 66 q^{2} - 26 q^{3} + q^{4}) (1 + q^{5})} \end{array}}{{(1 + q)}^{12}} \\ = \frac{q {(1 + 57 q + 302 q^{2} - 302 q^{3} + 57 q^{4} - q^{5})}}{{(1 + q)}^{7}} \end{matrix}

A.15

Accordingly, the functions about zero are:

f (0) = p_{1}

A.16

f^{(1)} (0) = p_{1} (1 - p_{1})

A.17

f^{(2)} (0) = p_{1} (1 - p_{1}) (1 - 2 p_{1})

A.18

f^{(3)} (0) = p_{1} (1 - p_{1}) (1 - 6 p_{1} + 6 p_{1}^{2})

A.19

f^{(4)} (0) = p_{1} (1 - p_{1}) (1 - 14 p_{1} + 36 p_{1}^{2} - 24 p_{1}^{3})

A.20

f^{(5)} (0) = p_{1} (1 - p_{1}) (1 - 30 p_{1} + 150 p_{1}^{2} - 240 p_{1}^{3} + 120 p_{1}^{4})

A.21

f^{(6)} (0) = p_{1} (1 - p_{1}) (1 - 62 p_{1} + 540 p_{1}^{2} - 1560 p_{1}^{3} + 1800 p_{1}^{4} - 720 p_{1}^{5})

A.22

Based on the normality assumption of ϵ, the odd moments would be equal to zero, the even moments were: $E (ε^{2}) = σ_{ε}^{2}, E (ε^{4}) = 3 σ_{ε}^{4}, E (ε^{6}) = 15 σ_{ε}^{6} \dots$ etc. Hence, if we consider the first six derivatives of f(ϵ), and if terms up to and including $σ_{ε}^{6}$ were involved, we may rewrite the expectations of Term I as:

\begin{array}{l} E_{1} [E_{2} (y_{1 j} | ε)] \\ = p_{1} + \frac{σ^{2}}{2} p_{1} (1 - p_{1}) (1 - 2 p_{1}) + \frac{3 σ^{4}}{4!} p_{1} (1 - p_{1}) (1 - 14 p_{1} + 36 p_{1}^{2} - 24 p_{1}^{3}) \\ + \frac{15 σ_{ε}^{6}}{6!} p_{1} (1 - p_{1}) (1 - 62 p_{1} + 540 p_{1}^{2} - 1560 p_{1}^{3} + 1800 p_{1}^{4} - 720 p_{1}^{5}) . \end{array}

A.23

For the convenience of notations, we define Eq. (A.23) as:

E_{1} [E_{2} (y_{1 j} | ε)] = g_{1} (p_{1});

hence,

\begin{array}{l} E_{1} {E_{2} (\frac{{\hat{p}}_{1} - p_{1}}{p_{1}})} \\ = {\frac{g_{1} (p_{1}) - p_{1}}{p_{1}}} \end{array}

A.24

= (1 - p_{1}) {\begin{cases} \frac{σ_{ε}^{2}}{2} (1 - 2 p_{1}) + \frac{3 σ_{ε}^{4}}{4!} p_{1} (1 - 14 p_{1} + 36 p_{1}^{2} - 24 p_{1}^{3}) \\ + \frac{15 σ_{ε}^{6}}{6!} (1 - 62 p_{1} + 540 p_{1}^{2} - 1560 p_{1}^{3} - 1800 p_{1}^{4} - 720 p_{1}^{5}) \end{cases}} .

A.25

Corollary A.1. Consider the expectations of corresponding terms of $log n_{1} (1 - {\hat{p}}_{1})$ of Eq. (A.1). These may be written as:

\begin{array}{l} E_{1} {E_{2} (\frac{1 - {\hat{p}}_{1} - (1 - p_{1})}{(1 - p_{1})})} \\ = - [\begin{array}{l} \frac{σ_{ε}^{2}}{2} p_{1} (1 - 2 p_{1}) + \frac{3 σ_{ε}^{4}}{4!} p_{1} (1 - 14 p_{1} + 36 p_{1}^{2} - 24 p_{1}^{3}) \\ + \frac{15 σ_{ε}^{6}}{6!} p_{1} (1 - 62 p_{1} + 540 p_{1}^{2} - 1560 p_{1}^{3} + 1800 p_{1}^{4} - 720 p_{1}^{5}) \end{array}] \end{array}

A.26

= - {\frac{g_{1} (p_{1}) - p_{1}}{(1 - p_{1})}} .

A.27

We apply the same procedures for terms associated with ${\hat{p}}_{0}$ and 1 − ${\hat{p}}_{0}$ , from Eq. (A.1) and define similar functions of p₀ at this stage.

Now consider Term II,

{\frac{{\hat{p}}_{1} - p_{1}}{p_{1}}}^{2} = \frac{1}{{(n_{1} p_{1})}^{2}} Σ {(y_{1 j})}^{2} - \frac{2}{n_{1} p_{1}} Σ (y_{1 j}) + 1

A.28

= \frac{1}{{(n_{1} p_{1})}^{2}} \sum y_{1 j}^{2} + \sum_{j} \sum_{j^{'}} \frac{y_{1 j} y_{1 j^{'}}}{{(n_{1} p_{1})}^{2}} + 1 - \frac{2}{n_{1} p_{1}} \sum y_{1 j} .

A.29

Since y_1j, is binary, the conditional expectation is:

E_{1} {E_{2} (y_{1 j}^{2} | ε)} = E_{1} {E_{2} (y_{1 j} | ε)}

A.30

E_{1} {E_{2} (y_{1 j} y_{1 j^{'}} | ε_{j} ε_{j^{'}})} = {E_{1} E_{2} (y_{1 j} | ε)}^{2} .

A.31

Assuming conditional independence of y_1j, we may then write:

E_{1} {E_{2} {(\frac{{\hat{p}}_{1} - p_{1}}{p_{1}})}^{2}} = o (\frac{1}{n_{1}}) + \frac{n_{1} (n_{1} - 1)}{{(n_{1} p_{1})}^{2}} {[g_{1} (p_{1})]}^{2} + 1 - \frac{2}{p_{1}} g_{1} (p_{1}),

A.32

where $o (\frac{1}{n_{1}})$ includes all terms of order (n₁)⁻¹, which are zero based on the asymptotic theory assumption, hence the R.H.S of Eq. (A.32) may be written as:

E_{1} {E_{2} {(\frac{{\hat{p}}_{1} - p_{1}}{p_{1}})}^{2}} = {\frac{g_{1} (p_{1}) - p_{1}}{p_{1}}}^{2} + o (\frac{1}{n_{1}}),

A.33

by including corresponding terms of $log n_{1} (1 - {\hat{p}}_{1})$ , from Eq. (A.1) these give:

E_{1} {E_{2} {(\frac{{\hat{p}}_{1} - p_{1}}{1 - p_{1}})}^{2}} = {\frac{g_{1} (p_{1}) - p_{1}}{(1 - p_{1})}}^{2} .

A.34

We also consider Term III, and apply similar procedures; we write:

{\frac{{\hat{p}}_{1} - p_{1}}{p_{1}}}^{3} = {\sum_{j} \frac{y_{1 j}}{(n_{1} p_{1})} - 1}^{3}

A.35

= \frac{1}{{(n_{1} p_{1})}^{3}} {\sum y_{1 j}}^{3} - \frac{3}{{(n_{1} p_{1})}^{2}} {\sum y_{1 j}}^{2} + \frac{3}{n_{1} p_{1}} {\sum y_{1 j}} - 1

A.36

\begin{matrix} = \frac{1}{{(n_{1} p_{1})}^{3}} {\sum y_{1 j}^{3} + \sum_{j \pm j^{'}} \sum y_{1 j}^{2} y_{1 j^{'}} + \sum_{j \neq j^{'} \neq j^{″}} \sum \sum y_{1 j} y_{1 j^{'}} y_{1 j^{″}}} \\ - \frac{3}{{(n_{1} p_{1})}^{2}} {\sum y_{1 j}^{2} + \sum_{j \pm j^{'}} \sum y_{1 j} y_{1 j^{'}}} + \frac{3}{(n_{1} p_{1})} {\sum y_{1 j}} - 1; \end{matrix}

A.37

therefore,

E_{1} {E_{2} {(\frac{{\hat{p}}_{1} - p_{1}}{p_{1}})}^{3}} = \frac{n_{1} (n_{1} - 1) (n_{1} - 2)}{{(n_{1} p_{1})}^{3}} {g_{1} (p_{1})}^{3} - \frac{3 n_{1} (n_{1} - 1)}{{(n_{1} p_{1})}^{2}} {g_{1} (p_{1})}^{2} + \frac{3}{p_{1}} g_{1} (p_{1}) - 1 + o (\frac{1}{n_{1}})

A.38

= {\frac{g_{1} (p_{1}) - p_{1}}{p_{1}}}^{3} + o (\frac{1}{n_{1}}) .

A.39

If we similarly consider the corresponding terms of $log n_{1} (1 - {\hat{p}}_{1})$ of Eq. (A.1), that gives:

- E_{1} {E_{2} {(\frac{{\hat{p}}_{1} - p_{1}}{1 - p_{1}})}^{3}} = - {\frac{g_{1} (p_{1}) - p_{1}}{(1 - p_{1})}}^{3}

A.40

Proof. We first restrict the bias to include only the first-order terms. The proof comes directly by substituting in Eq. (A.1), terms from I, from Eqs. (A.24), and complementary terms from (A.27), plus other similar terms for, $log (n_{1} {\hat{p}}_{0})$ and $log (n_{1} (1 - {\hat{p}}_{0}))$ , the later terms are identical to the functions of ${\hat{p}}_{1}$ and $(1 - {\hat{p}}_{1})$ , but they are functions of ${\hat{p}}_{0}$ and $(1 - {\hat{p}}_{0})$ and they take different signs as Eq. (A.1) showed. The expectation of $\hat{β}$ may then be written as:

E_{1} {E_{2} (\hat{β})} = β + {\frac{g_{1} (p_{1}) - p_{1}}{p_{1} (1 - p_{1})}} - {\frac{g_{1} (p_{0}) - p_{0}}{p_{0} (1 - p_{0})}};

A.41

therefore,

bias (\hat{β}) = {\frac{g_{1} (p_{1}) - p_{1}}{p_{1} (1 - p_{1})}} - {\frac{g_{1} (p_{0}) - p_{0}}{p_{0} (1 - p_{0})}} = δ_{1}

A.42

if the second-order term II of Eq. (A.3) was also considered by bringing its relevant components from Eqs. (A.33) and (A.34), and the corresponding terms related to ${\hat{p}}_{0}$ and $(1 - {\hat{p}}_{0})$ , and substitute all in Eq. (A.1), then we may rewrite the expectations of $\hat{β}$ as:

E_{1} {E_{2} (\hat{β})} = β + δ_{1} - \frac{1}{2} [{\frac{g_{1} (p_{1}) - p_{1}}{p_{1} (1 - p_{1})}}^{2} (1 - 2 p_{1}) - {\frac{g_{1} (p_{0}) - p_{0}}{p_{0} (1 - p_{0})}}^{2} (1 - 2 p_{0})] .

A.43

To simplify we write:

E_{1} {E_{2} (\hat{β})} = β + δ_{1} + δ_{2},

A.44

where

δ_{2} = - \frac{1}{2} [{\frac{g_{1} (p_{1}) - p_{1}}{p_{1} (1 - p_{1})}}^{2} (1 - 2 p_{1}) - {\frac{g_{1} (p_{0}) - p_{0}}{p_{0} (1 - p_{0})}}^{2} (1 - 2 p_{0})] .

A.45

If term III was also considered, components from Eqs. (A.39) and (A.40), and similar terms relevant to ${\hat{p}}_{0}$ and $(1 - {\hat{p}}_{0})$ , were also brought and substituted in Eq. (A.1), by procedures similar to those used in manipulating and including I and II, the expected value $E_{1} {E_{2} (\hat{β})}$ may be written as:

E_{1} {E_{2} (\hat{β})} = β + δ_{1} - \frac{1}{2} [{\frac{g_{1} (p_{1}) - p_{1}}{p_{1} (1 - p_{1})}}^{2} (1 - 2 p_{1}) - {\frac{g_{1} (p_{0}) - p_{0}}{p_{0} (1 - p_{0})}}^{2} (1 - 2 p_{0})] + \frac{1}{3} [{\frac{g_{1} (p_{1}) - p_{1}}{p_{1} (1 - p_{1})}}^{3} (1 - 3 p_{1} + 3 p_{1}^{2}) - {\frac{g_{1} (p_{0} - p_{0})}{p_{0} (1 - p_{0})}}^{3} (1 - 3 p_{0} + 3 p_{0}^{2})],

A.46

where

δ_{3} = \frac{1}{3} [{\frac{g_{1} (p_{1}) - p_{1}}{p_{1} (1 - p_{1})}}^{3} (1 - 3 p_{1} + 3 p_{1}^{2}) - {\frac{g_{1} (p_{0}) - p_{0}}{p_{0} (1 - p_{0})}}^{3} (1 - 3 p_{0} + 3 p_{0}^{2})] .

To simplify, we may write:

E_{1} {E_{2} (\hat{β})} = β + δ_{1} + δ_{2} + δ_{3}

A.47

and

bias (\hat{β}) = δ_{1} + δ_{2} + δ_{3} .

A.48

By substituting, the actual components of g₁(p₁) and g₁(p₀), in Eqs. (A.41), (A.43), and (A.46), we may evaluate bias, including first-, second-, and third-order terms.

References

Agresti A. Probability and Statistics. New York: Wiley-Interscience; 2002. Categorical Data Analysis. [Google Scholar]
Anderson J. A., Richardson S. C. Logistic discrimination and bias correction in maximum likelihood estimation. Technometrics. 1979;21:71–78. [Google Scholar]
Aprahamian F., Chanel O., Luchini S. Modeling starting point bias as unobserved heterogeneity in contingent valuation surveys: an application to air pollution. Amer. J. Agri. Econ. 2007;89:533–547. [Google Scholar]
Arana J. E., Leon C. J. Flexible mixture distribution modeling of dichotomous choice contingent valuation with heterogenity 250541. J. Environm. Econ. Manage. 2005;50:170–188. [Google Scholar]
Arana J. E., Leon C. J. Modelling unobserved heterogeneity in contingent valuation of health risks. Appl. Econ. 2006;38:2315–2325. [Google Scholar]
Ayis S. A. M. Modelling Unobserved Heterogeneity: Theoretical and Practical Aspects. Southampton, UK: University of Southampton; 1995. [Google Scholar]
Copas J. B. Binary regression models for contaminated data. J. Roy. Statist. Soc. Ser. B (Methodological) 1988;50:225–265. [Google Scholar]
Cramer J. S. Robustness of logit analysis: unobserved heterogeneity and mis-specified disturbances. Oxford Bull. Econ. Statist. 2007;69:545–555. [Google Scholar]
Firth D. Bias reduction, the Jeffreys prior and GLIM. In: Fahrmeir L., Francis B., Gilchrist, R., editors. Advances in GLIM and Statistical Modelling. New York: Springer; 1992. pp. 91–100. [Google Scholar]
Firth D. A. V. I. Bias reduction of maximum likelihood estimates. Biometrika. 1993;80:27–38. [Google Scholar]
Gail M. H., Wieand S., Piantadosi S. Biased estimates of treatment effect in randomized experiments with nonlinear regressions and omitted covariates. Biometrika. 1984;71:431–444. [Google Scholar]
Griffiths W. E., Hill R. C., Pope P. J. Small sample properties of probit model estimators. J. Amer. Statist. Assoc. 1987;82:929–937. [Google Scholar]
Horowitz J. L. Semiparametric and nonparametric estimation of quantal response models. In: Maddala G. S., Rao C. R., Vinod, H. D., editors. Handbook of Statistics. Amsterdam: Elsevier; 1993. [Google Scholar]
Lee S., Lee S. Testing heterogeneity for frailty distribution in shared frailty model. Commun. Statist. Theor. Meth. 2003;32:2245–2253. [Google Scholar]
McCullagh P. A., Nelder J. A. Generalized Linear Models. Monographs on Statistics and Applied Probability. 2nd ed. London: Chapman Hall; 1989. [Google Scholar]
Mosley W. H., Chen L. C. An analytical framework for the study of child survival in developing countries. 1984. Bull. World Health Organ. 2003;81:140–145. [PMC free article] [PubMed] [Google Scholar]
Peters B. L. Is there a wage bonus from drinking? Unobserved heterogeneity examined. Appl. Econ. 2004;36:2299–2315. [Google Scholar]
Skinner C. J., Holt D., Smith, T. M. F. Analysis of Complex Surveys. Chichester: John Wiley Sons, Ltd; 1989. [Google Scholar]
Steyerberg E. W., Eijkemans M. J. C. Heterogeneity bias: the difference between adjusted and unadjusted effects. Med. Deci. Making. 2004;24:102–104. doi: 10.1177/0272989X03262285. [DOI] [PubMed] [Google Scholar]
World Health Organization. International Classification of Functioning, Disability and Health. Geneva ed.: ICF; 2001. [Google Scholar]
Zohoori N., Savitz D. A. Econometric approaches to epidemiologic data: Relating endogeneity and unobserved heterogeneity to confounding. Ann. Epidemio. 1997;7:251–257. doi: 10.1016/s1047-2797(97)00023-9. [DOI] [PubMed] [Google Scholar]

[b1] Agresti A. Probability and Statistics. New York: Wiley-Interscience; 2002. Categorical Data Analysis. [Google Scholar]

[b2] Anderson J. A., Richardson S. C. Logistic discrimination and bias correction in maximum likelihood estimation. Technometrics. 1979;21:71–78. [Google Scholar]

[b3] Aprahamian F., Chanel O., Luchini S. Modeling starting point bias as unobserved heterogeneity in contingent valuation surveys: an application to air pollution. Amer. J. Agri. Econ. 2007;89:533–547. [Google Scholar]

[b4] Arana J. E., Leon C. J. Flexible mixture distribution modeling of dichotomous choice contingent valuation with heterogenity 250541. J. Environm. Econ. Manage. 2005;50:170–188. [Google Scholar]

[b5] Arana J. E., Leon C. J. Modelling unobserved heterogeneity in contingent valuation of health risks. Appl. Econ. 2006;38:2315–2325. [Google Scholar]

[b6] Ayis S. A. M. Modelling Unobserved Heterogeneity: Theoretical and Practical Aspects. Southampton, UK: University of Southampton; 1995. [Google Scholar]

[b7] Copas J. B. Binary regression models for contaminated data. J. Roy. Statist. Soc. Ser. B (Methodological) 1988;50:225–265. [Google Scholar]

[b8] Cramer J. S. Robustness of logit analysis: unobserved heterogeneity and mis-specified disturbances. Oxford Bull. Econ. Statist. 2007;69:545–555. [Google Scholar]

[b9] Firth D. Bias reduction, the Jeffreys prior and GLIM. In: Fahrmeir L., Francis B., Gilchrist, R., editors. Advances in GLIM and Statistical Modelling. New York: Springer; 1992. pp. 91–100. [Google Scholar]

[b10] Firth D. A. V. I. Bias reduction of maximum likelihood estimates. Biometrika. 1993;80:27–38. [Google Scholar]

[b11] Gail M. H., Wieand S., Piantadosi S. Biased estimates of treatment effect in randomized experiments with nonlinear regressions and omitted covariates. Biometrika. 1984;71:431–444. [Google Scholar]

[b12] Griffiths W. E., Hill R. C., Pope P. J. Small sample properties of probit model estimators. J. Amer. Statist. Assoc. 1987;82:929–937. [Google Scholar]

[b13] Horowitz J. L. Semiparametric and nonparametric estimation of quantal response models. In: Maddala G. S., Rao C. R., Vinod, H. D., editors. Handbook of Statistics. Amsterdam: Elsevier; 1993. [Google Scholar]

[b14] Lee S., Lee S. Testing heterogeneity for frailty distribution in shared frailty model. Commun. Statist. Theor. Meth. 2003;32:2245–2253. [Google Scholar]

[b15] McCullagh P. A., Nelder J. A. Generalized Linear Models. Monographs on Statistics and Applied Probability. 2nd ed. London: Chapman Hall; 1989. [Google Scholar]

[b16] Mosley W. H., Chen L. C. An analytical framework for the study of child survival in developing countries. 1984. Bull. World Health Organ. 2003;81:140–145. [PMC free article] [PubMed] [Google Scholar]

[b17] Peters B. L. Is there a wage bonus from drinking? Unobserved heterogeneity examined. Appl. Econ. 2004;36:2299–2315. [Google Scholar]

[b18] Skinner C. J., Holt D., Smith, T. M. F. Analysis of Complex Surveys. Chichester: John Wiley Sons, Ltd; 1989. [Google Scholar]

[b19] Steyerberg E. W., Eijkemans M. J. C. Heterogeneity bias: the difference between adjusted and unadjusted effects. Med. Deci. Making. 2004;24:102–104. doi: 10.1177/0272989X03262285. [DOI] [PubMed] [Google Scholar]

[b20] World Health Organization. International Classification of Functioning, Disability and Health. Geneva ed.: ICF; 2001. [Google Scholar]

[b21] Zohoori N., Savitz D. A. Econometric approaches to epidemiologic data: Relating endogeneity and unobserved heterogeneity to confounding. Ann. Epidemio. 1997;7:251–257. doi: 10.1016/s1047-2797(97)00023-9. [DOI] [PubMed] [Google Scholar]

PERMALINK

Quantifying the Impact of Unobserved Heterogeneity on Inference from the Logistic Model

Salma Ayis

Abstract

1. Introduction

2. Methods

2.1. Assumptions and Models

2.2. Estimation of Bias, Using Taylor Series Expansion: The Uncorrelated Case

2.3. Simulation

3. Results

3.1. Theoretical Bias: The Effect of Position of the Two Probabilities

Table 1.

Table 2.

3.2. Simulation Results: Theoretical and Numerical Bias

4. Discussion

Acknowledgments

Appendix A

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Quantifying the Impact of Unobserved Heterogeneity on Inference from the Logistic Model

Salma Ayis

Abstract

1. Introduction

2. Methods

2.1. Assumptions and Models

2.2. Estimation of Bias, Using Taylor Series Expansion: The Uncorrelated Case

2.3. Simulation

3. Results

3.1. Theoretical Bias: The Effect of Position of the Two Probabilities

Table 1.

Table 2.

3.2. Simulation Results: Theoretical and Numerical Bias

4. Discussion

Acknowledgments

Appendix A

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases