Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2011 May 1.
Published in final edited form as: Stat Methodol. 2010 May 1;7(3):307–322. doi: 10.1016/j.stamet.2009.09.003

Ecological Inference in the Social Sciences

Adam Glynn 1, Jon Wakefield 2
PMCID: PMC2885825  NIHMSID: NIHMS152933  PMID: 20563299

Abstract

Ecological inference is a problem of partial identification, and therefore reliable precise conclusions are rarely possible without the collection of individual level (identifying) data. Without such data, sensitivity analyses provide the only recourse. In this paper we review and critique approaches to ecological inference in the social sciences, and describe in detail hierarchical models, which allow both sensitivity analysis and the incorporation of individual level data into an ecological analysis. A crucial element of a sensitivity analysis in such models is prior specification, and we detail how this may be carried out. Furthermore, we demonstrate how the inclusion of a small amount of individual level data can dramatically improve the properties of such estimates.

1 Introduction

In this paper the problems of making individual-level inference from ecological data is considered. In particular suppose we have a set of R × C tables in which the margins only are observed, for concreteness we suppose each table corresponds to a different geographical area. This problem arises in many disciplines including political science (Achen and Shively, 1995; King, 1997), sociology (Goodman, 1953, 1959; Duncan and Davis, 1953)) and spatial epidemiology (Richardson and Montfort, 2000; Wakefield, 2008); King (1997) and Cleave et al. (1995) describe further application areas. However, the exact nature of the inferential question differs importantly across these disciplines, see Salway and Wakefield (2004) compare objectives and approaches to inference in the social sciences and epidemiology.

In epidemiological applications the usual aim is to estimate a risk contrast between exposed and unexposed individuals in a certain population and time period. This contrast can then be used to predict the number of new cases in a future time period (for public health provision, for example) or to make causal inferences. Since the data are usually observational, to estimate the causal effect of the exposure an attempt must be made to control for confounding variables (for example, one or more of age, gender, race, and smoking history) that are responsible for differences in risk, beyond those due to exposure, of the study populations. Hence in ecological studies in epidemiology there is never a single predictor and the data are not simply in the form of a series of 2 × 2 tables, since within each area there will (typically) be multiple confounders for which to control. Control for confounders in ecological studies is more difficult than in individual studies, since the multivariate distribution of exposures/confounders within and between areas is needed (Richardson et al., 1987; Greenland and Robins, 1994; Prentice and Sheppard, 1995; Plummer and Clayton, 1996; Lasserre et al., 2000; Wakefield, 2008).

By contrast, in the social sciences, ecological inference can have causal and non-causal inferential purposes. In many cases, social scientists have concentrated on imputing the missing cells in the constituent 2×2 or R×C tables. This type of ecological inference is often referred to as EI (named after the software package that accompanies King’s famous 1997 book). For example, in political science, a typical analysis will examine the differences between racial voting patterns in a specific region. This query can be answered by imputing the number of votes by race for a particular party or candidate, by area. Hence, viewed in this way, prediction rather than causality is the aim. However, in many recent cases, EI is followed by a second stage analysis that utilizes the imputed data, often as the dependent variable in a regression (Herron and Shotts, 2003b) and may be implicitly causal. For example, Burden and Kimball (1998) use EI to analyze ticket splitting rates across congressional districts, and then use these rates in a second stage analysis to determine why voters are splitting their tickets. This type of analysis is often referred to as EI-R.

Although the shortcomings of ecological inference in the EI and epidemiological contexts have been documented (Achen and Shively, 1995; Greenland and Robins, 1994; Cho, 1998; Freedman et al., 1998; Gelman et al., 2001; Wakefield, 2008), the continued use of ecological data can be attributed to: the increased sample sizes and predictor ranges they provide; their routine availability; their increased reliability when a sensitive question is asked; and the impossibility of further data collection in historical contexts. Furthermore, King (1999) argues that EI is a reasonable approach and will only be unreliable in the following circumstances:

If the assumptions are wrong and the bounds and diagnostics are not sufficiently informative, and the researcher has no time to collect qualitative knowledge, then EI will perform poorly.

The difficulties of ecological inference may however become compounded when ecological estimates are used as data in second stage analyses in EI-R (Herron and Shotts, 2003b), and although extensions to the EI can alleviate some of these problems (Adolph and King, 2003; Herron and Shotts, 2003a; Adolph et al., 2003), there are still concerns over the use of the technique (Herron and Shotts, 2004; Cho and Gaines, 2004).

Given the aforementioned difficulties of ecological inference, the ecological analyst will benefit from a method that possesses the following properties: 1) produces only logically consistent estimates that respect the Duncan and Davis (1953) bounds, 2) allows the incorporation of qualitative information, 3) provides a means for carrying out sensitivity analysis, 4) allows the inclusion of individual level data and characterizes the gains to be made from such an inclusion. In this paper, we demonstrate that the hierarchical convolution likelihood model described in Wakefield (2004a) has all of these properties.

The outline of this paper is as follows. In Section 2 we describe the fundamental difficulties of ecological inference, and describe a data set that will be used to illustrate the various issues raised throughout the paper. In Section 3 we describe the so-called convolution likelihood. Section 4 describes the Bayesian approach to inference, including subsections on prior specification and predictive distributions. The latter links the parameters of the model to the observables (unobserved counts), and is of great importance in political science applications (where there has been confusion between unobservable probabilities in a hypothetical super population model, and the samples fractions in the study population, which are potentially observable) as emphasized above. Section 5 describes the various computational schemes that have been suggested in the context of ecological inference. Section 6 presents a sensitivity analysis in the context of the Louisiana data, while we consider the combination of aggregate and individual data in Section 7. A concluding discussion rounds out the paper in Section 8.

2 The Fundamental Difficulty of Ecological Inference

To motivate our discussion we introduce a specific data set for which Y = 0/1 represents the event Democrat/Republican registration, and X = 0/1 the event Black/White. The data were collected in the U.S. state of Louisiana in 1990 and are available in each of the 64 counties of that state. These data are ideal for illustration of methods, since they are one of the few sources for which the individual level data are available.

Figure 1 gives an initial look at the ecological data. In panel (a) we give a histogram of the fraction black, and observe that in the majority of counties this fraction is less than 0.5. Panel (b) gives the proportion registered Republican against the fraction black, with a least squares line added to indicate the linear association. The general trend is that the proportion registered Republican decreases as the proportion black increases. The obvious explanation is that blacks are less likely to register Republican. Alternative explanations exist, however, in particular the same pattern could be observed if whites are less likely to register Republican if in a predominantly black county, if blacks are more likely to register Republican in a predominantly white county, or if individual race is an unimportant predictor of registration behavior, and instead an individual’s behavior, whether black or white, is predicted by the proportion of blacks in the area. In each of these scenarios the proportion black/white in an area is an example of a contextual variable, a variable reflecting characteristics of individuals in a shared environment. To help explain the results that follow panels (c) and (d) give the fractions black and white who register Republican against the fraction black, again with least squares lines added. Note that it would not have been possible to produce these two plots without the individual level data which we have in this special case. We see that the fraction black who register Republican decreases across counties as the fraction black in the counties increases, hence it seems plausible that some contextual variable is driving black Republican registration (e.g. income). We also see that the proportion white who register Republican increases across counties as the fraction black in the counties increases. This is consistent with the explanation that whites in areas with large numbers of blacks are more fearful of affirmative action policies, and register/vote accordingly.

Figure 1.

Figure 1

Across 64 counties of Louisiana: (a) Histogram of fraction of population that are black; (b) fraction registered Republican versus fraction black, (c) fraction black registered Republican versus fraction black, (d) fraction white registered Republican versus fraction black.

Hence Figure 1(b) understates the extent to which blacks are less likely to register Republican than whites, which is an example of what Selvin (1958), called the ecological fallacy: incorrect inference concerning individual effects gleaned from aggregate data. In an extreme case, the aggregate relationship could be the reversal of the true individual relationship, a phenomenon closely related to Simpson’s paradox (Simpson, 1951), see Wakefield (2004c) for further discussion. The ecological fallacy had been discussed in the sociology literature before 1950, but Robinson (1950) provided an extremely lucid account, which explained the subsequent influence of the paper, in deterring the analysis of ecologic data. Recently, Robinson’s paper has been revisited, within a multilevel framework (Subramanian et al., 2009b,a) and critiqued (Oakes, 2009; Firebaugh, 2009; Wakefield, 2009).

We now introduce some notation, in the context of the Louisiana data, in order to ease description of models that have been suggested for ecologic data. For a generic individual, Y = 0/1 will denote the event that an individual is unregistered/registered (the response), and X = 0/1 the event that an individual is of black/white race (the predictor). Table 1 describes the notation that we will use throughout the paper; Y0i, Y1i are the Y = 1 individuals from covariate group X = 0, 1, respectively, in area i. In an aggregate situation we do not observe the internal counts Y0i, Y1i. The fundamental difficulty of ecological inference is that we are interested in these two quantities, but it is their sum Yi only, that we observe.

Table 1.

Table summarizing data in area i; in an ecological study the margins only are observed.

Y = 0 Y = 1
x = 0 Y0i N0i
x = 1 Y1i N1i

NiYi Yi Ni

In the social science ecological inference literature the inference problem has often been treated as the imputation of the missing data, Y0i, Y1i, and due to this perspective approaches have often implicitly adopted a finite sampling view. Here we utilize a hypothetical infinite population of exchangeable blacks and whites within each area as the primitive modeling object, and define the parameter pji to be the fraction of race j in area i that register. With this viewpoint an estimate of this probability, ji, is not equal to the true (but unobserved) fraction registered, Yji/Nji, which we denote by ji. In a finite sample view if Yji were observed then inference is complete since the population has been observed. In contrast, in the infinite population view, even if Yji is observed, uncertainty concerning pji will remain (though may be small if Nji is large). However, while the infinite population model takes pji to be the primary parameter of interest, note that this model can still be used to make predictions about the fractions ji if these are of interest. Section 4.3 discusses this in greater detail.

To see the indeterminacy of ecological inference more clearly we write, for area i,

YiNi=Y0i+Y1iNi=Y0iN0i×N0iNi+Y1iN1i×N1iNi

which may be rewritten as

qi=p0i×xi+p1i×(1xi), (1)

where i is the fraction registered, 0i and 1i are the black and white fractions registered, and xi and 1 − xi are the proportions black and white respectively. In an ecological data set, i and xi are observed while 0i and 1i are not. From (1) we see that the observed i are consistent with many true fractions 0i, 1i. The bounds of Duncan and Davis (1953), may be written in terms of i and xi:

max{0,qi(1xi)xi}p0imin{1,qixi}max{0,qixi1xi}p1imin{1,qi1xi}

In terms of the underlying probabilities pji, there is no constraint beyond 0 < pji < 1. This is a crucial difference between the finite sample and infinite sampling views. Figure 2 shows the admissible ranges for 0i and 1i for the Louisiana data via a so-called tomography plot. We see that for the blacks in particular there is a great deal of uncertainty. The open circles correspond to the true fractions, which are available for these data.

Figure 2.

Figure 2

Tomography lines for Louisiana data.

Two extreme explanations are consistent with (1). First, following Goodman (1953, 1959) we may assume that 0i and 1i are such that

E[pjixi]=pj, (2)

j = 0, 1 so that the fractions are uncorrelated with xi. The expectation here is with respect to repeated sampling in areas with proportion of blacks xi. We then have

E[qixi]=p0×xi+p1×(1xi)=a+bxi, (3)

where a = p1 and b = p0p1. Although it is only the expectations of the fractions that are considered constant in (2), the usual way of imputing the internal fractions is to simply take ji = pj, which is equivalent to a model in which the fractions themselves are constant. This model has sometimes been described as Goodman regression, but the name ecological regression is more appropriate as Goodman did not encourage general use of the approach, and in particular was aware that the ‘constancy assumption’ (2) would often be inappropriate. The assumption of constancy allows the mean to be derived, but to formulate an estimation method it would be desirable to derive the variance and covariance of Yi = Nii. In general it has been assumed that counts in different areas are independent, and various forms for the variance have been considered. As we will describe in detail in Section 3, a plausible likelihood leads to Yi following a convolution distribution with variance that depends on p0i and p1i.

A very simple model, termed the ‘nonlinear neighborhood model’ (Freedman et al., 1991), is to assume that p0i = p1i = qi, i.e. to assume that registration and individual race are independent. This allows the table to be collapsed, and inference is straightforward. Freedman (2001) states that in this model, ‘…behavior is determined by geography not demography’. A specific version of the nonlinear neighborhood model, the ‘linear neighborhood’ model, was also described by Freedman et al. (1991) and makes the assumption that E[p0i|xi] and E[p1i|xi] are identical but depend on the proportion black via the linear form

E[p0ixi]=E[p1ixi]=E[qixi]=a+bxi, (4)

which is identical to (3) though the interpretation and imputed internal cells are drastically different under the two models, which was the motivation for Freedman et al. (1991) to introduce the model, to illustrate the fundamental unidentifiability of ecological inference. Other regression-type approaches, with a non-parametric flavor, are described by Chambers and Steel (2001).

The assumption that i is uncorrelated with xi may be a major problem in some applications; see Freedman et al. (1998, 1999); Freedman (2001)) for examples. A further problem with ecological regression is the assumption that the estimated fractions are not allowed to vary across areas so that between-area variability is not acknowledged. Least squares procedures are known to provide consistent estimates of regression parameters under a range of distributions of the errors, but are also known to be very poor at providing predictions of observable quantities. For prediction some knowledge of the distribution of the error terms is required. The great benefit of the hierarchical approach that was popularized by King (1997) is that between-area differences in fractions are assigned a distribution, so allowing variability in the estimates of race-specific fractions across areas.

To conclude, in this section we have reviewed how two competing explanations with vastly different interpretations and inferential implications lead to an identical mean function. To overcome this unidentifiability and estimate 2m quantities from m observables, it is clear that any approach that is considered must make assumptions (or incorporate additional information). It is not immediately apparent, but also true that some of the assumptions from any approach will be uncheckable from the aggregate data alone. In all observational studies untestable assumptions such as ‘no unmeasured confounding’ are required for causal interpretations (e.g. even Figures 1(c) and (d) are not sufficient to derive the full causal story). If causal inference is the goal of an ecological study, this problem is particularly acute since the amount of information concerning quantities of interest is much smaller than in typical individual-level observational studies (e.g. Figure 1(b) provides less information than Figures 1(c) and (d)).

3 The Convolution Likelihood

In the previous section we simply derived the form of the marginal mean of the fraction registered Republican under various assumptions. In this section we describe a likelihood function under a plausible sampling scheme, and compare this with various (often implicit) likelihoods that have been used in the ecological literature. Recall that

p0i=Pr(Y=1x=0,i)andp1i=Pr(Y=1x=1,i) (5)

are the population probabilities in area i, i = 1, …, m. Returning to Table 1 we first note that if Y0i and Y1i were observed then if we were to assume that each of the N0i black individuals in area i have independent Bernoulli responses with probability p0i, and each of the N1i white individuals in area i have independent Bernoulli response with probability p1i, then

YjipjiBinomial(Nji,pji),

j = 0, 1, i = 1, …, m. Under this sampling scheme, if Y0i and Y1i are unobserved then the sum Yi follows a convolution of these binomial distributions:

Pr(Yip0i,p1i)=y0i=liui(N0iy0i)(N1iYiy0i)p0iy0i(1p0i)N0iy0ip1iYiy0i(1p1i)N1iYi+y0i (6)

where

li=max(0,YiN1i),ui=min(N0i,Yi). (7)

These values correspond to the admissible values that Yi can take, given the margins in Table 1. McCullagh and Nelder (1989) consider this likelihood under the assumption that p0i = p0 and p1i = p1, see also Achen and Shively (1995, p. 46).

We now briefly examine the shape of the likelihood for a single table. Plackett (1977) showed that the maximum likelihood estimate of the log odds ratio in a single table in which the margins only are observed is ±∞, which corresponds to p0i = 0 or 1 and/or p1i = 0 or 1. Steele et al. (2004) work with the convolution directly and report that the maximum likelihood estimator lies at the endpoint of the tomography line.

In King et al. (1999) the alternative model

YiP0i,p1iBinomial{Ni,P0ixi+p1i(1xi)} (8)

was considered. As in King (1997), this produces a likelihood that is constant along the tomography line (an intuitively appealing feature given the implicit lack of information on the internal cells of the table). However, the underlying model should be viewed as an approximation in this context since it assumes sampling independently Ni individuals each with probability p0ixi + p1i(1 − xi).

In contrast, the convolution likelihood assumes that we sample N0i individuals with probability p0i and N1i individuals with probability p1i, i = 1, …, m. As Ni → ∞ with xi and i constant, this convolution likelihood function becomes concentrated along the tomography line with an asymmetric U-shape, with the maximum at one endpoint. At first, this non-constancy of the likelihood may seem counterintuitive (given the lack of information). However, an MLE on the boundary of the parameter space is an alternative indication for a lack of information. Furthermore, if uniform priors are placed on the probabilities p0i and p1i, this likelihood implies a flat posterior predictive distribution for 0i and 1i along the associated tomography line. Therefore, the convolution likelihood produces constancy in exactly the space that we would expect ({0i, 1i} instead of {p0i, p1i}), and only produces constancy when the appropriate assumption of “no information” is made about p0i and p1i.

It is clear that the data in one table alone gives limited information concerning p0i, p1i or 0i, 1i, since we only have a single observation, Yi. However, in most applications, ecological data from multiple areas is available.

4 Bayesian Inference

4.1 Priors

Following King (1997) a number of authors have developed hierarchical approaches in which, rather than reduce the dimensionality of the models as was described in the previous section, the full 2m parameters are retained but the probabilities/fractions, are assumed to arise from a bivariate distribution.

At the second stage of the King (1997) model it is assumed that the pair 0i, 1i arise from a truncated bivariate normal distribution, hence imposing identifiability. King (1997) views the truncated bivariate normal distribution as the likelihood while we have referred to the tomography lines as providing the first stage of the model, with the truncated bivariate normal the second stage of the model. Inference is initially carried out via MLE for the five population parameters, using numerical integration, and then simulation is used to make more refined inference. Priors may be placed on the population parameters (that characterize the truncated normal) to give a Bayesian model. In common with the majority of approaches, it is assumed that the pairs of fractions form an independent sample from the second stage distribution (here the truncated bivariate normal), see Haneuse and Wakefield (2004) for a hierarchical model with spatial dependence between the probabilities. The model in its most basic form also assumes that the fractions are uncorrelated with xi. The latter may be relaxed (see, King 1997, Chapter 9), via the introduction of contextual effects (in King et al. (2004) it is recommended that such effects be included), but reliable estimation of both individual and contextual effects is crucially dependent on the existence of substantive prior information (see the example in Wakefield (2004c), for a further demonstration of this). The freely-available EzI software (Benoit and King, 1998) may be used to implement the truncated normal model, and its extensions.

At the second stage, King et al. (1999) assume that p0i and p1i are independent with

pjiaj,bjiidBeta(aj,bj). (9)

The third and final stage of the model consists of exponential priors, Exp(λ) on aj, bj, j = 0, 1, where λ−1 is the mean of the exponential. Specifically, in the example considered it was assumed that these exponential priors had mean 2 (λ = 0.5), a choice which is not desirable in many instances because it often produces a prior for each probability which is very strongly U-shaped (since beta priors with aj < 1, bj < 1 are themselves U-shaped, and an exponential with mean 2 has a 0.39 probability of being less than one). This is discussed more fully in Wakefield (2004c), in particular see Figure 6. Choosing much smaller values of λ, for example, λ = 0.01, produces almost uniform priors on the probabilities, though we would not universally recommend a particular hyperprior, given the sensitivity of inference it should be context specific. As the number of tables decreases and the x distribution becomes more asymmetric this problem becomes more and more acute. The ideal situation is for substantive information to be available for prior specification. The strong dependence on the third stage prior is in stark contrast to the usual generalized mixed model case for which there is far less dependence (except for priors on variance components, where again care must be taken with small numbers of units). Here we emphasize that the form of the prior should be examined through simulation. Specifically, for generic second stage, p(p|φ, and third stage, p(φ):

  1. For fixed φ, simulate φ(s) ~ p(φ), for s = 1, …, S.

  2. Simulate p(s) ~ p(p|φ(s)), for s = 1, …, S.

  3. Examine graphical and numerical summaries of the collection {φ (s), s = 1, …, S}.

This procedure will be illustrated shortly.

The model given by (9) does not allow dependence between the two random effects (note this is distinct from the independence between pairs of random effects in different areas, which is also assumed) though it is conjugate (giving a marginal distribution for the data that is beta-binomial) which may offer some advantage in terms of computation. The model also allows area-level covariates to be added at the second stage.

Wakefield (2004c) proposed, as an alternative to the beta model, a second stage in which the logits of the registration probabilities arose from a bivariate normal distribution; this model was introduced for the analysis of a series of 2 × 2 tables when the internal cells were observed by Skene and Wakefield (1990). Specifically, a reasonably general form is

θ0i=log(p0i1p0i)=μ0+β0zi+δ0iθ1i=log(p0i1p0i)=μ1+β1zi+δ1i (10)

with

δiN2(0,),

where

δi=[δ0iδ1i]and=[00011011]. (11)

Hence θ0i and θ1i denote the logits of the probabilities p0i and p1i in table i, so that pji = exp(θji)/{1 + exp(θji)}, j = 0, 1. In the specification (10), zi represent area-level characteristics (and may, in principle, include xi) and β0, β1 are (ecological) log odds ratios associated with these variables.

A third stage hyperprior adds priors on μ0, μ1 and Σ (and β0, β1 if there are covariates). It is difficult to gain information on the covariance term Σ01 and so from this point onwards we assume that Σ01 = 0. Without substantive information for the registration-race data, Wakefield (2004c) chose logistic priors with location 0 and scale 1 for μ0 and μ1, since these induce uniform priors on exp(μj)/{1 + exp(μj)} (the median of the registration probability for race j across the population of areas). Since G(z) ≈ (cz) with c=163/(15π) and where G(z) = (1 + ez)−1 is the CDF of a logistic random variables, as an alternative we may specify normal priors with mean 0 and standard deviation 1/c.

For the precisions 001,111 we specify gamma distributions Ga(a, b) (where the parameterization is such that the mean is given by a/b). In the WinBUGS manual the priors Ga(0.001, 0.001) are often used for precisions within a hierarchical model. This choice is not to be recommended in general (that is, for all applications); here it is a very poor one (and leads to marginal priors for the probabilities that are highly U-shaped). We follow a previously suggested procedure (Wakefield, 2009) which we briefly describe for a generic log odds ratio in area i, δi ~iid N(0, Σ) with Σ−1 ~ Ga(a, b). We integrate over S to find the marginal distribution p(δi) which is a t distribution with d = 2a degrees of freedom, location zero, and scale Σ = b/a. To construct a prior distribution we require a careful interpretation of δi, or more informatively, exp(δi) which is the perturbation of the odds of Republican registration from the median of the distribution of the odds of Republican registration across all areas. Hence we may refer to exp(δi) as a residual odds, since it is relative to the median odds across areas. We give a range for exp(δi). In particular, for the range (1/R, R) we use the relationship ±t0.025dσ=±logR, where trd is the 100 × r-th quantile of a Student t random variable with d degrees of freedom, to give a = d/2, b=(logR)2d/2(t1(1q)/2d)2. We choose d = 1, to give a Cauchy marginal distribution. As an example, for a 95% range of [0.1, 10] we obtain a = 0.5, b = 0.0164.

We illustrate the prior simulation strategy with the priors μ ~ N(0, 1/c2), Σ−1 ~ Ga(0.5, b), with θi = μ + δi, and δi ~ N(0, Σ). We take three values of b, corrsponding to 95% ranges of [0.9, 1.1], [0.5, 2] and [0.1, 10] (which correspond to b = 0.000028, 0.00149, 0.0164, respectively). Figure 3(a) gives the marginal distribution of median(p)=exp(μ)1+exp(μ), which is close to uniform, in line with the theory outlined above. These priors are applied in Section 6.

Figure 3.

Figure 3

Simulations from the N(0, 1/c2) × Ga(0.5, b) prior. Panel (a) gives the marginal distribution of the median odds of Republican registration. Panels (b), (c) and (d) gives the residual odds of Republican registration (across areas) with ranges [0.9, 1.1], [0.5, 2] and [0.1, 10], respectively.

4.2 Derivation of the posterior distribution

In the Bayesian approach all unknown quantities are assigned prior distributions and the posterior distribution reflects both these distributions and the information in the data that is contained in the likelihood. In the hierarchical models described in Section 4.1 two stage priors are specified, with the first stage of the prior assuming a common form for the pairs of probabilities, and the second stage assigning hyperpriors to the parameters of this form. Letting pi represent the pair of table specific probabilities, and φ a generic set of hyperparameters upon which the second stage of the prior depends, we have:

π(p1,,pm,φy1,,ym)p(y1,,ymp1,,pm,φ)×π(p1,,pm,φ)

with

p(y1,,ynp1,,pm,φ)=i=1mp(yipi),

by conditional independence of counts in different areas, and

π(p1,,pm,φ)=π(p1,,pmφ)×π(φ),

to give the two-stage prior. Under the assumption of independence of the table-specific parameters (which would not be true if we assumed spatial dependence between these parameters), we may further write

π(p1,,pmφ)=i=1mπ(piφ).

Hence, under these assumptions, we have the posterior distribution

π(p1,,pm,φy1,,ym)i=1mp(yipi)×i=1mπ(piφ)×π(φ).

Inference follows via consideration of marginal posterior distributions, and predictive distributions. For example π(pi|y1, …, ym) is the marginal posterior distribution for the pair of probabilities of Republican registration from table i.

4.3 The Posterior Predictive Distribution

We may also be interested in imputing the missing counts in area i. In particular, this is often the goal of ecological inference in the social sciences. This type of inference may be carried out via examination of the predictive distribution

Pr(Y0iy1,,ym)=Pr(Y0ipi,N0i,N1i,Niyi,yi)×π(piy1,,yn)dθi.

Note that we only need the distribution for Y0i since the distribution of Y1i = YiY0i, which is immediately available. The integral averages the distribution of Pr(Y0i|pi, N0i, N1i, Niyi, yi) with respect to the posterior. The distribution of Y0i given the row and column margins and the table probabilities, is a non-central (sometimes referred to as an extended) hypergeometric distribution, see for example, McCullagh and Nelder (1989). Suppose the odds ratio in the table is given by ψi = p0i(1 − p1i)/p1i(1 − p0i); then Y0i has a non-central hypergeometric distribution if its distribution is of the form

Pr(Y0i=y0iψi,N0i,N1i,Niyi,yi)={(N0iy0i)(N1iyiy0i)ψiy0iu=liui(N0iu)(N1iyiu)ψiuy0i=li,,ui,0otherwise (12)

where li = max(0, yiN1i) and ui = min(N0i, yi). Hence the predictive distribution is an overdispersed non-central hypergeometric distribution. The above predictive distribution produces (y0i/N0i, y1i/N1i) pairs that lie along the tomography line, and with flat priors on the probabilities, this distribution is uniform along the tomography line Wakefield (2004d). This provides a link with the likelihoods of King (1997) and King et al. (1999), but we emphasize that the flat distribution is with respect to the fractions, and not for the underlying probabilities.

5 Computation

Given the lack of identifiability in the posterior distribution, it is not surprising that computation is not straightforward for ecological inference, when analyzed using hierarchical models. In Wakefield (2004c) an “obvious” augmented data scheme was utilized.

Auxiliary Variable Sampling

For the missing data y0i, the distribution is an extended hypergeometric distribution with margins N0i, N1i, yi, Niyi, i = 1, …, m:

Pr(Y0i=y0iψi,N0i,N1i,yi)={(N0iy0i)(N1iyiy0i)ψiy0iu=liui(N0iu)(N1iyiu)ψiuy0i=li,,ui,0otherwise (13)

where li = max(0, YN1i) and ui = min(N0i, Y) and ψi = p0i(1− p1i)/p1i(1− p0i) is the odds ratio in the table. This discrete distribution may be sampled from in an obvious fashion, but in typical political science/sociology applications the margins are large and so generation is highly inefficient due to the summation over a large number of terms, each of which contains factorials. The mode is available in closed form, however, which may be exploited to produce an improved scheme, see Wakefield (2004c) for details.

Posterior Probability Sampling

Here we are required to sample from the conditional distribution for i = 1, …, m. If we assume pji|aj, bj ~ Be(aj, bj), then this conditional distribution corresponds to the product

Be(y0i+a0,N0iy0i+b0)Be(y1i+a1,N1iy1i+b1), (14)

for i = 1, …, m, and is straightforward to sample from. With a normal second stage distribution for the logits, the conditional distribution is no longer of standard form but a Metropolis-Hastings step is easy.

For large table Wakefield (2004c) proposed a normal approximation to the convolution. WinBUGS code for ecological inference using this normal approximation was given in the Appendix of Wakefield (2004b). In the JAGS (Just Another Gibbs Sampler) software (Plummer (2009)) there is a a novel distribution dsum that may be used in the ecological inference context. It may be used in the following way (Plummer, personal communication). The specification y ~ dsum (y0, y1) where y is observed and y0, y1 are unobserved discrete-valued stochastic nodes creates an MCMC sampler that will simultaneously update y0 and y1, while respecting the constraint y0 + y1== y. In typical social science applications there are too many possible values of y0 (or y1) to use inversion and so the sampler uses discrete slice sampling (Neal, 2003) as an alternative.

The R package MCMCpack package (Martin et al., 2009) contains a function MCMChierEI to implement the hierarchical model of Wakefield (2004c) with the normal approximation to the convolution likelihood implemented along with slice sampling. An extension to this work to the R × C table case is available in the R package RxCEcolInf (Greiner et al., 2009). This package contains functions to analyze both ecological data alone, or ecological data supplemented with individual-level data, which is a very important extension, as we discuss further in Section 7. The methodological extension to the 2 × 2 table cases is described in Greiner and Quinn (2009). An additional R package, eco is also available and fits models described in Imai et al. (2008). These models include a close relative of the parametric model of Wakefield (2004c) and a non-parametric model in which the second stage distribution is a Dirichlet process prior.

With sampling-based inference, if we can simulate from Pr(Y0i, Y1i|θi, N0i, N1i, Niyi, yi) then it is straightforward to simulate from the predictive distribution, once samples pi(s) are available from π(pi|N0i, N1i, Niyi, yi), via

1Ss=1SPr(Y0i,Y1ipi(s),N0i,N1i,Niyi,yi).

6 Illustrative Analysis

We analyze the Louisiana data using the hierarchical normal model, and investigate the sensitivity of inference to the choice of the gamma prior on the precisions of the random effects distribution. Specifically, following the procedure outlined in Section 4.1 we pick ranges of: [0.1, 10], [0.2, 5], [0.5, 2], [0.9, 1.1], for the residual odds of Republican registration. These range from a prior that expexts the probabilities across areas to be tightly clustered around the median, to one in which there is much larger variability. In all cases we specify N(0,(15π/163)2) priors for μj, j = 1, 2.

We used the MCMChierEI function to carry out inference, and ran the Markov chains for 106 iterations, after discarding 105 iterations as burn-in. We summarize the accuracy of inference in terms of S0 and S1 where Sj = Σi |jiji|/ji, j = 0, 1. For the black probabilities we obtain S0 = 76.1, 72.8, 78.9, 81.6 while for the white probabilities S1 = 4.4, 4.2, 4.6, 4.7. The first thing to note is that inference for the blacks is much less accurate for the blacks, as we might expect from Figure 2, since there is far less information available. The empirical distribution of the residual odds may be calculated here (since the individual data are available), and give 95% ranges for blacks and whites of [0.33,2.3] and [0.41,2.9], respectively so that the second prior is most consistent with the data, which explains the above summaries of S0 and S1. Figure 4 gives the posterior medians of the fractions registered Republican for blacks and whites, versus the true fractions based on the individual data. The top row is under the prior with residual odds in the range [0.9,1.1] and the bottom row is under the prior with range [0.1,10]. We see that the black/white fractions tend to be overestimated/underestimated. The effect of the prior is most apparent for the black fractions; under the narrower prior the estimates are virtually identical for all counties. In Figure 2 we saw that in many counties the bounds on the black fractions were wide, indicating the lack of information.

Figure 4.

Figure 4

(a) Estimated black fraction registered Republican (RR) versus black fraction RR under the narrow prior, (b) estimated white fraction RR versus white fraction RR under the narrow prior, (c) estimated black fraction RR versus black fraction RR under the wide prior, (d) estimated white fraction RR versus white fraction RR under the wide prior.

7 Combination of Individual and Aggregate Data for the Posterior Predictive

We now consider the situation in which survey data are available, Table 2 illustrates the notation in this case, the observed counts in area i are z0i, z1i and yi, i = 1, …, m. When such survey data are available on a subset of individuals within particular areas then the resultant product of binomial distributions may be simply combined with the aggregate data likelihood, with each term being independent, i.e.

Table 2.

Summary of notation for the situation in which we have both individual survey data with sample sizes m0i and m1i, and aggregate marginal data in area i There are Ni individuals in area i, with yi responding Y = 1, and N0i, N1i individuals with x = 0, 1 respectively.

Survey Data Aggregate Data
Y= 0 Y= 1 Y = 0 Y = 1
x = 0 z0i m0i N0im0i
x = 1 z1i m1i N1im1i

mizi zi mi Nimi − (yizi) yizi Nimi
L(p0i,p1i)=p(z0ip0i)×p(z1ip1i)×p(yip0i,p1i),

where yi=yizi, and each of the first two terms is binomial and the third is the convolution likelihood.

Wakefield (2004c) illustrated the benefits of adding individual (survey) data to the ecological data to gain identifiability. A number of discussants to the paper (Best, 2004; Jackson, 2004; Salway, 2004) suggested that smaller sample sizes may be all that is needed, and that the design of the survey is an important topic. Here we touch upon these issues by investigating a number of scenarios.

The first 4 rows of Table 3 report S0 and S1 for survey samples within each area of sizes 1000, 500, 300, 100, respectively. Two sets of results are given in each row, the first set correspond to the use of the individual-level data only, and the second to the combined individual-ecologic data. All results were obtained using the AnalyzeWithExitPoll function within the RxCEcolInf package. Each MCMC run began with a burn-in of 105, with 103 samples collected subsequently over 106 iterations. In all cases inference is greatly improved when individual-level data supplement the ecologic data. Examination of the resultant estimates of the probabilities revealed that for low sample sizes bias existed in the estimates but this was more than offset by the reduction in variance.

Table 3.

Summaries for the combined survey/ecologic setting. Rows two through five of the main body show the effects of adding samples of the stated sizes to all areas. Error measures associated with the individual data only are also given for these rows. The last three rows report situations in which individual samples of size 500 were sampled from the reported proportion of areas.

Individual Combined
Data Source S0 S1 S0 S1
Ecologic Only 78.9 4.6

1000 Samples 22.1 4.0 16.7 0.9
500 Samples 26.2 6.1 15.9 0.8
300 Samples 40.4 8.1 21.7 1.1
100 Samples 73.1 13.4 31.1 1.7

Samples in Half 22.0 1.3
Samples in Quarter 24.9 1.6
Samples in Eighth 29.5 1.6

Viewed in the opposite direction to the emphasis of this paper, an important observation is that inference based on the individual data only can be greatly improved by supplementing the survey data with ecologic information.

In the second stage of our investigation we now fixed the sample size at 500 but only sampled 1/2, 1/4 and 1/8 of areas. Again we see that in all cases inference is hugely improved over the ecologic only analysis. In fact, note that with samples of size 500 from 8/64 tables (4000 sample size), we achieve better results than with samples of 300 from 64/64 tables (19,200 sample size). Given the additional cost effectiveness of sampling within fewer tables, this result implies the potential for considerable savings from sampling design in this context.

In Figure 5 we plot the estimates versus the “true” fractions. In the top row these comparisons are for the areas with survey data, while in the second row the comparisons are in areas with no survey data. Although there is clearly bias in the estimates for blacks in particular, it is far less than in the ecologic only case (compare with Figure 4). This improvement in estimation with no survey data is due to the hierarchical model which is common to all areas, thus allowing the areas with surveys to positively impact the areas with no data.

Figure 5.

Figure 5

Analysis based on subsamples in the first 16 areas only: (a) Estimated black fraction registered Republican (RR) versus black fraction RR in areas with survey data, (b) estimated white fraction RR versus white fraction RR in areas with survey data, (c) estimated black fraction RR versus black fraction RR in areas with no survey data, (d) estimated white fraction RR versus white fraction RR in areas with no survey data.

8 Discussion

In this paper, we have shown that it can be unreliable to estimate the pji or ji values with ecological data. We have also shown that the inclusion of individual level data in the analysis can mitigate these problems, and that only a small amount of data from a small number of tables may be necessary.

Furthermore, while we have focused on the estimation of pji or ji in this paper, it is straightforward to see that the estimation of contextual parameters is not possible from ecological data alone. Let pji be the proportion of individuals of race j in area i and suppose the individual-level model is:

pji=aji+bjixi

so that we have both effects due to race in area i, a0i and a1i, and contextual effects, b0i and b1i. Upon averaging across individuals, to give ecological data we obtain the marginal area-level probability

pi=xi×p0i+(1xi)p1i=a1i+xi(a0i+b1ia1i)+xi2(b0ib1i)

which clearly shows the identifiability problem. With a non-linear model (for example a logistic model), the parameters become identifiable, but only due to the non-linearity and different non-linear forms will give different answers.

Data and code for all of the examples of the paper are available at: http://faculty.washington.edu/jonno/cv.html

In Section 7 we showed the benefits of small amounts of individual-level data. In the simulations we sampled individuals without replacement from black and white populations to produce a representative sample. In practice selection bias may be present, and without further information on the nature of this bias, a sensitivity study is the only recourse. However, within the framework we have described such a study is straightforward.

Acknowledgments

The authors would like to thank Kevin Quinn and Sebastien Haneuse for useful discussions. The second author was supported by grant R01 CA095994 from the National Institutes of Health.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

  1. Achen CH, Shively WP. Cross-level Inference. University of Chicago Press; 1995. [Google Scholar]
  2. Adolph C, King G. Analyzing second-stage ecological regressions: Comment on Herron and Shotts. Political Analysis. 2003;11:65–76. [Google Scholar]
  3. Adolph C, King G, Herron MC, Shotts KW. A consensus on second-stage analyses in ecological inference models. Political Analysis. 2003;11:86–94. [Google Scholar]
  4. Benoit K, King G. EzI: An Easy Program for Ecological Inference, Version 2.02. Boston: Department of Government, Harvard University; 1998. [Google Scholar]
  5. Best NG. Discussion of: “Ecological inference for 2 × 2 tables”. Journal of the Royal Statistical Society, Series A. 2004;167:426–427. [Google Scholar]
  6. Burden BC, Kimball DC. A new approach to the study of ticket splitting. American Political Science Review. 1998;92:533–544. [Google Scholar]
  7. Chambers RL, Steel DG. Simple methods for ecological inference in 2 × 2 tables. Journal of the Royal Statistical Society Series A. 2001;164:175–92. [Google Scholar]
  8. Cho WKT, Gaines BJ. The Limits of Ecological Inference: The Case of Split-Ticket Voting. American Journal of Political Science. 2004;48:152–171. [Google Scholar]
  9. Cho WKTam. Iff the assumption fits…: A comment on the King ecological inference solution. Political Analysis. 1998;7:143–163. [Google Scholar]
  10. Cleave N, Brown PJ, Payne CD. Evaluation of methods for ecological inference. Journal of the Royal Statistical Society, Series A. 1995;158:55–72. [Google Scholar]
  11. Duncan OD, Davis B. An alternative to ecological correlation. American Sociological Review. 1953;18:665–6. [Google Scholar]
  12. Firebaugh G. Commentary: ‘Is the social world flat? W.S. Robinson and the ecologic fallacy’. International Journal of Epidemiology. 2009;38:368–370. doi: 10.1093/ije/dyn355. [DOI] [PubMed] [Google Scholar]
  13. Freedman DA. Ecological inference and the ecological fallacy. In: Smelser NJ, Baltes PB, editors. International Encyclopedia of the Social and Behavioural Sciences. 6. New York: Elsevier; 2001. pp. 4027–4030. [Google Scholar]
  14. Freedman DA, Klein SP, Ostland M, Roberts MR. A solution to the ecological inference problem (book review) Journal of the American Statistical Association. 1998;93:1518–1522. [Google Scholar]
  15. Freedman DA, Klein SP, Sacks J, Smyth CA, Everett CG. Ecological regression and voting rights. Evaluation Review. 1991;15:673–711. [Google Scholar]
  16. Freedman DA, Ostland M, Roberts MR, Klein SP. Reply to G. King. Journal of the American Statistical Association. 1999;94:355–357. [Google Scholar]
  17. Gelman A, Park DK, Ansolabehere S, Price PN, Minnite LC. Models, assumptions and model checking in ecological regressions. Journal of the Royal Statistical Society, Series A. 2001;164:101–118. [Google Scholar]
  18. Goodman LA. Ecological regressions and the behavior of individuals. American Sociological Review. 1953;18:663–4. [Google Scholar]
  19. Goodman LA. Some alternatives to ecological correlation. Americal Journal of Sociology. 1959;64:610–25. [Google Scholar]
  20. Greenland S, Robins J. Ecological studies: biases, misconceptions and counterexamples. American Journal of Epidemiology. 1994;139:747–760. doi: 10.1093/oxfordjournals.aje.a117069. [DOI] [PubMed] [Google Scholar]
  21. Greiner DJ, Baimnes P, Quinn KM. Package ‘RxCEcolInf’ 2009 [Google Scholar]
  22. Greiner DJ, Quinn KM. R × C ecological inference: bounds, correlations, flexibility and transparency of assumptions. Journal of the Royal Statistical Society, Series A. 2009;172:67–81. [Google Scholar]
  23. Haneuse S, Wakefield JC. Ecological inference incorporating spatial dependence. In: King G, Rosen O, Tanner M, editors. Ecological Inference: New Methodological Strategies. Chapter 12. Cambridge: Cambridge University Press; 2004. pp. 266–302. [Google Scholar]
  24. Herron MC, Shotts KW. Cross-contamination in EI-R: Reply. Political Analysis. 2003a;11(1):77–85. [Google Scholar]
  25. Herron MC, Shotts KW. Using ecological inference point estimates as dependent variables in second-stage linear regressions. Political Analysis. 2003b;11:44–64. [Google Scholar]
  26. Herron MC, Shotts KW. Logical inconsistency in EI-based second-stage regressions. American Journal of Political Science. 2004;48(1):172–183. [Google Scholar]
  27. Imai K, Lu Y, Strauss A. Bayesian and likelihood inference for 2×2 ecological tables: an incomplete data approach. Political Analysis. 2008;16:41–69. [Google Scholar]
  28. Jackson C. Discussion of: “Ecological inference for 2 × 2 tables”. Journal of the Royal Statistical Society, Series A. 2004;167:430. [Google Scholar]
  29. King G. A Solution to the Ecological Inference Problem. Princeton: Princeton University Press; 1997. [Google Scholar]
  30. King G. The future of ecological inference research: A comment on Freedman et al. Journal of the American Statistical Association. 1999;94:352–355. [Google Scholar]
  31. King G, Rosen O, Tanner M. Information in Ecological Inference: An Introduction. In: King G, Rosen O, Tanner M, editors. Ecological Inference: New Methodological Strategies. Cambridge: Cambridge University Press; 2004. pp. 1–12. [Google Scholar]
  32. King G, Rosen O, Tanner MA. Binomial-beta hierarchical models for ecological inference. Sociological Methods and Research. 1999;28:61–90. [Google Scholar]
  33. Lasserre V, Guihenneuc-Jouyaux C, Richardson S. Biases in ecological studies: utility of including within-area distribution of confounders. Statistics in Medicine. 2000;19:45–59. doi: 10.1002/(sici)1097-0258(20000115)19:1<45::aid-sim276>3.0.co;2-5. [DOI] [PubMed] [Google Scholar]
  34. Martin AD, Quinn KM, Park JH. Package ‘MCMCpack’ 2009 [Google Scholar]
  35. McCullagh P, Nelder JA. Generalized Linear Models. 2. London: Chapman and Hall; 1989. [Google Scholar]
  36. Neal RM. Slice sampling (with discussion) Annals of Statistics. 2003;31:705–767. [Google Scholar]
  37. Oakes JM. Commentary: Individual, ecological and multilevel fallacies. International Journal of Epidemiology. 2009;38:361–368. doi: 10.1093/ije/dyn356. [DOI] [PubMed] [Google Scholar]
  38. Plackett RL. The marginal totals of a 2 × 2 table. Biometrika. 1977;64:37–42. [Google Scholar]
  39. Plummer M. JAGS Version 1.0.3 Manual. 2009 Technical report. [Google Scholar]
  40. Plummer M, Clayton D. Estimation of population exposure. Journal of the Royal Statistical Society, Series B. 1996;58:113–126. [Google Scholar]
  41. Prentice RL, Sheppard L. Aggregate data studies of disease risk factors. Biometrika. 1995;82:113–25. [Google Scholar]
  42. Richardson S, Montfort C. Ecological correlation studies. In: Elliott P, Wakefield JC, Best NG, Briggs D, editors. Spatial Epidemiology: Methods and Applications. Oxford: Oxford University Press; 2000. pp. 205–220. [Google Scholar]
  43. Richardson S, Stucker I, Hémon D. Comparison of relative risks obtained in ecological and individual studies: some methodological considerations. International Journal of Epidemiology. 1987;16:111–20. doi: 10.1093/ije/16.1.111. [DOI] [PubMed] [Google Scholar]
  44. Robinson WS. Ecological correlations and the behavior of individuals. American Sociological Review. 1950;15:351–57. [Google Scholar]
  45. Salway R. Discussion of: “Ecological inference for 2 × 2 tables”. Journal of the Royal Statistical Society, Series A. 2004;167:438–439. [Google Scholar]
  46. Salway R, Wakefield JC. A comparison of approaches to ecological inference in epidemiology, political science and sociology. In: King G, Rosen O, Tanner M, editors. Ecological Inference: New Methodological Strategies. Cambridge University Press; 2004. [Google Scholar]
  47. Selvin HC. Durkheim’s ‘suicide’ and problems of empirical research. American Journal of Sociology. 1958;63:607–619. [Google Scholar]
  48. Simpson EH. The interpretation of interaction in contingency tables. Journal of the Royal Statistical Society, Series B. 1951;13:238–241. [Google Scholar]
  49. Skene AM, Wakefield JC. Hierarchical models for multi-centre binary response studies. Statistics in Medicine. 1990;9:919–929. doi: 10.1002/sim.4780090808. [DOI] [PubMed] [Google Scholar]
  50. Steele DG, Beh EJ, Chambers RL. The information in aggregate data. In: King G, Rosen O, Tanner M, editors. Ecological Inference: New Methodological Strategies. Cambridge: Cambridge University Press; 2004. [Google Scholar]
  51. Subramanian SV, Jones K, Kaddour A, Krieger N. Response: The value of a historically informed multilevel analysis of Robinson’s data. International Journal of Epidemiology. 2009a;38:370–373. doi: 10.1093/ije/dyn359. [DOI] [PMC free article] [PubMed] [Google Scholar]
  52. Subramanian SV, Jones K, Kaddour A, Krieger N. Revisiting Robinson: the perils of individualistic and ecologic fallacy. International Journal of Epidemiology. 2009b;38:342–360. doi: 10.1093/ije/dyn359. [DOI] [PMC free article] [PubMed] [Google Scholar]
  53. Wakefield J. Ecological inference for 2 × 2 tables (with discussion) Journal of the Royal Statistical Society, Series A. 2004a;167:385–445. [Google Scholar]
  54. Wakefield JC. Prior and likelihood choices in the analysis of ecological data. In: King G, Rosen O, Tanner M, editors. Ecological Inference: New Methodological Strategies. Chapter 1. Cambridge: Cambridge University Press; 2004b. pp. 13–50. [Google Scholar]
  55. Wakefield J. Ecologic studies revisited. Annual Review of Public Health. 2008;29:75–90. doi: 10.1146/annurev.publhealth.29.020907.090821. [DOI] [PubMed] [Google Scholar]
  56. Wakefield JC. Multi-level modelling, the ecologic fallacy, and hybrid study designs. International Journal of Epidemiology. 2009;38:330–336. doi: 10.1093/ije/dyp179. [DOI] [PMC free article] [PubMed] [Google Scholar]
  57. Wakefield JC. Ecological inference for 2 × 2 tables (with discussion) Journal of the Royal Statistical Society, Series A. 2004c;167:385–445. [Google Scholar]
  58. Wakefield JC. Response to the discussion of “Ecological inference for 2 × 2 tables”. Journal of the Royal Statistical Society, Series A. 2004d;167:440–445. [Google Scholar]

RESOURCES