A default prior distribution for contingency tables with dependent factor levels

Antony M Overstall; Ruth King

doi:10.1016/j.stamet.2013.08.007

. 2014 Jan;16:90–99. doi: 10.1016/j.stamet.2013.08.007

A default prior distribution for contingency tables with dependent factor levels

Antony M Overstall ^1,^⁎, Ruth King ¹

PMCID: PMC3990456 PMID: 24748854

Abstract

A default prior distribution is proposed for the Bayesian analysis of contingency tables. The prior is specified to allow for dependence between levels of the factors. Different dependence structures are considered, including conditional autoregressive and distance correlation structures. To demonstrate the prior distribution, a dataset is considered which involves estimating the number of injecting drug users in the eleven National Health Service board regions of Scotland using an incomplete contingency table where the dependence structure relates to geographical regions.

Keywords: Contingency table, Dependence structure, Default prior

1. Introduction

Contingency tables (e.g. [1]) are formed when a population is cross-classified according to a series of categories (or factors). Each cell count of the table gives the number observed under each cross-classification. The aim of forming such a table is to summarise the data, and typically, with a view to identifying interactions or relationships between the factors.

The standard statistical practice to model such interactions is the log–linear model (e.g. [1, Chapter 7]). In this case the logarithm of the expected cell count is proportional to a linear predictor depending on the main effect terms and interaction terms between the factors. Each combination of interaction terms defines its own log–linear model so that the identification of the non-zero interaction terms translates to an exercise in model comparison. Additionally incomplete contingency tables with missing cell counts can be used to estimate closed populations [4] where some of the factors correspond to sources that have either observed or not observed individuals in the population.

In this paper, we consider the case where the levels of one or more of the factors may be dependent on one another. An obvious example is when one of the factors has levels corresponding to geographical regions or locations which may be dependent due to their geographical proximity. In these cases, we may expect the parameters of the log–linear model to have some dependence structure. Bayesian analysis of contingency tables is common (e.g. [3], [13], [5]) and is the approach taken here. One feature of the Bayesian approach is that prior information on the interaction terms can be incorporated through the prior distribution. We take the position of having weak prior information on the magnitude of the log–linear parameters but wish to incorporate the information provided by the dependence structure mentioned above. In the case of weak prior information and model uncertainty, care must be taken when specifying prior distributions due to Lindley’s paradox (e.g. [16, pp. 77–79]). There have been several attempts in the literature (e.g. [3], [15], [18]) to specify “default” prior distributions that can be applied for log–linear models under model uncertainty. We extend these approaches by developing a default prior that can take account of the dependence structure between the factor levels and can be seen as a generalisation of the above mentioned priors. The proposed prior is constructed by conditioning on the constraints on the parameters which are introduced in contingency table analysis to maintain identifiability of the parameters.

This paper is organised as follows. In Section 2 we set out our notation and briefly describe log–linear models. In Section 3 we derive our proposed default prior distribution including descriptions of different dependence structures. Finally, we apply our proposed prior to a real data application in Section 4, which involves estimating the number of injecting drug users in Scotland. Here, one of the factors corresponds to geographical regions, and we wish to take account of the possible dependence structure that may exist for the regions.

2. Notation and log–linear models

2.1. Notation

We assume that there are a total of $c$ factors such that each factor $k = 1, \dots, c$ has $l_{k}$ levels. The corresponding contingency table has $n = \prod_{k = 1}^{c} l_{k}$ cells. Let $y$ be the $n \times 1$ vector of cell counts with elements denoted as $y_{i}$ and where $i = (i_{1}, \dots, i_{c})$ identifies the combination of factor levels that cross-classify the cell $i$ . Let $S$ be set of all $n$ cross-classifications so that

S = {(i_{1}, \dots, i_{c}) : i_{l} \in {1, \dots, l_{k}}} .

Finally, let $N = \sum_{i \in S} y_{i}$ be the total population size. In the case of an incomplete contingency table, $N$ is unknown, since elements of $y$ are unknown.

As a pedagogic example that we use for illustrative purposes throughout, suppose that there are three factors used to cross-classify a population of hospital patients: age (2 levels: young; old), hypertension (2 levels: no; yes) and region (3 levels: A; B; C). In this example, $c = 3$ , where $l_{1} = 2, l_{2} = 2$ and $l_{3} = 3$ , and the three factors (age, hypertension and region) have been labelled 1, 2 and 3, respectively. It follows that there are $n = 2 \times 2 \times 3 = 12$ cells.

2.2. Log–linear models

We now briefly describe log–linear models and initially assume that the form of the log–linear model is known, i.e. it is known which interactions are present. We extend to the case of model uncertainty later in this section. Let $η_{i}$ denote the linear predictor associated with cell $i \in S$ , where

η_{i} = ϕ + z_{i}^{T} θ,

with $ϕ \in R$ denoting the intercept term, $θ$ the $q \times 1$ vector of log–linear parameters (i.e. the main effects and interaction terms) and $z_{i}$ the $q \times 1$ vector of zeros and ones identifying which elements of $θ$ are applicable to cell $i \in S$ .

For identifiability, certain elements of $θ$ are constrained, e.g. by sum-to-zero, or corner-point constraints, so we can rewrite $η_{i}$ as

η_{i} = ϕ + x_{i}^{T} β,

where $β \in R^{p}$ is the $p \times 1$ vector of unconstrained regression parameters, and $x_{i}$ is the $p \times 1$ vector which identifies which elements of $β$ correspond to cell $i \in S$ , with $p < q$ .

Finally, let $η$ be the $n \times 1$ vector with elements $η_{i}$ , and let $X$ be the $n \times p$ model matrix with rows $x_{i}$ . Then we can write

η = ϕ 1_{n} + X β,

where $1_{n}$ denotes the $n \times 1$ vector of ones.

For the statistical analysis of contingency tables, it is common to assume that

y_{i} | ϕ, β \sim Poisson (λ_{i}),

(1)

independently, where $log λ_{i} = η_{i}$ .

In practice, we typically do not know the form of the log–linear model. This is equivalent to not knowing the elements of $z_{i}$ and $x_{i}$ , or the columns of $X$ . Let $M$ be the set of competing log–linear models which are indexed by $m \in M$ . Associated with each log–linear model are $z_{i}^{(m)}, x_{i}^{(m)}, X^{(m)}, θ^{(m)}$ and $β^{(m)}$ , where $z_{i}^{(m)}$ and $θ^{(m)}$ are $q^{(m)} \times 1$ vectors, $x_{i}^{(m)}$ and $β^{(m)}$ are $p^{(m)} \times 1$ vectors, and $X^{(m)}$ is an $n \times p^{(m)}$ matrix.

In the next section, we derive a default prior distribution for $β^{(m)} | m$ . For the intercept, $ϕ$ , we assume a prior given by $π (ϕ) \propto 1$ . Although this prior is improper, the resulting posterior is still proper [5]. This prior will not cause a problem under Lindley’s paradox since it is present for all models in $M$ [16, p. 174].

3. A default prior distribution for $β^{(m)} | m$

3.1. Derivation

In this section we develop a default prior distribution for $β^{(m)} | m$ . For notational simplicity, we drop the dependency on the model $m$ by removing the superscript $(m)$ .

Suppose that there are a total of $T$ log–linear terms and $β = (β_{1}, \dots, β_{T})$ where $β_{t}$ , for $t = 1, \dots, T$ , is the $p_{t} \times 1$ vector corresponding to the regression parameters for the main effect or interaction term $t$ . Similarly let $θ_{t}$ denote the corresponding $q_{t} \times 1$ vector of log–linear parameters, for $t = 1, \dots, T$ .

Let $R_{t}$ be the set of $f$ main effect terms that define the $f$ -way interaction $β_{t}$ . Dellaportas and Forster [3] refer to $R_{t}$ as the constituent terms of the interaction. Note that $q_{t} = \prod_{j \in R_{t}} q_{j}$ and if $β_{t}$ corresponds to a main effect then $R_{t}$ has only one element, i.e. $t$ . Consider the pedagogic example, from Section 2.1, and $t$ corresponding to the 2-way interaction between age and region so that $q_{t} = 6$ and $p_{t} = 2$ . The constituent terms, $R_{t}$ , have two elements: the terms corresponding to age and region.

We initially consider deriving the default prior distribution under sum-to-zero constraints. We describe how the prior can be extended to any system of constraints in Section 3.4. Following Dellaportas and Forster [3] we assume that $β$ has a multivariate normal distribution with mean zero, where $β_{r}$ and $β_{t}$ are independent for $r, t = 1, \dots, T$ and $r \neq t$ . Thus, all that remains is to specify the $p_{t} \times p_{t}$ covariance matrix for each $β_{t}$ , for $t = 1, \dots, T$ .

The elements of $θ_{t}$ are subject to constraints and can be written in the form

θ_{t} = A_{t} β_{t},

(2)

where $A_{t}$ is a $q_{t} \times p_{t}$ matrix defining the constraints. Under sum-to-zero constraints, $A_{t}$ can be written as

A_{t} = P_{t} (\begin{matrix} I_{p_{t}} \\ C_{t} \end{matrix}),

(3)

where $I_{p_{t}}$ is the $p_{t} \times p_{t}$ identity matrix, $C_{t}$ is a $(q_{t} - p_{t}) \times p_{t}$ matrix and $P_{t}$ is a $q_{t} \times q_{t}$ permutation matrix. For $t$ corresponding to the age and region interaction in the pedagogic example,

θ_{t} = (\begin{matrix} θ_{t 1} \\ θ_{t 2} \\ θ_{t 3} \\ θ_{t 4} \\ θ_{t 5} \\ θ_{t 6} \end{matrix}), A_{t} = (\begin{matrix} 1 & 0 \\ 0 & 1 \\ - 1 & - 1 \\ - 1 & 0 \\ 0 & - 1 \\ 1 & 1 \end{matrix}), C_{t} = (\begin{matrix} - 1 & 0 \\ 0 & - 1 \\ 1 & 1 \\ - 1 & - 1 \end{matrix}),

P_{t} = (\begin{matrix} 1 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 1 \\ 0 & 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & 0 \end{matrix}) .

The elements of $θ_{t}$ are ordered so that the factor levels of region vary the fastest.

Initially, ignoring the constraints that are applied to $θ_{t}$ , we assume that the distribution of $θ_{t}$ is

θ_{t} | σ_{t}^{2}, D_{t} \sim N (0, σ_{t}^{2} D_{t}),

where $σ_{t}^{2} > 0$ and $D_{t}$ is a $q_{t} \times q_{t}$ positive-definite scale matrix. The off-diagonal elements of $D_{t}$ control the dependence structure or correlation between the elements of the constrained parameters, $θ_{t}$ , corresponding to different factor levels.

It follows from (2), (3) that

P_{t}^{T} θ_{t} = (\begin{matrix} β_{t} \\ C_{t} β_{t} \end{matrix}) .

(4)

Let

γ_{t} = (\begin{matrix} γ_{t}^{(1)} \\ γ_{t}^{(2)} \end{matrix}) = P_{t}^{T} θ_{t}

be the permuted elements of $θ_{t}$ according to the inverse permutation $P_{t}^{- 1} = P_{t}^{T}$ , so that $γ^{(1)} = β_{t}$ and $γ^{(2)} = C_{t} β_{t}$ . The prior distribution for $β_{t}$ is the conditional distribution of $γ^{(1)}$ (which is $β_{t}$ ) given that $γ^{(2)} = C_{t} β_{t}$ , i.e. we find the distribution of $β_{t}$ from (4) subject to the constraints. It can be shown (see Appendix A) that

β_{t} | σ_{t}^{2}, D_{t} \sim N (0, σ_{t}^{2} Σ_{t}),

(5)

where

Σ_{t} = {(A_{t}^{T} D_{t}^{- 1} A_{t})}^{- 1} .

(6)

In the next two sections we consider $D_{t}$ . It may be that $D_{t}$ is completely specified a priori. The most plausible situation for this is when we assume independence between the levels of this term and $D_{t} = I_{q_{t}}$ . We consider this case in Section 3.2. In Section 3.3 we also consider where $D_{t}$ is unknown due to its dependence on some unknown hyperparameter which controls the strength of correlation between the elements of $θ_{t}$ .

3.2. Independent correlation structure

Suppose we assume that the factor levels are independent, i.e. $D_{t} = I_{q_{t}}$ , so that

Σ_{t} = {(A_{t}^{T} A_{t})}^{- 1} .

Denote by $X_{t}$ the $n \times p_{t}$ matrix formed by the columns of $X$ corresponding to $β_{t}$ . Since $X_{t}$ is a permutation of the matrix formed by stacking $A_{t}$ to form an $n \times p_{t}$ matrix, it follows that

X_{t}^{T} X_{t} = \frac{n}{q_{t}} A_{t}^{T} A_{t},

and therefore $Σ_{t} = (n / q_{t}) {(X_{t}^{T} X)}^{- 1}$ . The corresponding prior distribution for $β_{t}$ is

β_{t} | σ_{t}^{2} \sim N (0, \frac{σ_{t}^{2} n}{q_{t}} {(X_{t}^{T} X_{t})}^{- 1}) .

If $σ_{t}^{2} = g q_{t} / n$ , then since (under sum-to-zero constraints) $X_{t}^{T} X_{r} \neq 0$ , for all $t \neq r$ [14], it follows that the prior distribution for $β = (β_{1}, \dots, β_{T})$ is

β | g \sim N (0, g {(X^{T} X)}^{- 1}) .

(7)

If $g > 0$ is unknown and given a prior distribution, then (7) is a hierarchical prior distribution that is identical to the generalised hyper-g prior proposed by Sabanes-Bové and Held [18] for generalised linear models (GLMs) when applied to log–linear models. If, instead, $g$ is fixed then (7) is the default prior distribution considered by Dellaportas and Forster [3] who advocate setting $g = k n$ for some constant $k$ , which represents the number of units of prior information. Ntzoufras et al. [15] use $k = 1$ under their unit information prior for GLMs when applied to log–linear models.

3.3. General correlation structure

We now consider terms, $t$ , whose constituent terms, $R_{t}$ , contain factors with correlated levels and $D_{t}$ depends on some unknown hyperparameter $τ$ . This hyperparameter, $τ$ , controls the strength of correlation through some structure imposed on $D_{t}$ . Initially consider a main effect term $t$ . In this paper we focus on the case where the factor levels correspond to geographical regions or locations and propose two structural forms for $D_{t}$ . However there exist many possible applications with correlated factor levels and other correlation structures that can be used depending on the nature of the factor levels.

1.
Conditional autoregressive structure
Suppose that the $q_{t}$ levels correspond to regions. Let $G$ be the $q_{t} \times q_{t}$ neighbourhood matrix with $i j$ th element
$G_{i j} = {\begin{matrix} 1 & if regions i \neq j are neighbours , \\ 0 & if otherwise , \end{matrix}$
for $i, j = 1, \dots, q_{t}$ . Then for the conditional autoregressive (CAR) structure (e.g. [2]),
$D_{t} = {(I_{q_{t}} - τ G)}^{- 1},$
where $τ$ determines the strength of spatial correlation for the constrained parameters. To ensure that $D_{t}$ is positive-definite, the hyperparameter $τ$ must lie in the interval $(τ_{\min}, τ_{\max}) = (e_{q_{t}}^{- 1}, e_{1}^{- 1})$ , where $e_{1}$ and $e_{q_{t}}$ are the maximum and minimum eigenvalues of $G$ , respectively.
2.
Distance correlation structure
Suppose the $q_{t}$ levels correspond to locations such as cities. Then the $i j$ th element of $D_{t}$ is given by a correlation function that depends on the distance, $d_{i j}$ , between locations $i$ and $j$ , and $τ$ . For example, the Gaussian correlation function gives
$D_{t, i j} = exp (- \frac{d_{i j}^{2}}{2 τ^{2}}),$
where, again, $τ > 0$ controls the strength of correlation.

Note that in both examples, the hyperparameter, $τ$ , is not actually a correlation coefficient; it merely controls the strength of correlation. We need to specify a prior distribution for $τ$ . This will depend on the application.

For a term $t$ that corresponds to an interaction term, we propose

D_{t} = ⨂_{r \in R_{t}} D_{r} .

(8)

The form given by (8) has been chosen for its consistency. Suppose that the correlation between two levels of a main effect term is $d$ . Then, for an interaction involving this main effect, the correlation between the two levels will be $d$ if and only if the factor levels of the other constituent terms are identical. To demonstrate this we return to our pedagogic example where the regions A and B, and B and C are neighbours, but A and C are not neighbours. A CAR structure is specified. In this example, the neighbourhood matrix is

G = (\begin{matrix} 0 & 1 & 0 \\ 1 & 0 & 1 \\ 0 & 1 & 0 \end{matrix}),

so that $D_{t}$ for the main effect of region is

D_{region} = \frac{1}{1 - 2 τ^{2}} (\begin{matrix} 1 - τ^{2} & τ & τ^{2} \\ τ & 1 & τ \\ τ^{2} & τ & 1 - τ^{2} \end{matrix}) .

The eigenvalues of $G$ are $(- \sqrt{2}, 0, \sqrt{2})$ , so, therefore, $τ \in (τ_{\min}, τ_{\max}) = (- 1 / \sqrt{2}, 1 / \sqrt{2})$ . If an independent correlation structure is specified for the main effect of age, then

D_{age : region} = \frac{1}{1 - 2 τ^{2}} (\begin{matrix} 1 - τ^{2} & τ & τ^{2} & 0 & 0 & 0 \\ τ & 1 & τ & 0 & 0 & 0 \\ τ^{2} & τ & 1 - τ^{2} & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 - τ^{2} & τ & τ^{2} \\ 0 & 0 & 0 & τ & 1 & τ \\ 0 & 0 & 0 & τ^{2} & τ & 1 - τ^{2} \end{matrix}) .

(9)

The correlation between A and B for the main effect of region is $τ {(1 - τ^{2})}^{- 1 / 2}$ . For the age and region interaction, the correlation between levels involving A and B is $τ {(1 - τ^{2})}^{- 1 / 2}$ if and only if they have the same level for age. It now follows from (6), (9) that the scale matrix for the prior distribution is

Σ_{age : region} = \frac{1}{3 + 4 τ} (\begin{matrix} 1 + τ & - 1 / 2 \\ - 1 / 2 & 1 \end{matrix}) .

If we denote the regression parameters for this term as $β_{t} = (β_{t 1}, β_{t 2})$ , where $t = age : region$ , then the prior correlation between $β_{t 1}$ and $β_{t 2}$ is

corr (β_{t 1}, β_{t 2}) = - \frac{1}{2 \sqrt{1 + τ}} .

If $τ = 0$ , corresponding to independence between the regions, i.e. $D_{t} = I_{q_{t}}$ , and thus we have the Sabanes-Bové and Held [18] prior, then $corr (β_{t 1}, β_{t 2}) = - 1 / 2$ . The function $corr (β_{t 1}, β_{t 2})$ is increasing in $τ$ but the correlation is always negative. This is caused by the sum-to-zero constraints. As $τ$ increases, the magnitude of the negative correlation decreases.

A further advantage of using the structure defined by (8) is computational. If we assume that the independence model, containing only the main effect terms, is the simplest model we wish to consider then we will always have the same set of hyperparameters in each model.

3.4. Alternative constraint systems

We now consider alternative constraint systems to sum-to-zero constraints, e.g. corner-point or Helmert constraints. Let $β_{A}$ and $β$ denote the vectors of regression parameters under the alternative and sum-to-zero constraints, respectively. Since, under the sum-to-zero constraints, each component, $β_{t}$ , of $β$ has a normal distribution, then $β$ has a normal distribution with mean zero and variance matrix $Ψ = diag {σ_{1}^{2} Σ_{1}, \dots, σ_{T}^{2} Σ_{T}}$ . It can be shown (see Appendix B) that

β_{A} = {(X_{A}^{T} (I_{n} - \frac{1}{n} J_{n}) X_{A})}^{- 1} X_{A}^{T} (I_{n} - \frac{1}{n} J_{n}) X β, = R_{A} X β,

(10)

where $X_{A}$ and $X$ are the model matrices under the alternative and sum-to-zero constraints, respectively, $J_{n}$ is the $n \times n$ matrix of ones and

R_{A} = {(X_{A}^{T} (I_{n} - \frac{1}{n} J_{n}) X_{A})}^{- 1} X_{A}^{T} (I_{n} - \frac{1}{n} J_{n}) .

Therefore $β_{A} \sim N (0, Ψ_{A})$ , where the prior variance matrix, $Ψ_{A}$ , is given by

Ψ_{A} = R_{A} X Ψ X^{T} R_{A}^{T} .

Note that, under the alternative constraints, $β_{t}$ and $β_{r}$ may no longer, necessarily, be independent. This is equivalent to the fact that $Ψ_{A}$ (given by the above expression) may no longer, necessarily, be block diagonal.

Under the independence structure described in Section 3.2, where $D_{t} = I_{q_{t}}$ , for $t = 1, \dots, T$ , then

Ψ_{A} = g R_{A} X {(X^{T} X)}^{- 1} X^{T} R_{A}^{T} .

The matrix $H = X {(X^{T} X)}^{- 1} X^{T}$ is called the hat matrix and is invariant to the type of constraint system used, i.e. $H = H_{A} = X_{A} {(X_{A}^{T} X_{A})}^{- 1} X_{A}^{T}$ and therefore

Ψ_{A} = g {(X_{A}^{T} X_{A})}^{- 1} .

Therefore the proposed prior distribution is a generalisation of the default prior distribution of Sabanes-Bové and Held [18] for any type of constraint system.

4. Example: estimating the number of injecting drug users (IDUs) in Scotland from capture–recapture data

In this section we apply our proposed default prior distribution to an incomplete contingency table which has six factors and 352 cells that involves estimating the number of injecting drug users (IDUs) in Scotland in 2006. These data have been previously analysed by King et al. [12] and Overstall et al. [17]. The six factors are social enquiry reports (2 levels: observed; unobserved); hospital records (2 levels: observed; unobserved); Scottish drug misuse database (2 levels: observed; unobserved); age (2 levels: ≤35 years; >35 years); gender (2 levels: male; female) and region (11 levels: National Health Service (NHS) board regions—see Fig. 1). The first three factors are sources and the 44 cells which correspond to not being observed by any of these sources for the different age/gender/region combinations have missing counts. Therefore the total population of IDUs, $N$ , is unknown. We use Markov chain Monte Carlo (MCMC) methods to obtain posterior distributions for the missing cell entries and therefore a posterior distribution for the total population of IDUs.

Fig. 1 — Map showing the eleven regions of Scotland which correspond to National Health Service (NHS) board regions.

King et al. [12] and Overstall et al. [17] merged the eleven regions into just two levels: Greater Glasgow and Clyde, and the Rest of Scotland. Without merging, using all eleven distinct regions, there are small cell counts for many of the regions. For instance, in one region there are only 19 observed IDUs over all source, age and gender cross-classifications. This suggests that a prior distribution that involves smoothing (or borrowing of information), such as the prior proposed in Section 3, is required. We apply the proposed prior where the independence structure is specified for all of the factors except region where we use the CAR structure described in Section 3.3. By calculating the eigenvalues of the neighbourhood matrix, $G$ , for this example, $τ_{\min} = - 0.457$ and $τ_{\max} = 0.247$ . We place a uniform prior on $τ$ in the interval $(τ_{\min}, τ_{\max})$ . The prior distribution for each $β_{t}$ is

β_{t} | σ_{t}^{2}, D_{t} \sim N (0, σ_{t}^{2} Σ_{t}),

where $Σ_{t}$ is given by (6). Following from Section 3.2, we set $σ_{t}^{2} = g q_{t} / n$ , with

g \sim IG (\frac{a}{2}, \frac{b n}{2}),

where IG denotes the inverse-gamma distribution, and $a = b = 1 0^{- 3}$ , as suggested by Sabanes-Bové and Held [18]. We only specify non-zero prior model probabilities for the log–linear models that contain at most two-way interactions and assume a discrete uniform prior over all of these models. It was found that this allowed enough complexity to obtain an adequate overall model when using the Bayesian $p$ -value to assess model adequacy (see, [8, Chapter 6]).

We use the data-augmentation MCMC approach proposed by King and Brooks [13] with the reversible jump implementation for GLMs of Forster et al. [6] to make moves between log–linear models and the weighted least squares Metropolis–Hastings implementation of Gamerman [7] to make moves within the same log–linear model. We ran the algorithm for one million iterations (discarding the first 10% as burn-in).

For the total population size of IDUs, we obtain a posterior distribution for the total population size with a mean of 21 700 and a 95% highest posterior density interval (HPDI) of (18 900, 24 800). Overstall et al. [17] obtained a posterior mean of 24 000 and a 95% HPDI of (19 500, 29 700) and King et al. [12] a mean of 25 000 with a 95% HPDI of (20 700, 35 000). The advantage of our approach over the latter two analyses is that we are able to provide posterior distributions of the total population size in each NHS board region, broken down by age and gender. Our approach also results in a smaller credible interval for the total population size due to it allowing for correlated regions and not discarding information by merging the factor levels of region.

The posterior mean of $τ$ is 0.108 with a 95% HPDI of $(- 0.096, 0.247)$ . The posterior probability of $τ$ being positive is 0.816. It follows that the Bayes factor in support of the hypothesis that $τ > 0$ is 8.205. Therefore there appears to be positive evidence [11] in support of positive spatial correlation between the regions of Scotland.

5. Concluding remarks

In this paper we have proposed a default prior distribution for the regression parameters of a log–linear model that can take account of any dependence structure that may exist between the factor levels. This prior can be applied in situations of model uncertainty and can be seen as a generalisation of other default prior distributions applied to log–linear models including those of Dellaportas and Forster [3], Ntzoufras et al. [15] and Sabanes-Bové and Held [18].

Acknowledgements

The authors thank Dr. Gordon Hay for providing the data in Section 4 and the reviewer for providing helpful comments and suggestions that improved the paper. Both authors were partly funded by MRC-funded addictions cluster, NIQUAD (Grant No. G1000021).

Appendix A. Justification of default prior distribution

In this appendix we give justification for the prior given in Section 3.1, given by (5), (6). The prior distribution for $β_{t}$ is the conditional distribution of $γ^{(1)}$ given that $γ^{(2)} = C_{t} γ^{(1)}$ , where $γ = {(γ^{(1)}, γ^{(2)})}^{T} \sim N (0, σ_{t}^{2} M)$ , and $M = P_{t}^{T} D_{t} P_{t}$ . Define

ψ = (\begin{matrix} ψ^{(1)} \\ ψ^{(2)} \end{matrix}) = (\begin{matrix} I & 0 \\ - C_{t} & I \end{matrix}) (\begin{matrix} γ^{(1)} \\ γ^{(2)} \end{matrix}),

so that we now require the conditional distribution of $ψ^{(1)}$ given that $ψ^{(2)} = 0$ . It follows, from the properties of the multivariate normal distribution, that $ψ$ has a multivariate normal distribution with mean $0$ and covariance matrix $σ_{t}^{2} T$ where

T = (\begin{matrix} T_{11} & T_{12} \\ T_{21} & T_{22} \end{matrix}),

and

T_{11} = M_{11},

T_{12} = M_{12} - M_{11} C_{t}^{T},

T_{21} = M_{21} - C_{t} M_{11},

T_{22} = M_{22} - M_{21} C_{t}^{T} C_{t} M_{12} + C_{t} M_{11} C_{t}^{T} .

The partitioning of $M$ and $T$ follows from the partitioning of $γ$ into $γ^{(1)}$ and $γ^{(2)}$ . Using the properties of the multivariate normal distribution the covariance matrix of $β_{t}$ is $σ_{t}^{2} Σ_{t}$ , where

Σ_{t} = M_{11} - (M_{12} - M_{11} C_{t}^{T}) {(M_{22} - C_{t} M_{12} - M_{21} C_{t}^{T} C_{t} M_{11} C_{t}^{T})}^{- 1} (M_{21} - C_{t} M_{11}) .

Consider the inverse of $Σ_{t}$ . It can be shown using, e.g., [10], and after some matrix algebra, that

Σ_{t}^{- 1} = M_{11}^{- 1} + M_{11}^{- 1} M_{12} S_{M}^{- 1} M_{21} M_{11}^{- 1} - M_{11}^{- 1} M_{12} S_{M}^{- 1} C_{t} - C_{t}^{T} S_{M}^{- 1} M_{21} M_{11}^{- 1} + C_{t}^{T} S_{M}^{- 1} C_{t},

where $S_{M} = M_{22} - M_{21} M_{11}^{- 1} M_{12}$ is the Schur complement (e.g. [9, p. 95]) of $M_{11}$ in $M$ . As

M^{- 1} = {(P_{t}^{T} D_{t} P_{t})}^{- 1} = L = (\begin{matrix} L_{11} & L_{12} \\ L_{21} & L_{22} \end{matrix}),

then it can be shown that

Σ_{t}^{- 1} = L_{11} + L_{12} C_{t} + L_{21} C_{t}^{T} + C_{t}^{T} L_{22} C_{t}, = (I C_{t}^{T}) (P_{t}^{T} D_{t}^{- 1} P_{t}) (\begin{matrix} I \\ C_{t} \end{matrix}), = A_{t}^{T} D_{t}^{- 1} A_{t} .

Therefore $Σ_{t} = {(A_{t}^{T} D_{t}^{- 1} A_{t})}^{- 1}$ as required.

Appendix B. Correspondence of parameters between different constraint systems

In this appendix we provide a justification of the correspondence between the regression parameters under any constraint system and sum-to-zero constraints, given by (10). Let $Z_{A} = (1_{n}, X_{A})$ and $Z = (1_{n}, X)$ be the $n \times (p + 1)$ matrices formed by appending the vector of ones to the model matrices under the alternative and sum-to-zero constraints. The vector $(ϕ_{A}, β_{A})$ , where $ϕ_{A}$ is the intercept under the alternative constraints, is given by

(\begin{matrix} ϕ_{A} \\ β_{A} \end{matrix}) = {(Z_{A}^{T} Z_{A})}^{- 1} Z_{A}^{T} Z (\begin{matrix} ϕ \\ β \end{matrix}), = {(\begin{matrix} n & 1_{n}^{T} X_{A} \\ X_{A}^{T} 1_{n} & X_{A}^{T} X_{A} \end{matrix})}^{- 1} (\begin{matrix} n ϕ + 1_{n}^{T} X β \\ ϕ X_{A}^{T} 1_{n} + X_{A}^{T} X β \end{matrix}), = (\begin{matrix} \frac{1}{n} + \frac{1}{n^{2}} 1_{n}^{T} X_{A} U_{A}^{- 1} X_{A}^{T} 1_{n} & - \frac{1}{n} 1_{n}^{T} X_{A} U_{A}^{- 1} \\ - \frac{1}{n} U_{A}^{- 1} X_{A}^{T} 1_{n} & U_{A}^{- 1} \end{matrix}) (\begin{matrix} n ϕ + 1_{n}^{T} X β \\ ϕ X_{A}^{T} 1_{n} + X_{A}^{T} X β \end{matrix}),

where $U_{A} = X_{A}^{T} (I_{n} - \frac{1}{n} J_{n}) X_{A}$ . The expression for $β_{A}$ , given by (10), easily follows.

References

1.Agresti A. second ed. Wiley; 2007. An Introduction to Categorical Data Analysis. [Google Scholar]
2.Cressie N., Stern H., Wright D. Mapping rates associated with polygons. Journal of Geographical Systems. 2000;2:61–69. [Google Scholar]
3.Dellaportas P., Forster J. Markov chain Monte Carlo model determination for hierarchical and graphical log-linear models. Biometrika. 1999;86:615–633. [Google Scholar]
4.Fienberg S. The multiple recapture census for closed populations and incomplete $2^{k}$ contingency tables. Biometrika. 1972;59:591–603. [Google Scholar]
5.Forster J. Bayesian inference for Poisson and multinomial log-linear models. Statistical Methodology. 2010;7:210–224. [Google Scholar]
6.Forster J., Gill R., Overstall A. Reversible jump methods for generalised linear models and generalised linear mixed models. Statistics and Computing. 2012;22:107–120. [Google Scholar]
7.Gamerman D. Sampling from the posterior distribution in generalised linear mixed models. Statistics and Computing. 1997;7:57–68. [Google Scholar]
8.Gelman A., Carlin J., Stern H., Rubin D. second ed. Chapman and Hall; 2004. Bayesian Data Analysis. [Google Scholar]
9.Gentle J. Springer; 2007. Matrix Algebra: Theory, Computation, and Applications in Statistics. [Google Scholar]
10.Henderson H., Searle S. On deriving the inverse of a sum of matrices. SIAM Review. 1981;23:53–60. [Google Scholar]
11.Kass R., Raftery A. Bayes factors. Journal of the American Statistical Association. 1995;90:773–795. [Google Scholar]
12.King R., Bird S., Overstall A., Hay G., Hutchinson S. Injecting drug users in Scotland, 2006: number, demography, and opiate-related death-rates. Addiction Research and Theory. 2013;21:235–246. doi: 10.3109/16066359.2012.706344. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.King R., Brooks S. On the Bayesian analysis of population size. Biometrika. 2001;88:317–336. [Google Scholar]
14.Knuiman M., Speed T. Incorporating prior information into the analysis of contingency tables. Biometrics. 1988;44:1061–1071. [PubMed] [Google Scholar]
15.Ntzoufras I., Dellaportas P., Forster J. Bayesian variable and link determination for generalised linear models. Journal of Statistical Planning and Inference. 2003;111:165–180. [Google Scholar]
16.O’Hagan A., Forster J. second ed. vol. 2B. John Wiley & Sons; 2004. Kendall’s Advanced Theory of Statistics. (Bayesian Inference). [Google Scholar]
17.Overstall A., King R., Bird S., Hutchinson S., Hay G. University of St. Andrews; 2013. Incomplete contingency tables with censored cells with application to estimating the number of people who inject drugs in Scotland. Tech. Rep., School of Mathematics and Statistics. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Sabanes-Bové D., Held L. Hyper-g priors for generalized linear models. Bayesian Analysis. 2011;6:387–410. [Google Scholar]

[br000005] 1.Agresti A. second ed. Wiley; 2007. An Introduction to Categorical Data Analysis. [Google Scholar]

[br000010] 2.Cressie N., Stern H., Wright D. Mapping rates associated with polygons. Journal of Geographical Systems. 2000;2:61–69. [Google Scholar]

[br000015] 3.Dellaportas P., Forster J. Markov chain Monte Carlo model determination for hierarchical and graphical log-linear models. Biometrika. 1999;86:615–633. [Google Scholar]

[br000020] 4.Fienberg S. The multiple recapture census for closed populations and incomplete $2^{k}$ contingency tables. Biometrika. 1972;59:591–603. [Google Scholar]

[br000025] 5.Forster J. Bayesian inference for Poisson and multinomial log-linear models. Statistical Methodology. 2010;7:210–224. [Google Scholar]

[br000030] 6.Forster J., Gill R., Overstall A. Reversible jump methods for generalised linear models and generalised linear mixed models. Statistics and Computing. 2012;22:107–120. [Google Scholar]

[br000035] 7.Gamerman D. Sampling from the posterior distribution in generalised linear mixed models. Statistics and Computing. 1997;7:57–68. [Google Scholar]

[br000040] 8.Gelman A., Carlin J., Stern H., Rubin D. second ed. Chapman and Hall; 2004. Bayesian Data Analysis. [Google Scholar]

[br000045] 9.Gentle J. Springer; 2007. Matrix Algebra: Theory, Computation, and Applications in Statistics. [Google Scholar]

[br000050] 10.Henderson H., Searle S. On deriving the inverse of a sum of matrices. SIAM Review. 1981;23:53–60. [Google Scholar]

[br000055] 11.Kass R., Raftery A. Bayes factors. Journal of the American Statistical Association. 1995;90:773–795. [Google Scholar]

[br000060] 12.King R., Bird S., Overstall A., Hay G., Hutchinson S. Injecting drug users in Scotland, 2006: number, demography, and opiate-related death-rates. Addiction Research and Theory. 2013;21:235–246. doi: 10.3109/16066359.2012.706344. [DOI] [PMC free article] [PubMed] [Google Scholar]

[br000065] 13.King R., Brooks S. On the Bayesian analysis of population size. Biometrika. 2001;88:317–336. [Google Scholar]

[br000070] 14.Knuiman M., Speed T. Incorporating prior information into the analysis of contingency tables. Biometrics. 1988;44:1061–1071. [PubMed] [Google Scholar]

[br000075] 15.Ntzoufras I., Dellaportas P., Forster J. Bayesian variable and link determination for generalised linear models. Journal of Statistical Planning and Inference. 2003;111:165–180. [Google Scholar]

[br000080] 16.O’Hagan A., Forster J. second ed. vol. 2B. John Wiley & Sons; 2004. Kendall’s Advanced Theory of Statistics. (Bayesian Inference). [Google Scholar]

[br000085] 17.Overstall A., King R., Bird S., Hutchinson S., Hay G. University of St. Andrews; 2013. Incomplete contingency tables with censored cells with application to estimating the number of people who inject drugs in Scotland. Tech. Rep., School of Mathematics and Statistics. [DOI] [PMC free article] [PubMed] [Google Scholar]

[br000090] 18.Sabanes-Bové D., Held L. Hyper-g priors for generalized linear models. Bayesian Analysis. 2011;6:387–410. [Google Scholar]

PERMALINK

A default prior distribution for contingency tables with dependent factor levels

Antony M Overstall

Ruth King

Abstract

1. Introduction

2. Notation and log–linear models

2.1. Notation

2.2. Log–linear models

3. A default prior distribution for $β^{(m)} | m$

3.1. Derivation

3.2. Independent correlation structure

3.3. General correlation structure

3.4. Alternative constraint systems

4. Example: estimating the number of injecting drug users (IDUs) in Scotland from capture–recapture data

Fig. 1.

5. Concluding remarks

Acknowledgements

Appendix A. Justification of default prior distribution

Appendix B. Correspondence of parameters between different constraint systems

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

A default prior distribution for contingency tables with dependent factor levels

Antony M Overstall

Ruth King

Abstract

1. Introduction

2. Notation and log–linear models

2.1. Notation

2.2. Log–linear models

3. A default prior distribution for β(m)|m

3.1. Derivation

3.2. Independent correlation structure

3.3. General correlation structure

3.4. Alternative constraint systems

4. Example: estimating the number of injecting drug users (IDUs) in Scotland from capture–recapture data

Fig. 1.

5. Concluding remarks

Acknowledgements

Appendix A. Justification of default prior distribution

Appendix B. Correspondence of parameters between different constraint systems

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

3. A default prior distribution for $β^{(m)} | m$