Abstract
Consider a set of categorical variables where at least one, denoted by Y, is binary. The log-linear model that describes the contingency table counts implies a logistic regression model, with outcome Y. Extending results from Christensen (1997, Log-linear models and logistic regression, 2nd edn. New York, NY, Springer), we prove that the maximum-likelihood estimates (MLE) of the logistic regression parameters equals the MLE for the corresponding log-linear model parameters, also considering the case where contingency table factors are not present in the corresponding logistic regression and some of the contingency table cells are collapsed together. We prove that, asymptotically, standard errors are also equal. These results demonstrate the extent to which inferences from the log-linear framework translate to inferences within the logistic regression framework, on the magnitude of main effects and interactions. Finally, we prove that the deviance of the log-linear model is equal to the deviance of the corresponding logistic regression, provided that no cell observations are collapsed together when one or more factors in become obsolete. We illustrate the derived results with the analysis of a real dataset.
Keywords: categorical variables, contingency table, generalized linear modelling
1. Introduction
Let v = {v1, …, vn} denote a set of observations, θ = {θ1, …, θn} a set of parameters, and consider known or nuisance quantities ϕ = {ϕ1, …, ϕn}. Now, vi, i = 1, …, n, belongs to the exponential family of distributions if its probability function can be written as
where w = {w1, …, wn} are known weights, and ϕi is the dispersion or scale parameter. Regarding first-order moments, μi ≡ E(vi) = b′(θi). A generalized linear model relates μ = {μ1, …, μn} to covariates by setting ζ(μ) = Xdγ, where ζ denotes the link function, Xd the covariate design matrix and γ a vector of parameters. For a single μi, we write ζi(μi) = Xd(i)γ, where Xd(i) denotes the ith row of Xd, defining ζ as a vector function ζ ≡ {ζ1, …, ζn}.
Let denote a finite set of P categorical variables. Observations from can be arranged as counts in a P-way contingency table, with cell counts denoted by ni, i = 1, …, nll. The ‘ll’ indicator alludes to a log-linear model. The counts follow a Poisson distribution with E(ni) = μi. A Poisson log-linear interaction model, , is a generalized linear model that relates the expected counts to .
From Christensen [1], there is an association between log-linear modelling and multinomial logistic regression. Consider categorical variables X, Y and Z, with JX, JY and JZ levels, respectively. Let jX, jY, jZ be integer indices that describe the level of X, Y and Z. In a multinomial logistic regression with outcome Y, one typically models the log-odds of an observation at level jY + 1 relative to one at level jY, , jY = 0, …, JY − 1. This can be viewed as equivalent to fitting a log-linear model as
For more details, see [1, Section 4.6] where, in addition to the above approach, the alternative of constructing a multinomial model to model the log-odds of an observation at level jY, jY = 1, …, JY − 1, relative to one at fixed level JY is considered. In this manuscript, we focus on the association between log-linear modelling and binary logistic regression. Assume that the categorical variable Y is binary. Then, a logistic regression can be fitted with Y as the outcome, and all or some of the remaining P − 1 variables as covariates. We write, , , using the ‘lt’ indicator for the logistic model, denoting by pi the conditional probability that Y = 1 given covariates Xlt(i), and by β the vector of model parameters.
From Agresti [2], when contains a binary Y, a log-linear model implies a specific logistic regression model with parameters β defined uniquely by λ. As Y is binary, jY = 0, 1. Consider the log-linear model
| M1 |
where the superscript denotes the main effect or interaction term. Similar to the derivation above, the corresponding logistic regression model for the conditional odds ratios for Y is
This is a logistic regression with parameters, , so that, , and . Identifiability corner point constraints set all elements in λ with a zero subscript equal to zero. Then, , and . This scales in a straightforward manner to larger log-linear models. If a factor does not interact with Y in the log-linear model, this factor disappears from the corresponding logistic regression. Without any loss of generality, and to simplify the analysis and notation, we henceforth assume corner point constraints.
Considering the log-odds implied by a logistic regression, more than one log-linear models provide the same structure. For example, the log-linear model, log(μjYjXjZ) = λ + λXjX + λYjY + λZjZ + λXYjXjY + λYZjYjZ, implies the same conditional log-odds structure for Y as (M1). However, as shown in Christensen [3, Section 3.3.2] in conjunction with Christensen [1, Sections 11.1 and 12.4], the log-linear model that determines exactly the same logistic structure is the one that contains all possible interaction terms between the categorical factors in . Other log-linear models, even when they imply the same log-odds, impose additional constraints on the logistic structure. To avoid any confusion, the description of our results in this manuscript will expound that the considered log-linear model contains all possible interaction terms between the categorical factors in .
The relationship between β and λ can be described as β = Tλ, where T is an incidence matrix [4]. In the context of this manuscript, matrix T has one row for each element of β, and one column for each element of λ. The elements of T are zero, except in the case where the element of β is defined by the corresponding element of λ. The number of rows of T cannot be greater than the number of columns.
In Papathomas [5], the correspondence between the two modelling frameworks within the Bayesian framework was studied, deriving exact and asymptotic results. In this manuscript, we focus on the frequentist framework, and derive results on maximum-likelihood estimates (MLE), interval estimates and deviances. Christensen [1] offers a comprehensive account of log-linear and logistic regression modelling. In Christensen [1, ch. 11], results on the equivalence between MLE and confidence intervals were derived. We extend these results, by also considering the case where factors present in the contingency table and log-linear model are not present in the corresponding logistic regression model, and some of the contingency table cells are collapsed together. This case is not considered in [1,2] or, to the best of our knowledge, in any other published work. As stated in theorem 3.2, the MLE for the parameters of the logistic regression equals the MLE for the corresponding parameters of the log-linear model. Theorem 3.3 states that, asymptotically, standard errors for the logistic regression and corresponding log-linear model parameters are equal. Subsequently, Wald confidence intervals [2] are asymptotically equal.
For theorem 3.4, we stipulate that the logistic model is fitted to a dataset where no cell observations are collapsed together when one or more factors in are not present in the logistic regression. Then, we prove that the deviance of the log-linear model equals the deviance of the corresponding logistic regression. Christensen [1, p. 371] refers to this equality, by considering a simple logistic regression with two parameters and showing that the likelihood ratio test statistic (LRTS) for the log-linear model equals the LRTS for the logistic regression. This is done by using the invariance of the MLE and the properties of the product-binomial sampling scheme [1, Section 2.6]. Christensen [1, p. 365] also shows that applying the logistic regression to a contingency table implies that the sampling scheme of the contingency table is product-binomial instead of multinomial. As these results are based on a logistic regression with two parameters, a general mathematical proof is required, provided in appendix A.
In §2 we provide additional notation and essential derivations for the log-linear and logistic regression model, then §3 contains the main contributions in this manuscript. In §4, the correspondence from a log-linear to a logistic regression model is illustrated using real data. We conclude with a discussion, where we also consider possible practical implications of our results.
2. Deviances and the information matrix
The deviance of a generalized linear model is crucial for assessing goodness of fit [6]. Let denote the MLE of θ. Let L(θsat, v) and L(θsim, v) denote the log-likelihood for the saturated model, and for a simpler model, respectively. The deviance is defined as
Then,
Denote by the MLE of γ, and the information matrix . ( will be specified below for both modelling frameworks as and .) Then, from Agresti [2], asymptotically
2.1. Log-linear regression
Consider a vector n of counts ni i = 1, …, nll. Now, , and,
with , and . Also, , so that wi = 1 implies ϕi = 1. Note that, , and . For the log-linear model, , Xll is a nll × nλ design matrix of covariates, and . Given the above,
From Agresti [2, p. 140], when the log-linear model contains an intercept, . Then,
| 2.1 |
The diagonal matrix has non-zero elements , i = 1, …, nll.
2.2. Logistic regression
Assume that yi, i = 1, …, nlt, is the proportion of successes out of ti trials. Now, , and,
where , and . Also, , so that wi = 1 implies . Note that
For the logistic regression, , Xlt is a nlt × nβ design matrix, and . Given the above
After some algebra,
| 2.2 |
The diagonal matrix has non-zero elements , i = 1, …, nlt.
3. Results
To facilitate the derivation of theoretical results, we introduce the following additional notation. Without any loss of generality, let x.1 be the binary Y factor, and x.2, …, x.q the q − 1 factors that are present in the log-linear model but disappear from the logistic regression model as they do not interact with Y. Denote the rest of the factors by x.q+1, …, x.P. Each element of n is denoted by nj, j = (j1, …, jP), 0 ≤ jp ≤ Jp − 1, p = 1, …, P, where Jp is the number of levels of x.p. Here, j, identifies the combination of variable levels that cross-classify the given cell. We define L as the set of all nll cross-classifications, so that, . Elements yj and μj are defined analogously.
Lemma 3.1. —
Assume that the log-linear model contains all possible interaction terms between the categorical factors in Then, for all 0 ≤ jp ≤ Jp − 1, p = 2, …, P,
Proof. —
The proof is given in appendix A. ▪
Theorem 3.2. —
Assume that the log-linear model contains all possible interaction terms between the categorical factors in Then, the MLE of the parameters of the logistic-regression is equal to the MLE of the corresponding parameters of the log-linear model.
Proof. —
The proof is given in appendix A. ▪
Theorem 3.3. —
Assume that the log-linear model contains all possible interaction terms between the categorical factors in Then, asymptotically, the standard error for each element of β is equal to the standard error for the corresponding parameter of the log-linear model.
Proof. —
The proof is given in appendix A. ▪
The proofs for theorems 3.2 and 3.3 include the case where factors present in the log-linear model are not present in the corresponding logistic regression and some of the contingency table cells are collapsed together. For completeness, our proofs also include the case where all factors in are present in the logistic regression model. Theorem 3.4 postulates that nlt = nll/2, i.e. the number of proportions fitted by the logistic regression should be half the number of cell counts in the contingency table. This happens either because all factors in are present in the logistic regression, or because counts in cells with the same cross-classification considering x.q+1, …, x.P are not collapsed. This is important for observing equal deviances for the log-linear model and the corresponding logistic regression. Intuitively, when nlt = nll/2, the number of observations fitted by the logistic regression is in direct correspondence with the number of observations fitted by the log-linear model. When nlt < nll/2, a logistic regression model with the same number of parameters fits a smaller number of observations, something that naturally results in a smaller deviance compared to the deviance observed when the contingency table is not collapsed. This is illustrated in §4 with the analysis of a real dataset.
Theorem 3.4. —
Assume that the log-linear model contains all possible interaction terms between the categorical factors in Assume also that the corresponding logistic regression is fitted to a dataset where nlt = nll/2. Then, the deviance of the log-linear model equals the deviance of the corresponding logistic regression.
Proof. —
The proof is given in appendix A. ▪
4. Illustration
Edwards & Havránek [7] presented a 26 contingency table in which 1841 men were cross-classified by six binary risk factors {A, B, C, D, E, F} for coronary heart disease. Adopting the notation in Agresti [2], a single letter denotes the presence of a main effect, two-letter terms denote the presence of the implied first-order interaction and so on and so forth. The presence of an interaction between a set of variables implies the presence of all lower-order interactions plus main effects for that set. Consider the log-linear model
| M2 |
Treating A as the outcome, the corresponding logistic regression is
| M3 |
The deviances, MLE and standard errors for the relevant parameters of both models are given in table 1, after fitting the models in R using the ‘glm’ function. We observe that corresponding quantities are equal. To obtain equal deviances, although factors B and F are not present in the logistic regression, the logistic model was fitted to a dataset where contingency table cell counts discriminated only by B and F were not collapsed together. This resulted in nlt = 32. The datasets for (M2) and (M3) are given in appendix A. The design matrix is shown below, with denoting the transpose, with some of the rows identical.
Table 1.
Deviances, MLE and standard errors for the relevant parameters of log-linear model (M2) and the corresponding logistic regressions (M3) and (M4). (Standard errors are given in brackets.)
| log-linear model (M2), , deviance = 33.51 | ||||
| A | AC | AD | AE | |
| MLE | −0.4140 (0.0892) | 0.5501 (0.0958) | −0.3684 (0.0967) | 0.4893 (0.0973) |
| outcome is A (M3), , deviance = 33.51 | ||||
| intercept | C | D | E | |
| MLE | −0.4140 (0.0892) | 0.5501 (0.0958) | −0.3684 (0.0967) | 0.4893 (0.0973) |
| outcome is A (M4), , deviance = 3.47 | ||||
| intercept | C | D | E | |
| MLE | −0.4140 (0.0892) | 0.5501 (0.0958) | −0.3684 (0.0967) | 0.4893 (0.0973) |
As factors B and F disappear from the logistic regression that corresponds to (M2), one may decide to collapse together the contingency table cells with the same cross-classification considering C, D and E. A logistic regression is fitted, denoted by (M4). It only contains main effects for C, D and E, as does (M3). The dataset for (M4) is shown in appendix A. The design matrix for (M4) is
Relevant output is given in table 1. MLE and standard errors are equal, as theorems 3.2 and 3.3 hold. However, as cells are collapsed together and nlt ≠ nll/2, the deviances differ.
5. Discussion
The results in Christensen [1] and this manuscript demonstrate the extent to which inferences from the log-linear framework translate to inferences within the logistic regression framework, on the magnitude of main effects and interactions.
When factors are not present in the logistic regression, one may choose to collapse the counts in the contingency table cells that are only discriminated by the obsolete variables x.2, …, x.q. Logistic regression parameter estimates and associated standard errors are not affected by collapsing the cell counts. This is shown in the proofs for theorems 3.2 and 3.3 in appendix A. However, the logistic regression fitted to the collapsed dataset, returns a different deviance compared to a logistic regression with the same covariates (parameters) fitted without collapsing. This is expected, as two models with the same number of parameters are fitted to a different number of data points. The deviance naturally increases for the larger dataset.
Our results concern two of the most popular approaches for the analysis of categorical observations and the correspondence between them. Theoretical derivations on such associations improve understanding and enhance the models’ use, as advances for one framework are not always readily available to the other. For instance, to describe the joint probability distribution between covariates, Zhou et al. [8] adopt a PARAFAC factorization. Marginal independence is modelled with fixed baseline vectors, providing expressions for parameters of the log-linear models that correspond to the adopted latent class model. Another example is Papathomas & Richardson [9], where the use of employing variable selection within clustering to assist log-linear modelling is investigated, without examining logistic regression models.
Supplementary Material
Acknowledgements
We are grateful to Prof. Ronald Christensen for the instructive discussions we had during the preparation of this manuscript. We are also grateful for the comments by two reviewers and the editor that improved this manuscript.
Appendix A.
Proof of Lemma 3.1. —
To facilitate this and subsequent proofs, the following notation is introduced, similar to Papathomas [5]. Using the incidence matrix T discussed in §1, write the mapping between β and λ as β = Tλ, where
and λ(k), , is a vector of zeros with the exception of one element that is equal to one. This element is in the position of the kth λ parameter with a Y in its superscript. With we denote the number of parameters in λ with a Y in their superscript. To ease algebraic calculations, and without any loss of generality, rearrange the elements of λ, creating a new vector λr, so that T changes accordingly to, where I is an nβ × nβ identity matrix. (Vector μ is similarly rearranged to μr.) The rows and columns of Xll are also rearranged accordingly to create Xrll, so that
A 1 Xll− lt is a square (nll/2 × nll/2) matrix. This is because we consider the log-linear model that, in addition to the terms that involve Y, contains all possible interaction terms between the categorical factors in . The number of parameters that correspond to the intercept, main effects and interactions for is nll/2. Xlt* is a nll/2 × nβ matrix. When q = 1, all factors other than Y remain in the logistic regression model as covariates. When no cell counts are collapsed, either because q = 1, or because we opt not to collapse, Xlt* = Xlt, and nll = 2 × nlt. When the cell counts that are only discriminated by the obsolete variables x.2, …, x.q are collapsed, by rearranging the rows of Xrll when necessary, we can write Xlt* as, , where is repeated (J1 − 1) × J2 × … × Jq times. For example, for q = 2, Xlt repeats J2 times within Xlt*, and nll = 2 × J2 × nlt. When q = P, the corresponding logistic regression model only contains an intercept, and one may decide to fit the logistic regression to a collapsed contingency table that only contains two cells describing the total number of counts where Y = 0 and Y = 1. Then, nll = 2 × J2 × … × JP × nlt.
We can now write β = Trλr. For example, assume the log-linear model (M1) describes a 3 × 2 × 2 contingency table. Then, q = 1, and the standard arrangement of the elements of λ would be such that,
After rearranging
See Papathomas [5] for another example where q = 2. From Agresti [2, p. 138], the likelihood equations for a log-linear model are
where is the element of Xrll in the row that corresponds to , and column j, j = 1, …, nλ. As , includes all interactions between factors other than Y, Xll− lt is the design matrix for a saturated log-linear model for all factors other than Y. Because Xll− lt repeats within Xrll (as shown in (A1)), the nll/2 likelihood equations for , j = nβ + 1, …, nλ, are also the likelihood equations of a saturated log-linear model for fitting the nll/2 observations, :
Here, is the element of Xll− lt in the row that corresponds to , and column j, j = nβ + 1, …, nλ. As these are the likelihood equations of a saturated model,
and this completes the proof. ▪
Proof of Theorem 3.1. —
All factors in are present in the logistic regression, or no collapsing of cells. From Agresti [2, (p.193)] the likelihood equations for the logistic regression model, , are
for j = 1, …, nβ. Now,
where, a[a1 : a2], specifies the vector formed by all elements from the a1th to the a2th element of vector a, including the a1th and a2th elements. Therefore,
Thus, to estimate β, the likelihood equations are
For the log-linear model, for λr[1 : nβ], the likelihood equations are
where j = 1, …, nβ. As, for all j, the likelihood equations for estimating λr[1 : nβ] are
As, , and , the likelihood equations for estimating β and the corresponding λr[1 : nβ] are the same. Therefore, , as the number of equations equals the number of parameters.
Factors not present in the logistic regression, with collapsing of cells. As Xlt repeats J2 × · · · × Jq times within Xlt*, the likelihood equations for estimating λr[1 : nβ], for j = 1, …, nβ, are shown below:
where
These are also the equations for estimating the logistic regression parameters β. So, , as the number of equations equals the number of parameters. ▪
Proof of Theorem 3.2. —
Consider a vector of cell counts n = {n1, …, nll}, and the log-linear model . Then, from Agresti [2], asymptotically:
After rearranging the rows and columns of Xll, consider the log-linear model with linear predictor Xrllλr, for cell counts nr, where nr is n rearranged to correspond to Xrll. Now
denotes a diagonal matrix with non-zero elements , i = 1, …, nll/2. denotes a diagonal matrix with non-zero elements , i = 1, …, nll/2, where denotes the MLE for . Now,
where and . From Lutkepohl [10, p. 147, result 2(a)], and Lutkepohl [10, p. 29, line 6], the submatrix H that is formed by the first nβ rows and columns of is
Thus,
All factors in are present in the logistic regression, or no collapsing of cells. Assume cell counts are not collapsed (by choice or when q = 1), so that nlt = nll/2 and Xlt* = Xlt. We now use the standard result (e.g. [11, (p. 200)]) that, asymptotically, the Binomial distribution of a data point ti yi, i = 1, …, nlt, can be approximated by . Considering the Poisson log-linear model, the Binomial observation ti − ti × yi follows the Poisson distribution:
Therefore, approximately,
In matrix notation, we can now write that, asymptotically,
where has diagonal elements , i = 1, …, nlt. is, asymptotically, the variance of when the logistic regression is fitted directly, and this completes the proof when no collapsing of cell counts takes place.
Factors not present in the logistic regression, with collapsing of cells. When one chooses to collapse the counts in the contingency table cells that are only discriminated by the obsolete variables x.2, …, x.q,
where denotes a diagonal matrix with non-zero elements , i = 1, …, nlt. , k = 1, …, J2 × · · · × Jq, denotes a diagonal matrix with elements . Similar to the previous case, we use the standard result that, asymptotically, the Binomial distribution of a data point ti yi, i = 1, …, nlt, can be approximated by . When cell counts are collapsed, the Binomial observation ti − ti × yi is formed by adding J2 × · · · × Jq independent Poisson cell counts. Considering the Poisson log-linear model, ti − ti yi follows the Poisson distribution:
Therefore, approximately
In matrix notation, we can now write that, asymptotically
where t is a diagonal matrix with diagonal elements the number of trials ti, and has diagonal elements , i = 1, …, nlt. is, asymptotically, the variance of when the logistic regression is fitted directly, and this completes the proof. ▪
Proof of Theorem 3.3. —
Assume that no cell observations are collapsed when one or more factors in are not present in the logistic regression. From (2.2),
This, in turn, is equal to
A 2
A 3
A 4 For the log-linear model, from (2.1),
This, in turn, is equal to
A 5
A 6
A 7
A 8 Now, (A2)=(A5) by inspection. Furthermore, from theorem 3.2, . As,
we have that (A3)=(A6). Finally, from Lemma 3.1,
Also,
Then,
This completes the proof of theorem 3.4. ▪
Data analysed in §4. The dataset for log-linear model (M2) is given by vector
The dataset for the logistic regression (M3) is
The dataset for (M4) is,
and
Data accessibility
All data considered in this manuscript are provided in appendix A. This article has no additional data.
Authors' contributions
W.J. and M.P. contributed equally to all parts of this manuscript. All authors gave final approval for publication.
Competing interests
We declare we have no competing interests.
Funding
The first author acknowledge the support of the School of Mathematics and Statistics, as well as CREEM, at the University of St Andrews, and the University of St Andrews St Leonard’s 7th Century Scholarship.
References
- 1.Christensen R. 1997. Log-linear models and logistic regression, 2nd edn New York, NY: Springer. [Google Scholar]
- 2.Agresti A. 2002. Categorical data analysis, 2nd edn Princeton, NJ: John Wiley and Sons. [Google Scholar]
- 3.Christensen R. 1996. Plane answers to complex questions. The theory of linear models, 4th edn New York, NY: Springer. [Google Scholar]
- 4.Bapat RB. 2001. Graphs and matrices. New Delhi, India: Springer; Hindustan Book Agency. [Google Scholar]
- 5.Papathomas M. 2018. On the correspondence from Bayesian log-linear modelling to logistic regression modelling with g-priors. Test 27, 197–220. ( 10.1007/s11749-017-0540-8) [DOI] [Google Scholar]
- 6.Wood SN. 2006. Generalized additive models: an introduction with R. New York, NY: Chapman and Hall. [Google Scholar]
- 7.Edwards D, Havránek T. 1985. A fast procedure for model search in multi-dimensional contingency tables. Biometrika 72, 339–351. ( 10.1093/biomet/72.2.339) [DOI] [Google Scholar]
- 8.Zhou J, Bhattacharya A, Herring AH, Dunson DB. 2015. Bayesian factorizations of big sparse tensors. J. Am. Statist. Assoc. 110, 1562–1576. ( 10.1080/01621459.2014.983233) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Papathomas M, Richardson S. 2016. Exploring dependence between categorical variables: benefits and limitations of using variable selection within Bayesian clustering in relation to log-linear modelling with interaction terms. J. Stat. Plan. Infer. 173, 47–63. ( 10.1016/j.jspi.2016.01.002) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Lutkepohl H. 1996. Handbook of matrices. Chichester, UK: John Wiley and Sons. [Google Scholar]
- 11.Rohatgi VK. 1976. An introduction to probability theory and mathematical statistics. New York, NY: John Wiley and Sons. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
All data considered in this manuscript are provided in appendix A. This article has no additional data.
