Abstract
The two main algorithms that have been considered for fitting constrained marginal models to discrete data, one based on Lagrange multipliers and the other on a regression model, are studied in detail. It is shown that the updates produced by the two methods are identical, but that the Lagrangian method is more efficient in the case of identically distributed observations. A generalization is given of the regression algorithm for modelling the effect of exogenous individual-level covariates, a context in which the use of the Lagrangian algorithm would be infeasible for even moderate sample sizes. An extension of the method to likelihood-based estimation under L1-penalties is also considered.
Keywords: categorical data, L1-penalty, marginal log-linear model, maximum likelihood, non-linear constraint
1. Introduction
The application of marginal constraints to multi-way contingency tables has been much investigated in the last 20 years; see, for example, McCullagh and Nelder (1989); Liang et al. (1992); Lang and Agresti (1994); Glonek and McCullagh (1995); Agresti (2002); Bergsma et al. (2009). Bergsma and Rudas (2002) introduced marginal log-linear parameters (MLLPs), which generalize other discrete parameterizations including ordinary log-linear parameters and Glonek and McCullagh’s multivariate logistic parameters. The flexibility of this family of parameterizations enables their application to many popular classes of conditional independence models, and especially to graphical models (Forcina et al., 2010; Rudas et al., 2010; Evans and Richardson, 2013). Bergsma and Rudas (2002) show that, under certain conditions, models defined by linear constraints on MLLPs are curved exponential families. However, naïve algorithms for maximum likelihood estimation with MLLPs face several challenges: in general, there are no closed form equations for computing raw probabilities from MLLPs, so direct evaluation of the log-likelihood can be time consuming; in addition, MLLPs are not necessarily variation independent and, as noted by Bartolucci et al. (2007), ordinary Newton-Raphson or Fisher scoring methods may become stuck by producing updated estimates which are incompatible.
Lang (1996) and Bergsma (1997), amongst others, have tried to adapt a general algorithm introduced by Aitchison and Silvey (1958) for constrained maximum likelihood estimation to the context of marginal models. In this paper we provide an explicit formulation of Aitchison and Silvey’s algorithm, and show that an alternative method due to Colombi and Forcina (2001) is equivalent; we term this second approach the regression algorithm. Though the regression algorithm is less efficient, we show that it can be extended to deal with individual-level covariates, a context in which Aitchison and Silvey’s approach is infeasible, unless the sample size is very small. A variation of these algorithms, which can be used to fit marginal log-linear models under L1-penalties, and therefore perform automatic model selection, is also given.
Section 2 reviews marginal log-linear models and their basic properties, while in Section 3 we formulate the two algorithms, show that they are equivalent and discuss their properties. In Section 4 we derive an extension of the regression algorithm which can incorporate the effect of individual-level covariates. Finally Section 5 considers similar methods for L1-constrained estimation.
2. Notations and preliminary results
Let Xj, j = 1, …, d be categorical random variables taking values in {1, …, cj}. The joint distribution of X1, …, Xd is determined by the vector of joint probabilities π of dimension , whose entries correspond to cell probabilities, and are assumed to be strictly positive; we take the entries of π to be in lexicographic order. Further, let y denote the vector of cell frequencies with entries arranged in the same order as π. We write the multinomial log-likelihood in terms of the canonical parameters as
(see, for example, Bartolucci et al., 2007, p. 699); here n is the sample size, 1t a vector of length t whose entries are all 1, and G a t × (t − 1) full rank design matrix which determines the log-linear parameterization. The mapping between the canonical parameters and the joint probabilities may be expressed as
where L is a (t − 1) × t matrix of row contrasts and LG = I t−1.
The score vector, s, and the expected information matrix, F, with respect to θ take the form
here Ω = diag(π) − ππ′.
2.1. Marginal log-linear parameters
Marginal log-linear parameters (MLLPs) enable the simultaneous modelling of several marginal distributions (see, for example, Bergsma et al., 2009, Chapters 2 and 4) and the specification of suitable conditional independencies within marginal distributions of interest (see Evans and Richardson, 2013). In the following let η denote an arbitrary vector of MLLPs; it is well known that this can be written as
where C is a suitable matrix of row contrasts, and M a matrix of 0’s and 1’s producing the appropriate margins (see, for example, Bergsma et al., 2009, Section 2.3.4).
Bergsma and Rudas (2002) have shown that if a vector of MLLPs η is complete and hierarchical, two properties defined below, models determined by linear restrictions on η are curved exponential families, and thus smooth. Like ordinary log-linear parameters, MLLPs may be grouped into interaction terms involving a particular subset of variables; each interaction term must be defined within a margin of which it is a subset.
Definition 1
A vector of MLLPs η is called complete if every possible interaction is defined in precisely one margin.
Definition 2
A vector of MLLPs η is called hierarchical if there is a nondecreasing ordering of the margins of interest M1, …, Ms such that, for each j = 1, … s, no interaction term which is a subset of Mj is defined within a later margin.
3. Two algorithms for fitting marginal log-linear models
Here we describe the two main algorithms used for fitting models of the kind described above.
3.1. An adaptation of Aitchison and Silvey’s algorithm
Aitchison and Silvey (1958) study maximum likelihood estimation under non-linear constraints in a very general context, showing that, under certain conditions, the maximum likelihood estimates exist and are asymptotically normal; they also outline an algorithm for computing those estimates. Suppose we wish to maximize l(θ) subject to h(θ) = 0, a set of r non-linear constraints, under the assumption that the second derivative of h(θ) exists and is bounded. Aitchison and Silvey consider stationary points of the function l(θ) + h(θ)′λ , where λ is a vector of Lagrange multipliers; this leads to the system of equations
| (1) |
where θ̂ is the ML estimate and H the derivative of h′ with respect to θ. Since these are non-linear equations, they suggest an iterative algorithm which proceeds as follows: suppose that at the current iteration we have θ0, a value reasonably close to θ̂. Replace s and h with first order approximations around θ0; in addition replace H (θ̂) with H(θ0) and the second derivative of the log-likelihood with –F, minus the expected information matrix. The resulting equations, after rearrangement, may be written in matrix form as
where s0, F 0, H 0 and so on denote the corresponding quantities evaluated at θ0. To compute a solution, Aitchison and Silvey (1958) exploit the structure of the partitioned matrix, while Bergsma (1997) solves explicitly for θ̂ by substitution; in both cases, if we are uninterested in the Lagrange multipliers, we get the updating equation
| (2) |
As noted by Bergsma (1997), the algorithm does not always converge unless some sort of step length adjustment is introduced.
Linearly constrained marginal models are defined by K′ η = 0, where K is a matrix of full column rank r ≤ t −1. The multinomial likelihood is a regular exponential family, so these models may be fitted using the smooth constraint h(θ) = K′η(θ) = 0, which implies that
Remark 1
In the equation above we have replaced Ω with diag(π) by exploiting the fact that η is a homogeneous function of π (see Bergsma et al., 2009, Section 2.3.4). If the constrained model were not smooth then at singular points the Jacobian matrix R would not be invertible, implying that H is not of full rank and thus violating a crucial assumption in Aitchison and Silvey (1958). It has been shown (Bergsma and Rudas, 2002, Theorem 3) that completeness is a necessary condition for smoothness.
Calculation of (2) may be simplified by noting that K′C does not need to be updated; in addition, if we choose, for example, G to be the identity matrix of size t with the first column removed, an explicit inverse of F exists:
where π̇ denotes the vector π with the first element removed; this expression may be exploited when computing F−1H.
3.2. A regression algorithm
By noting that the Aitchison-Silvey algorithm is essentially based on a quadratic approximation of l(θ) with a linear approximation of the constraints, Colombi and Forcina (2001) designed an algorithm which they believed to be equivalent to the original, though no formal argument was provided; this equivalence is proven in Proposition 1 below. Recall that, by elementary linear algebra, there exists a (t −1) ×(t − r −1) design matrix X of full column rank such that K′X = 0, from which it follows that η = Xβ for a vector of t − r −1 unknown parameters β. Let
and s̄ = R′s, F̄ = R′F R respectively denote the score and information relative to η; then the regression algorithm consists of alternating the following steps:
-
update the estimate of β by
(3) where γ0 = η0 − Xβ0;
- update θ by
(4)
Proposition 1
The updating equation in (2) is equivalent to the combined steps given in (3) and (4).
Proof
First, consider matrices X and K such that the columns of X span the orthogonal complement of the space spanned by the columns of K. Then we claim that for any symmetric and positive definite matrix W
| (5) |
To see this, let U = W−1/2K and V = W1/2X and note that U ′V = K′X = 0, then (5) follows from the identity U (U′U)−1U′ + V (V′V)−1V′ = I.
Now, recall s̄ = R′s and F̄ = R′F R, and note that
using this in the updating equation (2) enables us to rewrite it as
| (6) |
Set W = F̄0 and note that (5) may be substituted into the first component of (6) and that its equivalent formulation
may be substituted into the second component, giving
This is easily seen to be the same as combining equations (3) and (4).
Remark 2
From the form of the updating equations (2), (3) and (4) it is clear that Proposition 1 remains true if identical step length adjustments are applied to the θ updates. This does not hold, however, if adjustments are applied to the β updates of the regression algorithm.
3.2.1. Derivation of the regression algorithm
In a neighbourhood of θ0, approximate l(θ) by a quadratic function Q having the same information matrix and the same score vector as l at θ0,
Now compute a linear approximation of θ with respect to β in a neighbourhood of θ0,
| (7) |
substituting into the expression for Q we obtain a quadratic function in β. By adding and subtracting R0Xβ0 and setting δ = β − β0, we have
A weighted least square solution of this local maximization problem gives (3); substitution into (7) gives (4).
Remark 3
The choice of X is somewhat arbitrary because the design matrix XA, where A is any non-singular matrix, implements the same set of constraints as X. In many cases an obvious choice for X is provided by the context; otherwise, if we are not interested in the interpretation of β, any numerical complement of K will do.
3.3. Comparison of the two algorithms
Since the matrices C and M have dimensions (t − 1) × u and u × t respectively, where the value of u ≥ t depends upon the particular parametrization, the hardest step in the Aitchson-Silvey’s algorithm is (K′C) diag(M π)−1M whose computational complexity is O(rut). In contrast, the hardest step in the regression algorithm is the computation of R, which has computational complexity O(ut2 + t3), making this procedure clearly less efficient. However, the regression algorithm can be extended to models with individual covariates, a context in which it is usually much faster than a straightforward extension of the ordinary algorithm; see Section 4.
Note that because step adjustments, if used, are not made on the same scale, each algorithm may take a slightly different number of steps to converge.
3.4. Properties of the algorithms
Detailed conditions for the asymptotic existence of the maximum likelihood estimates of constrained models are given by Aitchison and Silvey (1958); see also Bergsma and Rudas (2002), Theorem 8. Much less is known about existence for finite sample sizes where estimates might fail to exist because of observed zeros. In this case, some elements of π̂ may converge to 0, leading the Jacobian matrix R to become ill-conditioned and making the algorithm unstable.
Concerning the convergence properties of their algorithm, Aitchison and Silvey (1958, p. 827) noted only that it could be seen as a modified Newton algorithm and that similar modifications had been used successfully elsewhere. However, it is clear from the form of the updating equations that, if the algorithms converge to some θ*, then the constraints h(θ*) = 0 are satisfied, and θ* is a stationary point of the constrained likelihood. In addition, as a consequence of the Karush-Kuhn-Tucker conditions, if a local maximum of the constrained objective function exists, then it will be a saddle point of the Lagrangian (see, for example, Bertsekas, 1999).
To ensure that a stationary point β̂ reached by the algorithm is indeed a local maximum, one could check that the observed information with respect to β is positive definite. An efficient formula for computing the observed information matrix is given in Appendix A. Since the log-likelihood of constrained marginal models is not, in general, concave, it might be advisable to apply the algorithm to a range of starting values, in order to check that the achieved maximum is the global one.
3.5. Extension to more general constraints
Occasionally, one may wish to fit general constraints on marginal probabilities without the need to define a marginal log-linear parameterization; an interesting example is provided by the relational models of Klimova et al. (2011). They consider constrained models of the form h(θ) = A log(M π) = 0, where A is an arbitrary matrix of full row rank. Redefine
and note that, because A is not a matrix of row contrasts, h is not homogeneous in p, and thus the simplification of Ω mentioned in Remark 1 does not apply. If the resulting model is smooth, implying that K is a matrix of full column rank r everywhere in the parameter space, it can be fitted with the ordinary Aitchison-Silvey algorithm. We now show how the same model can also be fitted by a slight extension of the regression algorithm.
Let θ0 be a starting value and K̄0 be a right inverse of K′ at θ0; consider a first order expansion of the constraints
and let X 0 be a matrix that spans the orthogonal complement of K0. Then, with the same order of approximation,
by solving the above equation for θ – θ0 and substituting into the quadratic approximation of the log-likelihood, we obtain an updating equation similar to (3):
4. Modelling the effect of individual-level covariates
When exogenous individual-level covariates are available, it may be of interest to allow the marginal log-linear parameters to depend upon them as in a linear model: ηi = C log(M πi) = X iβ; here the matrix X i specifies how the non-zero marginal log-linear parameters depend on individual specific information, in addition to structural restrictions such as conditional independencies. Let yi, i = 1, …, n, be a vector of length t with a 1 in the entry corresponding to the response pattern of the ith individual, and all other values 0; define y to be the vector obtained by stacking the vectors yi, one below the other. Alternatively, if the sample size is large and the covariates can take only a limited number of distinct values, yi may contain the frequency table of the response variables within the sub-sample of subjects with the ith configuration of the covariates; in this case n denotes the number of strata. This arrangement avoids the need to construct a joint contingency table of responses and covariates; in addition the covariate configurations with no observations are simply ignored.
In either case, to implement the Aitchison-Silvey approach, stack the X i matrices one below the other into the matrix X, and let K span the orthogonal complement of X; as before, we have to fit the set of constraints K′η = 0. However, whilst q, the size of β, does not depend on the number of subjects, H is now of size [n(t −1) − q] × n(t −1), and its computation has complexity O(n3t2u), where u ≥ t as before; in addition, the inversion of the [n(t − 1) − q] × [n(t − 1) − q]-matrix H′F−1H has complexity O(n3t3). With n moderately large, this approach becomes almost infeasible.
For the regression algorithm, let θi denote the vector of canonical parameters for the ith individual and be the contribution to the log-likelihood. Note that X i need not be of full column rank, a property which must instead hold for the matrix X; for this reason our assumptions are much weaker than those used by Lang (1996), and allow for more flexible models. Both the quadratic and the linear approximations must be applied at the individual level; thus we set θi − θi0 = Ri0(Xiβ−ηi0), and the log-likelihood becomes
where γi0 = ηi0 − Xiβ0, si = G′(yi − πi) and F i = G′ΩiG.
Direct calculations lead to the updating expression
where . Thus, the procedure depends upon n only in that we have to sum across subjects, and so the complexity is O(n(t2u + t3)).
As an example of the utility of the method described above, consider the application to social mobility tables in Dardanoni et al. (2012). Social mobility tables are cross classifications of subjects according to their social class (columns) and that of their fathers (rows). The hypothesis of equality of opportunity would imply that the social class of sons is independent of that of their fathers. Mediating covariates may induce positive dependence between the social classes of fathers and sons, leading to the appearance of limited social mobility; to assess this, Dardanoni et al. (2012) fitted a model in which the vector of marginal parameters for each father-son pair was allowed to depend on individual covariates, including the father’s age, the results of cognitive and non-cognitive test scores taken by the son at school, and his academic qualifications. The analysis, based on the UK’s National Child Development Survey, included 1,942 father-son pairs classified in a 3 × 3 table. All marginal log-linear parameters for the father were allowed to depend on father’s age, the only available covariate for fathers; the parameters for the son and the interactions were allowed to depend on all 11 available covariates. The fitted model used 76 parameters.
5. L1-penalized parameters
Evans (2011) shows that, in the context of marginal log-linear parameters, consistent model selection can be performed using the so-called adaptive lasso. Since the adaptive lasso uses L1-penalties, we might therefore be interested in relaxing the equality constraints discussed above to a penalization framework, in which we maximize the penalized log-likelihood
for some vector of penalties ν = (νj) ≥ 0.
The advantage of penalties of this form is that one can obtain parameter estimates which are exactly zero (Tibshirani, 1996). Setting parameters of the form η to zero corresponds to many interesting submodels, such as those defined by conditional independences, (Forcina et al., 2010; Rudas et al., 2010), we can therefore perform model selection without the need to fit many models separately. For now, assume that no equality constraints hold for η, so we can take X to be the identity, and β = η. This gives the quadratic form
approximating l(θ) as before. Then φ is approximated by
and we can attempt to maximize φ by repeatedly solving the sub-problem of maximizing φ̃. Now, because the quadratic form Q(η) is concave and differentiable, and the absolute value function | · | is concave, coordinate-wise ascent is guaranteed to find a local maximum of φ̃ (Tseng, 2001). Coordinate-wise ascent cycles through j = 1, 2, …, t − 1, at each step minimizing
with respect to ηj, with η1, …, ηj−1, ηj+1, …, ηt−1 held fixed. This is solved just by taking
where a+ = max{a, 0}, and η̌j minimizes Q with respect to ηj (Friedman et al., 2010). This approach to the sub-problem may require a large number of iterations, but it is extremely fast in practice because each step is so simple. If the overall algorithm converges, then by a similar argument to that of Section 3.4, together with the fact that φ̃ has the same supergradient as φ at η = η0, we see that we must have reached a local maximum of φ.
Since penalty selection for the lasso and adaptive lasso is typically performed using computationally intensive procedures such as cross validation, its implementation makes fast algorithms such as the one outlined above essential.
Acknowledgments
We thank two anonymous referees, the associate editor, and editor for their suggestions, corrections, and patience.
Appendix A. Computation of the observed information matrix
Lemma 1
Suppose that A is a p×q matrix, and that y, b, x and u are column vectors with respective lengths q, p, k and r. Then if A and b are constant,
| (A.1) |
Proof
The observed information matrix is minus the second derivative of the log-likelihood with respect to β, that is
Since s depends on θ through both (y−nπ) and R, the above derivative has two main components, where the one obtained by differentiating (y − nπ) is minus the expected information. Using the well known expression for the derivative of an inverse matrix, it only remains to compute
where A = X′R′ and b = R′G′(y −nπ), giving
By two applications of (A.1), this is
Footnotes
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
References
- Agresti A. Categorical data analysis. John Wiley and Sons; 2002. [Google Scholar]
- Aitchison J, Silvey SD. Maximum-likelihood estimation of parameters subject to restraints. Ann Math Stat. 1958;29(3):813–828. [Google Scholar]
- Bartolucci F, Colombi R, Forcina A. An extended class of marginal link functions for modelling contingency tables by equality and inequality constraints. Statist Sinica. 2007;17(2):691. [Google Scholar]
- Bergsma W, Croon M, Hagenaars JA. Marginal Models: For Dependent, Clustered, and Longitudinal Categorial Data. Springer Verlag; 2009. [Google Scholar]
- Bergsma WP. Marginal models for categorical data. Tilburg University Press; Tilburg: 1997. [Google Scholar]
- Bergsma WP, Rudas T. Marginal models for categorical data. Ann Statist. 2002;30(1):140–159. [Google Scholar]
- Bertsekas DP. Nonlinear programming. 2. Athena Scientific; 1999. [Google Scholar]
- Colombi R, Forcina A. Marginal regression models for the analysis of positive association of ordinal response variables. Biometrika. 2001;88(4):1001–1019. [Google Scholar]
- Dardanoni V, Fiorini M, Forcina A. Stochastic monotonicity in intergen-erational mobility tables. J Appl Economtrics. 2012;27:85–107. [Google Scholar]
- Evans RJ. PhD thesis. University of Washington; 2011. Parametrizations of discrete graphical models. [Google Scholar]
- Evans RJ, Richardson TS. Marginal log-linear parameters for graphical markov models. J Roy Statist Soc B. To appear 2013 doi: 10.1111/rssb.12020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Forcina A, Lupparelli M, Marchetti GM. Marginal parameterizations of discrete models defined by a set of conditional independencies. Journ Mult Analysis. 2010;101(10):2519–2527. [Google Scholar]
- Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J Stat Soft. 2010;33(1) [PMC free article] [PubMed] [Google Scholar]
- Glonek GFV, McCullagh P. Multivariate logistic models. J R Statist Soc B. 1995;57(3):533–546. [Google Scholar]
- Klimova A, Rudas T, Dobra A. Relational models for contingency tables. 2011 arXiv:1102.5390. [Google Scholar]
- Lang JB. Maximum likelihood methods for a generalized class of log-linear models. Ann Statist. 1996;24(2):726–752. [Google Scholar]
- Lang JB, Agresti A. Simultaneously modeling joint and marginal distributions of multivariate categorical responses. J Amer Statist Assoc. 1994;89(426):625–632. [Google Scholar]
- Liang KY, Zeger SL, Qaqish B. Multivariate regression analyses for categorical data. J R Statist Soc B. 1992;54(1):3–40. [Google Scholar]
- McCullagh P, Nelder JA. Generalized linear models. Chapman & Hall/CRC; 1989. [Google Scholar]
- Rudas T, Bergsma WP, Németh R. Marginal log-linear parameterization of conditional independence models. Biometrika. 2010;97(4):1006–1012. [Google Scholar]
- Tibshirani R. Regression shrinkage and selection via the lasso. J Royal Statist Soc B. 1996;58(1):267–288. [Google Scholar]
- Tseng P. Convergence of a block coordinate descent method for nondifferentiable minimization. J Optim Theory Appl. 2001;109(3):475–494. [Google Scholar]
