MM Algorithms for Some Discrete Multivariate Distributions

Hua Zhou; Kenneth Lange

doi:10.1198/jcgs.2010.09014

. Author manuscript; available in PMC: 2010 Sep 25.

Published in final edited form as: J Comput Graph Stat. 2010 Sep 1;19(3):645–665. doi: 10.1198/jcgs.2010.09014

MM Algorithms for Some Discrete Multivariate Distributions

Hua Zhou ¹, Kenneth Lange ²

PMCID: PMC2945396 NIHMSID: NIHMS205488 PMID: 20877446

Abstract

The MM (minorization–maximization) principle is a versatile tool for constructing optimization algorithms. Every EM algorithm is an MM algorithm but not vice versa. This article derives MM algorithms for maximum likelihood estimation with discrete multivariate distributions such as the Dirichlet-multinomial and Connor–Mosimann distributions, the Neerchal–Morel distribution, the negative-multinomial distribution, certain distributions on partitions, and zero-truncated and zero-inflated distributions. These MM algorithms increase the likelihood at each iteration and reliably converge to the maximum from well-chosen initial values. Because they involve no matrix inversion, the algorithms are especially pertinent to high-dimensional problems. To illustrate the performance of the MM algorithms, we compare them to Newton’s method on data used to classify handwritten digits.

Keywords: Dirichlet and multinomial distributions, Inequalities, Maximum likelihood, Minorization

1. INTRODUCTION

The MM algorithm generalizes the celebrated EM algorithm (Dempster, Laird, and Rubin 1977). In this article we apply the MM (minorization–maximization) principle to devise new algorithms for maximum likelihood estimation with several discrete multivariate distributions. A series of research papers and review articles (Groenen 1993; de Leeuw 1994; Heiser 1995; Hunter and Lange 2004; Lange 2004; Wu and Lange 2010) have argued that the MM principle can lead to simpler derivations of known EM algorithms. More importantly, the MM principle also generates many new algorithms of considerable utility. Some statisticians encountering the MM principle for the first time react against its abstraction, unfamiliarity, and dependence on the mathematical theory of inequalities. This is unfortunate because real progress can be made applying a few basic ideas in a unified framework. The current article relies on just three well-known inequalities. For most of our examples, the derivation of a corresponding EM algorithm appears much harder, the main hindrance being the difficulty of choosing an appropriate missing data structure.

Discrete multivariate distributions are seeing wider use throughout statistics. Modern data mining employs such distributions in image reconstruction, pattern recognition, document clustering, movie rating, network analysis, and random graphs. High-dimension data demand high-dimensional models with ten to hundreds of thousands of parameters. Newton’s method and Fisher scoring are capable of finding the maximum likelihood estimates of these distributions via the parameter updates

θ^{(n + 1)} = θ^{(n)} + M {(θ^{(n)})}^{- 1} \nabla L (θ^{(n)}),

where ∇L(θ) is the score function and M(θ) is the observed or the expected information matrix, respectively. Several complications can compromise the performance of these traditional algorithms: (a) the information matrix M(θ) may be expensive to compute, (b) it may fail to be positive definite in Newton’s method, (c) in high dimensions it is expensive to solve the linear system M(θ)x = ∇L(θ⁽ⁿ⁾), and (d) if parameter constraints and parameter bounds intrude, then the update itself requires modification. Although mathematical scientists have devised numerous remedies and safeguards, these all come at a cost of greater implementation complexity. The MM principle offers a versatile weapon for attacking optimization problems of this sort. Although MM algorithms have at best a linear rate of convergence, their updates are often very simple. This can tip the computational balance in their favor. In addition, MM algorithms are typically easy to code, numerically stable, and amenable to acceleration. For the discrete distributions considered here, there is one further simplification often missed in the literature. These distributions involve gamma functions. To avoid the complications of evaluating the gamma function and its derivatives, we fall back on a device suggested by Haldane (1941) that replaces ratios of gamma functions by rising polynomials.

Rather than tire the skeptical reader with more preliminaries, it is perhaps best to move on to our examples without delay. The next section defines the MM principle, discusses our three driving inequalities, and reviews two simple acceleration methods. Section 3 derives MM algorithms for some standard multivariate discrete distributions, namely the Dirichlet-multinomial and Connor–Mosimann distributions, the Neerchal–Morel distribution, the negative-multinomial distribution, certain distributions on partitions, and zero-truncated and zero-inflated distributions. Section 4 describes a numerical experiment comparing the performance of the MM algorithms, accelerated MM algorithms, and Newton’s method on model fitting of handwritten digit data. Our discussion concludes by mentioning directions for further research and by frankly acknowledging the limitations of the MM principle.

2. OVERVIEW OF THE MM ALGORITHM

As we have already emphasized, the MM algorithm is a principle for creating algorithms rather than a single algorithm. There are two versions of the MM principle, one for iterative minimization and another for iterative maximization. Here we deal only with the maximization version. Let f (θ) be the objective function we seek to maximize. An MM algorithm involves minorizing f (θ) by a surrogate function g(θ|θⁿ) anchored at the current iterate θⁿ of a search. Minorization is defined by the two properties

f (θ^{n}) = g (θ^{n} | θ^{n}),

(2.1)

f (θ) \geq g (θ | θ^{n}), θ \neq θ^{n} .

(2.2)

In other words, the surface θ ↦ g(θ|θⁿ) lies below the surface θ ↦ f (θ) and is tangent to it at the point θ = θⁿ. Construction of the surrogate function g(θ|θⁿ) constitutes the first M of the MM algorithm.

In the second M of the algorithm, we maximize the surrogate g(θ|θⁿ) rather than f (θ). If θⁿ⁺¹ denotes the maximum point of g(θ|θⁿ), then this action forces the ascent property f (θⁿ⁺¹) ≥ f (θⁿ). The straightforward proof

f (θ^{n + 1}) \geq g (θ^{n + 1} | θ^{n}) \geq g (θ^{n} | θ^{n}) = f (θ^{n})

reflects definitions (2.1) and (2.2) and the choice of θⁿ⁺¹. The ascent property is the source of the MM algorithm’s numerical stability. Strictly speaking, it depends only on increasing g(θ|θⁿ), not on maximizing g(θ|θⁿ).

The art in devising an MM algorithm revolves around intelligent choices of minorizing functions. This brings us to the first of our three basic minorizations

ln (\sum_{i = 1}^{m} α_{i}) \geq \sum_{i = 1}^{m} \frac{α_{i}^{n}}{\sum_{j = 1}^{m} α_{j}^{n}} ln (\frac{\sum_{j = 1}^{m} α_{j}^{n}}{α_{i}^{n}} α_{i}),

(2.3)

invoking the chord below the graph property of the concave function ln x. Note here that all parameter values are positive and that equality obtains whenever $α_{i} = α_{i}^{n}$ for all i. Our second basic minorization

- ln (c + α) \geq - ln (c + α^{n}) - \frac{1}{c + α^{n}} (α - α^{n})

(2.4)

restates the supporting hyperplane property of the convex function −ln(c + x). Our final basic minorization

- ln (1 - α) \geq - ln (1 - α^{n}) + \frac{α^{n}}{1 - α^{n}} ln (\frac{α}{α^{n}})

(2.5)

is just a rearrangement of the two-point information inequality

α^{n} ln α^{n} + (1 - α^{n}) ln (1 - α^{n}) \geq α^{n} ln α + (1 - α^{n}) ln (1 - α) .

Here α and αⁿ must lie in (0, 1). Any standard text on inequalities, for example, the book by Steele (2004), proves these three inequalities. Because piecemeal minorization works well, our derivations apply the basic minorizations only to strategic parts of the overall objective function, leaving other parts untouched.

The convergence theory of MM algorithms is well known (Lange 2004). Convergence to a stationary point is guaranteed provided five properties of the objective function f (θ) and the MM algorithm map M(θ) hold: (a) f (θ) is coercive on its open domain; (b) f (θ) has only isolated stationary points; (c) M(θ) is continuous; (d) θ* is a fixed point of M(θ) if and only if it is a stationary point of f (θ); (e) f [M(θ*)] ≤ f (θ*), with equality if and only if θ* is a fixed point of M(θ). Most of these conditions are easy to verify for our examples, so the details will be omitted.

A common criticism of EM and MM algorithms is their slow convergence. Fortunately, MM algorithms can be easily accelerated (Jamshidian and Jennrich 1995; Lange 1995a; Jamshidian and Jennrich 1997; Varadhan and Rolland 2008). We will employ two versions of the recent square iterative method (SQUAREM) developed by Varadhan and Roland (2008). These simple vector extrapolation techniques require computation of two MM updates at each iteration. Denote the two updates by M(θⁿ) and M ◦ M(θⁿ), where M(θ) is the MM algorithm map. These updates in turn define two vectors

u = M (θ^{n}) - θ^{n}, υ = M ◦ M (θ^{n}) - M (θ^{n}) - u .

The versions diverge in how they compute the steplength constant s. SqMPE1 (minimal polynomial extrapolation) takes $s = \frac{u^{t} u}{u^{t} υ}$ , while SqRRE1 (reduced rank extrapolation) takes $s = \frac{u^{t} υ}{u^{t} υ}$ . Once s is specified, we define the next accelerated iterate by θⁿ⁺¹ = θⁿ − 2su + s²υ. Readers should consult the original article for motivation of SQUAREM. Whenever θⁿ⁺¹ decreases the log-likelihood L(θ), we revert to the MM update θⁿ⁺¹ =M ◦ M(θⁿ). Finally, we declare convergence when

\frac{| L (θ^{n}) - L (θ^{n - 1}) |}{| L (θ^{n - 1}) | + 1} < ε .

(2.6)

In the numerical examples that follow, we use the stringent criterion ε = 10⁻⁹. More sophisticated stopping criteria based on the gradient of the objective function and the norm of the parameter increment lead to similar results.

3. APPLICATIONS

3.1 Dirichlet-Multinomial and Connor–Mosimann Distributions

When count data exhibit overdispersion, the Dirichlet-multinomial distribution is often substituted for the multinomial distribution. The multinomial distribution is characterized by a vector p = (p₁,…, p_d) of cell probabilities and a total number of trials m. In the Dirichlet-multinomial sampling, p is first drawn from a Dirichlet distribution with parameter vector α = (α₁,…,α_d). Once the cell probabilities are determined, multinomial sampling commences. This leads to the admixture density

\begin{matrix} h (x | α) & = \frac{Γ (| α |)}{Γ (α_{1}) \dots Γ (α_{d})} \int_{Δ_{d}} (\begin{matrix} m \\ x \end{matrix}) \prod_{i = 1}^{d} p_{i}^{x_{i} + α_{i} - 1} {dp}_{1} \dots {dp}_{d} \\ = (\begin{matrix} m \\ x \end{matrix}) \frac{Γ (α_{1} + x_{1}) \dots Γ (α_{d} + x_{d})}{Γ (| α | + m)} \frac{Γ (| α |)}{Γ (α_{1}) \dots Γ (α_{d})}, \end{matrix}

(3.1)

where $| α | = \sum_{i = 1}^{d} α_{i}$ , Δ_d is the unit simplex in ℝ^d, and x = (x₁,…, x_d) is the vector of cell counts. Note that the count total $| x | = \sum_{i = 1}^{d} x_{i}$ is fixed at m. Standard calculations show that a random vector X drawn from h(x|α) has the means, variances, and covariances

\begin{matrix} E (X_{i}) & = m \frac{α_{i}}{| α |}, \\ Var (X_{i}) & = m \frac{α_{i}}{| α |} (1 - \frac{α_{i}}{| α |}) \frac{| α | + m}{| α | + 1}, \\ Cov (X_{i}, X_{j}) & = - m \frac{α_{i}}{| α |} \frac{α_{j}}{| α |} \frac{| α | + m}{| α | + 1}, i \neq j . \end{matrix}

If the fractions $\frac{α_{i}}{| α |}$ tend to constants p_i as |α| tends to ∞, then these moments collapse to the corresponding moments of the multinomial distribution with proportions p₁,…, p_d.

One of the most unappealing features of the density function h(x|α) is the occurrence of the gamma function. Fortunately, very early on Haldane (1941) noted the alternative representation

h (x | α) = (\begin{matrix} m \\ x \end{matrix}) \frac{\prod_{j = 1}^{d} α_{j} (α_{j} + 1) \dots (α_{j} + x_{j} - 1)}{| α | (| α | + 1) \dots (| α | + m - 1)} .

(3.2)

The replacement of gamma functions by rising polynomials is a considerable gain in simplicity. Bailey (1957) later suggested the reparameterization

π_{j} = \frac{α_{j}}{| α |}, j = 1, \dots, d, θ = \frac{1}{| α |}

in terms of the proportion vector π = (π₁,…, π_d) and the overdispersion parameter θ. In this setting, the discrete density function becomes

h (x | π, θ) = (\begin{matrix} m \\ x \end{matrix}) \frac{\prod_{j = 1}^{d} π_{j} (π_{j} + θ) \dots [π_{j} + (x_{j} - 1) θ]}{(1 + θ) \dots [1 + (m - 1) θ]} .

(3.3)

This version of the density function is used to good effect by Griffiths (1973) in implementing Newton’s method for maximum likelihood estimation with the beta-binomial distribution.

In maximum likelihood estimation, we pass to log-likelihoods. This introduces logarithms and turns factors into sums. To construct an MM algorithm under the parameterization (3.2), we need to minorize terms such as ln(α_j + k) and −ln(|α| + k). The basic inequalities (2.3) and (2.4) are directly relevant. Suppose we draw t independent samples x₁,…, x_t from the Dirichlet-multinomial distribution with m_i trials for sample i. The term −ln(|α| + k) occurs in the log-likelihood for x_i if and only if m_i ≥ k + 1. Likewise the term ln(α_j + k) occurs in the log-likelihood for x_i if and only if x_ij ≥ k + 1. It follows that the log-likelihood for the entire sample can be written as

\begin{matrix} L (α) = - \sum_{k} r_{k} ln (| α | + k) + \sum_{j} \sum_{k} s_{jk} ln (α_{j} + k), \\ r_{k} = \sum_{i} 1_{{m_{i} \geq k + 1}}, s_{jk} = \sum_{i} 1_{{x_{ij} \geq k + 1}} . \end{matrix}

(3.4)

The index k in these formulas ranges from 0 to max_i m_i −1.

Applying our two basic minorizations to L(α) yields the surrogate function

g (α | α^{n}) = - \sum_{k} r_{k} \frac{1}{| α^{n} | + k} | α | + \sum_{j} \sum_{k} s_{jk} \frac{α_{j}^{n}}{α_{j}^{n} + k} ln α_{j}

up to an irrelevant additive constant. Equating the partial derivative of the surrogate with respect to α_j to 0 produces the simple MM update

α_{j}^{n + 1} = (\sum_{k} \frac{s_{jk} α_{j}^{n}}{α_{j}^{n} + k}) / (\sum_{k} \frac{r_{k}}{| α^{n} | + k}) .

(3.5)

Minka (2003) derived these updates from a different perspective.

Under the parameterization (3.3), matters are slightly more complicated. Now we minorize the terms −ln(1 + kθ) and ln(π_j + kθ) via

- log (1 + k θ) \geq - log (1 + k θ^{n}) - \frac{1}{1 + k θ^{n}} (k θ - k θ^{n})

and

log (π_{j} + k θ) \geq \frac{π_{j}^{n}}{π_{j}^{n} + k θ^{n}} log (\frac{π_{j}^{n} + k θ^{n}}{π_{j}^{n}} π_{j}) + \frac{k θ^{n}}{π_{j}^{n} + k θ^{n}} log (\frac{π_{j}^{n} + k θ^{n}}{k θ^{n}} k θ) .

These minorizations lead to the surrogate function

- \sum_{k} r_{k} \frac{k}{1 + k θ^{n}} θ + \sum_{j} \sum_{k} s_{jk} {\frac{π_{j}^{n}}{π_{j}^{n} + k θ^{n}} log π_{j} + \frac{k θ^{n}}{π_{j}^{n} + k θ^{n}} log θ}

up to an irrelevant constant. Setting the partial derivative with respect to θ equal to 0 yields the MM update

θ^{n + 1} = (\sum_{j} \sum_{k} \frac{s_{jk} k θ^{n}}{π_{j}^{n} + k θ^{n}}) / (\sum_{k} \frac{r_{k} k}{1 + k θ^{n}}) .

(3.6)

The update of the proportion vector π must be treated as a Lagrange multiplier problem owing to the constraint Σ_j π_j = 1. Familiar arguments produce the MM update

π_{j}^{n + 1} = (\sum_{k} \frac{s_{jk} π_{j}^{n}}{π_{j}^{n} + k θ^{n}}) / (\sum_{l} \sum_{k} \frac{s_{lk} π_{l}^{n}}{π_{l}^{n} + k θ^{n}}) .

(3.7)

The two updates summarized by (3.5), (3.6), and (3.7) enjoy several desirable properties. First, parameter constraints are built in. Second, stationary points of the log-likelihood are fixed points of the updates. Virtually all MM algorithms share these properties. The update (3.7) also reduces to the maximum likelihood estimate

{\hat{π}}_{j} = \frac{\sum_{k} s_{jk}}{\sum_{l} \sum_{k} s_{lk}} = \frac{\sum_{i} x_{ij}}{\sum_{i} m_{i}}

(3.8)

of the corresponding multinomial proportion when θⁿ = 0.

The estimate (3.8) furnishes a natural initial value $π_{j}^{0}$ . To derive an initial value for the overdispersion parameter θ, consider the first two moments

E (P_{j}) = \frac{α_{j}}{| α |}, E (P_{j}^{2}) = \frac{α_{j} (α_{j} + 1)}{| α | (| α | + 1)}

of a Dirichlet distribution with parameter vector α. These identities imply that

\sum_{j = 1}^{d} \frac{E (P_{j}^{2})}{E (P_{j})} = \frac{| α | + d}{| α | + 1} = ρ,

which can be solved for θ = 1/|α| in terms of ρ as θ = (ρ − 1)/(d − ρ). Substituting the estimate

\hat{ρ} = \sum_{j} \frac{{\sum_{i} (x_{ij} / m_{i})}^{2}}{\sum_{i} (x_{ij} / m_{i})}

for ρ gives a sensible initial value θ⁰.

To test our two MM algorithms, we now turn to the beta-binomial data of Haseman and Soares (1976) on male mice exposed to various mutagens. The two outcome categories are (a) dead implants and (b) survived implants. In their first dataset, there are t = 524 observations with between m = 1 and m = 20 trials per observation. Table 1 presents the final log-likelihood, number of iterations, and running time (in seconds) of the two MM algorithms and their SQUAREM accelerations on these data. All MM algorithms converge to the maximum point previously found by the scoring method (Paul, Balasooriya, and Banerjee 2005). For the choice ε = 10⁻⁹ in stopping criterion (2.6), the MM algorithm (3.5) takes 700 iterations and 0.1580 sec to converge on a laptop computer. The alternative MM algorithm given in the updates (3.6) and (3.7) takes 339 iterations and 0.1626 sec. Figure 1 depicts the progress of the MM iterates on a contour plot of the log-likelihood. The conventional MM algorithm crawls slowly along the ridge in the contour plot; the accelerated versions SqMPE1 and SqRRE1 significantly reduce both the number of iterations and the running time until convergence.

Table 1.

MM algorithms for the Haseman and Soares beta-binomial data.

	(π, θ) parameterization			α parameterization

Algorithm	L	# iters	Time	L	# iters	Time
MM	−777.79	339	0.1626	−777.79	700	0.1580
SqMPE1 MM	−777.79	10	0.0093	−777.79	14	0.0105
SqRRE1 MM	−777.79	18	0.0159	−777.79	14	0.0100

Open in a new tab

MM Ascent of the Dirichlet-multinomial log-likelihood surface. A color version of this figure is available in the electronic version of this article.

The Dirichlet-multinomial distribution suffers from two restrictions that limit its applicability, namely the negative correlation of coordinates and the determination of variances by means. It is possible to overcome these restrictions by choosing a more flexible mixing distribution as a prior for the multinomial. Connor and Mosimann (1969) suggested a generalization of the Dirichlet distribution that meets this challenge. The resulting admixed distribution, called the generalized Dirichlet-multinomial distribution, has proved its worth in machine learning problems such as the modeling and clustering of images, handwritten digits, and text documents (Bouguila 2008). It is therefore helpful to derive an MM algorithm for maximum likelihood estimation with this distribution that avoids the complications of gamma/digamma/trigamma functions arising with Newton’s method (Bouguila 2008). The Connor–Mosimann distribution is constructed inductively by the mechanism of stick breaking. Imagine breaking the interval [0, 1] into d subintervals of lengths P₁,…, P_d by choosing d − 1 independent beta variates Z_i with parameters α_i and β_i. The length of subinterval 1 is P₁ = Z₁. Given P₁ through P_i, the length of subinterval i + 1 is P_i+1 = Z_i+1(1 − P₁ − ⋯−P_i). The last length P_d = 1 − (P₁ + ⋯ + P_d−1) takes up the slack. Standard calculations show that the P_i have the joint density

g (p | α, β) = \prod_{j = 1}^{d - 1} \frac{Γ (α_{j} + β_{j})}{Γ (α_{j}) Γ (β_{j})} p_{j}^{α_{j} - 1} {(1 - \sum_{i = 1}^{j} p_{i})}^{γ_{j}}, p \in Δ_{d},

where γ_j = β_j − α_j+1 − β_j+1 for j = 1,…, d − 2 and γ_d−1 = β_d−1 − 1. The univariate case (d = 2) corresponds to the beta distribution. The Dirichlet distribution is recovered by taking β_j = α_j+1 +⋯+α_d. With d − 2 more parameters than the Dirichlet distribution, the Connor–Mosimann distribution is naturally more versatile.

The Connor–Mosimann distribution is again conjugate to the multinomial distribution, and the marginal density of a count vector X over m trials is easily shown to be

\begin{matrix} Pr (X = x) & = \int_{Δ_{d}} (\begin{matrix} m \\ x \end{matrix}) \prod_{j = 1}^{d - 1} p_{j}^{x_{j}} g (p | α, β) d p \\ = (\begin{matrix} m \\ x \end{matrix}) \prod_{j = 1}^{d - 1} \frac{Γ (α_{j} + x_{j})}{Γ (α_{j})} \frac{Γ (β_{j} + y_{j + 1})}{Γ (β_{j})} \frac{Γ (α_{j} + β_{j})}{Γ (α_{j} + β_{j} + y_{j})}, \end{matrix}

where $y_{j} = \sum_{k = j}^{d} x_{k}$ . If we adopt the reparameterization

θ_{j} = \frac{1}{α_{j} + β_{j}}, π_{j} = \frac{α_{j}}{α_{j} + β_{j}}, j = 1, \dots, d - 1,

and use the fact that x_j + y_j+1 = y_j, then the density can be re-expressed as

(\begin{matrix} m \\ x \end{matrix}) \prod_{j = 1}^{d - 1} \frac{π_{j} \dots [π_{j} + (x_{j} - 1) θ_{j}] \times (1 - π_{j}) \dots [1 - π_{j} + (y_{j + 1} - 1) θ_{j}]}{1 \dots [1 + (y_{j} - 1) θ_{j}]} .

(3.9)

Thus, maximum likelihood estimation of the parameter vectors π = (π₁,…, π_d−1) and θ = (θ₁,…, θ_d−1) by the MM algorithm reduces to the case of d − 1 independent beta-binomial problems.

Let x₁,…, x_t be a random sample from the generalized Dirichlet-multinomial distribution (3.9) with m_i trials for observation x_i. Following our reasoning for estimation with the Dirichlet-multinomial, we define the associated counts

r_{jk} = \sum_{i = 1}^{t} 1_{{x_{ij} \geq k + 1},} s_{jk} = \sum_{i = 1}^{t} 1_{{y_{ij} \geq k + 1}}

for 1 ≤ j ≤ d − 1. In this notation, the reader can readily check that the MM updates become

\begin{matrix} π_{j}^{n + 1} = (\sum_{k} \frac{r_{jk} π_{j}^{n}}{π_{j}^{n} + k θ_{j}^{n}}) / (\sum_{k} [\frac{r_{jk} π_{j}^{n}}{π_{j}^{n} + k θ_{j}^{n}} + \frac{s_{j + 1, k} (1 - π_{j}^{n})}{1 - π_{j}^{n} + k θ_{j}^{n}}]), \\ θ_{j}^{n + 1} = (\sum_{k} [\frac{r_{jk} k θ_{j}^{n}}{π_{j}^{n} + k θ_{j}^{n}} + \frac{s_{j + 1, k} k θ_{j}^{n}}{1 - π_{j}^{n} + k θ_{j}^{n}}]) / (\sum_{k} \frac{s_{jk}}{1 + k θ_{j}^{n}}) . \end{matrix}

3.2 Neerchal–Morel Distribution

Neerchal and Morel (1998, 2005) proposed an alternative to the Dirichlet-multinomial distribution that accounts for overdispersion by finite admixture. If x represents count data over m trials and d categories, then their discrete density is

h (x | π, ρ) = \sum_{j = 1}^{d} π_{j} (\begin{matrix} m \\ x \end{matrix}) {[(1 - ρ) π_{1}]}^{x_{1}} \dots {[(1 - ρ) π_{j} + ρ]}^{x_{j}} \dots {[(1 - ρ) π_{d}]}^{x_{d}},

(3.10)

where π = (π₁,…, π_d) is a vector of proportions and ρ ∈ [0, 1] is an overdispersion parameter. The Neerchal–Morel distribution collapses to the multinomial distribution when ρ = 0. Straightforward calculations show that the Neerchal–Morel distribution has means, variances, and covariances

\begin{matrix} E (X_{i}) & = m π_{i}, \\ Var (X_{i}) & = m π_{i} (1 - π_{i}) [1 - ρ^{2} + m ρ^{2}], \\ Cov (X_{i}, X_{j}) & = - m π_{i} π_{j} [1 - ρ^{2} + m ρ^{2}], i \neq j . \end{matrix}

These are precisely the same as the first- and second-order moments of the Dirichlet-multinomial distribution provided we identify π_i = α_i/|α| and ρ₂ = 1/(|α| + 1).

If we draw t independent samples x₁,…, x_t from the Neerchal–Morel distribution with m_i trials for sample i, then the log-likelihood is

\sum_{i} ln {\sum_{j} π_{j} (\begin{matrix} m_{i} \\ x_{i} \end{matrix}) {[(1 - ρ) π_{1}]}^{x_{i 1}} \dots {[(1 - ρ) π_{j} + ρ]}^{x_{ij}} \dots {[(1 - ρ) π_{d}]}^{x_{id}}} .

(3.11)

It is worth bearing in mind that every mixture model yields to the minorization (2.3). This is one of the secrets to the success of the EM algorithm. As a practical matter, explicit minorization via inequality (2.3) is more mechanical and often simpler to implement than performing the E step of the EM algorithm. This is particularly true when several minorizations intervene before we reach the ideal surrogate. Here two successive minorizations are needed.

To state the first minorization, let us abbreviate

Π_{ij} = π_{j} {[(1 - ρ) π_{1}]}^{x_{i 1}} \dots {[(1 - ρ) π_{j} + ρ]}^{x_{ij}} \dots {[(1 - ρ) π_{d}]}^{x_{id}}

and denote by $\prod_{ij}^{n}$ the same quantity evaluated at the nth iterate. In this notation it follows that

ln (\sum_{j} Π_{ij}) \geq \sum_{j} w_{ij}^{n} ln (\frac{Π_{ij}}{w_{ij}^{n}}) = \sum_{j} w_{ij}^{n} ln Π_{ij} - \sum_{j} w_{ij}^{n} ln w_{ij}^{n}

with weights $w_{ij}^{n} = \frac{\prod_{ij}^{n}}{\sum_{l} \prod_{ij}^{n}}$ . The logarithm splits ln ∏_ij into the sum

ln Π_{ij} = m_{i} ln (1 - ρ) + ln π_{j} + x_{i 1} ln π_{1} + \dots + x_{ij} ln (π_{j} + θ) + \dots + x_{id} ln π_{d}

for θ = ρ/(1 − ρ). To separate the parameters π_j and θ in the troublesome term ln(π_j + θ), we apply the minorization (2.3) again. This produces

ln (π_{j} + θ) \geq \frac{π_{j}^{n}}{π_{j}^{n} + θ^{n}} ln (\frac{π_{j}^{n} + θ^{n}}{π_{j}^{n}} π_{j}) + \frac{θ^{n}}{π_{j}^{n} + θ^{n}} ln (\frac{π_{j}^{n} + θ^{n}}{θ^{n}} θ),

and up to a constant the surrogate function takes the form

\sum_{i} \sum_{j} w_{ij}^{n} [\sum_{k} x_{ik} ln π_{k} + (1 - \frac{x_{ij} θ^{n}}{π_{j}^{n} + θ^{n}}) ln π_{j}] + \sum_{i} \sum_{j} w_{ij}^{n} [(m_{i} - \frac{x_{ij} θ^{n}}{π_{j}^{n} + θ^{n}}) ln (1 - ρ) + \frac{x_{ij} θ^{n}}{π_{j}^{n} + θ^{n}} ln ρ] .

Standard arguments now yield the updates

\begin{matrix} π_{k}^{n + 1} = (\sum_{i} \sum_{j} w_{ij}^{n} x_{ik} + \sum_{j} w_{ik}^{n} (1 - \frac{x_{ik} θ^{n}}{π_{k}^{n} + θ^{n}})) / (\sum_{l} \sum_{i} \sum_{j} w_{ij}^{n} x_{il} + \sum_{l} \sum_{i} w_{il}^{n} (1 - \frac{x_{il} θ^{n}}{π_{l}^{n} + θ^{n}})), \\ \begin{matrix} ρ^{n + 1} = (\sum_{i} \sum_{j} \frac{w_{ij}^{n} x_{ij} θ^{n}}{π_{j}^{n} + θ^{n}}) / (\sum_{i} m_{i}), & θ^{n + 1} = \frac{ρ^{n + 1}}{1 - ρ^{n + 1}} . \end{matrix} \end{matrix}

Table 2 lists convergence results for this MM algorithm and its SQUAREM accelerations on the previously discussed Haseman and Soares data.

Table 2.

Performance of the Neerchal–Morel MM algorithms.

Algorithm	L	# iters	Time
MM	−783.29	128	0.2289
SqMPE1 MM	−783.29	10	0.0207
SqRRE1 MM	−783.29	11	0.0221

Open in a new tab

3.3 Negative-Multinomial

The motivation for the negative-multinomial distribution comes from multinomial sampling with d + 1 categories assigned probabilities π₁,…, π_d+1. Sampling continues until category d + 1 accumulates β outcomes. At that moment we count the number of outcomes x_i falling in category i for 1 ≤ i ≤ d. For a given vector x = (x₁,…, x_d ), elementary combinatorics gives the probability

\begin{matrix} h (x | β, π) & = (\begin{matrix} β + | x | - 1 \\ | x | \end{matrix}) (\begin{matrix} | x | \\ x \end{matrix}) \prod_{i = 1}^{d} π_{i}^{x_{i}} π_{d + 1}^{β} \\ = \frac{β (β + 1) \dots (β + | x | - 1)}{x_{1}! \dots x_{d}!} \prod_{i = 1}^{d} π_{i}^{x_{i}} π_{d + 1}^{β} . \end{matrix}

(3.12)

This formula continues to make sense even if the positive parameter β is not an integer. For arbitrary β > 0, the most straightforward way to construct the negative-multinomial distribution is to run d independent Poisson processes with intensities π₁,…, π_d. Wait a gamma distributed length of time with shape parameter β and intensity parameter π_d+1. At the expiration of this waiting time, count the number of random events X_i of each type i among the first d categories. The random vector X has precisely the discrete density (3.12).

The Poisson process perspective readily yields the moments

\begin{matrix} E (X_{i}) & = β \frac{π_{i}}{π_{d + 1}}, \\ Var (X_{i}) & = β \frac{π_{i}}{π_{d + 1}} (1 + \frac{π_{i}}{π_{d + 1}}), \\ Cov (X_{i}, X_{j}) & = β \frac{π_{i}}{π_{d + 1}} \frac{π_{j}}{π_{d + 1}}, i \neq j . \end{matrix}

(3.13)

Compared to a Poisson distributed random variable with the same mean, the component X_i is overdispersed. Also in contrast to the multinomial and Dirichlet-multinomial distributions, the counts from a negative-multinomial are positively correlated. Negative-multinomial sampling is therefore appealing in many applications.

Let x₁,…, x_t be a random sample from the negative-multinomial distribution with m_i = |x_i|. To maximize the log-likelihood

\begin{matrix} L (β, π) & = \sum_{k} r_{k} ln (β + k) + \sum_{j = 1}^{d} x_{\cdot j} ln π_{j} + t β ln π_{d + 1} - \sum_{i} \sum_{j} ln x_{ij}!, \\ r_{k} & = \sum_{i} 1_{{m_{i} \geq k + 1},} x_{\cdot j} = \sum_{i} x_{ij}, \end{matrix}

we must deal with the terms ln(β + k). Fortunately, the minorization (2.4) implies

ln (β + k) \geq \frac{β^{n}}{β^{n} + k} ln (\frac{β^{n} + k}{β^{n}} β) + \frac{k}{β^{n} + k} ln (\frac{β^{n} + k}{k} k),

leading to the surrogate function

g (β, π | β^{n}, π^{n}) = \sum_{k} r_{k} \frac{β^{n}}{β^{n} + k} ln β + \sum_{j = 1}^{d} x_{\cdot j} ln π_{j} + t β ln π_{d + 1}

up to an irrelevant constant. In view of the constraint $π_{d + 1} = 1 - \sum_{j = 1}^{d} π_{j}$ the stationarity conditions for a maximum of the surrogate reduce to

0 = \frac{1}{β} \sum_{k} r_{k} \frac{β^{n}}{β^{n} + k} + t ln π_{d + 1}, 0 = \frac{x_{\cdot j}}{π_{j}} - \frac{t β}{π_{d + 1}}, 1 \leq j \leq d .

(3.14)

Unfortunately, it is impossible to solve this system of equations analytically. There are two resolutions to the dilemma. One is block relaxation (de Leeuw 1994) alternating the updates

β^{n + 1} = - (\sum_{k} r_{k} \frac{β^{n}}{β^{n} + k}) / (t ln π_{d + 1}^{n})

and

π_{d + 1}^{n + 1} = \frac{t β^{n + 1}}{\sum_{k = 1}^{d} x_{\cdot k} + t β^{n + 1}}, π_{j}^{n + 1} = \frac{x_{\cdot j}}{\sum_{k = 1}^{d} x_{\cdot k} + t β^{n + 1}}, 1 \leq j \leq d .

This strategy enjoys the ascent property of all MM algorithms.

The other possibility is to solve the stationarity equations numerically. It is clear that the system of equations (3.14) reduces to the single equation

0 = \frac{1}{β} \sum_{k} r_{k} \frac{β^{n}}{β^{n} + k} + t ln (\frac{t β}{\sum_{k = 1}^{d} x_{\cdot k} + t β})

for β. Equivalently, if we let

α = β^{- 1}, \bar{m} = \frac{1}{t} \sum_{i} \sum_{j} x_{ij}, c^{n} = \sum_{k} r_{k} \frac{β^{n}}{β^{n} + k},

then we must find a root of the equation f (α) = αcⁿ − t ln(αm̄ + 1) = 0. It is clear that f (α) is a strictly convex function with f (0) = 0 and lim_α→∞ f (α) = ∞. Furthermore, a little reflection shows that f′ (0) = cⁿ −tm̄ < 0. Thus, there is a single root of f (α) on the interval (0,∞). Owing to the convexity of f (α), Newton’s method will reliably find the root if started to the right of the minimum of f (α) at α = t/cⁿ − 1/m̄.

To find initial values, we again resort to the method of moments. Based on the moments (3.13), the mean and variance of |X| = Σ_k X_j are

E (| X |) = \frac{β (1 - π_{d + 1})}{π_{d + 1}}, Var (| X |) = \frac{β (1 - π_{d + 1})}{π_{d + 1}^{2}} .

These suggest that we take

β^{0} = \frac{{\bar{x}}^{2}}{s^{2} - \bar{x}}, π_{d + 1}^{0} = \frac{\bar{x}}{s^{2}}, π_{j}^{0} = \frac{π_{d + 1}^{0}}{β^{0}} \frac{x_{\cdot j}}{t}, 1 \leq j \leq d,

where

\bar{x} = \frac{1}{t} \sum_{i = 1}^{t} m_{i}, s^{2} = \frac{1}{t - 1} \sum_{i = 1}^{t} {(m_{i} - \bar{x})}^{2} .

When the data are underdispersed (s² < x̄), our proposed initial values are not meaningful, but a negative-multinomial model is a poor choice anyway.

3.4 Distributions on Partitions

A partition of a positive integer m into k parts is a vector a = (a₁,…, a_m) of non-negative integers such that Σ_i a_i = k and |a| = Σ_i ia_i = m. In population genetics, the partition distributions of Ewens (2004) and Pitman (Pitman 1995; Johnson, Kotz, and Balakrishnan 1997) find wide application. We now develop an MM algorithm for Pitman’s distribution, which generalizes Ewens’s distribution. Pitman’s distribution

Pr (A = a | m, α, θ) = \frac{m! \prod_{i = 1}^{| a | - 1} (θ + i α)}{(θ + 1) \dots (θ + m - 1)} \times \prod_{j = 1}^{m} {[\frac{(1 - α) \dots (1 - α + j - 2)]}{i!}]}^{a_{j}} \frac{1}{a_{j}!}

involves two parameters 0 ≤ α < 1 and θ > −α. Ewens’s distribution corresponds to the choice α = 0. We will restrict θ to be positive.

To estimate parameters given u independent partitions a₁,…, a_u from Pitman’s distribution, we use the minorizations (2.3) and (2.4) to derive the minorizations

\begin{matrix} ln (θ + i α) & \geq \frac{θ^{n}}{θ^{n} + i α^{n}} ln θ + \frac{i α^{n}}{θ^{n} + i α^{n}} ln α + c, \\ ln (1 - α + i) & \geq \frac{1 - α^{n}}{1 - α^{n} + i} ln (1 - α) + c, \\ - ln (θ + i) & \geq - \frac{1}{θ^{n} + i} θ + c, \end{matrix}

where c is a different irrelevant constant in each case. Assuming a_j is a partition of the integer m_j, it follows that the log-likelihood is minorized by

\sum_{i} \frac{r_{i} θ^{n}}{θ^{n} + i α^{n}} ln θ + \sum_{i} \frac{r_{i} i α^{n}}{θ^{n} + i α^{n}} ln α + \sum_{i} \frac{s_{i} (1 - α^{n})}{1 - α^{n} + i} ln (1 - α) - \sum_{i} \frac{t_{i}}{θ^{n} + i} θ + c,

where

r_{i} = \sum_{j} 1_{{| a_{j} | \geq i + 1}, s_{i} = \sum_{j} \sum_{k \geq i + 2} a_{jk},} t_{i} = \sum_{j} 1_{{m_{i} \geq i + 1}} .

Standard arguments now yield the simple updates

\begin{matrix} α^{n + 1} = (\sum_{i} \frac{r_{i} i α^{n}}{θ^{n} + i α^{n}}) / (\sum_{i} \frac{r_{i} i α^{n}}{θ^{n} + i α^{n}} + \sum_{i} \frac{s_{i} (1 - α^{n})}{1 - α^{n} + i}), \\ θ^{n + 1} = (\sum_{i} \frac{r_{i} θ^{n}}{θ^{n} + i α^{n}}) / (\sum_{i} \frac{t_{i}}{θ^{n} + i}) . \end{matrix}

If we set α⁰ = 0, then in all subsequent iterates αⁿ = 0, and we get the MM updates for Ewens’s distribution. Despite the availability of the moments of the parts A_i (Charalambides 2007), it is not clear how to initialize α and θ. Unfortunately, the alternative suggestion of Nobuaki (2001) does not guarantee that the initial values satisfy the constraints α ∈ [0, 1) and θ > 0.

3.5 Zero-Truncated and Zero-Inflated Data

In this section we briefly indicate how the MM perspective sheds fresh light on EM algorithms for zero-truncated and zero-inflated data. Once again mastery of a handful of inequalities rather than computation of conditional expectations drives the derivations.

In many discrete probability models, only data with positive counts are observed. Counts that are 0 are missing. If f (x|θ) represents the density of the complete data, then the density of a random sample x₁,…, x_t of zero-truncated data amounts to

h (x | θ) = \prod_{i = 1}^{t} \frac{f (x_{i} | θ)}{1 - f (0 | θ)} .

Inequality (2.5) immediately implies the minorization

ln h (x | θ) \geq \sum_{i = 1}^{t} [ln f (x_{i} | θ) + \frac{f (0 | θ^{n})}{1 - f (0 | θ^{n})} ln f (0 | θ)] + c,

where c is an irrelevant constant. In many models, maximization of this surrogate function is straightforward.

For instance, with zero-truncated data from the binomial, Poisson, and negative-binomial distributions, the MM updates reduce to

\begin{matrix} p^{n + 1} = (\sum_{i} x_{i}) / (\sum_{i} \frac{m_{i}}{1 - {(1 - p^{n})}^{m_{i}}}), λ^{n + 1} = (\sum_{i} x_{i}) / (\sum_{i} \frac{1}{1 - e^{- λ n}}), \\ p^{n + 1} = (\sum_{i} \frac{m_{i}}{1 - {(p^{n})}^{m_{i}}}) / (\sum_{i} [x_{i} + \frac{m_{i}}{1 - {(p^{n})}^{m_{i}}}]) . \end{matrix}

For observation i of the binomial model, there are x_i successes out of m_i trials with success probability p per trial. λ is the mean in the Poisson model. For observation i of the negative-binomial model, there are x_i failures before m_i required successes.

More complicated models can be handled in similar fashion. The key insight in each case is to augment every ordinary observation x_i > 0 by a total of f (0|θⁿ)/[1 − f (0|θⁿ)] pseudo-observations of 0 at iteration n. With this amendment, the two MM algorithms for the beta-binomial distribution implemented in (3.5), (3.6), and (3.7) remain valid except that the count variables r_k and s_jk defining the updated parameters at iteration n become

\begin{matrix} s_{1 k} = \sum_{i} 1_{{x_{i 1} \geq k + 1}}, s_{2 k} = \sum_{i} [1_{{x_{i 2} \geq k + 1}} + \frac{f (0 | π^{n}, θ^{n})}{1 - f (0 | π^{n}, θ^{n})}], \\ r_{k} = \sum_{i} [1 + \frac{f (0 | π^{n}, θ^{n})}{1 - f (0 | π^{n}, θ^{n})}] 1_{{m_{i} \geq k + 1}}, \end{matrix}

where

f (0 | π^{n}, θ^{n}) = \frac{π_{2}^{n} (π_{2}^{n} + θ^{n}) \dots [π_{2}^{n} + (m_{i} - 1) θ^{n}]}{(1 + θ^{n}) \dots [1 + (m_{i} - 1) θ^{n}]} .

Here category 1 represents success and category 2 failure. If we start with θ⁰ = 0, then we recover the updates for the zero-truncated binomial distribution.

Zero-inflated data are equally easy to handle. The density function is now

h (x | θ, π) \prod_{i = 1}^{t} {[(1 - π) + π f (0 | θ)]}^{1_{{x_{i} = 0}}} {[π f (x_{i} | θ)]}^{1_{{x_{i} > 0}}} .

Inequality (2.3) entails the minorization

\begin{matrix} ln h (x | θ, π) & \geq \sum_{i = 1}^{t} 1_{{x_{i} = 0}} {z^{n} ln (1 - π) + (1 - z_{n}) [ln π + ln f (0 | θ^{n})]} + \sum_{i = 1}^{t} 1_{{x_{i} > 0}} [ln π + ln f (x_{i} | θ)], \\ z^{n} & = \frac{1 - π^{n}}{1 - π^{n} + π^{n} f (0 | θ^{n})} . \end{matrix}

The MM update of the inflation-admixture parameter clearly is

π^{n + 1} = \frac{1}{t} \sum_{i = 1}^{t} [1_{{x_{i} > 0}} + 1_{{x_{i} = 0}} (1 - z^{n})] .

As a typical example, consider estimation with the zero-inflated Poisson (Patil 2007). The mean λ of the Poisson component is updated by

λ^{n + 1} = \frac{\sum_{i} x_{i}}{\sum_{i} [(1 - z^{n}) 1_{{x_{i} = 0}} + 1_{{x_{i} > 0}}]} .

In other words, every 0 observation is discounted by the amount zⁿ at iteration n. This makes intuitive sense.

4. A NUMERICAL EXPERIMENT

As a numerical experiment, we fit the Dirichlet-multinomial (two parameterizations) and the Neerchal–Morel distributions to the 3823 training digits in the handwritten digit data from the UCI machine learning repository (Asuncion and Newman 2007). Each normalized 32 × 32 bitmap is divided into 64 blocks of size 4 × 4, and the black pixels are counted in each block. This generates a 64-dimensional count vector for each bitmap. Bouguila (2008) successfully fit mixtures of Connor–Mosimann to the training data and used the estimated models to cluster the test data. For illustrative purposes we now fit the Dirichlet-multinomial (two parameterizations) and Neerchal–Morel models. Based on the majorization (2.3), it is straightforward to extend our MM algorithms to fit finite mixture models using any of the previously encountered multivariate discrete distributions.

Table 3 lists the final log-likelihoods, number of iterations, and running times of the different algorithms tested. The MM and accelerated MM algorithms were coded in plain Matlab script language. Newton’s method was implemented using the fmincon function in the Marla Optimization Toolbox under the interior-point option with user-supplied analytical gradient and Hessian. All iterations started from the initial points θ⁰ = 1 and $π^{0} = α^{0} = (\frac{1}{64}, \dots, \frac{1}{64})$ . The stopping criterion for Newton’s method was tuned to achieve precision comparable to the stopping criterion (2.6) for the MM algorithms. Running times in seconds were recorded from a laptop computer.

Table 3.

Numerical experiment. Row 1: MM; Row 2: SqMPE1 MM; Row 3: SqRRE1 MM; Row 4: Newton’s method using the (fmincon) function available in the Matlab Optimization Toolbox.

	DM (π, θ)			DM (α)			Neerchal–Morel

Digit	L	# iters	Time	L	# iters	Time	L	iters	Time
0	−37,358	232	0.18	−37,358	361	0.16	−38,828	15	0.08
	−37,358	18	0.04	−37,358	18	0.04	−38,828	7	0.10
	−37,358	21	0.04	−37,358	18	0.04	−38,828	7	0.09
	−37,359	11	0.13	−37,358	18	0.16	−38,828	13	106.42
1	−42,179	237	0.16	−42,179	120	0.06	−52,424	17	0.09
	−42,179	17	0.03	−42,179	12	0.03	−52,424	7	0.10
	−42,179	26	0.05	−42,179	13	0.03	−52,424	7	0.10
	−42,179	15	0.19	−42,179	14	0.13	−52,424	12	98.91
2	−39,985	213	0.14	−39,985	136	0.07	−47,723	14	0.07
	−39,985	17	0.04	−39,985	15	0.03	−47,723	6	0.08
	−39,985	17	0.04	−39,985	11	0.03	−47,723	6	0.08
	−39,986	15	0.19	−39,985	15	0.13	−47,721	14	113.15
3	−40,519	214	0.14	−40,519	173	0.08	−45,816	14	0.07
	−40,519	23	0.04	−40,519	15	0.03	−45,816	6	0.08
	−40,519	20	0.04	−40,519	11	0.03	−45,816	6	0.08
	−40,519	14	0.17	−40,519	15	0.13	−45,816	12	102.30
4	−43,489	203	0.13	−43,489	102	0.06	−55,432	14	0.07
	−43,489	17	0.04	−43,489	12	0.03	−55,432	6	0.08
	−43,489	19	0.04	−43,489	9	0.03	−55,432	6	0.08
	−43,489	13	0.17	−43,489	14	0.12	−55,432	14	114.40
5	−41,191	205	0.13	−41,191	116	0.06	−50,063	13	0.07
	−41,191	18	0.04	−41,191	12	0.03	−50,063	6	0.08
	−41,191	19	0.04	−41,191	12	0.03	−50,063	6	0.09
	−41,192	12	0.16	−41,191	15	0.13	−50,063	15	118.22
6	−37,703	232	0.15	−37,703	203	0.10	−41,888	20	0.10
	−37,703	19	0.04	−37,703	16	0.03	−41,888	8	0.11
	−37,703	21	0.04	−37,703	11	0.03	−41,888	8	0.11
	−37,703	15	0.19	−37,703	19	0.16	−41,888	13	104.25
7	−40,304	218	0.14	−40,304	141	0.07	−47,653	12	0.06
	−40,304	16	0.04	−40,304	15	0.03	−47,653	6	0.08
	−40,304	18	0.04	−40,304	11	0.03	−47,653	6	0.08
	−40,305	13	0.15	−40,304	15	0.13	−47,653	15	120.95
8	−43,131	227	0.15	−43,131	171	0.08	−48,844	17	0.09
	−43,131	19	0.04	−43,131	16	0.03	−48,844	7	0.10
	−43,131	23	0.04	−43,131	14	0.03	−48,844	7	0.09
	−43,132	10	0.13	−43,131	15	0.14	−48,844	13	107.22
9	−43,710	207	0.14	−43,710	116	0.06	−53,030	13	0.07
	−43,710	19	0.04	−43,710	12	0.03	−53,030	6	0.08
	−43,710	18	0.04	−43,710	11	0.03	−53,030	6	0.08
	−43,710	12	0.16	−43,710	15	0.14	−53,030	14	116.49

Open in a new tab

Inspection of Table 3 demonstrates that the MM algorithms outperform Newton’s method and that acceleration is often very beneficial. The cost of evaluating and inverting the observed information matrices of the Neerchal–Morel model significantly slows Newton’s method even in these problems with only 64 parameters. The observed information matrix of the Dirichlet-multinomial distribution possesses a special structure (diagonal plus rank-1 perturbation) that makes matrix inversion far easier. Table 3 does not show the human effort in devising, programming, and debugging the various algorithms. For Newton’s method, derivation and programming took in excess of one day. Formulas for the score and observed information of the Dirichlet-multinomial and Neerchal–Morel distributions are omitted for the sake of brevity. Fisher’s scoring algorithm was not implemented because it is even more cumbersome than Newton’s method (Neerchal and Morel 2005).

This numerical comparison is merely for illustrative purpose. Numerical analysts have developed quasi-Newton algorithms to mend the defects of Newton’s method. The limited-memory BFGS (LBFGS) algorithm (Nocedal and Wright 2006) is especially pertinent to high-dimensional problems. A systematic comparison of the two methods is worth pursuing.

5. DISCUSSION

In designing algorithms for maximum likelihood estimation, Newton’s method and Fisher scoring come immediately to mind. In the last generation, statisticians have added the EM principle. These are good mental reflexes, but the broader MM principle also deserves serious consideration. In many problems, the EM and MM perspectives lead to the same algorithm. In other situations such as image reconstruction in transmission tomography, it is possible to construct different EM and MM algorithms for the same purpose (Lange 2004). One of the most appealing features of the EM perspective is that it provides a statistical interpretation of algorithm intermediates. Although it is a matter of taste and experience whether inequalities or missing data offer an easier path to algorithm development, the fact that there are two routes adds to the possibilities for new algorithms.

One can argue that applications of minorizations (2.3) and (2.5) are just disguised EM algorithms. This objection misses the point in three ways. First, it does not suggest missing data structures explaining the minorization (2.4) and other less well-know minorizations. Second, it fails to weigh the difficulties of invoking simple inequalities versus calculating conditional expectations. When the creation of an appropriate surrogate function requires several minorizations, the corresponding conditional expectations become harder to execute. For example, although the EM principle dictates adding pseudo-observations for zero-truncated data, it is easy to lose sight of this simple interpretation in complicated examples such as the beta-binomial distribution. The genetic segregation analysis example appearing in chapter 2 of the book by Lange (2002) falls into the same category. Third, it fails to acknowledge the conceptual clarity of the MM principle, which shifts focus away from the probability spaces connected with missing data to the simple act of minorization. For instance, when one undertakes maximum a posteriori estimation, should the E step of the EM algorithm take into account the prior?

Some EM and MM algorithms are notoriously slow to converge. As we noted earlier, slow convergence is partially offset by the simplicity of each iteration. There is a growing body of techniques for accelerating MM algorithms (Jamshidian and Jennrich 1995; Lange 1995a; Jamshidian and Jennrich 1997; Varadhan and Rolland 2008). These techniques often lead to a ten-fold or even a hundred-fold reduction in the number of iterations. The various examples appearing in this article are typical in this regard. On problems with boundaries or nondifferentiable objective functions, acceleration may be less helpful.

Our negative-multinomial example highlights two useful tactics for overcoming complications in solving the maximization step of the EM and MM algorithms. It is a mistake to think of the various optimization algorithms in isolation. Often block relaxation (de Leeuw 1994) and Newton’s method can be combined creatively with the MM principle. Systematic application of Newton’s method in solving the maximization step of the MM algorithm is formalized in the MM gradient algorithm (Lange 1995b).

Parameter asymptotic standard errors are a natural byproduct of Newton’s method and scoring. With a modicum of additional effort, the EM and MM algorithms also deliver asymptotic standard errors (Meng and Rubin 1991; Hunter and Lange 2004). Virtually all optimization algorithms are prone to converge to inferior modes. For this reason, we have emphasized finding reasonable initial values. The overlooked article of Ueda and Nakano (1998) suggested an annealing approach to maximization with mixture models. Here the idea is to flatten the likelihood surface and eliminate all but the dominant mode. As the iterations proceed, the flat surface gradually warps into the true bumpy surface. Our recent work (Zhou and Lange 2010) extends this idea to many other EM and MM algorithms. A similar idea, called graduated non-convexity (GNC), appears in computer vision and signal processing literature (Blake and Zisserman 1987). In the absence of a good annealing procedure, one can fall back on starting an optimization algorithm from multiple random points, but this inevitably increases the computational load. The reassurance that a log-likelihood is concave is always welcome.

Readers may want to try their hands at devising their own MM algorithms. For instance, the Dirichlet-negative-multinomial distribution, the bivariate Poisson (Johnson, Kotz, and Balakrishnan 1997), and truncated multivariate discrete distributions yield readily to the techniques described. The performance of the MM algorithm on these problems is similar to that in our fully developed examples. Of course, many objective functions are very complicated, and devising a good MM algorithm is a challenge. The greatest payoffs are apt to be on high-dimensional problems. For simplicity of exposition, we have not tackled any extremely high-dimensional problems, but these certainly exist (Sabatti and Lange 2002; Ayers and Lange 2008; Lange and Wu 2008). In any event, most mathematicians and statisticians keep a few tricks up their sleeves. The MM principle belongs there, waiting for the right problems to come along.

Supplementary Material

Matlab Code

NIHMS205488-supplement-Matlab_Code.zip^{(112.8KB, zip)}

ACKNOWLEDGMENTS

The authors thank the editors and referees for their many valuable comments.

Footnotes

SUPPLEMENTAL MATERIALS

Datasets and Matlab codes: The supplementary material (a single zip package) contains all datasets appearing here and the Matlab codes generating our numerical results and graphs. The readme.txt file describes the contents of each file in the package. (supp_material.zip)

Contributor Information

Hua Zhou, Post-Doctoral Fellow, Department of Human Genetics, University of California, Los Angeles, CA 90095-7088 (huazhou@ucla.edu)..

Kenneth Lange, Professor, Departments of Biomathematics, Human Genetics, and Statistics, University of California, Los Angeles, CA 90095-7088..

REFERENCES

Asuncion A, Newman DJ. (UCI) Machine Learning Repository. 2007 available at http://www.ics.uci.edu/~mlearn/Repository.html. [Google Scholar]
Ayers KL, Lange K. Penalized Estimation of Haplotype Frequencies. Bioinformatics. 2008;24:1596–1602. doi: 10.1093/bioinformatics/btn236. [DOI] [PubMed] [Google Scholar]
Bailey NTJ. The Mathematical Theory of Epidemics. London: Charles Griffin & Company; 1957. [Google Scholar]
Blake A, Zisserman A. Visual Reconstruction. Cambridge, MA: MIT Press; 1987. [Google Scholar]
Bouguila N. Clustering of Count Data Using Generalized Dirichlet Multinomial Distributions. IEEE Transactions on Knowledge and Data Engineering. 2008;20(4):462–474. [Google Scholar]
Charalambides CA. Distributions of Random Partitions and Their Applications. Methodology and Computing in Applied Probability. 2007;9:163–193. [Google Scholar]
Connor RJ, Mosimann JE. Concepts of Independence for Proportions With a Generalization of the Dirichlet Distribution. Journal of the American Statistical Association. 1969;64:194–206. [Google Scholar]
Dempster AP, Laird NM, Rubin DB. Maximum Likelihood From Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society, Ser. B. 1977;39(1):1–38. (with discussion) [Google Scholar]
de Leeuw J. Block Relaxation Algorithms in Statistics. In: Bock HH, Lenski W, Richter MM, editors. Information Systems and Data Analysis. Berlin: Springer-Verlag; 1994. [Google Scholar]
Ewens WJ. Mathematical Population Genetics. 2nd ed. New York: Springer-Verlag; 2004. [Google Scholar]
Griffiths DA. Maximum Likelihood Estimation for the Beta-Binomial Distribution and an Application to the Household Distribution of the Total Number of Cases of a Disease. Biometrics. 1973;29:637–648. [PubMed] [Google Scholar]
Groenen PJF. The Majorization Approach to Multidimensional Scaling: Some Problems and Extensions. Leiden: The Netherlands: DSWO Press; 1993. [Google Scholar]
Haldane JBS. The Fitting of Binomial Distributions. Annals of Eugenics. 1941;11:179–181. [Google Scholar]
Haseman JK, Soares ER. The Distribution of Fetal Death in Control Mice and Its Implications on Statistical Tests for Dominant Lethal Effects. Mutation Research/Fundamental and Molecular Mechanisms of Mutagenesis. 1976;41:277–288. doi: 10.1016/0027-5107(76)90101-9. [DOI] [PubMed] [Google Scholar]
Heiser WJ. Convergent Computing by Iterative Majorization: Theory and Applications in Multidimensional Data Analysis. In: Krzanowski WJ, editor. Recent Advances in Descriptive Multivariate Analysis. Oxford: Clarendon Press; 1995. pp. 157–189. [Google Scholar]
Hunter DR, Lange K. A Tutorial on MM Algorithms. The American Statistician. 2004;58:30–37. [Google Scholar]
Jamshidian M, Jennrich RI. Acceleration of the EM Algorithm by Using Quasi-Newton Methods. Journal of the Royal Statistical Society, Ser. B. 1995;59:569–587. [Google Scholar]
Jamshidian M, Jennrich RI. Quasi-Newton Acceleration of the EM Algorithm. Journal of the Royal Statistical Society, Ser. B. 1997;59:569–587. [Google Scholar]
Johnson NL, Kotz S, Balakrishnan N. Discrete Multivariate Distributions. New York: Wiley; 1997. [Google Scholar]
Lange K. A Quasi-Newton Acceleration of the EM Algorithm. Statistica Sinica. 1995a;5:1–18. [Google Scholar]
Lange K. A Gradient Algorithm Locally Equivalent to the EM Algorithm. Journal of the Royal Statistical Society, Ser. B. 1995b;57(2):425–437. [Google Scholar]
Lange K. Mathematical and Statistical Methods for Genetic Analysis. 2nd ed. New York: Springer-Verlag; 2002. [Google Scholar]
Lange K. Optimization. New York: Springer-Verlag; 2004. [Google Scholar]
Lange K, Wu TT. An MM Algorithm for Multicategory Vertex Discriminant Analysis. Journal of Computational and Graphical Statistics. 2008;17:527–544. [Google Scholar]
Meng XL, Rubin DB. Using EM to Obtain Asymptotic Variance–Covariance Matrices: The SEM Algorithm. Journal of the American Statistical Association. 1991;86:899–909. [Google Scholar]
Minka TP. Estimating a Dirichlet Distribution. Technical report, Microsoft. 2003
Neerchal NK, Morel JG. Large Cluster Results for Two Parametric Multinomial Extra Variation Models. Journal of the American Statistical Association. 1998;93(443):1078–1087. [Google Scholar]
Neerchal NK, Morel JG. An Improved Method for the Computation of Maximum Likelihood Estimates for Multinomial Overdispersion Models. Computational Statistics & Data Analysis. 2005;49(1):33–43. [Google Scholar]
Nobuaki H. Applying Pitman’s Sampling Formula to Microdata Disclosure Risk Assessment. Journal of Official Statistics. 2001;17:499–520. [Google Scholar]
Nocedal J, Wright S. Numerical Optimization. New York: Springer; 2006. [Google Scholar]
Patil MK, Shirke DT. Testing Parameter of the Power Series Distribution of a Zero Inflated Power Series Model. Statistical Methodology. 2007;4:393–406. [Google Scholar]
Paul SR, Balasooriya U, Banerjee T. Fisher Information Matrix of the Dirichlet-Multinomial Distribution. Biometrical Journal. 2005;47(2):230–236. doi: 10.1002/bimj.200410103. [DOI] [PubMed] [Google Scholar]
Pitman J. Exchangeable and Partially Exchangeable Random Partitions. Probability Theory and Related Fields. 1995;102(2):145–158. [Google Scholar]
Sabatti C, Lange K. Genomewide Motif Identification Using a Dictionary Model. Proceedings of the IEEE. 2002;90:1803–1810. [Google Scholar]
Steele JM. The Cauchy–Schwarz Master Class. MAA Problem Books Series. Washington, DC: Mathematical Association of America; 2004. [Google Scholar]
Ueda N, Nakano R. Deterministic Annealing EM Algorithm. Neural Networks. 1998;11:271–282. doi: 10.1016/s0893-6080(97)00133-0. [DOI] [PubMed] [Google Scholar]
Varadhan R, Roland C. Simple and Globally Convergent Methods for Accelerating the Convergence of Any EM Algorithm. Scandinavian Journal of Statistics. 2008;35:335–353. [Google Scholar]
Wu TT, Lange K. The MM Alternative to EM. Statistical Science. 2010 to appear. [Google Scholar]
Zhou H, Lange K. On the Bumpy Road to the Dominant Mode. Scandinavian Journal of Statistics. 2010 doi: 10.1111/j.1467-9469.2009.00681.x. to appear. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Matlab Code

NIHMS205488-supplement-Matlab_Code.zip^{(112.8KB, zip)}

[R1] Asuncion A, Newman DJ. (UCI) Machine Learning Repository. 2007 available at http://www.ics.uci.edu/~mlearn/Repository.html. [Google Scholar]

[R2] Ayers KL, Lange K. Penalized Estimation of Haplotype Frequencies. Bioinformatics. 2008;24:1596–1602. doi: 10.1093/bioinformatics/btn236. [DOI] [PubMed] [Google Scholar]

[R3] Bailey NTJ. The Mathematical Theory of Epidemics. London: Charles Griffin & Company; 1957. [Google Scholar]

[R4] Blake A, Zisserman A. Visual Reconstruction. Cambridge, MA: MIT Press; 1987. [Google Scholar]

[R5] Bouguila N. Clustering of Count Data Using Generalized Dirichlet Multinomial Distributions. IEEE Transactions on Knowledge and Data Engineering. 2008;20(4):462–474. [Google Scholar]

[R6] Charalambides CA. Distributions of Random Partitions and Their Applications. Methodology and Computing in Applied Probability. 2007;9:163–193. [Google Scholar]

[R7] Connor RJ, Mosimann JE. Concepts of Independence for Proportions With a Generalization of the Dirichlet Distribution. Journal of the American Statistical Association. 1969;64:194–206. [Google Scholar]

[R8] Dempster AP, Laird NM, Rubin DB. Maximum Likelihood From Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society, Ser. B. 1977;39(1):1–38. (with discussion) [Google Scholar]

[R9] de Leeuw J. Block Relaxation Algorithms in Statistics. In: Bock HH, Lenski W, Richter MM, editors. Information Systems and Data Analysis. Berlin: Springer-Verlag; 1994. [Google Scholar]

[R10] Ewens WJ. Mathematical Population Genetics. 2nd ed. New York: Springer-Verlag; 2004. [Google Scholar]

[R11] Griffiths DA. Maximum Likelihood Estimation for the Beta-Binomial Distribution and an Application to the Household Distribution of the Total Number of Cases of a Disease. Biometrics. 1973;29:637–648. [PubMed] [Google Scholar]

[R12] Groenen PJF. The Majorization Approach to Multidimensional Scaling: Some Problems and Extensions. Leiden: The Netherlands: DSWO Press; 1993. [Google Scholar]

[R13] Haldane JBS. The Fitting of Binomial Distributions. Annals of Eugenics. 1941;11:179–181. [Google Scholar]

[R14] Haseman JK, Soares ER. The Distribution of Fetal Death in Control Mice and Its Implications on Statistical Tests for Dominant Lethal Effects. Mutation Research/Fundamental and Molecular Mechanisms of Mutagenesis. 1976;41:277–288. doi: 10.1016/0027-5107(76)90101-9. [DOI] [PubMed] [Google Scholar]

[R15] Heiser WJ. Convergent Computing by Iterative Majorization: Theory and Applications in Multidimensional Data Analysis. In: Krzanowski WJ, editor. Recent Advances in Descriptive Multivariate Analysis. Oxford: Clarendon Press; 1995. pp. 157–189. [Google Scholar]

[R16] Hunter DR, Lange K. A Tutorial on MM Algorithms. The American Statistician. 2004;58:30–37. [Google Scholar]

[R17] Jamshidian M, Jennrich RI. Acceleration of the EM Algorithm by Using Quasi-Newton Methods. Journal of the Royal Statistical Society, Ser. B. 1995;59:569–587. [Google Scholar]

[R18] Jamshidian M, Jennrich RI. Quasi-Newton Acceleration of the EM Algorithm. Journal of the Royal Statistical Society, Ser. B. 1997;59:569–587. [Google Scholar]

[R19] Johnson NL, Kotz S, Balakrishnan N. Discrete Multivariate Distributions. New York: Wiley; 1997. [Google Scholar]

[R20] Lange K. A Quasi-Newton Acceleration of the EM Algorithm. Statistica Sinica. 1995a;5:1–18. [Google Scholar]

[R21] Lange K. A Gradient Algorithm Locally Equivalent to the EM Algorithm. Journal of the Royal Statistical Society, Ser. B. 1995b;57(2):425–437. [Google Scholar]

[R22] Lange K. Mathematical and Statistical Methods for Genetic Analysis. 2nd ed. New York: Springer-Verlag; 2002. [Google Scholar]

[R23] Lange K. Optimization. New York: Springer-Verlag; 2004. [Google Scholar]

[R24] Lange K, Wu TT. An MM Algorithm for Multicategory Vertex Discriminant Analysis. Journal of Computational and Graphical Statistics. 2008;17:527–544. [Google Scholar]

[R25] Meng XL, Rubin DB. Using EM to Obtain Asymptotic Variance–Covariance Matrices: The SEM Algorithm. Journal of the American Statistical Association. 1991;86:899–909. [Google Scholar]

[R26] Minka TP. Estimating a Dirichlet Distribution. Technical report, Microsoft. 2003

[R27] Neerchal NK, Morel JG. Large Cluster Results for Two Parametric Multinomial Extra Variation Models. Journal of the American Statistical Association. 1998;93(443):1078–1087. [Google Scholar]

[R28] Neerchal NK, Morel JG. An Improved Method for the Computation of Maximum Likelihood Estimates for Multinomial Overdispersion Models. Computational Statistics & Data Analysis. 2005;49(1):33–43. [Google Scholar]

[R29] Nobuaki H. Applying Pitman’s Sampling Formula to Microdata Disclosure Risk Assessment. Journal of Official Statistics. 2001;17:499–520. [Google Scholar]

[R30] Nocedal J, Wright S. Numerical Optimization. New York: Springer; 2006. [Google Scholar]

[R31] Patil MK, Shirke DT. Testing Parameter of the Power Series Distribution of a Zero Inflated Power Series Model. Statistical Methodology. 2007;4:393–406. [Google Scholar]

[R32] Paul SR, Balasooriya U, Banerjee T. Fisher Information Matrix of the Dirichlet-Multinomial Distribution. Biometrical Journal. 2005;47(2):230–236. doi: 10.1002/bimj.200410103. [DOI] [PubMed] [Google Scholar]

[R33] Pitman J. Exchangeable and Partially Exchangeable Random Partitions. Probability Theory and Related Fields. 1995;102(2):145–158. [Google Scholar]

[R34] Sabatti C, Lange K. Genomewide Motif Identification Using a Dictionary Model. Proceedings of the IEEE. 2002;90:1803–1810. [Google Scholar]

[R35] Steele JM. The Cauchy–Schwarz Master Class. MAA Problem Books Series. Washington, DC: Mathematical Association of America; 2004. [Google Scholar]

[R36] Ueda N, Nakano R. Deterministic Annealing EM Algorithm. Neural Networks. 1998;11:271–282. doi: 10.1016/s0893-6080(97)00133-0. [DOI] [PubMed] [Google Scholar]

[R37] Varadhan R, Roland C. Simple and Globally Convergent Methods for Accelerating the Convergence of Any EM Algorithm. Scandinavian Journal of Statistics. 2008;35:335–353. [Google Scholar]

[R38] Wu TT, Lange K. The MM Alternative to EM. Statistical Science. 2010 to appear. [Google Scholar]

[R39] Zhou H, Lange K. On the Bumpy Road to the Dominant Mode. Scandinavian Journal of Statistics. 2010 doi: 10.1111/j.1467-9469.2009.00681.x. to appear. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

MM Algorithms for Some Discrete Multivariate Distributions

Hua Zhou

Kenneth Lange

Abstract

1. INTRODUCTION

2. OVERVIEW OF THE MM ALGORITHM

3. APPLICATIONS

3.1 Dirichlet-Multinomial and Connor–Mosimann Distributions

Table 1.

Figure 1.

3.2 Neerchal–Morel Distribution

Table 2.

3.3 Negative-Multinomial

3.4 Distributions on Partitions

3.5 Zero-Truncated and Zero-Inflated Data

4. A NUMERICAL EXPERIMENT

Table 3.

5. DISCUSSION

Supplementary Material

ACKNOWLEDGMENTS

Footnotes

Contributor Information

REFERENCES

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

MM Algorithms for Some Discrete Multivariate Distributions

Hua Zhou

Kenneth Lange

Abstract

1. INTRODUCTION

2. OVERVIEW OF THE MM ALGORITHM

3. APPLICATIONS

3.1 Dirichlet-Multinomial and Connor–Mosimann Distributions

Table 1.

Figure 1.

3.2 Neerchal–Morel Distribution

Table 2.

3.3 Negative-Multinomial

3.4 Distributions on Partitions

3.5 Zero-Truncated and Zero-Inflated Data

4. A NUMERICAL EXPERIMENT

Table 3.

5. DISCUSSION

Supplementary Material

ACKNOWLEDGMENTS

Footnotes

Contributor Information

REFERENCES

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases