MM Algorithms For Variance Components Models

Hua Zhou; Liuyi Hu; Jin Zhou; Kenneth Lange

doi:10.1080/10618600.2018.1529601

. Author manuscript; available in PMC: 2020 Mar 9.

Published in final edited form as: J Comput Graph Stat. 2019 Mar 9;28(2):350–361. doi: 10.1080/10618600.2018.1529601

MM Algorithms For Variance Components Models

Hua Zhou ^a, Liuyi Hu ^b, Jin Zhou ^c, Kenneth Lange ^d

PMCID: PMC6779174 NIHMSID: NIHMS1520713 PMID: 31592195

Abstract

Variance components estimation and mixed model analysis are central themes in statistics with applications in numerous scientific disciplines. Despite the best efforts of generations of statisticians and numerical analysts, maximum likelihood estimation and restricted maximum likelihood estimation of variance component models remain numerically challenging. Building on the minorization-maximization (MM) principle, this paper presents a novel iterative algorithm for variance components estimation. Our MM algorithm is trivial to implement and competitive on large data problems. The algorithm readily extends to more complicated problems such as linear mixed models, multivariate response models possibly with missing data, maximum a posteriori estimation, and penalized estimation. We establish the global convergence of the MM algorithm to a Karush-Kuhn-Tucker (KKT) point and demonstrate, both numerically and theoretically, that it converges faster than the classical EM algorithm when the number of variance components is greater than two and all covariance matrices are positive definite.

Keywords: global convergence, matrix convexity, linear mixed model (LMM), maximum a posteriori (MAP) estimation, minorization-maximization (MM), multivariate response, penalized estimation, variance components model

1. Introduction

Variance components and linear mixed models are among the most potent tools in a statistician’s toolbox, finding numerous applications in agriculture, biology, economics, genetics, epidemiology, and medicine. Given an observed n × 1 response vector y and n × p predictor matrix X, the simplest variance components model postulates that Y ∼ N (Xβ, Ω), where $Ω = \sum_{i = 1}^{m} σ_{i}^{2} V_{i}$ , and the V₁, …, V_m are m fixed positive semidefinite matrices. The parameters of the model can be divided into mean effects β = (β₁, …, β_p) and variance components $σ^{2} = (σ_{1}^{2}, \dots, σ_{m}^{2})$ . Throughout we assume Ω is positive definite. The extension to singular Ω will not be pursued here. Estimation revolves around the log-likelihood function

L (β, σ^{2}) = - \frac{1}{2} \ln \det Ω - \frac{1}{2} {(y - X β)}^{T} Ω^{- 1} (y - X β) .

(1)

Among the commonly used methods for estimating variance components, maximum likelihood estimation (MLE) (Hartley and Rao, 1967) and restricted (or residual) MLE (REML) (Harville, 1977) are the most popular. REML first projects y to the null space of X and then estimates variance components based on the projected responses. If the columns of a matrix B span the null space of X^T, then REML estimates the $σ_{i}^{2}$ by maximizing the log-likelihood of the redefined response vector B^T Y, which is normally distributed with mean 0 and covariance $B^{T} Ω B = \sum_{i = 1}^{m} σ_{i}^{2} B^{T} V_{i} B$ .

There exists a large literature on iterative algorithms for finding MLE and REML (Laird and Ware, 1982; Lindstrom and Bates, 1988, 1990; Harville and Callanan, 1990; Callanan and Harville, 1991; Bates and Pinheiro, 1998; Schafer and Yucel, 2002). Fitting variance components models remains a challenge in models with a large sample size n or a large number of variance components m. Newton’s method (Lindstrom and Bates, 1988) converges quickly but is numerically unstable owing to the non-concavity of the log-likelihood. Fisher’s scoring algorithm replaces the observed information matrix in Newton’s method by the expected information matrix and yields an ascent algorithm when safeguarded by step halving. However the calculation and inversion of expected information matrices cost O(mn³) + O(m³) flops and quickly become impractical for large n or m, unless V_i are low rank, block diagonal, or have other special structures. The expectation-maximization (EM) algorithm initiated by Dempster et al. (1977) is a third alternative (Laird and Ware, 1982; Laird et al., 1987; Lindstrom and Bates, 1988; Bates and Pinheiro, 1998). Compared to Newton’s method, the EM algorithm is easy to implement and numerically stable, but painfully slow to converge. In practice, a strategy of priming Newton’s method by a few EM steps leverages the stability of EM and the faster convergence of second-order methods.

In this paper we derive a novel minorization-maximization (MM) algorithm for finding the MLE and REML estimates of variance components. We prove global convergence of the MM algorithm to a Karush-Kuhn-Tucker (KKT) point and explain why MM generally converges faster than EM for models with more than two variance components. We also sketch extensions of the MM algorithm to the multivariate response model with possibly missing responses, the linear mixed model (LMM), maximum a posteriori (MAP) estimation, and penalized estimation. The numerical efficiency of the MM algorithm is illustrated through simulated data sets and a genomic example with 200 variance components.

2. Preliminaries

Background on MM algorithms

Throughout we reserve Greek letters for parameters and indicate the current iteration number by a superscript t. The MM principle for maximizing an objective function f (θ) involves minorizing the objective function f (θ) by a surrogate function g(θ | θ^(t)) around the current iterate θ^(t) of a search (Lange et al., 2000). Minorization is defined by the two conditions

f (θ^{(t)}) = g (θ^{(t)} | θ^{(t)}) f (θ) \geq g (θ | θ^{(t)}), θ \neq θ^{(t)} .

(2)

In other words, the surface $θ \mapsto g (θ | θ^{(t)})$ lies below the surface $θ \mapsto f (θ)$ and is tangent to it at the point θ = θ^(t). Construction of the minorizing function g(θ | θ^(t)) constitutes the first M of the MM algorithm. The second M of the algorithm maximizes the surrogate g(θ | θ^(t)) rather than f (θ). The point θ^(t+1) maximizing g(θ | θ^(t)) satisfies the ascent property $f (θ^{(t + 1)}) \geq f (θ^{(t)})$ . This fact follows from the inequalities

f (θ^{(t + 1)}) \geq g (θ^{(t + 1)} | θ^{(t)}) \geq g (θ^{(t)} | θ^{(t)}) = f (θ^{(t)}),

(3)

reflecting the definition of θ^(t+1) and the tangency and domination conditions (2). The ascent property makes the MM algorithm remarkably stable. The validity of the descent property depends only on increasing g(θ | θ^(t)), not on maximizing g(θ | θ^(t)). With obvious changes, the MM algorithm also applies to minimization rather than to maximization. To minimize a function f (θ), we majorize it by a surrogate function g(θ | θ^(t)) and minimize g(θ | θ^(t)) to produce the next iterate θ^(t+1). The acronym should not be confused with the maximization-maximization algorithm in the variational Bayes context (Jeon, 2012).

The MM principle (De Leeuw, 1994; Heiser, 1995; Kiers, 2002; Lange et al., 2000; Hunter and Lange, 2004) finds applications in multidimensional scaling (Borg and Groenen, 2005), ranking of sports teams (Hunter, 2004), variable selection (Hunter and Li, 2005; Yen, 2011), optimal experiment design (Yu, 2010), multivariate statistics (Zhou and Lange, 2010), geometric programming (Lange and Zhou, 2014), survival models (Hunter and Lange, 2002; Ding et al., 2015), sparse covariance estimation (Bien and Tibshirani, 2011), and many other areas (Lange, 2016). The celebrated EM principle (Dempster et al., 1977) is a special case of the MM principle. The Q function produced in the E step of an EM algorithm minorizes the log-likelihood up to an irrelevant constant. Thus, both EM and MM share the same advantages: simplicity, stability, graceful adaptation to constraints, and the tendency to avoid large matrix inversion. The more general MM perspective frees algorithm derivation from the missing data straitjacket and invites wider applications (Wu and Lange, 2010). Figure 1 shows the minorization functions of EM and MM for a variance components model with m = 2 variance components.

Figure 1: — Surrogate functions of EM and MM minorize the log-likelihood surface of a 2-variance component model at point $(σ_{1}^{2 (t)}, σ_{2}^{2 (t)}) = (18.5, 0.7)$ . MM surrogate function hugs the log-likelihood surface tighter than EM.

Convex matrix functions

For symmetric matrices we write $A ⪯ B$ when B − A is positive semidefinite and $A ≺ B$ if B − A is positive definite. A matrix-valued function f is said to be (matrix) convex if

f [λ A + (1 - λ) B] ⪯ λ f (A) + (1 - λ) f (B)

for all A, B, and λ ∈ [0, 1]. Our derivation of the MM variance components algorithm hinges on the convexity of the two functions mentioned in the next lemma. See standard text Boyd and Vandenberghe (2004) for the verification of both facts.

Lemma 1. (a) The matrix fractional function f (A, B) = A^T B⁻¹A is jointly convex in the m × n matrix A and the m × m positive definite matrix B. (b) The log determinant function f (B) = ln det B is concave on the set of positive definite matrices.

3. Univariate response model

Our strategy for maximizing the log-likelihood (1) is to alternate updating the mean parameters β and the variance components σ². Updating β given σ² is a standard general least squares problem with solution

β^{(t + 1)} = {(X^{T} Ω^{- (t)} X)}^{- 1} X^{T} Ω^{- (t)} y,

(4)

where Ω^−(t) represents the inverse of $Ω^{(t)} = \sum_{i = 1}^{m} σ_{i}^{2 (t)} V_{i}$ . Updating σ² given β^(t) depends on two minorizations. If we assume that all of the V_i are positive definite, then the joint convexity of the map $(X, Y) \mapsto X^{T} Y^{- 1} X$ for positive definite Y implies that

Ω^{(t)} Ω^{- 1} Ω^{(t)} = (\sum_{i = 1}^{m} σ_{i}^{2 (t)} V_{i}) {(\sum_{i = 1}^{m} σ_{i}^{2} V_{i})}^{- 1} (\sum_{i = 1}^{m} σ_{i}^{2 (t)} V_{i}) ⪯ \sum_{i = 1}^{m} \frac{σ_{i}^{2 (t)}}{\sum_{j} σ_{j}^{2 (t)}} (\frac{\sum_{j} σ_{j}^{2 (t)}}{σ_{i}^{2 (t)}} σ_{i}^{2 (t)} V_{i}) {(\frac{\sum_{j} σ_{j}^{2 (t)}}{σ_{i}^{2 (t)}} σ_{i}^{2} V_{i})}^{- 1} (\frac{\sum_{j} σ_{j}^{2 (t)}}{σ_{i}^{2 (t)}} σ_{i}^{2 (t)} V_{i}) = \sum_{i = 1}^{m} \frac{σ_{i}^{4 (t)}}{σ_{i}^{2}} V_{i} .

When one or more of the V_i are rank deficient, we replace each V_i by V_i,ϵ = V_i + ϵI for ϵ > 0 small and let $Ω_{ϵ}^{(t)} = \sum_{i} σ_{i}^{2 (t)} V_{i, ϵ}$ . Sending $ϵ$ to 0 in $Ω_{ϵ}^{(t)} Ω_{ϵ}^{- 1} Ω_{ϵ}^{(t)} ⪯ \sum_{i = 1}^{m} (σ_{i}^{4 (t)} / σ_{i}^{2}) V_{i, ϵ}$ now gives the desired majorization $Ω^{(t)} Ω^{- 1} Ω^{(t)} ⪯ \sum_{i = 1}^{m} (σ_{i}^{4 (t)} / σ_{i}^{2}) V_{i}$ in the general case. Negating both sides leads to the minorization

- {(y - X β)}^{T} Ω^{- 1} (y - X β) ⪰ - {(y - X β)}^{T} Ω^{- (t)} (\sum_{i = 1}^{m} \frac{σ_{i}^{4 (t)}}{σ_{i}^{2}} V_{i}) Ω^{- (t)} (y - X β)

(5)

that effectively separates the variance components $σ_{1}^{2}, \dots, σ_{m}^{2}$ in the quadratic term of the log-likelihood (1).

The convexity of the function $A \mapsto - \log \det A$ is equivalent to the supporting hyperplane minorization

- \ln \det Ω \geq - \ln \det Ω^{(t)} - tr [Ω^{- (t)} (Ω - Ω^{(t)})]

(6)

that separates $σ_{1}^{2}, \dots, σ_{m}^{2}$ in the log determinant term of the log-likelihood (1). Combination of the minorizations (5) and (6) gives the overall minorization

g (σ^{2} | σ^{2 (t)}) = - \frac{1}{2} tr (Ω^{- (t)} Ω) - \frac{1}{2} {(y - X β^{(t)})}^{T} Ω^{- (t)} (\sum_{i = 1}^{m} \frac{σ_{i}^{4 (t)}}{σ_{i}^{2}} V_{i}) Ω^{- (t)} (y - X β^{(t)}) + c^{(t)} = \sum_{i = 1}^{m} [- \frac{σ_{i}^{2}}{2} tr (Ω^{- (t)} V_{i}) - \frac{1}{2} \frac{σ_{i}^{4 (t)}}{σ_{i}^{2}} {(y - X β^{(t)})}^{T} Ω^{- (t)} V_{i} Ω^{- (t)} (y - X β^{(t)})] + c^{(t)},

(7)

where c^(t) is an irrelevant constant. Maximization of g(σ² | σ^2(t)) with respect to $σ_{i}^{2}$ yields the simple multiplicative update

σ_{i}^{2 (t + 1)} = σ_{i}^{2 (t)} \sqrt{\frac{{(y - X β^{(t)})}^{T} Ω^{- (t)} V_{i} Ω^{- (t)} (y - X β^{(t)})}{tr (Ω^{- (t)} V_{i})}}, i = 1, \dots, m .

(8)

As a sanity check on our derivation, consider the partial derivative

\frac{\partial}{\partial σ_{i}^{2}} L (β, σ^{2}) = - \frac{1}{2} tr (Ω^{- 1} V_{i}) + \frac{1}{2} {(y - X β)}^{T} Ω^{- 1} V_{i} Ω^{- 1} (y - X β) .

(9)

Given $σ_{i}^{2 (t)} > 0$ , it is clear from the update formula (8) that $σ_{i}^{2 (t + 1)} < σ_{i}^{2 (t)}$ when $\frac{\partial}{\partial σ_{i}^{2}} L < 0$ . Conversely $σ_{i}^{2 (t + 1)} > σ_{i}^{2 (t)}$ when $\frac{\partial}{\partial σ_{i}^{2}} L > 0$ .

graphic file with name nihms-1520713-f0002.jpg

Algorithm 1 summarizes the MM algorithm for MLE of the univariate response model (1). The update formula (8) assumes that the numerator under the square root sign is nonnegative and the denominator is positive. The numerator requirement is a consequence of the positive semidefiniteness of V_i. The denominator requirement is not obvious but can be verified through the Hadamard (elementwise) product representation $tr (Ω^{- (t)} V_{i}) = 1^{T} (Ω^{- (t)} ⊙ V_{i}) 1$ . The following lemma of Schur (1911) is crucial. We give a self-contained probabilistic proof in Supplementary Materials S.1.

Lemma 2 (Schur). The Hadamard product of a positive definite matrix with a positive semidefinite matrix with positive diagonal entries is positive definite.

We can now obtain the following characterization of the MM iterates.

Proposition 1. Assume V_i has strictly positive diagonal entries. Then tr(Ω^−(t)V_i) > 0 for all t. Furthermore if $σ_{i}^{2 (0)} > 0$ and $Ω^{- (t)} (y - X β^{(t)}) \notin n u l l (V_{i})$ for all t, then $σ_{i}^{2 (t)} > 0$ for all t. When V_i is positive definite, $σ_{i}^{2 (t)} > 0$ holds if and only if $y \neq X β^{(t)}$ .

Proof. The first claim follows easily from Schur’s lemma. The second claim follows by induction. The third claim follows from the observation that null(V_i) = {0}.

In most applications, V_m = I. Proposition 1 guarantees that if $σ_{m}^{2 (0)} > 0$ and the residual vector y − Xβ^(t) is nonzero, then $σ_{m}^{2 (t)}$ remains positive and thus Ω^(t) remains positive definite throughout all iterations. This fact does not prevent any of the sequences $σ_{i}^{2 (t)}$ from converging to 0. In this sense, the MM algorithm acts like an interior point method, approaching the optimum from inside the feasible region.

Univariate response: two variance components

The major computational cost of Algorithm 1 is inversion of the covariance matrix Ω^(t) at each iteration. The special case of m = 2 variance components deserves attention as repeated matrix inversion can be avoided by invoking the simultaneous congruence decomposition for two symmetric matrices, one of which is positive definite (Rao, 1973; Horn and Johnson, 1985). This decomposition is also called the generalized eigenvalue decomposition (Golub and Van Loan, 1996; Boyd and Vandenberghe, 2004). If one assumes $Ω = σ_{1}^{2} V_{1} + σ_{2}^{2} V_{2}$ and

lets $(V_{1}, V_{2}) \mapsto (D, U)$ be the decomposition with U nonsingular, U ^T V₁U = D diagonal, and U ^T V₂U = I, then

Ω^{(t)} = U^{- T} (σ_{1}^{2 (t)} D + σ_{2}^{2 (t)} I_{n}) U^{- 1} Ω^{- (t)} = U {(σ_{1}^{2 (t)} D + σ_{2}^{2 (t)} I_{n})}^{- 1} U^{T} \det (Ω^{(t)}) = \det (σ_{1}^{2 (t)} D + σ_{2}^{2 (t)} I_{n}) \det (V_{2}) .

(10)

With the revised responses $\tilde{y} = U^{T} y$ and the revised predictor matrix $\tilde{X} = U^{T} X$ , the update (8) requires only vector operations and costs O(n) flops. Updating the fixed effects is a weighted least squares problem with the transformed data $(\tilde{y}, \tilde{X})$ and observation weights $w_{i}^{(t)} = {(σ_{1}^{2 (t)} d_{i} + σ_{2}^{2 (t)})}^{- 1}$ . Algorithm 2 summarizes the simplified MM algorithm for two variance components.

Numerical experiments

This section compares the numerical performance of MM, EM, Fisher scoring, and the lme4 package in R (Bates et al., 2015) on simulated data from a two-way ANOVA random effects model and a genetic model. For ease of comparison, all algorithm runs start from σ²⁽⁰⁾ = 1 and terminate when the relative change (L^(t+1) − L^(t))/(|L^(t)|+ 1) in the log-likelihood is less than 10⁻⁸.

Two-way ANOVA:

We simulated data from a two-way ANOVA random effects model

y_{i j k} = μ + α_{i} + β_{j} + {(α β)}_{i j} + ϵ_{i j k}, 1 \leq i \leq a, 1 \leq j \leq b, 1 \leq k \leq c,

where $α_{i} ~ N (0, σ_{1}^{2})$ , $β_{j} ~ N (0, σ_{2}^{2})$ , ${(α β)}_{i j} ~ N (0, σ_{3}^{2})$ , and $ϵ_{i j k} ~ N (0, σ_{e}^{2})$ are jointly independent. Here i indexes levels in factor 1, j indexes levels in factor 2, and k indexes observations in the (i, j)-combination. This corresponds to m = 4 variance components. In the simulation, we set $σ_{1}^{2} = σ_{2}^{2} = σ_{3}^{2}$ and varied the ratio $σ_{1}^{2} / σ_{e}^{2}$ ; the numbers of levels a and b in factor 1 and factor 2, respectively; and the number of observations c in each combination of factor levels. For each simulation scenario, we simulated 50 replicates. The sample size was n = abc for each replicate.

Tables 1 and 2 show the average number of iterations and the average runtimes when there are a = b = 5 levels of each factor. Based on these results and further results not shown for other combinations of a and b, we draw the following conclusions: Fisher scoring takes the fewest iterations; the MM algorithm always takes fewer iterations than the EM algorithm; the faster rate of convergence of Fisher scoring is outweighed by the extra cost of evaluating and inverting the information matrix. Table 1 in Supplementary Materials S.2 shows that all algorithms converged to same objective values.

Table 1:

Fisher scoring converges fastest and MM takes fewer iterations than EM. Shown below are average number of iterations until convergence for MM, EM, and FS for fitting a two-way ANOVA model with a = b = 5 levels of both factors. Standard errors are given in parentheses.

$σ_{1}^{2} / σ_{e}^{2}$	Method	c = # observations per combination

		5	10	20	50
0.00	MM	143.12(99.76)	118.26(62.91)	96.26(50.61)	81.10(33.42)
	EM	2297.72(797.95)	1711.70(485.92)	1170.06(365.48)	788.10(216.60)
	FS	25.64(11.20)	21.10(7.00)	16.46(4.37)	13.88(2.88)
0.05	MM	121.86(98.52)	69.38(50.23)	55.88(37.34)	29.50(18.80)
	EM	1464.26(954.27)	538.04(504.42)	254.90(253.86)	104.98(157.97)
	FS	16.78(9.13)	12.62(6.22)	9.68(3.22)	8.10(1.34)
0.10	MM	84.74(59.33)	62.98(50.48)	40.46(31.43)	25.86(18.79)
	EM	985.46(830.49)	360.32(462.62)	157.70(231.91)	68.26(107.85)
	FS	15.20(10.10)	10.58(5.92)	8.58(3.56)	7.50(1.72)
1.00	MM	31.04(33.27)	29.60(27.66)	25.32(25.39)	24.90(20.76)
	EM	130.18(299.03)	161.14(290.23)	64.20(135.38)	84.88(137.88)
	FS	6.62(4.72)	6.32(3.64)	5.12(1.87)	5.36(1.50)
10.00	MM	29.80(35.42)	34.16(38.25)	28.82(28.44)	20.90(14.28)
	EM	115.94(274.33)	177.30(301.71)	80.12(155.67)	75.02(127.38)
	FS	12.72(5.14)	12.86(4.94)	11.66(3.95)	11.76(3.66)
20.00	MM	30.10(32.92)	32.72(39.02)	23.70(21.20)	19.62(15.67)
	EM	148.04(318.40)	85.86(180.28)	61.74(140.84)	37.36(83.89)
	FS	18.76(7.51)	17.40(5.21)	17.22(5.67)	16.28(5.03)

Open in a new tab

Table 2:

MM shows shortest run times than EM, Fisher scoring (FS), and lme4. Shown below are average run times (milliseconds) for fitting a two-way ANOVA model with a = b = 5 levels of both factors. Standard errors are given in parentheses.

$σ_{1}^{2} / σ_{e}^{2}$	Method	c = # observations per combination

		5	10	20	50
0.00	MM	11.46(7.77)	10.06(5.29)	11.93(6.35)	10.44(3.99)
	EM	189.32(71.32)	148.20(48.13)	147.87(49.97)	96.28(24.97)
	FS	34.27(33.47)	24.89(8.55)	23.70(14.15)	20.46(4.54)
	lme4	25.84(12.10)	22.32(1.25)	27.34(4.06)	36.14(5.59)
0.05	MM	9.79(7.72)	6.19(4.22)	6.87(4.37)	4.45(2.20)
	EM	116.03(75.57)	47.72(45.35)	30.60(29.88)	14.23(19.68)
	FS	19.18(10.23)	15.37(7.48)	12.78(4.06)	12.39(2.35)
	lme4	22.76(1.96)	24.88(2.60)	28.72(3.10)	47.34(16.29)
0.10	MM	7.07(4.78)	6.29(4.94)	5.14(3.72)	3.95(2.23)
	EM	78.96(66.19)	35.48(45.81)	19.53(27.71)	9.67(13.56)
	FS	17.36(11.26)	14.44(9.00)	12.08(6.31)	11.47(2.40)
	lme4	22.66(1.83)	28.90(8.70)	30.16(4.43)	44.58(4.89)
1.00	MM	2.66(2.61)	3.22(2.91)	3.57(3.15)	3.85(2.50)
	EM	10.71(23.93)	15.88(27.52)	8.35(16.26)	11.34(16.65)
	FS	7.88(5.44)	9.10(4.95)	7.12(2.42)	8.46(2.27)
	lme4	23.12(1.75)	30.22(9.37)	29.96(4.47)	42.82(8.32)
10.00	MM	2.48(2.72)	3.24(3.19)	3.84(3.35)	3.35(1.71)
	EM	9.66(22.02)	15.98(26.57)	10.24(18.78)	10.27(15.40)
	FS	15.19(6.05)	16.39(6.11)	15.81(5.15)	18.14(5.46)
	lme4	35.02(3.83)	47.12(8.10)	63.24(15.33)	102.78(34.49)
20.00	MM	2.57(2.49)	3.13(3.53)	3.13(2.44)	3.07(1.81)
	EM	12.28(25.71)	8.44(16.89)	8.01(17.12)	5.47(9.76)
	FS	22.09(8.53)	22.03(6.14)	23.08(7.21)	23.99(7.38)
	lme4	37.34(12.91)	50.24(8.59)	63.62(17.39)	91.14(28.39)

Open in a new tab

Genetic model:

We simulated a quantitative trait y from a genetic model with two variance components and covariance matrix $Ω = σ_{a}^{2} \hat{Φ} + σ_{e}^{2} I$ , where $\hat{Φ}$ is a full-rank empirical kinship matrix estimated from the genome-wide measurements of 212 individuals using Option 29 of the Mendel software (Lange et al., 2013). In this example, MM had run times similar to Fisher scoring, and both were much faster than EM and lme4.

In summary, the MM algorithm appears competitive even in small-scale examples. Many applications involve a large number of variance components. In this setting, the EM algorithm suffers from slow convergence and Fisher scoring from an extremely high cost per iteration. Our genomic example in Section 7 reinforces this point.

4. Global convergence of the MM algorithm

The Karush-Kuhn-Tucker (KKT) necessary conditions for a local maximum $σ^{2} = (σ_{1}^{2}, \dots, σ_{m}^{2})$ of the log-likelihood (1) require each component of the score vector to satisfy

\frac{\partial}{\partial σ_{i}^{2}} L (σ^{2}) \in {\begin{matrix} {0} \\ (- \infty, 0] \end{matrix} \begin{matrix} σ_{i}^{2} > 0 \\ σ_{i}^{2} = 0. \end{matrix}

In this section we establish the global convergence of Algorithm 1 to a KKT point. To reduce the notational burden, we assume that X is null and omit estimation of fixed effects β. The analysis easily extends to the nontrivial X case. Our convergence analysis relies on characterizing the properties of the objective function L(σ²) and the MM algorithmic mapping $σ^{2} \mapsto M (σ^{2})$ defined by equation (8). Special attention must be paid to the boundary values $σ_{i}^{2} = 0$ . We prove convergences for two cases, which cover most applications. For example, the genetic model in Section 3 satisfies Assumption 1, while the two-way ANOVA model satisfies Assumption 2.

Assumption 1. All V_i are positive definite.

Assumption 2. V₁ is positive definite, each V_i is nontrivial, $H = span {V_{2}, \dots, V_{m}}$ has dimension q < n, and $y \notin H$ .

The key condition $y \notin span {V_{2}, \dots, V_{m}}$ in the second case is also necessary for the existence of an MLE or REML (Demidenko and Massam, 1999; Grzadziel and Michalski, 2014). In Supplementary Materials S.4, we derive a sequence of lemmas en route to the global convergence result declared in Theorem 1.

Theorem 1. Under either Assumption 1 or 2, the MM sequence ${σ^{2 (t)}}_{t \geq 0}$ has at least one limit point. Every limit point is a fixed point of M (σ²). If the set of fixed points is discrete, then the MM sequence converges to one of them. Finally, when the iterates converge, their limit is a KKT point.

5. MM versus EM

Examination of Tables 2 and 3 suggests that the MM algorithm usually converges faster than the EM algorithm. We now provide an explanation for this observation. Again for notational convenience, we consider the REML case where X is null. Since the EM principle is just a special instance of the MM principle, we can compare their convergence properties in a unified framework. Consider an MM map M (θ) for maximizing the objective function f (θ) via the surrogate function g(θ | θ^(t)). Close to the optimal point $θ^{\infty}$ ,

θ^{(t + 1)} - θ^{\infty} \approx d M (θ^{\infty}) (θ^{(t)} - θ^{\infty}),

where $d M (θ^{\infty})$ is the differential of the mapping M at the optimal point $θ^{\infty}$ of f (θ). Hence, the local convergence rate of the sequence θ^(t+1) = M (θ^(t)) coincides with the spectral radius of $d M (θ^{\infty})$ . Familiar calculations (Lange, 2010) demonstrate that

d M (θ^{\infty}) = I - {[d^{2} g (θ^{\infty} | θ^{\infty})]}^{- 1} d^{2} f (θ^{\infty}) .

In other words, the local convergence rate is determined by how well the surrogate surface $g (θ | θ^{\infty})$ approximates the objective surface f (θ) near the optimal point $θ^{\infty}$ . In the EM literature, $d M (θ^{\infty})$ is called the rate matrix (Meng and Rubin, 1991). Fast convergence occurs when the surrogate $g (θ | θ^{\infty})$ hugs the objective f (θ) tightly around $θ^{\infty}$ . Figure 1 shows a case where the MM surrogate locally dominates the EM surrogate. We demonstrate that this is no accident.

Table 3:

MM and Fisher scoring (FS) show superior performance than EM and lme4. Shown below are average performance for fitting a genetic model. Standard errors are given in parentheses.

$σ_{a}^{2} / σ_{e}^{2}$	Method	Iteration	Runtime (ms)	Objective
0.00	MM	198.02(102.23)	133.61(822.67)	−375.59(9.63)
	EM	1196.10(958.51)	29.71(12.34)	−375.60(9.64)
	FS	7.60(3.07)	19.34(33.77)	−375.59(9.63)
	lme4	–	401.02(142.04)	−375.59(9.64)
0.05	MM	185.86(99.41)	17.26(1.76)	−377.39(10.52)
	EM	1227.62(1030.07)	29.82(12.74)	−377.40(10.52)
	FS	7.84(2.74)	14.97(1.55)	−377.39(10.52)
	lme4	–	425.04(144.00)	−377.39(10.52)
0.10	MM	169.24(99.75)	16.97(1.59)	−378.40(11.44)
	EM	924.80(912.23)	26.06(11.26)	−378.41(11.45)
	FS	7.32(2.75)	15.06(1.38)	−378.40(11.44)
	lme4	–	435.14(128.87)	−378.40(11.44)
1.00	MM	58.96(23.69)	15.53(0.75)	−409.54(10.90)
	EM	105.10(79.65)	15.49(0.96)	−409.54(10.90)
	FS	5.80(1.05)	14.66(0.89)	−409.54(10.90)
	lme4	–	493.14(52.80)	−409.54(10.90)
10.00	MM	110.00(63.13)	16.22(1.12)	−532.48(8.77)
	EM	642.48(1470.38)	22.32(18.37)	−532.57(8.75)
	FS	14.98(5.21)	14.78(0.97)	−531.72(8.92)
	lme4	–	2897.12(15006.38)	−532.48(8.77)
20.00	MM	110.52(34.81)	16.07(0.91)	−590.87(7.15)
	EM	1014.22(1775.40)	27.03(22.33)	−590.89(7.15)
	FS	17.72(3.13)	14.79(0.93)	−588.46(7.27)
	lme4	–	5059.24(20692.67)	−590.79(7.15)

Open in a new tab

The Q-function in the EM algorithm

g_{EM} (σ^{2} | σ^{2 (t)}) = - \frac{1}{2} \sum_{i = 1}^{m} [rank (V_{i}) \cdot \ln σ_{i}^{2} + rank (V_{i}) \frac{σ_{i}^{2 (t)}}{σ_{i}^{2}} - \frac{σ_{i}^{4 (t)}}{σ_{i}^{2}} tr (Ω^{- (t)} V_{i})] - \frac{1}{2} \sum_{i = 1}^{m} \frac{σ_{i}^{4 (t)}}{σ_{i}^{2}} y^{T} Ω^{- (t)} V_{i} Ω^{- (t)} y

minorizes the log-likelihood up to an irrelevant constant. Supplementary Materials S.6 gives a detailed derivation for the more general multivariate response case. Both surrogates g_EM(σ² | σ^2(∞)) and g_MM(σ² | σ^2(∞)) are parameter separated. This implies that both second differentials $d^{2} g_{EM} (σ^{2 (\infty)} | σ^{2 (\infty)})$ and $d^{2} g_{MM} (σ^{2 (\infty)} | σ^{2 (\infty)})$ are diagonal. A small diagonal entry of either matrix indicates fast convergence of the corresponding variance component. Our next result shows that, under Assumption 1, on average the diagonal entries of $d^{2} g_{EM} (σ^{2 (\infty)} | σ^{2 (\infty)})$ dominate those of $d^{2} g_{MM} (σ^{2 (\infty)} | σ^{2 (\infty)})$ when m > 2. Thus, the EM algorithm tends to converge more slowly than the MM algorithm, and the difference is more pronounced as the number of variance components m grows. See Supplementary Materials S.4 for the proof.

Theorem 2. Let $σ^{2 (\infty)} ≻ 0_{m}$ be a common limit point of the EM and MM algorithms. Then both second differentials $d^{2} g_{M M} (σ^{2 (\infty)} | σ^{2 (\infty)})$ and $d^{2} g_{E M} (σ^{2 (\infty)} | σ^{2 (\infty)})$ are diagonal with

d^{2} g_{E M} {(σ^{2 (\infty)} | σ^{2 (\infty)})}_{i i} = - \frac{rank (V_{i})}{2 σ_{i}^{4 (\infty)}} d^{2} g_{M M} {(σ^{2 (\infty)} | σ^{2 (\infty)})}_{i i} = - \frac{y^{T} Ω^{- (\infty)} V_{i} Ω^{- (\infty)} y}{σ_{i}^{2 (\infty)}} = - \frac{tr (Ω^{- (\infty)} V_{i})}{σ_{i}^{2 (\infty)}} .

Furthermore, the average ratio

\frac{1}{m} \sum_{i = 1}^{m} \frac{d^{2} g_{M M} {(σ^{2 (\infty)} | σ^{2 (\infty)})}_{i i}}{d^{2} g_{E M} {(σ^{2 (\infty)} | σ^{2 (\infty)})}_{i i}} = \frac{2}{m n} \sum_{i = 1}^{m} tr (Ω^{- (\infty)} σ_{i}^{2 (\infty)} V_{i}) = \frac{2}{m} < 1

for m > 2 when all V_i have full rank n.

It is not clear whether a similar result holds under Assumption 2. Empirically we observed faster convergence of MM than EM, for example, in the two-way ANOVA example (Table 1). Also note that both the EM and MM algorithms must evaluate the traces tr(Ω^−(t)V_i) and quadratic forms (y − Xβ^(t))^T Ω^−(t)V_iΩ^−(t)(y − Xβ^(t)) at each iteration. Since these quantities are also the building blocks of the approximate rate matrices d²g(σ^2(t) | σ^2(t)), one can rationally choose either the EM or MM updates based on which has smaller diagonal entries measured by the $l_{1}$ , $l_{2}$ or $l_{\infty}$ norms. At negligible extra cost, this produces a hybrid algorithm that retains the ascent property and enjoys the better of the two convergence rates under either Assumption 1 or 2.

6. Extensions

Besides its competitive numerical performance, Algorithm 1 is attractive for its simplicity and ease of generalization. In this section, we outline MM algorithms for multivariate response models possibly with missing data, linear mixed models, MAP estimation, and penalized estimation.

6.1. Multivariate response model

Consider the multivariate response model with n × d response matrix Y, which has no missing entries, mean E Y = XB, and covariance

Ω = Cov (vec Y) = \sum_{i = 1}^{m} Γ_{i} \otimes V_{i} .

The p × d coefficient matrix B collects the fixed effects, the Γ_i are unknown d × d variance components, and the V_i are known n × n covariance matrices. If the vector vecY is normally distributed, then Y equals a sum of independent matrix normal distributions (Gupta and Nagar, 1999). We now make this assumption and pursue estimation of B and the Γ_i, which we collectively denote as Γ. Under the normality assumption, Roth’s Kronecker product identity vec(CDE) = (E^T ⊗ C)vec(D) yields the log-likelihood

L (B, Γ) = - \frac{1}{2} \ln \det Ω - \frac{1}{2} vec {(Y - X B)}^{T} Ω^{- 1} vec (Y - X B) = - \frac{1}{2} \ln \det Ω - \frac{1}{2} {[vec Y - (I_{d} \otimes X) vec B]}^{T} Ω^{- 1} [vec Y - (I_{d} \otimes X) vec B] .

(11)

Updating B given Γ^(t) is accomplished by solving the general least squares problem met earlier in the univariate case. Update of Γ_i given B^(t) is difficult due to the positive semidefiniteness constraint. Typical solutions involve reparameterization of the covariance matrix (Pinheiro and Bates, 1996). The MM algorithm derived in this section gracefully accommodates the covariance constraints.

Updating Γ given B^(t) requires generalizing the minorization (5). In view of Lemma 1 and the identities (A ⊗ B)(C ⊗ D) = (AC) ⊗(BD) and (A ⊗ B)⁻¹ = A⁻¹ ⊗ B⁻¹, we have

Ω^{(t)} Ω^{- 1} Ω^{(t)} = m [\frac{1}{m} \sum_{i = 1}^{m} Γ_{i}^{(t)} \otimes V_{i}] {[\frac{1}{m} \sum_{i = 1}^{m} Γ_{i} \otimes V_{i}]}^{- 1} [\frac{1}{m} \sum_{i = 1}^{m} Γ_{i}^{(t)} \otimes V_{i}] ⪯ m \frac{1}{m} \sum_{i = 1}^{m} (Γ_{i}^{(t)} \otimes V_{i}) {(Γ_{i} \otimes V_{i})}^{- 1} (Γ_{i}^{(t)} \otimes V_{i}) = \sum_{i = 1}^{m} (Γ_{i}^{(t)} Γ_{i}^{- 1} Γ_{i}^{(t)}) \otimes V_{i},

or equivalently

Ω^{- 1} ⪯ Ω^{- (t)} [\sum_{i = 1}^{m} (Γ_{i}^{(t)} Γ_{i}^{- 1} Γ_{i}^{(t)}) \otimes V_{i}] Ω^{- (t)} .

(12)

This derivation relies on the invertibility of the matrices V_i. One can relax this assumption by substituting V_ϵ,i = V_i + ϵI_n for V_i and sending s to 0.

The majorization (12) and the minorization (6) jointly yield the surrogate

g (Γ | Γ^{(t)}) = - \frac{1}{2} \sum_{i = 1}^{m} {tr [Ω^{- (t)} (Γ_{i} \otimes V_{i})] + {(vec R^{(t)})}^{T} [(Γ_{i}^{(t)} Γ_{i}^{- 1} Γ_{i}^{(t)}) \otimes V_{i}] (vec R^{(t)})} + c^{(t)},

where R^(t) is the n × d matrix satisfying vec R^(t) = Ω^−(t)vec(Y − XB^(t)) and c^(t) is an irrelevant constant. Based on the Kronecker identities (vec A)^T vec B = tr(A^T B) and vec(CDE) = (E^T ⊗ C)vec(D), the surrogate can be rewritten as

g (Γ | Γ^{(t)}) = - \frac{1}{2} \sum_{i = 1}^{m} {tr [Ω^{- (t)} (Γ_{i} \otimes V_{i})] + tr (R^{(t) T} V_{i} R^{(t)} Γ_{i}^{(t)} Γ_{i}^{- 1} Γ_{i}^{(t)})} + c^{(t)} = - \frac{1}{2} \sum_{i = 1}^{m} {tr [Ω^{- (t)} (Γ_{i} \otimes V_{i})] + tr (Γ_{i}^{(t)} R^{(t) T} V_{i} R^{(t)} Γ_{i}^{(t)} Γ_{i}^{- 1})} + c^{(t)} .

The first trace is linear in Γ_i with the coefficient of entry (Γ_i)_jk equal to

tr (Ω_{j k}^{- (t)} V_{i}) = 1_{n}^{T} (V_{i} ⊙ Ω_{j k}^{- (t)}) 1_{n},

where $Ω_{j k}^{- (t)}$ is the (j, k)-th n × n block of Ω^−(t) and $⊙$ indicates elementwise product. The matrix M_i of these coefficients can be written as

M_{i} = {(I_{d} \otimes 1_{n})}^{T} [(1_{n} 1_{d}^{T} \otimes V_{i}) ⊙ Ω^{- (t)}] (I_{d} \otimes 1_{n}) .

The directional derivative of g(Γ | Γ^(t)) with respect to Γ_i in the direction Δ_i is

- \frac{1}{2} tr (M_{i} Δ_{i}) + \frac{1}{2} tr (Γ_{i}^{(t)} R^{(t) T} V_{i} R^{(t)} Γ_{i}^{- 1} Δ_{i} Γ_{i}^{- 1}) = - \frac{1}{2} tr (M_{i} Δ_{i}) + \frac{1}{2} tr (Γ_{i}^{- 1} Γ_{i}^{(t)} R^{(t) T} V_{i} R^{(t)} Γ_{i}^{(t)} Γ_{i}^{- 1} Δ_{i}) .

Because all directional derivatives of g(Γ | Γ^(t)) vanish at a stationarity point, the matrix equation

M_{i} = Γ_{i}^{- 1} Γ_{i}^{(t)} R^{(t) T} V_{i} R^{(t)} Γ_{i}^{(t)} Γ_{i}^{- 1}

(13)

holds. Fortunately, this equation admits an explicit solution. For positive scalers a and b, the solution to the equation $b = X^{- 1} a X^{- 1}$ is $X = \pm \sqrt{a / b}$ . The matrix analogue of this equation is the Riccati equation B = X⁻¹AX⁻¹, whose solution is summarized in the next lemma.

Lemma 3. Assume A and B are positive definite and L is the Cholesky factor of B. Then Y = L^−T (L^T AL)^1/2L⁻¹ is the unique positive definite solution to the matrix equation B = X⁻¹AX⁻¹.

The Cholesky factor L in Lemma 3 can be replaced by the symmetric square root of B. The solution, which is unique, remains the same. The Cholesky decomposition is preferred for its cheaper computational cost and better numerical stability.

Algorithm 3 summarizes the MM algorithm for fitting the multi-response model (3). Each iteration invokes m Cholesky decompositions and symmetric square roots of d × d positive definite matrices. Fortunately in most applications, d is a small number. The following result guarantees the non-singularity of the Cholesky factor throughout the iterations. See Supplementary Materials S.8 for the proof.

Proposition 2. Assume V_i has strictly positive diagonal entries. Then the symmetric matrix $M_{i} = {(I_{d} \otimes 1_{n})}^{T} [(1_{d} 1_{d}^{T} \otimes V_{i}) ⊙ Ω^{- (t)}] (I_{d} \otimes 1_{n})$ is positive definite for all t. Furthermore if $Γ_{i}^{(0)} ≻ 0$ and no column of R^(t) lies in the null space of V_i for all t, then $Γ_{i}^{(t)} ≻ 0$ for all t.

Multivariate response, two variance components

When there are m = 2 variance components Ω = Γ₁ ⊗V₁ + Γ₂ ⊗V₂, repeated inversion of the nd×nd covariance matrix Ω reduces to a single n×n simultaneous congruence decomposition and, per iteration, two d×d Cholesky decompositions and one d×d simultaneous congruence decomposition. The simultaneous congruence decomposition of the matrix pair (V₁, V₂) involves generalized eigenvalues d = (d₁, …, d_n) and a nonsingular matrix U such that U ^T V₁U = D = diag(d) and U ^T V₂U = I. If the simultaneous congruence decomposition of $(Γ_{1}^{(t)}, Γ_{2}^{(t)})$ is $(Λ^{(t)}, Φ^{(t)})$ with $Φ^{(t) T} Γ_{1}^{(t)} Φ^{(t)} = Λ^{(t)} = diag (λ^{(t)})$ and $Φ^{(t) T} Γ_{2}^{(t)} Φ^{(t)} = I_{d}$ , then

Ω^{(t)} = {(Φ^{- (t)} \otimes U^{- 1})}^{T} (Λ^{(t)} \otimes D + I_{D} \otimes I_{n}) (Φ^{- (t)} \otimes U^{- 1}) Ω^{- (t)} = (Φ^{(t)} \otimes U) {(Λ^{(t)} \otimes D + I_{D} \otimes I_{n})}^{- 1} {(Φ^{(t)} \otimes U)}^{T} \det Ω^{(t)} = \det (Λ^{(t)} \otimes D + I_{D} \otimes I_{n}) \det [{(Φ^{- (t)} \otimes U^{- 1})}^{T} (Φ^{- (t)} \otimes U^{- 1})] = \det (Λ^{(t)} \otimes D + I_{D} \otimes I_{n}) \det (Γ_{2}^{(t)} \otimes V_{2}) = \det (Λ^{(t)} \otimes D + I_{D} \otimes I_{n}) \det {(Γ_{2}^{(t)})}^{n} \det {(V_{2})}^{D} .

Updating the fixed effects reduces to a weighted least squares problem for the transformed responses $\tilde{Y} = U^{T} Y$ , transformed predictor matrix $\tilde{X} = U^{T} X$ , and observation weights ${(λ_{k}^{(t)} d_{i} + 1)}^{- 1}$ . Algorithm 4 summarizes the simplified MM algorithm. The lengthy derivations are relegated to Supplementary Materials S.5.

6.1.

6.2. Multivariate response model with missing responses

In many applications the multivariate response model (11) involves missing responses. For instance, in testing multiple longitudinal traits in genetics, some trait values y_ij may be missing due to dropped patient visits, while their genetic covariates are complete. Missing data destroys the symmetry of the log-likelihood (11) and complicates finding the MLE. Fortunately, MM algorithm 3 easily adapts to this challenge.

The familiar EM argument (McLachlan and Krishnan, 2008, Section 2.2) shows that

- \frac{n}{2} \ln \det Ω^{(t)} - \frac{1}{2} tr {Ω^{- (t)} [vec (Z^{(t)} - X B^{(t)}) vec {(Z^{(t)} - X B^{(t)})}^{T} + C^{(t)}]}

(14)

minorizes the observed log-likelihood at the current iterate $(B^{(t)}, Γ_{1}^{(t)}, \dots, Γ_{m}^{(t)})$ . Here Z^(t) is the completed response matrix given the observed responses $Y_{o b s}^{(t)}$ and the current parameter values. The complete data Y is assumed to be normally distributed N (vec(XB^(t)), Ω^(t)). The block matrix C^(t) is 0 except for a lower-right block consisting of a Schur complement.

To maximize the surrogate (14), we invoke the familiar minorization (6) and majorization (12) to separate the variance components Γ_i. At each iteration we impute missing entries by their conditional means, compute their conditional variances and covariances to supply the Schur complement, and then update the fixed effects and variance components by the explicit updates of Algorithm 3. The required conditional means and conditional variances can be conveniently obtained in the process of inverting Ω^(t) by the sweep operator of computational statistics (Lange, 2010, Section 7.3).

6.3. Linear mixed model (LMM)

The linear mixed model plays a central role in longitudinal data analysis. Consider the single-level LMM (Laird and Ware, 1982; Bates and Pinheiro, 1998) for n independent data clusters (y_i, X_i, Z_i) with

Y_{i} = X_{i} β + Z_{i} γ_{i} + ϵ_{i}, i = 1, \dots, n,

where β is a vector of fixed effects, the γi ∼ N (0, R_i(θ)) are independent random effects, and $ϵ_{i} \sim N (0, σ^{2} I_{n_{i}})$ captures random noise independent of γi. We assume the matrices Z_i have full column rank. The within-cluster covariance matrices R_i(θ) depend on a parameter vector θ; typical choices for R_i(θ) impose autocorrelation, compound symmetry, or unstructured correlation. It is clear that Y_i is normal with mean X_iβ, covariance $Ω_{i} = Z_{i} R_{i} (θ) Z_{i}^{T} + σ^{2} I_{n_{i}},$ and log-likelihood

L_{i} (β, θ, σ^{2}) = - \frac{1}{2} \ln \det Ω_{i} - \frac{1}{2} {(y_{i} - X_{i} β)}^{T} Ω_{i}^{- 1} (y_{i} - X_{i} β) .

The next three technical facts about pseudo-inverses are used in deriving the MM algorithm for LMM and their proofs are in Supplementary Materials S.9-S.11.

Lemma 4. If A has full column rank and B has full row rank, then (AB)⁺ = B⁺A⁺.

Lemma 5. If A and B are positive semidefinite matrices with the same range, then

\lim_{ϵ ↓ 0} (B + ϵ I) {(A + ϵ I)}^{- 1} (B + ϵ I) = B A^{+} B .

Lemma 6. If R and S are positive definite matrices, and the conformable matrix Z has full column rank, then the matrices ZRZ^T and ZSZ^T share a common range.

The convexity of the map $(X, Y) \mapsto X^{T} Y^{- 1} X$ and Lemmas 4–6 now yield via the obvious limiting argument the majorization

Ω^{(t)} Ω^{- 1} Ω^{(t)} = (Z_{i} R_{i} (θ^{(t)}) Z_{i}^{T} + σ^{2 (t)} I_{n_{i}}) {(Z_{i} R_{i} (θ) Z_{i}^{T} + σ^{2} I_{n_{i}})}^{- 1} (Z_{i} R_{i} (θ^{(t)}) Z_{i}^{T} + σ^{2 (t)} I_{n_{i}}) ⪯ (Z_{i} R_{i} (θ^{(t)}) Z_{i}^{T}) {(Z_{i} R_{i} (θ) Z_{i}^{T})}^{+} (Z_{i} R_{i} (θ^{(t)}) Z_{i}^{T}) + \frac{σ^{4 (t)}}{σ^{2}} I_{n_{i}} = [Z_{i} R_{i} (θ^{(t)}) Z_{i}^{T} Z_{i}^{T +}] R_{i}^{- 1} (θ) [Z_{i}^{+} Z_{i} R_{i} (θ^{(t)}) Z_{i}^{T}] + \frac{σ^{4 (t)}}{σ^{2}} I_{n_{i}} .

In combination with the minorization (6), this gives the surrogate

g_{i} (θ, σ^{2} | θ^{(t)}, σ^{2 (t)}) = - \frac{1}{2} tr (Z_{i}^{T} Ω_{i}^{- (t)} Z_{i} R_{i} (θ)) - \frac{1}{2} r_{i}^{(t) T} R_{i}^{- 1} (θ) r_{i}^{(t)} - \frac{1}{2} tr (Ω_{i}^{- (t)}) - \frac{σ^{4 (t)}}{2 σ^{2}} {(y_{i} - X_{i} β^{(t)})}^{T} Ω_{i}^{- 2 (t)} (y_{i} - X_{i} β^{(t)}) + c^{(t)},

for the log-likelihood L_i(θ, σ²), where

r_{i}^{(t)} = (Z_{i}^{+} Z_{i} R_{i} (θ^{(t)}) Z_{i}^{T}) Ω_{i}^{- (t)} (y_{i} - X_{i} β^{(t)}) = R_{i} (θ^{(t)}) Z_{i}^{T} Ω_{i}^{- (t)} (y_{i} - X_{i} β^{(t)}) .

The parameters θ and σ² are nicely separated. To maximize the overall minorization function $\sum_{i} g_{i} (θ, σ^{2} | θ^{(t)}, σ^{2 (t)})$ , we update σ² via

σ^{2 (t + 1)} = σ^{2 (t)} \sqrt{\frac{\sum_{i} {(y_{i} - X_{i} β^{(t)})}^{T} Ω_{i}^{- 2 (t)} (y_{i} - X_{i} β^{(t)})}{\sum_{i} tr (Ω_{i}^{- (t)})}} .

For structured models such as autocorrelation and compound symmetry, updating θ is a low-dimensional optimization problem that can be approached through the stationarity condition

{\sum_{i} vec (Z_{i}^{T} Ω_{i}^{(T)} Z_{i} - R_{i}^{- 1} (θ) r_{i}^{(t)} r_{i}^{(t) T} R_{i}^{- 1} (θ))}^{T} \frac{\partial}{\partial θ_{j}} vec R_{i} (θ) = 0

for each component θ_j. For the unstructured model with R_i(θ) = R for all i, the stationarity condition reads

\sum_{i} Z_{i}^{T} Ω_{i}^{(t)} Z_{i} = R^{- 1} (\sum_{i} r_{i}^{(t)} r_{i}^{(t) T}) R^{- 1}

and admits an explicit solution based on Lemma 3.

The same tactics apply to a multilevel LMM (Bates and Pinheiro, 1998) with responses

Y_{i} = X_{i} β + Z_{i 1} γ_{i 1} + \cdot \cdot \cdot Z_{i m} γ_{i m} + ϵ_{i} .

Minorization separates parameters for each level (variance component). Depending on the complexity of the covariance matrices, maximization of the surrogate can be accomplished analytically. For the sake of brevity, details are omitted.

6.4. MAP estimation

Suppose β follows an improper flat prior, the variance components $σ_{i}^{2}$ follow inverse gamma priors with shapes α_i > 0 and scales γ_i > 0, and these priors are independent. The log-posterior density then reduces to

- \frac{1}{2} \ln \det Ω - \frac{1}{2} {(y - X β)}^{T} Ω^{- 1} (y - X β) - \sum_{i = 1}^{m} (α_{i} + 1) \ln σ_{i}^{2} - \sum_{i = 1}^{m} \frac{γ_{i}}{σ_{i}^{2}} + c,

(15)

where c is an irrelevant constant. The MAP estimator of (β, σ²) is the mode of the posterior distribution. The update (4) of β given σ² remains the same. To update σ² given β, apply the same minorizations (5) and (6) to the first two terms of equation (15). This separates parameters and yields a convex surrogate for each $σ_{i}^{2}$ . The minimum of the $σ_{i}^{2}$ surrogate is defined by the stationarity condition

0 = - \frac{1}{2} tr (Ω^{- (t)} V_{i}) + \frac{σ_{i}^{4 (t)}}{2 σ_{i}^{4}} {(y - X β^{(t)})}^{T} Ω^{- (t)} V_{i} Ω^{- (t)} (y - X β^{(t)}) - \frac{α_{i} + 1}{σ_{i}^{2}} + \frac{γ_{i}}{σ_{i}^{4}} .

Multiplying this by $σ_{i}^{4}$ gives a quadratic equation in $σ_{i}^{2}$ . The positive root should be taken to meet the nonnegativity constraint on $σ_{i}^{2}$ .

For the multivariate response model (11), we assume the variance components Γ_i follow independent inverse Wishart distributions with degrees of freedom ν_i > d − 1 and scale matrix $Ψ_{i} ≻ 0$ . The log density of the posterior distribution is

- \frac{1}{2} \ln \det Ω - \frac{1}{2} vec {(Y - X B)}^{T} Ω^{- 1} vec (Y - X B) - \frac{1}{2} \sum_{i = 1}^{m} (v_{i} + d + 1) \ln \det Γ_{i} - \frac{1}{2} \sum_{i = 1}^{m} tr (Ψ_{i} Γ_{i}^{- 1}) + c,

(16)

where c is an irrelevant constant. Invoking the minorizations (6) and (12) for the first two terms and the supporting hyperplane minorization for − ln det Γ_i gives the surrogate function

g (Γ | Γ^{(t)}) = - \frac{1}{2} \sum_{i = 1}^{m} tr (Ω^{- (t)} (Γ_{i} \otimes V_{i})) - \frac{1}{2} \sum_{i = 1}^{m} tr (Γ_{i}^{(t)} R^{(t) T} V_{i} R^{(t)} Γ_{i}^{(t)} Γ_{i}^{- 1}) - \frac{1}{2} \sum_{i = 1}^{m} (v_{i} + d + 1) tr (Γ_{i}^{- (t)} Γ_{i}) - \frac{1}{2} \sum_{i = 1}^{m} tr (Ψ_{i} Γ_{i}^{- 1}) + c^{(t)} .

The optimal Γ_i satisfies the stationarity condition

{(I_{d} \otimes 1_{n})}^{T} [(1_{d} 1_{d}^{T} \otimes V_{i}) ⊙ Ω^{- (t)}] (I_{d} \otimes 1_{n}) + (V_{i} + d + 1) Γ_{i}^{- (t)} = Γ_{i}^{- 1} (Γ_{i}^{(t)} R^{(t) T} V_{i} R^{(t)} Γ_{i}^{(t)} + Ψ_{i}) Γ_{i}^{- 1},

which can be solved by Lemma 3.

6.5. Variable selection

In the statistical analysis of high-dimensional data, the imposition of sparsity leads to better interpretation and more stable parameter estimation. MM algorithms mesh well with penalized estimation. The simple variance components model (1) illustrates this fact. For the selection of fixed effects, minimizing the lasso-penalized log-likelihood $- L (β, σ^{2}) + λ \sum_{j} | β_{j} |$ is often recommended (Schelldorfer et al., 2011). The only change to the MM Algorithm 1 is that in estimating β, one solves a lasso penalized general least squares problem rather than an ordinary general least squares problem. The updates of the variance components $σ_{i}^{2}$ remain the same. For estimation of a large number of variance components, one can minimize the ridge-penalized log-likelihood

- L (β, σ^{2}) + λ \sum_{i = 1}^{m} σ_{i}^{2}

subject to the nonnegativity constraints $σ_{i}^{2} \geq 0$ . The variance update (8) becomes

σ_{i}^{2 (t + 1)} = σ_{i}^{2 (t)} \sqrt{\frac{{(y - X β^{(t)})}^{T} Ω^{- (t)} V_{i} Ω^{- (t)} (y - X β^{(t)})}{tr (Ω^{- (t)} V_{i}) + 2 λ}}, i = 1, \dots, m,

which clearly exhibits shrinkage but no thresholding. The lasso penalized log-likelihood

- L (β, σ^{2}) + λ \sum_{i = 1}^{m} σ_{i}

(17)

subject to nonnegativity constraint $σ_{i} \geq 0$ achieves both ends. The update of σ_i is chosen among the positive roots of a quartic equation and the boundary 0, whichever yields a lower objective value. Next section illustrates variance component selection using lasso penalty on a real genetic data set.

7. A numerical example

Quantitative trait loci (QTL) mapping aims to identify genes associated with a quantitative trait. Current sequencing technology measures millions of genetic markers in study subjects. Traditional single-marker tests suffer from low power due to the low frequency of many markers and the corrections needed for multiple hypothesis testing. Region-based association tests are a powerful alternative for analyzing next generation sequencing data with abundant rare variants.

Suppose y is a n × 1 vector of quantitative trait measurements on n people, X is an n × p predictor matrix (incorporating predictors such as sex, smoking history, and principal components for ethnic admixture), and G is an n × m genotype matrix of m genetic variants in a pre-defined region. The linear mixed model assumes

Y = X β + G γ + ϵ, γ \sim N (0, σ_{g}^{2} I), ϵ \sim N (0, σ_{e}^{2} I_{n}),

where β are fixed effects, γ are random genetic effects, and $σ_{g}^{2}$ and $σ_{e}^{2}$ are variance components for the genetic and environmental effects, respectively. Thus, the phenotype vector Y has covariance $σ_{g}^{2} G G^{T} + σ_{e}^{2} I_{n}$ , where GG^T is the kernel matrix capturing the overall effect of the m variants. Current approaches test the null hypothesis $σ_{g}^{2} = 0$ for each region separately and then adjust for multiple testing (Lee et al., 2014; Zhou et al., 2016). Instead of this marginal testing strategy, we consider the joint model

y = X β + s_{1}^{- 1 / 2} G_{1} γ_{1} + \cdot \cdot \cdot + s_{m}^{- 1 / 2} G_{m} γ_{m} + ϵ, γ_{i} \sim N (0, σ_{i}^{2} I), ϵ \sim N (0, σ_{e}^{2} I_{n})

and select the variance components $σ_{i}^{2}$ via the penalization (17). Here s_i is the number of variants in region i, and the weights $s_{i}^{- 1 / 2}$ put all variance components on the same scale.

We illustrate this approach using the COPDGene exome sequencing study (Regan et al., 2010). After quality control, 399 individuals and 646,125 genetic variants remain for analysis. Genetic variants are grouped into 16,619 genes to expose those genes associated with the complex trait height. We include age, sex, and the top 3 principal components in the mean effects. Because the number of genes vastly exceeds the sample size n = 399, we first pare the 16,619 genes down to 200 genes according to their marginal likelihood ratio test p-values and then carry out penalized estimation of the 200 variance components in the joint model (17). This is similar to the sure independence screening strategy for selecting mean effects (Fan and Lv, 2008). Genes are ranked according to the order they appear in the lasso solution path. Table 4 lists the top 10 genes together with their marginal LRT p-values. Figure 1 in Supplementary Materials displays the corresponding segment of the lasso solution path. It is noteworthy that the ranking of genes by penalized estimation differs from the ranking according to marginal p-values. The same phenomenon occurs in selection of highly correlated mean predictors. This penalization approach for selecting variance components warrants further theoretical study.

Table 4:

Top 10 genes selected by the lasso penalized variance component model (17) are tallied with their marginal p-values in an association study of 200 genes and the complex trait height.

Lasso Rank	Gene	Marginal P-value	# Variants
1	DOLPP1	2.35 × 10⁻⁶	2
2	C9orf21	3.70 × 10⁻⁵	4
3	PLS1	2.29 × 10⁻³	5
4	ATP5D	6.80 × 10⁻⁷	3
5	ADCY4	1.01 × 10⁻³	11
6	SLC22A25	3.95 × 10⁻³	14
7	RCSD1	9.04 × 10⁻⁴	4
8	PCDH7	1.20 × 10⁻⁴	7
9	AVIL	8.34 × 10⁻⁴	11
10	AHR	1.14 × 10⁻³	7

Open in a new tab

8. Discussion

The current paper leverages the MM principle to design powerful and versatile algorithms for variance components estimation. The MM algorithms derived are notable for their simplicity, generality, numerical efficiency, and theoretical guarantees. Both ordinary MLE and REML are apt to benefit. Other extensions are possible. In nonlinear models (Bates and Watts, 1988; Lindstrom and Bates, 1990), the mean response is a nonlinear function in the fixed effects β. One can easily modify the MM algorithms to update β by a few rounds of Gauss- Newton iteration. The variance components updates remain unchanged.

One can also extend our MM algorithms to elliptically symmetric densities

f (y) = \frac{e^{- \frac{1}{2} κ (δ^{2})}}{{(2 π)}^{\frac{n}{2}} {(\det Ω)}^{\frac{1}{2}}}

defined for $y \in ℝ^{n}$ , where δ² = (y − µ)^T Ω⁻¹(y − µ) denotes the Mahalanobis distance between y and µ. Here we assume that the function κ(s) is strictly increasing and strictly concave. Examples of elliptically symmetric densities include the multivariate t, slash, contaminated normal, power exponential, and stable families. Previous work (Huber and Ronchetti, 2009; Lange and Sinsheimer, 1993) has focused on using the MM principle to convert parameter estimation for these robust families into parameter estimation under the multivariate normal. One can chain the relevant majorization $κ (s) \leq κ (s^{(t)}) + κ^{'} (s^{(t)}) (s - s^{(t)})$ with our previous minorizations and simultaneously split variance components and pass to the more benign setting of the multivariate normal. These extensions are currently under investigation.

Supplementary Material

Supp1

NIHMS1520713-supplement-Supp1.pdf^{(486.3KB, pdf)}

Acknowledgments

The research is partially supported by NIH grants R01HG006139, R01GM53275, R01GM105785 and K01DK106116. The authors thank Michael Cho, Dandi Qiao, and Edwin Silverman for their assistance in processing and assessing COPDGene exome sequencing data. COPDGene is supported by NIH grants R01HL089897 and R01HL089856.

References

Bates D, Mächler M, Bolker B, and Walker S (2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1):1–48. [Google Scholar]
Bates D and Pinheiro J (1998). Computational methods for multilevel models. Technical Report Technical Memorandum BL0112140–980226-01TM, Bell Labs, Lucent Technologies, Murray Hill, NJ. [Google Scholar]
Bates DM and Watts DG (1988). Nonlinear Regression Analysis and Its Applications. Wiley Series in Probability and Mathematical Statistics: Applied Probability and Statistics John Wiley & Sons, Inc, New York. [Google Scholar]
Bien J and Tibshirani RJ (2011). Sparse estimation of a covariance matrix. Biometrika, 98(4):807–820. [DOI] [PMC free article] [PubMed] [Google Scholar]
Borg I and Groenen PJ (2005). Modern Multidimensional Scaling: Theory and Applications Springer Science & Business Media. [Google Scholar]
Boyd S and Vandenberghe L (2004). Convex Optimization Cambridge University Press, Cambridge. [Google Scholar]
Callanan TP and Harville DA (1991). Some new algorithms for computing restricted maximum likelihood estimates of variance components. J. Statist. Comput. Simulation, 38(1–4):239–259. [Google Scholar]
De Leeuw J (1994). Block-relaxation algorithms in statistics. In Information Systems and Data Analysis, pages 308–324. Springer. [Google Scholar]
Demidenko E and Massam H (1999). On the existence of the maximum likelihood estimate in variance components models. Sankhy¯a Ser. A, 61(3):431–443. [Google Scholar]
Dempster A, Laird N, and Rubin D (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Soceity Series B, 39(1–38). [Google Scholar]
Ding J, Tian G-L, and Yuen KC (2015). A new MM algorithm for constrained estimation in the proportional hazards model. Comput. Statist. Data Anal, 84:135–151. [Google Scholar]
Fan J and Lv J (2008). Sure independence screening for ultrahigh dimensional feature space (with discussion). J. R. Statist. Soc. B, 70:849–911. [DOI] [PMC free article] [PubMed] [Google Scholar]
Golub GH and Van Loan CF (1996). Matrix Computations. Johns Hopkins Studies in the Mathematical Sciences Johns Hopkins University Press, Baltimore, MD, third edition. [Google Scholar]
Grzadziel M and Michalski A (2014). A note on the existence of the maximum likelihood estimate in variance components models. Discuss. Math. Probab. Stat, 34(1–2):159–167. [Google Scholar]
Gupta A and Nagar D (1999). Matrix Variate Distributions. Monographs and Surveys in Pure and Applied Mathematics Taylor & Francis. [Google Scholar]
Hartley HO and Rao JNK (1967). Maximum-likelihood estimation for the mixed analysis of variance model. Biometrika, 54:93–108. [PubMed] [Google Scholar]
Harville D and Callanan T (1990). Computational aspects of likelihood-based inference for variance components. In Gianola D and Hammond K, editors, Advances in Statistical Methods for Genetic Improvement of Livestock, volume 18 of Advanced Series in Agricultural Sciences, pages 136–176. Springer Berlin Heidelberg. [Google Scholar]
Harville DA (1977). Maximum likelihood approaches to variance component estimation and to related problems. J. Amer. Statist. Assoc, 72(358):320–340. With a comment by J. N. K. Rao and a reply by the author. [Google Scholar]
Heiser WJ (1995). Convergent computation by iterative majorization: theory and applications in multidimensional data analysis. Recent Advances in Descriptive Multivariate Analysis, pages 157–189. [Google Scholar]
Horn RA and Johnson CR (1985). Matrix Analysis Cambridge University Press, Cambridge. [Google Scholar]
Huber PJ and Ronchetti EM (2009). Robust Statistics. Wiley Series in Probability and Statistics John Wiley & Sons, Inc, Hoboken, NJ, second edition. [Google Scholar]
Hunter DR (2004). MM algorithms for generalized Bradley-Terry models. Ann. Statist, 32(1):384–406. [Google Scholar]
Hunter DR and Lange K (2002). Computing estimates in the proportional odds model. Ann. Inst. Statist. Math, 54(1):155–168. [Google Scholar]
Hunter DR and Lange K (2004). A tutorial on MM algorithms. Amer. Statist, 58(1):30–37. [Google Scholar]
Hunter DR and Li R (2005). Variable selection using MM algorithms. Ann. Statist, 33(4):1617–1642. [DOI] [PMC free article] [PubMed] [Google Scholar]
Jeon M (2012). Estimation of Complex Generalized Linear Mixed Models for Measurement and Growth PhD thesis, University of California, Berkeley. [Google Scholar]
Kiers HA (2002). Setting up alternating least squares and iterative majorization algorithms for solving various matrix optimization problems. Computational Statistics & Data Analysis, 41(1):157–170. [Google Scholar]
Laird N, Lange N, and Stram D (1987). Maximum likelihood computations with repeated measures: application of the EM algorithm. J. Amer. Statist. Assoc, 82(397):97–105. [Google Scholar]
Laird NM and Ware JH (1982). Random-effects models for longitudinal data. Biometrics, 38(4):963–974. [PubMed] [Google Scholar]
Lange K (2010). Numerical Analysis for Statisticians. Statistics and Computing Springer, New York, second edition. [Google Scholar]
Lange K (2016). MM Optimization Algorithms. Society for Industrial and Applied Mathematics, Philadelphia, PA. [Google Scholar]
Lange K, Hunter DR, and Yang I (2000). Optimization transfer using surrogate objective functions. J. Comput. Graph. Statist, 9(1):1–59. With discussion, and a rejoinder by Hunter and Lange. [Google Scholar]
Lange K, Papp J, Sinsheimer J, Sripracha R, Zhou H, and Sobel E (2013). Mendel: the Swiss army knife of genetic analysis programs. Bioinformatics, 29:1568–1570. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lange K and Sinsheimer JS (1993). Normal/independent distributions and their applications in robust regression. Journal of Computational and Graphical Statistics, 2:175–198. [Google Scholar]
Lange K and Zhou H (2014). MM algorithms for geometric and signomial programming. Mathematical Programming Series A, 143:339–356. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lee S, Abecasis GR, Boehnke M, and Lin X (2014). Rare-variant association analysis: Study designs and statistical tests. The American Journal of Human Genetics, 95(1):5–23. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lindstrom MJ and Bates DM (1988). Newton-Raphson and EM algorithms for linear mixed-effects models for repeated-measures data. J. Amer. Statist. Assoc, 83(404):1014–1022. [Google Scholar]
Lindstrom MJ and Bates DM (1990). Nonlinear mixed effects models for repeated measures data. Biometrics, 46(3):673–687. [PubMed] [Google Scholar]
McLachlan GJ and Krishnan T (2008). The EM Algorithm and Extensions. Wiley Series in Probability and Statistics Wiley-Interscience [John Wiley & Sons], Hoboken, NJ, second edition. [Google Scholar]
Meng X-L and Rubin DB (1991). Using EM to obtain asymptotic variance-covariance matrices: the SEM algorithm. Journal of the American Statistical Association, 86(416):899–909. [Google Scholar]
Pinheiro J and Bates D (1996). Unconstrained parametrizations for variance-covariance matrices. Statistics and Computing, 6(3):289–296. [Google Scholar]
Rao CR (1973). Linear Statistical Inference and its Applications, 2nd ed John Wiley & Sons. [Google Scholar]
Regan EA, Hokanson JE, Murphy JR, Make B, Lynch DA, Beaty TH, Curran-Everett D, Silverman EK, and Crapo JD (2010). Genetic epidemiology of COPD (COPDGene) study designs. COPD, 7:32–43. [DOI] [PMC free article] [PubMed] [Google Scholar]
Schafer JL and Yucel RM (2002). Computational strategies for multivariate linear mixed-effects models with missing values. J. Comput. Graph. Statist, 11(2):437–457. [Google Scholar]
Schelldorfer J, Bühlmann P, and van de Geer S (2011). Estimation for high-dimensional linear mixed-effects models using A1-penalization. Scand. J. Stat, 38(2):197–214. [Google Scholar]
Schur J (1911). Bemerkungen zur Theorie der beschränkten Bilinearformen mit unendlich vielen Veränderlichen. J. Reine Angew. Math, 140:1–28. [Google Scholar]
Wu TT and Lange K (2010). The MM alternative to EM. Statistical Science, 25:492–505. [Google Scholar]
Yen T-J (2011). A majorization-minimization approach to variable selection using spike and slab priors. Ann. Statist, 39(3):1748–1775. [Google Scholar]
Yu Y (2010). Monotonic convergence of a general algorithm for computing optimal designs. Ann. Statist, 38(3):1593–1606. [Google Scholar]
Zhou H and Lange K (2010). MM algorithms for some discrete multivariate distributions. Journal of Computational and Graphical Statistics, 19:645–665. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhou JJ, Hu T, Qiao D, Cho MH, and Zhou H (2016). Boosting gene mapping power and efficiency with efficient exact variance component tests of single nucleotide polymorphism sets. Genetics, 204(3):921–931. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp1

NIHMS1520713-supplement-Supp1.pdf^{(486.3KB, pdf)}

[R1] Bates D, Mächler M, Bolker B, and Walker S (2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1):1–48. [Google Scholar]

[R2] Bates D and Pinheiro J (1998). Computational methods for multilevel models. Technical Report Technical Memorandum BL0112140–980226-01TM, Bell Labs, Lucent Technologies, Murray Hill, NJ. [Google Scholar]

[R3] Bates DM and Watts DG (1988). Nonlinear Regression Analysis and Its Applications. Wiley Series in Probability and Mathematical Statistics: Applied Probability and Statistics John Wiley & Sons, Inc, New York. [Google Scholar]

[R4] Bien J and Tibshirani RJ (2011). Sparse estimation of a covariance matrix. Biometrika, 98(4):807–820. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] Borg I and Groenen PJ (2005). Modern Multidimensional Scaling: Theory and Applications Springer Science & Business Media. [Google Scholar]

[R6] Boyd S and Vandenberghe L (2004). Convex Optimization Cambridge University Press, Cambridge. [Google Scholar]

[R7] Callanan TP and Harville DA (1991). Some new algorithms for computing restricted maximum likelihood estimates of variance components. J. Statist. Comput. Simulation, 38(1–4):239–259. [Google Scholar]

[R8] De Leeuw J (1994). Block-relaxation algorithms in statistics. In Information Systems and Data Analysis, pages 308–324. Springer. [Google Scholar]

[R9] Demidenko E and Massam H (1999). On the existence of the maximum likelihood estimate in variance components models. Sankhy¯a Ser. A, 61(3):431–443. [Google Scholar]

[R10] Dempster A, Laird N, and Rubin D (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Soceity Series B, 39(1–38). [Google Scholar]

[R11] Ding J, Tian G-L, and Yuen KC (2015). A new MM algorithm for constrained estimation in the proportional hazards model. Comput. Statist. Data Anal, 84:135–151. [Google Scholar]

[R12] Fan J and Lv J (2008). Sure independence screening for ultrahigh dimensional feature space (with discussion). J. R. Statist. Soc. B, 70:849–911. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Golub GH and Van Loan CF (1996). Matrix Computations. Johns Hopkins Studies in the Mathematical Sciences Johns Hopkins University Press, Baltimore, MD, third edition. [Google Scholar]

[R14] Grzadziel M and Michalski A (2014). A note on the existence of the maximum likelihood estimate in variance components models. Discuss. Math. Probab. Stat, 34(1–2):159–167. [Google Scholar]

[R15] Gupta A and Nagar D (1999). Matrix Variate Distributions. Monographs and Surveys in Pure and Applied Mathematics Taylor & Francis. [Google Scholar]

[R16] Hartley HO and Rao JNK (1967). Maximum-likelihood estimation for the mixed analysis of variance model. Biometrika, 54:93–108. [PubMed] [Google Scholar]

[R17] Harville D and Callanan T (1990). Computational aspects of likelihood-based inference for variance components. In Gianola D and Hammond K, editors, Advances in Statistical Methods for Genetic Improvement of Livestock, volume 18 of Advanced Series in Agricultural Sciences, pages 136–176. Springer Berlin Heidelberg. [Google Scholar]

[R18] Harville DA (1977). Maximum likelihood approaches to variance component estimation and to related problems. J. Amer. Statist. Assoc, 72(358):320–340. With a comment by J. N. K. Rao and a reply by the author. [Google Scholar]

[R19] Heiser WJ (1995). Convergent computation by iterative majorization: theory and applications in multidimensional data analysis. Recent Advances in Descriptive Multivariate Analysis, pages 157–189. [Google Scholar]

[R20] Horn RA and Johnson CR (1985). Matrix Analysis Cambridge University Press, Cambridge. [Google Scholar]

[R21] Huber PJ and Ronchetti EM (2009). Robust Statistics. Wiley Series in Probability and Statistics John Wiley & Sons, Inc, Hoboken, NJ, second edition. [Google Scholar]

[R22] Hunter DR (2004). MM algorithms for generalized Bradley-Terry models. Ann. Statist, 32(1):384–406. [Google Scholar]

[R23] Hunter DR and Lange K (2002). Computing estimates in the proportional odds model. Ann. Inst. Statist. Math, 54(1):155–168. [Google Scholar]

[R24] Hunter DR and Lange K (2004). A tutorial on MM algorithms. Amer. Statist, 58(1):30–37. [Google Scholar]

[R25] Hunter DR and Li R (2005). Variable selection using MM algorithms. Ann. Statist, 33(4):1617–1642. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] Jeon M (2012). Estimation of Complex Generalized Linear Mixed Models for Measurement and Growth PhD thesis, University of California, Berkeley. [Google Scholar]

[R27] Kiers HA (2002). Setting up alternating least squares and iterative majorization algorithms for solving various matrix optimization problems. Computational Statistics & Data Analysis, 41(1):157–170. [Google Scholar]

[R28] Laird N, Lange N, and Stram D (1987). Maximum likelihood computations with repeated measures: application of the EM algorithm. J. Amer. Statist. Assoc, 82(397):97–105. [Google Scholar]

[R29] Laird NM and Ware JH (1982). Random-effects models for longitudinal data. Biometrics, 38(4):963–974. [PubMed] [Google Scholar]

[R30] Lange K (2010). Numerical Analysis for Statisticians. Statistics and Computing Springer, New York, second edition. [Google Scholar]

[R31] Lange K (2016). MM Optimization Algorithms. Society for Industrial and Applied Mathematics, Philadelphia, PA. [Google Scholar]

[R32] Lange K, Hunter DR, and Yang I (2000). Optimization transfer using surrogate objective functions. J. Comput. Graph. Statist, 9(1):1–59. With discussion, and a rejoinder by Hunter and Lange. [Google Scholar]

[R33] Lange K, Papp J, Sinsheimer J, Sripracha R, Zhou H, and Sobel E (2013). Mendel: the Swiss army knife of genetic analysis programs. Bioinformatics, 29:1568–1570. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] Lange K and Sinsheimer JS (1993). Normal/independent distributions and their applications in robust regression. Journal of Computational and Graphical Statistics, 2:175–198. [Google Scholar]

[R35] Lange K and Zhou H (2014). MM algorithms for geometric and signomial programming. Mathematical Programming Series A, 143:339–356. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] Lee S, Abecasis GR, Boehnke M, and Lin X (2014). Rare-variant association analysis: Study designs and statistical tests. The American Journal of Human Genetics, 95(1):5–23. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R37] Lindstrom MJ and Bates DM (1988). Newton-Raphson and EM algorithms for linear mixed-effects models for repeated-measures data. J. Amer. Statist. Assoc, 83(404):1014–1022. [Google Scholar]

[R38] Lindstrom MJ and Bates DM (1990). Nonlinear mixed effects models for repeated measures data. Biometrics, 46(3):673–687. [PubMed] [Google Scholar]

[R39] McLachlan GJ and Krishnan T (2008). The EM Algorithm and Extensions. Wiley Series in Probability and Statistics Wiley-Interscience [John Wiley & Sons], Hoboken, NJ, second edition. [Google Scholar]

[R40] Meng X-L and Rubin DB (1991). Using EM to obtain asymptotic variance-covariance matrices: the SEM algorithm. Journal of the American Statistical Association, 86(416):899–909. [Google Scholar]

[R41] Pinheiro J and Bates D (1996). Unconstrained parametrizations for variance-covariance matrices. Statistics and Computing, 6(3):289–296. [Google Scholar]

[R42] Rao CR (1973). Linear Statistical Inference and its Applications, 2nd ed John Wiley & Sons. [Google Scholar]

[R43] Regan EA, Hokanson JE, Murphy JR, Make B, Lynch DA, Beaty TH, Curran-Everett D, Silverman EK, and Crapo JD (2010). Genetic epidemiology of COPD (COPDGene) study designs. COPD, 7:32–43. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R44] Schafer JL and Yucel RM (2002). Computational strategies for multivariate linear mixed-effects models with missing values. J. Comput. Graph. Statist, 11(2):437–457. [Google Scholar]

[R45] Schelldorfer J, Bühlmann P, and van de Geer S (2011). Estimation for high-dimensional linear mixed-effects models using A1-penalization. Scand. J. Stat, 38(2):197–214. [Google Scholar]

[R46] Schur J (1911). Bemerkungen zur Theorie der beschränkten Bilinearformen mit unendlich vielen Veränderlichen. J. Reine Angew. Math, 140:1–28. [Google Scholar]

[R47] Wu TT and Lange K (2010). The MM alternative to EM. Statistical Science, 25:492–505. [Google Scholar]

[R48] Yen T-J (2011). A majorization-minimization approach to variable selection using spike and slab priors. Ann. Statist, 39(3):1748–1775. [Google Scholar]

[R49] Yu Y (2010). Monotonic convergence of a general algorithm for computing optimal designs. Ann. Statist, 38(3):1593–1606. [Google Scholar]

[R50] Zhou H and Lange K (2010). MM algorithms for some discrete multivariate distributions. Journal of Computational and Graphical Statistics, 19:645–665. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R51] Zhou JJ, Hu T, Qiao D, Cho MH, and Zhou H (2016). Boosting gene mapping power and efficiency with efficient exact variance component tests of single nucleotide polymorphism sets. Genetics, 204(3):921–931. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

MM Algorithms For Variance Components Models

Hua Zhou

Liuyi Hu

Jin Zhou

Kenneth Lange

Abstract

1. Introduction

2. Preliminaries

Background on MM algorithms

Figure 1:

Convex matrix functions

3. Univariate response model

Univariate response: two variance components

Numerical experiments

Two-way ANOVA:

Table 1:

Table 2:

Genetic model:

4. Global convergence of the MM algorithm

5. MM versus EM

Table 3:

6. Extensions

6.1. Multivariate response model

Multivariate response, two variance components

6.2. Multivariate response model with missing responses

6.3. Linear mixed model (LMM)

6.4. MAP estimation

6.5. Variable selection

7. A numerical example

Table 4:

8. Discussion

Supplementary Material

Acknowledgments

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases