Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2020 Mar 9.
Published in final edited form as: J Comput Graph Stat. 2019 Mar 9;28(2):350–361. doi: 10.1080/10618600.2018.1529601

MM Algorithms For Variance Components Models

Hua Zhou a, Liuyi Hu b, Jin Zhou c, Kenneth Lange d
PMCID: PMC6779174  NIHMSID: NIHMS1520713  PMID: 31592195

Abstract

Variance components estimation and mixed model analysis are central themes in statistics with applications in numerous scientific disciplines. Despite the best efforts of generations of statisticians and numerical analysts, maximum likelihood estimation and restricted maximum likelihood estimation of variance component models remain numerically challenging. Building on the minorization-maximization (MM) principle, this paper presents a novel iterative algorithm for variance components estimation. Our MM algorithm is trivial to implement and competitive on large data problems. The algorithm readily extends to more complicated problems such as linear mixed models, multivariate response models possibly with missing data, maximum a posteriori estimation, and penalized estimation. We establish the global convergence of the MM algorithm to a Karush-Kuhn-Tucker (KKT) point and demonstrate, both numerically and theoretically, that it converges faster than the classical EM algorithm when the number of variance components is greater than two and all covariance matrices are positive definite.

Keywords: global convergence, matrix convexity, linear mixed model (LMM), maximum a posteriori (MAP) estimation, minorization-maximization (MM), multivariate response, penalized estimation, variance components model

1. Introduction

Variance components and linear mixed models are among the most potent tools in a statistician’s toolbox, finding numerous applications in agriculture, biology, economics, genetics, epidemiology, and medicine. Given an observed n × 1 response vector y and n × p predictor matrix X, the simplest variance components model postulates that YN (, ), where Ω=i=1mσi2Vi, and the V1, …, Vm are m fixed positive semidefinite matrices. The parameters of the model can be divided into mean effects β = (β1, …, βp) and variance components σ2=(σ12,,σm2). Throughout we assume is positive definite. The extension to singular will not be pursued here. Estimation revolves around the log-likelihood function

L(β,σ2)=12lndetΩ12(yXβ)TΩ1(yXβ). (1)

Among the commonly used methods for estimating variance components, maximum likelihood estimation (MLE) (Hartley and Rao, 1967) and restricted (or residual) MLE (REML) (Harville, 1977) are the most popular. REML first projects y to the null space of X and then estimates variance components based on the projected responses. If the columns of a matrix B span the null space of XT, then REML estimates the σi2 by maximizing the log-likelihood of the redefined response vector BT Y, which is normally distributed with mean 0 and covariance BTΩB=i=1mσi2BTViB.

There exists a large literature on iterative algorithms for finding MLE and REML (Laird and Ware, 1982; Lindstrom and Bates, 1988, 1990; Harville and Callanan, 1990; Callanan and Harville, 1991; Bates and Pinheiro, 1998; Schafer and Yucel, 2002). Fitting variance components models remains a challenge in models with a large sample size n or a large number of variance components m. Newton’s method (Lindstrom and Bates, 1988) converges quickly but is numerically unstable owing to the non-concavity of the log-likelihood. Fisher’s scoring algorithm replaces the observed information matrix in Newton’s method by the expected information matrix and yields an ascent algorithm when safeguarded by step halving. However the calculation and inversion of expected information matrices cost O(mn3) + O(m3) flops and quickly become impractical for large n or m, unless Vi are low rank, block diagonal, or have other special structures. The expectation-maximization (EM) algorithm initiated by Dempster et al. (1977) is a third alternative (Laird and Ware, 1982; Laird et al., 1987; Lindstrom and Bates, 1988; Bates and Pinheiro, 1998). Compared to Newton’s method, the EM algorithm is easy to implement and numerically stable, but painfully slow to converge. In practice, a strategy of priming Newton’s method by a few EM steps leverages the stability of EM and the faster convergence of second-order methods.

In this paper we derive a novel minorization-maximization (MM) algorithm for finding the MLE and REML estimates of variance components. We prove global convergence of the MM algorithm to a Karush-Kuhn-Tucker (KKT) point and explain why MM generally converges faster than EM for models with more than two variance components. We also sketch extensions of the MM algorithm to the multivariate response model with possibly missing responses, the linear mixed model (LMM), maximum a posteriori (MAP) estimation, and penalized estimation. The numerical efficiency of the MM algorithm is illustrated through simulated data sets and a genomic example with 200 variance components.

2. Preliminaries

Background on MM algorithms

Throughout we reserve Greek letters for parameters and indicate the current iteration number by a superscript t. The MM principle for maximizing an objective function f (θ) involves minorizing the objective function f (θ) by a surrogate function g(θ | θ(t)) around the current iterate θ(t) of a search (Lange et al., 2000). Minorization is defined by the two conditions

f(θ(t))=g(θ(t)|θ(t))f(θ)g(θ|θ(t)),θθ(t). (2)

In other words, the surface θg(θ|θ(t)) lies below the surface θf(θ) and is tangent to it at the point θ = θ(t). Construction of the minorizing function g(θ | θ(t)) constitutes the first M of the MM algorithm. The second M of the algorithm maximizes the surrogate g(θ | θ(t)) rather than f (θ). The point θ(t+1) maximizing g(θ | θ(t)) satisfies the ascent property f(θ(t+1))f(θ(t)). This fact follows from the inequalities

f(θ(t+1))g(θ(t+1)|θ(t))g(θ(t)|θ(t))=f(θ(t)), (3)

reflecting the definition of θ(t+1) and the tangency and domination conditions (2). The ascent property makes the MM algorithm remarkably stable. The validity of the descent property depends only on increasing g(θ | θ(t)), not on maximizing g(θ | θ(t)). With obvious changes, the MM algorithm also applies to minimization rather than to maximization. To minimize a function f (θ), we majorize it by a surrogate function g(θ | θ(t)) and minimize g(θ | θ(t)) to produce the next iterate θ(t+1). The acronym should not be confused with the maximization-maximization algorithm in the variational Bayes context (Jeon, 2012).

The MM principle (De Leeuw, 1994; Heiser, 1995; Kiers, 2002; Lange et al., 2000; Hunter and Lange, 2004) finds applications in multidimensional scaling (Borg and Groenen, 2005), ranking of sports teams (Hunter, 2004), variable selection (Hunter and Li, 2005; Yen, 2011), optimal experiment design (Yu, 2010), multivariate statistics (Zhou and Lange, 2010), geometric programming (Lange and Zhou, 2014), survival models (Hunter and Lange, 2002; Ding et al., 2015), sparse covariance estimation (Bien and Tibshirani, 2011), and many other areas (Lange, 2016). The celebrated EM principle (Dempster et al., 1977) is a special case of the MM principle. The Q function produced in the E step of an EM algorithm minorizes the log-likelihood up to an irrelevant constant. Thus, both EM and MM share the same advantages: simplicity, stability, graceful adaptation to constraints, and the tendency to avoid large matrix inversion. The more general MM perspective frees algorithm derivation from the missing data straitjacket and invites wider applications (Wu and Lange, 2010). Figure 1 shows the minorization functions of EM and MM for a variance components model with m = 2 variance components.

Figure 1:

Figure 1:

Surrogate functions of EM and MM minorize the log-likelihood surface of a 2-variance component model at point (σ12(t),σ22(t))=(18.5,0.7). MM surrogate function hugs the log-likelihood surface tighter than EM.

Convex matrix functions

For symmetric matrices we write AB when BA is positive semidefinite and AB if BA is positive definite. A matrix-valued function f is said to be (matrix) convex if

f[λA+(1λ)B]λf(A)+(1λ)f(B)

for all A, B, and λ ∈ [0, 1]. Our derivation of the MM variance components algorithm hinges on the convexity of the two functions mentioned in the next lemma. See standard text Boyd and Vandenberghe (2004) for the verification of both facts.

Lemma 1. (a) The matrix fractional function f (A, B) = AT B−1A is jointly convex in the m × n matrix A and the m × m positive definite matrix B. (b) The log determinant function f (B) = ln det B is concave on the set of positive definite matrices.

3. Univariate response model

Our strategy for maximizing the log-likelihood (1) is to alternate updating the mean parameters β and the variance components σ2. Updating β given σ2 is a standard general least squares problem with solution

β(t+1)=(XTΩ(t)X)1XTΩ(t)y, (4)

where (t) represents the inverse of Ω(t)=i=1mσi2(t)Vi. Updating σ2 given β(t) depends on two minorizations. If we assume that all of the Vi are positive definite, then the joint convexity of the map (X,Y)XTY1X for positive definite Y implies that

Ω(t)Ω1Ω(t)=(i=1mσi2(t)Vi)(i=1mσi2Vi)1(i=1mσi2(t)Vi)i=1mσi2(t)jσj2(t)(jσj2(t)σi2(t)σi2(t)Vi)(jσj2(t)σi2(t)σi2Vi)1(jσj2(t)σi2(t)σi2(t)Vi)=i=1mσi4(t)σi2Vi.

When one or more of the Vi are rank deficient, we replace each Vi by Vi,ϵ = Vi + ϵI for ϵ > 0 small and let Ωϵ(t)=iσi2(t)Vi,ϵ. Sending ϵ to 0 in Ωϵ(t)Ωϵ1Ωϵ(t)i=1m(σi4(t)/σi2)Vi,ϵ now gives the desired majorization Ω(t)Ω1Ω(t)i=1m(σi4(t)/σi2)Vi in the general case. Negating both sides leads to the minorization

(yXβ)TΩ1(yXβ)(yXβ)TΩ(t)(i=1mσi4(t)σi2Vi)Ω(t)(yXβ) (5)

that effectively separates the variance components σ12,,σm2 in the quadratic term of the log-likelihood (1).

The convexity of the function AlogdetA is equivalent to the supporting hyperplane minorization

lndetΩlndetΩ(t)tr[Ω(t)(ΩΩ(t))] (6)

that separates σ12,,σm2 in the log determinant term of the log-likelihood (1). Combination of the minorizations (5) and (6) gives the overall minorization

g(σ2|σ2(t))=12tr(Ω(t)Ω)12(yXβ(t))TΩ(t)(i=1mσi4(t)σi2Vi)Ω(t)(yXβ(t))+c(t)=i=1m[σi22tr(Ω(t)Vi)12σi4(t)σi2(yXβ(t))TΩ(t)ViΩ(t)(yXβ(t))]+c(t), (7)

where c(t) is an irrelevant constant. Maximization of g(σ2 | σ2(t)) with respect to σi2 yields the simple multiplicative update

σi2(t+1)=σi2(t)(yXβ(t))TΩ(t)ViΩ(t)(yXβ(t))tr(Ω(t)Vi),i=1,,m. (8)

As a sanity check on our derivation, consider the partial derivative

σi2L(β,σ2)=12tr(Ω1Vi)+12(yXβ)TΩ1ViΩ1(yXβ). (9)

Given σi2(t)>0, it is clear from the update formula (8) that σi2(t+1)<σi2(t) when σi2L<0. Conversely σi2(t+1)>σi2(t) when σi2L>0.

graphic file with name nihms-1520713-f0002.jpg

Algorithm 1 summarizes the MM algorithm for MLE of the univariate response model (1). The update formula (8) assumes that the numerator under the square root sign is nonnegative and the denominator is positive. The numerator requirement is a consequence of the positive semidefiniteness of Vi. The denominator requirement is not obvious but can be verified through the Hadamard (elementwise) product representation tr(Ω(t)Vi)=1T(Ω(t)Vi)1. The following lemma of Schur (1911) is crucial. We give a self-contained probabilistic proof in Supplementary Materials S.1.

Lemma 2 (Schur). The Hadamard product of a positive definite matrix with a positive semidefinite matrix with positive diagonal entries is positive definite.

We can now obtain the following characterization of the MM iterates.

Proposition 1. Assume Vi has strictly positive diagonal entries. Then tr(−(t)Vi) > 0 for all t. Furthermore if σi2(0)>0 and Ω(t)(yXβ(t))null(Vi) for all t, then σi2(t)>0 for all t. When Vi is positive definite, σi2(t)>0 holds if and only if yXβ(t).

Proof. The first claim follows easily from Schur’s lemma. The second claim follows by induction. The third claim follows from the observation that null(Vi) = {0}.

In most applications, Vm = I. Proposition 1 guarantees that if σm2(0)>0 and the residual vector y(t) is nonzero, then σm2(t) remains positive and thus (t) remains positive definite throughout all iterations. This fact does not prevent any of the sequences σi2(t) from converging to 0. In this sense, the MM algorithm acts like an interior point method, approaching the optimum from inside the feasible region.

Univariate response: two variance components

The major computational cost of Algorithm 1 is inversion of the covariance matrix (t) at each iteration. The special case of m = 2 variance components deserves attention as repeated matrix inversion can be avoided by invoking the simultaneous congruence decomposition for two symmetric matrices, one of which is positive definite (Rao, 1973; Horn and Johnson, 1985). This decomposition is also called the generalized eigenvalue decomposition (Golub and Van Loan, 1996; Boyd and Vandenberghe, 2004). If one assumes Ω=σ12V1+σ22V2 and

3.

lets (V1,V2)(D,U) be the decomposition with U nonsingular, U T V1U = D diagonal, and U T V2U = I, then

Ω(t)=UT(σ12(t)D+σ22(t)In)U1Ω(t)=U(σ12(t)D+σ22(t)In)1UTdet(Ω(t))=det(σ12(t)D+σ22(t)In)det(V2). (10)

With the revised responses y˜=UTy and the revised predictor matrix X˜=UTX, the update (8) requires only vector operations and costs O(n) flops. Updating the fixed effects is a weighted least squares problem with the transformed data (y˜,X˜) and observation weights wi(t)=(σ12(t)di+σ22(t))1. Algorithm 2 summarizes the simplified MM algorithm for two variance components.

Numerical experiments

This section compares the numerical performance of MM, EM, Fisher scoring, and the lme4 package in R (Bates et al., 2015) on simulated data from a two-way ANOVA random effects model and a genetic model. For ease of comparison, all algorithm runs start from σ2(0) = 1 and terminate when the relative change (L(t+1)L(t))/(|L(t)|+ 1) in the log-likelihood is less than 108.

Two-way ANOVA:

We simulated data from a two-way ANOVA random effects model

yijk=μ+αi+βj+(αβ)ij+ϵijk,1ia,1jb,1kc,

where αi~N(0,σ12), βj~N(0,σ22), (αβ)ij~N(0,σ32), and ϵijk~N(0,σe2) are jointly independent. Here i indexes levels in factor 1, j indexes levels in factor 2, and k indexes observations in the (i, j)-combination. This corresponds to m = 4 variance components. In the simulation, we set σ12=σ22=σ32 and varied the ratio σ12/σe2; the numbers of levels a and b in factor 1 and factor 2, respectively; and the number of observations c in each combination of factor levels. For each simulation scenario, we simulated 50 replicates. The sample size was n = abc for each replicate.

Tables 1 and 2 show the average number of iterations and the average runtimes when there are a = b = 5 levels of each factor. Based on these results and further results not shown for other combinations of a and b, we draw the following conclusions: Fisher scoring takes the fewest iterations; the MM algorithm always takes fewer iterations than the EM algorithm; the faster rate of convergence of Fisher scoring is outweighed by the extra cost of evaluating and inverting the information matrix. Table 1 in Supplementary Materials S.2 shows that all algorithms converged to same objective values.

Table 1:

Fisher scoring converges fastest and MM takes fewer iterations than EM. Shown below are average number of iterations until convergence for MM, EM, and FS for fitting a two-way ANOVA model with a = b = 5 levels of both factors. Standard errors are given in parentheses.

σ12/σe2 Method c = # observations per combination

5 10 20 50
0.00 MM 143.12(99.76) 118.26(62.91) 96.26(50.61) 81.10(33.42)
EM 2297.72(797.95) 1711.70(485.92) 1170.06(365.48) 788.10(216.60)
FS 25.64(11.20) 21.10(7.00) 16.46(4.37) 13.88(2.88)
0.05 MM 121.86(98.52) 69.38(50.23) 55.88(37.34) 29.50(18.80)
EM 1464.26(954.27) 538.04(504.42) 254.90(253.86) 104.98(157.97)
FS 16.78(9.13) 12.62(6.22) 9.68(3.22) 8.10(1.34)
0.10 MM 84.74(59.33) 62.98(50.48) 40.46(31.43) 25.86(18.79)
EM 985.46(830.49) 360.32(462.62) 157.70(231.91) 68.26(107.85)
FS 15.20(10.10) 10.58(5.92) 8.58(3.56) 7.50(1.72)
1.00 MM 31.04(33.27) 29.60(27.66) 25.32(25.39) 24.90(20.76)
EM 130.18(299.03) 161.14(290.23) 64.20(135.38) 84.88(137.88)
FS 6.62(4.72) 6.32(3.64) 5.12(1.87) 5.36(1.50)
10.00 MM 29.80(35.42) 34.16(38.25) 28.82(28.44) 20.90(14.28)
EM 115.94(274.33) 177.30(301.71) 80.12(155.67) 75.02(127.38)
FS 12.72(5.14) 12.86(4.94) 11.66(3.95) 11.76(3.66)
20.00 MM 30.10(32.92) 32.72(39.02) 23.70(21.20) 19.62(15.67)
EM 148.04(318.40) 85.86(180.28) 61.74(140.84) 37.36(83.89)
FS 18.76(7.51) 17.40(5.21) 17.22(5.67) 16.28(5.03)
Table 2:

MM shows shortest run times than EM, Fisher scoring (FS), and lme4. Shown below are average run times (milliseconds) for fitting a two-way ANOVA model with a = b = 5 levels of both factors. Standard errors are given in parentheses.

σ12/σe2 Method c = # observations per combination

5 10 20 50
 0.00 MM 11.46(7.77) 10.06(5.29) 11.93(6.35) 10.44(3.99)
EM 189.32(71.32) 148.20(48.13) 147.87(49.97) 96.28(24.97)
FS 34.27(33.47) 24.89(8.55) 23.70(14.15) 20.46(4.54)
lme4 25.84(12.10) 22.32(1.25) 27.34(4.06) 36.14(5.59)
 0.05 MM 9.79(7.72) 6.19(4.22) 6.87(4.37) 4.45(2.20)
EM 116.03(75.57) 47.72(45.35) 30.60(29.88) 14.23(19.68)
FS 19.18(10.23) 15.37(7.48) 12.78(4.06) 12.39(2.35)
lme4 22.76(1.96) 24.88(2.60) 28.72(3.10) 47.34(16.29)
 0.10 MM 7.07(4.78) 6.29(4.94) 5.14(3.72) 3.95(2.23)
EM 78.96(66.19) 35.48(45.81) 19.53(27.71) 9.67(13.56)
FS 17.36(11.26) 14.44(9.00) 12.08(6.31) 11.47(2.40)
lme4 22.66(1.83) 28.90(8.70) 30.16(4.43) 44.58(4.89)
 1.00 MM 2.66(2.61) 3.22(2.91) 3.57(3.15) 3.85(2.50)
EM 10.71(23.93) 15.88(27.52) 8.35(16.26) 11.34(16.65)
FS 7.88(5.44) 9.10(4.95) 7.12(2.42) 8.46(2.27)
lme4 23.12(1.75) 30.22(9.37) 29.96(4.47) 42.82(8.32)
 10.00 MM 2.48(2.72) 3.24(3.19) 3.84(3.35) 3.35(1.71)
EM 9.66(22.02) 15.98(26.57) 10.24(18.78) 10.27(15.40)
FS 15.19(6.05) 16.39(6.11) 15.81(5.15) 18.14(5.46)
lme4 35.02(3.83) 47.12(8.10) 63.24(15.33) 102.78(34.49)
 20.00 MM 2.57(2.49) 3.13(3.53) 3.13(2.44) 3.07(1.81)
EM 12.28(25.71) 8.44(16.89) 8.01(17.12) 5.47(9.76)
FS 22.09(8.53) 22.03(6.14) 23.08(7.21) 23.99(7.38)
lme4 37.34(12.91) 50.24(8.59) 63.62(17.39) 91.14(28.39)

Genetic model:

We simulated a quantitative trait y from a genetic model with two variance components and covariance matrix Ω=σa2Φ^+σe2I, where Φ^ is a full-rank empirical kinship matrix estimated from the genome-wide measurements of 212 individuals using Option 29 of the Mendel software (Lange et al., 2013). In this example, MM had run times similar to Fisher scoring, and both were much faster than EM and lme4.

In summary, the MM algorithm appears competitive even in small-scale examples. Many applications involve a large number of variance components. In this setting, the EM algorithm suffers from slow convergence and Fisher scoring from an extremely high cost per iteration. Our genomic example in Section 7 reinforces this point.

4. Global convergence of the MM algorithm

The Karush-Kuhn-Tucker (KKT) necessary conditions for a local maximum σ2=(σ12,,σm2) of the log-likelihood (1) require each component of the score vector to satisfy

σi2L(σ2){{0}(,0]σi2>0σi2=0.

In this section we establish the global convergence of Algorithm 1 to a KKT point. To reduce the notational burden, we assume that X is null and omit estimation of fixed effects β. The analysis easily extends to the nontrivial X case. Our convergence analysis relies on characterizing the properties of the objective function L(σ2) and the MM algorithmic mapping σ2M(σ2) defined by equation (8). Special attention must be paid to the boundary values σi2=0. We prove convergences for two cases, which cover most applications. For example, the genetic model in Section 3 satisfies Assumption 1, while the two-way ANOVA model satisfies Assumption 2.

Assumption 1. All Vi are positive definite.

Assumption 2. V1 is positive definite, each Vi is nontrivial, H=span{V2,,Vm} has dimension q < n, and yH.

The key condition yspan{V2,,Vm} in the second case is also necessary for the existence of an MLE or REML (Demidenko and Massam, 1999; Grzadziel and Michalski, 2014). In Supplementary Materials S.4, we derive a sequence of lemmas en route to the global convergence result declared in Theorem 1.

Theorem 1. Under either Assumption 1 or 2, the MM sequence {σ2(t)}t0 has at least one limit point. Every limit point is a fixed point of M (σ2). If the set of fixed points is discrete, then the MM sequence converges to one of them. Finally, when the iterates converge, their limit is a KKT point.

5. MM versus EM

Examination of Tables 2 and 3 suggests that the MM algorithm usually converges faster than the EM algorithm. We now provide an explanation for this observation. Again for notational convenience, we consider the REML case where X is null. Since the EM principle is just a special instance of the MM principle, we can compare their convergence properties in a unified framework. Consider an MM map M (θ) for maximizing the objective function f (θ) via the surrogate function g(θ | θ(t)). Close to the optimal point θ,

θ(t+1)θdM(θ)(θ(t)θ),

where dM(θ) is the differential of the mapping M at the optimal point θ of f (θ). Hence, the local convergence rate of the sequence θ(t+1) = M (θ(t)) coincides with the spectral radius of dM(θ). Familiar calculations (Lange, 2010) demonstrate that

dM(θ)=I[d2g(θ|θ)]1d2f(θ).

In other words, the local convergence rate is determined by how well the surrogate surface g(θ|θ) approximates the objective surface f (θ) near the optimal point θ. In the EM literature, dM(θ) is called the rate matrix (Meng and Rubin, 1991). Fast convergence occurs when the surrogate g(θ|θ) hugs the objective f (θ) tightly around θ. Figure 1 shows a case where the MM surrogate locally dominates the EM surrogate. We demonstrate that this is no accident.

Table 3:

MM and Fisher scoring (FS) show superior performance than EM and lme4. Shown below are average performance for fitting a genetic model. Standard errors are given in parentheses.

σa2/σe2 Method Iteration Runtime (ms) Objective
0.00 MM 198.02(102.23) 133.61(822.67) −375.59(9.63)
EM 1196.10(958.51) 29.71(12.34) −375.60(9.64)
FS 7.60(3.07) 19.34(33.77) −375.59(9.63)
lme4 401.02(142.04) −375.59(9.64)
0.05 MM 185.86(99.41) 17.26(1.76) −377.39(10.52)
EM 1227.62(1030.07) 29.82(12.74) −377.40(10.52)
FS 7.84(2.74) 14.97(1.55) −377.39(10.52)
lme4 425.04(144.00) −377.39(10.52)
0.10 MM 169.24(99.75) 16.97(1.59) −378.40(11.44)
EM 924.80(912.23) 26.06(11.26) −378.41(11.45)
FS 7.32(2.75) 15.06(1.38) −378.40(11.44)
lme4 435.14(128.87) −378.40(11.44)
1.00 MM 58.96(23.69) 15.53(0.75) −409.54(10.90)
EM 105.10(79.65) 15.49(0.96) −409.54(10.90)
FS 5.80(1.05) 14.66(0.89) −409.54(10.90)
lme4 493.14(52.80) −409.54(10.90)
10.00 MM 110.00(63.13) 16.22(1.12) −532.48(8.77)
EM 642.48(1470.38) 22.32(18.37) −532.57(8.75)
FS 14.98(5.21) 14.78(0.97) −531.72(8.92)
lme4 2897.12(15006.38) −532.48(8.77)
20.00 MM 110.52(34.81) 16.07(0.91) −590.87(7.15)
EM 1014.22(1775.40) 27.03(22.33) −590.89(7.15)
FS 17.72(3.13) 14.79(0.93) −588.46(7.27)
lme4 5059.24(20692.67) −590.79(7.15)

The Q-function in the EM algorithm

gEM(σ2|σ2(t))=12i=1m[rank(Vi)lnσi2+rank(Vi)σi2(t)σi2σi4(t)σi2tr(Ω(t)Vi)]12i=1mσi4(t)σi2yTΩ(t)ViΩ(t)y

minorizes the log-likelihood up to an irrelevant constant. Supplementary Materials S.6 gives a detailed derivation for the more general multivariate response case. Both surrogates gEM(σ2 | σ2()) and gMM(σ2 | σ2()) are parameter separated. This implies that both second differentials d2gEM(σ2()|σ2()) and d2gMM(σ2()|σ2()) are diagonal. A small diagonal entry of either matrix indicates fast convergence of the corresponding variance component. Our next result shows that, under Assumption 1, on average the diagonal entries of d2gEM(σ2()|σ2()) dominate those of d2gMM(σ2()|σ2()) when m > 2. Thus, the EM algorithm tends to converge more slowly than the MM algorithm, and the difference is more pronounced as the number of variance components m grows. See Supplementary Materials S.4 for the proof.

Theorem 2. Let σ2()0m be a common limit point of the EM and MM algorithms. Then both second differentials d2gMM(σ2()|σ2()) and d2gEM(σ2()|σ2()) are diagonal with

d2gEM(σ2()|σ2())ii=rank(Vi)2σi4()d2gMM(σ2()|σ2())ii=yTΩ()ViΩ()yσi2()=tr(Ω()Vi)σi2().

Furthermore, the average ratio

1mi=1md2gMM(σ2()|σ2())iid2gEM(σ2()|σ2())ii=2mni=1mtr(Ω()σi2()Vi)=2m<1

for m > 2 when all Vi have full rank n.

It is not clear whether a similar result holds under Assumption 2. Empirically we observed faster convergence of MM than EM, for example, in the two-way ANOVA example (Table 1). Also note that both the EM and MM algorithms must evaluate the traces tr((t)Vi) and quadratic forms (y(t))T (t)Vi(t)(y(t)) at each iteration. Since these quantities are also the building blocks of the approximate rate matrices d2g(σ2(t) | σ2(t)), one can rationally choose either the EM or MM updates based on which has smaller diagonal entries measured by the l1, l2 or l norms. At negligible extra cost, this produces a hybrid algorithm that retains the ascent property and enjoys the better of the two convergence rates under either Assumption 1 or 2.

6. Extensions

Besides its competitive numerical performance, Algorithm 1 is attractive for its simplicity and ease of generalization. In this section, we outline MM algorithms for multivariate response models possibly with missing data, linear mixed models, MAP estimation, and penalized estimation.

6.1. Multivariate response model

Consider the multivariate response model with n × d response matrix Y, which has no missing entries, mean E Y = XB, and covariance

Ω=Cov(vecY)=i=1mΓiVi.

The p × d coefficient matrix B collects the fixed effects, the Γi are unknown d × d variance components, and the Vi are known n × n covariance matrices. If the vector vecY is normally distributed, then Y equals a sum of independent matrix normal distributions (Gupta and Nagar, 1999). We now make this assumption and pursue estimation of B and the Γi, which we collectively denote as Γ. Under the normality assumption, Roth’s Kronecker product identity vec(CDE) = (ETC)vec(D) yields the log-likelihood

L(B,Γ)=12lndetΩ12vec(YXB)TΩ1vec(YXB)=12lndetΩ12[vecY(IdX)vecB]TΩ1[vecY(IdX)vecB]. (11)

Updating B given Γ(t) is accomplished by solving the general least squares problem met earlier in the univariate case. Update of Γi given B(t) is difficult due to the positive semidefiniteness constraint. Typical solutions involve reparameterization of the covariance matrix (Pinheiro and Bates, 1996). The MM algorithm derived in this section gracefully accommodates the covariance constraints.

Updating Γ given B(t) requires generalizing the minorization (5). In view of Lemma 1 and the identities (AB)(CD) = (AC) ⊗(BD) and (AB)1 = A1B1, we have

Ω(t)Ω1Ω(t)=m[1mi=1mΓi(t)Vi][1mi=1mΓiVi]1[1mi=1mΓi(t)Vi]m1mi=1m(Γi(t)Vi)(ΓiVi)1(Γi(t)Vi)=i=1m(Γi(t)Γi1Γi(t))Vi,

or equivalently

Ω1Ω(t)[i=1m(Γi(t)Γi1Γi(t))Vi]Ω(t). (12)

This derivation relies on the invertibility of the matrices Vi. One can relax this assumption by substituting Vϵ,i = Vi + ϵIn for Vi and sending s to 0.

The majorization (12) and the minorization (6) jointly yield the surrogate

g(Γ|Γ(t))=12i=1m{tr[Ω(t)(ΓiVi)]+(vecR(t))T[(Γi(t)Γi1Γi(t))Vi](vecR(t))}+c(t),

where R(t) is the n × d matrix satisfying vec R(t) = (t)vec(YXB(t)) and c(t) is an irrelevant constant. Based on the Kronecker identities (vec A)T vec B = tr(AT B) and vec(CDE) = (ETC)vec(D), the surrogate can be rewritten as

g(Γ|Γ(t))=12i=1m{tr[Ω(t)(ΓiVi)]+tr(R(t)TViR(t)Γi(t)Γi1Γi(t))}+c(t)=12i=1m{tr[Ω(t)(ΓiVi)]+tr(Γi(t)R(t)TViR(t)Γi(t)Γi1)}+c(t).

The first trace is linear in Γi with the coefficient of entry (Γi)jk equal to

tr(Ωjk(t)Vi)=1nT(ViΩjk(t))1n,

where Ωjk(t) is the (j, k)-th n × n block of (t) and indicates elementwise product. The matrix Mi of these coefficients can be written as

Mi=(Id1n)T[(1n1dTVi)Ω(t)](Id1n).

The directional derivative of g(Γ | Γ(t)) with respect to Γi in the direction Δi is

12tr(MiΔi)+12tr(Γi(t)R(t)TViR(t)Γi1ΔiΓi1)=12tr(MiΔi)+12tr(Γi1Γi(t)R(t)TViR(t)Γi(t)Γi1Δi).

Because all directional derivatives of g(Γ | Γ(t)) vanish at a stationarity point, the matrix equation

Mi=Γi1Γi(t)R(t)TViR(t)Γi(t)Γi1 (13)

holds. Fortunately, this equation admits an explicit solution. For positive scalers a and b, the solution to the equation b=X1aX1 is X=±a/b. The matrix analogue of this equation is the Riccati equation B = X1AX1, whose solution is summarized in the next lemma.

Lemma 3. Assume A and B are positive definite and L is the Cholesky factor of B. Then Y = L−T (LT AL)1/2L−1 is the unique positive definite solution to the matrix equation B = X−1AX−1.

The Cholesky factor L in Lemma 3 can be replaced by the symmetric square root of B. The solution, which is unique, remains the same. The Cholesky decomposition is preferred for its cheaper computational cost and better numerical stability.

6.

Algorithm 3 summarizes the MM algorithm for fitting the multi-response model (3). Each iteration invokes m Cholesky decompositions and symmetric square roots of d × d positive definite matrices. Fortunately in most applications, d is a small number. The following result guarantees the non-singularity of the Cholesky factor throughout the iterations. See Supplementary Materials S.8 for the proof.

Proposition 2. Assume Vi has strictly positive diagonal entries. Then the symmetric matrix Mi=(Id1n)T[(1d1dTVi)Ω(t)](Id1n) is positive definite for all t. Furthermore if Γi(0)0 and no column of R(t) lies in the null space of Vi for all t, then Γi(t)0 for all t.

Multivariate response, two variance components

When there are m = 2 variance components = Γ1V1 + Γ2V2, repeated inversion of the nd×nd covariance matrix reduces to a single n×n simultaneous congruence decomposition and, per iteration, two d×d Cholesky decompositions and one d×d simultaneous congruence decomposition. The simultaneous congruence decomposition of the matrix pair (V1, V2) involves generalized eigenvalues d = (d1, …, dn) and a nonsingular matrix U such that U T V1U = D = diag(d) and U T V2U = I. If the simultaneous congruence decomposition of (Γ1(t),Γ2(t)) is (Λ(t),Φ(t)) with Φ(t)TΓ1(t)Φ(t)=Λ(t)=diag(λ(t)) and Φ(t)TΓ2(t)Φ(t)=Id, then

Ω(t)=(Φ(t)U1)T(Λ(t)D+IDIn)(Φ(t)U1)Ω(t)=(Φ(t)U)(Λ(t)D+IDIn)1(Φ(t)U)TdetΩ(t)=det(Λ(t)D+IDIn)det[(Φ(t)U1)T(Φ(t)U1)]=det(Λ(t)D+IDIn)det(Γ2(t)V2)=det(Λ(t)D+IDIn)det(Γ2(t))ndet(V2)D.

Updating the fixed effects reduces to a weighted least squares problem for the transformed responses Y˜=UTY, transformed predictor matrix X˜=UTX, and observation weights (λk(t)di+1)1. Algorithm 4 summarizes the simplified MM algorithm. The lengthy derivations are relegated to Supplementary Materials S.5.

6.1.

6.2. Multivariate response model with missing responses

In many applications the multivariate response model (11) involves missing responses. For instance, in testing multiple longitudinal traits in genetics, some trait values yij may be missing due to dropped patient visits, while their genetic covariates are complete. Missing data destroys the symmetry of the log-likelihood (11) and complicates finding the MLE. Fortunately, MM algorithm 3 easily adapts to this challenge.

The familiar EM argument (McLachlan and Krishnan, 2008, Section 2.2) shows that

n2lndetΩ(t)12tr{Ω(t)[vec(Z(t)XB(t))vec(Z(t)XB(t))T+C(t)]} (14)

minorizes the observed log-likelihood at the current iterate (B(t),Γ1(t),,Γm(t)). Here Z(t) is the completed response matrix given the observed responses Yobs(t) and the current parameter values. The complete data Y is assumed to be normally distributed N (vec(XB(t)), (t)). The block matrix C(t) is 0 except for a lower-right block consisting of a Schur complement.

To maximize the surrogate (14), we invoke the familiar minorization (6) and majorization (12) to separate the variance components Γi. At each iteration we impute missing entries by their conditional means, compute their conditional variances and covariances to supply the Schur complement, and then update the fixed effects and variance components by the explicit updates of Algorithm 3. The required conditional means and conditional variances can be conveniently obtained in the process of inverting (t) by the sweep operator of computational statistics (Lange, 2010, Section 7.3).

6.3. Linear mixed model (LMM)

The linear mixed model plays a central role in longitudinal data analysis. Consider the single-level LMM (Laird and Ware, 1982; Bates and Pinheiro, 1998) for n independent data clusters (yi, Xi, Zi) with

Yi=Xiβ+Ziγi+ϵi,i=1,,n,

where β is a vector of fixed effects, the γiN (0, Ri(θ)) are independent random effects, and ϵiN(0,σ2Ini) captures random noise independent of γi. We assume the matrices Zi have full column rank. The within-cluster covariance matrices Ri(θ) depend on a parameter vector θ; typical choices for Ri(θ) impose autocorrelation, compound symmetry, or unstructured correlation. It is clear that Yi is normal with mean Xiβ, covariance Ωi=ZiRi(θ)ZiT+σ2Ini, and log-likelihood

Li(β,θ,σ2)=12lndetΩi12(yiXiβ)TΩi1(yiXiβ).

The next three technical facts about pseudo-inverses are used in deriving the MM algorithm for LMM and their proofs are in Supplementary Materials S.9-S.11.

Lemma 4. If A has full column rank and B has full row rank, then (AB)+ = B+A+.

Lemma 5. If A and B are positive semidefinite matrices with the same range, then

limϵ0(B+ϵI)(A+ϵI)1(B+ϵI)=BA+B.

Lemma 6. If R and S are positive definite matrices, and the conformable matrix Z has full column rank, then the matrices ZRZT and ZSZT share a common range.

The convexity of the map (X,Y)XTY1X and Lemmas 4–6 now yield via the obvious limiting argument the majorization

Ω(t)Ω1Ω(t)=(ZiRi(θ(t))ZiT+σ2(t)Ini)(ZiRi(θ)ZiT+σ2Ini)1(ZiRi(θ(t))ZiT+σ2(t)Ini)(ZiRi(θ(t))ZiT)(ZiRi(θ)ZiT)+(ZiRi(θ(t))ZiT)+σ4(t)σ2Ini=[ZiRi(θ(t))ZiTZiT+]Ri1(θ)[Zi+ZiRi(θ(t))ZiT]+σ4(t)σ2Ini.

In combination with the minorization (6), this gives the surrogate

gi(θ,σ2|θ(t),σ2(t))=12tr(ZiTΩi(t)ZiRi(θ))12ri(t)TRi1(θ)ri(t)12tr(Ωi(t))σ4(t)2σ2(yiXiβ(t))TΩi2(t)(yiXiβ(t))+c(t),

for the log-likelihood Li(θ, σ2), where

ri(t)=(Zi+ZiRi(θ(t))ZiT)Ωi(t)(yiXiβ(t))=Ri(θ(t))ZiTΩi(t)(yiXiβ(t)).

The parameters θ and σ2 are nicely separated. To maximize the overall minorization function igi(θ,σ2|θ(t),σ2(t)), we update σ2 via

σ2(t+1)=σ2(t)i(yiXiβ(t))TΩi2(t)(yiXiβ(t))itr(Ωi(t)).

For structured models such as autocorrelation and compound symmetry, updating θ is a low-dimensional optimization problem that can be approached through the stationarity condition

ivec(ZiTΩi(T)ZiRi1(θ)ri(t)ri(t)TRi1(θ))TθjvecRi(θ)=0

for each component θj. For the unstructured model with Ri(θ) = R for all i, the stationarity condition reads

iZiTΩi(t)Zi=R1(iri(t)ri(t)T)R1

and admits an explicit solution based on Lemma 3.

The same tactics apply to a multilevel LMM (Bates and Pinheiro, 1998) with responses

Yi=Xiβ+Zi1γi1+Zimγim+ϵi.

Minorization separates parameters for each level (variance component). Depending on the complexity of the covariance matrices, maximization of the surrogate can be accomplished analytically. For the sake of brevity, details are omitted.

6.4. MAP estimation

Suppose β follows an improper flat prior, the variance components σi2 follow inverse gamma priors with shapes αi > 0 and scales γi > 0, and these priors are independent. The log-posterior density then reduces to

12lndetΩ12(yXβ)TΩ1(yXβ)i=1m(αi+1)lnσi2i=1mγiσi2+c, (15)

where c is an irrelevant constant. The MAP estimator of (β, σ2) is the mode of the posterior distribution. The update (4) of β given σ2 remains the same. To update σ2 given β, apply the same minorizations (5) and (6) to the first two terms of equation (15). This separates parameters and yields a convex surrogate for each σi2. The minimum of the σi2 surrogate is defined by the stationarity condition

0=12tr(Ω(t)Vi)+σi4(t)2σi4(yXβ(t))TΩ(t)ViΩ(t)(yXβ(t))αi+1σi2+γiσi4.

Multiplying this by σi4 gives a quadratic equation in σi2. The positive root should be taken to meet the nonnegativity constraint on σi2.

For the multivariate response model (11), we assume the variance components Γi follow independent inverse Wishart distributions with degrees of freedom νi > d − 1 and scale matrix Ψi0. The log density of the posterior distribution is

12lndetΩ12vec(YXB)TΩ1vec(YXB)12i=1m(vi+d+1)lndetΓi12i=1mtr(ΨiΓi1)+c, (16)

where c is an irrelevant constant. Invoking the minorizations (6) and (12) for the first two terms and the supporting hyperplane minorization for − ln det Γi gives the surrogate function

g(Γ|Γ(t))=12i=1mtr(Ω(t)(ΓiVi))12i=1mtr(Γi(t)R(t)TViR(t)Γi(t)Γi1)12i=1m(vi+d+1)tr(Γi(t)Γi)12i=1mtr(ΨiΓi1)+c(t).

The optimal Γi satisfies the stationarity condition

(Id1n)T[(1d1dTVi)Ω(t)](Id1n)+(Vi+d+1)Γi(t)=Γi1(Γi(t)R(t)TViR(t)Γi(t)+Ψi)Γi1,

which can be solved by Lemma 3.

6.5. Variable selection

In the statistical analysis of high-dimensional data, the imposition of sparsity leads to better interpretation and more stable parameter estimation. MM algorithms mesh well with penalized estimation. The simple variance components model (1) illustrates this fact. For the selection of fixed effects, minimizing the lasso-penalized log-likelihood L(β,σ2)+λj|βj| is often recommended (Schelldorfer et al., 2011). The only change to the MM Algorithm 1 is that in estimating β, one solves a lasso penalized general least squares problem rather than an ordinary general least squares problem. The updates of the variance components σi2 remain the same. For estimation of a large number of variance components, one can minimize the ridge-penalized log-likelihood

L(β,σ2)+λi=1mσi2

subject to the nonnegativity constraints σi20. The variance update (8) becomes

σi2(t+1)=σi2(t)(yXβ(t))TΩ(t)ViΩ(t)(yXβ(t))tr(Ω(t)Vi)+2λ,i=1,,m,

which clearly exhibits shrinkage but no thresholding. The lasso penalized log-likelihood

L(β,σ2)+λi=1mσi (17)

subject to nonnegativity constraint σi0 achieves both ends. The update of σi is chosen among the positive roots of a quartic equation and the boundary 0, whichever yields a lower objective value. Next section illustrates variance component selection using lasso penalty on a real genetic data set.

7. A numerical example

Quantitative trait loci (QTL) mapping aims to identify genes associated with a quantitative trait. Current sequencing technology measures millions of genetic markers in study subjects. Traditional single-marker tests suffer from low power due to the low frequency of many markers and the corrections needed for multiple hypothesis testing. Region-based association tests are a powerful alternative for analyzing next generation sequencing data with abundant rare variants.

Suppose y is a n × 1 vector of quantitative trait measurements on n people, X is an n × p predictor matrix (incorporating predictors such as sex, smoking history, and principal components for ethnic admixture), and G is an n × m genotype matrix of m genetic variants in a pre-defined region. The linear mixed model assumes

Y=Xβ+Gγ+ϵ,γN(0,σg2I),ϵN(0,σe2In),

where β are fixed effects, γ are random genetic effects, and σg2 and σe2 are variance components for the genetic and environmental effects, respectively. Thus, the phenotype vector Y has covariance σg2GGT+σe2In, where GGT is the kernel matrix capturing the overall effect of the m variants. Current approaches test the null hypothesis σg2=0 for each region separately and then adjust for multiple testing (Lee et al., 2014; Zhou et al., 2016). Instead of this marginal testing strategy, we consider the joint model

y=Xβ+s11/2G1γ1++sm1/2Gmγm+ϵ,γiN(0,σi2I),ϵN(0,σe2In)

and select the variance components σi2 via the penalization (17). Here si is the number of variants in region i, and the weights si1/2 put all variance components on the same scale.

We illustrate this approach using the COPDGene exome sequencing study (Regan et al., 2010). After quality control, 399 individuals and 646,125 genetic variants remain for analysis. Genetic variants are grouped into 16,619 genes to expose those genes associated with the complex trait height. We include age, sex, and the top 3 principal components in the mean effects. Because the number of genes vastly exceeds the sample size n = 399, we first pare the 16,619 genes down to 200 genes according to their marginal likelihood ratio test p-values and then carry out penalized estimation of the 200 variance components in the joint model (17). This is similar to the sure independence screening strategy for selecting mean effects (Fan and Lv, 2008). Genes are ranked according to the order they appear in the lasso solution path. Table 4 lists the top 10 genes together with their marginal LRT p-values. Figure 1 in Supplementary Materials displays the corresponding segment of the lasso solution path. It is noteworthy that the ranking of genes by penalized estimation differs from the ranking according to marginal p-values. The same phenomenon occurs in selection of highly correlated mean predictors. This penalization approach for selecting variance components warrants further theoretical study.

Table 4:

Top 10 genes selected by the lasso penalized variance component model (17) are tallied with their marginal p-values in an association study of 200 genes and the complex trait height.

Lasso Rank Gene Marginal P-value # Variants
1 DOLPP1 2.35 × 106 2
2 C9orf21 3.70 × 105 4
3 PLS1 2.29 × 103 5
4 ATP5D 6.80 × 107 3
5 ADCY4 1.01 × 103 11
6 SLC22A25 3.95 × 103 14
7 RCSD1 9.04 × 104 4
8 PCDH7 1.20 × 104 7
9 AVIL 8.34 × 104 11
10 AHR 1.14 × 103 7

8. Discussion

The current paper leverages the MM principle to design powerful and versatile algorithms for variance components estimation. The MM algorithms derived are notable for their simplicity, generality, numerical efficiency, and theoretical guarantees. Both ordinary MLE and REML are apt to benefit. Other extensions are possible. In nonlinear models (Bates and Watts, 1988; Lindstrom and Bates, 1990), the mean response is a nonlinear function in the fixed effects β. One can easily modify the MM algorithms to update β by a few rounds of Gauss- Newton iteration. The variance components updates remain unchanged.

One can also extend our MM algorithms to elliptically symmetric densities

f(y)=e12κ(δ2)(2π)n2(detΩ)12

defined for yn, where δ2 = (yµ)T 1(yµ) denotes the Mahalanobis distance between y and µ. Here we assume that the function κ(s) is strictly increasing and strictly concave. Examples of elliptically symmetric densities include the multivariate t, slash, contaminated normal, power exponential, and stable families. Previous work (Huber and Ronchetti, 2009; Lange and Sinsheimer, 1993) has focused on using the MM principle to convert parameter estimation for these robust families into parameter estimation under the multivariate normal. One can chain the relevant majorization κ(s)κ(s(t))+κ(s(t))(ss(t)) with our previous minorizations and simultaneously split variance components and pass to the more benign setting of the multivariate normal. These extensions are currently under investigation.

Supplementary Material

Supp1

Acknowledgments

The research is partially supported by NIH grants R01HG006139, R01GM53275, R01GM105785 and K01DK106116. The authors thank Michael Cho, Dandi Qiao, and Edwin Silverman for their assistance in processing and assessing COPDGene exome sequencing data. COPDGene is supported by NIH grants R01HL089897 and R01HL089856.

References

  1. Bates D, Mächler M, Bolker B, and Walker S (2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1):1–48. [Google Scholar]
  2. Bates D and Pinheiro J (1998). Computational methods for multilevel models. Technical Report Technical Memorandum BL0112140–980226-01TM, Bell Labs, Lucent Technologies, Murray Hill, NJ. [Google Scholar]
  3. Bates DM and Watts DG (1988). Nonlinear Regression Analysis and Its Applications. Wiley Series in Probability and Mathematical Statistics: Applied Probability and Statistics John Wiley & Sons, Inc, New York. [Google Scholar]
  4. Bien J and Tibshirani RJ (2011). Sparse estimation of a covariance matrix. Biometrika, 98(4):807–820. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Borg I and Groenen PJ (2005). Modern Multidimensional Scaling: Theory and Applications Springer Science & Business Media. [Google Scholar]
  6. Boyd S and Vandenberghe L (2004). Convex Optimization Cambridge University Press, Cambridge. [Google Scholar]
  7. Callanan TP and Harville DA (1991). Some new algorithms for computing restricted maximum likelihood estimates of variance components. J. Statist. Comput. Simulation, 38(1–4):239–259. [Google Scholar]
  8. De Leeuw J (1994). Block-relaxation algorithms in statistics. In Information Systems and Data Analysis, pages 308–324. Springer. [Google Scholar]
  9. Demidenko E and Massam H (1999). On the existence of the maximum likelihood estimate in variance components models. Sankhy¯a Ser. A, 61(3):431–443. [Google Scholar]
  10. Dempster A, Laird N, and Rubin D (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Soceity Series B, 39(1–38). [Google Scholar]
  11. Ding J, Tian G-L, and Yuen KC (2015). A new MM algorithm for constrained estimation in the proportional hazards model. Comput. Statist. Data Anal, 84:135–151. [Google Scholar]
  12. Fan J and Lv J (2008). Sure independence screening for ultrahigh dimensional feature space (with discussion). J. R. Statist. Soc. B, 70:849–911. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Golub GH and Van Loan CF (1996). Matrix Computations. Johns Hopkins Studies in the Mathematical Sciences Johns Hopkins University Press, Baltimore, MD, third edition. [Google Scholar]
  14. Grzadziel M and Michalski A (2014). A note on the existence of the maximum likelihood estimate in variance components models. Discuss. Math. Probab. Stat, 34(1–2):159–167. [Google Scholar]
  15. Gupta A and Nagar D (1999). Matrix Variate Distributions. Monographs and Surveys in Pure and Applied Mathematics Taylor & Francis. [Google Scholar]
  16. Hartley HO and Rao JNK (1967). Maximum-likelihood estimation for the mixed analysis of variance model. Biometrika, 54:93–108. [PubMed] [Google Scholar]
  17. Harville D and Callanan T (1990). Computational aspects of likelihood-based inference for variance components. In Gianola D and Hammond K, editors, Advances in Statistical Methods for Genetic Improvement of Livestock, volume 18 of Advanced Series in Agricultural Sciences, pages 136–176. Springer Berlin Heidelberg. [Google Scholar]
  18. Harville DA (1977). Maximum likelihood approaches to variance component estimation and to related problems. J. Amer. Statist. Assoc, 72(358):320–340. With a comment by J. N. K. Rao and a reply by the author. [Google Scholar]
  19. Heiser WJ (1995). Convergent computation by iterative majorization: theory and applications in multidimensional data analysis. Recent Advances in Descriptive Multivariate Analysis, pages 157–189. [Google Scholar]
  20. Horn RA and Johnson CR (1985). Matrix Analysis Cambridge University Press, Cambridge. [Google Scholar]
  21. Huber PJ and Ronchetti EM (2009). Robust Statistics. Wiley Series in Probability and Statistics John Wiley & Sons, Inc, Hoboken, NJ, second edition. [Google Scholar]
  22. Hunter DR (2004). MM algorithms for generalized Bradley-Terry models. Ann. Statist, 32(1):384–406. [Google Scholar]
  23. Hunter DR and Lange K (2002). Computing estimates in the proportional odds model. Ann. Inst. Statist. Math, 54(1):155–168. [Google Scholar]
  24. Hunter DR and Lange K (2004). A tutorial on MM algorithms. Amer. Statist, 58(1):30–37. [Google Scholar]
  25. Hunter DR and Li R (2005). Variable selection using MM algorithms. Ann. Statist, 33(4):1617–1642. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Jeon M (2012). Estimation of Complex Generalized Linear Mixed Models for Measurement and Growth PhD thesis, University of California, Berkeley. [Google Scholar]
  27. Kiers HA (2002). Setting up alternating least squares and iterative majorization algorithms for solving various matrix optimization problems. Computational Statistics & Data Analysis, 41(1):157–170. [Google Scholar]
  28. Laird N, Lange N, and Stram D (1987). Maximum likelihood computations with repeated measures: application of the EM algorithm. J. Amer. Statist. Assoc, 82(397):97–105. [Google Scholar]
  29. Laird NM and Ware JH (1982). Random-effects models for longitudinal data. Biometrics, 38(4):963–974. [PubMed] [Google Scholar]
  30. Lange K (2010). Numerical Analysis for Statisticians. Statistics and Computing Springer, New York, second edition. [Google Scholar]
  31. Lange K (2016). MM Optimization Algorithms. Society for Industrial and Applied Mathematics, Philadelphia, PA. [Google Scholar]
  32. Lange K, Hunter DR, and Yang I (2000). Optimization transfer using surrogate objective functions. J. Comput. Graph. Statist, 9(1):1–59. With discussion, and a rejoinder by Hunter and Lange. [Google Scholar]
  33. Lange K, Papp J, Sinsheimer J, Sripracha R, Zhou H, and Sobel E (2013). Mendel: the Swiss army knife of genetic analysis programs. Bioinformatics, 29:1568–1570. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Lange K and Sinsheimer JS (1993). Normal/independent distributions and their applications in robust regression. Journal of Computational and Graphical Statistics, 2:175–198. [Google Scholar]
  35. Lange K and Zhou H (2014). MM algorithms for geometric and signomial programming. Mathematical Programming Series A, 143:339–356. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Lee S, Abecasis GR, Boehnke M, and Lin X (2014). Rare-variant association analysis: Study designs and statistical tests. The American Journal of Human Genetics, 95(1):5–23. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Lindstrom MJ and Bates DM (1988). Newton-Raphson and EM algorithms for linear mixed-effects models for repeated-measures data. J. Amer. Statist. Assoc, 83(404):1014–1022. [Google Scholar]
  38. Lindstrom MJ and Bates DM (1990). Nonlinear mixed effects models for repeated measures data. Biometrics, 46(3):673–687. [PubMed] [Google Scholar]
  39. McLachlan GJ and Krishnan T (2008). The EM Algorithm and Extensions. Wiley Series in Probability and Statistics Wiley-Interscience [John Wiley & Sons], Hoboken, NJ, second edition. [Google Scholar]
  40. Meng X-L and Rubin DB (1991). Using EM to obtain asymptotic variance-covariance matrices: the SEM algorithm. Journal of the American Statistical Association, 86(416):899–909. [Google Scholar]
  41. Pinheiro J and Bates D (1996). Unconstrained parametrizations for variance-covariance matrices. Statistics and Computing, 6(3):289–296. [Google Scholar]
  42. Rao CR (1973). Linear Statistical Inference and its Applications, 2nd ed John Wiley & Sons. [Google Scholar]
  43. Regan EA, Hokanson JE, Murphy JR, Make B, Lynch DA, Beaty TH, Curran-Everett D, Silverman EK, and Crapo JD (2010). Genetic epidemiology of COPD (COPDGene) study designs. COPD, 7:32–43. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Schafer JL and Yucel RM (2002). Computational strategies for multivariate linear mixed-effects models with missing values. J. Comput. Graph. Statist, 11(2):437–457. [Google Scholar]
  45. Schelldorfer J, Bühlmann P, and van de Geer S (2011). Estimation for high-dimensional linear mixed-effects models using A1-penalization. Scand. J. Stat, 38(2):197–214. [Google Scholar]
  46. Schur J (1911). Bemerkungen zur Theorie der beschränkten Bilinearformen mit unendlich vielen Veränderlichen. J. Reine Angew. Math, 140:1–28. [Google Scholar]
  47. Wu TT and Lange K (2010). The MM alternative to EM. Statistical Science, 25:492–505. [Google Scholar]
  48. Yen T-J (2011). A majorization-minimization approach to variable selection using spike and slab priors. Ann. Statist, 39(3):1748–1775. [Google Scholar]
  49. Yu Y (2010). Monotonic convergence of a general algorithm for computing optimal designs. Ann. Statist, 38(3):1593–1606. [Google Scholar]
  50. Zhou H and Lange K (2010). MM algorithms for some discrete multivariate distributions. Journal of Computational and Graphical Statistics, 19:645–665. [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. Zhou JJ, Hu T, Qiao D, Cho MH, and Zhou H (2016). Boosting gene mapping power and efficiency with efficient exact variance component tests of single nucleotide polymorphism sets. Genetics, 204(3):921–931. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp1

RESOURCES