Fast Moment Estimation for Generalized Latent Dirichlet Models

Shiwen Zhao; Barbara E Engelhardt; Sayan Mukherjee; David B Dunson

doi:10.1080/01621459.2017.1341839

. Author manuscript; available in PMC: 2022 Jul 21.

Published in final edited form as: J Am Stat Assoc. 2018 Nov 13;113(524):1528–1540. doi: 10.1080/01621459.2017.1341839

Fast Moment Estimation for Generalized Latent Dirichlet Models

Shiwen Zhao ^a, Barbara E Engelhardt ^b, Sayan Mukherjee ^a, David B Dunson ^a

PMCID: PMC9302535 NIHMSID: NIHMS1815570 PMID: 35875263

Abstract

We develop a generalized method of moments (GMM) approach for fast parameter estimation in a new class of Dirichlet latent variable models with mixed data types. Parameter estimation via GMM has computational and statistical advantages over alternative methods, such as expectation maximization, variational inference, and Markov chain Monte Carlo. A key computational advantage of our method, Moment Estimation for latent Dirichlet models (MELD), is that parameter estimation does not require instantiation of the latent variables. Moreover, performance is agnostic to distributional assumptions of the observations. We derive population moment conditions after marginalizing out the sample-specific Dirichlet latent variables. The moment conditions only depend on component mean parameters. We illustrate the utility of our approach on simulated data, comparing results from MELD to alternative methods, and we show the promise of our approach through the application to several datasets. Supplementary materials for this article are available online.

Keywords: Generalized method of moments, Latent Dirichlet allocation, Latent variables, Mixed membership model, Mixed scale data, Tensor factorization

1. Introduction

Many modern statistical applications require the analysis of large-scale, heterogeneous data types including continuous, categorical, and count variables. For example, in social science, data often consist of collections of different data types (e.g., height, gender, and age); in population genetics, researchers are interested in analyzing genotype (integer-valued) and heterogeneous traits (e.g., blood pressure, BMI, alcoholic drinks per day). Often data take the form of an n × p matrix Y = (y₁, …, y_n)^T, with y_i (y_i1, …, y_ip)^T a p dimensional vector of measurements of varying types for subject i, for i =1, …, n.

This article focuses on (1) developing a new class of generalized latent variable models for mixed data types, and (2) performing fast and robust parameter estimation using the generalized method of moments (GMM). Many models have been proposed to address the first goal (reviewed below), but often these models lack robustness both statistically and computationally. In addition, parameter estimation in these models is often both inefficient and unstable, which necessitates our second goal. Hence, there is a clear need for new classes of models and corresponding robust and efficient approaches for routine inference under these models in general applications.

For modeling mixed scale data, there are two general strategies that have been most commonly used in the literature. The first is to assume an underlying Gaussian model, and then to characterize the feature dependence through a structured model of the Gaussian covariance (Muthén 1984). Such models are routinely used in the social science literature, focusing almost entirely on the case in which data are categorical or continuous, with the categorical variables arising via thresholding of Gaussian variables. These models are related to Gaussian copula models, which have been developed for broad classes of mixed scale data also including counts (Murray et al. 2013). The second approach is to define an exponential family distribution for each of the p variables, inducing dependence between variables through generalized linear models containing shared latent variables (Sammel, Ryan, and Legler 1997; Moustaki and Knott 2000; Dunson 2003).

There are a number of disadvantages to the above approaches. The underlying Gaussian framework is restrictive in forcing continuous variables to be Gaussian and non-continuous variables to be categorical. The exponential family latent variable models are potentially more flexible, but in practice suffer in several respects. Most notably, there is a fundamental lack of robustness due to the model structure, which arises because of the dual role of the latent variables in controlling sample dependence and the shape of the marginal distributions. For example, if there are two count variables, y_i1 and y_i2, that are highly correlated, then both counts must load strongly on a similar set of latent variables, which will induce over-dispersion in the marginal distributions with the variance much larger than the mean. There is no possibility of having highly correlated counts that do not have high over-dispersion. Copula models solve this problem by imposing a restrictive Gaussian copula covariance structure. Another fundamental issue is computation, which tends to rely on expectation-maximization (EM) or Markov chain Monte Carlo (MCMC) algorithms that alternate between updating latent variables and population parameters, intrinsically leading to slow convergence, inefficiency, and instability.

One promising approach to address the above issues is to rely on a generalized method of moments (GMM) estimator, which relates to the second goal in this article. The GMM estimator is robust to misspecification of the higher-order moments, while also leading to optimal efficiency among all estimators using only information on the initial moments (Hansen 1982). Applications of GMM to structural models with latent variables have a long history, with early examples including Bentler (1983) and Anderson and Gerbing (1988), among others. More recently, Gallant, Giacomini, and Ragusa (2013) applied GMM to a specific class of latent variable models by defining moment conditions based on the complete data, including the latent variables. In contrast, Bollen, Kolenikov, and Bauldry (2014) relied on a model-implied instrumental variable GMM estimator. Generally, current GMM methods focus on latent variable models that satisfy restrictive assumptions or require the instantiation of latent variables in a computationally intensive estimation algorithm.

The focus of our work is on a broad new class of latent variable models for mixed scale data. This eliminates some of the problems of current exponential family latent variable models, while also enabling derivation of an efficient GMM implementation that marginalizes out latent variables. With the first goal in mind, we focus on generalized mixed membership models, which incorporate Dirichlet latent variables. Mixed membership models have a rich literature with applications ranging from inference of ancestral populations in genomic data (Pritchard, Stephens, and Donnelly 2000a; Pritchard et al. 2000b) to topic modeling of documents (Blei, Ng, and Jordan 2003). Using Dirichlet latent variables to define cluster membership allows samples to partially belong to each of k latent components. Our model class includes latent Dirichlet allocation (Blei, Ng, and Jordan 2003) and simplex factor models (Bhattacharya and Dunson 2012) as special cases; however, we go beyond current approaches by allowing mixed scale data.

For the second goal, we develop an efficient moment tensor approach for parameter estimation with the Dirichlet latent variables marginalized out. This objective is related to recent moment tensor methods developed for latent variable models including mixtures of Gaussians, hidden Markov models, mixed membership models, and stochastic block models (Arora, Ge, and Moitra 2012; Anandkumar, Hsu, and Kakade 2012a; Anandkumar et al. 2012b; Hsu and Kakade 2013; Anandkumar et al. 2014a, 2014b). This recent work finds moment tensors of these latent variable models that have a symmetric PARAFAC (parallel factors) tensor decomposition (Kiers 2000). Parameter estimation using a moment tensor approach is performed as follows: (1) third-order empirical moment tensors are transformed to an orthogonal decomposable form; (2) orthogonal decompositions are performed using tensor power methods or singular value decompositions; (3) parameter estimates are recovered from the decomposition of the moment tensors. The moment tensor approach offers substantial computational advantages over other approximation methods, as illustrated in various applications (Tung and Smola 2014; Anandkumar et al. 2014a; Colombo and Vlassis 2015).

In this work, we develop a moment tensor approach for generalized mixed membership models. Our approach, moment estimation for latent Dirichlet models (MELD), differs from previous approaches in that the moment functions are defined for variables with different generative distributions. This goes beyond previous moment tensor approaches that construct moment tensors for homogeneous data distributions. Our approach constructs second- and third-order moments for variables with different data types. Parameter estimates are performed in the GMM framework by minimizing quadratic forms of the moment functions using a coordinate descent algorithm. Our algorithm scales as either O(p²) or O(p³), depending on whether third moments are used in the GMM procedure. In the setting where n is large and p is moderate, our algorithm shows clear computational advantages over alternative methods such as MCMC or EM. We also provide results regarding the asymptotic efficiency of our estimators.

In Section 2, we characterize the class of generalized Dirichlet latent variable models in MELD. In Section 3, we briefly review GMM and introduce the estimation procedure used in MELD. Asymptotic properties of the GMM estimator are also discussed. In Section 4, we evaluate the performance in a simulation study. We apply our method to three examples in Section 5 and conclude with a discussion in Section 6.

2. Generalized Dirichlet Latent Variable Models

In this section, we specify a generalized Dirichlet latent variable model. Let y_i = (y_i1, …, y_ip)^T denote a vector of p measurements having heterogeneous data types (e.g., continuous, categorical, and counts) over subjects i = 1, …, n. We are interested in flexible models for joint distributions of the data. A simple way to achieve this is to assume the elements of y_i are conditionally independent given a component index $s_{i} \in \{1, \dots, k\}$ , with the densities for each variable in each component having a parametric form. This is inflexible due to the assumption that each subject i belongs to exactly one component. We instead allow subjects to partially belong to each of the k components. We assume that the elements of y_i are conditionally independent given a latent Dirichlet vector $x_{i} = {(x_{i 1}, \dots, x_{i k})}^{T} \in Δ^{k - 1}$ , with ∆^k−1 denoting the (k – 1) probability simplex. In particular, we let

y_{i j} \sim \sum_{h = 1}^{k} x_{i h} g_{j} (ϕ_{j h}),

(1)

where g_j (ϕ_jh) is the density of the jth variable specific to component h. The elements of x_i are interpretable as probability weights for subject i on each of k components; in this setting, pure subjects have weight vectors with all zeros except for a single one.

The corresponding density g_j (ϕ_jh) is absolutely continuous with respect to a dominating measure $(Ω, H, μ)$ . In this article, we focus on exponential family models for the density, but the procedures we propose will hold for general parametric forms. Our model in Equation (1) generalizes current mixed membership models to allow mixed data types. The parameters $Φ = {\{ϕ_{j h}\}}_{1 \leq j \leq p, 1 \leq h \leq k}$ are mixture component parameters shared by all subjects. A full likelihood specification is completed by choosing a population distribution for the latent variable vector, x_i ∼ P, with P a distribution on the simplex ∆^k−1. We place a Dirichlet distribution on this latent variable vector x_i ∼ Dir(α) with parameter α = (α₁, …, α_k)^T.

Let m_i = (m_i1, … m_ip)^T denote a membership vector for subject i, where $m_{i j} \in \{1, \dots, k\}$ indicates the component that feature j in subject i is generated from. We specify the following generative model:

y_{i j} |m_{i j} = h \sim g_{j} (ϕ_{j h}), m_{i j}| x_{i} \sim Multi (x_{i}), x_{i} \sim Dir (α) .

(2)

When y_i is a multivariate categorical vector and g _j represents a multinomial distribution, Bhattacharya and Dunson (2012) proposed a simplex factor model. Their model reduces to latent Dirichlet allocation (Blei, Ng, and Jordan 2003) in the special case in which y_ij is the number of occurrences of word j in document i, and g_j (ϕ_jh) is a Poisson distribution, with scalar ϕ _jh indicating the rate of occurrence of word j in topic h. We also assume x_i ~ Dir(α), but allow a general form for the component specific densities. This leads to the following likelihood after marginalizing out the latent variables:

p (Y |α, Φ) = \prod_{i = 1}^{n} [\int \prod_{j = 1}^{p} (\sum_{h = 1}^{k} x_{i h} g_{j} (y_{i j} |ϕ_{j h})) d P (x_{i})] .

Categorical data.

Let $y_{i j} \in \{1, \dots, d_{j}\}$ , for j = 1, …, p, so that we have multivariate categorical data that can be organized as a p-way contingency table. Let c = (c₁, …, c_p)^T with $c_{j} \in \{1, \dots, d_{j}\}$ and $π_{c} = \Pr (y_{i} = c) = \Pr (y_{i 1} = c_{1}, \dots, y_{i p} = c_{p})$ . Then $π = \{π_{c}\}$ is a probability tensor (Dunson and Xing 2009) satisfying $π_{c} \geq 0$ and $\sum_{c} π_{c} = 1$ . In this case, $g_{j} (ϕ_{j h}) \overset{d}{=} Multi (ϕ_{j h})$ is a multinomial distribution, with $ϕ_{j h} = {(ϕ_{j h 1}, …, ϕ_{j h d_{j}})}^{T} \in Δ^{d_{j} - 1}$ a probability vector. As shown by Bhattacharya and Dunson (2012), marginalizing out the mixture proportion x_i leads to a Tucker decomposition of π

π_{c} = \sum_{h_{1} = 1}^{k} \cdot \cdot \cdot \sum_{h_{p} = 1}^{k} g_{h_{1}, …, h_{p}} \prod_{j = 1}^{p} ϕ_{j h_{j} c_{j}} .

Mixed data types.

When y_i includes variables of mixed data types, g_j (ϕ_jh) belongs to the exponential family of distributions with mean parameter ϕ_jh. We are interested in estimating mean parameters $Φ = (Φ_{1}; …; Φ_{p})$ with $Φ_{j} = (ϕ_{j 1}, …, ϕ_{j k})$ . For categorical variables y_ij, ϕ_jh is a probability vector indicating the distribution of the categories in component h; for noncategorical variables y_ij, ϕ_jh is a scalar term indicating the mean parameter of that feature for component h.

3. Generalized Methods of Moments Estimation

In this section, we provide a procedure for parameter estimation in generalized Dirichlet latent variable models based on a GMM framework. We first state the moment functions and then propose a two-stage procedure for parameter estimation.

3.1. Moment Functions Used in MELD

The idea behind method of moments (MM) algorithms is to derive a list of moment equations that have expected value zero at the true parameter values. Given observations ${\{y_{i}\}}_{1 \leq i \leq n}$ , MM algorithms specify ℓ moment functions and define a moment vector $f (y_{i}, θ) = {(f_{1} (y_{i}, θ), …, f_{ℓ} (y_{i}, θ))}^{T}$ such that $E [f (y_{i}, θ)] = 0$ at the true parameter $θ = θ_{0} \in Θ \subseteq ℝ^{p}$ . The moment vector is then used to specify an estimator by solving the following ℓ sample equations with p unknowns:

f_{n} (θ) = 0,

(3)

where $f_{n} (θ) = \frac{1}{n} \sum_{i = 1}^{n} f (y_{i}, θ)$ is the sample estimate of $E [f (y_{i}, θ)]$ . When f (y_i, θ) is linear in θ (e.g., linear regression with instrumental variables) with independent moment functions and ℓ = p, then θ can be uniquely determined. When ℓ> p, the system in Equation (3) might be over-determined, in which case standard MM cannot be applied.

Generalized method of moments (GMM) minimizes the following quadratic form rather than solving a linear system:

Q_{n} (θ; A_{n}) = f_{n} {(θ)}^{T} A_{n} f_{n} (θ),

(4)

where A_n is the asymptotically optimal positive semidefinite matrix:

A_{n}^{- 1} = S_{n} = Var [n^{1 / 2} f_{n} (θ)] .

In the over-determined setting, minimizing the quadratic form still well defined. Hansen’s two stage procedure (Hansen 1982) obtains an initial estimate of $\hat{θ}$ using a suboptimal weight matrix, such as the identity. This initial estimate is used to calculate A_n, which is then plugged into Equation (4) to obtain the final estimate (Hall 2005), which is consistent and asymptotically normal (Hansen 1982).

Applying GMM to all of the parameters in the generalized Dirichlet latent variable model is not feasible due to the dimensionality of the parameter space. Higher-order moment functions are complex and involve large numbers of unknowns. We will make use of a tensor decomposition and define moment functions that only depend on lower order moments. This extends recent moment tensor approaches to our proposed broad class of latent variable models (Arora, Ge, and Moitra 2012; Anandkumar, Hsu, and Kakade 2012a; Anandkumar et al. 2012b; Hsu and Kakade 2013; Anandkumar et al. 2014a, 2014b). One key innovation in our procedure is that we allow for lower order interaction moments across variables, where each variable can have a different distribution. The moment functions will have rank one decompositions of the component mean parameters over the heterogeneous variables. These heterogeneous low-order moment functions lead to fast and robust parameter estimation in MELD, as we show in the simulation and application sections.

We now define the moment functions. Recall that y_ij is the jth variable for subject i, which may include various data types, including continuous, categorical, or count data. Each categorical variable y_ij is encoded as a vector b_ij of length d_j containing all zeros except for a one in position y_ij. If y_ij is a non-categorical variable, then b_ij = y_ij is a scalar value. In the following, we assume that there are k latent components and that the latent variable x_i follows a Dirichlet distribution with parameters α = (α₁, …, α_k)^T and $α_{0} = \sum_{h = 1}^{k} α_{h}$ . Given an observation b_ij and hyperparameter α, we define moment functions based on second and third moments:

F_{j t}^{(2)} (y_{i}, Φ) = b_{i j} \circ b_{i t} - \frac{α_{0}}{α_{0} + 1} μ_{j} \circ μ_{t} - Φ_{j} Λ^{(2)} {Φ_{t}}^{T}, 1 \leq j, t \leq p, j \neq t

(5)

\begin{array}{l} F_{j s t}^{(3)} (y_{i}, Φ) = b_{i j} \circ b_{i s} \circ b_{i t} \\ - \frac{α_{0}}{α_{0} + 2} (b_{i j} \circ b_{i s} \circ μ_{t} + μ_{j} \circ b_{i s} \circ b_{i t} \\ + b_{i j} \circ μ_{s} \circ b_{i t}) + \frac{2 α_{0}^{2}}{(α_{0} + 1) (α_{0} + 2)} μ_{j} \circ μ_{s} \circ μ_{t} \\ - Λ^{(3)} \times_{1} Φ_{j} \times_{2} Φ_{s} \times_{3} Φ_{t}, \\ 1 \leq j, s, t \leq p, j \neq s \neq t . \end{array}

(6)

We use ◦ to denote an outer product, and ×_s indicates multiplication of a tensor with a matrix for mode s. The means μ _j take the values $μ_{j} = E (b_{i j} |Φ) = \frac{1}{α_{0}} Φ_{j} α$ for j = 1, …, p. The second moment function $F_{j t}^{(2)} (y_{i}, Φ)$ is a d_j × d_t matrix, and the third moment function $F_{j t}^{(3)} (y_{i}, Φ)$ is a d_j × d_s × d_t tensor. We set d_j = 1 when variable j is noncategorical. The matrix $Λ^{(2)}$ is a k × k matrix with $α / [α_{0} (α_{0} + 1)]$ along the diagonals and zero for all other entries. $Λ^{(3)}$ is a three-way tensor with the diagonal entry $λ_{h}^{(3)} = 2 α_{h} / [α_{0} (α_{0} + 1) (α_{0} + 2)]$ for h = 1, …, k and all the other entries are zero.

The following theorem states that, at the true parameter value, the expectations of the moment functions are zero. The proof can be found in Appendix A.

Theorem 1 (Moment conditions in MELD). The expectations of the second moment matrix $F_{j t}^{(2)} (y_{i}, Φ)$ and third moment tensor $F_{j t}^{(3)} (y_{i}, Φ)$ defined in Equations (5) and (6) are zero at true model parameter values $Φ_{0}$

E [F_{j t}^{(2)} (y_{i}, Φ_{0})] = 0, E [F_{j s t}^{(3)} (y_{i}, Φ_{0})] = 0,

with the expectations taken with respect to y_i.

3.2. Two Stage Optimal Estimation

To fix notation, we first restate Hansen’s two stage GMM procedure as:

estimate $\hat{θ} = \arg \min_{θ} Q_{n} (θ; I)$ ;
given $\hat{θ}$ calculate S_n and set $A_{n} = S_{n}^{- 1}$ ;
compute $\hat{θ} = \arg \min_{θ} Q_{n} (θ; A_{n})$ as the final parameter estimate.

We specify two versions of MELD, which we denote as MELD-2 and MELD-3, depending on whether we use only the second moment matrix or both the second and third moment matrices for estimation.

We now define the second and third moment vectors. We denote as $f^{(2)} (y_{i}, Φ)$ the second moment vector, which is constructed by stacking the second moment matrices in Equation (5) as follows:

\begin{array}{l} f^{(2)} (y_{i}, Φ) = (vec {[F_{12}^{(2)} (y_{i}, Φ)]}^{T}, …, vec {[F_{1 p}^{(2)} (y_{i}, Φ)]}^{T}, \\ vec {[F_{23}^{(2)} (y_{i}, Φ)]}^{T}, …, vec {[F_{2 p}^{(2)} (y_{i}, Φ)]}^{T}, …, \\ {vec {[F_{p - 1, p}^{(2)} (y_{i}, Φ)]}^{T})}^{T} . \end{array}

(7)

As the moment matrix is symmetric in the sense that $F_{j t}^{(2)} (y_{i}, Φ) = {[F_{t j}^{(2)} (y_{i}, Φ)]}^{T}$ we only consider the case where j < t. Assuming d levels for each element of y_i results in a moment vector of dimension p(p – 1)d²/2. The third moment vector $f^{(3)} (y_{i}, Φ)$ is constructed by stacking the second moment matrices and third moment tensors in Equations (5) and (6)

\begin{array}{l} f^{(3)} (y_{i}, Φ) = (vec {[F_{12}^{(2)} (y_{i}, Φ)]}^{T}, …, vec {[F_{p - 1, p}^{(2)} (y_{i}, Φ)]}^{T}, \\ vec {[F_{123}^{(3)} (y_{i}, Φ)]}^{T}, …, vec {[F_{12 p}^{(3)} (y_{i}, Φ)]}^{T}, \\ vec {[F_{134}^{(3)} (y_{i}, Φ)]}^{T}, …, vec {[F_{13 p}^{(3)} (y_{i}, Φ)]}^{T} …, \\ {vec {[F_{p - 2, p - 1, p}^{(3)} (y_{i}, Φ)]}^{T})}^{T} . \end{array}

(8)

The moment matrices $F_{j t}^{(2)} (y_{i}, Φ)$ follow the same order as in $f^{(2)} (y_{i}, Φ)$ . The third moment tensors $F_{j s t}^{(3)} (y_{i}, Φ)$ are ordered such that the index on the right runs faster than the index on the left in the subscript. The third moment tensor is symmetric with respect to any permutations of its indices, so we include only the moment tensors with j < s < t. The dimension of this moment vector is $p (p - 1) d^{2} / 2 + [p^{3} - 3 p (p - 1) - p] d^{3} / 6$ with the assumption of d levels for each element of y_i.

Based on the two moment vector versions, we define the following two quadratic functions used for our GMM estimation:

Q_{n}^{(2)} (Φ; A_{n}^{(2)}) = f_{n}^{(2)} {(Φ)}^{T} A_{n}^{(2)} f_{n}^{(2)} (Φ),

(9)

Q_{n}^{(3)} (Φ; A_{n}^{(3)}) = f_{n}^{(3)} {(Φ)}^{T} A_{n}^{(3)} f_{n}^{(3)} (Φ),

(10)

where $f_{n}^{(2)} (Φ)$ and $f_{n}^{(3)} (Φ)$ are sample estimates of the expectations of the moment vectors

f_{n}^{(2)} (Φ) = \frac{1}{n} \sum_{i = 1}^{n} f^{(2)} (y_{i}, Φ), f_{n}^{(3)} (Φ) = \frac{1}{n} \sum_{i = 1}^{n} f^{(3)} (y_{i}, Φ),

and $A_{n}^{(2)}$ and $A_{n}^{(3)}$ are two positive semidefinite matrices. We emphasize that when calculating $f_{n}^{(2)} (Φ)$ , the μ _j and μ_t in $f^{(2)} (y_{i}, Φ)$ are replaced by their sample estimates ${\hat{μ}}_{j} = \frac{1}{n} \sum_{i = 1}^{n} b_{i j}$ and ${\hat{μ}}_{t} = \frac{1}{n} \sum_{i = 1}^{n} b_{i t}$ , instead of their parametric counterparts $\frac{1}{α_{0}} Φ_{j} α$ and $\frac{1}{α_{0}} Φ_{t} α$ . In doing this the second moment matrix $F_{j t}^{(2)} (y_{i}, Φ)$ becomes a function of Φ only through $Φ_{j} Λ^{(2)} Φ_{t}^{T}$ , which is a rank one summation of ϕ_jh and ϕ_th. This method facilitates development of a fast coordinate descent algorithm for parameter estimation. Similar methods are used for calculating $f_{n}^{(3)} (Φ)$ .

In the first stage, we set the weight matrix to an identity matrix. Then, the quadratic functions in Equations (9) and (10) can be rewritten as follows:

Q_{n}^{(2)} (Φ, I) = \sum_{j = 1}^{p - 1} \sum_{t = j + 1}^{p} {‖F_{n, j t}^{(2)} (Φ)‖}_{F}^{2},

\begin{array}{l} Q_{n}^{(3)} (Φ, I) = \sum_{j = 1}^{p - 1} \sum_{t = j + 1}^{p} {‖F_{n, j t}^{(2)} (Φ)‖}_{F}^{2} \\ + \sum_{j = 1}^{p - 2} \sum_{s = j + 1}^{p - 1} \sum_{t = s + 1}^{p} {‖F_{n, j s t}^{(3)} (Φ)‖}_{F}^{2}, \end{array}

where $F_{n, j t}^{(2)} (Φ)$ and $F_{n, j s t}^{(3)} (Φ)$ are defined as

F_{n, j t}^{(2)} (Φ) = \frac{1}{n} \sum_{i = 1}^{n} F_{j t}^{(2)} (y_{i}, Φ), F_{n, j s t}^{(3)} (Φ) = \frac{1}{n} \sum_{i = 1}^{n} F_{j s t}^{(3)} (y_{i}, Φ),

with μ_j, μ_s and μ_t replaced by ${\hat{μ}}_{j}$ , ${\hat{μ}}_{s}$ and ${\hat{μ}}_{t}$ respectively in $F_{j t}^{(2)} (y_{i}, Φ)$ and $F_{j s t}^{(3)} (y_{i}, Φ)$ . ${‖\cdot‖}_{F}^{2}$ indicates the Frobenius norm, the element-wise sum of squares.

We obtain a first-stage estimator of Φ by minimizing the quadratic forms using Newton-Raphson. Note that, after we substitute μ _j by its sample estimate, only the last term of $F_{n, j t}^{(2)} (Φ)$ and $F_{n, j s t}^{(3)} (Φ)$ involves unknown parameter Φ. For simplicity, we denote

E_{n, j t}^{(2)} = F_{n, j t}^{(2)} (Φ) + Φ_{j} Λ^{(2)} Φ_{t}^{T},

E_{n, j s t}^{(3)} = F_{n, j s t}^{(3)} (Φ) + Λ^{(3)} \times_{1} Φ_{j} \times_{2} Φ_{s} \times_{3} Φ_{t} .

The two quantities $E_{n, j t}^{(2)}$ and $E_{n, j s t}^{(3)}$ can be computed directly from the samples. We optimize ϕ_jh with other mean parameters fixed. After calculating the gradient and Hessian of $Q_{n}^{(2)} (ϕ_{j h}, I)$ , the update rule simply becomes

ϕ_{j h}^{s} = \frac{{\sum_{t = 1, t \neq j}^{p} ({\bar{E}}_{n, j t}^{(2)} ϕ_{t h})}^{T}}{(λ_{h}^{(2)}) \sum_{t = 1, t \neq j}^{p} ϕ_{t h}^{T} ϕ_{t h}},

(11)

where ${\bar{E}}_{n, j t}^{(2)} = E_{n, j t}^{(2)} - \sum_{h^{'} \neq h} λ_{h^{'}}^{(2)} ϕ_{j h^{'}} \circ ϕ_{t h^{'}}$ and $λ_{h}^{(2)}$ is the hth diagonal entry of $Λ^{(2)}$ .

The update rule for ϕ_jh with $Q_{n}^{(3)} (ϕ_{j h}, I)$ can be calculated as

ϕ_{j h}^{s} = \frac{λ_{h}^{(2)} \sum_{t = 1, t \neq j}^{p} {({\bar{E}}_{n, j t}^{(2)} ϕ_{t h})}^{T} + λ_{h}^{(3)} \sum_{s = 1, s \neq j}^{p} [\sum_{t = 1, t \neq s, t \neq j}^{p} {({\bar{E}}_{n, j s t}^{(3)} \times_{2} ϕ_{s h} \times_{3} ϕ_{t h})}^{T}]}{{(λ_{h}^{(2)})}^{2} \sum_{t = 1, t \neq j}^{p} ϕ_{t h}^{T} ϕ_{t h} + {(λ_{h}^{(3)})}^{2} \sum_{s = 1, s \neq j}^{p} [\sum_{t = 1, t \neq s, t \neq j}^{p} (ϕ_{s h}^{T} ϕ_{s h}) (ϕ_{t h}^{T} ϕ_{t h})]},

(12)

where ${\bar{E}}_{n, j s t}^{(3)} = E_{n, j s t}^{(3)} - \sum_{h^{'} \neq h} λ_{h^{'}}^{(3)} ϕ_{j h^{'}} \circ ϕ_{s h^{'}} \circ ϕ_{t h^{'}}$ and $λ_{h}^{(3)}$ is the hth diagonal entry of $Λ^{(3)}$ . The derivations can be found in Appendix E. After updating ϕ_jh using the above equations, we retract ϕ_jh to its probability simplex when y_ij is a categorical variable.

We find the optimization procedure to be robust to parameter initialization. Our default initialization sets the parameters to the observed frequencies for categorical variables and to the empirical means for noncategorical variables. We also initialized the parameters of categorical variables to equal weights or a draw from the uniform distribution over the simplex with almost identical results. Although the optimization problem in Equations (9) and (10) is nonconvex, our algorithm tended to find a local minimum after a small number of iterations. Our convergence criteria corresponded to a difference less than 1 × 10⁻⁵ in the objective between two iterations.

MELD-2 minimizes the quadratic form Q⁽²⁾(Φ, A⁽²⁾), while MELD-3 minimizes Q⁽³⁾(Φ, A⁽³⁾). After an initial consistent estimator of Φ is found, we calculate the asymptotic covariance matrix of the moment vectors S_n and define a new weight matrix A_n = (S_n)⁻¹ for second stage GMM estimation. The form of S_n can be derived analytically for both MELD-2 and MELD-3, and we provide the results in the supplement materials. The calculation of A_n requires the inversion of a full-rank dense matrix S_n, with computation scaling as O(p²d² ) for MELD-2 and O(p³d³ ) for MELD-3. When including the off-diagonal terms in the weight matrix, the updating rules become intrinsically complicated. In practice, we only extract the diagonal elements of S_n and let A_n = 1/diag[(S_n)] in the second stage. This approximation has been used in previous GMM implementations (Jöreskog and Sörbom 1987). The gradient descent update equations can be found by slightly modifying Equations (11) and (12) to include weights.

Note that the moment functions do not solve the identifiability problems with respect to Φ. When α is a constant vector, any permutation τ of 1, …, k with $Φ_{j} (τ) = (ϕ_{j τ (1)}, \dots, ϕ_{j τ (k)})$ for all j = 1, …, p satisfies the moment condition. This problem is inherited from the label switching problem in mixture models. A similar situation occurs when there are ties in α and the permutation is restricted to each tie. In real applications, we typically do not find ties, and our parameter estimates tend to be close to the true parameter values.

3.2.1. Algorithmic Guidance

In deciding which version of MELD to apply to a particular problem, the main considerations are how to balance robustness to misspecification, statistical efficiency, and computational time. MELD-3 will have better statistical efficiency (when the model is correctly specified), greater computational cost as a function of p as it scales as O(p³ ), and is less robust to model misspecification. In practice, this tradeoff is somewhat complex, as it is difficult to test to what extent the data comply with the modeling assumptions on the third moment. As a general rule of thumb, we recommend using MELD-2 when the additional computational cost of MELD-3 is prohibitive. If p is modest (as in the political-economic data below), and thus the additional computational burden is small, we would generally recommend MELD-3.

The performance of the one-stage versus two-stage estimator relies on whether the matrix A_n is estimated accurately. If the variance matrix is accurately estimated, the two-stage method will not perform worse than the one-stage method. However, we find that, when the sample size is small or the model is misspecified, the variance can be poorly estimated and the one-stage method can outperform the two stage method. In general, if the sample size is small in comparison to the number of variables, we suggest using the one-stage estimator.

3.3. Properties of Parameter Estimates

We use GMM asymptotic theory to show that parameter estimates in MELD are consistent. We assume the following regularity conditions on the moment vector $f (y_{i}, Φ)$ and the parameter space Θ.

Assumption 1 (Regularity conditions (Assumptions 3.2, 3.9, and 3.10 (Hall 2005))).

(1) $f (y_{i}, Φ)$ is continuous on Θ for all $y_{i} \in Y;$ ; (2) $E [f (y_{i}, Φ)] < \infty$ and continuous for $Φ \in Θ$ ; 3) Θ is compact and $E [\sup_{Φ \in Θ} ‖f (y_{i}, Φ)‖] < \infty$ .

Remark 1. Conditions (1) and (2) are satisfied in MELD. Condition (3) is satisfied automatically for categorical variables, noting that $ϕ_{j h} \in Δ^{d_{j} - 1} \subset ℝ^{d_{j}}$ is compact. For noncategorical variables, we can restrict support to a large compact domain such as closed intervals in $ℝ$ without sacrificing practical performance.

With these conditions, we further assume that the weight matrix A_n converges to a positive definite matrix A in probability. We define the population analogs of the quadratic functions as

Q_{0} (Φ; A) = E {[f (y_{i}, Φ)]}^{T} A E [f (y_{i,} Φ)],

(13)

We have the following lemma showing the uniform convergence of $Q_{n} (Φ; A_{n})$ .

Lemma 1 (Uniform convergence (Lemma 3.1 (Hall 2005))). Under the regularity conditions in Assumption 1

\sup_{Φ \in Θ} |Q_{n} (Φ; A_{n}) - Q_{0} (Φ; A)| \overset{p}{\to} 0.

Theorem 2 (Consistency). Under the same conditions in Lemma 1, the estimator $\hat{Φ}$ that minimizes Q_n(Φ;A_n) in MELD-2 and MELD-3 converges to the true parameter Φ₀ in probability.

The proof can be found in Appendix B.

The asymptotic normality of $\hat{Φ}$ can also be established by assuming the following conditions on $\partial f (y_{i}, Φ) / \partial Φ$ .

Assumption 2 (Conditions on $\partial f (y_{i}, Φ) / \partial Φ$ (Assumptions 3.5, 3.12, and 3.13. (Hall 2005))).

(1) $\partial f (y_{i}, Φ) / \partial Φ$ exists and is continuous on Θ for all $y_{i} \in Y;$ ; (2) Φ₀ is an interior point of Θ; (3) $E [\partial f (y_{i}, Φ) / \partial Φ] = G_{0} (Φ) < \infty$ ; (4) G₀(Φ) is continuous on some neighborhood $N_{ϵ}$ of Φ₀; (5) the sample estimate G_n(Φ) uniformly converges to G₀(Φ).

Remark 2. We derive $\partial f (y_{i}, Φ) / \partial Φ$ for both MELD-2 and MELD-3 in Appendix D. Conditions 1, 3, and 4 are satisfied in MELD. Condition 5 can be shown with continuousness of the derivative and the compactness of Θ.

Theorem 3 (Asymptotic normality). With Assumptions 1 and 2, we have

n^{1 / 2} (vec (\hat{Φ}) - vec (Φ_{0})) \overset{p}{\to} N (0, M S M^{T})

with

M = {[{(G_{0})}^{T} A G_{0}]}^{- 1} {(G_{0})}^{T} A,

where $G_{0} = E [\partial f (y_{i}, Φ) / \partial Φ] |_{Φ = Φ_{0}}$ and $S = \lim_{n \to \infty} V a r [n^{1 / 2} f_{n} (Φ_{0})]$ .

The proof can be found in Appendix C. The optimal estimator can be obtained so that the weight matrix A_{n → A} = (S)⁻¹ (Hansen 1982).

3.4. Model Selection Using Goodness of Fit Tests

We use goodness-of-fit tests to choose the number of latent components k. With the optimal weight matrix A_n → (S)⁻¹, Wald-, Lagrange multiplier- (score), and likelihood ratio-type tests can be constructed. We could construct a sequence of test statistics with increasing k to quantify the goodness-of-fit in MELD. However, this approach requires the calculation of the optimal weight matrix and large matrix inversions. Instead, we quantify goodness of fit using a fitness index (FI) (Bentler 1983). Practically, we show in an extensive simulation study that the FI has low error in estimating k in MELD.

The FI is defined based on the value of the objective function evaluated at the parameter estimate (Bentler 1983)

FI = 1 - \frac{Q_{n} (\hat{Φ}, A_{n})}{{(e_{n})}^{T} A_{n} e_{n}},

(14)

where e_n is the concatenation of $vec (E_{n, j t}^{(2)})$ with j < t for MELD-2 and the concatenation of both $vec (E_{n, j t}^{(2)})$ with j < t and $vec (E_{n, j s t}^{(3)})$ with j < s < t for MELD-3. This FI may be constructed for any weight matrix A_n. It can be viewed as a normalized objective function; thus, the FI has values less than one. Larger values of the FI indicate a better fit.

4. Simulation Study

In this section, we evaluate the accuracy and run time of MELD-2 and MELD-3 in simulations with categorical and mixed data types. In the first stage, an identity weight matrix is used. In the second stage, we set A_n = 1/diag[(S_n)]. We denote MELD-2-j and MELD-3- j as the procedure using up to the jth stage of MELD-2 and MELD-3, respectively, for j = 1, 2.

4.1. Categorical Data

For a categorical data simulation, we considered a low-dimensional setting (p = 20) so that both second- and third-order moment functions may be efficiently calculated. A simulation with a moderate dimension (p = 100) is summarized in the supplementary materials.

We assumed each of the 20 categorical variables has d = 4 levels. We set the number of components to k = 3 and generated ϕ_jh from Dir(0.5, 0.5, 0.5, 0.5) with h = 1, …, k; α was set to (0.1, 0.1, 0.1)^T. We drew n = {50, 100, 200, 500, 1000, 10000} samples from the model in Equation (1) with g_j (ϕ_jh) a multinomial distribution. For each value of n, we generated ten independent datasets. We ran MELD for different values of k = {1, …, 5}. The FI criterion consistently chose the correct number of latent components k in first stage estimation (Table 1). For second-stage estimation, FI did not perform as well. The trajectories of the objective functions under different values of k are shown in Figure S1. Convergence of parameter estimates for the ten simulated datasets under n = 1000 and k = 3 is shown in Figure S2. MELD-2 converged in about 25 iterations and MELD-3 converged in about 10 iterations.

Table 1.

Goodness-of-fit tests using the fitness index (FI) in categorical simulation. Larger values of FI indicate better fit, with the maximum at one. Results shown are based on ten simulated datasets for each value of n. Standard deviations of FI are provided in parentheses. Bold type indicates the best FI across k.

n	k	MELD-2–1	MELD-2–2	MELD-3–1	MELD-3–2
50	1	0.824(0.020)	− 1.865(4.950)	0.585(0.031)	− 27.987(40.128)
	2	0.908(0.011)	− 1.285(2.630)	0.745(0.032)	− 27.240(54.415)
	3	0.930(0.004)	0.570(0.044)	0.775(0.022)	− 27.713(42.607)
	4	0.901(0.010)	0.588(0.032)	0.735(0.023)	0.254(0.090)
	5	0.795(0.033)	0.686(0.021)	0.653(0.024)	0.305(0.091)
100	1	0.860(0.012)	− 0.521(2.142)	0.651(0.021)	− 0.031(0.703)
	2	0.930(0.011)	0.282(0.644)	0.795(0.030)	− 4.457(8.924)
	3	0.960(0.005)	0.677(0.044)	0.851(0.009)	− 4.868(11.889)
	4	0.942(0.012)	0.691(0.042)	0.822(0.010)	0.225(0.019)
	5	0.863(0.046)	0.782(0.022)	0.768(0.044)	0.232(0.080)
200	1	0.869(0.012)	0.679(0.060)	0.682(0.021)	0.298(0.070)
	2	0.940(0.007)	0.699(0.047)	0.838(0.014)	0.306(0.054)
	3	0.980(0.001)	0.761(0.061)	0.919(0.004)	0.278(0.049)
	4	0.967(0.006)	0.780(0.019)	0.891(0.008)	0.287(0.050)
	5	0.911(0.017)	0.824(0.008)	0.864(0.015)	0.286(0.042)
500	1	0.882(0.007)	0.783(0.022)	0.713(0.012)	0.414(0.080)
	2	0.948(0.006)	0.799(0.019)	0.870(0.013)	0.427(0.080)
	3	0.992( <0.001)	0.884(0.024)	0.966(0.001)	0.388(0.073)
	4	0.983(0.004)	0.874(0.026)	0.938(0.005)	0.365(0.065)
	5	0.937(0.014)	0.894(0.006)	0.921(0.007)	0.353(0.042)
1000	1	0.888(0.003)	0.828(0.006)	0.729(0.008)	0.571(0.017)
	2	0.951(0.003)	0.855(0.009)	0.881(0.008)	0.615(0.030)
	3	0.996( <0.001)	0.950(0.005)	0.982(0.001)	0.609(0.031)
	4	0.989(0.002)	0.938(0.004)	0.961(0.003)	0.579(0.030)
	5	0.953(0.010)	0.932(0.006)	0.951(0.006)	0.550(0.034)
10000	1	0.920(0.001)	0.853(0.003)	0.812(<0.001)	0.582(0.009)
	2	0.980(<0.001)	0.914(0.002)	0.940(<0.001)	0.725(0.005)
	3	0.999( <0.001)	0.992(0.001)	0.998( <0.001)	0.900(0.008)
	4	0.990(0.003)	0.977(0.004)	0.981(<0.001)	0.861(0.008)
	5	0.885(0.005)	10.929(0.002)	0.981(<0.001)	0.856(0.008)

Open in a new tab

We compared MELD with the simplex factor model (SFM) (Bhattacharya and Dunson 2012) and latent Dirichlet allocation (LDA) (Blei, Ng, and Jordan 2003). For the LDA model, we used the topicmodels package in R (Hornik and Grün 2011) with both collapsed Gibbs sampling and variational EM. In our notation, y_ij is a word and y_i is a document. Our generative model is more flexible than LDA in that each variable was allowed to have its own k component distributions with parameter ϕ_j1, …, ϕ_jk. For SFM and LDA, we ran 10,000 steps of MCMC with fixed k and a burn-in of 5000 iterations. We collected every 50th sample to thin the chain. For LDA variational EM, we used the package defaults. Dirichlet parameters were set to α = 0.1 and β = 0.5 for mixture proportions and topic distributions, respectively. Mean squared errors (MSEs) were calculated for each variable y_ij by recovering the associated membership variable m_ij from the estimated parameters, calculating the estimate of y_ij using the estimated Φ_j parameter for membership variable m_ij, and then computing the MSE of that estimated y_ij with the true value (see supplementary materials). MSEs for different values of k are shown in Figure 1, and the running times of different methods are reported in Table S1. The LDA implementation in topicmodels does not allow k = 1, so we did not consider this case. The results show that MELD-2–1 was the most accurate and the fastest in most cases. MELD-3–2 did not perform well when n was small (i.e., n = 50), but estimation accuracy improved for n larger. SFM had comparable MSEs when n was small. However, when n = 500, 1000, or 10,000, MELD outperformed SFM. The model we simulated from is more flexible than the standard LDA model, so it was not surprising that LDA did not perform as well as MELD and SFM.

Figure 1. — Comparison of mean squared error (MSE) of estimated parameters in categorical simulations. For SFM and LDA with Gibbs sampling, posterior means of parameters were calculated using 100 posterior draws on each of the ten simulated datasets. MSEs were calculated for each variable y_ij by recovering the associated membership variable m_ij from the estimated parameters, calculating the estimate of y_ij using the estimated Φ_j parameter for membership variable m_ij, and then computing the MSE of that estimated y_ij with the true value. Details of MSEs and their standard deviations are in Supplementary Table S2– S4.

To provide a comparison more favorable to LDA, we generated samples from the LDA model. This is equivalent to setting $ϕ_{1 h} = \dots = ϕ_{p h}$ for h = 1, …, k. We set the same values for α and k, and we draw ϕ_jh from Dir(0.5, 0.5, 0.5, 0.5). The MSEs with different contamination levels (see below) are shown in Tables S5–S7. When there was no contamination and n was small (i.e., n = 50 or 100), LDA had the highest accuracy in parameter estimation. However, when the sample size was large or there was contamination, MELD outperformed LDA.

We also found that MELD using two-stage estimation had larger MSEs compared to MELD using one-stage estimation. This was especially true when the number of observations was small or the number of latent components was misspecified. FIs for two-stage estimation were less accurate. The loss in accuracy appeared to be a result of poorly estimating the diagonal weight matrix. We confirmed this hypothesis by calculating mean squared distances between true and estimated diagonal weight matrices for different n and k (Table S8; Figure S3).

We further evaluated performance in the presence of contamination. For each simulated dataset, we randomly replaced a proportion of observations (4% and 10%) with draws from a discrete uniform distribution. MSEs under different values of k are shown in Figure 1. With 4% contamination, MELD had the most accurate parameter estimation in almost all cases. MELD-2–1 performed the best, followed by MELD-3–1. When we increased contamination to 10%, MELD had the most accurate MSE in all cases. MELD-2–1 performed best, followed by MELD-2-MELD-2 consistently performed better than MELD-3, as expected given the robustness of using only low-order moments.

To evaluate the performance of MELD in terms of selecting the correct number of components, we varied the number of components k = {6, 7, 8, 9, 10} and generated 10,000 observations using the same simulation model. We found that MELD-3 was able to estimate the correct number of components for k ≤ 8, MELD-2–2 was accurate for k ≤ 7 and MELD-2–2 for k = 6 (see Tables S9– S11). For k ≥ 9, none of the variants of MELD were able to correctly estimate the number of components (see Tables S12 and S13).

4.2. Mixed Data Types

We considered a simulation setting mimicking applications in which DNA sequence variations influence a quantitative trait. An additional simulation combining categorical, Gaussian, and Poisson variables is summarized in the Supplementary Materials. We generated a sequence of nucleotides {A, C, G, T} at 50 genetic loci along with a continuous or integer-valued trait, resulting in p = 51 variables. We set k = 2 latent components and simulated n = 1000 individuals, with the first 500 from the first component and the last 500 from the second component. The continuous traits were drawn from $N (- 3, 1)$ and $N (3, 1)$ , while count traits were drawn from Poisson (5) and Poisson(10), respectively, for the two components. We chose eight loci J = {2, 4, 12, 14, 32, 34, 42, 44} to be associated with the trait. The multinomial parameters for each of the two components were randomly drawn from Dir(0.5, 0.5, 0.5, 0.5). The association between the trait and the eight loci is given by marginalizing the component variables. The distribution for the remaining 42 loci was fixed across individuals independently of the trait as multinomial with parameters generated from Dir(0.5, 0.5, 0.5, 0.5). Ten datasets were simulated from the model in Equation (1). To assess robustness, we added contamination (e.g., genotyping errors) by randomly replacing 4%, 10%, and 20% of the nucleotides with values uniformly generated from {A, C, G, T}.

We ran MELD-2–1 and chose the number of components k = {1, …, 5}. The fitness test FI chose the correct value of k on all ten datasets (Table S14). For each genomic locus, we calculated its marginal frequency according to the simulated data, and then computed the averaged Kullback–Leibler (KL) distance between the estimated component distributions and the marginal frequency as follows:

\begin{array}{l} aveKL (y_{i j}) = \frac{1}{k} \sum_{h = 1}^{k} \sum_{c_{j} = 1}^{d_{j}} \Pr (y_{i j} = c_{j} |m_{i j} = h) \\ \times \log (\frac{\Pr (y_{i j} = c_{j} |m_{i j} = h)}{\Pr (y_{i j} = c_{j})}) . \end{array}

(15)

A smaller averaged KL distance suggests that the component distributions were closer to the marginal distribution, implying that the locus frequency was not differentiated across components. The set J corresponded exactly to the eight loci with largest averaged KL distance in the MELD results in most cases (Table 2).

Table 2.

Quantitative trait association simulation with 50 nucleotides and one response. Nucleotides not in J = {2, 4, 12, 14, 32, 34, 42, 44} are labeled by an underline and missing nucleotides are crossed out. Results shown are for one of the ten simulated datasets. The complete results can be found in Tables S15 and S16.

Response	Contamination	Q⁽²⁾(Φ) 1st stage	Bayesian copula factor model
Gaussian	0%	{2, 4, 12, 14, 32, 34, 42, 44}	{1, 2, $4$ , 12, 14, 32, 34, 42, 44}
	4%	{2, 4, 12, 14, 32, 34, 42, 44}	{2, $4$ , 5, 12, 14, 32, 35, $34$ , 42, 44}
	10%	{2, 4, 12, 14, 32, 34, 42, 44}	{2, $4$ , 5, 12, 14, 25, 32, $34$ , 42, 44}
	20%	{2, 4, 12, 14, 32, 34, 42, 44}	{1, 2, 4, $12$ , $14$ , 25, 32, 34, 42, 44}
Poisson	0%	{2, 4, 12, 14, 32, 34, 42, 44}	{ $2$ , 3, 4, 12, $14$ , 32, 34, 37, 42, 44}
	4%	{2, 4, 12, 14, 32, 34, 42, 44}	{ $2$ , 3, 4, 12, 14, 32, 34, 42, 44}
	10%	{2, 4, 12, 14, 32, 34, 42, 44}	{ $2$ , 3, 4, 12, 14, 32, 34, 42, 44}
	20%	{2, 4, 7, 12, 14, 32, 34, $42$ , 44}	{ $2$ , 3, 4, 12, 14, 30, 32, 34, $42$ , 44}

Open in a new tab

We compared results from MELD with results from the Bayesian copula factor model (Murray et al. 2013), which estimates a correlation matrix C between variables. From this estimate, we computed partial correlations between the response variable (trait) and each genetic locus (Hoff 2007). We ran MCMC for the Bayesian copula factor model for 10,000 iterations with the correct value for k, a burn-in of 5000 iterations, and saving every 50th iteration. We selected genomic locations for which the 95% posterior intervals of the partial correlation did not include zero. Table 2 shows the eight loci having the largest absolute values of the posterior mean partial correlation. The Bayesian copula factor model had much higher error rates than MELD.

5. Applications of MELD

Promoter sequence analysis.

We applied MELD to gene transcription promoter data available in the UCI machine learning repository (Lichman 2013). The data include n = 106 nucleotide sequences {A, C, G, T} of length 57. Of those, 53 sequences are located in promoter regions, and 53 sequences are located in nonpromoter regions. The goal is binary classification: using nucleotide sequence alone, we would like to predict whether or not the sequence is in a promoter or a non-promoter region.

We included the promoter or nonpromoter status of the sequences as an additional binary variable, giving us p = 57 + 1 categorical variables. We applied MELD Q⁽²⁾(Φ) with first-stage estimation on the full data and also separately on the subsets of promoter and non-promoter sequences. We set k = {1, …, 8}. For k = 2, MELD converged in 2.13 seconds, compared with SFM, which took 41.6 seconds to perform 10,000 MCMC iterations. The FI test selected k = 2 for the full data, k = 2 for the promoter data, and k = 1 for the nonpromoter data (Table S17).

We set k = 2 in the following analysis. For each nucleotide position, we calculated the averaged KL distance between the estimated component distributions and its marginal distribution using Equation (15). An interpretation of the averaged KL distance is that it quantifies stratification of each nucleotide distribution across components: a larger value indicates greater stratification, suggesting that the nucleotide is important in defining and differentiating the components. For the full dataset, we observed approximately two peaks of the averaged KL distance, one around the 15th nucleotide and one around the 42nd (Figure S5). The first peak corresponds to the start of the biologically conserved region for promoter sequences (Harley and Reynolds 1987). For MELD applied only to promoter sequences, this peak was reduced, suggesting that, at approximately the 15th nucleotide, the components all included similarly well conserved distributions of this nucleotide. The estimated component membership variables also showed the importance of nucleotides around the 15th locus (Figures S6 and S7). For the peak around the 42nd nucleotide, this phenomenon was reversed: the increased averaged KL distance remained in promoter sequences but diminished in non-promoter sequences. One possible explanation is that this region was conserved uniformly in the non-promoter region, but differentially conserved in the promoter region. Applying MELD to the entire data and separately within each group of sequences (promoter, nonpromoter) provides insight into the sequence differences across the two groups.

Political-economic risk data.

In a second application, we applied MELD to political-economic risk data (Quinn 2004), which include five proxy variables of mixed types measured for 62 countries. The dataset has been analyzed using a mixed Gaussian and probit factor model (Quinn 2004) and a Bayesian copula factor model (Murray et al. 2013). The data are available in the MCMCpack package. There are three categorical variables and two real valued variables (Table 3).

Table 3.

Variables in the political-economic risk data

Variable	Type	Explanation
`ind.jud`	binary	An indicator variable that measures the independence of the national judiciary. This variable is equal to one if the judiciary is judged to be independent and equal to zero otherwise.
`blk.mkt`	real	Black-market premium measurement. Original values are measured as the black-market exchange rate (local currency per dollar) divided by the official exchange rate minus one. Quinn (2004) transformed the original data to log scale.
`lack.exp.risk`	ordinal	Lack of appropriation risk measurement. Six levels with coding 0 < 1 < 2 < 3 < 4 < 5.
`lack.corrup`	ordinal	Lack of corruption measurement. Six levels with coding 0 < 1 < 2 < 3 < 4 < 5.
`gdp.worker`	real	Real gross domestic product (GDP) per worker in 1985 international prices. Recorded data are log transformed.

Open in a new tab

We applied MELD with k = {1, …, 5} using both MELD-2–1 and MELD-3–1. The FI criterion selected k = 4 for MELD-2–1 and k = 3 for MELD-3–1 (Table S18). We chose results from MELD-3–1 with k = 3 for further analysis.

The estimated component parameters for the five variables showed distinct interpretations of the three components (Figure S8). We might interpret the three components as low-risk, intermediate-risk, and high-risk political–economic status, respectively. The first component had a high probability of independence of the national judiciary (ind.jud being one) and a low measurement of black-market premium. The first component also had a high probability of observing 4th and 5th levels in lack.exp.risk and 3rd, 4th, and 5th levels in lack.corrup. The mean of the GDP per worker was highest among the three components. The second component had a relatively high probability of being zero in ind.jud and a large mean value of blk.mkt. Both of lack of appropriation risk measurement and lack of corruption measurement put higher weights on lower category numbers (0, 1, and 2), indicating more risk and higher levels of corruption; the GDP per worker was still high. We might interpret this component as a society being relatively unstable while still having a good economic forecast, meaning that GDP per worker is high, possibly through the black market. The last component had the least judicial independence as quantified by the probability of ind.jud being zero. The black-market premium is also low, as is the lack of risk level and lack of corruption level. The GDP per worker is by far the lowest among the three components. We might interpret this component as societies being the most unstable with the greatest economic risk. We found that, although the three components had distinct stratification, each country was a mixture of the three components (Figure S9).

Human single nucleotide polymorphism data.

As a third application, we applied MELD to Human HapMap phase 3 (HM3) genotype data (Stranger et al. 2012). The HM3 data consist of 608 individuals from diverse ancestral populations. We focused on single nucleotide polymorphisms (SNPs) on chromosome 21 for which the frequency of the less frequent variant is greater than 5%. SNPs were represented by the number of copies of the less frequent variant at each genetic locus {1,2,3}. After selecting every 10th locus to avoid including well-correlated SNPs, we were left with 1672 SNPs for this application.

We applied MELD-2–1 to estimate genotype distributions in different populations. We considered different numbers of ancestral populations, including k = {1, 2, 3, 4, 5}. On this analysis, FI chose k = 2 (Table S19). The bands in the estimated membership variables shown in Figure S10 suggest a clear population structure among the 608 individuals. We calculated average KL distance for the 1,672 SNPs. The five SNPs with highest average KL distance are shown in Table S20, along with their genotypes in the samples. As with the previous applications, these SNPs define and differentiate the two inferred ancestral populations.

6. Discussion

In this article, we developed a new class of latent variable models with Dirichlet-distributed latent variables for mixed data types. These generalized latent Dirichlet variable models extend previous mixed membership models, such as LDA (Blei, Ng, and Jordan 2003) and simplex factor models (Bhattacharya and Dunson 2012), to allow mixed data types. For this class of models, we developed a fast parameter estimation procedure using the generalized method of moments. Our procedure extends recent moment tensor methods (Anandkumar et al. 2014b) to models with mixed data types and does not require instantiation of latent variables. We derive population moment conditions after marginalizing out the Dirichlet latent variables. We compared the two stages of our estimation procedure in different simulation settings and found that, for small-sample sizes and misspecified models, one-stage estimation had better performance. One possibility in terms of improving two-stage estimation is to apply a penalization procedure to better estimate the weight matrix.

Our results show that MELD is a promising alternative to MCMC or EM methods for producing fast and robust parameter estimates. An online method to update moment statistics when new samples arrive would allow its extension to streaming data. One limitation of our method is that the Newton-Raphson method is of order O(p² ) using second moment functions and order O(p³ ) using third moment functions. One possible approach to speed up computation when p is large is to use stochastic gradient methods to calculate an approximate gradient in each step. Another possible extension is to adapt our model to include ordinal variables; for example, by thresholding latent variables.

Supplementary Material

Supplement

NIHMS1815570-supplement-Supplement.pdf^{(4.2MB, pdf)}

Appendix

A. Proof of Theorem 1

Proof. We start with the case where y_ij is categorical with d_j levels. The probability vector $x_{i =} {(x_{i 1}, \dots, x_{i k})}^{T} \in Δ^{k - 1}$ defines the mixture proportion of individual i. We assume x_i ∼ Dir(α₁, …, α_k). Define normalizer $α_{0} = \sum_{h = 1}^{k} α_{h}$ and α = (α₁, …, α_k)^T.

We use the standard basis for encoding. We encode y_ij = c_j as $b_{i j} \in ℝ^{d_{j}}$ , a binary (0/1) vector with the c_jth coordinate being 1 and others being 0. Similarly, we encode the membership variable m_ij as a k dimensional binary vector $m_{i j} \in ℝ^{k}$ . Consider the first moment of b_ij,

\begin{array}{l} μ_{j} = E (b_{i j}) = E [E (b_{i j} | m_{i j})] = E (Φ_{j} m_{i j}) \\ = E [E (Φ_{j} m_{i j} | x_{i})] = E (Φ_{j} x_{i}) = Φ_{j} \frac{α}{α_{0}}, \end{array}

where Φ_j = (ϕ_j1, … ϕ_jk).

We consider second-order moment conditions. There are four types of second-order moments: same-variable, same-subject (type SS), same-variable, cross-subject (type SC), cross-variable, same-subject (type CS), and cross-variable, cross-subject (type CC). Of the four types, only the CS type is required to prove the theorem. The CS type second moment for b_ij and b_it ( j ≠ t) can be written as

E (b_{i j} \circ b_{i t}) = Φ_{j} E (m_{i j} \circ m_{i t}) Φ_{t}^{T} = Φ_{j} E (x_{i} \circ x_{i}) Φ_{t}^{T} .

For a Dirichlet distributed variable,

\begin{array}{l} E (x_{i} \circ x_{i}) = cov (x_{i}) + E (x_{i}) \circ E (x_{i}) \\ = \frac{1}{α_{0} (α_{0} + 1)} diag (α) + \frac{α_{0}}{α_{0}^{2} (α_{0} + 1)} α \circ α . \end{array}

Then, we have

\begin{array}{l} E (b_{i j} \circ b_{i t}) = Φ_{j} E (x_{i} \circ x_{i}) Φ_{t}^{T} \\ = \frac{1}{α_{0} (α_{0} + 1)} \sum_{h = 1}^{k} α_{h} ϕ_{j h} \circ ϕ_{t h} + \frac{α_{0}}{α_{0} + 1} μ_{j} \circ μ_{t} . \end{array}

(A.1)

We next consider third-order moment conditions. There are eight different types of third-order moments for b_ij. Only the moments with different variables for the same subject are required to prove the theorem.

We consider the third cross moment for b_ij, b_it, and b_is with j ≠ t ≠ s for the same subject. First, we calculate $E (m_{i j} \circ m_{i s} \circ m_{i t})$ ,

\begin{array}{l} E (m_{i j} \circ m_{i s} \circ m_{i t}) = E [E (m_{i j} \circ m_{i s} \circ m_{i t} | x_{i})] \\ = E (x_{i} \circ x_{i} \circ x_{i}) = \frac{1}{α_{0} (α_{0} + 1) (α_{0} + 2)} \\ \times ((α \circ α \circ α) + \sum_{h = 1}^{k} α_{h} (e_{h} \circ e_{h} \circ α) \\ + \sum_{h = 1}^{k} α_{h} (e_{h} \circ α \circ e_{h}) + \sum_{h = 1}^{k} α_{h} (α \circ e_{h} \circ e_{h}) \\ + 2 \sum_{h = 1}^{k} α_{h} (e_{h} \circ e_{h} \circ e_{h})) . \end{array}

Here, e_h is a standard basis vector of length k with the hth coordinate equal to one. The third order moment tensor of b_ij ◦ b_is ◦ b_it can be derived as

\begin{array}{l} E (b_{i j} \circ b_{i s} \circ b_{i t}) = E (m_{i j} \circ m_{i s} \circ m_{i t}) \times \{Φ_{j}, Φ_{s}, Φ_{t}\} \\ = \frac{1}{α_{0} (α_{0} + 1) (α_{0} + 2)} ((Φ_{j} α) \circ (Φ_{s} α) \circ (Φ_{t} α) \\ + \sum_{h = 1}^{k} α_{h} [ϕ_{j h} \circ ϕ_{s h} \circ (Φ_{t} α)] \\ + \sum_{h = 1}^{k} α_{h} [ϕ_{j h} \circ (Φ_{s} α) \circ ϕ_{t h}] \\ + \sum_{h = 1}^{k} α_{h} [(Φ_{j} α) \circ ϕ_{s h} \circ ϕ_{t h}] \\ + 2 \sum_{h = 1}^{k} α_{h} ϕ_{j h} \circ ϕ_{s h} \circ ϕ_{t h}) \\ = \frac{1}{α_{0} (α_{0} + 1) (α_{0} + 2)} (α_{0}^{3} μ_{j} \circ μ_{s} \circ μ_{t} \\ + α_{0}^{2} (α_{0} + 1) E (b_{i j} \circ b_{i s} \circ μ_{t}) - α_{0}^{3} μ_{j} \circ μ_{s} \circ μ_{t} \\ + α_{0}^{2} (α_{0} + 1) E (μ_{j} \circ b_{i s} \circ b_{i t}) - α_{0}^{3} μ_{j} \circ μ_{s} \circ μ_{t} \\ + α_{0}^{2} (α_{0} + 1) E (b_{i j} \circ μ_{s} \circ b_{i t}) - α_{0}^{3} μ_{j} \circ μ_{s} \circ μ_{t} \\ + 2 \sum_{h = 1}^{k} α_{h} ϕ_{j h} \circ ϕ_{s h} \circ ϕ_{t h}) . \end{array}

(A.2)

The theorem follows from Equations (A.1) and (A.2). For noncategorical data, we let b_ij = y_ij and ϕ_jh is a scalar mean parameter for y_ij. Equations (A.1) and (A.2) still hold. □

B. Proof of Theorem 2

Proof. For notational simplicity, we suppress A_n in Q_n(Φ; A_n) and A in Q₀(Φ; A). Lemma 1 implies that $\lim_{n \to \infty} \Pr [|Q_{n} (\hat{Φ}) - Q_{0} (\hat{Φ})| < ϵ / 3] = 1$ and $\lim_{n \to \infty} \Pr [|Q_{n} (Φ_{0}) - Q_{0} (Φ_{0})| < ϵ / 3] = 1$ for $ϵ > 0$ . This result also implies that

\lim_{n \to \infty} \Pr [Q_{0} (\hat{Φ}) < Q_{n} (\hat{Φ}) + ϵ / 3] = 1.

(B.1)

\lim_{n \to \infty} \Pr [Q_{n} (Φ_{0}) < Q_{0} (Φ_{0}) + ϵ / 3] = 1.

(B.2)

On the other hand, $\hat{Φ}$ minimizes Q_n(Φ), therefore

\lim_{n \to \infty} \Pr [Q_{n} (\hat{Φ}) < Q_{n} (Φ_{0}) + ϵ / 3] = 1.

(B.3)

Equations (B.1) and (B.3) imply that

\lim_{n \to \infty} \Pr [Q_{0} (\hat{Φ}) < Q_{n} (Φ_{0}) + 2 ϵ / 3] = 1.

Together with Equation (B.2), we get

\lim_{n \to \infty} \Pr [Q_{0} (\hat{Φ}) < Q_{0} (Φ_{0}) + ϵ] = 1.

Therefore,

\lim_{n \to \infty} \Pr [0 \leq Q_{0} (\hat{Φ}) < ϵ] = 1

(B.4)

follows with Q₀(Φ₀) = 0 and $Q_{0} (\hat{Φ}) \geq 0$ . Next, we choose a neighborhood N, which contains Φ₀ in Θ. Due to the compactness of Θ, the neighborhood N^C is also compact. The continuity of Q₀(Φ) implies the existence of $\inf_{Φ \in N^{C}} Q_{0} (Φ)$ and that it is positive. Let $ϵ = \inf_{Φ ϵ N^{C}} Q_{0} (Φ)$ , then we get

\lim_{n \to \infty} \Pr [0 \leq Q_{0} (\hat{Φ}) < \inf_{Φ ϵ N^{C}} Q_{0} (Φ)] = 1.

(B.5)

Therefore, $\lim_{n \to \infty} \Pr (\hat{Φ} \notin N^{C}) = 1$ , which suggests $\lim_{n \to \infty} \Pr (\hat{Φ} \in N) = 1$ . Shrinking the neighborhood size of N, we get

\lim_{n \to \infty} \Pr (\hat{Φ} = Φ_{0}) = 1.

□

C. Proof of Theorem 3

Proof. We approximate $f_{n} (\hat{Φ})$ using a first-order Taylor expansion

f_{n} (\hat{Φ}) = f_{n} (Φ_{0}) + G_{n} (Φ_{0}) [vec (\hat{Φ}) - vec (Φ_{0})] + O \{{[vec (\hat{Φ}) - vec (Φ_{0})]}^{2}\}

Ignoring the high-order term, we left multiply both sides by $[G_{n} {(\hat{Φ})}^{T} A_{n}]$ . Then, we get

\begin{array}{l} [G_{n} {(\hat{Φ})}^{T}] A_{n} f_{n} (\hat{Φ}) \approx {[G_{n} (\hat{Φ})]}^{T} A_{n} f_{n} (Φ_{0}) \\ + {[G_{n} (\hat{Φ})]}^{T} A_{n} G_{n} (Φ_{0}) [vec (\hat{Φ}) - vec (Φ_{0})] . \end{array}

The fact that estimator $\hat{Φ}$ minimizes Q_n(Φ,A_n ) implies that the left-hand side equals zero. Therefore,

\begin{array}{l} n^{1 / 2} [vec (\hat{Φ}) - vec (Φ_{0})] \\ \approx - {\{{[G_{n} (\hat{Φ})]}^{T} A_{n} G_{n} (Φ_{0})\}}^{- 1} {[G_{n} (\hat{Φ})]}^{T} A_{n} n^{1 / 2} f_{n} (Φ_{0}) . \end{array}

The theorem follows with $n^{1 / 2} f_{n} (Φ_{0}) \overset{p}{\to} N (0, S)$ and Assumptions 1 and 2. □

D. Derivatives of Moment Functions

D.1. Second Moment Matrix

The second moment matrix $F_{j t}^{(2)} (y_{i}, Φ)$ may be written as $b_{i j} \circ b_{i t} - Φ_{j} E (x_{i} \circ x_{i}) Φ_{t}^{T}$ . The derivatives of $F_{j t}^{(2)} (y_{i}, Φ)$ with respect to Φ_j and Φ_t can be written as

\begin{array}{l} \frac{\partial vec [F_{j t}^{(2)} (y_{i}, Φ)]}{\partial vec (Φ_{j})} = - \frac{\partial \{[Φ_{t} E (x_{i} \circ x_{i})] \otimes I_{d_{j}}\} vec (Φ_{j})}{\partial vec (Φ_{j})} \\ = - [Φ_{t} E (x_{i} \circ x_{i})] \otimes I_{d_{j}}, \end{array}

\begin{array}{l} \frac{\partial vec [F_{j t}^{(2)} (y_{i}, Φ)]}{\partial vec (Φ_{t})} = - T \frac{\partial vec [Φ_{t} E (x_{i} \circ x_{i}) Φ_{j}^{T}]}{\partial vec (Φ_{t})} \\ = - T \frac{\partial \{[Φ_{j} E (x_{i} \circ x_{i})] \otimes I_{d_{t}} vec (Φ_{t})\}}{\partial vec (Φ_{t})} \\ = - T \{[Φ_{j} E (x_{i} \circ x_{i})] \otimes I_{d_{t}}\}, \end{array}

where ⊗ indicates a Kronecker product and T is a d_tk × d_tk 0/1matrix that satisfies

vec [Φ_{j} E (x_{i} \circ x_{i}) Φ_{t}^{T}] = T vec [Φ_{t} E (x_{i} \circ x_{i}) Φ_{j}^{T}] .

Therefore, $E [\partial f^{(2)} (y_{i}, Φ) / \partial Φ]$ is a block matrix with blocks $- Φ_{t}^{T} E (x_{i} \circ x_{i}) \otimes I_{d_{j}}$ in columns corresponding to vec(Φ_t ) and rows corresponding to $vec [F_{j t}^{(2)} (y_{i}, Φ)]$ .

D.2. Third Moment Tensor

We next consider the third moment tensor. Write $F_{j s t}^{(3)} (y_{i}, Φ)$ as $b_{i j} \circ b_{i j} \circ b_{i j} - E (x_{i} \circ x_{i} \circ x_{i}) \times_{1} Φ_{j} \times_{2} Φ_{s} \times_{3} Φ_{t}$ . Then, only the second term involves Φ.

The derivatives of $F_{j s t}^{(3)} (y_{i}, Φ)$ with respect to Φ_j can be written as

\begin{array}{l} \frac{\partial vec [F_{j s t}^{(3)} (y_{i}, Φ)]}{\partial vec (Φ_{j})} = \frac{\partial vec \{{[E (x_{i} \circ x_{i} \circ x_{i}) \times_{1} Φ_{j} \times_{2} Φ_{s} \times_{3} Φ_{t}]}_{(1)}\}}{\partial vec (Φ_{j})} \\ = \frac{\partial vec \{[Φ_{j} E {(x_{i} \circ x_{i} \circ x_{i})}_{(1)} {(Φ_{t} \otimes Φ_{s})}^{T}\}}{\partial vec (Φ_{j})} \\ = - (Φ_{t} \otimes Φ_{s}) vec {[E {(x_{i} \circ x_{i} \circ x_{i})}_{(1)}]}^{T} \otimes I_{d_{j}}, \end{array}

where subscript (1) indicates model-1 unfolding of a three-way tensor.

The derivatives of $F_{j s t}^{(3)} (y_{i}, Φ)$ with respect to Φ_s and Φ_t can be calculated accordingly by introducing binary transformation matrices T ₍₂₎₍₁₎ and T ₍₃₎₍₁₎, both with size d_jd_sd_t × d_jd_sd_t, that satisfy

\begin{array}{l} vec \{{[F_{j s t}^{(3)} (y_{i}, Φ)]}_{(1)}\} = T_{(2) (1)} vec \{{[F_{j s t}^{(3)} (y_{i}, Φ)]}_{(2)}\} \\ = T_{(3) (1)} vec \{{[F_{j s t}^{(3)} (y_{i}, Φ)]}_{(3)}\} . \end{array}

Then,

\begin{array}{l} \frac{\partial vec [F_{j s t}^{(3)} (y_{i}, Φ)]}{\partial vec (Φ_{s})} \\ = - T_{(2) (1)} \frac{\partial vec \{{[E (x_{i} \circ x_{i} \circ x_{i}) \times_{1} Φ_{j} \times_{2} Φ_{s} \times_{3} Φ_{t}]}_{(2)}\}}{\partial vec (Φ_{s})} \\ = - T_{(2) (1)} \frac{\partial vec \{[Φ_{s} E {(x_{i} \circ x_{i} \circ x_{i})}_{(2)} {(Φ_{t} \otimes Φ_{j})}^{T}\}}{\partial vec (Φ_{s})} \\ = - T_{(2) (1)} \{(Φ_{t} \otimes Φ_{j}) {[E {(x_{i} \circ x_{i} \circ x_{i})}_{(2)}]}^{T} \otimes I_{d_{s}}\}, \end{array}

\begin{array}{l} \frac{\partial vec [F_{j s t}^{(3)} (y_{i}, Φ)]}{\partial vec (Φ_{t})} \\ = - T_{(3) (1)} \frac{\partial vec \{{[E (x_{i} \circ x_{i} \circ x_{i}) \times_{1} Φ_{j} \times_{2} Φ_{s} \times_{3} Φ_{t}]}_{(3)}\}}{\partial vec (Φ_{t})} \\ = - T_{(3) (1)} \frac{\partial vec \{[Φ_{t} E {(x_{i} \circ x_{i} \circ x_{i})}_{(3)} {(Φ_{s} \otimes Φ_{j})}^{T}\}}{\partial vec (Φ_{t})} \\ = - T_{(3) (1)} \{(Φ_{s} \otimes Φ_{j}) {[E {(x_{i} \circ x_{i} \circ x_{i})}_{(3)}]}^{T} \otimes I_{d_{t}}\} . \end{array}

Conditions (1), (3), and (4) in Assumption 2 in the main text follow after we calculate the derivatives of the moment functions.

E. Derivation of Newton-Raphson Update

We denote

E_{n, j t}^{(2)} = F_{n, j t}^{(2)} (Φ) + Φ_{j} Λ^{(2)} Φ_{t}^{T},

E_{n, j s t}^{(3)} = F_{n, j s t}^{(3)} (Φ) + Λ^{(3)} \times_{1} Φ_{j} \times_{2} Φ_{s} \times_{3} Φ_{t} .

With the identity matrix, the two objective functions can be written as

Q_{n}^{(2)} (Φ) = \sum_{j = 1}^{p - 1} \sum_{t = j + 1}^{p} {‖E_{n, j t}^{(2)} - Φ_{j} Λ^{(2)} Φ_{t}^{T}‖}_{F}^{2},

\begin{array}{l} Q_{n}^{(3)} (Φ) = \sum_{j = 1}^{p - 1} \sum_{t = j + 1}^{p} {‖E_{n, j t}^{(2)} - Φ_{j} Λ^{(2)} Φ_{t}^{T}‖}_{F}^{2} \\ + \sum_{j = 1}^{p - 2} \sum_{s = j + 1}^{p - 1} \sum_{t = s + 1}^{p} {‖E_{n, j s t}^{(3)} - Λ^{(3)} \times_{1} Φ_{j} \times_{2} Φ_{s} \times_{3} Φ_{t}‖}_{F}^{2} . \end{array}

Here, we suppress the weight matrix in the objective functions. We first consider $Q_{n}^{(2)} (Φ)$ . The terms that involve ϕ_jh are

\sum_{t = 1, t \neq j}^{p} [- 2 λ_{h}^{(2)} {({\bar{E}}_{n, j t}^{(2)} ϕ_{t h})}^{T} ϕ_{j h} + {(λ_{h}^{(2)})}^{2} (ϕ_{t h}^{T} ϕ_{t h}) ϕ_{j h}^{T} ϕ_{j h}],

where ${\bar{E}}_{n, j t}^{(2)} = E_{n, j t}^{(2)} - \sum_{h^{'} \neq h} λ_{h^{'}}^{(2)} ϕ_{j h^{'}} \circ ϕ_{t h^{'}}$ and $λ_{h}^{(2)}$ is the hth diagonal element of $Λ^{(2)}$ . Here, we use the fact that ${‖E_{n, j t}^{(2)} - Φ_{j} Λ^{(2)} Φ_{t}^{T}‖}_{F}^{2} = {‖E_{n, t j}^{(2)} - Φ_{t} Λ^{(2)} Φ_{j}^{T}‖}_{F}^{2}$ . By letting

ξ^{(2)} = - 2 λ_{h}^{(2)} \sum_{t = 1, t \neq j}^{p} ({\bar{E}}_{n, j t}^{(2)} ϕ_{t h}),

γ^{(2)} = {(λ_{h}^{(2)})}^{2} \sum_{t = 1, t \neq j}^{p} ϕ_{t h}^{T} ϕ_{t h},

the gradient $\nabla Q_{n}^{(2)} (ϕ_{j h})$ and Hessian $\nabla^{2} Q_{n}^{(2)} (ϕ_{j h})$ can be written as

\nabla Q_{n}^{(2)} (ϕ_{j h}) = ξ^{(2)} + 2 γ^{(2)} ϕ_{j h},

\nabla^{2} Q_{n}^{(2)} (ϕ_{j h}) = 2 γ^{(2)} I .

The update rule in (11) can be derived accordingly.

Then, we consider $Q_{n}^{(3)} (Φ)$ . The terms that involve ϕ_jh are

\begin{array}{l} \sum_{t = 1, t \neq j}^{p} [- 2 λ_{h}^{(2)} {({\bar{E}}_{n, j t}^{(2)} ϕ_{t h})}^{T} ϕ_{j h} + {(λ_{h}^{(2)})}^{2} (ϕ_{t h}^{T} ϕ_{t h}) ϕ_{j h}^{T} ϕ_{j h}] \\ + \sum_{s = 1, s \neq j}^{p} \sum_{t = 1, t \neq s, t \neq j}^{p} [- 2 〈{\bar{E}}_{n, j s t}^{(3)}, λ_{h}^{(3)} ϕ_{j h} \circ ϕ_{s h} \circ ϕ_{t h}〉 \\ + {‖λ_{h}^{(3)} ϕ_{j h} \circ ϕ_{s h} \circ ϕ_{t h}‖}_{F}^{2}], \end{array}

where ${\bar{E}}_{n, j t}^{(2)} = E_{n, j t}^{(2)} - \sum_{h^{'} \neq h} λ_{h^{'}}^{(2)} ϕ_{j h^{'}} \circ ϕ_{t h^{'}}$ and ${\bar{E}}_{n, j s t}^{(3)} = E_{n, j s t}^{(3)} - \sum_{h^{'} \neq h} λ_{h^{'}}^{(3)} ϕ_{j h^{'}} \circ ϕ_{s h^{'}} \circ ϕ_{t h^{'}}$ . Again we use the symmetric property of ${‖E_{n, j t}^{(2)} - Φ_{j} Λ^{(2)} Φ_{t}^{T}‖}_{F}^{2}$ and ${‖E_{n, j s t}^{(3)} - Λ \times_{1} Φ_{j} \times_{2} Φ_{s} \times_{3} Φ_{t}‖}_{F}^{2}$ . By organizing the terms, we get

\begin{array}{l} \sum_{t = 1, t \neq j}^{p} [- 2 λ_{h}^{(2)} {({\bar{E}}_{n, j t}^{(2)} ϕ_{t h})}^{T} ϕ_{j h} + {(λ_{h}^{(2)})}^{2} (ϕ_{t h}^{T} ϕ_{t h}) ϕ_{j h}^{T} ϕ_{j h}] \\ + \sum_{s = 1, s \neq j}^{p} \sum_{t = 1, t \neq s, t \neq j}^{p} [- 2 λ_{h}^{(3)} {({\bar{E}}_{n, j s t}^{(3)} \times_{2} ϕ_{s h} \times_{3} ϕ_{t h})}^{T} ϕ_{j h} \\ + {(λ_{h}^{(3)})}^{2} (ϕ_{s h}^{T} ϕ_{s h}) (ϕ_{t h}^{T} ϕ_{t h}) (ϕ_{j h}^{T} ϕ_{j h})], \end{array}

By letting

\begin{array}{l} ξ^{(3)} = - 2 λ_{h}^{(2)} \sum_{t = 1, t \neq j}^{p} ({\bar{E}}_{n, j t}^{(2)} ϕ_{t h}) - 2 λ_{h}^{(3)} \\ \times \sum_{s = 1, s \neq j}^{p} [\sum_{t = 1, t \neq s, t \neq j}^{p} ({\bar{E}}_{n, j s t}^{(3)} \times_{2} ϕ_{s h} \times_{3} ϕ_{t h})], \end{array}

\begin{array}{l} γ^{(3)} = {(λ_{h}^{(2)})}^{2} \sum_{t = 1, t \neq j}^{p} ϕ_{t h}^{T} ϕ_{t h} + {(λ_{h}^{(3)})}^{2} \\ \times \sum_{s = 1, s \neq j}^{p} [\sum_{t = 1, t \neq s, t \neq j}^{p} (ϕ_{s h}^{T} ϕ_{t h}) (ϕ_{t h}^{T} ϕ_{t h})], \end{array}

the gradient $\nabla Q_{n}^{(3)} (ϕ_{j h})$ and Hessian $\nabla^{2} Q_{n}^{(3)} (ϕ_{j h})$ can be written as

\nabla Q_{n}^{(3)} (ϕ_{j h}) = ξ^{(3)} + 2 γ^{(3)} ϕ_{j h},

\nabla^{2} Q_{n}^{(3)} (ϕ_{j h}) = 2 γ^{(3)} I .

The update rule in (12) follows directly.

Footnotes

Color versions of one or more of the figures in the article can be found online at www.tandfonline.com/r/JASA.

Supplementary materials for this article are available online. Please go to www.tandfonline.com/r/JASA.

Supplementary Materials

We provide supplementary materials that include (a) derivations of asymptotic optimal weight matrices; (b) calculation of MSE and computational complexity analysis; (c) two additional simulations with categorical variables and mixed data types; (d) Supplementary Tables S1–S25; (e) Supplementary Figures S1–S10. Python code implementing MELD and code used to generate results in the article are available at https://github.com/judyboon/MELD.

References

Anandkumar A, Ge R, Hsu D, and Kakade SM (2014a), “A Tensor Approach to Learning Mixed Membership Community Models,” The Journal of Machine Learning Research, 15, 2239–2312. [Google Scholar]
Anandkumar A, Ge R, Hsu D, Kakade SM, and Telgarsky M (2014b), “Tensor Decompositions for Learning Latent Variable Models,” The Journal of Machine Learning Research, 15, 2773–2832. [Google Scholar]
Anandkumar A, Hsu D, and Kakade SM (2012a), “A Method of Moments for Mixture Models and Hidden Markov Models,” JMLR W&CP 23: COLT [Google Scholar]
Anandkumar A, Liu Y, Hsu DJ, Foster DP, and Kakade SM (2012b), “A Spectral Algorithm for Latent Dirichlet Allocation,” in Advances in Neural Information Processing Systems 25, pp. 917–925. [Google Scholar]
Anderson JC, and Gerbing DW (1988), “Structural Equation Modeling in Practice: A Review and Recommended Two-Step Approach,” Psychological Bulletin, 103, 411. [Google Scholar]
Arora S, Ge R, and Moitra A (2012), “Learning Topic Models—Going Beyond SVD,” in Fifty-Third IEEE Annual Symposium on Foundations of Computer Science, pp. 1–10. [Google Scholar]
Bentler PM (1983), “Some Contributions to Efficient Statistics in Structural Models: Specification and Estimation of Moment Structures,” Psychometrika, 48, 493–517. [Google Scholar]
Bhattacharya A, and Dunson DB (2012), “Simplex Factor Models for Multivariate Unordered Categorical Data,” Journal of the American Statistical Association, 107, 362–377. [DOI] [PMC free article] [PubMed] [Google Scholar]
Blei DM, Ng AY, and Jordan MI (2003), “Latent Dirichlet Allocation,” The Journal of Machine Learning Research, 3, 993–1022. [Google Scholar]
Bollen KA, Kolenikov S, and Bauldry S (2014), “Model-Implied Instrumental Variable-Generalized Method of Moments (MIIV-GMM) Estimators for Latent Variable Models,” Psychometrika, 79, 20–50. [DOI] [PMC free article] [PubMed] [Google Scholar]
Colombo N, and Vlassis N (2015), “FastMotif: Spectral Sequence Motif Discovery,” Bioinformatics, 31, 2623–2631. [DOI] [PubMed] [Google Scholar]
Dunson DB (2003), “Dynamic Latent Trait Models for Multidimensional Longitudinal Data,” Journal of the American Statistical Association, 98, 555–563. [Google Scholar]
Dunson DB, and Xing C (2009), “Nonparametric Bayes Modeling of Multivariate Categorical Data,” Journal of the American Statistical Association, 104, 1042–1051. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gallant AR, Giacomini R, and Ragusa G (2013), Generalized Method of Moments With Latent Variables, Centre for Economic Policy Research. [Google Scholar]
Hall AR (2005), Generalized Method of Moments, New York: Oxford University Press. [Google Scholar]
Hansen LP (1982), “Large Sample Properties of Generalized Method of Moments Estimators,” Econometrica: Journal of the Econometric Society, 50, 1029. [Google Scholar]
Harley CB, and Reynolds RP (1987), “Analysis of E. coli Promoter Sequences,” Nucleic Acids Research, 15, 2343–2361. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hoff PD (2007), “Extending the Rank Likelihood for Semiparametric Copula Estimation,” The Annals of Applied Statistics, 1, 265–283. [Google Scholar]
Hornik K, and Grün B (2011), “topicmodels: An R Package for Fitting Topic Models,” Journal of Statistical Software, 40, 1–30. [Google Scholar]
Hsu D, and Kakade SM (2013), “Learning Mixtures of Spherical Gaussians: Moment Methods and Spectral Decompositions,” in Proceedings of the 4th conference on Innovations in Theoretical Computer Science, ACM, pp. 11–20. [Google Scholar]
Jöreskog KG, and Sörbom D (1987), “New Developments in LISREL,” Paper presented at the National Symposium on Methodological Issues in Causal Modeling, University of Alabama, Tuscaloosa. [Google Scholar]
Kiers HA (2000), “Towards a Standardized Notation and Terminology in Multiway Analysis,” Journal of Chemometrics, 14, 105–122. [Google Scholar]
Lichman M (2013), “UCI Machine Learning Repository,” available at https://archive.ics.uci.edu/ml/index.php. [Google Scholar]
Moustaki I, and Knott M (2000), “Generalized Latent Trait Models,” Psychometrika, 65, 391–411. [Google Scholar]
Murray JS, Dunson DB, Carin L, and Lucas JE (2013), “Bayesian Gaussian Copula Factor Models for Mixed Data,” Journal of the American Statistical Association, 108, 656–665. [DOI] [PMC free article] [PubMed] [Google Scholar]
Muthén B (1984), “A General Structural Equation Model With Dichotomous, Ordered Categorical, and Continuous Latent Variable Indicators,” Psychometrika, 49, 115–132. [Google Scholar]
Pritchard JK, Stephens M, and Donnelly P (2000a), “Inference of Population Structure Using Multilocus Genotype Data,” Genetics, 155, 945–959. [DOI] [PMC free article] [PubMed] [Google Scholar]
Pritchard JK, Stephens M, Rosenberg NA, and Donnelly P (2000b), “Association Mapping in Structured Populations,” The American Journal of Human Genetics, 67, 170–181. [DOI] [PMC free article] [PubMed] [Google Scholar]
Quinn KM (2004), “Bayesian Factor Analysis for Mixed Ordinal and Continuous Responses,” Political Analysis, 12, 338–353. [Google Scholar]
Sammel MD, Ryan LM, and Legler JM (1997), “Latent Variable Models for Mixed Discrete and Continuous Outcomes,” Journal of the Royal Statistical Society, Series B, 59, 667–678. [Google Scholar]
Stranger BE, Montgomery SB, Dimas AS, Parts L, Stegle O, Ingle CE, Sekowska M, Smith GD, Evans D, Gutierrez-Arcelus M, Price A, Raj T, Nisbett J, Nica AC, Beazley C, Durbin R, Deloukas P, and Dermitzakis ET (2012), “Patterns of Cis-Regulatory Variation in Diverse Human Populations,” PLoS Genetics, 8, e1002639. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tung HF, and Smola AJ (2014), “Spectral Methods for Indian Buffet Process Inference,” in Advances in Neural Information Processing Systems 27, pp. 1484–1492. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement

NIHMS1815570-supplement-Supplement.pdf^{(4.2MB, pdf)}

[R1] Anandkumar A, Ge R, Hsu D, and Kakade SM (2014a), “A Tensor Approach to Learning Mixed Membership Community Models,” The Journal of Machine Learning Research, 15, 2239–2312. [Google Scholar]

[R2] Anandkumar A, Ge R, Hsu D, Kakade SM, and Telgarsky M (2014b), “Tensor Decompositions for Learning Latent Variable Models,” The Journal of Machine Learning Research, 15, 2773–2832. [Google Scholar]

[R3] Anandkumar A, Hsu D, and Kakade SM (2012a), “A Method of Moments for Mixture Models and Hidden Markov Models,” JMLR W&CP 23: COLT [Google Scholar]

[R4] Anandkumar A, Liu Y, Hsu DJ, Foster DP, and Kakade SM (2012b), “A Spectral Algorithm for Latent Dirichlet Allocation,” in Advances in Neural Information Processing Systems 25, pp. 917–925. [Google Scholar]

[R5] Anderson JC, and Gerbing DW (1988), “Structural Equation Modeling in Practice: A Review and Recommended Two-Step Approach,” Psychological Bulletin, 103, 411. [Google Scholar]

[R6] Arora S, Ge R, and Moitra A (2012), “Learning Topic Models—Going Beyond SVD,” in Fifty-Third IEEE Annual Symposium on Foundations of Computer Science, pp. 1–10. [Google Scholar]

[R7] Bentler PM (1983), “Some Contributions to Efficient Statistics in Structural Models: Specification and Estimation of Moment Structures,” Psychometrika, 48, 493–517. [Google Scholar]

[R8] Bhattacharya A, and Dunson DB (2012), “Simplex Factor Models for Multivariate Unordered Categorical Data,” Journal of the American Statistical Association, 107, 362–377. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] Blei DM, Ng AY, and Jordan MI (2003), “Latent Dirichlet Allocation,” The Journal of Machine Learning Research, 3, 993–1022. [Google Scholar]

[R10] Bollen KA, Kolenikov S, and Bauldry S (2014), “Model-Implied Instrumental Variable-Generalized Method of Moments (MIIV-GMM) Estimators for Latent Variable Models,” Psychometrika, 79, 20–50. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] Colombo N, and Vlassis N (2015), “FastMotif: Spectral Sequence Motif Discovery,” Bioinformatics, 31, 2623–2631. [DOI] [PubMed] [Google Scholar]

[R12] Dunson DB (2003), “Dynamic Latent Trait Models for Multidimensional Longitudinal Data,” Journal of the American Statistical Association, 98, 555–563. [Google Scholar]

[R13] Dunson DB, and Xing C (2009), “Nonparametric Bayes Modeling of Multivariate Categorical Data,” Journal of the American Statistical Association, 104, 1042–1051. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] Gallant AR, Giacomini R, and Ragusa G (2013), Generalized Method of Moments With Latent Variables, Centre for Economic Policy Research. [Google Scholar]

[R15] Hall AR (2005), Generalized Method of Moments, New York: Oxford University Press. [Google Scholar]

[R16] Hansen LP (1982), “Large Sample Properties of Generalized Method of Moments Estimators,” Econometrica: Journal of the Econometric Society, 50, 1029. [Google Scholar]

[R17] Harley CB, and Reynolds RP (1987), “Analysis of E. coli Promoter Sequences,” Nucleic Acids Research, 15, 2343–2361. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] Hoff PD (2007), “Extending the Rank Likelihood for Semiparametric Copula Estimation,” The Annals of Applied Statistics, 1, 265–283. [Google Scholar]

[R19] Hornik K, and Grün B (2011), “topicmodels: An R Package for Fitting Topic Models,” Journal of Statistical Software, 40, 1–30. [Google Scholar]

[R20] Hsu D, and Kakade SM (2013), “Learning Mixtures of Spherical Gaussians: Moment Methods and Spectral Decompositions,” in Proceedings of the 4th conference on Innovations in Theoretical Computer Science, ACM, pp. 11–20. [Google Scholar]

[R21] Jöreskog KG, and Sörbom D (1987), “New Developments in LISREL,” Paper presented at the National Symposium on Methodological Issues in Causal Modeling, University of Alabama, Tuscaloosa. [Google Scholar]

[R22] Kiers HA (2000), “Towards a Standardized Notation and Terminology in Multiway Analysis,” Journal of Chemometrics, 14, 105–122. [Google Scholar]

[R23] Lichman M (2013), “UCI Machine Learning Repository,” available at https://archive.ics.uci.edu/ml/index.php. [Google Scholar]

[R24] Moustaki I, and Knott M (2000), “Generalized Latent Trait Models,” Psychometrika, 65, 391–411. [Google Scholar]

[R25] Murray JS, Dunson DB, Carin L, and Lucas JE (2013), “Bayesian Gaussian Copula Factor Models for Mixed Data,” Journal of the American Statistical Association, 108, 656–665. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] Muthén B (1984), “A General Structural Equation Model With Dichotomous, Ordered Categorical, and Continuous Latent Variable Indicators,” Psychometrika, 49, 115–132. [Google Scholar]

[R27] Pritchard JK, Stephens M, and Donnelly P (2000a), “Inference of Population Structure Using Multilocus Genotype Data,” Genetics, 155, 945–959. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] Pritchard JK, Stephens M, Rosenberg NA, and Donnelly P (2000b), “Association Mapping in Structured Populations,” The American Journal of Human Genetics, 67, 170–181. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] Quinn KM (2004), “Bayesian Factor Analysis for Mixed Ordinal and Continuous Responses,” Political Analysis, 12, 338–353. [Google Scholar]

[R30] Sammel MD, Ryan LM, and Legler JM (1997), “Latent Variable Models for Mixed Discrete and Continuous Outcomes,” Journal of the Royal Statistical Society, Series B, 59, 667–678. [Google Scholar]

[R31] Stranger BE, Montgomery SB, Dimas AS, Parts L, Stegle O, Ingle CE, Sekowska M, Smith GD, Evans D, Gutierrez-Arcelus M, Price A, Raj T, Nisbett J, Nica AC, Beazley C, Durbin R, Deloukas P, and Dermitzakis ET (2012), “Patterns of Cis-Regulatory Variation in Diverse Human Populations,” PLoS Genetics, 8, e1002639. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] Tung HF, and Smola AJ (2014), “Spectral Methods for Indian Buffet Process Inference,” in Advances in Neural Information Processing Systems 27, pp. 1484–1492. [Google Scholar]

PERMALINK

Fast Moment Estimation for Generalized Latent Dirichlet Models

Shiwen Zhao

Barbara E Engelhardt

Sayan Mukherjee

David B Dunson

Abstract

1. Introduction

2. Generalized Dirichlet Latent Variable Models

Categorical data.

Mixed data types.

3. Generalized Methods of Moments Estimation

3.1. Moment Functions Used in MELD

3.2. Two Stage Optimal Estimation

3.2.1. Algorithmic Guidance

3.3. Properties of Parameter Estimates

3.4. Model Selection Using Goodness of Fit Tests

4. Simulation Study

4.1. Categorical Data

Table 1.

Figure 1.

4.2. Mixed Data Types

Table 2.

5. Applications of MELD

Promoter sequence analysis.

Political-economic risk data.

Table 3.

Human single nucleotide polymorphism data.

6. Discussion

Supplementary Material

Appendix

A. Proof of Theorem 1

B. Proof of Theorem 2

C. Proof of Theorem 3

D. Derivatives of Moment Functions

D.1. Second Moment Matrix

D.2. Third Moment Tensor

E. Derivation of Newton-Raphson Update

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases