Skip to main content
Bioinformatics logoLink to Bioinformatics
. 2023 Apr 5;39(5):btad167. doi: 10.1093/bioinformatics/btad167

Finite mixtures of matrix variate Poisson-log normal distributions for three-way count data

Anjali Silva 1,2, Xiaoke Qin 3, Steven J Rothstein 4, Paul D McNicholas 5, Sanjeena Subedi 6,
Editor: Alfonso Valencia
PMCID: PMC10159656  PMID: 37018147

Abstract

Motivation

Three-way data structures, characterized by three entities, the units, the variables and the occasions, are frequent in biological studies. In RNA sequencing, three-way data structures are obtained when high-throughput transcriptome sequencing data are collected for n genes across p conditions at r occasions. Matrix variate distributions offer a natural way to model three-way data and mixtures of matrix variate distributions can be used to cluster three-way data. Clustering of gene expression data is carried out as means of discovering gene co-expression networks.

Results

In this work, a mixture of matrix variate Poisson-log normal distributions is proposed for clustering read counts from RNA sequencing. By considering the matrix variate structure, full information on the conditions and occasions of the RNA sequencing dataset is simultaneously considered, and the number of covariance parameters to be estimated is reduced. We propose three different frameworks for parameter estimation: a Markov chain Monte Carlo-based approach, a variational Gaussian approximation-based approach, and a hybrid approach. Various information criteria are used for model selection. The models are applied to both real and simulated data, and we demonstrate that the proposed approaches can recover the underlying cluster structure in both cases. In simulation studies where the true model parameters are known, our proposed approach shows good parameter recovery.

Availability and implementation

The GitHub R package for this work is available at https://github.com/anjalisilva/mixMVPLN and is released under the open source MIT license.

1 Introduction

Finite mixture models are popular for clustering applications and are widely used on two-way data (McLachlan and Peel 2000; McNicholas 2016). Three-way data are becoming increasingly commonplace in several fields, including bioinformatics. Three-way data structures are characterized by three entities or modes: the units (rows), the variables (columns), and the occasions (layers). For two-way data, each observation is represented as a vector whereas, for three-way data, each observation can be regarded as a matrix. A random matrix Tn is said to contain k{1,,p} responses over i{1,,r} occasions and n=1,,N such units are considered. This provides N independent and identically distributed random matrices T1,T2,,TN.

Matrix variate distributions offer a natural approach for modeling three-way data. Extensions of matrix variate distributions in the context of mixture models have given rise to mixtures of matrix variate distributions, which have been used to cluster three-way data (Viroli 2011; Anderlucci and Viroli 2015; Dogru et al. 2016; Gallaugher and McNicholas 2018). Here, the interest lies in clustering the N observed matrices into G clusters, while utilizing all information from the other two modes (Viroli 2011). It is assumed that matrices T1,T2,,TN are conditionally independent and identically distributed observations coming from a mixture model with G possible groups in proportions π1,,πG (Viroli 2011). The density of the G-component mixture is f(Tn|π1,,πG,ϑ1,,ϑG)=g=1Gπgf(r×p)(Tn|ϑg). Here parameters of the distribution function f(r×p)(·) are represented by ϑg and πg>0, such that g=1Gπg=1, is the mixing proportion of the gth component.

Three-way datasets are common in biological studies, including RNA sequencing (RNA-seq), where gene expression count data are collected for N genes across p conditions at r occasions. However, efficiently analyzing these complex data remains an ongoing challenge. While some early work utilized the Poisson distribution to model such count data (Marioni et al. 2008; Bullard et al. 2010), this was not ideal because of its restrictive mean–variance relationship, and so the negative binomial distribution emerged as the univariate distribution of choice (Love et al. 2014; Dong et al. 2016). However, the multivariate extensions of Poisson (Campbell 1934) and negative binomial distributions (Doss 1979) are seldom used in practice due to their computational complexity (Brijs et al. 2004). Silva et al. (2019) proposed a mixture model-based clustering methodology for overdispersed, multivariate count data based on the multivariate Poisson-log normal (MPLN) distribution for two-way RNA-seq data. For genes n{1,,N} and samples c{1,,rp}, the MPLN distribution is given by

graphic file with name btad167ilf1.jpg

where Inline graphic denotes the Poisson distribution and the Nrp is a rp-dimensional normal distribution. To account for the differences in library sizes across each sample c of an RNA-seq study, a fixed, known constant sc, representing the normalized library sizes, is added to the mean of the Poisson distribution. In this work, mixtures of MPLN distributions and matrix variate normal distributions are extended to give rise to mixtures of matrix variate Poisson-log normal (MVPLN) distributions for clustering three-way count data. Details of parameter estimation are provided, and both real and simulated data illustrations are used to demonstrate the clustering ability.

2 Materials and methods

2.1 Matrix variate Poisson-log normal distribution

Mathematical properties of the matrix variate normal distribution can be found in Gupta and Nagar (2000). By considering a matrix variate structure, the number of free covariance parameters to be estimated is reduced from (1/2)rp(rp+1) to (1/2)[r(r+1)+p(p+1)]. The matrix variate normal distribution can be extended to give rise to MVPLN distribution using a hierarchical structure. Consider N independent and identically distributed random matrices Yn,n=1,,N, each of dimension r × p. In the MVPLN framework, Ynik|θnik follows a Poisson distribution with mean exp(θnik), and θn follows a r × p matrix variate normal distribution Nr×p(M,Φ,Ω), where M is a r × p matrix of means, Φ is a r × r covariance matrix containing the variances and covariances between r occasions, and Ω is a p × p covariance matrix containing the variance and covariances of the p variables. Figure 1 provides a graphical representation of a mixture of MVPLN distributions.

Figure 1.

Figure 1.

Graphical representation of the MVPLN mixture model.

The vectorization of Yn, denoted vec(Yn) , is rp-dimensional. Given all vec(Yn), i.e. for n=1,,N, the library sizes vec(s) can be calculated. The vec(s) and vec(θn) are rp-dimensional. The covariance matrix of vec(Yn) is Σ=ΦΩ, where ⊗ denotes the Kronecker product. Note that Σ has dimension rp × rp, and the probability mass function of the MVPLN distribution is where ϑ=(M,Φ,Ω),f(·) is the probability mass function of Poisson distribution, vec(Yn)c represent the cth element of vec(Yn), and g(r×p)(·) is the probability density function of matrix variate normal distribution.

f(Yn,s|ϑ)=c=1rpf(vec(Yn)c|vec(θn)c,vec(s)c)g(r×p)(θn|ϑ) dvec(θn),

The unconditional mean and covariance of the MPLN distribution can be calculated using the properties of the log-normal distribution and of the conditional expectation (Aitchison and Ho 1989; Tunaru 2002). For the MVPLN distribution, the unconditional mean and covariance are respectively. The MVPLN distribution can account for both the correlations between variables and the correlations between occasions, as two different covariance matrices are used for the two modes. This makes the model ideal for modeling RNA-seq data when expression measurements for different conditions at different occasions, e.g. time-points or replicates, are available.

E(Yik)=E[E(Yik|θik)]=exp{μik+12(ΦiiΩkk)}=Mik,Var(Yik)=E[Var(Yik|θik)]+Var[E(Yik|θik)]=Mik+Mik2(exp{ΦiiΩkk}1),

2.2 Finite mixtures of MVPLN distributions

In the mixture model context, a random matrix Yn is assumed to come from a population with G subgroups each distributed according to an MVPLN distribution. Then N such matrices Y1,Y2,,YN are observed, each of which belongs to one of g{1,,G} different sub-populations with mixing proportions π1,,πG. Then the probability density function of a G-component mixture of MVPLN distributions can be written as where Θ=(π1,,πG,M1,,MG,Φ1,,ΦG,Ω1,,ΩG), the fg(·) is the probability mass function of a Poisson distribution and the gg(r×p)(·) is the probability density function of matrix variate normal distribution. The cluster membership of all units is assumed to be unknown and zng is used to cluster membership, where zng = 1 if Yn is in component g and zng = 0 otherwise. The complete data consist of the observed and missing data, i.e. (Y1,,YN,z1,,zN,θ1,,θN). The complete-data likelihood is and the complete-data log-likelihood is where ng=n=1Nzng and log (vec(Yn)c!) is the log of the factorial of the cth element of vec(Yn).Compared to the mixtures of MPLN distribution, the number of free parameters to be estimated is reduced by considering a matrix variate structure (see Figs 2 and 3). For the mixtures of MPLN model, the number of free parameters is K=(G1)+(Grp)+12Grp[rp+1], whereas for mixtures of MVPLN model it is K=(G1)+(Grp)+12G[r(r+1)+p(p+1)].

Figure 2.

Figure 2.

Scatter plot illustrating how the number of free parameters K grows with data dimensionality rp for the mixtures of MPLN model and for the mixtures of MVPLN model. Here G=2,r=2, and rp =4 up to 100.

Figure 3.

Figure 3.

Scatter plot illustrating how the number of free parameters K grows with the number of clusters G for the mixtures of MPLN model and for the mixtures of MVPLN model. Here G=1:100,r=2,p=5.

f(Y;Θ)=g=1GπgfY(Y|Mg,Φg,Ωg)=g=1Gπgc=1rpfg(vec(Yn|vec(θng)c,vec(s)c)×gg(r×p)(θng|Mg,Φg,Ωg) dvec(θng)
Lc(Θ)=n=1Ng=1G[πgc=1rpfg(vec(Yn|vec(θng)c,vec(s)c)× gg(r×p)(θng|Mg,Φg,Ωg)]zng,
lc(Θ)=g=1Gnglogπg n=1Ng=1Gc=1rpzngexp{vec(θng)c+logvec(s)c}+ n=1Ng=1Gzng[vec(θng)+logvec(s)]vec(Yn) n=1N(g=1Gzng)c=1rplog(vec(Yn)c!)nrp2log(2π) p2g=1Gnglog|Φg|r2g=1Gnglog|Ωg| 12n=1Ng=1Gzngtr[Φg1(θngMg)Ωg1(θngMg)],

2.3 Parameter estimation

Three different frameworks for parameter estimation for the mixtures of MVPLN models are proposed: one based on Markov chain Monte Carlo (MCMC) methods, one based on variational Gaussian approximation (VGA) as well as a hybrid approach. MCMC-based approaches are computationally intensive; hence, we also provide a computationally efficient variational approximation framework for parameter estimation. Finally, a hybrid approach combines the variational approximation-based approach and MCMC-based approach.

2.3.1 MCMC-based approach

In the MCMC-based approach, the MCMC expectation–maximization (MCMC–EM) algorithm is used to estimate the model parameters (see Silva et al. 2019, for details). Using an MCMC–EM algorithm, the expected value of the θng and the Zng conditional on the parameter updates from the tth iteration, respectively, are updated in the expectation (E-) step as follows: where and θng(f) is a random sample simulated via the RStan package for iterations f=1,,B. In the E-step, the expectation is taken conditional on the current parameter estimates; hence, the use of (t) on parameters in (1). As the values from initial iterations are discarded from further analysis to minimize bias, the number of iterations used for parameter estimation is W, where W < B. The conditional expected-value of the complete-data log-likelihood is where C is a constant with respect to Mg,Φg and Ωg, and ng(t)=n=1Nzng(t). During the M-step, the updates for the parameters are obtained as follows:

E(θng|Yn)1Wf=1Wθng(f)θng(t),E(Zng|Yn,θng,s)=qngh=1Gqnh:=zng(t), (1)
qng=πg(t){c=1rpfg(vec(Yn)c|vec(θng)c,vec(s)c)}×gg(r×p)(θng(t)|Mg(t),Φg(t),Ωg(t))
Q(Θ)E{lc(Θ)}Cp2g=1Gnglog|Φg|r2g=1Gnglog|Ωg|12n=1Ng=1Gzng(t)E[tr(Φg1(θngMg)Ωg1(θngMg))|θng(t),zng=1],
πg(t+1)=ng(t)N,Mg(t+1)=n=1Nzng(t)E(θng)ng,Φg(t+1)=n=1Nzng(t)E((θngMg(t+1))(Ωg(t))1(θngMg(t+1)))png,Ωg(t+1)=n=1Nzng(t)E((θngMg(t+1))(Φg(t+1))1(θngMg(t+1)))rng.

2.3.2 VGA-based approach

Variational approximations (Wainwright and Jordan 2007) are approximate inference techniques in which a computationally convenient approximating density is used in place of a more complex but “true” posterior density. The approximating density is obtained by minimizing the Kullback–Leibler (KL) divergence between the true and the approximating densities. Suppose we have an approximating density q(θ), then the marginal log of the probability mass function can be written where DKL(q||f)=q(θ)logq(θ)f(θ|Y)dθ is the KL divergence between f(θ|Y) and approximating distribution q(θ), and F(Y,q)=[logf(Y,θ)logq(θ)]q(θ)dθ is our evidence lower bound (ELBO). Thus, to minimize the KL divergence, we maximize our ELBO. For VGA, q(θ) is assumed to be a Gaussian distribution.

logfY(Y)=F(q,Y)+DKL(q||fY),

The complete-data log-likelihood for the mixtures of MVPLN distributions can be written

lc(Θ)=g=1Gn=1Nznglogπg+g=1Gn=1Nznglogf(Yn|Mg,Φg,Ωg)=g=1Gn=1Nznglogπg+g=1Gn=1Nzng[F(Yn,qng)+DKL(qng||fng)],

where DKL(qng||fng)=q(θng)logq(θng)f(θn|Yn,Zng=1)dθng is the KL divergence between f(θn|Yn,Zng=1) and approximating distribution q(θng). Assuming q(θng)=Nr×p(ξng,Δng,κng), the ELBO for each observation yn becomes

F(qng,Yn)=[i=1rj=1pexp{(ξng)ij+12(Δng)ii(Ωng)jj+logsij}]+ [vec(ξng)+log vec(s)]vec(Yn) [c=1rplog(vec(Yn)c!)]p2log|Φg|r2log|Ωg| 12(vec(ξng)vec(Mg))Φg1Ωg1(vec(ξng)vec(Mg))+ tr{Φg1Δng}tr{Ωg1κng}+p2log|Δng|+r2log|κng|+rp2.

The variational parameters that maximize the ELBO will minimize the KL divergence between the true posterior and the approximating density. Thus, parameter estimation can be done in an iterative EM-type approach such that the following steps are iterated. At the (t + 1)th step:

  1. Conditional on the variational parameters ξng,Δng, and κng and on Mg,Φg, and Ωg, the E(Zng) is computed. Given πg, Mg,Φg, and Ωg,
    E(Zng|Yn)=πgf(Yn|Mg,Φg,Ωg)h=1Gπhf(Yn|Mh,Φh,Ωh).
    Note that this involves the marginal distribution of Y which is difficult to compute. Hence, we use an approximation of E(Zng), where we replace the marginal density of the exponent of ELBO such that
    z^ng(t+1)=defπgexp[F(qng,Yn)]h=1Gπhexp[F(qnh,Yn)].

    This approximation is computationally convenient and a similar framework has been previously utilized (Gollini and Murphy 2014; Tang et al. 2015). This approximation works well in simulation studies and real data analysis.

  2. Given z^ng(t+1), variational parameters ξng,Δng, and κng are updated conditional on Mg(t),Φg(t), and Ωg(t).

    1. A fixed-point method is used for updating Δn:
      Δng(t+1)=p[Ir×r{diag(κng(t))[exp{ξng(t)+logs+12diag(Δng(t))diag(κng(t))}]}+Φg1(t)tr{Ωg1(t)κng(t)}]1,

      where the vector function exp[a]=(ea1,,ear) is a vector of the exponential of each element of the r-dimensional vector a, diag(κ)=(κ11,κpp) puts the diagonal elements of the p × p matrix κ into a p-dimensional vector, and is the Hadmard product.

    2. A fixed-point method is used for updating κng:
      κng(t+1)=r[Ip×p{diag(Δng(t+1))[exp{ξng(t)+logs+12(diag(κng(t))diag(Δng(t+1)))}]}+Ωg1(t)tr{Φg1(t)Δng(t+1)}]1,

      where the vector function exp[a]=(ea1,,eap) is a vector of exponential each element of the p-dimensional vector a, diag(Δ)=(Δ11,Δrr) puts the diagonal elements of the r × r matrix Δ into a r-dimensional vector, and is the Hadmard product.

    3. Newton’s method is used to update ξng:
      vec(ξng(t+1))=vec(ξng(t))Ψng1(t+1){vec(Yn)exp[log vec(s)                     +vec(ξng(t))+12diag(Ψng1(t+1))]                     Ψng1(t+1)(vec(ξng(t))vec(Mg(t)))vec(Yn)},

      where Ψng(t+1)=Δng(t+1)κng(t+1).

  3. Given z^ng(t+1) and the variational parameters ξng(t+1),Δng(t+1), and κng(t+1), the parameters πg, Mg,Φg, and Ωg can be solved for as
    πg(t+1)=ng(t+1)N where ng(t+1)=n=1Nz^ng(t+1),Mg(t+1)=n=1Nz^ng(t+1)ξng(t+1)ng(t+1),Φg(t+1)=n=1Nz^ng(t+1)(ξng(t+1)Mg(t+1))Ωg1(t)(ξng(t+1)Mg(t+1))png(t+1)+n=1Nz^ng(t+1)Δng(t+1)tr{Ω1(t)κng(t+1)}png(t+1),Ωg(t+1)=n=1Nz^ng(t+1)κng(t+1)tr{Φg1(t+1)Δng(t+1)}rng(t+1)+n=1Nz^ng(t+1)(ξng(t+1)Mg(t+1))Φg1(t+1)(ξng(t+1)Mg(t+1))rng(t+1).

2.3.3 Hybrid approach

While the MCMC-based approach can generate exact results, fitting such models can be computationally intensive because we need to evaluate the expected complete-data log-likelihood with respect to the posterior distribution of the latent variables at every iteration of the EM algorithm. For example, for datasets with N =1000, rp =6, and G =2 in Simulation setting 2, the MCMC-based approach took on an average of ∼41 h. On the other hand, the VGA approach is computationally efficient—on the same set of datasets from Simulation 2 with N=1000,rp=6, and G =2 fitting such a model took an average of ∼2 min (see Table 7 of Supplementary File 1 for complete details). However, it does not guarantee an exact posterior (Ghahramani and Beal 1999). Thus, we provide a computationally efficient hybrid approach in which

  • Step 1: Fit the model using the VGA-based approach.

  • Step 2: Estimate the component indicator variable Zng conditional on the parameter estimates from the VGA-based approach.

  • Step 3: Using the parameter estimates from Step 1 as the initial values for the parameters and using the classification from Step 2, compute the MCMC-based expectation for the latent variable θng and obtain the final estimates of the model parameters.

The hybrid approach comes with a substantial reduction in computational overhead compared to a traditional MCMC-EM but it can generate samples from the exact posterior distribution. Fitting such a model using the hybrid approach on the same set of datasets from Simulation setting 2 with N=1000,rp=6, and G =2 took on average just under 10 min. Complete details of computational times for all three simulation settings are provided in Table 7 of Supplementary File 1. When the primary goal is to detect the underlying clusters (which is the case for the real data analysis), the VGA-based approach is sufficient. However, when the primarily goal is posterior inference, we recommend the hybrid approach as it can better yield an exact posterior similar to the MCMC-EM approach but is computationally efficient.

Details on the convergence criteria, initialization, and parallel implementation for an MCMC-EM approach is provided in Supplementary File 3.

2.4 Identifiability

Model identifiability is vital to obtain unique and consistent parameter estimates. Identifiability of univariate and multivariate finite mixtures of normal distributions has been proved (Teicher 1963; Yakowitz and Spragins 1968). With regard to the mixtures of MVPLN distributions, the estimates for Φg and Ωg are only unique up to a strictly positive constant. To eliminate identifiability issues, a constraint needs to be imposed, e.g. the trace of Ωg can be set equal to p, the trace of Φg can be set equal to r, or the first diagonal element of Φg can be set equal to 1. The latter solution, which is used by Gallaugher and McNicholas (2018), is used for all analyses in this paper. To obtain final parameter estimates, the resulting Φg(t) is divided by the first diagonal element of Φg(t), and Ωg(t) is multiplied by the first diagonal element of Φg(t).

2.5 Model selection and performance assessment

Four model selection criteria are offered, which include the Akaike information criterion (AIC; Akaike 1973), the Bayesian information criterion (BIC; Schwarz 1978), a variation of the AIC used by Bozdogan (1994) called AIC3, and the integrated completed likelihood (ICL; Biernacki et al. 2000). These criteria are given by AIC=2logL(ϑ(MLE)|y)+2K,BIC=2logL(ϑ(MLE)|y)KlogN,AIC3=2logL(ϑ(MLE)|y)+3K, and ICLBIC+2n=1Ng=1GMAP{z^ng(t)}logz^ng(t), respectively, where L(ϑ(MLE)|y) represents maximized log-likelihood, ϑ(MLE) is the maximum likelihood estimate of the model parameters ϑ, N is the number of observations, and MAP{z^ng(t)} is the maximum a posteriori classification given z^ng(t).

In situations where the true classes are known but, for clustering purposes, are ignored, the adjusted Rand index (ARI; Hubert and Arabie 1985) can be used for performance assessment. The ARI takes a value 1 under perfect class agreement and has expected value 0 under random classification.

3 Results

3.1 Simulations

Simulation studies were conducted to illustrate the ability to recover the true underlying parameters for the mixtures of MVPLN algorithm. For Simulation 1, datasets with G =1 component were generated with N =1000 observations, r =2 and p =3. For Simulation 2, datasets with G =2 components and π1=0.79 were generated with N =1000 observations, r =2 and p =3. For Simulation 3, datasets with G =2 components and π1=0.6 were generated with N =1000 observations, r =2 and p =3. Further, only diagonal covariance structures for both Φg and Ωg were considered in Simulation 3. Each of the simulation settings consisted of 25 different datasets. The count range in the simulated datasets closely represented the range observed in the RNA-seq data (Freixas-Coutin et al. 2017). The covariance matrices Φg and Ωg for each setting are generated using the clusterGeneration package (Qiu and Joe 2015). Initialization of zng was done using one hundred different runs of the k-means algorithm. Clustering was performed on each dataset for values G=1,,5. We also compared the performance of our approach on count datasets generated from competitive models. For Simulation 4, we generated 25 datasets from a mixture of six independent Poisson distributions with G =2 components, π1=0.45 and N =1000 and we analyzed the data as 2 × 3 matrix when using mixtures of MVPLN distributions. For Simulation 5, we generated 25 datasets from a mixture of six independent negative binomial distributions with G =2 components, π1=0.79 and N =2000 and we analyzed the data as 2 × 3 matrix when using mixtures of MVPLN distributions. For Simulation 6, to show performance on a dataset with similar number of components to real data, we generated 25 datasets from a mixture of MVPLN distributions with G =8 components. We set N =1500, r =2, p =3, and π1=π8=0.125.

Comparative studies were also conducted on datasets from all five simulation settings. Because no comparable methods capable of clustering three-way count data were found in the current literature, datasets from Simulations 1, 2, 3, and 6 were vectorized and analyzed with clustering methods designed for two-way data. For comparison purpose, a model-based clustering technique for count data, HTSCluster (Rau et al. 2011, 2015), and a distance-based method, k-means clustering (MacQueen 1967) were used. For HTSCluster, initialization and clustering ranges were same as those used for mixtures of MVPLN algorithm. In a classification EM framework (CEM; Celeux and Govaert 1992), k-means has been shown to be equivalent to an isotropic Gaussian mixture model with equal variance across all components and with equal mixing proportions. Here, we fitted an equal variance isotropic Gaussian mixture model using R package mclust (Scrucca et al. 2016) on the log-transformed data which would be similar to fitting a fuzzy version of the k-means on the latent space and allow us to compute ICL that relies on the estimated soft Z^. We then utilized model selection criteria to select the optimal number of components.

The clustering results along with ARI values of our proposed method and other comparative methods are provided in Table 1. As evident in Table 1, in all six simulations, our proposed approach was able to recover the underlying cluster in all 25 datasets using both BIC and ICL, including when the Simulations 4 and 5, where the datasets were generated from mixtures of Poisson and negative binomial distributions. However, overfitting was evident with AIC and AIC3 as in various datasets, it favored models with larger number of components (i.e. G =3, 4, and 5 for Simulations 1–5 and G =9 and G =10 for Simulation 6). The AIC and AIC3 penalize log-likelihood only for the number of free parameters in the model and are constant with respect to the sample size. When the number of observations is large, these model selection criteria tend to favor more complex models and are known to overestimate number of components (Shibata 1976; Katz 1981). Note that we only provide the ARI values from the VGA approach. The ARI from the hybrid approach is the same as that from the VGA approach because the cluster membership indicator variable is determined in the VGA step in the hybrid approach. While we provide the estimated parameters for using MCMC approach, due to the extreme computational cost that comes with fitting these models, we only fitted the model with correct number of components, and thus, do not provide the summary in Table 1. For each simulation setting, we also provide a measure of cluster separation. For each simulated dataset, we computed the normalized separation index, I[0,1], by Hennig (2019) using the R package fpc (Hennig 2020), where higher values indicate that clusters have good separation. As Euclidean distance is used as a distance measure for computing the separation index, and we computed the separation index using the log-transformed data, which would be equivalent to assessing separation in the latent space. The average values of the separation index along with the standard deviation are provided in Table 1. The parameter estimation results for Mg,Φg, and Ωg via the mixtures of MVPLN algorithm for Simulations 1–3 using all three approaches are summarized in Supplementary File 1. As can be seen through the simulations, all three approaches can recover the parameter estimates very well. However, there is a slight increase in the precision of these estimations with the hybrid approach. Given the large number of groups, for Simulation 6, the Frobenius norm of the difference between the estimated parameters and true values of the parameters are provided in Supplementary File 1. In Table 7 of Supplementary File 1, we provide the computation time taken to fit the mixtures of MVPLN model with correct number of components using all three components. Although all three approaches provide comparable parameter estimates, the computational time taken to fit MVPLN using MCMC-based approach is substantially larger than for the VGA and the hybrid approaches as seen in Table 7 of Supplementary File 1. Overall, the simulation experiments illustrated that our approach for parameter estimation (Section 2.3) is effective at parameter recovery for the mixtures of MVPLN distributions.

Table 1.

Mean of normalized cluster separation index (standard deviation) for each simulation setting and the number of clusters selected (average ARI, standard deviation) from all 25 datasets for each simulation setting using different model selection criteria.

Method Simulation setting Mean of normalized separation index (SD) BIC ICL AIC AIC3
Mix. of MVPLN 1 1 (1.00, 0.00) 1 (1.00, 0.00) 1, 2 (0.88, 0.33) 1 (1.00, 0.00)
(VGA-based 2 0.44 (0.02) 2 (1.00, 0.00) 2 (1.00, 0.00) 2, 3, 4, 5 (0.89, 0.22) 2 (1.00, 0.00)
approach) 3 0.37 (0.02) 2 (1.00, 0.00) 2 (1.00, 0.00) 2, 3, 4, 5 (0.86, 0.21) 2, 3, 4 (0.96, 0.12)
4 0.69 (0.01) 2 (1.00, 0.00) 2 (1.00, 0.00) 2 (1.00, 0.00) 2 (1.00, 0.00)
5 0.44 (0.02) 2 (1.00, 0.00) 2 (1.00, 0.00) 2, 5 (0.99, 0.06) 2 (1.00, 0.00)
6 0.14 (0.01) 8 (0.99, 0.00) 8 (0.99, 0.00) 8, 9, 10 (0.99, 0.01) 8, 9 (0.99, 0.01)
HTSCluster 1 5 (0.00, 0.00) 5 (0.00, 0.00) 5 (0.00, 0.00) 5 (0.00, 0.00)
2 0.44 (0.02) 5 (−0.00, 0.00) 5 (−0.00, 0.00) 5 (−0.00, 0.00) 5 (−0.00, 0.00)
3 0.37 (0.02) 5 (0.00, 0.01) 5 (0.00, 0.01) 5 (0.00, 0.01) 5 (0.00, 0.01)
4 0.69 (0.01) 2 (1.00, 0.00) 2 (1.00, 0.00) 2, 3, 4 (0.97, 0.06) 2, 3, 4 (0.97, 0.06)
5 0.44 (0.02) 5 (0.22, 0.01) 5 (0.22, 0.01) 5 (0.22, 0.01) 5 (0.22, 0.01)
6 0.14 (0.01) 10 (0.03, 0.13) 10 (0.03, 0.13) 10 (0.03, 0.13) 10 (0.03, 0.13)
Fuzzy-version of 1 5 (0.00, 0.00) 4,5 (0.00, 0.00) 5 (0.00, 0.00) 5 (0.00, 0.00)
k-means 2 0.44 (0.02) 5 (0.24, 0.01) 5 (0.24, 0.01) 5 (0.24, 0.01) 5 (0.24, 0.01)
3 0.37 (0.02) 3, 4, 5 (0.55, 0.06) 2, 3, 4, 5 (0.81, 0.23) 4, 5 (0.54, 0.04) 4, 5 (0.54, 0.04)
4 0.69 (0.01) 2 (1.00, 0.00) 2 (1.00, 0.00) 2, 4, 5 (0.54, 0.12) 2, 4, 5 (0.54, 0.12)
5 0.44 (0.02) 2 (1.00, 0.00) 2 (1.00, 0.00) 2, 3, 4, 5 (0.82, 0.27) 2, 3, 4, 5 (0.82, 0.27)
6 0.14 (0.01) 8, 9, 10 (0.93, 0.03) 8, 9, 10 (0.96, 0.02) 10 (0.90, 0.01) 10 (0.90, 0.01)

In the two simulation settings where the datasets were generated from mixtures of independent Poisson distributions and mixtures of independent negative binomial distributions (Simulation 4 and 5), our approach was able to recover the underlying cluster structure and estimate the mean and variances of samples fairly well. With regard to HTSCluster, a model with G =5, the highest value for G considered, was selected for in Simulations 1, 2, 3, and 5 and a model with G =10, the highest value for G considered, was selected for in Simulation 6. Furthermore, the ARI values were low across these simulation settings, indicating that observations were not assigned to the correct clusters. However, in simulation setting 4 where the datasets are generated from mixtures of independent Poisson distribution, HTSCluster is able to recover the underlying cluster structure in all 25 datasets using BIC and ICL with perfect classification. It is worth noting that several studies have demonstrated that in presence of biological variations, the observed variability in the RNA-seq data is greater than what is predicted by the Poisson model (Gao et al. 2010). Thus, the negative binomial distribution (a hierarchical Poisson model similar to the univariate version of the MPLN model) has emerged as the most predominantly used distribution of choice for univariate analysis involving RNA-seq data (Gao et al. 2010). For k-means clustering, the average ARI values were also low for Simulations 1–3 but the value for Simulation 6 was high and perfect classification was observed in Simulations 4 and 5 using BIC and ICL.

3.2 Clustering transcriptome data

To illustrate the applicability of mixtures of MVPLN distributions for detecting the underlying cluster structure, the VGA-based approach was applied to an RNA-seq dataset. Typically, only a subset of genes from the experiment are used for cluster analysis, in order to reduce noise. For this analysis, only the differentially expressed genes were used for clustering. Freixas-Coutin et al. (2017) used RNA-seq to monitor the transcriptional dynamics in the seed coats of darkening and nondarkening cranberry beans (Phaseolus vulgaris) at three developmental stages: early, intermediate and mature. The aim of the study was to evaluate if the changes in the seed coat transcriptome were associated with proanthocyanidin levels as a function of seed development in cranberry beans. The RNA-seq data are available on the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) under the BioProject PRJNA380220.

The study identified 1336 differentially expressed genes, which were used for clustering. The raw read counts for genes were obtained from Binary Alignment/Map files using samtools (Li et al. 2009) and HTSeq (Anders et al. 2015). The median value from the three replicates per each developmental stage was used. Each observation or gene (the unit) was structured as a 2 × 3 matrix, such that it contained counts for the two varieties (the variables): darkening and nondarkening, across the three developmental stages (the occasions): early, intermediate and mature. On the three-way data of dimensions 1336×2×3, a clustering range of G=1,,10 was considered using k-means initialization (100 runs). Furthermore, we repeated the analysis 10 times. Since BIC and ICL both performed well in recovering the underlying cluster structure in the simulated data, here we also used BIC and ICL for model selection. Both BIC and ICL selected a model with G =8. In this model, Clusters 1–8 were composed of 206 (15.4%),163 (12.2%),104 (7.8%),162 (12.1%),126 (9.4%),147 (11.0%),162 (12.1%), and 266 (19.9%) genes, respectively. See Supplementary File 2 for gene composition of each cluster. Expression patterns across the clusters were visualized using a heatmap. The log-transformed expression patterns of the clusters are illustrated using the heatmap in Fig. 4.

Figure 4.

Figure 4.

Heatmap showing, for Cluster 1 through Cluster 8, log-transformed gene expression patterns for the G =8 model selected by both BIC and ICL for the cranberry bean RNA-seq dataset. The red and blue colors represent the log-transformed expression levels, where red represents high expression and blue represents low expression. The rows represent the genes and the columns represent samples involved in the RNA-seq study. Respectively, the samples are D-E: darkening early, D-I: darkening intermediate, D-M: darkening mature, ND-E: nondarkening early, ND-I: nondarkening intermediate, and ND-M: nondarkening mature cranberry bean.

In simulation studies where the true labels are available and ARI of true and predicted class label can be used to assess clustering performance. Because no true labels are available in real data, we use visualization of the data using heatmap to assess the trends in the clusters obtained on the transcriptome data in Fig. 4. We also visualize the cluster-specific μg which relates to mean trends of the expression levels of genes in different clusters in Fig. 5. As evident in Figs 4 and 5, each cluster has its distinctive expression signatures. In some clusters (for example, Cluster 4), the means of the gene expression signatures are similar between the darkening and nondarkening beans but their mean expression pattern varies between developmental stages; whereas, in Cluster 8, the means of the gene expression signatures are similar across the developmental stages but varies slightly between the darkening and nondarkening beans. On the other hand, in Cluster 3, the mean gene expression signatures vary across both the developmental stages as well as between darkening and nondarkening beans.

Figure 5.

Figure 5.

Visualization of the cluster-specific μg which relates to the mean trends of the expression levels of genes in Clusters 1–8.

For all simulation and transcriptome data analyses, the normalization factors representing library sizes for samples were obtained using the trimmed mean of M values from calcNormFactors function of edgeR package (Robinson and Oshlack 2010; McCarthy et al. 2012).

A table of mathematical notation is provided in Supplementary File 4 and a table of abbreviations is provided in Supplementary File 5.

4 Discussion

A mixture of MVPLN distributions is introduced for clustering three-way count data, targeted at expression data arising from RNA-seq experiments. This is the first use of a mixture of MVPLN distributions for clustering within the literature. By allowing for a direct analysis of three-way data structures, matrix variate distributions permit the estimation of correlations within and between variables and occasions. This makes them very attractive for analyzing matrix data in the context of clustering. Further, by considering a matrix variate structure, the number of free covariance parameters to be estimated is greatly reduced under high dimensional settings. Herein, three different parameter estimation frameworks are proposed: an approach based on MCMC, one based on VGA, and one based on a hybrid approach. When posterior inference is of interest, the MCMC-based approach is favorable but it can be computationally intensive. On the other hand, the VGA-based approach only approximates the posterior distribution that relies on approximation but it is computationally efficient. Therefore, here we also propose a hybrid approach that is computationally efficient and it samples from the true posterior. Through simulation studies, we show that the VGA approach provides good clustering performance even when the datasets are generated from mixtures of other discrete distributions.

Using simulated data, it was illustrated that the algorithm for mixtures of MVPLN distributions is effective and returned favorable clustering results. The transcriptome data analysis showed the applicability of the mixture model-based clustering method on RNA-seq count data. A possible future direction of this work would be to make use of subspace clustering methods and to develop the matrix variate factor analyzers model. This would permit clustering of data in low-dimensional subspaces as high-dimensional RNA-seq datasets become frequent. Another path is to consider restrictions on the matrices Φg and Ωg, as done by Viroli (2011). Also, constraints on Φg similar to those introduced by McNicholas and Murphy (2010), and used by Anderlucci and Viroli (2015), could be beneficial when analyzing longitudinal RNA-seq data.

Supplementary Material

btad167_Supplementary_Data

Acknowledgements

The authors acknowledge the computational support by Dr Marcelo Ponce at the SciNet HPC Consortium, University of Toronto, ON, Canada. This research was enabled in part by support provided by Research Computing Services (https://carleton.ca/rcs) at Carleton University.

Contributor Information

Anjali Silva, Department of Mathematics and Statistics, University of Guelph, Guelph, ON N1G 2W1, Canada; Department of Molecular and Cellular Biology, University of Guelph, Guelph, ON N1G 2W1, Canada.

Xiaoke Qin, School of Mathematics and Statistics, Carleton University, Ottawa, ON K1S 5B6, Canada.

Steven J Rothstein, Department of Molecular and Cellular Biology, University of Guelph, Guelph, ON N1G 2W1, Canada.

Paul D McNicholas, Department of Mathematics and Statistics, McMaster University, Hamilton, ON L8S 4L8, Canada.

Sanjeena Subedi, School of Mathematics and Statistics, Carleton University, Ottawa, ON K1S 5B6, Canada.

Supplementary data

Supplementary data are available at Bioinformatics online.

Conflict of interest

None declared.

Funding

This work was supported by Discovery Grant from the Natural Sciences and Engineering Research Council of Canada (NSERC) [to S.S. Grant no. RGPIN-2021-03812 and to P.M. Grant no. RGPIN-2023-06030], Canada Research Chairs program [to S.S. Grant no. 2020-00303 and P.M. Grant no. 950-230797, respectively], Queen Elizabeth II Graduate Scholarships in Science & Technology [to A.S.], Ontario Graduate Fellowship [to A.S.], and an E.W.R. Steacie Memorial Fellowship from NSERC [to P.M. Grant no. 537596-2019]. No funding body played a role in the design of the study, analysis and interpretation of data, or in writing the manuscript.

Data availability

The RNA-seq data used in the manuscript are publicly available on the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) under the BioProject PRJNA380220. The GitHub R package for this work is available at https://github.com/anjalisilva/mixMVPLN.

References

  1. Aitchison J, Ho CH.. The multivariate Poisson-log normal distribution. Biometrika 1989;76:643–53. [Google Scholar]
  2. Akaike H. Information theory and an extension of the maximum likelihood principle. In: Petrovand B.N. and Caski F. (eds) Second International Symposium on Information Theory. New York: Springer Verlag, 1973, 267–81. [Google Scholar]
  3. Anderlucci L, Viroli C.. Covariance pattern mixture models for the analysis of multivariate heterogeneous longitudinal data. Ann Appl Stat 2015;9:777–800. [Google Scholar]
  4. Anders S, Pyl PT, Huber W.. HTSeq—a python framework to work with high-throughput sequencing data. Bioinformatics 2015;31:166–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Biernacki C, Celeux G, Govaert G.. Assessing a mixture model for clustering with the integrated classification likelihood. IEEE Trans Pattern Anal Machine Intell 2000;22:719–25. [Google Scholar]
  6. Bozdogan H. Mixture-model cluster analysis using model selection criteria and a new informational measure of complexity. In: Bozdogan H, Sclove S. L., Gupta A. K, Haughton D., Kitagawa G., Ozaki T., Tanabe K. (eds) Proceedings of the First US/Japan Conference on the Frontiers of Statistical Modeling: An Informational Approach: Volume 2 Multivariate Statistical Modeling. Dordrecht: Springer Netherlands, 1994, 69–113. [Google Scholar]
  7. Brijs T, Karlis D, Swinnen G. et al. A multivariate Poisson mixture model for marketing applications. Stat Neerland 2004;58:322–48. [Google Scholar]
  8. Bullard JH, Purdom E, Hansen KD. et al. Evaluation of statistical methods for normalization and differential expression in mRNA-seq experiments. BMC Bioinformatics 2010;11:94. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Campbell J. The Poisson correlation function. Proc Edinburgh Math Soc 1934;4:18–26. [Google Scholar]
  10. Celeux G, Govaert G.. A classification EM algorithm for clustering and two stochastic versions. Comput Stat Data Anal 1992;14:315–32. [Google Scholar]
  11. Dogru FZ, Bulut YM, Arslan O.. Finite mixtures of matrix variate t distributions. Gazi Univ J Sci 2016;29:335–41. [Google Scholar]
  12. Dong K, Zhao H, Tong T. et al. NBLDA: negative binomial linear discriminant analysis for RNA-seq data. BMC Bioinformatics 2016;17:369. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Doss D. Definition and characterization of multivariate negative binomial distribution. J Multivariate Anal 1979;9:460–4. [Google Scholar]
  14. Freixas-Coutin JA, Munholland S, Silva A. et al. Proanthocyanidin accumulation and transcriptional responses in the seed coat of cranberry beans (Phaseolus vulgaris L) with different susceptibility to postharvest darkening. BMC Plant Biol 2017;17:89. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Gallaugher MPB, McNicholas PD.. Finite mixtures of skewed matrix variate distributions. Pattern Recognit 2018;80:83–93. [Google Scholar]
  16. Gao D, Kim J, Kim H. et al. A survey of statistical software for analysing RNA-seq data. Hum Genomics 2010;5:56–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Ghahramani Z, Beal M.. Variational inference for Bayesian mixtures of factor analysers. Adv Neural Inf Process Syst 1999;12:449–55. [Google Scholar]
  18. Gollini I, Murphy TB.. Mixture of latent trait analyzers for model-based clustering of categorical data. Stat Comput 2014;24:569–88. [Google Scholar]
  19. Gupta AK, Nagar DK.. Matrix Variate Distributions. Boca Raton: Chapman and Hall/CRC, 2000. [Google Scholar]
  20. Hennig C. Cluster validation by measurement of clustering characteristics relevant to the user. Data analysis and applications 1: clustering and regression, modeling-estimating. Forecast Data Mining 2019;2:1–24. [Google Scholar]
  21. Hennig C. fpc: Flexible Procedures for Clustering. R package version 2.2-9, 2020. https://CRAN.R-project.org/package=fpc.
  22. Hubert L, Arabie P.. Comparing partitions. J Classif 1985;2:193–218. [Google Scholar]
  23. Katz RW. On some criteria for estimating the order of a Markov chain. Technometrics 1981;23:243–9. [Google Scholar]
  24. Li H, Handsaker B, Wysoker A, et al. ; 1000 Genome Project Data Processing Subgroup. The sequence alignment/map (SAM) format and SAMtools. Bioinformatics 2009;25:2078–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Love MI, Huber W, Anders S.. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 2014;15:550. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. MacQueen J. Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. Volume 1: Statistics. Berkeley: University of California Press, 1967, 281–97. [Google Scholar]
  27. Marioni JC, Mason CE, Mane SM. et al. Rna-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res 2008;18:1509–17. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. McCarthy JD, Chen Y, Smyth KG.. Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Res 2012;40:4288–97. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. McLachlan GJ, Peel D.. Finite Mixture Models. New York: Wiley, 2000. [Google Scholar]
  30. McNicholas PD. Mixture Model-based Classification. Boca Raton: Chapman and Hall/CRC Press, 2016. [Google Scholar]
  31. McNicholas PD, Murphy TB.. Model-based clustering of longitudinal data. Can J Stat 2010;38:153–168. [Google Scholar]
  32. Qiu W, Joe H.. clusterGeneration: Random Cluster Generation (With Specified Degree of Separation). R Package Version 1.3.4, 2015.
  33. Rau A, Celeux G, Martin-Magniette M. et al. Clustering high-throughput sequencing data with Poisson mixture models. Technical Report RR-7786, INRIA, Saclay, Ile-de-France, 2011.
  34. Rau A, Maugis-Rabusseau C, Martin-Magniette ML. et al. Co-expression analysis of high-throughput transcriptome sequencing data with Poisson mixture models. Bioinformatics 2015;31:1420–7. [DOI] [PubMed] [Google Scholar]
  35. Robinson MD, Oshlack A.. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol 2010;11:R25. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Schwarz G. Estimating the dimension of a model. Ann Stat 1978;6:461–4. [Google Scholar]
  37. Scrucca L, Fop M, Murphy TB. et al. Mclust 5: clustering, classification and density estimation using gaussian finite mixture models. R J 2016;8:289. [PMC free article] [PubMed] [Google Scholar]
  38. Shibata R. Selection of the order of an autoregressive model by Akaike’s information criterion. Biometrika 1976;63:117–26. [Google Scholar]
  39. Silva A, Rothstein SJ, McNicholas PD. et al. A multivariate Poisson-log normal mixture model for clustering transcriptome sequencing data. BMC Bioinformatics 2019;20:394. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Tang Y, Browne RP, McNicholas PD.. Model based clustering of high-dimensional binary data. Comput Stat Data Anal 2015;87:84–101. [Google Scholar]
  41. Teicher H. Identifiability of finite mixtures. Ann Math Stat 1963;34:1265–9. [Google Scholar]
  42. Tunaru R. Hierarchical Bayesian models for multiple count data. Austr J Stat 2002;31:221–9. [Google Scholar]
  43. Viroli C. Finite mixtures of matrix normal distributions for classifying three-way data. Stat Comput 2011;21:511–22. [Google Scholar]
  44. Wainwright MJ, Jordan MI.. Graphical models, exponential families, and variational inference. FNT Mach Learn 2007;1:1–305. [Google Scholar]
  45. Yakowitz SJ, Spragins JD.. On the identifiability of finite mixtures. Ann Math Stat 1968;39:209–14. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

btad167_Supplementary_Data

Data Availability Statement

The RNA-seq data used in the manuscript are publicly available on the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) under the BioProject PRJNA380220. The GitHub R package for this work is available at https://github.com/anjalisilva/mixMVPLN.


Articles from Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES