Published in final edited form as: J Am Stat Assoc. 2016 Jan 15;110(512):1562–1576. doi: 10.1080/01621459.2014.983233

Bayesian factorizations of big sparse tensors

Jing Zhou 1, Anirban Bhattacharya 2, Amy Herring 3, David Dunson 4
PMCID: PMC6579540  NIHMSID: NIHMS855930  PMID: 31210707


It has become routine to collect data that are structured as multiway arrays (tensors). There is an enormous literature on low rank and sparse matrix factorizations, but limited consideration of extensions to the tensor case in statistics. The most common low rank tensor factorization relies on parallel factor analysis (PARAFAC), which expresses a rank k tensor as a sum of rank one tensors. When observations are only available for a tiny subset of the cells of a big tensor, the low rank assumption is not sufficient and PARAFAC has poor performance. We induce an additional layer of dimension reduction by allowing the effective rank to vary across dimensions of the table. For concreteness, we focus on a contingency table application. Taking a Bayesian approach, we place priors on terms in the factorization and develop an efficient Gibbs sampler for posterior computation. Theory is provided showing posterior concentration rates in high-dimensional settings, and the methods are shown to have excellent performance in simulations and several real data applications.

Keywords: Big data, Bayesian, Categorical data, Contingency table, Low rank, Matrix completion, PARAFAC, Tensor factorization

1. Introduction

Sparsely observed big tabular data sets are commonly collected in many applied domains. One example corresponds to recommender systems in which the dimensions of the table correspond to users, items and different contexts (Karatzoglou et al. (2010)), with a tiny proportion of the cells filled in for users providing rankings. The task is to fill in the rest of the huge table in order to make recommendations to users of which items they may prefer in each context. This extends the widely studied matrix completion problem (Candes and Recht (2009)) of which the Netflix challenge was one example. Another setting corresponds to contingency tables in which multivariate categorical data are collected for each individual, and the cells of the table contain counts of the number of individuals having a particular combination of values. In contingency table analyses, the focus is typically on inferring associations among the different variables, but challenges arise when there are many variables, so that the number of cells in the table is vastly bigger than the sample size.

Suppose that the tensor of interest is πd1××dp, with d1××dp a space of p-way tensors having dj rows in the jth direction. Often there are constraints on the elements of the tensor. For recommender systems, ratings are non-negative so that one is faced with a non-negative tensor factorization problem (Paatero and Tapper (1994); Lee and Seung (1999); Friedlander and Hatz (2005); Lim and Comon (2009); Liu et al. (2012)). For contingency tables, the tensor corresponds to the joint probability mass function for multivariate categorical data, so that the elements are non-negative and add to one across all the cells (Dunson and Xing (2009); Bhattacharya and Dunson (2012)). Let Y denote the data collected on tensor π. For recommender systems, Y consists of ratings for a small subset of the j=1pdj cells in the tensor, while for contingency tables Y includes response vectors yi = (yi1,…, yip)T for subjects i = 1, …, n, with yij ∈ {1, …, dj} for j = 1, …, p. In both cases, data are extremely sparse, with no observations in the overwhelming majority of cells.

To combat this data sparsity, it is necessary to substantially reduce dimensionality in estimating π. The usual way to accomplish this is through a low rank assumption. Unlike for matrices, there is no unique definition of rank but the most common convention is to define the rank k of a tensor π as the smallest value of k such that π can be expressed as

π=h=1kψh(1)ψh(p), (1)

which is sum of k rank one tensors, each an outer product of vectors1 for each dimension (Kolda and Bader, 2009). Expression (1) is commonly referred to as parallel factor analysis (PARAFAC) (Harshman (1970); Bro (1997)). For k small, the number of parameters is massively reduced from j=1pdj to kj=1pdj; as the low rank assumption often holds approximately, this leads to an effective approach in many applications, and a rich variety of algorithms are available for estimation.

However, the decrease in degrees of freedom from exponential in p to linear in p is not sufficient when p is big. Large p small n problems arise routinely, and a usual solution outside of tensor settings is to incorporate sparsity. For example, in linear regression, many of the coefficients are set to zero, while in estimation of large covariance matrices, sparse factor models are used that assume few factors and many zeros in the factor loadings matrices (West (2003); Carvalho et al. (2008)). In the matrix factorization literature, there has been consideration of low rank plus sparse decompositions (Chartrand (2012)), but this approach does not solve our problem of too many parameters. Including zeros in the component vectors {ψh(j)} is not a viable solution, particularly as we do not want to enforce exact zeros in blocks of the tensor π but require an alternative notion of sparsity.

Our notion is as follows. For component h (h = 1, …, k), we partition the dimensions into two mutually exclusive subsets ShShc={1,,p}. The proposed sparse PARAFAC (sp-PARAFAC) factorization is then

π=h=1kψh(1)ψh(p), ψh(j)=ψ0(j) for jShc. (2)

Hence, instead of having to introduce a separate vector ψh(j) for every h and j, we allow there to be more degrees of freedom used to characterize the tensor structure in certain directions than in others. Consider the recommender systems application and suppose we have three dimensions, including users (j = 1), items (j = 2) and context (j = 3). If we let ψh(3)=ψ0(3) for h = 1, …, k,

πc1c2c3=ψ0c3(3)h=1kψhc1(1)ψhc2(2), (3)

so that we factorize the user-item matrix as being of rank k, and then include a multiplier specific to each level of the context factor. This assumes that users rank systematically higher or lower depending on context but there is no interaction. In the contingency table application, Pr (yi1=c1,,yip=cp)=πc1cp. If jShc for h = 1, …, k, then the jth variable is independent of the other variables with Pr (yij=cj)=ψ0cj(j). By including jShc for some but not all h ∈ {1, …, k} one can use fewer degrees of freedom in characterizing the interaction between the jth factor and the other factors. In practice, we will learn {Sh} using a Bayesian approach, as the appropriate lower dimensional structure is typically not known in advance.

We conjecture that many tensor data sets can be concisely represented via (2), with results substantially improved over usual PARAFAC factorizations due to the second layer of dimension reduction. For concreteness and brevity, we focus on contingency tables, but the methods are easily modified to other settings. Contingency table analysis is routine in practice; refer to Agresti (2002); Fienberg and Rinaldo (2007). However, in stark contrast to the well developed literature on linear regression and covariance matrix estimation in big data settings, very few flexible methods are scalable beyond small tables. Throughout the rest of the paper, we assume that the observed data yi = (yi1, …, yip)T, i = 1, …, n, is multivariate unordered categorical, with yij· ∈ {1, …, dj}. Our interest is in situations where the dimensionality p is comparable or even larger than the number of samples n.

2. Sparse Factor Models for Tables

2.1. Model and prior

We focus on a Bayesian implementation of sp-PARAFAC in (2). Let Sr1={xr:xj0,j=1rxj=1} denote the (r − 1)-dimensional probability simplex. In the contingency table case, Dunson and Xing (2009) proposed the following probabilistic PARAFAC factorization.

Pr (yi1=c1,,yip=cp)=πc1cp=h=1kνhj=1pλhcj(j), (4)

where ν={νh}Sk1 and λh(j)=(λh1(j),,λhdj(j))Sdj1 is a vector of probabilities of yij = 1, …, dj in component h. Introducing a latent sub-population index zi ∈ {1, …, k} for subject i, the elements of yi are conditionally independent given zi with Pr (yij=cj|zi=h)=λhcj(j), and marginalizing out the latent index zi leads to a mixture of product multinomial distribution for yi. Placing Dirichlet priors on the component vectors leads to a simple and efficient Gibbs sampler for posterior computation. We will refer to this model (4) as standard PARAFAC.

This approach has excellent performance in small to moderate p problems, but as p increases there is an inevitable breakdown point. The number of parameters increases linearly in p, as for other PARAFAC factorizations, so problems arise as p approaches the order of n or pn. For example, we are particularly motivated by epidemiology studies collecting many categorical predictors, such as occupation type, demographic variables, and single nucleotide polymorphisms. For continuous response vectors yip, there is a well developed literature on Gaussian sparse factor models that are adept at accommodating pn data (West (2003); Lucas et al. (2006); Carvalho et al. (2008); Bhattacharya and Dunson (2011)). These models include many zeros in the loadings matrices to induce additional dimension reduction on top of the low rank assumption. Pati et al. (2013a) provided theoretical support through characterizing posterior concentration.

Our sp-PARAFAC factorization provides an analog of sparse factor models in the tensor setting. Modifying for the categorical data case, we let

πc1cp=h=1kνhjShλhcj(j)jShcλ0cj(j), (5)

where |Sh| ≪ p(|S| denotes the cardinality of a set S) and the λ0(j) vectors are fixed in advance; we consider two cases:

(i)λ0(j)=(1dj,,1dj)T and (ii)λ0(j)=(1ni=1nyi1,,1ni=1nyip)T,

corresponding to a discrete uniform and empirical estimates of the marginal category probabilities. By fixing the baseline dictionary vectors {λ0(j)} in advance, and allocating a large subset of the variables within each cluster h to the baseline component, we dramatically reduce the size of the model space. In particular, the probability tensor π in (5) can be parameterized as θπ=(ν,{Sh}1hk,{λh(j)}1hk,jSh), where νSk1,Sh{1,,p},λh(j)Sdj1. Thus, the effective number of model parameters is now reduced to (k1)+h=1k|Sh|+h=1kjSh(dj1), which is substantially smaller than the (k1)+j=1pk(dj1) parameters in the original specification, provided |Sh| ≪ p for all h = 1, … k. This is ensured via a sparsity favoring prior on |Sh| below. We will illustrate that this can lead to huge differences in practical performance.

Completing a Bayesian specification with priors for the unknown parameter vectors and expressing the model in hierarchical form, we have2

yij~ Mult ({1,,dj};λz11(j),,λzidj(j)),  λh(j)~(1τh)δλ0(j)+τh Diri (aj1,,ajdj),  Pr(zi=h)=νh=Vhl<h(1Vl),  Vh~ Beta (1,α), α~ Gamma (aα,bα), τh~ Beta (1,γ). (6)

It is evident that the hierarchical prior in (6) is supported on the space of probability tensors with a sp-PARAFAC decomposition as in (5), since (6) is equivalent to letting the subset-size |Sh| ~ Binom(p, τh) and drawing a random subset Sh uniformly from all subsets of {1, …, p} of size |Sh| in (5). A stick-breaking prior (Sethuraman, 1994) is chosen for the component weights {νh}, taking a nonparametric Bayes approach that allows k = ∞, with a hyperprior placed on the concentration parameter α in the stick-breaking process to allow the data to inform more strongly about the component weights. The probability of allocation τh to the active (non-baseline) category in component h is chosen as beta(1, γ), with γ > 1 favoring allocation of many of the λh(j)s to the baseline category λ0(j). In the limiting case as γ > ∞, the joint probability tensor π becomes an outer product of the baseline probabilities for the individual variables, π=λ0(1)λ0(p). On the other hand, as γ → 0, one reduces back to standard PARAFAC (4).

Line 2 of expression (6) is key in inducing the second level of dimensionality reduction in our Bayesian sparse PARAFAC factorization. The inclusion of the baseline component that does not vary with h massively reduces the number of parameters, and can additionally be argued to have minimal impact on the flexibility of the specification. The λh(j)s are incorporated within j=1pλhcj(j), which for large p is highly concentrated around its mean since the λh(j)s are independent across j. This is a manifestation of the concentration of measure phenomenon (Talagrand, 1996), which roughly states that a random variable that depends in a smooth way on the influence of many independent variables, but not too much on any one of them, is essentially constant. For example, if θj~iidU(0,1) and Θ=j=1pθj, then E(Θ) = (1/2)p and var(Θ) = (1/3)p, which rapidly converges to zero. This implies that replacing a large randomly chosen subset of the λh(j)s by λ0(j) should have minimal impact on modeling flexibility.

2.2. Induced prior in log-linear parameterization

An important challenge is accommodating higher order interactions, which play an important rolein many applications (e.g., genetics), but are typically assumed to equal zero for tractability. As p grows, it is challenging to even accommodate two-way interactions in traditional categorical data models (log-linear, logistic regression) due to an explosion in the number of terms. In contrast, the tensor factorization does not explicitly parameterize interactions, but indirectly induces a shrinkage prior on the terms in a saturated log-linear model. One can then reparameterize in terms of the log-linear model in conducting inferences in a post model-fitting step. We illustrate the induced priors on the main effects and interactions below.

For ease of exposition, we first focus on a case where p = 3 and dj = d = 2 for j = 1, …, 3. We generate 10, 000 random probability tensors π(t)=(πc1c2c3(t)), t = 1, …, 10, 000 distributed according to (6), where we fix the baseline λ0(j)=(1/2,1/2) for all j. Given a 2 × 2 × 2 tensor π, we can equivalently characterize π in terms of its log-linear parameterization


consisting of 3 main effect terms β1, β2, β3, three second-order interaction terms β12, β13, β23 and one third order interaction term β123; refer to §5.3.5 of Agresti (2002). Given each prior sample π(t), we equivalently obtain a sample β(t) from the induced prior on β, which allows us to estimate the marginal densities of the main effects and interactions and also their joint distributions. In particular, since γ plays an important role in placing weights on the baseline component, we would like to see how our induced priors differ with different γ values.

In our simulation exercise, we fix three values of γ, namely, γ = 1, 5, 20. Note that γ = 1 corresponds to a U(0,1) prior on τh. For different values of γ, we show the histograms of one main effect term β1, one two-way interaction β12 and the three-way interaction β123 in Figure 1. Table 1 additionally reports summary statistics.

Figure 1:

Figure 1:

Histograms of induced priors for one main effect β1, one two-way interaction β12, and the three-way interaction β123 - Top Row: γ = 1; Middle Row: γ = 5; Bottom Row: γ = 20.

Table 1:

Summary statistics of induced priors on coefficients in log-linear model parameterization.

γ Coefficient Mean Min Max Skewness Kurtosis
1 β1 0.014 0.831 −6.765 6.389 0.210 9.109
1 β12 −0.002 0.340 −2.895 3.105 −0.025 16.583
1 β123 0.002 0.196 −2.223 2.632 0.525 24.686
5 β1 −0.002 0.485 −5.648 5.433 0.031 27.980
5 β12 0.000 0.124 −2.085 2.244 0.495 93.438
5 β123 0.000 0.051 −1.214 0.745 −3.701 159.360
20 β1 0.002 0.246 −3.109 5.669 2.474 99.554
20 β12 0.000 0.042 −1.126 1.819 9.488 632.790
20 β123 0.000 0.009 −0.664 0.214 −44.051 3014.000

In high-dimensional regression, yi=xiTβ+ϵi, there has been substantial interest in shrinkage priors, which draw βj a priori from a density concentrated at zero with heavy tails. Such priors strongly shrink the small coefficients to zero, while limiting shrinkage of the larger signals (Park and Casella, 2008; Carvalho et al., 2010; Polson and Scott, 2010; Hans, 2011; Armagan et al., 2013a). In Figure 1, the induced prior on any of the log-linear model parameters is symmetric about zero, with a large spike very close to zero, and heavy tails. Thus, we have indirectly induced a continuous shrinkage prior on the main effects and interactions through our tensor decomposition approach. In addition, the prior automatically shrinks more aggressively as the interaction order increases. Such greater shrinkage of interactions is commonly recommended (Gelman et al., 2008). Importantly, we do not zero out small interactions but allow many small coefficients, which is an important distinction in applications, such as genomics, having many small signals. The hyperparameter γ serves as a penalty controlling the degree of shrinkage.

Our next set of simulations involve larger values of p, where the necessity of the regularization implied by γ becomes strikingly evident. In the log-linear parameterization, we now have p main effects β1, …, βp; let βmain = (β1, …, βp)T. In the pn setting, one cannot even hope to consistently recover all the main effects unless a large fraction of the βj’s are zero or close to zero. One would thus favor a shrinkage prior on βmain, with any particular draw resembling a near-sparse vector. Since the induced prior on the βj’s is continuous, we study the l1 norm βmain1=j=1p|βj| as a surrogate for the l0 norm to quantify the sparsity.

We consider p = 50, 100, 150, 200 and plot histograms of the induced density of ∥βmain1 based on 10, 000 prior draws in Figures 2 and 3. Figure 2 corresponds to the case where γ = 0, i.e., when the sp-PARAFAC formulation reduces back to the standard PARAFAC (4), while γ/p is set to a constant β ∈ (0,1) in Figure 3. Figure 2 reveals a highly undesirable property of the standard PARAFAC in high dimensions, where the entire distribution of ∥βmain1 shifts to the right with increasing p, with Eβmain1p. The induced prior clearly lacks any automatic multiplicity adjustment property (Scott and Berger, 2010), and would bias inferences for moderate to large values of p. On the other hand, under the sp-PARAFAC model, the induced prior on ∥βmain1 is robust to increasing p, as evident from Figure 3. The choice γ = βp essentially forces a constant proportion of the variablesto be assigned to the null group; see Castillo and van der Vaart (2012) for a similar choice of the hyper-parameter in a regression setting.3.1 Preliminaries

Figure 2:

Figure 2:

Histograms of ∥βmain1 for different values of p under the standard PARAFAC model.

Figure 3:

Figure 3:

Histograms of ∥βmain1 for different values of p under the sp-PARAFAC model with γ = 0.1p.

3. Posterior concentration

3.1. Preliminaries

In this section, we provide theoretical justification to the proposed sp-PARAFAC procedure in high dimensional settings by studying the concentration properties of the posterior with growing sample size. When the parameter space is finite dimensional, it is well known that the posterior contracts at the parametric rate of n−1/2 under mild regularity conditions (Ghosal et al., 2000). However, we are interested in the asymptotic framework of the dimension p = pn growing with the sample size n, potentially at a faster rate, reflecting the applications we are interested in. There is a small but increasing literature on asymptotic properties of Bayesian procedures in models with growing dimensionality, with most of the focus being on linear models or generalized linear models belonging to the exponential family; refer to Ghosal (1999, 2000); Belitser and Ghosal (2003); Jiang (2007); Armagan et al. (2013b); Bontemps (2011); Castillo and van der Vaart (2012); Yang and Dunson (2013) among others. In all these cases, the object of interest is a vector of high-dimensional regression coefficients or more generally, the conditional distribution f (y | x) of a univariate response y given high-dimensional predictors x. However, our object of interest is significantly different as we are concerned with estimation of the high-dimensional joint probability tensor π.

Let Fn denote the class of all d1××dpn probability tensors; we shall assume d1==dpn=d in the sequel for notational convenience. Let π(0n)Fn be a sequence of true tensors. We observe y1, …, yn ~ π(0n) and set y(n) = (y1, …, yn). We denote the prior distribution on Fn induced by the sp-PARAFAC formulation by n and the corresponding posterior distribution by n(|y(n)).

For two probability tensors π(1) and π(2)Fn, the L1 distance is defined as:


For a sequence of numbers ϵn → 0 and a constant M > 0 independent of ϵn, let

Un={π:ππ(0n)1Mϵn} (7)

denote a ball of radius n around π(0n) in the L1 norm. We seek to find a minimum possible sequence ϵn such that

limnn(Unc|y(n))0,  a.s. π(0n). (8)

3.2. Assumptions

In this section we state our assumptions on the true data generating model and briefly discuss their implications.

Assumption 3.1. The true sequence of probability tensors π(0n) are of the form

πc1cpn(0)=h=1knν0hjS0hλhcj(0j)jS0hcλ0cj(j), 1cjd,1jpn, (A0)

where λ0(j)Sd1 are assumed to be known. Unless otherwise specified, we shall assume λ0(j)=(1/d,,1/d) is the probability vector corresponding to the uniform distribution on {1, …, d}.

We now provide some intuition for assumption (A0). Letting S0=h=1knS0h, we can rewrite the expansion of π(0n) in (A0) as

πc1cpn(0n)=h=1knν0hjS0λ¯hcj(0j)jS0cλ0cj(j), (9)


λ¯h(0j)={λh(0j)if jS0h,λ0(j)if jS0\S0h.

In (9), the term jS0cλ0cj(j) doesn’t involve h and can be factored out completely. Assumption (A0) thus posits that the variables in S0c are marginally independent and the entire dependence structure is driven by the variables in S0. We shall refer to S0 and S0c as the non-null and null group of variables respectively.

Let qn = |S0| and define a mapping jej from {1, …, qn} to the ordered elements of S0, so that e1eqn. As j varies between 1 to qn, ej ranges over the elements of S0. Denote by ψ(0n) the dqn joint probability tensor for the variables {yij : jS0}, so that

ψc1cn(0n)= Pr (yie1=c1,,yien=cn)=h=1knν0hj=1qnλ¯hcj(0ej). (10)

Thus, after factoring out the marginally independent variables in S0c, (A0) implies a standard PARAFAC expansion (10) for ψ(0n) with kn many components. Since any non-negative tensor admits a standard PARAFAC distribution (Lim and Comon, 2009), we can always write an expansion of ψ(0n) as in (10).

The next set of assumptions are provided below.3

Assumption 3.2. In addition to (A0), π(0n) satisfies

(A1) The number of components kn = O(1).

(A2) Letting sn=max1hkn|S0h|, one has sn = o(log pn).

(A3) There exists a constant εο ∈ (0,1) such that λhc(0j)ε0 for all 1 ≤ hkn, 1 ≤ cd, jS0h.

(A1) and (A2) imply that the size of the non-null group is much smaller than pn, since qn=|S0|h=1kn|Sh|knsnpn.

Some discussion is in order for condition (A3). First, note that we can choose ε0 in a way so that λ¯hc(0j)ε0 for all h, c and jS0. Hence, (A3) implies a lower bound on the joint probability ψ(0n) in (10). Such a lower bound on a compactly supported target density is a standard assumption in Bayesian non-parametric theory; see for example van der Vaart and van Zanten (2008). However, unlike univariate or multivariate density estimation in fixed dimensions where the density can be assumed to be bounded below by a constant, we need to precisely characterize the decay rate of the lower bound of the joint probability. Since ψ(0n) is a dqn probability tensor, minc1,csnψc1csn(0n)(1/d)qn=exp(snkn log d). Assumption (A3) implies that

minc1,,csnψc1csn(0n)exp(qnlog(1/ε0))=exp(c0sn) (11)

for some constant c0 > 0.

3.3. Main result

We are now in a position to state a theorem on posterior convergence rates.

Theorem 3.1. Assume the true sequence of tensors π(0n)Fn satisfy assumptions (A0) – (A3) and sn log pn/n → 0. Also, assume the sp-PARAFAC model is fitted with the stick-breaking prior truncated to kn many components and γ=βpn2 for some constant β ∈ (0,1) in (6). Then, (8) is satisfied with ϵn=sn log pn/n in (7).

A proof of Theorem 3.1 can be found in the appendix. As an implication of Theorem 3.1, if pn = nd for some constant d, then the posterior contracts at the near parametric rate (log n)c/n for some constant c > 0. Moreover, consistent estimation is possible even if pn is exponentially large as long as pnexp(n). In particular, with pn = exp(nδ/2) for δ < 1, the posterior contracts at least at the rate n−(1−δ)/2.

Remark 3.2. We assume the number of components kn known in Theorem 3.1 for ease of exposition, with our main focus on dimensionality reduction. Adapting to an unknown number of components in mixture models is a well -studied problem; see, for example, Ge and Jiang (2006); Pati et al. (2013b); Shen et al. (2011). For the infinite stick-breaking prior on the mixture components, one can use the sieving technique developed in Pati et al. (2013b) to estimate deviation bounds for the tail sum of a stick-breaking process.

Remark 3.3. In practice, we recommend the choice γ = βpn for numerical stability, with β = 0.2 used as a default choice in all our examples. The probability mass function of the induced beta-bernoulli prior on |Sh| with γ=βpn2 behaves like exp(−cs log pn) for small s, while the same is exp(−cs) for γ = βpn; refer to the proof of Theorem 3.1 for further details.

4. Posterior Computation

Under model (6), we can easily proceed to draw posterior samples from a Gibbs sampler since all the full conditionals have recognizable forms. The algorithm iterates through the following steps:

  1. For each jth variable and latent class h, update λh(j)(λh1(j),,λhdj(j)) from a mixture of two distributions with different weights. Given the prior we specified for λh(j) in (6), the posterior maintains its conjugacy and comes from either a Dirichlet or the baseline category. i.e., for j = 1, …, p, h = 1, …, k*, where k* = max{z1, …, zn}:
    (λh(j)|)=w0h(j)δλ0(j)+w1h(j) Diri(aj1+i=1n1(yij=1,zi=h),,ajdj+i=1n1(yij=dj,zi=h)), (12)
    where w0h(j) and w1h(j) are the mixture weights:
  1. Let Shj be the allocation variable with Shj = 0 if λh(j) is updated from the baseline component, and Shj = 1 if λh(j) is from a Dirichlet posterior distribution. Update τh, h = 1, …, k* from a Beta full conditional:
    τh|~ Beta (1+j=1p1(Shj=1),γ+j=1p1(Shj=0)). (13)
  1. The full conditional of Vh, h = 1, …, k* only requires the updated information on latent class allocation for all subjects:
    Vh|~ Beta (1+i=1n1(zi=h),α+i=1n1(zi>h)). (14)
  1. We sample zi, i = 1, …, n from the multinomial full conditional with:
     Pr (zi=h|)=νhj=1pλhyij(j)l=1k*νlj=1pλlyij(j), (15)
    where νh = Vhl<h(1 − Vl).
  1. Update α from the Gamma full conditional:
    α|~ Gamma (aα+k*,bαh=1k*log(1Vh)). (16)

These steps are simple to implement and we gain efficiency by updating the parameters in blocks. For example, instead of updating λh(j) one at a time, we sample λ{λh(j),h=1,,k*,j=1,,p} jointly with corresponding parameters in matrix form. In all our examples, we ran the chain for 25, 000 iterations, discarding the first 10, 000 iterations as burn-in and collecting every fifth sample post burn-in to thin the chain. Mixing and convergence were satisfactory based on the examination of trace plots and the run time scaled linearly with n and p. We also carried out sensitivity analysis by multiplying and dividing the hyperparamaters aα, bα and γ in (6) by a factor of 2, with the conclusions remained unchanged from the default setting aα = bα = 1 and γ = 0.2 p.

5. Simulation Studies

5.1. Estimating sparse interactions

We first conduct a replicated simulation study to assess the estimation of sparse interactions using the proposed sp-PARAFAC model. We simulated 100 dependent binary variables yij ∈ {0,1}, j = 1, …, p = 100 (dj = d = 2) for i = 1, …, n = 100 subjects from a log-linear model having up to three-way interactions:

log(πc1cpπ00)=s=13S{1,,p}:|S|=sβS1(cS=1). (17)

For example, if S = {1, 2,4}, then βS = β1,2,4 and 1(cS=1)=1(c1=1,c2=1,c4=1) with 1(·) denoting the indicator function. To mimic the situation where only a few interactions are present, we restrict to SS* = {2, 4, 12, 14} and set all interactions except


to zero. This data generating mechanism induces dependence among the variables in S*, while rendering the other variables to be marginally independent. Figure 4 reports the posterior means and 95% credible intervals for all main effects and interactions for the variables in S* averaged across 100 simulation replicates along with the true coefficients. As illustrated in Figure 4, averaging across the simulation replicates and different parameters, the 95% credible intervals cover the true parameter values 80% of the time.

Figure 4:

Figure 4:

Posterior means and 95% credible intervals for all main effects and interactions in S* compared with the true coefficients.

Next, we study performance in estimating the dependence structure. Cramer’s V is a popular statistic measuring the strength of association or dependence between two (nominal) categorical variables in a contingency table, ranging from 0 (no association) to 1 (perfect association). Let ρjj denote the Cramer’s V statistics for variables j and j′, so that

ρjj2=1min{dj,dj}1cj=1djcj=1dj(πcjcj(jj)πcj(j)πcj(j))2πcj(j)πc(j)(j), (18)

where πll(jj)=Pr(yij=l,yij=l) and πl(j)=Pr(yij=l). Under the log-linear model (17), ρ = (ρjj) is a sparse matrix with the Cramer’s V for all pairs except those in S* × S* being zero. This is an immediate consequence of the fact that if (j, j′) ∉ S* × S*, then yij and yij′, are independent.

We compare estimation of the off-diagonal entries of ρ under the sp-PARAFAC model with the empirical Cramer’s V matrix ρ^. We can clearly convert posterior samples for the model parameters to posterior samples for ρjj through (18). The empirical estimator is obtained by replacing πcjcj'(jj) and πcj(j) by their empirical estimators. The left panel in Figure 5 shows the posterior summaries (averaged across simulation replicates) of the Cramer’s V values for all possible dependent pairs along with the true Cramer’s V values (which can be calculated from (17)). In the right panel of Figure 5, we overlay kernel density estimators of posterior samples (in grey) and the empirical estimators (in red) of the Cramer’s V values for all null pairs across all simulation replicates. Note the axes are also marked in grey and red for the respective cases. The sp-PARAFAC method clearly outperforms the empirical estimator convincingly, with the posterior density for the null pairs highly concentrated near zero while the empirical estimator has a mean Cramer’s V value of 0.08 across the null pairs.

Figure 5:

Figure 5:

Left: Posterior summaries of the Cramer’s V values for all dependent pairs vs. the true Cramer’s V values; Right: Estimated density of Cramer’s V combining all null pairs under sp-PARAFAC vs. empirical estimation.

Furthermore, we can obtain power for any non-null variable or type I error for any null variable by computing the percentage of detected significance over the simulation replicates. We first look at the power and type I error of the main effects and interactions in S*, most of the power and type I error are appealing, although a few of them are far from satisfactory (see Table 2 and Table 3). However, given the Cramer’s V results in the right panel of Figure 5, the type I error for any variable not in S* should be very small or zero. As an example, we tested the main effects and all the possible interactions for positions 20, 30, 40 and 50. The type I error rates are 0 for all of them. These results are based on examining whether 95% intervals contain zero, and it is as expected that the approach may have difficulty assessing the exact interaction structure among a set of associated variables based on limited data.

Table 2:

Power for Non-null Variables Based on 100 Simulations

β2 β4 β12 β14 β2,4 β2,12 β4,12 β4,14 β12,14 β2,4,12 β4,12,14
Power 0.97 0.9 1 1 0.95 0.99 0.98 0.97 0.99 0 0
True coefficient 1 −1.5 2 1.5 −0.5 0.5 −0.5 −0.5 0.5 0.25 0.5

Table 3:

Type I Error for Null Variables Based on 100 Simulations

β2,4 β2,4,14 β2,12,14 β2,4,12,14
Type I error 0.97 0 0.68 0
True coefficient 0 0 0 0

5.2. Comparison with standard PARAFAC

We now conduct a simulation study to compare estimation of the Cramer’s V matrix ρ under the proposed approach to the usual specification of the PARAFAC model without any sparsity as in (4), which is equivalent to setting γ = 0 in (6). We considered 100 simulation replicates, with data in each replicate consisting of p = 100 categorical variables for n = 100 subjects, with each variable having 4 possible levels (dj = d = 4). Two simulation settings were considered to induce dependence between the variables in S* = {2, 4, 12, 14}: (i) via multiple subpopulations as in the simulation study in Dunson and Xing (2009), and (ii) via a nominal GLM model Pr (yij=c)=exp(yi(j)βc)1+Σc=24exp(yi(j)βc) for jS*, where yi(j)βc is a linear combination of all variables that are associated with the jth variable excluding the jth variable. The remaining variables were independently generated from a discrete uniform distribution.

The color plot on the left in Figure 6 shows the true pairwise Cramer’s V values under simulation setting (i) (only the top-left 20 × 20 sub matrix of ρ is shown for clarity). Figure 6 (right) and Figure 7 represent one of the replicates, in which the right plot in Figure 6 shows the Cramer’s V under the standard non-sparse PARAFAC method, while Figure 7 shows the Cramer’s V using our method with two different baseline components. It is obvious that our approach has much better estimates for not only the true dependent pairs but also the true nulls. Results for simulation (ii) shown in Figure 8 again show superiority of our sparse improvement to PARAFAC.

Figure 6:

Figure 6:

Simulation setting (i) - Left: True Cramer’s V matrix; Right: Posterior means of Cramer’s V using standard PARAFAC.

Figure 7:

Figure 7:

Posterior means of Cramer’s V under simulation setting (i) using proposed method - Left: with λ0(j) being discrete uniform; Right: with λ0(j) being empirical estimates of the marginal category probabilities.

Figure 8:

Figure 8:

Posterior means of Cramer’s V under simulation setting (ii) – Left: using standard PARAFAC; Middle: under proposed method using empirical marginal with Diri(1, …,1) prior for λ0; Right: using proposed method with discrete uniform λ0.

6. Application

6.1. Splice-junction Gene Sequences

We applied the method to the Splice-junction Gene Sequences (abbreviated as splice data below). Splice junctions are points on a DNA sequence at which ‘superfluous’ DNA is removed during the process of protein creation in higher organisms. These data consist of A, C, G, T nucleotides at p = 60 positions for N = 3, 175 sequences. Since its sample size is much larger than the number of variables, we compared our approach with the standard PARAFAC in two scenarios, first a small randomly selected subset (of size n = 2p = 120) of the full data set, and second, the full data set itself. Using two different sample sizes in this manner allows for a study of the new and existing method and a comparison to a gold standard (a sufficiently large data set). We ran the analysis to estimate the pairwise positional dependence structure under the standard PARAFAC method and the proposed approach with discrete uniform baseline component. As is apparent in Figure 10, both methods have similar performance when np, however, in the smaller sample size situation, Figure 9 demonstrates that our proposed method has the advantage of identifying the dependence structure and pushing the independent pairs to zero, which it is closer to the results in a large sample case (Figure 10).

Figure 10:

Figure 10:

Posterior quantiles of Cramer’s V with 3,175 sequences of splice data – Upper panel: under standard PARAFAC; Bottom panel: under proposed method.

Figure 9:

Figure 9:

Posterior quantiles of Cramer’s V with 120 sequences of splice data – Upper panel: under standard PARAFAC; Bottom panel: under proposed method.

7. Discussion

We have proposed a sparse modification to the widely-used PARAFAC tensor factorization, and have applied this in a Bayesian context to improve analyses of ultra sparse huge contingency tables. Given the compelling success in this application area, we hope that the proposed notion of sparsity will have a major impact in other areas, including tensor completion problems in machine learning. There is an enormous literature on low rank and sparse matrix factorizations, and the sp-PARAFAC should facilitate scaling of such approaches to many-way tables while dealing with the inevitable curse of dimensionality. Although we take a Bayesian approach, we suspect that frequentist penalized optimization methods can also exploit our same concept of sparsity in learning a compressed characterization of a huge array based on limited data.


7.1. Proof of Theorem 3.1

We verify the conditions of Theorem 4 in Yang & Dunson (2013), which is a minor modification of Theorem 2 appearing in Ghosal et al. (2000). Let ϵn → 0 be such that nϵn2 and Σnexp(nϵn2). Suppose there exist a sequence of sets PnFn and a constant C > 0 such that the following hold:4

  1. logN(ϵn;Pn,1)nϵn2
    n(π: logππ(0n)ϵn2)exp(Cnϵn2)

Then, the posterior contracts at the rate ϵn, i.e., (8) is satisfied. We now proceed to verify conditions (1) - (3). We define,

Pn={πFn:πc1cp=h=1knνh*jSh*λhcj(*j)jSh*cλ0cj(j);νS(kn1),|Sh*|Asn,h=1,,kn} (19)

where S(r1) denotes the (r − 1)-dimensional probability simplex and A > 0 is an absolute constant. We shall use C to denote an absolute constant whose meaning may change from one line to the next.

To estimate N(ϵn;Pn,1), we make use of the following Lemma, which follows in a straight-forward manner by repeated uses of the triangle inequality.

Lemma 7.1. Let π(1), π(2)Fn with

π(i)=h=1knνihλih(1)λih(pn), i=1,2.



Lemma 7.1 implies that if π(1), π(2)Pn with S1h*=S2h*=Sh*, then


Based on the above observation, we create an ϵ-net of Pn as follows: In (19), (i) vary Sh* over all possible subsets of {1, …, pn} with |Sh*|Asn for h = 1, …, kn, (ii) for h ∈ {1, … kn} and jSh*, vary λh(*j) over an ϵn/(2Adsn)-net of S(d1) and (iii) vary ν* over an ϵn/(2kn)-net of S(kn1).

For a fixed h, there are Σs=0Asn(ps) subsets of size smaller then or equal to Asn. Using the in-equality (ps)(pe/s)s for sp/2, the number of possible subsets in (i) can be bounded above by exp(Cknsn log pn). Hence,


Using the fact that N(δ,Sr1,1)(C/δ)r (Vershynin, 2010), the right hand side in the above display can be bounded above by exp(Csnlogpn)=exp(nϵn2), since kn = O(1).

We now bound n(FnPnc). Recall that in the sp-PARAFAC model, the induced prior on the subset size |Sh| is Bin(pn, τh), with τh ~ Beta(1, γ). Now,

n((FnPnc) Pr(h{1,,kn}s.t. |Sh|Asn)knP(|S1|>Asn).

Integrating τ1, the distribution of |S1| is a beta-bernoulli distribution with probability mass function

Pr (|S1|=s)=(ps)1B(1,γ)τ=01τs(1τ)pns(1τ)γ1dτ=(pns)B(1+s,γ+pns)B(1,γ)=1γpn!(pns)!(γ+pns1)!(γ+pn)!,

for s = 0,1,…, pn. B(·, ·) denotes the Beta function in the above display. Hence, for s ≥ 1,

 Pr(|S1|=s) Pr(|S1|=s1)=(pns+1)(pns+γ).

Now, letting γ=pn2, one has for any pn ≥ 2 and 1 ≤ spn/2,


In general, for γ=βpn2, we can bound this from both sides by C/pn. Noting that Pr(|S1|=0)=C/pn3, we have

Pr(|S1|=s)=Cpn3j=1s Pr(|S1|=j) Pr(|S1|=j1),

implying there exists constants c1, c2 > 0 such that

ec1(s+3)logpnPr(|S1|=s)ec2(s+3)logpn, (20)

for 0 ≤ spn/2. In particular, the upper bound holds for all 0 ≤ spn, since (pns + 1)/(pns + γ) ≤ C/pn for all s. Hence, for n large enough so that sn ≥ 3,


We finally show that (3) holds. Recall the decomposition of π(0n) from (9). A probability tensor π following a sp-PARAFAC model with a truncated stick-breaking prior on ν can be parameterized as

θπ=(ν,{Sh}1hkn, {λh(j)}1hkn,jSh),

where νSkn1, Sh ⊂ {1, …, pn}, λh(j)Sd1. Consider the following subset A of the parameter space,


We now show that θπA implies logπ/π(0n)ϵn2, so that n(logπ/π(0n)ϵn2) can be bounded below by n(A). First, observe that since Sh = S0 for all h on A, π/π(0n) = ψ/ψ(0n), where ψ(0n) is as in (10) and ψ is the dqn joint probability tensor implied by the sp-PARAFAC model for the variables {yij : jS0},




where the penultimate step follows from an application of triangle inequality and the last step uses log(1 + x) ≤ x for x ≥ 0. For any c1,,csn, by an application of triangle inequality,

|ψc1csnψc1csn(0n)|h=1kn|νhν0h|+h=1knν0h|j=1qnλhcj(ej)j=1qnλ¯hcj(0ej)|. (21)

We now state a Lemma to facilitate bounding the second term of the above display.

Lemma 7.2. Let v1,… vr ∈ (ε0, 1 − ε0) for some ε0 > 0. Let δ > 0 be such that rδ < ε0/2. Then, if u1, …, ur satisfy |ujvj | ≤ δ for all j = 1, …, r, then


Apply Lemma 7.2 with r = qn, uj=λ¯hcj(0ej) and δ=ϵn2ε0/(4qn) (clearly rδ/ε0=ϵn2/4<1/2) to obtain that for any 1 ≤ hqn, |Πj=1qnλhcj(ej)Πj=1qnλ¯hcj(0ej)|(ϵn2/2)Πj=1qnλ¯hcj(0ej).·Substituting this bound in (21), we have on A,

|ψc1csnψc1csn(0n)|ψc1csn(0n)Σh=1kn|νhν0h|ec0sn+(ϵn2/2)Σh=1knν0h Πj=1qn λ¯hcj(0ej)ψc1csn(0n)ϵn2.

For the two terms in the above display after the first inequality, we used the lower bound (11) for the first term along with Σh=1kn|νhν0h|ϵn2/(2ec0sn) on A, and by definition of ψ(0n), the second term is ϵn2/2.

It thus remains to lower bound n(A). By independence across h, Pr(Sh=S0,1hkn)=Pr(S1=S0)kn. Further, by exchangeability of the prior on S1, since all subsets of a particular size receive the same prior probability, Pr (S1=S0)= Pr (|S1|=qn)/(pnqn). From (20), Pr(|S1| = qn ≥ exp(−Csn log pn). Using (pnqn)(pne/qn)qn, we conclude that Pr(S1 = S0) ≥ exp(−Csn log pn).

Recall that νh=νh*Πl<h(1νl*), where νl*~Beta(1,α) independently. Find numbers {ν0h*} such that ν0h=ν0h*Πl<h(1ν0l*). It is easy to see that there exists a constant C > 0 such that |νh*ν0h*|ϵn/(Ckn) for all h = 1, …, kn implies Σh=1kn|νhν0h|ϵn. Hence, using a general result on small ball probability estimate of Dirichlet random vectors (Lemma 6.1 of Ghosal et al. (2000)), one has


Again, applying Lemma 6.1 of Ghosal et al. (2000),

Pr (c=1d|λhc(j)λ¯hc(0j)|ϵn2ε04qn)exp{Clog(sn/ϵn)}.

Combining, we get Pr (A)exp(Csnlogpn)exp(nϵn2). Hence, we have established (1) - (3) completing the proof.

7.2. Proof of Lemma 7.2

Observe that


Now, since uhvh + δ for all h,


Using the binomial theorem, (1+δ/ε0)r1=rδ/ε0+Σh=2r(rh)(δ/ε0)h. Next, bound (rh)rh and use the fact that /ε0 < 1/2 to conclude that Σh=2r(rh)(δ/ε0)hΣh=1(rδ/ε0)h2rδ/ε0.

On the other hand, using uhvhδ for all h,


The proof is concluded by observing that




For p = 2, ψ(1) ⊗ ψ(2) = ψ(1)ψ(2)T. In general, (ψ(1)ψ(p))c1cp=ψc1(1)ψcp(p)


Mult({1, …, d}; λ1, …, λd) denotes a discrete distribution on {1, …, d} with probabilities λ1, …, λd associated to each atom.


For sequences an, bn, we write an = o(bn) if an/bn → 0 as n → ∞ and an = O(bn) if anCbn for all large n.


Given a metric space (X,d), let N(ϵ;X,d) denote its ϵ-covering number, i.e., the minimum number of d-balls of radius ϵ needed to cover X.

