Bayesian factorizations of big sparse tensors

Jing Zhou; Anirban Bhattacharya; Amy Herring; David Dunson

doi:10.1080/01621459.2014.983233

. Author manuscript; available in PMC: 2019 Jun 17.

Published in final edited form as: J Am Stat Assoc. 2016 Jan 15;110(512):1562–1576. doi: 10.1080/01621459.2014.983233

Bayesian factorizations of big sparse tensors

Jing Zhou ¹, Anirban Bhattacharya ², Amy Herring ³, David Dunson ⁴

PMCID: PMC6579540 NIHMSID: NIHMS855930 PMID: 31210707

Abstract

It has become routine to collect data that are structured as multiway arrays (tensors). There is an enormous literature on low rank and sparse matrix factorizations, but limited consideration of extensions to the tensor case in statistics. The most common low rank tensor factorization relies on parallel factor analysis (PARAFAC), which expresses a rank k tensor as a sum of rank one tensors. When observations are only available for a tiny subset of the cells of a big tensor, the low rank assumption is not sufficient and PARAFAC has poor performance. We induce an additional layer of dimension reduction by allowing the effective rank to vary across dimensions of the table. For concreteness, we focus on a contingency table application. Taking a Bayesian approach, we place priors on terms in the factorization and develop an efficient Gibbs sampler for posterior computation. Theory is provided showing posterior concentration rates in high-dimensional settings, and the methods are shown to have excellent performance in simulations and several real data applications.

Keywords: Big data, Bayesian, Categorical data, Contingency table, Low rank, Matrix completion, PARAFAC, Tensor factorization

1. Introduction

Sparsely observed big tabular data sets are commonly collected in many applied domains. One example corresponds to recommender systems in which the dimensions of the table correspond to users, items and different contexts (Karatzoglou et al. (2010)), with a tiny proportion of the cells filled in for users providing rankings. The task is to fill in the rest of the huge table in order to make recommendations to users of which items they may prefer in each context. This extends the widely studied matrix completion problem (Candes and Recht (2009)) of which the Netflix challenge was one example. Another setting corresponds to contingency tables in which multivariate categorical data are collected for each individual, and the cells of the table contain counts of the number of individuals having a particular combination of values. In contingency table analyses, the focus is typically on inferring associations among the different variables, but challenges arise when there are many variables, so that the number of cells in the table is vastly bigger than the sample size.

Suppose that the tensor of interest is $π \in \prod_{d_{1} \times \dots \times d_{p}}$ , with $\prod_{d_{1} \times \dots \times d_{p}}$ a space of p-way tensors having d_j rows in the jth direction. Often there are constraints on the elements of the tensor. For recommender systems, ratings are non-negative so that one is faced with a non-negative tensor factorization problem (Paatero and Tapper (1994); Lee and Seung (1999); Friedlander and Hatz (2005); Lim and Comon (2009); Liu et al. (2012)). For contingency tables, the tensor corresponds to the joint probability mass function for multivariate categorical data, so that the elements are non-negative and add to one across all the cells (Dunson and Xing (2009); Bhattacharya and Dunson (2012)). Let Y denote the data collected on tensor π. For recommender systems, Y consists of ratings for a small subset of the $\prod_{j = 1}^{p} d_{j}$ cells in the tensor, while for contingency tables Y includes response vectors y_i = (y_i1,…, y_ip)^T for subjects i = 1, …, n, with y_ij ∈ {1, …, d_j} for j = 1, …, p. In both cases, data are extremely sparse, with no observations in the overwhelming majority of cells.

To combat this data sparsity, it is necessary to substantially reduce dimensionality in estimating π. The usual way to accomplish this is through a low rank assumption. Unlike for matrices, there is no unique definition of rank but the most common convention is to define the rank k of a tensor π as the smallest value of k such that π can be expressed as

π = \sum_{h = 1}^{k} ψ_{h}^{(1)} \otimes \dots \otimes ψ_{h}^{(p)},

(1)

which is sum of k rank one tensors, each an outer product of vectors¹ for each dimension (Kolda and Bader, 2009). Expression (1) is commonly referred to as parallel factor analysis (PARAFAC) (Harshman (1970); Bro (1997)). For k small, the number of parameters is massively reduced from $\prod_{j = 1}^{p} d_{j}$ to $k \sum_{j = 1}^{p} d_{j}$ ; as the low rank assumption often holds approximately, this leads to an effective approach in many applications, and a rich variety of algorithms are available for estimation.

However, the decrease in degrees of freedom from exponential in p to linear in p is not sufficient when p is big. Large p small n problems arise routinely, and a usual solution outside of tensor settings is to incorporate sparsity. For example, in linear regression, many of the coefficients are set to zero, while in estimation of large covariance matrices, sparse factor models are used that assume few factors and many zeros in the factor loadings matrices (West (2003); Carvalho et al. (2008)). In the matrix factorization literature, there has been consideration of low rank plus sparse decompositions (Chartrand (2012)), but this approach does not solve our problem of too many parameters. Including zeros in the component vectors ${ψ_{h}^{(j)}}$ is not a viable solution, particularly as we do not want to enforce exact zeros in blocks of the tensor π but require an alternative notion of sparsity.

Our notion is as follows. For component h (h = 1, …, k), we partition the dimensions into two mutually exclusive subsets $S_{h} \cup S_{h}^{c} = {1, \dots, p}$ . The proposed sparse PARAFAC (sp-PARAFAC) factorization is then

π = \sum_{h = 1}^{k} ψ_{h}^{(1)} \otimes \dots \otimes ψ_{h}^{(p)}, ψ_{h}^{(j)} = ψ_{0}^{(j)} for j \in S_{h}^{c} .

(2)

Hence, instead of having to introduce a separate vector $ψ_{h}^{(j)}$ for every h and j, we allow there to be more degrees of freedom used to characterize the tensor structure in certain directions than in others. Consider the recommender systems application and suppose we have three dimensions, including users (j = 1), items (j = 2) and context (j = 3). If we let $ψ_{h}^{(3)} = ψ_{0}^{(3)}$ for h = 1, …, k,

π_{c_{1} c_{2} c_{3}} = ψ_{0 c_{3}}^{(3)} \sum_{h = 1}^{k} ψ_{h c_{1}}^{(1)} ψ_{h c_{2}}^{(2)},

(3)

so that we factorize the user-item matrix as being of rank k, and then include a multiplier specific to each level of the context factor. This assumes that users rank systematically higher or lower depending on context but there is no interaction. In the contingency table application, $Pr (y_{i 1} = c_{1}, \dots, y_{i p} = c_{p}) = π_{c_{1} \dots c_{p}}$ . If $j \in S_{h}^{c}$ for h = 1, …, k, then the jth variable is independent of the other variables with $Pr (y_{i j} = c_{j}) = ψ_{0 c_{j}}^{(j)}$ . By including $j \in S_{h}^{c}$ for some but not all h ∈ {1, …, k} one can use fewer degrees of freedom in characterizing the interaction between the jth factor and the other factors. In practice, we will learn {S_h} using a Bayesian approach, as the appropriate lower dimensional structure is typically not known in advance.

We conjecture that many tensor data sets can be concisely represented via (2), with results substantially improved over usual PARAFAC factorizations due to the second layer of dimension reduction. For concreteness and brevity, we focus on contingency tables, but the methods are easily modified to other settings. Contingency table analysis is routine in practice; refer to Agresti (2002); Fienberg and Rinaldo (2007). However, in stark contrast to the well developed literature on linear regression and covariance matrix estimation in big data settings, very few flexible methods are scalable beyond small tables. Throughout the rest of the paper, we assume that the observed data y_i = (y_i1, …, y_ip)^T, i = 1, …, n, is multivariate unordered categorical, with y_ij· ∈ {1, …, d_j}. Our interest is in situations where the dimensionality p is comparable or even larger than the number of samples n.

2. Sparse Factor Models for Tables

2.1. Model and prior

We focus on a Bayesian implementation of sp-PARAFAC in (2). Let $S^{r - 1} = {x \in ℜ^{r} : x_{j} \geq 0, \sum_{j = 1}^{r} x_{j} = 1}$ denote the (r − 1)-dimensional probability simplex. In the contingency table case, Dunson and Xing (2009) proposed the following probabilistic PARAFAC factorization.

Pr (y_{i 1} = c_{1}, \dots, y_{i p} = c_{p}) = π_{c_{1} \dots c_{p}} = \sum_{h = 1}^{k} ν_{h} \prod_{j = 1}^{p} λ_{h c_{j}}^{(j)},

(4)

where $ν = {ν_{h}} \in S^{k - 1}$ and $λ_{h}^{(j)} = (λ_{h 1}^{(j)}, \dots, λ_{h d_{j}}^{(j)}) \in S^{d_{j} - 1}$ is a vector of probabilities of y_ij = 1, …, d_j in component h. Introducing a latent sub-population index z_i ∈ {1, …, k} for subject i, the elements of y_i are conditionally independent given z_i with $Pr (y_{i j} = c_{j} | z_{i} = h) = λ_{h c_{j}}^{(j)}$ , and marginalizing out the latent index z_i leads to a mixture of product multinomial distribution for y_i. Placing Dirichlet priors on the component vectors leads to a simple and efficient Gibbs sampler for posterior computation. We will refer to this model (4) as standard PARAFAC.

This approach has excellent performance in small to moderate p problems, but as p increases there is an inevitable breakdown point. The number of parameters increases linearly in p, as for other PARAFAC factorizations, so problems arise as p approaches the order of n or p ≫ n. For example, we are particularly motivated by epidemiology studies collecting many categorical predictors, such as occupation type, demographic variables, and single nucleotide polymorphisms. For continuous response vectors $y_{i} \in ℜ^{p}$ , there is a well developed literature on Gaussian sparse factor models that are adept at accommodating p ≫ n data (West (2003); Lucas et al. (2006); Carvalho et al. (2008); Bhattacharya and Dunson (2011)). These models include many zeros in the loadings matrices to induce additional dimension reduction on top of the low rank assumption. Pati et al. (2013a) provided theoretical support through characterizing posterior concentration.

Our sp-PARAFAC factorization provides an analog of sparse factor models in the tensor setting. Modifying for the categorical data case, we let

π_{c_{1} \dots c_{p}} = \sum_{h = 1}^{k} ν_{h} \prod_{j \in S_{h}} λ_{h c_{j}}^{(j)} \prod_{j \in S_{h}^{c}} λ_{0 c_{j}}^{(j)},

(5)

where |S_h| ≪ p(|S| denotes the cardinality of a set S) and the $λ_{0}^{(j)}$ vectors are fixed in advance; we consider two cases:

(i) λ_{0}^{(j)} = {(\frac{1}{d_{j}}, \dots, \frac{1}{d_{j}})}^{T} and (i i) λ_{0}^{(j)} = {(\frac{1}{n} \sum_{i = 1}^{n} y_{i 1}, \dots, \frac{1}{n} \sum_{i = 1}^{n} y_{i p})}^{T},

corresponding to a discrete uniform and empirical estimates of the marginal category probabilities. By fixing the baseline dictionary vectors ${λ_{0}^{(j)}}$ in advance, and allocating a large subset of the variables within each cluster h to the baseline component, we dramatically reduce the size of the model space. In particular, the probability tensor π in (5) can be parameterized as $θ_{π} = (ν, {S_{h}}_{1 \leq h \leq k}, {λ_{h}^{(j)}}_{1 \leq h \leq k, j \in S_{h}})$ , where $ν \in S^{k - 1}, S_{h} \subset {1, \dots, p}, λ_{h}^{(j)} \in S^{d_{j} - 1}$ . Thus, the effective number of model parameters is now reduced to $(k - 1) + \sum_{h = 1}^{k} | S_{h} | + \sum_{h = 1}^{k} \sum_{j \in S_{h}} (d_{j} - 1)$ , which is substantially smaller than the $(k - 1) + \sum_{j = 1}^{p} k (d_{j} - 1)$ parameters in the original specification, provided |S_h| ≪ p for all h = 1, … k. This is ensured via a sparsity favoring prior on |S_h| below. We will illustrate that this can lead to huge differences in practical performance.

Completing a Bayesian specification with priors for the unknown parameter vectors and expressing the model in hierarchical form, we have²

y_{i j} ~ Mult ({1, \dots, d_{j}}; λ_{z_{1} 1}^{(j)}, \dots, λ_{z_{i} d_{j}}^{(j)}), λ_{h}^{(j)} ~ (1 - τ_{h}) δ_{λ_{0}^{(j)}} + τ_{h} Diri (a_{j 1}, \dots, a_{j d_{j}}), Pr (z_{i} = h) = ν_{h} = V_{h} \prod_{l < h} (1 - V_{l}), V_{h} ~ Beta (1, α), α ~ Gamma (a_{α}, b_{α}), τ_{h} ~ Beta (1, γ) .

(6)

It is evident that the hierarchical prior in (6) is supported on the space of probability tensors with a sp-PARAFAC decomposition as in (5), since (6) is equivalent to letting the subset-size |S_h| ~ Binom(p, τ_h) and drawing a random subset S_h uniformly from all subsets of {1, …, p} of size |S_h| in (5). A stick-breaking prior (Sethuraman, 1994) is chosen for the component weights {ν_h}, taking a nonparametric Bayes approach that allows k = ∞, with a hyperprior placed on the concentration parameter α in the stick-breaking process to allow the data to inform more strongly about the component weights. The probability of allocation τ_h to the active (non-baseline) category in component h is chosen as beta(1, γ), with γ > 1 favoring allocation of many of the $λ_{h}^{(j)}$ s to the baseline category $λ_{0}^{(j)}$ . In the limiting case as γ > ∞, the joint probability tensor π becomes an outer product of the baseline probabilities for the individual variables, $π = λ_{0}^{(1)} \otimes \dots \otimes λ_{0}^{(p)}$ . On the other hand, as γ → 0, one reduces back to standard PARAFAC (4).

Line 2 of expression (6) is key in inducing the second level of dimensionality reduction in our Bayesian sparse PARAFAC factorization. The inclusion of the baseline component that does not vary with h massively reduces the number of parameters, and can additionally be argued to have minimal impact on the flexibility of the specification. The $λ_{h}^{(j)}$ s are incorporated within $\prod_{j = 1}^{p} λ_{h c_{j}}^{(j)}$ , which for large p is highly concentrated around its mean since the $λ_{h}^{(j)}$ ‘s are independent across j. This is a manifestation of the concentration of measure phenomenon (Talagrand, 1996), which roughly states that a random variable that depends in a smooth way on the influence of many independent variables, but not too much on any one of them, is essentially constant. For example, if $θ_{j} \overset{i i d}{~} U (0, 1)$ and $Θ = \prod_{j = 1}^{p} θ_{j}$ , then E(Θ) = (1/2)^p and var(Θ) = (1/3)^p, which rapidly converges to zero. This implies that replacing a large randomly chosen subset of the $λ_{h}^{(j)}$ s by $λ_{0}^{(j)}$ should have minimal impact on modeling flexibility.

2.2. Induced prior in log-linear parameterization

An important challenge is accommodating higher order interactions, which play an important rolein many applications (e.g., genetics), but are typically assumed to equal zero for tractability. As p grows, it is challenging to even accommodate two-way interactions in traditional categorical data models (log-linear, logistic regression) due to an explosion in the number of terms. In contrast, the tensor factorization does not explicitly parameterize interactions, but indirectly induces a shrinkage prior on the terms in a saturated log-linear model. One can then reparameterize in terms of the log-linear model in conducting inferences in a post model-fitting step. We illustrate the induced priors on the main effects and interactions below.

For ease of exposition, we first focus on a case where p = 3 and d_j = d = 2 for j = 1, …, 3. We generate 10, 000 random probability tensors $π^{(t)} = (π_{c_{1} c_{2} c_{3}}^{(t)})$ , t = 1, …, 10, 000 distributed according to (6), where we fix the baseline $λ_{0}^{(j)} = (1 / 2, 1 / 2)$ for all j. Given a 2 × 2 × 2 tensor π, we can equivalently characterize π in terms of its log-linear parameterization

β = {(β_{1}, β_{2}, β_{3}, β_{12}, β_{13}, β_{23}, β_{123})}^{T},

consisting of 3 main effect terms β₁, β₂, β₃, three second-order interaction terms β₁₂, β₁₃, β₂₃ and one third order interaction term β₁₂₃; refer to §5.3.5 of Agresti (2002). Given each prior sample π^(t), we equivalently obtain a sample β^(t) from the induced prior on β, which allows us to estimate the marginal densities of the main effects and interactions and also their joint distributions. In particular, since γ plays an important role in placing weights on the baseline component, we would like to see how our induced priors differ with different γ values.

In our simulation exercise, we fix three values of γ, namely, γ = 1, 5, 20. Note that γ = 1 corresponds to a U(0,1) prior on τ_h. For different values of γ, we show the histograms of one main effect term β₁, one two-way interaction β₁₂ and the three-way interaction β₁₂₃ in Figure 1. Table 1 additionally reports summary statistics.

Figure 1: — Histograms of induced priors for one main effect β₁, one two-way interaction β₁₂, and the three-way interaction β₁₂₃ - Top Row: γ = 1; Middle Row: γ = 5; Bottom Row: γ = 20.

Table 1:

Summary statistics of induced priors on coefficients in log-linear model parameterization.

γ	Coefficient	Mean	Std.dev	Min	Max	Skewness	Kurtosis
1	β₁	0.014	0.831	−6.765	6.389	0.210	9.109
1	β₁₂	−0.002	0.340	−2.895	3.105	−0.025	16.583
1	β₁₂₃	0.002	0.196	−2.223	2.632	0.525	24.686
5	β₁	−0.002	0.485	−5.648	5.433	0.031	27.980
5	β₁₂	0.000	0.124	−2.085	2.244	0.495	93.438
5	β₁₂₃	0.000	0.051	−1.214	0.745	−3.701	159.360
20	β₁	0.002	0.246	−3.109	5.669	2.474	99.554
20	β₁₂	0.000	0.042	−1.126	1.819	9.488	632.790
20	β₁₂₃	0.000	0.009	−0.664	0.214	−44.051	3014.000

Open in a new tab

In high-dimensional regression, $y_{i} = x_{i}^{T} β + ϵ_{i}$ , there has been substantial interest in shrinkage priors, which draw β_j a priori from a density concentrated at zero with heavy tails. Such priors strongly shrink the small coefficients to zero, while limiting shrinkage of the larger signals (Park and Casella, 2008; Carvalho et al., 2010; Polson and Scott, 2010; Hans, 2011; Armagan et al., 2013a). In Figure 1, the induced prior on any of the log-linear model parameters is symmetric about zero, with a large spike very close to zero, and heavy tails. Thus, we have indirectly induced a continuous shrinkage prior on the main effects and interactions through our tensor decomposition approach. In addition, the prior automatically shrinks more aggressively as the interaction order increases. Such greater shrinkage of interactions is commonly recommended (Gelman et al., 2008). Importantly, we do not zero out small interactions but allow many small coefficients, which is an important distinction in applications, such as genomics, having many small signals. The hyperparameter γ serves as a penalty controlling the degree of shrinkage.

Our next set of simulations involve larger values of p, where the necessity of the regularization implied by γ becomes strikingly evident. In the log-linear parameterization, we now have p main effects β₁, …, β_p; let β_main = (β₁, …, β_p)^T. In the p ≫ n setting, one cannot even hope to consistently recover all the main effects unless a large fraction of the β_j’s are zero or close to zero. One would thus favor a shrinkage prior on β_main, with any particular draw resembling a near-sparse vector. Since the induced prior on the β_j’s is continuous, we study the l₁ norm ${‖ β_{m a i n} ‖}_{1} = \sum_{j = 1}^{p} | β_{j} |$ as a surrogate for the l₀ norm to quantify the sparsity.

We consider p = 50, 100, 150, 200 and plot histograms of the induced density of ∥β_main∥₁ based on 10, 000 prior draws in Figures 2 and 3. Figure 2 corresponds to the case where γ = 0, i.e., when the sp-PARAFAC formulation reduces back to the standard PARAFAC (4), while γ/p is set to a constant β ∈ (0,1) in Figure 3. Figure 2 reveals a highly undesirable property of the standard PARAFAC in high dimensions, where the entire distribution of ∥β_main∥₁ shifts to the right with increasing p, with $E {‖ β_{m a i n} ‖}_{1} ≍ p$ . The induced prior clearly lacks any automatic multiplicity adjustment property (Scott and Berger, 2010), and would bias inferences for moderate to large values of p. On the other hand, under the sp-PARAFAC model, the induced prior on ∥β_main∥₁ is robust to increasing p, as evident from Figure 3. The choice γ = βp essentially forces a constant proportion of the variablesto be assigned to the null group; see Castillo and van der Vaart (2012) for a similar choice of the hyper-parameter in a regression setting.3.1 Preliminaries

Figure 3: — Histograms of ∥β_main∥₁ for different values of p under the sp-PARAFAC model with γ = 0.1p.

3. Posterior concentration

3.1. Preliminaries

In this section, we provide theoretical justification to the proposed sp-PARAFAC procedure in high dimensional settings by studying the concentration properties of the posterior with growing sample size. When the parameter space is finite dimensional, it is well known that the posterior contracts at the parametric rate of n^−1/2 under mild regularity conditions (Ghosal et al., 2000). However, we are interested in the asymptotic framework of the dimension p = p_n growing with the sample size n, potentially at a faster rate, reflecting the applications we are interested in. There is a small but increasing literature on asymptotic properties of Bayesian procedures in models with growing dimensionality, with most of the focus being on linear models or generalized linear models belonging to the exponential family; refer to Ghosal (1999, 2000); Belitser and Ghosal (2003); Jiang (2007); Armagan et al. (2013b); Bontemps (2011); Castillo and van der Vaart (2012); Yang and Dunson (2013) among others. In all these cases, the object of interest is a vector of high-dimensional regression coefficients or more generally, the conditional distribution f (y | x) of a univariate response y given high-dimensional predictors x. However, our object of interest is significantly different as we are concerned with estimation of the high-dimensional joint probability tensor π.

Let $F_{n}$ denote the class of all $d_{1} \times \dots \times d_{p_{n}}$ probability tensors; we shall assume $d_{1} = \dots = d_{p_{n}} = d$ in the sequel for notational convenience. Let $π^{(0 n)} \subset F_{n}$ be a sequence of true tensors. We observe y₁, …, y_n ~ π⁽⁰ⁿ⁾ and set y⁽ⁿ⁾ = (y₁, …, y_n). We denote the prior distribution on $F_{n}$ induced by the sp-PARAFAC formulation by $ℙ_{n}$ and the corresponding posterior distribution by $ℙ_{n} (\cdot | y^{(n)})$ .

For two probability tensors π⁽¹⁾ and $π^{(2)} \in F_{n}$ , the L₁ distance is defined as:

{‖ π^{(1)} - π^{(2)} ‖}_{1} = \sum_{c_{1} = 1}^{d} \dots \sum_{c_{p_{n}} = 1}^{d} | π_{c_{1} \dots c_{p_{n}}}^{(1)} - π_{c_{1} \dots c_{p_{n}}}^{(2)} | .

For a sequence of numbers ϵ_n → 0 and a constant M > 0 independent of ϵ_n, let

U_{n} = {π : {‖ π - π^{(0 n)} ‖}_{1} \leq M ϵ_{n}}

(7)

denote a ball of radius Mϵ_n around π⁽⁰ⁿ⁾ in the L₁ norm. We seek to find a minimum possible sequence ϵ_n such that

lim_{n \to \infty} ℙ_{n} (U_{n}^{c} | y^{(n)}) \to 0, a.s. π^{(0 n)} .

(8)

3.2. Assumptions

In this section we state our assumptions on the true data generating model and briefly discuss their implications.

Assumption 3.1. The true sequence of probability tensors π⁽⁰ⁿ⁾ are of the form

π_{c_{1} \dots c_{p_{n}}}^{(0)} = \sum_{h = 1}^{k_{n}} ν_{0 h} \prod_{j \in S_{0 h}} λ_{h c_{j}}^{(0 j)} \prod_{j \in S_{0 h}^{c}} λ_{0 c_{j}}^{(j)}, 1 \leq c_{j} \leq d, 1 \leq j \leq p_{n},

(A0)

where $λ_{0}^{(j)} \in S^{d - 1}$ are assumed to be known. Unless otherwise specified, we shall assume $λ_{0}^{(j)} = (1 / d, \dots, 1 / d)$ is the probability vector corresponding to the uniform distribution on {1, …, d}.

We now provide some intuition for assumption (A0). Letting $S_{0} = \cup_{h = 1}^{k_{n}} S_{0 h}$ , we can rewrite the expansion of π⁽⁰ⁿ⁾ in (A0) as

π_{c_{1} \dots c_{p_{n}}}^{(0 n)} = \sum_{h = 1}^{k_{n}} ν_{0 h} \prod_{j \in S_{0}} {\bar{λ}}_{h c_{j}}^{(0 j)} \prod_{j \in S_{0}^{c}} λ_{0 c_{j}}^{(j)},

(9)

where

{\bar{λ}}_{h}^{(0 j)} = {\begin{array}{l} λ_{h}^{(0 j)} & if j \in S_{0 h}, \\ λ_{0}^{(j)} & if j \in S_{0} \ S_{0 h} . \end{array}

In (9), the term $\prod_{j \in S_{0}^{c}} λ_{0 c_{j}}^{(j)}$ doesn’t involve h and can be factored out completely. Assumption (A0) thus posits that the variables in $S_{0}^{c}$ are marginally independent and the entire dependence structure is driven by the variables in S₀. We shall refer to S₀ and $S_{0}^{c}$ as the non-null and null group of variables respectively.

Let q_n = |S₀| and define a mapping j → e_j from {1, …, q_n} to the ordered elements of S₀, so that $e_{1} \leq \dots e_{q_{n}}$ . As j varies between 1 to q_n, e_j ranges over the elements of S₀. Denote by ψ⁽⁰ⁿ⁾ the $d^{q_{n}}$ joint probability tensor for the variables {y_ij : j ∈ S₀}, so that

ψ_{c_{1} \dots c_{n}}^{(0 n)} = Pr (y_{i e_{1}} = c_{1}, \dots, y_{i e_{n}} = c_{n}) = \sum_{h = 1}^{k_{n}} ν_{0 h} \prod_{j = 1}^{q_{n}} {\bar{λ}}_{h c_{j}}^{(0 e_{j})} .

(10)

Thus, after factoring out the marginally independent variables in $S_{0}^{c}$ , (A0) implies a standard PARAFAC expansion (10) for ψ⁽⁰ⁿ⁾ with k_n many components. Since any non-negative tensor admits a standard PARAFAC distribution (Lim and Comon, 2009), we can always write an expansion of ψ⁽⁰ⁿ⁾ as in (10).

The next set of assumptions are provided below.³

Assumption 3.2. In addition to (A0), π⁽⁰ⁿ⁾ satisfies

(A1) The number of components k_n = O(1).

(A2) Letting $s_{n} = {max}_{1 \leq h \leq k_{n}} | S_{0 h} |$ , one has s_n = o(log p_n).

(A3) There exists a constant ε_ο ∈ (0,1) such that $λ_{h c}^{(0 j)} \geq ε_{0}$ for all 1 ≤ h ≤ k_n, 1 ≤ c ≤ d, j ∈ S_0h.

(A1) and (A2) imply that the size of the non-null group is much smaller than p_n, since $q_{n} = | S_{0} | \leq \sum_{h = 1}^{k_{n}} | S_{h} | \leq k_{n} s_{n} ≪ p_{n}$ .

Some discussion is in order for condition (A3). First, note that we can choose ε₀ in a way so that ${\bar{λ}}_{h c}^{(0 j)} \geq ε_{0}$ for all h, c and j ∈ S₀. Hence, (A3) implies a lower bound on the joint probability ψ⁽⁰ⁿ⁾ in (10). Such a lower bound on a compactly supported target density is a standard assumption in Bayesian non-parametric theory; see for example van der Vaart and van Zanten (2008). However, unlike univariate or multivariate density estimation in fixed dimensions where the density can be assumed to be bounded below by a constant, we need to precisely characterize the decay rate of the lower bound of the joint probability. Since ψ⁽⁰ⁿ⁾ is a $d^{q_{n}}$ probability tensor, ${min}_{c_{1}, \dots c_{s n}} ψ_{c_{1} \dots c_{s_{n}}}^{(0 n)} \leq {(1 / d)}^{q_{n}} = exp (- s_{n} k_{n} log d)$ . Assumption (A3) implies that

min_{_{c_{1}, \dots, c_{s n}}} ψ_{c_{1} \dots c_{s n}}^{(0 n)} \geq exp (- q_{n} log (1 / ε_{0})) = exp (- c_{0} s_{n})

(11)

for some constant c₀ > 0.

3.3. Main result

We are now in a position to state a theorem on posterior convergence rates.

Theorem 3.1. Assume the true sequence of tensors $π^{(0 n)} \in F_{n}$ satisfy assumptions (A0) – (A3) and s_n log p_n/n → 0. Also, assume the sp-PARAFAC model is fitted with the stick-breaking prior truncated to k_n many components and $γ = β p_{n}^{2}$ for some constant β ∈ (0,1) in (6). Then, (8) is satisfied with $ϵ_{n} = \sqrt{s_{n} log p_{n} / n}$ in (7).

A proof of Theorem 3.1 can be found in the appendix. As an implication of Theorem 3.1, if p_n = n^d for some constant d, then the posterior contracts at the near parametric rate $\sqrt{{(log n)}^{c} / n}$ for some constant c > 0. Moreover, consistent estimation is possible even if p_n is exponentially large as long as $p_{n} \leq exp (\sqrt{n})$ . In particular, with p_n = exp(n^δ/2) for δ < 1, the posterior contracts at least at the rate n^{−(1−δ)/2}.

Remark 3.2. We assume the number of components k_n known in Theorem 3.1 for ease of exposition, with our main focus on dimensionality reduction. Adapting to an unknown number of components in mixture models is a well -studied problem; see, for example, Ge and Jiang (2006); Pati et al. (2013b); Shen et al. (2011). For the infinite stick-breaking prior on the mixture components, one can use the sieving technique developed in Pati et al. (2013b) to estimate deviation bounds for the tail sum of a stick-breaking process.

Remark 3.3. In practice, we recommend the choice γ = βp_n for numerical stability, with β = 0.2 used as a default choice in all our examples. The probability mass function of the induced beta-bernoulli prior on |S_h| with $γ = β p_{n}^{2}$ behaves like exp(−cs log p_n) for small s, while the same is exp(−cs) for γ = βp_n; refer to the proof of Theorem 3.1 for further details.

4. Posterior Computation

Under model (6), we can easily proceed to draw posterior samples from a Gibbs sampler since all the full conditionals have recognizable forms. The algorithm iterates through the following steps:

For each j^th variable and latent class h, update $λ_{h}^{(j)} \equiv (λ_{h 1}^{(j)}, \dots, λ_{h d_{j}}^{(j)})$ from a mixture of two distributions with different weights. Given the prior we specified for $λ_{h}^{(j)}$ in (6), the posterior maintains its conjugacy and comes from either a Dirichlet or the baseline category. i.e., for j = 1, …, p, h = 1, …, k*, where k* = max{z₁, …, z_n}:
$(λ_{h}^{(j)} | -) = w_{0 h}^{(j)} δ_{λ_{0}^{(j)}} + w_{1 h}^{(j)} Diri (a_{j 1} + \sum_{i = 1}^{n} 1 (y_{i j} = 1, z_{i} = h), \dots, a_{j d_{j}} + \sum_{i = 1}^{n} 1 (y_{i j} = d_{j}, z_{i} = h)),$ (12)
where $w_{0 h}^{(j)}$ and $w_{1 h}^{(j)}$ are the mixture weights:
$w_{0 h}^{(j)} = \frac{(1 - τ_{h}) \prod_{c = 1}^{d_{j}} λ_{0 c}^{(j) \sum_{i = 1}^{n} 1 (z_{i} = h, y_{i j} = c)}}{(1 - τ_{h}) \prod_{c = 1}^{d_{j}} λ_{0 c}^{(j) \sum_{i = 1}^{n} 1 (z_{i} = h, y_{i j} = c)} + τ_{h} \frac{Γ (\sum_{c = 1}^{d_{j}} a_{j c})}{\prod_{c = 1}^{d_{j}} Γ (a_{j c})} \cdot \frac{\prod_{c = 1}^{d_{j}} Γ (a_{j c} + \sum_{i = 1}^{n} 1 (z_{i} = h, y_{i j} = c))}{Γ (\sum_{c = 1}^{d} a_{j c} + \sum_{i = 1}^{n} 1 (z_{i} = h))}},$

$w_{1 h}^{(j)} = 1 - w_{0 h}^{(j)} .$

Let S_hj be the allocation variable with S_hj = 0 if $λ_{h}^{(j)}$ is updated from the baseline component, and S_hj = 1 if $λ_{h}^{(j)}$ is from a Dirichlet posterior distribution. Update τ_h, h = 1, …, k* from a Beta full conditional:
$τ_{h} | - ~ Beta (1 + \sum_{j = 1}^{p} 1 (S_{h j} = 1), γ + \sum_{j = 1}^{p} 1 (S_{h j} = 0)) .$ (13)

The full conditional of V_h, h = 1, …, k* only requires the updated information on latent class allocation for all subjects:
$V_{h} | - ~ Beta (1 + \sum_{i = 1}^{n} 1 (z_{i} = h), α + \sum_{i = 1}^{n} 1 (z_{i} > h)) .$ (14)

We sample z_i, i = 1, …, n from the multinomial full conditional with:
$Pr (z_{i} = h | -) = \frac{ν_{h} \prod_{j = 1}^{p} λ_{h y_{i j}}^{(j)}}{\sum_{l = 1}^{k^{*}} ν_{l} \prod_{j = 1}^{p} λ_{l y_{i j}}^{(j)}},$ (15)
where ν_h = V_h ∏_l<h(1 − V_l).

Update α from the Gamma full conditional:
$α | - ~ Gamma (a_{α} + k^{*}, b_{α} - \sum_{h = 1}^{k^{*}} log (1 - V_{h})) .$ (16)

These steps are simple to implement and we gain efficiency by updating the parameters in blocks. For example, instead of updating $λ_{h}^{(j)}$ one at a time, we sample $λ \equiv {λ_{h}^{(j)}, h = 1, \dots, k^{*}, j = 1, \dots, p}$ jointly with corresponding parameters in matrix form. In all our examples, we ran the chain for 25, 000 iterations, discarding the first 10, 000 iterations as burn-in and collecting every fifth sample post burn-in to thin the chain. Mixing and convergence were satisfactory based on the examination of trace plots and the run time scaled linearly with n and p. We also carried out sensitivity analysis by multiplying and dividing the hyperparamaters a_α, b_α and γ in (6) by a factor of 2, with the conclusions remained unchanged from the default setting a_α = b_α = 1 and γ = 0.2 p.

5. Simulation Studies

5.1. Estimating sparse interactions

We first conduct a replicated simulation study to assess the estimation of sparse interactions using the proposed sp-PARAFAC model. We simulated 100 dependent binary variables y_ij ∈ {0,1}, j = 1, …, p = 100 (d_j = d = 2) for i = 1, …, n = 100 subjects from a log-linear model having up to three-way interactions:

log (\frac{π_{c_{1} \dots c_{p}}}{π_{0 \dots 0}}) = \sum_{s = 1}^{3} \sum_{S \subset {1, \dots, p} : | S | = s} β_{S} 1_{(c_{S} = 1)} .

(17)

For example, if S = {1, 2,4}, then β_S = β_1,2,4 and $1_{(c_{S} = 1)} = 1_{(c_{1} = 1, c_{2} = 1, c_{4} = 1)}$ with 1_(·) denoting the indicator function. To mimic the situation where only a few interactions are present, we restrict to S ∈ S* = {2, 4, 12, 14} and set all interactions except

β = {(β_{2}, β_{4}, β_{12}, β_{14}, β_{2, 4}, β_{2, 12}, β_{4, 12}, β_{4, 14}, β_{12, 14}, β_{2, 4, 12}, β_{4, 12, 14})}^{T}

to zero. This data generating mechanism induces dependence among the variables in S*, while rendering the other variables to be marginally independent. Figure 4 reports the posterior means and 95% credible intervals for all main effects and interactions for the variables in S* averaged across 100 simulation replicates along with the true coefficients. As illustrated in Figure 4, averaging across the simulation replicates and different parameters, the 95% credible intervals cover the true parameter values 80% of the time.

Figure 4: — Posterior means and 95% credible intervals for all main effects and interactions in S* compared with the true coefficients.

Next, we study performance in estimating the dependence structure. Cramer’s V is a popular statistic measuring the strength of association or dependence between two (nominal) categorical variables in a contingency table, ranging from 0 (no association) to 1 (perfect association). Let ρ_jj′ denote the Cramer’s V statistics for variables j and j′, so that

ρ_{j j^{'}}^{2} = \frac{1}{min {d_{j}, d_{j^{'}}} - 1} \sum_{c_{j} = 1}^{d_{j}} \sum_{c_{j^{'}} = 1}^{d_{j^{'}}} \frac{{(π_{c_{j} c_{j}^{'}}^{(j j^{'})} - π_{c_{j}}^{(j)} π_{c_{j^{'}}}^{(j^{'})})}^{2}}{π_{c_{j}}^{(j)} π_{c_{(j^{'})}}^{(j^{'})}},

(18)

where $π_{l l^{'}}^{(j j^{'})} = Pr (y_{i j} = l, y_{i j^{'}} = l^{'})$ and $π_{l}^{(j)} = Pr (y_{i j} = l)$ . Under the log-linear model (17), ρ = (ρ_jj′) is a sparse matrix with the Cramer’s V for all pairs except those in S* × S* being zero. This is an immediate consequence of the fact that if (j, j′) ∉ S* × S*, then y_ij and y_ij′, are independent.

We compare estimation of the off-diagonal entries of ρ under the sp-PARAFAC model with the empirical Cramer’s V matrix $\hat{ρ}$ . We can clearly convert posterior samples for the model parameters to posterior samples for ρ_jj′ through (18). The empirical estimator is obtained by replacing $π_{c_{j} c_{j'}}^{(j j^{'})}$ and $π_{c_{j}}^{(j)}$ by their empirical estimators. The left panel in Figure 5 shows the posterior summaries (averaged across simulation replicates) of the Cramer’s V values for all possible dependent pairs along with the true Cramer’s V values (which can be calculated from (17)). In the right panel of Figure 5, we overlay kernel density estimators of posterior samples (in grey) and the empirical estimators (in red) of the Cramer’s V values for all null pairs across all simulation replicates. Note the axes are also marked in grey and red for the respective cases. The sp-PARAFAC method clearly outperforms the empirical estimator convincingly, with the posterior density for the null pairs highly concentrated near zero while the empirical estimator has a mean Cramer’s V value of 0.08 across the null pairs.

Figure 5: — Left: Posterior summaries of the Cramer’s V values for all dependent pairs vs. the true Cramer’s V values; Right: Estimated density of Cramer’s V combining all null pairs under sp-PARAFAC vs. empirical estimation.

Furthermore, we can obtain power for any non-null variable or type I error for any null variable by computing the percentage of detected significance over the simulation replicates. We first look at the power and type I error of the main effects and interactions in S*, most of the power and type I error are appealing, although a few of them are far from satisfactory (see Table 2 and Table 3). However, given the Cramer’s V results in the right panel of Figure 5, the type I error for any variable not in S* should be very small or zero. As an example, we tested the main effects and all the possible interactions for positions 20, 30, 40 and 50. The type I error rates are 0 for all of them. These results are based on examining whether 95% intervals contain zero, and it is as expected that the approach may have difficulty assessing the exact interaction structure among a set of associated variables based on limited data.

Table 2:

Power for Non-null Variables Based on 100 Simulations

	β₂	β₄	β₁₂	β₁₄	β_2,4	β_2,12	β_4,12	β_4,14	β_12,14	β_2,4,12	β_4,12,14
Power	0.97	0.9	1	1	0.95	0.99	0.98	0.97	0.99	0	0
True coefficient	1	−1.5	2	1.5	−0.5	0.5	−0.5	−0.5	0.5	0.25	0.5

Open in a new tab

Table 3:

Type I Error for Null Variables Based on 100 Simulations

	β_2,4	β_2,4,14	β_2,12,14	β_2,4,12,14
Type I error	0.97	0	0.68	0
True coefficient	0	0	0	0

Open in a new tab

5.2. Comparison with standard PARAFAC

We now conduct a simulation study to compare estimation of the Cramer’s V matrix ρ under the proposed approach to the usual specification of the PARAFAC model without any sparsity as in (4), which is equivalent to setting γ = 0 in (6). We considered 100 simulation replicates, with data in each replicate consisting of p = 100 categorical variables for n = 100 subjects, with each variable having 4 possible levels (d_j = d = 4). Two simulation settings were considered to induce dependence between the variables in S* = {2, 4, 12, 14}: (i) via multiple subpopulations as in the simulation study in Dunson and Xing (2009), and (ii) via a nominal GLM model $P r (y_{i j} = c) = \frac{e x p (y_{i (j)} β_{c})}{1 + Σ_{c = 2}^{4} e x p (y_{i (j)} β_{c})}$ for j ∈ S*, where y_i(j)β_c is a linear combination of all variables that are associated with the j^th variable excluding the j^th variable. The remaining variables were independently generated from a discrete uniform distribution.

The color plot on the left in Figure 6 shows the true pairwise Cramer’s V values under simulation setting (i) (only the top-left 20 × 20 sub matrix of ρ is shown for clarity). Figure 6 (right) and Figure 7 represent one of the replicates, in which the right plot in Figure 6 shows the Cramer’s V under the standard non-sparse PARAFAC method, while Figure 7 shows the Cramer’s V using our method with two different baseline components. It is obvious that our approach has much better estimates for not only the true dependent pairs but also the true nulls. Results for simulation (ii) shown in Figure 8 again show superiority of our sparse improvement to PARAFAC.

Figure 7: — Posterior means of Cramer’s V under simulation setting (i) using proposed method - Left: with $λ_{0}^{(j)}$ being discrete uniform; Right: with $λ_{0}^{(j)}$ being empirical estimates of the marginal category probabilities.

Figure 8: — Posterior means of Cramer’s V under simulation setting (ii) – Left: using standard PARAFAC; Middle: under proposed method using empirical marginal with Diri(1, …,1) prior for λ₀; Right: using proposed method with discrete uniform λ₀.

6. Application

6.1. Splice-junction Gene Sequences

We applied the method to the Splice-junction Gene Sequences (abbreviated as splice data below). Splice junctions are points on a DNA sequence at which ‘superfluous’ DNA is removed during the process of protein creation in higher organisms. These data consist of A, C, G, T nucleotides at p = 60 positions for N = 3, 175 sequences. Since its sample size is much larger than the number of variables, we compared our approach with the standard PARAFAC in two scenarios, first a small randomly selected subset (of size n = 2p = 120) of the full data set, and second, the full data set itself. Using two different sample sizes in this manner allows for a study of the new and existing method and a comparison to a gold standard (a sufficiently large data set). We ran the analysis to estimate the pairwise positional dependence structure under the standard PARAFAC method and the proposed approach with discrete uniform baseline component. As is apparent in Figure 10, both methods have similar performance when n ≫ p, however, in the smaller sample size situation, Figure 9 demonstrates that our proposed method has the advantage of identifying the dependence structure and pushing the independent pairs to zero, which it is closer to the results in a large sample case (Figure 10).

Figure 10: — Posterior quantiles of Cramer’s V with 3,175 sequences of splice data – Upper panel: under standard PARAFAC; Bottom panel: under proposed method.

Figure 9: — Posterior quantiles of Cramer’s V with 120 sequences of splice data – Upper panel: under standard PARAFAC; Bottom panel: under proposed method.

7. Discussion

We have proposed a sparse modification to the widely-used PARAFAC tensor factorization, and have applied this in a Bayesian context to improve analyses of ultra sparse huge contingency tables. Given the compelling success in this application area, we hope that the proposed notion of sparsity will have a major impact in other areas, including tensor completion problems in machine learning. There is an enormous literature on low rank and sparse matrix factorizations, and the sp-PARAFAC should facilitate scaling of such approaches to many-way tables while dealing with the inevitable curse of dimensionality. Although we take a Bayesian approach, we suspect that frequentist penalized optimization methods can also exploit our same concept of sparsity in learning a compressed characterization of a huge array based on limited data.

Appendix

7.1. Proof of Theorem 3.1

We verify the conditions of Theorem 4 in Yang & Dunson (2013), which is a minor modification of Theorem 2 appearing in Ghosal et al. (2000). Let ϵ_n → 0 be such that $n ϵ_{n}^{2} \to \infty$ and $Σ_{n \geq} exp (- n ϵ_{n}^{2}) \leq \infty$ . Suppose there exist a sequence of sets $P_{n} \subset F_{n}$ and a constant C > 0 such that the following hold:⁴

$log N (ϵ_{n}; P_{n}, ‖ \cdot ‖_{1}) \leq n ϵ_{n}^{2}$

$ℙ_{n} (F_{n} \cap P_{n}^{c}) \leq exp {- (2 + C) n ϵ_{n}^{2}}$

$ℙ_{n} (π : ‖ log \frac{π}{π^{(0 n)}} ‖_{\infty} \leq ϵ_{n}^{2}) \geq exp (- C n ϵ_{n}^{2})$

Then, the posterior contracts at the rate ϵ_n, i.e., (8) is satisfied. We now proceed to verify conditions (1) - (3). We define,

P_{n} = {π \in F_{n} : π_{c_{1} \dots c_{p}} = \sum_{h = 1}^{k_{n}} ν_{h}^{*} \prod_{j \in S_{h}^{*}} λ_{h c_{j}}^{(* j)} \prod_{j \in S_{h}^{* c}} λ_{0 c_{j}}^{(j)}; ν \in S^{(k_{n} - 1)}, | S_{h}^{*} | \leq A s_{n}, h = 1, \dots, k_{n}}

(19)

where $S^{(r - 1)}$ denotes the (r − 1)-dimensional probability simplex and A > 0 is an absolute constant. We shall use C to denote an absolute constant whose meaning may change from one line to the next.

To estimate $N (ϵ_{n}; P_{n}, ‖ \cdot ‖_{1})$ , we make use of the following Lemma, which follows in a straight-forward manner by repeated uses of the triangle inequality.

Lemma 7.1. Let π⁽¹⁾, $π^{(2)} \in F_{n}$ with

π^{(i)} = \sum_{h = 1}^{k_{n}} ν_{i h} λ_{i h}^{(1)} \otimes \dots \otimes λ_{i h}^{(p_{n})}, i = 1, 2.

Then,

‖ π^{(1)} - π^{(2)} ‖_{1} \leq \sum_{h = 1}^{k_{n}} | ν_{1 h} - ν_{2 h} | + \sum_{h = 1}^{k_{n}} ν_{2 h} (\sum_{j = 1}^{p_{n}} \sum_{c = 1}^{d} | λ_{1 h c}^{(j)} - λ_{2 h c}^{(j)} |) .

Lemma 7.1 implies that if π⁽¹⁾, $π^{(2)} \in P_{n}$ with $S_{1 h}^{*} = S_{2 h}^{*} = S_{h}^{*}$ , then

‖ π^{(1)} - π^{(2)} ‖_{1} \leq \sum_{h = 1}^{k_{n}} | ν_{1 h} - ν_{2 h} | + \sum_{h = 1}^{k_{n}} ν_{2 h} (\sum_{j \in S_{h}^{*}} \sum_{c = 1}^{d} | λ_{1 h c}^{(j)} - λ_{2 h c}^{(j)} |) .

Based on the above observation, we create an ϵ-net of $P_{n}$ as follows: In (19), (i) vary $S_{h}^{*}$ over all possible subsets of {1, …, p_n} with $| S_{h}^{*} | \leq A s_{n}$ for h = 1, …, k_n, (ii) for h ∈ {1, … k_n} and $j \in S_{h}^{*}$ , vary $λ_{h}^{(* j)}$ over an ϵ_n/(2Ads_n)-net of $S^{(d - 1)}$ and (iii) vary ν* over an ϵ_n/(2k_n)-net of $S^{(k_{n} - 1)}$ .

For a fixed h, there are $Σ_{s = 0}^{A s_{n}} (\begin{array}{l} p \\ s \end{array})$ subsets of size smaller then or equal to As_n. Using the in-equality $(\begin{array}{l} p \\ s \end{array}) \leq {(p e / s)}^{s}$ for s ≤ p/2, the number of possible subsets in (i) can be bounded above by exp(Ck_ns_n log p_n). Hence,

N (ϵ_{n}; P_{n}, ‖ \cdot ‖_{1}) \leq exp (C k_{n} s_{n} log p_{n}) N {(ϵ_{n} / (2 A d s_{n}); S^{d - 1}, ‖ \cdot ‖_{1})}^{2 A d s_{n} k_{n}} N (ϵ_{n} / (2 k_{n}); S^{k_{n} - 1}, ‖ \cdot ‖_{1}) .

Using the fact that $N (δ, S^{r - 1}, ‖ \cdot ‖_{1}) \leq {(C / δ)}^{r}$ (Vershynin, 2010), the right hand side in the above display can be bounded above by $exp (C s_{n} log p_{n}) = exp (n ϵ_{n}^{2})$ , since k_n = O(1).

We now bound $ℙ_{n} (F_{n} \cap P_{n}^{c})$ . Recall that in the sp-PARAFAC model, the induced prior on the subset size |S_h| is Bin(p_n, τ_h), with τ_h ~ Beta(1, γ). Now,

ℙ_{n} ((F_{n} \cap P_{n}^{c}) \leq Pr (\exists h \in {1, \dots, k_{n}} s.t. | S_{h} | \geq A s_{n}) \leq k_{n} P (| S_{1} | > A s_{n}) .

Integrating τ₁, the distribution of |S₁| is a beta-bernoulli distribution with probability mass function

Pr (| S_{1} | = s) = (\begin{matrix} p \\ s \end{matrix}) \frac{1}{B (1, γ)} \int_{τ = 0}^{1} τ^{s} {(1 - τ)}^{p_{n} - s} {(1 - τ)}^{γ - 1} d τ = (\begin{matrix} p_{n} \\ s \end{matrix}) \frac{B (1 + s, γ + p_{n} - s)}{B (1, γ)} = \frac{1}{γ} \frac{p_{n}!}{(p_{n} - s)!} \frac{(γ + p_{n} - s - 1)!}{(γ + p_{n})!},

for s = 0,1,…, p_n. B(·, ·) denotes the Beta function in the above display. Hence, for s ≥ 1,

\frac{Pr (| S_{1} | = s)}{Pr (| S_{1} | = s - 1)} = \frac{(p_{n} - s + 1)}{(p_{n} - s + γ)} .

Now, letting $γ = p_{n}^{2}$ , one has for any p_n ≥ 2 and 1 ≤ s ≤ p_n/2,

\frac{1}{4 p_{n}} \leq \frac{(p_{n} - s + 1)}{(p_{n} - s + γ)} \leq \frac{1}{p_{n}} .

In general, for $γ = β p_{n}^{2}$ , we can bound this from both sides by C/p_n. Noting that $Pr (| S_{1} | = 0) = C / p_{n}^{3}$ , we have

Pr (| S_{1} | = s) = \frac{C}{p_{n}^{3}} \prod_{j = 1}^{s} \frac{Pr (| S_{1} | = j)}{Pr (| S_{1} | = j - 1)},

implying there exists constants c₁, c₂ > 0 such that

e^{- c_{1} (s + 3) log p_{n}} \leq Pr (| S_{1} | = s) \leq e^{- c_{2} (s + 3) log p_{n}},

(20)

for 0 ≤ s ≤ p_n/2. In particular, the upper bound holds for all 0 ≤ s ≤ p_n, since (p_n − s + 1)/(p_n − s + γ) ≤ C/p_n for all s. Hence, for n large enough so that s_n ≥ 3,

P (| S_{1} | > A s_{n}) \leq \sum_{j = A s_{n} + 1}^{p_{n}} exp (- C j log p_{n}) \leq exp (- C s_{n} log p_{n}) \leq exp (- n ϵ_{n}^{2}) .

We finally show that (3) holds. Recall the decomposition of π⁽⁰ⁿ⁾ from (9). A probability tensor π following a sp-PARAFAC model with a truncated stick-breaking prior on ν can be parameterized as

θ_{π} = (ν, {S_{h}}_{1 \leq h \leq k_{n}}, {λ_{h}^{(j)}}_{1 \leq h \leq k_{n}, j \in S_{h}}),

where $ν \in S^{k_{n} - 1}$ , S_h ⊂ {1, …, p_n}, $λ_{h}^{(j)} \in S^{d - 1}$ . Consider the following subset $A$ of the parameter space,

A = {S_{h} = S_{0}, 1 \leq h \leq k_{n}; \sum_{h = 1}^{k_{n}} | ν_{h} - ν_{0 h} | \leq \frac{ϵ_{n}^{2}}{2 e^{c_{0} s_{n}}}; \sum_{c = 1}^{d} | λ_{h c}^{(j)} - {\bar{λ}}_{h c}^{(0 j)} | \leq \frac{ϵ_{n}^{2} ε_{0}}{4 q_{n}}, 1 \leq h \leq k_{n}, j \in S_{0}} .

We now show that $θ_{π} \in A$ implies $log ‖ π / π^{(0 n)} ‖_{\infty} \leq ϵ_{n}^{2}$ , so that $ℙ_{n} (log ‖ π / π^{(0 n)} ‖_{\infty} \leq ϵ_{n}^{2})$ can be bounded below by $ℙ_{n} (A)$ . First, observe that since S_h = S₀ for all h on $A$ , π/π⁽⁰ⁿ⁾ = ψ/ψ⁽⁰ⁿ⁾, where ψ⁽⁰ⁿ⁾ is as in (10) and ψ is the d^qn joint probability tensor implied by the sp-PARAFAC model for the variables {y_ij : j ∈ S₀},

ψ_{c_{1} \dots c_{q_{n}}} = \sum_{h = 1}^{k_{n}} ν_{h} \prod_{j \in S_{0}} λ_{h c_{j}}^{(e_{j})} .

Hence,

log {‖ \frac{π}{π^{(0 n)}} ‖}_{\infty} = log {‖ \frac{ψ}{ψ^{(0 n)}} ‖}_{\infty} \leq log (1 + {‖ (\frac{ψ}{ψ^{(0 n)}} - 1) ‖}_{\infty}) \leq {‖ (\frac{ψ}{ψ^{(0 n)}} - 1) ‖}_{\infty},

where the penultimate step follows from an application of triangle inequality and the last step uses log(1 + x) ≤ x for x ≥ 0. For any $c_{1}, \dots, c_{s_{n}}$ , by an application of triangle inequality,

| ψ_{c_{1} \dots c_{s_{n}}} - ψ_{c_{1} \dots c_{s n}}^{(0 n)} | \leq \sum_{h = 1}^{k_{n}} | ν_{h} - ν_{0 h} | + \sum_{h = 1}^{k_{n}} ν_{0 h} | \prod_{j = 1}^{q_{n}} λ_{h c_{j}}^{(e_{j})} - \prod_{j = 1}^{q_{n}} {\bar{λ}}_{h c_{j}}^{(0 e_{j})} | .

(21)

We now state a Lemma to facilitate bounding the second term of the above display.

Lemma 7.2. Let v₁,… v_r ∈ (ε₀, 1 − ε₀) for some ε₀ > 0. Let δ > 0 be such that rδ < ε₀/2. Then, if u₁, …, u_r satisfy |u_j − v_j | ≤ δ for all j = 1, …, r, then

| u_{1} \dots u_{r} - v_{1} \dots v_{r} | \leq \frac{2 r δ}{ε_{0}} v_{1} \dots v_{r} .

Apply Lemma 7.2 with r = q_n, $u_{j} = {\bar{λ}}_{h c_{j}}^{(0 e_{j})}$ and $δ = ϵ_{n}^{2} ε_{0} / (4 q_{n})$ (clearly $r δ / ε_{0} = ϵ_{n}^{2} / 4 < 1 / 2$ ) to obtain that for any 1 ≤ h ≤ q_n, $| Π_{j = 1}^{q_{n}} λ_{h c_{j}}^{(e_{j})} - Π_{j = 1}^{q_{n}} {\bar{λ}}_{h c_{j}}^{(0 e_{j})} | \leq (ϵ_{n}^{2} / 2) Π_{j = 1}^{q_{n}} {\bar{λ}}_{h c_{j}}^{(0 e_{j})}$ .·Substituting this bound in (21), we have on $A$ ,

\frac{| ψ_{c_{1} \dots c_{s_{n}}} - ψ_{c_{1} \dots c_{s_{n}}}^{(0 n)} |}{ψ_{c_{1} \dots c_{s_{n}}}^{(0 n)}} \leq \frac{Σ_{h = 1}^{k_{n}} | ν_{h} - ν_{0 h} |}{e^{- c_{0} s_{n}}} + (ϵ_{n}^{2} / 2) \frac{Σ_{h = 1}^{k_{n}} ν_{0 h} Π_{j = 1}^{q_{n}} {\bar{λ}}_{h c_{j}}^{(0 e_{j})}}{ψ_{c_{1} \dots c_{s_{n}}}^{(0 n)}} \leq ϵ_{n}^{2} .

For the two terms in the above display after the first inequality, we used the lower bound (11) for the first term along with $Σ_{h = 1}^{k_{n}} | ν_{h} - ν_{0 h} | \leq ϵ_{n}^{2} / (2 e^{c_{0} s_{n}})$ on $A$ , and by definition of ψ⁽⁰ⁿ⁾, the second term is $ϵ_{n}^{2} / 2$ .

It thus remains to lower bound $ℙ_{n} (A)$ . By independence across h, $Pr (S_{h} = S_{0}, 1 \leq h \leq k_{n}) = Pr {(S_{1} = S_{0})}^{k_{n}}$ . Further, by exchangeability of the prior on S₁, since all subsets of a particular size receive the same prior probability, $P r (S_{1} = S_{0}) = P r (| S_{1} | = q_{n}) / (\begin{array}{l} p_{n} \\ q_{n} \end{array})$ . From (20), Pr(|S₁| = q_n ≥ exp(−Cs_n log p_n). Using $(\begin{array}{l} p_{n} \\ q_{n} \end{array}) \leq {(p_{n} e / q_{n})}^{q_{n}}$ , we conclude that Pr(S₁ = S₀) ≥ exp(−Cs_n log p_n).

Recall that $ν_{h} = ν_{h}^{*} Π_{l < h} (1 - ν_{l}^{*})$ , where $ν_{l}^{*} ~ Beta (1, α)$ independently. Find numbers { $ν_{0 h}^{*}$ } such that $ν_{0 h} = ν_{0 h}^{*} Π_{l < h} (1 - ν_{0 l}^{*})$ . It is easy to see that there exists a constant C > 0 such that $| ν_{h}^{*} - ν_{0 h}^{*} | \leq ϵ_{n} / (C k_{n})$ for all h = 1, …, k_n implies $Σ_{h = 1}^{k_{n}} | ν_{h} - ν_{0 h} | \leq ϵ_{n}$ . Hence, using a general result on small ball probability estimate of Dirichlet random vectors (Lemma 6.1 of Ghosal et al. (2000)), one has

Pr (\sum_{h = 1}^{k_{n}} | ν_{h} - ν_{0 h} | \leq \frac{ϵ_{n}^{2}}{2 e^{c_{0} s_{n}}}) \geq exp {- C s_{n} log (1 / ϵ_{n})} .

Again, applying Lemma 6.1 of Ghosal et al. (2000),

P r (\sum_{c = 1}^{d} | λ_{h c}^{(j)} - {\bar{λ}}_{h c}^{(0 j)} | \leq \frac{ϵ_{n}^{2} ε_{0}}{4 q_{n}}) \geq exp {- C log (s_{n} / ϵ_{n})} .

Combining, we get $P r (A) \geq exp (- C s_{n} log p_{n}) \geq exp (- n ϵ_{n}^{2})$ . Hence, we have established (1) - (3) completing the proof.

7.2. Proof of Lemma 7.2

Observe that

| u_{1} \dots u_{r} - v_{1} \dots v_{r} | = | v_{1} \dots v_{r} | | \frac{u_{1} \dots u_{r}}{v_{1} \dots v_{r}} - 1 | = v_{1} \dots v_{r} max {\frac{u_{1} \dots u_{r}}{v_{1} \dots v_{r}} - 1, 1 - \frac{u_{1} \dots u_{r}}{v_{1} \dots v_{r}}} .

Now, since u_h ≤ v_h + δ for all h,

\frac{u_{1} \dots u_{r}}{v_{1} \dots v_{r}} \leq \prod_{h = 1}^{r} (1 + δ / v_{h}) \leq {(1 + δ / ε_{0})}^{r} .

Using the binomial theorem, ${(1 + δ / ε_{0})}^{r} - 1 = r δ / ε_{0} + Σ_{h = 2}^{r} (\begin{array}{l} r \\ h \end{array}) {(δ / ε_{0})}^{h}$ . Next, bound $(\begin{array}{l} r \\ h \end{array}) \leq r^{h}$ and use the fact that rδ/ε₀ < 1/2 to conclude that $Σ_{h = 2}^{r} (\begin{array}{l} r \\ h \end{array}) {(δ / ε_{0})}^{h} \leq Σ_{h = 1}^{\infty} {(r δ / ε_{0})}^{h} \leq 2 r δ / ε_{0}$ .

On the other hand, using u_h ≥ v_h − δ for all h,

\frac{u_{1} \dots u_{r}}{v_{1} \dots v_{r}} \geq \prod_{h = 1}^{r} (1 - δ / v_{h}) \geq {(1 - δ / ε_{0})}^{r} \geq 1 - r δ / ε_{0} .

The proof is concluded by observing that

max {\frac{u_{1} \dots u_{r}}{v_{1} \dots v_{r}} - 1, 1 - \frac{u_{1} \dots u_{r}}{v_{1} \dots v_{r}}} \leq 2 r δ / ε_{0} .

Footnotes

For p = 2, ψ(1) ⊗ ψ(2) = ψ(1)ψ(2)T. In general, ${(ψ^{(1)} \otimes \dots \otimes ψ^{(p)})}_{c_{1} \dots c_{p}} = ψ_{c_{1}}^{(1)} \dots ψ_{c_{p}}^{(p)}$

Mult({1, …, d}; λ1, …, λd) denotes a discrete distribution on {1, …, d} with probabilities λ1, …, λd associated to each atom.

For sequences an, bn, we write an = o(bn) if an/bn → 0 as n → ∞ and an = O(bn) if an ≤ Cbn for all large n.

⁴

Given a metric space $(X, d)$ , let $N (ϵ; X, d)$ denote its ϵ-covering number, i.e., the minimum number of d-balls of radius ϵ needed to cover $X$ .

Contributor Information

Jing Zhou, Department of Biostatistics, The University of North Carolina at Chapel Hill.

Anirban Bhattacharya, Department of Statistics, Texas A&M University.

Amy Herring, Department of Biostatistics and Carolina Population Center, The University of North Carolina at Chapel Hill.

David Dunson, Department of Statistical Science, Duke University.

References

Agresti A (2002), Categorical data analysis, Vol. 359, Wiley-interscience. [Google Scholar]
Armagan A, Dunson D, and Lee J (2013a), “Generalized double Pareto shrinkage,” Statistica Sinica, 23, 119–143. [PMC free article] [PubMed] [Google Scholar]
Armagan A, Dunson D, Lee J, Bajwa W, and Strawn N (2013b), “Posterior consistency in high-dimensional linear models,” Biometrika (to appear). [Google Scholar]
Belitser E, and Ghosal S (2003), “Adaptive Bayesian inference on the mean of an infinitedimensional normal distribution,” The Annals of Statistics, 31, 536–559. [Google Scholar]
Bhattacharya A, and Dunson D (2011), “Sparse Bayesian infinite factor models,” Biometrika, 98, 291–306. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bhattacharya A, and Dunson D (2012), “Simplex factor models for multivariate unordered categorical data,” Journal of the American Statistical Association, 107, 362–377. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bontemps D (2011), “Bernstein-von Mises theorems for Gaussian regression with increasing number of regressors,” The Annals of Statistics, 39, 2557–2584. [Google Scholar]
Bro R (1997), “PARAFAC. Tutorial and applications,” Chemometrics and Intelligent Laboratory Systems, 38, 149–171. [Google Scholar]
Candes E, and Recht B (2009), “Exact matrix completion via convex optimization,” Foundations of Computational Mathematics, 9, 717–772. [Google Scholar]
Carvalho C, Lucas J, Wang Q, Nevins J, and West M (2008), “High-dimensional sparse factor modelling: applications in gene expression genomics,” Journal of the American Statistical Association, 103, 1438–1456. [DOI] [PMC free article] [PubMed] [Google Scholar]
Carvalho C, Polson N, and Scott J (2010), “The horseshoe estimator for sparse signals,” Biometrika, 97, 465–480. [Google Scholar]
Castillo I, and van der Vaart A (2012), “Needles and straws in a haystack: Posterior concentration for possibly sparse sequences,” The Annals of Statistics, 40, 2069–2101. [Google Scholar]
Chartrand R (2012), “Nonconvex splitting for regularized low-rank plus sparse decomposition,” IEEE Transactions on Signal Processing, 60, 5810–5819. [Google Scholar]
Dunson DB, and Xing C (2009), “Nonparametric Bayes modeling of multivariate categorical data,” Journal of the American Statistical Association, 104, 1042–1051. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fienberg S, and Rinaldo A (2007), “Three centuries of categorical data analysis: Log-linear models and maximum likelihood estimation,” Journal of Statistical Planning and Inference, 137, 3430–3445. [Google Scholar]
Friedlander M, and Hatz K (2005), “Computing non-negative tensor factorizations,” Optimization Methods and Software, 23, 631–647. [Google Scholar]
Ge Y, and Jiang W (2006), “On consistency of Bayesian inference with mixtures of logistic regression,” Neural Computation, 18, 224–243. [DOI] [PubMed] [Google Scholar]
Gelman A, Jakulin A, Pittau M, and Su Y (2008), “A weakly informative default prior distribution for logistic and other regression models,” Annals of Applied Statistics, 2, 1360–1383. [Google Scholar]
Ghosal S (1999), “Asymptotic normality of posterior distributions in high-dimensional linear models,” Bernoulli, 5, 315–331. [Google Scholar]
Ghosal S (2000), “Asymptotic normality of posterior distributions for exponential families when the number of parameters tends to infinity,” Journal of Multivariate Analysis, 74, 49–68. [Google Scholar]
Ghosal S, Ghosh J, and van der Vaart A (2000), “Convergence rates of posterior distributions,” Annals of Statistics, 28, 500–531. [Google Scholar]
Hans C (2011), “Elastic net regression modeling with the orthant normal prior,” Journal of the American Statistical Association, 106, 1383–1393. [Google Scholar]
Harshman R (1970), “Foundations of the PARAFAC procedure: Models and conditions for an “explanatory” multi-modal factor analysis,” UCLA Working Papers in Phonetics, 16, 84. [Google Scholar]
Jiang W (2007), “Bayesian variable selection for high dimensional generalized linear models: convergence rates of the fitted densities,” The Annals of Statistics, 1487–1511. [Google Scholar]
Karatzoglou A, Amatriain X, Baltrunas L, and Oliver N (2010), “Multiverse recommendation: n-dimensional tensor factorization for context-aware collaborative filtering,” Proceedings of the Fourth ACM Conference on Recommender Systems. [Google Scholar]
Kolda T, and Bader B (2009), “Tensor decompositions and applications,” SIAM Review, 51, 455–500. [Google Scholar]
Lee DD, and Seung HS (1999), “Learning the parts of objects by non-negative matrix factorization,” Nature, 401,788–791. [DOI] [PubMed] [Google Scholar]
Lim L, and Comon P (2009), “Nonnegative approximations of nonnegative tensors,” Jour. Chemometrics, 432–441. [Google Scholar]
Liu J, Liu J, Wonka P, and Ye J (2012), “Sparse non-negative tensor factorization using column-wise coordinate descent,” Pattern Recognition, 45, 649–656. [Google Scholar]
Lucas JE, Carvalho C, Wang Q, Bild A, Nevins J, and West M (2006), “Sparse statistical modelling in gene expression genomics,” in Bayesian Inference for Gene Expression and Proteomics, eds. Do K, Müller P and Vannucci M, Cambridge University Press, pp. 155–176. [Google Scholar]
Paatero P, and Tapper U (1994), “Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values,” Environmetrics, 5, 111–126. [Google Scholar]
Park T, and Casella G (2008), “The Bayesian lasso,” Journal of the American Statistical Association, 103,681–686. [Google Scholar]
Pati D, Bhattacharya A, Pillai N, and Dunson D (2013a), “Posterior contraction in sparse Bayesian factor models for massive covariance matrices,” arXiv:1206.3627. [Google Scholar]
Pati D, Dunson DB, and Tokdar ST (2013b), “Posterior consistency in conditional distribution estimation,” Journal of Multivariate Analysis. [DOI] [PMC free article] [PubMed] [Google Scholar]
Polson N, and Scott J (2010), “Shrink globally, act locally: Sparse Bayesian regularization and prediction,” in Bayesian Statistics 9 (Bernardo JM, Bayarri MJ, Berger JO, Dawid AP, Heckerman D, Smith AFM and West M, eds.), Oxford University Press, New York, pp. 501–538. [Google Scholar]
Scott J, and Berger J (2010), “Bayes and empirical-Bayes multiplicity adjustment in the variable selection problem,” The Annals of Statistics, 38, 2587–2619. [Google Scholar]
Sethuraman J (1994), “A constructive definition of Dirichlet priors,” Statistica Sinica, 4, 639–650. [Google Scholar]
Shen W, Tokdar S, and Ghosal S (2011), “Adaptive Bayesian multivariate density estimation with Dirichlet mixtures,” arXivpreprint arXiv:1109.6406. [Google Scholar]
Talagrand M (1996), “A new look at independence,” The Annals of Probability, 24, 1–34. [Google Scholar]
van der Vaart A, and van Zanten J (2008), “Rates of contraction of posterior distributions based on Gaussian process priors,” The Annals of Statistics, 36, 1435–1463. [Google Scholar]
Vershynin R (2010), “Introduction to the non-asymptotic analysis of random matrices,” Arxiv preprint arxiv:1011.3027. [Google Scholar]
West M (2003), “Bayesian factor regression models in the “large p, small n” paradigm,” in Bayesian Statistics 7 (Bernardo JM, Bayarri MJ, Berger JO, Dawid AP, Heckerman D, Smith AFM and West M, eds.), Oxford University Press, New York, pp. 733–742. [Google Scholar]
Yang Y, and Dunson DB (2013), “Bayesian conditional tensor factorizations for high-dimensional classification,” arXiv preprint arXiv:1301.4950. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R1] Agresti A (2002), Categorical data analysis, Vol. 359, Wiley-interscience. [Google Scholar]

[R2] Armagan A, Dunson D, and Lee J (2013a), “Generalized double Pareto shrinkage,” Statistica Sinica, 23, 119–143. [PMC free article] [PubMed] [Google Scholar]

[R3] Armagan A, Dunson D, Lee J, Bajwa W, and Strawn N (2013b), “Posterior consistency in high-dimensional linear models,” Biometrika (to appear). [Google Scholar]

[R4] Belitser E, and Ghosal S (2003), “Adaptive Bayesian inference on the mean of an infinitedimensional normal distribution,” The Annals of Statistics, 31, 536–559. [Google Scholar]

[R5] Bhattacharya A, and Dunson D (2011), “Sparse Bayesian infinite factor models,” Biometrika, 98, 291–306. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] Bhattacharya A, and Dunson D (2012), “Simplex factor models for multivariate unordered categorical data,” Journal of the American Statistical Association, 107, 362–377. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] Bontemps D (2011), “Bernstein-von Mises theorems for Gaussian regression with increasing number of regressors,” The Annals of Statistics, 39, 2557–2584. [Google Scholar]

[R8] Bro R (1997), “PARAFAC. Tutorial and applications,” Chemometrics and Intelligent Laboratory Systems, 38, 149–171. [Google Scholar]

[R9] Candes E, and Recht B (2009), “Exact matrix completion via convex optimization,” Foundations of Computational Mathematics, 9, 717–772. [Google Scholar]

[R10] Carvalho C, Lucas J, Wang Q, Nevins J, and West M (2008), “High-dimensional sparse factor modelling: applications in gene expression genomics,” Journal of the American Statistical Association, 103, 1438–1456. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] Carvalho C, Polson N, and Scott J (2010), “The horseshoe estimator for sparse signals,” Biometrika, 97, 465–480. [Google Scholar]

[R12] Castillo I, and van der Vaart A (2012), “Needles and straws in a haystack: Posterior concentration for possibly sparse sequences,” The Annals of Statistics, 40, 2069–2101. [Google Scholar]

[R13] Chartrand R (2012), “Nonconvex splitting for regularized low-rank plus sparse decomposition,” IEEE Transactions on Signal Processing, 60, 5810–5819. [Google Scholar]

[R14] Dunson DB, and Xing C (2009), “Nonparametric Bayes modeling of multivariate categorical data,” Journal of the American Statistical Association, 104, 1042–1051. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] Fienberg S, and Rinaldo A (2007), “Three centuries of categorical data analysis: Log-linear models and maximum likelihood estimation,” Journal of Statistical Planning and Inference, 137, 3430–3445. [Google Scholar]

[R16] Friedlander M, and Hatz K (2005), “Computing non-negative tensor factorizations,” Optimization Methods and Software, 23, 631–647. [Google Scholar]

[R17] Ge Y, and Jiang W (2006), “On consistency of Bayesian inference with mixtures of logistic regression,” Neural Computation, 18, 224–243. [DOI] [PubMed] [Google Scholar]

[R18] Gelman A, Jakulin A, Pittau M, and Su Y (2008), “A weakly informative default prior distribution for logistic and other regression models,” Annals of Applied Statistics, 2, 1360–1383. [Google Scholar]

[R19] Ghosal S (1999), “Asymptotic normality of posterior distributions in high-dimensional linear models,” Bernoulli, 5, 315–331. [Google Scholar]

[R20] Ghosal S (2000), “Asymptotic normality of posterior distributions for exponential families when the number of parameters tends to infinity,” Journal of Multivariate Analysis, 74, 49–68. [Google Scholar]

[R21] Ghosal S, Ghosh J, and van der Vaart A (2000), “Convergence rates of posterior distributions,” Annals of Statistics, 28, 500–531. [Google Scholar]

[R22] Hans C (2011), “Elastic net regression modeling with the orthant normal prior,” Journal of the American Statistical Association, 106, 1383–1393. [Google Scholar]

[R23] Harshman R (1970), “Foundations of the PARAFAC procedure: Models and conditions for an “explanatory” multi-modal factor analysis,” UCLA Working Papers in Phonetics, 16, 84. [Google Scholar]

[R24] Jiang W (2007), “Bayesian variable selection for high dimensional generalized linear models: convergence rates of the fitted densities,” The Annals of Statistics, 1487–1511. [Google Scholar]

[R25] Karatzoglou A, Amatriain X, Baltrunas L, and Oliver N (2010), “Multiverse recommendation: n-dimensional tensor factorization for context-aware collaborative filtering,” Proceedings of the Fourth ACM Conference on Recommender Systems. [Google Scholar]

[R26] Kolda T, and Bader B (2009), “Tensor decompositions and applications,” SIAM Review, 51, 455–500. [Google Scholar]

[R27] Lee DD, and Seung HS (1999), “Learning the parts of objects by non-negative matrix factorization,” Nature, 401,788–791. [DOI] [PubMed] [Google Scholar]

[R28] Lim L, and Comon P (2009), “Nonnegative approximations of nonnegative tensors,” Jour. Chemometrics, 432–441. [Google Scholar]

[R29] Liu J, Liu J, Wonka P, and Ye J (2012), “Sparse non-negative tensor factorization using column-wise coordinate descent,” Pattern Recognition, 45, 649–656. [Google Scholar]

[R30] Lucas JE, Carvalho C, Wang Q, Bild A, Nevins J, and West M (2006), “Sparse statistical modelling in gene expression genomics,” in Bayesian Inference for Gene Expression and Proteomics, eds. Do K, Müller P and Vannucci M, Cambridge University Press, pp. 155–176. [Google Scholar]

[R31] Paatero P, and Tapper U (1994), “Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values,” Environmetrics, 5, 111–126. [Google Scholar]

[R32] Park T, and Casella G (2008), “The Bayesian lasso,” Journal of the American Statistical Association, 103,681–686. [Google Scholar]

[R33] Pati D, Bhattacharya A, Pillai N, and Dunson D (2013a), “Posterior contraction in sparse Bayesian factor models for massive covariance matrices,” arXiv:1206.3627. [Google Scholar]

[R34] Pati D, Dunson DB, and Tokdar ST (2013b), “Posterior consistency in conditional distribution estimation,” Journal of Multivariate Analysis. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R35] Polson N, and Scott J (2010), “Shrink globally, act locally: Sparse Bayesian regularization and prediction,” in Bayesian Statistics 9 (Bernardo JM, Bayarri MJ, Berger JO, Dawid AP, Heckerman D, Smith AFM and West M, eds.), Oxford University Press, New York, pp. 501–538. [Google Scholar]

[R36] Scott J, and Berger J (2010), “Bayes and empirical-Bayes multiplicity adjustment in the variable selection problem,” The Annals of Statistics, 38, 2587–2619. [Google Scholar]

[R37] Sethuraman J (1994), “A constructive definition of Dirichlet priors,” Statistica Sinica, 4, 639–650. [Google Scholar]

[R38] Shen W, Tokdar S, and Ghosal S (2011), “Adaptive Bayesian multivariate density estimation with Dirichlet mixtures,” arXivpreprint arXiv:1109.6406. [Google Scholar]

[R39] Talagrand M (1996), “A new look at independence,” The Annals of Probability, 24, 1–34. [Google Scholar]

[R40] van der Vaart A, and van Zanten J (2008), “Rates of contraction of posterior distributions based on Gaussian process priors,” The Annals of Statistics, 36, 1435–1463. [Google Scholar]

[R41] Vershynin R (2010), “Introduction to the non-asymptotic analysis of random matrices,” Arxiv preprint arxiv:1011.3027. [Google Scholar]

[R42] West M (2003), “Bayesian factor regression models in the “large p, small n” paradigm,” in Bayesian Statistics 7 (Bernardo JM, Bayarri MJ, Berger JO, Dawid AP, Heckerman D, Smith AFM and West M, eds.), Oxford University Press, New York, pp. 733–742. [Google Scholar]

[R43] Yang Y, and Dunson DB (2013), “Bayesian conditional tensor factorizations for high-dimensional classification,” arXiv preprint arXiv:1301.4950. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Bayesian factorizations of big sparse tensors

Jing Zhou

Anirban Bhattacharya

Amy Herring

David Dunson

Abstract

1. Introduction

2. Sparse Factor Models for Tables

2.1. Model and prior

2.2. Induced prior in log-linear parameterization

Figure 1:

Table 1:

Figure 2:

Figure 3:

3. Posterior concentration

3.1. Preliminaries

3.2. Assumptions

3.3. Main result

4. Posterior Computation

5. Simulation Studies

5.1. Estimating sparse interactions

Figure 4:

Figure 5:

Table 2:

Table 3:

5.2. Comparison with standard PARAFAC

Figure 6:

Figure 7:

Figure 8:

6. Application

6.1. Splice-junction Gene Sequences

Figure 10:

Figure 9:

7. Discussion

Appendix

7.1. Proof of Theorem 3.1

7.2. Proof of Lemma 7.2

Footnotes

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases