GAMMA-BASED CLUSTERING VIA ORDERED MEANS WITH APPLICATION TO GENE-EXPRESSION ANALYSIS

Michael A Newton; Lisa M Chung

doi:10.1214/10-aos805

. Author manuscript; available in PMC: 2010 Dec 1.

Published in final edited form as: Ann Stat. 2010 Dec 1;38(6):3217–3244. doi: 10.1214/10-aos805

GAMMA-BASED CLUSTERING VIA ORDERED MEANS WITH APPLICATION TO GENE-EXPRESSION ANALYSIS^¹

Michael A Newton ¹, Lisa M Chung ¹

PMCID: PMC2990889 NIHMSID: NIHMS249963 PMID: 21113321

Abstract

Discrete mixture models provide a well-known basis for effective clustering algorithms, although technical challenges have limited their scope. In the context of gene-expression data analysis, a model is presented that mixes over a finite catalog of structures, each one representing equality and inequality constraints among latent expected values. Computations depend on the probability that independent gamma-distributed variables attain each of their possible orderings. Each ordering event is equivalent to an event in independent negative-binomial random variables, and this finding guides a dynamic-programming calculation. The structuring of mixture-model components according to constraints among latent means leads to strict concavity of the mixture log likelihood. In addition to its beneficial numerical properties, the clustering method shows promising results in an empirical study.

Key words and phrases: Gamma ranking, mixture model, next generation sequencing, Poisson embedding, rank probability

1. Introduction

A common problem in statistical genomics is how to organize expression data from genes that have been determined to exhibit differential expression relative to various cellular states. Cells in a time-course experiment may exhibit such genes, as may cells in any sort of designed experiment or observational study where expression alterations are being examined [e.g., Parmigiani et al. (2003), Speed (2004)]. In the event that the error-rate-controlled list of significantly altered genes is small, the post-processing problem amounts to inspecting observed patterns of expression, investigating what is known about the relatively few genes identified, and planning follow-up experiments as necessary. However, it is all too common that hundreds or even thousands of genes are detected as significantly altered in their expression pattern relative to the cellular states. Postprocessing these nonnull genes presents a substantial statistical problem. Difficulties are compounded in the multi-group setting because a gene can be nonnull in many different ways [Jensen et al. (2009)].

Ever since Eisen et al. (1998), clustering methods have been used to organize expression data. Information about a gene’s biological function may be conveyed by the other genes sharing its pattern of expression. Thalamuthu et al. (2006) provides a recent perspective. Clustering methods are often applied in order to partition nonnull genes which have been identified in differential expression analysis [e.g., Campbell et al. (2006), Grasso et al. (2008)]. Popular approaches are informative but not completely satisfactory. There are idiosyncratic problems, like how to select the number of clusters, but there is also the subtle issue that the clusters identified by most algorithms are anonymous: each cluster is defined only by similarity of its contents rather than by some external pattern that its genes may be approximating. Anonymity may contribute to technical problems, such as that the objective function being minimized is not convex, and that realized clusters have a more narrow size distribution than is warranted by the biological system.

Model-based clustering treats data as arising from a mixture of component distributions, and then forms clusters by assigning each data point to its most probable component [e.g., Titterington, Smith and Makov (1985), McLachlan and Basford (1988)]. For example, the mclust procedure is based on mixtures of Gaussian components [Fraley and Raftery (2002)]; the popular K-means algorithm is implicitly so based [Hastie, Tibshirani and Friedman (2001), page 463]. There is considerable flexibility in model-based clustering, though technical challenges have also affected its development: the likelihood function is often multi-modal; identifiability can be difficult to establish [e.g., Redner and Walker (1984), Holzmann, Munk and Gneiting (2006)]; and even where constraints may create identifiability, there can be a problem of label-switching during Bayesian inference [Stephens (2000)]. Some sophisticated model-based clustering methods have been developed for gene expression [e.g., Medvedovic, Yeung and Bumgarner (2004)]. Beyond empirical studies, it is difficult to determine properties of such approaches, and their reliance on Monte Carlo computation is somewhat limiting.

Here a model-based clustering method is developed that aims to support multi-group gene-expression analysis and possibly other applications. The method, called gamma ranking, places genes in a cluster if their expression patterns commonly approximate one element from a finite catalog of possible structures, in contrast to anonymous methods (Section 2). Under certain conditions, the component distributions are linearly independent functions—each one associated with a structure in the finite catalog—and this confers favorable computational characteristics to the gamma-ranking procedure (Sections 4, 5). The cataloged structures record patterns of equality and inequality among latent expected values. Where normal-theory specifications seem to be intractable, a gamma-based mixture model produces closed formulas for all necessary component densities, thanks to an embedding of the relevant gamma-distributed variables in a set of Poisson processes (Section 3). The formulation also extends to Poisson-distributed responses that are characteristic of gene expression measured by next-generation sequencing (Section 6).

2. Mixture of structured components

The data considered has a relatively simple layout. Each gene g from a possibly large number is associated with a vector x_g = (x_g,₁, x_g,₂, …, x_g,n) holding measurements of gene expression from n distinct biological samples. The n samples are distributed among 1 < p ≤ n different groups, which represent possibly different transcriptional states of the cells under study. The groups may represent cells exposed to p different chemical treatments, cells at p different developmental stages, or cells at p different points along a time course, for example. The layout of samples {1, 2, …, n} is recorded in a vector, say l = (l₁, l₂, …, l_n), with l_i = j indicating that sample i comes from group j. This is fixed by design and known to the analyst; to simplify the development we suppress l from the notation below except where clarification is warranted.

Each expression measurement x_g,i is treated as a positive, continuous variable representing a fluorescence intensity from a microarray, after preprocessing has adjusted for various systematic effects not related to the groupings of interest. Recent technological advances allow expression to be measured instead as an explicit abundance count. The mixture model developed below adapts readily to this case (Section 6).

Gamma ranking entails clustering genes according to the fit of a specific model of gene-level data. The joint probability density for a data vector x_g, denoted p(x_g ), is treated as a finite mixture over a catalog of discrete structures η, each of which determines ordering constraints among latent expected values. More specifically,

p (x_{g}) = \sum_{η} p (x_{g} ∣ η) π_{η},

(1)

where π_η is a mixing proportion and the component density p(x_g |η) is determined through modeling.

Each η is a partition of group labels {1, 2, …, p}, containing K_η subsets, that also carries an ordering of these subsets. For example, three structures cover the two-group comparison, denoted {(1)(2), (12), (2)(1)}. The notation conveys both the partition of group means and the ordering of subsets within the partition. For instance, in η = (2)(1) the expected expression level in group 2 is less than that of group 1; while η = (12) indicates that both groups share a common latent mean. With p = 3 groups, there are 13 structures

\begin{array}{l} (123), (12) (3), (3) (12), (13) (2), (2) (13), (1) (23), (23) (1), \\ (1) (2) (3), (2) (1) (3), (1) (3) (2), (2) (3) (1), (3) (1) (2), (3) (2) (1), \end{array}

and the number grows rapidly with the number of groups (Table 1). A way to think about H_ord = {η}, the catalog of these ordered structures on p groups, is to imagine p real values y = (y₁, y₂, …, y_p) and the possible vectors you would get by ranking y. Of course there are p! rankings if ties are not permitted, but generally there are far more rankings, and H_ord is in 1–1 correspondence with the set of rankings of p numbers, allowing ties.

Table 1.

The number of ordered structures, Bell+, as a function of the number of groups, p. This is $\sum_{k = 1}^{p} (k!) S (p, k)$ , where S(p, k) are Stirling numbers of the second kind. The Bell number of partitions of 1, …, p is included for comparison

p	Bell+	Bell
2	3	2
3	13	5
4	75	15
5	541	52
6	4683	203

Open in a new tab

An ordered structure η also dictates an association between sample labels i ∈ {1, 2, …, n} and levels of the latent expected values. The null structure η = (12 ··· p), for example, entails equal mean expression across all p groups; all observations are associated with a single mean value (and we write K_η = 1). More generally, there are K_η > 1 distinct mean values, μ₁ < μ₂ < ··· < μ_{K_η}, say. Without loss of generality, we index the means by rank order. The association maps each i ∈ {1, 2, …, n} to some μ_k; it amounts to a partition of the samples together with an ordering of the subsets within the partition matching the order of the latent means. We express this association with disjoint subsets σ (η, k), k = 1, 2, …, K_η, and have k follow the order of the expected values. For example, suppose that samples {1, 2, …, 6} constitute two replicate samples in each of p = 3 groups, and η = (13)(2) is considered to relate the group-specific expected values (i.e., the gene is upregulated in group 2, and not differentially expressed between groups 1 and 3). Then K_η = 2, σ(η, 1) = {1, 2, 5, 6} and σ (η, 2) = {3, 4}. Subset σ (η, k) includes n_k samples and induces gene-level statistics such as

s_{g, k} = \sum_{i \in σ (η, k)} x_{g, i} and t_{g, k} = \sum_{i \in σ (η, k)} x_{g, i} .

The structure/partition notation is convenient in multi-group mixture modeling. For clarification, let us refer back to the layout notation and take the replicates r_j = {i: l_i = j}, which equal those samples in group j. Consider a gene that is completely differentially expressed relative to the p groups; that is, it assumes one of the p! structures η in which K_η = p. It follows that each set r_j equals exactly one of the subsets σ(η, k). [It would be σ(η, 1) if r_j had the lowest mean expression level, e.g.] In the absence of complete differential expression, multiple groups share expected values. Generally, therefore, each subset σ (η, k) is a union of various replicate sets r_j. The language also conveys the assumption that replicates i₁ and i₂ in the same set r_j necessarily share expected value, regardless of the structure η. In calculating probabilities, the sets σ(η, k) of equi-mean samples are more important than the replicate sets r_j.

From the mixture model (1), posterior structure probabilities are p(η|x_g) = p(x_g|η)π_η/p(x_g) and these determine gene clusters by Bayes’s rule assignment. Alternatively, the cluster contents can be regulated by a threshold parameter c, and

cluster (η) = {g : p (η ∣ x_{g}) \geq c},

(2)

though some genes may go unassigned in this formulation. In any case, each cluster holds genes with empirical characteristics matching some discrete mean-ordering structure.

The latent expected values are constrained by η to the order μ₁ < μ₂ < ··· < μ_{K_η}. Propeling our calculations is the ability to integrate these ordered means (i.e., marginalize them) in a model involving gamma distributions on some transformation of the μ_k’s. Recall that a gamma distribution with shape a > 0 and rate λ > 0, denoted Gamma(a, λ), has probability density

p (z) = \frac{λ^{a} z^{a - 1} exp {- z λ}}{Γ (a)}, z > 0.

We assume that inverse means ψ_k = 1/μ_k have joint density

p_{η} (ψ_{1}, \dots, ψ_{K_{η}}) = K_{η}! [\prod_{k = 1}^{K_{η}} \frac{{(α_{0} ν_{0})}^{α_{0}} ψ_{k}^{α_{0} - 1} exp {- α_{0} ν_{0} ψ_{k}}}{Γ (α_{0})}] \times 1 [ψ_{1} > ψ_{2} > \dots > ψ_{K_{η}}],

(3)

which reflects independent and identically distributed Gamma(α₀, α₀ν₀) components, conditioned to one ordering. This parameterization gives ν₀ an interpretation as a centering parameter; on the null structure having a single latent mean μ₁, 1/ν₀ = E(1/μ₁).

To complete the hierarchical specification, we assume a gamma observation model

\begin{array}{l} p (x_{g} ∣ ψ_{1}, \dots, ψ_{K_{η}}, η) = \prod_{k = 1}^{K_{η}} \prod_{i \in σ (η, k)} \frac{{(α ψ_{k})}^{α} x_{g, i}^{α - 1} exp {- x_{g, i} ψ_{k} α}}{Γ (α)} \\ = \prod_{k = 1}^{K_{η}} \frac{{(α ψ_{k})}^{α n_{k}} t_{g, k}^{α - 1} exp {- s_{g, k} ψ_{k} α}}{{(Γ (α))}^{n_{k}}} . \end{array}

(4)

Equivalently, with sample i ∈ σ(η, k), measurement x_g,i is distributed as Gamma(α, αψ_k), all conditionally on the latent values and η, and independently across samples. The gamma observation component is often supported empirically; there is theoretical support from stochastic models of population abundance [Dennis and Patil (1984), Rempala and Pawlikowska (2008)] and there are practical considerations that a gamma-based model may be the only one for continuous data in which ordering calculations are feasible.

The structured component p(x_g |η) in (1) arises by integrating (4) against the continuous mixing distribution (3). Specifically,

p (x_{g} ∣ η) = \int p (x_{g} ∣ ψ_{1}, \dots, ψ_{K_{η}}, η) p_{η} (ψ_{1}, \dots, ψ_{K_{η}}) {d ψ}_{1} \dots {d ψ}_{K_{η}} .

Moving allowable factors from the integral

p (x_{g} ∣ η) = \frac{K_{η}! {(α_{0} ν_{0})}^{K_{η} α_{0}} α^{α n}}{Γ {(α_{0})}^{K_{η}} Γ {(α)}^{n}} (\prod_{k = 1}^{K_{η}} J_{k} t_{g, k}^{α - 1}) \times \int_{E} \prod_{k = 1}^{K_{η}} \frac{ψ_{k}^{α_{0} + α n_{k} - 1} exp {- ψ_{k} (α_{0} ν_{0} + α s_{g, k})}}{J_{k}} {d ψ}_{1} \dots {d ψ}_{K_{η}},

where the integral is over the set E of decreasing ψ_k’s, and where J_k represents any cluster-specific quantity which does not depend on ψ_k. Choosing

J_{k} = \frac{Γ (α_{0} + α n_{k})}{{(α_{0} ν_{0} + α s_{g, k})}^{α_{0} + α n_{k}}}

provides just the right normalization, because then the integrand becomes the joint density of independent gamma-distributed variables, with the kth variable having shape a_k = α₀ + αn_k and rate λ_k = α₀ν₀ + αs_g,k. The integral itself, denoted P_ord(η), is the probability that independent gamma-distributed variables assume a certain order. The preceding factor can be arranged as products of the product statistics t_g,k multiplied by factors involving the sum statistics s_g,k. After a bit of simplification, the following result is established.

Theorem 1

In the model defined above, the component density p(x_g |η) equals

c_{η} (\prod_{i = 1}^{n} x_{g, i}^{α - 1}) \underset{center (η)}{\underset{︸}{\prod_{k = 1}^{K_{η}} {(s_{g, k} + \frac{α_{0} ν_{0}}{α})}^{- a_{k}}}} \underset{P_{ord} (η)}{\underset{︸}{P (Z_{1} > Z_{2} > \dots > Z_{K_{η}})}},

(5)

where the Z_k’s are mutually independent gamma-distributed random variables with shapes a_k = α₀ + αn_k and rates λ_k = α₀ν₀ + αs_g,k, and where the normalizing constant is

c_{η} = \frac{K_{η}!}{{[Γ (α)]}^{n} {[Γ (α_{0})]}^{K_{η}}} {(\frac{α_{0} ν_{0}}{α})}^{α_{0} K_{η}} \prod_{k = 1}^{K_{η}} Γ (a_{k}) .

In (5), P_ord(η) = 1 for the null case involving K_η = 1.

The null structure η = (12 ··· p) entails equal mean expression for all samples; there is a single partition element, and K_η = 1. In this case, the distribution in (5) is exchangeable and equals a multivariate compound gamma [Hutchinson (1981)]. The positive parameters α and α₀ regulate within-group and among-group variation, and ν₀ is a scale parameter. Inspection also confirms that if the random X = (X₁, …, X_n) has density p(x|η) in (5), and if b > 0, then Y = (bX₁, …, bX_n) has a density of the same type, with shape parameters α₀ and α unchanged, but with scale parameter bν₀.

Special cases of the density (5) have been reported: Newton et al. (2004) presented the case p = 2; Jensen et al. (2009) presented the case p = 4. See also Yuan and Kendziorski (2006a). Evidently an algorithm to compute P_ord(η) is required in order to evaluate the component mixing densities. Beyond the p = 2 case, previous reports have evaluated these gamma-rank probabilities by Monte Carlo.

Figure 1 displays contours of the three structured components when n = 2 and p = 2. Clearly the components distribute mass quite differently from one another, and in a way that reflects constraints encoded by η. The densities from different structures η have the same support; the constraints restrict latent expected values rather than observables. In this way, the approach shares something with generalized linear modeling wherein responses are modeled by generic exponential family densities and covariate information constrains the expected values [McCullagh and Nelder (1989)].

Fig. 1 — Three structured components in ℝ². Here α = 10, α₀ = 3 and α₀ = 2⁵. Contours cover 50%, 80%, 95% and 99% probability. For convenience, each density is shown for log₂-transformed pairs.

3. Gamma-rank probabilities

A statistical computing problem must be solved in order to implement gamma ranking. Specifically, it is required to calculate the probability P (E) of the event

E = {Z_{1} > Z_{2} > \dots > Z_{K}},

(6)

where {Z_k: k = 1, 2, …, K} are mutually independent gamma distributed random variables with possibly different shapes a₁, a₂, …, a_K and rates λ₁, λ₂, …, λ_k. [Each P_ord(η) in Section 2 is an instance of P(E).] In the special case K = 2, the event in two gamma-distributed variables is equivalent to the E′ = {B > λ₁/(λ₁ + λ₂)}, where B is a Beta(a₁, a₂) distributed variable. Thus, P (E) = P (E′) can be computed by standard numerical approaches for the Beta distribution. Although a similar representation is possible for Dirichlet-distributed vectors when K > 2, a direct numerical approach is not clearly indicated. In modeling permutation data, Stern (1990) presented a formula for P (E) for any value K, but assuming common shape parameters a_k = a. Sobel and Frankowski (1994) calculated P (E) for K < 5 and assuming constant rates λ_k = λ, but to our knowledge a general formula has not been developed. A Monte Carlo approximation is certainly feasible, but a fast and accurate numerical approach would be preferable for computational efficiency: target values may be small, and P (E) may need to be recomputed for many shape and rate settings.

There is an efficient numerical approach to computing P(E) when shapes a_k are positive integers. The approach involves embedding {Z_k} in a collection of independent Poisson processes {ℕ_k}, where k = 1, 2, …, K. Specifically, let ℕ_k denote a Poisson process on (0, ∞) with rate λ_k. So ℕ_k(0, t] ~ Poisson(tλ_k), for example. Of course, gaps between points in ℕ_k are independent and exponentially distributed, and the gamma-distributed Z_k can be constructed by summing the first a_k gaps

Z_{k} = min {t > 0 : N_{k} (0, t] \geq a_{k}} .

Next, form processes { Inline graphic _k} by accumulating points in the originating processes: $M_{k} = \sum_{j = 1}^{k} N_{j}$ . Marginally, _k is a Poisson process with rate $Λ_{k} = \sum_{j = 1}^{k} λ_{j}$ , but over k the processes are dependent owing to overlapping points. To complete the construction, define count random variables M₁, M₂, …, M_K₋₁ by

M_{k} = M_{k} (0, Z_{k + 1}] .

(7)

It is immediate that each M_k has a marginal negative binomial distribution: the gamma distributed Z_k₊₁ is independent of Inline graphic _k; conditioning on Z_k₊₁ in (7) gives a Poisson variable which mixes out to the negative binomial [Greenwood and Yule (1920)]. Specifically,

M_{k} \sim NB (shape = a_{k + 1}, scale = Λ_{k} / λ_{k + 1}),

which corresponds to the probability mass function

p_{k} (m) = \frac{Γ (m + a_{k + 1})}{Γ (a_{k + 1}) Γ (m + 1)} {(\frac{λ_{k + 1}}{Λ_{k + 1}})}^{a_{k + 1}} {(\frac{Λ_{k}}{Λ_{k + 1}})}^{m}

(8)

for integers m ≥ 0. The next main finding is the following.

Theorem 2

With E as in (6), M_k as in (7) and p_k as in (8), P (E) equals

\sum_{m_{1} = 0}^{a_{1} - 1} \sum_{m_{2} = 0}^{m_{1} + a_{2} - 1} \dots \sum_{m_{K - 1} = 0}^{m_{K - 2} + a_{K - 1} - 1} p_{1} (m_{1}) p_{2} (m_{2}) \dots p_{K - 1} (m_{K - 1}) .

(9)

It does not seem to be obvious that E in (6) is equivalent to an event in the {M_k}. We also find it striking that the M_k variables are independent considering that they are constructed from highly dependent Inline graphic _k processes. Proof of (9) and the related distribution theory are presented in Appendix A.

A redistribution of products and sums allows a numerically efficient evaluation of (9), as in the sum-product algorithm [e.g., Kschischang, Frey and Loeliger (2001)]. For instance, with K = 4,

P (E) = \sum_{m_{1} = 0}^{a_{1} - 1} p_{1} (m_{1}) {\sum_{m_{2} = 0}^{m_{1} + a_{2} - 1} p_{2} (m_{2}) [\sum_{m_{3} = 0}^{m_{2} + a_{3} - 1} p_{3} (m_{3})]} .

(10)

Here, one would evaluate P (E) by first constructing for each m₂ ∈ {0, 1, …, a₁ + a₂ − 2} an inner sum P (M₃ ≤ m₂ + a₃ − 1). This vector in m₂ values is used to process the second inner sum, for each value m₁ ∈ {0, 1, …, a₁ − 1}. Indeed the computation is completely analogous to the Baum–Welch backward recursion [e.g., Rabiner (1989)], although, interestingly, there seems to be no hidden Markov chain in the system. A version of the Viterbi algorithm identifies the maximal summand and thus provides an approach to computing log P (E) in case P (E) is very small.

4. Linear independence

The component densities (5) seem to have the useful property of being linearly independent functions on ℝⁿ. Linear independence of the component density functions is equivalent to identifiability of the mixture model [Yakowitz and Spragins (1968)]. It is necessary for strict concavity of the log-likelihood, but it is not routinely established. Establishing identifiability also is a key step in determining sampling properties of the maximum likelihood estimator.

Let a = (a_η) denote a vector of real numbers indexed by structures η. Recall that the finite catalog of functions {p(x|η)} is linearly independent if

T_{a} (x) = \sum_{η} a_{η} p (x ∣ η) = 0 for all x implies a_{η} = 0 for all η .

It is plausible that this property holds generally, but we have been able to establish a proof only in a special case.

Theorem 3

In a balanced experiment where m replicate samples are measured in each of p = 2 or p = 3 groups, the component densities p(x_g|η) in (5) are linearly independent functions on ℝ^mp.

A proof proceeds by finding a multivariate polynomial φ(x) > 0 such that φ(x)T_a(x) is itself a multivariate polynomial. A close study of the degrees and coefficients of this polynomial leads us to the result (Appendix B). That such a φ(x) exists follows from (5): the center is a rational function, and the factor P_ord(η) is also rational, being a linear combination of rational functions, as established in (9).

5. Data analysis considerations

5.1. Estimation

To deploy model (1)–(5) requires the estimation of parameters α, α₀ and ν₀, which are shared by the different components, as well as mixing proportions π = {π_η}, which link the components together. Consider first the log likelihood for π alone (treating the shared parameters as known) under independent and identically distributed sampling from (1):

l (π) = \sum_{g = 1}^{G} log {\sum_{η} π_{η} p (x_{g} ∣ η)},

(11)

where G is the number of genes providing data. Maximum likelihood estimation of π is buttressed by the following finding.

Theorem 4

Suppose that the component densities are linearly independent functions in the mixture of structured components model. If G is sufficiently large, then the log likelihood l(π) in (11) is strictly concave on a convex domain, and thus admits a unique maximizer π̂ = {π̂_η}. This property is almost sure in data sets.

The expectation–maximization (EM) algorithm naturally applies to approximate π̂. By strict concavity of l(π), it is not necessary to rerun EM from multiple starting points. The final estimate and resultant clustering should be insensitive to starting position, as has been found in numerical experiments. This is a convenient but unusual property in the domain of mixture-based clustering [McLachlan and Peel (2000), page 44].

In a small simulation experiment, we confirmed that our implementation of the EM algorithm was able to recover mixture proportions given sufficiently many draws from the marginal distribution (1) (data not shown).

Full maximum likelihood for both the mixing proportions and shared parameters is feasible via the EM algorithm, but this increases computational costs. In the prototype implementation used here, we fixed the shared parameters at estimates obtained from a simpler mixture model, and then ran the EM algorithm to estimate the mixing proportions. Specifically, we used the gamma–gamma method in EBarrays (www.bioconductor.org), which corresponds to mixing as in (1) but over the smaller set of unordered structures. Experiments indicated that this approximation had a small effect on the identified clusters (see Appendix D).

Inference derived through the proposed parametric model is reliant to some degree on the validity of the governing assumptions. Quantile–quantile plots and plots relating sample coefficient-of-variation to sample mean provide useful diagnostics for the gamma observation-component of the model. The within-component model is restrictive in the sense that three parameters are shared among all the components (i.e., structures). This can be checked by making comparisons of inferred clusters, but only large clusters would deliver any power. Clusters reveal patterns in mean expression, while the shared parameters have more to do with variability; if other domains of statistics provide a guide, one expects that misspecifying the variance may reduce some measure of efficiency without disabling the entire procedure. The ultimate issue is whether or not the clustering method usefully represents any underlying biology. This is difficult to assess, though we examine the issue in a limited way in the examples considered next.

5.2. Example

Edwards et al. (2003) studied the transcriptional response of mouse heart tissue to oxidative stress. Three biological replicate samples were measured using Affymetrix oligonucleotide arrays at each of five time points (baseline and one, three, five and seven hours after a stress treatment) for several ages of mice. Considering the older mice for illustration, we have p = 5 distinct groups, n = 15 samples and 10,043 genes (i.e., probe sets, after pre-processing). Gene-specific moderated F-testing [Smyth (2004)] produced a list of G = 786 genes that exhibited a significant temporal response to stress at the 10% false discovery rate [by q-value; Storey and Tibshirani (2003)]. Gamma ranking involved fitting the mixture of structured components, which with p = 5 mixes over 540 distinct components. (Since we worked with significantly altered genes, we did not include the null component in which all means are equal; other aspects of model fitting and diagnosis are provided in Appendix D.) From the catalog of 540 possibilities, genes populated 23 clusters by gamma ranking, though only four clusters contained 10 or more of the G = 786 stress-responding genes (Figure 2). Most expression changes occurred between baseline and the first time point, but 30 genes (red cluster) showed significant upregulation at all but one time point, for example.

Fig. 2 — Dominant patterns of differential expression in time course data from Edwards et al. (2003). Each panel summarizes data from one cluster identified by gamma ranking (the nine largest clusters are shown). A digital code signifies the inferred ordering of the latent expected values (i.e., η, in an alternative notation). Each gene is a single line trace; triplicate measurements were reduced by averaging and then standardized for display; raw data went into the model fitting. Results are based on 100 cycles of EM to estimate mixing proportions followed by Bayes’ rule assignment.

Gamma ranking gave different results than K-means or mclust, which, respectively, found 20 and 2 clusters in Edwards’ data. Here K was chosen according to guidelines in Hastie, Tibshirani and Friedman (2001), mclustused the Bayes information criteria over the range from 1 to 50 clusters. Otherwise, both methods used default settings in the respective R functions (www.r-project.org). The adjusted Rand index [Hubert and Arabie (1985)], which measures dissimilarity of partitions, was 0.09 comparing gamma ranking and K-means, 0.16 for gamma ranking and mclust, while for K-means and mclustit was smaller, at 0.02.

The biological significance of clusters identified by any algorithm may be worth investigating. For example, the cluster of 30 increasing expressors includes 2 genes (Mgst1 & Gsta4) from among only 17 in the whole genome that are involved in glutathione transferase activity. Understanding the increased activity of this molecular function will give a more complete picture of the biology [e.g., Girardot, Monnier and Tricoire (2004)]. In isolation, it is difficult to see how such investigation is supportive of a given clustering approach. The benefits become more apparent when we look at many data sets and many functional categories.

5.3. Empirical study

Gamma ranking was applied to a series of 11 data sets obtained from the Gene Expression Omnibus (GEO) repository [Edgar, Domrachev and Lash (2002)]. These were all the data sets satisfying a specific and relevant query (Table 2). They represent experiments on different organisms and they exhibit a range of variation characteristics. In each case, we applied the moderated F-test and selected genes with q-value no larger than 5%. Gamma ranking and, for comparison, mclust and K-means, were applied in order to cluster genes separately for each data set. Basic facts about the identified clusters are reported in Table 2. Figure 3 shows that gamma ranking tends to produce smaller clusters than mclust and K-means, although it also has a wider size distribution; and there was a relatively low level of overlap among the three approaches.

Table 2.

Summary of 11 data sets from the Gene Expression Omnibus (GEO). GDS is the GEO data set accession number. These sets satisfied the search query from August 2008 having subset variable type time or development stage or age and having a single factor with three to eight levels. p indicates the number of groups and n is the number of samples. G indicates the number of genes deemed significantly altered by one-way moderated F-test and 0.05 FDR (limma). The remaining columns show how many clusters are found by gamma ranking with 100 EM iterations (GR), mclust (MC) and K-means (KM)

GDS	Citation	Organism	p	n	G	GR	MC	KM
2323	Coser et al.	Homo Sapiens	3	9	1409	11	5	13
1802	Tabuchi et al.	Mus musculus	4	8	3433	49	7	10
2043	Tabuchi et al.	Mus musculus	4	8	3001	51	8	18
2360	Ron et al.	Mus musculus	4	9	8714	50	8	30
599	Vemula et al.	Rattus norvegicus	5	10	673	42	2	40
812	Zeng et al.	Mus musculus	5	17	10,982	135	7	15
1937	Pilot et al.	Drosophila	5	15	7733	88	8	10
568	Welch et al.	Mus musculus	6	18	3737	134	4	25
2431	Keller et al.	Homo Sapiens	6	18	8505	137	9	12
587	Tomczak et al.	Mus musculus	7	21	860	50	2	20
586	Tomczak et al.	Mus musculus	8	24	5211	118	5	20

Open in a new tab

Fig. 3 — Characteristics of clusters from an empirical study of 11 data sets.

The empirical study shows not only that gamma ranking produces substantially different clusters than popular approaches, but also that the identified clusters are significant in terms of their biological properties. Investigators often measure the biological properties of a gene cluster by identifying functional properties that seem to be over-represented in the cluster. Gene set enrichment analysis is most frequently performed by applying Fisher’s exact test to each of a long list of functional categories, testing the null hypothesis that the functional category is independent of the gene cluster [e.g., Newton et al. (2007)]. Functional categories from the Gene Ontology (GO) Consortium and the Kyoto encyclopedia (KEGG) were used to assess the biological properties of all the clusters identified in the above calculation. Specifically, we computed for each cluster a vector of p-values across GO and KEGG. Figure 4 shows the proportion of these p-values smaller than 0.05, stratified by cluster size and in comparison to results on random sets of the same size. Evidently, the clusters identified by gamma ranking contain substantial biological information.

Fig. 4 — Empirical study of the association between clusters and biological function. For every cluster identified by gamma ranking (red) or mclust (green) in the data sets in Table 2, plotted is the proportion of small enrichment p-values (vertical) versus the cluster size (horizontal). The enrichment p-values are Fisher-exact-test p-values and the proportion is computed over a database of GO and KEGG pathways (Table 7). Bands indicate similar proportions computed for random sets.

Figure 4 also shows that mclustclusters carry substantial biological information, and a similar result is true for K-means(not shown). Whatever cluster signal is present in the expression data, it is evident that gamma-ranking finds different aspects of this signal than do the standard approaches, while still delivering clusters that relate in some way to the biology. Gamma-ranking clusters are not anonymous sets of genes with similar expression profiles; they are sets of genes linked to an ordering pattern in the underlying means. The commonly used clustering methods are unsupervised, while gamma-ranking utilizes the known grouping labels in the sample. It seems beneficial to use this grouping information; undoubtedly various schemes could be developed. By their construction, the gamma-ranking clusters have a simple interpretation in terms of sets of genes supporting particular hypotheses about changes in mean expression.

6. Count data

Microarray technology naturally leads to continuous measurements of gene expression, as modeled in Section 2, but technological advances allow investigators essentially to count the number of copies of each molecule of interest in each sample [e.g., Mortazavi et al. (2008)]. Poisson distributions are central in the analysis of such data [e.g., Marioni et al. (2008)], and gamma ranking extends readily to this case.

Briefly, data at each gene (or tag) is a vector x_g = (x_g,₁, …, x_g,n) as before, but x_g,i is now a count from the ith library (rather than an expression level on the ith microarray). There may be replicate libraries within a given cellular state, and comparisons of interest may be between different cellular states. Library sizes {N_i}, say, are additional but known design parameters. Important parameters are expected counts relative to some common library size. Adopting the notation from Section 2, a cluster of libraries σ(η, k) may share their size-adjusted expected values, and so for any i ∈ σ (η, k) the observed count x_g,i arises from the Poisson distribution with mean N_iμ_k. Further, the structure η on test puts an ordering constraint μ₁ < μ₂ < ···< μ_{K_η} on these latent expectations. The key is to integrate away these latent expected values using a conjugate gamma prior, but conditionally on the ordering. Prior to conditioning, the μ_k’s are independent and identical gamma variables with (integer) shape α₀ and rate α₀ν₀. Then, analogously to Theorem 2, the predictive distribution for the vector of conditionally Poisson responses is

p (x_{g} ∣ η) = c_{η} (\prod_{i = 1}^{n} \frac{1}{x_{g, i}!}) \underset{center (η)}{\underset{︸}{(\prod_{k = 1}^{K_{η}} u_{g, k} Γ (s_{g, k} + α_{0}))}} P_{ord} (η),

(12)

where

P_{ord} (η) = P (Z_{1} < Z_{2} < \dots < Z_{K_{η}})

with the Z_k ’s mutually independent gamma-distributed random variables with (gene-specific) shapes a_k = α₀ + s_g,k and rates λ_k = α₀ν₀ + n_k. In (12), the normalizing constant is

c_{η} = \frac{K_{η}! {(α_{0} ν_{0})}^{α_{0} K_{η}}}{{[Γ (α_{0})]}^{K_{η}}} \prod_{k = 1}^{K_{η}} {(α_{0} ν_{0} + n_{k})}^{- α_{0}}

and, further, s_g,k = Σ_i_∈_σ₍_η,k₎ x_g,i, n_k = Σ_i_∈_σ₍_η,k₎ N_i and

u_{g, k} = \prod_{i \in σ (η, k)} {(\frac{N_{i}}{α_{0} ν_{0} + n_{k}})}^{x_{g, i}} .

Notice that in P_ord(η) the event refers to an increasing sequence of gamma’s, rather than a decreasing sequence, as in Theorem 1. This arises because for Poisson responses the conjugate prior involves a gamma distribution for the means, whereas for gamma responses the conjugate is inverse gamma on the means. For computations to work out, the key thing is that some monotone transformation of each latent mean has a gamma distribution. In the null structure (all means equal), P_ord(η) = 1 and (12) reduces to the negative-multinomial distribution. It will be important to study the practical utility of (12) and overdispersed extensions [cf. Robinson and Smyth (2007)], but such investigation is not within the scope of the present paper. The main reason to present the finding here is to show that gamma-rank probabilities (Section 3) arise in multiple probability models.

7. Concluding remarks

Calculations presented here consider a discrete mixture model and the resulting clustering for gene-expression or similar data types. The discrete mixing is over patterns of equality and inequality among latent expected values (ordered structures). Clustering by these patterns addresses an important biological problem to organize gene expression relative to various cellular states, which is part of the larger task to determine biological function. In examples the method was applied after a round of feature selection, although it could have been applied to each full data set (i.e., by including the null structure in the mix) and it could have been the basis of more comprehensive analysis, going beyond clustering and towards hypothesis testing and error-rate-controlled gene lists. Our more conservative line is attributable in part to an incomplete understanding of the method’s robustness. Relaxing the fixed-coefficient-of-variation assumption, as in Lo and Gottardo (2007) or Rossell (2009), could be considered to address the problem. The focus on clustering, however, is motivated largely by its practical utility in the context of genomic data analysis.

By cataloging ordered structures, rather than the smaller set of unordered structures, the mixture model produces readily interpretable clusters in the multi-group setting. Jensen et al. (2009) argues similarly. For example, the largest cluster of temporally responsive genes in Edwards’ data are upregulated immediately after treatment and show no significant fluctuations thereafter. The development of calculations for ordered structures has been more challenging than for unordered structures, which were presented in Kendziorski et al. (2003) and implemented in the Bioconductor package EBarrays. Mixture calculations are simplified in the unordered case because component densities reduce by factorization to elementary products [i.e., the last factor in (5) is not present]. The requirement to compute gamma-rank probabilities had limited a fuller development.

Gamma ranking produces clusters indexed by patterns of expected expression rather than anonymous clusters defined by high similarity of their contents. A referee noted that large gamma-ranking clusters may tend to swallow up genes more easily than small clusters because the estimated posterior assignment probability is proportional to the estimate of the mixing weight π_η: that is, structures with large π_η have a head start in the allocation of genes. On one hand, this provides an efficiency which may be advantageous for genes that have a relatively weak signal (and which otherwise might be assigned to a more null-like structure). It also implies that small clusters are more reliable, in a way, since the assigned genes have made it in spite of the small π_η. Another feature of gamma ranking is that clusters can be tuned by a threshold parameter c, as in (2), rather than being determined by Bayes’s rule assignment. Taking c close to 1 tends to purify the clusters; the more equivocal genes drop into an unassigned category. Empirically, such swallowing up may not be substantial; at least in comparison to the simpler clustering methods analyzed, gamma ranking produces more and smaller clusters.

There is nothing explicit in gamma ranking that attends to the temporal dependence which might seem to be involved in time-course data. Independent cell lines were grown in the Edwards’ experiment, one for each microarray, and so there is independent sampling in spite of a time component. Additionally, the model imposes dependencies in (5) driven by whichever structure η governs data at a given gene. If there were complicated temporal dependence, the identified clusters would still reflect genes that act in concert in this experiment; they might act in concert by a different η in another run of the experiment, and we would not be confident in the fitted proportions, even though the clusters may continue to be informative. Neither does the model have explicit dependence among genes; but it produces clusters of genes that seem to be highly associated (genes that realize the same structure η seem to present correlated data). This shows that a sufficiently rich hierarchical model, based on lots of conditional independence, can represent characteristics of dependent data. Of course, care is needed since the sampling distribution of parameter estimates is affected by the intrinsic dependencies within the data generating mechanism.

The mixture framework from Kendziorski et al. (2003) has supported a number of extensions to related problems in statistical genomics: Yuan and Kendziorski (2006b) (time-course data), Kendziorski et al. (2006) (mapping expression traits) and Keles (2007) (localizing transcription factors). The ability to monitor ordered structures may have some application in these problems. Further, the ability to compute gamma-rank probabilities may have application in distinct inference problems [e.g., Doksum and Ozeki (2009)]. Future work includes developing a better software implementation of gamma ranking, enabling the implementation to have additional flexibility (e.g., gene-specific shapes α), studying the method’s sampling properties and exploring extensions to emerging data sources.

Acknowledgments

We thank Lev Borisov for the proof of Lemma 1, Christina Kendziorski, an Associate Editor and two referees for comments that greatly improved the development of this work.

APPENDIX A: PROOF OF THEOREM 2

Let g_k(z) denote the density of a gamma distribution with shape a_k and rate λ_k. By definition

\begin{array}{l} P (E) = \int_{0}^{\infty} \int_{z_{K}}^{\infty} \dots \int_{z_{2}}^{\infty} [\prod_{k = 1}^{K} g_{k} (z_{k})] {d z}_{1} \dots {d z}_{K - 1} {d z}_{K} \\ = \int_{0}^{\infty} g_{K} (z_{K}) \int_{z_{K}}^{\infty} g_{K - 1} (z_{K - 1}) \dots \int_{z_{2}}^{\infty} g_{1} (z_{1}) {d z}_{1} \dots {d z}_{K - 1} {d z}_{K}, \end{array}

where in the second line we move factors in the integrand as far as possible to the left. With this in mind we construct functions f_k(z), z ≥ 0, recursively as f₀(z) = 1 and, for k = 1, 2, …, K,

f_{k} (z) = \int_{z}^{\infty} f_{k - 1} (u) g_{k} (u) d u,

(13)

and we observe that P(E) = f_K(0). Evaluating these functions further, we see

\begin{array}{l} f_{1} (z) = \int_{z}^{\infty} g_{1} (z_{1}) {d z}_{1} \\ = P (Z_{1} \geq z) \\ = P (M_{1} < a_{1} ∣ Z_{2} = z) \\ = \sum_{m_{1} = 0}^{a_{1} - 1} po (m_{1}; λ_{1} z) . \end{array}

Here M₁ = Inline graphic ₁(0, Z₂) is Poisson(λ₁z) distributed conditionally upon Z₂ = z, and po(·) indicates the Poisson probability mass function with the indicated parameter. The equivalence in the second and third lines above stems from basic relationships between objects in the underlying Poisson processes. As long as M₁ is small, it means that the ℕ₁ process has not accumulated many points up to time Z₂ = z, and hence the Z₁ value must be relatively large. More basically,

P (U > u) = P (X < a),

(14)

when U ~ Gamma(a, λ) and X ~ Poisson(λu), for integer shapes a.

Proceeding to f₂(z),

\begin{array}{l} f_{2} (z) = \int_{z}^{\infty} f_{1} (z_{2}) g_{2} (z_{2}) {d z}_{2} \\ = \sum_{m_{1} = 0}^{a_{1} - 1} \int_{z}^{\infty} po (m_{1}; λ_{1} z_{2}) g_{2} (z_{2}) {d z}_{2} \\ = \sum_{m_{1} = 0}^{a_{1} - 1} p_{1} (m_{1}) \int_{z}^{\infty} \frac{po (m_{1}; λ_{1} z_{2});, g_{2} (z_{2})}{p_{1} (m_{1})} {d z}_{2} . \end{array}

Here p₁(m₁) is the probability mass function of a negative-binomial distribution, as in (8). Indeed, we have reorganized the summand above to highlight that integrand on the far right is precisely the density function of a gamma distributed variable with shape a₂ + m₁ and rate λ₁ + λ₂. This represents the Poisson–Gamma conjugacy in ordinary Bayesian analysis [e.g., Gelman et al. (2004), pages 52 and 53]. The integral evaluates to 1 if z = 0, and hence we have proved the case K = 2. But furthermore, the integral represents the chance that a gamma distributed variable is large, and so by (14)

\begin{array}{l} f_{2} (z) = \sum_{m_{1} = 0}^{a_{1} - 1} \sum_{m_{2} = 0}^{m_{1} + a_{2} - 1} p_{1} (m_{1}) po (m_{2}; (λ_{1} + λ_{2}) z) \\ = \sum_{m_{1} = 0}^{a_{1} - 1} \sum_{m_{2} = 0}^{m_{1} + a_{2} - 1} p_{1} (m_{1}) po (m_{2}; Λ_{2} z) . \end{array}

The baseline result of an induction proof has been established. Assume that for some k ≥ 3,

f_{k - 1} (z) = \sum_{m_{1} = 0}^{a_{1} - 1} \dots \sum_{m_{k - 1} = 0}^{m_{k - 2} + a_{k - 1} - 1} (\prod_{j = 1}^{k - 2} p_{j} (m_{j})) po (m_{k - 1}; Λ_{k - 1} z)

(15)

and then evaluate (13) to obtain

\begin{array}{l} f_{k} (z) = \int_{z}^{\infty} f_{k - 1} (z_{k}) g_{k} (z_{k}) {d z}_{k} \\ = \int_{z}^{\infty} \sum_{m_{1} = 0}^{a_{1} - 1} \dots \sum_{m_{k - 1} = 0}^{m_{k - 2} + a_{k - 1} - 1} (\prod_{j = 1}^{k - 2} p_{j} (m_{j})) po (m_{k - 1}; Λ_{k - 1} z_{k}) g_{k} (z_{k}) {d z}_{k} \\ = \sum_{m_{1} = 0}^{a_{1} - 1} \dots \sum_{m_{k - 1} = 0}^{m_{k - 2} + a_{k - 1} - 1} (\prod_{j = 1}^{k - 2} p_{j} (m_{j})) \int_{z}^{\infty} po (m_{k - 1}; Λ_{k - 1} z_{k}) g_{k} (z_{k}) {d z}_{k} \\ = \sum_{m_{1} = 0}^{a_{1} - 1} \dots \sum_{m_{k - 1} = 0}^{m_{k - 2} + a_{k - 1} - 1} (\prod_{j = 1}^{k - 1} p_{j} (m_{j})) \int_{z}^{\infty} \frac{po (m_{k - 1}; Λ_{k - 1} z_{k}) g_{k} (z_{k})}{p_{k - 1} (m_{k - 1})} {d z}_{k} \\ = \sum_{m_{1} = 0}^{a_{1} - 1} \dots \sum_{m_{k - 1} = 0}^{m_{k - 2} + a_{k - 1} - 1} (\prod_{j = 1}^{k - 1} p_{j} (m_{j})) \sum_{m_{k} = 0}^{m_{k - 1} + a_{k} - 1} po (m_{k}; Λ_{k} z) \\ = \sum_{m_{1} = 0}^{a_{1} - 1} \dots \sum_{m_{k} = 0}^{m_{k - 1} + a_{k} - 1} (\prod_{j = 1}^{k - 1} p_{j} (m_{j})) po (m_{k}; Λ_{k} z), \end{array}

which establishes that (15) is true for all k. Evaluating at k = K and z = 0 establishes the theorem.

Coda

Further insight is gained by realizing from the definition of the counts that

\begin{array}{l} M_{k} (Z_{k}) = M_{k - 1} (Z_{k}) + a_{k} \\ = M_{k - 1} + a_{k} . \end{array}

But also Inline graphic _k has a jump at Z_k, and so we see the equivalence

Z_{k} > Z_{k + 1} \Leftrightarrow M_{k} < M_{k - 1} + a_{k} .

(16)

The event E is an intersection of these pairwise events, and this is manifested in the ranges of summation in (9). In contrast to (9), these event considerations give P (E) equal to

\sum_{m_{1} = 0}^{a_{1} - 1} \sum_{m_{2} = 0}^{m_{1} + a_{2} - 1} \dots \sum_{m_{K - 1} = 0}^{m_{K - 2} + a_{K - 1} - 1} p_{joint} (m_{1}, m_{2}, \dots, m_{K - 1}) .

(17)

The implication seems to be that M₁, M₂, …, M_K₋₁ are mutually independent, though Theorem 1 does not confirm this because the factorization into negative binomials is required for all arguments, beyond what is shown. It is a conjecture that the {M_k} are mutually independent. A proof by brute force evaluation in the special cases K = 3 and K = 4 is available (not shown), but we have not found a general proof. The fact is somewhat surprising because the { Inline graphic _k} processes are highly positively dependent. The independence seems to emerge as a balance between this positive dependence and the negative association created by Z_k being inversely related to _k(0, t].

APPENDIX B: LINEAR INDEPENDENCE AND PROOF OF THEOREM 3

Consider the three-dimensional case, and initially consider a single replicate in each of the three groups. Data on each gene form the vector (x, y, z), say, of three positive reals. Thirteen component densities p(x, y, z|η) constitute the mixture model (Table 3). For a vector a = (a_η) of reals, the test function is T_a (x, y, z) = Σ_ηa_ηp(x, y, z|η). It needs to be shown that if T_a (x, y, z) = 0 for all x, y, z > 0, then a_η = 0 for all structures η. Specializing (5) to this case, and eliminating the positive factor (xyz)^α⁻¹, we see that T_a (x, y, z) = 0 is equivalent to

Table 3.

Thirteen structured components p(x, y, z|η) = c_η (xyz)^α−1 center(η)P_ord(η) in the three dimensional, no-replicate case. The forms have been simplified, w.l.o.g., by taking the scale ν₀ = 1, by writing β = α₀ + α and ξ = α₀/α. Normalizing constants c_η are as in (5). Note that the e_m and e_m,n stand for constants (not involving x, y, z), but possibly differing among rows

Structure η

[center(η)]⁻¹

P_ord(η)

(123)

(x + y + z + ξ)^β⁺²^α

(12)(3)

(x + y + ξ)^β⁺^α(z + ξ)^β

\sum_{m = 0}^{β + α - 1} \frac{e_{m} {(z + ξ)}^{β} {(x + y + ξ)}^{m}}{{(x + y + z + 2 ξ)}^{β + m}}

(3)(12)

“

\sum_{m = 0}^{β - 1} \frac{e_{m} {(z + ξ)}^{m} {(x + y + ξ)}^{β + α}}{{(x + y + z + 2 ξ)}^{β + α + m}}

(13)(2)

(x + z + ξ)^β⁺^α(y + ξ)^β

\sum_{m = 0}^{β + α - 1} \frac{e_{m} {(y + ξ)}^{β} {(x + z + ξ)}^{m}}{{(x + y + z + 2 ξ)}^{β + m}}

(2)(13)

“

\sum_{m = 0}^{β - 1} \frac{e_{m} {(y + ξ)}^{m} {(x + z + ξ)}^{β + a}}{{(x + y + z + 2 ξ)}^{β + α + m}}

(1)(23)

(y + z + ξ)^β⁺^α(x + ξ)^β

\sum_{m = 0}^{β - 1} \frac{e_{m} {(x + ξ)}^{m} {(y + z + ξ)}^{β + a}}{{(x + y + z + 2 ξ)}^{β + α + m}}

(23)(1)

“

\sum_{m = 0}^{β + α - 1} \frac{e_{m} {(x + ξ)}^{β} {(y + z + ξ)}^{m}}{{(x + y + z + 2 ξ)}^{β + m}}

(1)(2)(3)

[(x + ξ)(y + ξ)(z + ξ)]^β

\sum_{m = 0}^{β - 1} \sum_{n = 0}^{m + β - 1} \frac{e_{m, n} {(x + ξ)}^{m} {[(y + ξ) (z + ξ)]}^{β} {(x + y + 2 ξ)}^{n}}{{(x + y + 2 ξ)}^{β + m} {(x + y + z + 3 ξ)}^{β + n}}

(2)(1)(3)

“

\sum_{m = 0}^{β - 1} \sum_{n = 0}^{m + β - 1} \frac{e_{m, n} {(y + ξ)}^{m} {[(x + ξ) (z + ξ)]}^{β} {(x + y + 2 ξ)}^{n}}{{(x + y + 2 ξ)}^{β + m} {(x + y + z + 3 ξ)}^{β + n}}

(1)(3)(2)

“

\sum_{m = 0}^{β - 1} \sum_{n = 0}^{m + β - 1} \frac{e_{m, n} {(x + ξ)}^{m} {[(y + ξ) (z + ξ)]}^{β} {(x + z + 2 ξ)}^{n}}{{(x + z + 2 ξ)}^{β + m} {(x + y + z + 3 ξ)}^{β + n}}

(2)(3)(1)

“

\sum_{m = 0}^{β - 1} \sum_{n = 0}^{m + β - 1} \frac{e_{m, n} {(y + ξ)}^{m} {[(z + ξ) (x + ξ)]}^{β} {(y + z + 2 ξ)}^{n}}{{(y + z + 2 ξ)}^{β + m} {(x + y + z + 3 ξ)}^{β + n}}

(3)(1)(2)

“

\sum_{m = 0}^{β - 1} \sum_{n = 0}^{m + β - 1} \frac{e_{m, n} {(z + ξ)}^{m} {[(x + ξ) (y + ξ)]}^{β} {(x + z + 2 ξ)}^{n}}{{(x + z + 2 ξ)}^{β + m} {(x + y + z + 3 ξ)}^{β + n}}

(3)(2)(1)

“

\sum_{m = 0}^{β - 1} \sum_{n = 0}^{m + β - 1} \frac{e_{m, n} {(z + ξ)}^{m} {[(x + ξ) (y + ξ)]}^{β} {(y + z + 2 ξ)}^{n}}{{(y + z + 2 ξ)}^{β + m} {(x + y + z + 3 ξ)}^{β + n}}

Open in a new tab

\sum_{η} a_{η} c_{η} center (η) P_{ord} (η) = 0 .

(18)

A strictly positive multivariate polynomial φ(x, y, z) is required that can convert the left-hand side of (18) into a polynomial by the canceling of denominator factors. Specifically, φ = φ₁φ₂ where φ₁(x, y, z) controls factors in center(η) and φ₂(x, y, z) controls factors in P_ord(η). Inspection suggests taking φ₁(x, y, z) equal to

{(x + y + z + ξ)}^{β + 2 α} {[(x + y + ξ) (x + z + ξ) (y + z + ξ)]}^{β + α} {[(x + ξ) (y + ξ) (z + ξ)]}^{β}

and φ₂(x, y, z) equal to

{(x + y + z + 2 ξ)}^{2 β + α - 1} {[(x + y + 2 ξ) (x + z + 2 ξ) (y + z + 2 ξ)]}^{2 β - 1} \times {(x + y + z + 3 ξ)}^{3 β - 2} .

Observe that the degree of x in the polynomial φ = φ₁φ₂ is 13β + 5α − 5. Indeed this is also the degree of y and the degree of z by symmetry. These degrees are reduced in the polynomial φ_η = φ (x, y, z) center(η)P_ord(η) by factors in the denominators of center(η) and P_ord(η). For example, if η = (12)(3), then

f_{η} = {(x + y + z + ξ)}^{β + 2 α} {[(x + z + ξ) (y + z + ξ)]}^{β + α} {[(x + ξ) (y + ξ)]}^{β} \times {[(x + y + 2 ξ) (x + z + 2 ξ) (y + z + 2 ξ)]}^{2 β - 1} {(x + y + z + 3 ξ)}^{3 β - 2} \times \sum_{m = 0}^{β + α - 1} e_{m} {(z + ξ)}^{β} {(x + y + ξ)}^{m} {(x + y + z + 2 ξ)}^{β + α - 1 - m},

which is a polynomial of degree 11β + 4α − 5, in both x and y, and of degree 12β + 5α − 5 in z. A similar construction is possible for all structures; Table 4 records the degrees of x, y and z in all component polynomials f_η.

Table 4.

Degree of x, y and z in the multivariate polynomials f_η = φ(x, y, z) center(η)P_ord(η). Recall β = α₀ + α and both α and α₀ are positive integers

Structure η	Degree(x)	Degree(y)	Degree(z)
(123)	11β + 4α − 4	11β + 4α − 4	11β + 4α − 4
(12)(3)	11β + 4α − 5	11β + 4α −5	12β + 5α − 5
(3)(12)	12β + 4α − 5	12β + 4α − 5	11β + 4α − 5
(13)(2)	11β + 4α − 5	12β + 5α − 5	11β + 4α − 5
(2)(13)	12β + 4α − 5	11β + 4α − 5	12β + 4α − 5
(1)(23)	11β + 4α − 5	12β + 4α − 5	12β + 4α − 5
(23)(1)	12β + 5α − 5	11β + 4α − 5	11β + 4α − 5
(1)(2)(3)	10β + 5α − 5	11β + 5α − 5	12β + 5α − 5
(2)(1)(3)	11β + 5α − 5	10β + 5α − 5	12β + 5α − 5
(1)(3)(2)	10β + 5α − 5	12β + 5α − 5	11β + 5α − 5
(2)(3)(1)	12β + 5α − 5	10β + 5α − 5	11β + 5α − 5
(3)(1)(2)	11β + 5α − 5	12β + 5α − 5	10β + 5α − 5
(3)(2)(1)	12β + 5α − 5	11β + 5α − 5	10β + 5α − 5

Open in a new tab

Having introduced the multiplier φ, the linear independence (18) is equivalent to the assertion that polynomial equation

\sum_{η} a_{η} c_{η} f_{η} (x, y, z) = 0 for all x, y, z > 0

(19)

implies a_η = 0 for all η. Fixing any y and z, the left-hand side of equation (19) is a polynomial in x, with degree 12β + 5α − 5, according to Table 4. Indeed terms associated with structures η = (23)(1), (2)(3)(1) and (3)(2)(1) all contribute monomials with that highest power in x. The coefficient of x¹²^β⁺⁵^α⁻⁵, denoted d = d(a, y, z), equals

a_{(23) (1)} c_{(23) (1)} f_{(23) (1)}^{'} + a_{(2) (3) (1)} c_{(2) (3) (1)} f_{(2) (3) (1)}^{'} + a_{(3) (2) (1)} c_{(3) (2) (1)} f_{(2) (3) (1)}^{'},

where f′ indicates contributions from respective terms within f_η. This coefficient d must equal zero, for all y and z; after all, a degree 12β + 5α − 5 polynomial can equal zero in x for at most that many x values, unless the coefficient d is exactly zero; and we are asking that it equal zero at all values of x. From this study of the high-power coefficient in x, we have reduced consideration to three structures and are able to focus on d = d(a, y, z) as a bivariate polynomial in y and z (Table 5).

Table 5.

Degrees of y and z in three terms of the bivariate polynomial d(a, y, z). This is a subset of Table 4

Structure η	Degree(y)	Degree(z)
(23)(1)	11β + 4α − 5	11β + 4α − 5
(2)(3)(1)	10β + 5α − 5	11β + 5α − 5
(3)(2)(1)	11β + 5α − 5	10β + 5α − 5

Open in a new tab

The initial argument focusing on the degree of x can be adapted to study other variables in Table 5. With degree of y equal to 11β + 5α − 5, for instance, it must be that the coefficient d′(z), say, of y¹¹^β⁺⁵^α⁻⁵ equals zero for all z. After all, the polynomial can equal zero at at most 11β + 5α − 5 y values, and we require it to be zero at all y. But all contributions to that coefficient are strictly positive, except possibly a_(3)(2)(1), hence we conclude a_(3)(2)(1) = 0. By the same token, working with the degree 11β + 5α − 5 term in z, it follows that a_(2)(3)(1) = 0, which then forces a_(23)(1) = 0, because we require d = 0 overall. Three rows from Table 4 have been eliminated (i.e., forced a_η = 0), all those in which the mean of the first variable is greater than the other two means. Next, return to the reduced table, and focus, say, on structures (13)(2), (1)(3)(2) and (3)(1)(2), in which the second variable has mean greater than the others. In doing so, three more coefficients a_(13)(2) = a_(1)(3)(2) = a_(3)(2)(1) = 0 are forced, and Table 2 is further reduced to seven rows. Then the argument is repeated to get a_(12)(3) = a_(1)(2)(3) = a_(2)(1)(3) = 0, and it remains to assess coefficients a_η of the four structures in Table 6.

Table 6.

Final subtable

Structure η	Degree(x)	Degree(y)	Degree(z)
(123)	11β + 4α − 4	11β + 4α − 4	11β + 4α − 4
(3)(12)	12β + 4α − 5	12β + 4α − 5	11β + 4α − 5
(2)(13)	12β + 4α − 5	11β + 4α − 5	12β + 4α − 5
(1)(23)	11β + 4α − 5	12β + 4α − 5	12β + 4α − 5

Open in a new tab

The argument is repeated in this domain, knowing that all but four terms in (19) have been eliminated. The degree of x is 12β + 4α − 5, and there are contributions from both η = (3)(12) and η = (2)(13). But then restricted to these rows we get a_(3)(12) =0 because the coefficient of x¹²^β⁺⁴^α⁻⁵ as a polynomial in y has degree 12β + 4α − 5. The remaining constants a_η are similarly zero, completing the proof in the no-replicate (m = 1), three group (p = 3) case.

The balanced three group case follows suit, noting that now x, y and z are sums taken, respectively, across replicates in each of the three groups. The product statistic is not xyz, but anyway it is common to all components and thus cancels in the linear combination test function. The observation-related shape parameter α is replaced by mα. The two-dimensional (p = 2) case is simpler and is left as an exercise.

APPENDIX C: STRICT CONCAVITY OF LOG-LIKELIHOOD AND PROOF OF THEOREM 4

Let q denote the number of nonnull structures, and consider the log-likelihood l(π) in (11) to be on ℝ^q, with the null probability defined secondarily as π₀ = 1− Σ_η≠η₀ π_η. This way we need not invoke Lagrange multipliers to compute derivatives of l(π). By calculus, the q × q Hessian H of negative 2nd derivatives of l(π) has (ij)th entry

\begin{array}{l} H_{i j} = \sum_{g} \frac{[p (x_{g} ∣ η_{i}) - p (x_{g} ∣ η_{0})] [p (x_{g} ∣ η_{j}) - p (x_{g} ∣ η_{0})]}{{[p (x_{g})]}^{2}} \\ = \sum_{g} f_{i} (x_{g}) f_{j} (x_{g}), \end{array}

where p(x_g) is the marginal density obtained by mixing over structures, as in (1), and f_i (x) = [p(x|η_i) − p(x|η₀)]/p(x). Now let a = (a_η) be a q-vector of constants. To determine curvature of the log-likelihood we consider the quadratic form

\begin{array}{l} a^{T} H a = \sum_{i = 1}^{q} \sum_{j = 1}^{q} a_{i} a_{j} \sum_{g} f_{i} (x_{g}) f_{j} (x_{g}) = \sum_{g} {(\sum_{i = 1}^{q} a_{i} f_{i} (x_{g}))}^{2} \\ = \sum_{g} {[T_{a} (x_{g})]}^{2}, \end{array}

where $T_{a} (x) = \sum_{i = 1}^{q} a_{i} f_{i} (x)$ . Clearly, a^T H a ≥ 0 regardless of a and so H is non-negative definite and l(π) is concave. To establish strict concavity requires that we show T_a(x_g) = 0 for all g if and only if a = 0. The following lemma shows that knowing T_a(x_g) = 0 for all G values x_g is enough to force T_a(x) = 0 for all x, as long as G is sufficiently large. But then a = 0 by the linear independence assumption, completing the proof.

Lemma 1

Let ψ (x) be a multivariate polynomial in x ∈ ℝⁿ, and let X₁, X₂, …, X_m denote a random sample from a continuous distribution on ℝⁿ. If m is at least as large as the number of monomials in ψ, then, with probability one, ψ (X_i ) = 0 for i = 1, 2, …, m implies ψ (x) = 0 for all x.

Proof

Every point X_i puts a linear condition on the space of coefficients of ψ. It needs to be verified that these conditions are linearly independent. Suppose that the first k conditions are linearly independent, so the space of ψ’s that are zero at X₁, …, X_k has dimension (number of monomials in ψ) minus k. Pick one such nonzero polynomial and call it φ. Since φ = 0 is a set with positive codimension, we may assume (with probability one) that φ(X_k₊₁) is not zero. Then, if we impose the additional condition ψ (X_k₊₁) = 0, the dimension of the solution space drops by at least one, hence it drops by one. Letting k increase from 1 to m completes the proof.

APPENDIX D: FURTHER DETAILS OF NUMERICAL EXAMPLES

The parameters α, α₀ and ν₀ were fixed at values obtained by first fitting the unordered gamma–gamma mixture model in EBarrays, without a null structure but otherwise allowing all possible unordered structures. Shape parameters were then rounded to the nearest positive integer (Table 7) and all three parameters were plugged into the EM procedure to fit the proposed mixture-model proportions. [Recall that P_ord(η) in (5) can be computed only for integer shapes, hence the rounding.] To simplify EM calculations in the four examples having more than five groups, the full set of ordered structures was filtered to a reduced set based on the fitting of the unordered gamma–gamma model in EBarrays. Each ordered structure corresponds to exactly one unordered structure (a many to one mapping). If no gene had a high (greater than 0.5) probability of mapping to a given unordered structure, then we deemed all corresponding ordered structures to have π_η = 0. This approximation is not ideal, since the Bayes rule assignment for some genes may be one of the structures eliminated by forcing π_η = 0. This affects only 29 genes out of the 17,539 clustered in these four cases. It would not affect clustering by a high threshold.

Table 7.

Parameter estimates (not including mixing proportions) from the examples analyzed. The last column indicates the number of functional categories in GO and KEGG having at least five annotated genes, which were used in the development of Figure 4. KEGG was not available for GDS1937, and so this data set was not used in Figure 4

Data set	α	α₀	ν₀	# GO/KEGG
Edwards	113	1	586.5
GDS2323	14	1	119.1	3849/184
GDS1802	17	1	46.8	3619/182
GDS2043	22	1	47.7	3619/182
GDS2360	8	1	15.8	3258/175
GDS599	12	1	0.01	3180/159
GDS812	5	1	15.4	3258/175
GDS1937	6	1	20.5	NA
GDS568	10	1	37.1	3258/175
GDS2431	4	1	67.6	4085/188
GDS587	8	1	9999.2	1876/127
GDS586	13	1	4566.2	3258/175

Open in a new tab

For all data sets, we examined quantile–quantile plots and plots relating sample coefficient of variation to sample mean. Some model violations were noted, but largely the gamma observation model was supported.

For the Edwards data, we reran the EM algorithm for 10 cycles and updated shape parameter estimates via 2D grid search in each cycle. Estimated shapes changed slightly; 784/786 genes received the same Bayes rule cluster assignment.

Computations were done in R on industry-standard linux machines. For the data sets analyzed, run times ranged from 6 to 860 CPU seconds per EM iteration, with a mean of 270 seconds. Run time is affected by the number of genes analyzed, the number of groups and also the shape parameters and sample sizes.

Footnotes

Supported in part by NIH Grants R01 ES017400 and T32 GM074904.

Contributor Information

Michael A. Newton, Email: newton@stat.wisc.edu.

Lisa M. Chung, Email: lchung@stat.wisc.edu.

References

Campbell EA, O’Hara L, Catalano RD, Sharkey AM, Freeman TC, Johnson MH. Temporal expression profiling of the uterine luminal epithelium of the pseudo-pregnant mouse suggests receptivity to the fertilized egg is associated with complex transcriptional changes. Human Reproduction. 2006;21:2495–2513. doi: 10.1093/humrep/del195. [DOI] [PubMed] [Google Scholar]
Dennis B, Patil GP. The gamma distribution and weighted multimodal gamma distributions as models of population abundance. Math Biosci. 1984;68:187–212. MR0738902. [Google Scholar]
Doksum KA, Ozeki A. Semiparametric models and likelihood—the power of ranks. In: Rojo J, editor. Optimality: The Third Erich L. Lehmann Symposium. IMS Lecture Notes—Monograph Series. Vol. 57. IMS; Beachwood, OH: 2009. pp. 67–92. [Google Scholar]
Edgar R, Domrachev M, Lash AE. Gene expression omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Research. 2002;30:207–210. doi: 10.1093/nar/30.1.207. [DOI] [PMC free article] [PubMed] [Google Scholar]
Edwards M, Sarkar D, Klopp R, Morrow J, Weindruch R, Prolla T. Age-related impairment of the transcriptional response to oxidative stress in the mouse heart. Physiological Genomics. 2003;13:119–127. doi: 10.1152/physiolgenomics.00172.2002. [DOI] [PubMed] [Google Scholar]
Eisen MB, Spellman PT, Brown PO, Botstein D. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci. 1998;95:14863–14868. doi: 10.1073/pnas.95.25.14863. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fraley C, Raftery AE. Model-based clustering, discriminant analysis, and density estimation. J Amer Statist Assoc. 2002;97:611–631. MR1951635. [Google Scholar]
Gelman A, Carlin JB, Stern HS, Rubin DB. Bayesian Data Analysis. 2. Chapman and Hall; Boca Raton, FL: 2004. MR2027492. [Google Scholar]
Girardot F, Monnier V, Tricoire H. Genome wide analysis of common and specific stress responses in adult drosophila melanogaster. BMC Genomics. 2004;5:74. doi: 10.1186/1471-2164-5-74. [DOI] [PMC free article] [PubMed] [Google Scholar]
Grasso LC, Maindonald J, Rudd S, Hayward DC, Saint R, Miller DJ, Ball EE. Microarray analysis identifies candidate genes for key roles in coral development. BMC Genomics. 2008;9:540. doi: 10.1186/1471-2164-9-540. [DOI] [PMC free article] [PubMed] [Google Scholar]
Greenwood M, Yule GU. An inquiry into the nature of frequency distributions representative of multiple happenings with particular reference to the occurrence of multiple attacks of disease or of repeated accidents. J Roy Statist Soc Ser A. 1920;83:255–279. [Google Scholar]
Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning. Springer; New York: 2001. MR1851606. [Google Scholar]
Hubert L, Arabie P. Comparing partitions. J Classification. 1985;2:193–218. [Google Scholar]
Holzmann H, Munk A, Gneiting M. Identifiability of finite mixtures of elliptical distributions. Scand J Statist. 2006;33:753–763. MR2300914. [Google Scholar]
Hutchinson TP. Compound gamma bivariate distributions. Metrika. 1981;28:263–271. MR0642934. [Google Scholar]
Jensen ST, Erkan I, Arnardottir ES, Small DS. Bayesian testing of many hypotheses × many genes: A study of sleep apnea. Ann Appl Statist. 2009;3:1080–1101. [Google Scholar]
Keles S. Mixture modeling for genome-wide localization of transcription factors. Biometrics. 2007;63:10–21. doi: 10.1111/j.1541-0420.2005.00659.x. MR2345570. [DOI] [PubMed] [Google Scholar]
Kendziorski CM, Newton MA, Lan H, Gould MN. On parametric empirical Bayes methods for comparing multiple groups using replicated gene expression profiles. Stat Med. 2003;22:3899–3914. doi: 10.1002/sim.1548. [DOI] [PubMed] [Google Scholar]
Kendziorski CM, Chen M, Yuan M, Lan H, Attie AD. Statistical methods for expression quantitative trait loci (eQTL) mapping. Biometrics. 2006;62:19–27. doi: 10.1111/j.1541-0420.2005.00437.x. MR2226552. [DOI] [PubMed] [Google Scholar]
Kschischang R, Frey BJ, Loeliger HA. Factor graphs and the sum-product algorithm. IEEE Trans Inform Theory. 2001;47:498–519. MR1820474. [Google Scholar]
Lo K, Gottardo R. Flexible empirical Bayes models for differential gene expression. Bioinformatics. 2007;23:328–335. doi: 10.1093/bioinformatics/btl612. [DOI] [PubMed] [Google Scholar]
Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y. RNA-seq: An assessment of technical reproducibility and comparison with gene expression arrays. Genome Research. 2008;18:1509–1517. doi: 10.1101/gr.079558.108. [DOI] [PMC free article] [PubMed] [Google Scholar]
McCullagh P, Nelder JA. Generalized Linear Models. 2. Chapman and Hall; London: 1989. MR0727836. [Google Scholar]
McLachlan GJ, Basford KE. Mixture Models: Inference and Applications to Clustering. Dekker; New York: 1988. MR0926484. [Google Scholar]
McLachlan GJ, Peel D. Finite Mixture Models. Wiley; New York: 2000. MR1789474. [Google Scholar]
Medvedovic M, Yeung KY, Bumgarner RE. Bayesian mixture model based clustering of replicated microarray data. Bioinformatics. 2004;20:1222–1232. doi: 10.1093/bioinformatics/bth068. [DOI] [PubMed] [Google Scholar]
Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature Methods. 2008;5:621–628. doi: 10.1038/nmeth.1226. [DOI] [PubMed] [Google Scholar]
Newton MA, Noueiry A, Sarkar D, Ahlquist P. Detecting differential gene expression with a semiparametric hierarchical mixture method. Biostatistics. 2004;5:155–176. doi: 10.1093/biostatistics/5.2.155. [DOI] [PubMed] [Google Scholar]
Newton MA, Quintana FA, den Boon JA, Sengupta S, Ahlquist P. Random-set methods identify distinct aspects of the enrichment signal in gene-set analysis. Ann Appl Statist. 2007;1:85–106. MR2393842. [Google Scholar]
Parmigiani G, Garett ES, Irizarry RA, Zeger SL, editors. The Analysis of Gene Expression Data: Methods and Software. Springer; New York: 2003. MR2001388. [Google Scholar]
Rabiner LR. A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE. 1989;77:257–286. [Google Scholar]
Redner RA, Walker HF. Mixture densities, maximum likelihood and the EM algorithm. SIAM Rev. 1984;26:195–239. MR0738930. [Google Scholar]
Rempala GA, Pawlikowska I. Limit theorems for hybridization reactions on oligonucleotide microarrays. J Multivariate Anal. 2008;99:2082–2095. MR2466552. [Google Scholar]
Robinson MD, Smyth GK. Moderated statistical tests for assessing differences in tag abundance. Bioinformatics. 2007;23:2881–2887. doi: 10.1093/bioinformatics/btm453. [DOI] [PubMed] [Google Scholar]
Rossell D. GaGa: A parsimonious and flexible model for differential expression analysis. Ann Appl Statist. 2009;3:1035–1051. [Google Scholar]
Smyth GK. Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol. 2004;3:29. doi: 10.2202/1544-6115.1027. Art. 3. (electronic). MR2101454. [DOI] [PubMed] [Google Scholar]
Sobel M, Frankowski K. The 500th anniversary of the sharing problem (the oldest problem in the theory of probability) Amer Math Monthly. 1994;101:833–847. MR1300489. [Google Scholar]
Speed T, editor. Statistical Analysis of Gene Expression Microarray Data. CRC Press; Boca Raton, FL: 2004. [Google Scholar]
Stephens M. Dealing with label switching in mixture models. J R Stat Soc Ser B Stat Methodol. 2000;62:795–809. MR1796293. [Google Scholar]
Stern H. Models for distributions on permutations. J Amer Statist Assoc. 1990;85:558–564. [Google Scholar]
Storey JD, Tibshirani R. Statistical significance for genome-wide studies. Proc Natl Acad Sci USA. 2003;100:9440–9445. doi: 10.1073/pnas.1530509100. MR1994856. [DOI] [PMC free article] [PubMed] [Google Scholar]
Thalamuthu A, Mukhapadhyay I, Zheng X, Tseng GC. Evaluation and comparison of gene clustering methods in microarray analysis. Bioinformatics. 2006;22:2405–2412. doi: 10.1093/bioinformatics/btl406. [DOI] [PubMed] [Google Scholar]
Titterington DM, Smith AFM, Makov UE. Statistical Analysis of Finite Mixture Distributions. Wiley; New York: 1985. MR0838090. [Google Scholar]
Yakowitz SJ, Spragins JD. On the identifiability of finite mixtures. Ann Math Statist. 1968;39:209–214. MR0224204. [Google Scholar]
Yuan M, Kendziorski CM. A unified approach for simultaneous gene clustering and differential expression identification. Biometrics. 2006a;62:1089–1098. doi: 10.1111/j.1541-0420.2006.00611.x. MR2297680. [DOI] [PubMed] [Google Scholar]
Yuan M, Kendziorski CM. Hidden Markov models for time course data in multiple biological conditions (with discussion) J Amer Statist Assoc. 2006b;101:1323–1340. MR2307565. [Google Scholar]

[R1] Campbell EA, O’Hara L, Catalano RD, Sharkey AM, Freeman TC, Johnson MH. Temporal expression profiling of the uterine luminal epithelium of the pseudo-pregnant mouse suggests receptivity to the fertilized egg is associated with complex transcriptional changes. Human Reproduction. 2006;21:2495–2513. doi: 10.1093/humrep/del195. [DOI] [PubMed] [Google Scholar]

[R2] Dennis B, Patil GP. The gamma distribution and weighted multimodal gamma distributions as models of population abundance. Math Biosci. 1984;68:187–212. MR0738902. [Google Scholar]

[R3] Doksum KA, Ozeki A. Semiparametric models and likelihood—the power of ranks. In: Rojo J, editor. Optimality: The Third Erich L. Lehmann Symposium. IMS Lecture Notes—Monograph Series. Vol. 57. IMS; Beachwood, OH: 2009. pp. 67–92. [Google Scholar]

[R4] Edgar R, Domrachev M, Lash AE. Gene expression omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Research. 2002;30:207–210. doi: 10.1093/nar/30.1.207. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] Edwards M, Sarkar D, Klopp R, Morrow J, Weindruch R, Prolla T. Age-related impairment of the transcriptional response to oxidative stress in the mouse heart. Physiological Genomics. 2003;13:119–127. doi: 10.1152/physiolgenomics.00172.2002. [DOI] [PubMed] [Google Scholar]

[R6] Eisen MB, Spellman PT, Brown PO, Botstein D. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci. 1998;95:14863–14868. doi: 10.1073/pnas.95.25.14863. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] Fraley C, Raftery AE. Model-based clustering, discriminant analysis, and density estimation. J Amer Statist Assoc. 2002;97:611–631. MR1951635. [Google Scholar]

[R8] Gelman A, Carlin JB, Stern HS, Rubin DB. Bayesian Data Analysis. 2. Chapman and Hall; Boca Raton, FL: 2004. MR2027492. [Google Scholar]

[R9] Girardot F, Monnier V, Tricoire H. Genome wide analysis of common and specific stress responses in adult drosophila melanogaster. BMC Genomics. 2004;5:74. doi: 10.1186/1471-2164-5-74. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] Grasso LC, Maindonald J, Rudd S, Hayward DC, Saint R, Miller DJ, Ball EE. Microarray analysis identifies candidate genes for key roles in coral development. BMC Genomics. 2008;9:540. doi: 10.1186/1471-2164-9-540. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] Greenwood M, Yule GU. An inquiry into the nature of frequency distributions representative of multiple happenings with particular reference to the occurrence of multiple attacks of disease or of repeated accidents. J Roy Statist Soc Ser A. 1920;83:255–279. [Google Scholar]

[R12] Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning. Springer; New York: 2001. MR1851606. [Google Scholar]

[R13] Hubert L, Arabie P. Comparing partitions. J Classification. 1985;2:193–218. [Google Scholar]

[R14] Holzmann H, Munk A, Gneiting M. Identifiability of finite mixtures of elliptical distributions. Scand J Statist. 2006;33:753–763. MR2300914. [Google Scholar]

[R15] Hutchinson TP. Compound gamma bivariate distributions. Metrika. 1981;28:263–271. MR0642934. [Google Scholar]

[R16] Jensen ST, Erkan I, Arnardottir ES, Small DS. Bayesian testing of many hypotheses × many genes: A study of sleep apnea. Ann Appl Statist. 2009;3:1080–1101. [Google Scholar]

[R17] Keles S. Mixture modeling for genome-wide localization of transcription factors. Biometrics. 2007;63:10–21. doi: 10.1111/j.1541-0420.2005.00659.x. MR2345570. [DOI] [PubMed] [Google Scholar]

[R18] Kendziorski CM, Newton MA, Lan H, Gould MN. On parametric empirical Bayes methods for comparing multiple groups using replicated gene expression profiles. Stat Med. 2003;22:3899–3914. doi: 10.1002/sim.1548. [DOI] [PubMed] [Google Scholar]

[R19] Kendziorski CM, Chen M, Yuan M, Lan H, Attie AD. Statistical methods for expression quantitative trait loci (eQTL) mapping. Biometrics. 2006;62:19–27. doi: 10.1111/j.1541-0420.2005.00437.x. MR2226552. [DOI] [PubMed] [Google Scholar]

[R20] Kschischang R, Frey BJ, Loeliger HA. Factor graphs and the sum-product algorithm. IEEE Trans Inform Theory. 2001;47:498–519. MR1820474. [Google Scholar]

[R21] Lo K, Gottardo R. Flexible empirical Bayes models for differential gene expression. Bioinformatics. 2007;23:328–335. doi: 10.1093/bioinformatics/btl612. [DOI] [PubMed] [Google Scholar]

[R22] Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y. RNA-seq: An assessment of technical reproducibility and comparison with gene expression arrays. Genome Research. 2008;18:1509–1517. doi: 10.1101/gr.079558.108. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] McCullagh P, Nelder JA. Generalized Linear Models. 2. Chapman and Hall; London: 1989. MR0727836. [Google Scholar]

[R24] McLachlan GJ, Basford KE. Mixture Models: Inference and Applications to Clustering. Dekker; New York: 1988. MR0926484. [Google Scholar]

[R25] McLachlan GJ, Peel D. Finite Mixture Models. Wiley; New York: 2000. MR1789474. [Google Scholar]

[R26] Medvedovic M, Yeung KY, Bumgarner RE. Bayesian mixture model based clustering of replicated microarray data. Bioinformatics. 2004;20:1222–1232. doi: 10.1093/bioinformatics/bth068. [DOI] [PubMed] [Google Scholar]

[R27] Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature Methods. 2008;5:621–628. doi: 10.1038/nmeth.1226. [DOI] [PubMed] [Google Scholar]

[R28] Newton MA, Noueiry A, Sarkar D, Ahlquist P. Detecting differential gene expression with a semiparametric hierarchical mixture method. Biostatistics. 2004;5:155–176. doi: 10.1093/biostatistics/5.2.155. [DOI] [PubMed] [Google Scholar]

[R29] Newton MA, Quintana FA, den Boon JA, Sengupta S, Ahlquist P. Random-set methods identify distinct aspects of the enrichment signal in gene-set analysis. Ann Appl Statist. 2007;1:85–106. MR2393842. [Google Scholar]

[R30] Parmigiani G, Garett ES, Irizarry RA, Zeger SL, editors. The Analysis of Gene Expression Data: Methods and Software. Springer; New York: 2003. MR2001388. [Google Scholar]

[R31] Rabiner LR. A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE. 1989;77:257–286. [Google Scholar]

[R32] Redner RA, Walker HF. Mixture densities, maximum likelihood and the EM algorithm. SIAM Rev. 1984;26:195–239. MR0738930. [Google Scholar]

[R33] Rempala GA, Pawlikowska I. Limit theorems for hybridization reactions on oligonucleotide microarrays. J Multivariate Anal. 2008;99:2082–2095. MR2466552. [Google Scholar]

[R34] Robinson MD, Smyth GK. Moderated statistical tests for assessing differences in tag abundance. Bioinformatics. 2007;23:2881–2887. doi: 10.1093/bioinformatics/btm453. [DOI] [PubMed] [Google Scholar]

[R35] Rossell D. GaGa: A parsimonious and flexible model for differential expression analysis. Ann Appl Statist. 2009;3:1035–1051. [Google Scholar]

[R36] Smyth GK. Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol. 2004;3:29. doi: 10.2202/1544-6115.1027. Art. 3. (electronic). MR2101454. [DOI] [PubMed] [Google Scholar]

[R37] Sobel M, Frankowski K. The 500th anniversary of the sharing problem (the oldest problem in the theory of probability) Amer Math Monthly. 1994;101:833–847. MR1300489. [Google Scholar]

[R38] Speed T, editor. Statistical Analysis of Gene Expression Microarray Data. CRC Press; Boca Raton, FL: 2004. [Google Scholar]

[R39] Stephens M. Dealing with label switching in mixture models. J R Stat Soc Ser B Stat Methodol. 2000;62:795–809. MR1796293. [Google Scholar]

[R40] Stern H. Models for distributions on permutations. J Amer Statist Assoc. 1990;85:558–564. [Google Scholar]

[R41] Storey JD, Tibshirani R. Statistical significance for genome-wide studies. Proc Natl Acad Sci USA. 2003;100:9440–9445. doi: 10.1073/pnas.1530509100. MR1994856. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R42] Thalamuthu A, Mukhapadhyay I, Zheng X, Tseng GC. Evaluation and comparison of gene clustering methods in microarray analysis. Bioinformatics. 2006;22:2405–2412. doi: 10.1093/bioinformatics/btl406. [DOI] [PubMed] [Google Scholar]

[R43] Titterington DM, Smith AFM, Makov UE. Statistical Analysis of Finite Mixture Distributions. Wiley; New York: 1985. MR0838090. [Google Scholar]

[R44] Yakowitz SJ, Spragins JD. On the identifiability of finite mixtures. Ann Math Statist. 1968;39:209–214. MR0224204. [Google Scholar]

[R45] Yuan M, Kendziorski CM. A unified approach for simultaneous gene clustering and differential expression identification. Biometrics. 2006a;62:1089–1098. doi: 10.1111/j.1541-0420.2006.00611.x. MR2297680. [DOI] [PubMed] [Google Scholar]

[R46] Yuan M, Kendziorski CM. Hidden Markov models for time course data in multiple biological conditions (with discussion) J Amer Statist Assoc. 2006b;101:1323–1340. MR2307565. [Google Scholar]

PERMALINK

GAMMA-BASED CLUSTERING VIA ORDERED MEANS WITH APPLICATION TO GENE-EXPRESSION ANALYSIS1

Michael A Newton

Lisa M Chung

Abstract

1. Introduction

2. Mixture of structured components

Table 1.

Theorem 1

Fig. 1.

3. Gamma-rank probabilities

Theorem 2

4. Linear independence

Theorem 3

5. Data analysis considerations

5.1. Estimation

Theorem 4

5.2. Example

Fig. 2.

5.3. Empirical study

Table 2.

Fig. 3.

Fig. 4.

6. Count data

7. Concluding remarks

Acknowledgments

APPENDIX A: PROOF OF THEOREM 2

Coda

APPENDIX B: LINEAR INDEPENDENCE AND PROOF OF THEOREM 3

Table 3.

Table 4.

Table 5.

Table 6.

APPENDIX C: STRICT CONCAVITY OF LOG-LIKELIHOOD AND PROOF OF THEOREM 4

Lemma 1

Proof

APPENDIX D: FURTHER DETAILS OF NUMERICAL EXAMPLES

Table 7.

Footnotes

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

GAMMA-BASED CLUSTERING VIA ORDERED MEANS WITH APPLICATION TO GENE-EXPRESSION ANALYSIS^¹