Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2010 Dec 1.
Published in final edited form as: Ann Stat. 2010 Dec 1;38(6):3217–3244. doi: 10.1214/10-aos805

GAMMA-BASED CLUSTERING VIA ORDERED MEANS WITH APPLICATION TO GENE-EXPRESSION ANALYSIS1

Michael A Newton 1, Lisa M Chung 1
PMCID: PMC2990889  NIHMSID: NIHMS249963  PMID: 21113321

Abstract

Discrete mixture models provide a well-known basis for effective clustering algorithms, although technical challenges have limited their scope. In the context of gene-expression data analysis, a model is presented that mixes over a finite catalog of structures, each one representing equality and inequality constraints among latent expected values. Computations depend on the probability that independent gamma-distributed variables attain each of their possible orderings. Each ordering event is equivalent to an event in independent negative-binomial random variables, and this finding guides a dynamic-programming calculation. The structuring of mixture-model components according to constraints among latent means leads to strict concavity of the mixture log likelihood. In addition to its beneficial numerical properties, the clustering method shows promising results in an empirical study.

Key words and phrases: Gamma ranking, mixture model, next generation sequencing, Poisson embedding, rank probability

1. Introduction

A common problem in statistical genomics is how to organize expression data from genes that have been determined to exhibit differential expression relative to various cellular states. Cells in a time-course experiment may exhibit such genes, as may cells in any sort of designed experiment or observational study where expression alterations are being examined [e.g., Parmigiani et al. (2003), Speed (2004)]. In the event that the error-rate-controlled list of significantly altered genes is small, the post-processing problem amounts to inspecting observed patterns of expression, investigating what is known about the relatively few genes identified, and planning follow-up experiments as necessary. However, it is all too common that hundreds or even thousands of genes are detected as significantly altered in their expression pattern relative to the cellular states. Postprocessing these nonnull genes presents a substantial statistical problem. Difficulties are compounded in the multi-group setting because a gene can be nonnull in many different ways [Jensen et al. (2009)].

Ever since Eisen et al. (1998), clustering methods have been used to organize expression data. Information about a gene’s biological function may be conveyed by the other genes sharing its pattern of expression. Thalamuthu et al. (2006) provides a recent perspective. Clustering methods are often applied in order to partition nonnull genes which have been identified in differential expression analysis [e.g., Campbell et al. (2006), Grasso et al. (2008)]. Popular approaches are informative but not completely satisfactory. There are idiosyncratic problems, like how to select the number of clusters, but there is also the subtle issue that the clusters identified by most algorithms are anonymous: each cluster is defined only by similarity of its contents rather than by some external pattern that its genes may be approximating. Anonymity may contribute to technical problems, such as that the objective function being minimized is not convex, and that realized clusters have a more narrow size distribution than is warranted by the biological system.

Model-based clustering treats data as arising from a mixture of component distributions, and then forms clusters by assigning each data point to its most probable component [e.g., Titterington, Smith and Makov (1985), McLachlan and Basford (1988)]. For example, the mclust procedure is based on mixtures of Gaussian components [Fraley and Raftery (2002)]; the popular K-means algorithm is implicitly so based [Hastie, Tibshirani and Friedman (2001), page 463]. There is considerable flexibility in model-based clustering, though technical challenges have also affected its development: the likelihood function is often multi-modal; identifiability can be difficult to establish [e.g., Redner and Walker (1984), Holzmann, Munk and Gneiting (2006)]; and even where constraints may create identifiability, there can be a problem of label-switching during Bayesian inference [Stephens (2000)]. Some sophisticated model-based clustering methods have been developed for gene expression [e.g., Medvedovic, Yeung and Bumgarner (2004)]. Beyond empirical studies, it is difficult to determine properties of such approaches, and their reliance on Monte Carlo computation is somewhat limiting.

Here a model-based clustering method is developed that aims to support multi-group gene-expression analysis and possibly other applications. The method, called gamma ranking, places genes in a cluster if their expression patterns commonly approximate one element from a finite catalog of possible structures, in contrast to anonymous methods (Section 2). Under certain conditions, the component distributions are linearly independent functions—each one associated with a structure in the finite catalog—and this confers favorable computational characteristics to the gamma-ranking procedure (Sections 4, 5). The cataloged structures record patterns of equality and inequality among latent expected values. Where normal-theory specifications seem to be intractable, a gamma-based mixture model produces closed formulas for all necessary component densities, thanks to an embedding of the relevant gamma-distributed variables in a set of Poisson processes (Section 3). The formulation also extends to Poisson-distributed responses that are characteristic of gene expression measured by next-generation sequencing (Section 6).

2. Mixture of structured components

The data considered has a relatively simple layout. Each gene g from a possibly large number is associated with a vector xg = (xg,1, xg,2,, xg,n) holding measurements of gene expression from n distinct biological samples. The n samples are distributed among 1 < pn different groups, which represent possibly different transcriptional states of the cells under study. The groups may represent cells exposed to p different chemical treatments, cells at p different developmental stages, or cells at p different points along a time course, for example. The layout of samples {1, 2, …, n} is recorded in a vector, say l = (l1, l2,, ln), with li = j indicating that sample i comes from group j. This is fixed by design and known to the analyst; to simplify the development we suppress l from the notation below except where clarification is warranted.

Each expression measurement xg,i is treated as a positive, continuous variable representing a fluorescence intensity from a microarray, after preprocessing has adjusted for various systematic effects not related to the groupings of interest. Recent technological advances allow expression to be measured instead as an explicit abundance count. The mixture model developed below adapts readily to this case (Section 6).

Gamma ranking entails clustering genes according to the fit of a specific model of gene-level data. The joint probability density for a data vector xg, denoted p(xg ), is treated as a finite mixture over a catalog of discrete structures η, each of which determines ordering constraints among latent expected values. More specifically,

p(xg)=ηp(xgη)πη, (1)

where πη is a mixing proportion and the component density p(xg |η) is determined through modeling.

Each η is a partition of group labels {1, 2, …, p}, containing Kη subsets, that also carries an ordering of these subsets. For example, three structures cover the two-group comparison, denoted {(1)(2), (12), (2)(1)}. The notation conveys both the partition of group means and the ordering of subsets within the partition. For instance, in η = (2)(1) the expected expression level in group 2 is less than that of group 1; while η = (12) indicates that both groups share a common latent mean. With p = 3 groups, there are 13 structures

(123),(12)(3),(3)(12),(13)(2),(2)(13),(1)(23),(23)(1),(1)(2)(3),(2)(1)(3),(1)(3)(2),(2)(3)(1),(3)(1)(2),(3)(2)(1),

and the number grows rapidly with the number of groups (Table 1). A way to think about Hord = {η}, the catalog of these ordered structures on p groups, is to imagine p real values y = (y1, y2,, yp) and the possible vectors you would get by ranking y. Of course there are p! rankings if ties are not permitted, but generally there are far more rankings, and Hord is in 1–1 correspondence with the set of rankings of p numbers, allowing ties.

Table 1.

The number of ordered structures, Bell+, as a function of the number of groups, p. This is k=1p(k!)S(p,k), where S(p, k) are Stirling numbers of the second kind. The Bell number of partitions of 1, …, p is included for comparison

p Bell+ Bell
2 3 2
3 13 5
4 75 15
5 541 52
6 4683 203

An ordered structure η also dictates an association between sample labels i ∈ {1, 2,, n} and levels of the latent expected values. The null structure η = (12 ··· p), for example, entails equal mean expression across all p groups; all observations are associated with a single mean value (and we write Kη = 1). More generally, there are Kη > 1 distinct mean values, μ1 < μ2 < ··· < μKη, say. Without loss of generality, we index the means by rank order. The association maps each i ∈ {1, 2, …, n} to some μk; it amounts to a partition of the samples together with an ordering of the subsets within the partition matching the order of the latent means. We express this association with disjoint subsets σ (η, k), k = 1, 2, …, Kη, and have k follow the order of the expected values. For example, suppose that samples {1, 2, …, 6} constitute two replicate samples in each of p = 3 groups, and η = (13)(2) is considered to relate the group-specific expected values (i.e., the gene is upregulated in group 2, and not differentially expressed between groups 1 and 3). Then Kη = 2, σ(η, 1) = {1, 2, 5, 6} and σ (η, 2) = {3, 4}. Subset σ (η, k) includes nk samples and induces gene-level statistics such as

sg,k=iσ(η,k)xg,iandtg,k=iσ(η,k)xg,i.

The structure/partition notation is convenient in multi-group mixture modeling. For clarification, let us refer back to the layout notation and take the replicates rj = {i: li = j}, which equal those samples in group j. Consider a gene that is completely differentially expressed relative to the p groups; that is, it assumes one of the p! structures η in which Kη = p. It follows that each set rj equals exactly one of the subsets σ(η, k). [It would be σ(η, 1) if rj had the lowest mean expression level, e.g.] In the absence of complete differential expression, multiple groups share expected values. Generally, therefore, each subset σ (η, k) is a union of various replicate sets rj. The language also conveys the assumption that replicates i1 and i2 in the same set rj necessarily share expected value, regardless of the structure η. In calculating probabilities, the sets σ(η, k) of equi-mean samples are more important than the replicate sets rj.

From the mixture model (1), posterior structure probabilities are p(η|xg) = p(xg|η)πη/p(xg) and these determine gene clusters by Bayes’s rule assignment. Alternatively, the cluster contents can be regulated by a threshold parameter c, and

cluster(η)={g:p(ηxg)c}, (2)

though some genes may go unassigned in this formulation. In any case, each cluster holds genes with empirical characteristics matching some discrete mean-ordering structure.

The latent expected values are constrained by η to the order μ1 < μ2 < ··· < μKη. Propeling our calculations is the ability to integrate these ordered means (i.e., marginalize them) in a model involving gamma distributions on some transformation of the μk’s. Recall that a gamma distribution with shape a > 0 and rate λ > 0, denoted Gamma(a, λ), has probability density

p(z)=λaza1exp{zλ}Γ(a),z>0.

We assume that inverse means ψk = 1/μk have joint density

pη(ψ1,,ψKη)=Kη![k=1Kη(α0ν0)α0ψkα01exp{α0ν0ψk}Γ(α0)]×1[ψ1>ψ2>>ψKη], (3)

which reflects independent and identically distributed Gamma(α0, α0ν0) components, conditioned to one ordering. This parameterization gives ν0 an interpretation as a centering parameter; on the null structure having a single latent mean μ1, 1/ν0 = E(1/μ1).

To complete the hierarchical specification, we assume a gamma observation model

p(xgψ1,,ψKη,η)=k=1Kηiσ(η,k)(αψk)αxg,iα1exp{xg,iψkα}Γ(α)=k=1Kη(αψk)αnktg,kα1exp{sg,kψkα}(Γ(α))nk. (4)

Equivalently, with sample iσ(η, k), measurement xg,i is distributed as Gamma(α, αψk), all conditionally on the latent values and η, and independently across samples. The gamma observation component is often supported empirically; there is theoretical support from stochastic models of population abundance [Dennis and Patil (1984), Rempala and Pawlikowska (2008)] and there are practical considerations that a gamma-based model may be the only one for continuous data in which ordering calculations are feasible.

The structured component p(xg |η) in (1) arises by integrating (4) against the continuous mixing distribution (3). Specifically,

p(xgη)=p(xgψ1,,ψKη,η)pη(ψ1,,ψKη)dψ1dψKη.

Moving allowable factors from the integral

p(xgη)=Kη!(α0ν0)Kηα0ααnΓ(α0)KηΓ(α)n(k=1KηJktg,kα1)×Ek=1Kηψkα0+αnk1exp{ψk(α0ν0+αsg,k)}Jkdψ1dψKη,

where the integral is over the set E of decreasing ψk’s, and where Jk represents any cluster-specific quantity which does not depend on ψk. Choosing

Jk=Γ(α0+αnk)(α0ν0+αsg,k)α0+αnk

provides just the right normalization, because then the integrand becomes the joint density of independent gamma-distributed variables, with the kth variable having shape ak = α0 + αnk and rate λk = α0ν0 + αsg,k. The integral itself, denoted Pord(η), is the probability that independent gamma-distributed variables assume a certain order. The preceding factor can be arranged as products of the product statistics tg,k multiplied by factors involving the sum statistics sg,k. After a bit of simplification, the following result is established.

Theorem 1

In the model defined above, the component density p(xg |η) equals

cη(i=1nxg,iα1)k=1Kη(sg,k+α0ν0α)akcenter(η)P(Z1>Z2>>ZKη)Pord(η), (5)

where the Zk’s are mutually independent gamma-distributed random variables with shapes ak = α0 + αnk and rates λk = α0ν0 + αsg,k, and where the normalizing constant is

cη=Kη![Γ(α)]n[Γ(α0)]Kη(α0ν0α)α0Kηk=1KηΓ(ak).

In (5), Pord(η) = 1 for the null case involving Kη = 1.

The null structure η = (12 ··· p) entails equal mean expression for all samples; there is a single partition element, and Kη = 1. In this case, the distribution in (5) is exchangeable and equals a multivariate compound gamma [Hutchinson (1981)]. The positive parameters α and α0 regulate within-group and among-group variation, and ν0 is a scale parameter. Inspection also confirms that if the random X = (X1, …, Xn) has density p(x|η) in (5), and if b > 0, then Y = (bX1, …, bXn) has a density of the same type, with shape parameters α0 and α unchanged, but with scale parameter 0.

Special cases of the density (5) have been reported: Newton et al. (2004) presented the case p = 2; Jensen et al. (2009) presented the case p = 4. See also Yuan and Kendziorski (2006a). Evidently an algorithm to compute Pord(η) is required in order to evaluate the component mixing densities. Beyond the p = 2 case, previous reports have evaluated these gamma-rank probabilities by Monte Carlo.

Figure 1 displays contours of the three structured components when n = 2 and p = 2. Clearly the components distribute mass quite differently from one another, and in a way that reflects constraints encoded by η. The densities from different structures η have the same support; the constraints restrict latent expected values rather than observables. In this way, the approach shares something with generalized linear modeling wherein responses are modeled by generic exponential family densities and covariate information constrains the expected values [McCullagh and Nelder (1989)].

Fig. 1.

Fig. 1

Three structured components in ℝ2. Here α = 10, α0 = 3 and α0 = 25. Contours cover 50%, 80%, 95% and 99% probability. For convenience, each density is shown for log2-transformed pairs.

3. Gamma-rank probabilities

A statistical computing problem must be solved in order to implement gamma ranking. Specifically, it is required to calculate the probability P (E) of the event

E={Z1>Z2>>ZK}, (6)

where {Zk: k = 1, 2, …, K} are mutually independent gamma distributed random variables with possibly different shapes a1, a2,, aK and rates λ1, λ2,, λk. [Each Pord(η) in Section 2 is an instance of P(E).] In the special case K = 2, the event in two gamma-distributed variables is equivalent to the E′ = {B > λ1/(λ1 + λ2)}, where B is a Beta(a1, a2) distributed variable. Thus, P (E) = P (E′) can be computed by standard numerical approaches for the Beta distribution. Although a similar representation is possible for Dirichlet-distributed vectors when K > 2, a direct numerical approach is not clearly indicated. In modeling permutation data, Stern (1990) presented a formula for P (E) for any value K, but assuming common shape parameters ak = a. Sobel and Frankowski (1994) calculated P (E) for K < 5 and assuming constant rates λk = λ, but to our knowledge a general formula has not been developed. A Monte Carlo approximation is certainly feasible, but a fast and accurate numerical approach would be preferable for computational efficiency: target values may be small, and P (E) may need to be recomputed for many shape and rate settings.

There is an efficient numerical approach to computing P(E) when shapes ak are positive integers. The approach involves embedding {Zk} in a collection of independent Poisson processes {ℕk}, where k = 1, 2, …, K. Specifically, let ℕk denote a Poisson process on (0, ∞) with rate λk. So ℕk(0, t] ~ Poisson(k), for example. Of course, gaps between points in ℕk are independent and exponentially distributed, and the gamma-distributed Zk can be constructed by summing the first ak gaps

Zk=min{t>0:Nk(0,t]ak}.

Next, form processes {Inline graphick} by accumulating points in the originating processes: Mk=j=1kNj. Marginally, Inline graphick is a Poisson process with rate Λk=j=1kλj, but over k the processes are dependent owing to overlapping points. To complete the construction, define count random variables M1, M2,, MK−1 by

Mk=Mk(0,Zk+1]. (7)

It is immediate that each Mk has a marginal negative binomial distribution: the gamma distributed Zk+1 is independent of Inline graphick; conditioning on Zk+1 in (7) gives a Poisson variable which mixes out to the negative binomial [Greenwood and Yule (1920)]. Specifically,

MkNB(shape=ak+1,scale=Λk/λk+1),

which corresponds to the probability mass function

pk(m)=Γ(m+ak+1)Γ(ak+1)Γ(m+1)(λk+1Λk+1)ak+1(ΛkΛk+1)m (8)

for integers m ≥ 0. The next main finding is the following.

Theorem 2

With E as in (6), Mk as in (7) and pk as in (8), P (E) equals

m1=0a11m2=0m1+a21mK1=0mK2+aK11p1(m1)p2(m2)pK1(mK1). (9)

It does not seem to be obvious that E in (6) is equivalent to an event in the {Mk}. We also find it striking that the Mk variables are independent considering that they are constructed from highly dependent Inline graphick processes. Proof of (9) and the related distribution theory are presented in Appendix A.

A redistribution of products and sums allows a numerically efficient evaluation of (9), as in the sum-product algorithm [e.g., Kschischang, Frey and Loeliger (2001)]. For instance, with K = 4,

P(E)=m1=0a11p1(m1){m2=0m1+a21p2(m2)[m3=0m2+a31p3(m3)]}. (10)

Here, one would evaluate P (E) by first constructing for each m2 ∈ {0, 1,, a1 + a2 − 2} an inner sum P (M3m2 + a3 − 1). This vector in m2 values is used to process the second inner sum, for each value m1 ∈ {0, 1,, a1 − 1}. Indeed the computation is completely analogous to the Baum–Welch backward recursion [e.g., Rabiner (1989)], although, interestingly, there seems to be no hidden Markov chain in the system. A version of the Viterbi algorithm identifies the maximal summand and thus provides an approach to computing log P (E) in case P (E) is very small.

4. Linear independence

The component densities (5) seem to have the useful property of being linearly independent functions on ℝn. Linear independence of the component density functions is equivalent to identifiability of the mixture model [Yakowitz and Spragins (1968)]. It is necessary for strict concavity of the log-likelihood, but it is not routinely established. Establishing identifiability also is a key step in determining sampling properties of the maximum likelihood estimator.

Let a = (aη) denote a vector of real numbers indexed by structures η. Recall that the finite catalog of functions {p(x|η)} is linearly independent if

Ta(x)=ηaηp(xη)=0forallximpliesaη=0forallη.

It is plausible that this property holds generally, but we have been able to establish a proof only in a special case.

Theorem 3

In a balanced experiment where m replicate samples are measured in each of p = 2 or p = 3 groups, the component densities p(xg|η) in (5) are linearly independent functions on ℝmp.

A proof proceeds by finding a multivariate polynomial φ(x) > 0 such that φ(x)Ta(x) is itself a multivariate polynomial. A close study of the degrees and coefficients of this polynomial leads us to the result (Appendix B). That such a φ(x) exists follows from (5): the center is a rational function, and the factor Pord(η) is also rational, being a linear combination of rational functions, as established in (9).

5. Data analysis considerations

5.1. Estimation

To deploy model (1)–(5) requires the estimation of parameters α, α0 and ν0, which are shared by the different components, as well as mixing proportions π = {πη}, which link the components together. Consider first the log likelihood for π alone (treating the shared parameters as known) under independent and identically distributed sampling from (1):

l(π)=g=1Glog{ηπηp(xgη)}, (11)

where G is the number of genes providing data. Maximum likelihood estimation of π is buttressed by the following finding.

Theorem 4

Suppose that the component densities are linearly independent functions in the mixture of structured components model. If G is sufficiently large, then the log likelihood l(π) in (11) is strictly concave on a convex domain, and thus admits a unique maximizer π̂ = {π̂η}. This property is almost sure in data sets.

The expectation–maximization (EM) algorithm naturally applies to approximate π̂. By strict concavity of l(π), it is not necessary to rerun EM from multiple starting points. The final estimate and resultant clustering should be insensitive to starting position, as has been found in numerical experiments. This is a convenient but unusual property in the domain of mixture-based clustering [McLachlan and Peel (2000), page 44].

In a small simulation experiment, we confirmed that our implementation of the EM algorithm was able to recover mixture proportions given sufficiently many draws from the marginal distribution (1) (data not shown).

Full maximum likelihood for both the mixing proportions and shared parameters is feasible via the EM algorithm, but this increases computational costs. In the prototype implementation used here, we fixed the shared parameters at estimates obtained from a simpler mixture model, and then ran the EM algorithm to estimate the mixing proportions. Specifically, we used the gamma–gamma method in EBarrays (www.bioconductor.org), which corresponds to mixing as in (1) but over the smaller set of unordered structures. Experiments indicated that this approximation had a small effect on the identified clusters (see Appendix D).

Inference derived through the proposed parametric model is reliant to some degree on the validity of the governing assumptions. Quantile–quantile plots and plots relating sample coefficient-of-variation to sample mean provide useful diagnostics for the gamma observation-component of the model. The within-component model is restrictive in the sense that three parameters are shared among all the components (i.e., structures). This can be checked by making comparisons of inferred clusters, but only large clusters would deliver any power. Clusters reveal patterns in mean expression, while the shared parameters have more to do with variability; if other domains of statistics provide a guide, one expects that misspecifying the variance may reduce some measure of efficiency without disabling the entire procedure. The ultimate issue is whether or not the clustering method usefully represents any underlying biology. This is difficult to assess, though we examine the issue in a limited way in the examples considered next.

5.2. Example

Edwards et al. (2003) studied the transcriptional response of mouse heart tissue to oxidative stress. Three biological replicate samples were measured using Affymetrix oligonucleotide arrays at each of five time points (baseline and one, three, five and seven hours after a stress treatment) for several ages of mice. Considering the older mice for illustration, we have p = 5 distinct groups, n = 15 samples and 10,043 genes (i.e., probe sets, after pre-processing). Gene-specific moderated F-testing [Smyth (2004)] produced a list of G = 786 genes that exhibited a significant temporal response to stress at the 10% false discovery rate [by q-value; Storey and Tibshirani (2003)]. Gamma ranking involved fitting the mixture of structured components, which with p = 5 mixes over 540 distinct components. (Since we worked with significantly altered genes, we did not include the null component in which all means are equal; other aspects of model fitting and diagnosis are provided in Appendix D.) From the catalog of 540 possibilities, genes populated 23 clusters by gamma ranking, though only four clusters contained 10 or more of the G = 786 stress-responding genes (Figure 2). Most expression changes occurred between baseline and the first time point, but 30 genes (red cluster) showed significant upregulation at all but one time point, for example.

Fig. 2.

Fig. 2

Dominant patterns of differential expression in time course data from Edwards et al. (2003). Each panel summarizes data from one cluster identified by gamma ranking (the nine largest clusters are shown). A digital code signifies the inferred ordering of the latent expected values (i.e., η, in an alternative notation). Each gene is a single line trace; triplicate measurements were reduced by averaging and then standardized for display; raw data went into the model fitting. Results are based on 100 cycles of EM to estimate mixing proportions followed by Bayes’ rule assignment.

Gamma ranking gave different results than K-means or mclust, which, respectively, found 20 and 2 clusters in Edwards’ data. Here K was chosen according to guidelines in Hastie, Tibshirani and Friedman (2001), mclustused the Bayes information criteria over the range from 1 to 50 clusters. Otherwise, both methods used default settings in the respective R functions (www.r-project.org). The adjusted Rand index [Hubert and Arabie (1985)], which measures dissimilarity of partitions, was 0.09 comparing gamma ranking and K-means, 0.16 for gamma ranking and mclust, while for K-means and mclustit was smaller, at 0.02.

The biological significance of clusters identified by any algorithm may be worth investigating. For example, the cluster of 30 increasing expressors includes 2 genes (Mgst1 & Gsta4) from among only 17 in the whole genome that are involved in glutathione transferase activity. Understanding the increased activity of this molecular function will give a more complete picture of the biology [e.g., Girardot, Monnier and Tricoire (2004)]. In isolation, it is difficult to see how such investigation is supportive of a given clustering approach. The benefits become more apparent when we look at many data sets and many functional categories.

5.3. Empirical study

Gamma ranking was applied to a series of 11 data sets obtained from the Gene Expression Omnibus (GEO) repository [Edgar, Domrachev and Lash (2002)]. These were all the data sets satisfying a specific and relevant query (Table 2). They represent experiments on different organisms and they exhibit a range of variation characteristics. In each case, we applied the moderated F-test and selected genes with q-value no larger than 5%. Gamma ranking and, for comparison, mclust and K-means, were applied in order to cluster genes separately for each data set. Basic facts about the identified clusters are reported in Table 2. Figure 3 shows that gamma ranking tends to produce smaller clusters than mclust and K-means, although it also has a wider size distribution; and there was a relatively low level of overlap among the three approaches.

Table 2.

Summary of 11 data sets from the Gene Expression Omnibus (GEO). GDS is the GEO data set accession number. These sets satisfied the search query from August 2008 having subset variable type time or development stage or age and having a single factor with three to eight levels. p indicates the number of groups and n is the number of samples. G indicates the number of genes deemed significantly altered by one-way moderated F-test and 0.05 FDR (limma). The remaining columns show how many clusters are found by gamma ranking with 100 EM iterations (GR), mclust (MC) and K-means (KM)

GDS Citation Organism p n G GR MC KM
2323 Coser et al. Homo Sapiens 3 9 1409 11 5 13
1802 Tabuchi et al. Mus musculus 4 8 3433 49 7 10
2043 Tabuchi et al. Mus musculus 4 8 3001 51 8 18
2360 Ron et al. Mus musculus 4 9 8714 50 8 30
599 Vemula et al. Rattus norvegicus 5 10 673 42 2 40
812 Zeng et al. Mus musculus 5 17 10,982 135 7 15
1937 Pilot et al. Drosophila 5 15 7733 88 8 10
568 Welch et al. Mus musculus 6 18 3737 134 4 25
2431 Keller et al. Homo Sapiens 6 18 8505 137 9 12
587 Tomczak et al. Mus musculus 7 21 860 50 2 20
586 Tomczak et al. Mus musculus 8 24 5211 118 5 20

Fig. 3.

Fig. 3

Characteristics of clusters from an empirical study of 11 data sets.

The empirical study shows not only that gamma ranking produces substantially different clusters than popular approaches, but also that the identified clusters are significant in terms of their biological properties. Investigators often measure the biological properties of a gene cluster by identifying functional properties that seem to be over-represented in the cluster. Gene set enrichment analysis is most frequently performed by applying Fisher’s exact test to each of a long list of functional categories, testing the null hypothesis that the functional category is independent of the gene cluster [e.g., Newton et al. (2007)]. Functional categories from the Gene Ontology (GO) Consortium and the Kyoto encyclopedia (KEGG) were used to assess the biological properties of all the clusters identified in the above calculation. Specifically, we computed for each cluster a vector of p-values across GO and KEGG. Figure 4 shows the proportion of these p-values smaller than 0.05, stratified by cluster size and in comparison to results on random sets of the same size. Evidently, the clusters identified by gamma ranking contain substantial biological information.

Fig. 4.

Fig. 4

Empirical study of the association between clusters and biological function. For every cluster identified by gamma ranking (red) or mclust (green) in the data sets in Table 2, plotted is the proportion of small enrichment p-values (vertical) versus the cluster size (horizontal). The enrichment p-values are Fisher-exact-test p-values and the proportion is computed over a database of GO and KEGG pathways (Table 7). Bands indicate similar proportions computed for random sets.

Figure 4 also shows that mclustclusters carry substantial biological information, and a similar result is true for K-means(not shown). Whatever cluster signal is present in the expression data, it is evident that gamma-ranking finds different aspects of this signal than do the standard approaches, while still delivering clusters that relate in some way to the biology. Gamma-ranking clusters are not anonymous sets of genes with similar expression profiles; they are sets of genes linked to an ordering pattern in the underlying means. The commonly used clustering methods are unsupervised, while gamma-ranking utilizes the known grouping labels in the sample. It seems beneficial to use this grouping information; undoubtedly various schemes could be developed. By their construction, the gamma-ranking clusters have a simple interpretation in terms of sets of genes supporting particular hypotheses about changes in mean expression.

6. Count data

Microarray technology naturally leads to continuous measurements of gene expression, as modeled in Section 2, but technological advances allow investigators essentially to count the number of copies of each molecule of interest in each sample [e.g., Mortazavi et al. (2008)]. Poisson distributions are central in the analysis of such data [e.g., Marioni et al. (2008)], and gamma ranking extends readily to this case.

Briefly, data at each gene (or tag) is a vector xg = (xg,1,, xg,n) as before, but xg,i is now a count from the ith library (rather than an expression level on the ith microarray). There may be replicate libraries within a given cellular state, and comparisons of interest may be between different cellular states. Library sizes {Ni}, say, are additional but known design parameters. Important parameters are expected counts relative to some common library size. Adopting the notation from Section 2, a cluster of libraries σ(η, k) may share their size-adjusted expected values, and so for any iσ (η, k) the observed count xg,i arises from the Poisson distribution with mean Niμk. Further, the structure η on test puts an ordering constraint μ1 < μ2 < ···< μKη on these latent expectations. The key is to integrate away these latent expected values using a conjugate gamma prior, but conditionally on the ordering. Prior to conditioning, the μk’s are independent and identical gamma variables with (integer) shape α0 and rate α0ν0. Then, analogously to Theorem 2, the predictive distribution for the vector of conditionally Poisson responses is

p(xgη)=cη(i=1n1xg,i!)(k=1Kηug,kΓ(sg,k+α0))center(η)Pord(η), (12)

where

Pord(η)=P(Z1<Z2<<ZKη)

with the Zk ’s mutually independent gamma-distributed random variables with (gene-specific) shapes ak = α0 + sg,k and rates λk = α0ν0 + nk. In (12), the normalizing constant is

cη=Kη!(α0ν0)α0Kη[Γ(α0)]Kηk=1Kη(α0ν0+nk)α0

and, further, sg,k = Σiσ(η,k) xg,i, nk = Σiσ(η,k) Ni and

ug,k=iσ(η,k)(Niα0ν0+nk)xg,i.

Notice that in Pord(η) the event refers to an increasing sequence of gamma’s, rather than a decreasing sequence, as in Theorem 1. This arises because for Poisson responses the conjugate prior involves a gamma distribution for the means, whereas for gamma responses the conjugate is inverse gamma on the means. For computations to work out, the key thing is that some monotone transformation of each latent mean has a gamma distribution. In the null structure (all means equal), Pord(η) = 1 and (12) reduces to the negative-multinomial distribution. It will be important to study the practical utility of (12) and overdispersed extensions [cf. Robinson and Smyth (2007)], but such investigation is not within the scope of the present paper. The main reason to present the finding here is to show that gamma-rank probabilities (Section 3) arise in multiple probability models.

7. Concluding remarks

Calculations presented here consider a discrete mixture model and the resulting clustering for gene-expression or similar data types. The discrete mixing is over patterns of equality and inequality among latent expected values (ordered structures). Clustering by these patterns addresses an important biological problem to organize gene expression relative to various cellular states, which is part of the larger task to determine biological function. In examples the method was applied after a round of feature selection, although it could have been applied to each full data set (i.e., by including the null structure in the mix) and it could have been the basis of more comprehensive analysis, going beyond clustering and towards hypothesis testing and error-rate-controlled gene lists. Our more conservative line is attributable in part to an incomplete understanding of the method’s robustness. Relaxing the fixed-coefficient-of-variation assumption, as in Lo and Gottardo (2007) or Rossell (2009), could be considered to address the problem. The focus on clustering, however, is motivated largely by its practical utility in the context of genomic data analysis.

By cataloging ordered structures, rather than the smaller set of unordered structures, the mixture model produces readily interpretable clusters in the multi-group setting. Jensen et al. (2009) argues similarly. For example, the largest cluster of temporally responsive genes in Edwards’ data are upregulated immediately after treatment and show no significant fluctuations thereafter. The development of calculations for ordered structures has been more challenging than for unordered structures, which were presented in Kendziorski et al. (2003) and implemented in the Bioconductor package EBarrays. Mixture calculations are simplified in the unordered case because component densities reduce by factorization to elementary products [i.e., the last factor in (5) is not present]. The requirement to compute gamma-rank probabilities had limited a fuller development.

Gamma ranking produces clusters indexed by patterns of expected expression rather than anonymous clusters defined by high similarity of their contents. A referee noted that large gamma-ranking clusters may tend to swallow up genes more easily than small clusters because the estimated posterior assignment probability is proportional to the estimate of the mixing weight πη: that is, structures with large πη have a head start in the allocation of genes. On one hand, this provides an efficiency which may be advantageous for genes that have a relatively weak signal (and which otherwise might be assigned to a more null-like structure). It also implies that small clusters are more reliable, in a way, since the assigned genes have made it in spite of the small πη. Another feature of gamma ranking is that clusters can be tuned by a threshold parameter c, as in (2), rather than being determined by Bayes’s rule assignment. Taking c close to 1 tends to purify the clusters; the more equivocal genes drop into an unassigned category. Empirically, such swallowing up may not be substantial; at least in comparison to the simpler clustering methods analyzed, gamma ranking produces more and smaller clusters.

There is nothing explicit in gamma ranking that attends to the temporal dependence which might seem to be involved in time-course data. Independent cell lines were grown in the Edwards’ experiment, one for each microarray, and so there is independent sampling in spite of a time component. Additionally, the model imposes dependencies in (5) driven by whichever structure η governs data at a given gene. If there were complicated temporal dependence, the identified clusters would still reflect genes that act in concert in this experiment; they might act in concert by a different η in another run of the experiment, and we would not be confident in the fitted proportions, even though the clusters may continue to be informative. Neither does the model have explicit dependence among genes; but it produces clusters of genes that seem to be highly associated (genes that realize the same structure η seem to present correlated data). This shows that a sufficiently rich hierarchical model, based on lots of conditional independence, can represent characteristics of dependent data. Of course, care is needed since the sampling distribution of parameter estimates is affected by the intrinsic dependencies within the data generating mechanism.

The mixture framework from Kendziorski et al. (2003) has supported a number of extensions to related problems in statistical genomics: Yuan and Kendziorski (2006b) (time-course data), Kendziorski et al. (2006) (mapping expression traits) and Keles (2007) (localizing transcription factors). The ability to monitor ordered structures may have some application in these problems. Further, the ability to compute gamma-rank probabilities may have application in distinct inference problems [e.g., Doksum and Ozeki (2009)]. Future work includes developing a better software implementation of gamma ranking, enabling the implementation to have additional flexibility (e.g., gene-specific shapes α), studying the method’s sampling properties and exploring extensions to emerging data sources.

Acknowledgments

We thank Lev Borisov for the proof of Lemma 1, Christina Kendziorski, an Associate Editor and two referees for comments that greatly improved the development of this work.

APPENDIX A: PROOF OF THEOREM 2

Let gk(z) denote the density of a gamma distribution with shape ak and rate λk. By definition

P(E)=0zKz2[k=1Kgk(zk)]dz1dzK1dzK=0gK(zK)zKgK1(zK1)z2g1(z1)dz1dzK1dzK,

where in the second line we move factors in the integrand as far as possible to the left. With this in mind we construct functions fk(z), z ≥ 0, recursively as f0(z) = 1 and, for k = 1, 2, …, K,

fk(z)=zfk1(u)gk(u)du, (13)

and we observe that P(E) = fK(0). Evaluating these functions further, we see

f1(z)=zg1(z1)dz1=P(Z1z)=P(M1<a1Z2=z)=m1=0a11po(m1;λ1z).

Here M1 = Inline graphic1(0, Z2) is Poisson(λ1z) distributed conditionally upon Z2 = z, and po(·) indicates the Poisson probability mass function with the indicated parameter. The equivalence in the second and third lines above stems from basic relationships between objects in the underlying Poisson processes. As long as M1 is small, it means that the ℕ1 process has not accumulated many points up to time Z2 = z, and hence the Z1 value must be relatively large. More basically,

P(U>u)=P(X<a), (14)

when U ~ Gamma(a, λ) and X ~ Poisson(λu), for integer shapes a.

Proceeding to f2(z),

f2(z)=zf1(z2)g2(z2)dz2=m1=0a11zpo(m1;λ1z2)g2(z2)dz2=m1=0a11p1(m1)zpo(m1;λ1z2);,g2(z2)p1(m1)dz2.

Here p1(m1) is the probability mass function of a negative-binomial distribution, as in (8). Indeed, we have reorganized the summand above to highlight that integrand on the far right is precisely the density function of a gamma distributed variable with shape a2 + m1 and rate λ1 + λ2. This represents the Poisson–Gamma conjugacy in ordinary Bayesian analysis [e.g., Gelman et al. (2004), pages 52 and 53]. The integral evaluates to 1 if z = 0, and hence we have proved the case K = 2. But furthermore, the integral represents the chance that a gamma distributed variable is large, and so by (14)

f2(z)=m1=0a11m2=0m1+a21p1(m1)po(m2;(λ1+λ2)z)=m1=0a11m2=0m1+a21p1(m1)po(m2;Λ2z).

The baseline result of an induction proof has been established. Assume that for some k ≥ 3,

fk1(z)=m1=0a11mk1=0mk2+ak11(j=1k2pj(mj))po(mk1;Λk1z) (15)

and then evaluate (13) to obtain

fk(z)=zfk1(zk)gk(zk)dzk=zm1=0a11mk1=0mk2+ak11(j=1k2pj(mj))po(mk1;Λk1zk)gk(zk)dzk=m1=0a11mk1=0mk2+ak11(j=1k2pj(mj))zpo(mk1;Λk1zk)gk(zk)dzk=m1=0a11mk1=0mk2+ak11(j=1k1pj(mj))zpo(mk1;Λk1zk)gk(zk)pk1(mk1)dzk=m1=0a11mk1=0mk2+ak11(j=1k1pj(mj))mk=0mk1+ak1po(mk;Λkz)=m1=0a11mk=0mk1+ak1(j=1k1pj(mj))po(mk;Λkz),

which establishes that (15) is true for all k. Evaluating at k = K and z = 0 establishes the theorem.

Coda

Further insight is gained by realizing from the definition of the counts that

Mk(Zk)=Mk1(Zk)+ak=Mk1+ak.

But also Inline graphick has a jump at Zk, and so we see the equivalence

Zk>Zk+1Mk<Mk1+ak. (16)

The event E is an intersection of these pairwise events, and this is manifested in the ranges of summation in (9). In contrast to (9), these event considerations give P (E) equal to

m1=0a11m2=0m1+a21mK1=0mK2+aK11pjoint(m1,m2,,mK1). (17)

The implication seems to be that M1, M2,, MK−1 are mutually independent, though Theorem 1 does not confirm this because the factorization into negative binomials is required for all arguments, beyond what is shown. It is a conjecture that the {Mk} are mutually independent. A proof by brute force evaluation in the special cases K = 3 and K = 4 is available (not shown), but we have not found a general proof. The fact is somewhat surprising because the {Inline graphick} processes are highly positively dependent. The independence seems to emerge as a balance between this positive dependence and the negative association created by Zk being inversely related to Inline graphick(0, t].

APPENDIX B: LINEAR INDEPENDENCE AND PROOF OF THEOREM 3

Consider the three-dimensional case, and initially consider a single replicate in each of the three groups. Data on each gene form the vector (x, y, z), say, of three positive reals. Thirteen component densities p(x, y, z|η) constitute the mixture model (Table 3). For a vector a = (aη) of reals, the test function is Ta (x, y, z) = Σηaηp(x, y, z|η). It needs to be shown that if Ta (x, y, z) = 0 for all x, y, z > 0, then aη = 0 for all structures η. Specializing (5) to this case, and eliminating the positive factor (xyz)α−1, we see that Ta (x, y, z) = 0 is equivalent to

Table 3.

Thirteen structured components p(x, y, z|η) = cη (xyz)α−1 center(η)Pord(η) in the three dimensional, no-replicate case. The forms have been simplified, w.l.o.g., by taking the scale ν0 = 1, by writing β = α0 + α and ξ = α0/α. Normalizing constants cη are as in (5). Note that the em and em,n stand for constants (not involving x, y, z), but possibly differing among rows

Structure η [center(η)]−1 Pord(η)
(123) (x + y + z + ξ)β+2α 1
(12)(3) (x + y + ξ)β+α(z + ξ)β
m=0β+α1em(z+ξ)β(x+y+ξ)m(x+y+z+2ξ)β+m
(3)(12)
m=0β1em(z+ξ)m(x+y+ξ)β+α(x+y+z+2ξ)β+α+m
(13)(2) (x + z + ξ)β+α(y + ξ)β
m=0β+α1em(y+ξ)β(x+z+ξ)m(x+y+z+2ξ)β+m
(2)(13)
m=0β1em(y+ξ)m(x+z+ξ)β+a(x+y+z+2ξ)β+α+m
(1)(23) (y + z + ξ)β+α(x + ξ)β
m=0β1em(x+ξ)m(y+z+ξ)β+a(x+y+z+2ξ)β+α+m
(23)(1)
m=0β+α1em(x+ξ)β(y+z+ξ)m(x+y+z+2ξ)β+m
(1)(2)(3) [(x + ξ)(y + ξ)(z + ξ)]β
m=0β1n=0m+β1em,n(x+ξ)m[(y+ξ)(z+ξ)]β(x+y+2ξ)n(x+y+2ξ)β+m(x+y+z+3ξ)β+n
(2)(1)(3)
m=0β1n=0m+β1em,n(y+ξ)m[(x+ξ)(z+ξ)]β(x+y+2ξ)n(x+y+2ξ)β+m(x+y+z+3ξ)β+n
(1)(3)(2)
m=0β1n=0m+β1em,n(x+ξ)m[(y+ξ)(z+ξ)]β(x+z+2ξ)n(x+z+2ξ)β+m(x+y+z+3ξ)β+n
(2)(3)(1)
m=0β1n=0m+β1em,n(y+ξ)m[(z+ξ)(x+ξ)]β(y+z+2ξ)n(y+z+2ξ)β+m(x+y+z+3ξ)β+n
(3)(1)(2)
m=0β1n=0m+β1em,n(z+ξ)m[(x+ξ)(y+ξ)]β(x+z+2ξ)n(x+z+2ξ)β+m(x+y+z+3ξ)β+n
(3)(2)(1)
m=0β1n=0m+β1em,n(z+ξ)m[(x+ξ)(y+ξ)]β(y+z+2ξ)n(y+z+2ξ)β+m(x+y+z+3ξ)β+n
ηaηcηcenter(η)Pord(η)=0. (18)

A strictly positive multivariate polynomial φ(x, y, z) is required that can convert the left-hand side of (18) into a polynomial by the canceling of denominator factors. Specifically, φ = φ1φ2 where φ1(x, y, z) controls factors in center(η) and φ2(x, y, z) controls factors in Pord(η). Inspection suggests taking φ1(x, y, z) equal to

(x+y+z+ξ)β+2α[(x+y+ξ)(x+z+ξ)(y+z+ξ)]β+α[(x+ξ)(y+ξ)(z+ξ)]β

and φ2(x, y, z) equal to

(x+y+z+2ξ)2β+α1[(x+y+2ξ)(x+z+2ξ)(y+z+2ξ)]2β1×(x+y+z+3ξ)3β2.

Observe that the degree of x in the polynomial φ = φ1φ2 is 13β + 5α − 5. Indeed this is also the degree of y and the degree of z by symmetry. These degrees are reduced in the polynomial φη = φ (x, y, z) center(η)Pord(η) by factors in the denominators of center(η) and Pord(η). For example, if η = (12)(3), then

fη=(x+y+z+ξ)β+2α[(x+z+ξ)(y+z+ξ)]β+α[(x+ξ)(y+ξ)]β×[(x+y+2ξ)(x+z+2ξ)(y+z+2ξ)]2β1(x+y+z+3ξ)3β2×m=0β+α1em(z+ξ)β(x+y+ξ)m(x+y+z+2ξ)β+α1m,

which is a polynomial of degree 11β + 4α − 5, in both x and y, and of degree 12β + 5α − 5 in z. A similar construction is possible for all structures; Table 4 records the degrees of x, y and z in all component polynomials fη.

Table 4.

Degree of x, y and z in the multivariate polynomials fη = φ(x, y, z) center(η)Pord(η). Recall β = α0 + α and both α and α0 are positive integers

Structure η Degree(x) Degree(y) Degree(z)
(123) 11β + 4α − 4 11β + 4α − 4 11β + 4α − 4
(12)(3) 11β + 4α − 5 11β + 4α −5 12β + 5α − 5
(3)(12) 12β + 4α − 5 12β + 4α − 5 11β + 4α − 5
(13)(2) 11β + 4α − 5 12β + 5α − 5 11β + 4α − 5
(2)(13) 12β + 4α − 5 11β + 4α − 5 12β + 4α − 5
(1)(23) 11β + 4α − 5 12β + 4α − 5 12β + 4α − 5
(23)(1) 12β + 5α − 5 11β + 4α − 5 11β + 4α − 5
(1)(2)(3) 10β + 5α − 5 11β + 5α − 5 12β + 5α − 5
(2)(1)(3) 11β + 5α − 5 10β + 5α − 5 12β + 5α − 5
(1)(3)(2) 10β + 5α − 5 12β + 5α − 5 11β + 5α − 5
(2)(3)(1) 12β + 5α − 5 10β + 5α − 5 11β + 5α − 5
(3)(1)(2) 11β + 5α − 5 12β + 5α − 5 10β + 5α − 5
(3)(2)(1) 12β + 5α − 5 11β + 5α − 5 10β + 5α − 5

Having introduced the multiplier φ, the linear independence (18) is equivalent to the assertion that polynomial equation

ηaηcηfη(x,y,z)=0forallx,y,z>0 (19)

implies aη = 0 for all η. Fixing any y and z, the left-hand side of equation (19) is a polynomial in x, with degree 12β + 5α − 5, according to Table 4. Indeed terms associated with structures η = (23)(1), (2)(3)(1) and (3)(2)(1) all contribute monomials with that highest power in x. The coefficient of x12β+5α−5, denoted d = d(a, y, z), equals

a(23)(1)c(23)(1)f(23)(1)+a(2)(3)(1)c(2)(3)(1)f(2)(3)(1)+a(3)(2)(1)c(3)(2)(1)f(2)(3)(1),

where f′ indicates contributions from respective terms within fη. This coefficient d must equal zero, for all y and z; after all, a degree 12β + 5α − 5 polynomial can equal zero in x for at most that many x values, unless the coefficient d is exactly zero; and we are asking that it equal zero at all values of x. From this study of the high-power coefficient in x, we have reduced consideration to three structures and are able to focus on d = d(a, y, z) as a bivariate polynomial in y and z (Table 5).

Table 5.

Degrees of y and z in three terms of the bivariate polynomial d(a, y, z). This is a subset of Table 4

Structure η Degree(y) Degree(z)
(23)(1) 11β + 4α − 5 11β + 4α − 5
(2)(3)(1) 10β + 5α − 5 11β + 5α − 5
(3)(2)(1) 11β + 5α − 5 10β + 5α − 5

The initial argument focusing on the degree of x can be adapted to study other variables in Table 5. With degree of y equal to 11β + 5α − 5, for instance, it must be that the coefficient d′(z), say, of y11β+5α−5 equals zero for all z. After all, the polynomial can equal zero at at most 11β + 5α − 5 y values, and we require it to be zero at all y. But all contributions to that coefficient are strictly positive, except possibly a(3)(2)(1), hence we conclude a(3)(2)(1) = 0. By the same token, working with the degree 11β + 5α − 5 term in z, it follows that a(2)(3)(1) = 0, which then forces a(23)(1) = 0, because we require d = 0 overall. Three rows from Table 4 have been eliminated (i.e., forced aη = 0), all those in which the mean of the first variable is greater than the other two means. Next, return to the reduced table, and focus, say, on structures (13)(2), (1)(3)(2) and (3)(1)(2), in which the second variable has mean greater than the others. In doing so, three more coefficients a(13)(2) = a(1)(3)(2) = a(3)(2)(1) = 0 are forced, and Table 2 is further reduced to seven rows. Then the argument is repeated to get a(12)(3) = a(1)(2)(3) = a(2)(1)(3) = 0, and it remains to assess coefficients aη of the four structures in Table 6.

Table 6.

Final subtable

Structure η Degree(x) Degree(y) Degree(z)
(123) 11β + 4α − 4 11β + 4α − 4 11β + 4α − 4
(3)(12) 12β + 4α − 5 12β + 4α − 5 11β + 4α − 5
(2)(13) 12β + 4α − 5 11β + 4α − 5 12β + 4α − 5
(1)(23) 11β + 4α − 5 12β + 4α − 5 12β + 4α − 5

The argument is repeated in this domain, knowing that all but four terms in (19) have been eliminated. The degree of x is 12β + 4α − 5, and there are contributions from both η = (3)(12) and η = (2)(13). But then restricted to these rows we get a(3)(12) =0 because the coefficient of x12β+4α−5 as a polynomial in y has degree 12β + 4α − 5. The remaining constants aη are similarly zero, completing the proof in the no-replicate (m = 1), three group (p = 3) case.

The balanced three group case follows suit, noting that now x, y and z are sums taken, respectively, across replicates in each of the three groups. The product statistic is not xyz, but anyway it is common to all components and thus cancels in the linear combination test function. The observation-related shape parameter α is replaced by . The two-dimensional (p = 2) case is simpler and is left as an exercise.

APPENDIX C: STRICT CONCAVITY OF LOG-LIKELIHOOD AND PROOF OF THEOREM 4

Let q denote the number of nonnull structures, and consider the log-likelihood l(π) in (11) to be on ℝq, with the null probability defined secondarily as π0 = 1− Σηη0 πη. This way we need not invoke Lagrange multipliers to compute derivatives of l(π). By calculus, the q × q Hessian H of negative 2nd derivatives of l(π) has (ij)th entry

Hij=g[p(xgηi)p(xgη0)][p(xgηj)p(xgη0)][p(xg)]2=gfi(xg)fj(xg),

where p(xg) is the marginal density obtained by mixing over structures, as in (1), and fi (x) = [p(x|ηi) − p(x|η0)]/p(x). Now let a = (aη) be a q-vector of constants. To determine curvature of the log-likelihood we consider the quadratic form

aTHa=i=1qj=1qaiajgfi(xg)fj(xg)=g(i=1qaifi(xg))2=g[Ta(xg)]2,

where Ta(x)=i=1qaifi(x). Clearly, aT H a ≥ 0 regardless of a and so H is non-negative definite and l(π) is concave. To establish strict concavity requires that we show Ta(xg) = 0 for all g if and only if a = 0. The following lemma shows that knowing Ta(xg) = 0 for all G values xg is enough to force Ta(x) = 0 for all x, as long as G is sufficiently large. But then a = 0 by the linear independence assumption, completing the proof.

Lemma 1

Let ψ (x) be a multivariate polynomial in x ∈ ℝn, and let X1, X2, …, Xm denote a random sample from a continuous distribution on ℝn. If m is at least as large as the number of monomials in ψ, then, with probability one, ψ (Xi ) = 0 for i = 1, 2, …, m implies ψ (x) = 0 for all x.

Proof

Every point Xi puts a linear condition on the space of coefficients of ψ. It needs to be verified that these conditions are linearly independent. Suppose that the first k conditions are linearly independent, so the space of ψ’s that are zero at X1,, Xk has dimension (number of monomials in ψ) minus k. Pick one such nonzero polynomial and call it φ. Since φ = 0 is a set with positive codimension, we may assume (with probability one) that φ(Xk+1) is not zero. Then, if we impose the additional condition ψ (Xk+1) = 0, the dimension of the solution space drops by at least one, hence it drops by one. Letting k increase from 1 to m completes the proof.

APPENDIX D: FURTHER DETAILS OF NUMERICAL EXAMPLES

The parameters α, α0 and ν0 were fixed at values obtained by first fitting the unordered gamma–gamma mixture model in EBarrays, without a null structure but otherwise allowing all possible unordered structures. Shape parameters were then rounded to the nearest positive integer (Table 7) and all three parameters were plugged into the EM procedure to fit the proposed mixture-model proportions. [Recall that Pord(η) in (5) can be computed only for integer shapes, hence the rounding.] To simplify EM calculations in the four examples having more than five groups, the full set of ordered structures was filtered to a reduced set based on the fitting of the unordered gamma–gamma model in EBarrays. Each ordered structure corresponds to exactly one unordered structure (a many to one mapping). If no gene had a high (greater than 0.5) probability of mapping to a given unordered structure, then we deemed all corresponding ordered structures to have πη = 0. This approximation is not ideal, since the Bayes rule assignment for some genes may be one of the structures eliminated by forcing πη = 0. This affects only 29 genes out of the 17,539 clustered in these four cases. It would not affect clustering by a high threshold.

Table 7.

Parameter estimates (not including mixing proportions) from the examples analyzed. The last column indicates the number of functional categories in GO and KEGG having at least five annotated genes, which were used in the development of Figure 4. KEGG was not available for GDS1937, and so this data set was not used in Figure 4

Data set α α0 ν0 # GO/KEGG
Edwards 113 1 586.5
GDS2323 14 1 119.1 3849/184
GDS1802 17 1 46.8 3619/182
GDS2043 22 1 47.7 3619/182
GDS2360 8 1 15.8 3258/175
GDS599 12 1 0.01 3180/159
GDS812 5 1 15.4 3258/175
GDS1937 6 1 20.5 NA
GDS568 10 1 37.1 3258/175
GDS2431 4 1 67.6 4085/188
GDS587 8 1 9999.2 1876/127
GDS586 13 1 4566.2 3258/175

For all data sets, we examined quantile–quantile plots and plots relating sample coefficient of variation to sample mean. Some model violations were noted, but largely the gamma observation model was supported.

For the Edwards data, we reran the EM algorithm for 10 cycles and updated shape parameter estimates via 2D grid search in each cycle. Estimated shapes changed slightly; 784/786 genes received the same Bayes rule cluster assignment.

Computations were done in R on industry-standard linux machines. For the data sets analyzed, run times ranged from 6 to 860 CPU seconds per EM iteration, with a mean of 270 seconds. Run time is affected by the number of genes analyzed, the number of groups and also the shape parameters and sample sizes.

Footnotes

1

Supported in part by NIH Grants R01 ES017400 and T32 GM074904.

Contributor Information

Michael A. Newton, Email: newton@stat.wisc.edu.

Lisa M. Chung, Email: lchung@stat.wisc.edu.

References

  1. Campbell EA, O’Hara L, Catalano RD, Sharkey AM, Freeman TC, Johnson MH. Temporal expression profiling of the uterine luminal epithelium of the pseudo-pregnant mouse suggests receptivity to the fertilized egg is associated with complex transcriptional changes. Human Reproduction. 2006;21:2495–2513. doi: 10.1093/humrep/del195. [DOI] [PubMed] [Google Scholar]
  2. Dennis B, Patil GP. The gamma distribution and weighted multimodal gamma distributions as models of population abundance. Math Biosci. 1984;68:187–212. MR0738902. [Google Scholar]
  3. Doksum KA, Ozeki A. Semiparametric models and likelihood—the power of ranks. In: Rojo J, editor. Optimality: The Third Erich L. Lehmann Symposium. IMS Lecture Notes—Monograph Series. Vol. 57. IMS; Beachwood, OH: 2009. pp. 67–92. [Google Scholar]
  4. Edgar R, Domrachev M, Lash AE. Gene expression omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Research. 2002;30:207–210. doi: 10.1093/nar/30.1.207. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Edwards M, Sarkar D, Klopp R, Morrow J, Weindruch R, Prolla T. Age-related impairment of the transcriptional response to oxidative stress in the mouse heart. Physiological Genomics. 2003;13:119–127. doi: 10.1152/physiolgenomics.00172.2002. [DOI] [PubMed] [Google Scholar]
  6. Eisen MB, Spellman PT, Brown PO, Botstein D. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci. 1998;95:14863–14868. doi: 10.1073/pnas.95.25.14863. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Fraley C, Raftery AE. Model-based clustering, discriminant analysis, and density estimation. J Amer Statist Assoc. 2002;97:611–631. MR1951635. [Google Scholar]
  8. Gelman A, Carlin JB, Stern HS, Rubin DB. Bayesian Data Analysis. 2. Chapman and Hall; Boca Raton, FL: 2004. MR2027492. [Google Scholar]
  9. Girardot F, Monnier V, Tricoire H. Genome wide analysis of common and specific stress responses in adult drosophila melanogaster. BMC Genomics. 2004;5:74. doi: 10.1186/1471-2164-5-74. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Grasso LC, Maindonald J, Rudd S, Hayward DC, Saint R, Miller DJ, Ball EE. Microarray analysis identifies candidate genes for key roles in coral development. BMC Genomics. 2008;9:540. doi: 10.1186/1471-2164-9-540. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Greenwood M, Yule GU. An inquiry into the nature of frequency distributions representative of multiple happenings with particular reference to the occurrence of multiple attacks of disease or of repeated accidents. J Roy Statist Soc Ser A. 1920;83:255–279. [Google Scholar]
  12. Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning. Springer; New York: 2001. MR1851606. [Google Scholar]
  13. Hubert L, Arabie P. Comparing partitions. J Classification. 1985;2:193–218. [Google Scholar]
  14. Holzmann H, Munk A, Gneiting M. Identifiability of finite mixtures of elliptical distributions. Scand J Statist. 2006;33:753–763. MR2300914. [Google Scholar]
  15. Hutchinson TP. Compound gamma bivariate distributions. Metrika. 1981;28:263–271. MR0642934. [Google Scholar]
  16. Jensen ST, Erkan I, Arnardottir ES, Small DS. Bayesian testing of many hypotheses × many genes: A study of sleep apnea. Ann Appl Statist. 2009;3:1080–1101. [Google Scholar]
  17. Keles S. Mixture modeling for genome-wide localization of transcription factors. Biometrics. 2007;63:10–21. doi: 10.1111/j.1541-0420.2005.00659.x. MR2345570. [DOI] [PubMed] [Google Scholar]
  18. Kendziorski CM, Newton MA, Lan H, Gould MN. On parametric empirical Bayes methods for comparing multiple groups using replicated gene expression profiles. Stat Med. 2003;22:3899–3914. doi: 10.1002/sim.1548. [DOI] [PubMed] [Google Scholar]
  19. Kendziorski CM, Chen M, Yuan M, Lan H, Attie AD. Statistical methods for expression quantitative trait loci (eQTL) mapping. Biometrics. 2006;62:19–27. doi: 10.1111/j.1541-0420.2005.00437.x. MR2226552. [DOI] [PubMed] [Google Scholar]
  20. Kschischang R, Frey BJ, Loeliger HA. Factor graphs and the sum-product algorithm. IEEE Trans Inform Theory. 2001;47:498–519. MR1820474. [Google Scholar]
  21. Lo K, Gottardo R. Flexible empirical Bayes models for differential gene expression. Bioinformatics. 2007;23:328–335. doi: 10.1093/bioinformatics/btl612. [DOI] [PubMed] [Google Scholar]
  22. Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y. RNA-seq: An assessment of technical reproducibility and comparison with gene expression arrays. Genome Research. 2008;18:1509–1517. doi: 10.1101/gr.079558.108. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. McCullagh P, Nelder JA. Generalized Linear Models. 2. Chapman and Hall; London: 1989. MR0727836. [Google Scholar]
  24. McLachlan GJ, Basford KE. Mixture Models: Inference and Applications to Clustering. Dekker; New York: 1988. MR0926484. [Google Scholar]
  25. McLachlan GJ, Peel D. Finite Mixture Models. Wiley; New York: 2000. MR1789474. [Google Scholar]
  26. Medvedovic M, Yeung KY, Bumgarner RE. Bayesian mixture model based clustering of replicated microarray data. Bioinformatics. 2004;20:1222–1232. doi: 10.1093/bioinformatics/bth068. [DOI] [PubMed] [Google Scholar]
  27. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature Methods. 2008;5:621–628. doi: 10.1038/nmeth.1226. [DOI] [PubMed] [Google Scholar]
  28. Newton MA, Noueiry A, Sarkar D, Ahlquist P. Detecting differential gene expression with a semiparametric hierarchical mixture method. Biostatistics. 2004;5:155–176. doi: 10.1093/biostatistics/5.2.155. [DOI] [PubMed] [Google Scholar]
  29. Newton MA, Quintana FA, den Boon JA, Sengupta S, Ahlquist P. Random-set methods identify distinct aspects of the enrichment signal in gene-set analysis. Ann Appl Statist. 2007;1:85–106. MR2393842. [Google Scholar]
  30. Parmigiani G, Garett ES, Irizarry RA, Zeger SL, editors. The Analysis of Gene Expression Data: Methods and Software. Springer; New York: 2003. MR2001388. [Google Scholar]
  31. Rabiner LR. A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE. 1989;77:257–286. [Google Scholar]
  32. Redner RA, Walker HF. Mixture densities, maximum likelihood and the EM algorithm. SIAM Rev. 1984;26:195–239. MR0738930. [Google Scholar]
  33. Rempala GA, Pawlikowska I. Limit theorems for hybridization reactions on oligonucleotide microarrays. J Multivariate Anal. 2008;99:2082–2095. MR2466552. [Google Scholar]
  34. Robinson MD, Smyth GK. Moderated statistical tests for assessing differences in tag abundance. Bioinformatics. 2007;23:2881–2887. doi: 10.1093/bioinformatics/btm453. [DOI] [PubMed] [Google Scholar]
  35. Rossell D. GaGa: A parsimonious and flexible model for differential expression analysis. Ann Appl Statist. 2009;3:1035–1051. [Google Scholar]
  36. Smyth GK. Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol. 2004;3:29. doi: 10.2202/1544-6115.1027. Art. 3. (electronic). MR2101454. [DOI] [PubMed] [Google Scholar]
  37. Sobel M, Frankowski K. The 500th anniversary of the sharing problem (the oldest problem in the theory of probability) Amer Math Monthly. 1994;101:833–847. MR1300489. [Google Scholar]
  38. Speed T, editor. Statistical Analysis of Gene Expression Microarray Data. CRC Press; Boca Raton, FL: 2004. [Google Scholar]
  39. Stephens M. Dealing with label switching in mixture models. J R Stat Soc Ser B Stat Methodol. 2000;62:795–809. MR1796293. [Google Scholar]
  40. Stern H. Models for distributions on permutations. J Amer Statist Assoc. 1990;85:558–564. [Google Scholar]
  41. Storey JD, Tibshirani R. Statistical significance for genome-wide studies. Proc Natl Acad Sci USA. 2003;100:9440–9445. doi: 10.1073/pnas.1530509100. MR1994856. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Thalamuthu A, Mukhapadhyay I, Zheng X, Tseng GC. Evaluation and comparison of gene clustering methods in microarray analysis. Bioinformatics. 2006;22:2405–2412. doi: 10.1093/bioinformatics/btl406. [DOI] [PubMed] [Google Scholar]
  43. Titterington DM, Smith AFM, Makov UE. Statistical Analysis of Finite Mixture Distributions. Wiley; New York: 1985. MR0838090. [Google Scholar]
  44. Yakowitz SJ, Spragins JD. On the identifiability of finite mixtures. Ann Math Statist. 1968;39:209–214. MR0224204. [Google Scholar]
  45. Yuan M, Kendziorski CM. A unified approach for simultaneous gene clustering and differential expression identification. Biometrics. 2006a;62:1089–1098. doi: 10.1111/j.1541-0420.2006.00611.x. MR2297680. [DOI] [PubMed] [Google Scholar]
  46. Yuan M, Kendziorski CM. Hidden Markov models for time course data in multiple biological conditions (with discussion) J Amer Statist Assoc. 2006b;101:1323–1340. MR2307565. [Google Scholar]

RESOURCES