Abstract
Breitling and colleagues [1] introduced a statistical technique, the rank product method, for detecting differentially regulated genes in replicated microarray experiments. The technique has achieved widespread acceptance and is now used more broadly, in such diverse fields as RNAi analysis, proteomics, and machine learning. In this note, we relate the rank product method to linear rank statistics and provide an alternative derivation of distribution theory attending the rank product method.
Keywords: Rank product method, rank statistics, Wilcoxon scores, van der Waerden scores, gamma approximations, Edgeworth approximations
Introduction
In an influential paper, Breitling and colleagues [1] introduced a statistical technique for detecting differentially regulated genes in replicated microarray experiments. Their rank product method entails ranking expression levels within each replicate, then computing the product of the ranks across the replicates. The rank product is then compared to its sampling distribution under a permutation model for subsequent inference. The rank product method appears to be robust, with higher sensitivity and specificity than t-test methodology and desirable operating characteristics, as demonstrated in extensive numerical studies [2-5]. Although developed originally for microarrays, the rank product method has found widespread acceptance in diverse settings, e.g., RNAi analysis [6], proteomics [7], and machine learning model selection [8].
The purpose of this note is to provide an alternative method for establishing distributional properties of the rank product statistic, based on the classical notion of linear rank statistics [9]. This approach affords insight into theoretical properties of rank product method, and leads to useful extensions.
The rank product statistic
We briefly describe the rank product statistic. We start with expression levels for n genes from k replicate samples. Denote the expression level for the ith gene in the jth replicate by Xij, where 1 ≤ i ≤ n, 1 ≤ j ≤ k. Next, rank the expression levels X1j, X2j, …, Xnj in each replicate j, forming Rij = rank(Xij), 1 ≤ Rij ≤ n. [The ranking is such that genes with the smallest ranks are the “most interesting” from a biological perspective.] Then, Breitling's rank product statistic for the ith gene is, up to a normalization constant, the product
Genes associated with sufficiently small RP values would be marked for further consideration, and Breitling and colleagues [1] posit a statistical formalism for such a determination. In particular, they propose a permutation approach to calculate the distribution of the RPi under the null hypothesis that the Xij are identically distributed (exchangeable) within each of the k independent replicates.
An alternative formulation
An equivalent statistic to RPi is the monotone transformation
Monotonicity ensures that achieved significance levels of RPi and log(RPi) are identical. There are two key notions reflected in this transformation. First, since the k replicates are independent, log(RPi) is the sum of k independent, identically distributed random variables under the null hypothesis. More fundamentally, the log transformation demonstrates that the rank product method engenders replacement of the ranks Rij within each replicate by rank scores aj(Rij), where the score function here is given simply by aj(i)=log(i), 1 ≤ i ≤ n, 1 ≤ j ≤ k.
Score functions
Probably the most common score function for linear rank statistics is the Wilcoxon rank score, a(i)=i, 1 ≤ i ≤ n. If this score function were used rather than the log function, then the resulting test statistic, entailing summation of ranks across the k replicates, is equivalent to Wise's genome scan meta-analysis (GSMA) statistic [10]. Wise and colleagues had invoked a clever inclusion/exclusion argument for deriving the exact null distribution of the GSMA statistic; Koziol and Feng [11,12] had used a more classical approach with probability generating functions for determination of the exact distribution of the GSMA statistic.
A second common choice for score function are the normal or van der Waerden scores [9]
where Φ-1 is the inverse of the standard normal distribution function. Within the family of univariate linear rank statistics, the resulting normal scores statistic is more powerful against location shift alternatives with normally distributed data than the Wilcoxon statistic [9]. In practice, one might expect the normal scores statistic to perform comparably to the t-test method.
Much of the richness and diversity of linear rank statistics arises from adoption of different score functions into the underlying construct. The rank product method is associated with a log score function, but other functions could easily be used, as we note later. At first glance, van der Waerden scores might be a reasonable choice if linear shifts in location of expression levels are expected. The log score function would seem more appropriate if shifts in the shape of the underlying distribution are also likely to occur.
Distribution theory
A generalized version of the rank product statistic is given by
Note that under the null hypothesis of exchangeability within replicates, independence across replicates, the Yi are identically distributed. Also, in practice, one would likely choose the score functions aj(.) to be identical, all equaling a(.), say, though this is not necessary. For particular choices of a(.), exact distribution theory concerning Yi is readily available through the notion of probability generating functions [6, 7]. Wilcoxon scores are particularly amenable to this construction; details are in [6, 7].
Alternatively, one could invoke a normal approximation to the distribution of Yi, which is the sum of k independent random variables, identically distributed if all aj(.) = a(.). In this regard, one could improve on the normal approximation via Edgeworth correction for skewness and kurtosis. Details are given in the Appendix. Note that weights could easily be incorporated into the construct for the Yi: one might, for example, weight the replicates differently on a priori grounds.
A simple approximation for rank products
There is no need to invoke the formalism outlined above for the null distribution of the rank products, as there exists a remarkably simple approximation, which we now outline. First, note that Rij/(n+1) is approximately uniformly distributed on the unit interval (0,1), the approximation improving as n (the number of genes) increases. Next, let Uj denote a uniform random variable [that is, Uj is uniformly distributed on (0,1)]. Then -log(Uj) has an exponential distribution on the positive real line with scale parameter 1, commonly denoted Exp(1). The key here is that the Exp(1) distribution is a particular case of a gamma distribution, namely, a Gamma(1,1) distribution. [The gamma distribution is a two-parameter continuous probability distribution. The two parameters are commonly referred to as the shape parameter k and the scale parameter θ, and the distribution is denoted Gamma(k,θ).] The sum of independent, identically distributed exponentials is also gamma distributed, with the same scale parameter, but an altered shape parameter: in our setting, has a Gamma(k,1) distribution.
How does all this relate to the rank product? Recall that we are interested in “sufficiently small” values of RPi. We have the following steps:
That is, we may easily determine approximate critical values for RPi by back transformation from the associated probability that a random variable distributed as Gamma(k,1) exceeds the cutoff value specified in the last equation above.
To investigate the adequacy of the gamma approximation, we examined the following cases: n=1000, 5000, 30000 (genes), and k=3, 5, 10 (replicates). For each of these combinations of (n,k), we generated 10000 independent RP values, transformed the values to -log(RP)+k*log(n+1), and compared the empirical distributions of the transformed values to their respective Gamma(k,1) approximations. We present probability plots of the empirical distributions in Figure 1. Clearly, agreement is excellent over the range of support, indicating that the empirical distributions of the RP values are indeed well-approximated by the corresponding gamma distributions. We caution that estimation of extreme tail probabilities of the RP statistics from either the permutation approach of Breitling and colleagues or the simulation approach utilized here is rather imprecise, even with moderately large permutation or simulation runs. Our preference for tail probabilities is the simple gamma approximation, which should be quite satisfactory in settings with reasonably large numbers of genes and replicates.
Figure 1.
Probability plots of the empirical distributions of the transformed RP statistics, based on 10000 simulations, versus corresponding quantiles of the approximating gamma distributions. A. n=10000 genes, k=3 replicates. B. n=10000 genes, k=5 replicates. C. n=10000 genes, k=10 replicates. D. n=5000 genes, k=3 replicates. E. n=5000 genes, k=5 replicates. F. n=5000 genes, k=10 replicates. G. n=30000 genes, k=3 replicates. H. n=30000 genes, k=5 replicates. I. n=30000 genes, k=10 replicates. Points should form approximately a straight line along the diagonal; departures from this straight line indicate departures from the specified distribution. In each instance, the approximating gamma distribution has scale parameter 1, and shape parameter equal to the number of replicates.
We note in passing that gamma probabilities are readily obtained in Excel, as both the gamma cumulative distribution function and inverse cumulative distribution function are built-in mathematical functions. Hence individuals utilizing Breitling's Excel template for calculation of rank products would not require additional specialized software for gamma calculations.
In summary, the underlying theory of linear rank statistics provides insight into distributional properties of the rank product method. Extensions of the rank product method involving other score functions are straightforward, and may afford enhanced power properties against particular alternatives of interest [9, 13]. A simple gamma approximation is available for the log-transformed rank product statistic, and should be quite satisfactory in practice.
Acknowledgments
The author thanks the reviewers for insightful comments and valuable suggestions. This research was supported in part by NIH grants PO1AI070167 and PO1CA104898.
Appendix
For simplicity, assume aj(.)=a(.) for all j. Then, the Yi are identically distributed, so it suffices to examine the distribution of Y1. Y1 is the sum of k independent, identically distributed random variables, so consideration can be further restricted to the distribution of Z=a(R11) under the null hypothesis that all permutations of (R11, R21, …, RI1) are equally likely. Moments of Z are easily found: the mth moment is given by
for any positive integer m. Closed form formulas are available for particular choices of a(.), e.g., Wilcoxon scores, otherwise, numerical calculation via spreadsheet is straightforward.
Next, consider the cumulants κm of Z. The cumulants of a distribution are closely related to the distribution's moments [14]. For example, if Z has an expected value μ=E(Z) and a variance σ2=E(Z-μ)2, then these are the first two cumulants, that is, μ=κ1 and σ2=κ2. κ3 and κ4 are commonly referred to as the skewness and kurtosis respectively of the distribution of Z. Stuart and Ord [14, Sec. 3.14] give formulas for cumulants in terms of moments and conversely. It is advantageous to work with cumulants rather than moments because of the additivity property of cumulants in the independence setting: each cumulant of a sum of independent random variables is the sum of the corresponding cumulants of each of the random variables. Here, each cumulant of Y1 is merely k * the corresponding cumulant of Z.
In addition, a distribution with cumulants κm can be readily approximated through an Edgeworth series representation [9]. Let Wj=(Z1j-μ)/σ and Sk = k1/2(Z̄ − μ) / σ Then the Edgeworth expansion for Sk is given by
where
Φ and ϕ are the cumulative distribution function and density function respectively of the standard normal distribution, and, κ3 and κ4 are the skewness and kurtosis respectively of Z. Skewness is pronounced with non-symmetric score functions such as log(.), so skewness correction is appropriate in such instances.
Footnotes
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
References
- 1.Breitling R, Armengaud P, Amtmann A, Herzyk P. Rank products: a simple, yet powerful, new method to detect differentially regulated genes in replicated microarray experiments. FEBS Letters. 2004;573:83–92. doi: 10.1016/j.febslet.2004.07.055. [DOI] [PubMed] [Google Scholar]
- 2.Breitling R, Herzyk P. Rank-based methods as a non-parametric alternative of the t-test for the analysis of biological microarray data. J Bioinf Comp Biol. 2005;3:1171–1189. doi: 10.1142/s0219720005001442. [DOI] [PubMed] [Google Scholar]
- 3.Hong F, Breitling R, McEntree CW, Wittner BS, Nemhauser JL, Chory J. RankProd: a bioconductor package for detecting differentially expressed genes in meta-analysis. Bioinformatics. 2006;22:2825–2827. doi: 10.1093/bioinformatics/btl476. [DOI] [PubMed] [Google Scholar]
- 4.Jeffrey IB, Higgins DG, Culhane AE. Comparison and evaluation of methods for generating differentially expressed gene lists from microarray data. BMC Bioinformatics. 2006;7:359. doi: 10.1186/1471-2105-7-359. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Hong F, Breitling R. A comparison of meta-analysis methods for detecting differentially expressed genes in microarray experiments. Bioinformatics. 2008;24:374–382. doi: 10.1093/bioinformatics/btm620. [DOI] [PubMed] [Google Scholar]
- 6.Birmingham A, Selfors LM, Forster T, et al. Statistical methods for analysis of high-throughput RNA interference screens. Nature Methods. 2009;6:569–575. doi: 10.1038/nmeth.1351. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Wiederhold E, Gandhi T, Permentier HP, et al. The yeast vacuolar membrane proteome. Mol Cell Proteomics. 2009;8:380–392. doi: 10.1074/mcp.M800372-MCP200. [DOI] [PubMed] [Google Scholar]
- 8.Hoefsloot HC, Smit S, Smilde AK. A classification model for the Leiden proteomics competition. Stat Appl Genet Mol Biol. 2008;7:8. doi: 10.2202/1544-6115.1351. [DOI] [PubMed] [Google Scholar]
- 9.Hajek J, Sidak Z. Theory of Rank Tests. Academic Press; New York: 1967. [Google Scholar]
- 10.Wise LH, Lanchbury JS, Lewis CM. Meta-analysis of genome searches. Annals of Human Genetics. 1999;63:263–272. doi: 10.1046/j.1469-1809.1999.6330263.x. [DOI] [PubMed] [Google Scholar]
- 11.Koziol JA, Feng AC. A note on the genome scan meta-analysis statistic. Annals of Human Genetics. 2004;68:376–380. doi: 10.1046/j.1529-8817.2004.00103.x. [DOI] [PubMed] [Google Scholar]
- 12.Koziol JA, Feng AC. A note on generalized genome meta-analysis statistics. BMC Bioinformatics. 2005;6:32. doi: 10.1186/1471-2105-6-32. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Sen PK. Asymptotically efficient tests by the method of n rankings. Journal of the Royal Statistical Society Series B. 1968;30:312–317. [Google Scholar]
- 14.Stuart A, Ord JK. Kendall's Advanced Theory of Statistics. Fifth. Vol. 1. Oxford University Press; New York: 1987. [Google Scholar]

