Abstract
We published a new method (BMC Bioinformatics 2014, 15:14) for searching for differentially expressed genes from two biological conditions datasets. The presentation of theorem 1 in this paper was incomplete. We received an anonymous comment about our publication that motivates the present work. Here, we present a complementary result which is necessary from the theoretical point of view to demonstrate our theorem. We also show that this result has no negative impact on our conclusions obtained with synthetic and experimental microarrays datasets.
Keywords: Differentially expressed genes, Fold change, Average of ranks
Background
To search for differentially expressed (DE) genes in profiling studies, we presented a new method based on fold change rank ordering statistics (FCROS). For the derivation of this method, we considered microarrays data from two biological conditions where n probes (genes) were used with m 1 control and m 2 test samples. We performed k pairwise comparisons (k=m 1 m 2) of the data samples and computed fold changes (FC) for each gene. The FCs obtained for each comparison were sorted in increasing order and their corresponding ranks were associated with genes. Hence, we can form a matrix of rank values R with components r ij (i=1,2,…,n,j=1,2,…,k). We noted r i=[r i1 r i2 … r ik]T the vector of rank values associated with gene i. We noted , the average of ranks (a.o.r) value for gene i. The value for varies between and . That allows to associate an unique vector of a.o.r values with the n genes: where the scalars δ i are the differences between consecutive ordered a.o.r. Without loss of generality, we assumed that the differences δ i have the same value which is approximated by their mean: . Using these notations, we derived a theorem showing a normal distribution for vector [1]. The content of this theorem was incomplete as shown in the following lemma we received from an anonymous reader.
Lemma 1
Let consider the matrix of rank values R under the assumption that the rank values in each column are all distinct. Assume uniform random sampling without replacement model for the columns of R, i.e. each column of R is an independent draw from the set of all permutations of {1,…,n} with uniform probability for each permutation. Then, the asymptotic distribution of the unordered vector average of rank (a.o.r.), , has a mean and degenerate variance-covariance matrix Σ(n,n), detΣ=0:
1 |
with diagonal element , off-diagonal element and 1 n=[1,1,…,1]T.
Proof
Note that for k→∞, the appearance of all elements of the set {1,…,n} in each row of R under the assumed sampling model are equally likely, hence by the weak law of large numbers ([2], page 235) the asymptotic mean is the constant vector . Under the same observation, the asymptotic variance, ∀ℓ∈{1,…,n}, is equal to:
2 |
The asymptotic covariance is computed as a two-index summation over the set {1,…,n} with the restriction that no two indices can be the same since the columns are permutations by construction, hence ∀ℓ≠m∈{1,…,n}:
3 |
4 |
5 |
Thus, since Σ 1 n=0, it follows that detΣ=0. □
This lemma shows that the covariance term was missed in our theorem. In the next section, we present a complete version of our theorem using the notations we adopted in [1].
Results
From our notations, we have the vector with the a.o.r values. Each component of the vector can be writen as: . The theorem 1 in ([1], page 3) should be read as:
Theorem 1
When the number k of the pairwise comparisons grows, the ordered average of ranks (a.o.r.) have a normal distribution. The mean of this distribution is , its variance-covariance matrix has diagonal element and off-diagonal element , where a and b are the minimum and the maximum of the observed a.o.r., , respectively. δ is the average difference between consecutive ordered a.o.r. .
Proof
From the following definitions:
and using , a component of the mean of the normal distribution is:
6 |
A component of the variance (diagonal element) of the normal distribution matrix is:
7 |
A component of the covariance (off-diagonal element) of the normal distribution matrix is:
8 |
□
By setting a=δ=1 and b=n in the theorem 1, the mean and the variance-covariance component values are the same as in lemma 1. These setting values for a,b and δ correspond to the case we called ideal situation ([1], page 4).
For the FCROS algorithm, we used the standardized rank value, i.e., each observed rank value is divided by n. The mean and variance-covariance components should be divided by n and n 2 respectively. This leads to a mean component , and a variance-covariance matrix with a diagonal component and a off-diagonal component . Table 1 shows the values for r ⋆,β ⋆ and α ⋆ when n increases. For a large value for n, the off-diagonal components of the variance-covariance matrix vanish. Hence, when n is large, a good approximation for the mean and the variance components are and , respectively.
Table 1.
Values of the mean, the variance and the covariance components when n increases
n | 10 | 100 | 1,000 | 10,000 |
---|---|---|---|---|
r ⋆ | ||||
β ⋆ | ||||
α ⋆ | −9.17∗10−3 | −8.4∗10−4 | −8.34∗10−5 | −8.33∗10−6 |
Discussion and conclusions
As shown, the theorem we previously presented was incomplete since the covariance term was missed. The present complementary result is necessary from the theoretical point of view, and we are grateful to the anonymous reader for pointing this out. This result will be useful for small values of n. However, for high throughput biological datasets, n is large, often greater than 10,000 ([1], page 2). For such values of n, the rank deficient variance-covariance matrix of the normal distribution associated with the a.o.r values is near a diagonal matrix. Hence, it is as if the a.o.r values of each gene follow a normal distribution with parameters and .
Acknowledgments
We thank the anonymous reader for drawing our attention to this result.
Funding
This work was supported by funds from CNRS, INSERM and University of Strasbourg.
Availability of data and materials
Not Applicable.
Authors’ contributions
DD drafted the paper and performed the analyses. Both authors developed the method and contributed to the manuscript. Both authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Consent for publication
Not Applicable.
Ethics approval and consent to participate
Not Applicable.
References
- 1.Dembélé D. Kastner P. Fold change ordering statistics: a new method for detecting differentially expressed genes. BMC Bioinforma. 2014;15(1):14. doi: 10.1186/1471-2105-15-14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Feller W. An Introduction to Probability Theory and Its Applications. New York: John Wiley & Sons; 1971. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
Not Applicable.