A multivariate statistical test for differential expression analysis

Michele Tumminello; Giorgio Bertolazzi; Gianluca Sottile; Nicolina Sciaraffa; Walter Arancio; Claudia Coronnello

doi:10.1038/s41598-022-12246-w

. 2022 May 18;12:8265. doi: 10.1038/s41598-022-12246-w

A multivariate statistical test for differential expression analysis

Michele Tumminello ^1,², Giorgio Bertolazzi ¹, Gianluca Sottile ^1,^2,^✉, Nicolina Sciaraffa ³, Walter Arancio ³, Claudia Coronnello ^3,^✉

PMCID: PMC9117296 PMID: 35585166

Abstract

Statistical tests of differential expression usually suffer from two problems. Firstly, their statistical power is often limited when applied to small and skewed data sets. Secondly, gene expression data are usually discretized by applying arbitrary criteria to limit the number of false positives. In this work, a new statistical test obtained from a convolution of multivariate hypergeometric distributions, the Hy-test, is proposed to address these issues. Hy-test has been carried out on transcriptomic data from breast and kidney cancer tissues, and it has been compared with other differential expression analysis methods. Hy-test allows implicit discretization of the expression profiles and is more selective in retrieving both differential expressed genes and terms of Gene Ontology. Hy-test can be adopted together with other tests to retrieve information that would remain hidden otherwise, e.g., terms of (1) cell cycle deregulation for breast cancer and (2) “programmed cell death” for kidney cancer.

Subject terms: Biological techniques, Bioinformatics, Gene expression analysis, Software

Introduction

Differential expression analysis (DEA) is a large-scale inference procedure used to identify genes whose expression differs under different biological conditions. Several variants of the t-test have been developed to perform DEA^1,2. However, the small and skewed data typically analysed make the parametric assumptions rarely satisfied and, therefore, t-test p-values are often unreliable³. The easiest solution to small data size would be to increase the number of experiments, which, however, would increase experimental costs accordingly. Furthermore, data collected for poorly expressed genes are characterized by several zeros in the data. This evidence violates the typical assumptions under which t-test statistics are reliable. As a result, t-tests tend to increase type I errors and overestimate the number of significant genes. Alternative definitions of the t-test have been proposed to reduce the impact of small samples and low expression variability, e.g., moderated t-test⁴ and Significance Analysis of Microarray (SAM)⁵. Indeed, we compare the performance of the proposed test for differential expression with the one of moderated t-test and SAM. Conversely, t-tests applied to large data sets also produce too many significant genes; this depends on the fact that average expression differences may be significantly different from zero from a statistical point of view but are not large enough to be biologically meaningful.

A common strategy to reduce the number of selected differentially expressed genes is to discretize the gene expression. The discretization of gene expression data (GED) is widely used in genomics analysis. Despite a certain loss of information, GED discretization is often used as a preprocessing step to reduce raw data noise and facilitate the interpretation of data⁶. Several algorithms require data discretization during the preprocessing, e.g., the biclustering method⁷. Moreover, many network models require discrete data as input, e.g., Bayesian Networks and logical networks^8,9. Despite the importance of discretization in transcriptomics, the criteria behind discretization methods are always arbitrary: the log2-Fold Change (FC)-discretization¹⁰ depends on an arbitrary set threshold, usually equal to 1, 1.5 or 2; the Equal Width discretization¹¹ depends on a tuning parameter; a simple rank-based discretization depends on the Xth percentile that identifies the top-X% genes.

We propose a novel statistical test for DEA based on a convolution of multivariate hypergeometric distributions (Hy-test), which addresses both issues of t-test methods discussed before. Moreover, the method implicitly comprises a novel approach for data discretization, which is free from arbitrary parameters. At the price of a slight loss of information, Hy-test presents the following advantages with respect to the currently used methods:

It is free from parametric assumptions;
It allows implicitly provides a discretization of the expression profiles;
It is more conservative than the t-tests, reducing type I errors.
It can be integrated with other methods.

In this paper, the Hy-test has been applied to investigate breast and kidney cancer tissues, and results have been compared to those obtained through the t-test approach. The results indicate that the joint use of the Hy-test and moderated t-test allows one to understand the biological implications of DEA better.

Methods

Algorithm

Let's consider a gene expression profile recorded in two experimental conditions, e.g., normal and cancer tissues, for n pairs of tissues. We estimate a threshold couple able to discretize gene expression as “downregulated”, “upregulated”, and “no-changed”. The optimum thresholds are obtained by maximizing the disagreement between the discretized levels of the two different experimental conditions. Applying the thresholds $k_{1}, k_{2}$ on the whole expression of a single gene, we obtain two discretized vectors, one for healthy tissues, say ${\vec{v}}_{H}$ , and one for diseased tissues, say ${\vec{v}}_{D}$ , with entries that take values {-1,0,1}, which means “downregulated”, “no-changed”, and “upregulated”, respectively. The thresholds $k_{1}, k_{2}$ are estimated by maximizing the quantity

H ({\vec{v}}_{H}, {\vec{v}}_{D}) = n_{+, -} + n_{-, +}

where $n_{+, -} (n_{-, +})$ is the number of tissue couples that present upregulated normal (cancer) tissues paired with downregulated cancer (normal) tissues. Optimization research has been carried out by using a genetic algorithm¹². A threshold has been estimated for each gene of the dataset. However, this method can also be easily adapted to extract a single cut-off couple for all genes.

As soon as optimal values for the thresholds, $k_{1} and k_{2},$ are determined, we calculate a p-value to assess if gene expression is significantly different between cancer and normal tissues. To associate a p-value with $H ({\vec{v}}_{H}, {\vec{v}}_{D})$ it’s necessary, as a preliminary step, to evaluate the probability that a value of $H ({\vec{v}}_{H}, {\vec{v}}_{D}) = n_{+, -} + n_{-, +}$ occurs by chance. For the sake of readability, we describe the analysis in two steps. In the first one, we set constraints on the total number of positive, negative, and null signs on both vectors in the null hypothesis, then we describe the distribution of the null model after relaxing these constraints. Specifically, in the first step, the null model depends on the external parameters ${\vec{K}}_{H} = (K_{H}^{+}, K_{H}^{-}, K_{H}^{0})$ and ${\vec{K}}_{D} = (K_{D}^{+}, K_{D}^{-}, K_{D}^{0})$ , where $K_{H}^{i} (K_{D}^{i})$ is the total number of tissues with sign I in vector ${\vec{v}}_{H} ({\vec{v}}_{D})$ with, $i$ in {-1.1,0}. Such parameters are not independent. Indeed $K_{H}^{+} + K_{H}^{-} + K_{H}^{0} = K_{D}^{+} + K_{D}^{-} + K_{D}^{0} = n$ , where $n$ is the total number of tissue couples in the dataset. We are interested in calculating the probability that matrix

C = (\begin{matrix} n_{+, +} & n_{+, -} & n_{+, 0} \\ n_{-, +} & n_{-, -} & n_{-, 0} \\ n_{0, +} & n_{0, -} & n_{0, 0} \end{matrix})

occurs by chance, subject to the aforementioned constraints. An entry $n_{i, j}$ of $C$ represents the number of tissues that display sign $i$ in vector ${\vec{v}}_{H}$ and sign $j$ in ${\vec{v}}_{D}$ . Notation $C$ is used here because sometimes matrices such as the one above are indicated as “confusion" matrices. Entries of matrix $C$ are not independent due to the constraints on the number of positive, negative, and null signs described above. Specifically, they are linearly dependent according to the following six equations:

\{\begin{matrix} n_{+, +} + n_{+, -} + n_{+, 0} = K_{H}^{+} \\ n_{-, +} + n_{-, -} + n_{-, 0} = K_{H}^{-} \\ n_{0, +} + n_{0, -} + n_{0, 0} = K_{H}^{0} \\ n_{+, +} + n_{-, +} + n_{0, +} = K_{D}^{+} \\ n_{+, -} + n_{-, -} + n_{0, -} = K_{D}^{-} \\ n_{+, 0} + n_{-, 0} + n_{0, 0} = K_{D}^{0} \end{matrix})

This linear system has rank equal to 5, because of the linear relationship between parameters: $K_{H}^{+} + K_{H}^{-} + K_{H}^{0} = K_{D}^{+} + K_{D}^{-} + K_{D}^{0} = n$ . Therefore, it can be solved as

\{\begin{matrix} n_{+, 0} = K_{H}^{+} - n_{+, -} + n_{+, +} \\ n_{-, 0} = K_{H}^{-} - n_{-, -} - n_{-, +} \\ n_{0, +} = K_{D}^{+} - n_{-, +} + n_{+, +} \\ n_{0, -} = K_{D}^{-} - n_{-, -} - n_{+, -} \\ n_{0, 0} = K_{H}^{0} + K_{D}^{0} - n + n_{-, -} + n_{-, +} + n_{+, -} + n_{+, +} \end{matrix})

This result indicates that matrix $C$ is fully determined by the knowledge of $n_{-, -}, n_{-, +}, n_{+, -}, and n_{+, +}$ . Therefore, the probability

P (C) = P (n_{-, -}, n_{-, +}, n_{+, -}, n_{+, +} | {\vec{H}}_{H}, {\vec{K}}_{D}) = = P (n_{-, -}, n_{-, +} | n_{+, -}, n_{+, +}, {\vec{K}}_{H}, {\vec{K}}_{D}) P (n_{+, -}, n_{+, +} | {\vec{K}}_{H}, {\vec{K}}_{D})

where according to a simple combinatorial analysis of the problem,

P (n_{+, -}, n_{+, +} | {\vec{K}}_{H}, {\vec{K}}_{D}) = \frac{(\begin{matrix} K_{D}^{+} \\ n_{+, +} \end{matrix}) (\begin{matrix} K_{D}^{-} \\ n_{+, -} \end{matrix}) (\begin{matrix} K_{D}^{0} \\ n_{+, 0} \end{matrix})}{(\begin{matrix} n \\ K_{H}^{+} \end{matrix})}

and

P (n_{-, -}, n_{-, +} | n_{+, -}, n_{+, +}, {\vec{K}}_{H}, {\vec{K}}_{D}) = \frac{(\begin{matrix} K_{D}^{+} - n_{+, +} \\ n_{-, +} \end{matrix}) (\begin{matrix} K_{D}^{-} - n_{+, -} \\ n_{-, -} \end{matrix}) (\begin{matrix} K_{D}^{0} - n_{+, 0} \\ n_{-, 0} \end{matrix})}{(\begin{matrix} n - K_{H}^{+} \\ K_{H}^{-} \end{matrix})}

The distribution of $C$ allows to calculate the probability

P [H ({\vec{v}}_{H}, {\vec{v}}_{D}) = x] = P (n_{+, -} + n_{-, +} = x) = P (x)

\begin{matrix} P (x) = & \sum_{n_{+, +}, n_{-, -}, n_{-, +}} P (n_{-, -}, n_{- . +} | x - n_{-, +}, n_{+, +}, {\vec{K}}_{H}, {\vec{K}}_{D}) P (x - n_{-, +}, n_{+, +} | {\vec{K}}_{H}, {\vec{K}}_{D}) \\ = & \sum_{\{n_{+, +}, n_{-, -}, n_{-, +}\}} \frac{(\begin{matrix} K_{D}^{+} \\ n_{+, +} \end{matrix}) (\begin{matrix} K_{D}^{-} \\ x - n_{-, +} \end{matrix}) (\begin{matrix} K_{D}^{0} \\ n_{+, 0} \end{matrix})}{(\begin{matrix} n \\ K_{H}^{+} \end{matrix})} \frac{(\begin{matrix} K_{D}^{+} - n_{+, +} \\ n_{-, +} \end{matrix}) (\begin{matrix} K_{D}^{-} - x + n_{-, +} \\ n_{-, -} \end{matrix}) (\begin{matrix} K_{D}^{0} - n_{-, 0} \\ n_{-, 0} \end{matrix})}{(\begin{matrix} n - K_{H}^{+} \\ K_{H}^{-} \end{matrix})} \end{matrix}

According to this distribution, the p-value associated with an observation $\hat{x} = {\hat{n}}_{-, +} + {\hat{n}}_{+, -}$ is :

P (x \geq \hat{x}) = \sum_{\{n_{+, +}, n_{-, -}, n_{-, +}, x \geq \hat{x}\}} \frac{(\begin{matrix} K_{D}^{+} \\ n_{+, +} \end{matrix}) (\begin{matrix} K_{D}^{-} \\ x - n_{-, +} \end{matrix}) (\begin{matrix} K_{D}^{0} \\ n_{+, 0} \end{matrix})}{(\begin{matrix} n \\ K_{H}^{+} \end{matrix})} \frac{(\begin{matrix} K_{D}^{+} - n_{+, +} \\ n_{-, +} \end{matrix}) (\begin{matrix} K_{D}^{-} - x + n_{-, +} \\ n_{-, -} \end{matrix}) (\begin{matrix} K_{D}^{0} - n_{-, 0} \\ n_{-, 0} \end{matrix})}{(\begin{matrix} n - K_{H}^{+} \\ K_{H}^{-} \end{matrix})}

In the second step, we relax the constraints on the total number of positive, negative, and null signs in both the vectors associated with healthy (H) and diseased tissues (D). This is done by only assuming that the overall (across H and D tissues) number of positive, K+ , negative, K-, and null signs, K0, are set. In this case, we have to modify the previous formula. Specifically, let's indicate with $K^{+} = K_{R}^{+} + K_{G}^{+}, K^{-} = K_{R}^{-} + K_{G}^{-}$ and $K^{0} = K_{R}^{0} + K_{G}^{0}$ the total number of positive, negative and null signs across the $2 n = K^{+} + K^{-} + K^{0}$ samples, that is, two times the number of paired tissues. In this case, the null hypothesis is attained by assuming that $n$ tissues are randomly selected to be pathological, and paired with the others, which are supposed to be the healthy ones. Therefore:

P (x \geq \hat{x}) = = \sum_{Q} \frac{(\begin{matrix} K^{+} \\ K_{D}^{+} \end{matrix}) (\begin{matrix} K^{-} \\ K_{D}^{-} \end{matrix}) (\begin{matrix} K^{0} \\ K_{D}^{0} \end{matrix})}{(\begin{matrix} 2 n \\ n \end{matrix})} \frac{(\begin{matrix} K_{D}^{+} \\ n_{+, +} \end{matrix}) (\begin{matrix} K_{D}^{-} \\ x - n_{-, +} \end{matrix}) (\begin{matrix} K_{D}^{0} \\ n_{+, 0} \end{matrix})}{(\begin{matrix} n \\ K_{H}^{+} \end{matrix})} \frac{(\begin{matrix} K_{D}^{+} - n_{+, +} \\ n_{-, +} \end{matrix}) (\begin{matrix} K_{D}^{-} - x + n_{-, +} \\ n_{-, -} \end{matrix}) (\begin{matrix} K_{D}^{0} - n_{-, 0} \\ n_{-, 0} \end{matrix})}{(\begin{matrix} n - K_{H}^{+} \\ K_{H}^{-} \end{matrix})}

where $Q = {K_{D}^{+}, K_{D}^{-}, n_{+, +}, n_{-, -}, n_{-, +}$ }, such that $x \geq \hat{x}$ . Therefore, at difference with Eq. (9), quantities $K_{D}^{+}$ and $K_{D}^{-}$ can vary, and the sum is carried over all possible values of parameters such that $x \geq \hat{x}$ , under the constrain $K_{D}^{+} + K_{D}^{-} + K_{D}^{0} = n$ . In this manuscript, the Hy-test refers to Eq. (10). We use this test on a large set of genes, therefore a multiple comparison correction is required. In all subsequent analysis statistical significance indicates that a Bonferroni corrected p-value is below the 5% level¹³.

Preprocessing procedure for microarray data

To test the effectiveness of the proposed method, we consider gene expression profiles of breast cancer (BRCA) cells in a pattern of paired tissues; 17.632 genes have been recorded in 75 tumour tissues and in the 75 paired normal tissues. Then the analysis has also been performed by considering 67 kidneys with renal clear cell carcinoma—KIRC—paired with 67 normal tissues. Data has been downloaded from The Cancer Genome Atlas (TCGA) database using the TCGA-assembler tool¹⁴. The expression profiles of duplicated genes have been replaced by their mean expression. Moreover, the expression of each gene has been normalized using a quantile normalization procedure implemented in R package preprocessCore¹⁵. Finally, gene expression values were log2-transformed.

Quantitative analysis of GO-terms

The performance of the Hy-test has been compared to one of two classical methods of differential expression analysis, i.e., moderated t-test⁴ in combination with fold change larger than 2 and significance analysis of microarray⁵. Both tests are available from the Bioconductor repository and are implemented in the packages “limma” and “siggenes”, respectively. According to the three methods, genes that turned out to be significant were also compared by exploiting their functional roles with a Gene Ontology (GO) enrichment analysis¹⁶. We obtained three separate lists of significant GO-terms from the three sets of differentially expressed genes. GO-analysis has been done using the topGO package from Bioconductor, focusing on biological process terms. Fisher exact p-values have been associated with each GO-term. To identify GO-terms (e.g., cell cycle) conceptually associated with a specific cell line (for example, breast cancer), we have defined a novel procedure that counts the PubMed articles related to the biological concepts under exam, for example, breast cancer and cell cycle. We assume that more articles related to both concepts indicate a stronger conceptual association between them. The automated PubMed search has been carried out using the R package RISmed¹⁷. The used query considers articles published between January 2000 and December 2020. The probability of observing $n_{C, T}$ PubMed articles with both keywords “breast cancer” and “cell cycle” is

Pr (N_{C, T} = n_{C, T} | N, N_{C}, N_{T}) = \frac{(\begin{matrix} N_{C} \\ n_{C, T} \end{matrix}) (\begin{matrix} N - N_{C} \\ N_{T} - n_{C, T} \end{matrix})}{(\begin{matrix} N \\ N_{T} \end{matrix})}

where $N$ is the number of articles available on PubMed, $N_{C}$ is the number of articles with the keyword “breast cancer” and $N_{T}$ is the number of articles with “cell cycle” as keywords. Using a hypergeometric test we have associated a p-value of conceptual association with each GO term as

Pr (N_{C, T} \geq n_{C, T}) = \sum_{X = n_{C, T}}^{min (N C, N T)} Pr (X | N, N_{C}, N_{T}) .

Results

The three methods, i.e. Hy-test, moderated t-test and SAM, have been compared. Venn diagrams reported in Fig. 1 clearly show the differences between the outcomes of the three considered methods.

Venn diagrams of the differentially expressed genes and significant terms found in each of the three analysis steps by the three methods: Hy-test, moderated t-test, and SAM. The upper panels (A, B, C) refer to the breast tissue and the lower panels (D, E, F) to the kidney. The first column (A and D) refers to the DE analysis, the second column (B and E) to the enrichment analysis and the third column (C and F) to the PubMed research. Significance is assessed when a Bonferroni corrected p-value is below the 5% level.

Considering breast (kidney) tissues, the Hy-test identifies 1.304 (2.720) significant genes, whereas both SAM and the moderated t-test select many more genes: 7.620 (8.347) and 3.362 (4.192) significant genes, respectively. More importantly, panels A (breast cancer) and D (kidney cancer) of Fig. 1 clearly show that the Hy-test mostly identifies differentially expressed genes also identified by both the other methods. These results indicate that the Hy-test is more conservative than the other two tests. According to a GO-enrichment analysis of the lists of differentially expressed genes, 109 (245) significant terms result from the Hy-test gene list, 230 (457) from the moderated t-test list, and 66 (162) from the SAM list. The intersections among the detected lists of terms are pictured in Fig. 1B, E. A selection of significant terms with breast (kidney) cancer evaluated by researching PubMed papers is reported in Table 1 (Table 2).

Table 1.

GO-terms significantly associated with breast cancer among significant GO-terms found using Hy-test, moderated t-test and both procedures.

Sign. GO-term	GO ID	Analysis	Term size	BR term size	p value
Cell cycle checkpoint signaling	GO:0000075	Hy-test	167	34	< 1.11E−16
Mitotic spindle checkpoint signaling	GO:0071174	Hy-test	38	14	< 1.11E−16
Regulation of cell cycle	GO:0051726	Hy-test	951	134	< 1.11E−16
Regulation of cell cycle process	GO:0010564	Hy-test	594	102	< 1.11E−16
Spindle assembly checkpoint signaling	GO:0071173	Hy-test	37	15	< 1.11E−16
Cell surface receptor signaling pathway	GO:0007166	Mod t-test	2485	643	< 1.11E−16
Cell–cell signaling	GO:0007267	Mod t-test	1545	436	< 1.11E−16
Regulation of signal transduction	GO:0009966	Mod t-test	2734	619	< 1.11E−16
Regulation of signaling	GO:0023051	Mod t-test	3107	719	< 1.11E−16
Signal transduction	GO:0007165	Mod t-test	5175	1210	< 1.11E−16
Angiogenesis	GO:0001525	Both	493	171	< 1.11E−16
Cell communication	GO:0007154	Both	5681	1342	< 1.11E−16
Cell population proliferation	GO:0008283	Both	1835	473	< 1.11E−16
Mitotic cell cycle	GO:0000278	Both	833	217	< 1.11E−16
Tissue development	GO:0009888	Both	1749	483	< 1.11E−16

Open in a new tab

Term size is the number of genes that compose a GO-term; BR term size is the number of GO-term genes associated with breast cancer; p-value is computed by using the hypergeometric distribution.

Table 2.

GO-terms significantly associated with “kidney cancer” among significant GO-terms found using Hy-test, t-test and both procedures.

Sign. GO-term	GO ID	Analysis	Term size	KIRC term size	p value
Apoptotic process	GO:0006915	Hy-test	1761	363	< 1.11E−16
Cell death	GO:0008219	Hy-test	1951	396	< 1.11E−16
Programmed cell death	GO:0012501	Hy-test	1808	371	< 1.11E−16
Cell differentiation	GO:0030154	Mod t-test	3844	1159	< 1.11E−16
Kidney development	GO:0001822	Mod t-test	283	115	< 1.11E−16
Kidney epithelium development	GO:0072073	Mod t-test	133	61	< 1.11E−16
Regulation of cell differentiation	GO:0045595	Mod t-test	1432	459	1.98E−05
Renal system development	GO:0072001	Mod t-test	292	118	< 1.11E−16
Antigen processing and presentation	GO:0019882	Both	102	54	2.37E−09
Cell killing	GO:0001906	Both	173	79	6.80E−15
Immune system development	GO:0002520	Both	881	301	6.86E−04
Leukocyte mediated cytotoxicity	GO:0001909	Both	117	62	7.01E−09
Lymphocyte proliferation	GO:0046651	Both	276	133	4.07E−07
Regulation of signaling	GO:0023051	Both	3110	924	< 1.11E−16

Open in a new tab

Term size is the number of genes that compose a GO-term; KIRC term size is the number of GO-term genes associated with kidney cancer; p-value is computed by using the hypergeometric distribution.

The list of all terms is reported in Supplementary Table S1 (Table S2). Just 8 (37) of those terms have been found by all the methods, as shown in Fig. 1C, F. It’s worth noticing that SAM analysis provides such a large number of differentially expressed genes, more than 5000 in both the applications, that it is reasonable to assume the presence of many false positives, while the Hy-test alone or the combined use of Hy-test and moderate t-test suggest better recovery of significant terms associated with both types of cancer.

A crucial issue in interpreting results from transcriptomics studies is the bias due to the significantly high and increasing number of cancer-related studies with respect to any other disease¹⁸. The consequence is that almost any gene has been (or will be) associated with cancer. Evaluating the performance of our algorithm by measuring its ability to retrieve cancer-related genes might not be sufficient. On the other hand, several different perturbations can trigger concerted “expression waves” marking state transitions that could cause global transcriptomic changes with common underlying characteristics¹⁹. The consequence, in this case, is the reported presence of a “generic signature” of differentially expressed genes, i.e. genes that are frequently detected as differentially expressed, despite the comparison performed²⁰. Therefore, we evaluated the algorithms by considering their ability to avoid the selection of the generic signature, not because the genes selected are not related to the comparisons we are performing, but by testing which algorithm can retrieve more specific features of the system under investigation and not the effect of the generic perturbation. To measure the condition specificity of the used tests, i.e., the ability to select differentially expressed genes specifically related to the performed sample’s comparisons, we used the DE prior score defined and computed in²⁰. The genes selected with the SAM test show a DE prior score cumulative distribution very close to the diagonal, explained by selecting a high number of genes, most of which are probably false positive (Supplementary Fig. 1). The DE prior scores of the genes selected as differentially expressed in breast cancer tissues with the Hy-test and the moderated t-test are similarly distributed. On the other hand, the Hy-test in kidney data analysis selects differentially expressed genes with significantly lower DE prior scores. Even though the Hy-test selects a smaller number of differentially expressed genes, its focus is not on the genes that appear differentially expressed in any condition of comparison but, at least in these examples, on genes more peculiar to the system under investigation.

Correlation structures and spectral analysis

Besides using statistical techniques to identify differentially expressed genes, it is also important to use statistical charts to detect normalization problems, differential expression designation problems, and common analysis errors. For example, as shown in Fig. 2 (Fig. 3) for breast (kidney)-cancer data, a simple comparison between the correlation matrices of tissues is constructed by using (1) all available genes, (2) the genes selected by the moderate t-test and (3) the ones selected by the Hy-test, allows one to perform a quality check on the two analyses of differential expression. Specifically, it is possible to observe how the panel of genes selected by the Hy-test can be considered a better filter than the one obtained through the moderated t-test since the former amplifies gene expression differences between the two types of tissues-healthy (H) and cancer (C) tissues. Furthermore, results reported in Fig. 3 about kidney cancer data suggest a misclassification of one (H, C)-pair tissue, namely, TCGA.CW.5591, which corresponds to the straight lines of opposite colours in the figure.

Correlation structure of breast cancer expression genes. Top-left panel refers to all genes, the top-right panel refers to the set of genes selected by moderated t-test, and the bottom panel refers to the set of genes selected by the Hy-test. $\bar{ϱ}$ is the block average correlation.

Correlation structure of kidney cancer expression genes. Top-left panel refers to all genes, the top-right panel refers to the set of genes selected by moderated t-test, and the bottom panel refers to the set of genes selected by the Hy-test. $\bar{ϱ}$ is the block average correlation.

Many times genes do not work in isolation, but their “effect” is organised into “eigengene” modes, which one can study by performing a Principal Component Analysis (PCA)²¹. The first component typically reflects the batch effect corresponding to the “average expression profile” of genes, whereas minor components may identify disease (or any other perturbation) effects²². The dimensionality reduction obtained by considering principal components provides relevant insights into the considered selection procedures of differentially expressed genes. Using all the genes, the first eigenvector, which explains about 90% of the total variance, also captures the differential effect of the two types of paired tissues as a background effect, making it impossible to use it to identify the gene-disease association. However, analysing the two reduced sets of genes that we identified through the moderated t-test and the Hy-test, we observe that the two effects (background and difference between healthy and cancer tissues) are split into the first two principal components. The variance explained by the two components together is the same as the one explained by the first component obtained from the whole dataset, i.e., about 90%. When looking at the distribution of gene scores projected on the first component (top panels of Supplementary Figs. 2a and 3a), we note a peak in the right tail of the distribution, which smooths out if one considers only genes selected through the t-test, and eventually disappears if one only focuses on genes selected through the Hy-test (batch effect). This evidence suggests using the second principal component (bottom panels of Supplementary Figs. 2a and 3a) to obtain more insights into the involvement of the selected genes in the differentiation between the two types of tissue. Remarkably, the second component for the Hy-test selected genes explains about twice the variance that the same component for t-test selected genes does. In the Supplementary Material, we report the first two eigenvectors (Supplementary Figs. 2a and 3a) and the correlation structures obtained by ordering the genes according to the second principal component scores (Supplementary Figs. 2b and 3b). This unsupervised procedure can be used in conjunction with ours to visualise high-dimensional space better and investigate the structure of several complex systems in biology²³.

Robustness analysis

To evaluate the finite sample properties of our test, we perform a Monte Carlo simulation in various scenarios. We assessed the performance of both the Hy-test and moderated t-test in terms of power functions (i.e., rejection rate) under the null (a) and alternative (b) hypotheses, respectively. Simulations are performed by generating paired vectors $(Y_{1}, Y_{2})$ of length $n \in \{50, 75\}$ of synthetic expression profiles, once log-normally distributed and once power-law distributed. ( $Y_{1} \sim P L (x_{min} = 20, α = 3.5)$ and $Y_{2} \sim P L (x_{min} = 40, α = 3.5)$ ) Under the null hypothesis we simulated two independent samples, { $Y_{1}$ } and { $Y_{2}$ }, such that (1) $E [Y_{2}] - E [Y_{1}] = 0$ and (2) $V a r [Y_{1}] = V a r [Y_{2}] = 0.25$ , whereas under the alternative we considered (1) $E [Y_{2}] - E [Y_{1}] = 1$ and (2) $V a r [Y_{1}] = V a r [Y_{2}] = 0.25$ . For the latter, we assessed the sensitivity of both methods, the Hy-test and moderated t-test, according to three different correlation structures among the synthetic paired tissues, i.e., $ρ \in \{0.1, 0.2, 0.4\}$ . For each block of simulations, we performed 250 Monte Carlo replicates. Table 3 shows the mean rejection rate after adjusting the p-values with the Bonferroni correction. Results of simulations under the null hypothesis of no differential expression block (a) of simulations are not shown because both tests were robust in detecting true negatives.

Table 3.

Results of simulation block (b), where two vectors of paired synthetic expression profiles $(Y_{1}, Y_{2})$ , have to satisfy (1) $E [Y_{2}] - E [Y_{1}] = 1$ and (2) $Var [Y_{1}] = Var [Y_{2}] = 0.25$ .

	$C o r (Y_{1}, Y_{2}) = 0.1$		$C o r (Y_{1}, Y_{2}) = 0.2$		$C o r (Y_{1}, Y_{2}) = 0.4$
	Log-normal	Power-law	Log-normal	Power-law	Log-normal	Power-law
n = 50
Hy-test	0.89	0.94	0.94	0.97	1.00	0.99
Mod t-test	0.85	0.70	0.96	0.88	1.00	0.98
n = 75
Hy-test	1.00	1.00	1.00	1.00	1.00	1.00
Mod t-test	0.90	0.76	0.98	0.93	1.00	1.00

Open in a new tab

Average rejection rates after 250 Monte Carlo replicates is reported for two different sample sizes, i.e., $n \in \{50, 75\}$ , and distributions (log-normal and power-law), after adjusting the p-values with the Bonferroni correction.

According to the results reported in Table 3, the Hy-test method shows greater robustness than the moderate t-test in identifying true positives, even in low correlation and especially with highly leptokurtic distributions, such as the power-law distribution.

Discussion

DEA plays a central role in comparative transcriptomic studies, which represent the vast majority of gene expression analyses. The core action that defines a transcriptomic comparative study is the definition and retrieval of differentially expressed genes in different conditions. Working with data generated by a plethora of procedures in a very noisy and variable system, such as a biological one, requires one to adopt different approaches to analyse a given phenomenon. We provide a biological interpretation of the results obtained by performing a differential expression analysis of breast and kidney cancer genes through the moderated t-test and our Hy-test.

In the case of the real breast cancer profiles analysed, both moderated t-test and Hy-test reveal that DE genes are enriched in functions involved in tissue development and cell proliferation, as expected²⁴. While the t-test approach focuses on signal transduction^25–27, the Hy-test highlights a central role in regulating the cell cycle in breast cancer, as strongly supported by recent literature^28,29.

In detail, the mammary gland is a tissue characterised by a high proliferation rate, and the developmental programs are prompt to be subverted to promote cancer progression. In the gland, many cells are extremely polarised. When extrinsic or intrinsic factors disrupt the maintenance of this organisation, this disruption may act as a promoter of hyperplasia and transformation³⁰. Several studies also suggest that the disruption of the typical apical-basal polarity may contribute to the metastatic event³¹. The deregulation of extracellular matrix proteins and signalling is sufficient to promote breast cancer development and progression²⁴. Signal transduction has a central role in breast cancer; indeed, breast cancer molecular classification usually follows the presence or absence of specific hormone and growth factor receptors^25,26 with direct implications in diagnosis, prognosis, and therapy. Both tissue development and signal transduction have a central role in breast cancer. However, the moderated t-test is not efficient in retrieving the cell-intrinsic cell cycle deregulation GO terms that the Hy-test has pinpointed. Indeed, cell cycle deregulation is crucial to breast cancer development and cell cycle control machinery targets novel therapeutic strategies, such as CDK4/6 inhibitors^28,29.

In the case of kidney cancers, the differences between the Hy-test results and those from the moderated t-test are even more apparent. Both approaches retrieve an enrichment in cell signalling, particularly in the contest of the immunological microenvironment^32,33, and the t-test only finds the involvement of functions related to kidney development³⁴. However, Hy-test only points to “programmed cell death” , which is a central mechanism in kidney cancer, targeted by some therapeutic approaches to the disease³³.

In detail, it is known that the reshaping of the metabolism is one of the key steps that kidney tumour cells must undergo during cancer progression. This metabolic reshape strongly relies on the cross-talk between cancer cells and the tumour microenvironment³⁵. In particular, the inflammatory microenvironment is involved in developing of both pre-neoplastic alterations and kidney cancer³⁶. To further support our findings, we can also mention that, for patients with renal clear cell carcinoma, a model has been proposed based on a few immune-related genes that can predict the prognosis based on tumour immune microenvironments³⁷. Considering that the programmed cell death subversion plays a central role in kidney cancer development, it is intriguing to ascertain that only the Hy-test leads to retrieving this GO term from the enrichment analysis, strongly suggesting that a dual approach using both the Hy-test and moderated t-test can be even more suitable than single methods alone to investigate the biological meaning of a DEA on real data.

Conclusions

Hy-test can be adopted alone or jointly with other existing DEA tests to identify differentially expressed genes in a very conservative way, as confirmed by the analyses of real data of breast and kidney cancers reported in this paper. Such robust information would remain otherwise hidden within the extremely large number of genes identified by standard DEA tests as differentially expressed, likely including many false positives. According to our results, the moderated t-test increases substantially the number of significant genes retrieved from DEA with respect to the Hy-test, broadening the differential gene ontology enrichment. Consequently, the Hy-test is more selective than moderated t-test in both retrieving DE genes and relevant terms of GO. On the other end, the SAM test detects even more statistically significant genes than the moderated t-test, leading to apparent issues in identifying of enriched GO terms. To evaluate the performance of the analysed DEA tests in detecting cancer-related genes, we have focused on the enriched ontology terms validated through the automated PubMed-search procedure described in the “Methods”section. In this way, we can focus our attention only on terms with a widely established involvement in cancer diseases. The excluded terms might also carry important cancer information, but their analysis goes beyond the purpose of the present performance evaluation. Hy-test is not only able to narrow the window of selected genes but focusing the functional analysis. It can also retrieve specific terms of GO that would be otherwise missing. This is particularly evident in the breast cancer dataset, where the moderated t-test also collects the vast majority of DE genes retrieved by the Hy-test. However, the enrichment analysis shows only a moderate overlapping, strongly suggesting that Hy-test can retrieve a different set of genes that points to functions of biological relevance that would be otherwise missed. This is also true to a lower extent for the kidney dataset.

Supplementary Information

Supplementary Information.^{(14.1MB, docx)}

Acknowledgements

MT, WA, and CC acknowledge support by Regione Siciliana, through the PO FESR action 1.1.5, project OBIND N.086202000366—CUP G29J18000700007. MT, GS, and GB acknowledge support from FFR2021. GS acknowledges support by the Italian Ministry of University and Research (MUR) through the project PON-AIM “Attraction and International Mobility”: AIM1873193-2 activity 1. GB acknowledges support by the Italian Ministry of University (MUR) through the project “PON Research and Innovation 2014–2020”.

Author contributions

M.T. and C.C. coordinated the research project. M.T., G.S. and G.B. developed the method and the R software. C.C. and W.A. gathered the data. All authors equally contributed to the interpretation of results, the writing and the revision of the manuscript.

Data availability

Our source codes and data are available for downloading in the GitHub repository (https://github.com/gianluca-sottile/A-Novel-Statistical-Test-For-Differential-Expression-Analysis).

Competing interests

The authors declare no competing interests.

Footnotes

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Gianluca Sottile, Email: gianluca.sottile@unipa.it.

Claudia Coronnello, Email: ccoronnello@fondazionerimed.com.

Supplementary Information

The online version contains supplementary material available at 10.1038/s41598-022-12246-w.

References

1.Cui X, Churchill GA. Statistical tests for differential expression in cDNA microarray experiments. Genome Biol. 2003;4:1–10. doi: 10.1186/gb-2003-4-4-210. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Pan W. A comparative review of statistical methods for discovering differentially expressed genes in replicated microarray experiments. Bioinformatics. 2002;18:546–554. doi: 10.1093/bioinformatics/18.4.546. [DOI] [PubMed] [Google Scholar]
3.Fagerland MW, Sandvik L. Performance of five two-sample location tests for skewed distributions with unequal variances. Contemp. Clin. Trials. 2009;30:490–496. doi: 10.1016/j.cct.2009.06.007. [DOI] [PubMed] [Google Scholar]
4.Smyth GK. Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat. Appl. Genet. Mol. Biol. 2004;3:1. doi: 10.2202/1544-6115.1027. [DOI] [PubMed] [Google Scholar]
5.Tusher VG, Tibshirani R, Chu G. Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl. Acad. Sci. 2001;98:5116–5121. doi: 10.1073/pnas.091062498. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Gallo CA, Cecchini RL, Carballido JA, Micheletto S, Ponzoni I. Discretization of gene expression data revised. Brief. Bioinform. 2016;17:758–770. doi: 10.1093/bib/bbv074. [DOI] [PubMed] [Google Scholar]
7.Dussaut, J. S., Gallo, C. A., Carballido, J. A. & Ponzoni, I. Analysis of Gene Expression Discretization Techniques in Microarray Biclustering. in International Conference on Bioinformatics and Biomedical Engineering 257–266 (Springer, 2017).
8.Karlebach G, Shamir R. Modelling and analysis of gene regulatory networks. Nat. Rev. Mol. cell Biol. 2008;9:770–780. doi: 10.1038/nrm2503. [DOI] [PubMed] [Google Scholar]
9.Dimitrova ES, Licona MPV, McGee J, Laubenbacher R. Discretization of time series data. J. Comput. Biol. 2010;17:853–868. doi: 10.1089/cmb.2008.0023. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.McCarthy DJ, Smyth GK. Testing significance relative to a fold-change threshold is a TREAT. Bioinformatics. 2009;25:765–771. doi: 10.1093/bioinformatics/btp053. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Catlett J. European Working Session on Learning. Springer; 1991. On Changing Continuous Attributes Into Ordered Discrete Attributes; pp. 164–178. [Google Scholar]
12.Whitley D. A genetic algorithm tutorial. Stat. Comput. 1994;4:65–85. doi: 10.1007/BF00175354. [DOI] [Google Scholar]
13.Miller RG. Simultaneous Statistical Inference. Springer; 1981. [Google Scholar]
14.Wei L, et al. TCGA-assembler 2: Software pipeline for retrieval and processing of TCGA/CPTAC data. Bioinformatics. 2018;34:1615–1617. doi: 10.1093/bioinformatics/btx812. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Bolstad BM, Irizarry RA, Åstrand M, Speed TP. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics. 2003;19:185–193. doi: 10.1093/bioinformatics/19.2.185. [DOI] [PubMed] [Google Scholar]
16.Zheng Q, Wang X-J. GOEAST: A web-based software toolkit for Gene Ontology enrichment analysis. Nucleic Acids Res. 2008;36:W358–W363. doi: 10.1093/nar/gkn276. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Kovalchik, S. RISmed: Download Content from NCBI Databases. R package version 2.3.0https://cran.r-project.org/package=RISmed (2021).
18.de Magalhães, J. P. Every gene can (and possibly will) be associated with cancer. Trends Genet. (2021). [DOI] [PubMed]
19.Zimatore G, Tsuchiya M, Hashimoto M, Kasperski A, Giuliani A. Self-organization of whole-gene expression through coordinated chromatin structural transition. Biophys. Rev. 2021;2:31303. doi: 10.1063/5.0058511. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Crow M, Lim N, Ballouz S, Pavlidis P, Gillis J. Predictability of human differential gene expression. Proc. Natl. Acad. Sci. 2019;116:6491–6500. doi: 10.1073/pnas.1802973116. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Roden JC, et al. Mining gene expression data by interpreting principal components. BMC Bioinform. 2006;7:1–22. doi: 10.1186/1471-2105-7-194. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Censi F, Calcagnini G, Bartolini P, Giuliani A. A systems biology strategy on differential gene expression data discloses some biological features of atrial fibrillation. PLoS ONE. 2010;5:e13668. doi: 10.1371/journal.pone.0013668. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Langfelder P, Horvath S. Eigengene networks for studying the relationships between co-expression modules. BMC Syst. Biol. 2007;1:1–17. doi: 10.1186/1752-0509-1-54. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Zhu J, Xiong G, Trinkle C, Xu R. Integrated extracellular matrix signaling in mammary gland development and breast cancer progression. Histol. Histopathol. 2014;29:1083. doi: 10.14670/hh-29.1083. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Akram M, Iqbal M, Daniyal M, Khan AU. Awareness and current knowledge of breast cancer. Biol. Res. 2017;50:1–23. doi: 10.1186/s40659-017-0140-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Tan, P. H. et al. The 2019 World Health Organization classification of tumours of the breast. (2020). [DOI] [PubMed]
27.Rajan A, et al. Deregulated estrogen receptor signaling and DNA damage response in breast tumorigenesis. Biochim. Biophys. Acta (BBA) Rev. Cancer. 2021;1875:188482. doi: 10.1016/j.bbcan.2020.188482. [DOI] [PubMed] [Google Scholar]
28.Thu KL, Soria-Bretones I, Mak TW, Cescon DW. Targeting the cell cycle in breast cancer: Towards the next phase. Cell Cycle. 2018;17:1871–1885. doi: 10.1080/15384101.2018.1502567. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Ding L, et al. The roles of cyclin-dependent kinases in cell-cycle progression and therapeutic strategies in human breast cancer. Int. J. Mol. Sci. 2020;21:1960. doi: 10.3390/ijms21061960. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Rejon C, Al-Masri M, McCaffrey L. Cell polarity proteins in breast cancer progression. J. Cell. Biochem. 2016;117:2215–2223. doi: 10.1002/jcb.25553. [DOI] [PubMed] [Google Scholar]
31.Chatterjee SJ, McCaffrey L. Emerging role of cell polarity proteins in breast cancer progression and metastasis. Breast Cancer Targets Ther. 2014;6:15. doi: 10.2147/BCTT.S43764. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Drake CG, Stein MN. The immunobiology of kidney cancer. J. Clin. Oncol. 2018;36:3547–3552. doi: 10.1200/JCO.2018.79.2648. [DOI] [PubMed] [Google Scholar]
33.Aggen DH, Drake CG, Rini BI. Targeting PD-1 or PD-L1 in metastatic kidney cancer: Combination therapy in the first-line setting. Clin. Cancer Res. 2020;26:2087–2095. doi: 10.1158/1078-0432.CCR-19-3323. [DOI] [PubMed] [Google Scholar]
34.Drake KA, et al. Stromal β-catenin activation impacts nephron progenitor differentiation in the developing kidney and may contribute to Wilms tumor. Development. 2020;147:dev189597. doi: 10.1242/dev.189597. [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Wettersten HI. Reprogramming of metabolism in kidney cancer. Semin. Nephrol. 2020;40:2–13. doi: 10.1016/j.semnephrol.2019.12.002. [DOI] [PubMed] [Google Scholar]
36.Peterfi L, Yusenko MV, Kovacs G. IL6 shapes an inflammatory microenvironment and triggers the development of unique types of cancer in end-stage kidney. Anticancer Res. 2019;39:1869–1874. doi: 10.21873/anticanres.13294. [DOI] [PubMed] [Google Scholar]
37.Zou Y, Hu C. A 14 immune-related gene signature predicts clinical outcomes of kidney renal clear cell carcinoma. PeerJ. 2020;8:e10183. doi: 10.7717/peerj.10183. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Information.^{(14.1MB, docx)}

Data Availability Statement

Our source codes and data are available for downloading in the GitHub repository (https://github.com/gianluca-sottile/A-Novel-Statistical-Test-For-Differential-Expression-Analysis).

[CR1] 1.Cui X, Churchill GA. Statistical tests for differential expression in cDNA microarray experiments. Genome Biol. 2003;4:1–10. doi: 10.1186/gb-2003-4-4-210. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR2] 2.Pan W. A comparative review of statistical methods for discovering differentially expressed genes in replicated microarray experiments. Bioinformatics. 2002;18:546–554. doi: 10.1093/bioinformatics/18.4.546. [DOI] [PubMed] [Google Scholar]

[CR3] 3.Fagerland MW, Sandvik L. Performance of five two-sample location tests for skewed distributions with unequal variances. Contemp. Clin. Trials. 2009;30:490–496. doi: 10.1016/j.cct.2009.06.007. [DOI] [PubMed] [Google Scholar]

[CR4] 4.Smyth GK. Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat. Appl. Genet. Mol. Biol. 2004;3:1. doi: 10.2202/1544-6115.1027. [DOI] [PubMed] [Google Scholar]

[CR5] 5.Tusher VG, Tibshirani R, Chu G. Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl. Acad. Sci. 2001;98:5116–5121. doi: 10.1073/pnas.091062498. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR6] 6.Gallo CA, Cecchini RL, Carballido JA, Micheletto S, Ponzoni I. Discretization of gene expression data revised. Brief. Bioinform. 2016;17:758–770. doi: 10.1093/bib/bbv074. [DOI] [PubMed] [Google Scholar]

[CR7] 7.Dussaut, J. S., Gallo, C. A., Carballido, J. A. & Ponzoni, I. Analysis of Gene Expression Discretization Techniques in Microarray Biclustering. in International Conference on Bioinformatics and Biomedical Engineering 257–266 (Springer, 2017).

[CR8] 8.Karlebach G, Shamir R. Modelling and analysis of gene regulatory networks. Nat. Rev. Mol. cell Biol. 2008;9:770–780. doi: 10.1038/nrm2503. [DOI] [PubMed] [Google Scholar]

[CR9] 9.Dimitrova ES, Licona MPV, McGee J, Laubenbacher R. Discretization of time series data. J. Comput. Biol. 2010;17:853–868. doi: 10.1089/cmb.2008.0023. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR10] 10.McCarthy DJ, Smyth GK. Testing significance relative to a fold-change threshold is a TREAT. Bioinformatics. 2009;25:765–771. doi: 10.1093/bioinformatics/btp053. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR11] 11.Catlett J. European Working Session on Learning. Springer; 1991. On Changing Continuous Attributes Into Ordered Discrete Attributes; pp. 164–178. [Google Scholar]

[CR12] 12.Whitley D. A genetic algorithm tutorial. Stat. Comput. 1994;4:65–85. doi: 10.1007/BF00175354. [DOI] [Google Scholar]

[CR13] 13.Miller RG. Simultaneous Statistical Inference. Springer; 1981. [Google Scholar]

[CR14] 14.Wei L, et al. TCGA-assembler 2: Software pipeline for retrieval and processing of TCGA/CPTAC data. Bioinformatics. 2018;34:1615–1617. doi: 10.1093/bioinformatics/btx812. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR15] 15.Bolstad BM, Irizarry RA, Åstrand M, Speed TP. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics. 2003;19:185–193. doi: 10.1093/bioinformatics/19.2.185. [DOI] [PubMed] [Google Scholar]

[CR16] 16.Zheng Q, Wang X-J. GOEAST: A web-based software toolkit for Gene Ontology enrichment analysis. Nucleic Acids Res. 2008;36:W358–W363. doi: 10.1093/nar/gkn276. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR17] 17.Kovalchik, S. RISmed: Download Content from NCBI Databases. R package version 2.3.0https://cran.r-project.org/package=RISmed (2021).

[CR18] 18.de Magalhães, J. P. Every gene can (and possibly will) be associated with cancer. Trends Genet. (2021). [DOI] [PubMed]

[CR19] 19.Zimatore G, Tsuchiya M, Hashimoto M, Kasperski A, Giuliani A. Self-organization of whole-gene expression through coordinated chromatin structural transition. Biophys. Rev. 2021;2:31303. doi: 10.1063/5.0058511. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR20] 20.Crow M, Lim N, Ballouz S, Pavlidis P, Gillis J. Predictability of human differential gene expression. Proc. Natl. Acad. Sci. 2019;116:6491–6500. doi: 10.1073/pnas.1802973116. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR21] 21.Roden JC, et al. Mining gene expression data by interpreting principal components. BMC Bioinform. 2006;7:1–22. doi: 10.1186/1471-2105-7-194. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR22] 22.Censi F, Calcagnini G, Bartolini P, Giuliani A. A systems biology strategy on differential gene expression data discloses some biological features of atrial fibrillation. PLoS ONE. 2010;5:e13668. doi: 10.1371/journal.pone.0013668. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR23] 23.Langfelder P, Horvath S. Eigengene networks for studying the relationships between co-expression modules. BMC Syst. Biol. 2007;1:1–17. doi: 10.1186/1752-0509-1-54. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR24] 24.Zhu J, Xiong G, Trinkle C, Xu R. Integrated extracellular matrix signaling in mammary gland development and breast cancer progression. Histol. Histopathol. 2014;29:1083. doi: 10.14670/hh-29.1083. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR25] 25.Akram M, Iqbal M, Daniyal M, Khan AU. Awareness and current knowledge of breast cancer. Biol. Res. 2017;50:1–23. doi: 10.1186/s40659-017-0140-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR26] 26.Tan, P. H. et al. The 2019 World Health Organization classification of tumours of the breast. (2020). [DOI] [PubMed]

[CR27] 27.Rajan A, et al. Deregulated estrogen receptor signaling and DNA damage response in breast tumorigenesis. Biochim. Biophys. Acta (BBA) Rev. Cancer. 2021;1875:188482. doi: 10.1016/j.bbcan.2020.188482. [DOI] [PubMed] [Google Scholar]

[CR28] 28.Thu KL, Soria-Bretones I, Mak TW, Cescon DW. Targeting the cell cycle in breast cancer: Towards the next phase. Cell Cycle. 2018;17:1871–1885. doi: 10.1080/15384101.2018.1502567. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR29] 29.Ding L, et al. The roles of cyclin-dependent kinases in cell-cycle progression and therapeutic strategies in human breast cancer. Int. J. Mol. Sci. 2020;21:1960. doi: 10.3390/ijms21061960. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR30] 30.Rejon C, Al-Masri M, McCaffrey L. Cell polarity proteins in breast cancer progression. J. Cell. Biochem. 2016;117:2215–2223. doi: 10.1002/jcb.25553. [DOI] [PubMed] [Google Scholar]

[CR31] 31.Chatterjee SJ, McCaffrey L. Emerging role of cell polarity proteins in breast cancer progression and metastasis. Breast Cancer Targets Ther. 2014;6:15. doi: 10.2147/BCTT.S43764. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR32] 32.Drake CG, Stein MN. The immunobiology of kidney cancer. J. Clin. Oncol. 2018;36:3547–3552. doi: 10.1200/JCO.2018.79.2648. [DOI] [PubMed] [Google Scholar]

[CR33] 33.Aggen DH, Drake CG, Rini BI. Targeting PD-1 or PD-L1 in metastatic kidney cancer: Combination therapy in the first-line setting. Clin. Cancer Res. 2020;26:2087–2095. doi: 10.1158/1078-0432.CCR-19-3323. [DOI] [PubMed] [Google Scholar]

[CR34] 34.Drake KA, et al. Stromal β-catenin activation impacts nephron progenitor differentiation in the developing kidney and may contribute to Wilms tumor. Development. 2020;147:dev189597. doi: 10.1242/dev.189597. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR35] 35.Wettersten HI. Reprogramming of metabolism in kidney cancer. Semin. Nephrol. 2020;40:2–13. doi: 10.1016/j.semnephrol.2019.12.002. [DOI] [PubMed] [Google Scholar]

[CR36] 36.Peterfi L, Yusenko MV, Kovacs G. IL6 shapes an inflammatory microenvironment and triggers the development of unique types of cancer in end-stage kidney. Anticancer Res. 2019;39:1869–1874. doi: 10.21873/anticanres.13294. [DOI] [PubMed] [Google Scholar]

[CR37] 37.Zou Y, Hu C. A 14 immune-related gene signature predicts clinical outcomes of kidney renal clear cell carcinoma. PeerJ. 2020;8:e10183. doi: 10.7717/peerj.10183. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

A multivariate statistical test for differential expression analysis

Michele Tumminello

Giorgio Bertolazzi

Gianluca Sottile

Nicolina Sciaraffa

Walter Arancio

Claudia Coronnello

Abstract

Introduction

Methods

Algorithm

Preprocessing procedure for microarray data

Quantitative analysis of GO-terms

Results

Figure 1.

Table 1.

Table 2.

Correlation structures and spectral analysis

Figure 2.

Figure 3.

Robustness analysis

Table 3.

Discussion

Conclusions

Supplementary Information

Acknowledgements

Author contributions

Data availability

Competing interests

Footnotes

Contributor Information

Supplementary Information

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases