TSCCA: A tensor sparse CCA method for detecting microRNA-gene patterns from multiple cancers

Wenwen Min; Tsung-Hui Chang; Shihua Zhang; Xiang Wan

doi:10.1371/journal.pcbi.1009044

. 2021 Jun 1;17(6):e1009044. doi: 10.1371/journal.pcbi.1009044

TSCCA: A tensor sparse CCA method for detecting microRNA-gene patterns from multiple cancers

Wenwen Min ^1,^2,³, Tsung-Hui Chang ^1,², Shihua Zhang ^4,^5,^6,^7,^*, Xiang Wan ^1,^*

Editor: Moritz Gerstung⁸

PMCID: PMC8195367 PMID: 34061840

Abstract

Existing studies have demonstrated that dysregulation of microRNAs (miRNAs or miRs) is involved in the initiation and progression of cancer. Many efforts have been devoted to identify microRNAs as potential biomarkers for cancer diagnosis, prognosis and therapeutic targets. With the rapid development of miRNA sequencing technology, a vast amount of miRNA expression data for multiple cancers has been collected. These invaluable data repositories provide new paradigms to explore the relationship between miRNAs and cancer. Thus, there is an urgent need to explore the complex cancer-related miRNA-gene patterns by integrating multi-omics data in a pan-cancer paradigm. In this study, we present a tensor sparse canonical correlation analysis (TSCCA) method for identifying cancer-related miRNA-gene modules across multiple cancers. TSCCA is able to overcome the drawbacks of existing solutions and capture both the cancer-shared and specific miRNA-gene co-expressed modules with better biological interpretations. We comprehensively evaluate the performance of TSCCA using a set of simulated data and matched miRNA/gene expression data across 33 cancer types from the TCGA database. We uncover several dysfunctional miRNA-gene modules with important biological functions and statistical significance. These modules can advance our understanding of miRNA regulatory mechanisms of cancer and provide insights into miRNA-based treatments for cancer.

Author summary

MicroRNAs (miRNAs) are a class of small non-coding RNAs. Previous studies have revealed that miRNA-gene regulatory modules play key roles in the occurrence and development of cancer. However, little has been done to discover miRNA-gene regulatory modules from a pan-cancer view. Thus, it is urgently needed to develop new methods to explore the complex cancer-related miRNA-gene patterns by integrating multi-omics data of multi-cancers. To build the connections between miRNA-gene regulatory modules across different cancer types, we propose a tensor sparse canonical correlation analysis (TSCCA) method. Our specific contributions are two-fold: (1) We propose a sparse statistical learning model TSCCA and an efficient block-coordinate descent algorithm to solve it. (2) We apply TSCCA to a multi-omics data set of 33 cancer types from TCGA and identify some cancer-related miRNA-gene modules with important biological functions and statistical significance.

Introduction

Cancer is a complex and heterogeneous disease and the second leading cause of death worldwide [1, 2]. Although medical advances have made possible earlier diagnosis and more effective treatments, researchers still face many critical challenges for cancer drug resistance, combinatorial drug treatment optimization and personalized cancer therapy design and so on [3, 4]. A number of studies have been conducted to understand the mechanisms underlying the cancer development for better prevention and treatment.

In the past decade, an increasing number of studies have reported that abnormal microRNAs (miRNAs) play important roles in the occurrence and development of cancer [5, 6], and some miRNAs can be used as drug targets for cancer treatment [7, 8]. miRNA is a type of small non-coding RNAs with about 20 bases, which regulates gene expression during post-transcriptional processes [9]. In cancer cells, miRNAs have been found to be heavily dysregulated [8]. Thus, they are potential candidates for prognostic biomarkers and therapeutic targets in cancer. For example, Yang et al. have reported that miR-506 plays essential roles in the pathogenesis of ovarian cancer, which can be considered as a potential therapeutic interest [7]. Moreover, Lai et al. outlined some miRNAs as monotherapy or adjuvant therapy from a systems biology perspective [8].

Since miRNAs were found, researchers have studied the regulatory mechanisms between miRNAs and genes comprehensively. For example, sequence-based methods have been proposed to predict their regulatory relationships [10, 11]. However, such methods fail to capture the context-specific miRNA-gene regulatory relationship. With the development of miRNA sequencing technology, a huge number of miRNA expression data of multi-species have been accumulated (e.g., those in the Gene Expression Omnibus database repository [12]). The Cancer Genome Atlas (TCGA) [13] and NCI-60 [14] allow us to obtain matched miRNA and mRNA expression data in certain cancers. These invaluable database repositories provide new paradigms to explore context-specific miRNA-gene regulatory relationship. Several computational methods have been proposed on the basis of modular structure identification [15–21]. Zhang et al. developed a joint non-negative matrix factorization method to discover miRNA-gene co-modules in ovarian cancer [15]. However, the strength of miRNA-gene relationship in the identified modules by it is still unclear and the algorithm therein has a high computational complexity. Min et al. developed a simple two-step method for the same task [16]. This method firstly reconstructs a sparse miRNA-gene regulation matrix by integrating miRNA and mRNA expression data and prior miRNA group information. Then, a bi-clustering method based on a sparse matrix factorization is used to cluster the regulation matrix for discovering miRNA-gene modules. Yoon et al. (2019) also developed a bi-clustering method to identify condition-specific modules by integrating the gene expression and miRNA sequence-specific targets information [21]. Although these methods can discover miRNA-gene modules for one cancer or tissue to some extent, they fail to identify cancer-specific and shared miRNA-gene modules when integrating multiple cancer data.

Recently, some studies have focused on the integrative analysis of multiple omics data from multiple cancers [22–26]. For example, Tan et al. systematically investigated the positive correlation between miRNAs and genes in multiple human cancers [26]. However, little has been done to discover miRNA-gene regulatory modules from a pan-cancer view. Therefore, it is urgently needed to develop new methods to explore the complex cancer-related miRNA-gene patterns by integrating multi-omics data of multi-cancers.

In this study, we present a tensor sparse canonical correlation analysis (TSCCA) method for the explorative analysis of matched miRNA and gene expression data of multiple cancers with a focus on identifying cancer-specific and shared miRNA-gene co-expressed modules (Fig 1). TSCCA first calculates a cancer-miRNA-gene correlation tensor which is a “3D” array with gene, miRNA and cancer dimensions (Fig 1B). Then it decomposes the correlation tensor into a number of latent factors (u_i, v_i and w_i, i = 1, ⋯, r) that represent major patterns of variation in the tensor data (Fig 1C). The scores of u_i, v_i and w_i indicate the relative contribution of genes, miRNAs and cancers, respectively. Based on their non-zero elements of u_i, v_i and w_i for any i, we can discover a cancer-miRNA-gene module. In short, our main contributions are two-fold: (1) We design a statistical learning model TSCCA, which is equivalent to a ℓ₀-norm constrained tensor-based model, and develop an efficient block-coordinate descent algorithm to solve it. (2) We apply TSCCA to a multi-omics data set of 33 cancer types from TCGA database and discover some dysfunctional miRNA-gene modules with important biological functions and statistical significance.

Fig 1 — (A) Prepare the matched miRNA and gene expression data of 33 cancer types from TCGA. (B) Compute a cancer-miRNA-gene Pearson correlation tensor $A \in R^{p \times q \times M}$ , where p, q and M represent the number of genes, miRNAs and cancers respectively. (C) Estimate multiple sparse latent factors (u_i, v_i and w_i, i = 1, ⋯, r) and these non-zero genes in u_i, non-zero miRNAs in v_i and non-zero cancers in w_i are considered as a cancer-miRNA-gene module.

Materials and methods

Biological data

TCGA data

We used the biological data from 33 TCGA cancer types available from the Broad GDAC Firehose website (http://firebrowse.org/, accessed 28 January 2016). For each cancer type, we downloaded the processed (Level 3) mRNA-seq and miRNA-seq data, and clinical data. Before applying our method, we implemented multi-step data preprocessing for each cancer data set: (1) We removed those genes and miRNAs, which are expressed in less than 5% samples; (2) Missing elements were imputed using the k-nearest neighbor method by using the R package “impute”. (3) The expression values were log2 transformed and scaled with zero mean and unit standard deviation for every gene/miRNA. (4) Differential gene expression analysis was carried out by using the Wilcox test for each gene when the cancer type contains more than 5 normal samples. We found 7889 pan-cancer significant differentially expressed genes in more than 15 cancers with Benjamini-Hochberg (BH) adjusted P < 0.05 and the detailed results are shown in S1 Table. Finally, we obtained the matched mRNA and miRNA expression data of 33 cancer types including 9645 cancer samples, 7889 genes and 523 miRNAs (Fig 2A and S2 Table). To further analyze the biological functions of the cancer-miRNA-gene module, we also downloaded the following data sets:

Fig 2 — (A) Number of cancer patients or samples on 33 cancer types from TCGA in this study. (B) Correlation between the modularity scores of identified modules (y-axis) and the corresponding singular values (objective function values) (x-axis) with PCC r = 0.98. (C) Distribution of modularity scores. The modularity scores of identified modules are significantly greater than those of random ones (Permutation test P < 0.05/50 for each identified module). (D) Among the 1793 genes from all the identified modules, 328 are reported to be related with cancer (Hypergeometric test P = 1.47e-06). (E) Among the 122 miRNAs from all the identified modules, 73 are reported to be related with cancer (Hypergeometric test P = 3.38e-03).

miRNA family database. We downloaded a miRNA family data set from miRbase database [9]. A miRNA family contains a set of miRNAs.
miRNA-gene interaction network data. We collected an experimentally validated miRNA-gene interaction network data set from miRTarBase database [27].
Gene interaction network data. We downloaded a protein-protein interaction (PPI) network data set from the Pathway-Commons database [28]. A gene interaction network was constructed by the PPI network.
Cancer gene and miRNA sets. We collected a cancer gene set data from the allOnco database (http://www.bushmanlab.org/links/genelists) and a cancer miRNA set data from http://mircancer.ecu.edu/ [29].
Gene functional annotations. We also downloaded multiple gene functional annotations including GO biological processes (GOBP), KEGG and reactome pathways from Molecular Signatures Database (MSigDB) [30].

Sparse CCA

Canonical Correlation Analysis (CCA) is a common statistical learning method for analyzing pairwise data. It learns a projection for both representations such that they are maximally correlated in the dimensionality-reduced space. Suppose $X \in R^{n \times p}$ with n samples and p features and $Y \in R^{n \times q}$ with n samples and q features represent two omics data from a single cancer and their columns of X and Y are centered and scaled with zero mean and unit variance. Then, the CCA model can be written as follows:

\begin{array}{l} \underset{u, v}{maximize} & u^{T} X^{T} Y v \\ subject to & u^{T} X^{T} X u = 1, v^{T} Y^{T} Y v = 1 . \end{array}

(1)

Suppose X^T X = I and Y^T Y = I, where I is the identity matrix. Then the above model reduces to:

\begin{array}{l} \underset{u, v}{maximize} & u^{T} X^{T} Y v \\ subject to & u^{T} u = 1, v^{T} v = 1 . \end{array}

(2)

which was called as the diagonal CCA whose performance is usually better than the traditional CCA in high-dimensional data [31, 32]. However, the classical CCA leads to non-sparse canonical vectors. It is difficult to select features and interpret in biology. To this end, a large number of sparse CCA models have been proposed to obtain sparse canonical vectors by using different penalty functions [16, 33–37]. Specifically, a sparse CCA (SCCA) with ℓ₀-norm constraint [35] can be formulated into the following optimization problem:

\begin{array}{l} \underset{u, v}{maximize} & u^{T} X^{T} Y v \\ subject to & ‖ u ‖_{0} \leq k_{u}, ‖ v ‖_{0} \leq k_{v}, \\ u^{T} u = 1, v^{T} v = 1, \end{array}

(3)

where k_u and k_v are two parameters to control the sparsity of canonical vectors (u and v), and ‖u‖₀ is the number of non-zero elements in the u.

Proposed tensor sparse CCA (TSCCA)

Let $X^{i} \in R^{n_{i} \times p}$ with n_i samples and p genes and $Y^{i} \in R^{n_{i} \times q}$ with n_i samples and q miRNAs be the matched gene and miRNA expression matrices of cancer i (i = 1, ⋯, M). Each column of them is normalized with zero-mean and unit-variance (Fig 1A). To capture invariant miRNA-gene co-expressed pattern for different cancers, we propose a tensor-based method to integrate miRNA and gene expression data from multiple cancers by weighting each cancer as follows:

\begin{array}{l} \underset{u, v, w}{maximize} & \sum_{i = 1}^{M} w_{i} (u^{T} X^{i}^{^{T}} Y^{i} v) \\ subject to & ‖ u ‖_{0} \leq k_{u}, ‖ v ‖_{0} \leq k_{v}, ‖ w ‖_{0} \leq k_{w}, \\ u^{T} u = 1, v^{T} v = 1, w^{T} w = 1, \end{array}

(4)

where w = (w₁, w₂, ⋯, w_M)^T. After simplification, we get the following TSCCA model:

\begin{array}{l} \underset{u, v, w}{maximize} & - A {\bar{\times}}_{1} u {\bar{\times}}_{2} v {\bar{\times}}_{3} w \\ subject to & ‖ u ‖_{0} \leq k_{u}, ‖ v ‖_{0} \leq k_{v}, ‖ w ‖_{0} \leq k_{w}, \\ u^{T} u = 1, v^{T} v = 1, w^{T} w = 1, \\ with & A_{: : i} = X^{i}^{T} Y^{i} i = 1, \dots, M, \end{array}

(5)

where $A {\bar{\times}}_{i} z$ (i = 1, 2, 3) denotes the i-mode (vector) product of a tensor $A \in R^{I_{1} \times I_{2} \times I_{3}}$ with a column $z \in R^{I_{i}}$ , and $A_{: : i}$ is frontal slice and also written as $A_{i}$ . More detailed definitions about tensor operations can be found in [38].

Proposed optimization algorithm

Recently, a global block-coordinate update algorithm has been proposed to solve a class of nonconvex optimization problems [39]. The block-coordinate descent algorithm is also called as alternating iteration algorithm which updates one factor at a time with the others fixed. Inspired by the algorithm, we develop a block-coordinate descent algorithm to solve the above problem (5):

\begin{matrix} \begin{matrix} u^{k + 1} \leftarrow & \underset{u^{T} u = 1, {‖ u ‖}_{0} \leq k_{u}}{argmin} f (u, v^{k}, w^{k}), \\ v^{k + 1} \leftarrow & \underset{v^{T} v = 1, {‖ v ‖}_{0} \leq k_{v}}{argmin} f (u^{k + 1}, v, w^{k}), \\ w^{k + 1} \leftarrow & \underset{w^{T} w = 1, {‖ w ‖}_{0} \leq k_{w}}{argmin} f (u^{k + 1}, v^{k + 1}, w), \end{matrix} \end{matrix}

(6)

where $f (u, v, w) = - A {\bar{\times}}_{1} u {\bar{\times}}_{2} v {\bar{\times}}_{3} w$ . To implement it, we need to solve three sub-problems in Eq (6). Taking the first as an example, with v and w fixed, it is equivalent to solve

\begin{array}{l} \underset{u}{minimize} & - u^{T} z \\ subject to & u^{T} u = 1, ‖ u ‖_{0} \leq k, \end{array}

(7)

where $z = A {\bar{\times}}_{2} v {\bar{\times}}_{3} w$ . For convenience, we define a k-sparse projection operator Π(⋅, k) for a given $z \in R^{p}$ with k ≤ p:

{[Π (z, k)]}_{i} = {\begin{array}{l} z_{i}, & if i \in support (z, k) \\ 0, & otherwise \end{array}

(8)

where support(z, k) is a set of indices of z with the largest k absolute values. For example, if z = (−6, 4, 5, 2, −1, 3)^T, then Π(z, 3) = (−6, 4, 5, 0, 0, 0)^T. We have Proposition 1 to solve Eq (7) and its proof is detailed in S1 Text.

Proposition 1. Suppose z is a non-zero vector, then the solution of problem (7) is $u^{*} = \frac{Π (z, k)}{{‖ Π (z, k) ‖}_{2}}$ .

Based on Proposition 1, we develop a block-coordinate descent algorithm to solve (5). The details of this algorithm is shown in Algorithm 1 and its stopping condition, convergence analysis and computational complexity are given in S1 Text.

Algorithm 1 TSCCA algorithm solves Eq (5)

Require: $X^{i} \in R^{n_{i} \times p}$ (gene expression data) and $Y^{i} \in R^{n_{i} \times q}$ (miRNA expression data) for i = 1, ⋯, M (cancer types); Parameters: k_u, k_v, and k_w.

Ensure: u, v, w and singular value d.

1: Compute $A_{i} = {(X^{i})}^{T} Y^{i}, i = 1, \dots, M$

2: Initialize $w = {(\frac{1}{\sqrt{M}}, \dots, \frac{1}{\sqrt{M}})}^{T}$ with ‖w‖ = 1

3: Initialize u, v using the principal left and right singular vectors of $\sum_{i = 1}^{M} w_{i} A_{i}$

4: repeat

5: Compute a matrix $C = \sum_{i = 1}^{M} w_{i} A_{i}$

6: Let z_u = C v

7: $u \leftarrow \frac{Π (z_{u}, k_{u})}{‖ Π (z_{u}, k_{u}) ‖_{2}}$

8: Let z_v = C^T u

9: $v \leftarrow \frac{Π (z_{v}, k_{v})}{‖ Π (z_{v}, k_{v}) ‖_{2}}$

10: Let $z_{w} = {[u^{T} A_{1} v, \dots, u^{T} A_{M} v]}^{T}$

11: $w \leftarrow \frac{Π (z_{w}, k_{w})}{‖ Π (z_{w}, k_{w}) ‖_{2}}$

12: until convergence of u, v and w

13: $d = A {\bar{\times}}_{1} u {\bar{\times}}_{2} v {\bar{\times}}_{3} w$

14: return u, v, w and singular value d

Determination of cancer-miRNA-gene modules

Based on the output of Algorithm 1, the non-zero genes in u, the non-zero miRNAs in v and the non-zero cancer types in w together are considered as a cancer-miRNA-gene functional module (Fig 1C). Furthermore, we also extend Algorithm 1 to identify the next module by updating the input $A ≔ A - d \cdot u \circ v \circ w$ , where $d = A {\bar{\times}}_{1} u {\bar{\times}}_{2} v {\bar{\times}}_{3} w$ and it is also called singular value, reflecting the relative importance of a corresponding module (See Algorithm 2 in S1 Text). We carefully discuss the parameter selection issue of Algorithm 1 (See S1 Text for more detail).

Modularity

For a given module with a gene set I, a miRNA set J and a cancer type set K, we define a modularity score:

\begin{matrix} Modularity = \frac{1}{| I | | J | | K |} \sum_{i \in I, j \in J, k \in K} | C_{i j k} |, \end{matrix}

(9)

where C_ijk is a Pearson correlation coefficient (PCC) between gene i and miRNA j in the cancer k. A high modularity score indicates that these genes and miRNAs within the module are strongly co-expressed across these selected cancers within the module.

Results

Application to the TCGA data

We applied TSCCA to matched miRNA and gene expression data from TCGA consisting of 9645 cancer patients across 33 cancer types (Fig 2A and S2 Table). All output of TSCCA is detailed in S3–S6 Tables. We discovered 50 cancer-gene-miRNA modules (S7 Table). Each identified module contains about 100 genes, 10 miRNAs and 20 cancer types. Regarding the characteristics of TSCCA when it was applied to the TCGA data, we observed that (1) TSCCA converged in about 20 steps (S1 Fig) and it took a total of about 1 hour on a personal laptop. (2) The modularity scores of these modules have a strong correlation with their corresponding singular values of TSCCA model (PCC r = 0.98 with P < 0.001, Fig 2B). In addition, we also used permutation test to assess the number of overlapping elements between any two modules (S8 Table, see section 7 in S1 Text for more detail). Only 51 out of 1225 pairs of module from these identified modules are significantly overlapping with permutation test P < 0.05, indicating that these identified modules are statistically independent patterns.

Statistical analysis of correlation of modules

To evaluate the correlations between genes and miRNAs within each module, we randomly generated 1,000 modules with the same size as these identified modules. The identified modules with P-values smaller than 0.05/50 were considered as significant ones. We found that the modularity scores of all modules are significantly larger than those of the random ones (Fig 2C and S1 Text). For each cancer type on the TCGA data, we also computed a basic modularity score based on all considered miRNAs (n = 523) and genes (n = 7889). We observed that 33 basic modularity scores of TCGA 33 cancer types are distributed between about 0.1 ∼ 0.2 (S9 Table). For example, the basic modularity score of TGCT is the largest with Modularity = 0.21 and CGA is the smallest with Modularity = 0.086. Full details on these 33 cancer types are given in S9 Table. We observed that the modularity scores of these identified modules are far greater than the corresponding basic modularity score in these selected cancers.

Module miRNAs and genes are strongly implicated in cancer

To assess whether these identified modules are related to cancer, we first collected a total of 1793 genes and 122 miRNAs via combining all the modules. In addition, we also collected a cancer gene set from the allOnco database and a cancer miRNA set from [29]. As we expected, we found that 328 out of 1793 genes are cancer genes (Hypergeometric test P = 1.47e-06) (Fig 2D), and 73 out of 122 miRNAs are cancer miRNAs (Hypergeometric test P = 3.38e-03) (Fig 2E). In addition, we also used hypergeometric test to evaluate whether the number of cancer genes or cancer miRNAs within each identified module is significantly larger than expected by chance (S10 Table and S1 Text). We found that each module contains an average of 6 cancer miRNAs and 20 cancer genes. There are 8 out of 50 modules including significantly more cancer miRNAs and 15 out of 50 modules including significantly more cancer genes. For example, module 1 contains 31 cancer genes (fold enrichment = 2.1, hypergeometric test P < 0.05) and module 4 contains 20 cancer genes (fold enrichment = 1.7, hypergeometric test P < 0.05).

Characteristics of modules in different cancers

To visualize the co-expressed pattern of each identified cancer-miRNA-gene module, we first calculated a Pearson correlation matrix between the genes and the miRNAs within the module based on the corresponding miRNA and mRNA expression data for each cancer within the module. We then drew a heatmap to show the co-expressed pattern using these correlation matrices. The heatmaps of these identified modules are given in S2 Fig. We found that some identified modules show different co-expressed patterns in different cancer types. For example, the genes and the miRNAs within module 1 show strong positive correlation on all selected cancers (Fig 3A), those within module 2 are both positively and negatively correlated on all selected cancers (Fig 3B), whereas those within module 5 show strong negative correlation on all selected cancers (Fig 3C). These results suggest that miRNA-gene regulation in cancer are very complex.

Fig 3 — The top half of (A) corresponds to the module 1 (row corresponds to gene, column corresponds to miRNA) and the lower part of (A) is a random module for comparison. Similar setting is used for module 2 and module 5 in (B) and (C) respectively. (A), (B) and (C) show three different co-expression patterns.

We further investigated whether these modules are specifically related with some cancer types by visualizing the matrix W (Fig 4). W is the output matrix of Algorithm 2 (See section 6 in S1 Text), whose each column corresponds to a module, and each row corresponds to a cancer type. The absolute value of W_ij reflects the co-expressed intensity of between the genes and the miRNAs within the module j on the cancer i. We first observed that there are only three negative elements in W (S3(A) and S3(B) Fig), i.e., (Module 31, TGCT) is −0.145, (Module 49, TGCT) is −0.23, and (Module 49, UCS) is −0.138. Interestingly, we also observed that the miRNAs and genes within module 31 are positively correlated in TGCT cancer type, but are negatively correlated in other cancer types, and module 49 are positively correlated in TGCT and UCS cancer types, but are negatively correlated in other cancer types (S3(C) Fig). In addition, a hierarchical clustering method was used to cluster the rows (cancer types) of W and the 33 cancer types were divided into 4 clusters. The first cluster (including STAD, STES, COAD, COADREAD, READ, BLCA and ESCA) has the strongest weighted values of W. The second cluster contains TGCT, BRCA, LUSC, LUAD, HNSC, CHOL, UCEC, PAAD, PRAD and CESC, where the LUSC and LUAD show very similar patterns in different modules. The third cluster (including KIPAN, DLBC, UCS, KIRC, KIRP, THCA, OV, PCPG, MESO, LGG and UVM) has the weakest weighted values. Several cancer types in the third cluster show module-specific characteristics. For example, UVM is specifically related with module 45, and LGG is specifically related with module 11. Importantly, the results of the following survival analysis also show that module 11 is the most important and clinically relevant module with LGG in all the modules. The fourth cluster contains SARC, LIHC, SKCM, KICH and ACC. We note that TSCCA is an explorative tool, which identifies the “strongest” modular patterns in the current multiple cancer data. This means that in a subset of cancer data, it could identify other significant modules. For example, most of the 50 modules identified by TSCCA on the TCGA dataset are enriched in 60% of cancers, while other cancers are rare. To this end, we may extract a subset of cancers from the cluster 3 in Fig 4 and then re-use TSCCA to extract some modules on a subset of the previous data (across 18 cancers). We found some new modules with significant modularity scores, and more details are given in S4 Fig. This procedure will overcome the limit that a small number of cancers may dominate the results for TSCCA.

We also calculated a modularity score for each cancer type of an identified module. Two examples are shown in Fig 5A and 5E. These modularity scores of different cancers for the two examples are larger than those of the random ones. All the results suggest that the miRNAs and the genes are strongly co-expressed on these selected cancers for each module.

Cooperativity of genes and miRNAs within modules

To evaluate the biological relevance of the modules, we performed GOBP, KEGG and Reactome pathway enrichment analysis for the genes within each module (See section 13 in S1 Text). We downloaded the gene functional annotations including GOBP, KEGG and reactome pathways from MSigDB [30]. We found that 84% (42 out of 50) modules identified by TSCCA are significantly related with at least a functional term with a benjamini-hochberg (BH) adjusted P < 0.05 (S11 Table) and different modules tend to be enriched in different terms. On average, each module is significantly enriched in 40 GOBP terms (S12 Table), 2 KEGG terms (S13 Table) and 8 Reactome terms (S14 Table). For example, the top enriched GOBP terms of module 1 includes cell cycle process, mitotic cell cycle, cell cycle, etc (Fig 5B), and the top enriched GOBP terms of module 4 includes cell cycle, cell cycle process and chromosome organization (Fig 5F). Importantly, cell cycle process has been reported to be one of 10 oncogenic signaling pathway [40].

To assess whether the genes within module tend to be densely connected on the gene interaction network, we computed the numbers of gene interactions from this network for each module (S10 Table). We found that 64% (32 out of 50) modules contain significantly more gene interactions than expected by chance (Hypergeometric test P < 0.001). For example, module 1 contains 505 gene interactions with 15-fold enrichment of the interaction density of the gene interaction network (Hypergeometric test P < 1.0e-16, Fig 5C middle), and module 4 contains 253 gene-gene interactions with 7-fold enrichment (Hypergeometric test P < 1.0e-16, Fig 5G middle). In addition, to avoid the influence of degree in the gene interaction network, we developed a statistical permutation test method to perform the gene-gene interaction set enrichment, and found that 88% (44 out of 50) modules contain significantly more gene interactions than expected by chance (Permutation test P < 0.05, see section 23 in S1 text). All the above results suggest that the genes within each identified module tend to cooperate with each other.

Previous studies have shown that miRNAs co-regulate gene expression in a cooperative form and participate in cellular activities [5]. So, we expect the miRNAs within module to be cooperative. To this end, we collected a miRNA family data set from miRbase database [9]. We found that 92% (46 out of 50) modules have at least two miRNAs in the same family (Permutation test P < 0.01, see section 17 in S1 Text and S15 Table). For example, the members of module 1 including hsa-miR-17–5p, hsa-miR-18a-5p, hsa-miR-93–5p, hsa-miR-106b-5p, and hsa-miR-106b-3p belong to miR-17 family, which has been reported to be associated with cancer [41]. Module 8 includes seven miRNAs, which are hsa-miR-200b-5p, hsa-miR-200b-3p, hsa-miR-200c-5p, hsa-miR-200c-3p, hsa-miR-200a-5p, hsa-miR-200a-3p, and hsa-miR-429 and they belong to miR-8 family, which has been reported to be associated with cancer [42].

We also evaluated the cooperation of the genes and the miRNAs within module from statistical significance using a permutation test method. To this end, we computed the average of gene-gene/miRNA-miRNA absolute PCCs of any two genes/miRNAs within a given module (denoted as gene/miRNA modularity). We found that the gene/miRNA modularity scores of all the identified modules are significantly larger than those of 1000 modules randomly generated (Permutation test P < 0.01) (Fig 6A and 6B). On average, the miRNA modularity score is about 0.5, and gene modularity is about 0.45 for these identified modules. These results demonstrate that the genes/miRNAs within a module tend to cooperate from the perspective of co-expression.

Fig 6 — (A) The average of absolute gene-gene PCCs of the genes within each module (Permutation test P < 0.01). (B) The same results about miRNAs.

miRNA-gene regulatory network analysis of modules

To evaluate whether the regulatory relationship between miRNAs and genes within a given module tends to be verified experimentally, we computed the number of experimentally validated miRNA-gene interactions between these miRNAs and genes within the module. These experimentally validated interactions are from a miRNA-gene interaction network, which is collected from the miRTarBase database [27]. We found that 38% (19 out of 50) modules contain the number of validated miRNA-gene interactions are significantly more than expected by chance (Hypergeometric test P < 0.05) (S10 Table). For example, module 1 contains 33 validated miRNA-gene interactions with 2.5-fold enrichment of the whole experimentally validated miRNA-gene network (Fig 5C right), and module 4 contains 57 validated miRNA-gene interactions (Fig 5G right). In addition, to avoid the influence of degree for miRNAs in the miRNA-gene network, we developed a statistical permutation test method to perform the miRNA-gene interaction set enrichment (See section 23 in S1 Text). There are 28% (14 out of 50) modules which contain significantly more miRNA-gene interactions than expected by chance (Permutation test P < 0.05).

For each identified miRNA-gene module, we have confirmed that some miRNA-gene interactions are verified by the miRTarBase database, while there are also many miRNA-gene pairs are not verified by the database. Furthermore, based on the experimentally validated miRNA-gene and gene-gene interactions, we built a three-layer miRNA-gene regulatory network for each module: miRNAs regulate genes and these genes regulate the other genes within the three-layer network (S5(A) Fig). We found that 70% modules have at least three miRNAs participating in a three-layer network (Permutation test P < 0.01, see section 17 in S1 Text). The detailed results are shown in S16 Table. For example, we extracted a largest connected miRNA-gene subnetwork of module 1 (including 7 miRNAs, 84 genes and 538 edges), where the miRNAs directly regulate 21 genes and the 21 genes regulate 63 other genes (Fig 5D), and a largest connected miRNA-gene subnetwork of module 4 (including 7 miRNAs, 75 genes and 309 edges), where the miRNAs directly regulate 24 genes and the 24 genes regulate 51 other genes (Fig 5H). Interestingly, we also collected a total of 1793 genes and 122 miRNAs via combining all identified modules and found 3619 experimentally validated miRNA-gene interactions with hypergeometric test P = 3.5e-43 (S5(B) Fig).

Survival analysis of modules

To evaluate whether the identified modules can be seen as prognostic biomarkers, we further investigated the association between the expression of both miRNAs and genes within the module and survival time. For each module and for each cancer within the module, we first extracted the first principal component (PC1) based on the expression data of these genes and miRNAs within the module. We then divided the cancer samples into two groups based the median value of the PC1 and log-rank test was used to assess the difference between the two groups of samples and a P-value was computed. All computed P-values were corrected using the BH adjusted method. Based on these -log10(BH adjusted P-value) scores, we built a bipartite graph between the modules and the cancer types (Fig 7A and S17 Table). In the bipartite graph, we only kept these edges between the modules and the cancer types with BH adjusted P < 0.05. In total, there are 45 modules, 17 cancer types and 116 significant module-cancer edges in the bipartite graph. We found that 80% modules are significantly related to the survival time on at least one cancer. For example, we found that M11-LGG and M36-KIPAN edges have the largest weight value (i.e., smallest P-value) in the bipartite graph. Module 11 is the most important and clinically relevant module to LGG (Log-rank test P = 3.18e-06) and module 36 is a clinically relevant module to KIPAN (Log-rank test P = 1.53e-05) (Fig 7B).

Fig 7 — (A) showing a bipartite graph between the identified modules and the different cancer types based on these -log₁₀(BH adjusted P-value). For each identified module and each cancer within the module, we first extracted the first principal component (PC1) based on the expressed matrix of both miRNAs and genes within the module from the cancer type. We then divided the samples from the cancer type into two groups based on the median value of PC1 and a P-value was compute using log-rank test. In the graph, we only kept these edges/relationships between the modules and cancer types with adjusted P < 0.05. (B) Some cancer-miRNA-gene modules relate to survival time. For a given cancer type and a given module, the Kaplan-Meier survival curves were drawn for each group, and “+” denotes the censoring patient. Each sub-figure corresponds to a module and a cancer type. For example, Module 11 has a significant P = 3.2e-09 for LGG (cancer type), written as “M11-LGG, P = 3.2e-09”.

We also considered the expression of each miRNA within a module as the prognostic scores (S18 Table). On average, we found that two clinically relevant miRNAs with BH adjusted P < 0.05 for each cancer. Some important and clinically relevant miRNAs were found. For example, the two most significant miRNAs are hsa-miR-15b-3p of module 11 for LGG with log-rank test P = 3.33e-06, and hsa-miR-130b-3p of module 43 for KIPAN with log-rank test P = 1.13e-05. In addition, three miRNAs (hsa-miR-93–3p, hsa-miR-130b-5p and hsa-miR-130b-3p) of module 19 correlate with survival in ACC and four miRNAs (hsa-let-7c-5p, hsa-miR-99a-5p, hsa-miR-125b-5p and hsa-miR-125b-2–3p) of module 3 correlate with survival in BLCA. These results reveal that some modules can be used as prognostic biomarkers in multiple cancer types.

Case studies

Based on the above functional analysis, we found that some identified modules show diverse biological functions and relevance from different views (S19 Table). We took modules 1, 4, and 11 as examples. The module 1 consists of 100 genes, 10 miRNAs and 20 cancers, of which 5 cancer miRNAs and 31 cancer genes (Hypergeometric test P = 2.59e-05). The correlations between miRNAs and genes across the selected cancer types are statistically significant compared to random ones (Permutation test P < 0.001). For five cancer types (including STAD, BRCA, STES, KICH and SARC), the expression pattern of miRNAs and genes within the module is significantly related with their patient survival respectively (Log-rank test BH adjusted P < 0.05, see section 18 in S1 Text). Therefore, we may consider module 1 as a potential prognostic biomarker for these five cancer types. Moreover, the module genes are enriched with a large number of cancer-related functional terms including GOBP terms (cell cycle process, mitotic cell cycle, cell cycle, chromosome segregation and cell division) and KEGG pathways (cell cycle, oocyte meiosis, progesterone mediated oocyte maturation, homologous recombination and p53 signaling pathway), suggesting its strong cancer relevance. Recent studies have shown that these cell cycle-related functions are related to multiple cancer processes [40, 43]. On the other hand, five module miRNAs (hsa-miR-17–5p, hsa-miR-18a-5p, hsa-miR-93–5p, hsa-miR-106b-5p and hsa-miR-106b-3p) belong to miR-17 family, which has been reported to be related to cancer [41, 44]. Finally, we also found that 6 of 10 miRNAs is related with patient survival in at least one cancer type (Log-rank test BH adjusted P < 0.05) (S18 Table). For example, the expression of hsa-miR-130b-5p and hsa-miR-130b-3p are significantly related with ACC patient survival.

The module 4 contains 5 cancer miRNAs and 25 cancer genes (Hypergeometric test P = 4.66e-03). The correlations between miRNAs and genes across the selected cancer types within this module are statistically significant compared to random ones (Permutation test P < 0.001). This module is significantly related to the survival time in five cancer types (ACC, LIHC, LUAD, PAAD, KICH). The genes within the module are enriched with some cancer-related functional terms including GOBP terms (cell cycle, cell cycle process, chromosome organization, mitotic cell cycle and DNA metabolic process) and KEGG pathways (DNA replication, base excision repair, nucleotide excision repair, cell cycle, pyrimidine metabolism). Boyer et al. have reported that DNA replication pathway plays an important role in cancer [45]. More importantly, we found that 57 miRNA-gene interactions between the miRNAs and genes within this module were verified before. Collecting the gene-gene network from PPI network, we construct a miRNA-gene-gene regulatory sub-network where there are 7 miRNAs, 75 genes and 309 edges (Fig 5H and S16 Table).

The last example, module 11 exhibits distinct biological relevance with LGG (Brain Lower Grade Glioma) in terms of miRNAs and genes. Firstly, the miRNAs and genes across the selected cancer types within the module show strong correlations (Permutation test P < 0.001). Secondly, the genes within this module are enriched with several cancer-related KEGG pathways including cell cycle, small cell lung cancer, DNA replication, mismatch repair. As mentioned earlier, cell cycle and DNA replication pathways have been reported to play an important role in cancer. Thirdly, 36 miRNA-gene interactions between the miRNAs and genes within this module were verified by miRTarBase database. We also construct a miRNA-gene-gene regulatory sub-network, which contains 7 miRNAs, 68 genes and 208 miRNA-gene edges (S16 Table). Importantly, two miRNAs (hsa-miR-130b-5p and hsa-miR-130b-3p) within the module belong to mir-130 family, which have been reported as potential biomarkers for brain cancer [46–48]. Especially, the expression pattern of miRNAs and genes within the module is significantly related with LGG patient survival (Log-rank test BH adjusted P = 3.18e-06).

Comparison on the simulated data

In this section, we compared TSCCA with SCCA [35] and Modularity_SA on a set of simulated data. Modularity_SA is a modularity-based simulated annealing (Modularity_SA) method (See section 19 in S1 Text), which uses a simulated annealing algorithm to maximize the modularity index (Eq 9) for extracting a cancer-miRNA-gene module.

We generated a synthetic miRNA-gene correlation tensor $A \in R^{300 \times 30 \times 4}$ with 300 genes and 30 miRNAs and 4 cancers, where (1) $A_{1} [i, j] \sim N (0.5, 0 . 2^{2})$ when 1 ≤ i ≤ 100 and 1 ≤ j ≤ 10, and $A_{1} [i, j] \sim N (- 0.5, 0 . 2^{2})$ when 101 ≤ i ≤ 200 and 11 ≤ j ≤ 20, and the other elements are from N(0, 0.2²); (2) $A_{2} [i, j] \sim N (- 0.5, 0 . 2^{2})$ when 1 ≤ i ≤ 100 and 1 ≤ j ≤ 10, and $A_{2} [i, j] \sim N (0.5, 0 . 2^{2})$ when 201 ≤ i ≤ 300 and 21 ≤ j ≤ 30, and the other elements are from N(0, 0.2²); (3) $A_{3} [i, j] \sim N (0.5, 0 . 2^{2})$ when 101 ≤ i ≤ 200 and 11 ≤ j ≤ 20, and $A_{3} [i, j] \sim N (- 0.5, 0 . 2^{2})$ when 201 ≤ i ≤ 300 and 21 ≤ j ≤ 30, and the other elements are from N(0, 0.2²); (4) $A_{4} [i, j] \sim N (0, 0 . 2^{2})$ for any i and j. We repeatedly generated 50 tensors ( $A s$ ) and Fig 8A shows an $A$ . For each $A$ , we applied SCCA to each single miRNA-gene correlation matrix $A_{i}$ and the joint data defined as $\sum_{i}^{4} A_{i}$ . To ensure fairness of comparison between TSCCA, SCCA and Modularity_SA, their parameters are consistent with the size of true modules. We assessed the similarity between the true modules and the prediction modules through the use of two metrics: Clustering error (CE) score and Recovery score (S20 and S21 Tables, see section 20 in S1 Text for more detail). The results show that TSCCA is superior to other methods in terms of Recovery and CE scores (Fig 8B). More results and description on the simulated data with different variances are given in S22 and S23 Tables (See section 21 in S1 Text for more detail). We found that SCCA has two disadvantages on the single cancer simulated data: (1) SCCA always loses a real module on the simulated data. For example, SCCA misses the module 3 when it was applied to $A_{1}$ , and misses the module 2 when it was applied to $A_{2}$ , and SCCA misses all modules when was applied to the noise matrix $A_{4}$ . (2) SCCA cannot make feature selection about the cancer types, i.e., SCCA cannot assess the importance of the module for different cancers. Additionally, Modularity_SA has two shortcomings: (1) it misses some real members of the true modules; and (2) it is more time-consuming compared to TSCCA.

Fig 8 — (A) A synthetic miRNA-gene correlation tensor $A$ , which contains four matrices with the same number of genes (rows) and miRNAs (columns), and includes three true modules framed by rectangular boxes of different colors. The shuffled $A$ is as the input of tested methods by shuffling the genes (rows) and miRNAs (columns) of $A$ . (B) Comparison of different methods in terms of CE ± std and Recovery ± std on the simulated data. The Recovery and CE scores are computed based on $A s$ generated repeatedly.

Comparison on the TCGA data

In this section, we compared TSCCA with SCCA and multiple tri-clustering methods on the TCGA data. Firstly, we used SCCA to identify 50 modules on each cancer data set and compared TSCCA with SCCA in terms of modularity scores and multiple biological indicators (S24 Table). The parameters of SCCA is consistent with the parameters of TSCCA with k_u = 200, k_v = 10 and k_w = 20 when applying to the TCGA data. For a single cancer data, SCCA also ensures that the expression of miRNAs and genes within the identified modules are correlated in the specific cancer data (See the eighth column in S24 Table), but it failed to ensure that the miRNAs and genes with the identified modules are correlated in most cancer types (See the seventh column in S24 Table). Thus, TSCCA is more suitable to multi-cancer data compared to SCCA.

Secondly, we also compared TSCCA with multiple tri-clustering methods including Modularity_SA and Sparse Canonical Polyadic decomposition (SCP) which uses ℓ₁-regularization to force sparse [49], and two merit-function based methods including “Variance” (See Eq 1 in [50]) and “Mean squared residue (MSR)” (See Eq 3 in [50]). The two merit-functions are optimized by using annealing algorithm. Var_SA is a variance-based simulated annealing (Var_SA) method, which uses a simulated annealing algorithm to minimize the variance merit-function for extracting a cancer-miRNA-gene module. Similarly, MSR_SA is an MSR-based simulated annealing (MSR_SA) method, which uses a simulated annealing algorithm to minimize the MSR merit-function for extracting a cancer-miRNA-gene module. The comparison results are given in S25 Table and show that TSCCA is superior to the other tri-clustering methods in terms of multiple biological indicators and modularity score. Due to the definition of MSR, the MSR_SA method is very consuming time. We found that MSR_SA took an hour to identify a module, while Var_SA only takes 5 seconds on a personal computer. Compared with the TSCCA and Modularity_SA, the sub-tensors/modules identified by Var_SA or MSR_SA tend to be zero patterns (S6 Fig). We found that Modularity_SA has good performance results in terms of the number of cancer genes and miRNAs, while TSCCA is better in terms of the modularity score and the number of gene-gene and miRNA-gene edges (S25 Table). In addition, we also compared the performance of TSCCA and Modularity_SA under the same input data. Compared with Modularity_SA, TSCCA obtained higher modularity scores and consumed less time (S7(A) and S7(B) Fig). Therefore, from the perspective of maximizing the modularity score, TSCCA is still better than the Modularity_SA.

Finally, we also compared TSCCA with principal component analysis which is applied to a joint miRNA and gene expression data from 33 TCGA cancer types (S8 Fig and S26 Table). More details and results about the comparison of TSCCA with other methods are given in S1 Text.

Discussion

Many large projects (e.g., TCGA) have complied large multi-omics data and provided an unprecedented opportunity for deep understanding of the fundamental mechanism of cancer [51–53]. To build the connections between miRNA-gene regulatory modules across different cancer types, we developed TSCCA to identify cancer-specific and shared miRNA-gene modules using the matched miRNA and gene expression data from multiple cancers.

We applied TSCCA to the matched miRNA and mRNA expression profiles across 33 cancer types with 9,645 cancer samples for detecting cancer-related miRNA-gene modules. We found that the correlations of miRNA-miRNA, gene-gene and miRNA-gene within each module are significantly higher than those of random ones. Furthermore, we also investigated the cooperation mechanisms of miRNAs and genes within each module from multiple views: 1) whether miRNAs within the module tend to be in the miRNA family; 2) whether genes within the module tend to be enriched in some known functional classes, and whether they tend to have significantly enriched interactions in the gene interaction network; 3) whether miRNAs and genes within the module tend to have significantly enriched miRNA-gene interactions in the miRNA-gene network; 4) whether genes and miRNAs within the module tend to be cancer-related makers. We eventually found that most of the modules identified by TSCCA have cooperative characteristics or cancer-related biological functions.

We also revealed that the miRNA-gene co-expressed patterns of these identified modules show some different patterns (S2 Fig). Interestingly, a large number of miRNA-gene co-expressed patterns with positive correlation coefficients were identified, which were also observed before [54]. These results show that 1) miRNA-gene correlation patterns are heterogeneous for different cancers; 2) There may be a large number of indirect miRNA-gene regulatory relationships within each module. Furthermore, our analysis implies that these miRNA-gene patterns take different forms in different cancers. They are strongly co-expressed in some cancers while being weak in others. We also found that the miRNA-gene co-expressed patterns of some modules are reversed in different cancers. For example, the miRNA-gene correlation coefficients within module 49 are almost negative in most cancer types, while they are mostly positive in TGCT and UCS (S2(J) Fig). This observation implies the complexity of miRNA-gene regulation in cancer. Interestingly, we also found that some miRNA-gene modules can be used as diagnostic makers in different cancers. Some cancers share common survival-related modules while the others are specific to certain modules. Additionally, some cancer-specific or shared survival-related miRNAs were also found (S18 Table). This finding suggests that it is possible to develop miRNA-targeted drugs to treat multiple cancers.

In this study, we have addressed a number of important challenges in the integrative analysis of multi-omics data across multiple cancers. Some further studies are deserved to investigate in the future. First, how to extend our linear model to identify non-linear relationships between miRNAs and genes across cancer types. Second, how to integrate prior information on the relationships between genes or miRNAs (e.g., the PPI network and gene pathway) to identify more biologically meaningful patterns. Third, how to make use of other omics data, such as copy number variation and DNA methylation data. The last but not the least, how to apply our approach to other biological problems. For example, GDSC and CCLE have released a wealth of drug and gene expression data across different cell lines [55–57]. This provides new opportunities to discover cell-specific and shared gene-drug co-modules using TSCCA.

Supporting information

S1 Text. Supporting methods and results.

(DOCX)

Click here for additional data file.^{(237.1KB, docx)}

S1 Fig. Convergence analysis of 50 modules identified by TSCCA on the TCGA dataset across 33 cancer types.

(PDF)

Click here for additional data file.^{(81.2KB, pdf)}

S2 Fig. Heatmap of cancer-miRNA-gene modules identified by TSCCA in the TCGA dataset.

Each subfigure corresponds to an identified module and a random module. In each subfigure, the top half corresponds to the identified module (row corresponds to gene, column corresponds to miRNA) and the lower part of is a random module for comparison. (A) Showing the Heatmap of modules 1 to 5. (B) Showing the Heatmap of modules 6 to 10. (C) Showing the Heatmap of modules 11 to 15. (D) Showing the Heatmap of modules 16 to 20. (E) Showing the Heatmap of modules 21 to 25. (F) Showing the Heatmap of modules 26 to 30. (G) Showing the Heatmap of modules 31 to 35. (H) Showing the Heatmap of modules 36 to 40. (I) Showing the Heatmap of modules 41 to 45. (J) Showing the Heatmap of modules 46 to 50.

(PDF)

Click here for additional data file.^{(8MB, pdf)}

S3 Fig. Characteristics of modules in different cancers.

(A) Heatmap showing the output matrix W of Algorithm 2, when it was applied to the TCGA data. Each column corresponds to a module and each row corresponds to a cancer type and W_ij reflects the co-expressed intensity between the genes and the miRNAs within the module j on the cancer i. A hierarchical clustering method was used to cluster the rows (cancer types) into four clusters. (B) Scatter plot for elements of the W matrix. There are three negative elements/pairs in W, where (Module 31, TGCT) is −0.145, (Module 49,TGCT) is −0.23 and (Module 49, UCS) is −0.138 and (C) Their heatmaps shown in the blue frame.

(PDF)

Click here for additional data file.^{(412.3KB, pdf)}

S4 Fig. Application of the TSCCA onto the subset of TCGA cancer data from the cluster 3 in Fig 4 and extract 50 modules.

We first extracted a subset of cancers (A) and then re-used TSCCA to extract 50 modules on the subset of the previous data, and we found some new modules with significant modularity scores (B). Finally, we show the heatmap of the corresponding W matrix (C).

(PDF)

Click here for additional data file.^{(60.6KB, pdf)}

S5 Fig. miRNA-gene regulatory network analysis of modules.

(A) For each identified module, a produce is developed to identify a largest connected subgraph, i.e., a three-layer miRNA-gene regulatory network, where the miRNA-gene interactions are from miRTarBase network and the gene-gene interactions are from the gene interaction network, and miRNAs regulate genes and these genes regulate the other genes with three-layer network. (B) A miRNA-gene network contains 3619 experimentally verified miRNA-gene interactions from miRTarBase network via combing all genes and miRNAs of modules identified by TSCCA (Hypergeometric test P = 3.5e-43).

(PDF)

Click here for additional data file.^{(124.1KB, pdf)}

S6 Fig. Heatmap of cancer-miRNA-gene modules identified by different methods in the TCGA dataset.

The top half of each heatmap corresponds to the module 1 (row corresponds to gene, column corresponds to miRNA) and the lower part is a random module for comparison.

(PDF)

Click here for additional data file.^{(490.3KB, pdf)}

S7 Fig. Comparison of different methods on the TCGA data in terms of Modularity score (A) and time (B).

We also compared the running time of different methods on a personal laptop. Box-plots show results in terms of modularity scores and running time of algorithm based on 50 different initializations of each method.

(PDF)

Click here for additional data file.^{(77.5KB, pdf)}

S8 Fig. Results of pcModule.

(A) Heatmap of pcModule. The top half of each heatmap corresponds to the module 1 (row corresponds to gene, column corresponds to miRNA) and the lower part is a random module for comparison. (B) Comparison of modularity scores of pcModule and TSCCA modules.

(PDF)

Click here for additional data file.^{(167.3KB, pdf)}

S9 Fig. Heatmap of some modules identified by TSCCA in the TCGA dataset.

(A) Heatmap of modules 1, 4 and 10. (B) Heatmap of modules 5, 8 and 9. (C) Heatmap of cancer-miRNA-gene module 31 identified by TSCCA in the TCGA dataset. Module 31 is a TGCT-cancer-specific miRNA-gene co-expressed module.

(PDF)

Click here for additional data file.^{(341.5KB, pdf)}

S1 Table. The list of 7889 significant different expression genes with BH adjusted P < 0.05 in at least 15 cancer types.

(XLSX)

Click here for additional data file.^{(2.5MB, xlsx)}

S2 Table. Summary of the TCGA data.

(XLSX)

Click here for additional data file.^{(11.6KB, xlsx)}

S3 Table. Objective function values (Singular values) of modules identified by TSCCA.

(XLSX)

Click here for additional data file.^{(10KB, xlsx)}

S4 Table. Cancer types and weights of modules identified by TSCCA.

(XLSX)

Click here for additional data file.^{(25.5KB, xlsx)}

S5 Table. miRNA members and weights of modules identified by TSCCA.

(XLSX)

Click here for additional data file.^{(34.4KB, xlsx)}

S6 Table. Gene members and weights of modules identified by TSCCA.

(XLSX)

Click here for additional data file.^{(370.4KB, xlsx)}

S7 Table. Summary of modules concerning gene names, miRNA names and cancer type names.

(XLSX)

Click here for additional data file.^{(294.1KB, xlsx)}

S8 Table. Significant overlap between two miRNA-gene-cancer modules/subtensors in a binary form.

(XLSX)

Click here for additional data file.^{(16.4KB, xlsx)}

S9 Table. Modularity values for different cancer types.

(XLSX)

Click here for additional data file.^{(10.1KB, xlsx)}

S10 Table. Enrichment analysis of modules in terms of cancer miRNAs, cancer genes, PPIs and miRNA-gene interactions.

(XLSX)

Click here for additional data file.^{(17.5KB, xlsx)}

S11 Table. Number of significant terms.

(XLSX)

Click here for additional data file.^{(10.7KB, xlsx)}

S12 Table. Significant GOBP terms.

(XLSX)

Click here for additional data file.^{(100.8KB, xlsx)}

S13 Table. Significant KEGG terms.

(XLSX)

Click here for additional data file.^{(13.9KB, xlsx)}

S14 Table. Significant Reactome terms.

(XLSX)

Click here for additional data file.^{(27.7KB, xlsx)}

S15 Table. Module miRNAs are cooperative within miRNA families.

(XLSX)

Click here for additional data file.^{(20.2KB, xlsx)}

S16 Table. Largest connected subnetwork (LCS) of modules where each edge is from verified miRNA-gene and gene-gene interactions.

(XLSX)

Click here for additional data file.^{(10.8KB, xlsx)}

S17 Table. Prognostic miRNA-gene module biomarkers in multiple cancer types.

(XLSX)

Click here for additional data file.^{(22KB, xlsx)}

S18 Table. Prognostic miRNA biomarkers in multiple cancer types.

(XLSX)

Click here for additional data file.^{(16.9KB, xlsx)}

S19 Table. Biological functional analysis of selected cancer-miRNA-gene modules.

(XLSX)

Click here for additional data file.^{(10.9KB, xlsx)}

S20 Table. Comparison (in terms of CE ± std) on the simulated data.

(XLSX)

Click here for additional data file.^{(10.6KB, xlsx)}

S21 Table. Comparison (in terms of Recovery ± std) on the simulated data.

(XLSX)

Click here for additional data file.^{(9.6KB, xlsx)}

S22 Table. Comparison (in terms of CE ± std) on the simulated data with different variances.

(XLSX)

Click here for additional data file.^{(10.8KB, xlsx)}

S23 Table. Comparison of the (in terms of Recovery ± std) on the simulated data with different variables.

(XLSX)

Click here for additional data file.^{(10.9KB, xlsx)}

S24 Table. Performance comparison of TSCCA and SCCA, where we applied SCCA to identify 50 modules on each cancer data set.

(XLSX)

Click here for additional data file.^{(11.5KB, xlsx)}

S25 Table. Performance comparison of TSCCA and the triclustering methods.

(XLSX)

Click here for additional data file.^{(10.4KB, xlsx)}

S26 Table. Results of pcModule.

(XLSX)

Click here for additional data file.^{(9.5KB, xlsx)}

S27 Table. Gene-gene interaction set enrichment for the identified modules by TSCCA on the TCGA data.

(XLSX)

Click here for additional data file.^{(12.3KB, xlsx)}

S28 Table. miRNA-gene interaction set enrichment for the identified modules by TSCCA on the TCGA data.

(XLSX)

Click here for additional data file.^{(11KB, xlsx)}

Data Availability

We used the biological data from 33 TCGA cancer types available from the Broad GDAC Firehose website (http://firebrowse.org/, accessed 28 January 2016). The code of TSCCA is available from https://github.com/wenwenmin/TSCCA.

Funding Statement

The work of X.W. was supported by the Key-Area Research and Development Program of Guangdong Province of China [2020B0101350001]. The work of W.M. was supported by the National Science Foundation of China [61802157], the Natural Science Foundation of Jiangxi Province of China [20192BAB217004] and the China Postdoctoral Science Foundation [2020M671902]. The work of T-H.C. was supported by the Open Research Fund from Shenzhen Research Institute of Big Data [2019ORF01002] and the National Science Foundation of China [61731018]. The work of S.Z. was supported by the National Science Foundation of China [11661141019, 61621003], the National Ten Thousand Talent Program for Young Top-notch Talents, the National Key Research and Development Program of China [2019YFA0709501] and the CAS Frontier Science Research Key Project for Top Young Scientist [QYZDB-SSW-SYS008]. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1. Bray F, Ferlay J, Soerjomataram I, Siegel RL, Torre LA, Jemal A. Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin. 2018;68(6):394–424. doi: 10.3322/caac.21492 [DOI] [PubMed] [Google Scholar]
2. Mehtonen J, Pölönen P, Häyrynen S, Dufva O, Lin J, Liuksiala T, et al. Data-driven characterization of molecular phenotypes across heterogeneous sample collections. Nucleic Acids Res. 2019;47(13):e76. doi: 10.1093/nar/gkz281 [DOI] [PMC free article] [PubMed] [Google Scholar]
3. Zugazagoitia J, Guedes C, Ponce S, Ferrer I, Molina-Pinelo S, Paz-Ares L. Current challenges in cancer treatment. Clin Ther. 2016;38(7):1551–1566. doi: 10.1016/j.clinthera.2016.03.026 [DOI] [PubMed] [Google Scholar]
4. Cheng F, Liang H, Butte AJ, Eng C, Nussinov R. Personal mutanomes meet modern oncology drug discovery and precision health. Pharmacol Rev. 2019;71(1):1–19. doi: 10.1124/pr.118.016253 [DOI] [PMC free article] [PubMed] [Google Scholar]
5. Bracken CP, Scott HS, Goodall GJ. A network-biology perspective of microRNA function and dysfunction in cancer. Nat Rev Genet. 2016;17(12):719–732. doi: 10.1038/nrg.2016.134 [DOI] [PubMed] [Google Scholar]
6. Zhao XM, Liu KQ, Zhu G, He F, Duval B, Richer JM, et al. Identifying cancer-related microRNAs based on gene expression data. Bioinformatics. 2014;31(8):1226–1234. doi: 10.1093/bioinformatics/btu811 [DOI] [PubMed] [Google Scholar]
7. Yang D, Sun Y, Hu L, Zheng H, Ji P, Pecot CV, et al. Integrated analyses identify a master microRNA regulatory network for the mesenchymal subtype in serous ovarian cancer. Cancer Cell. 2013;23(2):186–199. doi: 10.1016/j.ccr.2012.12.020 [DOI] [PMC free article] [PubMed] [Google Scholar]
8. Lai X, Eberhardt M, Schmitz U, Vera J. Systems biology-based investigation of cooperating microRNAs as monotherapy or adjuvant therapy in cancer. Nucleic Acids Res. 2019;47(15):7753–7766. doi: 10.1093/nar/gkz638 [DOI] [PMC free article] [PubMed] [Google Scholar]
9. Kozomara A, Birgaoanu M, Griffiths-Jones S. miRBase: from microRNA sequences to function. Nucleic Acids Res. 2018;47(D1):D155–D162. [DOI] [PMC free article] [PubMed] [Google Scholar]
10. Tong Y, Ru B, Zhang J. miRNACancerMAP: an integrative web server inferring miRNA regulation network for cancer. Bioinformatics. 2018;34(18):3211–3213. doi: 10.1093/bioinformatics/bty320 [DOI] [PubMed] [Google Scholar]
11. Agarwal V, Bell GW, Nam JW, Bartel DP. Predicting effective microRNA target sites in mammalian mRNAs. Elife. 2015;4:e05005. doi: 10.7554/eLife.05005 [DOI] [PMC free article] [PubMed] [Google Scholar]
12. Barrett T, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, et al. NCBI GEO: archive for functional genomics data sets—update. Nucleic Acids Res. 2012;41(D1):D991–D995. doi: 10.1093/nar/gks1193 [DOI] [PMC free article] [PubMed] [Google Scholar]
13. Hoadley KA, Yau C, Hinoue T, Wolf DM, Lazar AJ, Drill E, et al. Cell-of-origin patterns dominate the molecular classification of 10,000 tumors from 33 types of cancer. Cell. 2018;173(2):291–304. doi: 10.1016/j.cell.2018.03.022 [DOI] [PMC free article] [PubMed] [Google Scholar]
14. Marshall EA, Sage AP, Ng KW, Martinez VD, Firmino NS, Bennewith KL, et al. Small non-coding RNA transcriptome of the NCI-60 cell line panel. Sci Data. 2017;4:170157. doi: 10.1038/sdata.2017.157 [DOI] [PMC free article] [PubMed] [Google Scholar]
15. Zhang S, Li Q, Liu J, Zhou XJ. A novel computational framework for simultaneous integration of multiple types of genomic data to identify microRNA-gene regulatory modules. Bioinformatics. 2011;27(13):i401–i409. doi: 10.1093/bioinformatics/btr206 [DOI] [PMC free article] [PubMed] [Google Scholar]
16. Min W, Liu J, Luo F, Zhang S. A two-stage method to identify joint modules from matched microRNA and mRNA expression data. IEEE Trans Nanobioscience. 2016;15(4):362–370. doi: 10.1109/TNB.2016.2556744 [DOI] [Google Scholar]
17. Bryan K, Terrile M, Bray IM, Domingo-Fernandez R, Watters KM, Koster J, et al. Discovery and visualization of miRNA–mRNA functional modules within integrated data using bicluster analysis. Nucleic Acids Res. 2013;42(3):e17. doi: 10.1093/nar/gkt1318 [DOI] [PMC free article] [PubMed] [Google Scholar]
18. Li Y, Liang C, Wong KC, Luo J, Zhang Z. Mirsynergy: detecting synergistic miRNA regulatory modules by overlapping neighbourhood expansion. Bioinformatics. 2014;30(18):2627–2635. doi: 10.1093/bioinformatics/btu373 [DOI] [PubMed] [Google Scholar]
19. Jin D, Lee H. A computational approach to identifying gene-microRNA modules in cancer. PLoS Comput Biol. 2015;11(1):e1004042. doi: 10.1371/journal.pcbi.1004042 [DOI] [PMC free article] [PubMed] [Google Scholar]
20. Shi WJ, et al. Unsupervised discovery of phenotype-specific multi-omics networks. Bioinformatics. 2019;35(21):4336–4343. doi: 10.1093/bioinformatics/btz226 [DOI] [PMC free article] [PubMed] [Google Scholar]
21. Yoon S, Nguyen HC, Jo W, Kim J, Chi SM, Park J, et al. Biclustering analysis of transcriptome big data identifies condition-specific microRNA targets. Nucleic Acids Res. 2019;47(9):e53. doi: 10.1093/nar/gkz139 [DOI] [PMC free article] [PubMed] [Google Scholar]
22. Li Y, Kang K, Krahn JM, Croutwater N, Lee K, Umbach DM, et al. A comprehensive genomic pan-cancer classification using The Cancer Genome Atlas gene expression data. BMC Genomics. 2017;18(1):508. doi: 10.1186/s12864-017-3906-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
23. Yang X, Gao L, Zhang S. Comparative pan-cancer DNA methylation analysis reveals cancer common and specific patterns. Brief Bioinform. 2016;18(5):761–773. [DOI] [PubMed] [Google Scholar]
24. Chen H, Li C, Peng X, Zhou Z, Weinstein JN, Caesar-Johnson SJ, et al. A pan-cancer analysis of enhancer expression in nearly 9000 patient samples. Cell. 2018;173(2):386–399. doi: 10.1016/j.cell.2018.03.027 [DOI] [PMC free article] [PubMed] [Google Scholar]
25. Ma X, Liu Y, Liu Y, Alexandrov LB, Edmonson MN, Gawad C, et al. Pan-cancer genome and transcriptome analyses of 1,699 paediatric leukaemias and solid tumours. Nature. 2018;555(7696):371–376. doi: 10.1038/nature25795 [DOI] [PMC free article] [PubMed] [Google Scholar]
26. Tan H, Huang S, Zhang Z, Qian X, Sun P, Zhou X. Pan-cancer analysis on microRNA-associated gene activation. EBioMedicine. 2019;43:82–97. doi: 10.1016/j.ebiom.2019.03.082 [DOI] [PMC free article] [PubMed] [Google Scholar]
27. Huang HY, Lin YCD, Li J, Huang KY, Shrestha S, Hong HC, et al. miRTarBase 2020: updates to the experimentally validated microRNA–target interaction database. Nucleic Acids Res. 2019;48(D1):D148–D154. [DOI] [PMC free article] [PubMed] [Google Scholar]
28. Rodchenkov I, Babur O, Luna A, Aksoy BA, Wong JV, Fong D, et al. Pathway Commons 2019 Update: integration, analysis and exploration of pathway data. Nucleic Acids Res. 2019;48(D1):D489–D497. [DOI] [PMC free article] [PubMed] [Google Scholar]
29. Xie B, Ding Q, Han H, Wu D. miRCancer: a microRNA–cancer association database constructed by text mining on literature. Bioinformatics. 2013;29(5):638–644. doi: 10.1093/bioinformatics/btt014 [DOI] [PubMed] [Google Scholar]
30. Liberzon A, Birger C, Thorvaldsdóttir H, Ghandi M, Mesirov JP, Tamayo P. The molecular signatures database hallmark gene set collection. Cell Syst. 2015;1(6):417–425. doi: 10.1016/j.cels.2015.12.004 [DOI] [PMC free article] [PubMed] [Google Scholar]
31. Witten DM, Tibshirani RJ. Extensions of sparse canonical correlation analysis with applications to genomic data. Stat Appl Genet Mol Biol. 2009;8(1):Article 28. [DOI] [PMC free article] [PubMed] [Google Scholar]
32. Chang BHW, Kruger U, Kustra R, Zhang J. Canonical correlation analysis based on Hilbert-Schmidt independence criterion and centered kernel target alignment. In: ICML; 2013. p. 316–324. [Google Scholar]
33. Parkhomenko E, Tritchler D, Beyene J. Sparse canonical correlation analysis with application to genomic data integration. Stat Appl Genet Mol Biol. 2009;8(1):Article 1. [DOI] [PubMed] [Google Scholar]
34. Chu D, Liao LZ, Ng MK, Zhang X. Sparse canonical correlation analysis: New formulation and algorithm. IEEE Trans Pattern Anal Mach Intell. 2013;35(12):3050–3065. doi: 10.1109/TPAMI.2013.104 [DOI] [PubMed] [Google Scholar]
35. Asteris M, Kyrillidis A, Koyejo O, Poldrack R. A simple and provable algorithm for sparse diagonal CCA. In: ICML; 2016. p. 1148–1157. [Google Scholar]
36. Rohart F, Gautier B, Singh A, Lê Cao KA. mixOmics: An R package for omics feature selection and multiple data integration. PLoS Comput Biol. 2017;13(11):e1005752. doi: 10.1371/journal.pcbi.1005752 [DOI] [PMC free article] [PubMed] [Google Scholar]
37. Min EJ, Safo SE, Long Q. Penalized co-inertia analysis with applications to-omics data. Bioinformatics. 2018;35(6):1018–1025. [DOI] [PMC free article] [PubMed] [Google Scholar]
38. Kolda TG, Bader BW. Tensor decompositions and applications. SIAM Rev. 2009;51(3):455–500. doi: 10.1137/07070111X [DOI] [Google Scholar]
39. Xu Y, Yin W. A globally convergent algorithm for nonconvex optimization based on block coordinate update. J Sci Comput. 2017;72(2):700–734. doi: 10.1007/s10915-017-0376-0 [DOI] [Google Scholar]
40. Sanchez-Vega F, Mina M, Armenia J, Chatila WK, Luna A, La KC, et al. Oncogenic signaling pathways in the cancer genome atlas. Cell. 2018;173(2):321–337. doi: 10.1016/j.cell.2018.03.035 [DOI] [PMC free article] [PubMed] [Google Scholar]
41. Liu F, Zhang F, Li X, Liu Q, Liu W, Song P, et al. Prognostic role of miR-17-92 family in human cancers: evaluation of multiple prognostic outcomes. Oncotarget. 2017;8(40):69125. doi: 10.18632/oncotarget.19096 [DOI] [PMC free article] [PubMed] [Google Scholar]
42. Vallejo DM, Caparros E, Dominguez M. Targeting Notch signalling by the conserved miR-8/200 microRNA family in development and cancer cells. EMBO J. 2011;30(4):756–769. doi: 10.1038/emboj.2010.358 [DOI] [PMC free article] [PubMed] [Google Scholar]
43. Otto T, Sicinski P. Cell cycle proteins as promising targets in cancer therapy. Nat Rev Cancer. 2017;17(2):93. doi: 10.1038/nrc.2016.138 [DOI] [PMC free article] [PubMed] [Google Scholar]
44. Jansson MD, Lund AH. MicroRNA and cancer. Mol Oncol. 2012;6(6):590–610. doi: 10.1016/j.molonc.2012.09.006 [DOI] [PMC free article] [PubMed] [Google Scholar]
45. Boyer AS, Walter D, Sørensen CS. DNA replication and cancer: From dysfunctional replication origin activities to therapeutic opportunities. Semin Cancer Biol. 2016;37–38:16–25. [DOI] [PubMed] [Google Scholar]
46. Petrescu GE, Sabo AA, Torsin LI, Calin GA, Dragomir MP. MicroRNA based theranostics for brain cancer: basic principles. J Exp Clin Cancer Res. 2019;38(1):231. doi: 10.1186/s13046-019-1180-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
47. Gu JJ, Fan KC, Zhang JH, Chen HJ, Wang SS. Suppression of microRNA-130b inhibits glioma cell proliferation and invasion, and induces apoptosis by PTEN/AKT signaling. Int J Mol Med. 2018;41(1):284–292. [DOI] [PMC free article] [PubMed] [Google Scholar]
48. Huang S, Xue P, Han X, Zhang C, Yang L, Liu L, et al. Exosomal miR-130b-3p targets SIK1 to inhibit medulloblastoma tumorigenesis. Cell Death Dis. 2020;11(6):1–16. doi: 10.1038/s41419-020-2621-y [DOI] [PMC free article] [PubMed] [Google Scholar]
49. Allen G. Sparse higher-order principal components analysis. In: Artificial Intelligence and Statistics; 2012. p. 27–36. [Google Scholar]
50. Henriques R, Madeira SC. Triclustering algorithms for three-dimensional data analysis: a comprehensive survey. ACM Comput Surv. 2018;51(5), Article 95. [Google Scholar]
51. Argelaguet R, Velten B, Arnol D, Dietrich S, Zenz T, Marioni JC, et al. Multi-Omics Factor Analysis: A framework for unsupervised integration of multi-omics data sets. Mol Syst Biol. 2018;14(6):e8124. doi: 10.15252/msb.20178124 [DOI] [PMC free article] [PubMed] [Google Scholar]
52. Rappoport N, Shamir R. Multi-omic and multi-view clustering algorithms: review and cancer benchmark. Nucleic Acids Res. 2018;46(20):10546–10562. doi: 10.1093/nar/gky889 [DOI] [PMC free article] [PubMed] [Google Scholar]
53. Pierre-Jean M, Deleuze JF, Le Floch E, Mauger F. Clustering and variable selection evaluation of 13 unsupervised methods for multi-omics data integration. Brief Bioinform. 2020;21(6):2011–2030. doi: 10.1093/bib/bbz138 [DOI] [PubMed] [Google Scholar]
54. Dvinge H, Git A, Gräf S, Salmon-Divon M, Curtis C, Sottoriva A, et al. The shaping and functional consequences of the microRNA landscape in breast cancer. Nature. 2013;497(7449):378–382. doi: 10.1038/nature12108 [DOI] [PubMed] [Google Scholar]
55. Iorio F, Knijnenburg TA, Vis DJ, Bignell GR, Menden MP, Schubert M, et al. A landscape of pharmacogenomic interactions in cancer. Cell. 2016;166(3):740–754. doi: 10.1016/j.cell.2016.06.017 [DOI] [PMC free article] [PubMed] [Google Scholar]
56. Ghandi M, Huang FW, Jané-Valbuena J, Kryukov GV, Lo CC, McDonald ER, et al. Next-generation characterization of the Cancer Cell Line Encyclopedia. Nature. 2019;569(7757):503–508. doi: 10.1038/s41586-019-1186-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
57. Chen J, Zhang S. Integrative analysis for identifying joint modular patterns of gene-expression and drug-response data. Bioinformatics. 2016;32(11):1724–1732. doi: 10.1093/bioinformatics/btw059 [DOI] [PubMed] [Google Scholar]

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1009044.r001

Decision Letter 0

Florian Markowetz, Moritz Gerstung

2 Oct 2020

Dear Dr. Min,

Thank you very much for submitting your manuscript "TSCCA: A tensor sparse CCA method for detecting microRNA-gene patterns from multiple cancers" for consideration at PLOS Computational Biology.

As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments.

Please apologise the slow turnaround. The paper has now been assessed by two experts in the field who found your work of interest but have flagged a number of major issues. If you feel you can address these criticisms, concerning a comparison to other triclustering methods and software implementation, but also in relation to your simulation study, then we would invite you to resubmit an appropriately revised manuscript.

We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.

Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Moritz Gerstung

Guest Editor

PLOS Computational Biology

Florian Markowetz

Deputy Editor

PLOS Computational Biology

***********************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: The paper presents TSCCA, a novel algorithm for identification of miRNA-gene modules across multiple cancer types from TCGA. TSCCA is based on CCA and extends it to tensors to support several cancer types. The authors apply the method to 33 cancers from TCGA, and show the biological relevance of the detected modules using several databases of biological knowledge, and by considering the survival implications of the selected genes and miRNA. The authors also compare TSCCA to Modularity SA and to SCCA on cancer and synthetic data.

The task of detecting miRNA-gene modules in cancer is of high importance, given the known involvement of miRNA in cancer and the complex regulatory processes that are perturbed in cancer. We are not familiar with previous methods designed to find miRNA-gene modules across several datasets, so this method has potentially high value. The paper is well organized and well written. However, as previous methods detected miRNA-gene modules, and the main innovation of this method is in working on multiple datasets, the benefit of using multiple cancers should be shown in a more convincing manner. Additionally, while no algorithm was previously developed for the specific data used in the study, the problem of tri-clustering (detecting modules in tensors) was previously investigated, and the suggested method should be compared to other tri-clustering methods, in addition to modularity SA. Furthermore, the authors should provide an implementation of TSCCA. If these points are satisfactorily addressed, we think that this work would merit publication in PLOS Computational Biology.

Major comments

1. Previous methods were developed to detect miRNA-gene modules, and the main novelty of TSCCA is that it uses multiple datasets. The advantage of using multiple datasets is not clearly demonstrated. The criteria used to show the biological and clinical relevance of the detected modules (enrichment of cancer genes and miRNAs in the modules, enrichment of miRNA families, miRNA-gene regulatory networks, and survival analysis using the modules) are never compared to solutions obtained on a single cancer dataset. It is very likely that all of these criteria would also show biological relevance if SCCA would be applied to each cancer separately.

The multi-dataset nature of the algorithm is currently considered in two places. The first is Figure 4, where the W matrix is visualized. This matrix shows that the modules actually capture only a subset of the cancer types, while other cancer types have many zero or near-zero loadings in W. To these reviewers, it seems surprising that all these cancer types have such a small number of miRNA-gene modules. Is this a biological reality or a bias introduced by the algorithm? The authors should rerun TSCCA after excluding the few cancer types that are responsible for most of the modules, and examine the output of the algorithm. If new biologically significant modules emerge, this would suggest that dominant caner types overshadow the others. In this case this limitation of the algorithm should be clearly stated.

The second place where the multi-dataset aspect is considered is the direct comparison to SCCA. This analysis is more convincing, but it should be expanded in two ways. First, biological criteria should be used for the comparison (enrichment of cancer genes etc.), and not only the modularity. Second, it is interesting to show the modularity when calculated only on one cancer type (e.g. ACC) when TSCCA is applied to all the data, and SCCA is applied only on ACC. This can show when using one dataset to detect miRNA-gene modules is preferable to using multiple datasets, highlighting the limitations of the algorithm.

2. To the best of our knowledge, no previous method was developed to detect miRNA-gene modules across multiple datasets, but other methods were developed for a similar computational task – triclustering. The only triclustering algorithm TSCCA is compared to is Modularity_SA, which is a method that the authors developed themselves for the comparison. The authors should compare TSCCA to other triclustering methods, even if they do not perform l0 regularization. A survey of triclustering algorithms can be found in [1].

As stated in the previous point, the comparison should include biological criteria. The authors demonstrate convincingly that TSCCA's modules are biologically relevant, but biological criteria (e.g. enrichment of gene-gene interactions) should be compared to other methods.

3. The authors should provide an implementation (or at least a well documented executable) of TSCCA, as the method, rather than the biological results, are the main contribution of this work.

Other important comments

4. In order to optimize TSCCA's objective function (equation (5)), the authors optimize for u, v and w separately. If w is held constant, the optimization problem is that of sparse diagonal CCA, which is solved by the work of Xu et al. that the authors cite (reference [38] in the paper). Their algorithm can be used for an improved optimization procedure.

5. The analysis for whether the 50 modules significantly overlap is not clear.

a. Does the number of overlapping elements include genes, miRNA and cancer types? If so, it places a higher emphasis on gene expression, which is the most common feature. Or are three tests for overlap performed separately, one for genes, one for miRNA and one for cancer types? If the latter, how are these tests integrated?

b. The random modules are created by sampling 100 genes, 10 miRNA and 20 cancers. While these are the parameters used to run TSCCA, the authors state that not all modules include this number of genes, miRNA and cancers. The number of features in these random modules is therefore different from the number of features in the original modules, and the distribution of the size of the overlap is also different. The tests should be performed by conditioning on the size of the modules.

c. There seems to be a small error in the number of random modules. The appendix says 100 modules were generated, but the number of module pairs appears to be 1000 in the rest of the text.

6. In a couple of analyses the authors do not statistically show the merit of their results.

a. The authors count the number of modules with at least two miRNAs from the same family. Though 46 out of 50 sounds high, a statistical test should be performed to derive a p-value.

b. The authors found that 70% of the modules have at least 3 miRNAs participating in a three-layer network, but the significance of this observation is not clear without a statistical test.

7. The geometric tests used to calculate gene-gene and miRNA-gene interaction set enrichment (as described in sections 15 and 16 of the appendix) do not condition on the degree of the genes and miRNAs in the networks. In the way these tests are currently performed, it is possible to obtain significant p-values just because a chosen gene has high degree in the gene-gene network (or similarly, a gene / miRNA has high degree in the gene-miRNA network), even if there's no enrichment of interaction in the module. Indeed, cancer genes are generally known to have high degrees in gene-gene networks. The tests should be performed by permuting genes / miRNA conditioned on their degree in the networks.

8. The CE and Recovery score does not consider the cancers selected for each module. The authors need to add an additional metric (or change the current one) that considers the cancers. Additionally, because the number of miRNAs is smaller than the number of genes, it is of interest to report the CE and Recovery score when considering only genes or only miRNAs. Otherwise the current reporting places more emphasis on genes.

9. The work will be much improved if the biology behind several selected modules is described. How do the discovered modules improve our understanding of pan-cancer gene and miRNA regulation?

Minor comments

10. Prior work by Tan et al. recently investigated miRNA-gene modules across cancers [2]. Their computational approach is quite naïve, but this work should be mentioned.

11. Though the appendix states where corrections for multiple hypotheses were performed, we think that for clarity, these corrections should be also mentioned in the main text when they are used, and whether reported p-values are after correction.

12. The external datasets used in this study are mentioned at the beginning, but it would be helpful if they are also mentioned when they are being used. For example, in section 3.3 it would be helpful if the databases for cancer genes and miRNA are mentioned.

13. The text describing Figure 3C is not clear. It resembles the text for 3B, even though these two examples demonstrate very different phenomena.

14. The objective functions for TSCCA allows for negative W values. It is therefore interesting to visualize W, in addition to the current visualization in Figure 4 of |W|. Are there miRNA-gene modules that are correlated in one cancer type, but are anti-correlated in another? This is briefly discussed in the discussion, and is seen in the supplementary figures, but will be more easily shown by visualizing W directly.

15. "For each miRNA-gene module, … have not been verified" – this sentence is not clear. What verification was performed?

16. Because each module has more genes than miRNA, the 1st PC will likely mainly represent the variance in gene expression. It will be interesting to repeat this analysis by taking the 1ST PC using genes only, or the 1st PC using miRNA only.

17. When describing the survival analysis, modules 11 and 36 are specifically mentioned. We understand that this is because these modules had non-zero entries in W in the cancer types for which they were linked to survival. This should be explicitly stated, as otherwise it is not clear why these modules are mentioned.

18. Typos:

a. "For each cancer types, we downloaded..." – should be "type" (p. 3).

b. "We found that 7889…" – "that" should be removed (p. 3).

c. In equation (6), the l0 constraint should be on v, not on u.

d. "While we also found that the modularity…" – remove "While" (p. 7).

e. "negative correlation on all cancers on all cancers". (p. 9).

f. "from a experimentally validated…" – replace "a" with "an" (p. 12).

g. When describing the simulation, the distribution for A_3 is written twice. The second time should be A_4 (p. 15).

h. "it miss some real members" – should be "misses" (p. 15). There are several times "miss" should be replaced with "misses" in this page.

References

[1] - Rui Henriques and Sara C. Madeira. 2018. Triclustering Algorithms for Three-Dimensional Data Analysis: A Comprehensive Survey. ACM Comput. Surv. 51, 5, Article 95 (January 2019), 43 pages. DOI:https://doi.org/10.1145/3195833

[2] - Hua Tan, Shan Huang, Zhigang Zhang, Xiaohua Qian, Peiqing Sun, and Xiaobo Zhoua. 2019. Pan-cancer analysis on microRNA-associated gene activation. EBioMedicine. 2019 May; 43: 82–97. doi: 10.1016/j.ebiom.2019.03.082

Reviewer #2: To identify cancer-specific and shared miRNA-gene co-expressed modules, the authors proposed a tensor sparse canonical correlation analysis (TSCCA) method to analysis of matched miRNA and gene expression data of multiple cancers. The authors first constructed a tensor of gene, miRNA and cancers. Then they decomposed the correlation tensor into a number of latent factors. Finally, based on the non-zero latent factors, they identified cancer-miRNA-gene modules. Application to 33 TCGA cancer types identified novel cancer-related miRNA-gene modules. Here are my comments:

1.In simulation, the data was generated from normal distributions with fixed variances. I wonder what would happen if the values were generated with larger variances, instead of 0.04?

2.It is claimed that TSCCA method can identify both cancer-specific and shared miRNA-gene co-expressed modules. I wonder whether TSCCA identified any shared miRNA-gene co-expressed modules across 33 TCGA cancer types. If so, what properties do the shared modules have?

3.The authors evaluated the performance of TSCCA with other methods on the TCGA data mainly using the modularity score. They may want to evaluate the modules identified by the comparison methods using other measurements, for example, they may check the enrichment of cancer related genes/miRNAs among these modules.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: None

Reviewer #2: None

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions, please see http://journals.plos.org/compbiol/s/submission-guidelines#loc-materials-and-methods

PLoS Comput Biol. 2021 Jun 1;17(6):e1009044. doi: 10.1371/journal.pcbi.1009044.r002

Author response to Decision Letter 0

2 Dec 2020

Attachment

Submitted filename: Response_Letter.pdf

Click here for additional data file.^{(1.4MB, pdf)}

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1009044.r003

Decision Letter 1

Florian Markowetz, Moritz Gerstung

28 Jan 2021

Dear Dr. Min,

Thank you very much for submitting your manuscript "TSCCA: A tensor sparse CCA method for detecting microRNA-gene patterns from multiple cancers" for consideration at PLOS Computational Biology.

When you are ready to resubmit, please upload the following:

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Sincerely,

Moritz Gerstung

Guest Editor

PLOS Computational Biology

Florian Markowetz

Deputy Editor

PLOS Computational Biology

***********************

Thank you for submitting a revised manuscript. As you will see, reviewer 1 continues to raise a series of major concerns that need to be addressed to make the manuscript suitable for publication.

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: The authors made a thorough revision and seriously considered our previous remarks. This significantly improved the paper. The three main issues we pointed to previously have been largely addressed. We do have several more comments, but if they are addressed as well, the paper is worthy of publications.

Major comment

The one-sample Wilcoxon signed-rank test is used several times in the manuscript, e.g. in the analysis in section 3.2. We do not think this test is appropriate here. The null hypothesis for the Wilcoxon test is that the modularity score for a module is higher than the median for the randomly generated modules. This p-value can be very low even if the module’s modularity score is not very extreme in comparison to the modularity scores of the random modules. For example, consider a module with modularity score 2000, and 1000 random modules with scores 1401, 1402, …, 2400. The p-value for Wilcoxon will be very small (<2e-16 according to our test in R), because we can confidently say that the modularity of the original module (2000) is higher than the median modularity of the random modules. However, 40% of the random modules have a higher score. A better way to calculate an empirical p-value is to count the number of randomly permuted modules with modularity score equal or greater than 2000.

In the same section (3.2), it is not clear how a single p-value is calculated. Section S8 in the appendix only describes how to calculate a p-value for a single module, and it is not clear how these p-values are integrated into a single one (which is stated in the paper as “P < 0.001”).

This comment doesn’t apply only to section 3.2, but to other cases in which the one-sample Wilcoxon signed-rank test is used (e.g. the number of modules with at least 2 miRNAs from the same family).

Other important comments

- To perform permutation testing while conditioning on the degrees in a network, the authors sample genes such that the sum of their degrees is close to the sum of degrees in the original gene set. This conditions on the sum of degrees, but not on the full degree distribution. The common way to perform permutations while conditioning on the degree is to permute gene names only between genes with the same degree. For example, a gene with degree 5 can only be replaced in the permutation with another gene of degree 5. (If sample is insufficient, this can be done by forming bins of genes by the degree, and permuting between genes from the same bin)

- Two of our major comments were that further comparison to triclustering methods is required, and that the advantage of using multiple cancers is not sufficiently shown. The authors addressed these points, but some of the results from these new analyses are only mentioned briefly in the discussion, while they should be stated more clearly:

o The W matrix visualization shows that a few cancers dominate all the created modules. In the discussion the authors mention this, and show another analysis in which these dominant cancers are removed (in appendix figure S16). This point, that a small number of cancers may dominate the results, is a major caveat of the analysis, and as such it should be mentioned when first presenting TSCCA’s results, and not briefly referred to in the discussion.

o Modularity_SA has very good results in terms of the number of cancer genes and miRs, while TSCCA is better in terms of the number of gene-gene and miR-gene edges. The good performance of Modularity_SA, and its advantages in comparison to TSCCA, should be mentioned in the main text.

Minor comment

In tables reporting the results of TSCCA, the font in the rows representing TSCCA’s results is bold. It is common practice to mark in bold the best result in each column, and we suggest the authors do the same here, or remove the bold font from TSCCA’s row. Otherwise it looks as if TSCCA always has the best performance.

Typos:

- “within this module were verified reported before” (p. 12) – remove “verified” or “reported”.

- “We assessed the similarity of between the true modules” (p. 13) – remove “of”.

- “miRNA-gene correlation patterns are heterogeneity” (p. 15) – should be “heterogeneous”.

- “explorative tool, which identity” (p. 15) – should be “identify”.

Reviewer #2: The authors conducted simulation and real data studies to address the questions. There is a question about Table S25. The authors compared TSCCA with other methods using multiple biological indicators and modularity score and concluded TSCCA is superior to the other tri-clustering methods. However, Modularity_SA identified more cancer_miR and cancer_gene than the TSCCA. The authors may want to discuss this before directly concluding that TSCCA is superior to the other tri-clustering methods.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Reviewer #1: Yes

Reviewer #2: None

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Figure Files:

Data Requirements:

Reproducibility:

PLoS Comput Biol. 2021 Jun 1;17(6):e1009044. doi: 10.1371/journal.pcbi.1009044.r004

Author response to Decision Letter 1

22 Mar 2021

Attachment

Submitted filename: Response_Letter_0314.pdf

Click here for additional data file.^{(330.2KB, pdf)}

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1009044.r005

Decision Letter 2

Florian Markowetz, Moritz Gerstung

5 May 2021

Dear Dr. Min,

We are pleased to inform you that your manuscript 'TSCCA: A tensor sparse CCA method for detecting microRNA-gene patterns from multiple cancers' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology.

Best regards,

Moritz Gerstung

Guest Editor

PLOS Computational Biology

Florian Markowetz

Deputy Editor

PLOS Computational Biology

***********************************************************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: PCOMPBIOL-D-20-01172: TSCCA: A tensor sparse CCA method for detecting microRNA-gene patterns from multiple cancers

Wenwen Min, Ph.D.; Tsung-Hui Chang; Shihua Zhang; Xiang Wan

The authors considered all our previous remarks, and we now consider this paper as worthy of publication.

We only have a couple of minor suggestions:

- Even though the analysis the authors performed in section 3.2 is now clear, its phrasing is not (this refers to the sentences "The identified modules with… random ones (permutation test P < 0.05 / 50)"). We understand that all modules were statistically significant, but this is not directly stated (it currently says "these modules" rather than "all modules"). Also, k should be removed, and the p-value of 0.05 / 50 mentioned only once. Since it is currently mentioned twice, it seems as if it relates to two different analyses.

- Section 3.4: "These modularity scores… those of the random ones". This sentence is not backed by any statistical analyses, but only by two examples in Figure 5A and 5E. Either a statistical analysis should be performed to validate this sentence, or it should be explicitly stated that this sentence only refers to two examples.

Reviewer #2: No more comments.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1009044.r006

Acceptance letter

Florian Markowetz, Moritz Gerstung

24 May 2021

PCOMPBIOL-D-20-01172R2

TSCCA: A tensor sparse CCA method for detecting microRNA-gene patterns from multiple cancers

Dear Dr Min,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Katalin Szabo

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1 Text. Supporting methods and results.

(DOCX)

Click here for additional data file.^{(237.1KB, docx)}

S1 Fig. Convergence analysis of 50 modules identified by TSCCA on the TCGA dataset across 33 cancer types.

(PDF)

Click here for additional data file.^{(81.2KB, pdf)}

S2 Fig. Heatmap of cancer-miRNA-gene modules identified by TSCCA in the TCGA dataset.

(PDF)

Click here for additional data file.^{(8MB, pdf)}

S3 Fig. Characteristics of modules in different cancers.

(PDF)

Click here for additional data file.^{(412.3KB, pdf)}

S4 Fig. Application of the TSCCA onto the subset of TCGA cancer data from the cluster 3 in Fig 4 and extract 50 modules.

(PDF)

Click here for additional data file.^{(60.6KB, pdf)}

S5 Fig. miRNA-gene regulatory network analysis of modules.

(PDF)

Click here for additional data file.^{(124.1KB, pdf)}

S6 Fig. Heatmap of cancer-miRNA-gene modules identified by different methods in the TCGA dataset.

The top half of each heatmap corresponds to the module 1 (row corresponds to gene, column corresponds to miRNA) and the lower part is a random module for comparison.

(PDF)

Click here for additional data file.^{(490.3KB, pdf)}

S7 Fig. Comparison of different methods on the TCGA data in terms of Modularity score (A) and time (B).

(PDF)

Click here for additional data file.^{(77.5KB, pdf)}

S8 Fig. Results of pcModule.

(PDF)

Click here for additional data file.^{(167.3KB, pdf)}

S9 Fig. Heatmap of some modules identified by TSCCA in the TCGA dataset.

(PDF)

Click here for additional data file.^{(341.5KB, pdf)}

S1 Table. The list of 7889 significant different expression genes with BH adjusted P < 0.05 in at least 15 cancer types.

(XLSX)

Click here for additional data file.^{(2.5MB, xlsx)}

S2 Table. Summary of the TCGA data.

(XLSX)

Click here for additional data file.^{(11.6KB, xlsx)}

S3 Table. Objective function values (Singular values) of modules identified by TSCCA.

(XLSX)

Click here for additional data file.^{(10KB, xlsx)}

S4 Table. Cancer types and weights of modules identified by TSCCA.

(XLSX)

Click here for additional data file.^{(25.5KB, xlsx)}

S5 Table. miRNA members and weights of modules identified by TSCCA.

(XLSX)

Click here for additional data file.^{(34.4KB, xlsx)}

S6 Table. Gene members and weights of modules identified by TSCCA.

(XLSX)

Click here for additional data file.^{(370.4KB, xlsx)}

S7 Table. Summary of modules concerning gene names, miRNA names and cancer type names.

(XLSX)

Click here for additional data file.^{(294.1KB, xlsx)}

S8 Table. Significant overlap between two miRNA-gene-cancer modules/subtensors in a binary form.

(XLSX)

Click here for additional data file.^{(16.4KB, xlsx)}

S9 Table. Modularity values for different cancer types.

(XLSX)

Click here for additional data file.^{(10.1KB, xlsx)}

S10 Table. Enrichment analysis of modules in terms of cancer miRNAs, cancer genes, PPIs and miRNA-gene interactions.

(XLSX)

Click here for additional data file.^{(17.5KB, xlsx)}

S11 Table. Number of significant terms.

(XLSX)

Click here for additional data file.^{(10.7KB, xlsx)}

S12 Table. Significant GOBP terms.

(XLSX)

Click here for additional data file.^{(100.8KB, xlsx)}

S13 Table. Significant KEGG terms.

(XLSX)

Click here for additional data file.^{(13.9KB, xlsx)}

S14 Table. Significant Reactome terms.

(XLSX)

Click here for additional data file.^{(27.7KB, xlsx)}

S15 Table. Module miRNAs are cooperative within miRNA families.

(XLSX)

Click here for additional data file.^{(20.2KB, xlsx)}

S16 Table. Largest connected subnetwork (LCS) of modules where each edge is from verified miRNA-gene and gene-gene interactions.

(XLSX)

Click here for additional data file.^{(10.8KB, xlsx)}

S17 Table. Prognostic miRNA-gene module biomarkers in multiple cancer types.

(XLSX)

Click here for additional data file.^{(22KB, xlsx)}

S18 Table. Prognostic miRNA biomarkers in multiple cancer types.

(XLSX)

Click here for additional data file.^{(16.9KB, xlsx)}

S19 Table. Biological functional analysis of selected cancer-miRNA-gene modules.

(XLSX)

Click here for additional data file.^{(10.9KB, xlsx)}

S20 Table. Comparison (in terms of CE ± std) on the simulated data.

(XLSX)

Click here for additional data file.^{(10.6KB, xlsx)}

S21 Table. Comparison (in terms of Recovery ± std) on the simulated data.

(XLSX)

Click here for additional data file.^{(9.6KB, xlsx)}

S22 Table. Comparison (in terms of CE ± std) on the simulated data with different variances.

(XLSX)

Click here for additional data file.^{(10.8KB, xlsx)}

S23 Table. Comparison of the (in terms of Recovery ± std) on the simulated data with different variables.

(XLSX)

Click here for additional data file.^{(10.9KB, xlsx)}

S24 Table. Performance comparison of TSCCA and SCCA, where we applied SCCA to identify 50 modules on each cancer data set.

(XLSX)

Click here for additional data file.^{(11.5KB, xlsx)}

S25 Table. Performance comparison of TSCCA and the triclustering methods.

(XLSX)

Click here for additional data file.^{(10.4KB, xlsx)}

S26 Table. Results of pcModule.

(XLSX)

Click here for additional data file.^{(9.5KB, xlsx)}

S27 Table. Gene-gene interaction set enrichment for the identified modules by TSCCA on the TCGA data.

(XLSX)

Click here for additional data file.^{(12.3KB, xlsx)}

S28 Table. miRNA-gene interaction set enrichment for the identified modules by TSCCA on the TCGA data.

(XLSX)

Click here for additional data file.^{(11KB, xlsx)}

Attachment

Submitted filename: Response_Letter.pdf

Click here for additional data file.^{(1.4MB, pdf)}

Attachment

Submitted filename: Response_Letter_0314.pdf

Click here for additional data file.^{(330.2KB, pdf)}

Data Availability Statement

[pcbi.1009044.ref001] 1. Bray F, Ferlay J, Soerjomataram I, Siegel RL, Torre LA, Jemal A. Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin. 2018;68(6):394–424. doi: 10.3322/caac.21492 [DOI] [PubMed] [Google Scholar]

[pcbi.1009044.ref002] 2. Mehtonen J, Pölönen P, Häyrynen S, Dufva O, Lin J, Liuksiala T, et al. Data-driven characterization of molecular phenotypes across heterogeneous sample collections. Nucleic Acids Res. 2019;47(13):e76. doi: 10.1093/nar/gkz281 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009044.ref003] 3. Zugazagoitia J, Guedes C, Ponce S, Ferrer I, Molina-Pinelo S, Paz-Ares L. Current challenges in cancer treatment. Clin Ther. 2016;38(7):1551–1566. doi: 10.1016/j.clinthera.2016.03.026 [DOI] [PubMed] [Google Scholar]

[pcbi.1009044.ref004] 4. Cheng F, Liang H, Butte AJ, Eng C, Nussinov R. Personal mutanomes meet modern oncology drug discovery and precision health. Pharmacol Rev. 2019;71(1):1–19. doi: 10.1124/pr.118.016253 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009044.ref005] 5. Bracken CP, Scott HS, Goodall GJ. A network-biology perspective of microRNA function and dysfunction in cancer. Nat Rev Genet. 2016;17(12):719–732. doi: 10.1038/nrg.2016.134 [DOI] [PubMed] [Google Scholar]

[pcbi.1009044.ref006] 6. Zhao XM, Liu KQ, Zhu G, He F, Duval B, Richer JM, et al. Identifying cancer-related microRNAs based on gene expression data. Bioinformatics. 2014;31(8):1226–1234. doi: 10.1093/bioinformatics/btu811 [DOI] [PubMed] [Google Scholar]

[pcbi.1009044.ref007] 7. Yang D, Sun Y, Hu L, Zheng H, Ji P, Pecot CV, et al. Integrated analyses identify a master microRNA regulatory network for the mesenchymal subtype in serous ovarian cancer. Cancer Cell. 2013;23(2):186–199. doi: 10.1016/j.ccr.2012.12.020 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009044.ref008] 8. Lai X, Eberhardt M, Schmitz U, Vera J. Systems biology-based investigation of cooperating microRNAs as monotherapy or adjuvant therapy in cancer. Nucleic Acids Res. 2019;47(15):7753–7766. doi: 10.1093/nar/gkz638 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009044.ref009] 9. Kozomara A, Birgaoanu M, Griffiths-Jones S. miRBase: from microRNA sequences to function. Nucleic Acids Res. 2018;47(D1):D155–D162. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009044.ref010] 10. Tong Y, Ru B, Zhang J. miRNACancerMAP: an integrative web server inferring miRNA regulation network for cancer. Bioinformatics. 2018;34(18):3211–3213. doi: 10.1093/bioinformatics/bty320 [DOI] [PubMed] [Google Scholar]

[pcbi.1009044.ref011] 11. Agarwal V, Bell GW, Nam JW, Bartel DP. Predicting effective microRNA target sites in mammalian mRNAs. Elife. 2015;4:e05005. doi: 10.7554/eLife.05005 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009044.ref012] 12. Barrett T, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, et al. NCBI GEO: archive for functional genomics data sets—update. Nucleic Acids Res. 2012;41(D1):D991–D995. doi: 10.1093/nar/gks1193 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009044.ref013] 13. Hoadley KA, Yau C, Hinoue T, Wolf DM, Lazar AJ, Drill E, et al. Cell-of-origin patterns dominate the molecular classification of 10,000 tumors from 33 types of cancer. Cell. 2018;173(2):291–304. doi: 10.1016/j.cell.2018.03.022 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009044.ref014] 14. Marshall EA, Sage AP, Ng KW, Martinez VD, Firmino NS, Bennewith KL, et al. Small non-coding RNA transcriptome of the NCI-60 cell line panel. Sci Data. 2017;4:170157. doi: 10.1038/sdata.2017.157 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009044.ref015] 15. Zhang S, Li Q, Liu J, Zhou XJ. A novel computational framework for simultaneous integration of multiple types of genomic data to identify microRNA-gene regulatory modules. Bioinformatics. 2011;27(13):i401–i409. doi: 10.1093/bioinformatics/btr206 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009044.ref016] 16. Min W, Liu J, Luo F, Zhang S. A two-stage method to identify joint modules from matched microRNA and mRNA expression data. IEEE Trans Nanobioscience. 2016;15(4):362–370. doi: 10.1109/TNB.2016.2556744 [DOI] [Google Scholar]

[pcbi.1009044.ref017] 17. Bryan K, Terrile M, Bray IM, Domingo-Fernandez R, Watters KM, Koster J, et al. Discovery and visualization of miRNA–mRNA functional modules within integrated data using bicluster analysis. Nucleic Acids Res. 2013;42(3):e17. doi: 10.1093/nar/gkt1318 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009044.ref018] 18. Li Y, Liang C, Wong KC, Luo J, Zhang Z. Mirsynergy: detecting synergistic miRNA regulatory modules by overlapping neighbourhood expansion. Bioinformatics. 2014;30(18):2627–2635. doi: 10.1093/bioinformatics/btu373 [DOI] [PubMed] [Google Scholar]

[pcbi.1009044.ref019] 19. Jin D, Lee H. A computational approach to identifying gene-microRNA modules in cancer. PLoS Comput Biol. 2015;11(1):e1004042. doi: 10.1371/journal.pcbi.1004042 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009044.ref020] 20. Shi WJ, et al. Unsupervised discovery of phenotype-specific multi-omics networks. Bioinformatics. 2019;35(21):4336–4343. doi: 10.1093/bioinformatics/btz226 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009044.ref021] 21. Yoon S, Nguyen HC, Jo W, Kim J, Chi SM, Park J, et al. Biclustering analysis of transcriptome big data identifies condition-specific microRNA targets. Nucleic Acids Res. 2019;47(9):e53. doi: 10.1093/nar/gkz139 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009044.ref022] 22. Li Y, Kang K, Krahn JM, Croutwater N, Lee K, Umbach DM, et al. A comprehensive genomic pan-cancer classification using The Cancer Genome Atlas gene expression data. BMC Genomics. 2017;18(1):508. doi: 10.1186/s12864-017-3906-0 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009044.ref023] 23. Yang X, Gao L, Zhang S. Comparative pan-cancer DNA methylation analysis reveals cancer common and specific patterns. Brief Bioinform. 2016;18(5):761–773. [DOI] [PubMed] [Google Scholar]

[pcbi.1009044.ref024] 24. Chen H, Li C, Peng X, Zhou Z, Weinstein JN, Caesar-Johnson SJ, et al. A pan-cancer analysis of enhancer expression in nearly 9000 patient samples. Cell. 2018;173(2):386–399. doi: 10.1016/j.cell.2018.03.027 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009044.ref025] 25. Ma X, Liu Y, Liu Y, Alexandrov LB, Edmonson MN, Gawad C, et al. Pan-cancer genome and transcriptome analyses of 1,699 paediatric leukaemias and solid tumours. Nature. 2018;555(7696):371–376. doi: 10.1038/nature25795 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009044.ref026] 26. Tan H, Huang S, Zhang Z, Qian X, Sun P, Zhou X. Pan-cancer analysis on microRNA-associated gene activation. EBioMedicine. 2019;43:82–97. doi: 10.1016/j.ebiom.2019.03.082 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009044.ref027] 27. Huang HY, Lin YCD, Li J, Huang KY, Shrestha S, Hong HC, et al. miRTarBase 2020: updates to the experimentally validated microRNA–target interaction database. Nucleic Acids Res. 2019;48(D1):D148–D154. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009044.ref028] 28. Rodchenkov I, Babur O, Luna A, Aksoy BA, Wong JV, Fong D, et al. Pathway Commons 2019 Update: integration, analysis and exploration of pathway data. Nucleic Acids Res. 2019;48(D1):D489–D497. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009044.ref029] 29. Xie B, Ding Q, Han H, Wu D. miRCancer: a microRNA–cancer association database constructed by text mining on literature. Bioinformatics. 2013;29(5):638–644. doi: 10.1093/bioinformatics/btt014 [DOI] [PubMed] [Google Scholar]

[pcbi.1009044.ref030] 30. Liberzon A, Birger C, Thorvaldsdóttir H, Ghandi M, Mesirov JP, Tamayo P. The molecular signatures database hallmark gene set collection. Cell Syst. 2015;1(6):417–425. doi: 10.1016/j.cels.2015.12.004 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009044.ref031] 31. Witten DM, Tibshirani RJ. Extensions of sparse canonical correlation analysis with applications to genomic data. Stat Appl Genet Mol Biol. 2009;8(1):Article 28. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009044.ref032] 32. Chang BHW, Kruger U, Kustra R, Zhang J. Canonical correlation analysis based on Hilbert-Schmidt independence criterion and centered kernel target alignment. In: ICML; 2013. p. 316–324. [Google Scholar]

[pcbi.1009044.ref033] 33. Parkhomenko E, Tritchler D, Beyene J. Sparse canonical correlation analysis with application to genomic data integration. Stat Appl Genet Mol Biol. 2009;8(1):Article 1. [DOI] [PubMed] [Google Scholar]

[pcbi.1009044.ref034] 34. Chu D, Liao LZ, Ng MK, Zhang X. Sparse canonical correlation analysis: New formulation and algorithm. IEEE Trans Pattern Anal Mach Intell. 2013;35(12):3050–3065. doi: 10.1109/TPAMI.2013.104 [DOI] [PubMed] [Google Scholar]

[pcbi.1009044.ref035] 35. Asteris M, Kyrillidis A, Koyejo O, Poldrack R. A simple and provable algorithm for sparse diagonal CCA. In: ICML; 2016. p. 1148–1157. [Google Scholar]

[pcbi.1009044.ref036] 36. Rohart F, Gautier B, Singh A, Lê Cao KA. mixOmics: An R package for omics feature selection and multiple data integration. PLoS Comput Biol. 2017;13(11):e1005752. doi: 10.1371/journal.pcbi.1005752 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009044.ref037] 37. Min EJ, Safo SE, Long Q. Penalized co-inertia analysis with applications to-omics data. Bioinformatics. 2018;35(6):1018–1025. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009044.ref038] 38. Kolda TG, Bader BW. Tensor decompositions and applications. SIAM Rev. 2009;51(3):455–500. doi: 10.1137/07070111X [DOI] [Google Scholar]

[pcbi.1009044.ref039] 39. Xu Y, Yin W. A globally convergent algorithm for nonconvex optimization based on block coordinate update. J Sci Comput. 2017;72(2):700–734. doi: 10.1007/s10915-017-0376-0 [DOI] [Google Scholar]

[pcbi.1009044.ref040] 40. Sanchez-Vega F, Mina M, Armenia J, Chatila WK, Luna A, La KC, et al. Oncogenic signaling pathways in the cancer genome atlas. Cell. 2018;173(2):321–337. doi: 10.1016/j.cell.2018.03.035 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009044.ref041] 41. Liu F, Zhang F, Li X, Liu Q, Liu W, Song P, et al. Prognostic role of miR-17-92 family in human cancers: evaluation of multiple prognostic outcomes. Oncotarget. 2017;8(40):69125. doi: 10.18632/oncotarget.19096 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009044.ref042] 42. Vallejo DM, Caparros E, Dominguez M. Targeting Notch signalling by the conserved miR-8/200 microRNA family in development and cancer cells. EMBO J. 2011;30(4):756–769. doi: 10.1038/emboj.2010.358 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009044.ref043] 43. Otto T, Sicinski P. Cell cycle proteins as promising targets in cancer therapy. Nat Rev Cancer. 2017;17(2):93. doi: 10.1038/nrc.2016.138 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009044.ref044] 44. Jansson MD, Lund AH. MicroRNA and cancer. Mol Oncol. 2012;6(6):590–610. doi: 10.1016/j.molonc.2012.09.006 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009044.ref045] 45. Boyer AS, Walter D, Sørensen CS. DNA replication and cancer: From dysfunctional replication origin activities to therapeutic opportunities. Semin Cancer Biol. 2016;37–38:16–25. [DOI] [PubMed] [Google Scholar]

[pcbi.1009044.ref046] 46. Petrescu GE, Sabo AA, Torsin LI, Calin GA, Dragomir MP. MicroRNA based theranostics for brain cancer: basic principles. J Exp Clin Cancer Res. 2019;38(1):231. doi: 10.1186/s13046-019-1180-5 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009044.ref047] 47. Gu JJ, Fan KC, Zhang JH, Chen HJ, Wang SS. Suppression of microRNA-130b inhibits glioma cell proliferation and invasion, and induces apoptosis by PTEN/AKT signaling. Int J Mol Med. 2018;41(1):284–292. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009044.ref048] 48. Huang S, Xue P, Han X, Zhang C, Yang L, Liu L, et al. Exosomal miR-130b-3p targets SIK1 to inhibit medulloblastoma tumorigenesis. Cell Death Dis. 2020;11(6):1–16. doi: 10.1038/s41419-020-2621-y [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009044.ref049] 49. Allen G. Sparse higher-order principal components analysis. In: Artificial Intelligence and Statistics; 2012. p. 27–36. [Google Scholar]

[pcbi.1009044.ref050] 50. Henriques R, Madeira SC. Triclustering algorithms for three-dimensional data analysis: a comprehensive survey. ACM Comput Surv. 2018;51(5), Article 95. [Google Scholar]

[pcbi.1009044.ref051] 51. Argelaguet R, Velten B, Arnol D, Dietrich S, Zenz T, Marioni JC, et al. Multi-Omics Factor Analysis: A framework for unsupervised integration of multi-omics data sets. Mol Syst Biol. 2018;14(6):e8124. doi: 10.15252/msb.20178124 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009044.ref052] 52. Rappoport N, Shamir R. Multi-omic and multi-view clustering algorithms: review and cancer benchmark. Nucleic Acids Res. 2018;46(20):10546–10562. doi: 10.1093/nar/gky889 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009044.ref053] 53. Pierre-Jean M, Deleuze JF, Le Floch E, Mauger F. Clustering and variable selection evaluation of 13 unsupervised methods for multi-omics data integration. Brief Bioinform. 2020;21(6):2011–2030. doi: 10.1093/bib/bbz138 [DOI] [PubMed] [Google Scholar]

[pcbi.1009044.ref054] 54. Dvinge H, Git A, Gräf S, Salmon-Divon M, Curtis C, Sottoriva A, et al. The shaping and functional consequences of the microRNA landscape in breast cancer. Nature. 2013;497(7449):378–382. doi: 10.1038/nature12108 [DOI] [PubMed] [Google Scholar]

[pcbi.1009044.ref055] 55. Iorio F, Knijnenburg TA, Vis DJ, Bignell GR, Menden MP, Schubert M, et al. A landscape of pharmacogenomic interactions in cancer. Cell. 2016;166(3):740–754. doi: 10.1016/j.cell.2016.06.017 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009044.ref056] 56. Ghandi M, Huang FW, Jané-Valbuena J, Kryukov GV, Lo CC, McDonald ER, et al. Next-generation characterization of the Cancer Cell Line Encyclopedia. Nature. 2019;569(7757):503–508. doi: 10.1038/s41586-019-1186-3 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009044.ref057] 57. Chen J, Zhang S. Integrative analysis for identifying joint modular patterns of gene-expression and drug-response data. Bioinformatics. 2016;32(11):1724–1732. doi: 10.1093/bioinformatics/btw059 [DOI] [PubMed] [Google Scholar]

PERMALINK

TSCCA: A tensor sparse CCA method for detecting microRNA-gene patterns from multiple cancers

Wenwen Min

Tsung-Hui Chang

Shihua Zhang

Xiang Wan

Roles

Abstract

Author summary

Introduction

Fig 1. Illustration of TSCCA to identify cancer-related miRNA-gene functional modules.

Materials and methods

Biological data

TCGA data

Fig 2. Application to the TCGA data from multiple cancers.

Sparse CCA

Proposed tensor sparse CCA (TSCCA)

Proposed optimization algorithm

Determination of cancer-miRNA-gene modules

Modularity

Results

Application to the TCGA data

Statistical analysis of correlation of modules

Module miRNAs and genes are strongly implicated in cancer

Characteristics of modules in different cancers

Fig 3. Heatmap of cancer-miRNA-gene modules identified by TSCCA in the TCGA dataset.

Fig 4. Heatmap showing W, which is the output matrix of Algorithm 2 (See S1 Text), when it was applied to the TCGA data.

Fig 5. Illustration of two cancer-miRNA-gene modules identified by TSCCA in the TCGA dataset.

Cooperativity of genes and miRNAs within modules

Fig 6. Statistical analysis of PCCs of module miRNAs/genes using permutation test.

miRNA-gene regulatory network analysis of modules

Survival analysis of modules

Fig 7. Survival analysis of modules.

Case studies

Comparison on the simulated data

Fig 8. Comparison of results from different algorithms on the simulated data and TCGA data.

Comparison on the TCGA data

Discussion

Supporting information

Data Availability

Funding Statement

References

Decision Letter 0

Florian Markowetz

Moritz Gerstung

Roles

Author response to Decision Letter 0

Decision Letter 1

Florian Markowetz

Moritz Gerstung

Roles

Author response to Decision Letter 1

Decision Letter 2

Florian Markowetz

Moritz Gerstung

Roles

Acceptance letter

Florian Markowetz

Moritz Gerstung

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases