Skip to main content
Briefings in Bioinformatics logoLink to Briefings in Bioinformatics
. 2024 Jun 18;25(4):bbae288. doi: 10.1093/bib/bbae288

dCCA: detecting differential covariation patterns between two types of high-throughput omics data

Hwiyoung Lee 1,2, Tianzhou Ma 3, Hongjie Ke 4, Zhenyao Ye 5,6, Shuo Chen 7,8,9,
PMCID: PMC11184902  PMID: 38888456

Abstract

Motivation

The advent of multimodal omics data has provided an unprecedented opportunity to systematically investigate underlying biological mechanisms from distinct yet complementary angles. However, the joint analysis of multi-omics data remains challenging because it requires modeling interactions between multiple sets of high-throughput variables. Furthermore, these interaction patterns may vary across different clinical groups, reflecting disease-related biological processes.

Results

We propose a novel approach called Differential Canonical Correlation Analysis (dCCA) to capture differential covariation patterns between two multivariate vectors across clinical groups. Unlike classical Canonical Correlation Analysis, which maximizes the correlation between two multivariate vectors, dCCA aims to maximally recover differentially expressed multivariate-to-multivariate covariation patterns between groups. We have developed computational algorithms and a toolkit to sparsely select paired subsets of variables from two sets of multivariate variables while maximizing the differential covariation. Extensive simulation analyses demonstrate the superior performance of dCCA in selecting variables of interest and recovering differential correlations. We applied dCCA to the Pan-Kidney cohort from the Cancer Genome Atlas Program database and identified differentially expressed covariations between noncoding RNAs and gene expressions.

Availability and Implementation

The R package that implements dCCA is available at https://github.com/hwiyoungstat/dCCA.

Keywords: canonical correlation analysis, differential correlation, bipartite graph, multivariate-to-multivariate, multiomics, RNA gene regulation

Introduction

Multiomics data have recently gained increased attention due to their multifaceted involvement in various aspects of the underlying biological environment. For example, in cancer research, the joint analysis of gene expression and non-coding RNAs (ncRNAs) that are not translated into proteins, including microRNAs (miRNAs), long noncoding RNAs (lncRNAs), and circular RNAs (circRNAs), has become a promising avenue to uncover the pivotal functional role of ncRNAs in cancer. ncRNA may display both tumor suppressive and oncogenic functions, and aberrant expression of ncRNAs can induce abnormal transcriptional regulation in critical tumor-related genes, which ultimately contribute to tumor initiation and progression. Existing studies focused on a few specific ncRNAs and their regulatory roles in a small set of genes without fully utilizing the information from the multiomics data generated by high-throughput technology [1, 2]. Gaining a comprehensive picture of the association between non-coding RNAs and genes at a transcriptome-wide level is imperative to advance our knowledge of cancer pathogenesis. In practical applications, the combined analysis of two types of omics data (such as gene expression and microbiome, or metabolomics and microbiome, among various combinations) offers a novel approach to comprehending the intricacies and interactive nature of biological systems. Despite the potentially valuable findings from multiomics data, the joint analysis of two sets of high-dimensional variables raises computational challenges.

Canonical Correlation Analysis (CCA), originally introduced by [3], is widely used to assess associations between two sets of multivariate data [4, 5]. As common covariation among variables may exist within each set of multivariate data, CCA aims to identify latent factors for both multivariate vectors that maximize the correlations between them. As a popular model to decipher the interactions between two sets of multivariate data, CCA has been widely applied to a wide range of biomedical data analysis [6]. The resulting canonical variables (i.e. factors) by CCA facilitate visualization and effectively reveal associations between two distinct data blocks in a lower dimensional space. Furthermore, they can serve as input features in various tasks, including classification, particularly in situations where the use of the original variables is challenging due to multicollinearity and high dimensionality [7].

Conventionally, CCA is only applicable to multivariate vectors with a dimensionality lower than the sample size due to the singularity of the sample covariance matrices [8]. The recent advances in statistical methods, e.g. various versions of sparse CCA methods (sCCA) [9, 10] have been developed to alleviate this dimensionality constraint by utilizing regularization techniques that ensure algorithmic stability and promote parsimony for enhanced interpretability. However, challenges remain for sCCA methods to identify the underlying differential multivariate-to-multivariate association patterns across clinical groups. For example, the associations between ncRNAs and gene expressions can exhibit variations influenced by factors such as different cancer stages, and subtypes, thereby introducing significant heterogeneity. Neither classic CCA nor sCCA methods can capture the underlying differential covariation patterns [11], which motivates our current research.

To address this unmet need, we propose a new differential Canonical Correlation Analysis (dCCA) method to identify the heterogeneity in multivariate-to-multivariate associations across groups with different clinical or experimental conditions. We propose a novel objective function that maximizes the multivariate-to-multivariate correlations while recognizing inter-group discrepancy. By relaxing multiple constraints imposed on the covariance matrices, we implement the objective function using a subgradient-based algorithm. Additionally, to address the high dimensionality of both data blocks, we introduce a bipartite dense graph-based screening procedure.

The rest of this paper is organized as follows. In Method, we introduce the details of dCCA method and conduct extensive simulation studies to assess its performance by comparing it with competing methods. In Results, we apply the method to data obtained from the Cancer Genome Atlas Program (TCGA) database to explore the association between noncoding RNA and gene expression in kidney cancer. The paper concludes in Conclusion with a discussion.

Method

In this study, we consider a multivariate-multivariate dataset comprising Inline graphic observations. The dataset consists of two high-dimensional data blocks of dimensions Inline graphic and Inline graphic, respectively, denoted by Inline graphic and Inline graphic. We first consider the case where Inline graphic, and for the case where Inline graphic, we resort to a screening procedure (see Screening) to reduce dimensionality. Additionally, a binary group variable Inline graphic, which takes values of either 0 or 1, serves as a moderator, differentiating the association patterns between Inline graphic and Inline graphic. Based on Inline graphic, we divide the complete data Inline graphic into two subsets: Inline graphic and Inline graphic, where the subscripts indicate the corresponding Inline graphic values. Here, Inline graphic and Inline graphic represent the numbers of participants in groups Inline graphic and Inline graphic, respectively. For example, in our data application, Inline graphic represents microRNA (miRNA) data, Inline graphic represents gene expression data, and Inline graphic represents distinct subtypes of kidney cancer, where Inline graphic corresponds to a common subtype and Inline graphic corresponds to a rare subtype.

dCCA (Association analysis)

Our primary objective is two-fold: (i) to assess whether underlying association patterns exist between two sets of high-dimensional variables Inline graphic and Inline graphic by maximally revealing the common patterns and (ii) further to identify differential associations between Inline graphic and Inline graphic for those with Inline graphic vs. Inline graphic. To achieve the first objective, we can employ the classic CCA with the objective function represented as follows:

graphic file with name DmEquation1.gif

where Inline graphic and Inline graphic are loading vectors that assign the weights to the original variables in the datasets Inline graphic and Inline graphic, respectively.

To address the second objective, we consider the differences in the canonical correlations between two subgroups categorized by the value of Inline graphic, represented as

graphic file with name DmEquation2.gif

This maximizes the discrepancy in association patterns across distinct subsets, allowing us to gain insights into the heterogeneity of the association patterns between subgroups. Therefore, to simultaneously achieve both goals, we propose the dCCA approach with an integrated objective function

graphic file with name DmEquation3.gif

We can rewrite the objective function as

graphic file with name DmEquation4.gif (1)

where Inline graphic matrices Inline graphic, Inline graphic, and Inline graphic represent the cross-covariance matrices of the respective pairs of data: Inline graphic, Inline graphic, and Inline graphic. Additionally, Inline graphic, Inline graphic, Inline graphic, Inline graphic, Inline graphic, and Inline graphic denote the covariance matrices of each individual dataset (Inline graphic, Inline graphic, Inline graphic, Inline graphic, Inline graphic, and Inline graphic).

In summary, both CCA and dCCA aim to identify vectors Inline graphic. The goal of CCA is to maximize the correlation between Inline graphic and Inline graphic for all groups. In contrast, while maintaining this primary objective, dCCA also aims to maximize the difference in correlations between two groups (Inline graphic vs Inline graphic), i.e., Inline graphic vs Inline graphic.

The objective function (1) can simultaneously identify the underlying correlation patterns for both groups and highlight the differential correlations between groups. These two terms are linked by a tuning parameter Inline graphic. Thus, Inline graphic plays a crucial role in balancing the classical CCA term and the discrepancy term. Specifically, a higher value of Inline graphic places a stronger emphasis on the between-group discrepancy, whereas a smaller Inline graphic leads to results more similar to the classic CCA. We adopt the commonly used cross-validation strategy to objectively select the optimal Inline graphic [12]. The canonical variables (i.e. Inline graphic and Inline graphic) in (1) are used to reduce the dimensions for both multivariate vectors and highlight the latent correlation patterns (see Fig. 1C).

Figure 1.

Figure 1

The demonstration of dCCA workflow: (A) is the heatmap of marginal correlation matrix between two vectors of high-dimensional variables Inline graphic and Inline graphic; each row represents a gene expression variable, and each column represents a miRNA variable; (B) shows the heatmap in (A) after a screening step when the dimensionality of Inline graphic and Inline graphic is high; non-informative variables Inline graphic and Inline graphic can be excluded for further analysis; (C) the differential correlation patterns between the two clinical groups are demonstrated in the enlarged heatmaps; we next perform the dCCA analysis on postscreened Inline graphic and Inline graphic to compute Inline graphic and Inline graphic, and (D) illustrates the contrasting results of the canonical variables from Inline graphic and Inline graphic in dCCA vs CCA within the first block. Specifically, dCCA can better identify differential correlation patterns between the two clinical groups; note that the screening step in (B) is not necessary when Inline graphic.

Implementation. We numerically optimize the objective function as follows. The numerators involve the cross-covariance between Inline graphic and Inline graphic, capturing the correlation between these two sets of variables and forming a primary focus of our analysis. The denominators, which encompass the normalization of Inline graphic and Inline graphic using their respective covariance matrices, ensure that their contributions are scaled relative to the variability (or covariance) of the data.

By reformulating the above using the constraint form commonly employed in CCA, it can be expressed as

graphic file with name DmEquation5.gif

The above objective function retains its focus on maximizing the correlation between linear combinations while incorporating the regularization term that accounts for the difference between two groups. All constraints from the denominators in the original objective function in (1) aim to ensure the loading vectors Inline graphic and Inline graphic possessing unit length within distinct covariance structures associated with subsets of data (Inline graphic, and Inline graphic). However, optimizing the above while simultaneously satisfying all the constraints is computationally intractable. Following a commonly used numerical optimization strategy by [9] and [13], we relax the constraints by substituting all covariance matrices within the constraints with identity matrices of the same dimensions. Consequently, the modified objective function becomes

graphic file with name DmEquation6.gif (2)

By reformulating the constraint optimization problem in (2) using the Lagrangian function (i.e. Inline graphic), we develop an optimization algorithm in Algorithm 1.

graphic file with name bbae288fx1.jpg

Screening

When the dimensionality of Inline graphic and Inline graphic is higher than the sample size Inline graphic or non-informative noise presents, we can first conduct a screening step to exclude inactive pairs before implementing the objective function (2). This alleviates computational limitations for CCA and facilitates a more efficient study by narrowing down the focus to a subset of variables of interest. Following [14], many screening methods have been developed across diverse contexts, each tailored to accomplish its specific objectives. For example, [15] developed a screening procedure for two high-dimensional variables. In this research, we introduce a novel graph-based screening process to efficiently identify active pairs of variables between high-dimensional Inline graphic and high-dimensional Inline graphic.

We present the association between Inline graphic and Inline graphic as a bipartite graph, denoted as Inline graphic, where Inline graphic and Inline graphic represent distinct node sets for Inline graphic and Inline graphic, respectively (i.e. Inline graphic, where Inline graphic denote the cardinality of the set), and Inline graphic denotes the edges (i.e. Inline graphic). Assuming that associations between Inline graphic and Inline graphic are concentrated in highly correlated pairs of nodes rather than occurring across the entire set of pairs, we extract Inline graphic quasi-bicliques which are subsets of pairs of nodes with dense associations and filter out the irrelevant variables (i.e. screening).

Let Inline graphic be the biadjacency matrix with entries Inline graphic, obtained by thresholding the weighted edge matrix of the bipartite graph (e.g. the absolute Inline graphic correlation matrix: Inline graphic) with the threshold value Inline graphic. Then, the Inline graphicth quasi-biclique consisting of the node set Inline graphic is obtained by optimizing the following objective function:

graphic file with name DmEquation8.gif (3)

where Inline graphic. Note that Inline graphic is the biadjacency matrix of the subgraph induced by the nodes (Inline graphic, Inline graphic), and Inline graphic is the entry-wise Inline graphic norm (i.e. Inline graphic). To implement the above screening procedure, we utilize a greedy algorithm [16], and the algorithm’s summary is provided in Algorithm 2 (see details for the algorithm in the Supplementary Material).

graphic file with name bbae288fx2.jpg

The tuning parameter Inline graphic in (3) plays a crucial role in extracting the dense subset Inline graphic. For example, large Inline graphic tends to yield a more parsimonious result, characterized by reduced size and increased density of Inline graphic. We select the optimal Inline graphic in a data-driven manner using the Kullback–Leibler (KL) divergence. Specifically, considering two distinct blocks; (i) the dense block (Inline graphic), and (ii) outside the dense block (Inline graphic), where within the dense block, Inline graphic is more likely to be 1 for a pair of variables Inline graphic and Inline graphic with a strong association, while outside the dense block, it is likely to be 0 (i.e. uncorrelated). Therefore, the binarized association strength indicator variable Inline graphic can be assumed to follow a mixture of Bernoulli distribution, i.e. Inline graphic, where Inline graphic. Alternatively, one can consider a reference Bernoulli distribution Inline graphic, assuming that Inline graphic pairs exhibit no clustered patterns, where the dense bipartite graph-based screening is not effective. As the KL divergence quantifies the dissimilarity between the well-modeled distribution Inline graphic (representing the dense pattern), and the naive distribution Inline graphic (i.e. Inline graphic), it serves as a suitable measure for selecting the tuning parameter Inline graphic. Thus, we select the tuning parameter Inline graphic by maximizing the following KL divergence:

graphic file with name DmEquation10.gif (4)

The Bernoulli distribution parameters (i.e. Inline graphic) can be estimated using maximum likelihood estimation; see the Supplementary Material for details.

By filtering out non-informative signals (potential noise or weak associations), the remaining quasi-bicliques can better reveal the latent (differential) correlation patterns. Therefore, the screening step can generally reduce the computational cost and improve the estimation accuracy. However, when the dimensions of variables Inline graphic and Inline graphic are moderate (e.g. less than the sample size, i.e. Inline graphic) and the noise level is low, dCCA can be performed without the screening step.

Simulation

In this section, we conduct simulation studies to evaluate the performance of dCCA with the screening procedure (dCCA+Screen) and benchmark it with comparable multivariate analysis methods, including sparse CCA (SCCA), sparse PCA (SPCA), and sparse LDA (SLDA, [17] implemented in the sparseLDA package). Both SCCA and SPCA are based on the unified penalized matrix decomposition framework in [9] and are implemented through the PMA package in R. In addition, to assess the effectiveness of the screening procedure, we also use the unscreened version of dCCA as a competing method. For CCA and SPCA, we performed stratified analyses by applying the methods separately to each group. SCCASep and SPCASep denote these separate applications, respectively. Within these methods, subscripts 0 and 1 indicate the groups corresponding to Inline graphic and Inline graphic, respectively.

We simulate the multivariate predictors Inline graphic from a Inline graphic-dimensional multivariate normal distribution (i.e. Inline graphic). By introducing a binary group label Inline graphic, which serves as a moderator in the association between Inline graphic and Inline graphic, we generate the multivariate response Inline graphic (gene expression) from two different Inline graphic-dimensional multivariate normal distributions: Inline graphic and Inline graphic. Here, Inline graphic and Inline graphic represent the Inline graphic regression coefficients matrices corresponding to subgroups where Inline graphic equals Inline graphic or Inline graphic, respectively. Additionally, we use equal group sizes with Inline graphic.

We set the dimensions to Inline graphic and Inline graphic. Among all possible pairs of Inline graphic associations, we specify the active pairs within two dense blocks: the first block is sized Inline graphic and Inline graphic, and the second block is sized Inline graphic and Inline graphic. Non-zero values are assigned to the entries within these dense blocks of the coefficient matrix, while the remaining entries are set to zero. This configuration is designed to replicate the circuitry commonly observed in RNA gene regulation networks, wherein the RNA-gene pairs within the dense block exhibit concentrated interactions, while the inactive pairs outside the block do not play a role in influencing gene expression through RNA.

Under this multi-dense block configuration, we consider two settings. In the first, the direction (sign) of the association differs depending on Inline graphic, while in the second, the association between Inline graphic and Inline graphic only exists when Inline graphic. In both settings, the non-zero coefficients within Inline graphic are assigned negative values, while the first scenario involves positive coefficients in Inline graphic, and coefficients of 0 in Inline graphic for the second scenario. Our simulation settings are designed to emulate biologically plausible scenarios. In the first scenario, different clinical statuses can lead to contrasting regulatory effects of RNA on gene expression. The second scenario reflects the selective and condition-dependent nature of association in biological systems, where the association between RNA and gene expression is present only under specific clinical statuses.

We assess the performance for all methods by two criteria: (i) variable selection accuracy and (ii) recovery of the differential correlation patterns between groups. Specifically, we evaluate the accuracy of Inline graphic and Inline graphic selection using precision, recall, and Inline graphic score. To assess how precisely the method recovers differential correlation patterns, we calculate the absolute bias (i.e. Inline graphic) for different groups separately. Here, Inline graphic represents the estimated canonical correlation of the Inline graphicth subgroup, while Inline graphic represents the theoretical correlation under the noiseless case. Thus, in the first and second settings, we have Inline graphic equal to 1 and 0, respectively, while Inline graphic remains fixed in both scenarios. Moreover, we assess whether canonical variables derived from dCCA and comparable methods can distinguish the two groups. We fitted a logistic regression with the group as the outcome and canonical variables as predictors. Performance was evaluated by comparing the Areas Under the Curve (AUC) of the Receiver Operating Characteristic (ROC) curves.

Results are summarized in Tables 1 and 2 and displayed in Fig. 2, with 100 replications per simulation setting. Regarding variable selection, dCCA+Screen accurately identifies non-zero, correlated variables in both data blocks Inline graphic and Inline graphic for both settings (see Inline graphic scores), and outperforms the competing methods. SCCA exhibits a high rate of false positives across all settings, resulting in low precision. SPCA accurately selects variables in Inline graphic but not in Inline graphic. SLDA misses most true variables because it is designed for classification instead of variable selection.

Table 1.

Simulation Results (Setting 1: The direction of the association pattern is opposite between groups): We compare dCCA with the screening procedure (dCCA+Screen) to dCCA without the screening procedure (dCCA), and three competing methods (sparse CCA (SCCA), sparse LDA (SLDA), and sparse PCA (SPCA)); SCCASep and SPCASep are used to denote these separate applications, respectively; subscripts 0 and 1 denote the groups corresponding to Inline graphic, and Inline graphic, respectively.

Variable selection in Inline graphic (Inline graphic selection)
Method Precision Recall Inline graphic
dCCA+Screen 0.8582 (0.14) 0.9920 (0.03) 0.9121 (0.11)
SCCA 0.0809 (0.04) 0.1700 (0.09) 0.1094 (0.06)
SCCASep0 0.2645 (0.03) 0.7173 (0.05) 0.3856 (0.04)
SCCASep1 0.2590 (0.03) 0.7207 (0.06) 0.3803 (0.04)
SLDA 0.0607 (0.06) 0.0607 (0.06) 0.0607 (0.06)
SPCA 0.0760 (0.05) 0.1560 (0.10) 0.1019 (0.06)
SPCASep0 0.0771 (0.05) 0.1573 (0.10) 0.1031 (0.06)
SPCASep1 0.0692 (0.04) 0.1407 (0.09) 0.0925 (0.06)
Variable selection in Inline graphic (Inline graphic selection)
Method Precision Recall Inline graphic
dCCA+Screen 0.9878 (0.09) 1.0000 (0.00) 0.9899 (0.09)
SCCA 0.2159 (0.04) 0.7547 (0.09) 0.3352 (0.05)
SCCASep0 0.1548 (0.02) 0.8140 (0.09) 0.2600 (0.03)
SCCASep1 0.1556 (0.02) 0.8063 (0.10) 0.2607 (0.04)
SLDA 0.1093 (0.05) 0.1093 (0.05) 0.1093 (0.05)
SPCA 1.0000 (0.00) 0.6543 (0.02) 0.7909 (0.01)
SPCASep0 1.0000 (0.00) 0.6517 (0.02) 0.7889 (0.01)
SPCASep1 1.0000 (0.00) 0.6507 (0.02) 0.7882 (0.01)
Identifying correlation and classification (Inline graphic)
Method Inline graphic Inline graphic AUC
dCCA+Screen 0.0342 (0.02) 0.0352 (0.02) 0.9831 (0.01)
dCCA 0.0938 (0.01) 0.1138 (0.02) 0.9498 (0.01)
SCCA 0.5722 (0.11) 1.4408 (0.1) 0.5601 (0.04)
SCCASep 0.0629 (0.01) 1.9353 (0.02) 0.5433 (0.03)
SLDA 1.0018 (0.07) 0.9951 (0.07) 0.8241 (0.02)
SPCA 1.0043 (0.12) 1.0113 (0.12) 0.5585 (0.04)
SPCASep 1.0300 (0.12) 0.9847 (0.12) 0.5486 (0.03)

Table 2.

Simulation Results (Setting 2: The association between Inline graphic and Inline graphic exists only in one clinical group, specifically when Inline graphic): We compare dCCA with the screening procedure (dCCA+Screen) to dCCA without the screening procedure (dCCA), and three competing methods (sparse CCA (SCCA), sparse LDA (SLDA), and sparse PCA (SPCA)); SCCASep and SPCASep are used to denote these separate applications, respectively; subscripts 0 and 1 denote the groups corresponding to Inline graphic, and Inline graphic, respectively.

Variable selection in Inline graphic (Inline graphic selection)
Method Precision Recall Inline graphic
dCCA+Screen 0.8620 (0.09) 0.9680 (0.05) 0.9083 (0.06)
SCCA 0.2564 (0.04) 0.7207 (0.06) 0.3772 (0.04)
SCCASep0 0.0752 (0.04) 0.1627 (0.09) 0.1026 (0.05)
SCCASep1 0.2585 (0.03) 0.7193 (0.06) 0.3796 (0.04)
SLDA 0.0627 (0.06) 0.0627 (0.06) 0.0627 (0.06)
SPCA 0.0788 (0.05) 0.1620 (0.10) 0.1057 (0.07)
SPCASep0 0.0766 (0.05) 0.1567 (0.10) 0.1025 (0.06)
SPCASep1 0.0690 (0.04) 0.1407 (0.09) 0.0924 (0.06)
Variable selection in Inline graphic (Inline graphic selection)
Method Precision Recall Inline graphic
dCCA+Screen 0.9994 (0.00) 0.9930 (0.02) 0.9961 (0.01)
SCCA 0.1593 (0.02) 0.8103 (0.09) 0.2661 (0.03)
SCCASep0 0.0731 (0.03) 0.1587 (0.07) 0.0999 (0.04)
SCCASep1 0.1559 (0.02) 0.8067 (0.10) 0.2612 (0.04)
SLDA 0.1063 (0.05) 0.1063 (0.05) 0.1063 (0.05)
SPCA 1.0000 (0.00) 0.6520 (0.02) 0.7892 (0.01)
SPCASep0 0.0765 (0.05) 0.0837 (0.05) 0.0795 (0.05)
SPCASep1 1.0000 (0.00) 0.6507 (0.02) 0.7882 (0.01)
Identifying correlation and classification (Inline graphic)
Method Inline graphic Inline graphic AUC
dCCA+Screen 0.1030 (0.06) 0.0754 (0.03) 0.8648 (0.02)
dCCA 0.5532 (0.34) 1.2009 (0.69) 0.8369 (0.05)
SCCA 0.1798 (0.07) 1.9327 (0.02) 0.8423 (0.02)
SCCASep 0.8091 (0.02) 1.9354 (0.02) 0.7116 (0.02)
SLDA 0.0583 (0.04) 0.9958 (0.07) 0.8237 (0.01)
SPCA 0.0562 (0.04) 1.0188 (0.12) 0.5580 (0.04)
SPCASep 0.0631 (0.05) 0.9889 (0.12) 0.5465 (0.02)

Figure 2.

Figure 2

Results of simulation studies: (A) ROC curves compare the performance of the methods’ canonical variables in the classification task; the middle and bottom panels display scatter plots of the projected (canonical) variables from different methods for (B) Setting 1 and (C) Setting 2, respectively.

When capturing differential correlation patterns between groups, dCCA+Screen generally demonstrates the least absolute bias, except when Inline graphic in setting 2, where Inline graphic. In this setting, SPCA achieves the best performance; however, dCCA shows a nearly comparable performance. Note that the satisfactory performance of SLDA and SPCA in this specific case stems from their inherent design, which does not prioritize uncovering association patterns between two multivariate data blocks. Consequently, they consistently yield near-zero correlations in all settings, which leads to a significant bias in every other case. This renders their projected variables lacking meaningful interpretation (see Fig. 2). Due to the absence of addressing group heterogeneity, conventional SCCA produced nearly identical canonical correlations for both groups in setting 1. Furthermore, SCCA cannot distinguish the direction (sign) of the overall association for different groups and generated positively correlated canonical variables when Inline graphic (see (B) in Fig. 2), even though the true underlying correlation is negative. This results in a significant bias (see Inline graphic in Table 1 and Table 2). Performing SCCA and SPCA separately (i.e. SCCASep and SPCASep) for each clinical group misses the underlying differential association patterns between groups, resulting in high correlation estimation biases. In contrast, dCCA+Screen accurately discerns the underlying correlation between the two groups.

In addition, the canonical variables obtained from dCCA+Screen achieve the highest AUC in both settings, demonstrating their advantage in the classification task over those derived from other dimension-reduction techniques.

In summary, dCCA method outperforms the benchmark multivariate association analysis models in accurately selecting active pairs of variables and identifying distinct underlying association patterns for different groups. The dCCA-derived canonical variables can also classify groups with improved accuracy.

Assessing robustness of dCCA: We further examine whether dCCA introduces false positive differential correlations when the cross-group differential association pattern is absent. In this setting, we simulate identical regression coefficient matrices, i.e. Inline graphic within the same multi-block structure employed in the previous simulation settings, and assess false positive findings.

The results in Table 3 demonstrate that the false positive rate (FPR) is below 5% for dCCA+Screen. The results for correlation estimation and variable selection are provided in Table S1 of the Supplementary Material.

Table 3.

FPR; we test the difference in canonical vectors between groups under the test level Inline graphic.

Method dCCA+Screen dCCA SCCA SCCASep
FPR 0.04 0.16 0.07 0.37
Method SLDA SPCA SPCASep
FPR 0.04 0.02 0.23

Results

We applied our method to Pan-kidney cohort data obtained from TCGA. This cohort offers a wide array of datasets, including gene expression, non-coding RNA (e.g. long noncoding RNAs (lncRNAs) and microRNAs (miRNAs)), along with clinical information (e.g. cancer stage and subtypes), enabling comprehensive research into kidney cancer. In our analysis, we uncover how the association between miRNAs and gene expression is influenced by different cancer subtypes. RNA sequencing was used for miRNA data (in RPM) and gene expression data (in RPKM), both of which were downloaded from LinkedOmics [18]. We conducted data preprocessing steps. Specifically, for the miRNA data, we excluded miRNAs with zero expression across all samples and applied Inline graphic transformation to stabilize variance and make the data more symmetrically distributed. In gene expression data, genes with low expression levels are regarded as uninformative. Therefore, we applied a mean expression cutoff of Inline graphic to filter out such genes, enabling us to prioritize those with robust expression levels. The processed dataset contains a total of Inline graphic observations and has dimensions of Inline graphic and Inline graphic for miRNAs and genes, respectively.

Renal cell carcinoma (RCC) is the predominant form of kidney cancer in adults and is categorized into various subtypes based on histopathological characteristics. In samples from the Pan-Kidney cohort, three subtypes are identified: Clear cell renal cell carcinoma (ccRCC), Papillary renal cell carcinoma (pRCC), and Chromophobe renal cell carcinoma (chRCC). Each of these subtypes exhibits unique cancer progression patterns, genetic traits, and RNA profiles, which, in turn, can impact gene and RNA regulation differently. The first two subtypes (ccRCC and pRCC) are common types, collectively accounting for 85%–95% of RCC cases, while chRCC is a rare subtype that accounts for Inline graphic 5% of all RCC cases. In our analysis, we treat these kidney cancer types as the group variable, assigning Inline graphic to common kidney cancer types (ccRCC and pRCC), and Inline graphic for chRCC, a rare kidney cancer type. The number of subjects in each cancer subtype is Inline graphic for the common subtype and Inline graphic for the rare subtype.

Since both Inline graphic and Inline graphic in this study, we first perform the screening step of dCCA. The screening procedure filters out non-informative pairs of miRNAs and genes and retains 77 miRNA variables and 591 gene variables. These variables comprises three (Inline graphic) bipartite blocks (see Fig. 3): Block 1: 43 miRNAs Inline graphic 319 genes; Block 2: 18 miRNAs Inline graphic 227 genes; and Block 3: 16 miRNAs Inline graphic 45 genes. In each block, distinct differential association patterns are present between the two cancer subtypes (see Fig. 3). In Block 1 (upper left corner), miRNA and gene are stronger (positively) correlated in the common cancer type group than in the chRCC group. Block 2 also demonstrates stronger (negatively) correlations for the common cancer type group in comparison to the chRCC group. Contrastingly, in Block 3, the correlations are stronger for the chRCC group compared with the common subtype. We then implement the objective function of dCCA on the filtered data to assess the differentially expressed Inline graphic correlation patterns between groups. The produced canonical variables (see Fig. 4) reflect the differential correlation patterns in three blocks. In Block 1, both groups exhibit positive correlations between canonical variables, with a stronger strength observed in the common subtypes (ccRCC, pRCC) in comparison to chRCC. In Block 2, canonical correlation is negative for the common subtypes, whereas it is close to zero for chRCC. In Block 3, the correlation associated with chRCC is stronger than that of the common subtypes.

Figure 3.

Figure 3

(A) Heat map illustrating the difference in the correlation matrix (miRNAs vs. genes) between different subtypes (common vs. rare) within the dense blocks; (B) network plots: in each block, nodes to the left represent the top 10 miRNAs, while nodes to the right represent the top 10 genes; the top 10 miRNAs and genes were chosen based on the summation of the absolute values of elements within each column (miRNA) and row (gene) of the corresponding block in the correlation matrix; the direction (sign) of the association is denoted by different colors (positive: red, negative: blue) in the edges, while the strength of the connection is visualized through both the width and transparency of the edge.

Figure 4.

Figure 4

Comparison of scatter plots of canonical variables: miRNA (on the Inline graphic-axis) vs. gene expression (on the Inline graphic-axis) obtained from CCA in the upper panel and dCCA in the lower panels (orange strip); different subtypes of kidney cancer are visually distinguished by color (red for common subtypes (ccRCC and pRCC), and blue for chRCC); Pearson correlation coefficients (R) separately calculated from each subtype and their corresponding Inline graphic-values (p) are given and color-coded similarly as above; in addition, the statistical significance of the difference (Diff) in canonical correlations between the two subtypes is tested, and the associated Inline graphic-values are given (in black).

These findings are well aligned with results from prior studies. For example, block 1 identifies miRNA-gene pairs that are tightly connected in ccRCC and pRCC subtypes but loosely connected in the chRCC subtype. We searched two existing databases, miRCancer [19  http://mircancer.ecu.edu] and dbDEMC [20  https://www.biosino.org/dbDEMC/index]. Most miRNAs in Block 1 (e.g. miR-126, miR-145, miR-122) were identified as critical and differentially expressed in the ccRCC subtype but not in the chRCC subtype (see the Supplementary Material). In Block 2, miR-141, a unique miRNA signature in clear cell RCC [21], was found to be associated with critical tumor suppressor genes such as USH1C [22] in common RCC subtypes, while it was not associated in rare RCC subtypes. We also performed pathway analysis on the identified genes and found that several genes in Block 3 (e.g. CD3D, CD3E, CD2, SIT1) were enriched in pathways related to T cell and lymphocyte activation (see the Supplementary Material). dCCA identified miR-150, a miRNA that plays critical roles in lymphocyte development and is significantly associated with RCC survival [23, 24]. It was found to be strongly co-expressed with these genes in the chRCC subtype but weakly in the ccRCC and pRCC subtypes. A stronger miR-150-immune gene regulatory bond in chRCC may explain its utility in prognosis of RCC survival compared with the ccRCC and pRCC subtypes [25]. In addition, miR-223 in Block 3, a cancer–specific survival-related biomarker [26, 27], was found to be associated with genes in chRCC but not in other kidney cancer subtypes.

In comparison, we also apply the classic CCA[4] and sCCA[9] to this dataset. However, neither of the two methods identifies the underlying differential correlation patterns extracted by dCCA. For example, in Fig. 4, the correlations between canonical variables by CCA are almost identical between the two clinical groups, which misses the group differences potentially reflecting differential biological mechanisms.

Conclusion

We have developed a new multivariate-to-multivariate analysis tool, dCCA, to decipher the complex interaction patterns between two types of high-dimensional omics data. We focus on extracting differentially expressed omics-to-omics interaction patterns between clinical groups. dCCA, unlike classic CCA and SCCA methods, more effectively uncovers interaction patterns between two types of omics data that are related to clinical status. Thus, the differential interaction patterns identified by dCCA can help pinpoint potential biomarkers that distinguish the subtypes. This approach may also provide insights into the distinct biological mechanisms that differ between groups. For example, identifying subtype-specific mechanisms may suggest targeted therapeutic strategies for the disease. However, our study remains exploratory rather than confirmatory, future studies and experiments need to be performed to further validate the findings. For example, an in vitro approach by growing cell lines from each cancer type may further validate our findings.

dCCA is computationally efficient and can handle the interactions between thousands-to-thousands variables as the graph-based screening procedure can efficiently filter non-informative features. For validation, we applied dCCA to an additional dataset (breast cancer study in TCGA). See the Supplementary Material for details.

The proposed dCCA method currently focuses on analyzing datasets with binary group variables. Expanding dCCA to accommodate group variables with more than two categories, such as the four molecular subtypes of breast cancer (HER2-enriched, Luminal A, Luminal B, Basal-like), will significantly enhance its utility for analyzing more complex datasets. For example, applying this method to datasets involving patients at different stages of cancer can provide insights into uncovering ordinal trends and dynamically varying association patterns throughout the progression of the disease. We provide a potential two-step solution for handling more than two groups in the Supplementary Material.

Key Points

  • dCCA deciphers the complex interaction patterns between two types of high-dimensional omics data.

  • Specifically dCCA extracts differentially expressed omics-to-omics interaction patterns between clinical groups, which can provide insights into the distinct biological mechanisms that differ between groups.

  • We propose a novel graph-based approach to efficiently identify active variable pairs in two high-dimensional spaces (i.e. Inline graphic and Inline graphic), outperforming existing methods in accurately selecting variables within both Inline graphic and Inline graphic.

Supplementary Material

supp_material_dCCA_bbae288

Author Biographies

Hwiyoung Lee is a postdoctoral research fellow at the Department of Epidemiology and Public Health, School of Medicine, University of Maryland, Baltimore.

Tianzhou Ma is an assistant professor at the Department of Epidemiology and Biostatistics, School of Public Health, University of Maryland, College Park.

Hongjie Ke is a PhD candidate at the Department of Epidemiology and Biostatistics, School of Public Health, University of Maryland, College Park.

Zhenyao Ye is a PhD candidate at the Department of Epidemiology and Public Health, School of Medicine, University of Maryland, Baltimore.

Shuo Chen is a professor at the Department of Epidemiology and Public Health, School of Medicine, University of Maryland, Baltimore.

Contributor Information

Hwiyoung Lee, Maryland Psychiatric Research Center, School of Medicine, University of Maryland, Baltimore, MD 21201, United States; The University of Maryland Institute for Health Computing (UM-IHC), North Bethesda, MD 20852, United States.

Tianzhou Ma, Department of Epidemiology and Biostatistics, University of Maryland, College Park, MD 20742, United States.

Hongjie Ke, Department of Epidemiology and Biostatistics, University of Maryland, College Park, MD 20742, United States.

Zhenyao Ye, The University of Maryland Institute for Health Computing (UM-IHC), North Bethesda, MD 20852, United States; Division of Biostatistics and Bioinformatics, Department of Epidemiology and Public Health, School of Medicine, University of Maryland, Baltimore, MD 21201, United States.

Shuo Chen, Maryland Psychiatric Research Center, School of Medicine, University of Maryland, Baltimore, MD 21201, United States; The University of Maryland Institute for Health Computing (UM-IHC), North Bethesda, MD 20852, United States; Division of Biostatistics and Bioinformatics, Department of Epidemiology and Public Health, School of Medicine, University of Maryland, Baltimore, MD 21201, United States.

Funding

This work was supported by the National Institutes of Health under Award Number: 1DP1DA048968-01.

Availability and Implementation

The software package that implements dCCA is available at https://github.com/hwiyoungstat/dCCA.

Data availability

The miRNA and gene expression data utilized in this study are accessible through the Cancer Genome Atlas Program (TCGA) Pan-kidney cohort via the website https://www.cancer.gov/ccg/research/genome-sequencing/tcga.

References

  • 1. Zhu S, Hailong W, Fangting W. et al.  Microrna-21 targets tumor suppressor genes in invasion and metastasis. Cell Res  2008;18:350–9. 10.1038/cr.2008.24. [DOI] [PubMed] [Google Scholar]
  • 2. Bhan A, Soleimani M, Mandal SS. Long noncoding rna and cancer: a new paradigm. Cancer Res  2017;77:3965–81. 10.1158/0008-5472.CAN-16-2634. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Hotelling H. Relations between two sets of variates. Biometrika  1936;28:321–77. 10.1093/biomet/28.3-4.321. [DOI] [Google Scholar]
  • 4. Yang X, Liu W, Liu W. et al.  A survey on canonical correlation analysis. IEEE Trans Knowl Data Eng  2019;33:2349–68. 10.1109/TKDE.2019.2958342. [DOI] [Google Scholar]
  • 5. Zhuang X, Yang Z, Cordes D. A technical review of canonical correlation analysis for neuroscience applications. Hum Brain Mapp  2020;41:3807–33. 10.1002/hbm.25090. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Jiang M-Z, Aguet F, Ardlie K. et al.  Canonical correlation analysis for multi-omics: application to cross-cohort analysis. PLoS Genet  2023;19:1–22. 10.1371/journal.pgen.1010517. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Rousu J, Agranoff DD, Sodeinde O. et al.  Biomarker discovery by sparse canonical correlation analysis of complex clinical phenotypes of tuberculosis and malaria. PLoS Comput Biol  2013;9:1–10. 10.1371/journal.pcbi.1003018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Cao KAL, González I, Déjean S. integrOmics: an R package to unravel relationships between two omics datasets. Bioinformatics  2009;25:2855–6. 10.1093/bioinformatics/btp515. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Witten DM, Tibshirani R, Hastie T. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics  2009;10:515–34. 10.1093/biostatistics/kxp008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Lei D, Liu K, Yao X. et al.  Detecting genetic associations with brain imaging phenotypes in alzheimer’s disease via a novel structured scca approach. Med Image Anal  2020;61:101656. 10.1016/j.media.2020.101656. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Lei D, Liu F, Liu K. et al.  Li Shen, and for the Alzheimer’s Disease Neuroimaging Initiative. Identifying diagnosis-specific genotype-phenotype associations via joint multitask sparse canonical correlation analysis and classification. Bioinformatics  2020;36:i371–9. 10.1093/bioinformatics/btaa434. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning, 2nd edn. New York, NY: Springer, 2009, 10.1007/978-0-387-84858-7. [DOI] [Google Scholar]
  • 13. Dudoit S, Fridlyand J, Speed TP. Comparison of discrimination methods for the classification of tumors using gene expression data. J Am Stat Assoc  2002;97:77–87. 10.1198/016214502753479248. [DOI] [Google Scholar]
  • 14. Fan J, Lv J. Sure independence screening for ultrahigh dimensional feature space. J R Stat Soc Series B Stat Methodology  2008;70:849–911. 10.1111/j.1467-9868.2008.00674.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Ke H, Ren Z, Qi J. et al.  High-dimension to high-dimension screening for detecting genome-wide epigenetic and noncoding RNA regulators of gene expression. Bioinformatics  2022;38:4078–87. 10.1093/bioinformatics/btac518. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Charikar M. Greedy approximation algorithms for finding dense components in a graph. In: Jansen K, Khuller S (eds). Approximation Algorithms for Combinatorial Optimization. Berlin, Heidelberg: Springer Berlin Heidelberg, 2000, 84–95. 10.1007/3-540-44436-X_10. [DOI] [Google Scholar]
  • 17. Clemmensen L, Witten D, Hastie T. et al.  Sparse discriminant analysis. Dent Tech  2011;53:406–13. 10.1198/TECH.2011.08118. [DOI] [Google Scholar]
  • 18. Vasaikar SV, Straub P, Wang J. et al.  LinkedOmics: analyzing multi-omics data within and across 32 cancer types. Nucleic Acids Res  2017;46:D956–63. 10.1093/nar/gkx1090. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Xie B, Ding Q, Han H. et al.  Hongjin Han, and Di Wu. miRCancer: a microRNA-cancer association database constructed by text mining on literature. Bioinformatics  2013;29:638–44. 10.1093/bioinformatics/btt014. [DOI] [PubMed] [Google Scholar]
  • 20. Feng X, Wang Y, Ling Y. et al.  Dbdemc 3.0: functional exploration of differentially expressed mirnas in cancers of human and model organisms. Genomics Proteomics Bioinformatics  2022;20:446–54. Bioinformatics Commons–2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Cairns P. Renal cell carcinoma. Cancer Biomark  2011;9:461–73. 10.3233/CBM-2011-0176. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Chen L, Liu P, Evans TC. et al.  Dna damage is a pervasive cause of sequencing errors, directly confounding variant identification. Science  2017;355:752–6. 10.1126/science.aai8690. [DOI] [PubMed] [Google Scholar]
  • 23. Hu YZ, Li Q, Wang PF. et al.  Multiple functions and regulatory network of mir-150 in b lymphocyte-related diseases. Front Oncol  2023;13:1140813. 10.3389/fonc.2023.1140813. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Chanudet E, Wozniak MB, Bouaoun L. et al.  Large-scale genome-wide screening of circulating micrornas in clear cell renal cell carcinoma reveals specific signatures in late-stage disease. Int J Cancer  2017;141:1730–40. 10.1002/ijc.30845. [DOI] [PubMed] [Google Scholar]
  • 25. Garje R, Elhag D, Yasin HA. et al.  Comprehensive review of chromophobe renal cell carcinoma. Crit Rev Oncol Hematol  2021;160:103287. 10.1016/j.critrevonc.2021.103287. [DOI] [PubMed] [Google Scholar]
  • 26. Ghafouri-Fard S, Shirvani-Farsani Z, Branicki W. et al.  Microrna signature in renal cell carcinoma. Front Oncol  2020;10:596359. 10.3389/fonc.2020.596359. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Kajdasz A, Majer W, Kluzek K. et al.  Identification of rcc subtype-specific micrornas-meta-analysis of high-throughput rcc tumor microrna expression data. Cancer  2021;13:548. 10.3390/cancers13030548. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supp_material_dCCA_bbae288

Data Availability Statement

The miRNA and gene expression data utilized in this study are accessible through the Cancer Genome Atlas Program (TCGA) Pan-kidney cohort via the website https://www.cancer.gov/ccg/research/genome-sequencing/tcga.


Articles from Briefings in Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES