TOSCCA: a framework for interpretation and testing of sparse canonical correlations

Nuria Senar; Mark van de Wiel; Aeilko H Zwinderman; Michel H Hof

doi:10.1093/bioadv/vbae021

. 2024 Feb 21;4(1):vbae021. doi: 10.1093/bioadv/vbae021

TOSCCA: a framework for interpretation and testing of sparse canonical correlations

Nuria Senar ^1,^✉, Mark van de Wiel ², Aeilko H Zwinderman ³, Michel H Hof ⁴

Editor: Lina Ma

PMCID: PMC10919946 PMID: 38456127

Abstract

Summary

In clinical and biomedical research, multiple high-dimensional datasets are nowadays routinely collected from omics and imaging devices. Multivariate methods, such as Canonical Correlation Analysis (CCA), integrate two (or more) datasets to discover and understand underlying biological mechanisms. For an explorative method like CCA, interpretation is key. We present a sparse CCA method based on soft-thresholding that produces near-orthogonal components, allows for browsing over various sparsity levels, and permutation-based hypothesis testing. Our soft-thresholding approach avoids tuning of a penalty parameter. Such tuning is computationally burdensome and may render unintelligible results. In addition, unlike alternative approaches, our method is less dependent on the initialization. We examined the performance of our approach with simulations and illustrated its use on real cancer genomics data from drug sensitivity screens. Moreover, we compared its performance to Penalized Matrix Analysis (PMA), which is a popular alternative of sparse CCA with a focus on yielding interpretable results. Compared to PMA, our method offers improved interpretability of the results, while not compromising, or even improving, signal discovery.

Availability and implementation

The software and simulation framework are available at https://github.com/nuria-sv/toscca.

1 Introduction

New technologies in clinical and biomedical research have facilitated the collection of high-throughput omics data using DNA sequencing, RNA microarrays, or mass spectroscopy. These methods typically result in hundreds or thousands of variables per patient, yet, sample sizes remain low. As a result, the number of variables largely exceeds the number of observations. Genomics studies concerned with finding common structures between multiple pheno- or genotypical measures call for statistical models capable of dealing with high-dimensions.

Integrative approaches such as Canonical Correlation Analysis (CCA) (Hotelling 1936) can contribute to improvements in diagnostics and understanding of biological mechanisms, by exploring connecting attributes between datasets. This includes genomics as well as neurological data which recently has been at the centre of many CCA applications linking brain connectivity data to genetic, demographics, behavioural or thought patterns (Du et al. 2020, Wang et al. 2020). Additional applications of CCA methods and their applications are discussed in Rodosthenous et al. (2020). CCA is a multivariate method in high-dimensional analysis for exploring underlying signals relating two (or more) datasets through pairs of weight vectors. It is an immediate extension of PCA for more than one dataset, and a scale invariant adaptation of PLS. CCA searches for linear combinations of the data, called latent variables, that are maximally correlated to each other. The original variables are summarized into these lower-dimensional variables. This process may be repeated to render multiple latent variables. Methods combining dimension reduction and correlation maximization have competitive accuracy in predicting complex traits than other conventional ML methods (Rodosthenous et al. 2020), which means that CCA is an interesting tool for high-dimensional analysis.

For such, however, computation times are prone to be high and results may be difficult to extrapolate, generalize or test. Furthermore, interpreting the estimated weights is far from trivial in large dimension problems. This problem persists in CCA applications as there may be many possible linear combinations of variables maximizing correlations. Hence, the probability of selecting highly correlated noise increases and so does the in-sample canonical correlation (CC) estimate. In these scenarios, the presence of redundant variables is dealt with through sparse constraints dealing with regularization, variable selection or a combination of both.

Recent methods for CCA search for complex associations, i.e. nonlinear or supervised. However, these generally are focused on prediction and show lack interpretability, or assume some type of structure or classification. As we are concerned with finding interpretable results from an exploratory analysis, we focus on penalized alternatives which render sparse weights for both datasets.

Sparse extensions to CCA include lasso or elastic net penalization methods for parameter shrinkage and variable selection (Waaijenborg and Zwinderman 2007, Parkhomenko et al. 2009, Witten et al. 2009). The sparsity is chosen through cross-validation techniques of the penalty parameters. Said techniques are known to be unstable both in terms of the estimation of the penalty parameters (van Nee et al. 2022) and that of the coefficients (Xu et al. 2012). Not only does this affect interpretability of the results, but it also means that the permutation testing framework is ill-behaved.

We address these concerns by imposing a threshold on the support of the canonical vectors using soft-thresholding, rather than using a penalty parameter. Hence, we introduce sparsity into our canonical vectors by stating the number of nonzero weights, keeping the number of selected variables equal through permutations and promoting interpretable results via direct control over the number of selected variables. In addition, to achieve Type-I error control, we also show that using the out-of-sample correlation, instead of the in-sample correlation, accounts for spurious associations. We propose a fast estimation scheme based on the NIPALS (Wold 1966) algorithm, essential for efficient testing in high-dimensional CCA.

With simulations, we evaluated signal recovery and Type-I error control and compared its performance to the popular Penalized Matrix Analysis (PMA) method (Witten et al. 2009). Moreover, we applied our algorithm to real data on gene expression and drug sensitivity measures to study the performance of our sparse CCA. We found that, compared to PMA, fixing the number of nonzeros improved stability of the shape and size of the canonical weights for increasingly large matrices.

2 Methods

Suppose we have two data matrices $X_{1} \in R^{n \times p}$ and $X_{2} \in R^{n \times q}$ containing respectively p and q variables from n samples from which we want to extract a sequence of K pairs of canonical vectors ${(α_{1}, β_{1}), \dots, (α_{K}, β_{K})}$ , where $K \leq min (p, q)$ . The k^th pair of canonical variables is given by $γ_{k} = X_{1} α_{k}$ and $ζ_{k} = X_{2} β_{k}$ . The correlation between this pair of canonical variables, referred to as canonical correlation, is given by

\begin{matrix} ρ_{k} = \frac{α_{k}^{T} X_{1}^{T} X_{2} β_{k}}{\sqrt{α_{k}^{T} X_{1}^{T} X_{1} α_{k}} \sqrt{β_{k}^{T} X_{2}^{T} X_{2} β_{k}}} \end{matrix}

(1)

The goal of CCA is to choose the weights $A = (α_{1}, \dots, α_{k}, \dots, α_{K})$ and $B = (β_{1}, \dots, β_{k}, \dots, β_{K})$ such that the correlation between all canonical vectors is maximized under the restriction that the columns in the sets $(γ_{1}, \dots, γ_{K})$ and $(ζ_{1} . \dots, ζ_{K})$ are orthogonal. Generally, the weights are estimated such that the first pair of canonical vectors has the highest canonical correlation and with each succeeding pair the canonical correlation decreases.

2.1 Nonlinear iterative partial least squares and CCA

In this paper, we consider the CCA problem in a regression framework in which pairs of canonical vectors are sequentially estimated with an alternating regression procedure. This technique, known as Nonlinear Iterative Partial Least Squares (NIPALS) (Wold 1966), starts by initializing one of the canonical vectors, $α^{(0)}$ , and computing $β$ given $α^{(0)}$ . The estimation of the weights $β$ is equivalent to a simple least square problem. To obtain the first pair of canonical vectors, we use the equivalence between maximizing equation (1) and the optimization problem

(\hat{α}, \hat{β}) = \arg \min_{α, β} \sum_{i = 1}^{n} (x_{1, i} α_{i} - x_{2, i} β_{i})^{2},

(2)

where the canonical variables are required to have unit norm to have the same constraints as in Equation (1). In the alternating regression procedure, we initialize and fix vector $α^{(0)}$ and scale $γ$ to have unit norm after step 2 in Algorithm 1, i.e.

γ^{(0)} = \frac{X_{1} α^{(0)}}{\sqrt{α^{(0) T} X_{1}^{T} X_{1} α^{(0)}}}

By fixing $α^{(0)}$ , Equation (2) reduces to a simple (least squares) regression problem (Wilms and Croux 2015). An estimate of $β$ obtained as

β^{(1)} = {(X_{2}^{T} X_{2})}^{- 1} X_{2}^{T} γ^{(0)}

(3)

Vice versa, we obtain $ζ^{(1)}$ and $α^{(1)}$ fixing $β^{(1)}$ . The estimated vectors describe the strength of linear association between the matrices. This process is then repeated until convergence of some tolerance measure.

Generalized to k > 1, we initialize $α_{k}$ as $α_{k}^{(0)}$ to then repeatedly fix and re-estimate new weights, using the deflated matrices, to obtain a sequence ${(α_{k}^{(0)}, β_{k}^{(0)}), (α_{k}^{(1)}, β_{k}^{(1)}), \dots,}$ that is monotonically convergent (Hanafi 2007) for each component. The first canonical vector of $X_{1}, α_{k}^{(0)}$ , can be initialized randomly, with uniform weights or with some type of matrix decomposition.

Many penalized alternatives based on lasso (Parkhomenko et al. 2009, Witten et al. 2009) or elastic net (Waaijenborg and Zwinderman 2007) have been proposed to deal with high-dimensional CCA. Both approaches impose sparsity in the weights $α_{k}$ and $β_{k}$ by penalizing the model used in Equation 3. To obtain a certain sparsity, it is therefore necessary to search for the corresponding penalty parameter. As an alternative, we propose to introduce a soft-thresholding penalty to the regression formula (3). This penalization allows us direct control on the number of nonzero weights in both $α_{k}$ and $β_{k}$ . Not only will this improve the interpretation of the results, but it also speeds up NIPALS algorithm since we do not have to search for the penalty that corresponds to a particular number of nonzero weights. Additionally, this allows us to use permutations for hypothesis testing.

Algorithm 1.

Thresholded Ordered Sparse CCA (TOSCCA)

Input. $X_{1, s}, X_{2, s}, α^{(0)}, p_{α}$ , and $q_{β}$

Output. $α_{k}^{*}$ and $β_{k}^{*}$

$t \leftarrow 1, θ ≪ 1, ε = 10^{6}, ρ^{(0)} 0 \leftarrow 0$

1: while $ε > θ$ do ▹ Changes larger than tolerance measure

2: $γ \leftarrow X_{1, s} α^{(t - 1)}$

3: ${\tilde{β}}^{(t)} \leftarrow X_{2, s}^{T} γ$

4: $β_{k}^{(t)} \leftarrow 1_{| {\tilde{β}}^{(t)} | > q_{β}} {\tilde{β}}^{(t)} - q_{β}$

5: $ζ_{k} \leftarrow X_{2, s} β_{k}^{(t)}$ ▹ Standardize canonical variable for $X_{2}$

6: ${\tilde{α}}^{(t)} \leftarrow X_{1}^{T} ζ_{k}$

7: $α_{k}^{(t)} \leftarrow 1_{| {\tilde{α}}^{(t)} | > p_{α_{i}}} {\tilde{α}}^{(t)} - p_{α}$

8: $γ_{k} \leftarrow X_{1, s} α_{k}^{(t)}$ ▹ Standardize canonical variable for $X_{1}$

9: $ρ^{(t)} \leftarrow cor (γ_{k}, ζ_{k})$

10: $ε \leftarrow ρ^{(t)} - ρ^{(t - 1)}$

11: $t = t + 1$

12: end while

13: return $(α_{k}^{*}, β_{k}^{*})$ ▹ The canonical vectors

To add the soft-threshold penalty, we ignore the potential collinearity in the data by assuming that $X_{1}^{T} X_{1} = I_{p}$ . As with $X_{2}$ , simplifying the regression from Equation (3) into steps 3 and 6 (Dudoit et al. 2002). We calculate the optimal weight coefficients in Algorithm 1, through a modified NIPALS with soft-thresholding (steps 7 and 4) based on threshold parameters $p_{α} \in {1, 2, \dots, p}$ and $q_{β} \in {1, 2, \dots, q}$ . θ is fixed. We can show that the relationship between the in-sample canonical correlation and progressively larger $p_{α}$ , given $q_{β}$ , is nonconcave and increases for non-sparse solutions. Therefore, we use the out-of-sample canonical correlation which indeed shows a convex trajectory, implying decrease of the canonical correlation, as more irrelevant variables are included.

Through this algorithm, smaller choices of $(p_{α}, q_{β})$ yield canonical vectors which are subsets of dense alternatives when there is a signal, keeping one penalty fixed. That is, for threshold choices $p_{α, 1} \leq \dots \leq p$ and some fixed $q_{β}$ , both in the simulation study and the real application, $supp (α (p_{α, i})) \subseteq supp (α (p_{α, j}))$ if $i \leq j$ . This property is useful for interpretation of the results, as it shows selection stability of the larger contributors.

2.2 Estimating multiple canonical variates

In high-dimensional settings, finding the true canonical weights linking both datasets is particularly difficult as, in practice, there are many possible competitive combinations, rendering comparable canonical correlations. Furthermore, we wish to balance the percentage of explained variance to the number of selected variables, ruling out tuning of a penalty parameter. Instead, we choose soft-thresholding, which allows the user to search over and compare results from multiple specific sparsity levels at the same time.

From the computational perspective, the NIPALS algorithm allows the simultaneous and efficient estimation of the canonical vectors for different combinations of penalties by defining vector pair $(p_{α}, q_{β})$ . Then the canonical vectors for each component become matrices ( $A_{k}, B_{k}$ ) for which each column represents a $(p_{α, i}, q_{β, i})$ pairing. That is, steps 3 and 6 in Algorithm 1 become $B = X_{2}^{T} Γ$ and $A = X_{1}^{T} Z$ , where $Γ$ and Z are matrices of latent variables from canonical weights with different sparsity levels. In a single run of the NIPALS algorithm, it is possible to estimate the canonical vectors for several sparsity levels. This may be used to guide the researchers as to the range of appropriate sparsity levels.

We calculate the canonical weights and correlations for later components ( $k \geq 2$ ) deflating the data to account for the variance explained in previously estimated latent variables. We deflate the matrices as

X_{1}^{(k + 1)} = (I_{p} - γ^{(k)} {(γ^{(k) T} γ^{(k)})}^{- 1} γ^{(k) T}) X_{1}^{(k)},

(4)

where $X_{1}^{(1)} = X_{1}$ . Matrix $X_{2}$ is deflated following the same scheme.

In the original unpenalized version of NIPALS, this deflation would make latent variables orthogonal for different components. However, it is well known that introducing sparsity compromises this property (Jolliffe 1995) as it is the case with other sparse CCA methods. Consequently, the standard measure of cumulative percentage of explained variance (CPEV), which is used to determine the number of components (Shen and Huang 2008), may be inaccurate due to the presence of repeated information. That is, as orthogonality of the canonical vectors can no longer be guaranteed, new components do not necessarily contain new information, and hence latent variables may be correlated. We propose the following alternative measure to adjust for repeated information for $k = 2, \dots, K$ :

{CPEV}_{adj} (γ_{k}) = CPEV (γ_{k}) \cdot \prod_{i < k} (1 - | cor (γ_{i}, γ_{k}) |),

(5)

where CPEV is defined as $t r (γ_{1 : k}^{T} γ_{1 : k}) / t r (X_{1}^{T} X_{1})$ , and equally for $ζ$ and $X_{2}$ .

2.3 Permutation testing

To assess the estimated correlations, we test the null hypothesis of no correlation between the datasets, and their deflated counterpart for subsequent components via permutation testing. We permute one of the datasets and re-estimate the canonical correlation to approximate the distribution of the correlations under the null. Since the canonical correlation estimate is affected by the number of nonzero weights, using the number of nonzeros as the original analysis makes the permuted correlations comparable between themselves and the original estimate.

Standard penalties, such as the lasso or the elastic net, optimize the combination of weights and variable selection to match the corresponding dataset. This yields null distributions which are contingent on sparsity levels and, thus may lead to incorrect assessment of the estimated canonical correlations, as the same penalty across permutation may not return the same sparsity level (Supplementary Fig. A1).

The canonical correlation estimate is non-decreasing as a function of variables selected. Controlling over the number of the selected variables avoids catering the dimension reduction to the idiosyncrasies of the data. We argue that setting a more direct penalty over variable selection together with an appropriate residualization scheme has several advantages. Mainly these are improvements in subsequent signal detection, interpretability and assessment of the relationships found without interference form. This scheme returns appropriate type I error rates.

Multi-component canonical correlation analysis requires testing for each component. Due to the high-dimensional data, we expect the gaps between quantiles of the null distributions to be small; under the null distribution, the estimated correlations will have similar values. We have empirical evidence supporting this statement coming from the permutation distributions amongst different correlation estimates looking very similar. Hence, we determine that a simple Bonferroni correction will suffice to manage multiple testing concerns. We address multiple testing concerns using the statistic for the largest correlation as threshold for the rest.

3 Simulations

We simulated the data using the probabilistic CCA (Bach and Jordan 2005) as data generating process. We simulate three true components of different sizes for data with n = 100, p = 2500 and q = 500. We analysed the simulated data using the approach from Section 2, from now on referred to as TOSCCA for Thresholded Ordered Sparse CCA, and compared its performance to the existing method, PMA, a popular sparse CCA method used in the study of high-dimensional data aiming to improve interpretability. We examined each model’s accuracy (Fig. 1), selection stability, convergence and adjusted CPEV, from Equation (5).

Figure 1. — True (top) and estimated canonical vectors for TOSCCA (middle) and PMA (bottom). (a) Out-of-sample CC. (b) CPEV (%). (c) Correlation to k = 1. (d) Adjusted CPEV (%).

We fixed the sparsity level for each method as $p_{α} = q_{β} = 100$ variables for all components and found the best penalty for PMA using their built-in function. Figure 1 displays the true signals (top) and the estimated canonical vectors by TOSCCA (centre) and PMA (bottom). For the sake of comparability, both methods had initial values drawn from a random uniform distribution. This was a minor modification to the PMA algorithm as it was originally designed to be initialized with values from an eigen rendering canonical vectors which do not differ much from their initialization. That is, canonical vectors are effectively predetermined from the start. Since convergence irrespective of the initialization is a desirable property for iterative algorithms, we compared the two approaches using the same random initialization. We observed that TOSCCA consistently selects the corresponding variables for each component; when the signal involved fewer than ${p_{α}, q_{β}}$ variables for each, the remaining weights were set closer to zero. Moreover, there was no overlap between signals and the canonical weights were appropriately paired. PMA, on the other hand, selected the larger signal for both components.

We checked TOSCCA’s selection stability for running the algorithm for eight different sparsity levels, one for each choice of $p_{α}$ while keeping $q_{β}$ fixed. The signal was distinctly identified regardless the number of nonzero variables. We observed the selection stability described in the previous section (Supplementary Fig. A2). We observed the adjusted CPEV increase for the first three components, where the signal was located, and then plateaued for the fourth component. The auto-correlation between canonical vectors was effectively zero, reducing Equation (5) to the original formula.

We assessed the validity of the estimated canonical correlations for each component through permutation testing (Supplementary Fig. A3). We found the first three canonical correlations to be statistically different from those find in the permuted data. The fourth estimated correlation was correctly found to be not significant.

4 GDCS data

We applied TOSCCA to analyse data on the Genomics of Drug Sensitivity in Cancer (Garnett et al. 2012) from the GDSC project (Yang et al. 2013) aimed at identifying molecular markers of drug response. The data comprised drug sensitivity measures for cell lines and their corresponding genomic profile (gene expression, methylation profiles, mutations and copy numbers) from the Catalogue of Somatic Mutations in Cancer database.

We were interested in quantifying the associations between gene expression and drug sensitivity (IC₅₀) to explore how combinations of genes may affect drug effectiveness. The data comprised 737 samples of 49 386 gene expression measurements and 320 IC₅₀ values. We fixed the sparsity of the estimated canonical correlation to be of $p_{α} = 100$ variables belonging to gene expression to $q_{β} = 20$ from the IC₅₀ values. These numbers were chosen to simplify analysis, limiting the number of variables to be interpreted to what we believe is feasible and to illustrate how fixing sparsity may improve results for exploratory analysis. We ran the same analysis using PMA, where we used their proposed method based on cross-validation to find the optimal penalty parameters. These penalties rendered 800 gene expression variables and 7 drugs. Finally, we repeated the analysis with TOSCCA matching PMA’s optimal sparsity level.

We observed PMA achieve a greater correlation across components with the exception of the first component (k = 1), Fig. 2a. However, as previously argued, correlation alone is a weak indicator for links between high-dimensional data. The CPEV values for the four first components, Fig. 2b, show TOSCCA, with the default configuration (TOSCCA 100), generally outperformed all other alternatives. After inspection, PMA’s correlation and CPEV values for subsequent components were attributed to PMA selecting virtually the same variables across components. Thus, replicating the first, usually the highest, correlation estimate. This is in line with what observed from PMA in Section 3, as it is prone to compute very similar components (Fig. 1). Figure 3c shows said correlation values between the subsequent components the first one. These suggest that the variance added from subsequent components was unlikely coming from new information.

Figure 2. — CCA result comparison between TOSCCA and PMA for initializations from eigen decomposition, random uniform, and random normal.

Figure 3. — Latent variables plot for k = 2 (K2) and k = 3 (K3) against k = 1 (K1). Sparsity levels are $p_{α} = 100, q_{β} = 20$ (left), $p_{α} = 800, q_{β} = 7$ (centre) and $p_{α} = 800, q_{β} = 7$ (right).

We used the proposed adjusted CPEV in Equation (5) to control for almost identical information from subsequent components. This adjusted CPEV is displayed in Fig. 3d where the values from Fig. 2b account for high correlation between components, as an indicator of repeated information. After this adjustment, TOSCCA consistently outperformed PMA across the board.

We then followed with permutation testing for the estimated correlations. We observed all four components to be far away from the null distribution, hence deemed significant (Supplementary Fig. A4). Datasets of such dimensions and characteristics will most likely keep returning significant correlations for many components as biological interactions go beyond the simple digits in this example. Nevertheless, we previously stated the advantages of keeping these links manageable in favour of interpretation. We suggest using the Adjusted CPEV to determine the number of components to estimate.

Last, as CCA and PCA are closely related, we used the equivalent of a score plot to observe any potential similarities represented by the estimated latent variables. We grouped observations by the region assigned to each cell line, as displayed in Fig. 3. We chose to focus on the blood, lung and skin regions as these were the ones with the most observations. Studies show idiosyncrasies in drug resistance from cancers in organ systems as they create a micro-environment which impacts drug delivery outcomes (Park et al. 2022), compared to that of blood cancers.

Next, we use the cell line regions to validate the results of TOSCCA and PMA. TOSCCA found two or three different groups that were strongly associated with the cell lines’ region. PMA (right) showed the same linearity discussed above, consequently both k = 2 and k = 3 look similar when plotted against k = 1. These plots pick up on the distinction between cancers on organ tissues and blood cancers across methods, as blood cell lines tend to remain further apart from skin and lung cell lines. Further analysis into the dynamics between gene expression and drug sensitivity measures is beyond the scope of this paper.

In addition, we have included in Supplementary Appendix C the analysis for the breast cancer dataset (Chin et al. 2006) used in Witten et al. (2009).

5 Discussion

We introduced the method TOSCCA to carry out exploratory analysis on high-dimensional data. We used the NIPALS algorithm, together with soft-thresholding to induce sparsity, for its efficiency in dealing with high-dimensional data. Our method introduces computational and interpretational attributes that ease the search and analysis of the associations integrating this data. In our method, we fix the number of nonzero canonical weights therefore promoting interpretable results and limiting the computational burden in estimation and, consequently, permutation testing. This framework allows for multiple sparsity levels to be computed simultaneously, which further facilitates the choice of sparsity. Moreover, TOSCCA shows selection stability across different choices of sparsity and produces near-orthogonal components. Last, we found that when the threshold parameter was set to be larger than the true signal, the estimated canonical weights were forced to be closer zero, displaying robustness regardless of the chosen threshold parameters.

Understanding the contribution of a variable or set of variables in penalized high-dimension analysis is complicated as different results can easily yield very similar outcomes. This is particularly true of genomic data where the intra-correlation structure interferes with deriving inference from the results. We argue that simplifying the search is more aligned with the exploratory efforts on integrated high-dimensional data. Fixing sparsity levels achieves said goal.

Altogether, the above scheme supports reliable assessment of the estimated canonical correlations through permutation testing as this dimension reduction strategy has permutations be comparable and, hence, draw an appropriate null distribution. TOSCCA shows improvements in signal discovery, especially for subsequent components, and assessment when compared to the PMA method which, as established in Section 1, continues to be a popular method for exploring associations in genomics datasets.

Supplementary Material

vbae021_Supplementary_Data

vbae021_supplementary_data.pdf^{(10.7MB, pdf)}

Acknowledgements

The authors thank the anonymous reviewers for their valuable suggestions.

Contributor Information

Nuria Senar, Department of Epidemiology & Data Science, Amsterdam School of Public Health, Amsterdam UMC, 1105 AZ Nord-Holland, The Netherlands.

Mark van de Wiel, Department of Epidemiology & Data Science, Amsterdam School of Public Health, Amsterdam UMC, 1105 AZ Nord-Holland, The Netherlands.

Aeilko H Zwinderman, Department of Epidemiology & Data Science, Amsterdam School of Public Health, Amsterdam UMC, 1105 AZ Nord-Holland, The Netherlands.

Michel H Hof, Department of Epidemiology & Data Science, Amsterdam School of Public Health, Amsterdam UMC, 1105 AZ Nord-Holland, The Netherlands.

Author contributions

Nuria Senar (Methodology [equal], Formal analysis [lead], Software [lead], Writing—original draft [lead]), Mark van de Wiel (Methodology [supporting], Writing—review & editing [equal], Supervision [equal]), Aeilko Zwinderman (Conceptualisation [equal], Methodology [supporting], Writing—review & editing[equal], Supervision [equal]), and Michel Hof (Conceptualization [lead], Methodology [equal], Writing—review & editing [lead], Supervision [equal])

Supplementary data

Supplementary data are available at Bioinformatics Advances online.

Conflict of interest

None declared.

Funding

None declared.

Data availability

The software and examples for $toscca$ are available on GitHub (https://github.com/nuria-sv/toscca).

References

Bach F, Jordan MI.. A Probabilistic Interpretation of Canonical Correlation Analysis. Technical Report 688 (Department of Statistics, University of California), 2005.
Chin K, DeVries S, Fridlyand J. et al. Genomic and transcriptional aberrations linked to breast cancer pathophysiologies. Cancer Cell 2006;10:529–41. [DOI] [PubMed] [Google Scholar]
Du L, Liu F, Liu K. et al. ; Alzheimer’s Disease Neuroimaging Initiative. Identifying diagnosis-specific genotype-phenotype associations via joint multitask sparse canonical correlation analysis. Bioinformatics 2020;36:i371–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
Dudoit S, Fridlyand J, Speed TP.. Comparison of discrimination methods for the classification of tumors using gene expression data. J Am Stat Assoc 2002;97:77–87. [Google Scholar]
Garnett MJ, Edelman EJ, Heidorn SJ. et al. Systematic identification of genomic markers of drug sensitivity in cancer cells. Nature 2012;483:570–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hanafi M. PLS path modelling: computation of latent variables with the estimation mode B. Comput Stat 2007;22:275–92. [Google Scholar]
Hotelling H. Relations between two sets of variates. Biometrika 1936;28:321–77. [Google Scholar]
Jolliffe I. Rotation of principal components: choice of normalization constraints. J Appl Stat 1995;22:29–35. [Google Scholar]
Parkhomenko E, Tritchler D, Beyene J.. Sparse canonical correlation analyisis with application to genomic data integration. Stat Appl Genet Mol Biol 2009;8:Article 1. [DOI] [PubMed] [Google Scholar]
Park BJ, Raha P, Pankovich J. et al. Utilization of cancer cell line screening to elucidate the anticancer activity and biological pathways related to the ruthenium-based therapeutic BOLD-100. Oncotarget 2022;15:28. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rodosthenous T, Shahrezaei V, Evangelou M.. Integrating multi-omics data through sparse canonical correlation analysis for the prediction of complex traits: a comparison study. Bioinformatics 2020;36:4616–25. [DOI] [PMC free article] [PubMed] [Google Scholar]
Shen H, Huang JZ.. Sparse principal component analysis via regularized low rank matrix approximation. J Multivariate Anal 2008;99:1015–34. [Google Scholar]
van Nee MM, van de Brug T, van de Wiel MA.. Fast marginal likelihood estimation of penalties for group-adaptive elastic net. J Comput Graph Stat 2022;32:950–60. [DOI] [PMC free article] [PubMed] [Google Scholar]
Waaijenborg S, Zwinderman AH.. Penalized canonical correlation analysis to quantify the association between gene expression and DNA markers. BMC Proc 2007;1:S122. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang H-T, Smallwood J, Mourao-Miranda J. et al. Finding the needle in a high-dimensional haystack: canonical correlation analysis for neuroscientists. Neuroimage 2020;216:116745. [DOI] [PubMed] [Google Scholar]
Wilms I, Croux C.. Sparse canonical correlation analysis from a predictive point of view. Biom J 2015;57:834–51. [DOI] [PubMed] [Google Scholar]
Witten DM, Tibshirani R, Hastie T.. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics 2009;10:515–34. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wold H. Estimation of principal components and related models by iterative least squares. J Multivar Anal 1966;391–420. [Google Scholar]
Xu H, Caramanis C, Mannor S.. Sparse algorithms are not stable: a no-free-lunch theorem. IEEE Trans Pattern Anal Mach Intell 2012;34:187–93. [DOI] [PubMed] [Google Scholar]
Yang W, Soares J, Greninger P. et al. Genomics of drug sensitivity in cancer (GDSC): a resource for therapeutic biomarker discovery in cancer cells. Nucleic Acids Res 2013;41:D955–61. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

vbae021_Supplementary_Data

vbae021_supplementary_data.pdf^{(10.7MB, pdf)}

Data Availability Statement

The software and examples for $toscca$ are available on GitHub (https://github.com/nuria-sv/toscca).

[vbae021-B1] Bach F, Jordan MI.. A Probabilistic Interpretation of Canonical Correlation Analysis. Technical Report 688 (Department of Statistics, University of California), 2005.

[vbae021-B2] Chin K, DeVries S, Fridlyand J. et al. Genomic and transcriptional aberrations linked to breast cancer pathophysiologies. Cancer Cell 2006;10:529–41. [DOI] [PubMed] [Google Scholar]

[vbae021-B3] Du L, Liu F, Liu K. et al. ; Alzheimer’s Disease Neuroimaging Initiative. Identifying diagnosis-specific genotype-phenotype associations via joint multitask sparse canonical correlation analysis. Bioinformatics 2020;36:i371–9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbae021-B4] Dudoit S, Fridlyand J, Speed TP.. Comparison of discrimination methods for the classification of tumors using gene expression data. J Am Stat Assoc 2002;97:77–87. [Google Scholar]

[vbae021-B5] Garnett MJ, Edelman EJ, Heidorn SJ. et al. Systematic identification of genomic markers of drug sensitivity in cancer cells. Nature 2012;483:570–5. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbae021-B6] Hanafi M. PLS path modelling: computation of latent variables with the estimation mode B. Comput Stat 2007;22:275–92. [Google Scholar]

[vbae021-B7] Hotelling H. Relations between two sets of variates. Biometrika 1936;28:321–77. [Google Scholar]

[vbae021-B8] Jolliffe I. Rotation of principal components: choice of normalization constraints. J Appl Stat 1995;22:29–35. [Google Scholar]

[vbae021-B9] Parkhomenko E, Tritchler D, Beyene J.. Sparse canonical correlation analyisis with application to genomic data integration. Stat Appl Genet Mol Biol 2009;8:Article 1. [DOI] [PubMed] [Google Scholar]

[vbae021-B10] Park BJ, Raha P, Pankovich J. et al. Utilization of cancer cell line screening to elucidate the anticancer activity and biological pathways related to the ruthenium-based therapeutic BOLD-100. Oncotarget 2022;15:28. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbae021-B11] Rodosthenous T, Shahrezaei V, Evangelou M.. Integrating multi-omics data through sparse canonical correlation analysis for the prediction of complex traits: a comparison study. Bioinformatics 2020;36:4616–25. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbae021-B12] Shen H, Huang JZ.. Sparse principal component analysis via regularized low rank matrix approximation. J Multivariate Anal 2008;99:1015–34. [Google Scholar]

[vbae021-B13] van Nee MM, van de Brug T, van de Wiel MA.. Fast marginal likelihood estimation of penalties for group-adaptive elastic net. J Comput Graph Stat 2022;32:950–60. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbae021-B14] Waaijenborg S, Zwinderman AH.. Penalized canonical correlation analysis to quantify the association between gene expression and DNA markers. BMC Proc 2007;1:S122. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbae021-B15] Wang H-T, Smallwood J, Mourao-Miranda J. et al. Finding the needle in a high-dimensional haystack: canonical correlation analysis for neuroscientists. Neuroimage 2020;216:116745. [DOI] [PubMed] [Google Scholar]

[vbae021-B16] Wilms I, Croux C.. Sparse canonical correlation analysis from a predictive point of view. Biom J 2015;57:834–51. [DOI] [PubMed] [Google Scholar]

[vbae021-B17] Witten DM, Tibshirani R, Hastie T.. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics 2009;10:515–34. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbae021-B18] Wold H. Estimation of principal components and related models by iterative least squares. J Multivar Anal 1966;391–420. [Google Scholar]

[vbae021-B19] Xu H, Caramanis C, Mannor S.. Sparse algorithms are not stable: a no-free-lunch theorem. IEEE Trans Pattern Anal Mach Intell 2012;34:187–93. [DOI] [PubMed] [Google Scholar]

[vbae021-B20] Yang W, Soares J, Greninger P. et al. Genomics of drug sensitivity in cancer (GDSC): a resource for therapeutic biomarker discovery in cancer cells. Nucleic Acids Res 2013;41:D955–61. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

TOSCCA: a framework for interpretation and testing of sparse canonical correlations

Nuria Senar

Mark van de Wiel

Aeilko H Zwinderman

Michel H Hof

Roles