Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Oct 6.
Published in final edited form as: Biometrika. 2020 Apr 15;107(3):609–625. doi: 10.1093/biomet/asaa007

Sparse semiparametric canonical correlation analysis for data of mixed types

GRACE YOON 1, RAYMOND J CARROLL 1, IRINA GAYNANOVA 1
PMCID: PMC8494134  NIHMSID: NIHMS1615831  PMID: 34621080

Summary

Canonical correlation analysis investigates linear relationships between two sets of variables, but often works poorly on modern datasets due to high-dimensionality and mixed data types (continuous/binary/zero-inflated). We propose a new approach for sparse canonical correlation analysis of mixed data types that does not require explicit parametric assumptions. Our main contribution is the use of truncated latent Gaussian copula to model the data with excess zeroes, which allows us to derive a rank-based estimator of latent correlation matrix without the estimation of marginal transformation functions. The resulting semiparametric sparse canonical correlation analysis method works well in high-dimensional settings as demonstrated via numerical studies, and application to the analysis of association between gene expression and micro RNA data of breast cancer patients.

Keywords: BIC, Gaussian copula model, Kendall’s τ, Latent correlation matrix, Truncated continuous variable, Zero-inflated data

1. Introduction

Canonical correlation analysis investigates linear associations between two sets of variables, and is widely used in various fields including biomedical sciences, imaging and genomics (Hardoon et al., 2004; Chi et al., 2013; Safo et al., 2018). However, sample canonical correlation analysis often performs poorly due to two main challenges: high-dimensionality and non-normality of the data.

In high-dimensional settings, sample canonical correlation analysis is known to overfit the data due to singularity of sample covariance matrices (Hardoon et al., 2004; Guo et al., 2016). Additional regularization is often used to address this challenge. González et al. (2008) focus on ridge regularization of sample covariance matrices to avoid singularity, while more recent methods focus on sparsity regularization of canonical vectors (Parkhomenko et al., 2009; Witten et al., 2009; Chi et al., 2013; Cruz-Cano & Lee, 2014; Wilms & Croux, 2015; Safo et al., 2018). At the same time, with the advancement in technology, it is common to collect data of different types. For example, the Cancer Genome Atlas Project contains matched data of mixed types such as gene expression (continuous), mutation (binary) and micro RNA (count) data. While regularized canonical correlation methods work well for Gaussian data, they still are based on sample covariance matrix, and therefore are not appropriate for the analysis in the presence of binary data or data with excess of zero values.

Several approaches have been proposed to address the non-normality of the data. On the one hand, there are completely non-parametric approaches such as kernel canonical correlation analysis (Hardoon et al., 2004). On the other hand, there are parametric approaches building up on probabilistic interpretation of Bach & Jordan (2005). For example, Zoh et al. (2016) develop probabilistic canonical correlation analysis for count data by exploring natural parameter for Poisson distribution. More recently, Agniel & Cai (2017) utilize the normal semi-parametric transformation model for the analysis of mixed types of variables, however the method requires estimation of marginal transformation functions via nonparametric maximum likelihood.

In summary, a significant progress has been made in developing regularized variants of sample canonical correlation analysis that work well in high-dimensional settings. However, these approaches are not suited for mixed data types. At the same time, several methods have been proposed to account for non-normality of the data, however are not designed for high-dimensional settings. More importantly, to our knowledge none of the existing methods explicitly address the case of zero-inflated measurement, which, for example, is common for micro RNA and microbiome abundance data.

To bridge this gap, we propose a semi-parametric approach for sparse canonical correlation analysis, which allows to handle high-dimensional data of mixed types via a common latent Gaussian copula framework. Our work has three main contributions. First, we assume that zeros in the data are observed due to truncation of underlying latent continuous variable, and define corresponding truncated Gaussian copula model. We derive explicit formulas for the bridge functions that connect the Kendall’s τ of observed data to the latent correlation matrix for different combinations of data types, and use these formulas to construct a rank-based estimator of the latent correlation matrix for the mixed (continuous/binary/truncated) data. Fan et al. (2016) use bridge function approach in the context of graphical models, however the authors do not consider the truncated variable type. The latter requires derivation of new bridge functions, and those derivations are considerably more involved than corresponding derivations for continuous/binary case. The significant advantage of bridge function technique is that it allows to estimate the latent correlation structure of Gaussian copula without estimating marginal transformation functions, in contrast to Agniel & Cai (2017). Secondly, we use the derived rank-based estimator instead of sample correlation matrix within the sparse canonical correlation analysis framework that is motivated by Chi et al. (2013) and Wilms & Croux (2015). This allows us to take into account the dataset-specific correlation structure in addition to cross-correlation structure. In contrast, Parkhomenko et al. (2009) and Witten et al. (2009) model the variables within each dataset as uncorrelated. We develop an efficient optimization algorithm to solve the corresponding problem. Finally, we propose two types of Bayesian Information Criterion (bic) for tuning parameter selection, which leads to significant computational saving compared to commonly used cross-validation and permutation techniques (Witten & Tibshirani, 2009). Wilms & Croux (2015) also use bic in canonical correlation analysis context, however only one criterion is proposed. Two criteria originate from bic formulation for Gaussian linear models depending on whether the case of known or unknown error variance is considered. We found that both are competitive in our numerical studies, however one criterion works best for variable selection, whereas the other works best for prediction.

2. Background

2·1. Canonical correlation analysis

In this section we review both the classical canonical correlation analysis, and its sparse alternatives. Given two random vectors X1p1 and X2p2, let Σ1 = cov(X1), Σ2 = cov(X2) and Σ12 = cov(X1, X2). The population canonical correlation analysis (Hotelling, 1936) seeks linear combinations w1X1 and w2X2 with maximal correlation:

maximize w1,w2{w1Σ12w2}  subject to  w1Σ1w1=1,  w2Σ2w2=1. (1)

Problem (1) has a closed form solution via the singular value decomposition of Σ11/2Σ12Σ21/2. Given the first pair of singular vectors (u, v), the solutions to (1) can be expressed as w1=Σ11/2u and w2=Σ21/2v.

The sample canonical correlation analysis replaces Σ1, Σ2 and Σ12 in (1) by corresponding sample covariance matrices S1, S2 and S12. In high-dimensional settings when sample size is small compared to the number of variables, S1 and S2 are singular, thus leading to non-uniqueness of solution and poor performance due to overfitting. A common approach to circumvent this challenge is to consider sparse regularization of w1 and w2 via the addition of l1 penalty in the objective function of (1) (Witten et al., 2009; Parkhomenko et al., 2009; Chi et al., 2013; Wilms & Croux, 2015). The sparse canonical correlation analysis is then formulated as

maximize w1,w2{w1S12w2λ1w11λ2w21}   subject to   w1S1w11,   w2S2w21. (2)

In addition to l1 penalties, the equality constraints in (1) are replaced with inequality constraints which define convex sets. This generalization is possible since nonzero solutions to (2) satisfy the constraints with equality, see Proposition 1 below.

While problem (2) works well in high-dimensional settings, it still relies on sample covariance matrices, and therefore is not well-suited for skewed or non-continuous data, such as binary or zero-inflated. Further we review the Gaussian copula models that we propose to use to address these challenges.

2·2. Latent Gaussian copula model for mixed data

In this section we review the Gaussian copula model in Liu et al. (2009), and its extension to mixed (continuous/binary) data in Fan et al. (2016).

Definition 1 (Gaussian copula model). A random vector X = (X1, … , Xp) satisfies Gaussian copula model if there exists a set of monotonically increasing transformations f=(fj)j=1p satisfying f(X) = (f1(X1), … , fp(Xp)) ~ Np(0,Σ) with Σjj = 1. We denote X ~ NPN(0,Σ, f).

Definition 2 (Latent Gaussian copula model for mixed data). Let X1p1 be continuous and X2p2 be binary random vectors with X = (X1, X2). Then X satisfies the latent Gaussian copula model if there exists a p2-dimensional random vector U2=(Up1+1,,Up1+p2) such that U := (X1, U2) ~ NPN(0,Σ, f) and Xj = I(Uj > Cj) for all j = p1 + 1, … , p1 + p2, where I(·) is the indicator function and C = (C1, … , Cp) is a vector of constants. We denote X ~ LNPN(0, Σ, fC), where Σ is the latent correlation matrix.

Fan et al. (2016) consider the problem of estimating Σ for the latent Gaussian copula model based on the Kendall’s τ. Given the observed data (Xj1, Xk1), … , (Xjn, Xkn) for variables Xj and Xk, Kendall’s τ is defined as

τ^jk=2{n(n1)}1/21i<insign(XjiXji)sign(XkiXki).

Since τ^jk is invariant under monotone transformation of the data, it is well-suited to capture associations in copula models. Let τjk=E(τ^jk) be the population Kendall’s τ. The latent correlation matrix Σ can be connected to the Kendall’s τ via the so-called bridge function F such that Σjk = F −1(τjk) for all variables j and k. Fan et al. (2016) derive an explicit form of the bridge function for continuous, binary and mixed (continuous/binary) variable pairs, which allows to estimate latent correlation matrix via method of moments. We summarize these results below.

Theorem 1 (Fan et al. (2016)). Let X = (X1, X2) ~ LNPN(0,Σ, f,C) with p1-dimensional continuous X1 and p2-dimensional binary X2. The rank-based estimator of Σ is given by the symmetric matrix R^ with R^jj=1 and R^jk=R^kj=Fjk1(τ^jk), where for t ∈ (0, 1)

Fjk(t)={2sin1(t)/π if 1j<kp1;2{Φ2(Δj,Δk,t)Φ(Δj)Φ(Δk)} if p1+1j<kp1+p2;4Φ2(Δj,0,t/2)2Φ(Δj) if 1jp1,p1+1kp1+p2.

Here Δj = fj(Cj), Φ(·) is the cdf of standard normal distribution, and Φ2(·,·, t) is the cdf of standard bivariate normal distribution with correlation t

Remark 1. Since Δj = fj(Cj) is unknown in practice, Fan et al. (2016) propose to use plug-in estimator from the moment equation E(Xij)=1Φ(Δj) leading to Δ^j=Φ1(1X¯j).

Fan et al. (2016) use these results in the context of Gaussian graphical models, and replace the sample covariance matrix with rank-based estimator R^, which allows to use Gaussian models with skewed con tinuous and binary data. However, Fan et al. (2016) do not consider the case of zero-inflated data, which requires formulation of a new model, and subsequently derivation of new bridge functions.

3. Methodology

3·1. Truncated latent Gaussian copula model

Our goal is to model the zero-inflated data through the latent Gaussian copula models. Two motivating examples are micro RNA and microbiome data, where it is common to encounter large number of zero counts. In both examples it is reasonable to assume that zeros are observed due to truncation of underlying latent continuous variable. More generally, one can think of zeroes as representing the measurement error due to truncation of values below a certain positive threshold. This intuition leads us to consider the following model.

Definition 3 (Truncated latent Gaussian copula model). A random vector X = (X1, … , Xd) satisfies truncated Gaussian copula model if there exists a d-dimensional random vector U = (U1, … , Ud) ~ NPN(0,Σ, f) such that

Xj=I(Uj>Cj)Uj  (j=1,,d),

where I(·) is the indicator function and C = (C1, … , Cd) is a vector of positive constants. We denote X ~ TLNPN(0,Σ, f,C), where Σ is the latent correlation matrix.

The methodology in Fan et al. (2016) allows to estimate the latent correlation matrix in the presence of mixed continuous and binary data. Our Definition 3 adds a third type, which we denote as truncated for short. To construct a rank-based estimator for Σ as in Theorem 1 in the presence of truncated variables, below we derive an explicit form of the bridge function for all possible combinations of the data types (continuous/binary/truncated). Throughout, we use Φ(·) for the cdf of standard normal distribution and Φd(· · · ;Σd) for the cdf of standard d-variate normal distribution with correlation matrix Σd. All the proofs are deferred to the Appendix A.

Theorem 2. Let Xj be truncated and Xk be binary. Then E(τ^jk)=F(Σjk;Δj,Δk), where

F(Σjk;Δj,Δk)=2{1Φ(Δj)}Φ(Δk)2Φ3(Δj,Δk,0;Σ3a)2Φ3(Δj,Δk,0;Σ3b),Δj=fj(Cj),Δk=fk(Ck)

and

Σ3a=(1Σjk1/2Σjk1Σjk/21/2Σjk/21),    Σ3b=(101/201Σjk/21/2Σjk/21).

Theorem 3. Let Xj be truncated and Xk be continuous. Then E(τ^jk)=F(Σjk;Δj), where

F(Σjk;Δj)=2Φ2(Δj,0;1/2)+4Φ3(Δj,0,0;Σ3),Δj=fj(Cj)

and

Σ3=(11/2Σjk/21/21ΣjkΣjk/2Σjk1).

Theorem 4. Let both Xj and Xk be truncated. Then E(τ^jk)=F(Σjk;Δj,Δk), where

F(Σjk;Δj,Δk)=2Φ4(Δj,Δk,0,0;Σ4a)+2Φ4(Δj,Δk,0,0;Σ4b),Δj=fj(Cj),Δk=fk(Ck)

and

Σ4a=(101/2Σjk/201Σjk/21/21/2Σjk/21Σjkjk/21/2Σjk1)

and

Σ4b=(1Σjk1/2Σjk/2Σjk1Σjk/21/21/2Σjk/21ΣjkΣjk/21/2Σjk1).

We also show that the inverse bridge function exists for all of the cases.

Theorem 5. For any constants Δj and Δk, the bridge functions Fjk) in Theorems 24 are strictly increasing in Σjk ∈ (−1, 1), and therefore, the inverse function F−1jk) exists.

Theorems 2–5 complement the results of Fan et al. (2016) summarized in Theorem 1 by adding three more cases (continuous/truncated, binary/truncated and truncated/truncated), thus allowing to construct rank-based estimator R^ for Σ in the presence of mixed (contunuous/binary/truncated) variables.

Remark 2. Since R^ is not guaranteed to be positive semidefinite, Fan et al. (2016) regularize R^ by projecting it onto the cone of positive semidefinite matrices. We follow this approach using nearPD function in Matrix R package leading to estimator R^p. Furthermore, we consider

R˜=(1ρ)R^p+ρI (3)

with a small value of ρ > 0, so that R˜ is strictly positive definite. Throughout, we fix ρ = 0.01.

Remark 3. As in binary case, Δj = fj(Cj) is unknown for truncated variables. Similar to Fan et al. (2016), we use a plug-in estimator Δ^j based on the moment equation EI(Xij>0)=(Xj>0)=(fj(Uj)>Δj)=1Φ(Δj). Let nnonzero =i=1nI(Xij>0) for i = 1, … , n, then we use Δ^j=Φ1(1nnonzero /n).

3·2. Semiparametric sparse canonical correlation analysis

Our proposal is based on formulating sparse canonical correlation analysis using latent correlation matrix from the Gaussian copula model for mixed data. On a population level, let Σ be the latent correlation matrix for (X1, X2) ~ LNPN(0, Σ, f, C) where each X1 and X2 follows one of the three data types: continuous, binary or truncated. In Section 3·1 we derived a rank-based estimator for Σ, which we propose to use within the sparse canonical correlation analysis framework (2).

Given semiparametric estimator R˜ in (3), we propose to find canonical vectors by solving

minimizew1,w2{w1 R˜12w2+λ1w11+λ2w21}   subject to   w1 R˜1w11,  w2 R˜2w21. (4)

While we focus only on the estimation of the first canonical pair, the subsequent canonical pairs can be found sequentially by using the deflation scheme. Let R˜12(1)=R˜12 and let w^1, w^2 be the (k − 1)th estimated canonical pair. To estimate the kth pair for k > 1, form

R˜12(k)=R˜12(k1)(w^1R˜12(k1)w^2) R˜1w^1w^2 R˜2,

and solve (4) using R˜12(k) instead of R˜12.

While problem (4) is not jointly convex in w1 and w2, it is biconvex. Therefore, we propose to iteratively optimize over w1 and w2. First, consider optimizing over w1 with w2 fixed.

Proposition 1. For a fixed w2p2, let

w^1=argminw1{w1 R˜12w2+λ1w11}   subject to   w1 R˜1w11. (5)

This problem is equivalent to finding

w˜1=argminw1{(1/2)w1R˜1w1w1 R˜12w2+λ1w11}, (6)

and then setting w^1=0 if w˜1=0, and w^1=w˜1/(w˜1 R˜1w˜1)1/2 if w˜10.

Both problems (5) and (6) are convex, but unlike (5), problem (6) is unconstrained. Furthermore, problem (6) is of the same form as the well-studied penalized LASSO problem (Tibshirani, 1996), which can be solved efficiently using for example coordinate-descent algorithm. Hence, the proposed optimization algorithm for (4) can be viewed as a sequence of LASSO problems with rescaling. Given the value of w2 at iteration t, the updates at iteration t + 1 have the form

w˜1=argminw1 {(1/2)w1 R˜1w1w1R˜12w2(t)+λ1w11};
w^1(t+1)=w˜1/(w˜1 R˜1 w˜1)1/2;
w˜2=argmin w2{(1/2)w2R˜2w2w2R˜12w1(t+1)+λ2w21};
w^2(t+1)=w˜2/(w˜2R˜2w˜2)1/2.

If a zero solution is obtained at any of the steps, the optimization algorithm stops, and both w1 and w2 are returned as zeroes. Otherwise, the algorithm proceeds until convergence, which is guaranteed due to biconvexity of (4) (Gorski et al., 2007).

We further describe coordinate-descent algorithm for (6). Consider the KKT conditions

R˜1w1R˜12w2+λ1s1=0,

where s1 is the subgradient of w11. If λ1R˜12w2, it follows that w˜1=0. Otherwise, the ith element of w1 can be expressed through the other coordinates as

w1i=Sλ1{(R˜12)iw2(t)(R˜1)i,i(w1)i},

where Sλ(t) = sign(t) (|t| − λ)+ is the soft-thresholding operator, (R12)i denotes the ith row of matrix R12 and (R1)i,−i denotes ith row of matrix R1 without the ith component that is (R)i,−i = (Ri1, … , Ri,i−1, Ri,i+1, … , Rip). The coordinate-descent algorithm proceeds by using the above formula to update one coordinate at a time until the convergence to global optimum is achieved. This convergence is guaranteed due to convexity of the objective function and separability of the penalty with respect to coordinates (Tseng, 1988).

Remark 4. Problem (6) allows an alternative interpretation of R˜. Using the definition of R˜ in (3), (6) can be written as

minimize w1[(1ρ)(1/2)w1R^1w1(1ρ)w1R^12w2(t)+ρ(1/2)w1w1+λ1w11],

which is equivalent to using with R^ elastic net regularization rather than the lasso penalty (Zou & Hastie, 2005).

3·3. Selection of tuning parameters

Cross-validation is a popular approach to select the tuning parameter in LASSO. In our context, however, it amounts to performing a grid search over both λ1 and λ2. Moreover, splitting the data as in cross-validation leads to too small number of testing samples fto construct the rank-based estimator of latent correlation matrix. Instead, motivated by Wilms & Croux (2015), we propose to adapt the Bayesian information criterion to the canonical correlation analysis to avoid splitting the data and decrease computational costs.

For Gaussian linear regression model, the Bayesian information criterion (bic) has the form

BIC=2l+df log n,

where df indicate the number of parameters in the model, and l is the log-likelihood

l=log L=(n/2) log σ2i=1n(yiXiβ)2/(2σ2).

Two cases can be considered depending on whether the variance σ2 is known or unknown.

1. If σ2 is known, and the data are scaled so that σ2 = 1, then

BIC=n1i=1n(yiXiβ^)2+dfβ^log n/n.

2. If σ2 is unknown, using σ^MLE2=n1n=1n(yiXiβ)2 leads to

BIC =n log {n1i=1n(yiXiβ^)2}+dfβ^ log n.

Wilms & Croux (2015) use criterion 2 for canonical correlation analysis by substituting X1w^1X2w222/n instead of i=1n(yiXiβ^)2/n for centered X1 and X2. Since X1w^1X2w222/n=w1S1w12w1S12w2+w2S2w2, and we use R˜ instead of sample covariance matrix S, we substitute

f(w^1)=w^1 R˜1w^12w^1 R˜12w2+w2 R˜2w2

instead of residual sum of squares. Furthermore, motivated by the performance of the adjusted degrees of freedom variance estimator in Reid et al. (2016), we also adjust f(w^1) s for the 2nd criterion leading to

BIC1=f(w^1)+dfw^1 log n/n;BIC2= log {nndfw^1f(w^1)}+dfw^1log n/n.

Here dfw^1 coincide with the size of the support (Tibshirani & Taylor, 2012). biccriteria for w2 are defined analogously to w1.

We use both criteria in evaluating our approach. Given the selected criterion (either bic1 or bic2), we apply it sequentially at each step of biconvex optimization algorithm of Section 3·2, and each time select the tuning parameter corresponding to the smallest value of criterion.

4. Simulation Studies

In this section we evaluate the performance of the following methods: (i) Classical canonical correlation analysis based on the sample covariance matrix; (ii) Canonical ridge available in the R package CCA (González et al., 2008); (iii) Sparse canonical correlation analysis of Witten et al. (2009) available in the R package PMA; (iv) Sparse canonical correlation analysis via Kendall’s τ proposed in this paper. For our method, we evaluate both types of biccriteria as described in Section 3·3.

We generate n = 100 independent pairs (Z1,Z2)p1+p2 following

(Z1Z2)~N{(00),(Σ1ρΣ1w1w2Σ2ρΣ2w2w1Σ1Σ2)}.

We consider two settings for the number of variables: low-dimensional (p1 = p2 = 25) and high-dimensional (p1 = p2 = 100). Each canonical vector wg (g = 1, 2) is defined by taking a vector of ones at the coordinates (1, 6, 11, 16, 21) and zeros elsewhere, and normalizing it such that wgΣgwg=1, similar model is used in Chen et al. (2013). The value of canonical correlation is set at ρ = 0.9. We use autoregressive structure for Σ1={γ|jk|}j,k=1p1 and block-diagonal structures for Σ2:

Σ2=(Σγ000Σγ000Σγ),

where ΣγRd×d is an equicorrelation matrix with value 1 on the diagonal and γ off the diagonal. We use five blocks of size d ∈ {3, 3, 6, 6, 7} for low-dimensional, and d ∈ {12, 14, 21, 25, 28} for high-dimensional setting. We set γ = 0.7 for both Σ1 and Σ2, similar results are obtained when autoregressive structure is substituted with identity matrix. We further randomly permute the order of variables in each Zg to remove the covariance-induced ordering.

We consider transformations Ug = fg(Zg + c) with c being 0 or 1 with equal probability. The choice of c allows to vary the proportion of zero values in truncated and binary variables at 5–80%. We consider three choices for fg: (copula 0) no transformation, fg(z) = z for g = 1, 2; (copula 1) exponential transformation for U1, f1(z) = exp(z), and no transformation for U2, f2(z) = z; (copula 2) exponential transformation for U1, f1(z) = exp(z), and cubic transformation for U2, f2 (z) = z3. Finally, we set Xg to be equal to Ug for continuous variable type, and dichotomize Ug at value C to form binary/truncated Xg. We set C = 0 for copula 0 and 1, and C = 1.5 for copula 2. For each case, we consider three combinations of variable types for X1/X2: truncated/truncated, truncated/continuous and truncated/binary.

To compare the methods performance, we evaluate expected out-of-sample correlation

ρ^=w^1Σ12w^2(w^1Σ1w^1)1/2(w^2Σ2w^2)1/2, (7)

and predictive loss

L(wg,w^g)=1|w^gΣgwg|(w^gΣgw^g)1/2  (g=1,2); (8)

similar loss function is used in Gao et al. (2017). Since wgΣgwg=1, L(wg,w^g)[0,1] with L(wg,w^g)=0 if w^g=wg. We also evaluate the variable selection performance using the selected model size, true-positive rate and true-negative rate defined as

TPRg=#{j:w^gj0 and wgj0}#{j:wgj0},  TNRg=#{j:w^gj=0 and wgj=0}#{j:wgj=0}  (g=1,2).

The results for truncated/truncated case over 100 replications are presented in Figures 13, the results for other cases are qualitatively similar and deferred to Appendix B.

Fig. 1.

Fig. 1.

Top: The value of ρ^ from (7). The horizontal lines indicate true canonical correlation value ρ = 0.9. Bottom: The value of predictive loss (8). Results over 100 replications. CCA: Sample canonical correlation analysis; RidgeCCA: Canonical ridge of González et al. (2008); WittenCCA: method of Witten et al. (2009); KendallBIC1, Kendall-BIC2: proposed method with tuning parameter selected using either bic1 or bic2 criterion; LD: low-dimensional setting (p1 = p2 = 25); HD: high-dimensional setting (p1 = p2 = 100).

Fig. 3.

Fig. 3.

Selected model size over 100 replications. The horizontal lines indicate true model size 5. CCA: Sample canonical correlation analysis; RidgeCCA: Canonical Ridge of González et al. (2008); WittenCCA: method of Witten et al. (2009); KendallBIC1, KendallBIC2: proposed method with tuning parameter selected using either bic1 or bic2 criterion; LD: low-dimensional setting (p1 = p2 = 25); HD: high-dimensional setting (p1 = p2 = 100).

From Figure 1, all methods perform better in absence of data transformation (copula 0) compared to cases where transformation is applied (copula 1 and 2). Similarly, the performance deteriorates with increased dimensions leading to smaller values of ρ^, larger predictive losses and worse true positive rates. The classical canonical correlation analysis performs especially poor in high-dimensional settings with ρ^ being very close to 0 and predictive loss being close to 1 for both w1 and w2. Canonical ridge works well in copula 0 setting, however its performance is strongly affected in the presence of transformations (copula 1 and 2). Witten’s method outperforms canonical ridge in the presence of transformations, however works worse than both variants of our approach. Overall, our method with bic1 attains the highest values of ρ^ in low-dimensional settings, whereas bic2 is the highest in high-dimensional settings. Unlike the classical canonical correlation and canonical ridge, both Witten’s and our method perform variable selection. Unexpected to us, the number of selected variables varies significantly across replications for Witten’s method (Figure 3), leading to significant variations in true positive and true negative rates. In all cases bic1 leads to sparsest model and highest true negative rate. On the other hand, since bic1 sometimes misses true variables, especially in the high-dimensional settings, bic2 shows more accurate values of ρ^ and smaller predictive loss (See Figure 1). In summary, bic1 works better for variable selection, whereas bic2 works better for prediction.

5. Application To Tcga Data

The Cancer Genome Atlas (TCGA) project collects data from multiple platforms using high-throughput sequencing technologies. We consider gene expression data (p1 = 891) and micro RNA data (p2 = 431) for n = 500 matched subjects from TCGA BRCA database. We treat gene expression data as continuous and micro RNA data as truncated continuous. The range of proportions of zero values contained in each variable in micro RNA data is 0 − 49.8%. The subjects belong to one of the 5 breast cancer subtypes: Normal, Basal, Her2, LumA and LumB, with 37 subjects having missing subtype information (denoted as NA). The goal of the analysis is to characterize the association between gene expression and micro RNA data, and investigate whether this association is relevant with respect to breast cancer subtypes.

To investigate the performance of our method relative to other approaches, we randomly split the data 100 times. Each time 400 samples are used for training, and the remaining 100 test samples are used to asses the found association via

ρ^test=w^1,trainΣ12,testw^2,train(w^1,trainΣ1,testw^1,train)1/2(w^2,trainΣ2,testw^2,train)1/2.

Here Σtest is evaluated based on the test samples, and is either rank-based estimator R˜ (for our method), or sample covariance matrix (for other methods). We also compare the number of selected genes and micro RNAs, the results are presented in Table 1.

Table 1.

Mean support sizes and values of ρ^test’ over 100 random splits of breast cancer data, standard deviation is given in parentheses

Method Selected Genes Selected micro RNAs ρ^test
CCA 891 431 0·0219
(0·00) (0·00) (0·111)
RidgeCCA 891 431 0·704
(0·00) (0·00) (0·129)
WittenCCA 368·91 179·86 0·787
(195·38) (100·95) (0·0448)
KendallBICl 83·73 6·11 0·888
(23·43) (1·95) (0·0438)
KendallBIC2 106·03 105·90 0·926
(10·86) (10·20) (0·231)

CCA: Sample canonical correlation analysis; RidgeCCA: Canonical Ridge of González et al. (2008); WittenCCA: method of Witten et al. (2009); KendallBIC1, KendallBIC2: proposed method with either bic1 or bic2 criterion.

As expected, neither sample canonical correlation analysis nor canonical ridge method perform variable selection. In addition, ρ^test is very close to 0 for sample canonical correlation, confirming poor performance of the method. Canonical ridge leads to significantly higher values of ρ^test demonstrating the advantage of added regularization, however it still has smaller correlation values compared to other approaches. The method of Witten et al. (2009) leads to higher correlation values compared to both sample canonical correlation analysis and canonical ridge, however it still selects a significant number of variables, with highly varied model sizes across replications. We suspect this is due to the use of permutation-based algorithm for tuning parameter selection, similar behaviour is observed in Section 4. Finally, the values of ρ^test are the highest for both variations of our method. At the same time, both variations result in sparsest models with smallest variability in model size across replications. While bic2 criterion leads to largest out-of-sample correlation value, bic1 criterion leads to sparsest model. In light of these results and results of Section 4, we conclude that bic1 works well for variable selection, whereas bic2 works well for prediction.

We further apply our method with bic1 criterion using the full set of n = 500 samples, leading to the selection of 64 genes and 8 micro RNAs. Figures 4 and 5 show heatmaps of selected variables for each platform, with samples ordered by their respective cancer subtype. The heatmaps show clear separation between Basal and other subtypes, suggesting that found association is relevant to cancer biology.

Fig. 4.

Fig. 4.

A heatmap of 64 genes selected by the proposed approach when using bic1 criterion. Dissimilarity measure is set as 1 − τ2 with τ being the Kendall’s τ, and the Ward linkage is used.

Fig. 5.

Fig. 5.

A heatmap of 8 micro RNAs selected by the proposed approach when using bic1 criterion. Dissimilarity measure is set as 1 − τ2 with τ being the Kendall’s τ, and the Ward linkage is used. Colors are assigned based on variable-specific quantiles.

Some of the selected genes and micro RNAs can be found in recent literature which supports their association with breast cancer. For example, Xiao et al. (2018) identify hsa-miR-452–5p in the analysis of estrogen receptor subtypes of breast cancer, and Manvati et al. (2015) demonstrate negative correlation of hsa-miR-24–2 with both metastasis and increasing nodes in sporadic breast tumours. As for hsa-miR-135b, not only it is reported to be related to breast cancer cell growth (Aakula et al., 2015; Hua et al., 2016), but it is also demonstrated to regulate estrogen receptor α gene ESR1 (Aakula et al., 2015), which coincidentally is among the 64 genes selected by our approach. Some other genes among the selected ones that demonstrate association with breast cancer according to previous research are ERBB4, FOXA1, UGT2B15 and ELF5 (Kim et al., 2016; Hu et al., 2016; Piggin et al., 2016).

6. Discussion

One of the main contributions of this work is the proposed truncated Gaussian copula model for the zero-inflated data, and corresponding development of a rank-based estimator for the latent correlation matrix. While our focus is on canonical correlation analysis, the derived estimator can be used in conjunction with other covariance-based approaches, for example it can be used for constructing graphical models as in Fan et al. (2016) in cases where some or all of the variables have excess of zeroes. Micro RNA data is one example that we have explored in this work, however another prominent example is microbiome abundance data. It would be of interest to further explore the potential of our modeling approach in different application areas.

Supplementary Material

1

Fig. 2.

Fig. 2.

Top: True positive rate (TPR); Bottom: True negative rate (TNR). Results over 100 replications. CCA: Sample canonical correlation analysis; RidgeCCA: Canonical Ridge of González et al. (2008); WittenCCA: method of Witten et al. (2009); KendallBIC1, Kendall-BIC2: proposed method with tuning parameter selected using either bic1 or bic2 criterion; LD: low-dimensional setting (p1 = p2 = 25); HD: high-dimensional setting (p1 = p2 = 100).

Fig. 6.

Fig. 6.

Top: The value of ρ^ from (7). The horizontal lines indicate true canonical correlation value ρ = 0.9. Bottom: The value of predictive loss (8). Results over 100 replications. CCA: Sample canonical correlation analysis; RidgeCCA: Canonical ridge of González et al. (2008); WittenCCA: method of Witten et al. (2009); KendallBIC1, Kendall-BIC2: proposed method with tuning parameter selected using either bic1 or bic2 criterion; LD: low-dimensional setting (p1 = p2 = 25); HD: high-dimensional setting (p1 = p2 = 100).

Fig. 7.

Fig. 7.

Top: True positive rate (TPR); Bottom: True negative rate (TNR). Results over 100 replications. CCA: Sample canonical correlation analysis; RidgeCCA: Canonical Ridge of González et al. (2008); WittenCCA: method of Witten et al. (2009); KendallBIC1, Kendall-BIC2: proposed method with tuning parameter selected using either bic1 or bic2 criterion; LD: low-dimensional setting (p1 = p2 = 25); HD: high-dimensional setting (p1 = p2 = 100).

Fig. 8.

Fig. 8.

Selected model size over 100 replications. The horizontal lines indicate true model size 5. CCA: Sample canonical correlation analysis; RidgeCCA: Canonical Ridge of González et al. (2008); WittenCCA: method of Witten et al. (2009); KendallBIC1, KendallBIC2: proposed method with tuning parameter selected using either bic1 or bic2 criterion; LD: low-dimensional setting (p1 = p2 = 25); HD: high-dimensional setting (p1 = p2 = 100).

Fig. 9.

Fig. 9.

Top: The value of ρ^ from (7). The horizontal lines indicate true canonical correlation value ρ = 0.9. Bottom: The value of predictive loss (8). Results over 100 replications. CCA: Sample canonical correlation analysis; RidgeCCA: Canonical ridge of González et al. (2008); WittenCCA: method of Witten et al. (2009); KendallBIC1, Kendall-BIC2: proposed method with tuning parameter selected using either bic1 or bic2 criterion; LD: low-dimensional setting (p1 = p2 = 25); HD: high-dimensional setting (p1 = p2 = 100).

Fig. 10.

Fig. 10.

Top: True positive rate (TPR); Bottom: True negative rate (TNR). Results over 100 replications. CCA: Sample canonical correlation analysis; RidgeCCA: Canonical Ridge of González et al. (2008); WittenCCA: method of Witten et al. (2009); KendallBIC1, Kendall-BIC2: proposed method with tuning parameter selected using either bic1 or bic2 criterion; LD: low-dimensional setting (p1 = p2 = 25); HD: high-dimensional setting (p1 = p2 = 100).

Fig. 11.

Fig. 11.

Selected model size over 100 replications. The horizontal lines indicate true model size 5. CCA: Sample canonical correlation analysis; RidgeCCA: Canonical Ridge of González et al. (2008); WittenCCA: method of Witten et al. (2009); KendallBIC1, KendallBIC2: proposed method with tuning parameter selected using either bic1 or bic2 criterion; LD: low-dimensional setting (p1 = p2 = 25); HD: high-dimensional setting (p1 = p2 = 100).

Acknowledgements

Yoon’s research was funded by a grant from the National Cancer Institute (T32-CA090301). Carroll’s research was supported by a grant from the National Cancer Institute (U01-CA057030). Carroll is also Distinguished Professor, School of Mathematical and Physical Sciences, University of Technology Sydney, Broadway NSW 2007, Australia. Gaynanova’s research was supported by NSF grant DMS-1712943.

REFERENCES

  1. Aakula A, Leivonen S-K, Hintsanen P, Aittokallio T, Ceder Y, Børresen-Dale A-L, Perälä M,Östling P & Kallioniemi O (2015). Microrna-135b regulates erα, ar and hif1an and affects breast and prostate cancer cell growth. Molecular oncology 9, 1287–1300. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Agniel D & Cai T (2017). Analysis of multiple diverse phenotypes via semiparametric canonical correlation analysis. Biometrics. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Bach FR & Jordan MI (2005). A probabilistic interpretation of canonical correlation analysis.
  4. Chen M, Gao C, Ren Z & Zhou HH (2013). Sparse CCA via Precision Adjusted Iterative Thresholding. arXiv, 1311.6186v1. [Google Scholar]
  5. Chi EC, Allen GI, Zhou H, Kohannim O, Lange K & Thompson PM (2013). Imaging genetics via sparse canonical correlation analysis. In Biomedical Imaging (ISBI), 2013 IEEE 10th International Symposium on. IEEE. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Cruz-Cano R & Lee M-LT (2014). Fast regularized canonical correlation analysis. Computational Statistics & Data Analysis 70, 88–100. [Google Scholar]
  7. Fan J, Liu H, Ning Y & Zou H (2016). High dimensional semiparametric latent graphical model for mixed data. Journal of the Royal Statistical Society, Series B. [Google Scholar]
  8. Gao C, Ma Z & Zhou HH (2017). Sparse CCA: Adaptive estimation and computational barriers. The Annals of Statistics 45, 2074–2101. [Google Scholar]
  9. González I, Déjean S, Martin PG & Baccini A (2008). CCA: An R package to extend canonical correlation analysis.
  10. Gorski J, Pfeuffer F & Klamroth K (2007). Biconvex sets and optimization with biconvex functions: a survey and extensions. Mathematical Methods of Operations Research 66, 373–407. [Google Scholar]
  11. Guo Y, Ding X, Liu C & Xue J-H (2016). Sufficient canonical correlation analysis. IEEE Transactions on Image Processing 25, 2610–2619. [DOI] [PubMed] [Google Scholar]
  12. Hardoon DR, Szedmak S & Shawe-Taylor J (2004). Canonical correlation analysis: An overview with application to learning methods. Neural computation 16, 2639–2664. [DOI] [PubMed] [Google Scholar]
  13. Hotelling H (1936). Relations between two sets of variates. Biometrika 28, 321–377. [Google Scholar]
  14. Hu DG, Selth LA, Tarulli GA, Meech R, Wijayakumara D, Chanawong A, Russell R, Caldas C, Robinson JL, Carroll JS et al. (2016). Androgen and estrogen receptors in breast cancer coregulate human udp-glucuronosyltransferases 2b15 and 2b17. Cancer research 76, 5881–5893. [DOI] [PubMed] [Google Scholar]
  15. Hua K, Jin J, Zhao J, Song J, Song H, Li D, Maskey N, Zhao B, Wu C, Xu H et al. (2016). mir-135b, upregulated in breast cancer, promotes cell growth and disrupts the cell cycle by regulating lats2. International journal of oncology 48, 1997–2006. [DOI] [PubMed] [Google Scholar]
  16. Kim J-Y, Jung HH, Do I-G, Bae S, Lee SK, Kim SW, Lee JE, Nam SJ, Ahn JS, Park YH et al. (2016). Prognostic value of erbb4 expression in patients with triple negative breast cancer. BMC cancer 16, 138. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Liu H, Lafferty J & Wasserman L (2009). The nonparanormal: Semiparametric estimation of high dimensional undirected graphs. Journal of Machine Learning Research 10, 2295–2328. [PMC free article] [PubMed] [Google Scholar]
  18. Manvati S, Mangalhara KC, Kalaiarasan P, Srivastava N & Bamezai R (2015). mir-24–2 regulates genes in survival pathway and demonstrates potential in reducing cellular viability in combination with docetaxel. Gene 567, 217–224. [DOI] [PubMed] [Google Scholar]
  19. Parkhomenko E, Tritchler D & Beyene J (2009). Sparse canonical correlation analysis with application to genomic data integration. Statistical applications in genetics and molecular biology 8, 1–34. [DOI] [PubMed] [Google Scholar]
  20. Piggin CL, Roden DL, Gallego-Ortega D, Lee HJ, Oakes SR & Ormandy CJ (2016). ELF5 isoform expression is tissue-specific and significantly altered in cancer. Breast Cancer Research 18, 4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Plackett RL (1954). A Reduction Formula for Normal Multivariate Integrals. Biometrika 41, 351–360. [Google Scholar]
  22. Reid S, Tibshirani R & Friedman J (2016). A study of error variance estimation in lasso regression. Statistica Sinica, 35–67. [Google Scholar]
  23. Safo SE, Li S & Long Q (2018). Integrative analysis of transcriptomic and metabolomic data via sparse canonical correlation analysis with incorporation of biological information. Biometrics 74, 300–312. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Tibshirani RJ (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Ser. B 58, 267–288. [Google Scholar]
  25. Tibshirani RJ & Taylor J (2012). Degrees of freedom in lasso problems. Annals of Statistics 40, 1198–1232. [Google Scholar]
  26. Tseng P (1988). Coordinate Ascent for Maximizing Nondifferentiable Concave Functions. Massachusetts Institute of Technology, Laboratory for Information and Decision Systems. [Google Scholar]
  27. Wilms I & Croux C (2015). Sparse canonical correlation analysis from a predictive point of view. Biometrical Journal 57, 834–851. [DOI] [PubMed] [Google Scholar]
  28. Witten DM & Tibshirani RJ (2009). Extensions of sparse canonical correlation analysis with applications to genomic data. Statistical applications in genetics and molecular biology 8, 1–27. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Witten DM & Tibshirani RJ (2011). Penalized classification using Fisher’s linear discriminant. Journal of the Royal Statistical Society, Ser. B 73, 753–772. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Witten DM, Tibshirani RJ & Hastie T (2009). A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics 10, 515–534. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Xiao B, Zhang W, Chen L, Hang J, Wang L, Zhang R, Liao Y, Chen J, Ma Q, Sun Z et al. (2018). Analysis of the mirna–mrna–lncrna network in human estrogen receptor-positive and estrogen receptor-negative breast cancer based on tcga data. Gene 658, 28–35. [DOI] [PubMed] [Google Scholar]
  32. Zoh RS, Mallick B, Ivanov I, Baladandayuthapani V, Manyam G, Chapkin RS, Lampe JW & Carroll RJ (2016). PCAN: Probabilistic correlation analysis of two non-normal data sets. Biometrics, n/a–n/a. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Zou H & Hastie T (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67, 301–320. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

1

RESOURCES