Abstract
We propose a new method for the simultaneous selection and estimation of multivariate sparse additive models with correlated errors. Our method called Covariance Assisted Multivariate Penalized Additive Regression (CoMPAdRe) simultaneously selects among null, linear, and smooth non-linear effects for each predictor while incorporating joint estimation of the sparse residual structure among responses, with the motivation that accounting for inter-response correlation structure can lead to improved accuracy in variable selection and estimation efficiency. CoMPAdRe is constructed in a computationally efficient way that allows the selection and estimation of linear and non-linear covariates to be conducted in parallel across responses. Compared to single-response approaches that marginally select linear and non-linear covariate effects, we demonstrate in simulation studies that the joint multivariate modeling leads to gains in both estimation efficiency and selection accuracy, of greater magnitude in settings where signal is moderate relative to the level of noise. We apply our approach to protein-mRNA expression levels from multiple breast cancer pathways obtained from The Cancer Proteome Atlas and characterize both mRNA-protein associations and protein-protein subnetworks for each pathway. We find non-linear mRNA-protein associations for the Core Reactive, EMT, PIK-AKT, and RTK pathways. Supplementary Materials are available online.
Keywords: Multivariate analysis, Multivariate regression, Non-convex optimization, Variable selection, Semi-parametric regression
1. Introduction
Additive models are a generalization of linear models in which a response is modeled as the sum of arbitrary smooth, non-linear functions of covariates (Hastie and Tibshirani 1986; Wood 2017). Likewise, multivariate linear regression generalizes classical linear regression to the setting where potentially correlated responses are regressed on a common set of predictors (Izenman 2013). Multivariate generalizations of the additive model are less common in the literature, and the setting of interest in this article. To set notation, let be a matrix in which the responses are potentially correlated and have a common set of predictors in matrix . Here, we model
| (1) |
where the rows of are independently and identically , and assumed for identifiability. Note that is indexed by response and covariate . Just as in the classical single response setting of additive models, each response is represented by the sum of smooth, covariate-specific functions for each . In contrast to traditional settings, however, the residual error across the responses are correlated and related by precision matrix .
In the single response setting, a variety of approaches have been developed for variable selection in additive models. The majority of these methods have been -based penalized regression procedures for selecting between null and non-linear predictor effects (Lin and Zhang 2006; Ravikumar et al. 2009; Huang et al. 2010). Work has also been developed in a Bayesian context (Scheipl et al. 2012) and for models where estimated non-linear fits are piecewise constant with data-adaptive knots (Petersen et al. 2016). More recently, research has been extended to include the selection between null, linear, and non-linear predictor effects involving variants of a group lasso penalty (Chouldechova and Hastie 2015; Lou et al. 2016; Petersen and Witten 2019). These approaches often require pre-selection of a hyperparameter to favor linear versus non-linear fits. Multi-step algorithmic approaches have been developed as an alternative to favor parsimonious, interpretable linear fits when they sufficiently explain the association between predictor and response, with degree of preference towards linear fits controlled by a pre-specified hyperparameter (Tay and Tibshirani 2020).
There is also recent work on multivariate methods to perform simultaneous selection of regression coefficients and precision matrix elements to produce sparse solutions. For Gaussian multivariate linear regression, Rothman et al. (2010) and Yin and Li (2013) proposed joint penalties on regression coefficients and off-diagonal elements of the precision matrix. In a Bayesian paradigm, Bhadra and Mallick (2013), Ha et al. (2021) and Consonni et al. (2017) proposed hierarchical models with a hyper-inverse Wishart prior on the covariance matrix, and Deshpande et al. (2019) used optimization based on a multivariate spike-and-slab lasso (mSSL) prior to simultaneously select sparse sets of linear coefficients and precision elements.
There is some literature on methods for variable selection of non-linear functions in multivariate settings with non-independently distributed errors. Nandy et al. (2017) developed a method for the selection between null and smooth additive non-linear functions in the setting where errors are spatially dependent. The authors utilized the adaptive group lasso with an objective function that included a spatial weight matrix, establishing both selection consistency and convergence properties. In a multivariate context, Niu et al. (2020) developed a Bayesian variable selection procedure that, in the context of Gaussian graphical models, selected between null and smooth non-linear predictor effects for multiple responses while estimating the precision matrix among responses, provided that the precision matrix followed a decomposable graphical structure. This approach restricts all responses to have the same set of non-sparse predictors and does not distinguish between linear and non-linear predictor effects. To the best of our knowledge, the current literature does not encompass a multivariate approach that simultaneously selects among null, linear, and non-linear associations for each predictor-response pair while estimating an unconstrained dependence structure among responses.
To address this gap, we develop a computationally efficient solution to a sparse multivariate additive regression model, Covariance Assisted Multivariate Penalized Additive Regression (CoMPAdRe). CoMPAdRe employs a penalized spline basis representation (Demmler and Reinsch 1975) for , and jointly selects among null, linear, and smooth non-linear predictor effects for each response while simultaneously estimating a sparse precision matrix among responses. Our method enables the selection and estimation of linear and non-linear predictor effects to be conducted in parallel across responses. Through simulation studies, we show that incorporating estimated residual structure into the selection and estimation of predictor effects leads to gains in selection accuracy and in statistical efficiency. We use CoMPAdRe to study the associations between mRNA and protein expression levels for 8 breast cancer pathways derived from The Cancer Proteome Atlas (TPCA) that reveal several non-linear associations missed by other approaches (Li et al. 2013). Software for our method can be found at https://github.com/nmd1994/ComPAdRe along with examples for implementation.
2. Methods
To begin this section, we refer back to (1) and re-establish some notation. Let be a matrix with sample size n and Q possibly correlated responses. Similarly, let be a matrix of covariates; we assume all covariates are continuous for exposition but can also contain discrete covariates. Then, we model where, in contrast to the classical single response setting, the rows of are independently and identically , and assumed for identifiability.
Additive models can be written using a number of possible non-linear functional representations for each . Some common choices are Gaussian processes or smoothing splines (Cheng et al. 2019; Wood 2017). Our multivariate framework is based on using the Demmler Reinsch parameterization of smoothing splines or O’ Sullivan penalized splines, which naturally partitions linear from non-linear components. In the next sub-sections, we introduce our notation and formulation starting with the notation of spline representation, the univariate additive model, and our variable selection processes before fully specifying our multivariate additive model selection approach.
2.1. Additive Models with Penalized Splines
A smoothing spline is a bivariate smoother for a single () response and covariate whose objective is described as follows: . Smoothness is determined by tuning parameter , which controls the balance between interpolation and having a small second derivative; is a linear function when (Wang 2011). The optimal solution to the smoothing spline problem can be obtained by solving a representation of determined by a () basis of splines with knots defined at unique values of . To further establish notation, let be a vector of coefficients, then . Our objective is now min where . This is a standard penalized least squares problem, whose solution is (Wang 2011; Hansen 2019). For computational efficiency one can use () knots defined at certain quantiles of , which would then be called penalized splines. We utilize O’Sullivan penalized splines for this article (Wand and Ormerod 2008).
Given the detailed description above, we define a penalized objective for an additive model where each function has a spline representation. We see that
Conveniently, the optimal minimizer to this problem is an additive cubic spline model for which each is a cubic spline with knots defined at unique values of each (Wang 2011; Wood 2017). Note that we assume for all in order to ensure an identifiable solution. To represent each , we next outline the Demmler-Reinsch representation before redefining our additive model objective.
To form the Demmler-Reinsch representation of a smoothing (or penalized) spline, which separates linear and non-linear components of function , we revisit spline basis . Specifically, we note that , where is a orthogonal matrix, is a diagonal matrix with all non-negative values, and is a orthogonal matrix. We then rewrite . Note that is a positive semi-definite matrix that can be diagonalized as is a orthogonal matrix and is a diagonal matrix with decreasing non-negative elements whose last two diagonal entries are zero (Wood 2017; Hansen 2019). Given these representations, we now re-write is the Demmler-Reinsch basis with corresponding coefficient vector . The final two columns of correspond to the intercept and linear components of , and the other columns of correspond to non-linear components of . The now diagonal smoothing penalty leaves the 2 intercept and linear basis functions of unpenalized and the remaining basis functions increasingly penalized by their complexity (i.e. the values along the diagonal of ) (Demmler and Reinsch 1975; Hansen 2019).
To summarize, we can represent the objective for as follows. Let denote the linear and non-linear components of basis functions , respectively, with corresponding coefficients . Since is just an orthogonal rotation of , we can trivially replace with and see:
minimize
and can then rewrite our additive model objective with covariates as:
minimize
To ensure identifiability constraints are satisfied, we omit all intercept basis functions for each and fit one global intercept (Lee et al. 2018). We now conclude this subsection by briefly describing how we can fit additive models as a linear mixed model; this detail becomes relevant during the estimation portion of our framework.
Estimating Additive Models as a Linear Mixed Model:
One can represent any with an associated measure controlling ‘wiggliness’ (i.e. ) in terms of a mixed effects model if they can separate linear and non-linear components of , treating the linear components as fixed effects and non-linear components as random effects (Wood 2017). Given the Demmler-Reinsch representation, we can then represent a smoothing spline or penalized spline as follows: , where . represents a diagonal matrix of the positive non-zero values of , and . To incorporate an additive model into the mixed model framework, one can append the linear components for each function to the matrix of fixed effects and non-linear components of to the random effects design matrix while ensuring identifiability by restricting to a global intercept (Wood 2017).
2.2. Variable Selection for Additive Models
Given that additive models consider multiple covariates whose dimension may be high, approaches that perform variable selection can greatly aid in producing interpretable, computationally feasible solutions. To perform variable selection for functions , in this article we adapt a smoothness-sparsity penalty that uses separate hyperparameters to control smoothness and selection of additive models. Adopting the above notation, the general form of the penalty is as follows:
Penalty has two hyperparameters: controls the level of sparsity while controls smoothness via the same second-derivative penalty utilized in smoothing splines. Using the spline basis formulation for each previously described, the Method’s objective is .
Computationally, this objective can be re-parameterized as a simple group lasso problem for fixed , ensuring the smoothness of the functional fit is accounted for in the selection process (Meier et al. 2009).
To illustrate how we perform linear and non-linear predictor selection, we first present a single response objective in this section before moving on to our full multivariate objective, where we induce sparsity in the precision matrix with penalization. More concretely, we utilize the Demmler-Reisch basis to separate linear and non-linear components of , penalizing linear basis functions with an penalty and non-linear basis functions with an adapted version of the smoothness-sparsity penalty outlined above (Meier et al. 2009). We now note the following objective:
minimize
Linear selection is controlled by hyperparameter while non-linear selection is controlled by . hyperparameter controls the degree of smoothness for covariate ’s non-linear basis functions; note that we use because the first two diagonal components of , corresponding to linear basis functions, are zero (meaning linear components are left unsmoothed). This optimization problem will produce a selection of null, linear, or nonlinear effect for each proposed covariate , and in the univariate additive model settings we refer to this approach as Penalized Additive Regression, or PAdRe. In simulation studies, to show the benefit of multivariate modeling we will use this as a comparitor for out multivariate approach CoMPAdRe, which we will now describe.
2.3. Multivariate Objective and Model Fitting Procedure
Now transitioning to the multivariate additive model setting, suppose we observe a matrix where each of our responses are potentially correlated and contain a common predictor set . We begin by identifying a joint likelihood for our model. Let for where the rows of are independently and identically . Then, we can rewrite . Extending to Q responses, we concatenate notation and redefine , and see: with the following negative log-likelihood: where is a concatenated matrix of non-linear basis functions, where each of p covariates has non-linear basis functions.
We now introduce penalties to our negative log-likelihood, completing our objective. Along with the penalties outlined in the previous section, we include an penalty on the off-diagonal elements of precision matrix , similar to those utilized in the graphical lasso (Friedman et al. 2008). Our final model objective is demonstrated as follows:
minimize
| (2) |
To produce a solution to this non-convex objective, we break our problem down into a three step procedure in which each step, conditioned on the other steps, reduces to a simpler convex optimization problem. Our model fitting procedure also enables parallel computing in terms of responses , ensuring a computationally scalable solution. We next describe this procedure, and how model-fitting is implemented, in more detail.
2.4. Model Fitting Procedure
Before outlining model fitting in Algorithm 1, we first describe a parameterization of precision matrix using a formulation that is based on the relationship between the precision matrix of a multivariate normal distribution and regression coefficients (Anderson 1984). Specifically, let matrix denote the vector of errors other than , the error term for response . We can then represent in terms of the following regression: , where coefficient is a vector of partial correlations of response with the other responses, and error term . Recent literature has extended this concept to the topic of variable selection for multi-layered Gaussian graphical models (mIGGM) (Ha et al. 2021). Their work, in the context of a multivariate regression, demonstrated that model fitting could be conducted as parallel single response regressions.
Specifically, in our context we can rewrite each response of as follows:
| (3) |
where denotes the column corresponding to response and refers to the set of all columns other than the column indexing response . Given estimates of linear coefficients, non-linear coefficients, and precision matrix , this formulation enables updating linear and non-linear selection and estimation steps as parallel single-response procedures, and reduces the selection of linear, non-linear, and precision components to simpler convex optimization problems when conditioned on previous estimates.
More specifically, Algorithm 1 breaks our overall model objective into 3 sequential steps given initial estimates of , and . Rewriting each response of as seen above in (3), we first condition on () and update via separate Lasso procedures; selected coefficients are then re-estimated via OLS. Then, conditioning on () we update selection of via separate group lasso procedures; both and are then re-estimated with separate linear mixed models. Finally, conditioning on we update with the graphical lasso. At each specific step of the algorithm, we note that all responses can be fit simultaneously (in parallel). We iterate among these three steps until converges within a pre-specified tolerance. All tuning parameters are selected via cross-validation; full details on how cross-validation is implemented can be seen at the end of supplemental section S.1. Smoothness hyperparameters are pre-specified with generalized cross-validation (GCV) marginally for every covariate-response combination; pre-fixing every enables the selection of to be reduced to separate group lasso problems when conditioned on and . We note that CoMPAdRe does not pre-specify a preference for linear vs. non-linear fits; both linear and non-linear selection is performed for every covariate-response combination.
Algorithm 1:
Covariance-Assisted Multivariate Sparse Additive Regression (CoMPAdRe)
| Result: Mean Components: Linear: , Non-Linear: & Estimated Precision |
| Require Initial estimates , , and ; |
| for do |
| Set ; |
| Solve ; |
| end |
| Step 1.5: Linear Re-estimation with OLS; |
| Let , denote the subset of covariates and coefficients selected as non-zero for response ; |
| for do |
| end |
| Step 2: Non-Linear Selection Update; |
| for do |
| Set ; |
| Solve ; |
| end |
| Step 2.5: Linear and Non-Linear Coefficient Estimation with Mixed Effects Model; |
| for do |
| Set ; |
| Let denote the subset of basis functions and coefficients selected as nonzero and non-linear for response ; |
| Update estimation of , with a linear mixed model |
| end |
| Step 3: Selection of Precision Elements; |
| Let ; |
| Obtain an estimate of the empirical covariance and update precision matrix using the graphical lasso (Friedman et al. 2008); |
| Stop when MSEY converges within tol |
3. Simulation Study
Simulation Design:
We assess CoMPAdRe’s performance under settings with varying sample sizes, levels of residual dependence, and signal-to-noise ratios. We consider sample size , number of responses , fix the number of covariates to either or , and use the following signal-to-noise ratios: . To induce residual dependence, we specify a Toeplitz structure to generate covariance matrices (in this case correlation matrices): , where and higher values of correspond to more highly dependent responses. We present results for in the main body of this manuscript and present results from all other settings in supplemental section S.1. This section of the supplement includes simulation settings with alternative covariance structures as well as additional details for how tuning parameters were selected for each method. Each covariate , for , is generated using draws from a random uniform distribution: Unif(−1,1). We consider the following functions for non-null associations between a covariate and a response, four types of nonlinear functions: , as well as linear functions: . A figure displaying the shapes of all functions along with additional simulations assessing function-specific selection and estimation performance across all methods considered is contained in supplemental section S.1.2.
For each simulated dataset, we generate the sparse set of predictors for each response as follows. We first randomly select 4 of the first 5 responses to have non-zero predictors. For each non-sparse response, we then randomly select 1 – 5 covariates to have any signal with that response. Each of these selected covariates is randomly assigned a function with probability 0.125 for and 0.5 for . Each response is then be generated as , where . For each setting of sample size , residual dependence level , and signal-to-noise considered, we simulate 50 datasets in the manner outlined above.
Performance Assessment:
We compare CoMPAdRe to the following approaches: (1) PAdRe, (2) GAMSEL (Chouldechova and Hastie 2015), (3) Lasso (Tibshirani 1996), and (4) mSSL (Deshpande et al. 2019). These approaches can be divided as follows: single response approaches that marginally select linear and non-linear covariate associations for each response (1, 2), single response approaches that marginally select linear covariate associations ignoring the distinction between linear and non-linear fits (3), and multivariate approaches that simultaneously select linear associations and precision elements ignoring the distinction between linear and non-linear fits (4). Note that GAMSEL has a user-selected parameter for favoring linear vs. non-linear fits - we used the approach’s suggested default. All other tuning parameters across methods were selected via cross-validation. For selection accuracy, we report the true positive rate (TPR) and false positive rate (FPR) for null vs non-null signal. For estimation accuracy, we present the ratio of mean absolute deviation (MAD) between CoMPAdRe and method (1), (CoMPAdRe PAdRe), for estimating true function . Ratios of estimation accuracy between CoMPAdRe and other methods considered can be seen in supplemental section S.1. We also compare computational times across all approaches in this section of the supplement. After selection the marginal approach (1) is re-estimated using a linear mixed effects model as in CoMPAdRe.
Simulation Results:
Table 1 summarizes our key selection results, reporting the true positive rate (TPR) and false positive rate (FPR) for all five methods at different levels of residual dependence and signal-to-noise ratio . A comprehensive breakdown of selection results by type of function (linear vs. non-linear) can be seen in supplemental section S.1. Similarly, Figure 1 visualizes key estimation accuracy results across levels of and for estimating overall true signal .
Table 1.
Summary of simulation results for settings where number of covariates . Results are divided into true positive rate (TPR) and false positive rate (FPR), expressed as a percentage, for levels of residual dependence and signal-to-noise ratio . Signal is divided into null vs. non-null signal and results are presented as the median with the interquartile range in parenthesis. 50 datasets were simulated for each setting and sample size was fixed to and the number of responses to .
| CoMPAdRe | PAdRe | GAMSEL | mSSL | Lasso | CoMPAdRe | PAdRe | GAMSEL | mSSL | Lasso | |||
| 0.25 | 77.0 (21.7) | 11.1 (20.0) | 11.1 (24.3) | 27.0 (21.3) | 11.1 (20.0) | 0.2 (0.5) | < 0.1 (0) | < 0.1 (0) | < 0.1 (0) | < 0.1 (0) | ||
| 0.5 | 87.5 (10.2) | 40.8 (37.1) | 55.8 (34.5) | 71.4 (26.7) | 40.8 (34.7) | 0.3 (0.6) | < 0.1 (0.2) | 0.7 (0.9) | < 0.1 (0) | < 0.1 (0.2) | ||
| 0.75 | TPR | 93.3 (12.2) | 69.6 (30.7) | 83.3 (19.3) | 73.0 (22.9) | 64.0 (31.6) | FPR | 0.4 (0.6) | 0.1 (0.2) | 1.5 (1.9) | < 0.1 (0) | 0.1 (0.2) |
| 1 | 92.9 (13.3) | 76.9 (16.5) | 91.7 (9.10) | 75.0 (20.7) | 66.7 (13.5) | 0.2 (0.4) | 0.1 (0.2) | 1.5 (1.1) | < 0.1 (0) | 0.1 (0.2) | ||
| 2 | 97.1 (8.20) | 85.1 (12.9) | > 99.8 (0) | 77.8 (15.9) | 72.7 (19.0) | 0.4 (1.0) | 0.1 (0.2) | 2.5 (1.9) | < 0.1 (0) | 0.1 (0.2) | ||
| 0.25 | 33.3 (20.8) | 11.1 (24.5) | 10.6 (25.0) | 11.1 (12.3) | 11.1 (24.5) | 0.1 (0.3) | < 0.1 (0) | < 0.1 (0.1) | < 0.1 (0) | < 0.1 (0) | ||
| 0.5 | 83.3 (20.0) | 44.2 (34.6) | 54.5 (23.8) | 55.8 (28.1) | 43.3 (33.3) | 0.3 (0.5) | < 0.1 (0.2) | 0.6 (1.0) | < 0.1 (0) | < 0.1 (0.2) | ||
| 0.75 | TPR | 84.6 (13.7) | 60.0 (27.1) | 83.3 (17.5) | 63.6 (22.5) | 56.3 (22.9) | FPR | 0.2 (0.3) | 0.1 (0.2) | 1.5 (1.7) | < 0.1 (0) | 0.1 (0.2) |
| 1 | 90.5 (15.1) | 72.4 (16.3) | 90.0 (18.2) | 66.7 (19.4) | 63.1 (22.0) | 0.2 (0.4) | 0.1 (0.2) | 1.3 (1.4) | < 0.1 (0) | 0.1 (0.2) | ||
| 2 | 94.1 (10.8) | 84.0 (15.7) | > 99.8 (0) | 73.0 (26.2) | 72.1 (24.8) | 0.3 (0.6) | < 0.1 (0.3) | 2.8 (1.8) | < 0.1 (0) | < 0.1 (0.3) | ||
| 0.25 | 15.4 (21.5) | 10.6 (19.7) | 12.5 (17.1) | 11.1 (12.7) | 10.6 (19.7) | < 0.1 (0.3) | < 0.1 (0) | < 0.1 (0) | < 0.1 (0) | < 0.1 (0) | ||
| 0.5 | 72.7 (16.7) | 44.9 (26.3) | 56.9 (23.3) | 33.3 (19.4) | 44.4 (27.8) | 0.3 (0.1) | 0.1 (0.2) | 0.6 (1.0) | < 0.1 (0) | 0.1 (0.2) | ||
| 0.75 | TPR | 84.0 (19.1) | 64.2 (29.8) | 81.8 (22.2) | 65.2 (25.1) | 61.5 (29.1) | FPR | 0.2 (0.2) | 0.1 (0.3) | 1.6 (1.6) | < 0.1 (0) | 0.1 (0.3) |
| 1 | 88.9 (15.4) | 78.2 (23.9) | 89.4 (20.0) | 66.7 (28.4) | 69.2 (23.5) | 0.2 (0.3) | < 0.1 (0.2) | 1.5 (1.5) | < 0.1 (0) | < 0.1 (0.2) | ||
| 2 | 92.9 (14.0) | 84.0 (17.2) | > 99.8 (0) | 72.7 (15.7) | 75.0 (19.0) | 0.2 (0.4) | 0.1 (0.1) | 3.0 (2.5) | < 0.1 (0) | 0.1 (0.1) | ||
| 0.25 | 16.2 (24.3) | 12.5 (19.5) | 19.1 (19.8) | 11.1 (12.2) | 12.5 (19.5) | < 0.1 (0.1) | < 0.1 (0) | < 0.1 (0.1) | < 0.1 (0) | < 0.1 (0) | ||
| 0.5 | 57.3 (20.2) | 36.9 (27.5) | 64.3 (23.2) | 23.1 (20.8) | 36.9 (25.0) | 0.1 (0.3) | < 0.1 (0.2) | 0.8 (1.3) | < 0.1 (0) | < 0.1 (0.2) | ||
| 0.75 | TPR | 84.6 (17.3) | 70.7 (28.9) | 81.5 (21.5) | 57.1 (18.6) | 65.5 (27.5) | FPR | 0.2 (0.4) | < 0.1 (0.2) | 1.3 (1.4) | < 0.1 (0) | < 0.1 (0.2) |
| 1 | 83.3 (17.9) | 72.1 (23.9) | 90.0 (20.0) | 64.5 (26.1) | 63.1 (28.0) | 0.2 (0.4) | 0.1 (0.2) | 1.5 (1.1) | < 0.1 (0) | 0.1 (0.2) | ||
| 2 | 93.3 (11.1) | 84.6 (20.2) | > 99.8 (0) | 72.1 (21.8) | 72.7 (21.3) | 0.2 (0.6) | 0.1 (0.2) | 2.4 (1.6) | < 0.1 (0) | 0.1 (0.2) |
Fig. 1.
Boxplots of estimation accuracy results for non-linear signal . Midpoint lines represent the median ratio of mean absolute deviation (MAD): across 50 simulated datasets per setting. Settings considered vary based on signal-to-noise ratio and level of residual dependence .
At moderate to high levels of residual dependence (), CoMPAdRe consistently demonstrates superior sensitivity for selecting significant covariates than other approaches while maintaining favorable false positive rates. The relative improvement of CoMPAdRe over PAdRe, an equivalent method in every way except ignoring the between-response correlation, demonstrates that the joint modeling resulted in improved variable selection. This difference was strongest at low signal-to-noise ratios (). For example, in setting () CoMPAdRe had a significantly higher median true positive rate compared to the median rates of its competitors (CoMPAdRe = 0.770, PAdRe = 0.111, GAMSEL = 0.111, mSSL = 0.270, Lasso = 0.111). ComPAdRe had the highest sensitivity for all settings except for the highest signal-to-noise (), for which GAMSEL had slightly higher sensitivity, but both GAMSEL and COMPAdRe near 1. GAMSEL, however, consistently reported a false positive rate approximately ∼ 10-fold higher than CoMPAdRe at higher signal to noise ratios across all levels of residual dependence . This pattern persisted across all settings considered. For example, in settings where (), GAMSEL showed an average median false positive rate of while CoMPAdRe showed an average median false positive rate of across all levels of residual dependence .
At lower levels of residual dependence (), CoMPAdRe still generally exhibited superior sensitivity than competitors. The contrast between true positive rates at lower levels of signal to noise () was less stark than those seen at higher levels of across all methods. GAMSEL again had slightly higher sensitivity than CoMPAdRe at and , but again accompanied by higher FPR. From additional simulations conducted to assess function-specific selection and estimation performance (supplemental section S.1.2), we see that linear selection approaches (Lasso and mSSL) consistently failed to select functions and , even at high signal-to-noise ratios, while CoMPAdRe while allowing the potential of nonlinear associations did not appear to lose sensitivity for selecting covariates with linear associations.
In terms of estimation accuracy, CoMPAdRe outperformed method (1) when estimating overall signal across both and . In each setting of signal-to-noise the improved performance of CoMPAdRe relative to method (1) increased as level of residual dependence increased; this is evidenced by the downward linear trend in Figure 2 at each level of . This trend persisted across all settings considered (see S.1). The only setting where CoMPAdRe didn’t show notable gains in estimation accuracy occurred at low signal-to-noise and levels of residual dependence: (, ), (, ). In these settings the ratio of estimation accuracy () remained centered around 1. These results demonstrate the strong benefit in estimation accuracy from the joint modeling when strong inter-response correlation structure is present, but without substantial tradeoff when the responses have low levels of correlation. CoMPAdRe displayed similar performance relative to other approaches considered, with improvements in estimation accuracy increasing with higher levels of inter-response correlation. From additional function-specific simulations, CoMPAdRe demonstrated superior estimation accuracy to linear selection approaches, with results most evident for estimation of and . Given simulation results that show the benefits of our approach, we now apply CoMPAdRe to protein-mRNA expression data obtained from The Cancer Protein Atlas project (TCPA).
Fig. 2.
Protein-Protein covariance networks for PI3K-AKT, EMT, DNA Damage, and Breast Reactive pathways. Blue edges indicate negative associations while red edges indicate positive associations. Edge thickness indicates the magnitude of the dependence between two corresponding proteins and node size is scaled relative to the strength and number of connections for a protein.
4. Analyses of Proteogenomics data in Breast Cancer
We applied CoMPAdRe to a proteogenomics dataset containing both protein and mRNA expression levels for 8 known breast cancer pathways obtained from The Cancer Protein Atlas project (TCPA). The central dogma of molecular biology states that genetic information is primarily passed from an individual’s DNA towards the production of mRNA (a process called transcription) before mRNA takes that information to a cell’s ribosomes to construct proteins (a process called translation) which carry out the body’s biological functions (Crick 1970). Furthermore, proteins are known to carry out biological functions in coordinated networks (De Las Rivas and Fontanillo 2010; Garcia et al. 2012). Here we consider a subset of TCPA data consisting of mRNA and protein data for subjects from 8 cancer-relevant biological pathways. For each pathway, we have data from 3–11 proteins and 3–11 mRNA transcripts. Our objective is three-fold: (1) to find which mRNA are predictive of protein expression in a particular pathway, (2) assess the shape of the relationship, whether linear or nonlinear, and (3) estimate the protein-protein networks accounting for mRNA expression. We will accomplish this by applying CoMPAdRe to each pathway, treating the proteins as responses and transcripts as covariates. The joint approach will give us estimates of the protein-protein network and, as demonstrated by our simulations, we expect this joint modeling will result in improved detection of mRNA transcripts predictive of protein abundances than if the proteins were modeled independently.
More information about each pathway can be found in (Akbani et al. 2014; Ha et al. 2018). As an initial step, covariates (mRNA expression levels) were centered and scaled by their mean and standard deviation. Just as in our simulation studies, we then formed B-spline bases by taking knots at the deciles for each covariate considered within a pathway.
Biological Interpretations:
Table 2 summarizes mRNA selection results for all pathways analyzed. A comprehensive table of mRNA selection results are contained in supplemental section S.2 along with visualizations of the shapes for selected nonlinear mRNA–protein associations. Among non-linear associations, we note that CDH1–-catenin and CDH1–E-cadherin were selected for the Core Reactive Pathway, CDH1–E-cadherin for the EMT pathway, INPP4B–INPP4B for the PI3K-AKT pathway, and ERBB2–HER2PY1248 for the RTK pathway. We consistently found the same pattern for all non-linear mRNA–protein associations; protein expression increases with mRNA expression until seemingly hitting a plateau. Protein expression has been found to both be positively correlated with mRNA expression and to often plateau at high expression levels because of a suspected saturation of ribosomes, which would limit translation (Liu et al. 2016; van Asbeck et al. 2021). Therefore, identified non-linear mRNA–protein functional associations may provide insight into ‘saturation points’ beyond which increased mRNA expression ceases to result in an increase of protein abundance.
Table 2.
A summary of mRNA selection results for each pathway, divided into linear associations and non-linear associations. Names on the left side of the dash indicate mRNA biomarkers, while those on the right side of the dash indicate the corresponding matching protein. Lines ended by ‘...’ indicate the presence of one or more associations than those listed.
| Pathway | Linear mRNA | Non-linear mRNA |
|---|---|---|
| Breast Reactive | GAPDH-GAPDH | - |
| Core Reactive | - | CDH1--catenin, ... |
| DNA damage | RAD50-RAD50, MRE11A-RAD50, ... | - |
| EMT | CDH1--catenin | CDH1-E-cadherin |
| PI3K - AKT | PTEN-PTEN, CDK1B-AKTPT308, ... | INPP4B-INPP4B |
| RAS-MAPK | YBX1-JNKPT183Y185, YBX1-YB1PS102 | - |
| RTK | ERBB2-EGFRPY1068, EGFR-EGFRPY1068 | ERBB2-HER2PY1248 |
| TSC-mTOR | EIF4EBP1-X4EBP1PS65, ... | - |
Pathway specific protein-protein networks:
Figure 2 displays protein-protein networks for the pathways detailed in mRNA selection. A visualization of protein-protein networks for all other pathways are detailed in supplemental section S.2. The most highly interconnected proteins, which we call hub nodes, include MYH11, E-cadherin, -catenin, and EGFR. In particular, MYH11 showed strong associations with other proteins in the Breast Reactive pathway. MYH11 are smooth muscle myosin heavy chain proteins, which play an essential role in cell movement and the transport of materials within and between cells (Brownstein et al. 2018). While its exact function in breast cancer is not fully understood, MYH11 has been found to be downregulated in breast cancer tissues and to have been critical to empirically constructed indicators for survival prognosis in breast cancer (Zhu et al. 2020). Our results indicate that the role of MYH11 in breast cancer may be better understood through the biomarkers it was found to be highly associated with (CAVEOLINI, RBM15, GAPDH). We also note that previous literature has shown expression levels of E-cadherin and -catenin to be strongly correlated with each other; reduced levels in both proteins are associated with poor survival prognosis in triple negative breast cancer (Shen et al. 2016).
5. Discussion
In this article, we introduced CoMPAdRe, a framework for simultaneous variable selection and estimation of sparse residual precision matrices in multivariate additive models. The approach simultaneously obtains a sparse estimate of the inter-response precision matrix and utilizes this association to borrow strength across responses in identifying significant covariates for each response, determining whether each selected covariate effect is linear or nonlinear. It also obtains an estimate of the residual precision matrix, which itself is a quantity of scientific interest in many settings including the protein-protein network modeling of our motivating example. Importantly, our approach allows different covariates to be selected for different responses, and the same covariate to have different functional relationships for different responses. The fitting procedure utilizes a regression approach to estimate and account for the residual correlation that simplifies the joint multivariate modeling into a series of single-response models to improve scalability and computational efficiency. We also note that the univariate special case of our method, which we call PAdRe, is a useful tool for doing linear and nonlinear variable selection in univariate additive model setting, as well.
We empirically demonstrated that CoMPAdRe achieves superior overall variable selection accuracy and statistical efficiency than marginal competitors that model each response independently, with the benefit more demonstrable as the inter-response correlation increases or the signal-to-noise levels decrease. This demonstrates the benefit of joint multivariate modeling, with the inherent borrowing of strength across responses resulting in not just better estimation accuracy, but also improved variable selection and determination of linearity or nonlinearity of the effects. The improved estimation accuracy was expected based on the principles of seemingly unrelated regression (SUR, Zellner (1963)), given that the sparsity prior on covariates implies different covariates for each response. However, our results also suggest that the joint modeling can substantially improve variable selection accuracy, as well. Future theoretical investigations would be interesting to evaluate and validate this result. Further, the improved performance for lower signal-to-noise ratios demonstrates that the efficiency gained by borrowing strength across responses may be especially important to detect more subtle signals or in the presence of higher noise levels. Our simulations also demonstrate that for covariates with nonlinear associations with responses, variable selection methods based on linear regression will tend to miss these variables, especially for certain nonlinear functional shapes. Thus, the variable selection framework we introduce enables identification of important predictive variables even when having a highly nonlinear association with responses.
While the CoMPAdRe method introduced here assumes unstructured inter-response correlation structure, it would be relatively straightforward to adapt to incorporate structured covariance matrices, for example if the responses are observed on some known temporal or spatial grid, or if some decomposable graph structure is known beforehand. This extension could bring the enhanced variable selection and estimation accuracy inherent to the joint modeling while accounting for known structure among the responses. We leave this work for future extensions. CoMPAdRe can also be modified to perform linear selection in a group-wise manner, which may aid interpretability in certain applications. This paper focused on variable selection and estimation, but it would also be of interest to obtain inferential quantities, including hypothesis tests for significant covariates or confidence measures for each selected variable as well as confidence bands for estimated regression functions and precision elements. We also leave this to future work. Finally, while as described in the supplement we have semi-automatic methods to estimate the tuning parameters that seem to work well, more rigorous and automatic approaches for selecting smoothing parameters in this setting may improve performance further, and will be investigated in the future. Software related to this article can be found at https://github.com/nmd1994/ComPAdRe along with examples for implementation.
Supplementary Material
Footnotes
6 Supplementary Materials
Supplemental File:
Supplementary materials online contain a document that includes the following: additional results from the simulations shown in the main body of the manuscript (including those with a smaller number of covariates), new results from simulations with alternative residual structure, new results assessing function-specific performance, new results assessing runtime and convergence properties of our algorithm, and additional results for the proteomics application analyzed in the main body of the manuscript (Supplementary_Materials.pdf).
References
- Akbani R, Ng PKS, Werner HM, Shahmoradgoli M, Zhang F, Ju Z, Liu W, Yang J-Y, Yoshihara K, Li J. et al. (2014), ‘A pan-cancer proteomic perspective on the cancer genome atlas’, Nature communications 5(1), 3887. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Anderson T. (1984), ‘An introduction to multivariate statistical analysis.[una introducción al análisis estadístico multivariado]’.
- Bhadra A. and Mallick BK (2013), ‘Joint high-dimensional bayesian variable and covariance selection with an application to eqtl analysis’, Biometrics 69(2), 447–457. [DOI] [PubMed] [Google Scholar]
- Brownstein A, Ziganshin B. and Elefteriades J. (2018), ‘Genetic disorders of the vasculature’.
- Cheng L, Ramchandran S, Vatanen T, Lietzén N, Lahesmaa R, Vehtari A. and Lähdesmäki H. (2019), ‘An additive gaussian process regression model for interpretable non-parametric analysis of longitudinal data’, Nature communications 10(1), 1798. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chouldechova A. and Hastie T. (2015), ‘Generalized additive model selection’, arXiv preprint arXiv:1506.03850.
- Consonni G, La Rocca L. and Peluso S. (2017), ‘Objective bayes covariate-adjusted sparse graphical model selection’, Scandinavian Journal of Statistics 44(3), 741–764. [Google Scholar]
- Crick F. (1970), ‘Central dogma of molecular biology’, Nature 227(5258), 561–563. [DOI] [PubMed] [Google Scholar]
- De Las Rivas J. and Fontanillo C. (2010), ‘Protein–protein interactions essentials: key concepts to building and analyzing interactome networks’, PLoS computational biology 6(6), e1000807. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Demmler A. and Reinsch C. (1975), ‘Oscillation matrices with spline smoothing’, Numerische Mathematik 24(5), 375–382. [Google Scholar]
- Deshpande SK, Ročková V. and George EI (2019), ‘Simultaneous variable and covariance selection with the multivariate spike-and-slab lasso’, Journal of Computational and Graphical Statistics 28(4), 921–931. [Google Scholar]
- Friedman J, Hastie T. and Tibshirani R. (2008), ‘Sparse inverse covariance estimation with the graphical lasso’, Biostatistics 9(3), 432–441. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Garcia J, Bonet J, Guney E, Fornes O, Planas J. and Oliva B. (2012), ‘Networks of protein protein interactions: From uncertainty to molecular details’, Molecular Informatics 31(5), 342–362. [DOI] [PubMed] [Google Scholar]
- Ha MJ, Banerjee S, Akbani R, Liang H, Mills GB, Do K-A and Baladandayuthapani V. (2018), ‘Personalized integrated network modeling of the cancer proteome atlas’, Scientific reports 8(1), 1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ha MJ, Stingo FC and Baladandayuthapani V. (2021), ‘Bayesian structure learning in multilayered genomic networks’, Journal of the American Statistical Association 116(534), 605–618. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hansen NR (2019), Computational Statistics with R, cswr.nrhstat.org.
- Hastie T. and Tibshirani R. (1986), ‘Generalized Additive Models’, Statistical Science 1(3), 297 – 310. [DOI] [PubMed] [Google Scholar]
- Huang J, Horowitz JL and Wei F. (2010), ‘Variable selection in nonparametric additive models’, Annals of Statistics 38(4), 2282–2313. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Izenman AJ (2013), Multivariate regression, in ‘Modern multivariate statistical techniques’, Springer, pp. 159–194. [Google Scholar]
- Lee W, Miranda MF, Rausch P, Baladandayuthapani V, Fazio M, Downs JC and Morris JS (2018), ‘Bayesian semiparametric functional mixed models for serially correlated functional data, with application to glaucoma data’, Journal of the American Statistical Association. [DOI] [PMC free article] [PubMed]
- Li J, Lu Y, Akbani R, Ju Z, Roebuck PL, Liu W, Yang J-Y, Broom BM, Verhaak RG, Kane DW et al. (2013), ‘Tcpa: a resource for cancer functional proteomics data’, Nature methods 10(11), 1046–1047. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lin Y. and Zhang HH (2006), ‘Component selection and smoothing in multivariate nonparametric regression’, The Annals of Statistics 34(5), 2272–2297. [Google Scholar]
- Liu Y, Beyer A. and Aebersold R. (2016), ‘On the dependency of cellular protein levels on mrna abundance’, Cell 165(3), 535–550. [DOI] [PubMed] [Google Scholar]
- Lou Y, Bien J, Caruana R. and Gehrke J. (2016), ‘Sparse partially linear additive models’, Journal of Computational and Graphical Statistics 25(4), 1126–1140. [Google Scholar]
- Meier L, Van de Geer S. and Bühlmann P. (2009), ‘High-dimensional additive modeling’, The Annals of Statistics 37(6B), 3779–3821. [Google Scholar]
- Nandy S, Lim CY and Maiti T. (2017), ‘Additive model building for spatial regression’, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 79(3), 779–800. [Google Scholar]
- Niu Y, Guha N, De D, Bhadra A, Baladandayuthapani V. and Mallick BK (2020), ‘Bayesian variable selection in multivariate nonlinear regression with graph structures’, arXiv preprint arXiv:2010.14638.
- Petersen A. and Witten D. (2019), ‘Data-adaptive additive modeling’, Statistics in Medicine 38(4), 583–600. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Petersen A, Witten D. and Simon N. (2016), ‘Fused lasso additive model’, Journal of Computational and Graphical Statistics 25(4), 1005–1025. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ravikumar P, Lafferty J, Liu H. and Wasserman L. (2009), ‘Sparse additive models’, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 71(5), 1009–1030. [Google Scholar]
- Rothman AJ, Levina E. and Zhu J. (2010), ‘Sparse multivariate regression with covariance estimation’, Journal of Computational and Graphical Statistics 19(4), 947–962. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Scheipl F, Fahrmeir L. and Kneib T. (2012), ‘Spike-and-slab priors for function selection in structured additive regression models’, Journal of the American Statistical Association 107(500), 1518–1532. [Google Scholar]
- Shen T, Zhang K, Siegal GP and Wei S. (2016), ‘Prognostic value of e-cadherin and -catenin in triple-negative breast cancer’, American journal of clinical pathology 146(5), 603–610. [DOI] [PubMed] [Google Scholar]
- Tay JK and Tibshirani R. (2020), ‘Reluctant generalised additive modelling’, International Statistical Review 88, S205–S224. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tibshirani R. (1996), ‘Regression shrinkage and selection via the lasso’, Journal of the Royal Statistical Society Series B: Statistical Methodology 58(1), 267–288. [Google Scholar]
- van Asbeck AH, Dieker J, Oude Egberink R, van den Berg L, van der Vlag J. and Brock R. (2021), ‘Protein expression correlates linearly with mrna dose over up to five orders of magnitude in vitro and in vivo’, Biomedicines 9(5), 511. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wand MP and Ormerod J. (2008), ‘On semiparametric regression with o’sullivan penalized splines’, Australian & New Zealand Journal of Statistics 50(2), 179–198. [Google Scholar]
- Wang Y. (2011), Smoothing splines: methods and applications, CRC press. [Google Scholar]
- Wood SN (2017), Generalized additive models: an introduction with R, CRC press. [Google Scholar]
- Yin J. and Li H. (2013), ‘Adjusting for high-dimensional covariates in sparse precision matrix estimation by 1-penalization’, Journal of multivariate analysis 116, 365–381. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zellner A. (1963), ‘Estimators for seemingly unrelated regression equations: Some exact finite sample results’, Journal of the American Statistical Association 58(304), 977–992. [Google Scholar]
- Zhu T, Zheng J, Hu S, Zhang W, Zhou H, Li X. and Liu Z-Q (2020), ‘Construction and validation of an immunity-related prognostic signature for breast cancer’, Aging (Albany NY) 12(21), 21597. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.


