SOFAR: Large-Scale Association Network Learning

Yoshimasa Uematsu; Yingying Fan; Kun Chen; Jinchi Lv; Wei Lin

doi:10.1109/tit.2019.2909889

. Author manuscript; available in PMC: 2021 Mar 18.

Published in final edited form as: IEEE Trans Inf Theory. 2019 Apr 11;65(8):4924–4939. doi: 10.1109/tit.2019.2909889

SOFAR: Large-Scale Association Network Learning

Yoshimasa Uematsu ¹, Yingying Fan ¹, Kun Chen ¹, Jinchi Lv ¹, Wei Lin ¹

PMCID: PMC7970712 NIHMSID: NIHMS1535080 PMID: 33746241

Abstract

Many modern big data applications feature large scale in both numbers of responses and predictors. Better statistical efficiency and scientific insights can be enabled by understanding the large-scale response-predictor association network structures via layers of sparse latent factors ranked by importance. Yet sparsity and orthogonality have been two largely incompatible goals. To accommodate both features, in this paper we suggest the method of sparse orthogonal factor regression (SOFAR) via the sparse singular value decomposition with orthogonality constrained optimization to learn the underlying association networks, with broad applications to both unsupervised and supervised learning tasks such as biclustering with sparse singular value decomposition, sparse principal component analysis, sparse factor analysis, and spare vector autoregression analysis. Exploiting the framework of convexity-assisted nonconvex optimization, we derive nonasymptotic error bounds for the suggested procedure characterizing the theoretical advantages. The statistical guarantees are powered by an efficient SOFAR algorithm with convergence property. Both computational and theoretical advantages of our procedure are demonstrated with several simulations and real data examples.

Keywords: Big data, Large-scale association network, Simultaneous response and predictor selection, Latent factors, Sparse singular value decomposition, Orthogonality constrained optimization, Nonconvex statistical learning

I. Introduction

The genetics of gene expression variation may be complex due to the presence of both local and distant genetic effects and shared genetic components across multiple genes [14, 18]. A useful statistical analysis in such studies is to simultaneously classify the genetic variants and gene expressions into groups that are associated. For example, in a yeast expression quantitative trait loci (eQTLs) mapping analysis, the goal is to understand how the eQTLs, which are regions of the genome containing DNA sequence variants, influence the expression level of genes in the yeast MAPK signaling pathways. Extensive genetic and biochemical analysis has revealed that there are a few functionally distinct signaling pathways of genes [36, 14], suggesting that the association structure between the eQTLs and the genes is of low rank. Each signaling pathway involves only a subset of genes, which are regulated by only a few genetic variants, suggesting that each association between the eQTLs and the genes is sparse in both the input and the output (or in both the responses and the predictors), and the pattern of sparsity should be pathway specific. Moreover, it is known that the yeast MAPK pathways regulate and interact with each other [36]. The complex genetic structures described above clearly call for a joint statistical analysis that can reveal multiple distinct associations between subsets of genes and subsets of genetic variants. If we treat the genetic variants and gene expressions as the predictors and responses, respectively, in a multivariate regression model, the task can then be carried out by seeking a sparse representation of the coefficient matrix and performing predictor and response selection simultaneously. The problem of large-scale response-predictor association network learning is indeed of fundamental importance in many modern big data applications featuring large scale in both numbers of responses and predictors.

Observing n independent pairs (x_i, y_i), i = 1, … , n, with $x_{i} \in ℝ^{p}$ the covariate vector and $y_{i} \in ℝ^{q}$ the response vector, motivated from the above applications we consider the following multivariate regression model

Y = X C^{*} + E,

(1)

where $Y = {(y_{1}, \dots, y_{n})}^{T} \in ℝ^{n \times q}$ is the response matrix, $X = {(x_{1}, \dots, x_{n})}^{T} \in ℝ^{n \times p}$ is the predictor matrix, $C^{*} \in ℝ^{p \times q}$ is the true regression coefficient matrix, and E = (e₁, … , e_n)^T is the error matrix. To model the sparse relationship between the responses and the predictors as in the yeast eQTLs mapping analysis, we exploit the following singular value decomposition (SVD) of the coefficient matrix

C^{*} = U^{*} D^{*} V^{* T} = \sum_{j = 1}^{r} d_{j}^{*} u_{j}^{*} v_{j}^{* T},

(2)

where 1 ≤ r ≤ min(p, q) is the rank of matrix C*, $D^{*} = diag (d_{1}^{*}, \dots, d_{r}^{*})$ is a diagonal matrix of nonzero singular values, and $U^{*} = (u_{1}^{*}, \dots, u_{r}^{*}) \in ℝ^{p \times r}$ and $V^{*} = (v_{1}^{*}, \dots, v_{r}^{*}) \in ℝ^{q \times r}$ are the orthonormal matrices of left and right singular vectors, respectively. Here, we assume that C* is low-rank with only r nonzero singular values, and the matrices U* and V* are sparse.

Under the sparse SVD structure (2), model (1) can be rewritten as

\tilde{Y} = \tilde{X} D^{*} + \tilde{E},

where $\tilde{Y} = Y V^{*}$ , $\tilde{X} = X U^{*}$ , and $\tilde{E} = E V^{*} \in ℝ^{n \times r}$ are the matrices of latent responses, predictors, and random errors, respectively. The associations between the predictors and responses are thus diagonalized under the pairs of transformations specified by U* and V*. When C* is of low rank, this provides an appealing low-dimensional latent model interpretation for model (1). Further, note that the latent responses and predictors are linear combinations of the original responses and predictors, respectively. Thus, the interpretability of the SVD can be enhanced if we require that the left and right singular vectors be sparse so that each latent predictor/response involves only a small number of the original predictors/responses, thereby performing the task of variable selection among the predictors/responses, as needed in the yeast eQTLs analysis.

The above model (1) with low-rank coefficient matrix has been commonly adopted in the literature. In particular, the reduced rank regression [1, 39, 55] is an effective approach to dimension reduction by constraining the coefficient matrix C* to be of low rank. Bunea et al. [15] proposed a rank selection criterion that can be viewed as an L₀ regularization on the singular values of C*. The popularity of L₁ regularization methods such as the Lasso [60] led to the development of nuclear norm regularization in multivariate regression [66]. Chen et al. [21] proposed an adaptive nuclear norm penalization approach to bridge the gap between L₀ and L₁ regularization methods and combine some of their advantages. With the additional SVD structure (2), Chen et al. [19] proposed a new estimation method with a correctly specified rank by imposing a weighted L₁ penalty on each rank-1 SVD layer for the classical setting of fixed dimensionality. Chen and Huang [22] and Bunea et al. [16] explored a low-rank representation of C* in which the rows of C* are sparse; however, their approaches do not impose sparsity on the right singular vectors and, hence, are inapplicable to settings with high-dimensional responses where response selection is highly desirable.

Recently, there have been some new developments in sparse and low-rank regression problems. Ma and Sun [50] studied the properties of row-sparse reduced-rank regression model with nonconvex sparsity-inducing penalties, and later Ma et al. [49] extended their work to two-way sparse reduced-rank regression. Chen and Huang [23] extended the row-sparse reduced-rank regression by incorporating covariance matrix estimation, and the authors mainly focused on computational issues. Lian et al. [46] proposed a semiparametric reduced-rank regression with a sparsity penalty on the coefficient matrix itself. Goh et al. [33] studied the Bayesian counterpart of the row/column-sparse reduced-rank regression and established its posterior consistency. However, none of these works considered the possible entrywise sparsity in the SVD of the coefficient matrix. The sparse and low-rank regression models have also been applied in various fields to solve important scientific problems. To name a few, Chen et al. [20] applied a sparse and low-rank bi-linear model for the task of source-sink reconstruction in marine ecology, Zhu et al. [70] used a Bayesian low-rank model for associating neuroimaging phenotypes and genetic markers, and Ma et al. [48] used a threshold SVD regression model for learning regulatory relationships in genomics.

In view of the key role that the sparse SVD plays for simultaneous dimension reduction and variable selection in model (1), in this paper we suggest a unified regularization approach to estimating such a sparse SVD structure. Our proposal successfully meets three key methodological challenges that are posed by the complex structural constraints on the SVD. First, sparsity and orthogonality are two largely incompatible goals and would seem difficult to be accommodated within a single framework. For instance, a standard orthogonalization process such as QR factorization will generally destroy the sparsity pattern of a matrix. Previous methods either relaxed the orthogonality constraint to allow efficient search for sparsity patterns [19], or avoided imposing both sparsity and orthogonality requirements on the same factor matrix [22, 16]. To resolve this issue, we formulate our approach as an orthogonality constrained regularization problem, which yields simultaneously sparse and orthogonal factor matrices in the SVD. Second, we employ the nuclear norm penalty to encourage sparsity among the singular values and achieve rank reduction. As a result, our method produces a continuous solution path, which facilitates rank parameter tuning and distinguishes it from the L₀ regularization method adopted by Bunea et al. [16]. Third, unlike rank-constrained estimation, the nuclear norm penalization approach makes the estimation of singular vectors more intricate, since one does not know a priori which singular values will vanish and, hence, which pairs of left and right singular vectors are unidentifiable. Noting that the degree of identifiability of the singular vectors increases with the singular value, we propose to penalize the singular vectors weighted by singular values, which proves to be meaningful and effective. Combining these aspects, we introduce sparse orthogonal factor regression (SOFAR), a novel regularization framework for high-dimensional multivariate regression. While respecting the orthogonality constraint, we allow the sparsity-inducing penalties to take a general, flexible form, which includes special cases that adapt to the entrywise and rowwise sparsity of the singular vector matrices, resulting in a nonconvex objective function for the SOFAR method.

In addition to the aforementioned three methodological challenges, the nonconvexity of the SOFAR objective function also poses important algorithmic and theoretical challenges in obtaining and characterizing the SOFAR estimator. To address these challenges, we suggest a two-step approach exploiting the framework of convexity-assisted nonconvex optimization (CANO) to obtain the SOFAR estimator. More specifically, in the first step we minimize the L₁-penalized squared loss for the multivariate regression (1) to obtain an initial estimator. Then in the second step, we minimize the SOFAR objective function in an asymptotically shrinking neighborhood of the initial estimator. Thanks to the convexity of its objective function, the initial estimator can be obtained effectively and efficiently. Yet since the finer sparsity structure imposed through the sparse SVD (2) is completely ignored in the first step, the initial estimator meets none of the aforementioned three methodological challenges. Nevertheless, since it is theoretically guaranteed that the initial estimator is not far away from the true coefficient matrix C* with asymptotic probability one, searching in an asymptotically shrinking neighborhood of the initial estimator significantly alleviates the nonconvexity issue of the SOFAR objective function. In fact, under the framework of CANO we derive nonasymptotic bounds for the prediction, estimation, and variable selection errors of the SOFAR estimator characterizing the theoretical advantages. In implementation, to disentangle the sparsity and orthogonality constraints we develop an efficient SOFAR algorithm and establish its convergence properties.

Our suggested SOFAR method for large-scale association network learning is in fact connected to a variety of statistical methods in both unsupervised and supervised multivariate analysis. For example, the sparse SVD and sparse principal component analysis (PCA) for a high-dimensional data matrix can be viewed as unsupervised versions of our general method. Other prominent examples include sparse factor models, sparse canonical correlation analysis [63], and sparse vector autoregressive (VAR) models for high-dimensional time series. See Section II-B for more details on these applications and connections.

The rest of the paper is organized as follows. Section II introduces the SOFAR method and discusses its applications to several unsupervised and supervised learning tasks. We present the nonasymptotic properties of the method in Section III. Section IV develops an efficient optimization algorithm and discusses its convergence and tuning parameter selection. We provide several simulation and real data examples in Section V. All the proofs of main results and technical details are detailed in the Supplementary Material. An associated R package implementing the suggested method is available at https://cran.r-project.org/package=rrpack.

II. Large-scale association network learning via SOFAR

A. Sparse orthogonal factor regression

To estimate the sparse SVD of the true regression coefficient matrix C* in model (1), we start by considering an estimator of the form UDV^T, where $D = diag (d_{1}, \dots, d_{m}) \in ℝ^{m \times m}$ with d₁ ≥ · · · ≥ d_m ≥ 0 and 1 ≤ m ≤ min{p, q} is a diagonal matrix of singular values, and $U = (u_{1}, \dots, u_{m}) \in ℝ^{p \times m}$ and $V = (v_{1}, \dots, v_{m}) \in ℝ^{q \times m}$ are orthonormal matrices of left and right singular vectors, respectively. Although it is always possible to take m = min(p, q) without prior knowledge of the rank r, it is often sufficient in practice to take a small m that is slightly larger than the expected rank (estimated by some procedure such as in Bunea et al. [15]), which can dramatically reduce computation time and space. Throughout the paper, for any matrix M = (m_ij) we denote by ‖M‖_F, ‖M‖₁, ‖M‖_∞, and ‖M‖_2,1 the Frobenius norm, entrywise L₁-norm, entrywise L_∞-norm, and rowwise (2, 1)-norm defined, respectively, as $‖ M ‖_{F} = {(\sum_{i, j} m_{i j}^{2})}^{1 / 2}$ , ‖M‖₁ = Σ_i,j |m_ij|, ‖M‖_∞ = max_i,j |m_ij|, and $‖ M ‖_{2, 1} = \sum_{i} {(\sum_{j} m_{i j}^{2})}^{1 / 2}$ . We also denote by ‖ · ‖₂ the induced matrix norm (operator norm).

As mentioned in the Introduction, we employ the nuclear norm penalty to encourage sparsity among the singular values, which is exactly the entrywise L₁ penalty on D. Penalization directly on U and V, however, is inappropriate since the singular vectors are not equally identifiable and should not be subject to the same amount of regularization. Singular vectors corresponding to larger singular values can be estimated more accurately and should contribute more to the regularization, whereas those corresponding to vanishing singular values are unidentifiable and should play no role in the regularization. Therefore, we propose an importance weighting by the singular values and place sparsity-inducing penalties on the weighted versions of singular vector matrices, UD and VD. Note also that our goal is to estimate not only the low-rank matrix C* but also the factor matrices D*, U*, and V*. Taking into account these points, we consider the orthogonality constrained optimization problem

(\hat{D}, \hat{U}, \hat{V}) = \underset{D, U, V}{arg min} {\frac{1}{2} ‖ Y - X U D V^{T} ‖_{F}^{2} + λ_{d} ‖ D ‖_{1} + λ_{a} ρ_{a} (U D) + λ_{b} ρ_{b} (V D)} subject to U^{T} U = I_{m}, V^{T} V = I_{m},

(3)

where ρ_a(·) and ρ_b(·) are penalty functions to be clarified later, and λ_d, λ_a, λ_b ≥ 0 are tuning parameters that control the strengths of regularization. We call this regularization method sparse orthogonal factor regression (SOFAR) and the regularized estimator ( $\hat{D}$ , $\hat{U}$ , $\hat{V}$ ) the SOFAR estimator. Note that ρ_a(·) and ρ_b(·) can be equal or distinct, depending on the scientific question and the goals of variable selection. Letting λ_d = λ_b = 0 while setting ρ_a(·) = ‖·‖_2,1 reduces SOFAR estimator to the sparse reduced-rank·estimator of Chen and Huang [22]. In view of our choices of ρ_a(·) and ρ_b(·), although D appears in all three penalty terms, rank reduction·is achieved mainly through the first term, while variable selection is achieved through the last two terms under necessary scalings by D.

We note that one major advantage of SOFAR is that the final estimates satisfy the orthogonality constraints in (3). This is also the major distinction of our method from many existing ones. The orthogonality constraints are motivated from a combination of practical, methodological, and theoretical considerations. On the practical side, the orthogonality constraints maximize the separation of different latent layers, ensure that the importance of these layers can be measured by the magnitudes of diagonals in D*, and thus enhance the interpretation. On the methodological side, they are a natural, convenient way to ensure the identifiability of the factor matrices D*, U*, and V* [7]. On the theoretical side, they allow us to establish rigorous error bound inequalities for the estimates $\hat{D}$ , $\hat{A}$ , and $\hat{B}$ . Nevertheless the orthogonality condition among the sparse latent factors may not hold exactly in certain real applications. Thus in Section V, we will investigate the robustness of our method through simulation studies where the orthogonality condition in the model is violated. It would be interesting to formally study the scenario when the orthogonality condition may hold approximately, which is beyond the scope of the current paper and we leave it for future research.

Note that for simplicity we do not explicitly state the ordering constraint d₁ ≥ ··· ≥ d_m ≥ 0 in optimization problem (3). In fact, when ρ_a(·) and ρ_b(·) are matrix norms that satisfy certain invariance properties, such as the entrywise L₁-norm and rowwise (2, 1)-norm, this constraint can be easily enforced by simultaneously permuting and/or changing the signs of the singular values and the corresponding singular vectors. The orthogonality constraints are, however, essential to the optimization problem in that a solution cannot be simply obtained through solving the unconstrained regularization problem followed by an orthogonalization process. The interplay between sparse regularization and orthogonality constraints is crucial for achieving important theoretical and practical advantages, which distinguishes our SOFAR method from most previous procedures.

B. Applications of SOFAR

The SOFAR method provides a unified framework for a variety of statistical problems in multivariate analysis. We give four such examples, and in each example, briefly review existing techniques and suggest new methods.

1). Biclustering with sparse SVD:

The biclustering problem of a data matrix, which can be traced back to Hartigan [37], aims to simultaneously cluster the rows (samples) and columns (features) of a data matrix into statistically related subgroups. A variety of biclustering techniques, which differ in the criteria used to relate clusters of samples and clusters of features and in whether overlapping of clusters is allowed, have been suggested as useful tools in the exploratory analysis of high-dimensional genomic and text data. See, for example, Busygin et al. [17] for a survey. One way of formulating the biclustering problem is through the mean model

X = C^{*} + E,

(4)

where the mean matrix C* admits a sparse SVD (2) and the sparsity patterns in the left (or right) singular vectors serve as indicators for the samples (or features) to be clustered. Lee et al. [44] proposed to estimate the first sparse SVD layer by solving the optimization problem

(\hat{d}, \hat{u}, \hat{v}) = \underset{d, u, v}{arg min} {\frac{1}{2} {‖ X - d u v^{T} ‖}_{F}^{2} + λ_{a} ρ_{a} (d u) + λ_{b} ρ_{b} (d v)} subject to ‖ u ‖_{2} = 1, ‖ v ‖_{2} = 1,

(5)

and obtain the next sparse SVD layer by applying the same procedure to the residual matrix $X - \hat{d} \hat{u} {\hat{v}}^{T}$ . Clearly, problem (5) is a specific example of the SOFAR problem (3) with m = 1 and λ_d = 0; however, the orthogonality constraints are not maintained during the layer-by-layer extraction process. The orthogonality issue also exists in most previous proposals, for example, Zhang et al. [68].

The multivariate linear model (1) with a sparse SVD (2) can be viewed as a supervised version of the above biclustering problem, which extends the mean model (4) to a general design matrix and can be used to identify interpretable clusters of predictors and clusters of responses that are significantly associated. Applying the SOFAR method to model (4) yields the new estimator

(\hat{D}, \hat{U}, \hat{V}) = \underset{D, U, V}{arg min} {\frac{1}{2} {‖ X - U D V^{T} ‖}_{F}^{2} + λ_{d} ‖ D ‖_{1} + λ_{a} ρ_{a} (U D) + λ_{b} ρ_{b} (V D)} subject to U^{T} U = I_{m}, V^{T} V = I_{m},

(6)

which estimates all sparse SVD layers simultaneously while determining the rank by nuclear norm penalization and preserving the orthogonality constraints.

2). Sparse PCA:

A useful technique closely related to sparse SVD is sparse principal component analysis (PCA), which enhances the convergence and improves the interpretability of PCA by introducing sparsity in the loadings of principal components. There has been a fast growing literature on sparse PCA due to its importance in dimension reduction for high-dimensional data. Various formulations coupled with efficient algorithms, notably through L₀ regularization and its L₁ and semidefinite relaxations, have been proposed by Zou et al. [72], d’Aspremont et al. [25], Shen and Huang [57], Johnstone and Lu [40], and Guo et al. [35], among others. Recently, Benidis et al. [9] developed a new method to estimate sparse eigenvectors without trading off their orthogonality based on the eigenvalue decomposition rather than the SVD using the Procrustes reformulation.

We are interested in two different ways of casting sparse PCA in our sparse SVD framework. The first approach bears a resemblance to the proposal of Zou et al. [72], which formulates sparse PCA as a regularized multivariate regression problem with the data matrix X treated as both the responses and the predictors. Specifically, they proposed to solve the optimization problem

(\hat{A}, \hat{V}) = \underset{A, V}{arg min} {\frac{1}{2} {‖ X - X A V^{T} ‖}_{F}^{2} + λ_{a} ρ_{a} (A)} subject to V^{T} V = I_{m},

(7)

and the loading vectors are given by the normalized columns of $\hat{A}, {\hat{a}}_{j} / {‖ {\hat{a}}_{j} ‖}_{2}, j = 1, \dots, m$ . However, the orthogonality of the loading vectors, a desirable property enjoyed by the standard PCA, is not enforced by problem (7). Similarly applying the SOFAR method leads to the estimator

(\hat{D}, \hat{U}, \hat{V}) = \underset{D, U, V}{arg min} {\frac{1}{2} {‖ X - X U D V^{T} ‖}_{F}^{2} + λ_{d} ‖ D ‖_{1} + λ_{a} ρ_{a} (U D)} subject to U^{T} U = I_{m}, V^{T} V = I_{m},

which explicitly imposes orthogonality among the loading vectors (the columns of $\hat{U}$ ). One can optionally ignore the nuclear norm penalty and determine the number of principal components by some well-established criterion.

The second approach exploits the connection of sparse PCA with regularized SVD suggested by Shen and Huang [57]. They proposed to solve the rank-1 matrix approximation problem

(\hat{u}, \hat{b}) = \underset{u, b}{arg min} {\frac{1}{2} {‖ X - u b^{T} ‖}_{F}^{2} + λ_{b} ρ_{b} (b)} subject to ‖ u ‖_{2} = 1,

(8)

and obtain the first loading vector $\hat{b} / ‖ \hat{b} ‖_{2}$ . Applying the SOFAR method similarly to the rank-m matrix approximation problem yields the estimator

(\hat{D}, \hat{U}, \hat{V}) = \underset{D, U, V}{arg min} {\frac{1}{2} {‖ X - U D V^{T} ‖}_{F}^{2} + λ_{d} ‖ D ‖_{1} + λ_{b} ρ_{b} (V D)} subject to U^{T} U = I_{m}, V^{T} V = I_{m},

which constitutes a multivariate generalization of problem (8), with the desirable orthogonality constraint imposed on the loading vectors (the columns of $\hat{V}$ ) and the optional nuclear norm penalty useful for determining the number of principal components.

3). Sparse factor analysis:

Factor analysis plays an important role in dimension reduction and feature extraction for high-dimensional time series. A low-dimensional factor structure is appealing from both theoretical and practical angles, and can be conveniently incorporated into many other statistical tasks, such as forecasting with factor-augmented regression [59] and covariance matrix estimation [28]. See, for example, Bai and Ng [6] for an overview.

Let $x_{t} \in ℝ^{p}$ be a vector of observed time series. Consider the factor model

x_{t} = Λ f_{t} + e_{t}, t = 1, \dots, T,

(9)

where $f_{t} \in ℝ^{m}$ is a vector of latent factors, $Λ \in ℝ^{p \times m}$ is the factor loading matrix, and e_t is the idiosyncratic error. Most existing methods for high-dimensional factor models rely on classical PCA [5, 2] or maximum likelihood to estimate the factors and factor loadings [4, 3]; as a result, the estimated factors and loadings are generally nonzero. However, in order to assign economic meanings to the factors and loadings and to further mitigate the curse of dimensionality, it would be desirable to introduce sparsity in the factors and loadings. Writing model (9) in the matrix form

X = F Λ^{T} + E

with X = (x₁, … , x_T)^T, F = (f₁, … , f_T)^T, and E = (e₁, … , e_T)^T reveals its equivalence to model (4). Therefore, under the usual normalization restrictions that F^T F/T = I_m and Λ^TΛ is diagonal, we can solve for ( $\hat{D}$ , $\hat{U}$ , $\hat{V}$ ) in problem (6) and estimate the sparse factors and loadings by $\hat{F} = \sqrt{T} \hat{U}$ and $\hat{Λ} = \hat{V} \hat{D} / \sqrt{T}$ .

4). Sparse VAR analysis:

Vector autoregressive (VAR) models have been widely used to analyze the joint dynamics of multivariate time series; see, for example, Stock and Watson [58]. Classical VAR analysis suffers greatly from the large number of free parameters in a VAR model, which grows quadratically with the dimensionality. Early attempts in reducing the impact of dimensionality have explored reduced rank methods such as canonical analysis and reduced rank regression [12, 62]. Regularization methods such as the Lasso have recently been adapted to VAR analysis for variable selection [38, 52, 42, 8].

We present an example in which our parsimonious model setup is most appropriate. Suppose we observe the data (y_t, x_t), where $y_{t} \in ℝ^{q}$ is a low-dimensional vector of time series whose dynamics are of primary interest, and $x_{t} \in ℝ^{p}$ is a high-dimensional vector of informational time series. We assume that x_t are generated by the VAR equation

x_{t} = C^{* T} x_{t - 1} + e_{t},

where C has a sparse SVD (2). This implies a low-dimensional latent model of the form

g_{t} = D^{*} f_{t - 1} + {\tilde{e}}_{t},

where f_t = U*^T x_t, g_t =V*^T x_t, and ẽ_t = V*^T e_t. Following the factor-augmented VAR (FAVAR) approach of Bernanke et al. [10], we augment the latent factors f_t and g_t to the dynamic equation of y_t and consider the joint model

(\begin{array}{l} y_{t} \\ g_{t} \end{array}) = (\begin{matrix} A^{T} & B^{T} \\ 0 & D^{*} \end{matrix}) (\begin{array}{l} y_{t - 1} \\ f_{t - 1} \end{array}) + (\begin{array}{l} ε_{t} \\ {\tilde{e}}_{t} \end{array}) .

We can estimate the parameters A, B, and D* by a two-step method: first apply the SOFAR method to obtain estimates of D* and f_t, and then estimate A and B by a usual VAR since both y_t and f_t are of low dimensionality. Our approach differs from previous methods in that we enforce sparse factor loadings; hence, it would allow the factors to be given economic interpretations and would be useful for uncovering the structural relationships underlying the joint dynamics of (y_t, x_t).

III. Theoretical properties

We now investigate the theoretical properties of the SOFAR estimator (3) for model (1) under the sparse SVD structure (2). Our results concern nonasymptotic error bounds, where both response dimensionality q and predictor dimensionality p can diverge simultaneously with sample size n. The major theoretical challenges stem from the nonconvexity issues of our optimization problem which are prevalent in nonconvex statistical learning.

A. Technical conditions

We begin with specifying a few assumptions that facilitate our technical analysis. To simplify the technical presentation, we focus on the scenario of p ≥ q and our proofs can be adapted easily to the case of p ≥ q with the only difference that the rates of convergence in Theorems 1 and 2 will be modified correspondingly. Assume that each column of X, ${\tilde{x}}_{j}$ with j = 1, … , p, has been rescaled such that ${‖ {\tilde{x}}_{j} ‖}_{2}^{2} = n$ . The SOFAR method minimizes the objective function in (3). Since the true rank r is unknown and we cannot expect that one can choose m to perfectly match r, the SOFAR estimates $\hat{U}$ , $\hat{V}$ , and $\hat{D}$ are generally of different sizes than U*, V*, and D*, respectively. To ease the presentation, we expand the dimensions of matrices U*, V*, and D* by simply adding columns and rows of zeros to the right and to the bottom of each of the matrices to make them of sizes p × q, q × q, and q × q, respectively. We also expand the matrices $\hat{D}$ , $\hat{U}$ , and $\hat{V}$ similarly to match the sizes of D*, U*, and V*, respectively. Define A* = U*D* and B* = V*D*, and correspondingly $\hat{A} = \hat{U} \hat{D}$ and $\hat{B} = \hat{V} \hat{D}$ using the SOFAR estimates ( $\hat{U}$ , $\hat{V}$ , $\hat{D}$ ).

Definition 1 (Robust spark). The robust κ_c of the n × p design matrix X is defined as the smallest possible positive integer such that there exists an n × κ_c submatrix of n^−1/2X having a singular value less than a given positive constant c.

Condition 1. (Parameter space) The true parameters (C*, D*, A*, B*) lie in $C \times D \times A \times B$ , where $C = {C \in ℝ^{p \times q} : ‖ C ‖_{0} < κ_{c_{2}} / 2}$ , $D = {D = diag {d_{j}} \in ℝ^{q \times q} : d_{j} = 0 or | d_{j} | \geq τ}$ , $A = {A = (a_{i j}) \in ℝ^{p \times q} : a_{i j} = 0 or | a_{i j} | \geq τ}$ , and $B = {B = (b_{i j}) \in ℝ^{q \times q} : b_{i j} = 0 or | b_{i j} | \geq τ}$ with $κ_{c_{2}}$ the robust spark of X, c₂ > 0 some constant,| and τ > 0 asymptotically vanishing.

Condition 2. (Constrained eigenvalue) It holds that ${max}_{‖ u ‖_{0} < κ_{c_{2}} / 2, ‖ u ‖_{2} = 1} ‖ X u ‖_{2}^{2} \leq c_{3} n$ and ${max}_{1 \leq j \leq r} {‖ X u_{j}^{*} ‖}_{2}^{2} \leq c_{3} n$ for some constant c₃ > 0, where $u_{j}^{*}$ is the left singular vector of C* corresponding to singular value $d_{j}^{*}$ .

Condition 3. (Error term) The error term $E \in ℝ^{n \times q} ~ N (0, I_{n} \otimes Σ)$ with the maximum eigenvalue α_max of Σ bounded from above and diagonal entries of Σ being $σ_{j}^{2}$ ’s.

Condition 4. (Penalty functions) For matrices M and M* of the same size, the penalty functions ρ_h with h ∈ {a, b} satisfy |ρ_h(M) − ρ_h(M*)| ≤ ‖M – M*‖₁.

Condition 5. (Relative spectral gap) The nonzero singular values of C* satisfy that $d_{j - 1}^{* 2} - d_{j}^{* 2} \geq δ^{1 / 2} d_{j - 1}^{* 2}$ for 2 ≤ j ≤ r with δ > some constant, and r and $\sum_{j = 1}^{r} {(d_{1}^{*} / d_{j}^{*})}^{2}$ can diverge as n → ∞.

The concept of robust spark in Definition 1 was introduced initially in [69] and [30], where the thresholded parameter space was exploited to characterize the global optimum for regularization methods with general penalties. Similarly, the thresholded parameter space and the constrained eigenvalue condition which builds on the robust spark condition of the design matrix in Conditions 1 and 2 are essential for investigating the computable solution to the nonconvex SOFAR optimization problem in (3). By Proposition 1 of [30], the robust spark $κ_{c_{2}}$ can be at least of order O{n/(log p)} with asymptotic probability one when the rows of X are independently sampled from multivariate Gaussian distributions with dependency. Although Condition 3 assumes Gaussianity, our theory can in principle carry over to the case of sub-Gaussian errors, provided that the concentration inequalities for Gaussian random variables used in our proofs are replaced by those for sub-Gaussian random variables.

Condition 4 includes many kinds of penalty functions that bring about sparse estimates. Important examples include the entrywise L₁-norm and rowwise (2, 1)-norm, where the former encourages sparsity among the predictor/response effects specific to each rank-1 SVD layer, while the latter promotes predictor/response-wise sparsity regardless of the specific layer. To see why the rowwise (2, 1)-norm satisfies Condition 4, observe that

‖ M ‖_{1} \equiv \sum_{i} \sum_{j} | m_{i j} | = \sum_{i} {(\sum_{j, k} | m_{i j} ‖ m_{i k} |)}^{1 / 2} \geq \sum_{i} {(\sum_{j} m_{i j}^{2})}^{1 / 2} \equiv ‖ M ‖_{2, 1},

which along with the triangle inequality entails that Condition 4 is indeed satisfied. Moreover, Condition 4 allows us to use concave penalties such as SCAD [29] and MCP [67]; see, for instance, the proof of Lemma 1 in [30].

Intuitively, Condition 5 rules out the nonidentifiable case where some nonzero singular values are tied with each other and the associated singular vectors in matrices U* and V* are identifiable only up to some orthogonal transformation. In particular, Condition 5 enables us to establish the key Lemma 3 in Section F of Supplementary Material, where the matrix perturbation theory can be invoked.

B. Main results

Since the objective function of the SOFAR method (3) is nonconvex, solving this optimization problem is highly challenging. To overcome the difficulties, as mentioned in the Introduction we exploit the framework of CANO and suggest a two-step approach, where in the first step we solve the following L₁-penalized squared loss minimization problem

\tilde{C} = \underset{C \in ℝ^{p \times q}}{arg min} {{(2 n)}^{- 1} ‖ Y - X C ‖_{F}^{2} + λ_{0} ‖ C ‖_{1}}

(10)

to construct an initial estimator $\tilde{C}$ with λ₀ ≥ 0 some regularization parameter. If $\tilde{C} = 0$ , then we set the final SOFAR estimator as $\hat{C} = 0$ ; otherwise, in the second step we do a refined search and minimize the SOFAR objective function (3) in an asymptotically shrinking neighborhood of $\tilde{C}$ to obtain the final SOFAR estimator $\hat{C}$ . In the case of $\tilde{C} = 0$ , our two-step procedure reduces to a one-step procedure. Since Theorem 1 below establishes that $\tilde{C}$ can be close to C* with asymptotic probability one, having $\tilde{C} = 0$ is a good indicator that the true C* = 0.

Thanks to its convexity, the objective function in (10) in the first step can be solved easily and efficiently. In fact, since the objective function in (10) is separable it follows that the jth column of $\tilde{C}$ can be obtained by solving the univariate response Lasso regression

min_{β \in ℝ^{p}} {{(2 n)}^{- 1} {‖ Y e_{j} - X β ‖}_{2}^{2} + λ_{0} ‖ β ‖_{1}},

where e_j is a q-dimensional vector with jth component 1 and all other components 0. The above univariate response Lasso regression has been studied extensively and well understood, and many efficient algorithms have been proposed for solving it. Denote by ( $\tilde{D}$ , $\tilde{U}$ , $\tilde{V}$ ) the initial estimator of (D*, U*, V*) obtained from the SVD of $\tilde{C}$ , and let $\tilde{A} = \tilde{U} \tilde{D}$ and $\tilde{B} = \tilde{V} \tilde{D}$ . Since the bounds for the SVD are key to the analysis of SOFAR estimator in the second step, for completeness we present the nonasymptotic bounds on estimation errors of the initial estimator in the following theorem.

Theorem 1 (Error bounds for initial estimator). Assume that Conditions 1–3 hold and let λ₀ = c₀σ_max (n⁻¹ log(pq))^1/2 with σ_max = max_1≤j≤q σ_j and $c_{0} > \sqrt{2}$ some constant. Then with probability at least $1 - 2 {(p q)}^{1 - c_{0}^{2} / 2}$ , the estimation error is bounded as

{‖ \tilde{C} - C^{*} ‖}_{F} \leq R_{n} \equiv c {(n^{- 1} s log (p q))}^{1 / 2}

(11)

with s = ‖C*‖₀ and c > 0 some constant. Under additional Condition 5, with the same probability bound the following estimation error bounds hold simultaneously

{‖ \tilde{D} - D^{*} ‖}_{F} \leq c {(n^{- 1} s log (p q))}^{1 / 2},

(12)

{‖ \tilde{A} - A^{*} ‖}_{F} + {‖ \tilde{B} - B^{*} ‖}_{F} \leq c η_{n} {(n^{- 1} s log (p q))}^{1 / 2},

(13)

where $η_{n} = 1 + δ^{- 1 / 2} {(\sum_{j = 1}^{r} {(d_{1}^{*} / d_{j}^{*})}^{2})}^{1 / 2}$ .

For the case of q = 1, the estimation error bound (11) is consistent with the well-known oracle inequality for Lasso [11]. The additional estimation error bounds (12) and (13) for the SVD in Theorem 1 are, however, new to the literature. It is worth mentioning that Condition 5 and the latest results in [65] play a crucial role in establishing these additional error bounds.

After obtaining the initial estimator $\tilde{C}$ from the first step, we can solve the SOFAR optimization problem in an asymptotically shrinking neighborhood of $\tilde{C}$ . More specifically, we define ${\tilde{P}}_{n} = {C : ‖ C - \tilde{C} ‖_{F} \leq 2 R_{n}}$ with R_n the upper bound in (11). Then it is seen from Theorem 1 that the true coefficient matrix C* is contained in ${\tilde{P}}_{n}$ with probability at least $1 - 2 {(p q)}^{1 - c_{0}^{2} / 2}$ . Further define

P_{n} = {\tilde{P}}_{n} \cap (C \times D \times A \times B),

(14)

where sets $C$ , $D$ , $A$ , and $B$ are defined in Condition 1. Then probability at least $1 - 2 {(p q)}^{1 - c_{0}^{2} / 2}$ , the set $P_{n}$ defined in (14) is nonempty with at least one element C* by Condition 1. We minimize the SOFAR objective function (3) by searching in the shrinking neighborhood $P_{n}$ and denote by $\hat{C}$ the resulting SOFAR estimator. Then it follows that with probability at least $1 - 2 {(p q)}^{1 - c_{0}^{2} / 2}$ ,

{‖ \hat{C} - C^{*} ‖}_{F} \leq ‖ \hat{C} - \tilde{C} ‖_{F} + {‖ \tilde{C} - C^{*} ‖}_{F} \leq 3 R_{n},

where the first inequality is by the triangle and the second one is by the construction of set $P_{n}$ and Theorem 1. Therefore, we see that the SOFAR estimator given by our two-step procedure is guaranteed to have convergence rate at least O(R_n).

Since the initial estimator investigated in Theorem 1 completely ignores the finer sparse SVD structure of the coefficient matrix C*, intuitively the second step of SOFAR estimation can lead to improved error bounds. Indeed we show in Theorem 2 below that with the second step of refinement, up to some columnwise sign changes the SOFAR estimator can admit estimator error bounds in terms of parameters r, s_a, and s_b with r = ‖D*‖₀, s_a = ‖A*‖₀, and s_b = ‖B*‖₀. When r, s_a, and s_b are drastically smaller than s, these new upper bounds can have better rates of convergence.

Theorem 2 (Error bounds for SOFAR estimator). Assume that Conditions 1–5 hold, λ_max ≡ max(λ_d, λ_a, λ_b) = c₁ (n⁻¹ log(pr))^1/2 with c₁ > 0 some large constant, log p = O(n^α), q = O(n^β/2), s = O(n^γ), and $η_{n}^{2} = o (min {λ_{max}^{- 1} τ, n^{1 - α - β - γ} τ^{2}})$ with α, β, γ ≥ 0, α+β+γ < 1, and η_n as given in Theorem 1. Then with probability at least

1 - {2 {(p q)}^{1 - c_{0}^{2} / 2} + 2 {(p r)}^{- {\tilde{c}}_{2}} + 2 p r exp (- {\tilde{c}}_{3} n^{1 - β - γ} τ^{2} η_{n}^{- 2})},

(15)

the SOFAR estimator satisfies the following error bounds simultaneously:

(a) {‖ \hat{C} - C^{*} ‖}_{F} \leq c min {s, (r + s_{a} + s_{b}) η_{n}^{2}}^{1 / 2} {n^{- 1} log (p q)}^{1 / 2},

(16)

(b) {‖ \hat{D} - D^{*} ‖}_{F} + {‖ \hat{A} - A^{*} ‖}_{F} + {‖ \hat{B} - B^{*} ‖}_{F} \leq c min {s, (r + s_{a} + s_{b}) η_{n}^{2}}^{1 / 2} η_{n} {n^{- 1} log (p q)}^{1 / 2},

(17)

(c) {‖ \hat{D} - D^{*} ‖}_{0} + {‖ \hat{A} - A^{*} ‖}_{0} + {‖ \hat{B} - B^{*} ‖}_{0} \leq (r + s_{a} + s_{b}) [1 + o (1)],

(18)

(d) {‖ \hat{D} - D^{*} ‖}_{1} + {‖ \hat{A} - A^{*} ‖}_{1} + {‖ \hat{B} - B^{*} ‖}_{1} \leq c (r + s_{a} + s_{b}) η_{n}^{2} λ_{max},

(19)

(e) n^{- 1} {‖ X (\hat{C} - C^{*}) ‖}_{F}^{2} \leq c (r + s_{a} + s_{b}) η_{n}^{2} λ_{max}^{2},

(20)

where $c_{0} > \sqrt{2}$ and c, ${\tilde{c}}_{2}$ , ${\tilde{c}}_{3}$ are some positive constants.

We see from Theorem 2 that the upper bounds in (16) and (17) are the minimum of two rates, one involving r + s_a + s_b (the total sparsity of D*, A*, and B*) and the other one involving s (the sparsity of matrix C*). The rate involving s is from the first step of Lasso estimation, while the rate involving r+s_a+s_b is from the second step of SOFAR refinement. For the case of $s > (r + s_{a} + s_{b}) η_{n}^{2}$ , our two-step procedure leads to enhanced error rates under the Frobenius norm. Moreover, the error rates in (18)–(20) are new to the literature and not shared by the initial Lasso estimator, showing again the advantages of having the second step of refinement. It is seen that our two-step SOFAR estimator is capable of recovering the sparsity structure of D*, A*, and B* very well.

Let us gain more insights into these new error bounds. In the case of univariate response with q = 1, we have η_n = 1 + δ, r = 1, s_a = s, and s_b = 1. Then the upper bounds in (16)–(20) reduce to c{sn⁻¹ log p}^1/2, c{sn⁻¹ log p}^1/2, cs, cs{n⁻¹ log p}^1/2, and cn⁻¹s log p, respectively, which are indeed within a logarithmic factor of the oracle rates for the case of high-dimensional univariate response regression. Furthermore, in the rank-one case of r = 1 we have η_n = 1 + δ^−1/2 and s = s_as_b. Correspondingly, the upper bounds in (11)–(13) for the initial Lasso estimator all become c{n⁻¹s_as_b log(pq)^1/2, while the upper bounds in (16)–(20) for the SOFAR estimator become c{(s_a+s_b)n⁻¹ log(pq)}^1/2, c{(s_a + s_b)n⁻¹ log(pq)}^1/2, c(s_a + s_b), c(s_a + s_b){n⁻¹ log(pq)}^1/2, and cn⁻¹(s_a + s_b) log(pq), respectively. In particular, we see that the SOFAR estimator can have much improved rates of convergence even in the setting of r = 1.

IV. Implementation of SOFAR

The interplay between sparse regularization and orthogonality constraints creates substantial algorithmic challenges for solving the SOFAR optimization problem (3), for which many existing algorithms can become either inefficient or inapplicable. For example, coordinate descent methods that are popular for solving large-scale sparse regularization problems [32] are not directly applicable because the penalty terms in problem (3) are not separable under the orthogonality constraints. Also, the general framework for algorithms involving orthogonality constraints [26] does not take sparsity into account and hence does not lead to efficient algorithms in our context. Recently, Benidis et al. [9] focused on the unsupervised learning setting and introduced a new algorithm for estimating sparse eigenvectors without trading off their orthogonality based on the eigenvalue decomposition rather than the SVD. To obtain sparse orthogonal eigenvectors, they applied the minorization-maximization framework on the sparse PCA problem, which results in solving a sequence of rectangular Procrustes problems. Inspired by a recently revived interest in the augmented Lagrangian method (ALM) and its variants for large-scale optimization in statistics and machine learning [13], in this section we develop an efficient algorithm for solving problem (3).

A. SOFAR algorithm with ALM-BCD

The architecture of the proposed SOFAR algorithm is based on the ALM coupled with block coordinate descent (BCD). The first construction step is to utilize variable splitting to separate the orthogonality constraints and sparsity-inducing penalties into different subproblems, which then enables efficient optimization in a block coordinate descent fashion. To this end, we introduce two new variables A and B, and express problem (3) in the equivalent form

(\hat{Θ}, \hat{Ω}) = \underset{Θ, Ω}{arg min} {\frac{1}{2} {‖ Y - X U D V^{T} ‖}_{F}^{2} + λ_{d} ‖ D ‖_{1} + λ_{a} ρ_{a} (A) + λ_{b} ρ_{b} (B)} subject to U^{T} U = I_{m}, V^{T} V = I_{m}, U D = A, V D = B,

(21)

where Θ = (D, U, V) and Ω = (A, B). We form the augmented Lagrangian for problem (21) as

L_{μ} (Θ, Ω, Γ) = \frac{1}{2} {‖ Y - X U D V^{T} ‖}_{F}^{2} + λ_{d} ‖ D ‖_{1} + λ_{a} ρ_{a} (A) + λ_{b} ρ_{b} (B) + 〈 Γ_{a}, U D - A 〉 + 〈 Γ_{b}, V D - B 〉 + \frac{μ}{2} ‖ U D - A ‖_{F}^{2} + \frac{μ}{2} ‖ V D - B ‖_{F}^{2},

where Γ = (Γ_a, Γ_b) is the set of Lagrangian multipliers and μ > 0 is a penalty parameter. Based on ALM, the proposed algorithm consists of the following iterations:

(Θ, Ω)-step: $(Θ^{k + 1}, Ω^{k + 1}) \leftarrow {argmin}_{Θ : U^{T} U = V^{T} V = I_{m}, Ω} L_{μ} (Θ, Ω, Γ^{k});$
Γ-step: $Γ_{a}^{k + 1} \leftarrow Γ_{a}^{k} + μ (U^{k + 1} D^{k + 1} - A^{k + 1})$ and $Γ_{b}^{k + 1} \leftarrow Γ_{b}^{k} + μ (V^{k + 1} D^{k + 1} - B^{k + 1})$ .

The (Θ, Ω)-step can be solved by a block coordinate descent method [61] cycling through the blocks U, V, D, A, and B. Note that the orthogonality constraints and the sparsity-inducing penalties are now separated into subproblems with respect to Θ and Ω, respectively. To achieve convergence of the SOFAR algorithm in practice, an inexact minimization with a few block coordinate descent iterations is often sufficient. Moreover, to enhance the convergence of the algorithm to a feasible solution we optionally increase the penalty parameter μ by a ratio γ > 1 at the end of each iteration. This leads to the SOFAR algorithm with ALM-BCD described in Table I.

TABLE I.

SOFAR algorithm with ALM-BCD

Parameters: λ_d, λ_a, λ_b, and γ > 1

Initialize U⁰, V⁰, D⁰, A⁰, B⁰,

Γ_{a}^{0}

Γ_{b}^{0}

and μ⁰

For k = 0, 1, … do

update U, V, D, A and B:

(a)

U^{k + 1} \leftarrow \underset{U^{T} U = I_{m}}{arg min} {\frac{1}{2} {‖ Y - X U D^{k} {(V^{k})}^{T} ‖}_{F}^{2} + \frac{μ^{k}}{2} {‖ U D^{k} - A^{k} + Γ_{a}^{k} / μ^{k} ‖}_{F}^{2}}

(b)

V^{k + 1} \leftarrow \underset{V^{T} V = I_{m}}{arg min} {\frac{1}{2} {‖ Y - X U^{k + 1} D^{k} V^{T} ‖}_{F}^{2} + \frac{μ^{k}}{2} {‖ V D^{k} - B^{k} + Γ_{b}^{k} / μ^{k} ‖}_{F}^{2}}

(c)

D^{k + 1} \leftarrow \underset{D \geq 0}{arg min} {\frac{1}{2} {‖ Y - X U^{k + 1} D {(V^{k + 1})}^{T} ‖}_{F}^{2} + \frac{μ^{k}}{2} {‖ U^{k + 1} D - A^{k} + Γ_{a}^{k} / μ^{k} ‖}_{F}^{2} + \frac{μ^{k}}{2} {‖ V^{k + 1} D - B^{k} + Γ_{b}^{k} / μ^{k} ‖}_{F}^{2} + λ_{d} ‖ D ‖_{1}}

(d)

A^{k + 1} \leftarrow \underset{A}{arg min} {\frac{μ^{κ}}{2} {‖ U^{k + 1} D^{k + 1} - A + Γ_{a}^{k} / μ^{k} ‖}_{F}^{2} + λ_{a} ρ_{a} (A)}

(e)

B^{k + 1} \leftarrow \underset{B}{arg min} big {\frac{μ^{k}}{2} {‖ V^{k + 1} D^{k + 1} - B + Γ_{b}^{k} / μ^{k} ‖}_{F}^{2} + λ_{b} ρ_{b} (B)}

(f) optionally, repeat (a)–(e) until convergence

update Γ_a and Γ_b:

(a)

Γ_{a}^{k + 1} \leftarrow Γ_{a}^{k} + μ^{k} (U^{k + 1} D^{k + 1} - A^{k + 1})

(b)

Γ_{b}^{k + 1} \leftarrow Γ_{b}^{k} + μ^{k} (V^{k + 1} D^{k + 1} - B^{k + 1})

update μ by μ^k+1 ← γμ^k

end

Open in a new tab

We still need to solve the subproblems in algorithm I. The U-update is similar to the weighted orthogonal Procrustes problem considered by Koschat and Swayne [43]. By expanding the squares and omitting terms not involving U, this subproblem is equivalent to minimizing

\frac{1}{2} {‖ X U D^{k} ‖}_{F}^{2} - tr (U^{T} X^{T} Y V^{k} D^{k}) - tr (U^{T} (μ^{k} A^{k} - Γ_{a}^{k}) D^{k})

subject to U^TU = I_m. Taking a matrix Z such that Z^TZ = ρ²I_p − X^TX, where ρ² is the largest eigenvalue of X^TX, we can follow the argument of Koschat and Swayne [43] to obtain the iterative algorithm: for j = 0, 1, … , form the p × m matrix $C_{1} = (X^{T} Y V^{k} + μ^{k} A^{k} - Γ_{a}^{k} + Z^{T} Z U^{j} D^{k}) D^{k}$ , compute the SVD $U_{1} Σ_{1} V_{1}^{T} = C_{1}$ , and update $U^{j + 1} = U_{1} V_{1}^{T}$ . Note that C₁ depends on Z^TZ only, and hence the explicit computation of Z is not needed. The V-update is similar to a standard orthogonal Procrustes problem and amounts to maximizing

tr (V^{T} Y^{T} X U^{k + 1} D^{k}) + tr (V^{T} (μ^{k} B^{k} - Γ_{b}^{k}) D^{k})

subject to V^TV = I_m. A direct method for this problem [34, pp. 327–328] gives the algorithm: form the q × m matrix $C_{2} = (Y^{T} X U^{k + 1} + μ^{k} B^{k} - Γ_{b}^{k}) D^{k}$ , compute the SVD $U_{2} Σ_{2} V_{2}^{T} = C_{2}$ , and set $V = U_{2} V_{2}^{T}$ . Since m is usually small, the SVD computations in the U- and V-updates are cheap. The Lasso problem in the D-update reduces to a standard quadratic program with the nonnegativity constraint, which can be readily solved by efficient algorithms; see, for example, Sha et al. [56]. Note that the D-update may set some singular values to exactly zero; hence, a greedy strategy can be taken to further bring down the computational complexity, by removing the zero singular values and reducing the sizes of the relevant matrices accordingly in subsequent computations. The updates of A and B are free of orthogonality constraints and therefore easy to solve. With the popular choices of ‖·‖₁ and ‖·‖_2,1 as the penalty functions, the updates can be performed by entrywise and rowwise soft-thresholding, respectively.

Following the theoretical analysis for the SOFAR method in Section III, we employ the SVD of the cross-validated L_1-penalized estimator $\tilde{C}$ in (10) to initialize U, V, D, A, and B; the Γ_a and Γ_b are initialized as zero matrices. In practice, for large-scale problems we can further scale up the SOFAR method by performing feature screening with the initial estimator $\tilde{C}$ , that is, the response variables corresponding to zero columns in $\tilde{C}$ and the predictors corresponding to zero rows in $\tilde{C}$ could be removed prior to the finer SOFAR analysis.

B. Convergence analysis and tuning parameter selection

For general nonconvex problems, an ALM algorithm needs not to converge, and even if it converges, it needs not to converge to an optimal solution. We have the following convergence results regarding the proposed SOFAR algorithm with ALM-BCD.

Theorem 3 (Convergence of SOFAR algorithm). Assume that $\sum_{k = 1}^{\infty} {{[Δ L_{μ} (U^{k})]}^{1 / 2} + {[Δ L_{μ} (V^{k})]}^{1 / 2} + {[Δ L_{μ} (D^{k})]}^{1 / 2}} < \infty$ and the penalty functions ρ_a(·) and ρ_b(·) are convex, where ΔL_μ(·) denotes the decrease in L_μ(·) by a block update. Then the sequence generated by the SOFAR algorithm converges to a local solution of the augmented Lagrangian for problem (21).

Note that without the above assumption on (U^k), (V^k), and (D^k), we can only show that the differences between two consecutive U-, V-, and D-updates converge to zero by the convergence of the sequence (L_μ(·)), but the sequences (U^k), (V^k), and (D^k) may not necessarily converge. Although Theorem 3 does not ensure the convergence of algorithm I to an optimal solution, numerical evidence suggests that the algorithm has strong convergence properties and the produced solutions perform well in numerical studies.

The above SOFAR algorithm is presented for a fixed triple of tuning parameters (λ_d, λ_a, λ_b). One may apply a fine grid search with K-fold cross-validation or an information criterion such as BIC and its high-dimensional extensions including GIC [31] to choose an optimal triple of tuning parameters and hence a best model. In either case, a full search over a three-dimensional grid would be prohibitively expensive, especially for large-scale problems. Theorem 2, however, suggests that the parameter tuning can be effectively reduced to one or two dimensions. Hence, we adopt a search strategy which is computationally affordable and still provides reasonable and robust performance. To this end, we first estimate an upper bound on each of the tuning parameters by considering the marginal null model, where two of the three tuning parameters are fixed at zero and the other is set to the minimum value leading to a null model. We denote the upper bounds thus obtained by ( $λ_{d}^{*}$ , $λ_{a}^{*}$ , $λ_{b}^{*}$ ), and conduct a search over a one-dimensional grid of values between ( $λ_{d}^{*}$ , $λ_{a}^{*}$ , $λ_{b}^{*}$ ) and ( $ε λ_{d}^{*}$ , $ε λ_{a}^{*}$ , $ε λ_{b}^{*}$ ), with ε > 0 sufficiently small (e.g., 10⁻³) to ensure the coverage of a full spectrum of reasonable solutions. Our numerical experience suggests that this simple search strategy works well in practice while reducing the computational cost dramatically. More flexibility can be gained by adjusting the ratios between λ_d, λ_a, and λ_b if additional information about the relative sparsity levels of D, A, and B is available.

V. Numerical studies

A. Simulation examples

Our Condition 4 in Section III-A accommodates a large group of penalty functions including concave ones such as SCAD and MCP. As demonstrated in [73] and [27], nonconvex regularization problems can be solved using the idea of local linear approximation, which essentially reduces the original problem to the weighted L₁-regularization with the weights chosen adaptively based on some initial solution. For this reason, in the simulation study we focus on the entrywise L₁-norm ‖ · ‖₁ and the rowwise (2, 1)-norm ‖ · ‖_2,1, as well as their adaptive extensions. The use of adaptively weighted penalties has also been explored in the contexts of reduced rank regression [21] and sparse PCA [45]. We next provide more details on the adaptive penalties used in our simulation study. To simplify the presentation, we use the entrywise L_1-norm as an example.

Incorporating adaptive weighting into the penalty terms in problem (21) leads to the adaptive SOFAR estimator

(\hat{Θ}, \hat{Ω}) = \underset{Θ, Ω}{arg min} {\frac{1}{2} {‖ Y - X U D V^{T} ‖}_{F}^{2} + λ_{d} {‖ W_{d} ○ D ‖}_{1} + λ_{a} {‖ W_{a} ○ A ‖}_{1} + λ_{b} {‖ W_{b} ○ B ‖}_{1}} subject to U^{T} U = I_{m}, V^{T} V = I_{m}, U D = A, V D = B,

where $W_{d} \in ℝ^{m \times m}$ , $W_{a} \in ℝ^{p \times m}$ , and $W_{b} \in ℝ^{q \times m}$ are weighting matrices that depend on the initial estimates $\tilde{D}$ , $\tilde{A}$ , and $\tilde{B}$ , respectively, and ○ is the Hadamard or entrywise product. The weighting matrices are chosen to reflect the intuition that singular values and singular vectors of larger magnitude should be less penalized in order to reduce bias and improve efficiency in estimation. As suggested in [73], if one is interested in using some nonconvex penalty functions ρ_a(·) and ρ_b(·) then the weight matrices can be constructed by using the first order derivatives of the penalty functions and the initial solution ( $\tilde{A}$ , $\tilde{B}$ , $\tilde{D}$ ). In our implementation, for simplification we adopt the alternative popular choice of $W_{d} = diag ({\tilde{d}}_{1}^{- 1}, \dots, {\tilde{d}}_{m}^{- 1})$ with ${\tilde{d}}_{j}$ the jth diagonal entry of $\tilde{D}$ , as suggested in Zou [71]. Similarly, we set $W_{a} = ({\tilde{a}}_{i j}^{- 1})$ and $W_{b} = ({\tilde{b}}_{i j}^{- 1})$ with ${\tilde{a}}_{i j}$ and ${\tilde{b}}_{i j}$ the (i, j)th entries of $\tilde{A}$ and $\tilde{B}$ , respectively. Extension of the SOFAR algorithm with ALM-BCD in Section IV-A is also straightforward, the D-update becoming an adaptive Lasso problem and the updates of A and B now performed by adaptive soft-thresholding. A further way of improving the estimation efficiency is to exploit regularization methods in the thresholded parameter space [30] or thresholded regression [69], which we do not pursue in this paper.

We compare the SOFAR estimator with the entrywise L_1-norm (Lasso) penalty (SOFAR-L) or the rowwise (2, 1)-norm (group Lasso) penalty (SOFAR-GL) with five alternative methods, including three classical methods, namely, the ordinary least squares (OLS), separate adaptive Lasso regressions (Lasso), and reduced rank regression (RRR), and two recent sparse and low rank methods, namely, reduced rank regression with sparse SVD (RSSVD) proposed by Chen et al. [19] and sparse reduced rank regression (SRRR) considered by Chen and Huang [22] (see also the rank constrained group Lasso estimator in Bunea et al. 16). Both Chen et al. [19] and Chen and Huang [22] used adaptive weighted penalization. We thus consider both nonadaptive and adaptive versions of the SOFAR-L, SOFAR-GL, RSSVD, and SRRR methods.

1). Simulation setups:

We consider several simulation settings with various model dimensions and sparse SVD patterns in the coefficient matrix C*. In all settings, we took the sample size n = 200 and the true rank r = 3. Models 1 and 2 concern the entrywise sparse SVD structure in C*. The design matrix X was generated with i.i.d. rows from N_p(0, Σ_x), where Σ_x = (0.5^|i–j|). In model 1, we set p = 100 and q = 40, and let $C^{*} = \sum_{j = 1}^{3} d_{j}^{*} u_{j}^{*} v_{j}^{* T}$ with $d_{1}^{*} = 20$ , $d_{2}^{*} = 15$ , $d_{3}^{*} = 10$ , and

{\tilde{u}}_{1} = {(unif (S_{u}, 5), rep (0, 20))}^{T},

{\tilde{u}}_{2} = {(rep (0, 3), - {\tilde{u}}_{1, 4}, {\tilde{u}}_{1, 5}, unif (S_{u}, 3), rep (0, 17))}^{T},

{\tilde{u}}_{3} = {(rep (0, 8), unif (S_{u}, 2), rep (0, 15))}^{T},

u_{j}^{*} = {\tilde{u}}_{j} / {‖ {\tilde{u}}_{j} ‖}_{2}, j = 1, 2, 3,

{\tilde{v}}_{1} = {(unif (S_{v}, 5), rep (0, 10))}^{T},

{\tilde{v}}_{2} = {(rep (0, 5), unif (S_{v}, 5), rep (0, 5))}^{T},

{\tilde{v}}_{3} = {(rep (0, 10), unif (S_{v}, 5))}^{T},

v_{j}^{*} = {\tilde{v}}_{j} / {‖ {\tilde{v}}_{j} ‖}_{2}, j = 1, 2, 3,

where unif(S, k) denotes a k-vector with i.i.d. entries from the uniform distribution on the set S, S_u = {−1, 1}, S_v = [−1, −0.5] ∪ [0.5, 1], rep(α, k) denotes a k-vector replicating the value α, and ${\tilde{u}}_{j, k}$ is the kth entry of ${\tilde{u}}_{j}$ . Model 2 is similar to Model 1 except with higher model dimensions, where we set p = 400, q = 120, and appended 300 and 80 zeros to each $u_{j}^{*}$ and $v_{j}^{*}$ defined above, respectively.

Models 3 and 4 pertain to the rowwise/columnwise sparse SVD structure in C*. Also, we intend to study the case of approximate low-rankness/sparsity, by not requiring the signals be bounded away from zero. We generated X with i.i.d. rows from N_p(0, Σ_x), where Σ_x has diagonal entries 1 and off-diagonal entries 0.5. The rowwise sparsity patterns were generated in a similar way to the setup in Chen and Huang [22] except that we allow also the matrix of right singular vectors to be rowwise sparse, so that response selection may also be necessary. Specifically, we let $C^{*} = C_{1} C_{2}^{T}$ , where $C_{1} \in ℝ^{p \times r}$ with i.i.d. entries in its first p₀ rows from N(0, 1) and the rest set to zero, and $C_{2} \in ℝ^{q \times r}$ with i.i.d. entries in its first q₀ rows from N(0, 1) and the rest set to zero. We set p = 100, p₀ = 10, q = q₀ = 10 in Model 3, and p = 400, p₀ = 10, q = 200, and q₀ = 10 in Model 4. We also investigate models with even higher dimensions. In Model 5, we experimented with increasing the dimensions of Model 2 to p = 1000 and q = 400, by adding more noise variables, i.e., appending zeros to the $u_{j}^{*}$ and $v_{j}^{*}$ vectors.

Finally, we consider Model 6 where the orthogonality among the sparse factors is violated. Specifically, Model 6 is similar to Model 1, except that we modify the true values of U* and V* as follows,

{\tilde{u}}_{1} = {(unif (S_{u}, 5), rep (0, 20))}^{T},

{\tilde{u}}_{2} = {(rep (0, 3), unif (S_{u}, 5), rep (0, 17))}^{T},

{\tilde{u}}_{3} = {(rep (0, 8), unif (S_{u}, 2), rep (0, 15))}^{T},

u_{j}^{*} = {\tilde{u}}_{j} / {‖ {\tilde{u}}_{j} ‖}_{2}, j = 1, 2, 3,

{\tilde{v}}_{1} = {(unif (S_{v}, 5), rep (0, 10))}^{T},

{\tilde{v}}_{2} = {(rep (0, 4), unif (S_{v}, 5), rep (0, 6))}^{T},

{\tilde{v}}_{3} = {(rep (0, 8), unif (S_{v}, 5), rep (0, 2))}^{T},

v_{j}^{*} = {\tilde{v}}_{j} / {‖ {\tilde{v}}_{j} ‖}_{2}, j = 1, 2, 3.

Model 7 is similar to Model 6 except with higher model dimensionality, where we set p = 400, q = 120, and appended 300 and 80 zeros to each $u_{j}^{*}$ and $v_{j}^{*}$ defined above, respectively. We would like to point out that when the sparse factors are not exactly orthogonal, the model in fact can be regarded as close to a two-way row-sparse SOFAR model (similar to Models 3 and 4 where both U* and V* are orthogonal and row-sparse); this is because if we compute the SVD of the true coefficient matrix, the resulting orthogonal factors will still have sparsity corresponding to the completely irrelevant responses and predictors.

In all the seven settings, we generated the data Y from the model Y = XC*+E, where the error matrix E has i.i.d. rows from N_q(0, σ² Σ) with Σ = (0.5^|i−j|). In each simulation, σ² is computed to control the signal to noise ratio, defined as ${‖ d_{r}^{*} X u_{r}^{*} v_{r}^{* T} ‖}_{F} / ‖ E ‖_{F}$ , to be exactly 1. The simulation was replicated 300 times in each setting.

All methods under comparison except OLS require selection of tuning parameters, which include the rank parameter in RRR, RSSVD, and SRRR and the regularization parameters in SOFAR-L, SOFAR-GL, RSSVD, and SRRR. To reveal the full potential of each method, we chose the tuning parameters based on the predictive accuracy evaluated on a large, independently generated validation set of size 2000. The results with tuning parameters chosen by cross-validation or GIC [31] were similar to those based on a large validation set, and hence are not reported.

The model accuracy of each method is measured by the mean squared error ${‖ \hat{C} - C^{*} ‖}_{F}^{2} / (p q)$ for estimation (MSE-Est) and ${‖ X (\hat{C} - C^{*}) ‖}_{F}^{2} / (n q)$ for prediction (MSE-Pred). The variable selection performance is characterized by the false positive rate (FPR%) and false negative rate (FNR%) in recovering the sparsity patterns of the SVD, that is, FPR = FP/(TN + FP) and FNR = FN/(TP + FN), where TP, FP, TN, and FN are the numbers of true nonzeros, false nonzeros, true zeros, and false zeros, respectively. The rank selection performance is evaluated by average estimated rank (Rank) and the percentage of correct rank identification (Rank%). Finally, for the SOFAR-L, SOFAR-GL, and RSSVD methods which explicitly produce an SVD, the orthogonality of estimated factor matrices is measured by $100 ({‖ {\hat{U}}^{T} \hat{U} ‖}_{1} + {‖ {\hat{V}}^{T} \hat{V} ‖}_{1} - 2 r)$ (Orth), which is minimized at zero when exact orthogonality is achieved.

2). Simulation results:

We first compare the performance of nonadaptive and adaptive versions of the four sparse regularization methods. Because of the space constraint, only the results in terms of MSE-Pred in high-dimensional models 2 and 4 are presented. The comparisons in other model settings are similar and thus omitted. From Fig. 1, we observe that adaptive weighting generally improves the empirical performance of each method. For this reason, we only consider the adaptive versions of these regularization methods in other comparisons.

Fig. 1. — Boxplots of *MSE-Pred* for Models 2 and 4 with nonadaptive (dark gray) and adaptive (light gray) versions of various methods

The comparison results with adaptive penalty for Models 1 and 2 are summarized in Table II. The entrywise sparse SVD structure is exactly what the SOFAR-L and RSSVD methods aim to recover. We observe that SOFAR-L performs the best among all methods in terms of both model accuracy and sparsity recovery. Although RSSVD performs only second to SOFAR-L in Model I, it has substantially worse performance in Model 2 in terms of model accuracy. This is largely because the RSSVD method does not impose any form of orthogonality constraints, which tends to cause nonidentifiability issues and compromise its performance in high dimensions. We note further that SOFAR-GL and SRRR perform worse than SOFAR-L, since they are not intended for entrywise sparsity recovery. However, these two methods still provide remarkable improvements over the OLS and RRR methods due to their ability to eliminate irrelevant variables, and over the Lasso method due to the advantages of imposing a low-rank structure. Compared to SRRR, the SOFAR-GL method results in fewer false positives and shows a clear advantage due to response selection.

TABLE II.

Simulation results for Models 1–2 with various methods¹

Model	Method	MSE-Est	MSE-Pred	FPR (%)	FNR (%)	Rank	Rank (%)	Orth
1	OLS	250.7 (129.2)	753.8 (392.2)	100	0
	Lasso	12.7 (5.9)	80.8 (34.1)	3.8	0
	RRR	14.7 (6.8)	58.6 (29.3)	100	0	3	100	0
	SOFAR-L	0.4 (0.1)	2.8 (1.3)	0	0	3	100	0
	RSSVD	0.5 (0.3)	3.8 (2.3)	0.2	0	3	99.7	1.9
	SOFAR-GL	1.2 (0.5)	8.2 (4.1)	9.8	0	3	100	0
	SRRR	3.2 (1.0)	25.2 (12.6)	35.5	0	3	100	5.1
2	OLS	1013.0 (117.0)	765.6 (407.2)	100	0
	Lasso	21.3 (7.0)	59.0 (18.1)	1.3	0
	RRR	756.4 (56.8)	30.2 (15.9)	100	0	3	0	0
	SOFAR-L	0.2 (0.1)	0.7 (0.3)	0	0	3	0	0
	RSSVD	2.5 (2.4)	5.3 (4.1)	1	0.1	3	0	28.4
	SOFAR-GL	0.7 (0.4)	2.0 (1.0)	2.7	0	3	0	0
	SRRR	3.8 (1.5)	12.0 (6.3)	19.8	0	3	0	40.2

Open in a new tab

Adaptive versions of Lasso, SOFAR-L, RSSVD, SOFAR-GL, and SRRR were applied. Means of performance measures with standard deviations in parentheses over 300 replicates are reported. MSE-Est values are scaled by multiplying 10⁴ in Model 1 and 10⁵ in Model 2, and MSE-Pred values are scaled by multiplying 10³.

The simulation results for Models 3 and 4 are reported in Table III. For the rowwise sparse SVD structure in these two models, SOFAR-GL and SRRR are more suitable than the other methods. All sparse regularization methods result in higher false negative rates than in Models 1 and 2 because of the presence of some very weak signals. In Model 3, where the matrix of right singular vectors is not sparse and the dimensionality is moderate, SOFAR-GL has a slightly worse performance compared to SRRR since response selection is unnecessary. The advantages of SOFAR are clearly seen in Model 4, where the dimension is high and many irrelevant predictors and responses coexist; SOFAR-GL performs slightly better than SOFAR-L, and both methods substantially outperform the other methods. In both models, SOFAR-L and RSSVD result in higher false negative rates, since they introduce more parsimony than necessary by encouraging entrywise sparsity in U and V. Table IV shows that the SOFAR methods still greatly outperform the others in both estimation and sparse recovery. In contrast, RSSVD becomes unstable and inaccurate; this again shows the effectiveness of enforcing the orthogonality in high-dimensional sparse SVD recovery.

TABLE III.

Simulation results for Models 3–4 with various methods¹

Model	Method	MSE-Est	MSE-Pred	FPR (%)	FNR (%)	Rank	Rank (%)	Orth
3	OLS	599.2 (339.2)	1530.1 (870.8)	100	0
	Lasso	97.6 (50.0)	472.8 (242.7)	15.5	0.6
	RRR	102.6 (70.2)	291.9 (191.8)	100	0	3	100	0
	SOFAR-L	24.8 (15.3)	129.5 (83.2)	0.3	7.4	3.7	30.3	0
	RSSVD	17.3 (11.3)	96.6 (66.4)	0.6	11	3	100	29
	SOFAR-GL	16.6 (11.4)	94.4 (67.5)	0.4	1.1	3.6	41.7	0
	SRRR	11.0 (6.7)	63.1 (40.2)	0.6	0.3	3	100	14.8
4	OLS	252.3 (78)	126.5 (65.4)	100	0
	Lasso	37.4 (11.8)	73.2 (24.1)	0.8	2.5
	RRR	186.6 (51.6)	6.1 (3.9)	100	0	3	99	0
	SOFAR-L	0.1 (0.1)	0.3 (0.2)	0.1	4.8	3	92.7	0.1
	RSSVD	1.0 (0.7)	2.2 (1.3)	0.3	11.5	3	100	40.1
	SOFAR-GL	0.1 (0.0)	0.2 (0.1)	0	0.1	3	100	0
	SRRR	0.8 (0.5)	2.0 (1.2)	24.9	0.2	3	100	31.3

Open in a new tab

Adaptive versions of Lasso, SOFAR-L, RSSVD, SOFAR-GL, and SRRR were applied. Means of performance measures with standard deviations in parentheses over 300 replicates are reported. MSE-Est values are scaled by multiplying 10⁴ in Model 3 and 10⁵ in Model 4, and MSE-Pred values are scaled by multiplying 10³.

TABLE IV.

Simulation results for Model 5. We use Model 2 with increased dimensions p = 1000, q = 400 by adding noise variables¹

Model	Method	MSE-Est	MSE-Pred	FPR (%)	FNR (%)	Rank	Rank (%)	Orth
5	OLS	151.5 (5.7)	230.1 (122.9)	100	0
	Lasso	3.9 (1.8)	29.3 (11.8)	0.6	0
	RRR	146.8 (7.7)	61.5 (77.1)	100	0	2.6	57.7	0
	SOFAR-L	0.1 (0.0)	0.1 (0.0)	0	0	3	100	0
	RSSVD	6.6 (14.4)	2.8 (2.7)	3.1	1	3	99	49.1
	SOFAR-GL	0.1 (0.0)	0.2 (0.1)	0.8	0	3	100	0
	SRRR	0.5 (0.2)	3.6 (1.8)	19.7	0	3	100	55.5

Open in a new tab

Adaptive versions of Lasso, SOFAR-L, RSSVD, SOFAR-GL, and SRRR were applied. Means of performance measures with standard deviations in parentheses over 300 replicates are reported. MSE-Est values are scaled by 10⁵ and MSE-Pred values are scaled by multiplying 10³.

Finally, Table V summarizes the results for Models 6 and 7. As expected, in Model 6 where the model dimensionality is low, SOFAR-GL performs the best, followed by RSSVD and SOFAR-L, whose performance is comparable to each other. In Model 7 where the model dimensionality is much higher, SOFAR-GL and SOFAR-L perform much better than RSSVD, as the latter becomes less stable. In both models, SRRR is outperformed by SOFAR since the former does not pursue sparsity in V. We have also experimented with other modified settings, and all the results are consistent with what have been reported. These results confirm that it is still preferable to apply SOFAR even when the underlying sparse factors are not exactly orthogonal, especially so in high-dimensional problems.

TABLE V.

Simulation results for Models 6–7 when the sparse factors are not exactly orthogonal.¹

Model	Method	MSE-Est	MSE-Pred	FPR (%)	FNR (%)	Rank	Rank (%)	Orth
6	OLS	291.7 (133.9)	868.5 (403.1)	100	0
	Lasso	11.8 (4.7)	71.9 (28.4)	10.3	0
	RRR	17.6 (7.3)	69.5 (31.2)	100	0	3	100	0
	SOFAR-L	2.2 (0.9)	14.1 (6.3)	8	0.1	3	100	0
	RSSVD	2.0 (0.7)	13.6 (6.1)	7.3	0.1	3	100	22.7
	SOFAR-GL	1.2 (0.5)	8.9 (3.9)	9.5	0	3	100	0
	SRRR	3.5 (1.1)	28.7 (13.1)	36.2	0	3	100	5.6
7	OLS	1045.1 (122.4)	886.5 (403.4)	100	0
	Lasso	33.2 (12.7)	79.9 (24.4)	3.2	0
	RRR	754.0 (67.8)	35.2 (15.4)	100	0	3	99.7	0
	SOFAR-L	1.2 (0.5)	3.1 (1.5)	2.4	0.2	3	100	0
	RSSVD	4.8 (4.1)	8.5 (5.1)	2.4	1.7	3	99.7	64.4
	SOFAR-GL	1.0 (0.4)	2.6 (1.1)	3.6	0	3	100	0
	SRRR	4.3 (1.5)	13.9 (6.2)	20	0	3	100	45

Open in a new tab

Adaptive versions of Lasso, SOFAR-L, RSSVD, SOFAR-GL, and SRRR were applied. Means of performance measures with standard deviations in parentheses over 300 replicates are reported. MSE-Est values are scaled by multiplying 10⁴ in Model 6 and 10⁵ in Model 7, and MSE-Pred values are scaled by multiplying 10³.

B. Real data analysis

In genetical genomics experiments, gene expression levels are treated as quantitative traits in order to identify expression quantitative trait loci (eQTLs) that contribute to phenotypic variation in gene expression. The task can be regarded as a multivariate regression problem with the gene expression levels as responses and the genetic variants as predictors, where both responses and predictors are often of high dimensionality. Most existing methods for eQTL data analysis exploit entrywise or rowwise sparsity of the coefficient matrix to identify individual genetic effects or master regulators [54], which not only tends to suffer from low detection power for multiple eQTLs that combine to affect a subset of gene expression traits, but also may offer little information about the functional grouping structure of the genetic variants and gene expressions. By exploiting a sparse SVD structure, the SOFAR method is particularly appealing for such applications, and may provide new insights into the complex genetics of gene expression variation. Here the orthogonality can be roughly interpreted as maximum separability, so that different association layers are more likely to reflect different functional pathways.

We illustrate our approach by the analysis of a yeast eQTL data set described by Brem and Kruglyak [14], where n = 112 segregants were grown from a cross between two budding yeast strains, BY4716 and RM11–1a. For each of the segregants, gene expression was profiled on microarrays containing 6216 genes, and genotyping was performed at 2957 markers. Similar to Yin and Li [64], we combined the markers into blocks such that markers with the same block differed by at most one sample, and one representative marker was chosen from each block; a marginal gene–marker association analysis was then performed to identify markers that are associated with the expression levels of at least two genes with a p-value less than 0.05, resulting in a total of p = 605 markers.

Owing to the small sample size and weak genetic perturbations, we focused our analysis on q = 54 genes in the yeast MAPK signaling pathways [41]. We then applied the proposed SOFAR methods with adaptive weighting. Both SOFAR-L and SOFAR-GL methods resulted in a model of rank 3, indicating that dimension reduction is very effective for the data set. Also, the SVD layers estimated by the SOFAR methods are indeed sparse. The SOFAR-L estimates include 140 nonzeros in $\hat{U}$ , which involve only 112 markers, and 40 nonzeros in $\hat{V}$ , which involve only 27 genes. The sparse SVD produced by SOFAR-GL involves only 34 markers and 15 genes. The SOFAR-GL method is more conservative since it tends to identify markers that regulate all selected genes rather than a subset of genes involved in a specific SVD layer. We compare the original gene expression matrix Y and its estimates $X \hat{C}$ by the RRR, SOFAR-L and SOFAR-GL methods using heat maps in Fig. 2. It is seen that the SOFAR methods achieve both low-rankness and sparsity, while still capturing main patterns in the original matrix.

Fig. 2. — Heat maps of Y and its estimates by RRR, SOFAR-L, and SOFAR-GL (from left to right)

Fig. 3 shows the scatterplots of the latent responses $Y {\hat{v}}_{j}$ versus the latent predictors $X {\hat{u}}_{j}$ for j = 1, 2, 3, where ${\hat{u}}_{j}$ and ${\hat{v}}_{j}$ are the jth columns of $\hat{U}$ and $\hat{V}$ , respectively. The plots demonstrate a strong association between each pair of latent variables, with the association strength descending from layer 1 to layer 3. A closer look at the SVD layers reveals further information about clustered samples and genes. The plot for layer 1 indicates that the yeast samples form two clusters, suggesting that our method may be useful for classification based on the latent variables. Also, examining the nonzero entries in ${\hat{v}}_{1}$ shows that this layer is dominated by four genes, namely, STE3 (−0.66), STE2 (0.59), MFA2 (0.40), and MFA1 (0.22). All four genes are upstream in the pheromone response pathway, where MFA2 and MFA1 are genes encoding mating pheromones and STE3 and STE2 are genes encoding pheromone receptors [24]. The second layer is mainly dominated by CTT1 (−0.93), and other leading genes include SLN1 (0.16), SLT2 (−0.14), MSN4 (−0.14), and GLO1 (−0.13). Interestingly, CTT1, MSN4, and GLO1 are all downstream genes linked to the upstream gene SLN1 in the high osmolarity/glycerol pathway required for survival in response to hyperosmotic stress. Finally, layer 3 includes the leading genes FUS1 (0.81), FAR1 (0.32), STE2 (0.25), STE3 (0.24), GPA1 (0.22), FUS3 (0.18), and STE12 (0.11). These genes consist of two major groups that are downstream (FUS1, FAR1, FUS3, and STE12) and upstream (STE2, STE3, and GPA1) in the pheromone response pathway. Overall, our results suggest that there are common genetic components shared by the expression traits of the clustered genes and clearly reveal strong associations between the upstream and downstream genes on several signaling pathways, which are consistent with the current functional understanding of the MAPK signaling pathways.

To examine the predictive performance of SOFAR and other competing methods, we randomly split the data into a training set of size 92 and a test set of size 20. The model was fitted using the training set and the predictive accuracy was evaluated on the test set based on the prediction error $‖ Y - X \hat{C} ‖_{F}^{2} / (n q)$ . The splitting process was repeated 50 times. The scaled prediction errors for the RRR, SOFARL, SOFAR-GL, RSSVD, and SRRR methods are 3.4 (0.3), 2.6 (0.2), 2.5 (0.2), 2.9 (0.3), and 2.6 (0.2), respectively. The comparison shows the advantages of sparse and low-rank estimation. RSSVD yields higher prediction error and is less stable than the SOFAR methods. Although the SRRR method yielded similar predictive accuracy compared to SOFAR methods on this data set, it resulted in a less parsimonious model and cannot be used for gene selection or clustering.

Supplementary Material

sofarfinal

NIHMS1535080-supplement-sofarfinal.pdf^{(259.9KB, pdf)}

Acknowledgments

This work was supported by Grant-in-Aid for JSPS Fellows 26-1905, NIH Grant 1R01GM131407-01, NSF CAREER Awards DMS-0955316 and DMS-1150318, NIH grant U01 HL114494, NSF grants DMS-1613295, DMS-1613295, and IIS-1718798, a grant from the Simons Foundation, Adobe Data Science Research Award, NSFC grants 11671018 and 71532001, and National Key R&D Program of China grant 2016YFC0207703 Most of this work was completed while Uematsu visited USC Marshall as a JSPS Overseas Research Fellow and Postdoctoral Scholar. Part of this work was completed while Fan and Lv visited the Departments of Statistics at University of California, Berkeley and Stanford University. These authors sincerely thank both departments for their hospitality. The authors also would like to thank the Associate Editor and referees for their valuable comments that helped improve the article substantially.

References

[1].Anderson TW (1951) Estimating linear restrictions on regression coefficients for multivariate normal distributions. Ann. Math. Statist, 22, 327–351. [Google Scholar]
[2].Bai J (2003) Inferential theory for factor models of large dimensions. Econometrica, 71, 135–171. [Google Scholar]
[3].Bai J and Li K (2012) Statistical analysis of factor models of high dimension. Ann. Statist, 40, 436–465. [Google Scholar]
[4].Bai J and Li K (2016) Maximum likelihood estimation and inference for approximate factor models of high dimension. Review of Economics and Statistics, 98, 298–309. [Google Scholar]
[5].Bai J and Ng S (2002) Determining the number of factors in approximate factor models. Econometrica, 70, 191–221. [Google Scholar]
[6].Bai J and Ng S (2008) Large dimensional factor analysis. Foundns Trends Econmetr, 3, 89–163. [Google Scholar]
[7].Bai J and Ng S (2013) Principal components estimation and identification of static factors. Journal of Econometrics, 176, 18–29. URL https://ideas.repec.org/a/eee/econom/v176y2013i1p18-29.html. [Google Scholar]
[8].Basu S and Michailidis G (2015) Regularized estimation in sparse high-dimensional time series models. Ann. Statist, 43, 1535–1567. [Google Scholar]
[9].Benidis K, Sun Y, Babu P and Palomar DP (2016) Orthogonal sparse pca and covariance estimation via procrustes reformulation. IEEE Trans. on Signal Processing, 64, 6211–6226. [Google Scholar]
[10].Bernanke BS, Boivin J and Eliasz P (2005) Measuring the effects of monetary policy: A factor-augmented vector autoregressive (FAVAR) approach. Q. J. Econ, 120, 387–422. [Google Scholar]
[11].Bickel P, Ritov Y and Tsybakov A (2009) Simultaneous analysis of lasso and dantzig selector. Annals of statistics, 37, 1705–1732. [Google Scholar]
[12].Box GEP and Tiao GC (1977) A canonical analysis of multiple time series. Biometrika, 64, 355–365. [Google Scholar]
[13].Boyd S, Parikh N, Chu E, Peleato B and Eckstein J (2011) Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundns Trends Mach. Learn, 3, 1–122. [Google Scholar]
[14].Brem RB and Kruglyak L (2005) The landscape of genetic complexity across 5,700 gene expression traits in yeast. Proc. Natn. Acad. Sci. USA, 102, 1572–1577. [DOI] [PMC free article] [PubMed] [Google Scholar]
[15].Bunea F, She Y and Wegkamp MH (2011) Optimal selection of reduced rank estimators of high-dimensional matrices. Ann. Statist, 39, 1282–1309. [Google Scholar]
[16].Bunea F, She Y and Wegkamp MH (2012) Joint variable and rank selection for parsimonious estimation of high-dimensional matrices. Ann. Statist, 40, 2359–2388. [Google Scholar]
[17].Busygin S, Prokopyev O and Pardalos PM (2008) Biclustering in data mining. Comput. Oper. Res, 35, 2964–2987. [Google Scholar]
[18].Cai TT, Li H, Liu W and Xie J (2013) Covariate-adjusted precision matrix estimation with an application in genetical genomics. Biometrika, 100, 139–156. [DOI] [PMC free article] [PubMed] [Google Scholar]
[19].Chen K, Chan K-S and Stenseth NC (2012) Reduced rank stochastic regression with a sparse singular value decomposition. J. R. Statist. Soc. B, 74, 203–221. [Google Scholar]
[20].Chen K, Chan K-S and Stenseth NC (2014) Source-sink reconstruction through regularized multicomponent regression analysis–with application to assessing whether North Sea cod larvae contributed to local fjord cod in Skagerrak. Journal of the American Statistical Association, 109, 560–573. [Google Scholar]
[21].Chen K, Dong H and Chan K-S (2013) Reduced rank regression via adaptive nuclear norm penalization. Biometrika, 100, 901–920. [DOI] [PMC free article] [PubMed] [Google Scholar]
[22].Chen L and Huang JZ (2012) Sparse reduced-rank regression for simultaneous dimension reduction and variable selection. J. Am. Statist. Ass, 107, 1533–1545. [Google Scholar]
[23].Chen L and Huang JZ (2016) Sparse reduced-rank regression with covariance estimation. Statistics and Computing, 26, 461–470. [Google Scholar]
[24].Chen RE and Thorner J (2007) Function and regulation in MAPK signaling pathways: Lessons learned from the yeast Saccharomyces Cerevisiae. Biochim. Biophys. Acta, 1773, 1311–1340. [DOI] [PMC free article] [PubMed] [Google Scholar]
[25].d’Aspremont A, El Ghaoui L, Jordan MI and Lanckriet GRG (2007) A direct formulation for sparse PCA using semidefinite programming. SIAM Rev, 49, 434–448. [Google Scholar]
[26].Edelman A, Arias TA and Smith ST (1998) The geometry of algorithms with orthogonality constraints. SIAM J. Matrx Anal. Appl, 20, 303–353. [Google Scholar]
[27].Fan J, Fan Y and Barut E (2014) Adaptive robust variable selection. The Annals of Statistics, 42, 324–351. [DOI] [PMC free article] [PubMed] [Google Scholar]
[28].Fan J, Fan Y and Lv J (2008) High dimensional covariance matrix estimation using a factor model. J. Econmetr, 147, 186–197. [Google Scholar]
[29].Fan J and Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc, 96, 1348–1360. [Google Scholar]
[30].Fan Y and Lv J (2013) Asymptotic equivalence of regularization methods in thresholded parameter space. J. Am. Statist. Ass, 108, 1044–1061. [Google Scholar]
[31].Fan Y and Tang CY (2013) Tuning parameter selection in high dimensional penalized likelihood. J. R. Statist. Soc. B, 75, 531–552. [Google Scholar]
[32].Friedman J, Hastie T, Höfling H and Tibshirani R (2007) Pathwise coordinate optimization. Ann. Appl. Statist, 1, 302–332. [Google Scholar]
[33].Goh G, Dey DK and Chen K (2017) Bayesian sparse reduced rank multivariate regression. Journal of Multivariate Analysis, 157, 14–28. [DOI] [PMC free article] [PubMed] [Google Scholar]
[34].Golub GH and Van Loan CF (2013) Matrix Computations. Baltimore: The Johns Hopkins University Press, 4th edn. [Google Scholar]
[35].Guo J, James G, Levina E, Michailidis G and Zhu J (2010) Principal component analysis with sparse fused loadings. J. Computnl Graph. Statist, 19, 930–946. [DOI] [PMC free article] [PubMed] [Google Scholar]
[36].Gustin MC, Albertyn J, Alexander M and Davenport K (1998) Map kinase pathways in the yeast saccharomyces cerevisiae. Microbiology and Molecular Biology Reviews, 62, 1264–1300. [DOI] [PMC free article] [PubMed] [Google Scholar]
[37].Hartigan JA (1972) Direct clustering of a data matrix. J. Am. Statist. Ass, 67, 123–129. [Google Scholar]
[38].Hsu N-J, Hung H-L and Chang Y-M (2008) Subset selection for vector autoregressive processes using Lasso. Computnl Statist. Data Anal, 52, 3645–3657. [Google Scholar]
[39].Izenman AJ (1975) Reduced-rank regression for the multivariate linear model. J. Multiv. Anal, 5, 248–264. [Google Scholar]
[40].Johnstone IM and Lu AY (2009) On consistency and sparsity for principal components analysis in high dimensions. J. Am. Statist. Ass, 104, 682–703. [DOI] [PMC free article] [PubMed] [Google Scholar]
[41].Kanehisa M, Goto S, Sato Y, Kawashima M, Furumichi M and Tanabe M (2014) Data, information, knowledge and principle: Back to metabolism in KEGG. Nucleic Acids Res, 42, D199–D205. [DOI] [PMC free article] [PubMed] [Google Scholar]
[42].Kock A and Callot L (2015) Oracle inequalities for high dimensional vector autoregressions. Journal of Econometrics, 186, 325–344. [Google Scholar]
[43].Koschat MA and Swayne DF (1991) A weighted Procrustes criterion. Psychometrika, 56, 229–239. [Google Scholar]
[44].Lee M, Shen H, Huang JZ and Marron JS (2010) Biclustering via sparse singular value decomposition. Biometrics, 66, 1087–1095. [DOI] [PubMed] [Google Scholar]
[45].Leng C and Wang H (2009) On general adaptive sparse principal component analysis. J. Computnl Graph. Statist, 18, 201–215. [Google Scholar]
[46].Lian H, Feng S and Zhao K (2015) Parametric and semiparametric reduced-rank regression with flexible sparsity. Journal of Multivariate Analysis, 136, 163–174. [Google Scholar]
[47].Lv J (2013) Impacts of high dimensionality in finite samples. The Annals of Statistics, 41, 2236–2262. [Google Scholar]
[48].Ma X, Xiao L and Wong WH (2014) Learning regulatory programs by threshold svd regression. Proceedings of the National Academy of Sciences of the United States of America, 111, 15675–15680. [DOI] [PMC free article] [PubMed] [Google Scholar]
[49].Ma Z, Ma Z and Sun T (2014) Adaptive estimation in two-way sparse reduced-rank regression. ArXiv e-prints arXiv:1403.1922. [Google Scholar]
[50].Ma Z and Sun T (2014) Adaptive sparse reduced-rank regression. ArXiv e-prints arXiv:1403.1922. [Google Scholar]
[51].Mirsky L (1960) Symmetric gauge functions and unitarily invariant norms. Quarterly Journal of Mathematics, 11, 50–59. [Google Scholar]
[52].Nardi Y and Rinaldo A (2011) Autoregressive process modeling via the Lasso procedure. J. Multiv. Anal, 102, 528–549. [Google Scholar]
[53].Negahban SN, Ravikumar P, Wainwright MJ and Yu B (2012) A unified framework for high-dimensional decomposable regularizers. Statistical Science, 27, 538–557. [Google Scholar]
[54].Peng J, Zhu J, Bergamaschi A, Han W, Noh D-Y, Pollack JR and Wang P (2010) Regularized multivariate regression for identifying master predictors with application to integrative genomics study of breast cancer. Ann. Appl. Statist, 4, 53–77. [DOI] [PMC free article] [PubMed] [Google Scholar]
[55].Reinsel GC and Velu RP (1998) Multivariate Reduced-Rank Regression: Theory and Applications. New York: Springer. [Google Scholar]
[56].Sha F, Lin Y, Saul LK and Lee DD (2007) Multiplicative updates for nonnegative quadratic programming. Neur. Computn, 19, 2004–2031. [DOI] [PubMed] [Google Scholar]
[57].Shen H and Huang JZ (2008) Sparse principal component analysis via regularized low rank matrix approximation. J. Multiv. Anal, 99, 1015–1034. [Google Scholar]
[58].Stock JH and Watson MW (2001) Vector autoregressions. J. Econ. Perspect, 15, 101–115. [Google Scholar]
[59].Stock JH and Watson MW (2002) Forecasting using principal components from a large number of predictors. J. Am. Statist. Ass, 97, 1167–1179. [Google Scholar]
[60].Tibshirani R (1996) Regression shrinkage and selection via the lasso. J. R. Statist. Soc. B, 58, 267–288. [Google Scholar]
[61].Tseng P (2001) Convergence of a block coordinate descent method for nondifferentiable minimization. J. Optimizn Theor. Appl, 109, 475–494. [Google Scholar]
[62].Velu RP, Reinsel GC and Wichern DW (1986) Reduced rank models for multiple time series. Biometrika, 73, 105–118. [Google Scholar]
[63].Witten DM, Tibshirani R and Hastie T (2009) A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics, 10, 515–534. [DOI] [PMC free article] [PubMed] [Google Scholar]
[64].Yin J and Li H (2011) A sparse conditional Gaussian graphical model for analysis of genetical genomics data. Ann. Appl. Statist, 5, 2630–2650. [DOI] [PMC free article] [PubMed] [Google Scholar]
[65].Yu Y, Wang T and Samworth R (2015) A useful variant of the davis-kahan theorem for statisticians. Biometrika, 102, 315–323. [Google Scholar]
[66].Yuan M, Ekici A, Lu Z and Monteiro R (2007) Dimension reduction and coefficient estimation in multivariate linear regression. J. R. Statist. Soc. B, 69, 329–346. [Google Scholar]
[67].Zhang C-H (2010) Nearly unbiased variable selection under minimax concave penalty. Ann. Statist, 38, 894–942. [Google Scholar]
[68].Zhang Z, Zha H and Simon H (2002) Low-rank approximations with sparse factors I: Basic algorithms and error analysis. SIAM J. Matrx Anal. Appl, 23, 706–727. [Google Scholar]
[69].Zheng Z, Fan Y and Lv J (2014) High dimensional thresholded regression and shrinkage effect. Journal of the Royal Statistical Society Series B, 76, 627–649. [Google Scholar]
[70].Zhu H, Khondker Z, Lu Z and Ibrahim JG (2014) Bayesian generalized low rank regression models for neuroimaging phenotypes and genetic markers. Journal of the American Statistical Association, 109, 997–990. [PMC free article] [PubMed] [Google Scholar]
[71].Zou H (2006) The adaptive lasso and its oracle properties. J. Am. Statist. Ass, 101, 1418–1429. [Google Scholar]
[72].Zou H, Hastie T and Tibshirani R (2006) Sparse principal component analysis. J. Computnl Graph. Statist, 15, 265–286. [Google Scholar]
[73].Zou H and Li R (2008) One-step sparse estimates in nonconcave penalized likelihood models. Ann. Statist, 36, 1509–1533. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

sofarfinal

NIHMS1535080-supplement-sofarfinal.pdf^{(259.9KB, pdf)}

[R1] [1].Anderson TW (1951) Estimating linear restrictions on regression coefficients for multivariate normal distributions. Ann. Math. Statist, 22, 327–351. [Google Scholar]

[R2] [2].Bai J (2003) Inferential theory for factor models of large dimensions. Econometrica, 71, 135–171. [Google Scholar]

[R3] [3].Bai J and Li K (2012) Statistical analysis of factor models of high dimension. Ann. Statist, 40, 436–465. [Google Scholar]

[R4] [4].Bai J and Li K (2016) Maximum likelihood estimation and inference for approximate factor models of high dimension. Review of Economics and Statistics, 98, 298–309. [Google Scholar]

[R5] [5].Bai J and Ng S (2002) Determining the number of factors in approximate factor models. Econometrica, 70, 191–221. [Google Scholar]

[R6] [6].Bai J and Ng S (2008) Large dimensional factor analysis. Foundns Trends Econmetr, 3, 89–163. [Google Scholar]

[R7] [7].Bai J and Ng S (2013) Principal components estimation and identification of static factors. Journal of Econometrics, 176, 18–29. URL https://ideas.repec.org/a/eee/econom/v176y2013i1p18-29.html. [Google Scholar]

[R8] [8].Basu S and Michailidis G (2015) Regularized estimation in sparse high-dimensional time series models. Ann. Statist, 43, 1535–1567. [Google Scholar]

[R9] [9].Benidis K, Sun Y, Babu P and Palomar DP (2016) Orthogonal sparse pca and covariance estimation via procrustes reformulation. IEEE Trans. on Signal Processing, 64, 6211–6226. [Google Scholar]

[R10] [10].Bernanke BS, Boivin J and Eliasz P (2005) Measuring the effects of monetary policy: A factor-augmented vector autoregressive (FAVAR) approach. Q. J. Econ, 120, 387–422. [Google Scholar]

[R11] [11].Bickel P, Ritov Y and Tsybakov A (2009) Simultaneous analysis of lasso and dantzig selector. Annals of statistics, 37, 1705–1732. [Google Scholar]

[R12] [12].Box GEP and Tiao GC (1977) A canonical analysis of multiple time series. Biometrika, 64, 355–365. [Google Scholar]

[R13] [13].Boyd S, Parikh N, Chu E, Peleato B and Eckstein J (2011) Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundns Trends Mach. Learn, 3, 1–122. [Google Scholar]

[R14] [14].Brem RB and Kruglyak L (2005) The landscape of genetic complexity across 5,700 gene expression traits in yeast. Proc. Natn. Acad. Sci. USA, 102, 1572–1577. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] [15].Bunea F, She Y and Wegkamp MH (2011) Optimal selection of reduced rank estimators of high-dimensional matrices. Ann. Statist, 39, 1282–1309. [Google Scholar]

[R16] [16].Bunea F, She Y and Wegkamp MH (2012) Joint variable and rank selection for parsimonious estimation of high-dimensional matrices. Ann. Statist, 40, 2359–2388. [Google Scholar]

[R17] [17].Busygin S, Prokopyev O and Pardalos PM (2008) Biclustering in data mining. Comput. Oper. Res, 35, 2964–2987. [Google Scholar]

[R18] [18].Cai TT, Li H, Liu W and Xie J (2013) Covariate-adjusted precision matrix estimation with an application in genetical genomics. Biometrika, 100, 139–156. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] [19].Chen K, Chan K-S and Stenseth NC (2012) Reduced rank stochastic regression with a sparse singular value decomposition. J. R. Statist. Soc. B, 74, 203–221. [Google Scholar]

[R20] [20].Chen K, Chan K-S and Stenseth NC (2014) Source-sink reconstruction through regularized multicomponent regression analysis–with application to assessing whether North Sea cod larvae contributed to local fjord cod in Skagerrak. Journal of the American Statistical Association, 109, 560–573. [Google Scholar]

[R21] [21].Chen K, Dong H and Chan K-S (2013) Reduced rank regression via adaptive nuclear norm penalization. Biometrika, 100, 901–920. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] [22].Chen L and Huang JZ (2012) Sparse reduced-rank regression for simultaneous dimension reduction and variable selection. J. Am. Statist. Ass, 107, 1533–1545. [Google Scholar]

[R23] [23].Chen L and Huang JZ (2016) Sparse reduced-rank regression with covariance estimation. Statistics and Computing, 26, 461–470. [Google Scholar]

[R24] [24].Chen RE and Thorner J (2007) Function and regulation in MAPK signaling pathways: Lessons learned from the yeast Saccharomyces Cerevisiae. Biochim. Biophys. Acta, 1773, 1311–1340. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] [25].d’Aspremont A, El Ghaoui L, Jordan MI and Lanckriet GRG (2007) A direct formulation for sparse PCA using semidefinite programming. SIAM Rev, 49, 434–448. [Google Scholar]

[R26] [26].Edelman A, Arias TA and Smith ST (1998) The geometry of algorithms with orthogonality constraints. SIAM J. Matrx Anal. Appl, 20, 303–353. [Google Scholar]

[R27] [27].Fan J, Fan Y and Barut E (2014) Adaptive robust variable selection. The Annals of Statistics, 42, 324–351. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] [28].Fan J, Fan Y and Lv J (2008) High dimensional covariance matrix estimation using a factor model. J. Econmetr, 147, 186–197. [Google Scholar]

[R29] [29].Fan J and Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc, 96, 1348–1360. [Google Scholar]

[R30] [30].Fan Y and Lv J (2013) Asymptotic equivalence of regularization methods in thresholded parameter space. J. Am. Statist. Ass, 108, 1044–1061. [Google Scholar]

[R31] [31].Fan Y and Tang CY (2013) Tuning parameter selection in high dimensional penalized likelihood. J. R. Statist. Soc. B, 75, 531–552. [Google Scholar]

[R32] [32].Friedman J, Hastie T, Höfling H and Tibshirani R (2007) Pathwise coordinate optimization. Ann. Appl. Statist, 1, 302–332. [Google Scholar]

[R33] [33].Goh G, Dey DK and Chen K (2017) Bayesian sparse reduced rank multivariate regression. Journal of Multivariate Analysis, 157, 14–28. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] [34].Golub GH and Van Loan CF (2013) Matrix Computations. Baltimore: The Johns Hopkins University Press, 4th edn. [Google Scholar]

[R35] [35].Guo J, James G, Levina E, Michailidis G and Zhu J (2010) Principal component analysis with sparse fused loadings. J. Computnl Graph. Statist, 19, 930–946. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] [36].Gustin MC, Albertyn J, Alexander M and Davenport K (1998) Map kinase pathways in the yeast saccharomyces cerevisiae. Microbiology and Molecular Biology Reviews, 62, 1264–1300. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R37] [37].Hartigan JA (1972) Direct clustering of a data matrix. J. Am. Statist. Ass, 67, 123–129. [Google Scholar]

[R38] [38].Hsu N-J, Hung H-L and Chang Y-M (2008) Subset selection for vector autoregressive processes using Lasso. Computnl Statist. Data Anal, 52, 3645–3657. [Google Scholar]

[R39] [39].Izenman AJ (1975) Reduced-rank regression for the multivariate linear model. J. Multiv. Anal, 5, 248–264. [Google Scholar]

[R40] [40].Johnstone IM and Lu AY (2009) On consistency and sparsity for principal components analysis in high dimensions. J. Am. Statist. Ass, 104, 682–703. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R41] [41].Kanehisa M, Goto S, Sato Y, Kawashima M, Furumichi M and Tanabe M (2014) Data, information, knowledge and principle: Back to metabolism in KEGG. Nucleic Acids Res, 42, D199–D205. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R42] [42].Kock A and Callot L (2015) Oracle inequalities for high dimensional vector autoregressions. Journal of Econometrics, 186, 325–344. [Google Scholar]

[R43] [43].Koschat MA and Swayne DF (1991) A weighted Procrustes criterion. Psychometrika, 56, 229–239. [Google Scholar]

[R44] [44].Lee M, Shen H, Huang JZ and Marron JS (2010) Biclustering via sparse singular value decomposition. Biometrics, 66, 1087–1095. [DOI] [PubMed] [Google Scholar]

[R45] [45].Leng C and Wang H (2009) On general adaptive sparse principal component analysis. J. Computnl Graph. Statist, 18, 201–215. [Google Scholar]

[R46] [46].Lian H, Feng S and Zhao K (2015) Parametric and semiparametric reduced-rank regression with flexible sparsity. Journal of Multivariate Analysis, 136, 163–174. [Google Scholar]

[R47] [47].Lv J (2013) Impacts of high dimensionality in finite samples. The Annals of Statistics, 41, 2236–2262. [Google Scholar]

[R48] [48].Ma X, Xiao L and Wong WH (2014) Learning regulatory programs by threshold svd regression. Proceedings of the National Academy of Sciences of the United States of America, 111, 15675–15680. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R49] [49].Ma Z, Ma Z and Sun T (2014) Adaptive estimation in two-way sparse reduced-rank regression. ArXiv e-prints arXiv:1403.1922. [Google Scholar]

[R50] [50].Ma Z and Sun T (2014) Adaptive sparse reduced-rank regression. ArXiv e-prints arXiv:1403.1922. [Google Scholar]

[R51] [51].Mirsky L (1960) Symmetric gauge functions and unitarily invariant norms. Quarterly Journal of Mathematics, 11, 50–59. [Google Scholar]

[R52] [52].Nardi Y and Rinaldo A (2011) Autoregressive process modeling via the Lasso procedure. J. Multiv. Anal, 102, 528–549. [Google Scholar]

[R53] [53].Negahban SN, Ravikumar P, Wainwright MJ and Yu B (2012) A unified framework for high-dimensional decomposable regularizers. Statistical Science, 27, 538–557. [Google Scholar]

[R54] [54].Peng J, Zhu J, Bergamaschi A, Han W, Noh D-Y, Pollack JR and Wang P (2010) Regularized multivariate regression for identifying master predictors with application to integrative genomics study of breast cancer. Ann. Appl. Statist, 4, 53–77. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R55] [55].Reinsel GC and Velu RP (1998) Multivariate Reduced-Rank Regression: Theory and Applications. New York: Springer. [Google Scholar]

[R56] [56].Sha F, Lin Y, Saul LK and Lee DD (2007) Multiplicative updates for nonnegative quadratic programming. Neur. Computn, 19, 2004–2031. [DOI] [PubMed] [Google Scholar]

[R57] [57].Shen H and Huang JZ (2008) Sparse principal component analysis via regularized low rank matrix approximation. J. Multiv. Anal, 99, 1015–1034. [Google Scholar]

[R58] [58].Stock JH and Watson MW (2001) Vector autoregressions. J. Econ. Perspect, 15, 101–115. [Google Scholar]

[R59] [59].Stock JH and Watson MW (2002) Forecasting using principal components from a large number of predictors. J. Am. Statist. Ass, 97, 1167–1179. [Google Scholar]

[R60] [60].Tibshirani R (1996) Regression shrinkage and selection via the lasso. J. R. Statist. Soc. B, 58, 267–288. [Google Scholar]

[R61] [61].Tseng P (2001) Convergence of a block coordinate descent method for nondifferentiable minimization. J. Optimizn Theor. Appl, 109, 475–494. [Google Scholar]

[R62] [62].Velu RP, Reinsel GC and Wichern DW (1986) Reduced rank models for multiple time series. Biometrika, 73, 105–118. [Google Scholar]

[R63] [63].Witten DM, Tibshirani R and Hastie T (2009) A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics, 10, 515–534. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R64] [64].Yin J and Li H (2011) A sparse conditional Gaussian graphical model for analysis of genetical genomics data. Ann. Appl. Statist, 5, 2630–2650. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R65] [65].Yu Y, Wang T and Samworth R (2015) A useful variant of the davis-kahan theorem for statisticians. Biometrika, 102, 315–323. [Google Scholar]

[R66] [66].Yuan M, Ekici A, Lu Z and Monteiro R (2007) Dimension reduction and coefficient estimation in multivariate linear regression. J. R. Statist. Soc. B, 69, 329–346. [Google Scholar]

[R67] [67].Zhang C-H (2010) Nearly unbiased variable selection under minimax concave penalty. Ann. Statist, 38, 894–942. [Google Scholar]

[R68] [68].Zhang Z, Zha H and Simon H (2002) Low-rank approximations with sparse factors I: Basic algorithms and error analysis. SIAM J. Matrx Anal. Appl, 23, 706–727. [Google Scholar]

[R69] [69].Zheng Z, Fan Y and Lv J (2014) High dimensional thresholded regression and shrinkage effect. Journal of the Royal Statistical Society Series B, 76, 627–649. [Google Scholar]

[R70] [70].Zhu H, Khondker Z, Lu Z and Ibrahim JG (2014) Bayesian generalized low rank regression models for neuroimaging phenotypes and genetic markers. Journal of the American Statistical Association, 109, 997–990. [PMC free article] [PubMed] [Google Scholar]

[R71] [71].Zou H (2006) The adaptive lasso and its oracle properties. J. Am. Statist. Ass, 101, 1418–1429. [Google Scholar]

[R72] [72].Zou H, Hastie T and Tibshirani R (2006) Sparse principal component analysis. J. Computnl Graph. Statist, 15, 265–286. [Google Scholar]

[R73] [73].Zou H and Li R (2008) One-step sparse estimates in nonconcave penalized likelihood models. Ann. Statist, 36, 1509–1533. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

SOFAR: Large-Scale Association Network Learning

Yoshimasa Uematsu

Yingying Fan

Kun Chen

Jinchi Lv

Wei Lin

Abstract

I. Introduction

II. Large-scale association network learning via SOFAR

A. Sparse orthogonal factor regression

B. Applications of SOFAR

1). Biclustering with sparse SVD:

2). Sparse PCA:

3). Sparse factor analysis:

4). Sparse VAR analysis:

III. Theoretical properties

A. Technical conditions

B. Main results

IV. Implementation of SOFAR

A. SOFAR algorithm with ALM-BCD

TABLE I.

B. Convergence analysis and tuning parameter selection

V. Numerical studies

A. Simulation examples

1). Simulation setups:

2). Simulation results:

Fig. 1.

TABLE II.

TABLE III.

TABLE IV.

TABLE V.

B. Real data analysis

Fig. 2.

Fig. 3.

Supplementary Material

Acknowledgments

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases