Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2024 Jan 1.
Published in final edited form as: J Comput Graph Stat. 2022 Oct 14;32(3):974–984. doi: 10.1080/10618600.2022.2119986

Stability Approach to Regularization Selection for Reduced-Rank Regression

Canhong Wen 1, Qin Wang 1, Yuan Jiang 2
PMCID: PMC10554232  NIHMSID: NIHMS1853453  PMID: 37810194

Abstract

The reduced-rank regression model is a popular model to deal with multivariate response and multiple predictors, and is widely used in biology, chemometrics, econometrics, engineering, and other fields. In the reduced-rank regression modelling, a central objective is to estimate the rank of the coefficient matrix that represents the number of effective latent factors in predicting the multivariate response. Although theoretical results such as rank estimation consistency have been established for various methods, in practice rank determination still relies on information criterion based methods such as AIC and BIC or subsampling based methods such as cross validation. Unfortunately, the theoretical properties of these practical methods are largely unknown. In this paper, we present a novel method called StARS-RRR that selects the tuning parameter and then estimates the rank of the coefficient matrix for reduced-rank regression based on the stability approach. We prove that StARS-RRR achieves rank estimation consistency, i.e., the rank estimated with the tuning parameter selected by StARS-RRR is consistent to the true rank. Through a simulation study, we show that StARS-RRR outperforms other tuning parameter selection methods including AIC, BIC, and cross validation as it provides the most accurate estimated rank. In addition, when applied to a breast cancer dataset, StARS-RRR discovers a reasonable number of genetic pathways that affect the DNA copy number variations and results in a smaller prediction error than the other methods with a random-splitting process.

Keywords: Rank estimation consistency, Reduced-rank regression, Stability approach, Tuning parameter selection

1. Introduction

To model the relationship between two sets of multivariate data is of great significance in both theory and practice. One commonly used model for two sets of multivariate data is the multivariate regression model, y = Cx + e, which assumes a linear relationship between the multivariate predictor x and multivariate response y, with a coefficient matrix C and random errors e. An additional natural assumption of this model is that the coefficient matrix C has a low-rank structure, which further defines the reduced-rank regression model

y=Cx+e, rank(C)k, (1)

where k is a given integer. Since Anderson (1951) proposed the reduced-rank regression model, it has been widely used in biology, econometrics, image science and many other fields (Anderson, 2002a; Kobak et al., 2019; Wen et al., 2020).

Many methods have been proposed to estimate the coefficient matrix in reduced-rank regression. Among them, Bunea et al. (2011) proposed a penalized least squares estimator of C with the l0-norm penalty of the singular values of C, based on the fact that the rank of a matrix is equal to the number of its non-zero singular values. Motivated by the application of the l1-norm penalty to variable selection, Yuan et al. (2007) proposed a penalized least squares estimator of C with the nuclear norm penalty of the coefficient matrix, which is the l1-norm of the singular values of C. To reduce estimation bias caused by the nuclear norm penalty, Chen et al. (2013) introduced an adaptive nuclear norm penalty and showed that their estimator achieves a flexible bias-variance trade-off: a large singular value receives a small penalty to control bias, and a small singular value receives a large penalty to induce sparsity. Other reduced-rank regression methods include sparse reduced-rank regression that further imposes sparsity on the columns of the coefficient matrix (Chen and Huang, 2012) and co-sparse reduced-rank regression that imposes sparsity on both the rows and columns of the coefficient matrix (Wen et al., 2020).

In the study of low-rank matrix estimation, rank determination has always been a key issue (Kanagal and Sindhwani, 2010; Ashraphijuo et al., 2017; Kong, 2020). The influence of rank determination on coefficient estimation has been studied by Anderson (2002b): when the rank is underestimated, it will lead to estimation bias; when the rank is overestimated, variance of the estimator can be unnecessarily large. To effectively determine the rank of the coefficient matrix, all the aforementioned methods rely on tuning parameters that need to be selected. In fact, theoretical results such as rank estimation consistency have been established (Bunea et al., 2011; Chen et al., 2013) when the tuning parameters are chosen to meet certain theoretical conditions. However, these conditions usually involve unknown model parameters and thus are hard to verify in real applications. In practice, it still relies heavily on empirical methods to select tuning parameters, such as information criteria and subsampling methods.

The most well-known information criteria are probably the Akaike Information Criterion (AIC, Akaike, 1974) and the Bayesian Information Criterion (BIC, Schwarz et al., 1978). They are both widely used in reduced-rank regression (Corander and Villani, 2004; Chen et al., 2012; Bernardini and Cubadda, 2015). In addition, generalized cross validation (GCV) and generalized information criterion (GIC), proposed by Golub et al. (1979) and Fan and Tang (2013) respectively for linear models, are also often used in reduced-rank regression (Yuan et al., 2007; Chen, 2016). She (2017) considered a selective reduced-rank regression model that possesses both low-rank and sparse structure, and proposed a predictive information criterion (PIC) for tuning parameter selection. This method can also be applied to reduced-rank regression. Other information criteria include an extension of BIC for high-dimensional data named BICP (An et al., 2008). In addition, rank estimation in reduced-rank regression can be further improved by better estimating its degrees of freedom. A better estimation of the degrees of freedom can improve the performance of information criteria as they often involve such an estimator. For more details, please refer to Mukherjee et al. (2015) and Yuan (2016).

The subsampling methods are represented by cross validation, which has also been widely used in variable selection and reduced-rank regression (Mukherjee and Zhu, 2011; Ulfarsson and Solo, 2013; Jiang et al., 2016). Although these methods have been repeatedly used, their theoretical properties are largely unknown and still need to be studied. One such effort is She and Tran (2019) that investigated cross validation for sparse reduced-rank regression and found that conventional cross validation may be associated with inconsistent models on different training sets. Instead of cross validating the tuning parameter, She and Tran (2019) proposed to cross validate the sparsity structure to maintain the same model in different trainings and validations.

Another type of subsampling methods for high-dimensional data is called stability approach. Different from cross validation and generalized cross validation that evaluate the prediction accuracy of a model, stability approaches focus on the stability of a model across subsamples. Stability selection (Meinshausen and Bühlmann, 2010) is the first stability approach proposed for high-dimensional data and has a wide range of applicability, such as variable selection, graphical modelling, and cluster analysis. Furthermore, Liu et al. (2010) proposed the Stability Approach to Regularization Selection (StARS) method to select the tuning parameter in graphical models. Compared to other tuning parameter selection methods such as AIC, BIC, and cross validation, StARS enjoys both theoretical and empirical advantages in graphical models. Further studies of stability approaches include Shah and Samworth (2013), Yu (2013), Sun et al. (2013), among others. Although the stability approach has been shown powerful for variable selection and graphical models, its theoretical and empirical performances have yet to be studied in reduced-rank regression. This is the main motivation of our work.

In this article, we propose a new tuning parameter selection method for reduced-rank regression based on stability approach. To appropriately apply the idea of stability approach, we define a new concept of instability specially for the reduced-rank regression models, that is, the sample variance of the estimated rank from the subsamples. In adjunct with the newly defined instability, we propose a new algorithm to select the tuning parameter based on the behavior of the instability along a grid of increasing tuning parameters. We call the new method the Stability Approach to Regularization Selection for Reduced-Rank Regression (StARS-RRR). Theoretically, we establish the consistency of rank estimation for StARS-RRR: the estimated rank is equal to the true rank with probability tending to one, a result that is stronger than the partial sparsistency property established for StARS (Liu et al., 2010). Empirically, we show that StARS-RRR outperforms information criteria and other subsampling methods for both simulated and real data. In simulated data, StARS-RRR recovers the rank correctly in most replications and leads to the smallest bias as long as the signal to noise ratio is not extremely small; in real data, StARS-RRR discovers a reasonable number of relationships between copy number variations and gene expressions among breast cancer patients and results in a smaller prediction error with a random-splitting process than the other methods.

2. Methodology and Algorithm

2.1. Adaptive nuclear norm penalization

In the following subsections, we will introduce the methodology and algorithm for StARS-RRR. As seen later, StARS-RRR is a general framework that can be applied to any reduced-rank regression method. For the purpose of illustration, we will use the adaptive nuclear norm penalization method (Chen et al., 2013) as an example to introduce StARS-RRR.

Denote Y=Y1,,YnTn×q as the response, X=X1,,XnTn×p as the design matrix, and assume that they follow a multivariate linear model

Y=XC+E, (2)

where Cp×q is the coefficient matrix that is often assumed to have a low rank and En×q is the error matrix. The adaptive nuclear norm penalization method aims to estimate the coefficient matrix C by considering the following optimization problem

C^λ=argminCP×q12YXCF2+λXCw. (3)

In (3),  F denotes the Frobenius norm and XCw=i=1pqwidi(XC) denotes the adaptive nuclear norm of the matrix XC with wi=diγ(PY), where di(M) denotes the i-th largest singular value of a matrix M, P = X(XTX)XT, and γ ≥ 0. The above optimization leads to an explicit form for the rank of the estimated coefficient matrix C^λ (Chen et al., 2013):

r^λ=maxr:dr(PY)>λ1/(γ+1). (4)

Based on the above explicit form, it was shown that the adaptive nuclear norm penalization method recovers the true rank of C with high probability if the error matrix E has independent N (0, σ2) entries and the tuning parameter λ satisfies the condition λ=(1+θ)rx+qσ/δγ+1 for some θ > 0, where rx is the rank of X and δ is a constant depending on the singular values of XC. However, this theoretical result may not be used to select λ in practice because it involves the unknown parameters σ and δ. Thus, it is essential to use a practical and data-driven procedure to select the tuning parameter λ.

2.2. Tuning parameter selection for reduced-rank regression

In this subsection, we will review existing methods to select tuning parameters for reduced-rank regression. Roughly, there are two types of tuning parameter selection methods, one based on information criteria and the other based on subsampling technique. Given a tuning parameter λ, denote the estimator of (3) as C^λ, and the corresponding rank as r^λ. For an information criterion based method, one chooses an optimal λ by minimizing a function of the sum of squared error and the degree of freedom. Table 1 shows some widely used information criteria.

Table 1.

Information criterion based methods for tuning parameter selection in (3).

Information Criterion Mathematical Formula
AIC nqlogYXC^λF2nq+2r^λrx+qr^λ
BIC nqlogYXC^λF2nq+log(nq)r^λrx+qr^λ
GIC nqlogYXC^λF2nq+loglognqlogpqr^λrx+qr^λ
BICP nqlogYXC^λF2nq+2log(pq)r^λrx+qr^λ
GCV nqYXC^λF2nqr^λrx+qr^λ2
PIC YXC^λF2nq2r^λrX+qr^λ

Information criteria play an important role in model selection and model assessment, generally composed of an error term and a degree of freedom term to balance the goodness of fit and the model complexity. They are a popular choice for tuning parameter selection in reduced-rank regression, yet they may not be able to accurately estimate the rank when the true rank is low (Velu and Reinsel, 2013). Information criteria like AIC and BIC tend to overestimate the rank and their performance deteriorates with small sample size. In addition to AIC and BIC, other information criterion such as GCV, GIC, and BICP have also been used to estimate rank in reduced-rank regression (Yuan et al., 2007; Chen, 2016), since they were initially proposed for linear models. Proposed for selective reduced-rank regression models with both low-rank and sparse structures, PIC can also be simplified for reduced-rank regression models to the form in Table 1. For PIC, She (2017) established a non-asymptotic oracle inequality for its estimation error rate without assuming an infinite sample size.

The subsampling methods include cross validation, which is also one of the most popular methods applied in reduced-rank regression (Ulfarsson and Solo, 2013; Kobak et al., 2019). K-fold cross validation randomly divides the data set into K subsets of the same size. Each time one subset is reserved as the test set and the other K – 1 subsets serve as the training set. Then, the model with a given tuning parameter is trained on the training set and tested on the test set. Last, the average test error of the K models is taken as the cross validation score. The tuning parameter with the smallest cross validation score is taken as the selected parameter. She and Tran (2019) investigated cross validation for sparse reduced-rank regression and found that conventional cross validation may be associated with inconsistent models on different training sets. Instead of cross validating the tuning parameter, She and Tran (2019) proposed to cross validate the sparsity structure to maintain the same model in different trainings and validations.

Although information criteria and subsampling methods have been widely used in reduced-rank regression, their theoretical properties are largely unknown and still need to be studied. A tuning parameter selection method for reduced-rank regression that has a solid theoretical property is imperative for practical use.

2.3. StARS-RRR

We hereby propose a novel tuning parameter selection method called StARS-RRR for reduced-rank regression based on the stability approach. The stability approach was originally introduced for variable selection and graphical models (Meinshausen and Bühlmann, 2010; Liu et al., 2010). In particular, we borrow the idea from StARS (Liu et al., 2010). The key ingredient of StARS is to define a measure called instability of an estimator from randomly drawn subsamples of the original data. For instance, Liu et al. (2010) proposed to use the variance of the Bernoulli indicator of an edge in a graph, averaged over all edges and all estimated graphs from randomly drawn subsamples.

Before presenting the definition of instability for StARS-RRR, we introduce some notation. Let b = b(n) be such that 1 < b(n) < n. We draw N random subsamples S1, …, SN from {(X, Y), …, (Xn, Yn)} without replacement, each of size b. Theoretically, there are N=nb such subsamples. However, Politis et al. (1999) argues that it suffices in practice to choose a large number N of subsamples at random. Based on the i-th subsample Si, we derive the adaptive nuclear norm penalization estimator, C^λSi, from (3), with rank r^λSi for a pre-specified tuning parameter λ > 0.

As our objective is to select a tuning parameter λ so that the estimated rank of C is close to the true rank, a natural definition of instability arises from the variation of the estimated ranks from the subsamples. Therefore, we define the instability corresponding to λ as the sample variance of r^λSi:i=1,,N:

D^(λ)=S2r^λ=1N1i=1Nr^λ2SiN1Ni=1Nr^λSi2. (5)

To see how the instability varies for different values of λ, we demonstrate via a simulated dataset. The dataset was simulated under the setting that is detailed in Section 4.1, where we note that the true rank of C equals 10. We draw 100 subsamples to calculate the instability defined in (5). Figure 1 shows the instability D^(λ) over a range of values of log(λ) as well as the estimated rank r^λ based on the full data, from which we observe the following patterns.

Fig. 1.

Fig. 1

The dotted line represents the instability relative to log(λ) and the solid line represents estimated rank from the full data relative to log(λ). In this simulated dataset, SNR is 2.047 and true rank is 10.

  1. When λ is small such that the estimated rank is larger than the true rank, the instability fluctuates but there is always a substantial gap from 0.

  2. In a sub-interval of the values of λ where the estimated rank is equal to the true rank, the instability stays at zero.

  3. When λ is large such that the estimated rank is less than the true rank, the instability fluctuates again and it can decrease to zero at certain values of λ.

The above observations motivate us to search from small to large for the first tuning parameter with which the instability is small enough. To achieve this objective, we consider a sequence of tuning parameters Λ = {λ1, …, λk} where λ1 < ⵈ < λk. Based on this sequence of tuning parameters, we define the cumulative minimum instability for any λ ϵ Λ:

D¯(λ)=minD^λ:λΛ,λ1λλ. (6)

Then, we select the optimal tuning parameter as

λ^=min{λΛ:D¯(λ)η},

where η is a small pre-specified threshold. In Section 3, we will show that η is interpretable as it is an explicit function of a key quantity in the theoretical property of StARS-RRR. Therefore, we are not simply replacing the problem of selecting λ with the problem of selecting η.

It is noteworthy that the selection rule in StARS-RRR is different from that in StARS (Liu et al., 2010). When StARS was applied in graphical models, it starts with a large tuning parameter (or equivalently, an empty graph) and selects the first tuning parameter that achieves an instability lower than a threshold. However, StARS-RRR starts with a small tuning parameter (or equivalently, a high estimated rank) and selects the first tuning parameter that achieves an instability lower than a threshold. The reason why StARS-RRR searches the tuning parameter in a different direction is that, when the tuning parameter is too large so that the rank is underestimated, the estimated rank can still be stable across subsamples due to its discreteness. Therefore, if we start with a large tuning parameter and search for a stable estimated rank, we may select a tuning parameter that is too large and leads to an underestimated rank.

To demonstrate the performance of StARS-RRR, we indicate the selected tuning parameter in the previous simulated dataset as the vertical line in Figure 1. It is seen that StARS-RRR selects λ correctly such that the estimated rank is the same as the true rank. To summarize, we present the StARS-RRR method as in Algorithm 1.

Algorithm 1: StARS-RRR                                                                                            ¯Input: Observations Xi,Yii1n; Number of subsamples N; Size of subsample bThreshold η; A list of candidate tuning parameters in an increasing order, Λ=λ1,,λK;Output:Optimal tuning parameter λ^1Draw N subsamples Sj,j=1,,N of size b from Xi,Yii1n without replacement; 2for each k=1,,K do 3 for each j=1,,N do 4 Apply the adaptive nuclear norm penalization method on data Sj with tuning parameter λk to obtain C^λkSj with rank r^λkSj5 end 6 Compute the instability D^λk and then the cumulative minimum instability D¯λkby (5) and (6), respectively; 7if D¯λkη then 8λ^=λk9break 10end 11end ¯ ¯

Although our previous discussion is based on the adaptive nuclear norm penalization method for reduced-rank regression, StARS-RRR can be applied to any reduced-rank regression estimation methods that is not limited to this method. Furthermore, the proposed framework is not even limited to the reduced-rank regression model, it can be used for general matrix estimation problems as long as the objective is to determine the rank of a matrix.

3. Rank Estimation Consistency

In this section, we will show that the estimated rank from StARS-RRR is consistent to the true rank: the estimated rank is equal to the true rank with probability tending to one. In addition, we will show that there is an explicit relationship between the threshold η employed in the StARS-RRR algorithm (Algorithm 1) and the probability that the rank is estimated correctly. Therefore, the threshold η is an interpretable quantity and we are not simply replacing the problem of choosing λ with the problem of choosing η. Different from the consistency result in the literature that requires a theoretical tuning parameter depending on unknown model parameters (Bunea et al., 2011; Chen et al., 2013), our result is practically more useful as it can be used to determine the tuning parameter in real applications.

Before presenting the consistency of rank estimation for StARS-RRR, we introduce some notation and impose necessary assumptions as follows. For any two real numbers u and v, let uv=min(u,v) and uv=max(u,v). Recall that X and Y are the design matrix and responses in the multivariate linear model in (2). Further, let Xb = (X1, …, Xb)T and Yb = (Y1, …, Yb)T be the corresponding design matrix and responses from the subsample Xi,Yii=1b. Denote by rx and rxb the ranks of X and Xb, respectively, and let r* be the true rank of the coefficient matrix C in (2). Finally, let r^λb be the estimated rank based on the subsample Xi,Yii=1b using the adaptive nuclear norm penalization method (Chen et al., 2013). In addition, we impose the following assumptions.

Assumption 1. The error matrix E in (2) has independent N (0, σ2) entries.

Assumption 2. For a fixed θ>0, drXbC21+θσrxb+q.

Assumption 3. rrxbq where q is the number of responses in Y.

Assumptions 1 and 2 are almost identical to Assumptions 1 and 2 in Chen et al. (2013), except that Assumption 2 is on the singular value of XbC for the subsample Xb instead of the whole sample X. Assumption 2 is a common type of assumption for subsampling based methods such as stability approaches. For example, Assumption (A2) in the original StARS paper (Liu et al., 2010) is also imposed on the subsamples of size b < n. It requires that from a subsample of size b < n “all estimated graphs using regularization parameters Λ ≥ Λ0 contain the true graph with high probability.” Assumption 3 is a moderate assumption as long as b is not too small because it is always true that r* ≤ rx Λ q. In practice, it is commonly assumed that r* is low for a reduced-rank regression problem.

As follows, we will establish the consistency of rank estimation for StARS-RRR in a few steps. First, we will show that the true variance of r^λb follows the patterns (i) and (ii) as observed from Figure 1, i.e., (i) the true variance of r^λb stays away from 0 when λ is small such that the rank is overestimated and (ii) the true variance of r^λb is very close to zero when the estimated rank is equal to the true rank. This result is summarized in Theorem 3.1.

Theorem 3.1. Suppose Assumptions 1–3 hold. For any δexpθ2rxb+q2/8,1/2 with a large enough rxb+q, there exist λl, λm, λh with 0<λlλmλh(1+θ/2)σrxb+qγ+1 such that

  1. when λhλ(1+3θ/2)σrxb+qγ+1, Pr^λb=r12expθ2rxb+q2/8 and varr^λb4rxbq2expθ2rxb+q2/8, and further, Pr^λb=r1 and varr^λb0 as rxb+q;

  2. when λλm,, Pr^λb=r1δ2expθ2rxb+q2/8;

  3. when 0<λλm, Pr^λbr+1δ, and when λlλλm, varr^λbδ(1δ).

On the one hand, part (a) of Theorem 3.1 provides the consistency of the estimated rank from a subsample of the data, which shows the adaptive nuclear norm penalization method is able to identify the correct rank with probability tending to one for an appropriate range of λ values. This result is similar to Theorem 3 in Chen et al. (2013) with a slight distinction that we identify a range of λ values for rank consistency instead of a single value of λ as in Chen et al. (2013). It is also obvious that the variance of the estimated rank tends to zero when the rank is correctly identified. On the other hand, the results in parts (b) and (c) provide additional information about the rank estimation when λ is smaller. In part (b), when λλm, the adaptive nuclear norm penalization method achieves a slightly weaker result than that in part (a): the probability that the rank is correctly estimated is lowered by δ compared to part (a), although the estimated rank is still consistent as long as δ → 0. Part (c) discusses the case when λ is even smaller, i.e., λ1λλm. There are two implications. With the probability at least δ the rank is overestimated and the variance of the estimated rank has a lower bound so that it stays away from zero. In summary, Theorem 3.1 shows the patterns (i) and (ii) observed in Figure 1 theoretically for the true variance of the estimated rank from a subsample of the data.

Second, we will show that the sample variance of the estimated ranks from all the subsamples, i.e., the instability as defined in (5), is very close to the true variance of r^λb. Hereby, we present this result in Theorem 3.2.

Theorem 3.2. For any λ > 0 such that Er^λb1/2 and t6rxbq2/(N1),9rxq,

PS2r^λbvarr^λb>t6expnt2/162rxq4b. (7)

This result is similar to Theorem 1 in Liu et al. (2010) although our focus is the difference between the sample variance and the true variance while Liu et al. (2010) concerns the difference between the sample mean and the true mean. From (7), it is seen that there is a trade-off between the difference t and the probability on its right-hand side. For example, if t is a fixed quantity, then the probability tends to zero as long as (rx Λ q)4 b / n → 0. However, if one wishes to choose t such that t → 0 in order to achieve an asymptotically negligible difference, then the condition for the probability tending to zero becomes (rx Λ q)4 b / (nt2) → 0, depending on the convergence rate of t.

Combining Theorems 3.1 and 3.2, we are able to validate the patterns (i) and (ii) as observed in Figure 1 for the instability, which is presented in the following corollary.

Corollary 3.1. Suppose Assumptions 1–3 hold and Λ = {λ1, …, λK} is a grid of K increasing positive values of λ. When rxb+q is large enough, with probability at least 16Kexpnδ2/648C2rxq4b,

minD^(λ):λΛλl,λm[(C1)/C]δ(1δ) (8)
maxD^(λ):λΛλh,(1+3θ/2)σrxb+qγ+1(2/C)δ(1δ), (9)

for any fixed C > 3 and any δ8Crxbq2expθ2rxb+q2/812Crxbq2/(N1),1/2.

In Corollary 3.1, to ensure the interval for the possible values of δ is not empty, as 8Crxbq2expθ2rxb+q2/8 is small with a large enough rxb+q, one only needs to assume that rxbq2/N is small. Moreover, for a large enough C, the lower bound in (8) is close to δ(1 – δ) and the upper bound in (9) is close to 0. This verifies the patterns (i) and (ii) for the instability as observed in Figure 1.

Based on Corollary 3.1, we can choose an appropriate threshold η in StARS-RRR so that the optimal λ^ lies either in the interval [λm, λh] or λh,(1+3θ/2)σrxb+qγ+1 as long as the candidate tuning parameters Λ = {λ1, …, λK} in StARS-RRR are all larger than λl. Combining this result with parts (a) and (b) of Theorem 3.1 establishes the rank estimation consistency of StARS-RRR as presented in the following theorem.

Theorem 3.3. Suppose Assumptions 1–3 hold and Λ = {λ1, …, λK} is a grid of K increasing positive values of λ such that λ1λ with λl defined in Theorem 3.1. Let λ^ be the optimal tuning parameter selected by StARS-RRR described in Algorithm 1 with a threshold η = δ(1 – δ)/2 where δ32rxbq2expθ2(rxb+q)2/848rxbq2/(N1),1/2), and let r^λ^b be the estimated rank at λ^ from the subsample Xb and Yb. Then,

Pr^λ^b=r1Kδ2Kexpθ2rxb+q2/86Kexpnδ2/10368rxq4b. (10)

Furthermore, assume that K is a fixed integer, that rxb+q, and that there exist α > 0 and β > 0 such that rxbq2expθ2rxb+q2/8=onα and rxq4b/n=onβ, as n → ∞. Then, we can choose δ = n−(α Λ β)/2 in (10), which leads to δ=n(αβ)/2.

Pr^i^b=r*1, as n.

Theorem 3.3 consists of two parts. First, it provides a finite sample lower bound for the probability with which the rank is correctly estimated when the tuning parameter is selected by StARS-RRR. It is interesting that there is an explicit relationship between this lower bound and the threshold η used in StARS-RRR because η = δ(1 – δ)/2 and the lower bound in (10) is also a known function of δ. Therefore, this result gives η an explicit interpretation by connecting it to the theoretical property of the estimated rank and makes the choice of the threshold meaningful in the StARS-RRR algorithm. Second, under further technical conditions, the estimated rank is asymptotically consistent to the true rank as the above-mentioned lower bound tends to one. While most of these conditions are common and also moderate, we note that the condition (rx Λ q)4 b / n = o (n−β) imposes an upper bound on rx Λ q depending on b and n. This condition arises from Theorem 3.2 and thus is similar to the condition that ensures the upper bound of probability in (7) converges to zero.

A critical condition that guarantees rank estimation consistency is that the candidate tuning parameters used in StARS-RRR cannot be too small, implied by the assumption that λ1λ as in Theorem 3.3. Without this assumption, the optimal λ^ could lie in the interval (0, λ] and possibly leads to an overestimate of the rank. However, we do not expect this assumption to be too strong as λ1/(γ+1) is the δ-quantile of dr+1PbYb (see Proof of Lemma 2 in the supplementary material) and δ = n−(α Λ β)/2 tends to zero as n, where Pb = Xb[(Xb)T Xb](Xb)T. In addition, this is only a sufficient condition for rank estimation consistency. In our numerical studies (see Section 4), we haven’t observed an obvious trend of overestimation of the true rank with practically selected candidate tuning parameters in StARS-RRR.

From the technical proofs of Theorems 3.1–3.3 (see the supplementary material), it is seen that our main theoretical result, the rank estimation consistency, is actually not limited to the adaptive nuclear norm penalization method. In fact, any rank estimation method that results in an estimated rank as in (4) will enjoy the results in Theorem 3.3. For example, the L0 penalized estimator in Bunea et al. (2011) also results in a similar form of the estimated rank:

r^λ=maxr:dr(PY)>λ,

with λ being the tuning parameter for the L0 penalty. Therefore, the rank estimation consistency would also hold for the L0 penalized estimator in Bunea et al. (2011) as long as we replace γ in (4) by 0.

Compared to the partial sparsistency property as established for StARS in Liu et al. (2010), our result is apparently stronger. The partial sparsistency result shows that with probability tending to one the true edges of a graph belong to the estimated edge set using the optimal tuning parameter selected by StARS on a subsample of size b, which is only a “one-direction” result. By contrast, our result is a “two-direction ” result that establishes the rank estimation consistency for StARS-RRR.

In practice, after the optimal tuning parameter is selected by StARS-RRR, one often performs the reduced-rank regression on the full data with this optimal tuning parameter. Thus, we recommend choosing b sufficiently large so that the behavior of the subsample Xb and Yb is similar to that of the full data. On the other hand, b cannot be too large as implied by the technical condition (rx Λ q)4 b / n = o (nβ). Therefore, an appropriate size b needs to be selected under a particular setting of X, q, and n. This philosophy is similar to how b is chosen in StARS although our technical condition about b is slightly more complicated than that for StARS. In our numerical studies in Section 4, we set η = 0.001, b = 0.7n, and N = 100 in Algorithm 1 for StARS-RRR.

4. Numerical Experiments

4.1. Simulation

In this subsection, we compare the finite sample performance of the rank determination via StARS-RRR and other approaches including AIC, BIC, GIC, BICP, GCV and cross validation (CV) on simulated data.

We adopt the same simulation settings from Bunea et al. (2011). Specifically, the coefficient matrix C is generated by C=sC1C2T, where s>0,C1p×r, C2q×r. All entries in C1 and C2 are drawn randomly from N(0, 1). The design matrix X is generated by X=X0Γ1/2, where X0=X1X2T, X1n×rx, X2p×rx, and Γ=Γijp×p with Γij=ρ|ij|, i,j=1,,p. The response matrix Y is then generated by Y = XC + E, where the elements of E are independent random variables from N(0, 1). Thus, the simulation model is characterized by the sample size n, the number of predictors p, the number of responses q, the rank of the design matrix rx, the true model rank r*, the correlation coefficient between the adjacent predictors ρ, and the signal s.

We will explore two different model settings, where Model I is a low-dimensional case with (n, p, q, rx, r*) = (500, 25, 25, 15, 10) and Model II is a high-dimensional case with (n, p, q, rx, r*) = (80, 100, 100, 30, 8). In Model I, p and q are relatively small compared with n; while in Model II, p and q are relatively large. Although these finite-sample settings may not align perfectly with the technical conditions in our large-sample theory in Section 3, they still deserve numerical investigations to see how StARS-RRR performs in practice. In addition, for each model, we set ρ to be 0.1, 0.5, or 0.9, which stands for weak, moderate, or strong dependence between the predictors. We consider six different values of s such that the signal-to-noise ratio (SNR) ranges from 1 to 3 roundly. Since the r* th largest singular value of XC, i.e., drXC, measures the signal strength and the largest singular value of the projected noise matrix PE = X(XTX)XTE, i.e., d1(PE), measures the noise level, the SNR is defined as dr(XC)/d1(PE) (Chen et al., 2013).

Table 3.

Mean and standard error (in parenthesis) of bias of the estimated rank in the simulation study.

SNR AIC BIC GIC BICP GCV CV StARS-RRR
Model I, ρ = 0.1
1.07 0.12 (0.45) −1.11 (0.68) −1.73 (0.71) −1.55 (0.7) 0.09 (0.41) 0.11 (0.46) −0.4 (0.65)
1.6 0.14 (0.39) −0.4 (0.53) −0.76 (0.65) −0.65 (0.6) 0.12 (0.37) 0.13 (0.38) −0.06 (0.37)
1.85 0.15 (0.39) −0.24 (0.43) −0.51 (0.57) −0.43 (0.54) 0.13 (0.38) 0.13 (0.38) −0.03 (0.32)
2.14 0.15 (0.39) −0.13 (0.34) −0.32 (0.47) −0.26 (0.45) 0.13 (0.36) 0.14 (0.37) 0 (0.28)
2.49 0.15 (0.39) −0.06 (0.23) −0.17 (0.38) −0.14 (0.35) 0.14 (0.37) 0.13 (0.36) 0 (0.28)
3.03 0.16 (0.39) −0.02 (0.13) −0.07 (0.26) −0.05 (0.21) 0.13 (0.36) 0.14 (0.37) 0 (0.19)
Model I, ρ = 0.5
1.1 0.09 (0.43) −0.98 (0.67) −1.57 (0.69) −1.42 (0.68) 0.06 (0.42) 0.11 (0.44) −0.39 (0.63)
1.26 0.12 (0.4) −0.74 (0.63) −1.22 (0.69) −1.08 (0.68) 0.09 (0.4) 0.12 (0.43) −0.25 (0.58)
1.57 0.13 (0.39) −0.42 (0.53) −0.75 (0.62) −0.66 (0.62) 0.11 (0.37) 0.13 (0.42) −0.07 (0.48)
2.2 0.15 (0.41) −0.14 (0.35) −0.33 (0.48) −0.27 (0.45) 0.12 (0.36) 0.14 (0.4) 0 (0.35)
2.52 0.15 (0.38) −0.07 (0.26) −0.2 (0.4) −0.16 (0.36) 0.13 (0.37) 0.14 (0.39) 0.02 (0.32)
2.99 0.15 (0.38) −0.03 (0.18) −0.1 (0.29) −0.08 (0.27) 0.13 (0.37) 0.15 (0.4) 0.02 (0.32)
Model I, ρ = 0.9
1.08 0.13 (0.49) −1 (0.64) −1.55 (0.7) −1.39 (0.69) 0.1 (0.47) 0.11 (0.52) −0.42 (0.58)
1.24 0.15 (0.45) −0.77 (0.63) −1.23 (0.67) −1.09 (0.65) 0.12 (0.43) 0.14 (0.48) −0.29 (0.49)
1.55 0.17 (0.44) −0.46 (0.54) −0.79 (0.62) −0.69 (0.61) 0.14 (0.41) 0.16 (0.46) −0.14 (0.36)
2.09 0.17 (0.42) −0.17 (0.37) −0.37 (0.51) −0.31 (0.48) 0.13 (0.38) 0.17 (0.46) −0.02 (0.24)
2.71 0.16 (0.41) −0.06 (0.23) −0.16 (0.37) −0.12 (0.32) 0.14 (0.39) 0.15 (0.43) 0 (0.2)
3.1 0.17 (0.42) −0.03 (0.16) −0.09 (0.29) −0.07 (0.25) 0.14 (0.39) 0.16 (0.45) 0 (0.19)
Model II, ρ = 0.1
1.16 1.16 (1.33) −3.05 (0.85) −7.63 (0.52) −7.39 (0.62) 0.16 (0.41) 0.09 (0.33) −0.79 (1.86)
1.45 1.11 (1.37) −1.45 (0.83) −7.1 (0.73) −6.65 (0.82) 0.11 (0.36) 0.03 (0.21) −0.03 (0.2)
1.73 1.16 (1.47) −0.52 (0.61) −6.52 (0.87) −5.76 (1.03) 0.09 (0.34) 0.02 (0.14) 0 (0.06)
2.02 1.21 (1.55) −0.17 (0.4) −5.78 (1.15) −4.61 (1.4) 0.08 (0.33) 0.01 (0.12) 0 (0)
2.6 1.37 (1.66) −0.01 (0.09) −3.24 (1.94) −1.45 (1.27) 0.09 (0.35) 0.01 (0.11) 0 (0)
3.18 1.42 (1.67) 0 (0) −0.46 (0.75) −0.22 (0.46) 0.11 (0.41) 0.01 (0.15) 0 (0)
Model II, ρ = 0.5
1.12 1.19 (1.36) −3.19 (0.84) −7.54 (0.55) −7.31 (0.64) 0.15 (0.41) 0.11 (0.34) −0.98 (2.27)
1.4 1.1 (1.37) −1.66 (0.83) −6.97 (0.76) −6.46 (0.86) 0.13 (0.38) 0.06 (0.25) −0.04 (1.08)
1.68 1.1 (1.42) −0.66 (0.68) −6.34 (0.94) −5.67 (1.06) 0.1 (0.34) 0.02 (0.17) 0.04 (0.94)
2.09 1.15 (1.5) −0.13 (0.35) −5.2 (1.32) −3.95 (1.38) 0.08 (0.33) 0.01 (0.11) 0.04 (0.94)
2.51 1.19 (1.55) −0.01 (0.11) −3.4 (1.74) −1.71 (1.32) 0.09 (0.36) 0.01 (0.09) 0.04 (0.94)
3.07 1.3 (1.59) 0 (0) −0.64 (0.86) −0.29 (0.54) 0.11 (0.37) 0 (0.09) 0.04 (0.94)
Model II, ρ = 0.9
1.05 1.26 (1.4) −2.6 (0.73) −5.92 (0.75) −5.58 (0.8) 0.21 (0.48) 0.16 (0.42) −0.75 (1.6)
1.24 1.2 (1.4) −1.72 (0.73) −5.38 (0.84) −4.9 (0.87) 0.18 (0.45) 0.11 (0.35) −0.15 (1.1)
1.52 1.19 (1.48) −0.91 (0.66) −4.47 (0.97) −3.8 (0.89) 0.13 (0.4) 0.05 (0.23) 0.04 (0.94)
2 1.23 (1.57) −0.19 (0.39) −2.88 (0.98) −2.2 (0.91) 0.11 (0.36) 0.02 (0.14) 0.04 (0.94)
2.48 1.3 (1.64) −0.02 (0.15) −1.38 (0.88) −1 (0.75) 0.12 (0.38) 0.01 (0.09) 0.04 (0.94)
3.05 1.37 (1.67) 0 (0) −0.46 (0.62) −0.28 (0.51) 0.12 (0.38) 0.01 (0.1) 0.04 (0.94)

We apply the reduced-rank regression via the adaptive nuclear norm penalization to fit the full data with the optimal tuning parameter selected by AIC, BIC, GIC, BICP, GCV, CV (5-fold), and StARS-RRR. In StARS-RRR, we set the threshold η as 0.0001. For each method, a total of 500 simulation replications are conducted. To compare the performance of the aforementioned methods, we consider several performance measures. The first measure is the rank recovery ratio, defined as the proportion of r^=r over all replications, where r^ is the estimated rank. The method with a higher rank recovery ratio is more effective in rank determination. The second and third measures are the rank overestimate ratio and the rank underestimate ratio, which are defined by the proportion of r^>r and r^<r over all replications, respectively. The fourth measure is the bias of the estimated rank, defined as the mean difference between the estimated rank and the true rank. A better performing method should have a lower bias in terms of magnitude.

Table 2 summarizes the rank recovery, underestimate, and overestimate ratios. For model I, it is clear that StARS-RRR outperforms the other methods in terms of every performance measure when SNR is moderate to high. In particular, the rank recovery ratio of StARS-RRR is at least 10% higher than those of information criterion based methods. When the SNR is low, i.e., SNR < 1.5, GCV has the best performance, followed by AIC, CV, and StARS-RRR. In this case, the StARS-RRR estimators from the subsamples might be dominated by the noise, and this leads to underestimation of the rank. In contrast, GCV, CV, and AIC tend to overestimate the rank and result in a more complicate model, which would explain why these methods have good performance when SNR is very low.

Table 2.

Rank recovery (left), underestimate (middle), and overestimate (right) ratios (in percentage) in the simulation study.

s SNR AIC BIC GIC BICP GCV CV StARS-RRR
Model I, ρ = 0.1
30 1.07 (81,4,15) (17,83,0) (2,98,0) (4,96,0) (85,4,12) (80,5,15) (63,37,0)
45 1.6 (86,1,14) (62,38,0) (36,64,0) (42,58,0) (87,1,12) (86,1,13) (92,8,0)
52 1.85 (85,0,14) (76,24,0) (53,47,0) (59,41,0) (87,0,12) (87,0,13) (95,4,0)
60 2.14 (86,0,14) (87,13,0) (68,32,0) (74,26,0) (87,0,13) (87,0,13) (98,2,0)
70 2.49 (86,0,14) (94,6,0) (83,17,0) (86,14,0) (87,0,13) (87,0,13) (98,1,0)
85 3.03 (85,0,15) (98,2,0) (93,7,0) (95,5,0) (87,0,13) (87,0,13) (99,0,0)
Model I, ρ = 0.5
35 1.1 (83,4,13) (23,77,0) (4,96,0) (6,94,0) (84,5,11) (82,4,14) (63,37,0)
40 1.26 (85,2,13) (36,64,0) (12,88,0) (18,82,0) (86,3,11) (83,3,14) (73,26,1)
50 1.57 (86,1,13) (60,40,0) (35,65,0) (42,58,0) (87,1,12) (85,1,13) (89,10,1)
70 2.2 (86,0,14) (86,14,0) (68,32,0) (73,27,0) (88,0,12) (87,0,13) (97,3,1)
80 2.52 (86,0,14) (93,7,0) (80,20,0) (84,16,0) (88,0,12) (87,0,13) (99,1,1)
95 2.99 (86,0,14) (97,3,0) (90,10,0) (92,8,0) (88,0,12) (86,0,14) (99,1,1)
Model I, ρ = 0.9
70 1.08 (78,5,17) (20,80,0) (4,96,0) (8,92,0) (80,5,14) (78,6,16) (63,37,0)
80 1.24 (82,3,16) (33,67,0) (11,89,0) (16,84,0) (84,3,13) (82,3,15) (73,27,0)
100 1.55 (83,1,16) (56,44,0) (32,68,0) (39,61,0) (86,1,14) (84,1,15) (86,14,0)
135 2.09 (85,0,15) (83,17,0) (64,36,0) (70,30,0) (88,0,12) (86,0,14) (97,3,0)
175 2.71 (85,0,15) (94,6,0) (84,16,0) (88,12,0) (87,0,13) (87,0,13) (99,1,0)
200 3.1 (85,0,15) (97,3,0) (91,9,0) (93,7,0) (87,0,13) (86,0,14) (99,0,0)
Model II, ρ = 0.1
8 1.16 (43,0,57) (0,100,0) (0,100,0) (0,100,0) (86,0,14) (90,1,9) (77,23,0)
10 1.45 (48,0,52) (12,88,0) (0,100,0) (0,100,0) (91,0,9) (97,0,3) (98,2,0)
12 1.73 (50,0,50) (54,46,0) (0,100,0) (0,100,0) (93,0,7) (99,0,1) (100,0,0)
14 2.02 (50,0,50) (83,17,0) (0,100,0) (0,100,0) (93,0,7) (99,0,1) (100,0,0)
18 2.6 (47,0,53) (99,1,0) (5,95,0) (24,76,0) (93,0,7) (99,0,1) (100,0,0)
22 3.18 (45,0,55) (100,0,0) (65,35,0) (80,20,0) (92,0,8) (99,0,1) (100,0,0)
Model II, ρ = 0.5
8 1.12 (42,0,58) (0,100,0) (0,100,0) (0,100,0) (86,0,14) (89,0,11) (69,30,0)
10 1.4 (48,0,52) (5,95,0) (0,100,0) (0,100,0) (89,0,11) (95,0,5) (95,4,0)
12 1.68 (51,0,49) (45,55,0) (0,100,0) (0,100,0) (92,0,8) (98,0,2) (100,0,0)
15 2.09 (52,0,48) (88,12,0) (0,100,0) (0,100,0) (94,0,6) (99,0,1) (100,0,0)
18 2.51 (52,0,48) (99,1,0) (2,98,0) (17,83,0) (93,0,7) (99,0,1) (100,0,0)
22 3.07 (48,0,52) (100,0,0) (55,45,0) (75,25,0) (91,0,9) (100,0,0) (100,0,0)
Model II, ρ = 0.9
11 1.05 (41,0,59) (0,100,0) (0,100,0) (0,100,0) (81,0,19) (83,1,16) (58,42,0)
13 1.24 (44,0,56) (3,97,0) (0,100,0) (0,100,0) (85,0,15) (90,0,10) (86,14,0)
16 1.52 (47,0,53) (26,74,0) (0,100,0) (0,100,0) (89,0,11) (96,0,4) (99,1,0)
21 2 (49,0,51) (82,18,0) (0,100,0) (1,99,0) (91,0,9) (99,0,1) (100,0,0)
26 2.48 (49,0,51) (98,2,0) (15,85,0) (25,75,0) (90,0,10) (99,0,1) (100,0,0)
32 3.05 (47,0,53) (100,0,0) (61,39,0) (74,26,0) (90,0,10) (99,0,1) (100,0,0)

For Model II, different from Model I where the sample size is sufficient compared to the dimension, BICP and GIC can rarely recover the true rank. This is probably because that the penalty in BICP and GIC contains the term log(pq) (Table 1), which becomes excessive when the sample size is small and the dimension is high. As a result, the estimated ranks from these two methods are much smaller than the true rank. This can also be observed from Table 3, where the mean biases of the estimated ranks from BICP and GIC are negative. It is clear that StARS-RRR has a much better performance than BICP and GIC in recovering the true rank among a wide range of SNR. Regardless of whether the correlation between the predictors is low, medium, or high, StARS-RRR can always achieve a rank recovery ratio of at least 97% when the SNR is greater than 1.4, a commonly occurred case in practice. StARS-RRR also slightly outperforms CV although the performance of CV in Model II is better than in Model I. In summary, when the sample size is small and the dimension is high, StARS-RRR can obtain a reliable result in determining the rank in a reduced-rank regression model.

Table 3 tabulates the mean and standard error of the bias of the estimated rank over simulation replicates. From this table, we observe that the biases of the estimated ranks from AIC, GCV, and CV are always greater than 0, for both Model I and Model II, which indicates that these three methods tend to yield a more complex model. Furthermore, their biases do not quite vanish when SNR increases in Model I and/or Model II, which means that these methods tend to overestimate the rank of the coefficient matrix regardless of the signal-to-noise ratio. By contrast, BIC, BICP, and GIC tend to underestimate the rank for both Model I and Model II, as evidenced by the negative signs of their mean biases. However, the magnitude of the bias decreases when SNR increases, which implies that these three methods perform better with a higher SNR. Finally, StARS-RRR achieves the smallest or the second smallest magnitude of bias in most settings except for very low SNR’s. Overall, StARS-RRR performs the best among all the methods we investigated in rank determination for reduced-rank regression models.

Based on referee’s suggestions, we also conduct additional simulation studies to evaluate the prediction performance of StARS-RRR, as well as sensitivity analyses for the hyperparameters η, N, and b in StARS-RRR. For more details, please refer to Section B of the supplementary material.

4.2. Application to breast cancer data

In this subsection, we apply StARS-RRR to a real dataset to show its effectiveness in rank determination. In particular, we consider the breast cancer data (Witten et al., 2009), consisting of the gene expression measurements and comparative genomic hybridization (CGH) measurements for n = 89 patients. This dataset has been studied in previous work (Bunea et al., 2011; Chen et al., 2013) and is available in the R package PMA (Witten et al., 2009). The question of interest is to investigate the relationship between the DNA copy-number variations and gene expression profiles for the patients. We will use reduced-rank regression to model the copy number variations based on the gene expression profiles (Geng et al., 2011; Zhou et al., 2012). A reduced-rank regression model yields a low-rank coefficient matrix, with the estimated rank representing the number of linear combinations of gene expression measurements that enter into the prediction of CGH measurements. These linear combinations of gene expression measurements can be regarded as biologically functional pathways that affect the DNA copy number variations.

For the purpose of illustration, we analyzed the data on chromosome 13, where there are p = 319 gene expression measurements and q = 58 CGH measurements. The adaptive nuclear norm penalization method was used to estimate the coefficient matrix, and StARS-RRR was applied to determine the optimal rank, together with the other approaches used in the simulation. The estimated ranks are presented in the top panel of Table 4. On the one hand, AIC and GCV estimate the rank as 57 and 26, respectively, which seem to overestimate the number of functional pathways of practical interest. On the other hand, BIC, GIC, BICP and CV estimate the rank as 1 or 2 and haven’t revealed enough biological relationships for further investigation. Instead, StARS-RRR reveals three linear combinations of gene expressions that potentially affect copy number variations, which include a reasonable number of biological pathways that deserve further investigation.

Table 4.

Comparison of the model fits to the real data for various tuning parameter selection methods. The mean squared prediction errors (MSPE) and the estimated ranks (Rank) are reported, with their standard errors in the parentheses.

AIC BIC GIC BICP GCV CV StARS-RRR
Full Data
Rank 57 2 1 2 26 0 3
Random-Splitting Process
Rank 56.98 (0.14) 49.87 (18.51) 1.00 (0.00) 1.06 (0.24) 30.76 (12.42) 0.80 (0.40) 2.65 (0.52)
MSPE 4.32 (0.98) 4.20 (1.02) 3.71 (1.28) 3.71 (1.27) 4.25 (0.98) 3.84 (1.37) 3.41 (0.95)

To visualize the relationship between the DNA copy-number variations and gene expression profiles, we also plot the estimated coefficient matrix with the tuning parameter selected by StARS-RRR in the form of a heat map in Figure 2. It is visually clear that there are three sets of CGH measurements, each of which follows a similar relationship with the gene profiles. The left 24 CGH measurements have a strong relationship with the genes as the coefficients have the largest magnitude among the three sets and have both positive and negative signs. The middle 3 CGH measurements have a moderate relationship that is similar to the above measurements in terms of signs although the magnitude of the coefficients is much smaller. The right 31 CGH measurements have a weak relationship with the genes as their coefficients are small in magnitude.

Fig. 2.

Fig. 2

The heat map of coefficient matrix obtained by the StARS-RRR for each CGH spot (row) and the gene (column). Genes with all of its coefficients being 0 are not shown.

To provide further insight into the performance of different methods, we carried out the following random-splitting process for 100 times. The data were randomly split into a training set of size ntrain = 79 and a test set of size ntest = 10. We first estimate the rank using the aforementioned methods, and then refit the model using a ridge generalization of reduced-rank regression model (Izenman, 2008) to derive the final estimated coefficient matrix C^. Finally, we calculate the mean squared prediction error (MSPE) as  MSPE =100×Ytest Xtest C^F2/qntest , where Xtest and Ytest are the predictors and responses in the test set.

Similar to the full-data performance, the bottom panel of Table 4 shows that the tuning parameter selection methods can be divided into three groups according to their estimated ranks. The first group consists of AIC, BIC, and GCV, whose estimated ranks are quite large. The coefficient matrix has 58 columns, so the coefficient matrices estimated by AIC and BIC are of almost full rank. Therefore, these methods result in a complex model that might overfit the data. The second group includes GIC, BICP, and CV, where the average estimated rank is close to 1. This might imply an underestimation of the rank. Underestimation of the rank leads to the lack of information to be extracted, which also explains why they underperform in terms of prediction accuracy. The last group is StARS-RRR, whose average estimated rank is between 2 and 3, a more reasonable rank than the other methods. Moreover, the MSPE of StARS-RRR is lower than all the other methods, which further convinces us that StARS-RRR yields an accurate estimation of rank for this dataset.

5. Discussion

In this article, we propose a new method based on the stability approach to select the tuning parameter for reduced-rank regression. To the best of our knowledge, it is the first time that the stability approach is used in this framework. Our main contribution is twofold. First, we set up a general framework of the stability approach for rank determination including a new definition of instability and a new tuning parameter selection rule based on the instability. This framework is generally applicable to any matrix estimation problem and is referred to as StARS-RRR when specifically applied to reduced-rank regression. Second, we show that the rank determined by StARS-RRR is consistent to the true rank when the adaptive nuclear norm penalization is used. In fact, we provide a finite sample lower bound of the probability with which the rank is estimated correctly. Interestingly, there is an explicit relationship between this lower bound and the threshold η used to select the tuning parameter in StARS-RRR, giving the method high interpretability.

While the focus of this work is to estimate the true rank for reduced-rank regression, we understand that rank estimation is not the unique objective and may not be the optimal objective especially when the data are very noisy. For example, She and Tran (2019) mentioned in their simulation studies that “when the noise level is very high, there is no reason to believe that the noise-free simulation truth yields the best predictive model from the observed data” and “because the low SNR data are heavily contaminated by noise…, a more parsimonious and predictive model may exist that diverges from the zero-noise model used in generating synthetic data.” This implies that a model with the true rank may not be the best predictive model. To partially address this concern, we conduct additional simulation studies (see the supplementary material) and show that StARS-RRR not only estimates the true rank accurately but also estimates the effective rank (Bunea et al., 2011) and predicts the responses at least as well as other tuning parameter selection methods.

Although StARS-RRR performs satisfactorily in both simulated and real data, it still has a few limitations that need attention and/or deserve further studies. First, the current definition of instability emphasizes the stability of the estimated rank but there could be alternative definitions. For example, if one concerns more the stability of the subspace corresponding to the estimated coefficient matrix instead of its rank, the instability could be defined as the variation of such subspaces estimated from the randomly draw subsamples. We refer to Taeb et al. (2020) for recent development of subspace stability in low-rank matrix estimation. Second, we established the rank estimation consistency for StARS-RRR, which makes it more theoretically sound than most information criterion methods when applied to reduced-rank regression. However, we must point out that most information criterion methods do not need any assumptions on the design matrix, while StARS-RRR assumes an independent and identically distributed sample. This limitation needs to be taken into consideration in the application of StARS-RRR to dependent or heterogeneous data. Third, this article represents only the first effort to apply the stability approach to low-rank matrix estimation and thus there are still many unsolved questions to explore. For example, if both sparsity and low rank are desired as in the sparse reduced-rank regression, it will be an interesting topic to extend StARS-RRR so that variable selection consistency and rank estimation consistency can be achieved simultaneously. Generalization of StARS-RRR to other matrix estimation problems such as low-rank matrix completion also warrants future investigation.

Supplementary Material

Supp 1

Acknowledgement

Wen’s research was supported in part by National Science Foundation of China under Grants 12171449 and 11801540; and Natural Science Foundation of Anhui Province under Grant BJ2040170017. Jiang’s research was supported in part by the U.S. National Institutes of Health under Grant R01GM126549.

Footnotes

Supplementary Material

Supplementary material includes technical proofs of the theoretical results as well as additional numerical results.

References

  1. Akaike H (1974). A new look at the statistical model identification. IEEE transactions on automatic control, 19(6):716–723. [Google Scholar]
  2. An H, Huang D, Yao Q, and Zhang C (2008). Stepwise searching for feature variables in high-dimensional linear regression Available online: http://eprints.lse.ac.uk/51349/.
  3. Anderson TW (1951). Estimating linear restrictions on regression coefficients for multivariate normal distributions. Annals of mathematical statistics, 22(3):327–351. [Google Scholar]
  4. Anderson TW (2002a). Reduced rank regression in cointegrated models. Journal of Econometrics, 106(2):203–216. [Google Scholar]
  5. Anderson TW (2002b). Specification and misspecification in reduced rank regression. Sankhyā: The Indian Journal of Statistics, Series A, pages 193–205.
  6. Ashraphijuo M, Wang X, and Aggarwal V (2017). Rank determination for low-rank data completion. The Journal of Machine Learning Research, 18(1):3422–3450. [Google Scholar]
  7. Bernardini E and Cubadda G (2015). Macroeconomic forecasting and structural analysis through regularized reduced-rank regression. International Journal of Forecasting, 31(3):682–691. [Google Scholar]
  8. Bunea F, She Y, Wegkamp MH, et al. (2011). Optimal selection of reduced rank estimators of high-dimensional matrices. Annals of Statistics, 39(2):1282–1309. [Google Scholar]
  9. Chen K (2016). Model diagnostics in reduced-rank estimation. Statistics and its interface, 9(4):469. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Chen K, Chan K-S, and Stenseth NC (2012). Reduced rank stochastic regression with a sparse singular value decomposition. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74(2):203–221. [Google Scholar]
  11. Chen K, Dong H, and Chan K-S (2013). Reduced rank regression via adaptive nuclear norm penalization. Biometrika, 100(4):901–920. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Chen L and Huang JZ (2012). Sparse reduced-rank regression for simultaneous dimension reduction and variable selection. Journal of the American Statistical Association, 107(500):1533–1545. [Google Scholar]
  13. Corander J and Villani M (2004). Bayesian assessment of dimensionality in reduced rank regression. Statistica Neerlandica, 58(3):255–270. [Google Scholar]
  14. Fan Y and Tang CY (2013). Tuning parameter selection in high dimensional penalized likelihood. Journal of the Royal Statistical Society: Series B (Statistical Methodology), pages 531–552.
  15. Geng H, Iqbal J, Chan WC, and Ali HH (2011). Virtual cgh: an integrative approach to predict genetic abnormalities from gene expression microarray data applied in lymphoma. BMC medical genomics, 4(1):1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Golub GH, Heath M, and Wahba G (1979). Generalized cross-validation as a method for choosing a good ridge parameter. Technometrics, 21(2):215–223. [Google Scholar]
  17. Izenman AJ (2008). Modern multivariate statistical techniques. Regression, classification and manifold learning, 10:978–0. [Google Scholar]
  18. Jiang Y, He Y, and Zhang H (2016). Variable selection with prior information for generalized linear models via the prior lasso method. Journal of the American Statistical Association, 111(513):355–376. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Kanagal B and Sindhwani V (2010). Rank selection in low-rank matrix approximations: A study of cross-validation for nmfs. In Proc Conf Adv Neural Inf Process, volume 1, pages 10–15. [Google Scholar]
  20. Kobak D, Bernaerts Y, Weis MA, Scala F, Tolias A, and Berens P (2019). Sparse reduced-rank regression for exploratory visualization of multimodal data sets. Biorxiv, page 302208.
  21. Kong X (2020). A random-perturbation-based rank estimator of the number of factors. Biometrika, 107(2):505–511. [Google Scholar]
  22. Liu H, Roeder K, and Wasserman L (2010). Stability approach to regularization selection (stars) for high dimensional graphical models. Advances in neural information processing systems, 24(2):1432. [PMC free article] [PubMed] [Google Scholar]
  23. Meinshausen N and Bühlmann P (2010). Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4):417–473. [Google Scholar]
  24. Mukherjee A, Chen K, Wang N, and Zhu J (2015). On the degrees of freedom of reduced-rank estimators in multivariate regression. Biometrika, 102(2):457–477. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Mukherjee A and Zhu J (2011). Reduced rank ridge regression and its kernel extensions. Statistical analysis and data mining: the ASA data science journal, 4(6):612–622. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Politis DN, Romano JP, and Wolf M (1999). Subsampling Springer Science & Business Media. [Google Scholar]
  27. Schwarz G et al. (1978). Estimating the dimension of a model. Annals of statistics, 6(2):461–464. [Google Scholar]
  28. Shah RD and Samworth RJ (2013). Variable selection with error control: another look at stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(1):55–80. [Google Scholar]
  29. She Y (2017). Selective factor extraction in high dimensions. Biometrika, 104(1):97–110. [Google Scholar]
  30. She Y and Tran H (2019). On cross-validation for sparse reduced rank regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 81(1):145–161. [Google Scholar]
  31. Sun W, Wang J, and Fang Y (2013). Consistent selection of tuning parameters via variable selection stability. The Journal of Machine Learning Research, 14(1):3419–3440. [Google Scholar]
  32. Taeb A, Shah P, and Chandrasekaran V (2020). False discovery and its control in low rank estimation. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 82(4):997–1027. [Google Scholar]
  33. Ulfarsson MO and Solo V (2013). Tuning parameter selection for underdetermined reduced-rank regression. IEEE Signal Processing Letters, 20(9):881–884. [Google Scholar]
  34. Velu R and Reinsel GC (2013). Multivariate reduced-rank regression: theory and applications, volume 136. Springer Science & Business Media. [Google Scholar]
  35. Wen C, Ba H, Pan W, Huang M, and Initiative ADN (2020). Co-sparse reduced-rank regression for association analysis between imaging phenotypes and genetic variants. Bioinformatics, 36(21):5214–5222. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Witten DM, Tibshirani R, and Hastie T (2009). A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics, 10(3):515–534. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Yu B (2013). Stability. Bernoulli, 19(4):1484–1500. [Google Scholar]
  38. Yuan M (2016). Degrees of freedom in low rank matrix estimation. Science China Mathematics, 59(12):2485–2502. [Google Scholar]
  39. Yuan M, Ekici A, Lu Z, and Monteiro R (2007). Dimension reduction and coefficient estimation in multivariate linear regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 69(3):329–346. [Google Scholar]
  40. Zhou Y, Zhang Q, Stephens O, Heuck CJ, Tian E, Sawyer JR, Cartron-Mizeracki M-A, Qu P, Keller J, Epstein J, et al. (2012). Prediction of cytogenetic abnormalities with gene expression profiles. Blood, The Journal of the American Society of Hematology, 119(21):e148–e150. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp 1

RESOURCES