Abstract
This paper explores the homogeneity of coefficients in high-dimensional regression, which extends the sparsity concept and is more general and suitable for many applications. Homogeneity arises when regression coefficients corresponding to neighboring geographical regions or a similar cluster of covariates are expected to be approximately the same. Sparsity corresponds to a special case of homogeneity with a large cluster of known atom zero. In this article, we propose a new method called clustering algorithm in regression via data-driven segmentation (CARDS) to explore homogeneity. New mathematics are provided on the gain that can be achieved by exploring homogeneity. Statistical properties of two versions of CARDS are analyzed. In particular, the asymptotic normality of our proposed CARDS estimator is established, which reveals better estimation accuracy for homogeneous parameters than that without homogeneity exploration. When our methods are combined with sparsity exploration, further efficiency can be achieved beyond the exploration of sparsity alone. This provides additional insights into the power of exploring low-dimensional structures in high-dimensional regression: homogeneity and sparsity. Our results also shed lights on the properties of the fussed Lasso. The newly developed method is further illustrated by simulation studies and applications to real data. Supplementary materials for this article are available online.
Keywords: clustering, homogeneity, sparsity
1 Introduction
Driven by applications in genomics, image processing, etc., high dimensionality has become one of the major themes in statistics. See Bühlmann and van de Geer (2011) and references therein for an overview of recent developments in this area. To overcome the difficulty of fitting high dimensional models, one usually assumes that the true parameters lie in a low dimensional subspace. For example, many papers focus on sparsity, i.e., only a small fraction of coefficients are nonzero (Chen, Donoho, and Saunders 1998; Tibshirani 1996). In this article, we consider a more general type of low dimensional structure: homogeneity, i.e., the regression coefficients share the same values in their unknown clusters. A motivating example is the gene network analysis, where it is assumed that genes cluster into groups which play similar roles in molecular processes (Kim and Xing 2009; Li and Li 2010). It can be modeled as a linear regression problem with groups of homogeneous coeffficients. Similarly, in diagnostic lab tests, one often counts the number of positive results in a battery of medical tests, which implicitly assumes that their regression coefficients (impact) in the joint models are approximately the same. In spatial-temporal studies, it is not unreasonable to assume that the dynamics of neighboring geographical regions are similar, namely, their regression coefficients are clustered (Fan, Lv, and Qi 2011; Huang, Hsu, Theobald, and Breidt 2010). In the same vein, financial returns of similar sectors of industry share similar loadings on risk factors.
Homogeneity is a more general assumption than sparsity, where the latter can be viewed as a special case of the former with a large group of 0-value coefficients. In addition, the atom 0 is known to data analysts. One advantage of assuming homogeneity rather than sparsity is that it enables us to possibly select more than n variables (n is the sample size). It is well known that the sparsity-based techniques, such as the lasso, can select at most n variables. Moreover, identifying the homogeneous groups naturally provides a structure in the covariates, which can be helpful in scientific discoveries.
Regression under the homogeneity setting has been previously studied in the literature. Park, Hastie, and Tibshirani (2007) propose a two-step method. Their method performs hierarchical clustering on the predictors, cuts the obtained dendrogram at an appropriate level, and treats the cluster averages as new predictors. The fused lasso (Friedman, Hastie, Höfling, and Tibshirani 2007; Tibshirani, Saunders, Rosset, Zhu, and Knight 2005) can also be regarded as an effort of exploring homogeneity, with the assistance of neighborhoods defined according to either time or location. In this sense, our newly proposed methods are different since we do not know such a neighborhood a priori. The clustering of homogeneous coefficients is completely data-driven. For example, in the fused Lasso, where a complete ordering of the covariates is given, Tibshirani et al. (2005) use the L1 penalty to penalize the pairwise differences of adjacent coordinates; in the case without a complete ordering, they suggest penalizing the pairs of ‘neighboring’ nodes in the sense of a general distance measure. Bondell and Reich (2008) propose the method OSCAR where a special octagonal shrinkage penalty is applied to each pair of coordinates to promote equal-value solutions. Shen and Huang (2010) develop an algorithm called Grouping Pursuit, where they use the truncated L1 penalty to penalize the pairwise differences for all pairs of coordinates. In an extension, Zhu, Shen, and Pan (in press) consider simultaneous grouping pursuit and feature selection by including additional truncated L1 penalties on the individual coefficients. Yang et al. (2012) explore simultaneous feature grouping and selection with the assistance of an undirected graph by penalizing the pairwise difference for each pair of coordinates that are connected by an edge in the graph. All the aforementioned methods either depend on a known ordering or graph of the covariates, which is sometimes not available, or use exhaustive pairwise penalties, which increase the computational complexity. Yang and He (2012) consider the homogeneity across coefficients of different percentile levels in quantile regression, and propose a Bayesian framework by using shrinkage priors to promote homogeneity. Although similar ideas may be applied to regression models, their settings are very different from ours, and there are no existing results on feature grouping for their method.
In this article, we propose a new method called Clustering Algorithm in Regression via Data-driven Segmentation (CARDS) to explore homogeneity. The main idea of CARDS is to take advantage of a preliminary estimate without homogeneity structure and to shrink those coefficients, that are estimated “closely”, further towards each other to achieve homogeneity. In the basic version of CARDS, we first build an ordering of covariates from the preliminary estimate and run a penalized least squares afterwards with fused penalties in the new ordering. The number of penalty terms is only (p − 1), compared to p(p − 1)/2 in the exhaustive pairwise penalties. On the other hand, an advanced version of CARDS builds an “ordered segmentation” on the covariates, which can be viewed as a generalized ordering, and imposes “hybrid pairwise penalties”, which can be viewed as a generalization of fused penalties. This version of CARDS tolerates possible misorderings in the preliminary estimate better and is thus more robust. Compared with other existing methods for homogeneity exploration, CARDS can successfully deal with the case of unordered covariates. At the same time, it avoids using exhaustive pairwise penalties and can be computationally more efficient than the Grouping Pursuit and OSCAR.
We study CARDS in details by providing some theoretical analysis. It reveals that the sum of squared errors of estimated coefficients is Op(K/n), where K is the number of true homogeneous groups. Therefore, the smaller the number of true groups is, the better precision it can achieve. In particular, when K = p, there is no homogeneity to explore and the result reduces to the case without grouping. Moreover, in order to exactly recover the true groups with high probability, the minimum signal strength (the gaps between different groups) is of the order , where |Ak|’s are sizes of true groups. In addition, the asymptotic normality of our proposed CARDS estimator is established, which reveals better estimation accuracy than that without homogeneity exploration. Furthermore, our results can be combined with the sparsity-based results to provide additional insights into the power of exploring low-dimensional structure in high-dimensional regression: homogeneity and sparsity. As a byproduct, our analysis on the basic version of CARDS also establishes a framework for analyzing the fused type of penalties, which is new to our knowledge.
Throughout this paper, we consider the following linear regression setting
(1) |
where X = (x1, · · ·, xp) is an n × p design matrix, y = (y1, · · ·, yn)T is an n × 1 vector of response,
denotes the true parameters of interest, and ε = (ε1, · · ·, εn)T contains independently and identically distributed noises with E(εi) = 0 and
. We assume further that there is a partition of {1, 2, · · ·, p} denoted as
= (A0, A1, · · ·, AK) such that
(2) |
where is the common value shared by all regression coefficients whose indices are in Ak. By default, , so A0 is the group of 0-value coefficients. This allows us to explore homogeneity and sparsity simultaneously. Write . Without loss of generality, we assume .
Our theory and methods are stated for the standard least-squares problem although they can be adapted to other more sophisticated models. For example, when forecasting housing appreciation in the United States (Fan et al. 2011), one builds the spatial-temporal model
(3) |
in which i indicates a spatial location and t indicates time. It is expected that s are approximately the same for neighboring zip codes i and this type of homogeneity can be explored in a similar fashion. Similarly, when Yit represents the return of a stock and Xit = Xt stands for common market risk factors, one can assume a certain degree of homogeneity within each sector of industry; namely, the factor loading vector βi is approximately the same for stocks belonging to the same sector of industry.
Throughout this paper, ℝ denotes the set of real numbers and ℝp denotes the p-dimensional real Euclidean space. For any a, b ∈ ℝ, a ∨ b denotes the maximum between a and b. For any positive sequences {an} and {bn}, we write an ≫ bn if an/bn → ∞ as n → ∞. Given 1 ≤ q < ∞, for any vector x, ||x||q = (Σj |xj|q)1/q denotes the Lq-norm of x and ||x||∞ = maxj{|xj|}. For any matrix M, ||M||q = maxx:||x||q=1 ||Mx||q denotes the matrix Lq-norm of M. In particular, ||M||∞ is the maximum absolute row sum of M. We omit the subscript q when q = 2. ||M||max = maxi,j{|Mij|} denotes the matrix max norm. When M is symmetric, λmax(M) and λmin(M) denote the maximum and minimum eigenvalues of M, respectively.
The rest of the paper is organized as follows. Section 2 describes CARDS, including the basic, advanced and shrinkage versions. Section 3 studies theoretical properties of the basic version of CARDS, and Section 4 analyzes the advanced and shrinkage versions. Sections 5 and 6 present the results of simulation studies and real data analysis, respectively. Some concluding remarks are given in Section 7. Proofs can be found in the Appendix and supplemental materials.
2 CARDS: a data-driven pairwise shrinkage procedure
2.1 Basic version of CARDS
Without considering the homogeneity assumption (2), there are many methods available for fitting model (1). Let β̃ be such a preliminary estimator. A simple idea to generate homogeneity is as follows: first, rearrange the coefficients in β̃ in the ascending order; second, group together those adjacent indices whose coefficients in β̃ are close to each other; finally, force indices in each estimated group to share a common coefficient and refit model (1). A problem of this naive procedure is how to group the indices. Alternatively, we can run a penalized least squares to simultaneously extract the grouping structure and estimate coefficients. To shrink coefficients of adjacent indices (after reordering) towards homogeneity, we can add fused penalties, i.e., {|βi+1 − βi|, i = 1, · · ·, p − 1} are penalized. This leads to the following two-stage procedure:
- Preordering: Construct the rank statistics {τ(j) : 1 ≤ j ≤ p} such that β̃τ(j) is the j-th smallest value in {β̃i, 1 ≤ i ≤ p}, i.e.,
(4) - Estimation: Given a folded concave penalty function pλ(·) (Fan and Li 2001) with a regularization parameter λ, the final estimate is given by
(5)
We call this two-stage procedure bCARDS (basic CARDS).
At the first stage, bCARDS establishes a data-driven rank mapping τ(·) based on the preliminary estimator β̃. At the second stage, only “adjacent” coefficient pairs in the order τ are penalized, resulting in only (p −1) penalty terms in total. In addition, (5) does not require that βτ(j) ≤ βτ(j+1). This allows coordinates in β̂ to have a different order of increasing values from that in β̃.
With an appropriately large tuning parameter λ, β̂ is a piecewise constant vector in the order τ and consequently its elements have homogeneous groups. In Section 3, we shall show that, if τ is consistent with the order of β0, that is,
(6) |
then under some regularity conditions, β̂ can consistently estimate the true coefficient groups of β0 with high probability.
When pλ(·) is a folded-concave penalty function (e.g. SCAD (Fan and Li 2001), MCP (Zhang 2010)), (5) is a non-convex optimization problem. It is generally challenging to find the global minimizer. The local linear approximation (LLA) algorithm can be applied to find a local minimizer for any fixed initial solution; see Zou and Li (2008); Fan, Xue, and Zou (2012) and references therein for details. The coupling of the concave convex procedure (CCCP) can also be applied to produce a local minimizer; see Kim, Choi, and Oh (2008); Wang, Kim, and Li (in press) for a detailed explanation of CCCP.
2.2 Advanced version of CARDS
To guarantee the success of bCARDS, (6) is an essential condition. It requires that whenever , τ (i) < τ(j) must hold. This imposes fairly strong conditions on the preliminary estimator β̃. For example, (6) can be violated if ||β̃ − β0||∞ is larger than the minimum gap between groups. To relax such a restrictive requirement, we now introduce an advanced version of CARDS, where the main idea is to use less information from β̃ and to add more penalty terms in (5).
We first introduce the ordered segmentation, which can be viewed as a generalized ordering. Note that each rank mapping τ in bCARDS actually defines a partition of {1, · · ·, p} into p disjoint sets B1, · · ·, Bp with Bj = {τ(j)} being a singleton. Similarly, we may divide {1, · · ·, p} into L(≤p) disjoint sets B1, · · ·, BL, where Bl’s are not necessarily singletons. We call such Bl’s segments. The segments B1, · · ·, BL are ordered, but the ordering of coordinates within each segment is not defined. This is similar to letter grades assigned to a course. A formal definition is as follows:
Definition 2.1
For an integer 1 ≤ L ≤ p, the mapping ϒ : {1, · · ·, p} → {1, · · ·, L} is called an ordered segmentation if the sets Bl ≡ {1 ≤ j ≤ p : ϒ(j) = l}, 1 ≤ l ≤ L, form a partition of {1, · · ·, p}.
When L = p, ϒ is a one-to-one mapping and it defines a complete ordering.
Note that, in the basic version of CARDS, the preliminary estimator β̃ produces a complete rank mapping τ. Now in the advanced version of CARDS, instead of extracting a complete ordering, we only extract an ordered segmentation ϒ from β̃. The analogue is similar to grading an exam: overall score rank (percentile rank) versus letter grade. Let δ > 0 be a predetermined parameter. First, obtain the rank mapping τ as in (4) and find all indices i2 < i3 < · · ·< iL such that the gaps
Then, construct the segments
(7) |
where i1 = 1 and iL+1 = p + 1. This process is indeed similar to the letter grades that we assign to a course. The intuition behind this construction is that when β̃τ(j+1) ≤ β̃τ(j) + δ, i.e., the estimated coefficients of two “adjacent coordinates” differ by only a small amount, we do not trust the ordering between them and group them into a same segment. Compared to the complete ordering τ, the ordered segments {B1, · · ·, BL} utilize less information from β̃ and hence are less sensitive to the estimation error in β̃.
Given an ordered segmentation τ, we wish to take advantage of the order of segments B1, · · ·, BL and at the same time allow flexibility of order shuffling within each segment. Towards this goal, we introduce the hybrid pairwise penalty.
Definition 2.2
Given a penalty function pλ(·) and tuning parameters λ1 and λ2, the hybrid pairwise penalty corresponding to an ordered segmentation ϒ is
(8) |
In (8), we call the first part between-segment penalty and the second part within-segment penalty. The within-segment penalty penalizes all pairs of indices in each segment, hence, it does not rely on any ordering within the segment. The between-segment penalty penalizes pairs of indices from two adjacent segments, and it can be viewed as a “generalized” fused penalty on segments.
When L = p, each Bl is a singleton and (8) reduces to the fused penalty in (5). On the other hand, when L = 1, namely, no prior information about β, there is only one segment B1 = {1, · · ·, p}, and (8) reduces to the exhaustive pairwise penalty
(9) |
It is also called the total variation penalty (Harchaoui and Lévy-Leduc 2010), and the case with pλ(·) being a truncated L1 penalty is studied in Shen and Huang (2010). Thus, the penalty (8) is a generalization of both the fused penalty and the total variation penalty, which explains the name “hybrid”.
The main motivation of introducing the hybrid pairwise penalty is to provide a set of intermediate versions between the fused penalty and the total variation penalty. When using pairwise penalties to promote homogeneity, we need to penalize “enough” pairs to guarantee that all true groups can be exactly recovered when the signal-to-noise ratio is sufficiently large. Given a consistent ordering, the fused penalty contains “just enough” pairs; but when the ordering is inconsistent, we have to penalize more pairs to achieve the aforementioned exact-group-recovery (see Section 2.3 for a numerical example). However, it may not be a good choice to include all pairs, i.e., using the total variation penalty, as the large number of redundant pairs can result in statistical and computational inefficiency. The hybrid penalty is designed aiming to include “just enough” pairs that adapt to the available “partial” ordering information of an ordered segmentation.
Now, we discuss how the requirement (6) can be relaxed. If we let Bj = {τ(j)}, then (6) can be written equivalently as , for 1 ≤ j ≤ p −1. This definition can be generalized to the case Bj’s are not singletons.
Definition 2.3
An ordered segmentation ϒ preserves the order of β0 if , for l = 1, · · ·, L − 1.
In the construction (7), even if (6) does not hold, it is still possible that the resulting ϒ preserves the order of β0. Consider a toy example where p = 4, and so that {τ (1), τ (2), τ (4)} and {τ (3)} are two true homogeneous groups in β0. Here τ ranks the coordinate τ(3) ahead of τ(4) based on the preliminary estimator β̃, but . So, τ fails to give a consistent ordering. However, as long as β̃τ(4) ≤ β̃τ(3)+δ, τ(3) and τ(4) are grouped into the same segment in (7), say, B1 = {τ (1), τ (2)} and B2 = {τ (3), τ (4)}. Then ϒ still preserves the order of β0 according to Definition 2.3.
Now we formally introduce the advanced version of Clustering Algorithm in Regression via Data-driven Segmentation (aCARDS). It consists of three steps, where the first two steps are very similar to the way that we assign letter grades based on scores of an exam.
Preliminary Ranking: Given a preliminary estimate β̃, generate the rank statistics {τ(j) : 1 ≤ j ≤ p} such that β̃τ(1) ≤ β̃τ(2) ≤ · · · ≤ β̃τ(p).
Segmentation: For a tuning parameter δ > 0, construct an ordered segmentation ϒ as described in (7).
Estimation: For tuning parameters λ1 and λ2, compute the solution β̂ that minimizes
(10) |
We call this procedure aCARDS (advanced CARDS).
In Section 4, we shall show that if ϒ preserves the order of β0, under certain conditions, β̂ recovers the true homogeneous groups of β0 with high probability. Therefore, to guarantee the success of aCARDS, we need the existence of a δ > 0 for the preliminary estimator β̃ such that the associated ϒ preserves the order of β0. The above toy example shows that even when (6) fails, this condition can still hold. So aCARDS requires weaker conditions on β̃ than bCARDS. This is due to that the hybrid penalty contains penalty terms corresponding to more pairs of indices, hence, it is more robust to mis-ordering in τ. In fact, bCARDS is a special case of aCARDS with δ = 0.
2.3 Comparison of two versions of CARDS
In this subsection, we first use a numerical example to compare bCARDS and aCARDS. It reveals how the ordered segmentation and hybrid pairwise penalty (8) play a role in aCARDS. We then discuss how to choose between two versions of CARDS in real data analysis.
We generate a data set with p = 40 predictors and n = 100 samples. The predictors are divided into two homogeneous groups, each of size 20. Let for j in Group 1 and for j in Group 2. X1, · · ·, Xn are generated independently and identically from Np(0, I), and for 1 ≤ i ≤ n, where ε1, ···, εn are independent noises with a standard normal distribution. In aCARDS, we take the ordinary least squares (OLS) estimator as the preliminary estimator. Figure 1 plots the sorted OLS coefficients for a realization. The estimated rank is not exactly consistent with the order of β0 since the predictors τ(17) and τ(18), which belong to Group 2, are mistakenly ranked ahead of some predictors in Group 1. If we use only the fused penalty, the terms that involve τ(17) and τ(18) are
Figure 1.
Illustration of the hybrid pairwise penalty and the aCARDS algorithm. Top panel: OLS coefficients and the associated ordered segmentation. Red dots and blue crosses represent predictors from Group 1 and Group 2, respectively. Bottom panel: Solution paths of bCARDS (left) and aCARDS (right) under misranking. The ranking and ordered segmentation are the same as in the top panel. For bCARDS, the horizontal axis represents the parameter λ. For aCARDS, the horizontal axis represents the between-segment parameter λ1, where we fix the within-segment parameter λ2 = 0.02. The vertical axis represents the estimated 40 regression coefficients, which are shrunk towards homogeneity (as the figures do not start from the smallest λ, we do not see initial 40 regression coefficients).
There are no penalty terms to shrink the coefficients of τ(17) and τ(18) towards being equal to the coefficients of other predictors in Group 2. Now, suppose that we extract an ordered segmentation from the OLS coefficients by taking δ = 0.3; see Figure 1. Since it allows for arbitrary order reshuffling within the segment B4, this ordered segmentation preserves the order of β0, i.e., Definition 2.3 is satisfied. The hybrid pairwise penalty associated with this segmentation includes terms
between segments B4 and B5. So aCARDS will shrink the coefficients of τ(17) and τ(18) towards being equal to the coefficient of τ(23), a predictor in Group 2. Moreover, there are terms such as
between segments B5 and B6. So aCARDS will also shrink the coefficient of τ(23) towards being equal to the coefficients of other predictors in Group 2. Eventually, aCARDS will shrink the coefficients of τ(17) and τ(18) towards being equal to the coefficients of many other predictors in Group 2. This example explains how the ordered segmentation and hybrid penalty help overcome issues caused by misranking in the preliminary estimator.
To better illustrate the effects of fused penalty and hybrid penalty under misranking, we fix the estimated rank and ordered segmentation from above, and compute the solution paths of both bCARDS and aCARDS. Note that the penalty terms in both (5) and (8) are now fixed (hence we do not need the parameter δ in aCARDS). For bCARDS, we let λ vary. For aCARDS, we set the within-segment parameter λ2 = 0.02 and let the between-segment parameter λ1 vary. Figure 1 displays the solution paths. We see that although bCARDS does not include the true grouping in the solution path owing to misranking, aCARDS still achieves the true grouping, which is a benefit of the hybrid penalty.
In practical data analysis, we need not differentiate between two versions of CARDS, but the tuning parameter selection process automatically tells us which version to use. This is because bCARDS is a special case of aCARDS with δ = 0. We only need to include 0 in the candidates of the parameter δ and select δ in a data-driven manner (e.g., AIC, BIC and GCV). We call the resulting method CARDS, which involves a data-driven selection between bCARDS and aCARDS.
2.4 CARDS under sparsity
In applications, we may need to explore homogeneity and sparsity simultaneously. Often the preliminary estimator β̃ takes into account the sparsity, namely it is obtained with a penalized least-squares method (Fan and Li 2001; Tibshirani et al. 2005) or sure independence screening (Fan and Lv 2008). Suppose β̃ has the sure screening property, i.e., S0 ⊂ S̃ with high probability, where S̃ and S0 denote the support of β̃ and β0, respectively. We modify CARDS as follows: in the first two steps, using the non-zero elements of β̃, we can similarly construct a data-driven hybrid penalty only on coefficients of variables in S̃; in the third step, we fix β̂S̃c = 0 and obtain β̂S̃ by minimizing the following penalized least squares
(11) |
where XS̃ is the submatrix of X restricted to columns in S̃. In (11), the second term is the hybrid penalty to encourage homogeneity among coefficients of variables already selected in β̃, and the third term is the element-wise penalty to help further filter out falsely selected variables. We call this modified version sCARDS (shrinkage CARDS).
3 Analysis of the basic CARDS
In this section, we analyze theoretical properties of bCARDS. Due to the limited space, we state the results here and only prove Theorems 3.1–3.3 in the Appendix, leaving the rest of the proofs to the online supplemental materials of this paper.
3.1 Heuristics
We first provide some heuristics on why taking advantage of the homogeneity helps reduce the estimation error ||β̂ − β0||. Consider an ideal case of orthogonal design XTX = nIp (necessarily p ≤ n). The ordinary least-squares estimator β̂ols = (XTX)−1XTy has the decomposition
It is clear by the square-root law that . Now, if there are K homogeneous groups in β0 and that we know the true groups, the original model (1) can be rewritten as
where contains distinct values in β0, and XA = (xA,1, · · ·, xA,K) is such that xA,k = Σj∈Ak xj. The corresponding ordinary least-squares estimator has the decomposition
(12) |
Here is the noise averaged over group k. The oracle estimator β̂oracle is defined such that for all j ∈ Ak. Then, by the square-root law,
which immediately implies that .
The surprises of the results are two fold. First, ||β̂oracle − β0|| has the rate convergence instead of . The point is that in (12) the noises are averaged, thanks to exploiting homogeneity, and consequently is estimated more accurately. The second surprise is that the rate has nothing to do with the sizes of true homogeneous groups. No matter whether we have K groups of equal size, or one dominating group with other (K − 1) small groups, the rate is always the same for the oracle estimator.
3.2 Notations and regularity conditions
Let
be the subspace of ℝp defined by
For each β ∈
, we can always write Xβ = XAβA, where XA = (xA,1, · · ·, xA,K) is an n × K matrix such that its k-th column xA,k = Σj∈Ak
xj, and βA is a K-dimensional vector with its k-th component βA,k being the common coefficient in group Ak. Define the matrix D = diag(|A1|1/2, · · ·, |AK|1/2). We introduce the following conditions on the design matrix X:
Condition 3.1
, for 1 ≤ j ≤ p. The eigenvalues of the K × K matrix are bounded below by c1 > 0 and bounded above by c2 > 0.
In the case of orthogonal designs, i.e., XTX = nIp, the matrix simplifies to IK, and c1 = c2 = 1.
Let ρ(t) = λ−1pλ(t) and ρ̄(t) = ρ′(|t|)sgn(t). We assume that the penalty function pλ(·) satisfies the following condition.
Condition 3.2
pλ(·) is a symmetric function and it is non-decreasing and concave on [0, ∞). ρ′(t) exists and is continuous except for a finite number of t with ρ′(0+) = 1. There exists a constant a > 0 such that ρ(t) is a constant for all |t| ≥ aλ.
We also assume that the noise vector ε = (ε1, · · ·, εp)T has sub-Gaussian tails.
Condition 3.3
For any vector a ∈ ℝn and x > 0, P(|aTε| > ||a||x) ≤ 2e−c3x2, where c3 is a positive constant.
Given the design matrix X, let Xk be its submatrix formed by including only columns in Ak, for 1 ≤ k ≤ K. For any vector v ∈ ℝq, let be the “deviation from centrality”. Define
(13) |
where λmax(·) denotes the largest eigenvalue operator. In the case of orthogonal design, σk = 1 and νk = 0. Let
be half of the minimum gap between groups in β0, and λ = λn the tuning parameter in the penalty function.
3.3 Properties of bCARDS
When the true groups A1, · · ·, AK are known, the oracle estimator is
Theorem 3.1
Suppose Conditions 3.1–3.3 hold, K = o(n), and the preliminary estimator β̃ generates a rank mapping τ that is consistent with the order of β0, i.e., (6) holds, with probability at least 1 − ε0. If the half minimum gap between groups, bn, satisfies that bn > aλn, where a is the same as that in Condition 3.2, and
(14) |
then with probability at least 1 − ε0 − n−1K − (n ∨ p)−1, the bCARDS objective function (5) has a strictly local minimizer β̂ such that
β̂ = β̂oracle,
.
Theorem 3.1 shows that bCARDS includes the oracle estimator as a strictly local minimizer, with overwhelming probability. This strong oracle property is a stronger result than the oracle property in Fan and Li (2001).
The objective function (5) in bCARDS is non-convex and may have multiple local minimizers. In practice, we apply the Local Linear Approximation algorithm (LLA) (Zou and Li 2008) to solve it: start from an initial solution β̂(0) = β̂initial; at step m, update the solution by
Given β̂initial, this algorithm produces a unique sequence of estimators which converge to a certain local minimizer. Next, Theorem 3.2 shows that under certain conditions, the sequence of estimators produced by the LLA algorithm converge to the oracle estimator.
Theorem 3.2
Under conditions of Theorem 3.1, suppose ρ′(λn) ∈ a0 for some constant a0 > 0, and that there exists an initial solution β̂initial of (5) satisfying ||β̂initial − β0||∞ ≤ λn/2. Then with probability at least 1 − ε0 − n−1K − (n ∨ p)−1, the LLA algorithm yields β̂oracle after one iteration, and it converges to β̂oracle after two iterations.
From Theorems 3.1 and 3.2, we conclude that bCARDS combined with the LLA algorithm yields the oracle estimator with overwhelming probability, provided that we have a good preliminary estimator β̃. Next, we discuss the choice of β̃.
Since we focus on dense problems in this section, the usual sparsity is not explicitly explored and the ordinary least squares estimator
can be used as the preliminary estimator. The following theorem shows that it induces a rank consistent mapping with high probability.
Theorem 3.3
Suppose Condition 3.3 holds, p = O(nα) and , where 0 < α < 1 and c4 > 0 are constants. If where bn is the half of the minimum gap between groups in β0, then with probability at least 1 − O(n−α), the rank mapping τ generated from β̂ols is consistent with the order of β0.
When the rank mapping τ extracted from β̃ does not give a consistent order, i.e., (6) does not hold, the penalty in (5) is no longer a “correct” penalty for promoting the true grouping structure. There is no hope that local minimizers of (5) exactly recover the true groups. However, if there is not too much misranking in τ, it is still possible to control ||β̂ − β̂0||. Given a rank mapping τ, define
It is the number of coefficient “jumps” in β0 under the order given by τ. These “jumps” define subgroups , each being a subset of one true group. Although different subgroups may share the same true coefficient, any two consecutive subgroups and have a gap in coefficient values. As a result, the above results apply to this subgrouping structure. The following theorem is a generalization and a direct application of the proof of Theorem 3.1 and its details are omitted.
Theorem 3.4
Suppose Conditions 3.1–3.3 hold, K*(τ) = o(n), the half minimum gap bn > aλn, and λn satisfies (14) with K replaced by K*(τ). Then with probability at least 1−ε0 −n−1K −(n ∨ p)−1, the bCARDS objective function (5) has a strictly local minimizer β̂ such that .
3.4 bCARDS with the L1 penalty
In the bCARDS formulation (5), ρ(t) can also be the L1 penalty function ρ(t) = |t|. It can be useful to get the initial solution β̂initial for the LLA algorithm. However, ρ(t) = |t| does not satisfy Condition 3.2. Hence, Theorem 3.1 does not apply and its associated properties requires additional studies.
We first relax the requirement that τ is consistent with the order of β0. Instead, we consider the case that “τ is consistent with groups in β0”: there exists a permutation μ on {1, · · ·, K} and 1 = i1 < i2 < · · · < iK < iK+1 = p + 1 such that for k = 1, · · ·, K,
(15) |
When μ is the identical permutation, i.e., μ(k) = k, (15) is equivalent to (6) and τ is consistent with the order of β0. Under the condition (15), recovering the true groups is equivalent to locating coefficient jumps in β0 under the order given by τ, and these jumps can have positive or negative magnitudes.
To guarantee the exact recovery of jumps, we need a joint condition on X and β0, it is in the same spirit of the “irrepresentability” condition in Zhao and Yu (2006) but is specifically designed for the homogeneity setting. For notation simplicity, we change the indices of groups to let μ(k) = k for all k. Note that does not hold with these new group indices.
For k = 1, · · ·, K − 1, write . Define the K-dimensional vector d0 by and
Here d0 is the adjacent difference of the sign vector of jumps in β0. For example, suppose K = 4 and the common coefficients in 4 groups satisfy and . Then d = (1, −2, 2, −1). Also, define the p-dimensional vector
In the case of orthogonal design XTX = nIp, b0 ∈
and it has the form
for j ∈ Ak. For each j ∈ Ak, let
Namely, and contain indices in group k that are ranked above and below τ(j), respectively. Let be the proportion of indices in group k that are ranked above τ(j). Denote by the average of elements in b0 over the indices in , and the average over the indices in .
Condition 3.4
There exists a positive sequence {ωn}, which can go to 0, such that for 1 ≤ k ≤ K, j ∈ Ak and j ≠ jk+1 − 1,
(16) |
In the case of orthogonal design XTX = nIp, b0 ∈
and b̄kj − bkj = 0 holds for all k and j ∈ Ak. Condition 3.4 reduces to
This is possible only when
(17) |
Noting that 1/|Ak| ≤ θkj ≤ 1 − 1/|Ak|, the associated ωn can be chosen as mink{1/|Ak|} when (17) holds.
Theorem 3.5
Suppose Conditions 3.1, 3.3 and 3.4 hold, K = o(n), and the preliminary estimator β̃ generates an order τ that is consistent with groups in β0, i.e., (15) holds, with probability at least 1 − ε0. If the half minimum gap bn and the tuning parameter λn satisfy
(18) |
then with probability at least 1 − ε0 − n−1K − (n ∨ p)−1, the bCARDS objective function (5) with ρ(t) = |t| has a unique global minimizer β̂ such that
β̂ ∈
;
, k = 1, · · ·, K − 1;
, where .
Compared to Theorem 3.1, there is an extra bias term in the estimation error ||β̂− β0||, p which is of order . Moreover, to achieve the exact recovery, it requires Condition 3.4, which is restrictive. For example, in the case of orthogonal designs, it is required that all consecutive jumps (under the order given by τ) have oppositive signs.
4 Analysis of the advanced CARDS
In this section, we analyze aCARDS and its variant sCARDS. The proofs can be found in the online supplemental materials.
4.1 Properties of aCARDS
To guarantee the success of aCARDS, a key condition is that the ordered segmentation ϒ = {B1, · · ·, BL} defined in (7) preserves the order of β0 in the sense of Definition 2.3. This allows the ranking of coefficients in β̃ to deviate from that in β0, but not too much: for some δ > 0, whenever , β̃i ≤ β̃j + δ must hold.
For given A1, · · ·, AK and a segmentation ϒ = {B1, · · ·, BL}, define
Here 1/|Ak|2 ≤ ϕk ≤ |Ak| for 1 ≤ k ≤ K.
Theorem 4.1
Suppose Conditions 3.1–3.3 hold, K = o(n), and the preliminary estimator β̃ and the tuning parameter δn together generate an ordered segmentation ϒ that preserves the order of β0, with probability at least 1−ε0. If the half minimum gap bn and the tuning parameters (λ1n, λ2n) in (10) satisfy that bn > a max{λ1n, λ2n}, where a is the same as that in Condition 3.2, and
(19) |
and
(20) |
then with probability at least 1 − ε0 − n−1K − 2(n ∨ p)−1, the aCARDS objective function (10) has a strictly local minimizer β̂ such that
β̂ = β̂oracle,
.
Compared to Theorem 3.1, aCARDS not only imposes less restrictive conditions on β̃, but also requires a smaller minimum gap between true coefficients.
Next, we establish the asymptotic normality of the CARDS estimator. By Theorem 4.1, with overwhelming probability, aCARDS performs just like the oracle. In the oracle situation, for example, if p = 5 and there are three true groups {β1, β4}, {β2} and {β3, β5}, the accuracy of estimating β is the same as if we know the model:
Theorem 4.2
Given a positive integer q, let {Bn} be a sequence of matrices such that Bn ∈ ℝq×K, and , where H is a fixed q × q positive definite matrix. Suppose conditions of Theorem 4.1 hold and let β̂ be the local minimizer of the aCARDS objective function (10) given in Theorem 4.1. Then
where β̂A is the K-dimensional vector of distinct values in β̂.
Theorem 4.2 states the asymptotic normality of β̂A. Note that β̂ duplicates elements in β̂A. We introduce the following corollary to compare the asymptotic covariance of β̂ to that of β̂ols.
Corollary 4.1
Suppose conditions of Theorems 4.1 and 4.2 hold. Let β̂ols and β̂ be the ordinary least squares estimator and CARDS estimator respectively. Let Mn be the p × K matrix with Mn(j, k) = (1/|Ak|1/2)1{j ∈ Ak}. For any sequence of p-dimensional vectors {an},
with ;
with .
Moreover, v1n ≥ v2n.
4.2 Properties of sCARDS
In Section 2.4, we introduced sCARDS to explore both homogeneity and sparsity. In sCARDS, given a preliminary estimator β̃ and a parameter δ, we extract segments B1, · · ·, BL such that , where S̃ is the support of β̃. Denote B0 = {j : β̃j = 0}. In this case, we say ϒ = {B0, B1, · · ·, BL} preserves the order of β0 if
(21) |
This implies that β̃ has the sure screening property, and on those preliminarily selected variables, the data-driven segments preserve the order of true coefficients.
Suppose there is a group of zero coefficients in β0, namely,
= (A0, A1, · · ·, AK). Let
be the subspace of ℝp defined by
The oracle estimator is
Denote by S the support of β0, and write s = |S| and s̃ = |S̃|. Define
to be the half minimum signal strength.
Theorem 4.3
Suppose Conditions 3.1–3.3 hold, s = o(n), log(p) = o(n), and the preliminary estimator β̃ and the tuning parameter δn together generate an ordered segmentation ϒ that preserves the order of β0, i.e., (21) holds, with probability at least 1 − ε0. If , bn and the tuning parameters (λ1n, λ2n, λn) satisfy that , bn > a max{λ1n, λ2n} and
Then with probability at least 1−ε0 − n−1K − (n ∨ s)−1 − (n ∨ s̃)−1, the sCARDS objective function (11) has a strictly local minimizer β̂ such that
β̂ = β̂oracle,
.
The preliminary estimator β̃ can be chosen, for example, as the SCAD estimator
(22) |
where pλ′(·) is the SCAD penalty function (Fan and Li 2001). The following theorem is a direct result of Theorem 2 in Fan and Lv (2011), and the proof is omitted.
Theorem 4.4
Under Conditions 3.1 and 3.3, if s = o(n), and , then with probability at least 1 − o(1), there exists a strictly local minimizer β̂scad and δn = O(log(n)/n) which together generate a segmentation preserving the order of β0.
5 Simulation studies
We conduct numerical experiments to study the performance of two versions of CARDS, bCARDS and aCARDS, and their variant sCARDS. The goal is to investigate the performance of CARDS under different situations. Experiments 1–4 are based on the linear regression model , with Experiments 1–3 exploring the homogeneity only and Experiment 4 exploring the homogeneity and sparsity simultaneously. Experiment 5 is based on the spatial-temporal model .
In all experiments, {Xi : 1 ≤ i ≤ n} or {Xt : 1 ≤ t ≤ T} are generated independently and identically from the multivariate standard Gaussian distributions, and {εi : 1 ≤ i ≤ n} or {εit: 1 ≤ i ≤ p, 1 ≤ t ≤ T} are IID samples of N(0, 1). All results are based on 100 repetitions.
Experiment 1
Consider the linear regression model with p = 60 and n = 100. Predictors are divided into four groups, each of size 15. The true regression coefficients shared within each group are −2r, −r, r and 2r, respectively. Different values of r > 0 lead to various signal-to-noise ratios. Here we let r take values in {1, 0.8, 0.5}, corresponding to high, moderate and low signal-to-noise ratio, respectively.
We compare the performance of six different methods: Oracle, ordinary least squares (OLS), bCARDS, aCARDS, total variation (TV), fused Lasso (fLasso). Oracle is the least squares estimator knowing the true groups. aCARDS and bCARDS are described in Section 2; here we let the penalty function pλ(·) be the SCAD penalty with a = 3.7, and take the OLS estimator as the preliminary estimator. TV uses the exhaustive pairwise penalty (9), where pλ(·) is also the SCAD penalty with a = 3.7. The fused Lasso is based on an order generated from ranking the OLS coefficients. In this sense, the fused Lasso is essentially bCARDS with the Lasso penalty pλ(t) = λ|t|. Tuning parameters of all these methods are selected via Bayesian information criteria (BIC).
Prediction performance of different methods is evaluated in terms of the average model error over an independent test set of size 10, 000. The model error is the prediction error subtracted by the variance of εit, and it better reflects the performance of statistical methods. In addition, to measure how close the estimated grouping structure approaches the true one, we introduce the normalized mutual information (NMI), which is a common measure for similarity between clusterings (Fred and Jain 2003). Suppose ℂ = {C1, C2, · · ·} and ⅅ= {D1, D2, · · ·,} are two sets of disjoint clusters of {1, · · ·, p}, define
where I(ℂ; ⅅ) = Σk,j(|Ck ∩ Dj |/p) log(p|Ck ∩ Dj |/|Ck||Dj |) is the mutual information between ℂ and ⅅ, and H(ℂ) = Σk(|Ck|/p) log(|Ck|/p) is the entropy of ℂ. NMI(ℂ,ⅅ) takes values on [0, 1], and larger NMI implies the two groupings are closer. In particular, NMI = 1 means that the two groupings are exactly the same.
Figure 2 shows boxplots of the average model error and NMI for six different methods. We observe that except for the case of weak signals (r = 0.5), two versions of CARDS outperform other methods since they lead to smaller average model error and larger NMI. bCARDS is performing especially well in achieving low model errors, even in the case r = 0.5. aCARDS has a better performance in terms of NMI, which indicates that it is better in recovering the true grouping structure. The possible reason that aCARDS does not perform as well as bCARDS in model errors is that aCARDS has more tuning parameters and selection of these tuning parameters in simulations may be non-optimal.
Figure 2.
The average model error and normalized mutual information in Experiment 1, where p = 60, n = 100, and there are four equal-size coefficient groups.
Experiment 2
The setting of this experiment is the same as in Experiment 1, except that the homogeneous groups have non-equal sizes. In Experiment 2a, the predictors are divided into four groups of size 1, 15, 15 and 29. The four distinct regression coefficients are −4r, −r, r and 2r, respectively. Here, the first group is a singleton. In Experiment 2b, there is one dominating group of size 50 with a common regression coefficient −2r. The other 10 predictors have the regression coefficients , respectively. In both subexperiments, we take r = 1 and 0.7 to represent two levels of signal-to-noise ratio. Besides the six methods compared in Experiment 1, we also implement a data-driven selection between bCARDS and aCARDS, as described in Section 2.3. In detail, we select the parameter δ via BIC (for each value of δ, the other parameters are also selected via BIC). The resulting method is called CARDS.
Figure 3 displays results for Experiment 2a. It suggests that both bCARDS and aCARDS outperform other methods, so does CARDS. Figure 4 displays results for Experiment 2b. We see that the total variation and fused Lasso can not improve the OLS estimator in terms of the average model error. bCARDS also performs unsatisfactorily, possibly due to misranking in the preliminary estimate. But aCARDS performs much better than OLS and other methods. After a data-driven selection between bCARDS and aCARDS, the resulting method CARDS also outperforms other methods.
Figure 3.
The average model error and normalized mutual information in Experiment 2a, where p = 60, n = 100, and there are four coefficient groups of size 1, 15, 15 and 29.
Figure 4.
The average model error and normalized mutual information in Experiment 2b, where p = 60, n = 100, and there is one dominating group of size 50.
Experiment 3
We use this experiment to investigate how the performance of two versions of CARDS is affected by misranking in the preliminary estimate. The setting is the same as Experiment 1 and we fix r = 1 (so n = 100, p = 60, and there are 4 equal-size groups with true regression coefficients equal to −2, −1, 1 and 2 respectively). For each data set (X, y), we generate 11 different preliminary ranks as follows: for each σ in {1, 1.2, 1.4, · · ·, 3}, we generate z ~ N(Xβ0, σ2In) independently of y conditional on X, and then use the OLS estimator associated with z to get a preliminary rank. A larger value of σ tends to yield a “worse” preliminary rank. We use K* defined in Section 3.3 to quantify the level of misranking. Recall that K* is the total number of jumps in true regression coefficients under the preliminary rank, and K* > 3 means there exists misordering in the preliminary rank. We generated 100 data sets and 11 preliminary ranks for each data set so that the results are based on 1100 repetitions. We compare the performance of bCARDS and aCARDS with that of the total variation (TV), where TV does not use any information from the preliminary rank. Figure 5 contains boxplots of the average model error as K* changes. First, we see that two versions of CARDS are quite robust to the increase of K* and always outperform the total variation. Second, the model error of bCARDS increases as K* increases, which provides empirical evidence for the claim in Theorem 3.4 about the effect of misranking to bCARDS. Third, when K* is large, aCARDS has a better performance than bCARDS, due to the fact that the hybrid pairwise penalty can tolerate a higher level of misranking than the fused penalty.
Figure 5.
The average model error in Experiment 3. The horizontal axis represents K*, and “b”, “a” and “TV” are short for bCARDS, aCARDS and total variation.
Experiment 4
This experiment explores the homogeneity and sparsity simultaneously. Consider the linear regression model with p = 100 and n = 150. Among the 100 predictors, 60 are important ones and their coefficients are the same as those in Experiment 1. Besides, there are 40 unimportant predictors whose coefficients are all equal to 0.
We implemented sCARDS and compared its performance with different oracle estimators, Oracle, Oracle0 and OracleG, as well as ordinary least squares (OLS), SCAD, shrinkage total variation (sTV) and fused Lasso (fLasso). The three oracles are defined with different levels of prior information: the Oracle knows both the important predictors and the true groups, the Oracle0 only knows which are important predictors; and the OracleG only knows the true groups (it treats all unimportant predictors as one group with unknown coefficients). sCARDS is as described in Section 2; while implementing it, we take the SCAD estimator as the preliminary estimator. The shrinkage total variation is an extension of TV by adding both the elementwise SCAD penalty and exhaustive pairwise SCAD penalty. The fused Lasso used here has both the elementwise L1 penalty and fused pairwise L1 penalty.
Figure 6 displays the boxplots of average model errors for r = 1 and r = 0.7. First, by comparing model errors of the three oracles, we see a significant advantage of taking into account both homogeneity and sparsity over pure sparsity. Moreover, the results of Oracle0 and OracleG show that exploring the group structure is more important than sparsity. Second, sCARDS achieves a smaller model error than OLS, SCAD and fused Lasso. The performance of sCARDS and sTV in terms of model error are comparable. However, they are different in feature selection. Figure 7 contains frequency histograms of the number of falsely selected features for two methods. In about 17% of the repetitions, sTV fails to shrink coefficients of the 40 unimportant predictors to 0.
Figure 6.
The average model error and normalized mutual information in Experiment 4, where p = 100, n = 100, and there are 60 important predictors divided into four equal-size groups.
Figure 7.
Feature selection of sCARDS and shrinkage total variation (sTV) in Experiment 4 (left: r = 1; right: r = 0.7), where there are 40 unimportant predictors.
Experiment 5
We consider a special case of the spatial-temporal model (3), , i = 1, · · ·, p, i.e., the predictors are the same for all spatial locations. There are p = 100 different locations and k = 5 common predictors (so each βi has a dimension 5). We assume only the spatial homogeneity in the regression coefficients, that is, for each predictor j (j = 1, · · ·, 5), the coefficients, {βi,j, 1 ≤ i ≤ 100}, are divided into four groups of equal size 25, where coefficients in the same group share a same value. In simulations, we let
where {ωj = 0.1 × (j −1), 1 ≤ j ≤ 5} are location-independent constants. In this experiment, instead of varying the signal-to-noise ratio directly, we equivalently change T, the total number of time points.
We extend aCARDS to this model by adding the hybrid pairwise penalty on coefficients of the same predictor at different locations, and still call the method aCARDS. The total variation (TV) and fused Lasso (fLasso) can be extended to this model in a similar way. The Oracle is the maximum likelihood estimator which knows the true groups for each predictor. We aim to compare the performance of Oracle, OLS, aCARDS, total variation and fused Lasso.
Figure 8 displays the results. We see that aCARDS achieves significantly lower model errors than OLS, due to exploring homogeneity. Moreover, it has a better performance than the total variation and fused Lasso, particularly when T = 50, 80. aCARDS also estimates well the true grouping structure; when T = 50, 80, the normalized mutual information is larger than 0.95 in most repetitions.
Figure 8.
The average model error and normalized mutual information in Experiment 5, where the model is a spatial-temporal regression model with p = 100 locations and k = 5 common predictors.
6 Real data analysis
6.1 S&P500 returns
In this study, we fit a homogeneous Fama-French model for stock returns: , where Xt contains three Fama-French factors at time t, Yit is the excess return of stocks and εit are idiosyncratic noises. We collected daily returns of 410 stocks, which were always included in the components of the S&P500 index in the period December 1, 2010 to December 1, 2011 (T = 254). We applied CARDS as in Experiment 5, except that the intercepts αj ‘s were also penalized for sparsity. The sparsity of αj ‘s is supported by the capital asset pricing model (CAPM) and its extension, multifactor pricing model, in financial econometric theories. The tuning parameters were chosen via generalized cross validation (GCV). Table 1 shows the number of fitted coefficient groups on three factors and the number of non-zero intercepts. We then used daily returns of the same stocks in the period December 1, 2011 to July 1, 2012 (T = 146) to evaluate the prediction error. Let ŷit and yit be the fitted and observed excess returns of stock i at time t = 1, · · ·, 146, respectively. Define the discounted cumulative sum of squared estimation errors (cRSS) at time t as , where we take ρ = 0.95. Figure 9 shows the percentage improvement in cRSSt of the CARDS estimator over the OLS estimator. We see that CARDS achieves a smaller discounted cRSS compared to OLS at most time points, especially in the “very-close” and “far-away” future. The North American Industry Classification System (NAICS) classifies these 410 companies into 18 different industry sectors. Figure 10(a) shows the OLS coefficients on the factor “book-to-market ratio”. We can see that stocks belonging to Sector 3 “Utilities” (29 stocks in total) have very close OLS coefficients, and 17 stocks in this sector were clustered into one group in CARDS estimator. Figure 10(b) shows the percentage improvement in cRSSt only for stocks in this sector, where the improvement is more significant.
Table 1.
Number of groups in fitting the S&P500 data.
Fama-French factors | No. of coef. groups |
---|---|
“market return” | 41 |
“market capitalization” | 32 |
“book-to-market ratio” | 56 |
intercept | 60 |
Figure 9.
Comparison of the cumulative sum of squared prediction errors of the S&P500 data from December 1, 2011 to July 1, 2012. The vertical axis is the percent of relative errors between the predictions made by OLS and CARDS, defined by . The right panel is a zoom-in of the results for CARDS.
Figure 10.
(a)OLS coefficients on the “book-to-market ratio” factor. The x axis represents different sectors. (b)Percentage improvement of the discount cumulative sum of squared estimation errors for stocks in the sector “Utilities” (Sector 2).
6.2 Polyadenylation signals
CARDS can be easily extended to more general settings such as generalized linear models (McCullagh and Nelder 1989) although we have focused on the linear regression model so far. In this subsection, we apples CARDS to a logistic regression example. This study tried to predict polyadenylation signals (PASes) in human DNA and mRNA sequences by analyzing features around them. The data set was first used in Legendre and Gautheret (2003) and later analyzed by Liu et al. (2003), and it is available at http://datam.i2r.a-star.edu.sg/datasets/krbd/SequenceData/Polya.html. There is one training data set and five testing data sets. To avoid any platform bias, we use the training data set only. It has 4418 observations each with 170 predictors and a binary response. The binary response indicates whether a terminal sequence is classified as a “strong” or “weak” polyA site, and the predictors are features from the upstream (USE) and downstream (DSE) sequence elements. We randomly select 2000 observations to perform model estimation and use the rest to evaluate performance. Our numerical analysis consists of the following steps. Step 1 is to apply the L1-penalized logistic regression to these 2000 observations with all 170 predictors and use AIC to select an appropriate regularization parameter. In Step 2, we use the logistic regression coefficients obtained in Step 1 as our preliminary estimate and apply sCARDS accordingly. Average prediction error (and standard error in parentheses) over 40 random splitting are reported in Table 2. We also report the average number of non-zero coefficient groups and the average number of selected features. It shows that sCARDS lead to a smaller prediction error when compared with the shrinkage total variation (sTV). In addition, the sCARDS has fewer groups of non-zero coefficients but more selected features, which implies that we can include more predictors while fixing the degrees of freedom at a small value. Note that in this example, the fused Lasso (fLasso) has a similar performance as the sCARDS. In Section 5, we remarked that the fused Lasso is essentially bCARDS with the lasso penalty pλ(t) = λ|t|.
Table 2.
Results of the PASes data.
sCARDS | SCAD | sTV | fLasso | |
---|---|---|---|---|
Prediction Error | 0.2449 (.0015) | 0.2458 (.0014) | 0.2757 (.0026) | 0.2445 (0.0010) |
No. of non-zero coef. groups | 5.5000 | 21.6250 | 5.7500 | 5.5000 |
No. of selected features | 73.2750 | 21.6250 | 40.3500 | 86.8750 |
7 Conclusion
In this paper, we explored homogeneity of coefficients in high-dimensional regression. We proposed a new method called Clustering Algorithm in Regression via Data-driven Segmentation (CARDS) to estimate regression coefficients and to detect homogeneous groups. The implementation of CARDS does not need any geographical information (neighborhoods, distance, graphs, etc.) a priori, which distinguishes it from other methods in similar settings and makes it more general for applications. A modification of CARDS, sCARDS, can be used to explore homogeneity and sparsity simultaneously. Our theoretical results show that better estimation accuracy can be achieved by exploring homogeneity. In particular, when the number of homogeneous groups is small, the power of exploring homogeneity and sparsity simultaneously is much larger than that of exploring sparsity only, which is also confirmed in our simulation studies.
Methodologically, CARDS has two main innovations. First, it takes advantage of a preliminary estimate by extracting from which either an estimated ranking or an estimated ordered segmentation. Second, it introduces the so-called “hybrid pairwise penalty” to adapt to available partial ordering information. The hybrid pairwise penalty not only is robust to misordering, but also avoids statistical and computational inefficiency due to penalizing too many pairs. These ideas about handling homogeneity can be applied to much broader situations than linear regression, if we combine the hybrid pairwise penalty with appropriate loss functions. For example, CARDS can be extended to generalized linear models (GLM) when homogeneity appears.
To promote homogeneity, CARDS takes advantage of a preliminary estimate. Such idea can be generalized. Instead of extracting a complete raking or an ordered segmentation, we may also apply clustering methods to coefficients of the preliminary estimate, such as k-mean algorithm or hierarchical clustering algorithm, to help construct datadriven penalties and further promote homogeneity.
This paper only considers the case where predictors in one homogeneous group have equal coefficients. In a more general situation, coefficients of predictors in the same group are close but not exactly equal. The idea of data-driven pairwise penalties still applies, but instead of using the class of folded concave penalty functions, we may need to use penalty functions which are smooth at the origin, e.g., the L2 penalty function. Another possible approach is to use posterior-type estimators combined with, say, a Gaussian prior on the coefficients. These are beyond the scope of this paper and we leave them as future work.
Supplementary Material
Acknowledgments
The authors would like to thank the Editor Professor Xuming He, the anonymous Associate Editor and two referees for their very helpful comments and suggestions.
The authors were partially supported by the National Institute of General Medical Sciences of the National Institutes of Health through Grant Numbers R01-GM072611 and R01-GM100474 and National Science Foundation grant DMS-1206464.
The author was partially supported by NSF grants DMS-0905561 and DMS-1055210, and NIH Grant NIH/NCI R01 CA-149569.
A Proofs
A.1 Proof of Theorem 3.1
Introduce the mapping T :
→ ℝK, where T(β) is the K-dimensional vector whose k-th coordinate equals to the common value of βj for j ∈ Ak. Note that T is a bijection and T−1 is well-defined for any μ ∈ ℝK. Also, introduce the mapping T* : ℝp → ℝK, where
. We see that T* = T on
, and T−1 ∘ T* is the orthogonal projection from ℝp to
. Denote μ0 = T (β0) and μ̂oracle = T(β̂oracle).
Denote and , so that we can write Qn(β) = Ln(β) + Pn(β). For any μ ∈ ℝK, let
and define
. Note that when τ is consistent with the order of β0, there exist 1 = j1 < j2 < · · · < jK < jK+1 = p + 1 such that Ak = {τ(jk), τ(jk + 1), · · ·, τ(jk+1 − 1) for 1 ≤ k ≤ K. Then
and
for any β ∈
and μ ∈ ℝK.
In the first part of the proof, we show . By definition and direct calculations,
Therefore, we can write
(23) |
From Condition 3.1, and . By the Markov inequality, for any δ > 0,
(24) |
Combining the above, we have shown that with probability at least 1 − δ, . This proves .
Furthermore, we show a result that will be frequently used in later proofs:
(25) |
Write , where vk = XAD−1ek and ek is the unit vector with 1 on the k-th coordinate and 0 elsewhere. Observing that ||vk||2 is the k-th diagonal of the matrix , we have . It follows from Condition 3.3 and the union bound that
Since ,
(26) |
Then (25) follows by combining (23) and (26).
In the second part of the proof, we show that β̂oracle is a strictly local minimizer of Qn(β) with probability at least 1 − ε0 − n−1 K − (n ∨ p)−1. By assumption, there is an event E1 such that and over the event E1, τ is consistent with the order of β0. Consider the neighborhood of β0:
By (25), there is an event E2 such that
and over the event E2,
. Hence, β̂oracle ∈
over the event E2.
For any β ∈
, write β* as its orthogonal projection to
. We aim to show
-
Over the eventE1 ∩ E2,
(27) and the inequality is strict whenever β* ≠ β̂oracle.
-
There is an event E3 such that . Over the event E1 ∩ E2 ∩ E3, there exists
(
⊂
), a neighborhood of β̂oracle, such that
(28) and the inequality is strict whenever β ≠ β*.
Combining (a) and (b), Qn(β) ≥ Qn(β̂oracle) for any β ∈
, a neighborhood of β̂oracle, and the inequality is strict whenever β ≠ β̂oracle. This proves that β̂oracle is a strictly local minimizer of Qn, over the event E1 ∩ E2 ∩ E3.
It remains to show (a) and (b). Consider (a) first. We claim that
(29) |
To see this, for a given β ∈
, write μ = T* (β). It suffices to check |μk+1 − μk| > aλn for k = 1, · · ·, K − 1. Note that
. Since
, it is easy to see that |μk+1 − μk| > aλn.
Using (29), we see that
, for all β ∈
. Since
and T−1 ∘ T* is the orthogonal projection from ℝp to
, for any β ∈
,
(30) |
In particular, noting that β̂oracle ∈
and its orthogonal projection to
is itself, the above further implies
(31) |
By definition and the fact that is positive definite, μ̂oracle is the unique global minimizer of . As a result,
(32) |
and the inequality is strict whenever T*(β) ≠ μ̂oracle, i.e., β* ≠ T−1(μ̂oracle) = β̂oracle. Combining (30)–(32) gives (a).
Second, consider (b). For a positive sequence {tn} to be determined, let
Since β* is the orthogonal projection of β to
, ||β − β* || ≤ ||β − β′|| for any β′ ∈
. In particular, ||β − β*|| ≤ ||β − β̂oracle||. As a result, to show (28), it suffices to show
(33) |
and the inequality is strict whenever β ≠ β*.
To show (33), write μ = T*(β) so that β* = T−1(μ). By Taylor expansion,
where βm is in the line between β and β*. Consider I2 first. Direct calculations yield
where ρ̄(t) = ρ′(|t|)sgn(t) and ρ(t) = λ−1pλ(t). Plugging it into I2 and rearranging the sum, we obtain
(34) |
When τ(j) and τ(j + 1) belong to the same group, , and hence the sign of ( ) is the same as the sign of (βτ(j+1) − βτ(j)) if neither of them is 0. In addition, recall that Ak = {τ(jk), τ(jk + 1), · · ·, τ(jk+1 − 1)} for all 1 ≤ k ≤ K, for some indices 1 = j1 < j2 < · · · < jK < jK+1 = p + 1. Combining the above, we can rewrite
First, since β0 ∈
and β* is the orthogonal projection of β to
, ||β* − β0|| ≤ ||β − β0||. Hence, β ∈
implies β*, βm ∈
. By repeating the proof of (29), we can show
for 2 ≤ k ≤ K. So the second term in I2 disappears. Second, in the first term of I2, since
, it follows by concavity that
. Together, we have
(35) |
Next, we simplify I1. Let z = z(βm) = XT(y − Xβm) and write . For any fixed k and l such that τ(l) ∈ Ak and l ≠ jk+1 −1, let and . Regarding that for i ∈ Ak, we can reexpress I1 as
(36) |
where for any vector v ∊ ℝp,
We aim to bound |wτ(l)(z)|. Let η = XTX(β* − β0), ηm = XTX(βm − β*) and write z = XTε − η − ηm. First, wτ(l)(v) is a linear function of v. Second, since βm lies between β and β*, we have ||β* − βm|| ≤ ||β* − β|| ≤ tn. It follows that ||ηm|| ≤ λmax(XTX)tn. Moreover, |wτ(l)(v)| ≤ (|Ak|/n)||v||∞ ≤ (p/n)||v|| for all v. Combining the above yields
(37) |
First, we bound the term wτ(l)(XTε). Let E3 be the event that
(38) |
where we recall σk is the maximum eigenvalue of n−1XTX restricted to the (Ak, Ak)-block. Given τ(l), we can express wτ(l)(XTε) as
Write and , so that |Ak| = L1+L2. It is observed that . Using the fact that (a + b)2 ≤ 2(a2 + b2) for any real values a, b, we have . Applying Condition 3.3 and the probability union bound,
(39) |
Second, we bound the term wτ(l)(η). Observing that for any vector v, wτ(l)(v) = wτ(l)(v − v̄k1), where v̄k is the mean of {vj, j ∈ Ak}, we have
Since η = XTX(β* − β0) and β* − β0 ∈
, we have maxj∈Ak|ηj − η̄k| ≤ nνk||β* − β0||, where νk is defined in (13). As a result,
(40) |
where the last inequality is because we consider β ∈
in (28), and ||β* − β0|| ≤ ||β − β0|| (noticing that β* is the orthogonal projection of β onto
).
Combining (36)–(40), we find that over the event E1 ∩ E2 ∩ E3,
(41) |
where we have used condition (14) on λn.
From (35) and (41), over the event E1 ∩ E2 ∩ E3,
where gn(tn) = n−1pλmax(XTX)tn − λn[1 − ρ′(2tn)]. Since ρ′(0+) = 1, gn(0+) = 0. So we can always choose tn sufficiently small to make sure |gn(tn)| < λn/2; consequently, the right hand side is non-negative, and strictly positive when , i.e., β ≠ β*. This proves (b).
A.2 Proof of Theorem 3.2
First, we show that the LLA algorithm yields β̂oracle after one iteration. Let E1 be the event that the ranking τ is consistent with the order of β0, E2 the event that and E3 the event that (38) holds. We have shown that P (E1 ∩E2 ∩E3) ≥ 1 − ε0 − n−1K − (n ∨ p)−1. It suffices to show that over the event E1 ∩ E2 ∩ E3, the LLA algorithm gives β̂oracle after the first iteration.
Let . At the first iteration, the algorithm minimizes
This is a convex function, hence it suffices to show that β̂oracle is a strictly local minimizer of
. Using the same notations as in the proof of Theorem 3.1, for any β ∈ ℝp, write β* = T−1 ∘ T*(β) as its orthogonal projection to
. Let
, and for a sequence {tn} to be determined, consider the neighborhood of β̂oralce defined by
= {β ∈
: ||β − β̂oracle|| ≤ tn}. It suffices to show
(42) |
and the first inequality is strict whenever β ≠ β*, and the second inequality is also strict whenever β ≠ β̂oracle.
We first show the second inequality in (42). For τ(j) and τ(j + 1) in different groups,
. In addition, ||β̂initial − β0||∞ ≤ λn/2. Hence,
, and it follows that wj = 0. On the other hand, for τ(j) and τ(j + 1) in the same group, βτ(j+1) − βτ(j) = 0 whenever β ∈
. Consequently,
It is easy to see that β̂oracle is the unique global minimizer of Ln constrained on
. So the second inequality in (42) holds.
Next, consider the first inequality in (42). We apply Taylor expansion to , and rearrange the sums as in (34). Then, for some βm that lies in the line between β and β*,
We first simplify J1. Note that wj = 0 when τ(j) and τ(j + 1) are in different groups. When τ(j) and τ(j + 1) are in the same Ak, first, , and [ ] has the same sign as [βτ(j+1) − βτ(j)]; second, , and hence wj ≥ ρ′(λn) ≥ a0. Combining the above yields
(43) |
Next, we simplify J2. Denote z = XT(y − Xβm). Similarly to (36)–(41), we find that
where over the event E3, for any jk ≤ l ≤ jk+1 − 2,
From the condition on λn, the sum of the first two terms is upper bounded by a0λn/3 for large n. We choose tn = a0nλn/(3pλmax(XTX)). It follows that
(44) |
Combining (43) and (44), over the event E1 ∩ E2 ∩ E3,
This proves the first inequality in (42).
Second, we show that over the event E1 ∩ E2 ∩ E3, at the second iteration, the LLA algorithm still yields β̂oracle and therefore it converges to β̂oracle. We have shown that after the first iteration, the algorithm outputs β̂oracle. It then treats β̂oracle as the initial solution for the second iteration. So it suffices to check
This is true because over the event E1, .
A.3 Proof of Theorem 3.3
It suffices to show that, with probability at least 1 − O(n−α), implies for any 1 ≤ i, j ≤ p. When , necessarily . Moreover, . So it suffices to show that ||β̂ols − β0||∞ ≤ bn with probability at least 1 − O(n−α).
From direct calculations, βols = β0 + (XTX)−1XTε. Write , where aj = X(XTX)−1ej for j = 1, · · ·, p. Then . By Condition 3.3 and applying the union bound,
Since p = O(nα), 2pn−2α = O(n−α). This completes the proof.
Footnotes
The supplemental materials contain technical proofs for Theorems 3.5, 4.1, 4.2, 4.3, and Corollary 4.1.
References
- Bondell HD, Reich BJ. Simultaneous Regression Shrinkage, Variable Selection, and Supervised Clustering of Predictors with OSCAR. Biometrics. 2008;64:115–123. doi: 10.1111/j.1541-0420.2007.00843.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bühlmann P, van de Geer S. Statistics for High-Dimensional Data. Springer; 2011. p. 32. [Google Scholar]
- Chen SS, Donoho DL, Saunders MA. Atomic Decomposition by Basis Pursuit. SIAM Journal on Scientific Computing. 1998;20:33–61. [Google Scholar]
- Fan J, Li R. Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties. Journal of American Statistical Association. 2001;96:1348–1360. [Google Scholar]
- Fan J, Lv J. Sure Independence Screening for Ultra-high Dimensional Feature Space. Journal of Royal Statistical Society, Ser B. 2008;70:849–911. doi: 10.1111/j.1467-9868.2008.00674.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fan J, Lv J. Nonconcave Penalized Likelihood with NP-dimensionality. IEEE Transactions on Information Theory. 2011;57:5467–5484. doi: 10.1109/TIT.2011.2158486. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fan J, Lv J, Qi L. Sparse High-dimensional Models in Economics. Annual Review of Economics. 2011;3:291–317. doi: 10.1146/annurev-economics-061109-080451. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fan J, Xue L, Zou H. Strong Oracle Optimality of Folded Concave Penalized Estimation. 2012 doi: 10.1214/13-aos1198. Manuscript, available at http://arxiv.org/abs/1210.5992. [DOI] [PMC free article] [PubMed]
- Fred A, Jain AK. Robust Data Clustering. Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 2003;3:128–136. [Google Scholar]
- Friedman J, Hastie T, Höing H, Tibshirani R. Pathwise Coordinate Optimization. The Annals of Applied Statistics. 2007;1:302–332. [Google Scholar]
- Harchaoui Z, Lévy-Leduc C. Multiple Change-point Estimation with a Total Variation Penalty. Journal of the American Statistical Associateion. 2010;105:1480–1493. [Google Scholar]
- Huang H-C, Hsu N-J, Theobald DM, Breidt FJ. Spatial Lasso with Applications to GIS Model Selection. Journal of Computational and Graphical Statistics. 2010;19:963–983. [Google Scholar]
- Kim S, Xing EP. Statistical Estimation of Correlated Genome Associations to a Quantitative Trait Network. PLos Genetics. 2009;5:e1000587. doi: 10.1371/journal.pgen.1000587. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kim Y, Choi H, Oh H-S. Smoothly Clipped Absolute Deviation on High Dimensions. Journal of the American Statistical Association. 2008;103:1665–1673. [Google Scholar]
- Legendre M, Gautheret D. Sequence Determinants in Human Polyadenylation Site Selection. BMC Genomics. 2003;4:7–15. doi: 10.1186/1471-2164-4-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li C, Li H. Variable Selection and Regression Analysis for Graph-structured Covariates with an Application to Genomics. The Annals of Applied Statistics. 2010;4:1498–1516. doi: 10.1214/10-AOAS332. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu H, Han H, Li J, Wong L. An in-Silico Method for Prediction of Polyadenylation Signals in Human Sequences. Genome Informatics Series. 2003;14:84–93. [PubMed] [Google Scholar]
- McCullagh P, Nelder JA. Generalized Linear Models. 2. London: Chapman & Hall; 1989. [Google Scholar]
- Park MY, Hastie T, Tibshirani R. Averaged Gene Expressions for Regression. Biostatistics. 2007;8:212–227. doi: 10.1093/biostatistics/kxl002. [DOI] [PubMed] [Google Scholar]
- Shen X, Huang H-C. Grouping Pursuit through a Regularization Solution Surface. Journal of the American Statistical Associateion. 2010;105:727–739. doi: 10.1198/jasa.2010.tm09380. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tibshirani R. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society, Ser B. 1996;58:267–288. [Google Scholar]
- Tibshirani S, Saunders M, Rosset S, Zhu J, Knight K. Sparsity and Smoothness via the Fused Lasso. Journal of the Royal Statistical Society, Ser B. 2005;67:91–108. [Google Scholar]
- Wang L, Kim Y, Li R. Calibrating Nonconvex Penalized Regression in Ultra-high Dimension. Tha Annals of Statistics. doi: 10.1214/13-AOS1159. (in press) [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang S, Yuan L, Lai Y-C, Shen X, Wonka P, Ye J. Feature Grouping and Selection over an Undirected Graph. Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining; ACM. 2012. pp. 922–930. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang Y, He X. Bayesian Empirical Likelihood for Quantile Regression. The Annals of Statistics. 2012;40:1102–1131. [Google Scholar]
- Zhang C-H. Nearly Unbiased Variable Selection under Minimax Concave Penalty. The Annals of Statistics. 2010;38:894–942. [Google Scholar]
- Zhao P, Yu B. On Model Selection of Lasso. Journal of Machine Learning Research. 2006;7:2541–2563. [Google Scholar]
- Zhu Y, Shen X, Pan W. Simultaneous Grouping Pursuit and Feature Selection over an Undirected Graph. Journal of the American Statistical Associateion. doi: 10.1080/01621459.2013.770704. (in press) [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zou H, Li R. One-step Sparse Estimates in Nonconcave Penalized Likelihood Models (with discussion) The Annals of Statistics. 2008;36:1509–1566. doi: 10.1214/009053607000000802. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.