Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2016 Apr 22.
Published in final edited form as: J Am Stat Assoc. 2015 Apr 22;110(509):175–194. doi: 10.1080/01621459.2014.892882

Homogeneity Pursuit

Tracy Ke *, Jianqing Fan *, Yichao Wu
PMCID: PMC4465377  NIHMSID: NIHMS578407  PMID: 26085701

Abstract

This paper explores the homogeneity of coefficients in high-dimensional regression, which extends the sparsity concept and is more general and suitable for many applications. Homogeneity arises when regression coefficients corresponding to neighboring geographical regions or a similar cluster of covariates are expected to be approximately the same. Sparsity corresponds to a special case of homogeneity with a large cluster of known atom zero. In this article, we propose a new method called clustering algorithm in regression via data-driven segmentation (CARDS) to explore homogeneity. New mathematics are provided on the gain that can be achieved by exploring homogeneity. Statistical properties of two versions of CARDS are analyzed. In particular, the asymptotic normality of our proposed CARDS estimator is established, which reveals better estimation accuracy for homogeneous parameters than that without homogeneity exploration. When our methods are combined with sparsity exploration, further efficiency can be achieved beyond the exploration of sparsity alone. This provides additional insights into the power of exploring low-dimensional structures in high-dimensional regression: homogeneity and sparsity. Our results also shed lights on the properties of the fussed Lasso. The newly developed method is further illustrated by simulation studies and applications to real data. Supplementary materials for this article are available online.

Keywords: clustering, homogeneity, sparsity

1 Introduction

Driven by applications in genomics, image processing, etc., high dimensionality has become one of the major themes in statistics. See Bühlmann and van de Geer (2011) and references therein for an overview of recent developments in this area. To overcome the difficulty of fitting high dimensional models, one usually assumes that the true parameters lie in a low dimensional subspace. For example, many papers focus on sparsity, i.e., only a small fraction of coefficients are nonzero (Chen, Donoho, and Saunders 1998; Tibshirani 1996). In this article, we consider a more general type of low dimensional structure: homogeneity, i.e., the regression coefficients share the same values in their unknown clusters. A motivating example is the gene network analysis, where it is assumed that genes cluster into groups which play similar roles in molecular processes (Kim and Xing 2009; Li and Li 2010). It can be modeled as a linear regression problem with groups of homogeneous coeffficients. Similarly, in diagnostic lab tests, one often counts the number of positive results in a battery of medical tests, which implicitly assumes that their regression coefficients (impact) in the joint models are approximately the same. In spatial-temporal studies, it is not unreasonable to assume that the dynamics of neighboring geographical regions are similar, namely, their regression coefficients are clustered (Fan, Lv, and Qi 2011; Huang, Hsu, Theobald, and Breidt 2010). In the same vein, financial returns of similar sectors of industry share similar loadings on risk factors.

Homogeneity is a more general assumption than sparsity, where the latter can be viewed as a special case of the former with a large group of 0-value coefficients. In addition, the atom 0 is known to data analysts. One advantage of assuming homogeneity rather than sparsity is that it enables us to possibly select more than n variables (n is the sample size). It is well known that the sparsity-based techniques, such as the lasso, can select at most n variables. Moreover, identifying the homogeneous groups naturally provides a structure in the covariates, which can be helpful in scientific discoveries.

Regression under the homogeneity setting has been previously studied in the literature. Park, Hastie, and Tibshirani (2007) propose a two-step method. Their method performs hierarchical clustering on the predictors, cuts the obtained dendrogram at an appropriate level, and treats the cluster averages as new predictors. The fused lasso (Friedman, Hastie, Höfling, and Tibshirani 2007; Tibshirani, Saunders, Rosset, Zhu, and Knight 2005) can also be regarded as an effort of exploring homogeneity, with the assistance of neighborhoods defined according to either time or location. In this sense, our newly proposed methods are different since we do not know such a neighborhood a priori. The clustering of homogeneous coefficients is completely data-driven. For example, in the fused Lasso, where a complete ordering of the covariates is given, Tibshirani et al. (2005) use the L1 penalty to penalize the pairwise differences of adjacent coordinates; in the case without a complete ordering, they suggest penalizing the pairs of ‘neighboring’ nodes in the sense of a general distance measure. Bondell and Reich (2008) propose the method OSCAR where a special octagonal shrinkage penalty is applied to each pair of coordinates to promote equal-value solutions. Shen and Huang (2010) develop an algorithm called Grouping Pursuit, where they use the truncated L1 penalty to penalize the pairwise differences for all pairs of coordinates. In an extension, Zhu, Shen, and Pan (in press) consider simultaneous grouping pursuit and feature selection by including additional truncated L1 penalties on the individual coefficients. Yang et al. (2012) explore simultaneous feature grouping and selection with the assistance of an undirected graph by penalizing the pairwise difference for each pair of coordinates that are connected by an edge in the graph. All the aforementioned methods either depend on a known ordering or graph of the covariates, which is sometimes not available, or use exhaustive pairwise penalties, which increase the computational complexity. Yang and He (2012) consider the homogeneity across coefficients of different percentile levels in quantile regression, and propose a Bayesian framework by using shrinkage priors to promote homogeneity. Although similar ideas may be applied to regression models, their settings are very different from ours, and there are no existing results on feature grouping for their method.

In this article, we propose a new method called Clustering Algorithm in Regression via Data-driven Segmentation (CARDS) to explore homogeneity. The main idea of CARDS is to take advantage of a preliminary estimate without homogeneity structure and to shrink those coefficients, that are estimated “closely”, further towards each other to achieve homogeneity. In the basic version of CARDS, we first build an ordering of covariates from the preliminary estimate and run a penalized least squares afterwards with fused penalties in the new ordering. The number of penalty terms is only (p − 1), compared to p(p − 1)/2 in the exhaustive pairwise penalties. On the other hand, an advanced version of CARDS builds an “ordered segmentation” on the covariates, which can be viewed as a generalized ordering, and imposes “hybrid pairwise penalties”, which can be viewed as a generalization of fused penalties. This version of CARDS tolerates possible misorderings in the preliminary estimate better and is thus more robust. Compared with other existing methods for homogeneity exploration, CARDS can successfully deal with the case of unordered covariates. At the same time, it avoids using exhaustive pairwise penalties and can be computationally more efficient than the Grouping Pursuit and OSCAR.

We study CARDS in details by providing some theoretical analysis. It reveals that the sum of squared errors of estimated coefficients is Op(K/n), where K is the number of true homogeneous groups. Therefore, the smaller the number of true groups is, the better precision it can achieve. In particular, when K = p, there is no homogeneity to explore and the result reduces to the case without grouping. Moreover, in order to exactly recover the true groups with high probability, the minimum signal strength (the gaps between different groups) is of the order maxk{Aklog(p)/n}, where |Ak|’s are sizes of true groups. In addition, the asymptotic normality of our proposed CARDS estimator is established, which reveals better estimation accuracy than that without homogeneity exploration. Furthermore, our results can be combined with the sparsity-based results to provide additional insights into the power of exploring low-dimensional structure in high-dimensional regression: homogeneity and sparsity. As a byproduct, our analysis on the basic version of CARDS also establishes a framework for analyzing the fused type of penalties, which is new to our knowledge.

Throughout this paper, we consider the following linear regression setting

y=Xβ0+ε, (1)

where X = (x1, · · ·, xp) is an n × p design matrix, y = (y1, · · ·, yn)T is an n × 1 vector of response, β0=(β10,,βp0)T denotes the true parameters of interest, and ε = (ε1, · · ·, εn)T contains independently and identically distributed noises with E(εi) = 0 and E(εi2)=σ2. We assume further that there is a partition of {1, 2, · · ·, p} denoted as Inline graphic= (A0, A1, · · ·, AK) such that

βj0=βA,k0foralljAk, (2)

where βA,k0 is the common value shared by all regression coefficients whose indices are in Ak. By default, βA,00=0, so A0 is the group of 0-value coefficients. This allows us to explore homogeneity and sparsity simultaneously. Write βA0=(βA,10,,βA,K0)T. Without loss of generality, we assume βA,10<βA,20<<βA,K0.

Our theory and methods are stated for the standard least-squares problem although they can be adapted to other more sophisticated models. For example, when forecasting housing appreciation in the United States (Fan et al. 2011), one builds the spatial-temporal model

Yit=XitTβi+εit, (3)

in which i indicates a spatial location and t indicates time. It is expected that βis are approximately the same for neighboring zip codes i and this type of homogeneity can be explored in a similar fashion. Similarly, when Yit represents the return of a stock and Xit = Xt stands for common market risk factors, one can assume a certain degree of homogeneity within each sector of industry; namely, the factor loading vector βi is approximately the same for stocks belonging to the same sector of industry.

Throughout this paper, ℝ denotes the set of real numbers and ℝp denotes the p-dimensional real Euclidean space. For any a, b ∈ ℝ, ab denotes the maximum between a and b. For any positive sequences {an} and {bn}, we write anbn if an/bn → ∞ as n → ∞. Given 1 ≤ q < ∞, for any vector x, ||x||q = (Σj |xj|q)1/q denotes the Lq-norm of x and ||x|| = maxj{|xj|}. For any matrix M, ||M||q = maxx:||x||q=1 ||Mx||q denotes the matrix Lq-norm of M. In particular, ||M|| is the maximum absolute row sum of M. We omit the subscript q when q = 2. ||M||max = maxi,j{|Mij|} denotes the matrix max norm. When M is symmetric, λmax(M) and λmin(M) denote the maximum and minimum eigenvalues of M, respectively.

The rest of the paper is organized as follows. Section 2 describes CARDS, including the basic, advanced and shrinkage versions. Section 3 studies theoretical properties of the basic version of CARDS, and Section 4 analyzes the advanced and shrinkage versions. Sections 5 and 6 present the results of simulation studies and real data analysis, respectively. Some concluding remarks are given in Section 7. Proofs can be found in the Appendix and supplemental materials.

2 CARDS: a data-driven pairwise shrinkage procedure

2.1 Basic version of CARDS

Without considering the homogeneity assumption (2), there are many methods available for fitting model (1). Let β̃ be such a preliminary estimator. A simple idea to generate homogeneity is as follows: first, rearrange the coefficients in β̃ in the ascending order; second, group together those adjacent indices whose coefficients in β̃ are close to each other; finally, force indices in each estimated group to share a common coefficient and refit model (1). A problem of this naive procedure is how to group the indices. Alternatively, we can run a penalized least squares to simultaneously extract the grouping structure and estimate coefficients. To shrink coefficients of adjacent indices (after reordering) towards homogeneity, we can add fused penalties, i.e., {|βi+1βi|, i = 1, · · ·, p − 1} are penalized. This leads to the following two-stage procedure:

  • Preordering: Construct the rank statistics {τ(j) : 1 ≤ jp} such that β̃τ(j) is the j-th smallest value in {β̃i, 1 ≤ ip}, i.e.,
    βτ(1)β(2)βτ(p). (4)
  • Estimation: Given a folded concave penalty function pλ(·) (Fan and Li 2001) with a regularization parameter λ, the final estimate is given by
    β^=argminβ{12ny-Xβ2+j=1p-1pλ(βτ(j+1)-βτ(j))}. (5)

We call this two-stage procedure bCARDS (basic CARDS).

At the first stage, bCARDS establishes a data-driven rank mapping τ(·) based on the preliminary estimator β̃. At the second stage, only “adjacent” coefficient pairs in the order τ are penalized, resulting in only (p −1) penalty terms in total. In addition, (5) does not require that βτ(j)βτ(j+1). This allows coordinates in β̂ to have a different order of increasing values from that in β̃.

With an appropriately large tuning parameter λ, β̂ is a piecewise constant vector in the order τ and consequently its elements have homogeneous groups. In Section 3, we shall show that, if τ is consistent with the order of β0, that is,

βτ(1)0βτ(2)0βτ(p)0, (6)

then under some regularity conditions, β̂ can consistently estimate the true coefficient groups of β0 with high probability.

When pλ(·) is a folded-concave penalty function (e.g. SCAD (Fan and Li 2001), MCP (Zhang 2010)), (5) is a non-convex optimization problem. It is generally challenging to find the global minimizer. The local linear approximation (LLA) algorithm can be applied to find a local minimizer for any fixed initial solution; see Zou and Li (2008); Fan, Xue, and Zou (2012) and references therein for details. The coupling of the concave convex procedure (CCCP) can also be applied to produce a local minimizer; see Kim, Choi, and Oh (2008); Wang, Kim, and Li (in press) for a detailed explanation of CCCP.

2.2 Advanced version of CARDS

To guarantee the success of bCARDS, (6) is an essential condition. It requires that whenever βi0<βj0, τ (i) < τ(j) must hold. This imposes fairly strong conditions on the preliminary estimator β̃. For example, (6) can be violated if ||β̃β0|| is larger than the minimum gap between groups. To relax such a restrictive requirement, we now introduce an advanced version of CARDS, where the main idea is to use less information from β̃ and to add more penalty terms in (5).

We first introduce the ordered segmentation, which can be viewed as a generalized ordering. Note that each rank mapping τ in bCARDS actually defines a partition of {1, · · ·, p} into p disjoint sets B1, · · ·, Bp with Bj = {τ(j)} being a singleton. Similarly, we may divide {1, · · ·, p} into L(≤p) disjoint sets B1, · · ·, BL, where Bl’s are not necessarily singletons. We call such Bl’s segments. The segments B1, · · ·, BL are ordered, but the ordering of coordinates within each segment is not defined. This is similar to letter grades assigned to a course. A formal definition is as follows:

Definition 2.1

For an integer 1 ≤ Lp, the mapping ϒ : {1, · · ·, p} → {1, · · ·, L} is called an ordered segmentation if the sets Bl ≡ {1 ≤ jp : ϒ(j) = l}, 1 ≤ lL, form a partition of {1, · · ·, p}.

When L = p, ϒ is a one-to-one mapping and it defines a complete ordering.

Note that, in the basic version of CARDS, the preliminary estimator β̃ produces a complete rank mapping τ. Now in the advanced version of CARDS, instead of extracting a complete ordering, we only extract an ordered segmentation ϒ from β̃. The analogue is similar to grading an exam: overall score rank (percentile rank) versus letter grade. Let δ > 0 be a predetermined parameter. First, obtain the rank mapping τ as in (4) and find all indices i2 < i3 < · · ·< iL such that the gaps

βτ(j)-βτ(j-1)>δ,j=i2,,iL.

Then, construct the segments

Bl={τ(il),τ(il+1),,τ(il+1-1)},l=1,,L, (7)

where i1 = 1 and iL+1 = p + 1. This process is indeed similar to the letter grades that we assign to a course. The intuition behind this construction is that when β̃τ(j+1)β̃τ(j) + δ, i.e., the estimated coefficients of two “adjacent coordinates” differ by only a small amount, we do not trust the ordering between them and group them into a same segment. Compared to the complete ordering τ, the ordered segments {B1, · · ·, BL} utilize less information from β̃ and hence are less sensitive to the estimation error in β̃.

Given an ordered segmentation τ, we wish to take advantage of the order of segments B1, · · ·, BL and at the same time allow flexibility of order shuffling within each segment. Towards this goal, we introduce the hybrid pairwise penalty.

Definition 2.2

Given a penalty function pλ(·) and tuning parameters λ1 and λ2, the hybrid pairwise penalty corresponding to an ordered segmentation ϒ is

Pϒ,λ1λ2(β)=l=1L-1iBl,jBl+1pλ1(βi-βj)+l=1Li,jBlpλ2(βi-βj). (8)

In (8), we call the first part between-segment penalty and the second part within-segment penalty. The within-segment penalty penalizes all pairs of indices in each segment, hence, it does not rely on any ordering within the segment. The between-segment penalty penalizes pairs of indices from two adjacent segments, and it can be viewed as a “generalized” fused penalty on segments.

When L = p, each Bl is a singleton and (8) reduces to the fused penalty in (5). On the other hand, when L = 1, namely, no prior information about β, there is only one segment B1 = {1, · · ·, p}, and (8) reduces to the exhaustive pairwise penalty

PλTV(β)=1i,jppλ(βi-βj). (9)

It is also called the total variation penalty (Harchaoui and Lévy-Leduc 2010), and the case with pλ(·) being a truncated L1 penalty is studied in Shen and Huang (2010). Thus, the penalty (8) is a generalization of both the fused penalty and the total variation penalty, which explains the name “hybrid”.

The main motivation of introducing the hybrid pairwise penalty is to provide a set of intermediate versions between the fused penalty and the total variation penalty. When using pairwise penalties to promote homogeneity, we need to penalize “enough” pairs to guarantee that all true groups can be exactly recovered when the signal-to-noise ratio is sufficiently large. Given a consistent ordering, the fused penalty contains “just enough” pairs; but when the ordering is inconsistent, we have to penalize more pairs to achieve the aforementioned exact-group-recovery (see Section 2.3 for a numerical example). However, it may not be a good choice to include all pairs, i.e., using the total variation penalty, as the large number of redundant pairs can result in statistical and computational inefficiency. The hybrid penalty is designed aiming to include “just enough” pairs that adapt to the available “partial” ordering information of an ordered segmentation.

Now, we discuss how the requirement (6) can be relaxed. If we let Bj = {τ(j)}, then (6) can be written equivalently as maxiBjβi0miniBj+1βi0, for 1 ≤ jp −1. This definition can be generalized to the case Bj’s are not singletons.

Definition 2.3

An ordered segmentation ϒ preserves the order of β0 if maxjBlβj0minjBl+1βj0, for l = 1, · · ·, L − 1.

In the construction (7), even if (6) does not hold, it is still possible that the resulting ϒ preserves the order of β0. Consider a toy example where p = 4, and βτ(1)0=βτ(2)0=βτ(4)0<βτ(3)0 so that {τ (1), τ (2), τ (4)} and {τ (3)} are two true homogeneous groups in β0. Here τ ranks the coordinate τ(3) ahead of τ(4) based on the preliminary estimator β̃, but βτ(3)0>βτ(4)0. So, τ fails to give a consistent ordering. However, as long as β̃τ(4)β̃τ(3)+δ, τ(3) and τ(4) are grouped into the same segment in (7), say, B1 = {τ (1), τ (2)} and B2 = {τ (3), τ (4)}. Then ϒ still preserves the order of β0 according to Definition 2.3.

Now we formally introduce the advanced version of Clustering Algorithm in Regression via Data-driven Segmentation (aCARDS). It consists of three steps, where the first two steps are very similar to the way that we assign letter grades based on scores of an exam.

  • Preliminary Ranking: Given a preliminary estimate β̃, generate the rank statistics {τ(j) : 1 ≤ jp} such that β̃τ(1)β̃τ(2) ≤ · · · ≤ β̃τ(p).

  • Segmentation: For a tuning parameter δ > 0, construct an ordered segmentation ϒ as described in (7).

  • Estimation: For tuning parameters λ1 and λ2, compute the solution β̂ that minimizes

Qn(β)=12ny-Xβ2+Pϒ,λ1,λ2(β). (10)

We call this procedure aCARDS (advanced CARDS).

In Section 4, we shall show that if ϒ preserves the order of β0, under certain conditions, β̂ recovers the true homogeneous groups of β0 with high probability. Therefore, to guarantee the success of aCARDS, we need the existence of a δ > 0 for the preliminary estimator β̃ such that the associated ϒ preserves the order of β0. The above toy example shows that even when (6) fails, this condition can still hold. So aCARDS requires weaker conditions on β̃ than bCARDS. This is due to that the hybrid penalty contains penalty terms corresponding to more pairs of indices, hence, it is more robust to mis-ordering in τ. In fact, bCARDS is a special case of aCARDS with δ = 0.

2.3 Comparison of two versions of CARDS

In this subsection, we first use a numerical example to compare bCARDS and aCARDS. It reveals how the ordered segmentation and hybrid pairwise penalty (8) play a role in aCARDS. We then discuss how to choose between two versions of CARDS in real data analysis.

We generate a data set with p = 40 predictors and n = 100 samples. The predictors are divided into two homogeneous groups, each of size 20. Let βj0=-0.2 for j in Group 1 and βj0=0.2 for j in Group 2. X1, · · ·, Xn are generated independently and identically from Np(0, I), and Yi=XiTβ0+εi for 1 ≤ in, where ε1, ···, εn are independent noises with a standard normal distribution. In aCARDS, we take the ordinary least squares (OLS) estimator as the preliminary estimator. Figure 1 plots the sorted OLS coefficients for a realization. The estimated rank is not exactly consistent with the order of β0 since the predictors τ(17) and τ(18), which belong to Group 2, are mistakenly ranked ahead of some predictors in Group 1. If we use only the fused penalty, the terms that involve τ(17) and τ(18) are

Figure 1.

Figure 1

Illustration of the hybrid pairwise penalty and the aCARDS algorithm. Top panel: OLS coefficients and the associated ordered segmentation. Red dots and blue crosses represent predictors from Group 1 and Group 2, respectively. Bottom panel: Solution paths of bCARDS (left) and aCARDS (right) under misranking. The ranking and ordered segmentation are the same as in the top panel. For bCARDS, the horizontal axis represents the parameter λ. For aCARDS, the horizontal axis represents the between-segment parameter λ1, where we fix the within-segment parameter λ2 = 0.02. The vertical axis represents the estimated 40 regression coefficients, which are shrunk towards homogeneity (as the figures do not start from the smallest λ, we do not see initial 40 regression coefficients).

pλ(βτ(16)-βτ(17))+pλ(βτ(17)-βτ(18))+pλ(βτ(18)-βτ(19)).

There are no penalty terms to shrink the coefficients of τ(17) and τ(18) towards being equal to the coefficients of other predictors in Group 2. Now, suppose that we extract an ordered segmentation from the OLS coefficients by taking δ = 0.3; see Figure 1. Since it allows for arbitrary order reshuffling within the segment B4, this ordered segmentation preserves the order of β0, i.e., Definition 2.3 is satisfied. The hybrid pairwise penalty associated with this segmentation includes terms

pλ1(βτ(17)-βτ(23))+pλ1(βτ(18)-βτ(23))

between segments B4 and B5. So aCARDS will shrink the coefficients of τ(17) and τ(18) towards being equal to the coefficient of τ(23), a predictor in Group 2. Moreover, there are terms such as

pλ1(βτ(23)-βτ(24))+pλ1(βτ(23)-βτ(25))+pλ1(βτ(23)-βτ(26))

between segments B5 and B6. So aCARDS will also shrink the coefficient of τ(23) towards being equal to the coefficients of other predictors in Group 2. Eventually, aCARDS will shrink the coefficients of τ(17) and τ(18) towards being equal to the coefficients of many other predictors in Group 2. This example explains how the ordered segmentation and hybrid penalty help overcome issues caused by misranking in the preliminary estimator.

To better illustrate the effects of fused penalty and hybrid penalty under misranking, we fix the estimated rank and ordered segmentation from above, and compute the solution paths of both bCARDS and aCARDS. Note that the penalty terms in both (5) and (8) are now fixed (hence we do not need the parameter δ in aCARDS). For bCARDS, we let λ vary. For aCARDS, we set the within-segment parameter λ2 = 0.02 and let the between-segment parameter λ1 vary. Figure 1 displays the solution paths. We see that although bCARDS does not include the true grouping in the solution path owing to misranking, aCARDS still achieves the true grouping, which is a benefit of the hybrid penalty.

In practical data analysis, we need not differentiate between two versions of CARDS, but the tuning parameter selection process automatically tells us which version to use. This is because bCARDS is a special case of aCARDS with δ = 0. We only need to include 0 in the candidates of the parameter δ and select δ in a data-driven manner (e.g., AIC, BIC and GCV). We call the resulting method CARDS, which involves a data-driven selection between bCARDS and aCARDS.

2.4 CARDS under sparsity

In applications, we may need to explore homogeneity and sparsity simultaneously. Often the preliminary estimator β̃ takes into account the sparsity, namely it is obtained with a penalized least-squares method (Fan and Li 2001; Tibshirani et al. 2005) or sure independence screening (Fan and Lv 2008). Suppose β̃ has the sure screening property, i.e., S0 with high probability, where and S0 denote the support of β̃ and β0, respectively. We modify CARDS as follows: in the first two steps, using the non-zero elements of β̃, we can similarly construct a data-driven hybrid penalty only on coefficients of variables in ; in the third step, we fix β̂c = 0 and obtain β̂ by minimizing the following penalized least squares

Qnsparse(β)=12ny-XSβS2+Pϒ,λ1,λ2(βS)+jSpλ(βj), (11)

where X is the submatrix of X restricted to columns in . In (11), the second term is the hybrid penalty to encourage homogeneity among coefficients of variables already selected in β̃, and the third term is the element-wise penalty to help further filter out falsely selected variables. We call this modified version sCARDS (shrinkage CARDS).

3 Analysis of the basic CARDS

In this section, we analyze theoretical properties of bCARDS. Due to the limited space, we state the results here and only prove Theorems 3.1–3.3 in the Appendix, leaving the rest of the proofs to the online supplemental materials of this paper.

3.1 Heuristics

We first provide some heuristics on why taking advantage of the homogeneity helps reduce the estimation error ||β̂β0||. Consider an ideal case of orthogonal design XTX = nIp (necessarily pn). The ordinary least-squares estimator β̂ols = (XTX)−1XTy has the decomposition

β^jols=βj0+εj,εji.i.d.N(0,n-1),j=1,,p.

It is clear by the square-root law that β^ols-β0=OP(p/n). Now, if there are K homogeneous groups in β0 and that we know the true groups, the original model (1) can be rewritten as

y=XAβA0+ε,

where βA0=(βA,10,,βA,K0)T contains distinct values in β0, and XA = (xA,1, · · ·, xA,K) is such that xA,k = ΣjAk xj. The corresponding ordinary least-squares estimator β^Aols=(XATXA)-1XATy has the decomposition

β^A,kols=βA,k0+ε¯k,ε¯ksareindependent,ε¯k~N(0,n-1Ak-1). (12)

Here ε¯k=1AkjAkεj is the noise averaged over group k. The oracle estimator β̂oracle is defined such that β^joracle=β^A,kols for all jAk. Then, by the square-root law,

β^oracle-β02=k=1KAkβ^A,kols-βA,k02=Op(k=1KAk·n-1Ak-1)=Op(K/n),

which immediately implies that β^oracle-β0=Op(K/n).

The surprises of the results are two fold. First, ||β̂oracleβ0|| has the rate convergence K/n instead of p/n. The point is that in (12) the noises are averaged, thanks to exploiting homogeneity, and consequently βA,k0 is estimated more accurately. The second surprise is that the rate has nothing to do with the sizes of true homogeneous groups. No matter whether we have K groups of equal size, or one dominating group with other (K − 1) small groups, the rate is always the same for the oracle estimator.

3.2 Notations and regularity conditions

Let Inline graphic be the subspace of ℝp defined by

MA={βp:βi=βj,foranyi,jAk,1kK}.

For each βInline graphic, we can always write Xβ = XAβA, where XA = (xA,1, · · ·, xA,K) is an n × K matrix such that its k-th column xA,k = ΣjAk xj, and βA is a K-dimensional vector with its k-th component βA,k being the common coefficient in group Ak. Define the matrix D = diag(|A1|1/2, · · ·, |AK|1/2). We introduce the following conditions on the design matrix X:

Condition 3.1

xj=n, for 1 ≤ jp. The eigenvalues of the K × K matrix 1nD-1XATXAD-1 are bounded below by c1 > 0 and bounded above by c2 > 0.

In the case of orthogonal designs, i.e., XTX = nIp, the matrix 1nD-1XATXAD-1 simplifies to IK, and c1 = c2 = 1.

Let ρ(t) = λ−1pλ(t) and ρ̄(t) = ρ′(|t|)sgn(t). We assume that the penalty function pλ(·) satisfies the following condition.

Condition 3.2

pλ(·) is a symmetric function and it is non-decreasing and concave on [0, ∞). ρ′(t) exists and is continuous except for a finite number of t with ρ′(0+) = 1. There exists a constant a > 0 such that ρ(t) is a constant for all |t| ≥ .

We also assume that the noise vector ε = (ε1, · · ·, εp)T has sub-Gaussian tails.

Condition 3.3

For any vector a ∈ ℝn and x > 0, P(|aTε| > ||a||x) ≤ 2ec3x2, where c3 is a positive constant.

Given the design matrix X, let Xk be its submatrix formed by including only columns in Ak, for 1 ≤ kK. For any vector v ∈ ℝq, let DC(v)=max1iqvi-q-1j=1qvj be the “deviation from centrality”. Define

σk=λmax(1nXkTXk)andνk=maxμMA:μ=1DC(1nXkTXμ), (13)

where λmax(·) denotes the largest eigenvalue operator. In the case of orthogonal design, σk = 1 and νk = 0. Let

bn=12min1klKβA,k0-βA,l0

be half of the minimum gap between groups in β0, and λ = λn the tuning parameter in the penalty function.

3.3 Properties of bCARDS

When the true groups A1, · · ·, AK are known, the oracle estimator is

β^oracle=argminβMA{12ny-Xβ2}.

Theorem 3.1

Suppose Conditions 3.1–3.3 hold, K = o(n), and the preliminary estimator β̃ generates a rank mapping τ that is consistent with the order of β0, i.e., (6) holds, with probability at least 1 − ε0. If the half minimum gap between groups, bn, satisfies that bn > n, where a is the same as that in Condition 3.2, and

λnmaxk{σkAklog(np)/n+(1+νkAk1/2)Klog(n)/n}, (14)

then with probability at least 1 − ε0n−1K − (np)−1, the bCARDS objective function (5) has a strictly local minimizer β̂ such that

  • β̂ = β̂oracle,

  • β^-β0=Op(K/n).

Theorem 3.1 shows that bCARDS includes the oracle estimator as a strictly local minimizer, with overwhelming probability. This strong oracle property is a stronger result than the oracle property in Fan and Li (2001).

The objective function (5) in bCARDS is non-convex and may have multiple local minimizers. In practice, we apply the Local Linear Approximation algorithm (LLA) (Zou and Li 2008) to solve it: start from an initial solution β̂(0) = β̂initial; at step m, update the solution by

β^(m)=argminβ{12ny-Xβ2+j=1p-1pλ(β^τ(j+1)(m-1)-β^τ(j)(m-1))·βτ(j+1)-βτ(j)}.

Given β̂initial, this algorithm produces a unique sequence of estimators which converge to a certain local minimizer. Next, Theorem 3.2 shows that under certain conditions, the sequence of estimators produced by the LLA algorithm converge to the oracle estimator.

Theorem 3.2

Under conditions of Theorem 3.1, suppose ρ′(λn) ∈ a0 for some constant a0 > 0, and that there exists an initial solution β̂initial of (5) satisfying ||β̂initialβ0||λn/2. Then with probability at least 1 − ε0n−1K − (np)−1, the LLA algorithm yields β̂oracle after one iteration, and it converges to β̂oracle after two iterations.

From Theorems 3.1 and 3.2, we conclude that bCARDS combined with the LLA algorithm yields the oracle estimator with overwhelming probability, provided that we have a good preliminary estimator β̃. Next, we discuss the choice of β̃.

Since we focus on dense problems in this section, the usual sparsity is not explicitly explored and the ordinary least squares estimator

β^ols=argminβp{12ny-Xβ2}

can be used as the preliminary estimator. The following theorem shows that it induces a rank consistent mapping with high probability.

Theorem 3.3

Suppose Condition 3.3 holds, p = O(nα) and λmin(1nXTX)c4, where 0 < α < 1 and c4 > 0 are constants. If bn>(2ac4/c3)log(n)/n where bn is the half of the minimum gap between groups in β0, then with probability at least 1 − O(nα), the rank mapping τ generated from β̂ols is consistent with the order of β0.

When the rank mapping τ extracted from β̃ does not give a consistent order, i.e., (6) does not hold, the penalty in (5) is no longer a “correct” penalty for promoting the true grouping structure. There is no hope that local minimizers of (5) exactly recover the true groups. However, if there is not too much misranking in τ, it is still possible to control ||β̂β̂0||. Given a rank mapping τ, define

K(τ)=j=1p-11{βτ(j)0βτ(j+1)0}.

It is the number of coefficient “jumps” in β0 under the order given by τ. These “jumps” define subgroups A1,A2,,AK, each being a subset of one true group. Although different subgroups may share the same true coefficient, any two consecutive subgroups Ak and Ak+1 have a gap in coefficient values. As a result, the above results apply to this subgrouping structure. The following theorem is a generalization and a direct application of the proof of Theorem 3.1 and its details are omitted.

Theorem 3.4

Suppose Conditions 3.1–3.3 hold, K*(τ) = o(n), the half minimum gap bn > n, and λn satisfies (14) with K replaced by K*(τ). Then with probability at least 1−ε0n−1K −(np)−1, the bCARDS objective function (5) has a strictly local minimizer β̂ such that β^-β0=Op(K(τ)/n).

3.4 bCARDS with the L1 penalty

In the bCARDS formulation (5), ρ(t) can also be the L1 penalty function ρ(t) = |t|. It can be useful to get the initial solution β̂initial for the LLA algorithm. However, ρ(t) = |t| does not satisfy Condition 3.2. Hence, Theorem 3.1 does not apply and its associated properties requires additional studies.

We first relax the requirement that τ is consistent with the order of β0. Instead, we consider the case that “τ is consistent with groups in β0”: there exists a permutation μ on {1, · · ·, K} and 1 = i1 < i2 < · · · < iK < iK+1 = p + 1 such that for k = 1, · · ·, K,

βτ(i)0=βA,μ(k)0,ikiik+1-1. (15)

When μ is the identical permutation, i.e., μ(k) = k, (15) is equivalent to (6) and τ is consistent with the order of β0. Under the condition (15), recovering the true groups is equivalent to locating coefficient jumps in β0 under the order given by τ, and these jumps can have positive or negative magnitudes.

To guarantee the exact recovery of jumps, we need a joint condition on X and β0, it is in the same spirit of the “irrepresentability” condition in Zhao and Yu (2006) but is specifically designed for the homogeneity setting. For notation simplicity, we change the indices of groups to let μ(k) = k for all k. Note that β10<β20<<βK0 does not hold with these new group indices.

For k = 1, · · ·, K − 1, write dβA,k0=βA,k+10-βA,k0. Define the K-dimensional vector d0 by d10=sgn(dβA,10),dK0=-sgn(dβA,K-10) and

dk0=sgn(dβA,k0)-sgn(dβA,k-10),2kK-1.

Here d0 is the adjacent difference of the sign vector of jumps in β0. For example, suppose K = 4 and the common coefficients in 4 groups satisfy βA,20-βA,10>0,βA,30-βA,20<0 and βA,40-βA,30>0. Then d = (1, −2, 2, −1). Also, define the p-dimensional vector

b0=XTXA(XATXA)-1d0.

In the case of orthogonal design XTX = nIp, b0Inline graphic and it has the form bj0=dk0/Ak for jAk. For each jAk, let

Akj1={τ(i)Ak:ij},Akj2={τ(i)Ak:i>j}.

Namely, Akj1 and Akj2 contain indices in group k that are ranked above and below τ(j), respectively. Let θkj=Akj1/Ak be the proportion of indices in group k that are ranked above τ(j). Denote by b¯kj=1Akj1τ(i)Akj1bτ(i)0 the average of elements in b0 over the indices in Akj1, and b_kj=1Akj2τ(i)Akj2bτ(i)0 the average over the indices in Akj2.

Condition 3.4

There exists a positive sequence {ωn}, which can go to 0, such that for 1 ≤ kK, jAk and jjk+1 − 1,

1-ωn{θ1jsgn(dβA,10)+A12θ1j(1-θ1j)(b¯1j-b_1j),(1-θkj)sgn(dβA,k-10)+θkjsgn(dβA,k0)+Ak2θkj(1-θkj)(b¯kj-b_kj),2kK-1,(1-θKj)sgn(dβA,K-10)+AK2θKj(1-θKj)(b¯Kj-b_Kj). (16)

In the case of orthogonal design XTX = nIp, b0Inline graphic and kjbkj = 0 holds for all k and jAk. Condition 3.4 reduces to

1-ωn{θ1jsgn(dβA,10),(1-θkj)sgn(dβA,k-10)+θkjsgn(dβA,k0),2kK-1,(1-θKj)sgn(dβA,K-10).

This is possible only when

sgn(dβA,k-10)sgn(dβA,k0),2kK-1. (17)

Noting that 1/|Ak| ≤ θkj ≤ 1 − 1/|Ak|, the associated ωn can be chosen as mink{1/|Ak|} when (17) holds.

Theorem 3.5

Suppose Conditions 3.1, 3.3 and 3.4 hold, K = o(n), and the preliminary estimator β̃ generates an order τ that is consistent with groups in β0, i.e., (15) holds, with probability at least 1 − ε0. If the half minimum gap bn and the tuning parameter λn satisfy

bnKlog(n)/n+λn(k=1K1Ak2)1/2andλnωn-1maxk{σkAklog(np)/n}, (18)

then with probability at least 1 − ε0n−1K − (np)−1, the bCARDS objective function (5) with ρ(t) = |t| has a unique global minimizer β̂ such that

  • β̂Inline graphic;

  • sgn(β^A,k+1-β^A,k)=sgn(βA,k+10-βA,k0), k = 1, · · ·, K − 1;

  • β^-β0=Op(K/n+γn), where γn=λn(k=1K1Ak)1/2.

Compared to Theorem 3.1, there is an extra bias term in the estimation error ||β̂β0||, p which is of order Klog(np)/n. Moreover, to achieve the exact recovery, it requires Condition 3.4, which is restrictive. For example, in the case of orthogonal designs, it is required that all consecutive jumps (under the order given by τ) have oppositive signs.

4 Analysis of the advanced CARDS

In this section, we analyze aCARDS and its variant sCARDS. The proofs can be found in the online supplemental materials.

4.1 Properties of aCARDS

To guarantee the success of aCARDS, a key condition is that the ordered segmentation ϒ = {B1, · · ·, BL} defined in (7) preserves the order of β0 in the sense of Definition 2.3. This allows the ranking of coefficients in β̃ to deviate from that in β0, but not too much: for some δ > 0, whenever βi0<βj0, β̃iβ̃j + δ must hold.

For given A1, · · ·, AK and a segmentation ϒ = {B1, · · ·, BL}, define

ϕk=Ak/min{Ak3,minl:BlAk{Bl2}}.

Here 1/|Ak|2ϕk ≤ |Ak| for 1 ≤ kK.

Theorem 4.1

Suppose Conditions 3.1–3.3 hold, K = o(n), and the preliminary estimator β̃ and the tuning parameter δn together generate an ordered segmentation ϒ that preserves the order of β0, with probability at least 1−ε0. If the half minimum gap bn and the tuning parameters (λ1n, λ2n) in (10) satisfy that bn > a max{λ1n, λ2n}, where a is the same as that in Condition 3.2, and

λ1nmaxk{σkϕklog(np)/n+(1-νkϕk1/2)Klog(n)/n}, (19)

and

λ2nmaxk{Ak-1log(np)/n+(1+νkAk-1)Klog(n)/n}, (20)

then with probability at least 1 − ε0n−1K − 2(np)−1, the aCARDS objective function (10) has a strictly local minimizer β̂ such that

  • β̂ = β̂oracle,

  • β^-β0=Op(K/n).

Compared to Theorem 3.1, aCARDS not only imposes less restrictive conditions on β̃, but also requires a smaller minimum gap between true coefficients.

Next, we establish the asymptotic normality of the CARDS estimator. By Theorem 4.1, with overwhelming probability, aCARDS performs just like the oracle. In the oracle situation, for example, if p = 5 and there are three true groups {β1, β4}, {β2} and {β3, β5}, the accuracy of estimating β is the same as if we know the model:

Y=β1(X1+X4)+β2X2+β3(X3+X5)+ε.

Theorem 4.2

Given a positive integer q, let {Bn} be a sequence of matrices such that Bn ∈ ℝq×K, maxvq:v=1XAT(XATXA)-1BnTv4=o(1) and BnBnTH, where H is a fixed q × q positive definite matrix. Suppose conditions of Theorem 4.1 hold and let β̂ be the local minimizer of the aCARDS objective function (10) given in Theorem 4.1. Then

Bn(XATXA)1/2(β^A-βA0)dN(0,H),

where β̂A is the K-dimensional vector of distinct values in β̂.

Theorem 4.2 states the asymptotic normality of β̂A. Note that β̂ duplicates elements in β̂A. We introduce the following corollary to compare the asymptotic covariance of β̂ to that of β̂ols.

Corollary 4.1

Suppose conditions of Theorems 4.1 and 4.2 hold. Let β̂ols and β̂ be the ordinary least squares estimator and CARDS estimator respectively. Let Mn be the p × K matrix with Mn(j, k) = (1/|Ak|1/2)1{jAk}. For any sequence of p-dimensional vectors {an},

  • v1n-1/2anT(β^ols-β0)dN(0,1) with v1n=anT(XTX)-1an;

  • v2n-1/2anT(β^-β0)dN(0,1) with v2n=anTMn(MnTXTXMn)-1MnTan.

Moreover, v1nv2n.

4.2 Properties of sCARDS

In Section 2.4, we introduced sCARDS to explore both homogeneity and sparsity. In sCARDS, given a preliminary estimator β̃ and a parameter δ, we extract segments B1, · · ·, BL such that l=1LBl=S, where is the support of β̃. Denote B0 = {j : β̃j = 0}. In this case, we say ϒ = {B0, B1, · · ·, BL} preserves the order of β0 if

maxjB0βj0=0andmaxjBlβj0minjBl+1βj0,l=1,,L-1. (21)

This implies that β̃ has the sure screening property, and on those preliminarily selected variables, the data-driven segments preserve the order of true coefficients.

Suppose there is a group of zero coefficients in β0, namely, Inline graphic = (A0, A1, · · ·, AK). Let MA be the subspace of ℝp defined by

MA={βp:βi=0,foranyiA0;βi=βj,foranyi,jAk,1kK}.

The oracle estimator is

β^oracle=argminβMA{12ny-Xβ2}.

Denote by S the support of β0, and write s = |S| and = ||. Define

bn=12min{βj0:βj00}

to be the half minimum signal strength.

Theorem 4.3

Suppose Conditions 3.1–3.3 hold, s = o(n), log(p) = o(n), and the preliminary estimator β̃ and the tuning parameter δn together generate an ordered segmentation ϒ that preserves the order of β0, i.e., (21) holds, with probability at least 1 − ε0. If bn, bn and the tuning parameters (λ1n, λ2n, λn) satisfy that bn=aλn, bn > a max{λ1n, λ2n} and

λnlog(ns)/n+XSTXS2,Klog(n)/n,λ1nmaxk{σkϕklog(ns)/n+(1+νkϕk1/2)Klog(n)/n},λ2nmaxk{Ak-1log(ns)/n+(1+νkAk-1)Klog(n)/n}.

Then with probability at least 1−ε0n−1K − (ns)−1 − (n)−1, the sCARDS objective function (11) has a strictly local minimizer β̂ such that

  • β̂ = β̂oracle,

  • β^-β0=Op(K/n).

The preliminary estimator β̃ can be chosen, for example, as the SCAD estimator

β^scad=argmin{12ny-Xβ2+j=1ppλ(βj)}, (22)

where pλ(·) is the SCAD penalty function (Fan and Li 2001). The following theorem is a direct result of Theorem 2 in Fan and Lv (2011), and the proof is omitted.

Theorem 4.4

Under Conditions 3.1 and 3.3, if s = o(n), λnn-1/2[log(n)]2 and min{βj0:βj00}n-1/2max{logp,1nXScTXSlogn}, then with probability at least 1 − o(1), there exists a strictly local minimizer β̂scad and δn = O(log(n)/n) which together generate a segmentation preserving the order of β0.

5 Simulation studies

We conduct numerical experiments to study the performance of two versions of CARDS, bCARDS and aCARDS, and their variant sCARDS. The goal is to investigate the performance of CARDS under different situations. Experiments 1–4 are based on the linear regression model Yi=XiTβ0+εi, with Experiments 1–3 exploring the homogeneity only and Experiment 4 exploring the homogeneity and sparsity simultaneously. Experiment 5 is based on the spatial-temporal model Yit=XtTβi0+εit.

In all experiments, {Xi : 1 ≤ in} or {Xt : 1 ≤ tT} are generated independently and identically from the multivariate standard Gaussian distributions, and {εi : 1 ≤ in} or {εit: 1 ≤ ip, 1 ≤ tT} are IID samples of N(0, 1). All results are based on 100 repetitions.

Experiment 1

Consider the linear regression model with p = 60 and n = 100. Predictors are divided into four groups, each of size 15. The true regression coefficients shared within each group are −2r, −r, r and 2r, respectively. Different values of r > 0 lead to various signal-to-noise ratios. Here we let r take values in {1, 0.8, 0.5}, corresponding to high, moderate and low signal-to-noise ratio, respectively.

We compare the performance of six different methods: Oracle, ordinary least squares (OLS), bCARDS, aCARDS, total variation (TV), fused Lasso (fLasso). Oracle is the least squares estimator knowing the true groups. aCARDS and bCARDS are described in Section 2; here we let the penalty function pλ(·) be the SCAD penalty with a = 3.7, and take the OLS estimator as the preliminary estimator. TV uses the exhaustive pairwise penalty (9), where pλ(·) is also the SCAD penalty with a = 3.7. The fused Lasso is based on an order generated from ranking the OLS coefficients. In this sense, the fused Lasso is essentially bCARDS with the Lasso penalty pλ(t) = λ|t|. Tuning parameters of all these methods are selected via Bayesian information criteria (BIC).

Prediction performance of different methods is evaluated in terms of the average model error over an independent test set of size 10, 000. The model error is the prediction error subtracted by the variance of εit, and it better reflects the performance of statistical methods. In addition, to measure how close the estimated grouping structure approaches the true one, we introduce the normalized mutual information (NMI), which is a common measure for similarity between clusterings (Fred and Jain 2003). Suppose ℂ = {C1, C2, · · ·} and ⅅ= {D1, D2, · · ·,} are two sets of disjoint clusters of {1, · · ·, p}, define

NMI(,D)=I(;D)[H()+H(D)]/2,

where I(ℂ; ⅅ) = Σk,j(|CkDj |/p) log(p|CkDj |/|Ck||Dj |) is the mutual information between ℂ and ⅅ, and H(ℂ) = Σk(|Ck|/p) log(|Ck|/p) is the entropy of ℂ. NMI(ℂ,ⅅ) takes values on [0, 1], and larger NMI implies the two groupings are closer. In particular, NMI = 1 means that the two groupings are exactly the same.

Figure 2 shows boxplots of the average model error and NMI for six different methods. We observe that except for the case of weak signals (r = 0.5), two versions of CARDS outperform other methods since they lead to smaller average model error and larger NMI. bCARDS is performing especially well in achieving low model errors, even in the case r = 0.5. aCARDS has a better performance in terms of NMI, which indicates that it is better in recovering the true grouping structure. The possible reason that aCARDS does not perform as well as bCARDS in model errors is that aCARDS has more tuning parameters and selection of these tuning parameters in simulations may be non-optimal.

Figure 2.

Figure 2

The average model error and normalized mutual information in Experiment 1, where p = 60, n = 100, and there are four equal-size coefficient groups.

Experiment 2

The setting of this experiment is the same as in Experiment 1, except that the homogeneous groups have non-equal sizes. In Experiment 2a, the predictors are divided into four groups of size 1, 15, 15 and 29. The four distinct regression coefficients are −4r, −r, r and 2r, respectively. Here, the first group is a singleton. In Experiment 2b, there is one dominating group of size 50 with a common regression coefficient −2r. The other 10 predictors have the regression coefficients 0,29,49,69,,2, respectively. In both subexperiments, we take r = 1 and 0.7 to represent two levels of signal-to-noise ratio. Besides the six methods compared in Experiment 1, we also implement a data-driven selection between bCARDS and aCARDS, as described in Section 2.3. In detail, we select the parameter δ via BIC (for each value of δ, the other parameters are also selected via BIC). The resulting method is called CARDS.

Figure 3 displays results for Experiment 2a. It suggests that both bCARDS and aCARDS outperform other methods, so does CARDS. Figure 4 displays results for Experiment 2b. We see that the total variation and fused Lasso can not improve the OLS estimator in terms of the average model error. bCARDS also performs unsatisfactorily, possibly due to misranking in the preliminary estimate. But aCARDS performs much better than OLS and other methods. After a data-driven selection between bCARDS and aCARDS, the resulting method CARDS also outperforms other methods.

Figure 3.

Figure 3

The average model error and normalized mutual information in Experiment 2a, where p = 60, n = 100, and there are four coefficient groups of size 1, 15, 15 and 29.

Figure 4.

Figure 4

The average model error and normalized mutual information in Experiment 2b, where p = 60, n = 100, and there is one dominating group of size 50.

Experiment 3

We use this experiment to investigate how the performance of two versions of CARDS is affected by misranking in the preliminary estimate. The setting is the same as Experiment 1 and we fix r = 1 (so n = 100, p = 60, and there are 4 equal-size groups with true regression coefficients equal to −2, −1, 1 and 2 respectively). For each data set (X, y), we generate 11 different preliminary ranks as follows: for each σ in {1, 1.2, 1.4, · · ·, 3}, we generate z ~ N(Xβ0, σ2In) independently of y conditional on X, and then use the OLS estimator associated with z to get a preliminary rank. A larger value of σ tends to yield a “worse” preliminary rank. We use K* defined in Section 3.3 to quantify the level of misranking. Recall that K* is the total number of jumps in true regression coefficients under the preliminary rank, and K* > 3 means there exists misordering in the preliminary rank. We generated 100 data sets and 11 preliminary ranks for each data set so that the results are based on 1100 repetitions. We compare the performance of bCARDS and aCARDS with that of the total variation (TV), where TV does not use any information from the preliminary rank. Figure 5 contains boxplots of the average model error as K* changes. First, we see that two versions of CARDS are quite robust to the increase of K* and always outperform the total variation. Second, the model error of bCARDS increases as K* increases, which provides empirical evidence for the claim in Theorem 3.4 about the effect of misranking to bCARDS. Third, when K* is large, aCARDS has a better performance than bCARDS, due to the fact that the hybrid pairwise penalty can tolerate a higher level of misranking than the fused penalty.

Figure 5.

Figure 5

The average model error in Experiment 3. The horizontal axis represents K*, and “b”, “a” and “TV” are short for bCARDS, aCARDS and total variation.

Experiment 4

This experiment explores the homogeneity and sparsity simultaneously. Consider the linear regression model with p = 100 and n = 150. Among the 100 predictors, 60 are important ones and their coefficients are the same as those in Experiment 1. Besides, there are 40 unimportant predictors whose coefficients are all equal to 0.

We implemented sCARDS and compared its performance with different oracle estimators, Oracle, Oracle0 and OracleG, as well as ordinary least squares (OLS), SCAD, shrinkage total variation (sTV) and fused Lasso (fLasso). The three oracles are defined with different levels of prior information: the Oracle knows both the important predictors and the true groups, the Oracle0 only knows which are important predictors; and the OracleG only knows the true groups (it treats all unimportant predictors as one group with unknown coefficients). sCARDS is as described in Section 2; while implementing it, we take the SCAD estimator as the preliminary estimator. The shrinkage total variation is an extension of TV by adding both the elementwise SCAD penalty and exhaustive pairwise SCAD penalty. The fused Lasso used here has both the elementwise L1 penalty and fused pairwise L1 penalty.

Figure 6 displays the boxplots of average model errors for r = 1 and r = 0.7. First, by comparing model errors of the three oracles, we see a significant advantage of taking into account both homogeneity and sparsity over pure sparsity. Moreover, the results of Oracle0 and OracleG show that exploring the group structure is more important than sparsity. Second, sCARDS achieves a smaller model error than OLS, SCAD and fused Lasso. The performance of sCARDS and sTV in terms of model error are comparable. However, they are different in feature selection. Figure 7 contains frequency histograms of the number of falsely selected features for two methods. In about 17% of the repetitions, sTV fails to shrink coefficients of the 40 unimportant predictors to 0.

Figure 6.

Figure 6

The average model error and normalized mutual information in Experiment 4, where p = 100, n = 100, and there are 60 important predictors divided into four equal-size groups.

Figure 7.

Figure 7

Feature selection of sCARDS and shrinkage total variation (sTV) in Experiment 4 (left: r = 1; right: r = 0.7), where there are 40 unimportant predictors.

Experiment 5

We consider a special case of the spatial-temporal model (3), Yit=XtTβi+εit, i = 1, · · ·, p, i.e., the predictors are the same for all spatial locations. There are p = 100 different locations and k = 5 common predictors (so each βi has a dimension 5). We assume only the spatial homogeneity in the regression coefficients, that is, for each predictor j (j = 1, · · ·, 5), the coefficients, {βi,j, 1 ≤ i ≤ 100}, are divided into four groups of equal size 25, where coefficients in the same group share a same value. In simulations, we let

βi,j=ωj+(-2)I1i25+(-1)I26i50+I51i75+2I76i100,1j5,

where {ωj = 0.1 × (j −1), 1 ≤ j ≤ 5} are location-independent constants. In this experiment, instead of varying the signal-to-noise ratio directly, we equivalently change T, the total number of time points.

We extend aCARDS to this model by adding the hybrid pairwise penalty on coefficients of the same predictor at different locations, and still call the method aCARDS. The total variation (TV) and fused Lasso (fLasso) can be extended to this model in a similar way. The Oracle is the maximum likelihood estimator which knows the true groups for each predictor. We aim to compare the performance of Oracle, OLS, aCARDS, total variation and fused Lasso.

Figure 8 displays the results. We see that aCARDS achieves significantly lower model errors than OLS, due to exploring homogeneity. Moreover, it has a better performance than the total variation and fused Lasso, particularly when T = 50, 80. aCARDS also estimates well the true grouping structure; when T = 50, 80, the normalized mutual information is larger than 0.95 in most repetitions.

Figure 8.

Figure 8

The average model error and normalized mutual information in Experiment 5, where the model is a spatial-temporal regression model with p = 100 locations and k = 5 common predictors.

6 Real data analysis

6.1 S&P500 returns

In this study, we fit a homogeneous Fama-French model for stock returns: Yit=αi+XtTβi0+εit, where Xt contains three Fama-French factors at time t, Yit is the excess return of stocks and εit are idiosyncratic noises. We collected daily returns of 410 stocks, which were always included in the components of the S&P500 index in the period December 1, 2010 to December 1, 2011 (T = 254). We applied CARDS as in Experiment 5, except that the intercepts αj ‘s were also penalized for sparsity. The sparsity of αj ‘s is supported by the capital asset pricing model (CAPM) and its extension, multifactor pricing model, in financial econometric theories. The tuning parameters were chosen via generalized cross validation (GCV). Table 1 shows the number of fitted coefficient groups on three factors and the number of non-zero intercepts. We then used daily returns of the same stocks in the period December 1, 2011 to July 1, 2012 (T = 146) to evaluate the prediction error. Let ŷit and yit be the fitted and observed excess returns of stock i at time t = 1, · · ·, 146, respectively. Define the discounted cumulative sum of squared estimation errors (cRSS) at time t as cRSSt=s=1tρs/10i(y^it-yit)2, where we take ρ = 0.95. Figure 9 shows the percentage improvement in cRSSt of the CARDS estimator over the OLS estimator. We see that CARDS achieves a smaller discounted cRSS compared to OLS at most time points, especially in the “very-close” and “far-away” future. The North American Industry Classification System (NAICS) classifies these 410 companies into 18 different industry sectors. Figure 10(a) shows the OLS coefficients on the factor “book-to-market ratio”. We can see that stocks belonging to Sector 3 “Utilities” (29 stocks in total) have very close OLS coefficients, and 17 stocks in this sector were clustered into one group in CARDS estimator. Figure 10(b) shows the percentage improvement in cRSSt only for stocks in this sector, where the improvement is more significant.

Table 1.

Number of groups in fitting the S&P500 data.

Fama-French factors No. of coef. groups
“market return” 41
“market capitalization” 32
“book-to-market ratio” 56
intercept 60

Figure 9.

Figure 9

Comparison of the cumulative sum of squared prediction errors of the S&P500 data from December 1, 2011 to July 1, 2012. The vertical axis is the percent of relative errors between the predictions made by OLS and CARDS, defined by 100(cRSStols-cRSStcards)/cRSStols. The right panel is a zoom-in of the results for CARDS.

Figure 10.

Figure 10

(a)OLS coefficients on the “book-to-market ratio” factor. The x axis represents different sectors. (b)Percentage improvement of the discount cumulative sum of squared estimation errors for stocks in the sector “Utilities” (Sector 2).

6.2 Polyadenylation signals

CARDS can be easily extended to more general settings such as generalized linear models (McCullagh and Nelder 1989) although we have focused on the linear regression model so far. In this subsection, we apples CARDS to a logistic regression example. This study tried to predict polyadenylation signals (PASes) in human DNA and mRNA sequences by analyzing features around them. The data set was first used in Legendre and Gautheret (2003) and later analyzed by Liu et al. (2003), and it is available at http://datam.i2r.a-star.edu.sg/datasets/krbd/SequenceData/Polya.html. There is one training data set and five testing data sets. To avoid any platform bias, we use the training data set only. It has 4418 observations each with 170 predictors and a binary response. The binary response indicates whether a terminal sequence is classified as a “strong” or “weak” polyA site, and the predictors are features from the upstream (USE) and downstream (DSE) sequence elements. We randomly select 2000 observations to perform model estimation and use the rest to evaluate performance. Our numerical analysis consists of the following steps. Step 1 is to apply the L1-penalized logistic regression to these 2000 observations with all 170 predictors and use AIC to select an appropriate regularization parameter. In Step 2, we use the logistic regression coefficients obtained in Step 1 as our preliminary estimate and apply sCARDS accordingly. Average prediction error (and standard error in parentheses) over 40 random splitting are reported in Table 2. We also report the average number of non-zero coefficient groups and the average number of selected features. It shows that sCARDS lead to a smaller prediction error when compared with the shrinkage total variation (sTV). In addition, the sCARDS has fewer groups of non-zero coefficients but more selected features, which implies that we can include more predictors while fixing the degrees of freedom at a small value. Note that in this example, the fused Lasso (fLasso) has a similar performance as the sCARDS. In Section 5, we remarked that the fused Lasso is essentially bCARDS with the lasso penalty pλ(t) = λ|t|.

Table 2.

Results of the PASes data.

sCARDS SCAD sTV fLasso
Prediction Error 0.2449 (.0015) 0.2458 (.0014) 0.2757 (.0026) 0.2445 (0.0010)
No. of non-zero coef. groups 5.5000 21.6250 5.7500 5.5000
No. of selected features 73.2750 21.6250 40.3500 86.8750

7 Conclusion

In this paper, we explored homogeneity of coefficients in high-dimensional regression. We proposed a new method called Clustering Algorithm in Regression via Data-driven Segmentation (CARDS) to estimate regression coefficients and to detect homogeneous groups. The implementation of CARDS does not need any geographical information (neighborhoods, distance, graphs, etc.) a priori, which distinguishes it from other methods in similar settings and makes it more general for applications. A modification of CARDS, sCARDS, can be used to explore homogeneity and sparsity simultaneously. Our theoretical results show that better estimation accuracy can be achieved by exploring homogeneity. In particular, when the number of homogeneous groups is small, the power of exploring homogeneity and sparsity simultaneously is much larger than that of exploring sparsity only, which is also confirmed in our simulation studies.

Methodologically, CARDS has two main innovations. First, it takes advantage of a preliminary estimate by extracting from which either an estimated ranking or an estimated ordered segmentation. Second, it introduces the so-called “hybrid pairwise penalty” to adapt to available partial ordering information. The hybrid pairwise penalty not only is robust to misordering, but also avoids statistical and computational inefficiency due to penalizing too many pairs. These ideas about handling homogeneity can be applied to much broader situations than linear regression, if we combine the hybrid pairwise penalty with appropriate loss functions. For example, CARDS can be extended to generalized linear models (GLM) when homogeneity appears.

To promote homogeneity, CARDS takes advantage of a preliminary estimate. Such idea can be generalized. Instead of extracting a complete raking or an ordered segmentation, we may also apply clustering methods to coefficients of the preliminary estimate, such as k-mean algorithm or hierarchical clustering algorithm, to help construct datadriven penalties and further promote homogeneity.

This paper only considers the case where predictors in one homogeneous group have equal coefficients. In a more general situation, coefficients of predictors in the same group are close but not exactly equal. The idea of data-driven pairwise penalties still applies, but instead of using the class of folded concave penalty functions, we may need to use penalty functions which are smooth at the origin, e.g., the L2 penalty function. Another possible approach is to use posterior-type estimators combined with, say, a Gaussian prior on the coefficients. These are beyond the scope of this paper and we leave them as future work.

Supplementary Material

Supplementary Material

Acknowledgments

The authors would like to thank the Editor Professor Xuming He, the anonymous Associate Editor and two referees for their very helpful comments and suggestions.

The authors were partially supported by the National Institute of General Medical Sciences of the National Institutes of Health through Grant Numbers R01-GM072611 and R01-GM100474 and National Science Foundation grant DMS-1206464.

The author was partially supported by NSF grants DMS-0905561 and DMS-1055210, and NIH Grant NIH/NCI R01 CA-149569.

A Proofs

A.1 Proof of Theorem 3.1

Introduce the mapping T : Inline graphic → ℝK, where T(β) is the K-dimensional vector whose k-th coordinate equals to the common value of βj for jAk. Note that T is a bijection and T−1 is well-defined for any μ ∈ ℝK. Also, introduce the mapping T* : ℝp → ℝK, where T(β)k=1AkjAkβj. We see that T* = T on Inline graphic, and T−1T* is the orthogonal projection from ℝp to Inline graphic. Denote μ0 = T (β0) and μ̂oracle = T(β̂oracle).

Denote Ln(β)=12ny-Xβ2 and Pn(β)=λnj=1p-1ρ(βτ(j+1)-βτ(j)), so that we can write Qn(β) = Ln(β) + Pn(β). For any μ ∈ ℝK, let

LnA(μ)=12ny-XAμ2,PnA(μ)=λnk=1K-1ρ(μk+1-μk),

and define QnA(μ)=LnA(μ)+PnA(μ). Note that when τ is consistent with the order of β0, there exist 1 = j1 < j2 < · · · < jK < jK+1 = p + 1 such that Ak = {τ(jk), τ(jk + 1), · · ·, τ(jk+1 − 1) for 1 ≤ kK. Then Qn(β)=QnA(T(β)) and QnA(μ)=Qn(T-1(μ)) for any βInline graphic and μ ∈ ℝK.

In the first part of the proof, we show β^oracle-β0=Op(K/n). By definition and direct calculations,

β^oracle-β0=D(μ^oracle-μ0),μ^oracle-μ0=(XATXA)-1XATε.

Therefore, we can write

β^oracle-β0=(D-1XATXAD-1)-1D-1XATε. (23)

From Condition 3.1, (D-1XATXAD-1)-1(c1n)-1 and tr(D-1XATXAD-1)c2nK. By the Markov inequality, for any δ > 0,

P(D-1XATε>c2nKδ)ED-1XATε2c2nK/δ=tr(D-1XATXAD-1)c2nK/δδ. (24)

Combining the above, we have shown that with probability at least 1 − δ, β^oracle-β0Cδ-1/2K/n. This proves β^-β0=Op(K/n).

Furthermore, we show a result that will be frequently used in later proofs:

β^oracle-β0CKlog(n)/n,withprobability1-n-1K. (25)

Write D-1XATε=(v1Tε,,vkTε)T, where vk = XAD−1ek and ek is the unit vector with 1 on the k-th coordinate and 0 elsewhere. Observing that ||vk||2 is the k-th diagonal of the matrix D-1XATXAD-1, we have vkc2n. It follows from Condition 3.3 and the union bound that

P(D-1XATε>c2c3-1nlog(2n))k=1KP(vkTε>vkc3-1log(2n))n-1K.

Since D-1XATεK1/2D-1XATε,

D-1XATεc2c3-1Knlog(2n),withprobability1-n-1K. (26)

Then (25) follows by combining (23) and (26).

In the second part of the proof, we show that β̂oracle is a strictly local minimizer of Qn(β) with probability at least 1 − ε0n−1 K − (np)−1. By assumption, there is an event E1 such that P(E1c)ε0 and over the event E1, τ is consistent with the order of β0. Consider the neighborhood of β0:

B={βp:β-β0<2CKlog(n)/n}.

By (25), there is an event E2 such that P(E2c)n-1K and over the event E2, β^oracle-β0CKlog(n)/n. Hence, β̂oracleInline graphic over the event E2.

For any βInline graphic, write β* as its orthogonal projection to Inline graphic. We aim to show

  1. Over the eventE1E2,
    Qn(β)Qn(β^oracle),foranyβB, (27)

    and the inequality is strict whenever β*β̂oracle.

  2. There is an event E3 such that P(E3c)(np)-1. Over the event E1E2E3, there exists Inline graphic ( Inline graphicInline graphic), a neighborhood of β̂oracle, such that
    Qn(β)Qn(β),foranyβBn, (28)

    and the inequality is strict whenever ββ*.

Combining (a) and (b), Qn(β) ≥ Qn(β̂oracle) for any βInline graphic, a neighborhood of β̂oracle, and the inequality is strict whenever ββ̂oracle. This proves that β̂oracle is a strictly local minimizer of Qn, over the event E1E2E3.

It remains to show (a) and (b). Consider (a) first. We claim that

PnA(T(β))=0foranyβB. (29)

To see this, for a given βInline graphic, write μ = T* (β). It suffices to check |μk+1μk| > n for k = 1, · · ·, K − 1. Note that μk+1-μkminiAk,jAk+1βi-βjmini,jβi0-βj0-2β-β02bn-2CKlog(n)/n. Since bn>aλnKlog(n)/n, it is easy to see that |μk+1μk| > n.

Using (29), we see that QnA(T(β))=LnA(T(β)), for all βInline graphic. Since QnA=QnT-1 and T−1T* is the orthogonal projection from ℝp to Inline graphic, for any βInline graphic,

Qn(β)=Qn(T-1T(β))=QnA(T(β))=LnA(T(β)). (30)

In particular, noting that β̂oracleInline graphic and its orthogonal projection to Inline graphic is itself, the above further implies

Qn(β^oracle)=LnA(μ^oracle). (31)

By definition and the fact that 2LnA(μ)μμT=1nXATXA is positive definite, μ̂oracle is the unique global minimizer of LnA(μ). As a result,

LnA(T(β))LnA(μ^oracle), (32)

and the inequality is strict whenever T*(β) ≠ μ̂oracle, i.e., β*T−1(μ̂oracle) = β̂oracle. Combining (30)–(32) gives (a).

Second, consider (b). For a positive sequence {tn} to be determined, let

Bn=B{β:β-β^oracletn}.

Since β* is the orthogonal projection of β to Inline graphic, ||ββ* || ≤ ||ββ′|| for any β′ ∈ Inline graphic. In particular, ||ββ*|| ≤ ||ββ̂oracle||. As a result, to show (28), it suffices to show

Qn(β)Qn(β),foranyβsuchthatβ-βtn, (33)

and the inequality is strict whenever ββ*.

To show (33), write μ = T*(β) so that β* = T−1(μ). By Taylor expansion,

Qn(β)-Qn(β)=-1n(y-Xβm)TX(β-β)+j=1pPn(βm)βτ(j)(βτ(j)-βτ(j))I1+I2,

where βm is in the line between β and β*. Consider I2 first. Direct calculations yield

Pn(β)βτ(j)={-λnρ¯(βτ(2)-βτ(1)),j=1λnρ¯(βτ(j)-βτ(j-1))-λnρ¯(βτ(j+1)-βτ(j)),2jp-1λnρ¯(βτ(p)-βτ(p-1)),j=p,

where ρ̄(t) = ρ′(|t|)sgn(t) and ρ(t) = λ−1pλ(t). Plugging it into I2 and rearranging the sum, we obtain

I2=λnj=1p-1ρ¯(βτ(j+1)m-βτ(j)m)[(βτ(j+1)-βτ(j))-(βτ(j+1)-βτ(j))]. (34)

When τ(j) and τ(j + 1) belong to the same group, βτ(j)=βτ(j+1), and hence the sign of ( βτ(j+1)m-βτ(j)m) is the same as the sign of (βτ(j+1)βτ(j)) if neither of them is 0. In addition, recall that Ak = {τ(jk), τ(jk + 1), · · ·, τ(jk+1 − 1)} for all 1 ≤ kK, for some indices 1 = j1 < j2 < · · · < jK < jK+1 = p + 1. Combining the above, we can rewrite

I2=λnk=1Kj=jkjk+1-2ρ(βτ(j+1)m-βτ(j)m)βτ(j+1)-βτ(j)+λnk=2Kρ¯(βτ(jk)m-βτ(jk-1)m)[(βτ(jk)-βτ(jk-1))-(βτ(jk)-β(jk-1))].

First, since β0Inline graphic and β* is the orthogonal projection of β to Inline graphic, ||β*β0|| ≤ ||ββ0||. Hence, βInline graphic implies β*, βmInline graphic. By repeating the proof of (29), we can show ρ¯(βτ(jk)m-βτ(jk-1)m)=0 for 2 ≤ kK. So the second term in I2 disappears. Second, in the first term of I2, since βτ(j+1)m-βτ(j)m2βm-β2β-β2tn, it follows by concavity that ρ(βτ(j+1)m-βτ(j)m)ρ(2tn). Together, we have

I2λnk=1Kj=jkjk+1-2ρ(2tn)βτ(j+1)-β(j). (35)

Next, we simplify I1. Let z = z(βm) = XT(yXβm) and write I1=-1nzT(β-β). For any fixed k and l such that τ(l) ∈ Ak and ljk+1 −1, let Akl1={τ(j)Ak:jl} and Akl2={τ(j)Ak:j>l}. Regarding that βτ(i)=1Akj=jkjk+1-1βτ(j) for iAk, we can reexpress I1 as

I1=-12k=1Ki=jkjk+1-11nzτ(i)[βτ(i)-βτ(i)]-12k=1Kj=jkjk+1-11nzτ(j)[βτ(j)-βτ(j)]=-k=1K12nAki,j=jkjk+1-1zτ(i)[βτ(i)-βτ(j)]-k=1K12nAki,j=jkjk+1-1zτ(j)[βτ(j)-βτ(i)]=-k=1K12nAki,j=jkjk+1-1[zτ(j)-zτ(i)][βτ(j)-βτ(i)]=-k=1K1nAkjki<j=jk+1-1[zτ(j)-zτ(i)]il<j[βτ(l+1)-βτ(l)]=-k=1K1nAkl=jkjk+1-2[βτ(l+1)-βτ(l)][Akl1jAkl2zτ(j)-Akl2iAkl1zτ(i)]k=1Kl=jkjk+1-2wτ(l)(z)[βτ(l+1)-βτ(l)], (36)

where for any vector v ∊ ℝp,

wτ(l)(v)=n-1[Akl2AkjAkl1vτ(j)-Akl1AkjAkl2vτ(j)].

We aim to bound |wτ(l)(z)|. Let η = XTX(β*β0), ηm = XTX(βmβ*) and write z = XTεηηm. First, wτ(l)(v) is a linear function of v. Second, since βm lies between β and β*, we have ||β*βm|| ≤ ||β*β|| ≤ tn. It follows that ||ηm|| ≤ λmax(XTX)tn. Moreover, |wτ(l)(v)| ≤ (|Ak|/n)||v|| ≤ (p/n)||v|| for all v. Combining the above yields

wτ(l)(z)wτ(l)(XTε)+wτ(l)(η)+supv:vλmax(XTX)tnwτ(l)(v)wτ(l)(XTε)+wτ(l)(η)+(p/n)λmax(XTX)·tn. (37)

First, we bound the term wτ(l)(XTε). Let E3 be the event that

maxτ(l)Akwτ(l)(XTε)n-1/2σkAklog(2(np))/c3,k=1,,K, (38)

where we recall σk is the maximum eigenvalue of n−1XTX restricted to the (Ak, Ak)-block. Given τ(l), we can express wτ(l)(XTε) as

wτ(l)(XTε)=aτ(l)Tε,whereaτ(l)=n-1(Akl2AkXAkl11Akl1-Akl1AkXAkl21Akl2).

Write L1=Akl1 and L2=Akl2, so that |Ak| = L1+L2. It is observed that XAkl11Akl12nσk1Akl12nσkL1. Using the fact that (a + b)2 ≤ 2(a2 + b2) for any real values a, b, we have aτ(l)22n-1σk(L22L1/Ak2+L12L2/Ak2)=2σkL1L2/(nAk)σkAk/(2n). Applying Condition 3.3 and the probability union bound,

P(E3c)k=1Kτ(l)AkP(wτ(l)(XTε)>n-1/2σkAklog(2(np))/c3)1jpP(ajTε>aj2log(2(np))/c3)<(np)-1. (39)

Second, we bound the term wτ(l)(η). Observing that for any vector v, wτ(l)(v) = wτ(l)(vk1), where k is the mean of {vj, jAk}, we have

wτ(l)(v)22(Akl22Akl1n2Ak2+Akl12Akl2n2Ak2)(maxjAkvj-v¯k)2Ak2n2(maxjAkvj-v¯k)2.

Since η = XTX(β*β0) and β*β0Inline graphic, we have maxjAk|ηjη̄k| ≤ k||β*β0||, where νk is defined in (13). As a result,

maxτ(l)Akwτ(l)(η)νk2Ak1/2·β-β0CνkKAklog(n)/n, (40)

where the last inequality is because we consider βInline graphic in (28), and ||β*β0|| ≤ ||ββ0|| (noticing that β* is the orthogonal projection of β onto Inline graphic).

Combining (36)–(40), we find that over the event E1E2E3,

I1k=1Kl=jkjk+1-2[C(σkAklog(np)n+νkKAklog(n)n)+pλmax(XTX)ntn]βτ(l+1)-β(l)k=1Kl=jkjk+1-2(λn2+pλmax(XTX)ntn)βτ(l+1)-βτ(l), (41)

where we have used condition (14) on λn.

From (35) and (41), over the event E1E2E3,

infβB:β-βtn[Qn(β)-Qn(β)]k=1Kl=jkjk+1-2[λn2-gn(tn)]βτ(l+1)-βτ(l),

where gn(tn) = n−1max(XTX)tnλn[1 − ρ′(2tn)]. Since ρ′(0+) = 1, gn(0+) = 0. So we can always choose tn sufficiently small to make sure |gn(tn)| < λn/2; consequently, the right hand side is non-negative, and strictly positive when k=1Kl=jkjk+1-2βτ(l+1)-βτ(l)>0, i.e., ββ*. This proves (b).

A.2 Proof of Theorem 3.2

First, we show that the LLA algorithm yields β̂oracle after one iteration. Let E1 be the event that the ranking τ is consistent with the order of β0, E2 the event that β^-β0CKlog(n)/n and E3 the event that (38) holds. We have shown that P (E1E2E3) ≥ 1 − ε0n−1K − (np)−1. It suffices to show that over the event E1E2E3, the LLA algorithm gives β̂oracle after the first iteration.

Let wj=ρ(β^τ(j+1)initial-β^τ(j)initial). At the first iteration, the algorithm minimizes

Qninitial(β)12ny-Xβ2+λnj=1p-1wjβτ(j+1)-βτ(j).

This is a convex function, hence it suffices to show that β̂oracle is a strictly local minimizer of Qninitial. Using the same notations as in the proof of Theorem 3.1, for any β ∈ ℝp, write β* = T−1T*(β) as its orthogonal projection to Inline graphic. Let B={βp:β-β0CKlog(n)/n}, and for a sequence {tn} to be determined, consider the neighborhood of β̂oralce defined by Inline graphic = {βInline graphic : ||ββ̂oracle|| ≤ tn}. It suffices to show

Qninitial(β)Qninitial(β)Qninitial(β^oracle),foranyβBn, (42)

and the first inequality is strict whenever ββ*, and the second inequality is also strict whenever ββ̂oracle.

We first show the second inequality in (42). For τ(j) and τ(j + 1) in different groups, βτ(j+1)0-βτ(j)0>2bn. In addition, ||β̂initialβ0||λn/2. Hence, β^τ(j+1)initial-β^τ(j)initial2bn-λn>aλn, and it follows that wj = 0. On the other hand, for τ(j) and τ(j + 1) in the same group, βτ(j+1)βτ(j) = 0 whenever βInline graphic. Consequently,

Qninitial(β)=12ny-Xβ2=Ln(β),forβMA.

It is easy to see that β̂oracle is the unique global minimizer of Ln constrained on Inline graphic. So the second inequality in (42) holds.

Next, consider the first inequality in (42). We apply Taylor expansion to Qninitial(β)-Qninitial(β), and rearrange the sums as in (34). Then, for some βm that lies in the line between β and β*,

Qninitial(β)-Qninitial(β)=λnj=1p-1wj·sgn(βτ(j+1)m-βτ(j)m)[(βτ(j+1)-βτ(j))-(βτ(j+1)-βτ(j))]-1n(y-Xβm)TX(β-β)J1+J2.

We first simplify J1. Note that wj = 0 when τ(j) and τ(j + 1) are in different groups. When τ(j) and τ(j + 1) are in the same Ak, first, βτ(j+1)=βτ(j), and [ βτ(j+1)m-βτ(j)m] has the same sign as [βτ(j+1)βτ(j)]; second, β^τ(j+1)initial-β^τ(j)initial2β^initial-β0λn, and hence wjρ′(λn) ≥ a0. Combining the above yields

J1=λnk=1Kj=jkjk+1-2wjβτ(j+1)-β(j)a0λnk=1Kj=jkjk+1-2βτ(j+1)-βτ(j) (43)

Next, we simplify J2. Denote z = XT(yXβm). Similarly to (36)–(41), we find that

J2=-k=1Kl=jkjk+1-2wτ(l)(z)[βτ(l+1)-βτ(l)],

where over the event E3, for any jkljk+1 − 2,

wτ(l)(z)C(σkAklog(np)n+νkKAklog(n)n)+pλmax(XTX)ntn.

From the condition on λn, the sum of the first two terms is upper bounded by a0λn/3 for large n. We choose tn = a0n/(3max(XTX)). It follows that

J2k=1Kl=jkjk+1-22a0λn3βτ(l+1)-βτ(l). (44)

Combining (43) and (44), over the event E1E2E3,

Qninitial(β)-Qninitial(β)a0λn3k=1Kl=jkjk+1-2βτ(l+1)-βτ(l)0.

This proves the first inequality in (42).

Second, we show that over the event E1E2E3, at the second iteration, the LLA algorithm still yields β̂oracle and therefore it converges to β̂oracle. We have shown that after the first iteration, the algorithm outputs β̂oracle. It then treats β̂oracle as the initial solution for the second iteration. So it suffices to check

β^oracle-β0λn/2.

This is true because over the event E1, β^oracle-β0CKlog(n)/nλn.

A.3 Proof of Theorem 3.3

It suffices to show that, with probability at least 1 − O(n−α), βi0<βj0 implies β^iolsβ^jols for any 1 ≤ i, jp. When βi0<βj0, necessarily βj0-βi02bn. Moreover, β^jols-β^iols(βj0-βi0)-2β^ols-β0. So it suffices to show that ||β̂olsβ0||bn with probability at least 1 − O(n−α).

From direct calculations, βols = β0 + (XTX)−1XTε. Write β^jols-βj0=ajTε, where aj = X(XTX)−1ej for j = 1, · · ·, p. Then aj2=ejT(XTX)-1ejc4n-1. By Condition 3.3 and applying the union bound,

P(β^ols-β0>bn)P(β^ols-β0>(2αc4/c3)log(n)/n)j=1pP(ajTε>a2αlog(n)/c3)2pn-2α.

Since p = O(nα), 2pn−2α = O(nα). This completes the proof.

Footnotes

8 Supplemental Materials

The supplemental materials contain technical proofs for Theorems 3.5, 4.1, 4.2, 4.3, and Corollary 4.1.

References

  1. Bondell HD, Reich BJ. Simultaneous Regression Shrinkage, Variable Selection, and Supervised Clustering of Predictors with OSCAR. Biometrics. 2008;64:115–123. doi: 10.1111/j.1541-0420.2007.00843.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Bühlmann P, van de Geer S. Statistics for High-Dimensional Data. Springer; 2011. p. 32. [Google Scholar]
  3. Chen SS, Donoho DL, Saunders MA. Atomic Decomposition by Basis Pursuit. SIAM Journal on Scientific Computing. 1998;20:33–61. [Google Scholar]
  4. Fan J, Li R. Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties. Journal of American Statistical Association. 2001;96:1348–1360. [Google Scholar]
  5. Fan J, Lv J. Sure Independence Screening for Ultra-high Dimensional Feature Space. Journal of Royal Statistical Society, Ser B. 2008;70:849–911. doi: 10.1111/j.1467-9868.2008.00674.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Fan J, Lv J. Nonconcave Penalized Likelihood with NP-dimensionality. IEEE Transactions on Information Theory. 2011;57:5467–5484. doi: 10.1109/TIT.2011.2158486. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Fan J, Lv J, Qi L. Sparse High-dimensional Models in Economics. Annual Review of Economics. 2011;3:291–317. doi: 10.1146/annurev-economics-061109-080451. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Fan J, Xue L, Zou H. Strong Oracle Optimality of Folded Concave Penalized Estimation. 2012 doi: 10.1214/13-aos1198. Manuscript, available at http://arxiv.org/abs/1210.5992. [DOI] [PMC free article] [PubMed]
  9. Fred A, Jain AK. Robust Data Clustering. Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 2003;3:128–136. [Google Scholar]
  10. Friedman J, Hastie T, Höing H, Tibshirani R. Pathwise Coordinate Optimization. The Annals of Applied Statistics. 2007;1:302–332. [Google Scholar]
  11. Harchaoui Z, Lévy-Leduc C. Multiple Change-point Estimation with a Total Variation Penalty. Journal of the American Statistical Associateion. 2010;105:1480–1493. [Google Scholar]
  12. Huang H-C, Hsu N-J, Theobald DM, Breidt FJ. Spatial Lasso with Applications to GIS Model Selection. Journal of Computational and Graphical Statistics. 2010;19:963–983. [Google Scholar]
  13. Kim S, Xing EP. Statistical Estimation of Correlated Genome Associations to a Quantitative Trait Network. PLos Genetics. 2009;5:e1000587. doi: 10.1371/journal.pgen.1000587. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Kim Y, Choi H, Oh H-S. Smoothly Clipped Absolute Deviation on High Dimensions. Journal of the American Statistical Association. 2008;103:1665–1673. [Google Scholar]
  15. Legendre M, Gautheret D. Sequence Determinants in Human Polyadenylation Site Selection. BMC Genomics. 2003;4:7–15. doi: 10.1186/1471-2164-4-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Li C, Li H. Variable Selection and Regression Analysis for Graph-structured Covariates with an Application to Genomics. The Annals of Applied Statistics. 2010;4:1498–1516. doi: 10.1214/10-AOAS332. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Liu H, Han H, Li J, Wong L. An in-Silico Method for Prediction of Polyadenylation Signals in Human Sequences. Genome Informatics Series. 2003;14:84–93. [PubMed] [Google Scholar]
  18. McCullagh P, Nelder JA. Generalized Linear Models. 2. London: Chapman & Hall; 1989. [Google Scholar]
  19. Park MY, Hastie T, Tibshirani R. Averaged Gene Expressions for Regression. Biostatistics. 2007;8:212–227. doi: 10.1093/biostatistics/kxl002. [DOI] [PubMed] [Google Scholar]
  20. Shen X, Huang H-C. Grouping Pursuit through a Regularization Solution Surface. Journal of the American Statistical Associateion. 2010;105:727–739. doi: 10.1198/jasa.2010.tm09380. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Tibshirani R. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society, Ser B. 1996;58:267–288. [Google Scholar]
  22. Tibshirani S, Saunders M, Rosset S, Zhu J, Knight K. Sparsity and Smoothness via the Fused Lasso. Journal of the Royal Statistical Society, Ser B. 2005;67:91–108. [Google Scholar]
  23. Wang L, Kim Y, Li R. Calibrating Nonconvex Penalized Regression in Ultra-high Dimension. Tha Annals of Statistics. doi: 10.1214/13-AOS1159. (in press) [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Yang S, Yuan L, Lai Y-C, Shen X, Wonka P, Ye J. Feature Grouping and Selection over an Undirected Graph. Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining; ACM. 2012. pp. 922–930. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Yang Y, He X. Bayesian Empirical Likelihood for Quantile Regression. The Annals of Statistics. 2012;40:1102–1131. [Google Scholar]
  26. Zhang C-H. Nearly Unbiased Variable Selection under Minimax Concave Penalty. The Annals of Statistics. 2010;38:894–942. [Google Scholar]
  27. Zhao P, Yu B. On Model Selection of Lasso. Journal of Machine Learning Research. 2006;7:2541–2563. [Google Scholar]
  28. Zhu Y, Shen X, Pan W. Simultaneous Grouping Pursuit and Feature Selection over an Undirected Graph. Journal of the American Statistical Associateion. doi: 10.1080/01621459.2013.770704. (in press) [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Zou H, Li R. One-step Sparse Estimates in Nonconcave Penalized Likelihood Models (with discussion) The Annals of Statistics. 2008;36:1509–1566. doi: 10.1214/009053607000000802. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material

RESOURCES