Homogeneity Pursuit

Tracy Ke; Jianqing Fan; Yichao Wu

doi:10.1080/01621459.2014.892882

. Author manuscript; available in PMC: 2016 Apr 22.

Published in final edited form as: J Am Stat Assoc. 2015 Apr 22;110(509):175–194. doi: 10.1080/01621459.2014.892882

Homogeneity Pursuit

Tracy Ke ^*, Jianqing Fan ^*, Yichao Wu ^†

PMCID: PMC4465377 NIHMSID: NIHMS578407 PMID: 26085701

Abstract

This paper explores the homogeneity of coefficients in high-dimensional regression, which extends the sparsity concept and is more general and suitable for many applications. Homogeneity arises when regression coefficients corresponding to neighboring geographical regions or a similar cluster of covariates are expected to be approximately the same. Sparsity corresponds to a special case of homogeneity with a large cluster of known atom zero. In this article, we propose a new method called clustering algorithm in regression via data-driven segmentation (CARDS) to explore homogeneity. New mathematics are provided on the gain that can be achieved by exploring homogeneity. Statistical properties of two versions of CARDS are analyzed. In particular, the asymptotic normality of our proposed CARDS estimator is established, which reveals better estimation accuracy for homogeneous parameters than that without homogeneity exploration. When our methods are combined with sparsity exploration, further efficiency can be achieved beyond the exploration of sparsity alone. This provides additional insights into the power of exploring low-dimensional structures in high-dimensional regression: homogeneity and sparsity. Our results also shed lights on the properties of the fussed Lasso. The newly developed method is further illustrated by simulation studies and applications to real data. Supplementary materials for this article are available online.

Keywords: clustering, homogeneity, sparsity

1 Introduction

Driven by applications in genomics, image processing, etc., high dimensionality has become one of the major themes in statistics. See Bühlmann and van de Geer (2011) and references therein for an overview of recent developments in this area. To overcome the difficulty of fitting high dimensional models, one usually assumes that the true parameters lie in a low dimensional subspace. For example, many papers focus on sparsity, i.e., only a small fraction of coefficients are nonzero (Chen, Donoho, and Saunders 1998; Tibshirani 1996). In this article, we consider a more general type of low dimensional structure: homogeneity, i.e., the regression coefficients share the same values in their unknown clusters. A motivating example is the gene network analysis, where it is assumed that genes cluster into groups which play similar roles in molecular processes (Kim and Xing 2009; Li and Li 2010). It can be modeled as a linear regression problem with groups of homogeneous coeffficients. Similarly, in diagnostic lab tests, one often counts the number of positive results in a battery of medical tests, which implicitly assumes that their regression coefficients (impact) in the joint models are approximately the same. In spatial-temporal studies, it is not unreasonable to assume that the dynamics of neighboring geographical regions are similar, namely, their regression coefficients are clustered (Fan, Lv, and Qi 2011; Huang, Hsu, Theobald, and Breidt 2010). In the same vein, financial returns of similar sectors of industry share similar loadings on risk factors.

Homogeneity is a more general assumption than sparsity, where the latter can be viewed as a special case of the former with a large group of 0-value coefficients. In addition, the atom 0 is known to data analysts. One advantage of assuming homogeneity rather than sparsity is that it enables us to possibly select more than n variables (n is the sample size). It is well known that the sparsity-based techniques, such as the lasso, can select at most n variables. Moreover, identifying the homogeneous groups naturally provides a structure in the covariates, which can be helpful in scientific discoveries.

Regression under the homogeneity setting has been previously studied in the literature. Park, Hastie, and Tibshirani (2007) propose a two-step method. Their method performs hierarchical clustering on the predictors, cuts the obtained dendrogram at an appropriate level, and treats the cluster averages as new predictors. The fused lasso (Friedman, Hastie, Höfling, and Tibshirani 2007; Tibshirani, Saunders, Rosset, Zhu, and Knight 2005) can also be regarded as an effort of exploring homogeneity, with the assistance of neighborhoods defined according to either time or location. In this sense, our newly proposed methods are different since we do not know such a neighborhood a priori. The clustering of homogeneous coefficients is completely data-driven. For example, in the fused Lasso, where a complete ordering of the covariates is given, Tibshirani et al. (2005) use the L₁ penalty to penalize the pairwise differences of adjacent coordinates; in the case without a complete ordering, they suggest penalizing the pairs of ‘neighboring’ nodes in the sense of a general distance measure. Bondell and Reich (2008) propose the method OSCAR where a special octagonal shrinkage penalty is applied to each pair of coordinates to promote equal-value solutions. Shen and Huang (2010) develop an algorithm called Grouping Pursuit, where they use the truncated L₁ penalty to penalize the pairwise differences for all pairs of coordinates. In an extension, Zhu, Shen, and Pan (in press) consider simultaneous grouping pursuit and feature selection by including additional truncated L₁ penalties on the individual coefficients. Yang et al. (2012) explore simultaneous feature grouping and selection with the assistance of an undirected graph by penalizing the pairwise difference for each pair of coordinates that are connected by an edge in the graph. All the aforementioned methods either depend on a known ordering or graph of the covariates, which is sometimes not available, or use exhaustive pairwise penalties, which increase the computational complexity. Yang and He (2012) consider the homogeneity across coefficients of different percentile levels in quantile regression, and propose a Bayesian framework by using shrinkage priors to promote homogeneity. Although similar ideas may be applied to regression models, their settings are very different from ours, and there are no existing results on feature grouping for their method.

In this article, we propose a new method called Clustering Algorithm in Regression via Data-driven Segmentation (CARDS) to explore homogeneity. The main idea of CARDS is to take advantage of a preliminary estimate without homogeneity structure and to shrink those coefficients, that are estimated “closely”, further towards each other to achieve homogeneity. In the basic version of CARDS, we first build an ordering of covariates from the preliminary estimate and run a penalized least squares afterwards with fused penalties in the new ordering. The number of penalty terms is only (p − 1), compared to p(p − 1)/2 in the exhaustive pairwise penalties. On the other hand, an advanced version of CARDS builds an “ordered segmentation” on the covariates, which can be viewed as a generalized ordering, and imposes “hybrid pairwise penalties”, which can be viewed as a generalization of fused penalties. This version of CARDS tolerates possible misorderings in the preliminary estimate better and is thus more robust. Compared with other existing methods for homogeneity exploration, CARDS can successfully deal with the case of unordered covariates. At the same time, it avoids using exhaustive pairwise penalties and can be computationally more efficient than the Grouping Pursuit and OSCAR.

We study CARDS in details by providing some theoretical analysis. It reveals that the sum of squared errors of estimated coefficients is O_p(K/n), where K is the number of true homogeneous groups. Therefore, the smaller the number of true groups is, the better precision it can achieve. In particular, when K = p, there is no homogeneity to explore and the result reduces to the case without grouping. Moreover, in order to exactly recover the true groups with high probability, the minimum signal strength (the gaps between different groups) is of the order ${max}_{k} {\sqrt{∣ A_{k} ∣ log (p) / n}}$ , where |A_k|’s are sizes of true groups. In addition, the asymptotic normality of our proposed CARDS estimator is established, which reveals better estimation accuracy than that without homogeneity exploration. Furthermore, our results can be combined with the sparsity-based results to provide additional insights into the power of exploring low-dimensional structure in high-dimensional regression: homogeneity and sparsity. As a byproduct, our analysis on the basic version of CARDS also establishes a framework for analyzing the fused type of penalties, which is new to our knowledge.

Throughout this paper, we consider the following linear regression setting

y = X β^{0} + ε,

(1)

where X = (x₁, · · ·, x_p) is an n × p design matrix, y = (y₁, · · ·, y_n)^T is an n × 1 vector of response, $β^{0} = {(β_{1}^{0}, \dots, β_{p}^{0})}^{T}$ denotes the true parameters of interest, and ε = (ε₁, · · ·, ε_n)^T contains independently and identically distributed noises with E(ε_i) = 0 and $E (ε_{i}^{2}) = σ^{2}$ . We assume further that there is a partition of {1, 2, · · ·, p} denoted as Inline graphic = (A₀, A₁, · · ·, A_K) such that

β_{j}^{0} = β_{A, k}^{0} for all j \in A_{k},

(2)

where $β_{A, k}^{0}$ is the common value shared by all regression coefficients whose indices are in A_k. By default, $β_{A, 0}^{0} = 0$ , so A₀ is the group of 0-value coefficients. This allows us to explore homogeneity and sparsity simultaneously. Write $β_{A}^{0} = {(β_{A, 1}^{0}, \dots, β_{A, K}^{0})}^{T}$ . Without loss of generality, we assume $β_{A, 1}^{0} < β_{A, 2}^{0} < \dots < β_{A, K}^{0}$ .

Our theory and methods are stated for the standard least-squares problem although they can be adapted to other more sophisticated models. For example, when forecasting housing appreciation in the United States (Fan et al. 2011), one builds the spatial-temporal model

Y_{i t} = X_{i t}^{T} β_{i} + ε_{i t},

(3)

in which i indicates a spatial location and t indicates time. It is expected that $β_{i}^{'}$ s are approximately the same for neighboring zip codes i and this type of homogeneity can be explored in a similar fashion. Similarly, when Y_it represents the return of a stock and X_it = X_t stands for common market risk factors, one can assume a certain degree of homogeneity within each sector of industry; namely, the factor loading vector β_i is approximately the same for stocks belonging to the same sector of industry.

Throughout this paper, ℝ denotes the set of real numbers and ℝ^p denotes the p-dimensional real Euclidean space. For any a, b ∈ ℝ, a ∨ b denotes the maximum between a and b. For any positive sequences {a_n} and {b_n}, we write a_n ≫ b_n if a_n/b_n → ∞ as n → ∞. Given 1 ≤ q < ∞, for any vector x, ||x||_q = (Σ_j |x_j|^q)^1/^q denotes the L_q-norm of x and ||x||_∞ = max_j{|x_j|}. For any matrix M, ||M||_q = max_{x:||x||_q=1} ||Mx||_q denotes the matrix L_q-norm of M. In particular, ||M||_∞ is the maximum absolute row sum of M. We omit the subscript q when q = 2. ||M||_max = max_i_,_j{|M_ij|} denotes the matrix max norm. When M is symmetric, λ_max(M) and λ_min(M) denote the maximum and minimum eigenvalues of M, respectively.

The rest of the paper is organized as follows. Section 2 describes CARDS, including the basic, advanced and shrinkage versions. Section 3 studies theoretical properties of the basic version of CARDS, and Section 4 analyzes the advanced and shrinkage versions. Sections 5 and 6 present the results of simulation studies and real data analysis, respectively. Some concluding remarks are given in Section 7. Proofs can be found in the Appendix and supplemental materials.

2 CARDS: a data-driven pairwise shrinkage procedure

2.1 Basic version of CARDS

Without considering the homogeneity assumption (2), there are many methods available for fitting model (1). Let β̃ be such a preliminary estimator. A simple idea to generate homogeneity is as follows: first, rearrange the coefficients in β̃ in the ascending order; second, group together those adjacent indices whose coefficients in β̃ are close to each other; finally, force indices in each estimated group to share a common coefficient and refit model (1). A problem of this naive procedure is how to group the indices. Alternatively, we can run a penalized least squares to simultaneously extract the grouping structure and estimate coefficients. To shrink coefficients of adjacent indices (after reordering) towards homogeneity, we can add fused penalties, i.e., {|β_i₊₁ − β_i|, i = 1, · · ·, p − 1} are penalized. This leads to the following two-stage procedure:

Preordering: Construct the rank statistics {τ(j) : 1 ≤ j ≤ p} such that β̃_τ₍_j₎ is the j-th smallest value in {β̃_i, 1 ≤ i ≤ p}, i.e.,
${\tilde{β}}_{τ (1)} \leq {\tilde{β}}_{(2)} \leq \dots \leq {\tilde{β}}_{τ (p)} .$ (4)
Estimation: Given a folded concave penalty function p_λ(·) (Fan and Li 2001) with a regularization parameter λ, the final estimate is given by
$\hat{β} = arg min_{β} {\frac{1}{2 n} {‖ y - X β ‖}^{2} + \sum_{j = 1}^{p - 1} p_{λ} (∣ β_{τ (j + 1)} - β_{τ (j)} ∣)} .$ (5)

We call this two-stage procedure bCARDS (basic CARDS).

At the first stage, bCARDS establishes a data-driven rank mapping τ(·) based on the preliminary estimator β̃. At the second stage, only “adjacent” coefficient pairs in the order τ are penalized, resulting in only (p −1) penalty terms in total. In addition, (5) does not require that β_τ₍_j₎ ≤ βτ₍_j₊₁₎. This allows coordinates in β̂ to have a different order of increasing values from that in β̃.

With an appropriately large tuning parameter λ, β̂ is a piecewise constant vector in the order τ and consequently its elements have homogeneous groups. In Section 3, we shall show that, if τ is consistent with the order of β⁰, that is,

β_{τ (1)}^{0} \leq β_{τ (2)}^{0} \leq \dots \leq β_{τ (p)}^{0},

(6)

then under some regularity conditions, β̂ can consistently estimate the true coefficient groups of β⁰ with high probability.

When p_λ(·) is a folded-concave penalty function (e.g. SCAD (Fan and Li 2001), MCP (Zhang 2010)), (5) is a non-convex optimization problem. It is generally challenging to find the global minimizer. The local linear approximation (LLA) algorithm can be applied to find a local minimizer for any fixed initial solution; see Zou and Li (2008); Fan, Xue, and Zou (2012) and references therein for details. The coupling of the concave convex procedure (CCCP) can also be applied to produce a local minimizer; see Kim, Choi, and Oh (2008); Wang, Kim, and Li (in press) for a detailed explanation of CCCP.

2.2 Advanced version of CARDS

To guarantee the success of bCARDS, (6) is an essential condition. It requires that whenever $β_{i}^{0} < β_{j}^{0}$ , τ (i) < τ(j) must hold. This imposes fairly strong conditions on the preliminary estimator β̃. For example, (6) can be violated if ||β̃ − β⁰||_∞ is larger than the minimum gap between groups. To relax such a restrictive requirement, we now introduce an advanced version of CARDS, where the main idea is to use less information from β̃ and to add more penalty terms in (5).

We first introduce the ordered segmentation, which can be viewed as a generalized ordering. Note that each rank mapping τ in bCARDS actually defines a partition of {1, · · ·, p} into p disjoint sets B₁, · · ·, B_p with B_j = {τ(j)} being a singleton. Similarly, we may divide {1, · · ·, p} into L(≤p) disjoint sets B₁, · · ·, B_L, where B_l’s are not necessarily singletons. We call such B_l’s segments. The segments B₁, · · ·, B_L are ordered, but the ordering of coordinates within each segment is not defined. This is similar to letter grades assigned to a course. A formal definition is as follows:

Definition 2.1

For an integer 1 ≤ L ≤ p, the mapping ϒ : {1, · · ·, p} → {1, · · ·, L} is called an ordered segmentation if the sets B_l ≡ {1 ≤ j ≤ p : ϒ(j) = l}, 1 ≤ l ≤ L, form a partition of {1, · · ·, p}.

When L = p, ϒ is a one-to-one mapping and it defines a complete ordering.

Note that, in the basic version of CARDS, the preliminary estimator β̃ produces a complete rank mapping τ. Now in the advanced version of CARDS, instead of extracting a complete ordering, we only extract an ordered segmentation ϒ from β̃. The analogue is similar to grading an exam: overall score rank (percentile rank) versus letter grade. Let δ > 0 be a predetermined parameter. First, obtain the rank mapping τ as in (4) and find all indices i₂ < i₃ < · · ·< i_L such that the gaps

{\tilde{β}}_{τ (j)} - {\tilde{β}}_{τ (j - 1)} > δ, j = i_{2}, \dots, i_{L} .

Then, construct the segments

B_{l} = {τ (i_{l}), τ (i_{l} + 1), \dots, τ (i_{l + 1} - 1)}, l = 1, \dots, L,

(7)

where i₁ = 1 and i_L₊₁ = p + 1. This process is indeed similar to the letter grades that we assign to a course. The intuition behind this construction is that when β̃_τ₍_j₊₁₎ ≤ β̃_τ₍_j₎ + δ, i.e., the estimated coefficients of two “adjacent coordinates” differ by only a small amount, we do not trust the ordering between them and group them into a same segment. Compared to the complete ordering τ, the ordered segments {B₁, · · ·, B_L} utilize less information from β̃ and hence are less sensitive to the estimation error in β̃.

Given an ordered segmentation τ, we wish to take advantage of the order of segments B₁, · · ·, B_L and at the same time allow flexibility of order shuffling within each segment. Towards this goal, we introduce the hybrid pairwise penalty.

Definition 2.2

Given a penalty function p_λ(·) and tuning parameters λ₁ and λ₂, the hybrid pairwise penalty corresponding to an ordered segmentation ϒ is

P_{ϒ, λ_{1} λ_{2}} (β) = \sum_{l = 1}^{L - 1} \sum_{i \in B_{l}, j \in B_{l + 1}} p_{λ_{1}} (∣ β_{i} - β_{j} ∣) + \sum_{l = 1}^{L} \sum_{i, j \in B_{l}} p_{λ_{2}} (∣ β_{i} - β_{j} ∣) .

(8)

In (8), we call the first part between-segment penalty and the second part within-segment penalty. The within-segment penalty penalizes all pairs of indices in each segment, hence, it does not rely on any ordering within the segment. The between-segment penalty penalizes pairs of indices from two adjacent segments, and it can be viewed as a “generalized” fused penalty on segments.

When L = p, each B_l is a singleton and (8) reduces to the fused penalty in (5). On the other hand, when L = 1, namely, no prior information about β, there is only one segment B₁ = {1, · · ·, p}, and (8) reduces to the exhaustive pairwise penalty

P_{λ}^{T V} (β) = \sum_{1 \leq i, j \leq p} p_{λ} (∣ β_{i} - β_{j} ∣) .

(9)

It is also called the total variation penalty (Harchaoui and Lévy-Leduc 2010), and the case with p_λ(·) being a truncated L₁ penalty is studied in Shen and Huang (2010). Thus, the penalty (8) is a generalization of both the fused penalty and the total variation penalty, which explains the name “hybrid”.

The main motivation of introducing the hybrid pairwise penalty is to provide a set of intermediate versions between the fused penalty and the total variation penalty. When using pairwise penalties to promote homogeneity, we need to penalize “enough” pairs to guarantee that all true groups can be exactly recovered when the signal-to-noise ratio is sufficiently large. Given a consistent ordering, the fused penalty contains “just enough” pairs; but when the ordering is inconsistent, we have to penalize more pairs to achieve the aforementioned exact-group-recovery (see Section 2.3 for a numerical example). However, it may not be a good choice to include all pairs, i.e., using the total variation penalty, as the large number of redundant pairs can result in statistical and computational inefficiency. The hybrid penalty is designed aiming to include “just enough” pairs that adapt to the available “partial” ordering information of an ordered segmentation.

Now, we discuss how the requirement (6) can be relaxed. If we let B_j = {τ(j)}, then (6) can be written equivalently as ${max}_{i \in B_{j}} β_{i}^{0} \leq {min}_{i \in B_{j + 1}} β_{i}^{0}$ , for 1 ≤ j ≤ p −1. This definition can be generalized to the case B_j’s are not singletons.

Definition 2.3

An ordered segmentation ϒ preserves the order of β⁰ if ${max}_{j \in B_{l}} β_{j}^{0} \leq {min}_{j \in B_{l + 1}} β_{j}^{0}$ , for l = 1, · · ·, L − 1.

In the construction (7), even if (6) does not hold, it is still possible that the resulting ϒ preserves the order of β⁰. Consider a toy example where p = 4, and $β_{τ (1)}^{0} = β_{τ (2)}^{0} = β_{τ (4)}^{0} < β_{τ (3)}^{0}$ so that {τ (1), τ (2), τ (4)} and {τ (3)} are two true homogeneous groups in β⁰. Here τ ranks the coordinate τ(3) ahead of τ(4) based on the preliminary estimator β̃, but $β_{τ (3)}^{0} > β_{τ (4)}^{0}$ . So, τ fails to give a consistent ordering. However, as long as β̃_τ₍₄₎ ≤ β̃_τ₍₃₎+δ, τ(3) and τ(4) are grouped into the same segment in (7), say, B₁ = {τ (1), τ (2)} and B₂ = {τ (3), τ (4)}. Then ϒ still preserves the order of β⁰ according to Definition 2.3.

Now we formally introduce the advanced version of Clustering Algorithm in Regression via Data-driven Segmentation (aCARDS). It consists of three steps, where the first two steps are very similar to the way that we assign letter grades based on scores of an exam.

Preliminary Ranking: Given a preliminary estimate β̃, generate the rank statistics {τ(j) : 1 ≤ j ≤ p} such that β̃_τ₍₁₎ ≤ β̃_τ₍₂₎ ≤ · · · ≤ β̃_τ₍_p₎.
Segmentation: For a tuning parameter δ > 0, construct an ordered segmentation ϒ as described in (7).
Estimation: For tuning parameters λ₁ and λ₂, compute the solution β̂ that minimizes

Q_{n} (β) = \frac{1}{2 n} {‖ y - X β ‖}^{2} + P_{ϒ, λ_{1}, λ_{2}} (β) .

(10)

We call this procedure aCARDS (advanced CARDS).

In Section 4, we shall show that if ϒ preserves the order of β⁰, under certain conditions, β̂ recovers the true homogeneous groups of β⁰ with high probability. Therefore, to guarantee the success of aCARDS, we need the existence of a δ > 0 for the preliminary estimator β̃ such that the associated ϒ preserves the order of β⁰. The above toy example shows that even when (6) fails, this condition can still hold. So aCARDS requires weaker conditions on β̃ than bCARDS. This is due to that the hybrid penalty contains penalty terms corresponding to more pairs of indices, hence, it is more robust to mis-ordering in τ. In fact, bCARDS is a special case of aCARDS with δ = 0.

2.3 Comparison of two versions of CARDS

In this subsection, we first use a numerical example to compare bCARDS and aCARDS. It reveals how the ordered segmentation and hybrid pairwise penalty (8) play a role in aCARDS. We then discuss how to choose between two versions of CARDS in real data analysis.

We generate a data set with p = 40 predictors and n = 100 samples. The predictors are divided into two homogeneous groups, each of size 20. Let $β_{j}^{0} = - 0.2$ for j in Group 1 and $β_{j}^{0} = 0.2$ for j in Group 2. X₁, · · ·, X_n are generated independently and identically from N_p(0, I), and $Y_{i} = X_{i}^{T} β^{0} + ε_{i}$ for 1 ≤ i ≤ n, where ε₁, ···, ε_n are independent noises with a standard normal distribution. In aCARDS, we take the ordinary least squares (OLS) estimator as the preliminary estimator. Figure 1 plots the sorted OLS coefficients for a realization. The estimated rank is not exactly consistent with the order of β⁰ since the predictors τ(17) and τ(18), which belong to Group 2, are mistakenly ranked ahead of some predictors in Group 1. If we use only the fused penalty, the terms that involve τ(17) and τ(18) are

Illustration of the hybrid pairwise penalty and the aCARDS algorithm. Top panel: OLS coefficients and the associated ordered segmentation. Red dots and blue crosses represent predictors from Group 1 and Group 2, respectively. Bottom panel: Solution paths of bCARDS (left) and aCARDS (right) under misranking. The ranking and ordered segmentation are the same as in the top panel. For bCARDS, the horizontal axis represents the parameter λ. For aCARDS, the horizontal axis represents the between-segment parameter λ₁, where we fix the within-segment parameter λ₂ = 0.02. The vertical axis represents the estimated 40 regression coefficients, which are shrunk towards homogeneity (as the figures do not start from the smallest λ, we do not see initial 40 regression coefficients).

p_{λ} (∣ β_{τ (16)} - β_{τ (17)} ∣) + p_{λ} (∣ β_{τ (17)} - β_{τ (18)} ∣) + p_{λ} (∣ β_{τ (18)} - β_{τ (19)} ∣) .

There are no penalty terms to shrink the coefficients of τ(17) and τ(18) towards being equal to the coefficients of other predictors in Group 2. Now, suppose that we extract an ordered segmentation from the OLS coefficients by taking δ = 0.3; see Figure 1. Since it allows for arbitrary order reshuffling within the segment B₄, this ordered segmentation preserves the order of β⁰, i.e., Definition 2.3 is satisfied. The hybrid pairwise penalty associated with this segmentation includes terms

p_{λ_{1}} (∣ β_{τ (17)} - β_{τ (23)} ∣) + p_{λ_{1}} (∣ β_{τ (18)} - β_{τ (23)} ∣)

between segments B₄ and B₅. So aCARDS will shrink the coefficients of τ(17) and τ(18) towards being equal to the coefficient of τ(23), a predictor in Group 2. Moreover, there are terms such as

p_{λ_{1}} (∣ β_{τ (23)} - β_{τ (24)} ∣) + p_{λ_{1}} (∣ β_{τ (23)} - β_{τ (25)} ∣) + p_{λ_{1}} (∣ β_{τ (23)} - β_{τ (26)} ∣)

between segments B₅ and B₆. So aCARDS will also shrink the coefficient of τ(23) towards being equal to the coefficients of other predictors in Group 2. Eventually, aCARDS will shrink the coefficients of τ(17) and τ(18) towards being equal to the coefficients of many other predictors in Group 2. This example explains how the ordered segmentation and hybrid penalty help overcome issues caused by misranking in the preliminary estimator.

To better illustrate the effects of fused penalty and hybrid penalty under misranking, we fix the estimated rank and ordered segmentation from above, and compute the solution paths of both bCARDS and aCARDS. Note that the penalty terms in both (5) and (8) are now fixed (hence we do not need the parameter δ in aCARDS). For bCARDS, we let λ vary. For aCARDS, we set the within-segment parameter λ₂ = 0.02 and let the between-segment parameter λ₁ vary. Figure 1 displays the solution paths. We see that although bCARDS does not include the true grouping in the solution path owing to misranking, aCARDS still achieves the true grouping, which is a benefit of the hybrid penalty.

In practical data analysis, we need not differentiate between two versions of CARDS, but the tuning parameter selection process automatically tells us which version to use. This is because bCARDS is a special case of aCARDS with δ = 0. We only need to include 0 in the candidates of the parameter δ and select δ in a data-driven manner (e.g., AIC, BIC and GCV). We call the resulting method CARDS, which involves a data-driven selection between bCARDS and aCARDS.

2.4 CARDS under sparsity

In applications, we may need to explore homogeneity and sparsity simultaneously. Often the preliminary estimator β̃ takes into account the sparsity, namely it is obtained with a penalized least-squares method (Fan and Li 2001; Tibshirani et al. 2005) or sure independence screening (Fan and Lv 2008). Suppose β̃ has the sure screening property, i.e., S₀ ⊂ S̃ with high probability, where S̃ and S₀ denote the support of β̃ and β⁰, respectively. We modify CARDS as follows: in the first two steps, using the non-zero elements of β̃, we can similarly construct a data-driven hybrid penalty only on coefficients of variables in S̃; in the third step, we fix β̂_S̃^c = 0 and obtain β̂_S̃ by minimizing the following penalized least squares

Q_{n}^{sparse} (β) = \frac{1}{2 n} {‖ y - X_{\tilde{S}} β_{\tilde{S}} ‖}^{2} + P_{ϒ, λ_{1}, λ_{2}} (β_{\tilde{S}}) + \sum_{j \in \tilde{S}} p_{λ} (∣ β_{j} ∣),

(11)

where X_S̃ is the submatrix of X restricted to columns in S̃. In (11), the second term is the hybrid penalty to encourage homogeneity among coefficients of variables already selected in β̃, and the third term is the element-wise penalty to help further filter out falsely selected variables. We call this modified version sCARDS (shrinkage CARDS).

3 Analysis of the basic CARDS

In this section, we analyze theoretical properties of bCARDS. Due to the limited space, we state the results here and only prove Theorems 3.1–3.3 in the Appendix, leaving the rest of the proofs to the online supplemental materials of this paper.

3.1 Heuristics

We first provide some heuristics on why taking advantage of the homogeneity helps reduce the estimation error ||β̂ − β⁰||. Consider an ideal case of orthogonal design X^TX = nI_p (necessarily p ≤ n). The ordinary least-squares estimator β̂^ols = (X^TX)⁻¹X^Ty has the decomposition

{\hat{β}}_{j}^{ols} = β_{j}^{0} + ε_{j}, ε_{j} \overset{i . i . d .}{\sim} N (0, n^{- 1}), j = 1, \dots, p .

It is clear by the square-root law that $‖ {\hat{β}}^{ols} - β^{0} ‖ = O_{P} (\sqrt{p / n})$ . Now, if there are K homogeneous groups in β⁰ and that we know the true groups, the original model (1) can be rewritten as

y = X_{A} β_{A}^{0} + ε,

where $β_{A}^{0} = {(β_{A, 1}^{0}, \dots, β_{A, K}^{0})}^{T}$ contains distinct values in β⁰, and X_A = (x_A_,1, · · ·, x_A_,_K) is such that x_A_,_k = Σ_{j∈A_k} x_j. The corresponding ordinary least-squares estimator ${\hat{β}}_{A}^{ols} = {(X_{A}^{T} X_{A})}^{- 1} X_{A}^{T} y$ has the decomposition

{\hat{β}}_{A, k}^{ols} = β_{A, k}^{0} + {\bar{ε}}_{k}, {\bar{ε}}_{k} ’ s are independent, {\bar{ε}}_{k} ~ N (0, n^{- 1} {∣ A_{k} ∣}^{- 1}) .

(12)

Here ${\bar{ε}}_{k} = \frac{1}{∣ A_{k} ∣} \sum_{j \in A_{k}} ε_{j}$ is the noise averaged over group k. The oracle estimator β̂^oracle is defined such that ${\hat{β}}_{j}^{oracle} = {\hat{β}}_{A, k}^{ols}$ for all j ∈ A_k. Then, by the square-root law,

\begin{array}{l} {‖ {\hat{β}}^{oracle} - β^{0} ‖}^{2} = \sum_{k = 1}^{K} ∣ A_{k} ∣ {∣ {\hat{β}}_{A, k}^{ols} - β_{A, k}^{0} ∣}^{2} \\ = O_{p} (\sum_{k = 1}^{K} ∣ A_{k} ∣ \cdot n^{- 1} {∣ A_{k} ∣}^{- 1}) = O_{p} (K / n), \end{array}

which immediately implies that $‖ {\hat{β}}^{oracle} - β^{0} ‖ = O_{p} (\sqrt{K / n})$ .

The surprises of the results are two fold. First, ||β̂^oracle − β⁰|| has the rate convergence $\sqrt{K / n}$ instead of $\sqrt{p / n}$ . The point is that in (12) the noises are averaged, thanks to exploiting homogeneity, and consequently $β_{A, k}^{0}$ is estimated more accurately. The second surprise is that the rate has nothing to do with the sizes of true homogeneous groups. No matter whether we have K groups of equal size, or one dominating group with other (K − 1) small groups, the rate is always the same for the oracle estimator.

3.2 Notations and regularity conditions

Let Inline graphic be the subspace of ℝ^p defined by

M_{A} = {β \in ℝ^{p} : β_{i} = β_{j}, for any i, j \in A_{k}, 1 \leq k \leq K} .

For each β ∈ Inline graphic , we can always write Xβ = X_Aβ_A, where X_A = (x_A_,1, · · ·, x_A_,_K) is an n × K matrix such that its k-th column x_A_,_k = Σ_{j∈A_k} x_j, and β_A is a K-dimensional vector with its k-th component β_A_,_k being the common coefficient in group A_k. Define the matrix D = diag(|A₁|^1/2, · · ·, |A_K|^1/2). We introduce the following conditions on the design matrix X:

Condition 3.1

$‖ x_{j} ‖ = \sqrt{n}$ , for 1 ≤ j ≤ p. The eigenvalues of the K × K matrix $\frac{1}{n} D^{- 1} X_{A}^{T} X_{A} D^{- 1}$ are bounded below by c₁ > 0 and bounded above by c₂ > 0.

In the case of orthogonal designs, i.e., X^TX = nI_p, the matrix $\frac{1}{n} D^{- 1} X_{A}^{T} X_{A} D^{- 1}$ simplifies to I_K, and c₁ = c₂ = 1.

Let ρ(t) = λ⁻¹p_λ(t) and ρ̄(t) = ρ′(|t|)sgn(t). We assume that the penalty function p_λ(·) satisfies the following condition.

Condition 3.2

p_λ(·) is a symmetric function and it is non-decreasing and concave on [0, ∞). ρ′(t) exists and is continuous except for a finite number of t with ρ′(0+) = 1. There exists a constant a > 0 such that ρ(t) is a constant for all |t| ≥ aλ.

We also assume that the noise vector ε = (ε₁, · · ·, ε_p)^T has sub-Gaussian tails.

Condition 3.3

For any vector a ∈ ℝⁿ and x > 0, P(|a^Tε| > ||a||x) ≤ 2e^−c₃x₂, where c₃ is a positive constant.

Given the design matrix X, let X_k be its submatrix formed by including only columns in A_k, for 1 ≤ k ≤ K. For any vector v ∈ ℝ^q, let $DC (v) = {max}_{1 \leq i \leq q} ∣ v_{i} - q^{- 1} \sum_{j = 1}^{q} v_{j} ∣$ be the “deviation from centrality”. Define

σ_{k} = λ_{max} (\frac{1}{n} X_{k}^{T} X_{k}) and ν_{k} = max_{μ \in M_{A} : ‖ μ ‖ = 1} DC (\frac{1}{n} X_{k}^{T} X μ),

(13)

where λ_max(·) denotes the largest eigenvalue operator. In the case of orthogonal design, σ_k = 1 and ν_k = 0. Let

b_{n} = \frac{1}{2} min_{1 \leq k \leq l \leq K} ∣ β_{A, k}^{0} - β_{A, l}^{0} ∣

be half of the minimum gap between groups in β⁰, and λ = λ_n the tuning parameter in the penalty function.

3.3 Properties of bCARDS

When the true groups A₁, · · ·, A_K are known, the oracle estimator is

{\hat{β}}^{oracle} = arg min_{β \in M_{A}} {\frac{1}{2 n} {‖ y - X β ‖}^{2}} .

Theorem 3.1

Suppose Conditions 3.1–3.3 hold, K = o(n), and the preliminary estimator β̃ generates a rank mapping τ that is consistent with the order of β⁰, i.e., (6) holds, with probability at least 1 − ε₀. If the half minimum gap between groups, b_n, satisfies that b_n > aλ_n, where a is the same as that in Condition 3.2, and

λ_{n} ≫ max_{k} {\sqrt{σ_{k} ∣ A_{k} ∣ log (n \lor p) / n} + (1 + ν_{k} {∣ A_{k} ∣}^{1 / 2}) \sqrt{K log (n) / n}},

(14)

then with probability at least 1 − ε₀ − n⁻¹K − (n ∨ p)⁻¹, the bCARDS objective function (5) has a strictly local minimizer β̂ such that

β̂ = β̂^oracle,
$‖ \hat{β} - β^{0} ‖ = O_{p} (\sqrt{K / n})$ .

Theorem 3.1 shows that bCARDS includes the oracle estimator as a strictly local minimizer, with overwhelming probability. This strong oracle property is a stronger result than the oracle property in Fan and Li (2001).

The objective function (5) in bCARDS is non-convex and may have multiple local minimizers. In practice, we apply the Local Linear Approximation algorithm (LLA) (Zou and Li 2008) to solve it: start from an initial solution β̂⁽⁰⁾ = β̂^initial; at step m, update the solution by

{\hat{β}}^{(m)} = arg min_{β} {\frac{1}{2 n} {‖ y - X β ‖}^{2} + \sum_{j = 1}^{p - 1} p_{λ}^{'} (∣ {\hat{β}}_{τ (j + 1)}^{(m - 1)} - {\hat{β}}_{τ (j)}^{(m - 1)} ∣) \cdot ∣ β_{τ (j + 1)} - β_{τ (j)} ∣} .

Given β̂^initial, this algorithm produces a unique sequence of estimators which converge to a certain local minimizer. Next, Theorem 3.2 shows that under certain conditions, the sequence of estimators produced by the LLA algorithm converge to the oracle estimator.

Theorem 3.2

Under conditions of Theorem 3.1, suppose ρ′(λ_n) ∈ a₀ for some constant a₀ > 0, and that there exists an initial solution β̂^initial of (5) satisfying ||β̂^initial − β⁰||_∞ ≤ λ_n/2. Then with probability at least 1 − ε₀ − n⁻¹K − (n ∨ p)⁻¹, the LLA algorithm yields β̂^oracle after one iteration, and it converges to β̂^oracle after two iterations.

From Theorems 3.1 and 3.2, we conclude that bCARDS combined with the LLA algorithm yields the oracle estimator with overwhelming probability, provided that we have a good preliminary estimator β̃. Next, we discuss the choice of β̃.

Since we focus on dense problems in this section, the usual sparsity is not explicitly explored and the ordinary least squares estimator

{\hat{β}}^{ols} = arg min_{β \in ℝ^{p}} {\frac{1}{2 n} {‖ y - X β ‖}^{2}}

can be used as the preliminary estimator. The following theorem shows that it induces a rank consistent mapping with high probability.

Theorem 3.3

Suppose Condition 3.3 holds, p = O(n^α) and $λ_{min} (\frac{1}{n} X^{T} X) \geq c_{4}$ , where 0 < α < 1 and c₄ > 0 are constants. If $b_{n} > \sqrt{(2 a c_{4} / c_{3}) log (n) / n}$ where b_n is the half of the minimum gap between groups in β⁰, then with probability at least 1 − O(n⁻^α), the rank mapping τ generated from β̂^ols is consistent with the order of β⁰.

When the rank mapping τ extracted from β̃ does not give a consistent order, i.e., (6) does not hold, the penalty in (5) is no longer a “correct” penalty for promoting the true grouping structure. There is no hope that local minimizers of (5) exactly recover the true groups. However, if there is not too much misranking in τ, it is still possible to control ||β̂ − β̂⁰||. Given a rank mapping τ, define

K^{*} (τ) = \sum_{j = 1}^{p - 1} 1 {β_{τ (j)}^{0} \neq β_{τ (j + 1)}^{0}} .

It is the number of coefficient “jumps” in β⁰ under the order given by τ. These “jumps” define subgroups $A_{1}^{'}, A_{2}^{'}, \dots, A_{K^{*}}^{'}$ , each being a subset of one true group. Although different subgroups may share the same true coefficient, any two consecutive subgroups $A_{k}^{'}$ and $A_{k + 1}^{'}$ have a gap in coefficient values. As a result, the above results apply to this subgrouping structure. The following theorem is a generalization and a direct application of the proof of Theorem 3.1 and its details are omitted.

Theorem 3.4

Suppose Conditions 3.1–3.3 hold, K^*(τ) = o(n), the half minimum gap b_n > aλ_n, and λ_n satisfies (14) with K replaced by K*(τ). Then with probability at least 1−ε₀ −n⁻¹K −(n ∨ p)⁻¹, the bCARDS objective function (5) has a strictly local minimizer β̂ such that $‖ \hat{β} - β^{0} ‖ = O_{p} (\sqrt{K^{*} (τ) / n})$ .

3.4 bCARDS with the L₁ penalty

In the bCARDS formulation (5), ρ(t) can also be the L₁ penalty function ρ(t) = |t|. It can be useful to get the initial solution β̂^initial for the LLA algorithm. However, ρ(t) = |t| does not satisfy Condition 3.2. Hence, Theorem 3.1 does not apply and its associated properties requires additional studies.

We first relax the requirement that τ is consistent with the order of β⁰. Instead, we consider the case that “τ is consistent with groups in β⁰”: there exists a permutation μ on {1, · · ·, K} and 1 = i₁ < i₂ < · · · < i_K < i_K₊₁ = p + 1 such that for k = 1, · · ·, K,

β_{τ (i)}^{0} = β_{A, μ (k)}^{0}, i_{k} \leq i \leq i_{k + 1} - 1.

(15)

When μ is the identical permutation, i.e., μ(k) = k, (15) is equivalent to (6) and τ is consistent with the order of β⁰. Under the condition (15), recovering the true groups is equivalent to locating coefficient jumps in β⁰ under the order given by τ, and these jumps can have positive or negative magnitudes.

To guarantee the exact recovery of jumps, we need a joint condition on X and β⁰, it is in the same spirit of the “irrepresentability” condition in Zhao and Yu (2006) but is specifically designed for the homogeneity setting. For notation simplicity, we change the indices of groups to let μ(k) = k for all k. Note that $β_{1}^{0} < β_{2}^{0} < \dots < β_{K}^{0}$ does not hold with these new group indices.

For k = 1, · · ·, K − 1, write $d β_{A, k}^{0} = β_{A, k + 1}^{0} - β_{A, k}^{0}$ . Define the K-dimensional vector d₀ by $d_{1}^{0} = sgn (d β_{A, 1}^{0}), d_{K}^{0} = - sgn (d β_{A, K - 1}^{0})$ and

d_{k}^{0} = sgn (d β_{A, k}^{0}) - sgn (d β_{A, k - 1}^{0}), 2 \leq k \leq K - 1.

Here d⁰ is the adjacent difference of the sign vector of jumps in β⁰. For example, suppose K = 4 and the common coefficients in 4 groups satisfy $β_{A, 2}^{0} - β_{A, 1}^{0} > 0, β_{A, 3}^{0} - β_{A, 2}^{0} < 0$ and $β_{A, 4}^{0} - β_{A, 3}^{0} > 0$ . Then d = (1, −2, 2, −1). Also, define the p-dimensional vector

b^{0} = X^{T} X_{A} {(X_{A}^{T} X_{A})}^{- 1} d^{0} .

In the case of orthogonal design X^TX = nI_p, b⁰ ∈ Inline graphic and it has the form $b_{j}^{0} = d_{k}^{0} / ∣ A_{k} ∣$ for j ∈ A_k. For each j ∈ A_k, let

A_{k j}^{1} = {τ (i) \in A_{k} : i \leq j}, A_{k j}^{2} = {τ (i) \in A_{k} : i > j} .

Namely, $A_{k j}^{1}$ and $A_{k j}^{2}$ contain indices in group k that are ranked above and below τ(j), respectively. Let $θ_{k j} = ∣ A_{k j}^{1} ∣ / ∣ A_{k} ∣$ be the proportion of indices in group k that are ranked above τ(j). Denote by ${\bar{b}}_{k j} = \frac{1}{∣ A_{k j}^{1} ∣} \sum_{τ (i) \in A_{k j}^{1}} b_{τ (i)}^{0}$ the average of elements in b⁰ over the indices in $A_{k j}^{1}$ , and ${\underline{b}}_{k j} = \frac{1}{∣ A_{k j}^{2} ∣} \sum_{τ (i) \in A_{k j}^{2}} b_{τ (i)}^{0}$ the average over the indices in $A_{k j}^{2}$ .

Condition 3.4

There exists a positive sequence {ω_n}, which can go to 0, such that for 1 ≤ k ≤ K, j ∈ A_k and j ≠ j_k₊₁ − 1,

1 - ω_{n} \geq {\begin{cases} ∣ θ_{1 j} sgn (d β_{A, 1}^{0}) + {∣ A_{1} ∣}^{2} θ_{1 j} (1 - θ_{1 j}) ({\bar{b}}_{1 j} - {\underline{b}}_{1 j}) ∣, \\ ∣ (1 - θ_{k j}) sgn (d β_{A, k - 1}^{0}) + θ_{k j} sgn (d β_{A, k}^{0}) + {∣ A_{k} ∣}^{2} θ_{k j} (1 - θ_{k j}) ({\bar{b}}_{k j} - {\underline{b}}_{k j}) ∣, 2 \leq k \leq K - 1, \\ ∣ (1 - θ_{K j}) sgn (d β_{A, K - 1}^{0}) + {∣ A_{K} ∣}^{2} θ_{K j} (1 - θ_{K j}) ({\bar{b}}_{K j} - {\underline{b}}_{K j}) ∣ . \end{cases}

(16)

In the case of orthogonal design X^TX = nI_p, b⁰ ∈ Inline graphic and b̄_kj − b_kj = 0 holds for all k and j ∈ A_k. Condition 3.4 reduces to

1 - ω_{n} \geq {\begin{cases} ∣ θ_{1 j} sgn (d β_{A, 1}^{0}) ∣, \\ ∣ (1 - θ_{k j}) sgn (d β_{A, k - 1}^{0}) + θ_{k j} sgn (d β_{A, k}^{0}) ∣, 2 \leq k \leq K - 1, \\ ∣ (1 - θ_{K j}) sgn (d β_{A, K - 1}^{0}) ∣ . \end{cases}

This is possible only when

sgn (d β_{A, k - 1}^{0}) \neq sgn (d β_{A, k}^{0}), 2 \leq k \leq K - 1.

(17)

Noting that 1/|A_k| ≤ θ_kj ≤ 1 − 1/|A_k|, the associated ω_n can be chosen as min_k{1/|A_k|} when (17) holds.

Theorem 3.5

Suppose Conditions 3.1, 3.3 and 3.4 hold, K = o(n), and the preliminary estimator β̃ generates an order τ that is consistent with groups in β⁰, i.e., (15) holds, with probability at least 1 − ε₀. If the half minimum gap b_n and the tuning parameter λ_n satisfy

b_{n} ≫ \sqrt{K log (n) / n} + λ_{n} {(\sum_{k = 1}^{K} \frac{1}{{∣ A_{k} ∣}^{2}})}^{1 / 2} and λ_{n} ≫ ω_{n}^{- 1} max_{k} {\sqrt{σ_{k} ∣ A_{k} ∣ log (n \lor p) / n}},

(18)

then with probability at least 1 − ε₀ − n⁻¹K − (n ∨ p)⁻¹, the bCARDS objective function (5) with ρ(t) = |t| has a unique global minimizer β̂ such that

β̂ ∈ ;
$sgn ({\hat{β}}_{A, k + 1} - {\hat{β}}_{A, k}) = sgn (β_{A, k + 1}^{0} - β_{A, k}^{0})$ , k = 1, · · ·, K − 1;
$‖ \hat{β} - β^{0} ‖ = O_{p} (\sqrt{K / n} + γ_{n})$ , where $γ_{n} = λ_{n} {(\sum_{k = 1}^{K} \frac{1}{∣ A_{k} ∣})}^{1 / 2}$ .

Compared to Theorem 3.1, there is an extra bias term in the estimation error ||β̂− β⁰||, p which is of order $\sqrt{K log (n \lor p) / n}$ . Moreover, to achieve the exact recovery, it requires Condition 3.4, which is restrictive. For example, in the case of orthogonal designs, it is required that all consecutive jumps (under the order given by τ) have oppositive signs.

4 Analysis of the advanced CARDS

In this section, we analyze aCARDS and its variant sCARDS. The proofs can be found in the online supplemental materials.

4.1 Properties of aCARDS

To guarantee the success of aCARDS, a key condition is that the ordered segmentation ϒ = {B₁, · · ·, B_L} defined in (7) preserves the order of β⁰ in the sense of Definition 2.3. This allows the ranking of coefficients in β̃ to deviate from that in β⁰, but not too much: for some δ > 0, whenever $β_{i}^{0} < β_{j}^{0}$ , β̃_i ≤ β̃_j + δ must hold.

For given A₁, · · ·, A_K and a segmentation ϒ = {B₁, · · ·, B_L}, define

ϕ_{k} = ∣ A_{k} ∣ / min {{∣ A_{k} ∣}^{3}, min_{l : B_{l} \cap A_{k} \neq \emptyset} {{∣ B_{l} ∣}^{2}}} .

Here 1/|A_k|² ≤ ϕ_k ≤ |A_k| for 1 ≤ k ≤ K.

Theorem 4.1

Suppose Conditions 3.1–3.3 hold, K = o(n), and the preliminary estimator β̃ and the tuning parameter δ_n together generate an ordered segmentation ϒ that preserves the order of β⁰, with probability at least 1−ε₀. If the half minimum gap b_n and the tuning parameters (λ₁_n, λ₂_n) in (10) satisfy that b_n > a max{λ₁_n, λ₂_n}, where a is the same as that in Condition 3.2, and

λ_{1 n} ≫ max_{k} {\sqrt{σ_{k} ϕ_{k} log (n \lor p) / n} + (1 - ν_{k} ϕ_{k}^{1 / 2}) \sqrt{K log (n) / n}},

(19)

and

λ_{2 n} ≫ max_{k} {{∣ A_{k} ∣}^{- 1} \sqrt{log (n \lor p) / n} + (1 + ν_{k} {∣ A_{k} ∣}^{- 1}) \sqrt{K log (n) / n}},

(20)

then with probability at least 1 − ε₀ − n⁻¹K − 2(n ∨ p)⁻¹, the aCARDS objective function (10) has a strictly local minimizer β̂ such that

β̂ = β̂^oracle,
$‖ \hat{β} - β^{0} ‖ = O_{p} (\sqrt{K / n})$ .

Compared to Theorem 3.1, aCARDS not only imposes less restrictive conditions on β̃, but also requires a smaller minimum gap between true coefficients.

Next, we establish the asymptotic normality of the CARDS estimator. By Theorem 4.1, with overwhelming probability, aCARDS performs just like the oracle. In the oracle situation, for example, if p = 5 and there are three true groups {β₁, β₄}, {β₂} and {β₃, β₅}, the accuracy of estimating β is the same as if we know the model:

Y = β_{1} (X_{1} + X_{4}) + β_{2} X_{2} + β_{3} (X_{3} + X_{5}) + ε .

Theorem 4.2

Given a positive integer q, let {B_n} be a sequence of matrices such that B_n ∈ ℝ^q^×^K, ${max}_{v \in ℝ^{q} : ‖ v ‖ = 1} {‖ X_{A}^{T} {(X_{A}^{T} X_{A})}^{- 1} B_{n}^{T} v ‖}_{4} = o (1)$ and $B_{n} B_{n}^{T} \to H$ , where H is a fixed q × q positive definite matrix. Suppose conditions of Theorem 4.1 hold and let β̂ be the local minimizer of the aCARDS objective function (10) given in Theorem 4.1. Then

B_{n} {(X_{A}^{T} X_{A})}^{1 / 2} ({\hat{β}}_{A} - β_{A}^{0}) \overset{d}{\to} N (0, H),

where β̂_A is the K-dimensional vector of distinct values in β̂.

Theorem 4.2 states the asymptotic normality of β̂_A. Note that β̂ duplicates elements in β̂_A. We introduce the following corollary to compare the asymptotic covariance of β̂ to that of β̂^ols.

Corollary 4.1

Suppose conditions of Theorems 4.1 and 4.2 hold. Let β̂^ols and β̂ be the ordinary least squares estimator and CARDS estimator respectively. Let M_n be the p × K matrix with M_n(j, k) = (1/|A_k|^1/2)1{j ∈ A_k}. For any sequence of p-dimensional vectors {a_n},

$v_{1 n}^{- 1 / 2} a_{n}^{T} ({\hat{β}}^{ols} - β^{0}) \overset{d}{\to} N (0, 1)$ with $v_{1 n} = a_{n}^{T} {(X^{T} X)}^{- 1} a_{n}$ ;
$v_{2 n}^{- 1 / 2} a_{n}^{T} (\hat{β} - β^{0}) \overset{d}{\to} N (0, 1)$ with $v_{2 n} = a_{n}^{T} M_{n} {(M_{n}^{T} X^{T} {XM}_{n})}^{- 1} M_{n}^{T} a_{n}$ .

Moreover, v₁_n ≥ v₂_n.

4.2 Properties of sCARDS

In Section 2.4, we introduced sCARDS to explore both homogeneity and sparsity. In sCARDS, given a preliminary estimator β̃ and a parameter δ, we extract segments B₁, · · ·, B_L such that $\cup_{l = 1}^{L} B_{l} = \tilde{S}$ , where S̃ is the support of β̃. Denote B₀ = {j : β̃_j = 0}. In this case, we say ϒ = {B₀, B₁, · · ·, B_L} preserves the order of β⁰ if

max_{j \in B_{0}} ∣ β_{j}^{0} ∣ = 0 and max_{j \in B_{l}} β_{j}^{0} \leq min_{j \in B_{l + 1}} β_{j}^{0}, l = 1, \dots, L - 1.

(21)

This implies that β̃ has the sure screening property, and on those preliminarily selected variables, the data-driven segments preserve the order of true coefficients.

Suppose there is a group of zero coefficients in β⁰, namely, Inline graphic = (A₀, A₁, · · ·, A_K). Let $M_{A}^{*}$ be the subspace of ℝ^p defined by

M_{A}^{*} = {β \in ℝ^{p} : β_{i} = 0, for any i \in A_{0}; β_{i} = β_{j}, for any i, j \in A_{k}, 1 \leq k \leq K} .

The oracle estimator is

{\hat{β}}^{oracle} = arg min_{β \in M_{A}^{*}} {\frac{1}{2 n} {‖ y - X β ‖}^{2}} .

Denote by S the support of β⁰, and write s = |S| and s̃ = |S̃|. Define

b_{n}^{'} = \frac{1}{2} min {∣ β_{j}^{0} ∣ : β_{j}^{0} \neq 0}

to be the half minimum signal strength.

Theorem 4.3

Suppose Conditions 3.1–3.3 hold, s = o(n), log(p) = o(n), and the preliminary estimator β̃ and the tuning parameter δ_n together generate an ordered segmentation ϒ that preserves the order of β⁰, i.e., (21) holds, with probability at least 1 − ε₀. If $b_{n}^{'}$ , b_n and the tuning parameters (λ₁_n, λ₂_n, λ_n) satisfy that $b_{n}^{'} = a λ_{n}$ , b_n > a max{λ₁_n, λ₂_n} and

\begin{array}{l} λ_{n} ≫ \sqrt{log (n \lor \tilde{s}) / n} + {‖ X_{\tilde{S}}^{T} X_{S} ‖}_{2, \infty} \sqrt{K log (n) / n}, \\ λ_{1 n} ≫ max_{k} {\sqrt{σ_{k} ϕ_{k} log (n \lor s) / n} + (1 + ν_{k} ϕ_{k}^{1 / 2}) \sqrt{K log (n) / n}}, \\ λ_{2 n} ≫ max_{k} {{∣ A_{k} ∣}^{- 1} \sqrt{log (n \lor s) / n} + (1 + ν_{k} {∣ A_{k} ∣}^{- 1}) \sqrt{K log (n) / n}} . \end{array}

Then with probability at least 1−ε₀ − n⁻¹K − (n ∨ s)⁻¹ − (n ∨ s̃)⁻¹, the sCARDS objective function (11) has a strictly local minimizer β̂ such that

β̂ = β̂^oracle,
$‖ \hat{β} - β^{0} ‖ = O_{p} (\sqrt{K / n})$ .

The preliminary estimator β̃ can be chosen, for example, as the SCAD estimator

{\hat{β}}^{scad} = arg min {\frac{1}{2 n} {‖ y - X β ‖}^{2} + \sum_{j = 1}^{p} p_{λ^{'}} (∣ β_{j} ∣)},

(22)

where p_λ_′(·) is the SCAD penalty function (Fan and Li 2001). The following theorem is a direct result of Theorem 2 in Fan and Lv (2011), and the proof is omitted.

Theorem 4.4

Under Conditions 3.1 and 3.3, if s = o(n), $λ_{n}^{'} ≫ n^{- 1 / 2} {[log (n)]}^{2}$ and $min {∣ β_{j}^{0} ∣ : β_{j}^{0} \neq 0} ≫ n^{- 1 / 2} max {\sqrt{log p}, {‖ \frac{1}{n} X_{S^{c}}^{T} X_{S} ‖}_{\infty} \sqrt{log n}}$ , then with probability at least 1 − o(1), there exists a strictly local minimizer β̂^scad and δ_n = O(log(n)/n) which together generate a segmentation preserving the order of β⁰.

5 Simulation studies

We conduct numerical experiments to study the performance of two versions of CARDS, bCARDS and aCARDS, and their variant sCARDS. The goal is to investigate the performance of CARDS under different situations. Experiments 1–4 are based on the linear regression model $Y_{i} = X_{i}^{T} β^{0} + ε_{i}$ , with Experiments 1–3 exploring the homogeneity only and Experiment 4 exploring the homogeneity and sparsity simultaneously. Experiment 5 is based on the spatial-temporal model $Y_{i t} = X_{t}^{T} β_{i}^{0} + ε_{i t}$ .

In all experiments, {X_i : 1 ≤ i ≤ n} or {X_t : 1 ≤ t ≤ T} are generated independently and identically from the multivariate standard Gaussian distributions, and {ε_i : 1 ≤ i ≤ n} or {ε_it: 1 ≤ i ≤ p, 1 ≤ t ≤ T} are IID samples of N(0, 1). All results are based on 100 repetitions.

Experiment 1

Consider the linear regression model with p = 60 and n = 100. Predictors are divided into four groups, each of size 15. The true regression coefficients shared within each group are −2r, −r, r and 2r, respectively. Different values of r > 0 lead to various signal-to-noise ratios. Here we let r take values in {1, 0.8, 0.5}, corresponding to high, moderate and low signal-to-noise ratio, respectively.

We compare the performance of six different methods: Oracle, ordinary least squares (OLS), bCARDS, aCARDS, total variation (TV), fused Lasso (fLasso). Oracle is the least squares estimator knowing the true groups. aCARDS and bCARDS are described in Section 2; here we let the penalty function p_λ(·) be the SCAD penalty with a = 3.7, and take the OLS estimator as the preliminary estimator. TV uses the exhaustive pairwise penalty (9), where p_λ(·) is also the SCAD penalty with a = 3.7. The fused Lasso is based on an order generated from ranking the OLS coefficients. In this sense, the fused Lasso is essentially bCARDS with the Lasso penalty p_λ(t) = λ|t|. Tuning parameters of all these methods are selected via Bayesian information criteria (BIC).

Prediction performance of different methods is evaluated in terms of the average model error over an independent test set of size 10, 000. The model error is the prediction error subtracted by the variance of ε_it, and it better reflects the performance of statistical methods. In addition, to measure how close the estimated grouping structure approaches the true one, we introduce the normalized mutual information (NMI), which is a common measure for similarity between clusterings (Fred and Jain 2003). Suppose ℂ = {C₁, C₂, · · ·} and ⅅ= {D₁, D₂, · · ·,} are two sets of disjoint clusters of {1, · · ·, p}, define

NMI (ℂ, D) = \frac{I (ℂ; D)}{[H (ℂ) + H (D)] / 2},

where I(ℂ; ⅅ) = Σ_k_,_j(|C_k ∩ D_j |/p) log(p|C_k ∩ D_j |/|C_k||D_j |) is the mutual information between ℂ and ⅅ, and H(ℂ) = Σ_k(|C_k|/p) log(|C_k|/p) is the entropy of ℂ. NMI(ℂ,ⅅ) takes values on [0, 1], and larger NMI implies the two groupings are closer. In particular, NMI = 1 means that the two groupings are exactly the same.

Figure 2 shows boxplots of the average model error and NMI for six different methods. We observe that except for the case of weak signals (r = 0.5), two versions of CARDS outperform other methods since they lead to smaller average model error and larger NMI. bCARDS is performing especially well in achieving low model errors, even in the case r = 0.5. aCARDS has a better performance in terms of NMI, which indicates that it is better in recovering the true grouping structure. The possible reason that aCARDS does not perform as well as bCARDS in model errors is that aCARDS has more tuning parameters and selection of these tuning parameters in simulations may be non-optimal.

The average model error and normalized mutual information in Experiment 1, where p = 60, n = 100, and there are four equal-size coefficient groups.

Experiment 2

The setting of this experiment is the same as in Experiment 1, except that the homogeneous groups have non-equal sizes. In Experiment 2a, the predictors are divided into four groups of size 1, 15, 15 and 29. The four distinct regression coefficients are −4r, −r, r and 2r, respectively. Here, the first group is a singleton. In Experiment 2b, there is one dominating group of size 50 with a common regression coefficient −2r. The other 10 predictors have the regression coefficients $0, \frac{2}{9}, \frac{4}{9}, \frac{6}{9}, \dots, 2$ , respectively. In both subexperiments, we take r = 1 and 0.7 to represent two levels of signal-to-noise ratio. Besides the six methods compared in Experiment 1, we also implement a data-driven selection between bCARDS and aCARDS, as described in Section 2.3. In detail, we select the parameter δ via BIC (for each value of δ, the other parameters are also selected via BIC). The resulting method is called CARDS.

Figure 3 displays results for Experiment 2a. It suggests that both bCARDS and aCARDS outperform other methods, so does CARDS. Figure 4 displays results for Experiment 2b. We see that the total variation and fused Lasso can not improve the OLS estimator in terms of the average model error. bCARDS also performs unsatisfactorily, possibly due to misranking in the preliminary estimate. But aCARDS performs much better than OLS and other methods. After a data-driven selection between bCARDS and aCARDS, the resulting method CARDS also outperforms other methods.

The average model error and normalized mutual information in Experiment 2a, where p = 60, n = 100, and there are four coefficient groups of size 1, 15, 15 and 29.

The average model error and normalized mutual information in Experiment 2b, where p = 60, n = 100, and there is one dominating group of size 50.

Experiment 3

We use this experiment to investigate how the performance of two versions of CARDS is affected by misranking in the preliminary estimate. The setting is the same as Experiment 1 and we fix r = 1 (so n = 100, p = 60, and there are 4 equal-size groups with true regression coefficients equal to −2, −1, 1 and 2 respectively). For each data set (X, y), we generate 11 different preliminary ranks as follows: for each σ in {1, 1.2, 1.4, · · ·, 3}, we generate z ~ N(Xβ⁰, σ²I_n) independently of y conditional on X, and then use the OLS estimator associated with z to get a preliminary rank. A larger value of σ tends to yield a “worse” preliminary rank. We use K^* defined in Section 3.3 to quantify the level of misranking. Recall that K^* is the total number of jumps in true regression coefficients under the preliminary rank, and K^* > 3 means there exists misordering in the preliminary rank. We generated 100 data sets and 11 preliminary ranks for each data set so that the results are based on 1100 repetitions. We compare the performance of bCARDS and aCARDS with that of the total variation (TV), where TV does not use any information from the preliminary rank. Figure 5 contains boxplots of the average model error as K^* changes. First, we see that two versions of CARDS are quite robust to the increase of K^* and always outperform the total variation. Second, the model error of bCARDS increases as K^* increases, which provides empirical evidence for the claim in Theorem 3.4 about the effect of misranking to bCARDS. Third, when K^* is large, aCARDS has a better performance than bCARDS, due to the fact that the hybrid pairwise penalty can tolerate a higher level of misranking than the fused penalty.

Experiment 4

This experiment explores the homogeneity and sparsity simultaneously. Consider the linear regression model with p = 100 and n = 150. Among the 100 predictors, 60 are important ones and their coefficients are the same as those in Experiment 1. Besides, there are 40 unimportant predictors whose coefficients are all equal to 0.

We implemented sCARDS and compared its performance with different oracle estimators, Oracle, Oracle0 and OracleG, as well as ordinary least squares (OLS), SCAD, shrinkage total variation (sTV) and fused Lasso (fLasso). The three oracles are defined with different levels of prior information: the Oracle knows both the important predictors and the true groups, the Oracle0 only knows which are important predictors; and the OracleG only knows the true groups (it treats all unimportant predictors as one group with unknown coefficients). sCARDS is as described in Section 2; while implementing it, we take the SCAD estimator as the preliminary estimator. The shrinkage total variation is an extension of TV by adding both the elementwise SCAD penalty and exhaustive pairwise SCAD penalty. The fused Lasso used here has both the elementwise L₁ penalty and fused pairwise L₁ penalty.

Figure 6 displays the boxplots of average model errors for r = 1 and r = 0.7. First, by comparing model errors of the three oracles, we see a significant advantage of taking into account both homogeneity and sparsity over pure sparsity. Moreover, the results of Oracle0 and OracleG show that exploring the group structure is more important than sparsity. Second, sCARDS achieves a smaller model error than OLS, SCAD and fused Lasso. The performance of sCARDS and sTV in terms of model error are comparable. However, they are different in feature selection. Figure 7 contains frequency histograms of the number of falsely selected features for two methods. In about 17% of the repetitions, sTV fails to shrink coefficients of the 40 unimportant predictors to 0.

The average model error and normalized mutual information in Experiment 4, where p = 100, n = 100, and there are 60 important predictors divided into four equal-size groups.

Feature selection of sCARDS and shrinkage total variation (sTV) in Experiment 4 (left: r = 1; right: r = 0.7), where there are 40 unimportant predictors.

Experiment 5

We consider a special case of the spatial-temporal model (3), $Y_{i t} = X_{t}^{T} β_{i} + ε_{i t}$ , i = 1, · · ·, p, i.e., the predictors are the same for all spatial locations. There are p = 100 different locations and k = 5 common predictors (so each β_i has a dimension 5). We assume only the spatial homogeneity in the regression coefficients, that is, for each predictor j (j = 1, · · ·, 5), the coefficients, {β_i_,_j, 1 ≤ i ≤ 100}, are divided into four groups of equal size 25, where coefficients in the same group share a same value. In simulations, we let

β_{i, j} = ω_{j} + (- 2) I_{1 \leq i \leq 25} + (- 1) I_{26 \leq i \leq 50} + I_{51 \leq i \leq 75} + 2 I_{76 \leq i \leq 100}, 1 \leq j \leq 5,

where {ω_j = 0.1 × (j −1), 1 ≤ j ≤ 5} are location-independent constants. In this experiment, instead of varying the signal-to-noise ratio directly, we equivalently change T, the total number of time points.

We extend aCARDS to this model by adding the hybrid pairwise penalty on coefficients of the same predictor at different locations, and still call the method aCARDS. The total variation (TV) and fused Lasso (fLasso) can be extended to this model in a similar way. The Oracle is the maximum likelihood estimator which knows the true groups for each predictor. We aim to compare the performance of Oracle, OLS, aCARDS, total variation and fused Lasso.

Figure 8 displays the results. We see that aCARDS achieves significantly lower model errors than OLS, due to exploring homogeneity. Moreover, it has a better performance than the total variation and fused Lasso, particularly when T = 50, 80. aCARDS also estimates well the true grouping structure; when T = 50, 80, the normalized mutual information is larger than 0.95 in most repetitions.

The average model error and normalized mutual information in Experiment 5, where the model is a spatial-temporal regression model with p = 100 locations and k = 5 common predictors.

6 Real data analysis

6.1 S&P500 returns

In this study, we fit a homogeneous Fama-French model for stock returns: $Y_{i t} = α_{i} + X_{t}^{T} β_{i}^{0} + ε_{i t}$ , where X_t contains three Fama-French factors at time t, Y_it is the excess return of stocks and ε_it are idiosyncratic noises. We collected daily returns of 410 stocks, which were always included in the components of the S&P500 index in the period December 1, 2010 to December 1, 2011 (T = 254). We applied CARDS as in Experiment 5, except that the intercepts α_j ‘s were also penalized for sparsity. The sparsity of α_j ‘s is supported by the capital asset pricing model (CAPM) and its extension, multifactor pricing model, in financial econometric theories. The tuning parameters were chosen via generalized cross validation (GCV). Table 1 shows the number of fitted coefficient groups on three factors and the number of non-zero intercepts. We then used daily returns of the same stocks in the period December 1, 2011 to July 1, 2012 (T = 146) to evaluate the prediction error. Let ŷ_it and y_it be the fitted and observed excess returns of stock i at time t = 1, · · ·, 146, respectively. Define the discounted cumulative sum of squared estimation errors (cRSS) at time t as ${cRSS}_{t} = \sum_{s = 1}^{t} ρ^{⌊ s / 10 ⌋} \sum_{i} {({\hat{y}}_{i t} - y_{i t})}^{2}$ , where we take ρ = 0.95. Figure 9 shows the percentage improvement in cRSS_t of the CARDS estimator over the OLS estimator. We see that CARDS achieves a smaller discounted cRSS compared to OLS at most time points, especially in the “very-close” and “far-away” future. The North American Industry Classification System (NAICS) classifies these 410 companies into 18 different industry sectors. Figure 10(a) shows the OLS coefficients on the factor “book-to-market ratio”. We can see that stocks belonging to Sector 3 “Utilities” (29 stocks in total) have very close OLS coefficients, and 17 stocks in this sector were clustered into one group in CARDS estimator. Figure 10(b) shows the percentage improvement in cRSS_t only for stocks in this sector, where the improvement is more significant.

Table 1.

Number of groups in fitting the S&P500 data.

Fama-French factors	No. of coef. groups
“market return”	41
“market capitalization”	32
“book-to-market ratio”	56
intercept	60

Open in a new tab

Comparison of the cumulative sum of squared prediction errors of the S&P500 data from December 1, 2011 to July 1, 2012. The vertical axis is the percent of relative errors between the predictions made by OLS and CARDS, defined by $100 ({cRSS}_{t}^{ols} - {cRSS}_{t}^{cards}) / {cRSS}_{t}^{ols}$ . The right panel is a zoom-in of the results for CARDS.

(a)OLS coefficients on the “book-to-market ratio” factor. The x axis represents different sectors. (b)Percentage improvement of the discount cumulative sum of squared estimation errors for stocks in the sector “Utilities” (Sector 2).

6.2 Polyadenylation signals

CARDS can be easily extended to more general settings such as generalized linear models (McCullagh and Nelder 1989) although we have focused on the linear regression model so far. In this subsection, we apples CARDS to a logistic regression example. This study tried to predict polyadenylation signals (PASes) in human DNA and mRNA sequences by analyzing features around them. The data set was first used in Legendre and Gautheret (2003) and later analyzed by Liu et al. (2003), and it is available at http://datam.i2r.a-star.edu.sg/datasets/krbd/SequenceData/Polya.html. There is one training data set and five testing data sets. To avoid any platform bias, we use the training data set only. It has 4418 observations each with 170 predictors and a binary response. The binary response indicates whether a terminal sequence is classified as a “strong” or “weak” polyA site, and the predictors are features from the upstream (USE) and downstream (DSE) sequence elements. We randomly select 2000 observations to perform model estimation and use the rest to evaluate performance. Our numerical analysis consists of the following steps. Step 1 is to apply the L₁-penalized logistic regression to these 2000 observations with all 170 predictors and use AIC to select an appropriate regularization parameter. In Step 2, we use the logistic regression coefficients obtained in Step 1 as our preliminary estimate and apply sCARDS accordingly. Average prediction error (and standard error in parentheses) over 40 random splitting are reported in Table 2. We also report the average number of non-zero coefficient groups and the average number of selected features. It shows that sCARDS lead to a smaller prediction error when compared with the shrinkage total variation (sTV). In addition, the sCARDS has fewer groups of non-zero coefficients but more selected features, which implies that we can include more predictors while fixing the degrees of freedom at a small value. Note that in this example, the fused Lasso (fLasso) has a similar performance as the sCARDS. In Section 5, we remarked that the fused Lasso is essentially bCARDS with the lasso penalty p_λ(t) = λ|t|.

Table 2.

Results of the PASes data.

	sCARDS	SCAD	sTV	fLasso
Prediction Error	0.2449 (.0015)	0.2458 (.0014)	0.2757 (.0026)	0.2445 (0.0010)
No. of non-zero coef. groups	5.5000	21.6250	5.7500	5.5000
No. of selected features	73.2750	21.6250	40.3500	86.8750

Open in a new tab

7 Conclusion

In this paper, we explored homogeneity of coefficients in high-dimensional regression. We proposed a new method called Clustering Algorithm in Regression via Data-driven Segmentation (CARDS) to estimate regression coefficients and to detect homogeneous groups. The implementation of CARDS does not need any geographical information (neighborhoods, distance, graphs, etc.) a priori, which distinguishes it from other methods in similar settings and makes it more general for applications. A modification of CARDS, sCARDS, can be used to explore homogeneity and sparsity simultaneously. Our theoretical results show that better estimation accuracy can be achieved by exploring homogeneity. In particular, when the number of homogeneous groups is small, the power of exploring homogeneity and sparsity simultaneously is much larger than that of exploring sparsity only, which is also confirmed in our simulation studies.

Methodologically, CARDS has two main innovations. First, it takes advantage of a preliminary estimate by extracting from which either an estimated ranking or an estimated ordered segmentation. Second, it introduces the so-called “hybrid pairwise penalty” to adapt to available partial ordering information. The hybrid pairwise penalty not only is robust to misordering, but also avoids statistical and computational inefficiency due to penalizing too many pairs. These ideas about handling homogeneity can be applied to much broader situations than linear regression, if we combine the hybrid pairwise penalty with appropriate loss functions. For example, CARDS can be extended to generalized linear models (GLM) when homogeneity appears.

To promote homogeneity, CARDS takes advantage of a preliminary estimate. Such idea can be generalized. Instead of extracting a complete raking or an ordered segmentation, we may also apply clustering methods to coefficients of the preliminary estimate, such as k-mean algorithm or hierarchical clustering algorithm, to help construct datadriven penalties and further promote homogeneity.

This paper only considers the case where predictors in one homogeneous group have equal coefficients. In a more general situation, coefficients of predictors in the same group are close but not exactly equal. The idea of data-driven pairwise penalties still applies, but instead of using the class of folded concave penalty functions, we may need to use penalty functions which are smooth at the origin, e.g., the L₂ penalty function. Another possible approach is to use posterior-type estimators combined with, say, a Gaussian prior on the coefficients. These are beyond the scope of this paper and we leave them as future work.

Supplementary Material

NIHMS578407-supplement-Supplementary_Material.pdf^{(303.5KB, pdf)}

Acknowledgments

The authors would like to thank the Editor Professor Xuming He, the anonymous Associate Editor and two referees for their very helpful comments and suggestions.

The authors were partially supported by the National Institute of General Medical Sciences of the National Institutes of Health through Grant Numbers R01-GM072611 and R01-GM100474 and National Science Foundation grant DMS-1206464.

The author was partially supported by NSF grants DMS-0905561 and DMS-1055210, and NIH Grant NIH/NCI R01 CA-149569.

A Proofs

A.1 Proof of Theorem 3.1

Introduce the mapping T : Inline graphic → ℝ^K, where T(β) is the K-dimensional vector whose k-th coordinate equals to the common value of β_j for j ∈ A_k. Note that T is a bijection and T⁻¹ is well-defined for any μ ∈ ℝ^K. Also, introduce the mapping T^* : ℝ^p → ℝ^K, where $T^{*} {(β)}_{k} = \frac{1}{∣ A_{k} ∣} \sum_{j \in A_{k}} β_{j}$ . We see that T^* = T on Inline graphic , and T⁻¹ ∘ T^* is the orthogonal projection from ℝ^p to . Denote μ⁰ = T (β⁰) and μ̂^oracle = T(β̂^oracle).

Denote $L_{n} (β) = \frac{1}{2 n} {‖ y - X β ‖}^{2}$ and $P_{n} (β) = λ_{n} \sum_{j = 1}^{p - 1} ρ (β_{τ (j + 1)} - β_{τ (j)})$ , so that we can write Q_n(β) = L_n(β) + P_n(β). For any μ ∈ ℝ^K, let

L_{n}^{A} (μ) = \frac{1}{2 n} {‖ y - X_{A μ} ‖}^{2}, P_{n}^{A} (μ) = λ_{n} \sum_{k = 1}^{K - 1} ρ (μ_{k + 1} - μ_{k}),

and define $Q_{n}^{A} (μ) = L_{n}^{A} (μ) + P_{n}^{A} (μ)$ . Note that when τ is consistent with the order of β⁰, there exist 1 = j₁ < j₂ < · · · < j_K < j_K₊₁ = p + 1 such that A_k = {τ(j_k), τ(j_k + 1), · · ·, τ(j_k₊₁ − 1) for 1 ≤ k ≤ K. Then $Q_{n} (β) = Q_{n}^{A} (T (β))$ and $Q_{n}^{A} (μ) = Q_{n} (T^{- 1} (μ))$ for any β ∈ Inline graphic and μ ∈ ℝ^K.

In the first part of the proof, we show $‖ {\hat{β}}^{oracle} - β^{0} ‖ = O_{p} (\sqrt{K / n})$ . By definition and direct calculations,

‖ {\hat{β}}^{oracle} - β^{0} ‖ = ‖ D ({\hat{μ}}^{oracle} - μ^{0}) ‖, {\hat{μ}}^{oracle} - μ^{0} = {(X_{A}^{T} X_{A})}^{- 1} X_{A}^{T} ε .

Therefore, we can write

‖ {\hat{β}}^{oracle} - β^{0} ‖ = ‖ {(D^{- 1} X_{A}^{T} X_{A} D^{- 1})}^{- 1} D^{- 1} X_{A}^{T} ε ‖ .

(23)

From Condition 3.1, $‖ {(D^{- 1} X_{A}^{T} X_{A} D^{- 1})}^{- 1} ‖ \leq {(c_{1} n)}^{- 1}$ and $tr (D^{- 1} X_{A}^{T} X_{A} D^{- 1}) \leq c_{2} n K$ . By the Markov inequality, for any δ > 0,

P (‖ D^{- 1} X_{A}^{T} ε ‖ > \sqrt{\frac{c_{2} n K}{δ}}) \leq \frac{E {‖ D^{- 1} X_{A}^{T} ε ‖}^{2}}{c_{2} n K / δ} = \frac{tr (D^{- 1} X_{A}^{T} X_{A} D^{- 1})}{c_{2} n K / δ} \leq δ .

(24)

Combining the above, we have shown that with probability at least 1 − δ, $‖ {\hat{β}}^{oracle} - β^{0} ‖ \leq C δ^{- 1 / 2} \sqrt{K / n}$ . This proves $‖ \hat{β} - β^{0} ‖ = O_{p} (\sqrt{K / n})$ .

Furthermore, we show a result that will be frequently used in later proofs:

‖ {\hat{β}}^{oracle} - β^{0} ‖ \leq C \sqrt{K log (n) / n}, with probability \geq 1 - n^{- 1} K .

(25)

Write $D^{- 1} X_{A}^{T} ε = {(v_{1}^{T} ε, \dots, v_{k}^{T} ε)}^{T}$ , where v_k = X_AD⁻¹e_k and e_k is the unit vector with 1 on the k-th coordinate and 0 elsewhere. Observing that ||v_k||² is the k-th diagonal of the matrix $D^{- 1} X_{A}^{T} X_{A} D^{- 1}$ , we have $‖ v_{k} ‖ \leq \sqrt{c_{2} n}$ . It follows from Condition 3.3 and the union bound that

P ({‖ D^{- 1} X_{A}^{T} ε ‖}_{\infty} > \sqrt{c_{2} c_{3}^{- 1} n log (2 n)}) \leq \sum_{k = 1}^{K} P (‖ v_{k}^{T} ε ‖ > ‖ v_{k} ‖ \sqrt{c_{3}^{- 1} log (2 n)}) \leq n^{- 1} K .

Since $‖ D^{- 1} X_{A}^{T} ε ‖ \leq K^{1 / 2} {‖ D^{- 1} X_{A}^{T} ε ‖}_{\infty}$ ,

‖ D^{- 1} X_{A}^{T} ε ‖ \leq \sqrt{c_{2} c_{3}^{- 1} Kn log (2 n)}, with probability \geq 1 - n^{- 1} K .

(26)

Then (25) follows by combining (23) and (26).

In the second part of the proof, we show that β̂^oracle is a strictly local minimizer of Q_n(β) with probability at least 1 − ε₀ − n⁻¹ K − (n ∨ p)⁻¹. By assumption, there is an event E₁ such that $P (E_{1}^{c}) \leq ε_{0}$ and over the event E₁, τ is consistent with the order of β⁰. Consider the neighborhood of β⁰:

B = {β \in ℝ^{p} : ‖ β - β^{0} ‖ < 2 C \sqrt{K log (n) / n}} .

By (25), there is an event E₂ such that $P (E_{2}^{c}) \leq n^{- 1} K$ and over the event E₂, $‖ {\hat{β}}^{oracle} - β^{0} ‖ \leq C \sqrt{K log (n) / n}$ . Hence, β̂^oracle ∈ Inline graphic over the event E₂.

For any β ∈ Inline graphic , write β^* as its orthogonal projection to . We aim to show

Over the eventE₁ ∩ E₂,
$Q_{n} (β^{*}) \geq Q_{n} ({\hat{β}}^{oracle}), for any β \in B,$ (27)

and the inequality is strict whenever β^* ≠ β̂^oracle.
There is an event E₃ such that $P (E_{3}^{c}) \leq {(n \lor p)}^{- 1}$ . Over the event E₁ ∩ E₂ ∩ E₃, there exists ( ⊂ ), a neighborhood of β̂^oracle, such that
$Q_{n} (β) \geq Q_{n} (β^{*}), for any β \in B_{n},$ (28)

and the inequality is strict whenever β ≠ β^*.

Combining (a) and (b), Q_n(β) ≥ Q_n(β̂^oracle) for any β ∈ Inline graphic , a neighborhood of β̂^oracle, and the inequality is strict whenever β ≠ β̂^oracle. This proves that β̂^oracle is a strictly local minimizer of Q_n, over the event E₁ ∩ E₂ ∩ E₃.

It remains to show (a) and (b). Consider (a) first. We claim that

P_{n}^{A} (T^{*} (β)) = 0 for any β \in B .

(29)

To see this, for a given β ∈ Inline graphic , write μ = T^* (β). It suffices to check |μ_k₊₁ − μ_k| > aλ_n for k = 1, · · ·, K − 1. Note that $∣ μ_{k + 1} - μ_{k} ∣ \geq {min}_{i \in A_{k}, j \in A_{k + 1}} ∣ β_{i} - β_{j} ∣ \geq {min}_{i, j} ∣ β_{i}^{0} - β_{j}^{0} ∣ - 2 {‖ β - β^{0} ‖}_{\infty} \geq 2 b_{n} - 2 C \sqrt{K log (n) / n}$ . Since $b_{n} > a λ_{n} ≫ \sqrt{K log (n) / n}$ , it is easy to see that |μ_k₊₁ − μ_k| > aλ_n.

Using (29), we see that $Q_{n}^{A} (T^{*} (β)) = L_{n}^{A} (T^{*} (β))$ , for all β ∈ Inline graphic . Since $Q_{n}^{A} = Q_{n} \circ T^{- 1}$ and T⁻¹ ∘ T^* is the orthogonal projection from ℝ^p to , for any β ∈ ,

Q_{n} (β^{*}) = Q_{n} (T^{- 1} \circ T^{*} (β)) = Q_{n}^{A} (T^{*} (β)) = L_{n}^{A} (T^{*} (β)) .

(30)

In particular, noting that β̂^oracle ∈ Inline graphic and its orthogonal projection to is itself, the above further implies

Q_{n} ({\hat{β}}^{oracle}) = L_{n}^{A} ({\hat{μ}}^{oracle}) .

(31)

By definition and the fact that $\frac{\partial^{2} L_{n}^{A} (μ)}{\partial μ \partial μ^{T}} = \frac{1}{n} X_{A}^{T} X_{A}$ is positive definite, μ̂^oracle is the unique global minimizer of $L_{n}^{A} (μ)$ . As a result,

L_{n}^{A} (T^{*} (β)) \geq L_{n}^{A} ({\hat{μ}}^{oracle}),

(32)

and the inequality is strict whenever T^*(β) ≠ μ̂^oracle, i.e., β^* ≠ T⁻¹(μ̂^oracle) = β̂^oracle. Combining (30)–(32) gives (a).

Second, consider (b). For a positive sequence {t_n} to be determined, let

B_{n} = B \cap {β : ‖ β - {\hat{β}}^{oracle} ‖ \leq t_{n}} .

Since β^* is the orthogonal projection of β to Inline graphic , ||β − β^* || ≤ ||β − β′|| for any β′ ∈ . In particular, ||β − β^*|| ≤ ||β − β̂^oracle||. As a result, to show (28), it suffices to show

Q_{n} (β) \geq Q_{n} (β^{*}), for any β such that ‖ β - β^{*} ‖ \leq t_{n},

(33)

and the inequality is strict whenever β ≠ β^*.

To show (33), write μ = T^*(β) so that β^* = T⁻¹(μ). By Taylor expansion,

\begin{array}{l} Q_{n} (β) - Q_{n} (β^{*}) = - \frac{1}{n} {(y - X β^{m})}^{T} X (β - β^{*}) + \sum_{j = 1}^{p} \frac{\partial P_{n} (β^{m})}{\partial β_{τ (j)}} (β_{τ (j)} - β_{τ (j)}^{*}) \\ \equiv I_{1} + I_{2}, \end{array}

where β^m is in the line between β and β^*. Consider I₂ first. Direct calculations yield

\frac{\partial P_{n} (β)}{\partial β_{τ (j)}} = {\begin{cases} - λ_{n} \bar{ρ} (β_{τ (2)} - β_{τ (1)}), & j = 1 \\ λ_{n} \bar{ρ} (β_{τ (j)} - β_{τ (j - 1)}) - λ_{n} \bar{ρ} (β_{τ (j + 1)} - β_{τ (j)}), & 2 \leq j \leq p - 1 \\ λ_{n} \bar{ρ} (β_{τ (p)} - β_{τ (p - 1)}), & j = p, \end{cases}

where ρ̄(t) = ρ′(|t|)sgn(t) and ρ(t) = λ⁻¹p_λ(t). Plugging it into I₂ and rearranging the sum, we obtain

I_{2} = λ_{n} \sum_{j = 1}^{p - 1} \bar{ρ} (β_{τ (j + 1)}^{m} - β_{τ (j)}^{m}) [(β_{τ (j + 1)} - β_{τ (j)}) - (β_{τ (j + 1)}^{*} - β_{τ (j)}^{*})] .

(34)

When τ(j) and τ(j + 1) belong to the same group, $β_{τ (j)}^{*} = β_{τ (j + 1)}^{*}$ , and hence the sign of ( $β_{τ (j + 1)}^{m} - β_{τ (j)}^{m}$ ) is the same as the sign of (β_τ₍_j₊₁₎ − β_τ₍_j₎) if neither of them is 0. In addition, recall that A_k = {τ(j_k), τ(j_k + 1), · · ·, τ(j_k₊₁ − 1)} for all 1 ≤ k ≤ K, for some indices 1 = j₁ < j₂ < · · · < j_K < j_K₊₁ = p + 1. Combining the above, we can rewrite

I_{2} = λ_{n} \sum_{k = 1}^{K} \sum_{j = j_{k}}^{j_{k + 1} - 2} ρ^{'} (∣ β_{τ (j + 1)}^{m} - β_{τ (j)}^{m} ∣) ∣ β_{τ (j + 1)} - β_{τ (j)} ∣ + λ_{n} \sum_{k = 2}^{K} \bar{ρ} (∣ β_{τ (j k)}^{m} - β_{τ (j_{k} - 1)}^{m} ∣) [(β_{τ (j_{k})} - β_{τ (j_{k} - 1)}) - (β_{τ (j_{k})}^{*} - β_{(j_{k} - 1)}^{*})] .

First, since β⁰ ∈ Inline graphic and β^* is the orthogonal projection of β to , ||β^* − β⁰|| ≤ ||β − β⁰||. Hence, β ∈ implies β^*, β^m ∈ . By repeating the proof of (29), we can show $\bar{ρ} (∣ β_{τ (j_{k})}^{m} - β_{τ (j_{k} - 1)}^{m} ∣) = 0$ for 2 ≤ k ≤ K. So the second term in I₂ disappears. Second, in the first term of I₂, since $∣ β_{τ (j + 1)}^{m} - β_{τ (j)}^{m} ∣ \leq 2 {‖ β^{m} - β^{*} ‖}_{\infty} \leq 2 {‖ β - β^{*} ‖}_{\infty} \leq 2 t_{n}$ , it follows by concavity that $ρ^{'} (∣ β_{τ (j + 1)}^{m} - β_{τ (j)}^{m} ∣) \geq ρ^{'} (2 t_{n})$ . Together, we have

I_{2} \geq λ_{n} \sum_{k = 1}^{K} \sum_{j = j_{k}}^{j_{k + 1} - 2} ρ^{'} (2 t_{n}) ∣ β_{τ (j + 1)} - β_{(j)} ∣ .

(35)

Next, we simplify I₁. Let z = z(β^m) = X^T(y − Xβ^m) and write $I_{1} = - \frac{1}{n} z^{T} (β - β^{*})$ . For any fixed k and l such that τ(l) ∈ A_k and l ≠ j_k₊₁ −1, let $A_{k l}^{1} = {τ (j) \in A_{k} : j \leq l}$ and $A_{k l}^{2} = {τ (j) \in A_{k} : j > l}$ . Regarding that $β_{τ (i)}^{*} = \frac{1}{∣ A_{k} ∣} \sum_{j = j_{k}}^{j_{k + 1} - 1} β_{τ (j)}$ for i ∈ A_k, we can reexpress I₁ as

\begin{array}{l} I_{1} = - \frac{1}{2} \sum_{k = 1}^{K} \sum_{i = j_{k}}^{j_{k + 1} - 1} \frac{1}{n} z_{τ (i)} [β_{τ (i)} - β_{τ (i)}^{*}] - \frac{1}{2} \sum_{k = 1}^{K} \sum_{j = j_{k}}^{j_{k + 1} - 1} \frac{1}{n} z_{τ (j)} [β_{τ (j)} - β_{τ (j)}^{*}] \\ = - \sum_{k = 1}^{K} \frac{1}{2 n ∣ A_{k} ∣} \sum_{i, j = j_{k}}^{j_{k + 1} - 1} z_{τ (i)} [β_{τ (i)} - β_{τ (j)}] - \sum_{k = 1}^{K} \frac{1}{2 n ∣ A_{k} ∣} \sum_{i, j = j_{k}}^{j_{k + 1} - 1} z_{τ (j)} [β_{τ (j)} - β_{τ (i)}] \\ = - \sum_{k = 1}^{K} \frac{1}{2 n ∣ A_{k} ∣} \sum_{i, j = j_{k}}^{j_{k + 1} - 1} [z_{τ (j)} - z_{τ (i)}] [β_{τ (j)} - β_{τ (i)}] \\ = - \sum_{k = 1}^{K} \frac{1}{n ∣ A_{k} ∣} \sum_{j_{k} \leq i < j = j_{k + 1} - 1} [z_{τ (j)} - z_{τ (i)}] \sum_{i \leq l < j} [β_{τ (l + 1)} - β_{τ (l)}] \\ = - \sum_{k = 1}^{K} \frac{1}{n ∣ A_{k} ∣} \sum_{l = j_{k}}^{j_{k + 1} - 2} [β_{τ (l + 1)} - β_{τ (l)}] [∣ A_{k l}^{1} ∣ \sum_{j \in A_{k l}^{2}} z_{τ (j)} - ∣ A_{k l}^{2} ∣ \sum_{i \in A_{k l}^{1}} z_{τ (i)}] \\ \equiv \sum_{k = 1}^{K} \sum_{l = j_{k}}^{j_{k + 1} - 2} w_{τ (l)} (z) [β_{τ (l + 1)} - β_{τ (l)}], \end{array}

(36)

where for any vector v ∊ ℝ^p,

w_{τ (l)} (v) = n^{- 1} [\frac{∣ A_{k l}^{2} ∣}{∣ A_{k} ∣} \sum_{j \in A_{k l}^{1}} v_{τ (j)} - \frac{∣ A_{k l}^{1} ∣}{∣ A_{k} ∣} \sum_{j \in A_{k l}^{2}} v_{τ (j)}] .

We aim to bound |w_τ₍_l₎(z)|. Let η = X^TX(β^* − β⁰), η^m = X^TX(β^m − β^*) and write z = X^Tε − η − η^m. First, w_τ₍_l₎(v) is a linear function of v. Second, since β^m lies between β and β^*, we have ||β^* − β^m|| ≤ ||β^* − β|| ≤ t_n. It follows that ||η^m|| ≤ λ_max(X^TX)t_n. Moreover, |w_τ₍_l₎(v)| ≤ (|A_k|/n)||v||_∞ ≤ (p/n)||v|| for all v. Combining the above yields

\begin{array}{l} ∣ w_{τ (l)} (z) ∣ \leq ∣ w_{τ (l)} (X^{T} ε) ∣ + ∣ w_{τ (l)} (η) ∣ + sup_{v : ‖ v ‖ \leq λ_{max} (X^{T} X) t_{n}} ∣ w_{τ (l)} (v) ∣ \\ \leq ∣ w_{τ (l)} (X^{T} ε) ∣ + ∣ w_{τ (l)} (η) ∣ + (p / n) λ_{max} (X^{T} X) \cdot t_{n} . \end{array}

(37)

First, we bound the term w_τ₍_l₎(X^Tε). Let E₃ be the event that

max_{τ (l) \in A_{k}} ∣ w_{τ (l)} (X^{T} ε) ∣ \leq n^{- 1 / 2} \sqrt{σ_{k} ∣ A_{k} ∣ log (2 (n \lor p)) / c_{3}}, k = 1, \dots, K,

(38)

where we recall σ_k is the maximum eigenvalue of n⁻¹X^TX restricted to the (A_k, A_k)-block. Given τ(l), we can express w_τ₍_l₎(X^Tε) as

w_{τ (l)} (X^{T} ε) = a_{τ (l)}^{T} ε, where a_{τ (l)} = n^{- 1} (\frac{∣ A_{k l}^{2} ∣}{∣ A_{k} ∣} X_{A_{k l}^{1}} 1_{A_{k l}^{1}} - \frac{∣ A_{k l}^{1} ∣}{∣ A_{k} ∣} X_{A_{k l}^{2}} 1_{A_{k l}^{2}}) .

Write $L_{1} = ∣ A_{k l}^{1} ∣$ and $L_{2} = ∣ A_{k l}^{2} ∣$ , so that |A_k| = L₁+L₂. It is observed that ${‖ X_{A_{k l}^{1}} 1_{A_{k l}^{1}} ‖}^{2} \leq n σ_{k} {‖ 1_{A_{k l}^{1}} ‖}^{2} \leq n σ_{k} L_{1}$ . Using the fact that (a + b)² ≤ 2(a² + b²) for any real values a, b, we have ${‖ a_{τ (l)} ‖}^{2} \leq 2 n^{- 1} σ_{k} (L_{2}^{2} L_{1} / {∣ A_{k} ∣}^{2} + L_{1}^{2} L_{2} / {∣ A_{k} ∣}^{2}) = 2 σ_{k} L_{1} L_{2} / (n ∣ A_{k} ∣) \leq σ_{k} ∣ A_{k} ∣ / (2 n)$ . Applying Condition 3.3 and the probability union bound,

\begin{array}{l} P (E_{3}^{c}) \leq \sum_{k = 1}^{K} \sum_{τ (l) \in A_{k}} P (∣ w_{τ (l)} (X^{T} ε) ∣ > n^{- 1 / 2} \sqrt{σ_{k} ∣ A_{k} ∣ log (2 (n \lor p)) / c_{3}}) \\ \leq \sum_{1 \leq j \leq p} P (∣ a_{j}^{T} ε ∣ > ‖ a_{j} ‖ \sqrt{2 log (2 (n \lor p)) / c_{3}}) < {(n \lor p)}^{- 1} . \end{array}

(39)

Second, we bound the term w_τ₍_l₎(η). Observing that for any vector v, w_τ₍_l₎(v) = w_τ₍_l₎(v − v̄_k1), where v̄_k is the mean of {v_j, j ∈ A_k}, we have

{∣ w_{τ (l)} (v) ∣}^{2} \leq 2 (\frac{{∣ A_{k l}^{2} ∣}^{2} ∣ A_{k l}^{1} ∣}{n^{2} {∣ A_{k} ∣}^{2}} + \frac{{∣ A_{k l}^{1} ∣}^{2} ∣ A_{k l}^{2} ∣}{n^{2} {∣ A_{k} ∣}^{2}}) {(max_{j \in A_{k}} ∣ v_{j} - {\bar{v}}_{k} ∣)}^{2} \leq \frac{∣ A_{k} ∣}{2 n^{2}} {(max_{j \in A_{k}} ∣ v_{j} - {\bar{v}}_{k} ∣)}^{2} .

Since η = X^TX(β^* − β⁰) and β^* − β⁰ ∈ Inline graphic , we have max_{j∈A_k}|η_j − η̄_k| ≤ nν_k||β^* − β⁰||, where ν_k is defined in (13). As a result,

max_{τ (l) \in A_{k}} ∣ w_{τ (l)} (η) ∣ \leq \frac{ν_{k}}{\sqrt{2}} {∣ A_{k} ∣}^{1 / 2} \cdot ‖ β^{*} - β^{0} ‖ \leq C ν_{k} \sqrt{K ∣ A_{k} ∣ log (n) / n},

(40)

where the last inequality is because we consider β ∈ Inline graphic in (28), and ||β^* − β⁰|| ≤ ||β − β⁰|| (noticing that β^* is the orthogonal projection of β onto ).

Combining (36)–(40), we find that over the event E₁ ∩ E₂ ∩ E₃,

\begin{array}{l} ∣ I_{1} ∣ \leq \sum_{k = 1}^{K} \sum_{l = j_{k}}^{j_{k + 1} - 2} [C (\sqrt{\frac{σ_{k} ∣ A_{k} ∣ log (n \lor p)}{n}} + ν_{k} \sqrt{\frac{K ∣ A_{k} ∣ log (n)}{n}}) + \frac{p λ_{max} (X^{T} X)}{n} t_{n}] ∣ β_{τ (l + 1)} - β_{(l)} ∣ \\ \leq \sum_{k = 1}^{K} \sum_{l = j_{k}}^{j_{k + 1} - 2} (\frac{λ_{n}}{2} + \frac{p λ_{max} (X^{T} X)}{n} t_{n}) ∣ β_{τ (l + 1)} - β_{τ (l)} ∣, \end{array}

(41)

where we have used condition (14) on λ_n.

From (35) and (41), over the event E₁ ∩ E₂ ∩ E₃,

inf_{β \in B : ‖ β - β^{*} ‖ \leq t_{n}} [Q_{n} (β) - Q_{n} (β^{*})] \geq \sum_{k = 1}^{K} \sum_{l = j_{k}}^{j_{k + 1} - 2} [\frac{λ_{n}}{2} - g_{n} (t_{n})] ∣ β_{τ (l + 1)} - β_{τ (l)} ∣,

where g_n(t_n) = n⁻¹pλ_max(X^TX)t_n − λ_n[1 − ρ′(2t_n)]. Since ρ′(0+) = 1, g_n(0+) = 0. So we can always choose t_n sufficiently small to make sure |g_n(t_n)| < λ_n/2; consequently, the right hand side is non-negative, and strictly positive when $\sum_{k = 1}^{K} \sum_{l = j_{k}}^{j_{k + 1} - 2} ∣ β_{τ (l + 1)} - β_{τ (l)} ∣ > 0$ , i.e., β ≠ β^*. This proves (b).

A.2 Proof of Theorem 3.2

First, we show that the LLA algorithm yields β̂^oracle after one iteration. Let E₁ be the event that the ranking τ is consistent with the order of β⁰, E₂ the event that $‖ \hat{β} - β^{0} ‖ \leq C \sqrt{K log (n) / n}$ and E₃ the event that (38) holds. We have shown that P (E₁ ∩E₂ ∩E₃) ≥ 1 − ε₀ − n⁻¹K − (n ∨ p)⁻¹. It suffices to show that over the event E₁ ∩ E₂ ∩ E₃, the LLA algorithm gives β̂^oracle after the first iteration.

Let $w_{j} = ρ^{'} (∣ {\hat{β}}_{τ (j + 1)}^{initial} - {\hat{β}}_{τ (j)}^{initial} ∣)$ . At the first iteration, the algorithm minimizes

Q_{n}^{initial} (β) \equiv \frac{1}{2 n} {‖ y - X β ‖}^{2} + λ_{n} \sum_{j = 1}^{p - 1} w_{j} ∣ β_{τ (j + 1)} - β_{τ (j)} ∣ .

This is a convex function, hence it suffices to show that β̂^oracle is a strictly local minimizer of $Q_{n}^{initial}$ . Using the same notations as in the proof of Theorem 3.1, for any β ∈ ℝ^p, write β^* = T⁻¹ ∘ T^*(β) as its orthogonal projection to Inline graphic . Let $B = {β \in ℝ^{p} : ‖ β - β^{0} ‖ \leq C \sqrt{K log (n) / n}}$ , and for a sequence {t_n} to be determined, consider the neighborhood of β̂^oralce defined by = {β ∈ : ||β − β̂^oracle|| ≤ t_n}. It suffices to show

Q_{n}^{initial} (β) \geq Q_{n}^{initial} (β^{*}) \geq Q_{n}^{initial} ({\hat{β}}^{oracle}), for any β \in B_{n},

(42)

and the first inequality is strict whenever β ≠ β^*, and the second inequality is also strict whenever β ≠ β̂^oracle.

We first show the second inequality in (42). For τ(j) and τ(j + 1) in different groups, $∣ β_{τ (j + 1)}^{0} - β_{τ (j)}^{0} ∣ > 2 b_{n}$ . In addition, ||β̂^initial − β⁰||_∞ ≤ λ_n/2. Hence, $∣ {\hat{β}}_{τ (j + 1)}^{initial} - {\hat{β}}_{τ (j)}^{initial} ∣ \geq 2 b_{n} - λ_{n} > a λ_{n}$ , and it follows that w_j = 0. On the other hand, for τ(j) and τ(j + 1) in the same group, β_τ₍_j₊₁₎ − β_τ₍_j₎ = 0 whenever β ∈ Inline graphic . Consequently,

Q_{n}^{initial} (β) = \frac{1}{2 n} {‖ y - X β ‖}^{2} = L_{n} (β), for β \in M_{A} .

It is easy to see that β̂^oracle is the unique global minimizer of L_n constrained on Inline graphic . So the second inequality in (42) holds.

Next, consider the first inequality in (42). We apply Taylor expansion to $Q_{n}^{initial} (β) - Q_{n}^{initial} (β^{*})$ , and rearrange the sums as in (34). Then, for some β^m that lies in the line between β and β^*,

Q_{n}^{initial} (β) - Q_{n}^{initial} (β^{*}) = λ_{n} \sum_{j = 1}^{p - 1} w_{j} \cdot sgn (β_{τ (j + 1)}^{m} - β_{τ (j)}^{m}) [(β_{τ (j + 1)} - β_{τ (j)}) - (β_{τ (j + 1)}^{*} - β_{τ (j)}^{*})] - \frac{1}{n} {(y - X β^{m})}^{T} X (β - β^{*}) \equiv J_{1} + J_{2} .

We first simplify J₁. Note that w_j = 0 when τ(j) and τ(j + 1) are in different groups. When τ(j) and τ(j + 1) are in the same A_k, first, $β_{τ (j + 1)}^{*} = β_{τ (j)}^{*}$ , and [ $β_{τ (j + 1)}^{m} - β_{τ (j)}^{m}$ ] has the same sign as [β_τ₍_j₊₁₎ − β_τ₍_j₎]; second, $∣ {\hat{β}}_{τ (j + 1)}^{initial} - {\hat{β}}_{τ (j)}^{initial} ∣ \leq 2 {‖ {\hat{β}}^{initial} - β^{0} ‖}_{\infty} \leq λ_{n}$ , and hence w_j ≥ ρ′(λ_n) ≥ a₀. Combining the above yields

J_{1} = λ_{n} \sum_{k = 1}^{K} \sum_{j = j_{k}}^{j_{k + 1} - 2} w_{j} ∣ β_{τ (j + 1)} - β_{(j)} ∣ \geq a_{0} λ_{n} \sum_{k = 1}^{K} \sum_{j = j_{k}}^{j_{k + 1} - 2} ∣ β_{τ (j + 1)} - β_{τ (j)} ∣

(43)

Next, we simplify J₂. Denote z = X^T(y − Xβ^m). Similarly to (36)–(41), we find that

J_{2} = - \sum_{k = 1}^{K} \sum_{l = j_{k}}^{j_{k + 1} - 2} w_{τ (l)} (z) [β_{τ (l + 1)} - β_{τ (l)}],

where over the event E₃, for any j_k ≤ l ≤ j_k₊₁ − 2,

∣ w_{τ (l)} (z) ∣ \leq C (\sqrt{\frac{σ_{k} ∣ A_{k} ∣ log (n \lor p)}{n}} + ν_{k} \sqrt{\frac{K ∣ A_{k} ∣ log (n)}{n}}) + \frac{p λ_{max} (X^{T} X)}{n} t_{n} .

From the condition on λ_n, the sum of the first two terms is upper bounded by a₀λ_n/3 for large n. We choose t_n = a₀nλ_n/(3pλ_max(X^TX)). It follows that

∣ J_{2} ∣ \leq \sum_{k = 1}^{K} \sum_{l = j_{k}}^{j_{k + 1} - 2} \frac{2 a_{0} λ_{n}}{3} ∣ β_{τ (l + 1)} - β_{τ (l)} ∣ .

(44)

Combining (43) and (44), over the event E₁ ∩ E₂ ∩ E₃,

Q_{n}^{initial} (β) - Q_{n}^{initial} (β^{*}) \geq \frac{a_{0} λ_{n}}{3} \sum_{k = 1}^{K} \sum_{l = j_{k}}^{j_{k + 1} - 2} ∣ β_{τ (l + 1)} - β_{τ (l)} ∣ \geq 0.

This proves the first inequality in (42).

Second, we show that over the event E₁ ∩ E₂ ∩ E₃, at the second iteration, the LLA algorithm still yields β̂^oracle and therefore it converges to β̂^oracle. We have shown that after the first iteration, the algorithm outputs β̂^oracle. It then treats β̂^oracle as the initial solution for the second iteration. So it suffices to check

{‖ {\hat{β}}^{oracle} - β^{0} ‖}_{\infty} \leq λ_{n} / 2.

This is true because over the event E₁, $‖ {\hat{β}}^{oracle} - β^{0} ‖ \leq C \sqrt{K log (n) / n} ≪ λ_{n}$ .

A.3 Proof of Theorem 3.3

It suffices to show that, with probability at least 1 − O(n^−α), $β_{i}^{0} < β_{j}^{0}$ implies ${\hat{β}}_{i}^{ols} \leq {\hat{β}}_{j}^{ols}$ for any 1 ≤ i, j ≤ p. When $β_{i}^{0} < β_{j}^{0}$ , necessarily $β_{j}^{0} - β_{i}^{0} \geq 2 b_{n}$ . Moreover, ${\hat{β}}_{j}^{ols} - {\hat{β}}_{i}^{ols} \geq (β_{j}^{0} - β_{i}^{0}) - 2 {‖ {\hat{β}}^{ols} - β^{0} ‖}_{\infty}$ . So it suffices to show that ||β̂^ols − β⁰||_∞ ≤ b_n with probability at least 1 − O(n^−α).

From direct calculations, β^ols = β⁰ + (X^TX)⁻¹X^Tε. Write ${\hat{β}}_{j}^{ols} - β_{j}^{0} = a_{j}^{T} ε$ , where a_j = X(X^TX)⁻¹e_j for j = 1, · · ·, p. Then ${‖ a_{j} ‖}^{2} = e_{j}^{T} {(X^{T} X)}^{- 1} e_{j} \leq c_{4} n^{- 1}$ . By Condition 3.3 and applying the union bound,

\begin{array}{l} P ({‖ {\hat{β}}^{ols} - β^{0} ‖}_{\infty} > b_{n}) \leq P ({‖ {\hat{β}}^{ols} - β^{0} ‖}_{\infty} > \sqrt{(2 α c_{4} / c_{3}) log (n) / n}) \\ \leq \sum_{j = 1}^{p} P (∣ a_{j}^{T} ε ∣ > ‖ a ‖ \sqrt{2 α log (n) / c_{3}}) \leq 2 {p n}^{- 2 α} . \end{array}

Since p = O(n^α), 2pn⁻²^α = O(n⁻^α). This completes the proof.

Footnotes

8 Supplemental Materials

The supplemental materials contain technical proofs for Theorems 3.5, 4.1, 4.2, 4.3, and Corollary 4.1.

References

Bondell HD, Reich BJ. Simultaneous Regression Shrinkage, Variable Selection, and Supervised Clustering of Predictors with OSCAR. Biometrics. 2008;64:115–123. doi: 10.1111/j.1541-0420.2007.00843.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bühlmann P, van de Geer S. Statistics for High-Dimensional Data. Springer; 2011. p. 32. [Google Scholar]
Chen SS, Donoho DL, Saunders MA. Atomic Decomposition by Basis Pursuit. SIAM Journal on Scientific Computing. 1998;20:33–61. [Google Scholar]
Fan J, Li R. Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties. Journal of American Statistical Association. 2001;96:1348–1360. [Google Scholar]
Fan J, Lv J. Sure Independence Screening for Ultra-high Dimensional Feature Space. Journal of Royal Statistical Society, Ser B. 2008;70:849–911. doi: 10.1111/j.1467-9868.2008.00674.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J, Lv J. Nonconcave Penalized Likelihood with NP-dimensionality. IEEE Transactions on Information Theory. 2011;57:5467–5484. doi: 10.1109/TIT.2011.2158486. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J, Lv J, Qi L. Sparse High-dimensional Models in Economics. Annual Review of Economics. 2011;3:291–317. doi: 10.1146/annurev-economics-061109-080451. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J, Xue L, Zou H. Strong Oracle Optimality of Folded Concave Penalized Estimation. 2012 doi: 10.1214/13-aos1198. Manuscript, available at http://arxiv.org/abs/1210.5992. [DOI] [PMC free article] [PubMed]
Fred A, Jain AK. Robust Data Clustering. Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 2003;3:128–136. [Google Scholar]
Friedman J, Hastie T, Höing H, Tibshirani R. Pathwise Coordinate Optimization. The Annals of Applied Statistics. 2007;1:302–332. [Google Scholar]
Harchaoui Z, Lévy-Leduc C. Multiple Change-point Estimation with a Total Variation Penalty. Journal of the American Statistical Associateion. 2010;105:1480–1493. [Google Scholar]
Huang H-C, Hsu N-J, Theobald DM, Breidt FJ. Spatial Lasso with Applications to GIS Model Selection. Journal of Computational and Graphical Statistics. 2010;19:963–983. [Google Scholar]
Kim S, Xing EP. Statistical Estimation of Correlated Genome Associations to a Quantitative Trait Network. PLos Genetics. 2009;5:e1000587. doi: 10.1371/journal.pgen.1000587. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kim Y, Choi H, Oh H-S. Smoothly Clipped Absolute Deviation on High Dimensions. Journal of the American Statistical Association. 2008;103:1665–1673. [Google Scholar]
Legendre M, Gautheret D. Sequence Determinants in Human Polyadenylation Site Selection. BMC Genomics. 2003;4:7–15. doi: 10.1186/1471-2164-4-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li C, Li H. Variable Selection and Regression Analysis for Graph-structured Covariates with an Application to Genomics. The Annals of Applied Statistics. 2010;4:1498–1516. doi: 10.1214/10-AOAS332. [DOI] [PMC free article] [PubMed] [Google Scholar]
Liu H, Han H, Li J, Wong L. An in-Silico Method for Prediction of Polyadenylation Signals in Human Sequences. Genome Informatics Series. 2003;14:84–93. [PubMed] [Google Scholar]
McCullagh P, Nelder JA. Generalized Linear Models. 2. London: Chapman & Hall; 1989. [Google Scholar]
Park MY, Hastie T, Tibshirani R. Averaged Gene Expressions for Regression. Biostatistics. 2007;8:212–227. doi: 10.1093/biostatistics/kxl002. [DOI] [PubMed] [Google Scholar]
Shen X, Huang H-C. Grouping Pursuit through a Regularization Solution Surface. Journal of the American Statistical Associateion. 2010;105:727–739. doi: 10.1198/jasa.2010.tm09380. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tibshirani R. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society, Ser B. 1996;58:267–288. [Google Scholar]
Tibshirani S, Saunders M, Rosset S, Zhu J, Knight K. Sparsity and Smoothness via the Fused Lasso. Journal of the Royal Statistical Society, Ser B. 2005;67:91–108. [Google Scholar]
Wang L, Kim Y, Li R. Calibrating Nonconvex Penalized Regression in Ultra-high Dimension. Tha Annals of Statistics. doi: 10.1214/13-AOS1159. (in press) [DOI] [PMC free article] [PubMed] [Google Scholar]
Yang S, Yuan L, Lai Y-C, Shen X, Wonka P, Ye J. Feature Grouping and Selection over an Undirected Graph. Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining; ACM. 2012. pp. 922–930. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yang Y, He X. Bayesian Empirical Likelihood for Quantile Regression. The Annals of Statistics. 2012;40:1102–1131. [Google Scholar]
Zhang C-H. Nearly Unbiased Variable Selection under Minimax Concave Penalty. The Annals of Statistics. 2010;38:894–942. [Google Scholar]
Zhao P, Yu B. On Model Selection of Lasso. Journal of Machine Learning Research. 2006;7:2541–2563. [Google Scholar]
Zhu Y, Shen X, Pan W. Simultaneous Grouping Pursuit and Feature Selection over an Undirected Graph. Journal of the American Statistical Associateion. doi: 10.1080/01621459.2013.770704. (in press) [DOI] [PMC free article] [PubMed] [Google Scholar]
Zou H, Li R. One-step Sparse Estimates in Nonconcave Penalized Likelihood Models (with discussion) The Annals of Statistics. 2008;36:1509–1566. doi: 10.1214/009053607000000802. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material

NIHMS578407-supplement-Supplementary_Material.pdf^{(303.5KB, pdf)}

[R1] Bondell HD, Reich BJ. Simultaneous Regression Shrinkage, Variable Selection, and Supervised Clustering of Predictors with OSCAR. Biometrics. 2008;64:115–123. doi: 10.1111/j.1541-0420.2007.00843.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] Bühlmann P, van de Geer S. Statistics for High-Dimensional Data. Springer; 2011. p. 32. [Google Scholar]

[R3] Chen SS, Donoho DL, Saunders MA. Atomic Decomposition by Basis Pursuit. SIAM Journal on Scientific Computing. 1998;20:33–61. [Google Scholar]

[R4] Fan J, Li R. Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties. Journal of American Statistical Association. 2001;96:1348–1360. [Google Scholar]

[R5] Fan J, Lv J. Sure Independence Screening for Ultra-high Dimensional Feature Space. Journal of Royal Statistical Society, Ser B. 2008;70:849–911. doi: 10.1111/j.1467-9868.2008.00674.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] Fan J, Lv J. Nonconcave Penalized Likelihood with NP-dimensionality. IEEE Transactions on Information Theory. 2011;57:5467–5484. doi: 10.1109/TIT.2011.2158486. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] Fan J, Lv J, Qi L. Sparse High-dimensional Models in Economics. Annual Review of Economics. 2011;3:291–317. doi: 10.1146/annurev-economics-061109-080451. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] Fan J, Xue L, Zou H. Strong Oracle Optimality of Folded Concave Penalized Estimation. 2012 doi: 10.1214/13-aos1198. Manuscript, available at http://arxiv.org/abs/1210.5992. [DOI] [PMC free article] [PubMed]

[R9] Fred A, Jain AK. Robust Data Clustering. Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 2003;3:128–136. [Google Scholar]

[R10] Friedman J, Hastie T, Höing H, Tibshirani R. Pathwise Coordinate Optimization. The Annals of Applied Statistics. 2007;1:302–332. [Google Scholar]

[R11] Harchaoui Z, Lévy-Leduc C. Multiple Change-point Estimation with a Total Variation Penalty. Journal of the American Statistical Associateion. 2010;105:1480–1493. [Google Scholar]

[R12] Huang H-C, Hsu N-J, Theobald DM, Breidt FJ. Spatial Lasso with Applications to GIS Model Selection. Journal of Computational and Graphical Statistics. 2010;19:963–983. [Google Scholar]

[R13] Kim S, Xing EP. Statistical Estimation of Correlated Genome Associations to a Quantitative Trait Network. PLos Genetics. 2009;5:e1000587. doi: 10.1371/journal.pgen.1000587. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] Kim Y, Choi H, Oh H-S. Smoothly Clipped Absolute Deviation on High Dimensions. Journal of the American Statistical Association. 2008;103:1665–1673. [Google Scholar]

[R15] Legendre M, Gautheret D. Sequence Determinants in Human Polyadenylation Site Selection. BMC Genomics. 2003;4:7–15. doi: 10.1186/1471-2164-4-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] Li C, Li H. Variable Selection and Regression Analysis for Graph-structured Covariates with an Application to Genomics. The Annals of Applied Statistics. 2010;4:1498–1516. doi: 10.1214/10-AOAS332. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] Liu H, Han H, Li J, Wong L. An in-Silico Method for Prediction of Polyadenylation Signals in Human Sequences. Genome Informatics Series. 2003;14:84–93. [PubMed] [Google Scholar]

[R18] McCullagh P, Nelder JA. Generalized Linear Models. 2. London: Chapman & Hall; 1989. [Google Scholar]

[R19] Park MY, Hastie T, Tibshirani R. Averaged Gene Expressions for Regression. Biostatistics. 2007;8:212–227. doi: 10.1093/biostatistics/kxl002. [DOI] [PubMed] [Google Scholar]

[R20] Shen X, Huang H-C. Grouping Pursuit through a Regularization Solution Surface. Journal of the American Statistical Associateion. 2010;105:727–739. doi: 10.1198/jasa.2010.tm09380. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] Tibshirani R. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society, Ser B. 1996;58:267–288. [Google Scholar]

[R22] Tibshirani S, Saunders M, Rosset S, Zhu J, Knight K. Sparsity and Smoothness via the Fused Lasso. Journal of the Royal Statistical Society, Ser B. 2005;67:91–108. [Google Scholar]

[R23] Wang L, Kim Y, Li R. Calibrating Nonconvex Penalized Regression in Ultra-high Dimension. Tha Annals of Statistics. doi: 10.1214/13-AOS1159. (in press) [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] Yang S, Yuan L, Lai Y-C, Shen X, Wonka P, Ye J. Feature Grouping and Selection over an Undirected Graph. Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining; ACM. 2012. pp. 922–930. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] Yang Y, He X. Bayesian Empirical Likelihood for Quantile Regression. The Annals of Statistics. 2012;40:1102–1131. [Google Scholar]

[R26] Zhang C-H. Nearly Unbiased Variable Selection under Minimax Concave Penalty. The Annals of Statistics. 2010;38:894–942. [Google Scholar]

[R27] Zhao P, Yu B. On Model Selection of Lasso. Journal of Machine Learning Research. 2006;7:2541–2563. [Google Scholar]

[R28] Zhu Y, Shen X, Pan W. Simultaneous Grouping Pursuit and Feature Selection over an Undirected Graph. Journal of the American Statistical Associateion. doi: 10.1080/01621459.2013.770704. (in press) [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] Zou H, Li R. One-step Sparse Estimates in Nonconcave Penalized Likelihood Models (with discussion) The Annals of Statistics. 2008;36:1509–1566. doi: 10.1214/009053607000000802. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Homogeneity Pursuit

Tracy Ke

Jianqing Fan

Yichao Wu

Abstract

1 Introduction

2 CARDS: a data-driven pairwise shrinkage procedure

2.1 Basic version of CARDS

2.2 Advanced version of CARDS

Definition 2.1

Definition 2.2

Definition 2.3

2.3 Comparison of two versions of CARDS

Figure 1.

2.4 CARDS under sparsity

3 Analysis of the basic CARDS

3.1 Heuristics

3.2 Notations and regularity conditions

Condition 3.1

Condition 3.2

Condition 3.3

3.3 Properties of bCARDS

Theorem 3.1

Theorem 3.2

Theorem 3.3

Theorem 3.4

3.4 bCARDS with the L1 penalty

Condition 3.4

Theorem 3.5

4 Analysis of the advanced CARDS

4.1 Properties of aCARDS

Theorem 4.1

Theorem 4.2

Corollary 4.1

4.2 Properties of sCARDS

Theorem 4.3

Theorem 4.4

5 Simulation studies

Experiment 1

Figure 2.

Experiment 2

Figure 3.

Figure 4.

Experiment 3

Figure 5.

Experiment 4

Figure 6.

Figure 7.

Experiment 5

Figure 8.

6 Real data analysis

6.1 S&P500 returns

Table 1.

Figure 9.

Figure 10.

6.2 Polyadenylation signals

Table 2.

7 Conclusion

Supplementary Material

Acknowledgments

A Proofs

A.1 Proof of Theorem 3.1

A.2 Proof of Theorem 3.2

A.3 Proof of Theorem 3.3

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

3.4 bCARDS with the L₁ penalty